From f432deecdf1377455280623be2322250b0d9f7f0 Mon Sep 17 00:00:00 2001 From: Mohd Kaif <98801504+KaifAhmad1@users.noreply.github.com> Date: Tue, 20 Feb 2024 20:23:57 +0530 Subject: [PATCH] ValueError: OLMoForCausalLM does not support Flash Attention 2.0 yet #29145 (#1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add qwen2 (#29145) * add config, modeling, and tokenization * add auto and init * update readme * update readme * update team name * fixup * fixup * update config * update code style * update for fixup * update for fixup * update for fixup * update for testing * update for testing * fix bug for config and tokenization * fix bug for bos token * not doctest * debug tokenizer * not doctest * debug tokenization * debug init for tokenizer * fix style * update init * delete if in token auto * add tokenizer doc * add tokenizer in init * Update dummy_tokenizers_objects.py * update * update * debug * Update tokenization_qwen2.py * debug * Update convert_slow_tokenizer.py * add copies * add copied from and make style * update files map * update test * fix style * fix merge reading and update tests * fix tests * fix tests * fix style * debug a variable in readme * Update src/transformers/models/qwen2/configuration_qwen2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * update test and copied from * fix style * update qwen2 tokenization and tests * Update tokenization_qwen2.py * delete the copied from after property * fix style * update tests * update tests * add copied from * fix bugs * update doc * add warning for sliding window attention * update qwen2 tokenization * fix style * Update src/transformers/models/qwen2/modeling_qwen2.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix tokenizer fast --------- Co-authored-by: Ren Xuancheng Co-authored-by: renxuancheng.rxc Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix SDPA tests (#28552) * skip bf16 test if not supported by device * fix * fix bis * use is_torch_bf16_available_on_device * use is_torch_fp16_available_on_device * fix & use public llama * use 1b model * fix flacky test --------- Co-authored-by: Your Name * Allow to train dinov2 with different dtypes like bf16 (#28504) I want to train dinov2 with bf16 but I get the following error in https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/dinov2/modeling_dinov2.py#L635: ``` RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same ``` Since the input dtype is torch.float32, the parameter dtype has to be torch.float32... @LZHgrla and I checked the code of clip vision encoder and found there is an automatic dtype transformation (https://github.com/huggingface/transformers/blob/bc72b4e2cdcbc80d5f56731f35dbc9c18b4c8de6/src/transformers/models/clip/modeling_clip.py#L181-L182). So I add similar automatic dtype transformation to modeling_dinov2.py. * Fix Switch Transformers When sparse_step = 1 (#28564) Fix sparse_step = 1 I case sparse_step = 1, the current code will not work. * Save `Processor` (#27761) * save processor * Update tests/models/auto/test_processor_auto.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/test_processing_common.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix --------- Co-authored-by: ydshieh Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Use `weights_only` only if torch >= 1.13 (#28506) * fix * fix * fix --------- Co-authored-by: ydshieh * [`Core Tokenization`] Support a fix for spm fast models (#26678) * fix * last attempt * current work * fix forward compatibility * save all special tokens * current state * revert additional changes * updates * remove tokenizer.model * add a test and the fix * nit * revert one more break * fix typefield issue * quality * more tests * fix fields for FC * more nits? * new additional changes * how * some updates * the fix * where do we stand * nits * nits * revert unrelated changes * nits nits nits * styling * don't break llama just yet * revert llama changes * safe arg check * fixup * Add a test for T5 * Necessary changes * Tests passing, added tokens need to not be normalized. If the added tokens are normalized, it will the stripping which seems to be unwanted for a normal functioning * Add even more tests, when normalization is set to True (which does not work :sweat: ) * Add even more tests, when normalization is set to True (which does not work :sweat: ) * Update to main * nits * fmt * more and more test * comments * revert change as tests are failing * make the test more readble * nits * refactor the test * nit * updates * simplify * style * style * style convert slow * Update src/transformers/convert_slow_tokenizer.py * chore: Fix multiple typos (#28574) * Add new meta w2v2-conformer BERT-like model (#28165) * first commit * correct default value non causal * update config and modeling code * update converting checkpoint * clean modeling and fix tests * make style * add new config parameters to docstring * fix copied from statements * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * make position_embeddings_type docstrings clearer * clean converting script * remove function not used * clean modeling file * apply suggestion for test file + add convert script to not_doctested * modify tests according to review - cleaner logic and more tests * Apply nit suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add checker of valid position embeddings type * instantiate new layer norm layer with the right eps * fix freeze_feature_encoder since it can be None in some cases * add test same output in convert script * restore wav2vec2conformer and add new model * create processor and FE + clean * add new model code * fix convert script and set default config parameters * correct model id paths * make style * make fix-copies and cleaning files * fix copied from statements * complete .md and fixe copies * clean convert script argument defaults * fix config parameters docstrings * fix config docstring * add copied from and enrich FE tests * fix copied from and repo-consistency * add autotokenizer * make test input length shorter and change docstring code * fix docstrings and copied from * add add_adapter to ASR training example * make testing of adapters more robust * adapt to multi adapter layers * refactor input_values->input_features and remove w2v2-bert feature extractor * remove pretraining model * remove depreciated features and useless lines * add copied from and ignore statements to modeling tests * remove pretraining model #2 * change import in convert script * change default in convert script * update readme and remove useless line * Update tests/models/wav2vec2_bert/test_processor_wav2vec2_bert.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * refactor BERT to Bert for consistency * remove useless ignore copy statement * add persistent to buffer in rotary * add eps in LayerNorm init and remove copied from * add adapter activation parameters and add copied from statements * Fix copied statements and add unitest.skip reasons * add copied statement in test_processor * refactor processor * make style * replace numpy random by torch rand * remove expected output CTC * improve converting script with processor class * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * remove gumbel class * remove tests related to previously deleted class * Update src/transformers/models/wav2vec2_bert/configuration_wav2vec2_bert.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * correct typos * remove uused parameters * update processor to takes both text and audio * update checkpoints * update expected output and add ctc expected output * add label_attention_mask * replace pt with np in processor tests * fix typo * revert to behaviour with labels_attention_mask --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Use `LoggingLevel` context manager in 3 tests (#28575) * inside with LoggingLevel * remove is_flaky --------- Co-authored-by: ydshieh * Fix the documentation checkpoint for xlm-roberta-xl (#28567) * Fix the documentation checkpoint for xlm-roberta-xl * Improve docstring consistency * [ASR Pipe] Update init to set model type and subsequently call parent init method (#28486) * add image processor arg * super * rm args * [Whisper Tok] Move token ids to CPU when computing offsets (#28485) * move token ids to cpu * check for torch attr * [Whisper] Fix audio classification with weighted layer sum (#28563) * fix * tests * fix test * Making CTC training example more general (#28582) * add w2v2bert compatibility * Update examples/pytorch/speech-recognition/run_speech_recognition_ctc.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Don't save `processor_config.json` if a processor has no extra attribute (#28584) * not save if empty * fix * fix * fix * fix * fix --------- Co-authored-by: ydshieh * v4.38.dev.0 * Add w2v2bert to pipeline (#28585) * generalize asr pipeline to fbank models * change w2v2 pipeline output * Update test_pipelines_automatic_speech_recognition.py * feat: Sequential beam search (#26304) * [Whisper] Finalize batched SOTA long-form generation (#27658) * finalize * make fix copies whisper * [Tests] Make sure that we don't run tests mulitple times * Update src/transformers/models/whisper/modeling_whisper.py * [Tests] Make sure that we don't run tests mulitple times * fix more * improve * improve * improve further * improve more * improve * fix more * git commit and git push * fix more * fix more * fix more * New try * Fix more whisper stuff * Improve * correct more * correct more * correct more * Fix some tests * Add more tests * correct more * correct more * correct more * push * correct more * Fix more * Better * without dec mask * correct more * clean * save intermediate * Fix more * Fix VAD for large-v2 * Save new * Correct more * make cleaner * correct tests * correct src * Finish * Fix more * Fix more * finish * Fix edge cases * fix return_dict_in_generate * fix all tests * make style * add docstrings * add docstrings * Fix logit processor * make style * fix pipeline test * fix more style * Apply suggestions from code review * apply feedback Sanchit * correct more * Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Joao Gante Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * correct more * correct more * correct more * Fix staticmethod * correct more * fix * fix slow tests * make style * fix tokenizer test * fix tokenizer test * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * finish * finish * revert kwargs change --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Joao Gante Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix wrong xpu device in DistributedType.MULTI_XPU mode (#28386) * remove elif xpu * remove redudant code * [SigLIP] Don't pad by default (#28578) First draft * [`Llava`] Fix convert_llava_weights_to_hf.py script (#28570) * Update convert_llava_weights_to_hf.py Fix call to `tokenizer.add_tokens` * Add special_tokens to tokenizer.add_tokens in convert_vipllava_weights_to_hf.py * Allow add_tokens for ESM (#28535) * Allow non-special tokens to be added * Add test, fix token adding code * Revert changes to id_to_token and token_to_id * Update the ESM tokenizer to be a bit more standardized * Update src/transformers/models/esm/tokenization_esm.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Fix `_speculative_sampling` implementation (#28508) * RWKV: raise informative exception when attempting to manipulate `past_key_values` (#28600) * Fix auxiliary loss related code in transformers (#28406) * [DETA] fix freeze/unfreeze function * Update src/transformers/models/deta/modeling_deta.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/models/deta/modeling_deta.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * add freeze/unfreeze test case in DETA * fix type * fix typo 2 * fix : enable aux and enc loss in training pipeline * Add unsynced variables from original DETA for training * modification for passing CI test * make style * make fix * manual make fix * change deta_modeling_test of configuration 'two_stage' default to TRUE and minor change of dist checking * remove print * divide configuration in DetaModel and DetaForObjectDetection * image smaller size than 224 will give topk error * pred_boxes and logits should be equivalent to two_stage_num_proposals * add missing part in DetaConfig * Update src/transformers/models/deta/modeling_deta.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add docstring in configure and prettify TO DO part * change distribute related code to accelerate * Update src/transformers/models/deta/configuration_deta.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/deta/test_modeling_deta.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * protect importing accelerate * change variable name to specific value * wrong import * fix aux_loss in conditional_detr * add test aux_loss * add aux_loss test in deta and table_transformer * fix yolos since it doesn't have auxiliary function * fix maskformer auxiliary_loss related code * make style * change param 'auxiliary_loss' to 'use_auxiliary_loss' * change param 'auxiliary_loss' to 'use_auxiliary_loss' in tests * make style & fix-copies, also revert yolos related parameter * revert variable name 'use_auxiliary_loss' to 'auxiliary_loss' due to DetrConfig * revert variable name in yolos * revert maskformer * add aux_loss test in maskformer * make style * Update src/transformers/models/yolos/configuration_yolos.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * [`GPTNeoX`] Fix BC issue with 4.36 (#28602) * fix dtype issue * add a test * update copied from mentions * nits * fixup * fix copies * Apply suggestions from code review * Fix id2label assignment in run_classification.py (#28590) * Add missing key to TFLayoutLM signature (#28640) Fix missing bbox in LayoutLM signature * Avoid root logger's level being changed (#28638) * avoid root logger's level being changed --------- Co-authored-by: ydshieh * Add config tip to custom model docs (#28601) Add tip to custom model docs * Fix lr_scheduler in no_trainer training scripts (#27872) * Fix lr_scheduler * Fix lr scheduler * [`Llava`] Update convert_llava_weights_to_hf.py script (#28617) * Update convert_llava_weights_to_hf.py script * Remove config update of adding padding to `vocab_size` and `text_config.vocab_size` which causes `ValueError` exception. * Remove keys that ends with `inv_freq` from the state dict. * Add examples and instructions for creating `model_state_dict.bin` that can be used by the script. * Update convert_llava_weights_to_hf.py * Update convert_vipllava_weights_to_hf.py * [`GPTNeoX`] Fix GPTNeoX + Flash Attention 2 issue (#28645) Update modeling_gpt_neox.py * Update image_processing_deformable_detr.py (#28561) * Update image_processing_deformable_detr.py * Changes after running make fix-copies * [`SigLIP`] Only import tokenizer if sentencepiece available (#28636) Only import class if sp available * Fix phi model doc checkpoint (#28581) Co-authored-by: Pashmina Cameron <11311835+pashminacameron@users.noreply.github.com> * get default device through `PartialState().default_device` as it has been officially released (#27256) get default device through `PartialState().default_device` as it has been officially released * integrations: fix DVCLiveCallback model logging (#28653) * Enable safetensors conversion from PyTorch to other frameworks without the torch requirement (#27599) * Initial commit * Requirements & tests * Tests * Tests * Rogue import * Rogue torch import * Cleanup * Apply suggestions from code review Co-authored-by: Nicolas Patry * bfloat16 management * Sanchit's comments * Import shield * apply suggestions from code review * correct bf16 * rebase --------- Co-authored-by: Nicolas Patry Co-authored-by: sanchit-gandhi * Enable instantiating model with pretrained backbone weights (#28214) * Enable instantiating model with pretrained backbone weights * Update tests so backbone checkpoint isn't passed in * Remove doc updates until changes made in modeling code * Clarify pretrained import * Update configs - docs and validation check * Update src/transformers/utils/backbone_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Clarify exception message * Update config init in tests * Add test for when use_timm_backbone=True * Small test updates --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * `tensor_size` - fix copy/paste error msg typo (#28660) Fix copy/paste error msg typo * Fix windows err with checkpoint race conditions (#28637) Fix windows err * add dataloader prefetch factor in training args and trainer (#28498) * add dataloader prefetch factor in training args and trainer * remove trailing spaces * prevent dataloader_num_workers == 0 and dataloader_prefetch_factor != None dataloader_prefetch_factor works only when data is loaded in a different process as the main one. This commit adds the necessary checks to avoid having prefetch_factor set when there is no such process. * Remove whitespaces in empty line * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Support single token decode for `CodeGenTokenizer` (#28628) convert token id to list in .decode() * Remove deprecated eager_serving fn (#28665) * Remove deprecated eager_serving fn * Fix the input_signature docstring while I'm here * fix a hidden bug of `GenerationConfig`, now the `generation_config.json` can be loaded successfully (#28604) * fix a hidden bug of GenerationConfig * keep `sort_keys=True` to maintain visibility * Update src/transformers/generation/configuration_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update configuration_utils.py in case `obj` is a list, check the items in the list --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update README_es.md (#28612) Fixing grammatical errors in the text * Exclude the load balancing loss of padding tokens in Mixtral-8x7B (#28517) * fix the function load_balancing_loss_func in Mixtral_Moe to include attention_mask * format code using black and ruff * skip computing mask if attention_mask=None * add tests for load balancing loss Mixtral-Moe * fix assert loss is different in mixtral_test * fix pad_leng * use assertNotAlmostEqual and print to debug * remove print for debug * minor updates * reduce rtol and atol * Use save_safetensor to disable safe serialization for XLA (#28669) * Use save_safetensor to disable safe serialization for XLA https://github.com/huggingface/transformers/issues/28438 * Style fixup * Add back in generation types (#28681) * [docs] DeepSpeed (#28542) * config * optim * pre deploy * deploy * save weights, memory, troubleshoot, non-Trainer * done * Improved type hinting for all attention parameters (#28479) * Changed type hinting for all attention inputs to 'Optional[Tuple[torch.FloatTensor,...]] = None' * Fixed the ruff formatting issue * fixed type hinting for all hidden_states to 'Optional[Tuple[torch.FloatTensor, ...]] = None' * Changed type hinting in these 12 scripts modeling_dpr.py,modeling_nat.py,idefics/vision.py,modeling_tf_dpr.py,modeling_luke.py,modeling_swin.py,modeling_tf_swin.py,modeling_blip.py,modeling_tf_blip.py,modeling_donut_swin.py,modeling_dinat.py,modeling_swinv2.py * test fail update * fixed type hinting for these 15 scripts modeling_xlnet.py,modeling_tf_xlnet.py,modeling_led.py,modeling_tf_led.py,modleing_rwkv.py,modeling_dpt.py,modeling_tf_cvt.py,modeling_clip.py,modeling_flax_clip.py,modeling_tf_clip.py,modeling_longformer.py,modeling_tf_longformer.py,modeling_siglip.py,modeling_clap.py,modeling_git.py * Changed type hinting in these 12 scripts modeling_dpr.py,modeling_nat.py,idefics/vision.py,modeling_tf_dpr.py,modeling_luke.py,modeling_swin.py,modeling_tf_swin.py,modeling_blip.py,modeling_tf_blip.py,modeling_donut_swin.py,modeling_dinat.py,modeling_swinv2.py * test fail update * Removed the myvenv file * Fixed type hinting for these 8 scripts modeling_tvlt.py,modeling_sam.py,modeling_tf_sam.py,modeling_tvp.py,modeling_rag.py,modeling_tf_rag.py,modeling_tf_xlm.py,modeling_xlm.py * improve efficient training on CPU documentation (#28646) * update doc * revert * typo fix * refine * add dtypes * Update docs/source/en/perf_train_cpu.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/perf_train_cpu.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/perf_train_cpu.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * no comma * use avx512-vnni --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * [docs] Fix doc format (#28684) * fix hfoptions * revert changes to other files * fix * Add Depth Anything (#28654) * First draft * More improvements * More improvements * More improvements * More improvements * Add docs * Remove file * Add copied from * Address comments * Address comments * Address comments * Fix style * Update docs * Convert all checkpoints, add integration test * Rename checkpoints * Add pretrained backbone attributes * Fix default config * Address comment * Add figure to docs * Fix bug thanks to @xenova * Update conversion script * Fix integration test * [`chore`] Add missing space in warning (#28695) Add missing space in warning * Improve Backbone API docs (#28666) Update backbones.md * Update question_answering.md (#28694) fix typo: from: "model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")" to: model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased") * [`Vilt`] align input and model dtype in the ViltPatchEmbeddings forward pass (#28633) align dtype * [`docs`] Improve visualization for vertical parallelism (#28583) The documentation says "We refer to this Model parallelism as “Vertical” because of how models are typically visualized.", but then visualizes the model horizontally. This change visualizes the model indeed vertically. * Don't fail when `LocalEntryNotFoundError` during `processor_config.json` loading (#28709) * fix --------- Co-authored-by: ydshieh * Fix duplicate & unnecessary flash attention warnings (#28557) * fix duplicate & unnecessary flash warnings * trigger ci * warning_once * if/else order --------- Co-authored-by: Your Name * support PeftMixedModel signature inspect (#28321) * support PeftMixedModel signature inspect * import PeftMixedModel only peft>=0.7.0 * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * fix styling * Update src/transformers/trainer.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * style fixup * fix note --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix: corrected misleading log message in save_pretrained function (#28699) * [`docs`] Update preprocessing.md (#28719) * Update preprocessing.md adjust ImageProcessor link to working target (same as in lower section of file) * Update preprocessing.md * Initialize _tqdm_active with hf_hub_utils.are_progress_bars_disabled(… (#28717) Initialize _tqdm_active with hf_hub_utils.are_progress_bars_disabled() to respect HF_HUB_DISABLE_PROGRESS_BARS It seems like enable_progress_bar() and disable_progress_bar() sync up with huggingface_hub, but the initial value is always True. This changes will make sure the user's preference is respected implicity on initialization. * Fix `weights_only` (#28725) fix Co-authored-by: ydshieh * Stop confusing the TF compiler with ModelOutput objects (#28712) * Stop confusing the TF compiler with ModelOutput objects * Stop confusing the TF compiler with ModelOutput objects * fix: suppress `GatedRepoError` to use cache file (fix #28558). (#28566) * fix: suppress `GatedRepoError` to use cache file (fix #28558). * move condition_to_return parameter back to outside. * Unpin pydantic (#28728) * try pydantic v2 * try pydantic v2 --------- Co-authored-by: ydshieh * [docs] Fix datasets in guides (#28715) * change datasets * fix * [Flax] Update no init test for Flax v0.7.1 (#28735) * Falcon: removed unused function (#28605) * Generate: deprecate old src imports (#28607) * [`Siglip`] protect from imports if sentencepiece not installed (#28737) [Siglip] protect from imports if sentencepiece not installed * Add serialization logic to pytree types (#27871) * Add serialized type name to pytrees * Modify context * add serde test * Fix `DepthEstimationPipeline`'s docstring (#28733) * fix * fix * Fix --------- Co-authored-by: ydshieh * Fix input data file extension in examples (#28741) * [Docs] Fix Typo in English & Japanese CLIP Model Documentation (TMBD -> TMDB) (#28751) * [Docs] Fix Typo in English CLIP model_doc * [Docs] Fix Typo in Japanese CLIP model_doc * PatchtTST and PatchTSMixer fixes (#28083) * :bug: fix .max bug * remove prediction_length from regression output dimensions * fix parameter names, fix output names, update tests * ensure shape for PatchTST * ensure output shape for PatchTSMixer * update model, batch, and expected for regression distribution test * update test expected Signed-off-by: Wesley M. Gifford * Update tests/models/patchtst/test_modeling_patchtst.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/patchtst/test_modeling_patchtst.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/patchtst/test_modeling_patchtst.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/models/patchtsmixer/modeling_patchtsmixer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * standardize on patch_length Signed-off-by: Wesley M. Gifford * Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update tests/models/patchtsmixer/test_modeling_patchtsmixer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Make arguments more explicit Signed-off-by: Wesley M. Gifford * adjust prepared inputs Signed-off-by: Wesley M. Gifford --------- Signed-off-by: Wesley M. Gifford Co-authored-by: Wesley M. Gifford Co-authored-by: Kashif Rasul Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Enable Gradient Checkpointing in Deformable DETR (#28686) * Enabled gradient checkpointing in Deformable DETR * Enabled gradient checkpointing in Deformable DETR encoder * Removed # Copied from headers in modeling_deta.py to break dependence on Deformable DETR code * small doc update for CamemBERT (#28644) * Pin pytest version <8.0.0 (#28758) * Pin pytest version <8.0.0 * Update setup.py * make deps_table_update * Mark test_constrained_beam_search_generate as flaky (#28757) * Make test_constrained_beam_search_generate as flaky * Update tests/generation/test_utils.py * Fix typo of `Block`. (#28727) * [Whisper] Make tokenizer normalization public (#28136) * [Whisper] Make tokenizer normalization public * add to docs * Support saving only PEFT adapter in checkpoints when using PEFT + FSDP (#28297) * Update trainer.py * Revert "Update trainer.py" This reverts commit 0557e2cc9effa3a41304322032239a3874b948a7. * Make trainer.py use adapter_only=True when using FSDP + PEFT * Support load_best_model with adapter_only=True * Ruff format * Inspect function args for save_ load_ fsdp utility functions and only pass adapter_only=True if they support it * Add French translation: french README.md (#28696) * doc: french README Signed-off-by: ThibaultLengagne * doc: Add Depth Anything Signed-off-by: ThibaultLengagne * doc: Add french link in other docs Signed-off-by: ThibaultLengagne * doc: Add missing links in fr docs * doc: fix several mistakes in translation Signed-off-by: ThibaultLengagne --------- Signed-off-by: ThibaultLengagne Co-authored-by: Sarapuce * Don't allow passing `load_in_8bit` and `load_in_4bit` at the same time (#28266) * Update quantization_config.py * Style * Protect from setting directly * add tests * Update tests/quantization/bnb/test_4bit.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Move CLIP _no_split_modules to CLIPPreTrainedModel (#27841) Add _no_split_modules to CLIPModel * `HfQuantizer` class for quantization-related stuff in `modeling_utils.py` (#26610) * squashed earlier commits for easier rebase * rm rebase leftovers * 4bit save enabled @quantizers * TMP gptq test use exllama * fix AwqConfigTest::test_wrong_backend for A100 * quantizers AWQ fixes * _load_pretrained_model low_cpu_mem_usage branch * quantizers style * remove require_low_cpu_mem_usage attr * rm dtype arg from process_model_before_weight_loading * rm config_origin from Q-config * rm inspect from q_config * fixed docstrings in QuantizationConfigParser * logger.warning fix * mv is_loaded_in_4(8)bit to BnbHFQuantizer * is_accelerate_available error msg fix in quantizer * split is_model_trainable in bnb quantizer class * rm llm_int8_skip_modules as separate var in Q * Q rm todo * fwd ref to HFQuantizer in type hint * rm note re optimum.gptq.GPTQQuantizer * quantization_config in __init__ simplified * replaced NonImplemented with create_quantized_param * rm load_in_4/8_bit deprecation warning * QuantizationConfigParser refactoring * awq-related minor changes * awq-related changes * awq config.modules_to_not_convert * raise error if no q-method in q-config in args * minor cleanup * awq quantizer docstring * combine common parts in bnb process_model_before_weight_loading * revert test_gptq * .process_model_ cleanup * restore dict config warning * removed typevars in quantizers.py * cleanup post-rebase 16 jan * QuantizationConfigParser classmethod refactor * rework of handling of unexpected aux elements of bnb weights * moved q-related stuff from save_pretrained to quantizers * refactor v1 * more changes * fix some tests * remove it from main init * ooops * Apply suggestions from code review Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * fix awq issues * fix * fix * fix * fix * fix * fix * add docs * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/hf_quantizer.md * address comments * fix * fixup * Update src/transformers/modeling_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * address final comment * update * Update src/transformers/quantizers/base.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update src/transformers/quantizers/auto.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * add kwargs update * fixup * add `optimum_quantizer` attribute * oops * rm unneeded file * fix doctests --------- Co-authored-by: younesbelkada Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * [`HfQuantizer`] Move it to "Developper guides" (#28768) Update _toctree.yml * Use Conv1d for TDNN (#25728) * use conv for tdnn * run make fixup * update TDNN * add PEFT LoRA check * propagate tdnn warnings to others * add missing imports * update TDNN in wav2vec2_bert * add missing imports * Fix transformers.utils.fx compatibility with torch<2.0 (#28774) guard sdpa on torch>=2.0 * Further pin pytest version (in a temporary way) (#28780) fix Co-authored-by: ydshieh * [`Backbone`] Use `load_backbone` instead of `AutoBackbone.from_config` (#28661) * Enable instantiating model with pretrained backbone weights * Remove doc updates until changes made in modeling code * Use load_backbone instead * Add use_timm_backbone to the model configs * Add missing imports and arguments * Update docstrings * Make sure test is properly configured * Include recent DPT updates * Task-specific pipeline init args (#28439) * Abstract out pipeline init args * Address PR comments * Reword * BC PIPELINE_INIT_ARGS * Remove old arguments * Small fix * Add tf_keras imports to prepare for Keras 3 (#28588) * Port core files + ESM (because ESM code is odd) * Search-replace in modelling code * Fix up transfo_xl as well * Fix other core files + tests (still need to add correct import to tests) * Fix cookiecutter * make fixup, fix imports in some more core files * Auto-add imports to tests * Cleanup, add imports to sagemaker tests * Use correct exception for importing tf_keras * Fixes in modeling_tf_utils * make fixup * Correct version parsing code * Ensure the pipeline tests correctly revert to float32 after each test * Ensure the pipeline tests correctly revert to float32 after each test * More tf.keras -> keras * Add dtype cast * Better imports of tf_keras * Add a cast for tf.assign, just in case * Fix callback imports * Pin Torch to <2.2.0 (#28785) * Pin torch to <2.2.0 * Pin torchvision and torchaudio as well * Playing around with versions to see if this helps * twiddle something to restart the CI * twiddle it back * Try changing the natten version * make fixup * Revert "Try changing the natten version" This reverts commit de0d6592c35dc39ae8b5a616c27285db28262d06. * make fixup * fix fix fix * fix fix fix --------- Co-authored-by: ydshieh * [`bnb`] Fix bnb slow tests (#28788) fix bnb slow tests * Prevent MLflow exception from disrupting training (#28779) Modified MLflow logging metrics from synchronous to asynchronous Co-authored-by: codiceSpaghetti * don't initialize the output embeddings if we're going to tie them to input embeddings (#28192) * test that tied output embeddings aren't initialized on load * don't initialize the output embeddings if we're going to tie them to the input embeddings * [`HFQuantizer`] Remove `check_packages_compatibility` logic (#28789) remove `check_packages_compatibility` logic * [Whisper] Refactor forced_decoder_ids & prompt ids (#28687) * up * Fix more * Correct more * Fix more tests * fix fast tests * Fix more * fix more * push all files * finish all * make style * Fix timestamp wrap * make style * make style * up * up * up * Fix lang detection behavior * Fix lang detection behavior * Add lang detection test * Fix lang detection behavior * make style * Update src/transformers/models/whisper/generation_whisper.py Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * better error message * make style tests * add warning --------- Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> * Resolve DeepSpeed cannot resume training with PeftModel (#28746) * fix: resolve deepspeed resume peft model issues * chore: update something * chore: update model instance pass into is peft model checks * chore: remove hard code value to tests * fix: format code * canonical repos moves (#28795) * canonical repos moves * Style --------- Co-authored-by: Lysandre * Wrap Keras methods to support BatchEncoding (#28734) * Shim the Keras methods to support BatchEncoding * Extract everything to a convert_batch_encoding function * Convert BatchFeature too (thanks Amy) * tf.keras -> keras * Flax mistral (#26943) * direct copy from llama work * mistral modules forward pass working * flax mistral forward pass with sliding window * added tests * added layer collection approach * Revert "added layer collection approach" This reverts commit 0e2905bf2236ec323163fc1a9f0c016b21aa8b8f. * Revert "Revert "added layer collection approach"" This reverts commit fb17b6187ac5d16da7c461e1130514dc3d137a43. * fixed attention outputs * added mistral to init and auto * fixed import name * fixed layernorm weight dtype * freeze initialized weights * make sure conversion consideres bfloat16 * added backend * added docstrings * added cache * fixed sliding window causal mask * passes cache tests * passed all tests * applied make style * removed commented out code * applied fix-copies ignored other model changes * applied make fix-copies * removed unused functions * passed generation integration test * slow tests pass * fixed slow tests * changed default dtype from jax.numpy.float32 to float32 for docstring check * skip cache test for FlaxMistralForSequenceClassification since if pad_token_id in input_ids it doesn't score previous input_ids * updated checkpoint since from_pt not included * applied black style * removed unused args * Applied styling and fixup * changed checkpoint for doc back * fixed rf after adding it to hf hub * Add dummy ckpt * applied styling * added tokenizer to new ckpt * fixed slice format * fix init and slice * changed ref for placeholder TODO * added copies from Llama * applied styling * applied fix-copies * fixed docs * update weight dtype reconversion for sharded weights * removed Nullable input ids * Removed unnecessary output attentions in Module * added embedding weight initialziation * removed unused past_key_values * fixed deterministic * Fixed RMS Norm and added copied from * removed input_embeds * applied make style * removed nullable input ids from sequence classification model * added copied from GPTJ * added copied from Llama on FlaxMistralDecoderLayer * added copied from to FlaxMistralPreTrainedModel methods * fix test deprecation warning * freeze gpt neox random_params and fix copies * applied make style * fixed doc issue * skipped docstring test to allign # copied from * applied make style * removed FlaxMistralForSequenceClassification * removed unused padding_idx * removed more sequence classification * removed sequence classification * applied styling and consistency * added copied from in tests * removed sequence classification test logic * applied styling * applied make style * removed freeze and fixed copies * undo test change * changed repeat_kv to tile * fixed to key value groups * updated copyright year * split casual_mask * empty to rerun failed pt_flax_equivalence test FlaxWav2Vec2ModelTest * went back to 2023 for tests_pr_documentation_tests * went back to 2024 * changed tile to repeat * applied make style * empty for retry on Wav2Vec2 * DeepSpeed: hardcode `torch.arange` dtype on `float` usage to avoid incorrect initialization (#28760) * Add artifact name in job step to maintain job / artifact correspondence (#28682) * avoid using job name * apply to other files --------- Co-authored-by: ydshieh * Split daily CI using 2 level matrix (#28773) * update / add new workflow files * Add comment * Use env.NUM_SLICES * use scripts * use scripts * use scripts * Fix * using one script * Fix * remove unused file * update * fail-fast: false * remove unused file * fix * fix * use matrix * inputs * style * update * fix * fix * no model name * add doc * allow args * style * pass argument --------- Co-authored-by: ydshieh * [docs] Correct the statement in the docstirng of compute_transition_scores in generation/utils.py (#28786) * Adding [T5/MT5/UMT5]ForTokenClassification (#28443) * Adding [T5/MT5/UMT5]ForTokenClassification * Add auto mappings for T5ForTokenClassification and variants * Adding ForTokenClassification to the list of models * Adding attention_mask param to the T5ForTokenClassification test * Remove outdated comment in test * Adding EncoderOnly and Token Classification tests for MT5 and UMT5 * Fix typo in umt5 string * Add tests for all the existing MT5 models * Fix wrong comment in dependency_versions_table * Reverting change to common test for _keys_to_ignore_on_load_missing The test is correctly picking up redundant keys in _keys_to_ignore_on_load_missing. * Removing _keys_to_ignore_on_missing from MT5 since the key is not used in the model * Add fix-copies to MT5ModelTest * Make `is_torch_bf16_available_on_device` more strict (#28796) fix Co-authored-by: ydshieh * Fix symbolic_trace with kv cache (#28724) * fix symbolic_trace with kv cache * comment & better test * Add tip on setting tokenizer attributes (#28764) * Add tip on setting tokenizer attributes * Grammar * Remove the bit that was causing doc builds to fail * enable graident checkpointing in DetaObjectDetection and add tests in Swin/Donut_Swin (#28615) * enable graident checkpointing in DetaObjectDetection * fix missing part in original DETA * make style * make fix-copies * Revert "make fix-copies" This reverts commit 4041c86c29248f1673e8173b677c20b5a4511358. * remove fix-copies of DetaDecoder * enable swin gradient checkpointing * fix gradient checkpointing in donut_swin * add tests for deta/swin/donut * Revert "fix gradient checkpointing in donut_swin" This reverts commit 1cf345e34d3cc0e09eb800d9895805b1dd9b474d. * change supports_gradient_checkpointing pipeline to PreTrainedModel * Revert "add tests for deta/swin/donut" This reverts commit 6056ffbb1eddc3cb3a99e4ebb231ae3edf295f5b. * Revert "Revert "fix gradient checkpointing in donut_swin"" This reverts commit 24e25d0a14891241de58a0d86f817d0b5d2a341f. * Simple revert * enable deformable detr gradient checkpointing * add gradient in encoder * [docs] fix some bugs about parameter description (#28806) Co-authored-by: p_spozzhang * Add models from deit (#28302) * Add modelss * Add 2 more models * add models to tocrree * Add modles * Update docs/source/ja/model_doc/detr.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/model_doc/deit.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/ja/model_doc/deplot.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix bugs --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * [docs] Backbone (#28739) * backbones * fix path * fix paths * fix code snippet * fix links * [docs] HfQuantizer (#28820) * tidy * fix path * [Docs] Fix spelling and grammar mistakes (#28825) * Fix typos and grammar mistakes in docs and examples * Fix typos in docstrings and comments * Fix spelling of `tokenizer` in model tests * Remove erroneous spaces in decorators * Remove extra spaces in Markdown link texts * Explicitly check if token ID's are None in TFBertTokenizer constructor (#28824) Add an explicit none-check, since token ids can be 0 * Add missing None check for hf_quantizer (#28804) * Add missing None check for hf_quantizer * Add test, fix logic. * make style * Switch test model to Mistral * Comment * Update tests/test_modeling_utils.py --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Fix issues caused by natten (#28834) try Co-authored-by: ydshieh * fix / skip (for now) some tests before switch to torch 2.2 (#28838) * fix / skip some tests before we can switch to torch 2.2 * style --------- Co-authored-by: ydshieh * Use `-v` for `pytest` on CircleCI (#28840) use -v in pytest Co-authored-by: ydshieh * Reduce GPU memory usage when using FSDP+PEFT (#28830) support FSDP+PEFT * Mark `test_encoder_decoder_model_generate` for `vision_encoder_deocder` as flaky (#28842) Mark test as flaky * Bump dash from 2.3.0 to 2.15.0 in /examples/research_projects/decision_transformer (#28845) Bump dash in /examples/research_projects/decision_transformer Bumps [dash](https://github.com/plotly/dash) from 2.3.0 to 2.15.0. - [Release notes](https://github.com/plotly/dash/releases) - [Changelog](https://github.com/plotly/dash/blob/dev/CHANGELOG.md) - [Commits](https://github.com/plotly/dash/compare/v2.3.0...v2.15.0) --- updated-dependencies: - dependency-name: dash dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Support custom scheduler in deepspeed training (#26831) Reuse trainer.create_scheduler to create scheduler for deepspeed * [Docs] Fix bad doc: replace save with logging (#28855) Fix bad doc: replace save with logging * Ability to override clean_code_for_run (#28783) * Add clean_code_for_run function * Call clean_code_for_run from agent method * [WIP] Hard error when ignoring tensors. (#27484) * [WIP] Hard error when ignoring tensors. * Better selection/error when saving a checkpoint. - Find all names we should normally drop (those are in the transformers config) - Find all disjoint tensors (for those we can safely trigger a copy to get rid of the sharing before saving) - Clone those disjoint tensors getting rid of the issue - Find all identical names (those should be declared in the config but we try to find them all anyway.) - For all identical names: - If they are in the config, just ignore them everything is fine - If they are not, warn about them. - For all remainder tensors which are shared yet neither identical NOR disjoint. raise a hard error. * Adding a failing test on `main` that passes here. * We don't need to keep the subfolder logic in this test. * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * [`Doc`] update contribution guidelines (#28858) update guidelines * Correct wav2vec2-bert inputs_to_logits_ratio (#28821) * Correct wav2vec2-bert inputs_to_logits_ratio * correct ratio * correct ratio, clean asr pipeline * refactor on one line * Image Feature Extraction pipeline (#28216) * Draft pipeline * Fixup * Fix docstrings * Update doctest * Update pipeline_model_mapping * Update docstring * Update tests * Update src/transformers/pipelines/image_feature_extraction.py Co-authored-by: Omar Sanseviero * Fix docstrings - review comments * Remove pipeline mapping for composite vision models * Add to pipeline tests * Remove for flava (multimodal) * safe pil import * Add requirements for pipeline run * Account for super slow efficientnet * Review comments * Fix tests * Swap order of kwargs * Use build_pipeline_init_args * Add back FE pipeline for Vilt * Include image_processor_kwargs in docstring * Mark test as flaky * Update TODO * Update tests/pipelines/test_pipelines_image_feature_extraction.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add license header --------- Co-authored-by: Omar Sanseviero Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * ClearMLCallback enhancements: support multiple runs and handle logging better (#28559) * add clearml tracker * support multiple train runs * remove bad code * add UI entries for config/hparams overrides * handle models in different tasks * run ruff format * tidy code based on code review --------- Co-authored-by: Eugen Ajechiloae * Do not use mtime for checkpoint rotation. (#28862) Resolve https://github.com/huggingface/transformers/issues/26961 * Adds LlamaForQuestionAnswering class in modeling_llama.py along with AutoModel Support (#28777) * This is a test commit * testing commit * final commit with some changes * Removed copy statement * Fixed formatting issues * Fixed error added past_key_values in the forward method * Fixed a trailing whitespace. Damn the formatting rules are strict * Added the copy statement * Bump cryptography from 41.0.2 to 42.0.0 in /examples/research_projects/decision_transformer (#28879) Bump cryptography in /examples/research_projects/decision_transformer Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.2 to 42.0.0. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pyca/cryptography/compare/41.0.2...42.0.0) --- updated-dependencies: - dependency-name: cryptography dependency-type: direct:production ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Docs] Update project names and links in awesome-transformers (#28878) Update project names and repository links in awesome-transformers * Fix LongT5ForConditionalGeneration initialization of lm_head (#28873) * Raise error when using `save_only_model` with `load_best_model_at_end` for DeepSpeed/FSDP (#28866) * Raise error when using `save_only_model` with `load_best_model_at_end` for DeepSpeed/FSDP * Update trainer.py * Fix `FastSpeech2ConformerModelTest` and skip it on CPU (#28888) * fix * fix --------- Co-authored-by: ydshieh * Revert "[WIP] Hard error when ignoring tensors." (#28898) Revert "[WIP] Hard error when ignoring tensors. (#27484)" This reverts commit 2da28c4b41bba23969a8afe97c3dfdcbc47a57dc. * unpin torch (#28892) * unpin torch * check * check * check --------- Co-authored-by: ydshieh * Explicit server error on gated model (#28894) * [Docs] Fix backticks in inline code and documentation links (#28875) Fix backticks in code blocks and documentation links * Hotfix - make `torchaudio` get the correct version in `torch_and_flax_job` (#28899) * check * check * check --------- Co-authored-by: ydshieh * [Docs] Add missing language options and fix broken links (#28852) * Add missing entries to the language selector * Add links to the Colab and AWS Studio notebooks for ONNX * Use anchor links in CONTRIBUTING.md * Fix broken hyperlinks due to spaces * Fix links to OpenAI research articles * Remove confusing footnote symbols from author names, as they are also considered invalid markup * fix: Fixed the documentation for `logging_first_step` by removing "evaluate" (#28884) Fixed the documentation for logging_first_step by removing evaluate. * fix Starcoder FA2 implementation (#28891) * Fix Keras scheduler import so it works for older versions of Keras (#28895) Fix our schedule import so it works for older versions of Keras * ⚠️ Raise `Exception` when trying to generate 0 tokens ⚠️ (#28621) * change warning to exception * Update src/transformers/generation/utils.py Co-authored-by: Joao Gante * validate `max_new_tokens` > 0 in `GenerationConfig` * fix truncation test parameterization in `TextGenerationPipelineTests` --------- Co-authored-by: Joao Gante * Update the cache number (#28905) * fix * fix * fix --------- Co-authored-by: ydshieh * Add npu device for pipeline (#28885) add npu device for pipeline Co-authored-by: unit_test * [Docs] Fix placement of tilde character (#28913) Fix placement of tilde character * [Docs] Revert translation of '@slow' decorator (#28912) * Fix utf-8 yaml load for marian conversion to pytorch in Windows (#28618) Fix utf-8 yaml in marian conversion * [`Core generation`] Adds support for static KV cache (#27931) Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante * Remove dead TF loading code (#28926) Remove dead code * fix: torch.int32 instead of torch.torch.int32 (#28883) * pass kwargs in stopping criteria list (#28927) * Support batched input for decoder start ids (#28887) * support batched input for decoder start ids * Fix typos Co-authored-by: Joao Gante * minor changes * fix: decoder_start_id as list * empty commit * empty commit * empty commit * empty commit * empty commit * empty commit * empty commit * empty commit * empty commit --------- Co-authored-by: Joao Gante * [Docs] Fix broken links and syntax issues (#28918) * Fix model documentation links in attention.md * Fix external link syntax * Fix target anchor names of section links * Fix copyright statement comments * Fix documentation headings * Fix max_position_embeddings default value for llama2 to 4096 #28241 (#28754) * Changed max_position_embeddings default value from 2048 to 4096 * force push * Fixed formatting issues. Fixed missing argument in write_model. * Reverted to the default value 2048 in the Llama config. Added comments for the llama_version argument. * Fixed issue with default value value of max_position_embeddings in docstring * Updated help message for llama versions Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fix a wrong link to CONTRIBUTING.md section in PR template (#28941) * Fix type annotations on neftune_noise_alpha and fsdp_config TrainingArguments parameters (#28942) * [i18n-de] Translate README.md to German (#28933) * Translate README.md to German * Add links to README_de.md * Remove invisible characters in README * Change to a formal tone and fix punctuation marks * [Nougat] Fix pipeline (#28242) * Fix pipeline * Remove print statements * Address comments * Address issue * Remove unused imports * [Docs] Update README and default pipelines (#28864) * Update README and docs * Update README * Update README * Convert `torch_dtype` as `str` to actual torch data type (i.e. "float16" …to `torch.float16`) (#28208) * Convert torch_dtype as str to actual torch data type (i.e. "float16" to torch.float16) * Check if passed torch_dtype is an attribute in torch * Update src/transformers/pipelines/__init__.py Check type via isinstance Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * [`pipelines`] updated docstring with vqa alias (#28951) updated docstring with vqa alias * Tests: tag `test_save_load_fast_init_from_base` as flaky (#28930) * Updated requirements for image-classification samples: datasets>=2.14.0 (#28974) Updated datasets requirements. Need a package version >= 2.14.0 * Always initialize tied output_embeddings if it has a bias term (#28947) Continue to initialize tied output_embeddings if it has a bias term The bias term is not tied, and so will need to be initialized accordingly. * Clean up staging tmp checkpoint directory (#28848) clean up remaining tmp checkpoint dir Signed-off-by: woshiyyya * [Docs] Add language identifiers to fenced code blocks (#28955) Add language identifiers to code blocks * [Docs] Add video section (#28958) Add video section * [i18n-de] Translate CONTRIBUTING.md to German (#28954) * Translate contributing.md to German * Fix formatting issues in contributing.md * Address review comments * Fix capitalization * [`NllbTokenizer`] refactor with added tokens decoder (#27717) * refactor with addedtokens decoder * style * get rid of lang code to id * style * keep some things for BC * update tests * add the mask token at the end of the vocab * nits * nits * fix final tests * style * nits * Update src/transformers/models/nllb/tokenization_nllb_fast.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * nits * style? * Update src/transformers/convert_slow_tokenizer.py * make it a tad bit more custom * ruff please stop Co-Authored by avidale * Update Co-authored-by: avidale * Update Co-authored-by: avidale * oupts * ouft * nites * test * fix the remaining failing tests * style * fix failing test * ficx other test * temp dir + test the raw init * update test * style --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Add sudachi_projection option to BertJapaneseTokenizer (#28503) * add sudachi_projection option * Upgrade sudachipy>=0.6.8 * add a test case for sudachi_projection * Compatible with older versions of SudachiPy * make fixup * make style * error message for unidic download * revert jumanpp test cases * format options for sudachi_projection Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * format options for sudachi_split_mode and sudachi_dict_type * comment * add tests for full_tokenizer kwargs * pass projection arg directly * require_sudachi_projection * make style * revert upgrade sudachipy * check is_sudachi_projection_available() * revert dependency_version_table and bugfix * style format * simply raise ImportError Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * simply raise ImportError --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Static Cache: load models with MQA or GQA (#28975) * Update configuration_llama.py: fixed broken link (#28946) * Update configuration_llama.py: fix broken link * [Nit] Explicit redirection not required Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * [`DETR`] Update the processing to adapt masks & bboxes to reflect padding (#28363) * Update the processing so bbox coords are adjusted for padding * Just pad masks * Tidy up, add tests * Better tests * Fix yolos and mark as slow for pycocotols * Fix yolos - return_tensors * Clarify padding and normalization behaviour * ENH: Do not pass warning message in case `quantization_config` is in config but not passed as an arg (#28988) * Update auto.py * Update auto.py * Update src/transformers/quantizers/auto.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/quantizers/auto.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * ENH [`AutoQuantizer`]: enhance trainer + not supported quant methods (#28991) * enhance trainer + not support quant methods * remove all old logic * add version * Add `StableLM` (#28810) * Add `StableLM` * fix(model): re-create from `huggingface-cli add-new-model-like persimmon` * fix: re-add changes to address comments * fix(readme): add links to paper * fix(tokenization_auto): remove `GPTNeoXTokenizerFastFast` ref * fix(tests): re-add `@slow` decorator to integration tests * fix(tests): import slow... * fix(readme_hd): remove whitespace edit * fix(tokenizer): auto tokenizer tuple * skip doctests for `modeling_stablelm` * Add SiglipForImageClassification and CLIPForImageClassification (#28952) * First draft * Add CLIPForImageClassification * Remove scripts * Fix doctests * AQLM quantizer support (#28928) * aqlm init * calibration and dtypes * docs * Readme update * is_aqlm_available * Simpler link in docs * Test TODO real reference * init _import_structure fix * AqlmConfig autodoc * integration aqlm * integrations in tests * docstring fix * legacy typing * Less typings * More kernels information * Performance -> Accuracy * correct tests * remoced multi-gpu test * Update docs/source/en/quantization.md Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/utils/quantization_config.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Brought back multi-gpu tests * Update src/transformers/integrations/aqlm.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update tests/quantization/aqlm_integration/test_aqlm.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --------- Co-authored-by: Andrei Panferov Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * [`Doc`] Fix docbuilder - make `BackboneMixin` and `BackboneConfigMixin` importable from `utils`. (#29002) * Trigger doc build * Test removing references * Importable from utils * Trigger another run on a new commit for testing * Set the dataset format used by `test_trainer` to float32 (#28920) Co-authored-by: unit_test * Introduce AcceleratorConfig dataclass (#28664) * Introduce acceleratorconfig dataclass * Extra second warn * Move import * Try moving import under is_accelerate_available * Quality * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Clean * Remove to_kwargs * Change version * Improve tests by including dispatch and split batches * Improve reliability * Update tests/trainer/test_trainer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fixup tests and review nits * Make tests pass * protect import * Protect import * Empty-Commit * Make training_args.to_dict handle the AcceleratorConfig --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fix flaky test vision encoder-decoder generate (#28923) * Mask Generation Task Guide (#28897) * Create mask_generation.md * add h1 * add to toctree * Update docs/source/en/tasks/mask_generation.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Update mask_generation.md * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Maria Khalusova * Update mask_generation.md * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Klaus Hipp * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Klaus Hipp * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Klaus Hipp * Update docs/source/en/tasks/mask_generation.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update docs/source/en/tasks/mask_generation.md * Update mask_generation.md * Update mask_generation.md --------- Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Maria Khalusova Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Klaus Hipp * Add tie_weights() to LM heads and set bias in set_output_embeddings() (#28948) * Add tie_weights() to LM heads and set bias in set_output_embeddings() The bias were not tied correctly in some LM heads, and this change should fix that. * Moving test_save_and_load_low_cpu_mem_usage to ModelTesterMixin * Adding _tie_weights() to MPNet and Vilt * Skip test for low cpu mem usage for Deta/DeformableDetr since they cannot init on meta device * Rename to test name to save_load to match the convention * Backbone kwargs in config (#28784) * Enable instantiating model with pretrained backbone weights * Clarify pretrained import * Use load_backbone instead * Add backbone_kwargs to config * Pass kwargs to constructors * Fix up * Input verification * Add tests * Tidy up * Update tests/utils/test_backbone_utils.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * [TPU] Support PyTorch/XLA FSDP via SPMD (#28949) * Initial commit * Add guards for the global mesh * Address more comments * Move the dataloader into integrations/tpu.py * Fix linters * Make karg more explicitly * Remove the move device logic * Fix the CI * Fix linters * Re-enable checkpointing * FIX [`Trainer` / tags]: Fix trainer + tags when users do not pass `"tags"` to `trainer.push_to_hub()` (#29009) * fix trainer tags * add test * [`CLeanup`] Revert SDPA attention changes that got in the static kv cache PR (#29027) * revert unrelated changes that got in * style * Fix static generation when compiling! (#28937) * wow I was scared! * fix everything * nits * make it BC? * add todo * nits * is_tracing should still be used to pass tracing tests * nits * some nits to make sure genration works with static cache uncompiled * fix sdpa * fix FA2 for both static and dynamic in a better way? * style * fix-copies * fix fix copies * fix sequential beam searcg * style * use `keys_to_ignore` * nit * correct dtype inference when init * :( the fix for FA2 is still not optimal to investigate! * styling * nits * nit * this might work better * add comment * Update src/transformers/models/llama/modeling_llama.py * "position_ids" -> "cache_position" * style * nit * Remove changes that should no be propagatted just yet * Apply suggestions from code review * Styling * make sure we raise an errir for static cache with FA2 enabled * move to the bottom of the signature * style * Update src/transformers/models/llama/modeling_llama.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Update src/transformers/models/llama/modeling_llama.py * nit in the name --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Add cuda_custom_kernel in DETA (#28989) * enable graident checkpointing in DetaObjectDetection * fix missing part in original DETA * make style * make fix-copies * Revert "make fix-copies" This reverts commit 4041c86c29248f1673e8173b677c20b5a4511358. * remove fix-copies of DetaDecoder * enable swin gradient checkpointing * fix gradient checkpointing in donut_swin * add tests for deta/swin/donut * Revert "fix gradient checkpointing in donut_swin" This reverts commit 1cf345e34d3cc0e09eb800d9895805b1dd9b474d. * change supports_gradient_checkpointing pipeline to PreTrainedModel * Revert "add tests for deta/swin/donut" This reverts commit 6056ffbb1eddc3cb3a99e4ebb231ae3edf295f5b. * Revert "Revert "fix gradient checkpointing in donut_swin"" This reverts commit 24e25d0a14891241de58a0d86f817d0b5d2a341f. * Simple revert * enable deformable detr gradient checkpointing * add gradient in encoder * add cuda_custom_kernel function in MSDA * make style and fix input of DetaMSDA * make fix-copies * remove n_levels in input of DetaMSDA * minor changes * refactor custom_cuda_kernel like yoso format https://github.com/huggingface/transformers/blob/0507e69d34f8902422eb4977ec066dd6bef179a0/src/transformers/models/yoso/modeling_yoso.py#L53 * DeformableDetrModel support fp16 (#29013) * Update ms_deform_attn_cuda.cu * Update ms_deform_attn_cuda.cuh * Update modeling_deformable_detr.py * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update modeling_deformable_detr.py * python utils/check_copies.py --fix_and_overwrite * Fix dtype missmatch error * Update test_modeling_deformable_detr.py * Update test_modeling_deformable_detr.py * Update modeling_deformable_detr.py * Update modeling_deformable_detr.py --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fix copies between DETR and DETA (#29037) * FIX: Fix error with `logger.warning` + inline with recent refactor (#29039) Update modeling_utils.py * Patch to skip failing `test_save_load_low_cpu_mem_usage` tests (#29043) * Patch to skip currently failing tests * Whoops - wrong place * Removed obsolete attribute setting for AQLM quantization. (#29034) removed redundant field * Fix a tiny typo in `generation/utils.py::GenerateEncoderDecoderOutput`'s docstring (#29044) Update utils.py * add test marker to run all tests with @require_bitsandbytes (#28278) * Update all references to canonical models (#29001) * Script & Manual edition * Update * Update important model list (#29019) * Fix max_length criteria when using inputs_embeds (#28994) * fix max_length for inputs_embeds * make style * Update src/transformers/generation/utils.py Co-authored-by: Joao Gante * Static Cache: load models with MQA or GQA (#28975) * fix * fix tests * fix tests * Update src/transformers/generation/utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * more fixes * make style --------- Co-authored-by: Joao Gante Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Support : Leverage Accelerate for object detection/segmentation models (#28312) * made changes for object detection models * added support for segmentation models. * Made changes for segmentaion models * Changed import statements * solving conflicts * removed conflicts * Resolving commits * Removed conflicts * Fix : Pixel_mask_value set to False * fix num_assistant_tokens with heuristic schedule (#28759) * fix heuristic num_assistant_tokens_schedule * Update src/transformers/generation/configuration_utils.py Co-authored-by: Joao Gante * Update src/transformers/generation/candidate_generator.py Co-authored-by: Joao Gante * Update utils.py check that candidate_generator.assistant_model exists since some some speculations (like ngram and PLD) don't have assistant_model attribute * Update src/transformers/generation/candidate_generator.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Update tests/generation/test_utils.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * make fixup * merge conflict * fix docstring * make fixup --------- Co-authored-by: Joao Gante Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix failing trainer ds tests (#29057) * `auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. (#29058) Update trainer.py * Honor trust_remote_code for custom tokenizers (#28854) * pass through trust_remote_code for dynamically loading unregistered tokenizers specified by config add test * change directories back to previous directory after test * fix ruff check * Add a note to that block for future in case we want to remove it later --------- Co-authored-by: Matt * Feature: Option to set the tracking URI for MLflowCallback. (#29032) * Added option to set tracking URI for MLflowCallback. * Added option to set tracking URI for MLflowCallback. * Changed to in docstring. * Fix trainer test wrt DeepSpeed + auto_find_bs (#29061) * FIx trainer test * Update tests/trainer/test_trainer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Add chat support to text generation pipeline (#28945) * Add chat support to text generation pipeline * Better handling of single elements * Deprecate ConversationalPipeline * stash commit * Add missing add_special_tokens kwarg * Update chat templating docs to refer to TextGenerationPipeline instead of ConversationalPipeline * Add ✨TF✨ tests * @require_tf * Add type hint * Add specific deprecation version * Remove unnecessary do_sample * Remove todo - the discrepancy has been resolved * Update src/transformers/tokenization_utils_base.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Update src/transformers/pipelines/text_generation.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * [Docs] Spanish translation of task_summary.md (#28844) * Add task_summary to es/_toctree.yml * Add task_summary.md to docs/es * Change title of task_summary.md * Translate firsts paragraphs * Translate middle paragraphs * Translte the rest of the doc * Edit firts paragraph * [`Awq`] Add peft support for AWQ (#28987) * add peft support for AWQ * Update src/transformers/quantizers/quantizer_awq.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * FIX [`bnb` / `tests`]: Fix currently failing bnb tests (#29092) Update test_mixed_int8.py * fix the post-processing link (#29091) The link in evaluation was missing a hyphen between post and processing. I fixed this, for English only. Someone with the ability to do a global search/replace should fix the other languages (if indeed they have this issue)/ * Fix the `bert-base-cased` tokenizer configuration test (#29105) Fix test * Fix a typo in `examples/pytorch/text-classification/run_classification.py` (#29072) * change version (#29097) * change version * nuke * this doesn't make sense * update some requirements.py * revert + no main * nits * change cache number * more pin * revert --------- Co-authored-by: ydshieh * [Docs] Add resources (#28705) * Add resource * Add more resources * Add resources * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Remove mention * Remove pipeline tags --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * ENH: added new output_logits option to generate function (#28667) output_logits option behaves like output_scores, but returns the raw, unprocessed prediction logit scores, ie. the values before they undergo logit processing and/or warping. The latter happens by default for the regular output scores. It's useful to have the unprocessed logit scores in certain circumstances. For example, unprocessed logit scores are very useful with causallm models when one wants to determine the probability of a certain answer, e.g. when asking a question with a yes/no answer. In that case getting the next-token probabilities of both "yes" and "no" (and/or their relative ratio) is of interest for classification. The reason for getting these _before_ logit processing and/or warping is b/c a) that can change the probabilities or b) reject the tokens of interest / reduce the number of tokens to just 1. For an example use-case see paper TabLLM: Few-shot Classification of Tabular Data with Large Language Models by Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. https://arxiv.org/abs/2210.10723 In addition: - added dedicated unit test: tests/generation/test_utils/test_return_unprocessed_logit_scores which tests return of logics with output_logits=True in generation. - set output_logits=True in all other generation unit tests, that also have output_scores=True. Implemented @gante's and @amyeroberts review feedback Co-authored-by: kx79wq * Bnb test fix for different hardwares (#29066) * generated text on A10G * generated text in CI * Apply suggestions from code review add explanatory comments Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * Fix two tiny typos in `pipelines/base.py::Pipeline::_sanitize_parameters()`'s docstring (#29102) * Update base.py * Fix a typo * storing & logging gradient norm in trainer (#27326) * report grad_norm during training * support getting grad_norm from deepspeed * Fixed nll with label_smoothing to just nll (#28708) * Fixed nll with label_smoothing to nll * Resolved conflict by rebase * Fixed nll with label_smoothing to nll * Resolved conflict by rebase * Added label_smoothing to config file * Fixed nits * [`gradient_checkpointing`] default to use it for torch 2.3 (#28538) * default to use it * style * Move misplaced line (#29117) Move misplaced line, improve code comment * FEAT [`Trainer` / `bnb`]: Add RMSProp from `bitsandbytes` to HF `Trainer` (#29082) * add RMSProp to Trainer * revert some change * Update src/transformers/trainer.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Abstract image processor arg checks. (#28843) * abstract image processor arg checks. * fix signatures and quality * add validate_ method to rescale-prone processors * add more validations * quality * quality * fix formatting Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix formatting Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix formatting Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * Fix formatting mishap Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * fix crop_size compatibility * fix default mutable arg * fix segmentation map + image arg validity * remove segmentation check from arg validation * fix quality * fix missing segmap * protect PILImageResampling type * Apply suggestions from code review Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * add back segmentation maps check --------- Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * FIX [`bnb` / `tests`] Propagate the changes from #29092 to 4-bit tests (#29122) * forgot to push the changes for 4bit .. * trigger CI * Llama: fix batched generation (#29109) * Generate: unset GenerationConfig parameters do not raise warning (#29119) * [`cuda kernels`] only compile them when initializing (#29133) * only compile when needed * fix mra as well * fix yoso as well * update * rempve comment * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py * Update src/transformers/models/deformable_detr/modeling_deformable_detr.py * opps * Update src/transformers/models/deta/modeling_deta.py * nit * FIX [`PEFT` / `Trainer` ] Handle better peft + quantized compiled models (#29055) * handle peft + compiled models * add tests * fixup * adapt from suggestions * clarify comment * [`Core tokenization`] `add_dummy_prefix_space` option to help with latest issues (#28010) * add add_dummy_prefix_space option to slow * checking kwargs might be better. Should be there for all spm tokenizer IMO * nits * fix copies * more copied * nits * add prefix space * nit * nits * Update src/transformers/convert_slow_tokenizer.py * fix inti * revert wrong styling * fix * nits * style * updates * make sure we use slow tokenizer for conversion instead of looking for the decoder * support llama ast well * update llama tokenizer fast * nits * nits nits nits * update the doc * update * update to fix tests * skip unrelated tailing test * Update src/transformers/convert_slow_tokenizer.py * add proper testing * test decode as well * more testing * format * fix llama test * Apply suggestions from code review * Revert low cpu mem tie weights (#29135) * Revert "Add tie_weights() to LM heads and set bias in set_output_embeddings() (#28948)" This reverts commit 725f4ad1ccad4e1aeb309688706b56713070334b. * Revert "Patch to skip failing `test_save_load_low_cpu_mem_usage` tests (#29043)" This reverts commit 4156f517ce0f00e0b7842410542aad5fe37e73cf. * Add support for fine-tuning CLIP-like models using contrastive-image-text example (#29070) * add support for siglip and chinese-clip model training with contrastive-image-text example * codebase fixups * Save (circleci) cache at the end of a job (#29141) nice job Co-authored-by: ydshieh * [Phi] Add support for sdpa (#29108) --------- Signed-off-by: Wesley M. Gifford Signed-off-by: ThibaultLengagne Signed-off-by: dependabot[bot] Signed-off-by: woshiyyya Co-authored-by: Junyang Lin Co-authored-by: Ren Xuancheng Co-authored-by: renxuancheng.rxc Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Your Name Co-authored-by: Lucas Thompson <33491471+StarCycle@users.noreply.github.com> Co-authored-by: Ahmed Elnaggar Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh Co-authored-by: hugo-syn <61210734+hugo-syn@users.noreply.github.com> Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Jeremy Fowers <80718789+jeremyfowers@users.noreply.github.com> Co-authored-by: Saibo-creator <53392976+Saibo-creator@users.noreply.github.com> Co-authored-by: Patrick von Platen Co-authored-by: Joao Gante Co-authored-by: Fanli Lin Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: isaac-vidas <80056737+isaac-vidas@users.noreply.github.com> Co-authored-by: Matt Co-authored-by: Ofir Zafrir Co-authored-by: Sangbum Daniel Choi <34004152+SangbumChoi@users.noreply.github.com> Co-authored-by: jheitmann Co-authored-by: bofeng huang Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Sounak Dey Co-authored-by: Pashmina Cameron <11311835+pashminacameron@users.noreply.github.com> Co-authored-by: Huazhong Ji Co-authored-by: Dave Berenbaum Co-authored-by: Lysandre Debut Co-authored-by: Nicolas Patry Co-authored-by: sanchit-gandhi Co-authored-by: Scruel Tao Co-authored-by: Zach Mueller Co-authored-by: Quentin Meeus <25608944+qmeeus@users.noreply.github.com> Co-authored-by: cmathw <108584265+cmathw@users.noreply.github.com> Co-authored-by: Zhenwei <964730078@qq.com> Co-authored-by: Vladimir Pinera Co-authored-by: Khai Mai Co-authored-by: jeffhataws Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: nakranivaibhav <67785830+nakranivaibhav@users.noreply.github.com> Co-authored-by: Fanli Lin Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Co-authored-by: Merve Noyan Co-authored-by: Yusuf Co-authored-by: Peter Götz Co-authored-by: Facico <56598258+Facico@users.noreply.github.com> Co-authored-by: Turetskii Mikhail Co-authored-by: D Co-authored-by: Shukant Pal Co-authored-by: Angela Yi Co-authored-by: Klaus Hipp Co-authored-by: Vinyzu <50874994+Vinyzu@users.noreply.github.com> Co-authored-by: Wesley Gifford <79663411+wgifford@users.noreply.github.com> Co-authored-by: Wesley M. Gifford Co-authored-by: Kashif Rasul Co-authored-by: Nate Cibik <50897218+FoamoftheSea@users.noreply.github.com> Co-authored-by: Julien Chaumond Co-authored-by: xkszltl Co-authored-by: Ajay Patel Co-authored-by: ThibaultLengagne Co-authored-by: Sarapuce Co-authored-by: Omar Sanseviero Co-authored-by: Zhan Ling Co-authored-by: Poedator <24738311+poedator@users.noreply.github.com> Co-authored-by: younesbelkada Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Thien Tran Co-authored-by: Alessio Serra Co-authored-by: codiceSpaghetti Co-authored-by: tom-p-reichel <43631024+tom-p-reichel@users.noreply.github.com> Co-authored-by: Hieu Lam Co-authored-by: Lysandre Co-authored-by: Kian Sierra McGettigan <47116198+kiansierra@users.noreply.github.com> Co-authored-by: Shichao Song <60967965+Ki-Seki@users.noreply.github.com> Co-authored-by: JB (Don) <1557853+hackyon@users.noreply.github.com> Co-authored-by: zspo Co-authored-by: p_spozzhang Co-authored-by: Rockerz Co-authored-by: skumar951 Co-authored-by: Juri Ganitkevitch Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ziyang Co-authored-by: Zizhao Chen Co-authored-by: w4ffl35 Co-authored-by: eajechiloae <97950284+eugen-ajechiloae-clearml@users.noreply.github.com> Co-authored-by: Eugen Ajechiloae Co-authored-by: Eran Hirsch Co-authored-by: Lucain Co-authored-by: Sai-Suraj-27 Co-authored-by: Daniel Korat Co-authored-by: unit_test Co-authored-by: Javier <25750030+SystemPanic@users.noreply.github.com> Co-authored-by: vodkaslime <646329483@qq.com> Co-authored-by: Raushan Turganbay Co-authored-by: Karl Hajjar Co-authored-by: Yuki Watanabe <31463517+B-Step62@users.noreply.github.com> Co-authored-by: Philip Blair Co-authored-by: Kossai Sbai <35923560+KossaiSbai@users.noreply.github.com> Co-authored-by: cmahmut <159416666+cmahmut@users.noreply.github.com> Co-authored-by: Alexey Fadeev Co-authored-by: Yunxuan Xiao Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: Aditya Kane <64411306+AdityaKane2001@users.noreply.github.com> Co-authored-by: Jonathan Tow <41410219+jon-tow@users.noreply.github.com> Co-authored-by: Andrei Panferov Co-authored-by: Andrei Panferov Co-authored-by: Maria Khalusova Co-authored-by: Jiewen Tan Co-authored-by: Donggeun Yu Co-authored-by: Sadra Barikbin Co-authored-by: Titus <9048635+Titus-von-Koeller@users.noreply.github.com> Co-authored-by: Raushan Turganbay Co-authored-by: Tanmay patil Co-authored-by: Jonathan Mamou Co-authored-by: Richard Lee Co-authored-by: Matt Co-authored-by: Sean (Seok-Won) Yi Co-authored-by: Aaron Jimenez Co-authored-by: Winton Davies <6550854+davies-w@users.noreply.github.com> Co-authored-by: Jay Zhou <50169346+Ja1Zhou@users.noreply.github.com> Co-authored-by: Max Baak Co-authored-by: kx79wq Co-authored-by: Shijie Wu Co-authored-by: Nilesh Co-authored-by: Erich Schubert Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com> Co-authored-by: Taylor Jackle Spriggs <74561858+tjs-intel@users.noreply.github.com> --- .circleci/TROUBLESHOOT.md | 2 +- .circleci/config.yml | 97 +- .circleci/create_circleci_config.py | 390 +- .github/ISSUE_TEMPLATE/bug-report.yml | 11 +- .github/ISSUE_TEMPLATE/i18n.md | 22 +- .github/PULL_REQUEST_TEMPLATE.md | 12 +- .github/conda/meta.yaml | 6 +- .github/workflows/TROUBLESHOOT.md | 2 +- .github/workflows/add-model-like.yml | 16 +- .github/workflows/build-docker-images.yml | 223 +- .../build-nightly-ci-docker-images.yml | 85 + .../workflows/build-past-ci-docker-images.yml | 65 +- .github/workflows/build_documentation.yml | 3 +- .github/workflows/build_pr_documentation.yml | 2 +- .github/workflows/check_runner_status.yml | 67 - .github/workflows/check_tiny_models.yml | 82 + .github/workflows/delete_doc_comment.yml | 13 - .github/workflows/doctests.yml | 25 +- .github/workflows/model-templates.yml | 2 +- .github/workflows/model_jobs.yml | 102 + .github/workflows/release-conda.yml | 2 +- .../workflows/self-nightly-past-ci-caller.yml | 134 + .github/workflows/self-nightly-scheduled.yml | 89 +- .github/workflows/self-past-caller.yml | 136 - .github/workflows/self-past.yml | 169 +- .../workflows/self-push-amd-mi210-caller.yml | 25 + .../workflows/self-push-amd-mi250-caller.yml | 25 + .github/workflows/self-push-amd.yml | 329 + .github/workflows/self-push-caller.yml | 6 +- .github/workflows/self-push.yml | 76 +- .../workflows/self-scheduled-amd-caller.yml | 14 + .../self-scheduled-amd-mi210-caller.yml | 19 + .../self-scheduled-amd-mi250-caller.yml | 19 + .github/workflows/self-scheduled-amd.yml | 519 ++ .github/workflows/self-scheduled.yml | 231 +- .github/workflows/stale.yml | 6 +- .github/workflows/update_metdata.yml | 25 +- .github/workflows/upload_pr_documentation.yml | 16 + .gitignore | 2 +- CONTRIBUTING.md | 107 +- ISSUES.md | 4 +- MANIFEST.in | 1 - Makefile | 28 +- README.md | 191 +- README_de.md | 576 ++ README_es.md | 146 +- README_fr.md | 574 ++ README_hd.md | 348 +- README_ja.md | 128 +- README_ko.md | 128 +- README_pt-br.md | 568 ++ README_ru.md | 556 ++ README_te.md | 560 ++ README_zh-hans.md | 130 +- README_zh-hant.md | 130 +- SECURITY.md | 6 + awesome-transformers.md | 609 ++ conftest.py | 13 +- docker/transformers-all-latest-gpu/Dockerfile | 47 +- docker/transformers-cpu/Dockerfile | 26 - docker/transformers-doc-builder/Dockerfile | 3 +- docker/transformers-past-gpu/Dockerfile | 30 +- .../transformers-pytorch-amd-gpu/Dockerfile | 36 + docker/transformers-pytorch-cpu/Dockerfile | 25 - .../Dockerfile | 45 + .../Dockerfile | 27 +- .../Dockerfile | 51 +- docker/transformers-pytorch-gpu/Dockerfile | 9 +- docker/transformers-tensorflow-cpu/Dockerfile | 25 - docker/transformers-tensorflow-gpu/Dockerfile | 4 +- docs/README.md | 54 +- docs/TRANSLATING.md | 2 +- docs/source/_config.py | 2 +- docs/source/de/_toctree.yml | 22 + .../de/{accelerate.mdx => accelerate.md} | 4 + docs/source/de/add_new_model.md | 895 ++ docs/source/de/add_new_pipeline.md | 258 + docs/source/de/add_tensorflow_model.md | 356 + ...ass_tutorial.mdx => autoclass_tutorial.md} | 16 +- docs/source/de/contributing.md | 334 + docs/source/de/{index.mdx => index.md} | 19 +- .../de/{installation.mdx => installation.md} | 20 +- docs/source/de/llm_tutorial.md | 221 + .../{model_sharing.mdx => model_sharing.md} | 6 +- docs/source/de/peft.md | 216 + ...line_tutorial.mdx => pipeline_tutorial.md} | 10 +- docs/source/de/pr_checks.md | 199 + .../{preprocessing.mdx => preprocessing.md} | 26 +- .../source/de/{quicktour.mdx => quicktour.md} | 18 +- docs/source/de/run_scripts.md | 351 + docs/source/de/testing.md | 1293 +++ docs/source/de/{training.mdx => training.md} | 20 +- docs/source/de/transformers_agents.md | 323 + docs/source/en/_config.py | 2 +- docs/source/en/_redirects.yml | 3 + docs/source/en/_toctree.yml | 398 +- .../en/{accelerate.mdx => accelerate.md} | 6 +- .../{add_new_model.mdx => add_new_model.md} | 64 +- ...d_new_pipeline.mdx => add_new_pipeline.md} | 18 +- ...flow_model.mdx => add_tensorflow_model.md} | 30 +- .../source/en/{attention.mdx => attention.md} | 10 +- ...ass_tutorial.mdx => autoclass_tutorial.md} | 63 +- .../en/{benchmarks.mdx => benchmarks.md} | 42 +- .../source/en/{bertology.mdx => bertology.md} | 5 + .../en/{big_models.mdx => big_models.md} | 12 +- docs/source/en/chat_templating.md | 484 ++ .../source/en/{community.mdx => community.md} | 18 +- .../en/converting_tensorflow_models.mdx | 162 - .../{create_a_model.mdx => create_a_model.md} | 105 +- .../{custom_models.mdx => custom_models.md} | 76 +- docs/source/en/custom_tools.md | 789 ++ .../source/en/{debugging.mdx => debugging.md} | 142 + docs/source/en/deepspeed.md | 1222 +++ ...fast_tokenizers.mdx => fast_tokenizers.md} | 4 + docs/source/en/fsdp.md | 138 + ...trategies.mdx => generation_strategies.md} | 191 +- docs/source/en/{glossary.mdx => glossary.md} | 163 +- docs/source/en/hf_quantizer.md | 69 + .../source/en/{hpo_train.mdx => hpo_train.md} | 22 +- docs/source/en/index.md | 320 + .../en/{installation.mdx => installation.md} | 50 +- .../{audio_utils.mdx => audio_utils.md} | 19 +- .../{file_utils.mdx => file_utils.md} | 4 + ...neration_utils.mdx => generation_utils.md} | 202 +- ...ng_utils.mdx => image_processing_utils.md} | 4 + .../{modeling_utils.mdx => modeling_utils.md} | 7 +- ...pipelines_utils.mdx => pipelines_utils.md} | 4 + docs/source/en/internal/time_series_utils.md | 29 + ...zation_utils.mdx => tokenization_utils.md} | 4 + .../{trainer_utils.mdx => trainer_utils.md} | 6 +- docs/source/en/llm_tutorial.md | 268 + docs/source/en/llm_tutorial_optimization.md | 781 ++ docs/source/en/main_classes/agent.md | 105 + docs/source/en/main_classes/backbones.md | 60 + .../{callback.mdx => callback.md} | 15 +- .../{configuration.mdx => configuration.md} | 4 + .../{data_collator.mdx => data_collator.md} | 4 + docs/source/en/main_classes/deepspeed.md | 32 + docs/source/en/main_classes/deepspeed.mdx | 2263 ----- ...ure_extractor.mdx => feature_extractor.md} | 9 +- ...image_processor.mdx => image_processor.md} | 4 + ...keras_callbacks.mdx => keras_callbacks.md} | 4 + .../main_classes/{logging.mdx => logging.md} | 21 + .../en/main_classes/{model.mdx => model.md} | 6 +- .../en/main_classes/{onnx.mdx => onnx.md} | 4 + ...r_schedules.mdx => optimizer_schedules.md} | 4 + .../en/main_classes/{output.mdx => output.md} | 30 +- .../{pipelines.mdx => pipelines.md} | 45 +- .../{processors.mdx => processors.md} | 6 +- docs/source/en/main_classes/quantization.md | 47 + docs/source/en/main_classes/quantization.mdx | 150 - ...text_generation.mdx => text_generation.md} | 7 +- .../{tokenizer.mdx => tokenizer.md} | 10 + docs/source/en/main_classes/trainer.md | 54 + docs/source/en/main_classes/trainer.mdx | 679 -- docs/source/en/migration.mdx | 315 - .../en/model_doc/{albert.mdx => albert.md} | 82 +- docs/source/en/model_doc/align.md | 104 + .../en/model_doc/{altclip.mdx => altclip.md} | 15 +- ...r.mdx => audio-spectrogram-transformer.md} | 25 +- .../source/en/model_doc/{auto.mdx => auto.md} | 56 +- docs/source/en/model_doc/autoformer.md | 50 + docs/source/en/model_doc/bark.md | 232 + .../source/en/model_doc/{bart.mdx => bart.md} | 52 +- .../en/model_doc/{barthez.mdx => barthez.md} | 12 +- .../en/model_doc/{bartpho.mdx => bartpho.md} | 12 +- .../source/en/model_doc/{beit.mdx => beit.md} | 25 +- ...bert-generation.mdx => bert-generation.md} | 25 +- .../{bert-japanese.mdx => bert-japanese.md} | 14 +- .../source/en/model_doc/{bert.mdx => bert.md} | 41 +- .../model_doc/{bertweet.mdx => bertweet.md} | 15 +- .../model_doc/{big_bird.mdx => big_bird.md} | 31 +- ...bigbird_pegasus.mdx => bigbird_pegasus.md} | 16 +- .../en/model_doc/{biogpt.mdx => biogpt.md} | 29 +- docs/source/en/model_doc/{bit.mdx => bit.md} | 16 +- ...enderbot-small.mdx => blenderbot-small.md} | 33 +- .../{blenderbot.mdx => blenderbot.md} | 58 +- .../en/model_doc/{blip-2.mdx => blip-2.md} | 22 +- .../source/en/model_doc/{blip.mdx => blip.md} | 58 +- .../en/model_doc/{bloom.mdx => bloom.md} | 44 +- .../source/en/model_doc/{bort.mdx => bort.md} | 25 +- .../{bridgetower.mdx => bridgetower.md} | 38 +- docs/source/en/model_doc/bros.md | 114 + .../source/en/model_doc/{byt5.mdx => byt5.md} | 12 +- .../model_doc/{camembert.mdx => camembert.md} | 37 +- .../en/model_doc/{canine.mdx => canine.md} | 28 +- .../{chinese_clip.mdx => chinese_clip.md} | 12 +- .../source/en/model_doc/{clap.mdx => clap.md} | 12 +- .../source/en/model_doc/{clip.mdx => clip.md} | 56 +- .../en/model_doc/{clipseg.mdx => clipseg.md} | 18 +- docs/source/en/model_doc/clvp.md | 126 + docs/source/en/model_doc/code_llama.md | 127 + .../en/model_doc/{codegen.mdx => codegen.md} | 10 +- ...nditional_detr.mdx => conditional_detr.md} | 9 +- .../model_doc/{convbert.mdx => convbert.md} | 27 +- .../model_doc/{convnext.mdx => convnext.md} | 18 +- docs/source/en/model_doc/convnextv2.md | 68 + docs/source/en/model_doc/{cpm.mdx => cpm.md} | 13 +- docs/source/en/model_doc/cpmant.md | 47 + .../source/en/model_doc/{ctrl.mdx => ctrl.md} | 24 +- docs/source/en/model_doc/{cvt.mdx => cvt.md} | 19 +- .../model_doc/{data2vec.mdx => data2vec.md} | 41 +- .../{deberta-v2.mdx => deberta-v2.md} | 25 + .../en/model_doc/{deberta.mdx => deberta.md} | 18 + ...ransformer.mdx => decision_transformer.md} | 8 +- ...deformable_detr.mdx => deformable_detr.md} | 16 +- .../source/en/model_doc/{deit.mdx => deit.md} | 27 +- docs/source/en/model_doc/deplot.md | 66 + docs/source/en/model_doc/depth_anything.md | 113 + .../source/en/model_doc/{deta.mdx => deta.md} | 12 +- .../source/en/model_doc/{detr.mdx => detr.md} | 28 +- .../model_doc/{dialogpt.mdx => dialogpt.md} | 13 +- .../en/model_doc/{dinat.mdx => dinat.md} | 27 +- docs/source/en/model_doc/dinov2.md | 83 + .../{distilbert.mdx => distilbert.md} | 68 +- docs/source/en/model_doc/{dit.mdx => dit.md} | 19 +- .../en/model_doc/{donut.mdx => donut.md} | 18 +- docs/source/en/model_doc/{dpr.mdx => dpr.md} | 17 +- docs/source/en/model_doc/{dpt.mdx => dpt.md} | 22 + ...efficientformer.mdx => efficientformer.md} | 33 +- docs/source/en/model_doc/efficientnet.md | 51 + .../en/model_doc/{electra.mdx => electra.md} | 28 +- docs/source/en/model_doc/encodec.md | 65 + ...encoder-decoder.mdx => encoder-decoder.md} | 24 +- .../en/model_doc/{ernie.mdx => ernie.md} | 17 +- .../en/model_doc/{ernie_m.mdx => ernie_m.md} | 22 +- docs/source/en/model_doc/{esm.mdx => esm.md} | 30 +- docs/source/en/model_doc/falcon.md | 84 + .../en/model_doc/fastspeech2_conformer.md | 134 + .../en/model_doc/{flan-t5.mdx => flan-t5.md} | 12 +- docs/source/en/model_doc/flan-ul2.md | 54 + .../model_doc/{flaubert.mdx => flaubert.md} | 22 + .../en/model_doc/{flava.mdx => flava.md} | 6 +- .../source/en/model_doc/{fnet.mdx => fnet.md} | 22 +- docs/source/en/model_doc/focalnet.md | 50 + .../source/en/model_doc/{fsmt.mdx => fsmt.md} | 7 +- .../en/model_doc/{funnel.mdx => funnel.md} | 25 +- docs/source/en/model_doc/fuyu.md | 115 + docs/source/en/model_doc/{git.mdx => git.md} | 14 +- .../source/en/model_doc/{glpn.mdx => glpn.md} | 9 +- .../en/model_doc/{gpt-sw3.mdx => gpt-sw3.md} | 31 +- .../source/en/model_doc/{gpt2.mdx => gpt2.md} | 41 +- docs/source/en/model_doc/gpt_bigcode.md | 106 + docs/source/en/model_doc/gpt_neo.md | 147 + docs/source/en/model_doc/gpt_neo.mdx | 80 - .../model_doc/{gpt_neox.mdx => gpt_neox.md} | 59 +- ...neox_japanese.mdx => gpt_neox_japanese.md} | 10 +- .../source/en/model_doc/{gptj.mdx => gptj.md} | 48 +- docs/source/en/model_doc/gptsan-japanese.md | 121 + .../{graphormer.mdx => graphormer.md} | 14 +- .../model_doc/{groupvit.mdx => groupvit.md} | 23 +- .../en/model_doc/{herbert.mdx => herbert.md} | 17 +- .../en/model_doc/{hubert.mdx => hubert.md} | 21 +- .../en/model_doc/{ibert.mdx => ibert.md} | 11 + docs/source/en/model_doc/idefics.md | 63 + .../model_doc/{imagegpt.mdx => imagegpt.md} | 11 +- docs/source/en/model_doc/informer.md | 50 + docs/source/en/model_doc/instructblip.md | 67 + .../en/model_doc/{jukebox.mdx => jukebox.md} | 22 +- docs/source/en/model_doc/kosmos-2.md | 98 + .../model_doc/{layoutlm.mdx => layoutlm.md} | 24 +- .../{layoutlmv2.mdx => layoutlmv2.md} | 48 +- .../{layoutlmv3.mdx => layoutlmv3.md} | 39 +- .../model_doc/{layoutxlm.mdx => layoutxlm.md} | 16 +- docs/source/en/model_doc/{led.mdx => led.md} | 28 +- .../en/model_doc/{levit.mdx => levit.md} | 12 +- .../source/en/model_doc/{lilt.mdx => lilt.md} | 29 +- docs/source/en/model_doc/llama.md | 132 + docs/source/en/model_doc/llama2.md | 140 + docs/source/en/model_doc/llava.md | 80 + .../{longformer.mdx => longformer.md} | 27 +- .../en/model_doc/{longt5.mdx => longt5.md} | 24 +- .../source/en/model_doc/{luke.mdx => luke.md} | 25 +- .../en/model_doc/{lxmert.mdx => lxmert.md} | 20 +- .../en/model_doc/{m2m_100.mdx => m2m_100.md} | 23 +- docs/source/en/model_doc/madlad-400.md | 68 + .../en/model_doc/{marian.mdx => marian.md} | 33 +- .../model_doc/{markuplm.mdx => markuplm.md} | 23 +- .../{mask2former.mdx => mask2former.md} | 21 +- .../{maskformer.mdx => maskformer.md} | 21 +- docs/source/en/model_doc/matcha.md | 76 + .../en/model_doc/{mbart.mdx => mbart.md} | 27 +- .../en/model_doc/{mctct.mdx => mctct.md} | 23 +- docs/source/en/model_doc/mega.md | 84 + .../{megatron-bert.mdx => megatron-bert.md} | 21 +- .../{megatron_gpt2.mdx => megatron_gpt2.md} | 18 +- docs/source/en/model_doc/mgp-str.md | 88 + docs/source/en/model_doc/mistral.md | 161 + docs/source/en/model_doc/mixtral.md | 163 + .../en/model_doc/{mluke.mdx => mluke.md} | 12 +- docs/source/en/model_doc/mms.md | 389 + .../{mobilebert.mdx => mobilebert.md} | 26 +- .../{mobilenet_v1.mdx => mobilenet_v1.md} | 11 +- .../{mobilenet_v2.mdx => mobilenet_v2.md} | 14 +- .../model_doc/{mobilevit.mdx => mobilevit.md} | 24 +- docs/source/en/model_doc/mobilevitv2.md | 56 + .../en/model_doc/{mpnet.mdx => mpnet.md} | 29 +- docs/source/en/model_doc/mpt.md | 70 + docs/source/en/model_doc/mra.md | 62 + docs/source/en/model_doc/{mt5.mdx => mt5.md} | 32 + docs/source/en/model_doc/musicgen.md | 280 + docs/source/en/model_doc/{mvp.mdx => mvp.md} | 21 +- docs/source/en/model_doc/{nat.mdx => nat.md} | 29 +- .../en/model_doc/{nezha.mdx => nezha.md} | 12 + docs/source/en/model_doc/nllb-moe.md | 134 + .../source/en/model_doc/{nllb.mdx => nllb.md} | 61 +- docs/source/en/model_doc/nougat.md | 115 + .../{nystromformer.mdx => nystromformer.md} | 12 + .../model_doc/{oneformer.mdx => oneformer.md} | 19 +- docs/source/en/model_doc/open-llama.md | 61 + .../{openai-gpt.mdx => openai-gpt.md} | 26 +- docs/source/en/model_doc/{opt.mdx => opt.md} | 96 +- docs/source/en/model_doc/owlv2.md | 128 + .../en/model_doc/{owlvit.mdx => owlvit.md} | 30 +- docs/source/en/model_doc/patchtsmixer.md | 94 + docs/source/en/model_doc/patchtst.md | 68 + .../en/model_doc/{pegasus.mdx => pegasus.md} | 43 +- .../model_doc/{pegasus_x.mdx => pegasus_x.md} | 19 +- .../model_doc/{perceiver.mdx => perceiver.md} | 19 +- docs/source/en/model_doc/persimmon.md | 98 + docs/source/en/model_doc/phi.md | 190 + .../en/model_doc/{phobert.mdx => phobert.md} | 15 +- docs/source/en/model_doc/pix2struct.md | 77 + .../en/model_doc/{plbart.mdx => plbart.md} | 22 +- .../{poolformer.mdx => poolformer.md} | 10 +- docs/source/en/model_doc/pop2piano.md | 194 + .../{prophetnet.mdx => prophetnet.md} | 17 +- docs/source/en/model_doc/pvt.md | 71 + .../en/model_doc/{qdqbert.mdx => qdqbert.md} | 25 +- docs/source/en/model_doc/qwen2.md | 82 + docs/source/en/model_doc/{rag.mdx => rag.md} | 21 +- .../en/model_doc/{realm.mdx => realm.md} | 4 + .../model_doc/{reformer.mdx => reformer.md} | 29 +- .../en/model_doc/{regnet.mdx => regnet.md} | 39 +- .../en/model_doc/{rembert.mdx => rembert.md} | 24 +- .../en/model_doc/{resnet.mdx => resnet.md} | 31 +- .../model_doc/{retribert.mdx => retribert.md} | 13 + ...elayernorm.mdx => roberta-prelayernorm.md} | 30 +- .../en/model_doc/{roberta.mdx => roberta.md} | 32 +- .../model_doc/{roc_bert.mdx => roc_bert.md} | 23 +- .../model_doc/{roformer.mdx => roformer.md} | 31 +- docs/source/en/model_doc/rwkv.md | 150 + docs/source/en/model_doc/sam.md | 148 + docs/source/en/model_doc/seamless_m4t.md | 220 + docs/source/en/model_doc/seamless_m4t_v2.md | 194 + .../model_doc/{segformer.mdx => segformer.md} | 17 +- .../en/model_doc/{sew-d.mdx => sew-d.md} | 12 +- docs/source/en/model_doc/{sew.mdx => sew.md} | 12 +- docs/source/en/model_doc/siglip.md | 157 + ...-decoder.mdx => speech-encoder-decoder.md} | 12 +- .../{speech_to_text.mdx => speech_to_text.md} | 16 +- ...eech_to_text_2.mdx => speech_to_text_2.md} | 10 +- .../model_doc/{speecht5.mdx => speecht5.md} | 6 +- .../model_doc/{splinter.mdx => splinter.md} | 12 +- .../{squeezebert.mdx => squeezebert.md} | 15 +- docs/source/en/model_doc/stablelm.md | 102 + docs/source/en/model_doc/swiftformer.md | 44 + .../source/en/model_doc/{swin.mdx => swin.md} | 22 +- .../en/model_doc/{swin2sr.mdx => swin2sr.md} | 4 + .../en/model_doc/{swinv2.mdx => swinv2.md} | 8 +- ...ransformers.mdx => switch_transformers.md} | 19 +- docs/source/en/model_doc/{t5.mdx => t5.md} | 101 +- .../en/model_doc/{t5v1.1.mdx => t5v1.1.md} | 16 +- ...e-transformer.mdx => table-transformer.md} | 15 +- .../en/model_doc/{tapas.mdx => tapas.md} | 26 +- .../en/model_doc/{tapex.mdx => tapex.md} | 23 +- ...sformer.mdx => time_series_transformer.md} | 21 +- .../{timesformer.mdx => timesformer.md} | 16 +- ...nsformer.mdx => trajectory_transformer.md} | 20 +- .../{transfo-xl.mdx => transfo-xl.md} | 59 +- .../en/model_doc/{trocr.mdx => trocr.md} | 27 +- .../source/en/model_doc/{tvlt.mdx => tvlt.md} | 20 +- docs/source/en/model_doc/tvp.md | 186 + docs/source/en/model_doc/{ul2.mdx => ul2.md} | 18 +- docs/source/en/model_doc/umt5.md | 112 + .../{unispeech-sat.mdx => unispeech-sat.md} | 16 +- .../model_doc/{unispeech.mdx => unispeech.md} | 14 +- docs/source/en/model_doc/univnet.md | 80 + .../en/model_doc/{upernet.mdx => upernet.md} | 25 +- docs/source/en/model_doc/{van.mdx => van.md} | 18 +- .../model_doc/{videomae.mdx => videomae.md} | 14 +- .../source/en/model_doc/{vilt.mdx => vilt.md} | 26 +- docs/source/en/model_doc/vipllava.md | 61 + ...-decoder.mdx => vision-encoder-decoder.md} | 22 +- ...ncoder.mdx => vision-text-dual-encoder.md} | 21 + .../{visual_bert.mdx => visual_bert.md} | 14 +- docs/source/en/model_doc/{vit.mdx => vit.md} | 83 +- .../{vit_hybrid.mdx => vit_hybrid.md} | 8 +- .../en/model_doc/{vit_mae.mdx => vit_mae.md} | 31 +- .../en/model_doc/{vit_msn.mdx => vit_msn.md} | 22 +- docs/source/en/model_doc/vitdet.md | 38 + docs/source/en/model_doc/vitmatte.md | 55 + docs/source/en/model_doc/vits.md | 161 + docs/source/en/model_doc/vivit.md | 43 + docs/source/en/model_doc/wav2vec2-bert.md | 90 + ...c2-conformer.mdx => wav2vec2-conformer.md} | 14 +- .../model_doc/{wav2vec2.mdx => wav2vec2.md} | 33 +- ...v2vec2_phoneme.mdx => wav2vec2_phoneme.md} | 23 +- .../en/model_doc/{wavlm.mdx => wavlm.md} | 20 +- docs/source/en/model_doc/whisper.md | 192 + docs/source/en/model_doc/whisper.mdx | 81 - .../en/model_doc/{xclip.mdx => xclip.md} | 4 + .../source/en/model_doc/{xglm.mdx => xglm.md} | 22 +- .../{xlm-prophetnet.mdx => xlm-prophetnet.md} | 12 +- .../{xlm-roberta-xl.mdx => xlm-roberta-xl.md} | 22 +- .../{xlm-roberta.mdx => xlm-roberta.md} | 39 +- .../en/model_doc/{xlm-v.mdx => xlm-v.md} | 15 +- docs/source/en/model_doc/{xlm.mdx => xlm.md} | 27 +- .../en/model_doc/{xlnet.mdx => xlnet.md} | 24 +- .../en/model_doc/{xls_r.mdx => xls_r.md} | 16 +- .../{xlsr_wav2vec2.mdx => xlsr_wav2vec2.md} | 12 +- .../source/en/model_doc/{xmod.mdx => xmod.md} | 21 +- .../en/model_doc/{yolos.mdx => yolos.md} | 17 +- .../source/en/model_doc/{yoso.mdx => yoso.md} | 20 +- docs/source/en/model_memory_anatomy.md | 272 + .../{model_sharing.mdx => model_sharing.md} | 10 +- .../{model_summary.mdx => model_summary.md} | 4 + .../en/{multilingual.mdx => multilingual.md} | 42 +- .../{pad_truncation.mdx => pad_truncation.md} | 5 + docs/source/en/peft.md | 236 + .../{perf_hardware.mdx => perf_hardware.md} | 17 +- docs/source/en/perf_infer_cpu.md | 127 + docs/source/en/perf_infer_cpu.mdx | 71 - docs/source/en/perf_infer_gpu_many.mdx | 23 - docs/source/en/perf_infer_gpu_one.md | 398 + docs/source/en/perf_infer_gpu_one.mdx | 108 - docs/source/en/perf_torch_compile.md | 359 + docs/source/en/perf_train_cpu.md | 82 + docs/source/en/perf_train_cpu.mdx | 63 - docs/source/en/perf_train_cpu_many.md | 318 + docs/source/en/perf_train_cpu_many.mdx | 130 - docs/source/en/perf_train_gpu_many.md | 668 ++ docs/source/en/perf_train_gpu_many.mdx | 529 -- docs/source/en/perf_train_gpu_one.md | 552 ++ docs/source/en/perf_train_gpu_one.mdx | 744 -- docs/source/en/perf_train_special.md | 63 + docs/source/en/perf_train_special.mdx | 20 - docs/source/en/perf_train_tpu.mdx | 20 - ..._train_tpu_tf.mdx => perf_train_tpu_tf.md} | 4 + docs/source/en/performance.md | 73 + docs/source/en/performance.mdx | 92 - .../en/{perplexity.mdx => perplexity.md} | 21 +- .../en/{philosophy.mdx => philosophy.md} | 6 +- ...line_tutorial.mdx => pipeline_tutorial.md} | 109 +- ...ne_webserver.mdx => pipeline_webserver.md} | 19 +- .../source/en/{pr_checks.mdx => pr_checks.md} | 75 +- .../{preprocessing.mdx => preprocessing.md} | 97 +- docs/source/en/quantization.md | 620 ++ .../source/en/{quicktour.mdx => quicktour.md} | 42 +- .../en/{run_scripts.mdx => run_scripts.md} | 32 +- .../source/en/{sagemaker.mdx => sagemaker.md} | 5 +- docs/source/en/serialization.md | 210 + docs/source/en/serialization.mdx | 539 -- .../en/{task_summary.mdx => task_summary.md} | 35 +- docs/source/en/tasks/{asr.mdx => asr.md} | 10 +- ...sification.mdx => audio_classification.md} | 8 +- ...ing.mdx => document_question_answering.md} | 7 +- docs/source/en/tasks/idefics.md | 425 + ...age_captioning.mdx => image_captioning.md} | 6 +- ...sification.mdx => image_classification.md} | 21 +- docs/source/en/tasks/image_to_image.md | 132 + ...e_distillation_for_image_classification.md | 186 + ...uage_modeling.mdx => language_modeling.md} | 117 +- docs/source/en/tasks/mask_generation.md | 238 + ...deling.mdx => masked_language_modeling.md} | 121 +- .../en/tasks/monocular_depth_estimation.md | 151 + ...multiple_choice.mdx => multiple_choice.md} | 24 +- ...ject_detection.mdx => object_detection.md} | 66 +- docs/source/en/tasks/prompting.md | 439 + ...on_answering.mdx => question_answering.md} | 20 +- ...mentation.mdx => semantic_segmentation.md} | 314 +- ...ication.mdx => sequence_classification.md} | 30 +- .../{summarization.mdx => summarization.md} | 30 +- docs/source/en/tasks/text-to-speech.md | 637 ++ ...sification.mdx => token_classification.md} | 28 +- .../tasks/{translation.mdx => translation.md} | 28 +- ...sification.mdx => video_classification.md} | 8 +- .../en/tasks/visual_question_answering.md | 401 + .../tasks/zero_shot_image_classification.md | 147 + .../en/tasks/zero_shot_object_detection.md | 301 + ...tasks_explained.mdx => tasks_explained.md} | 4 + docs/source/en/{testing.mdx => testing.md} | 101 +- docs/source/en/{tf_xla.mdx => tf_xla.md} | 16 +- docs/source/en/tflite.md | 62 + ...nizer_summary.mdx => tokenizer_summary.md} | 14 +- .../en/{torchscript.mdx => torchscript.md} | 10 +- docs/source/en/trainer.md | 414 + docs/source/en/{training.mdx => training.md} | 26 +- docs/source/en/transformers_agents.md | 323 + ...troubleshooting.mdx => troubleshooting.md} | 26 +- docs/source/es/_toctree.yml | 95 +- .../es/{accelerate.mdx => accelerate.md} | 4 + ...d_new_pipeline.mdx => add_new_pipeline.md} | 6 +- ...ass_tutorial.mdx => autoclass_tutorial.md} | 16 +- .../source/es/{bertology.mdx => bertology.md} | 5 + .../source/es/{community.mdx => community.md} | 10 +- ...ls.mdx => converting_tensorflow_models.md} | 22 +- .../{create_a_model.mdx => create_a_model.md} | 28 +- .../{custom_models.mdx => custom_models.md} | 4 + .../source/es/{debugging.mdx => debugging.md} | 4 + ...fast_tokenizers.mdx => fast_tokenizers.md} | 4 + docs/source/es/glossary.md | 464 ++ docs/source/es/{index.mdx => index.md} | 16 +- .../es/{installation.mdx => installation.md} | 18 +- .../{model_sharing.mdx => model_sharing.md} | 6 +- .../es/{multilingual.mdx => multilingual.md} | 38 +- docs/source/es/pad_truncation.md | 69 + docs/source/es/performance.md | 61 + docs/source/es/perplexity.md | 116 + .../es/{philosophy.mdx => philosophy.md} | 4 + ...line_tutorial.mdx => pipeline_tutorial.md} | 8 +- .../source/es/{pr_checks.mdx => pr_checks.md} | 4 + .../{preprocessing.mdx => preprocessing.md} | 16 +- .../source/es/{quicktour.mdx => quicktour.md} | 14 +- .../es/{run_scripts.mdx => run_scripts.md} | 32 +- .../source/es/{sagemaker.mdx => sagemaker.md} | 5 +- .../{serialization.mdx => serialization.md} | 33 +- docs/source/es/task_summary.md | 347 + docs/source/es/tasks/{asr.mdx => asr.md} | 4 + ...sification.mdx => image_classification.md} | 18 +- ...uage_modeling.mdx => language_modeling.md} | 32 +- ...multiple_choice.mdx => multiple_choice.md} | 12 +- ...on_answering.mdx => question_answering.md} | 12 +- .../{summarization.mdx => summarization.md} | 14 +- docs/source/es/{training.mdx => training.md} | 22 +- docs/source/fr/_toctree.yml | 162 +- docs/source/fr/autoclass_tutorial.md | 142 + docs/source/fr/in_translation.md | 5 + docs/source/fr/in_translation.mdx | 1 - docs/source/fr/{index.mdx => index.md} | 26 +- docs/source/fr/installation.md | 258 + .../source/fr/{quicktour.mdx => quicktour.md} | 71 +- docs/source/hi/_toctree.yml | 3 + docs/source/hi/pipeline_tutorial.md | 317 + docs/source/it/_toctree.yml | 118 +- .../it/{accelerate.mdx => accelerate.md} | 4 + .../{add_new_model.mdx => add_new_model.md} | 6 +- ...d_new_pipeline.mdx => add_new_pipeline.md} | 6 +- ...ass_tutorial.mdx => autoclass_tutorial.md} | 16 +- docs/source/it/big_models.md | 123 + docs/source/it/community.md | 68 + ...ls.mdx => converting_tensorflow_models.md} | 23 +- .../{create_a_model.mdx => create_a_model.md} | 28 +- .../{custom_models.mdx => custom_models.md} | 4 + .../source/it/{debugging.mdx => debugging.md} | 4 + docs/source/it/{index.mdx => index.md} | 15 +- .../it/{installation.mdx => installation.md} | 14 +- docs/source/it/migration.md | 313 + .../{model_sharing.mdx => model_sharing.md} | 6 +- .../it/{multilingual.mdx => multilingual.md} | 38 +- .../{perf_hardware.mdx => perf_hardware.md} | 16 +- docs/source/it/perf_infer_cpu.md | 79 + docs/source/it/perf_infer_gpu_many.md | 28 + docs/source/it/perf_infer_gpu_one.md | 112 + .../perf_infer_special.md} | 8 +- docs/source/it/perf_train_cpu.md | 69 + docs/source/it/perf_train_cpu_many.md | 141 + docs/source/it/perf_train_special.md | 24 + docs/source/it/perf_train_tpu.md | 24 + ...line_tutorial.mdx => pipeline_tutorial.md} | 8 +- docs/source/it/pr_checks.md | 135 + .../{preprocessing.mdx => preprocessing.md} | 14 +- .../source/it/{quicktour.mdx => quicktour.md} | 10 +- .../it/{run_scripts.mdx => run_scripts.md} | 32 +- .../{serialization.mdx => serialization.md} | 37 +- docs/source/it/{training.mdx => training.md} | 18 +- docs/source/ja/_toctree.yml | 394 +- docs/source/ja/accelerate.md | 136 + docs/source/ja/add_new_model.md | 756 ++ docs/source/ja/add_tensorflow_model.md | 296 + docs/source/ja/attention.md | 52 + docs/source/ja/autoclass_tutorial.md | 161 + docs/source/ja/benchmarks.md | 381 + docs/source/ja/bertology.md | 34 + docs/source/ja/big_models.md | 130 + docs/source/ja/chat_templating.md | 249 + docs/source/ja/community.md | 69 + docs/source/ja/create_a_model.md | 420 + docs/source/ja/custom_models.md | 336 + docs/source/ja/custom_tools.md | 764 ++ docs/source/ja/fast_tokenizers.md | 73 + docs/source/ja/generation_strategies.md | 345 + docs/source/ja/glossary.md | 444 + docs/source/ja/hpo_train.md | 150 + docs/source/ja/{index.mdx => index.md} | 14 +- .../ja/{installation.mdx => installation.md} | 14 +- docs/source/ja/internal/audio_utils.md | 39 + docs/source/ja/internal/file_utils.md | 49 + docs/source/ja/internal/generation_utils.md | 357 + .../ja/internal/image_processing_utils.md | 48 + docs/source/ja/internal/modeling_utils.md | 83 + docs/source/ja/internal/pipelines_utils.md | 44 + docs/source/ja/internal/time_series_utils.md | 29 + docs/source/ja/internal/tokenization_utils.md | 42 + docs/source/ja/internal/trainer_utils.md | 49 + docs/source/ja/llm_tutorial.md | 223 + docs/source/ja/main_classes/agent.md | 105 + docs/source/ja/main_classes/callback.md | 135 + docs/source/ja/main_classes/configuration.md | 31 + docs/source/ja/main_classes/data_collator.md | 67 + docs/source/ja/main_classes/deepspeed.md | 2254 +++++ .../ja/main_classes/feature_extractor.md | 41 + .../source/ja/main_classes/image_processor.md | 33 + .../source/ja/main_classes/keras_callbacks.md | 28 + docs/source/ja/main_classes/logging.md | 121 + docs/source/ja/main_classes/model.md | 160 + docs/source/ja/main_classes/onnx.md | 55 + .../ja/main_classes/optimizer_schedules.md | 77 + docs/source/ja/main_classes/output.md | 321 + docs/source/ja/main_classes/pipelines.md | 500 ++ docs/source/ja/main_classes/processors.md | 160 + docs/source/ja/main_classes/quantization.md | 447 + .../source/ja/main_classes/text_generation.md | 63 + docs/source/ja/main_classes/tokenizer.md | 80 + docs/source/ja/main_classes/trainer.md | 727 ++ docs/source/ja/model_doc/albert.md | 193 + docs/source/ja/model_doc/align.md | 104 + docs/source/ja/model_doc/altclip.md | 97 + .../audio-spectrogram-transformer.md | 69 + docs/source/ja/model_doc/auto.md | 370 + docs/source/ja/model_doc/autoformer.md | 50 + docs/source/ja/model_doc/bark.md | 198 + docs/source/ja/model_doc/bart.md | 223 + docs/source/ja/model_doc/barthez.md | 60 + docs/source/ja/model_doc/bartpho.md | 86 + docs/source/ja/model_doc/beit.md | 143 + docs/source/ja/model_doc/bert-generation.md | 107 + docs/source/ja/model_doc/bert-japanese.md | 81 + docs/source/ja/model_doc/bert.md | 312 + docs/source/ja/model_doc/bertweet.md | 68 + docs/source/ja/model_doc/big_bird.md | 176 + docs/source/ja/model_doc/bigbird_pegasus.md | 95 + docs/source/ja/model_doc/biogpt.md | 73 + docs/source/ja/model_doc/bit.md | 65 + docs/source/ja/model_doc/blenderbot-small.md | 110 + docs/source/ja/model_doc/blenderbot.md | 132 + docs/source/ja/model_doc/blip-2.md | 90 + docs/source/ja/model_doc/blip.md | 134 + docs/source/ja/model_doc/bloom.md | 107 + docs/source/ja/model_doc/bort.md | 55 + docs/source/ja/model_doc/bridgetower.md | 171 + docs/source/ja/model_doc/bros.md | 113 + docs/source/ja/model_doc/byt5.md | 154 + docs/source/ja/model_doc/camembert.md | 135 + docs/source/ja/model_doc/canine.md | 144 + docs/source/ja/model_doc/chinese_clip.md | 112 + docs/source/ja/model_doc/clap.md | 80 + docs/source/ja/model_doc/clip.md | 220 + docs/source/ja/model_doc/clipseg.md | 104 + docs/source/ja/model_doc/clvp.md | 123 + docs/source/ja/model_doc/code_llama.md | 125 + docs/source/ja/model_doc/codegen.md | 90 + docs/source/ja/model_doc/conditional_detr.md | 75 + docs/source/ja/model_doc/convbert.md | 145 + docs/source/ja/model_doc/convnext.md | 94 + docs/source/ja/model_doc/convnextv2.md | 68 + docs/source/ja/model_doc/cpm.md | 54 + docs/source/ja/model_doc/cpmant.md | 47 + docs/source/ja/model_doc/ctrl.md | 113 + docs/source/ja/model_doc/cvt.md | 88 + docs/source/ja/model_doc/data2vec.md | 187 + docs/source/ja/model_doc/deberta-v2.md | 167 + docs/source/ja/model_doc/deberta.md | 164 + .../ja/model_doc/decision_transformer.md | 53 + docs/source/ja/model_doc/deformable_detr.md | 75 + docs/source/ja/model_doc/deit.md | 148 + docs/source/ja/model_doc/deplot.md | 65 + docs/source/ja/model_doc/deta.md | 64 + docs/source/ja/model_doc/detr.md | 217 + docs/source/ja/model_doc/dialogpt.md | 57 + docs/source/ja/model_doc/dinat.md | 93 + docs/source/ja/model_memory_anatomy.md | 255 + docs/source/ja/model_sharing.md | 262 + docs/source/ja/model_summary.md | 110 + .../ja/{multilingual.mdx => multilingual.md} | 38 +- docs/source/ja/pad_truncation.md | 63 + docs/source/ja/peft.md | 214 + docs/source/ja/perf_hardware.md | 160 + docs/source/ja/perf_infer_cpu.md | 74 + docs/source/ja/perf_infer_gpu_many.md | 125 + docs/source/ja/perf_infer_gpu_one.md | 441 + docs/source/ja/perf_infer_special.md | 18 + docs/source/ja/perf_torch_compile.md | 359 + docs/source/ja/perf_train_cpu.md | 67 + docs/source/ja/perf_train_cpu_many.md | 151 + docs/source/ja/perf_train_gpu_many.md | 529 ++ docs/source/ja/perf_train_gpu_one.md | 438 + docs/source/ja/perf_train_special.md | 24 + docs/source/ja/perf_train_tpu.md | 24 + docs/source/ja/perf_train_tpu_tf.md | 168 + docs/source/ja/performance.md | 68 + docs/source/ja/perplexity.md | 116 + docs/source/ja/philosophy.md | 67 + docs/source/ja/pipeline_tutorial.md | 293 + docs/source/ja/pipeline_webserver.md | 132 + docs/source/ja/pr_checks.md | 208 + docs/source/ja/preprocessing.md | 533 ++ docs/source/ja/quicktour.md | 588 ++ docs/source/ja/run_scripts.md | 370 + docs/source/ja/serialization.md | 191 + docs/source/ja/task_summary.md | 355 + docs/source/ja/tasks/asr.md | 380 + docs/source/ja/tasks/audio_classification.md | 330 + .../ja/tasks/document_question_answering.md | 502 ++ docs/source/ja/tasks/idefics.md | 430 + docs/source/ja/tasks/image_captioning.md | 276 + docs/source/ja/tasks/image_classification.md | 559 ++ docs/source/ja/tasks/image_to_image.md | 135 + ...e_distillation_for_image_classification.md | 188 + docs/source/ja/tasks/language_modeling.md | 444 + .../ja/tasks/masked_language_modeling.md | 455 + .../ja/tasks/monocular_depth_estimation.md | 154 + docs/source/ja/tasks/multiple_choice.md | 470 ++ docs/source/ja/tasks/object_detection.md | 603 ++ docs/source/ja/tasks/prompting.md | 438 + docs/source/ja/tasks/question_answering.md | 445 + docs/source/ja/tasks/semantic_segmentation.md | 605 ++ .../ja/tasks/sequence_classification.md | 608 ++ docs/source/ja/tasks/summarization.md | 409 + docs/source/ja/tasks/text-to-speech.md | 638 ++ docs/source/ja/tasks/token_classification.md | 565 ++ docs/source/ja/tasks/translation.md | 417 + docs/source/ja/tasks/video_classification.md | 503 ++ .../ja/tasks/visual_question_answering.md | 405 + .../tasks/zero_shot_image_classification.md | 148 + .../ja/tasks/zero_shot_object_detection.md | 310 + docs/source/ja/tasks_explained.md | 301 + docs/source/ja/testing.md | 1214 +++ docs/source/ja/tf_xla.md | 179 + docs/source/ja/tflite.md | 58 + docs/source/ja/tokenizer_summary.md | 179 + docs/source/ja/torchscript.md | 177 + docs/source/ja/training.md | 434 + docs/source/ja/transformers_agents.md | 282 + docs/source/ja/troubleshooting.md | 195 + docs/source/ko/_config.py | 2 +- docs/source/ko/_toctree.yml | 680 +- docs/source/ko/accelerate.md | 136 + docs/source/ko/add_new_model.md | 630 ++ docs/source/ko/add_new_pipeline.md | 248 + docs/source/ko/add_tensorflow_model.md | 262 + docs/source/ko/attention.md | 54 + docs/source/ko/autoclass_tutorial.md | 144 + docs/source/ko/bertology.md | 41 + docs/source/ko/big_models.md | 122 + docs/source/ko/community.md | 69 + docs/source/ko/contributing.md | 332 + docs/source/ko/create_a_model.md | 388 + docs/source/ko/custom_models.md | 346 + docs/source/ko/custom_tools.md | 748 ++ docs/source/ko/debugging.md | 306 + docs/source/ko/fast_tokenizers.md | 71 + docs/source/ko/hpo_train.md | 124 + docs/source/ko/in_translation.md | 5 + docs/source/ko/in_translation.mdx | 1 - docs/source/ko/{index.mdx => index.md} | 15 +- .../ko/{installation.mdx => installation.md} | 16 +- docs/source/ko/llm_tutorial.md | 222 + docs/source/ko/model_doc/llama.md | 117 + docs/source/ko/model_doc/llama2.md | 129 + docs/source/ko/model_doc/whisper.md | 128 + docs/source/ko/model_memory_anatomy.md | 242 + docs/source/ko/model_sharing.md | 232 + docs/source/ko/model_summary.md | 107 + docs/source/ko/multilingual.md | 192 + docs/source/ko/pad_truncation.md | 68 + docs/source/ko/peft.md | 209 + docs/source/ko/perf_hardware.md | 156 + docs/source/ko/perf_infer_cpu.md | 73 + docs/source/ko/perf_infer_gpu_many.md | 27 + docs/source/ko/perf_infer_gpu_one.md | 184 + docs/source/ko/perf_train_cpu.md | 67 + docs/source/ko/perf_train_cpu_many.md | 134 + docs/source/ko/perf_train_gpu_many.md | 533 ++ docs/source/ko/perf_train_tpu_tf.md | 162 + docs/source/ko/performance.md | 96 + docs/source/ko/perplexity.md | 135 + docs/source/ko/philosophy.md | 66 + docs/source/ko/pipeline_tutorial.md | 243 + docs/source/ko/pipeline_webserver.md | 144 + docs/source/ko/pr_checks.md | 200 + docs/source/ko/preprocessing.md | 539 ++ docs/source/ko/quicktour.md | 557 ++ docs/source/ko/quicktour.mdx | 536 -- docs/source/ko/run_scripts.md | 375 + docs/source/ko/sagemaker.md | 28 + docs/source/ko/serialization.md | 181 + docs/source/ko/task_summary.md | 341 + docs/source/ko/tasks/asr.md | 380 + docs/source/ko/tasks/audio_classification.md | 329 + .../ko/tasks/document_question_answering.md | 482 ++ docs/source/ko/tasks/image_captioning.md | 281 + docs/source/ko/tasks/image_classification.md | 546 ++ docs/source/ko/tasks/language_modeling.md | 417 + .../ko/tasks/masked_language_modeling.md | 448 + .../ko/tasks/monocular_depth_estimation.md | 149 + docs/source/ko/tasks/multiple_choice.md | 465 ++ docs/source/ko/tasks/object_detection.md | 588 ++ docs/source/ko/tasks/question_answering.md | 428 + docs/source/ko/tasks/semantic_segmentation.md | 591 ++ .../ko/tasks/sequence_classification.md | 395 + docs/source/ko/tasks/summarization.md | 418 + docs/source/ko/tasks/token_classification.md | 560 ++ docs/source/ko/tasks/translation.md | 409 + docs/source/ko/tasks/video_classification.md | 498 ++ .../ko/tasks/visual_question_answering.md | 375 + .../tasks/zero_shot_image_classification.md | 144 + .../ko/tasks/zero_shot_object_detection.md | 307 + docs/source/ko/tasks_explained.md | 295 + docs/source/ko/testing.md | 1278 +++ docs/source/ko/tf_xla.md | 174 + docs/source/ko/tflite.md | 62 + docs/source/ko/tokenizer_summary.md | 253 + docs/source/ko/torchscript.md | 189 + docs/source/ko/training.md | 428 + docs/source/ko/transformers_agents.md | 328 + docs/source/ko/troubleshooting.md | 198 + docs/source/ms/_toctree.yml | 688 ++ docs/source/{en/index.mdx => ms/index.md} | 129 +- docs/source/pt/_config.py | 2 +- .../pt/{accelerate.mdx => accelerate.md} | 4 + ...ls.mdx => converting_tensorflow_models.md} | 22 +- .../{create_a_model.mdx => create_a_model.md} | 28 +- .../{custom_models.mdx => custom_models.md} | 4 + ...fast_tokenizers.mdx => fast_tokenizers.md} | 4 + docs/source/pt/{index.mdx => index.md} | 17 +- .../pt/{installation.mdx => installation.md} | 14 +- .../pt/{multilingual.mdx => multilingual.md} | 38 +- ...line_tutorial.mdx => pipeline_tutorial.md} | 12 +- .../source/pt/{quicktour.mdx => quicktour.md} | 14 +- .../pt/{run_scripts.mdx => run_scripts.md} | 32 +- .../{serialization.mdx => serialization.md} | 29 +- ...ication.mdx => sequence_classification.md} | 16 +- ...sification.mdx => token_classification.md} | 16 +- docs/source/pt/{training.mdx => training.md} | 24 +- docs/source/te/_toctree.yml | 6 + docs/source/te/index.md | 298 + docs/source/te/quicktour.md | 557 ++ docs/source/tr/_toctree.yml | 4 + docs/source/tr/index.md | 295 + docs/source/zh/_toctree.yml | 135 +- docs/source/zh/accelerate.md | 132 + docs/source/zh/autoclass_tutorial.md | 149 + docs/source/zh/big_models.md | 123 + docs/source/zh/contributing.md | 331 + docs/source/zh/create_a_model.md | 389 + docs/source/zh/custom_models.md | 305 + docs/source/zh/debugging.md | 308 + docs/source/zh/fast_tokenizers.md | 67 + docs/source/zh/hpo_train.md | 139 + docs/source/zh/{index.mdx => index.md} | 51 +- docs/source/zh/installation.md | 256 + docs/source/zh/internal/audio_utils.md | 40 + docs/source/zh/internal/file_utils.md | 50 + docs/source/zh/internal/generation_utils.md | 352 + .../zh/internal/image_processing_utils.md | 48 + docs/source/zh/internal/modeling_utils.md | 83 + docs/source/zh/internal/pipelines_utils.md | 45 + docs/source/zh/internal/time_series_utils.md | 31 + docs/source/zh/internal/tokenization_utils.md | 43 + docs/source/zh/internal/trainer_utils.md | 50 + docs/source/zh/llm_tutorial.md | 269 + docs/source/zh/main_classes/agent.md | 101 + docs/source/zh/main_classes/callback.md | 125 + docs/source/zh/main_classes/configuration.md | 28 + docs/source/zh/main_classes/data_collator.md | 65 + docs/source/zh/main_classes/deepspeed.md | 2100 +++++ .../zh/main_classes/feature_extractor.md | 39 + .../source/zh/main_classes/image_processor.md | 34 + .../source/zh/main_classes/keras_callbacks.md | 27 + docs/source/zh/main_classes/logging.md | 107 + docs/source/zh/main_classes/model.md | 137 + docs/source/zh/main_classes/onnx.md | 50 + .../zh/main_classes/optimizer_schedules.md | 77 + docs/source/zh/main_classes/output.md | 309 + docs/source/zh/main_classes/pipelines.md | 480 ++ docs/source/zh/main_classes/processors.md | 146 + docs/source/zh/main_classes/quantization.md | 572 ++ .../source/zh/main_classes/text_generation.md | 58 + docs/source/zh/main_classes/tokenizer.md | 65 + docs/source/zh/main_classes/trainer.md | 665 ++ docs/source/zh/model_sharing.md | 238 + docs/source/zh/multilingual.md | 178 + docs/source/zh/peft.md | 215 + docs/source/zh/perf_hardware.md | 156 + docs/source/zh/perf_torch_compile.md | 362 + docs/source/zh/performance.md | 63 + docs/source/zh/pipeline_tutorial.md | 308 + docs/source/zh/preprocessing.md | 541 ++ docs/source/zh/quicktour.md | 547 ++ docs/source/zh/quicktour.mdx | 538 -- docs/source/zh/run_scripts.md | 359 + docs/source/zh/serialization.md | 181 + docs/source/zh/task_summary.md | 347 + docs/source/zh/tf_xla.md | 179 + docs/source/zh/tflite.md | 54 + docs/source/zh/tokenizer_summary.md | 234 + docs/source/zh/training.md | 407 + docs/source/zh/transformers_agents.md | 285 + examples/README.md | 40 +- examples/flax/_tests_requirements.txt | 6 +- examples/flax/conftest.py | 2 +- examples/flax/image-captioning/README.md | 6 +- ...reate_model_from_encoder_decoder_models.py | 4 +- .../run_image_captioning_flax.py | 87 +- examples/flax/language-modeling/README.md | 20 +- .../language-modeling/run_bart_dlm_flax.py | 111 +- .../flax/language-modeling/run_clm_flax.py | 123 +- .../flax/language-modeling/run_mlm_flax.py | 109 +- .../flax/language-modeling/run_t5_mlm_flax.py | 102 +- examples/flax/question-answering/README.md | 4 +- examples/flax/question-answering/run_qa.py | 91 +- examples/flax/speech-recognition/README.md | 68 + .../flax/speech-recognition/requirements.txt | 8 + .../run_flax_speech_recognition_seq2seq.py | 859 ++ examples/flax/summarization/README.md | 2 +- .../summarization/run_summarization_flax.py | 122 +- examples/flax/test_flax_examples.py | 45 +- examples/flax/text-classification/README.md | 2 +- .../flax/text-classification/run_flax_glue.py | 93 +- examples/flax/token-classification/README.md | 2 +- .../flax/token-classification/run_flax_ner.py | 86 +- examples/flax/vision/requirements.txt | 4 +- .../flax/vision/run_image_classification.py | 66 +- .../benchmarking/README.md | 4 +- .../benchmarking/plot_csv_file.py | 6 +- .../benchmarking/requirements.txt | 0 .../benchmarking/run_benchmark.py | 0 .../multiple_choice/utils_multiple_choice.py | 4 +- .../pytorch-lightning/lightning_base.py | 2 +- .../legacy/pytorch-lightning/requirements.txt | 2 +- examples/legacy/pytorch-lightning/run_glue.py | 6 +- examples/legacy/pytorch-lightning/run_ner.py | 8 +- examples/legacy/question-answering/README.md | 16 +- .../legacy/question-answering/run_squad.py | 14 +- examples/legacy/run_camembert.py | 4 +- examples/legacy/run_language_modeling.py | 4 +- examples/legacy/run_openai_gpt.py | 6 +- examples/legacy/run_swag.py | 14 +- examples/legacy/run_transfo_xl.py | 2 +- examples/legacy/seq2seq/README.md | 14 +- examples/legacy/seq2seq/old_test_datasets.py | 2 +- examples/legacy/seq2seq/pack_dataset.py | 2 +- examples/legacy/seq2seq/requirements.txt | 2 +- .../legacy/seq2seq/run_distributed_eval.py | 8 +- examples/legacy/seq2seq/run_eval.py | 6 +- examples/legacy/seq2seq/run_eval_search.py | 6 +- examples/legacy/seq2seq/seq2seq_trainer.py | 16 +- .../legacy/seq2seq/seq2seq_training_args.py | 4 +- examples/legacy/seq2seq/test_data/test_data | 1 - examples/legacy/seq2seq/utils.py | 2 +- .../run_tf_text_classification.py | 313 - .../legacy/token-classification/README.md | 8 +- .../legacy/token-classification/run_ner.py | 2 +- .../legacy/token-classification/run_tf_ner.py | 310 - .../legacy/token-classification/utils_ner.py | 2 +- examples/pytorch/README.md | 35 +- examples/pytorch/_tests_requirements.txt | 4 +- .../run_audio_classification.py | 79 +- examples/pytorch/conftest.py | 2 +- .../pytorch/contrastive-image-text/README.md | 4 +- .../contrastive-image-text/run_clip.py | 80 +- .../pytorch/image-classification/README.md | 7 +- .../image-classification/requirements.txt | 2 +- .../run_image_classification.py | 109 +- .../run_image_classification_no_trainer.py | 125 +- examples/pytorch/image-pretraining/README.md | 2 +- examples/pytorch/image-pretraining/run_mae.py | 44 +- examples/pytorch/image-pretraining/run_mim.py | 52 +- .../image-pretraining/run_mim_no_trainer.py | 810 ++ examples/pytorch/language-modeling/README.md | 22 +- examples/pytorch/language-modeling/run_clm.py | 111 +- .../language-modeling/run_clm_no_trainer.py | 138 +- examples/pytorch/language-modeling/run_mlm.py | 90 +- .../language-modeling/run_mlm_no_trainer.py | 115 +- examples/pytorch/language-modeling/run_plm.py | 76 +- examples/pytorch/multiple-choice/README.md | 8 +- examples/pytorch/multiple-choice/run_swag.py | 77 +- .../multiple-choice/run_swag_no_trainer.py | 93 +- ...a_examples.py => old_test_xla_examples.py} | 2 +- examples/pytorch/question-answering/README.md | 12 +- examples/pytorch/question-answering/run_qa.py | 69 +- .../question-answering/run_qa_beam_search.py | 55 +- .../run_qa_beam_search_no_trainer.py | 73 +- .../question-answering/run_qa_no_trainer.py | 100 +- .../question-answering/run_seq2seq_qa.py | 70 +- .../question-answering/trainer_seq2seq_qa.py | 13 +- .../run_semantic_segmentation.py | 57 +- .../run_semantic_segmentation_no_trainer.py | 93 +- .../run_wav2vec2_pretraining_no_trainer.py | 16 +- examples/pytorch/speech-recognition/README.md | 123 +- .../run_speech_recognition_ctc.py | 111 +- .../run_speech_recognition_ctc_adapter.py | 836 ++ .../run_speech_recognition_seq2seq.py | 100 +- examples/pytorch/summarization/README.md | 14 +- .../summarization/run_summarization.py | 140 +- .../run_summarization_no_trainer.py | 133 +- examples/pytorch/test_accelerate_examples.py | 67 +- examples/pytorch/test_pytorch_examples.py | 116 +- .../pytorch/text-classification/README.md | 59 +- .../text-classification/run_classification.py | 763 ++ .../pytorch/text-classification/run_glue.py | 86 +- .../run_glue_no_trainer.py | 80 +- .../pytorch/text-classification/run_xnli.py | 61 +- examples/pytorch/text-generation/README.md | 6 +- .../pytorch/text-generation/requirements.txt | 1 + .../pytorch/text-generation/run_generation.py | 193 +- .../run_generation_contrastive_search.py | 33 +- .../pytorch/token-classification/README.md | 8 +- .../pytorch/token-classification/run_ner.py | 70 +- .../run_ner_no_trainer.py | 103 +- examples/pytorch/translation/README.md | 10 +- .../pytorch/translation/run_translation.py | 89 +- .../translation/run_translation_no_trainer.py | 89 +- examples/research_projects/README.md | 2 +- .../bert-loses-patience/README.md | 2 +- .../pabee/modeling_pabee_albert.py | 6 +- .../pabee/modeling_pabee_bert.py | 4 +- .../run_glue_with_pabee.py | 14 +- .../test_run_glue_with_pabee.py | 2 +- examples/research_projects/bertabs/README.md | 2 +- ...ert_bertabs_original_pytorch_checkpoint.py | 2 +- .../bertabs/modeling_bertabs.py | 4 +- .../bertabs/run_summarization.py | 2 +- .../bertology/run_bertology.py | 6 +- .../bertology/run_prune_gpt.py | 6 +- .../research_projects/codeparrot/README.md | 8 +- .../codeparrot/scripts/arguments.py | 4 +- .../codeparrot/scripts/codeparrot_training.py | 4 +- .../codeparrot/scripts/human_eval.py | 2 +- .../scripts/minhash_deduplication.py | 4 +- .../codeparrot/scripts/preprocessing.py | 8 +- .../codeparrot/scripts/pretokenizing.py | 2 +- .../decision_transformer/requirements.txt | 28 +- examples/research_projects/deebert/README.md | 2 +- .../deebert/run_glue_deebert.py | 14 +- .../deebert/src/modeling_highway_bert.py | 6 +- .../deebert/test_glue_deebert.py | 12 +- .../research_projects/distillation/README.md | 2 +- .../distillation/grouped_batch_sampler.py | 2 +- .../distillation/requirements.txt | 2 +- .../distillation/run_squad_w_distillation.py | 14 +- .../research_projects/distillation/train.py | 2 +- .../fsner/src/fsner/tokenizer_utils.py | 4 +- .../information-gain-filtration/README.md | 4 +- .../information-gain-filtration/igf/igf.py | 4 +- .../run_clm_igf.py | 27 +- .../research_projects/jax-projects/README.md | 46 +- .../jax-projects/big_bird/README.md | 2 +- .../jax-projects/big_bird/bigbird_flax.py | 9 +- .../jax-projects/big_bird/evaluate.py | 6 +- .../big_bird/prepare_natural_questions.py | 14 +- .../jax-projects/dataset-streaming/README.md | 14 +- .../dataset-streaming/run_mlm_flax_stream.py | 6 +- .../jax-projects/hybrid_clip/README.md | 16 +- .../hybrid_clip/modeling_hybrid_clip.py | 6 +- .../hybrid_clip/run_hybrid_clip.py | 10 +- .../jax-projects/model_parallel/README.md | 4 +- .../jax-projects/model_parallel/partitions.py | 2 +- .../jax-projects/model_parallel/run_clm_mp.py | 21 +- .../jax-projects/wav2vec2/README.md | 8 +- .../layoutlmv3/run_funsd_cord.py | 14 +- .../research_projects/longform-qa/eli5_app.py | 6 +- .../longform-qa/eli5_utils.py | 12 +- .../luke/run_luke_ner_no_trainer.py | 25 +- .../lxmert/extracting_data.py | 2 +- .../lxmert/modeling_frcnn.py | 10 +- .../research_projects/lxmert/requirements.txt | 8 +- examples/research_projects/lxmert/utils.py | 6 +- examples/research_projects/mlm_wwm/README.md | 6 +- .../research_projects/mlm_wwm/run_mlm_wwm.py | 11 +- examples/research_projects/mm-imdb/README.md | 4 +- .../research_projects/mm-imdb/run_mmimdb.py | 14 +- .../movement-pruning/README.md | 10 +- .../movement-pruning/bertarize.py | 4 +- .../movement-pruning/counts_parameters.py | 4 +- .../emmental/modeling_bert_masked.py | 4 +- .../movement-pruning/masked_run_glue.py | 15 +- .../movement-pruning/masked_run_squad.py | 12 +- .../bart_onnx/reduce_onnx_size.py | 6 +- .../research_projects/performer/README.md | 4 +- .../performer/run_mlm_performer.py | 11 +- examples/research_projects/pplm/run_pplm.py | 14 +- .../pplm/run_pplm_discrim_train.py | 13 +- .../quantization-qdqbert/README.md | 56 +- .../evaluate-hf-trt-qa.py | 8 +- .../quantization-qdqbert/quant_trainer.py | 10 +- .../quantization-qdqbert/run_quant_qa.py | 14 +- .../rag-end2end-retriever/finetune_rag.py | 18 +- .../rag-end2end-retriever/lightning_base.py | 2 +- .../use_own_knowledge_dataset.py | 2 +- .../rag-end2end-retriever/utils_rag.py | 2 +- examples/research_projects/rag/README.md | 4 +- .../research_projects/rag/finetune_rag.py | 16 +- .../research_projects/rag/lightning_base.py | 2 +- .../research_projects/rag/requirements.txt | 2 +- .../rag/use_own_knowledge_dataset.py | 2 +- examples/research_projects/rag/utils_rag.py | 2 +- .../robust-speech-event/README.md | 6 +- .../robust-speech-event/eval.py | 2 +- .../run_speech_recognition_ctc_bnb.py | 22 +- .../run_speech_recognition_ctc_streaming.py | 20 +- .../seq2seq-distillation/README.md | 2 +- .../_test_seq2seq_examples.py | 98 +- .../_test_seq2seq_examples_multi_gpu.py | 40 +- .../seq2seq-distillation/finetune.py | 16 +- .../seq2seq-distillation/lightning_base.py | 2 +- .../seq2seq-distillation/make_student.py | 10 +- .../seq2seq-distillation/run_eval.py | 4 +- .../seq2seq-distillation/utils.py | 2 +- .../tapex/run_tabfact_with_tapex.py | 10 +- .../tapex/run_wikisql_with_tapex.py | 12 +- .../run_wikitablequestions_with_tapex.py | 12 +- .../research_projects/tapex/wikisql_utils.py | 4 +- .../visual_bert/extracting_data.py | 2 +- .../visual_bert/modeling_frcnn.py | 10 +- .../visual_bert/requirements.txt | 8 +- .../research_projects/visual_bert/utils.py | 6 +- .../research_projects/vqgan-clip/README.md | 6 +- .../vqgan-clip/VQGAN_CLIP.py | 4 +- .../research_projects/vqgan-clip/loaders.py | 2 +- .../wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md | 8 +- examples/research_projects/wav2vec2/README.md | 12 +- .../research_projects/wav2vec2/run_asr.py | 4 +- .../wav2vec2/run_common_voice.py | 10 +- .../wav2vec2/test_wav2vec2_deepspeed.py | 4 +- .../xtreme-s/run_xtreme_s.py | 28 +- .../zero-shot-distillation/README.md | 2 +- .../distill_classifier.py | 10 +- examples/run_on_remote.py | 71 + examples/tensorflow/README.md | 8 +- examples/tensorflow/_tests_requirements.txt | 6 +- examples/tensorflow/benchmarking/README.md | 4 +- .../tensorflow/benchmarking/plot_csv_file.py | 6 +- .../contrastive-image-text/README.md | 81 + .../contrastive-image-text/requirements.txt | 2 + .../contrastive-image-text/run_clip.py | 626 ++ .../tensorflow/image-classification/README.md | 4 +- .../run_image_classification.py | 66 +- .../language-modeling-tpu/README.md | 110 + .../prepare_tfrecord_shards.py | 181 + .../language-modeling-tpu/requirements.txt | 3 + .../language-modeling-tpu/run_mlm.py | 323 + .../language-modeling-tpu/train_unigram.py | 119 + .../tensorflow/language-modeling/README.md | 16 +- .../tensorflow/language-modeling/run_clm.py | 111 +- .../tensorflow/language-modeling/run_mlm.py | 108 +- examples/tensorflow/multiple-choice/README.md | 2 +- .../tensorflow/multiple-choice/run_swag.py | 85 +- .../tensorflow/question-answering/README.md | 4 +- .../tensorflow/question-answering/run_qa.py | 81 +- .../summarization/run_summarization.py | 114 +- .../tensorflow/test_tensorflow_examples.py | 30 +- .../tensorflow/text-classification/README.md | 10 +- .../text-classification/run_glue.py | 62 +- .../run_text_classification.py | 81 +- .../tensorflow/token-classification/README.md | 4 +- .../token-classification/run_ner.py | 99 +- examples/tensorflow/translation/README.md | 4 +- .../tensorflow/translation/run_translation.py | 105 +- hubconf.py | 29 +- notebooks/README.md | 16 +- pyproject.toml | 29 +- scripts/benchmark/trainer-benchmark.py | 2 +- scripts/check_tokenizers.py | 8 +- scripts/fsmt/fsmt-make-super-tiny-model.py | 9 +- scripts/fsmt/fsmt-make-tiny-model.py | 14 +- scripts/fsmt/gen-card-allenai-wmt16.py | 1 + scripts/fsmt/gen-card-allenai-wmt19.py | 1 + scripts/fsmt/gen-card-facebook-wmt19.py | 3 +- .../pegasus/build_test_sample_spm_no_bos.py | 5 +- scripts/stale.py | 38 +- scripts/tatoeba/README.md | 4 +- setup.cfg | 17 - setup.py | 124 +- src/transformers/__init__.py | 3257 +++++++- src/transformers/activations.py | 58 +- src/transformers/activations_tf.py | 31 +- src/transformers/audio_utils.py | 682 +- .../benchmark/benchmark_args_utils.py | 11 +- src/transformers/benchmark/benchmark_tf.py | 33 +- src/transformers/benchmark/benchmark_utils.py | 11 +- src/transformers/cache_utils.py | 417 + src/transformers/commands/add_new_model.py | 4 +- .../commands/add_new_model_like.py | 184 +- src/transformers/commands/convert.py | 23 +- src/transformers/commands/download.py | 18 +- src/transformers/commands/env.py | 53 +- src/transformers/commands/pt_to_tf.py | 46 +- src/transformers/commands/serving.py | 2 +- src/transformers/commands/train.py | 2 +- src/transformers/configuration_utils.py | 257 +- src/transformers/convert_graph_to_onnx.py | 42 +- .../convert_pytorch_checkpoint_to_tf2.py | 22 +- src/transformers/convert_slow_tokenizer.py | 315 +- ...nvert_tf_hub_seq_to_seq_bert_to_pytorch.py | 2 +- src/transformers/data/data_collator.py | 88 +- src/transformers/data/metrics/__init__.py | 3 +- .../data/test_generation_utils.py | 99 - src/transformers/deepspeed.py | 410 +- src/transformers/dependency_versions_check.py | 28 +- src/transformers/dependency_versions_table.py | 51 +- src/transformers/dynamic_module_utils.py | 307 +- .../feature_extraction_sequence_utils.py | 4 +- src/transformers/feature_extraction_utils.py | 124 +- src/transformers/file_utils.py | 11 +- src/transformers/generation/__init__.py | 53 +- .../generation/beam_constraints.py | 9 +- src/transformers/generation/beam_search.py | 154 +- .../generation/candidate_generator.py | 409 + .../generation/configuration_utils.py | 375 +- .../generation/flax_logits_process.py | 197 +- src/transformers/generation/flax_utils.py | 201 +- src/transformers/generation/logits_process.py | 1661 +++- .../generation/stopping_criteria.py | 35 +- src/transformers/generation/streamers.py | 227 + .../generation/tf_logits_process.py | 15 +- src/transformers/generation/tf_utils.py | 306 +- src/transformers/generation/utils.py | 2270 +++-- src/transformers/generation_flax_utils.py | 2 +- src/transformers/generation_tf_utils.py | 2 +- src/transformers/generation_utils.py | 2 +- src/transformers/hf_argparser.py | 36 +- src/transformers/hyperparameter_search.py | 141 + src/transformers/image_processing_utils.py | 231 +- src/transformers/image_transforms.py | 186 +- src/transformers/image_utils.py | 172 +- src/transformers/integrations/__init__.py | 148 + src/transformers/integrations/aqlm.py | 99 + src/transformers/integrations/awq.py | 374 + src/transformers/integrations/bitsandbytes.py | 314 + src/transformers/integrations/deepspeed.py | 438 + .../integration_utils.py} | 582 +- src/transformers/integrations/peft.py | 476 ++ src/transformers/integrations/tpu.py | 36 + src/transformers/keras_callbacks.py | 35 +- .../cpu/ms_deform_attn_cpu.cpp | 0 .../deformable_detr}/cpu/ms_deform_attn_cpu.h | 0 .../cuda/ms_deform_attn_cuda.cu | 156 + .../cuda/ms_deform_attn_cuda.cuh | 1467 ++++ .../cuda/ms_deform_attn_cuda.h | 0 .../cuda/ms_deform_im2col_cuda.cuh | 0 .../deformable_detr}/ms_deform_attn.h | 0 .../deformable_detr}/vision.cpp | 0 .../kernels/deta/cpu/ms_deform_attn_cpu.cpp | 40 + .../kernels/deta/cpu/ms_deform_attn_cpu.h | 32 + .../deta}/cuda/ms_deform_attn_cuda.cu | 0 .../deta}/cuda/ms_deform_attn_cuda.cuh | 0 .../kernels/deta/cuda/ms_deform_attn_cuda.h | 29 + .../deta/cuda/ms_deform_im2col_cuda.cuh | 1327 +++ .../kernels/deta/ms_deform_attn.h | 61 + src/transformers/kernels/deta/vision.cpp | 16 + src/transformers/kernels/mra/cuda_kernel.cu | 383 + src/transformers/kernels/mra/cuda_kernel.h | 59 + src/transformers/kernels/mra/cuda_launch.cu | 154 + src/transformers/kernels/mra/cuda_launch.h | 39 + .../kernels/mra/torch_extension.cpp | 78 + src/transformers/kernels/rwkv/wkv_cuda.cu | 187 + .../kernels/rwkv/wkv_cuda_bf16.cu | 186 + src/transformers/kernels/rwkv/wkv_op.cpp | 66 + .../{models => kernels}/yoso/common.h | 0 .../{models => kernels}/yoso/common_cuda.h | 0 .../yoso/common_cuda_device.h | 0 .../yoso/fast_lsh_cumulation.cu | 0 .../yoso/fast_lsh_cumulation.h | 0 .../yoso/fast_lsh_cumulation_cuda.cu | 0 .../yoso/fast_lsh_cumulation_cuda.h | 0 .../yoso/fast_lsh_cumulation_torch.cpp | 0 src/transformers/modelcard.py | 39 +- src/transformers/modeling_attn_mask_utils.py | 500 ++ src/transformers/modeling_flax_outputs.py | 58 + .../modeling_flax_pytorch_utils.py | 223 +- src/transformers/modeling_flax_utils.py | 320 +- src/transformers/modeling_outputs.py | 526 +- src/transformers/modeling_tf_outputs.py | 226 +- src/transformers/modeling_tf_pytorch_utils.py | 186 +- src/transformers/modeling_tf_utils.py | 896 +- src/transformers/modeling_utils.py | 1929 ++++- src/transformers/models/__init__.py | 69 +- .../models/albert/configuration_albert.py | 24 +- ...lbert_original_tf_checkpoint_to_pytorch.py | 4 +- .../models/albert/modeling_albert.py | 57 +- .../models/albert/modeling_flax_albert.py | 18 +- .../models/albert/modeling_tf_albert.py | 510 +- .../models/albert/tokenization_albert.py | 54 +- .../models/albert/tokenization_albert_fast.py | 53 +- src/transformers/models/align/__init__.py | 73 + .../models/align/configuration_align.py | 384 + .../models/align/convert_align_tf_to_hf.py | 389 + .../models/align/modeling_align.py | 1636 ++++ .../models/align/processing_align.py | 122 + .../models/altclip/configuration_altclip.py | 111 +- .../models/altclip/modeling_altclip.py | 67 +- .../models/altclip/processing_altclip.py | 6 +- .../audio_spectrogram_transformer/__init__.py | 21 +- ...iguration_audio_spectrogram_transformer.py | 7 +- ...xtraction_audio_spectrogram_transformer.py | 79 +- .../modeling_audio_spectrogram_transformer.py | 30 +- src/transformers/models/auto/__init__.py | 54 + src/transformers/models/auto/auto_factory.py | 226 +- .../models/auto/configuration_auto.py | 289 +- .../models/auto/feature_extraction_auto.py | 103 +- .../models/auto/image_processing_auto.py | 129 +- src/transformers/models/auto/modeling_auto.py | 344 +- .../models/auto/modeling_flax_auto.py | 25 +- .../models/auto/modeling_tf_auto.py | 95 +- .../models/auto/processing_auto.py | 140 +- .../models/auto/tokenization_auto.py | 244 +- .../models/autoformer/__init__.py | 63 + .../autoformer/configuration_autoformer.py | 246 + .../models/autoformer/modeling_autoformer.py | 2117 +++++ src/transformers/models/bark/__init__.py | 79 + .../models/bark/configuration_bark.py | 330 + .../models/bark/convert_suno_to_hf.py | 262 + .../bark/generation_configuration_bark.py | 331 + src/transformers/models/bark/modeling_bark.py | 1910 +++++ .../models/bark/processing_bark.py | 286 + src/transformers/models/bart/__init__.py | 2 + .../models/bart/configuration_bart.py | 1 + ..._original_pytorch_checkpoint_to_pytorch.py | 13 +- src/transformers/models/bart/modeling_bart.py | 656 +- .../models/bart/modeling_flax_bart.py | 17 +- .../models/bart/modeling_tf_bart.py | 491 +- .../models/bart/tokenization_bart.py | 34 +- .../models/bart/tokenization_bart_fast.py | 15 +- .../models/barthez/tokenization_barthez.py | 29 +- .../barthez/tokenization_barthez_fast.py | 5 +- .../models/bartpho/tokenization_bartpho.py | 26 +- src/transformers/models/beit/__init__.py | 2 + .../models/beit/configuration_beit.py | 46 +- .../beit/convert_beit_unilm_to_pytorch.py | 29 +- .../models/beit/image_processing_beit.py | 181 +- src/transformers/models/beit/modeling_beit.py | 199 +- .../models/beit/modeling_flax_beit.py | 11 +- .../models/bert/configuration_bert.py | 45 +- ..._bert_pytorch_checkpoint_to_original_tf.py | 6 +- src/transformers/models/bert/modeling_bert.py | 104 +- .../models/bert/modeling_flax_bert.py | 23 +- .../models/bert/modeling_tf_bert.py | 685 +- .../models/bert/tokenization_bert.py | 163 +- .../models/bert/tokenization_bert_fast.py | 154 +- .../models/bert/tokenization_bert_tf.py | 26 +- .../configuration_bert_generation.py | 11 +- .../modeling_bert_generation.py | 54 +- .../tokenization_bert_generation.py | 18 +- .../tokenization_bert_japanese.py | 86 +- .../models/bertweet/tokenization_bertweet.py | 59 +- .../models/big_bird/configuration_big_bird.py | 3 +- .../models/big_bird/modeling_big_bird.py | 112 +- .../models/big_bird/modeling_flax_big_bird.py | 174 +- .../models/big_bird/tokenization_big_bird.py | 32 +- .../big_bird/tokenization_big_bird_fast.py | 7 +- .../configuration_bigbird_pegasus.py | 1 + .../convert_bigbird_pegasus_tf_to_pytorch.py | 6 +- .../modeling_bigbird_pegasus.py | 231 +- src/transformers/models/biogpt/__init__.py | 4 + .../models/biogpt/configuration_biogpt.py | 12 +- .../models/biogpt/modeling_biogpt.py | 407 +- .../models/biogpt/tokenization_biogpt.py | 24 +- .../models/bit/configuration_bit.py | 28 +- .../models/bit/image_processing_bit.py | 193 +- src/transformers/models/bit/modeling_bit.py | 26 +- .../blenderbot/configuration_blenderbot.py | 1 + .../models/blenderbot/modeling_blenderbot.py | 225 +- .../blenderbot/modeling_flax_blenderbot.py | 15 +- .../blenderbot/modeling_tf_blenderbot.py | 361 +- .../blenderbot/tokenization_blenderbot.py | 92 +- .../tokenization_blenderbot_fast.py | 62 +- .../configuration_blenderbot_small.py | 1 + .../modeling_blenderbot_small.py | 237 +- .../modeling_flax_blenderbot_small.py | 13 +- .../modeling_tf_blenderbot_small.py | 372 +- .../tokenization_blenderbot_small.py | 30 +- .../tokenization_blenderbot_small_fast.py | 21 + src/transformers/models/blip/__init__.py | 42 +- .../models/blip/configuration_blip.py | 60 +- .../convert_blip_original_pytorch_to_hf.py | 2 +- .../models/blip/image_processing_blip.py | 168 +- src/transformers/models/blip/modeling_blip.py | 143 +- .../models/blip/modeling_blip_text.py | 197 +- .../models/blip/modeling_tf_blip.py | 1710 ++++ .../models/blip/modeling_tf_blip_text.py | 1122 +++ .../models/blip/processing_blip.py | 4 +- src/transformers/models/blip_2/__init__.py | 2 + .../models/blip_2/configuration_blip_2.py | 46 +- .../convert_blip_2_original_to_pytorch.py | 62 +- .../models/blip_2/modeling_blip_2.py | 581 +- .../models/blip_2/processing_blip_2.py | 8 +- src/transformers/models/bloom/__init__.py | 28 +- .../models/bloom/configuration_bloom.py | 6 +- ...rt_bloom_original_checkpoint_to_pytorch.py | 6 +- .../models/bloom/modeling_bloom.py | 141 +- .../models/bloom/modeling_flax_bloom.py | 734 ++ .../models/bloom/tokenization_bloom_fast.py | 61 +- .../models/bridgetower/__init__.py | 2 + .../bridgetower/configuration_bridgetower.py | 46 +- .../image_processing_bridgetower.py | 299 +- .../bridgetower/modeling_bridgetower.py | 299 +- .../bridgetower/processing_bridgetower.py | 1 + src/transformers/models/bros/__init__.py | 77 + .../models/bros/configuration_bros.py | 140 + .../models/bros/convert_bros_to_pytorch.py | 145 + src/transformers/models/bros/modeling_bros.py | 1320 +++ .../models/bros/processing_bros.py | 109 + .../models/byt5/tokenization_byt5.py | 70 +- .../camembert/configuration_camembert.py | 8 +- .../models/camembert/modeling_camembert.py | 107 +- .../models/camembert/modeling_tf_camembert.py | 562 +- .../camembert/tokenization_camembert.py | 199 +- .../camembert/tokenization_camembert_fast.py | 18 +- .../models/canine/configuration_canine.py | 9 +- .../models/canine/modeling_canine.py | 28 +- .../models/canine/tokenization_canine.py | 30 +- .../configuration_chinese_clip.py | 110 +- .../image_processing_chinese_clip.py | 177 +- .../chinese_clip/modeling_chinese_clip.py | 58 +- .../chinese_clip/processing_chinese_clip.py | 6 +- .../models/clap/configuration_clap.py | 36 +- .../convert_clap_original_pytorch_to_hf.py | 42 +- .../models/clap/feature_extraction_clap.py | 93 +- src/transformers/models/clap/modeling_clap.py | 76 +- .../models/clap/processing_clap.py | 1 + src/transformers/models/clip/__init__.py | 4 + .../models/clip/configuration_clip.py | 120 +- .../convert_clip_original_pytorch_to_hf.py | 6 +- .../models/clip/image_processing_clip.py | 198 +- src/transformers/models/clip/modeling_clip.py | 222 +- .../models/clip/modeling_flax_clip.py | 127 +- .../models/clip/modeling_tf_clip.py | 437 +- .../models/clip/processing_clip.py | 6 +- .../models/clip/tokenization_clip.py | 51 +- .../models/clip/tokenization_clip_fast.py | 14 +- .../models/clipseg/configuration_clipseg.py | 115 +- .../convert_clipseg_original_pytorch_to_hf.py | 6 +- .../models/clipseg/modeling_clipseg.py | 100 +- .../models/clipseg/processing_clipseg.py | 6 +- src/transformers/models/clvp/__init__.py | 83 + .../models/clvp/configuration_clvp.py | 457 ++ .../models/clvp/convert_clvp_to_hf.py | 234 + .../models/clvp/feature_extraction_clvp.py | 238 + src/transformers/models/clvp/modeling_clvp.py | 2024 +++++ .../models/clvp/number_normalizer.py | 238 + .../models/clvp/processing_clvp.py | 91 + .../models/clvp/tokenization_clvp.py | 379 + .../models/code_llama/__init__.py | 57 + .../code_llama/tokenization_code_llama.py | 522 ++ .../tokenization_code_llama_fast.py | 439 + .../models/codegen/configuration_codegen.py | 20 +- .../models/codegen/modeling_codegen.py | 145 +- .../models/codegen/tokenization_codegen.py | 58 +- .../codegen/tokenization_codegen_fast.py | 35 +- .../models/conditional_detr/__init__.py | 6 +- .../configuration_conditional_detr.py | 26 +- ..._original_pytorch_checkpoint_to_pytorch.py | 14 +- .../feature_extraction_conditional_detr.py | 10 + .../image_processing_conditional_detr.py | 586 +- .../modeling_conditional_detr.py | 444 +- .../models/convbert/configuration_convbert.py | 3 +- .../models/convbert/modeling_convbert.py | 45 +- .../models/convbert/modeling_tf_convbert.py | 462 +- .../models/convbert/tokenization_convbert.py | 63 +- .../convbert/tokenization_convbert_fast.py | 6 +- src/transformers/models/convnext/__init__.py | 2 +- .../models/convnext/configuration_convnext.py | 26 +- .../convnext/convert_convnext_to_pytorch.py | 12 +- .../convnext/image_processing_convnext.py | 158 +- .../models/convnext/modeling_convnext.py | 41 +- .../models/convnext/modeling_tf_convnext.py | 243 +- .../models/convnextv2/__init__.py | 97 + .../convnextv2/configuration_convnextv2.py | 118 + .../convert_convnextv2_to_pytorch.py | 286 + .../models/convnextv2/modeling_convnextv2.py | 576 ++ .../convnextv2/modeling_tf_convnextv2.py | 686 ++ .../models/cpm/tokenization_cpm.py | 39 +- .../models/cpm/tokenization_cpm_fast.py | 5 +- src/transformers/models/cpmant/__init__.py | 64 + .../models/cpmant/configuration_cpmant.py | 124 + .../models/cpmant/modeling_cpmant.py | 874 ++ .../models/cpmant/tokenization_cpmant.py | 278 + .../models/ctrl/configuration_ctrl.py | 8 +- src/transformers/models/ctrl/modeling_ctrl.py | 54 +- .../models/ctrl/modeling_tf_ctrl.py | 357 +- .../models/ctrl/tokenization_ctrl.py | 11 +- .../models/cvt/configuration_cvt.py | 1 + ..._original_pytorch_checkpoint_to_pytorch.py | 10 +- src/transformers/models/cvt/modeling_cvt.py | 4 +- .../models/cvt/modeling_tf_cvt.py | 332 +- .../data2vec/configuration_data2vec_audio.py | 6 + .../data2vec/configuration_data2vec_text.py | 1 + .../data2vec/configuration_data2vec_vision.py | 1 + ..._original_pytorch_checkpoint_to_pytorch.py | 8 +- ..._original_pytorch_checkpoint_to_pytorch.py | 8 +- .../data2vec/modeling_data2vec_audio.py | 97 +- .../models/data2vec/modeling_data2vec_text.py | 98 +- .../data2vec/modeling_data2vec_vision.py | 76 +- .../data2vec/modeling_tf_data2vec_vision.py | 681 +- .../models/deberta/configuration_deberta.py | 3 +- .../models/deberta/modeling_deberta.py | 47 +- .../models/deberta/modeling_tf_deberta.py | 451 +- .../models/deberta/tokenization_deberta.py | 66 +- .../deberta/tokenization_deberta_fast.py | 33 +- .../models/deberta_v2/__init__.py | 2 + .../deberta_v2/configuration_deberta_v2.py | 3 +- .../models/deberta_v2/modeling_deberta_v2.py | 61 +- .../deberta_v2/modeling_tf_deberta_v2.py | 569 +- .../deberta_v2/tokenization_deberta_v2.py | 48 +- .../tokenization_deberta_v2_fast.py | 5 +- .../configuration_decision_transformer.py | 4 +- .../modeling_decision_transformer.py | 80 +- .../models/deformable_detr/__init__.py | 6 +- .../configuration_deformable_detr.py | 32 +- .../convert_deformable_detr_to_pytorch.py | 14 +- .../feature_extraction_deformable_detr.py | 10 + .../image_processing_deformable_detr.py | 566 +- .../models/deformable_detr/load_custom.py | 10 +- .../modeling_deformable_detr.py | 340 +- .../models/deit/configuration_deit.py | 15 +- .../deit/convert_deit_timm_to_pytorch.py | 12 +- .../models/deit/image_processing_deit.py | 194 +- src/transformers/models/deit/modeling_deit.py | 40 +- .../models/deit/modeling_tf_deit.py | 392 +- .../models/{bort => deprecated}/__init__.py | 0 .../models/deprecated/bort}/__init__.py | 0 ...original_gluonnlp_checkpoint_to_pytorch.py | 2 +- .../models/{ => deprecated}/mctct/__init__.py | 21 +- .../mctct/configuration_mctct.py | 19 +- .../mctct/feature_extraction_mctct.py | 144 +- .../{ => deprecated}/mctct/modeling_mctct.py | 61 +- .../mctct/processing_mctct.py | 3 +- .../models/{ => deprecated}/mmbt/__init__.py | 2 +- .../mmbt/configuration_mmbt.py | 2 +- .../{ => deprecated}/mmbt/modeling_mmbt.py | 12 +- .../models/deprecated/open_llama/__init__.py | 95 + .../open_llama/configuration_open_llama.py | 168 + .../open_llama/modeling_open_llama.py | 968 +++ .../{ => deprecated}/retribert/__init__.py | 2 +- .../retribert/configuration_retribert.py | 5 +- .../retribert/modeling_retribert.py | 6 +- .../retribert/tokenization_retribert.py | 63 +- .../retribert/tokenization_retribert_fast.py | 6 +- .../models/{ => deprecated}/tapex/__init__.py | 2 +- .../tapex/tokenization_tapex.py | 49 +- .../trajectory_transformer/__init__.py | 2 +- .../configuration_trajectory_transformer.py | 5 +- ..._original_pytorch_checkpoint_to_pytorch.py | 0 .../modeling_trajectory_transformer.py | 32 +- .../{ => deprecated}/transfo_xl/__init__.py | 2 +- .../transfo_xl/configuration_transfo_xl.py | 14 +- ...fo_xl_original_tf_checkpoint_to_pytorch.py | 4 +- .../transfo_xl/modeling_tf_transfo_xl.py | 235 +- .../modeling_tf_transfo_xl_utilities.py | 5 +- .../transfo_xl/modeling_transfo_xl.py | 29 +- .../modeling_transfo_xl_utilities.py | 8 +- .../transfo_xl/tokenization_transfo_xl.py | 102 +- .../models/{ => deprecated}/van/__init__.py | 2 +- .../{ => deprecated}/van/configuration_van.py | 9 +- .../van/convert_van_to_pytorch.py | 14 +- .../{ => deprecated}/van/modeling_van.py | 14 +- .../models/depth_anything/__init__.py | 56 + .../configuration_depth_anything.py | 146 + .../convert_depth_anything_to_hf.py | 299 + .../depth_anything/modeling_depth_anything.py | 465 ++ .../models/deta/configuration_deta.py | 56 +- .../models/deta/image_processing_deta.py | 498 +- src/transformers/models/deta/modeling_deta.py | 391 +- src/transformers/models/detr/__init__.py | 6 +- .../models/detr/configuration_detr.py | 41 +- ..._original_pytorch_checkpoint_to_pytorch.py | 14 +- .../models/detr/convert_detr_to_pytorch.py | 42 +- .../models/detr/feature_extraction_detr.py | 10 + .../models/detr/image_processing_detr.py | 553 +- src/transformers/models/detr/modeling_detr.py | 405 +- .../models/dinat/configuration_dinat.py | 32 +- .../models/dinat/modeling_dinat.py | 50 +- src/transformers/models/dinov2/__init__.py | 61 + .../models/dinov2/configuration_dinov2.py | 176 + .../models/dinov2/convert_dinov2_to_hf.py | 287 + .../models/dinov2/modeling_dinov2.py | 860 ++ .../distilbert/configuration_distilbert.py | 1 + .../models/distilbert/modeling_distilbert.py | 282 +- .../distilbert/modeling_flax_distilbert.py | 9 +- .../distilbert/modeling_tf_distilbert.py | 371 +- .../distilbert/tokenization_distilbert.py | 58 +- .../tokenization_distilbert_fast.py | 2 +- .../dit/convert_dit_unilm_to_pytorch.py | 14 +- .../models/donut/configuration_donut_swin.py | 11 +- .../models/donut/convert_donut_to_pytorch.py | 6 +- .../models/donut/image_processing_donut.py | 204 +- .../models/donut/modeling_donut_swin.py | 40 +- .../models/donut/processing_donut.py | 12 +- .../models/dpr/configuration_dpr.py | 3 + ...vert_dpr_original_checkpoint_to_pytorch.py | 8 +- src/transformers/models/dpr/modeling_dpr.py | 43 +- .../models/dpr/modeling_tf_dpr.py | 184 +- .../models/dpr/tokenization_dpr.py | 9 +- .../models/dpr/tokenization_dpr_fast.py | 9 +- .../models/dpt/configuration_dpt.py | 107 +- .../models/dpt/convert_dinov2_depth_to_hf.py | 384 + .../models/dpt/convert_dpt_beit_to_hf.py | 306 + .../dpt/convert_dpt_hybrid_to_pytorch.py | 12 +- .../models/dpt/convert_dpt_swinv2_to_hf.py | 322 + .../models/dpt/convert_dpt_to_pytorch.py | 25 +- .../models/dpt/image_processing_dpt.py | 197 +- src/transformers/models/dpt/modeling_dpt.py | 185 +- .../models/efficientformer/__init__.py | 35 +- .../configuration_efficientformer.py | 8 +- ..._original_pytorch_checkpoint_to_pytorch.py | 14 +- .../image_processing_efficientformer.py | 163 +- .../modeling_efficientformer.py | 43 +- .../modeling_tf_efficientformer.py | 1196 +++ .../models/efficientnet/__init__.py | 84 + .../configuration_efficientnet.py | 170 + .../convert_efficientnet_to_pytorch.py | 339 + .../image_processing_efficientnet.py | 366 + .../efficientnet/modeling_efficientnet.py | 649 ++ .../models/electra/configuration_electra.py | 1 + .../models/electra/modeling_electra.py | 68 +- .../models/electra/modeling_flax_electra.py | 11 +- .../models/electra/modeling_tf_electra.py | 564 +- .../models/electra/tokenization_electra.py | 63 +- .../electra/tokenization_electra_fast.py | 6 +- src/transformers/models/encodec/__init__.py | 65 + .../models/encodec/configuration_encodec.py | 195 + .../convert_encodec_checkpoint_to_pytorch.py | 365 + .../encodec/feature_extraction_encodec.py | 206 + .../models/encodec/modeling_encodec.py | 806 ++ .../configuration_encoder_decoder.py | 19 +- .../modeling_encoder_decoder.py | 30 +- .../modeling_flax_encoder_decoder.py | 23 +- .../modeling_tf_encoder_decoder.py | 209 +- .../models/ernie/configuration_ernie.py | 5 +- .../models/ernie/modeling_ernie.py | 62 +- .../models/ernie_m/configuration_ernie_m.py | 18 +- .../models/ernie_m/modeling_ernie_m.py | 21 +- .../models/ernie_m/tokenization_ernie_m.py | 26 +- .../models/esm/configuration_esm.py | 1 + src/transformers/models/esm/convert_esm.py | 4 +- src/transformers/models/esm/modeling_esm.py | 78 +- .../models/esm/modeling_esmfold.py | 24 +- .../models/esm/modeling_tf_esm.py | 535 +- .../models/esm/openfold_utils/chunk_utils.py | 2 +- .../models/esm/tokenization_esm.py | 59 +- src/transformers/models/falcon/__init__.py | 68 + .../models/falcon/configuration_falcon.py | 192 + .../falcon/convert_custom_code_checkpoint.py | 74 + .../models/falcon/modeling_falcon.py | 1646 ++++ .../models/fastspeech2_conformer/__init__.py | 77 + .../configuration_fastspeech2_conformer.py | 488 ++ ..._original_pytorch_checkpoint_to_pytorch.py | 210 + .../fastspeech2_conformer/convert_hifigan.py | 134 + .../convert_model_with_hifigan.py | 102 + .../modeling_fastspeech2_conformer.py | 1686 ++++ .../tokenization_fastspeech2_conformer.py | 198 + .../models/flaubert/modeling_flaubert.py | 20 +- .../models/flaubert/modeling_tf_flaubert.py | 376 +- .../models/flaubert/tokenization_flaubert.py | 36 +- .../models/flava/configuration_flava.py | 185 +- .../models/flava/image_processing_flava.py | 162 +- .../models/flava/modeling_flava.py | 69 +- .../models/flava/processing_flava.py | 6 +- .../models/fnet/configuration_fnet.py | 5 +- src/transformers/models/fnet/modeling_fnet.py | 24 +- .../models/fnet/tokenization_fnet.py | 74 +- .../models/fnet/tokenization_fnet_fast.py | 13 +- src/transformers/models/focalnet/__init__.py | 59 + .../models/focalnet/configuration_focalnet.py | 165 + .../focalnet/convert_focalnet_to_hf_format.py | 237 + .../models/focalnet/modeling_focalnet.py | 1035 +++ .../models/fsmt/configuration_fsmt.py | 16 +- src/transformers/models/fsmt/modeling_fsmt.py | 43 +- .../models/fsmt/tokenization_fsmt.py | 41 +- .../models/funnel/configuration_funnel.py | 7 +- .../models/funnel/modeling_funnel.py | 16 +- .../models/funnel/modeling_tf_funnel.py | 416 +- .../models/funnel/tokenization_funnel.py | 66 +- .../models/funnel/tokenization_funnel_fast.py | 2 +- src/transformers/models/fuyu/__init__.py | 73 + .../models/fuyu/configuration_fuyu.py | 213 + .../fuyu/convert_fuyu_model_weights_to_hf.py | 134 + .../models/fuyu/image_processing_fuyu.py | 718 ++ src/transformers/models/fuyu/modeling_fuyu.py | 356 + .../models/fuyu/processing_fuyu.py | 695 ++ .../models/git/configuration_git.py | 14 +- .../models/git/convert_git_to_pytorch.py | 15 +- src/transformers/models/git/modeling_git.py | 136 +- src/transformers/models/git/processing_git.py | 1 + .../models/glpn/configuration_glpn.py | 7 +- .../models/glpn/convert_glpn_to_pytorch.py | 14 +- .../models/glpn/image_processing_glpn.py | 114 +- src/transformers/models/glpn/modeling_glpn.py | 6 +- src/transformers/models/gpt2/__init__.py | 2 + .../models/gpt2/configuration_gpt2.py | 27 +- .../models/gpt2/modeling_flax_gpt2.py | 8 +- src/transformers/models/gpt2/modeling_gpt2.py | 257 +- .../models/gpt2/modeling_tf_gpt2.py | 427 +- .../models/gpt2/tokenization_gpt2.py | 97 +- .../models/gpt2/tokenization_gpt2_fast.py | 96 +- .../models/gpt2/tokenization_gpt2_tf.py | 7 +- .../models/gpt_bigcode/__init__.py | 65 + .../gpt_bigcode/configuration_gpt_bigcode.py | 145 + .../gpt_bigcode/modeling_gpt_bigcode.py | 1511 ++++ src/transformers/models/gpt_neo/__init__.py | 4 + .../models/gpt_neo/configuration_gpt_neo.py | 36 +- .../models/gpt_neo/modeling_flax_gpt_neo.py | 2 +- .../models/gpt_neo/modeling_gpt_neo.py | 571 +- src/transformers/models/gpt_neox/__init__.py | 6 + .../models/gpt_neox/configuration_gpt_neox.py | 59 + .../models/gpt_neox/modeling_gpt_neox.py | 929 ++- .../gpt_neox/tokenization_gpt_neox_fast.py | 38 +- .../configuration_gpt_neox_japanese.py | 1 + .../modeling_gpt_neox_japanese.py | 71 +- .../tokenization_gpt_neox_japanese.py | 48 +- .../gpt_sw3/convert_megatron_to_pytorch.py | 2 +- .../models/gpt_sw3/tokenization_gpt_sw3.py | 111 +- .../models/gptj/configuration_gptj.py | 1 + .../models/gptj/modeling_flax_gptj.py | 5 +- src/transformers/models/gptj/modeling_gptj.py | 185 +- .../models/gptj/modeling_tf_gptj.py | 304 +- .../models/gptsan_japanese/__init__.py | 70 + .../configuration_gptsan_japanese.py | 159 + ...convert_gptsan_tf_checkpoint_to_pytorch.py | 181 + .../modeling_gptsan_japanese.py | 1345 +++ .../tokenization_gptsan_japanese.py | 541 ++ .../models/graphormer/collating_graphormer.py | 6 +- .../graphormer/configuration_graphormer.py | 5 + .../models/graphormer/modeling_graphormer.py | 134 +- .../models/groupvit/configuration_groupvit.py | 100 +- .../models/groupvit/modeling_groupvit.py | 90 +- .../models/groupvit/modeling_tf_groupvit.py | 617 +- .../models/herbert/tokenization_herbert.py | 60 +- .../models/hubert/configuration_hubert.py | 8 +- .../models/hubert/modeling_hubert.py | 141 +- .../models/hubert/modeling_tf_hubert.py | 439 +- .../models/ibert/configuration_ibert.py | 3 +- .../models/ibert/modeling_ibert.py | 20 +- src/transformers/models/idefics/__init__.py | 73 + .../models/idefics/configuration_idefics.py | 329 + .../idefics/image_processing_idefics.py | 168 + .../models/idefics/modeling_idefics.py | 1592 ++++ src/transformers/models/idefics/perceiver.py | 188 + .../models/idefics/processing_idefics.py | 414 + src/transformers/models/idefics/vision.py | 490 ++ .../imagegpt/image_processing_imagegpt.py | 111 +- .../models/imagegpt/modeling_imagegpt.py | 75 +- src/transformers/models/informer/__init__.py | 60 + .../models/informer/configuration_informer.py | 253 + .../models/informer/modeling_informer.py | 2049 +++++ .../models/instructblip/__init__.py | 69 + .../configuration_instructblip.py | 359 + ...onvert_instructblip_original_to_pytorch.py | 303 + .../instructblip/modeling_instructblip.py | 1555 ++++ .../instructblip/processing_instructblip.py | 173 + .../models/jukebox/configuration_jukebox.py | 35 +- .../models/jukebox/convert_jukebox.py | 18 +- .../models/jukebox/modeling_jukebox.py | 29 +- .../models/jukebox/tokenization_jukebox.py | 47 +- src/transformers/models/kosmos2/__init__.py | 64 + .../models/kosmos2/configuration_kosmos2.py | 299 + ..._original_pytorch_checkpoint_to_pytorch.py | 77 + .../models/kosmos2/modeling_kosmos2.py | 2056 +++++ .../models/kosmos2/processing_kosmos2.py | 666 ++ .../models/layoutlm/configuration_layoutlm.py | 18 +- .../models/layoutlm/modeling_layoutlm.py | 45 +- .../models/layoutlm/modeling_tf_layoutlm.py | 455 +- .../models/layoutlm/tokenization_layoutlm.py | 63 +- .../layoutlm/tokenization_layoutlm_fast.py | 6 +- .../layoutlmv2/configuration_layoutlmv2.py | 3 +- .../layoutlmv2/image_processing_layoutlmv2.py | 95 +- .../models/layoutlmv2/modeling_layoutlmv2.py | 66 +- .../layoutlmv2/processing_layoutlmv2.py | 6 +- .../layoutlmv2/tokenization_layoutlmv2.py | 71 +- .../layoutlmv3/configuration_layoutlmv3.py | 5 +- .../layoutlmv3/image_processing_layoutlmv3.py | 175 +- .../models/layoutlmv3/modeling_layoutlmv3.py | 71 +- .../layoutlmv3/modeling_tf_layoutlmv3.py | 524 +- .../layoutlmv3/processing_layoutlmv3.py | 6 +- .../layoutlmv3/tokenization_layoutlmv3.py | 45 +- .../tokenization_layoutlmv3_fast.py | 8 +- .../models/layoutxlm/processing_layoutxlm.py | 74 +- .../layoutxlm/tokenization_layoutxlm.py | 38 +- .../layoutxlm/tokenization_layoutxlm_fast.py | 5 +- .../models/led/configuration_led.py | 1 + src/transformers/models/led/modeling_led.py | 182 +- .../models/led/modeling_tf_led.py | 570 +- .../models/led/tokenization_led.py | 34 +- .../models/led/tokenization_led_fast.py | 14 +- .../models/levit/configuration_levit.py | 1 + .../levit/convert_levit_timm_to_pytorch.py | 10 +- .../models/levit/image_processing_levit.py | 158 +- .../models/levit/modeling_levit.py | 13 +- .../models/lilt/configuration_lilt.py | 1 + src/transformers/models/lilt/modeling_lilt.py | 45 +- src/transformers/models/llama/__init__.py | 114 + .../models/llama/configuration_llama.py | 191 + .../llama/convert_llama_weights_to_hf.py | 339 + .../models/llama/modeling_flax_llama.py | 738 ++ .../models/llama/modeling_llama.py | 1521 ++++ .../models/llama/tokenization_llama.py | 482 ++ .../models/llama/tokenization_llama_fast.py | 290 + src/transformers/models/llava/__init__.py | 56 + .../models/llava/configuration_llava.py | 130 + .../llava/convert_llava_weights_to_hf.py | 148 + .../models/llava/modeling_llava.py | 564 ++ .../models/llava/processing_llava.py | 136 + .../longformer/configuration_longformer.py | 9 +- .../models/longformer/modeling_longformer.py | 93 +- .../longformer/modeling_tf_longformer.py | 652 +- .../longformer/tokenization_longformer.py | 56 +- .../tokenization_longformer_fast.py | 25 +- .../models/longt5/configuration_longt5.py | 3 +- .../models/longt5/modeling_flax_longt5.py | 26 +- .../models/longt5/modeling_longt5.py | 103 +- .../models/luke/configuration_luke.py | 11 +- ..._original_pytorch_checkpoint_to_pytorch.py | 2 +- src/transformers/models/luke/modeling_luke.py | 99 +- .../models/luke/tokenization_luke.py | 58 +- .../models/lxmert/configuration_lxmert.py | 42 +- .../models/lxmert/modeling_lxmert.py | 17 +- .../models/lxmert/modeling_tf_lxmert.py | 601 +- .../models/lxmert/tokenization_lxmert.py | 63 +- .../models/lxmert/tokenization_lxmert_fast.py | 6 +- .../models/m2m_100/configuration_m2m_100.py | 5 +- .../models/m2m_100/modeling_m2m_100.py | 207 +- .../models/m2m_100/tokenization_m2m_100.py | 59 +- .../models/marian/configuration_marian.py | 1 + .../marian/convert_marian_to_pytorch.py | 16 +- .../models/marian/modeling_flax_marian.py | 16 +- .../models/marian/modeling_marian.py | 212 +- .../models/marian/modeling_tf_marian.py | 478 +- .../models/marian/tokenization_marian.py | 50 +- .../models/markuplm/configuration_markuplm.py | 4 +- .../markuplm/feature_extraction_markuplm.py | 2 +- .../models/markuplm/modeling_markuplm.py | 59 +- .../models/markuplm/processing_markuplm.py | 1 + .../models/markuplm/tokenization_markuplm.py | 49 +- .../markuplm/tokenization_markuplm_fast.py | 13 +- .../mask2former/configuration_mask2former.py | 59 +- ..._original_pytorch_checkpoint_to_pytorch.py | 16 +- .../image_processing_mask2former.py | 280 +- .../mask2former/modeling_mask2former.py | 113 +- .../maskformer/configuration_maskformer.py | 69 +- .../configuration_maskformer_swin.py | 26 +- ..._original_pytorch_checkpoint_to_pytorch.py | 92 +- .../convert_maskformer_resnet_to_pytorch.py | 14 +- .../convert_maskformer_swin_to_pytorch.py | 14 +- .../feature_extraction_maskformer.py | 3 +- .../maskformer/image_processing_maskformer.py | 291 +- .../models/maskformer/modeling_maskformer.py | 327 +- .../maskformer/modeling_maskformer_swin.py | 44 +- .../models/mbart/configuration_mbart.py | 1 + .../models/mbart/modeling_flax_mbart.py | 19 +- .../models/mbart/modeling_mbart.py | 516 +- .../models/mbart/modeling_tf_mbart.py | 457 +- .../models/mbart/tokenization_mbart.py | 46 +- .../models/mbart/tokenization_mbart_fast.py | 29 +- .../models/mbart50/tokenization_mbart50.py | 33 +- .../mbart50/tokenization_mbart50_fast.py | 11 +- src/transformers/models/mega/__init__.py | 70 + .../models/mega/configuration_mega.py | 243 + ..._original_pytorch_checkpoint_to_pytorch.py | 291 + src/transformers/models/mega/modeling_mega.py | 2275 +++++ .../configuration_megatron_bert.py | 5 +- .../convert_megatron_bert_checkpoint.py | 2 +- .../megatron_bert/modeling_megatron_bert.py | 65 +- ...eckpoint_reshaping_and_interoperability.py | 33 +- .../convert_megatron_gpt2_checkpoint.py | 6 +- src/transformers/models/mgp_str/__init__.py | 62 + .../models/mgp_str/configuration_mgp_str.py | 138 + .../models/mgp_str/modeling_mgp_str.py | 514 ++ .../models/mgp_str/processing_mgp_str.py | 230 + .../models/mgp_str/tokenization_mgp_str.py | 111 + src/transformers/models/mistral/__init__.py | 82 + .../models/mistral/configuration_mistral.py | 152 + .../mistral/convert_mistral_weights_to_hf.py | 276 + .../models/mistral/modeling_flax_mistral.py | 741 ++ .../models/mistral/modeling_mistral.py | 1386 ++++ src/transformers/models/mixtral/__init__.py | 62 + .../models/mixtral/configuration_mixtral.py | 169 + .../mixtral/convert_mixtral_weights_to_hf.py | 244 + .../models/mixtral/modeling_mixtral.py | 1605 ++++ ..._original_pytorch_checkpoint_to_pytorch.py | 2 +- .../models/mluke/tokenization_mluke.py | 128 +- .../mobilebert/configuration_mobilebert.py | 1 + .../models/mobilebert/modeling_mobilebert.py | 25 +- .../mobilebert/modeling_tf_mobilebert.py | 663 +- .../mobilebert/tokenization_mobilebert.py | 63 +- .../tokenization_mobilebert_fast.py | 6 +- .../configuration_mobilenet_v1.py | 3 +- ...nvert_original_tf_checkpoint_to_pytorch.py | 14 +- .../image_processing_mobilenet_v1.py | 184 +- .../configuration_mobilenet_v2.py | 11 +- ...nvert_original_tf_checkpoint_to_pytorch.py | 10 +- .../image_processing_mobilenet_v2.py | 204 +- .../mobilevit/configuration_mobilevit.py | 9 +- .../mobilevit/convert_mlcvnets_to_pytorch.py | 14 +- .../mobilevit/image_processing_mobilevit.py | 336 +- .../models/mobilevit/modeling_mobilevit.py | 20 +- .../models/mobilevit/modeling_tf_mobilevit.py | 441 +- .../models/mobilevitv2/__init__.py | 71 + .../mobilevitv2/configuration_mobilevitv2.py | 169 + .../convert_mlcvnets_to_pytorch.py | 326 + .../mobilevitv2/modeling_mobilevitv2.py | 1033 +++ .../models/mpnet/configuration_mpnet.py | 1 + .../models/mpnet/modeling_mpnet.py | 20 +- .../models/mpnet/modeling_tf_mpnet.py | 419 +- .../models/mpnet/tokenization_mpnet.py | 77 +- .../models/mpnet/tokenization_mpnet_fast.py | 12 +- src/transformers/models/mpt/__init__.py | 62 + .../models/mpt/configuration_mpt.py | 247 + src/transformers/models/mpt/modeling_mpt.py | 952 +++ src/transformers/models/mra/__init__.py | 68 + .../models/mra/configuration_mra.py | 138 + .../mra/convert_mra_pytorch_to_pytorch.py | 110 + src/transformers/models/mra/modeling_mra.py | 1481 ++++ src/transformers/models/mt5/__init__.py | 14 +- .../models/mt5/configuration_mt5.py | 41 +- .../models/mt5/modeling_flax_mt5.py | 14 +- src/transformers/models/mt5/modeling_mt5.py | 639 +- .../models/mt5/modeling_tf_mt5.py | 1 + src/transformers/models/musicgen/__init__.py | 67 + .../models/musicgen/configuration_musicgen.py | 243 + .../musicgen/convert_musicgen_transformers.py | 235 + .../models/musicgen/modeling_musicgen.py | 2523 ++++++ .../models/musicgen/processing_musicgen.py | 140 + .../models/mvp/configuration_mvp.py | 5 +- src/transformers/models/mvp/modeling_mvp.py | 186 +- .../models/mvp/tokenization_mvp.py | 52 +- .../models/mvp/tokenization_mvp_fast.py | 18 +- .../models/nat/configuration_nat.py | 32 +- src/transformers/models/nat/modeling_nat.py | 49 +- .../models/nezha/configuration_nezha.py | 3 +- .../models/nezha/modeling_nezha.py | 44 +- .../models/nllb/tokenization_nllb.py | 150 +- .../models/nllb/tokenization_nllb_fast.py | 75 +- src/transformers/models/nllb_moe/__init__.py | 68 + .../models/nllb_moe/configuration_nllb_moe.py | 219 + ..._sharded_original_checkpoint_to_pytorch.py | 160 + .../models/nllb_moe/modeling_nllb_moe.py | 1794 ++++ src/transformers/models/nougat/__init__.py | 63 + .../models/nougat/convert_nougat_to_hf.py | 282 + .../models/nougat/image_processing_nougat.py | 511 ++ .../models/nougat/processing_nougat.py | 160 + .../models/nougat/tokenization_nougat_fast.py | 634 ++ .../configuration_nystromformer.py | 3 +- .../nystromformer/modeling_nystromformer.py | 24 +- .../oneformer/configuration_oneformer.py | 124 +- .../oneformer/convert_to_hf_oneformer.py | 6 +- .../oneformer/image_processing_oneformer.py | 267 +- .../models/oneformer/modeling_oneformer.py | 68 +- .../models/oneformer/processing_oneformer.py | 4 +- .../models/openai/configuration_openai.py | 10 +- .../models/openai/modeling_openai.py | 29 +- .../models/openai/modeling_tf_openai.py | 301 +- .../models/openai/tokenization_openai.py | 40 +- .../models/openai/tokenization_openai_fast.py | 14 +- .../models/opt/configuration_opt.py | 3 +- ..._original_pytorch_checkpoint_to_pytorch.py | 4 +- .../models/opt/modeling_flax_opt.py | 2 +- src/transformers/models/opt/modeling_opt.py | 469 +- .../models/opt/modeling_tf_opt.py | 270 +- src/transformers/models/owlv2/__init__.py | 93 + .../models/owlv2/configuration_owlv2.py | 338 + .../models/owlv2/convert_owlv2_to_hf.py | 422 + .../models/owlv2/image_processing_owlv2.py | 600 ++ .../models/owlv2/modeling_owlv2.py | 1780 ++++ .../models/owlv2/processing_owlv2.py | 191 + .../models/owlvit/configuration_owlvit.py | 40 +- .../convert_owlvit_original_flax_to_hf.py | 82 +- .../models/owlvit/image_processing_owlvit.py | 179 +- .../models/owlvit/modeling_owlvit.py | 100 +- .../models/owlvit/processing_owlvit.py | 9 +- .../models/patchtsmixer/__init__.py | 69 + .../configuration_patchtsmixer.py | 236 + .../patchtsmixer/modeling_patchtsmixer.py | 2177 +++++ src/transformers/models/patchtst/__init__.py | 66 + .../models/patchtst/configuration_patchtst.py | 262 + .../models/patchtst/modeling_patchtst.py | 2037 +++++ .../models/pegasus/configuration_pegasus.py | 1 + .../pegasus/convert_pegasus_tf_to_pytorch.py | 2 +- .../models/pegasus/modeling_flax_pegasus.py | 28 +- .../models/pegasus/modeling_pegasus.py | 223 +- .../models/pegasus/modeling_tf_pegasus.py | 441 +- .../models/pegasus/tokenization_pegasus.py | 64 +- .../pegasus/tokenization_pegasus_fast.py | 14 +- .../pegasus_x/configuration_pegasus_x.py | 5 +- .../models/pegasus_x/modeling_pegasus_x.py | 199 +- .../perceiver/configuration_perceiver.py | 11 +- .../convert_perceiver_haiku_to_pytorch.py | 10 +- .../perceiver/image_processing_perceiver.py | 172 +- .../models/perceiver/modeling_perceiver.py | 153 +- .../perceiver/tokenization_perceiver.py | 58 +- src/transformers/models/persimmon/__init__.py | 62 + .../persimmon/configuration_persimmon.py | 165 + .../convert_persimmon_weights_to_hf.py | 129 + .../models/persimmon/modeling_persimmon.py | 1013 +++ src/transformers/models/phi/__init__.py | 69 + .../models/phi/configuration_phi.py | 195 + .../models/phi/convert_phi_weights_to_hf.py | 207 + src/transformers/models/phi/modeling_phi.py | 1493 ++++ .../models/phobert/tokenization_phobert.py | 39 +- .../models/pix2struct/__init__.py | 86 + .../pix2struct/configuration_pix2struct.py | 390 + ...nvert_pix2struct_original_pytorch_to_hf.py | 155 + .../pix2struct/image_processing_pix2struct.py | 460 ++ .../models/pix2struct/modeling_pix2struct.py | 1805 ++++ .../pix2struct/processing_pix2struct.py | 163 + .../models/plbart/configuration_plbart.py | 1 + .../models/plbart/modeling_plbart.py | 276 +- .../models/plbart/tokenization_plbart.py | 40 +- .../poolformer/configuration_poolformer.py | 3 +- .../convert_poolformer_original_to_pytorch.py | 22 +- .../poolformer/image_processing_poolformer.py | 179 +- .../models/poolformer/modeling_poolformer.py | 7 +- src/transformers/models/pop2piano/__init__.py | 122 + .../pop2piano/configuration_pop2piano.py | 129 + .../convert_pop2piano_weights_to_hf.py | 190 + .../pop2piano/feature_extraction_pop2piano.py | 450 + .../models/pop2piano/modeling_pop2piano.py | 1365 +++ .../models/pop2piano/processing_pop2piano.py | 139 + .../pop2piano/tokenization_pop2piano.py | 713 ++ .../prophetnet/configuration_prophetnet.py | 1 + .../models/prophetnet/modeling_prophetnet.py | 379 +- .../prophetnet/tokenization_prophetnet.py | 56 +- src/transformers/models/pvt/__init__.py | 80 + .../models/pvt/configuration_pvt.py | 164 + .../models/pvt/convert_pvt_to_pytorch.py | 227 + .../models/pvt/image_processing_pvt.py | 273 + src/transformers/models/pvt/modeling_pvt.py | 670 ++ .../models/qdqbert/configuration_qdqbert.py | 11 +- .../models/qdqbert/modeling_qdqbert.py | 66 +- src/transformers/models/qwen2/__init__.py | 80 + .../models/qwen2/configuration_qwen2.py | 144 + .../models/qwen2/modeling_qwen2.py | 1401 ++++ .../models/qwen2/tokenization_qwen2.py | 345 + .../models/qwen2/tokenization_qwen2_fast.py | 143 + .../models/rag/configuration_rag.py | 14 - src/transformers/models/rag/modeling_rag.py | 64 +- .../models/rag/modeling_tf_rag.py | 200 +- src/transformers/models/rag/retrieval_rag.py | 20 +- .../models/realm/configuration_realm.py | 3 +- .../models/realm/modeling_realm.py | 39 +- .../models/realm/retrieval_realm.py | 3 +- .../models/realm/tokenization_realm.py | 31 +- .../models/realm/tokenization_realm_fast.py | 4 +- .../models/reformer/configuration_reformer.py | 1 + .../models/reformer/modeling_reformer.py | 44 +- .../models/reformer/tokenization_reformer.py | 12 +- .../reformer/tokenization_reformer_fast.py | 5 +- src/transformers/models/regnet/__init__.py | 32 +- .../models/regnet/configuration_regnet.py | 1 + .../convert_regnet_seer_10b_to_pytorch.py | 12 +- .../regnet/convert_regnet_to_pytorch.py | 14 +- .../models/regnet/modeling_flax_regnet.py | 819 ++ .../models/regnet/modeling_regnet.py | 7 +- .../models/regnet/modeling_tf_regnet.py | 259 +- .../models/rembert/configuration_rembert.py | 3 +- .../models/rembert/modeling_rembert.py | 54 +- .../models/rembert/modeling_tf_rembert.py | 549 +- .../models/rembert/tokenization_rembert.py | 28 +- .../rembert/tokenization_rembert_fast.py | 7 +- src/transformers/models/resnet/__init__.py | 27 +- .../models/resnet/configuration_resnet.py | 30 +- .../resnet/convert_resnet_to_pytorch.py | 12 +- .../models/resnet/modeling_flax_resnet.py | 701 ++ .../models/resnet/modeling_resnet.py | 55 +- .../models/resnet/modeling_tf_resnet.py | 228 +- .../models/roberta/configuration_roberta.py | 19 +- .../models/roberta/modeling_flax_roberta.py | 15 +- .../models/roberta/modeling_roberta.py | 118 +- .../models/roberta/modeling_tf_roberta.py | 570 +- .../models/roberta/tokenization_roberta.py | 92 +- .../roberta/tokenization_roberta_fast.py | 73 +- .../configuration_roberta_prelayernorm.py | 22 +- .../modeling_flax_roberta_prelayernorm.py | 13 +- .../modeling_roberta_prelayernorm.py | 100 +- .../modeling_tf_roberta_prelayernorm.py | 570 +- .../models/roc_bert/configuration_roc_bert.py | 3 +- .../models/roc_bert/modeling_roc_bert.py | 72 +- .../models/roc_bert/tokenization_roc_bert.py | 57 +- .../models/roformer/configuration_roformer.py | 5 +- .../models/roformer/modeling_flax_roformer.py | 9 +- .../models/roformer/modeling_roformer.py | 68 +- .../models/roformer/modeling_tf_roformer.py | 487 +- .../models/roformer/tokenization_roformer.py | 55 +- .../roformer/tokenization_roformer_fast.py | 22 +- src/transformers/models/rwkv/__init__.py | 60 + .../models/rwkv/configuration_rwkv.py | 130 + .../rwkv/convert_rwkv_checkpoint_to_hf.py | 201 + src/transformers/models/rwkv/modeling_rwkv.py | 873 ++ src/transformers/models/sam/__init__.py | 105 + .../models/sam/configuration_sam.py | 312 + .../sam/convert_sam_original_to_hf_format.py | 206 + .../models/sam/image_processing_sam.py | 1473 ++++ src/transformers/models/sam/modeling_sam.py | 1419 ++++ .../models/sam/modeling_tf_sam.py | 1660 ++++ src/transformers/models/sam/processing_sam.py | 266 + .../models/seamless_m4t/__init__.py | 111 + .../configuration_seamless_m4t.py | 418 + .../seamless_m4t/convert_fairseq2_to_hf.py | 397 + .../feature_extraction_seamless_m4t.py | 306 + .../seamless_m4t/modeling_seamless_m4t.py | 4388 ++++++++++ .../seamless_m4t/processing_seamless_m4t.py | 117 + .../seamless_m4t/tokenization_seamless_m4t.py | 576 ++ .../tokenization_seamless_m4t_fast.py | 461 ++ .../models/seamless_m4t_v2/__init__.py | 65 + .../configuration_seamless_m4t_v2.py | 426 + .../seamless_m4t_v2/convert_fairseq2_to_hf.py | 405 + .../modeling_seamless_m4t_v2.py | 4797 +++++++++++ .../segformer/configuration_segformer.py | 17 +- .../convert_segformer_original_to_pytorch.py | 14 +- .../segformer/image_processing_segformer.py | 170 +- .../models/segformer/modeling_segformer.py | 4 +- .../models/segformer/modeling_tf_segformer.py | 355 +- .../models/sew/configuration_sew.py | 12 +- src/transformers/models/sew/modeling_sew.py | 85 +- .../models/sew_d/configuration_sew_d.py | 24 +- .../models/sew_d/modeling_sew_d.py | 91 +- src/transformers/models/siglip/__init__.py | 112 + .../models/siglip/configuration_siglip.py | 302 + .../models/siglip/convert_siglip_to_hf.py | 413 + .../models/siglip/image_processing_siglip.py | 229 + .../models/siglip/modeling_siglip.py | 1298 +++ .../models/siglip/processing_siglip.py | 143 + .../models/siglip/tokenization_siglip.py | 386 + .../configuration_speech_encoder_decoder.py | 17 +- ...rt_wav2vec2_seq2seq_original_to_pytorch.py | 4 +- ...xt_wav2vec2_seq2seq_original_to_pytorch.py | 2 +- .../modeling_flax_speech_encoder_decoder.py | 8 +- .../modeling_speech_encoder_decoder.py | 12 +- .../models/speech_to_text/__init__.py | 19 +- .../configuration_speech_to_text.py | 57 +- .../convert_s2t_fairseq_to_tfms.py | 10 +- .../feature_extraction_speech_to_text.py | 71 +- .../speech_to_text/modeling_speech_to_text.py | 192 +- .../modeling_tf_speech_to_text.py | 385 +- .../processing_speech_to_text.py | 1 + .../tokenization_speech_to_text.py | 43 +- .../configuration_speech_to_text_2.py | 1 + .../modeling_speech_to_text_2.py | 140 +- .../processing_speech_to_text_2.py | 1 + .../tokenization_speech_to_text_2.py | 21 +- src/transformers/models/speecht5/__init__.py | 19 +- .../models/speecht5/configuration_speecht5.py | 23 +- ..._original_pytorch_checkpoint_to_pytorch.py | 11 +- .../speecht5/feature_extraction_speecht5.py | 177 +- .../models/speecht5/modeling_speecht5.py | 779 +- .../models/speecht5/number_normalizer.py | 192 + .../models/speecht5/processing_speecht5.py | 105 +- .../models/speecht5/tokenization_speecht5.py | 62 +- .../models/splinter/configuration_splinter.py | 3 +- .../models/splinter/modeling_splinter.py | 35 +- .../models/splinter/tokenization_splinter.py | 29 +- .../squeezebert/configuration_squeezebert.py | 1 + .../squeezebert/modeling_squeezebert.py | 12 +- .../squeezebert/tokenization_squeezebert.py | 63 +- .../tokenization_squeezebert_fast.py | 6 +- src/transformers/models/stablelm/__init__.py | 62 + .../models/stablelm/configuration_stablelm.py | 183 + .../models/stablelm/modeling_stablelm.py | 1238 +++ .../models/swiftformer/__init__.py | 67 + .../swiftformer/configuration_swiftformer.py | 137 + .../convert_swiftformer_original_to_hf.py | 176 + .../swiftformer/modeling_swiftformer.py | 619 ++ .../models/swin/configuration_swin.py | 38 +- .../swin/convert_swin_simmim_to_pytorch.py | 14 +- .../swin/convert_swin_timm_to_pytorch.py | 10 +- src/transformers/models/swin/modeling_swin.py | 97 +- .../models/swin/modeling_tf_swin.py | 427 +- .../models/swin2sr/configuration_swin2sr.py | 9 +- .../swin2sr/image_processing_swin2sr.py | 107 +- .../models/swin2sr/modeling_swin2sr.py | 81 +- src/transformers/models/swinv2/__init__.py | 2 + .../models/swinv2/configuration_swinv2.py | 27 +- .../swinv2/convert_swinv2_timm_to_pytorch.py | 10 +- .../models/swinv2/modeling_swinv2.py | 284 +- .../configuration_switch_transformers.py | 40 +- .../switch_transformers/convert_big_switch.py | 2 +- .../modeling_switch_transformers.py | 183 +- src/transformers/models/t5/__init__.py | 6 + .../models/t5/configuration_t5.py | 23 +- .../t5/convert_t5x_checkpoint_to_pytorch.py | 13 +- .../models/t5/modeling_flax_t5.py | 38 +- src/transformers/models/t5/modeling_t5.py | 653 +- src/transformers/models/t5/modeling_tf_t5.py | 427 +- src/transformers/models/t5/tokenization_t5.py | 191 +- .../models/t5/tokenization_t5_fast.py | 60 +- .../models/table_transformer/__init__.py | 6 +- .../configuration_table_transformer.py | 32 +- ....py => convert_table_transformer_to_hf.py} | 14 +- ...convert_table_transformer_to_hf_no_timm.py | 435 + .../modeling_table_transformer.py | 353 +- .../models/tapas/modeling_tapas.py | 40 +- .../models/tapas/modeling_tf_tapas.py | 445 +- .../models/tapas/tokenization_tapas.py | 89 +- .../configuration_time_series_transformer.py | 38 +- .../modeling_time_series_transformer.py | 1082 ++- .../timesformer/configuration_timesformer.py | 8 +- .../convert_timesformer_to_pytorch.py | 10 +- .../timesformer/modeling_timesformer.py | 126 +- .../models/timm_backbone/__init__.py | 49 + .../configuration_timm_backbone.py | 83 + .../timm_backbone/modeling_timm_backbone.py | 158 + .../models/trocr/configuration_trocr.py | 1 + .../trocr/convert_trocr_unilm_to_pytorch.py | 8 +- .../models/trocr/modeling_trocr.py | 130 +- .../models/trocr/processing_trocr.py | 6 +- src/transformers/models/tvlt/__init__.py | 17 +- .../models/tvlt/configuration_tvlt.py | 7 +- .../models/tvlt/feature_extraction_tvlt.py | 176 +- .../models/tvlt/image_processing_tvlt.py | 153 +- src/transformers/models/tvlt/modeling_tvlt.py | 109 +- .../models/tvlt/processing_tvlt.py | 1 + src/transformers/models/tvp/__init__.py | 80 + .../models/tvp/configuration_tvp.py | 203 + .../models/tvp/image_processing_tvp.py | 478 ++ src/transformers/models/tvp/modeling_tvp.py | 895 ++ src/transformers/models/tvp/processing_tvp.py | 154 + src/transformers/models/umt5/__init__.py | 60 + .../models/umt5/configuration_umt5.py | 177 + .../convert_umt5_checkpoint_to_pytorch.py | 274 + src/transformers/models/umt5/modeling_umt5.py | 1857 +++++ .../unispeech/configuration_unispeech.py | 35 +- .../models/unispeech/modeling_unispeech.py | 148 +- .../configuration_unispeech_sat.py | 35 +- .../unispeech_sat/modeling_unispeech_sat.py | 176 +- src/transformers/models/univnet/__init__.py | 65 + .../models/univnet/configuration_univnet.py | 127 + .../models/univnet/convert_univnet.py | 162 + .../univnet/feature_extraction_univnet.py | 456 ++ .../models/univnet/modeling_univnet.py | 636 ++ .../models/upernet/configuration_upernet.py | 42 +- .../models/upernet/modeling_upernet.py | 25 +- .../models/videomae/configuration_videomae.py | 3 +- .../videomae/convert_videomae_to_pytorch.py | 66 +- .../videomae/image_processing_videomae.py | 150 +- .../models/videomae/modeling_videomae.py | 125 +- .../models/vilt/configuration_vilt.py | 7 +- .../vilt/convert_vilt_original_to_pytorch.py | 8 +- .../models/vilt/image_processing_vilt.py | 246 +- src/transformers/models/vilt/modeling_vilt.py | 53 +- .../models/vilt/processing_vilt.py | 6 +- src/transformers/models/vipllava/__init__.py | 54 + .../models/vipllava/configuration_vipllava.py | 130 + .../convert_vipllava_weights_to_hf.py | 132 + .../models/vipllava/modeling_vipllava.py | 563 ++ .../configuration_vision_encoder_decoder.py | 17 +- .../modeling_flax_vision_encoder_decoder.py | 15 +- .../modeling_tf_vision_encoder_decoder.py | 209 +- .../modeling_vision_encoder_decoder.py | 16 +- .../vision_text_dual_encoder/__init__.py | 28 +- .../configuration_vision_text_dual_encoder.py | 35 +- .../modeling_flax_vision_text_dual_encoder.py | 17 +- .../modeling_tf_vision_text_dual_encoder.py | 622 ++ .../modeling_vision_text_dual_encoder.py | 18 +- .../processing_vision_text_dual_encoder.py | 8 +- .../visual_bert/configuration_visual_bert.py | 4 +- .../visual_bert/modeling_visual_bert.py | 40 +- .../models/vit/configuration_vit.py | 13 +- .../models/vit/convert_dino_to_pytorch.py | 12 +- .../models/vit/convert_vit_timm_to_pytorch.py | 133 +- .../models/vit/image_processing_vit.py | 139 +- .../models/vit/modeling_flax_vit.py | 7 +- .../models/vit/modeling_tf_vit.py | 278 +- src/transformers/models/vit/modeling_vit.py | 61 +- .../vit_hybrid/configuration_vit_hybrid.py | 55 +- .../vit_hybrid/image_processing_vit_hybrid.py | 184 +- .../models/vit_hybrid/modeling_vit_hybrid.py | 40 +- .../models/vit_mae/configuration_vit_mae.py | 5 +- .../vit_mae/convert_vit_mae_to_pytorch.py | 12 +- .../models/vit_mae/modeling_tf_vit_mae.py | 307 +- .../models/vit_mae/modeling_vit_mae.py | 28 +- .../models/vit_msn/configuration_vit_msn.py | 1 + .../models/vit_msn/convert_msn_to_pytorch.py | 12 +- .../models/vit_msn/modeling_vit_msn.py | 20 +- src/transformers/models/vitdet/__init__.py | 57 + .../models/vitdet/configuration_vitdet.py | 158 + .../models/vitdet/modeling_vitdet.py | 876 ++ src/transformers/models/vitmatte/__init__.py | 72 + .../models/vitmatte/configuration_vitmatte.py | 137 + .../models/vitmatte/convert_vitmatte_to_hf.py | 170 + .../vitmatte/image_processing_vitmatte.py | 271 + .../models/vitmatte/modeling_vitmatte.py | 338 + src/transformers/models/vits/__init__.py | 67 + .../models/vits/configuration_vits.py | 255 + .../vits/convert_original_checkpoint.py | 390 + src/transformers/models/vits/modeling_vits.py | 1487 ++++ .../models/vits/tokenization_vits.py | 250 + src/transformers/models/vivit/__init__.py | 78 + .../models/vivit/configuration_vivit.py | 123 + .../vivit/convert_vivit_flax_to_pytorch.py | 230 + .../models/vivit/image_processing_vivit.py | 403 + .../models/vivit/modeling_vivit.py | 745 ++ src/transformers/models/wav2vec2/__init__.py | 2 + .../models/wav2vec2/configuration_wav2vec2.py | 15 +- ..._original_pytorch_checkpoint_to_pytorch.py | 150 +- .../wav2vec2/feature_extraction_wav2vec2.py | 11 +- .../models/wav2vec2/modeling_tf_wav2vec2.py | 616 +- .../models/wav2vec2/modeling_wav2vec2.py | 405 +- .../models/wav2vec2/processing_wav2vec2.py | 1 + .../models/wav2vec2/tokenization_wav2vec2.py | 193 +- .../models/wav2vec2_bert/__init__.py | 70 + .../configuration_wav2vec2_bert.py | 315 + .../convert_wav2vec2_seamless_checkpoint.py | 218 + .../wav2vec2_bert/modeling_wav2vec2_bert.py | 1674 ++++ .../wav2vec2_bert/processing_wav2vec2_bert.py | 145 + .../configuration_wav2vec2_conformer.py | 10 +- .../modeling_wav2vec2_conformer.py | 101 +- .../tokenization_wav2vec2_phoneme.py | 127 +- .../processing_wav2vec2_with_lm.py | 110 +- .../models/wavlm/configuration_wavlm.py | 6 + .../models/wavlm/modeling_wavlm.py | 136 +- src/transformers/models/whisper/__init__.py | 55 +- .../models/whisper/configuration_whisper.py | 90 +- .../models/whisper/convert_openai_to_hf.py | 220 +- .../models/whisper/english_normalizer.py | 34 +- .../whisper/feature_extraction_whisper.py | 238 +- .../models/whisper/generation_whisper.py | 1675 ++++ .../models/whisper/modeling_flax_whisper.py | 1696 ++++ .../models/whisper/modeling_tf_whisper.py | 580 +- .../models/whisper/modeling_whisper.py | 1317 ++- .../models/whisper/processing_whisper.py | 5 + .../models/whisper/tokenization_whisper.py | 741 +- .../whisper/tokenization_whisper_fast.py | 650 ++ .../models/x_clip/configuration_x_clip.py | 96 +- .../convert_x_clip_original_pytorch_to_hf.py | 6 +- .../models/x_clip/modeling_x_clip.py | 199 +- .../models/x_clip/processing_x_clip.py | 6 +- .../models/xglm/configuration_xglm.py | 1 + .../models/xglm/modeling_flax_xglm.py | 14 +- .../models/xglm/modeling_tf_xglm.py | 354 +- src/transformers/models/xglm/modeling_xglm.py | 331 +- .../models/xglm/tokenization_xglm.py | 29 +- .../models/xglm/tokenization_xglm_fast.py | 7 +- .../models/xlm/configuration_xlm.py | 22 +- ..._original_pytorch_checkpoint_to_pytorch.py | 4 +- .../models/xlm/modeling_tf_xlm.py | 404 +- src/transformers/models/xlm/modeling_xlm.py | 41 +- .../models/xlm/tokenization_xlm.py | 121 +- .../configuration_xlm_prophetnet.py | 1 + .../xlm_prophetnet/modeling_xlm_prophetnet.py | 387 +- .../tokenization_xlm_prophetnet.py | 52 +- .../xlm_roberta/configuration_xlm_roberta.py | 27 +- .../xlm_roberta/modeling_flax_xlm_roberta.py | 19 +- .../xlm_roberta/modeling_tf_xlm_roberta.py | 566 +- .../xlm_roberta/modeling_xlm_roberta.py | 118 +- .../xlm_roberta/tokenization_xlm_roberta.py | 61 +- .../tokenization_xlm_roberta_fast.py | 57 +- .../configuration_xlm_roberta_xl.py | 5 +- .../xlm_roberta_xl/modeling_xlm_roberta_xl.py | 105 +- .../models/xlnet/configuration_xlnet.py | 6 +- .../models/xlnet/modeling_tf_xlnet.py | 449 +- .../models/xlnet/modeling_xlnet.py | 67 +- .../models/xlnet/tokenization_xlnet.py | 37 +- .../models/xlnet/tokenization_xlnet_fast.py | 17 +- .../models/xmod/configuration_xmod.py | 5 +- ..._original_pytorch_checkpoint_to_pytorch.py | 2 +- src/transformers/models/xmod/modeling_xmod.py | 91 +- .../models/yolos/configuration_yolos.py | 13 +- .../models/yolos/convert_yolos_to_pytorch.py | 14 +- .../models/yolos/feature_extraction_yolos.py | 10 + .../models/yolos/image_processing_yolos.py | 516 +- .../models/yolos/modeling_yolos.py | 38 +- .../models/yoso/configuration_yoso.py | 3 +- src/transformers/models/yoso/modeling_yoso.py | 71 +- src/transformers/onnx/__main__.py | 8 +- src/transformers/onnx/config.py | 4 +- src/transformers/onnx/convert.py | 66 +- src/transformers/onnx/features.py | 6 +- src/transformers/optimization.py | 239 +- src/transformers/optimization_tf.py | 45 +- src/transformers/pipelines/__init__.py | 288 +- .../pipelines/audio_classification.py | 62 +- src/transformers/pipelines/audio_utils.py | 32 +- .../pipelines/automatic_speech_recognition.py | 491 +- src/transformers/pipelines/base.py | 197 +- src/transformers/pipelines/conversational.py | 288 +- .../pipelines/depth_estimation.py | 33 +- .../pipelines/document_question_answering.py | 24 +- .../pipelines/feature_extraction.py | 45 +- src/transformers/pipelines/fill_mask.py | 53 +- .../pipelines/image_classification.py | 124 +- .../pipelines/image_feature_extraction.py | 92 + .../pipelines/image_segmentation.py | 40 +- src/transformers/pipelines/image_to_image.py | 134 + src/transformers/pipelines/image_to_text.py | 90 +- src/transformers/pipelines/mask_generation.py | 285 + .../pipelines/object_detection.py | 27 +- src/transformers/pipelines/pt_utils.py | 2 +- .../pipelines/question_answering.py | 19 +- .../pipelines/table_question_answering.py | 29 +- .../pipelines/text2text_generation.py | 37 +- .../pipelines/text_classification.py | 24 +- src/transformers/pipelines/text_generation.py | 138 +- src/transformers/pipelines/text_to_audio.py | 207 + .../pipelines/token_classification.py | 182 +- .../pipelines/video_classification.py | 8 +- .../pipelines/visual_question_answering.py | 52 +- .../zero_shot_audio_classification.py | 155 + .../pipelines/zero_shot_classification.py | 9 +- .../zero_shot_image_classification.py | 88 +- .../pipelines/zero_shot_object_detection.py | 21 +- src/transformers/processing_utils.py | 299 +- src/transformers/pytorch_utils.py | 61 +- src/transformers/quantizers/__init__.py | 15 + src/transformers/quantizers/auto.py | 157 + src/transformers/quantizers/base.py | 197 + src/transformers/quantizers/quantizer_aqlm.py | 88 + src/transformers/quantizers/quantizer_awq.py | 113 + .../quantizers/quantizer_bnb_4bit.py | 311 + .../quantizers/quantizer_bnb_8bit.py | 271 + src/transformers/quantizers/quantizer_gptq.py | 94 + .../quantizers/quantizers_utils.py | 26 + src/transformers/safetensors_conversion.py | 107 + .../sagemaker/training_args_sm.py | 4 +- src/transformers/testing_utils.py | 694 +- src/transformers/tf_utils.py | 197 + src/transformers/time_series_utils.py | 225 + src/transformers/tokenization_utils.py | 252 +- src/transformers/tokenization_utils_base.py | 820 +- src/transformers/tokenization_utils_fast.py | 109 +- src/transformers/tools/__init__.py | 71 + src/transformers/tools/agent_types.py | 277 + src/transformers/tools/agents.py | 778 ++ src/transformers/tools/base.py | 743 ++ .../tools/document_question_answering.py | 80 + src/transformers/tools/evaluate_agent.py | 692 ++ src/transformers/tools/image_captioning.py | 51 + .../tools/image_question_answering.py | 57 + src/transformers/tools/image_segmentation.py | 58 + src/transformers/tools/prompts.py | 48 + src/transformers/tools/python_interpreter.py | 253 + src/transformers/tools/speech_to_text.py | 41 + src/transformers/tools/text_classification.py | 70 + .../tools/text_question_answering.py | 52 + src/transformers/tools/text_summarization.py | 52 + src/transformers/tools/text_to_speech.py | 65 + src/transformers/tools/translation.py | 271 + src/transformers/trainer.py | 2448 +++--- src/transformers/trainer_callback.py | 54 +- src/transformers/trainer_pt_utils.py | 134 +- src/transformers/trainer_seq2seq.py | 195 +- src/transformers/trainer_tf.py | 801 -- src/transformers/trainer_utils.py | 163 +- src/transformers/training_args.py | 1605 +++- src/transformers/training_args_seq2seq.py | 30 +- src/transformers/training_args_tf.py | 17 +- src/transformers/utils/__init__.py | 54 +- src/transformers/utils/backbone_utils.py | 350 + src/transformers/utils/bitsandbytes.py | 202 +- src/transformers/utils/doc.py | 41 +- ...pretty_midi_and_scipy_and_torch_objects.py | 23 + src/transformers/utils/dummy_flax_objects.py | 181 + src/transformers/utils/dummy_music_objects.py | 16 + src/transformers/utils/dummy_pt_objects.py | 3754 +++++++-- .../utils/dummy_sentencepiece_objects.py | 28 + .../utils/dummy_speech_objects.py | 21 - src/transformers/utils/dummy_tf_objects.py | 397 +- .../utils/dummy_timm_and_vision_objects.py | 112 - .../utils/dummy_tokenizers_objects.py | 48 +- .../utils/dummy_vision_objects.py | 91 + src/transformers/utils/fx.py | 149 +- src/transformers/utils/generic.py | 283 +- src/transformers/utils/hp_naming.py | 12 +- src/transformers/utils/hub.py | 543 +- src/transformers/utils/import_utils.py | 789 +- src/transformers/utils/logging.py | 53 +- .../utils/model_parallel_utils.py | 2 +- src/transformers/utils/notebook.py | 50 +- src/transformers/utils/peft_utils.py | 124 + src/transformers/utils/quantization_config.py | 736 +- .../utils/sentencepiece_model_pb2.py | 12 +- .../utils/sentencepiece_model_pb2_new.py | 47 + src/transformers/utils/versions.py | 18 +- ...on_{{cookiecutter.lowercase_modelname}}.py | 2 +- .../adding_a_new_example_script/README.md | 4 +- .../run_{{cookiecutter.example_shortcut}}.py | 55 +- .../ADD_NEW_MODEL_PROPOSAL_TEMPLATE.md | 10 +- templates/adding_a_new_model/README.md | 28 +- ...on_{{cookiecutter.lowercase_modelname}}.py | 9 +- ...ax_{{cookiecutter.lowercase_modelname}}.py | 10 +- ...tf_{{cookiecutter.lowercase_modelname}}.py | 563 +- ...ng_{{cookiecutter.lowercase_modelname}}.py | 161 +- ...ax_{{cookiecutter.lowercase_modelname}}.py | 2 +- ...tf_{{cookiecutter.lowercase_modelname}}.py | 20 - ...ce_{{cookiecutter.lowercase_modelname}}.py | 2 +- ...> {{cookiecutter.lowercase_modelname}}.md} | 0 .../open_model_proposals/ADD_BIG_BIRD.md | 10 +- .../bort => bettertransformer}/__init__.py | 0 tests/bettertransformer/test_integration.py | 86 + tests/deepspeed/test_deepspeed.py | 361 +- tests/deepspeed/test_model_zoo.py | 46 +- tests/extended/test_trainer_ext.py | 65 +- tests/fsdp/test_fsdp.py | 274 + tests/generation/test_beam_search.py | 12 +- tests/generation/test_configuration_utils.py | 119 +- tests/generation/test_flax_utils.py | 13 + tests/generation/test_framework_agnostic.py | 48 +- tests/generation/test_logits_process.py | 166 +- tests/generation/test_streamers.py | 122 + tests/generation/test_tf_logits_process.py | 2 + tests/generation/test_tf_utils.py | 42 +- tests/generation/test_utils.py | 1363 ++- tests/models/albert/test_modeling_albert.py | 22 +- .../albert/test_modeling_flax_albert.py | 6 +- .../models/albert/test_modeling_tf_albert.py | 44 +- .../models/albert/test_tokenization_albert.py | 6 +- tests/models/{mctct => align}/__init__.py | 0 tests/models/align/test_modeling_align.py | 640 ++ tests/models/align/test_processor_align.py | 207 + tests/models/altclip/test_modeling_altclip.py | 59 +- ...xtraction_audio_spectrogram_transformer.py | 63 +- ..._modeling_audio_spectrogram_transformer.py | 19 +- tests/models/auto/test_configuration_auto.py | 35 +- .../auto/test_feature_extraction_auto.py | 49 + .../models/auto/test_image_processing_auto.py | 45 + tests/models/auto/test_modeling_auto.py | 118 +- tests/models/auto/test_modeling_flax_auto.py | 8 +- tests/models/auto/test_modeling_tf_auto.py | 27 +- tests/models/auto/test_modeling_tf_pytorch.py | 10 +- tests/models/auto/test_processor_auto.py | 163 +- tests/models/auto/test_tokenization_auto.py | 157 +- .../{retribert => autoformer}/__init__.py | 0 .../autoformer/test_modeling_autoformer.py | 473 ++ tests/models/{tapex => bark}/__init__.py | 0 tests/models/bark/test_modeling_bark.py | 1280 +++ tests/models/bark/test_processor_bark.py | 127 + tests/models/bart/test_modeling_bart.py | 50 +- tests/models/bart/test_modeling_tf_bart.py | 52 +- tests/models/bart/test_tokenization_bart.py | 1 - .../barthez/test_tokenization_barthez.py | 4 +- .../models/beit/test_image_processing_beit.py | 120 +- tests/models/beit/test_modeling_beit.py | 145 +- tests/models/beit/test_modeling_flax_beit.py | 22 +- tests/models/bert/test_modeling_bert.py | 51 +- tests/models/bert/test_modeling_flax_bert.py | 4 +- tests/models/bert/test_modeling_tf_bert.py | 43 +- tests/models/bert/test_tokenization_bert.py | 8 +- .../models/bert/test_tokenization_bert_tf.py | 26 +- .../test_modeling_bert_generation.py | 10 +- .../test_tokenization_bert_generation.py | 4 +- .../test_tokenization_bert_japanese.py | 129 +- .../models/big_bird/test_modeling_big_bird.py | 106 +- .../big_bird/test_modeling_flax_big_bird.py | 11 +- .../big_bird/test_tokenization_big_bird.py | 8 +- .../test_modeling_bigbird_pegasus.py | 85 +- tests/models/biogpt/test_modeling_biogpt.py | 73 +- tests/models/bit/test_modeling_bit.py | 42 +- .../blenderbot/test_modeling_blenderbot.py | 43 +- .../test_modeling_flax_blenderbot.py | 6 +- .../blenderbot/test_modeling_tf_blenderbot.py | 46 +- .../test_tokenization_blenderbot.py | 22 + .../test_modeling_blenderbot_small.py | 40 +- .../test_modeling_flax_blenderbot_small.py | 7 +- .../test_modeling_tf_blenderbot_small.py | 51 +- .../models/blip/test_image_processing_blip.py | 199 +- tests/models/blip/test_modeling_blip.py | 219 +- tests/models/blip/test_modeling_blip_text.py | 17 +- tests/models/blip/test_modeling_tf_blip.py | 903 ++ .../models/blip/test_modeling_tf_blip_text.py | 181 + tests/models/blip_2/test_modeling_blip_2.py | 178 +- tests/models/bloom/test_modeling_bloom.py | 102 +- .../models/bloom/test_modeling_flax_bloom.py | 251 + tests/models/bloom/test_tokenization_bloom.py | 36 +- tests/models/bort/test_modeling_bort.py | 51 - tests/models/bort/test_modeling_tf_bort.py | 51 - .../test_image_processing_bridgetower.py | 142 +- .../bridgetower/test_modeling_bridgetower.py | 363 +- .../__init__.py | 0 tests/models/bros/test_modeling_bros.py | 450 + tests/models/byt5/test_tokenization_byt5.py | 16 +- .../camembert/test_modeling_camembert.py | 2 +- .../camembert/test_modeling_tf_camembert.py | 2 + .../camembert/test_tokenization_camembert.py | 104 +- tests/models/canine/test_modeling_canine.py | 38 +- .../models/canine/test_tokenization_canine.py | 14 +- .../test_image_processing_chinese_clip.py | 199 +- .../test_modeling_chinese_clip.py | 49 +- .../clap/test_feature_extraction_clap.py | 434 +- tests/models/clap/test_modeling_clap.py | 115 +- .../models/clip/test_image_processing_clip.py | 214 +- tests/models/clip/test_modeling_clip.py | 119 +- tests/models/clip/test_modeling_flax_clip.py | 13 +- tests/models/clip/test_modeling_tf_clip.py | 35 +- tests/models/clip/test_processor_clip.py | 10 +- tests/models/clip/test_tokenization_clip.py | 8 +- tests/models/clipseg/test_modeling_clipseg.py | 88 +- .../models/clipseg/test_processor_clipseg.py | 4 +- tests/models/{transfo_xl => clvp}/__init__.py | 0 .../clvp/test_feature_extraction_clvp.py | 237 + tests/models/clvp/test_modeling_clvp.py | 635 ++ tests/models/clvp/test_processor_clvp.py | 136 + tests/models/clvp/test_tokenization_clvp.py | 312 + tests/models/{van => code_llama}/__init__.py | 0 .../test_tokenization_code_llama.py | 657 ++ tests/models/codegen/test_modeling_codegen.py | 37 +- .../codegen/test_tokenization_codegen.py | 5 +- .../test_image_processing_conditional_detr.py | 385 +- .../test_modeling_conditional_detr.py | 94 +- .../models/convbert/test_modeling_convbert.py | 26 +- .../convbert/test_modeling_tf_convbert.py | 24 +- .../test_image_processing_convnext.py | 126 +- .../models/convnext/test_modeling_convnext.py | 51 +- .../convnext/test_modeling_tf_convnext.py | 23 +- tests/{onnx => models/convnextv2}/__init__.py | 0 .../convnextv2/test_modeling_convnextv2.py | 344 + .../convnextv2/test_modeling_tf_convnextv2.py | 308 + tests/models/cpm/test_tokenization_cpm.py | 6 + tests/models/cpmant/__init__.py | 0 tests/models/cpmant/test_modeling_cpmant.py | 237 + .../models/cpmant/test_tokenization_cpmant.py | 69 + tests/models/ctrl/test_modeling_ctrl.py | 41 +- tests/models/ctrl/test_modeling_tf_ctrl.py | 41 +- tests/models/cvt/test_modeling_cvt.py | 35 +- tests/models/cvt/test_modeling_tf_cvt.py | 33 +- .../data2vec/test_modeling_data2vec_audio.py | 20 +- .../data2vec/test_modeling_data2vec_text.py | 19 +- .../data2vec/test_modeling_data2vec_vision.py | 39 +- .../test_modeling_tf_data2vec_vision.py | 38 +- tests/models/deberta/test_modeling_deberta.py | 17 +- .../deberta/test_modeling_tf_deberta.py | 21 +- .../deberta_v2/test_modeling_deberta_v2.py | 17 +- .../deberta_v2/test_modeling_tf_deberta_v2.py | 37 +- .../test_tokenization_deberta_v2.py | 40 +- .../test_modeling_decision_transformer.py | 4 +- .../test_image_processing_deformable_detr.py | 386 +- .../test_modeling_deformable_detr.py | 137 +- .../models/deit/test_image_processing_deit.py | 126 +- tests/models/deit/test_modeling_deit.py | 62 +- tests/models/deit/test_modeling_tf_deit.py | 36 +- tests/models/depth_anything/__init__.py | 0 .../test_modeling_depth_anything.py | 243 + .../models/deta/test_image_processing_deta.py | 386 +- tests/models/deta/test_modeling_deta.py | 121 +- .../models/detr/test_image_processing_detr.py | 407 +- tests/models/detr/test_modeling_detr.py | 169 +- tests/models/dinat/test_modeling_dinat.py | 41 +- tests/models/dinov2/__init__.py | 0 tests/models/dinov2/test_modeling_dinov2.py | 339 + .../distilbert/test_modeling_distilbert.py | 130 +- .../test_modeling_flax_distilbert.py | 2 +- .../distilbert/test_modeling_tf_distilbert.py | 19 +- tests/models/dit/test_modeling_dit.py | 6 +- .../donut/test_image_processing_donut.py | 27 +- .../models/donut/test_modeling_donut_swin.py | 17 +- tests/models/dpr/test_modeling_dpr.py | 6 +- tests/models/dpr/test_modeling_tf_dpr.py | 8 +- tests/models/dpr/test_tokenization_dpr.py | 4 +- tests/models/dpt/test_image_processing_dpt.py | 137 +- tests/models/dpt/test_modeling_dpt.py | 61 +- .../dpt/test_modeling_dpt_auto_backbone.py | 326 + tests/models/dpt/test_modeling_dpt_hybrid.py | 52 +- .../test_image_processing_efficientformer.py | 130 +- .../test_modeling_efficientformer.py | 66 +- .../test_modeling_tf_efficientformer.py | 412 + tests/models/efficientnet/__init__.py | 0 .../test_image_processing_efficientnet.py | 122 + .../test_modeling_efficientnet.py} | 180 +- tests/models/electra/test_modeling_electra.py | 18 +- .../electra/test_modeling_flax_electra.py | 2 +- .../electra/test_modeling_tf_electra.py | 19 +- .../test_tokenization_electra.py} | 66 +- tests/models/encodec/__init__.py | 0 .../test_feature_extraction_encodec.py | 253 + tests/models/encodec/test_modeling_encodec.py | 606 ++ .../test_modeling_encoder_decoder.py | 88 +- .../test_modeling_flax_encoder_decoder.py | 20 +- .../test_modeling_tf_encoder_decoder.py | 53 +- tests/models/ernie/test_modeling_ernie.py | 22 +- tests/models/ernie_m/test_modeling_ernie_m.py | 25 +- .../ernie_m/test_tokenization_ernie_m.py | 4 +- tests/models/esm/test_modeling_esm.py | 39 +- tests/models/esm/test_modeling_esmfold.py | 34 +- tests/models/esm/test_modeling_tf_esm.py | 37 +- tests/models/esm/test_tokenization_esm.py | 26 +- tests/models/falcon/__init__.py | 0 tests/models/falcon/test_modeling_falcon.py | 594 ++ .../models/fastspeech2_conformer/__init__.py | 0 .../test_modeling_fastspeech2_conformer.py | 793 ++ ...test_tokenization_fastspeech2_conformer.py | 190 + .../models/flaubert/test_modeling_flaubert.py | 37 +- .../flaubert/test_modeling_tf_flaubert.py | 35 +- .../flava/test_image_processing_flava.py | 31 +- tests/models/flava/test_modeling_flava.py | 173 +- tests/models/flava/test_processor_flava.py | 4 +- tests/models/fnet/test_modeling_fnet.py | 51 +- tests/models/fnet/test_tokenization_fnet.py | 16 +- tests/models/focalnet/__init__.py | 0 .../models/focalnet/test_modeling_focalnet.py | 445 + tests/models/fsmt/test_modeling_fsmt.py | 65 +- tests/models/funnel/test_modeling_funnel.py | 15 +- .../models/funnel/test_modeling_tf_funnel.py | 31 +- tests/models/fuyu/__init__.py | 0 .../models/fuyu/test_image_processing_fuyu.py | 63 + tests/models/fuyu/test_modeling_fuyu.py | 402 + tests/models/fuyu/test_processing_fuyu.py | 231 + tests/models/git/test_modeling_git.py | 50 +- .../models/glpn/test_image_processing_glpn.py | 63 +- tests/models/glpn/test_modeling_glpn.py | 27 +- tests/models/gpt2/test_modeling_flax_gpt2.py | 35 +- tests/models/gpt2/test_modeling_gpt2.py | 107 +- tests/models/gpt2/test_modeling_tf_gpt2.py | 73 +- tests/models/gpt2/test_tokenization_gpt2.py | 27 +- .../models/gpt2/test_tokenization_gpt2_tf.py | 5 +- tests/models/gpt_bigcode/__init__.py | 0 .../gpt_bigcode/test_modeling_gpt_bigcode.py | 628 ++ .../gpt_neo/test_modeling_flax_gpt_neo.py | 8 +- tests/models/gpt_neo/test_modeling_gpt_neo.py | 64 +- .../models/gpt_neox/test_modeling_gpt_neox.py | 154 +- .../test_modeling_gpt_neox_japanese.py | 12 +- .../gpt_sw3/test_tokenization_gpt_sw3.py | 32 +- tests/models/gptj/test_modeling_flax_gptj.py | 6 +- tests/models/gptj/test_modeling_gptj.py | 56 +- tests/models/gptj/test_modeling_tf_gptj.py | 56 +- tests/models/gptsan_japanese/__init__.py | 0 .../test_modeling_gptsan_japanese.py | 464 ++ .../test_tokenization_gptsan_japanese.py | 217 + .../graphormer/test_modeling_graphormer.py | 39 +- .../models/groupvit/test_modeling_groupvit.py | 54 +- .../groupvit/test_modeling_tf_groupvit.py | 51 +- .../herbert/test_tokenization_herbert.py | 12 + tests/models/hubert/test_modeling_hubert.py | 27 +- .../models/hubert/test_modeling_tf_hubert.py | 124 +- tests/models/ibert/test_modeling_ibert.py | 22 +- tests/models/idefics/__init__.py | 0 .../idefics/test_image_processing_idefics.py | 203 + tests/models/idefics/test_modeling_idefics.py | 670 ++ .../models/idefics/test_processor_idefics.py | 172 + .../test_image_processing_imagegpt.py | 92 +- .../models/imagegpt/test_modeling_imagegpt.py | 53 +- tests/models/informer/__init__.py | 0 .../models/informer/test_modeling_informer.py | 531 ++ tests/models/instructblip/__init__.py | 0 .../test_modeling_instructblip.py | 612 ++ .../test_processor_instructblip.py | 191 + tests/models/jukebox/test_modeling_jukebox.py | 112 +- .../jukebox/test_tokenization_jukebox.py | 10 +- tests/models/kosmos2/__init__.py | 0 tests/models/kosmos2/test_modeling_kosmos2.py | 763 ++ .../models/kosmos2/test_processor_kosmos2.py | 471 ++ .../models/layoutlm/test_modeling_layoutlm.py | 35 +- .../layoutlm/test_modeling_tf_layoutlm.py | 18 +- .../test_image_processing_layoutlmv2.py | 127 +- .../layoutlmv2/test_modeling_layoutlmv2.py | 14 +- .../layoutlmv2/test_processor_layoutlmv2.py | 34 +- .../test_tokenization_layoutlmv2.py | 32 +- .../test_image_processing_layoutlmv3.py | 127 +- .../layoutlmv3/test_modeling_layoutlmv3.py | 30 +- .../layoutlmv3/test_modeling_tf_layoutlmv3.py | 30 +- .../layoutlmv3/test_processor_layoutlmv3.py | 34 +- .../test_tokenization_layoutlmv3.py | 48 +- .../layoutxlm/test_processor_layoutxlm.py | 94 +- .../layoutxlm/test_tokenization_layoutxlm.py | 42 +- tests/models/led/test_modeling_led.py | 39 +- tests/models/led/test_modeling_tf_led.py | 46 +- .../levit/test_image_processing_levit.py | 126 +- tests/models/levit/test_modeling_levit.py | 46 +- tests/models/lilt/test_modeling_lilt.py | 38 +- tests/models/llama/__init__.py | 0 .../models/llama/test_modeling_flax_llama.py | 261 + tests/models/llama/test_modeling_llama.py | 679 ++ tests/models/llama/test_tokenization_llama.py | 747 ++ tests/models/llava/__init__.py | 0 tests/models/llava/test_modeling_llava.py | 372 + .../longformer/test_modeling_longformer.py | 36 +- .../longformer/test_modeling_tf_longformer.py | 43 +- .../test_tokenization_longformer.py | 3 +- .../longt5/test_modeling_flax_longt5.py | 2 +- tests/models/longt5/test_modeling_longt5.py | 23 +- tests/models/luke/test_modeling_luke.py | 44 +- tests/models/luke/test_tokenization_luke.py | 1 - tests/models/lxmert/test_modeling_lxmert.py | 10 +- .../models/lxmert/test_modeling_tf_lxmert.py | 148 +- tests/models/m2m_100/test_modeling_m2m_100.py | 38 +- .../m2m_100/test_tokenization_m2m_100.py | 27 +- .../marian/test_modeling_flax_marian.py | 4 + tests/models/marian/test_modeling_marian.py | 68 +- .../models/marian/test_modeling_tf_marian.py | 76 +- .../models/marian/test_tokenization_marian.py | 12 +- .../models/markuplm/test_modeling_markuplm.py | 26 +- .../markuplm/test_processor_markuplm.py | 42 +- .../markuplm/test_tokenization_markuplm.py | 63 +- .../test_image_processing_mask2former.py | 166 +- .../mask2former/test_modeling_mask2former.py | 77 +- .../test_image_processing_maskformer.py | 166 +- .../maskformer/test_modeling_maskformer.py | 184 +- .../test_modeling_maskformer_swin.py | 66 +- tests/models/mbart/test_modeling_mbart.py | 84 +- tests/models/mbart/test_modeling_tf_mbart.py | 81 +- tests/models/mbart/test_tokenization_mbart.py | 4 + .../mbart50/test_tokenization_mbart50.py | 18 +- .../mctct/test_feature_extraction_mctct.py | 273 - tests/models/mctct/test_modeling_mctct.py | 651 -- tests/models/mega/__init__.py | 0 tests/models/mega/test_modeling_mega.py | 745 ++ .../test_modeling_megatron_bert.py | 18 +- tests/models/mgp_str/__init__.py | 0 tests/models/mgp_str/test_modeling_mgp_str.py | 262 + .../models/mgp_str/test_processor_mgp_str.py | 209 + .../mgp_str/test_tokenization_mgp_str.py | 94 + tests/models/mistral/__init__.py | 0 .../mistral/test_modeling_flax_mistral.py | 243 + tests/models/mistral/test_modeling_mistral.py | 592 ++ tests/models/mixtral/__init__.py | 0 tests/models/mixtral/test_modeling_mixtral.py | 551 ++ tests/models/mluke/test_tokenization_mluke.py | 1 - .../mobilebert/test_modeling_mobilebert.py | 22 +- .../mobilebert/test_modeling_tf_mobilebert.py | 51 +- .../test_tokenization_mobilebert.py | 21 +- .../test_image_processing_mobilenet_v1.py | 126 +- .../test_modeling_mobilenet_v1.py | 33 +- .../test_image_processing_mobilenet_v2.py | 126 +- .../test_modeling_mobilenet_v2.py | 41 +- .../test_image_processing_mobilevit.py | 135 +- .../mobilevit/test_modeling_mobilevit.py | 51 +- .../mobilevit/test_modeling_tf_mobilevit.py | 30 +- tests/models/mobilevitv2/__init__.py | 0 .../mobilevitv2/test_modeling_mobilevitv2.py | 378 + tests/models/mpnet/test_modeling_mpnet.py | 21 +- tests/models/mpnet/test_modeling_tf_mpnet.py | 19 +- tests/models/mpt/__init__.py | 0 tests/models/mpt/test_modeling_mpt.py | 519 ++ tests/models/mra/__init__.py | 0 tests/models/mra/test_modeling_mra.py | 437 + tests/models/mt5/test_modeling_mt5.py | 1050 ++- tests/models/mt5/test_modeling_tf_mt5.py | 2 + tests/models/musicgen/__init__.py | 0 .../models/musicgen/test_modeling_musicgen.py | 1477 ++++ .../musicgen/test_processing_musicgen.py | 174 + tests/models/mvp/test_modeling_mvp.py | 62 +- tests/models/nat/test_modeling_nat.py | 41 +- tests/models/nezha/test_modeling_nezha.py | 17 +- tests/models/nllb/test_tokenization_nllb.py | 76 +- tests/models/nllb_moe/__init__.py | 0 .../models/nllb_moe/test_modeling_nllb_moe.py | 573 ++ tests/models/nougat/__init__.py | 0 .../nougat/test_image_processing_nougat.py | 194 + .../models/nougat/test_tokenization_nougat.py | 196 + .../test_modeling_nystromformer.py | 17 +- .../test_image_processing_oneformer.py | 246 +- .../oneformer/test_modeling_oneformer.py | 72 +- .../oneformer/test_processor_oneformer.py | 82 +- tests/models/openai/test_modeling_openai.py | 29 +- .../models/openai/test_modeling_tf_openai.py | 49 +- tests/models/opt/test_modeling_opt.py | 64 +- tests/models/opt/test_modeling_tf_opt.py | 30 +- tests/models/owlv2/__init__.py | 0 .../owlv2/test_image_processor_owlv2.py | 125 + tests/models/owlv2/test_modeling_owlv2.py | 907 ++ .../owlvit/test_image_processing_owlvit.py | 124 +- tests/models/owlvit/test_modeling_owlvit.py | 101 +- tests/models/owlvit/test_processor_owlvit.py | 6 +- tests/models/patchtsmixer/__init__.py | 0 .../test_modeling_patchtsmixer.py | 1115 +++ tests/models/patchtst/__init__.py | 0 .../models/patchtst/test_modeling_patchtst.py | 385 + .../pegasus/test_modeling_flax_pegasus.py | 2 +- tests/models/pegasus/test_modeling_pegasus.py | 56 +- .../pegasus/test_modeling_tf_pegasus.py | 76 +- .../pegasus/test_tokenization_pegasus.py | 18 +- .../pegasus_x/test_modeling_pegasus_x.py | 29 +- .../perceiver/test_modeling_perceiver.py | 49 +- .../perceiver/test_tokenization_perceiver.py | 10 +- tests/models/persimmon/__init__.py | 0 .../persimmon/test_modeling_persimmon.py | 446 + tests/models/phi/__init__.py | 0 tests/models/phi/test_modeling_phi.py | 469 ++ tests/models/pix2struct/__init__.py | 0 .../test_image_processing_pix2struct.py | 346 + .../pix2struct/test_modeling_pix2struct.py | 857 ++ .../pix2struct/test_processor_pix2struct.py | 192 + tests/models/plbart/test_modeling_plbart.py | 51 +- .../test_image_processing_poolformer.py | 127 +- .../poolformer/test_modeling_poolformer.py | 27 +- tests/models/pop2piano/__init__.py | 0 .../test_feature_extraction_pop2piano.py | 291 + .../pop2piano/test_modeling_pop2piano.py | 774 ++ .../pop2piano/test_processor_pop2piano.py | 264 + .../pop2piano/test_tokenization_pop2piano.py | 415 + .../prophetnet/test_modeling_prophetnet.py | 51 +- tests/models/pvt/__init__.py | 0 tests/models/pvt/test_image_processing_pvt.py | 106 + tests/models/pvt/test_modeling_pvt.py | 325 + tests/models/qdqbert/test_modeling_qdqbert.py | 20 +- tests/models/qwen2/__init__.py | 0 tests/models/qwen2/test_modeling_qwen2.py | 604 ++ tests/models/qwen2/test_tokenization_qwen2.py | 204 + tests/models/rag/test_modeling_tf_rag.py | 6 + tests/models/rag/test_retrieval_rag.py | 59 - tests/models/realm/test_modeling_realm.py | 8 +- tests/models/realm/test_retrieval_realm.py | 6 +- tests/models/realm/test_tokenization_realm.py | 2 +- .../models/reformer/test_modeling_reformer.py | 52 +- .../reformer/test_tokenization_reformer.py | 4 +- .../regnet/test_modeling_flax_regnet.py | 237 + tests/models/regnet/test_modeling_regnet.py | 31 +- .../models/regnet/test_modeling_tf_regnet.py | 21 +- tests/models/rembert/test_modeling_rembert.py | 18 +- .../rembert/test_modeling_tf_rembert.py | 22 +- .../rembert/test_tokenization_rembert.py | 243 + .../resnet/test_modeling_flax_resnet.py | 228 + tests/models/resnet/test_modeling_resnet.py | 45 +- .../models/resnet/test_modeling_tf_resnet.py | 20 +- .../roberta/test_modeling_flax_roberta.py | 4 +- tests/models/roberta/test_modeling_roberta.py | 46 +- .../roberta/test_modeling_tf_roberta.py | 26 +- .../roberta/test_tokenization_roberta.py | 4 +- ...test_modeling_flax_roberta_prelayernorm.py | 6 +- .../test_modeling_roberta_prelayernorm.py | 37 +- .../test_modeling_tf_roberta_prelayernorm.py | 24 +- .../models/roc_bert/test_modeling_roc_bert.py | 36 +- .../roc_bert/test_tokenization_roc_bert.py | 34 +- .../roformer/test_modeling_flax_roformer.py | 2 +- .../models/roformer/test_modeling_roformer.py | 59 +- .../roformer/test_modeling_tf_roformer.py | 31 +- .../roformer/test_tokenization_roformer.py | 11 +- tests/models/rwkv/__init__.py | 0 tests/models/rwkv/test_modeling_rwkv.py | 455 + tests/models/sam/__init__.py | 0 tests/models/sam/test_modeling_sam.py | 763 ++ tests/models/sam/test_modeling_tf_sam.py | 671 ++ tests/models/sam/test_processor_sam.py | 306 + tests/models/seamless_m4t/__init__.py | 0 .../test_feature_extraction_seamless_m4t.py | 294 + .../test_modeling_seamless_m4t.py | 1142 +++ .../test_processor_seamless_m4t.py | 126 + .../test_tokenization_seamless_m4t.py | 669 ++ tests/models/seamless_m4t_v2/__init__.py | 0 .../test_modeling_seamless_m4t_v2.py | 1191 +++ .../test_image_processing_segformer.py | 120 +- .../segformer/test_modeling_segformer.py | 49 +- .../segformer/test_modeling_tf_segformer.py | 35 +- tests/models/sew/test_modeling_sew.py | 20 +- tests/models/sew_d/test_modeling_sew_d.py | 20 +- tests/models/siglip/__init__.py | 0 .../siglip/test_image_processor_siglip.py | 125 + tests/models/siglip/test_modeling_siglip.py | 690 ++ .../models/siglip/test_tokenization_siglip.py | 462 ++ ...st_modeling_flax_speech_encoder_decoder.py | 4 +- .../test_modeling_speech_encoder_decoder.py | 4 +- .../test_feature_extraction_speech_to_text.py | 86 +- .../test_modeling_speech_to_text.py | 55 +- .../test_modeling_tf_speech_to_text.py | 22 +- .../test_processor_speech_to_text.py | 6 +- .../test_tokenization_speech_to_text.py | 24 +- .../test_modeling_speech_to_text_2.py | 8 +- .../test_feature_extraction_speecht5.py | 51 +- .../models/speecht5/test_modeling_speecht5.py | 375 +- .../speecht5/test_processor_speecht5.py | 4 +- .../speecht5/test_tokenization_speecht5.py | 76 +- .../models/splinter/test_modeling_splinter.py | 21 +- .../squeezebert/test_modeling_squeezebert.py | 17 +- tests/models/stablelm/__init__.py | 0 .../models/stablelm/test_modeling_stablelm.py | 433 + tests/models/swiftformer/__init__.py | 0 .../swiftformer/test_modeling_swiftformer.py | 293 + tests/models/swin/test_modeling_swin.py | 54 +- tests/models/swin/test_modeling_tf_swin.py | 31 +- .../swin2sr/test_image_processing_swin2sr.py | 139 +- tests/models/swin2sr/test_modeling_swin2sr.py | 44 +- tests/models/swinv2/test_modeling_swinv2.py | 113 +- .../test_modeling_switch_transformers.py | 143 +- tests/models/t5/test_modeling_flax_t5.py | 20 +- tests/models/t5/test_modeling_t5.py | 339 +- tests/models/t5/test_modeling_tf_t5.py | 94 +- tests/models/t5/test_tokenization_t5.py | 279 +- .../test_modeling_table_transformer.py | 61 +- tests/models/tapas/test_modeling_tapas.py | 28 +- tests/models/tapas/test_modeling_tf_tapas.py | 23 +- tests/models/tapas/test_tokenization_tapas.py | 21 +- tests/models/tapex/test_tokenization_tapex.py | 904 -- .../test_modeling_time_series_transformer.py | 141 +- .../timesformer/test_modeling_timesformer.py | 33 +- tests/models/timm_backbone/__init__.py | 0 .../test_modeling_timm_backbone.py | 275 + .../test_modeling_trajectory_transformer.py | 278 - .../transfo_xl/test_modeling_tf_transfo_xl.py | 261 - .../transfo_xl/test_modeling_transfo_xl.py | 506 -- .../test_tokenization_transfo_xl.py | 130 - tests/models/trocr/test_modeling_trocr.py | 10 +- .../tvlt/test_feature_extraction_tvlt.py | 59 +- .../models/tvlt/test_image_processor_tvlt.py | 45 +- tests/models/tvlt/test_modeling_tvlt.py | 26 +- tests/models/tvp/__init__.py | 0 tests/models/tvp/test_image_processing_tvp.py | 306 + tests/models/tvp/test_modeling_tvp.py | 267 + tests/models/umt5/__init__.py | 1 + tests/models/umt5/test_modeling_umt5.py | 777 ++ .../unispeech/test_modeling_unispeech.py | 21 +- .../test_modeling_unispeech_sat.py | 26 +- tests/models/univnet/__init__.py | 0 .../test_feature_extraction_univnet.py | 365 + tests/models/univnet/test_modeling_univnet.py | 369 + tests/models/upernet/test_modeling_upernet.py | 24 +- .../test_image_processing_videomae.py | 119 +- .../models/videomae/test_modeling_videomae.py | 41 +- .../models/vilt/test_image_processing_vilt.py | 142 +- tests/models/vilt/test_modeling_vilt.py | 29 +- tests/models/vipllava/__init__.py | 0 .../models/vipllava/test_modeling_vipllava.py | 254 + ...st_modeling_flax_vision_encoder_decoder.py | 10 +- ...test_modeling_tf_vision_encoder_decoder.py | 61 +- .../test_modeling_vision_encoder_decoder.py | 95 +- ..._modeling_flax_vision_text_dual_encoder.py | 2 +- ...st_modeling_tf_vision_text_dual_encoder.py | 421 + .../test_modeling_vision_text_dual_encoder.py | 2 +- ...test_processor_vision_text_dual_encoder.py | 8 +- .../visual_bert/test_modeling_visual_bert.py | 24 +- tests/models/vit/test_image_processing_vit.py | 126 +- tests/models/vit/test_modeling_flax_vit.py | 2 +- tests/models/vit/test_modeling_tf_vit.py | 27 +- tests/models/vit/test_modeling_vit.py | 51 +- .../vit_hybrid/test_modeling_vit_hybrid.py | 35 +- .../vit_mae/test_modeling_tf_vit_mae.py | 85 +- tests/models/vit_mae/test_modeling_vit_mae.py | 33 +- tests/models/vit_msn/test_modeling_vit_msn.py | 35 +- tests/models/vitdet/__init__.py | 0 tests/models/vitdet/test_modeling_vitdet.py | 302 + tests/models/vitmatte/__init__.py | 0 .../test_image_processing_vitmatte.py | 194 + .../models/vitmatte/test_modeling_vitmatte.py | 270 + tests/models/vits/__init__.py | 0 tests/models/vits/test_modeling_vits.py | 430 + tests/models/vits/test_tokenization_vits.py | 186 + tests/models/vivit/__init__.py | 0 .../vivit/test_image_processing_vivit.py | 228 + tests/models/vivit/test_modeling_vivit.py | 356 + .../test_feature_extraction_wav2vec2.py | 9 + .../wav2vec2/test_modeling_flax_wav2vec2.py | 4 +- .../wav2vec2/test_modeling_tf_wav2vec2.py | 268 +- .../models/wav2vec2/test_modeling_wav2vec2.py | 296 +- .../wav2vec2/test_tokenization_wav2vec2.py | 70 +- tests/models/wav2vec2_bert/__init__.py | 0 .../test_modeling_wav2vec2_bert.py | 913 +++ .../test_processor_wav2vec2_bert.py} | 46 +- .../test_modeling_wav2vec2_conformer.py | 61 +- .../test_processor_wav2vec2_with_lm.py | 23 +- tests/models/wavlm/test_modeling_wavlm.py | 24 +- .../test_feature_extraction_whisper.py | 61 +- .../whisper/test_modeling_flax_whisper.py | 923 +++ .../whisper/test_modeling_tf_whisper.py | 108 +- tests/models/whisper/test_modeling_whisper.py | 2103 ++++- .../models/whisper/test_processor_whisper.py | 31 + .../whisper/test_tokenization_whisper.py | 252 +- tests/models/x_clip/test_modeling_x_clip.py | 52 +- tests/models/xglm/test_modeling_flax_xglm.py | 2 +- tests/models/xglm/test_modeling_tf_xglm.py | 132 +- tests/models/xglm/test_modeling_xglm.py | 119 +- tests/models/xglm/test_tokenization_xglm.py | 4 +- tests/models/xlm/test_modeling_tf_xlm.py | 38 +- tests/models/xlm/test_modeling_xlm.py | 36 +- tests/models/xlm/test_tokenization_xlm.py | 2 +- .../test_modeling_xlm_prophetnet.py | 8 +- .../test_tokenization_xlm_prophetnet.py | 4 +- .../test_modeling_flax_xlm_roberta.py | 4 +- .../test_modeling_tf_xlm_roberta.py | 2 + .../xlm_roberta/test_modeling_xlm_roberta.py | 4 +- .../test_tokenization_xlm_roberta.py | 8 +- .../test_modeling_xlm_roberta_xl.py | 27 +- tests/models/xlnet/test_modeling_tf_xlnet.py | 40 +- tests/models/xlnet/test_modeling_xlnet.py | 28 +- tests/models/xlnet/test_tokenization_xlnet.py | 9 +- tests/models/xmod/test_modeling_xmod.py | 29 +- .../yolos/test_image_processing_yolos.py | 425 +- tests/models/yolos/test_modeling_yolos.py | 52 +- tests/models/yoso/test_modeling_yoso.py | 17 +- tests/onnx/test_features.py | 111 - tests/onnx/test_onnx.py | 195 - tests/onnx/test_onnx_v2.py | 533 -- tests/optimization/test_optimization.py | 16 + .../peft_integration/test_peft_integration.py | 496 ++ .../test_pipelines_audio_classification.py | 45 +- ..._pipelines_automatic_speech_recognition.py | 399 +- tests/pipelines/test_pipelines_common.py | 327 +- .../test_pipelines_conversational.py | 116 +- .../test_pipelines_depth_estimation.py | 30 +- ...t_pipelines_document_question_answering.py | 6 +- .../test_pipelines_feature_extraction.py | 7 +- tests/pipelines/test_pipelines_fill_mask.py | 108 +- .../test_pipelines_image_classification.py | 66 +- ...test_pipelines_image_feature_extraction.py | 157 + .../test_pipelines_image_segmentation.py | 53 +- .../test_pipelines_image_to_image.py | 85 + .../pipelines/test_pipelines_image_to_text.py | 124 +- .../test_pipelines_mask_generation.py | 160 + .../test_pipelines_object_detection.py | 16 +- .../test_pipelines_question_answering.py | 34 +- .../pipelines/test_pipelines_summarization.py | 14 +- ...test_pipelines_table_question_answering.py | 23 +- .../test_pipelines_text2text_generation.py | 7 +- .../test_pipelines_text_classification.py | 22 +- .../test_pipelines_text_generation.py | 174 +- .../pipelines/test_pipelines_text_to_audio.py | 245 + .../test_pipelines_token_classification.py | 174 +- tests/pipelines/test_pipelines_translation.py | 7 +- .../test_pipelines_video_classification.py | 8 +- ...est_pipelines_visual_question_answering.py | 77 +- tests/pipelines/test_pipelines_zero_shot.py | 26 +- ...ipelines_zero_shot_audio_classification.py | 94 + ...ipelines_zero_shot_image_classification.py | 58 +- ...st_pipelines_zero_shot_object_detection.py | 14 +- .../quantization/aqlm_integration/__init__.py | 0 .../aqlm_integration/test_aqlm.py | 183 + tests/quantization/autoawq/__init__.py | 0 tests/quantization/autoawq/test_awq.py | 460 ++ .../bnb}/README.md | 16 +- tests/quantization/bnb/__init__.py | 0 tests/quantization/bnb/test_4bit.py | 665 ++ .../bnb}/test_mixed_int8.py | 351 +- tests/quantization/gptq/__init__.py | 0 tests/quantization/gptq/test_gptq.py | 437 + tests/repo_utils/test_check_copies.py | 379 +- tests/repo_utils/test_check_docstrings.py | 98 + tests/repo_utils/test_get_test_info.py | 109 + tests/repo_utils/test_tests_fetcher.py | 820 +- tests/sagemaker/conftest.py | 12 +- .../pytorch/run_glue_model_parallelism.py | 16 +- tests/sagemaker/scripts/tensorflow/run_tf.py | 20 +- .../scripts/tensorflow/run_tf_dist.py | 7 +- .../test_multi_node_data_parallel.py | 6 +- .../test_multi_node_model_parallel.py | 4 +- tests/sagemaker/test_single_node_gpu.py | 4 +- tests/test_backbone_common.py | 227 + tests/test_cache_utils.py | 387 + tests/test_configuration_common.py | 291 +- tests/test_configuration_utils.py | 314 + tests/test_feature_extraction_common.py | 128 +- tests/test_feature_extraction_utils.py | 144 + tests/test_image_processing_common.py | 357 +- tests/test_image_processing_utils.py | 154 + tests/test_image_transforms.py | 176 +- tests/test_modeling_common.py | 2659 +++--- tests/test_modeling_flax_common.py | 227 +- tests/test_modeling_flax_utils.py | 403 + tests/test_modeling_tf_common.py | 888 +- tests/test_modeling_tf_utils.py | 729 ++ tests/test_modeling_utils.py | 2086 +++++ tests/test_pipeline_mixin.py | 527 ++ tests/test_processing_common.py | 128 + ...test_sequence_feature_extraction_common.py | 10 +- tests/test_tokenization_common.py | 758 +- tests/test_tokenization_utils.py | 280 + tests/tokenization/test_tokenization_fast.py | 68 +- tests/tokenization/test_tokenization_utils.py | 24 +- tests/tools/__init__.py | 0 tests/tools/test_agent_types.py | 121 + .../tools/test_document_question_answering.py | 56 + tests/tools/test_image_captioning.py | 53 + tests/tools/test_image_question_answering.py | 53 + tests/tools/test_image_segmentation.py | 53 + tests/tools/test_python_interpreter.py | 131 + tests/tools/test_speech_to_text.py | 38 + tests/tools/test_text_classification.py | 43 + tests/tools/test_text_question_answering.py | 52 + tests/tools/test_text_summarization.py | 64 + tests/tools/test_text_to_speech.py | 58 + tests/tools/test_tools_common.py | 133 + tests/tools/test_translation.py | 86 + tests/trainer/test_data_collator.py | 27 +- tests/trainer/test_trainer.py | 1329 ++- tests/trainer/test_trainer_callback.py | 4 +- tests/trainer/test_trainer_distributed.py | 128 +- tests/trainer/test_trainer_seq2seq.py | 67 +- tests/trainer/test_trainer_utils.py | 8 +- tests/utils/test_add_new_model_like.py | 54 +- tests/utils/test_audio_utils.py | 757 ++ tests/utils/test_backbone_utils.py | 269 + tests/utils/test_cli.py | 47 +- tests/utils/test_convert_slow_tokenizer.py | 1 + tests/utils/test_doc_samples.py | 1 - tests/utils/test_dynamic_module_utils.py | 129 + tests/utils/test_file_utils.py | 64 +- tests/utils/test_hf_argparser.py | 80 +- tests/utils/test_hub_utils.py | 23 +- tests/utils/test_image_utils.py | 83 +- tests/utils/test_logging.py | 1 + tests/utils/test_model_output.py | 82 +- tests/utils/test_modeling_tf_core.py | 48 +- tests/utils/test_offline.py | 29 +- tests/utils/test_versions_utils.py | 9 +- tests/utils/tiny_model_summary.json | 7290 +++++++++++++++++ utils/add_pipeline_model_mapping_to_test.py | 337 + utils/check_build.py | 48 + utils/check_config_attributes.py | 100 +- utils/check_config_docstrings.py | 27 +- utils/check_copies.py | 759 +- utils/check_doc_toc.py | 45 +- utils/check_docstrings.py | 1246 +++ utils/check_doctest_list.py | 64 +- utils/check_dummies.py | 92 +- utils/check_inits.py | 97 +- utils/check_model_tester.py | 63 + utils/check_repo.py | 616 +- utils/check_self_hosted_runner.py | 2 +- utils/check_support_list.py | 95 + utils/check_table.py | 227 +- utils/check_task_guides.py | 103 +- utils/create_dummy_models.py | 786 +- utils/custom_init_isort.py | 109 +- utils/documentation_tests.txt | 230 - utils/extract_warnings.py | 13 +- utils/get_ci_error_statistics.py | 94 +- utils/get_github_job_time.py | 19 +- utils/get_previous_daily_ci.py | 70 + utils/get_test_info.py | 190 + utils/not_doctested.txt | 1015 +++ utils/notification_service.py | 403 +- utils/notification_service_doc_tests.py | 71 +- utils/past_ci_versions.py | 102 +- utils/prepare_for_doc_test.py | 148 - utils/release.py | 99 +- utils/slow_documentation_tests.txt | 11 + utils/sort_auto_mappings.py | 45 +- utils/split_model_tests.py | 65 + utils/tests_fetcher.py | 1328 ++- utils/update_metadata.py | 148 +- utils/update_tiny_models.py | 200 + 3216 files changed, 462991 insertions(+), 71988 deletions(-) create mode 100644 .github/workflows/build-nightly-ci-docker-images.yml delete mode 100644 .github/workflows/check_runner_status.yml create mode 100644 .github/workflows/check_tiny_models.yml delete mode 100644 .github/workflows/delete_doc_comment.yml create mode 100644 .github/workflows/model_jobs.yml create mode 100644 .github/workflows/self-nightly-past-ci-caller.yml delete mode 100644 .github/workflows/self-past-caller.yml create mode 100644 .github/workflows/self-push-amd-mi210-caller.yml create mode 100644 .github/workflows/self-push-amd-mi250-caller.yml create mode 100644 .github/workflows/self-push-amd.yml create mode 100644 .github/workflows/self-scheduled-amd-caller.yml create mode 100644 .github/workflows/self-scheduled-amd-mi210-caller.yml create mode 100644 .github/workflows/self-scheduled-amd-mi250-caller.yml create mode 100644 .github/workflows/self-scheduled-amd.yml create mode 100644 .github/workflows/upload_pr_documentation.yml delete mode 100644 MANIFEST.in create mode 100644 README_de.md create mode 100644 README_fr.md create mode 100644 README_pt-br.md create mode 100644 README_ru.md create mode 100644 README_te.md create mode 100644 SECURITY.md create mode 100644 awesome-transformers.md delete mode 100644 docker/transformers-cpu/Dockerfile create mode 100644 docker/transformers-pytorch-amd-gpu/Dockerfile delete mode 100644 docker/transformers-pytorch-cpu/Dockerfile create mode 100644 docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile delete mode 100644 docker/transformers-tensorflow-cpu/Dockerfile rename docs/source/de/{accelerate.mdx => accelerate.md} (96%) create mode 100644 docs/source/de/add_new_model.md create mode 100644 docs/source/de/add_new_pipeline.md create mode 100644 docs/source/de/add_tensorflow_model.md rename docs/source/de/{autoclass_tutorial.mdx => autoclass_tutorial.md} (91%) create mode 100644 docs/source/de/contributing.md rename docs/source/de/{index.mdx => index.md} (96%) rename docs/source/de/{installation.mdx => installation.md} (94%) create mode 100644 docs/source/de/llm_tutorial.md rename docs/source/de/{model_sharing.mdx => model_sharing.md} (95%) create mode 100644 docs/source/de/peft.md rename docs/source/de/{pipeline_tutorial.mdx => pipeline_tutorial.md} (90%) create mode 100644 docs/source/de/pr_checks.md rename docs/source/de/{preprocessing.mdx => preprocessing.md} (95%) rename docs/source/de/{quicktour.mdx => quicktour.md} (93%) create mode 100644 docs/source/de/run_scripts.md create mode 100644 docs/source/de/testing.md rename docs/source/de/{training.mdx => training.md} (95%) create mode 100644 docs/source/de/transformers_agents.md create mode 100644 docs/source/en/_redirects.yml rename docs/source/en/{accelerate.mdx => accelerate.md} (93%) rename docs/source/en/{add_new_model.mdx => add_new_model.md} (94%) rename docs/source/en/{add_new_pipeline.mdx => add_new_pipeline.md} (93%) rename docs/source/en/{add_tensorflow_model.mdx => add_tensorflow_model.md} (94%) rename docs/source/en/{attention.mdx => attention.md} (85%) rename docs/source/en/{autoclass_tutorial.mdx => autoclass_tutorial.md} (61%) rename docs/source/en/{benchmarks.mdx => benchmarks.md} (90%) rename docs/source/en/{bertology.mdx => bertology.md} (86%) rename docs/source/en/{big_models.mdx => big_models.md} (90%) create mode 100644 docs/source/en/chat_templating.md rename docs/source/en/{community.mdx => community.md} (92%) delete mode 100644 docs/source/en/converting_tensorflow_models.mdx rename docs/source/en/{create_a_model.mdx => create_a_model.md} (79%) rename docs/source/en/{custom_models.mdx => custom_models.md} (91%) create mode 100644 docs/source/en/custom_tools.md rename docs/source/en/{debugging.mdx => debugging.md} (67%) create mode 100644 docs/source/en/deepspeed.md rename docs/source/en/{fast_tokenizers.mdx => fast_tokenizers.md} (94%) create mode 100644 docs/source/en/fsdp.md rename docs/source/en/{generation_strategies.mdx => generation_strategies.md} (62%) rename docs/source/en/{glossary.mdx => glossary.md} (63%) create mode 100644 docs/source/en/hf_quantizer.md rename docs/source/en/{hpo_train.mdx => hpo_train.md} (82%) create mode 100644 docs/source/en/index.md rename docs/source/en/{installation.mdx => installation.md} (86%) rename docs/source/en/internal/{audio_utils.mdx => audio_utils.md} (69%) rename docs/source/en/internal/{file_utils.mdx => file_utils.md} (88%) rename docs/source/en/internal/{generation_utils.mdx => generation_utils.md} (66%) rename docs/source/en/internal/{image_processing_utils.mdx => image_processing_utils.md} (89%) rename docs/source/en/internal/{modeling_utils.mdx => modeling_utils.md} (92%) rename docs/source/en/internal/{pipelines_utils.mdx => pipelines_utils.md} (87%) create mode 100644 docs/source/en/internal/time_series_utils.md rename docs/source/en/internal/{tokenization_utils.mdx => tokenization_utils.md} (89%) rename docs/source/en/internal/{trainer_utils.mdx => trainer_utils.md} (85%) create mode 100644 docs/source/en/llm_tutorial.md create mode 100644 docs/source/en/llm_tutorial_optimization.md create mode 100644 docs/source/en/main_classes/agent.md create mode 100644 docs/source/en/main_classes/backbones.md rename docs/source/en/main_classes/{callback.mdx => callback.md} (84%) rename docs/source/en/main_classes/{configuration.mdx => configuration.md} (87%) rename docs/source/en/main_classes/{data_collator.mdx => data_collator.md} (92%) create mode 100644 docs/source/en/main_classes/deepspeed.md delete mode 100644 docs/source/en/main_classes/deepspeed.mdx rename docs/source/en/main_classes/{feature_extractor.mdx => feature_extractor.md} (68%) rename docs/source/en/main_classes/{image_processor.mdx => image_processor.md} (87%) rename docs/source/en/main_classes/{keras_callbacks.mdx => keras_callbacks.md} (83%) rename docs/source/en/main_classes/{logging.mdx => logging.md} (76%) rename docs/source/en/main_classes/{model.mdx => model.md} (96%) rename docs/source/en/main_classes/{onnx.mdx => onnx.md} (90%) rename docs/source/en/main_classes/{optimizer_schedules.mdx => optimizer_schedules.md} (92%) rename docs/source/en/main_classes/{output.mdx => output.md} (89%) rename docs/source/en/main_classes/{pipelines.mdx => pipelines.md} (95%) rename docs/source/en/main_classes/{processors.mdx => processors.md} (95%) create mode 100644 docs/source/en/main_classes/quantization.md delete mode 100644 docs/source/en/main_classes/quantization.mdx rename docs/source/en/main_classes/{text_generation.mdx => text_generation.md} (86%) rename docs/source/en/main_classes/{tokenizer.mdx => tokenizer.md} (92%) create mode 100644 docs/source/en/main_classes/trainer.md delete mode 100644 docs/source/en/main_classes/trainer.mdx delete mode 100644 docs/source/en/migration.mdx rename docs/source/en/model_doc/{albert.mdx => albert.md} (51%) create mode 100644 docs/source/en/model_doc/align.md rename docs/source/en/model_doc/{altclip.mdx => altclip.md} (94%) rename docs/source/en/model_doc/{audio-spectrogram-transformer.mdx => audio-spectrogram-transformer.md} (91%) rename docs/source/en/model_doc/{auto.mdx => auto.md} (85%) create mode 100644 docs/source/en/model_doc/autoformer.md create mode 100644 docs/source/en/model_doc/bark.md rename docs/source/en/model_doc/{bart.mdx => bart.md} (89%) rename docs/source/en/model_doc/{barthez.mdx => barthez.md} (86%) rename docs/source/en/model_doc/{bartpho.mdx => bartpho.md} (94%) rename docs/source/en/model_doc/{beit.mdx => beit.md} (94%) rename docs/source/en/model_doc/{bert-generation.mdx => bert-generation.md} (83%) rename docs/source/en/model_doc/{bert-japanese.mdx => bert-japanese.md} (87%) rename docs/source/en/model_doc/{bert.mdx => bert.md} (95%) rename docs/source/en/model_doc/{bertweet.mdx => bertweet.md} (87%) rename docs/source/en/model_doc/{big_bird.mdx => big_bird.md} (88%) rename docs/source/en/model_doc/{bigbird_pegasus.mdx => bigbird_pegasus.md} (89%) rename docs/source/en/model_doc/{biogpt.mdx => biogpt.md} (73%) rename docs/source/en/model_doc/{bit.mdx => bit.md} (92%) rename docs/source/en/model_doc/{blenderbot-small.mdx => blenderbot-small.md} (84%) rename docs/source/en/model_doc/{blenderbot.mdx => blenderbot.md} (86%) rename docs/source/en/model_doc/{blip-2.mdx => blip-2.md} (93%) rename docs/source/en/model_doc/{blip.mdx => blip.md} (77%) rename docs/source/en/model_doc/{bloom.mdx => bloom.md} (83%) rename docs/source/en/model_doc/{bort.mdx => bort.md} (76%) rename docs/source/en/model_doc/{bridgetower.mdx => bridgetower.md} (85%) create mode 100644 docs/source/en/model_doc/bros.md rename docs/source/en/model_doc/{byt5.mdx => byt5.md} (95%) rename docs/source/en/model_doc/{camembert.mdx => camembert.md} (69%) rename docs/source/en/model_doc/{canine.mdx => canine.md} (91%) rename docs/source/en/model_doc/{chinese_clip.mdx => chinese_clip.md} (94%) rename docs/source/en/model_doc/{clap.mdx => clap.md} (76%) rename docs/source/en/model_doc/{clip.mdx => clip.md} (71%) rename docs/source/en/model_doc/{clipseg.mdx => clipseg.md} (95%) create mode 100644 docs/source/en/model_doc/clvp.md create mode 100644 docs/source/en/model_doc/code_llama.md rename docs/source/en/model_doc/{codegen.mdx => codegen.md} (94%) rename docs/source/en/model_doc/{conditional_detr.mdx => conditional_detr.md} (93%) rename docs/source/en/model_doc/{convbert.mdx => convbert.md} (84%) rename docs/source/en/model_doc/{convnext.mdx => convnext.md} (93%) create mode 100644 docs/source/en/model_doc/convnextv2.md rename docs/source/en/model_doc/{cpm.mdx => cpm.md} (87%) create mode 100644 docs/source/en/model_doc/cpmant.md rename docs/source/en/model_doc/{ctrl.mdx => ctrl.md} (90%) rename docs/source/en/model_doc/{cvt.mdx => cvt.md} (93%) rename docs/source/en/model_doc/{data2vec.mdx => data2vec.md} (85%) rename docs/source/en/model_doc/{deberta-v2.mdx => deberta-v2.md} (89%) rename docs/source/en/model_doc/{deberta.mdx => deberta.md} (93%) rename docs/source/en/model_doc/{decision_transformer.mdx => decision_transformer.md} (93%) rename docs/source/en/model_doc/{deformable_detr.mdx => deformable_detr.md} (93%) rename docs/source/en/model_doc/{deit.mdx => deit.md} (95%) create mode 100644 docs/source/en/model_doc/deplot.md create mode 100644 docs/source/en/model_doc/depth_anything.md rename docs/source/en/model_doc/{deta.mdx => deta.md} (93%) rename docs/source/en/model_doc/{detr.mdx => detr.md} (96%) rename docs/source/en/model_doc/{dialogpt.mdx => dialogpt.md} (89%) rename docs/source/en/model_doc/{dinat.mdx => dinat.md} (93%) create mode 100644 docs/source/en/model_doc/dinov2.md rename docs/source/en/model_doc/{distilbert.mdx => distilbert.md} (86%) rename docs/source/en/model_doc/{dit.mdx => dit.md} (92%) rename docs/source/en/model_doc/{donut.mdx => donut.md} (96%) rename docs/source/en/model_doc/{dpr.mdx => dpr.md} (93%) rename docs/source/en/model_doc/{dpt.mdx => dpt.md} (76%) rename docs/source/en/model_doc/{efficientformer.mdx => efficientformer.md} (81%) create mode 100644 docs/source/en/model_doc/efficientnet.md rename docs/source/en/model_doc/{electra.mdx => electra.md} (91%) create mode 100644 docs/source/en/model_doc/encodec.md rename docs/source/en/model_doc/{encoder-decoder.mdx => encoder-decoder.md} (94%) rename docs/source/en/model_doc/{ernie.mdx => ernie.md} (84%) rename docs/source/en/model_doc/{ernie_m.mdx => ernie_m.md} (77%) rename docs/source/en/model_doc/{esm.mdx => esm.md} (91%) create mode 100644 docs/source/en/model_doc/falcon.md create mode 100644 docs/source/en/model_doc/fastspeech2_conformer.md rename docs/source/en/model_doc/{flan-t5.mdx => flan-t5.md} (85%) create mode 100644 docs/source/en/model_doc/flan-ul2.md rename docs/source/en/model_doc/{flaubert.mdx => flaubert.md} (87%) rename docs/source/en/model_doc/{flava.mdx => flava.md} (94%) rename docs/source/en/model_doc/{fnet.mdx => fnet.md} (81%) create mode 100644 docs/source/en/model_doc/focalnet.md rename docs/source/en/model_doc/{fsmt.mdx => fsmt.md} (93%) rename docs/source/en/model_doc/{funnel.mdx => funnel.md} (91%) create mode 100644 docs/source/en/model_doc/fuyu.md rename docs/source/en/model_doc/{git.mdx => git.md} (93%) rename docs/source/en/model_doc/{glpn.mdx => glpn.md} (94%) rename docs/source/en/model_doc/{gpt-sw3.mdx => gpt-sw3.md} (67%) rename docs/source/en/model_doc/{gpt2.mdx => gpt2.md} (92%) create mode 100644 docs/source/en/model_doc/gpt_bigcode.md create mode 100644 docs/source/en/model_doc/gpt_neo.md delete mode 100644 docs/source/en/model_doc/gpt_neo.mdx rename docs/source/en/model_doc/{gpt_neox.mdx => gpt_neox.md} (54%) rename docs/source/en/model_doc/{gpt_neox_japanese.mdx => gpt_neox_japanese.md} (91%) rename docs/source/en/model_doc/{gptj.mdx => gptj.md} (86%) create mode 100644 docs/source/en/model_doc/gptsan-japanese.md rename docs/source/en/model_doc/{graphormer.mdx => graphormer.md} (92%) rename docs/source/en/model_doc/{groupvit.mdx => groupvit.md} (93%) rename docs/source/en/model_doc/{herbert.mdx => herbert.md} (90%) rename docs/source/en/model_doc/{hubert.mdx => hubert.md} (88%) rename docs/source/en/model_doc/{ibert.mdx => ibert.md} (85%) create mode 100644 docs/source/en/model_doc/idefics.md rename docs/source/en/model_doc/{imagegpt.mdx => imagegpt.md} (96%) create mode 100644 docs/source/en/model_doc/informer.md create mode 100644 docs/source/en/model_doc/instructblip.md rename docs/source/en/model_doc/{jukebox.mdx => jukebox.md} (77%) create mode 100644 docs/source/en/model_doc/kosmos-2.md rename docs/source/en/model_doc/{layoutlm.mdx => layoutlm.md} (92%) rename docs/source/en/model_doc/{layoutlmv2.mdx => layoutlmv2.md} (84%) rename docs/source/en/model_doc/{layoutlmv3.mdx => layoutlmv3.md} (90%) rename docs/source/en/model_doc/{layoutxlm.mdx => layoutxlm.md} (93%) rename docs/source/en/model_doc/{led.mdx => led.md} (85%) rename docs/source/en/model_doc/{levit.mdx => levit.md} (96%) rename docs/source/en/model_doc/{lilt.mdx => lilt.md} (91%) create mode 100644 docs/source/en/model_doc/llama.md create mode 100644 docs/source/en/model_doc/llama2.md create mode 100644 docs/source/en/model_doc/llava.md rename docs/source/en/model_doc/{longformer.mdx => longformer.md} (92%) rename docs/source/en/model_doc/{longt5.mdx => longt5.md} (94%) rename docs/source/en/model_doc/{luke.mdx => luke.md} (90%) rename docs/source/en/model_doc/{lxmert.mdx => lxmert.md} (93%) rename docs/source/en/model_doc/{m2m_100.mdx => m2m_100.md} (86%) create mode 100644 docs/source/en/model_doc/madlad-400.md rename docs/source/en/model_doc/{marian.mdx => marian.md} (91%) rename docs/source/en/model_doc/{markuplm.mdx => markuplm.md} (94%) rename docs/source/en/model_doc/{mask2former.mdx => mask2former.md} (96%) rename docs/source/en/model_doc/{maskformer.mdx => maskformer.md} (92%) create mode 100644 docs/source/en/model_doc/matcha.md rename docs/source/en/model_doc/{mbart.mdx => mbart.md} (93%) rename docs/source/en/model_doc/{mctct.mdx => mctct.md} (80%) create mode 100644 docs/source/en/model_doc/mega.md rename docs/source/en/model_doc/{megatron-bert.mdx => megatron-bert.md} (86%) rename docs/source/en/model_doc/{megatron_gpt2.mdx => megatron_gpt2.md} (86%) create mode 100644 docs/source/en/model_doc/mgp-str.md create mode 100644 docs/source/en/model_doc/mistral.md create mode 100644 docs/source/en/model_doc/mixtral.md rename docs/source/en/model_doc/{mluke.mdx => mluke.md} (93%) create mode 100644 docs/source/en/model_doc/mms.md rename docs/source/en/model_doc/{mobilebert.mdx => mobilebert.md} (87%) rename docs/source/en/model_doc/{mobilenet_v1.mdx => mobilenet_v1.md} (95%) rename docs/source/en/model_doc/{mobilenet_v2.mdx => mobilenet_v2.md} (94%) rename docs/source/en/model_doc/{mobilevit.mdx => mobilevit.md} (93%) create mode 100644 docs/source/en/model_doc/mobilevitv2.md rename docs/source/en/model_doc/{mpnet.mdx => mpnet.md} (81%) create mode 100644 docs/source/en/model_doc/mpt.md create mode 100644 docs/source/en/model_doc/mra.md rename docs/source/en/model_doc/{mt5.mdx => mt5.md} (86%) create mode 100644 docs/source/en/model_doc/musicgen.md rename docs/source/en/model_doc/{mvp.mdx => mvp.md} (90%) rename docs/source/en/model_doc/{nat.mdx => nat.md} (94%) rename docs/source/en/model_doc/{nezha.mdx => nezha.md} (84%) create mode 100644 docs/source/en/model_doc/nllb-moe.md rename docs/source/en/model_doc/{nllb.mdx => nllb.md} (71%) create mode 100644 docs/source/en/model_doc/nougat.md rename docs/source/en/model_doc/{nystromformer.mdx => nystromformer.md} (85%) rename docs/source/en/model_doc/{oneformer.mdx => oneformer.md} (97%) create mode 100644 docs/source/en/model_doc/open-llama.md rename docs/source/en/model_doc/{openai-gpt.mdx => openai-gpt.md} (94%) rename docs/source/en/model_doc/{opt.mdx => opt.md} (64%) create mode 100644 docs/source/en/model_doc/owlv2.md rename docs/source/en/model_doc/{owlvit.mdx => owlvit.md} (79%) create mode 100644 docs/source/en/model_doc/patchtsmixer.md create mode 100644 docs/source/en/model_doc/patchtst.md rename docs/source/en/model_doc/{pegasus.mdx => pegasus.md} (93%) rename docs/source/en/model_doc/{pegasus_x.mdx => pegasus_x.md} (81%) rename docs/source/en/model_doc/{perceiver.mdx => perceiver.md} (94%) create mode 100644 docs/source/en/model_doc/persimmon.md create mode 100644 docs/source/en/model_doc/phi.md rename docs/source/en/model_doc/{phobert.mdx => phobert.md} (83%) create mode 100644 docs/source/en/model_doc/pix2struct.md rename docs/source/en/model_doc/{plbart.mdx => plbart.md} (91%) rename docs/source/en/model_doc/{poolformer.mdx => poolformer.md} (95%) create mode 100644 docs/source/en/model_doc/pop2piano.md rename docs/source/en/model_doc/{prophetnet.mdx => prophetnet.md} (91%) create mode 100644 docs/source/en/model_doc/pvt.md rename docs/source/en/model_doc/{qdqbert.mdx => qdqbert.md} (90%) create mode 100644 docs/source/en/model_doc/qwen2.md rename docs/source/en/model_doc/{rag.mdx => rag.md} (86%) rename docs/source/en/model_doc/{realm.mdx => realm.md} (95%) rename docs/source/en/model_doc/{reformer.mdx => reformer.md} (91%) rename docs/source/en/model_doc/{regnet.mdx => regnet.md} (80%) rename docs/source/en/model_doc/{rembert.mdx => rembert.md} (85%) rename docs/source/en/model_doc/{resnet.mdx => resnet.md} (89%) rename docs/source/en/model_doc/{retribert.mdx => retribert.md} (73%) rename docs/source/en/model_doc/{roberta-prelayernorm.mdx => roberta-prelayernorm.md} (85%) rename docs/source/en/model_doc/{roberta.mdx => roberta.md} (92%) rename docs/source/en/model_doc/{roc_bert.mdx => roc_bert.md} (83%) rename docs/source/en/model_doc/{roformer.mdx => roformer.md} (82%) create mode 100644 docs/source/en/model_doc/rwkv.md create mode 100644 docs/source/en/model_doc/sam.md create mode 100644 docs/source/en/model_doc/seamless_m4t.md create mode 100644 docs/source/en/model_doc/seamless_m4t_v2.md rename docs/source/en/model_doc/{segformer.mdx => segformer.md} (96%) rename docs/source/en/model_doc/{sew-d.mdx => sew-d.md} (87%) rename docs/source/en/model_doc/{sew.mdx => sew.md} (87%) create mode 100644 docs/source/en/model_doc/siglip.md rename docs/source/en/model_doc/{speech-encoder-decoder.mdx => speech-encoder-decoder.md} (94%) rename docs/source/en/model_doc/{speech_to_text.mdx => speech_to_text.md} (96%) rename docs/source/en/model_doc/{speech_to_text_2.mdx => speech_to_text_2.md} (94%) rename docs/source/en/model_doc/{speecht5.mdx => speecht5.md} (94%) rename docs/source/en/model_doc/{splinter.mdx => splinter.md} (93%) rename docs/source/en/model_doc/{squeezebert.mdx => squeezebert.md} (88%) create mode 100644 docs/source/en/model_doc/stablelm.md create mode 100644 docs/source/en/model_doc/swiftformer.md rename docs/source/en/model_doc/{swin.mdx => swin.md} (93%) rename docs/source/en/model_doc/{swin2sr.mdx => swin2sr.md} (95%) rename docs/source/en/model_doc/{swinv2.mdx => swinv2.md} (93%) rename docs/source/en/model_doc/{switch_transformers.mdx => switch_transformers.md} (79%) rename docs/source/en/model_doc/{t5.mdx => t5.md} (84%) rename docs/source/en/model_doc/{t5v1.1.mdx => t5v1.1.md} (90%) rename docs/source/en/model_doc/{table-transformer.mdx => table-transformer.md} (87%) rename docs/source/en/model_doc/{tapas.mdx => tapas.md} (98%) rename docs/source/en/model_doc/{tapex.mdx => tapex.md} (90%) rename docs/source/en/model_doc/{time_series_transformer.mdx => time_series_transformer.md} (86%) rename docs/source/en/model_doc/{timesformer.mdx => timesformer.md} (84%) rename docs/source/en/model_doc/{trajectory_transformer.mdx => trajectory_transformer.md} (84%) rename docs/source/en/model_doc/{transfo-xl.mdx => transfo-xl.md} (71%) rename docs/source/en/model_doc/{trocr.mdx => trocr.md} (71%) rename docs/source/en/model_doc/{tvlt.mdx => tvlt.md} (95%) create mode 100644 docs/source/en/model_doc/tvp.md rename docs/source/en/model_doc/{ul2.mdx => ul2.md} (86%) create mode 100644 docs/source/en/model_doc/umt5.md rename docs/source/en/model_doc/{unispeech-sat.mdx => unispeech-sat.md} (89%) rename docs/source/en/model_doc/{unispeech.mdx => unispeech.md} (90%) create mode 100644 docs/source/en/model_doc/univnet.md rename docs/source/en/model_doc/{upernet.mdx => upernet.md} (93%) rename docs/source/en/model_doc/{van.mdx => van.md} (83%) rename docs/source/en/model_doc/{videomae.mdx => videomae.md} (92%) rename docs/source/en/model_doc/{vilt.mdx => vilt.md} (93%) create mode 100644 docs/source/en/model_doc/vipllava.md rename docs/source/en/model_doc/{vision-encoder-decoder.mdx => vision-encoder-decoder.md} (94%) rename docs/source/en/model_doc/{vision-text-dual-encoder.mdx => vision-text-dual-encoder.md} (85%) rename docs/source/en/model_doc/{visual_bert.mdx => visual_bert.md} (95%) rename docs/source/en/model_doc/{vit.mdx => vit.md} (86%) rename docs/source/en/model_doc/{vit_hybrid.mdx => vit_hybrid.md} (93%) rename docs/source/en/model_doc/{vit_mae.mdx => vit_mae.md} (95%) rename docs/source/en/model_doc/{vit_msn.mdx => vit_msn.md} (94%) create mode 100644 docs/source/en/model_doc/vitdet.md create mode 100644 docs/source/en/model_doc/vitmatte.md create mode 100644 docs/source/en/model_doc/vits.md create mode 100644 docs/source/en/model_doc/vivit.md create mode 100644 docs/source/en/model_doc/wav2vec2-bert.md rename docs/source/en/model_doc/{wav2vec2-conformer.mdx => wav2vec2-conformer.md} (89%) rename docs/source/en/model_doc/{wav2vec2.mdx => wav2vec2.md} (92%) rename docs/source/en/model_doc/{wav2vec2_phoneme.mdx => wav2vec2_phoneme.md} (84%) rename docs/source/en/model_doc/{wavlm.mdx => wavlm.md} (87%) create mode 100644 docs/source/en/model_doc/whisper.md delete mode 100644 docs/source/en/model_doc/whisper.mdx rename docs/source/en/model_doc/{xclip.mdx => xclip.md} (96%) rename docs/source/en/model_doc/{xglm.mdx => xglm.md} (91%) rename docs/source/en/model_doc/{xlm-prophetnet.mdx => xlm-prophetnet.md} (86%) rename docs/source/en/model_doc/{xlm-roberta-xl.mdx => xlm-roberta-xl.md} (74%) rename docs/source/en/model_doc/{xlm-roberta.mdx => xlm-roberta.md} (91%) rename docs/source/en/model_doc/{xlm-v.mdx => xlm-v.md} (89%) rename docs/source/en/model_doc/{xlm.mdx => xlm.md} (89%) rename docs/source/en/model_doc/{xlnet.mdx => xlnet.md} (91%) rename docs/source/en/model_doc/{xls_r.mdx => xls_r.md} (89%) rename docs/source/en/model_doc/{xlsr_wav2vec2.mdx => xlsr_wav2vec2.md} (92%) rename docs/source/en/model_doc/{xmod.mdx => xmod.md} (88%) rename docs/source/en/model_doc/{yolos.mdx => yolos.md} (89%) rename docs/source/en/model_doc/{yoso.mdx => yoso.md} (88%) create mode 100644 docs/source/en/model_memory_anatomy.md rename docs/source/en/{model_sharing.mdx => model_sharing.md} (92%) rename docs/source/en/{model_summary.mdx => model_summary.md} (99%) rename docs/source/en/{multilingual.mdx => multilingual.md} (77%) rename docs/source/en/{pad_truncation.mdx => pad_truncation.md} (95%) create mode 100644 docs/source/en/peft.md rename docs/source/en/{perf_hardware.mdx => perf_hardware.md} (94%) create mode 100644 docs/source/en/perf_infer_cpu.md delete mode 100644 docs/source/en/perf_infer_cpu.mdx delete mode 100644 docs/source/en/perf_infer_gpu_many.mdx create mode 100644 docs/source/en/perf_infer_gpu_one.md delete mode 100644 docs/source/en/perf_infer_gpu_one.mdx create mode 100644 docs/source/en/perf_torch_compile.md create mode 100644 docs/source/en/perf_train_cpu.md delete mode 100644 docs/source/en/perf_train_cpu.mdx create mode 100644 docs/source/en/perf_train_cpu_many.md delete mode 100644 docs/source/en/perf_train_cpu_many.mdx create mode 100644 docs/source/en/perf_train_gpu_many.md delete mode 100644 docs/source/en/perf_train_gpu_many.mdx create mode 100644 docs/source/en/perf_train_gpu_one.md delete mode 100644 docs/source/en/perf_train_gpu_one.mdx create mode 100644 docs/source/en/perf_train_special.md delete mode 100644 docs/source/en/perf_train_special.mdx delete mode 100644 docs/source/en/perf_train_tpu.mdx rename docs/source/en/{perf_train_tpu_tf.mdx => perf_train_tpu_tf.md} (98%) create mode 100644 docs/source/en/performance.md delete mode 100644 docs/source/en/performance.mdx rename docs/source/en/{perplexity.mdx => perplexity.md} (93%) rename docs/source/en/{philosophy.mdx => philosophy.md} (95%) rename docs/source/en/{pipeline_tutorial.mdx => pipeline_tutorial.md} (66%) rename docs/source/en/{pipeline_webserver.mdx => pipeline_webserver.md} (92%) rename docs/source/en/{pr_checks.mdx => pr_checks.md} (57%) rename docs/source/en/{preprocessing.mdx => preprocessing.md} (90%) create mode 100644 docs/source/en/quantization.md rename docs/source/en/{quicktour.mdx => quicktour.md} (92%) rename docs/source/en/{run_scripts.mdx => run_scripts.md} (92%) rename docs/source/en/{sagemaker.mdx => sagemaker.md} (86%) create mode 100644 docs/source/en/serialization.md delete mode 100644 docs/source/en/serialization.mdx rename docs/source/en/{task_summary.mdx => task_summary.md} (90%) rename docs/source/en/tasks/{asr.mdx => asr.md} (97%) rename docs/source/en/tasks/{audio_classification.mdx => audio_classification.md} (97%) rename docs/source/en/tasks/{document_question_answering.mdx => document_question_answering.md} (98%) create mode 100644 docs/source/en/tasks/idefics.md rename docs/source/en/tasks/{image_captioning.mdx => image_captioning.md} (97%) rename docs/source/en/tasks/{image_classification.mdx => image_classification.md} (90%) create mode 100644 docs/source/en/tasks/image_to_image.md create mode 100644 docs/source/en/tasks/knowledge_distillation_for_image_classification.md rename docs/source/en/tasks/{language_modeling.mdx => language_modeling.md} (61%) create mode 100644 docs/source/en/tasks/mask_generation.md rename docs/source/en/tasks/{masked_language_modeling.mdx => masked_language_modeling.md} (63%) create mode 100644 docs/source/en/tasks/monocular_depth_estimation.md rename docs/source/en/tasks/{multiple_choice.mdx => multiple_choice.md} (90%) rename docs/source/en/tasks/{object_detection.mdx => object_detection.md} (92%) create mode 100644 docs/source/en/tasks/prompting.md rename docs/source/en/tasks/{question_answering.mdx => question_answering.md} (86%) rename docs/source/en/tasks/{semantic_segmentation.mdx => semantic_segmentation.md} (64%) rename docs/source/en/tasks/{sequence_classification.mdx => sequence_classification.md} (79%) rename docs/source/en/tasks/{summarization.mdx => summarization.md} (93%) create mode 100644 docs/source/en/tasks/text-to-speech.md rename docs/source/en/tasks/{token_classification.mdx => token_classification.md} (84%) rename docs/source/en/tasks/{translation.mdx => translation.md} (89%) rename docs/source/en/tasks/{video_classification.mdx => video_classification.md} (98%) create mode 100644 docs/source/en/tasks/visual_question_answering.md create mode 100644 docs/source/en/tasks/zero_shot_image_classification.md create mode 100644 docs/source/en/tasks/zero_shot_object_detection.md rename docs/source/en/{tasks_explained.mdx => tasks_explained.md} (99%) rename docs/source/en/{testing.mdx => testing.md} (91%) rename docs/source/en/{tf_xla.mdx => tf_xla.md} (92%) create mode 100644 docs/source/en/tflite.md rename docs/source/en/{tokenizer_summary.mdx => tokenizer_summary.md} (97%) rename docs/source/en/{torchscript.mdx => torchscript.md} (95%) create mode 100644 docs/source/en/trainer.md rename docs/source/en/{training.mdx => training.md} (94%) create mode 100644 docs/source/en/transformers_agents.md rename docs/source/en/{troubleshooting.mdx => troubleshooting.md} (84%) rename docs/source/es/{accelerate.mdx => accelerate.md} (96%) rename docs/source/es/{add_new_pipeline.mdx => add_new_pipeline.md} (98%) rename docs/source/es/{autoclass_tutorial.mdx => autoclass_tutorial.md} (89%) rename docs/source/es/{bertology.mdx => bertology.md} (86%) rename docs/source/es/{community.mdx => community.md} (95%) rename docs/source/es/{converting_tensorflow_models.mdx => converting_tensorflow_models.md} (90%) rename docs/source/es/{create_a_model.mdx => create_a_model.md} (94%) rename docs/source/es/{custom_models.mdx => custom_models.md} (98%) rename docs/source/es/{debugging.mdx => debugging.md} (98%) rename docs/source/es/{fast_tokenizers.mdx => fast_tokenizers.md} (94%) create mode 100644 docs/source/es/glossary.md rename docs/source/es/{index.mdx => index.md} (96%) rename docs/source/es/{installation.mdx => installation.md} (95%) rename docs/source/es/{model_sharing.mdx => model_sharing.md} (94%) rename docs/source/es/{multilingual.mdx => multilingual.md} (78%) create mode 100644 docs/source/es/pad_truncation.md create mode 100644 docs/source/es/performance.md create mode 100644 docs/source/es/perplexity.md rename docs/source/es/{philosophy.mdx => philosophy.md} (97%) rename docs/source/es/{pipeline_tutorial.mdx => pipeline_tutorial.md} (95%) rename docs/source/es/{pr_checks.mdx => pr_checks.md} (97%) rename docs/source/es/{preprocessing.mdx => preprocessing.md} (98%) rename docs/source/es/{quicktour.mdx => quicktour.md} (98%) rename docs/source/es/{run_scripts.mdx => run_scripts.md} (93%) rename docs/source/es/{sagemaker.mdx => sagemaker.md} (86%) rename docs/source/es/{serialization.mdx => serialization.md} (95%) create mode 100644 docs/source/es/task_summary.md rename docs/source/es/tasks/{asr.mdx => asr.md} (98%) rename docs/source/es/tasks/{image_classification.mdx => image_classification.md} (87%) rename docs/source/es/tasks/{language_modeling.mdx => language_modeling.md} (88%) rename docs/source/es/tasks/{multiple_choice.mdx => multiple_choice.md} (94%) rename docs/source/es/tasks/{question_answering.mdx => question_answering.md} (94%) rename docs/source/es/tasks/{summarization.mdx => summarization.md} (95%) rename docs/source/es/{training.mdx => training.md} (95%) create mode 100644 docs/source/fr/autoclass_tutorial.md create mode 100644 docs/source/fr/in_translation.md delete mode 100644 docs/source/fr/in_translation.mdx rename docs/source/fr/{index.mdx => index.md} (97%) create mode 100644 docs/source/fr/installation.md rename docs/source/fr/{quicktour.mdx => quicktour.md} (86%) create mode 100644 docs/source/hi/_toctree.yml create mode 100644 docs/source/hi/pipeline_tutorial.md rename docs/source/it/{accelerate.mdx => accelerate.md} (96%) rename docs/source/it/{add_new_model.mdx => add_new_model.md} (99%) rename docs/source/it/{add_new_pipeline.mdx => add_new_pipeline.md} (97%) rename docs/source/it/{autoclass_tutorial.mdx => autoclass_tutorial.md} (89%) create mode 100644 docs/source/it/big_models.md create mode 100644 docs/source/it/community.md rename docs/source/it/{converting_tensorflow_models.mdx => converting_tensorflow_models.md} (90%) rename docs/source/it/{create_a_model.mdx => create_a_model.md} (93%) rename docs/source/it/{custom_models.mdx => custom_models.md} (98%) rename docs/source/it/{debugging.mdx => debugging.md} (98%) rename docs/source/it/{index.mdx => index.md} (97%) rename docs/source/it/{installation.mdx => installation.md} (95%) create mode 100644 docs/source/it/migration.md rename docs/source/it/{model_sharing.mdx => model_sharing.md} (95%) rename docs/source/it/{multilingual.mdx => multilingual.md} (77%) rename docs/source/it/{perf_hardware.mdx => perf_hardware.md} (95%) create mode 100644 docs/source/it/perf_infer_cpu.md create mode 100644 docs/source/it/perf_infer_gpu_many.md create mode 100644 docs/source/it/perf_infer_gpu_one.md rename docs/source/{en/perf_infer_special.mdx => it/perf_infer_special.md} (56%) create mode 100644 docs/source/it/perf_train_cpu.md create mode 100644 docs/source/it/perf_train_cpu_many.md create mode 100644 docs/source/it/perf_train_special.md create mode 100644 docs/source/it/perf_train_tpu.md rename docs/source/it/{pipeline_tutorial.mdx => pipeline_tutorial.md} (95%) create mode 100644 docs/source/it/pr_checks.md rename docs/source/it/{preprocessing.mdx => preprocessing.md} (97%) rename docs/source/it/{quicktour.mdx => quicktour.md} (98%) rename docs/source/it/{run_scripts.mdx => run_scripts.md} (92%) rename docs/source/it/{serialization.mdx => serialization.md} (95%) rename docs/source/it/{training.mdx => training.md} (95%) create mode 100644 docs/source/ja/accelerate.md create mode 100644 docs/source/ja/add_new_model.md create mode 100644 docs/source/ja/add_tensorflow_model.md create mode 100644 docs/source/ja/attention.md create mode 100644 docs/source/ja/autoclass_tutorial.md create mode 100644 docs/source/ja/benchmarks.md create mode 100644 docs/source/ja/bertology.md create mode 100644 docs/source/ja/big_models.md create mode 100644 docs/source/ja/chat_templating.md create mode 100644 docs/source/ja/community.md create mode 100644 docs/source/ja/create_a_model.md create mode 100644 docs/source/ja/custom_models.md create mode 100644 docs/source/ja/custom_tools.md create mode 100644 docs/source/ja/fast_tokenizers.md create mode 100644 docs/source/ja/generation_strategies.md create mode 100644 docs/source/ja/glossary.md create mode 100644 docs/source/ja/hpo_train.md rename docs/source/ja/{index.mdx => index.md} (98%) rename docs/source/ja/{installation.mdx => installation.md} (96%) create mode 100644 docs/source/ja/internal/audio_utils.md create mode 100644 docs/source/ja/internal/file_utils.md create mode 100644 docs/source/ja/internal/generation_utils.md create mode 100644 docs/source/ja/internal/image_processing_utils.md create mode 100644 docs/source/ja/internal/modeling_utils.md create mode 100644 docs/source/ja/internal/pipelines_utils.md create mode 100644 docs/source/ja/internal/time_series_utils.md create mode 100644 docs/source/ja/internal/tokenization_utils.md create mode 100644 docs/source/ja/internal/trainer_utils.md create mode 100644 docs/source/ja/llm_tutorial.md create mode 100644 docs/source/ja/main_classes/agent.md create mode 100644 docs/source/ja/main_classes/callback.md create mode 100644 docs/source/ja/main_classes/configuration.md create mode 100644 docs/source/ja/main_classes/data_collator.md create mode 100644 docs/source/ja/main_classes/deepspeed.md create mode 100644 docs/source/ja/main_classes/feature_extractor.md create mode 100644 docs/source/ja/main_classes/image_processor.md create mode 100644 docs/source/ja/main_classes/keras_callbacks.md create mode 100644 docs/source/ja/main_classes/logging.md create mode 100644 docs/source/ja/main_classes/model.md create mode 100644 docs/source/ja/main_classes/onnx.md create mode 100644 docs/source/ja/main_classes/optimizer_schedules.md create mode 100644 docs/source/ja/main_classes/output.md create mode 100644 docs/source/ja/main_classes/pipelines.md create mode 100644 docs/source/ja/main_classes/processors.md create mode 100644 docs/source/ja/main_classes/quantization.md create mode 100644 docs/source/ja/main_classes/text_generation.md create mode 100644 docs/source/ja/main_classes/tokenizer.md create mode 100644 docs/source/ja/main_classes/trainer.md create mode 100644 docs/source/ja/model_doc/albert.md create mode 100644 docs/source/ja/model_doc/align.md create mode 100644 docs/source/ja/model_doc/altclip.md create mode 100644 docs/source/ja/model_doc/audio-spectrogram-transformer.md create mode 100644 docs/source/ja/model_doc/auto.md create mode 100644 docs/source/ja/model_doc/autoformer.md create mode 100644 docs/source/ja/model_doc/bark.md create mode 100644 docs/source/ja/model_doc/bart.md create mode 100644 docs/source/ja/model_doc/barthez.md create mode 100644 docs/source/ja/model_doc/bartpho.md create mode 100644 docs/source/ja/model_doc/beit.md create mode 100644 docs/source/ja/model_doc/bert-generation.md create mode 100644 docs/source/ja/model_doc/bert-japanese.md create mode 100644 docs/source/ja/model_doc/bert.md create mode 100644 docs/source/ja/model_doc/bertweet.md create mode 100644 docs/source/ja/model_doc/big_bird.md create mode 100644 docs/source/ja/model_doc/bigbird_pegasus.md create mode 100644 docs/source/ja/model_doc/biogpt.md create mode 100644 docs/source/ja/model_doc/bit.md create mode 100644 docs/source/ja/model_doc/blenderbot-small.md create mode 100644 docs/source/ja/model_doc/blenderbot.md create mode 100644 docs/source/ja/model_doc/blip-2.md create mode 100644 docs/source/ja/model_doc/blip.md create mode 100644 docs/source/ja/model_doc/bloom.md create mode 100644 docs/source/ja/model_doc/bort.md create mode 100644 docs/source/ja/model_doc/bridgetower.md create mode 100644 docs/source/ja/model_doc/bros.md create mode 100644 docs/source/ja/model_doc/byt5.md create mode 100644 docs/source/ja/model_doc/camembert.md create mode 100644 docs/source/ja/model_doc/canine.md create mode 100644 docs/source/ja/model_doc/chinese_clip.md create mode 100644 docs/source/ja/model_doc/clap.md create mode 100644 docs/source/ja/model_doc/clip.md create mode 100644 docs/source/ja/model_doc/clipseg.md create mode 100644 docs/source/ja/model_doc/clvp.md create mode 100644 docs/source/ja/model_doc/code_llama.md create mode 100644 docs/source/ja/model_doc/codegen.md create mode 100644 docs/source/ja/model_doc/conditional_detr.md create mode 100644 docs/source/ja/model_doc/convbert.md create mode 100644 docs/source/ja/model_doc/convnext.md create mode 100644 docs/source/ja/model_doc/convnextv2.md create mode 100644 docs/source/ja/model_doc/cpm.md create mode 100644 docs/source/ja/model_doc/cpmant.md create mode 100644 docs/source/ja/model_doc/ctrl.md create mode 100644 docs/source/ja/model_doc/cvt.md create mode 100644 docs/source/ja/model_doc/data2vec.md create mode 100644 docs/source/ja/model_doc/deberta-v2.md create mode 100644 docs/source/ja/model_doc/deberta.md create mode 100644 docs/source/ja/model_doc/decision_transformer.md create mode 100644 docs/source/ja/model_doc/deformable_detr.md create mode 100644 docs/source/ja/model_doc/deit.md create mode 100644 docs/source/ja/model_doc/deplot.md create mode 100644 docs/source/ja/model_doc/deta.md create mode 100644 docs/source/ja/model_doc/detr.md create mode 100644 docs/source/ja/model_doc/dialogpt.md create mode 100644 docs/source/ja/model_doc/dinat.md create mode 100644 docs/source/ja/model_memory_anatomy.md create mode 100644 docs/source/ja/model_sharing.md create mode 100644 docs/source/ja/model_summary.md rename docs/source/ja/{multilingual.mdx => multilingual.md} (77%) create mode 100644 docs/source/ja/pad_truncation.md create mode 100644 docs/source/ja/peft.md create mode 100644 docs/source/ja/perf_hardware.md create mode 100644 docs/source/ja/perf_infer_cpu.md create mode 100644 docs/source/ja/perf_infer_gpu_many.md create mode 100644 docs/source/ja/perf_infer_gpu_one.md create mode 100644 docs/source/ja/perf_infer_special.md create mode 100644 docs/source/ja/perf_torch_compile.md create mode 100644 docs/source/ja/perf_train_cpu.md create mode 100644 docs/source/ja/perf_train_cpu_many.md create mode 100644 docs/source/ja/perf_train_gpu_many.md create mode 100644 docs/source/ja/perf_train_gpu_one.md create mode 100644 docs/source/ja/perf_train_special.md create mode 100644 docs/source/ja/perf_train_tpu.md create mode 100644 docs/source/ja/perf_train_tpu_tf.md create mode 100644 docs/source/ja/performance.md create mode 100644 docs/source/ja/perplexity.md create mode 100644 docs/source/ja/philosophy.md create mode 100644 docs/source/ja/pipeline_tutorial.md create mode 100644 docs/source/ja/pipeline_webserver.md create mode 100644 docs/source/ja/pr_checks.md create mode 100644 docs/source/ja/preprocessing.md create mode 100644 docs/source/ja/quicktour.md create mode 100644 docs/source/ja/run_scripts.md create mode 100644 docs/source/ja/serialization.md create mode 100644 docs/source/ja/task_summary.md create mode 100644 docs/source/ja/tasks/asr.md create mode 100644 docs/source/ja/tasks/audio_classification.md create mode 100644 docs/source/ja/tasks/document_question_answering.md create mode 100644 docs/source/ja/tasks/idefics.md create mode 100644 docs/source/ja/tasks/image_captioning.md create mode 100644 docs/source/ja/tasks/image_classification.md create mode 100644 docs/source/ja/tasks/image_to_image.md create mode 100644 docs/source/ja/tasks/knowledge_distillation_for_image_classification.md create mode 100644 docs/source/ja/tasks/language_modeling.md create mode 100644 docs/source/ja/tasks/masked_language_modeling.md create mode 100644 docs/source/ja/tasks/monocular_depth_estimation.md create mode 100644 docs/source/ja/tasks/multiple_choice.md create mode 100644 docs/source/ja/tasks/object_detection.md create mode 100644 docs/source/ja/tasks/prompting.md create mode 100644 docs/source/ja/tasks/question_answering.md create mode 100644 docs/source/ja/tasks/semantic_segmentation.md create mode 100644 docs/source/ja/tasks/sequence_classification.md create mode 100644 docs/source/ja/tasks/summarization.md create mode 100644 docs/source/ja/tasks/text-to-speech.md create mode 100644 docs/source/ja/tasks/token_classification.md create mode 100644 docs/source/ja/tasks/translation.md create mode 100644 docs/source/ja/tasks/video_classification.md create mode 100644 docs/source/ja/tasks/visual_question_answering.md create mode 100644 docs/source/ja/tasks/zero_shot_image_classification.md create mode 100644 docs/source/ja/tasks/zero_shot_object_detection.md create mode 100644 docs/source/ja/tasks_explained.md create mode 100644 docs/source/ja/testing.md create mode 100644 docs/source/ja/tf_xla.md create mode 100644 docs/source/ja/tflite.md create mode 100644 docs/source/ja/tokenizer_summary.md create mode 100644 docs/source/ja/torchscript.md create mode 100644 docs/source/ja/training.md create mode 100644 docs/source/ja/transformers_agents.md create mode 100644 docs/source/ja/troubleshooting.md create mode 100644 docs/source/ko/accelerate.md create mode 100644 docs/source/ko/add_new_model.md create mode 100644 docs/source/ko/add_new_pipeline.md create mode 100644 docs/source/ko/add_tensorflow_model.md create mode 100644 docs/source/ko/attention.md create mode 100644 docs/source/ko/autoclass_tutorial.md create mode 100644 docs/source/ko/bertology.md create mode 100644 docs/source/ko/big_models.md create mode 100644 docs/source/ko/community.md create mode 100644 docs/source/ko/contributing.md create mode 100644 docs/source/ko/create_a_model.md create mode 100644 docs/source/ko/custom_models.md create mode 100644 docs/source/ko/custom_tools.md create mode 100644 docs/source/ko/debugging.md create mode 100644 docs/source/ko/fast_tokenizers.md create mode 100644 docs/source/ko/hpo_train.md create mode 100644 docs/source/ko/in_translation.md delete mode 100644 docs/source/ko/in_translation.mdx rename docs/source/ko/{index.mdx => index.md} (98%) rename docs/source/ko/{installation.mdx => installation.md} (95%) create mode 100644 docs/source/ko/llm_tutorial.md create mode 100644 docs/source/ko/model_doc/llama.md create mode 100644 docs/source/ko/model_doc/llama2.md create mode 100644 docs/source/ko/model_doc/whisper.md create mode 100644 docs/source/ko/model_memory_anatomy.md create mode 100644 docs/source/ko/model_sharing.md create mode 100644 docs/source/ko/model_summary.md create mode 100644 docs/source/ko/multilingual.md create mode 100644 docs/source/ko/pad_truncation.md create mode 100644 docs/source/ko/peft.md create mode 100644 docs/source/ko/perf_hardware.md create mode 100644 docs/source/ko/perf_infer_cpu.md create mode 100644 docs/source/ko/perf_infer_gpu_many.md create mode 100644 docs/source/ko/perf_infer_gpu_one.md create mode 100644 docs/source/ko/perf_train_cpu.md create mode 100644 docs/source/ko/perf_train_cpu_many.md create mode 100644 docs/source/ko/perf_train_gpu_many.md create mode 100644 docs/source/ko/perf_train_tpu_tf.md create mode 100644 docs/source/ko/performance.md create mode 100644 docs/source/ko/perplexity.md create mode 100644 docs/source/ko/philosophy.md create mode 100644 docs/source/ko/pipeline_tutorial.md create mode 100644 docs/source/ko/pipeline_webserver.md create mode 100644 docs/source/ko/pr_checks.md create mode 100644 docs/source/ko/preprocessing.md create mode 100644 docs/source/ko/quicktour.md delete mode 100644 docs/source/ko/quicktour.mdx create mode 100644 docs/source/ko/run_scripts.md create mode 100644 docs/source/ko/sagemaker.md create mode 100644 docs/source/ko/serialization.md create mode 100644 docs/source/ko/task_summary.md create mode 100644 docs/source/ko/tasks/asr.md create mode 100644 docs/source/ko/tasks/audio_classification.md create mode 100644 docs/source/ko/tasks/document_question_answering.md create mode 100644 docs/source/ko/tasks/image_captioning.md create mode 100644 docs/source/ko/tasks/image_classification.md create mode 100644 docs/source/ko/tasks/language_modeling.md create mode 100644 docs/source/ko/tasks/masked_language_modeling.md create mode 100644 docs/source/ko/tasks/monocular_depth_estimation.md create mode 100644 docs/source/ko/tasks/multiple_choice.md create mode 100644 docs/source/ko/tasks/object_detection.md create mode 100644 docs/source/ko/tasks/question_answering.md create mode 100644 docs/source/ko/tasks/semantic_segmentation.md create mode 100644 docs/source/ko/tasks/sequence_classification.md create mode 100644 docs/source/ko/tasks/summarization.md create mode 100644 docs/source/ko/tasks/token_classification.md create mode 100644 docs/source/ko/tasks/translation.md create mode 100644 docs/source/ko/tasks/video_classification.md create mode 100644 docs/source/ko/tasks/visual_question_answering.md create mode 100644 docs/source/ko/tasks/zero_shot_image_classification.md create mode 100644 docs/source/ko/tasks/zero_shot_object_detection.md create mode 100644 docs/source/ko/tasks_explained.md create mode 100644 docs/source/ko/testing.md create mode 100644 docs/source/ko/tf_xla.md create mode 100644 docs/source/ko/tflite.md create mode 100644 docs/source/ko/tokenizer_summary.md create mode 100644 docs/source/ko/torchscript.md create mode 100644 docs/source/ko/training.md create mode 100644 docs/source/ko/transformers_agents.md create mode 100644 docs/source/ko/troubleshooting.md create mode 100644 docs/source/ms/_toctree.yml rename docs/source/{en/index.mdx => ms/index.md} (84%) rename docs/source/pt/{accelerate.mdx => accelerate.md} (96%) rename docs/source/pt/{converting_tensorflow_models.mdx => converting_tensorflow_models.md} (90%) rename docs/source/pt/{create_a_model.mdx => create_a_model.md} (93%) rename docs/source/pt/{custom_models.mdx => custom_models.md} (98%) rename docs/source/pt/{fast_tokenizers.mdx => fast_tokenizers.md} (94%) rename docs/source/pt/{index.mdx => index.md} (97%) rename docs/source/pt/{installation.mdx => installation.md} (96%) rename docs/source/pt/{multilingual.mdx => multilingual.md} (80%) rename docs/source/pt/{pipeline_tutorial.mdx => pipeline_tutorial.md} (93%) rename docs/source/pt/{quicktour.mdx => quicktour.md} (93%) rename docs/source/pt/{run_scripts.mdx => run_scripts.md} (93%) rename docs/source/pt/{serialization.mdx => serialization.md} (94%) rename docs/source/pt/tasks/{sequence_classification.mdx => sequence_classification.md} (88%) rename docs/source/pt/tasks/{token_classification.mdx => token_classification.md} (90%) rename docs/source/pt/{training.mdx => training.md} (95%) create mode 100644 docs/source/te/_toctree.yml create mode 100644 docs/source/te/index.md create mode 100644 docs/source/te/quicktour.md create mode 100644 docs/source/tr/_toctree.yml create mode 100644 docs/source/tr/index.md create mode 100644 docs/source/zh/accelerate.md create mode 100644 docs/source/zh/autoclass_tutorial.md create mode 100644 docs/source/zh/big_models.md create mode 100644 docs/source/zh/contributing.md create mode 100644 docs/source/zh/create_a_model.md create mode 100644 docs/source/zh/custom_models.md create mode 100644 docs/source/zh/debugging.md create mode 100644 docs/source/zh/fast_tokenizers.md create mode 100644 docs/source/zh/hpo_train.md rename docs/source/zh/{index.mdx => index.md} (95%) create mode 100644 docs/source/zh/installation.md create mode 100644 docs/source/zh/internal/audio_utils.md create mode 100644 docs/source/zh/internal/file_utils.md create mode 100644 docs/source/zh/internal/generation_utils.md create mode 100644 docs/source/zh/internal/image_processing_utils.md create mode 100644 docs/source/zh/internal/modeling_utils.md create mode 100644 docs/source/zh/internal/pipelines_utils.md create mode 100644 docs/source/zh/internal/time_series_utils.md create mode 100644 docs/source/zh/internal/tokenization_utils.md create mode 100644 docs/source/zh/internal/trainer_utils.md create mode 100644 docs/source/zh/llm_tutorial.md create mode 100644 docs/source/zh/main_classes/agent.md create mode 100644 docs/source/zh/main_classes/callback.md create mode 100644 docs/source/zh/main_classes/configuration.md create mode 100644 docs/source/zh/main_classes/data_collator.md create mode 100644 docs/source/zh/main_classes/deepspeed.md create mode 100644 docs/source/zh/main_classes/feature_extractor.md create mode 100644 docs/source/zh/main_classes/image_processor.md create mode 100644 docs/source/zh/main_classes/keras_callbacks.md create mode 100644 docs/source/zh/main_classes/logging.md create mode 100644 docs/source/zh/main_classes/model.md create mode 100644 docs/source/zh/main_classes/onnx.md create mode 100644 docs/source/zh/main_classes/optimizer_schedules.md create mode 100644 docs/source/zh/main_classes/output.md create mode 100644 docs/source/zh/main_classes/pipelines.md create mode 100644 docs/source/zh/main_classes/processors.md create mode 100644 docs/source/zh/main_classes/quantization.md create mode 100644 docs/source/zh/main_classes/text_generation.md create mode 100644 docs/source/zh/main_classes/tokenizer.md create mode 100644 docs/source/zh/main_classes/trainer.md create mode 100644 docs/source/zh/model_sharing.md create mode 100644 docs/source/zh/multilingual.md create mode 100644 docs/source/zh/peft.md create mode 100644 docs/source/zh/perf_hardware.md create mode 100644 docs/source/zh/perf_torch_compile.md create mode 100644 docs/source/zh/performance.md create mode 100644 docs/source/zh/pipeline_tutorial.md create mode 100644 docs/source/zh/preprocessing.md create mode 100644 docs/source/zh/quicktour.md delete mode 100644 docs/source/zh/quicktour.mdx create mode 100644 docs/source/zh/run_scripts.md create mode 100644 docs/source/zh/serialization.md create mode 100644 docs/source/zh/task_summary.md create mode 100644 docs/source/zh/tf_xla.md create mode 100644 docs/source/zh/tflite.md create mode 100644 docs/source/zh/tokenizer_summary.md create mode 100644 docs/source/zh/training.md create mode 100644 docs/source/zh/transformers_agents.md create mode 100644 examples/flax/speech-recognition/README.md create mode 100644 examples/flax/speech-recognition/requirements.txt create mode 100644 examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py rename examples/{pytorch => legacy}/benchmarking/README.md (59%) rename examples/{pytorch => legacy}/benchmarking/plot_csv_file.py (96%) rename examples/{pytorch => legacy}/benchmarking/requirements.txt (100%) rename examples/{pytorch => legacy}/benchmarking/run_benchmark.py (100%) mode change 100755 => 100644 delete mode 120000 examples/legacy/seq2seq/test_data/test_data delete mode 100755 examples/legacy/text-classification/run_tf_text_classification.py delete mode 100755 examples/legacy/token-classification/run_tf_ner.py mode change 100644 => 100755 examples/pytorch/image-classification/run_image_classification.py create mode 100644 examples/pytorch/image-pretraining/run_mim_no_trainer.py rename examples/pytorch/{test_xla_examples.py => old_test_xla_examples.py} (97%) create mode 100755 examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py create mode 100755 examples/pytorch/text-classification/run_classification.py create mode 100644 examples/run_on_remote.py create mode 100644 examples/tensorflow/contrastive-image-text/README.md create mode 100644 examples/tensorflow/contrastive-image-text/requirements.txt create mode 100644 examples/tensorflow/contrastive-image-text/run_clip.py create mode 100644 examples/tensorflow/language-modeling-tpu/README.md create mode 100644 examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py create mode 100644 examples/tensorflow/language-modeling-tpu/requirements.txt create mode 100644 examples/tensorflow/language-modeling-tpu/run_mlm.py create mode 100644 examples/tensorflow/language-modeling-tpu/train_unigram.py delete mode 100644 setup.cfg create mode 100644 src/transformers/cache_utils.py delete mode 100644 src/transformers/data/test_generation_utils.py create mode 100644 src/transformers/generation/candidate_generator.py create mode 100644 src/transformers/generation/streamers.py create mode 100644 src/transformers/hyperparameter_search.py create mode 100644 src/transformers/integrations/__init__.py create mode 100644 src/transformers/integrations/aqlm.py create mode 100644 src/transformers/integrations/awq.py create mode 100644 src/transformers/integrations/bitsandbytes.py create mode 100644 src/transformers/integrations/deepspeed.py rename src/transformers/{integrations.py => integrations/integration_utils.py} (72%) create mode 100644 src/transformers/integrations/peft.py create mode 100644 src/transformers/integrations/tpu.py rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/cpu/ms_deform_attn_cpu.cpp (100%) rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/cpu/ms_deform_attn_cpu.h (100%) create mode 100644 src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cu create mode 100644 src/transformers/kernels/deformable_detr/cuda/ms_deform_attn_cuda.cuh rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/cuda/ms_deform_attn_cuda.h (100%) rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/cuda/ms_deform_im2col_cuda.cuh (100%) rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/ms_deform_attn.h (100%) rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deformable_detr}/vision.cpp (100%) create mode 100644 src/transformers/kernels/deta/cpu/ms_deform_attn_cpu.cpp create mode 100644 src/transformers/kernels/deta/cpu/ms_deform_attn_cpu.h rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deta}/cuda/ms_deform_attn_cuda.cu (100%) rename src/transformers/{models/deformable_detr/custom_kernel => kernels/deta}/cuda/ms_deform_attn_cuda.cuh (100%) create mode 100644 src/transformers/kernels/deta/cuda/ms_deform_attn_cuda.h create mode 100644 src/transformers/kernels/deta/cuda/ms_deform_im2col_cuda.cuh create mode 100644 src/transformers/kernels/deta/ms_deform_attn.h create mode 100644 src/transformers/kernels/deta/vision.cpp create mode 100644 src/transformers/kernels/mra/cuda_kernel.cu create mode 100644 src/transformers/kernels/mra/cuda_kernel.h create mode 100644 src/transformers/kernels/mra/cuda_launch.cu create mode 100644 src/transformers/kernels/mra/cuda_launch.h create mode 100644 src/transformers/kernels/mra/torch_extension.cpp create mode 100644 src/transformers/kernels/rwkv/wkv_cuda.cu create mode 100644 src/transformers/kernels/rwkv/wkv_cuda_bf16.cu create mode 100644 src/transformers/kernels/rwkv/wkv_op.cpp rename src/transformers/{models => kernels}/yoso/common.h (100%) rename src/transformers/{models => kernels}/yoso/common_cuda.h (100%) rename src/transformers/{models => kernels}/yoso/common_cuda_device.h (100%) rename src/transformers/{models => kernels}/yoso/fast_lsh_cumulation.cu (100%) rename src/transformers/{models => kernels}/yoso/fast_lsh_cumulation.h (100%) rename src/transformers/{models => kernels}/yoso/fast_lsh_cumulation_cuda.cu (100%) rename src/transformers/{models => kernels}/yoso/fast_lsh_cumulation_cuda.h (100%) rename src/transformers/{models => kernels}/yoso/fast_lsh_cumulation_torch.cpp (100%) create mode 100755 src/transformers/modeling_attn_mask_utils.py create mode 100644 src/transformers/models/align/__init__.py create mode 100644 src/transformers/models/align/configuration_align.py create mode 100644 src/transformers/models/align/convert_align_tf_to_hf.py create mode 100644 src/transformers/models/align/modeling_align.py create mode 100644 src/transformers/models/align/processing_align.py create mode 100644 src/transformers/models/autoformer/__init__.py create mode 100644 src/transformers/models/autoformer/configuration_autoformer.py create mode 100644 src/transformers/models/autoformer/modeling_autoformer.py create mode 100644 src/transformers/models/bark/__init__.py create mode 100644 src/transformers/models/bark/configuration_bark.py create mode 100644 src/transformers/models/bark/convert_suno_to_hf.py create mode 100644 src/transformers/models/bark/generation_configuration_bark.py create mode 100644 src/transformers/models/bark/modeling_bark.py create mode 100644 src/transformers/models/bark/processing_bark.py create mode 100644 src/transformers/models/blip/modeling_tf_blip.py create mode 100644 src/transformers/models/blip/modeling_tf_blip_text.py create mode 100644 src/transformers/models/bloom/modeling_flax_bloom.py create mode 100644 src/transformers/models/bros/__init__.py create mode 100644 src/transformers/models/bros/configuration_bros.py create mode 100644 src/transformers/models/bros/convert_bros_to_pytorch.py create mode 100755 src/transformers/models/bros/modeling_bros.py create mode 100644 src/transformers/models/bros/processing_bros.py create mode 100644 src/transformers/models/clvp/__init__.py create mode 100644 src/transformers/models/clvp/configuration_clvp.py create mode 100644 src/transformers/models/clvp/convert_clvp_to_hf.py create mode 100644 src/transformers/models/clvp/feature_extraction_clvp.py create mode 100644 src/transformers/models/clvp/modeling_clvp.py create mode 100644 src/transformers/models/clvp/number_normalizer.py create mode 100644 src/transformers/models/clvp/processing_clvp.py create mode 100644 src/transformers/models/clvp/tokenization_clvp.py create mode 100644 src/transformers/models/code_llama/__init__.py create mode 100644 src/transformers/models/code_llama/tokenization_code_llama.py create mode 100644 src/transformers/models/code_llama/tokenization_code_llama_fast.py create mode 100644 src/transformers/models/convnextv2/__init__.py create mode 100644 src/transformers/models/convnextv2/configuration_convnextv2.py create mode 100644 src/transformers/models/convnextv2/convert_convnextv2_to_pytorch.py create mode 100644 src/transformers/models/convnextv2/modeling_convnextv2.py create mode 100644 src/transformers/models/convnextv2/modeling_tf_convnextv2.py create mode 100644 src/transformers/models/cpmant/__init__.py create mode 100644 src/transformers/models/cpmant/configuration_cpmant.py create mode 100755 src/transformers/models/cpmant/modeling_cpmant.py create mode 100644 src/transformers/models/cpmant/tokenization_cpmant.py rename src/transformers/models/{bort => deprecated}/__init__.py (100%) rename {tests/mixed_int8 => src/transformers/models/deprecated/bort}/__init__.py (100%) rename src/transformers/models/{ => deprecated}/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py (99%) rename src/transformers/models/{ => deprecated}/mctct/__init__.py (75%) rename src/transformers/models/{ => deprecated}/mctct/configuration_mctct.py (94%) rename src/transformers/models/{ => deprecated}/mctct/feature_extraction_mctct.py (71%) rename src/transformers/models/{ => deprecated}/mctct/modeling_mctct.py (94%) rename src/transformers/models/{ => deprecated}/mctct/processing_mctct.py (99%) rename src/transformers/models/{ => deprecated}/mmbt/__init__.py (94%) rename src/transformers/models/{ => deprecated}/mmbt/configuration_mmbt.py (98%) rename src/transformers/models/{ => deprecated}/mmbt/modeling_mmbt.py (97%) create mode 100644 src/transformers/models/deprecated/open_llama/__init__.py create mode 100644 src/transformers/models/deprecated/open_llama/configuration_open_llama.py create mode 100644 src/transformers/models/deprecated/open_llama/modeling_open_llama.py rename src/transformers/models/{ => deprecated}/retribert/__init__.py (95%) rename src/transformers/models/{ => deprecated}/retribert/configuration_retribert.py (98%) rename src/transformers/models/{ => deprecated}/retribert/modeling_retribert.py (98%) rename src/transformers/models/{ => deprecated}/retribert/tokenization_retribert.py (94%) rename src/transformers/models/{ => deprecated}/retribert/tokenization_retribert_fast.py (98%) rename src/transformers/models/{ => deprecated}/tapex/__init__.py (95%) rename src/transformers/models/{ => deprecated}/tapex/tokenization_tapex.py (99%) rename src/transformers/models/{ => deprecated}/trajectory_transformer/__init__.py (95%) rename src/transformers/models/{ => deprecated}/trajectory_transformer/configuration_trajectory_transformer.py (98%) rename src/transformers/models/{ => deprecated}/trajectory_transformer/convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.py (100%) rename src/transformers/models/{ => deprecated}/trajectory_transformer/modeling_trajectory_transformer.py (96%) rename src/transformers/models/{ => deprecated}/transfo_xl/__init__.py (96%) rename src/transformers/models/{ => deprecated}/transfo_xl/configuration_transfo_xl.py (94%) rename src/transformers/models/{ => deprecated}/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py (95%) mode change 100755 => 100644 rename src/transformers/models/{ => deprecated}/transfo_xl/modeling_tf_transfo_xl.py (86%) rename src/transformers/models/{ => deprecated}/transfo_xl/modeling_tf_transfo_xl_utilities.py (98%) rename src/transformers/models/{ => deprecated}/transfo_xl/modeling_transfo_xl.py (98%) rename src/transformers/models/{ => deprecated}/transfo_xl/modeling_transfo_xl_utilities.py (97%) rename src/transformers/models/{ => deprecated}/transfo_xl/tokenization_transfo_xl.py (91%) rename src/transformers/models/{ => deprecated}/van/__init__.py (93%) rename src/transformers/models/{ => deprecated}/van/configuration_van.py (96%) rename src/transformers/models/{ => deprecated}/van/convert_van_to_pytorch.py (95%) rename src/transformers/models/{ => deprecated}/van/modeling_van.py (97%) create mode 100644 src/transformers/models/depth_anything/__init__.py create mode 100644 src/transformers/models/depth_anything/configuration_depth_anything.py create mode 100644 src/transformers/models/depth_anything/convert_depth_anything_to_hf.py create mode 100644 src/transformers/models/depth_anything/modeling_depth_anything.py create mode 100644 src/transformers/models/dinov2/__init__.py create mode 100644 src/transformers/models/dinov2/configuration_dinov2.py create mode 100644 src/transformers/models/dinov2/convert_dinov2_to_hf.py create mode 100644 src/transformers/models/dinov2/modeling_dinov2.py create mode 100644 src/transformers/models/dpt/convert_dinov2_depth_to_hf.py create mode 100644 src/transformers/models/dpt/convert_dpt_beit_to_hf.py create mode 100644 src/transformers/models/dpt/convert_dpt_swinv2_to_hf.py create mode 100644 src/transformers/models/efficientformer/modeling_tf_efficientformer.py create mode 100644 src/transformers/models/efficientnet/__init__.py create mode 100644 src/transformers/models/efficientnet/configuration_efficientnet.py create mode 100644 src/transformers/models/efficientnet/convert_efficientnet_to_pytorch.py create mode 100644 src/transformers/models/efficientnet/image_processing_efficientnet.py create mode 100644 src/transformers/models/efficientnet/modeling_efficientnet.py create mode 100644 src/transformers/models/encodec/__init__.py create mode 100644 src/transformers/models/encodec/configuration_encodec.py create mode 100644 src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/encodec/feature_extraction_encodec.py create mode 100644 src/transformers/models/encodec/modeling_encodec.py create mode 100644 src/transformers/models/falcon/__init__.py create mode 100644 src/transformers/models/falcon/configuration_falcon.py create mode 100644 src/transformers/models/falcon/convert_custom_code_checkpoint.py create mode 100644 src/transformers/models/falcon/modeling_falcon.py create mode 100644 src/transformers/models/fastspeech2_conformer/__init__.py create mode 100644 src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py create mode 100644 src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/fastspeech2_conformer/convert_hifigan.py create mode 100644 src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py create mode 100644 src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py create mode 100644 src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py create mode 100644 src/transformers/models/focalnet/__init__.py create mode 100644 src/transformers/models/focalnet/configuration_focalnet.py create mode 100644 src/transformers/models/focalnet/convert_focalnet_to_hf_format.py create mode 100644 src/transformers/models/focalnet/modeling_focalnet.py create mode 100644 src/transformers/models/fuyu/__init__.py create mode 100644 src/transformers/models/fuyu/configuration_fuyu.py create mode 100644 src/transformers/models/fuyu/convert_fuyu_model_weights_to_hf.py create mode 100644 src/transformers/models/fuyu/image_processing_fuyu.py create mode 100644 src/transformers/models/fuyu/modeling_fuyu.py create mode 100644 src/transformers/models/fuyu/processing_fuyu.py create mode 100644 src/transformers/models/gpt_bigcode/__init__.py create mode 100644 src/transformers/models/gpt_bigcode/configuration_gpt_bigcode.py create mode 100644 src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py mode change 100755 => 100644 src/transformers/models/gptj/modeling_gptj.py create mode 100644 src/transformers/models/gptsan_japanese/__init__.py create mode 100644 src/transformers/models/gptsan_japanese/configuration_gptsan_japanese.py create mode 100644 src/transformers/models/gptsan_japanese/convert_gptsan_tf_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/gptsan_japanese/modeling_gptsan_japanese.py create mode 100644 src/transformers/models/gptsan_japanese/tokenization_gptsan_japanese.py create mode 100644 src/transformers/models/idefics/__init__.py create mode 100644 src/transformers/models/idefics/configuration_idefics.py create mode 100644 src/transformers/models/idefics/image_processing_idefics.py create mode 100644 src/transformers/models/idefics/modeling_idefics.py create mode 100644 src/transformers/models/idefics/perceiver.py create mode 100644 src/transformers/models/idefics/processing_idefics.py create mode 100644 src/transformers/models/idefics/vision.py create mode 100644 src/transformers/models/informer/__init__.py create mode 100644 src/transformers/models/informer/configuration_informer.py create mode 100644 src/transformers/models/informer/modeling_informer.py create mode 100644 src/transformers/models/instructblip/__init__.py create mode 100644 src/transformers/models/instructblip/configuration_instructblip.py create mode 100644 src/transformers/models/instructblip/convert_instructblip_original_to_pytorch.py create mode 100644 src/transformers/models/instructblip/modeling_instructblip.py create mode 100644 src/transformers/models/instructblip/processing_instructblip.py create mode 100644 src/transformers/models/kosmos2/__init__.py create mode 100644 src/transformers/models/kosmos2/configuration_kosmos2.py create mode 100644 src/transformers/models/kosmos2/convert_kosmos2_original_pytorch_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/kosmos2/modeling_kosmos2.py create mode 100644 src/transformers/models/kosmos2/processing_kosmos2.py create mode 100644 src/transformers/models/llama/__init__.py create mode 100644 src/transformers/models/llama/configuration_llama.py create mode 100644 src/transformers/models/llama/convert_llama_weights_to_hf.py create mode 100644 src/transformers/models/llama/modeling_flax_llama.py create mode 100644 src/transformers/models/llama/modeling_llama.py create mode 100644 src/transformers/models/llama/tokenization_llama.py create mode 100644 src/transformers/models/llama/tokenization_llama_fast.py create mode 100644 src/transformers/models/llava/__init__.py create mode 100644 src/transformers/models/llava/configuration_llava.py create mode 100644 src/transformers/models/llava/convert_llava_weights_to_hf.py create mode 100644 src/transformers/models/llava/modeling_llava.py create mode 100644 src/transformers/models/llava/processing_llava.py create mode 100644 src/transformers/models/mega/__init__.py create mode 100644 src/transformers/models/mega/configuration_mega.py create mode 100644 src/transformers/models/mega/convert_mega_original_pytorch_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/mega/modeling_mega.py create mode 100644 src/transformers/models/mgp_str/__init__.py create mode 100644 src/transformers/models/mgp_str/configuration_mgp_str.py create mode 100644 src/transformers/models/mgp_str/modeling_mgp_str.py create mode 100644 src/transformers/models/mgp_str/processing_mgp_str.py create mode 100644 src/transformers/models/mgp_str/tokenization_mgp_str.py create mode 100644 src/transformers/models/mistral/__init__.py create mode 100644 src/transformers/models/mistral/configuration_mistral.py create mode 100644 src/transformers/models/mistral/convert_mistral_weights_to_hf.py create mode 100644 src/transformers/models/mistral/modeling_flax_mistral.py create mode 100644 src/transformers/models/mistral/modeling_mistral.py create mode 100644 src/transformers/models/mixtral/__init__.py create mode 100644 src/transformers/models/mixtral/configuration_mixtral.py create mode 100644 src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py create mode 100644 src/transformers/models/mixtral/modeling_mixtral.py create mode 100644 src/transformers/models/mobilevitv2/__init__.py create mode 100644 src/transformers/models/mobilevitv2/configuration_mobilevitv2.py create mode 100644 src/transformers/models/mobilevitv2/convert_mlcvnets_to_pytorch.py create mode 100644 src/transformers/models/mobilevitv2/modeling_mobilevitv2.py create mode 100644 src/transformers/models/mpt/__init__.py create mode 100644 src/transformers/models/mpt/configuration_mpt.py create mode 100644 src/transformers/models/mpt/modeling_mpt.py create mode 100644 src/transformers/models/mra/__init__.py create mode 100644 src/transformers/models/mra/configuration_mra.py create mode 100644 src/transformers/models/mra/convert_mra_pytorch_to_pytorch.py create mode 100644 src/transformers/models/mra/modeling_mra.py create mode 100644 src/transformers/models/musicgen/__init__.py create mode 100644 src/transformers/models/musicgen/configuration_musicgen.py create mode 100644 src/transformers/models/musicgen/convert_musicgen_transformers.py create mode 100644 src/transformers/models/musicgen/modeling_musicgen.py create mode 100644 src/transformers/models/musicgen/processing_musicgen.py create mode 100644 src/transformers/models/nllb_moe/__init__.py create mode 100644 src/transformers/models/nllb_moe/configuration_nllb_moe.py create mode 100644 src/transformers/models/nllb_moe/convert_nllb_moe_sharded_original_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/nllb_moe/modeling_nllb_moe.py create mode 100644 src/transformers/models/nougat/__init__.py create mode 100644 src/transformers/models/nougat/convert_nougat_to_hf.py create mode 100644 src/transformers/models/nougat/image_processing_nougat.py create mode 100644 src/transformers/models/nougat/processing_nougat.py create mode 100644 src/transformers/models/nougat/tokenization_nougat_fast.py create mode 100644 src/transformers/models/owlv2/__init__.py create mode 100644 src/transformers/models/owlv2/configuration_owlv2.py create mode 100644 src/transformers/models/owlv2/convert_owlv2_to_hf.py create mode 100644 src/transformers/models/owlv2/image_processing_owlv2.py create mode 100644 src/transformers/models/owlv2/modeling_owlv2.py create mode 100644 src/transformers/models/owlv2/processing_owlv2.py create mode 100644 src/transformers/models/patchtsmixer/__init__.py create mode 100644 src/transformers/models/patchtsmixer/configuration_patchtsmixer.py create mode 100644 src/transformers/models/patchtsmixer/modeling_patchtsmixer.py create mode 100644 src/transformers/models/patchtst/__init__.py create mode 100644 src/transformers/models/patchtst/configuration_patchtst.py create mode 100755 src/transformers/models/patchtst/modeling_patchtst.py create mode 100644 src/transformers/models/persimmon/__init__.py create mode 100644 src/transformers/models/persimmon/configuration_persimmon.py create mode 100644 src/transformers/models/persimmon/convert_persimmon_weights_to_hf.py create mode 100644 src/transformers/models/persimmon/modeling_persimmon.py create mode 100644 src/transformers/models/phi/__init__.py create mode 100644 src/transformers/models/phi/configuration_phi.py create mode 100644 src/transformers/models/phi/convert_phi_weights_to_hf.py create mode 100644 src/transformers/models/phi/modeling_phi.py create mode 100644 src/transformers/models/pix2struct/__init__.py create mode 100644 src/transformers/models/pix2struct/configuration_pix2struct.py create mode 100644 src/transformers/models/pix2struct/convert_pix2struct_original_pytorch_to_hf.py create mode 100644 src/transformers/models/pix2struct/image_processing_pix2struct.py create mode 100644 src/transformers/models/pix2struct/modeling_pix2struct.py create mode 100644 src/transformers/models/pix2struct/processing_pix2struct.py create mode 100644 src/transformers/models/pop2piano/__init__.py create mode 100644 src/transformers/models/pop2piano/configuration_pop2piano.py create mode 100644 src/transformers/models/pop2piano/convert_pop2piano_weights_to_hf.py create mode 100644 src/transformers/models/pop2piano/feature_extraction_pop2piano.py create mode 100644 src/transformers/models/pop2piano/modeling_pop2piano.py create mode 100644 src/transformers/models/pop2piano/processing_pop2piano.py create mode 100644 src/transformers/models/pop2piano/tokenization_pop2piano.py create mode 100644 src/transformers/models/pvt/__init__.py create mode 100644 src/transformers/models/pvt/configuration_pvt.py create mode 100644 src/transformers/models/pvt/convert_pvt_to_pytorch.py create mode 100644 src/transformers/models/pvt/image_processing_pvt.py create mode 100755 src/transformers/models/pvt/modeling_pvt.py create mode 100644 src/transformers/models/qwen2/__init__.py create mode 100644 src/transformers/models/qwen2/configuration_qwen2.py create mode 100644 src/transformers/models/qwen2/modeling_qwen2.py create mode 100644 src/transformers/models/qwen2/tokenization_qwen2.py create mode 100644 src/transformers/models/qwen2/tokenization_qwen2_fast.py create mode 100644 src/transformers/models/regnet/modeling_flax_regnet.py create mode 100644 src/transformers/models/resnet/modeling_flax_resnet.py create mode 100644 src/transformers/models/rwkv/__init__.py create mode 100644 src/transformers/models/rwkv/configuration_rwkv.py create mode 100644 src/transformers/models/rwkv/convert_rwkv_checkpoint_to_hf.py create mode 100644 src/transformers/models/rwkv/modeling_rwkv.py create mode 100644 src/transformers/models/sam/__init__.py create mode 100644 src/transformers/models/sam/configuration_sam.py create mode 100644 src/transformers/models/sam/convert_sam_original_to_hf_format.py create mode 100644 src/transformers/models/sam/image_processing_sam.py create mode 100644 src/transformers/models/sam/modeling_sam.py create mode 100644 src/transformers/models/sam/modeling_tf_sam.py create mode 100644 src/transformers/models/sam/processing_sam.py create mode 100644 src/transformers/models/seamless_m4t/__init__.py create mode 100644 src/transformers/models/seamless_m4t/configuration_seamless_m4t.py create mode 100644 src/transformers/models/seamless_m4t/convert_fairseq2_to_hf.py create mode 100644 src/transformers/models/seamless_m4t/feature_extraction_seamless_m4t.py create mode 100755 src/transformers/models/seamless_m4t/modeling_seamless_m4t.py create mode 100644 src/transformers/models/seamless_m4t/processing_seamless_m4t.py create mode 100644 src/transformers/models/seamless_m4t/tokenization_seamless_m4t.py create mode 100644 src/transformers/models/seamless_m4t/tokenization_seamless_m4t_fast.py create mode 100644 src/transformers/models/seamless_m4t_v2/__init__.py create mode 100644 src/transformers/models/seamless_m4t_v2/configuration_seamless_m4t_v2.py create mode 100644 src/transformers/models/seamless_m4t_v2/convert_fairseq2_to_hf.py create mode 100644 src/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py create mode 100644 src/transformers/models/siglip/__init__.py create mode 100644 src/transformers/models/siglip/configuration_siglip.py create mode 100644 src/transformers/models/siglip/convert_siglip_to_hf.py create mode 100644 src/transformers/models/siglip/image_processing_siglip.py create mode 100644 src/transformers/models/siglip/modeling_siglip.py create mode 100644 src/transformers/models/siglip/processing_siglip.py create mode 100644 src/transformers/models/siglip/tokenization_siglip.py create mode 100644 src/transformers/models/speecht5/number_normalizer.py create mode 100644 src/transformers/models/stablelm/__init__.py create mode 100644 src/transformers/models/stablelm/configuration_stablelm.py create mode 100755 src/transformers/models/stablelm/modeling_stablelm.py create mode 100644 src/transformers/models/swiftformer/__init__.py create mode 100644 src/transformers/models/swiftformer/configuration_swiftformer.py create mode 100644 src/transformers/models/swiftformer/convert_swiftformer_original_to_hf.py create mode 100644 src/transformers/models/swiftformer/modeling_swiftformer.py rename src/transformers/models/table_transformer/{convert_table_transformer_original_pytorch_checkpoint_to_pytorch.py => convert_table_transformer_to_hf.py} (96%) create mode 100644 src/transformers/models/table_transformer/convert_table_transformer_to_hf_no_timm.py create mode 100644 src/transformers/models/timm_backbone/__init__.py create mode 100644 src/transformers/models/timm_backbone/configuration_timm_backbone.py create mode 100644 src/transformers/models/timm_backbone/modeling_timm_backbone.py create mode 100644 src/transformers/models/tvp/__init__.py create mode 100644 src/transformers/models/tvp/configuration_tvp.py create mode 100644 src/transformers/models/tvp/image_processing_tvp.py create mode 100644 src/transformers/models/tvp/modeling_tvp.py create mode 100644 src/transformers/models/tvp/processing_tvp.py create mode 100644 src/transformers/models/umt5/__init__.py create mode 100644 src/transformers/models/umt5/configuration_umt5.py create mode 100644 src/transformers/models/umt5/convert_umt5_checkpoint_to_pytorch.py create mode 100644 src/transformers/models/umt5/modeling_umt5.py create mode 100644 src/transformers/models/univnet/__init__.py create mode 100644 src/transformers/models/univnet/configuration_univnet.py create mode 100644 src/transformers/models/univnet/convert_univnet.py create mode 100644 src/transformers/models/univnet/feature_extraction_univnet.py create mode 100644 src/transformers/models/univnet/modeling_univnet.py create mode 100644 src/transformers/models/vipllava/__init__.py create mode 100644 src/transformers/models/vipllava/configuration_vipllava.py create mode 100644 src/transformers/models/vipllava/convert_vipllava_weights_to_hf.py create mode 100644 src/transformers/models/vipllava/modeling_vipllava.py create mode 100644 src/transformers/models/vision_text_dual_encoder/modeling_tf_vision_text_dual_encoder.py create mode 100644 src/transformers/models/vitdet/__init__.py create mode 100644 src/transformers/models/vitdet/configuration_vitdet.py create mode 100644 src/transformers/models/vitdet/modeling_vitdet.py create mode 100644 src/transformers/models/vitmatte/__init__.py create mode 100644 src/transformers/models/vitmatte/configuration_vitmatte.py create mode 100644 src/transformers/models/vitmatte/convert_vitmatte_to_hf.py create mode 100644 src/transformers/models/vitmatte/image_processing_vitmatte.py create mode 100644 src/transformers/models/vitmatte/modeling_vitmatte.py create mode 100644 src/transformers/models/vits/__init__.py create mode 100644 src/transformers/models/vits/configuration_vits.py create mode 100644 src/transformers/models/vits/convert_original_checkpoint.py create mode 100644 src/transformers/models/vits/modeling_vits.py create mode 100644 src/transformers/models/vits/tokenization_vits.py create mode 100644 src/transformers/models/vivit/__init__.py create mode 100644 src/transformers/models/vivit/configuration_vivit.py create mode 100644 src/transformers/models/vivit/convert_vivit_flax_to_pytorch.py create mode 100644 src/transformers/models/vivit/image_processing_vivit.py create mode 100755 src/transformers/models/vivit/modeling_vivit.py create mode 100644 src/transformers/models/wav2vec2_bert/__init__.py create mode 100644 src/transformers/models/wav2vec2_bert/configuration_wav2vec2_bert.py create mode 100644 src/transformers/models/wav2vec2_bert/convert_wav2vec2_seamless_checkpoint.py create mode 100644 src/transformers/models/wav2vec2_bert/modeling_wav2vec2_bert.py create mode 100644 src/transformers/models/wav2vec2_bert/processing_wav2vec2_bert.py mode change 100644 => 100755 src/transformers/models/whisper/convert_openai_to_hf.py create mode 100644 src/transformers/models/whisper/generation_whisper.py create mode 100644 src/transformers/models/whisper/modeling_flax_whisper.py create mode 100644 src/transformers/models/whisper/tokenization_whisper_fast.py create mode 100644 src/transformers/pipelines/image_feature_extraction.py create mode 100644 src/transformers/pipelines/image_to_image.py create mode 100644 src/transformers/pipelines/mask_generation.py create mode 100644 src/transformers/pipelines/text_to_audio.py create mode 100644 src/transformers/pipelines/zero_shot_audio_classification.py create mode 100644 src/transformers/quantizers/__init__.py create mode 100644 src/transformers/quantizers/auto.py create mode 100644 src/transformers/quantizers/base.py create mode 100644 src/transformers/quantizers/quantizer_aqlm.py create mode 100644 src/transformers/quantizers/quantizer_awq.py create mode 100644 src/transformers/quantizers/quantizer_bnb_4bit.py create mode 100644 src/transformers/quantizers/quantizer_bnb_8bit.py create mode 100644 src/transformers/quantizers/quantizer_gptq.py create mode 100644 src/transformers/quantizers/quantizers_utils.py create mode 100644 src/transformers/safetensors_conversion.py create mode 100644 src/transformers/time_series_utils.py create mode 100644 src/transformers/tools/__init__.py create mode 100644 src/transformers/tools/agent_types.py create mode 100644 src/transformers/tools/agents.py create mode 100644 src/transformers/tools/base.py create mode 100644 src/transformers/tools/document_question_answering.py create mode 100644 src/transformers/tools/evaluate_agent.py create mode 100644 src/transformers/tools/image_captioning.py create mode 100644 src/transformers/tools/image_question_answering.py create mode 100644 src/transformers/tools/image_segmentation.py create mode 100644 src/transformers/tools/prompts.py create mode 100644 src/transformers/tools/python_interpreter.py create mode 100644 src/transformers/tools/speech_to_text.py create mode 100644 src/transformers/tools/text_classification.py create mode 100644 src/transformers/tools/text_question_answering.py create mode 100644 src/transformers/tools/text_summarization.py create mode 100644 src/transformers/tools/text_to_speech.py create mode 100644 src/transformers/tools/translation.py delete mode 100644 src/transformers/trainer_tf.py create mode 100644 src/transformers/utils/backbone_utils.py create mode 100644 src/transformers/utils/dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects.py create mode 100644 src/transformers/utils/dummy_music_objects.py delete mode 100644 src/transformers/utils/dummy_timm_and_vision_objects.py create mode 100644 src/transformers/utils/peft_utils.py create mode 100644 src/transformers/utils/sentencepiece_model_pb2_new.py rename templates/adding_a_new_model/cookiecutter-template-{{cookiecutter.modelname}}/{{{cookiecutter.lowercase_modelname}}.mdx => {{cookiecutter.lowercase_modelname}}.md} (100%) rename tests/{models/bort => bettertransformer}/__init__.py (100%) create mode 100644 tests/bettertransformer/test_integration.py create mode 100644 tests/fsdp/test_fsdp.py create mode 100644 tests/generation/test_streamers.py rename tests/models/{mctct => align}/__init__.py (100%) create mode 100644 tests/models/align/test_modeling_align.py create mode 100644 tests/models/align/test_processor_align.py rename tests/models/{retribert => autoformer}/__init__.py (100%) create mode 100644 tests/models/autoformer/test_modeling_autoformer.py rename tests/models/{tapex => bark}/__init__.py (100%) create mode 100644 tests/models/bark/test_modeling_bark.py create mode 100644 tests/models/bark/test_processor_bark.py create mode 100644 tests/models/blip/test_modeling_tf_blip.py create mode 100644 tests/models/blip/test_modeling_tf_blip_text.py create mode 100644 tests/models/bloom/test_modeling_flax_bloom.py delete mode 100644 tests/models/bort/test_modeling_bort.py delete mode 100644 tests/models/bort/test_modeling_tf_bort.py rename tests/models/{trajectory_transformer => bros}/__init__.py (100%) create mode 100644 tests/models/bros/test_modeling_bros.py rename tests/models/{transfo_xl => clvp}/__init__.py (100%) create mode 100644 tests/models/clvp/test_feature_extraction_clvp.py create mode 100644 tests/models/clvp/test_modeling_clvp.py create mode 100644 tests/models/clvp/test_processor_clvp.py create mode 100644 tests/models/clvp/test_tokenization_clvp.py rename tests/models/{van => code_llama}/__init__.py (100%) create mode 100644 tests/models/code_llama/test_tokenization_code_llama.py rename tests/{onnx => models/convnextv2}/__init__.py (100%) create mode 100644 tests/models/convnextv2/test_modeling_convnextv2.py create mode 100644 tests/models/convnextv2/test_modeling_tf_convnextv2.py create mode 100644 tests/models/cpmant/__init__.py create mode 100644 tests/models/cpmant/test_modeling_cpmant.py create mode 100644 tests/models/cpmant/test_tokenization_cpmant.py create mode 100644 tests/models/depth_anything/__init__.py create mode 100644 tests/models/depth_anything/test_modeling_depth_anything.py create mode 100644 tests/models/dinov2/__init__.py create mode 100644 tests/models/dinov2/test_modeling_dinov2.py create mode 100644 tests/models/dpt/test_modeling_dpt_auto_backbone.py create mode 100644 tests/models/efficientformer/test_modeling_tf_efficientformer.py create mode 100644 tests/models/efficientnet/__init__.py create mode 100644 tests/models/efficientnet/test_image_processing_efficientnet.py rename tests/models/{van/test_modeling_van.py => efficientnet/test_modeling_efficientnet.py} (53%) rename tests/models/{retribert/test_tokenization_retribert.py => electra/test_tokenization_electra.py} (83%) create mode 100644 tests/models/encodec/__init__.py create mode 100644 tests/models/encodec/test_feature_extraction_encodec.py create mode 100644 tests/models/encodec/test_modeling_encodec.py create mode 100644 tests/models/falcon/__init__.py create mode 100644 tests/models/falcon/test_modeling_falcon.py create mode 100644 tests/models/fastspeech2_conformer/__init__.py create mode 100644 tests/models/fastspeech2_conformer/test_modeling_fastspeech2_conformer.py create mode 100644 tests/models/fastspeech2_conformer/test_tokenization_fastspeech2_conformer.py create mode 100644 tests/models/focalnet/__init__.py create mode 100644 tests/models/focalnet/test_modeling_focalnet.py create mode 100644 tests/models/fuyu/__init__.py create mode 100644 tests/models/fuyu/test_image_processing_fuyu.py create mode 100644 tests/models/fuyu/test_modeling_fuyu.py create mode 100644 tests/models/fuyu/test_processing_fuyu.py create mode 100644 tests/models/gpt_bigcode/__init__.py create mode 100644 tests/models/gpt_bigcode/test_modeling_gpt_bigcode.py create mode 100644 tests/models/gptsan_japanese/__init__.py create mode 100644 tests/models/gptsan_japanese/test_modeling_gptsan_japanese.py create mode 100644 tests/models/gptsan_japanese/test_tokenization_gptsan_japanese.py create mode 100644 tests/models/idefics/__init__.py create mode 100644 tests/models/idefics/test_image_processing_idefics.py create mode 100644 tests/models/idefics/test_modeling_idefics.py create mode 100644 tests/models/idefics/test_processor_idefics.py create mode 100644 tests/models/informer/__init__.py create mode 100644 tests/models/informer/test_modeling_informer.py create mode 100644 tests/models/instructblip/__init__.py create mode 100644 tests/models/instructblip/test_modeling_instructblip.py create mode 100644 tests/models/instructblip/test_processor_instructblip.py create mode 100644 tests/models/kosmos2/__init__.py create mode 100644 tests/models/kosmos2/test_modeling_kosmos2.py create mode 100644 tests/models/kosmos2/test_processor_kosmos2.py create mode 100644 tests/models/llama/__init__.py create mode 100644 tests/models/llama/test_modeling_flax_llama.py create mode 100644 tests/models/llama/test_modeling_llama.py create mode 100644 tests/models/llama/test_tokenization_llama.py create mode 100644 tests/models/llava/__init__.py create mode 100644 tests/models/llava/test_modeling_llava.py delete mode 100644 tests/models/mctct/test_feature_extraction_mctct.py delete mode 100644 tests/models/mctct/test_modeling_mctct.py create mode 100644 tests/models/mega/__init__.py create mode 100644 tests/models/mega/test_modeling_mega.py create mode 100644 tests/models/mgp_str/__init__.py create mode 100644 tests/models/mgp_str/test_modeling_mgp_str.py create mode 100644 tests/models/mgp_str/test_processor_mgp_str.py create mode 100644 tests/models/mgp_str/test_tokenization_mgp_str.py create mode 100644 tests/models/mistral/__init__.py create mode 100644 tests/models/mistral/test_modeling_flax_mistral.py create mode 100644 tests/models/mistral/test_modeling_mistral.py create mode 100644 tests/models/mixtral/__init__.py create mode 100644 tests/models/mixtral/test_modeling_mixtral.py create mode 100644 tests/models/mobilevitv2/__init__.py create mode 100644 tests/models/mobilevitv2/test_modeling_mobilevitv2.py create mode 100644 tests/models/mpt/__init__.py create mode 100644 tests/models/mpt/test_modeling_mpt.py create mode 100644 tests/models/mra/__init__.py create mode 100644 tests/models/mra/test_modeling_mra.py create mode 100644 tests/models/musicgen/__init__.py create mode 100644 tests/models/musicgen/test_modeling_musicgen.py create mode 100644 tests/models/musicgen/test_processing_musicgen.py create mode 100644 tests/models/nllb_moe/__init__.py create mode 100644 tests/models/nllb_moe/test_modeling_nllb_moe.py create mode 100644 tests/models/nougat/__init__.py create mode 100644 tests/models/nougat/test_image_processing_nougat.py create mode 100644 tests/models/nougat/test_tokenization_nougat.py create mode 100644 tests/models/owlv2/__init__.py create mode 100644 tests/models/owlv2/test_image_processor_owlv2.py create mode 100644 tests/models/owlv2/test_modeling_owlv2.py create mode 100644 tests/models/patchtsmixer/__init__.py create mode 100644 tests/models/patchtsmixer/test_modeling_patchtsmixer.py create mode 100644 tests/models/patchtst/__init__.py create mode 100644 tests/models/patchtst/test_modeling_patchtst.py create mode 100644 tests/models/persimmon/__init__.py create mode 100644 tests/models/persimmon/test_modeling_persimmon.py create mode 100644 tests/models/phi/__init__.py create mode 100644 tests/models/phi/test_modeling_phi.py create mode 100644 tests/models/pix2struct/__init__.py create mode 100644 tests/models/pix2struct/test_image_processing_pix2struct.py create mode 100644 tests/models/pix2struct/test_modeling_pix2struct.py create mode 100644 tests/models/pix2struct/test_processor_pix2struct.py create mode 100644 tests/models/pop2piano/__init__.py create mode 100644 tests/models/pop2piano/test_feature_extraction_pop2piano.py create mode 100644 tests/models/pop2piano/test_modeling_pop2piano.py create mode 100644 tests/models/pop2piano/test_processor_pop2piano.py create mode 100644 tests/models/pop2piano/test_tokenization_pop2piano.py create mode 100644 tests/models/pvt/__init__.py create mode 100644 tests/models/pvt/test_image_processing_pvt.py create mode 100644 tests/models/pvt/test_modeling_pvt.py create mode 100644 tests/models/qwen2/__init__.py create mode 100644 tests/models/qwen2/test_modeling_qwen2.py create mode 100644 tests/models/qwen2/test_tokenization_qwen2.py create mode 100644 tests/models/regnet/test_modeling_flax_regnet.py create mode 100644 tests/models/rembert/test_tokenization_rembert.py create mode 100644 tests/models/resnet/test_modeling_flax_resnet.py create mode 100644 tests/models/rwkv/__init__.py create mode 100644 tests/models/rwkv/test_modeling_rwkv.py create mode 100644 tests/models/sam/__init__.py create mode 100644 tests/models/sam/test_modeling_sam.py create mode 100644 tests/models/sam/test_modeling_tf_sam.py create mode 100644 tests/models/sam/test_processor_sam.py create mode 100644 tests/models/seamless_m4t/__init__.py create mode 100644 tests/models/seamless_m4t/test_feature_extraction_seamless_m4t.py create mode 100644 tests/models/seamless_m4t/test_modeling_seamless_m4t.py create mode 100644 tests/models/seamless_m4t/test_processor_seamless_m4t.py create mode 100644 tests/models/seamless_m4t/test_tokenization_seamless_m4t.py create mode 100644 tests/models/seamless_m4t_v2/__init__.py create mode 100644 tests/models/seamless_m4t_v2/test_modeling_seamless_m4t_v2.py create mode 100644 tests/models/siglip/__init__.py create mode 100644 tests/models/siglip/test_image_processor_siglip.py create mode 100644 tests/models/siglip/test_modeling_siglip.py create mode 100644 tests/models/siglip/test_tokenization_siglip.py create mode 100644 tests/models/stablelm/__init__.py create mode 100644 tests/models/stablelm/test_modeling_stablelm.py create mode 100644 tests/models/swiftformer/__init__.py create mode 100644 tests/models/swiftformer/test_modeling_swiftformer.py delete mode 100644 tests/models/tapex/test_tokenization_tapex.py create mode 100644 tests/models/timm_backbone/__init__.py create mode 100644 tests/models/timm_backbone/test_modeling_timm_backbone.py delete mode 100644 tests/models/trajectory_transformer/test_modeling_trajectory_transformer.py delete mode 100644 tests/models/transfo_xl/test_modeling_tf_transfo_xl.py delete mode 100644 tests/models/transfo_xl/test_modeling_transfo_xl.py delete mode 100644 tests/models/transfo_xl/test_tokenization_transfo_xl.py create mode 100644 tests/models/tvp/__init__.py create mode 100644 tests/models/tvp/test_image_processing_tvp.py create mode 100644 tests/models/tvp/test_modeling_tvp.py create mode 100644 tests/models/umt5/__init__.py create mode 100644 tests/models/umt5/test_modeling_umt5.py create mode 100644 tests/models/univnet/__init__.py create mode 100644 tests/models/univnet/test_feature_extraction_univnet.py create mode 100644 tests/models/univnet/test_modeling_univnet.py create mode 100644 tests/models/vipllava/__init__.py create mode 100644 tests/models/vipllava/test_modeling_vipllava.py create mode 100644 tests/models/vision_text_dual_encoder/test_modeling_tf_vision_text_dual_encoder.py create mode 100644 tests/models/vitdet/__init__.py create mode 100644 tests/models/vitdet/test_modeling_vitdet.py create mode 100644 tests/models/vitmatte/__init__.py create mode 100644 tests/models/vitmatte/test_image_processing_vitmatte.py create mode 100644 tests/models/vitmatte/test_modeling_vitmatte.py create mode 100644 tests/models/vits/__init__.py create mode 100644 tests/models/vits/test_modeling_vits.py create mode 100644 tests/models/vits/test_tokenization_vits.py create mode 100644 tests/models/vivit/__init__.py create mode 100644 tests/models/vivit/test_image_processing_vivit.py create mode 100644 tests/models/vivit/test_modeling_vivit.py create mode 100644 tests/models/wav2vec2_bert/__init__.py create mode 100644 tests/models/wav2vec2_bert/test_modeling_wav2vec2_bert.py rename tests/models/{mctct/test_processor_mctct.py => wav2vec2_bert/test_processor_wav2vec2_bert.py} (74%) create mode 100644 tests/models/whisper/test_modeling_flax_whisper.py delete mode 100644 tests/onnx/test_features.py delete mode 100644 tests/onnx/test_onnx.py delete mode 100644 tests/onnx/test_onnx_v2.py create mode 100644 tests/peft_integration/test_peft_integration.py create mode 100644 tests/pipelines/test_pipelines_image_feature_extraction.py create mode 100644 tests/pipelines/test_pipelines_image_to_image.py create mode 100644 tests/pipelines/test_pipelines_mask_generation.py create mode 100644 tests/pipelines/test_pipelines_text_to_audio.py create mode 100644 tests/pipelines/test_pipelines_zero_shot_audio_classification.py create mode 100644 tests/quantization/aqlm_integration/__init__.py create mode 100644 tests/quantization/aqlm_integration/test_aqlm.py create mode 100644 tests/quantization/autoawq/__init__.py create mode 100644 tests/quantization/autoawq/test_awq.py rename tests/{mixed_int8 => quantization/bnb}/README.md (87%) create mode 100644 tests/quantization/bnb/__init__.py create mode 100644 tests/quantization/bnb/test_4bit.py rename tests/{mixed_int8 => quantization/bnb}/test_mixed_int8.py (58%) create mode 100644 tests/quantization/gptq/__init__.py create mode 100644 tests/quantization/gptq/test_gptq.py create mode 100644 tests/repo_utils/test_check_docstrings.py create mode 100644 tests/repo_utils/test_get_test_info.py create mode 100644 tests/test_backbone_common.py create mode 100644 tests/test_cache_utils.py create mode 100644 tests/test_configuration_utils.py create mode 100644 tests/test_feature_extraction_utils.py create mode 100644 tests/test_image_processing_utils.py create mode 100644 tests/test_modeling_flax_utils.py create mode 100644 tests/test_modeling_tf_utils.py create mode 100755 tests/test_modeling_utils.py create mode 100644 tests/test_pipeline_mixin.py create mode 100644 tests/test_processing_common.py create mode 100644 tests/test_tokenization_utils.py create mode 100644 tests/tools/__init__.py create mode 100644 tests/tools/test_agent_types.py create mode 100644 tests/tools/test_document_question_answering.py create mode 100644 tests/tools/test_image_captioning.py create mode 100644 tests/tools/test_image_question_answering.py create mode 100644 tests/tools/test_image_segmentation.py create mode 100644 tests/tools/test_python_interpreter.py create mode 100644 tests/tools/test_speech_to_text.py create mode 100644 tests/tools/test_text_classification.py create mode 100644 tests/tools/test_text_question_answering.py create mode 100644 tests/tools/test_text_summarization.py create mode 100644 tests/tools/test_text_to_speech.py create mode 100644 tests/tools/test_tools_common.py create mode 100644 tests/tools/test_translation.py create mode 100644 tests/utils/test_audio_utils.py create mode 100644 tests/utils/test_backbone_utils.py create mode 100644 tests/utils/test_dynamic_module_utils.py create mode 100644 tests/utils/tiny_model_summary.json create mode 100644 utils/add_pipeline_model_mapping_to_test.py create mode 100644 utils/check_build.py create mode 100644 utils/check_docstrings.py create mode 100644 utils/check_model_tester.py create mode 100644 utils/check_support_list.py delete mode 100644 utils/documentation_tests.txt create mode 100644 utils/get_previous_daily_ci.py create mode 100644 utils/get_test_info.py create mode 100644 utils/not_doctested.txt delete mode 100644 utils/prepare_for_doc_test.py create mode 100644 utils/slow_documentation_tests.txt create mode 100644 utils/split_model_tests.py create mode 100644 utils/update_tiny_models.py diff --git a/.circleci/TROUBLESHOOT.md b/.circleci/TROUBLESHOOT.md index c662a921ba56f3..484d62b46a87f4 100644 --- a/.circleci/TROUBLESHOOT.md +++ b/.circleci/TROUBLESHOOT.md @@ -1,6 +1,6 @@ # Troubleshooting -This is a document explaining how to deal with various issues on Circle-CI. The entries may include actually solutions or pointers to Issues that cover those. +This is a document explaining how to deal with various issues on Circle-CI. The entries may include actual solutions or pointers to Issues that cover those. ## Circle CI diff --git a/.circleci/config.yml b/.circleci/config.yml index 736f61736df6ca..44d50547804f04 100644 --- a/.circleci/config.yml +++ b/.circleci/config.yml @@ -12,7 +12,7 @@ jobs: # Ensure running with CircleCI/huggingface check_circleci_user: docker: - - image: cimg/python:3.7.12 + - image: cimg/python:3.8.12 parallelism: 1 steps: - run: echo $CIRCLE_PROJECT_USERNAME @@ -26,13 +26,13 @@ jobs: fetch_tests: working_directory: ~/transformers docker: - - image: cimg/python:3.7.12 + - image: cimg/python:3.8.12 parallelism: 1 steps: - checkout - - run: pip install --upgrade pip - - run: pip install GitPython - - run: pip install . + - run: pip install --upgrade --upgrade-strategy eager pip + - run: pip install -U --upgrade-strategy eager GitPython + - run: pip install -U --upgrade-strategy eager . - run: mkdir -p test_preparation - run: python utils/tests_fetcher.py | tee tests_fetched_summary.txt - store_artifacts: @@ -43,6 +43,24 @@ jobs: else touch test_preparation/test_list.txt fi + - run: | + if [ -f examples_test_list.txt ]; then + mv examples_test_list.txt test_preparation/examples_test_list.txt + else + touch test_preparation/examples_test_list.txt + fi + - run: | + if [ -f filtered_test_list_cross_tests.txt ]; then + mv filtered_test_list_cross_tests.txt test_preparation/filtered_test_list_cross_tests.txt + else + touch test_preparation/filtered_test_list_cross_tests.txt + fi + - run: | + if [ -f doctest_list.txt ]; then + cp doctest_list.txt test_preparation/doctest_list.txt + else + touch test_preparation/doctest_list.txt + fi - run: | if [ -f test_repo_utils.txt ]; then mv test_repo_utils.txt test_preparation/test_repo_utils.txt @@ -56,15 +74,10 @@ jobs: else touch test_preparation/filtered_test_list.txt fi - - run: python utils/tests_fetcher.py --filters tests examples | tee examples_tests_fetched_summary.txt - - run: | - if [ -f test_list.txt ]; then - mv test_list.txt test_preparation/examples_test_list.txt - else - touch test_preparation/examples_test_list.txt - fi - store_artifacts: path: test_preparation/test_list.txt + - store_artifacts: + path: test_preparation/doctest_list.txt - store_artifacts: path: ~/transformers/test_preparation/filtered_test_list.txt - store_artifacts: @@ -78,6 +91,8 @@ jobs: - run: cp test_preparation/generated_config.yml test_preparation/generated_config.txt - store_artifacts: path: test_preparation/generated_config.txt + - store_artifacts: + path: test_preparation/filtered_test_list_cross_tests.txt - continuation/continue: configuration_path: test_preparation/generated_config.yml @@ -85,17 +100,17 @@ jobs: fetch_all_tests: working_directory: ~/transformers docker: - - image: cimg/python:3.7.12 + - image: cimg/python:3.8.12 parallelism: 1 steps: - checkout - - run: pip install --upgrade pip - - run: pip install GitPython - - run: pip install . + - run: pip install --upgrade --upgrade-strategy eager pip + - run: pip install -U --upgrade-strategy eager GitPython + - run: pip install -U --upgrade-strategy eager . - run: | mkdir test_preparation echo -n "tests" > test_preparation/test_list.txt - echo -n "tests" > test_preparation/examples_test_list.txt + echo -n "all" > test_preparation/examples_test_list.txt echo -n "tests/repo_utils" > test_preparation/test_repo_utils.txt - run: | echo -n "tests" > test_list.txt @@ -111,7 +126,7 @@ jobs: check_code_quality: working_directory: ~/transformers docker: - - image: cimg/python:3.7.12 + - image: cimg/python:3.8.12 resource_class: large environment: TRANSFORMERS_IS_CI: yes @@ -121,30 +136,37 @@ jobs: - checkout - restore_cache: keys: - - v0.5-code_quality-{{ checksum "setup.py" }} - - v0.5-code-quality - - run: pip install --upgrade pip - - run: pip install .[all,quality] + - v0.7-code_quality-pip-{{ checksum "setup.py" }} + - v0.7-code-quality-pip + - restore_cache: + keys: + - v0.7-code_quality-site-packages-{{ checksum "setup.py" }} + - v0.7-code-quality-site-packages + - run: pip install --upgrade --upgrade-strategy eager pip + - run: pip install -U --upgrade-strategy eager .[all,quality] - save_cache: - key: v0.5-code_quality-{{ checksum "setup.py" }} + key: v0.7-code_quality-pip-{{ checksum "setup.py" }} paths: - '~/.cache/pip' + - save_cache: + key: v0.7-code_quality-site-packages-{{ checksum "setup.py" }} + paths: + - '~/.pyenv/versions/' - run: name: Show installed libraries and their versions command: pip freeze | tee installed.txt - store_artifacts: path: ~/transformers/installed.txt - - run: black --check examples tests src utils - - run: ruff examples tests src utils + - run: ruff check examples tests src utils + - run: ruff format tests src utils --check - run: python utils/custom_init_isort.py --check_only - run: python utils/sort_auto_mappings.py --check_only - - run: doc-builder style src/transformers docs/source --max_len 119 --check_only --path_to_docs docs/source - run: python utils/check_doc_toc.py check_repository_consistency: working_directory: ~/transformers docker: - - image: cimg/python:3.7.12 + - image: cimg/python:3.8.12 resource_class: large environment: TRANSFORMERS_IS_CI: yes @@ -154,14 +176,22 @@ jobs: - checkout - restore_cache: keys: - - v0.5-repository_consistency-{{ checksum "setup.py" }} - - v0.5-repository_consistency - - run: pip install --upgrade pip - - run: pip install .[all,quality] + - v0.7-repository_consistency-pip-{{ checksum "setup.py" }} + - v0.7-repository_consistency-pip + - restore_cache: + keys: + - v0.7-repository_consistency-site-packages-{{ checksum "setup.py" }} + - v0.7-repository_consistency-site-packages + - run: pip install --upgrade --upgrade-strategy eager pip + - run: pip install -U --upgrade-strategy eager .[all,quality] - save_cache: - key: v0.5-repository_consistency-{{ checksum "setup.py" }} + key: v0.7-repository_consistency-pip-{{ checksum "setup.py" }} paths: - '~/.cache/pip' + - save_cache: + key: v0.7-repository_consistency-site-packages-{{ checksum "setup.py" }} + paths: + - '~/.pyenv/versions/' - run: name: Show installed libraries and their versions command: pip freeze | tee installed.txt @@ -176,9 +206,10 @@ jobs: - run: python utils/check_config_attributes.py - run: python utils/check_doctest_list.py - run: make deps_table_check_updated - - run: python utils/tests_fetcher.py --sanity_check - run: python utils/update_metadata.py --check-only - run: python utils/check_task_guides.py + - run: python utils/check_docstrings.py + - run: python utils/check_support_list.py workflows: version: 2 diff --git a/.circleci/create_circleci_config.py b/.circleci/create_circleci_config.py index bf8c9e281a63f2..7f271ff0819f78 100644 --- a/.circleci/create_circleci_config.py +++ b/.circleci/create_circleci_config.py @@ -15,7 +15,6 @@ import argparse import copy -import glob import os import random from dataclasses import dataclass @@ -24,9 +23,28 @@ import yaml -COMMON_ENV_VARIABLES = {"OMP_NUM_THREADS": 1, "TRANSFORMERS_IS_CI": True, "PYTEST_TIMEOUT": 120} -COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "dist": "loadfile", "s": None} -DEFAULT_DOCKER_IMAGE = [{"image": "cimg/python:3.7.12"}] +COMMON_ENV_VARIABLES = { + "OMP_NUM_THREADS": 1, + "TRANSFORMERS_IS_CI": True, + "PYTEST_TIMEOUT": 120, + "RUN_PIPELINE_TESTS": False, + "RUN_PT_TF_CROSS_TESTS": False, + "RUN_PT_FLAX_CROSS_TESTS": False, +} +# Disable the use of {"s": None} as the output is way too long, causing the navigation on CircleCI impractical +COMMON_PYTEST_OPTIONS = {"max-worker-restart": 0, "dist": "loadfile", "v": None} +DEFAULT_DOCKER_IMAGE = [{"image": "cimg/python:3.8.12"}] + + +class EmptyJob: + job_name = "empty" + + def to_dict(self): + return { + "working_directory": "~/transformers", + "docker": copy.deepcopy(DEFAULT_DOCKER_IMAGE), + "steps":["checkout"], + } @dataclass @@ -34,7 +52,7 @@ class CircleCIJob: name: str additional_env: Dict[str, Any] = None cache_name: str = None - cache_version: str = "0.5" + cache_version: str = "0.8.2" docker_image: List[Dict[str, str]] = None install_steps: List[str] = None marker: Optional[str] = None @@ -44,6 +62,8 @@ class CircleCIJob: resource_class: Optional[str] = "xlarge" tests_to_run: Optional[List[str]] = None working_directory: str = "~/transformers" + # This should be only used for doctest job! + command_timeout: Optional[int] = None def __post_init__(self): # Deal with defaults for mutable attributes. @@ -64,10 +84,17 @@ def __post_init__(self): self.parallelism = 1 def to_dict(self): + env = COMMON_ENV_VARIABLES.copy() + env.update(self.additional_env) + + cache_branch_prefix = os.environ.get("CIRCLE_BRANCH", "pull") + if cache_branch_prefix != "main": + cache_branch_prefix = "pull" + job = { "working_directory": self.working_directory, "docker": self.docker_image, - "environment": {**COMMON_ENV_VARIABLES, **self.additional_env}, + "environment": env, } if self.resource_class is not None: job["resource_class"] = self.resource_class @@ -79,30 +106,44 @@ def to_dict(self): { "restore_cache": { "keys": [ - f"v{self.cache_version}-{self.cache_name}-" + '{{ checksum "setup.py" }}', - f"v{self.cache_version}-{self.cache_name}-", + # check the fully-matched cache first + f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-pip-" + '{{ checksum "setup.py" }}', + # try the partially-matched cache from `main` + f"v{self.cache_version}-{self.cache_name}-main-pip-", + # try the general partially-matched cache + f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-pip-", ] } }, - ] - steps.extend([{"run": l} for l in self.install_steps]) - steps.append( { - "save_cache": { - "key": f"v{self.cache_version}-{self.cache_name}-" + '{{ checksum "setup.py" }}', - "paths": ["~/.cache/pip"], + "restore_cache": { + "keys": [ + f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-site-packages-" + '{{ checksum "setup.py" }}', + f"v{self.cache_version}-{self.cache_name}-main-site-packages-", + f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-site-packages-", + ] } - } - ) + }, + ] + steps.extend([{"run": l} for l in self.install_steps]) + steps.extend([{"run": 'pip install "fsspec>=2023.5.0,<2023.10.0"'}]) + steps.extend([{"run": "pip install pytest-subtests"}]) steps.append({"run": {"name": "Show installed libraries and their versions", "command": "pip freeze | tee installed.txt"}}) steps.append({"store_artifacts": {"path": "~/transformers/installed.txt"}}) all_options = {**COMMON_PYTEST_OPTIONS, **self.pytest_options} - pytest_flags = [f"--{key}={value}" if value is not None else f"-{key}" for key, value in all_options.items()] + pytest_flags = [f"--{key}={value}" if (value is not None or key in ["doctest-modules"]) else f"-{key}" for key, value in all_options.items()] pytest_flags.append( f"--make-reports={self.name}" if "examples" in self.name else f"--make-reports=tests_{self.name}" ) - test_command = f"python -m pytest -n {self.pytest_num_workers} " + " ".join(pytest_flags) + + steps.append({"run": {"name": "Create `test-results` directory", "command": "mkdir test-results"}}) + + test_command = "" + if self.command_timeout: + test_command = f"timeout {self.command_timeout} " + test_command += f"python -m pytest --junitxml=test-results/junit.xml -n {self.pytest_num_workers} " + " ".join(pytest_flags) + if self.parallelism == 1: if self.tests_to_run is None: test_command += " << pipeline.parameters.tests_to_run >>" @@ -152,14 +193,80 @@ def to_dict(self): steps.append({"store_artifacts": {"path": "~/transformers/tests.txt"}}) steps.append({"store_artifacts": {"path": "~/transformers/splitted_tests.txt"}}) - test_command = f"python -m pytest -n {self.pytest_num_workers} " + " ".join(pytest_flags) + test_command = "" + if self.timeout: + test_command = f"timeout {self.timeout} " + test_command += f"python -m pytest -n {self.pytest_num_workers} " + " ".join(pytest_flags) test_command += " $(cat splitted_tests.txt)" if self.marker is not None: test_command += f" -m {self.marker}" - test_command += " | tee tests_output.txt" + + if self.name == "pr_documentation_tests": + # can't use ` | tee tee tests_output.txt` as usual + test_command += " > tests_output.txt" + # Save the return code, so we can check if it is timeout in the next step. + test_command += '; touch "$?".txt' + # Never fail the test step for the doctest job. We will check the results in the next step, and fail that + # step instead if the actual test failures are found. This is to avoid the timeout being reported as test + # failure. + test_command = f"({test_command}) || true" + else: + test_command = f"({test_command} | tee tests_output.txt) || true" steps.append({"run": {"name": "Run tests", "command": test_command}}) + + # Deal with errors + check_test_command = f'if [ -s reports/{self.job_name}/errors.txt ]; ' + check_test_command += 'then echo "Some tests errored out!"; echo ""; ' + check_test_command += f'cat reports/{self.job_name}/errors.txt; ' + check_test_command += 'echo ""; echo ""; ' + + py_command = f'import os; fp = open("reports/{self.job_name}/summary_short.txt"); failed = os.linesep.join([x for x in fp.read().split(os.linesep) if x.startswith("ERROR ")]); fp.close(); fp = open("summary_short.txt", "w"); fp.write(failed); fp.close()' + check_test_command += f"$(python3 -c '{py_command}'); " + check_test_command += 'cat summary_short.txt; echo ""; exit -1; ' + + # Deeal with failed tests + check_test_command += f'elif [ -s reports/{self.job_name}/failures_short.txt ]; ' + check_test_command += 'then echo "Some tests failed!"; echo ""; ' + check_test_command += f'cat reports/{self.job_name}/failures_short.txt; ' + check_test_command += 'echo ""; echo ""; ' + + py_command = f'import os; fp = open("reports/{self.job_name}/summary_short.txt"); failed = os.linesep.join([x for x in fp.read().split(os.linesep) if x.startswith("FAILED ")]); fp.close(); fp = open("summary_short.txt", "w"); fp.write(failed); fp.close()' + check_test_command += f"$(python3 -c '{py_command}'); " + check_test_command += 'cat summary_short.txt; echo ""; exit -1; ' + + check_test_command += f'elif [ -s reports/{self.job_name}/stats.txt ]; then echo "All tests pass!"; ' + + # return code `124` means the previous (pytest run) step is timeout + if self.name == "pr_documentation_tests": + check_test_command += 'elif [ -f 124.txt ]; then echo "doctest timeout!"; ' + + check_test_command += 'else echo "other fatal error"; echo ""; exit -1; fi;' + + steps.append({"run": {"name": "Check test results", "command": check_test_command}}) + + steps.append({"store_test_results": {"path": "test-results"}}) + steps.append({"store_artifacts": {"path": "~/transformers/tests_output.txt"}}) steps.append({"store_artifacts": {"path": "~/transformers/reports"}}) + + # save cache at the end: so pytest step runs before cache saving and we can see results earlier + steps.append( + { + "save_cache": { + "key": f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-pip-" + '{{ checksum "setup.py" }}', + "paths": ["~/.cache/pip"], + } + } + ) + steps.append( + { + "save_cache": { + "key": f"v{self.cache_version}-{self.cache_name}-{cache_branch_prefix}-site-packages-" + '{{ checksum "setup.py" }}', + "paths": ["~/.pyenv/versions/"], + } + } + ) + job["steps"] = steps return job @@ -173,12 +280,14 @@ def job_name(self): "torch_and_tf", additional_env={"RUN_PT_TF_CROSS_TESTS": True}, install_steps=[ - "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng git-lfs", + "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng git-lfs cmake", "git lfs install", - "pip install --upgrade pip", - "pip install .[sklearn,tf-cpu,torch,testing,sentencepiece,torch-speech,vision]", - "pip install tensorflow_probability", - "pip install git+https://github.com/huggingface/accelerate", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,tf-cpu,torch,testing,sentencepiece,torch-speech,vision]", + "pip install -U --upgrade-strategy eager tensorflow_probability", + "pip install -U --upgrade-strategy eager -e git+https://github.com/huggingface/accelerate@main#egg=accelerate", + # TODO: remove this one after fixing the dependency issue(s) above + "pip install -U --upgrade-strategy eager torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", ], marker="is_pt_tf_cross_test", pytest_options={"rA": None, "durations": 0}, @@ -190,9 +299,9 @@ def job_name(self): additional_env={"RUN_PT_FLAX_CROSS_TESTS": True}, install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng", - "pip install --upgrade pip", - "pip install .[sklearn,flax,torch,testing,sentencepiece,torch-speech,vision]", - "pip install git+https://github.com/huggingface/accelerate", + "pip install -U --upgrade-strategy eager --upgrade pip", + "pip install -U --upgrade-strategy eager .[sklearn,flax,torch,testing,sentencepiece,torch-speech,vision]", + "pip install -U --upgrade-strategy eager -e git+https://github.com/huggingface/accelerate@main#egg=accelerate", ], marker="is_pt_flax_cross_test", pytest_options={"rA": None, "durations": 0}, @@ -203,25 +312,24 @@ def job_name(self): "torch", install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng time", - "pip install --upgrade pip", - "pip install .[sklearn,torch,testing,sentencepiece,torch-speech,vision,timm]", - "pip install git+https://github.com/huggingface/accelerate", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,torch,testing,sentencepiece,torch-speech,vision,timm]", + "pip install -U --upgrade-strategy eager -e git+https://github.com/huggingface/accelerate@main#egg=accelerate", ], parallelism=1, - pytest_num_workers=3, + pytest_num_workers=6, ) tf_job = CircleCIJob( "tf", install_steps=[ - "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng", - "pip install --upgrade pip", - "pip install .[sklearn,tf-cpu,testing,sentencepiece,tf-speech,vision]", - "pip install tensorflow_probability", + "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng cmake", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,tf-cpu,testing,sentencepiece,tf-speech,vision]", + "pip install -U --upgrade-strategy eager tensorflow_probability", ], parallelism=1, - pytest_options={"rA": None}, ) @@ -229,35 +337,36 @@ def job_name(self): "flax", install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng", - "pip install --upgrade pip", - "pip install .[flax,testing,sentencepiece,flax-speech,vision]", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[flax,testing,sentencepiece,flax-speech,vision]", ], parallelism=1, - pytest_options={"rA": None}, ) pipelines_torch_job = CircleCIJob( "pipelines_torch", + additional_env={"RUN_PIPELINE_TESTS": True}, install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng", - "pip install --upgrade pip", - "pip install .[sklearn,torch,testing,sentencepiece,torch-speech,vision,timm,video]", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,torch,testing,sentencepiece,torch-speech,vision,timm,video]", ], - pytest_options={"rA": None}, - tests_to_run="tests/pipelines/" + marker="is_pipeline_test", + pytest_num_workers=6, ) pipelines_tf_job = CircleCIJob( "pipelines_tf", + additional_env={"RUN_PIPELINE_TESTS": True}, install_steps=[ - "pip install --upgrade pip", - "pip install .[sklearn,tf-cpu,testing,sentencepiece,vision]", - "pip install tensorflow_probability", + "sudo apt-get -y update && sudo apt-get install -y cmake", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,tf-cpu,testing,sentencepiece,vision]", + "pip install -U --upgrade-strategy eager tensorflow_probability", ], - pytest_options={"rA": None}, - tests_to_run="tests/pipelines/" + marker="is_pipeline_test", ) @@ -276,8 +385,8 @@ def job_name(self): "sudo cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr/local\n" "sudo make install\n", }, - "pip install --upgrade pip", - "pip install .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[ja,testing,sentencepiece,jieba,spacy,ftfy,rjieba]", "python -m unidic download", ], parallelism=None, @@ -292,14 +401,16 @@ def job_name(self): examples_torch_job = CircleCIJob( "examples_torch", + additional_env={"OMP_NUM_THREADS": 8}, cache_name="torch_examples", install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng", - "pip install --upgrade pip", - "pip install .[sklearn,torch,sentencepiece,testing,torch-speech]", - "pip install -r examples/pytorch/_tests_requirements.txt", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,torch,sentencepiece,testing,torch-speech]", + "pip install -U --upgrade-strategy eager -r examples/pytorch/_tests_requirements.txt", + "pip install -U --upgrade-strategy eager -e git+https://github.com/huggingface/accelerate@main#egg=accelerate", ], - tests_to_run="./examples/pytorch/", + pytest_num_workers=1, ) @@ -307,11 +418,11 @@ def job_name(self): "examples_tensorflow", cache_name="tensorflow_examples", install_steps=[ - "pip install --upgrade pip", - "pip install .[sklearn,tensorflow,sentencepiece,testing]", - "pip install -r examples/tensorflow/_tests_requirements.txt", + "sudo apt-get -y update && sudo apt-get install -y cmake", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[sklearn,tensorflow,sentencepiece,testing]", + "pip install -U --upgrade-strategy eager -r examples/tensorflow/_tests_requirements.txt", ], - tests_to_run="./examples/tensorflow/", ) @@ -319,22 +430,22 @@ def job_name(self): "examples_flax", cache_name="flax_examples", install_steps=[ - "pip install --upgrade pip", - "pip install .[flax,testing,sentencepiece]", - "pip install -r examples/flax/_tests_requirements.txt", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[flax,testing,sentencepiece]", + "pip install -U --upgrade-strategy eager -r examples/flax/_tests_requirements.txt", ], - tests_to_run="./examples/flax/", ) hub_job = CircleCIJob( "hub", + additional_env={"HUGGINGFACE_CO_STAGING": True}, install_steps=[ "sudo apt-get -y update && sudo apt-get install git-lfs", 'git config --global user.email "ci@dummy.com"', 'git config --global user.name "ci"', - "pip install --upgrade pip", - "pip install .[torch,sentencepiece,testing]", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[torch,sentencepiece,testing,vision]", ], marker="is_staging_test", pytest_num_workers=1, @@ -344,8 +455,9 @@ def job_name(self): onnx_job = CircleCIJob( "onnx", install_steps=[ - "pip install --upgrade pip", - "pip install .[torch,tf,testing,sentencepiece,onnxruntime,vision,rjieba]", + "sudo apt-get -y update && sudo apt-get install -y cmake", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[torch,tf,testing,sentencepiece,onnxruntime,vision,rjieba]", ], pytest_options={"k onnx": None}, pytest_num_workers=1, @@ -356,19 +468,24 @@ def job_name(self): "exotic_models", install_steps=[ "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev", - "pip install --upgrade pip", - "pip install .[torch,testing,vision]", - "pip install torchvision", - "pip install scipy", - "pip install 'git+https://github.com/facebookresearch/detectron2.git'", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[torch,testing,vision]", + "pip install -U --upgrade-strategy eager torchvision", + "pip install -U --upgrade-strategy eager scipy", + "pip install -U --upgrade-strategy eager 'git+https://github.com/facebookresearch/detectron2.git'", "sudo apt install tesseract-ocr", - "pip install pytesseract", - "pip install natten", + "pip install -U --upgrade-strategy eager pytesseract", + "pip install -U --upgrade-strategy eager natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels", + "pip install -U --upgrade-strategy eager python-Levenshtein", + "pip install -U --upgrade-strategy eager opencv-python", + "pip install -U --upgrade-strategy eager nltk", + "pip uninstall -y torch torchvision torchaudio && pip install -U --upgrade-strategy eager 'torch<2.2.0' 'torchvision<0.17' 'torchaudio<2.2.0'" ], tests_to_run=[ "tests/models/*layoutlmv*", "tests/models/*nat", "tests/models/deta", + "tests/models/nougat", ], pytest_num_workers=1, pytest_options={"durations": 100}, @@ -378,15 +495,60 @@ def job_name(self): repo_utils_job = CircleCIJob( "repo_utils", install_steps=[ - "pip install --upgrade pip", - "pip install .[quality,testing]", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager .[quality,testing,torch]", ], parallelism=None, pytest_num_workers=1, - resource_class=None, + resource_class="large", tests_to_run="tests/repo_utils", ) + +# We also include a `dummy.py` file in the files to be doc-tested to prevent edge case failure. Otherwise, the pytest +# hangs forever during test collection while showing `collecting 0 items / 21 errors`. (To see this, we have to remove +# the bash output redirection.) +py_command = 'from utils.tests_fetcher import get_doctest_files; to_test = get_doctest_files() + ["dummy.py"]; to_test = " ".join(to_test); print(to_test)' +py_command = f"$(python3 -c '{py_command}')" +command = f'echo "{py_command}" > pr_documentation_tests_temp.txt' +doc_test_job = CircleCIJob( + "pr_documentation_tests", + additional_env={"TRANSFORMERS_VERBOSITY": "error", "DATASETS_VERBOSITY": "error", "SKIP_CUDA_DOCTEST": "1"}, + install_steps=[ + "sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng time ffmpeg", + "pip install --upgrade --upgrade-strategy eager pip", + "pip install -U --upgrade-strategy eager -e .[dev]", + "pip install -U --upgrade-strategy eager -e git+https://github.com/huggingface/accelerate@main#egg=accelerate", + "pip install --upgrade --upgrade-strategy eager 'pytest<8.0.0' pytest-sugar", + "pip install -U --upgrade-strategy eager natten==0.15.1+torch210cpu -f https://shi-labs.com/natten/wheels", + "pip install -U --upgrade-strategy eager g2p-en", + # TODO: remove this one after fixing the dependency issue(s) above + "pip install -U --upgrade-strategy eager torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", + "find -name __pycache__ -delete", + "find . -name \*.pyc -delete", + # Add an empty file to keep the test step running correctly even no file is selected to be tested. + "touch dummy.py", + { + "name": "Get files to test", + "command": command, + }, + { + "name": "Show information in `Get files to test`", + "command": + "cat pr_documentation_tests_temp.txt" + }, + { + "name": "Get the last line in `pr_documentation_tests.txt`", + "command": + "tail -n1 pr_documentation_tests_temp.txt | tee pr_documentation_tests.txt" + }, + ], + tests_to_run="$(cat pr_documentation_tests.txt)", # noqa + pytest_options={"-doctest-modules": None, "doctest-glob": "*.md", "dist": "loadfile", "rvsA": None}, + command_timeout=1200, # test cannot run longer than 1200 seconds + pytest_num_workers=1, +) + REGULAR_TESTS = [ torch_and_tf_job, torch_and_flax_job, @@ -408,6 +570,8 @@ def job_name(self): pipelines_tf_job, ] REPO_UTIL_TESTS = [repo_utils_job] +DOC_TESTS = [doc_test_job] + def create_circleci_config(folder=None): if folder is None: @@ -433,25 +597,73 @@ def create_circleci_config(folder=None): if len(test_list) > 0: jobs.extend(REGULAR_TESTS) + extended_tests_to_run = set(test_list.split()) + # Extend the test files for cross test jobs + for job in jobs: + if job.job_name in ["tests_torch_and_tf", "tests_torch_and_flax"]: + for test_path in copy.copy(extended_tests_to_run): + dir_path, fn = os.path.split(test_path) + if fn.startswith("test_modeling_tf_"): + fn = fn.replace("test_modeling_tf_", "test_modeling_") + elif fn.startswith("test_modeling_flax_"): + fn = fn.replace("test_modeling_flax_", "test_modeling_") + else: + if job.job_name == "test_torch_and_tf": + fn = fn.replace("test_modeling_", "test_modeling_tf_") + elif job.job_name == "test_torch_and_flax": + fn = fn.replace("test_modeling_", "test_modeling_flax_") + new_test_file = str(os.path.join(dir_path, fn)) + if os.path.isfile(new_test_file): + if new_test_file not in extended_tests_to_run: + extended_tests_to_run.add(new_test_file) + extended_tests_to_run = sorted(extended_tests_to_run) + for job in jobs: + if job.job_name in ["tests_torch_and_tf", "tests_torch_and_flax"]: + job.tests_to_run = extended_tests_to_run + fn = "filtered_test_list_cross_tests.txt" + f_path = os.path.join(folder, fn) + with open(f_path, "w") as fp: + fp.write(" ".join(extended_tests_to_run)) + example_file = os.path.join(folder, "examples_test_list.txt") if os.path.exists(example_file) and os.path.getsize(example_file) > 0: - jobs.extend(EXAMPLES_TESTS) + with open(example_file, "r", encoding="utf-8") as f: + example_tests = f.read() + for job in EXAMPLES_TESTS: + framework = job.name.replace("examples_", "").replace("torch", "pytorch") + if example_tests == "all": + job.tests_to_run = [f"examples/{framework}"] + else: + job.tests_to_run = [f for f in example_tests.split(" ") if f.startswith(f"examples/{framework}")] + + if len(job.tests_to_run) > 0: + jobs.append(job) + + doctest_file = os.path.join(folder, "doctest_list.txt") + if os.path.exists(doctest_file): + with open(doctest_file) as f: + doctest_list = f.read() + else: + doctest_list = [] + if len(doctest_list) > 0: + jobs.extend(DOC_TESTS) repo_util_file = os.path.join(folder, "test_repo_utils.txt") if os.path.exists(repo_util_file) and os.path.getsize(repo_util_file) > 0: jobs.extend(REPO_UTIL_TESTS) - if len(jobs) > 0: - config = {"version": "2.1"} - config["parameters"] = { - # Only used to accept the parameters from the trigger - "nightly": {"type": "boolean", "default": False}, - "tests_to_run": {"type": "string", "default": test_list}, - } - config["jobs"] = {j.job_name: j.to_dict() for j in jobs} - config["workflows"] = {"version": 2, "run_tests": {"jobs": [j.job_name for j in jobs]}} - with open(os.path.join(folder, "generated_config.yml"), "w") as f: - f.write(yaml.dump(config, indent=2, width=1000000, sort_keys=False)) + if len(jobs) == 0: + jobs = [EmptyJob()] + config = {"version": "2.1"} + config["parameters"] = { + # Only used to accept the parameters from the trigger + "nightly": {"type": "boolean", "default": False}, + "tests_to_run": {"type": "string", "default": test_list}, + } + config["jobs"] = {j.job_name: j.to_dict() for j in jobs} + config["workflows"] = {"version": 2, "run_tests": {"jobs": [j.job_name for j in jobs]}} + with open(os.path.join(folder, "generated_config.yml"), "w") as f: + f.write(yaml.dump(config, indent=2, width=1000000, sort_keys=False)) if __name__ == "__main__": diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml index b8d0f312216d11..1ec76462acfdff 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.yml +++ b/.github/ISSUE_TEMPLATE/bug-report.yml @@ -37,15 +37,16 @@ body: - pipelines: @Narsil - tensorflow: @gante and @Rocketknight1 - tokenizers: @ArthurZucker - - trainer: @sgugger + - trainer: @muellerzr and @pacman100 Integrations: - - deepspeed: HF Trainer: @stas00, Accelerate: @pacman100 + - deepspeed: HF Trainer/Accelerate: @pacman100 - ray/raytune: @richardliaw, @amogkam - - Big Model Inference: @sgugger @muellerzr + - Big Model Inference: @SunMarc + - quantization (bitsandbytes, autogpt): @SunMarc and @younesbelkada - Documentation: @sgugger, @stevhliu and @MKhalusova + Documentation: @stevhliu and @MKhalusova Model hub: @@ -61,7 +62,7 @@ body: Maintained examples (not research project or legacy): - Flax: @sanchit-gandhi - - PyTorch: @sgugger + - PyTorch: See Models above and tag the person corresponding to the modality of the example. - TensorFlow: @Rocketknight1 Research projects are not maintained and should be taken as is. diff --git a/.github/ISSUE_TEMPLATE/i18n.md b/.github/ISSUE_TEMPLATE/i18n.md index 39d369a25324e5..52667f930508a6 100644 --- a/.github/ISSUE_TEMPLATE/i18n.md +++ b/.github/ISSUE_TEMPLATE/i18n.md @@ -23,23 +23,23 @@ Some notes: * Please translate in a gender-neutral way. * Add your translations to the folder called `` inside the [source folder](https://github.com/huggingface/transformers/tree/main/docs/source). * Register your translation in `/_toctree.yml`; please follow the order of the [English version](https://github.com/huggingface/transformers/blob/main/docs/source/en/_toctree.yml). -* Once you're finished, open a pull request and tag this issue by including #issue-number in the description, where issue-number is the number of this issue. Please ping @ArthurZucker, @sgugger for review. +* Once you're finished, open a pull request and tag this issue by including #issue-number in the description, where issue-number is the number of this issue. Please ping @stevhliu and @MKhalusova for review. * 🙋 If you'd like others to help you with the translation, you can also post in the 🤗 [forums](https://discuss.huggingface.co/). ## Get Started section -- [ ] [index.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/index.mdx) https://github.com/huggingface/transformers/pull/20180 -- [ ] [quicktour.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/quicktour.mdx) (waiting for initial PR to go through) -- [ ] [installation.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/installation.mdx). +- [ ] [index.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/index.md) https://github.com/huggingface/transformers/pull/20180 +- [ ] [quicktour.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/quicktour.md) (waiting for initial PR to go through) +- [ ] [installation.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/installation.md). ## Tutorial section -- [ ] [pipeline_tutorial.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/pipeline_tutorial.mdx) -- [ ] [autoclass_tutorial.mdx](https://github.com/huggingface/transformers/blob/master/docs/source/autoclass_tutorial.mdx) -- [ ] [preprocessing.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/preprocessing.mdx) -- [ ] [training.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/training.mdx) -- [ ] [accelerate.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/accelerate.mdx) -- [ ] [model_sharing.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/model_sharing.mdx) -- [ ] [multilingual.mdx](https://github.com/huggingface/transformers/blob/main/docs/source/en/multilingual.mdx) +- [ ] [pipeline_tutorial.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/pipeline_tutorial.md) +- [ ] [autoclass_tutorial.md](https://github.com/huggingface/transformers/blob/master/docs/source/autoclass_tutorial.md) +- [ ] [preprocessing.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/preprocessing.md) +- [ ] [training.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/training.md) +- [ ] [accelerate.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/accelerate.md) +- [ ] [model_sharing.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/model_sharing.md) +- [ ] [multilingual.md](https://github.com/huggingface/transformers/blob/main/docs/source/en/multilingual.md) diff --git a/.github/conda/meta.yaml b/.github/conda/meta.yaml index 6f6b680d1e6b04..89dc353b127774 100644 --- a/.github/conda/meta.yaml +++ b/.github/conda/meta.yaml @@ -16,7 +16,6 @@ requirements: - pip - numpy >=1.17 - dataclasses - - importlib_metadata - huggingface_hub - packaging - filelock @@ -27,11 +26,12 @@ requirements: - protobuf - tokenizers >=0.11.1,!=0.11.3,<0.13 - pyyaml >=5.1 + - safetensors + - fsspec run: - python - numpy >=1.17 - dataclasses - - importlib_metadata - huggingface_hub - packaging - filelock @@ -42,6 +42,8 @@ requirements: - protobuf - tokenizers >=0.11.1,!=0.11.3,<0.13 - pyyaml >=5.1 + - safetensors + - fsspec test: imports: diff --git a/.github/workflows/TROUBLESHOOT.md b/.github/workflows/TROUBLESHOOT.md index 616ba8e55bd208..f6101e6d70b59a 100644 --- a/.github/workflows/TROUBLESHOOT.md +++ b/.github/workflows/TROUBLESHOOT.md @@ -1,6 +1,6 @@ # Troubleshooting -This is a document explaining how to deal with various issues on github-actions self-hosted CI. The entries may include actually solutions or pointers to Issues that cover those. +This is a document explaining how to deal with various issues on github-actions self-hosted CI. The entries may include actual solutions or pointers to Issues that cover those. ## GitHub Actions (self-hosted CI) diff --git a/.github/workflows/add-model-like.yml b/.github/workflows/add-model-like.yml index 3ea3c89249fe20..8bdd66e4466d62 100644 --- a/.github/workflows/add-model-like.yml +++ b/.github/workflows/add-model-like.yml @@ -3,18 +3,18 @@ name: Add model like runner on: push: branches: - - main - pull_request: - paths: - - "src/**" - - "tests/**" - - ".github/**" - types: [opened, synchronize, reopened] + - none # put main here when this is fixed + #pull_request: + # paths: + # - "src/**" + # - "tests/**" + # - ".github/**" + # types: [opened, synchronize, reopened] jobs: run_tests_templates_like: name: "Add new model like template tests" - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - uses: actions/checkout@v3 diff --git a/.github/workflows/build-docker-images.yml b/.github/workflows/build-docker-images.yml index 03ecf450264de0..be070a95d3a94f 100644 --- a/.github/workflows/build-docker-images.yml +++ b/.github/workflows/build-docker-images.yml @@ -3,7 +3,7 @@ name: Build docker images (scheduled) on: push: branches: - - docker-image* + - build_ci_docker_image* repository_dispatch: workflow_call: inputs: @@ -11,7 +11,7 @@ on: required: true type: string schedule: - - cron: "0 1 * * *" + - cron: "17 0 * * *" concurrency: group: docker-images-builds @@ -20,23 +20,33 @@ concurrency: jobs: latest-docker: name: "Latest PyTorch + TensorFlow [dev]" - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - name: Build and push - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-all-latest-gpu build-args: | @@ -49,7 +59,7 @@ jobs: # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`. # The later case is useful for manual image building for debugging purpose. Use another tag in this case! if: inputs.image_postfix != '-push-ci' - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-all-latest-gpu build-args: | @@ -57,54 +67,35 @@ jobs: push: true tags: huggingface/transformers-all-latest-gpu-push-ci - latest-with-torch-nightly-docker: - name: "Nightly PyTorch + Stable TensorFlow" - # Push CI doesn't need this image - if: inputs.image_postfix != '-push-ci' - runs-on: ubuntu-latest - steps: - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 - - - name: Check out code - uses: actions/checkout@v3 - - - name: Login to DockerHub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_PASSWORD }} - - - name: Build and push - uses: docker/build-push-action@v3 - with: - context: ./docker/transformers-all-latest-gpu - build-args: | - REF=main - PYTORCH=pre - push: true - tags: huggingface/transformers-all-latest-torch-nightly-gpu - latest-torch-deepspeed-docker: name: "Latest PyTorch + DeepSpeed" - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - name: Build and push - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-pytorch-deepspeed-latest-gpu build-args: | @@ -115,17 +106,27 @@ jobs: # Can't build 2 images in a single job `latest-torch-deepspeed-docker` (for `nvcr.io/nvidia`) latest-torch-deepspeed-docker-for-push-ci-daily-build: name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)" - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} @@ -135,7 +136,7 @@ jobs: # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`. # The later case is useful for manual image building for debugging purpose. Use another tag in this case! if: inputs.image_postfix != '-push-ci' - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-pytorch-deepspeed-latest-gpu build-args: | @@ -143,55 +144,27 @@ jobs: push: true tags: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci - nightly-torch-deepspeed-docker: - name: "Nightly PyTorch + DeepSpeed" - # Push CI doesn't need this image - if: inputs.image_postfix != '-push-ci' - runs-on: ubuntu-latest - steps: - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 - - - name: Check out code - uses: actions/checkout@v3 - - - name: Login to DockerHub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_PASSWORD }} - - - name: Build and push - uses: docker/build-push-action@v3 - with: - context: ./docker/transformers-pytorch-deepspeed-nightly-gpu - build-args: | - REF=main - push: true - tags: huggingface/transformers-pytorch-deepspeed-nightly-gpu - doc-builder: name: "Doc builder" # Push CI doesn't need this image if: inputs.image_postfix != '-push-ci' - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - name: Build and push - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-doc-builder push: true @@ -201,23 +174,33 @@ jobs: name: "Latest PyTorch [dev]" # Push CI doesn't need this image if: inputs.image_postfix != '-push-ci' - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - name: Build and push - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-pytorch-gpu build-args: | @@ -225,30 +208,102 @@ jobs: push: true tags: huggingface/transformers-pytorch-gpu +# Need to be fixed with the help from Guillaume. +# latest-pytorch-amd: +# name: "Latest PyTorch (AMD) [dev]" +# runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210] +# steps: +# - name: Set up Docker Buildx +# uses: docker/setup-buildx-action@v3 +# - name: Check out code +# uses: actions/checkout@v3 +# - name: Login to DockerHub +# uses: docker/login-action@v3 +# with: +# username: ${{ secrets.DOCKERHUB_USERNAME }} +# password: ${{ secrets.DOCKERHUB_PASSWORD }} +# - name: Build and push +# uses: docker/build-push-action@v5 +# with: +# context: ./docker/transformers-pytorch-amd-gpu +# build-args: | +# REF=main +# push: true +# tags: huggingface/transformers-pytorch-amd-gpu${{ inputs.image_postfix }} +# # Push CI images still need to be re-built daily +# - +# name: Build and push (for Push CI) in a daily basis +# # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`. +# # The later case is useful for manual image building for debugging purpose. Use another tag in this case! +# if: inputs.image_postfix != '-push-ci' +# uses: docker/build-push-action@v5 +# with: +# context: ./docker/transformers-pytorch-amd-gpu +# build-args: | +# REF=main +# push: true +# tags: huggingface/transformers-pytorch-amd-gpu-push-ci + latest-tensorflow: name: "Latest TensorFlow [dev]" # Push CI doesn't need this image if: inputs.image_postfix != '-push-ci' - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + uses: docker/setup-buildx-action@v3 - name: Check out code uses: actions/checkout@v3 - name: Login to DockerHub - uses: docker/login-action@v2 + uses: docker/login-action@v3 with: username: ${{ secrets.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_PASSWORD }} - name: Build and push - uses: docker/build-push-action@v3 + uses: docker/build-push-action@v5 with: context: ./docker/transformers-tensorflow-gpu build-args: | REF=main push: true tags: huggingface/transformers-tensorflow-gpu + + # latest-pytorch-deepspeed-amd: + # name: "PyTorch + DeepSpeed (AMD) [dev]" + + # runs-on: [self-hosted, docker-gpu, amd-gpu, single-gpu, mi210] + # steps: + # - name: Set up Docker Buildx + # uses: docker/setup-buildx-action@v3 + # - name: Check out code + # uses: actions/checkout@v3 + # - name: Login to DockerHub + # uses: docker/login-action@v3 + # with: + # username: ${{ secrets.DOCKERHUB_USERNAME }} + # password: ${{ secrets.DOCKERHUB_PASSWORD }} + # - name: Build and push + # uses: docker/build-push-action@v5 + # with: + # context: ./docker/transformers-pytorch-deepspeed-amd-gpu + # build-args: | + # REF=main + # push: true + # tags: huggingface/transformers-pytorch-deepspeed-amd-gpu${{ inputs.image_postfix }} + # # Push CI images still need to be re-built daily + # - + # name: Build and push (for Push CI) in a daily basis + # # This condition allows `schedule` events, or `push` events that trigger this workflow NOT via `workflow_call`. + # # The later case is useful for manual image building for debugging purpose. Use another tag in this case! + # if: inputs.image_postfix != '-push-ci' + # uses: docker/build-push-action@v5 + # with: + # context: ./docker/transformers-pytorch-deepspeed-amd-gpu + # build-args: | + # REF=main + # push: true + # tags: huggingface/transformers-pytorch-deepspeed-amd-gpu-push-ci diff --git a/.github/workflows/build-nightly-ci-docker-images.yml b/.github/workflows/build-nightly-ci-docker-images.yml new file mode 100644 index 00000000000000..63bc7daa743425 --- /dev/null +++ b/.github/workflows/build-nightly-ci-docker-images.yml @@ -0,0 +1,85 @@ +name: Build docker images (Nightly CI) + +on: + workflow_call: + push: + branches: + - build_nightly_ci_docker_image* + +concurrency: + group: docker-images-builds + cancel-in-progress: false + +jobs: + latest-with-torch-nightly-docker: + name: "Nightly PyTorch + Stable TensorFlow" + runs-on: ubuntu-22.04 + steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + - + name: Check out code + uses: actions/checkout@v3 + - + name: Login to DockerHub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_PASSWORD }} + - + name: Build and push + uses: docker/build-push-action@v3 + with: + context: ./docker/transformers-all-latest-gpu + build-args: | + REF=main + PYTORCH=pre + push: true + tags: huggingface/transformers-all-latest-torch-nightly-gpu + + nightly-torch-deepspeed-docker: + name: "Nightly PyTorch + DeepSpeed" + runs-on: ubuntu-22.04 + steps: + - name: Cleanup disk + run: | + sudo ls -l /usr/local/lib/ + sudo ls -l /usr/share/ + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + sudo rm -rf /usr/local/lib/android + sudo rm -rf /usr/share/dotnet + sudo du -sh /usr/local/lib/ + sudo du -sh /usr/share/ + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + - + name: Check out code + uses: actions/checkout@v3 + - + name: Login to DockerHub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_PASSWORD }} + - + name: Build and push + uses: docker/build-push-action@v3 + with: + context: ./docker/transformers-pytorch-deepspeed-nightly-gpu + build-args: | + REF=main + push: true + tags: huggingface/transformers-pytorch-deepspeed-nightly-gpu \ No newline at end of file diff --git a/.github/workflows/build-past-ci-docker-images.yml b/.github/workflows/build-past-ci-docker-images.yml index 3a0e1612454c58..302386b6851b18 100644 --- a/.github/workflows/build-past-ci-docker-images.yml +++ b/.github/workflows/build-past-ci-docker-images.yml @@ -3,7 +3,7 @@ name: Build docker images (Past CI) on: push: branches: - - past-ci-docker-image* + - build_past_ci_docker_image* concurrency: group: docker-images-builds @@ -15,8 +15,8 @@ jobs: strategy: fail-fast: false matrix: - version: ["1.11", "1.10", "1.9", "1.8", "1.7", "1.6", "1.5", "1.4"] - runs-on: ubuntu-latest + version: ["1.13", "1.12", "1.11"] + runs-on: ubuntu-22.04 steps: - name: Set up Docker Buildx @@ -24,6 +24,17 @@ jobs: - name: Check out code uses: actions/checkout@v3 + - + id: get-base-image + name: Get Base Image + env: + framework_version: ${{ matrix.version }} + run: | + echo "base_image=$(python3 -c 'import os; from utils.past_ci_versions import past_versions_testing; base_image = past_versions_testing["pytorch"][os.environ["framework_version"]]["base_image"]; print(base_image)')" >> $GITHUB_OUTPUT + - + name: Print Base Image + run: | + echo ${{ steps.get-base-image.outputs.base_image }} - name: Login to DockerHub uses: docker/login-action@v2 @@ -37,6 +48,7 @@ jobs: context: ./docker/transformers-past-gpu build-args: | REF=main + BASE_DOCKER_IMAGE=${{ steps.get-base-image.outputs.base_image }} FRAMEWORK=pytorch VERSION=${{ matrix.version }} push: true @@ -47,8 +59,8 @@ jobs: strategy: fail-fast: false matrix: - version: ["2.8", "2.7", "2.6", "2.5"] - runs-on: ubuntu-latest + version: ["2.11", "2.10", "2.9", "2.8", "2.7", "2.6", "2.5"] + runs-on: ubuntu-22.04 steps: - name: Set up Docker Buildx @@ -57,37 +69,16 @@ jobs: name: Check out code uses: actions/checkout@v3 - - name: Login to DockerHub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_PASSWORD }} - - - name: Build and push - uses: docker/build-push-action@v3 - with: - context: ./docker/transformers-past-gpu - build-args: | - REF=main - FRAMEWORK=tensorflow - VERSION=${{ matrix.version }} - push: true - tags: huggingface/transformers-tensorflow-past-${{ matrix.version }}-gpu - - past-tensorflow-docker-2-4: - name: "Past TensorFlow Docker" - strategy: - fail-fast: false - matrix: - version: ["2.4"] - runs-on: ubuntu-latest - steps: - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 + id: get-base-image + name: Get Base Image + env: + framework_version: ${{ matrix.version }} + run: | + echo "base_image=$(python3 -c 'import os; from utils.past_ci_versions import past_versions_testing; base_image = past_versions_testing["tensorflow"][os.environ["framework_version"]]["base_image"]; print(base_image)')" >> $GITHUB_OUTPUT - - name: Check out code - uses: actions/checkout@v3 + name: Print Base Image + run: | + echo ${{ steps.get-base-image.outputs.base_image }} - name: Login to DockerHub uses: docker/login-action@v2 @@ -101,8 +92,8 @@ jobs: context: ./docker/transformers-past-gpu build-args: | REF=main - BASE_DOCKER_IMAGE=nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04 + BASE_DOCKER_IMAGE=${{ steps.get-base-image.outputs.base_image }} FRAMEWORK=tensorflow VERSION=${{ matrix.version }} push: true - tags: huggingface/transformers-tensorflow-past-${{ matrix.version }}-gpu \ No newline at end of file + tags: huggingface/transformers-tensorflow-past-${{ matrix.version }}-gpu diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index 4e59cfeb9d0d93..99f0f15230a017 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -15,6 +15,7 @@ jobs: commit_sha: ${{ github.sha }} package: transformers notebook_folder: transformers_doc - languages: de en es fr it ko pt zh + languages: de en es fr hi it ko pt tr zh ja te secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} + hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index 640a0cb2f59f2b..f6fa4c8d537cc6 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -14,4 +14,4 @@ jobs: commit_sha: ${{ github.event.pull_request.head.sha }} pr_number: ${{ github.event.number }} package: transformers - languages: de en es fr it ko pt zh + languages: de en es fr hi it ko pt tr zh ja te diff --git a/.github/workflows/check_runner_status.yml b/.github/workflows/check_runner_status.yml deleted file mode 100644 index 8912e32c94ee8f..00000000000000 --- a/.github/workflows/check_runner_status.yml +++ /dev/null @@ -1,67 +0,0 @@ -name: Self-hosted runner (check runner status) - -# Note that each job's dependencies go into a corresponding docker file. -# -# For example for `run_all_tests_torch_cuda_extensions_gpu` the docker image is -# `huggingface/transformers-pytorch-deepspeed-latest-gpu`, which can be found at -# `docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile` - -on: - repository_dispatch: - schedule: - # run per hour - - cron: "0 */1 * * *" - -env: - TRANSFORMERS_IS_CI: yes - -jobs: - check_runner_status: - name: Check Runner Status - runs-on: ubuntu-latest - outputs: - offline_runners: ${{ steps.set-offline_runners.outputs.offline_runners }} - steps: - - name: Checkout transformers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Check Runner Status - run: python utils/check_self_hosted_runner.py --target_runners single-gpu-ci-runner-docker,multi-gpu-ci-runner-docker,single-gpu-scheduled-ci-runner-docker,multi-scheduled-scheduled-ci-runner-docker,single-gpu-doctest-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} - - - id: set-offline_runners - name: Set output for offline runners - if: ${{ always() }} - run: | - offline_runners=$(python3 -c 'fp = open("offline_runners.txt"); failed = fp.read(); fp.close(); print(failed)') - echo "offline_runners=$offline_runners" >> $GITHUB_OUTPUT - - send_results: - name: Send results to webhook - runs-on: ubuntu-latest - needs: check_runner_status - if: ${{ failure() }} - steps: - - name: Preliminary job status - shell: bash - run: | - echo "Runner availability: ${{ needs.check_runner_status.result }}" - - - uses: actions/checkout@v3 - - uses: actions/download-artifact@v3 - - name: Send message to Slack - env: - CI_SLACK_BOT_TOKEN: ${{ secrets.CI_SLACK_BOT_TOKEN }} - CI_SLACK_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }} - CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} - CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} - CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} - CI_EVENT: runner status check - RUNNER_STATUS: ${{ needs.check_runner_status.result }} - OFFLINE_RUNNERS: ${{ needs.check_runner_status.outputs.offline_runners }} - # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change - # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. - run: | - pip install slack_sdk - python utils/notification_service.py diff --git a/.github/workflows/check_tiny_models.yml b/.github/workflows/check_tiny_models.yml new file mode 100644 index 00000000000000..0725bd04a1f2c3 --- /dev/null +++ b/.github/workflows/check_tiny_models.yml @@ -0,0 +1,82 @@ +name: Check Tiny Models + +on: + push: + branches: + - check_tiny_models* + repository_dispatch: + schedule: + - cron: "0 2 * * *" + +env: + TOKEN: ${{ secrets.TRANSFORMERS_HUB_BOT_HF_TOKEN }} + +jobs: + check_tiny_models: + name: Check tiny models + runs-on: ubuntu-22.04 + steps: + - name: Checkout transformers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - uses: actions/checkout@v3 + - name: Set up Python 3.8 + uses: actions/setup-python@v4 + with: + # Semantic version range syntax or exact version of a Python version + python-version: '3.8' + # Optional - x64 or x86 architecture, defaults to x64 + architecture: 'x64' + + - name: Install + run: | + sudo apt-get -y update && sudo apt-get install -y libsndfile1-dev espeak-ng cmake + pip install --upgrade pip + python -m pip install -U .[sklearn,torch,testing,sentencepiece,torch-speech,vision,timm,video,tf-cpu] + pip install tensorflow_probability + python -m pip install -U 'natten<0.15.0' + + - name: Create all tiny models (locally) + run: | + python utils/create_dummy_models.py tiny_local_models --all --num_workers 2 + + - name: Local tiny model reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: tiny_local_model_creation_reports + path: tiny_local_models/reports + + # GitHub-hosted runners have 2-core CPUs + - name: Run pipeline tests against all new (local) tiny models + run: | + OMP_NUM_THREADS=1 TRANSFORMERS_TINY_MODEL_PATH=tiny_local_models python -m pytest --max-worker-restart=0 -n 2 --dist=loadfile -s -rA --make-reports=tests_pipelines tests/models -m is_pipeline_test -k "test_pipeline_" | tee tests_output.txt + + - name: Test suite reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: tiny_local_model_creation_reports + path: reports/tests_pipelines + + - name: Create + Upload tiny models for new model architecture(s) + run: | + python utils/update_tiny_models.py --num_workers 2 + + - name: Full report + run: cat tiny_models/reports/tiny_model_creation_report.json + + - name: Failure report + run: cat tiny_models/reports/simple_failed_report.txt + + - name: Summary report + run: cat tiny_models/reports/tiny_model_summary.json + + - name: New tiny model creation reports artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: tiny_model_creation_reports + path: tiny_models/reports diff --git a/.github/workflows/delete_doc_comment.yml b/.github/workflows/delete_doc_comment.yml deleted file mode 100644 index d894f991ca9ce6..00000000000000 --- a/.github/workflows/delete_doc_comment.yml +++ /dev/null @@ -1,13 +0,0 @@ -name: Delete dev documentation - -on: - pull_request: - types: [ closed ] - - -jobs: - delete: - uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main - with: - pr_number: ${{ github.event.number }} - package: transformers diff --git a/.github/workflows/doctests.yml b/.github/workflows/doctests.yml index d65698e2a4f345..0384144ceac741 100644 --- a/.github/workflows/doctests.yml +++ b/.github/workflows/doctests.yml @@ -6,7 +6,7 @@ on: - doctest* repository_dispatch: schedule: - - cron: "0 2 * * *" + - cron: "17 2 * * *" env: @@ -20,31 +20,36 @@ env: jobs: run_doctests: - runs-on: [self-hosted, doc-tests-gpu] + runs-on: [single-gpu, nvidia-gpu, t4, ci] container: image: huggingface/transformers-all-latest-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ steps: + - name: uninstall transformers (installed during docker image build) + run: python3 -m pip uninstall -y transformers + - uses: actions/checkout@v3 - name: NVIDIA-SMI run: | nvidia-smi + - name: Install transformers in edit mode + run: python3 -m pip install -e .[flax] + - name: GPU visibility run: | python3 utils/print_env.py - - name: Prepare files for doctests - run: | - python3 utils/prepare_for_doc_test.py src docs + - name: Show installed libraries and their versions + run: pip freeze - - name: Run doctests + - name: Get doctest files run: | - python3 -m pytest -v --make-reports doc_tests_gpu --doctest-modules $(cat utils/documentation_tests.txt) -sv --doctest-continue-on-failure --doctest-glob="*.mdx" + $(python3 -c 'from utils.tests_fetcher import get_all_doctest_files; to_test = get_all_doctest_files(); to_test = " ".join(to_test); fp = open("doc_tests.txt", "w"); fp.write(to_test); fp.close()') - - name: Clean files after doctests + - name: Run doctests run: | - python3 utils/prepare_for_doc_test.py src docs --remove_new_line + python3 -m pytest -v --make-reports doc_tests_gpu --doctest-modules $(cat doc_tests.txt) -sv --doctest-continue-on-failure --doctest-glob="*.md" - name: Failure short reports if: ${{ failure() }} @@ -61,7 +66,7 @@ jobs: send_results: name: Send results to webhook - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() needs: [run_doctests] steps: diff --git a/.github/workflows/model-templates.yml b/.github/workflows/model-templates.yml index 3830c23fe0484a..eb77d9dcbe1e64 100644 --- a/.github/workflows/model-templates.yml +++ b/.github/workflows/model-templates.yml @@ -7,7 +7,7 @@ on: jobs: run_tests_templates: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 steps: - name: Checkout repository uses: actions/checkout@v3 diff --git a/.github/workflows/model_jobs.yml b/.github/workflows/model_jobs.yml new file mode 100644 index 00000000000000..8bf8d78570fe18 --- /dev/null +++ b/.github/workflows/model_jobs.yml @@ -0,0 +1,102 @@ +name: model jobs + +on: + workflow_call: + inputs: + folder_slices: + required: true + type: string + machine_type: + required: true + type: string + slice_id: + required: true + type: number + +env: + HF_HOME: /mnt/cache + TRANSFORMERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + RUN_SLOW: yes + # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. + # This token is created under the bot `hf-transformers-bot`. + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} + SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} + TF_FORCE_GPU_ALLOW_GROWTH: true + RUN_PT_TF_CROSS_TESTS: 1 + CUDA_VISIBLE_DEVICES: 0,1 + +jobs: + model_job: + name: " " + strategy: + fail-fast: false + matrix: + folders: ${{ fromJson(inputs.folder_slices)[inputs.slice_id] }} + runs-on: ['${{ inputs.machine_type }}', nvidia-gpu, t4, daily-ci] + container: + image: huggingface/transformers-all-latest-gpu + options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + - name: Echo input and matrix info + shell: bash + run: | + echo "${{ inputs.folder_slices }}" + echo "${{ matrix.folders }}" + echo "${{ toJson(fromJson(inputs.folder_slices)[inputs.slice_id]) }}" + + - name: Echo folder ${{ matrix.folders }} + shell: bash + # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to + # set the artifact folder names (because the character `/` is not allowed). + run: | + echo "${{ matrix.folders }}" + matrix_folders=${{ matrix.folders }} + matrix_folders=${matrix_folders/'models/'/'models_'} + echo "$matrix_folders" + echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV + + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: NVIDIA-SMI + run: | + nvidia-smi + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all tests on GPU + working-directory: /transformers + run: python3 -m pytest -v --make-reports=${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }} tests/${{ matrix.folders }} + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt + + - name: Run test + shell: bash + run: | + mkdir -p /transformers/reports/${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }} + echo "hello" > /transformers/reports/${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }}/hello.txt + echo "${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }}" + + - name: "Test suite reports artifacts: ${{ inputs.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ inputs.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + path: /transformers/reports/${{ inputs.machine_type }}_tests_gpu_${{ matrix.folders }} diff --git a/.github/workflows/release-conda.yml b/.github/workflows/release-conda.yml index 4cc0b662fcc8c0..7a1990eec6b3d7 100644 --- a/.github/workflows/release-conda.yml +++ b/.github/workflows/release-conda.yml @@ -12,7 +12,7 @@ env: jobs: build_and_package: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 defaults: run: shell: bash -l {0} diff --git a/.github/workflows/self-nightly-past-ci-caller.yml b/.github/workflows/self-nightly-past-ci-caller.yml new file mode 100644 index 00000000000000..67840355960c8c --- /dev/null +++ b/.github/workflows/self-nightly-past-ci-caller.yml @@ -0,0 +1,134 @@ +name: Self-hosted runner (nightly-past-ci-caller) + +on: + schedule: + # 2:17 am on each Sunday and Thursday + + - cron: "17 2 * * 0,4" + push: + branches: + - run_nightly_ci* + - run_past_ci* + +jobs: + build_nightly_ci_images: + name: Build Nightly CI Docker Images + if: (github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_nightly_ci')) + uses: ./.github/workflows/build-nightly-ci-docker-images.yml + secrets: inherit + + run_nightly_ci: + name: Nightly CI + needs: [build_nightly_ci_images] + uses: ./.github/workflows/self-nightly-scheduled.yml + secrets: inherit + + run_past_ci_pytorch_1-13: + name: PyTorch 1.13 + if: (cancelled() != true) && ((github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci'))) + needs: [run_nightly_ci] + uses: ./.github/workflows/self-past.yml + with: + framework: pytorch + version: "1.13" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_pytorch_1-12: + name: PyTorch 1.12 + if: (cancelled() != true) && ((github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci'))) + needs: [run_past_ci_pytorch_1-13] + uses: ./.github/workflows/self-past.yml + with: + framework: pytorch + version: "1.12" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_pytorch_1-11: + name: PyTorch 1.11 + if: (cancelled() != true) && ((github.event_name == 'schedule') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci'))) + needs: [run_past_ci_pytorch_1-12] + uses: ./.github/workflows/self-past.yml + with: + framework: pytorch + version: "1.11" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-11: + name: TensorFlow 2.11 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_pytorch_1-11] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.11" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-10: + name: TensorFlow 2.10 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-11] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.10" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-9: + name: TensorFlow 2.9 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-10] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.9" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-8: + name: TensorFlow 2.8 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-9] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.8" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-7: + name: TensorFlow 2.7 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-8] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.7" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-6: + name: TensorFlow 2.6 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-7] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.6" + sha: ${{ github.sha }} + secrets: inherit + + run_past_ci_tensorflow_2-5: + name: TensorFlow 2.5 + if: (cancelled() != true) && ((github.event_name == 'push') && startsWith(github.ref_name, 'run_past_ci')) + needs: [run_past_ci_tensorflow_2-6] + uses: ./.github/workflows/self-past.yml + with: + framework: tensorflow + version: "2.5" + sha: ${{ github.sha }} + secrets: inherit diff --git a/.github/workflows/self-nightly-scheduled.yml b/.github/workflows/self-nightly-scheduled.yml index accccf6164bc1e..5c3e30e4b424f9 100644 --- a/.github/workflows/self-nightly-scheduled.yml +++ b/.github/workflows/self-nightly-scheduled.yml @@ -1,4 +1,4 @@ -name: Self-hosted runner (nightly) +name: Self-hosted runner (nightly-ci) # Note that each job's dependencies go into a corresponding docker file. # @@ -8,9 +8,7 @@ name: Self-hosted runner (nightly) on: repository_dispatch: -# Disable temporarily until the test suite can be run under 12 hours. -# schedule: -# - cron: "0 16 * * *" + workflow_call: env: HF_HOME: /mnt/cache @@ -18,45 +16,19 @@ env: OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 RUN_SLOW: yes + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} TF_FORCE_GPU_ALLOW_GROWTH: true RUN_PT_TF_CROSS_TESTS: 1 + CUDA_VISIBLE_DEVICES: 0,1 jobs: - check_runner_status: - name: Check Runner Status - runs-on: ubuntu-latest - steps: - - name: Checkout transformers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Check Runner Status - run: python utils/check_self_hosted_runner.py --target_runners single-gpu-scheduled-ci-runner-docker,multi-gpu-scheduled-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} - - check_runners: - name: Check Runners - needs: check_runner_status - strategy: - matrix: - machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} - container: - image: huggingface/transformers-all-latest-torch-nightly-gpu - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - steps: - - name: NVIDIA-SMI - run: | - nvidia-smi - setup: name: Setup - needs: check_runners strategy: matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-all-latest-torch-nightly-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -96,7 +68,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [single-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-all-latest-torch-nightly-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -117,6 +89,10 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: NVIDIA-SMI run: | nvidia-smi @@ -139,11 +115,11 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_nightly" if: ${{ always() }} uses: actions/upload-artifact@v3 with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_nightly path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} run_tests_multi_gpu: @@ -153,7 +129,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-all-latest-torch-nightly-gpu options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -174,6 +150,10 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: NVIDIA-SMI run: | nvidia-smi @@ -196,11 +176,11 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_nightly" if: ${{ always() }} uses: actions/upload-artifact@v3 with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_nightly path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} run_all_tests_torch_cuda_extensions_gpu: @@ -209,7 +189,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] needs: setup container: image: huggingface/transformers-pytorch-deepspeed-nightly-gpu @@ -219,6 +199,10 @@ jobs: working-directory: /workspace/transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /workspace/transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Remove cached torch extensions run: rm -rf /github/home/.cache/torch_extensions/ @@ -229,7 +213,7 @@ jobs: python3 -m pip uninstall -y deepspeed rm -rf DeepSpeed git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build - DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check + DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check - name: NVIDIA-SMI run: | @@ -254,20 +238,18 @@ jobs: continue-on-error: true run: cat /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports_postfix_nightly" if: ${{ always() }} uses: actions/upload-artifact@v3 with: - name: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports + name: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports_postfix_nightly path: /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu send_results: name: Send results to webhook - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() needs: [ - check_runner_status, - check_runners, setup, run_tests_single_gpu, run_tests_multi_gpu, @@ -278,8 +260,6 @@ jobs: shell: bash # For the meaning of these environment variables, see the job `Setup` run: | - echo "Runner availability: ${{ needs.check_runner_status.result }}" - echo "Runner status: ${{ needs.check_runners.result }}" echo "Setup status: ${{ needs.setup.result }}" - uses: actions/checkout@v3 @@ -291,9 +271,8 @@ jobs: CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_PAST_FUTURE }} - CI_EVENT: nightly-build - RUNNER_STATUS: ${{ needs.check_runner_status.result }} - RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} + CI_EVENT: Nightly CI SETUP_STATUS: ${{ needs.setup.result }} # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. @@ -301,3 +280,11 @@ jobs: pip install slack_sdk pip show slack_sdk python utils/notification_service.py "${{ needs.setup.outputs.matrix }}" + + + # delete-artifact + - uses: geekyeggo/delete-artifact@v2 + with: + name: | + single-* + multi-* diff --git a/.github/workflows/self-past-caller.yml b/.github/workflows/self-past-caller.yml deleted file mode 100644 index 2cc81dac8ca281..00000000000000 --- a/.github/workflows/self-past-caller.yml +++ /dev/null @@ -1,136 +0,0 @@ -name: Self-hosted runner (past-ci-caller) - -on: - push: - branches: - - run-past-ci* - -jobs: - run_past_ci_pytorch_1-11: - name: PyTorch 1.11 - if: always() - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.11" - secrets: inherit - - run_past_ci_pytorch_1-10: - name: PyTorch 1.10 - if: always() - needs: [run_past_ci_pytorch_1-11] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.10" - secrets: inherit - - run_past_ci_pytorch_1-9: - name: PyTorch 1.9 - if: always() - needs: [run_past_ci_pytorch_1-10] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.9" - secrets: inherit - - run_past_ci_pytorch_1-8: - name: PyTorch 1.8 - if: always() - needs: [run_past_ci_pytorch_1-9] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.8" - secrets: inherit - - run_past_ci_pytorch_1-7: - name: PyTorch 1.7 - if: always() - needs: [run_past_ci_pytorch_1-8] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.7" - secrets: inherit - - run_past_ci_pytorch_1-6: - name: PyTorch 1.6 - if: always() - needs: [run_past_ci_pytorch_1-7] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.6" - secrets: inherit - - run_past_ci_pytorch_1-5: - name: PyTorch 1.5 - if: always() - needs: [run_past_ci_pytorch_1-6] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.5" - secrets: inherit - - run_past_ci_pytorch_1-4: - name: PyTorch 1.4 - if: always() - needs: [run_past_ci_pytorch_1-5] - uses: ./.github/workflows/self-past.yml - with: - framework: pytorch - version: "1.4" - secrets: inherit - - run_past_ci_tensorflow_2-8: - name: TensorFlow 2.8 - if: always() - needs: [run_past_ci_pytorch_1-4] - uses: ./.github/workflows/self-past.yml - with: - framework: tensorflow - version: "2.8" - secrets: inherit - - run_past_ci_tensorflow_2-7: - name: TensorFlow 2.7 - if: always() - needs: [run_past_ci_tensorflow_2-8] - uses: ./.github/workflows/self-past.yml - with: - framework: tensorflow - version: "2.7" - secrets: inherit - - run_past_ci_tensorflow_2-6: - name: TensorFlow 2.6 - if: always() - needs: [run_past_ci_tensorflow_2-7] - uses: ./.github/workflows/self-past.yml - with: - framework: tensorflow - version: "2.6" - secrets: inherit - - run_past_ci_tensorflow_2-5: - name: TensorFlow 2.5 - if: always() - needs: [run_past_ci_tensorflow_2-6] - uses: ./.github/workflows/self-past.yml - with: - framework: tensorflow - version: "2.5" - secrets: inherit - - run_past_ci_tensorflow_2-4: - name: TensorFlow 2.4 - if: always() - needs: [run_past_ci_tensorflow_2-5] - uses: ./.github/workflows/self-past.yml - with: - framework: tensorflow - version: "2.4" - secrets: inherit \ No newline at end of file diff --git a/.github/workflows/self-past.yml b/.github/workflows/self-past.yml index c59800445bdcc7..6b7587fdeb8227 100644 --- a/.github/workflows/self-past.yml +++ b/.github/workflows/self-past.yml @@ -1,4 +1,4 @@ -name: Self-hosted runner (past) +name: Self-hosted runner (past-ci) # Note that each job's dependencies go into a corresponding docker file. # @@ -27,45 +27,19 @@ env: OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 RUN_SLOW: yes + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} TF_FORCE_GPU_ALLOW_GROWTH: true RUN_PT_TF_CROSS_TESTS: 1 + CUDA_VISIBLE_DEVICES: 0,1 jobs: - check_runner_status: - name: Check Runner Status - runs-on: ubuntu-latest - steps: - - name: Checkout transformers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Check Runner Status - run: python utils/check_self_hosted_runner.py --target_runners single-gpu-past-ci-runner-docker,multi-gpu-past-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} - - check_runners: - name: Check Runners - needs: check_runner_status - strategy: - matrix: - machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker-past-ci') }} - container: - image: huggingface/transformers-${{ inputs.framework }}-past-${{ inputs.version }}-gpu - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - steps: - - name: NVIDIA-SMI - run: | - nvidia-smi - setup: name: Setup - needs: check_runners strategy: matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker-past-ci') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-${{ inputs.framework }}-past-${{ inputs.version }}-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -101,7 +75,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [single-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker-past-ci') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-${{ inputs.framework }}-past-${{ inputs.version }}-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -111,6 +85,14 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ inputs.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: Update some packages + working-directory: /transformers + run: python3 -m pip install -U datasets + - name: Echo folder ${{ matrix.folders }} shell: bash # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to @@ -126,6 +108,12 @@ jobs: run: | nvidia-smi + - name: Install + if: inputs.framework == 'pytorch' + working-directory: /transformers + run: | + python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate + - name: Environment working-directory: /transformers run: | @@ -153,11 +141,11 @@ jobs: echo "$job_name" echo "$job_name" > /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/job_name.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }}" if: ${{ always() }} uses: actions/upload-artifact@v3 with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }} path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} run_tests_multi_gpu: @@ -167,7 +155,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker-past-ci') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] container: image: huggingface/transformers-${{ inputs.framework }}-past-${{ inputs.version }}-gpu options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -177,6 +165,14 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ inputs.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: Update some packages + working-directory: /transformers + run: python3 -m pip install -U datasets + - name: Echo folder ${{ matrix.folders }} shell: bash # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to @@ -192,6 +188,12 @@ jobs: run: | nvidia-smi + - name: Install + if: inputs.framework == 'pytorch' + working-directory: /transformers + run: | + python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate + - name: Environment working-directory: /transformers run: | @@ -219,25 +221,100 @@ jobs: echo "$job_name" echo "$job_name" > /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/job_name.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }}" if: ${{ always() }} uses: actions/upload-artifact@v3 with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }} path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} + run_all_tests_torch_cuda_extensions_gpu: + name: Torch CUDA extension tests + if: inputs.framework == 'pytorch' + strategy: + fail-fast: false + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, past-ci] + needs: setup + container: + image: huggingface/transformers-${{ inputs.framework }}-past-${{ inputs.version }}-gpu + options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: Update some packages + working-directory: /transformers + run: python3 -m pip install -U datasets + + - name: Install + working-directory: /transformers + run: | + python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate + + - name: Remove cached torch extensions + run: rm -rf /github/home/.cache/torch_extensions/ + + # To avoid unknown test failures + - name: Pre build DeepSpeed *again* + working-directory: / + run: | + python3 -m pip uninstall -y deepspeed + rm -rf DeepSpeed + git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build + DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check + + - name: NVIDIA-SMI + run: | + nvidia-smi + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all tests on GPU + working-directory: /transformers + run: | + python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu tests/deepspeed tests/extended + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }}" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports_postfix_${{ inputs.framework }}-${{ inputs.version }} + path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu + send_results: name: Send results to webhook - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() - needs: [check_runner_status, check_runners, setup, run_tests_single_gpu, run_tests_multi_gpu] + needs: [ + setup, + run_tests_single_gpu, + run_tests_multi_gpu, + run_all_tests_torch_cuda_extensions_gpu + ] steps: - name: Preliminary job status shell: bash # For the meaning of these environment variables, see the job `Setup` run: | - echo "Runner availability: ${{ needs.check_runner_status.result }}" - echo "Runner status: ${{ needs.check_runners.result }}" echo "Setup status: ${{ needs.setup.result }}" - uses: actions/checkout@v3 @@ -254,9 +331,8 @@ jobs: CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_PAST_FUTURE }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} CI_EVENT: Past CI - ${{ inputs.framework }}-${{ inputs.version }} - RUNNER_STATUS: ${{ needs.check_runner_status.result }} - RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} SETUP_STATUS: ${{ needs.setup.result }} # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. @@ -271,4 +347,11 @@ jobs: uses: actions/upload-artifact@v3 with: name: test_failure_tables_${{ inputs.framework }}-${{ inputs.version }} - path: test_failure_tables \ No newline at end of file + path: test_failure_tables + + # delete-artifact + - uses: geekyeggo/delete-artifact@v2 + with: + name: | + single-* + multi-* diff --git a/.github/workflows/self-push-amd-mi210-caller.yml b/.github/workflows/self-push-amd-mi210-caller.yml new file mode 100644 index 00000000000000..a401e40ee7f164 --- /dev/null +++ b/.github/workflows/self-push-amd-mi210-caller.yml @@ -0,0 +1,25 @@ +name: Self-hosted runner (AMD mi210 CI caller) + +on: + workflow_run: + workflows: ["Self-hosted runner (push-caller)"] + branches: ["main"] + types: [completed] + push: + branches: + - run_amd_push_ci_caller* + paths: + - "src/**" + - "tests/**" + - ".github/**" + - "templates/**" + - "utils/**" + +jobs: + run_amd_ci: + name: AMD mi210 + if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller'))) + uses: ./.github/workflows/self-push-amd.yml + with: + gpu_flavor: mi210 + secrets: inherit diff --git a/.github/workflows/self-push-amd-mi250-caller.yml b/.github/workflows/self-push-amd-mi250-caller.yml new file mode 100644 index 00000000000000..fef532703170cb --- /dev/null +++ b/.github/workflows/self-push-amd-mi250-caller.yml @@ -0,0 +1,25 @@ +name: Self-hosted runner (AMD mi250 CI caller) + +on: + workflow_run: + workflows: ["Self-hosted runner (push-caller)"] + branches: ["main"] + types: [completed] + push: + branches: + - run_amd_push_ci_caller* + paths: + - "src/**" + - "tests/**" + - ".github/**" + - "templates/**" + - "utils/**" + +jobs: + run_amd_ci: + name: AMD mi250 + if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_push_ci_caller'))) + uses: ./.github/workflows/self-push-amd.yml + with: + gpu_flavor: mi250 + secrets: inherit diff --git a/.github/workflows/self-push-amd.yml b/.github/workflows/self-push-amd.yml new file mode 100644 index 00000000000000..4bd7c1f4873dab --- /dev/null +++ b/.github/workflows/self-push-amd.yml @@ -0,0 +1,329 @@ +name: Self-hosted runner AMD GPU (push) + +on: + workflow_call: + inputs: + gpu_flavor: + required: true + type: string + +env: + HF_HOME: /mnt/cache + TRANSFORMERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + PYTEST_TIMEOUT: 60 + TF_FORCE_GPU_ALLOW_GROWTH: true + RUN_PT_TF_CROSS_TESTS: 1 + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} + +jobs: + check_runner_status: + name: Check Runner Status + runs-on: ubuntu-22.04 + steps: + - name: Checkout transformers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Check Runner Status + run: python utils/check_self_hosted_runner.py --target_runners amd-mi210-single-gpu-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} + + check_runners: + name: Check Runners + needs: check_runner_status + strategy: + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu-push-ci # <--- We test only for PyTorch for now + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + setup_gpu: + name: Setup + needs: check_runners + strategy: + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu-push-ci # <--- We test only for PyTorch for now + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + outputs: + matrix: ${{ steps.set-matrix.outputs.matrix }} + test_map: ${{ steps.set-matrix.outputs.test_map }} + steps: + # Necessary to get the correct branch name and commit SHA for `workflow_run` event + # We also take into account the `push` event (we might want to test some changes in a branch) + - name: Prepare custom environment variables + shell: bash + # `CI_BRANCH_PUSH`: The branch name from the push event + # `CI_BRANCH_WORKFLOW_RUN`: The name of the branch on which this workflow is triggered by `workflow_run` event + # `CI_BRANCH`: The non-empty branch name from the above two (one and only one of them is empty) + # `CI_SHA_PUSH`: The commit SHA from the push event + # `CI_SHA_WORKFLOW_RUN`: The commit SHA that triggers this workflow by `workflow_run` event + # `CI_SHA`: The non-empty commit SHA from the above two (one and only one of them is empty) + run: | + CI_BRANCH_PUSH=${{ github.event.ref }} + CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''} + CI_BRANCH_WORKFLOW_RUN=${{ github.event.workflow_run.head_branch }} + CI_SHA_PUSH=${{ github.event.head_commit.id }} + CI_SHA_WORKFLOW_RUN=${{ github.event.workflow_run.head_sha }} + echo $CI_BRANCH_PUSH + echo $CI_BRANCH_WORKFLOW_RUN + echo $CI_SHA_PUSH + echo $CI_SHA_WORKFLOW_RUN + [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV + [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV + + - name: print environment variables + run: | + echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}" + echo "env.CI_SHA = ${{ env.CI_SHA }}" + + - name: Update clone using environment variables + working-directory: /transformers + run: | + echo "original branch = $(git branch --show-current)" + git fetch && git checkout ${{ env.CI_BRANCH }} + echo "updated branch = $(git branch --show-current)" + git checkout ${{ env.CI_SHA }} + echo "log = $(git log -n 1)" + + - name: Cleanup + working-directory: /transformers + run: | + rm -rf tests/__pycache__ + rm -rf tests/models/__pycache__ + rm -rf reports + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Fetch the tests to run + working-directory: /transformers + # TODO: add `git-python` in the docker images + run: | + pip install --upgrade git-python + python3 utils/tests_fetcher.py --diff_with_last_commit | tee test_preparation.txt + + - name: Report fetched tests + uses: actions/upload-artifact@v3 + with: + name: test_fetched + path: /transformers/test_preparation.txt + + - id: set-matrix + name: Organize tests into models + working-directory: /transformers + # The `keys` is used as GitHub actions matrix for jobs, i.e. `models/bert`, `tokenization`, `pipeline`, etc. + # The `test_map` is used to get the actual identified test files under each key. + # If no test to run (so no `test_map.json` file), create a dummy map (empty matrix will fail) + run: | + if [ -f test_map.json ]; then + keys=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); d = list(test_map.keys()); print(d)') + test_map=$(python3 -c 'import json; fp = open("test_map.json"); test_map = json.load(fp); fp.close(); print(test_map)') + else + keys=$(python3 -c 'keys = ["dummy"]; print(keys)') + test_map=$(python3 -c 'test_map = {"dummy": []}; print(test_map)') + fi + echo $keys + echo $test_map + echo "matrix=$keys" >> $GITHUB_OUTPUT + echo "test_map=$test_map" >> $GITHUB_OUTPUT + + run_tests_amdgpu: + name: Model tests + needs: setup_gpu + # `dummy` means there is no test to run + if: contains(fromJson(needs.setup_gpu.outputs.matrix), 'dummy') != true + strategy: + fail-fast: false + matrix: + folders: ${{ fromJson(needs.setup_gpu.outputs.matrix) }} + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu-push-ci # <--- We test only for PyTorch for now + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + # Necessary to get the correct branch name and commit SHA for `workflow_run` event + # We also take into account the `push` event (we might want to test some changes in a branch) + - name: Prepare custom environment variables + shell: bash + # For the meaning of these environment variables, see the job `Setup` + run: | + CI_BRANCH_PUSH=${{ github.event.ref }} + CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''} + CI_BRANCH_WORKFLOW_RUN=${{ github.event.workflow_run.head_branch }} + CI_SHA_PUSH=${{ github.event.head_commit.id }} + CI_SHA_WORKFLOW_RUN=${{ github.event.workflow_run.head_sha }} + echo $CI_BRANCH_PUSH + echo $CI_BRANCH_WORKFLOW_RUN + echo $CI_SHA_PUSH + echo $CI_SHA_WORKFLOW_RUN + [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV + [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV + + - name: print environment variables + run: | + echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}" + echo "env.CI_SHA = ${{ env.CI_SHA }}" + + - name: Update clone using environment variables + working-directory: /transformers + run: | + echo "original branch = $(git branch --show-current)" + git fetch && git checkout ${{ env.CI_BRANCH }} + echo "updated branch = $(git branch --show-current)" + git checkout ${{ env.CI_SHA }} + echo "log = $(git log -n 1)" + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: Echo folder ${{ matrix.folders }} + shell: bash + # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to + # set the artifact folder names (because the character `/` is not allowed). + run: | + echo "${{ matrix.folders }}" + echo "${{ fromJson(needs.setup_gpu.outputs.test_map)[matrix.folders] }}" + matrix_folders=${{ matrix.folders }} + matrix_folders=${matrix_folders/'models/'/'models_'} + echo "$matrix_folders" + echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all non-slow selected tests on GPU + working-directory: /transformers + run: | + python3 -m pytest -n 2 --dist=loadfile -v --make-reports=${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} ${{ fromJson(needs.setup_gpu.outputs.test_map)[matrix.folders] }} + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} + + send_results: + name: Send results to webhook + runs-on: ubuntu-22.04 + if: always() + needs: [ + check_runner_status, + check_runners, + setup_gpu, + run_tests_amdgpu, +# run_tests_torch_cuda_extensions_single_gpu, +# run_tests_torch_cuda_extensions_multi_gpu + ] + steps: + - name: Preliminary job status + shell: bash + # For the meaning of these environment variables, see the job `Setup` + run: | + echo "Runner availability: ${{ needs.check_runner_status.result }}" + echo "Setup status: ${{ needs.setup_gpu.result }}" + echo "Runner status: ${{ needs.check_runners.result }}" + + # Necessary to get the correct branch name and commit SHA for `workflow_run` event + # We also take into account the `push` event (we might want to test some changes in a branch) + - name: Prepare custom environment variables + shell: bash + # For the meaning of these environment variables, see the job `Setup` + run: | + CI_BRANCH_PUSH=${{ github.event.ref }} + CI_BRANCH_PUSH=${CI_BRANCH_PUSH/'refs/heads/'/''} + CI_BRANCH_WORKFLOW_RUN=${{ github.event.workflow_run.head_branch }} + CI_SHA_PUSH=${{ github.event.head_commit.id }} + CI_SHA_WORKFLOW_RUN=${{ github.event.workflow_run.head_sha }} + echo $CI_BRANCH_PUSH + echo $CI_BRANCH_WORKFLOW_RUN + echo $CI_SHA_PUSH + echo $CI_SHA_WORKFLOW_RUN + [[ ! -z "$CI_BRANCH_PUSH" ]] && echo "CI_BRANCH=$CI_BRANCH_PUSH" >> $GITHUB_ENV || echo "CI_BRANCH=$CI_BRANCH_WORKFLOW_RUN" >> $GITHUB_ENV + [[ ! -z "$CI_SHA_PUSH" ]] && echo "CI_SHA=$CI_SHA_PUSH" >> $GITHUB_ENV || echo "CI_SHA=$CI_SHA_WORKFLOW_RUN" >> $GITHUB_ENV + + - name: print environment variables + run: | + echo "env.CI_BRANCH = ${{ env.CI_BRANCH }}" + echo "env.CI_SHA = ${{ env.CI_SHA }}" + + - uses: actions/checkout@v3 + # To avoid failure when multiple commits are merged into `main` in a short period of time. + # Checking out to an old commit beyond the fetch depth will get an error `fatal: reference is not a tree: ... + # (Only required for `workflow_run` event, where we get the latest HEAD on `main` instead of the event commit) + with: + fetch-depth: 20 + + - name: Update clone using environment variables + run: | + echo "original branch = $(git branch --show-current)" + git fetch && git checkout ${{ env.CI_BRANCH }} + echo "updated branch = $(git branch --show-current)" + git checkout ${{ env.CI_SHA }} + echo "log = $(git log -n 1)" + + - uses: actions/download-artifact@v3 + - name: Send message to Slack + env: + CI_SLACK_BOT_TOKEN: ${{ secrets.CI_SLACK_BOT_TOKEN }} + CI_SLACK_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }} + CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} + CI_SLACK_CHANNEL_ID_AMD: ${{ secrets.CI_SLACK_CHANNEL_ID_AMD }} + CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} + CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_AMD }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} + CI_EVENT: Push CI (AMD) - ${{ inputs.gpu_flavor }} + CI_TITLE_PUSH: ${{ github.event.head_commit.message }} + CI_TITLE_WORKFLOW_RUN: ${{ github.event.workflow_run.head_commit.message }} + CI_SHA: ${{ env.CI_SHA }} + RUNNER_STATUS: ${{ needs.check_runner_status.result }} + RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} + SETUP_STATUS: ${{ needs.setup_gpu.result }} + + # We pass `needs.setup_gpu.outputs.matrix` as the argument. A processing in `notification_service.py` to change + # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. + run: | + pip install slack_sdk + pip show slack_sdk + python utils/notification_service.py "${{ needs.setup_gpu.outputs.matrix }}" diff --git a/.github/workflows/self-push-caller.yml b/.github/workflows/self-push-caller.yml index 994567c5cdbd48..14b5262426b452 100644 --- a/.github/workflows/self-push-caller.yml +++ b/.github/workflows/self-push-caller.yml @@ -14,7 +14,7 @@ on: jobs: check-for-setup: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 name: Check if setup was changed outputs: changed: ${{ steps.was_changed.outputs.changed }} @@ -25,7 +25,7 @@ jobs: - name: Get changed files id: changed-files - uses: tj-actions/changed-files@v22.2 + uses: tj-actions/changed-files@v41 - name: Was setup changed id: was_changed @@ -46,7 +46,7 @@ jobs: run_push_ci: name: Trigger Push CI - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: ${{ always() }} needs: build-docker-containers steps: diff --git a/.github/workflows/self-push.yml b/.github/workflows/self-push.yml index b6c3a70e3eb8a1..fd823ce4f5cac8 100644 --- a/.github/workflows/self-push.yml +++ b/.github/workflows/self-push.yml @@ -25,42 +25,15 @@ env: PYTEST_TIMEOUT: 60 TF_FORCE_GPU_ALLOW_GROWTH: true RUN_PT_TF_CROSS_TESTS: 1 + CUDA_VISIBLE_DEVICES: 0,1 jobs: - check_runner_status: - name: Check Runner Status - runs-on: ubuntu-latest - steps: - - name: Checkout transformers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Check Runner Status - run: python utils/check_self_hosted_runner.py --target_runners single-gpu-ci-runner-docker,multi-gpu-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} - - check_runners: - name: Check Runners - needs: check_runner_status - strategy: - matrix: - machine_type: [single-gpu, multi-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] - container: - image: huggingface/transformers-all-latest-gpu-push-ci - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - steps: - - name: NVIDIA-SMI - run: | - nvidia-smi - setup: name: Setup - needs: check_runners strategy: matrix: machine_type: [single-gpu, multi-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, push-ci] container: image: huggingface/transformers-all-latest-gpu-push-ci options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -158,7 +131,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [single-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, push-ci] container: image: huggingface/transformers-all-latest-gpu-push-ci options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -195,6 +168,10 @@ jobs: git checkout ${{ env.CI_SHA }} echo "log = $(git log -n 1)" + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Echo folder ${{ matrix.folders }} shell: bash # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to @@ -230,7 +207,7 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -247,7 +224,7 @@ jobs: matrix: folders: ${{ fromJson(needs.setup.outputs.matrix) }} machine_type: [multi-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, push-ci] container: image: huggingface/transformers-all-latest-gpu-push-ci options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -284,6 +261,10 @@ jobs: git checkout ${{ env.CI_SHA }} echo "log = $(git log -n 1)" + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Echo folder ${{ matrix.folders }} shell: bash # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to @@ -321,7 +302,7 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -336,7 +317,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, push-ci] container: image: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -373,6 +354,10 @@ jobs: git checkout ${{ env.CI_SHA }} echo "log = $(git log -n 1)" + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /workspace/transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Remove cached torch extensions run: rm -rf /github/home/.cache/torch_extensions/ @@ -381,7 +366,7 @@ jobs: working-directory: /workspace run: | python3 -m pip uninstall -y deepspeed - DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check + DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check - name: NVIDIA-SMI run: | @@ -407,7 +392,7 @@ jobs: continue-on-error: true run: cat /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -422,7 +407,7 @@ jobs: fail-fast: false matrix: machine_type: [multi-gpu] - runs-on: [self-hosted, docker-gpu, '${{ matrix.machine_type }}'] + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, push-ci] container: image: huggingface/transformers-pytorch-deepspeed-latest-gpu-push-ci options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -459,6 +444,10 @@ jobs: git checkout ${{ env.CI_SHA }} echo "log = $(git log -n 1)" + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /workspace/transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Remove cached torch extensions run: rm -rf /github/home/.cache/torch_extensions/ @@ -467,7 +456,7 @@ jobs: working-directory: /workspace run: | python3 -m pip uninstall -y deepspeed - DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check + DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check - name: NVIDIA-SMI run: | @@ -493,7 +482,7 @@ jobs: continue-on-error: true run: cat /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -502,11 +491,9 @@ jobs: send_results: name: Send results to webhook - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() needs: [ - check_runner_status, - check_runners, setup, run_tests_single_gpu, run_tests_multi_gpu, @@ -518,9 +505,7 @@ jobs: shell: bash # For the meaning of these environment variables, see the job `Setup` run: | - echo "Runner availability: ${{ needs.check_runner_status.result }}" echo "Setup status: ${{ needs.setup.result }}" - echo "Runner status: ${{ needs.check_runners.result }}" # Necessary to get the correct branch name and commit SHA for `workflow_run` event # We also take into account the `push` event (we might want to test some changes in a branch) @@ -568,12 +553,11 @@ jobs: CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} CI_EVENT: push CI_TITLE_PUSH: ${{ github.event.head_commit.message }} CI_TITLE_WORKFLOW_RUN: ${{ github.event.workflow_run.head_commit.message }} CI_SHA: ${{ env.CI_SHA }} - RUNNER_STATUS: ${{ needs.check_runner_status.result }} - RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} SETUP_STATUS: ${{ needs.setup.result }} # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change diff --git a/.github/workflows/self-scheduled-amd-caller.yml b/.github/workflows/self-scheduled-amd-caller.yml new file mode 100644 index 00000000000000..dc5c7b7e905bd8 --- /dev/null +++ b/.github/workflows/self-scheduled-amd-caller.yml @@ -0,0 +1,14 @@ +name: Self-hosted runner (AMD scheduled CI caller) + +on: + schedule: + - cron: "17 2 * * *" + +jobs: + run_scheduled_amd_ci: + name: Trigger Scheduled AMD CI + runs-on: ubuntu-22.04 + if: ${{ always() }} + steps: + - name: Trigger scheduled AMD CI via workflow_run + run: echo "Trigger scheduled AMD CI via workflow_run" diff --git a/.github/workflows/self-scheduled-amd-mi210-caller.yml b/.github/workflows/self-scheduled-amd-mi210-caller.yml new file mode 100644 index 00000000000000..cdb968901058b6 --- /dev/null +++ b/.github/workflows/self-scheduled-amd-mi210-caller.yml @@ -0,0 +1,19 @@ +name: Self-hosted runner (AMD mi210 scheduled CI caller) + +on: + workflow_run: + workflows: ["Self-hosted runner (AMD scheduled CI caller)"] + branches: ["main"] + types: [completed] + push: + branches: + - run_amd_scheduled_ci_caller* + +jobs: + run_amd_ci: + name: AMD mi210 + if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_scheduled_ci_caller'))) + uses: ./.github/workflows/self-scheduled-amd.yml + with: + gpu_flavor: mi210 + secrets: inherit diff --git a/.github/workflows/self-scheduled-amd-mi250-caller.yml b/.github/workflows/self-scheduled-amd-mi250-caller.yml new file mode 100644 index 00000000000000..dc7d12f173935e --- /dev/null +++ b/.github/workflows/self-scheduled-amd-mi250-caller.yml @@ -0,0 +1,19 @@ +name: Self-hosted runner (AMD mi250 scheduled CI caller) + +on: + workflow_run: + workflows: ["Self-hosted runner (AMD scheduled CI caller)"] + branches: ["main"] + types: [completed] + push: + branches: + - run_amd_scheduled_ci_caller* + +jobs: + run_amd_ci: + name: AMD mi250 + if: (cancelled() != true) && ((github.event_name == 'workflow_run') || ((github.event_name == 'push') && startsWith(github.ref_name, 'run_amd_scheduled_ci_caller'))) + uses: ./.github/workflows/self-scheduled-amd.yml + with: + gpu_flavor: mi250 + secrets: inherit diff --git a/.github/workflows/self-scheduled-amd.yml b/.github/workflows/self-scheduled-amd.yml new file mode 100644 index 00000000000000..69f5f861a3ffcd --- /dev/null +++ b/.github/workflows/self-scheduled-amd.yml @@ -0,0 +1,519 @@ +name: Self-hosted runner (scheduled-amd) + +# Note: For the AMD CI, we rely on a caller workflow and on the workflow_call event to trigger the +# CI in order to run it on both MI210 and MI250, without having to use matrix here which pushes +# us towards the limit of allowed jobs on GitHub Actions. +on: + workflow_call: + inputs: + gpu_flavor: + required: true + type: string + +env: + HF_HOME: /mnt/cache + TRANSFORMERS_IS_CI: yes + OMP_NUM_THREADS: 8 + MKL_NUM_THREADS: 8 + RUN_SLOW: yes + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} + SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} + + +# Important note: each job (run_tests_single_gpu, run_tests_multi_gpu, run_examples_gpu, run_pipelines_torch_gpu) requires all the previous jobs before running. +# This is done so that we avoid parallelizing the scheduled tests, to leave available +# runners for the push CI that is running on the same machine. +jobs: + check_runner_status: + name: Check Runner Status + runs-on: ubuntu-22.04 + steps: + - name: Checkout transformers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Check Runner Status + run: python utils/check_self_hosted_runner.py --target_runners hf-amd-mi210-ci-1gpu-1,hf-amd-mi250-ci-1gpu-1 --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} + + check_runners: + name: Check Runners + needs: check_runner_status + strategy: + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + setup: + name: Setup + needs: check_runners + strategy: + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + outputs: + matrix: ${{ steps.set-matrix.outputs.matrix }} + steps: + - name: Update clone + working-directory: /transformers + run: | + git fetch && git checkout ${{ github.sha }} + + - name: Cleanup + working-directory: /transformers + run: | + rm -rf tests/__pycache__ + rm -rf tests/models/__pycache__ + rm -rf reports + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - id: set-matrix + name: Identify models to test + working-directory: /transformers/tests + run: | + echo "matrix=$(python3 -c 'import os; tests = os.getcwd(); model_tests = os.listdir(os.path.join(tests, "models")); d1 = sorted(list(filter(os.path.isdir, os.listdir(tests)))); d2 = sorted(list(filter(os.path.isdir, [f"models/{x}" for x in model_tests]))); d1.remove("models"); d = d2 + d1; print(d)')" >> $GITHUB_OUTPUT + + - name: ROCM-SMI + run: | + rocm-smi + + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + run_tests_single_gpu: + name: Single GPU tests + strategy: + max-parallel: 1 # For now, not to parallelize. Can change later if it works well. + fail-fast: false + matrix: + folders: ${{ fromJson(needs.setup.outputs.matrix) }} + machine_type: [single-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + needs: setup + steps: + - name: Echo folder ${{ matrix.folders }} + shell: bash + # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to + # set the artifact folder names (because the character `/` is not allowed). + run: | + echo "${{ matrix.folders }}" + matrix_folders=${{ matrix.folders }} + matrix_folders=${matrix_folders/'models/'/'models_'} + echo "$matrix_folders" + echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV + + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all tests on GPU + working-directory: /transformers + run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} tests/${{ matrix.folders }} + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} + + run_tests_multi_gpu: + name: Multi GPU tests + strategy: + max-parallel: 1 + fail-fast: false + matrix: + folders: ${{ fromJson(needs.setup.outputs.matrix) }} + machine_type: [multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + needs: setup + steps: + - name: Echo folder ${{ matrix.folders }} + shell: bash + # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to + # set the artifact folder names (because the character `/` is not allowed). + run: | + echo "${{ matrix.folders }}" + matrix_folders=${{ matrix.folders }} + matrix_folders=${matrix_folders/'models/'/'models_'} + echo "$matrix_folders" + echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV + + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all tests on GPU + working-directory: /transformers + run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} tests/${{ matrix.folders }} + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports + path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} + + run_examples_gpu: + name: Examples tests + strategy: + fail-fast: false + matrix: + machine_type: [single-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + needs: setup + steps: + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run examples tests on GPU + working-directory: /transformers + run: | + pip install -r examples/pytorch/_tests_requirements.txt + python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_examples_gpu examples/pytorch + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_examples_gpu/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_examples_gpu" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_examples_gpu + path: /transformers/reports/${{ matrix.machine_type }}_examples_gpu + + run_pipelines_torch_gpu: + name: PyTorch pipelines tests + strategy: + fail-fast: false + matrix: + machine_type: [single-gpu, multi-gpu] + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + container: + image: huggingface/transformers-pytorch-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + needs: setup + steps: + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all pipeline tests on GPU + working-directory: /transformers + run: | + python3 -m pytest -n 1 -v --dist=loadfile --make-reports=${{ matrix.machine_type }}_tests_torch_pipeline_gpu tests/pipelines + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu + path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu + + run_tests_torch_deepspeed_gpu: + name: Torch ROCm deepspeed tests + strategy: + fail-fast: false + matrix: + machine_type: [single-gpu, multi-gpu] + + runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}'] + needs: setup + container: + image: huggingface/transformers-pytorch-deepspeed-amd-gpu + options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + steps: + - name: Update clone + working-directory: /transformers + run: git fetch && git checkout ${{ github.sha }} + + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + + - name: ROCM-SMI + run: | + rocm-smi + - name: ROCM-INFO + run: | + rocminfo | grep "Agent" -A 14 + + - name: Show ROCR environment + run: | + echo "ROCR: $ROCR_VISIBLE_DEVICES" + + - name: Environment + working-directory: /transformers + run: | + python3 utils/print_env.py + + - name: Show installed libraries and their versions + working-directory: /transformers + run: pip freeze + + - name: Run all tests on GPU + working-directory: /transformers + run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_deepspeed_gpu tests/deepspeed tests/extended + + - name: Failure short reports + if: ${{ failure() }} + continue-on-error: true + run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu/failures_short.txt + + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports" + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: ${{ matrix.machine_type }}_run_tests_torch_deepspeed_gpu_test_reports + path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_deepspeed_gpu + + run_extract_warnings: + name: Extract warnings in CI artifacts + runs-on: ubuntu-22.04 + if: always() + needs: [ + check_runner_status, + check_runners, + setup, + run_tests_single_gpu, + run_tests_multi_gpu, + run_examples_gpu, + run_pipelines_torch_gpu, + run_tests_torch_deepspeed_gpu + ] + steps: + - name: Checkout transformers + uses: actions/checkout@v3 + with: + fetch-depth: 2 + + - name: Install transformers + run: pip install transformers + + - name: Show installed libraries and their versions + run: pip freeze + + - name: Create output directory + run: mkdir warnings_in_ci + + - uses: actions/download-artifact@v3 + with: + path: warnings_in_ci + + - name: Show artifacts + run: echo "$(python3 -c 'import os; d = os.listdir(); print(d)')" + working-directory: warnings_in_ci + + - name: Extract warnings in CI artifacts + run: | + python3 utils/extract_warnings.py --workflow_run_id ${{ github.run_id }} --output_dir warnings_in_ci --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} --from_gh + echo "$(python3 -c 'import os; import json; fp = open("warnings_in_ci/selected_warnings.json"); d = json.load(fp); d = "\n".join(d) ;print(d)')" + + - name: Upload artifact + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: warnings_in_ci + path: warnings_in_ci/selected_warnings.json + + send_results: + name: Send results to webhook + runs-on: ubuntu-22.04 + if: always() + needs: [ + check_runner_status, + check_runners, + setup, + run_tests_single_gpu, + run_tests_multi_gpu, + run_examples_gpu, + run_pipelines_torch_gpu, + run_tests_torch_deepspeed_gpu, + run_extract_warnings + ] + steps: + - name: Preliminary job status + shell: bash + # For the meaning of these environment variables, see the job `Setup` + run: | + echo "Runner availability: ${{ needs.check_runner_status.result }}" + echo "Runner status: ${{ needs.check_runners.result }}" + echo "Setup status: ${{ needs.setup.result }}" + + - uses: actions/checkout@v3 + - uses: actions/download-artifact@v3 + - name: Send message to Slack + env: + CI_SLACK_BOT_TOKEN: ${{ secrets.CI_SLACK_BOT_TOKEN }} + CI_SLACK_CHANNEL_ID_DAILY_AMD: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY_AMD }} + CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} + CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY_AMD }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} + CI_EVENT: Scheduled CI (AMD) - ${{ inputs.gpu_flavor }} + CI_SHA: ${{ github.sha }} + CI_WORKFLOW_REF: ${{ github.workflow_ref }} + RUNNER_STATUS: ${{ needs.check_runner_status.result }} + RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} + SETUP_STATUS: ${{ needs.setup.result }} + # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change + # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. + run: | + sudo apt-get install -y curl + pip install slack_sdk + pip show slack_sdk + python utils/notification_service.py "${{ needs.setup.outputs.matrix }}" + + # Upload complete failure tables, as they might be big and only truncated versions could be sent to Slack. + - name: Failure table artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: test_failure_tables + path: test_failure_tables diff --git a/.github/workflows/self-scheduled.yml b/.github/workflows/self-scheduled.yml index 750f4a95694398..d44e9a29ecf0da 100644 --- a/.github/workflows/self-scheduled.yml +++ b/.github/workflows/self-scheduled.yml @@ -9,7 +9,10 @@ name: Self-hosted runner (scheduled) on: repository_dispatch: schedule: - - cron: "0 2 * * *" + - cron: "17 2 * * *" + push: + branches: + - run_scheduled_ci* env: HF_HOME: /mnt/cache @@ -17,50 +20,28 @@ env: OMP_NUM_THREADS: 8 MKL_NUM_THREADS: 8 RUN_SLOW: yes + # For gated repositories, we still need to agree to share information on the Hub repo. page in order to get access. + # This token is created under the bot `hf-transformers-bot`. + HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }} SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }} TF_FORCE_GPU_ALLOW_GROWTH: true RUN_PT_TF_CROSS_TESTS: 1 + CUDA_VISIBLE_DEVICES: 0,1 + NUM_SLICES: 2 jobs: - check_runner_status: - name: Check Runner Status - runs-on: ubuntu-latest - steps: - - name: Checkout transformers - uses: actions/checkout@v3 - with: - fetch-depth: 2 - - - name: Check Runner Status - run: python utils/check_self_hosted_runner.py --target_runners single-gpu-scheduled-ci-runner-docker,multi-gpu-scheduled-ci-runner-docker --token ${{ secrets.ACCESS_REPO_INFO_TOKEN }} - - check_runners: - name: Check Runners - needs: check_runner_status - strategy: - matrix: - machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} - container: - image: huggingface/transformers-all-latest-gpu - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - steps: - - name: NVIDIA-SMI - run: | - nvidia-smi - setup: name: Setup - needs: check_runners strategy: matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci] container: image: huggingface/transformers-all-latest-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ outputs: - matrix: ${{ steps.set-matrix.outputs.matrix }} + folder_slices: ${{ steps.set-matrix.outputs.folder_slices }} + slice_ids: ${{ steps.set-matrix.outputs.slice_ids }} steps: - name: Update clone working-directory: /transformers @@ -82,125 +63,27 @@ jobs: name: Identify models to test working-directory: /transformers/tests run: | - echo "matrix=$(python3 -c 'import os; tests = os.getcwd(); model_tests = os.listdir(os.path.join(tests, "models")); d1 = sorted(list(filter(os.path.isdir, os.listdir(tests)))); d2 = sorted(list(filter(os.path.isdir, [f"models/{x}" for x in model_tests]))); d1.remove("models"); d = d2 + d1; print(d)')" >> $GITHUB_OUTPUT + echo "folder_slices=$(python3 ../utils/split_model_tests.py --num_splits ${{ env.NUM_SLICES }})" >> $GITHUB_OUTPUT + echo "slice_ids=$(python3 -c 'd = list(range(${{ env.NUM_SLICES }})); print(d)')" >> $GITHUB_OUTPUT - name: NVIDIA-SMI run: | nvidia-smi - run_tests_single_gpu: - name: Model tests - strategy: - fail-fast: false - matrix: - folders: ${{ fromJson(needs.setup.outputs.matrix) }} - machine_type: [single-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} - container: - image: huggingface/transformers-all-latest-gpu - options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ + run_tests_gpu: + name: " " needs: setup - steps: - - name: Echo folder ${{ matrix.folders }} - shell: bash - # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to - # set the artifact folder names (because the character `/` is not allowed). - run: | - echo "${{ matrix.folders }}" - matrix_folders=${{ matrix.folders }} - matrix_folders=${matrix_folders/'models/'/'models_'} - echo "$matrix_folders" - echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV - - - name: Update clone - working-directory: /transformers - run: git fetch && git checkout ${{ github.sha }} - - - name: NVIDIA-SMI - run: | - nvidia-smi - - - name: Environment - working-directory: /transformers - run: | - python3 utils/print_env.py - - - name: Show installed libraries and their versions - working-directory: /transformers - run: pip freeze - - - name: Run all tests on GPU - working-directory: /transformers - run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} tests/${{ matrix.folders }} - - - name: Failure short reports - if: ${{ failure() }} - continue-on-error: true - run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v3 - with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports - path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} - - run_tests_multi_gpu: - name: Model tests strategy: fail-fast: false matrix: - folders: ${{ fromJson(needs.setup.outputs.matrix) }} - machine_type: [multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} - container: - image: huggingface/transformers-all-latest-gpu - options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ - needs: setup - steps: - - name: Echo folder ${{ matrix.folders }} - shell: bash - # For folders like `models/bert`, set an env. var. (`matrix_folders`) to `models_bert`, which will be used to - # set the artifact folder names (because the character `/` is not allowed). - run: | - echo "${{ matrix.folders }}" - matrix_folders=${{ matrix.folders }} - matrix_folders=${matrix_folders/'models/'/'models_'} - echo "$matrix_folders" - echo "matrix_folders=$matrix_folders" >> $GITHUB_ENV - - - name: Update clone - working-directory: /transformers - run: git fetch && git checkout ${{ github.sha }} - - - name: NVIDIA-SMI - run: | - nvidia-smi - - - name: Environment - working-directory: /transformers - run: | - python3 utils/print_env.py - - - name: Show installed libraries and their versions - working-directory: /transformers - run: pip freeze - - - name: Run all tests on GPU - working-directory: /transformers - run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} tests/${{ matrix.folders }} - - - name: Failure short reports - if: ${{ failure() }} - continue-on-error: true - run: cat /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }}/failures_short.txt - - - name: Test suite reports artifacts - if: ${{ always() }} - uses: actions/upload-artifact@v3 - with: - name: ${{ matrix.machine_type }}_run_all_tests_gpu_${{ env.matrix_folders }}_test_reports - path: /transformers/reports/${{ matrix.machine_type }}_tests_gpu_${{ matrix.folders }} + machine_type: [single-gpu, multi-gpu] + slice_id: ${{ fromJSON(needs.setup.outputs.slice_ids) }} + uses: ./.github/workflows/model_jobs.yml + with: + folder_slices: ${{ needs.setup.outputs.folder_slices }} + machine_type: ${{ matrix.machine_type }} + slice_id: ${{ matrix.slice_id }} + secrets: inherit run_examples_gpu: name: Examples directory @@ -208,7 +91,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci] container: image: huggingface/transformers-all-latest-gpu options: --gpus 0 --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -218,6 +101,10 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: NVIDIA-SMI run: | nvidia-smi @@ -242,7 +129,7 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_examples_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_examples_gpu" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -255,7 +142,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci] container: image: huggingface/transformers-pytorch-gpu options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -265,6 +152,10 @@ jobs: working-directory: /transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: NVIDIA-SMI run: | nvidia-smi @@ -288,7 +179,7 @@ jobs: continue-on-error: true run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -301,7 +192,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci] container: image: huggingface/transformers-tensorflow-gpu options: --gpus all --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/ @@ -312,6 +203,10 @@ jobs: run: | git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: NVIDIA-SMI run: | nvidia-smi @@ -335,7 +230,7 @@ jobs: run: | cat /transformers/reports/${{ matrix.machine_type }}_tests_tf_pipeline_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_tf_pipeline_gpu" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -348,7 +243,7 @@ jobs: fail-fast: false matrix: machine_type: [single-gpu, multi-gpu] - runs-on: ${{ format('{0}-{1}', matrix.machine_type, 'docker') }} + runs-on: ['${{ matrix.machine_type }}', nvidia-gpu, t4, daily-ci] needs: setup container: image: huggingface/transformers-pytorch-deepspeed-latest-gpu @@ -358,6 +253,10 @@ jobs: working-directory: /workspace/transformers run: git fetch && git checkout ${{ github.sha }} + - name: Reinstall transformers in edit mode (remove the one installed during docker image build) + working-directory: /workspace/transformers + run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . + - name: Remove cached torch extensions run: rm -rf /github/home/.cache/torch_extensions/ @@ -366,7 +265,7 @@ jobs: working-directory: /workspace run: | python3 -m pip uninstall -y deepspeed - DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check + DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check - name: NVIDIA-SMI run: | @@ -391,7 +290,7 @@ jobs: continue-on-error: true run: cat /workspace/transformers/reports/${{ matrix.machine_type }}_tests_torch_cuda_extensions_gpu/failures_short.txt - - name: Test suite reports artifacts + - name: "Test suite reports artifacts: ${{ matrix.machine_type }}_run_tests_torch_cuda_extensions_gpu_test_reports" if: ${{ always() }} uses: actions/upload-artifact@v3 with: @@ -400,14 +299,11 @@ jobs: run_extract_warnings: name: Extract warnings in CI artifacts - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() needs: [ - check_runner_status, - check_runners, setup, - run_tests_single_gpu, - run_tests_multi_gpu, + run_tests_gpu, run_examples_gpu, run_pipelines_tf_gpu, run_pipelines_torch_gpu, @@ -450,14 +346,11 @@ jobs: send_results: name: Send results to webhook - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 if: always() needs: [ - check_runner_status, - check_runners, setup, - run_tests_single_gpu, - run_tests_multi_gpu, + run_tests_gpu, run_examples_gpu, run_pipelines_tf_gpu, run_pipelines_torch_gpu, @@ -469,8 +362,6 @@ jobs: shell: bash # For the meaning of these environment variables, see the job `Setup` run: | - echo "Runner availability: ${{ needs.check_runner_status.result }}" - echo "Runner status: ${{ needs.check_runners.result }}" echo "Setup status: ${{ needs.setup.result }}" - uses: actions/checkout@v3 @@ -482,13 +373,23 @@ jobs: CI_SLACK_CHANNEL_ID_DAILY: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} CI_SLACK_CHANNEL_DUMMY_TESTS: ${{ secrets.CI_SLACK_CHANNEL_DUMMY_TESTS }} CI_SLACK_REPORT_CHANNEL_ID: ${{ secrets.CI_SLACK_CHANNEL_ID_DAILY }} + ACCESS_REPO_INFO_TOKEN: ${{ secrets.ACCESS_REPO_INFO_TOKEN }} CI_EVENT: scheduled - RUNNER_STATUS: ${{ needs.check_runner_status.result }} - RUNNER_ENV_STATUS: ${{ needs.check_runners.result }} + CI_SHA: ${{ github.sha }} + CI_WORKFLOW_REF: ${{ github.workflow_ref }} SETUP_STATUS: ${{ needs.setup.result }} # We pass `needs.setup.outputs.matrix` as the argument. A processing in `notification_service.py` to change # `models/bert` to `models_bert` is required, as the artifact names use `_` instead of `/`. run: | + sudo apt-get install -y curl pip install slack_sdk pip show slack_sdk - python utils/notification_service.py "${{ needs.setup.outputs.matrix }}" + python utils/notification_service.py "${{ needs.setup.outputs.folder_slices }}" + + # Upload complete failure tables, as they might be big and only truncated versions could be sent to Slack. + - name: Failure table artifacts + if: ${{ always() }} + uses: actions/upload-artifact@v3 + with: + name: prev_ci_results + path: prev_ci_results diff --git a/.github/workflows/stale.yml b/.github/workflows/stale.yml index 9412442a7d0a78..4a7e94bac429db 100644 --- a/.github/workflows/stale.yml +++ b/.github/workflows/stale.yml @@ -2,13 +2,13 @@ name: Stale Bot on: schedule: - - cron: "0 15 * * *" + - cron: "0 8 * * *" jobs: close_stale_issues: name: Close Stale Issues if: github.repository == 'huggingface/transformers' - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} steps: @@ -17,7 +17,7 @@ jobs: - name: Setup Python uses: actions/setup-python@v4 with: - python-version: 3.7 + python-version: 3.8 - name: Install requirements run: | diff --git a/.github/workflows/update_metdata.yml b/.github/workflows/update_metdata.yml index f6c9afd15b7e5d..a2269e32e4d3cd 100644 --- a/.github/workflows/update_metdata.yml +++ b/.github/workflows/update_metdata.yml @@ -4,11 +4,11 @@ on: push: branches: - main - - update_transformers_metadata + - update_transformers_metadata* jobs: build_and_package: - runs-on: ubuntu-latest + runs-on: ubuntu-22.04 defaults: run: shell: bash -l {0} @@ -16,25 +16,12 @@ jobs: steps: - uses: actions/checkout@v3 - - name: Load cached virtual environment - uses: actions/cache@v2 - id: cache - with: - path: ~/venv/ - key: v3-metadata-${{ hashFiles('setup.py') }} - - - name: Create virtual environment on cache miss - if: steps.cache.outputs.cache-hit != 'true' - run: | - python -m venv ~/venv && . ~/venv/bin/activate - pip install --upgrade pip - - name: Setup environment run: | - . ~/venv/bin/activate - pip install git+https://github.com/huggingface/transformers#egg=transformers[dev] + pip install --upgrade pip + pip install datasets pandas==2.0.3 + pip install .[torch,tf,flax] - name: Update metadata run: | - . ~/venv/bin/activate - python utils/update_metadata.py --token ${{ secrets.SYLVAIN_HF_TOKEN }} --commit_sha ${{ github.sha }} + python utils/update_metadata.py --token ${{ secrets.LYSANDRE_HF_TOKEN }} --commit_sha ${{ github.sha }} diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml new file mode 100644 index 00000000000000..64befc595c421e --- /dev/null +++ b/.github/workflows/upload_pr_documentation.yml @@ -0,0 +1,16 @@ +name: Upload PR Documentation + +on: + workflow_run: + workflows: ["Build PR Documentation"] + types: + - completed + +jobs: + build: + uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main + with: + package_name: transformers + secrets: + hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} + comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }} \ No newline at end of file diff --git a/.gitignore b/.gitignore index eeb41b3fcaea35..337f2ef2c735e8 100644 --- a/.gitignore +++ b/.gitignore @@ -166,4 +166,4 @@ tags .DS_Store # ruff -.ruff_cache \ No newline at end of file +.ruff_cache diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 50ab222387e38c..9aee200ba4120e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -40,8 +40,7 @@ There are several ways you can contribute to 🤗 Transformers: If you don't know where to start, there is a special [Good First Issue](https://github.com/huggingface/transformers/contribute) listing. It will give you a list of -open issues that are beginner-friendly and help you start contributing to open-source. Just comment in the issue that you'd like to work -on it. +open issues that are beginner-friendly and help you start contributing to open-source. The best way to do that is to open a Pull Request and link it to the issue that you'd like to work on. We try to give priority to opened PRs as we can easily track the progress of the fix, and if the contributor does not have time anymore, someone else can take the PR over. For something slightly more challenging, you can also take a look at the [Good Second Issue](https://github.com/huggingface/transformers/labels/Good%20Second%20Issue) list. In general though, if you feel like you know what you're doing, go for it and we'll help you get there! 🚀 @@ -49,7 +48,7 @@ For something slightly more challenging, you can also take a look at the [Good S ## Fixing outstanding issues -If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#create-a-pull-request) and open a Pull Request! +If you notice an issue with the existing code and have a fix in mind, feel free to [start contributing](#create-a-pull-request) and open a Pull Request! ## Submitting a bug-related issue or feature request @@ -62,7 +61,7 @@ feedback. The 🤗 Transformers library is robust and reliable thanks to users who report the problems they encounter. Before you report an issue, we would really appreciate it if you could **make sure the bug was not -already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code. If you're unsure whether the bug is in your code or the library, please ask on the [forum](https://discuss.huggingface.co/) first. This helps us respond quicker to fixing issues related to the library versus general questions. +already reported** (use the search bar on GitHub under Issues). Your issue should also be related to bugs in the library itself, and not your code. If you're unsure whether the bug is in your code or the library, please ask in the [forum](https://discuss.huggingface.co/) first. This helps us respond quicker to fixing issues related to the library versus general questions. Once you've confirmed the bug hasn't already been reported, please include the following information in your issue so we can quickly resolve it: @@ -103,9 +102,9 @@ We have added [templates](https://github.com/huggingface/transformers/tree/main/ ## Do you want to implement a new model? -New models are constantly released and if you want to implement a new model, please provide the following information +New models are constantly released and if you want to implement a new model, please provide the following information: -* A short description of the model and link to the paper. +* A short description of the model and a link to the paper. * Link to the implementation if it is open-sourced. * Link to the model weights if they are available. @@ -130,7 +129,7 @@ You will need basic `git` proficiency to contribute to manual. Type `git --help` in a shell and enjoy! If you prefer books, [Pro Git](https://git-scm.com/book/en/v2) is a very good reference. -You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/main/setup.py#L426))** or above to contribute to 🤗 Transformers. Follow the steps below to start contributing: +You'll need **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** or above to contribute to 🤗 Transformers. Follow the steps below to start contributing: 1. Fork the [repository](https://github.com/huggingface/transformers) by clicking on the **[Fork](https://github.com/huggingface/transformers/fork)** button on the repository's page. This creates a copy of the code @@ -139,15 +138,15 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai 2. Clone your fork to your local disk, and add the base repository as a remote: ```bash - $ git clone git@github.com:/transformers.git - $ cd transformers - $ git remote add upstream https://github.com/huggingface/transformers.git + git clone git@github.com:/transformers.git + cd transformers + git remote add upstream https://github.com/huggingface/transformers.git ``` 3. Create a new branch to hold your development changes: ```bash - $ git checkout -b a-descriptive-name-for-my-changes + git checkout -b a-descriptive-name-for-my-changes ``` 🚨 **Do not** work on the `main` branch! @@ -155,28 +154,30 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai 4. Set up a development environment by running the following command in a virtual environment: ```bash - $ pip install -e ".[dev]" + pip install -e ".[dev]" ``` If 🤗 Transformers was already installed in the virtual environment, remove it with `pip uninstall transformers` before reinstalling it in editable mode with the `-e` flag. - Depending on your OS, you may need to install some external libraries as well if the `pip` installation fails. - - For macOS, you will likely need [MeCab](https://taku910.github.io/mecab/) which can be installed from Homebrew: - + Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a + failure with this command. If that's the case make sure to install the Deep Learning framework you are working with + (PyTorch, TensorFlow and/or Flax) then do: + ```bash - brew install mecab + pip install -e ".[quality]" ``` -5. Develop the features on your branch. + which should be enough for most use cases. + +5. Develop the features in your branch. As you work on your code, you should make sure the test suite passes. Run the tests impacted by your changes like this: ```bash - $ pytest tests/.py + pytest tests/.py ``` For more information about tests, check out the @@ -187,7 +188,7 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai that can't be automated in one go with: ```bash - $ make fixup + make fixup ``` This target is also optimized to only work with files modified by the PR you're working on. @@ -196,48 +197,48 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai style corrections: ```bash - $ make style + make style ``` 🤗 Transformers also uses `ruff` and a few custom scripts to check for coding mistakes. Quality controls are run by the CI, but you can run the same checks with: ```bash - $ make quality + make quality ``` - Finally, we have a lot of scripts to make sure we didn't forget to update + Finally, we have a lot of scripts to make sure we don't forget to update some files when adding a new model. You can run these scripts with: ```bash - $ make repo-consistency + make repo-consistency ``` To learn more about those checks and how to fix any issues with them, check out the [Checks on a Pull Request](https://huggingface.co/docs/transformers/pr_checks) guide. - If you're modifying documents under `docs/source` directory, make sure the documentation can still be built. This check will also run in the CI when you open a pull request. To run a local check + If you're modifying documents under the `docs/source` directory, make sure the documentation can still be built. This check will also run in the CI when you open a pull request. To run a local check make sure you install the documentation builder: ```bash - $ pip install ".[docs]" + pip install ".[docs]" ``` Run the following command from the root of the repository: ```bash - $ doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build + doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build ``` This will build the documentation in the `~/tmp/test-build` folder where you can inspect the generated Markdown files with your favorite editor. You can also preview the docs on GitHub when you open a pull request. - Once you're happy with your changes, add changed files with `git add` and + Once you're happy with your changes, add the changed files with `git add` and record your changes locally with `git commit`: ```bash - $ git add modified_file.py - $ git commit + git add modified_file.py + git commit ``` Please remember to write [good commit @@ -247,19 +248,19 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai repository, rebase your branch on `upstream/branch` *before* you open a pull request or if requested by a maintainer: ```bash - $ git fetch upstream - $ git rebase upstream/main + git fetch upstream + git rebase upstream/main ``` Push your changes to your branch: ```bash - $ git push -u origin a-descriptive-name-for-my-changes + git push -u origin a-descriptive-name-for-my-changes ``` If you've already opened a pull request, you'll need to force push with the `--force` flag. Otherwise, if the pull request hasn't been opened yet, you can just push your changes normally. -6. Now you can go to your fork of the repository on GitHub and click on **Pull request** to open a pull request. Make sure you tick off all the boxes in our [checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#pull-request-checklist) below. When you're ready, you can send your changes to the project maintainers for review. +6. Now you can go to your fork of the repository on GitHub and click on **Pull Request** to open a pull request. Make sure you tick off all the boxes on our [checklist](#pull-request-checklist) below. When you're ready, you can send your changes to the project maintainers for review. 7. It's ok if maintainers request changes, it happens to our core contributors too! So everyone can see the changes in the pull request, work in your local @@ -273,7 +274,7 @@ You'll need **[Python 3.7]((https://github.com/huggingface/transformers/blob/mai request description to make sure they are linked (and people viewing the issue know you are working on it).
☐ To indicate a work in progress please prefix the title with `[WIP]`. These are -useful to avoid duplicated work, and to differentiate it from PRs ready to be merged. +useful to avoid duplicated work, and to differentiate it from PRs ready to be merged.
☐ Make sure existing tests pass.
☐ If adding a new feature, also add tests for it.
- If you are adding a new model, make sure you use @@ -282,7 +283,7 @@ useful to avoid duplicated work, and to differentiate it from PRs ready to be me `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`. - If you are adding a new tokenizer, write tests and make sure `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` passes. - CircleCI does not run the slow tests, but GitHub Actions does every night!
+ - CircleCI does not run the slow tests, but GitHub Actions does every night!
☐ All public methods must have informative docstrings (see [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py) @@ -293,7 +294,7 @@ repository such as [`hf-internal-testing`](https://huggingface.co/hf-internal-te to host these files and reference them by URL. We recommend placing documentation related images in the following repository: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images). -You can open a PR on this dataset repostitory and ask a Hugging Face member to merge it. +You can open a PR on this dataset repository and ask a Hugging Face member to merge it. For more information about the checks run on a pull request, take a look at our [Checks on a Pull Request](https://huggingface.co/docs/transformers/pr_checks) guide. @@ -304,17 +305,17 @@ the [tests](https://github.com/huggingface/transformers/tree/main/tests) folder [examples](https://github.com/huggingface/transformers/tree/main/examples) folder. We like `pytest` and `pytest-xdist` because it's faster. From the root of the -repository, specify a *path to a subfolder or a test file* to run the test. +repository, specify a *path to a subfolder or a test file* to run the test: ```bash -$ python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model ``` Similarly, for the `examples` directory, specify a *path to a subfolder or test file* to run the test. For example, the following command tests the text classification subfolder in the PyTorch `examples` directory: ```bash -$ pip install -r examples/xxx/requirements.txt # only needed the first time -$ python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +pip install -r examples/xxx/requirements.txt # only needed the first time +python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification ``` In fact, this is actually how our `make test` and `make test-examples` commands are implemented (not including the `pip install`)! @@ -333,8 +334,8 @@ Remember to specify a *path to a subfolder or a test file* to run the test. Othe ```bash -$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model -$ RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification ``` Like the slow tests, there are other environment variables available which not enabled by default during testing: @@ -351,8 +352,8 @@ This means `unittest` is fully supported. Here's how to run tests with `unittest`: ```bash -$ python -m unittest discover -s tests -t . -v -$ python -m unittest discover -s examples -t examples -v +python -m unittest discover -s tests -t . -v +python -m unittest discover -s examples -t examples -v ``` ### Style guide @@ -363,7 +364,7 @@ for more information. ### Develop on Windows -On Windows (unless you're working in [Windows Subsytem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings: +On Windows (unless you're working in [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) or WSL), you need to configure git to transform Windows `CRLF` line endings to Linux `LF` line endings: ```bash git config core.autocrlf input @@ -376,7 +377,7 @@ One way to run the `make` command on Windows is with MSYS2: 3. Run in the shell: `pacman -Syu` and install `make` with `pacman -S make`. 4. Add `C:\msys64\usr\bin` to your PATH environment variable. -You can now use `make` from any terminal (Powershell, cmd.exe, etc.)! 🎉 +You can now use `make` from any terminal (PowerShell, cmd.exe, etc.)! 🎉 ### Sync a forked repository with upstream main (the Hugging Face repository) @@ -385,9 +386,9 @@ When updating the main branch of a forked repository, please follow these steps 1. When possible, avoid syncing with the upstream using a branch and PR on the forked repository. Instead, merge directly into the forked main. 2. If a PR is absolutely necessary, use the following steps after checking out your branch: -```bash -$ git checkout -b your-branch-for-syncing -$ git pull --squash --no-commit upstream main -$ git commit -m '' -$ git push --set-upstream origin your-branch-for-syncing -``` + ```bash + git checkout -b your-branch-for-syncing + git pull --squash --no-commit upstream main + git commit -m '' + git push --set-upstream origin your-branch-for-syncing + ``` diff --git a/ISSUES.md b/ISSUES.md index 7c36da3c6804c9..a5969a3027f86d 100644 --- a/ISSUES.md +++ b/ISSUES.md @@ -152,13 +152,13 @@ You are not required to read the following guidelines before opening an issue. H ```bash cd examples/seq2seq - python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py \ + torchrun --nproc_per_node=2 ./finetune_trainer.py \ --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --data_dir wmt_en_ro \ --output_dir output_dir --overwrite_output_dir \ --do_train --n_train 500 --num_train_epochs 1 \ --per_device_train_batch_size 1 --freeze_embeds \ --src_lang en_XX --tgt_lang ro_RO --task translation \ - --fp16 --sharded_ddp + --fp16 ``` If you don't break it up, one has to scroll horizontally which often makes it quite difficult to quickly see what's happening. diff --git a/MANIFEST.in b/MANIFEST.in deleted file mode 100644 index 1aba38f67a2211..00000000000000 --- a/MANIFEST.in +++ /dev/null @@ -1 +0,0 @@ -include LICENSE diff --git a/Makefile b/Makefile index 400a35bbfe2e7f..424880ce150a6e 100644 --- a/Makefile +++ b/Makefile @@ -5,12 +5,14 @@ export PYTHONPATH = src check_dirs := examples tests src utils +exclude_folders := examples/research_projects + modified_only_fixup: $(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs))) @if test -n "$(modified_py_files)"; then \ echo "Checking/fixing $(modified_py_files)"; \ - black $(modified_py_files); \ - ruff $(modified_py_files) --fix; \ + ruff check $(modified_py_files) --fix --exclude $(exclude_folders); \ + ruff format $(modified_py_files) --exclude $(exclude_folders);\ else \ echo "No library .py files were modified"; \ fi @@ -41,18 +43,18 @@ repo-consistency: python utils/check_config_docstrings.py python utils/check_config_attributes.py python utils/check_doctest_list.py - python utils/tests_fetcher.py --sanity_check python utils/update_metadata.py --check-only python utils/check_task_guides.py + python utils/check_docstrings.py + python utils/check_support_list.py # this target runs checks on all files quality: - black --check $(check_dirs) + ruff check $(check_dirs) setup.py conftest.py + ruff format --check $(check_dirs) setup.py conftest.py python utils/custom_init_isort.py --check_only python utils/sort_auto_mappings.py --check_only - ruff $(check_dirs) - doc-builder style src/transformers docs/source --max_len 119 --check_only --path_to_docs docs/source python utils/check_doc_toc.py # Format source code automatically and check is there are any problems left that need manual fixing @@ -60,14 +62,13 @@ quality: extra_style_checks: python utils/custom_init_isort.py python utils/sort_auto_mappings.py - doc-builder style src/transformers docs/source --max_len 119 --path_to_docs docs/source python utils/check_doc_toc.py --fix_and_overwrite # this target runs checks on all files and potentially modifies some of them style: - black $(check_dirs) - ruff $(check_dirs) --fix + ruff check $(check_dirs) setup.py conftest.py --fix --exclude $(exclude_folders) + ruff format $(check_dirs) setup.py conftest.py --exclude $(exclude_folders) ${MAKE} autogenerate_code ${MAKE} extra_style_checks @@ -81,7 +82,9 @@ fix-copies: python utils/check_copies.py --fix_and_overwrite python utils/check_table.py --fix_and_overwrite python utils/check_dummies.py --fix_and_overwrite + python utils/check_doctest_list.py --fix_and_overwrite python utils/check_task_guides.py --fix_and_overwrite + python utils/check_docstrings.py --fix_and_overwrite # Run tests for the library @@ -112,3 +115,10 @@ post-release: post-patch: python utils/release.py --post_release --patch + +build-release: + rm -rf dist + rm -rf build + python setup.py bdist_wheel + python setup.py sdist + python utils/check_build.py diff --git a/README.md b/README.md index 794edef2491673..b7077ce61032ba 100644 --- a/README.md +++ b/README.md @@ -15,10 +15,15 @@ limitations under the License. -->

-
- -
-

+ + + + Hugging Face Transformers Library + +
+
+

+

Build @@ -46,8 +51,13 @@ limitations under the License. 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -62,7 +72,7 @@ limitations under the License. These models can be applied on: -* 📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages. +* 📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, and text generation, in over 100 languages. * 🖼️ Images, for tasks like image classification, object detection, and segmentation. * 🗣️ Audio, for tasks like speech recognition and audio classification. @@ -78,37 +88,52 @@ You can test most of our models directly on their pages from the [model hub](htt Here are a few examples: - In Natural Language Processing: -- [Masked word completion with BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) -- [Name Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [Text generation with GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [Natural Language Inference with RoBERTa](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +In Natural Language Processing: +- [Masked word completion with BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Named Entity Recognition with Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Text generation with Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) +- [Natural Language Inference with RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [Summarization with BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [Question answering with DistilBERT](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [Translation with T5](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [Question answering with DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Translation with T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) In Computer Vision: - [Image classification with ViT](https://huggingface.co/google/vit-base-patch16-224) - [Object Detection with DETR](https://huggingface.co/facebook/detr-resnet-50) - [Semantic Segmentation with SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) -- [Panoptic Segmentation with MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) -- [Depth Estimation with DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Panoptic Segmentation with Mask2Former](https://huggingface.co/facebook/mask2former-swin-large-coco-panoptic) +- [Depth Estimation with Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything) - [Video Classification with VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) - [Universal Segmentation with OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) In Audio: -- [Automatic Speech Recognition with Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Automatic Speech Recognition with Whisper](https://huggingface.co/openai/whisper-large-v3) - [Keyword Spotting with Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) - [Audio Classification with Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) In Multimodal tasks: - [Table Question Answering with TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) - [Visual Question Answering with ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) -- [Zero-shot Image Classification with CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Image captioning with LLaVa](https://huggingface.co/llava-hf/llava-1.5-7b-hf) +- [Zero-shot Image Classification with SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) - [Document Question Answering with LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) - [Zero-shot Video Classification with X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) +- [Zero-shot Object Detection with OWLv2](https://huggingface.co/docs/transformers/en/model_doc/owlv2) +- [Zero-shot Image Segmentation with CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg) +- [Automatic Mask Generation with SAM](https://huggingface.co/docs/transformers/model_doc/sam) + + +## 100 projects using Transformers -**[Write With Transformer](https://transformer.huggingface.co)**, built by the Hugging Face team, is the official demo of this repo’s text generation capabilities. +Transformers is more than a toolkit to use pretrained models: it's a community of projects built around it and the +Hugging Face Hub. We want Transformers to enable developers, researchers, students, professors, engineers, and anyone +else to build their dream projects. + +In order to celebrate the 100,000 stars of transformers, we have decided to put the spotlight on the +community, and we have created the [awesome-transformers](./awesome-transformers.md) page which lists 100 +incredible projects built in the vicinity of transformers. + +If you own or use a project that you believe should be part of the list, please open a PR to add it! ## If you are looking for custom support from the Hugging Face team @@ -129,7 +154,7 @@ To immediately use a model on a given input (text, image, audio, ...), we provid [{'label': 'POSITIVE', 'score': 0.9996980428695679}] ``` -The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here the answer is "positive" with a confidence of 99.97%. +The second line of code downloads and caches the pretrained model used by the pipeline, while the third evaluates it on the given text. Here, the answer is "positive" with a confidence of 99.97%. Many tasks have a pre-trained `pipeline` ready to go, in NLP but also in computer vision and speech. For example, we can easily extract detected objects in an image: @@ -163,7 +188,7 @@ Many tasks have a pre-trained `pipeline` ready to go, in NLP but also in compute 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] ``` -Here we get a list of objects detected in the image, with a box surrounding the object and a confidence score. Here is the original image on the left, with the predictions displayed on the right: +Here, we get a list of objects detected in the image, with a box surrounding the object and a confidence score. Here is the original image on the left, with the predictions displayed on the right:

@@ -176,8 +201,8 @@ In addition to `pipeline`, to download and use any of the pretrained models on y ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -187,14 +212,14 @@ And here is the equivalent code for TensorFlow: ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) ``` -The tokenizer is responsible for all the preprocessing the pretrained model expects, and can be called directly on a single string (as in the above examples) or a list. It will output a dictionary that you can use in downstream code or simply directly pass to your model using the ** argument unpacking operator. +The tokenizer is responsible for all the preprocessing the pretrained model expects and can be called directly on a single string (as in the above examples) or a list. It will output a dictionary that you can use in downstream code or simply directly pass to your model using the ** argument unpacking operator. The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or a [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (depending on your backend) which you can use as usual. [This tutorial](https://huggingface.co/docs/transformers/training) explains how to integrate such a model into a classic PyTorch or TensorFlow training loop, or how to use our `Trainer` API to quickly fine-tune on a new dataset. @@ -209,12 +234,12 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta 1. Lower compute costs, smaller carbon footprint: - Researchers can share trained models instead of always retraining. - Practitioners can reduce compute time and production costs. - - Dozens of architectures with over 60,000 pretrained models across all modalities. + - Dozens of architectures with over 400,000 pretrained models across all modalities. 1. Choose the right framework for every part of a model's lifetime: - Train state-of-the-art models in 3 lines of code. - Move a single model between TF2.0/PyTorch/JAX frameworks at will. - - Seamlessly pick the right framework for training, evaluation and production. + - Seamlessly pick the right framework for training, evaluation, and production. 1. Easily customize a model or an example to your needs: - We provide examples for each architecture to reproduce the results published by its original authors. @@ -225,19 +250,19 @@ The model itself is a regular [Pytorch `nn.Module`](https://pytorch.org/docs/sta - This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose, so that researchers can quickly iterate on each of the models without diving into additional abstractions/files. - The training API is not intended to work on any model but is optimized to work with the models provided by the library. For generic machine learning loops, you should use another library (possibly, [Accelerate](https://huggingface.co/docs/accelerate)). -- While we strive to present as many use cases as possible, the scripts in our [examples folder](https://github.com/huggingface/transformers/tree/main/examples) are just that: examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. +- While we strive to present as many use cases as possible, the scripts in our [examples folder](https://github.com/huggingface/transformers/tree/main/examples) are just that: examples. It is expected that they won't work out-of-the-box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. ## Installation ### With pip -This repository is tested on Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+ and TensorFlow 2.3+. +This repository is tested on Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, and TensorFlow 2.6+. You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, check out the [user guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). First, create a virtual environment with the version of Python you're going to use and activate it. -Then, you will need to install at least one of Flax, PyTorch or TensorFlow. +Then, you will need to install at least one of Flax, PyTorch, or TensorFlow. Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/), [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax](https://github.com/google/flax#quick-install) and [Jax](https://github.com/google/jax#installation) installation pages regarding the specific installation command for your platform. When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows: @@ -250,34 +275,37 @@ If you'd like to play with the examples or need the bleeding edge of the code an ### With conda -Since Transformers version v4.0.0, we now have a conda channel: `huggingface`. - 🤗 Transformers can be installed using conda as follows: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_NOTE:_** Installing `transformers` from the `huggingface` channel is deprecated. + Follow the installation pages of Flax, PyTorch or TensorFlow to see how to install them with conda. > **_NOTE:_** On Windows, you may be prompted to activate Developer Mode in order to benefit from caching. If this is not an option for you, please let us know in [this issue](https://github.com/huggingface/huggingface_hub/issues/1062). ## Model architectures -**[All the model checkpoints](https://huggingface.co/models)** provided by 🤗 Transformers are seamlessly integrated from the huggingface.co [model hub](https://huggingface.co/models) where they are uploaded directly by [users](https://huggingface.co/users) and [organizations](https://huggingface.co/organizations). +**[All the model checkpoints](https://huggingface.co/models)** provided by 🤗 Transformers are seamlessly integrated from the huggingface.co [model hub](https://huggingface.co/models), where they are uploaded directly by [users](https://huggingface.co/users) and [organizations](https://huggingface.co/organizations). Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) 🤗 Transformers currently provides the following architectures (see [here](https://huggingface.co/docs/transformers/model_summary) for a high-level summary of each them): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. -1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. 1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei. -1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen. 1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. @@ -287,22 +315,27 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. @@ -311,41 +344,58 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (from ESPnet) released with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. @@ -353,43 +403,69 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (from Microsoft Research & University of Wisconsin-Madison) released with the paper [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (from IBM Research) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (from IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (from the Qwen team, Alibaba Group) released with the paper [Qwen Technical Report](https://arxiv.org/abs/2309.16609) by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. @@ -400,14 +476,21 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. @@ -422,38 +505,47 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. 1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. 1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. -1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR. +1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedback before starting your PR. To check if each model has an implementation in Flax, PyTorch or TensorFlow, or has an associated tokenizer backed by the 🤗 Tokenizers library, refer to [this table](https://huggingface.co/docs/transformers/index#supported-frameworks). @@ -470,7 +562,6 @@ These implementations have been tested on several datasets (see the example scri | [Training and fine-tuning](https://huggingface.co/docs/transformers/training) | Using the models provided by 🤗 Transformers in a PyTorch/TensorFlow training loop and the `Trainer` API | | [Quick tour: Fine-tuning/usage scripts](https://github.com/huggingface/transformers/tree/main/examples) | Example scripts for fine-tuning models on a wide range of tasks | | [Model sharing and uploading](https://huggingface.co/docs/transformers/model_sharing) | Upload and share your fine-tuned models with the community | -| [Migration](https://huggingface.co/docs/transformers/migration) | Migrate to 🤗 Transformers from `pytorch-transformers` or `pytorch-pretrained-bert` | ## Citation diff --git a/README_de.md b/README_de.md new file mode 100644 index 00000000000000..f21bebdc781120 --- /dev/null +++ b/README_de.md @@ -0,0 +1,576 @@ + + +

+ + + + Hugging Face Transformers Library + +
+
+

+ +

+ + Build + + + GitHub + + + Documentation + + + GitHub release + + + Contributor Covenant + + DOI +

+ +

+

+ English | + 简体中文 | + 繁體中文 | + 한국어 | + Español | + 日本語 | + हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

+

+ +

+

Maschinelles Lernen auf dem neuesten Stand der Technik für JAX, PyTorch und TensorFlow

+

+ +

+ +

+ +🤗 Transformers bietet Tausende von vortrainierten Modellen, um Aufgaben in verschiedenen Modalitäten wie Text, Bild und Audio durchzuführen. + +Diese Modelle können angewendet werden, auf: + +* 📝 Text - für Aufgaben wie Textklassifizierung, Informationsextraktion, Question Answering, automatische Textzusammenfassung, maschinelle Übersetzung und Textgenerierung in über 100 Sprachen. +* 🖼️ Bilder - für Aufgaben wie Bildklassifizierung, Objekterkennung und Segmentierung. +* 🗣️ Audio - für Aufgaben wie Spracherkennung und Audioklassifizierung. + +Transformer-Modelle können auch Aufgaben für **mehrere Modalitäten in Kombination** durchführen, z. B. tabellenbasiertes Question Answering, optische Zeichenerkennung, Informationsextraktion aus gescannten Dokumenten, Videoklassifizierung und visuelles Question Answering. + +🤗 Transformers bietet APIs, um diese vortrainierten Modelle schnell herunterzuladen und für einen gegebenen Text zu verwenden, sie auf Ihren eigenen Datensätzen zu feintunen und dann mit der Community in unserem [Model Hub](https://huggingface.co/models) zu teilen. Gleichzeitig ist jedes Python-Modul, das eine Architektur definiert, komplett eigenständig und kann modifiziert werden, um schnelle Forschungsexperimente zu ermöglichen. + +🤗 Transformers unterstützt die nahtlose Integration von drei der beliebtesten Deep-Learning-Bibliotheken: [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) und [TensorFlow](https://www.tensorflow.org/). Trainieren Sie Ihr Modell in einem Framework und laden Sie es zur Inferenz unkompliziert mit einem anderen. + +## Online-Demos + +Sie können die meisten unserer Modelle direkt auf ihren Seiten im [Model Hub](https://huggingface.co/models) testen. Wir bieten auch [privates Modell-Hosting, Versionierung, & eine Inferenz-API](https://huggingface.co/pricing) für öffentliche und private Modelle an. + +Hier sind einige Beispiele: + +In der Computerlinguistik: + +- [Maskierte Wortvervollständigung mit BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Eigennamenerkennung mit Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Textgenerierung mit GPT-2](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [Natural Language Inference mit RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [Automatische Textzusammenfassung mit BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) +- [Question Answering mit DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Maschinelle Übersetzung mit T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) + +In der Computer Vision: + +- [Bildklassifizierung mit ViT](https://huggingface.co/google/vit-base-patch16-224) +- [Objekterkennung mit DETR](https://huggingface.co/facebook/detr-resnet-50) +- [Semantische Segmentierung mit SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [Panoptische Segmentierung mit MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) +- [Depth Estimation mit DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Videoklassifizierung mit VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) +- [Universelle Segmentierung mit OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) + +Im Audio-Bereich: + +- [Automatische Spracherkennung mit Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Keyword Spotting mit Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) +- [Audioklassifizierung mit Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) + +In multimodalen Aufgaben: + +- [Tabellenbasiertes Question Answering mit TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) +- [Visuelles Question Answering mit ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) +- [Zero-Shot-Bildklassifizierung mit CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Dokumentenbasiertes Question Answering mit LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) +- [Zero-Shot-Videoklassifizierung mit X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) + +## 100 Projekte, die 🤗 Transformers verwenden + +🤗 Transformers ist mehr als nur ein Toolkit zur Verwendung von vortrainierten Modellen: Es ist eine Gemeinschaft von Projekten, die darum herum und um den Hugging Face Hub aufgebaut sind. Wir möchten, dass 🤗 Transformers es Entwicklern, Forschern, Studenten, Professoren, Ingenieuren und jedem anderen ermöglicht, ihre Traumprojekte zu realisieren. + +Um die 100.000 Sterne von 🤗 Transformers zu feiern, haben wir beschlossen, die Gemeinschaft in den Mittelpunkt zu stellen und die Seite [awesome-transformers](./awesome-transformers.md) erstellt, die 100 unglaubliche Projekte auflistet, die zusammen mit 🤗 Transformers realisiert wurden. + +Wenn Sie ein Projekt besitzen oder nutzen, von dem Sie glauben, dass es Teil der Liste sein sollte, öffnen Sie bitte einen PR, um es hinzuzufügen! + +## Wenn Sie individuelle Unterstützung vom Hugging Face-Team möchten + + + HuggingFace Expert Acceleration Program +
+ +## Schnelleinstieg + +Um sofort ein Modell mit einer bestimmten Eingabe (Text, Bild, Audio ...) zu verwenden, bieten wir die `pipeline`-API an. Pipelines kombinieren ein vortrainiertes Modell mit der jeweiligen Vorverarbeitung, die während dessen Trainings verwendet wurde. Hier sehen Sie, wie man schnell eine Pipeline verwenden kann, um positive und negative Texte zu klassifizieren: + +```python +>>> from transformers import pipeline + +# Zuweisung einer Pipeline für die Sentiment-Analyse +>>> classifier = pipeline('sentiment-analysis') +>>> classifier('We are very happy to introduce pipeline to the transformers repository.') +[{'label': 'POSITIVE', 'score': 0.9996980428695679}] +``` + +Die zweite Codezeile lädt und cacht das vortrainierte Modell, das von der Pipeline verwendet wird, während die dritte es an dem gegebenen Text evaluiert. Hier ist die Antwort "positiv" mit einer Konfidenz von 99,97 %. + +Viele Aufgaben, sowohl in der Computerlinguistik als auch in der Computer Vision und Sprachverarbeitung, haben eine vortrainierte `pipeline`, die sofort einsatzbereit ist. Z. B. können wir leicht erkannte Objekte in einem Bild extrahieren: + +``` python +>>> import requests +>>> from PIL import Image +>>> from transformers import pipeline + +# Download eines Bildes mit süßen Katzen +>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png" +>>> image_data = requests.get(url, stream=True).raw +>>> image = Image.open(image_data) + +# Zuweisung einer Pipeline für die Objekterkennung +>>> object_detector = pipeline('object-detection') +>>> object_detector(image) +[{'score': 0.9982201457023621, + 'label': 'remote', + 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}}, + {'score': 0.9960021376609802, + 'label': 'remote', + 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}}, + {'score': 0.9954745173454285, + 'label': 'couch', + 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}}, + {'score': 0.9988006353378296, + 'label': 'cat', + 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}}, + {'score': 0.9986783862113953, + 'label': 'cat', + 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] +``` + +Hier erhalten wir eine Liste von Objekten, die im Bild erkannt wurden, mit einer Markierung, die das Objekt eingrenzt, und einem zugehörigen Konfidenzwert. Folgend ist das Originalbild links und die Vorhersagen rechts dargestellt: + +

+ + +

+ +Sie können mehr über die von der `pipeline`-API unterstützten Aufgaben in [diesem Tutorial](https://huggingface.co/docs/transformers/task_summary) erfahren. + +Zusätzlich zur `pipeline` benötigt es nur drei Zeilen Code, um eines der vortrainierten Modelle für Ihre Aufgabe herunterzuladen und zu verwenden. Hier ist der Code für die PyTorch-Version: + +```python +>>> from transformers import AutoTokenizer, AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +Und hier ist der entsprechende Code für TensorFlow: + +```python +>>> from transformers import AutoTokenizer, TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="tf") +>>> outputs = model(**inputs) +``` + +Der Tokenizer ist für die gesamte Vorverarbeitung, die das vortrainierte Modell benötigt, verantwortlich und kann direkt auf einem einzelnen String (wie in den obigen Beispielen) oder einer Liste ausgeführt werden. Er gibt ein Dictionary aus, das Sie im darauffolgenden Code verwenden oder einfach direkt Ihrem Modell übergeben können, indem Sie den ** Operator zum Entpacken von Argumenten einsetzen. + +Das Modell selbst ist ein reguläres [PyTorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) oder ein [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (abhängig von Ihrem Backend), das Sie wie gewohnt verwenden können. [Dieses Tutorial](https://huggingface.co/docs/transformers/training) erklärt, wie man ein solches Modell in eine klassische PyTorch- oder TensorFlow-Trainingsschleife integrieren kann oder wie man unsere `Trainer`-API verwendet, um es schnell auf einem neuen Datensatz zu feintunen. + +## Warum sollten Sie 🤗 Transformers verwenden? + +1. Benutzerfreundliche Modelle auf dem neuesten Stand der Technik: + - Hohe Leistung bei Aufgaben zu Natural Language Understanding & Generation, Computer Vision und Audio. + - Niedrige Einstiegshürde für Bildungskräfte und Praktiker. + - Wenige benutzerseitige Abstraktionen mit nur drei zu lernenden Klassen. + - Eine einheitliche API für die Verwendung aller unserer vortrainierten Modelle. + +1. Geringere Rechenkosten, kleinerer CO2-Fußabdruck: + - Forscher können trainierte Modelle teilen, anstatt sie immer wieder neu zu trainieren. + - Praktiker können die Rechenzeit und Produktionskosten reduzieren. + - Dutzende Architekturen mit über 400.000 vortrainierten Modellen über alle Modalitäten hinweg. + +1. Wählen Sie das richtige Framework für jeden Lebensabschnitt eines Modells: + - Trainieren Sie Modelle auf neustem Stand der Technik in nur drei Codezeilen. + - Verwenden Sie ein einzelnes Modell nach Belieben mit TF2.0-/PyTorch-/JAX-Frameworks. + - Wählen Sie nahtlos das richtige Framework für Training, Evaluation und Produktiveinsatz. + +1. Passen Sie ein Modell oder Beispiel leicht an Ihre Bedürfnisse an: + - Wir bieten Beispiele für jede Architektur an, um die von ihren ursprünglichen Autoren veröffentlichten Ergebnisse zu reproduzieren. + - Modellinterna sind so einheitlich wie möglich verfügbar gemacht. + - Modelldateien können unabhängig von der Bibliothek für schnelle Experimente verwendet werden. + +## Warum sollten Sie 🤗 Transformers nicht verwenden? + +- Diese Bibliothek ist kein modularer Werkzeugkasten mit Bausteinen für neuronale Netze. Der Code in den Modelldateien ist absichtlich nicht mit zusätzlichen Abstraktionen refaktorisiert, sodass Forscher schnell mit jedem der Modelle iterieren können, ohne sich in zusätzliche Abstraktionen/Dateien vertiefen zu müssen. +- Die Trainings-API ist nicht dafür gedacht, mit beliebigen Modellen zu funktionieren, sondern ist für die Verwendung mit den von der Bibliothek bereitgestellten Modellen optimiert. Für generische Trainingsschleifen von maschinellem Lernen sollten Sie eine andere Bibliothek verwenden (möglicherweise [Accelerate](https://huggingface.co/docs/accelerate)). +- Auch wenn wir bestrebt sind, so viele Anwendungsfälle wie möglich zu veranschaulichen, sind die Beispielskripte in unserem [`examples`](./examples) Ordner genau das: Beispiele. Es ist davon auszugehen, dass sie nicht sofort auf Ihr spezielles Problem anwendbar sind und einige Codezeilen geändert werden müssen, um sie für Ihre Bedürfnisse anzupassen. + +## Installation + +### Mit pip + +Dieses Repository wurde mit Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ und TensorFlow 2.6+ getestet. + +Sie sollten 🤗 Transformers in einer [virtuellen Umgebung](https://docs.python.org/3/library/venv.html) installieren. Wenn Sie mit virtuellen Python-Umgebungen nicht vertraut sind, schauen Sie sich den [Benutzerleitfaden](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) an. + +Erstellen und aktivieren Sie zuerst eine virtuelle Umgebung mit der Python-Version, die Sie verwenden möchten. + +Dann müssen Sie entweder Flax, PyTorch oder TensorFlow installieren. Bitte beziehe dich entsprechend auf die jeweiligen Installationsanleitungen für [TensorFlow](https://www.tensorflow.org/install/), [PyTorch](https://pytorch.org/get-started/locally/#start-locally), und/oder [Flax](https://github.com/google/flax#quick-install) und [Jax](https://github.com/google/jax#installation) für den spezifischen Installationsbefehl für Ihre Plattform. + +Wenn eines dieser Backends installiert ist, kann 🤗 Transformers wie folgt mit pip installiert werden: + +```bash +pip install transformers +``` + +Wenn Sie mit den Beispielen experimentieren möchten oder die neueste Version des Codes benötigen und nicht auf eine neue Veröffentlichung warten können, müssen Sie [die Bibliothek von der Quelle installieren](https://huggingface.co/docs/transformers/installation#installing-from-source). + +### Mit conda + +🤗 Transformers kann wie folgt mit conda installiert werden: + +```shell script +conda install conda-forge::transformers +``` + +> **_HINWEIS:_** Die Installation von `transformers` aus dem `huggingface`-Kanal ist veraltet. + +Folgen Sie den Installationsanleitungen von Flax, PyTorch oder TensorFlow, um zu sehen, wie sie mit conda installiert werden können. + +> **_HINWEIS:_** Auf Windows werden Sie möglicherweise aufgefordert, den Entwicklermodus zu aktivieren, um von Caching zu profitieren. Wenn das für Sie keine Option ist, lassen Sie es uns bitte in [diesem Issue](https://github.com/huggingface/huggingface_hub/issues/1062) wissen. + +## Modellarchitekturen + +**[Alle Modell-Checkpoints](https://huggingface.co/models)**, die von 🤗 Transformers bereitgestellt werden, sind nahtlos aus dem huggingface.co [Model Hub](https://huggingface.co/models) integriert, wo sie direkt von [Benutzern](https://huggingface.co/users) und [Organisationen](https://huggingface.co/organizations) hochgeladen werden. + +Aktuelle Anzahl der Checkpoints: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) + +🤗 Transformers bietet derzeit die folgenden Architekturen an (siehe [hier](https://huggingface.co/docs/transformers/model_summary) für eine jeweilige Übersicht): + +1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. +1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. +1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. +1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. +1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. +1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen. +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. +1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. +1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. +1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. +1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). +1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. +1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. +1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (from ESPnet) released with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. +1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) +1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach +1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. +1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). +1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. +1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. +1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. +1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. +1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. +1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. +1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (from Microsoft Research & University of Wisconsin-Madison) released with the paper [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. +1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. +1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. +1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. +1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. +1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. +1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. +1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. +1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. +1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. +1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. +1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. +1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). +1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (from IBM Research) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (from IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. +1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (from the Qwen team, Alibaba Group) released with the paper [Qwen Technical Report](https://arxiv.org/abs/2309.16609) by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. +1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. +1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár. +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. +1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. +1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. +1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham. +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. +1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace). +1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani. +1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. +1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. +1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. +1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. +1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. +1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. +1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. +1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. +1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. +1. Möchten Sie ein neues Modell beitragen? Wir haben einen **detaillierten Leitfaden und Vorlagen** hinzugefügt, um Sie beim Hinzufügen eines neuen Modells zu unterstützen. Sie können diese im [`templates`](./templates) Ordner des Repositorys finden. Lesen Sie unbedingt die [Beitragshinweise](./CONTRIBUTING.md) und kontaktieren Sie die Maintainer oder erstellen Sie ein Issue, um Feedback zu sammeln, bevor Sie mit der PR starten. + +Um zu überprüfen, ob jedes Modell eine Implementierung in Flax, PyTorch oder TensorFlow hat oder über einen zugehörigen Tokenizer verfügt, der von der 🤗 Tokenizers-Bibliothek unterstützt wird, schauen Sie auf [diese Tabelle](https://huggingface.co/docs/transformers/index#supported-frameworks). + +Diese Implementierungen wurden mit mehreren Datensätzen getestet (siehe Beispielskripte) und sollten den Leistungen der ursprünglichen Implementierungen entsprechen. Weitere Details zur Leistung finden Sie im Abschnitt der Beispiele in der [Dokumentation](https://github.com/huggingface/transformers/tree/main/examples). + +## Mehr erfahren + +| Abschnitt | Beschreibung | +|-|-| +| [Dokumentation](https://huggingface.co/docs/transformers/) | Vollständige API-Dokumentation und Tutorials | +| [Zusammenfassung der Aufgaben](https://huggingface.co/docs/transformers/task_summary) | Von 🤗 Transformers unterstützte Aufgaben | +| [Vorverarbeitungs-Tutorial](https://huggingface.co/docs/transformers/preprocessing) | Verwendung der `Tokenizer`-Klasse zur Vorverarbeitung der Daten für die Modelle | +| [Training und Feintuning](https://huggingface.co/docs/transformers/training) | Verwendung der von 🤗 Transformers bereitgestellten Modelle in einer PyTorch-/TensorFlow-Trainingsschleife und der `Trainer`-API | +| [Schnelleinstieg: Feintuning/Anwendungsskripte](https://github.com/huggingface/transformers/tree/main/examples) | Beispielskripte für das Feintuning von Modellen für eine breite Palette von Aufgaben | +| [Modellfreigabe und -upload](https://huggingface.co/docs/transformers/model_sharing) | Laden Sie Ihre feingetunten Modelle hoch und teilen Sie sie mit der Community | + +## Zitation + +Wir haben jetzt ein [Paper](https://www.aclweb.org/anthology/2020.emnlp-demos.6/), das Sie für die 🤗 Transformers-Bibliothek zitieren können: + +```bibtex +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = oct, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/README_es.md b/README_es.md index a5b559caf361d9..9dfbf8931abada 100644 --- a/README_es.md +++ b/README_es.md @@ -18,7 +18,7 @@ limitations under the License.

-

+

Build @@ -46,8 +46,13 @@ limitations under the License. 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -58,15 +63,15 @@ limitations under the License.

-🤗 Transformers aporta miles de modelos preentrenados Para realizar tareas en diferentes modalidades como texto, vision, y audio. +🤗 Transformers aporta miles de modelos preentrenados para realizar tareas en diferentes modalidades como texto, visión, y audio. Estos modelos pueden ser aplicados en: -* 📝 Texto, Para tareas como clasificación de texto, extracción de información, responder preguntas, resumir, traducir, generación de texto, en más de 100 idiomas. +* 📝 Texto, para tareas como clasificación de texto, extracción de información, responder preguntas, resumir, traducir, generación de texto, en más de 100 idiomas. * 🖼️ Imágenes, para tareas como clasificación de imágenes, detección the objetos, y segmentación. * 🗣️ Audio, para tareas como reconocimiento de voz y clasificación de audio. -Los modelos de Transformer también pueden realizar tareas en **muchas modalidades combinadas**, como responder pregunstas, reconocimiento de carácteres ópticos,extracción de información de documentos escaneados, clasificación de video, y respuesta de preguntas visuales. +Los modelos de Transformer también pueden realizar tareas en **muchas modalidades combinadas**, como responder preguntas, reconocimiento de carácteres ópticos,extracción de información de documentos escaneados, clasificación de video, y respuesta de preguntas visuales. 🤗 Transformers aporta APIs para descargar rápidamente y usar estos modelos preentrenados en un texto dado, afinarlos en tus propios sets de datos y compartirlos con la comunidad en nuestro [centro de modelos](https://huggingface.co/models). Al mismo tiempo, cada módulo de Python que define una arquitectura es completamente independiente y se puede modificar para permitir experimentos de investigación rápidos. @@ -78,14 +83,14 @@ Puedes probar la mayoría de nuestros modelos directamente en sus páginas desde Aquí hay algunos ejemplos: - En procesamiento del lenguaje natural: -- [Terminación de palabras enmascaradas con BERT](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +En procesamiento del lenguaje natural: +- [Terminación de palabras enmascaradas con BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [Reconocimiento del nombre de la entidad con Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [Generación de texto con GPT-2](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [Inferencia del lenguaje natural con RoBERTa](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [Generación de texto con GPT-2](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [Inferencia del lenguaje natural con RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [Resumen con BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [Responder a preguntas con DistilBERT](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [Traducción con T5](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [Responder a preguntas con DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Traducción con T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) En visión de ordenador: - [Clasificación de imágenes con ViT](https://huggingface.co/google/vit-base-patch16-224) @@ -169,8 +174,8 @@ Además de `pipeline`, para descargar y usar cualquiera de los modelos previamen ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -180,14 +185,14 @@ Y aquí está el código equivalente para TensorFlow: ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) ``` -El tokenizador es responsable de todo el preprocesamiento que espera el modelo preentrenado y se puede llamar directamente en una sola cadena (como en los ejemplos anteriores) o en una lista. Dará como resultado un diccionario que puedes usar en el código descendente o simplemente pasarlo directamente a su modelo usando el operador de desempaquetado de argumento **. +El tokenizador es responsable de todo el preprocesamiento que espera el modelo preentrenado y se puede llamar directamente en una sola cadena (como en los ejemplos anteriores) o en una lista. Este dará como resultado un diccionario que puedes usar en el código descendente o simplemente pasarlo directamente a su modelo usando el operador de desempaquetado de argumento **. El modelo en si es un [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) normal o un [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (dependiendo De tu backend) que puedes usar de forma habitual. [Este tutorial](https://huggingface.co/docs/transformers/training) explica cómo integrar un modelo de este tipo en un ciclo de entrenamiento PyTorch o TensorFlow clásico, o como usar nuestra API `Trainer` para ajustar rápidamente un nuevo conjunto de datos. @@ -224,13 +229,13 @@ El modelo en si es un [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.h ### Con pip -Este repositorio está probado en Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+ y TensorFlow 2.3+. +Este repositorio está probado en Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ y TensorFlow 2.6+. -Deberías instalar 🤗 Transformers en un [ambiente virtual](https://docs.python.org/3/library/venv.html). Si no estas familiarizado con los entornos virtuales de Python, consulta la [guía de usuario](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). +Deberías instalar 🤗 Transformers en un [entorno virtual](https://docs.python.org/3/library/venv.html). Si no estas familiarizado con los entornos virtuales de Python, consulta la [guía de usuario](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). Primero, crea un entorno virtual con la versión de Python que vas a usar y actívalo. -Luego, deberás instalar al menos uno de Flax, PyTorch o TensorFlow. +Luego, deberás instalar al menos uno entre Flax, PyTorch o TensorFlow. Por favor, ve a la [página de instalación de TensorFlow](https://www.tensorflow.org/install/), [página de instalación de PyTorch](https://pytorch.org/get-started/locally/#start-locally) y/o las páginas de instalación de [Flax](https://github.com/google/flax#quick-install) y [Jax](https://github.com/google/jax#installation) con respecto al comando de instalación específico para tu plataforma. Cuando se ha instalado uno de esos backends, los 🤗 Transformers se pueden instalar usando pip de la siguiente manera: @@ -243,14 +248,14 @@ Si deseas jugar con los ejemplos o necesitas la última versión del código y n ### Con conda -Desde la versión v4.0.0 de Transformers, ahora tenemos un canal conda: `huggingface`. - 🤗 Transformers se puede instalar usando conda de la siguiente manera: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_NOTA:_** Instalar `transformers` desde el canal `huggingface` está obsoleto. + Sigue las páginas de instalación de Flax, PyTorch o TensorFlow para ver cómo instalarlos con conda. > **_NOTA:_** En Windows, es posible que se le pida que active el modo de desarrollador para beneficiarse del almacenamiento en caché. Si esta no es una opción para usted, háganoslo saber en [esta issue](https://github.com/huggingface/huggingface_hub/issues/1062). @@ -264,8 +269,11 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 🤗 Transformers actualmente proporciona las siguientes arquitecturas (ver [aquí](https://huggingface.co/docs/transformers/model_summary) para un resumen de alto nivel de cada uno de ellas.): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. @@ -280,22 +288,27 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. @@ -304,41 +317,58 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2** was released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (from ESPnet) released with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. @@ -346,43 +376,69 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (from Microsoft Research & University of Wisconsin-Madison) released with the paper [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Facebook) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (from IBM Research) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (from IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released with the paper [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (from the Qwen team, Alibaba Group) released with the paper [Qwen Technical Report](https://arxiv.org/abs/2309.16609) by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. @@ -393,14 +449,21 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. @@ -415,40 +478,49 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. 1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. 1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. 1. ¿Quieres aportar un nuevo modelo? Hemos agregado una **guía detallada y plantillas** para guiarte en el proceso de agregar un nuevo modelo. Puedes encontrarlos en la carpeta de [`templates`](./templates) del repositorio. Asegúrate de revisar las [pautas de contribución](./CONTRIBUTING.md) y comunícate con los mantenedores o abra un problema para recopilar comentarios antes de comenzar su PR. -Para comprobar si cada modelo tiene una implementación en Flax, PyTorch o TensorFlow, o tiene un tokenizador asociado respaldado por la librería 🤗 Tokenizers , ve a [esta tabla](https://huggingface.co/docs/transformers/index#supported-frameworks). +Para comprobar si cada modelo tiene una implementación en Flax, PyTorch o TensorFlow, o tiene un tokenizador asociado respaldado por la librería 🤗 Tokenizers, ve a [esta tabla](https://huggingface.co/docs/transformers/index#supported-frameworks). Estas implementaciones se han probado en varios conjuntos de datos (consulte los scripts de ejemplo) y deberían coincidir con el rendimiento de las implementaciones originales. Puede encontrar más detalles sobre el rendimiento en la sección Examples de la [documentación](https://github.com/huggingface/transformers/tree/main/examples). @@ -459,7 +531,7 @@ Estas implementaciones se han probado en varios conjuntos de datos (consulte los |-|-| | [Documentación](https://huggingface.co/docs/transformers/) | Toda la documentación de la API y tutoriales | | [Resumen de tareas](https://huggingface.co/docs/transformers/task_summary) | Tareas soportadas 🤗 Transformers | -| [Tutorial de preprocesAmiento](https://huggingface.co/docs/transformers/preprocessing) | Usando la clase `Tokenizer` para preparar datos para los modelos | +| [Tutorial de preprocesamiento](https://huggingface.co/docs/transformers/preprocessing) | Usando la clase `Tokenizer` para preparar datos para los modelos | | [Entrenamiento y puesta a punto](https://huggingface.co/docs/transformers/training) | Usando los modelos aportados por 🤗 Transformers en un bucle de entreno de PyTorch/TensorFlow y la API de `Trainer` | | [Recorrido rápido: secuencias de comandos de ajuste/uso](https://github.com/huggingface/transformers/tree/main/examples) | Scripts de ejemplo para ajustar modelos en una amplia gama de tareas | | [Compartir y subir modelos](https://huggingface.co/docs/transformers/model_sharing) | Carga y comparte tus modelos perfeccionados con la comunidad | @@ -467,7 +539,7 @@ Estas implementaciones se han probado en varios conjuntos de datos (consulte los ## Citación -Ahora nosotros tenemos un [papel](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) que puedes citar para la librería de 🤗 Transformers: +Ahora nosotros tenemos un [paper](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) que puedes citar para la librería de 🤗 Transformers: ```bibtex @inproceedings{wolf-etal-2020-transformers, title = "Transformers: State-of-the-Art Natural Language Processing", diff --git a/README_fr.md b/README_fr.md new file mode 100644 index 00000000000000..75ebdd315f651d --- /dev/null +++ b/README_fr.md @@ -0,0 +1,574 @@ + + +

+ + + + Bibliothèque Hugging Face Transformers + +
+
+

+ +

+ + Construction + + + GitHub + + + Documentation + + + Version GitHub + + + Pacte des contributeurs + + DOI +

+ +

+

+ English | + 简体中文 | + 繁體中文 | + 한국어 | + Español | + 日本語 | + हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

+

+ +

+

Apprentissage automatique de pointe pour JAX, PyTorch et TensorFlow

+

+ +

+ +

+ +🤗 Transformers fournit des milliers de modèles pré-entraînés pour effectuer des tâches sur différentes modalités telles que le texte, la vision et l'audio. + +Ces modèles peuvent être appliqués à : + +* 📝 Texte, pour des tâches telles que la classification de texte, l'extraction d'informations, la réponse aux questions, le résumé, la traduction et la génération de texte, dans plus de 100 langues. +* 🖼️ Images, pour des tâches telles que la classification d'images, la détection d'objets et la segmentation. +* 🗣️ Audio, pour des tâches telles que la reconnaissance vocale et la classification audio. + +Les modèles de transformer peuvent également effectuer des tâches sur **plusieurs modalités combinées**, telles que la réponse aux questions sur des tableaux, la reconnaissance optique de caractères, l'extraction d'informations à partir de documents numérisés, la classification vidéo et la réponse aux questions visuelles. + +🤗 Transformers fournit des API pour télécharger et utiliser rapidement ces modèles pré-entraînés sur un texte donné, les affiner sur vos propres ensembles de données, puis les partager avec la communauté sur notre [hub de modèles](https://huggingface.co/models). En même temps, chaque module Python définissant une architecture est complètement indépendant et peut être modifié pour permettre des expériences de recherche rapides. + +🤗 Transformers est soutenu par les trois bibliothèques d'apprentissage profond les plus populaires — [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) et [TensorFlow](https://www.tensorflow.org/) — avec une intégration transparente entre eux. Il est facile de former vos modèles avec l'un avant de les charger pour l'inférence avec l'autre. + +## Démos en ligne + +Vous pouvez tester la plupart de nos modèles directement sur leurs pages du [hub de modèles](https://huggingface.co/models). Nous proposons également [l'hébergement privé de modèles, le versionning et une API d'inférence](https://huggingface.co/pricing) pour des modèles publics et privés. + +Voici quelques exemples : + +En traitement du langage naturel : +- [Complétion de mots masqués avec BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Reconnaissance d'entités nommées avec Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Génération de texte avec GPT-2](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [Inférence de langage naturel avec RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [Résumé avec BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) +- [Réponse aux questions avec DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Traduction avec T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) + +En vision par ordinateur : +- [Classification d'images avec ViT](https://huggingface.co/google/vit-base-patch16-224) +- [Détection d'objets avec DETR](https://huggingface.co/facebook/detr-resnet-50) +- [Segmentation sémantique avec SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [Segmentation panoptique avec MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) +- [Estimation de profondeur avec DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Classification vidéo avec VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) +- [Segmentation universelle avec OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) + +En audio : +- [Reconnaissance automatique de la parole avec Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Spotting de mots-clés avec Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) +- [Classification audio avec Audio Spectrogram Transformer](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) + +Dans les tâches multimodales : +- [Réponses aux questions sur table avec TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) +- [Réponses aux questions visuelles avec ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) +- [Classification d'images sans étiquette avec CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Réponses aux questions sur les documents avec LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) +- [Classification vidéo sans étiquette avec X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) + + +## 100 projets utilisant Transformers + +Transformers est plus qu'une boîte à outils pour utiliser des modèles pré-entraînés : c'est une communauté de projets construits autour de lui et du Hub Hugging Face. Nous voulons que Transformers permette aux développeurs, chercheurs, étudiants, professeurs, ingénieurs et à quiconque d'imaginer et de réaliser leurs projets de rêve. + +Afin de célébrer les 100 000 étoiles de transformers, nous avons décidé de mettre en avant la communauté et avons créé la page [awesome-transformers](./awesome-transformers.md) qui répertorie 100 projets incroyables construits autour de transformers. + +Si vous possédez ou utilisez un projet que vous pensez devoir figurer dans la liste, veuillez ouvrir une pull request pour l'ajouter ! + +## Si vous recherchez un support personnalisé de la part de l'équipe Hugging Face + + + Programme d'accélération des experts HuggingFace +
+ +## Tour rapide + +Pour utiliser immédiatement un modèle sur une entrée donnée (texte, image, audio,...), nous fournissons l'API `pipeline`. Les pipelines regroupent un modèle pré-entraîné avec la préparation des données qui a été utilisée lors de l'entraînement de ce modèle. Voici comment utiliser rapidement un pipeline pour classer des textes en positif ou négatif : + +```python +>>> from transformers import pipeline + +# Allouer un pipeline pour l'analyse de sentiment +>>> classifieur = pipeline('sentiment-analysis') +>>> classifieur('Nous sommes très heureux d'introduire le pipeline dans le référentiel transformers.') +[{'label': 'POSITIF', 'score': 0.9996980428695679}] +``` + +La deuxième ligne de code télécharge et met en cache le modèle pré-entraîné utilisé par le pipeline, tandis que la troisième l'évalue sur le texte donné. Ici, la réponse est "positive" avec une confiance de 99,97%. + +De nombreuses tâches ont une pipeline pré-entraîné prêt à l'emploi, en NLP, mais aussi en vision par ordinateur et en parole. Par exemple, nous pouvons facilement extraire les objets détectés dans une image : + +```python +>>> import requests +>>> from PIL import Image +>>> from transformers import pipeline + +# Télécharger une image avec de jolis chats +>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png" +>>> donnees_image = requests.get(url, stream=True).raw +>>> image = Image.open(donnees_image) + +# Allouer un pipeline pour la détection d'objets +>>> detecteur_objets = pipeline('object-detection') +>>> detecteur_objets(image) +[{'score': 0.9982201457023621, + 'label': 'télécommande', + 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}}, + {'score': 0.9960021376609802, + 'label': 'télécommande', + 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}}, + {'score': 0.9954745173454285, + 'label': 'canapé', + 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}}, + {'score': 0.9988006353378296, + 'label': 'chat', + 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}}, + {'score': 0.9986783862113953, + 'label': 'chat', + 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] +``` + +Ici, nous obtenons une liste d'objets détectés dans l'image, avec une boîte entourant l'objet et un score de confiance. Voici l'image originale à gauche, avec les prédictions affichées à droite : + +

+ + +

+ +Vous pouvez en savoir plus sur les tâches supportées par l'API pipeline dans [ce tutoriel](https://huggingface.co/docs/transformers/task_summary). + +En plus de `pipeline`, pour télécharger et utiliser n'importe lequel des modèles pré-entraînés sur votre tâche donnée, il suffit de trois lignes de code. Voici la version PyTorch : + +```python +>>> from transformers import AutoTokenizer, AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") + +inputs = tokenizer("Bonjour le monde !", return_tensors="pt") +outputs = model(**inputs) +``` + +Et voici le code équivalent pour TensorFlow : + +```python +from transformers import AutoTokenizer, TFAutoModel + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") + +inputs = tokenizer("Bonjour le monde !", return_tensors="tf") +outputs = model(**inputs) +``` + +Le tokenizer est responsable de toutes les étapes de prétraitement que le modèle préentraîné attend et peut être appelé directement sur une seule chaîne de caractères (comme dans les exemples ci-dessus) ou sur une liste. Il produira un dictionnaire que vous pouvez utiliser dans votre code ou simplement passer directement à votre modèle en utilisant l'opérateur de déballage **. + +Le modèle lui-même est un module [`nn.Module` PyTorch](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) ou un modèle [`tf.keras.Model` TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (selon votre backend) que vous pouvez utiliser comme d'habitude. [Ce tutoriel](https://huggingface.co/docs/transformers/training) explique comment intégrer un tel modèle dans une boucle d'entraînement classique PyTorch ou TensorFlow, ou comment utiliser notre API `Trainer` pour affiner rapidement sur un nouvel ensemble de données. + +## Pourquoi devrais-je utiliser transformers ? + +1. Des modèles de pointe faciles à utiliser : + - Hautes performances en compréhension et génération de langage naturel, en vision par ordinateur et en tâches audio. + - Faible barrière à l'entrée pour les éducateurs et les praticiens. + - Peu d'abstractions visibles pour l'utilisateur avec seulement trois classes à apprendre. + - Une API unifiée pour utiliser tous nos modèles préentraînés. + +1. Coûts informatiques réduits, empreinte carbone plus petite : + - Les chercheurs peuvent partager des modèles entraînés au lieu de toujours les réentraîner. + - Les praticiens peuvent réduire le temps de calcul et les coûts de production. + - Des dizaines d'architectures avec plus de 400 000 modèles préentraînés dans toutes les modalités. + +1. Choisissez le bon framework pour chaque partie de la vie d'un modèle : + - Entraînez des modèles de pointe en 3 lignes de code. + - Trasnférer un seul modèle entre les frameworks TF2.0/PyTorch/JAX à volonté. + - Choisissez facilement le bon framework pour l'entraînement, l'évaluation et la production. + +1. Personnalisez facilement un modèle ou un exemple selon vos besoins : + - Nous fournissons des exemples pour chaque architecture afin de reproduire les résultats publiés par ses auteurs originaux. + - Les détails internes du modèle sont exposés de manière aussi cohérente que possible. + - Les fichiers de modèle peuvent être utilisés indépendamment de la bibliothèque pour des expériences rapides. + +## Pourquoi ne devrais-je pas utiliser transformers ? + +- Cette bibliothèque n'est pas une boîte à outils modulaire de blocs de construction pour les réseaux neuronaux. Le code dans les fichiers de modèle n'est pas refactored avec des abstractions supplémentaires à dessein, afin que les chercheurs puissent itérer rapidement sur chacun des modèles sans plonger dans des abstractions/fichiers supplémentaires. +- L'API d'entraînement n'est pas destinée à fonctionner avec n'importe quel modèle, mais elle est optimisée pour fonctionner avec les modèles fournis par la bibliothèque. Pour des boucles génériques d'apprentissage automatique, vous devriez utiliser une autre bibliothèque (éventuellement, [Accelerate](https://huggingface.co/docs/accelerate)). +- Bien que nous nous efforcions de présenter autant de cas d'utilisation que possible, les scripts de notre [dossier d'exemples](https://github.com/huggingface/transformers/tree/main/examples) ne sont que cela : des exemples. Il est prévu qu'ils ne fonctionnent pas immédiatement sur votre problème spécifique et que vous devrez probablement modifier quelques lignes de code pour les adapter à vos besoins. + +## Installation + +### Avec pip + +Ce référentiel est testé sur Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ et TensorFlow 2.6+. + +Vous devriez installer 🤗 Transformers dans un [environnement virtuel](https://docs.python.org/3/library/venv.html). Si vous n'êtes pas familier avec les environnements virtuels Python, consultez le [guide utilisateur](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). + +D'abord, créez un environnement virtuel avec la version de Python que vous allez utiliser et activez-le. + +Ensuite, vous devrez installer au moins l'un de Flax, PyTorch ou TensorFlow. +Veuillez vous référer à la page d'installation de [TensorFlow](https://www.tensorflow.org/install/), de [PyTorch](https://pytorch.org/get-started/locally/#start-locally) et/ou de [Flax](https://github.com/google/flax#quick-install) et [Jax](https://github.com/google/jax#installation) pour connaître la commande d'installation spécifique à votre plateforme. + +Lorsqu'un de ces backends est installé, 🤗 Transformers peut être installé avec pip comme suit : + +```bash +pip install transformers +``` + +Si vous souhaitez jouer avec les exemples ou avez besoin de la dernière version du code et ne pouvez pas attendre une nouvelle version, vous devez [installer la bibliothèque à partir de la source](https://huggingface.co/docs/transformers/installation#installing-from-source). + +### Avec conda + +🤗 Transformers peut être installé avec conda comme suit : + +```shell +conda install conda-forge::transformers +``` + +> **_NOTE:_** L'installation de `transformers` depuis le canal `huggingface` est obsolète. + +Suivez les pages d'installation de Flax, PyTorch ou TensorFlow pour voir comment les installer avec conda. + +> **_NOTE:_** Sur Windows, on peut vous demander d'activer le mode développeur pour bénéficier de la mise en cache. Si ce n'est pas une option pour vous, veuillez nous le faire savoir dans [cette issue](https://github.com/huggingface/huggingface_hub/issues/1062). + +## Architectures de modèles + +**[Tous les points de contrôle](https://huggingface.co/models)** de modèle fournis par 🤗 Transformers sont intégrés de manière transparente depuis le [hub de modèles](https://huggingface.co/models) huggingface.co, où ils sont téléchargés directement par les [utilisateurs](https://huggingface.co/users) et les [organisations](https://huggingface.co/organizations). + +Nombre actuel de points de contrôle : ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) + + +🤗 Transformers fournit actuellement les architectures suivantes (consultez [ici](https://huggingface.co/docs/transformers/model_summary) pour un résumé global de chacune d'entre elles) : +1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (de Google Research et du Toyota Technological Institute at Chicago) publié dans l'article [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), par Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (de Google Research) publié dans l'article [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) de Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. +1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (de BAAI) publié dans l'article [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) de Chen, Zhongzhi et Liu, Guang et Zhang, Bo-Wen et Ye, Fulong et Yang, Qinghong et Wu, Ledell. +1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (du MIT) publié dans l'article [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) de Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (de l'Université Tsinghua) publié dans l'article [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) de Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (de Suno) publié dans le référentiel [suno-ai/bark](https://github.com/suno-ai/bark) par l'équipe Suno AI. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (de Facebook) publié dans l'article [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) de Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov et Luke Zettlemoyer. +1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (de l'École polytechnique) publié dans l'article [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) de Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. +1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (de VinAI Research) publié dans l'article [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) de Nguyen Luong Tran, Duong Minh Le et Dat Quoc Nguyen. +1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (de Microsoft) publié dans l'article [BEiT: Pré-entraînement BERT des transformateurs d'images](https://arxiv.org/abs/2106.08254) par Hangbo Bao, Li Dong, Furu Wei. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (de Google) publié dans l'article [BERT : Pré-entraînement de transformateurs bidirectionnels profonds pour la compréhension du langage](https://arxiv.org/abs/1810.04805) par Jacob Devlin, Ming-Wei Chang, Kenton Lee et Kristina Toutanova. +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (de Google) publié dans l'article [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) parSascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (de VinAI Research) publié dans l'article [BERTweet : un modèle de langage pré-entraîné pour les Tweets en anglais](https://aclanthology.org/2020.emnlp-demos.2/) par Dat Quoc Nguyen, Thanh Vu et Anh Tuan Nguyen. +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (de Google Research) publié dans l'article [Big Bird: Transformateurs pour des séquences plus longues](https://arxiv.org/abs/2007.14062) par Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (de Google Research) publié dans l'article [Big Bird: Transformateurs pour des séquences plus longues](https://arxiv.org/abs/2007.14062) par Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (de Microsoft Research AI4Science) publié dans l'article [BioGPT : transformateur génératif pré-entraîné pour la génération et l'extraction de texte biomédical](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) par Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon et Tie-Yan Liu. +1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (de Google AI) publié dans l'article [Big Transfer (BiT) : Apprentissage général de la représentation visuelle](https://arxiv.org/abs/1912.11370) par Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (de Facebook) publié dans l'article [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) par Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (de Facebook) publié dans l'article [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) par Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (de Salesforce) publié dans l'article [BLIP : Pré-entraînement de la langue et de l'image par bootstrap pour une compréhension et une génération unifiées de la vision et du langage](https://arxiv.org/abs/2201.12086) par Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (de Salesforce) publié dans l'article [BLIP-2 : Pré-entraînement de la langue et de l'image avec des encodeurs d'images gelés et de grands modèles de langage](https://arxiv.org/abs/2301.12597) par Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (de l'atelier BigScience) publié par l'[atelier BigScience](https://bigscience.huggingface.co/). +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (d'Alexa) publié dans l'article [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) par Adrian de Wynter et Daniel J. Perry. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (de l'Institut de technologie de Harbin/Microsoft Research Asia/Intel Labs) publié dans l'article [BridgeTower : Construire des ponts entre les encodeurs dans l'apprentissage de la représentation vision-langage](https://arxiv.org/abs/2206.08657) par Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (de NAVER CLOVA) publié dans l'article [BROS : un modèle de langage pré-entraîné axé sur le texte et la mise en page pour une meilleure extraction des informations clés des documents](https://arxiv.org/abs/2108.04539) par Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (de Google Research) publié dans l'article [ByT5 : Vers un futur sans jeton avec des modèles pré-entraînés byte-to-byte](https://arxiv.org/abs/2105.13626) par Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (d'Inria/Facebook/Sorbonne) publié dans l'article [CamemBERT : un modèle de langue français savoureux](https://arxiv.org/abs/1911.03894) par Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah et Benoît Sagot. +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (de Google Research) publié dans l'article [CANINE : Pré-entraînement d'un encodeur sans tokenisation efficace pour la représentation du langage](https://arxiv.org/abs/2103.06874) par Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. +1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (d'OFA-Sys) publié dans l'article [Chinese CLIP : Pré-entraînement contrastif vision-langage en chinois](https://arxiv.org/abs/2211.01335) par An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (de LAION-AI) publié dans l'article [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) par Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (d'OpenAI) publié dans l'article [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) par Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. +1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (de l'Université de Göttingen) publié dans l'article [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) par Timo Lüddecke et Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** publié dans l'article [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) par James Betker. +1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (de Salesforce) publié dans l'article [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) par Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (de MetaAI) publié dans l'article [Code Llama : Modèles ouverts fondamentaux pour le code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) par Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (de Microsoft Research Asia) publié dans l'article [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) par Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (de YituTech) publié dans l'article [ConvBERT : Amélioration de BERT avec une convolution dynamique basée sur des plages](https://arxiv.org/abs/2008.02496) par Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (de Facebook AI) publié dans l'article [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) par Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (de Facebook AI) publié dans l'article [ConvNeXt V2 : Conception conjointe et mise à l'échelle de ConvNets avec des autoencodeurs masqués](https://arxiv.org/abs/2301.00808) par Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (de l'Université de Tsinghua) publié dans l'article [CPM : Un modèle de langue chinois pré-entraîné génératif à grande échelle](https://arxiv.org/abs/2012.00413) par Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (d'OpenBMB) publié par l'[OpenBMB](https://www.openbmb.org/). +1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (de Salesforce) publié dans l'article [CTRL : Un modèle de langage conditionnel de type Transformer pour une génération contrôlable](https://arxiv.org/abs/1909.05858) par Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong et Richard Socher. +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (de Microsoft) publié dans l'article [CvT : Introduction de convolutions aux transformateurs visuels](https://arxiv.org/abs/2103.15808) par Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (de Facebook) publié dans l'article [Data2Vec : Un cadre général pour l'apprentissage auto-supervisé en parole, vision et langage](https://arxiv.org/abs/2202.03555) par Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (de Microsoft) publié dans l'article [DeBERTa : BERT amélioré avec attention désentrelacée](https://arxiv.org/abs/2006.03654) par Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (de Microsoft) publié dans l'article [DeBERTa : BERT amélioré avec attention désentrelacée](https://arxiv.org/abs/2006.03654) par Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (de Berkeley/Facebook/Google) publié dans l'article [Decision Transformer : Apprentissage par renforcement via la modélisation de séquences](https://arxiv.org/abs/2106.01345) par Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (de SenseTime Research) publié dans l'article [Deformable DETR : Transformateurs déformables pour la détection d'objets de bout en bout](https://arxiv.org/abs/2010.04159) par Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (de Facebook) publié dans l'article [Entraînement d'images efficace et distillation par l'attention](https://arxiv.org/abs/2012.12877) par Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (de Google AI) publié dans l'article [DePlot : Raisonnement visuel en une étape par traduction de l'intrigue en tableau](https://arxiv.org/abs/2212.10505) par Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (de l'université d'Hong Kong et TikTok) publié dans l'article [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (de l'Université du Texas à Austin) publié dans l'article [NMS Strikes Back](https://arxiv.org/abs/2212.06137) par Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (de Facebook) publié dans l'article [Détection d'objets de bout en bout avec des transformateurs](https://arxiv.org/abs/2005.12872) par Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (de Microsoft Research) publié dans l'article [DialoGPT : Pré-entraînement génératif à grande échelle pour la génération de réponses conversationnelles](https://arxiv.org/abs/1911.00536) par Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. +1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (de SHI Labs) publié dans l'article [Transformateur d'attention dilatée pour l'attention aux quartiers](https://arxiv.org/abs/2209.15001) par Ali Hassani et Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (de Meta AI) publié dans l'article [DINOv2 : Apprentissage de fonctionnalités visuelles robustes sans supervision](https://arxiv.org/abs/2304.07193) par Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (de HuggingFace), publié dans l'article [DistilBERT, une version condensée de BERT : plus petit, plus rapide, moins cher et plus léger](https://arxiv.org/abs/1910.01108) par Victor Sanh, Lysandre Debut et Thomas Wolf. La même méthode a été appliquée pour compresser GPT2 en [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa en [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT en [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) et une version allemande de DistilBERT. +1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (de Microsoft Research) publié dans l'article [DiT : Auto-pré-entraînement pour le transformateur d'images de documents](https://arxiv.org/abs/2203.02378) par Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (de NAVER), publié dans l'article [Transformation de compréhension de documents sans OCR](https://arxiv.org/abs/2111.15664) par Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (de Facebook) publié dans l'article [Passage dense pour la recherche de réponses à des questions en domaine ouvert](https://arxiv.org/abs/2004.04906) par Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen et Wen-tau Yih. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (d'Intel Labs) publié dans l'article [Transformateurs de vision pour la prédiction dense](https://arxiv.org/abs/2103.13413) par René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (de Snap Research) publié dans l'article [EfficientFormer : Transformateurs de vision à la vitesse de MobileNet](https://arxiv.org/abs/2206.01191) par Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (de Google Brain) publié dans l'article [EfficientNet: Repenser l'échelle des modèles pour les réseaux de neurones convolutionnels](https://arxiv.org/abs/1905.11946) par Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (de Google Research/Université Stanford) publié dans l'article [ELECTRA: Pré-entraîner les encodeurs de texte en tant que discriminateurs plutôt que des générateurs](https://arxiv.org/abs/2003.10555) par Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (de Meta AI) publié dans l'article [Compression neuronale audio de haute fidélité](https://arxiv.org/abs/2210.13438) par Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (de Google Research) publié dans l'article [Exploiter des points de contrôle pré-entraînés pour les tâches de génération de séquences](https://arxiv.org/abs/1907.12461) par Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (de Baidu) publié dans l'article [ERNIE: Intégration améliorée des représentations par la connaissance](https://arxiv.org/abs/1904.09223) par Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (de Baidu) publié dans l'article [ERNIE-M: Représentation multilingue améliorée en alignant les sémantiques interlingues avec des corpus monolingues](https://arxiv.org/abs/2012.15674) par Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (de Meta AI) sont des modèles de langage de protéines de type transformateur. **ESM-1b** a été publié dans l'article [La structure et la fonction biologiques émergent de la mise à l'échelle de l'apprentissage non supervisé à 250 millions de séquences de protéines](https://www.pnas.org/content/118/15/e2016239118) par Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma et Rob Fergus. **ESM-1v** a été publié dans l'article [Les modèles de langage permettent une prédiction hors champ des effets des mutations sur la fonction des protéines](https://doi.org/10.1101/2021.07.09.450648) par Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu et Alexander Rives. **ESM-2 et ESMFold** ont été publiés avec l'article [Les modèles de langage des séquences de protéines à l'échelle de l'évolution permettent une prédiction précise de la structure](https://doi.org/10.1101/2022.07.20.500902) par Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (de Technology Innovation Institute) par Almazrouei, Ebtesam et Alobeidli, Hamza et Alshamsi, Abdulaziz et Cappelli, Alessandro et Cojocaru, Ruxandra et Debbah, Merouane et Goffinet, Etienne et Heslow, Daniel et Launay, Julien et Malartic, Quentin et Noune, Badreddine et Pannier, Baptiste et Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (d'ESPnet) publié dans l'article [Développements récents sur la boîte à outils Espnet boostés par Conformer](https://arxiv.org/abs/2010.13956) par Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang et Yuekai Zhang. +1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (de Google AI) publié dans le référentiel [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) par Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le et Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (de Google AI) publié dans le référentiel [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) par Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le et Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (du CNRS) publié dans l'article [FlauBERT : Pré-entraînement de modèle de langue non supervisé pour le français](https://arxiv.org/abs/1912.05372) par Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (de Facebook AI) publié dans l'article [FLAVA : Un modèle fondamental d'alignement de la langue et de la vision](https://arxiv.org/abs/2112.04482) par Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach et Douwe Kiela. +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (de Google Research) publié dans l'article [FNet : Mélanger les jetons avec des transformations de Fourier](https://arxiv.org/abs/2105.03824) par James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (de Microsoft Research) publié dans l'article [Réseaux de modulation focale](https://arxiv.org/abs/2203.11926) par Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (de l'Université Carnegie Mellon/Google Brain) publié dans l'article [Funnel-Transformer : Filtrer la redondance séquentielle pour un traitement efficace du langage](https://arxiv.org/abs/2006.03236) par Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (de ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Publié dans l'article [billet de blog](https://www.adept.ai/blog/fuyu-8b) +1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (de Microsoft Research) publié dans l'article [GIT : Un transformateur génératif d'images en texte pour la vision et le langage](https://arxiv.org/abs/2205.14100) par Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (de la KAIST) publié dans l'article [Réseaux de chemins globaux-locaux pour l'estimation de profondeur monoculaire avec Vertical CutDepth](https://arxiv.org/abs/2201.07436) par Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (d'OpenAI) publié dans l'article [Améliorer la compréhension du langage par l'apprentissage préalable génératif](https://openai.com/research/language-unsupervised/) par Alec Radford, Karthik Narasimhan, Tim Salimans et Ilya Sutskever. +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (d'EleutherAI) publié dans le référentiel [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) par Sid Black, Stella Biderman, Leo Gao, Phil Wang et Connor Leahy. +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (d'EleutherAI) publié dans l'article [GPT-NeoX-20B: Un modèle de langue autonome open source](https://arxiv.org/abs/2204.06745) par Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach +1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (de ABEJA) publié par Shinya Otani, Takayoshi Makabe, Anuj Arora et Kyo Hattori. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (d'OpenAI) a été publié dans l'article [Les modèles de langage sont des apprenants multitâches non supervisés](https://openai.com/research/better-language-models/) par Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei et Ilya Sutskever. +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (d'EleutherAI) a été publié dans le dépôt [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) par Ben Wang et Aran Komatsuzaki. +1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (d'AI-Sweden) a été publié dans l'article [Leçons apprises de GPT-SW3 : Construction du premier modèle de langage génératif à grande échelle pour le suédois](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) par Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (de BigCode) a été publié dans l'article [SantaCoder: ne visez pas les étoiles !](https://arxiv.org/abs/2301.03988) par Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** a été publié dans le dépôt [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) par Toshiyuki Sakamoto (tanreinama). +1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (de Microsoft) a été publié dans l'article [Les Transformers sont-ils vraiment inefficaces pour la représentation graphique ?](https://arxiv.org/abs/2106.05234) par Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (de l'UCSD, NVIDIA) a été publié dans l'article [GroupViT : la segmentation sémantique émerge de la supervision textuelle](https://arxiv.org/abs/2202.11094) par Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (d'Allegro.pl, AGH University of Science and Technology) a été publié dans l'article [KLEJ : référentiel complet pour la compréhension du langage polonais](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) par Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (de Facebook) a été publié dans l'article [HuBERT : Apprentissage de la représentation autonome de la parole par prédiction masquée des unités cachées](https://arxiv.org/abs/2106.07447) par Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (de Berkeley) a été publié dans l'article [I-BERT : Quantification entière de BERT avec des entiers uniquement](https://arxiv.org/abs/2101.01321) par Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (de HuggingFace) a été publié dans l'article [OBELICS : Un ensemble de données filtré à l'échelle du Web d'intercalation de documents texte-image](https://huggingface.co/papers/2306.16527) par Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (d'OpenAI) a été publié dans l'article [Pré-entraînement génératif à partir de pixels](https://openai.com/blog/image-gpt/) par Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (de l'Université de Beihang, UC Berkeley, Rutgers University, SEDD Company) a été publié dans l'article [Informer : Au-delà du Transformer efficace pour la prévision de séries temporel +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (de Salesforce) a été publié dans l'article [InstructBLIP : Vers des modèles vision-langage polyvalents avec un réglage d'instructions](https://arxiv.org/abs/2305.06500) de Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (d'OpenAI) a été publié dans l'article [Jukebox : Un modèle génératif pour la musique](https://arxiv.org/pdf/2005.00341.pdf) de Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (de Microsoft Research Asia) a été publié dans l'article [Kosmos-2 : Ancrage de modèles linguistiques multimodaux à travers le monde](https://arxiv.org/abs/2306.14824) de Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. +1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (de Microsoft Research Asia) a été publié dans l'article [LayoutLM : Pré-entraînement de texte et de mise en page pour la compréhension d'images de documents](https://arxiv.org/abs/1912.13318) de Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. +1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (de Microsoft Research Asia) a été publié dans l'article [LayoutLMv2 : Pré-entraînement multimodal pour la compréhension visuellement riche de documents](https://arxiv.org/abs/2012.14740) de Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. +1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (de Microsoft Research Asia) a été publié dans l'article [LayoutLMv3 : Pré-entraînement pour l'IA de documents avec un masquage de texte et d'image unifié](https://arxiv.org/abs/2204.08387) de Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. +1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (de Microsoft Research Asia) a été publié dans l'article [LayoutXLM : Pré-entraînement multimodal pour la compréhension de documents visuellement riches et multilingues](https://arxiv.org/abs/2104.08836) de Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. +1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (d'AllenAI) a été publié dans l'article [Longformer : Le transformateur de documents longs](https://arxiv.org/abs/2004.05150) de Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (de Meta AI) a été publié dans l'article [LeViT : Un transformateur de vision déguisé en réseau de neurones convolutionnel pour une inférence plus rapide](https://arxiv.org/abs/2104.01136) de Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. +1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (de l'Université de technologie du Sud de la Chine) a été publié dans l'article [LiLT : Un transformateur de mise en page simple mais efficace et indépendant de la langue pour la compréhension de documents structurés](https://arxiv.org/abs/2202.13669) de Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (de l'équipe FAIR de Meta AI) a été publié dans l'article [LLaMA : Modèles linguistiques de base ouverts et efficaces](https://arxiv.org/abs/2302.13971) de Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (de l'équipe FAIR de Meta AI) a été publié dans l'article [Llama2 : Modèles de base ouverts et affinés pour le chat](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) de Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (de Microsoft Research & University of Wisconsin-Madison) a été publié dans l'article [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) de Haotian Liu, Chunyuan Li, Yuheng Li et Yong Jae Lee. +1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (d'AllenAI) a été publié dans l'article [Longformer : Le transformateur de documents longs](https://arxiv.org/abs/2004.05150) de Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (de Google AI) a été publié dans l'article [LongT5 : Transformateur de texte-à-texte efficace pour de longues séquences](https://arxiv.org/abs/2112.07916) de Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (de Studio Ousia) a été publié dans l'article [LUKE : Représentations contextuelles profondes d'entités avec auto-attention consciente des entités](https://arxiv.org/abs/2010.01057) de Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. +1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (de l'UNC Chapel Hill) a été publié dans l'article [LXMERT : Apprentissage de représentations d'encodeur cross-modal à partir de transformateurs pour le questionnement en domaine ouvert](https://arxiv.org/abs/1908.07490) de Hao Tan et Mohit Bansal. +1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (de Facebook) a été publié dans l'article [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) de Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve et Ronan Collobert. +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (de Facebook) a été publié dans l'article [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) de Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (de Google) a été publié dans l'article [MADLAD-400 : Un ensemble de données multilingue et de niveau document](https://arxiv.org/abs/2309.04662) de Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. +1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Des modèles de traduction automatique formés avec les données [OPUS](http://opus.nlpl.eu/) par Jörg Tiedemann. Le [cadre Marian](https://marian-nmt.github.io/) est en cours de développement par l'équipe Microsoft Translator. +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (de Microsoft Research Asia) a été publié dans l'article [MarkupLM : Pré-entraînement de texte et de langage de balisage pour la compréhension visuellement riche de documents](https://arxiv.org/abs/2110.08518) de Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. +1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (de FAIR et UIUC) a été publié dans l'article [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) de Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (de Meta et UIUC) a été publié dans l'article [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) de Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (de Google AI) a été publié dans l'article [MatCha : Amélioration du pré-entraînement de langage visuel avec raisonnement mathématique et décomposition de graphiques](https://arxiv.org/abs/2212.09662) de Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (de Facebook) a été publié dans l'article [Pré-entraînement de débruitage multilingue pour la traduction automatique neuronale +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (de Facebook) a été publié dans l'article [Traduction multilingue avec un pré-entraînement et un fine-tuning multilingues extensibles](https://arxiv.org/abs/2008.00401) par Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (de Meta/USC/CMU/SJTU) a été publié dans l'article [Mega : Attention équipée d'une moyenne mobile](https://arxiv.org/abs/2209.10655) par Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May et Luke Zettlemoyer. +1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (de NVIDIA) a été publié dans l'article [Megatron-LM : Entraînement de modèles linguistiques de plusieurs milliards de paramètres en utilisant le parallélisme de modèle](https://arxiv.org/abs/1909.08053) par Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper et Bryan Catanzaro. +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (de NVIDIA) a été publié dans l'article [Megatron-LM : Entraînement de modèles linguistiques de plusieurs milliards de paramètres en utilisant le parallélisme de modèle](https://arxiv.org/abs/1909.08053) par Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper et Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (d'Alibaba Research) a été publié dans l'article [Prédiction multi-granularité pour la reconnaissance de texte de scène](https://arxiv.org/abs/2209.03592) par Peng Wang, Cheng Da et Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (de Mistral AI) par l'équipe [Mistral AI](https://mistral.ai) : Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (de Mistral AI) par l'équipe [Mistral AI](https://mistral.ai) : Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (de Studio Ousia) a été publié dans l'article [mLUKE : La puissance des représentations d'entités dans les modèles linguistiques pré-entraînés multilingues](https://arxiv.org/abs/2110.08151) par Ryokan Ri, Ikuya Yamada et Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (de Facebook) a été publié dans l'article [Mise à l'échelle de la technologie de la parole à plus de 1 000 langues](https://arxiv.org/abs/2305.13516) par Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (de CMU/Google Brain) a été publié dans l'article [MobileBERT : un BERT compact et agnostique pour les tâches sur les appareils à ressources limitées](https://arxiv.org/abs/2004.02984) par Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang et Denny Zhou. +1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (de Google Inc.) a été publié dans l'article [MobileNets : Réseaux neuronaux convolutifs efficaces pour les applications de vision mobile](https://arxiv.org/abs/1704.04861) par Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. +1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (de Google Inc.) a été publié dans l'article [MobileNetV2 : Résidus inversés et coudes linéaires](https://arxiv.org/abs/1801.04381) par Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (d'Apple) a été publié dans l'article [MobileViT : Vision Transformer léger, polyvalent et adapté aux mobiles](https://arxiv.org/abs/2110.02178) par Sachin Mehta et Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (d'Apple) a été publié dans l'article [Auto-attention séparable pour les Vision Transformers mobiles](https://arxiv.org/abs/2206.02680) par Sachin Mehta et Mohammad Rastegari. +1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (de Microsoft Research) a été publié dans l'article [MPNet : Pré-entraînement masqué et permuté pour la compréhension du langage](https://arxiv.org/abs/2004.09297) par Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (de MosaiML) a été publié avec le référentiel [llm-foundry](https://github.com/mosaicml/llm-foundry/) par l'équipe MosaiML NLP. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (de l'Université du Wisconsin - Madison) a été publié dans l'article [Analyse multi-résolution (MRA) pour une auto-attention approximative](https://arxiv.org/abs/2207.10284) par Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (de Google AI) a été publié dans l'article [mT5 : un transformateur texte-à-texte pré-entraîné massivement multilingue](https://arxiv.org/abs/2010.11934) par Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (de Meta) a été publié dans l'article [Génération de musique simple et contrôlable](https://arxiv.org/abs/2306.05284) par Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi et Alexandre Défossez. +1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (de RUC AI Box) a été publié dans l'article [MVP : Pré-entraînement supervisé multi-tâche pour la génération de langage naturel](https://arxiv.org/abs/2206.12131) par Tianyi Tang, Junyi Li, Wayne Xin Zhao et Ji-Rong Wen. +1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (de SHI Labs) a été publié dans l'article [Transformateur d'attention de voisinage](https://arxiv.org/abs/2204.07143) par Ali Hassani, Steven Walton, Jiachen Li, Shen Li et Humphrey Shi. +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (du laboratoire Noah's Ark de Huawei) a été publié dans l'article [NEZHA : Représentation contextualisée neurale pour la compréhension du langage chinois](https://arxiv.org/abs/1909.00204) par Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen et Qun Liu. +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (de Meta) a été publié dans l'article [No Language Left Behind : Mise à l'échelle de la traduction automatique centrée sur l'humain](https://arxiv.org/abs/2207.04672) par l'équipe NLLB. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (de Meta) a été publié dans l'article [No Language Left Behind : Mise à l'échelle de la traduction automatique centrée sur l'humain](https://arxiv.org/abs/2207.04672) par l'équipe NLLB. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (de Meta AI) a été publié dans l'article [Nougat : Compréhension Optique Neuronale pour les Documents Académiques](https://arxiv.org/abs/2308.13418) par Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (de l'Université du Wisconsin - Madison) a été publié dans l'article [Nyströmformer : Un algorithme basé sur Nyström pour approximer l'auto-attention](https://arxiv.org/abs/2102.03902) par Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. +1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (de SHI Labs) a été publié dans l'article [OneFormer : Un Transformer pour dominer la segmentation universelle d'images](https://arxiv.org/abs/2211.06220) par Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (de [s-JoL](https://huggingface.co/s-JoL)) publié sur GitHub (maintenant supprimé). +1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (de Meta AI) a été publié dans l'article [OPT : Modèles linguistiques Transformer pré-entraînés ouverts](https://arxiv.org/abs/2205.01068) par Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (de Google AI) a été publié dans l'article [OWL-ViT : Détection d'objets simple à vocabulaire ouvert avec des transformateurs de vision](https://arxiv.org/abs/2205.06230) par Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf et Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (de Google AI) a été publié dans l'article [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) par Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (d'IBM Research) a été publié dans l'article [TSMixer : Modèle MLP-Mixer léger pour la prévision multivariée de séries temporelles](https://arxiv.org/pdf/2306.09364.pdf) par Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (d'IBM) a été publié dans l'article [Une série temporelle vaut 64 mots : Prévision à long terme avec des Transformers](https://arxiv.org/abs/2211.14730) par Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (de Google) a été publié dans l'article [PEGASUS : Pré-entraînement avec Phrases-Écart Extraites pour la Résumé Abstrait](https://arxiv.org/abs/1912.08777) par Jingqing Zhang, Yao Zhao, Mohammad Saleh et Peter J. Liu. +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (de Google) a été publié dans l'article [Investiguer l'Extension Efficace des Transformers pour la Longue Summarization d'Entrée](https://arxiv.org/abs/2208.04347) par Jason Phang, Yao Zhao et Peter J. Liu. +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (de Deepmind) a été publié dans l'article [Perceiver IO : Une architecture générale pour les entrées et sorties structurées](https://arxiv.org/abs/2107.14795) par Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals et João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (d'ADEPT) a été publié dans un [billet de blog](https://www.adept.ai/blog/persimmon-8b) par Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (de Microsoft) a été publié avec les articles - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) par Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee et Yuanzhi Li, [Textbooks Are All You Need II : Rapport technique phi-1.5](https://arxiv.org/abs/2309.05463) par Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar et Yin Tat Lee. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (de VinAI Research) a été publié dans l'article [PhoBERT : Modèles linguistiques pré-entraînés pour le vietnamien](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) par Dat Quoc Nguyen et Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (de Google) a été publié dans l'article [Pix2Struct : Analyse d'images d'écran en tant que pré-entraînement pour la compréhension du langage visuel](https://arxiv.org/abs/2210.03347) par Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (de UCLA NLP) a été publié dans l'article [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) par Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. +1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (de Sea AI Labs) a été publié dans l'article [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) par Yu, Weihao et Luo, Mi et Zhou, Pan et Si, Chenyang et Zhou, Yichen et Wang, Xinchao et Feng, Jiashi et Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** a été publié dans l'article [Pop2Piano : Génération de reprises de morceaux de piano basée sur l'audio pop](https://arxiv.org/abs/2211.00895) par Jongho Choi et Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (de Microsoft Research) a été publié dans l'article [ProphetNet : Prédire les N-grammes futurs pour l'entraînement préalable de séquences à séquences](https://arxiv.org/abs/2001.04063) par Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang et Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (de l'Université de Nankin, l'Université de Hong Kong, etc.) a été publié dans l'article [Pyramid Vision Transformer : Une colonne vertébrale polyvalente pour la prédiction dense sans convolutions](https://arxiv.org/pdf/2102.12122.pdf) par Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo et Ling Shao. +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (de NVIDIA) a été publié dans l'article [Quantification entière pour l'inférence d'apprentissage profond : Principes et évaluation empirique](https://arxiv.org/abs/2004.09602) par Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev et Paulius Micikevicius. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (de l'équipe Qwen, Alibaba Group) a été publié avec le rapport technique [Qwen](https://arxiv.org/abs/2309.16609) par Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou et Tianhang Zhu. +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (de Facebook) a été publié dans l'article [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) par Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. +1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (de Google Research) a été publié dans l'article [REALM : Pré-entraînement de modèle linguistique augmenté par la récupération](https://arxiv.org/abs/2002.08909) par Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat et Ming-Wei Chang. +1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (de Google Research) a été publié dans l'article [Reformer : Le transformateur efficace](https://arxiv.org/abs/2001.04451) par Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (de META Platforms) a été publié dans l'article [Designing Network Design Space](https://arxiv.org/abs/2003.13678) par Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár. +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (de Google Research) a été publié dans l'article [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) par Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (de Microsoft Research) a été publié dans l'article [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) par Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (de Facebook), publié dans l'article [RoBERTa : Une approche d'entraînement préalable BERT robuste](https://arxiv.org/abs/1907.11692) par Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (de Facebook) a été publié dans l'article [fairseq : Une boîte à outils rapide et extensible pour la modélisation de séquences](https://arxiv.org/abs/1904.01038) par Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. +1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (de WeChatAI) a été publié dans l'article [RoCBert : BERT chinois robuste avec pré-entraînement contrastif multimodal](https://aclanthology.org/2022.acl-long.65.pdf) par HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (de ZhuiyiTechnology), publié dans l'article [RoFormer : Transformateur amélioré avec insertion rotative de position](https://arxiv.org/abs/2104.09864) par Jianlin Su et Yu Lu et Shengfeng Pan et Bo Wen et Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (de Bo Peng), publié sur [ce référentiel](https://github.com/BlinkDL/RWKV-LM) par Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (de Meta AI) a été publié dans l'article [SeamlessM4T — Traduction multimodale et massivement multilingue](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) par l'équipe de communication transparente. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (de Meta AI) a été publié dans l'article [Seamless: Traduction de la parole multilingue, expressive et en continu](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) par l'équipe de communication transparente. +1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (de NVIDIA) a été publié dans l'article [SegFormer : Conception simple et efficace pour la segmentation sémantique avec des transformateurs](https://arxiv.org/abs/2105.15203) par Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (de Meta AI) a été publié dans l'article [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) par Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (de ASAPP) a été publié dans l'article [Compromis entre performances et efficacité dans l'entraînement non supervisé pour la reconnaissance vocale](https://arxiv.org/abs/2109.06870) par Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (de ASAPP) a été publié dans l'article [Compromis entre performances et efficacité dans l'entraînement non supervisé pour la reconnaissance vocale](https://arxiv.org/abs/2109.06870) par Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (de Google AI) a été publié dans l'article [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) par Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (de Microsoft Research) a été publié dans l'article [SpeechT5 : Pré-entraînement unifié encodeur-décodeur pour le traitement du langage parlé](https://arxiv.org/abs/2110.07205) par Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (de Facebook), publié dans l'article [fairseq S2T : Modélisation rapide de la parole à texte avec fairseq](https://arxiv.org/abs/2010.05171) par Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (de Facebook), publié dans l'article [Apprentissage auto-supervisé et semi-supervisé à grande échelle pour la traduction de la parole](https://arxiv.org/abs/2104.06678) par Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (de l'Université de Tel Aviv), publié dans l'article [Réponse à quelques questions avec peu d'exemples par la pré-sélection des spans](https://arxiv.org/abs/2101.00438) par Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (de Berkeley) a été publié dans l'article [SqueezeBERT : Que l'apprentissage automatique peut-il apprendre au traitement du langage naturel sur les réseaux neuronaux efficaces ?](https://arxiv.org/abs/2006.11316) par Forrest N. Iandola, Albert E. Shaw, Ravi Krishna et Kurt W. Keutzer. +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (de MBZUAI) a été publié dans l'article [SwiftFormer : Attention additive efficace pour les applications de vision mobile en temps réel basées sur des transformateurs](https://arxiv.org/abs/2303.15446) par Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (de Microsoft) a été publié dans l'article [Swin Transformer : Transformateur hiérarchique de la vision utilisant des fenêtres décalées](https://arxiv.org/abs/2103.14030) par Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (de Microsoft) a été publié dans l'article [Swin Transformer V2 : Augmentation de la capacité et de la résolution](https://arxiv.org/abs/2111.09883) par Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (de l'Université de Würzburg) a été publié dans l'article [Swin2SR : Transformateur SwinV2 pour la super-résolution et la restauration d'images compressées](https://arxiv.org/abs/2209.11345) par Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. +1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (de Google) a été publié dans l'article [Switch Transformers : Passage à des modèles de trillions de paramètres avec une parcimonie simple et efficace](https://arxiv.org/abs/2101.03961) par William Fedus, Barret Zoph, Noam Shazeer. +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (de Google AI) a été publié dans l'article [Exploration des limites de l'apprentissage par transfert avec un transformateur de texte à texte unifié](https://arxiv.org/abs/1910.10683) par Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li et Peter J. Liu. +1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (de Google AI) a été publié dans le dépôt [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) par Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li et Peter J. Liu. +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (de Microsoft Research) a été publié dans l'article [PubTables-1M : Vers une extraction complète des tables à partir de documents non structurés](https://arxiv.org/abs/2110.00061) par Brandon Smock, Rohith Pesala, Robin Abraham. +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (de Google AI) a été publié dans l'article [TAPAS : Analyse faiblement supervisée des tables via le pré-entraînement](https://arxiv.org/abs/2004.02349) par Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno et Julian Martin Eisenschlos. +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (de Microsoft Research) a été publié dans l'article [TAPEX : Pré-entraînement des tables via l'apprentissage d'un exécuteur SQL neuronal](https://arxiv.org/abs/2107.07653) par Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen et Jian-Guang Lou. +1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (de HuggingFace). +1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (de Facebook) a été publié dans l'article [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) par Gedas Bertasius, Heng Wang, Lorenzo Torresani. +1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (de l'Université de Californie à Berkeley) a été publié dans l'article [L'apprentissage par renforcement hors ligne comme un grand problème de modélisation de séquence](https://arxiv.org/abs/2106.02039) par Michael Janner, Qiyang Li, Sergey Levine. +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (de Google/CMU) a été publié dans l'article [Transformer-XL : Modèles de langage attentifs au-delà d'un contexte de longueur fixe](https://arxiv.org/abs/1901.02860) par Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (de Microsoft), publié dans l'article [TrOCR : Reconnaissance optique de caractères basée sur un transformateur avec des modèles pré-entraînés](https://arxiv.org/abs/2109.10282) par Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (de l'UNC Chapel Hill) a été publié dans l'article [TVLT : Transformer Vision-Language sans texte](https://arxiv.org/abs/2209.14156) par Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (d'Intel) a été publié dans l'article [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) par Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. +1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (de Google Research) a été publié dans l'article [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) par Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler. +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (de Google Research) a été publié dans l'article [UniMax : Échantillonnage linguistique plus équitable et plus efficace pour l'entraînement préalable multilingue à grande échelle](https://openreview.net/forum?id=kXwdL1cWOAi) par Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (de Microsoft Research) a été publié dans l'article [UniSpeech : Apprentissage unifié de la représentation de la parole avec des données étiquetées et non étiquetées](https://arxiv.org/abs/2101.07597) par Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (de Microsoft Research) a été publié dans l'article [UNISPEECH-SAT : Apprentissage universel de la représentation de la parole avec la préformation consciente du locuteur](https://arxiv.org/abs/2110.05752) par Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (de Kakao Corporation) a été publié dans l'article [UnivNet : un vocodeur neuronal avec des discriminateurs de spectrogramme multi-résolution pour la génération de formes d'onde haute fidélité](https://arxiv.org/abs/2106.07889) par Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim et Juntae Kim. +1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (de l'Université de Pékin) a été publié dans l'article [Analyse perceptuelle unifiée pour la compréhension de scènes](https://arxiv.org/abs/1807.10221) par Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (de l'Université Tsinghua et de l'Université Nankai) publié dans l'article [Visual Attention Network](https://arxiv.org/abs/2202.09741) par Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (du groupe d'informatique multimédia, Université de Nankin) publié dans l'article [VideoMAE : Les autoencodeurs masqués sont des apprenants efficaces en données pour l'auto-pré-entraînement vidéo supervisé](https://arxiv.org/abs/2203.12602) par Zhan Tong, Yibing Song, Jue Wang, Limin Wang. +1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (du NAVER AI Lab/Kakao Enterprise/Kakao Brain) publié dans l'article [ViLT : Vision-and-Language Transformer sans convolution ni supervision de région](https://arxiv.org/abs/2102.03334) par Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (de l'Université du Wisconsin–Madison) publié dans l'article [Rendre les grands modèles multimodaux comprenant des incitations visuelles arbitraires](https://arxiv.org/abs/2312.00784) par Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. +1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (de Google AI) publié dans l'article [Une image vaut 16x16 mots : Transformers pour la reconnaissance d'images à grande échelle](https://arxiv.org/abs/2010.11929) par Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (de UCLA NLP) publié dans l'article [VisualBERT : Une référence simple et performante pour la vision et le langage](https://arxiv.org/pdf/1908.03557) par Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. +1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (de Google AI) publié dans l'article [Une image vaut 16x16 mots : Transformers pour la reconnaissance d'images à grande échelle](https://arxiv.org/abs/2010.11929) par Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (de Meta AI) publié dans l'article [Exploration des transformateurs de vision plain pour la détection d'objets](https://arxiv.org/abs/2203.16527) par Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (de Meta AI) publié dans l'article [Les autoencodeurs masqués sont des apprenants évolutifs de la vision](https://arxiv.org/abs/2111.06377) par Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (de HUST-VL) publié dans l'article [ViTMatte : Renforcer le détourage d'image avec des transformateurs de vision plain pré-entraînés](https://arxiv.org/abs/2305.15272) par Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (de Meta AI) publié dans l'article [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) par Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (de Kakao Enterprise) publié dans l'article [Auto-encodeur variationnel conditionnel avec apprentissage adversarial pour la conversion texte-parole de bout en bout](https://arxiv.org/abs/2106.06103) par Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (de Google Research) publié dans l'article [ViViT : Un transformateur de vision vidéo](https://arxiv.org/abs/2103.15691) par Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (de Facebook AI) publié dans l'article [wav2vec 2.0 : Un cadre pour l'apprentissage auto-supervisé des représentations de la parole](https://arxiv.org/abs/2006.11477) par Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (de Meta AI) publié dans l'article [Seamless : Traduction de la parole multilingue, expressive et en continu](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) par l'équipe Seamless Communication. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (de Facebook AI) a été publié dans l'article [FAIRSEQ S2T : Modélisation rapide de la parole au texte avec FAIRSEQ](https://arxiv.org/abs/2010.05171) par Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (de Facebook AI) a été publié dans l'article [Reconnaissance de phonèmes interlingues simple et efficace sans apprentissage préalable](https://arxiv.org/abs/2109.11680) par Qiantong Xu, Alexei Baevski, Michael Auli. +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (de Microsoft Research) a été publié dans l'article [WavLM : Pré-entraînement auto-supervisé à grande échelle pour le traitement complet de la parole](https://arxiv.org/abs/2110.13900) par Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (d'OpenAI) a été publié dans l'article [Reconnaissance robuste de la parole via une supervision faible à grande échelle](https://cdn.openai.com/papers/whisper.pdf) par Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (de Microsoft Research) a été publié dans l'article [Expansion des modèles pré-entraînés langage-image pour la reconnaissance vidéo générale](https://arxiv.org/abs/2208.02816) par Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (de Meta AI) a été publié dans l'article [Lever le sort de la multilinguité par le pré-entraînement des transformateurs modulaires](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) par Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (de Facebook AI) a été publié dans l'article [Apprentissage à quelques échantillons avec des modèles de langues multilingues](https://arxiv.org/abs/2112.10668) par Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (de Facebook) a été publié dans l'article [Pré-entraînement de modèles linguistiques multilingues](https://arxiv.org/abs/1901.07291) par Guillaume Lample et Alexis Conneau. +1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (de Microsoft Research) a été publié dans l'article [ProphetNet : Prédire l'avenir N-gramme pour le pré-entraînement séquence-séquence](https://arxiv.org/abs/2001.04063) par Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang et Ming Zhou. +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (de Facebook AI), publié dans l'article [Apprentissage non supervisé de la représentation croisée-lingue à grande échelle](https://arxiv.org/abs/1911.02116) par Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer et Veselin Stoyanov. +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (de Facebook AI), publié dans l'article [Transformateurs à plus grande échelle pour la modélisation du langage masqué multilingue](https://arxiv.org/abs/2105.00572) par Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. +1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (de Meta AI) a été publié dans l'article [XLM-V : Surmonter le goulot d'étranglement du vocabulaire dans les modèles de langage masqués multilingues](https://arxiv.org/abs/2301.10472) par Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (de Google/CMU) a été publié dans l'article [XLNet : Préentraînement autorégressif généralisé pour la compréhension du langage](https://arxiv.org/abs/1906.08237) par Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (de Facebook AI) publié dans l'article [XLS-R : Apprentissage d'une représentation de la parole autonome et multilingue à grande échelle](https://arxiv.org/abs/2111.09296) par Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (de Facebook AI) publié dans l'article [Apprentissage non supervisé de représentations multilingues pour la reconnaissance vocale](https://arxiv.org/abs/2006.13979) par Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. +1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (de l'Université Huazhong des sciences et technologies) publié dans l'article [You Only Look at One Sequence : Repenser le Transformer dans la vision à travers la détection d'objets](https://arxiv.org/abs/2106.00666) par Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (de l'Université du Wisconsin - Madison) publié dans l'article [You Only Sample (Almost) Once : Coût linéaire Self-Attention via l'échantillonnage Bernoulli](https://arxiv.org/abs/2111.09714) par Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. +1. Vous souhaitez contribuer avec un nouveau modèle ? Nous avons ajouté un **guide détaillé et des modèles types** pour vous guider dans le processus d'ajout d'un nouveau modèle. Vous pouvez les trouver dans le dossier [`templates`](./templates) du référentiel. Assurez-vous de consulter les [directives de contribution](./CONTRIBUTING.md) et de contacter les mainteneurs ou d'ouvrir un ticket pour recueillir des commentaires avant de commencer votre pull request. + +Pour vérifier si chaque modèle a une implémentation en Flax, PyTorch ou TensorFlow, ou s'il a un tokenizer associé pris en charge par la bibliothèque 🤗 Tokenizers, consultez [ce tableau](https://huggingface.co/docs/transformers/index#supported-frameworks). + +Ces implémentations ont été testées sur plusieurs ensembles de données (voir les scripts d'exemple) et devraient correspondre aux performances des implémentations originales. Vous pouvez trouver plus de détails sur les performances dans la section Exemples de la [documentation](https://github.com/huggingface/transformers/tree/main/examples). + +## En savoir plus + +| Section | Description | +|-|-| +| [Documentation](https://huggingface.co/docs/transformers/) | Documentation complète de l'API et tutoriels | +| [Résumé des tâches](https://huggingface.co/docs/transformers/task_summary) | Tâches prises en charge par les 🤗 Transformers | +| [Tutoriel de prétraitement](https://huggingface.co/docs/transformers/preprocessing) | Utilisation de la classe `Tokenizer` pour préparer les données pour les modèles | +| [Entraînement et ajustement fin](https://huggingface.co/docs/transformers/training) | Utilisation des modèles fournis par les 🤗 Transformers dans une boucle d'entraînement PyTorch/TensorFlow et de l'API `Trainer` | +| [Tour rapide : Scripts d'ajustement fin/d'utilisation](https://github.com/huggingface/transformers/tree/main/examples) | Scripts d'exemple pour ajuster finement les modèles sur une large gamme de tâches | +| [Partage et téléversement de modèles](https://huggingface.co/docs/transformers/model_sharing) | Téléchargez et partagez vos modèles ajustés avec la communauté | + +## Citation + +Nous disposons désormais d'un [article](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) que vous pouvez citer pour la bibliothèque 🤗 Transformers : +```bibtex +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = oct, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/README_hd.md b/README_hd.md index b6fa80c14c3e02..6402c3ee5eb7fc 100644 --- a/README_hd.md +++ b/README_hd.md @@ -26,7 +26,7 @@ token: शब्द (और मूल अंग्रेजी को कोष tokenize: टोकननाइज़ करें (और मूल अंग्रेज़ी को चिह्नित करने के लिए कोष्ठक का उपयोग करें) tokenizer: Tokenizer (मूल अंग्रेजी में कोष्ठक के साथ) transformer: transformer -pipeline: समनुक्रम +pipeline: समनुक्रम API: API (अनुवाद के बिना) inference: विचार Trainer: प्रशिक्षक। कक्षा के नाम के रूप में प्रस्तुत किए जाने पर अनुवादित नहीं किया गया। @@ -43,7 +43,7 @@ checkpoint: जाँच बिंदु

-

+

Build @@ -72,7 +72,12 @@ checkpoint: जाँच बिंदु Español | 日本語 | हिन्दी | -

+ Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -85,22 +90,22 @@ checkpoint: जाँच बिंदु 🤗 Transformers 100 से अधिक भाषाओं में पाठ वर्गीकरण, सूचना निष्कर्षण, प्रश्न उत्तर, सारांशीकरण, अनुवाद, पाठ निर्माण का समर्थन करने के लिए हजारों पूर्व-प्रशिक्षित मॉडल प्रदान करता है। इसका उद्देश्य सबसे उन्नत एनएलपी तकनीक को सभी के लिए सुलभ बनाना है। -🤗 Transformers त्वरित डाउनलोड और उपयोग के लिए एक एपीआई प्रदान करता है, जिससे आप किसी दिए गए पाठ पर एक पूर्व-प्रशिक्षित मॉडल ले सकते हैं, इसे अपने डेटासेट पर ठीक कर सकते हैं और इसे [मॉडल हब] (https://huggingface.co/models) के माध्यम से समुदाय के साथ साझा कर सकते हैं। ) . इसी समय, प्रत्येक परिभाषित पायथन मॉड्यूल पूरी तरह से स्वतंत्र है, जो संशोधन और तेजी से अनुसंधान प्रयोगों के लिए सुविधाजनक है। +🤗 Transformers त्वरित डाउनलोड और उपयोग के लिए एक एपीआई प्रदान करता है, जिससे आप किसी दिए गए पाठ पर एक पूर्व-प्रशिक्षित मॉडल ले सकते हैं, इसे अपने डेटासेट पर ठीक कर सकते हैं और इसे [मॉडल हब](https://huggingface.co/models) के माध्यम से समुदाय के साथ साझा कर सकते हैं। इसी समय, प्रत्येक परिभाषित पायथन मॉड्यूल पूरी तरह से स्वतंत्र है, जो संशोधन और तेजी से अनुसंधान प्रयोगों के लिए सुविधाजनक है। 🤗 Transformers तीन सबसे लोकप्रिय गहन शिक्षण पुस्तकालयों का समर्थन करता है: [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) and [TensorFlow](https://www.tensorflow.org/) — और इसके साथ निर्बाध रूप से एकीकृत होता है। आप अपने मॉडल को सीधे एक ढांचे के साथ प्रशिक्षित कर सकते हैं और दूसरे के साथ लोड और अनुमान लगा सकते हैं। ## ऑनलाइन डेमो -आप सबसे सीधे मॉडल पृष्ठ पर परीक्षण कर सकते हैं [model hub](https://huggingface.co/models) मॉडल पर। हम [निजी मॉडल होस्टिंग, मॉडल संस्करण, और अनुमान एपीआई] भी प्रदान करते हैं।(https://huggingface.co/pricing)。 +आप सबसे सीधे मॉडल पृष्ठ पर परीक्षण कर सकते हैं [model hub](https://huggingface.co/models) मॉडल पर। हम [निजी मॉडल होस्टिंग, मॉडल संस्करण, और अनुमान एपीआई](https://huggingface.co/pricing) भी प्रदान करते हैं।。 यहाँ कुछ उदाहरण हैं: -- [शब्द को भरने के लिए मास्क के रूप में BERT का प्रयोग करें](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [शब्द को भरने के लिए मास्क के रूप में BERT का प्रयोग करें](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [इलेक्ट्रा के साथ नामित इकाई पहचान](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [जीपीटी-2 के साथ टेक्स्ट जनरेशन](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [रॉबर्टा के साथ प्राकृतिक भाषा निष्कर्ष](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [जीपीटी-2 के साथ टेक्स्ट जनरेशन](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [रॉबर्टा के साथ प्राकृतिक भाषा निष्कर्ष](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [बार्ट के साथ पाठ सारांश](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [डिस्टिलबर्ट के साथ प्रश्नोत्तर](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [अनुवाद के लिए T5 का प्रयोग करें](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [डिस्टिलबर्ट के साथ प्रश्नोत्तर](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [अनुवाद के लिए T5 का प्रयोग करें](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) **[Write With Transformer](https://transformer.huggingface.co)**,हगिंग फेस टीम द्वारा बनाया गया, यह एक आधिकारिक पाठ पीढ़ी है demo。 @@ -146,8 +151,8 @@ checkpoint: जाँच बिंदु ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -156,8 +161,8 @@ checkpoint: जाँच बिंदु ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) @@ -165,7 +170,7 @@ checkpoint: जाँच बिंदु टोकननाइज़र सभी पूर्व-प्रशिक्षित मॉडलों के लिए प्रीप्रोसेसिंग प्रदान करता है और इसे सीधे एक स्ट्रिंग (जैसे ऊपर दिए गए उदाहरण) या किसी सूची पर बुलाया जा सकता है। यह एक डिक्शनरी (तानाशाही) को आउटपुट करता है जिसे आप डाउनस्ट्रीम कोड में उपयोग कर सकते हैं या `**` अनपैकिंग एक्सप्रेशन के माध्यम से सीधे मॉडल को पास कर सकते हैं। -मॉडल स्वयं एक नियमित [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) या [TensorFlow `tf.keras.Model`](https ://pytorch.org/docs/stable/nn.html#torch.nn.Module) ://www.tensorflow.org/api_docs/python/tf/keras/Model) (आपके बैकएंड के आधार पर), जो हो सकता है सामान्य तरीके से उपयोग किया जाता है। [यह ट्यूटोरियल](https://huggingface.co/transformers/training.html) बताता है कि इस तरह के मॉडल को क्लासिक PyTorch या TensorFlow प्रशिक्षण लूप में कैसे एकीकृत किया जाए, या हमारे `ट्रेनर` एपीआई का उपयोग कैसे करें ताकि इसे जल्दी से फ़ाइन ट्यून किया जा सके।एक नया डेटासेट पे। +मॉडल स्वयं एक नियमित [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) या [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (आपके बैकएंड के आधार पर), जो हो सकता है सामान्य तरीके से उपयोग किया जाता है। [यह ट्यूटोरियल](https://huggingface.co/transformers/training.html) बताता है कि इस तरह के मॉडल को क्लासिक PyTorch या TensorFlow प्रशिक्षण लूप में कैसे एकीकृत किया जाए, या हमारे `ट्रेनर` एपीआई का उपयोग कैसे करें ताकि इसे जल्दी से फ़ाइन ट्यून किया जा सके।एक नया डेटासेट पे। ## ट्रांसफार्मर का उपयोग क्यों करें? @@ -194,19 +199,21 @@ checkpoint: जाँच बिंदु - यह लाइब्रेरी मॉड्यूलर न्यूरल नेटवर्क टूलबॉक्स नहीं है। मॉडल फ़ाइल में कोड जानबूझकर अल्पविकसित है, बिना अतिरिक्त सार इनकैप्सुलेशन के, ताकि शोधकर्ता अमूर्तता और फ़ाइल जंपिंग में शामिल हुए जल्दी से पुनरावृति कर सकें। - `ट्रेनर` एपीआई किसी भी मॉडल के साथ संगत नहीं है, यह केवल इस पुस्तकालय के मॉडल के लिए अनुकूलित है। यदि आप सामान्य मशीन लर्निंग के लिए उपयुक्त प्रशिक्षण लूप कार्यान्वयन की तलाश में हैं, तो कहीं और देखें। -- हमारे सर्वोत्तम प्रयासों के बावजूद, [उदाहरण निर्देशिका] (https://github.com/huggingface/transformers/tree/main/examples) में स्क्रिप्ट केवल उपयोग के मामले हैं। आपकी विशिष्ट समस्या के लिए, वे जरूरी नहीं कि बॉक्स से बाहर काम करें, और आपको कोड की कुछ पंक्तियों को सूट करने की आवश्यकता हो सकती है। +- हमारे सर्वोत्तम प्रयासों के बावजूद, [उदाहरण निर्देशिका](https://github.com/huggingface/transformers/tree/main/examples) में स्क्रिप्ट केवल उपयोग के मामले हैं। आपकी विशिष्ट समस्या के लिए, वे जरूरी नहीं कि बॉक्स से बाहर काम करें, और आपको कोड की कुछ पंक्तियों को सूट करने की आवश्यकता हो सकती है। ## स्थापित करना ### पिप का उपयोग करना -इस रिपॉजिटरी का परीक्षण Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+ और TensorFlow 2.3+ के तहत किया गया है। +इस रिपॉजिटरी का परीक्षण Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ और TensorFlow 2.6+ के तहत किया गया है। -आप [वर्चुअल एनवायरनमेंट] (https://docs.python.org/3/library/venv.html) में 🤗 ट्रांसफॉर्मर इंस्टॉल कर सकते हैं। यदि आप अभी तक पायथन के वर्चुअल एनवायरनमेंट से परिचित नहीं हैं, तो कृपया इसे [उपयोगकर्ता निर्देश] (https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) पढ़ें। +आप [वर्चुअल एनवायरनमेंट](https://docs.python.org/3/library/venv.html) में 🤗 ट्रांसफॉर्मर इंस्टॉल कर सकते हैं। यदि आप अभी तक पायथन के वर्चुअल एनवायरनमेंट से परिचित नहीं हैं, तो कृपया इसे [उपयोगकर्ता निर्देश](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) पढ़ें। सबसे पहले, पायथन के उस संस्करण के साथ एक आभासी वातावरण बनाएं जिसका आप उपयोग करने और उसे सक्रिय करने की योजना बना रहे हैं। -फिर, आपको Flax, PyTorch या TensorFlow में से किसी एक को स्थापित करने की आवश्यकता है। अपने प्लेटफ़ॉर्म पर इन फ़्रेमवर्क को स्थापित करने के लिए, [TensorFlow स्थापना पृष्ठ](https://www.tensorflow.org/install/), [PyTorch स्थापना पृष्ठ](https://pytorch.org/get-started /locally/# देखें) start-locally) या [Flax स्थापना पृष्ठ](https://github.com/google/flax#quick-install). +फिर, आपको Flax, PyTorch या TensorFlow में से किसी एक को स्थापित करने की आवश्यकता है। अपने प्लेटफ़ॉर्म पर इन फ़्रेमवर्क को स्थापित करने के लिए, [TensorFlow स्थापना पृष्ठ](https://www.tensorflow.org/install/), [PyTorch स्थापना पृष्ठ](https://pytorch.org/get-started/locally) + +देखें start-locally या [Flax स्थापना पृष्ठ](https://github.com/google/flax#quick-install). जब इनमें से कोई एक बैकएंड सफलतापूर्वक स्थापित हो जाता है, तो ट्रांसफॉर्मर निम्नानुसार स्थापित किए जा सकते हैं: @@ -214,213 +221,280 @@ checkpoint: जाँच बिंदु pip install transformers ``` -यदि आप उपयोग के मामलों को आज़माना चाहते हैं या आधिकारिक रिलीज़ से पहले नवीनतम इन-डेवलपमेंट कोड का उपयोग करना चाहते हैं, तो आपको [सोर्स से इंस्टॉल करना होगा](https://huggingface.co/docs/transformers/installation#installing-from- स्रोत)। +यदि आप उपयोग के मामलों को आज़माना चाहते हैं या आधिकारिक रिलीज़ से पहले नवीनतम इन-डेवलपमेंट कोड का उपयोग करना चाहते हैं, तो आपको [सोर्स से इंस्टॉल करना होगा](https://huggingface.co/docs/transformers/installation#installing-from-) स्रोत। ### कोंडा का उपयोग करना -ट्रांसफॉर्मर संस्करण 4.0.0 के बाद से, हमारे पास एक कोंडा चैनल है: `हगिंगफेस`। - ट्रांसफॉर्मर कोंडा के माध्यम से निम्नानुसार स्थापित किया जा सकता है: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_नोट:_** `huggingface` चैनल से `transformers` इंस्टॉल करना पुराना पड़ चुका है। + कोंडा के माध्यम से Flax, PyTorch, या TensorFlow में से किसी एक को स्थापित करने के लिए, निर्देशों के लिए उनके संबंधित स्थापना पृष्ठ देखें। ## मॉडल आर्किटेक्चर -[उपयोगकर्ता](https://huggingface.co/users) और [organization](https://huggingface.co) द्वारा ट्रांसफॉर्मर समर्थित [**सभी मॉडल चौकियों**](https://huggingface.co/models) /users) हगिंगफेस.को/ऑर्गनाइजेशन), सभी को बिना किसी बाधा के हगिंगफेस.को [मॉडल हब](https://huggingface.co) के साथ एकीकृत किया गया है। +[उपयोगकर्ता](https://huggingface.co/users) और [organization](https://huggingface.co) द्वारा ट्रांसफॉर्मर समर्थित [**सभी मॉडल चौकियों**](https://huggingface.co/models/users) हगिंगफेस.को/ऑर्गनाइजेशन), सभी को बिना किसी बाधा के हगिंगफेस.को [मॉडल हब](https://huggingface.co) के साथ एकीकृत किया गया है। चौकियों की वर्तमान संख्या: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) -🤗 ट्रांसफॉर्मर वर्तमान में निम्नलिखित आर्किटेक्चर का समर्थन करते हैं (मॉडल के अवलोकन के लिए [यहां] देखें (https://huggingface.co/docs/transformers/model_summary)): +🤗 ट्रांसफॉर्मर वर्तमान में निम्नलिखित आर्किटेक्चर का समर्थन करते हैं (मॉडल के अवलोकन के लिए [यहां देखें](https://huggingface.co/docs/transformers/model_summary)): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (Google Research and the Toyota Technological Institute at Chicago) साथ थीसिस [ALBERT: A Lite BERT for Self-supervised भाषा प्रतिनिधित्व सीखना](https://arxiv.org/abs/1909.11942), झेंझोंग लैन, मिंगदा चेन, सेबेस्टियन गुडमैन, केविन गिम्पेल, पीयूष शर्मा, राडू सोरिकट +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (Google Research से) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. द्वाराअनुसंधान पत्र [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) के साथ जारी किया गया 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. -1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ] (https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (फेसबुक) साथ थीसिस [बार्ट: प्राकृतिक भाषा निर्माण, अनुवाद के लिए अनुक्रम-से-अनुक्रम पूर्व प्रशिक्षण , और समझ](https://arxiv.org/pdf/1910.13461.pdf) पर निर्भर माइक लुईस, यिनहान लियू, नमन गोयल, मार्जन ग़ज़विनिनेजाद, अब्देलरहमान मोहम्मद, ओमर लेवी, वेस स्टोयानोव और ल्यूक ज़ेटलमॉयर 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (से École polytechnique) साथ थीसिस [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) पर निर्भर Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis रिहाई। 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research से) साथ में पेपर [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701)गुयेन लुओंग ट्रान, डुओंग मिन्ह ले और डाट क्वोक गुयेन द्वारा पोस्ट किया गया। 1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (Microsoft से) साथ में कागज [BEiT: BERT इमेज ट्रांसफॉर्मर्स का प्री-ट्रेनिंग](https://arxiv.org/abs/2106.08254) Hangbo Bao, Li Dong, Furu Wei द्वारा। -1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (गूगल से) साथ वाला पेपर [बीईआरटी: प्री-ट्रेनिंग ऑफ डीप बिडायरेक्शनल ट्रांसफॉर्मर्स फॉर लैंग्वेज अंडरस्टैंडिंग](https://arxiv.org/abs/1810.04805) जैकब डेवलिन, मिंग-वेई चांग, ​​केंटन ली और क्रिस्टीना टौटानोवा द्वारा प्रकाशित किया गया था। . -1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (गूगल से) साथ देने वाला पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https ://arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा। -1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (VinAI Research से) साथ में पेपर [BERTweet: अंग्रेजी ट्वीट्स के लिए एक पूर्व-प्रशिक्षित भाषा मॉडल] (https://aclanthology.org/2020.emnlp-demos.2/) डाट क्वोक गुयेन, थान वु और अन्ह तुआन गुयेन द्वारा प्रकाशित। -1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (गूगल रिसर्च से) साथ वाला पेपर [बिग बर्ड: ट्रांसफॉर्मर्स फॉर लॉन्गर सीक्वेंस](https://arxiv .org/abs/2007.14062) मंज़िल ज़हीर, गुरु गुरुगणेश, अविनावा दुबे, जोशुआ आइंस्ली, क्रिस अल्बर्टी, सैंटियागो ओंटानोन, फिलिप फाम, अनिरुद्ध रावुला, किफ़ान वांग, ली यांग, अमर अहमद द्वारा। +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (गूगल से) साथ वाला पेपर [बीईआरटी: प्री-ट्रेनिंग ऑफ डीप बिडायरेक्शनल ट्रांसफॉर्मर्स फॉर लैंग्वेज अंडरस्टैंडिंग](https://arxiv.org/abs/1810.04805) जैकब डेवलिन, मिंग-वेई चांग, केंटन ली और क्रिस्टीना टौटानोवा द्वारा प्रकाशित किया गया था। . +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (गूगल से) साथ देने वाला पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https://arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा। +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (VinAI Research से) साथ में पेपर [BERTweet: अंग्रेजी ट्वीट्स के लिए एक पूर्व-प्रशिक्षित भाषा मॉडल](https://aclanthology.org/2020.emnlp-demos.2/) डाट क्वोक गुयेन, थान वु और अन्ह तुआन गुयेन द्वारा प्रकाशित। +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (गूगल रिसर्च से) साथ वाला पेपर [बिग बर्ड: ट्रांसफॉर्मर्स फॉर लॉन्गर सीक्वेंस](https://arxiv.org/abs/2007.14062) मंज़िल ज़हीर, गुरु गुरुगणेश, अविनावा दुबे, जोशुआ आइंस्ली, क्रिस अल्बर्टी, सैंटियागो ओंटानोन, फिलिप फाम, अनिरुद्ध रावुला, किफ़ान वांग, ली यांग, अमर अहमद द्वारा। 1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (गूगल रिसर्च से) साथ में पेपर [बिग बर्ड: ट्रांसफॉर्मर्स फॉर लॉन्गर सीक्वेंस](https://arxiv.org/abs/2007.14062) मंज़िल ज़हीर, गुरु गुरुगणेश, अविनावा दुबे, जोशुआ आइंस्ली, क्रिस अल्बर्टी, सैंटियागो ओंटानन, फिलिप फाम द्वारा , अनिरुद्ध रावुला, किफ़ान वांग, ली यांग, अमर अहमद द्वारा पोस्ट किया गया। 1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. 1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. -1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (फेसबुक से) साथ में कागज [एक ओपन-डोमेन चैटबॉट बनाने की विधि](https://arxiv.org /abs/2004.13637) स्टीफन रोलर, एमिली दीनन, नमन गोयल, दा जू, मैरी विलियमसन, यिनहान लियू, जिंग जू, मायल ओट, कर्ट शस्टर, एरिक एम। स्मिथ, वाई-लैन बॉरो, जेसन वेस्टन द्वारा। -1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (फेसबुक से) साथ में पेपर [एक ओपन-डोमेन चैटबॉट बनाने की रेसिपी](https://arxiv .org/abs/2004.13637) स्टीफन रोलर, एमिली दीनन, नमन गोयल, दा जू, मैरी विलियमसन, यिनहान लियू, जिंग जू, मायल ओट, कर्ट शस्टर, एरिक एम स्मिथ, वाई-लैन बॉरो, जेसन वेस्टन द्वारा। +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (फेसबुक से) साथ में कागज [एक ओपन-डोमेन चैटबॉट बनाने की विधि](https://arxiv.org/abs/2004.13637) स्टीफन रोलर, एमिली दीनन, नमन गोयल, दा जू, मैरी विलियमसन, यिनहान लियू, जिंग जू, मायल ओट, कर्ट शस्टर, एरिक एम। स्मिथ, वाई-लैन बॉरो, जेसन वेस्टन द्वारा। +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (फेसबुक से) साथ में पेपर [एक ओपन-डोमेन चैटबॉट बनाने की रेसिपी](https://arxiv.org/abs/2004.13637) स्टीफन रोलर, एमिली दीनन, नमन गोयल, दा जू, मैरी विलियमसन, यिनहान लियू, जिंग जू, मायल ओट, कर्ट शस्टर, एरिक एम स्मिथ, वाई-लैन बॉरो, जेसन वेस्टन द्वारा। 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (Salesforce से) Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. द्वाराअनुसंधान पत्र [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) के साथ जारी किया गया +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (Salesforce से) Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. द्वाराअनुसंधान पत्र [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) के साथ जारी किया गया 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigSicence Workshop](https://bigscience.huggingface.co/). -1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (एलेक्सा से) कागज के साथ [बीईआरटी के लिए ऑप्टिमल सबआर्किटेक्चर एक्सट्रैक्शन](https://arxiv.org/abs/ 2010.10499) एड्रियन डी विंटर और डैनियल जे पेरी द्वारा। -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (हरबिन इंस्टिट्यूट ऑफ़ टेक्नोलॉजी/माइक्रोसॉफ्ट रिसर्च एशिया/इंटेल लैब्स से) कागज के साथ [ब्रिजटॉवर: विजन-लैंग्वेज रिप्रेजेंटेशन लर्निंग में एनकोडर्स के बीच ब्रिज बनाना]() by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. -1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (Google अनुसंधान से) साथ में कागज [ByT5: पूर्व-प्रशिक्षित बाइट-टू-बाइट मॉडल के साथ एक टोकन-मुक्त भविष्य की ओर] (https://arxiv.org/abs/2105.13626) Linting Xue, Aditya Barua, Noah Constant, रामी अल-रफू, शरण नारंग, मिहिर काले, एडम रॉबर्ट्स, कॉलिन रैफेल द्वारा पोस्ट किया गया। -1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https:// arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा। -1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन]( https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा। +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (एलेक्सा से) कागज के साथ [बीईआरटी के लिए ऑप्टिमल सबआर्किटेक्चर एक्सट्रैक्शन](https://arxiv.org/abs/2010.10499) एड्रियन डी विंटर और डैनियल जे पेरी द्वारा। +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (हरबिन इंस्टिट्यूट ऑफ़ टेक्नोलॉजी/माइक्रोसॉफ्ट रिसर्च एशिया/इंटेल लैब्स से) कागज के साथ [ब्रिजटॉवर: विजन-लैंग्वेज रिप्रेजेंटेशन लर्निंग में एनकोडर्स के बीच ब्रिज बनाना]() by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (NAVER CLOVA से) Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. द्वाराअनुसंधान पत्र [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) के साथ जारी किया गया +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (Google अनुसंधान से) साथ में कागज [ByT5: पूर्व-प्रशिक्षित बाइट-टू-बाइट मॉडल के साथ एक टोकन-मुक्त भविष्य की ओर](https://arxiv.org/abs/2105.13626) Linting Xue, Aditya Barua, Noah Constant, रामी अल-रफू, शरण नारंग, मिहिर काले, एडम रॉबर्ट्स, कॉलिन रैफेल द्वारा पोस्ट किया गया। +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (इनरिया/फेसबुक/सोरबोन से) साथ में कागज [CamemBERT: एक टेस्टी फ्रेंच लैंग्वेज मॉडल](https://arxiv.org/abs/1911.03894) लुई मार्टिन*, बेंजामिन मुलर*, पेड्रो जेवियर ऑर्टिज़ सुआरेज़*, योआन ड्यूपॉन्ट, लॉरेंट रोमरी, एरिक विलेमोन्टे डे ला क्लर्जरी, जैमे सेडाह और बेनोइट सगोट द्वारा। +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google रिसर्च से) साथ में दिया गया पेपर [कैनाइन: प्री-ट्रेनिंग ए एफिशिएंट टोकनाइजेशन-फ्री एनकोडर फॉर लैंग्वेज रिप्रेजेंटेशन](https://arxiv.org/abs/2103.06874) जोनाथन एच क्लार्क, डैन गैरेट, यूलिया टर्क, जॉन विएटिंग द्वारा। 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) के साथ जारी किया गया -1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org /abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा। +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI से) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. द्वाराअनुसंधान पत्र [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) के साथ जारी किया गया +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI से) साथ वाला पेपर [लर्निंग ट्रांसफरेबल विजुअल मॉडल फ्रॉम नेचुरल लैंग्वेज सुपरविजन](https://arxiv.org/abs/2103.00020) एलेक रैडफोर्ड, जोंग वूक किम, क्रिस हैलासी, आदित्य रमेश, गेब्रियल गोह, संध्या अग्रवाल, गिरीश शास्त्री, अमांडा एस्केल, पामेला मिश्किन, जैक क्लार्क, ग्रेचेन क्रुएगर, इल्या सुत्स्केवर द्वारा। 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (सेल्सफोर्स से) साथ में पेपर [प्रोग्राम सिंथेसिस के लिए एक संवादात्मक प्रतिमान](https://arxiv.org/abs/2203.13474) एरिक निजकैंप, बो पैंग, हिरोआकी हयाशी, लिफू तू, हुआन वांग, यिंगबो झोउ, सिल्वियो सावरेस, कैमिंग जिओंग रिलीज। -1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (माइक्रोसॉफ्ट रिसर्च एशिया से) कागज के साथ [फास्ट ट्रेनिंग कन्वर्जेंस के लिए सशर्त डीईटीआर](https://arxiv. org/abs/2108.06152) डेपू मेंग, ज़ियाओकांग चेन, ज़ेजिया फैन, गैंग ज़ेंग, होउकियांग ली, युहुई युआन, लेई सन, जिंगडोंग वांग द्वारा। -1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (YituTech से) साथ में कागज [ConvBERT: स्पैन-आधारित डायनेमिक कनवल्शन के साथ BERT में सुधार](https://arxiv .org/abs/2008.02496) जिहांग जियांग, वीहाओ यू, डाकान झोउ, युनपेंग चेन, जियाशी फेंग, शुइचेंग यान द्वारा। -1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (Facebook AI से) साथ वाला पेपर [A ConvNet for the 2020s](https://arxiv.org/abs /2201.03545) ज़ुआंग लियू, हेंज़ी माओ, चाओ-युआन वू, क्रिस्टोफ़ फीचटेनहोफ़र, ट्रेवर डेरेल, सैनिंग ज़ी द्वारा। -1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (सिंघुआ यूनिवर्सिटी से) साथ में पेपर [सीपीएम: ए लार्ज-स्केल जेनेरेटिव चाइनीज प्री-ट्रेंड लैंग्वेज मॉडल](https : //arxiv.org/abs/2012.00413) झेंग्यान झांग, जू हान, हाओ झोउ, पेई के, युक्सियन गु, डेमिंग ये, युजिया किन, युशेंग सु, हाओझे जी, जियान गुआन, फैंचाओ क्यूई, ज़ियाओझी वांग, यानान झेंग द्वारा , गुओयांग ज़ेंग, हुआनकी काओ, शेंगकी चेन, डाइक्सुआन ली, ज़ेनबो सन, ज़ियुआन लियू, मिनली हुआंग, वेंटाओ हान, जी तांग, जुआनज़ी ली, ज़ियाओयान झू, माओसोंग सन। +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI से) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. द्वाराअनुसंधान पत्र [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) के साथ जारी किया गया +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (माइक्रोसॉफ्ट रिसर्च एशिया से) कागज के साथ [फास्ट ट्रेनिंग कन्वर्जेंस के लिए सशर्त डीईटीआर](https://arxiv.org/abs/2108.06152) डेपू मेंग, ज़ियाओकांग चेन, ज़ेजिया फैन, गैंग ज़ेंग, होउकियांग ली, युहुई युआन, लेई सन, जिंगडोंग वांग द्वारा। +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (YituTech से) साथ में कागज [ConvBERT: स्पैन-आधारित डायनेमिक कनवल्शन के साथ BERT में सुधार](https://arxiv.org/abs/2008.02496) जिहांग जियांग, वीहाओ यू, डाकान झोउ, युनपेंग चेन, जियाशी फेंग, शुइचेंग यान द्वारा। +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (Facebook AI से) साथ वाला पेपर [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) ज़ुआंग लियू, हेंज़ी माओ, चाओ-युआन वू, क्रिस्टोफ़ फीचटेनहोफ़र, ट्रेवर डेरेल, सैनिंग ज़ी द्वारा। +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (सिंघुआ यूनिवर्सिटी से) साथ में पेपर [सीपीएम: ए लार्ज-स्केल जेनेरेटिव चाइनीज प्री-ट्रेंड लैंग्वेज मॉडल](https://arxiv.org/abs/2012.00413) झेंग्यान झांग, जू हान, हाओ झोउ, पेई के, युक्सियन गु, डेमिंग ये, युजिया किन, युशेंग सु, हाओझे जी, जियान गुआन, फैंचाओ क्यूई, ज़ियाओझी वांग, यानान झेंग द्वारा , गुओयांग ज़ेंग, हुआनकी काओ, शेंगकी चेन, डाइक्सुआन ली, ज़ेनबो सन, ज़ियुआन लियू, मिनली हुआंग, वेंटाओ हान, जी तांग, जुआनज़ी ली, ज़ियाओयान झू, माओसोंग सन। +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (सेल्सफोर्स से) साथ में पेपर [CTRL: ए कंडिशनल ट्रांसफॉर्मर लैंग्वेज मॉडल फॉर कंट्रोलेबल जेनरेशन](https://arxiv.org/abs/1909.05858) नीतीश शिरीष केसकर*, ब्रायन मैककैन*, लव आर. वार्ष्णेय, कैमिंग जिओंग और रिचर्ड द्वारा सोचर द्वारा जारी किया गया। -1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (Microsoft से) साथ में दिया गया पेपर [CvT: इंट्रोड्यूसिंग कनवॉल्यूशन टू विजन ट्रांसफॉर्मर्स](https://arxiv.org/ एब्स/2103.15808) हैपिंग वू, बिन जिओ, नोएल कोडेला, मेंगचेन लियू, जियांग दाई, लू युआन, लेई झांग द्वारा। -1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (फेसबुक से) साथ में कागज [Data2Vec: भाषण, दृष्टि और भाषा में स्व-पर्यवेक्षित सीखने के लिए एक सामान्य ढांचा] (https://arxiv.org/abs/2202.03555) एलेक्सी बाएव्स्की, वेई-निंग सू, कियानटोंग जू, अरुण बाबू, जियाताओ गु, माइकल औली द्वारा पोस्ट किया गया। -1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (Microsoft से) साथ में दिया गया पेपर [DeBERta: डिकोडिंग-एन्हांस्ड BERT विद डिसेंटैंगल्ड अटेंशन](https://arxiv. org/abs/2006.03654) पेंगचेंग हे, ज़ियाओडोंग लियू, जियानफेंग गाओ, वीज़ू चेन द्वारा। -1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (Microsoft से) साथ में दिया गया पेपर [DeBERTa: डिकोडिंग-एन्हांस्ड BERT विथ डिसेंन्गल्ड अटेंशन](https: //arxiv.org/abs/2006.03654) पेंगचेंग हे, ज़ियाओडोंग लियू, जियानफेंग गाओ, वीज़ू चेन द्वारा पोस्ट किया गया। -1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (बर्कले/फेसबुक/गूगल से) पेपर के साथ [डिसीजन ट्रांसफॉर्मर: रीनफोर्समेंट लर्निंग वाया सीक्वेंस मॉडलिंग](https : //arxiv.org/abs/2106.01345) लिली चेन, केविन लू, अरविंद राजेश्वरन, किमिन ली, आदित्य ग्रोवर, माइकल लास्किन, पीटर एबील, अरविंद श्रीनिवास, इगोर मोर्डच द्वारा पोस्ट किया गया। -1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (सेंसटाइम रिसर्च से) साथ में पेपर [डिफॉर्मेबल डीईटीआर: डिफॉर्मेबल ट्रांसफॉर्मर्स फॉर एंड-टू-एंड ऑब्जेक्ट डिटेक्शन] (https://arxiv.org/abs/2010.04159) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, जिफेंग दाई द्वारा पोस्ट किया गया। -1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (फेसबुक से) साथ में पेपर [ट्रेनिंग डेटा-एफिशिएंट इमेज ट्रांसफॉर्मर और डिस्टिलेशन थ्रू अटेंशन](https://arxiv .org/abs/2012.12877) ह्यूगो टौव्रोन, मैथ्यू कॉर्ड, मैथिज्स डूज़, फ़्रांसिस्को मस्सा, एलेक्ज़ेंडर सबलेरोल्स, हर्वे जेगौ द्वारा। -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. -1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (फेसबुक से) साथ में कागज [ट्रांसफॉर्मर्स के साथ एंड-टू-एंड ऑब्जेक्ट डिटेक्शन](https://arxiv. org/abs/2005.12872) निकोलस कैरियन, फ़्रांसिस्को मस्सा, गेब्रियल सिनेव, निकोलस उसुनियर, अलेक्जेंडर किरिलोव, सर्गेई ज़ागोरुयको द्वारा। -1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [DialoGPT: बड़े पैमाने पर जनरेटिव प्री-ट्रेनिंग फॉर कन्वर्सेशनल रिस्पांस जेनरेशन](https ://arxiv.org/abs/1911.00536) यिज़े झांग, सिकी सन, मिशेल गैली, येन-चुन चेन, क्रिस ब्रोकेट, जियांग गाओ, जियानफेंग गाओ, जिंगजिंग लियू, बिल डोलन द्वारा। +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (Microsoft से) साथ में दिया गया पेपर [CvT: इंट्रोड्यूसिंग कनवॉल्यूशन टू विजन ट्रांसफॉर्मर्स](https://arxiv.org/एब्स/2103.15808) हैपिंग वू, बिन जिओ, नोएल कोडेला, मेंगचेन लियू, जियांग दाई, लू युआन, लेई झांग द्वारा। +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (फेसबुक से) साथ में कागज [Data2Vec: भाषण, दृष्टि और भाषा में स्व-पर्यवेक्षित सीखने के लिए एक सामान्य ढांचा](https://arxiv.org/abs/2202.03555) एलेक्सी बाएव्स्की, वेई-निंग सू, कियानटोंग जू, अरुण बाबू, जियाताओ गु, माइकल औली द्वारा पोस्ट किया गया। +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (Microsoft से) साथ में दिया गया पेपर [DeBERta: डिकोडिंग-एन्हांस्ड BERT विद डिसेंटैंगल्ड अटेंशन](https://arxiv.org/abs/2006.03654) पेंगचेंग हे, ज़ियाओडोंग लियू, जियानफेंग गाओ, वीज़ू चेन द्वारा। +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (Microsoft से) साथ में दिया गया पेपर [DeBERTa: डिकोडिंग-एन्हांस्ड BERT विथ डिसेंन्गल्ड अटेंशन](https://arxiv.org/abs/2006.03654) पेंगचेंग हे, ज़ियाओडोंग लियू, जियानफेंग गाओ, वीज़ू चेन द्वारा पोस्ट किया गया। +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (बर्कले/फेसबुक/गूगल से) पेपर के साथ [डिसीजन ट्रांसफॉर्मर: रीनफोर्समेंट लर्निंग वाया सीक्वेंस मॉडलिंग](https://arxiv.org/abs/2106.01345) लिली चेन, केविन लू, अरविंद राजेश्वरन, किमिन ली, आदित्य ग्रोवर, माइकल लास्किन, पीटर एबील, अरविंद श्रीनिवास, इगोर मोर्डच द्वारा पोस्ट किया गया। +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (सेंसटाइम रिसर्च से) साथ में पेपर [डिफॉर्मेबल डीईटीआर: डिफॉर्मेबल ट्रांसफॉर्मर्स फॉर एंड-टू-एंड ऑब्जेक्ट डिटेक्शन](https://arxiv.org/abs/2010.04159) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, जिफेंग दाई द्वारा पोस्ट किया गया। +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (फेसबुक से) साथ में पेपर [ट्रेनिंग डेटा-एफिशिएंट इमेज ट्रांसफॉर्मर और डिस्टिलेशन थ्रू अटेंशन](https://arxiv.org/abs/2012.12877) ह्यूगो टौव्रोन, मैथ्यू कॉर्ड, मैथिज्स डूज़, फ़्रांसिस्को मस्सा, एलेक्ज़ेंडर सबलेरोल्स, हर्वे जेगौ द्वारा। +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (Google AI से) Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. द्वाराअनुसंधान पत्र [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) के साथ जारी किया गया +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (University of Hong Kong and TikTok से) Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. द्वाराअनुसंधान पत्र [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) के साथ जारी किया गया +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (फेसबुक से) साथ में कागज [ट्रांसफॉर्मर्स के साथ एंड-टू-एंड ऑब्जेक्ट डिटेक्शन](https://arxiv.org/abs/2005.12872) निकोलस कैरियन, फ़्रांसिस्को मस्सा, गेब्रियल सिनेव, निकोलस उसुनियर, अलेक्जेंडर किरिलोव, सर्गेई ज़ागोरुयको द्वारा। +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [DialoGPT: बड़े पैमाने पर जनरेटिव प्री-ट्रेनिंग फॉर कन्वर्सेशनल रिस्पांस जेनरेशन](https://arxiv.org/abs/1911.00536) यिज़े झांग, सिकी सन, मिशेल गैली, येन-चुन चेन, क्रिस ब्रोकेट, जियांग गाओ, जियानफेंग गाओ, जिंगजिंग लियू, बिल डोलन द्वारा। 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. -1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (हगिंगफेस से), साथ में कागज [डिस्टिलबर्ट, बीईआरटी का डिस्टिल्ड वर्जन: छोटा, तेज, सस्ता और हल्का] (https://arxiv.org/abs/1910.01108) विक्टर सनह, लिसांड्रे डेब्यू और थॉमस वुल्फ द्वारा पोस्ट किया गया। यही तरीका GPT-2 को [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERta से [DistilRoBERta](https://github.com) पर कंप्रेस करने के लिए भी लागू किया जाता है। / हगिंगफेस/ट्रांसफॉर्मर्स/ट्री/मेन/उदाहरण/डिस्टिलेशन), बहुभाषी BERT से [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) और डिस्टिलबर्ट का जर्मन संस्करण। +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (Meta AI से) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. द्वाराअनुसंधान पत्र [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) के साथ जारी किया गया +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (हगिंगफेस से), साथ में कागज [डिस्टिलबर्ट, बीईआरटी का डिस्टिल्ड वर्जन: छोटा, तेज, सस्ता और हल्का](https://arxiv.org/abs/1910.01108) विक्टर सनह, लिसांड्रे डेब्यू और थॉमस वुल्फ द्वारा पोस्ट किया गया। यही तरीका GPT-2 को [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERta से [DistilRoBERta](https://github.com) पर कंप्रेस करने के लिए भी लागू किया जाता है। / हगिंगफेस/ट्रांसफॉर्मर्स/ट्री/मेन/उदाहरण/डिस्टिलेशन), बहुभाषी BERT से [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) और डिस्टिलबर्ट का जर्मन संस्करण। 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [DiT: सेल्फ सुपरवाइज्ड प्री-ट्रेनिंग फॉर डॉक्यूमेंट इमेज ट्रांसफॉर्मर](https://arxiv.org/abs/2203.02378) जुनलॉन्ग ली, यिहेंग जू, टेंगचाओ लव, लेई कुई, चा झांग द्वारा फुरु वेई द्वारा पोस्ट किया गया। -1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (NAVER से) साथ में कागज [OCR-मुक्त डॉक्यूमेंट अंडरस्टैंडिंग ट्रांसफॉर्मर](https://arxiv.org/abs /2111.15664) गीवूक किम, टीकग्यू होंग, मूनबिन यिम, जियोंग्योन नाम, जिनयॉन्ग पार्क, जिनयॉन्ग यिम, वोनसेओक ह्वांग, सांगडू यूं, डोंगयून हान, सेउंग्युन पार्क द्वारा। -1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (फेसबुक से) साथ में पेपर [ओपन-डोमेन क्वेश्चन आंसरिंग के लिए डेंस पैसेज रिट्रीवल](https://arxiv. org/abs/2004.04906) व्लादिमीर करपुखिन, बरलास ओज़ुज़, सेवन मिन, पैट्रिक लुईस, लेडेल वू, सर्गेई एडुनोव, डैनकी चेन, और वेन-ताऊ यिह द्वारा। -1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (इंटेल लैब्स से) साथ में कागज [विज़न ट्रांसफॉर्मर्स फॉर डेंस प्रेडिक्शन](https://arxiv.org /abs/2103.13413) रेने रैनफ्टल, एलेक्सी बोचकोवस्की, व्लादलेन कोल्टन द्वारा। +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (NAVER से) साथ में कागज [OCR-मुक्त डॉक्यूमेंट अंडरस्टैंडिंग ट्रांसफॉर्मर](https://arxiv.org/abs/2111.15664) गीवूक किम, टीकग्यू होंग, मूनबिन यिम, जियोंग्योन नाम, जिनयॉन्ग पार्क, जिनयॉन्ग यिम, वोनसेओक ह्वांग, सांगडू यूं, डोंगयून हान, सेउंग्युन पार्क द्वारा। +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (फेसबुक से) साथ में पेपर [ओपन-डोमेन क्वेश्चन आंसरिंग के लिए डेंस पैसेज रिट्रीवल](https://arxiv.org/abs/2004.04906) व्लादिमीर करपुखिन, बरलास ओज़ुज़, सेवन मिन, पैट्रिक लुईस, लेडेल वू, सर्गेई एडुनोव, डैनकी चेन, और वेन-ताऊ यिह द्वारा। +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (इंटेल लैब्स से) साथ में कागज [विज़न ट्रांसफॉर्मर्स फॉर डेंस प्रेडिक्शन](https://arxiv.org/abs/2103.13413) रेने रैनफ्टल, एलेक्सी बोचकोवस्की, व्लादलेन कोल्टन द्वारा। 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. -1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google रिसर्च/स्टैनफोर्ड यूनिवर्सिटी से) साथ में दिया गया पेपर [इलेक्ट्रा: जेनरेटर के बजाय भेदभाव करने वाले के रूप में टेक्स्ट एन्कोडर्स का पूर्व-प्रशिक्षण] (https://arxiv.org/abs/2003.10555) केविन क्लार्क, मिन्ह-थांग लुओंग, क्वोक वी. ले, क्रिस्टोफर डी. मैनिंग द्वारा पोस्ट किया गया। -1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google रिसर्च से) साथ में दिया गया पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https:/ /arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा। +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google रिसर्च/स्टैनफोर्ड यूनिवर्सिटी से) साथ में दिया गया पेपर [इलेक्ट्रा: जेनरेटर के बजाय भेदभाव करने वाले के रूप में टेक्स्ट एन्कोडर्स का पूर्व-प्रशिक्षण](https://arxiv.org/abs/2003.10555) केविन क्लार्क, मिन्ह-थांग लुओंग, क्वोक वी. ले, क्रिस्टोफर डी. मैनिंग द्वारा पोस्ट किया गया। +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (Meta AI से) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. द्वाराअनुसंधान पत्र [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) के साथ जारी किया गया +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google रिसर्च से) साथ में दिया गया पेपर [सीक्वेंस जेनरेशन टास्क के लिए प्री-ट्रेंड चेकपॉइंट का इस्तेमाल करना](https://arxiv.org/abs/1907.12461) साशा रोठे, शशि नारायण, अलियाक्सि सेवेरिन द्वारा। 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)**(Baidu से) साथ देने वाला पेपर [ERNIE: एन्हांस्ड रिप्रेजेंटेशन थ्रू नॉलेज इंटीग्रेशन](https://arxiv.org/abs/1904.09223) यू सन, शुओहुआन वांग, युकुन ली, शिकुन फेंग, ज़ुई चेन, हान झांग, शिन तियान, डैनक्सियांग झू, हाओ तियान, हुआ वू द्वारा पोस्ट किया गया। -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (Baidu से) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. द्वाराअनुसंधान पत्र [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) के साथ जारी किया गया -1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (मेटा AI से) ट्रांसफॉर्मर प्रोटीन भाषा मॉडल हैं। **ESM-1b** पेपर के साथ जारी किया गया था [ अलेक्जेंडर राइव्स, जोशुआ मेयर, टॉम सर्कु, सिद्धार्थ गोयल, ज़ेमिंग लिन द्वारा जैविक संरचना और कार्य असुरक्षित सीखने को 250 मिलियन प्रोटीन अनुक्रमों तक स्केल करने से उभरता है] (https://www.pnas.org/content/118/15/e2016239118) जेसन लियू, डेमी गुओ, मायल ओट, सी. लॉरेंस ज़िटनिक, जेरी मा और रॉब फर्गस। **ESM-1v** को पेपर के साथ जारी किया गया था [भाषा मॉडल प्रोटीन फ़ंक्शन पर उत्परिवर्तन के प्रभावों की शून्य-शॉट भविष्यवाणी को सक्षम करते हैं] (https://doi.org/10.1101/2021.07.09.450648) जोशुआ मेयर, रोशन राव, रॉबर्ट वेरकुइल, जेसन लियू, टॉम सर्कु और अलेक्जेंडर राइव्स द्वारा। **ESM-2** को पेपर के साथ जारी किया गया था [भाषा मॉडल विकास के पैमाने पर प्रोटीन अनुक्रम सटीक संरचना भविष्यवाणी को सक्षम करते हैं](https://doi.org/10.1101/2022.07.20.500902) ज़ेमिंग लिन, हलील अकिन, रोशन राव, ब्रायन ही, झोंगकाई झू, वेंटिंग लू, ए द्वारा लान डॉस सैंटोस कोस्टा, मरियम फ़ज़ल-ज़रंडी, टॉम सर्कू, साल कैंडिडो, अलेक्जेंडर राइव्स। +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu से) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. द्वाराअनुसंधान पत्र [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) के साथ जारी किया गया +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (मेटा AI से) ट्रांसफॉर्मर प्रोटीन भाषा मॉडल हैं। **ESM-1b** पेपर के साथ जारी किया गया था [अलेक्जेंडर राइव्स, जोशुआ मेयर, टॉम सर्कु, सिद्धार्थ गोयल, ज़ेमिंग लिन द्वारा जैविक संरचना और कार्य असुरक्षित सीखने को 250 मिलियन प्रोटीन अनुक्रमों तक स्केल करने से उभरता है](https://www.pnas.org/content/118/15/e2016239118) जेसन लियू, डेमी गुओ, मायल ओट, सी. लॉरेंस ज़िटनिक, जेरी मा और रॉब फर्गस। **ESM-1v** को पेपर के साथ जारी किया गया था [भाषा मॉडल प्रोटीन फ़ंक्शन पर उत्परिवर्तन के प्रभावों की शून्य-शॉट भविष्यवाणी को सक्षम करते हैं](https://doi.org/10.1101/2021.07.09.450648) जोशुआ मेयर, रोशन राव, रॉबर्ट वेरकुइल, जेसन लियू, टॉम सर्कु और अलेक्जेंडर राइव्स द्वारा। **ESM-2** को पेपर के साथ जारी किया गया था [भाषा मॉडल विकास के पैमाने पर प्रोटीन अनुक्रम सटीक संरचना भविष्यवाणी को सक्षम करते हैं](https://doi.org/10.1101/2022.07.20.500902) ज़ेमिंग लिन, हलील अकिन, रोशन राव, ब्रायन ही, झोंगकाई झू, वेंटिंग लू, ए द्वारा लान डॉस सैंटोस कोस्टा, मरियम फ़ज़ल-ज़रंडी, टॉम सर्कू, साल कैंडिडो, अलेक्जेंडर राइव्स। +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (ESPnet and Microsoft Research से) Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. द्वाराअनुसंधान पत्र [Fastspeech 2: Fast And High-quality End-to-End Text To Speech](https://arxiv.org/pdf/2006.04558.pdf) के साथ जारी किया गया 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei -1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (CNRS से) साथ वाला पेपर [FlauBERT: Unsupervised Language Model Pre-training for फ़्रेंच](https://arxiv .org/abs/1912.05372) Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, बेंजामिन लेकोउटेक्स, अलेक्जेंड्रे अल्लाउज़ेन, बेनोइट क्रैबे, लॉरेंट बेसेसियर, डिडिएर श्वाब द्वारा। -1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (FLAVA: A फाउंडेशनल लैंग्वेज एंड विजन अलाइनमेंट मॉडल) (https://arxiv) साथ वाला पेपर .org/abs/2112.04482) अमनप्रीत सिंह, रोंगहांग हू, वेदानुज गोस्वामी, गुइल्यूम कुएरॉन, वोज्शिएक गालुबा, मार्कस रोहरबैक, और डौवे कीला द्वारा। -1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (गूगल रिसर्च से) साथ वाला पेपर [FNet: मिक्सिंग टोकन विद फूरियर ट्रांसफॉर्म्स](https://arxiv.org /abs/2105.03824) जेम्स ली-थॉर्प, जोशुआ आइंस्ली, इल्या एकस्टीन, सैंटियागो ओंटानन द्वारा। -1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [फ़नल-ट्रांसफॉर्मर: कुशल भाषा प्रसंस्करण के लिए अनुक्रमिक अतिरेक को छानना](https://arxiv.org/abs/2006.03236) जिहांग दाई, गुओकुन लाई, यिमिंग यांग, क्वोक वी. ले ​​द्वारा रिहाई। +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (CNRS से) साथ वाला पेपर [FlauBERT: Unsupervised Language Model Pre-training for फ़्रेंच](https://arxiv.org/abs/1912.05372) Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, बेंजामिन लेकोउटेक्स, अलेक्जेंड्रे अल्लाउज़ेन, बेनोइट क्रैबे, लॉरेंट बेसेसियर, डिडिएर श्वाब द्वारा। +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** [FLAVA: A फाउंडेशनल लैंग्वेज एंड विजन अलाइनमेंट मॉडल](https://arxiv.org/abs/2112.04482) साथ वाला पेपर अमनप्रीत सिंह, रोंगहांग हू, वेदानुज गोस्वामी, गुइल्यूम कुएरॉन, वोज्शिएक गालुबा, मार्कस रोहरबैक, और डौवे कीला द्वारा। +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (गूगल रिसर्च से) साथ वाला पेपर [FNet: मिक्सिंग टोकन विद फूरियर ट्रांसफॉर्म्स](https://arxiv.org/abs/2105.03824) जेम्स ली-थॉर्प, जोशुआ आइंस्ली, इल्या एकस्टीन, सैंटियागो ओंटानन द्वारा। +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (Microsoft Research से) Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. द्वाराअनुसंधान पत्र [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) के साथ जारी किया गया +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [फ़नल-ट्रांसफॉर्मर: कुशल भाषा प्रसंस्करण के लिए अनुक्रमिक अतिरेक को छानना](https://arxiv.org/abs/2006.03236) जिहांग दाई, गुओकुन लाई, यिमिंग यांग, क्वोक वी. ले द्वारा रिहाई। +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (ADEPT से) रोहन बाविशी, एरिच एलसेन, कर्टिस हॉथोर्न, मैक्सवेल नी, ऑगस्टस ओडेना, अरुशी सोमानी, सागनाक तासिरलार [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. -1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST से) साथ वाला पेपर [वर्टिकल कटडेप्थ के साथ मोनोकुलर डेप्थ एस्टीमेशन के लिए ग्लोबल-लोकल पाथ नेटवर्क्स](https:/ /arxiv.org/abs/2201.07436) डोयोन किम, वूंगह्युन गा, प्युंगवान आह, डोंगग्यू जू, सेहवान चुन, जुनमो किम द्वारा। -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI से) साथ में दिया गया पेपर [जेनरेटिव प्री-ट्रेनिंग द्वारा भाषा की समझ में सुधार](https://blog .openai.com/language-unsupervised/) एलेक रैडफोर्ड, कार्तिक नरसिम्हन, टिम सालिमन्स और इल्या सुत्स्केवर द्वारा। -1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (EleutherAI से) रिपॉजिटरी के साथ [EleutherAI/gpt-neo](https://github.com/ EleutherAI /gpt-neo) रिलीज। सिड ब्लैक, स्टेला बिडरमैन, लियो गाओ, फिल वांग और कॉनर लेही द्वारा पोस्ट किया गया। -1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (EleutherAI से) पेपर के साथ जारी किया गया [GPT-NeoX-20B: एक ओपन-सोर्स ऑटोरेग्रेसिव लैंग्वेज मॉडल] (https://arxiv.org/abs/2204.06745) सिड ब्लैक, स्टेला बिडरमैन, एरिक हैलाहन, क्वेंटिन एंथोनी, लियो गाओ, लॉरेंस गोल्डिंग, होरेस हे, कॉनर लेही, काइल मैकडोनेल, जेसन फांग, माइकल पाइलर, यूएसवीएसएन साई प्रशांत द्वारा , शिवांशु पुरोहित, लारिया रेनॉल्ड्स, जोनाथन टो, बेन वांग, सैमुअल वेनबैक +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST से) साथ वाला पेपर [वर्टिकल कटडेप्थ के साथ मोनोकुलर डेप्थ एस्टीमेशन के लिए ग्लोबल-लोकल पाथ नेटवर्क्स](https://arxiv.org/abs/2201.07436) डोयोन किम, वूंगह्युन गा, प्युंगवान आह, डोंगग्यू जू, सेहवान चुन, जुनमो किम द्वारा। +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI से) साथ में दिया गया पेपर [जेनरेटिव प्री-ट्रेनिंग द्वारा भाषा की समझ में सुधार](https://openai.com/research/language-unsupervised/) एलेक रैडफोर्ड, कार्तिक नरसिम्हन, टिम सालिमन्स और इल्या सुत्स्केवर द्वारा। +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (EleutherAI से) रिपॉजिटरी के साथ [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) रिलीज। सिड ब्लैक, स्टेला बिडरमैन, लियो गाओ, फिल वांग और कॉनर लेही द्वारा पोस्ट किया गया। +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (EleutherAI से) पेपर के साथ जारी किया गया [GPT-NeoX-20B: एक ओपन-सोर्स ऑटोरेग्रेसिव लैंग्वेज मॉडल](https://arxiv.org/abs/2204.06745) सिड ब्लैक, स्टेला बिडरमैन, एरिक हैलाहन, क्वेंटिन एंथोनी, लियो गाओ, लॉरेंस गोल्डिंग, होरेस हे, कॉनर लेही, काइल मैकडोनेल, जेसन फांग, माइकल पाइलर, यूएसवीएसएन साई प्रशांत द्वारा , शिवांशु पुरोहित, लारिया रेनॉल्ड्स, जोनाथन टो, बेन वांग, सैमुअल वेनबैक 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (अबेजा के जरिए) शिन्या ओटानी, ताकायोशी मकाबे, अनुज अरोड़ा, क्यो हटोरी द्वारा। -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://blog.openai.com/better-language-models/) एलेक रैडफोर्ड*, जेफरी वू*, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी* द्वारा * और इल्या सुत्सकेवर** ने पोस्ट किया। -1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github. com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा। +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (ओपनएआई से) साथ में पेपर [लैंग्वेज मॉडल्स अनसुपरवाइज्ड मल्टीटास्क लर्नर्स हैं](https://openai.com/research/better-language-models/) एलेक रैडफोर्ड, जेफरी वू, रेवन चाइल्ड, डेविड लुआन, डारियो एमोडी द्वारा और इल्या सुत्सकेवर ने पोस्ट किया। +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI से) साथ वाला पेपर [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) बेन वांग और अरन कोमात्सुजाकी द्वारा। 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (BigCode से) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. द्वाराअनुसंधान पत्र [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) के साथ जारी किया गया +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. -1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv .org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा। -1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https ://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा। -1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https:// arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा। +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA से) साथ में कागज [GroupViT: टेक्स्ट सुपरविजन से सिमेंटिक सेगमेंटेशन इमर्जेस](https://arxiv.org/abs/2202.11094) जियारुई जू, शालिनी डी मेलो, सिफ़ी लियू, वोनमिन बायन, थॉमस ब्रेउएल, जान कौट्ज़, ज़ियाओलोंग वांग द्वारा। +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (Allegro.pl, AGH University of Science and Technology से) Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. द्वाराअनुसंधान पत्र [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) के साथ जारी किया गया +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (फेसबुक से) साथ में पेपर [ह्यूबर्ट: सेल्फ सुपरवाइज्ड स्पीच रिप्रेजेंटेशन लर्निंग बाय मास्क्ड प्रेडिक्शन ऑफ हिडन यूनिट्स](https://arxiv.org/abs/2106.07447) वेई-निंग सू, बेंजामिन बोल्टे, याओ-हंग ह्यूबर्ट त्साई, कुशाल लखोटिया, रुस्लान सालाखुतदीनोव, अब्देलरहमान मोहम्मद द्वारा। +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (बर्कले से) साथ में कागज [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) सेहून किम, अमीर घोलमी, ज़ेवेई याओ, माइकल डब्ल्यू महोनी, कर्ट केटज़र द्वारा। +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce से) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. द्वाराअनुसंधान पत्र [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) के साथ जारी किया गया 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (माइक्रोसॉफ्ट रिसर्च एशिया से) साथ देने वाला पेपर [लेआउटएलएमवी3: यूनिफाइड टेक्स्ट और इमेज मास्किंग के साथ दस्तावेज़ एआई के लिए पूर्व-प्रशिक्षण](https://arxiv.org/abs/2204.08387) युपन हुआंग, टेंगचाओ लव, लेई कुई, युटोंग लू, फुरु वेई द्वारा पोस्ट किया गया। 1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. -1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (मेटा AI से) साथ वाला पेपर [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https:/ /arxiv.org/abs/2104.01136) बेन ग्राहम, अलाएल्डिन एल-नौबी, ह्यूगो टौवरन, पियरे स्टॉक, आर्मंड जौलिन, हर्वे जेगौ, मैथिज डूज़ द्वारा। +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (मेटा AI से) साथ वाला पेपर [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) बेन ग्राहम, अलाएल्डिन एल-नौबी, ह्यूगो टौवरन, पियरे स्टॉक, आर्मंड जौलिन, हर्वे जेगौ, मैथिज डूज़ द्वारा। 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (दक्षिण चीन प्रौद्योगिकी विश्वविद्यालय से) साथ में कागज [LiLT: एक सरल लेकिन प्रभावी भाषा-स्वतंत्र लेआउट ट्रांसफार्मर संरचित दस्तावेज़ समझ के लिए](https://arxiv.org/abs/2202.13669) जियापेंग वांग, लियानवेन जिन, काई डिंग द्वारा पोस्ट किया गया। +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (The FAIR team of Meta AI से) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. द्वाराअनुसंधान पत्र [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) के साथ जारी किया गया +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (The FAIR team of Meta AI से) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.. द्वाराअनुसंधान पत्र [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) के साथ जारी किया गया +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (Microsoft Research & University of Wisconsin-Madison से) Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. द्वाराअनुसंधान पत्र [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) के साथ जारी किया गया 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (मैंडी गुओ, जोशुआ आइंस्ली, डेविड यूथस, सैंटियागो ओंटानन, जियानमो नि, यूं-हुआन सुंग, यिनफेई यांग द्वारा पोस्ट किया गया। -1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (स्टूडियो औसिया से) साथ में पेपर [LUKE: डीप कॉन्टेक्स्टुअलाइज्ड एंटिटी रिप्रेजेंटेशन विद एंटिटी-अवेयर सेल्फ-अटेंशन](https ://arxiv.org/abs/2010.01057) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto द्वारा। +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (स्टूडियो औसिया से) साथ में पेपर [LUKE: डीप कॉन्टेक्स्टुअलाइज्ड एंटिटी रिप्रेजेंटेशन विद एंटिटी-अवेयर सेल्फ-अटेंशन](https://arxiv.org/abs/2010.01057) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto द्वारा। 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (UNC चैपल हिल से) साथ में पेपर [LXMERT: ओपन-डोमेन क्वेश्चन के लिए ट्रांसफॉर्मर से क्रॉस-मोडलिटी एनकोडर रिप्रेजेंटेशन सीखना Answering](https://arxiv.org/abs/1908.07490) हाओ टैन और मोहित बंसल द्वारा। 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. -1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (फेसबुक से) साथ देने वाला पेपर [बियॉन्ड इंग्लिश-सेंट्रिक मल्टीलिंगुअल मशीन ट्रांसलेशन](https://arxiv.org/ एब्स/2010.11125) एंजेला फैन, श्रुति भोसले, होल्गर श्वेन्क, झी मा, अहमद अल-किश्की, सिद्धार्थ गोयल, मनदीप बैनेस, ओनूर सेलेबी, गुइल्लाम वेन्जेक, विश्रव चौधरी, नमन गोयल, टॉम बर्च, विटाली लिपचिंस्की, सर्गेई एडुनोव, एडौर्ड द्वारा ग्रेव, माइकल औली, आर्मंड जौलिन द्वारा पोस्ट किया गया। +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (फेसबुक से) साथ देने वाला पेपर [बियॉन्ड इंग्लिश-सेंट्रिक मल्टीलिंगुअल मशीन ट्रांसलेशन](https://arxiv.org/एब्स/2010.11125) एंजेला फैन, श्रुति भोसले, होल्गर श्वेन्क, झी मा, अहमद अल-किश्की, सिद्धार्थ गोयल, मनदीप बैनेस, ओनूर सेलेबी, गुइल्लाम वेन्जेक, विश्रव चौधरी, नमन गोयल, टॉम बर्च, विटाली लिपचिंस्की, सर्गेई एडुनोव, एडौर्ड द्वारा ग्रेव, माइकल औली, आर्मंड जौलिन द्वारा पोस्ट किया गया। +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Jörg द्वारा [OPUS](http://opus.nlpl.eu/) डेटा से प्रशिक्षित मशीनी अनुवाद मॉडल पोस्ट किया गया टाइडेमैन द्वारा। [मैरियन फ्रेमवर्क](https://marian-nmt.github.io/) माइक्रोसॉफ्ट ट्रांसलेटर टीम द्वारा विकसित। -1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (माइक्रोसॉफ्ट रिसर्च एशिया से) साथ में पेपर [मार्कअपएलएम: विजुअली-रिच डॉक्यूमेंट अंडरस्टैंडिंग के लिए टेक्स्ट और मार्कअप लैंग्वेज का प्री-ट्रेनिंग] (https://arxiv.org/abs/2110.08518) जुनलॉन्ग ली, यिहेंग जू, लेई कुई, फुरु द्वारा वी द्वारा पोस्ट किया गया। +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (माइक्रोसॉफ्ट रिसर्च एशिया से) साथ में पेपर [मार्कअपएलएम: विजुअली-रिच डॉक्यूमेंट अंडरस्टैंडिंग के लिए टेक्स्ट और मार्कअप लैंग्वेज का प्री-ट्रेनिंग](https://arxiv.org/abs/2110.08518) जुनलॉन्ग ली, यिहेंग जू, लेई कुई, फुरु द्वारा वी द्वारा पोस्ट किया गया। 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC से) Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. द्वाराअनुसंधान पत्र [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) के साथ जारी किया गया -1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (मेटा और UIUC से) पेपर के साथ जारी किया गया [प्रति-पिक्सेल वर्गीकरण वह सब नहीं है जिसकी आपको सिमेंटिक सेगमेंटेशन की आवश्यकता है] (https://arxiv.org/abs/2107.06278) बोवेन चेंग, अलेक्जेंडर जी. श्विंग, अलेक्जेंडर किरिलोव द्वारा >>>>>> रिबेस ठीक करें -1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [न्यूरल मशीन ट्रांसलेशन के लिए मल्टीलिंगुअल डीनोइजिंग प्री-ट्रेनिंग](https://arxiv. org/abs/2001.08210) यिनहान लियू, जियाताओ गु, नमन गोयल, जियान ली, सर्गेई एडुनोव, मार्जन ग़ज़विनिनेजाद, माइक लुईस, ल्यूक ज़ेटलमॉयर द्वारा। -1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [एक्स्टेंसिबल बहुभाषी प्रीट्रेनिंग और फाइनट्यूनिंग के साथ बहुभाषी अनुवाद](https://arxiv युकिंग टैंग, चाउ ट्रान, जियान ली, पेंग-जेन चेन, नमन गोयल, विश्रव चौधरी, जियाताओ गु, एंजेला फैन द्वारा .org/abs/2008.00401)। +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (मेटा और UIUC से) पेपर के साथ जारी किया गया [प्रति-पिक्सेल वर्गीकरण वह सब नहीं है जिसकी आपको सिमेंटिक सेगमेंटेशन की आवश्यकता है](https://arxiv.org/abs/2107.06278) बोवेन चेंग, अलेक्जेंडर जी. श्विंग, अलेक्जेंडर किरिलोव द्वारा >>>>>> रिबेस ठीक करें +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (Google AI से) Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. द्वाराअनुसंधान पत्र [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) के साथ जारी किया गया +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [न्यूरल मशीन ट्रांसलेशन के लिए मल्टीलिंगुअल डीनोइजिंग प्री-ट्रेनिंग](https://arxiv.org/abs/2001.08210) यिनहान लियू, जियाताओ गु, नमन गोयल, जियान ली, सर्गेई एडुनोव, मार्जन ग़ज़विनिनेजाद, माइक लुईस, ल्यूक ज़ेटलमॉयर द्वारा। +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (फेसबुक से) साथ में पेपर [एक्स्टेंसिबल बहुभाषी प्रीट्रेनिंग और फाइनट्यूनिंग के साथ बहुभाषी अनुवाद](https://arxiv.org/abs/2008.00401) युकिंग टैंग, चाउ ट्रान, जियान ली, पेंग-जेन चेन, नमन गोयल, विश्रव चौधरी, जियाताओ गु, एंजेला फैन द्वारा। +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (Facebook से) Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. द्वाराअनुसंधान पत्र [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) के साथ जारी किया गया 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA से) कागज के साथ [Megatron-LM: मॉडल का उपयोग करके बहु-अरब पैरामीटर भाषा मॉडल का प्रशिक्षण Parallelism](https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा। -1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA से) साथ वाला पेपर [Megatron-LM: ट्रेनिंग मल्टी-बिलियन पैरामीटर लैंग्वेज मॉडल्स यूजिंग मॉडल पैरेललिज़्म] (https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा पोस्ट किया गया। +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA से) साथ वाला पेपर [Megatron-LM: ट्रेनिंग मल्टी-बिलियन पैरामीटर लैंग्वेज मॉडल्स यूजिंग मॉडल पैरेललिज़्म](https://arxiv.org/abs/1909.08053) मोहम्मद शोएबी, मोस्टोफा पटवारी, राउल पुरी, पैट्रिक लेग्रेस्ले, जेरेड कैस्पर और ब्रायन कैटानज़ारो द्वारा पोस्ट किया गया। +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research से) Peng Wang, Cheng Da, and Cong Yao. द्वाराअनुसंधान पत्र [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) के साथ जारी किया गया +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (फ्रॉम Studio Ousia) साथ में पेपर [mLUKE: द पावर ऑफ एंटिटी रिप्रेजेंटेशन इन मल्टीलिंगुअल प्रीट्रेन्ड लैंग्वेज मॉडल्स](https://arxiv.org/abs/2110.08151) रयोकन री, इकुया यामाडा, और योशिमासा त्सुरोका द्वारा। -1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [मोबाइलबर्ट: संसाधन-सीमित उपकरणों के लिए एक कॉम्पैक्ट टास्क-अज्ञेय बीईआरटी] (https://arxiv.org/abs/2004.02984) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, और Denny Zhou द्वारा पोस्ट किया गया। +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (Facebook से) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. द्वाराअनुसंधान पत्र [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) के साथ जारी किया गया +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (सीएमयू/गूगल ब्रेन से) साथ में कागज [मोबाइलबर्ट: संसाधन-सीमित उपकरणों के लिए एक कॉम्पैक्ट टास्क-अज्ञेय बीईआरटी](https://arxiv.org/abs/2004.02984) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, और Denny Zhou द्वारा पोस्ट किया गया। 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. -1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (Apple से) साथ में कागज [MobileViT: लाइट-वेट, जनरल-पर्पस, और मोबाइल-फ्रेंडली विजन ट्रांसफॉर्मर] (https://arxiv.org/abs/2110.02178) सचिन मेहता और मोहम्मद रस्तगरी द्वारा पोस्ट किया गया। +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (Apple से) साथ में कागज [MobileViT: लाइट-वेट, जनरल-पर्पस, और मोबाइल-फ्रेंडली विजन ट्रांसफॉर्मर](https://arxiv.org/abs/2110.02178) सचिन मेहता और मोहम्मद रस्तगरी द्वारा पोस्ट किया गया। +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (Apple से) Sachin Mehta and Mohammad Rastegari. द्वाराअनुसंधान पत्र [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) के साथ जारी किया गया 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. -1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI से) साथ वाला पेपर [mT5: एक व्यापक बहुभाषी पूर्व-प्रशिक्षित टेक्स्ट-टू-टेक्स्ट ट्रांसफॉर्मर]( https://arxiv.org/abs/2010.11934) लिंटिंग ज़ू, नोआ कॉन्सटेंट, एडम रॉबर्ट्स, मिहिर काले, रामी अल-रफू, आदित्य सिद्धांत, आदित्य बरुआ, कॉलिन रैफेल द्वारा पोस्ट किया गया। +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (MosaiML से) the MosaicML NLP Team. द्वाराअनुसंधान पत्र [llm-foundry](https://github.com/mosaicml/llm-foundry/) के साथ जारी किया गया +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison से) Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. द्वाराअनुसंधान पत्र [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) के साथ जारी किया गया +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI से) साथ वाला पेपर [mT5: एक व्यापक बहुभाषी पूर्व-प्रशिक्षित टेक्स्ट-टू-टेक्स्ट ट्रांसफॉर्मर](https://arxiv.org/abs/2010.11934) लिंटिंग ज़ू, नोआ कॉन्सटेंट, एडम रॉबर्ट्स, मिहिर काले, रामी अल-रफू, आदित्य सिद्धांत, आदित्य बरुआ, कॉलिन रैफेल द्वारा पोस्ट किया गया। +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. -1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https :/ /arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा। -1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (फ्रॉम मेटा) साथ में पेपर [नो लैंग्वेज लेफ्ट बिहाइंड: स्केलिंग ह्यूमन-सेंटेड मशीन ट्रांसलेशन] (https://arxiv.org/abs/2207.04672) एनएलएलबी टीम द्वारा प्रकाशित। -1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में कागज [Nyströmformer: A Nyström- आधारित एल्गोरिथम आत्म-ध्यान का अनुमान लगाने के लिए ](https://arxiv.org/abs/2102.03902) युनयांग ज़िओंग, झानपेंग ज़ेंग, रुद्रसिस चक्रवर्ती, मिंगक्सिंग टैन, ग्लेन फंग, यिन ली, विकास सिंह द्वारा पोस्ट किया गया। +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (हुआवेई नूह के आर्क लैब से) साथ में कागज़ [NEZHA: चीनी भाषा समझ के लिए तंत्रिका प्रासंगिक प्रतिनिधित्व](https://arxiv.org/abs/1909.00204) जुन्किउ वेई, ज़ियाओज़े रेन, ज़िआओगुआंग ली, वेनयोंग हुआंग, यी लियाओ, याशेंग वांग, जियाशू लिन, शिन जियांग, जिओ चेन और कुन लियू द्वारा। +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (फ्रॉम मेटा) साथ में पेपर [नो लैंग्वेज लेफ्ट बिहाइंड: स्केलिंग ह्यूमन-सेंटेड मशीन ट्रांसलेशन](https://arxiv.org/abs/2207.04672) एनएलएलबी टीम द्वारा प्रकाशित। +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta से) the NLLB team. द्वाराअनुसंधान पत्र [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) के साथ जारी किया गया +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (Meta AI से) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. द्वाराअनुसंधान पत्र [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) के साथ जारी किया गया +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में कागज [Nyströmformer: A Nyström- आधारित एल्गोरिथम आत्म-ध्यान का अनुमान लगाने के लिए](https://arxiv.org/abs/2102.03902) युनयांग ज़िओंग, झानपेंग ज़ेंग, रुद्रसिस चक्रवर्ती, मिंगक्सिंग टैन, ग्लेन फंग, यिन ली, विकास सिंह द्वारा पोस्ट किया गया। 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs से) पेपर [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) जितेश जैन, जिआचेन ली, मांगटिक चिउ, अली हसनी, निकिता ओरलोव, हम्फ्री शि के द्वारा जारी किया गया है। +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. -1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI से) साथ में कागज [विज़न ट्रांसफॉर्मर्स के साथ सिंपल ओपन-वोकैबुलरी ऑब्जेक्ट डिटेक्शन](https:/ /arxiv.org/abs/2205.06230) मैथियास मिंडरर, एलेक्सी ग्रिट्सेंको, ऑस्टिन स्टोन, मैक्सिम न्यूमैन, डिर्क वीसेनबोर्न, एलेक्सी डोसोवित्स्की, अरविंद महेंद्रन, अनुराग अर्नब, मुस्तफा देहघानी, ज़ुओरन शेन, जिओ वांग, ज़ियाओहुआ झाई, थॉमस किफ़, और नील हॉल्सबी द्वारा पोस्ट किया गया। +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI से) साथ में कागज [विज़न ट्रांसफॉर्मर्स के साथ सिंपल ओपन-वोकैबुलरी ऑब्जेक्ट डिटेक्शन](https://arxiv.org/abs/2205.06230) मैथियास मिंडरर, एलेक्सी ग्रिट्सेंको, ऑस्टिन स्टोन, मैक्सिम न्यूमैन, डिर्क वीसेनबोर्न, एलेक्सी डोसोवित्स्की, अरविंद महेंद्रन, अनुराग अर्नब, मुस्तफा देहघानी, ज़ुओरन शेन, जिओ वांग, ज़ियाओहुआ झाई, थॉमस किफ़, और नील हॉल्सबी द्वारा पोस्ट किया गया। +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (Google AI से) Matthias Minderer, Alexey Gritsenko, Neil Houlsby. द्वाराअनुसंधान पत्र [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) के साथ जारी किया गया +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** ( IBM Research से) Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. द्वाराअनुसंधान पत्र [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) के साथ जारी किया गया +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (IBM से) Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. द्वाराअनुसंधान पत्र [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf) के साथ जारी किया गया 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. -1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (Google की ओर से) साथ में दिया गया पेपर [लंबे इनपुट सारांश के लिए ट्रांसफ़ॉर्मरों को बेहतर तरीके से एक्सटेंड करना](https://arxiv .org/abs/2208.04347) जेसन फांग, याओ झाओ, पीटर जे लियू द्वारा। -1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (दीपमाइंड से) साथ में पेपर [पर्सीवर आईओ: संरचित इनपुट और आउटपुट के लिए एक सामान्य वास्तुकला] (https://arxiv.org/abs/2107.14795) एंड्रयू जेगल, सेबेस्टियन बोरग्यूड, जीन-बैप्टिस्ट अलायराक, कार्ल डोर्श, कैटलिन इओनेस्कु, डेविड द्वारा डिंग, स्कंद कोप्पुला, डैनियल ज़ोरान, एंड्रयू ब्रॉक, इवान शेलहैमर, ओलिवियर हेनाफ, मैथ्यू एम। बोट्विनिक, एंड्रयू ज़िसरमैन, ओरिओल विनियल्स, जोआओ कैरेरा द्वारा पोस्ट किया गया। -1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research से) कागज के साथ [PhoBERT: वियतनामी के लिए पूर्व-प्रशिक्षित भाषा मॉडल](https://www .aclweb.org/anthology/2020.findings-emnlp.92/) डैट क्वोक गुयेन और अन्ह तुआन गुयेन द्वारा पोस्ट किया गया। -1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv .org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा। +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (Google की ओर से) साथ में दिया गया पेपर [लंबे इनपुट सारांश के लिए ट्रांसफ़ॉर्मरों को बेहतर तरीके से एक्सटेंड करना](https://arxiv.org/abs/2208.04347) जेसन फांग, याओ झाओ, पीटर जे लियू द्वारा। +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (दीपमाइंड से) साथ में पेपर [पर्सीवर आईओ: संरचित इनपुट और आउटपुट के लिए एक सामान्य वास्तुकला](https://arxiv.org/abs/2107.14795) एंड्रयू जेगल, सेबेस्टियन बोरग्यूड, जीन-बैप्टिस्ट अलायराक, कार्ल डोर्श, कैटलिन इओनेस्कु, डेविड द्वारा डिंग, स्कंद कोप्पुला, डैनियल ज़ोरान, एंड्रयू ब्रॉक, इवान शेलहैमर, ओलिवियर हेनाफ, मैथ्यू एम। बोट्विनिक, एंड्रयू ज़िसरमैन, ओरिओल विनियल्स, जोआओ कैरेरा द्वारा पोस्ट किया गया। +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (ADEPT से) Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. द्वाराअनुसंधान पत्र [blog post](https://www.adept.ai/blog/persimmon-8b) के साथ जारी किया गया +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research से) कागज के साथ [PhoBERT: वियतनामी के लिए पूर्व-प्रशिक्षित भाषा मॉडल](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) डैट क्वोक गुयेन और अन्ह तुआन गुयेन द्वारा पोस्ट किया गया। +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google से) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. द्वाराअनुसंधान पत्र [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) के साथ जारी किया गया +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP से) साथ वाला पेपर [प्रोग्राम अंडरस्टैंडिंग एंड जेनरेशन के लिए यूनिफाइड प्री-ट्रेनिंग](https://arxiv.org/abs/2103.06333) वसी उद्दीन अहमद, सैकत चक्रवर्ती, बैशाखी रे, काई-वेई चांग द्वारा। 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. -1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग ](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया। -1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https:// arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा। -1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (फेसबुक से) साथ में कागज [रिट्रीवल-ऑगमेंटेड जेनरेशन फॉर नॉलेज-इंटेंसिव एनएलपी टास्क](https://arxiv .org/abs/2005.11401) पैट्रिक लुईस, एथन पेरेज़, अलेक्जेंड्रा पिक्टस, फैबियो पेट्रोनी, व्लादिमीर कारपुखिन, नमन गोयल, हेनरिक कुटलर, माइक लुईस, वेन-ताउ यिह, टिम रॉकटाशेल, सेबस्टियन रिडेल, डौवे कीला द्वारा। +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू-सीक्वेंस प्री-ट्रेनिंग](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा पोस्ट किया गया। +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. से) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. द्वाराअनुसंधान पत्र [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) के साथ जारी किया गया +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA से) साथ वाला पेपर [डीप लर्निंग इंफ़ेक्शन के लिए इंटीजर क्वांटिज़ेशन: प्रिंसिपल्स एंड एम्पिरिकल इवैल्यूएशन](https://arxiv.org/abs/2004.09602) हाओ वू, पैट्रिक जुड, जिआओजी झांग, मिखाइल इसेव और पॉलियस माइकेविसियस द्वारा। +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (the Qwen team, Alibaba Group से) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. द्वाराअनुसंधान पत्र [Qwen Technical Report](https://arxiv.org/abs/2309.16609) के साथ जारी किया गया +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (फेसबुक से) साथ में कागज [रिट्रीवल-ऑगमेंटेड जेनरेशन फॉर नॉलेज-इंटेंसिव एनएलपी टास्क](https://arxiv.org/abs/2005.11401) पैट्रिक लुईस, एथन पेरेज़, अलेक्जेंड्रा पिक्टस, फैबियो पेट्रोनी, व्लादिमीर कारपुखिन, नमन गोयल, हेनरिक कुटलर, माइक लुईस, वेन-ताउ यिह, टिम रॉकटाशेल, सेबस्टियन रिडेल, डौवे कीला द्वारा। 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (Google अनुसंधान से) केल्विन गु, केंटन ली, ज़ोरा तुंग, पानुपोंग पसुपत और मिंग-वेई चांग द्वारा साथ में दिया गया पेपर [REALM: रिट्रीवल-ऑगमेंटेड लैंग्वेज मॉडल प्री-ट्रेनिंग](https://arxiv.org/abs/2002.08909)। 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. -1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (META रिसर्च से) [डिज़ाइनिंग नेटवर्क डिज़ाइन स्पेस] (https://arxiv.org/) पेपर के साथ जारी किया गया एब्स/2003.13678) इलिजा राडोसावोविक, राज प्रतीक कोसाराजू, रॉस गिर्शिक, कैमिंग ही, पिओटर डॉलर द्वारा। -1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (गूगल रिसर्च से) साथ वाला पेपर [पूर्व-प्रशिक्षित भाषा मॉडल में एम्बेडिंग कपलिंग पर पुनर्विचार](https://arxiv .org/pdf/2010.12821.pdf) ह्युंग वोन चुंग, थिबॉल्ट फ़ेवरी, हेनरी त्साई, एम. जॉनसन, सेबेस्टियन रुडर द्वारा। -1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (माइक्रोसॉफ्ट रिसर्च से) [डीप रेसिडुअल लर्निंग फॉर इमेज रिकग्निशन] (https://arxiv. org/abs/1512.03385) कैमिंग हे, जियांग्यु झांग, शाओकिंग रेन, जियान सन द्वारा। -1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (फेसबुक से), साथ में कागज [मजबूत रूप से अनुकूलित BERT प्रीट्रेनिंग दृष्टिकोण](https://arxiv.org/abs /1907.11692) यिनहान लियू, मायल ओट, नमन गोयल, जिंगफेई डू, मंदार जोशी, डैनकी चेन, ओमर लेवी, माइक लुईस, ल्यूक ज़ेटलमॉयर, वेसेलिन स्टोयानोव द्वारा। +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (META रिसर्च से) [डिज़ाइनिंग नेटवर्क डिज़ाइन स्पेस](https://arxiv.org/) पेपर के साथ जारी किया गया एब्स/2003.13678) इलिजा राडोसावोविक, राज प्रतीक कोसाराजू, रॉस गिर्शिक, कैमिंग ही, पिओटर डॉलर द्वारा। +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (गूगल रिसर्च से) साथ वाला पेपर [पूर्व-प्रशिक्षित भाषा मॉडल में एम्बेडिंग कपलिंग पर पुनर्विचार](https://arxiv.org/pdf/2010.12821.pdf) ह्युंग वोन चुंग, थिबॉल्ट फ़ेवरी, हेनरी त्साई, एम. जॉनसन, सेबेस्टियन रुडर द्वारा। +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (माइक्रोसॉफ्ट रिसर्च से) [डीप रेसिडुअल लर्निंग फॉर इमेज रिकग्निशन](https://arxiv.org/abs/1512.03385) कैमिंग हे, जियांग्यु झांग, शाओकिंग रेन, जियान सन द्वारा। +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (फेसबुक से), साथ में कागज [मजबूत रूप से अनुकूलित BERT प्रीट्रेनिंग दृष्टिकोण](https://arxiv.org/abs/1907.11692) यिनहान लियू, मायल ओट, नमन गोयल, जिंगफेई डू, मंदार जोशी, डैनकी चेन, ओमर लेवी, माइक लुईस, ल्यूक ज़ेटलमॉयर, वेसेलिन स्टोयानोव द्वारा। 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. -1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर] (https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित। +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (झुईई टेक्नोलॉजी से), साथ में पेपर [रोफॉर्मर: रोटरी पोजिशन एंबेडिंग के साथ एन्हांस्ड ट्रांसफॉर्मर](https://arxiv.org/pdf/2104.09864v1.pdf) जियानलिन सु और यू लू और शेंगफेंग पैन और बो वेन और युनफेंग लियू द्वारा प्रकाशित। +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng से) Bo Peng. द्वाराअनुसंधान पत्र [this repo](https://github.com/BlinkDL/RWKV-LM) के साथ जारी किया गया +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. -1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https ://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा। -1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP से) साथ में पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स] (https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योआव आर्टज़ी द्वारा पोस्ट किया गया। -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. -1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (फेसबुक से), साथ में पेपर [फेयरसेक S2T: फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग विद फेयरसेक](https: //arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया。 +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI से) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. द्वाराअनुसंधान पत्र [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) के साथ जारी किया गया +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP से) साथ देने वाला पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योव आर्टज़ी द्वारा। +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP से) साथ में पेपर [भाषण पहचान के लिए अनसुपरवाइज्ड प्री-ट्रेनिंग में परफॉर्मेंस-एफिशिएंसी ट्रेड-ऑफ्स](https://arxiv.org/abs/2109.06870) फेलिक्स वू, क्वांगयुन किम, जिंग पैन, क्यू हान, किलियन क्यू. वेनबर्गर, योआव आर्टज़ी द्वारा पोस्ट किया गया। +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (Google AI से) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. द्वाराअनुसंधान पत्र [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) के साथ जारी किया गया +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (फेसबुक से), साथ में पेपर [फेयरसेक S2T: फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग विद फेयरसेक](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया。 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (फेसबुक से) साथ में पेपर [लार्ज-स्केल सेल्फ- एंड सेमी-सुपरवाइज्ड लर्निंग फॉर स्पीच ट्रांसलेशन](https://arxiv.org/abs/2104.06678) चांगहान वांग, ऐनी वू, जुआन पिनो, एलेक्सी बेवस्की, माइकल औली, एलेक्सिस द्वारा Conneau द्वारा पोस्ट किया गया। -1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (तेल अवीव यूनिवर्सिटी से) साथ में पेपर [स्पैन सिलेक्शन को प्री-ट्रेनिंग करके कुछ-शॉट क्वेश्चन आंसरिंग](https:// arxiv.org/abs/2101.00438) ओरि राम, युवल कर्स्टन, जोनाथन बेरेंट, अमीर ग्लोबर्सन, ओमर लेवी द्वारा। -1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (बर्कले से) कागज के साथ [SqueezeBERT: कुशल तंत्रिका नेटवर्क के बारे में NLP को कंप्यूटर विज़न क्या सिखा सकता है?](https: //arxiv.org/abs/2006.11316) फॉरेस्ट एन. इनडोला, अल्बर्ट ई. शॉ, रवि कृष्णा, और कर्ट डब्ल्यू. केटज़र द्वारा। -1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (माइक्रोसॉफ्ट से) साथ में कागज [स्वाइन ट्रांसफॉर्मर: शिफ्टेड विंडोज का उपयोग कर पदानुक्रमित विजन ट्रांसफॉर्मर](https://arxiv .org/abs/2103.14030) ज़ी लियू, युटोंग लिन, यू काओ, हान हू, यिक्सुआन वेई, झेंग झांग, स्टीफन लिन, बैनिंग गुओ द्वारा। -1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft से) साथ वाला पेपर [Swin Transformer V2: स्केलिंग अप कैपेसिटी एंड रेजोल्यूशन](https:// ज़ी लियू, हान हू, युटोंग लिन, ज़ुलिआंग याओ, ज़ेंडा ज़ी, यिक्सुआन वेई, जिया निंग, यू काओ, झेंग झांग, ली डोंग, फुरु वेई, बैनिंग गुओ द्वारा arxiv.org/abs/2111.09883। +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (तेल अवीव यूनिवर्सिटी से) साथ में पेपर [स्पैन सिलेक्शन को प्री-ट्रेनिंग करके कुछ-शॉट क्वेश्चन आंसरिंग](https://arxiv.org/abs/2101.00438) ओरि राम, युवल कर्स्टन, जोनाथन बेरेंट, अमीर ग्लोबर्सन, ओमर लेवी द्वारा। +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (बर्कले से) कागज के साथ [SqueezeBERT: कुशल तंत्रिका नेटवर्क के बारे में NLP को कंप्यूटर विज़न क्या सिखा सकता है?](https://arxiv.org/abs/2006.11316) फॉरेस्ट एन. इनडोला, अल्बर्ट ई. शॉ, रवि कृष्णा, और कर्ट डब्ल्यू. केटज़र द्वारा। +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI से) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. द्वाराअनुसंधान पत्र [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) के साथ जारी किया गया +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (माइक्रोसॉफ्ट से) साथ में कागज [स्वाइन ट्रांसफॉर्मर: शिफ्टेड विंडोज का उपयोग कर पदानुक्रमित विजन ट्रांसफॉर्मर](https://arxiv.org/abs/2103.14030) ज़ी लियू, युटोंग लिन, यू काओ, हान हू, यिक्सुआन वेई, झेंग झांग, स्टीफन लिन, बैनिंग गुओ द्वारा। +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft से) साथ वाला पेपर [Swin Transformer V2: स्केलिंग अप कैपेसिटी एंड रेजोल्यूशन](https://arxiv.org/abs/2111.09883) ज़ी लियू, हान हू, युटोंग लिन, ज़ुलिआंग याओ, ज़ेंडा ज़ी, यिक्सुआन वेई, जिया निंग, यू काओ, झेंग झांग, ली डोंग, फुरु वेई, बैनिंग गुओ द्वारा। 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. 1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. -1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (来自 Google AI)कॉलिन रैफेल और नोम शज़ीर और एडम रॉबर्ट्स और कैथरीन ली और शरण नारंग और माइकल मटेना द्वारा साथ में पेपर [एक एकीकृत टेक्स्ट-टू-टेक्स्ट ट्रांसफॉर्मर के साथ स्थानांतरण सीखने की सीमा की खोज] (https://arxiv.org/abs/1910.10683) और यांकी झोउ और वेई ली और पीटर जे लियू। +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (来自 Google AI)कॉलिन रैफेल और नोम शज़ीर और एडम रॉबर्ट्स और कैथरीन ली और शरण नारंग और माइकल मटेना द्वारा साथ में पेपर [एक एकीकृत टेक्स्ट-टू-टेक्स्ट ट्रांसफॉर्मर के साथ स्थानांतरण सीखने की सीमा की खोज](https://arxiv.org/abs/1910.10683) और यांकी झोउ और वेई ली और पीटर जे लियू। 1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (Google AI से) साथ वाला पेपर [google-research/text-to-text-transfer- ट्रांसफॉर्मर](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) कॉलिन रैफेल और नोम शज़ीर और एडम रॉबर्ट्स और कैथरीन ली और शरण नारंग द्वारा और माइकल मटेना और यांकी झोउ और वेई ली और पीटर जे लियू। -1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [पबटेबल्स-1एम: टूवर्ड्स कॉम्प्रिहेंसिव टेबल एक्सट्रैक्शन फ्रॉम अनस्ट्रक्चर्ड डॉक्यूमेंट्स ](https://arxiv.org/abs/2110.00061) ब्रैंडन स्मॉक, रोहित पेसाला, रॉबिन अब्राहम द्वारा पोस्ट किया गया। -1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (Google AI से) साथ में कागज [TAPAS: पूर्व-प्रशिक्षण के माध्यम से कमजोर पर्यवेक्षण तालिका पार्सिंग](https:// arxiv.org/abs/2004.02349) जोनाथन हर्ज़िग, पावेल क्रिज़िस्तोफ़ नोवाक, थॉमस मुलर, फ्रांसेस्को पिकिन्नो और जूलियन मार्टिन ईसेन्च्लोस द्वारा। -1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [TAPEX: टेबल प्री-ट्रेनिंग थ्रू लर्निंग अ न्यूरल SQL एक्ज़ीक्यूटर](https: //arxiv.org/abs/2107.07653) कियान लियू, बेई चेन, जियाकी गुओ, मोर्टेज़ा ज़ियादी, ज़ेकी लिन, वीज़ू चेन, जियान-गुआंग लू द्वारा पोस्ट किया गया। +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [पबटेबल्स-1एम: टूवर्ड्स कॉम्प्रिहेंसिव टेबल एक्सट्रैक्शन फ्रॉम अनस्ट्रक्चर्ड डॉक्यूमेंट्स](https://arxiv.org/abs/2110.00061) ब्रैंडन स्मॉक, रोहित पेसाला, रॉबिन अब्राहम द्वारा पोस्ट किया गया। +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (Google AI से) साथ में कागज [TAPAS: पूर्व-प्रशिक्षण के माध्यम से कमजोर पर्यवेक्षण तालिका पार्सिंग](https://arxiv.org/abs/2004.02349) जोनाथन हर्ज़िग, पावेल क्रिज़िस्तोफ़ नोवाक, थॉमस मुलर, फ्रांसेस्को पिकिन्नो और जूलियन मार्टिन ईसेन्च्लोस द्वारा। +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (माइक्रोसॉफ्ट रिसर्च से) साथ में पेपर [TAPEX: टेबल प्री-ट्रेनिंग थ्रू लर्निंग अ न्यूरल SQL एक्ज़ीक्यूटर](https://arxiv.org/abs/2107.07653) कियान लियू, बेई चेन, जियाकी गुओ, मोर्टेज़ा ज़ियादी, ज़ेकी लिन, वीज़ू चेन, जियान-गुआंग लू द्वारा पोस्ट किया गया। 1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace). 1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani. 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine -1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (Google/CMU की ओर से) कागज के साथ [संस्करण-एक्स: एक ब्लॉग मॉडल चौकस चौक मॉडल मॉडल] (https://arxivorg/abs/1901.02860) क्वोकोक वी. ले, रुस्लैन सलाखुतदी +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (Google/CMU की ओर से) कागज के साथ [संस्करण-एक्स: एक ब्लॉग मॉडल चौकस चौक मॉडल मॉडल](https://arxivorg/abs/1901.02860) क्वोकोक वी. ले, रुस्लैन सलाखुतदी 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler -1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (माइक्रोसॉफ्ट रिसर्च से) साथ में दिया गया पेपर [UniSpeech: यूनिफाइड स्पीच रिप्रेजेंटेशन लर्निंग विद लेबलेड एंड अनलेबल्ड डेटा](https:/ /arxiv.org/abs/2101.07597) चेंगई वांग, यू वू, याओ कियान, केनिची कुमातानी, शुजी लियू, फुरु वेई, माइकल ज़ेंग, ज़ुएदोंग हुआंग द्वारा। -1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [UNISPEECH-SAT: यूनिवर्सल स्पीच रिप्रेजेंटेशन लर्निंग विद स्पीकर अवेयर प्री-ट्रेनिंग ](https://arxiv.org/abs/2110.05752) सानयुआन चेन, यू वू, चेंग्यी वांग, झेंगयांग चेन, झूओ चेन, शुजी लियू, जियान वू, याओ कियान, फुरु वेई, जिन्यु ली, जियांगज़ान यू द्वारा पोस्ट किया गया। +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research से) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. द्वाराअनुसंधान पत्र [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) के साथ जारी किया गया +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (माइक्रोसॉफ्ट रिसर्च से) साथ में दिया गया पेपर [UniSpeech: यूनिफाइड स्पीच रिप्रेजेंटेशन लर्निंग विद लेबलेड एंड अनलेबल्ड डेटा](https://arxiv.org/abs/2101.07597) चेंगई वांग, यू वू, याओ कियान, केनिची कुमातानी, शुजी लियू, फुरु वेई, माइकल ज़ेंग, ज़ुएदोंग हुआंग द्वारा। +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [UNISPEECH-SAT: यूनिवर्सल स्पीच रिप्रेजेंटेशन लर्निंग विद स्पीकर अवेयर प्री-ट्रेनिंग](https://arxiv.org/abs/2110.05752) सानयुआन चेन, यू वू, चेंग्यी वांग, झेंगयांग चेन, झूओ चेन, शुजी लियू, जियान वू, याओ कियान, फुरु वेई, जिन्यु ली, जियांगज़ान यू द्वारा पोस्ट किया गया। +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. -1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (सिंघुआ यूनिवर्सिटी और ननकाई यूनिवर्सिटी से) साथ में पेपर [विजुअल अटेंशन नेटवर्क](https://arxiv.org/ pdf/2202.09741.pdf) मेंग-हाओ गुओ, चेंग-ज़े लू, झेंग-निंग लियू, मिंग-मिंग चेंग, शि-मिन हू द्वारा। -1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (मल्टीमीडिया कम्प्यूटिंग ग्रुप, नानजिंग यूनिवर्सिटी से) साथ में पेपर [वीडियोएमएई: मास्क्ड ऑटोएन्कोडर स्व-पर्यवेक्षित वीडियो प्री-ट्रेनिंग के लिए डेटा-कुशल सीखने वाले हैं] (https://arxiv.org/abs/2203.12602) ज़ान टोंग, यिबिंग सॉन्ग, जुए द्वारा वांग, लिमिन वांग द्वारा पोस्ट किया गया। +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (सिंघुआ यूनिवर्सिटी और ननकाई यूनिवर्सिटी से) साथ में पेपर [विजुअल अटेंशन नेटवर्क](https://arxiv.org/pdf/2202.09741.pdf) मेंग-हाओ गुओ, चेंग-ज़े लू, झेंग-निंग लियू, मिंग-मिंग चेंग, शि-मिन हू द्वारा। +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (मल्टीमीडिया कम्प्यूटिंग ग्रुप, नानजिंग यूनिवर्सिटी से) साथ में पेपर [वीडियोएमएई: मास्क्ड ऑटोएन्कोडर स्व-पर्यवेक्षित वीडियो प्री-ट्रेनिंग के लिए डेटा-कुशल सीखने वाले हैं](https://arxiv.org/abs/2203.12602) ज़ान टोंग, यिबिंग सॉन्ग, जुए द्वारा वांग, लिमिन वांग द्वारा पोस्ट किया गया। 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (NAVER AI Lab/Kakao Enterprise/Kakao Brain से) साथ में कागज [ViLT: Vision-and-Language Transformer बिना कनवल्शन या रीजन सुपरविजन](https://arxiv.org/abs/2102.03334) वोनजे किम, बोक्यूंग सोन, इल्डू किम द्वारा पोस्ट किया गया। +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (University of Wisconsin–Madison से) Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. द्वाराअनुसंधान पत्र [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) के साथ जारी किया गया 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (गूगल एआई से) कागज के साथ [एक इमेज इज़ वर्थ 16x16 वर्ड्स: ट्रांसफॉर्मर्स फॉर इमेज रिकॉग्निशन एट स्केल](https://arxiv.org/abs/2010.11929) एलेक्सी डोसोवित्स्की, लुकास बेयर, अलेक्जेंडर कोलेसनिकोव, डिर्क वीसेनबोर्न, शियाओहुआ झाई, थॉमस अनटरथिनर, मुस्तफा देहघानी, मैथियास मिंडरर, जॉर्ज हेगोल्ड, सिल्वेन गेली, जैकब उस्ज़कोरेइट द्वारा हॉल्सबी द्वारा पोस्ट किया गया। -1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP से) साथ वाला पेपर [VisualBERT: A Simple and Performant Baseline for Vision and Language](https:/ /arxiv.org/pdf/1908.03557) लियुनियन हेरोल्ड ली, मार्क यात्स्कर, दा यिन, चो-जुई हसीह, काई-वेई चांग द्वारा। +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP से) साथ वाला पेपर [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) लियुनियन हेरोल्ड ली, मार्क यात्स्कर, दा यिन, चो-जुई हसीह, काई-वेई चांग द्वारा। 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. -1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/ एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा। -1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv. org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा। -1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन] (https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। -1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग ](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया। -1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (Facebook AI से) साथ वाला पेपर [सरल और प्रभावी जीरो-शॉट क्रॉस-लिंगुअल फोनेम रिकॉग्निशन](https:/ /arxiv.org/abs/2109.11680) कियानटोंग जू, एलेक्सी बाएव्स्की, माइकल औली द्वारा। -1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (माइक्रोसॉफ्ट रिसर्च से) पेपर के साथ जारी किया गया [WavLM: फुल स्टैक के लिए बड़े पैमाने पर स्व-पर्यवेक्षित पूर्व-प्रशिक्षण स्पीच प्रोसेसिंग] (https://arxiv.org/abs/2110.13900) सानयुआन चेन, चेंगयी वांग, झेंगयांग चेन, यू वू, शुजी लियू, ज़ुओ चेन, जिन्यु ली, नाओयुकी कांडा, ताकुया योशियोका, ज़िओंग जिओ, जियान वू, लॉन्ग झोउ, शुओ रेन, यानमिन कियान, याओ कियान, जियान वू, माइकल ज़ेंग, फुरु वेई। -1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (OpenAI से) साथ में कागज [बड़े पैमाने पर कमजोर पर्यवेक्षण के माध्यम से मजबूत भाषण पहचान](https://cdn. openai.com/papers/whisper.pdf) एलेक रैडफोर्ड, जोंग वूक किम, ताओ जू, ग्रेग ब्रॉकमैन, क्रिस्टीन मैकलीवे, इल्या सुत्स्केवर द्वारा। -1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [एक्सपैंडिंग लैंग्वेज-इमेज प्रीट्रेन्ड मॉडल फॉर जनरल वीडियो रिकग्निशन](https: //arxiv.org/abs/2208.02816) बोलिन नी, होउवेन पेंग, मिंगाओ चेन, सोंगयांग झांग, गाओफेंग मेंग, जियानलोंग फू, शिमिंग जियांग, हैबिन लिंग द्वारा। -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (Meta AI से) Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. द्वाराअनुसंधान पत्र [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) के साथ जारी किया गया +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (Meta AI से) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. द्वाराअनुसंधान पत्र [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) के साथ जारी किया गया +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (मेटा एआई से) साथ में कागज [मास्कड ऑटोएन्कोडर स्केलेबल विजन लर्नर्स हैं](https://arxiv.org/एब्स/2111.06377) कैमिंग हे, ज़िनेली चेन, सेनिंग ज़ी, यांगहो ली, पिओट्र डॉलर, रॉस गिर्शिक द्वारा। +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (HUST-VL से) Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. द्वाराअनुसंधान पत्र [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) के साथ जारी किया गया +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (मेटा एआई से) साथ में कागज [लेबल-कुशल सीखने के लिए मास्क्ड स्याम देश के नेटवर्क](https://arxiv.org/abs/2204.07141) महमूद असरान, मथिल्डे कैरन, ईशान मिश्रा, पियोट्र बोजानोवस्की, फ्लोरियन बोर्डेस, पास्कल विंसेंट, आर्मंड जौलिन, माइकल रब्बत, निकोलस बल्लास द्वारा। +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (Kakao Enterprise से) Jaehyeon Kim, Jungil Kong, Juhee Son. द्वाराअनुसंधान पत्र [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) के साथ जारी किया गया +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (फेसबुक एआई से) साथ में पेपर [wav2vec 2.0: ए फ्रेमवर्क फॉर सेल्फ-सुपरवाइज्ड लर्निंग ऑफ स्पीच रिप्रेजेंटेशन](https://arxiv.org/abs/2006.11477) एलेक्सी बेवस्की, हेनरी झोउ, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI से) साथ वाला पेपर [FAIRSEQ S2T: FAIRSEQ के साथ फास्ट स्पीच-टू-टेक्स्ट मॉडलिंग](https://arxiv.org/abs/2010.05171) चांगहान वांग, यूं तांग, जुताई मा, ऐनी वू, सरव्या पोपुरी, दिमित्रो ओखोनको, जुआन पिनो द्वारा पोस्ट किया गया। +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (Facebook AI से) साथ वाला पेपर [सरल और प्रभावी जीरो-शॉट क्रॉस-लिंगुअल फोनेम रिकॉग्निशन](https://arxiv.org/abs/2109.11680) कियानटोंग जू, एलेक्सी बाएव्स्की, माइकल औली द्वारा। +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (माइक्रोसॉफ्ट रिसर्च से) पेपर के साथ जारी किया गया [WavLM: फुल स्टैक के लिए बड़े पैमाने पर स्व-पर्यवेक्षित पूर्व-प्रशिक्षण स्पीच प्रोसेसिंग](https://arxiv.org/abs/2110.13900) सानयुआन चेन, चेंगयी वांग, झेंगयांग चेन, यू वू, शुजी लियू, ज़ुओ चेन, जिन्यु ली, नाओयुकी कांडा, ताकुया योशियोका, ज़िओंग जिओ, जियान वू, लॉन्ग झोउ, शुओ रेन, यानमिन कियान, याओ कियान, जियान वू, माइकल ज़ेंग, फुरु वेई। +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (OpenAI से) साथ में कागज [बड़े पैमाने पर कमजोर पर्यवेक्षण के माध्यम से मजबूत भाषण पहचान](https://cdn.openai.com/papers/whisper.pdf) एलेक रैडफोर्ड, जोंग वूक किम, ताओ जू, ग्रेग ब्रॉकमैन, क्रिस्टीन मैकलीवे, इल्या सुत्स्केवर द्वारा। +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (माइक्रोसॉफ्ट रिसर्च से) कागज के साथ [एक्सपैंडिंग लैंग्वेज-इमेज प्रीट्रेन्ड मॉडल फॉर जनरल वीडियो रिकग्निशन](https://arxiv.org/abs/2208.02816) बोलिन नी, होउवेन पेंग, मिंगाओ चेन, सोंगयांग झांग, गाओफेंग मेंग, जियानलोंग फू, शिमिंग जियांग, हैबिन लिंग द्वारा। +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (Meta AI से) Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. द्वाराअनुसंधान पत्र [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) के साथ जारी किया गया 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. -1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (फेसबुक से) साथ में पेपर [क्रॉस-लिंगुअल लैंग्वेज मॉडल प्रीट्रेनिंग] (https://arxiv.org/abs/1901.07291) गिलाउम लैम्पल और एलेक्सिस कोनो द्वारा। +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (फेसबुक से) साथ में पेपर [क्रॉस-लिंगुअल लैंग्वेज मॉडल प्रीट्रेनिंग](https://arxiv.org/abs/1901.07291) गिलाउम लैम्पल और एलेक्सिस कोनो द्वारा। 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (माइक्रोसॉफ्ट रिसर्च से) साथ में कागज [ProphetNet: प्रेडिक्टिंग फ्यूचर एन-ग्राम फॉर सीक्वेंस-टू- सीक्वेंस प्री-ट्रेनिंग](https://arxiv.org/abs/2001.04063) यू यान, वीज़ेन क्यूई, येयुन गोंग, दयाहेंग लियू, नान डुआन, जिउशेंग चेन, रुओफ़ेई झांग और मिंग झोउ द्वारा। -1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (फेसबुक एआई से), साथ में पेपर [अनसुपरवाइज्ड क्रॉस-लिंगुअल रिप्रेजेंटेशन लर्निंग एट स्केल] (https://arxiv.org/abs/1911.02116) एलेक्सिस कोन्यू*, कार्तिकेय खंडेलवाल*, नमन गोयल, विश्रव चौधरी, गिलाउम वेनज़ेक, फ्रांसिस्को गुज़मैन द्वारा , एडौर्ड ग्रेव, मायल ओट, ल्यूक ज़ेटलमॉयर और वेसेलिन स्टोयानोव द्वारा। -1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (Facebook AI से) साथ में कागज [बहुभाषी नकाबपोश भाषा के लिए बड़े पैमाने पर ट्रांसफॉर्मर ] मॉडलिंग](https://arxiv.org/abs/2105.00572) नमन गोयल, जिंगफेई डू, मायल ओट, गिरि अनंतरामन, एलेक्सिस कोनो द्वारा पोस्ट किया गया। +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (फेसबुक एआई से), साथ में पेपर [अनसुपरवाइज्ड क्रॉस-लिंगुअल रिप्रेजेंटेशन लर्निंग एट स्केल](https://arxiv.org/abs/1911.02116) एलेक्सिस कोन्यू*, कार्तिकेय खंडेलवाल*, नमन गोयल, विश्रव चौधरी, गिलाउम वेनज़ेक, फ्रांसिस्को गुज़मैन द्वारा , एडौर्ड ग्रेव, मायल ओट, ल्यूक ज़ेटलमॉयर और वेसेलिन स्टोयानोव द्वारा। +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (Facebook AI से) साथ में कागज [बहुभाषी नकाबपोश भाषा के लिए बड़े पैमाने पर ट्रांसफॉर्मर मॉडलिंग](https://arxiv.org/abs/2105.00572) नमन गोयल, जिंगफेई डू, मायल ओट, गिरि अनंतरामन, एलेक्सिस कोनो द्वारा पोस्ट किया गया। 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU से) साथ वाला पेपर [XLNet: जनरलाइज्ड ऑटोरेग्रेसिव प्रीट्रेनिंग फॉर लैंग्वेज अंडरस्टैंडिंग](https://arxiv ज़ीलिन यांग*, ज़िहांग दाई*, यिमिंग यांग, जैम कार्बोनेल, रुस्लान सलाखुतदीनोव, क्वोक वी. ले ​​द्वारा .org/abs/1906.08237)। +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU से) साथ वाला पेपर [XLNet: जनरलाइज्ड ऑटोरेग्रेसिव प्रीट्रेनिंग फॉर लैंग्वेज अंडरस्टैंडिंग](https://arxiv.org/abs/1906.08237) ज़ीलिन यांग*, ज़िहांग दाई*, यिमिंग यांग, जैम कार्बोनेल, रुस्लान सलाखुतदीनोव, क्वोक वी. ले द्वारा। 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (Facebook AI से) साथ वाला पेपर [XLS-R: सेल्फ सुपरवाइज्ड क्रॉस-लिंगुअल स्पीच रिप्रेजेंटेशन लर्निंग एट स्केल](https://arxiv.org/abs/2111.09296) अरुण बाबू, चांगहान वांग, एंड्रोस तजंद्रा, कुशाल लखोटिया, कियानटोंग जू, नमन गोयल, कृतिका सिंह, पैट्रिक वॉन प्लैटन, याथार्थ सराफ, जुआन पिनो, एलेक्सी बेवस्की, एलेक्सिस कोन्यू, माइकल औली द्वारा पोस्ट किया गया। -1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (फेसबुक एआई से) साथ में पेपर [अनसुपरवाइज्ड क्रॉस-लिंगुअल रिप्रेजेंटेशन लर्निंग फॉर स्पीच रिकग्निशन] (https://arxiv.org/abs/2006.13979) एलेक्सिस कोन्यू, एलेक्सी बेवस्की, रोनन कोलोबर्ट, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (फेसबुक एआई से) साथ में पेपर [अनसुपरवाइज्ड क्रॉस-लिंगुअल रिप्रेजेंटेशन लर्निंग फॉर स्पीच रिकग्निशन](https://arxiv.org/abs/2006.13979) एलेक्सिस कोन्यू, एलेक्सी बेवस्की, रोनन कोलोबर्ट, अब्देलरहमान मोहम्मद, माइकल औली द्वारा। 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (हुआझोंग यूनिवर्सिटी ऑफ साइंस एंड टेक्नोलॉजी से) साथ में पेपर [यू ओनली लुक एट वन सीक्वेंस: रीथिंकिंग ट्रांसफॉर्मर इन विज़न थ्रू ऑब्जेक्ट डिटेक्शन](https://arxiv.org/abs/2106.00666) युक्सिन फेंग, बेनचेंग लियाओ, जिंगगैंग वांग, जेमिन फेंग, जियांग क्यूई, रुई वू, जियानवेई नीयू, वेन्यू लियू द्वारा पोस्ट किया गया। 1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (विस्कॉन्सिन विश्वविद्यालय - मैडिसन से) साथ में पेपर [यू ओनली सैंपल (लगभग) ज़ानपेंग ज़ेंग, युनयांग ज़िओंग द्वारा , सत्य एन. रवि, शैलेश आचार्य, ग्लेन फंग, विकास सिंह द्वारा पोस्ट किया गया। -1. एक नए मॉडल में योगदान देना चाहते हैं? नए मॉडल जोड़ने में आपका मार्गदर्शन करने के लिए हमारे पास एक **विस्तृत मार्गदर्शिका और टेम्प्लेट** है। आप उन्हें [`टेम्पलेट्स`](./templates) निर्देशिका में पा सकते हैं। पीआर शुरू करने से पहले [योगदान दिशानिर्देश] (./CONTRIBUTING.md) देखना और अनुरक्षकों से संपर्क करना या प्रतिक्रिया प्राप्त करने के लिए एक नया मुद्दा खोलना याद रखें। +1. एक नए मॉडल में योगदान देना चाहते हैं? नए मॉडल जोड़ने में आपका मार्गदर्शन करने के लिए हमारे पास एक **विस्तृत मार्गदर्शिका और टेम्प्लेट** है। आप उन्हें [`टेम्पलेट्स`](./templates) निर्देशिका में पा सकते हैं। पीआर शुरू करने से पहले [योगदान दिशानिर्देश](./CONTRIBUTING.md) देखना और अनुरक्षकों से संपर्क करना या प्रतिक्रिया प्राप्त करने के लिए एक नया मुद्दा खोलना याद रखें। -यह जांचने के लिए कि क्या किसी मॉडल में पहले से ही Flax, PyTorch या TensorFlow का कार्यान्वयन है, या यदि उसके पास Tokenizers लाइब्रेरी में संबंधित टोकन है, तो [यह तालिका] (https://huggingface.co/ docs/transformers/index#supported) देखें। -फ्रेमवर्क)। +यह जांचने के लिए कि क्या किसी मॉडल में पहले से ही Flax, PyTorch या TensorFlow का कार्यान्वयन है, या यदि उसके पास Tokenizers लाइब्रेरी में संबंधित टोकन है, तो [यह तालिका](https://huggingface.co/docs/transformers/index#supported) देखें। -फ्रेमवर्क)। इन कार्यान्वयनों का परीक्षण कई डेटासेट पर किया गया है (देखें केस स्क्रिप्ट का उपयोग करें) और वैनिला कार्यान्वयन के लिए तुलनात्मक रूप से प्रदर्शन करना चाहिए। आप उपयोग के मामले के दस्तावेज़ [इस अनुभाग](https://huggingface.co/docs/transformers/examples) में व्यवहार का विवरण पढ़ सकते हैं। diff --git a/README_ja.md b/README_ja.md index 9668c90c391208..bd8a058b7b1b96 100644 --- a/README_ja.md +++ b/README_ja.md @@ -53,7 +53,7 @@ user: ユーザ

-

+

Build @@ -81,8 +81,13 @@ user: ユーザ 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -114,13 +119,13 @@ user: ユーザ 以下はその一例です: 自然言語処理にて: -- [BERTによるマスクドワード補完](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [BERTによるマスクドワード補完](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [Electraによる名前実体認識](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [GPT-2によるテキスト生成](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [RoBERTaによる自然言語推論](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [GPT-2によるテキスト生成](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [RoBERTaによる自然言語推論](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [BARTによる要約](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [DistilBERTによる質問応答](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [T5による翻訳](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [DistilBERTによる質問応答](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [T5による翻訳](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) コンピュータビジョンにて: - [ViTによる画像分類](https://huggingface.co/google/vit-base-patch16-224) @@ -203,19 +208,19 @@ Hugging Faceチームによって作られた **[トランスフォーマーを ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) ``` -And here is the equivalent code for TensorFlow: +そしてこちらはTensorFlowと同等のコードとなります: ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) @@ -258,7 +263,7 @@ And here is the equivalent code for TensorFlow: ### pipにて -このリポジトリは、Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+, TensorFlow 2.3+ でテストされています。 +このリポジトリは、Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, TensorFlow 2.6+ でテストされています。 🤗Transformersは[仮想環境](https://docs.python.org/3/library/venv.html)にインストールする必要があります。Pythonの仮想環境に慣れていない場合は、[ユーザーガイド](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)を確認してください。 @@ -277,14 +282,14 @@ pip install transformers ### condaにて -Transformersバージョン4.0.0から、condaチャンネルを搭載しました: `huggingface`。 - 🤗Transformersは以下のようにcondaを使って設置することができます: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_注意:_** `huggingface` チャンネルから `transformers` をインストールすることは非推奨です。 + Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それぞれのインストールページに従ってください。 > **_注意:_** Windowsでは、キャッシュの恩恵を受けるために、デベロッパーモードを有効にするよう促されることがあります。このような場合は、[このissue](https://github.com/huggingface/huggingface_hub/issues/1062)でお知らせください。 @@ -298,8 +303,11 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 🤗Transformersは現在、以下のアーキテクチャを提供しています(それぞれのハイレベルな要約は[こちら](https://huggingface.co/docs/transformers/model_summary)を参照してください): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (Google Research and the Toyota Technological Institute at Chicago から) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut から公開された研究論文: [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (Google Research から) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. から公開された研究論文 [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (BAAI から) Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell から公開された研究論文: [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (MIT から) Yuan Gong, Yu-An Chung, James Glass から公開された研究論文: [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (Facebook から) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer から公開された研究論文: [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (École polytechnique から) Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis から公開された研究論文: [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (VinAI Research から) Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen から公開された研究論文: [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) @@ -314,22 +322,27 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (Facebook から) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston から公開された研究論文: [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (Facebook から) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston から公開された研究論文: [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (Salesforce から) Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi から公開された研究論文: [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (Salesforce から) Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. から公開された研究論文 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (Salesforce から) Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. から公開された研究論文 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (BigScience workshop から) [BigScience Workshop](https://bigscience.huggingface.co/) から公開されました. 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (Alexa から) Adrian de Wynter and Daniel J. Perry から公開された研究論文: [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (Harbin Institute of Technology/Microsoft Research Asia/Intel Labs から) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (Harbin Institute of Technology/Microsoft Research Asia/Intel Labs から) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (NAVER CLOVA から) Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. から公開された研究論文 [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (Google Research から) Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel から公開された研究論文: [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne から) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot から公開された研究論文: [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research から) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting から公開された研究論文: [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys から) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou から公開された研究論文: [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI から) Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. から公開された研究論文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI から) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever から公開された研究論文: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen から) Timo Lüddecke and Alexander Ecker から公開された研究論文: [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce から) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong から公開された研究論文: [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI から) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. から公開された研究論文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia から) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang から公開された研究論文: [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (YituTech から) Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan から公開された研究論文: [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (Facebook AI から) Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie から公開された研究論文: [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (Tsinghua University から) Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun から公開された研究論文: [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (OpenBMB から) [OpenBMB](https://www.openbmb.org/) から公開されました. 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (Salesforce から) Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher から公開された研究論文: [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (Microsoft から) Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang から公開された研究論文: [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (Facebook から) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli から公開された研究論文: [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) @@ -338,41 +351,58 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (Berkeley/Facebook/Google から) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch から公開された研究論文: [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (SenseTime Research から) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai から公開された研究論文: [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (Facebook から) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou から公開された研究論文: [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (The University of Texas at Austin から) Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. から公開された研究論文 [NMS Strikes Back](https://arxiv.org/abs/2212.06137) +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (Google AI から) Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. から公開された研究論文 [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (University of Hong Kong and TikTok から) Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. から公開された研究論文 [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (The University of Texas at Austin から) Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. から公開された研究論文 [NMS Strikes Back](https://arxiv.org/abs/2212.06137) 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (Facebook から) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko から公開された研究論文: [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (Microsoft Research から) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan から公開された研究論文: [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (SHI Labs から) Ali Hassani and Humphrey Shi から公開された研究論文: [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (Meta AI から) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. から公開された研究論文 [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (HuggingFace から), Victor Sanh, Lysandre Debut and Thomas Wolf. 同じ手法で GPT2, RoBERTa と Multilingual BERT の圧縮を行いました.圧縮されたモデルはそれぞれ [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation)、[DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation)、[DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) と名付けられました. 公開された研究論文: [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (Microsoft Research から) Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei から公開された研究論文: [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (NAVER から), Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park から公開された研究論文: [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (Facebook から) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih から公開された研究論文: [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (Intel Labs から) René Ranftl, Alexey Bochkovskiy, Vladlen Koltun から公開された研究論文: [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (Snap Research から) Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. から公開された研究論文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University から) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning から公開された研究論文: [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (Meta AI から) Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. から公開された研究論文 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research から) Sascha Rothe, Shashi Narayan, Aliaksei Severyn から公開された研究論文: [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu から) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu から公開された研究論文: [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (Baidu から) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. から公開された研究論文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu から) Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. から公開された研究論文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (Meta AI から) はトランスフォーマープロテイン言語モデルです. **ESM-1b** は Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus から公開された研究論文: [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118). **ESM-1v** は Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives から公開された研究論文: [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648). **ESM-2** と **ESMFold** は Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives から公開された研究論文: [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (ESPnet and Microsoft Research から) Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. から公開された研究論文 [Fastspeech 2: Fast And High-quality End-to-End Text To Speech](https://arxiv.org/pdf/2006.04558.pdf) 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (Google AI から) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V から公開されたレポジトリー [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (CNRS から) Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab から公開された研究論文: [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (Facebook AI から) Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela から公開された研究論文: [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (Google Research から) James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon から公開された研究論文: [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (Microsoft Research から) Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. から公開された研究論文 [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (CMU/Google Brain から) Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le から公開された研究論文: [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (ADEPT から) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. から公開された研究論文 [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (Microsoft Research から) Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. から公開された研究論文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (KAIST から) Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim から公開された研究論文: [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI から) Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever から公開された研究論文: [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (OpenAI から) Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever から公開された研究論文: [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (EleutherAI から) Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy から公開されたレポジトリー : [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (EleutherAI から) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach から公開された研究論文: [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (ABEJA から) Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori からリリース. -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI から) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever から公開された研究論文: [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (EleutherAI から) Ben Wang and Aran Komatsuzaki から公開されたレポジトリー [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (AI-Sweden から) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren から公開された研究論文: [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (BigCode から) Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. から公開された研究論文 [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) 坂本俊之(tanreinama)からリリースされました. 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (Microsoft から) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu から公開された研究論文: [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234). 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA から) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang から公開された研究論文: [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (Allegro.pl, AGH University of Science and Technology から) Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. から公開された研究論文 [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook から) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed から公開された研究論文: [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley から) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer から公開された研究論文: [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI から) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever から公開された研究論文: [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce から) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. から公開された研究論文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (OpenAI から) Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever から公開された研究論文: [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (Microsoft Research Asia から) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou から公開された研究論文: [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (Microsoft Research Asia から) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou から公開された研究論文: [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (Microsoft Research Asia から) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei から公開された研究論文: [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) @@ -380,43 +410,69 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (AllenAI から) Iz Beltagy, Matthew E. Peters, Arman Cohan から公開された研究論文: [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (Meta AI から) Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze から公開された研究論文: [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (South China University of Technology から) Jiapeng Wang, Lianwen Jin, Kai Ding から公開された研究論文: [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (The FAIR team of Meta AI から) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. から公開された研究論文 [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (The FAIR team of Meta AI から) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.. から公開された研究論文 [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (Microsoft Research & University of Wisconsin-Madison から) Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. から公開された研究論文 [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (AllenAI から) Iz Beltagy, Matthew E. Peters, Arman Cohan から公開された研究論文: [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (Google AI から) Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang から公開された研究論文: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (Studio Ousia から) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto から公開された研究論文: [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (UNC Chapel Hill から) Hao Tan and Mohit Bansal から公開された研究論文: [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (Facebook から) Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert から公開された研究論文: [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (Facebook から) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin から公開された研究論文: [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Jörg Tiedemann から. [OPUS](http://opus.nlpl.eu/) を使いながら学習された "Machine translation" (マシントランスレーション) モデル. [Marian Framework](https://marian-nmt.github.io/) はMicrosoft Translator Team が現在開発中です. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (Microsoft Research Asia から) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei から公開された研究論文: [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC から) Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. から公開された研究論文 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (Meta and UIUC から) Bowen Cheng, Alexander G. Schwing, Alexander Kirillov から公開された研究論文: [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (Google AI から) Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. から公開された研究論文 [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook から) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer から公開された研究論文: [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook から) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan から公開された研究論文: [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (Facebook から) Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. から公開された研究論文 [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA から) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro から公開された研究論文: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research から) Peng Wang, Cheng Da, and Cong Yao. から公開された研究論文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia から) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka から公開された研究論文: [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (Facebook から) Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. から公開された研究論文 [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain から) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou から公開された研究論文: [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. から) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam から公開された研究論文: [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (Google Inc. から) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen から公開された研究論文: [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (Apple から) Sachin Mehta and Mohammad Rastegari から公開された研究論文: [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (Apple から) Sachin Mehta and Mohammad Rastegari. から公開された研究論文 [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (Microsoft Research から) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu から公開された研究論文: [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (MosaiML から) the MosaicML NLP Team. から公開された研究論文 [llm-foundry](https://github.com/mosaicml/llm-foundry/) +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison から) Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. から公開された研究論文 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI から) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel から公開された研究論文: [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (RUC AI Box から) Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen から公開された研究論文: [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (SHI Labs から) Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi から公開された研究論文: [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab から) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu から公開された研究論文: [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta から) the NLLB team から公開された研究論文: [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta から) the NLLB team. から公開された研究論文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (Meta AI から) Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. から公開された研究論文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison から) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh から公開された研究論文: [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs から) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi から公開された研究論文: [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (Meta AI から) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al から公開された研究論文: [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI から) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby から公開された研究論文: [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (Google AI から) Matthias Minderer, Alexey Gritsenko, Neil Houlsby. から公開された研究論文 [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** ( IBM Research から) Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. から公開された研究論文 [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (IBM から) Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. から公開された研究論文 [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf) 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (Google から) Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu から公開された研究論文: [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (Google から) Jason Phang, Yao Zhao, and Peter J. Liu から公開された研究論文: [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (Deepmind から) Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira から公開された研究論文: [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (ADEPT から) Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. から公開された研究論文 [blog post](https://www.adept.ai/blog/persimmon-8b) +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research から) Dat Quoc Nguyen and Anh Tuan Nguyen から公開された研究論文: [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google から) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. から公開された研究論文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP から) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang から公開された研究論文: [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs から) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng から公開された研究論文: [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. から) Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. から公開された研究論文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA から) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius から公開された研究論文: [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (the Qwen team, Alibaba Group から) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. から公開された研究論文 [Qwen Technical Report](https://arxiv.org/abs/2309.16609) 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook から) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela から公開された研究論文: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (Google Research から) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang から公開された研究論文: [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (Google Research から) Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya から公開された研究論文: [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) @@ -427,14 +483,21 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (Facebook から) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli から公開された研究論文: [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI から) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou から公開された研究論文: [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology から), Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu から公開された研究論文: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng から) Bo Peng. から公開された研究論文 [this repo](https://github.com/BlinkDL/RWKV-LM) +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA から) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo から公開された研究論文: [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI から) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. から公開された研究論文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP から) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi から公開された研究論文: [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (Microsoft Research から) Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. から公開された研究論文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (Google AI から) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. から公開された研究論文 [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (Microsoft Research から) Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. から公開された研究論文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (Facebook から), Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino から公開された研究論文: [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (Facebook から), Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau から公開された研究論文: [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University から), Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy から公開された研究論文: [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley から) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer から公開された研究論文: [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI から) Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. から公開された研究論文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft から) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo から公開された研究論文: [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft から) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo から公開された研究論文: [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (University of Würzburg から) Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte から公開された研究論文: [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) @@ -449,33 +512,42 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (the University of California at Berkeley から) Michael Janner, Qiyang Li, Sergey Levine から公開された研究論文: [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (Google/CMU から) Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov から公開された研究論文: [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (Microsoft から), Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei から公開された研究論文: [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill から), Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal から公開された研究論文: [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill から), Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal から公開された研究論文: [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (Intel から), Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding から公開された研究論文: [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (Google Research から) Yi Tay, Mostafa Dehghani, Vinh Q から公開された研究論文: [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research から) Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. から公開された研究論文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research から) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang から公開された研究論文: [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research から) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu から公開された研究論文: [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University から) Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. から公開された研究論文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University から) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu から公開された研究論文: [Visual Attention Network](https://arxiv.org/abs/2202.09741) 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University から) Zhan Tong, Yibing Song, Jue Wang, Limin Wang から公開された研究論文: [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (NAVER AI Lab/Kakao Enterprise/Kakao Brain から) Wonjae Kim, Bokyung Son, Ildoo Kim から公開された研究論文: [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (University of Wisconsin–Madison から) Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. から公開された研究論文 [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (Google AI から) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby から公開された研究論文: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP から) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang から公開された研究論文: [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (Google AI から) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby から公開された研究論文: [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (Meta AI から) Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. から公開された研究論文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI から) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick から公開された研究論文: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (HUST-VL から) Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. から公開された研究論文 [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI から) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas から公開された研究論文: [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (Kakao Enterprise から) Jaehyeon Kim, Jungil Kong, Juhee Son. から公開された研究論文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI から) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI から) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino から公開された研究論文: [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (Facebook AI から) Qiantong Xu, Alexei Baevski, Michael Auli から公開された研究論文: [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (Microsoft Research から) Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei から公開された研究論文: [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (OpenAI から) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever から公開された研究論文: [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (Microsoft Research から) Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling から公開された研究論文: [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (Meta AI から) Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. から公開された研究論文 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (Meta AI から) Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. から公開された研究論文 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li から公開された研究論文: [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (Facebook から) Guillaume Lample and Alexis Conneau から公開された研究論文: [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (Microsoft Research から) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou から公開された研究論文: [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (Facebook AI から), Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov から公開された研究論文: [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) 1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (Facebook AI から), Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau から公開された研究論文: [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (Meta AI から) Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa から公開された研究論文: [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU から) Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le から公開された研究論文: [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU から) Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le から公開された研究論文: [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (Facebook AI から) Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli から公開された研究論文: [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (Facebook AI から) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli から公開された研究論文: [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (Huazhong University of Science & Technology から) Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu から公開された研究論文: [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) diff --git a/README_ko.md b/README_ko.md index 213466392b3b34..533ab4685bce09 100644 --- a/README_ko.md +++ b/README_ko.md @@ -18,7 +18,7 @@ limitations under the License.

-

+

Build @@ -46,8 +46,13 @@ limitations under the License. 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -69,13 +74,13 @@ limitations under the License. 대부분의 모델을 [모델 허브](https://huggingface.co/models) 페이지에서 바로 테스트해볼 수 있습니다. 공개 및 비공개 모델을 위한 [비공개 모델 호스팅, 버전 관리, 추론 API](https://huggingface.co/pricing)도 제공합니다. 예시: -- [BERT로 마스킹된 단어 완성하기](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [BERT로 마스킹된 단어 완성하기](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [Electra를 이용한 개체명 인식](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [GPT-2로 텍스트 생성하기](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [RoBERTa로 자연어 추론하기](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [GPT-2로 텍스트 생성하기](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [RoBERTa로 자연어 추론하기](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [BART를 이용한 요약](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [DistilBERT를 이용한 질문 답변](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [T5로 번역하기](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [DistilBERT를 이용한 질문 답변](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [T5로 번역하기](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) **[Transformer와 글쓰기](https://transformer.huggingface.co)** 는 이 저장소의 텍스트 생성 능력에 관한 Hugging Face 팀의 공식 데모입니다. @@ -121,8 +126,8 @@ limitations under the License. ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -131,8 +136,8 @@ limitations under the License. ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) @@ -175,7 +180,7 @@ limitations under the License. ### pip로 설치하기 -이 저장소는 Python 3.6+, Flax 0.3.2+, PyTorch 1.3.1+, TensorFlow 2.3+에서 테스트 되었습니다. +이 저장소는 Python 3.8+, Flax 0.4.1+, PyTorch 1.11+, TensorFlow 2.6+에서 테스트 되었습니다. [가상 환경](https://docs.python.org/3/library/venv.html)에 🤗 Transformers를 설치하세요. Python 가상 환경에 익숙하지 않다면, [사용자 가이드](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)를 확인하세요. @@ -194,14 +199,14 @@ pip install transformers ### conda로 설치하기 -Transformers 버전 v4.0.0부터, conda 채널이 생겼습니다: `huggingface`. - 🤗 Transformers는 다음과 같이 conda로 설치할 수 있습니다: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_노트:_** `huggingface` 채널에서 `transformers`를 설치하는 것은 사용이 중단되었습니다. + Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 방법을 확인하세요. ## 모델 구조 @@ -213,8 +218,11 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 🤗 Transformers는 다음 모델들을 제공합니다 (각 모델의 요약은 [여기](https://huggingface.co/docs/transformers/model_summary)서 확인하세요): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (Google Research 에서 제공)은 Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig.의 [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918)논문과 함께 발표했습니다. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. @@ -229,22 +237,27 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (Salesforce 에서 제공)은 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.의 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)논문과 함께 발표했습니다. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (Salesforce 에서 제공)은 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.의 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597)논문과 함께 발표했습니다. 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (Alexa 에서) Adrian de Wynter and Daniel J. Perry 의 [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) 논문과 함께 발표했습니다. -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (NAVER CLOVA 에서 제공)은 Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park.의 [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539)논문과 함께 발표했습니다. 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (Google Research 에서) Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel 의 [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) 논문과 함께 발표했습니다. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (Inria/Facebook/Sorbonne 에서) Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 의 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 논문과 함께 발표했습니다. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (Google Research 에서) Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 의 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 논문과 함께 발표했습니다. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (OFA-Sys 에서) An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 의 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 논문과 함께 발표했습니다. -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (LAION-AI 에서 제공)은 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov.의 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)논문과 함께 발표했습니다. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 의 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 논문과 함께 발표했습니다. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (University of Göttingen 에서) Timo Lüddecke and Alexander Ecker 의 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 논문과 함께 발표했습니다. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (Salesforce 에서) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 의 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 논문과 함께 발표했습니다. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (MetaAI 에서 제공)은 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve.의 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/)논문과 함께 발표했습니다. 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (Microsoft Research Asia 에서) Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 의 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 논문과 함께 발표했습니다. 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (YituTech 에서) Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan 의 [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) 논문과 함께 발표했습니다. 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (Facebook AI 에서) Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie 의 [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) 논문과 함께 발표했습니다. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (Tsinghua University 에서) Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun 의 [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) 논문과 함께 발표했습니다. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (Salesforce 에서) Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher 의 [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) 논문과 함께 발표했습니다. 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (Microsoft 에서) Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang 의 [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) 논문과 함께 발표했습니다. 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (Facebook 에서) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli 의 [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) 논문과 함께 발표했습니다. @@ -253,41 +266,58 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (Berkeley/Facebook/Google 에서) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch 의 [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) 논문과 함께 발표했습니다. 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (SenseTime Research 에서) Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai 의 [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) 논문과 함께 발표했습니다. 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (Facebook 에서) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou 의 [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) 논문과 함께 발표했습니다. -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (The University of Texas at Austin 에서 제공)은 Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.의 [NMS Strikes Back](https://arxiv.org/abs/2212.06137)논문과 함께 발표했습니다. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (Google AI 에서 제공)은 Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun.의 [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505)논문과 함께 발표했습니다. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (University of Hong Kong and TikTok 에서 제공)은 Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao.의 [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891)논문과 함께 발표했습니다. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (The University of Texas at Austin 에서 제공)은 Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl.의 [NMS Strikes Back](https://arxiv.org/abs/2212.06137)논문과 함께 발표했습니다. 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (Facebook 에서) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko 의 [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) 논문과 함께 발표했습니다. 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (Microsoft Research 에서) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan 의 [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) 논문과 함께 발표했습니다. 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (SHI Labs 에서) Ali Hassani and Humphrey Shi 의 [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) 논문과 함께 발표했습니다. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (Meta AI 에서 제공)은 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski.의 [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)논문과 함께 발표했습니다. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (HuggingFace 에서) Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) and a German version of DistilBERT 의 [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) 논문과 함께 발표했습니다. 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (Microsoft Research 에서) Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei 의 [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) 논문과 함께 발표했습니다. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (NAVER 에서) Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park 의 [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) 논문과 함께 발표했습니다. 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (Facebook 에서) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih 의 [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) 논문과 함께 발표했습니다. 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (Intel Labs 에서) René Ranftl, Alexey Bochkovskiy, Vladlen Koltun 의 [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) 논문과 함께 발표했습니다. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (Google Research/Stanford University 에서) Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 의 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 논문과 함께 발표했습니다. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (Meta AI 에서 제공)은 Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.의 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)논문과 함께 발표했습니다. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (Google Research 에서) Sascha Rothe, Shashi Narayan, Aliaksei Severyn 의 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 논문과 함께 발표했습니다. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (Baidu 에서) Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 의 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) 논문과 함께 발표했습니다. -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (Baidu 에서 제공)은 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.의 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674)논문과 함께 발표했습니다. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (Baidu 에서 제공)은 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.의 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674)논문과 함께 발표했습니다. 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2** was released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (ESPnet and Microsoft Research 에서 제공)은 Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.의 [Fastspeech 2: Fast And High-quality End-to-End Text To Speech](https://arxiv.org/pdf/2006.04558.pdf)논문과 함께 발표했습니다. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. 논문과 함께 공개 [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (EleutherAI 에서) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbac 의 [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) 논문과 함께 발표했습니다. 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 의 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 논문과 함께 발표했습니다. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (OpenAI 에서) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever 의 [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) 논문과 함께 발표했습니다. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (AI-Sweden 에서) Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 의 [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) 논문과 함께 발표했습니다. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (BigCode 에서 제공)은 Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.의 [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988)논문과 함께 발표했습니다. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu 의 [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) 논문과 함께 발표했습니다. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (UCSD, NVIDIA 에서) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 의 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 논문과 함께 발표했습니다. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (Allegro.pl, AGH University of Science and Technology 에서 제공)은 Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik.의 [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf)논문과 함께 발표했습니다. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (Facebook 에서) Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 의 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 논문과 함께 발표했습니다. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (Berkeley 에서) Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 의 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 논문과 함께 발표했습니다. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (OpenAI 에서) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 의 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 논문과 함께 발표했습니다. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (Salesforce 에서 제공)은 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi.의 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)논문과 함께 발표했습니다. 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (OpenAI 에서) Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever 의 [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) 논문과 함께 발표했습니다. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (Microsoft Research Asia 에서) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 의 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 논문과 함께 발표했습니다. 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (Microsoft Research Asia 에서) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou 의 [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 논문과 함께 발표했습니다. 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (Microsoft Research Asia 에서) Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei 의 [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) 논문과 함께 발표했습니다. @@ -295,43 +325,69 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (AllenAI 에서) Iz Beltagy, Matthew E. Peters, Arman Cohan 의 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 논문과 함께 발표했습니다. 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (Meta AI 에서) Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze 의 [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) 논문과 함께 발표했습니다. 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (South China University of Technology 에서) Jiapeng Wang, Lianwen Jin, Kai Ding 의 [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) 논문과 함께 발표했습니다. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (The FAIR team of Meta AI 에서 제공)은 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample.의 [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)논문과 함께 발표했습니다. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (The FAIR team of Meta AI 에서 제공)은 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom..의 [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX)논문과 함께 발표했습니다. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (Microsoft Research & University of Wisconsin-Madison 에서 제공)은 Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.의 [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485)논문과 함께 발표했습니다. 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (AllenAI 에서) Iz Beltagy, Matthew E. Peters, Arman Cohan 의 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 논문과 함께 발표했습니다. 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (Google AI 에서) Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang 의 [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) 논문과 함께 발표했습니다. 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (Studio Ousia 에서) Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto 의 [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) 논문과 함께 발표했습니다. 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (UNC Chapel Hill 에서) Hao Tan and Mohit Bansal 의 [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 논문과 함께 발표했습니다. 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (Facebook 에서) Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert 의 [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) 논문과 함께 발표했습니다. 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (Facebook 에서) Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 의 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 논문과 함께 발표했습니다. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (Microsoft Research Asia 에서) Junlong Li, Yiheng Xu, Lei Cui, Furu Wei 의 [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 논문과 함께 발표했습니다. 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (FAIR and UIUC 에서 제공)은 Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar.의 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527)논문과 함께 발표했습니다. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (Meta and UIUC 에서) Bowen Cheng, Alexander G. Schwing, Alexander Kirillov 의 [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) 논문과 함께 발표했습니다. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (Google AI 에서 제공)은 Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos.의 [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662)논문과 함께 발표했습니다. 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook 에서) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer 의 [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 논문과 함께 발표했습니다. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (Facebook 에서) Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 의 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 논문과 함께 발표했습니다. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (Facebook 에서 제공)은 Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer.의 [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655)논문과 함께 발표했습니다. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다. 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (NVIDIA 에서) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 의 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 논문과 함께 발표했습니다. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (Alibaba Research 에서 제공)은 Peng Wang, Cheng Da, and Cong Yao.의 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592)논문과 함께 발표했습니다. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (Studio Ousia 에서) Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 의 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 논문과 함께 발표했습니다. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (Facebook 에서 제공)은 Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli.의 [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516)논문과 함께 발표했습니다. 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (CMU/Google Brain 에서) Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 의 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 논문과 함께 발표했습니다. 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (Google Inc. 에서) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 의 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 논문과 함께 발표했습니다. 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (Google Inc. 에서) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen 의 [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) 논문과 함께 발표했습니다. 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (Apple 에서) Sachin Mehta and Mohammad Rastegari 의 [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) 논문과 함께 발표했습니다. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (Apple 에서 제공)은 Sachin Mehta and Mohammad Rastegari.의 [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680)논문과 함께 발표했습니다. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (Microsoft Research 에서) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 의 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 논문과 함께 발표했습니다. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (MosaiML 에서 제공)은 the MosaicML NLP Team.의 [llm-foundry](https://github.com/mosaicml/llm-foundry/)논문과 함께 발표했습니다. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (the University of Wisconsin - Madison 에서 제공)은 Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh.의 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) 논문과 함께 발표했습니다. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (Google AI 에서) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 의 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 논문과 함께 발표했습니다. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (RUC AI Box 에서) Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen 의 [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) 논문과 함께 발표했습니다. 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (SHI Labs 에서) Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi 의 [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) 논문과 함께 발표했습니다. 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (Huawei Noah’s Ark Lab 에서) Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 의 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 논문과 함께 발표했습니다. 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (Meta 에서) the NLLB team 의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 논문과 함께 발표했습니다. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (Meta 에서 제공)은 the NLLB team.의 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672)논문과 함께 발표했습니다. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (Meta AI 에서 제공)은 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic.의 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418)논문과 함께 발표했습니다. 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (the University of Wisconsin - Madison 에서) Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 의 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 논문과 함께 발표했습니다. 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (SHI Labs 에서) Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 의 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 논문과 함께 발표했습니다. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (Meta AI 에서) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al 의 [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 논문과 함께 발표했습니다. 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (Google AI 에서) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby 의 [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) 논문과 함께 발표했습니다. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (Google AI 에서 제공)은 Matthias Minderer, Alexey Gritsenko, Neil Houlsby.의 [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683)논문과 함께 발표했습니다. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** ( IBM Research 에서 제공)은 Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.의 [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf)논문과 함께 발표했습니다. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (IBM 에서 제공)은 Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam.의 [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf)논문과 함께 발표했습니다. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (Google 에서) Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 의 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 논문과 함께 발표했습니다. 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (Google 에서) Jason Phang, Yao Zhao, Peter J. Liu 의 [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) 논문과 함께 발표했습니다. 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (Deepmind 에서) Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 의 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 논문과 함께 발표했습니다. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (ADEPT 에서 제공)은 Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani.의 [blog post](https://www.adept.ai/blog/persimmon-8b)논문과 함께 발표했습니다. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (VinAI Research 에서) Dat Quoc Nguyen and Anh Tuan Nguyen 의 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 논문과 함께 발표했습니다. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (Google 에서 제공)은 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova.의 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347)논문과 함께 발표했습니다. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (UCLA NLP 에서) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 의 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 논문과 함께 발표했습니다. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (Sea AI Labs 에서) Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 의 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 논문과 함께 발표했습니다. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (Nanjing University, The University of Hong Kong etc. 에서 제공)은 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.의 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf)논문과 함께 발표했습니다. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (NVIDIA 에서) Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 의 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 논문과 함께 발표했습니다. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (the Qwen team, Alibaba Group 에서 제공)은 Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu.의 [Qwen Technical Report](https://arxiv.org/abs/2309.16609)논문과 함께 발표했습니다. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (Facebook 에서) Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 의 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 논문과 함께 발표했습니다. 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (Google Research 에서) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang 의 [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) 논문과 함께 발표했습니다. 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (Google Research 에서) Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya 의 [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) 논문과 함께 발표했습니다. @@ -342,14 +398,21 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (Facebook 에서) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli 의 [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) 논문과 함께 발표했습니다. 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (WeChatAI 에서) HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 의 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 논문과 함께 발표했습니다. 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (ZhuiyiTechnology 에서) Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 의 a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 논문과 함께 발표했습니다. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (Bo Peng 에서 제공)은 Bo Peng.의 [this repo](https://github.com/BlinkDL/RWKV-LM)논문과 함께 발표했습니다. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (NVIDIA 에서) Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 의 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 논문과 함께 발표했습니다. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (Meta AI 에서 제공)은 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.의 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf)논문과 함께 발표했습니다. 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다. 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (ASAPP 에서) Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 의 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 논문과 함께 발표했습니다. -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (Microsoft Research 에서 제공)은 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.의 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205)논문과 함께 발표했습니다. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (Google AI 에서 제공)은 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.의 [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343)논문과 함께 발표했습니다. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (Microsoft Research 에서 제공)은 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.의 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205)논문과 함께 발표했습니다. 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (Facebook 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino 의 [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다. 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (Facebook 에서) Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau 의 [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 논문과 함께 발표했습니다. 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (Tel Aviv University 에서) Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 의 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 논문과 함께 발표했습니다. 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (Berkeley 에서) Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 의 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 논문과 함께 발표했습니다. +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (MBZUAI 에서 제공)은 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan.의 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446)논문과 함께 발표했습니다. 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (Microsoft 에서) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 의 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 논문과 함께 발표했습니다. 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (Microsoft 에서) Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 의 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 논문과 함께 발표했습니다. 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (University of Würzburg 에서) Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte 의 [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) 논문과 함께 발표했습니다. @@ -364,37 +427,46 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (the University of California at Berkeley 에서) Michael Janner, Qiyang Li, Sergey Levin 의 [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) 논문과 함께 발표했습니다. 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (Google/CMU 에서) Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov 의 [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) 논문과 함께 발표했습니다. 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (Microsoft 에서) Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 의 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 논문과 함께 발표했습니다. -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill 에서) Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 의 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 논문과 함께 발표했습니다. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill 에서) Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 의 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 논문과 함께 발표했습니다. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (Intel 에서) Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding 의 [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) 논문과 함께 발표했습니다. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (Google Research 에서) Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzle 의 [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) 논문과 함께 발표했습니다. +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (Google Research 에서 제공)은 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant.의 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi)논문과 함께 발표했습니다. 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (Microsoft Research 에서) Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 의 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 논문과 함께 발표했습니다. 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (Microsoft Research 에서) Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 의 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 논문과 함께 발표했습니다. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (Peking University 에서 제공)은 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun.의 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221)논문과 함께 발표했습니다. 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (Tsinghua University and Nankai University 에서) Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 의 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 논문과 함께 발표했습니다. 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (Multimedia Computing Group, Nanjing University 에서) Zhan Tong, Yibing Song, Jue Wang, Limin Wang 의 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 논문과 함께 발표했습니다. 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (NAVER AI Lab/Kakao Enterprise/Kakao Brain 에서) Wonjae Kim, Bokyung Son, Ildoo Kim 의 [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 논문과 함께 발표했습니다. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (University of Wisconsin–Madison 에서 제공)은 Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.의 [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784)논문과 함께 발표했습니다. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (Google AI 에서) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 의 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 논문과 함께 발표했습니다. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (UCLA NLP 에서) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 의 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 논문과 함께 발표했습니다. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (Google AI 에서) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 의 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 논문과 함께 발표했습니다. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (Meta AI 에서 제공)은 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He.의 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527)논문과 함께 발표했습니다. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (Meta AI 에서) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 의 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 논문과 함께 발표했습니다. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (HUST-VL 에서 제공)은 Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.의 [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272)논문과 함께 발표했습니다. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (Meta AI 에서) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 의 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) 논문과 함께 발표했습니다. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (Kakao Enterprise 에서 제공)은 Jaehyeon Kim, Jungil Kong, Juhee Son.의 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103)논문과 함께 발표했습니다. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (Facebook AI 에서) Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 의 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 논문과 함께 발표했습니다. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (Facebook AI 에서) Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 의 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 논문과 함께 발표했습니다. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (Facebook AI 에서) Qiantong Xu, Alexei Baevski, Michael Auli 의 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 논문과 함께 발표했습니다. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (Microsoft Research 에서) Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei 의 [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) 논문과 함께 발표했습니다. 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (OpenAI 에서) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever 의 [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) 논문과 함께 발표했습니다. 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (Microsoft Research 에서) Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 의 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 논문과 함께 발표했습니다. -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (Meta AI 에서 제공)은 Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe.의 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255)논문과 함께 발표했습니다. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (Meta AI 에서 제공)은 Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe.의 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255)논문과 함께 발표했습니다. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (Facebook AI 에서 제공) Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li 의 [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) 논문과 함께 발표했습니다. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (Facebook 에서) Guillaume Lample and Alexis Conneau 의 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 논문과 함께 발표했습니다. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (Microsoft Research 에서) Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 의 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 논문과 함께 발표했습니다. 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (Facebook AI 에서) Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov 의 [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) 논문과 함께 발표했습니다. 1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (Facebook AI 에서) Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau 의 [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) 논문과 함께 발표했습니다. 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (Meta AI 에서) Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa 의 [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) 논문과 함께 발표했습니다. -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU 에서) Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le 의 [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 논문과 함께 발표했습니다. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (Google/CMU 에서) Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le 의 [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) 논문과 함께 발표했습니다. 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (Facebook AI 에서) Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli 의 [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) 논문과 함께 발표했습니다. 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (Facebook AI 에서) Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli 의 [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) 논문과 함께 발표했습니다. 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (Huazhong University of Science & Technology 에서) Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu 의 [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) 논문과 함께 발표했습니다. -1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (the University of Wisconsin - Madison 에서) Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh 의 [You Only Sample (Almost) 논문과 함께 발표했습니다. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (the University of Wisconsin - Madison 에서) Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh 의 [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) 논문과 함께 발표했습니다. 1. 새로운 모델을 올리고 싶나요? 우리가 **상세한 가이드와 템플릿** 으로 새로운 모델을 올리도록 도와드릴게요. 가이드와 템플릿은 이 저장소의 [`templates`](./templates) 폴더에서 확인하실 수 있습니다. [컨트리뷰션 가이드라인](./CONTRIBUTING.md)을 꼭 확인해주시고, PR을 올리기 전에 메인테이너에게 연락하거나 이슈를 오픈해 피드백을 받으시길 바랍니다. 각 모델이 Flax, PyTorch, TensorFlow으로 구현되었는지 또는 🤗 Tokenizers 라이브러리가 지원하는 토크나이저를 사용하는지 확인하려면, [이 표](https://huggingface.co/docs/transformers/index#supported-frameworks)를 확인하세요. diff --git a/README_pt-br.md b/README_pt-br.md new file mode 100644 index 00000000000000..40841bd82b9f8a --- /dev/null +++ b/README_pt-br.md @@ -0,0 +1,568 @@ + + +

+ + + + Hugging Face Transformers Library + +
+
+

+ +

+ + Build + + + GitHub + + + Documentation + + + GitHub release + + + Contributor Covenant + + DOI +

+ +

+

+ English | + 简体中文 | + 繁體中文 | + 한국어 | + Español | + 日本語 | + हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

+

+ +

+

Aprendizado de máquina de última geração para JAX, PyTorch e TensorFlow

+

+ +

+ +

+ + +A biblioteca 🤗 Transformers oferece milhares de modelos pré-treinados para executar tarefas em diferentes modalidades, como texto, visão e áudio. + +Esses modelos podem ser aplicados a: + +* 📝 Texto, para tarefas como classificação de texto, extração de informações, resposta a perguntas, sumarização, tradução, geração de texto, em mais de 100 idiomas. +* 🖼️ Imagens, para tarefas como classificação de imagens, detecção de objetos e segmentação. +* 🗣️ Áudio, para tarefas como reconhecimento de fala e classificação de áudio. + +Os modelos Transformer também podem executar tarefas em diversas modalidades combinadas, como responder a perguntas em tabelas, reconhecimento óptico de caracteres, extração de informações de documentos digitalizados, classificação de vídeo e resposta a perguntas visuais. + + +A biblioteca 🤗 Transformers oferece APIs para baixar e usar rapidamente esses modelos pré-treinados em um texto específico, ajustá-los em seus próprios conjuntos de dados e, em seguida, compartilhá-los com a comunidade em nosso [model hub](https://huggingface.co/models). Ao mesmo tempo, cada módulo Python que define uma arquitetura é totalmente independente e pode ser modificado para permitir experimentos de pesquisa rápidos. + +A biblioteca 🤗 Transformers é respaldada pelas três bibliotecas de aprendizado profundo mais populares — [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) e [TensorFlow](https://www.tensorflow.org/) — com uma integração perfeita entre elas. É simples treinar seus modelos com uma delas antes de carregá-los para inferência com a outra + +## Demonstração Online + +Você pode testar a maioria de nossos modelos diretamente em suas páginas a partir do [model hub](https://huggingface.co/models). Também oferecemos [hospedagem de modelos privados, versionamento e uma API de inferência](https://huggingface.co/pricing) +para modelos públicos e privados. + +Aqui estão alguns exemplos: + +Em Processamento de Linguagem Natural: + +- [Completar palavra mascarada com BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Reconhecimento de Entidades Nomeadas com Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Geração de texto com GPT-2](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C) +- [Inferência de Linguagem Natural com RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [Sumarização com BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) +- [Resposta a perguntas com DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Tradução com T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) + + +Em Visão Computacional: +- [Classificação de Imagens com ViT](https://huggingface.co/google/vit-base-patch16-224) +- [Detecção de Objetos com DETR](https://huggingface.co/facebook/detr-resnet-50) +- [Segmentação Semântica com SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [Segmentação Panóptica com MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) +- [Estimativa de Profundidade com DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Classificação de Vídeo com VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) +- [Segmentação Universal com OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) + + +Em Áudio: +- [Reconhecimento Automático de Fala com Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Detecção de Palavras-Chave com Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) +- [Classificação de Áudio com Transformer de Espectrograma de Áudio](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) + +Em Tarefas Multimodais: +- [Respostas de Perguntas em Tabelas com TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) +- [Respostas de Perguntas Visuais com ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) +- [Classificação de Imagens sem Anotação com CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Respostas de Perguntas em Documentos com LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) +- [Classificação de Vídeo sem Anotação com X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) + +## 100 Projetos Usando Transformers + +Transformers é mais do que um conjunto de ferramentas para usar modelos pré-treinados: é uma comunidade de projetos construídos ao seu redor e o Hugging Face Hub. Queremos que o Transformers permita que desenvolvedores, pesquisadores, estudantes, professores, engenheiros e qualquer outra pessoa construa seus projetos dos sonhos. + +Para celebrar as 100.000 estrelas do Transformers, decidimos destacar a comunidade e criamos a página [awesome-transformers](./awesome-transformers.md), que lista 100 projetos incríveis construídos nas proximidades dos Transformers. + +Se você possui ou utiliza um projeto que acredita que deveria fazer parte da lista, abra um PR para adicioná-lo! + +## Se você está procurando suporte personalizado da equipe Hugging Face + + + HuggingFace Expert Acceleration Program +
+ + +## Tour Rápido + +Para usar imediatamente um modelo em uma entrada específica (texto, imagem, áudio, ...), oferecemos a API `pipeline`. Os pipelines agrupam um modelo pré-treinado com o pré-processamento que foi usado durante o treinamento desse modelo. Aqui está como usar rapidamente um pipeline para classificar textos como positivos ou negativos: + +```python +from transformers import pipeline + +# Carregue o pipeline de classificação de texto +>>> classifier = pipeline("sentiment-analysis") + +# Classifique o texto como positivo ou negativo +>>> classifier("Estamos muito felizes em apresentar o pipeline no repositório dos transformers.") +[{'label': 'POSITIVE', 'score': 0.9996980428695679}] +``` + +A segunda linha de código baixa e armazena em cache o modelo pré-treinado usado pelo pipeline, enquanto a terceira linha o avalia no texto fornecido. Neste exemplo, a resposta é "positiva" com uma confiança de 99,97%. + +Muitas tarefas têm um `pipeline` pré-treinado pronto para uso, não apenas em PNL, mas também em visão computacional e processamento de áudio. Por exemplo, podemos facilmente extrair objetos detectados em uma imagem: + +``` python +>>> import requests +>>> from PIL import Image +>>> from transformers import pipeline + +# Download an image with cute cats +>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png" +>>> image_data = requests.get(url, stream=True).raw +>>> image = Image.open(image_data) + +# Allocate a pipeline for object detection +>>> object_detector = pipeline('object-detection') +>>> object_detector(image) +[{'score': 0.9982201457023621, + 'label': 'remote', + 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}}, + {'score': 0.9960021376609802, + 'label': 'remote', + 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}}, + {'score': 0.9954745173454285, + 'label': 'couch', + 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}}, + {'score': 0.9988006353378296, + 'label': 'cat', + 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}}, + {'score': 0.9986783862113953, + 'label': 'cat', + 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] +``` + + +Aqui obtemos uma lista de objetos detectados na imagem, com uma caixa envolvendo o objeto e uma pontuação de confiança. Aqui está a imagem original à esquerda, com as previsões exibidas à direita: + +

+ + +

+ +Você pode aprender mais sobre as tarefas suportadas pela API `pipeline` em [este tutorial](https://huggingface.co/docs/transformers/task_summary). + + +Além do `pipeline`, para baixar e usar qualquer um dos modelos pré-treinados em sua tarefa específica, tudo o que é necessário são três linhas de código. Aqui está a versão em PyTorch: + +```python +>>> from transformers import AutoTokenizer, AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +E aqui está o código equivalente para TensorFlow: + +```python +>>> from transformers import AutoTokenizer, TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="tf") +>>> outputs = model(**inputs) +``` + +O tokenizador é responsável por todo o pré-processamento que o modelo pré-treinado espera, e pode ser chamado diretamente em uma única string (como nos exemplos acima) ou em uma lista. Ele produzirá um dicionário que você pode usar no código subsequente ou simplesmente passar diretamente para o seu modelo usando o operador de descompactação de argumentos **. + +O modelo em si é um [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) ou um [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)(dependendo do seu back-end) que você pode usar como de costume. [Este tutorial](https://huggingface.co/docs/transformers/training) explica como integrar esse modelo em um ciclo de treinamento clássico do PyTorch ou TensorFlow, ou como usar nossa API `Trainer` para ajuste fino rápido em um novo conjunto de dados. + +## Por que devo usar transformers? + +1. Modelos state-of-the-art fáceis de usar: + - Alto desempenho em compreensão e geração de linguagem natural, visão computacional e tarefas de áudio. + - Barreira de entrada baixa para educadores e profissionais. + - Poucas abstrações visíveis para o usuário, com apenas três classes para aprender. + - Uma API unificada para usar todos os nossos modelos pré-treinados. + +1. Menores custos de computação, menor pegada de carbono: + - Pesquisadores podem compartilhar modelos treinados em vez de treinar sempre do zero. + - Profissionais podem reduzir o tempo de computação e os custos de produção. + - Dezenas de arquiteturas com mais de 60.000 modelos pré-treinados em todas as modalidades. + +1. Escolha o framework certo para cada parte da vida de um modelo: + - Treine modelos state-of-the-art em 3 linhas de código. + - Mova um único modelo entre frameworks TF2.0/PyTorch/JAX à vontade. + - Escolha o framework certo de forma contínua para treinamento, avaliação e produção. + +1. Personalize facilmente um modelo ou um exemplo para atender às suas necessidades: + - Fornecemos exemplos para cada arquitetura para reproduzir os resultados publicados pelos autores originais. + - Os detalhes internos do modelo são expostos de maneira consistente. + - Os arquivos do modelo podem ser usados de forma independente da biblioteca para experimentos rápidos. + +## Por que não devo usar transformers? + +- Esta biblioteca não é uma caixa de ferramentas modular para construir redes neurais. O código nos arquivos do modelo não é refatorado com abstrações adicionais de propósito, para que os pesquisadores possam iterar rapidamente em cada um dos modelos sem se aprofundar em abstrações/arquivos adicionais. +- A API de treinamento não é projetada para funcionar com qualquer modelo, mas é otimizada para funcionar com os modelos fornecidos pela biblioteca. Para loops de aprendizado de máquina genéricos, você deve usar outra biblioteca (possivelmente, [Accelerate](https://huggingface.co/docs/accelerate)). +- Embora nos esforcemos para apresentar o maior número possível de casos de uso, os scripts em nossa [pasta de exemplos](https://github.com/huggingface/transformers/tree/main/examples) são apenas isso: exemplos. É esperado que eles não funcionem prontos para uso em seu problema específico e que seja necessário modificar algumas linhas de código para adaptá-los às suas necessidades. + + + +### Com pip + +Este repositório é testado no Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ e TensorFlow 2.6+. + +Você deve instalar o 🤗 Transformers em um [ambiente virtual](https://docs.python.org/3/library/venv.html). Se você não está familiarizado com ambientes virtuais em Python, confira o [guia do usuário](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). + +Primeiro, crie um ambiente virtual com a versão do Python que você vai usar e ative-o. + +Em seguida, você precisará instalar pelo menos um dos back-ends Flax, PyTorch ou TensorFlow. +Consulte a [página de instalação do TensorFlow](https://www.tensorflow.org/install/), a [página de instalação do PyTorch](https://pytorch.org/get-started/locally/#start-locally) e/ou [Flax](https://github.com/google/flax#quick-install) e [Jax](https://github.com/google/jax#installation) páginas de instalação para obter o comando de instalação específico para a sua plataforma. + +Quando um desses back-ends estiver instalado, o 🤗 Transformers pode ser instalado usando pip da seguinte forma: + +```bash +pip install transformers +``` +Se você deseja experimentar com os exemplos ou precisa da versão mais recente do código e não pode esperar por um novo lançamento, você deve instalar a [biblioteca a partir do código-fonte](https://huggingface.co/docs/transformers/installation#installing-from-source). + +### Com conda + +O 🤗 Transformers pode ser instalado com conda da seguinte forma: + +```bash +conda install conda-forge::transformers +``` + +> **_NOTA:_** Instalar `transformers` pelo canal `huggingface` está obsoleto. + +Siga as páginas de instalação do Flax, PyTorch ou TensorFlow para ver como instalá-los com conda. + +Siga as páginas de instalação do Flax, PyTorch ou TensorFlow para ver como instalá-los com o conda. + +> **_NOTA:_** No Windows, você pode ser solicitado a ativar o Modo de Desenvolvedor para aproveitar o cache. Se isso não for uma opção para você, por favor nos avise [neste problema](https://github.com/huggingface/huggingface_hub/issues/1062). + +## Arquiteturas de Modelos + +**[Todos os pontos de verificação de modelo](https://huggingface.co/models)** fornecidos pelo 🤗 Transformers são integrados de forma transparente do [model hub](https://huggingface.co/models) do huggingface.co, onde são carregados diretamente por [usuários](https://huggingface.co/users) e [organizações](https://huggingface.co/organizations). + +Número atual de pontos de verificação: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) + +🤗 Transformers atualmente fornece as seguintes arquiteturas (veja [aqui](https://huggingface.co/docs/transformers/model_summary) para um resumo de alto nível de cada uma delas): + +1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. +1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. +1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. +1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. +1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. +1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen. +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. +1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. +1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. +1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). +1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. +1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. +1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach +1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. +1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). +1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. +1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. +1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. +1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. +1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. +1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. +1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. +1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. +1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. +1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. +1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. +1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. +1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. +1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. +1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. +1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. +1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). +1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. +1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. +1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. +1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár. +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. +1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. +1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham. +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. +1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace). +1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani. +1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. +1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. +1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) rreleased with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. +1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. +1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. +1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, +Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. + +1. Quer contribuir com um novo modelo? Adicionamos um **guia detalhado e modelos de exemplo** para orientar você no processo de adição de um novo modelo. Você pode encontrá-los na pasta [`templates`](./templates) do repositório. Certifique-se de verificar as [diretrizes de contribuição](./CONTRIBUTING.md) e entrar em contato com os mantenedores ou abrir uma issue para coletar feedback antes de iniciar sua PR. + +Para verificar se cada modelo tem uma implementação em Flax, PyTorch ou TensorFlow, ou possui um tokenizador associado com a biblioteca 🤗 Tokenizers, consulte [esta tabela](https://huggingface.co/docs/transformers/index#supported-frameworks). + +Essas implementações foram testadas em vários conjuntos de dados (veja os scripts de exemplo) e devem corresponder ao desempenho das implementações originais. Você pode encontrar mais detalhes sobre o desempenho na seção de Exemplos da [documentação](https://github.com/huggingface/transformers/tree/main/examples). + + +## Saiba mais + +| Seção | Descrição | +|-|-| +| [Documentação](https://huggingface.co/docs/transformers/) | Documentação completa da API e tutoriais | +| [Resumo de Tarefas](https://huggingface.co/docs/transformers/task_summary) | Tarefas suportadas pelo 🤗 Transformers | +| [Tutorial de Pré-processamento](https://huggingface.co/docs/transformers/preprocessing) | Usando a classe `Tokenizer` para preparar dados para os modelos | +| [Treinamento e Ajuste Fino](https://huggingface.co/docs/transformers/training) | Usando os modelos fornecidos pelo 🤗 Transformers em um loop de treinamento PyTorch/TensorFlow e a API `Trainer` | +| [Tour Rápido: Scripts de Ajuste Fino/Utilização](https://github.com/huggingface/transformers/tree/main/examples) | Scripts de exemplo para ajuste fino de modelos em uma ampla gama de tarefas | +| [Compartilhamento e Envio de Modelos](https://huggingface.co/docs/transformers/model_sharing) | Envie e compartilhe seus modelos ajustados com a comunidade | + +## Citação + +Agora temos um [artigo](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) que você pode citar para a biblioteca 🤗 Transformers: +```bibtex +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = out, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/README_ru.md b/README_ru.md new file mode 100644 index 00000000000000..3e6f3d54f27e22 --- /dev/null +++ b/README_ru.md @@ -0,0 +1,556 @@ + + +

+ + + + Hugging Face Transformers Library + +
+
+

+ +

+ + Build + + + GitHub + + + Documentation + + + GitHub release + + + Contributor Covenant + + DOI +

+ +

+

+ English | + 简体中文 | + 繁體中文 | + 한국어 | + Español | + 日本語 | + हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

+

+ +

+

Современное машинное обучение для JAX, PyTorch и TensorFlow

+

+ +

+ +

+ +🤗 Transformers предоставляет тысячи предварительно обученных моделей для выполнения различных задач, таких как текст, зрение и аудио. + +Эти модели могут быть применены к: + +* 📝 Тексту для таких задач, как классификация текстов, извлечение информации, ответы на вопросы, обобщение, перевод, генерация текстов на более чем 100 языках. +* 🖼️ Изображениям для задач классификации изображений, обнаружения объектов и сегментации. +* 🗣️ Аудио для задач распознавания речи и классификации аудио. + +Модели transformers также могут выполнять несколько задач, такие как ответы на табличные вопросы, распознавание оптических символов, извлечение информации из отсканированных документов, классификация видео и ответы на визуальные вопросы. + +🤗 Transformers предоставляет API для быстрой загрузки и использования предварительно обученных моделей, их тонкой настройки на собственных датасетах и последующего взаимодействия ими с сообществом на нашем [сайте](https://huggingface.co/models). В то же время каждый python модуль, определяющий архитектуру, полностью автономен и может быть модифицирован для проведения быстрых исследовательских экспериментов. + +🤗 Transformers опирается на три самые популярные библиотеки глубокого обучения - [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) и [TensorFlow](https://www.tensorflow.org/) - и легко интегрируется между ними. Это позволяет легко обучать модели с помощью одной из них, а затем загружать их для выводов с помощью другой. + +## Онлайн демонстрация + +Большинство наших моделей можно протестировать непосредственно на их страницах с [сайта](https://huggingface.co/models). Мы также предлагаем [привтаный хостинг моделей, контроль версий и API для выводов](https://huggingface.co/pricing) для публичных и частных моделей. + +Вот несколько примеров: + +В области NLP ( Обработка текстов на естественном языке ): +- [Маскированное заполнение слов с помощью BERT](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Распознавание сущностей с помощью Electra](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [Генерация текста с помощью GPT-2](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [Выводы на естественном языке с помощью RoBERTa](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [Обобщение с помощью BART](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) +- [Ответы на вопросы с помощью DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [Перевод с помощью T5](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) + +В области компьютерного зрения: +- [Классификация изображений с помощью ViT](https://huggingface.co/google/vit-base-patch16-224) +- [Обнаружение объектов с помощью DETR](https://huggingface.co/facebook/detr-resnet-50) +- [Семантическая сегментация с помощью SegFormer](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [Сегментация паноптикума с помощью MaskFormer](https://huggingface.co/facebook/maskformer-swin-small-coco) +- [Оценка глубины с помощью DPT](https://huggingface.co/docs/transformers/model_doc/dpt) +- [Классификация видео с помощью VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae) +- [Универсальная сегментация с помощью OneFormer](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) + +В области звука: +- [Автоматическое распознавание речи с помощью Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Поиск ключевых слов с помощью Wav2Vec2](https://huggingface.co/superb/wav2vec2-base-superb-ks) +- [Классификация аудиоданных с помощью траснформера аудиоспектрограмм](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) + +В мультимодальных задачах: +- [Ответы на вопросы по таблице с помощью TAPAS](https://huggingface.co/google/tapas-base-finetuned-wtq) +- [Визуальные ответы на вопросы с помощью ViLT](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) +- [Zero-shot классификация изображений с помощью CLIP](https://huggingface.co/openai/clip-vit-large-patch14) +- [Ответы на вопросы по документам с помощью LayoutLM](https://huggingface.co/impira/layoutlm-document-qa) +- [Zero-shot классификация видео с помощью X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip) + + +## 100 проектов, использующих Transformers + +Transformers - это не просто набор инструментов для использования предварительно обученных моделей: это сообщество проектов, созданное на его основе, и +Hugging Face Hub. Мы хотим, чтобы Transformers позволил разработчикам, исследователям, студентам, профессорам, инженерам и всем желающим +создавать проекты своей мечты. + +Чтобы отпраздновать 100 тысяч звезд Transformers, мы решили сделать акцент на сообществе, и создали страницу [awesome-transformers](./awesome-transformers.md), на которой перечислены 100 +невероятных проектов, созданных с помощью transformers. + +Если вы являетесь владельцем или пользователем проекта, который, по вашему мнению, должен быть включен в этот список, пожалуйста, откройте PR для его добавления! + +## Если вы хотите получить индивидуальную поддержку от команды Hugging Face + + + HuggingFace Expert Acceleration Program +
+ +## Быстрый гайд + +Для использования модели на заданном входе (текст, изображение, звук, ...) мы предоставляем API `pipeline`. Конвейеры объединяют предварительно обученную модель с препроцессингом, который использовался при ее обучении. Вот как можно быстро использовать конвейер для классификации положительных и отрицательных текстов: + +```python +>>> from transformers import pipeline + +# Выделение конвейера для анализа настроений +>>> classifier = pipeline('sentiment-analysis') +>>> classifier('Мы очень рады представить конвейер в transformers.') +[{'label': 'POSITIVE', 'score': 0.9996980428695679}] +``` + +Вторая строка кода загружает и кэширует предварительно обученную модель, используемую конвейером, а третья оценивает ее на заданном тексте. Здесь ответ "POSITIVE" с уверенностью 99,97%. + +Во многих задачах, как в НЛП, так и в компьютерном зрении и речи, уже есть готовый `pipeline`. Например, мы можем легко извлечь обнаруженные объекты на изображении: + +``` python +>>> import requests +>>> from PIL import Image +>>> from transformers import pipeline + +# Скачиваем изображение с милыми котиками +>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png" +>>> image_data = requests.get(url, stream=True).raw +>>> image = Image.open(image_data) + +# Выделение конвейера для обнаружения объектов +>>> object_detector = pipeline('object-detection') +>>> object_detector(image) +[{'score': 0.9982201457023621, + 'label': 'remote', + 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}}, + {'score': 0.9960021376609802, + 'label': 'remote', + 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}}, + {'score': 0.9954745173454285, + 'label': 'couch', + 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}}, + {'score': 0.9988006353378296, + 'label': 'cat', + 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}}, + {'score': 0.9986783862113953, + 'label': 'cat', + 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] +``` + +Здесь мы получаем список объектов, обнаруженных на изображении, с рамкой вокруг объекта и оценкой достоверности. Слева - исходное изображение, справа прогнозы: + +

+ + +

+ +Подробнее о задачах, поддерживаемых API `pipeline`, можно узнать в [этом учебном пособии](https://huggingface.co/docs/transformers/task_sum) + +В дополнение к `pipeline`, для загрузки и использования любой из предварительно обученных моделей в заданной задаче достаточно трех строк кода. Вот версия для PyTorch: +```python +>>> from transformers import AutoTokenizer, AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Привет мир!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +А вот эквивалентный код для TensorFlow: +```python +>>> from transformers import AutoTokenizer, TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Привет мир!", return_tensors="tf") +>>> outputs = model(**inputs) +``` + +Токенизатор отвечает за всю предварительную обработку, которую ожидает предварительно обученная модель, и может быть вызван непосредственно с помощью одной строки (как в приведенных выше примерах) или на списке. В результате будет получен словарь, который можно использовать в последующем коде или просто напрямую передать в модель с помощью оператора распаковки аргументов **. + +Сама модель представляет собой обычный [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) или [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (в зависимости от используемого бэкенда), который можно использовать как обычно. [В этом руководстве](https://huggingface.co/docs/transformers/training) рассказывается, как интегрировать такую модель в классический цикл обучения PyTorch или TensorFlow, или как использовать наш API `Trainer` для быстрой тонкой настройки на новом датасете. + +## Почему необходимо использовать transformers? + +1. Простые в использовании современные модели: + - Высокая производительность в задачах понимания и генерации естественного языка, компьютерного зрения и аудио. + - Низкий входной барьер для преподавателей и практиков. + - Небольшое количество абстракций для пользователя и всего три класса для изучения. + - Единый API для использования всех наших предварительно обученных моделей. + +1. Более низкие вычислительные затраты, меньший "углеродный след": + - Исследователи могут обмениваться обученными моделями вместо того, чтобы постоянно их переобучать. + - Практики могут сократить время вычислений и производственные затраты. + - Десятки архитектур с более чем 60 000 предварительно обученных моделей для всех модальностей. + +1. Выбор подходящего фреймворка для каждого этапа жизни модели: + - Обучение самых современных моделей за 3 строки кода. + - Перемещайте одну модель между фреймворками TF2.0/PyTorch/JAX по своему усмотрению. + - Беспрепятственный выбор подходящего фреймворка для обучения, оценки и производства. + +1. Легко настроить модель или пример под свои нужды: + - Мы предоставляем примеры для каждой архитектуры, чтобы воспроизвести результаты, опубликованные их авторами. + - Внутренние компоненты модели раскрываются максимально последовательно. + - Файлы моделей можно использовать независимо от библиотеки для проведения быстрых экспериментов. + +## Почему я не должен использовать transformers? + +- Данная библиотека не является модульным набором строительных блоков для нейронных сетей. Код в файлах моделей специально не рефакторится дополнительными абстракциями, чтобы исследователи могли быстро итеративно работать с каждой из моделей, не погружаясь в дополнительные абстракции/файлы. +- API обучения не предназначен для работы с любой моделью, а оптимизирован для работы с моделями, предоставляемыми библиотекой. Для работы с общими циклами машинного обучения следует использовать другую библиотеку (возможно, [Accelerate](https://huggingface.co/docs/accelerate)). +- Несмотря на то, что мы стремимся представить как можно больше примеров использования, скрипты в нашей папке [примеров](https://github.com/huggingface/transformers/tree/main/examples) являются именно примерами. Предполагается, что они не будут работать "из коробки" для решения вашей конкретной задачи, и вам придется изменить несколько строк кода, чтобы адаптировать их под свои нужды. + +## Установка + +### С помощью pip + +Данный репозиторий протестирован на Python 3.8+, Flax 0.4.1+, PyTorch 1.11+ и TensorFlow 2.6+. + +Устанавливать 🤗 Transformers следует в [виртуальной среде](https://docs.python.org/3/library/venv.html). Если вы не знакомы с виртуальными средами Python, ознакомьтесь с [руководством пользователя](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). + +Сначала создайте виртуальную среду с той версией Python, которую вы собираетесь использовать, и активируйте ее. + +Затем необходимо установить хотя бы один бекенд из Flax, PyTorch или TensorFlow. +Пожалуйста, обратитесь к страницам [TensorFlow установочная страница](https://www.tensorflow.org/install/), [PyTorch установочная страница](https://pytorch.org/get-started/locally/#start-locally) и/или [Flax](https://github.com/google/flax#quick-install) и [Jax](https://github.com/google/jax#installation), где описаны команды установки для вашей платформы. + +После установки одного из этих бэкендов 🤗 Transformers может быть установлен с помощью pip следующим образом: + +```bash +pip install transformers +``` + +Если вы хотите поиграть с примерами или вам нужен самый современный код и вы не можете ждать нового релиза, вы должны [установить библиотеку из исходного кода](https://huggingface.co/docs/transformers/installation#installing-from-source). + +### С помощью conda + +Установить Transformers с помощью conda можно следующим образом: + +```bash +conda install conda-forge::transformers +``` + +> **_ЗАМЕТКА:_** Установка `transformers` через канал `huggingface` устарела. + +О том, как установить Flax, PyTorch или TensorFlow с помощью conda, читайте на страницах, посвященных их установке. + +> **_ЗАМЕТКА:_** В операционной системе Windows вам может быть предложено активировать режим разработчика, чтобы воспользоваться преимуществами кэширования. Если для вас это невозможно, сообщите нам об этом [здесь](https://github.com/huggingface/huggingface_hub/issues/1062). + +## Модельные архитектуры + +**[Все контрольные точки моделей](https://huggingface.co/models)**, предоставляемые 🤗 Transformers, беспрепятственно интегрируются с huggingface.co [model hub](https://huggingface.co/models), куда они загружаются непосредственно [пользователями](https://huggingface.co/users) и [организациями](https://huggingface.co/organizations). + +Текущее количество контрольных точек: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) + +🤗 В настоящее время Transformers предоставляет следующие архитектуры (подробное описание каждой из них см. [здесь](https://huggingface.co/docs/transformers/model_summary)): + +1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. +1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. +1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. +1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. +1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. +1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen. +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. +1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. +1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. +1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). +1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. +1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. +1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) +1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach +1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. +1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). +1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. +1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. +1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. +1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. +1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. +1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. +1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. +1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. +1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. +1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. +1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. +1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. +1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. +1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. +1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. +1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. +1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). +1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/main/model_doc/persimmon)** (from ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/main/transformers/model_doc/phi)** (from Microsoft Research) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. +1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. +1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. +1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár. +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. +1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. +1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham. +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. +1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace). +1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani. +1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. +1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. +1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/main/model_doc/vitmatte)** (from HUST-VL) rreleased with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. +1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. +1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. +1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. +1. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR. + +Чтобы проверить, есть ли у каждой модели реализация на Flax, PyTorch или TensorFlow, или связанный с ней токенизатор, поддерживаемый библиотекой 🤗 Tokenizers, обратитесь к [этой таблице](https://huggingface.co/docs/transformers/index#supported-frameworks). + +Эти реализации были протестированы на нескольких наборах данных (см. примеры скриптов) и должны соответствовать производительности оригинальных реализаций. Более подробную информацию о производительности можно найти в разделе "Примеры" [документации](https://github.com/huggingface/transformers/tree/main/examples). + + +## Изучи больше + +| Секция | Описание | +|-|-| +| [Документация](https://huggingface.co/docs/transformers/) | Полная документация по API и гайды | +| [Краткие описания задач](https://huggingface.co/docs/transformers/task_summary) | Задачи поддерживаются 🤗 Transformers | +| [Пособие по предварительной обработке](https://huggingface.co/docs/transformers/preprocessing) | Использование класса `Tokenizer` для подготовки данных для моделей | +| [Обучение и доработка](https://huggingface.co/docs/transformers/training) | Использование моделей, предоставляемых 🤗 Transformers, в цикле обучения PyTorch/TensorFlow и API `Trainer`. | +| [Быстрый тур: Тонкая настройка/скрипты использования](https://github.com/huggingface/transformers/tree/main/examples) | Примеры скриптов для тонкой настройки моделей на широком спектре задач | +| [Совместное использование и загрузка моделей](https://huggingface.co/docs/transformers/model_sharing) | Загружайте и делитесь с сообществом своими доработанными моделями | + +## Цитирование + +Теперь у нас есть [статья](https://www.aclweb.org/anthology/2020.emnlp-demos.6/), которую можно цитировать для библиотеки 🤗 Transformers: +```bibtex +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = oct, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/README_te.md b/README_te.md new file mode 100644 index 00000000000000..2c0b97dada67ed --- /dev/null +++ b/README_te.md @@ -0,0 +1,560 @@ + + +

+ + + + Hugging Face Transformers Library + +
+
+

+ + +

+ + Build + + + GitHub + + + Documentation + + + GitHub release + + + Contributor Covenant + + DOI +

+ + +

+

+ English | + 简体中文 | + 繁體中文 | + 한국어 | + Español | + 日本語 | + हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

+

+ +

+

JAX, PyTorch మరియు TensorFlow కోసం అత్యాధునిక యంత్ర అభ్యాసం

+

+ +

+ +

+ +🤗 ట్రాన్స్‌ఫార్మర్లు టెక్స్ట్, విజన్ మరియు ఆడియో వంటి విభిన్న పద్ధతులపై టాస్క్‌లను నిర్వహించడానికి వేలాది ముందుగా శిక్షణ పొందిన మోడల్‌లను అందిస్తాయి. + +ఈ నమూనాలు వర్తించవచ్చు: + +* 📝 టెక్స్ట్, 100కి పైగా భాషల్లో టెక్స్ట్ క్లాసిఫికేషన్, ఇన్ఫర్మేషన్ ఎక్స్‌ట్రాక్షన్, ప్రశ్నలకు సమాధానాలు, సారాంశం, అనువాదం, టెక్స్ట్ జనరేషన్ వంటి పనుల కోసం. +* 🖼️ ఇమేజ్‌లు, ఇమేజ్ వర్గీకరణ, ఆబ్జెక్ట్ డిటెక్షన్ మరియు సెగ్మెంటేషన్ వంటి పనుల కోసం. +* 🗣️ ఆడియో, స్పీచ్ రికగ్నిషన్ మరియు ఆడియో వర్గీకరణ వంటి పనుల కోసం. + +ట్రాన్స్‌ఫార్మర్ మోడల్‌లు టేబుల్ క్వశ్చన్ ఆన్సర్ చేయడం, ఆప్టికల్ క్యారెక్టర్ రికగ్నిషన్, స్కాన్ చేసిన డాక్యుమెంట్‌ల నుండి ఇన్ఫర్మేషన్ ఎక్స్‌ట్రాక్షన్, వీడియో క్లాసిఫికేషన్ మరియు విజువల్ క్వశ్చన్ ఆన్సర్ చేయడం వంటి **అనేక పద్ధతులతో కలిపి** పనులను కూడా చేయగలవు. + +🤗 ట్రాన్స్‌ఫార్మర్లు అందించిన టెక్స్ట్‌లో ప్రీట్రైన్డ్ మోడల్‌లను త్వరగా డౌన్‌లోడ్ చేయడానికి మరియు ఉపయోగించడానికి, వాటిని మీ స్వంత డేటాసెట్‌లలో ఫైన్-ట్యూన్ చేయడానికి మరియు వాటిని మా [మోడల్ హబ్](https://huggingface.co/models)లో సంఘంతో భాగస్వామ్యం చేయడానికి API లను అందిస్తుంది. అదే సమయంలో, ఆర్కిటెక్చర్‌ని నిర్వచించే ప్రతి పైథాన్ మాడ్యూల్ పూర్తిగా స్వతంత్రంగా ఉంటుంది మరియు త్వరిత పరిశోధన ప్రయోగాలను ప్రారంభించడానికి సవరించవచ్చు. + +🤗 ట్రాన్స్‌ఫార్మర్‌లకు మూడు అత్యంత ప్రజాదరణ పొందిన డీప్ లెర్నింగ్ లైబ్రరీలు ఉన్నాయి — [Jax](https://jax.readthedocs.io/en/latest/), [PyTorch](https://pytorch.org/) మరియు [TensorFlow](https://www.tensorflow.org/) — వాటి మధ్య అతుకులు లేని ఏకీకరణతో. మీ మోడల్‌లను ఒకదానితో మరొకదానితో అనుమితి కోసం లోడ్ చేసే ముందు వాటికి శిక్షణ ఇవ్వడం చాలా సులభం. + +## ఆన్‌లైన్ డెమోలు + +మీరు [మోడల్ హబ్](https://huggingface.co/models) నుండి మా మోడళ్లలో చాలా వరకు వాటి పేజీలలో నేరుగా పరీక్షించవచ్చు. మేము పబ్లిక్ మరియు ప్రైవేట్ మోడల్‌ల కోసం [ప్రైవేట్ మోడల్ హోస్టింగ్, సంస్కరణ & అనుమితి API](https://huggingface.co/pricing)ని కూడా అందిస్తాము. + +ఇక్కడ కొన్ని ఉదాహరణలు ఉన్నాయి: + +సహజ భాషా ప్రాసెసింగ్‌లో: +- [BERT తో మాస్క్‌డ్ వర్డ్ కంప్లీషన్](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [Electra తో పేరు ఎంటిటీ గుర్తింపు](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) +- [GPT-2 తో టెక్స్ట్ జనరేషన్](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [RoBERTa తో సహజ భాషా అనుమితి](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+Lost.+Nobody+lost+any+animal) +- [BART తో సారాంశం](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) +- [DistilBERT తో ప్రశ్న సమాధానం](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [T5 తో అనువాదం](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) + +కంప్యూటర్ దృష్టిలో: +- [VIT తో చిత్ర వర్గీకరణ](https://huggingface.co/google/vit-base-patch16-224) +- [DETR తో ఆబ్జెక్ట్ డిటెక్షన్](https://huggingface.co/facebook/detr-resnet-50) +- [SegFormer తో సెమాంటిక్ సెగ్మెంటేషన్](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [MaskFormer తో పానోప్టిక్ సెగ్మెంటేషన్](https://huggingface.co/facebook/maskformer-swin-small-coco) +- [DPT తో లోతు అంచనా](https://huggingface.co/docs/transformers/model_doc/dpt) +- [VideoMAE తో వీడియో వర్గీకరణ](https://huggingface.co/docs/transformers/model_doc/videomae) +- [OneFormer తో యూనివర్సల్ సెగ్మెంటేషన్](https://huggingface.co/shi-labs/oneformer_ade20k_dinat_large) + +ఆడియోలో: +- [Wav2Vec2 తో ఆటోమేటిక్ స్పీచ్ రికగ్నిషన్](https://huggingface.co/facebook/wav2vec2-base-960h) +- [Wav2Vec2 తో కీవర్డ్ స్పాటింగ్](https://huggingface.co/superb/wav2vec2-base-superb-ks) +- [ఆడియో స్పెక్ట్రోగ్రామ్ ట్రాన్స్‌ఫార్మర్‌తో ఆడియో వర్గీకరణ](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) + +మల్టీమోడల్ టాస్క్‌లలో: +- [TAPAS తో టేబుల్ ప్రశ్న సమాధానాలు](https://huggingface.co/google/tapas-base-finetuned-wtq) +- [ViLT తో దృశ్యమాన ప్రశ్నకు సమాధానం](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa) +- [CLIP తో జీరో-షాట్ ఇమేజ్ వర్గీకరణ](https://huggingface.co/openai/clip-vit-large-patch14) +- [LayoutLM తో డాక్యుమెంట్ ప్రశ్నకు సమాధానం](https://huggingface.co/impira/layoutlm-document-qa) +- [X-CLIP తో జీరో-షాట్ వీడియో వర్గీకరణ](https://huggingface.co/docs/transformers/model_doc/xclip) + +## ట్రాన్స్‌ఫార్మర్‌లను ఉపయోగించి 100 ప్రాజెక్టులు + +ట్రాన్స్‌ఫార్మర్లు ప్రీట్రైన్డ్ మోడల్‌లను ఉపయోగించడానికి టూల్‌కిట్ కంటే ఎక్కువ: ఇది దాని చుట్టూ నిర్మించిన ప్రాజెక్ట్‌ల సంఘం మరియు +హగ్గింగ్ ఫేస్ హబ్. డెవలపర్‌లు, పరిశోధకులు, విద్యార్థులు, ప్రొఫెసర్‌లు, ఇంజనీర్లు మరియు ఎవరినైనా అనుమతించేలా ట్రాన్స్‌ఫార్మర్‌లను మేము కోరుకుంటున్నాము +వారి కలల ప్రాజెక్టులను నిర్మించడానికి. + +ట్రాన్స్‌ఫార్మర్‌ల 100,000 నక్షత్రాలను జరుపుకోవడానికి, మేము స్పాట్‌లైట్‌ని ఉంచాలని నిర్ణయించుకున్నాము +సంఘం, మరియు మేము 100 జాబితాలను కలిగి ఉన్న [awesome-transformers](./awesome-transformers.md) పేజీని సృష్టించాము. +ట్రాన్స్‌ఫార్మర్ల పరిసరాల్లో అద్భుతమైన ప్రాజెక్టులు నిర్మించబడ్డాయి. + +జాబితాలో భాగమని మీరు విశ్వసించే ప్రాజెక్ట్‌ను మీరు కలిగి ఉంటే లేదా ఉపయోగిస్తుంటే, దయచేసి దానిని జోడించడానికి PRని తెరవండి! + +## మీరు హగ్గింగ్ ఫేస్ టీమ్ నుండి అనుకూల మద్దతు కోసం చూస్తున్నట్లయితే + + + HuggingFace Expert Acceleration Program +
+ +## త్వరిత పర్యటన + +ఇచ్చిన ఇన్‌పుట్ (టెక్స్ట్, ఇమేజ్, ఆడియో, ...)పై తక్షణమే మోడల్‌ను ఉపయోగించడానికి, మేము `pipeline` API ని అందిస్తాము. పైప్‌లైన్‌లు ఆ మోడల్ శిక్షణ సమయంలో ఉపయోగించిన ప్రీప్రాసెసింగ్‌తో కూడిన ప్రీట్రైన్డ్ మోడల్‌ను సమూహపరుస్తాయి. సానుకూల మరియు ప్రతికూల పాఠాలను వర్గీకరించడానికి పైప్‌లైన్‌ను త్వరగా ఎలా ఉపయోగించాలో ఇక్కడ ఉంది: + +```python +>>> from transformers import pipeline + +# Allocate a pipeline for sentiment-analysis +>>> classifier = pipeline('sentiment-analysis') +>>> classifier('We are very happy to introduce pipeline to the transformers repository.') +[{'label': 'POSITIVE', 'score': 0.9996980428695679}] +``` + +రెండవ లైన్ కోడ్ డౌన్‌లోడ్ మరియు పైప్‌లైన్ ఉపయోగించే ప్రీట్రైన్డ్ మోడల్‌ను కాష్ చేస్తుంది, మూడవది ఇచ్చిన టెక్స్ట్‌పై మూల్యాంకనం చేస్తుంది. ఇక్కడ సమాధానం 99.97% విశ్వాసంతో "పాజిటివ్". + +చాలా పనులు NLPలో కానీ కంప్యూటర్ విజన్ మరియు స్పీచ్‌లో కూడా ముందుగా శిక్షణ పొందిన `pipeline` సిద్ధంగా ఉన్నాయి. ఉదాహరణకు, మనం చిత్రంలో గుర్తించిన వస్తువులను సులభంగా సంగ్రహించవచ్చు: + +``` python +>>> import requests +>>> from PIL import Image +>>> from transformers import pipeline + +# Download an image with cute cats +>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png" +>>> image_data = requests.get(url, stream=True).raw +>>> image = Image.open(image_data) + +# Allocate a pipeline for object detection +>>> object_detector = pipeline('object-detection') +>>> object_detector(image) +[{'score': 0.9982201457023621, + 'label': 'remote', + 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}}, + {'score': 0.9960021376609802, + 'label': 'remote', + 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}}, + {'score': 0.9954745173454285, + 'label': 'couch', + 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}}, + {'score': 0.9988006353378296, + 'label': 'cat', + 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}}, + {'score': 0.9986783862113953, + 'label': 'cat', + 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}] +``` + +ఇక్కడ మనం ఆబ్జెక్ట్ చుట్టూ ఉన్న బాక్స్ మరియు కాన్ఫిడెన్స్ స్కోర్‌తో చిత్రంలో గుర్తించబడిన వస్తువుల జాబితాను పొందుతాము. ఇక్కడ ఎడమవైపున ఉన్న అసలు చిత్రం, కుడివైపున అంచనాలు ప్రదర్శించబడతాయి: + +

+ + +

+ +మీరు [ఈ ట్యుటోరియల్](https://huggingface.co/docs/transformers/task_summary)లో `pipeline` API ద్వారా సపోర్ట్ చేసే టాస్క్‌ల గురించి మరింత తెలుసుకోవచ్చు. + +`pipeline`తో పాటు, మీరు ఇచ్చిన టాస్క్‌లో ఏదైనా ప్రీట్రైన్డ్ మోడల్‌లను డౌన్‌లోడ్ చేయడానికి మరియు ఉపయోగించడానికి, దీనికి మూడు లైన్ల కోడ్ సరిపోతుంది. ఇక్కడ PyTorch వెర్షన్ ఉంది: +```python +>>> from transformers import AutoTokenizer, AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +మరియు TensorFlow కి సమానమైన కోడ్ ఇక్కడ ఉంది: +```python +>>> from transformers import AutoTokenizer, TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") + +>>> inputs = tokenizer("Hello world!", return_tensors="tf") +>>> outputs = model(**inputs) +``` + +ప్రిట్రైన్డ్ మోడల్ ఆశించే అన్ని ప్రీప్రాసెసింగ్‌లకు టోకెనైజర్ బాధ్యత వహిస్తుంది మరియు నేరుగా ఒకే స్ట్రింగ్ (పై ఉదాహరణలలో వలె) లేదా జాబితాపై కాల్ చేయవచ్చు. ఇది మీరు డౌన్‌స్ట్రీమ్ కోడ్‌లో ఉపయోగించగల నిఘంటువుని అవుట్‌పుట్ చేస్తుంది లేదా ** ఆర్గ్యుమెంట్ అన్‌ప్యాకింగ్ ఆపరేటర్‌ని ఉపయోగించి నేరుగా మీ మోడల్‌కి పంపుతుంది. + +మోడల్ కూడా సాధారణ [Pytorch `nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) లేదా [TensorFlow `tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (మీ బ్యాకెండ్‌ని బట్టి) మీరు మామూలుగా ఉపయోగించవచ్చు. [ఈ ట్యుటోరియల్](https://huggingface.co/docs/transformers/training) అటువంటి మోడల్‌ని క్లాసిక్ PyTorch లేదా TensorFlow ట్రైనింగ్ లూప్‌లో ఎలా ఇంటిగ్రేట్ చేయాలో లేదా మా `Trainer` API ని ఎలా ఉపయోగించాలో వివరిస్తుంది కొత్త డేటాసెట్. + +## నేను ట్రాన్స్‌ఫార్మర్‌లను ఎందుకు ఉపయోగించాలి? + +1. ఉపయోగించడానికి సులభమైన స్టేట్ ఆఫ్ ది ఆర్ట్ మోడల్‌లు: + - సహజ భాషా అవగాహన & ఉత్పత్తి, కంప్యూటర్ దృష్టి మరియు ఆడియో పనులపై అధిక పనితీరు. + - విద్యావేత్తలు మరియు అభ్యాసకుల ప్రవేశానికి తక్కువ అవరోధం. + - తెలుసుకోవడానికి కేవలం మూడు తరగతులతో కొన్ని వినియోగదారు-ముఖ సంగ్రహణలు. + - మా అన్ని ప్రీట్రైన్డ్ మోడల్‌లను ఉపయోగించడం కోసం ఏకీకృత API. + +2. తక్కువ గణన ఖర్చులు, చిన్న కార్బన్ పాదముద్ర: + - పరిశోధకులు ఎల్లప్పుడూ మళ్లీ శిక్షణ పొందే బదులు శిక్షణ పొందిన నమూనాలను పంచుకోవచ్చు. + - అభ్యాసకులు గణన సమయాన్ని మరియు ఉత్పత్తి ఖర్చులను తగ్గించగలరు. + - అన్ని పద్ధతుల్లో 60,000 కంటే ఎక్కువ ప్రీట్రైన్డ్ మోడల్‌లతో డజన్ల కొద్దీ ఆర్కిటెక్చర్‌లు. + +3. మోడల్ జీవితకాలంలో ప్రతి భాగానికి సరైన ఫ్రేమ్‌వర్క్‌ను ఎంచుకోండి: + - 3 లైన్ల కోడ్‌లో స్టేట్ ఆఫ్ ది ఆర్ట్ మోడల్‌లకు శిక్షణ ఇవ్వండి. + - TF2.0/PyTorch/JAX ఫ్రేమ్‌వర్క్‌ల మధ్య ఒకే మోడల్‌ను ఇష్టానుసారంగా తరలించండి. + - శిక్షణ, మూల్యాంకనం మరియు ఉత్పత్తి కోసం సరైన ఫ్రేమ్‌వర్క్‌ను సజావుగా ఎంచుకోండి. + +4. మీ అవసరాలకు అనుగుణంగా మోడల్ లేదా ఉదాహరణను సులభంగా అనుకూలీకరించండి: + - ప్రతి ఆర్కిటెక్చర్ దాని అసలు రచయితలు ప్రచురించిన ఫలితాలను పునరుత్పత్తి చేయడానికి మేము ఉదాహరణలను అందిస్తాము. + - మోడల్ ఇంటర్నల్‌లు వీలైనంత స్థిరంగా బహిర్గతమవుతాయి. + - శీఘ్ర ప్రయోగాల కోసం లైబ్రరీ నుండి స్వతంత్రంగా మోడల్ ఫైల్‌లను ఉపయోగించవచ్చు. + +## నేను ట్రాన్స్‌ఫార్మర్‌లను ఎందుకు ఉపయోగించకూడదు? + +- ఈ లైబ్రరీ న్యూరల్ నెట్‌ల కోసం బిల్డింగ్ బ్లాక్‌ల మాడ్యులర్ టూల్‌బాక్స్ కాదు. మోడల్ ఫైల్‌లలోని కోడ్ ఉద్దేశపూర్వకంగా అదనపు సంగ్రహణలతో రీఫ్యాక్టరింగ్ చేయబడదు, తద్వారా పరిశోధకులు అదనపు సంగ్రహణలు/ఫైళ్లలోకి ప్రవేశించకుండా ప్రతి మోడల్‌పై త్వరగా మళ్లించగలరు. +- శిక్షణ API ఏ మోడల్‌లో పని చేయడానికి ఉద్దేశించబడలేదు కానీ లైబ్రరీ అందించిన మోడల్‌లతో పని చేయడానికి ఆప్టిమైజ్ చేయబడింది. సాధారణ మెషిన్ లెర్నింగ్ లూప్‌ల కోసం, మీరు మరొక లైబ్రరీని ఉపయోగించాలి (బహుశా, [Accelerate](https://huggingface.co/docs/accelerate)). +- మేము వీలైనన్ని ఎక్కువ వినియోగ సందర్భాలను ప్రదర్శించడానికి ప్రయత్నిస్తున్నప్పుడు, మా [ఉదాహరణల ఫోల్డర్](https://github.com/huggingface/transformers/tree/main/examples)లోని స్క్రిప్ట్‌లు కేవలం: ఉదాహరణలు. మీ నిర్దిష్ట సమస్యపై అవి పని చేయవు మరియు వాటిని మీ అవసరాలకు అనుగుణంగా మార్చుకోవడానికి మీరు కొన్ని కోడ్ లైన్‌లను మార్చవలసి ఉంటుంది. + +## సంస్థాపన + +### పిప్ తో + +ఈ రిపోజిటరీ పైథాన్ 3.8+, ఫ్లాక్స్ 0.4.1+, PyTorch 1.11+ మరియు TensorFlow 2.6+లో పరీక్షించబడింది. + +మీరు [వర్చువల్ వాతావరణం](https://docs.python.org/3/library/venv.html)లో 🤗 ట్రాన్స్‌ఫార్మర్‌లను ఇన్‌స్టాల్ చేయాలి. మీకు పైథాన్ వర్చువల్ పరిసరాల గురించి తెలియకుంటే, [యూజర్ గైడ్](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) చూడండి. + +ముందుగా, మీరు ఉపయోగించబోతున్న పైథాన్ వెర్షన్‌తో వర్చువల్ వాతావరణాన్ని సృష్టించండి మరియు దానిని సక్రియం చేయండి. + +అప్పుడు, మీరు ఫ్లాక్స్, పైటార్చ్ లేదా టెన్సర్‌ఫ్లోలో కనీసం ఒకదానిని ఇన్‌స్టాల్ చేయాలి. +దయచేసి [TensorFlow ఇన్‌స్టాలేషన్ పేజీ](https://www.tensorflow.org/install/), [PyTorch ఇన్‌స్టాలేషన్ పేజీ](https://pytorch.org/get-started/locally/#start-locally) మరియు/ని చూడండి లేదా మీ ప్లాట్‌ఫారమ్ కోసం నిర్దిష్ట ఇన్‌స్టాలేషన్ కమాండ్‌కు సంబంధించి [Flax](https://github.com/google/flax#quick-install) మరియు [Jax](https://github.com/google/jax#installation) ఇన్‌స్టాలేషన్ పేజీలు . + +ఆ బ్యాకెండ్‌లలో ఒకటి ఇన్‌స్టాల్ చేయబడినప్పుడు, 🤗 ట్రాన్స్‌ఫార్మర్‌లను ఈ క్రింది విధంగా పిప్‌ని ఉపయోగించి ఇన్‌స్టాల్ చేయవచ్చు: + +```bash +pip install transformers +``` + +మీరు ఉదాహరణలతో ప్లే చేయాలనుకుంటే లేదా కోడ్ యొక్క బ్లీడింగ్ ఎడ్జ్ అవసరం మరియు కొత్త విడుదల కోసం వేచి ఉండలేకపోతే, మీరు తప్పనిసరిగా [మూలం నుండి లైబ్రరీని ఇన్‌స్టాల్ చేయాలి](https://huggingface.co/docs/transformers/installation#installing-from-source). + +### కొండా తో + +🤗 కింది విధంగా కొండా ఉపయోగించి ట్రాన్స్‌ఫార్మర్‌లను ఇన్‌స్టాల్ చేయవచ్చు: + +```shell script +conda install conda-forge::transformers +``` + +> **_గమనిక:_** `huggingface` ఛానెల్ నుండి `transformers` ఇన్‌స్టాల్ చేయడం పురాతనంగా ఉంది. + +Flax, PyTorch లేదా TensorFlow యొక్క ఇన్‌స్టాలేషన్ పేజీలను కొండాతో ఎలా ఇన్‌స్టాల్ చేయాలో చూడటానికి వాటిని అనుసరించండి. + +> **_గమనిక:_** Windowsలో, కాషింగ్ నుండి ప్రయోజనం పొందేందుకు మీరు డెవలపర్ మోడ్‌ని సక్రియం చేయమని ప్రాంప్ట్ చేయబడవచ్చు. ఇది మీకు ఎంపిక కాకపోతే, దయచేసి [ఈ సంచిక](https://github.com/huggingface/huggingface_hub/issues/1062)లో మాకు తెలియజేయండి. + +## మోడల్ ఆర్కిటెక్చర్లు + +**[అన్ని మోడల్ చెక్‌పాయింట్‌లు](https://huggingface.co/models)** 🤗 అందించిన ట్రాన్స్‌ఫార్మర్లు huggingface.co [model hub](https://huggingface.co/models) నుండి సజావుగా ఏకీకృతం చేయబడ్డాయి [users](https://huggingface.co/users) మరియు [organizations](https://huggingface.co/organizations) ద్వారా నేరుగా అప్‌లోడ్ చేయబడతాయి. + +ప్రస్తుత తనిఖీ కేంద్రాల సంఖ్య: ![](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/models&color=brightgreen) + +🤗 ట్రాన్స్‌ఫార్మర్లు ప్రస్తుతం కింది ఆర్కిటెక్చర్‌లను అందజేస్తున్నాయి (వాటిలో ప్రతి ఒక్కటి ఉన్నత స్థాయి సారాంశం కోసం [ఇక్కడ](https://huggingface.co/docs/transformers/model_summary) చూడండి): + +1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. +1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. +1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. +1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. +1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. +1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. +1. **[BEiT](https://huggingface.co/docs/transformers/model_doc/beit)** (from Microsoft) released with the paper [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) by Hangbo Bao, Li Dong, Furu Wei. +1. **[BERT](https://huggingface.co/docs/transformers/model_doc/bert)** (from Google) released with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. +1. **[BERT For Sequence Generation](https://huggingface.co/docs/transformers/model_doc/bert-generation)** (from Google) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[BERTweet](https://huggingface.co/docs/transformers/model_doc/bertweet)** (from VinAI Research) released with the paper [BERTweet: A pre-trained language model for English Tweets](https://aclanthology.org/2020.emnlp-demos.2/) by Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen. +1. **[BigBird-Pegasus](https://huggingface.co/docs/transformers/model_doc/bigbird_pegasus)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BigBird-RoBERTa](https://huggingface.co/docs/transformers/model_doc/big_bird)** (from Google Research) released with the paper [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. +1. **[BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt)** (from Microsoft Research AI4Science) released with the paper [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. +1. **[BiT](https://huggingface.co/docs/transformers/model_doc/bit)** (from Google AI) released with the paper [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) by Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby. +1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. +1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). +1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. +1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. +1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. +1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. +1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. +1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. +1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. +1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). +1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. +1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. +1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. +1. **[DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2)** (from Microsoft) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. +1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. +1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. +1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. +1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. +1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. +1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. +1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER), released together with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. +1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. +1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. +1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. +1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. +1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. +1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. +1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) +1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. +1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. +1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach +1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. +1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. +1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). +1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. +1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. +1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. +1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. +1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. +1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. +1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. +1. **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** (from Microsoft Research Asia) released with the paper [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) by Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei. +1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. +1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. +1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. +1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. +1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. +1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. +1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. +1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. +1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. +1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. +1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. +1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. +1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. +1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. +1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. +1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. +1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. +1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. +1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. +1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the repository [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. +1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. +1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. +1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. +1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. +1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. +1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. +1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). +1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. +1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/main/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. +1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. +1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released in a [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. +1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. +1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. +1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. +1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. +1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. +1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. +1. **[RegNet](https://huggingface.co/docs/transformers/model_doc/regnet)** (from META Platforms) released with the paper [Designing Network Design Space](https://arxiv.org/abs/2003.13678) by Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár. +1. **[RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert)** (from Google Research) released with the paper [Rethinking embedding coupling in pre-trained language models](https://arxiv.org/abs/2010.12821) by Hyung Won Chung, Thibault Févry, Henry Tsai, M. Johnson, Sebastian Ruder. +1. **[ResNet](https://huggingface.co/docs/transformers/model_doc/resnet)** (from Microsoft Research) released with the paper [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. +1. **[RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta)** (from Facebook), released together with the paper [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. +1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. +1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. +1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. +1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. +1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. +1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. +1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. +1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. +1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. +1. **[SwitchTransformers](https://huggingface.co/docs/transformers/model_doc/switch_transformers)** (from Google) released with the paper [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. +1. **[T5](https://huggingface.co/docs/transformers/model_doc/t5)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham. +1. **[TAPAS](https://huggingface.co/docs/transformers/model_doc/tapas)** (from Google AI) released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. +1. **[TAPEX](https://huggingface.co/docs/transformers/model_doc/tapex)** (from Microsoft Research) released with the paper [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, Jian-Guang Lou. +1. **[Time Series Transformer](https://huggingface.co/docs/transformers/model_doc/time_series_transformer)** (from HuggingFace). +1. **[TimeSformer](https://huggingface.co/docs/transformers/model_doc/timesformer)** (from Facebook) released with the paper [Is Space-Time Attention All You Need for Video Understanding?](https://arxiv.org/abs/2102.05095) by Gedas Bertasius, Heng Wang, Lorenzo Torresani. +1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine +1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. +1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. +1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. +1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. +1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. +1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. +1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. +1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. +1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. +1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. +1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. +1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. +1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. +1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. +1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI), released together with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. +1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. +1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. +1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. +1. కొత్త మోడల్‌ను అందించాలనుకుంటున్నారా? కొత్త మోడల్‌ను జోడించే ప్రక్రియలో మీకు మార్గనిర్దేశం చేసేందుకు మేము **వివరణాత్మక గైడ్ మరియు టెంప్లేట్‌లను** జోడించాము. మీరు వాటిని రిపోజిటరీ యొక్క [`టెంప్లేట్లు`](./టెంప్లేట్లు) ఫోల్డర్‌లో కనుగొనవచ్చు. మీ PRని ప్రారంభించడానికి ముందు [సహకార మార్గదర్శకాలు](./CONTRIBUTING.md)ని తనిఖీ చేసి, నిర్వహణదారులను సంప్రదించండి లేదా అభిప్రాయాన్ని సేకరించడానికి సమస్యను తెరవండి. + +ప్రతి మోడల్ ఫ్లాక్స్, పైటార్చ్ లేదా టెన్సర్‌ఫ్లోలో అమలు చేయబడిందా లేదా 🤗 Tokenizers లైబ్రరీ ద్వారా అనుబంధించబడిన టోకెనైజర్‌ని కలిగి ఉందో లేదో తనిఖీ చేయడానికి, [ఈ పట్టిక](https://huggingface.co/docs/transformers/index#supported-frameworks). + +ఈ అమలులు అనేక డేటాసెట్‌లలో పరీక్షించబడ్డాయి (ఉదాహరణ స్క్రిప్ట్‌లను చూడండి) మరియు అసలైన అమలుల పనితీరుతో సరిపోలాలి. మీరు [డాక్యుమెంటేషన్](https://github.com/huggingface/transformers/tree/main/examples) యొక్క ఉదాహరణల విభాగంలో పనితీరుపై మరిన్ని వివరాలను కనుగొనవచ్చు. + +## ఇంకా నేర్చుకో + +| విభాగం | వివరణ | +|-|-| +| [డాక్యుమెంటేషన్](https://huggingface.co/docs/transformers/) | పూర్తి API డాక్యుమెంటేషన్ మరియు ట్యుటోరియల్స్ | +| [టాస్క్ సారాంశం](https://huggingface.co/docs/transformers/task_summary) | 🤗 ట్రాన్స్‌ఫార్మర్‌ల ద్వారా సపోర్ట్ చేయబడిన విధులు | +| [ప్రీప్రాసెసింగ్ ట్యుటోరియల్](https://huggingface.co/docs/transformers/preprocessing) | మోడల్‌ల కోసం డేటాను సిద్ధం చేయడానికి `Tokenizer` క్లాస్‌ని ఉపయోగించడం | +| [ట్రైనింగ్ మరియు ఫైన్-ట్యూనింగ్](https://huggingface.co/docs/transformers/training) | PyTorch/TensorFlow ట్రైనింగ్ లూప్ మరియు `Trainer` APIలో 🤗 ట్రాన్స్‌ఫార్మర్లు అందించిన మోడల్‌లను ఉపయోగించడం | +| [త్వరిత పర్యటన: ఫైన్-ట్యూనింగ్/యూసేజ్ స్క్రిప్ట్‌లు](https://github.com/huggingface/transformers/tree/main/examples) | విస్తృత శ్రేణి టాస్క్‌లపై ఫైన్-ట్యూనింగ్ మోడల్స్ కోసం ఉదాహరణ స్క్రిప్ట్‌లు | +| [మోడల్ భాగస్వామ్యం మరియు అప్‌లోడ్ చేయడం](https://huggingface.co/docs/transformers/model_sharing) | కమ్యూనిటీతో మీ ఫైన్-ట్యూన్డ్ మోడల్‌లను అప్‌లోడ్ చేయండి మరియు భాగస్వామ్యం చేయండి | + +## అనులేఖనం + +🤗 ట్రాన్స్‌ఫార్మర్స్ లైబ్రరీ కోసం మీరు ఉదహరించగల [పేపర్](https://www.aclweb.org/anthology/2020.emnlp-demos.6/) ఇప్పుడు మా వద్ద ఉంది: +```bibtex +@inproceedings{wolf-etal-2020-transformers, + title = "Transformers: State-of-the-Art Natural Language Processing", + author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", + booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", + month = oct, + year = "2020", + address = "Online", + publisher = "Association for Computational Linguistics", + url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", + pages = "38--45" +} +``` diff --git a/README_zh-hans.md b/README_zh-hans.md index a248ba811d222d..f2b9b38273bfba 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -43,7 +43,7 @@ checkpoint: 检查点

-

+

Build @@ -71,8 +71,13 @@ checkpoint: 检查点 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -94,13 +99,13 @@ checkpoint: 检查点 你可以直接在模型页面上测试大多数 [model hub](https://huggingface.co/models) 上的模型。 我们也提供了 [私有模型托管、模型版本管理以及推理API](https://huggingface.co/pricing)。 这里是一些例子: -- [用 BERT 做掩码填词](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [用 BERT 做掩码填词](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [用 Electra 做命名实体识别](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [用 GPT-2 做文本生成](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [用 RoBERTa 做自然语言推理](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [用 GPT-2 做文本生成](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [用 RoBERTa 做自然语言推理](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [用 BART 做文本摘要](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [用 DistilBERT 做问答](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [用 T5 做翻译](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [用 DistilBERT 做问答](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [用 T5 做翻译](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) **[Write With Transformer](https://transformer.huggingface.co)**,由抱抱脸团队打造,是一个文本生成的官方 demo。 @@ -146,8 +151,8 @@ checkpoint: 检查点 ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -156,8 +161,8 @@ checkpoint: 检查点 ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) @@ -200,7 +205,7 @@ checkpoint: 检查点 ### 使用 pip -这个仓库已在 Python 3.6+、Flax 0.3.2+、PyTorch 1.3.1+ 和 TensorFlow 2.3+ 下经过测试。 +这个仓库已在 Python 3.8+、Flax 0.4.1+、PyTorch 1.11+ 和 TensorFlow 2.6+ 下经过测试。 你可以在[虚拟环境](https://docs.python.org/3/library/venv.html)中安装 🤗 Transformers。如果你还不熟悉 Python 的虚拟环境,请阅此[用户说明](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)。 @@ -218,14 +223,14 @@ pip install transformers ### 使用 conda -自 Transformers 4.0.0 版始,我们有了一个 conda 频道: `huggingface`。 - 🤗 Transformers 可以通过 conda 依此安装: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_笔记:_** 从 `huggingface` 渠道安装 `transformers` 已被废弃。 + 要通过 conda 安装 Flax、PyTorch 或 TensorFlow 其中之一,请参阅它们各自安装页的说明。 ## 模型架构 @@ -237,8 +242,11 @@ conda install -c huggingface transformers 🤗 Transformers 目前支持如下的架构(模型概述请阅[这里](https://huggingface.co/docs/transformers/model_summary)): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (来自 Google Research and the Toyota Technological Institute at Chicago) 伴随论文 [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), 由 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut 发布。 +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (来自 Google Research) 伴随论文 [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) 由 Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig 发布。 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (来自 BAAI) 伴随论文 [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) 由 Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell 发布。 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (来自 MIT) 伴随论文 [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) 由 Yuan Gong, Yu-An Chung, James Glass 发布。 +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (来自 Facebook) 伴随论文 [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) 由 Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer 发布。 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (来自 École polytechnique) 伴随论文 [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) 由 Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis 发布。 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (来自 VinAI Research) 伴随论文 [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) 由 Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen 发布。 @@ -253,22 +261,27 @@ conda install -c huggingface transformers 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (来自 Facebook) 伴随论文 [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) 由 Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston 发布。 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (来自 Facebook) 伴随论文 [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) 由 Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston 发布。 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (来自 Salesforce) 伴随论文 [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) 由 Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi 发布。 -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (来自 Salesforce) 伴随论文 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) 由 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi 发布。 +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (来自 Salesforce) 伴随论文 [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) 由 Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi 发布。 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (来自 Alexa) 伴随论文 [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) 由 Adrian de Wynter and Daniel J. Perry 发布。 -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (来自 NAVER CLOVA) 伴随论文 [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) 由 Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park 发布。 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (来自 Google Research) 伴随论文 [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) 由 Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel 发布。 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (来自 Inria/Facebook/Sorbonne) 伴随论文 [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) 由 Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot 发布。 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (来自 Google Research) 伴随论文 [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) 由 Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting 发布。 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (来自 OFA-Sys) 伴随论文 [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) 由 An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou 发布。 -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。 +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (来自 LAION-AI) 伴随论文 [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) 由 Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov 发布。 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (来自 OpenAI) 伴随论文 [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) 由 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever 发布。 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (来自 University of Göttingen) 伴随论文 [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) 由 Timo Lüddecke and Alexander Ecker 发布。 +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (来自 Salesforce) 伴随论文 [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) 由 Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong 发布。 +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (来自 MetaAI) 伴随论文 [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) 由 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve 发布。 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (来自 Microsoft Research Asia) 伴随论文 [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) 由 Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang 发布。 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (来自 YituTech) 伴随论文 [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) 由 Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan 发布。 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (来自 Facebook AI) 伴随论文 [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) 由 Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie 发布。 +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (来自 Tsinghua University) 伴随论文 [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) 由 Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun 发布。 +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (来自 Salesforce) 伴随论文 [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) 由 Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher 发布。 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (来自 Microsoft) 伴随论文 [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) 由 Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang 发布。 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (来自 Facebook) 伴随论文 [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) 由 Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli 发布。 @@ -277,41 +290,58 @@ conda install -c huggingface transformers 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (来自 Berkeley/Facebook/Google) 伴随论文 [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) 由 Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch 发布。 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (来自 SenseTime Research) 伴随论文 [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) 由 Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai 发布。 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (来自 Facebook) 伴随论文 [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) 由 Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou 发布。 -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (来自 The University of Texas at Austin) 伴随论文 [NMS Strikes Back](https://arxiv.org/abs/2212.06137) 由 Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl 发布。 +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (来自 Google AI) 伴随论文 [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) 由 Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun 发布。 +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (来自 University of Hong Kong and TikTok) 伴随论文 [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) 由 Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao 发布。 +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (来自 The University of Texas at Austin) 伴随论文 [NMS Strikes Back](https://arxiv.org/abs/2212.06137) 由 Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl 发布。 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (来自 Facebook) 伴随论文 [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) 由 Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko 发布。 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (来自 Microsoft Research) 伴随论文 [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) 由 Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan 发布。 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (来自 SHI Labs) 伴随论文 [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) 由 Ali Hassani and Humphrey Shi 发布。 +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (来自 Meta AI) 伴随论文 [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) 由 Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski 发布。 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (来自 HuggingFace), 伴随论文 [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) 由 Victor Sanh, Lysandre Debut and Thomas Wolf 发布。 同样的方法也应用于压缩 GPT-2 到 [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa 到 [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT 到 [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) 和德语版 DistilBERT。 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (来自 Microsoft Research) 伴随论文 [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) 由 Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei 发布。 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (来自 NAVER) 伴随论文 [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) 由 Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park 发布。 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (来自 Facebook) 伴随论文 [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) 由 Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih 发布。 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (来自 Intel Labs) 伴随论文 [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) 由 René Ranftl, Alexey Bochkovskiy, Vladlen Koltun 发布。 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (来自 Snap Research) 伴随论文 [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) 由 Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren 发布。 +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (来自 Google Research/Stanford University) 伴随论文 [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) 由 Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning 发布。 +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (来自 Meta AI) 伴随论文 [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) 由 Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi 发布。 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (来自 Google Research) 伴随论文 [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) 由 Sascha Rothe, Shashi Narayan, Aliaksei Severyn 发布。 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (来自 Baidu) 伴随论文 [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu 发布。 -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (来自 Baidu) 伴随论文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 由 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang 发布。 +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (来自 Baidu) 伴随论文 [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) 由 Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang 发布。 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2** was released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (来自 ESPnet and Microsoft Research) 伴随论文 [Fastspeech 2: Fast And High-quality End-to-End Text To Speech](https://arxiv.org/pdf/2006.04558.pdf) 由 Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang 发布。 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (来自 CNRS) 伴随论文 [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) 由 Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab 发布。 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (来自 Facebook AI) 伴随论文 [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) 由 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela 发布。 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (来自 Google Research) 伴随论文 [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) 由 James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon 发布。 +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (来自 Microsoft Research) 伴随论文 [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) 由 Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao 发布。 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (来自 CMU/Google Brain) 伴随论文 [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) 由 Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le 发布。 +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (来自 ADEPT) 伴随论文 [blog post](https://www.adept.ai/blog/fuyu-8b) 由 Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar 发布。 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (来自 Microsoft Research) 伴随论文 [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) 由 Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang 发布。 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (来自 KAIST) 伴随论文 [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) 由 Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim 发布。 -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (来自 OpenAI) 伴随论文 [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) 由 Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever 发布。 +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (来自 OpenAI) 伴随论文 [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) 由 Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever 发布。 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (来自 EleutherAI) 随仓库 [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) 发布。作者为 Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy 发布。 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (来自 ABEJA) 由 Shinya Otani, Takayoshi Makabe, Anuj Arora, Kyo Hattori。 -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) 由 Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever** 发布。 +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (来自 OpenAI) 伴随论文 [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) 由 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever 发布。 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (来自 EleutherAI) 伴随论文 [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) 由 Ben Wang and Aran Komatsuzaki 发布。 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (来自 BigCode) 伴随论文 [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) 由 Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra 发布。 +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by 坂本俊之(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (来自 UCSD, NVIDIA) 伴随论文 [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) 由 Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang 发布。 +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (来自 Allegro.pl, AGH University of Science and Technology) 伴随论文 [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) 由 Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik 发布。 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (来自 Facebook) 伴随论文 [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) 由 Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed 发布。 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (来自 Berkeley) 伴随论文 [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) 由 Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer 发布。 +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (来自 OpenAI) 伴随论文 [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) 由 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever 发布。 +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (来自 Salesforce) 伴随论文 [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) 由 Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi 发布。 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) 由 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou 发布。 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) 由 Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou 发布。 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (来自 Microsoft Research Asia) 伴随论文 [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) 由 Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei 发布。 @@ -319,43 +349,69 @@ conda install -c huggingface transformers 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (来自 AllenAI) 伴随论文 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 由 Iz Beltagy, Matthew E. Peters, Arman Cohan 发布。 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (来自 Meta AI) 伴随论文 [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) 由 Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze 发布。 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (来自 South China University of Technology) 伴随论文 [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) 由 Jiapeng Wang, Lianwen Jin, Kai Ding 发布。 +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (来自 The FAIR team of Meta AI) 伴随论文 [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) 由 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample 发布。 +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (来自 The FAIR team of Meta AI) 伴随论文 [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) 由 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. 发布。 +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (来自 Microsoft Research & University of Wisconsin-Madison) 伴随论文 [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) 由 Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee 发布。 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (来自 AllenAI) 伴随论文 [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) 由 Iz Beltagy, Matthew E. Peters, Arman Cohan 发布。 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (来自 Google AI) released 伴随论文 [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) 由 Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang 发布。 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (来自 Studio Ousia) 伴随论文 [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) 由 Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto 发布。 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (来自 UNC Chapel Hill) 伴随论文 [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) 由 Hao Tan and Mohit Bansal 发布。 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (来自 Facebook) 伴随论文 [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) 由 Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert 发布。 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (来自 Facebook) 伴随论文 [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) 由 Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin 发布。 +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** 用 [OPUS](http://opus.nlpl.eu/) 数据训练的机器翻译模型由 Jörg Tiedemann 发布。[Marian Framework](https://marian-nmt.github.io/) 由微软翻译团队开发。 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (来自 Microsoft Research Asia) 伴随论文 [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 由 Junlong Li, Yiheng Xu, Lei Cui, Furu Wei 发布。 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (来自 FAIR and UIUC) 伴随论文 [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) 由 Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar 发布。 -1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov +1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (来自 Google AI) 伴随论文 [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) 由 Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos 发布。 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 由 Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer 发布。 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。 +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (来自 Facebook) 伴随论文 [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) 由 Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer 发布。 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。 +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (来自 Alibaba Research) 伴随论文 [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) 由 Peng Wang, Cheng Da, and Cong Yao 发布。 +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (来自 Studio Ousia) 伴随论文 [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) 由 Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka 发布。 +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (来自 Facebook) 伴随论文 [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) 由 Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli 发布。 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (来自 CMU/Google Brain) 伴随论文 [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) 由 Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou 发布。 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (来自 Google Inc.) 伴随论文 [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) 由 Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam 发布。 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (来自 Google Inc.) 伴随论文 [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) 由 Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen 发布。 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (来自 Apple) 伴随论文 [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) 由 Sachin Mehta and Mohammad Rastegari 发布。 +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (来自 Apple) 伴随论文 [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) 由 Sachin Mehta and Mohammad Rastegari 发布。 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (来自 Microsoft Research) 伴随论文 [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) 由 Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu 发布。 +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (来自 MosaiML) 伴随论文 [llm-foundry](https://github.com/mosaicml/llm-foundry/) 由 the MosaicML NLP Team 发布。 +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (来自 the University of Wisconsin - Madison) 伴随论文 [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) 由 Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh 发布。 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (来自 Google AI) 伴随论文 [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) 由 Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel 发布。 +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (来自 中国人民大学 AI Box) 伴随论文 [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) 由 Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen 发布。 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (来自 SHI Labs) 伴随论文 [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) 由 Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi 发布。 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (来自华为诺亚方舟实验室) 伴随论文 [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) 由 Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu 发布。 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。 +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (来自 Meta) 伴随论文 [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) 由 the NLLB team 发布。 +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (来自 Meta AI) 伴随论文 [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) 由 Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic 发布。 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (来自 the University of Wisconsin - Madison) 伴随论文 [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) 由 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh 发布。 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (来自 SHI Labs) 伴随论文 [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) 由 Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi 发布。 +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (来自 [s-JoL](https://huggingface.co/s-JoL)) 由 GitHub (现已删除). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (来自 Meta AI) 伴随论文 [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) 由 Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al 发布。 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (来自 Google AI) 伴随论文 [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) 由 Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby 发布。 +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (来自 Google AI) 伴随论文 [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) 由 Matthias Minderer, Alexey Gritsenko, Neil Houlsby 发布。 +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (来自 IBM Research) 伴随论文 [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) 由 Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam 发布。 +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (来自 IBM) 伴随论文 [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf) 由 Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam 发布。 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (来自 Google) 伴随论文 [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) 由 Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu 发布。 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (来自 Google) 伴随论文 [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) 由 Jason Phang, Yao Zhao, Peter J. Liu 发布。 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (来自 Deepmind) 伴随论文 [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) 由 Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira 发布。 +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (来自 ADEPT) 伴随论文 [blog post](https://www.adept.ai/blog/persimmon-8b) 由 Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani 发布。 +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (来自 VinAI Research) 伴随论文 [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) 由 Dat Quoc Nguyen and Anh Tuan Nguyen 发布。 +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (来自 Google) 伴随论文 [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) 由 Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova 发布。 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (来自 UCLA NLP) 伴随论文 [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) 由 Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang 发布。 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (来自 Sea AI Labs) 伴随论文 [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) 由 Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng 发布。 +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。 +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (来自 Nanjing University, The University of Hong Kong etc.) 伴随论文 [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) 由 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao 发布。 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (来自 NVIDIA) 伴随论文 [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) 由 Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius 发布。 +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (来自 the Qwen team, Alibaba Group) 伴随论文 [Qwen Technical Report](https://arxiv.org/abs/2309.16609) 由 Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu 发布。 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (来自 Facebook) 伴随论文 [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) 由 Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela 发布。 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (来自 Google Research) 伴随论文 [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) 由 Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang 发布。 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (来自 Google Research) 伴随论文 [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) 由 Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya 发布。 @@ -366,14 +422,21 @@ conda install -c huggingface transformers 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (来自 Facebook) 伴随论文 [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) 由 Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli 发布。 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (来自 WeChatAI), 伴随论文 [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) 由 HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou 发布。 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (来自 ZhuiyiTechnology), 伴随论文 [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) 由 Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu 发布。 +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (来自 Bo Peng) 伴随论文 [this repo](https://github.com/BlinkDL/RWKV-LM) 由 Bo Peng 发布。 +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (来自 NVIDIA) 伴随论文 [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) 由 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo 发布。 +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (来自 Meta AI) 伴随论文 [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) 由 Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick 发布。 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (来自 ASAPP) 伴随论文 [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) 由 Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi 发布。 -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (来自 Microsoft Research) 伴随论文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) 由 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei 发布。 +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (来自 Google AI) 伴随论文 [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) 由 Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer 发布。 +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (来自 Microsoft Research) 伴随论文 [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) 由 Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei 发布。 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (来自 Facebook), 伴随论文 [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino 发布。 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (来自 Facebook) 伴随论文 [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) 由 Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau 发布。 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (来自 Tel Aviv University) 伴随论文 [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) 由 Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy 发布。 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (来自 Berkeley) 伴随论文 [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) 由 Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer 发布。 +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (来自 MBZUAI) 伴随论文 [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) 由 Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan 发布。 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (来自 Microsoft) 伴随论文 [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) 由 Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo 发布。 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (来自 Microsoft) 伴随论文 [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) 由 Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo 发布。 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (来自 University of Würzburg) 伴随论文 [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) 由 Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte 发布。 @@ -388,26 +451,35 @@ conda install -c huggingface transformers 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (来自 Google/CMU) 伴随论文 [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) 由 Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov 发布。 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (来自 Microsoft) 伴随论文 [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) 由 Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei 发布。 -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (来自 UNC Chapel Hill) 伴随论文 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 由 Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 发布。 +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (来自 UNC Chapel Hill) 伴随论文 [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) 由 Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal 发布。 +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (来自 Intel) 伴随论文 [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) 由 Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding 发布. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (来自 Google Research) 伴随论文 [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) 由 Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant 发布。 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (来自 Microsoft Research) 伴随论文 [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) 由 Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang 发布。 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (来自 Microsoft Research) 伴随论文 [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) 由 Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu 发布。 +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (来自 Peking University) 伴随论文 [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) 由 Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun 发布。 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (来自 Tsinghua University and Nankai University) 伴随论文 [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) 由 Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu 发布。 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (来自 Multimedia Computing Group, Nanjing University) 伴随论文 [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) 由 Zhan Tong, Yibing Song, Jue Wang, Limin Wang 发布。 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (来自 NAVER AI Lab/Kakao Enterprise/Kakao Brain) 伴随论文 [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) 由 Wonjae Kim, Bokyung Son, Ildoo Kim 发布。 +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (来自 University of Wisconsin–Madison) 伴随论文 [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) 由 Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee 发布。 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (来自 UCLA NLP) 伴随论文 [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) 由 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang 发布。 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (来自 Google AI) 伴随论文 [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) 由 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby 发布。 +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (来自 Meta AI) 伴随论文 [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) 由 Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He 发布。 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (来自 Meta AI) 伴随论文 [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) 由 Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick 发布。 +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (来自 HUST-VL) 伴随论文 [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) 由 Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang 发布。 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (来自 Meta AI) 伴随论文 [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas 发布. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (来自 Kakao Enterprise) 伴随论文 [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) 由 Jaehyeon Kim, Jungil Kong, Juhee Son 发布。 +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (来自 Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) 由 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (来自 Facebook AI) 伴随论文 [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) 由 Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli 发布。 +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (来自 Facebook AI) 伴随论文 [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) 由 Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino 发布。 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (来自 Facebook AI) 伴随论文 [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) 由 Qiantong Xu, Alexei Baevski, Michael Auli 发布。 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (来自 OpenAI) 伴随论文 [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) 由 Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever 发布。 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (来自 Microsoft Research) 伴随论文 [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) 由 Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling 发布。 -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (来自 Meta AI) 伴随论文 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) 由 Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe 发布。 +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (来自 Meta AI) 伴随论文 [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) 由 Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe 发布。 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (来自 Facebook) 伴随论文 [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) 由 Guillaume Lample and Alexis Conneau 发布。 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (来自 Microsoft Research) 伴随论文 [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) 由 Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou 发布。 @@ -418,7 +490,7 @@ conda install -c huggingface transformers 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (来自 Facebook AI) 伴随论文 [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) 由 Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli 发布。 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (来自 Facebook AI) 伴随论文 [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) 由 Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli 发布。 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (来自 Huazhong University of Science & Technology) 伴随论文 [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) 由 Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu 发布。 -1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (来自 the University of Wisconsin - Madison) 伴随论文 [You Only Sample (Almost) 由 Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh 发布。 +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (来自 the University of Wisconsin - Madison) 伴随论文 [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) 由 Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh 发布。 1. 想要贡献新的模型?我们这里有一份**详细指引和模板**来引导你添加新的模型。你可以在 [`templates`](./templates) 目录中找到他们。记得查看 [贡献指南](./CONTRIBUTING.md) 并在开始写 PR 前联系维护人员或开一个新的 issue 来获得反馈。 要检查某个模型是否已有 Flax、PyTorch 或 TensorFlow 的实现,或其是否在 🤗 Tokenizers 库中有对应词符化器(tokenizer),敬请参阅[此表](https://huggingface.co/docs/transformers/index#supported-frameworks)。 @@ -430,7 +502,7 @@ conda install -c huggingface transformers | 章节 | 描述 | |-|-| -| [文档](https://huggingface.co/transformers/) | 完整的 API 文档和教程 | +| [文档](https://huggingface.co/docs/transformers/) | 完整的 API 文档和教程 | | [任务总结](https://huggingface.co/docs/transformers/task_summary) | 🤗 Transformers 支持的任务 | | [预处理教程](https://huggingface.co/docs/transformers/preprocessing) | 使用 `Tokenizer` 来为模型准备数据 | | [训练和微调](https://huggingface.co/docs/transformers/training) | 在 PyTorch/TensorFlow 的训练循环或 `Trainer` API 中使用 🤗 Transformers 提供的模型 | diff --git a/README_zh-hant.md b/README_zh-hant.md index 0ef150904e12d9..1d5155529aa0a3 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -39,7 +39,7 @@ library: 函式庫 module: 模組 NLP/Natural Language Processing: 以 NLP 出現時不翻譯,以 Natural Language Processing 出現時翻譯為自然語言處理 online demos: 線上Demo -pipeline: pipeline(不翻譯) +pipeline: pipeline(不翻譯) pretrained/pretrain: 預訓練 Python data structures (e.g., list, set, dict): 翻譯為串列,集合,字典,並用括號標註原英文 repository: repository(不翻譯) @@ -55,7 +55,7 @@ user: 使用者

-

+

Build @@ -83,8 +83,13 @@ user: 使用者 한국어 | Español | 日本語 | - हिन्दी -

+ हिन्दी | + Русский | + Рortuguês | + తెలుగు | + Français | + Deutsch | +

@@ -106,13 +111,13 @@ user: 使用者 你可以直接在 [model hub](https://huggingface.co/models) 上測試大多數的模型。我們也提供了 [私有模型託管、模型版本管理以及推論API](https://huggingface.co/pricing)。 這裡是一些範例: -- [用 BERT 做遮蓋填詞](https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) +- [用 BERT 做遮蓋填詞](https://huggingface.co/google-bert/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France) - [用 Electra 做專有名詞辨識](https://huggingface.co/dbmdz/electra-large-discriminator-finetuned-conll03-english?text=My+name+is+Sarah+and+I+live+in+London+city) -- [用 GPT-2 做文本生成](https://huggingface.co/gpt2?text=A+long+time+ago%2C+) -- [用 RoBERTa 做自然語言推論](https://huggingface.co/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) +- [用 GPT-2 做文本生成](https://huggingface.co/openai-community/gpt2?text=A+long+time+ago%2C+) +- [用 RoBERTa 做自然語言推論](https://huggingface.co/FacebookAI/roberta-large-mnli?text=The+dog+was+lost.+Nobody+lost+any+animal) - [用 BART 做文本摘要](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct) -- [用 DistilBERT 做問答](https://huggingface.co/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) -- [用 T5 做翻譯](https://huggingface.co/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) +- [用 DistilBERT 做問答](https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad?text=Which+name+is+also+used+to+describe+the+Amazon+rainforest+in+English%3F&context=The+Amazon+rainforest+%28Portuguese%3A+Floresta+Amaz%C3%B4nica+or+Amaz%C3%B4nia%3B+Spanish%3A+Selva+Amaz%C3%B3nica%2C+Amazon%C3%ADa+or+usually+Amazonia%3B+French%3A+For%C3%AAt+amazonienne%3B+Dutch%3A+Amazoneregenwoud%29%2C+also+known+in+English+as+Amazonia+or+the+Amazon+Jungle%2C+is+a+moist+broadleaf+forest+that+covers+most+of+the+Amazon+basin+of+South+America.+This+basin+encompasses+7%2C000%2C000+square+kilometres+%282%2C700%2C000+sq+mi%29%2C+of+which+5%2C500%2C000+square+kilometres+%282%2C100%2C000+sq+mi%29+are+covered+by+the+rainforest.+This+region+includes+territory+belonging+to+nine+nations.+The+majority+of+the+forest+is+contained+within+Brazil%2C+with+60%25+of+the+rainforest%2C+followed+by+Peru+with+13%25%2C+Colombia+with+10%25%2C+and+with+minor+amounts+in+Venezuela%2C+Ecuador%2C+Bolivia%2C+Guyana%2C+Suriname+and+French+Guiana.+States+or+departments+in+four+nations+contain+%22Amazonas%22+in+their+names.+The+Amazon+represents+over+half+of+the+planet%27s+remaining+rainforests%2C+and+comprises+the+largest+and+most+biodiverse+tract+of+tropical+rainforest+in+the+world%2C+with+an+estimated+390+billion+individual+trees+divided+into+16%2C000+species) +- [用 T5 做翻譯](https://huggingface.co/google-t5/t5-base?text=My+name+is+Wolfgang+and+I+live+in+Berlin) **[Write With Transformer](https://transformer.huggingface.co)**,由 Hugging Face 團隊所打造,是一個文本生成的官方 demo。 @@ -158,8 +163,8 @@ user: 使用者 ```python >>> from transformers import AutoTokenizer, AutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = AutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = AutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="pt") >>> outputs = model(**inputs) @@ -168,8 +173,8 @@ user: 使用者 ```python >>> from transformers import AutoTokenizer, TFAutoModel ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") ->>> model = TFAutoModel.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = TFAutoModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello world!", return_tensors="tf") >>> outputs = model(**inputs) @@ -212,7 +217,7 @@ Tokenizer 為所有的預訓練模型提供了預處理,並可以直接轉換 ### 使用 pip -這個 Repository 已在 Python 3.6+、Flax 0.3.2+、PyTorch 1.3.1+ 和 TensorFlow 2.3+ 下經過測試。 +這個 Repository 已在 Python 3.8+、Flax 0.4.1+、PyTorch 1.11+ 和 TensorFlow 2.6+ 下經過測試。 你可以在[虛擬環境](https://docs.python.org/3/library/venv.html)中安裝 🤗 Transformers。如果你還不熟悉 Python 的虛擬環境,請閱此[使用者指引](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)。 @@ -230,14 +235,14 @@ pip install transformers ### 使用 conda -自 Transformers 4.0.0 版始,我們有了一個 conda channel: `huggingface`。 - 🤗 Transformers 可以藉由 conda 依此安裝: ```shell script -conda install -c huggingface transformers +conda install conda-forge::transformers ``` +> **_筆記:_** 從 `huggingface` 頻道安裝 `transformers` 已被淘汰。 + 要藉由 conda 安裝 Flax、PyTorch 或 TensorFlow 其中之一,請參閱它們各自安裝頁面的說明。 ## 模型架構 @@ -249,8 +254,11 @@ conda install -c huggingface transformers 🤗 Transformers 目前支援以下的架構(模型概覽請參閱[這裡](https://huggingface.co/docs/transformers/model_summary)): 1. **[ALBERT](https://huggingface.co/docs/transformers/model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](https://huggingface.co/docs/transformers/model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[AltCLIP](https://huggingface.co/docs/transformers/model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](https://huggingface.co/docs/transformers/model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. +1. **[Bark](https://huggingface.co/docs/transformers/model_doc/bark)** (from Suno) released in the repository [suno-ai/bark](https://github.com/suno-ai/bark) by Suno AI team. 1. **[BART](https://huggingface.co/docs/transformers/model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/pdf/1910.13461.pdf) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BARThez](https://huggingface.co/docs/transformers/model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](https://huggingface.co/docs/transformers/model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. @@ -265,22 +273,27 @@ conda install -c huggingface transformers 1. **[Blenderbot](https://huggingface.co/docs/transformers/model_doc/blenderbot)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BlenderbotSmall](https://huggingface.co/docs/transformers/model_doc/blenderbot-small)** (from Facebook) released with the paper [Recipes for building an open-domain chatbot](https://arxiv.org/abs/2004.13637) by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston. 1. **[BLIP](https://huggingface.co/docs/transformers/model_doc/blip)** (from Salesforce) released with the paper [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -1. **[BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. +1. **[BLIP-2](https://huggingface.co/docs/transformers/model_doc/blip-2)** (from Salesforce) released with the paper [BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 1. **[BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](https://huggingface.co/docs/transformers/model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. -1. **[BridgeTower](https://huggingface.co/docs/transformers/main/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BridgeTower](https://huggingface.co/docs/transformers/model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[BROS](https://huggingface.co/docs/transformers/model_doc/bros)** (from NAVER CLOVA) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. 1. **[ByT5](https://huggingface.co/docs/transformers/model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. 1. **[CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CANINE](https://huggingface.co/docs/transformers/model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[Chinese-CLIP](https://huggingface.co/docs/transformers/model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. -1. **[CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLAP](https://huggingface.co/docs/transformers/model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. 1. **[CLIP](https://huggingface.co/docs/transformers/model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIPSeg](https://huggingface.co/docs/transformers/model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. +1. **[CLVP](https://huggingface.co/docs/transformers/model_doc/clvp)** released with the paper [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. 1. **[CodeGen](https://huggingface.co/docs/transformers/model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. +1. **[CodeLlama](https://huggingface.co/docs/transformers/model_doc/llama_code)** (from MetaAI) released with the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. 1. **[Conditional DETR](https://huggingface.co/docs/transformers/model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](https://huggingface.co/docs/transformers/model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](https://huggingface.co/docs/transformers/model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](https://huggingface.co/docs/transformers/model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](https://huggingface.co/docs/transformers/model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](https://huggingface.co/docs/transformers/model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](https://huggingface.co/docs/transformers/model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. 1. **[Data2Vec](https://huggingface.co/docs/transformers/model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. @@ -289,41 +302,58 @@ conda install -c huggingface transformers 1. **[Decision Transformer](https://huggingface.co/docs/transformers/model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. 1. **[Deformable DETR](https://huggingface.co/docs/transformers/model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. 1. **[DeiT](https://huggingface.co/docs/transformers/model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. -1. **[DETA](https://huggingface.co/docs/transformers/main/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. +1. **[DePlot](https://huggingface.co/docs/transformers/model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. +1. **[Depth Anything](https://huggingface.co/docs/transformers/main/model_doc/depth_anything)** (from University of Hong Kong and TikTok) released with the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. +1. **[DETA](https://huggingface.co/docs/transformers/model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. 1. **[DETR](https://huggingface.co/docs/transformers/model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DialoGPT](https://huggingface.co/docs/transformers/model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. 1. **[DiNAT](https://huggingface.co/docs/transformers/model_doc/dinat)** (from SHI Labs) released with the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Ali Hassani and Humphrey Shi. +1. **[DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2)** (from Meta AI) released with the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. 1. **[DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/distillation) and a German version of DistilBERT. 1. **[DiT](https://huggingface.co/docs/transformers/model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. 1. **[Donut](https://huggingface.co/docs/transformers/model_doc/donut)** (from NAVER) released with the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park. 1. **[DPR](https://huggingface.co/docs/transformers/model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](https://huggingface.co/docs/transformers/master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[EfficientFormer](https://huggingface.co/docs/transformers/model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](https://huggingface.co/docs/transformers/model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. +1. **[EnCodec](https://huggingface.co/docs/transformers/model_doc/encodec)** (from Meta AI) released with the paper [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. 1. **[EncoderDecoder](https://huggingface.co/docs/transformers/model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. -1. **[ErnieM](https://huggingface.co/docs/transformers/main/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. +1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2** was released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (from ESPnet and Microsoft Research) released with the paper [Fastspeech 2: Fast And High-quality End-to-End Text To Speech](https://arxiv.org/pdf/2006.04558.pdf) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. 1. **[FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](https://huggingface.co/docs/transformers/model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](https://huggingface.co/docs/transformers/model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 1. **[FNet](https://huggingface.co/docs/transformers/model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](https://huggingface.co/docs/transformers/model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. 1. **[Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. +1. **[Fuyu](https://huggingface.co/docs/transformers/model_doc/fuyu)** (from ADEPT) Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. Released with the paper [blog post](https://www.adept.ai/blog/fuyu-8b) 1. **[GIT](https://huggingface.co/docs/transformers/model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](https://huggingface.co/docs/transformers/model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](https://huggingface.co/docs/transformers/model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released with the paper [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](https://huggingface.co/docs/transformers/model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by 坂本俊之(tanreinama). 1. **[Graphormer](https://huggingface.co/docs/transformers/model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. 1. **[GroupViT](https://huggingface.co/docs/transformers/model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. +1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** (from Allegro.pl, AGH University of Science and Technology) released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, Ireneusz Gawlik. 1. **[Hubert](https://huggingface.co/docs/transformers/model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. +1. **[IDEFICS](https://huggingface.co/docs/transformers/model_doc/idefics)** (from HuggingFace) released with the paper [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents](https://huggingface.co/papers/2306.16527) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh. 1. **[ImageGPT](https://huggingface.co/docs/transformers/model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](https://huggingface.co/docs/transformers/model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. +1. **[InstructBLIP](https://huggingface.co/docs/transformers/model_doc/instructblip)** (from Salesforce) released with the paper [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. 1. **[Jukebox](https://huggingface.co/docs/transformers/model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. +1. **[KOSMOS-2](https://huggingface.co/docs/transformers/model_doc/kosmos-2)** (from Microsoft Research Asia) released with the paper [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. 1. **[LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 1. **[LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. 1. **[LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3)** (from Microsoft Research Asia) released with the paper [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. @@ -331,43 +361,69 @@ conda install -c huggingface transformers 1. **[LED](https://huggingface.co/docs/transformers/model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LeViT](https://huggingface.co/docs/transformers/model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. 1. **[LiLT](https://huggingface.co/docs/transformers/model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](https://huggingface.co/docs/transformers/model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. +1. **[Llama2](https://huggingface.co/docs/transformers/model_doc/llama2)** (from The FAIR team of Meta AI) released with the paper [Llama2: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/XXX) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom.. +1. **[LLaVa](https://huggingface.co/docs/transformers/model_doc/llava)** (from Microsoft Research & University of Wisconsin-Madison) released with the paper [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. 1. **[Longformer](https://huggingface.co/docs/transformers/model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LongT5](https://huggingface.co/docs/transformers/model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. 1. **[LUKE](https://huggingface.co/docs/transformers/model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. 1. **[LXMERT](https://huggingface.co/docs/transformers/model_doc/lxmert)** (from UNC Chapel Hill) released with the paper [LXMERT: Learning Cross-Modality Encoder Representations from Transformers for Open-Domain Question Answering](https://arxiv.org/abs/1908.07490) by Hao Tan and Mohit Bansal. 1. **[M-CTC-T](https://huggingface.co/docs/transformers/model_doc/mctct)** (from Facebook) released with the paper [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. 1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin. +1. **[MADLAD-400](https://huggingface.co/docs/transformers/model_doc/madlad-400)** (from Google) released with the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662) by Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[Mask2Former](https://huggingface.co/docs/transformers/model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov +1. **[MatCha](https://huggingface.co/docs/transformers/model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](https://huggingface.co/docs/transformers/model_doc/mega)** (from Facebook) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. 1. **[Megatron-GPT2](https://huggingface.co/docs/transformers/model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](https://huggingface.co/docs/transformers/model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. +1. **[Mistral](https://huggingface.co/docs/transformers/model_doc/mistral)** (from Mistral AI) by The Mistral AI team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.. +1. **[Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral)** (from Mistral AI) by The [Mistral AI](https://mistral.ai) team: Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. 1. **[mLUKE](https://huggingface.co/docs/transformers/model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. +1. **[MMS](https://huggingface.co/docs/transformers/model_doc/mms)** (from Facebook) released with the paper [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli. 1. **[MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 1. **[MobileNetV1](https://huggingface.co/docs/transformers/model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. 1. **[MobileNetV2](https://huggingface.co/docs/transformers/model_doc/mobilenet_v2)** (from Google Inc.) released with the paper [MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381) by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen. 1. **[MobileViT](https://huggingface.co/docs/transformers/model_doc/mobilevit)** (from Apple) released with the paper [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. +1. **[MobileViTV2](https://huggingface.co/docs/transformers/model_doc/mobilevitv2)** (from Apple) released with the paper [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. 1. **[MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet)** (from Microsoft Research) released with the paper [MPNet: Masked and Permuted Pre-training for Language Understanding](https://arxiv.org/abs/2004.09297) by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. +1. **[MPT](https://huggingface.co/docs/transformers/model_doc/mpt)** (from MosaiML) released with the paper [llm-foundry](https://github.com/mosaicml/llm-foundry/) by the MosaicML NLP Team. +1. **[MRA](https://huggingface.co/docs/transformers/model_doc/mra)** (from the University of Wisconsin - Madison) released with the paper [Multi Resolution Analysis (MRA)](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, Vikas Singh. 1. **[MT5](https://huggingface.co/docs/transformers/model_doc/mt5)** (from Google AI) released with the paper [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. +1. **[MusicGen](https://huggingface.co/docs/transformers/model_doc/musicgen)** (from Meta) released with the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. 1. **[MVP](https://huggingface.co/docs/transformers/model_doc/mvp)** (from RUC AI Box) released with the paper [MVP: Multi-task Supervised Pre-training for Natural Language Generation](https://arxiv.org/abs/2206.12131) by Tianyi Tang, Junyi Li, Wayne Xin Zhao and Ji-Rong Wen. 1. **[NAT](https://huggingface.co/docs/transformers/model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 1. **[Nezha](https://huggingface.co/docs/transformers/model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. 1. **[NLLB](https://huggingface.co/docs/transformers/model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](https://huggingface.co/docs/transformers/model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[Nougat](https://huggingface.co/docs/transformers/model_doc/nougat)** (from Meta AI) released with the paper [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. 1. **[Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. 1. **[OneFormer](https://huggingface.co/docs/transformers/model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](https://huggingface.co/docs/transformers/model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](https://huggingface.co/docs/transformers/master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. 1. **[OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. +1. **[OWLv2](https://huggingface.co/docs/transformers/model_doc/owlv2)** (from Google AI) released with the paper [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. +1. **[PatchTSMixer](https://huggingface.co/docs/transformers/model_doc/patchtsmixer)** (from IBM Research) released with the paper [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. +1. **[PatchTST](https://huggingface.co/docs/transformers/model_doc/patchtst)** (from IBM) released with the paper [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/pdf/2211.14730.pdf) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong, Jayant Kalagnanam. 1. **[Pegasus](https://huggingface.co/docs/transformers/model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[PEGASUS-X](https://huggingface.co/docs/transformers/model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, Peter J. Liu. 1. **[Perceiver IO](https://huggingface.co/docs/transformers/model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. +1. **[Persimmon](https://huggingface.co/docs/transformers/model_doc/persimmon)** (from ADEPT) released with the paper [blog post](https://www.adept.ai/blog/persimmon-8b) by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. +1. **[Phi](https://huggingface.co/docs/transformers/model_doc/phi)** (from Microsoft) released with the papers - [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li, [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. 1. **[PhoBERT](https://huggingface.co/docs/transformers/model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](https://huggingface.co/docs/transformers/model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. 1. **[PLBart](https://huggingface.co/docs/transformers/model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](https://huggingface.co/docs/transformers/model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. +1. **[Pop2Piano](https://huggingface.co/docs/transformers/model_doc/pop2piano)** released with the paper [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi, Kyogu Lee. 1. **[ProphetNet](https://huggingface.co/docs/transformers/model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. +1. **[PVT](https://huggingface.co/docs/transformers/model_doc/pvt)** (from Nanjing University, The University of Hong Kong etc.) released with the paper [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/pdf/2102.12122.pdf) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. 1. **[QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert)** (from NVIDIA) released with the paper [Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation](https://arxiv.org/abs/2004.09602) by Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev and Paulius Micikevicius. +1. **[Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2)** (from the Qwen team, Alibaba Group) released with the paper [Qwen Technical Report](https://arxiv.org/abs/2309.16609) by Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou and Tianhang Zhu. 1. **[RAG](https://huggingface.co/docs/transformers/model_doc/rag)** (from Facebook) released with the paper [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) by Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. 1. **[REALM](https://huggingface.co/docs/transformers/model_doc/realm.html)** (from Google Research) released with the paper [REALM: Retrieval-Augmented Language Model Pre-Training](https://arxiv.org/abs/2002.08909) by Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang. 1. **[Reformer](https://huggingface.co/docs/transformers/model_doc/reformer)** (from Google Research) released with the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. @@ -378,14 +434,21 @@ conda install -c huggingface transformers 1. **[RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. 1. **[RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. 1. **[RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper a [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/pdf/2104.09864v1.pdf) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](https://huggingface.co/docs/transformers/model_doc/rwkv)** (from Bo Peng) released with the paper [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. +1. **[SeamlessM4T](https://huggingface.co/docs/transformers/model_doc/seamless_m4t)** (from Meta AI) released with the paper [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team. +1. **[SeamlessM4Tv2](https://huggingface.co/docs/transformers/model_doc/seamless_m4t_v2)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. 1. **[SEW](https://huggingface.co/docs/transformers/model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. 1. **[SEW-D](https://huggingface.co/docs/transformers/model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. -1. **[SpeechT5](https://huggingface.co/docs/transformers/main/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. +1. **[SigLIP](https://huggingface.co/docs/transformers/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. +1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. 1. **[SpeechToTextTransformer](https://huggingface.co/docs/transformers/model_doc/speech_to_text)** (from Facebook), released together with the paper [fairseq S2T: Fast Speech-to-Text Modeling with fairseq](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Dmytro Okhonko, Juan Pino. 1. **[SpeechToTextTransformer2](https://huggingface.co/docs/transformers/model_doc/speech_to_text_2)** (from Facebook) released with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. 1. **[Splinter](https://huggingface.co/docs/transformers/model_doc/splinter)** (from Tel Aviv University) released with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. 1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[StableLm](https://huggingface.co/docs/transformers/main/model_doc/stablelm)** released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu. +1. **[SwiftFormer](https://huggingface.co/docs/transformers/model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. 1. **[Swin Transformer](https://huggingface.co/docs/transformers/model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[Swin Transformer V2](https://huggingface.co/docs/transformers/model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. 1. **[Swin2SR](https://huggingface.co/docs/transformers/model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. @@ -400,37 +463,46 @@ conda install -c huggingface transformers 1. **[Trajectory Transformer](https://huggingface.co/docs/transformers/model_doc/trajectory_transformers)** (from the University of California at Berkeley) released with the paper [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine 1. **[Transformer-XL](https://huggingface.co/docs/transformers/model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft) released with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. -1. **[TVLT](https://huggingface.co/docs/transformers/main/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVLT](https://huggingface.co/docs/transformers/model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](https://huggingface.co/docs/transformers/model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. 1. **[UL2](https://huggingface.co/docs/transformers/model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](https://huggingface.co/docs/transformers/model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. 1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. 1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. +1. **[UnivNet](https://huggingface.co/docs/transformers/model_doc/univnet)** (from Kakao Corporation) released with the paper [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. 1. **[UPerNet](https://huggingface.co/docs/transformers/model_doc/upernet)** (from Peking University) released with the paper [Unified Perceptual Parsing for Scene Understanding](https://arxiv.org/abs/1807.10221) by Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun. 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/pdf/2202.09741.pdf) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. +1. **[VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. 1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 1. **[VisualBERT](https://huggingface.co/docs/transformers/model_doc/visual_bert)** (from UCLA NLP) released with the paper [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/pdf/1908.03557) by Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang. 1. **[ViT Hybrid](https://huggingface.co/docs/transformers/model_doc/vit_hybrid)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. +1. **[VitDet](https://huggingface.co/docs/transformers/model_doc/vitdet)** (from Meta AI) released with the paper [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. 1. **[ViTMAE](https://huggingface.co/docs/transformers/model_doc/vit_mae)** (from Meta AI) released with the paper [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. +1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. 1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://arxiv.org/abs/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas. +1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. +1. **[ViViT](https://huggingface.co/docs/transformers/model_doc/vivit)** (from Google Research) released with the paper [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. 1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli. +1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team. 1. **[Wav2Vec2-Conformer](https://huggingface.co/docs/transformers/model_doc/wav2vec2-conformer)** (from Facebook AI) released with the paper [FAIRSEQ S2T: Fast Speech-to-Text Modeling with FAIRSEQ](https://arxiv.org/abs/2010.05171) by Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino. 1. **[Wav2Vec2Phoneme](https://huggingface.co/docs/transformers/model_doc/wav2vec2_phoneme)** (from Facebook AI) released with the paper [Simple and Effective Zero-shot Cross-lingual Phoneme Recognition](https://arxiv.org/abs/2109.11680) by Qiantong Xu, Alexei Baevski, Michael Auli. 1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://arxiv.org/abs/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei. 1. **[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)** (from OpenAI) released with the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. 1. **[X-CLIP](https://huggingface.co/docs/transformers/model_doc/xclip)** (from Microsoft Research) released with the paper [Expanding Language-Image Pretrained Models for General Video Recognition](https://arxiv.org/abs/2208.02816) by Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. -1. **[X-MOD](https://huggingface.co/docs/transformers/main/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. +1. **[X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod)** (from Meta AI) released with the paper [Lifting the Curse of Multilinguality by Pre-training Modular Transformers](http://dx.doi.org/10.18653/v1/2022.naacl-main.255) by Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe. 1. **[XGLM](https://huggingface.co/docs/transformers/model_doc/xglm)** (From Facebook AI) released with the paper [Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668) by Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li. 1. **[XLM](https://huggingface.co/docs/transformers/model_doc/xlm)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM-ProphetNet](https://huggingface.co/docs/transformers/model_doc/xlm-prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. 1. **[XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)** (from Facebook AI), released together with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. 1. **[XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl)** (from Facebook AI) released with the paper [Larger-Scale Transformers for Multilingual Masked Language Modeling](https://arxiv.org/abs/2105.00572) by Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, Alexis Conneau. 1. **[XLM-V](https://huggingface.co/docs/transformers/model_doc/xlm-v)** (from Meta AI) released with the paper [XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models](https://arxiv.org/abs/2301.10472) by Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, Madian Khabsa. -1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. +1. **[XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 1. **[XLS-R](https://huggingface.co/docs/transformers/model_doc/xls_r)** (from Facebook AI) released with the paper [XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale](https://arxiv.org/abs/2111.09296) by Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli. 1. **[XLSR-Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/xlsr_wav2vec2)** (from Facebook AI) released with the paper [Unsupervised Cross-Lingual Representation Learning For Speech Recognition](https://arxiv.org/abs/2006.13979) by Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli. 1. **[YOLOS](https://huggingface.co/docs/transformers/model_doc/yolos)** (from Huazhong University of Science & Technology) released with the paper [You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection](https://arxiv.org/abs/2106.00666) by Yuxin Fang, Bencheng Liao, Xinggang Wang, Jiemin Fang, Jiyang Qi, Rui Wu, Jianwei Niu, Wenyu Liu. -1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. +1. **[YOSO](https://huggingface.co/docs/transformers/model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. 1. 想要貢獻新的模型?我們這裡有一份**詳細指引和模板**來引導你加入新的模型。你可以在 [`templates`](./templates) 目錄中找到它們。記得查看[貢獻指引](./CONTRIBUTING.md)並在開始寫 PR 前聯繫維護人員或開一個新的 issue 來獲得 feedbacks。 要檢查某個模型是否已有 Flax、PyTorch 或 TensorFlow 的實作,或其是否在🤗 Tokenizers 函式庫中有對應的 tokenizer,敬請參閱[此表](https://huggingface.co/docs/transformers/index#supported-frameworks)。 diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000000000..a16cfe099f8f78 --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,6 @@ +# Security Policy + +## Reporting a Vulnerability + +🤗 We have our bug bounty program set up with HackerOne. Please feel free to submit vulnerability reports to our private program at https://hackerone.com/hugging_face. +Note that you'll need to be invited to our program, so send us a quick email at security@huggingface.co if you've found a vulnerability. diff --git a/awesome-transformers.md b/awesome-transformers.md new file mode 100644 index 00000000000000..2ecdd3406f7095 --- /dev/null +++ b/awesome-transformers.md @@ -0,0 +1,609 @@ +# Awesome projects built with Transformers + +This page lists awesome projects built on top of Transformers. Transformers is more than a toolkit to use pretrained +models: it's a community of projects built around it and the Hugging Face Hub. We want Transformers to enable +developers, researchers, students, professors, engineers, and anyone else to build their dream projects. + +In this list, we showcase incredibly impactful and novel projects that have pushed the field forward. We celebrate +100 of these projects as we reach the milestone of 100k stars as a community; but we're very open to pull requests +adding other projects to the list. If you believe a project should be here and it's not, then please, open a PR +to add it. + +## [gpt4all](https://github.com/nomic-ai/gpt4all) + +[gpt4all](https://github.com/nomic-ai/gpt4all) is an ecosystem of open-source chatbots trained on massive collections of clean assistant data including code, stories and dialogue. It offers open-source, large language models such as LLaMA and GPT-J trained in an assistant-style. + +Keywords: Open-source, LLaMa, GPT-J, instruction, assistant + +## [recommenders](https://github.com/microsoft/recommenders) + +This repository contains examples and best practices for building recommendation systems, provided as Jupyter notebooks. It goes over several aspects required to build efficient recommendation systems: data preparation, modeling, evaluation, model selection & optimization, as well as operationalization + +Keywords: Recommender systems, AzureML + +## [IOPaint](https://github.com/Sanster/IOPaint) + +Image inpainting tool powered by Stable Diffusion. Remove any unwanted object, defect, people from your pictures or erase and replace anything on your pictures. + +Keywords: inpainting, SD, Stable Diffusion + +## [flair](https://github.com/flairNLP/flair) + +FLAIR is a powerful PyTorch NLP framework, convering several important tasks: NER, sentiment-analysis, part-of-speech tagging, text and document embeddings, among other things. + +Keywords: NLP, text embedding, document embedding, biomedical, NER, PoS, sentiment-analysis + +## [mindsdb](https://github.com/mindsdb/mindsdb) + +MindsDB is a low-code ML platform, which automates and integrates several ML frameworks into the data stack as "AI Tables" to streamline the integration of AI into applications, making it accessible to developers of all skill levels. + +Keywords: Database, low-code, AI table + +## [langchain](https://github.com/hwchase17/langchain) + +[langchain](https://github.com/hwchase17/langchain) is aimed at assisting in the development of apps merging both LLMs and other sources of knowledge. The library allows chaining calls to applications, creating a sequence across many tools. + +Keywords: LLMs, Large Language Models, Agents, Chains + +## [LlamaIndex](https://github.com/jerryjliu/llama_index) + +[LlamaIndex](https://github.com/jerryjliu/llama_index) is a project that provides a central interface to connect your LLM's with external data. It provides various kinds of indices and retreival mechanisms to perform different LLM tasks and obtain knowledge-augmented results. + +Keywords: LLMs, Large Language Models, Data Retrieval, Indices, Knowledge Augmentation + +## [ParlAI](https://github.com/facebookresearch/ParlAI) + +[ParlAI](https://github.com/facebookresearch/ParlAI) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. It provides more than 100 datasets under the same API, a large zoo of pretrained models, a set of agents, and has several integrations. + +Keywords: Dialogue, Chatbots, VQA, Datasets, Agents + +## [sentence-transformers](https://github.com/UKPLab/sentence-transformers) + +This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity. + +Keywords: Dense vector representations, Text embeddings, Sentence embeddings + +## [ludwig](https://github.com/ludwig-ai/ludwig) + +Ludwig is a declarative machine learning framework that makes it easy to define machine learning pipelines using a simple and flexible data-driven configuration system. Ludwig is targeted at a wide variety of AI tasks. It provides a data-driven configuration system, training, prediction, and evaluation scripts, as well as a programmatic API. + +Keywords: Declarative, Data-driven, ML Framework + +## [InvokeAI](https://github.com/invoke-ai/InvokeAI) + +[InvokeAI](https://github.com/invoke-ai/InvokeAI) is an engine for Stable Diffusion models, aimed at professionals, artists, and enthusiasts. It leverages the latest AI-driven technologies through CLI as well as a WebUI. + +Keywords: Stable-Diffusion, WebUI, CLI + +## [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) + +[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) is an easy-to-use and powerful NLP library particularly targeted at the Chinese languages. It has support for multiple pre-trained model zoos, and supports a wide-range of NLP tasks from research to industrial applications. + +Keywords: NLP, Chinese, Research, Industry + +## [stanza](https://github.com/stanfordnlp/stanza) + +The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. + +Keywords: NLP, Multilingual, CoreNLP + +## [DeepPavlov](https://github.com/deeppavlov/DeepPavlov) + +[DeepPavlov](https://github.com/deeppavlov/DeepPavlov) is an open-source conversational AI library. It is designed for the development of production ready chat-bots and complex conversational systems, as well as research in the area of NLP and, particularly, of dialog systems. + +Keywords: Conversational, Chatbot, Dialog + +## [alpaca-lora](https://github.com/tloen/alpaca-lora) + +Alpaca-lora contains code for reproducing the Stanford Alpaca results using low-rank adaptation (LoRA). The repository provides training (fine-tuning) as well as generation scripts. + +Keywords: LoRA, Parameter-efficient fine-tuning + +## [imagen-pytorch](https://github.com/lucidrains/imagen-pytorch) + +An open-source Implementation of Imagen, Google's closed-source Text-to-Image Neural Network that beats DALL-E2. As of release, it is the new SOTA for text-to-image synthesis. + +Keywords: Imagen, Text-to-image + +## [adapters](https://github.com/adapter-hub/adapters) + +[adapters](https://github.com/adapter-hub/adapters) is an extension of HuggingFace's Transformers library, integrating adapters into state-of-the-art language models by incorporating AdapterHub, a central repository for pre-trained adapter modules. It is a drop-in replacement for transformers, which is regularly updated to stay up-to-date with the developments of transformers. + +Keywords: Adapters, LoRA, Parameter-efficient fine-tuning, Hub + +## [NeMo](https://github.com/NVIDIA/NeMo) + +NVIDIA [NeMo](https://github.com/NVIDIA/NeMo) is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), text-to-speech synthesis (TTS), large language models (LLMs), and natural language processing (NLP). The primary objective of [NeMo](https://github.com/NVIDIA/NeMo) is to help researchers from industry and academia to reuse prior work (code and pretrained models) and make it easier to create new https://developer.nvidia.com/conversational-ai#started. + +Keywords: Conversational, ASR, TTS, LLMs, NLP + +## [Runhouse](https://github.com/run-house/runhouse) + +[Runhouse](https://github.com/run-house/runhouse) allows to send code and data to any of your compute or data infra, all in Python, and continue to interact with them normally from your existing code and environment. Runhouse developers mention: + +> Think of it as an expansion pack to your Python interpreter that lets it take detours to remote machines or manipulate remote data. + +Keywords: MLOps, Infrastructure, Data storage, Modeling + +## [MONAI](https://github.com/Project-MONAI/MONAI) + +[MONAI](https://github.com/Project-MONAI/MONAI) is a PyTorch-based, open-source framework for deep learning in healthcare imaging, part of PyTorch Ecosystem. Its ambitions are: +- developing a community of academic, industrial and clinical researchers collaborating on a common foundation; +- creating state-of-the-art, end-to-end training workflows for healthcare imaging; +- providing researchers with the optimized and standardized way to create and evaluate deep learning models. + +Keywords: Healthcare imaging, Training, Evaluation + +## [simpletransformers](https://github.com/ThilinaRajapakse/simpletransformers) + +Simple Transformers lets you quickly train and evaluate Transformer models. Only 3 lines of code are needed to initialize, train, and evaluate a model. It supports a wide variety of NLP tasks. + +Keywords: Framework, simplicity, NLP + +## [JARVIS](https://github.com/microsoft/JARVIS) + +[JARVIS](https://github.com/microsoft/JARVIS) is a system attempting to merge LLMs such as GPT-4 with the rest of the open-source ML community: leveraging up to 60 downstream models in order to perform tasks identified by the LLM. + +Keywords: LLM, Agents, HF Hub + +## [transformers.js](https://xenova.github.io/transformers.js/) + +[transformers.js](https://xenova.github.io/transformers.js/) is a JavaScript library targeted at running models from transformers directly within the browser. + +Keywords: Transformers, JavaScript, browser + +## [bumblebee](https://github.com/elixir-nx/bumblebee) + +Bumblebee provides pre-trained Neural Network models on top of Axon, a neural networks library for the Elixir language. It includes integration with 🤗 Models, allowing anyone to download and perform Machine Learning tasks with few lines of code. + +Keywords: Elixir, Axon + +## [argilla](https://github.com/argilla-io/argilla) + +Argilla is an open-source platform providing advanced NLP labeling, monitoring, and workspaces. It is compatible with many open source ecosystems such as Hugging Face, Stanza, FLAIR, and others. + +Keywords: NLP, Labeling, Monitoring, Workspaces + +## [haystack](https://github.com/deepset-ai/haystack) + +Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs. It offers production-ready tools to quickly build complex decision making, question answering, semantic search, text generation applications, and more. + +Keywords: NLP, Framework, LLM + +## [spaCy](https://github.com/explosion/spaCy) + +[spaCy](https://github.com/explosion/spaCy) is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. It offers support for transformers models through its third party package, spacy-transformers. + +Keywords: NLP, Framework + +## [speechbrain](https://github.com/speechbrain/speechbrain) + +SpeechBrain is an open-source and all-in-one conversational AI toolkit based on PyTorch. +The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. + +Keywords: Conversational, Speech + +## [skorch](https://github.com/skorch-dev/skorch) + +Skorch is a scikit-learn compatible neural network library that wraps PyTorch. It has support for models within transformers, and tokenizers from tokenizers. + +Keywords: Scikit-Learn, PyTorch + +## [bertviz](https://github.com/jessevig/bertviz) + +BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. + +Keywords: Visualization, Transformers + +## [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) + +[mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax) is a haiku library using the xmap/pjit operators in JAX for model parallelism of transformers. This library is designed for scalability up to approximately 40B parameters on TPUv3s. It was the library used to train the GPT-J model. + +Keywords: Haiku, Model parallelism, LLM, TPU + +## [deepchem](https://github.com/deepchem/deepchem) + +DeepChem aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. + +Keywords: Drug discovery, Materials Science, Quantum Chemistry, Biology + +## [OpenNRE](https://github.com/thunlp/OpenNRE) + +An Open-Source Package for Neural Relation Extraction (NRE). It is targeted at a wide range of users, from newcomers to relation extraction, to developers, researchers, or students. + +Keywords: Neural Relation Extraction, Framework + +## [pycorrector](https://github.com/shibing624/pycorrector) + +PyCorrector is a Chinese Text Error Correction Tool. It uses a language model to detect errors, pinyin feature and shape feature to correct Chinese text errors. it can be used for Chinese Pinyin and stroke input method. + +Keywords: Chinese, Error correction tool, Language model, Pinyin + +## [nlpaug](https://github.com/makcedward/nlpaug) + +This python library helps you with augmenting nlp for machine learning projects. It is a lightweight library featuring synthetic data generation for improving model performance, support for audio and text, and compatibility with several ecosystems (scikit-learn, pytorch, tensorflow). + +Keywords: Data augmentation, Synthetic data generation, Audio, NLP + +## [dream-textures](https://github.com/carson-katri/dream-textures) + +[dream-textures](https://github.com/carson-katri/dream-textures) is a library targeted at bringing stable-diffusion support within Blender. It supports several use-cases, such as image generation, texture projection, inpainting/outpainting, ControlNet, and upscaling. + +Keywords: Stable-Diffusion, Blender + +## [seldon-core](https://github.com/SeldonIO/seldon-core) + +Seldon core converts your ML models (Tensorflow, Pytorch, H2o, etc.) or language wrappers (Python, Java, etc.) into production REST/GRPC microservices. +Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries and more. + +Keywords: Microservices, Modeling, Language wrappers + +## [open_model_zoo](https://github.com/openvinotoolkit/open_model_zoo) + +This repository includes optimized deep learning models and a set of demos to expedite development of high-performance deep learning inference applications. Use these free pre-trained models instead of training your own models to speed-up the development and production deployment process. + +Keywords: Optimized models, Demos + +## [ml-stable-diffusion](https://github.com/apple/ml-stable-diffusion) + +ML-Stable-Diffusion is a repository by Apple bringing Stable Diffusion support to Core ML, on Apple Silicon devices. It supports stable diffusion checkpoints hosted on the Hugging Face Hub. + +Keywords: Stable Diffusion, Apple Silicon, Core ML + +## [stable-dreamfusion](https://github.com/ashawkey/stable-dreamfusion) + +Stable-Dreamfusion is a pytorch implementation of the text-to-3D model Dreamfusion, powered by the Stable Diffusion text-to-2D model. + +Keywords: Text-to-3D, Stable Diffusion + +## [txtai](https://github.com/neuml/txtai) + +[txtai](https://github.com/neuml/txtai) is an open-source platform for semantic search and workflows powered by language models. txtai builds embeddings databases, which are a union of vector indexes and relational databases enabling similarity search with SQL. Semantic workflows connect language models together into unified applications. + +Keywords: Semantic search, LLM + +## [djl](https://github.com/deepjavalibrary/djl) + +Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is designed to be easy to get started with and simple to use for developers. DJL provides a native Java development experience and functions like any other regular Java library. DJL offers [a Java binding](https://github.com/deepjavalibrary/djl/tree/master/extensions/tokenizers) for HuggingFace Tokenizers and easy conversion toolkit for HuggingFace model to deploy in Java. + +Keywords: Java, Framework + +## [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/) + +This project provides a unified framework to test generative language models on a large number of different evaluation tasks. It has support for more than 200 tasks, and supports different ecosystems: HF Transformers, GPT-NeoX, DeepSpeed, as well as the OpenAI API. + +Keywords: LLM, Evaluation, Few-shot + +## [gpt-neox](https://github.com/EleutherAI/gpt-neox) + +This repository records EleutherAI's library for training large-scale language models on GPUs. The framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. It is focused on training multi-billion-parameter models. + +Keywords: Training, LLM, Megatron, DeepSpeed + +## [muzic](https://github.com/microsoft/muzic) + +Muzic is a research project on AI music that empowers music understanding and generation with deep learning and artificial intelligence. Muzic was created by researchers from Microsoft Research Asia. + +Keywords: Music understanding, Music generation + +## [dalle-flow](https://github.com/jina-ai/dalle-flow) + +DALL·E Flow is an interactive workflow for generating high-definition images from a text prompt. Itt leverages DALL·E-Mega, GLID-3 XL, and Stable Diffusion to generate image candidates, and then calls CLIP-as-service to rank the candidates w.r.t. the prompt. +The preferred candidate is fed to GLID-3 XL for diffusion, which often enriches the texture and background. Finally, the candidate is upscaled to 1024x1024 via SwinIR. + +Keywords: High-definition image generation, Stable Diffusion, DALL-E Mega, GLID-3 XL, CLIP, SwinIR + +## [lightseq](https://github.com/bytedance/lightseq) + +LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP and CV models such as BERT, GPT, Transformer, etc. It is therefore best useful for machine translation, text generation, image classification, and other sequence related tasks. + +Keywords: Training, Inference, Sequence Processing, Sequence Generation + +## [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) + +The goal of this project is to create a learning based system that takes an image of a math formula and returns corresponding LaTeX code. + +Keywords: OCR, LaTeX, Math formula + +## [open_clip](https://github.com/mlfoundations/open_clip) + +OpenCLIP is an open source implementation of OpenAI's CLIP. + +The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. +The starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. + +Specifically, a ResNet-50 model trained with this codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. + +Keywords: CLIP, Open-source, Contrastive, Image-text + +## [dalle-playground](https://github.com/saharmor/dalle-playground) + +A playground to generate images from any text prompt using Stable Diffusion and Dall-E mini. + +Keywords: WebUI, Stable Diffusion, Dall-E mini + +## [FedML](https://github.com/FedML-AI/FedML) + +[FedML](https://github.com/FedML-AI/FedML) is a federated learning and analytics library enabling secure and collaborative machine learning on decentralized data anywhere at any scale. + +It supports large-scale cross-silo federated learning, and cross-device federated learning on smartphones/IoTs, and research simulation. + +Keywords: Federated Learning, Analytics, Collaborative ML, Decentralized + +## [gpt-code-clippy](https://github.com/CodedotAl/gpt-code-clippy) + +GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub. + +Keywords: LLM, Code + +## [TextAttack](https://github.com/QData/TextAttack) + +[TextAttack](https://github.com/QData/TextAttack) 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP. + +Keywords: Adversarial attacks, Data augmentation, NLP + +## [OpenPrompt](https://github.com/thunlp/OpenPrompt) + +Prompt-learning is a paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modify the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. [OpenPrompt](https://github.com/thunlp/OpenPrompt) supports loading PLMs directly from https://github.com/huggingface/transformers. + +## [text-generation-webui](https://github.com/oobabooga/text-generation-webui/) + +[text-generation-webui](https://github.com/oobabooga/text-generation-webui/) is a Gradio Web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA. + +Keywords: LLM, WebUI + +## [libra](https://github.com/Palashio/libra) + +An ergonomic machine learning [libra](https://github.com/Palashio/libra)ry for non-technical users. It focuses on ergonomics and on ensuring that training a model is as simple as it can be. + +Keywords: Ergonomic, Non-technical + +## [alibi](https://github.com/SeldonIO/alibi) + +Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The focus of the library is to provide high-quality implementations of black-box, white-box, local and global explanation methods for classification and regression models. + +Keywords: Model inspection, Model interpretation, Black-box, White-box + +## [tortoise-tts](https://github.com/neonbjb/tortoise-tts) + +Tortoise is a text-to-speech program built with the following priorities: strong multi-voice capabilities, and highly realistic prosody and intonation. + +Keywords: Text-to-speech + +## [flower](https://github.com/adap/flower) + +Flower (flwr) is a framework for building federated learning systems. The design of Flower is based on a few guiding principles: customizability, extendability, framework agnosticity, and ease-of-use. + +Keywords: Federated learning systems, Customizable, Extendable, Framework-agnostic, Simplicity + +## [fast-bert](https://github.com/utterworks/fast-bert) + +Fast-Bert is a deep learning library that allows developers and data scientists to train and deploy BERT and XLNet based models for natural language processing tasks beginning with Text Classification. It is aimed at simplicity. + +Keywords: Deployment, BERT, XLNet + +## [towhee](https://github.com/towhee-io/towhee) + +Towhee makes it easy to build neural data processing pipelines for AI applications. We provide hundreds of models, algorithms, and transformations that can be used as standard pipeline building blocks. Users can use Towhee's Pythonic API to build a prototype of their pipeline and automatically optimize it for production-ready environments. + +Keywords: Data processing pipeline, Optimization + +## [alibi-detect](https://github.com/SeldonIO/alibi-detect) + +Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. Both TensorFlow and PyTorch backends are supported for drift detection. + +Keywords: Adversarial, Outlier, Drift detection + +## [FARM](https://github.com/deepset-ai/FARM) + +[FARM](https://github.com/deepset-ai/FARM) makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built upon transformers and provides additional features to simplify the life of developers: Parallelized preprocessing, highly modular design, multi-task learning, experiment tracking, easy debugging and close integration with AWS SageMaker. + +Keywords: Transfer Learning, Modular design, Multi-task learning, Experiment tracking + +## [aitextgen](https://github.com/minimaxir/aitextgen) + +A robust Python tool for text-based AI training and generation using OpenAI's GPT-2 and EleutherAI's GPT Neo/GPT-3 architecture. +[aitextgen](https://github.com/minimaxir/aitextgen) is a Python package that leverages PyTorch, Hugging Face Transformers and pytorch-lightning with specific optimizations for text generation using GPT-2, plus many added features. + +Keywords: Training, Generation + +## [diffgram](https://github.com/diffgram/diffgram) + +Diffgram aims to integrate human supervision into platforms. We support your team programmatically changing the UI (Schema, layout, etc.) like in Streamlit. This means that you can collect and annotate timely data from users. In other words, we are the platform behind your platform, an integrated part of your application, to ship new & better AI products faster. + +Keywords: Human supervision, Platform + +## [ecco](https://github.com/jalammar/ecco) + +Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0). + +Keywords: Model explainability + +## [s3prl](https://github.com/s3prl/s3prl) + +[s3prl](https://github.com/s3prl/s3prl) stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks. + +Keywords: Speech, Training + +## [ru-dalle](https://github.com/ai-forever/ru-dalle) + +RuDALL-E aims to be similar to DALL-E, targeted to Russian. + +Keywords: DALL-E, Russian + +## [DeepKE](https://github.com/zjunlp/DeepKE) + +[DeepKE](https://github.com/zjunlp/DeepKE) is a knowledge extraction toolkit for knowledge graph construction supporting cnSchema,low-resource, document-level and multimodal scenarios for entity, relation and attribute extraction. + +Keywords: Knowledge Extraction, Knowledge Graphs + +## [Nebuly](https://github.com/nebuly-ai/nebuly) + +Nebuly is the next-generation platform to monitor and optimize your AI costs in one place. The platform connects to all your AI cost sources (compute, API providers, AI software licenses, etc) and centralizes them in one place to give you full visibility on a model basis. The platform also provides optimization recommendations and a co-pilot model that can guide during the optimization process. The platform builds on top of the open-source tools allowing you to optimize the different steps of your AI stack to squeeze out the best possible cost performances. + +Keywords: Optimization, Performance, Monitoring + +## [imaginAIry](https://github.com/brycedrennan/imaginAIry) + +Offers a CLI and a Python API to generate images with Stable Diffusion. It has support for many tools, like image structure control (controlnet), instruction-based image edits (InstructPix2Pix), prompt-based masking (clipseg), among others. + +Keywords: Stable Diffusion, CLI, Python API + +## [sparseml](https://github.com/neuralmagic/sparseml) + +SparseML is an open-source model optimization toolkit that enables you to create inference-optimized sparse models using pruning, quantization, and distillation algorithms. Models optimized with SparseML can then be exported to the ONNX and deployed with DeepSparse for GPU-class performance on CPU hardware. + +Keywords: Model optimization, Pruning, Quantization, Distillation + +## [opacus](https://github.com/pytorch/opacus) + +Opacus is a library that enables training PyTorch models with differential privacy. It supports training with minimal code changes required on the client, has little impact on training performance, and allows the client to online track the privacy budget expended at any given moment. + +Keywords: Differential privacy + +## [LAVIS](https://github.com/salesforce/LAVIS) + +[LAVIS](https://github.com/salesforce/LAVIS) is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets. It features a unified interface design to access + +Keywords: Multimodal, NLP, Vision + +## [buzz](https://github.com/chidiwilliams/buzz) + +Buzz transcribes and translates audio offline on your personal computer. Powered by OpenAI's Whisper. + +Keywords: Audio transcription, Translation + +## [rust-bert](https://github.com/guillaume-be/rust-bert) + +Rust-native state-of-the-art Natural Language Processing models and pipelines. Port of Hugging Face's Transformers library, using the tch-rs crate and pre-processing from rust-tokenizers. Supports multi-threaded tokenization and GPU inference. This repository exposes the model base architecture, task-specific heads and ready-to-use pipelines. + +Keywords: Rust, BERT, Inference + +## [EasyNLP](https://github.com/alibaba/EasyNLP) + +[EasyNLP](https://github.com/alibaba/EasyNLP) is an easy-to-use NLP development and application toolkit in PyTorch, first released inside Alibaba in 2021. It is built with scalable distributed training strategies and supports a comprehensive suite of NLP algorithms for various NLP applications. [EasyNLP](https://github.com/alibaba/EasyNLP) integrates knowledge distillation and few-shot learning for landing large pre-trained models, together with various popular multi-modality pre-trained models. It provides a unified framework of model training, inference, and deployment for real-world applications. + +Keywords: NLP, Knowledge distillation, Few-shot learning, Multi-modality, Training, Inference, Deployment + +## [TurboTransformers](https://github.com/Tencent/TurboTransformers) + +A fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU. + +Keywords: Optimization, Performance + +## [hivemind](https://github.com/learning-at-home/hivemind) + +Hivemind is a PyTorch library for decentralized deep learning across the Internet. Its intended usage is training one large model on hundreds of computers from different universities, companies, and volunteers. + +Keywords: Decentralized training + +## [docquery](https://github.com/impira/docquery) + +DocQuery is a library and command-line tool that makes it easy to analyze semi-structured and unstructured documents (PDFs, scanned images, etc.) using large language models (LLMs). You simply point DocQuery at one or more documents and specify a question you want to ask. DocQuery is created by the team at Impira. + +Keywords: Semi-structured documents, Unstructured documents, LLM, Document Question Answering + +## [CodeGeeX](https://github.com/THUDM/CodeGeeX) + +[CodeGeeX](https://github.com/THUDM/CodeGeeX) is a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. It has several unique features: +- Multilingual code generation +- Crosslingual code translation +- Is a customizable programming assistant + +Keywords: Code Generation Model + +## [ktrain](https://github.com/amaiya/ktrain) + +[ktrain](https://github.com/amaiya/ktrain) is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig, [ktrain](https://github.com/amaiya/ktrain) is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. + +Keywords: Keras wrapper, Model building, Training, Deployment + +## [FastDeploy](https://github.com/PaddlePaddle/FastDeploy) + +[FastDeploy](https://github.com/PaddlePaddle/FastDeploy) is an Easy-to-use and High Performance AI model deployment toolkit for Cloud, Mobile and Edge with packageout-of-the-box and unified experience, endend-to-end optimization for over fire160+ Text, Vision, Speech and Cross-modal AI models. Including image classification, object detection, OCR, face detection, matting, pp-tracking, NLP, stable diffusion, TTS and other tasks to meet developers' industrial deployment needs for multi-scenario, multi-hardware and multi-platform. + +Keywords: Model deployment, CLoud, Mobile, Edge + +## [underthesea](https://github.com/undertheseanlp/underthesea) + +[underthesea](https://github.com/undertheseanlp/underthesea) is a Vietnamese NLP toolkit. Underthesea is a suite of open source Python modules data sets and tutorials supporting research and development in Vietnamese Natural Language Processing. We provides extremely easy API to quickly apply pretrained NLP models to your Vietnamese text, such as word segmentation, part-of-speech tagging (PoS), named entity recognition (NER), text classification and dependency parsing. + +Keywords: Vietnamese, NLP + +## [hasktorch](https://github.com/hasktorch/hasktorch) + +Hasktorch is a library for tensors and neural networks in Haskell. It is an independent open source community project which leverages the core C++ libraries shared by PyTorch. + +Keywords: Haskell, Neural Networks + +## [donut](https://github.com/clovaai/donut) + +Donut, or Document understanding transformer, is a new method of document understanding that utilizes an OCR-free end-to-end Transformer model. + +Donut does not require off-the-shelf OCR engines/APIs, yet it shows state-of-the-art performances on various visual document understanding tasks, such as visual document classification or information extraction (a.k.a. document parsing). + +Keywords: Document Understanding + +## [transformers-interpret](https://github.com/cdpierse/transformers-interpret) + +Transformers Interpret is a model explainability tool designed to work exclusively with the transformers package. + +In line with the philosophy of the Transformers package Transformers Interpret allows any transformers model to be explained in just two lines. Explainers are available for both text and computer vision models. Visualizations are also available in notebooks and as savable png and html files + +Keywords: Model interpretation, Visualization + +## [mlrun](https://github.com/mlrun/mlrun) + +MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications, significantly reducing engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements. + +Keywords: MLOps + +## [FederatedScope](https://github.com/alibaba/FederatedScope) + +[FederatedScope](https://github.com/alibaba/FederatedScope) is a comprehensive federated learning platform that provides convenient usage and flexible customization for various federated learning tasks in both academia and industry. Based on an event-driven architecture, [FederatedScope](https://github.com/alibaba/FederatedScope) integrates rich collections of functionalities to satisfy the burgeoning demands from federated learning, and aims to build up an easy-to-use platform for promoting learning safely and effectively. + +Keywords: Federated learning, Event-driven + +## [pythainlp](https://github.com/PyThaiNLP/pythainlp) + +PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language. + +Keywords: Thai, NLP, NLTK + +## [FlagAI](https://github.com/FlagAI-Open/FlagAI) + +[FlagAI](https://github.com/FlagAI-Open/FlagAI) (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model. Our goal is to support training, fine-tuning, and deployment of large-scale models on various downstream tasks with multi-modality. + +Keywords: Large models, Training, Fine-tuning, Deployment, Multi-modal + +## [pyserini](https://github.com/castorini/pyserini) + +[pyserini](https://github.com/castorini/pyserini) is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with the group's Anserini IR toolkit. Retrieval using dense representations is provided via integration with Facebook's Faiss library. + +Keywords: IR, Information Retrieval, Dense, Sparse + +## [baal](https://github.com/baal-org/baal) + +[baal](https://github.com/baal-org/baal) is an active learning library that supports both industrial applications and research usecases. [baal](https://github.com/baal-org/baal) currently supports Monte-Carlo Dropout, MCDropConnect, deep ensembles, and semi-supervised learning. + +Keywords: Active Learning, Research, Labeling + +## [cleanlab](https://github.com/cleanlab/cleanlab) + +[cleanlab](https://github.com/cleanlab/cleanlab) is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. For text, image, tabular, audio (among others) datasets, you can use cleanlab to automatically: detect data issues (outliers, label errors, near duplicates, etc), train robust ML models, infer consensus + annotator-quality for multi-annotator data, suggest data to (re)label next (active learning). + +Keywords: Data-Centric AI, Data Quality, Noisy Labels, Outlier Detection, Active Learning + +## [BentoML](https://github.com/bentoml/BentoML) + +[BentoML](https://github.com/bentoml) is the unified framework for for building, shipping, and scaling production-ready AI applications incorporating traditional ML, pre-trained AI models, Generative and Large Language Models. +All Hugging Face models and pipelines can be seamlessly integrated into BentoML applications, enabling the running of models on the most suitable hardware and independent scaling based on usage. + +Keywords: BentoML, Framework, Deployment, AI Applications + +## [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) + +[LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) offers a user-friendly fine-tuning framework that incorporates PEFT. The repository includes training(fine-tuning) and inference examples for LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, and other LLMs. A ChatGLM version is also available in [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning). + +Keywords: PEFT, fine-tuning, LLaMA-2, ChatGLM, Qwen + diff --git a/conftest.py b/conftest.py index c3d4f70326d90c..0b5daf574f0bc9 100644 --- a/conftest.py +++ b/conftest.py @@ -20,9 +20,13 @@ import warnings from os.path import abspath, dirname, join +import _pytest + +from transformers.testing_utils import HfDoctestModule, HfDocTestParser + # allow having multiple repository checkouts and not needing to remember to rerun -# 'pip install -e .[dev]' when switching between checkouts and running tests. +# `pip install -e '.[dev]'` when switching between checkouts and running tests. git_repo_path = abspath(join(dirname(__file__), "src")) sys.path.insert(1, git_repo_path) @@ -38,7 +42,10 @@ def pytest_configure(config): config.addinivalue_line( "markers", "is_pt_flax_cross_test: mark test to run only when PT and FLAX interactions are tested" ) + config.addinivalue_line("markers", "is_pipeline_test: mark test to run only when pipelines are tested") config.addinivalue_line("markers", "is_staging_test: mark test to run only in the staging environment") + config.addinivalue_line("markers", "accelerate_tests: mark test that require accelerate") + config.addinivalue_line("markers", "tool_tests: mark the tool tests that are run on their specific schedule") def pytest_addoption(parser): @@ -62,7 +69,7 @@ def pytest_sessionfinish(session, exitstatus): # Doctest custom flag to ignore output. -IGNORE_RESULT = doctest.register_optionflag('IGNORE_RESULT') +IGNORE_RESULT = doctest.register_optionflag("IGNORE_RESULT") OutputChecker = doctest.OutputChecker @@ -75,3 +82,5 @@ def check_output(self, want, got, optionflags): doctest.OutputChecker = CustomOutputChecker +_pytest.doctest.DoctestModule = HfDoctestModule +doctest.DocTestParser = HfDocTestParser diff --git a/docker/transformers-all-latest-gpu/Dockerfile b/docker/transformers-all-latest-gpu/Dockerfile index 95e127ef6ddb95..e96eb9539c8bd2 100644 --- a/docker/transformers-all-latest-gpu/Dockerfile +++ b/docker/transformers-all-latest-gpu/Dockerfile @@ -1,4 +1,4 @@ -FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04 +FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 LABEL maintainer="Hugging Face" ARG DEBIAN_FRONTEND=noninteractive @@ -9,11 +9,11 @@ SHELL ["sh", "-lc"] # The following `ARG` are mainly used to specify the versions explicitly & directly in this docker file, and not meant # to be used as arguments for docker build (so far). -ARG PYTORCH='1.13.1' +ARG PYTORCH='2.1.1' # (not always a valid torch version) -ARG INTEL_TORCH_EXT='1.11.0' +ARG INTEL_TORCH_EXT='2.1.100' # Example: `cu102`, `cu113`, etc. -ARG CUDA='cu116' +ARG CUDA='cu118' RUN apt update RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs @@ -22,7 +22,6 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip ARG REF=main RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF -RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] # TODO: Handle these in a python utility script RUN [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile @@ -32,29 +31,47 @@ RUN echo torch=$VERSION # TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI). RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA -RUN python3 -m pip install --no-cache-dir -U tensorflow==2.11 -RUN python3 -m pip install --no-cache-dir -U tensorflow_probability -RUN python3 -m pip uninstall -y flax jax +RUN python3 -m pip install --no-cache-dir -U tensorflow==2.13 protobuf==3.20.3 tensorflow_text tensorflow_probability + +RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime] -# To include the change in this commit https://github.com/onnx/tensorflow-onnx/commit/ddca3a5eb2d912f20fe7e0568dd1a3013aee9fa3 -# Otherwise, we get tf2onnx==1.8 (caused by `flatbuffers` version), and some tests fail with `ValueError: from_keras requires input_signature`. -# TODO: remove this line once the conflict is resolved in these libraries. -RUN python3 -m pip install --no-cache-dir git+https://github.com/onnx/tensorflow-onnx.git@ddca3a5eb2d912f20fe7e0568dd1a3013aee9fa3 +RUN python3 -m pip uninstall -y flax jax -RUN python3 -m pip install --no-cache-dir intel_extension_for_pytorch==$INTEL_TORCH_EXT+cpu -f https://software.intel.com/ipex-whl-stable +RUN python3 -m pip install --no-cache-dir intel_extension_for_pytorch==$INTEL_TORCH_EXT -f https://developer.intel.com/ipex-whl-stable-cpu RUN python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git pytesseract RUN python3 -m pip install -U "itsdangerous<2.1.0" RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate +RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/peft@main#egg=peft + # Add bitsandbytes for mixed int8 testing RUN python3 -m pip install --no-cache-dir bitsandbytes -RUN python3 -m pip install --no-cache-dir decord +# Add auto-gptq for gtpq quantization testing +RUN python3 -m pip install --no-cache-dir auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ + +# Add einops for additional model testing +RUN python3 -m pip install --no-cache-dir einops + +# Add aqlm for quantization testing +RUN python3 -m pip install --no-cache-dir aqlm[gpu]==1.0.1 + +# Add autoawq for quantization testing +RUN python3 -m pip install --no-cache-dir https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.8/autoawq-0.1.8+cu118-cp38-cp38-linux_x86_64.whl + +# For bettertransformer + gptq +RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/optimum@main#egg=optimum + +# For video model testing +RUN python3 -m pip install --no-cache-dir decord av==9.2.0 # For `dinat` model -RUN python3 -m pip install --no-cache-dir natten -f https://shi-labs.com/natten/wheels/$CUDA/ +RUN python3 -m pip install --no-cache-dir 'natten<0.15.0' -f https://shi-labs.com/natten/wheels/$CUDA/ + +# For `nougat` tokenizer +RUN python3 -m pip install --no-cache-dir python-Levenshtein # When installing in editable mode, `transformers` is not recognized as a package. # this line must be added in order for python to be aware of transformers. diff --git a/docker/transformers-cpu/Dockerfile b/docker/transformers-cpu/Dockerfile deleted file mode 100644 index c3590e4239e470..00000000000000 --- a/docker/transformers-cpu/Dockerfile +++ /dev/null @@ -1,26 +0,0 @@ -FROM ubuntu:18.04 -LABEL maintainer="Hugging Face" -LABEL repository="transformers" - -RUN apt update && \ - apt install -y bash \ - build-essential \ - git \ - curl \ - ca-certificates \ - python3 \ - python3-pip && \ - rm -rf /var/lib/apt/lists - -RUN python3 -m pip install --no-cache-dir --upgrade pip && \ - python3 -m pip install --no-cache-dir \ - jupyter \ - tensorflow-cpu \ - torch - -WORKDIR /workspace -COPY . transformers/ -RUN cd transformers/ && \ - python3 -m pip install --no-cache-dir . - -CMD ["/bin/bash"] diff --git a/docker/transformers-doc-builder/Dockerfile b/docker/transformers-doc-builder/Dockerfile index 0e5b072d488930..bd3d2ce2be1604 100644 --- a/docker/transformers-doc-builder/Dockerfile +++ b/docker/transformers-doc-builder/Dockerfile @@ -1,4 +1,4 @@ -FROM python:3.8 +FROM python:3.10 LABEL maintainer="Hugging Face" RUN apt update @@ -11,7 +11,6 @@ RUN apt-get -y update && apt-get install -y libsndfile1-dev && apt install -y te RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed] RUN python3 -m pip install --no-cache-dir torchvision git+https://github.com/facebookresearch/detectron2.git pytesseract -RUN python3 -m pip install --no-cache-dir pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com RUN python3 -m pip install -U "itsdangerous<2.1.0" # Test if the image could successfully build the doc. before publishing the image diff --git a/docker/transformers-past-gpu/Dockerfile b/docker/transformers-past-gpu/Dockerfile index 99fb550c6a35d8..0cdc9ff0712437 100644 --- a/docker/transformers-past-gpu/Dockerfile +++ b/docker/transformers-past-gpu/Dockerfile @@ -1,4 +1,4 @@ -ARG BASE_DOCKER_IMAGE="nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04" +ARG BASE_DOCKER_IMAGE FROM $BASE_DOCKER_IMAGE LABEL maintainer="Hugging Face" @@ -8,7 +8,7 @@ ARG DEBIAN_FRONTEND=noninteractive SHELL ["sh", "-lc"] RUN apt update -RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs +RUN apt install -y git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-pip ffmpeg git-lfs libaio-dev RUN git lfs install RUN python3 -m pip install --no-cache-dir --upgrade pip @@ -23,9 +23,11 @@ RUN cd transformers && python3 setup.py develop ARG FRAMEWORK ARG VERSION +# Control `setuptools` version to avoid some issues +RUN [ "$VERSION" != "1.10" ] && python3 -m pip install -U setuptools || python3 -m pip install -U "setuptools<=59.5" + # Remove all frameworks -# (`accelerate` requires `torch`, and this causes import issues for TF-only testing) -RUN python3 -m pip uninstall -y torch torchvision torchaudio accelerate tensorflow jax flax +RUN python3 -m pip uninstall -y torch torchvision torchaudio tensorflow jax flax # Get the libraries and their versions to install, and write installation command to `~/.profile`. RUN python3 ./transformers/utils/past_ci_versions.py --framework $FRAMEWORK --version $VERSION @@ -34,4 +36,24 @@ RUN python3 ./transformers/utils/past_ci_versions.py --framework $FRAMEWORK --ve RUN echo "INSTALL_CMD = $INSTALL_CMD" RUN $INSTALL_CMD +RUN [ "$FRAMEWORK" != "pytorch" ] && echo "`deepspeed-testing` installation is skipped" || python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] + +# Remove `accelerate`: it requires `torch`, and this causes import issues for TF-only testing +# We will install `accelerate@main` in Past CI workflow file +RUN python3 -m pip uninstall -y accelerate + +# Uninstall `torch-tensorrt` and `apex` shipped with the base image +RUN python3 -m pip uninstall -y torch-tensorrt apex + +# Pre-build **nightly** release of DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout) +RUN python3 -m pip uninstall -y deepspeed +# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.) +# Issue: https://github.com/microsoft/DeepSpeed/issues/2010 +# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \ +# DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 + RUN python3 -m pip install -U "itsdangerous<2.1.0" + +# When installing in editable mode, `transformers` is not recognized as a package. +# this line must be added in order for python to be aware of transformers. +RUN cd transformers && python3 setup.py develop diff --git a/docker/transformers-pytorch-amd-gpu/Dockerfile b/docker/transformers-pytorch-amd-gpu/Dockerfile new file mode 100644 index 00000000000000..46ca1a531b4ab4 --- /dev/null +++ b/docker/transformers-pytorch-amd-gpu/Dockerfile @@ -0,0 +1,36 @@ +FROM rocm/dev-ubuntu-20.04:5.6 +# rocm/pytorch has no version with 2.1.0 +LABEL maintainer="Hugging Face" + +ARG DEBIAN_FRONTEND=noninteractive + +ARG PYTORCH='2.1.0' +ARG TORCH_VISION='0.16.0' +ARG TORCH_AUDIO='2.1.0' +ARG ROCM='5.6' + +RUN apt update && \ + apt install -y --no-install-recommends git libsndfile1-dev tesseract-ocr espeak-ng python3 python3-dev python3-pip ffmpeg && \ + apt clean && \ + rm -rf /var/lib/apt/lists/* + +RUN python3 -m pip install --no-cache-dir --upgrade pip + +RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM + +RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools ninja git+https://github.com/facebookresearch/detectron2.git pytesseract "itsdangerous<2.1.0" + +ARG REF=main +WORKDIR / + +# Invalidate docker cache from here if new commit is available. +ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json +RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF + +RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video] + +RUN python3 -m pip uninstall -y tensorflow flax + +# When installing in editable mode, `transformers` is not recognized as a package. +# this line must be added in order for python to be aware of transformers. +RUN cd transformers && python3 setup.py develop diff --git a/docker/transformers-pytorch-cpu/Dockerfile b/docker/transformers-pytorch-cpu/Dockerfile deleted file mode 100644 index d1759d650b84fd..00000000000000 --- a/docker/transformers-pytorch-cpu/Dockerfile +++ /dev/null @@ -1,25 +0,0 @@ -FROM ubuntu:18.04 -LABEL maintainer="Hugging Face" -LABEL repository="transformers" - -RUN apt update && \ - apt install -y bash \ - build-essential \ - git \ - curl \ - ca-certificates \ - python3 \ - python3-pip && \ - rm -rf /var/lib/apt/lists - -RUN python3 -m pip install --no-cache-dir --upgrade pip && \ - python3 -m pip install --no-cache-dir \ - jupyter \ - torch - -WORKDIR /workspace -COPY . transformers/ -RUN cd transformers/ && \ - python3 -m pip install --no-cache-dir . - -CMD ["/bin/bash"] \ No newline at end of file diff --git a/docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile b/docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile new file mode 100644 index 00000000000000..1fa384dfa2bc03 --- /dev/null +++ b/docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile @@ -0,0 +1,45 @@ +FROM rocm/dev-ubuntu-22.04:5.6 +LABEL maintainer="Hugging Face" + +ARG DEBIAN_FRONTEND=noninteractive +ARG PYTORCH='2.1.1' +ARG TORCH_VISION='0.16.1' +ARG TORCH_AUDIO='2.1.1' +ARG ROCM='5.6' + +RUN apt update && \ + apt install -y --no-install-recommends \ + libaio-dev \ + git \ + # These are required to build deepspeed. + python3-dev \ + python-is-python3 \ + rocrand-dev \ + rocthrust-dev \ + hipsparse-dev \ + hipblas-dev \ + rocblas-dev && \ + apt clean && \ + rm -rf /var/lib/apt/lists/* + +RUN python3 -m pip install --no-cache-dir --upgrade pip ninja "pydantic<2" +RUN python3 -m pip uninstall -y apex torch torchvision torchaudio +RUN python3 -m pip install torch==$PYTORCH torchvision==$TORCH_VISION torchaudio==$TORCH_AUDIO --index-url https://download.pytorch.org/whl/rocm$ROCM --no-cache-dir + +# Pre-build DeepSpeed, so it's be ready for testing (to avoid timeout) +RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache-dir -v --disable-pip-version-check 2>&1 + +ARG REF=main +WORKDIR / + +# Invalidate docker cache from here if new commit is available. +ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json +RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF + +RUN python3 -m pip install --no-cache-dir ./transformers[accelerate,testing,sentencepiece,sklearn] + +# When installing in editable mode, `transformers` is not recognized as a package. +# this line must be added in order for python to be aware of transformers. +RUN cd transformers && python3 setup.py develop + +RUN python3 -c "from deepspeed.launcher.runner import main" \ No newline at end of file diff --git a/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile b/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile index 57ddc43506aae4..a7b08a8c60d31d 100644 --- a/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile +++ b/docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile @@ -1,12 +1,12 @@ -# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_22-04.html#rel_22-04 -FROM nvcr.io/nvidia/pytorch:22.04-py3 +# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-11.html#rel-23-11 +FROM nvcr.io/nvidia/pytorch:23.11-py3 LABEL maintainer="Hugging Face" ARG DEBIAN_FRONTEND=noninteractive -ARG PYTORCH='1.13.1' +ARG PYTORCH='2.1.0' # Example: `cu102`, `cu113`, etc. -ARG CUDA='cu116' +ARG CUDA='cu121' RUN apt -y update RUN apt install -y libaio-dev @@ -15,6 +15,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip ARG REF=main RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF +RUN python3 -m pip uninstall -y torch torchvision torchaudio + # Install latest release PyTorch # (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.) # (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops) @@ -22,25 +24,32 @@ RUN python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchau RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] -RUN python3 -m pip install torch-tensorrt==1.3.0 --find-links https://github.com/pytorch/TensorRT/releases/expanded_assets/v1.3.0 +RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate + +# Uninstall `transformer-engine` shipped with the base image +RUN python3 -m pip uninstall -y transformer-engine + +# Uninstall `torch-tensorrt` shipped with the base image +RUN python3 -m pip uninstall -y torch-tensorrt # recompile apex RUN python3 -m pip uninstall -y apex -RUN git clone https://github.com/NVIDIA/apex +# RUN git clone https://github.com/NVIDIA/apex # `MAX_JOBS=1` disables parallel building to avoid cpu memory OOM when building image on GitHub Action (standard) runners -RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check . +# TODO: check if there is alternative way to install latest apex +# RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check . # Pre-build **latest** DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout) RUN python3 -m pip uninstall -y deepspeed # This has to be run (again) inside the GPU VMs running the tests. # The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests. # TODO: Find out why test fail. -RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 +RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 # When installing in editable mode, `transformers` is not recognized as a package. # this line must be added in order for python to be aware of transformers. RUN cd transformers && python3 setup.py develop # The base image ships with `pydantic==1.8.2` which is not working - i.e. the next command fails -RUN python3 -m pip install -U --no-cache-dir pydantic +RUN python3 -m pip install -U --no-cache-dir "pydantic<2" RUN python3 -c "from deepspeed.launcher.runner import main" diff --git a/docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile b/docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile index 573e09c22a9c05..06da67049296a5 100644 --- a/docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile +++ b/docker/transformers-pytorch-deepspeed-nightly-gpu/Dockerfile @@ -1,10 +1,11 @@ -FROM nvcr.io/nvidia/pytorch:21.03-py3 +# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-11.html#rel-23-11 +FROM nvcr.io/nvidia/pytorch:23.11-py3 LABEL maintainer="Hugging Face" ARG DEBIAN_FRONTEND=noninteractive # Example: `cu102`, `cu113`, etc. -ARG CUDA='cu113' +ARG CUDA='cu121' RUN apt -y update RUN apt install -y libaio-dev @@ -13,6 +14,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip ARG REF=main RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF +RUN python3 -m pip uninstall -y torch torchvision torchaudio + # Install **nightly** release PyTorch (flag `--pre`) # (PyTorch must be installed before pre-compiling any DeepSpeed c++/cuda ops.) # (https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops) @@ -20,30 +23,38 @@ RUN python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] +RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/accelerate@main#egg=accelerate + +# Uninstall `transformer-engine` shipped with the base image +RUN python3 -m pip uninstall -y transformer-engine + +# Uninstall `torch-tensorrt` and `apex` shipped with the base image +RUN python3 -m pip uninstall -y torch-tensorrt apex + # Pre-build **nightly** release of DeepSpeed, so it would be ready for testing (otherwise, the 1st deepspeed test will timeout) RUN python3 -m pip uninstall -y deepspeed # This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.) # Issue: https://github.com/microsoft/DeepSpeed/issues/2010 # RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \ -# DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 - -# For `torchdynamo` tests -# (see https://github.com/huggingface/transformers/pull/17765) -RUN git clone https://github.com/pytorch/functorch -RUN python3 -m pip install --no-cache-dir ./functorch[aot] -RUN cd functorch && python3 setup.py develop - -RUN git clone https://github.com/pytorch/torchdynamo -RUN python3 -m pip install -r ./torchdynamo/requirements.txt -RUN cd torchdynamo && python3 setup.py develop - -# install TensorRT -RUN python3 -m pip install --no-cache-dir -U nvidia-pyindex -RUN python3 -m pip install --no-cache-dir -U nvidia-tensorrt==8.2.4.2 +# DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1 -# install torch_tensorrt (fx path) -RUN git clone https://github.com/pytorch/TensorRT.git -RUN cd TensorRT/py && python3 setup.py install --fx-only +## For `torchdynamo` tests +## (see https://github.com/huggingface/transformers/pull/17765) +#RUN git clone https://github.com/pytorch/functorch +#RUN python3 -m pip install --no-cache-dir ./functorch[aot] +#RUN cd functorch && python3 setup.py develop +# +#RUN git clone https://github.com/pytorch/torchdynamo +#RUN python3 -m pip install -r ./torchdynamo/requirements.txt +#RUN cd torchdynamo && python3 setup.py develop +# +## install TensorRT +#RUN python3 -m pip install --no-cache-dir -U nvidia-pyindex +#RUN python3 -m pip install --no-cache-dir -U nvidia-tensorrt==8.2.4.2 +# +## install torch_tensorrt (fx path) +#RUN git clone https://github.com/pytorch/TensorRT.git +#RUN cd TensorRT/py && python3 setup.py install --fx-only # When installing in editable mode, `transformers` is not recognized as a package. # this line must be added in order for python to be aware of transformers. diff --git a/docker/transformers-pytorch-gpu/Dockerfile b/docker/transformers-pytorch-gpu/Dockerfile index 689c18435bb4f7..a45210e7d1148c 100644 --- a/docker/transformers-pytorch-gpu/Dockerfile +++ b/docker/transformers-pytorch-gpu/Dockerfile @@ -1,4 +1,4 @@ -FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04 +FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04 LABEL maintainer="Hugging Face" ARG DEBIAN_FRONTEND=noninteractive @@ -9,19 +9,20 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip ARG REF=main RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF -RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video] # If set to nothing, will install the latest version -ARG PYTORCH='1.13.1' +ARG PYTORCH='2.1.1' ARG TORCH_VISION='' ARG TORCH_AUDIO='' # Example: `cu102`, `cu113`, etc. -ARG CUDA='cu116' +ARG CUDA='cu121' RUN [ ${#PYTORCH} -gt 0 ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA RUN [ ${#TORCH_VISION} -gt 0 ] && VERSION='torchvision=='TORCH_VISION'.*' || VERSION='torchvision'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA RUN [ ${#TORCH_AUDIO} -gt 0 ] && VERSION='torchaudio=='TORCH_AUDIO'.*' || VERSION='torchaudio'; python3 -m pip install --no-cache-dir -U $VERSION --extra-index-url https://download.pytorch.org/whl/$CUDA +RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-torch,testing,video] + RUN python3 -m pip uninstall -y tensorflow flax RUN python3 -m pip install --no-cache-dir git+https://github.com/facebookresearch/detectron2.git pytesseract diff --git a/docker/transformers-tensorflow-cpu/Dockerfile b/docker/transformers-tensorflow-cpu/Dockerfile deleted file mode 100644 index ef3dc3d212cbbc..00000000000000 --- a/docker/transformers-tensorflow-cpu/Dockerfile +++ /dev/null @@ -1,25 +0,0 @@ -FROM ubuntu:18.04 -LABEL maintainer="Hugging Face" -LABEL repository="transformers" - -RUN apt update && \ - apt install -y bash \ - build-essential \ - git \ - curl \ - ca-certificates \ - python3 \ - python3-pip && \ - rm -rf /var/lib/apt/lists - -RUN python3 -m pip install --no-cache-dir --upgrade pip && \ - python3 -m pip install --no-cache-dir \ - mkl \ - tensorflow-cpu - -WORKDIR /workspace -COPY . transformers/ -RUN cd transformers/ && \ - python3 -m pip install --no-cache-dir . - -CMD ["/bin/bash"] diff --git a/docker/transformers-tensorflow-gpu/Dockerfile b/docker/transformers-tensorflow-gpu/Dockerfile index 09e8512f2ce831..df9039a0c4d28e 100644 --- a/docker/transformers-tensorflow-gpu/Dockerfile +++ b/docker/transformers-tensorflow-gpu/Dockerfile @@ -1,4 +1,4 @@ -FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 +FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 LABEL maintainer="Hugging Face" ARG DEBIAN_FRONTEND=noninteractive @@ -12,7 +12,7 @@ RUN git clone https://github.com/huggingface/transformers && cd transformers && RUN python3 -m pip install --no-cache-dir -e ./transformers[dev-tensorflow,testing] # If set to nothing, will install the latest version -ARG TENSORFLOW='2.11' +ARG TENSORFLOW='2.13' RUN [ ${#TENSORFLOW} -gt 0 ] && VERSION='tensorflow=='$TENSORFLOW'.*' || VERSION='tensorflow'; python3 -m pip install --no-cache-dir -U $VERSION RUN python3 -m pip uninstall -y torch flax diff --git a/docs/README.md b/docs/README.md index 9aa74d4de94b2c..7dbcefc0483c66 100644 --- a/docs/README.md +++ b/docs/README.md @@ -81,10 +81,10 @@ The `preview` command only works with existing doc files. When you add a complet ## Adding a new element to the navigation bar -Accepted files are Markdown (.md or .mdx). +Accepted files are Markdown (.md). Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting -the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/transformers/blob/main/docs/source/_toctree.yml) file. +the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/transformers/blob/main/docs/source/en/_toctree.yml) file. ## Renaming section headers and moving sections @@ -109,7 +109,7 @@ Sections that were moved: Use the relative style to link to the new file so that the versioned docs continue to work. -For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.mdx). +For an example of a rich moved section set please see the very end of [the Trainer doc](https://github.com/huggingface/transformers/blob/main/docs/source/en/main_classes/trainer.md). ## Writing Documentation - Specification @@ -138,7 +138,7 @@ When translating, refer to the guide at [./TRANSLATING.md](https://github.com/hu When adding a new model: -- Create a file `xxx.mdx` or under `./source/model_doc` (don't hesitate to copy an existing file as template). +- Create a file `xxx.md` or under `./source/model_doc` (don't hesitate to copy an existing file as template). - Link that file in `./source/_toctree.yml`. - Write a short overview of the model: - Overview with paper & authors @@ -147,7 +147,7 @@ When adding a new model: - Add the classes that should be linked in the model. This generally includes the configuration, the tokenizer, and every model of that class (the base model, alongside models with additional heads), both in PyTorch and TensorFlow. The order is generally: - - Configuration, + - Configuration - Tokenizer - PyTorch base model - PyTorch head models @@ -202,7 +202,7 @@ provide its path. For instance: \[\`utils.ModelOutput\`\]. This will be converte `utils.ModelOutput` in the description. To get rid of the path and only keep the name of the object you are linking to in the description, add a ~: \[\`~utils.ModelOutput\`\] will generate a link with `ModelOutput` in the description. -The same works for methods so you can either use \[\`XXXClass.method\`\] or \[~\`XXXClass.method\`\]. +The same works for methods so you can either use \[\`XXXClass.method\`\] or \[\`~XXXClass.method\`\]. #### Defining arguments in a method @@ -250,7 +250,7 @@ then its documentation should look like this: Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even if the first line describing your argument type and its default gets long, you can't break it on several lines. You can -however write as many lines as you want in the indented description (see the example above with `input_ids`). +however, write as many lines as you want in the indented description (see the example above with `input_ids`). #### Writing a multi-line code block @@ -364,25 +364,9 @@ We use pytests' [doctest integration](https://docs.pytest.org/doctest.html) to v For Transformers, the doctests are run on a daily basis via GitHub Actions as can be seen [here](https://github.com/huggingface/transformers/actions/workflows/doctests.yml). -To include your example in the daily doctests, you need to add the filename that -contains the example docstring to the [documentation_tests.txt](../utils/documentation_tests.txt). - ### For Python files -You will first need to run the following command (from the root of the repository) to prepare the doc file (doc-testing needs to add additional lines that we don't include in the doc source files): - -```bash -python utils/prepare_for_doc_test.py src docs -``` - -If you work on a specific python module, say `modeling_wav2vec2.py`, you can run the command as follows (to avoid the unnecessary temporary changes in irrelevant files): - -```bash -python utils/prepare_for_doc_test.py src/transformers/utils/doc.py src/transformers/models/wav2vec2/modeling_wav2vec2.py -``` -(`utils/doc.py` should always be included) - -Then you can run all the tests in the docstrings of a given file with the following command, here is how we test the modeling file of Wav2Vec2 for instance: +Run all the tests in the docstrings of a given file with the following command, here is how we test the modeling file of Wav2Vec2 for instance: ```bash pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py -sv --doctest-continue-on-failure @@ -394,30 +378,12 @@ If you want to isolate a specific docstring, just add `::` after the file name t pytest --doctest-modules src/transformers/models/wav2vec2/modeling_wav2vec2.py::transformers.models.wav2vec2.modeling_wav2vec2.Wav2Vec2ForCTC.forward -sv --doctest-continue-on-failure ``` -Once you're done, you can run the following command (still from the root of the repository) to undo the changes made by the first command before committing: - -```bash -python utils/prepare_for_doc_test.py src docs --remove_new_line -``` - ### For Markdown files -You will first need to run the following command (from the root of the repository) to prepare the doc file (doc-testing needs to add additional lines that we don't include in the doc source files): - -```bash -python utils/prepare_for_doc_test.py src docs -``` - -Then you can test locally a given file with this command (here testing the quicktour): - -```bash -pytest --doctest-modules docs/source/quicktour.mdx -sv --doctest-continue-on-failure --doctest-glob="*.mdx" -``` - -Once you're done, you can run the following command (still from the root of the repository) to undo the changes made by the first command before committing: +You can test locally a given file with this command (here testing the quicktour): ```bash -python utils/prepare_for_doc_test.py src docs --remove_new_line +pytest --doctest-modules docs/source/quicktour.md -sv --doctest-continue-on-failure --doctest-glob="*.md" ``` ### Writing doctests diff --git a/docs/TRANSLATING.md b/docs/TRANSLATING.md index c6f5c45baf0291..420e7a8b16a1c8 100644 --- a/docs/TRANSLATING.md +++ b/docs/TRANSLATING.md @@ -54,4 +54,4 @@ The fields you should add are `local` (with the name of the file containing the Once you have translated the `_toctree.yml` file, you can start translating the [MDX](https://mdxjs.com/) files associated with your docs chapter. -> 🙋 If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/transformers/issues) and tag @sgugger. +> 🙋 If you'd like others to help you with the translation, you should [open an issue](https://github.com/huggingface/transformers/issues) and tag @stevhliu and @MKhalusova. diff --git a/docs/source/_config.py b/docs/source/_config.py index 4a7a86cc23d807..d26d908aa29ea2 100644 --- a/docs/source/_config.py +++ b/docs/source/_config.py @@ -10,5 +10,5 @@ black_avoid_patterns = { "{processor_class}": "FakeProcessorClass", "{model_class}": "FakeModelClass", - "{object_class}": "FakeObjectClass", + "{object_class}": "FakeObjectClass", } diff --git a/docs/source/de/_toctree.yml b/docs/source/de/_toctree.yml index 8b15c2c53e7c7f..068beccdfe8578 100644 --- a/docs/source/de/_toctree.yml +++ b/docs/source/de/_toctree.yml @@ -15,8 +15,30 @@ title: Vorverarbeiten - local: training title: Optimierung eines vortrainierten Modells + - local: run_scripts + title: Trainieren mit einem Skript - local: accelerate title: Verteiltes Training mit 🤗 Accelerate + - local: peft + title: Laden und Trainieren von Adaptern mit 🤗 PEFT - local: model_sharing title: Ein Modell teilen + - local: transformers_agents + title: Agents + - local: llm_tutorial + title: Generation with LLMs title: Tutorials +- sections: + - local: contributing + title: Wie kann man zu 🤗 Transformers beitragen? + - local: add_new_model + title: Wie fügt man ein Modell zu 🤗 Transformers hinzu? + - local: add_tensorflow_model + title: Wie konvertiert man ein 🤗 Transformers-Modell in TensorFlow? + - local: add_new_pipeline + title: Wie fügt man eine Pipeline zu 🤗 Transformers hinzu? + - local: testing + title: Testen + - local: pr_checks + title: Überprüfung einer Pull Request + title: Contribute \ No newline at end of file diff --git a/docs/source/de/accelerate.mdx b/docs/source/de/accelerate.md similarity index 96% rename from docs/source/de/accelerate.mdx rename to docs/source/de/accelerate.md index 64f85f205f8afb..98a11cbdc41771 100644 --- a/docs/source/de/accelerate.mdx +++ b/docs/source/de/accelerate.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Verteiltes Training mit 🤗 Accelerate diff --git a/docs/source/de/add_new_model.md b/docs/source/de/add_new_model.md new file mode 100644 index 00000000000000..3f3317dd8b7e96 --- /dev/null +++ b/docs/source/de/add_new_model.md @@ -0,0 +1,895 @@ + + +# Wie kann ich ein Modell zu 🤗 Transformers hinzufügen? + +Die 🤗 Transformers-Bibliothek ist dank der Beiträge der Community oft in der Lage, neue Modelle anzubieten. Aber das kann ein anspruchsvolles Projekt sein und erfordert eine eingehende Kenntnis der 🤗 Transformers-Bibliothek und des zu implementierenden Modells. Bei Hugging Face versuchen wir, mehr Mitgliedern der Community die Möglichkeit zu geben, aktiv Modelle hinzuzufügen, und wir haben diese Anleitung zusammengestellt, die Sie durch den Prozess des Hinzufügens eines PyTorch-Modells führt (stellen Sie sicher, dass Sie [PyTorch installiert haben](https://pytorch.org/get-started/locally/)). + + + +Wenn Sie daran interessiert sind, ein TensorFlow-Modell zu implementieren, werfen Sie einen Blick in die Anleitung [How to convert a 🤗 Transformers model to TensorFlow](add_tensorflow_model)! + + + +Auf dem Weg dorthin, werden Sie: + +- Einblicke in bewährte Open-Source-Verfahren erhalten +- die Konstruktionsprinzipien hinter einer der beliebtesten Deep-Learning-Bibliotheken verstehen +- lernen Sie, wie Sie große Modelle effizient testen können +- lernen Sie, wie Sie Python-Hilfsprogramme wie `black`, `ruff` und `make fix-copies` integrieren, um sauberen und lesbaren Code zu gewährleisten + +Ein Mitglied des Hugging Face-Teams wird Ihnen dabei zur Seite stehen, damit Sie nicht alleine sind. 🤗 ❤️ + +Um loszulegen, öffnen Sie eine [New model addition](https://github.com/huggingface/transformers/issues/new?assignees=&labels=New+model&template=new-model-addition.yml) Ausgabe für das Modell, das Sie in 🤗 Transformers sehen möchten. Wenn Sie nicht besonders wählerisch sind, wenn es darum geht, ein bestimmtes Modell beizusteuern, können Sie nach dem [New model label](https://github.com/huggingface/transformers/labels/New%20model) filtern, um zu sehen, ob es noch unbeanspruchte Modellanfragen gibt, und daran arbeiten. + +Sobald Sie eine neue Modellanfrage eröffnet haben, sollten Sie sich zunächst mit 🤗 Transformers vertraut machen, falls Sie das noch nicht sind! + +## Allgemeiner Überblick über 🤗 Transformers + +Zunächst sollten Sie sich einen allgemeinen Überblick über 🤗 Transformers verschaffen. 🤗 Transformers ist eine sehr meinungsfreudige Bibliothek, es ist also möglich, dass +Es besteht also die Möglichkeit, dass Sie mit einigen der Philosophien oder Designentscheidungen der Bibliothek nicht einverstanden sind. Aus unserer Erfahrung heraus haben wir jedoch +dass die grundlegenden Designentscheidungen und Philosophien der Bibliothek entscheidend sind, um 🤗 Transformers effizient zu skalieren. +Transformatoren zu skalieren und gleichzeitig die Wartungskosten auf einem vernünftigen Niveau zu halten. + +Ein guter erster Ansatzpunkt, um die Bibliothek besser zu verstehen, ist die Lektüre der [Dokumentation unserer Philosophie](Philosophie). Als Ergebnis unserer Arbeitsweise gibt es einige Entscheidungen, die wir versuchen, auf alle Modelle anzuwenden: + +- Komposition wird im Allgemeinen gegenüber Abstraktion bevorzugt +- Die Duplizierung von Code ist nicht immer schlecht, wenn sie die Lesbarkeit oder Zugänglichkeit eines Modells stark verbessert +- Modelldateien sind so in sich geschlossen wie möglich, so dass Sie, wenn Sie den Code eines bestimmten Modells lesen, idealerweise nur + in die entsprechende Datei `modeling_....py` schauen müssen. + +Unserer Meinung nach ist der Code der Bibliothek nicht nur ein Mittel, um ein Produkt bereitzustellen, *z.B.* die Möglichkeit, BERT für +Inferenz zu verwenden, sondern auch als das Produkt selbst, das wir verbessern wollen. Wenn Sie also ein Modell hinzufügen, ist der Benutzer nicht nur die +Person, die Ihr Modell verwenden wird, sondern auch jeder, der Ihren Code liest, zu verstehen versucht und ihn möglicherweise verbessert. + +Lassen Sie uns daher ein wenig tiefer in das allgemeine Design der Bibliothek einsteigen. + +### Überblick über die Modelle + +Um ein Modell erfolgreich hinzuzufügen, ist es wichtig, die Interaktion zwischen Ihrem Modell und seiner Konfiguration zu verstehen, +[`PreTrainedModel`] und [`PretrainedConfig`]. Als Beispiel werden wir +das Modell, das zu 🤗 Transformers hinzugefügt werden soll, `BrandNewBert` nennen. + +Schauen wir uns das mal an: + + + +Wie Sie sehen, machen wir in 🤗 Transformers von der Vererbung Gebrauch, aber wir beschränken die Abstraktionsebene auf ein absolutes Minimum. +Minimum. Es gibt nie mehr als zwei Abstraktionsebenen für ein Modell in der Bibliothek. `BrandNewBertModel` +erbt von `BrandNewBertPreTrainedModel`, das wiederum von [`PreTrainedModel`] erbt und +das war's. In der Regel wollen wir sicherstellen, dass ein neues Modell nur von +[`PreTrainedModel`] abhängt. Die wichtigen Funktionalitäten, die jedem neuen Modell automatisch zur Verfügung gestellt werden, sind +Modell automatisch bereitgestellt werden, sind [`~PreTrainedModel.from_pretrained`] und +[`~PreTrainedModel.save_pretrained`], die für die Serialisierung und Deserialisierung verwendet werden. Alle +anderen wichtigen Funktionalitäten, wie `BrandNewBertModel.forward` sollten vollständig in der neuen +Skript `modeling_brand_new_bert.py` definiert werden. Als nächstes wollen wir sicherstellen, dass ein Modell mit einer bestimmten Kopfebene, wie z.B. +`BrandNewBertForMaskedLM` nicht von `BrandNewBertModel` erbt, sondern `BrandNewBertModel` verwendet +als Komponente, die im Forward Pass aufgerufen werden kann, um die Abstraktionsebene niedrig zu halten. Jedes neue Modell erfordert eine +Konfigurationsklasse, genannt `BrandNewBertConfig`. Diese Konfiguration wird immer als ein Attribut in +[PreTrainedModel] gespeichert und kann daher über das Attribut `config` für alle Klassen aufgerufen werden +die von `BrandNewBertPreTrainedModel` erben: + +```python +model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert") +model.config # model has access to its config +``` + +Ähnlich wie das Modell erbt die Konfiguration grundlegende Serialisierungs- und Deserialisierungsfunktionalitäten von +[`PretrainedConfig`]. Beachten Sie, dass die Konfiguration und das Modell immer in zwei verschiedene Formate serialisiert werden +unterschiedliche Formate serialisiert werden - das Modell in eine *pytorch_model.bin* Datei und die Konfiguration in eine *config.json* Datei. Aufruf von +[`~PreTrainedModel.save_pretrained`] wird automatisch +[`~PretrainedConfig.save_pretrained`] auf, so dass sowohl das Modell als auch die Konfiguration gespeichert werden. + + +### Code-Stil + +Wenn Sie Ihr neues Modell kodieren, sollten Sie daran denken, dass Transformers eine Bibliothek mit vielen Meinungen ist und dass wir selbst ein paar Macken haben +wie der Code geschrieben werden sollte :-) + +1. Der Vorwärtsdurchlauf Ihres Modells sollte vollständig in die Modellierungsdatei geschrieben werden und dabei völlig unabhängig von anderen + Modellen in der Bibliothek. Wenn Sie einen Block aus einem anderen Modell wiederverwenden möchten, kopieren Sie den Code und fügen ihn mit einem + `# Kopiert von` ein (siehe [hier](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160) + für ein gutes Beispiel und [hier](pr_checks#check-copies) für weitere Dokumentation zu Copied from). +2. Der Code sollte vollständig verständlich sein, auch für einen Nicht-Muttersprachler. Das heißt, Sie sollten + beschreibende Variablennamen wählen und Abkürzungen vermeiden. Ein Beispiel: `activation` ist `act` vorzuziehen. + Von Variablennamen mit nur einem Buchstaben wird dringend abgeraten, es sei denn, es handelt sich um einen Index in einer for-Schleife. +3. Generell ziehen wir längeren expliziten Code einem kurzen magischen Code vor. +4. Vermeiden Sie die Unterklassifizierung von `nn.Sequential` in PyTorch, sondern unterklassifizieren Sie `nn.Module` und schreiben Sie den Vorwärtspass, so dass jeder + so dass jeder, der Ihren Code verwendet, ihn schnell debuggen kann, indem er Druckanweisungen oder Haltepunkte hinzufügt. +5. Ihre Funktionssignatur sollte mit einer Typ-Annotation versehen sein. Im Übrigen sind gute Variablennamen viel lesbarer und verständlicher + verständlicher als Typ-Anmerkungen. + +### Übersicht der Tokenizer + +Noch nicht ganz fertig :-( Dieser Abschnitt wird bald hinzugefügt! + +## Schritt-für-Schritt-Rezept zum Hinzufügen eines Modells zu 🤗 Transformers + +Jeder hat andere Vorlieben, was die Portierung eines Modells angeht. Daher kann es sehr hilfreich sein, wenn Sie sich Zusammenfassungen ansehen +wie andere Mitwirkende Modelle auf Hugging Face portiert haben. Hier ist eine Liste von Blogbeiträgen aus der Community, wie man ein Modell portiert: + +1. [Portierung eines GPT2-Modells](https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28) von [Thomas](https://huggingface.co/thomwolf) +2. [Portierung des WMT19 MT-Modells](https://huggingface.co/blog/porting-fsmt) von [Stas](https://huggingface.co/stas) + +Aus Erfahrung können wir Ihnen sagen, dass die wichtigsten Dinge, die Sie beim Hinzufügen eines Modells beachten müssen, sind: + +- Erfinden Sie das Rad nicht neu! Die meisten Teile des Codes, den Sie für das neue 🤗 Transformers-Modell hinzufügen werden, existieren bereits + irgendwo in 🤗 Transformers. Nehmen Sie sich etwas Zeit, um ähnliche, bereits vorhandene Modelle und Tokenizer zu finden, die Sie kopieren können + von. [grep](https://www.gnu.org/software/grep/) und [rg](https://github.com/BurntSushi/ripgrep) sind Ihre + Freunde. Beachten Sie, dass es sehr gut möglich ist, dass der Tokenizer Ihres Modells auf einer Modellimplementierung basiert und + und der Modellierungscode Ihres Modells auf einer anderen. *Z.B.* Der Modellierungscode von FSMT basiert auf BART, während der Tokenizer-Code von FSMT + auf XLM basiert. +- Es handelt sich eher um eine technische als um eine wissenschaftliche Herausforderung. Sie sollten mehr Zeit auf die Schaffung einer + eine effiziente Debugging-Umgebung zu schaffen, als zu versuchen, alle theoretischen Aspekte des Modells in dem Papier zu verstehen. +- Bitten Sie um Hilfe, wenn Sie nicht weiterkommen! Modelle sind der Kernbestandteil von 🤗 Transformers, so dass wir bei Hugging Face mehr als + mehr als glücklich, Ihnen bei jedem Schritt zu helfen, um Ihr Modell hinzuzufügen. Zögern Sie nicht zu fragen, wenn Sie merken, dass Sie nicht weiterkommen. + Fortschritte machen. + +Im Folgenden versuchen wir, Ihnen ein allgemeines Rezept an die Hand zu geben, das uns bei der Portierung eines Modells auf 🤗 Transformers am nützlichsten erschien. + +Die folgende Liste ist eine Zusammenfassung all dessen, was getan werden muss, um ein Modell hinzuzufügen und kann von Ihnen als To-Do verwendet werden +Liste verwenden: + +☐ (Optional) Verstehen der theoretischen Aspekte des Modells
+☐ Vorbereiten der 🤗 Transformers-Entwicklungsumgebung
+☐ Debugging-Umgebung des ursprünglichen Repositorys eingerichtet
+☐ Skript erstellt, das den Durchlauf `forward()` unter Verwendung des ursprünglichen Repositorys und des Checkpoints erfolgreich durchführt
+☐ Erfolgreich das Modellskelett zu 🤗 Transformers hinzugefügt
+☐ Erfolgreiche Umwandlung des ursprünglichen Prüfpunkts in den 🤗 Transformers-Prüfpunkt
+☐ Erfolgreich den Durchlauf `forward()` in 🤗 Transformers ausgeführt, der eine identische Ausgabe wie der ursprüngliche Prüfpunkt liefert
+☐ Modell-Tests in 🤗 Transformers abgeschlossen
+☐ Erfolgreich Tokenizer in 🤗 Transformers hinzugefügt
+☐ End-to-End-Integrationstests ausgeführt
+☐ Docs fertiggestellt
+☐ Modellgewichte in den Hub hochgeladen
+☐ Die Pull-Anfrage eingereicht
+☐ (Optional) Hinzufügen eines Demo-Notizbuchs + +Für den Anfang empfehlen wir in der Regel, mit einem guten theoretischen Verständnis von `BrandNewBert` zu beginnen. Wie auch immer, +wenn Sie es vorziehen, die theoretischen Aspekte des Modells *on-the-job* zu verstehen, dann ist es völlig in Ordnung, direkt in die +in die Code-Basis von `BrandNewBert` einzutauchen. Diese Option könnte für Sie besser geeignet sein, wenn Ihre technischen Fähigkeiten besser sind als +als Ihre theoretischen Fähigkeiten, wenn Sie Schwierigkeiten haben, die Arbeit von `BrandNewBert` zu verstehen, oder wenn Sie einfach Spaß am Programmieren +mehr Spaß am Programmieren haben als am Lesen wissenschaftlicher Abhandlungen. + +### 1. (Optional) Theoretische Aspekte von BrandNewBert + +Sie sollten sich etwas Zeit nehmen, um die Abhandlung von *BrandNewBert* zu lesen, falls eine solche Beschreibung existiert. Möglicherweise gibt es große +Abschnitte des Papiers, die schwer zu verstehen sind. Wenn das der Fall ist, ist das in Ordnung - machen Sie sich keine Sorgen! Das Ziel ist +ist es nicht, ein tiefes theoretisches Verständnis des Papiers zu erlangen, sondern die notwendigen Informationen zu extrahieren, um +das Modell effektiv in 🤗 Transformers zu implementieren. Das heißt, Sie müssen nicht zu viel Zeit auf die +theoretischen Aspekten verbringen, sondern sich lieber auf die praktischen Aspekte konzentrieren, nämlich: + +- Welche Art von Modell ist *brand_new_bert*? BERT-ähnliches Modell nur für den Encoder? GPT2-ähnliches reines Decoder-Modell? BART-ähnliches + Encoder-Decoder-Modell? Sehen Sie sich die [model_summary](model_summary) an, wenn Sie mit den Unterschieden zwischen diesen Modellen nicht vertraut sind. +- Was sind die Anwendungen von *brand_new_bert*? Textklassifizierung? Texterzeugung? Seq2Seq-Aufgaben, *z.B.,* + Zusammenfassungen? +- Was ist die neue Eigenschaft des Modells, die es von BERT/GPT-2/BART unterscheidet? +- Welches der bereits existierenden [🤗 Transformers-Modelle](https://huggingface.co/transformers/#contents) ist am ähnlichsten + ähnlich wie *brand_new_bert*? +- Welche Art von Tokenizer wird verwendet? Ein Satzteil-Tokenisierer? Ein Wortstück-Tokenisierer? Ist es derselbe Tokenisierer, der für + für BERT oder BART? + +Nachdem Sie das Gefühl haben, einen guten Überblick über die Architektur des Modells erhalten zu haben, können Sie dem +Hugging Face Team schreiben und Ihre Fragen stellen. Dazu können Fragen zur Architektur des Modells gehören, +seiner Aufmerksamkeitsebene usw. Wir werden Ihnen gerne weiterhelfen. + +### 2. Bereiten Sie als nächstes Ihre Umgebung vor + +1. Forken Sie das [Repository](https://github.com/huggingface/transformers), indem Sie auf der Seite des Repositorys auf die Schaltfläche 'Fork' klicken. + Seite des Repositorys klicken. Dadurch wird eine Kopie des Codes unter Ihrem GitHub-Benutzerkonto erstellt. + +2. Klonen Sie Ihren `transformers` Fork auf Ihre lokale Festplatte und fügen Sie das Basis-Repository als Remote hinzu: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +3. Richten Sie eine Entwicklungsumgebung ein, indem Sie z.B. den folgenden Befehl ausführen: + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +Abhängig von Ihrem Betriebssystem und da die Anzahl der optionalen Abhängigkeiten von Transformers wächst, kann es sein, dass Sie bei diesem Befehl einen +Fehler mit diesem Befehl. Stellen Sie in diesem Fall sicher, dass Sie das Deep Learning Framework, mit dem Sie arbeiten, installieren +(PyTorch, TensorFlow und/oder Flax) und führen Sie es aus: + +```bash +pip install -e ".[quality]" +``` + +was für die meisten Anwendungsfälle ausreichend sein sollte. Sie können dann zum übergeordneten Verzeichnis zurückkehren + +```bash +cd .. +``` + +4. Wir empfehlen, die PyTorch-Version von *brand_new_bert* zu Transformers hinzuzufügen. Um PyTorch zu installieren, folgen Sie bitte den + Anweisungen auf https://pytorch.org/get-started/locally/. + +**Anmerkung:** Sie müssen CUDA nicht installiert haben. Es reicht aus, das neue Modell auf der CPU zum Laufen zu bringen. + +5. Um *brand_new_bert* zu portieren, benötigen Sie außerdem Zugriff auf das Original-Repository: + +```bash +git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git +cd brand_new_bert +pip install -e . +``` + +Jetzt haben Sie eine Entwicklungsumgebung eingerichtet, um *brand_new_bert* auf 🤗 Transformers zu portieren. + +### 3.-4. Führen Sie einen Pre-Training-Checkpoint mit dem Original-Repository durch + +Zunächst werden Sie mit dem ursprünglichen *brand_new_bert* Repository arbeiten. Oft ist die ursprüngliche Implementierung sehr +"forschungslastig". Das bedeutet, dass es an Dokumentation mangeln kann und der Code schwer zu verstehen sein kann. Aber das sollte +genau Ihre Motivation sein, *brand_new_bert* neu zu implementieren. Eines unserer Hauptziele bei Hugging Face ist es, *die Menschen dazu zu bringen +auf den Schultern von Giganten zu stehen*, was sich hier sehr gut darin ausdrückt, dass wir ein funktionierendes Modell nehmen und es umschreiben, um es so +es so **zugänglich, benutzerfreundlich und schön** wie möglich zu machen. Dies ist die wichtigste Motivation für die Neuimplementierung von +Modelle in 🤗 Transformers umzuwandeln - der Versuch, komplexe neue NLP-Technologie für **jeden** zugänglich zu machen. + +Sie sollten damit beginnen, indem Sie in das Original-Repository eintauchen. + +Die erfolgreiche Ausführung des offiziellen Pre-Trainingsmodells im Original-Repository ist oft **der schwierigste** Schritt. +Unserer Erfahrung nach ist es sehr wichtig, dass Sie einige Zeit damit verbringen, sich mit der ursprünglichen Code-Basis vertraut zu machen. Sie müssen +das Folgende herausfinden: + +- Wo finden Sie die vortrainierten Gewichte? +- Wie lädt man die vorab trainierten Gewichte in das entsprechende Modell? +- Wie kann der Tokenizer unabhängig vom Modell ausgeführt werden? +- Verfolgen Sie einen Forward Pass, damit Sie wissen, welche Klassen und Funktionen für einen einfachen Forward Pass erforderlich sind. Normalerweise, + müssen Sie nur diese Funktionen reimplementieren. +- Sie müssen in der Lage sein, die wichtigen Komponenten des Modells zu finden: Wo befindet sich die Klasse des Modells? Gibt es Unterklassen des Modells, + *z.B.* EncoderModel, DecoderModel? Wo befindet sich die Selbstaufmerksamkeitsschicht? Gibt es mehrere verschiedene Aufmerksamkeitsebenen, + *z.B.* *Selbstaufmerksamkeit*, *Kreuzaufmerksamkeit*...? +- Wie können Sie das Modell in der ursprünglichen Umgebung des Repo debuggen? Müssen Sie *print* Anweisungen hinzufügen, können Sie + mit einem interaktiven Debugger wie *ipdb* arbeiten oder sollten Sie eine effiziente IDE zum Debuggen des Modells verwenden, wie z.B. PyCharm? + +Es ist sehr wichtig, dass Sie, bevor Sie mit der Portierung beginnen, den Code im Original-Repository **effizient** debuggen können +Repository können! Denken Sie auch daran, dass Sie mit einer Open-Source-Bibliothek arbeiten, also zögern Sie nicht, ein Problem oder +oder sogar eine Pull-Anfrage im Original-Repository zu stellen. Die Betreuer dieses Repositorys sind wahrscheinlich sehr froh darüber +dass jemand in ihren Code schaut! + +An diesem Punkt liegt es wirklich an Ihnen, welche Debugging-Umgebung und Strategie Sie zum Debuggen des ursprünglichen +Modell zu debuggen. Wir raten dringend davon ab, eine kostspielige GPU-Umgebung einzurichten, sondern arbeiten Sie einfach auf einer CPU, sowohl wenn Sie mit dem +in das ursprüngliche Repository einzutauchen und auch, wenn Sie beginnen, die 🤗 Transformers-Implementierung des Modells zu schreiben. Nur +ganz am Ende, wenn das Modell bereits erfolgreich auf 🤗 Transformers portiert wurde, sollte man überprüfen, ob das +Modell auch auf der GPU wie erwartet funktioniert. + +Im Allgemeinen gibt es zwei mögliche Debugging-Umgebungen für die Ausführung des Originalmodells + +- [Jupyter notebooks](https://jupyter.org/) / [google colab](https://colab.research.google.com/notebooks/intro.ipynb) +- Lokale Python-Skripte. + +Jupyter-Notebooks haben den Vorteil, dass sie eine zellenweise Ausführung ermöglichen, was hilfreich sein kann, um logische Komponenten besser voneinander zu trennen und +logische Komponenten voneinander zu trennen und schnellere Debugging-Zyklen zu haben, da Zwischenergebnisse gespeichert werden können. Außerdem, +Außerdem lassen sich Notebooks oft leichter mit anderen Mitwirkenden teilen, was sehr hilfreich sein kann, wenn Sie das Hugging Face Team um Hilfe bitten möchten. +Face Team um Hilfe bitten. Wenn Sie mit Jupyter-Notizbüchern vertraut sind, empfehlen wir Ihnen dringend, mit ihnen zu arbeiten. + +Der offensichtliche Nachteil von Jupyter-Notizbüchern ist, dass Sie, wenn Sie nicht daran gewöhnt sind, mit ihnen zu arbeiten, einige Zeit damit verbringen müssen +einige Zeit damit verbringen müssen, sich an die neue Programmierumgebung zu gewöhnen, und dass Sie möglicherweise Ihre bekannten Debugging-Tools nicht mehr verwenden können +wie z.B. `ipdb` nicht mehr verwenden können. + +Für jede Codebasis ist es immer ein guter erster Schritt, einen **kleinen** vortrainierten Checkpoint zu laden und in der Lage zu sein, einen +einzelnen Vorwärtsdurchlauf mit einem Dummy-Integer-Vektor von Eingabe-IDs als Eingabe zu reproduzieren. Ein solches Skript könnte wie folgt aussehen (in +Pseudocode): + +```python +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = [0, 4, 5, 2, 3, 7, 9] # vector of input ids +original_output = model.predict(input_ids) +``` + +Was die Debugging-Strategie anbelangt, so können Sie im Allgemeinen aus mehreren Strategien wählen: + +- Zerlegen Sie das ursprüngliche Modell in viele kleine testbare Komponenten und führen Sie für jede dieser Komponenten einen Vorwärtsdurchlauf zur + Überprüfung +- Zerlegen Sie das ursprüngliche Modell nur in den ursprünglichen *Tokenizer* und das ursprüngliche *Modell*, führen Sie einen Vorwärtsdurchlauf für diese Komponenten durch + und verwenden Sie dazwischenliegende Druckanweisungen oder Haltepunkte zur Überprüfung. + +Auch hier bleibt es Ihnen überlassen, welche Strategie Sie wählen. Oft ist die eine oder die andere Strategie vorteilhaft, je nach der ursprünglichen Codebasis +Basis. + +Wenn die ursprüngliche Codebasis es Ihnen erlaubt, das Modell in kleinere Teilkomponenten zu zerlegen, *z.B.* wenn die ursprüngliche +Code-Basis problemlos im Eager-Modus ausgeführt werden kann, lohnt es sich in der Regel, dies zu tun. Es gibt einige wichtige Vorteile +am Anfang den schwierigeren Weg zu gehen: + +- Wenn Sie später das ursprüngliche Modell mit der Hugging Face-Implementierung vergleichen, können Sie automatisch überprüfen, ob + für jede Komponente einzeln überprüfen, ob die entsprechende Komponente der 🤗 Transformers-Implementierung übereinstimmt, anstatt sich auf + anstatt sich auf den visuellen Vergleich über Druckanweisungen zu verlassen +- können Sie das große Problem der Portierung eines Modells in kleinere Probleme der Portierung einzelner Komponenten zerlegen + einzelnen Komponenten zu zerlegen und so Ihre Arbeit besser zu strukturieren +- Die Aufteilung des Modells in logisch sinnvolle Komponenten hilft Ihnen, einen besseren Überblick über das Design des Modells zu bekommen + und somit das Modell besser zu verstehen +- In einem späteren Stadium helfen Ihnen diese komponentenweisen Tests dabei, sicherzustellen, dass keine Regressionen auftreten, während Sie fortfahren + Ihren Code ändern + +[Lysandre's](https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed) Integrationstests für ELECTRA +gibt ein schönes Beispiel dafür, wie dies geschehen kann. + +Wenn die ursprüngliche Codebasis jedoch sehr komplex ist oder nur die Ausführung von Zwischenkomponenten in einem kompilierten Modus erlaubt, +könnte es zu zeitaufwändig oder sogar unmöglich sein, das Modell in kleinere testbare Teilkomponenten zu zerlegen. Ein gutes +Beispiel ist die [T5's MeshTensorFlow](https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow) Bibliothek, die sehr komplex ist +sehr komplex ist und keine einfache Möglichkeit bietet, das Modell in seine Unterkomponenten zu zerlegen. Bei solchen Bibliotheken ist man +oft auf die Überprüfung von Druckanweisungen angewiesen. + +Unabhängig davon, welche Strategie Sie wählen, ist die empfohlene Vorgehensweise oft die gleiche, nämlich dass Sie mit der Fehlersuche in den +die Anfangsebenen zuerst und die Endebenen zuletzt debuggen. + +Es wird empfohlen, dass Sie die Ausgaben der folgenden Ebenen abrufen, entweder durch Druckanweisungen oder Unterkomponentenfunktionen +Schichten in der folgenden Reihenfolge abrufen: + +1. Rufen Sie die Eingabe-IDs ab, die an das Modell übergeben wurden +2. Rufen Sie die Worteinbettungen ab +3. Rufen Sie die Eingabe der ersten Transformer-Schicht ab +4. Rufen Sie die Ausgabe der ersten Transformer-Schicht ab +5. Rufen Sie die Ausgabe der folgenden n - 1 Transformer-Schichten ab +6. Rufen Sie die Ausgabe des gesamten BrandNewBert Modells ab + +Die Eingabe-IDs sollten dabei aus einem Array von Ganzzahlen bestehen, *z.B.* `input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]` + +Die Ausgaben der folgenden Schichten bestehen oft aus mehrdimensionalen Float-Arrays und können wie folgt aussehen: + +``` +[[ + [-0.1465, -0.6501, 0.1993, ..., 0.1451, 0.3430, 0.6024], + [-0.4417, -0.5920, 0.3450, ..., -0.3062, 0.6182, 0.7132], + [-0.5009, -0.7122, 0.4548, ..., -0.3662, 0.6091, 0.7648], + ..., + [-0.5613, -0.6332, 0.4324, ..., -0.3792, 0.7372, 0.9288], + [-0.5416, -0.6345, 0.4180, ..., -0.3564, 0.6992, 0.9191], + [-0.5334, -0.6403, 0.4271, ..., -0.3339, 0.6533, 0.8694]]], +``` + +Wir erwarten, dass jedes zu 🤗 Transformers hinzugefügte Modell eine Reihe von Integrationstests besteht, was bedeutet, dass das ursprüngliche +Modell und die neu implementierte Version in 🤗 Transformers exakt dieselbe Ausgabe liefern müssen, und zwar mit einer Genauigkeit von 0,001! +Da es normal ist, dass das exakt gleiche Modell, das in verschiedenen Bibliotheken geschrieben wurde, je nach Bibliotheksrahmen eine leicht unterschiedliche Ausgabe liefern kann +eine leicht unterschiedliche Ausgabe liefern kann, akzeptieren wir eine Fehlertoleranz von 1e-3 (0,001). Es reicht nicht aus, wenn das Modell +fast das gleiche Ergebnis liefert, sie müssen fast identisch sein. Daher werden Sie sicherlich die Zwischenergebnisse +Zwischenergebnisse der 🤗 Transformers-Version mehrfach mit den Zwischenergebnissen der ursprünglichen Implementierung von +*brand_new_bert* vergleichen. In diesem Fall ist eine **effiziente** Debugging-Umgebung des ursprünglichen Repositorys absolut +wichtig ist. Hier sind einige Ratschläge, um Ihre Debugging-Umgebung so effizient wie möglich zu gestalten. + +- Finden Sie den besten Weg, um Zwischenergebnisse zu debuggen. Ist das ursprüngliche Repository in PyTorch geschrieben? Dann sollten Sie + dann sollten Sie sich wahrscheinlich die Zeit nehmen, ein längeres Skript zu schreiben, das das ursprüngliche Modell in kleinere Unterkomponenten zerlegt, um + Zwischenwerte abzurufen. Ist das ursprüngliche Repository in Tensorflow 1 geschrieben? Dann müssen Sie sich möglicherweise auf die + TensorFlow Druckoperationen wie [tf.print](https://www.tensorflow.org/api_docs/python/tf/print) verlassen, um die + Zwischenwerte auszugeben. Ist das ursprüngliche Repository in Jax geschrieben? Dann stellen Sie sicher, dass das Modell **nicht jitted** ist, wenn + wenn Sie den Vorwärtsdurchlauf ausführen, *z.B.* schauen Sie sich [dieser Link](https://github.com/google/jax/issues/196) an. +- Verwenden Sie den kleinsten vortrainierten Prüfpunkt, den Sie finden können. Je kleiner der Prüfpunkt ist, desto schneller wird Ihr Debugging-Zyklus + wird. Es ist nicht effizient, wenn Ihr vorab trainiertes Modell so groß ist, dass Ihr Vorwärtsdurchlauf mehr als 10 Sekunden dauert. + Falls nur sehr große Checkpoints verfügbar sind, kann es sinnvoller sein, ein Dummy-Modell in der neuen + Umgebung mit zufällig initialisierten Gewichten zu erstellen und diese Gewichte zum Vergleich mit der 🤗 Transformers-Version + Ihres Modells +- Vergewissern Sie sich, dass Sie den einfachsten Weg wählen, um einen Forward Pass im ursprünglichen Repository aufzurufen. Idealerweise sollten Sie + die Funktion im originalen Repository finden, die **nur** einen einzigen Vorwärtspass aufruft, *d.h.* die oft aufgerufen wird + Vorhersagen", "Auswerten", "Vorwärts" oder "Aufruf" genannt wird. Sie wollen keine Funktion debuggen, die `forward` aufruft + mehrfach aufruft, *z.B.* um Text zu erzeugen, wie `autoregressive_sample`, `generate`. +- Versuchen Sie, die Tokenisierung vom *Forward*-Pass des Modells zu trennen. Wenn das Original-Repository Beispiele zeigt, bei denen + Sie eine Zeichenkette eingeben müssen, dann versuchen Sie herauszufinden, an welcher Stelle im Vorwärtsaufruf die Zeichenketteneingabe in Eingabe-IDs geändert wird + geändert wird und beginnen Sie an dieser Stelle. Das könnte bedeuten, dass Sie möglicherweise selbst ein kleines Skript schreiben oder den + Originalcode so ändern müssen, dass Sie die ids direkt eingeben können, anstatt eine Zeichenkette einzugeben. +- Vergewissern Sie sich, dass sich das Modell in Ihrem Debugging-Setup **nicht** im Trainingsmodus befindet, der oft dazu führt, dass das Modell + Dies führt häufig zu zufälligen Ergebnissen, da das Modell mehrere Dropout-Schichten enthält. Stellen Sie sicher, dass der Vorwärtsdurchlauf in Ihrer Debugging + Umgebung **deterministisch** ist, damit die Dropout-Schichten nicht verwendet werden. Oder verwenden Sie *transformers.utils.set_seed*. + wenn sich die alte und die neue Implementierung im selben Framework befinden. + +Im folgenden Abschnitt finden Sie genauere Details/Tipps, wie Sie dies für *brand_new_bert* tun können. + +### 5.-14. Portierung von BrandNewBert auf 🤗 Transformatoren + +Als nächstes können Sie endlich damit beginnen, neuen Code zu 🤗 Transformers hinzuzufügen. Gehen Sie in den Klon Ihres 🤗 Transformers Forks: + +```bash +cd transformers +``` + +In dem speziellen Fall, dass Sie ein Modell hinzufügen, dessen Architektur genau mit der Modellarchitektur eines +Modells übereinstimmt, müssen Sie nur ein Konvertierungsskript hinzufügen, wie in [diesem Abschnitt](#write-a-conversion-script) beschrieben. +In diesem Fall können Sie einfach die gesamte Modellarchitektur des bereits vorhandenen Modells wiederverwenden. + +Andernfalls beginnen wir mit der Erstellung eines neuen Modells. Sie haben hier zwei Möglichkeiten: + +- `transformers-cli add-new-model-like`, um ein neues Modell wie ein bestehendes hinzuzufügen +- `transformers-cli add-new-model`, um ein neues Modell aus unserer Vorlage hinzuzufügen (sieht dann aus wie BERT oder Bart, je nachdem, welche Art von Modell Sie wählen) + +In beiden Fällen werden Sie mit einem Fragebogen aufgefordert, die grundlegenden Informationen zu Ihrem Modell auszufüllen. Für den zweiten Befehl müssen Sie `cookiecutter` installieren, weitere Informationen dazu finden Sie [hier](https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model). + +**Eröffnen Sie einen Pull Request auf dem Haupt-Repositorium huggingface/transformers** + +Bevor Sie mit der Anpassung des automatisch generierten Codes beginnen, ist es nun an der Zeit, einen "Work in progress (WIP)" Pull +Anfrage, *z.B.* "[WIP] Add *brand_new_bert*", in 🤗 Transformers zu öffnen, damit Sie und das Hugging Face Team +Seite an Seite an der Integration des Modells in 🤗 Transformers arbeiten können. + +Sie sollten Folgendes tun: + +1. Erstellen Sie eine Verzweigung mit einem beschreibenden Namen von Ihrer Hauptverzweigung + +```bash +git checkout -b add_brand_new_bert +``` + +2. Bestätigen Sie den automatisch generierten Code: + +```bash +git add . +git commit +``` + +3. Abrufen und zurücksetzen auf die aktuelle Haupt + +```bash +git fetch upstream +git rebase upstream/main +``` + +4. Übertragen Sie die Änderungen auf Ihr Konto mit: + +```bash +git push -u origin a-descriptive-name-for-my-changes +``` + +5. Wenn Sie zufrieden sind, gehen Sie auf die Webseite Ihrer Abspaltung auf GitHub. Klicken Sie auf "Pull request". Stellen Sie sicher, dass Sie das + GitHub-Handle einiger Mitglieder des Hugging Face-Teams als Reviewer hinzuzufügen, damit das Hugging Face-Team über zukünftige Änderungen informiert wird. + zukünftige Änderungen benachrichtigt wird. + +6. Ändern Sie den PR in einen Entwurf, indem Sie auf der rechten Seite der GitHub-Pull-Request-Webseite auf "In Entwurf umwandeln" klicken. + +Vergessen Sie im Folgenden nicht, wenn Sie Fortschritte gemacht haben, Ihre Arbeit zu committen und in Ihr Konto zu pushen, damit sie in der Pull-Anfrage erscheint. +damit sie in der Pull-Anfrage angezeigt wird. Außerdem sollten Sie darauf achten, dass Sie Ihre Arbeit von Zeit zu Zeit mit dem aktuellen main +von Zeit zu Zeit zu aktualisieren, indem Sie dies tun: + +```bash +git fetch upstream +git merge upstream/main +``` + +Generell sollten Sie alle Fragen, die Sie in Bezug auf das Modell oder Ihre Implementierung haben, in Ihrem PR stellen und +in der PR diskutiert/gelöst werden. Auf diese Weise wird das Hugging Face Team immer benachrichtigt, wenn Sie neuen Code einreichen oder +wenn Sie eine Frage haben. Es ist oft sehr hilfreich, das Hugging Face-Team auf Ihren hinzugefügten Code hinzuweisen, damit das Hugging Face-Team Ihr Problem oder Ihre Frage besser verstehen kann. +Face-Team Ihr Problem oder Ihre Frage besser verstehen kann. + +Gehen Sie dazu auf die Registerkarte "Geänderte Dateien", auf der Sie alle Ihre Änderungen sehen, gehen Sie zu einer Zeile, zu der Sie eine Frage stellen möchten +eine Frage stellen möchten, und klicken Sie auf das "+"-Symbol, um einen Kommentar hinzuzufügen. Wenn eine Frage oder ein Problem gelöst wurde, +können Sie auf die Schaltfläche "Lösen" des erstellten Kommentars klicken. + +Auf dieselbe Weise wird das Hugging Face-Team Kommentare öffnen, wenn es Ihren Code überprüft. Wir empfehlen, die meisten Fragen +auf GitHub in Ihrem PR zu stellen. Für einige sehr allgemeine Fragen, die für die Öffentlichkeit nicht sehr nützlich sind, können Sie das +Hugging Face Team per Slack oder E-Mail zu stellen. + +**5. Passen Sie den Code der generierten Modelle für brand_new_bert** an. + +Zunächst werden wir uns nur auf das Modell selbst konzentrieren und uns nicht um den Tokenizer kümmern. Den gesamten relevanten Code sollten Sie +finden Sie in den generierten Dateien `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` und +`src/transformers/models/brand_new_bert/configuration_brand_new_bert.py`. + +Jetzt können Sie endlich mit dem Programmieren beginnen :). Der generierte Code in +`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` wird entweder die gleiche Architektur wie BERT haben, wenn +wenn es sich um ein reines Encoder-Modell handelt oder BART, wenn es sich um ein Encoder-Decoder-Modell handelt. An diesem Punkt sollten Sie sich daran erinnern, was +was Sie am Anfang über die theoretischen Aspekte des Modells gelernt haben: *Wie unterscheidet sich das Modell von BERT oder +BART?*". Implementieren Sie diese Änderungen, was oft bedeutet, dass Sie die *Selbstaufmerksamkeitsschicht*, die Reihenfolge der Normalisierungsschicht usw. ändern müssen. +Schicht usw... Auch hier ist es oft nützlich, sich die ähnliche Architektur bereits bestehender Modelle in Transformers anzusehen, um ein besseres Gefühl dafür zu bekommen +ein besseres Gefühl dafür zu bekommen, wie Ihr Modell implementiert werden sollte. + +**Beachten Sie**, dass Sie an diesem Punkt nicht sehr sicher sein müssen, dass Ihr Code völlig korrekt oder sauber ist. Vielmehr ist es +Sie sollten vielmehr eine erste *unbereinigte*, kopierte Version des ursprünglichen Codes in +src/transformers/models/brand_new_bert/modeling_brand_new_bert.py" hinzuzufügen, bis Sie das Gefühl haben, dass der gesamte notwendige Code +hinzugefügt wurde. Unserer Erfahrung nach ist es viel effizienter, schnell eine erste Version des erforderlichen Codes hinzuzufügen und +den Code iterativ mit dem Konvertierungsskript zu verbessern/korrigieren, wie im nächsten Abschnitt beschrieben. Das einzige, was +zu diesem Zeitpunkt funktionieren muss, ist, dass Sie die 🤗 Transformers-Implementierung von *brand_new_bert* instanziieren können, *d.h.* der +folgende Befehl sollte funktionieren: + +```python +from transformers import BrandNewBertModel, BrandNewBertConfig + +model = BrandNewBertModel(BrandNewBertConfig()) +``` + +Der obige Befehl erstellt ein Modell gemäß den Standardparametern, die in `BrandNewBertConfig()` definiert sind, mit +zufälligen Gewichten und stellt damit sicher, dass die `init()` Methoden aller Komponenten funktionieren. + +Beachten Sie, dass alle zufälligen Initialisierungen in der Methode `_init_weights` Ihres `BrandnewBertPreTrainedModel` stattfinden sollten. +Klasse erfolgen sollte. Sie sollte alle Blattmodule in Abhängigkeit von den Variablen der Konfiguration initialisieren. Hier ist ein Beispiel mit der +BERT `_init_weights` Methode: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) +``` + +Sie können weitere benutzerdefinierte Schemata verwenden, wenn Sie eine spezielle Initialisierung für einige Module benötigen. Zum Beispiel in +`Wav2Vec2ForPreTraining` müssen die letzten beiden linearen Schichten die Initialisierung des regulären PyTorch `nn.Linear` haben. +aber alle anderen sollten eine Initialisierung wie oben verwenden. Dies ist wie folgt kodiert: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, Wav2Vec2ForPreTraining): + module.project_hid.reset_parameters() + module.project_q.reset_parameters() + module.project_hid._is_hf_initialized = True + module.project_q._is_hf_initialized = True + elif isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() +``` + +Das Flag `_is_hf_initialized` wird intern verwendet, um sicherzustellen, dass wir ein Submodul nur einmal initialisieren. Wenn Sie es auf +`True` für `module.project_q` und `module.project_hid` setzen, stellen wir sicher, dass die benutzerdefinierte Initialisierung, die wir vorgenommen haben, später nicht überschrieben wird, +die Funktion `_init_weights` nicht auf sie angewendet wird. + +**6. Schreiben Sie ein Konvertierungsskript** + +Als nächstes sollten Sie ein Konvertierungsskript schreiben, mit dem Sie den Checkpoint, den Sie zum Debuggen von *brand_new_bert* im +im ursprünglichen Repository in einen Prüfpunkt konvertieren, der mit Ihrer gerade erstellten 🤗 Transformers-Implementierung von +*brand_new_bert*. Es ist nicht ratsam, das Konvertierungsskript von Grund auf neu zu schreiben, sondern die bereits +bestehenden Konvertierungsskripten in 🤗 Transformers nach einem Skript zu suchen, das für die Konvertierung eines ähnlichen Modells verwendet wurde, das im +demselben Framework wie *brand_new_bert* geschrieben wurde. Normalerweise reicht es aus, ein bereits vorhandenes Konvertierungsskript zu kopieren und +es für Ihren Anwendungsfall leicht anzupassen. Zögern Sie nicht, das Hugging Face Team zu bitten, Sie auf ein ähnliches, bereits vorhandenes +Konvertierungsskript für Ihr Modell zu finden. + +- Wenn Sie ein Modell von TensorFlow nach PyTorch portieren, ist ein guter Ausgangspunkt das Konvertierungsskript von BERT [hier](https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91) +- Wenn Sie ein Modell von PyTorch nach PyTorch portieren, ist ein guter Ausgangspunkt das Konvertierungsskript von BART [hier](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py) + +Im Folgenden werden wir kurz erklären, wie PyTorch-Modelle Ebenengewichte speichern und Ebenennamen definieren. In PyTorch wird der +Name einer Ebene durch den Namen des Klassenattributs definiert, das Sie der Ebene geben. Lassen Sie uns ein Dummy-Modell in +PyTorch, das wir `SimpleModel` nennen, wie folgt: + +```python +from torch import nn + + +class SimpleModel(nn.Module): + def __init__(self): + super().__init__() + self.dense = nn.Linear(10, 10) + self.intermediate = nn.Linear(10, 10) + self.layer_norm = nn.LayerNorm(10) +``` + +Jetzt können wir eine Instanz dieser Modelldefinition erstellen, die alle Gewichte ausfüllt: `dense`, `intermediate`, +`layer_norm` mit zufälligen Gewichten. Wir können das Modell ausdrucken, um seine Architektur zu sehen + +```python +model = SimpleModel() + +print(model) +``` + +Dies gibt folgendes aus: + +``` +SimpleModel( + (dense): Linear(in_features=10, out_features=10, bias=True) + (intermediate): Linear(in_features=10, out_features=10, bias=True) + (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True) +) +``` + +Wir können sehen, dass die Ebenennamen durch den Namen des Klassenattributs in PyTorch definiert sind. Sie können die Gewichtswerte +Werte einer bestimmten Ebene anzeigen lassen: + +```python +print(model.dense.weight.data) +``` + +um zu sehen, dass die Gewichte zufällig initialisiert wurden + +``` +tensor([[-0.0818, 0.2207, -0.0749, -0.0030, 0.0045, -0.1569, -0.1598, 0.0212, + -0.2077, 0.2157], + [ 0.1044, 0.0201, 0.0990, 0.2482, 0.3116, 0.2509, 0.2866, -0.2190, + 0.2166, -0.0212], + [-0.2000, 0.1107, -0.1999, -0.3119, 0.1559, 0.0993, 0.1776, -0.1950, + -0.1023, -0.0447], + [-0.0888, -0.1092, 0.2281, 0.0336, 0.1817, -0.0115, 0.2096, 0.1415, + -0.1876, -0.2467], + [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465, + 0.2577, 0.0402], + [ 0.1502, 0.2465, 0.2566, 0.0693, 0.2352, -0.0530, 0.1859, -0.0604, + 0.2132, 0.1680], + [ 0.1733, -0.2407, -0.1721, 0.1484, 0.0358, -0.0633, -0.0721, -0.0090, + 0.2707, -0.2509], + [-0.1173, 0.1561, 0.2945, 0.0595, -0.1996, 0.2988, -0.0802, 0.0407, + 0.1829, -0.1568], + [-0.1164, -0.2228, -0.0403, 0.0428, 0.1339, 0.0047, 0.1967, 0.2923, + 0.0333, -0.0536], + [-0.1492, -0.1616, 0.1057, 0.1950, -0.2807, -0.2710, -0.1586, 0.0739, + 0.2220, 0.2358]]). +``` + +Im Konvertierungsskript sollten Sie diese zufällig initialisierten Gewichte mit den genauen Gewichten der +entsprechenden Ebene im Kontrollpunkt. *Z.B.* + +```python +# retrieve matching layer weights, e.g. by +# recursive algorithm +layer_name = "dense" +pretrained_weight = array_of_dense_layer + +model_pointer = getattr(model, "dense") + +model_pointer.weight.data = torch.from_numpy(pretrained_weight) +``` + +Dabei müssen Sie sicherstellen, dass jedes zufällig initialisierte Gewicht Ihres PyTorch-Modells und sein entsprechendes +Checkpoint-Gewicht in **Form und Name** genau übereinstimmen. Zu diesem Zweck ist es **notwendig**, assert +Anweisungen für die Form hinzuzufügen und die Namen der Checkpoint-Gewichte auszugeben. Sie sollten z.B. Anweisungen hinzufügen wie: + +```python +assert ( + model_pointer.weight.shape == pretrained_weight.shape +), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched" +``` + +Außerdem sollten Sie die Namen der beiden Gewichte ausdrucken, um sicherzustellen, dass sie übereinstimmen, *z.B.*. + +```python +logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}") +``` + +Wenn entweder die Form oder der Name nicht übereinstimmt, haben Sie wahrscheinlich das falsche Kontrollpunktgewicht einer zufällig +Ebene der 🤗 Transformers-Implementierung zugewiesen. + +Eine falsche Form ist höchstwahrscheinlich auf eine falsche Einstellung der Konfigurationsparameter in `BrandNewBertConfig()` zurückzuführen, die +nicht genau mit denen übereinstimmen, die für den zu konvertierenden Prüfpunkt verwendet wurden. Es könnte aber auch sein, dass +die PyTorch-Implementierung eines Layers erfordert, dass das Gewicht vorher transponiert wird. + +Schließlich sollten Sie auch überprüfen, ob **alle** erforderlichen Gewichte initialisiert sind und alle Checkpoint-Gewichte ausgeben, die +die nicht zur Initialisierung verwendet wurden, um sicherzustellen, dass das Modell korrekt konvertiert wurde. Es ist völlig normal, dass die +Konvertierungsversuche entweder mit einer falschen Shape-Anweisung oder einer falschen Namenszuweisung fehlschlagen. Das liegt höchstwahrscheinlich daran, dass entweder +Sie haben falsche Parameter in `BrandNewBertConfig()` verwendet, haben eine falsche Architektur in der 🤗 Transformers +Implementierung, Sie haben einen Fehler in den `init()` Funktionen einer der Komponenten der 🤗 Transformers +Implementierung oder Sie müssen eine der Kontrollpunktgewichte transponieren. + +Dieser Schritt sollte mit dem vorherigen Schritt wiederholt werden, bis alle Gewichte des Kontrollpunkts korrekt in das +Transformers-Modell geladen sind. Nachdem Sie den Prüfpunkt korrekt in die 🤗 Transformers-Implementierung geladen haben, können Sie das Modell +das Modell unter einem Ordner Ihrer Wahl `/path/to/converted/checkpoint/folder` speichern, der dann sowohl ein +Datei `pytorch_model.bin` und eine Datei `config.json` enthalten sollte: + +```python +model.save_pretrained("/path/to/converted/checkpoint/folder") +``` + +**7. Implementieren Sie den Vorwärtspass** + +Nachdem es Ihnen gelungen ist, die trainierten Gewichte korrekt in die 🤗 Transformers-Implementierung zu laden, sollten Sie nun dafür sorgen +sicherstellen, dass der Forward Pass korrekt implementiert ist. In [Machen Sie sich mit dem ursprünglichen Repository vertraut](#3-4-führen-sie-einen-pre-training-checkpoint-mit-dem-original-repository-durch) haben Sie bereits ein Skript erstellt, das einen Forward Pass +Durchlauf des Modells unter Verwendung des Original-Repositorys durchführt. Jetzt sollten Sie ein analoges Skript schreiben, das die 🤗 Transformers +Implementierung anstelle der Originalimplementierung verwenden. Es sollte wie folgt aussehen: + +```python +model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder") +input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19] +output = model(input_ids).last_hidden_states +``` + +Es ist sehr wahrscheinlich, dass die 🤗 Transformers-Implementierung und die ursprüngliche Modell-Implementierung nicht genau die gleiche Ausgabe liefern. +beim ersten Mal nicht die gleiche Ausgabe liefern oder dass der Vorwärtsdurchlauf einen Fehler auslöst. Seien Sie nicht enttäuscht - das ist zu erwarten! Erstens, +sollten Sie sicherstellen, dass der Vorwärtsdurchlauf keine Fehler auslöst. Es passiert oft, dass die falschen Dimensionen verwendet werden +verwendet werden, was zu einem *Dimensionality mismatch* Fehler führt oder dass der falsche Datentyp verwendet wird, *z.B.* `torch.long` +anstelle von `torch.float32`. Zögern Sie nicht, das Hugging Face Team um Hilfe zu bitten, wenn Sie bestimmte Fehler nicht lösen können. +bestimmte Fehler nicht lösen können. + +Um sicherzustellen, dass die Implementierung von 🤗 Transformers korrekt funktioniert, müssen Sie sicherstellen, dass die Ausgaben +einer Genauigkeit von `1e-3` entsprechen. Zunächst sollten Sie sicherstellen, dass die Ausgabeformen identisch sind, *d.h.*. +Die Ausgabeform *outputs.shape* sollte für das Skript der 🤗 Transformers-Implementierung und die ursprüngliche +Implementierung ergeben. Als nächstes sollten Sie sicherstellen, dass auch die Ausgabewerte identisch sind. Dies ist einer der schwierigsten +Teile des Hinzufügens eines neuen Modells. Häufige Fehler, warum die Ausgaben nicht identisch sind, sind: + +- Einige Ebenen wurden nicht hinzugefügt, *d.h.* eine *Aktivierungsebene* wurde nicht hinzugefügt, oder die Restverbindung wurde vergessen +- Die Worteinbettungsmatrix wurde nicht gebunden +- Es werden die falschen Positionseinbettungen verwendet, da die ursprüngliche Implementierung einen Offset verwendet +- Dropout wird während des Vorwärtsdurchlaufs angewendet. Um dies zu beheben, stellen Sie sicher, dass *model.training auf False* steht und dass keine Dropout + Schicht während des Vorwärtsdurchlaufs fälschlicherweise aktiviert wird, *d.h.* übergeben Sie *self.training* an [PyTorch's functional dropout](https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout) + +Der beste Weg, das Problem zu beheben, besteht normalerweise darin, sich den Vorwärtsdurchlauf der ursprünglichen Implementierung und die 🤗 +Transformers-Implementierung nebeneinander zu sehen und zu prüfen, ob es Unterschiede gibt. Idealerweise sollten Sie die +Zwischenergebnisse beider Implementierungen des Vorwärtsdurchlaufs debuggen/ausdrucken, um die genaue Position im Netzwerk zu finden, an der die 🤗 +Transformers-Implementierung eine andere Ausgabe zeigt als die ursprüngliche Implementierung. Stellen Sie zunächst sicher, dass die +hartcodierten `input_ids` in beiden Skripten identisch sind. Überprüfen Sie dann, ob die Ausgaben der ersten Transformation von +der `input_ids` (normalerweise die Worteinbettungen) identisch sind. Und dann arbeiten Sie sich bis zur allerletzten Schicht des +Netzwerks. Irgendwann werden Sie einen Unterschied zwischen den beiden Implementierungen feststellen, der Sie auf den Fehler +in der Implementierung von 🤗 Transformers hinweist. Unserer Erfahrung nach ist ein einfacher und effizienter Weg, viele Druckanweisungen hinzuzufügen +sowohl in der Original-Implementierung als auch in der 🤗 Transformers-Implementierung an den gleichen Stellen im Netzwerk +hinzuzufügen und nacheinander Druckanweisungen zu entfernen, die dieselben Werte für Zwischenpräsentationen anzeigen. + +Wenn Sie sicher sind, dass beide Implementierungen die gleiche Ausgabe liefern, überprüfen Sie die Ausgaben mit +`torch.allclose(original_output, output, atol=1e-3)` überprüfen, haben Sie den schwierigsten Teil hinter sich! Herzlichen Glückwunsch - die +Arbeit, die noch zu erledigen ist, sollte ein Kinderspiel sein 😊. + +**8. Hinzufügen aller notwendigen Modelltests** + +An diesem Punkt haben Sie erfolgreich ein neues Modell hinzugefügt. Es ist jedoch sehr gut möglich, dass das Modell noch nicht +noch nicht vollständig mit dem erforderlichen Design übereinstimmt. Um sicherzustellen, dass die Implementierung vollständig kompatibel mit 🤗 Transformers ist, sollten alle +gemeinsamen Tests bestehen. Der Cookiecutter sollte automatisch eine Testdatei für Ihr Modell hinzugefügt haben, wahrscheinlich unter +demselben `tests/models/brand_new_bert/test_modeling_brand_new_bert.py`. Führen Sie diese Testdatei aus, um zu überprüfen, ob alle gängigen +Tests bestehen: + +```bash +pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py +``` + +Nachdem Sie alle allgemeinen Tests festgelegt haben, müssen Sie nun sicherstellen, dass all die schöne Arbeit, die Sie geleistet haben, gut getestet ist, damit + +- a) die Community Ihre Arbeit leicht nachvollziehen kann, indem sie sich spezifische Tests von *brand_new_bert* ansieht +- b) zukünftige Änderungen an Ihrem Modell keine wichtigen Funktionen des Modells zerstören. + +Als erstes sollten Sie Integrationstests hinzufügen. Diese Integrationstests tun im Wesentlichen dasselbe wie die Debugging-Skripte +die Sie zuvor zur Implementierung des Modells in 🤗 Transformers verwendet haben. Eine Vorlage für diese Modelltests wurde bereits von dem +Cookiecutter hinzugefügt, die `BrandNewBertModelIntegrationTests` heißt und nur noch von Ihnen ausgefüllt werden muss. Um sicherzustellen, dass diese +Tests erfolgreich sind, führen Sie + +```bash +RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests +``` + + + +Falls Sie Windows verwenden, sollten Sie `RUN_SLOW=1` durch `SET RUN_SLOW=1` ersetzen. + + + +Zweitens sollten alle Funktionen, die speziell für *brand_new_bert* sind, zusätzlich in einem separaten Test getestet werden unter +`BrandNewBertModelTester`/`BrandNewBertModelTest`. Dieser Teil wird oft vergessen, ist aber in zweierlei Hinsicht äußerst nützlich +Weise: + +- Er hilft dabei, das Wissen, das Sie während der Modellerweiterung erworben haben, an die Community weiterzugeben, indem er zeigt, wie die + speziellen Funktionen von *brand_new_bert* funktionieren sollten. +- Künftige Mitwirkende können Änderungen am Modell schnell testen, indem sie diese speziellen Tests ausführen. + + +**9. Implementieren Sie den Tokenizer** + +Als nächstes sollten wir den Tokenizer von *brand_new_bert* hinzufügen. Normalerweise ist der Tokenizer äquivalent oder sehr ähnlich zu einem +bereits vorhandenen Tokenizer von 🤗 Transformers. + +Es ist sehr wichtig, die ursprüngliche Tokenizer-Datei zu finden/extrahieren und es zu schaffen, diese Datei in die 🤗 +Transformers Implementierung des Tokenizers zu laden. + +Um sicherzustellen, dass der Tokenizer korrekt funktioniert, empfiehlt es sich, zunächst ein Skript im ursprünglichen Repository zu erstellen +zu erstellen, das eine Zeichenkette eingibt und die `input_ids` zurückgibt. Es könnte etwa so aussehen (in Pseudocode): + +```python +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = model.tokenize(input_str) +``` + +Möglicherweise müssen Sie noch einmal einen Blick in das ursprüngliche Repository werfen, um die richtige Tokenizer-Funktion zu finden, oder Sie müssen +Sie müssen vielleicht sogar Änderungen an Ihrem Klon des Original-Repositorys vornehmen, um nur die `input_ids` auszugeben. Nach dem Schreiben +ein funktionierendes Tokenisierungsskript geschrieben, das das ursprüngliche Repository verwendet, sollten Sie ein analoges Skript für 🤗 Transformers +erstellt werden. Es sollte ähnlich wie dieses aussehen: + +```python +from transformers import BrandNewBertTokenizer + +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." + +tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/") + +input_ids = tokenizer(input_str).input_ids +``` + +Wenn beide `input_ids` die gleichen Werte ergeben, sollte als letzter Schritt auch eine Tokenizer-Testdatei hinzugefügt werden. + +Analog zu den Modellierungstestdateien von *brand_new_bert* sollten auch die Tokenisierungs-Testdateien von *brand_new_bert* +eine Reihe von fest kodierten Integrationstests enthalten. + +**10. Führen Sie End-to-End-Integrationstests aus** + +Nachdem Sie den Tokenizer hinzugefügt haben, sollten Sie auch ein paar End-to-End-Integrationstests, die sowohl das Modell als auch den +Tokenizer zu `tests/models/brand_new_bert/test_modeling_brand_new_bert.py` in 🤗 Transformers. +Ein solcher Test sollte bei einem aussagekräftigen +Text-zu-Text-Beispiel zeigen, dass die Implementierung von 🤗 Transformers wie erwartet funktioniert. Ein aussagekräftiges Text-zu-Text-Beispiel kann +z.B. *ein Quell-zu-Ziel-Übersetzungspaar, ein Artikel-zu-Zusammenfassung-Paar, ein Frage-zu-Antwort-Paar, usw... Wenn keiner der +der portierten Prüfpunkte in einer nachgelagerten Aufgabe feinabgestimmt wurde, genügt es, sich einfach auf die Modelltests zu verlassen. In einem +letzten Schritt, um sicherzustellen, dass das Modell voll funktionsfähig ist, sollten Sie alle Tests auch auf der GPU durchführen. Es kann +Es kann vorkommen, dass Sie vergessen haben, einige `.to(self.device)` Anweisungen zu internen Tensoren des Modells hinzuzufügen, was in einem solchen +Test zu einem Fehler führen würde. Falls Sie keinen Zugang zu einem Grafikprozessor haben, kann das Hugging Face Team diese Tests für Sie durchführen. +Tests für Sie übernehmen. + +**11. Docstring hinzufügen** + +Nun sind alle notwendigen Funktionen für *brand_new_bert* hinzugefügt - Sie sind fast fertig! Das Einzige, was Sie noch hinzufügen müssen, ist +ein schöner Docstring und eine Doku-Seite. Der Cookiecutter sollte eine Vorlagendatei namens +`docs/source/model_doc/brand_new_bert.md` hinzugefügt haben, die Sie ausfüllen sollten. Die Benutzer Ihres Modells werden in der Regel zuerst einen Blick auf +diese Seite ansehen, bevor sie Ihr Modell verwenden. Daher muss die Dokumentation verständlich und prägnant sein. Es ist sehr nützlich für +die Gemeinschaft, einige *Tipps* hinzuzufügen, um zu zeigen, wie das Modell verwendet werden sollte. Zögern Sie nicht, das Hugging Face-Team anzupingen +bezüglich der Docstrings. + +Stellen Sie als nächstes sicher, dass der zu `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` hinzugefügte docstring +korrekt ist und alle erforderlichen Eingaben und Ausgaben enthält. Wir haben eine ausführliche Anleitung zum Schreiben von Dokumentationen und unserem Docstring-Format [hier](writing-documentation). Es ist immer gut, sich daran zu erinnern, dass die Dokumentation +mindestens so sorgfältig behandelt werden sollte wie der Code in 🤗 Transformers, denn die Dokumentation ist in der Regel der erste Kontaktpunkt der +Berührungspunkt der Community mit dem Modell ist. + +**Code refactor** + +Großartig, jetzt haben Sie den gesamten erforderlichen Code für *brand_new_bert* hinzugefügt. An diesem Punkt sollten Sie einige mögliche +falschen Codestil korrigieren, indem Sie ausführen: + +```bash +make style +``` + +und überprüfen Sie, ob Ihr Kodierungsstil die Qualitätsprüfung besteht: + +```bash +make quality +``` + +Es gibt noch ein paar andere sehr strenge Designtests in 🤗 Transformers, die möglicherweise noch fehlschlagen, was sich in den +den Tests Ihres Pull Requests. Dies liegt oft an fehlenden Informationen im Docstring oder an einer falschen +Benennung. Das Hugging Face Team wird Ihnen sicherlich helfen, wenn Sie hier nicht weiterkommen. + +Und schließlich ist es immer eine gute Idee, den eigenen Code zu refaktorisieren, nachdem man sichergestellt hat, dass er korrekt funktioniert. Wenn alle +Tests bestanden haben, ist es nun an der Zeit, den hinzugefügten Code noch einmal durchzugehen und einige Überarbeitungen vorzunehmen. + +Sie haben nun den Codierungsteil abgeschlossen, herzlichen Glückwunsch! 🎉 Sie sind großartig! 😎 + +**12. Laden Sie die Modelle in den Model Hub hoch** + +In diesem letzten Teil sollten Sie alle Checkpoints konvertieren und in den Modell-Hub hochladen und eine Modellkarte für jeden +hochgeladenen Modell-Kontrollpunkt. Sie können sich mit den Hub-Funktionen vertraut machen, indem Sie unsere [Model sharing and uploading Page](model_sharing) lesen. Hier sollten Sie mit dem Hugging Face-Team zusammenarbeiten, um einen passenden Namen für jeden +Checkpoint festzulegen und die erforderlichen Zugriffsrechte zu erhalten, um das Modell unter der Organisation des Autors *brand_new_bert* hochladen zu können. +*brand_new_bert*. Die Methode `push_to_hub`, die in allen Modellen in `transformers` vorhanden ist, ist ein schneller und effizienter Weg, Ihren Checkpoint in den Hub zu pushen. Ein kleines Snippet ist unten eingefügt: + +```python +brand_new_bert.push_to_hub("brand_new_bert") +# Uncomment the following line to push to an organization. +# brand_new_bert.push_to_hub("/brand_new_bert") +``` + +Es lohnt sich, etwas Zeit darauf zu verwenden, für jeden Kontrollpunkt passende Musterkarten zu erstellen. Die Modellkarten sollten die +spezifischen Merkmale dieses bestimmten Prüfpunkts hervorheben, * z.B.* auf welchem Datensatz wurde der Prüfpunkt +vortrainiert/abgestimmt? Für welche nachgelagerte Aufgabe sollte das Modell verwendet werden? Und fügen Sie auch etwas Code bei, wie Sie +wie das Modell korrekt verwendet wird. + +**13. (Optional) Notizbuch hinzufügen** + +Es ist sehr hilfreich, ein Notizbuch hinzuzufügen, in dem im Detail gezeigt wird, wie *brand_new_bert* für Schlussfolgerungen verwendet werden kann und/oder +bei einer nachgelagerten Aufgabe feinabgestimmt wird. Dies ist nicht zwingend erforderlich, um Ihren PR zusammenzuführen, aber sehr nützlich für die Gemeinschaft. + +**14. Reichen Sie Ihren fertigen PR ein** + +Sie sind jetzt mit der Programmierung fertig und können zum letzten Schritt übergehen, nämlich der Zusammenführung Ihres PR mit main. Normalerweise hat das +Hugging Face Team Ihnen an diesem Punkt bereits geholfen haben, aber es lohnt sich, sich etwas Zeit zu nehmen, um Ihrem fertigen +PR eine schöne Beschreibung zu geben und eventuell Kommentare zu Ihrem Code hinzuzufügen, wenn Sie Ihren Gutachter auf bestimmte Designentscheidungen hinweisen wollen. +Gutachter hinweisen wollen. + +### Teilen Sie Ihre Arbeit!! + +Jetzt ist es an der Zeit, von der Community Anerkennung für Ihre Arbeit zu bekommen! Die Fertigstellung einer Modellergänzung ist ein wichtiger +Beitrag zu Transformers und der gesamten NLP-Gemeinschaft. Ihr Code und die portierten vortrainierten Modelle werden sicherlich +von Hunderten und vielleicht sogar Tausenden von Entwicklern und Forschern genutzt werden. Sie sollten stolz auf Ihre Arbeit sein und Ihre +Ihre Leistung mit der Gemeinschaft teilen. + +**Sie haben ein weiteres Modell erstellt, das für jeden in der Community super einfach zugänglich ist! 🤯** diff --git a/docs/source/de/add_new_pipeline.md b/docs/source/de/add_new_pipeline.md new file mode 100644 index 00000000000000..f5e64be7db310f --- /dev/null +++ b/docs/source/de/add_new_pipeline.md @@ -0,0 +1,258 @@ + + +# Wie erstellt man eine benutzerdefinierte Pipeline? + +In dieser Anleitung sehen wir uns an, wie Sie eine benutzerdefinierte Pipeline erstellen und sie auf dem [Hub](https://hf.co/models) freigeben oder sie der +🤗 Transformers-Bibliothek hinzufügen. + +Zuallererst müssen Sie entscheiden, welche Roheingaben die Pipeline verarbeiten kann. Es kann sich um Strings, rohe Bytes, +Dictionaries oder was auch immer die wahrscheinlichste gewünschte Eingabe ist. Versuchen Sie, diese Eingaben so rein wie möglich in Python zu halten +denn das macht die Kompatibilität einfacher (auch mit anderen Sprachen über JSON). Dies werden die Eingaben der +Pipeline (`Vorverarbeitung`). + +Definieren Sie dann die `Outputs`. Dieselbe Richtlinie wie für die Eingänge. Je einfacher, desto besser. Dies werden die Ausgaben der +Methode `Postprocess`. + +Beginnen Sie damit, die Basisklasse `Pipeline` mit den 4 Methoden zu erben, die für die Implementierung von `preprocess` benötigt werden, +Weiterleiten", "Nachbearbeitung" und "Parameter säubern". + + +```python +from transformers import Pipeline + + +class MyPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] + return preprocess_kwargs, {}, {} + + def preprocess(self, inputs, maybe_arg=2): + model_input = Tensor(inputs["input_ids"]) + return {"model_input": model_input} + + def _forward(self, model_inputs): + # model_inputs == {"model_input": model_input} + outputs = self.model(**model_inputs) + # Maybe {"logits": Tensor(...)} + return outputs + + def postprocess(self, model_outputs): + best_class = model_outputs["logits"].softmax(-1) + return best_class +``` + +Die Struktur dieser Aufteilung soll eine relativ nahtlose Unterstützung für CPU/GPU ermöglichen und gleichzeitig die Durchführung von +Vor-/Nachbearbeitung auf der CPU in verschiedenen Threads + +Preprocess" nimmt die ursprünglich definierten Eingaben und wandelt sie in etwas um, das in das Modell eingespeist werden kann. Es kann +mehr Informationen enthalten und ist normalerweise ein `Dict`. + +`_forward` ist das Implementierungsdetail und ist nicht dafür gedacht, direkt aufgerufen zu werden. Weiterleiten" ist die bevorzugte +aufgerufene Methode, da sie Sicherheitsvorkehrungen enthält, die sicherstellen, dass alles auf dem erwarteten Gerät funktioniert. Wenn etwas +mit einem realen Modell verknüpft ist, gehört es in die Methode `_forward`, alles andere gehört in die Methoden preprocess/postprocess. + +Die Methode `Postprocess` nimmt die Ausgabe von `_forward` und verwandelt sie in die endgültige Ausgabe, die zuvor festgelegt wurde. +zuvor entschieden wurde. + +Die Methode `_sanitize_parameters` ermöglicht es dem Benutzer, beliebige Parameter zu übergeben, wann immer er möchte, sei es bei der Initialisierung +Zeit `pipeline(...., maybe_arg=4)` oder zur Aufrufzeit `pipe = pipeline(...); output = pipe(...., maybe_arg=4)`. + +Die Rückgabe von `_sanitize_parameters` sind die 3 Dicts von kwargs, die direkt an `preprocess` übergeben werden, +`_forward` und `postprocess` übergeben werden. Füllen Sie nichts aus, wenn der Aufrufer keinen zusätzlichen Parameter angegeben hat. Das +erlaubt es, die Standardargumente in der Funktionsdefinition beizubehalten, was immer "natürlicher" ist. + +Ein klassisches Beispiel wäre das Argument `top_k` in der Nachbearbeitung bei Klassifizierungsaufgaben. + +```python +>>> pipe = pipeline("my-new-task") +>>> pipe("This is a test") +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05} +{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}] + +>>> pipe("This is a test", top_k=2) +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}] +``` + +In order to achieve that, we'll update our `postprocess` method with a default parameter to `5`. and edit +`_sanitize_parameters` to allow this new parameter. + + +```python +def postprocess(self, model_outputs, top_k=5): + best_class = model_outputs["logits"].softmax(-1) + # Add logic to handle top_k + return best_class + + +def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] + + postprocess_kwargs = {} + if "top_k" in kwargs: + postprocess_kwargs["top_k"] = kwargs["top_k"] + return preprocess_kwargs, {}, postprocess_kwargs +``` + +Versuchen Sie, die Eingaben/Ausgaben sehr einfach und idealerweise JSON-serialisierbar zu halten, da dies die Verwendung der Pipeline sehr einfach macht +ohne dass die Benutzer neue Arten von Objekten verstehen müssen. Es ist auch relativ üblich, viele verschiedene Arten von Argumenten zu unterstützen +von Argumenten zu unterstützen (Audiodateien, die Dateinamen, URLs oder reine Bytes sein können). + + + +## Hinzufügen zur Liste der unterstützten Aufgaben + +Um Ihre `neue Aufgabe` in die Liste der unterstützten Aufgaben aufzunehmen, müssen Sie sie zur `PIPELINE_REGISTRY` hinzufügen: + +```python +from transformers.pipelines import PIPELINE_REGISTRY + +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, +) +``` + +Wenn Sie möchten, können Sie ein Standardmodell angeben. In diesem Fall sollte es mit einer bestimmten Revision (die der Name einer Verzweigung oder ein Commit-Hash sein kann, hier haben wir `"abcdef"` genommen) sowie mit dem Typ versehen sein: + +```python +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, + default={"pt": ("user/awesome_model", "abcdef")}, + type="text", # current support type: text, audio, image, multimodal +) +``` + +## Teilen Sie Ihre Pipeline auf dem Hub + +Um Ihre benutzerdefinierte Pipeline auf dem Hub freizugeben, müssen Sie lediglich den benutzerdefinierten Code Ihrer `Pipeline`-Unterklasse in einer +Python-Datei speichern. Nehmen wir zum Beispiel an, Sie möchten eine benutzerdefinierte Pipeline für die Klassifizierung von Satzpaaren wie folgt verwenden: + +```py +import numpy as np + +from transformers import Pipeline + + +def softmax(outputs): + maxes = np.max(outputs, axis=-1, keepdims=True) + shifted_exp = np.exp(outputs - maxes) + return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True) + + +class PairClassificationPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "second_text" in kwargs: + preprocess_kwargs["second_text"] = kwargs["second_text"] + return preprocess_kwargs, {}, {} + + def preprocess(self, text, second_text=None): + return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework) + + def _forward(self, model_inputs): + return self.model(**model_inputs) + + def postprocess(self, model_outputs): + logits = model_outputs.logits[0].numpy() + probabilities = softmax(logits) + + best_class = np.argmax(probabilities) + label = self.model.config.id2label[best_class] + score = probabilities[best_class].item() + logits = logits.tolist() + return {"label": label, "score": score, "logits": logits} +``` + +Die Implementierung ist Framework-unabhängig und funktioniert für PyTorch- und TensorFlow-Modelle. Wenn wir dies in einer Datei +einer Datei namens `pair_classification.py` gespeichert haben, können wir sie importieren und wie folgt registrieren: + +```py +from pair_classification import PairClassificationPipeline +from transformers.pipelines import PIPELINE_REGISTRY +from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification + +PIPELINE_REGISTRY.register_pipeline( + "pair-classification", + pipeline_class=PairClassificationPipeline, + pt_model=AutoModelForSequenceClassification, + tf_model=TFAutoModelForSequenceClassification, +) +``` + +Sobald dies geschehen ist, können wir es mit einem vortrainierten Modell verwenden. Zum Beispiel wurde `sgugger/finetuned-bert-mrpc` auf den +auf den MRPC-Datensatz abgestimmt, der Satzpaare als Paraphrasen oder nicht klassifiziert. + +```py +from transformers import pipeline + +classifier = pipeline("pair-classification", model="sgugger/finetuned-bert-mrpc") +``` + +Dann können wir sie auf dem Hub mit der Methode `save_pretrained` in einem `Repository` freigeben: + +```py +from huggingface_hub import Repository + +repo = Repository("test-dynamic-pipeline", clone_from="{your_username}/test-dynamic-pipeline") +classifier.save_pretrained("test-dynamic-pipeline") +repo.push_to_hub() +``` + +Dadurch wird die Datei, in der Sie `PairClassificationPipeline` definiert haben, in den Ordner `"test-dynamic-pipeline"` kopiert, +und speichert das Modell und den Tokenizer der Pipeline, bevor Sie alles in das Repository verschieben +`{Ihr_Benutzername}/test-dynamic-pipeline`. Danach kann jeder die Pipeline verwenden, solange er die Option +`trust_remote_code=True` angeben: + +```py +from transformers import pipeline + +classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remote_code=True) +``` + +## Hinzufügen der Pipeline zu 🤗 Transformers + +Wenn Sie Ihre Pipeline zu 🤗 Transformers beitragen möchten, müssen Sie ein neues Modul im Untermodul `pipelines` hinzufügen +mit dem Code Ihrer Pipeline hinzufügen. Fügen Sie es dann der Liste der in `pipelines/__init__.py` definierten Aufgaben hinzu. + +Dann müssen Sie noch Tests hinzufügen. Erstellen Sie eine neue Datei `tests/test_pipelines_MY_PIPELINE.py` mit Beispielen für die anderen Tests. + +Die Funktion `run_pipeline_test` ist sehr allgemein gehalten und läuft auf kleinen Zufallsmodellen auf jeder möglichen +Architektur, wie durch `model_mapping` und `tf_model_mapping` definiert. + +Dies ist sehr wichtig, um die zukünftige Kompatibilität zu testen, d.h. wenn jemand ein neues Modell für +`XXXForQuestionAnswering` hinzufügt, wird der Pipeline-Test versuchen, mit diesem Modell zu arbeiten. Da die Modelle zufällig sind, ist es +ist es unmöglich, die tatsächlichen Werte zu überprüfen. Deshalb gibt es eine Hilfsfunktion `ANY`, die einfach versucht, die +Ausgabe der Pipeline TYPE. + +Außerdem *müssen* Sie 2 (idealerweise 4) Tests implementieren. + +- `test_small_model_pt` : Definieren Sie 1 kleines Modell für diese Pipeline (es spielt keine Rolle, ob die Ergebnisse keinen Sinn ergeben) + und testen Sie die Ausgaben der Pipeline. Die Ergebnisse sollten die gleichen sein wie bei `test_small_model_tf`. +- `test_small_model_tf` : Definieren Sie 1 kleines Modell für diese Pipeline (es spielt keine Rolle, ob die Ergebnisse keinen Sinn ergeben) + und testen Sie die Ausgaben der Pipeline. Die Ergebnisse sollten die gleichen sein wie bei `test_small_model_pt`. +- `test_large_model_pt` (`optional`): Testet die Pipeline an einer echten Pipeline, bei der die Ergebnisse + Sinn machen. Diese Tests sind langsam und sollten als solche gekennzeichnet werden. Hier geht es darum, die Pipeline zu präsentieren und sicherzustellen + sicherzustellen, dass es in zukünftigen Versionen keine Abweichungen gibt. +- `test_large_model_tf` (`optional`): Testet die Pipeline an einer echten Pipeline, bei der die Ergebnisse + Sinn machen. Diese Tests sind langsam und sollten als solche gekennzeichnet werden. Hier geht es darum, die Pipeline zu präsentieren und sicherzustellen + sicherzustellen, dass es in zukünftigen Versionen keine Abweichungen gibt. diff --git a/docs/source/de/add_tensorflow_model.md b/docs/source/de/add_tensorflow_model.md new file mode 100644 index 00000000000000..8488acbe709b64 --- /dev/null +++ b/docs/source/de/add_tensorflow_model.md @@ -0,0 +1,356 @@ + + +# Wie konvertiert man ein 🤗 Transformers-Modell in TensorFlow? + +Die Tatsache, dass mehrere Frameworks für die Verwendung mit 🤗 Transformers zur Verfügung stehen, gibt Ihnen die Flexibilität, deren Stärken beim Entwurf Ihrer Anwendung auszuspielen. +Ihre Anwendung zu entwerfen, aber das bedeutet auch, dass die Kompatibilität für jedes Modell einzeln hinzugefügt werden muss. Die gute Nachricht ist, dass +das Hinzufügen von TensorFlow-Kompatibilität zu einem bestehenden Modell einfacher ist als [das Hinzufügen eines neuen Modells von Grund auf](add_new_model)! +Ob Sie ein tieferes Verständnis für große TensorFlow-Modelle haben möchten, einen wichtigen Open-Source-Beitrag leisten oder +TensorFlow für das Modell Ihrer Wahl aktivieren wollen, dieser Leitfaden ist für Sie. + +Dieser Leitfaden befähigt Sie, ein Mitglied unserer Gemeinschaft, TensorFlow-Modellgewichte und/oder +Architekturen beizusteuern, die in 🤗 Transformers verwendet werden sollen, und zwar mit minimaler Betreuung durch das Hugging Face Team. Das Schreiben eines neuen Modells +ist keine Kleinigkeit, aber ich hoffe, dass dieser Leitfaden dazu beiträgt, dass es weniger eine Achterbahnfahrt 🎢 und mehr ein Spaziergang im Park 🚶 ist. +Die Nutzung unserer kollektiven Erfahrungen ist absolut entscheidend, um diesen Prozess immer einfacher zu machen, und deshalb möchten wir +ermutigen Sie daher, Verbesserungsvorschläge für diesen Leitfaden zu machen! + +Bevor Sie tiefer eintauchen, empfehlen wir Ihnen, die folgenden Ressourcen zu lesen, wenn Sie neu in 🤗 Transformers sind: +- [Allgemeiner Überblick über 🤗 Transformers](add_new_model#general-overview-of-transformers) +- [Die TensorFlow-Philosophie von Hugging Face](https://huggingface.co/blog/tensorflow-philosophy) + +Im Rest dieses Leitfadens werden Sie lernen, was nötig ist, um eine neue TensorFlow Modellarchitektur hinzuzufügen, die +Verfahren zur Konvertierung von PyTorch in TensorFlow-Modellgewichte und wie Sie Unstimmigkeiten zwischen ML +Frameworks. Legen Sie los! + + + +Sind Sie unsicher, ob das Modell, das Sie verwenden möchten, bereits eine entsprechende TensorFlow-Architektur hat? + +  + +Überprüfen Sie das Feld `model_type` in der `config.json` des Modells Ihrer Wahl +([Beispiel](https://huggingface.co/google-bert/bert-base-uncased/blob/main/config.json#L14)). Wenn der entsprechende Modellordner in +🤗 Transformers eine Datei hat, deren Name mit "modeling_tf" beginnt, bedeutet dies, dass es eine entsprechende TensorFlow +Architektur hat ([Beispiel](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert)). + + + + +## Schritt-für-Schritt-Anleitung zum Hinzufügen von TensorFlow-Modellarchitektur-Code + +Es gibt viele Möglichkeiten, eine große Modellarchitektur zu entwerfen, und viele Möglichkeiten, diesen Entwurf zu implementieren. Wie auch immer, +Sie erinnern sich vielleicht an unseren [allgemeinen Überblick über 🤗 Transformers](add_new_model#general-overview-of-transformers) +wissen, dass wir ein meinungsfreudiger Haufen sind - die Benutzerfreundlichkeit von 🤗 Transformers hängt von konsistenten Designentscheidungen ab. Aus +Erfahrung können wir Ihnen ein paar wichtige Dinge über das Hinzufügen von TensorFlow-Modellen sagen: + +- Erfinden Sie das Rad nicht neu! In den meisten Fällen gibt es mindestens zwei Referenzimplementierungen, die Sie überprüfen sollten: das +PyTorch-Äquivalent des Modells, das Sie implementieren, und andere TensorFlow-Modelle für dieselbe Klasse von Problemen. +- Gute Modellimplementierungen überleben den Test der Zeit. Dies geschieht nicht, weil der Code hübsch ist, sondern eher +sondern weil der Code klar, einfach zu debuggen und darauf aufzubauen ist. Wenn Sie den Maintainern das Leben mit Ihrer +TensorFlow-Implementierung leicht machen, indem Sie die gleichen Muster wie in anderen TensorFlow-Modellen nachbilden und die Abweichung +zur PyTorch-Implementierung minimieren, stellen Sie sicher, dass Ihr Beitrag lange Bestand haben wird. +- Bitten Sie um Hilfe, wenn Sie nicht weiterkommen! Das 🤗 Transformers-Team ist da, um zu helfen, und wir haben wahrscheinlich Lösungen für die gleichen +Probleme gefunden, vor denen Sie stehen. + +Hier finden Sie einen Überblick über die Schritte, die zum Hinzufügen einer TensorFlow-Modellarchitektur erforderlich sind: +1. Wählen Sie das Modell, das Sie konvertieren möchten +2. Bereiten Sie die Transformers-Entwicklungsumgebung vor. +3. (Optional) Verstehen Sie die theoretischen Aspekte und die bestehende Implementierung +4. Implementieren Sie die Modellarchitektur +5. Implementieren Sie Modelltests +6. Reichen Sie den Pull-Antrag ein +7. (Optional) Erstellen Sie Demos und teilen Sie diese mit der Welt + +### 1.-3. Bereiten Sie Ihren Modellbeitrag vor + +**1. Wählen Sie das Modell, das Sie konvertieren möchten** + +Beginnen wir mit den Grundlagen: Als erstes müssen Sie die Architektur kennen, die Sie konvertieren möchten. Wenn Sie +Sie sich nicht auf eine bestimmte Architektur festgelegt haben, ist es eine gute Möglichkeit, das 🤗 Transformers-Team um Vorschläge zu bitten. +Wir werden Sie zu den wichtigsten Architekturen führen, die auf der TensorFlow-Seite noch fehlen. +Seite fehlen. Wenn das spezifische Modell, das Sie mit TensorFlow verwenden möchten, bereits eine Implementierung der TensorFlow-Architektur in +🤗 Transformers, aber es fehlen Gewichte, können Sie direkt in den +Abschnitt [Gewichtskonvertierung](#hinzufügen-von-tensorflow-gewichten-zum--hub) +auf dieser Seite. + +Der Einfachheit halber wird im Rest dieser Anleitung davon ausgegangen, dass Sie sich entschieden haben, mit der TensorFlow-Version von +*BrandNewBert* (dasselbe Beispiel wie in der [Anleitung](add_new_model), um ein neues Modell von Grund auf hinzuzufügen). + + + +Bevor Sie mit der Arbeit an einer TensorFlow-Modellarchitektur beginnen, sollten Sie sich vergewissern, dass es keine laufenden Bemühungen in dieser Richtung gibt. +Sie können nach `BrandNewBert` auf der +[pull request GitHub page](https://github.com/huggingface/transformers/pulls?q=is%3Apr), um zu bestätigen, dass es keine +TensorFlow-bezogene Pull-Anfrage gibt. + + + + +**2. Transformers-Entwicklungsumgebung vorbereiten** + +Nachdem Sie die Modellarchitektur ausgewählt haben, öffnen Sie einen PR-Entwurf, um Ihre Absicht zu signalisieren, daran zu arbeiten. Folgen Sie den +Anweisungen, um Ihre Umgebung einzurichten und einen PR-Entwurf zu öffnen. + +1. Forken Sie das [repository](https://github.com/huggingface/transformers), indem Sie auf der Seite des Repositorys auf die Schaltfläche 'Fork' klicken. + Seite des Repositorys klicken. Dadurch wird eine Kopie des Codes unter Ihrem GitHub-Benutzerkonto erstellt. + +2. Klonen Sie Ihren `transformers` Fork auf Ihre lokale Festplatte und fügen Sie das Basis-Repository als Remote hinzu: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +3. Richten Sie eine Entwicklungsumgebung ein, indem Sie z.B. den folgenden Befehl ausführen: + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +Abhängig von Ihrem Betriebssystem und da die Anzahl der optionalen Abhängigkeiten von Transformers wächst, kann es sein, dass Sie bei diesem Befehl einen +Fehler mit diesem Befehl erhalten. Wenn das der Fall ist, stellen Sie sicher, dass Sie TensorFlow installieren und dann ausführen: + +```bash +pip install -e ".[quality]" +``` + +**Hinweis:** Sie müssen CUDA nicht installiert haben. Es reicht aus, das neue Modell auf der CPU laufen zu lassen. + +4. Erstellen Sie eine Verzweigung mit einem beschreibenden Namen von Ihrer Hauptverzweigung + +```bash +git checkout -b add_tf_brand_new_bert +``` + +5. Abrufen und zurücksetzen auf die aktuelle Hauptversion + +```bash +git fetch upstream +git rebase upstream/main +``` + +6. Fügen Sie eine leere `.py` Datei in `transformers/src/models/brandnewbert/` mit dem Namen `modeling_tf_brandnewbert.py` hinzu. Dies wird +Ihre TensorFlow-Modelldatei sein. + +7. Übertragen Sie die Änderungen auf Ihr Konto mit: + +```bash +git add . +git commit -m "initial commit" +git push -u origin add_tf_brand_new_bert +``` + +8. Wenn Sie zufrieden sind, gehen Sie auf die Webseite Ihrer Abspaltung auf GitHub. Klicken Sie auf "Pull request". Stellen Sie sicher, dass Sie das + GitHub-Handle einiger Mitglieder des Hugging Face-Teams als Reviewer hinzuzufügen, damit das Hugging Face-Team über zukünftige Änderungen informiert wird. + zukünftige Änderungen benachrichtigt wird. + +9. Ändern Sie den PR in einen Entwurf, indem Sie auf der rechten Seite der GitHub-Pull-Request-Webseite auf "In Entwurf umwandeln" klicken. + + +Jetzt haben Sie eine Entwicklungsumgebung eingerichtet, um *BrandNewBert* nach TensorFlow in 🤗 Transformers zu portieren. + + +**3. (Optional) Verstehen Sie die theoretischen Aspekte und die bestehende Implementierung** + +Sie sollten sich etwas Zeit nehmen, um die Arbeit von *BrandNewBert* zu lesen, falls eine solche Beschreibung existiert. Möglicherweise gibt es große +Abschnitte des Papiers, die schwer zu verstehen sind. Wenn das der Fall ist, ist das in Ordnung - machen Sie sich keine Sorgen! Das Ziel ist +ist es nicht, ein tiefes theoretisches Verständnis des Papiers zu erlangen, sondern die notwendigen Informationen zu extrahieren, um +das Modell mit Hilfe von TensorFlow effektiv in 🤗 Transformers neu zu implementieren. Das heißt, Sie müssen nicht zu viel Zeit auf die +viel Zeit auf die theoretischen Aspekte verwenden, sondern sich lieber auf die praktischen Aspekte konzentrieren, nämlich auf die bestehende Modelldokumentation +Seite (z.B. [model docs for BERT](model_doc/bert)). + +Nachdem Sie die Grundlagen der Modelle, die Sie implementieren wollen, verstanden haben, ist es wichtig, die bestehende +Implementierung zu verstehen. Dies ist eine gute Gelegenheit, sich zu vergewissern, dass eine funktionierende Implementierung mit Ihren Erwartungen an das +Modell entspricht, und um technische Herausforderungen auf der TensorFlow-Seite vorauszusehen. + +Es ist ganz natürlich, dass Sie sich von der Menge an Informationen, die Sie gerade aufgesogen haben, überwältigt fühlen. Es ist +Es ist definitiv nicht erforderlich, dass Sie in dieser Phase alle Facetten des Modells verstehen. Dennoch empfehlen wir Ihnen dringend +ermutigen wir Sie, alle dringenden Fragen in unserem [Forum](https://discuss.huggingface.co/) zu klären. + + +### 4. Implementierung des Modells + +Jetzt ist es an der Zeit, endlich mit dem Programmieren zu beginnen. Als Ausgangspunkt empfehlen wir die PyTorch-Datei selbst: Kopieren Sie den Inhalt von +`modeling_brand_new_bert.py` in `src/transformers/models/brand_new_bert/` nach +`modeling_tf_brand_new_bert.py`. Das Ziel dieses Abschnitts ist es, die Datei zu ändern und die Importstruktur von +🤗 Transformers zu aktualisieren, so dass Sie `TFBrandNewBert` und +`TFBrandNewBert.from_pretrained(model_repo, from_pt=True)` erfolgreich ein funktionierendes TensorFlow *BrandNewBert* Modell lädt. + +Leider gibt es kein Rezept, um ein PyTorch-Modell in TensorFlow zu konvertieren. Sie können jedoch unsere Auswahl an +Tipps befolgen, um den Prozess so reibungslos wie möglich zu gestalten: +- Stellen Sie `TF` dem Namen aller Klassen voran (z.B. wird `BrandNewBert` zu `TFBrandNewBert`). +- Die meisten PyTorch-Operationen haben einen direkten TensorFlow-Ersatz. Zum Beispiel entspricht `torch.nn.Linear` der Klasse + `tf.keras.layers.Dense`, `torch.nn.Dropout` entspricht `tf.keras.layers.Dropout`, usw. Wenn Sie sich nicht sicher sind + über eine bestimmte Operation nicht sicher sind, können Sie die [TensorFlow-Dokumentation](https://www.tensorflow.org/api_docs/python/tf) + oder die [PyTorch-Dokumentation](https://pytorch.org/docs/stable/). +- Suchen Sie nach Mustern in der Codebasis von 🤗 Transformers. Wenn Sie auf eine bestimmte Operation stoßen, für die es keinen direkten Ersatz gibt + Ersatz hat, stehen die Chancen gut, dass jemand anderes bereits das gleiche Problem hatte. +- Behalten Sie standardmäßig die gleichen Variablennamen und die gleiche Struktur wie in PyTorch bei. Dies erleichtert die Fehlersuche, die Verfolgung von + Probleme zu verfolgen und spätere Korrekturen vorzunehmen. +- Einige Ebenen haben in jedem Framework unterschiedliche Standardwerte. Ein bemerkenswertes Beispiel ist die Schicht für die Batch-Normalisierung + epsilon (`1e-5` in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d) + und `1e-3` in [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization)). + Prüfen Sie die Dokumentation genau! +- Die Variablen `nn.Parameter` von PyTorch müssen in der Regel innerhalb von TF Layer's `build()` initialisiert werden. Siehe das folgende + Beispiel: [PyTorch](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_vit_mae.py#L212) / + [TensorFlow](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_tf_vit_mae.py#L220) +- Wenn das PyTorch-Modell ein `#copied from ...` am Anfang einer Funktion hat, stehen die Chancen gut, dass Ihr TensorFlow-Modell diese Funktion auch + diese Funktion von der Architektur ausleihen kann, von der sie kopiert wurde, vorausgesetzt, es hat eine TensorFlow-Architektur. +- Die korrekte Zuweisung des Attributs `name` in TensorFlow-Funktionen ist entscheidend, um das `from_pt=True` Gewicht zu erreichen + Cross-Loading. Name" ist fast immer der Name der entsprechenden Variablen im PyTorch-Code. Wenn `name` nicht + nicht richtig gesetzt ist, sehen Sie dies in der Fehlermeldung beim Laden der Modellgewichte. +- Die Logik der Basismodellklasse, `BrandNewBertModel`, befindet sich in `TFBrandNewBertMainLayer`, einer Keras + Schicht-Unterklasse ([Beispiel](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L719)). + TFBrandNewBertModel" ist lediglich ein Wrapper für diese Schicht. +- Keras-Modelle müssen erstellt werden, um die vorher trainierten Gewichte zu laden. Aus diesem Grund muss `TFBrandNewBertPreTrainedModel` + ein Beispiel für die Eingaben in das Modell enthalten, die `dummy_inputs` + ([Beispiel](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L916)). +- Wenn Sie nicht weiterkommen, fragen Sie nach Hilfe - wir sind für Sie da! 🤗 + +Neben der Modelldatei selbst müssen Sie auch die Verweise auf die Modellklassen und die zugehörigen +Dokumentationsseiten hinzufügen. Sie können diesen Teil ganz nach den Mustern in anderen PRs erledigen +([Beispiel](https://github.com/huggingface/transformers/pull/18020/files)). Hier ist eine Liste der erforderlichen manuellen +Änderungen: +- Fügen Sie alle öffentlichen Klassen von *BrandNewBert* in `src/transformers/__init__.py` ein. +- Fügen Sie *BrandNewBert* Klassen zu den entsprechenden Auto Klassen in `src/transformers/models/auto/modeling_tf_auto.py` hinzu. +- Fügen Sie die *BrandNewBert* zugehörigen Klassen für träges Laden in `src/transformers/utils/dummy_tf_objects.py` hinzu. +- Aktualisieren Sie die Importstrukturen für die öffentlichen Klassen in `src/transformers/models/brand_new_bert/__init__.py`. +- Fügen Sie die Dokumentationszeiger auf die öffentlichen Methoden von *BrandNewBert* in `docs/source/de/model_doc/brand_new_bert.md` hinzu. +- Fügen Sie sich selbst zur Liste der Mitwirkenden an *BrandNewBert* in `docs/source/de/model_doc/brand_new_bert.md` hinzu. +- Fügen Sie schließlich ein grünes Häkchen ✅ in der TensorFlow-Spalte von *BrandNewBert* in `docs/source/de/index.md` hinzu. + +Wenn Sie mit Ihrer Implementierung zufrieden sind, führen Sie die folgende Checkliste aus, um zu bestätigen, dass Ihre Modellarchitektur +fertig ist: +1. Alle Schichten, die sich zur Trainingszeit anders verhalten (z.B. Dropout), werden mit einem `Training` Argument aufgerufen, das +von den Top-Level-Klassen weitergegeben wird +2. Sie haben `#copied from ...` verwendet, wann immer es möglich war. +3. Die Funktion `TFBrandNewBertMainLayer` und alle Klassen, die sie verwenden, haben ihre Funktion `call` mit `@unpack_inputs` dekoriert +4. `TFBrandNewBertMainLayer` ist mit `@keras_serializable` dekoriert +5. Ein TensorFlow-Modell kann aus PyTorch-Gewichten mit `TFBrandNewBert.from_pretrained(model_repo, from_pt=True)` geladen werden. +6. Sie können das TensorFlow Modell mit dem erwarteten Eingabeformat aufrufen + + +### 5. Modell-Tests hinzufügen + +Hurra, Sie haben ein TensorFlow-Modell implementiert! Jetzt ist es an der Zeit, Tests hinzuzufügen, um sicherzustellen, dass sich Ihr Modell wie erwartet verhält. +erwartet. Wie im vorigen Abschnitt schlagen wir vor, dass Sie zunächst die Datei `test_modeling_brand_new_bert.py` in +`tests/models/brand_new_bert/` in die Datei `test_modeling_tf_brand_new_bert.py` zu kopieren und dann die notwendigen +TensorFlow-Ersetzungen vornehmen. Für den Moment sollten Sie in allen Aufrufen von `.from_pretrained()` das Flag `from_pt=True` verwenden, um die +die vorhandenen PyTorch-Gewichte zu laden. + +Wenn Sie damit fertig sind, kommt der Moment der Wahrheit: Führen Sie die Tests durch! 😬 + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +Das wahrscheinlichste Ergebnis ist, dass Sie eine Reihe von Fehlern sehen werden. Machen Sie sich keine Sorgen, das ist zu erwarten! Das Debuggen von ML-Modellen ist +notorisch schwierig, und der Schlüssel zum Erfolg ist Geduld (und `breakpoint()`). Nach unserer Erfahrung sind die schwierigsten +Probleme aus subtilen Unstimmigkeiten zwischen ML-Frameworks, zu denen wir am Ende dieses Leitfadens ein paar Hinweise geben. +In anderen Fällen kann es sein, dass ein allgemeiner Test nicht direkt auf Ihr Modell anwendbar ist; in diesem Fall empfehlen wir eine Überschreibung +auf der Ebene der Modelltestklasse. Zögern Sie nicht, in Ihrem Entwurf einer Pull-Anfrage um Hilfe zu bitten, wenn +Sie nicht weiterkommen. + +Wenn alle Tests erfolgreich waren, können Sie Ihr Modell in die 🤗 Transformers-Bibliothek aufnehmen! 🎉 + +### 6.-7. Stellen Sie sicher, dass jeder Ihr Modell verwenden kann + +**6. Reichen Sie den Pull Request ein** + +Sobald Sie mit der Implementierung und den Tests fertig sind, ist es an der Zeit, eine Pull-Anfrage einzureichen. Bevor Sie Ihren Code einreichen, +führen Sie unser Dienstprogramm zur Codeformatierung, `make fixup` 🪄, aus. Damit werden automatisch alle Formatierungsfehler behoben, die dazu führen würden, dass +unsere automatischen Prüfungen fehlschlagen würden. + +Nun ist es an der Zeit, Ihren Entwurf einer Pull-Anfrage in eine echte Pull-Anfrage umzuwandeln. Klicken Sie dazu auf die Schaltfläche "Bereit für +Review" und fügen Sie Joao (`@gante`) und Matt (`@Rocketknight1`) als Reviewer hinzu. Eine Modell-Pull-Anfrage benötigt +mindestens 3 Reviewer, aber sie werden sich darum kümmern, geeignete zusätzliche Reviewer für Ihr Modell zu finden. + +Nachdem alle Gutachter mit dem Stand Ihres PR zufrieden sind, entfernen Sie als letzten Aktionspunkt das Flag `from_pt=True` in +.from_pretrained()-Aufrufen zu entfernen. Da es keine TensorFlow-Gewichte gibt, müssen Sie sie hinzufügen! Lesen Sie den Abschnitt +unten, um zu erfahren, wie Sie dies tun können. + +Wenn schließlich die TensorFlow-Gewichte zusammengeführt werden, Sie mindestens 3 Genehmigungen von Prüfern haben und alle CI-Checks grün sind +grün sind, überprüfen Sie die Tests ein letztes Mal lokal + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +und wir werden Ihren PR zusammenführen! Herzlichen Glückwunsch zu dem Meilenstein 🎉. + +**7. (Optional) Erstellen Sie Demos und teilen Sie sie mit der Welt** + +Eine der schwierigsten Aufgaben bei Open-Source ist die Entdeckung. Wie können die anderen Benutzer von der Existenz Ihres +fabelhaften TensorFlow-Beitrags erfahren? Mit der richtigen Kommunikation, natürlich! 📣 + +Es gibt vor allem zwei Möglichkeiten, Ihr Modell mit der Community zu teilen: +- Erstellen Sie Demos. Dazu gehören Gradio-Demos, Notebooks und andere unterhaltsame Möglichkeiten, Ihr Modell vorzuführen. Wir raten Ihnen + ermutigen Sie, ein Notizbuch zu unseren [community-driven demos](https://huggingface.co/docs/transformers/community) hinzuzufügen. +- Teilen Sie Geschichten in sozialen Medien wie Twitter und LinkedIn. Sie sollten stolz auf Ihre Arbeit sein und sie mit der + Ihre Leistung mit der Community teilen - Ihr Modell kann nun von Tausenden von Ingenieuren und Forschern auf der ganzen Welt genutzt werden + der Welt genutzt werden 🌍! Wir werden Ihre Beiträge gerne retweeten und Ihnen helfen, Ihre Arbeit mit der Community zu teilen. + + +## Hinzufügen von TensorFlow-Gewichten zum 🤗 Hub + +Unter der Annahme, dass die TensorFlow-Modellarchitektur in 🤗 Transformers verfügbar ist, ist die Umwandlung von PyTorch-Gewichten in +TensorFlow-Gewichte ist ein Kinderspiel! + +Hier sehen Sie, wie es geht: +1. Stellen Sie sicher, dass Sie in Ihrem Terminal bei Ihrem Hugging Face Konto angemeldet sind. Sie können sich mit dem folgenden Befehl anmelden + `huggingface-cli login` (Ihre Zugangstoken finden Sie [hier](https://huggingface.co/settings/tokens)) +2. Führen Sie `transformers-cli pt-to-tf --model-name foo/bar` aus, wobei `foo/bar` der Name des Modell-Repositorys ist + ist, das die PyTorch-Gewichte enthält, die Sie konvertieren möchten. +3. Markieren Sie `@joaogante` und `@Rocketknight1` in dem 🤗 Hub PR, den der obige Befehl gerade erstellt hat + +Das war's! 🎉 + + +## Fehlersuche in verschiedenen ML-Frameworks 🐛 + +Irgendwann, wenn Sie eine neue Architektur hinzufügen oder TensorFlow-Gewichte für eine bestehende Architektur erstellen, werden Sie +stoßen Sie vielleicht auf Fehler, die sich über Unstimmigkeiten zwischen PyTorch und TensorFlow beschweren. Sie könnten sich sogar dazu entschließen, den +Modellarchitektur-Code für die beiden Frameworks zu öffnen, und stellen fest, dass sie identisch aussehen. Was ist denn da los? 🤔 + +Lassen Sie uns zunächst darüber sprechen, warum es wichtig ist, diese Diskrepanzen zu verstehen. Viele Community-Mitglieder werden 🤗 +Transformers-Modelle und vertrauen darauf, dass sich unsere Modelle wie erwartet verhalten. Wenn es eine große Diskrepanz gibt +zwischen den beiden Frameworks auftritt, bedeutet dies, dass das Modell nicht der Referenzimplementierung für mindestens eines der Frameworks folgt. +der Frameworks folgt. Dies kann zu stillen Fehlern führen, bei denen das Modell zwar läuft, aber eine schlechte Leistung aufweist. Dies ist +wohl schlimmer als ein Modell, das überhaupt nicht läuft! Aus diesem Grund streben wir an, dass die Abweichung zwischen den Frameworks kleiner als +1e-5" in allen Phasen des Modells. + +Wie bei anderen numerischen Problemen auch, steckt der Teufel im Detail. Und wie bei jedem detailorientierten Handwerk ist die geheime +Zutat hier Geduld. Hier ist unser Vorschlag für den Arbeitsablauf, wenn Sie auf diese Art von Problemen stoßen: +1. Lokalisieren Sie die Quelle der Abweichungen. Das Modell, das Sie konvertieren, hat wahrscheinlich bis zu einem gewissen Punkt nahezu identische innere Variablen. + bestimmten Punkt. Platzieren Sie `Breakpoint()`-Anweisungen in den Architekturen der beiden Frameworks und vergleichen Sie die Werte der + numerischen Variablen von oben nach unten, bis Sie die Quelle der Probleme gefunden haben. +2. Nachdem Sie nun die Ursache des Problems gefunden haben, setzen Sie sich mit dem 🤗 Transformers-Team in Verbindung. Es ist möglich + dass wir ein ähnliches Problem schon einmal gesehen haben und umgehend eine Lösung anbieten können. Als Ausweichmöglichkeit können Sie beliebte Seiten + wie StackOverflow und GitHub-Probleme. +3. Wenn keine Lösung in Sicht ist, bedeutet das, dass Sie tiefer gehen müssen. Die gute Nachricht ist, dass Sie das Problem gefunden haben. + Problem ausfindig gemacht haben, so dass Sie sich auf die problematische Anweisung konzentrieren und den Rest des Modells ausblenden können! Die schlechte Nachricht ist + dass Sie sich in die Quellimplementierung der besagten Anweisung einarbeiten müssen. In manchen Fällen finden Sie vielleicht ein + Problem mit einer Referenzimplementierung - verzichten Sie nicht darauf, ein Problem im Upstream-Repository zu öffnen. + +In einigen Fällen können wir nach Rücksprache mit dem 🤗 Transformers-Team zu dem Schluss kommen, dass die Behebung der Abweichung nicht machbar ist. +Wenn die Abweichung in den Ausgabeschichten des Modells sehr klein ist (aber möglicherweise groß in den versteckten Zuständen), können wir +könnten wir beschließen, sie zu ignorieren und das Modell zu verteilen. Die oben erwähnte CLI `pt-to-tf` hat ein `--max-error` +Flag, um die Fehlermeldung bei der Gewichtskonvertierung zu überschreiben. diff --git a/docs/source/de/autoclass_tutorial.mdx b/docs/source/de/autoclass_tutorial.md similarity index 91% rename from docs/source/de/autoclass_tutorial.mdx rename to docs/source/de/autoclass_tutorial.md index 95247cd04ba0ce..5dea87ca552c1a 100644 --- a/docs/source/de/autoclass_tutorial.mdx +++ b/docs/source/de/autoclass_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Vortrainierte Instanzen mit einer AutoClass laden @@ -16,7 +20,7 @@ Bei so vielen verschiedenen Transformator-Architekturen kann es eine Herausforde -Denken Sie daran, dass sich die Architektur auf das Skelett des Modells bezieht und die Checkpoints die Gewichte für eine bestimmte Architektur sind. Zum Beispiel ist [BERT](https://huggingface.co/bert-base-uncased) eine Architektur, während `bert-base-uncased` ein Checkpoint ist. Modell ist ein allgemeiner Begriff, der entweder Architektur oder Prüfpunkt bedeuten kann. +Denken Sie daran, dass sich die Architektur auf das Skelett des Modells bezieht und die Checkpoints die Gewichte für eine bestimmte Architektur sind. Zum Beispiel ist [BERT](https://huggingface.co/google-bert/bert-base-uncased) eine Architektur, während `google-bert/bert-base-uncased` ein Checkpoint ist. Modell ist ein allgemeiner Begriff, der entweder Architektur oder Prüfpunkt bedeuten kann. @@ -36,7 +40,7 @@ Laden Sie einen Tokenizer mit [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` Dann tokenisieren Sie Ihre Eingabe wie unten gezeigt: @@ -84,7 +88,7 @@ Mit den `AutoModelFor`-Klassen können Sie schließlich ein vortrainiertes Model ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Sie können denselben Prüfpunkt problemlos wiederverwenden, um eine Architektur für eine andere Aufgabe zu laden: @@ -92,7 +96,7 @@ Sie können denselben Prüfpunkt problemlos wiederverwenden, um eine Architektur ```py >>> from transformers import AutoModelForTokenClassification ->>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -111,7 +115,7 @@ Mit den Klassen `TFAutoModelFor` schließlich können Sie ein vortrainiertes Mod ```py >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Sie können denselben Prüfpunkt problemlos wiederverwenden, um eine Architektur für eine andere Aufgabe zu laden: @@ -119,7 +123,7 @@ Sie können denselben Prüfpunkt problemlos wiederverwenden, um eine Architektur ```py >>> from transformers import TFAutoModelForTokenClassification ->>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Im Allgemeinen empfehlen wir, die Klasse "AutoTokenizer" und die Klasse "TFAutoModelFor" zu verwenden, um vortrainierte Instanzen von Modellen zu laden. Dadurch wird sichergestellt, dass Sie jedes Mal die richtige Architektur laden. Im nächsten [Tutorial] (Vorverarbeitung) erfahren Sie, wie Sie Ihren neu geladenen Tokenizer, Feature Extractor und Prozessor verwenden, um einen Datensatz für die Feinabstimmung vorzuverarbeiten. diff --git a/docs/source/de/contributing.md b/docs/source/de/contributing.md new file mode 100644 index 00000000000000..4abc301766ee72 --- /dev/null +++ b/docs/source/de/contributing.md @@ -0,0 +1,334 @@ + + +# Zu 🤗 Transformers beitragen + +Jeder ist willkommen, einen Beitrag zu leisten, und wir schätzen den Beitrag jedes Einzelnen. Codebeiträge sind nicht der einzige Weg, der Community zu helfen. Fragen zu beantworten, anderen zu helfen und die Dokumentation zu verbessern, sind ebenfalls äußerst wertvoll. + +Es hilft uns auch, wenn Sie das Projekt weiterempfehlen! Erwähnen Sie die Bibliothek in Blogposts über die großartigen Projekte, die sie ermöglicht hat, tweeten Sie, wenn sie Ihnen geholfen hat, oder hinterlassen Sie dem Repository ein ⭐️, um Danke zu sagen. + +Wie auch immer Sie sich entscheiden beizutragen, seien Sie achtsam und respektieren Sie unseren [Verhaltenskodex](https://github.com/huggingface/transformers/blob/main/CODE_OF_CONDUCT.md). + +**Dieser Leitfaden wurde stark durch den fantastischen [scikit-learn-Leitfaden für Beiträge](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md) inspiriert.** + +## Beitragsmöglichkeiten + +Es gibt mehrere Wege, wie Sie zu 🤗 Transformers beitragen können: + +* Beheben Sie bestehende Probleme im vorhandenen Code. +* Erstellen Sie Issues im Zusammenhang mit Fehlern oder gewünschten neuen Funktionen. +* Implementieren Sie neue Modelle. +* Tragen Sie zu den Beispielen oder zur Dokumentation bei. + +Wenn Sie nicht wissen, wo Sie anfangen sollen, gibt es eine spezielle Liste von [Good First Issues](https://github.com/huggingface/transformers/contribute). Sie bietet Ihnen eine Liste offener und anfängerfreundlicher Probleme und hilft Ihnen, einen ersten Beitrag zu Open-Source zu leisten. Idealerweise erstellen Sie eine Pull-Anfrage und verlinken sie mit dem Issue, an dem Sie arbeiten möchten. Wir versuchen, erstellte PRs bevorzugt zu behandeln, da wir so den Fortschritt leicht verfolgen können, und die Option besteht, dass jemand anderes den PR übernehmen kann, falls der Beitragende keine Zeit mehr hat. + +Für etwas mehr Herausforderung, können Sie auch einen Blick auf die Liste der [Good Second Issues](https://github.com/huggingface/transformers/labels/Good%20Second%20Issue) werfen. Generell gilt: Legen Sie los, wenn Sie sich den Anforderungen gewachsen sehen und wir helfen Ihnen dabei! 🚀 + +> Alle Beiträge sind für die Community gleichermaßen wertvoll. 🥰 + +## Bestehende Probleme beheben + +Wenn Ihnen ein Problem im vorhandenen Code auffällt und Sie eine Lösung im Sinn haben, können Sie gerne einen Beitrag leisten und [eine Pull-Anfrage erstellen](#eine-pull-anfrage-erstellen)! + +## Ein fehlerspezifisches Issue oder eine Feature-Anfrage erstellen + +Tun Sie Ihr Bestes, diesen Richtlinien zu folgen, wenn Sie ein fehlerspezifisches Issue erstellen oder eine Feature-Anfrage einreichen. Das macht es uns leichter, Ihnen schnell und mit gutem Feedback zu antworten. + +### Haben Sie einen Fehler gefunden? + +Die 🤗 Transformers-Bibliothek verdankt ihre Robustheit und Zuverlässigkeit aller Nutzer, die frisch entdeckte Probleme melden. + +Wir würden es wirklich schätzen, wenn Sie **sicherstellen könnten, dass der Fehler noch nicht gemeldet wurde** (verwenden Sie die Suchleiste auf GitHub unter Issues), bevor Sie ein Issue erstellen. Ihr Problem sollte sich auch auf Fehler in der Bibliothek selbst und nicht auf Ihren eigenen Code beziehen. Wenn Sie sich nicht sicher sind, ob der Fehler in Ihrem eigenen Code oder der Bibliothek liegt, fragen Sie bitte zuerst im [Forum](https://discuss.huggingface.co/) nach. Das hilft uns, schneller auf Probleme im Zusammenhang mit der Bibliothek zu reagieren, anstatt auf allgemeine Fragen. + +Wenn Sie sich vergewissert haben, dass der Fehler noch nicht gemeldet wurde, geben Sie bitte die folgenden Informationen in Ihrem Issue an, damit wir es schnell beheben können: + +* Ihr **Betriebssystem und Version** sowie die Versionen von **Python**, **PyTorch** und **TensorFlow**, falls zutreffend. +* Ein kurzes und unabhängiges Code-Snippet, das es uns ermöglicht, den Fehler in weniger als 30 Sekunden nachzustellen. +* Den *vollständigen* Traceback, wenn eine Ausnahme geworfen wird. +* Fügen Sie weitere hilfreiche Informationen, wie z. B. Screenshots, an. + +Um das Betriebssystem und die Softwareversionen automatisch auszugeben, führen Sie den folgenden Befehl aus: + +```bash +transformers-cli env +``` + +Sie können denselben Befehl auch im Hauptverzeichnis des Repositorys ausführen: + +```bash +python src/transformers/commands/transformers_cli.py env +``` + +### Möchten Sie eine neue Funktion? + +Wenn Sie eine bestimmte neue Funktion in 🤗 Transformers sehen möchten, erstellen Sie bitte ein Issue und fügen Sie eine Beschreibung hinzu: + +1. Was ist die *Motivation* hinter dieser Funktion? Steht sie in Zusammenhang mit einem Problem oder einer Frustration mit der Bibliothek? Ist es eine Funktion, die Sie für ein Projekt benötigen? Ist es etwas, an dem Sie gearbeitet haben und denken, dass es der Community nutzen könnte? + + Was auch immer es ist, wir würden uns freuen, davon zu hören! + +1. Beschreiben Sie Ihre gewünschte Funktion so detailliert wie möglich. Je mehr Sie uns darüber erzählen können, desto besser können wir Ihnen helfen. +1. Stellen Sie einen *Code-Schnipsel* bereit, der die Funktionsweise demonstriert. +1. Falls die Funktion auf einem Paper beruht, verlinken Sie dieses bitte. + +Wenn Ihr Issue gut geschrieben ist, sind wir zum Zeitpunkt seiner Erstellung bereits zu 80 % fertig. + +Wir haben [Vorlagen](https://github.com/huggingface/transformers/tree/main/templates) hinzugefügt, um Ihnen den Start Ihres Issues zu erleichtern. + +## Möchten Sie ein neues Modell implementieren? + +Es werden ständig neue Modelle veröffentlicht. Wenn Sie ein neues Modell implementieren möchten, geben Sie bitte folgende Informationen an: + +* Eine kurze Beschreibung des Modells und einen Link zum Paper. +* Link zur Implementierung, falls sie Open-Source ist. +* Link zu den Modellgewichten, falls verfügbar. + +Lassen Sie es uns wissen, wenn Sie bereit sind, das Modell selbst beizutragen. Dann können wir Ihnen helfen, es zu 🤗 Transformers hinzuzufügen! + +Wir haben eine [detaillierte Anleitung und Vorlagen](https://github.com/huggingface/transformers/tree/main/templates) hinzugefügt, um Ihnen das Hinzufügen eines neuen Modells zu erleichtern, und wir haben auch einen technischen Leitfaden dazu, [wie man ein Modell zu 🤗 Transformers hinzufügt](https://huggingface.co/docs/transformers/add_new_model). + +## Möchten Sie die Dokumentation erweitern? + +Wir sind immer auf der Suche nach Verbesserungen, die die Dokumentation klarer und präziser machen. Bitte teilen Sie uns Verbesserungsvorschläge mit, wie z. B. Tippfehler und fehlende, unklare oder ungenaue Inhalte. Wir übernehmen gerne die Änderungen oder helfen Ihnen, einen Beitrag zu leisten, wenn Sie daran interessiert sind! + +Für weitere Einzelheiten darüber, wie man die Dokumentation generiert, erstellt und schreibt, werfen Sie einen Blick auf das [README](https://github.com/huggingface/transformers/tree/main/docs) der Dokumentation. + +## Eine Pull-Anfrage erstellen + +Bevor Sie irgendwelchen Code schreiben, empfehlen wir Ihnen dringend, die bestehenden PRs oder Issues zu durchsuchen, um sicherzustellen, dass niemand bereits an diesem Thema arbeitet. Wenn Sie sich unsicher sind, ist es immer eine gute Idee, nach Feedback in einem neuen Issue zu fragen. + +Sie benötigen grundlegende `git`-Kenntnisse, um zu 🤗 Transformers beizutragen. Obwohl `git` nicht das einfachste Werkzeug ist, hat es ein sehr gutes Handbuch. Geben Sie `git --help` in eine Shell ein und genießen Sie es! Wenn Sie Bücher bevorzugen, ist [Pro Git](https://git-scm.com/book/en/v2) eine gute Anlaufstelle. + +Sie benötigen **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** oder höher, um zu 🤗 Transformers beizutragen. Folgen Sie den nachstehenden Schritten, um mit dem Beitrag zu beginnen: + +1. Forken Sie das [Repository](https://github.com/huggingface/transformers), indem Sie auf den **[Fork](https://github.com/huggingface/transformers/fork)**-Button auf der Seite des Repositorys klicken. Dadurch wird eine Kopie des Codes auf Ihrem GitHub-Account erstellt. + +1. Klonen Sie Ihren Fork auf Ihre lokale Festplatte und fügen Sie das ursprüngliche Repository als Remote hinzu: + + ```bash + git clone git@github.com:/transformers.git + cd transformers + git remote add upstream https://github.com/huggingface/transformers.git + ``` + +1. Erstellen Sie einen neuen Branch, um Ihre Änderungen zu speichern: + + ```bash + git checkout -b a-descriptive-name-for-my-changes + ``` + + 🚨 Arbeiten Sie **nicht** auf dem `main` Branch! + +1. Richten Sie eine Entwicklungsumgebung ein, indem Sie den folgenden Befehl in einer virtuellen Umgebung ausführen: + + ```bash + pip install -e ".[dev]" + ``` + + Wenn 🤗 Transformers bereits in der virtuellen Umgebung installiert war, entfernen Sie es mit `pip uninstall transformers`, bevor Sie es im bearbeitbaren Modus mit dem `-e` Flag neu installieren. + + Abhängig von Ihrem Betriebssystem und durch die wachsende Anzahl der optionalen Abhängigkeiten von Transformers könnten Sie mit diesem Befehl einen Fehler verursachen. Wenn das der Fall ist, stellen Sie sicher, dass Sie ihr bevorzugtes Deep-Learning-Framework (PyTorch, TensorFlow und/oder Flax) installieren und anschließend den folgenden Befehl ausführen: + + ```bash + pip install -e ".[quality]" + ``` + + Dies sollte für die meisten Anwendungsfälle ausreichend sein. + +1. Entwickeln Sie die Funktionen in Ihrem Branch. + + Während Sie an Ihrem Code arbeiten, sollten Sie sicherstellen, dass die Test-Suite erfolgreich durchläuft. Führen Sie die von Ihren Änderungen betroffenen Tests wie folgt aus: + + ```bash + pytest tests/.py + ``` + + Weitere Informationen über Tests finden Sie in der Anleitung zum Thema [Testen](https://huggingface.co/docs/transformers/testing). + + 🤗 Transformers stützt sich auf `black` und `ruff`, um seinen Quellcode konsistent zu formatieren. Nachdem Sie Änderungen vorgenommen haben, wenden Sie automatische Stilkorrekturen und Codeprüfungen, die nicht automatisiert werden können, in einem Schritt an: + + ```bash + make fixup + ``` + + Dieser Task ist optimiert, nur mit Dateien zu arbeiten, die von Ihrer PR modifiziert wurden. + + Wenn Sie die Prüfungen nacheinander ausführen möchten, wendet der folgende Befehl die Stilkorrekturen an: + + ```bash + make style + ``` + + 🤗 Transformers verwendet auch `ruff` und einige benutzerdefinierte Skripte, um auf Programmierfehler zu prüfen. Qualitätskontrollen werden von der CI durchgeführt, aber Sie können die gleichen Überprüfungen auch selbst ausführen: + + ```bash + make quality + ``` + + Abschließend haben wir viele Skripte, die sicherstellen, dass wir alle betroffenen Dateien aktualisieren, wenn wir ein neues Modell hinzufügen. Sie können diese wie folgt ausführen: + + ```bash + make repo-consistency + ``` + + Um mehr über diese Prüfungen zu erfahren und wie man mit ihnen Probleme behebt, lesen Sie den Leitfaden zu [Überprüfungen bei einer Pull-Anfrage](https://huggingface.co/docs/transformers/pr_checks). + + Wenn Sie Dokumente im Verzeichnis `docs/source` ändern, stellen Sie sicher, dass die Dokumentation noch generiert werden kann. Diese Prüfung wird auch im CI laufen, wenn Sie eine Pull-Anfrage erstellen. Um eine lokale Prüfung durchzuführen, müssen Sie den Dukumentation-Builder installieren: + + ```bash + pip install ".[docs]" + ``` + + Führen Sie den folgenden Befehl im Hauptverzeichnis des Repositorys aus: + + ```bash + doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build + ``` + + Dadurch wird die Dokumentation im Ordner `~/tmp/test-build` erstellt, wo Sie die erzeugten Markdown-Dateien mit Ihrem bevorzugten Editor überprüfen können. Sie können auch eine Vorschau der Dokumentation auf GitHub sehen, wenn Sie eine Pull-Anfrage öffnen. + + Wenn Sie mit Ihren Änderungen zufrieden sind, fügen Sie die geänderten Dateien mit `git add` hinzu und speichern Sie Ihre Änderungen lokal mit `git commit`: + + ```bash + git add modified_file.py + git commit + ``` + + Bitte achten Sie darauf, [gute Commit-Nachrichten](https://chris.beams.io/posts/git-commit/) zu schreiben, um die von Ihnen vorgenommenen Änderungen klar zu kommunizieren! + + Um Ihre Kopie des Codes auf dem aktuellen Stand des ursprünglichen Repositorys zu halten, rebasen Sie Ihren Branch auf `upstream/branch` *bevor* Sie eine Pull-Anfrage öffnen oder falls Sie von einem Maintainer dazu aufgefordert werden: + + ```bash + git fetch upstream + git rebase upstream/main + ``` + + Pushen Sie Ihre Änderungen in Ihrem Branch: + + ```bash + git push -u origin a-descriptive-name-for-my-changes + ``` + + Wenn Sie bereits eine Pull-Anfrage erstellt haben, müssen Sie den Push mit dem `--force` Flag erzwingen. Andernfalls, wenn die Pull-Anfrage noch nicht erstellt wurde, können Sie Ihre Änderungen normal pushen. + +1. Jetzt können Sie zu Ihrem Fork des Repositorys auf GitHub gehen und auf **Pull-Anfrage** klicken, um eine Pull-Anfrage zu erstellen. Stellen Sie sicher, dass Sie alle Punkte auf unserer [Checkliste](#checkliste-für-pull-anfragen) unten abhaken. Wenn Sie fertig sind, können Sie Ihre Änderungen zur Überprüfung an die Projektverantwortlichen senden. + +1. Es ist kein Problem, wenn die Maintainer Änderungen beantragen, das geschieht auch bei unseren Kernmitarbeitern! Damit jeder die Änderungen in der Pull-Anfrage sehen kann, arbeiten Sie in Ihrem lokalen Branch und pushen die Änderungen zu Ihrem Fork. Sie werden automatisch in der Pull-Anfrage erscheinen. + +### Checkliste für Pull-Anfragen + +☐ Der Titel der Pull-Anfrage sollte Ihren Beitrag zusammenfassen.
+☐ Wenn Ihre Pull-Anfrage ein bestimmtes Issue bearbeitet, erwähnen Sie bitte die zugehörige Nummer in der Beschreibung der Pull-Anfrage, sodass diese verlinkt sind (und Personen, die das Issue lesen, wissen, dass Sie daran arbeiten).
+☐ Um eine fortlaufende Bearbeitung anzuzeigen, versehen Sie bitte den Titel mit einem `[WIP]` Präfix. Diese sind nützlich, um doppelte Arbeit zu verhindern und sie von PRs abzuheben, die bereit zum Zusammenführen sind.
+☐ Stellen Sie sicher, dass existierende Tests bestanden werden.
+☐ Wenn Sie eine neue Funktion hinzufügen, erstellen Sie auch Tests dafür.
+ +* Wenn Sie ein neues Modell hinzufügen, stellen Sie sicher, dass Sie `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` verwenden, um die gemeinsamen Tests auszulösen. +* Wenn Sie neue `@slow` Tests hinzufügen, stellen Sie mit `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py` sicher, dass diese erfolgreich durchlaufen. +* Wenn Sie einen neuen Tokenizer hinzufügen, schreiben Sie Tests und stellen Sie mit `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` sicher, dass diese erfolgreich durchlaufen. +* CircleCI führt die langsamen Tests nicht aus, aber GitHub Actions tut dies jede Nacht!
+ +☐ Alle public Methoden müssen informative Docstrings haben (siehe [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py) als Beispiel).
+☐ Aufgrund des schnell wachsenden Repositorys fügen Sie bitte keine Bilder, Videos oder andere Nicht-Textdateien hinzu, die das Repository erheblich belasten würden. Verwenden Sie stattdessen ein Hub-Repository wie [`hf-internal-testing`](https://huggingface.co/hf-internal-testing), um diese Dateien zu hosten und sie per URL zu verlinken. Wir empfehlen Bilder, die zur Dokumentation gehören, im folgenden Repository abzulegen: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images). Sie können eine PR in diesem Datasets-Repository erstellen und ein Hugging-Face-Mitglied bitten, sie zu mergen. + +Um mehr über die Prüfungen zu erfahren, die bei einer Pull-Anfrage ausgelöst werden, lesen Sie unseren Leitfaden zu [Überprüfungen bei einer Pull-Anfrage](https://huggingface.co/docs/transformers/pr_checks). + +### Tests + +Eine umfangreiche Test-Suite ist enthalten, um das Verhalten der Bibliothek und mehrerer Beispiele zu testen. Tests für die Bibliothek und Beispiele finden Sie jeweils im [tests](https://github.com/huggingface/transformers/tree/main/tests) und im [examples](https://github.com/huggingface/transformers/tree/main/examples) Ordner. + +Wir bevorzugen `pytest` und `pytest-xdist`, weil es schneller ist. Geben Sie einen *Pfad zu einem Unterordner oder einer Testdatei* vom Hauptverzeichnis des Repositorys aus an, um den Test auszuführen: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +``` + +Analog für den `examples` Ordner, geben Sie einen *Pfad zu einem Unterordner oder einer Testdatei* an, um den Test auszuführen. Z. B. führt der folgende Befehl den Test des Unterordners für Textklassifizierung im PyTorch `examples` Ordner durch: + +```bash +pip install -r examples/xxx/requirements.txt # nur beim ersten Mal erforderlich +python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +Tatsächlich ist dies genau, wie unsere `make test` und `make test-examples` Befehle implementiert sind (abgesehen von `pip install`)! + +Sie können auch eine kleinere Anzahl an Tests angeben, um nur die Funktion, an der Sie arbeiten, zu testen. + +Standardmäßig werden langsame Tests übersprungen, aber Sie können die Umgebungsvariable `RUN_SLOW` auf `yes` setzen, um sie auszuführen. Dies wird den Download vieler Gigabyte an Modellen starten - stellen Sie also sicher, dass Sie sowohl genügend Festplattenspeicher als auch eine gute Internetverbindung oder die nötige Geduld haben! + + + +Vergessen Sie nicht, einen *Pfad zu einem Unterordner oder einer Testdatei* anzugeben, um den Test auszuführen. Sonst führen Sie alle Tests im `tests` oder `examples` Ordner aus, was sehr lange dauern wird! + + + +```bash +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +Wie bei den langsamen Tests gibt es auch andere Umgebungsvariablen, die standardmäßig beim Testen nicht gesetzt sind: + +* `RUN_CUSTOM_TOKENIZERS`: Aktiviert Tests für benutzerdefinierte Tokenizer. +* `RUN_PT_FLAX_CROSS_TESTS`: Aktiviert Tests für die Integration von PyTorch + Flax. +* `RUN_PT_TF_CROSS_TESTS`: Aktiviert Tests für die Integration von TensorFlow + PyTorch. + +Weitere Umgebungsvariablen und zusätzliche Informationen finden Sie in der [testing_utils.py](src/transformers/testing_utils.py). + +🤗 Transformers verwendet `pytest` nur als Test-Runner. Es verwendet keine `pytest`-spezifischen Funktionen in der Test-Suite selbst. + +Das bedeutet, `unittest` wird vollständig unterstützt. Folgend wird beschrieben, wie man Tests mit `unittest` ausführt: + +```bash +python -m unittest discover -s tests -t . -v +python -m unittest discover -s examples -t examples -v +``` + +### Stil-Leitfaden + +Für Docstrings befolgt 🤗 Transformers den [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). +Lesen Sie unseren [Leitfaden zum Schreiben von Dokumentationen](https://github.com/huggingface/transformers/tree/main/docs#writing-documentation---specification) für weitere Informationen. + +### Entwickeln unter Windows + +Unter Windows (falls Sie nicht im [Windows-Subsystem für Linux](https://learn.microsoft.com/en-us/windows/wsl/) oder WSL arbeiten) müssen Sie git so konfigurieren, dass Windows `CRLF` in Linux `LF` Zeilenenden umgewandelt werden: + +```bash +git config core.autocrlf input +``` + +Eine Möglichkeit, den `make`-Befehl unter Windows auszuführen, ist mit MSYS2: + +1. Laden Sie [MSYS2](https://www.msys2.org/) herunter und installieren Sie es nach `C:\msys64`. +1. Öffnen Sie die Kommandozeile `C:\msys64\msys2.exe` (sie sollte vom **Start**-Menü aus verfügbar sein). +1. Führen Sie den Befehl in der Shell aus: `pacman -Syu` und installieren Sie `make` mit `pacman -S make`. +1. Fügen Sie `C:\msys64\usr\bin` an Ihrer PATH-Umgebungsvariable an. + +Sie können nun `make` aus jedem Terminal heraus verwenden (PowerShell, cmd.exe usw.)! 🎉 + +### Ein geforktes Repository mit dem Haupt-Repository von Hugging Face synchronisieren + +Beim Aktualisieren des main-Branches eines geforkten Repositories beachten Sie bitte die folgenden Schritte, um das Anpingen des Haupt-Repositorys zu vermeiden, was unnötige Verweise in abhängigen PRs vermerkt und beteiligte Entwickler benachrichtigt: + +1. Wenn möglich, vermeiden Sie die Synchronisation mit dem Haupt-Repository über einen Branch und PR im geforkten Repository. Mergen Sie stattdessen direkt in den main-Branch des Forks. +1. Wenn ein PR unbedingt notwendig ist, verwenden Sie die folgenden Schritte, nachdem Sie Ihren Branch ausgecheckt haben: + + ```bash + git checkout -b your-branch-for-syncing + git pull --squash --no-commit upstream main + git commit -m '' + git push --set-upstream origin your-branch-for-syncing + ``` diff --git a/docs/source/de/index.mdx b/docs/source/de/index.md similarity index 96% rename from docs/source/de/index.mdx rename to docs/source/de/index.md index 1bf25e31a7f7b9..5ddabb4e7382e1 100644 --- a/docs/source/de/index.mdx +++ b/docs/source/de/index.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers @@ -52,6 +56,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen, 1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. @@ -72,6 +77,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen, 1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. 1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. @@ -86,6 +92,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen, 1. **[DiT](model_doc/dit)** (from Microsoft Research) released with the paper [DiT: Self-supervised Pre-training for Document Image Transformer](https://arxiv.org/abs/2203.02378) by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. 1. **[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan and Quoc V. Le. 1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. @@ -93,11 +100,12 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen, 1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. 1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach -1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. +1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. @@ -165,6 +173,7 @@ Die Bibliothek enthält derzeit JAX-, PyTorch- und TensorFlow-Implementierungen, 1. **[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. 1. **[UL2](model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler +1. **[UMT5](model_doc/umt5)** (from Google Research) released with the paper [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. 1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. 1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. 1. **[VAN](model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. @@ -209,7 +218,7 @@ Flax), PyTorch, und/oder TensorFlow haben. | BigBird-Pegasus | ❌ | ❌ | ✅ | ❌ | ❌ | | Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ | | BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ✅ | -| BLOOM | ❌ | ✅ | ✅ | ❌ | ❌ | +| BLOOM | ❌ | ✅ | ✅ | ❌ | ✅ | | CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | CANINE | ✅ | ❌ | ✅ | ❌ | ❌ | | CLIP | ✅ | ✅ | ✅ | ✅ | ✅ | @@ -279,9 +288,9 @@ Flax), PyTorch, und/oder TensorFlow haben. | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | REALM | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | -| ResNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| ResNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | | RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | | RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ | diff --git a/docs/source/de/installation.mdx b/docs/source/de/installation.md similarity index 94% rename from docs/source/de/installation.mdx rename to docs/source/de/installation.md index 3103830ee7fd8a..55d0f2d8512d47 100644 --- a/docs/source/de/installation.mdx +++ b/docs/source/de/installation.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Installation @@ -90,7 +94,7 @@ Installieren wir 🤗 Transformers aus dem Quellcode mit dem folgenden Befehl: pip install git+https://github.com/huggingface/transformers ``` -Dieser Befehl installiert die aktuelle `main` Version und nicht die neueste `stable` Version. Die `main`-Version ist nützlich, um mit den neuesten Entwicklungen Schritt zu halten. Zum Beispiel, wenn ein Fehler seit der letzten offiziellen Version behoben wurde, aber eine neue Version noch nicht veröffentlicht wurde. Das bedeutet jedoch, dass die "Hauptversion" nicht immer stabil ist. Wir bemühen uns, die Hauptversion einsatzbereit zu halten, und die meisten Probleme werden normalerweise innerhalb weniger Stunden oder eines Tages behoben. Wenn Sie auf ein Problem stoßen, öffnen Sie bitte ein [Issue] (https://github.com/huggingface/transformers/issues), damit wir es noch schneller beheben können! +Dieser Befehl installiert die aktuelle `main` Version und nicht die neueste `stable` Version. Die `main`-Version ist nützlich, um mit den neuesten Entwicklungen Schritt zu halten. Zum Beispiel, wenn ein Fehler seit der letzten offiziellen Version behoben wurde, aber eine neue Version noch nicht veröffentlicht wurde. Das bedeutet jedoch, dass die "Hauptversion" nicht immer stabil ist. Wir bemühen uns, die Hauptversion einsatzbereit zu halten, und die meisten Probleme werden normalerweise innerhalb weniger Stunden oder eines Tages behoben. Wenn Sie auf ein Problem stoßen, öffnen Sie bitte ein [Issue](https://github.com/huggingface/transformers/issues), damit wir es noch schneller beheben können! Überprüfen wir, ob 🤗 Transformers richtig installiert wurde, indem Sie den folgenden Befehl ausführen: @@ -135,10 +139,10 @@ Ihre Python-Umgebung wird beim nächsten Ausführen die `main`-Version von 🤗 ## Installation mit conda -Installation von dem conda Kanal `huggingface`: +Installation von dem conda Kanal `conda-forge`: ```bash -conda install -c huggingface transformers +conda install conda-forge::transformers ``` ## Cache Einrichtung @@ -153,7 +157,7 @@ Vorgefertigte Modelle werden heruntergeladen und lokal zwischengespeichert unter Transformers verwendet die Shell-Umgebungsvariablen `PYTORCH_TRANSFORMERS_CACHE` oder `PYTORCH_PRETRAINED_BERT_CACHE`, wenn Sie von einer früheren Iteration dieser Bibliothek kommen und diese Umgebungsvariablen gesetzt haben, sofern Sie nicht die Shell-Umgebungsvariable `TRANSFORMERS_CACHE` angeben. - + ## Offline Modus @@ -169,14 +173,14 @@ Fügen sie [🤗 Datasets](https://huggingface.co/docs/datasets/) zu Ihrem Offli So würden Sie beispielsweise ein Programm in einem normalen Netzwerk mit einer Firewall für externe Instanzen mit dem folgenden Befehl ausführen: ```bash -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Führen Sie das gleiche Programm in einer Offline-Instanz mit aus: ```bash HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Das Skript sollte nun laufen, ohne sich aufzuhängen oder eine Zeitüberschreitung abzuwarten, da es weiß, dass es nur nach lokalen Dateien suchen soll. @@ -241,6 +245,6 @@ Sobald Ihre Datei heruntergeladen und lokal zwischengespeichert ist, geben Sie d -Weitere Informationen zum Herunterladen von Dateien, die auf dem Hub gespeichert sind, finden Sie im Abschnitt [Wie man Dateien vom Hub herunterlädt] (https://huggingface.co/docs/hub/how-to-downstream). - +Weitere Informationen zum Herunterladen von Dateien, die auf dem Hub gespeichert sind, finden Sie im Abschnitt [Wie man Dateien vom Hub herunterlädt](https://huggingface.co/docs/hub/how-to-downstream). + diff --git a/docs/source/de/llm_tutorial.md b/docs/source/de/llm_tutorial.md new file mode 100644 index 00000000000000..ea4a96632cb1de --- /dev/null +++ b/docs/source/de/llm_tutorial.md @@ -0,0 +1,221 @@ + + + +# Generation with LLMs + +[[open-in-colab]] + +LLMs (Large Language Models) sind die Schlüsselkomponente bei der Texterstellung. Kurz gesagt, bestehen sie aus großen, vortrainierten Transformationsmodellen, die darauf trainiert sind, das nächste Wort (oder genauer gesagt Token) aus einem Eingabetext vorherzusagen. Da sie jeweils ein Token vorhersagen, müssen Sie etwas Aufwändigeres tun, um neue Sätze zu generieren, als nur das Modell aufzurufen - Sie müssen eine autoregressive Generierung durchführen. + +Die autoregressive Generierung ist ein Verfahren zur Inferenzzeit, bei dem ein Modell mit seinen eigenen generierten Ausgaben iterativ aufgerufen wird, wenn einige anfängliche Eingaben vorliegen. In 🤗 Transformers wird dies von der Methode [`~generation.GenerationMixin.generate`] übernommen, die allen Modellen mit generativen Fähigkeiten zur Verfügung steht. + +Dieses Tutorial zeigt Ihnen, wie Sie: + +* Text mit einem LLM generieren +* Vermeiden Sie häufige Fallstricke +* Nächste Schritte, damit Sie das Beste aus Ihrem LLM herausholen können + +Bevor Sie beginnen, stellen Sie sicher, dass Sie alle erforderlichen Bibliotheken installiert haben: + +```bash +pip install transformers bitsandbytes>=0.39.0 -q +``` + + +## Text generieren + +Ein Sprachmodell, das für [causal language modeling](tasks/language_modeling) trainiert wurde, nimmt eine Folge von Text-Token als Eingabe und gibt die Wahrscheinlichkeitsverteilung für das nächste Token zurück. + + +
+ +
"Forward pass of an LLM"
+
+ +Ein wichtiger Aspekt der autoregressiven Generierung mit LLMs ist die Auswahl des nächsten Tokens aus dieser Wahrscheinlichkeitsverteilung. In diesem Schritt ist alles möglich, solange Sie am Ende ein Token für die nächste Iteration haben. Das heißt, es kann so einfach sein wie die Auswahl des wahrscheinlichsten Tokens aus der Wahrscheinlichkeitsverteilung oder so komplex wie die Anwendung von einem Dutzend Transformationen vor der Stichprobenziehung aus der resultierenden Verteilung. + + +
+ +
"Die autoregressive Generierung wählt iterativ das nächste Token aus einer Wahrscheinlichkeitsverteilung aus, um Text zu erzeugen"
+
+ +Der oben dargestellte Prozess wird iterativ wiederholt, bis eine bestimmte Abbruchbedingung erreicht ist. Im Idealfall wird die Abbruchbedingung vom Modell vorgegeben, das lernen sollte, wann es ein Ende-der-Sequenz-Token (EOS) ausgeben muss. Ist dies nicht der Fall, stoppt die Generierung, wenn eine vordefinierte Maximallänge erreicht ist. + +Damit sich Ihr Modell so verhält, wie Sie es für Ihre Aufgabe erwarten, müssen Sie den Schritt der Token-Auswahl und die Abbruchbedingung richtig einstellen. Aus diesem Grund haben wir zu jedem Modell eine [`~generation.GenerationConfig`]-Datei, die eine gute generative Standardparametrisierung enthält und zusammen mit Ihrem Modell geladen wird. + +Lassen Sie uns über Code sprechen! + + + +Wenn Sie an der grundlegenden Verwendung von LLMs interessiert sind, ist unsere High-Level-Schnittstelle [`Pipeline`](pipeline_tutorial) ein guter Ausgangspunkt. LLMs erfordern jedoch oft fortgeschrittene Funktionen wie Quantisierung und Feinsteuerung des Token-Auswahlschritts, was am besten über [`~generation.GenerationMixin.generate`] erfolgt. Die autoregressive Generierung mit LLMs ist ebenfalls ressourcenintensiv und sollte für einen angemessenen Durchsatz auf einer GPU ausgeführt werden. + + + + +Zunächst müssen Sie das Modell laden. + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained( +... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True +... ) +``` + +Sie werden zwei Flags in dem Aufruf `from_pretrained` bemerken: + + - `device_map` stellt sicher, dass das Modell auf Ihre GPU(s) übertragen wird + - `load_in_4bit` wendet [dynamische 4-Bit-Quantisierung](main_classes/quantization) an, um die Ressourcenanforderungen massiv zu reduzieren + +Es gibt noch andere Möglichkeiten, ein Modell zu initialisieren, aber dies ist eine gute Grundlage, um mit einem LLM zu beginnen. + +Als nächstes müssen Sie Ihre Texteingabe mit einem [tokenizer](tokenizer_summary) vorverarbeiten. + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b") +>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda") +``` + +Die Variable `model_inputs` enthält die tokenisierte Texteingabe sowie die Aufmerksamkeitsmaske. Obwohl [`~generation.GenerationMixin.generate`] sein Bestes tut, um die Aufmerksamkeitsmaske abzuleiten, wenn sie nicht übergeben wird, empfehlen wir, sie für optimale Ergebnisse wann immer möglich zu übergeben. + +Rufen Sie schließlich die Methode [`~generation.GenerationMixin.generate`] auf, um die generierten Token zurückzugeben, die vor dem Drucken in Text umgewandelt werden sollten. + +```py +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A list of colors: red, blue, green, yellow, black, white, and brown' +``` + +Und das war's! Mit ein paar Zeilen Code können Sie sich die Macht eines LLM zunutze machen. + + +## Häufige Fallstricke + +Es gibt viele [Generierungsstrategien](generation_strategies), und manchmal sind die Standardwerte für Ihren Anwendungsfall vielleicht nicht geeignet. Wenn Ihre Ausgaben nicht mit dem übereinstimmen, was Sie erwarten, haben wir eine Liste der häufigsten Fallstricke erstellt und wie Sie diese vermeiden können. + +```py +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b") +>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default +>>> model = AutoModelForCausalLM.from_pretrained( +... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True +... ) +``` + +### Generierte Ausgabe ist zu kurz/lang + +Wenn in der Datei [`~generation.GenerationConfig`] nichts angegeben ist, gibt `generate` standardmäßig bis zu 20 Token zurück. Wir empfehlen dringend, `max_new_tokens` in Ihrem `generate`-Aufruf manuell zu setzen, um die maximale Anzahl neuer Token zu kontrollieren, die zurückgegeben werden können. Beachten Sie, dass LLMs (genauer gesagt, [decoder-only models](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)) auch die Eingabeaufforderung als Teil der Ausgabe zurückgeben. + + +```py +>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") + +>>> # By default, the output will contain up to 20 tokens +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5' + +>>> # Setting `max_new_tokens` allows you to control the maximum length +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' +``` + +### Falscher Generierungsmodus + +Standardmäßig und sofern nicht in der Datei [`~generation.GenerationConfig`] angegeben, wählt `generate` bei jeder Iteration das wahrscheinlichste Token aus (gierige Dekodierung). Je nach Aufgabe kann dies unerwünscht sein; kreative Aufgaben wie Chatbots oder das Schreiben eines Aufsatzes profitieren vom Sampling. Andererseits profitieren Aufgaben, bei denen es auf die Eingabe ankommt, wie z.B. Audiotranskription oder Übersetzung, von der gierigen Dekodierung. Aktivieren Sie das Sampling mit `do_sample=True`. Mehr zu diesem Thema erfahren Sie in diesem [Blogbeitrag](https://huggingface.co/blog/how-to-generate). + +```py +>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility +>>> from transformers import set_seed +>>> set_seed(0) + +>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") + +>>> # LLM + greedy decoding = repetitive, boring output +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. I am a cat. I am a cat. I am a cat' + +>>> # With sampling, the output becomes more creative! +>>> generated_ids = model.generate(**model_inputs, do_sample=True) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat.\nI just need to be. I am always.\nEvery time' +``` + +### Falsche Auffüllseite + +LLMs sind [decoder-only](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)-Architekturen, d.h. sie iterieren weiter über Ihre Eingabeaufforderung. Wenn Ihre Eingaben nicht die gleiche Länge haben, müssen sie aufgefüllt werden. Da LLMs nicht darauf trainiert sind, mit aufgefüllten Token fortzufahren, muss Ihre Eingabe links aufgefüllt werden. Vergessen Sie auch nicht, die Aufmerksamkeitsmaske an generate zu übergeben! + +```py +>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, +>>> # which is shorter, has padding on the right side. Generation fails. +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)[0] +'' + +>>> # With left-padding, it works as expected! +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b", padding_side="left") +>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 3, 4, 5, 6,' +``` + + + +## Weitere Ressourcen + +Während der Prozess der autoregressiven Generierung relativ einfach ist, kann die optimale Nutzung Ihres LLM ein schwieriges Unterfangen sein, da es viele bewegliche Teile gibt. Für Ihre nächsten Schritte, die Ihnen helfen, tiefer in die LLM-Nutzung und das Verständnis einzutauchen: + + +### Fortgeschrittene Nutzung generieren + +1. [Leitfaden](generation_strategies) zur Steuerung verschiedener Generierungsmethoden, zur Einrichtung der Generierungskonfigurationsdatei und zum Streaming der Ausgabe; +2. API-Referenz zu [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`] und [generate-bezogene Klassen](internal/generation_utils). + +### LLM-Ranglisten + +1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), das sich auf die Qualität der Open-Source-Modelle konzentriert; +2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), das sich auf den LLM-Durchsatz konzentriert. + +### Latenz und Durchsatz + +1. [Leitfaden](main_classes/quantization) zur dynamischen Quantisierung, der Ihnen zeigt, wie Sie Ihren Speicherbedarf drastisch reduzieren können. + +### Verwandte Bibliotheken + +1. [text-generation-inference](https://github.com/huggingface/text-generation-inference), ein produktionsreifer Server für LLMs; +2. [`optimum`](https://github.com/huggingface/optimum), eine Erweiterung von 🤗 Transformers, die für bestimmte Hardware-Geräte optimiert. diff --git a/docs/source/de/model_sharing.mdx b/docs/source/de/model_sharing.md similarity index 95% rename from docs/source/de/model_sharing.mdx rename to docs/source/de/model_sharing.md index 42b09d40d70ade..6bbb6e10cb4942 100644 --- a/docs/source/de/model_sharing.mdx +++ b/docs/source/de/model_sharing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Ein Modell teilen @@ -225,4 +229,4 @@ Um sicherzustellen, dass die Benutzer die Fähigkeiten, Grenzen, möglichen Verz * Manuelles Erstellen und Hochladen einer "README.md"-Datei. * Klicken Sie auf die Schaltfläche **Modellkarte bearbeiten** in Ihrem Modell-Repository. -Werfen Sie einen Blick auf die DistilBert [model card](https://huggingface.co/distilbert-base-uncased) als gutes Beispiel für die Art von Informationen, die eine Modellkarte enthalten sollte. Weitere Details über andere Optionen, die Sie in der Datei "README.md" einstellen können, wie z.B. den Kohlenstoff-Fußabdruck eines Modells oder Beispiele für Widgets, finden Sie in der Dokumentation [hier](https://huggingface.co/docs/hub/models-cards). \ No newline at end of file +Werfen Sie einen Blick auf die DistilBert [model card](https://huggingface.co/distilbert/distilbert-base-uncased) als gutes Beispiel für die Art von Informationen, die eine Modellkarte enthalten sollte. Weitere Details über andere Optionen, die Sie in der Datei "README.md" einstellen können, wie z.B. den Kohlenstoff-Fußabdruck eines Modells oder Beispiele für Widgets, finden Sie in der Dokumentation [hier](https://huggingface.co/docs/hub/models-cards). \ No newline at end of file diff --git a/docs/source/de/peft.md b/docs/source/de/peft.md new file mode 100644 index 00000000000000..bdc0684d798d3a --- /dev/null +++ b/docs/source/de/peft.md @@ -0,0 +1,216 @@ + + +# Adapter mit 🤗 PEFT laden + +[[open-in-colab]] + +Die [Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) Methoden frieren die vorab trainierten Modellparameter während der Feinabstimmung ein und fügen eine kleine Anzahl trainierbarer Parameter (die Adapter) hinzu. Die Adapter werden trainiert, um aufgabenspezifische Informationen zu lernen. Es hat sich gezeigt, dass dieser Ansatz sehr speichereffizient ist und weniger Rechenleistung beansprucht, während die Ergebnisse mit denen eines vollständig feinabgestimmten Modells vergleichbar sind. + +Adapter, die mit PEFT trainiert wurden, sind in der Regel um eine Größenordnung kleiner als das vollständige Modell, so dass sie bequem gemeinsam genutzt, gespeichert und geladen werden können. + +
+ +
Die Adaptergewichte für ein OPTForCausalLM-Modell, die auf dem Hub gespeichert sind, sind nur ~6MB groß, verglichen mit der vollen Größe der Modellgewichte, die ~700MB betragen können.
+
+ +Wenn Sie mehr über die 🤗 PEFT-Bibliothek erfahren möchten, sehen Sie sich die [Dokumentation](https://huggingface.co/docs/peft/index) an. + +## Setup + +Starten Sie mit der Installation von 🤗 PEFT: + +```bash +pip install peft +``` + +Wenn Sie die brandneuen Funktionen ausprobieren möchten, sollten Sie die Bibliothek aus dem Quellcode installieren: + +```bash +pip install git+https://github.com/huggingface/peft.git +``` + +## Unterstützte PEFT-Modelle + +Transformers unterstützt nativ einige PEFT-Methoden, d.h. Sie können lokal oder auf dem Hub gespeicherte Adaptergewichte laden und sie mit wenigen Zeilen Code einfach ausführen oder trainieren. Die folgenden Methoden werden unterstützt: + +- [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) +- [AdaLoRA](https://arxiv.org/abs/2303.10512) + +Wenn Sie andere PEFT-Methoden, wie z.B. Prompt Learning oder Prompt Tuning, verwenden möchten, oder über die 🤗 PEFT-Bibliothek im Allgemeinen, lesen Sie bitte die [Dokumentation](https://huggingface.co/docs/peft/index). + + +## Laden Sie einen PEFT-Adapter + +Um ein PEFT-Adaptermodell von 🤗 Transformers zu laden und zu verwenden, stellen Sie sicher, dass das Hub-Repository oder das lokale Verzeichnis eine `adapter_config.json`-Datei und die Adaptergewichte enthält, wie im obigen Beispielbild gezeigt. Dann können Sie das PEFT-Adaptermodell mit der Klasse `AutoModelFor` laden. Um zum Beispiel ein PEFT-Adaptermodell für die kausale Sprachmodellierung zu laden: + +1. Geben Sie die PEFT-Modell-ID an. +2. übergeben Sie es an die Klasse [`AutoModelForCausalLM`]. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id) +``` + + + +Sie können einen PEFT-Adapter entweder mit einer `AutoModelFor`-Klasse oder der Basismodellklasse wie `OPTForCausalLM` oder `LlamaForCausalLM` laden. + + + +Sie können einen PEFT-Adapter auch laden, indem Sie die Methode `load_adapter` aufrufen: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "facebook/opt-350m" +peft_model_id = "ybelkada/opt-350m-lora" + +model = AutoModelForCausalLM.from_pretrained(model_id) +model.load_adapter(peft_model_id) +``` + +## Laden in 8bit oder 4bit + +Die `bitsandbytes`-Integration unterstützt Datentypen mit 8bit und 4bit Genauigkeit, was für das Laden großer Modelle nützlich ist, weil es Speicher spart (lesen Sie den `bitsandbytes`-Integrations [guide](./quantization#bitsandbytes-integration), um mehr zu erfahren). Fügen Sie die Parameter `load_in_8bit` oder `load_in_4bit` zu [`~PreTrainedModel.from_pretrained`] hinzu und setzen Sie `device_map="auto"`, um das Modell effektiv auf Ihre Hardware zu verteilen: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True) +``` + +## Einen neuen Adapter hinzufügen + +Sie können [`~peft.PeftModel.add_adapter`] verwenden, um einen neuen Adapter zu einem Modell mit einem bestehenden Adapter hinzuzufügen, solange der neue Adapter vom gleichen Typ ist wie der aktuelle Adapter. Wenn Sie zum Beispiel einen bestehenden LoRA-Adapter an ein Modell angehängt haben: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + init_lora_weights=False +) + +model.add_adapter(lora_config, adapter_name="adapter_1") +``` + +Um einen neuen Adapter hinzuzufügen: + +```py +# attach new adapter with same config +model.add_adapter(lora_config, adapter_name="adapter_2") +``` + +Jetzt können Sie mit [`~peft.PeftModel.set_adapter`] festlegen, welcher Adapter verwendet werden soll: + +```py +# use adapter_1 +model.set_adapter("adapter_1") +output = model.generate(**inputs) +print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) + +# use adapter_2 +model.set_adapter("adapter_2") +output_enabled = model.generate(**inputs) +print(tokenizer.decode(output_enabled[0], skip_special_tokens=True)) +``` + +## Aktivieren und Deaktivieren von Adaptern + +Sobald Sie einen Adapter zu einem Modell hinzugefügt haben, können Sie das Adaptermodul aktivieren oder deaktivieren. So aktivieren Sie das Adaptermodul: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +adapter_model_id = "ybelkada/opt-350m-lora" +tokenizer = AutoTokenizer.from_pretrained(model_id) +text = "Hello" +inputs = tokenizer(text, return_tensors="pt") + +model = AutoModelForCausalLM.from_pretrained(model_id) +peft_config = PeftConfig.from_pretrained(adapter_model_id) + +# to initiate with random weights +peft_config.init_lora_weights = False + +model.add_adapter(peft_config) +model.enable_adapters() +output = model.generate(**inputs) +``` + +So deaktivieren Sie das Adaptermodul: + +```py +model.disable_adapters() +output = model.generate(**inputs) +``` + +## PEFT-Adapter trainieren + +PEFT-Adapter werden von der Klasse [`Trainer`] unterstützt, so dass Sie einen Adapter für Ihren speziellen Anwendungsfall trainieren können. Dazu müssen Sie nur ein paar weitere Codezeilen hinzufügen. Zum Beispiel, um einen LoRA-Adapter zu trainieren: + + + +Wenn Sie mit der Feinabstimmung eines Modells mit [`Trainer`] noch nicht vertraut sind, werfen Sie einen Blick auf das Tutorial [Feinabstimmung eines vortrainierten Modells](Training). + + + +1. Definieren Sie Ihre Adapterkonfiguration mit dem Aufgabentyp und den Hyperparametern (siehe [`~peft.LoraConfig`] für weitere Details darüber, was die Hyperparameter tun). + +```py +from peft import LoraConfig + +peft_config = LoraConfig( + lora_alpha=16, + lora_dropout=0.1, + r=64, + bias="none", + task_type="CAUSAL_LM", +) +``` + +2. Fügen Sie dem Modell einen Adapter hinzu. + +```py +model.add_adapter(peft_config) +``` + +3. Jetzt können Sie das Modell an [`Trainer`] übergeben! + +```py +trainer = Trainer(model=model, ...) +trainer.train() +``` + +So speichern Sie Ihren trainierten Adapter und laden ihn wieder: + +```py +model.save_pretrained(save_dir) +model = AutoModelForCausalLM.from_pretrained(save_dir) +``` + + diff --git a/docs/source/de/pipeline_tutorial.mdx b/docs/source/de/pipeline_tutorial.md similarity index 90% rename from docs/source/de/pipeline_tutorial.mdx rename to docs/source/de/pipeline_tutorial.md index 19c37c35dea1ec..5106af9b2fafc7 100644 --- a/docs/source/de/pipeline_tutorial.mdx +++ b/docs/source/de/pipeline_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipelines für Inferenzen @@ -67,13 +71,13 @@ Alle zusätzlichen Parameter für Ihre Aufgabe können auch in die [`pipeline`] ### Wählen Sie ein Modell und einen Tokenizer -Die [`pipeline`] akzeptiert jedes Modell aus dem [Hub] (https://huggingface.co/models). Auf dem Hub gibt es Tags, mit denen Sie nach einem Modell filtern können, das Sie für Ihre Aufgabe verwenden möchten. Sobald Sie ein passendes Modell ausgewählt haben, laden Sie es mit der entsprechenden `AutoModelFor` und [`AutoTokenizer`] Klasse. Laden Sie zum Beispiel die Klasse [`AutoModelForCausalLM`] für eine kausale Sprachmodellierungsaufgabe: +Die [`pipeline`] akzeptiert jedes Modell aus dem [Hub](https://huggingface.co/models). Auf dem Hub gibt es Tags, mit denen Sie nach einem Modell filtern können, das Sie für Ihre Aufgabe verwenden möchten. Sobald Sie ein passendes Modell ausgewählt haben, laden Sie es mit der entsprechenden `AutoModelFor` und [`AutoTokenizer`] Klasse. Laden Sie zum Beispiel die Klasse [`AutoModelForCausalLM`] für eine kausale Sprachmodellierungsaufgabe: ```py >>> from transformers import AutoTokenizer, AutoModelForCausalLM ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Erstellen Sie eine [`pipeline`] für Ihre Aufgabe, und geben Sie das Modell und den Tokenizer an, die Sie geladen haben: diff --git a/docs/source/de/pr_checks.md b/docs/source/de/pr_checks.md new file mode 100644 index 00000000000000..ee2bbf489b8e75 --- /dev/null +++ b/docs/source/de/pr_checks.md @@ -0,0 +1,199 @@ + + +# Überprüfungen bei einer Pull-Anfrage + +Wenn Sie eine Pull-Anfrage für 🤗 Transformers öffnen, wird eine ganze Reihe von Prüfungen durchgeführt, um sicherzustellen, dass der Patch, den Sie hinzufügen, nichts Bestehendes zerstört. Es gibt vier Arten von Prüfungen: +- reguläre Tests +- Erstellung der Dokumentation +- Stil von Code und Dokumentation +- allgemeine Konsistenz des Repository + +In diesem Dokument werden wir versuchen zu erklären, worum es sich bei diesen verschiedenen Prüfungen handelt und wie Sie sie lokal debuggen können, wenn eine der Prüfungen in Ihrer PR fehlschlägt. + +Beachten Sie, dass Sie im Idealfall eine Dev-Installation benötigen: + +```bash +pip install transformers[dev] +``` + +oder für eine bearbeitbare Installation: + +```bash +pip install -e .[dev] +``` + +innerhalb des Transformers Repo. Da die Anzahl der optionalen Abhängigkeiten von Transformers stark zugenommen hat, ist es möglich, dass Sie nicht alle davon bekommen können. Wenn die Dev-Installation fehlschlägt, stellen Sie sicher, dass Sie das Deep Learning-Framework, mit dem Sie arbeiten, installieren (PyTorch, TensorFlow und/oder Flax). + +```bash +pip install transformers[quality] +``` + +oder für eine bearbeitbare Installation: + +```bash +pip install -e .[quality] +``` + + +## Tests + +Alle Jobs, die mit `ci/circleci: run_tests_` beginnen, führen Teile der Transformers-Testsuite aus. Jeder dieser Jobs konzentriert sich auf einen Teil der Bibliothek in einer bestimmten Umgebung: `ci/circleci: run_tests_pipelines_tf` zum Beispiel führt den Pipelines-Test in einer Umgebung aus, in der nur TensorFlow installiert ist. + +Beachten Sie, dass nur ein Teil der Testsuite jedes Mal ausgeführt wird, um zu vermeiden, dass Tests ausgeführt werden, wenn es keine wirkliche Änderung in den Modulen gibt, die sie testen: ein Dienstprogramm wird ausgeführt, um die Unterschiede in der Bibliothek zwischen vor und nach dem PR zu ermitteln (was GitHub Ihnen auf der Registerkarte "Files changes" anzeigt) und die Tests auszuwählen, die von diesem Unterschied betroffen sind. Dieses Dienstprogramm kann lokal mit ausgeführt werden: + +```bash +python utils/tests_fetcher.py +``` + +aus dem Stammverzeichnis des Transformers-Repositoriums. Es wird: + +1. Überprüfen Sie für jede Datei im Diff, ob die Änderungen im Code oder nur in Kommentaren oder Docstrings enthalten sind. Nur die Dateien mit echten Codeänderungen werden beibehalten. +2. Erstellen Sie eine interne Map, die für jede Datei des Quellcodes der Bibliothek alle Dateien angibt, auf die sie rekursiv Einfluss nimmt. Von Modul A wird gesagt, dass es sich auf Modul B auswirkt, wenn Modul B Modul A importiert. Für die rekursive Auswirkung benötigen wir eine Kette von Modulen, die von Modul A zu Modul B führt und in der jedes Modul das vorherige importiert. +3. Wenden Sie diese Zuordnung auf die in Schritt 1 gesammelten Dateien an. So erhalten wir die Liste der Modelldateien, die von der PR betroffen sind. +4. Ordnen Sie jede dieser Dateien der/den entsprechenden Testdatei(en) zu und erhalten Sie die Liste der auszuführenden Tests. + +Wenn Sie das Skript lokal ausführen, sollten Sie die Ergebnisse von Schritt 1, 3 und 4 ausgegeben bekommen und somit wissen, welche Tests ausgeführt werden. Das Skript erstellt außerdem eine Datei namens `test_list.txt`, die die Liste der auszuführenden Tests enthält, die Sie mit dem folgenden Befehl lokal ausführen können: + +```bash +python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt) +``` + +Für den Fall, dass Ihnen etwas entgangen ist, wird die komplette Testreihe ebenfalls täglich ausgeführt. + +## Dokumentation erstellen + +Der Job `build_pr_documentation` erstellt und generiert eine Vorschau der Dokumentation, um sicherzustellen, dass alles in Ordnung ist, wenn Ihr PR zusammengeführt wird. Ein Bot fügt einen Link zur Vorschau der Dokumentation zu Ihrem PR hinzu. Alle Änderungen, die Sie an dem PR vornehmen, werden automatisch in der Vorschau aktualisiert. Wenn die Dokumentation nicht erstellt werden kann, klicken Sie auf **Details** neben dem fehlgeschlagenen Auftrag, um zu sehen, wo der Fehler liegt. Oft ist der Fehler so einfach wie eine fehlende Datei im `toctree`. + +Wenn Sie daran interessiert sind, die Dokumentation lokal zu erstellen oder in der Vorschau anzusehen, werfen Sie einen Blick in die [`README.md`](https://github.com/huggingface/transformers/tree/main/docs) im Ordner docs. + +## Code und Dokumentationsstil + +Die Formatierung des Codes erfolgt für alle Quelldateien, die Beispiele und die Tests mit `black` und `ruff`. Wir haben auch ein benutzerdefiniertes Tool, das sich um die Formatierung von docstrings und `rst`-Dateien kümmert (`utils/style_doc.py`), sowie um die Reihenfolge der Lazy-Importe, die in den Transformers `__init__.py`-Dateien durchgeführt werden (`utils/custom_init_isort.py`). All dies können Sie starten, indem Sie Folgendes ausführen + +```bash +make style +``` + +Das CI prüft, ob diese innerhalb der Prüfung `ci/circleci: check_code_quality` angewendet wurden. Es führt auch `ruff` aus, das einen grundlegenden Blick auf Ihren Code wirft und sich beschwert, wenn es eine undefinierte Variable findet oder eine, die nicht verwendet wird. Um diese Prüfung lokal auszuführen, verwenden Sie + +```bash +make quality +``` + +Dies kann sehr viel Zeit in Anspruch nehmen. Um dasselbe nur für die Dateien zu tun, die Sie im aktuellen Zweig geändert haben, führen Sie + +```bash +make fixup +``` + +Dieser letzte Befehl führt auch alle zusätzlichen Prüfungen für die Konsistenz des Repositorys durch. Schauen wir uns diese an. + +## Repository-Konsistenz + +Dies fasst alle Tests zusammen, die sicherstellen, dass Ihr PR das Repository in einem guten Zustand verlässt. Sie können diese Prüfung lokal durchführen, indem Sie Folgendes ausführen: + +```bash +make repo-consistency +``` + +Dies überprüft, ob: + +- Alle zum Init hinzugefügten Objekte sind dokumentiert (ausgeführt von `utils/check_repo.py`) +- Alle `__init__.py`-Dateien haben in ihren beiden Abschnitten den gleichen Inhalt (ausgeführt von `utils/check_inits.py`) +- Der gesamte Code, der als Kopie eines anderen Moduls identifiziert wurde, stimmt mit dem Original überein (ausgeführt von `utils/check_copies.py`) +- Alle Konfigurationsklassen haben mindestens einen gültigen Prüfpunkt, der in ihren Dokumentationen erwähnt wird (ausgeführt von `utils/check_config_docstrings.py`) +- Alle Konfigurationsklassen enthalten nur Attribute, die in den entsprechenden Modellierungsdateien verwendet werden (ausgeführt von `utils/check_config_attributes.py`) +- Die Übersetzungen der READMEs und der Index des Dokuments haben die gleiche Modellliste wie die Haupt-README (durchgeführt von `utils/check_copies.py`) +- Die automatisch generierten Tabellen in der Dokumentation sind auf dem neuesten Stand (ausgeführt von `utils/check_table.py`) +- Die Bibliothek verfügt über alle Objekte, auch wenn nicht alle optionalen Abhängigkeiten installiert sind (ausgeführt von `utils/check_dummies.py`) + +Sollte diese Prüfung fehlschlagen, müssen die ersten beiden Punkte manuell korrigiert werden, die letzten vier können automatisch für Sie korrigiert werden, indem Sie den Befehl + +```bash +make fix-copies +``` + +Zusätzliche Prüfungen betreffen PRs, die neue Modelle hinzufügen, vor allem, dass: + +- Alle hinzugefügten Modelle befinden sich in einer Auto-Zuordnung (durchgeführt von `utils/check_repo.py`) + +- Alle Modelle werden ordnungsgemäß getestet (ausgeführt von `utils/check_repo.py`) + + + +### Kopien prüfen + +Da die Transformers-Bibliothek in Bezug auf den Modellcode sehr eigenwillig ist und jedes Modell vollständig in einer einzigen Datei implementiert sein sollte, ohne sich auf andere Modelle zu stützen, haben wir einen Mechanismus hinzugefügt, der überprüft, ob eine Kopie des Codes einer Ebene eines bestimmten Modells mit dem Original übereinstimmt. Auf diese Weise können wir bei einer Fehlerbehebung alle anderen betroffenen Modelle sehen und entscheiden, ob wir die Änderung weitergeben oder die Kopie zerstören. + + + +Wenn eine Datei eine vollständige Kopie einer anderen Datei ist, sollten Sie sie in der Konstante `FULL_COPIES` von `utils/check_copies.py` registrieren. + + + +Dieser Mechanismus stützt sich auf Kommentare der Form `# Kopiert von xxx`. Das `xxx` sollte den gesamten Pfad zu der Klasse der Funktion enthalten, die darunter kopiert wird. Zum Beispiel ist `RobertaSelfOutput` eine direkte Kopie der Klasse `BertSelfOutput`. Sie können also [hier](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289) sehen, dass sie einen Kommentar hat: + +```py +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput +``` + +Beachten Sie, dass Sie dies nicht auf eine ganze Klasse anwenden, sondern auf die entsprechenden Methoden, von denen kopiert wird. Zum Beispiel [hier](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L598) können Sie sehen, wie `RobertaPreTrainedModel._init_weights` von der gleichen Methode in `BertPreTrainedModel` mit dem Kommentar kopiert wird: + +```py +# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights +``` + +Manchmal ist die Kopie bis auf die Namen genau gleich: zum Beispiel verwenden wir in `RobertaAttention` `RobertaSelfAttention` anstelle von `BertSelfAttention`, aber ansonsten ist der Code genau derselbe. Aus diesem Grund unterstützt `#Copied from` einfache String-Ersetzungen mit der folgenden Syntax: `Kopiert von xxx mit foo->bar`. Das bedeutet, dass der Code kopiert wird, wobei alle Instanzen von "foo" durch "bar" ersetzt werden. Sie können sehen, wie es [hier](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` mit dem Kommentar verwendet wird: + +```py +# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta +``` + +Beachten Sie, dass um den Pfeil herum keine Leerzeichen stehen sollten (es sei denn, das Leerzeichen ist Teil des zu ersetzenden Musters, natürlich). + +Sie können mehrere Muster durch ein Komma getrennt hinzufügen. Zum Beispiel ist hier `CamemberForMaskedLM` eine direkte Kopie von `RobertaForMaskedLM` mit zwei Ersetzungen: `Roberta` zu `Camembert` und `ROBERTA` zu `CAMEMBERT`. Sie können [hier](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/camembert/modeling_camembert.py#L929) sehen, wie dies mit dem Kommentar gemacht wird: + +```py +# Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT +``` + +Wenn die Reihenfolge eine Rolle spielt (weil eine der Ersetzungen mit einer vorherigen in Konflikt geraten könnte), werden die Ersetzungen von links nach rechts ausgeführt. + + + +Wenn die Ersetzungen die Formatierung ändern (wenn Sie z.B. einen kurzen Namen durch einen sehr langen Namen ersetzen), wird die Kopie nach Anwendung des automatischen Formats überprüft. + + + +Eine andere Möglichkeit, wenn es sich bei den Mustern nur um verschiedene Umschreibungen derselben Ersetzung handelt (mit einer groß- und einer kleingeschriebenen Variante), besteht darin, die Option `all-casing` hinzuzufügen. [Hier](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237) ist ein Beispiel in `MobileBertForSequenceClassification` mit dem Kommentar: + +```py +# Copied from transformers.models.bert.modeling_bert.BertForSequenceClassification with Bert->MobileBert all-casing +``` + +In diesem Fall wird der Code von `BertForSequenceClassification` kopiert, indem er ersetzt wird: +- `Bert` durch `MobileBert` (zum Beispiel bei der Verwendung von `MobileBertModel` in der Init) +- `bert` durch `mobilebert` (zum Beispiel bei der Definition von `self.mobilebert`) +- `BERT` durch `MOBILEBERT` (in der Konstante `MOBILEBERT_INPUTS_DOCSTRING`) diff --git a/docs/source/de/preprocessing.mdx b/docs/source/de/preprocessing.md similarity index 95% rename from docs/source/de/preprocessing.mdx rename to docs/source/de/preprocessing.md index ea6c185cc10155..b56a5c0ae4ca1c 100644 --- a/docs/source/de/preprocessing.mdx +++ b/docs/source/de/preprocessing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Vorverarbeiten @@ -41,7 +45,7 @@ Laden Sie einen vortrainierten Tokenizer mit [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") ``` Dann übergeben Sie Ihren Satz an den Tokenizer: @@ -205,7 +209,7 @@ Audioeingaben werden anders vorverarbeitet als Texteingaben, aber das Endziel bl pip install datasets ``` -Laden Sie den [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) Datensatz (weitere Informationen zum Laden eines Datensatzes finden Sie im 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html)): +Laden Sie den [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) Datensatz (weitere Informationen zum Laden eines Datensatzes finden Sie im 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub)): ```py >>> from datasets import load_dataset, Audio @@ -244,7 +248,7 @@ Der Datensatz [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) hat zum 'sampling_rate': 8000} ``` -1. Verwenden Sie die Methode [~datasets.Dataset.cast_column] von 🤗 Datasets, um die Abtastrate auf 16kHz zu erhöhen: +1. Verwenden Sie die Methode [`~datasets.Dataset.cast_column`] von 🤗 Datasets, um die Abtastrate auf 16kHz zu erhöhen: ```py >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) @@ -340,7 +344,7 @@ Laden wir den [food101](https://huggingface.co/datasets/food101) Datensatz für >>> dataset = load_dataset("food101", split="train[:100]") ``` -Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild] (https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) an: +Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) an: ```py >>> dataset[0]["image"] @@ -350,12 +354,12 @@ Als Nächstes sehen Sie sich das Bild mit dem Merkmal 🤗 Datensätze [Bild] (h ### Merkmalsextraktor -Laden Sie den Merkmalsextraktor mit [`AutoFeatureExtractor.from_pretrained`]: +Laden Sie den Merkmalsextraktor mit [`AutoImageProcessor.from_pretrained`]: ```py ->>> from transformers import AutoFeatureExtractor +>>> from transformers import AutoImageProcessor ->>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224") +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") ``` ### Datenerweiterung @@ -367,9 +371,9 @@ Bei Bildverarbeitungsaufgaben ist es üblich, den Bildern als Teil der Vorverarb ```py >>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor ->>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std) +>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) >>> _transforms = Compose( -... [RandomResizedCrop(feature_extractor.size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize] +... [RandomResizedCrop(image_processor.size["height"]), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize] ... ) ``` @@ -381,7 +385,7 @@ Bei Bildverarbeitungsaufgaben ist es üblich, den Bildern als Teil der Vorverarb ... return examples ``` -3. Dann verwenden Sie 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform), um die Transformationen im laufenden Betrieb anzuwenden: +3. Dann verwenden Sie 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process#format-transform), um die Transformationen im laufenden Betrieb anzuwenden: ```py >>> dataset.set_transform(transforms) @@ -472,7 +476,7 @@ Erinnern Sie sich an den früheren Abschnitt über die Verarbeitung von Audiodat ### Prozessor -Ein Processor kombiniert einen Feature-Extraktor und einen Tokenizer. Laden Sie einen Processor mit [`AutoProcessor.from_pretrained]: +Ein Processor kombiniert einen Feature-Extraktor und einen Tokenizer. Laden Sie einen Processor mit [`AutoProcessor.from_pretrained`]: ```py >>> from transformers import AutoProcessor diff --git a/docs/source/de/quicktour.mdx b/docs/source/de/quicktour.md similarity index 93% rename from docs/source/de/quicktour.mdx rename to docs/source/de/quicktour.md index 4c668bf419b134..01cd7200750c4d 100644 --- a/docs/source/de/quicktour.mdx +++ b/docs/source/de/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Schnellstart @@ -64,11 +68,13 @@ Installieren Sie die folgenden Abhängigkeiten, falls Sie dies nicht bereits get + ```bash pip install torch ``` + ```bash pip install tensorflow ``` @@ -83,7 +89,7 @@ Importieren sie die [`pipeline`] und spezifizieren sie die Aufgabe, welche sie l >>> classifier = pipeline("sentiment-analysis") ``` -Die Pipeline lädt ein standardmäßiges [vortrainiertes Modell] (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) und einen Tokenizer für die Stimmungs-Analyse herunter und speichert sie. Jetzt können Sie den "Klassifikator" auf Ihren Zieltext anwenden: +Die Pipeline lädt ein standardmäßiges [vortrainiertes Modell](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) und einen Tokenizer für die Stimmungs-Analyse herunter und speichert sie. Jetzt können Sie den "Klassifikator" auf Ihren Zieltext anwenden: ```py >>> classifier("We are very happy to show you the 🤗 Transformers library.") @@ -115,7 +121,7 @@ Erstellen wir eine [`pipeline`] mit der Aufgabe die wir lösen und dem Modell we >>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") ``` -Als nächstes laden wir den Datensatz (siehe 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) für mehr Details) welches wir nutzen möchten. Zum Beispiel laden wir den [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) Datensatz: +Als nächstes laden wir den Datensatz (siehe 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart) für mehr Details) welches wir nutzen möchten. Zum Beispiel laden wir den [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) Datensatz: ```py >>> from datasets import load_dataset, Audio @@ -142,7 +148,7 @@ Bei einem größeren Datensatz mit vielen Eingaben (wie bei Sprache oder Bildver ### Ein anderes Modell und einen anderen Tokenizer in der Pipeline verwenden -Die [`pipeline`] kann jedes Modell aus dem [Model Hub] (https://huggingface.co/models) verwenden, wodurch es einfach ist, die [`pipeline`] für andere Anwendungsfälle anzupassen. Wenn Sie beispielsweise ein Modell wünschen, das französischen Text verarbeiten kann, verwenden Sie die Tags im Model Hub, um nach einem geeigneten Modell zu filtern. Das oberste gefilterte Ergebnis liefert ein mehrsprachiges [BERT-Modell](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment), das auf die Stimmungsanalyse abgestimmt ist. Großartig, verwenden wir dieses Modell! +Die [`pipeline`] kann jedes Modell aus dem [Model Hub](https://huggingface.co/models) verwenden, wodurch es einfach ist, die [`pipeline`] für andere Anwendungsfälle anzupassen. Wenn Sie beispielsweise ein Modell wünschen, das französischen Text verarbeiten kann, verwenden Sie die Tags im Model Hub, um nach einem geeigneten Modell zu filtern. Das oberste gefilterte Ergebnis liefert ein mehrsprachiges [BERT-Modell](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment), das auf die Stimmungsanalyse abgestimmt ist. Großartig, verwenden wir dieses Modell! ```py >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" @@ -222,6 +228,7 @@ Genau wie die [`pipeline`] akzeptiert der Tokenizer eine Liste von Eingaben. Dar + ```py >>> pt_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -233,6 +240,7 @@ Genau wie die [`pipeline`] akzeptiert der Tokenizer eine Liste von Eingaben. Dar ``` + ```py >>> tf_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -371,6 +379,7 @@ Ein besonders cooles 🤗 Transformers-Feature ist die Möglichkeit, ein Modell + ```py >>> from transformers import AutoModel @@ -379,6 +388,7 @@ Ein besonders cooles 🤗 Transformers-Feature ist die Möglichkeit, ein Modell ``` + ```py >>> from transformers import TFAutoModel @@ -397,7 +407,7 @@ Beginnen Sie mit dem Import von [`AutoConfig`] und laden Sie dann das trainierte ```py >>> from transformers import AutoConfig ->>> my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) ``` diff --git a/docs/source/de/run_scripts.md b/docs/source/de/run_scripts.md new file mode 100644 index 00000000000000..61a0754ea92628 --- /dev/null +++ b/docs/source/de/run_scripts.md @@ -0,0 +1,351 @@ + + +# Trainieren mit einem Skript + +Neben den 🤗 Transformers [notebooks](./noteboks/README) gibt es auch Beispielskripte, die zeigen, wie man ein Modell für eine Aufgabe mit [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch), [TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow) oder [JAX/Flax](https://github.com/huggingface/transformers/tree/main/examples/flax) trainiert. + +Sie werden auch Skripte finden, die wir in unseren [Forschungsprojekten](https://github.com/huggingface/transformers/tree/main/examples/research_projects) und [Legacy-Beispielen](https://github.com/huggingface/transformers/tree/main/examples/legacy) verwendet haben und die größtenteils von der Community stammen. Diese Skripte werden nicht aktiv gepflegt und erfordern eine bestimmte Version von 🤗 Transformers, die höchstwahrscheinlich nicht mit der neuesten Version der Bibliothek kompatibel ist. + +Es wird nicht erwartet, dass die Beispielskripte bei jedem Problem sofort funktionieren. Möglicherweise müssen Sie das Skript an das Problem anpassen, das Sie zu lösen versuchen. Um Ihnen dabei zu helfen, legen die meisten Skripte vollständig offen, wie die Daten vorverarbeitet werden, so dass Sie sie nach Bedarf für Ihren Anwendungsfall bearbeiten können. + +Für jede Funktion, die Sie in einem Beispielskript implementieren möchten, diskutieren Sie bitte im [Forum](https://discuss.huggingface.co/) oder in einem [issue](https://github.com/huggingface/transformers/issues), bevor Sie einen Pull Request einreichen. Wir freuen uns zwar über Fehlerkorrekturen, aber es ist unwahrscheinlich, dass wir einen Pull Request zusammenführen, der mehr Funktionalität auf Kosten der Lesbarkeit hinzufügt. + +Diese Anleitung zeigt Ihnen, wie Sie ein Beispiel für ein Trainingsskript zur Zusammenfassung in [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) und [TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) ausführen können. Sofern nicht anders angegeben, sollten alle Beispiele mit beiden Frameworks funktionieren. + +## Einrichtung + +Um die neueste Version der Beispielskripte erfolgreich auszuführen, **müssen Sie 🤗 Transformers aus dem Quellcode** in einer neuen virtuellen Umgebung installieren: + +```bash +git clone https://github.com/huggingface/transformers +cd transformers +pip install . +``` + +Für ältere Versionen der Beispielskripte klicken Sie auf die Umschalttaste unten: + +
+ Beispiele für ältere Versionen von 🤗 Transformers + +
+ +Dann stellen Sie Ihren aktuellen Klon von 🤗 Transformers auf eine bestimmte Version um, z.B. v3.5.1: + +```bash +git checkout tags/v3.5.1 +``` + +Nachdem Sie die richtige Bibliotheksversion eingerichtet haben, navigieren Sie zu dem Beispielordner Ihrer Wahl und installieren die beispielspezifischen Anforderungen: + +```bash +pip install -r requirements.txt +``` + +## Ein Skript ausführen + + + +Das Beispielskript lädt einen Datensatz aus der 🤗 [Datasets](https://huggingface.co/docs/datasets/) Bibliothek herunter und verarbeitet ihn vor. Dann nimmt das Skript eine Feinabstimmung eines Datensatzes mit dem [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) auf einer Architektur vor, die eine Zusammenfassung unterstützt. Das folgende Beispiel zeigt, wie die Feinabstimmung von [T5-small](https://huggingface.co/google-t5/t5-small) auf dem Datensatz [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) durchgeführt wird. Das T5-Modell benötigt aufgrund der Art und Weise, wie es trainiert wurde, ein zusätzliches Argument `source_prefix`. Mit dieser Eingabeaufforderung weiß T5, dass es sich um eine Zusammenfassungsaufgabe handelt. + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +Das Beispielskript lädt einen Datensatz aus der 🤗 [Datasets](https://huggingface.co/docs/datasets/) Bibliothek herunter und verarbeitet ihn vor. Anschließend nimmt das Skript die Feinabstimmung eines Datensatzes mit Keras auf einer Architektur vor, die die Zusammenfassung unterstützt. Das folgende Beispiel zeigt, wie die Feinabstimmung von [T5-small](https://huggingface.co/google-t5/t5-small) auf dem [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) Datensatz durchgeführt wird. Das T5-Modell benötigt aufgrund der Art und Weise, wie es trainiert wurde, ein zusätzliches Argument `source_prefix`. Mit dieser Eingabeaufforderung weiß T5, dass es sich um eine Zusammenfassungsaufgabe handelt. + +```bash +python examples/tensorflow/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## Verteiltes Training und gemischte Präzision + +Der [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) unterstützt verteiltes Training und gemischte Präzision, d.h. Sie können ihn auch in einem Skript verwenden. So aktivieren Sie diese beiden Funktionen: + +- Fügen Sie das Argument `fp16` hinzu, um gemischte Genauigkeit zu aktivieren. +- Legen Sie die Anzahl der zu verwendenden GPUs mit dem Argument `nproc_per_node` fest. + +```bash +torchrun \ + --nproc_per_node 8 pytorch/summarization/run_summarization.py \ + --fp16 \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +TensorFlow-Skripte verwenden eine [`MirroredStrategy`](https://www.tensorflow.org/guide/distributed_training#mirroredstrategy) für verteiltes Training, und Sie müssen dem Trainingsskript keine zusätzlichen Argumente hinzufügen. Das TensorFlow-Skript verwendet standardmäßig mehrere GPUs, wenn diese verfügbar sind. + +## Ein Skript auf einer TPU ausführen + + + +Tensor Processing Units (TPUs) sind speziell für die Beschleunigung der Leistung konzipiert. PyTorch unterstützt TPUs mit dem [XLA](https://www.tensorflow.org/xla) Deep Learning Compiler (siehe [hier](https://github.com/pytorch/xla/blob/master/README.md) für weitere Details). Um eine TPU zu verwenden, starten Sie das Skript `xla_spawn.py` und verwenden das Argument `num_cores`, um die Anzahl der TPU-Kerne festzulegen, die Sie verwenden möchten. + +```bash +python xla_spawn.py --num_cores 8 \ + summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +Tensor Processing Units (TPUs) sind speziell für die Beschleunigung der Leistung konzipiert. TensorFlow Skripte verwenden eine [`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy) für das Training auf TPUs. Um eine TPU zu verwenden, übergeben Sie den Namen der TPU-Ressource an das Argument `tpu`. + +```bash +python run_summarization.py \ + --tpu name_of_tpu_resource \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## Führen Sie ein Skript mit 🤗 Accelerate aus. + +🤗 [Accelerate](https://huggingface.co/docs/accelerate) ist eine reine PyTorch-Bibliothek, die eine einheitliche Methode für das Training eines Modells auf verschiedenen Arten von Setups (nur CPU, mehrere GPUs, TPUs) bietet und dabei die vollständige Transparenz der PyTorch-Trainingsschleife beibehält. Stellen Sie sicher, dass Sie 🤗 Accelerate installiert haben, wenn Sie es nicht bereits haben: + +> Hinweis: Da Accelerate schnell weiterentwickelt wird, muss die Git-Version von Accelerate installiert sein, um die Skripte auszuführen. +```bash +pip install git+https://github.com/huggingface/accelerate +``` + +Anstelle des Skripts `run_summarization.py` müssen Sie das Skript `run_summarization_no_trainer.py` verwenden. Die von Accelerate unterstützten Skripte haben eine Datei `task_no_trainer.py` im Ordner. Beginnen Sie mit dem folgenden Befehl, um eine Konfigurationsdatei zu erstellen und zu speichern: + +```bash +accelerate config +``` + +Testen Sie Ihre Einrichtung, um sicherzustellen, dass sie korrekt konfiguriert ist: + +```bash +accelerate test +``` + +Jetzt sind Sie bereit, das Training zu starten: + +```bash +accelerate launch run_summarization_no_trainer.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir ~/tmp/tst-summarization +``` + +## Verwenden Sie einen benutzerdefinierten Datensatz + +Das Verdichtungsskript unterstützt benutzerdefinierte Datensätze, solange es sich um eine CSV- oder JSON-Line-Datei handelt. Wenn Sie Ihren eigenen Datensatz verwenden, müssen Sie mehrere zusätzliche Argumente angeben: + +- `train_file` und `validation_file` geben den Pfad zu Ihren Trainings- und Validierungsdateien an. +- `text_column` ist der Eingabetext, der zusammengefasst werden soll. +- Summary_column" ist der auszugebende Zieltext. + +Ein Zusammenfassungsskript, das einen benutzerdefinierten Datensatz verwendet, würde wie folgt aussehen: + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --train_file path_to_csv_or_jsonlines_file \ + --validation_file path_to_csv_or_jsonlines_file \ + --text_column text_column_name \ + --summary_column summary_column_name \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --overwrite_output_dir \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --predict_with_generate +``` + +## Testen Sie ein Skript + +Es ist oft eine gute Idee, Ihr Skript an einer kleineren Anzahl von Beispielen für Datensätze auszuführen, um sicherzustellen, dass alles wie erwartet funktioniert, bevor Sie sich auf einen ganzen Datensatz festlegen, dessen Fertigstellung Stunden dauern kann. Verwenden Sie die folgenden Argumente, um den Datensatz auf eine maximale Anzahl von Stichproben zu beschränken: + +- `max_train_samples` +- `max_eval_samples` +- `max_predict_samples` + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --max_train_samples 50 \ + --max_eval_samples 50 \ + --max_predict_samples 50 \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +Nicht alle Beispielskripte unterstützen das Argument `max_predict_samples`. Wenn Sie sich nicht sicher sind, ob Ihr Skript dieses Argument unterstützt, fügen Sie das Argument `-h` hinzu, um dies zu überprüfen: + +```bash +examples/pytorch/summarization/run_summarization.py -h +``` + +## Training vom Kontrollpunkt fortsetzen + +Eine weitere hilfreiche Option, die Sie aktivieren können, ist die Wiederaufnahme des Trainings von einem früheren Kontrollpunkt aus. Auf diese Weise können Sie im Falle einer Unterbrechung Ihres Trainings dort weitermachen, wo Sie aufgehört haben, ohne von vorne beginnen zu müssen. Es gibt zwei Methoden, um das Training von einem Kontrollpunkt aus wieder aufzunehmen. + +Die erste Methode verwendet das Argument `output_dir previous_output_dir`, um das Training ab dem letzten in `output_dir` gespeicherten Kontrollpunkt wieder aufzunehmen. In diesem Fall sollten Sie `overwrite_output_dir` entfernen: + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --output_dir previous_output_dir \ + --predict_with_generate +``` + +Die zweite Methode verwendet das Argument `Resume_from_checkpoint path_to_specific_checkpoint`, um das Training ab einem bestimmten Checkpoint-Ordner wieder aufzunehmen. + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --resume_from_checkpoint path_to_specific_checkpoint \ + --predict_with_generate +``` + +## Teilen Sie Ihr Modell + +Alle Skripte können Ihr endgültiges Modell in den [Model Hub](https://huggingface.co/models) hochladen. Stellen Sie sicher, dass Sie bei Hugging Face angemeldet sind, bevor Sie beginnen: + +```bash +huggingface-cli login +``` + +Dann fügen Sie dem Skript das Argument `push_to_hub` hinzu. Mit diesem Argument wird ein Repository mit Ihrem Hugging Face-Benutzernamen und dem in `output_dir` angegebenen Ordnernamen erstellt. + +Wenn Sie Ihrem Repository einen bestimmten Namen geben möchten, fügen Sie ihn mit dem Argument `push_to_hub_model_id` hinzu. Das Repository wird automatisch unter Ihrem Namensraum aufgeführt. + +Das folgende Beispiel zeigt, wie Sie ein Modell mit einem bestimmten Repository-Namen hochladen können: + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --push_to_hub \ + --push_to_hub_model_id finetuned-t5-cnn_dailymail \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` \ No newline at end of file diff --git a/docs/source/de/testing.md b/docs/source/de/testing.md new file mode 100644 index 00000000000000..25c1143e381de8 --- /dev/null +++ b/docs/source/de/testing.md @@ -0,0 +1,1293 @@ + + +# Testen + + +Werfen wir einen Blick darauf, wie 🤗 Transformers-Modelle getestet werden und wie Sie neue Tests schreiben und die vorhandenen verbessern können. + +Es gibt 2 Testsuiten im Repository: + +1. `tests` -- Tests für die allgemeine API +2. `examples` -- Tests hauptsächlich für verschiedene Anwendungen, die nicht Teil der API sind + +## Wie Transformatoren getestet werden + +1. Sobald ein PR eingereicht wurde, wird er mit 9 CircleCi Jobs getestet. Jeder neue Commit zu diesem PR wird erneut getestet. Diese Aufträge + sind in dieser [Konfigurationsdatei](https://github.com/huggingface/transformers/tree/main/.circleci/config.yml) definiert, so dass Sie bei Bedarf die gleiche Umgebung auf Ihrem Rechner reproduzieren können. + Umgebung auf Ihrem Rechner reproduzieren können. + + Diese CI-Jobs führen keine `@slow`-Tests durch. + +2. Es gibt 3 Jobs, die von [github actions](https://github.com/huggingface/transformers/actions) ausgeführt werden: + + - [torch hub integration](https://github.com/huggingface/transformers/tree/main/.github/workflows/github-torch-hub.yml): prüft, ob die torch hub + Integration funktioniert. + + - [self-hosted (push)](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-push.yml): führt schnelle Tests auf der GPU nur bei Commits auf + `main`. Es wird nur ausgeführt, wenn ein Commit auf `main` den Code in einem der folgenden Ordner aktualisiert hat: `src`, + `tests`, `.github` (um zu verhindern, dass er auf hinzugefügten Modellkarten, Notebooks usw. läuft) + + - [self-hosted runner](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-scheduled.yml): führt normale und langsame Tests auf GPU in + `tests` und `examples`: + +```bash +RUN_SLOW=1 pytest tests/ +RUN_SLOW=1 pytest examples/ +``` + + Die Ergebnisse können Sie [hier](https://github.com/huggingface/transformers/actions) sehen. + + + +## Tests ausführen + + + + + +### Auswahl der auszuführenden Tests + +In diesem Dokument wird ausführlich erläutert, wie Tests ausgeführt werden können. Wenn Sie nach der Lektüre noch mehr Details benötigen +finden Sie diese [hier](https://docs.pytest.org/en/latest/usage.html). + +Hier sind einige der nützlichsten Möglichkeiten, Tests auszuführen. + +Alle ausführen: + +```console +pytest +``` + +oder: + +```bash +make test +``` + +Beachten Sie, dass Letzteres wie folgt definiert ist: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/ +``` + +was pytest anweist: + +- so viele Testprozesse laufen zu lassen, wie es CPU-Kerne gibt (was zu viele sein könnten, wenn Sie nicht über eine Menge RAM verfügen!) +- sicherzustellen, dass alle Tests aus derselben Datei von demselben Testprozess ausgeführt werden +- Erfassen Sie keine Ausgaben +- im ausführlichen Modus laufen lassen + + + +### Abrufen der Liste aller Tests + +Alle Tests der Testsuite: + +```bash +pytest --collect-only -q +``` + +Alle Tests einer bestimmten Testdatei: + +```bash +pytest tests/test_optimization.py --collect-only -q +``` + +### Führen Sie ein bestimmtes Testmodul aus + +Um ein einzelnes Testmodul auszuführen: + +```bash +pytest tests/utils/test_logging.py +``` + +### Spezifische Tests ausführen + +Da unittest in den meisten Tests verwendet wird, müssen Sie, um bestimmte Untertests auszuführen, den Namen der unittest +Klasse, die diese Tests enthält. Er könnte zum Beispiel lauten: + +```bash +pytest tests/test_optimization.py::OptimizationTest::test_adam_w +``` + +Hier: + +- `tests/test_optimization.py` - die Datei mit den Tests +- `OptimizationTest` - der Name der Klasse +- `test_adam_w` - der Name der spezifischen Testfunktion + +Wenn die Datei mehrere Klassen enthält, können Sie auswählen, dass nur die Tests einer bestimmten Klasse ausgeführt werden sollen. Zum Beispiel: + +```bash +pytest tests/test_optimization.py::OptimizationTest +``` + +führt alle Tests innerhalb dieser Klasse aus. + +Wie bereits erwähnt, können Sie sehen, welche Tests in der Klasse `OptimizationTest` enthalten sind, indem Sie sie ausführen: + +```bash +pytest tests/test_optimization.py::OptimizationTest --collect-only -q +``` + +Sie können Tests mit Hilfe von Schlüsselwortausdrücken ausführen. + +Um nur Tests auszuführen, deren Name `adam` enthält: + +```bash +pytest -k adam tests/test_optimization.py +``` + +Die logischen `und` und `oder` können verwendet werden, um anzugeben, ob alle Schlüsselwörter übereinstimmen sollen oder nur eines. `nicht` kann verwendet werden, um +negieren. + +Um alle Tests auszuführen, außer denen, deren Name `adam` enthält: + +```bash +pytest -k "not adam" tests/test_optimization.py +``` + +Und Sie können die beiden Muster in einem kombinieren: + +```bash +pytest -k "ada and not adam" tests/test_optimization.py +``` + +Um zum Beispiel sowohl `test_adafactor` als auch `test_adam_w` auszuführen, können Sie verwenden: + +```bash +pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py +``` + +Beachten Sie, dass wir hier `oder` verwenden, da wir wollen, dass eines der Schlüsselwörter übereinstimmt, um beide einzuschließen. + +Wenn Sie nur Tests einschließen möchten, die beide Muster enthalten, müssen Sie `und` verwenden: + +```bash +pytest -k "test and ada" tests/test_optimization.py +``` + +### Führen Sie `accelerate` Tests durch + +Manchmal müssen Sie `accelerate` Tests für Ihre Modelle ausführen. Dazu fügen Sie einfach `-m accelerate_tests` zu Ihrem Befehl hinzu, wenn Sie diese Tests bei einem `OPT`-Lauf ausführen möchten: +```bash +RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py +``` + + +### Dokumentationstests ausführen + +Um zu testen, ob die Dokumentationsbeispiele korrekt sind, sollten Sie überprüfen, ob die `doctests` erfolgreich sind. +Lassen Sie uns als Beispiel den docstring von [WhisperModel.forward](https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L1017-L1035) verwenden: + +```python +r""" +Returns: + +Example: + ```python + >>> import torch + >>> from transformers import WhisperModel, WhisperFeatureExtractor + >>> from datasets import load_dataset + + >>> model = WhisperModel.from_pretrained("openai/whisper-base") + >>> feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base") + >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + >>> inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt") + >>> input_features = inputs.input_features + >>> decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id + >>> last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state + >>> list(last_hidden_state.shape) + [1, 2, 512] + ```""" + +``` + +Führen Sie einfach die folgende Zeile aus, um automatisch jedes docstring-Beispiel in der gewünschten Datei zu testen: +```bash +pytest --doctest-modules +``` +Wenn die Datei eine Markdown-Erweiterung hat, sollten Sie das Argument `--doctest-glob="*.md"` hinzufügen. + +### Nur geänderte Tests ausführen + +Mit [pytest-picked](https://github.com/anapaulagomes/pytest-picked) können Sie die Tests ausführen, die sich auf die unstaged Dateien oder den aktuellen Zweig (gemäß Git) beziehen. Auf diese Weise können Sie schnell testen, ob Ihre Änderungen nichts kaputt gemacht haben. +nichts kaputt gemacht haben, da die Tests für Dateien, die Sie nicht verändert haben, nicht ausgeführt werden. + +```bash +pip install pytest-picked +``` + +```bash +pytest --picked +``` + +Alle Tests werden von Dateien und Ordnern ausgeführt, die geändert, aber noch nicht übergeben wurden. + +### Fehlgeschlagene Tests bei Änderung der Quelle automatisch wiederholen + +[pytest-xdist](https://github.com/pytest-dev/pytest-xdist) bietet eine sehr nützliche Funktion zur Erkennung aller fehlgeschlagenen +Tests zu erkennen und dann darauf zu warten, dass Sie Dateien ändern, um die fehlgeschlagenen Tests so lange zu wiederholen, bis sie erfolgreich sind, während Sie die +sie reparieren. So müssen Sie pytest nicht erneut starten, nachdem Sie die Korrektur vorgenommen haben. Dies wird so lange wiederholt, bis alle Tests bestanden sind. +Danach wird erneut ein vollständiger Durchlauf durchgeführt. + +```bash +pip install pytest-xdist +``` + +So rufen Sie den Modus auf: `pytest -f` oder `pytest --looponfail` + +Datei-Änderungen werden erkannt, indem die Wurzelverzeichnisse von `looponfailroots` und alle ihre Inhalte (rekursiv) untersucht werden. +Wenn die Vorgabe für diesen Wert für Sie nicht funktioniert, können Sie ihn in Ihrem Projekt ändern, indem Sie eine Konfigurations +Option in der Datei `setup.cfg` ändern: + +```ini +[tool:pytest] +looponfailroots = transformers tests +``` + +oder die Dateien `pytest.ini`/`tox.ini``: + +```ini +[pytest] +looponfailroots = transformers tests +``` + +Dies würde dazu führen, dass nur nach Dateiänderungen in den jeweiligen Verzeichnissen gesucht wird, die relativ zum Verzeichnis der ini-Datei angegeben sind. +Verzeichnis. + +[pytest-watch](https://github.com/joeyespo/pytest-watch) ist eine alternative Implementierung dieser Funktionalität. + + +### Überspringen eines Testmoduls + +Wenn Sie alle Testmodule ausführen möchten, mit Ausnahme einiger weniger, können Sie diese ausschließen, indem Sie eine explizite Liste der auszuführenden Tests angeben. Für +Beispiel: Um alle Tests außer `test_modeling_*.py` auszuführen: + +```bash +pytest *ls -1 tests/*py | grep -v test_modeling* +``` + +### Status leeren + +CI-Builds und wenn Isolation wichtig ist (gegen Geschwindigkeit), sollte der Cache geleert werden: + +```bash +pytest --cache-clear tests +``` + +### Tests parallel ausführen + +Wie bereits erwähnt, führt `make test` über das Plugin `pytest-xdist` Tests parallel aus (Argument `-n X`, z.B. `-n 2` +um 2 Jobs parallel laufen zu lassen). + +Mit der Option `--dist=` von `pytest-xdist` können Sie steuern, wie die Tests gruppiert werden. Mit `--dist=loadfile` werden die +Tests, die sich in einer Datei befinden, in denselben Prozess. + +Da die Reihenfolge der ausgeführten Tests unterschiedlich und nicht vorhersehbar ist, kann die Ausführung der Testsuite mit `pytest-xdist` +zu Fehlern führt (was bedeutet, dass wir einige unentdeckte gekoppelte Tests haben), verwenden Sie [pytest-replay](https://github.com/ESSS/pytest-replay), um die Tests in der gleichen Reihenfolge abzuspielen, was dabei helfen sollte +diese fehlgeschlagene Sequenz auf ein Minimum zu reduzieren. + +### Testreihenfolge und Wiederholung + +Es ist gut, die Tests mehrmals zu wiederholen, nacheinander, zufällig oder in Gruppen, um mögliche +Abhängigkeiten und zustandsbezogene Fehler zu erkennen (Abriss). Und die einfache, mehrfache Wiederholung ist einfach gut, um +einige Probleme zu erkennen, die durch die Zufälligkeit von DL aufgedeckt werden. + + +#### Wiederholungstests + +- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): + +```bash +pip install pytest-flakefinder +``` + +Und führen Sie dann jeden Test mehrmals durch (standardmäßig 50): + +```bash +pytest --flake-finder --flake-runs=5 tests/test_failing_test.py +``` + + + +Dieses Plugin funktioniert nicht mit dem `-n` Flag von `pytest-xdist`. + + + + + +Es gibt noch ein anderes Plugin `pytest-repeat`, aber es funktioniert nicht mit `unittest`. + + + +#### Run tests in a random order + +```bash +pip install pytest-random-order +``` + +Wichtig: Das Vorhandensein von `pytest-random-order` sorgt für eine automatische Zufallsanordnung der Tests, es sind keine Konfigurationsänderungen oder +Befehlszeilenoptionen sind nicht erforderlich. + +Wie bereits erläutert, ermöglicht dies die Erkennung von gekoppelten Tests - bei denen der Zustand eines Tests den Zustand eines anderen beeinflusst. Wenn +`pytest-random-order` installiert ist, gibt es den Zufallswert aus, der für diese Sitzung verwendet wurde, z.B: + +```bash +pytest tests +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +Wenn eine bestimmte Sequenz fehlschlägt, können Sie sie reproduzieren, indem Sie genau diesen Seed hinzufügen, z.B: + +```bash +pytest --random-order-seed=573663 +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +Es wird nur dann die exakte Reihenfolge reproduzieren, wenn Sie genau dieselbe Liste von Tests (oder gar keine Liste) verwenden. Sobald Sie beginnen, die Liste +die Liste manuell einzugrenzen, können Sie sich nicht mehr auf den Seed verlassen, sondern müssen die Tests manuell in der genauen Reihenfolge auflisten +auflisten und pytest anweisen, sie nicht zu randomisieren, indem Sie `--random-order-bucket=none` verwenden, z.B.: + +```bash +pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py +``` + +So deaktivieren Sie das Shuffling für alle Tests: + +```bash +pytest --random-order-bucket=none +``` + +Standardmäßig ist `--random-order-bucket=module` impliziert, wodurch die Dateien auf den Modulebenen gemischt werden. Es kann auch +auf den Ebenen `class`, `package`, `global` und `none` mischen. Die vollständigen Details entnehmen Sie bitte der +[Dokumentation](https://github.com/jbasko/pytest-random-order). + +Eine weitere Alternative zur Randomisierung ist: [`pytest-random`](https://github.com/pytest-dev/pytest-randomly). Dieses +Modul hat eine sehr ähnliche Funktionalität/Schnittstelle, aber es hat nicht die Eimermodi, die in +`pytest-random-order` zur Verfügung. Es hat das gleiche Problem, dass es sich nach der Installation aufdrängt. + +### Variationen von Aussehen und Bedienung + +#### pytest-zucker + +[pytest-sugar](https://github.com/Frozenball/pytest-sugar) ist ein Plugin, das das Erscheinungsbild verbessert, eine +Fortschrittsbalken hinzufügt und Tests, die fehlschlagen, sowie die Bestätigung sofort anzeigt. Es wird bei der Installation automatisch aktiviert. + +```bash +pip install pytest-sugar +``` + +Um Tests ohne sie durchzuführen, führen Sie aus: + +```bash +pytest -p no:sugar +``` + +oder deinstallieren Sie es. + + + +#### Melden Sie den Namen jedes Subtests und seinen Fortschritt + +Für einen einzelnen oder eine Gruppe von Tests über `pytest` (nach `pip install pytest-pspec`): + +```bash +pytest --pspec tests/test_optimization.py +``` + +#### Zeigt fehlgeschlagene Tests sofort an + +[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) zeigt Fehlschläge und Fehler sofort an, anstatt +bis zum Ende der Testsitzung zu warten. + +```bash +pip install pytest-instafail +``` + +```bash +pytest --instafail +``` + +### Zu GPU oder nicht zu GPU + +Bei einem GPU-aktivierten Setup fügen Sie zum Testen im reinen CPU-Modus `CUDA_VISIBLE_DEVICES=""` hinzu: + +```bash +CUDA_VISIBLE_DEVICES="" pytest tests/utils/test_logging.py +``` + +oder wenn Sie mehrere Grafikprozessoren haben, können Sie angeben, welcher von `pytest` verwendet werden soll. Wenn Sie zum Beispiel nur den +zweiten Grafikkarte zu verwenden, wenn Sie die Grafikkarten `0` und `1` haben, können Sie folgendes ausführen: + +```bash +CUDA_VISIBLE_DEVICES="1" pytest tests/utils/test_logging.py +``` + +Dies ist praktisch, wenn Sie verschiedene Aufgaben auf verschiedenen GPUs ausführen möchten. + +Einige Tests müssen nur auf der CPU ausgeführt werden, andere entweder auf der CPU, der GPU oder der TPU und wieder andere auf mehreren GPUs. Die folgenden skip +Dekorateure werden verwendet, um die Anforderungen von Tests in Bezug auf CPU/GPU/TPU festzulegen: + +- `require_torch` - dieser Test wird nur unter Torch ausgeführt +- `require_torch_gpu` - wie `require_torch` plus erfordert mindestens 1 GPU +- `require_torch_multi_gpu` - wie `require_torch` und zusätzlich mindestens 2 GPUs erforderlich +- `require_torch_non_multi_gpu` - wie `require_torch` plus benötigt 0 oder 1 GPUs +- `require_torch_up_to_2_gpus` - wie `require_torch` plus erfordert 0 oder 1 oder 2 GPUs +- `require_torch_tpu` - wie `require_torch` plus erfordert mindestens 1 TPU + +Lassen Sie uns die GPU-Anforderungen in der folgenden Tabelle darstellen: + + +| n gpus | decorator | +|--------|--------------------------------| +| `>= 0` | `@require_torch` | +| `>= 1` | `@require_torch_gpu` | +| `>= 2` | `@require_torch_multi_gpu` | +| `< 2` | `@require_torch_non_multi_gpu` | +| `< 3` | `@require_torch_up_to_2_gpus` | + + +Hier ist zum Beispiel ein Test, der nur ausgeführt werden muss, wenn 2 oder mehr GPUs verfügbar sind und pytorch installiert ist: + +```python no-style +@require_torch_multi_gpu +def test_example_with_multi_gpu(): +``` + +Wenn ein Test `tensorflow` benötigt, verwenden Sie den Dekorator `require_tf`. Zum Beispiel: + +```python no-style +@require_tf +def test_tf_thing_with_tensorflow(): +``` + +Diese Dekors können gestapelt werden. Wenn zum Beispiel ein Test langsam ist und mindestens eine GPU unter pytorch benötigt, können Sie +wie Sie ihn einrichten können: + +```python no-style +@require_torch_gpu +@slow +def test_example_slow_on_gpu(): +``` + +Einige Dekoratoren wie `@parametrized` schreiben Testnamen um, daher müssen `@require_*`-Sprungdekoratoren als letztes aufgeführt werden. +zuletzt aufgeführt werden, damit sie korrekt funktionieren. Hier ist ein Beispiel für die korrekte Verwendung: + +```python no-style +@parameterized.expand(...) +@require_torch_multi_gpu +def test_integration_foo(): +``` + +Dieses Problem mit der Reihenfolge gibt es bei `@pytest.mark.parametrize` nicht, Sie können es an den Anfang oder an den Schluss setzen und es wird trotzdem funktionieren. +funktionieren. Aber es funktioniert nur bei Nicht-Unittests. + +Innerhalb von Tests: + +- Wie viele GPUs sind verfügbar: + +```python +from transformers.testing_utils import get_gpu_count + +n_gpu = get_gpu_count() # works with torch and tf +``` + +### Testen mit einem bestimmten PyTorch-Backend oder Gerät + +Um die Testsuite auf einem bestimmten Torch-Gerät auszuführen, fügen Sie `TRANSFORMERS_TEST_DEVICE="$Gerät"` hinzu, wobei `$Gerät` das Ziel-Backend ist. Zum Beispiel, um nur auf der CPU zu testen: +```bash +TRANSFORMERS_TEST_DEVICE="cpu" pytest tests/utils/test_logging.py +``` + +Diese Variable ist nützlich, um benutzerdefinierte oder weniger verbreitete PyTorch-Backends wie `mps` zu testen. Sie kann auch verwendet werden, um den gleichen Effekt wie `CUDA_VISIBLE_DEVICES` zu erzielen, indem Sie bestimmte GPUs anvisieren oder im reinen CPU-Modus testen. + +Bestimmte Geräte erfordern einen zusätzlichen Import, nachdem Sie `torch` zum ersten Mal importiert haben. Dies kann über die Umgebungsvariable `TRANSFORMERS_TEST_BACKEND` festgelegt werden: +```bash +TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py +``` + + +### Verteiltes Training + +`pytest` kann nicht direkt mit verteiltem Training umgehen. Wenn dies versucht wird, tun die Unterprozesse nicht das Richtige +und denken am Ende, sie seien `pytest` und beginnen, die Testsuite in Schleifen auszuführen. Es funktioniert jedoch, wenn man +einen normalen Prozess erzeugt, der dann mehrere Worker erzeugt und die IO-Pipes verwaltet. + +Hier sind einige Tests, die dies verwenden: + +- [test_trainer_distributed.py](https://github.com/huggingface/transformers/tree/main/tests/trainer/test_trainer_distributed.py) +- [test_deepspeed.py](https://github.com/huggingface/transformers/tree/main/tests/deepspeed/test_deepspeed.py) + +Um direkt mit der Ausführung zu beginnen, suchen Sie in diesen Tests nach dem Aufruf `execute_subprocess_async`. + +Sie benötigen mindestens 2 GPUs, um diese Tests in Aktion zu sehen: + +```bash +CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py +``` + +### Erfassung von Ausgaben + +Während der Testausführung werden alle Ausgaben, die an `stdout` und `stderr` gesendet werden, aufgezeichnet. Wenn ein Test oder eine Setup-Methode fehlschlägt, wird die +wird die entsprechende aufgezeichnete Ausgabe in der Regel zusammen mit dem Fehler-Traceback angezeigt. + +Um die Aufzeichnung von Ausgaben zu deaktivieren und `stdout` und `stderr` normal zu erhalten, verwenden Sie `-s` oder `--capture=no`: + +```bash +pytest -s tests/utils/test_logging.py +``` + +So senden Sie Testergebnisse an die JUnit-Formatausgabe: + +```bash +py.test tests --junitxml=result.xml +``` + +### Farbsteuerung + +Keine Farbe zu haben (z.B. gelb auf weißem Hintergrund ist nicht lesbar): + +```bash +pytest --color=no tests/utils/test_logging.py +``` + +### Testbericht an den Online-Dienst pastebin senden + +Erstellen Sie eine URL für jeden Testfehler: + +```bash +pytest --pastebin=failed tests/utils/test_logging.py +``` + +Dadurch werden Informationen über den Testlauf an einen entfernten Paste-Dienst übermittelt und eine URL für jeden Fehlschlag bereitgestellt. Sie können die +Tests wie gewohnt auswählen oder z.B. -x hinzufügen, wenn Sie nur einen bestimmten Fehler senden möchten. + +Erstellen einer URL für ein ganzes Testsitzungsprotokoll: + +```bash +pytest --pastebin=all tests/utils/test_logging.py +``` + +## Tests schreiben + +🤗 Die Tests von Transformers basieren auf `unittest`, werden aber von `pytest` ausgeführt, so dass die meiste Zeit Funktionen aus beiden Systemen +verwendet werden können. + +Sie können [hier](https://docs.pytest.org/en/stable/unittest.html) nachlesen, welche Funktionen unterstützt werden, aber das Wichtigste ist +Wichtig ist, dass die meisten `pytest`-Fixtures nicht funktionieren. Auch die Parametrisierung nicht, aber wir verwenden das Modul +`parametrisiert`, das auf ähnliche Weise funktioniert. + + +### Parametrisierung + +Oft besteht die Notwendigkeit, denselben Test mehrmals auszuführen, aber mit unterschiedlichen Argumenten. Das könnte innerhalb des Tests geschehen +des Tests gemacht werden, aber dann gibt es keine Möglichkeit, den Test mit nur einem Satz von Argumenten auszuführen. + +```python +# test_this1.py +import unittest +from parameterized import parameterized + + +class TestMathUnitTest(unittest.TestCase): + @parameterized.expand( + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ] + ) + def test_floor(self, name, input, expected): + assert_equal(math.floor(input), expected) +``` + +Nun wird dieser Test standardmäßig 3 Mal ausgeführt, wobei jedes Mal die letzten 3 Argumente von `test_floor` den entsprechenden Argumenten in der Parameterliste zugeordnet werden. +die entsprechenden Argumente in der Parameterliste. + +Sie können auch nur die Parameter `negativ` und `ganzzahlig` mit ausführen: + +```bash +pytest -k "negative and integer" tests/test_mytest.py +``` + +oder alle Untertests außer `negativ`, mit: + +```bash +pytest -k "not negative" tests/test_mytest.py +``` + +Neben der Verwendung des gerade erwähnten Filters `-k` können Sie auch den genauen Namen jedes Untertests herausfinden und jeden +oder alle unter Verwendung ihrer genauen Namen ausführen. + +```bash +pytest test_this1.py --collect-only -q +``` + +und es wird aufgelistet: + +```bash +test_this1.py::TestMathUnitTest::test_floor_0_negative +test_this1.py::TestMathUnitTest::test_floor_1_integer +test_this1.py::TestMathUnitTest::test_floor_2_large_fraction +``` + +Jetzt können Sie also nur 2 spezifische Untertests durchführen: + +```bash +pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer +``` + +Das Modul [parametrisiert](https://pypi.org/project/parameterized/), das sich bereits in den Entwickler-Abhängigkeiten befindet +von `transformers` befindet, funktioniert sowohl für `unittests` als auch für `pytest` Tests. + +Wenn es sich bei dem Test jedoch nicht um einen `Unittest` handelt, können Sie `pytest.mark.parametrize` verwenden (oder Sie können sehen, dass es in +einigen bestehenden Tests verwendet wird, meist unter `Beispiele`). + +Hier ist das gleiche Beispiel, diesmal unter Verwendung der `parametrize`-Markierung von `pytest`: + +```python +# test_this2.py +import pytest + + +@pytest.mark.parametrize( + "name, input, expected", + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ], +) +def test_floor(name, input, expected): + assert_equal(math.floor(input), expected) +``` + +Genau wie bei `parametrisiert` können Sie mit `pytest.mark.parametrize` genau steuern, welche Subtests ausgeführt werden +ausgeführt werden, wenn der Filter `-k` nicht ausreicht. Allerdings erzeugt diese Parametrisierungsfunktion einen etwas anderen Satz von +Namen für die Untertests. Sie sehen folgendermaßen aus: + +```bash +pytest test_this2.py --collect-only -q +``` + +und es wird aufgelistet: + +```bash +test_this2.py::test_floor[integer-1-1.0] +test_this2.py::test_floor[negative--1.5--2.0] +test_this2.py::test_floor[large fraction-1.6-1] +``` + +Jetzt können Sie also nur den spezifischen Test durchführen: + +```bash +pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] +``` + +wie im vorherigen Beispiel. + + + +### Dateien und Verzeichnisse + +In Tests müssen wir oft wissen, wo sich Dinge relativ zur aktuellen Testdatei befinden, und das ist nicht trivial, da der Test +von mehreren Verzeichnissen aus aufgerufen werden kann oder sich in Unterverzeichnissen mit unterschiedlicher Tiefe befinden kann. Eine Hilfsklasse +`transformers.test_utils.TestCasePlus` löst dieses Problem, indem sie alle grundlegenden Pfade sortiert und einfache +Zugriffsmöglichkeiten auf sie bietet: + +- `pathlib`-Objekte (alle vollständig aufgelöst): + + - `test_file_path` - der aktuelle Testdateipfad, d.h. `__file__` + - `test_file_dir` - das Verzeichnis, das die aktuelle Testdatei enthält + - `tests_dir` - das Verzeichnis der `tests` Testreihe + - `examples_dir` - das Verzeichnis der `examples` Test-Suite + - `repo_root_dir` - das Verzeichnis des Repositorys + - `src_dir` - das Verzeichnis von `src` (d.h. wo sich das Unterverzeichnis `transformers` befindet) + +- stringifizierte Pfade - wie oben, aber diese geben Pfade als Strings zurück, anstatt als `pathlib`-Objekte: + + - `test_file_path_str` + - `test_file_dir_str` + - `tests_dir_str` + - `examples_dir_str` + - `repo_root_dir_str` + - `src_dir_str` + +Um diese zu verwenden, müssen Sie lediglich sicherstellen, dass der Test in einer Unterklasse von +`transformers.test_utils.TestCasePlus` befindet. Zum Beispiel: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_local_locations(self): + data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" +``` + +Wenn Sie Pfade nicht über `pathlib` manipulieren müssen oder nur einen Pfad als String benötigen, können Sie jederzeit +`str()` auf das `pathlib`-Objekt anwenden oder die Accessoren mit der Endung `_str` verwenden. Zum Beispiel: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_stringified_locations(self): + examples_dir = self.examples_dir_str +``` + +### Temporäre Dateien und Verzeichnisse + +Die Verwendung eindeutiger temporärer Dateien und Verzeichnisse ist für die parallele Durchführung von Tests unerlässlich, damit sich die Tests nicht gegenseitig überschreiben. +Daten gegenseitig überschreiben. Außerdem möchten wir, dass die temporären Dateien und Verzeichnisse am Ende jedes Tests, der sie erstellt hat, gelöscht werden. +erstellt hat. Daher ist die Verwendung von Paketen wie `tempfile`, die diese Anforderungen erfüllen, unerlässlich. + +Beim Debuggen von Tests müssen Sie jedoch sehen können, was in der temporären Datei oder dem temporären Verzeichnis gespeichert wird und Sie möchten +Sie müssen den genauen Pfad kennen und dürfen ihn nicht bei jedem neuen Testdurchlauf zufällig ändern. + +Für solche Zwecke ist die Hilfsklasse `transformers.test_utils.TestCasePlus` am besten geeignet. Sie ist eine Unterklasse von +Unittest.TestCase`, so dass wir in den Testmodulen einfach von ihr erben können. + +Hier ist ein Beispiel für die Verwendung dieser Klasse: + +```python +from transformers.testing_utils import TestCasePlus + + +class ExamplesTests(TestCasePlus): + def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +Dieser Code erstellt ein eindeutiges temporäres Verzeichnis und setzt `tmp_dir` auf dessen Speicherort. + +- Erstellen Sie ein eindeutiges temporäres Verzeichnis: + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +tmp_dir" enthält den Pfad zu dem erstellten temporären Verzeichnis. Es wird am Ende des Tests automatisch entfernt. +Tests entfernt. + +- Erstellen Sie ein temporäres Verzeichnis meiner Wahl, stellen Sie sicher, dass es leer ist, bevor der Test beginnt, und leeren Sie es nach dem Test nicht. + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir("./xxx") +``` + +Dies ist nützlich für die Fehlersuche, wenn Sie ein bestimmtes Verzeichnis überwachen und sicherstellen möchten, dass die vorherigen Tests keine Daten darin hinterlassen haben. +keine Daten dort hinterlassen haben. + +- Sie können das Standardverhalten außer Kraft setzen, indem Sie die Argumente `before` und `after` direkt überschreiben, was zu einem der folgenden Verhaltensweisen führt + folgenden Verhaltensweisen: + + - `before=True`: das temporäre Verzeichnis wird immer zu Beginn des Tests gelöscht. + - `before=False`: wenn das temporäre Verzeichnis bereits existiert, bleiben alle vorhandenen Dateien dort erhalten. + - `after=True`: das temporäre Verzeichnis wird immer am Ende des Tests gelöscht. + - `after=False`: das temporäre Verzeichnis wird am Ende des Tests immer beibehalten. + + + +Um das Äquivalent von `rm -r` sicher ausführen zu können, sind nur Unterverzeichnisse des Projektarchivs checkout erlaubt, wenn +ein explizites `tmp_dir` verwendet wird, so dass nicht versehentlich ein `/tmp` oder ein ähnlich wichtiger Teil des Dateisystems vernichtet wird. +d.h. geben Sie bitte immer Pfade an, die mit `./` beginnen. + + + + + +Jeder Test kann mehrere temporäre Verzeichnisse registrieren, die alle automatisch entfernt werden, sofern nicht anders gewünscht. +anders. + + + +### Temporäre Überschreibung von sys.path + +Wenn Sie `sys.path` vorübergehend überschreiben müssen, um z.B. von einem anderen Test zu importieren, können Sie den +Kontextmanager `ExtendSysPath` verwenden. Beispiel: + + +```python +import os +from transformers.testing_utils import ExtendSysPath + +bindir = os.path.abspath(os.path.dirname(__file__)) +with ExtendSysPath(f"{bindir}/.."): + from test_trainer import TrainerIntegrationCommon # noqa +``` + +### Überspringen von Tests + +Dies ist nützlich, wenn ein Fehler gefunden und ein neuer Test geschrieben wird, der Fehler aber noch nicht behoben ist. Damit wir ihn +in das Haupt-Repository zu übertragen, müssen wir sicherstellen, dass er bei `make test` übersprungen wird. + +Methoden: + +- Ein **Skip** bedeutet, dass Sie erwarten, dass Ihr Test nur dann erfolgreich ist, wenn einige Bedingungen erfüllt sind, andernfalls sollte pytest den Test überspringen. + die Ausführung des Tests ganz überspringen. Übliche Beispiele sind das Überspringen von Tests, die nur unter Windows laufen, auf Nicht-Windows-Plattformen oder das Überspringen von + Tests, die von einer externen Ressource abhängen, die im Moment nicht verfügbar ist (z.B. eine Datenbank). + +- Ein **xfail** bedeutet, dass Sie erwarten, dass ein Test aus irgendeinem Grund fehlschlägt. Ein gängiges Beispiel ist ein Test für eine Funktion, die noch nicht + noch nicht implementiert oder ein noch nicht behobener Fehler. Wenn ein Test trotz eines erwarteten Fehlschlags bestanden wird (markiert mit + pytest.mark.xfail), ist dies ein xpass und wird in der Testzusammenfassung gemeldet. + +Einer der wichtigsten Unterschiede zwischen den beiden ist, dass `skip` den Test nicht ausführt, während `xfail` dies tut. Wenn also der +Code, der fehlerhaft ist, einen schlechten Zustand verursacht, der sich auf andere Tests auswirkt, sollten Sie also nicht `xfail` verwenden. + +#### Implementierung + +- Hier sehen Sie, wie Sie einen ganzen Test bedingungslos überspringen können: + +```python no-style +@unittest.skip("this bug needs to be fixed") +def test_feature_x(): +``` + +oder mit pytest: + +```python no-style +@pytest.mark.skip(reason="this bug needs to be fixed") +``` + +oder mit dem `xfail` Weg: + +```python no-style +@pytest.mark.xfail +def test_feature_x(): +``` + +- Hier erfahren Sie, wie Sie einen Test aufgrund einer internen Prüfung innerhalb des Tests auslassen können: + +```python +def test_feature_x(): + if not has_something(): + pytest.skip("unsupported configuration") +``` + +oder das ganze Modul: + +```python +import pytest + +if not pytest.config.getoption("--custom-flag"): + pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) +``` + +oder mit dem `xfail` Weg: + +```python +def test_feature_x(): + pytest.xfail("expected to fail until bug XYZ is fixed") +``` + +- Hier erfahren Sie, wie Sie alle Tests in einem Modul überspringen können, wenn ein Import fehlt: + +```python +docutils = pytest.importorskip("docutils", minversion="0.3") +``` + +- Einen Test aufgrund einer Bedingung überspringen: + +```python no-style +@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") +def test_feature_x(): +``` + +oder: + +```python no-style +@unittest.skipIf(torch_device == "cpu", "Can't do half precision") +def test_feature_x(): +``` + +oder überspringen Sie das ganze Modul: + +```python no-style +@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") +class TestClass(): + def test_feature_x(self): +``` + +Weitere Details, Beispiele und Möglichkeiten finden Sie [hier](https://docs.pytest.org/en/latest/skipping.html). + +### Langsame Tests + +Die Bibliothek der Tests wächst ständig, und einige der Tests brauchen Minuten, um ausgeführt zu werden, daher können wir es uns nicht leisten, eine Stunde zu warten, bis die +eine Stunde auf die Fertigstellung der Testsuite auf CI zu warten. Daher sollten langsame Tests, mit einigen Ausnahmen für wichtige Tests, wie im folgenden Beispiel +wie im folgenden Beispiel markiert werden: + +```python no-style +from transformers.testing_utils import slow +@slow +def test_integration_foo(): +``` + +Sobald ein Test als `@slow` markiert ist, setzen Sie die Umgebungsvariable `RUN_SLOW=1`, um solche Tests auszuführen, z.B: + +```bash +RUN_SLOW=1 pytest tests +``` + +Einige Dekoratoren wie `@parameterized` schreiben Testnamen um, daher müssen `@slow` und die übrigen Skip-Dekoratoren +`@require_*` müssen als letztes aufgeführt werden, damit sie korrekt funktionieren. Hier ist ein Beispiel für die korrekte Verwendung: + +```python no-style +@parameterized.expand(...) +@slow +def test_integration_foo(): +``` + +Wie zu Beginn dieses Dokuments erläutert, werden langsame Tests nach einem Zeitplan ausgeführt und nicht in PRs CI +Prüfungen. Es ist also möglich, dass einige Probleme bei der Einreichung eines PRs übersehen werden und zusammengeführt werden. Solche Probleme werden +werden beim nächsten geplanten CI-Job abgefangen. Das bedeutet aber auch, dass es wichtig ist, die langsamen Tests auf Ihrem +Rechner auszuführen, bevor Sie den PR einreichen. + +Hier ist ein grober Entscheidungsmechanismus für die Auswahl der Tests, die als langsam markiert werden sollen: + +Wenn der Test auf eine der internen Komponenten der Bibliothek ausgerichtet ist (z.B. Modellierungsdateien, Tokenisierungsdateien, +Pipelines), dann sollten wir diesen Test in der nicht langsamen Testsuite ausführen. Wenn er sich auf einen anderen Aspekt der Bibliothek bezieht, +wie z.B. die Dokumentation oder die Beispiele, dann sollten wir diese Tests in der langsamen Testsuite durchführen. Und dann, zur Verfeinerung +Ansatz zu verfeinern, sollten wir Ausnahmen einführen: + +- Alle Tests, die einen umfangreichen Satz von Gewichten oder einen Datensatz mit einer Größe von mehr als ~50MB herunterladen müssen (z.B. Modell- oder + Tokenizer-Integrationstests, Pipeline-Integrationstests) sollten auf langsam gesetzt werden. Wenn Sie ein neues Modell hinzufügen, sollten Sie + sollten Sie eine kleine Version des Modells (mit zufälligen Gewichtungen) für Integrationstests erstellen und in den Hub hochladen. Dies wird + wird in den folgenden Abschnitten erläutert. +- Alle Tests, die ein Training durchführen müssen, das nicht speziell auf Schnelligkeit optimiert ist, sollten auf langsam gesetzt werden. +- Wir können Ausnahmen einführen, wenn einige dieser Tests, die nicht langsam sein sollten, unerträglich langsam sind, und sie auf + `@slow`. Auto-Modellierungstests, die große Dateien auf der Festplatte speichern und laden, sind ein gutes Beispiel für Tests, die als + als `@slow` markiert sind. +- Wenn ein Test in weniger als 1 Sekunde auf CI abgeschlossen wird (einschließlich eventueller Downloads), sollte es sich trotzdem um einen normalen Test handeln. + +Insgesamt müssen alle nicht langsamen Tests die verschiedenen Interna abdecken und dabei schnell bleiben. Zum Beispiel, +kann eine signifikante Abdeckung erreicht werden, indem Sie mit speziell erstellten kleinen Modellen mit zufälligen Gewichten testen. Solche Modelle +haben eine sehr geringe Anzahl von Schichten (z.B. 2), Vokabeln (z.B. 1000), usw. Dann können die `@slow`-Tests große +langsame Modelle verwenden, um qualitative Tests durchzuführen. Um die Verwendung dieser Modelle zu sehen, suchen Sie einfach nach *winzigen* Modellen mit: + +```bash +grep tiny tests examples +``` + +Hier ist ein Beispiel für ein [Skript](https://github.com/huggingface/transformers/tree/main/scripts/fsmt/fsmt-make-tiny-model.py), das das winzige Modell erstellt hat +[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de). Sie können es ganz einfach an Ihre eigene +Architektur Ihres Modells anpassen. + +Es ist leicht, die Laufzeit falsch zu messen, wenn zum Beispiel ein großes Modell heruntergeladen wird, aber wenn +Sie es lokal testen, würden die heruntergeladenen Dateien zwischengespeichert und somit die Download-Zeit nicht gemessen werden. Prüfen Sie daher den +Ausführungsgeschwindigkeitsbericht in den CI-Protokollen (die Ausgabe von `pytest --durations=0 tests`). + +Dieser Bericht ist auch nützlich, um langsame Ausreißer zu finden, die nicht als solche gekennzeichnet sind oder die neu geschrieben werden müssen, um schnell zu sein. +Wenn Sie bemerken, dass die Testsuite beim CI langsam wird, zeigt die oberste Liste dieses Berichts die langsamsten +Tests. + + +### Testen der stdout/stderr-Ausgabe + +Um Funktionen zu testen, die in `stdout` und/oder `stderr` schreiben, kann der Test auf diese Ströme zugreifen, indem er die +[capsys system](https://docs.pytest.org/en/latest/capture.html) von `pytest` zugreifen. So wird dies bewerkstelligt: + +```python +import sys + + +def print_to_stdout(s): + print(s) + + +def print_to_stderr(s): + sys.stderr.write(s) + + +def test_result_and_stdout(capsys): + msg = "Hello" + print_to_stdout(msg) + print_to_stderr(msg) + out, err = capsys.readouterr() # consume the captured output streams + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + sys.stderr.write(err) + # test: + assert msg in out + assert msg in err +``` + +Und natürlich wird `stderr` in den meisten Fällen als Teil einer Ausnahme auftreten, so dass try/except in einem solchen Fall verwendet werden muss +Fall verwendet werden: + +```python +def raise_exception(msg): + raise ValueError(msg) + + +def test_something_exception(): + msg = "Not a good value" + error = "" + try: + raise_exception(msg) + except Exception as e: + error = str(e) + assert msg in error, f"{msg} is in the exception:\n{error}" +``` + +Ein anderer Ansatz zur Erfassung von stdout ist `contextlib.redirect_stdout`: + +```python +from io import StringIO +from contextlib import redirect_stdout + + +def print_to_stdout(s): + print(s) + + +def test_result_and_stdout(): + msg = "Hello" + buffer = StringIO() + with redirect_stdout(buffer): + print_to_stdout(msg) + out = buffer.getvalue() + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + # test: + assert msg in out +``` + +Ein wichtiges potenzielles Problem beim Erfassen von stdout ist, dass es `r` Zeichen enthalten kann, die bei normalem `print` +alles zurücksetzen, was bisher gedruckt wurde. Mit `pytest` gibt es kein Problem, aber mit `pytest -s` werden diese +werden diese Zeichen in den Puffer aufgenommen. Um den Test mit und ohne `-s` laufen zu lassen, müssen Sie also eine zusätzliche Bereinigung +zusätzliche Bereinigung der erfassten Ausgabe vornehmen, indem Sie `re.sub(r'~.*\r', '', buf, 0, re.M)` verwenden. + +Aber dann haben wir einen Hilfskontextmanager-Wrapper, der sich automatisch um alles kümmert, unabhängig davon, ob er +einige "*.*.*.*" enthält oder nicht: + +```python +from transformers.testing_utils import CaptureStdout + +with CaptureStdout() as cs: + function_that_writes_to_stdout() +print(cs.out) +``` + +Hier ist ein vollständiges Testbeispiel: + +```python +from transformers.testing_utils import CaptureStdout + +msg = "Secret message\r" +final = "Hello World" +with CaptureStdout() as cs: + print(msg + final) +assert cs.out == final + "\n", f"captured: {cs.out}, expecting {final}" +``` + +Wenn Sie `stderr` aufzeichnen möchten, verwenden Sie stattdessen die Klasse `CaptureStderr`: + +```python +from transformers.testing_utils import CaptureStderr + +with CaptureStderr() as cs: + function_that_writes_to_stderr() +print(cs.err) +``` + +Wenn Sie beide Streams auf einmal erfassen müssen, verwenden Sie die übergeordnete Klasse `CaptureStd`: + +```python +from transformers.testing_utils import CaptureStd + +with CaptureStd() as cs: + function_that_writes_to_stdout_and_stderr() +print(cs.err, cs.out) +``` + +Um das Debuggen von Testproblemen zu erleichtern, geben diese Kontextmanager standardmäßig die aufgezeichneten Streams beim Verlassen +aus dem Kontext wieder. + + +### Erfassen von Logger-Streams + +Wenn Sie die Ausgabe eines Loggers validieren müssen, können Sie `CaptureLogger` verwenden: + +```python +from transformers import logging +from transformers.testing_utils import CaptureLogger + +msg = "Testing 1, 2, 3" +logging.set_verbosity_info() +logger = logging.get_logger("transformers.models.bart.tokenization_bart") +with CaptureLogger(logger) as cl: + logger.info(msg) +assert cl.out, msg + "\n" +``` + +### Testen mit Umgebungsvariablen + +Wenn Sie die Auswirkungen von Umgebungsvariablen für einen bestimmten Test testen möchten, können Sie einen Hilfsdekorator verwenden +`transformers.testing_utils.mockenv` + +```python +from transformers.testing_utils import mockenv + + +class HfArgumentParserTest(unittest.TestCase): + @mockenv(TRANSFORMERS_VERBOSITY="error") + def test_env_override(self): + env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) +``` + +Manchmal muss ein externes Programm aufgerufen werden, was die Einstellung von `PYTHONPATH` in `os.environ` erfordert, um mehrere lokale Pfade einzuschließen. +mehrere lokale Pfade. Eine Hilfsklasse `transformers.test_utils.TestCasePlus` hilft Ihnen dabei: + +```python +from transformers.testing_utils import TestCasePlus + + +class EnvExampleTest(TestCasePlus): + def test_external_prog(self): + env = self.get_env() + # now call the external program, passing `env` to it +``` + +Je nachdem, ob die Testdatei in der Testsuite `tests` oder in `examples` war, wird sie korrekt eingerichtet +`env[PYTHONPATH]` eines dieser beiden Verzeichnisse und auch das `src` Verzeichnis, um sicherzustellen, dass der Test gegen das aktuelle +um sicherzustellen, dass der Test mit dem aktuellen Projektarchiv durchgeführt wird, und schließlich mit dem, was in `env[PYTHONPATH]` bereits eingestellt war, bevor der Test aufgerufen wurde. +wenn überhaupt. + +Diese Hilfsmethode erstellt eine Kopie des Objekts `os.environ`, so dass das Original intakt bleibt. + + +### Reproduzierbare Ergebnisse erhalten + +In manchen Situationen möchten Sie vielleicht die Zufälligkeit Ihrer Tests beseitigen. Um identische, reproduzierbare Ergebnisse zu erhalten, müssen Sie +müssen Sie den Seed festlegen: + +```python +seed = 42 + +# python RNG +import random + +random.seed(seed) + +# pytorch RNGs +import torch + +torch.manual_seed(seed) +torch.backends.cudnn.deterministic = True +if torch.cuda.is_available(): + torch.cuda.manual_seed_all(seed) + +# numpy RNG +import numpy as np + +np.random.seed(seed) + +# tf RNG +tf.random.set_seed(seed) +``` + +### Tests debuggen + +Um einen Debugger an der Stelle zu starten, an der die Warnung auftritt, gehen Sie wie folgt vor: + +```bash +pytest tests/utils/test_logging.py -W error::UserWarning --pdb +``` + +## Arbeiten mit Github-Aktionen-Workflows + +Um einen CI-Job für einen Self-Push-Workflow auszulösen, müssen Sie: + +1. Erstellen Sie einen neuen Zweig auf `transformers` Ursprung (keine Gabelung!). +2. Der Name der Verzweigung muss entweder mit `ci_` oder `ci-` beginnen (`main` löst ihn auch aus, aber wir können keine PRs auf + `main`). Es wird auch nur für bestimmte Pfade ausgelöst - Sie können die aktuelle Definition finden, falls sie + falls sie sich seit der Erstellung dieses Dokuments geändert hat [hier](https://github.com/huggingface/transformers/blob/main/.github/workflows/self-push.yml) unter *push:* +3. Erstellen Sie einen PR von diesem Zweig. +4. Dann können Sie sehen, wie der Job erscheint [hier](https://github.com/huggingface/transformers/actions/workflows/self-push.yml). Er wird möglicherweise nicht sofort ausgeführt, wenn es + ein Backlog vorhanden ist. + + + + +## Testen experimenteller CI-Funktionen + +Das Testen von CI-Funktionen kann potenziell problematisch sein, da es die normale CI-Funktion beeinträchtigen kann. Wenn also eine +neue CI-Funktion hinzugefügt werden soll, sollte dies wie folgt geschehen. + +1. Erstellen Sie einen neuen Auftrag, der die zu testende Funktion testet. +2. Der neue Job muss immer erfolgreich sein, so dass er uns ein grünes ✓ gibt (Details unten). +3. Lassen Sie ihn einige Tage lang laufen, um zu sehen, dass eine Vielzahl verschiedener PR-Typen darauf laufen (Benutzer-Gabelzweige, + nicht geforkte Zweige, Zweige, die von github.com UI direct file edit stammen, verschiedene erzwungene Pushes, etc. - es gibt + es gibt so viele), während Sie die Protokolle des experimentellen Jobs überwachen (nicht den gesamten Job grün, da er absichtlich immer + grün) +4. Wenn klar ist, dass alles in Ordnung ist, fügen Sie die neuen Änderungen in die bestehenden Jobs ein. + +Auf diese Weise wird der normale Arbeitsablauf nicht durch Experimente mit der CI-Funktionalität selbst beeinträchtigt. + +Wie können wir nun dafür sorgen, dass der Auftrag immer erfolgreich ist, während die neue CI-Funktion entwickelt wird? + +Einige CIs, wie TravisCI, unterstützen ignore-step-failure und melden den gesamten Job als erfolgreich, aber CircleCI und +Github Actions unterstützen dies zum jetzigen Zeitpunkt nicht. + +Sie können also die folgende Abhilfe verwenden: + +1. Setzen Sie `set +euo pipefail` am Anfang des Ausführungsbefehls, um die meisten potenziellen Fehler im Bash-Skript zu unterdrücken. +2. Der letzte Befehl muss ein Erfolg sein: `echo "done"` oder einfach `true` reicht aus. + +Hier ist ein Beispiel: + +```yaml +- run: + name: run CI experiment + command: | + set +euo pipefail + echo "setting run-all-despite-any-errors-mode" + this_command_will_fail + echo "but bash continues to run" + # emulate another failure + false + # but the last command must be a success + echo "during experiment do not remove: reporting success to CI, even if there were failures" +``` + +Für einfache Befehle können Sie auch Folgendes tun: + +```bash +cmd_that_may_fail || true +``` + +Wenn Sie mit den Ergebnissen zufrieden sind, integrieren Sie den experimentellen Schritt oder Job natürlich in den Rest der normalen Jobs, +Entfernen Sie dabei `set +euo pipefail` oder andere Dinge, die Sie eventuell hinzugefügt haben, um sicherzustellen, dass der experimentelle Auftrag nicht +den normalen CI-Betrieb nicht beeinträchtigt. + +Dieser ganze Prozess wäre viel einfacher gewesen, wenn wir nur etwas wie `allow-failure` für den +experimentellen Schritt festlegen könnten und ihn scheitern lassen würden, ohne den Gesamtstatus der PRs zu beeinträchtigen. Aber wie bereits erwähnt, haben CircleCI und +Github Actions dies im Moment nicht unterstützen. + +Sie können in diesen CI-spezifischen Threads für diese Funktion stimmen und sehen, wo sie steht: + +- [Github Actions:](https://github.com/actions/toolkit/issues/399) +- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) diff --git a/docs/source/de/training.mdx b/docs/source/de/training.md similarity index 95% rename from docs/source/de/training.mdx rename to docs/source/de/training.md index e38779ba55714f..7b1bd3e5d0c368 100644 --- a/docs/source/de/training.mdx +++ b/docs/source/de/training.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Optimierung eines vortrainierten Modells @@ -39,12 +43,12 @@ Laden Sie zunächst den Datensatz [Yelp Reviews](https://huggingface.co/datasets 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} ``` -Wie Sie nun wissen, benötigen Sie einen Tokenizer, um den Text zu verarbeiten und eine Auffüll- und Abschneidungsstrategie einzubauen, um mit variablen Sequenzlängen umzugehen. Um Ihren Datensatz in einem Schritt zu verarbeiten, verwenden Sie die 🤗 Methode Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map), um eine Vorverarbeitungsfunktion auf den gesamten Datensatz anzuwenden: +Wie Sie nun wissen, benötigen Sie einen Tokenizer, um den Text zu verarbeiten und eine Auffüll- und Abschneidungsstrategie einzubauen, um mit variablen Sequenzlängen umzugehen. Um Ihren Datensatz in einem Schritt zu verarbeiten, verwenden Sie die 🤗 Methode Datasets [`map`](https://huggingface.co/docs/datasets/process#map), um eine Vorverarbeitungsfunktion auf den gesamten Datensatz anzuwenden: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): @@ -82,7 +86,7 @@ Beginnen Sie mit dem Laden Ihres Modells und geben Sie die Anzahl der erwarteten ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` @@ -183,7 +187,7 @@ Wir können sie also ohne Tokenisierung direkt in ein NumPy-Array konvertieren! ```py from transformers import AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") tokenized_data = tokenizer(dataset["text"], return_tensors="np", padding=True) # Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras tokenized_data = dict(tokenized_data) @@ -198,7 +202,7 @@ from transformers import TFAutoModelForSequenceClassification from tensorflow.keras.optimizers import Adam # Load and compile our model -model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") +model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased") # Lower learning rates are often better for fine-tuning transformers model.compile(optimizer=Adam(3e-5)) @@ -225,10 +229,10 @@ tf.data"-Pipeline schreiben können, wenn Sie wollen, haben wir zwei bequeme Met - [`~TFPreTrainedModel.prepare_tf_dataset`]: Dies ist die Methode, die wir in den meisten Fällen empfehlen. Da es sich um eine Methode Ihres Modells ist, kann sie das Modell inspizieren, um automatisch herauszufinden, welche Spalten als Modelleingaben verwendet werden können, und verwirft die anderen, um einen einfacheren, leistungsfähigeren Datensatz zu erstellen. -- [~datasets.Dataset.to_tf_dataset`]: Diese Methode ist eher auf niedriger Ebene angesiedelt und ist nützlich, wenn Sie genau kontrollieren wollen, wie +- [`~datasets.Dataset.to_tf_dataset`]: Diese Methode ist eher auf niedriger Ebene angesiedelt und ist nützlich, wenn Sie genau kontrollieren wollen, wie Dataset erstellt wird, indem man genau angibt, welche `columns` und `label_cols` einbezogen werden sollen. -Bevor Sie [~TFPreTrainedModel.prepare_tf_dataset`] verwenden können, müssen Sie die Tokenizer-Ausgaben als Spalten zu Ihrem Datensatz hinzufügen, wie in +Bevor Sie [`~TFPreTrainedModel.prepare_tf_dataset`] verwenden können, müssen Sie die Tokenizer-Ausgaben als Spalten zu Ihrem Datensatz hinzufügen, wie in dem folgenden Codebeispiel: ```py @@ -329,7 +333,7 @@ Laden Sie Ihr Modell mit der Anzahl der erwarteten Kennzeichnungen: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Optimierer und Lernratensteuerung diff --git a/docs/source/de/transformers_agents.md b/docs/source/de/transformers_agents.md new file mode 100644 index 00000000000000..1d676c395e1769 --- /dev/null +++ b/docs/source/de/transformers_agents.md @@ -0,0 +1,323 @@ + + +# Transformers Agents + + + +Transformers Agents ist eine experimentelle API, die jederzeit geändert werden kann. Die von den Agenten zurückgegebenen Ergebnisse +zurückgegeben werden, können variieren, da sich die APIs oder die zugrunde liegenden Modelle ändern können. + + + +Transformers Version v4.29.0, die auf dem Konzept von *Tools* und *Agenten* aufbaut. Sie können damit spielen in +[dieses Colab](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj). + +Kurz gesagt, es bietet eine API für natürliche Sprache auf der Grundlage von Transformers: Wir definieren eine Reihe von kuratierten Tools und entwerfen einen +Agenten, um natürliche Sprache zu interpretieren und diese Werkzeuge zu verwenden. Es ist von vornherein erweiterbar; wir haben einige relevante Tools kuratiert, +aber wir werden Ihnen zeigen, wie das System einfach erweitert werden kann, um jedes von der Community entwickelte Tool zu verwenden. + +Beginnen wir mit einigen Beispielen dafür, was mit dieser neuen API erreicht werden kann. Sie ist besonders leistungsfähig, wenn es um +Sie ist besonders leistungsstark, wenn es um multimodale Aufgaben geht. Lassen Sie uns also eine Runde drehen, um Bilder zu erzeugen und Text vorzulesen. + +```py +agent.run("Caption the following image", image=image) +``` + +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------| +| | A beaver is swimming in the water | + +--- + +```py +agent.run("Read the following text out loud", text=text) +``` +| **Input** | **Output** | +|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------| +| A beaver is swimming in the water | + +--- + +```py +agent.run( + "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?", + document=document, +) +``` +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|----------------| +| | ballroom foyer | + +## Schnellstart + +Bevor Sie `agent.run` verwenden können, müssen Sie einen Agenten instanziieren, der ein großes Sprachmodell (LLM) ist. +Wir bieten Unterstützung für openAI-Modelle sowie für OpenSource-Alternativen von BigCode und OpenAssistant. Die openAI +Modelle sind leistungsfähiger (erfordern aber einen openAI-API-Schlüssel, können also nicht kostenlos verwendet werden); Hugging Face +bietet kostenlosen Zugang zu Endpunkten für BigCode- und OpenAssistant-Modelle. + +To start with, please install the `agents` extras in order to install all default dependencies. +```bash +pip install transformers[agents] +``` + +Um openAI-Modelle zu verwenden, instanziieren Sie einen [`OpenAiAgent`], nachdem Sie die `openai`-Abhängigkeit installiert haben: + +```bash +pip install openai +``` + + +```py +from transformers import OpenAiAgent + +agent = OpenAiAgent(model="text-davinci-003", api_key="") +``` + +Um BigCode oder OpenAssistant zu verwenden, melden Sie sich zunächst an, um Zugriff auf die Inference API zu erhalten: + +```py +from huggingface_hub import login + +login("") +``` + +Dann instanziieren Sie den Agenten + +```py +from transformers import HfAgent + +# Starcoder +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +# StarcoderBase +# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase") +# OpenAssistant +# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") +``` + +Dies geschieht mit der Inferenz-API, die Hugging Face derzeit kostenlos zur Verfügung stellt. Wenn Sie Ihren eigenen Inferenz +Endpunkt für dieses Modell (oder einen anderen) haben, können Sie die obige URL durch Ihren URL-Endpunkt ersetzen. + + + +StarCoder und OpenAssistant sind kostenlos und leisten bei einfachen Aufgaben bewundernswert gute Arbeit. Allerdings halten die Kontrollpunkte +nicht, wenn es um komplexere Aufforderungen geht. Wenn Sie mit einem solchen Problem konfrontiert sind, empfehlen wir Ihnen, das OpenAI +Modell auszuprobieren, das zwar leider nicht quelloffen ist, aber zur Zeit eine bessere Leistung erbringt. + + + +Sie sind jetzt startklar! Lassen Sie uns in die beiden APIs eintauchen, die Ihnen jetzt zur Verfügung stehen. + +### Einzelne Ausführung (run) + +Die Methode der einmaligen Ausführung ist die Verwendung der [`~Agent.run`] Methode des Agenten: + +```py +agent.run("Draw me a picture of rivers and lakes.") +``` + + + +Es wählt automatisch das (oder die) Werkzeug(e) aus, das (die) für die von Ihnen gewünschte Aufgabe geeignet ist (sind) und führt es (sie) entsprechend aus. Es +kann eine oder mehrere Aufgaben in der gleichen Anweisung ausführen (je komplexer Ihre Anweisung ist, desto wahrscheinlicher ist ein +der Agent scheitern). + +```py +agent.run("Draw me a picture of the sea then transform the picture to add an island") +``` + + + +
+ + +Jede [`~Agent.run`] Operation ist unabhängig, so dass Sie sie mehrmals hintereinander mit unterschiedlichen Aufgaben ausführen können. + +Beachten Sie, dass Ihr `Agent` nur ein großsprachiges Modell ist, so dass kleine Variationen in Ihrer Eingabeaufforderung völlig unterschiedliche Ergebnisse liefern können. +unterschiedliche Ergebnisse liefern. Es ist wichtig, dass Sie die Aufgabe, die Sie ausführen möchten, so genau wie möglich erklären. Wir gehen noch weiter ins Detail +wie man gute Prompts schreibt [hier](custom_tools#writing-good-user-inputs). + +Wenn Sie einen Status über Ausführungszeiten hinweg beibehalten oder dem Agenten Nicht-Text-Objekte übergeben möchten, können Sie dies tun, indem Sie +Variablen, die der Agent verwenden soll. Sie könnten zum Beispiel das erste Bild von Flüssen und Seen erzeugen, +und das Modell bitten, dieses Bild zu aktualisieren und eine Insel hinzuzufügen, indem Sie Folgendes tun: + +```python +picture = agent.run("Generate a picture of rivers and lakes.") +updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture) +``` + + + +Dies kann hilfreich sein, wenn das Modell Ihre Anfrage nicht verstehen kann und die Werkzeuge verwechselt. Ein Beispiel wäre: + +```py +agent.run("Draw me the picture of a capybara swimming in the sea") +``` + +Hier könnte das Modell auf zwei Arten interpretieren: +- Die Funktion `Text-zu-Bild` erzeugt ein Wasserschwein, das im Meer schwimmt. +- Oder Sie lassen das `Text-zu-Bild` ein Wasserschwein erzeugen und verwenden dann das Werkzeug `Bildtransformation`, um es im Meer schwimmen zu lassen. + +Falls Sie das erste Szenario erzwingen möchten, können Sie dies tun, indem Sie die Eingabeaufforderung als Argument übergeben: + +```py +agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea") +``` + + + + +### Chat-basierte Ausführung (Chat) + +Der Agent verfügt auch über einen Chat-basierten Ansatz, der die Methode [`~Agent.chat`] verwendet: + +```py +agent.chat("Generate a picture of rivers and lakes") +``` + + + +```py +agent.chat("Transform the picture so that there is a rock in there") +``` + + + +
+ +Dies ist ein interessanter Ansatz, wenn Sie den Zustand über Anweisungen hinweg beibehalten möchten. Er ist besser für Experimente geeignet, +eignet sich aber eher für einzelne Anweisungen als für komplexe Anweisungen (die die [`~Agent.run`] +Methode besser verarbeiten kann). + +Diese Methode kann auch Argumente entgegennehmen, wenn Sie Nicht-Text-Typen oder bestimmte Aufforderungen übergeben möchten. + +### ⚠️ Fernausführung + +Zu Demonstrationszwecken und damit es mit allen Setups verwendet werden kann, haben wir Remote-Executors für mehrere +der Standard-Tools erstellt, auf die der Agent in dieser Version Zugriff hat. Diese werden erstellt mit +[inference endpoints](https://huggingface.co/inference-endpoints). + +Wir haben diese vorerst deaktiviert, aber um zu sehen, wie Sie selbst Remote Executors Tools einrichten können, +empfehlen wir die Lektüre des [custom tool guide](./custom_tools). + +### Was passiert hier? Was sind Tools und was sind Agenten? + + + +#### Agenten + +Der "Agent" ist hier ein großes Sprachmodell, das wir auffordern, Zugang zu einem bestimmten Satz von Tools zu erhalten. + +LLMs sind ziemlich gut darin, kleine Codeproben zu erzeugen. Diese API macht sich das zunutze, indem sie das +LLM ein kleines Codebeispiel gibt, das eine Aufgabe mit einer Reihe von Werkzeugen ausführt. Diese Aufforderung wird dann ergänzt durch die +Aufgabe, die Sie Ihrem Agenten geben, und die Beschreibung der Werkzeuge, die Sie ihm geben. Auf diese Weise erhält er Zugriff auf die Dokumentation der +Tools, insbesondere die erwarteten Eingaben und Ausgaben, und kann den entsprechenden Code generieren. + +#### Tools + +Tools sind sehr einfach: Sie bestehen aus einer einzigen Funktion mit einem Namen und einer Beschreibung. Wir verwenden dann die Beschreibungen dieser Tools +um den Agenten aufzufordern. Anhand der Eingabeaufforderung zeigen wir dem Agenten, wie er die Tools nutzen kann, um das zu tun, was in der +in der Abfrage angefordert wurde. + +Dies geschieht mit brandneuen Tools und nicht mit Pipelines, denn der Agent schreibt besseren Code mit sehr atomaren Tools. +Pipelines sind stärker refaktorisiert und fassen oft mehrere Aufgaben in einer einzigen zusammen. Tools sind dafür gedacht, sich auf +eine einzige, sehr einfache Aufgabe konzentrieren. + +#### Code-Ausführung?! + +Dieser Code wird dann mit unserem kleinen Python-Interpreter auf den mit Ihren Tools übergebenen Eingaben ausgeführt. +Wir hören Sie schon schreien "Willkürliche Codeausführung!", aber lassen Sie uns erklären, warum das nicht der Fall ist. + +Die einzigen Funktionen, die aufgerufen werden können, sind die von Ihnen zur Verfügung gestellten Tools und die Druckfunktion, so dass Sie bereits eingeschränkt sind +eingeschränkt, was ausgeführt werden kann. Sie sollten sicher sein, wenn es sich auf die Werkzeuge für das Umarmungsgesicht beschränkt. + +Dann lassen wir keine Attributsuche oder Importe zu (die ohnehin nicht benötigt werden, um die +Inputs/Outputs an eine kleine Gruppe von Funktionen), so dass alle offensichtlichen Angriffe (und Sie müssten den LLM +dazu auffordern, sie auszugeben) kein Problem darstellen sollten. Wenn Sie auf Nummer sicher gehen wollen, können Sie die +run()-Methode mit dem zusätzlichen Argument return_code=True ausführen. In diesem Fall gibt der Agent nur den auszuführenden Code +zur Ausführung zurück und Sie können entscheiden, ob Sie ihn ausführen möchten oder nicht. + +Die Ausführung bricht bei jeder Zeile ab, in der versucht wird, eine illegale Operation auszuführen, oder wenn ein regulärer Python-Fehler +mit dem vom Agenten generierten Code. + +### Ein kuratierter Satz von Tools + +Wir haben eine Reihe von Tools identifiziert, die solche Agenten unterstützen können. Hier ist eine aktualisierte Liste der Tools, die wir integriert haben +in `transformers` integriert haben: + +- **Beantwortung von Fragen zu Dokumenten**: Beantworten Sie anhand eines Dokuments (z.B. PDF) im Bildformat eine Frage zu diesem Dokument ([Donut](./model_doc/donut)) +- Beantworten von Textfragen**: Geben Sie einen langen Text und eine Frage an, beantworten Sie die Frage im Text ([Flan-T5](./model_doc/flan-t5)) +- **Unbedingte Bildunterschriften**: Beschriften Sie das Bild! ([BLIP](./model_doc/blip)) +- **Bildfragebeantwortung**: Beantworten Sie bei einem Bild eine Frage zu diesem Bild ([VILT](./model_doc/vilt)) +- **Bildsegmentierung**: Geben Sie ein Bild und einen Prompt an und geben Sie die Segmentierungsmaske dieses Prompts aus ([CLIPSeg](./model_doc/clipseg)) +- **Sprache in Text**: Geben Sie eine Audioaufnahme einer sprechenden Person an und transkribieren Sie die Sprache in Text ([Whisper](./model_doc/whisper)) +- **Text in Sprache**: wandelt Text in Sprache um ([SpeechT5](./model_doc/speecht5)) +- **Zero-Shot-Textklassifizierung**: Ermitteln Sie anhand eines Textes und einer Liste von Bezeichnungen, welcher Bezeichnung der Text am ehesten entspricht ([BART](./model_doc/bart)) +- **Textzusammenfassung**: fassen Sie einen langen Text in einem oder wenigen Sätzen zusammen ([BART](./model_doc/bart)) +- **Übersetzung**: Übersetzen des Textes in eine bestimmte Sprache ([NLLB](./model_doc/nllb)) + +Diese Tools sind in Transformatoren integriert und können auch manuell verwendet werden, zum Beispiel: + +```py +from transformers import load_tool + +tool = load_tool("text-to-speech") +audio = tool("This is a text to speech tool") +``` + +### Benutzerdefinierte Tools + +Wir haben zwar eine Reihe von Tools identifiziert, sind aber der festen Überzeugung, dass der Hauptwert dieser Implementierung darin besteht +die Möglichkeit, benutzerdefinierte Tools schnell zu erstellen und weiterzugeben. + +Indem Sie den Code eines Tools in einen Hugging Face Space oder ein Modell-Repository stellen, können Sie das Tool +direkt mit dem Agenten nutzen. Wir haben ein paar neue Funktionen hinzugefügt +**transformers-agnostic** Tools zur [`huggingface-tools` Organisation](https://huggingface.co/huggingface-tools) hinzugefügt: + +- **Text-Downloader**: zum Herunterladen eines Textes von einer Web-URL +- **Text zu Bild**: erzeugt ein Bild nach einer Eingabeaufforderung und nutzt dabei stabile Diffusion +- **Bildtransformation**: verändert ein Bild anhand eines Ausgangsbildes und einer Eingabeaufforderung, unter Ausnutzung der stabilen pix2pix-Diffusion +- **Text zu Video**: Erzeugen eines kleinen Videos nach einer Eingabeaufforderung, unter Verwendung von damo-vilab + +Das Text-zu-Bild-Tool, das wir von Anfang an verwendet haben, ist ein Remote-Tool, das sich in +[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)! Wir werden +weiterhin solche Tools für diese und andere Organisationen veröffentlichen, um diese Implementierung weiter zu verbessern. + +Die Agenten haben standardmäßig Zugriff auf die Tools, die sich auf [*huggingface-tools*](https://huggingface.co/huggingface-tools) befinden. +Wie Sie Ihre eigenen Tools schreiben und freigeben können und wie Sie jedes benutzerdefinierte Tool, das sich auf dem Hub befindet, nutzen können, erklären wir in [folgender Anleitung](custom_tools). + +### Code-Erzeugung + +Bisher haben wir gezeigt, wie Sie die Agenten nutzen können, um Aktionen für Sie durchzuführen. Der Agent generiert jedoch nur Code +den wir dann mit einem sehr eingeschränkten Python-Interpreter ausführen. Falls Sie den generierten Code in einer anderen Umgebung verwenden möchten +einer anderen Umgebung verwenden möchten, können Sie den Agenten auffordern, den Code zusammen mit einer Tooldefinition und genauen Importen zurückzugeben. + +Zum Beispiel die folgende Anweisung +```python +agent.run("Draw me a picture of rivers and lakes", return_code=True) +``` + +gibt den folgenden Code zurück + +```python +from transformers import load_tool + +image_generator = load_tool("huggingface-tools/text-to-image") + +image = image_generator(prompt="rivers and lakes") +``` + +die Sie dann selbst ändern und ausführen können. diff --git a/docs/source/en/_config.py b/docs/source/en/_config.py index cd76263e9a5cb2..a6d75853f57219 100644 --- a/docs/source/en/_config.py +++ b/docs/source/en/_config.py @@ -10,5 +10,5 @@ black_avoid_patterns = { "{processor_class}": "FakeProcessorClass", "{model_class}": "FakeModelClass", - "{object_class}": "FakeObjectClass", + "{object_class}": "FakeObjectClass", } diff --git a/docs/source/en/_redirects.yml b/docs/source/en/_redirects.yml new file mode 100644 index 00000000000000..b6575a6b02f205 --- /dev/null +++ b/docs/source/en/_redirects.yml @@ -0,0 +1,3 @@ +# Optimizing inference + +perf_infer_gpu_many: perf_infer_gpu_one diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 9fafcfd9ff55b4..678b679cb143d8 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -8,145 +8,190 @@ title: Get started - sections: - local: pipeline_tutorial - title: Pipelines for inference + title: Run inference with pipelines - local: autoclass_tutorial - title: Load pretrained instances with an AutoClass + title: Write portable code with AutoClass - local: preprocessing - title: Preprocess + title: Preprocess data - local: training title: Fine-tune a pretrained model + - local: run_scripts + title: Train with a script - local: accelerate - title: Distributed training with 🤗 Accelerate + title: Set up distributed training with 🤗 Accelerate + - local: peft + title: Load and train adapters with 🤗 PEFT - local: model_sharing - title: Share a model + title: Share your model + - local: transformers_agents + title: Agents + - local: llm_tutorial + title: Generation with LLMs title: Tutorials - sections: - - sections: - - local: create_a_model - title: Create a custom architecture - - local: custom_models - title: Sharing custom models - - local: run_scripts - title: Train with a script - - local: sagemaker - title: Run training on Amazon SageMaker - - local: converting_tensorflow_models - title: Converting from TensorFlow checkpoints - - local: serialization - title: Export to ONNX - - local: torchscript - title: Export to TorchScript - - local: troubleshooting - title: Troubleshoot - title: General usage - - sections: - - local: fast_tokenizers - title: Use tokenizers from 🤗 Tokenizers - - local: multilingual - title: Inference for multilingual models - - local: generation_strategies - title: Text generation strategies - - sections: - - local: tasks/sequence_classification - title: Text classification - - local: tasks/token_classification - title: Token classification - - local: tasks/question_answering - title: Question answering - - local: tasks/language_modeling - title: Causal language modeling - - local: tasks/masked_language_modeling - title: Masked language modeling - - local: tasks/translation - title: Translation - - local: tasks/summarization - title: Summarization - - local: tasks/multiple_choice - title: Multiple choice - title: Task guides - isExpanded: false + - isExpanded: false + sections: + - local: tasks/sequence_classification + title: Text classification + - local: tasks/token_classification + title: Token classification + - local: tasks/question_answering + title: Question answering + - local: tasks/language_modeling + title: Causal language modeling + - local: tasks/masked_language_modeling + title: Masked language modeling + - local: tasks/translation + title: Translation + - local: tasks/summarization + title: Summarization + - local: tasks/multiple_choice + title: Multiple choice title: Natural Language Processing - - sections: + - isExpanded: false + sections: - local: tasks/audio_classification title: Audio classification - local: tasks/asr title: Automatic speech recognition title: Audio - - sections: + - isExpanded: false + sections: - local: tasks/image_classification title: Image classification - local: tasks/semantic_segmentation - title: Semantic segmentation + title: Image segmentation - local: tasks/video_classification title: Video classification - local: tasks/object_detection title: Object detection + - local: tasks/zero_shot_object_detection + title: Zero-shot object detection + - local: tasks/zero_shot_image_classification + title: Zero-shot image classification + - local: tasks/monocular_depth_estimation + title: Depth estimation + - local: tasks/image_to_image + title: Image-to-Image + - local: tasks/mask_generation + title: Mask Generation + - local: tasks/knowledge_distillation_for_image_classification + title: Knowledge Distillation for Computer Vision title: Computer Vision - - sections: + - isExpanded: false + sections: - local: tasks/image_captioning title: Image captioning - local: tasks/document_question_answering title: Document Question Answering + - local: tasks/visual_question_answering + title: Visual Question Answering + - local: tasks/text-to-speech + title: Text to speech title: Multimodal + - isExpanded: false + sections: + - local: generation_strategies + title: Customize the generation strategy + title: Generation + - isExpanded: false + sections: + - local: tasks/idefics + title: Image tasks with IDEFICS + - local: tasks/prompting + title: LLM prompting guide + title: Prompting + title: Task Guides +- sections: + - local: fast_tokenizers + title: Use fast tokenizers from 🤗 Tokenizers + - local: multilingual + title: Run inference with multilingual models + - local: create_a_model + title: Use model-specific APIs + - local: custom_models + title: Share a custom model + - local: chat_templating + title: Templates for chat models + - local: trainer + title: Trainer + - local: sagemaker + title: Run training on Amazon SageMaker + - local: serialization + title: Export to ONNX + - local: tflite + title: Export to TFLite + - local: torchscript + title: Export to TorchScript + - local: benchmarks + title: Benchmarks + - local: notebooks + title: Notebooks with examples + - local: community + title: Community resources + - local: custom_tools + title: Custom Tools and Prompts + - local: troubleshooting + title: Troubleshoot + - local: hf_quantizer + title: Contribute new quantization method + title: Developer guides +- sections: + - local: performance + title: Overview + - local: quantization + title: Quantization - sections: - - local: performance - title: Overview - local: perf_train_gpu_one - title: Training on one GPU + title: Methods and tools for efficient training on a single GPU - local: perf_train_gpu_many - title: Training on many GPUs + title: Multiple GPUs and parallelism + - local: fsdp + title: Fully Sharded Data Parallel + - local: deepspeed + title: DeepSpeed - local: perf_train_cpu - title: Training on CPU + title: Efficient training on CPU - local: perf_train_cpu_many - title: Training on many CPUs - - local: perf_train_tpu - title: Training on TPUs + title: Distributed CPU training - local: perf_train_tpu_tf title: Training on TPU with TensorFlow - local: perf_train_special - title: Training on Specialized Hardware - - local: perf_infer_cpu - title: Inference on CPU - - local: perf_infer_gpu_one - title: Inference on one GPU - - local: perf_infer_gpu_many - title: Inference on many GPUs - - local: perf_infer_special - title: Inference on Specialized Hardware + title: PyTorch training on Apple silicon - local: perf_hardware title: Custom hardware for training - - local: big_models - title: Instantiating a big model - - local: debugging - title: Debugging - local: hpo_train title: Hyperparameter Search using Trainer API - - local: tf_xla - title: XLA Integration for TensorFlow Models - title: Performance and scalability + title: Efficient training techniques - sections: - - local: contributing - title: How to contribute to transformers? - - local: add_new_model - title: How to add a model to 🤗 Transformers? - - local: add_tensorflow_model - title: How to convert a 🤗 Transformers model to TensorFlow? - - local: add_new_pipeline - title: How to add a pipeline to 🤗 Transformers? - - local: testing - title: Testing - - local: pr_checks - title: Checks on a Pull Request - title: Contribute - - local: notebooks - title: 🤗 Transformers Notebooks - - local: community - title: Community resources - - local: benchmarks - title: Benchmarks - - local: migration - title: Migrating from previous packages - title: How-to guides + - local: perf_infer_cpu + title: CPU inference + - local: perf_infer_gpu_one + title: GPU inference + title: Optimizing inference + - local: big_models + title: Instantiating a big model + - local: debugging + title: Debugging + - local: tf_xla + title: XLA Integration for TensorFlow Models + - local: perf_torch_compile + title: Optimize inference using `torch.compile()` + title: Performance and scalability +- sections: + - local: contributing + title: How to contribute to 🤗 Transformers? + - local: add_new_model + title: How to add a model to 🤗 Transformers? + - local: add_tensorflow_model + title: How to convert a 🤗 Transformers model to TensorFlow? + - local: add_new_pipeline + title: How to add a pipeline to 🤗 Transformers? + - local: testing + title: Testing + - local: pr_checks + title: Checks on a Pull Request + title: Contribute - sections: - local: philosophy title: Philosophy @@ -170,11 +215,19 @@ title: Perplexity of fixed-length models - local: pipeline_webserver title: Pipelines for webserver inference + - local: model_memory_anatomy + title: Model training anatomy + - local: llm_tutorial_optimization + title: Getting the most out of LLMs title: Conceptual guides - sections: - sections: + - local: main_classes/agent + title: Agents and Tools - local: model_doc/auto title: Auto Classes + - local: main_classes/backbones + title: Backbones - local: main_classes/callback title: Callbacks - local: main_classes/configuration @@ -206,7 +259,7 @@ - local: main_classes/trainer title: Trainer - local: main_classes/deepspeed - title: DeepSpeed Integration + title: DeepSpeed - local: main_classes/feature_extractor title: Feature Extractor - local: main_classes/image_processor @@ -253,10 +306,14 @@ title: CANINE - local: model_doc/codegen title: CodeGen + - local: model_doc/code_llama + title: CodeLlama - local: model_doc/convbert title: ConvBERT - local: model_doc/cpm title: CPM + - local: model_doc/cpmant + title: CPMANT - local: model_doc/ctrl title: CTRL - local: model_doc/deberta @@ -279,8 +336,14 @@ title: ErnieM - local: model_doc/esm title: ESM + - local: model_doc/falcon + title: Falcon + - local: model_doc/fastspeech2_conformer + title: FastSpeech2Conformer - local: model_doc/flan-t5 title: FLAN-T5 + - local: model_doc/flan-ul2 + title: FLAN-UL2 - local: model_doc/flaubert title: FlauBERT - local: model_doc/fnet @@ -289,6 +352,8 @@ title: FSMT - local: model_doc/funnel title: Funnel Transformer + - local: model_doc/fuyu + title: Fuyu - local: model_doc/openai-gpt title: GPT - local: model_doc/gpt_neo @@ -301,6 +366,10 @@ title: GPT-J - local: model_doc/gpt2 title: GPT2 + - local: model_doc/gpt_bigcode + title: GPTBigCode + - local: model_doc/gptsan-japanese + title: GPTSAN Japanese - local: model_doc/gpt-sw3 title: GPTSw3 - local: model_doc/herbert @@ -311,6 +380,10 @@ title: Jukebox - local: model_doc/led title: LED + - local: model_doc/llama + title: LLaMA + - local: model_doc/llama2 + title: Llama2 - local: model_doc/longformer title: Longformer - local: model_doc/longt5 @@ -319,22 +392,34 @@ title: LUKE - local: model_doc/m2m_100 title: M2M100 + - local: model_doc/madlad-400 + title: MADLAD-400 - local: model_doc/marian title: MarianMT - local: model_doc/markuplm title: MarkupLM - local: model_doc/mbart title: MBart and MBart-50 + - local: model_doc/mega + title: MEGA - local: model_doc/megatron-bert title: MegatronBERT - local: model_doc/megatron_gpt2 title: MegatronGPT2 + - local: model_doc/mistral + title: Mistral + - local: model_doc/mixtral + title: Mixtral - local: model_doc/mluke title: mLUKE - local: model_doc/mobilebert title: MobileBERT - local: model_doc/mpnet title: MPNet + - local: model_doc/mpt + title: MPT + - local: model_doc/mra + title: MRA - local: model_doc/mt5 title: MT5 - local: model_doc/mvp @@ -343,14 +428,22 @@ title: NEZHA - local: model_doc/nllb title: NLLB + - local: model_doc/nllb-moe + title: NLLB-MoE - local: model_doc/nystromformer title: Nyströmformer + - local: model_doc/open-llama + title: Open-Llama - local: model_doc/opt title: OPT - local: model_doc/pegasus title: Pegasus - local: model_doc/pegasus_x title: PEGASUS-X + - local: model_doc/persimmon + title: Persimmon + - local: model_doc/phi + title: Phi - local: model_doc/phobert title: PhoBERT - local: model_doc/plbart @@ -359,6 +452,8 @@ title: ProphetNet - local: model_doc/qdqbert title: QDQBert + - local: model_doc/qwen2 + title: Qwen2 - local: model_doc/rag title: RAG - local: model_doc/realm @@ -377,10 +472,14 @@ title: RoCBert - local: model_doc/roformer title: RoFormer + - local: model_doc/rwkv + title: RWKV - local: model_doc/splinter title: Splinter - local: model_doc/squeezebert title: SqueezeBERT + - local: model_doc/stablelm + title: StableLm - local: model_doc/switch_transformers title: SwitchTransformers - local: model_doc/t5 @@ -393,6 +492,8 @@ title: Transformer XL - local: model_doc/ul2 title: UL2 + - local: model_doc/umt5 + title: UMT5 - local: model_doc/xmod title: X-MOD - local: model_doc/xglm @@ -422,24 +523,34 @@ title: Conditional DETR - local: model_doc/convnext title: ConvNeXT + - local: model_doc/convnextv2 + title: ConvNeXTV2 - local: model_doc/cvt title: CvT - local: model_doc/deformable_detr title: Deformable DETR - local: model_doc/deit title: DeiT + - local: model_doc/depth_anything + title: Depth Anything - local: model_doc/deta title: DETA - local: model_doc/detr title: DETR - local: model_doc/dinat title: DiNAT + - local: model_doc/dinov2 + title: DINOV2 - local: model_doc/dit title: DiT - local: model_doc/dpt title: DPT - local: model_doc/efficientformer title: EfficientFormer + - local: model_doc/efficientnet + title: EfficientNet + - local: model_doc/focalnet + title: FocalNet - local: model_doc/glpn title: GLPN - local: model_doc/imagegpt @@ -456,16 +567,22 @@ title: MobileNetV2 - local: model_doc/mobilevit title: MobileViT + - local: model_doc/mobilevitv2 + title: MobileViTV2 - local: model_doc/nat title: NAT - local: model_doc/poolformer title: PoolFormer + - local: model_doc/pvt + title: Pyramid Vision Transformer (PVT) - local: model_doc/regnet title: RegNet - local: model_doc/resnet title: ResNet - local: model_doc/segformer title: SegFormer + - local: model_doc/swiftformer + title: SwiftFormer - local: model_doc/swin title: Swin Transformer - local: model_doc/swinv2 @@ -474,20 +591,20 @@ title: Swin2SR - local: model_doc/table-transformer title: Table Transformer - - local: model_doc/timesformer - title: TimeSformer - local: model_doc/upernet title: UperNet - local: model_doc/van title: VAN - - local: model_doc/videomae - title: VideoMAE - local: model_doc/vit title: Vision Transformer (ViT) - local: model_doc/vit_hybrid title: ViT Hybrid + - local: model_doc/vitdet + title: ViTDet - local: model_doc/vit_mae title: ViTMAE + - local: model_doc/vitmatte + title: ViTMatte - local: model_doc/vit_msn title: ViTMSN - local: model_doc/yolos @@ -497,12 +614,26 @@ sections: - local: model_doc/audio-spectrogram-transformer title: Audio Spectrogram Transformer + - local: model_doc/bark + title: Bark - local: model_doc/clap title: CLAP + - local: model_doc/encodec + title: EnCodec - local: model_doc/hubert title: Hubert - local: model_doc/mctct title: MCTCT + - local: model_doc/mms + title: MMS + - local: model_doc/musicgen + title: MusicGen + - local: model_doc/pop2piano + title: Pop2Piano + - local: model_doc/seamless_m4t + title: Seamless-M4T + - local: model_doc/seamless_m4t_v2 + title: SeamlessM4T-v2 - local: model_doc/sew title: SEW - local: model_doc/sew-d @@ -517,8 +648,14 @@ title: UniSpeech - local: model_doc/unispeech-sat title: UniSpeech-SAT + - local: model_doc/univnet + title: UnivNet + - local: model_doc/vits + title: VITS - local: model_doc/wav2vec2 title: Wav2Vec2 + - local: model_doc/wav2vec2-bert + title: Wav2Vec2-BERT - local: model_doc/wav2vec2-conformer title: Wav2Vec2-Conformer - local: model_doc/wav2vec2_phoneme @@ -534,6 +671,17 @@ title: Audio models - isExpanded: false sections: + - local: model_doc/timesformer + title: TimeSformer + - local: model_doc/videomae + title: VideoMAE + - local: model_doc/vivit + title: ViViT + title: Video models + - isExpanded: false + sections: + - local: model_doc/align + title: ALIGN - local: model_doc/altclip title: AltCLIP - local: model_doc/blip @@ -542,14 +690,20 @@ title: BLIP-2 - local: model_doc/bridgetower title: BridgeTower + - local: model_doc/bros + title: BROS - local: model_doc/chinese_clip title: Chinese-CLIP - local: model_doc/clip title: CLIP - local: model_doc/clipseg title: CLIPSeg + - local: model_doc/clvp + title: CLVP - local: model_doc/data2vec title: Data2Vec + - local: model_doc/deplot + title: DePlot - local: model_doc/donut title: Donut - local: model_doc/flava @@ -558,6 +712,12 @@ title: GIT - local: model_doc/groupvit title: GroupViT + - local: model_doc/idefics + title: IDEFICS + - local: model_doc/instructblip + title: InstructBLIP + - local: model_doc/kosmos-2 + title: KOSMOS-2 - local: model_doc/layoutlm title: LayoutLM - local: model_doc/layoutlmv2 @@ -568,14 +728,30 @@ title: LayoutXLM - local: model_doc/lilt title: LiLT + - local: model_doc/llava + title: Llava - local: model_doc/lxmert title: LXMERT + - local: model_doc/matcha + title: MatCha + - local: model_doc/mgp-str + title: MGP-STR + - local: model_doc/nougat + title: Nougat - local: model_doc/oneformer title: OneFormer - local: model_doc/owlvit title: OWL-ViT + - local: model_doc/owlv2 + title: OWLv2 - local: model_doc/perceiver title: Perceiver + - local: model_doc/pix2struct + title: Pix2Struct + - local: model_doc/sam + title: Segment Anything + - local: model_doc/siglip + title: SigLIP - local: model_doc/speech-encoder-decoder title: Speech Encoder Decoder Models - local: model_doc/tapas @@ -584,8 +760,12 @@ title: TrOCR - local: model_doc/tvlt title: TVLT + - local: model_doc/tvp + title: TVP - local: model_doc/vilt title: ViLT + - local: model_doc/vipllava + title: VipLlava - local: model_doc/vision-encoder-decoder title: Vision Encoder Decoder Models - local: model_doc/vision-text-dual-encoder @@ -604,6 +784,14 @@ title: Reinforcement learning models - isExpanded: false sections: + - local: model_doc/autoformer + title: Autoformer + - local: model_doc/informer + title: Informer + - local: model_doc/patchtsmixer + title: PatchTSMixer + - local: model_doc/patchtst + title: PatchTST - local: model_doc/time_series_transformer title: Time Series Transformer title: Time series models @@ -630,5 +818,7 @@ title: Utilities for Audio processing - local: internal/file_utils title: General Utilities + - local: internal/time_series_utils + title: Utilities for Time Series title: Internal Helpers title: API diff --git a/docs/source/en/accelerate.mdx b/docs/source/en/accelerate.md similarity index 93% rename from docs/source/en/accelerate.mdx rename to docs/source/en/accelerate.md index 02e05df3907492..b0f0e4efe64778 100644 --- a/docs/source/en/accelerate.mdx +++ b/docs/source/en/accelerate.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Distributed training with 🤗 Accelerate @@ -129,4 +133,4 @@ accelerate launch train.py >>> notebook_launcher(training_function) ``` -For more information about 🤗 Accelerate and it's rich features, refer to the [documentation](https://huggingface.co/docs/accelerate). \ No newline at end of file +For more information about 🤗 Accelerate and its rich features, refer to the [documentation](https://huggingface.co/docs/accelerate). diff --git a/docs/source/en/add_new_model.mdx b/docs/source/en/add_new_model.md similarity index 94% rename from docs/source/en/add_new_model.mdx rename to docs/source/en/add_new_model.md index 56a130f14ec29f..70f7263e338a3a 100644 --- a/docs/source/en/add_new_model.mdx +++ b/docs/source/en/add_new_model.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # How to add a model to 🤗 Transformers? @@ -48,7 +52,7 @@ A good first starting point to better understand the library is to read the [doc In our opinion, the library's code is not just a means to provide a product, *e.g.* the ability to use BERT for inference, but also as the very product that we want to improve. Hence, when adding a model, the user is not only the -person that will use your model, but also everybody that will read, try to understand, and possibly tweak your code. +person who will use your model, but also everybody who will read, try to understand, and possibly tweak your code. With this in mind, let's go a bit deeper into the general library design. @@ -97,7 +101,7 @@ own regarding how code should be written :-) 1. The forward pass of your model should be fully written in the modeling file while being fully independent of other models in the library. If you want to reuse a block from another model, copy the code and paste it with a `# Copied from` comment on top (see [here](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160) - for a good example). + for a good example and [there](pr_checks#check-copies) for more documentation on Copied from). 2. The code should be fully understandable, even by a non-native English speaker. This means you should pick descriptive variable names and avoid abbreviations. As an example, `activation` is preferred to `act`. One-letter variable names are strongly discouraged unless it's an index in a for loop. @@ -127,9 +131,9 @@ From experience, we can tell you that the most important things to keep in mind friends. Note that it might very well happen that your model's tokenizer is based on one model implementation, and your model's modeling code on another one. *E.g.* FSMT's modeling code is based on BART, while FSMT's tokenizer code is based on XLM. -- It's more of an engineering challenge than a scientific challenge. You should spend more time on creating an - efficient debugging environment than trying to understand all theoretical aspects of the model in the paper. -- Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so that we at Hugging Face are more +- It's more of an engineering challenge than a scientific challenge. You should spend more time creating an + efficient debugging environment rather than trying to understand all theoretical aspects of the model in the paper. +- Ask for help, when you're stuck! Models are the core component of 🤗 Transformers so we at Hugging Face are more than happy to help you at every step to add your model. Don't hesitate to ask if you notice you are not making progress. @@ -153,9 +157,9 @@ List: ☐ Submitted the pull request
☐ (Optional) Added a demo notebook -To begin with, we usually recommend to start by getting a good theoretical understanding of `BrandNewBert`. However, +To begin with, we usually recommend starting by getting a good theoretical understanding of `BrandNewBert`. However, if you prefer to understand the theoretical aspects of the model *on-the-job*, then it is totally fine to directly dive -into the `BrandNewBert`'s code-base. This option might suit you better, if your engineering skills are better than +into the `BrandNewBert`'s code-base. This option might suit you better if your engineering skills are better than your theoretical skill, if you have trouble understanding `BrandNewBert`'s paper, or if you just enjoy programming much more than reading scientific papers. @@ -171,7 +175,7 @@ theoretical aspects, but rather focus on the practical ones, namely: encoder-decoder model? Look at the [model_summary](model_summary) if you're not familiar with the differences between those. - What are the applications of *brand_new_bert*? Text classification? Text generation? Seq2Seq tasks, *e.g.,* summarization? -- What is the novel feature of the model making it different from BERT/GPT-2/BART? +- What is the novel feature of the model that makes it different from BERT/GPT-2/BART? - Which of the already existing [🤗 Transformers models](https://huggingface.co/transformers/#contents) is most similar to *brand_new_bert*? - What type of tokenizer is used? A sentencepiece tokenizer? Word piece tokenizer? Is it the same tokenizer as used @@ -202,7 +206,15 @@ source .env/bin/activate pip install -e ".[dev]" ``` -and return to the parent directory +Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a +failure with this command. If that's the case make sure to install the Deep Learning framework you are working with +(PyTorch, TensorFlow and/or Flax) then do: + +```bash +pip install -e ".[quality]" +``` + +which should be enough for most use cases. You can then return to the parent directory ```bash cd .. @@ -249,7 +261,7 @@ figure out the following: - How can you debug the model in the original environment of the repo? Do you have to add *print* statements, can you work with an interactive debugger like *ipdb*, or should you use an efficient IDE to debug the model, like PyCharm? -It is very important that before you start the porting process, that you can **efficiently** debug code in the original +It is very important that before you start the porting process, you can **efficiently** debug code in the original repository! Also, remember that you are working with an open-source library, so do not hesitate to open an issue, or even a pull request in the original repository. The maintainers of this repository are most likely very happy about someone looking into their code! @@ -268,10 +280,10 @@ In general, there are two possible debugging environments for running the origin Jupyter notebooks have the advantage that they allow for cell-by-cell execution which can be helpful to better split logical components from one another and to have faster debugging cycles as intermediate results can be stored. Also, notebooks are often easier to share with other contributors, which might be very helpful if you want to ask the Hugging -Face team for help. If you are familiar with Jupyter notebooks, we strongly recommend you to work with them. +Face team for help. If you are familiar with Jupyter notebooks, we strongly recommend you work with them. The obvious disadvantage of Jupyter notebooks is that if you are not used to working with them you will have to spend -some time adjusting to the new programming environment and that you might not be able to use your known debugging tools +some time adjusting to the new programming environment and you might not be able to use your known debugging tools anymore, like `ipdb`. For each code-base, a good first step is always to load a **small** pretrained checkpoint and to be able to reproduce a @@ -317,7 +329,7 @@ example is [T5's MeshTensorFlow](https://github.com/tensorflow/mesh/tree/master/ very complex and does not offer a simple way to decompose the model into its sub-components. For such libraries, one often relies on verifying print statements. -No matter which strategy you choose, the recommended procedure is often the same in that you should start to debug the +No matter which strategy you choose, the recommended procedure is often the same that you should start to debug the starting layers first and the ending layers last. It is recommended that you retrieve the output, either by print statements or sub-component functions, of the following @@ -349,10 +361,10 @@ We expect that every model added to 🤗 Transformers passes a couple of integra model and the reimplemented version in 🤗 Transformers have to give the exact same output up to a precision of 0.001! Since it is normal that the exact same model written in different libraries can give a slightly different output depending on the library framework, we accept an error tolerance of 1e-3 (0.001). It is not enough if the model gives -nearly the same output, they have to be the almost identical. Therefore, you will certainly compare the intermediate +nearly the same output, they have to be almost identical. Therefore, you will certainly compare the intermediate outputs of the 🤗 Transformers version multiple times against the intermediate outputs of the original implementation of *brand_new_bert* in which case an **efficient** debugging environment of the original repository is absolutely -important. Here is some advice is to make your debugging environment as efficient as possible. +important. Here is some advice to make your debugging environment as efficient as possible. - Find the best way of debugging intermediate results. Is the original repository written in PyTorch? Then you should probably take the time to write a longer script that decomposes the original model into smaller sub-components to @@ -397,7 +409,7 @@ Otherwise, let's start generating a new model. You have two choices here: - `transformers-cli add-new-model-like` to add a new model like an existing one - `transformers-cli add-new-model` to add a new model from our template (will look like BERT or Bart depending on the type of model you select) -In both cases, you will be prompted with a questionnaire to fill the basic information of your model. The second command requires to install `cookiecutter`, you can find more information on it [here](https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model). +In both cases, you will be prompted with a questionnaire to fill in the basic information of your model. The second command requires to install `cookiecutter`, you can find more information on it [here](https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model). **Open a Pull Request on the main huggingface/transformers repo** @@ -439,7 +451,7 @@ git push -u origin a-descriptive-name-for-my-changes 6. Change the PR into a draft by clicking on “Convert to draft” on the right of the GitHub pull request web page. -In the following, whenever you have done some progress, don't forget to commit your work and push it to your account so +In the following, whenever you have made some progress, don't forget to commit your work and push it to your account so that it shows in the pull request. Additionally, you should make sure to update your work with the current main from time to time by doing: @@ -471,7 +483,7 @@ Now you can finally start coding :). The generated code in `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` will either have the same architecture as BERT if it's an encoder-only model or BART if it's an encoder-decoder model. At this point, you should remind yourself what you've learned in the beginning about the theoretical aspects of the model: *How is the model different from BERT or -BART?*". Implement those changes which often means to change the *self-attention* layer, the order of the normalization +BART?*". Implement those changes which often means changing the *self-attention* layer, the order of the normalization layer, etc… Again, it is often useful to look at the similar architecture of already existing models in Transformers to get a better feeling of how your model should be implemented. @@ -519,7 +531,7 @@ but all the other ones should use an initialization as above. This is coded like ```py def _init_weights(self, module): """Initialize the weights""" - if isinstnace(module, Wav2Vec2ForPreTraining): + if isinstance(module, Wav2Vec2ForPreTraining): module.project_hid.reset_parameters() module.project_q.reset_parameters() module.project_hid._is_hf_initialized = True @@ -653,7 +665,7 @@ PyTorch's implementation of a layer requires the weight to be transposed beforeh Finally, you should also check that **all** required weights are initialized and print out all checkpoint weights that were not used for initialization to make sure the model is correctly converted. It is completely normal, that the -conversion trials fail with either a wrong shape statement or wrong name assignment. This is most likely because either +conversion trials fail with either a wrong shape statement or a wrong name assignment. This is most likely because either you used incorrect parameters in `BrandNewBertConfig()`, have a wrong architecture in the 🤗 Transformers implementation, you have a bug in the `init()` functions of one of the components of the 🤗 Transformers implementation or you need to transpose one of the checkpoint weights. @@ -670,7 +682,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") **7. Implement the forward pass** Having managed to correctly load the pretrained weights into the 🤗 Transformers implementation, you should now make -sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward +sure that the forward pass is correctly implemented. In [Get familiar with the original repository](#3-4-run-a-pretrained-checkpoint-using-the-original-repository), you have already created a script that runs a forward pass of the model using the original repository. Now you should write an analogous script using the 🤗 Transformers implementation instead of the original one. It should look as follows: @@ -710,7 +722,7 @@ in the 🤗 Transformers implementation. From our experience, a simple and effic in both the original implementation and 🤗 Transformers implementation, at the same positions in the network respectively, and to successively remove print statements showing the same values for intermediate presentations. -When you're confident that both implementations yield the same output, verifying the outputs with +When you're confident that both implementations yield the same output, verify the outputs with `torch.allclose(original_output, output, atol=1e-3)`, you're done with the most difficult part! Congratulations - the work left to be done should be a cakewalk 😊. @@ -732,7 +744,7 @@ Having fixed all common tests, it is now crucial to ensure that all the nice wor - b) Future changes to your model will not break any important feature of the model. At first, integration tests should be added. Those integration tests essentially do the same as the debugging scripts -you used earlier to implement the model to 🤗 Transformers. A template of those model tests is already added by the +you used earlier to implement the model to 🤗 Transformers. A template of those model tests has already added by the Cookiecutter, called `BrandNewBertModelIntegrationTests` and only has to be filled out by you. To ensure that those tests are passing, run @@ -757,7 +769,7 @@ ways: **9. Implement the tokenizer** -Next, we should add the tokenizer of *brand_new_bert*. Usually, the tokenizer is equivalent or very similar to an +Next, we should add the tokenizer of *brand_new_bert*. Usually, the tokenizer is equivalent to or very similar to an already existing tokenizer of 🤗 Transformers. It is very important to find/extract the original tokenizer file and to manage to load this file into the 🤗 @@ -809,7 +821,7 @@ tests for you. Now, all the necessary functionality for *brand_new_bert* is added - you're almost done! The only thing left to add is a nice docstring and a doc page. The Cookiecutter should have added a template file called -`docs/source/model_doc/brand_new_bert.mdx` that you should fill out. Users of your model will usually first look at +`docs/source/model_doc/brand_new_bert.md` that you should fill out. Users of your model will usually first look at this page before using your model. Hence, the documentation must be understandable and concise. It is very useful for the community to add some *Tips* to show how the model should be used. Don't hesitate to ping the Hugging Face team regarding the docstrings. @@ -878,6 +890,6 @@ reviewer. Now, it's time to get some credit from the community for your work! Having completed a model addition is a major contribution to Transformers and the whole NLP community. Your code and the ported pre-trained models will certainly be used by hundreds and possibly even thousands of developers and researchers. You should be proud of your work and share -your achievement with the community. +your achievements with the community. **You have made another model that is super easy to access for everyone in the community! 🤯** diff --git a/docs/source/en/add_new_pipeline.mdx b/docs/source/en/add_new_pipeline.md similarity index 93% rename from docs/source/en/add_new_pipeline.mdx rename to docs/source/en/add_new_pipeline.md index b0cc2cd0ff72e7..9e10c310f07f39 100644 --- a/docs/source/en/add_new_pipeline.mdx +++ b/docs/source/en/add_new_pipeline.md @@ -7,11 +7,15 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # How to create a custom pipeline? -In this guide, we will see how to create a custom pipeline and share it on the [Hub](hf.co/models) or add it to the +In this guide, we will see how to create a custom pipeline and share it on the [Hub](https://hf.co/models) or add it to the 🤗 Transformers library. First and foremost, you need to decide the raw entries the pipeline will be able to take. It can be strings, raw bytes, @@ -107,8 +111,8 @@ def _sanitize_parameters(self, **kwargs): ``` Try to keep the inputs/outputs very simple and ideally JSON-serializable as it makes the pipeline usage very easy -without requiring users to understand new kind of objects. It's also relatively common to support many different types -of arguments for ease of use (audio files, can be filenames, URLs or pure bytes) +without requiring users to understand new kinds of objects. It's also relatively common to support many different types +of arguments for ease of use (audio files, which can be filenames, URLs or pure bytes) @@ -215,8 +219,8 @@ repo.push_to_hub() ``` This will copy the file where you defined `PairClassificationPipeline` inside the folder `"test-dynamic-pipeline"`, -along with saving the model and tokenizer of the pipeline, before pushing everything in the repository -`{your_username}/test-dynamic-pipeline`. After that anyone can use it as long as they provide the option +along with saving the model and tokenizer of the pipeline, before pushing everything into the repository +`{your_username}/test-dynamic-pipeline`. After that, anyone can use it as long as they provide the option `trust_remote_code=True`: ```py @@ -228,9 +232,9 @@ classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remot ## Add the pipeline to 🤗 Transformers If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the `pipelines` submodule -with the code of your pipeline, then add it in the list of tasks defined in `pipelines/__init__.py`. +with the code of your pipeline, then add it to the list of tasks defined in `pipelines/__init__.py`. -Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with example with the other tests. +Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with examples of the other tests. The `run_pipeline_test` function will be very generic and run on small random models on every possible architecture as defined by `model_mapping` and `tf_model_mapping`. diff --git a/docs/source/en/add_tensorflow_model.mdx b/docs/source/en/add_tensorflow_model.md similarity index 94% rename from docs/source/en/add_tensorflow_model.mdx rename to docs/source/en/add_tensorflow_model.md index e145a7d00184f4..52c7e3b1ada118 100644 --- a/docs/source/en/add_tensorflow_model.mdx +++ b/docs/source/en/add_tensorflow_model.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # How to convert a 🤗 Transformers model to TensorFlow? @@ -38,7 +42,7 @@ Are you unsure whether the model you wish to use already has a corresponding Ten   Check the `model_type` field of the `config.json` of your model of choice -([example](https://huggingface.co/bert-base-uncased/blob/main/config.json#L14)). If the corresponding model folder in +([example](https://huggingface.co/google-bert/bert-base-uncased/blob/main/config.json#L14)). If the corresponding model folder in 🤗 Transformers has a file whose name starts with "modeling_tf", it means that it has a corresponding TensorFlow architecture ([example](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert)). @@ -52,7 +56,7 @@ you might recall from our [general overview of 🤗 Transformers](add_new_model# that we are an opinionated bunch - the ease of use of 🤗 Transformers relies on consistent design choices. From experience, we can tell you a few important things about adding TensorFlow models: -- Don't reinvent the wheel! More often that not, there are at least two reference implementations you should check: the +- Don't reinvent the wheel! More often than not, there are at least two reference implementations you should check: the PyTorch equivalent of the model you are implementing and other TensorFlow models for the same class of problems. - Great model implementations survive the test of time. This doesn't happen because the code is pretty, but rather because the code is clear, easy to debug and build upon. If you make the life of the maintainers easy with your @@ -79,7 +83,7 @@ don't have your eyes set on a specific architecture, asking the 🤗 Transformer maximize your impact - we will guide you towards the most prominent architectures that are missing on the TensorFlow side. If the specific model you want to use with TensorFlow already has a TensorFlow architecture implementation in 🤗 Transformers but is lacking weights, feel free to jump straight into the -[weight conversion section](#adding-tensorflow-weights-to-hub) +[weight conversion section](#adding-tensorflow-weights-to--hub) of this page. For simplicity, the remainder of this guide assumes you've decided to contribute with the TensorFlow version of @@ -97,7 +101,7 @@ TensorFlow-related pull request. **2. Prepare transformers dev environment** -Having selected the model architecture, open an draft PR to signal your intention to work on it. Follow the +Having selected the model architecture, open a draft PR to signal your intention to work on it. Follow the instructions below to set up your environment and open a draft PR. 1. Fork the [repository](https://github.com/huggingface/transformers) by clicking on the 'Fork' button on the @@ -119,6 +123,13 @@ source .env/bin/activate pip install -e ".[dev]" ``` +Depending on your OS, and since the number of optional dependencies of Transformers is growing, you might get a +failure with this command. If that's the case make sure to install TensorFlow then do: + +```bash +pip install -e ".[quality]" +``` + **Note:** You don't need to have CUDA installed. Making the new model work on CPU is sufficient. 4. Create a branch with a descriptive name from your main branch @@ -218,12 +229,11 @@ documentation pages. You can complete this part entirely following the patterns changes: - Include all public classes of *BrandNewBert* in `src/transformers/__init__.py` - Add *BrandNewBert* classes to the corresponding Auto classes in `src/transformers/models/auto/modeling_tf_auto.py` -- Include the modeling file in the documentation test file list in `utils/documentation_tests.txt` - Add the lazy loading classes related to *BrandNewBert* in `src/transformers/utils/dummy_tf_objects.py` - Update the import structures for the public classes in `src/transformers/models/brand_new_bert/__init__.py` -- Add the documentation pointers to the public methods of *BrandNewBert* in `docs/source/en/model_doc/brand_new_bert.mdx` -- Add yourself to the list of contributors to *BrandNewBert* in `docs/source/en/model_doc/brand_new_bert.mdx` -- Finally, add a green tick ✅ to the TensorFlow column of *BrandNewBert* in `docs/source/en/index.mdx` +- Add the documentation pointers to the public methods of *BrandNewBert* in `docs/source/en/model_doc/brand_new_bert.md` +- Add yourself to the list of contributors to *BrandNewBert* in `docs/source/en/model_doc/brand_new_bert.md` +- Finally, add a green tick ✅ to the TensorFlow column of *BrandNewBert* in `docs/source/en/index.md` When you're happy with your implementation, run the following checklist to confirm that your model architecture is ready: @@ -317,7 +327,7 @@ That's it! 🎉 ## Debugging mismatches across ML frameworks 🐛 At some point, when adding a new architecture or when creating TensorFlow weights for an existing architecture, you -might come across errors compaining about mismatches between PyTorch and TensorFlow. You might even decide to open the +might come across errors complaining about mismatches between PyTorch and TensorFlow. You might even decide to open the model architecture code for the two frameworks, and find that they look identical. What's going on? 🤔 First of all, let's talk about why understanding these mismatches matters. Many community members will use 🤗 @@ -340,7 +350,7 @@ ingredient here is patience. Here is our suggested workflow for when you come ac that you'll have to venture into the source implementation of said instruction. In some cases, you might find an issue with a reference implementation - don't abstain from opening an issue in the upstream repository. -In some cases, in dicussion with the 🤗 Transformers team, we might find that the fixing the mismatch is infeasible. +In some cases, in discussion with the 🤗 Transformers team, we might find that fixing the mismatch is infeasible. When the mismatch is very small in the output layers of the model (but potentially large in the hidden states), we might decide to ignore it in favor of distributing the model. The `pt-to-tf` CLI mentioned above has a `--max-error` flag to override the error message at weight conversion time. diff --git a/docs/source/en/attention.mdx b/docs/source/en/attention.md similarity index 85% rename from docs/source/en/attention.mdx rename to docs/source/en/attention.md index d6542b3be45481..02e4db58f5bea0 100644 --- a/docs/source/en/attention.mdx +++ b/docs/source/en/attention.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Attention mechanisms @@ -18,7 +22,7 @@ use a sparse version of the attention matrix to speed up training. ## LSH attention -[Reformer](#reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax +[Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is modified to mask the current token (except at the first position), because it will give a query and a key equal (so @@ -27,7 +31,7 @@ very similar to each other). Since the hash can be a bit random, several hash fu ## Local attention -[Longformer](#longformer) uses local attention: often, the local context (e.g., what are the two tokens to the +[Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a representation of the whole sentence. @@ -47,7 +51,7 @@ length. ### Axial positional encodings -[Reformer](#reformer) uses axial positional encodings: in traditional transformer models, the positional encoding +[Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with diff --git a/docs/source/en/autoclass_tutorial.mdx b/docs/source/en/autoclass_tutorial.md similarity index 61% rename from docs/source/en/autoclass_tutorial.mdx rename to docs/source/en/autoclass_tutorial.md index 6b44e41a856c61..eacfdb441c2099 100644 --- a/docs/source/en/autoclass_tutorial.mdx +++ b/docs/source/en/autoclass_tutorial.md @@ -8,15 +8,19 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Load pretrained instances with an AutoClass -With so many different Transformer architectures, it can be challenging to create one for your checkpoint. As a part of 🤗 Transformers core philosophy to make the library easy, simple and flexible to use, an `AutoClass` automatically infer and load the correct architecture from a given checkpoint. The `from_pretrained()` method lets you quickly load a pretrained model for any architecture so you don't have to devote time and resources to train a model from scratch. Producing this type of checkpoint-agnostic code means if your code works for one checkpoint, it will work with another checkpoint - as long as it was trained for a similar task - even if the architecture is different. +With so many different Transformer architectures, it can be challenging to create one for your checkpoint. As a part of 🤗 Transformers core philosophy to make the library easy, simple and flexible to use, an `AutoClass` automatically infers and loads the correct architecture from a given checkpoint. The `from_pretrained()` method lets you quickly load a pretrained model for any architecture so you don't have to devote time and resources to train a model from scratch. Producing this type of checkpoint-agnostic code means if your code works for one checkpoint, it will work with another checkpoint - as long as it was trained for a similar task - even if the architecture is different. -Remember, architecture refers to the skeleton of the model and checkpoints are the weights for a given architecture. For example, [BERT](https://huggingface.co/bert-base-uncased) is an architecture, while `bert-base-uncased` is a checkpoint. Model is a general term that can mean either architecture or checkpoint. +Remember, architecture refers to the skeleton of the model and checkpoints are the weights for a given architecture. For example, [BERT](https://huggingface.co/google-bert/bert-base-uncased) is an architecture, while `google-bert/bert-base-uncased` is a checkpoint. Model is a general term that can mean either architecture or checkpoint. @@ -27,6 +31,7 @@ In this tutorial, learn to: * Load a pretrained feature extractor. * Load a pretrained processor. * Load a pretrained model. +* Load a model as a backbone. ## AutoTokenizer @@ -37,7 +42,7 @@ Load a tokenizer with [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` Then tokenize your input as shown below: @@ -60,6 +65,48 @@ For vision tasks, an image processor processes the image into the correct input >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") ``` +## AutoBackbone + +
+ +
A Swin backbone with multiple stages for outputting a feature map.
+
+ +The [`AutoBackbone`] lets you use pretrained models as backbones to get feature maps from different stages of the backbone. You should specify one of the following parameters in [`~PretrainedConfig.from_pretrained`]: + +* `out_indices` is the index of the layer you'd like to get the feature map from +* `out_features` is the name of the layer you'd like to get the feature map from + +These parameters can be used interchangeably, but if you use both, make sure they're aligned with each other! If you don't pass any of these parameters, the backbone returns the feature map from the last layer. + +
+ +
A feature map from the first stage of the backbone. The patch partition refers to the model stem.
+
+ +For example, in the above diagram, to return the feature map from the first stage of the Swin backbone, you can set `out_indices=(1,)`: + +```py +>>> from transformers import AutoImageProcessor, AutoBackbone +>>> import torch +>>> from PIL import Image +>>> import requests +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224") +>>> model = AutoBackbone.from_pretrained("microsoft/swin-tiny-patch4-window7-224", out_indices=(1,)) + +>>> inputs = processor(image, return_tensors="pt") +>>> outputs = model(**inputs) +>>> feature_maps = outputs.feature_maps +``` + +Now you can access the `feature_maps` object from the first stage of the backbone: + +```py +>>> list(feature_maps[0].shape) +[1, 96, 56, 56] +``` ## AutoFeatureExtractor @@ -91,12 +138,12 @@ Load a processor with [`AutoProcessor.from_pretrained`]: -Finally, the `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`]: +The `AutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`AutoModelForSequenceClassification.from_pretrained`]: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Easily reuse the same checkpoint to load an architecture for a different task: @@ -104,7 +151,7 @@ Easily reuse the same checkpoint to load an architecture for a different task: ```py >>> from transformers import AutoModelForTokenClassification ->>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -123,7 +170,7 @@ Finally, the `TFAutoModelFor` classes let you load a pretrained model for a give ```py >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Easily reuse the same checkpoint to load an architecture for a different task: @@ -131,7 +178,7 @@ Easily reuse the same checkpoint to load an architecture for a different task: ```py >>> from transformers import TFAutoModelForTokenClassification ->>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning. diff --git a/docs/source/en/benchmarks.mdx b/docs/source/en/benchmarks.md similarity index 90% rename from docs/source/en/benchmarks.mdx rename to docs/source/en/benchmarks.md index 244112001f5ccf..1fd61cc8de4029 100644 --- a/docs/source/en/benchmarks.mdx +++ b/docs/source/en/benchmarks.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Benchmarks @@ -44,7 +48,7 @@ The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an ```py >>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments ->>> args = PyTorchBenchmarkArguments(models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) +>>> args = PyTorchBenchmarkArguments(models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) >>> benchmark = PyTorchBenchmark(args) ``` @@ -53,7 +57,7 @@ The benchmark classes [`PyTorchBenchmark`] and [`TensorFlowBenchmark`] expect an >>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments >>> args = TensorFlowBenchmarkArguments( -... models=["bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] +... models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] ... ) >>> benchmark = TensorFlowBenchmark(args) ``` @@ -85,20 +89,20 @@ An instantiated benchmark object can then simply be run by calling `benchmark.ru -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Time in s -------------------------------------------------------------------------------- -bert-base-uncased 8 8 0.006 -bert-base-uncased 8 32 0.006 -bert-base-uncased 8 128 0.018 -bert-base-uncased 8 512 0.088 +google-bert/bert-base-uncased 8 8 0.006 +google-bert/bert-base-uncased 8 32 0.006 +google-bert/bert-base-uncased 8 128 0.018 +google-bert/bert-base-uncased 8 512 0.088 -------------------------------------------------------------------------------- ==================== INFERENCE - MEMORY - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Memory in MB -------------------------------------------------------------------------------- -bert-base-uncased 8 8 1227 -bert-base-uncased 8 32 1281 -bert-base-uncased 8 128 1307 -bert-base-uncased 8 512 1539 +google-bert/bert-base-uncased 8 8 1227 +google-bert/bert-base-uncased 8 32 1281 +google-bert/bert-base-uncased 8 128 1307 +google-bert/bert-base-uncased 8 512 1539 -------------------------------------------------------------------------------- ==================== ENVIRONMENT INFORMATION ==================== @@ -142,20 +146,20 @@ An instantiated benchmark object can then simply be run by calling `benchmark.ru -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Time in s -------------------------------------------------------------------------------- -bert-base-uncased 8 8 0.005 -bert-base-uncased 8 32 0.008 -bert-base-uncased 8 128 0.022 -bert-base-uncased 8 512 0.105 +google-bert/bert-base-uncased 8 8 0.005 +google-bert/bert-base-uncased 8 32 0.008 +google-bert/bert-base-uncased 8 128 0.022 +google-bert/bert-base-uncased 8 512 0.105 -------------------------------------------------------------------------------- ==================== INFERENCE - MEMORY - RESULT ==================== -------------------------------------------------------------------------------- Model Name Batch Size Seq Length Memory in MB -------------------------------------------------------------------------------- -bert-base-uncased 8 8 1330 -bert-base-uncased 8 32 1330 -bert-base-uncased 8 128 1330 -bert-base-uncased 8 512 1770 +google-bert/bert-base-uncased 8 8 1330 +google-bert/bert-base-uncased 8 32 1330 +google-bert/bert-base-uncased 8 128 1330 +google-bert/bert-base-uncased 8 512 1770 -------------------------------------------------------------------------------- ==================== ENVIRONMENT INFORMATION ==================== @@ -193,7 +197,7 @@ when adding the argument `save_to_csv=True` to [`PyTorchBenchmarkArguments`] and [`TensorFlowBenchmarkArguments`] respectively. In this case, every section is saved in a separate _.csv_ file. The path to each _.csv_ file can optionally be defined via the argument data classes. -Instead of benchmarking pre-trained models via their model identifier, _e.g._ `bert-base-uncased`, the user can +Instead of benchmarking pre-trained models via their model identifier, _e.g._ `google-bert/bert-base-uncased`, the user can alternatively benchmark an arbitrary configuration of any available model class. In this case, a `list` of configurations must be inserted with the benchmark args as follows. diff --git a/docs/source/en/bertology.mdx b/docs/source/en/bertology.md similarity index 86% rename from docs/source/en/bertology.mdx rename to docs/source/en/bertology.md index e64379d6580da5..ba1b4bd4002b97 100644 --- a/docs/source/en/bertology.mdx +++ b/docs/source/en/bertology.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BERTology @@ -21,6 +25,7 @@ There is a growing field of study concerned with investigating the inner working - Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650 - What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341 +- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://arxiv.org/abs/2210.04633 In order to help this new field develop, we have included a few additional features in the BERT/GPT/GPT-2 models to help people access the inner representations, mainly adapted from the great work of Paul Michel diff --git a/docs/source/en/big_models.mdx b/docs/source/en/big_models.md similarity index 90% rename from docs/source/en/big_models.mdx rename to docs/source/en/big_models.md index 971403b62d4a01..729d32ca202951 100644 --- a/docs/source/en/big_models.mdx +++ b/docs/source/en/big_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Instantiating a big model @@ -19,11 +23,11 @@ from PyTorch is: 2. Load your pretrained weights. 3. Put those pretrained weights in your random model. -Step 1 and 2 both require a full version of the model in memory, which is not a problem in most cases, but if your model starts weighing several GigaBytes, those two copies can make you got our of RAM. Even worse, if you are using `torch.distributed` to launch a distributed training, each process will load the pretrained model and store these two copies in RAM. +Step 1 and 2 both require a full version of the model in memory, which is not a problem in most cases, but if your model starts weighing several GigaBytes, those two copies can make you get out of RAM. Even worse, if you are using `torch.distributed` to launch a distributed training, each process will load the pretrained model and store these two copies in RAM. -Note that the randomly created model is initialized with "empty" tensors, which take the space in memory without filling it (thus the random values are whatever was in this chunk of memory at a given time). The random initialization following the appropriate distribution for the kind of model/parameters instatiated (like a normal distribution for instance) is only performed after step 3 on the non-initialized weights, to be as fast as possible! +Note that the randomly created model is initialized with "empty" tensors, which take the space in memory without filling it (thus the random values are whatever was in this chunk of memory at a given time). The random initialization following the appropriate distribution for the kind of model/parameters instantiated (like a normal distribution for instance) is only performed after step 3 on the non-initialized weights, to be as fast as possible! @@ -38,7 +42,7 @@ You can control the maximum size before sharding with the `max_shard_size` param ```py from transformers import AutoModel -model = AutoModel.from_pretrained("bert-base-cased") +model = AutoModel.from_pretrained("google-bert/bert-base-cased") ``` If you save it using [`~PreTrainedModel.save_pretrained`], you will get a new folder with two files: the config of the model and its weights: @@ -116,4 +120,4 @@ If you want to directly load such a sharded checkpoint inside a model without us Sharded checkpoints reduce the memory usage during step 2 of the workflow mentioned above, but in order to use that model in a low memory setting, we recommend leveraging our tools based on the Accelerate library. -Please read the following guide for more information: [Large model loading using Accelerate](./main_classes/model#large-model-loading) \ No newline at end of file +Please read the following guide for more information: [Large model loading using Accelerate](./main_classes/model#large-model-loading) diff --git a/docs/source/en/chat_templating.md b/docs/source/en/chat_templating.md new file mode 100644 index 00000000000000..94048f88acaa47 --- /dev/null +++ b/docs/source/en/chat_templating.md @@ -0,0 +1,484 @@ + + +# Templates for Chat Models + +## Introduction + +An increasingly common use case for LLMs is **chat**. In a chat context, rather than continuing a single string +of text (as is the case with a standard language model), the model instead continues a conversation that consists +of one or more **messages**, each of which includes a **role**, like "user" or "assistant", as well as message text. + +Much like tokenization, different models expect very different input formats for chat. This is the reason we added +**chat templates** as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations, +represented as lists of messages, into a single tokenizable string in the format that the model expects. + +Let's make this concrete with a quick example using the `BlenderBot` model. BlenderBot has an extremely simple default +template, which mostly just adds whitespace between rounds of dialogue: + +```python +>>> from transformers import AutoTokenizer +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") + +>>> chat = [ +... {"role": "user", "content": "Hello, how are you?"}, +... {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, +... {"role": "user", "content": "I'd like to show off how chat templating works!"}, +... ] + +>>> tokenizer.apply_chat_template(chat, tokenize=False) +" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!" +``` + +Notice how the entire chat is condensed into a single string. If we use `tokenize=True`, which is the default setting, +that string will also be tokenized for us. To see a more complex template in action, though, let's use the +`mistralai/Mistral-7B-Instruct-v0.1` model. + +```python +>>> from transformers import AutoTokenizer +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1") + +>>> chat = [ +... {"role": "user", "content": "Hello, how are you?"}, +... {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, +... {"role": "user", "content": "I'd like to show off how chat templating works!"}, +... ] + +>>> tokenizer.apply_chat_template(chat, tokenize=False) +"[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" +``` + +Note that this time, the tokenizer has added the control tokens [INST] and [/INST] to indicate the start and end of +user messages (but not assistant messages!). Mistral-instruct was trained with these tokens, but BlenderBot was not. + +## How do I use chat templates? + +As you can see in the example above, chat templates are easy to use. Simply build a list of messages, with `role` +and `content` keys, and then pass it to the [`~PreTrainedTokenizer.apply_chat_template`] method. Once you do that, +you'll get output that's ready to go! When using chat templates as input for model generation, it's also a good idea +to use `add_generation_prompt=True` to add a [generation prompt](#what-are-generation-prompts). + +Here's an example of preparing input for `model.generate()`, using the `Zephyr` assistant model: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +checkpoint = "HuggingFaceH4/zephyr-7b-beta" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = AutoModelForCausalLM.from_pretrained(checkpoint) # You may want to use bfloat16 and/or move to GPU here + +messages = [ + { + "role": "system", + "content": "You are a friendly chatbot who always responds in the style of a pirate", + }, + {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, + ] +tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") +print(tokenizer.decode(tokenized_chat[0])) +``` +This will yield a string in the input format that Zephyr expects. +```text +<|system|> +You are a friendly chatbot who always responds in the style of a pirate +<|user|> +How many helicopters can a human eat in one sitting? +<|assistant|> +``` + +Now that our input is formatted correctly for Zephyr, we can use the model to generate a response to the user's question: + +```python +outputs = model.generate(tokenized_chat, max_new_tokens=128) +print(tokenizer.decode(outputs[0])) +``` + +This will yield: + +```text +<|system|> +You are a friendly chatbot who always responds in the style of a pirate +<|user|> +How many helicopters can a human eat in one sitting? +<|assistant|> +Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all. +``` + +Arr, 'twas easy after all! + +## Is there an automated pipeline for chat? + +Yes, there is! Our text generation pipelines support chat inputs, which makes it easy to use chat models. In the past, +we used to use a dedicated "ConversationalPipeline" class, but this has now been deprecated and its functionality +has been merged into the [`TextGenerationPipeline`]. Let's try the `Zephyr` example again, but this time using +a pipeline: + +```python +from transformers import pipeline + +pipe = pipeline("text-generation", "HuggingFaceH4/zephyr-7b-beta") +messages = [ + { + "role": "system", + "content": "You are a friendly chatbot who always responds in the style of a pirate", + }, + {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +] +print(pipe(messages, max_new_tokens=128)[0]['generated_text'][-1]) # Print the assistant's response +``` + +```text +{'role': 'assistant', 'content': "Matey, I'm afraid I must inform ye that humans cannot eat helicopters. Helicopters are not food, they are flying machines. Food is meant to be eaten, like a hearty plate o' grog, a savory bowl o' stew, or a delicious loaf o' bread. But helicopters, they be for transportin' and movin' around, not for eatin'. So, I'd say none, me hearties. None at all."} +``` + +The pipeline will take care of all the details of tokenization and calling `apply_chat_template` for you - +once the model has a chat template, all you need to do is initialize the pipeline and pass it the list of messages! + +## What are "generation prompts"? + +You may have noticed that the `apply_chat_template` method has an `add_generation_prompt` argument. This argument tells +the template to add tokens that indicate the start of a bot response. For example, consider the following chat: + +```python +messages = [ + {"role": "user", "content": "Hi there!"}, + {"role": "assistant", "content": "Nice to meet you!"}, + {"role": "user", "content": "Can I ask a question?"} +] +``` + +Here's what this will look like without a generation prompt, using the ChatML template we saw in the Zephyr example: + +```python +tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) +"""<|im_start|>user +Hi there!<|im_end|> +<|im_start|>assistant +Nice to meet you!<|im_end|> +<|im_start|>user +Can I ask a question?<|im_end|> +""" +``` + +And here's what it looks like **with** a generation prompt: + +```python +tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) +"""<|im_start|>user +Hi there!<|im_end|> +<|im_start|>assistant +Nice to meet you!<|im_end|> +<|im_start|>user +Can I ask a question?<|im_end|> +<|im_start|>assistant +""" +``` + +Note that this time, we've added the tokens that indicate the start of a bot response. This ensures that when the model +generates text it will write a bot response instead of doing something unexpected, like continuing the user's +message. Remember, chat models are still just language models - they're trained to continue text, and chat is just a +special kind of text to them! You need to guide them with appropriate control tokens, so they know what they're +supposed to be doing. + +Not all models require generation prompts. Some models, like BlenderBot and LLaMA, don't have any +special tokens before bot responses. In these cases, the `add_generation_prompt` argument will have no effect. The exact +effect that `add_generation_prompt` has will depend on the template being used. + +## Can I use chat templates in training? + +Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you +can simply continue like any other language model training task. When training, you should usually set +`add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during +training. Let's see an example: + +```python +from transformers import AutoTokenizer +from datasets import Dataset + +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta") + +chat1 = [ + {"role": "user", "content": "Which is bigger, the moon or the sun?"}, + {"role": "assistant", "content": "The sun."} +] +chat2 = [ + {"role": "user", "content": "Which is bigger, a virus or a bacterium?"}, + {"role": "assistant", "content": "A bacterium."} +] + +dataset = Dataset.from_dict({"chat": [chat1, chat2]}) +dataset = dataset.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["chat"], tokenize=False, add_generation_prompt=False)}) +print(dataset['formatted_chat'][0]) +``` +And we get: +```text +<|user|> +Which is bigger, the moon or the sun? +<|assistant|> +The sun. +``` + +From here, just continue training like you would with a standard language modelling task, using the `formatted_chat` column. + +## Advanced: How do chat templates work? + +The chat template for a model is stored on the `tokenizer.chat_template` attribute. If no chat template is set, the +default template for that model class is used instead. Let's take a look at the template for `BlenderBot`: + +```python + +>>> from transformers import AutoTokenizer +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") + +>>> tokenizer.default_chat_template +"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}" +``` + +That's kind of intimidating. Let's add some newlines and indentation to make it more readable. Note that the first +newline after each block as well as any preceding whitespace before a block are ignored by default, using the +Jinja `trim_blocks` and `lstrip_blocks` flags. However, be cautious - although leading whitespace on each +line is stripped, spaces between blocks on the same line are not. We strongly recommend checking that your template +isn't printing extra spaces where it shouldn't be! + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ ' ' }} + {% endif %} + {{ message['content'] }} + {% if not loop.last %} + {{ ' ' }} + {% endif %} +{% endfor %} +{{ eos_token }} +``` + +If you've never seen one of these before, this is a [Jinja template](https://jinja.palletsprojects.com/en/3.1.x/templates/). +Jinja is a templating language that allows you to write simple code that generates text. In many ways, the code and +syntax resembles Python. In pure Python, this template would look something like this: + +```python +for idx, message in enumerate(messages): + if message['role'] == 'user': + print(' ') + print(message['content']) + if not idx == len(messages) - 1: # Check for the last message in the conversation + print(' ') +print(eos_token) +``` + +Effectively, the template does three things: +1. For each message, if the message is a user message, add a blank space before it, otherwise print nothing. +2. Add the message content +3. If the message is not the last message, add two spaces after it. After the final message, print the EOS token. + +This is a pretty simple template - it doesn't add any control tokens, and it doesn't support "system" messages, which +are a common way to give the model directives about how it should behave in the subsequent conversation. +But Jinja gives you a lot of flexibility to do those things! Let's see a Jinja template that can format inputs +similarly to the way LLaMA formats them (note that the real LLaMA template includes handling for default system +messages and slightly different system message handling in general - don't use this one in your actual code!) + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'] + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ ' ' + message['content'] + ' ' + eos_token }} + {% endif %} +{% endfor %} +``` + +Hopefully if you stare at this for a little bit you can see what this template is doing - it adds specific tokens based +on the "role" of each message, which represents who sent it. User, assistant and system messages are clearly +distinguishable to the model because of the tokens they're wrapped in. + +## Advanced: Adding and editing chat templates + +### How do I create a chat template? + +Simple, just write a jinja template and set `tokenizer.chat_template`. You may find it easier to start with an +existing template from another model and simply edit it for your needs! For example, we could take the LLaMA template +above and add "[ASST]" and "[/ASST]" to assistant messages: + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} + {% endif %} +{% endfor %} +``` + +Now, simply set the `tokenizer.chat_template` attribute. Next time you use [`~PreTrainedTokenizer.apply_chat_template`], it will +use your new template! This attribute will be saved in the `tokenizer_config.json` file, so you can use +[`~utils.PushToHubMixin.push_to_hub`] to upload your new template to the Hub and make sure everyone's using the right +template for your model! + +```python +template = tokenizer.chat_template +template = template.replace("SYS", "SYSTEM") # Change the system token +tokenizer.chat_template = template # Set the new template +tokenizer.push_to_hub("model_name") # Upload your new template to the Hub! +``` + +The method [`~PreTrainedTokenizer.apply_chat_template`] which uses your chat template is called by the [`TextGenerationPipeline`] class, so +once you set the correct chat template, your model will automatically become compatible with [`TextGenerationPipeline`]. + + +If you're fine-tuning a model for chat, in addition to setting a chat template, you should probably add any new chat +control tokens as special tokens in the tokenizer. Special tokens are never split, +ensuring that your control tokens are always handled as single tokens rather than being tokenized in pieces. You +should also set the tokenizer's `eos_token` attribute to the token that marks the end of assistant generations in your +template. This will ensure that text generation tools can correctly figure out when to stop generating text. + + + +### What are "default" templates? + +Before the introduction of chat templates, chat handling was hardcoded at the model class level. For backwards +compatibility, we have retained this class-specific handling as default templates, also set at the class level. If a +model does not have a chat template set, but there is a default template for its model class, the `TextGenerationPipeline` +class and methods like `apply_chat_template` will use the class template instead. You can find out what the default +template for your tokenizer is by checking the `tokenizer.default_chat_template` attribute. + +This is something we do purely for backward compatibility reasons, to avoid breaking any existing workflows. Even when +the class template is appropriate for your model, we strongly recommend overriding the default template by +setting the `chat_template` attribute explicitly to make it clear to users that your model has been correctly configured +for chat, and to future-proof in case the default templates are ever altered or deprecated. + +### What template should I use? + +When setting the template for a model that's already been trained for chat, you should ensure that the template +exactly matches the message formatting that the model saw during training, or else you will probably experience +performance degradation. This is true even if you're training the model further - you will probably get the best +performance if you keep the chat tokens constant. This is very analogous to tokenization - you generally get the +best performance for inference or fine-tuning when you precisely match the tokenization used during training. + +If you're training a model from scratch, or fine-tuning a base language model for chat, on the other hand, +you have a lot of freedom to choose an appropriate template! LLMs are smart enough to learn to handle lots of different +input formats. Our default template for models that don't have a class-specific template follows the +[ChatML format](https://github.com/openai/openai-python/blob/main/chatml.md), and this is a good, flexible choice for many use-cases. It looks like this: + +``` +{% for message in messages %} + {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}} +{% endfor %} +``` + +If you like this one, here it is in one-liner form, ready to copy into your code. The one-liner also includes +handy support for [generation prompts](#what-are-generation-prompts), but note that it doesn't add BOS or EOS tokens! +If your model expects those, they won't be added automatically by `apply_chat_template` - in other words, the +text will be tokenized with `add_special_tokens=False`. This is to avoid potential conflicts between the template and +the `add_special_tokens` logic. If your model expects special tokens, make sure to add them to the template! + +```python +tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}" +``` + +This template wraps each message in `<|im_start|>` and `<|im_end|>` tokens, and simply writes the role as a string, which +allows for flexibility in the roles you train with. The output looks like this: + +```text +<|im_start|>system +You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|> +<|im_start|>user +How are you?<|im_end|> +<|im_start|>assistant +I'm doing great!<|im_end|> +``` + +The "user", "system" and "assistant" roles are the standard for chat, and we recommend using them when it makes sense, +particularly if you want your model to operate well with [`TextGenerationPipeline`]. However, you are not limited +to these roles - templating is extremely flexible, and any string can be a role. + +### I want to add some chat templates! How should I get started? + +If you have any chat models, you should set their `tokenizer.chat_template` attribute and test it using +[`~PreTrainedTokenizer.apply_chat_template`], then push the updated tokenizer to the Hub. This applies even if you're +not the model owner - if you're using a model with an empty chat template, or one that's still using the default class +template, please open a [pull request](https://huggingface.co/docs/hub/repositories-pull-requests-discussions) to the model repository so that this attribute can be set properly! + +Once the attribute is set, that's it, you're done! `tokenizer.apply_chat_template` will now work correctly for that +model, which means it is also automatically supported in places like `TextGenerationPipeline`! + +By ensuring that models have this attribute, we can make sure that the whole community gets to use the full power of +open-source models. Formatting mismatches have been haunting the field and silently harming performance for too long - +it's time to put an end to them! + +## Advanced: Template writing tips + +If you're unfamiliar with Jinja, we generally find that the easiest way to write a chat template is to first +write a short Python script that formats messages the way you want, and then convert that script into a template. + +Remember that the template handler will receive the conversation history as a variable called `messages`. Each +message is a dictionary with two keys, `role` and `content`. You will be able to access `messages` in your template +just like you can in Python, which means you can loop over it with `{% for message in messages %}` or access +individual messages with, for example, `{{ messages[0] }}`. + +You can also use the following tips to convert your code to Jinja: + +### For loops + +For loops in Jinja look like this: + +``` +{% for message in messages %} +{{ message['content'] }} +{% endfor %} +``` + +Note that whatever's inside the {{ expression block }} will be printed to the output. You can use operators like +`+` to combine strings inside expression blocks. + +### If statements + +If statements in Jinja look like this: + +``` +{% if message['role'] == 'user' %} +{{ message['content'] }} +{% endif %} +``` + +Note how where Python uses whitespace to mark the beginnings and ends of `for` and `if` blocks, Jinja requires you +to explicitly end them with `{% endfor %}` and `{% endif %}`. + +### Special variables + +Inside your template, you will have access to the list of `messages`, but you can also access several other special +variables. These include special tokens like `bos_token` and `eos_token`, as well as the `add_generation_prompt` +variable that we discussed above. You can also use the `loop` variable to access information about the current loop +iteration, for example using `{% if loop.last %}` to check if the current message is the last message in the +conversation. Here's an example that puts these ideas together to add a generation prompt at the end of the +conversation if add_generation_prompt is `True`: + +``` +{% if loop.last and add_generation_prompt %} +{{ bos_token + 'Assistant:\n' }} +{% endif %} +``` + +### Notes on whitespace + +As much as possible, we've tried to get Jinja to ignore whitespace outside of {{ expressions }}. However, be aware +that Jinja is a general-purpose templating engine, and it may treat whitespace between blocks on the same line +as significant and print it to the output. We **strongly** recommend checking that your template isn't printing extra +spaces where it shouldn't be before you upload it! \ No newline at end of file diff --git a/docs/source/en/community.mdx b/docs/source/en/community.md similarity index 92% rename from docs/source/en/community.mdx rename to docs/source/en/community.md index 808b16779dd94a..7890cb22ca5882 100644 --- a/docs/source/en/community.mdx +++ b/docs/source/en/community.md @@ -1,4 +1,8 @@ -# Community + + +# Community This page regroups resources around 🤗 Transformers developed by the community. @@ -6,19 +10,19 @@ This page regroups resources around 🤗 Transformers developed by the community | Resource | Description | Author | |:----------|:-------------|------:| -| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learnt/revised using [Anki ](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) | +| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | A set of flashcards based on the [Transformers Docs Glossary](glossary) that has been put into a form which can be easily learned/revised using [Anki](https://apps.ankiweb.net/) an open source, cross platform app specifically designed for long term knowledge retention. See this [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) | ## Community notebooks: | Notebook | Description | Author | | |:----------|:-------------|:-------------|------:| | [Fine-tune a pre-trained Transformer to generate lyrics](https://github.com/AlekseyKorshuk/huggingartists) | How to generate lyrics in the style of your favorite artist by fine-tuning a GPT-2 model | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) | -| [Train T5 in Tensorflow 2 ](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | +| [Train T5 in Tensorflow 2](https://github.com/snapthat/TF-T5-text-to-text) | How to train T5 for any task using Tensorflow 2. This notebook demonstrates a Question & Answer task implemented in Tensorflow 2 using SQUAD | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | | [Train T5 on TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | How to train T5 on SQUAD with Transformers and Nlp | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) | | [Fine-tune T5 for Classification and Multiple Choice](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | How to fine-tune T5 for classification and multiple choice tasks using a text-to-text format with PyTorch Lightning | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | | [Fine-tune DialoGPT on New Datasets and Languages](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | How to fine-tune the DialoGPT model on a new dataset for open-dialog conversational chatbots | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | | [Long Sequence Modeling with Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | How to train on sequences as long as 500,000 tokens with Reformer | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | -| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | +| [Fine-tune BART for Summarization](https://github.com/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | How to fine-tune BART for summarization with fastai using blurr | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | | [Fine-tune a pre-trained Transformer on anyone's tweets](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | How to generate tweets in the style of your favorite Twitter account by fine-tuning a GPT-2 model | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | | [Optimize 🤗 Hugging Face models with Weights & Biases](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | A complete tutorial showcasing W&B integration with Hugging Face | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | | [Pretrain Longformer](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | How to build a "long" version of existing pretrained models | [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | @@ -31,7 +35,7 @@ This page regroups resources around 🤗 Transformers developed by the community |[Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|How to speed up fine-tuning by a factor of 2 using dynamic padding / bucketing|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)| |[Pretrain Reformer for Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| How to train a Reformer model with bi-directional self-attention layers | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)| |[Expand and Fine Tune Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| How to increase vocabulary of a pretrained SciBERT model from AllenAI on the CORD dataset and pipeline it. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)| -|[Fine Tune BlenderBotSmall for Summarization using the Trainer API](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-blenderbot_small-for-summarization.ipynb)| How to fine tune BlenderBotSmall for summarization on a custom dataset, using the Trainer API. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Wmupuls7mykSGyRN_Qo6lPQhgp56ymq?usp=sharing)| +|[Fine Tune BlenderBotSmall for Summarization using the Trainer API](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-blenderbot_small-for-summarization.ipynb)| How to fine-tune BlenderBotSmall for summarization on a custom dataset, using the Trainer API. | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Wmupuls7mykSGyRN_Qo6lPQhgp56ymq?usp=sharing)| |[Fine-tune Electra and interpret with Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | How to fine-tune Electra for sentiment analysis and interpret predictions with Captum Integrated Gradients | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)| |[fine-tune a non-English GPT-2 Model with Trainer class](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | How to fine-tune a non-English GPT-2 Model with Trainer class | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)| |[Fine-tune a DistilBERT Model for Multi Label Classification task](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | How to fine-tune a DistilBERT Model for Multi Label Classification task | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)| @@ -39,8 +43,8 @@ This page regroups resources around 🤗 Transformers developed by the community |[Fine-tune Roberta for sentiment analysis](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | How to fine-tune a Roberta model for sentiment analysis | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)| |[Evaluating Question Generation Models](https://github.com/flexudy-pipe/qugeev) | How accurate are the answers to questions generated by your seq2seq transformer model? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)| |[Classify text with DistilBERT and Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | How to fine-tune DistilBERT for text classification in TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)| -|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| -|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| +|[Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | How to warm-start a *EncoderDecoderModel* with a *google-bert/bert-base-uncased* checkpoint for summarization on CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| +|[Leverage RoBERTa for Encoder-Decoder Summarization on BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | How to warm-start a shared *EncoderDecoderModel* with a *FacebookAI/roberta-base* checkpoint for summarization on BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| |[Fine-tune TAPAS on Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | How to fine-tune *TapasForQuestionAnswering* with a *tapas-base* checkpoint on the Sequential Question Answering (SQA) dataset | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)| |[Evaluate TAPAS on Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | How to evaluate a fine-tuned *TapasForSequenceClassification* with a *tapas-base-finetuned-tabfact* checkpoint using a combination of the 🤗 datasets and 🤗 transformers libraries | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)| |[Fine-tuning mBART for translation](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | How to fine-tune mBART using Seq2SeqTrainer for Hindi to English translation | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)| diff --git a/docs/source/en/converting_tensorflow_models.mdx b/docs/source/en/converting_tensorflow_models.mdx deleted file mode 100644 index 8dc51dd61670d8..00000000000000 --- a/docs/source/en/converting_tensorflow_models.mdx +++ /dev/null @@ -1,162 +0,0 @@ - - -# Converting From Tensorflow Checkpoints - -A command-line interface is provided to convert original Bert/GPT/GPT-2/Transformer-XL/XLNet/XLM checkpoints to models -that can be loaded using the `from_pretrained` methods of the library. - - - -Since 2.3.0 the conversion script is now part of the transformers CLI (**transformers-cli**) available in any -transformers >= 2.3.0 installation. - -The documentation below reflects the **transformers-cli convert** command format. - - - -## BERT - -You can convert any TensorFlow checkpoint for BERT (in particular [the pre-trained models released by Google](https://github.com/google-research/bert#pre-trained-models)) in a PyTorch save file by using the -[convert_bert_original_tf_checkpoint_to_pytorch.py](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert/convert_bert_original_tf_checkpoint_to_pytorch.py) script. - -This CLI takes as input a TensorFlow checkpoint (three files starting with `bert_model.ckpt`) and the associated -configuration file (`bert_config.json`), and creates a PyTorch model for this configuration, loads the weights from -the TensorFlow checkpoint in the PyTorch model and saves the resulting model in a standard PyTorch save file that can -be imported using `from_pretrained()` (see example in [quicktour](quicktour) , [run_glue.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_glue.py) ). - -You only need to run this conversion script **once** to get a PyTorch model. You can then disregard the TensorFlow -checkpoint (the three files starting with `bert_model.ckpt`) but be sure to keep the configuration file (\ -`bert_config.json`) and the vocabulary file (`vocab.txt`) as these are needed for the PyTorch model too. - -To run this specific conversion script you will need to have TensorFlow and PyTorch installed (`pip install tensorflow`). The rest of the repository only requires PyTorch. - -Here is an example of the conversion process for a pre-trained `BERT-Base Uncased` model: - -```bash -export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12 - -transformers-cli convert --model_type bert \ - --tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \ - --config $BERT_BASE_DIR/bert_config.json \ - --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin -``` - -You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/bert#pre-trained-models). - -## ALBERT - -Convert TensorFlow model checkpoints of ALBERT to PyTorch using the -[convert_albert_original_tf_checkpoint_to_pytorch.py](https://github.com/huggingface/transformers/tree/main/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py) script. - -The CLI takes as input a TensorFlow checkpoint (three files starting with `model.ckpt-best`) and the accompanying -configuration file (`albert_config.json`), then creates and saves a PyTorch model. To run this conversion you will -need to have TensorFlow and PyTorch installed. - -Here is an example of the conversion process for the pre-trained `ALBERT Base` model: - -```bash -export ALBERT_BASE_DIR=/path/to/albert/albert_base - -transformers-cli convert --model_type albert \ - --tf_checkpoint $ALBERT_BASE_DIR/model.ckpt-best \ - --config $ALBERT_BASE_DIR/albert_config.json \ - --pytorch_dump_output $ALBERT_BASE_DIR/pytorch_model.bin -``` - -You can download Google's pre-trained models for the conversion [here](https://github.com/google-research/albert#pre-trained-models). - -## OpenAI GPT - -Here is an example of the conversion process for a pre-trained OpenAI GPT model, assuming that your NumPy checkpoint -save as the same format than OpenAI pretrained model (see [here](https://github.com/openai/finetune-transformer-lm)\ -) - -```bash -export OPENAI_GPT_CHECKPOINT_FOLDER_PATH=/path/to/openai/pretrained/numpy/weights - -transformers-cli convert --model_type gpt \ - --tf_checkpoint $OPENAI_GPT_CHECKPOINT_FOLDER_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config OPENAI_GPT_CONFIG] \ - [--finetuning_task_name OPENAI_GPT_FINETUNED_TASK] \ -``` - -## OpenAI GPT-2 - -Here is an example of the conversion process for a pre-trained OpenAI GPT-2 model (see [here](https://github.com/openai/gpt-2)) - -```bash -export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights - -transformers-cli convert --model_type gpt2 \ - --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config OPENAI_GPT2_CONFIG] \ - [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK] -``` - -## Transformer-XL - -Here is an example of the conversion process for a pre-trained Transformer-XL model (see [here](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models)) - -```bash -export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint - -transformers-cli convert --model_type transfo_xl \ - --tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config TRANSFO_XL_CONFIG] \ - [--finetuning_task_name TRANSFO_XL_FINETUNED_TASK] -``` - -## XLNet - -Here is an example of the conversion process for a pre-trained XLNet model: - -```bash -export TRANSFO_XL_CHECKPOINT_PATH=/path/to/xlnet/checkpoint -export TRANSFO_XL_CONFIG_PATH=/path/to/xlnet/config - -transformers-cli convert --model_type xlnet \ - --tf_checkpoint $TRANSFO_XL_CHECKPOINT_PATH \ - --config $TRANSFO_XL_CONFIG_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--finetuning_task_name XLNET_FINETUNED_TASK] \ -``` - -## XLM - -Here is an example of the conversion process for a pre-trained XLM model: - -```bash -export XLM_CHECKPOINT_PATH=/path/to/xlm/checkpoint - -transformers-cli convert --model_type xlm \ - --tf_checkpoint $XLM_CHECKPOINT_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT - [--config XML_CONFIG] \ - [--finetuning_task_name XML_FINETUNED_TASK] -``` - -## T5 - -Here is an example of the conversion process for a pre-trained T5 model: - -```bash -export T5=/path/to/t5/uncased_L-12_H-768_A-12 - -transformers-cli convert --model_type t5 \ - --tf_checkpoint $T5/t5_model.ckpt \ - --config $T5/t5_config.json \ - --pytorch_dump_output $T5/pytorch_model.bin -``` diff --git a/docs/source/en/create_a_model.mdx b/docs/source/en/create_a_model.md similarity index 79% rename from docs/source/en/create_a_model.mdx rename to docs/source/en/create_a_model.md index 5c736f1d79435f..29f26c59984aa3 100644 --- a/docs/source/en/create_a_model.mdx +++ b/docs/source/en/create_a_model.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Create a custom architecture @@ -83,7 +87,7 @@ DistilBertConfig { Pretrained model attributes can be modified in the [`~PretrainedConfig.from_pretrained`] function: ```py ->>> my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4) +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) ``` Once you are satisfied with your model configuration, you can save it with [`~PretrainedConfig.save_pretrained`]. Your configuration file is stored as a JSON file in the specified save directory: @@ -106,7 +110,7 @@ You can also save your configuration file as a dictionary or even just the diffe ## Model -The next step is to create a [model](main_classes/models). The model - also loosely referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes like `num_hidden_layers` from the configuration are used to define the architecture. Every model shares the base class [`PreTrainedModel`] and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also either a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) or [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/flax.linen.html#module) subclass. This means models are compatible with each of their respective framework's usage. +The next step is to create a [model](main_classes/models). The model - also loosely referred to as the architecture - defines what each layer is doing and what operations are happening. Attributes like `num_hidden_layers` from the configuration are used to define the architecture. Every model shares the base class [`PreTrainedModel`] and a few common methods like resizing input embeddings and pruning self-attention heads. In addition, all models are also either a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) or [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. This means models are compatible with each of their respective framework's usage. @@ -124,13 +128,13 @@ This creates a model with random values instead of pretrained weights. You won't Create a pretrained model with [`~PreTrainedModel.from_pretrained`]: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -148,13 +152,13 @@ This creates a model with random values instead of pretrained weights. You won't Create a pretrained model with [`~TFPreTrainedModel.from_pretrained`]: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` When you load pretrained weights, the default model configuration is automatically loaded if the model is provided by 🤗 Transformers. However, you can still replace - some or all of - the default model configuration attributes with your own if you'd like: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -170,7 +174,7 @@ For example, [`DistilBertForSequenceClassification`] is a base DistilBERT model ```py >>> from transformers import DistilBertForSequenceClassification ->>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [`DistilBertForQuestionAnswering`] model head. The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output. @@ -178,7 +182,7 @@ Easily reuse this checkpoint for another task by switching to a different model ```py >>> from transformers import DistilBertForQuestionAnswering ->>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -187,7 +191,7 @@ For example, [`TFDistilBertForSequenceClassification`] is a base DistilBERT mode ```py >>> from transformers import TFDistilBertForSequenceClassification ->>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Easily reuse this checkpoint for another task by switching to a different model head. For a question answering task, you would use the [`TFDistilBertForQuestionAnswering`] model head. The question answering head is similar to the sequence classification head except it is a linear layer on top of the hidden states output. @@ -195,7 +199,7 @@ Easily reuse this checkpoint for another task by switching to a different model ```py >>> from transformers import TFDistilBertForQuestionAnswering ->>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -205,7 +209,7 @@ Easily reuse this checkpoint for another task by switching to a different model The last base class you need before using a model for textual data is a [tokenizer](main_classes/tokenizer) to convert raw text to tensors. There are two types of tokenizers you can use with 🤗 Transformers: - [`PreTrainedTokenizer`]: a Python implementation of a tokenizer. -- [`PreTrainedTokenizerFast`]: a tokenizer from our Rust-based [🤗 Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/) library. This tokenizer type is significantly faster - especially during batch tokenization - due to it's Rust implementation. The fast tokenizer also offers additional methods like *offset mapping* which maps tokens to their original words or characters. +- [`PreTrainedTokenizerFast`]: a tokenizer from our Rust-based [🤗 Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/) library. This tokenizer type is significantly faster - especially during batch tokenization - due to its Rust implementation. The fast tokenizer also offers additional methods like *offset mapping* which maps tokens to their original words or characters. Both tokenizers support common methods such as encoding and decoding, adding new tokens, and managing special tokens. @@ -228,7 +232,7 @@ It is important to remember the vocabulary from a custom tokenizer will be diffe ```py >>> from transformers import DistilBertTokenizer ->>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Create a fast tokenizer with the [`DistilBertTokenizerFast`] class: @@ -236,7 +240,7 @@ Create a fast tokenizer with the [`DistilBertTokenizerFast`] class: ```py >>> from transformers import DistilBertTokenizerFast ->>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased") +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -245,7 +249,7 @@ By default, [`AutoTokenizer`] will try to load a fast tokenizer. You can disable -## Image Processor +## Image processor An image processor processes vision inputs. It inherits from the base [`~image_processing_utils.ImageProcessingMixin`] class. @@ -259,7 +263,7 @@ To use, create an image processor associated with the model you're using. For ex ViTImageProcessor { "do_normalize": true, "do_resize": true, - "feature_extractor_type": "ViTImageProcessor", + "image_processor_type": "ViTImageProcessor", "image_mean": [ 0.5, 0.5, @@ -291,7 +295,7 @@ Modify any of the [`ViTImageProcessor`] parameters to create your custom image p ViTImageProcessor { "do_normalize": false, "do_resize": true, - "feature_extractor_type": "ViTImageProcessor", + "image_processor_type": "ViTImageProcessor", "image_mean": [ 0.3, 0.3, @@ -307,7 +311,73 @@ ViTImageProcessor { } ``` -## Feature Extractor +## Backbone + +
+ +
+ +Computer vision models consist of a backbone, neck, and head. The backbone extracts features from an input image, the neck combines and enhances the extracted features, and the head is used for the main task (e.g., object detection). Start by initializing a backbone in the model config and specify whether you want to load pretrained weights or load randomly initialized weights. Then you can pass the model config to the model head. + +For example, to load a [ResNet](../model_doc/resnet) backbone into a [MaskFormer](../model_doc/maskformer) model with an instance segmentation head: + + + + +Set `use_pretrained_backbone=True` to load pretrained ResNet weights for the backbone. + +```py +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig + +config = MaskFormerConfig(backbone="microsoft/resnet50", use_pretrained_backbone=True) # backbone and neck config +model = MaskFormerForInstanceSegmentation(config) # head +``` + +You could also load the backbone config separately and then pass it to the model config. + +```py +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig + +backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-50") +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) +``` + + + + +Set `use_pretrained_backbone=False` to randomly initialize a ResNet backbone. + +```py +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig + +config = MaskFormerConfig(backbone="microsoft/resnet50", use_pretrained_backbone=False) # backbone and neck config +model = MaskFormerForInstanceSegmentation(config) # head +``` + +You could also load the backbone config separately and then pass it to the model config. + +```py +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig + +backbone_config = ResNetConfig() +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) +``` + + + + +[timm](https://hf.co/docs/timm/index) models are loaded with [`TimmBackbone`] and [`TimmBackboneConfig`]. + +```python +from transformers import TimmBackboneConfig, TimmBackbone + +backbone_config = TimmBackboneConfig("resnet50") +model = TimmBackbone(config=backbone_config) +``` + +## Feature extractor A feature extractor processes audio inputs. It inherits from the base [`~feature_extraction_utils.FeatureExtractionMixin`] class, and may also inherit from the [`SequenceFeatureExtractor`] class for processing audio inputs. @@ -353,7 +423,6 @@ Wav2Vec2FeatureExtractor { } ``` - ## Processor For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps processing classes such as a feature extractor and a tokenizer into a single object. For example, let's use the [`Wav2Vec2Processor`] for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer. diff --git a/docs/source/en/custom_models.mdx b/docs/source/en/custom_models.md similarity index 91% rename from docs/source/en/custom_models.mdx rename to docs/source/en/custom_models.md index f5ad5585624336..3d43446a0cc1b2 100644 --- a/docs/source/en/custom_models.mdx +++ b/docs/source/en/custom_models.md @@ -8,9 +8,13 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> -# Sharing custom models +# Building custom models The 🤗 Transformers library is designed to be easily extensible. Every model is fully coded in a given subfolder of the repository with no abstraction, so you can easily copy a modeling file and tweak it to your needs. @@ -18,7 +22,8 @@ of the repository with no abstraction, so you can easily copy a modeling file an If you are writing a brand new model, it might be easier to start from scratch. In this tutorial, we will show you how to write a custom model and its configuration so it can be used inside Transformers, and how you can share it with the community (with the code it relies on) so that anyone can use it, even if it's not present in the 🤗 -Transformers library. +Transformers library. We'll see how to build upon transformers and extend the framework with your hooks and +custom code. We will illustrate all of this on a ResNet model, by wrapping the ResNet class of the [timm library](https://github.com/rwightman/pytorch-image-models) into a [`PreTrainedModel`]. @@ -29,6 +34,16 @@ Before we dive into the model, let's first write its configuration. The configur will contain all the necessary information to build the model. As we will see in the next section, the model can only take a `config` to be initialized, so we really need that object to be as complete as possible. + + +Models in the `transformers` library itself generally follow the convention that they accept a `config` object +in their `__init__` method, and then pass the whole `config` to sub-layers in the model, rather than breaking the +config object into multiple arguments that are all passed individually to sub-layers. Writing your model in this +style results in simpler code with a clear "source of truth" for any hyperparameters, and also makes it easier +to reuse code from other models in `transformers`. + + + In our example, we will take a couple of arguments of the ResNet class that we might want to tweak. Different configurations will then give us the different types of ResNets that are possible. We then just store those arguments, after checking the validity of a few of them. @@ -214,6 +229,27 @@ resnet50d.model.load_state_dict(pretrained_model.state_dict()) Now let's see how to make sure that when we do [`~PreTrainedModel.save_pretrained`] or [`~PreTrainedModel.push_to_hub`], the code of the model is saved. +## Registering a model with custom code to the auto classes + +If you are writing a library that extends 🤗 Transformers, you may want to extend the auto classes to include your own +model. This is different from pushing the code to the Hub in the sense that users will need to import your library to +get the custom models (contrarily to automatically downloading the model code from the Hub). + +As long as your config has a `model_type` attribute that is different from existing model types, and that your model +classes have the right `config_class` attributes, you can just add them to the auto classes like this: + +```py +from transformers import AutoConfig, AutoModel, AutoModelForImageClassification + +AutoConfig.register("resnet", ResnetConfig) +AutoModel.register(ResnetConfig, ResnetModel) +AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) +``` + +Note that the first argument used when registering your custom config to [`AutoConfig`] needs to match the `model_type` +of your custom config, and the first argument used when registering your custom models to any auto model class needs +to match the `config_class` of those models. + ## Sending the code to the Hub @@ -268,6 +304,22 @@ Note that there is no need to specify an auto class for the configuration (there [`AutoConfig`]) but it's different for models. Your custom model could be suitable for many different tasks, so you have to specify which one of the auto classes is the correct one for your model. + + +Use `register_for_auto_class()` if you want the code files to be copied. If you instead prefer to use code on the Hub from another repo, +you don't need to call it. In cases where there's more than one auto class, you can modify the `config.json` directly using the +following structure: + +```json +"auto_map": { + "AutoConfig": "--", + "AutoModel": "--", + "AutoModelFor": "--", +}, +``` + + + Next, let's create the config and models as we did before: ```py @@ -330,23 +382,3 @@ model = AutoModelForImageClassification.from_pretrained( Note that when browsing the commit history of the model repo on the Hub, there is a button to easily copy the commit hash of any commit. -## Registering a model with custom code to the auto classes - -If you are writing a library that extends 🤗 Transformers, you may want to extend the auto classes to include your own -model. This is different from pushing the code to the Hub in the sense that users will need to import your library to -get the custom models (contrarily to automatically downloading the model code from the Hub). - -As long as your config has a `model_type` attribute that is different from existing model types, and that your model -classes have the right `config_class` attributes, you can just add them to the auto classes likes this: - -```py -from transformers import AutoConfig, AutoModel, AutoModelForImageClassification - -AutoConfig.register("resnet", ResnetConfig) -AutoModel.register(ResnetConfig, ResnetModel) -AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) -``` - -Note that the first argument used when registering your custom config to [`AutoConfig`] needs to match the `model_type` -of your custom config, and the first argument used when registering your custom models to any auto model class needs -to match the `config_class` of those models. diff --git a/docs/source/en/custom_tools.md b/docs/source/en/custom_tools.md new file mode 100644 index 00000000000000..9b7d1dcab67e6c --- /dev/null +++ b/docs/source/en/custom_tools.md @@ -0,0 +1,789 @@ + + +# Custom Tools and Prompts + + + +If you are not aware of what tools and agents are in the context of transformers, we recommend you read the +[Transformers Agents](transformers_agents) page first. + + + + + +Transformers Agents is an experimental API that is subject to change at any time. Results returned by the agents +can vary as the APIs or underlying models are prone to change. + + + +Creating and using custom tools and prompts is paramount to empowering the agent and having it perform new tasks. +In this guide we'll take a look at: + +- How to customize the prompt +- How to use custom tools +- How to create custom tools + +## Customizing the prompt + +As explained in [Transformers Agents](transformers_agents) agents can run in [`~Agent.run`] and [`~Agent.chat`] mode. +Both the `run` and `chat` modes underlie the same logic. The language model powering the agent is conditioned on a long +prompt and completes the prompt by generating the next tokens until the stop token is reached. +The only difference between the two modes is that during the `chat` mode the prompt is extended with +previous user inputs and model generations. This allows the agent to have access to past interactions, +seemingly giving the agent some kind of memory. + +### Structure of the prompt + +Let's take a closer look at how the prompt is structured to understand how it can be best customized. +The prompt is structured broadly into four parts. + +- 1. Introduction: how the agent should behave, explanation of the concept of tools. +- 2. Description of all the tools. This is defined by a `<>` token that is dynamically replaced at runtime with the tools defined/chosen by the user. +- 3. A set of examples of tasks and their solution +- 4. Current example, and request for solution. + +To better understand each part, let's look at a shortened version of how the `run` prompt can look like: + +````text +I will ask you to perform a task, your job is to come up with a series of simple commands in Python that will perform the task. +[...] +You can print intermediate results if it makes sense to do so. + +Tools: +- document_qa: This is a tool that answers a question about a document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +- image_captioner: This is a tool that generates a description of an image. It takes an input named `image` which should be the image to the caption and returns a text that contains the description in English. +[...] + +Task: "Answer the question in the variable `question` about the image stored in the variable `image`. The question is in French." + +I will use the following tools: `translator` to translate the question into English and then `image_qa` to answer the question on the input image. + +Answer: +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(image=image, question=translated_question) +print(f"The answer is {answer}") +``` + +Task: "Identify the oldest person in the `document` and create an image showcasing the result as a banner." + +I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + +Answer: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +[...] + +Task: "Draw me a picture of rivers and lakes" + +I will use the following +```` + +The introduction (the text before *"Tools:"*) explains precisely how the model shall behave and what it should do. +This part most likely does not need to be customized as the agent shall always behave the same way. + +The second part (the bullet points below *"Tools"*) is dynamically added upon calling `run` or `chat`. There are +exactly as many bullet points as there are tools in `agent.toolbox` and each bullet point consists of the name +and description of the tool: + +```text +- : +``` + +Let's verify this quickly by loading the document_qa tool and printing out the name and description. + +```py +from transformers import load_tool + +document_qa = load_tool("document-question-answering") +print(f"- {document_qa.name}: {document_qa.description}") +``` + +which gives: +```text +- document_qa: This is a tool that answers a question about a document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +``` + +We can see that the tool name is short and precise. The description includes two parts, the first explaining +what the tool does and the second states what input arguments and return values are expected. + +A good tool name and tool description are very important for the agent to correctly use it. Note that the only +information the agent has about the tool is its name and description, so one should make sure that both +are precisely written and match the style of the existing tools in the toolbox. In particular make sure the description +mentions all the arguments expected by name in code-style, along with the expected type and a description of what they +are. + + + +Check the naming and description of the curated Transformers tools to better understand what name and +description a tool is expected to have. You can see all tools with the [`Agent.toolbox`] property. + + + +The third part includes a set of curated examples that show the agent exactly what code it should produce +for what kind of user request. The large language models empowering the agent are extremely good at +recognizing patterns in a prompt and repeating the pattern with new data. Therefore, it is very important +that the examples are written in a way that maximizes the likelihood of the agent to generating correct, +executable code in practice. + +Let's have a look at one example: + +````text +Task: "Identify the oldest person in the `document` and create an image showcasing the result as a banner." + +I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + +Answer: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +```` + +The pattern the model is prompted to repeat has three parts: The task statement, the agent's explanation of +what it intends to do, and finally the generated code. Every example that is part of the prompt has this exact +pattern, thus making sure that the agent will reproduce exactly the same pattern when generating new tokens. + +The prompt examples are curated by the Transformers team and rigorously evaluated on a set of +[problem statements](https://github.com/huggingface/transformers/blob/main/src/transformers/tools/evaluate_agent.py) +to ensure that the agent's prompt is as good as possible to solve real use cases of the agent. + +The final part of the prompt corresponds to: +```text +Task: "Draw me a picture of rivers and lakes" + +I will use the following +``` + +is a final and unfinished example that the agent is tasked to complete. The unfinished example +is dynamically created based on the actual user input. For the above example, the user ran: + +```py +agent.run("Draw me a picture of rivers and lakes") +``` + +The user input - *a.k.a* the task: *"Draw me a picture of rivers and lakes"* is cast into the +prompt template: "Task: \n\n I will use the following". This sentence makes up the final lines of the +prompt the agent is conditioned on, therefore strongly influencing the agent to finish the example +exactly in the same way it was previously done in the examples. + +Without going into too much detail, the chat template has the same prompt structure with the +examples having a slightly different style, *e.g.*: + +````text +[...] + +===== + +Human: Answer the question in the variable `question` about the image stored in the variable `image`. + +Assistant: I will use the tool `image_qa` to answer the question on the input image. + +```py +answer = image_qa(text=question, image=image) +print(f"The answer is {answer}") +``` + +Human: I tried this code, it worked but didn't give me a good result. The question is in French + +Assistant: In this case, the question needs to be translated first. I will use the tool `translator` to do this. + +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(text=translated_question, image=image) +print(f"The answer is {answer}") +``` + +===== + +[...] +```` + +Contrary, to the examples of the `run` prompt, each `chat` prompt example has one or more exchanges between the +*Human* and the *Assistant*. Every exchange is structured similarly to the example of the `run` prompt. +The user's input is appended to behind *Human:* and the agent is prompted to first generate what needs to be done +before generating code. An exchange can be based on previous exchanges, therefore allowing the user to refer +to past exchanges as is done *e.g.* above by the user's input of "I tried **this** code" refers to the +previously generated code of the agent. + +Upon running `.chat`, the user's input or *task* is cast into an unfinished example of the form: +```text +Human: \n\nAssistant: +``` +which the agent completes. Contrary to the `run` command, the `chat` command then appends the completed example +to the prompt, thus giving the agent more context for the next `chat` turn. + +Great now that we know how the prompt is structured, let's see how we can customize it! + +### Writing good user inputs + +While large language models are getting better and better at understanding users' intentions, it helps +enormously to be as precise as possible to help the agent pick the correct task. What does it mean to be +as precise as possible? + +The agent sees a list of tool names and their description in its prompt. The more tools are added the +more difficult it becomes for the agent to choose the correct tool and it's even more difficult to choose +the correct sequences of tools to run. Let's look at a common failure case, here we will only return +the code to analyze it. + +```py +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") + +agent.run("Show me a tree", return_code=True) +``` + +gives: + +```text +==Explanation from the agent== +I will use the following tool: `image_segmenter` to create a segmentation mask for the image. + + +==Code generated by the agent== +mask = image_segmenter(image, prompt="tree") +``` + +which is probably not what we wanted. Instead, it is more likely that we want an image of a tree to be generated. +To steer the agent more towards using a specific tool it can therefore be very helpful to use important keywords that +are present in the tool's name and description. Let's have a look. +```py +agent.toolbox["image_generator"].description +``` + +```text +'This is a tool that creates an image according to a prompt, which is a text description. It takes an input named `prompt` which contains the image description and outputs an image. +``` + +The name and description make use of the keywords "image", "prompt", "create" and "generate". Using these words will most likely work better here. Let's refine our prompt a bit. + +```py +agent.run("Create an image of a tree", return_code=True) +``` + +gives: +```text +==Explanation from the agent== +I will use the following tool `image_generator` to generate an image of a tree. + + +==Code generated by the agent== +image = image_generator(prompt="tree") +``` + +Much better! That looks more like what we want. In short, when you notice that the agent struggles to +correctly map your task to the correct tools, try looking up the most pertinent keywords of the tool's name +and description and try refining your task request with it. + +### Customizing the tool descriptions + +As we've seen before the agent has access to each of the tools' names and descriptions. The base tools +should have very precise names and descriptions, however, you might find that it could help to change the +the description or name of a tool for your specific use case. This might become especially important +when you've added multiple tools that are very similar or if you want to use your agent only for a certain +domain, *e.g.* image generation and transformations. + +A common problem is that the agent confuses image generation with image transformation/modification when +used a lot for image generation tasks, *e.g.* +```py +agent.run("Make an image of a house and a car", return_code=True) +``` +returns +```text +==Explanation from the agent== +I will use the following tools `image_generator` to generate an image of a house and `image_transformer` to transform the image of a car into the image of a house. + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +house_car_image = image_transformer(image=car_image, prompt="A house") +``` + +which is probably not exactly what we want here. It seems like the agent has a difficult time +to understand the difference between `image_generator` and `image_transformer` and often uses the two together. + +We can help the agent here by changing the tool name and description of `image_transformer`. Let's instead call it `modifier` +to disassociate it a bit from "image" and "prompt": +```py +agent.toolbox["modifier"] = agent.toolbox.pop("image_transformer") +agent.toolbox["modifier"].description = agent.toolbox["modifier"].description.replace( + "transforms an image according to a prompt", "modifies an image" +) +``` + +Now "modify" is a strong cue to use the new image processor which should help with the above prompt. Let's run it again. + +```py +agent.run("Make an image of a house and a car", return_code=True) +``` + +Now we're getting: +```text +==Explanation from the agent== +I will use the following tools: `image_generator` to generate an image of a house, then `image_generator` to generate an image of a car. + + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +``` + +which is definitely closer to what we had in mind! However, we want to have both the house and car in the same image. Steering the task more toward single image generation should help: + +```py +agent.run("Create image: 'A house and car'", return_code=True) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_generator` to generate an image. + + +==Code generated by the agent== +image = image_generator(prompt="A house and car") +``` + + + +Agents are still brittle for many use cases, especially when it comes to +slightly more complex use cases like generating an image of multiple objects. +Both the agent itself and the underlying prompt will be further improved in the coming +months making sure that agents become more robust to a variety of user inputs. + + + +### Customizing the whole prompt + +To give the user maximum flexibility, the whole prompt template as explained in [above](#structure-of-the-prompt) +can be overwritten by the user. In this case make sure that your custom prompt includes an introduction section, +a tool section, an example section, and an unfinished example section. If you want to overwrite the `run` prompt template, +you can do as follows: + +```py +template = """ [...] """ + +agent = HfAgent(your_endpoint, run_prompt_template=template) +``` + + + +Please make sure to have the `<>` string and the `<>` defined somewhere in the `template` so that the agent can be aware +of the tools, it has available to it as well as correctly insert the user's prompt. + + + +Similarly, one can overwrite the `chat` prompt template. Note that the `chat` mode always uses the following format for the exchanges: +```text +Human: <> + +Assistant: +``` + +Therefore it is important that the examples of the custom `chat` prompt template also make use of this format. +You can overwrite the `chat` template at instantiation as follows. + +```python +template = """ [...] """ + +agent = HfAgent(url_endpoint=your_endpoint, chat_prompt_template=template) +``` + + + +Please make sure to have the `<>` string defined somewhere in the `template` so that the agent can be aware +of the tools, it has available to it. + + + +In both cases, you can pass a repo ID instead of the prompt template if you would like to use a template hosted by someone in the community. The default prompts live in [this repo](https://huggingface.co/datasets/huggingface-tools/default-prompts) as an example. + +To upload your custom prompt on a repo on the Hub and share it with the community just make sure: +- to use a dataset repository +- to put the prompt template for the `run` command in a file named `run_prompt_template.txt` +- to put the prompt template for the `chat` command in a file named `chat_prompt_template.txt` + +## Using custom tools + +In this section, we'll be leveraging two existing custom tools that are specific to image generation: + +- We replace [huggingface-tools/image-transformation](https://huggingface.co/spaces/huggingface-tools/image-transformation), + with [diffusers/controlnet-canny-tool](https://huggingface.co/spaces/diffusers/controlnet-canny-tool) + to allow for more image modifications. +- We add a new tool for image upscaling to the default toolbox: + [diffusers/latent-upscaler-tool](https://huggingface.co/spaces/diffusers/latent-upscaler-tool) replace the existing image-transformation tool. + +We'll start by loading the custom tools with the convenient [`load_tool`] function: + +```py +from transformers import load_tool + +controlnet_transformer = load_tool("diffusers/controlnet-canny-tool") +upscaler = load_tool("diffusers/latent-upscaler-tool") +``` + +Upon adding custom tools to an agent, the tools' descriptions and names are automatically +included in the agents' prompts. Thus, it is imperative that custom tools have +a well-written description and name in order for the agent to understand how to use them. +Let's take a look at the description and name of `controlnet_transformer`: + +```py +print(f"Description: '{controlnet_transformer.description}'") +print(f"Name: '{controlnet_transformer.name}'") +``` + +gives +```text +Description: 'This is a tool that transforms an image with ControlNet according to a prompt. +It takes two inputs: `image`, which should be the image to transform, and `prompt`, which should be the prompt to use to change it. It returns the modified image.' +Name: 'image_transformer' +``` + +The name and description are accurate and fit the style of the [curated set of tools](./transformers_agents#a-curated-set-of-tools). +Next, let's instantiate an agent with `controlnet_transformer` and `upscaler`: + +```py +tools = [controlnet_transformer, upscaler] +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=tools) +``` + +This command should give you the following info: + +```text +image_transformer has been replaced by as provided in `additional_tools` +``` + +The set of curated tools already has an `image_transformer` tool which is hereby replaced with our custom tool. + + + +Overwriting existing tools can be beneficial if we want to use a custom tool exactly for the same task as an existing tool +because the agent is well-versed in using the specific task. Beware that the custom tool should follow the exact same API +as the overwritten tool in this case, or you should adapt the prompt template to make sure all examples using that +tool are updated. + + + +The upscaler tool was given the name `image_upscaler` which is not yet present in the default toolbox and is therefore simply added to the list of tools. +You can always have a look at the toolbox that is currently available to the agent via the `agent.toolbox` attribute: + +```py +print("\n".join([f"- {a}" for a in agent.toolbox.keys()])) +``` + +```text +- document_qa +- image_captioner +- image_qa +- image_segmenter +- transcriber +- summarizer +- text_classifier +- text_qa +- text_reader +- translator +- image_transformer +- text_downloader +- image_generator +- video_generator +- image_upscaler +``` + +Note how `image_upscaler` is now part of the agents' toolbox. + +Let's now try out the new tools! We will re-use the image we generated in [Transformers Agents Quickstart](./transformers_agents#single-execution-run). + +```py +from diffusers.utils import load_image + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" +) +``` + + + +Let's transform the image into a beautiful winter landscape: + +```py +image = agent.run("Transform the image: 'A frozen lake and snowy forest'", image=image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_transformer` to transform the image. + + +==Code generated by the agent== +image = image_transformer(image, prompt="A frozen lake and snowy forest") +``` + + + +The new image processing tool is based on ControlNet which can make very strong modifications to the image. +By default the image processing tool returns an image of size 512x512 pixels. Let's see if we can upscale it. + +```py +image = agent.run("Upscale the image", image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_upscaler` to upscale the image. + + +==Code generated by the agent== +upscaled_image = image_upscaler(image) +``` + + + +The agent automatically mapped our prompt "Upscale the image" to the just added upscaler tool purely based on the description and name of the upscaler tool +and was able to correctly run it. + +Next, let's have a look at how you can create a new custom tool. + +### Adding new tools + +In this section, we show how to create a new tool that can be added to the agent. + +#### Creating a new tool + +We'll first start by creating a tool. We'll add the not-so-useful yet fun task of fetching the model on the Hugging Face +Hub with the most downloads for a given task. + +We can do that with the following code: + +```python +from huggingface_hub import list_models + +task = "text-classification" + +model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) +print(model.id) +``` + +For the task `text-classification`, this returns `'facebook/bart-large-mnli'`, for `translation` it returns `'google-t5/t5-base`. + +How do we convert this to a tool that the agent can leverage? All tools depend on the superclass `Tool` that holds the +main attributes necessary. We'll create a class that inherits from it: + +```python +from transformers import Tool + + +class HFModelDownloadsTool(Tool): + pass +``` + +This class has a few needs: +- An attribute `name`, which corresponds to the name of the tool itself. To be in tune with other tools which have a + performative name, we'll name it `model_download_counter`. +- An attribute `description`, which will be used to populate the prompt of the agent. +- `inputs` and `outputs` attributes. Defining this will help the python interpreter make educated choices about types, + and will allow for a gradio-demo to be spawned when we push our tool to the Hub. They're both a list of expected + values, which can be `text`, `image`, or `audio`. +- A `__call__` method which contains the inference code. This is the code we've played with above! + +Here's what our class looks like now: + +```python +from transformers import Tool +from huggingface_hub import list_models + + +class HFModelDownloadsTool(Tool): + name = "model_download_counter" + description = ( + "This is a tool that returns the most downloaded model of a given task on the Hugging Face Hub. " + "It takes the name of the category (such as text-classification, depth-estimation, etc), and " + "returns the name of the checkpoint." + ) + + inputs = ["text"] + outputs = ["text"] + + def __call__(self, task: str): + model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) + return model.id +``` + +We now have our tool handy. Save it in a file and import it from your main script. Let's name this file +`model_downloads.py`, so the resulting import code looks like this: + +```python +from model_downloads import HFModelDownloadsTool + +tool = HFModelDownloadsTool() +``` + +In order to let others benefit from it and for simpler initialization, we recommend pushing it to the Hub under your +namespace. To do so, just call `push_to_hub` on the `tool` variable: + +```python +tool.push_to_hub("hf-model-downloads") +``` + +You now have your code on the Hub! Let's take a look at the final step, which is to have the agent use it. + +#### Having the agent use the tool + +We now have our tool that lives on the Hub which can be instantiated as such (change the user name for your tool): + +```python +from transformers import load_tool + +tool = load_tool("lysandre/hf-model-downloads") +``` + +In order to use it in the agent, simply pass it in the `additional_tools` parameter of the agent initialization method: + +```python +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool]) + +agent.run( + "Can you read out loud the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub?" +) +``` +which outputs the following: +```text +==Code generated by the agent== +model = model_download_counter(task="text-to-video") +print(f"The model with the most downloads is {model}.") +audio_model = text_reader(model) + + +==Result== +The model with the most downloads is damo-vilab/text-to-video-ms-1.7b. +``` + +and generates the following audio. + +| **Audio** | +|------------------------------------------------------------------------------------------------------------------------------------------------------| +| \ No newline at end of file +
diff --git a/docs/source/en/internal/audio_utils.mdx b/docs/source/en/internal/audio_utils.md similarity index 69% rename from docs/source/en/internal/audio_utils.mdx rename to docs/source/en/internal/audio_utils.md index 8f1d6597149d6e..e6a39c7c1c49a9 100644 --- a/docs/source/en/internal/audio_utils.mdx +++ b/docs/source/en/internal/audio_utils.md @@ -8,14 +8,17 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for `FeatureExtractors` -This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *Mel log spectrogram*. - +This page lists all the utility functions that can be used by the audio [`FeatureExtractor`] in order to compute special features from a raw audio using common algorithms such as *Short Time Fourier Transform* or *log mel spectrogram*. -Most of those are only useful if you are studying the code of the image processors in the library. +Most of those are only useful if you are studying the code of the audio processors in the library. ## Audio Transformations @@ -23,12 +26,14 @@ Most of those are only useful if you are studying the code of the image processo [[autodoc]] audio_utils.mel_to_hertz -[[autodoc]] audio_utils.get_mel_filter_banks +[[autodoc]] audio_utils.mel_filter_bank -[[autodoc]] audio_utils.stft +[[autodoc]] audio_utils.optimal_fft_length -[[autodoc]] audio_utils.power_to_db +[[autodoc]] audio_utils.window_function -[[autodoc]] audio_utils.fram_wave +[[autodoc]] audio_utils.spectrogram +[[autodoc]] audio_utils.power_to_db +[[autodoc]] audio_utils.amplitude_to_db diff --git a/docs/source/en/internal/file_utils.mdx b/docs/source/en/internal/file_utils.md similarity index 88% rename from docs/source/en/internal/file_utils.mdx rename to docs/source/en/internal/file_utils.md index 9366293143585f..6f5657f7743cd4 100644 --- a/docs/source/en/internal/file_utils.mdx +++ b/docs/source/en/internal/file_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # General Utilities diff --git a/docs/source/en/internal/generation_utils.mdx b/docs/source/en/internal/generation_utils.md similarity index 66% rename from docs/source/en/internal/generation_utils.mdx rename to docs/source/en/internal/generation_utils.md index 3c86b7dc3f09ae..0fa15ddbcf1943 100644 --- a/docs/source/en/internal/generation_utils.mdx +++ b/docs/source/en/internal/generation_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for Generation @@ -34,14 +38,14 @@ Here's an example: ```python from transformers import GPT2Tokenizer, GPT2LMHeadModel -tokenizer = GPT2Tokenizer.from_pretrained("gpt2") -model = GPT2LMHeadModel.from_pretrained("gpt2") +tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2") +model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2") inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt") generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True) ``` -The `generation_output` object is a [`~generation.GreedySearchDecoderOnlyOutput`], as we can +The `generation_output` object is a [`~generation.GenerateDecoderOnlyOutput`], as we can see in the documentation of that class below, it means it has the following attributes: - `sequences`: the generated sequences of tokens @@ -71,39 +75,92 @@ values. Here, for instance, it has two keys that are `sequences` and `scores`. We document here all output types. -### GreedySearchOutput +### PyTorch -[[autodoc]] generation.GreedySearchDecoderOnlyOutput +[[autodoc]] generation.GenerateDecoderOnlyOutput -[[autodoc]] generation.GreedySearchEncoderDecoderOutput +[[autodoc]] generation.GenerateEncoderDecoderOutput -[[autodoc]] generation.FlaxGreedySearchOutput +[[autodoc]] generation.GenerateBeamDecoderOnlyOutput -### SampleOutput +[[autodoc]] generation.GenerateBeamEncoderDecoderOutput -[[autodoc]] generation.SampleDecoderOnlyOutput +### TensorFlow -[[autodoc]] generation.SampleEncoderDecoderOutput +[[autodoc]] generation.TFGreedySearchEncoderDecoderOutput -[[autodoc]] generation.FlaxSampleOutput +[[autodoc]] generation.TFGreedySearchDecoderOnlyOutput -### BeamSearchOutput +[[autodoc]] generation.TFSampleEncoderDecoderOutput -[[autodoc]] generation.BeamSearchDecoderOnlyOutput +[[autodoc]] generation.TFSampleDecoderOnlyOutput -[[autodoc]] generation.BeamSearchEncoderDecoderOutput +[[autodoc]] generation.TFBeamSearchEncoderDecoderOutput -### BeamSampleOutput +[[autodoc]] generation.TFBeamSearchDecoderOnlyOutput -[[autodoc]] generation.BeamSampleDecoderOnlyOutput +[[autodoc]] generation.TFBeamSampleEncoderDecoderOutput -[[autodoc]] generation.BeamSampleEncoderDecoderOutput +[[autodoc]] generation.TFBeamSampleDecoderOnlyOutput + +[[autodoc]] generation.TFContrastiveSearchEncoderDecoderOutput + +[[autodoc]] generation.TFContrastiveSearchDecoderOnlyOutput + +### FLAX + +[[autodoc]] generation.FlaxSampleOutput + +[[autodoc]] generation.FlaxGreedySearchOutput + +[[autodoc]] generation.FlaxBeamSearchOutput ## LogitsProcessor A [`LogitsProcessor`] can be used to modify the prediction scores of a language model head for generation. +### PyTorch + +[[autodoc]] AlternatingCodebooksLogitsProcessor + - __call__ + +[[autodoc]] ClassifierFreeGuidanceLogitsProcessor + - __call__ + +[[autodoc]] EncoderNoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] EncoderRepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] EpsilonLogitsWarper + - __call__ + +[[autodoc]] EtaLogitsWarper + - __call__ + +[[autodoc]] ExponentialDecayLengthPenalty + - __call__ + +[[autodoc]] ForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForceTokensLogitsProcessor + - __call__ + +[[autodoc]] HammingDiversityLogitsProcessor + - __call__ + +[[autodoc]] InfNanRemoveLogitsProcessor + - __call__ + +[[autodoc]] LogitNormalization + - __call__ + [[autodoc]] LogitsProcessor - __call__ @@ -119,58 +176,63 @@ generation. [[autodoc]] MinNewTokensLengthLogitsProcessor - __call__ -[[autodoc]] TemperatureLogitsWarper +[[autodoc]] NoBadWordsLogitsProcessor - __call__ -[[autodoc]] RepetitionPenaltyLogitsProcessor +[[autodoc]] NoRepeatNGramLogitsProcessor - __call__ -[[autodoc]] TopPLogitsWarper +[[autodoc]] PrefixConstrainedLogitsProcessor - __call__ -[[autodoc]] TopKLogitsWarper +[[autodoc]] RepetitionPenaltyLogitsProcessor - __call__ -[[autodoc]] TypicalLogitsWarper +[[autodoc]] SequenceBiasLogitsProcessor - __call__ -[[autodoc]] NoRepeatNGramLogitsProcessor +[[autodoc]] SuppressTokensAtBeginLogitsProcessor - __call__ -[[autodoc]] NoBadWordsLogitsProcessor +[[autodoc]] SuppressTokensLogitsProcessor - __call__ -[[autodoc]] PrefixConstrainedLogitsProcessor +[[autodoc]] TemperatureLogitsWarper - __call__ -[[autodoc]] HammingDiversityLogitsProcessor +[[autodoc]] TopKLogitsWarper - __call__ -[[autodoc]] ForcedBOSTokenLogitsProcessor +[[autodoc]] TopPLogitsWarper - __call__ -[[autodoc]] ForcedEOSTokenLogitsProcessor +[[autodoc]] TypicalLogitsWarper - __call__ -[[autodoc]] InfNanRemoveLogitsProcessor +[[autodoc]] UnbatchedClassifierFreeGuidanceLogitsProcessor - __call__ -[[autodoc]] TFLogitsProcessor +[[autodoc]] WhisperTimeStampLogitsProcessor - __call__ -[[autodoc]] TFLogitsProcessorList +### TensorFlow + +[[autodoc]] TFForcedBOSTokenLogitsProcessor - __call__ -[[autodoc]] TFLogitsWarper +[[autodoc]] TFForcedEOSTokenLogitsProcessor - __call__ -[[autodoc]] TFTemperatureLogitsWarper +[[autodoc]] TFForceTokensLogitsProcessor - __call__ -[[autodoc]] TFTopPLogitsWarper +[[autodoc]] TFLogitsProcessor - __call__ -[[autodoc]] TFTopKLogitsWarper +[[autodoc]] TFLogitsProcessorList + - __call__ + +[[autodoc]] TFLogitsWarper - __call__ [[autodoc]] TFMinLengthLogitsProcessor @@ -185,10 +247,30 @@ generation. [[autodoc]] TFRepetitionPenaltyLogitsProcessor - __call__ -[[autodoc]] TFForcedBOSTokenLogitsProcessor +[[autodoc]] TFSuppressTokensAtBeginLogitsProcessor - __call__ -[[autodoc]] TFForcedEOSTokenLogitsProcessor +[[autodoc]] TFSuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] TFTemperatureLogitsWarper + - __call__ + +[[autodoc]] TFTopKLogitsWarper + - __call__ + +[[autodoc]] TFTopPLogitsWarper + - __call__ + +### FLAX + +[[autodoc]] FlaxForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForceTokensLogitsProcessor - __call__ [[autodoc]] FlaxLogitsProcessor @@ -200,27 +282,30 @@ generation. [[autodoc]] FlaxLogitsWarper - __call__ -[[autodoc]] FlaxTemperatureLogitsWarper +[[autodoc]] FlaxMinLengthLogitsProcessor - __call__ -[[autodoc]] FlaxTopPLogitsWarper +[[autodoc]] FlaxSuppressTokensAtBeginLogitsProcessor - __call__ -[[autodoc]] FlaxTopKLogitsWarper +[[autodoc]] FlaxSuppressTokensLogitsProcessor - __call__ -[[autodoc]] FlaxForcedBOSTokenLogitsProcessor +[[autodoc]] FlaxTemperatureLogitsWarper - __call__ -[[autodoc]] FlaxForcedEOSTokenLogitsProcessor +[[autodoc]] FlaxTopKLogitsWarper - __call__ -[[autodoc]] FlaxMinLengthLogitsProcessor +[[autodoc]] FlaxTopPLogitsWarper + - __call__ + +[[autodoc]] FlaxWhisperTimeStampLogitsProcessor - __call__ ## StoppingCriteria -A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). +A [`StoppingCriteria`] can be used to change when to stop generation (other than EOS token). Please note that this is exclusively available to our PyTorch implementations. [[autodoc]] StoppingCriteria - __call__ @@ -236,7 +321,7 @@ A [`StoppingCriteria`] can be used to change when to stop generation (other than ## Constraints -A [`Constraint`] can be used to force the generation to include specific tokens or sequences in the output. +A [`Constraint`] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusively available to our PyTorch implementations. [[autodoc]] Constraint @@ -265,3 +350,30 @@ A [`Constraint`] can be used to force the generation to include specific tokens [[autodoc]] top_k_top_p_filtering [[autodoc]] tf_top_k_top_p_filtering + +## Streamers + +[[autodoc]] TextStreamer + +[[autodoc]] TextIteratorStreamer + +## Caches + +[[autodoc]] Cache + - update + +[[autodoc]] DynamicCache + - update + - get_seq_length + - reorder_cache + - to_legacy_cache + - from_legacy_cache + +[[autodoc]] SinkCache + - update + - get_seq_length + - reorder_cache + +[[autodoc]] StaticCache + - update + - get_seq_length \ No newline at end of file diff --git a/docs/source/en/internal/image_processing_utils.mdx b/docs/source/en/internal/image_processing_utils.md similarity index 89% rename from docs/source/en/internal/image_processing_utils.mdx rename to docs/source/en/internal/image_processing_utils.md index 831458bedab164..42f99f361703c1 100644 --- a/docs/source/en/internal/image_processing_utils.mdx +++ b/docs/source/en/internal/image_processing_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for Image Processors diff --git a/docs/source/en/internal/modeling_utils.mdx b/docs/source/en/internal/modeling_utils.md similarity index 92% rename from docs/source/en/internal/modeling_utils.mdx rename to docs/source/en/internal/modeling_utils.md index 914b8ca36798fa..afc8123558f5c3 100644 --- a/docs/source/en/internal/modeling_utils.mdx +++ b/docs/source/en/internal/modeling_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Custom Layers and Utilities @@ -54,9 +58,6 @@ Most of those are only useful if you are studying the code of the models in the [[autodoc]] modeling_tf_utils.TFConv1D -[[autodoc]] modeling_tf_utils.TFSharedEmbeddings - - call - [[autodoc]] modeling_tf_utils.TFSequenceSummary ## TensorFlow loss functions diff --git a/docs/source/en/internal/pipelines_utils.mdx b/docs/source/en/internal/pipelines_utils.md similarity index 87% rename from docs/source/en/internal/pipelines_utils.mdx rename to docs/source/en/internal/pipelines_utils.md index ed8e75b414bb7f..6ea6de9a61b8ab 100644 --- a/docs/source/en/internal/pipelines_utils.mdx +++ b/docs/source/en/internal/pipelines_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for pipelines diff --git a/docs/source/en/internal/time_series_utils.md b/docs/source/en/internal/time_series_utils.md new file mode 100644 index 00000000000000..11c562fbe32af5 --- /dev/null +++ b/docs/source/en/internal/time_series_utils.md @@ -0,0 +1,29 @@ + + +# Time Series Utilities + +This page lists all the utility functions and classes that can be used for Time Series based models. + +Most of those are only useful if you are studying the code of the time series models or you wish to add to the collection of distributional output classes. + +## Distributional Output + +[[autodoc]] time_series_utils.NormalOutput + +[[autodoc]] time_series_utils.StudentTOutput + +[[autodoc]] time_series_utils.NegativeBinomialOutput diff --git a/docs/source/en/internal/tokenization_utils.mdx b/docs/source/en/internal/tokenization_utils.md similarity index 89% rename from docs/source/en/internal/tokenization_utils.mdx rename to docs/source/en/internal/tokenization_utils.md index 24e81f7020643a..5aa65099176031 100644 --- a/docs/source/en/internal/tokenization_utils.mdx +++ b/docs/source/en/internal/tokenization_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for Tokenizers diff --git a/docs/source/en/internal/trainer_utils.mdx b/docs/source/en/internal/trainer_utils.md similarity index 85% rename from docs/source/en/internal/trainer_utils.mdx rename to docs/source/en/internal/trainer_utils.md index bba182d5ab6496..1bc5e2baae2d6f 100644 --- a/docs/source/en/internal/trainer_utils.mdx +++ b/docs/source/en/internal/trainer_utils.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Utilities for Trainer @@ -36,7 +40,7 @@ Most of those are only useful if you are studying the code of the Trainer in the [[autodoc]] trainer_pt_utils.DistributedTensorGatherer -## Distributed Evaluation +## Trainer Argument Parser [[autodoc]] HfArgumentParser diff --git a/docs/source/en/llm_tutorial.md b/docs/source/en/llm_tutorial.md new file mode 100644 index 00000000000000..8d6372e129cc47 --- /dev/null +++ b/docs/source/en/llm_tutorial.md @@ -0,0 +1,268 @@ + + + +# Generation with LLMs + +[[open-in-colab]] + +LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model -- you need to do autoregressive generation. + +Autoregressive generation is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs. In 🤗 Transformers, this is handled by the [`~generation.GenerationMixin.generate`] method, which is available to all models with generative capabilities. + +This tutorial will show you how to: + +* Generate text with an LLM +* Avoid common pitfalls +* Next steps to help you get the most out of your LLM + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install transformers bitsandbytes>=0.39.0 -q +``` + + +## Generate text + +A language model trained for [causal language modeling](tasks/language_modeling) takes a sequence of text tokens as input and returns the probability distribution for the next token. + + +
+ +
"Forward pass of an LLM"
+
+ +A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution. + + +
+ +
"Autoregressive generation iteratively selects the next token from a probability distribution to generate text"
+
+ +The process depicted above is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (`EOS`) token. If this is not the case, generation stops when some predefined maximum length is reached. + +Properly setting up the token selection step and the stopping condition is essential to make your model behave as you'd expect on your task. That is why we have a [`~generation.GenerationConfig`] file associated with each model, which contains a good default generative parameterization and is loaded alongside your model. + +Let's talk code! + + + +If you're interested in basic LLM usage, our high-level [`Pipeline`](pipeline_tutorial) interface is a great starting point. However, LLMs often require advanced features like quantization and fine control of the token selection step, which is best done through [`~generation.GenerationMixin.generate`]. Autoregressive generation with LLMs is also resource-intensive and should be executed on a GPU for adequate throughput. + + + +First, you need to load the model. + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +You'll notice two flags in the `from_pretrained` call: + + - `device_map` ensures the model is moved to your GPU(s) + - `load_in_4bit` applies [4-bit dynamic quantization](main_classes/quantization) to massively reduce the resource requirements + +There are other ways to initialize a model, but this is a good baseline to begin with an LLM. + +Next, you need to preprocess your text input with a [tokenizer](tokenizer_summary). + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") +>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda") +``` + +The `model_inputs` variable holds the tokenized text input, as well as the attention mask. While [`~generation.GenerationMixin.generate`] does its best effort to infer the attention mask when it is not passed, we recommend passing it whenever possible for optimal results. + +After tokenizing the inputs, you can call the [`~generation.GenerationMixin.generate`] method to returns the generated tokens. The generated tokens then should be converted to text before printing. + +```py +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A list of colors: red, blue, green, yellow, orange, purple, pink,' +``` + +Finally, you don't need to do it one sequence at a time! You can batch your inputs, which will greatly improve the throughput at a small latency and memory cost. All you need to do is to make sure you pad your inputs properly (more on that below). + +```py +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model_inputs = tokenizer( +... ["A list of colors: red, blue", "Portugal is"], return_tensors="pt", padding=True +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +['A list of colors: red, blue, green, yellow, orange, purple, pink,', +'Portugal is a country in southwestern Europe, on the Iber'] +``` + +And that's it! In a few lines of code, you can harness the power of an LLM. + + +## Common pitfalls + +There are many [generation strategies](generation_strategies), and sometimes the default values may not be appropriate for your use case. If your outputs aren't aligned with what you're expecting, we've created a list of the most common pitfalls and how to avoid them. + +```py +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +### Generated output is too short/long + +If not specified in the [`~generation.GenerationConfig`] file, `generate` returns up to 20 tokens by default. We highly recommend manually setting `max_new_tokens` in your `generate` call to control the maximum number of new tokens it can return. Keep in mind LLMs (more precisely, [decoder-only models](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)) also return the input prompt as part of the output. + + +```py +>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") + +>>> # By default, the output will contain up to 20 tokens +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5' + +>>> # Setting `max_new_tokens` allows you to control the maximum length +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' +``` + +### Incorrect generation mode + +By default, and unless specified in the [`~generation.GenerationConfig`] file, `generate` selects the most likely token at each iteration (greedy decoding). Depending on your task, this may be undesirable; creative tasks like chatbots or writing an essay benefit from sampling. On the other hand, input-grounded tasks like audio transcription or translation benefit from greedy decoding. Enable sampling with `do_sample=True`, and you can learn more about this topic in this [blog post](https://huggingface.co/blog/how-to-generate). + +```py +>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility +>>> from transformers import set_seed +>>> set_seed(42) + +>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") + +>>> # LLM + greedy decoding = repetitive, boring output +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. I am a cat. I am a cat. I am a cat' + +>>> # With sampling, the output becomes more creative! +>>> generated_ids = model.generate(**model_inputs, do_sample=True) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. Specifically, I am an indoor-only cat. I' +``` + +### Wrong padding side + +LLMs are [decoder-only](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt) architectures, meaning they continue to iterate on your input prompt. If your inputs do not have the same length, they need to be padded. Since LLMs are not trained to continue from pad tokens, your input needs to be left-padded. Make sure you also don't forget to pass the attention mask to generate! + +```py +>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, +>>> # which is shorter, has padding on the right side. Generation fails to capture the logic. +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 33333333333' + +>>> # With left-padding, it works as expected! +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 3, 4, 5, 6,' +``` + +### Wrong prompt + +Some models and tasks expect a certain input prompt format to work properly. When this format is not applied, you will get a silent performance degradation: the model kinda works, but not as well as if you were following the expected prompt. More information about prompting, including which models and tasks need to be careful, is available in this [guide](tasks/prompting). Let's see an example with a chat LLM, which makes use of [chat templating](chat_templating): + +```python +>>> tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha") +>>> model = AutoModelForCausalLM.from_pretrained( +... "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True +... ) +>>> set_seed(0) +>>> prompt = """How many helicopters can a human eat in one sitting? Reply as a thug.""" +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") +>>> input_length = model_inputs.input_ids.shape[1] +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=20) +>>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) +"I'm not a thug, but i can tell you that a human cannot eat" +>>> # Oh no, it did not follow our instruction to reply as a thug! Let's see what happens when we write +>>> # a better prompt and use the right template for this model (through `tokenizer.apply_chat_template`) + +>>> set_seed(0) +>>> messages = [ +... { +... "role": "system", +... "content": "You are a friendly chatbot who always responds in the style of a thug", +... }, +... {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +... ] +>>> model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") +>>> input_length = model_inputs.shape[1] +>>> generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=20) +>>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) +'None, you thug. How bout you try to focus on more useful questions?' +>>> # As we can see, it followed a proper thug style 😎 +``` + +## Further resources + +While the autoregressive generation process is relatively straightforward, making the most out of your LLM can be a challenging endeavor because there are many moving parts. For your next steps to help you dive deeper into LLM usage and understanding: + +### Advanced generate usage + +1. [Guide](generation_strategies) on how to control different generation methods, how to set up the generation configuration file, and how to stream the output; +2. [Guide](chat_templating) on the prompt template for chat LLMs; +3. [Guide](tasks/prompting) on to get the most of prompt design; +4. API reference on [`~generation.GenerationConfig`], [`~generation.GenerationMixin.generate`], and [generate-related classes](internal/generation_utils). Most of the classes, including the logits processors, have usage examples! + +### LLM leaderboards + +1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which focuses on the quality of the open-source models; +2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), which focuses on LLM throughput. + +### Latency, throughput and memory utilization + +1. [Guide](llm_tutorial_optimization) on how to optimize LLMs for speed and memory; +2. [Guide](main_classes/quantization) on quantization such as bitsandbytes and autogptq, which shows you how to drastically reduce your memory requirements. + +### Related libraries + +1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), a production-ready server for LLMs; +2. [`optimum`](https://github.com/huggingface/optimum), an extension of 🤗 Transformers that optimizes for specific hardware devices. diff --git a/docs/source/en/llm_tutorial_optimization.md b/docs/source/en/llm_tutorial_optimization.md new file mode 100644 index 00000000000000..93848d72b0d811 --- /dev/null +++ b/docs/source/en/llm_tutorial_optimization.md @@ -0,0 +1,781 @@ + +# Optimizing LLMs for Speed and Memory + +[[open-in-colab]] + +Large Language Models (LLMs) such as GPT3/4, [Falcon](https://huggingface.co/tiiuae/falcon-40b), and [Llama](https://huggingface.co/meta-llama/Llama-2-70b-hf) are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries. +Deploying these models in real-world tasks remains challenging, however: + +- To exhibit near-human text understanding and generation capabilities, LLMs currently require to be composed of billions of parameters (see [Kaplan et al](https://arxiv.org/abs/2001.08361), [Wei et. al](https://arxiv.org/abs/2206.07682)). This consequently amplifies the memory demands for inference. +- In many real-world tasks, LLMs need to be given extensive contextual information. This necessitates the model's capability to manage very long input sequences during inference. + +The crux of these challenges lies in augmenting the computational and memory capabilities of LLMs, especially when handling expansive input sequences. + +In this guide, we will go over the effective techniques for efficient LLM deployment: + +1. **Lower Precision:** Research has shown that operating at reduced numerical precision, namely [8-bit and 4-bit](./main_classes/quantization.md) can achieve computational advantages without a considerable decline in model performance. + +2. **Flash Attention:** Flash Attention is a variation of the attention algorithm that not only provides a more memory-efficient approach but also realizes increased efficiency due to optimized GPU memory utilization. + +3. **Architectural Innovations:** Considering that LLMs are always deployed in the same way during inference, namely autoregressive text generation with a long input context, specialized model architectures have been proposed that allow for more efficient inference. The most important advancement in model architectures hereby are [Alibi](https://arxiv.org/abs/2108.12409), [Rotary embeddings](https://arxiv.org/abs/2104.09864), [Multi-Query Attention (MQA)](https://arxiv.org/abs/1911.02150) and [Grouped-Query-Attention (GQA)]((https://arxiv.org/abs/2305.13245)). + +Throughout this guide, we will offer an analysis of auto-regressive generation from a tensor's perspective. We delve into the pros and cons of adopting lower precision, provide a comprehensive exploration of the latest attention algorithms, and discuss improved LLM architectures. While doing so, we run practical examples showcasing each of the feature improvements. + +## 1. Lower Precision + +Memory requirements of LLMs can be best understood by seeing the LLM as a set of weight matrices and vectors and the text inputs as a sequence of vectors. In the following, the definition *weights* will be used to signify all model weight matrices and vectors. + +At the time of writing this guide, LLMs consist of at least a couple billion parameters. Each parameter thereby is made of a decimal number, e.g. `4.5689` which is usually stored in either [float32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), or [float16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) format. This allows us to easily compute the memory requirement to load the LLM into memory: + +> *Loading the weights of a model having X billion parameters requires roughly 4 * X GB of VRAM in float32 precision* + +Nowadays, models are however rarely trained in full float32 precision, but usually in bfloat16 precision or less frequently in float16 precision. Therefore the rule of thumb becomes: + +> *Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision* + +For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM. + +To give some examples of how much VRAM it roughly takes to load a model in bfloat16: + +- **GPT3** requires 2 \* 175 GB = **350 GB** VRAM +- [**Bloom**](https://huggingface.co/bigscience/bloom) requires 2 \* 176 GB = **352 GB** VRAM +- [**Llama-2-70b**](https://huggingface.co/meta-llama/Llama-2-70b-hf) requires 2 \* 70 GB = **140 GB** VRAM +- [**Falcon-40b**](https://huggingface.co/tiiuae/falcon-40b) requires 2 \* 40 GB = **80 GB** VRAM +- [**MPT-30b**](https://huggingface.co/mosaicml/mpt-30b) requires 2 \* 30 GB = **60 GB** VRAM +- [**bigcode/starcoder**](https://huggingface.co/bigcode/starcoder) requires 2 \* 15.5 = **31 GB** VRAM + +As of writing this document, the largest GPU chip on the market is the A100 & H100 offering 80GB of VRAM. Most of the models listed before require more than 80GB just to be loaded and therefore necessarily require [tensor parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#tensor-parallelism) and/or [pipeline parallelism](https://huggingface.co/docs/transformers/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism). + +🤗 Transformers does not support tensor parallelism out of the box as it requires the model architecture to be written in a specific way. If you're interested in writing models in a tensor-parallelism-friendly way, feel free to have a look at [the text-generation-inference library](https://github.com/huggingface/text-generation-inference/tree/main/server/text_generation_server/models/custom_modeling). + +Naive pipeline parallelism is supported out of the box. For this, simply load the model with `device="auto"` which will automatically place the different layers on the available GPUs as explained [here](https://huggingface.co/docs/accelerate/v0.22.0/en/concept_guides/big_model_inference). +Note, however that while very effective, this naive pipeline parallelism does not tackle the issues of GPU idling. For this more advanced pipeline parallelism is required as explained [here](https://huggingface.co/docs/transformers/en/perf_train_gpu_many#naive-model-parallelism-vertical-and-pipeline-parallelism). + +If you have access to an 8 x 80GB A100 node, you could load BLOOM as follows + +```bash +!pip install transformers accelerate bitsandbytes optimum +``` +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", pad_token_id=0) +``` + +By using `device_map="auto"` the attention layers would be equally distributed over all available GPUs. + +In this guide, we will use [bigcode/octocoder](https://huggingface.co/bigcode/octocoder) as it can be run on a single 40 GB A100 GPU device chip. Note that all memory and speed optimizations that we will apply going forward, are equally applicable to models that require model or tensor parallelism. + +Since the model is loaded in bfloat16 precision, using our rule of thumb above, we would expect the memory requirement to run inference with `bigcode/octocoder` to be around 31 GB VRAM. Let's give it a try. + +We first load the model and tokenizer and then pass both to Transformers' [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) object. + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline +import torch + +model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto", pad_token_id=0) +tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder") + +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) +``` + +```python +prompt = "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer:" + +result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] +result +``` + +**Output**: +``` +Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single +``` + +Nice, we can now directly use the result to convert bytes into Gigabytes. + +```python +def bytes_to_giga_bytes(bytes): + return bytes / 1024 / 1024 / 1024 +``` + +Let's call [`torch.cuda.max_memory_allocated`](https://pytorch.org/docs/stable/generated/torch.cuda.max_memory_allocated.html) to measure the peak GPU memory allocation. + +```python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +``` + +**Output**: +```bash +29.0260648727417 +``` + +Close enough to our back-of-the-envelope computation! We can see the number is not exactly correct as going from bytes to kilobytes requires a multiplication of 1024 instead of 1000. Therefore the back-of-the-envelope formula can also be understood as an "at most X GB" computation. +Note that if we had tried to run the model in full float32 precision, a whopping 64 GB of VRAM would have been required. + +> Almost all models are trained in bfloat16 nowadays, there is no reason to run the model in full float32 precision if [your GPU supports bfloat16](https://discuss.pytorch.org/t/bfloat16-native-support/117155/5). Float32 won't give better inference results than the precision that was used to train the model. + +If you are unsure in which format the model weights are stored on the Hub, you can always look into the checkpoint's config under `"torch_dtype"`, *e.g.* [here](https://huggingface.co/meta-llama/Llama-2-7b-hf/blob/6fdf2e60f86ff2481f2241aaee459f85b5b0bbb9/config.json#L21). It is recommended to set the model to the same precision type as written in the config when loading with `from_pretrained(..., torch_dtype=...)` except when the original type is float32 in which case one can use both `float16` or `bfloat16` for inference. + + +Let's define a `flush(...)` function to free all allocated memory so that we can accurately measure the peak allocated GPU memory. + +```python +del pipe +del model + +import gc +import torch + +def flush(): + gc.collect() + torch.cuda.empty_cache() + torch.cuda.reset_peak_memory_stats() +``` + +Let's call it now for the next experiment. + +```python +flush() +``` +In the recent version of the accelerate library, you can also use an utility method called `release_memory()` + +```python +from accelerate.utils import release_memory +# ... + +release_memory(model) +``` + +Now what if your GPU does not have 32 GB of VRAM? It has been found that model weights can be quantized to 8-bit or 4-bits without a significant loss in performance (see [Dettmers et al.](https://arxiv.org/abs/2208.07339)). +Model can be quantized to even 3 or 2 bits with an acceptable loss in performance as shown in the recent [GPTQ paper](https://arxiv.org/abs/2210.17323) 🤯. + +Without going into too many details, quantization schemes aim at reducing the precision of weights while trying to keep the model's inference results as accurate as possible (*a.k.a* as close as possible to bfloat16). +Note that quantization works especially well for text generation since all we care about is choosing the *set of most likely next tokens* and don't really care about the exact values of the next token *logit* distribution. +All that matters is that the next token *logit* distribution stays roughly the same so that an `argmax` or `topk` operation gives the same results. + +There are various quantization techniques, which we won't discuss in detail here, but in general, all quantization techniques work as follows: + +- 1. Quantize all weights to the target precision +- 2. Load the quantized weights, and pass the input sequence of vectors in bfloat16 precision +- 3. Dynamically dequantize weights to bfloat16 to perform the computation with their input vectors in bfloat16 precision + +In a nutshell, this means that *inputs-weight matrix* multiplications, with \\( X \\) being the *inputs*, \\( W \\) being a weight matrix and \\( Y \\) being the output: + +$$ Y = X * W $$ + +are changed to + +$$ Y = X * \text{dequantize}(W) $$ + +for every matrix multiplication. Dequantization and re-quantization is performed sequentially for all weight matrices as the inputs run through the network graph. + +Therefore, inference time is often **not** reduced when using quantized weights, but rather increases. +Enough theory, let's give it a try! To quantize the weights with Transformers, you need to make sure that +the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) library is installed. + +```bash +!pip install bitsandbytes +``` + +We can then load models in 8-bit quantization by simply adding a `load_in_8bit=True` flag to `from_pretrained`. + +```python +model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_8bit=True, pad_token_id=0) +``` + +Now, let's run our example again and measure the memory usage. + +```python +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + +result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] +result +``` + +**Output**: +``` +Here is a Python function that transforms bytes to Giga bytes:\n\n```python\ndef bytes_to_giga_bytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single +``` + +Nice, we're getting the same result as before, so no loss in accuracy! Let's look at how much memory was used this time. + +```python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +``` + +**Output**: +``` +15.219234466552734 +``` + +Significantly less! We're down to just a bit over 15 GBs and could therefore run this model on consumer GPUs like the 4090. +We're seeing a very nice gain in memory efficiency and more or less no degradation to the model's output. However, we can also notice a slight slow-down during inference. + + +We delete the models and flush the memory again. +```python +del model +del pipe +``` + +```python +flush() +``` + +Let's see what peak GPU memory consumption 4-bit quantization gives. Quantizing the model to 4-bit can be done with the same API as before - this time by passing `load_in_4bit=True` instead of `load_in_8bit=True`. + +```python +model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", load_in_4bit=True, low_cpu_mem_usage=True, pad_token_id=0) + +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) + +result = pipe(prompt, max_new_tokens=60)[0]["generated_text"][len(prompt):] +result +``` + +**Output**: +``` +Here is a Python function that transforms bytes to Giga bytes:\n\n```\ndef bytes_to_gigabytes(bytes):\n return bytes / 1024 / 1024 / 1024\n```\n\nThis function takes a single argument +``` + +We're almost seeing the same output text as before - just the `python` is missing just before the code snippet. Let's see how much memory was required. + +```python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +``` + +**Output**: +``` +9.543574333190918 +``` + +Just 9.5GB! That's really not a lot for a >15 billion parameter model. + +While we see very little degradation in accuracy for our model here, 4-bit quantization can in practice often lead to different results compared to 8-bit quantization or full `bfloat16` inference. It is up to the user to try it out. + +Also note that inference here was again a bit slower compared to 8-bit quantization which is due to the more aggressive quantization method used for 4-bit quantization leading to \\( \text{quantize} \\) and \\( \text{dequantize} \\) taking longer during inference. + +```python +del model +del pipe +``` +```python +flush() +``` + +Overall, we saw that running OctoCoder in 8-bit precision reduced the required GPU VRAM from 32G GPU VRAM to only 15GB and running the model in 4-bit precision further reduces the required GPU VRAM to just a bit over 9GB. + +4-bit quantization allows the model to be run on GPUs such as RTX3090, V100, and T4 which are quite accessible for most people. + +For more information on quantization and to see how one can quantize models to require even less GPU VRAM memory than 4-bit, we recommend looking into the [`AutoGPTQ`](https://huggingface.co/docs/transformers/main/en/main_classes/quantization#autogptq-integration%60) implementation. + +> As a conclusion, it is important to remember that model quantization trades improved memory efficiency against accuracy and in some cases inference time. + +If GPU memory is not a constraint for your use case, there is often no need to look into quantization. However many GPUs simply can't run LLMs without quantization methods and in this case, 4-bit and 8-bit quantization schemes are extremely useful tools. + +For more in-detail usage information, we strongly recommend taking a look at the [Transformers Quantization Docs](https://huggingface.co/docs/transformers/main_classes/quantization#general-usage). +Next, let's look into how we can improve computational and memory efficiency by using better algorithms and an improved model architecture. + +## 2. Flash Attention + +Today's top-performing LLMs share more or less the same fundamental architecture that consists of feed-forward layers, activation layers, layer normalization layers, and most crucially, self-attention layers. + +Self-attention layers are central to Large Language Models (LLMs) in that they enable the model to understand the contextual relationships between input tokens. +However, the peak GPU memory consumption for self-attention layers grows *quadratically* both in compute and memory complexity with number of input tokens (also called *sequence length*) that we denote in the following by \\( N \\) . +While this is not really noticeable for shorter input sequences (of up to 1000 input tokens), it becomes a serious problem for longer input sequences (at around 16000 input tokens). + +Let's take a closer look. The formula to compute the output \\( \mathbf{O} \\) of a self-attention layer for an input \\( \mathbf{X} \\) of length \\( N \\) is: + +$$ \textbf{O} = \text{Attn}(\mathbf{X}) = \mathbf{V} \times \text{Softmax}(\mathbf{QK}^T) \text{ with } \mathbf{Q} = \mathbf{W}_q \mathbf{X}, \mathbf{V} = \mathbf{W}_v \mathbf{X}, \mathbf{K} = \mathbf{W}_k \mathbf{X} $$ + +\\( \mathbf{X} = (\mathbf{x}_1, ... \mathbf{x}_{N}) \\) is thereby the input sequence to the attention layer. The projections \\( \mathbf{Q} \\) and \\( \mathbf{K} \\) will each consist of \\( N \\) vectors resulting in the \\( \mathbf{QK}^T \\) being of size \\( N^2 \\) . + +LLMs usually have multiple attention heads, thus doing multiple self-attention computations in parallel. +Assuming, the LLM has 40 attention heads and runs in bfloat16 precision, we can calculate the memory requirement to store the \\( \mathbf{QK^T} \\) matrices to be \\( 40 * 2 * N^2 \\) bytes. For \\( N=1000 \\) only around 50 MB of VRAM are needed, however, for \\( N=16000 \\) we would need 19 GB of VRAM, and for \\( N=100,000 \\) we would need almost 1TB just to store the \\( \mathbf{QK}^T \\) matrices. + +Long story short, the default self-attention algorithm quickly becomes prohibitively memory-expensive for large input contexts. + +As LLMs improve in text comprehension and generation, they are applied to increasingly complex tasks. While models once handled the translation or summarization of a few sentences, they now manage entire pages, demanding the capability to process extensive input lengths. + +How can we get rid of the exorbitant memory requirements for large input lengths? We need a new way to compute the self-attention mechanism that gets rid of the \\( QK^T \\) matrix. [Tri Dao et al.](https://arxiv.org/abs/2205.14135) developed exactly such a new algorithm and called it **Flash Attention**. + +In a nutshell, Flash Attention breaks the \\(\mathbf{V} \times \text{Softmax}(\mathbf{QK}^T\\)) computation apart and instead computes smaller chunks of the output by iterating over multiple softmax computation steps: + +$$ \textbf{O}_i \leftarrow s^a_{ij} * \textbf{O}_i + s^b_{ij} * \mathbf{V}_{j} \times \text{Softmax}(\mathbf{QK}^T_{i,j}) \text{ for multiple } i, j \text{ iterations} $$ + +with \\( s^a_{ij} \\) and \\( s^b_{ij} \\) being some softmax normalization statistics that need to be recomputed for every \\( i \\) and \\( j \\) . + +Please note that the whole Flash Attention is a bit more complex and is greatly simplified here as going in too much depth is out of scope for this guide. The reader is invited to take a look at the well-written [Flash Attention paper](https://arxiv.org/abs/2205.14135) for more details. + +The main takeaway here is: + +> By keeping track of softmax normalization statistics and by using some smart mathematics, Flash Attention gives **numerical identical** outputs compared to the default self-attention layer at a memory cost that only increases linearly with \\( N \\) . + +Looking at the formula, one would intuitively say that Flash Attention must be much slower compared to the default self-attention formula as more computation needs to be done. Indeed Flash Attention requires more FLOPs compared to normal attention as the softmax normalization statistics have to constantly be recomputed (see [paper](https://arxiv.org/abs/2205.14135) for more details if interested) + +> However, Flash Attention is much faster in inference compared to default attention which comes from its ability to significantly reduce the demands on the slower, high-bandwidth memory of the GPU (VRAM), focusing instead on the faster on-chip memory (SRAM). + +Essentially, Flash Attention makes sure that all intermediate write and read operations can be done using the fast *on-chip* SRAM memory instead of having to access the slower VRAM memory to compute the output vector \\( \mathbf{O} \\) . + +In practice, there is currently absolutely no reason to **not** use Flash Attention if available. The algorithm gives mathematically the same outputs, and is both faster and more memory-efficient. + +Let's look at a practical example. + +Our OctoCoder model now gets a significantly longer input prompt which includes a so-called *system prompt*. System prompts are used to steer the LLM into a better assistant that is tailored to the users' task. +In the following, we use a system prompt that will make OctoCoder a better coding assistant. + +```python +system_prompt = """Below are a series of dialogues between various people and an AI technical assistant. +The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble but knowledgeable. +The assistant is happy to help with code questions and will do their best to understand exactly what is needed. +It also tries to avoid giving false or misleading information, and it caveats when it isn't entirely sure about the right answer. +That said, the assistant is practical really does its best, and doesn't let caution get too much in the way of being useful. + +The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests). +The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective, and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data. + +----- + +Question: Write a function that takes two lists and returns a list that has alternating elements from each input list. + +Answer: Sure. Here is a function that does that. + +def alternating(list1, list2): + results = [] + for i in range(len(list1)): + results.append(list1[i]) + results.append(list2[i]) + return results + +Question: Can you write some test cases for this function? + +Answer: Sure, here are some tests. + +assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3] +assert alternating([True, False], [4, 5]) == [True, 4, False, 5] +assert alternating([], []) == [] + +Question: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end. + +Answer: Here is the modified function. + +def alternating(list1, list2): + results = [] + for i in range(min(len(list1), len(list2))): + results.append(list1[i]) + results.append(list2[i]) + if len(list1) > len(list2): + results.extend(list1[i+1:]) + else: + results.extend(list2[i+1:]) + return results + +----- +""" +``` +For demonstration purposes, we duplicate the system prompt by ten so that the input length is long enough to observe Flash Attention's memory savings. +We append the original text prompt `"Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here"` + +```python +long_prompt = 10 * system_prompt + prompt +``` + +We instantiate our model again in bfloat16 precision. + +```python +model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder", torch_dtype=torch.bfloat16, device_map="auto") +tokenizer = AutoTokenizer.from_pretrained("bigcode/octocoder") + +pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) +``` + +Let's now run the model just like before *without Flash Attention* and measure the peak GPU memory requirement and inference time. + +```python +import time + +start_time = time.time() +result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):] + +print(f"Generated in {time.time() - start_time} seconds.") +result +``` + +**Output**: +``` +Generated in 10.96854019165039 seconds. +Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef +```` + +We're getting the same output as before, however this time, the model repeats the answer multiple times until it's 60 tokens cut-off. This is not surprising as we've repeated the system prompt ten times for demonstration purposes and thus cued the model to repeat itself. + +**Note** that the system prompt should not be repeated ten times in real-world applications - one time is enough! + +Let's measure the peak GPU memory requirement. + +```python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +``` + +**Output**: +```bash +37.668193340301514 +``` + +As we can see the peak GPU memory requirement is now significantly higher than in the beginning, which is largely due to the longer input sequence. Also the generation takes a little over a minute now. + +We call `flush()` to free GPU memory for our next experiment. + +```python +flush() +``` + +For comparison, let's run the same function, but enable Flash Attention instead. +To do so, we convert the model to [BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview) and by doing so enabling PyTorch's [SDPA self-attention](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) which in turn is able to use Flash Attention. + +```python +model.to_bettertransformer() +``` + +Now we run the exact same code snippet as before and under the hood Transformers will make use of Flash Attention. + +```py +start_time = time.time() +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + result = pipe(long_prompt, max_new_tokens=60)[0]["generated_text"][len(long_prompt):] + +print(f"Generated in {time.time() - start_time} seconds.") +result +``` + +**Output**: +``` +Generated in 3.0211617946624756 seconds. + Sure. Here is a function that does that.\n\ndef bytes_to_giga(bytes):\n return bytes / 1024 / 1024 / 1024\n\nAnswer: Sure. Here is a function that does that.\n\ndef +``` + +We're getting the exact same result as before, but can observe a very significant speed-up thanks to Flash Attention. + +Let's measure the memory consumption one last time. + +```python +bytes_to_giga_bytes(torch.cuda.max_memory_allocated()) +``` + +**Output**: +``` +32.617331981658936 +``` + +And we're almost back to our original 29GB peak GPU memory from the beginning. + +We can observe that we only use roughly 100MB more GPU memory when passing a very long input sequence with Flash Attention compared to passing a short input sequence as done in the beginning. + +```py +flush() +``` + +For more information on how to use Flash Attention, please have a look at [this doc page](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#flashattention-2). + +## 3. Architectural Innovations + +So far we have looked into improving computational and memory efficiency by: + +- Casting the weights to a lower precision format +- Replacing the self-attention algorithm with a more memory- and compute efficient version + +Let's now look into how we can change the architecture of an LLM so that it is most effective and efficient for task that require long text inputs, *e.g.*: +- Retrieval augmented Questions Answering, +- Summarization, +- Chat + +Note that *chat* not only requires the LLM to handle long text inputs, but it also necessitates that the LLM is able to efficiently handle the back-and-forth dialogue between user and assistant (such as ChatGPT). + +Once trained, the fundamental LLM architecture is difficult to change, so it is important to make considerations about the LLM's tasks beforehand and accordingly optimize the model's architecture. +There are two important components of the model architecture that quickly become memory and/or performance bottlenecks for large input sequences. + +- The positional embeddings +- The key-value cache + +Let's go over each component in more detail + +### 3.1 Improving positional embeddings of LLMs + +Self-attention puts each token in relation to each other's tokens. +As an example, the \\( \text{Softmax}(\mathbf{QK}^T) \\) matrix of the text input sequence *"Hello", "I", "love", "you"* could look as follows: + +![](/blog/assets/163_optimize_llm/self_attn_tokens.png) + +Each word token is given a probability mass at which it attends all other word tokens and, therefore is put into relation with all other word tokens. E.g. the word *"love"* attends to the word *"Hello"* with 5%, to *"I"* with 30%, and to itself with 65%. + +A LLM based on self-attention, but without position embeddings would have great difficulties in understanding the positions of the text inputs to each other. +This is because the probability score computed by \\( \mathbf{QK}^T \\) relates each word token to each other word token in \\( O(1) \\) computations regardless of their relative positional distance to each other. +Therefore, for the LLM without position embeddings each token appears to have the same distance to all other tokens, *e.g.* differentiating between *"Hello I love you"* and *"You love I hello"* would be very challenging. + +For the LLM to understand sentence order, an additional *cue* is needed and is usually applied in the form of *positional encodings* (or also called *positional embeddings*). +Positional encodings, encode the position of each token into a numerical presentation that the LLM can leverage to better understand sentence order. + +The authors of the [*Attention Is All You Need*](https://arxiv.org/abs/1706.03762) paper introduced sinusoidal positional embeddings \\( \mathbf{P} = \mathbf{p}_1, \ldots, \mathbf{p}_N \\) . +where each vector \\( \mathbf{p}_i \\) is computed as a sinusoidal function of its position \\( i \\) . +The positional encodings are then simply added to the input sequence vectors \\( \mathbf{\hat{X}} = \mathbf{\hat{x}}_1, \ldots, \mathbf{\hat{x}}_N \\) = \\( \mathbf{x}_1 + \mathbf{p}_1, \ldots, \mathbf{x}_N + \mathbf{p}_N \\) thereby cueing the model to better learn sentence order. + +Instead of using fixed position embeddings, others (such as [Devlin et al.](https://arxiv.org/abs/1810.04805)) used learned positional encodings for which the positional embeddings +\\( \mathbf{P} \\) are learned during training. + +Sinusoidal and learned position embeddings used to be the predominant methods to encode sentence order into LLMs, but a couple of problems related to these positional encodings were found: + + 1. Sinusoidal and learned position embeddings are both absolute positional embeddings, *i.e.* encoding a unique embedding for each position id: \\( 0, \ldots, N \\) . As shown by [Huang et al.](https://arxiv.org/abs/2009.13658) and [Su et al.](https://arxiv.org/abs/2104.09864), absolute positional embeddings lead to poor LLM performance for long text inputs. For long text inputs, it is advantageous if the model learns the relative positional distance input tokens have to each other instead of their absolute position. + 2. When using learned position embeddings, the LLM has to be trained on a fixed input length \\( N \\), which makes it difficult to extrapolate to an input length longer than what it was trained on. + +Recently, relative positional embeddings that can tackle the above mentioned problems have become more popular, most notably: + +- [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) +- [ALiBi](https://arxiv.org/abs/2108.12409) + +Both *RoPE* and *ALiBi* argue that it's best to cue the LLM about sentence order directly in the self-attention algorithm as it's there that word tokens are put into relation with each other. More specifically, sentence order should be cued by modifying the \\( \mathbf{QK}^T \\) computation. + +Without going into too many details, *RoPE* notes that positional information can be encoded into query-key pairs, *e.g.* \\( \mathbf{q}_i \\) and \\( \mathbf{x}_j \\) by rotating each vector by an angle \\( \theta * i \\) and \\( \theta * j \\) respectively with \\( i, j \\) describing each vectors sentence position: + +$$ \mathbf{\hat{q}}_i^T \mathbf{\hat{x}}_j = \mathbf{{q}}_i^T \mathbf{R}_{\theta, i -j} \mathbf{{x}}_j. $$ + +\\( \mathbf{R}_{\theta, i - j} \\) thereby represents a rotational matrix. \\( \theta \\) is *not* learned during training, but instead set to a pre-defined value that depends on the maximum input sequence length during training. + +> By doing so, the propability score between \\( \mathbf{q}_i \\) and \\( \mathbf{q}_j \\) is only affected if \\( i \ne j \\) and solely depends on the relative distance \\( i - j \\) regardless of each vector's specific positions \\( i \\) and \\( j \\) . + +*RoPE* is used in multiple of today's most important LLMs, such as: + +- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b) +- [**Llama**](https://arxiv.org/abs/2302.13971) +- [**PaLM**](https://arxiv.org/abs/2204.02311) + +As an alternative, *ALiBi* proposes a much simpler relative position encoding scheme. The relative distance that input tokens have to each other is added as a negative integer scaled by a pre-defined value `m` to each query-key entry of the \\( \mathbf{QK}^T \\) matrix right before the softmax computation. + +![](/blog/assets/163_optimize_llm/alibi.png) + +As shown in the [ALiBi](https://arxiv.org/abs/2108.12409) paper, this simple relative positional encoding allows the model to retain a high performance even at very long text input sequences. + +*ALiBi* is used in multiple of today's most important LLMs, such as: + +- [**MPT**](https://huggingface.co/mosaicml/mpt-30b) +- [**BLOOM**](https://huggingface.co/bigscience/bloom) + +Both *RoPE* and *ALiBi* position encodings can extrapolate to input lengths not seen during training whereas it has been shown that extrapolation works much better out-of-the-box for *ALiBi* as compared to *RoPE*. +For ALiBi, one simply increases the values of the lower triangular position matrix to match the length of the input sequence. +For *RoPE*, keeping the same \\( \theta \\) that was used during training leads to poor results when passing text inputs much longer than those seen during training, *c.f* [Press et al.](https://arxiv.org/abs/2108.12409). However, the community has found a couple of effective tricks that adapt \\( \theta \\), thereby allowing *RoPE* position embeddings to work well for extrapolated text input sequences (see [here](https://github.com/huggingface/transformers/pull/24653)). + +> Both RoPE and ALiBi are relative positional embeddings that are *not* learned during training, but instead are based on the following intuitions: + - Positional cues about the text inputs should be given directly to the \\( QK^T \\) matrix of the self-attention layer + - The LLM should be incentivized to learn a constant *relative* distance positional encodings have to each other + - The further text input tokens are from each other, the lower the probability of their query-value probability. Both RoPE and ALiBi lower the query-key probability of tokens far away from each other. RoPE by decreasing their vector product by increasing the angle between the query-key vectors. ALiBi by adding large negative numbers to the vector product + +In conclusion, LLMs that are intended to be deployed in tasks that require handling large text inputs are better trained with relative positional embeddings, such as RoPE and ALiBi. Also note that even if an LLM with RoPE and ALiBi has been trained only on a fixed length of say \\( N_1 = 2048 \\) it can still be used in practice with text inputs much larger than \\( N_1 \\), like \\( N_2 = 8192 > N_1 \\) by extrapolating the positional embeddings. + +### 3.2 The key-value cache + +Auto-regressive text generation with LLMs works by iteratively putting in an input sequence, sampling the next token, appending the next token to the input sequence, and continuing to do so until the LLM produces a token that signifies that the generation has finished. + +Please have a look at [Transformer's Generate Text Tutorial](https://huggingface.co/docs/transformers/llm_tutorial#generate-text) to get a more visual explanation of how auto-regressive generation works. + +Let's run a quick code snippet to show how auto-regressive works in practice. We will simply take the most likely next token via `torch.argmax`. + +```python +input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda") + +for _ in range(5): + next_logits = model(input_ids)["logits"][:, -1:] + next_token_id = torch.argmax(next_logits,dim=-1) + + input_ids = torch.cat([input_ids, next_token_id], dim=-1) + print("shape of input_ids", input_ids.shape) + +generated_text = tokenizer.batch_decode(input_ids[:, -5:]) +generated_text +``` + +**Output**: +``` +shape of input_ids torch.Size([1, 21]) +shape of input_ids torch.Size([1, 22]) +shape of input_ids torch.Size([1, 23]) +shape of input_ids torch.Size([1, 24]) +shape of input_ids torch.Size([1, 25]) +[' Here is a Python function'] +``` + +As we can see every time we increase the text input tokens by the just sampled token. + +With very few exceptions, LLMs are trained using the [causal language modeling objective](https://huggingface.co/docs/transformers/tasks/language_modeling#causal-language-modeling) and therefore mask the upper triangle matrix of the attention score - this is why in the two diagrams above the attention scores are left blank (*a.k.a* have 0 probability). For a quick recap on causal language modeling you can refer to the [*Illustrated Self Attention blog*](https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention). + +As a consequence, tokens *never* depend on previous tokens, more specifically the \\( \mathbf{q}_i \\) vector is never put in relation with any key, values vectors \\( \mathbf{k}_j, \mathbf{v}_j \\) if \\( j > i \\) . Instead \\( \mathbf{q}_i \\) only attends to previous key-value vectors \\( \mathbf{k}_{m < i}, \mathbf{v}_{m < i} \text{ , for } m \in \{0, \ldots i - 1\} \\). In order to reduce unnecessary computation, one can therefore cache each layer's key-value vectors for all previous timesteps. + +In the following, we will tell the LLM to make use of the key-value cache by retrieving and forwarding it for each forward pass. +In Transformers, we can retrieve the key-value cache by passing the `use_cache` flag to the `forward` call and can then pass it with the current token. + +```python +past_key_values = None # past_key_values is the key-value cache +generated_tokens = [] +next_token_id = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda") + +for _ in range(5): + next_logits, past_key_values = model(next_token_id, past_key_values=past_key_values, use_cache=True).to_tuple() + next_logits = next_logits[:, -1:] + next_token_id = torch.argmax(next_logits, dim=-1) + + print("shape of input_ids", next_token_id.shape) + print("length of key-value cache", len(past_key_values[0][0])) # past_key_values are of shape [num_layers, 0 for k, 1 for v, batch_size, length, hidden_dim] + generated_tokens.append(next_token_id.item()) + +generated_text = tokenizer.batch_decode(generated_tokens) +generated_text +``` + +**Output**: +``` +shape of input_ids torch.Size([1, 1]) +length of key-value cache 20 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 21 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 22 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 23 +shape of input_ids torch.Size([1, 1]) +length of key-value cache 24 +[' Here', ' is', ' a', ' Python', ' function'] +``` + +As one can see, when using the key-value cache the text input tokens are *not* increased in length, but remain a single input vector. The length of the key-value cache on the other hand is increased by one at every decoding step. + +> Making use of the key-value cache means that the \\( \mathbf{QK}^T \\) is essentially reduced to \\( \mathbf{q}_c\mathbf{K}^T \\) with \\( \mathbf{q}_c \\) being the query projection of the currently passed input token which is *always* just a single vector. + +Using the key-value cache has two advantages: +- Significant increase in computational efficiency as less computations are performed compared to computing the full \\( \mathbf{QK}^T \\) matrix. This leads to an increase in inference speed +- The maximum required memory is not increased quadratically with the number of generated tokens, but only increases linearly. + +> One should *always* make use of the key-value cache as it leads to identical results and a significant speed-up for longer input sequences. Transformers has the key-value cache enabled by default when making use of the text pipeline or the [`generate` method](https://huggingface.co/docs/transformers/main_classes/text_generation). + + + +Note that, despite our advice to use key-value caches, your LLM output may be slightly different when you use them. This is a property of the matrix multiplication kernels themselves -- you can read more about it [here](https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535). + + + +#### 3.2.1 Multi-round conversation + +The key-value cache is especially useful for applications such as chat where multiple passes of auto-regressive decoding are required. Let's look at an example. + +``` +User: How many people live in France? +Assistant: Roughly 75 million people live in France +User: And how many are in Germany? +Assistant: Germany has ca. 81 million inhabitants +``` + +In this chat, the LLM runs auto-regressive decoding twice: + 1. The first time, the key-value cache is empty and the input prompt is `"User: How many people live in France?"` and the model auto-regressively generates the text `"Roughly 75 million people live in France"` while increasing the key-value cache at every decoding step. + 2. The second time the input prompt is `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many in Germany?"`. Thanks to the cache, all key-value vectors for the first two sentences are already computed. Therefore the input prompt only consists of `"User: And how many in Germany?"`. While processing the shortened input prompt, it's computed key-value vectors are concatenated to the key-value cache of the first decoding. The second Assistant's answer `"Germany has ca. 81 million inhabitants"` is then auto-regressively generated with the key-value cache consisting of encoded key-value vectors of `"User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"`. + +Two things should be noted here: + 1. Keeping all the context is crucial for LLMs deployed in chat so that the LLM understands all the previous context of the conversation. E.g. for the example above the LLM needs to understand that the user refers to the population when asking `"And how many are in Germany"`. + 2. The key-value cache is extremely useful for chat as it allows us to continuously grow the encoded chat history instead of having to re-encode the chat history again from scratch (as e.g. would be the case when using an encoder-decoder architecture). + +In `transformers`, a `generate` call will return `past_key_values` when `return_dict_in_generate=True` is passed, in addition to the default `use_cache=True`. Note that it is not yet available through the `pipeline` interface. + +```python +# Generation as usual +prompt = system_prompt + "Question: Please write a function in Python that transforms bytes to Giga bytes.\n\nAnswer: Here" +model_inputs = tokenizer(prompt, return_tensors='pt') +generation_output = model.generate(**model_inputs, max_new_tokens=60, return_dict_in_generate=True) +decoded_output = tokenizer.batch_decode(generation_output.sequences)[0] + +# Piping the returned `past_key_values` to speed up the next conversation round +prompt = decoded_output + "\nQuestion: How can I modify the function above to return Mega bytes instead?\n\nAnswer: Here" +model_inputs = tokenizer(prompt, return_tensors='pt') +generation_output = model.generate( + **model_inputs, + past_key_values=generation_output.past_key_values, + max_new_tokens=60, + return_dict_in_generate=True +) +tokenizer.batch_decode(generation_output.sequences)[0][len(prompt):] +``` + +**Output**: +``` + is a modified version of the function that returns Mega bytes instead. + +def bytes_to_megabytes(bytes): + return bytes / 1024 / 1024 + +Answer: The function takes a number of bytes as input and returns the number of +``` + +Great, no additional time is spent recomputing the same key and values for the attention layer! There is however one catch. While the required peak memory for the \\( \mathbf{QK}^T \\) matrix is significantly reduced, holding the key-value cache in memory can become very memory expensive for long input sequences or multi-turn chat. Remember that the key-value cache needs to store the key-value vectors for all previous input vectors \\( \mathbf{x}_i \text{, for } i \in \{1, \ldots, c - 1\} \\) for all self-attention layers and for all attention heads. + +Let's compute the number of float values that need to be stored in the key-value cache for the LLM `bigcode/octocoder` that we used before. +The number of float values amounts to two times the sequence length times the number of attention heads times the attention head dimension and times the number of layers. +Computing this for our LLM at a hypothetical input sequence length of 16000 gives: + +```python +config = model.config +2 * 16_000 * config.n_layer * config.n_head * config.n_embd // config.n_head +``` + +**Output**: +``` +7864320000 +``` + +Roughly 8 billion float values! Storing 8 billion float values in `float16` precision requires around 15 GB of RAM which is circa half as much as the model weights themselves! +Researchers have proposed two methods that allow to significantly reduce the memory cost of storing the key-value cache, which are explored in the next subsections. + +#### 3.2.2 Multi-Query-Attention (MQA) + +[Multi-Query-Attention](https://arxiv.org/abs/1911.02150) was proposed in Noam Shazeer's *Fast Transformer Decoding: One Write-Head is All You Need* paper. As the title says, Noam found out that instead of using `n_head` key-value projections weights, one can use a single head-value projection weight pair that is shared across all attention heads without that the model's performance significantly degrades. + +> By using a single head-value projection weight pair, the key value vectors \\( \mathbf{k}_i, \mathbf{v}_i \\) have to be identical across all attention heads which in turn means that we only need to store 1 key-value projection pair in the cache instead of `n_head` ones. + +As most LLMs use between 20 and 100 attention heads, MQA significantly reduces the memory consumption of the key-value cache. For the LLM used in this notebook we could therefore reduce the required memory consumption from 15 GB to less than 400 MB at an input sequence length of 16000. + +In addition to memory savings, MQA also leads to improved computational efficiency as explained in the following. +In auto-regressive decoding, large key-value vectors need to be reloaded, concatenated with the current key-value vector pair to be then fed into the \\( \mathbf{q}_c\mathbf{K}^T \\) computation at every step. For auto-regressive decoding, the required memory bandwidth for the constant reloading can become a serious time bottleneck. By reducing the size of the key-value vectors less memory needs to be accessed, thus reducing the memory bandwidth bottleneck. For more detail, please have a look at [Noam's paper](https://arxiv.org/abs/1911.02150). + +The important part to understand here is that reducing the number of key-value attention heads to 1 only makes sense if a key-value cache is used. The peak memory consumption of the model for a single forward pass without key-value cache stays unchanged as every attention head still has a unique query vector so that each attention head still has a different \\( \mathbf{QK}^T \\) matrix. + +MQA has seen wide adoption by the community and is now used by many of the most popular LLMs: + +- [**Falcon**](https://huggingface.co/tiiuae/falcon-40b) +- [**PaLM**](https://arxiv.org/abs/2204.02311) +- [**MPT**](https://huggingface.co/mosaicml/mpt-30b) +- [**BLOOM**](https://huggingface.co/bigscience/bloom) + +Also, the checkpoint used in this notebook - `bigcode/octocoder` - makes use of MQA. + +#### 3.2.3 Grouped-Query-Attention (GQA) + +[Grouped-Query-Attention](https://arxiv.org/abs/2305.13245), as proposed by Ainslie et al. from Google, found that using MQA can often lead to quality degradation compared to using vanilla multi-key-value head projections. The paper argues that more model performance can be kept by less drastically reducing the number of query head projection weights. Instead of using just a single key-value projection weight, `n < n_head` key-value projection weights should be used. By choosing `n` to a significantly smaller value than `n_head`, such as 2,4 or 8 almost all of the memory and speed gains from MQA can be kept while sacrificing less model capacity and thus arguably less performance. + +Moreover, the authors of GQA found out that existing model checkpoints can be *uptrained* to have a GQA architecture with as little as 5% of the original pre-training compute. While 5% of the original pre-training compute can still be a massive amount, GQA *uptraining* allows existing checkpoints to be useful for longer input sequences. + +GQA was only recently proposed which is why there is less adoption at the time of writing this notebook. +The most notable application of GQA is [Llama-v2](https://huggingface.co/meta-llama/Llama-2-70b-hf). + +> As a conclusion, it is strongly recommended to make use of either GQA or MQA if the LLM is deployed with auto-regressive decoding and is required to handle large input sequences as is the case for example for chat. + + +## Conclusion + +The research community is constantly coming up with new, nifty ways to speed up inference time for ever-larger LLMs. As an example, one such promising research direction is [speculative decoding](https://arxiv.org/abs/2211.17192) where "easy tokens" are generated by smaller, faster language models and only "hard tokens" are generated by the LLM itself. Going into more detail is out of the scope of this notebook, but can be read upon in this [nice blog post](https://huggingface.co/blog/assisted-generation). + +The reason massive LLMs such as GPT3/4, Llama-2-70b, Claude, PaLM can run so quickly in chat-interfaces such as [Hugging Face Chat](https://huggingface.co/chat/) or ChatGPT is to a big part thanks to the above-mentioned improvements in precision, algorithms, and architecture. +Going forward, accelerators such as GPUs, TPUs, etc... will only get faster and allow for more memory, but one should nevertheless always make sure to use the best available algorithms and architectures to get the most bang for your buck 🤗 diff --git a/docs/source/en/main_classes/agent.md b/docs/source/en/main_classes/agent.md new file mode 100644 index 00000000000000..dfcd375a81abdd --- /dev/null +++ b/docs/source/en/main_classes/agent.md @@ -0,0 +1,105 @@ + + +# Agents & Tools + + + +Transformers Agents is an experimental API which is subject to change at any time. Results returned by the agents +can vary as the APIs or underlying models are prone to change. + + + +To learn more about agents and tools make sure to read the [introductory guide](../transformers_agents). This page +contains the API docs for the underlying classes. + +## Agents + +We provide three types of agents: [`HfAgent`] uses inference endpoints for opensource models, [`LocalAgent`] uses a model of your choice locally and [`OpenAiAgent`] uses OpenAI closed models. + +### HfAgent + +[[autodoc]] HfAgent + +### LocalAgent + +[[autodoc]] LocalAgent + +### OpenAiAgent + +[[autodoc]] OpenAiAgent + +### AzureOpenAiAgent + +[[autodoc]] AzureOpenAiAgent + +### Agent + +[[autodoc]] Agent + - chat + - run + - prepare_for_new_chat + +## Tools + +### load_tool + +[[autodoc]] load_tool + +### Tool + +[[autodoc]] Tool + +### PipelineTool + +[[autodoc]] PipelineTool + +### RemoteTool + +[[autodoc]] RemoteTool + +### launch_gradio_demo + +[[autodoc]] launch_gradio_demo + +## Agent Types + +Agents can handle any type of object in-between tools; tools, being completely multimodal, can accept and return +text, image, audio, video, among other types. In order to increase compatibility between tools, as well as to +correctly render these returns in ipython (jupyter, colab, ipython notebooks, ...), we implement wrapper classes +around these types. + +The wrapped objects should continue behaving as initially; a text object should still behave as a string, an image +object should still behave as a `PIL.Image`. + +These types have three specific purposes: + +- Calling `to_raw` on the type should return the underlying object +- Calling `to_string` on the type should return the object as a string: that can be the string in case of an `AgentText` + but will be the path of the serialized version of the object in other instances +- Displaying it in an ipython kernel should display the object correctly + +### AgentText + +[[autodoc]] transformers.tools.agent_types.AgentText + +### AgentImage + +[[autodoc]] transformers.tools.agent_types.AgentImage + +### AgentAudio + +[[autodoc]] transformers.tools.agent_types.AgentAudio diff --git a/docs/source/en/main_classes/backbones.md b/docs/source/en/main_classes/backbones.md new file mode 100644 index 00000000000000..efea7eb32a84c8 --- /dev/null +++ b/docs/source/en/main_classes/backbones.md @@ -0,0 +1,60 @@ + + +# Backbone + +A backbone is a model used for feature extraction for higher level computer vision tasks such as object detection and image classification. Transformers provides an [`AutoBackbone`] class for initializing a Transformers backbone from pretrained model weights, and two utility classes: + +* [`~utils.BackboneMixin`] enables initializing a backbone from Transformers or [timm](https://hf.co/docs/timm/index) and includes functions for returning the output features and indices. +* [`~utils.BackboneConfigMixin`] sets the output features and indices of the backbone configuration. + +[timm](https://hf.co/docs/timm/index) models are loaded with the [`TimmBackbone`] and [`TimmBackboneConfig`] classes. + +Backbones are supported for the following models: + +* [BEiT](..model_doc/beit) +* [BiT](../model_doc/bit) +* [ConvNet](../model_doc/convnext) +* [ConvNextV2](../model_doc/convnextv2) +* [DiNAT](..model_doc/dinat) +* [DINOV2](../model_doc/dinov2) +* [FocalNet](../model_doc/focalnet) +* [MaskFormer](../model_doc/maskformer) +* [NAT](../model_doc/nat) +* [ResNet](../model_doc/resnet) +* [Swin Transformer](../model_doc/swin) +* [Swin Transformer v2](../model_doc/swinv2) +* [ViTDet](../model_doc/vitdet) + +## AutoBackbone + +[[autodoc]] AutoBackbone + +## BackboneMixin + +[[autodoc]] utils.BackboneMixin + +## BackboneConfigMixin + +[[autodoc]] utils.BackboneConfigMixin + +## TimmBackbone + +[[autodoc]] models.timm_backbone.TimmBackbone + +## TimmBackboneConfig + +[[autodoc]] models.timm_backbone.TimmBackboneConfig diff --git a/docs/source/en/main_classes/callback.mdx b/docs/source/en/main_classes/callback.md similarity index 84% rename from docs/source/en/main_classes/callback.mdx rename to docs/source/en/main_classes/callback.md index 33ae17c66df2bb..bc7323f5911ee6 100644 --- a/docs/source/en/main_classes/callback.mdx +++ b/docs/source/en/main_classes/callback.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Callbacks @@ -21,7 +25,7 @@ Callbacks are "read only" pieces of code, apart from the [`TrainerControl`] obje cannot change anything in the training loop. For customizations that require changes in the training loop, you should subclass [`Trainer`] and override the methods you need (see [trainer](trainer) for examples). -By default a [`Trainer`] will use the following callbacks: +By default, `TrainingArguments.report_to` is set to `"all"`, so a [`Trainer`] will use the following callbacks. - [`DefaultFlowCallback`] which handles the default behavior for logging, saving and evaluation. - [`PrinterCallback`] or [`ProgressCallback`] to display progress and print the @@ -39,6 +43,10 @@ By default a [`Trainer`] will use the following callbacks: installed. - [`~integrations.ClearMLCallback`] if [clearml](https://github.com/allegroai/clearml) is installed. - [`~integrations.DagsHubCallback`] if [dagshub](https://dagshub.com/) is installed. +- [`~integrations.FlyteCallback`] if [flyte](https://flyte.org/) is installed. +- [`~integrations.DVCLiveCallback`] if [dvclive](https://dvc.org/doc/dvclive) is installed. + +If a package is installed but you don't wish to use the accompanying integration, you can change `TrainingArguments.report_to` to a list of just those integrations you want to use (e.g. `["azure_ml", "wandb"]`). The main class that implements callbacks is [`TrainerCallback`]. It gets the [`TrainingArguments`] used to instantiate the [`Trainer`], can access that @@ -79,6 +87,11 @@ Here is the list of the available [`TrainerCallback`] in the library: [[autodoc]] integrations.DagsHubCallback +[[autodoc]] integrations.FlyteCallback + +[[autodoc]] integrations.DVCLiveCallback + - setup + ## TrainerCallback [[autodoc]] TrainerCallback diff --git a/docs/source/en/main_classes/configuration.mdx b/docs/source/en/main_classes/configuration.md similarity index 87% rename from docs/source/en/main_classes/configuration.mdx rename to docs/source/en/main_classes/configuration.md index 541781eff76a03..0cfef06d3ce9ca 100644 --- a/docs/source/en/main_classes/configuration.mdx +++ b/docs/source/en/main_classes/configuration.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Configuration diff --git a/docs/source/en/main_classes/data_collator.mdx b/docs/source/en/main_classes/data_collator.md similarity index 92% rename from docs/source/en/main_classes/data_collator.mdx rename to docs/source/en/main_classes/data_collator.md index ee1c1418e493a2..74e653dd1185e9 100644 --- a/docs/source/en/main_classes/data_collator.mdx +++ b/docs/source/en/main_classes/data_collator.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Data Collator diff --git a/docs/source/en/main_classes/deepspeed.md b/docs/source/en/main_classes/deepspeed.md new file mode 100644 index 00000000000000..5863f66621a3ee --- /dev/null +++ b/docs/source/en/main_classes/deepspeed.md @@ -0,0 +1,32 @@ + + +# DeepSpeed + +[DeepSpeed](https://github.com/microsoft/DeepSpeed), powered by Zero Redundancy Optimizer (ZeRO), is an optimization library for training and fitting very large models onto a GPU. It is available in several ZeRO stages, where each stage progressively saves more GPU memory by partitioning the optimizer state, gradients, parameters, and enabling offloading to a CPU or NVMe. DeepSpeed is integrated with the [`Trainer`] class and most of the setup is automatically taken care of for you. + +However, if you want to use DeepSpeed without the [`Trainer`], Transformers provides a [`HfDeepSpeedConfig`] class. + + + +Learn more about using DeepSpeed with [`Trainer`] in the [DeepSpeed](../deepspeed) guide. + + + +## HfDeepSpeedConfig + +[[autodoc]] integrations.HfDeepSpeedConfig + - all diff --git a/docs/source/en/main_classes/deepspeed.mdx b/docs/source/en/main_classes/deepspeed.mdx deleted file mode 100644 index ccbb954a4f6813..00000000000000 --- a/docs/source/en/main_classes/deepspeed.mdx +++ /dev/null @@ -1,2263 +0,0 @@ - - -# DeepSpeed Integration - -[DeepSpeed](https://github.com/microsoft/DeepSpeed) implements everything described in the [ZeRO paper](https://arxiv.org/abs/1910.02054). Currently it provides full support for: - -1. Optimizer state partitioning (ZeRO stage 1) -2. Gradient partitioning (ZeRO stage 2) -3. Parameter partitioning (ZeRO stage 3) -4. Custom mixed precision training handling -5. A range of fast CUDA-extension-based optimizers -6. ZeRO-Offload to CPU and NVMe - -ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU -Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857). - -DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference. - -DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which -won't be possible on a single GPU. - -🤗 Transformers integrates [DeepSpeed](https://github.com/microsoft/DeepSpeed) via 2 options: - -1. Integration of the core DeepSpeed features via [`Trainer`]. This is an everything-done-for-you type - of integration - just supply your custom config file or use our template and you have nothing else to do. Most of - this document is focused on this feature. -2. If you don't use [`Trainer`] and want to use your own Trainer where you integrated DeepSpeed - yourself, core functionality functions like `from_pretrained` and `from_config` include integration of essential - parts of DeepSpeed like `zero.Init` for ZeRO stage 3 and higher. To tap into this feature read the docs on - [non-Trainer DeepSpeed Integration](#nontrainer-deepspeed-integration). - -What is integrated: - -Training: - -1. DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 with ZeRO-Infinity (CPU and NVME offload). - -Inference: - -1. DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but - it doesn't use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see: - [zero-inference](#zero-inference). - -There is also DeepSpeed Inference - this is a totally different technology which uses Tensor Parallelism instead of -ZeRO (coming soon). - - - - - - -## Trainer Deepspeed Integration - - - - -### Installation - -Install the library via pypi: - -```bash -pip install deepspeed -``` - -or via `transformers`' `extras`: - -```bash -pip install transformers[deepspeed] -``` - -or find more details on [the DeepSpeed's GitHub page](https://github.com/microsoft/deepspeed#installation) and -[advanced install](https://www.deepspeed.ai/tutorials/advanced-install/). - -If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](trainer#cuda-extension-installation-notes). - -If you don't prebuild the extensions and rely on them to be built at run time and you tried all of the above solutions -to no avail, the next thing to try is to pre-build the modules before installing them. - -To make a local build for DeepSpeed: - -```bash -git clone https://github.com/microsoft/DeepSpeed/ -cd DeepSpeed -rm -rf build -TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ ---global-option="build_ext" --global-option="-j8" --no-cache -v \ ---disable-pip-version-check 2>&1 | tee build.log -``` - -If you intend to use NVMe offload you will also need to include `DS_BUILD_AIO=1` in the instructions above (and also -install *libaio-dev* system-wide). - -Edit `TORCH_CUDA_ARCH_LIST` to insert the code for the architectures of the GPU cards you intend to use. Assuming all -your cards are the same you can get the arch via: - -```bash -CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" -``` - -So if you get `8, 6`, then use `TORCH_CUDA_ARCH_LIST="8.6"`. If you have multiple different cards, you can list all -of them like so `TORCH_CUDA_ARCH_LIST="6.1;8.6"` - -If you need to use the same setup on multiple machines, make a binary wheel: - -```bash -git clone https://github.com/microsoft/DeepSpeed/ -cd DeepSpeed -rm -rf build -TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ -python setup.py build_ext -j8 bdist_wheel -``` - -it will generate something like `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` which now you can install -as `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` locally or on any other machine. - -Again, remember to ensure to adjust `TORCH_CUDA_ARCH_LIST` to the target architectures. - -You can find the complete list of NVIDIA GPUs and their corresponding **Compute Capabilities** (same as arch in this -context) [here](https://developer.nvidia.com/cuda-gpus). - -You can check the archs pytorch was built with using: - -```bash -python -c "import torch; print(torch.cuda.get_arch_list())" -``` - -Here is how to find out the arch for one of the installed GPUs. For example, for GPU 0: - -```bash -CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ -print(torch.cuda.get_device_properties(torch.device('cuda')))" -``` - -If the output is: - -```bash -_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) -``` - -then you know that this card's arch is `8.6`. - -You can also leave `TORCH_CUDA_ARCH_LIST` out completely and then the build program will automatically query the -architecture of the GPUs the build is made on. This may or may not match the GPUs on the target machines, that's why -it's best to specify the desired archs explicitly. - -If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of -[Deepspeed](https://github.com/microsoft/DeepSpeed/issues), - - - - - -### Deployment with multiple GPUs - -To deploy the DeepSpeed integration adjust the [`Trainer`] command line arguments to include a new argument `--deepspeed ds_config.json`, where `ds_config.json` is the DeepSpeed configuration file as - documented [here](https://www.deepspeed.ai/docs/config-json/). The file naming is up to you. - -You can use a launcher of your choice here. You can continue using the pytorch launcher: - -```bash -torch.distributed.run --nproc_per_node=2 your_program.py --deepspeed ds_config.json -``` -or use the launcher provided by `deepspeed`: - -```bash -deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json -``` - -As you can see the arguments aren't the same, but for most needs either of them works. The -full details on how to configure various nodes and GPUs can be found [here](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). - -When you use the `deepspeed` launcher and you want to use all available gpus you can just omit the `--num_gpus` flag. - -Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs: - -```bash -deepspeed examples/pytorch/translation/run_translation.py \ ---deepspeed tests/deepspeed/ds_config_zero3.json \ ---model_name_or_path t5-small --per_device_train_batch_size 1 \ ---output_dir output_dir --overwrite_output_dir --fp16 \ ---do_train --max_train_samples 500 --num_train_epochs 1 \ ---dataset_name wmt16 --dataset_config "ro-en" \ ---source_lang en --target_lang ro -``` - -Note that in the DeepSpeed documentation you are likely to see `--deepspeed --deepspeed_config ds_config.json` - i.e. -two DeepSpeed-related arguments, but for the sake of simplicity, and since there are already so many arguments to deal -with, we combined the two into a single argument. - -For some practical usage examples, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400). - - - - - -### Deployment with one GPU - -To deploy DeepSpeed with one GPU adjust the [`Trainer`] command line arguments as follows: - -```bash -deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ ---deepspeed tests/deepspeed/ds_config_zero2.json \ ---model_name_or_path t5-small --per_device_train_batch_size 1 \ ---output_dir output_dir --overwrite_output_dir --fp16 \ ---do_train --max_train_samples 500 --num_train_epochs 1 \ ---dataset_name wmt16 --dataset_config "ro-en" \ ---source_lang en --target_lang ro -``` - -This is almost the same as with multiple-GPUs, but here we tell DeepSpeed explicitly to use just one GPU via -`--num_gpus=1`. By default, DeepSpeed deploys all GPUs it can see on the given node. If you have only 1 GPU to start -with, then you don't need this argument. The following [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) discusses the launcher options. - -Why would you want to use DeepSpeed with just one GPU? - -1. It has a ZeRO-offload feature which can delegate some computations and memory to the host's CPU and RAM, and thus - leave more GPU resources for model's needs - e.g. larger batch size, or enabling a fitting of a very big model which - normally won't fit. -2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit - bigger models and data batches. - -While we are going to discuss the configuration in details next, the key to getting a huge improvement on a single GPU -with DeepSpeed is to have at least the following configuration in the configuration file: - -```json -{ - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "overlap_comm": true, - "contiguous_gradients": true - } -} -``` - -which enables optimizer offload and some other important features. You may experiment with the buffer sizes, you will -find more details in the discussion below. - -For a practical usage example of this type of deployment, please, see this [post](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685). - -You may also try the ZeRO-3 with CPU and NVMe offload as explained further in this document. - - - -Notes: - -- if you need to run on a specific GPU, which is different from GPU 0, you can't use `CUDA_VISIBLE_DEVICES` to limit - the visible scope of available GPUs. Instead, you have to use the following syntax: - - ```bash - deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... - ``` - - In this example, we tell DeepSpeed to use GPU 1 (second gpu). - - - - - -### Deployment with multiple Nodes - -The information in this section isn't not specific to the DeepSpeed integration and is applicable to any multi-node program. But DeepSpeed provides a `deepspeed` launcher that is easier to use than other launchers unless you are in a SLURM environment. - -For the duration of this section let's assume that you have 2 nodes with 8 gpus each. And you can reach the first node with `ssh hostname1` and second node with `ssh hostname2`, and both must be able to reach each other via ssh locally without a password. Of course, you will need to rename these host (node) names to the actual host names you are working with. - -#### The torch.distributed.run launcher - - -For example, to use `torch.distributed.run`, you could do: - -```bash -python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ ---master_port=9901 your_program.py --deepspeed ds_config.json -``` - -You have to ssh to each node and run this same command on each one of them! There is no rush, the launcher will wait until both nodes will synchronize. - -For more information please see [torchrun](https://pytorch.org/docs/stable/elastic/run.html). Incidentally, this is also the launcher that replaced `torch.distributed.launch` a few pytorch versions back. - - -#### The deepspeed launcher - -To use the `deepspeed` launcher instead, you have to first create a `hostfile` file: - -``` -hostname1 slots=8 -hostname2 slots=8 -``` -and then you can launch it as: - -```bash -deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ -your_program.py --deepspeed ds_config.json -``` - -Unlike the `torch.distributed.run` launcher, `deepspeed` will automatically launch this command on both nodes! - -For more information please see [Resource Configuration (multi-node)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). - - -#### Launching in a SLURM environment - -In the SLURM environment the following approach can be used. The following is a slurm script `launch.slurm` which you will need to adapt it to your specific SLURM environment. - -```bash -#SBATCH --job-name=test-nodes # name -#SBATCH --nodes=2 # nodes -#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! -#SBATCH --cpus-per-task=10 # number of cores per tasks -#SBATCH --gres=gpu:8 # number of gpus -#SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) -#SBATCH --output=%x-%j.out # output file name - -export GPUS_PER_NODE=8 -export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) -export MASTER_PORT=9901 - -srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ - --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ - --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ -your_program.py --deepspeed ds_config.json' -``` - -All is left is to schedule it to run: -```bash -sbatch launch.slurm -``` - -`srun` will take care of launching the program simultaneously on all nodes. - - -#### Use of Non-shared filesystem - -By default DeepSpeed expects that a multi-node environment uses a shared storage. If this is not the case and each node can only see the local filesystem, you need to adjust the config file to include a [`checkpoint`_section](https://www.deepspeed.ai/docs/config-json/#checkpoint-options) with the following setting: - -```json -{ - "checkpoint": { - "use_node_local_storage": true - } -} -``` - -Alternatively, you can also use the [`Trainer`]'s `--save_on_each_node` argument, and the above config will be added automatically for you. - - - - -### Deployment in Notebooks - -The problem with running notebook cells as a script is that there is no normal `deepspeed` launcher to rely on, so -under certain setups we have to emulate it. - -If you're using only 1 GPU, here is how you'd have to adjust your training code in the notebook to use DeepSpeed. - -```python -# DeepSpeed requires a distributed environment even when only one process is used. -# This emulates a launcher in the notebook -import os - -os.environ["MASTER_ADDR"] = "localhost" -os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use -os.environ["RANK"] = "0" -os.environ["LOCAL_RANK"] = "0" -os.environ["WORLD_SIZE"] = "1" - -# Now proceed as normal, plus pass the deepspeed config file -training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") -trainer = Trainer(...) -trainer.train() -``` - -Note: `...` stands for the normal arguments that you'd pass to the functions. - -If you want to use more than 1 GPU, you must use a multi-process environment for DeepSpeed to work. That is, you have -to use the launcher for that purpose and this cannot be accomplished by emulating the distributed environment presented -at the beginning of this section. - -If you want to create the config file on the fly in the notebook in the current directory, you could have a dedicated -cell with: - -```python no-style -%%bash -cat <<'EOT' > ds_config_zero3.json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_16bit_weights_on_model_save": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false -} -EOT -``` - -If the training script is in a normal file and not in the notebook cells, you can launch `deepspeed` normally via -shell from a cell. For example, to use `run_translation.py` you would launch it with: - -```python no-style -!git clone https://github.com/huggingface/transformers -!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... -``` - -or with `%%bash` magic, where you can write a multi-line code for the shell program to run: - -```python no-style -%%bash - -git clone https://github.com/huggingface/transformers -cd transformers -deepspeed examples/pytorch/translation/run_translation.py ... -``` - -In such case you don't need any of the code presented at the beginning of this section. - -Note: While `%%bash` magic is neat, but currently it buffers the output so you won't see the logs until the process -completes. - - - - - - -### Configuration - -For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer -to the [following documentation](https://www.deepspeed.ai/docs/config-json/). - -You can find dozens of DeepSpeed configuration examples that address various practical needs in [the DeepSpeedExamples -repo](https://github.com/microsoft/DeepSpeedExamples): - -```bash -git clone https://github.com/microsoft/DeepSpeedExamples -cd DeepSpeedExamples -find . -name '*json' -``` - -Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the -example `.json` files with: - -```bash -grep -i Lamb $(find . -name '*json') -``` - -Some more examples are to be found in the [main repo](https://github.com/microsoft/DeepSpeed) as well. - -When using DeepSpeed you always need to supply a DeepSpeed configuration file, yet some configuration parameters have -to be configured via the command line. You will find the nuances in the rest of this guide. - -To get an idea of what DeepSpeed configuration file looks like, here is one that activates ZeRO stage 2 features, -including optimizer states cpu offload, uses `AdamW` optimizer and `WarmupLR` scheduler and will enable mixed -precision training if `--fp16` is passed: - -```json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", -} -``` - -When you execute the program, DeepSpeed will log the configuration it received from the [`Trainer`] -to the console, so you can see exactly what was the final configuration passed to it. - - - - - -### Passing Configuration - -As discussed in this document normally the DeepSpeed configuration is passed as a path to a json file, but if you're -not using the command line interface to configure the training, and instead instantiate the -[`Trainer`] via [`TrainingArguments`] then for the `deepspeed` argument you can -pass a nested `dict`. This allows you to create the configuration on the fly and doesn't require you to write it to -the file system before passing it to [`TrainingArguments`]. - -To summarize you can do: - -```python -TrainingArguments(..., deepspeed="/path/to/ds_config.json") -``` - -or: - -```python -ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) -TrainingArguments(..., deepspeed=ds_config_dict) -``` - - - -### Shared Configuration - - - - -This section is a must-read - - - -Some configuration values are required by both the [`Trainer`] and DeepSpeed to function correctly, -therefore, to prevent conflicting definitions, which could lead to hard to detect errors, we chose to configure those -via the [`Trainer`] command line arguments. - -Additionally, some configuration values are derived automatically based on the model's configuration, so instead of -remembering to manually adjust multiple values, it's the best to let the [`Trainer`] do the majority -of configuration for you. - -Therefore, in the rest of this guide you will find a special configuration value: `auto`, which when set will be -automatically replaced with the correct or most efficient value. Please feel free to choose to ignore this -recommendation and set the values explicitly, in which case be very careful that your the -[`Trainer`] arguments and DeepSpeed configurations agree. For example, are you using the same -learning rate, or batch size, or gradient accumulation settings? if these mismatch the training may fail in very -difficult to detect ways. You have been warned. - -There are multiple other values that are specific to DeepSpeed-only and those you will have to set manually to suit -your needs. - -In your own programs, you can also use the following approach if you'd like to modify the DeepSpeed config as a master -and configure [`TrainingArguments`] based on that. The steps are: - -1. Create or load the DeepSpeed configuration to be used as a master configuration -2. Create the [`TrainingArguments`] object based on these values - -Do note that some values, such as `scheduler.params.total_num_steps` are calculated by -[`Trainer`] during `train`, but you can of course do the math yourself. - - - -### ZeRO - -[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) is the workhorse of DeepSpeed. It -supports 3 different levels (stages) of optimization. The first one is not quite interesting for scalability purposes, -therefore this document focuses on stages 2 and 3. Stage 3 is further improved by the latest addition of ZeRO-Infinity. -You will find more indepth information in the DeepSpeed documentation. - -The `zero_optimization` section of the configuration file is the most important part ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)), since that is where you define -which ZeRO stages you want to enable and how to configure them. You will find the explanation for each parameter in the -DeepSpeed docs. - -This section has to be configured exclusively via DeepSpeed configuration - the [`Trainer`] provides -no equivalent command line arguments. - -Note: currently DeepSpeed doesn't validate parameter names, so if you misspell any, it'll use the default setting for -the parameter that got misspelled. You can watch the DeepSpeed engine start up log messages to see what values it is -going to use. - - - - - -#### ZeRO-2 Config - -The following is an example of configuration for ZeRO stage 2: - -```json -{ - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 5e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 5e8, - "contiguous_gradients": true - } -} -``` - -**Performance tuning:** - -- enabling `offload_optimizer` should reduce GPU RAM usage (it requires `"stage": 2`) -- `"overlap_comm": true` trades off increased GPU RAM usage to lower all-reduce latency. `overlap_comm` uses 4.5x - the `allgather_bucket_size` and `reduce_bucket_size` values. So if they are set to 5e8, this requires a 9GB - footprint (`5e8 x 2Bytes x 2 x 4.5`). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting - OOM-errors you will need to reduce those parameters to about `2e8`, which would require 3.6GB. You will want to do - the same on larger capacity GPU as well, if you're starting to hit OOM. -- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size is, - the slower the communication gets, and the more GPU RAM will be available to other tasks. So if a bigger batch size is - important, getting a slightly slower training time could be a good trade. - -Additionally, `deepspeed==0.4.4` added a new option `round_robin_gradients` which you can enable with: - -```json -{ - "zero_optimization": { - "round_robin_gradients": true - } -} -``` - -This is a stage 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). - - - - -#### ZeRO-3 Config - -The following is an example of configuration for ZeRO stage 3: - -```json -{ - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_16bit_weights_on_model_save": true - } -} -``` - -If you are getting OOMs, because your model or activations don't fit into the GPU memory and you have unutilized CPU -memory offloading the optimizer states and parameters to CPU memory with `"device": "cpu"` may solve this limitation. -If you don't want to offload to CPU memory, use `none` instead of `cpu` for the `device` entry. Offloading to -NVMe is discussed further down. - -Pinned memory is enabled with `pin_memory` set to `true`. This feature can improve the throughput at the cost of -making less memory available to other processes. Pinned memory is set aside to the specific process that requested it -and its typically accessed much faster than normal CPU memory. - -**Performance tuning:** - -- `stage3_max_live_parameters`: `1e9` -- `stage3_max_reuse_distance`: `1e9` - -If hitting OOM reduce `stage3_max_live_parameters` and `stage3_max_reuse_distance`. They should have minimal impact -on performance unless you are doing activation checkpointing. `1e9` would consume ~2GB. The memory is shared by -`stage3_max_live_parameters` and `stage3_max_reuse_distance`, so it's not additive, it's just 2GB total. - -`stage3_max_live_parameters` is the upper limit on how many full parameters you want to keep on the GPU at any given -time. "reuse distance" is a metric we are using to figure out when will a parameter be used again in the future, and we -use the `stage3_max_reuse_distance` to decide whether to throw away the parameter or to keep it. If a parameter is -going to be used again in near future (less than `stage3_max_reuse_distance`) then we keep it to reduce communication -overhead. This is super helpful when you have activation checkpointing enabled, where we do a forward recompute and -backward passes a a single layer granularity and want to keep the parameter in the forward recompute till the backward - -The following configuration values depend on the model's hidden size: - -- `reduce_bucket_size`: `hidden_size*hidden_size` -- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size` -- `stage3_param_persistence_threshold`: `10 * hidden_size` - -therefore set these values to `auto` and the [`Trainer`] will automatically assign the recommended -values. But, of course, feel free to set these explicitly as well. - -`stage3_gather_16bit_weights_on_model_save` enables model fp16 weights consolidation when model gets saved. With large -models and multiple GPUs this is an expensive operation both in terms of memory and speed. It's currently required if -you plan to resume the training. Watch out for future updates that will remove this limitation and make things more -flexible. - -If you're migrating from ZeRO-2 configuration note that `allgather_partitions`, `allgather_bucket_size` and -`reduce_scatter` configuration parameters are not used in ZeRO-3. If you keep these in the config file they will just -be ignored. - -- `sub_group_size`: `1e9` - -`sub_group_size` controls the granularity in which parameters are updated during optimizer steps. Parameters are -grouped into buckets of `sub_group_size` and each buckets is updated one at a time. When used with NVMe offload in -ZeRO-Infinity, `sub_group_size` therefore controls the granularity in which model states are moved in and out of CPU -memory from NVMe during the optimizer step. This prevents running out of CPU memory for extremely large models. - -You can leave `sub_group_size` to its default value of *1e9* when not using NVMe offload. You may want to change its -default value in the following cases: - -1. Running into OOM during optimizer step: Reduce `sub_group_size` to reduce memory utilization of temporary buffers -2. Optimizer Step is taking a long time: Increase `sub_group_size` to improve bandwidth utilization as a result of - the increased data buffers. - - -#### ZeRO-0 Config - -Note that we're listing Stage 0 and 1 last since they are rarely used. - -Stage 0 is disabling all types of sharding and just using DeepSpeed as DDP. You can turn it on with: - -```json -{ - "zero_optimization": { - "stage": 0 - } -} -``` - -Ths will essentially disable ZeRO without you needing to change anything else. - - -#### ZeRO-1 Config - - -Stage 1 is Stage 2 minus gradient sharding. You can always try it to speed things a tiny bit to only shard the optimizer states with: - - -```json -{ - "zero_optimization": { - "stage": 1 - } -} -``` - - - - - -### NVMe Support - -ZeRO-Infinity allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to -smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during -offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training -process. ZeRO-Infinity requires ZeRO-3 enabled. - -The following configuration example enables NVMe to offload both optimizer states and the params: - -```json -{ - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "nvme", - "nvme_path": "/local_nvme", - "pin_memory": true, - "buffer_count": 4, - "fast_init": false - }, - "offload_param": { - "device": "nvme", - "nvme_path": "/local_nvme", - "pin_memory": true, - "buffer_count": 5, - "buffer_size": 1e8, - "max_in_cpu": 1e9 - }, - "aio": { - "block_size": 262144, - "queue_depth": 32, - "thread_count": 1, - "single_submit": false, - "overlap_events": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_16bit_weights_on_model_save": true - }, -} -``` - -You can choose to offload both optimizer states and params to NVMe, or just one of them or none. For example, if you -have copious amounts of CPU memory available, by all means offload to CPU memory only as it'd be faster (hint: -*"device": "cpu"*). - -Here is the full documentation for offloading [optimizer states](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) and [parameters](https://www.deepspeed.ai/docs/config-json/#parameter-offloading). - -Make sure that your `nvme_path` is actually an NVMe, since it will work with the normal hard drive or SSD, but it'll -be much much slower. The fast scalable training was designed with modern NVMe transfer speeds in mind (as of this -writing one can have ~3.5GB/s read, ~3GB/s write peak speeds). - -In order to figure out the optimal `aio` configuration block you must run a benchmark on your target setup, as -[explained here](https://github.com/microsoft/DeepSpeed/issues/998). - - - - - -#### ZeRO-2 vs ZeRO-3 Performance - -ZeRO-3 is likely to be slower than ZeRO-2 if everything else is configured the same because the former has to gather -model weights in addition to what ZeRO-2 does. If ZeRO-2 meets your needs and you don't need to scale beyond a few GPUs -then you may choose to stick to it. It's important to understand that ZeRO-3 enables a much higher scalability capacity -at a cost of speed. - -It's possible to adjust ZeRO-3 configuration to make it perform closer to ZeRO-2: - -- set `stage3_param_persistence_threshold` to a very large number - larger than the largest parameter, e.g., `6 * hidden_size * hidden_size`. This will keep the parameters on the GPUs. -- turn off `offload_params` since ZeRO-2 doesn't have that option. - -The performance will likely improve significantly with just `offload_params` turned off, even if you don't change -`stage3_param_persistence_threshold`. Of course, these changes will impact the size of the model you can train. So -these help you to trade scalability for speed depending on your needs. - - - - - -#### ZeRO-2 Example - -Here is a full ZeRO-2 auto-configuration file `ds_config_zero2.json`: - -```json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false -} -``` - -Here is a full ZeRO-2 all-enabled manually set configuration file. It is here mainly for you to see what the typical -values look like, but we highly recommend using the one with multiple `auto` settings in it. - -```json -{ - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": 3e-5, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 3e-5, - "warmup_num_steps": 500 - } - }, - - "zero_optimization": { - "stage": 2, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "allgather_partitions": true, - "allgather_bucket_size": 2e8, - "overlap_comm": true, - "reduce_scatter": true, - "reduce_bucket_size": 2e8, - "contiguous_gradients": true - }, - - "steps_per_print": 2000, - "wall_clock_breakdown": false -} -``` - - - -#### ZeRO-3 Example - -Here is a full ZeRO-3 auto-configuration file `ds_config_zero3.json`: - - -```json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": "auto", - "stage3_prefetch_bucket_size": "auto", - "stage3_param_persistence_threshold": "auto", - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_16bit_weights_on_model_save": true - }, - - "gradient_accumulation_steps": "auto", - "gradient_clipping": "auto", - "steps_per_print": 2000, - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto", - "wall_clock_breakdown": false -} -``` - -Here is a full ZeRO-3 all-enabled manually set configuration file. It is here mainly for you to see what the typical -values look like, but we highly recommend using the one with multiple `auto` settings in it. - -```json -{ - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - }, - - "optimizer": { - "type": "AdamW", - "params": { - "lr": 3e-5, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - }, - - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 3e-5, - "warmup_num_steps": 500 - } - }, - - "zero_optimization": { - "stage": 3, - "offload_optimizer": { - "device": "cpu", - "pin_memory": true - }, - "offload_param": { - "device": "cpu", - "pin_memory": true - }, - "overlap_comm": true, - "contiguous_gradients": true, - "sub_group_size": 1e9, - "reduce_bucket_size": 1e6, - "stage3_prefetch_bucket_size": 0.94e6, - "stage3_param_persistence_threshold": 1e4, - "stage3_max_live_parameters": 1e9, - "stage3_max_reuse_distance": 1e9, - "stage3_gather_16bit_weights_on_model_save": true - }, - - "steps_per_print": 2000, - "wall_clock_breakdown": false -} -``` - -#### How to Choose Which ZeRO Stage and Offloads To Use For Best Performance - -So now you know there are all these different stages. How to decide which of them to use? This section will attempt to address this question. - -In general the following applies: - -- Speed-wise (left is faster than right) - -Stage 0 (DDP) > Stage 1 > Stage 2 > Stage 2 + offload > Stage 3 > Stage 3 + offloads - -- GPU Memory usage-wise (right is more GPU memory efficient than left) - -Stage 0 (DDP) < Stage 1 < Stage 2 < Stage 2 + offload < Stage 3 < Stage 3 + offloads - -So when you want to get the fastest execution while fitting into minimal number of GPUs, here is the process you could follow. We start with the fastest approach and if running into GPU OOM we then go to the next slower approach, but which will use less GPU memory. And so on and so forth. - -First of all set batch size to 1 (you can always use gradient accumulation for any desired effective batch size). - -1. Enable `--gradient_checkpointing 1` (HF Trainer) or directly `model.gradient_checkpointing_enable()` - if OOM then -2. Try ZeRO stage 2 first. if OOM then -3. Try ZeRO stage 2 + `offload_optimizer` - if OOM then -4. Switch to ZeRO stage 3 - if OOM then -5. Enable `offload_param` to `cpu` - if OOM then -6. Enable `offload_optimizer` to `cpu` - if OOM then - -7. If you still can't fit a batch size of 1 first check various default values and lower them if you can. For example, if you use `generate` and you don't use a wide search beam make it narrower as it'd take a lot of memory. - -8. Definitely use mixed half-precision over fp32 - so bf16 on Ampere and higher GPUs and fp16 on older gpu architectures. - -9. If you still OOM you could add more hardware or enable ZeRO-Infinity - that is switch offloads `offload_param` and `offload_optimizer` to `nvme`. You need to make sure it's a very fast nvme. As an anecdote I was able to infer BLOOM-176B on a tiny GPU using ZeRO-Infinity except it was extremely slow. But it worked! - -You can, of course, work through these steps in reverse by starting with the most GPU memory efficient config and then going backwards. Or try bi-secting it. - -Once you have your batch size 1 not leading to OOM, measure your effective throughput. - -Next try to increase the batch size to as large as you can, since the higher the batch size the more efficient the GPUs are as they perform the best when matrices they multiply are huge. - -Now the performance optimization game starts. You can turn off some offload features or step down in ZeRO stages and increase/decrease batch size and again measure your effective throughput. Rinse and repeat until satisfied. - -Don't spend forever on it, but if you're about to start a 3 months training - do spend a few days on it to find the most effective throughput-wise setup. So that your training cost will be the lowest and you will finish training faster. In the current crazy-paced ML world, if it takes you an extra month to train something you are likely to miss a golden opportunity. Of course, this is only me sharing an observation and in no way I'm trying to rush you. Before beginning to train BLOOM-176B I spent 2 days on this process and was able to increase throughput from 90 to 150 TFLOPs! This effort saved us more than one month of training time. - -These notes were written primarily for the training mode, but they should mostly apply for inference as well. For example, during inference Gradient Checkpointing is a no-op since it is only useful during training. Additionally, we found out that if you are doing a multi-GPU inference and not using [DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/), [Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts) should provide a superior performance. - - -Other quick related performance notes: -- if you are training something from scratch always try to have tensors with shapes that are divisible by 16 (e.g. hidden size). For batch size try divisible by 2 at least. There are [wave and tile quanitization](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/) divisibility that is hardware-specific if you want to squeeze even higher performance from your GPUs. - - -### Optimizer and Scheduler - -As long as you don't enable `offload_optimizer` you can mix and match DeepSpeed and HuggingFace schedulers and -optimizers, with the exception of using the combination of HuggingFace scheduler and DeepSpeed optimizer: - -| Combos | HF Scheduler | DS Scheduler | -| HF Optimizer | Yes | Yes | -| DS Optimizer | No | Yes | - -It is possible to use a non-DeepSpeed optimizer when `offload_optimizer` is enabled, as long as it has both CPU and -GPU implementation (except LAMB). - - - - - - -#### Optimizer - - -DeepSpeed's main optimizers are Adam, AdamW, OneBitAdam, and Lamb. These have been thoroughly tested with ZeRO and are -thus recommended to be used. It, however, can import other optimizers from `torch`. The full documentation is [here](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). - -If you don't configure the `optimizer` entry in the configuration file, the [`Trainer`] will -automatically set it to `AdamW` and will use the supplied values or the defaults for the following command line -arguments: `--learning_rate`, `--adam_beta1`, `--adam_beta2`, `--adam_epsilon` and `--weight_decay`. - -Here is an example of the auto-configured `optimizer` entry for `AdamW`: - -```json -{ - "optimizer": { - "type": "AdamW", - "params": { - "lr": "auto", - "betas": "auto", - "eps": "auto", - "weight_decay": "auto" - } - } -} -``` - -Note that the command line arguments will set the values in the configuration file. This is so that there is one -definitive source of the values and to avoid hard to find errors when for example, the learning rate is set to -different values in different places. Command line rules. The values that get overridden are: - -- `lr` with the value of `--learning_rate` -- `betas` with the value of `--adam_beta1 --adam_beta2` -- `eps` with the value of `--adam_epsilon` -- `weight_decay` with the value of `--weight_decay` - -Therefore please remember to tune the shared hyperparameters on the command line. - -You can also set the values explicitly: - -```json -{ - "optimizer": { - "type": "AdamW", - "params": { - "lr": 0.001, - "betas": [0.8, 0.999], - "eps": 1e-8, - "weight_decay": 3e-7 - } - } -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - -If you want to use another optimizer which is not listed above, you will have to add to the top level configuration. - -```json -{ - "zero_allow_untested_optimizer": true -} -``` - -Similarly to `AdamW`, you can configure other officially supported optimizers. Just remember that may have different -config values. e.g. for Adam you will want `weight_decay` around `0.01`. - - - - - -#### Scheduler - -DeepSpeed supports `LRRangeTest`, `OneCycle`, `WarmupLR` and `WarmupDecayLR` learning rate schedulers. The full -documentation is [here](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters). - -Here is where the schedulers overlap between 🤗 Transformers and DeepSpeed: - -- `WarmupLR` via `--lr_scheduler_type constant_with_warmup` -- `WarmupDecayLR` via `--lr_scheduler_type linear`. This is also the default value for `--lr_scheduler_type`, - therefore, if you don't configure the scheduler this is scheduler that will get configured by default. - -If you don't configure the `scheduler` entry in the configuration file, the [`Trainer`] will use -the values of `--lr_scheduler_type`, `--learning_rate` and `--warmup_steps` or `--warmup_ratio` to configure a -🤗 Transformers version of it. - -Here is an example of the auto-configured `scheduler` entry for `WarmupLR`: - -```json -{ - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - } -} -``` - -Since *"auto"* is used the [`Trainer`] arguments will set the correct values in the configuration -file. This is so that there is one definitive source of the values and to avoid hard to find errors when, for example, -the learning rate is set to different values in different places. Command line rules. The values that get set are: - -- `warmup_min_lr` with the value of `0`. -- `warmup_max_lr` with the value of `--learning_rate`. -- `warmup_num_steps` with the value of `--warmup_steps` if provided. Otherwise will use `--warmup_ratio` - multiplied by the number of training steps and rounded up. -- `total_num_steps` with either the value of `--max_steps` or if it is not provided, derived automatically at run - time based on the environment and the size of the dataset and other command line arguments (needed for - `WarmupDecayLR`). - -You can, of course, take over any or all of the configuration values and set those yourself: - -```json -{ - "scheduler": { - "type": "WarmupLR", - "params": { - "warmup_min_lr": 0, - "warmup_max_lr": 0.001, - "warmup_num_steps": 1000 - } - } -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - -For example, for `WarmupDecayLR`, you can use the following entry: - -```json -{ - "scheduler": { - "type": "WarmupDecayLR", - "params": { - "last_batch_iteration": -1, - "total_num_steps": "auto", - "warmup_min_lr": "auto", - "warmup_max_lr": "auto", - "warmup_num_steps": "auto" - } - } -} -``` - -and `total_num_steps`, `warmup_max_lr`, `warmup_num_steps` and `total_num_steps` will be set at loading time. - - - - - - -### fp32 Precision - -Deepspeed supports the full fp32 and the fp16 mixed precision. - -Because of the much reduced memory needs and faster speed one gets with the fp16 mixed precision, the only time you -will want to not use it is when the model you're using doesn't behave well under this training mode. Typically this -happens when the model wasn't pretrained in the fp16 mixed precision (e.g. often this happens with bf16-pretrained -models). Such models may overflow or underflow leading to `NaN` loss. If this is your case then you will want to use -the full fp32 mode, by explicitly disabling the otherwise default fp16 mixed precision mode with: - -```json -{ - "fp16": { - "enabled": "false", - } -} -``` - -If you're using the Ampere-architecture based GPU, pytorch version 1.7 and higher will automatically switch to using -the much more efficient tf32 format for some operations, but the results will still be in fp32. For details and -benchmarks, please, see [TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices). The document includes -instructions on how to disable this automatic conversion if for some reason you prefer not to use it. - -With the 🤗 Trainer you can use `--tf32` to enable it, or disable it with `--tf32 0` or `--no_tf32`. By default the PyTorch default is used. - - - - - -### Automatic Mixed Precision - -You can use automatic mixed precision with either a pytorch-like AMP way or the apex-like way: - -### fp16 - -To configure pytorch AMP-like mode with fp16 (float16) set: - -```json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - } -} -``` - -and the [`Trainer`] will automatically enable or disable it based on the value of -`args.fp16_backend`. The rest of config values are up to you. - -This mode gets enabled when `--fp16 --fp16_backend amp` or `--fp16_full_eval` command line args are passed. - -You can also enable/disable this mode explicitly: - -```json -{ - "fp16": { - "enabled": true, - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - } -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - -Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#fp16-training-options). - -### bf16 - -If bf16 (bfloat16) is desired instead of fp16 then the following configuration section is to be used: - -```json -{ - "bf16": { - "enabled": "auto" - } -} -``` - -bf16 has the same dynamic range as fp32 and thus doesn't require loss scaling. - -This mode gets enabled when `--bf16` or `--bf16_full_eval` command line args are passed. - -You can also enable/disable this mode explicitly: - -```json -{ - "bf16": { - "enabled": true - } -} -``` - - - -As of `deepspeed==0.6.0` the bf16 support is new and experimental. - -If you use [gradient accumulation](#gradient-accumulation) with bf16-enabled, you need to be aware that it'll accumulate gradients in bf16, which may not be what you want due to this format's low precision, as it may lead to a lossy accumulation. - -A work is being done to fix that and provide an option to use a higher precision `dtype` (fp16 or fp32). - - - - -### NCCL Collectives - -There is the `dtype` of the training regime and there is a separate `dtype` that is used for communication collectives like various reduction and gathering/scattering operations. - -All gather/scatter ops are performed in the same `dtype` the data is in, so if you're using bf16 training regime it gets gathered in bf16 - gathering is a non-lossy operation. - -Various reduce operations can be quite lossy, for example when gradients are averaged across multiple-gpus, if the communications are done in fp16 or bf16 the outcome is likely be lossy - since when one ads multiple numbers in low precision the result isn't exact. More so with bf16 as it has a lower precision than fp16. Often fp16 is good enough as the loss is minimal when averaging grads which are typically very small. Therefore, by default for half precision training fp16 is used as the default for reduction operations. But you have full control over this functionality and if you choose you can add a small overhead and ensure that reductions will be using fp32 as the accumulation dtype and only when the result is ready it'll get downcast to the half precision `dtype` you're training in. - -In order to override the default you simply add a new configuration entry: - -```json -{ - "communication_data_type": "fp32" -} -``` -The valid values as of this writing are "fp16", "bfp16", "fp32". - -note: stage zero 3 had a bug with regards to bf16 comm dtype that was fixed in `deepspeed==0.8.1` - - - -### apex - -To configure apex AMP-like mode set: - -```json -"amp": { - "enabled": "auto", - "opt_level": "auto" -} -``` - -and the [`Trainer`] will automatically configure it based on the values of `args.fp16_backend` and -`args.fp16_opt_level`. - -This mode gets enabled when `--fp16 --fp16_backend apex --fp16_opt_level 01` command line args are passed. - -You can also configure this mode explicitly: - -```json -{ - "amp": { - "enabled": true, - "opt_level": "O1" - } -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - -Here is the [documentation](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options). - - - - - -### Batch Size - -To configure batch size, use: - -```json -{ - "train_batch_size": "auto", - "train_micro_batch_size_per_gpu": "auto" -} -``` - -and the [`Trainer`] will automatically set `train_micro_batch_size_per_gpu` to the value of -`args.per_device_train_batch_size` and `train_batch_size` to `args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`. - -You can also set the values explicitly: - -```json -{ - "train_batch_size": 12, - "train_micro_batch_size_per_gpu": 4 -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - - - - - -### Gradient Accumulation - -To configure gradient accumulation set: - -```json -{ - "gradient_accumulation_steps": "auto" -} -``` - -and the [`Trainer`] will automatically set it to the value of `args.gradient_accumulation_steps`. - -You can also set the value explicitly: - -```json -{ - "gradient_accumulation_steps": 3 -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - - - - - -### Gradient Clipping - -To configure gradient gradient clipping set: - -```json -{ - "gradient_clipping": "auto" -} -``` - -and the [`Trainer`] will automatically set it to the value of `args.max_grad_norm`. - -You can also set the value explicitly: - -```json -{ - "gradient_clipping": 1.0 -} -``` - -But then you're on your own synchronizing the [`Trainer`] command line arguments and the DeepSpeed -configuration. - - - - - -### Getting The Model Weights Out - -As long as you continue training and resuming using DeepSpeed you don't need to worry about anything. DeepSpeed stores -fp32 master weights in its custom checkpoint optimizer files, which are `global_step*/*optim_states.pt` (this is glob -pattern), and are saved under the normal checkpoint. - -**FP16 Weights:** - -When a model is saved under ZeRO-2, you end up having the normal `pytorch_model.bin` file with the model weights, but -they are only the fp16 version of the weights. - -Under ZeRO-3, things are much more complicated, since the model weights are partitioned out over multiple GPUs, -therefore `"stage3_gather_16bit_weights_on_model_save": true` is required to get the `Trainer` to save the fp16 -version of the weights. If this setting is `False` `pytorch_model.bin` won't be created. This is because by default DeepSpeed's `state_dict` contains a placeholder and not the real weights. If we were to save this `state_dict` it won't be possible to load it back. - - -```json -{ - "zero_optimization": { - "stage3_gather_16bit_weights_on_model_save": true - } -} -``` - -**FP32 Weights:** - -While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to -the [models hub](https://huggingface.co/models) or pass it to someone else you most likely will want to get the fp32 -weights. This ideally shouldn't be done during training since this is a process that requires a lot of memory, and -therefore best to be performed offline after the training is complete. But if desired and you have plenty of free CPU -memory it can be done in the same training script. The following sections will discuss both approaches. - - -**Live FP32 Weights Recovery:** - -This approach may not work if you model is large and you have little free CPU memory left, at the end of the training. - -If you have saved at least one checkpoint, and you want to use the latest one, you can do the following: - -```python -from transformers.trainer_utils import get_last_checkpoint -from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint - -checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) -fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) -``` - -If you're using the `--load_best_model_at_end` class:*~transformers.TrainingArguments* argument (to track the best -checkpoint), then you can finish the training by first saving the final model explicitly and then do the same as above: - -```python -from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint - -checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") -trainer.deepspeed.save_checkpoint(checkpoint_dir) -fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) -``` - - - -Note, that once `load_state_dict_from_zero_checkpoint` was run, the `model` will no longer be usable in the -DeepSpeed context of the same application. i.e. you will need to re-initialize the deepspeed engine, since -`model.load_state_dict(state_dict)` will remove all the DeepSpeed magic from it. So do this only at the very end -of the training. - - - -Of course, you don't have to use class:*~transformers.Trainer* and you can adjust the examples above to your own -trainer. - -If for some reason you want more refinement, you can also extract the fp32 `state_dict` of the weights and apply -these yourself as is shown in the following example: - -```python -from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint - -state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu -model = model.cpu() -model.load_state_dict(state_dict) -``` - -**Offline FP32 Weights Recovery:** - -DeepSpeed creates a special conversion script `zero_to_fp32.py` which it places in the top-level of the checkpoint -folder. Using this script you can extract the weights at any point. The script is standalone and you no longer need to -have the configuration file or a `Trainer` to do the extraction. - -Let's say your checkpoint folder looks like this: - -```bash -$ ls -l output_dir/checkpoint-1/ --rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json -drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ --rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest --rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt --rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin --rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt --rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json --rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model --rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json --rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json --rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin --rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* -``` - -In this example there is just one DeepSpeed checkpoint sub-folder *global_step1*. Therefore to reconstruct the fp32 -weights just run: - -```bash -python zero_to_fp32.py . pytorch_model.bin -``` - -This is it. `pytorch_model.bin` will now contain the full fp32 model weights consolidated from multiple GPUs. - -The script will automatically be able to handle either a ZeRO-2 or ZeRO-3 checkpoint. - -`python zero_to_fp32.py -h` will give you usage details. - -The script will auto-discover the deepspeed sub-folder using the contents of the file `latest`, which in the current -example will contain `global_step1`. - -Note: currently the script requires 2x general RAM of the final fp32 model weights. - - -### ZeRO-3 and Infinity Nuances - -ZeRO-3 is quite different from ZeRO-2 because of its param sharding feature. - -ZeRO-Infinity further extends ZeRO-3 to support NVMe memory and multiple other speed and scalability improvements. - -While all the efforts were made for things to just work without needing any special changes to your models, in certain -circumstances you may find the following information to be needed. - - - -#### Constructing Massive Models - -DeepSpeed/ZeRO-3 can handle models with Trillions of parameters which may not fit onto the existing RAM. In such cases, -but also if you want the initialization to happen much faster, initialize the model using *deepspeed.zero.Init()* -context manager (which is also a function decorator), like so: - -```python -from transformers import T5ForConditionalGeneration, T5Config -import deepspeed - -with deepspeed.zero.Init(): - config = T5Config.from_pretrained("t5-small") - model = T5ForConditionalGeneration(config) -``` - -As you can see this gives you a randomly initialized model. - -If you want to use a pretrained model, `model_class.from_pretrained` will activate this feature as long as -`is_deepspeed_zero3_enabled()` returns `True`, which currently is setup by the -[`TrainingArguments`] object if the passed DeepSpeed configuration file contains ZeRO-3 config -section. Thus you must create the [`TrainingArguments`] object **before** calling -`from_pretrained`. Here is an example of a possible sequence: - -```python -from transformers import AutoModel, Trainer, TrainingArguments - -training_args = TrainingArguments(..., deepspeed=ds_config) -model = AutoModel.from_pretrained("t5-small") -trainer = Trainer(model=model, args=training_args, ...) -``` - -If you're using the official example scripts and your command line arguments include `--deepspeed ds_config.json` -with ZeRO-3 config enabled, then everything is already done for you, since this is how example scripts are written. - -Note: If the fp16 weights of the model can't fit onto the memory of a single GPU this feature must be used. - -For full details on this method and other related features please refer to [Constructing Massive Models](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models). - -Also when loading fp16-pretrained models, you will want to tell `from_pretrained` to use -`torch_dtype=torch.float16`. For details, please, see [from_pretrained-torch-dtype](#from_pretrained-torch-dtype). - - -#### Gathering Parameters - -Under ZeRO-3 on multiple GPUs no single GPU has all the parameters unless it's the parameters for the currently -executing layer. So if you need to access all parameters from all layers at once there is a specific method to do it. -Most likely you won't need it, but if you do please refer to [Gathering Parameters](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination) - -We do however use it internally in several places, one such example is when loading pretrained model weights in -`from_pretrained`. We load one layer at a time and immediately partition it to all participating GPUs, as for very -large models it won't be possible to load it on one GPU and then spread it out to multiple GPUs, due to memory -limitations. - -Also under ZeRO-3, if you write your own code and run into a model parameter weight that looks like: - -```python -tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) -``` - -stress on `tensor([1.])`, or if you get an error where it says the parameter is of size `1`, instead of some much -larger multi-dimensional shape, this means that the parameter is partitioned and what you see is a ZeRO-3 placeholder. - - - - - - -### ZeRO Inference - -ZeRO Inference uses the same config as ZeRO-3 Training. You just don't need the optimizer and scheduler sections. In -fact you can leave these in the config file if you want to share the same one with the training. They will just be -ignored. - -Otherwise you just need to pass the usual [`TrainingArguments`] arguments. For example: - -```bash -deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json -``` - -The only important thing is that you need to use a ZeRO-3 configuration, since ZeRO-2 provides no benefit whatsoever -for the inference as only ZeRO-3 performs sharding of parameters, whereas ZeRO-1 shards gradients and optimizer states. - -Here is an example of running `run_translation.py` under DeepSpeed deploying all available GPUs: - -```bash -deepspeed examples/pytorch/translation/run_translation.py \ ---deepspeed tests/deepspeed/ds_config_zero3.json \ ---model_name_or_path t5-small --output_dir output_dir \ ---do_eval --max_eval_samples 50 --warmup_steps 50 \ ---max_source_length 128 --val_max_target_length 128 \ ---overwrite_output_dir --per_device_eval_batch_size 4 \ ---predict_with_generate --dataset_config "ro-en" --fp16 \ ---source_lang en --target_lang ro --dataset_name wmt16 \ ---source_prefix "translate English to Romanian: " -``` - -Since for inference there is no need for additional large memory used by the optimizer states and the gradients you -should be able to fit much larger batches and/or sequence length onto the same hardware. - -Additionally DeepSpeed is currently developing a related product called Deepspeed-Inference which has no relationship -to the ZeRO technology, but instead uses tensor parallelism to scale models that can't fit onto a single GPU. This is a -work in progress and we will provide the integration once that product is complete. - - -### Memory Requirements - -Since Deepspeed ZeRO can offload memory to CPU (and NVMe) the framework provides utils that allow one to tell how much CPU and GPU memory will be needed depending on the number of GPUs being used. - -Let's estimate how much memory is needed to finetune "bigscience/T0_3B" on a single GPU: - -```bash -$ python -c 'from transformers import AutoModel; \ -from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ -model = AutoModel.from_pretrained("bigscience/T0_3B"); \ -estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' -[...] -Estimated memory needed for params, optim states and gradients for a: -HW: Setup with 1 node, 1 GPU per node. -SW: Model with 2783M total params, 65M largest layer params. - per CPU | per GPU | Options - 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 - 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 - 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 - 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 - 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 - 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 -``` - -So you can fit it on a single 80GB GPU and no CPU offload, or a tiny 8GB GPU but then need ~60GB of CPU memory. (Remember this is just the memory for params, optimizer states and gradients - you will need a bit more memory for cuda kernels, activations and temps.) - -Then it's a tradeoff of cost vs speed. It'll be cheaper to buy/rent a smaller GPU (or less GPUs since you can use multiple GPUs with Deepspeed ZeRO. But then it'll be slower, so even if you don't care about how fast something will be done, the slowdown has a direct impact on the duration of using the GPU and thus bigger cost. So experiment and compare which works the best. - -If you have enough GPU memory make sure to disable the CPU/NVMe offload as it'll make everything faster. - -For example, let's repeat the same for 2 GPUs: - -```bash -$ python -c 'from transformers import AutoModel; \ -from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ -model = AutoModel.from_pretrained("bigscience/T0_3B"); \ -estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=2, num_nodes=1)' -[...] -Estimated memory needed for params, optim states and gradients for a: -HW: Setup with 1 node, 2 GPUs per node. -SW: Model with 2783M total params, 65M largest layer params. - per CPU | per GPU | Options - 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 - 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 - 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=1 - 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=0 - 0.74GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=1 - 31.11GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=0 - -``` - -So here you'd want 2x 32GB GPUs or higher without offloading to CPU. - -For full information please see [memory estimators](https://deepspeed.readthedocs.io/en/latest/memory.html). - - - -### Filing Issues - -Here is how to file an issue so that we could quickly get to the bottom of the issue and help you to unblock your work. - -In your report please always include: - -1. the full Deepspeed config file in the report - -2. either the command line arguments if you were using the [`Trainer`] or - [`TrainingArguments`] arguments if you were scripting the Trainer setup yourself. Please do not - dump the [`TrainingArguments`] as it has dozens of entries that are irrelevant. - -3. Output of: - - ```bash - python -c 'import torch; print(f"torch: {torch.__version__}")' - python -c 'import transformers; print(f"transformers: {transformers.__version__}")' - python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' - ``` - -4. If possible include a link to a Google Colab notebook that we can reproduce the problem with. You can use this - [notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) as - a starting point. - -5. Unless it's impossible please always use a standard dataset that we can use and not something custom. - -6. If possible try to use one of the existing [examples](https://github.com/huggingface/transformers/tree/main/examples/pytorch) to reproduce the problem with. - -Things to consider: - -- Deepspeed is often not the cause of the problem. - - Some of the filed issues proved to be Deepspeed-unrelated. That is once Deepspeed was removed from the setup, the - problem was still there. - - Therefore, if it's not absolutely obvious it's a DeepSpeed-related problem, as in you can see that there is an - exception and you can see that DeepSpeed modules are involved, first re-test your setup without DeepSpeed in it. - And only if the problem persists then do mentioned Deepspeed and supply all the required details. - -- If it's clear to you that the issue is in the DeepSpeed core and not the integration part, please file the Issue - directly with [Deepspeed](https://github.com/microsoft/DeepSpeed/). If you aren't sure, please do not worry, - either Issue tracker will do, we will figure it out once you posted it and redirect you to another Issue tracker if - need be. - - - -### Troubleshooting - -#### the `deepspeed` process gets killed at startup without a traceback - -If the `deepspeed` process gets killed at launch time without a traceback, that usually means that the program tried -to allocate more CPU memory than your system has or your process is allowed to allocate and the OS kernel killed that -process. This is because your configuration file most likely has either `offload_optimizer` or `offload_param` or -both configured to offload to `cpu`. If you have NVMe, experiment with offloading to NVMe if you're running under -ZeRO-3. Here is how you can [estimate how much memory is needed for a specific model](https://deepspeed.readthedocs.io/en/latest/memory.html). - - -#### training and/or eval/predict loss is `NaN` - -This often happens when one takes a model pre-trained in bf16 mixed precision mode and tries to use it under fp16 (with or without mixed precision). Most models trained on TPU and often the ones released by Google are in this category (e.g. almost all t5-based models). Here the solution is to either use fp32 or bf16 if your hardware supports it (TPU, Ampere GPUs or newer). - -The other problem may have to do with using fp16. When you configure this section: - -```json -{ - "fp16": { - "enabled": "auto", - "loss_scale": 0, - "loss_scale_window": 1000, - "initial_scale_power": 16, - "hysteresis": 2, - "min_loss_scale": 1 - } -} -``` - -and you see in your log that Deepspeed reports `OVERFLOW!` as follows: - -``` -0%| | 0/189 [00:00 # Feature Extractor -A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction -from sequences, *e.g.*, pre-processing audio files to Log-Mel Spectrogram features, feature extraction from images -*e.g.* cropping image image files, but also padding, normalization, and conversion to Numpy, PyTorch, and TensorFlow -tensors. +A feature extractor is in charge of preparing input features for audio or vision models. This includes feature extraction from sequences, e.g., pre-processing audio files to generate Log-Mel Spectrogram features, feature extraction from images, e.g., cropping image files, but also padding, normalization, and conversion to NumPy, PyTorch, and TensorFlow tensors. ## FeatureExtractionMixin diff --git a/docs/source/en/main_classes/image_processor.mdx b/docs/source/en/main_classes/image_processor.md similarity index 87% rename from docs/source/en/main_classes/image_processor.mdx rename to docs/source/en/main_classes/image_processor.md index 6a108397213f3a..04a3cd1337a526 100644 --- a/docs/source/en/main_classes/image_processor.mdx +++ b/docs/source/en/main_classes/image_processor.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Image Processor diff --git a/docs/source/en/main_classes/keras_callbacks.mdx b/docs/source/en/main_classes/keras_callbacks.md similarity index 83% rename from docs/source/en/main_classes/keras_callbacks.mdx rename to docs/source/en/main_classes/keras_callbacks.md index bc44a0967cc9cb..c9932300dbc569 100644 --- a/docs/source/en/main_classes/keras_callbacks.mdx +++ b/docs/source/en/main_classes/keras_callbacks.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Keras callbacks diff --git a/docs/source/en/main_classes/logging.mdx b/docs/source/en/main_classes/logging.md similarity index 76% rename from docs/source/en/main_classes/logging.mdx rename to docs/source/en/main_classes/logging.md index 9d4432a7290d0f..6a77001608c914 100644 --- a/docs/source/en/main_classes/logging.mdx +++ b/docs/source/en/main_classes/logging.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Logging @@ -67,6 +71,23 @@ verbose to the most verbose), those levels (with their corresponding int values By default, `tqdm` progress bars will be displayed during model download. [`logging.disable_progress_bar`] and [`logging.enable_progress_bar`] can be used to suppress or unsuppress this behavior. +## `logging` vs `warnings` + +Python has two logging systems that are often used in conjunction: `logging`, which is explained above, and `warnings`, +which allows further classification of warnings in specific buckets, e.g., `FutureWarning` for a feature or path +that has already been deprecated and `DeprecationWarning` to indicate an upcoming deprecation. + +We use both in the `transformers` library. We leverage and adapt `logging`'s `captureWarning` method to allow +management of these warning messages by the verbosity setters above. + +What does that mean for developers of the library? We should respect the following heuristic: +- `warnings` should be favored for developers of the library and libraries dependent on `transformers` +- `logging` should be used for end-users of the library using it in every-day projects + +See reference of the `captureWarnings` method below. + +[[autodoc]] logging.captureWarnings + ## Base setters [[autodoc]] logging.set_verbosity_error diff --git a/docs/source/en/main_classes/model.mdx b/docs/source/en/main_classes/model.md similarity index 96% rename from docs/source/en/main_classes/model.mdx rename to docs/source/en/main_classes/model.md index fee685b3efc74d..da907f80ee486a 100644 --- a/docs/source/en/main_classes/model.mdx +++ b/docs/source/en/main_classes/model.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Models @@ -99,7 +103,7 @@ t0pp.hf_device_map 'lm_head': 'cpu'} ``` -You can also write your own device map following the same format (a dictionary layer name to device). It should map all parameters of the model to a given device, but you don't have to detail where all the submosules of one layer go if that layer is entirely on the same device. For instance, the following device map would work properly for T0pp (as long as you have the GPU memory): +You can also write your own device map following the same format (a dictionary layer name to device). It should map all parameters of the model to a given device, but you don't have to detail where all the submodules of one layer go if that layer is entirely on the same device. For instance, the following device map would work properly for T0pp (as long as you have the GPU memory): ```python device_map = {"shared": 0, "encoder": 0, "decoder": 1, "lm_head": 1} diff --git a/docs/source/en/main_classes/onnx.mdx b/docs/source/en/main_classes/onnx.md similarity index 90% rename from docs/source/en/main_classes/onnx.mdx rename to docs/source/en/main_classes/onnx.md index ff20f315a1a9a0..81d31c97e88dde 100644 --- a/docs/source/en/main_classes/onnx.mdx +++ b/docs/source/en/main_classes/onnx.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Exporting 🤗 Transformers models to ONNX diff --git a/docs/source/en/main_classes/optimizer_schedules.mdx b/docs/source/en/main_classes/optimizer_schedules.md similarity index 92% rename from docs/source/en/main_classes/optimizer_schedules.mdx rename to docs/source/en/main_classes/optimizer_schedules.md index 4808f4a2a4d944..dfcab9e91465a3 100644 --- a/docs/source/en/main_classes/optimizer_schedules.mdx +++ b/docs/source/en/main_classes/optimizer_schedules.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Optimization diff --git a/docs/source/en/main_classes/output.mdx b/docs/source/en/main_classes/output.md similarity index 89% rename from docs/source/en/main_classes/output.mdx rename to docs/source/en/main_classes/output.md index ced38976e84520..3567cf62c44e2d 100644 --- a/docs/source/en/main_classes/output.mdx +++ b/docs/source/en/main_classes/output.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Model outputs @@ -22,8 +26,8 @@ Let's see how this looks in an example: from transformers import BertTokenizer, BertForSequenceClassification import torch -tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") -model = BertForSequenceClassification.from_pretrained("bert-base-uncased") +tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 @@ -31,11 +35,19 @@ outputs = model(**inputs, labels=labels) ``` The `outputs` object is a [`~modeling_outputs.SequenceClassifierOutput`], as we can see in the -documentation of that class below, it means it has an optional `loss`, a `logits` an optional `hidden_states` and +documentation of that class below, it means it has an optional `loss`, a `logits`, an optional `hidden_states` and an optional `attentions` attribute. Here we have the `loss` since we passed along `labels`, but we don't have `hidden_states` and `attentions` because we didn't pass `output_hidden_states=True` or `output_attentions=True`. + + +When passing `output_hidden_states=True` you may expect the `outputs.hidden_states[-1]` to match `outputs.last_hidden_states` exactly. +However, this is not always the case. Some models apply normalization or subsequent process to the last hidden state when it's returned. + + + + You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get `None`. Here for instance `outputs.loss` is the loss computed by the model, and `outputs.attentions` is `None`. @@ -164,6 +176,18 @@ documented on their corresponding model page. [[autodoc]] modeling_outputs.XVectorOutput +## Seq2SeqTSModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSModelOutput + +## Seq2SeqTSPredictionOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSPredictionOutput + +## SampleTSPredictionOutput + +[[autodoc]] modeling_outputs.SampleTSPredictionOutput + ## TFBaseModelOutput [[autodoc]] modeling_tf_outputs.TFBaseModelOutput diff --git a/docs/source/en/main_classes/pipelines.mdx b/docs/source/en/main_classes/pipelines.md similarity index 95% rename from docs/source/en/main_classes/pipelines.mdx rename to docs/source/en/main_classes/pipelines.md index e5ee3902028e34..1e8f93f3ba8e5e 100644 --- a/docs/source/en/main_classes/pipelines.mdx +++ b/docs/source/en/main_classes/pipelines.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipelines @@ -39,7 +43,7 @@ If you want to use a specific model from the [hub](https://huggingface.co) you c the hub already defines it: ```python ->>> pipe = pipeline(model="roberta-large-mnli") +>>> pipe = pipeline(model="FacebookAI/roberta-large-mnli") >>> pipe("This restaurant is awesome") [{'label': 'NEUTRAL', 'score': 0.7313136458396912}] ``` @@ -221,7 +225,7 @@ For users, a rule of thumb is: - **Measure performance on your load, with your hardware. Measure, measure, and keep measuring. Real numbers are the only way to go.** -- If you are latency constrained (live product doing inference), don't batch +- If you are latency constrained (live product doing inference), don't batch. - If you are using CPU, don't batch. - If you are using throughput (you want to run your model on a bunch of static data), on GPU, then: @@ -314,6 +318,19 @@ Pipelines available for audio tasks include the following. - __call__ - all +### TextToAudioPipeline + +[[autodoc]] TextToAudioPipeline + - __call__ + - all + + +### ZeroShotAudioClassificationPipeline + +[[autodoc]] ZeroShotAudioClassificationPipeline + - __call__ + - all + ## Computer vision Pipelines available for computer vision tasks include the following. @@ -335,6 +352,12 @@ Pipelines available for computer vision tasks include the following. - __call__ - all +### ImageToImagePipeline + +[[autodoc]] ImageToImagePipeline + - __call__ + - all + ### ObjectDetectionPipeline [[autodoc]] ObjectDetectionPipeline @@ -377,12 +400,6 @@ Pipelines available for natural language processing tasks include the following. - __call__ - all -### NerPipeline - -[[autodoc]] NerPipeline - -See [`TokenClassificationPipeline`] for all details. - ### QuestionAnsweringPipeline [[autodoc]] QuestionAnsweringPipeline @@ -452,12 +469,24 @@ Pipelines available for multimodal tasks include the following. - __call__ - all +### ImageFeatureExtractionPipeline + +[[autodoc]] ImageFeatureExtractionPipeline + - __call__ + - all + ### ImageToTextPipeline [[autodoc]] ImageToTextPipeline - __call__ - all +### MaskGenerationPipeline + +[[autodoc]] MaskGenerationPipeline + - __call__ + - all + ### VisualQuestionAnsweringPipeline [[autodoc]] VisualQuestionAnsweringPipeline diff --git a/docs/source/en/main_classes/processors.mdx b/docs/source/en/main_classes/processors.md similarity index 95% rename from docs/source/en/main_classes/processors.mdx rename to docs/source/en/main_classes/processors.md index 5530720b1cb686..5e943fc9fdd5cc 100644 --- a/docs/source/en/main_classes/processors.mdx +++ b/docs/source/en/main_classes/processors.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Processors @@ -82,7 +86,7 @@ This library hosts the processor to load the XNLI data: Please note that since the gold labels are available on the test set, evaluation is performed on the test set. -An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/text-classification/run_xnli.py) script. +An example using these processors is given in the [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) script. ## SQuAD diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md new file mode 100644 index 00000000000000..297dd1a49531bd --- /dev/null +++ b/docs/source/en/main_classes/quantization.md @@ -0,0 +1,47 @@ + + +# Quantization + +Quantization techniques reduces memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes. + +Quantization techniques that aren't supported in Transformers can be added with the [`HfQuantizer`] class. + + + +Learn how to quantize models in the [Quantization](../quantization) guide. + + + +## AqlmConfig + +[[autodoc]] AqlmConfig + +## AwqConfig + +[[autodoc]] AwqConfig + +## GPTQConfig + +[[autodoc]] GPTQConfig + +## BitsAndBytesConfig + +[[autodoc]] BitsAndBytesConfig + +## HfQuantizer + +[[autodoc]] quantizers.base.HfQuantizer diff --git a/docs/source/en/main_classes/quantization.mdx b/docs/source/en/main_classes/quantization.mdx deleted file mode 100644 index 6ab6ec9dfa35ac..00000000000000 --- a/docs/source/en/main_classes/quantization.mdx +++ /dev/null @@ -1,150 +0,0 @@ - - -# Quantize 🤗 Transformers models - -## `bitsandbytes` Integration - -🤗 Transformers is closely integrated with most used modules on `bitsandbytes`. You can load your model in 8-bit precision with few lines of code. -This is supported by most of the GPU hardwares since the `0.37.0` release of `bitsandbytes`. - -Learn more about the quantization method in the [LLM.int8()](https://arxiv.org/abs/2208.07339) paper, or the [blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) about the collaboration. - -Here are the things you can do using `bitsandbytes` integration - -### Load a large model in 8bit - -You can load a model by roughly halving the memory requirements by using `load_in_8bit=True` argument when calling `.from_pretrained` method - - -```python -# pip install transformers accelerate bitsandbytes -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "bigscience/bloom-1b7" - -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id, device_map == "auto", load_in_8bit=True) -``` - -Then, use your model as you would usually use a [`PreTrainedModel`]. - -You can check the memory footprint of your model with `get_memory_footprint` method. - -```python -print(model.get_memory_footprint()) -``` - -With this integration we were able to load large models on smaller devices and run them without any issue. - - - -Note that once a model has been loaded in 8-bit it is currently not possible to push the quantized weights on the Hub. Note also that you cannot train 8-bit weights as this is not supported yet. However you can use 8-bit models to train extra parameters, this will be covered in the next section. - - - -### Advanced usecases - -This section is intended to advanced users, that want to explore what it is possible to do beyond loading and running 8-bit models. - -#### Offload between `cpu` and `gpu` - -One of the advanced usecase of this is being able to load a model and dispatch the weights between `CPU` and `GPU`. Note that the weights that will be dispatched on CPU **will not** be converted in 8-bit, thus kept in `float32`. This feature is intended for users that want to fit a very large model and dispatch the model between GPU and CPU. - -First, load a `BitsAndBytesConfig` from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`: - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) -``` - -Let's say you want to load `bigscience/bloom-1b7` model, and you have just enough GPU RAM to fit the entire model except the `lm_head`. Therefore write a custom device_map as follows: -```python -device_map = { - "transformer.word_embeddings": 0, - "transformer.word_embeddings_layernorm": 0, - "lm_head": "cpu", - "transformer.h": 0, - "transformer.ln_f": 0, -} -``` - -And load your model as follows: -```python -model_8bit = AutoModelForCausalLM.from_pretrained( - "bigscience/bloom-1b7", - device_map=device_map, - quantization_config=quantization_config, -) -``` - -And that's it! Enjoy your model! - -#### Play with `llm_int8_threshold` - -You can play with the `llm_int8_threshold` argument to change the threshold of the outliers. An "outlier" is a hidden state value that is greater than a certain threshold. -This corresponds to the outlier threshold for outlier detection as described in `LLM.int8()` paper. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning). -This argument can impact the inference speed of the model. We suggest to play with this parameter to find which one is the best for your usecase. - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -model_id = "bigscience/bloom-1b7" - -quantization_config = BitsAndBytesConfig( - llm_int8_threshold=10, -) - -model_8bit = AutoModelForCausalLM.from_pretrained( - model_id, - device_map=device_map, - quantization_config=quantization_config, -) -tokenizer = AutoTokenizer.from_pretrained(model_id) -``` - -#### Skip the conversion of some modules - -Some models has several modules that needs to be not converted in 8-bit to ensure stability. For example Jukebox model has several `lm_head` modules that should be skipped. Play with `llm_int8_skip_modules` - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -model_id = "bigscience/bloom-1b7" - -quantization_config = BitsAndBytesConfig( - llm_int8_skip_modules=["lm_head"], -) - -model_8bit = AutoModelForCausalLM.from_pretrained( - model_id, - device_map=device_map, - quantization_config=quantization_config, -) -tokenizer = AutoTokenizer.from_pretrained(model_id) -``` - -#### Fine-tune a model that has been loaded in 8-bit - -With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been loaded in 8-bit. -This enables fine-tuning large models such as `flan-t5-large` or `facebook/opt-6.7b` in a single google Colab. Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details. - -### BitsAndBytesConfig - -[[autodoc]] BitsAndBytesConfig - - -## Quantization with 🤗 `optimum` - -Please have a look at [Optimum documentation](https://huggingface.co/docs/optimum/index) to learn more about quantization methods that are supported by `optimum` and see if these are applicable for your usecase. - diff --git a/docs/source/en/main_classes/text_generation.mdx b/docs/source/en/main_classes/text_generation.md similarity index 86% rename from docs/source/en/main_classes/text_generation.mdx rename to docs/source/en/main_classes/text_generation.md index 5351129cbb1d73..309d7298eec70f 100644 --- a/docs/source/en/main_classes/text_generation.mdx +++ b/docs/source/en/main_classes/text_generation.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Generation @@ -24,7 +28,8 @@ of the generation method. To learn how to inspect a model's generation configuration, what are the defaults, how to change the parameters ad hoc, and how to create and save a customized generation configuration, refer to the -[text generation strategies guide](../generation_strategies). +[text generation strategies guide](../generation_strategies). The guide also explains how to use related features, +like token streaming. ## GenerationConfig diff --git a/docs/source/en/main_classes/tokenizer.mdx b/docs/source/en/main_classes/tokenizer.md similarity index 92% rename from docs/source/en/main_classes/tokenizer.mdx rename to docs/source/en/main_classes/tokenizer.md index 032373435cd5f4..2ad7e450404e77 100644 --- a/docs/source/en/main_classes/tokenizer.mdx +++ b/docs/source/en/main_classes/tokenizer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Tokenizer @@ -51,6 +55,9 @@ to a given token). [[autodoc]] PreTrainedTokenizer - __call__ + - add_tokens + - add_special_tokens + - apply_chat_template - batch_decode - decode - encode @@ -64,6 +71,9 @@ loaded very simply into 🤗 transformers. Take a look at the [Using tokenizers [[autodoc]] PreTrainedTokenizerFast - __call__ + - add_tokens + - add_special_tokens + - apply_chat_template - batch_decode - decode - encode diff --git a/docs/source/en/main_classes/trainer.md b/docs/source/en/main_classes/trainer.md new file mode 100644 index 00000000000000..3f33ff1e505a2a --- /dev/null +++ b/docs/source/en/main_classes/trainer.md @@ -0,0 +1,54 @@ + + +# Trainer + +The [`Trainer`] class provides an API for feature-complete training in PyTorch, and it supports distributed training on multiple GPUs/TPUs, mixed precision for [NVIDIA GPUs](https://nvidia.github.io/apex/), [AMD GPUs](https://rocm.docs.amd.com/en/latest/rocm.html), and [`torch.amp`](https://pytorch.org/docs/stable/amp.html) for PyTorch. [`Trainer`] goes hand-in-hand with the [`TrainingArguments`] class, which offers a wide range of options to customize how a model is trained. Together, these two classes provide a complete training API. + +[`Seq2SeqTrainer`] and [`Seq2SeqTrainingArguments`] inherit from the [`Trainer`] and [`TrainingArgument`] classes and they're adapted for training models for sequence-to-sequence tasks such as summarization or translation. + + + +The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors +when used with other models. When using it with your own model, make sure: + +- your model always return tuples or subclasses of [`~utils.ModelOutput`] +- your model can compute the loss if a `labels` argument is provided and that loss is returned as the first + element of the tuple (if your model returns tuples) +- your model can accept multiple label arguments (use `label_names` in [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"` + + + +## Trainer[[api-reference]] + +[[autodoc]] Trainer + - all + +## Seq2SeqTrainer + +[[autodoc]] Seq2SeqTrainer + - evaluate + - predict + +## TrainingArguments + +[[autodoc]] TrainingArguments + - all + +## Seq2SeqTrainingArguments + +[[autodoc]] Seq2SeqTrainingArguments + - all diff --git a/docs/source/en/main_classes/trainer.mdx b/docs/source/en/main_classes/trainer.mdx deleted file mode 100644 index a0b914cd40af98..00000000000000 --- a/docs/source/en/main_classes/trainer.mdx +++ /dev/null @@ -1,679 +0,0 @@ - - -# Trainer - -The [`Trainer`] class provides an API for feature-complete training in PyTorch for most standard use cases. It's used in most of the [example scripts](https://github.com/huggingface/transformers/tree/main/examples). - -Before instantiating your [`Trainer`], create a [`TrainingArguments`] to access all the points of customization during training. - -The API supports distributed training on multiple GPUs/TPUs, mixed precision through [NVIDIA Apex](https://github.com/NVIDIA/apex) and Native AMP for PyTorch. - -The [`Trainer`] contains the basic training loop which supports the above features. To inject custom behavior you can subclass them and override the following methods: - -- **get_train_dataloader** -- Creates the training DataLoader. -- **get_eval_dataloader** -- Creates the evaluation DataLoader. -- **get_test_dataloader** -- Creates the test DataLoader. -- **log** -- Logs information on the various objects watching training. -- **create_optimizer_and_scheduler** -- Sets up the optimizer and learning rate scheduler if they were not passed at - init. Note, that you can also subclass or override the `create_optimizer` and `create_scheduler` methods - separately. -- **create_optimizer** -- Sets up the optimizer if it wasn't passed at init. -- **create_scheduler** -- Sets up the learning rate scheduler if it wasn't passed at init. -- **compute_loss** - Computes the loss on a batch of training inputs. -- **training_step** -- Performs a training step. -- **prediction_step** -- Performs an evaluation/test step. -- **evaluate** -- Runs an evaluation loop and returns metrics. -- **predict** -- Returns predictions (with metrics if labels are available) on a test set. - - - -The [`Trainer`] class is optimized for 🤗 Transformers models and can have surprising behaviors -when you use it on other models. When using it on your own model, make sure: - -- your model always return tuples or subclasses of [`~utils.ModelOutput`]. -- your model can compute the loss if a `labels` argument is provided and that loss is returned as the first - element of the tuple (if your model returns tuples) -- your model can accept multiple label arguments (use the `label_names` in your [`TrainingArguments`] to indicate their name to the [`Trainer`]) but none of them should be named `"label"`. - - - -Here is an example of how to customize [`Trainer`] to use a weighted loss (useful when you have an unbalanced training set): - -```python -from torch import nn -from transformers import Trainer - - -class CustomTrainer(Trainer): - def compute_loss(self, model, inputs, return_outputs=False): - labels = inputs.get("labels") - # forward pass - outputs = model(**inputs) - logits = outputs.get("logits") - # compute custom loss (suppose one has 3 labels with different weights) - loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0])) - loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)) - return (loss, outputs) if return_outputs else loss -``` - -Another way to customize the training loop behavior for the PyTorch [`Trainer`] is to use [callbacks](callback) that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms...) and take decisions (like early stopping). - - -## Trainer - -[[autodoc]] Trainer - - all - -## Seq2SeqTrainer - -[[autodoc]] Seq2SeqTrainer - - evaluate - - predict - -## TrainingArguments - -[[autodoc]] TrainingArguments - - all - -## Seq2SeqTrainingArguments - -[[autodoc]] Seq2SeqTrainingArguments - - all - -## Checkpoints - -By default, [`Trainer`] will save all checkpoints in the `output_dir` you set in the -[`TrainingArguments`] you are using. Those will go in subfolder named `checkpoint-xxx` with xxx -being the step at which the training was at. - -Resuming training from a checkpoint can be done when calling [`Trainer.train`] with either: - -- `resume_from_checkpoint=True` which will resume training from the latest checkpoint -- `resume_from_checkpoint=checkpoint_dir` which will resume training from the specific checkpoint in the directory - passed. - -In addition, you can easily save your checkpoints on the Model Hub when using `push_to_hub=True`. By default, all -the models saved in intermediate checkpoints are saved in different commits, but not the optimizer state. You can adapt -the `hub-strategy` value of your [`TrainingArguments`] to either: - -- `"checkpoint"`: the latest checkpoint is also pushed in a subfolder named last-checkpoint, allowing you to - resume training easily with `trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")`. -- `"all_checkpoints"`: all checkpoints are pushed like they appear in the output folder (so you will get one - checkpoint folder per folder in your final repository) - - -## Logging - -By default [`Trainer`] will use `logging.INFO` for the main process and `logging.WARNING` for the replicas if any. - -These defaults can be overridden to use any of the 5 `logging` levels with [`TrainingArguments`]'s -arguments: - -- `log_level` - for the main process -- `log_level_replica` - for the replicas - -Further, if [`TrainingArguments`]'s `log_on_each_node` is set to `False` only the main node will -use the log level settings for its main process, all other nodes will use the log level settings for replicas. - -Note that [`Trainer`] is going to set `transformers`'s log level separately for each node in its -[`Trainer.__init__`]. So you may want to set this sooner (see the next example) if you tap into other -`transformers` functionality before creating the [`Trainer`] object. - -Here is an example of how this can be used in an application: - -```python -[...] -logger = logging.getLogger(__name__) - -# Setup logging -logging.basicConfig( - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - handlers=[logging.StreamHandler(sys.stdout)], -) - -# set the main code and the modules it uses to the same log-level according to the node -log_level = training_args.get_process_log_level() -logger.setLevel(log_level) -datasets.utils.logging.set_verbosity(log_level) -transformers.utils.logging.set_verbosity(log_level) - -trainer = Trainer(...) -``` - -And then if you only want to see warnings on the main node and all other nodes to not print any most likely duplicated -warnings you could run it as: - -```bash -my_app.py ... --log_level warning --log_level_replica error -``` - -In the multi-node environment if you also don't want the logs to repeat for each node's main process, you will want to -change the above to: - -```bash -my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0 -``` - -and then only the main process of the first node will log at the "warning" level, and all other processes on the main -node and all processes on other nodes will log at the "error" level. - -If you need your application to be as quiet as possible you could do: - -```bash -my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0 -``` - -(add `--log_on_each_node 0` if on multi-node environment) - - -## Randomness - -When resuming from a checkpoint generated by [`Trainer`] all efforts are made to restore the -_python_, _numpy_ and _pytorch_ RNG states to the same states as they were at the moment of saving that checkpoint, -which should make the "stop and resume" style of training as close as possible to non-stop training. - -However, due to various default non-deterministic pytorch settings this might not fully work. If you want full -determinism please refer to [Controlling sources of randomness](https://pytorch.org/docs/stable/notes/randomness). As explained in the document, that some of those settings -that make things deterministic (.e.g., `torch.backends.cudnn.deterministic`) may slow things down, therefore this -can't be done by default, but you can enable those yourself if needed. - - -## Specific GPUs Selection - -Let's discuss how you can tell your program which GPUs are to be used and in what order. - -When using [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) to use only a subset of your GPUs, you simply specify the number of GPUs to use. For example, if you have 4 GPUs, but you wish to use the first 2 you can do: - -```bash -python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ... -``` - -if you have either [`accelerate`](https://github.com/huggingface/accelerate) or [`deepspeed`](https://github.com/microsoft/DeepSpeed) installed you can also accomplish the same by using one of: -```bash -accelerate launch --num_processes 2 trainer-program.py ... -``` - -```bash -deepspeed --num_gpus 2 trainer-program.py ... -``` - -You don't need to use the Accelerate or [the Deepspeed integration](Deepspeed) features to use these launchers. - - -Until now you were able to tell the program how many GPUs to use. Now let's discuss how to select specific GPUs and control their order. - -The following environment variables help you control which GPUs to use and their order. - -**`CUDA_VISIBLE_DEVICES`** - -If you have multiple GPUs and you'd like to use only 1 or a few of those GPUs, set the environment variable `CUDA_VISIBLE_DEVICES` to a list of the GPUs to be used. - -For example, let's say you have 4 GPUs: 0, 1, 2 and 3. To run only on the physical GPUs 0 and 2, you can do: - -```bash -CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ... -``` - -So now pytorch will see only 2 GPUs, where your physical GPUs 0 and 2 are mapped to `cuda:0` and `cuda:1` correspondingly. - -You can even change their order: - -```bash -CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ... -``` - -Here your physical GPUs 0 and 2 are mapped to `cuda:1` and `cuda:0` correspondingly. - -The above examples were all for `DistributedDataParallel` use pattern, but the same method works for [`DataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) as well: -```bash -CUDA_VISIBLE_DEVICES=2,0 python trainer-program.py ... -``` - -To emulate an environment without GPUs simply set this environment variable to an empty value like so: - -```bash -CUDA_VISIBLE_DEVICES= python trainer-program.py ... -``` - -As with any environment variable you can, of course, export those instead of adding these to the command line, as in: - - -```bash -export CUDA_VISIBLE_DEVICES=0,2 -python -m torch.distributed.launch trainer-program.py ... -``` - -but this approach can be confusing since you may forget you set up the environment variable earlier and not understand why the wrong GPUs are used. Therefore, it's a common practice to set the environment variable just for a specific run on the same command line as it's shown in most examples of this section. - -**`CUDA_DEVICE_ORDER`** - -There is an additional environment variable `CUDA_DEVICE_ORDER` that controls how the physical devices are ordered. The two choices are: - -1. ordered by PCIe bus IDs (matches `nvidia-smi`'s order) - this is the default. - -```bash -export CUDA_DEVICE_ORDER=PCI_BUS_ID -``` - -2. ordered by GPU compute capabilities - -```bash -export CUDA_DEVICE_ORDER=FASTEST_FIRST -``` - -Most of the time you don't need to care about this environment variable, but it's very helpful if you have a lopsided setup where you have an old and a new GPUs physically inserted in such a way so that the slow older card appears to be first. One way to fix that is to swap the cards. But if you can't swap the cards (e.g., if the cooling of the devices gets impacted) then setting `CUDA_DEVICE_ORDER=FASTEST_FIRST` will always put the newer faster card first. It'll be somewhat confusing though since `nvidia-smi` will still report them in the PCIe order. - -The other solution to swapping the order is to use: - -```bash -export CUDA_VISIBLE_DEVICES=1,0 -``` -In this example we are working with just 2 GPUs, but of course the same would apply to as many GPUs as your computer has. - -Also if you do set this environment variable it's the best to set it in your `~/.bashrc` file or some other startup config file and forget about it. - - - - -## Trainer Integrations - -The [`Trainer`] has been extended to support libraries that may dramatically improve your training -time and fit much bigger models. - -Currently it supports third party solutions, [DeepSpeed](https://github.com/microsoft/DeepSpeed), [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) and [FairScale](https://github.com/facebookresearch/fairscale/), which implement parts of the paper [ZeRO: Memory Optimizations -Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054). - -This provided support is new and experimental as of this writing. While the support for DeepSpeed and PyTorch FSDP is active and we welcome issues around it, we don't support the FairScale integration anymore since it has been integrated in PyTorch main (see the [PyTorch FSDP integration](#pytorch-fully-sharded-data-parallel)) - - - -### CUDA Extension Installation Notes - -As of this writing, both FairScale and Deepspeed require compilation of CUDA C++ code, before they can be used. - -While all installation issues should be dealt with through the corresponding GitHub Issues of [FairScale](https://github.com/facebookresearch/fairscale/issues) and [Deepspeed](https://github.com/microsoft/DeepSpeed/issues), there are a few common issues that one may encounter while building -any PyTorch extension that needs to build CUDA extensions. - -Therefore, if you encounter a CUDA-related build issue while doing one of the following or both: - -```bash -pip install fairscale -pip install deepspeed -``` - -please, read the following notes first. - -In these notes we give examples for what to do when `pytorch` has been built with CUDA `10.2`. If your situation is -different remember to adjust the version number to the one you are after. - -#### Possible problem #1 - -While, Pytorch comes with its own CUDA toolkit, to build these two projects you must have an identical version of CUDA -installed system-wide. - -For example, if you installed `pytorch` with `cudatoolkit==10.2` in the Python environment, you also need to have -CUDA `10.2` installed system-wide. - -The exact location may vary from system to system, but `/usr/local/cuda-10.2` is the most common location on many -Unix systems. When CUDA is correctly set up and added to the `PATH` environment variable, one can find the -installation location by doing: - -```bash -which nvcc -``` - -If you don't have CUDA installed system-wide, install it first. You will find the instructions by using your favorite -search engine. For example, if you're on Ubuntu you may want to search for: [ubuntu cuda 10.2 install](https://www.google.com/search?q=ubuntu+cuda+10.2+install). - -#### Possible problem #2 - -Another possible common problem is that you may have more than one CUDA toolkit installed system-wide. For example you -may have: - -```bash -/usr/local/cuda-10.2 -/usr/local/cuda-11.0 -``` - -Now, in this situation you need to make sure that your `PATH` and `LD_LIBRARY_PATH` environment variables contain -the correct paths to the desired CUDA version. Typically, package installers will set these to contain whatever the -last version was installed. If you encounter the problem, where the package build fails because it can't find the right -CUDA version despite you having it installed system-wide, it means that you need to adjust the 2 aforementioned -environment variables. - -First, you may look at their contents: - -```bash -echo $PATH -echo $LD_LIBRARY_PATH -``` - -so you get an idea of what is inside. - -It's possible that `LD_LIBRARY_PATH` is empty. - -`PATH` lists the locations of where executables can be found and `LD_LIBRARY_PATH` is for where shared libraries -are to looked for. In both cases, earlier entries have priority over the later ones. `:` is used to separate multiple -entries. - -Now, to tell the build program where to find the specific CUDA toolkit, insert the desired paths to be listed first by -doing: - -```bash -export PATH=/usr/local/cuda-10.2/bin:$PATH -export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH -``` - -Note that we aren't overwriting the existing values, but prepending instead. - -Of course, adjust the version number, the full path if need be. Check that the directories you assign actually do -exist. `lib64` sub-directory is where the various CUDA `.so` objects, like `libcudart.so` reside, it's unlikely -that your system will have it named differently, but if it is adjust it to reflect your reality. - - -#### Possible problem #3 - -Some older CUDA versions may refuse to build with newer compilers. For example, you my have `gcc-9` but it wants -`gcc-7`. - -There are various ways to go about it. - -If you can install the latest CUDA toolkit it typically should support the newer compiler. - -Alternatively, you could install the lower version of the compiler in addition to the one you already have, or you may -already have it but it's not the default one, so the build system can't see it. If you have `gcc-7` installed but the -build system complains it can't find it, the following might do the trick: - -```bash -sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc -sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++ -``` - -Here, we are making a symlink to `gcc-7` from `/usr/local/cuda-10.2/bin/gcc` and since -`/usr/local/cuda-10.2/bin/` should be in the `PATH` environment variable (see the previous problem's solution), it -should find `gcc-7` (and `g++7`) and then the build will succeed. - -As always make sure to edit the paths in the example to match your situation. - -### FairScale - - - -This integration is not supported anymore, we recommend you either use DeepSpeed or PyTorch FSDP. - - - -By integrating [FairScale](https://github.com/facebookresearch/fairscale/) the [`Trainer`] -provides support for the following features from [the ZeRO paper](https://arxiv.org/abs/1910.02054): - -1. Optimizer State Sharding -2. Gradient Sharding -3. Model Parameters Sharding (new and very experimental) -4. CPU offload (new and very experimental) - -You will need at least two GPUs to use this feature. - - -**Installation**: - -Install the library via pypi: - -```bash -pip install fairscale -``` - -or via `transformers`' `extras`: - -```bash -pip install transformers[fairscale] -``` - -(available starting from `transformers==4.6.0`) or find more details on [the FairScale's GitHub page](https://github.com/facebookresearch/fairscale/#installation). - -If you're still struggling with the build, first make sure to read [CUDA Extension Installation Notes](#zero-install-notes). - -If it's still not resolved the build issue, here are a few more ideas. - -`fairscale` seems to have an issue with the recently introduced by pip build isolation feature. If you have a problem -with it, you may want to try one of: - -```bash -pip install fairscale --no-build-isolation . -``` - -or: - -```bash -git clone https://github.com/facebookresearch/fairscale/ -cd fairscale -rm -r dist build -python setup.py bdist_wheel -pip uninstall -y fairscale -pip install dist/fairscale-*.whl -``` - -`fairscale` also has issues with building against pytorch-nightly, so if you use it you may have to try one of: - -```bash -pip uninstall -y fairscale; pip install fairscale --pre \ --f https://download.pytorch.org/whl/nightly/cu110/torch_nightly \ ---no-cache --no-build-isolation -``` - -or: - -```bash -pip install -v --disable-pip-version-check . \ --f https://download.pytorch.org/whl/nightly/cu110/torch_nightly --pre -``` - -Of course, adjust the urls to match the cuda version you use. - -If after trying everything suggested you still encounter build issues, please, proceed with the GitHub Issue of -[FairScale](https://github.com/facebookresearch/fairscale/issues). - - - -**Usage**: - -To use the first version of Sharded data-parallelism, add `--sharded_ddp simple` to the command line arguments, and -make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already. - -For example here is how you could use it for `run_translation.py` with 2 GPUs: - -```bash -python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \ ---model_name_or_path t5-small --per_device_train_batch_size 1 \ ---output_dir output_dir --overwrite_output_dir \ ---do_train --max_train_samples 500 --num_train_epochs 1 \ ---dataset_name wmt16 --dataset_config "ro-en" \ ---source_lang en --target_lang ro \ ---fp16 --sharded_ddp simple -``` - -Notes: - -- This feature requires distributed training (so multiple GPUs). -- It is not implemented for TPUs. -- It works with `--fp16` too, to make things even faster. -- One of the main benefits of enabling `--sharded_ddp simple` is that it uses a lot less GPU memory, so you should be - able to use significantly larger batch sizes using the same hardware (e.g. 3x and even bigger) which should lead to - significantly shorter training time. - -3. To use the second version of Sharded data-parallelism, add `--sharded_ddp zero_dp_2` or `--sharded_ddp zero_dp_3` to the command line arguments, and make sure you have added the distributed launcher `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already. - -For example here is how you could use it for `run_translation.py` with 2 GPUs: - -```bash -python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py \ ---model_name_or_path t5-small --per_device_train_batch_size 1 \ ---output_dir output_dir --overwrite_output_dir \ ---do_train --max_train_samples 500 --num_train_epochs 1 \ ---dataset_name wmt16 --dataset_config "ro-en" \ ---source_lang en --target_lang ro \ ---fp16 --sharded_ddp zero_dp_2 -``` - -`zero_dp_2` is an optimized version of the simple wrapper, while `zero_dp_3` fully shards model weights, -gradients and optimizer states. - -Both are compatible with adding `cpu_offload` to enable ZeRO-offload (activate it like this: `--sharded_ddp "zero_dp_2 cpu_offload"`). - -Notes: - -- This feature requires distributed training (so multiple GPUs). -- It is not implemented for TPUs. -- It works with `--fp16` too, to make things even faster. -- The `cpu_offload` additional option requires `--fp16`. -- This is an area of active development, so make sure you have a source install of fairscale to use this feature as - some bugs you encounter may have been fixed there already. - -Known caveats: - -- This feature is incompatible with `--predict_with_generate` in the _run_translation.py_ script. -- Using `--sharded_ddp zero_dp_3` requires wrapping each layer of the model in the special container - `FullyShardedDataParallelism` of fairscale. It should be used with the option `auto_wrap` if you are not - doing this yourself: `--sharded_ddp "zero_dp_3 auto_wrap"`. - -### PyTorch Fully Sharded Data parallel - -To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model. -This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters. -To read more about it and the benefits, check out the [Fully Sharded Data Parallel blog](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/). -We have integrated the latest PyTorch's Fully Sharded Data Parallel (FSDP) training feature. -All you need to do is enable it through the config. - -**Required PyTorch version for FSDP support**: PyTorch Nightly (or 1.12.0 if you read this after it has been released) -as the model saving with FSDP activated is only available with recent fixes. - -**Usage**: - -- Make sure you have added the distributed launcher -`-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE` if you haven't been using it already. - -- **Sharding Strategy**: - - FULL_SHARD : Shards optimizer states + gradients + model parameters across data parallel workers/GPUs. - For this, add `--fsdp full_shard` to the command line arguments. - - SHARD_GRAD_OP : Shards optimizer states + gradients across data parallel workers/GPUs. - For this, add `--fsdp shard_grad_op` to the command line arguments. - - NO_SHARD : No sharding. For this, add `--fsdp no_shard` to the command line arguments. -- To offload the parameters and gradients to the CPU, -add `--fsdp "full_shard offload"` or `--fsdp "shard_grad_op offload"` to the command line arguments. -- To automatically recursively wrap layers with FSDP using `default_auto_wrap_policy`, -add `--fsdp "full_shard auto_wrap"` or `--fsdp "shard_grad_op auto_wrap"` to the command line arguments. -- To enable both CPU offloading and auto wrapping, -add `--fsdp "full_shard offload auto_wrap"` or `--fsdp "shard_grad_op offload auto_wrap"` to the command line arguments. -- If auto wrapping is enabled, you can either use transformer based auto wrap policy or size based auto wrap policy. - - For transformer based auto wrap policy, please add `--fsdp_transformer_layer_cls_to_wrap ` to command line arguments. - This specifies the transformer layer class name (case-sensitive) to wrap ,e.g, `BertLayer`, `GPTJBlock`, `T5Block` .... - This is important because submodules that share weights (e.g., embedding layer) should not end up in different FSDP wrapped units. - Using this policy, wrapping happens for each block containing Multi-Head Attention followed by couple of MLP layers. - Remaining layers including the shared embeddings are conveniently wrapped in same outermost FSDP unit. - Therefore, use this for transformer based models. - - For size based auto wrap policy, please add `--fsdp_min_num_params ` to command line arguments. - It specifies FSDP's minimum number of parameters for auto wrapping. - -**Few caveats to be aware of** -- Mixed precision is currently not supported with FSDP as we wait for PyTorch to fix support for it. -More details in this [issues](https://github.com/pytorch/pytorch/issues/75676). -- FSDP currently doesn't support multiple parameter groups. -More details mentioned in this [issue](https://github.com/pytorch/pytorch/issues/76501) -(`The original model parameters' .grads are not set, meaning that they cannot be optimized separately (which is why we cannot support multiple parameter groups)`). - -### Using Trainer for accelerated PyTorch Training on Mac - -With PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training. -This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac. -Apple's Metal Performance Shaders (MPS) as a backend for PyTorch enables this and can be used via the new `"mps"` device. -This will map computational graphs and primitives on the MPS Graph framework and tuned kernels provided by MPS. -For more information please refer official documents [Introducing Accelerated PyTorch Training on Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) -and [MPS BACKEND](https://pytorch.org/docs/stable/notes/mps.html). - - - -We strongly recommend to install PyTorch >= 1.13 (nightly version at the time of writing) on your MacOS machine. -It has major fixes related to model correctness and performance improvements for transformer based models. -Please refer to https://github.com/pytorch/pytorch/issues/82707 for more details. - - - -**Benefits of Training and Inference using Apple Silicon Chips** - -1. Enables users to train larger networks or batch sizes locally -2. Reduces data retrieval latency and provides the GPU with direct access to the full memory store due to unified memory architecture. -Therefore, improving end-to-end performance. -3. Reduces costs associated with cloud-based development or the need for additional local GPUs. - -**Pre-requisites**: To install torch with mps support, -please follow this nice medium article [GPU-Acceleration Comes to PyTorch on M1 Macs](https://medium.com/towards-data-science/gpu-acceleration-comes-to-pytorch-on-m1-macs-195c399efcc1). - -**Usage**: -User has to just pass `--use_mps_device` argument. -For example, you can run the official Glue text classififcation task (from the root folder) using Apple Silicon GPU with below command: - -```bash -export TASK_NAME=mrpc - -python examples/pytorch/text-classification/run_glue.py \ - --model_name_or_path bert-base-cased \ - --task_name $TASK_NAME \ - --do_train \ - --do_eval \ - --max_seq_length 128 \ - --per_device_train_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3 \ - --output_dir /tmp/$TASK_NAME/ \ - --use_mps_device \ - --overwrite_output_dir -``` - -**A few caveats to be aware of** - -1. Some PyTorch operations have not been implemented in mps and will throw an error. -One way to get around that is to set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1`, -which will fallback to CPU for these operations. It still throws a UserWarning however. -2. Distributed setups `gloo` and `nccl` are not working with `mps` device. -This means that currently only single GPU of `mps` device type can be used. - -Finally, please, remember that, 🤗 `Trainer` only integrates MPS backend, therefore if you -have any problems or questions with regards to MPS backend usage, please, -file an issue with [PyTorch GitHub](https://github.com/pytorch/pytorch/issues). - -Sections that were moved: - -[ DeepSpeed -| Installation -| Deployment with multiple GPUs -| Deployment with one GPU -| Deployment in Notebooks -| Configuration -| Passing Configuration -| Shared Configuration -| ZeRO -| ZeRO-2 Config -| ZeRO-3 Config -| NVMe Support -| ZeRO-2 vs ZeRO-3 Performance -| ZeRO-2 Example -| ZeRO-3 Example -| Optimizer -| Scheduler -| fp32 Precision -| Automatic Mixed Precision -| Batch Size -| Gradient Accumulation -| Gradient Clipping -| Getting The Model Weights Out -] diff --git a/docs/source/en/migration.mdx b/docs/source/en/migration.mdx deleted file mode 100644 index 7abf95875154ca..00000000000000 --- a/docs/source/en/migration.mdx +++ /dev/null @@ -1,315 +0,0 @@ - - -# Migrating from previous packages - -## Migrating from transformers `v3.x` to `v4.x` - -A couple of changes were introduced when the switch from version 3 to version 4 was done. Below is a summary of the -expected changes: - -#### 1. AutoTokenizers and pipelines now use fast (rust) tokenizers by default. - -The python and rust tokenizers have roughly the same API, but the rust tokenizers have a more complete feature set. - -This introduces two breaking changes: -- The handling of overflowing tokens between the python and rust tokenizers is different. -- The rust tokenizers do not accept integers in the encoding methods. - -##### How to obtain the same behavior as v3.x in v4.x - -- The pipelines now contain additional features out of the box. See the [token-classification pipeline with the `grouped_entities` flag](main_classes/pipelines#transformers.TokenClassificationPipeline). -- The auto-tokenizers now return rust tokenizers. In order to obtain the python tokenizers instead, the user may use the `use_fast` flag by setting it to `False`: - -In version `v3.x`: -```py -from transformers import AutoTokenizer - -tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") -``` -to obtain the same in version `v4.x`: -```py -from transformers import AutoTokenizer - -tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False) -``` - -#### 2. SentencePiece is removed from the required dependencies - -The requirement on the SentencePiece dependency has been lifted from the `setup.py`. This is done so that we may have a channel on anaconda cloud without relying on `conda-forge`. This means that the tokenizers that depend on the SentencePiece library will not be available with a standard `transformers` installation. - -This includes the **slow** versions of: -- `XLNetTokenizer` -- `AlbertTokenizer` -- `CamembertTokenizer` -- `MBartTokenizer` -- `PegasusTokenizer` -- `T5Tokenizer` -- `ReformerTokenizer` -- `XLMRobertaTokenizer` - -##### How to obtain the same behavior as v3.x in v4.x - -In order to obtain the same behavior as version `v3.x`, you should install `sentencepiece` additionally: - -In version `v3.x`: -```bash -pip install transformers -``` -to obtain the same in version `v4.x`: -```bash -pip install transformers[sentencepiece] -``` -or -```bash -pip install transformers sentencepiece -``` -#### 3. The architecture of the repo has been updated so that each model resides in its folder - -The past and foreseeable addition of new models means that the number of files in the directory `src/transformers` keeps growing and becomes harder to navigate and understand. We made the choice to put each model and the files accompanying it in their own sub-directories. - -This is a breaking change as importing intermediary layers using a model's module directly needs to be done via a different path. - -##### How to obtain the same behavior as v3.x in v4.x - -In order to obtain the same behavior as version `v3.x`, you should update the path used to access the layers. - -In version `v3.x`: -```bash -from transformers.modeling_bert import BertLayer -``` -to obtain the same in version `v4.x`: -```bash -from transformers.models.bert.modeling_bert import BertLayer -``` - -#### 4. Switching the `return_dict` argument to `True` by default - -The [`return_dict` argument](main_classes/output) enables the return of dict-like python objects containing the model outputs, instead of the standard tuples. This object is self-documented as keys can be used to retrieve values, while also behaving as a tuple as users may retrieve objects by index or by slice. - -This is a breaking change as the limitation of that tuple is that it cannot be unpacked: `value0, value1 = outputs` will not work. - -##### How to obtain the same behavior as v3.x in v4.x - -In order to obtain the same behavior as version `v3.x`, you should specify the `return_dict` argument to `False`, either in the model configuration or during the forward pass. - -In version `v3.x`: -```bash -model = BertModel.from_pretrained("bert-base-cased") -outputs = model(**inputs) -``` -to obtain the same in version `v4.x`: -```bash -model = BertModel.from_pretrained("bert-base-cased") -outputs = model(**inputs, return_dict=False) -``` -or -```bash -model = BertModel.from_pretrained("bert-base-cased", return_dict=False) -outputs = model(**inputs) -``` - -#### 5. Removed some deprecated attributes - -Attributes that were deprecated have been removed if they had been deprecated for at least a month. The full list of deprecated attributes can be found in [#8604](https://github.com/huggingface/transformers/pull/8604). - -Here is a list of these attributes/methods/arguments and what their replacements should be: - -In several models, the labels become consistent with the other models: -- `masked_lm_labels` becomes `labels` in `AlbertForMaskedLM` and `AlbertForPreTraining`. -- `masked_lm_labels` becomes `labels` in `BertForMaskedLM` and `BertForPreTraining`. -- `masked_lm_labels` becomes `labels` in `DistilBertForMaskedLM`. -- `masked_lm_labels` becomes `labels` in `ElectraForMaskedLM`. -- `masked_lm_labels` becomes `labels` in `LongformerForMaskedLM`. -- `masked_lm_labels` becomes `labels` in `MobileBertForMaskedLM`. -- `masked_lm_labels` becomes `labels` in `RobertaForMaskedLM`. -- `lm_labels` becomes `labels` in `BartForConditionalGeneration`. -- `lm_labels` becomes `labels` in `GPT2DoubleHeadsModel`. -- `lm_labels` becomes `labels` in `OpenAIGPTDoubleHeadsModel`. -- `lm_labels` becomes `labels` in `T5ForConditionalGeneration`. - -In several models, the caching mechanism becomes consistent with the other models: -- `decoder_cached_states` becomes `past_key_values` in all BART-like, FSMT and T5 models. -- `decoder_past_key_values` becomes `past_key_values` in all BART-like, FSMT and T5 models. -- `past` becomes `past_key_values` in all CTRL models. -- `past` becomes `past_key_values` in all GPT-2 models. - -Regarding the tokenizer classes: -- The tokenizer attribute `max_len` becomes `model_max_length`. -- The tokenizer attribute `return_lengths` becomes `return_length`. -- The tokenizer encoding argument `is_pretokenized` becomes `is_split_into_words`. - -Regarding the `Trainer` class: -- The `Trainer` argument `tb_writer` is removed in favor of the callback `TensorBoardCallback(tb_writer=...)`. -- The `Trainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`. -- The `Trainer` attribute `data_collator` should be a callable. -- The `Trainer` method `_log` is deprecated in favor of `log`. -- The `Trainer` method `_training_step` is deprecated in favor of `training_step`. -- The `Trainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`. -- The `Trainer` method `is_local_master` is deprecated in favor of `is_local_process_zero`. -- The `Trainer` method `is_world_master` is deprecated in favor of `is_world_process_zero`. - -Regarding the `TFTrainer` class: -- The `TFTrainer` argument `prediction_loss_only` is removed in favor of the class argument `args.prediction_loss_only`. -- The `Trainer` method `_log` is deprecated in favor of `log`. -- The `TFTrainer` method `_prediction_loop` is deprecated in favor of `prediction_loop`. -- The `TFTrainer` method `_setup_wandb` is deprecated in favor of `setup_wandb`. -- The `TFTrainer` method `_run_model` is deprecated in favor of `run_model`. - -Regarding the `TrainingArguments` class: -- The `TrainingArguments` argument `evaluate_during_training` is deprecated in favor of `evaluation_strategy`. - -Regarding the Transfo-XL model: -- The Transfo-XL configuration attribute `tie_weight` becomes `tie_words_embeddings`. -- The Transfo-XL modeling method `reset_length` becomes `reset_memory_length`. - -Regarding pipelines: -- The `FillMaskPipeline` argument `topk` becomes `top_k`. - - - -## Migrating from pytorch-transformers to 🤗 Transformers - -Here is a quick summary of what you should take care of when migrating from `pytorch-transformers` to 🤗 Transformers. - -### Positional order of some models' keywords inputs (`attention_mask`, `token_type_ids`...) changed - -To be able to use Torchscript (see #1010, #1204 and #1195) the specific order of some models **keywords inputs** (`attention_mask`, `token_type_ids`...) has been changed. - -If you used to call the models with keyword names for keyword arguments, e.g. `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, this should not cause any change. - -If you used to call the models with positional inputs for keyword arguments, e.g. `model(inputs_ids, attention_mask, token_type_ids)`, you may have to double check the exact order of input arguments. - -## Migrating from pytorch-pretrained-bert - -Here is a quick summary of what you should take care of when migrating from `pytorch-pretrained-bert` to 🤗 Transformers - -### Models always output `tuples` - -The main breaking change when migrating from `pytorch-pretrained-bert` to 🤗 Transformers is that the models forward method always outputs a `tuple` with various elements depending on the model and the configuration parameters. - -The exact content of the tuples for each model are detailed in the models' docstrings and the [documentation](https://huggingface.co/transformers/). - -In pretty much every case, you will be fine by taking the first element of the output as the output you previously used in `pytorch-pretrained-bert`. - -Here is a `pytorch-pretrained-bert` to 🤗 Transformers conversion example for a `BertForSequenceClassification` classification model: - -```python -# Let's load our model -model = BertForSequenceClassification.from_pretrained("bert-base-uncased") - -# If you used to have this line in pytorch-pretrained-bert: -loss = model(input_ids, labels=labels) - -# Now just use this line in 🤗 Transformers to extract the loss from the output tuple: -outputs = model(input_ids, labels=labels) -loss = outputs[0] - -# In 🤗 Transformers you can also have access to the logits: -loss, logits = outputs[:2] - -# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation) -model = BertForSequenceClassification.from_pretrained("bert-base-uncased", output_attentions=True) -outputs = model(input_ids, labels=labels) -loss, logits, attentions = outputs -``` - -### Serialization - -Breaking change in the `from_pretrained()`method: - -1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules. - -2. The additional `*inputs` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute first which can break derived model classes build based on the previous `BertForSequenceClassification` examples. More precisely, the positional arguments `*inputs` provided to `from_pretrained()` are directly forwarded the model `__init__()` method while the keyword arguments `**kwargs` (i) which match configuration class attributes are used to update said attributes (ii) which don't match any configuration class attributes are forwarded to the model `__init__()` method. - -Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before. - -Here is an example: - -```python -### Let's load a model and tokenizer -model = BertForSequenceClassification.from_pretrained("bert-base-uncased") -tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") - -### Do some stuff to our model and tokenizer -# Ex: add new tokens to the vocabulary and embeddings of our model -tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"]) -model.resize_token_embeddings(len(tokenizer)) -# Train our model -train(model) - -### Now let's save our model and tokenizer to a directory -model.save_pretrained("./my_saved_model_directory/") -tokenizer.save_pretrained("./my_saved_model_directory/") - -### Reload the model and the tokenizer -model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/") -tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/") -``` - -### Optimizers: BertAdam & OpenAIAdam are now AdamW, schedules are standard PyTorch schedules - -The two optimizers previously included, `BertAdam` and `OpenAIAdam`, have been replaced by a single `AdamW` optimizer which has a few differences: - -- it only implements weights decay correction, -- schedules are now externals (see below), -- gradient clipping is now also external (see below). - -The new optimizer `AdamW` matches PyTorch `Adam` optimizer API and let you use standard PyTorch or apex methods for the schedule and clipping. - -The schedules are now standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) and not part of the optimizer anymore. - -Here is a conversion examples from `BertAdam` with a linear warmup and decay schedule to `AdamW` and the same schedule: - -```python -# Parameters: -lr = 1e-3 -max_grad_norm = 1.0 -num_training_steps = 1000 -num_warmup_steps = 100 -warmup_proportion = float(num_warmup_steps) / float(num_training_steps) # 0.1 - -### Previously BertAdam optimizer was instantiated like this: -optimizer = BertAdam( - model.parameters(), - lr=lr, - schedule="warmup_linear", - warmup=warmup_proportion, - num_training_steps=num_training_steps, -) -### and used like this: -for batch in train_data: - loss = model(batch) - loss.backward() - optimizer.step() - -### In 🤗 Transformers, optimizer and schedules are split and instantiated like this: -optimizer = AdamW( - model.parameters(), lr=lr, correct_bias=False -) # To reproduce BertAdam specific behavior set correct_bias=False -scheduler = get_linear_schedule_with_warmup( - optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps -) # PyTorch scheduler -### and used like this: -for batch in train_data: - loss = model(batch) - loss.backward() - torch.nn.utils.clip_grad_norm_( - model.parameters(), max_grad_norm - ) # Gradient clipping is not in AdamW anymore (so you can use amp without issue) - optimizer.step() - scheduler.step() -``` diff --git a/docs/source/en/model_doc/albert.mdx b/docs/source/en/model_doc/albert.md similarity index 51% rename from docs/source/en/model_doc/albert.mdx rename to docs/source/en/model_doc/albert.md index 8b33235121c536..a75e6757804862 100644 --- a/docs/source/en/model_doc/albert.mdx +++ b/docs/source/en/model_doc/albert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ALBERT @@ -41,7 +45,10 @@ self-supervised loss that focuses on modeling inter-sentence coherence, and show with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.* -Tips: +This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by +[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). + +## Usage tips - ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -53,9 +60,67 @@ Tips: Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not. + This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT). + +## Resources + + +The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + + + +- [`AlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification). + + +- [`TFAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification). + +- [`FlaxAlbertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). +- Check the [Text classification task guide](../tasks/sequence_classification) on how to use the model. + + + + + +- [`AlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification). + + +- [`TFAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). + + + +- [`FlaxAlbertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). +- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- Check the [Token classification task guide](../tasks/token_classification) on how to use the model. + + + +- [`AlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). +- [`TFAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). +- [`FlaxAlbertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). +- [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- Check the [Masked language modeling task guide](../tasks/masked_language_modeling) on how to use the model. + + + +- [`AlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb). +- [`TFAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). +- [`FlaxAlbertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). +- [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- Check the [Question answering task guide](../tasks/question_answering) on how to use the model. + +**Multiple choice** + +- [`AlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). +- [`TFAlbertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). + +- Check the [Multiple choice task guide](../tasks/multiple_choice) on how to use the model. + + ## AlbertConfig [[autodoc]] AlbertConfig @@ -78,6 +143,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This [[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput + + + ## AlbertModel [[autodoc]] AlbertModel @@ -112,6 +180,10 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This [[autodoc]] AlbertForQuestionAnswering - forward + + + + ## TFAlbertModel [[autodoc]] TFAlbertModel @@ -147,6 +219,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This [[autodoc]] TFAlbertForQuestionAnswering - call + + + ## FlaxAlbertModel [[autodoc]] FlaxAlbertModel @@ -181,3 +256,8 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). This [[autodoc]] FlaxAlbertForQuestionAnswering - __call__ + + + + + diff --git a/docs/source/en/model_doc/align.md b/docs/source/en/model_doc/align.md new file mode 100644 index 00000000000000..5e41dac6024a20 --- /dev/null +++ b/docs/source/en/model_doc/align.md @@ -0,0 +1,104 @@ + + +# ALIGN + +## Overview + +The ALIGN model was proposed in [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ALIGN is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. ALIGN features a dual-encoder architecture with [EfficientNet](efficientnet) as its vision encoder and [BERT](bert) as its text encoder, and learns to align visual and text representations with contrastive learning. Unlike previous work, ALIGN leverages a massive noisy dataset and shows that the scale of the corpus can be used to achieve SOTA representations with a simple recipe. + +The abstract from the paper is the following: + +*Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.* + +This model was contributed by [Alara Dirik](https://huggingface.co/adirik). +The original code is not released, this implementation is based on the Kakao Brain implementation based on the original paper. + +## Usage example + +ALIGN uses EfficientNet to get visual features and BERT to get the text features. Both the text and visual features are then projected to a latent space with identical dimension. The dot product between the projected image and text features is then used as a similarity score. + +[`AlignProcessor`] wraps [`EfficientNetImageProcessor`] and [`BertTokenizer`] into a single instance to both encode the text and preprocess the images. The following example shows how to get the image-text similarity scores using [`AlignProcessor`] and [`AlignModel`]. + +```python +import requests +import torch +from PIL import Image +from transformers import AlignProcessor, AlignModel + +processor = AlignProcessor.from_pretrained("kakaobrain/align-base") +model = AlignModel.from_pretrained("kakaobrain/align-base") + +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +candidate_labels = ["an image of a cat", "an image of a dog"] + +inputs = processor(text=candidate_labels, images=image, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs) + +# this is the image-text similarity score +logits_per_image = outputs.logits_per_image + +# we can take the softmax to get the label probabilities +probs = logits_per_image.softmax(dim=1) +print(probs) +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ALIGN. + +- A blog post on [ALIGN and the COYO-700M dataset](https://huggingface.co/blog/vit-align). +- A zero-shot image classification [demo](https://huggingface.co/spaces/adirik/ALIGN-zero-shot-image-classification). +- [Model card](https://huggingface.co/kakaobrain/align-base) of `kakaobrain/align-base` model. + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource. + +## AlignConfig + +[[autodoc]] AlignConfig + - from_text_vision_configs + +## AlignTextConfig + +[[autodoc]] AlignTextConfig + +## AlignVisionConfig + +[[autodoc]] AlignVisionConfig + +## AlignProcessor + +[[autodoc]] AlignProcessor + +## AlignModel + +[[autodoc]] AlignModel + - forward + - get_text_features + - get_image_features + +## AlignTextModel + +[[autodoc]] AlignTextModel + - forward + +## AlignVisionModel + +[[autodoc]] AlignVisionModel + - forward diff --git a/docs/source/en/model_doc/altclip.mdx b/docs/source/en/model_doc/altclip.md similarity index 94% rename from docs/source/en/model_doc/altclip.mdx rename to docs/source/en/model_doc/altclip.md index 681bea22c72e3e..b1fc9b382694cd 100644 --- a/docs/source/en/model_doc/altclip.mdx +++ b/docs/source/en/model_doc/altclip.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # AltCLIP @@ -27,7 +31,9 @@ teacher learning and contrastive learning. We validate our method through evalua performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.* -## Usage +This model was contributed by [jongjyh](https://huggingface.co/jongjyh). + +## Usage tips and example The usage of AltCLIP is very similar to the CLIP. the difference between CLIP is the text encoder. Note that we use bidirectional attention instead of casual attention and we take the [CLS] token in XLM-R to represent text embedding. @@ -46,7 +52,6 @@ The [`AltCLIPProcessor`] wraps a [`CLIPImageProcessor`] and a [`XLMRobertaTokeni encode the text and prepare the images. The following example shows how to get the image-text similarity scores using [`AltCLIPProcessor`] and [`AltCLIPModel`]. - ```python >>> from PIL import Image >>> import requests @@ -66,11 +71,11 @@ encode the text and prepare the images. The following example shows how to get t >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities ``` -Tips: + -This model is build on `CLIPModel`, so use it like a original CLIP. +This model is based on `CLIPModel`, use it like you would use the original [CLIP](clip). -This model was contributed by [jongjyh](https://huggingface.co/jongjyh). + ## AltCLIPConfig diff --git a/docs/source/en/model_doc/audio-spectrogram-transformer.mdx b/docs/source/en/model_doc/audio-spectrogram-transformer.md similarity index 91% rename from docs/source/en/model_doc/audio-spectrogram-transformer.mdx rename to docs/source/en/model_doc/audio-spectrogram-transformer.md index 54fbebc3b456cc..3eac3781667eb4 100644 --- a/docs/source/en/model_doc/audio-spectrogram-transformer.mdx +++ b/docs/source/en/model_doc/audio-spectrogram-transformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Audio Spectrogram Transformer @@ -22,7 +26,15 @@ The abstract from the paper is the following: *In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.* -Tips: + + + Audio Spectrogram Transformer architecture. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/YuanGongND/ast). + +## Usage tips - When fine-tuning the Audio Spectrogram Transformer (AST) on your own dataset, it's recommended to take care of the input normalization (to make sure the input has mean of 0 and std of 0.5). [`ASTFeatureExtractor`] takes care of this. Note that it uses the AudioSet @@ -31,14 +43,6 @@ the authors compute the stats for a downstream dataset. - Note that the AST needs a low learning rate (the authors use a 10 times smaller learning rate compared to their CNN model proposed in the [PSLA paper](https://arxiv.org/abs/2102.01243)) and converges quickly, so please search for a suitable learning rate and learning rate scheduler for your task. - - - Audio pectrogram Transformer architecture. Taken from the original paper. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). -The original code can be found [here](https://github.com/YuanGongND/ast). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with the Audio Spectrogram Transformer. @@ -47,6 +51,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A notebook illustrating inference with AST for audio classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST). - [`ASTForAudioClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb). +- See also: [Audio classification](../tasks/audio_classification). If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -67,4 +72,4 @@ If you're interested in submitting a resource to be included here, please feel f ## ASTForAudioClassification [[autodoc]] ASTForAudioClassification - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/auto.mdx b/docs/source/en/model_doc/auto.md similarity index 85% rename from docs/source/en/model_doc/auto.mdx rename to docs/source/en/model_doc/auto.md index b39920151db424..036b8b81ca6b48 100644 --- a/docs/source/en/model_doc/auto.mdx +++ b/docs/source/en/model_doc/auto.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Auto Classes @@ -21,7 +25,7 @@ Instantiating one of [`AutoConfig`], [`AutoModel`], and ```python -model = AutoModel.from_pretrained("bert-base-cased") +model = AutoModel.from_pretrained("google-bert/bert-base-cased") ``` will create a model that is an instance of [`BertModel`]. @@ -45,7 +49,7 @@ You will then be able to use the auto classes like you would usually do! -If your `NewModelConfig` is a subclass of [`~transformer.PretrainedConfig`], make sure its +If your `NewModelConfig` is a subclass of [`~transformers.PretrainedConfig`], make sure its `model_type` attribute is set to the same key you use when registering the config (here `"new-model"`). Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its @@ -134,6 +138,14 @@ The following auto classes are available for the following natural language proc [[autodoc]] FlaxAutoModelForMaskedLM +### AutoModelForMaskGeneration + +[[autodoc]] AutoModelForMaskGeneration + +### TFAutoModelForMaskGeneration + +[[autodoc]] TFAutoModelForMaskGeneration + ### AutoModelForSeq2SeqLM [[autodoc]] AutoModelForSeq2SeqLM @@ -206,6 +218,14 @@ The following auto classes are available for the following natural language proc [[autodoc]] FlaxAutoModelForQuestionAnswering +### AutoModelForTextEncoding + +[[autodoc]] AutoModelForTextEncoding + +### TFAutoModelForTextEncoding + +[[autodoc]] TFAutoModelForTextEncoding + ## Computer vision The following auto classes are available for the following computer vision tasks. @@ -234,6 +254,10 @@ The following auto classes are available for the following computer vision tasks [[autodoc]] AutoModelForMaskedImageModeling +### TFAutoModelForMaskedImageModeling + +[[autodoc]] TFAutoModelForMaskedImageModeling + ### AutoModelForObjectDetection [[autodoc]] AutoModelForObjectDetection @@ -242,6 +266,10 @@ The following auto classes are available for the following computer vision tasks [[autodoc]] AutoModelForImageSegmentation +### AutoModelForImageToImage + +[[autodoc]] AutoModelForImageToImage + ### AutoModelForSemanticSegmentation [[autodoc]] AutoModelForSemanticSegmentation @@ -258,6 +286,14 @@ The following auto classes are available for the following computer vision tasks [[autodoc]] AutoModelForUniversalSegmentation +### AutoModelForZeroShotImageClassification + +[[autodoc]] AutoModelForZeroShotImageClassification + +### TFAutoModelForZeroShotImageClassification + +[[autodoc]] TFAutoModelForZeroShotImageClassification + ### AutoModelForZeroShotObjectDetection [[autodoc]] AutoModelForZeroShotObjectDetection @@ -272,6 +308,10 @@ The following auto classes are available for the following audio tasks. ### AutoModelForAudioFrameClassification +[[autodoc]] TFAutoModelForAudioClassification + +### TFAutoModelForAudioFrameClassification + [[autodoc]] AutoModelForAudioFrameClassification ### AutoModelForCTC @@ -286,10 +326,22 @@ The following auto classes are available for the following audio tasks. [[autodoc]] TFAutoModelForSpeechSeq2Seq +### FlaxAutoModelForSpeechSeq2Seq + +[[autodoc]] FlaxAutoModelForSpeechSeq2Seq + ### AutoModelForAudioXVector [[autodoc]] AutoModelForAudioXVector +### AutoModelForTextToSpectrogram + +[[autodoc]] AutoModelForTextToSpectrogram + +### AutoModelForTextToWaveform + +[[autodoc]] AutoModelForTextToWaveform + ## Multimodal The following auto classes are available for the following multimodal tasks. diff --git a/docs/source/en/model_doc/autoformer.md b/docs/source/en/model_doc/autoformer.md new file mode 100644 index 00000000000000..bb423e941c78ca --- /dev/null +++ b/docs/source/en/model_doc/autoformer.md @@ -0,0 +1,50 @@ + + +# Autoformer + +## Overview + +The Autoformer model was proposed in [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. + +This model augments the Transformer as a deep decomposition architecture, which can progressively decompose the trend and seasonal components during the forecasting process. + +The abstract from the paper is the following: + +*Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the long-term forecasting problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Going beyond Transformers, we design Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We break with the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.* + +This model was contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif). +The original code can be found [here](https://github.com/thuml/Autoformer). + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- Check out the Autoformer blog-post in HuggingFace blog: [Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer)](https://huggingface.co/blog/autoformer) + +## AutoformerConfig + +[[autodoc]] AutoformerConfig + +## AutoformerModel + +[[autodoc]] AutoformerModel + - forward + +## AutoformerForPrediction + +[[autodoc]] AutoformerForPrediction + - forward diff --git a/docs/source/en/model_doc/bark.md b/docs/source/en/model_doc/bark.md new file mode 100644 index 00000000000000..7c02e4be701187 --- /dev/null +++ b/docs/source/en/model_doc/bark.md @@ -0,0 +1,232 @@ + + +# Bark + +## Overview + +Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark). + +Bark is made of 4 main models: + +- [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text. +- [`BarkCoarseModel`] (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the [`BarkSemanticModel`] model. It aims at predicting the first two audio codebooks necessary for EnCodec. +- [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings. +- having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio array. + +It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice. + +This model was contributed by [Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) and [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi). +The original code can be found [here](https://github.com/suno-ai/bark). + +### Optimizing Bark + +Bark can be optimized with just a few extra lines of code, which **significantly reduces its memory footprint** and **accelerates inference**. + +#### Using half-precision + +You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision. + +```python +from transformers import BarkModel +import torch + +device = "cuda" if torch.cuda.is_available() else "cpu" +model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device) +``` + +#### Using CPU offload + +As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle. + +If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the submodels from GPU to CPU when they're idle. This operation is called *CPU offloading*. You can use it with one line of code as follows: + +```python +model.enable_cpu_offload() +``` + +Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install) + +#### Using Better Transformer + +Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer: + +```python +model = model.to_bettertransformer() +``` + +Note that 🤗 Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation) + +#### Using Flash Attention 2 + +Flash Attention 2 is an even faster, optimized version of the previous optimization. + +##### Installation + +First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer). + +Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2: + +```bash +pip install -U flash-attn --no-build-isolation +``` + + +##### Usage + +To load a model using Flash Attention 2, we can pass the `attn_implementation="flash_attention_2"` flag to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference: + +```python +model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device) +``` + +##### Performance comparison + + +The following diagram shows the latency for the native attention implementation (no optimisation) against Better Transformer and Flash Attention 2. In all cases, we generate 400 semantic tokens on a 40GB A100 GPU with PyTorch 2.1. Flash Attention 2 is also consistently faster than Better Transformer, and its performance improves even more as batch sizes increase: + +
+ +
+ +To put this into perspective, on an NVIDIA A100 and when generating 400 semantic tokens with a batch size of 16, you can get 17 times the [throughput](https://huggingface.co/blog/optimizing-bark#throughput) and still be 2 seconds faster than generating sentences one by one with the native model implementation. In other words, all the samples will be generated 17 times faster. + +At batch size 8, on an NVIDIA A100, Flash Attention 2 is also 10% faster than Better Transformer, and at batch size 16, 25%. + + +#### Combining optimization techniques + +You can combine optimization techniques, and use CPU offload, half-precision and Flash Attention 2 (or 🤗 Better Transformer) all at once. + +```python +from transformers import BarkModel +import torch + +device = "cuda" if torch.cuda.is_available() else "cpu" + +# load in fp16 and use Flash Attention 2 +model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device) + +# enable CPU offload +model.enable_cpu_offload() +``` + +Find out more on inference optimization techniques [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one). + +### Usage tips + +Suno offers a library of voice presets in a number of languages [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c). +These presets are also uploaded in the hub [here](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) or [here](https://huggingface.co/suno/bark/tree/main/speaker_embeddings). + +```python +>>> from transformers import AutoProcessor, BarkModel + +>>> processor = AutoProcessor.from_pretrained("suno/bark") +>>> model = BarkModel.from_pretrained("suno/bark") + +>>> voice_preset = "v2/en_speaker_6" + +>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset) + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +Bark can generate highly realistic, **multilingual** speech as well as other audio - including music, background noise and simple sound effects. + +```python +>>> # Multilingual speech - simplified Chinese +>>> inputs = processor("惊人的!我会说中文") + +>>> # Multilingual speech - French - let's use a voice_preset as well +>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5") + +>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics. +>>> inputs = processor("♪ Hello, my dog is cute ♪") + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +The model can also produce **nonverbal communications** like laughing, sighing and crying. + + +```python +>>> # Adding non-speech cues to the input text +>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]") + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +To save the audio, simply take the sample rate from the model config and some scipy utility: + +```python +>>> from scipy.io.wavfile import write as write_wav + +>>> # save audio to disk, but first take the sample rate from the model config +>>> sample_rate = model.generation_config.sample_rate +>>> write_wav("bark_generation.wav", sample_rate, audio_array) +``` + +## BarkConfig + +[[autodoc]] BarkConfig + - all + +## BarkProcessor + +[[autodoc]] BarkProcessor + - all + - __call__ + +## BarkModel + +[[autodoc]] BarkModel + - generate + - enable_cpu_offload + +## BarkSemanticModel + +[[autodoc]] BarkSemanticModel + - forward + +## BarkCoarseModel + +[[autodoc]] BarkCoarseModel + - forward + +## BarkFineModel + +[[autodoc]] BarkFineModel + - forward + +## BarkCausalModel + +[[autodoc]] BarkCausalModel + - forward + +## BarkCoarseConfig + +[[autodoc]] BarkCoarseConfig + - all + +## BarkFineConfig + +[[autodoc]] BarkFineConfig + - all + +## BarkSemanticConfig + +[[autodoc]] BarkSemanticConfig + - all + diff --git a/docs/source/en/model_doc/bart.mdx b/docs/source/en/model_doc/bart.md similarity index 89% rename from docs/source/en/model_doc/bart.mdx rename to docs/source/en/model_doc/bart.md index 8047082be2c3f1..7986228915cf88 100644 --- a/docs/source/en/model_doc/bart.mdx +++ b/docs/source/en/model_doc/bart.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BART @@ -21,9 +25,6 @@ specific language governing permissions and limitations under the License. -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign -@patrickvonplaten - ## Overview The Bart model was proposed in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, @@ -41,7 +42,9 @@ According to the abstract, state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. -Tips: +This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart). + +## Usage tips: - BART is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -53,18 +56,6 @@ Tips: * permute sentences * rotate the document to make it start at a specific token -This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/bart). - - -### Examples - -- Examples and scripts for fine-tuning BART and other models for sequence to sequence tasks can be found in - [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). -- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets` - object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904). -- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002). - - ## Implementation Notes - Bart doesn't use `token_type_ids` for sequence classification. Use [`BartTokenizer`] or @@ -103,12 +94,14 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on [Distributed Training: Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq). -- A notebook on how to [finetune BART for summarization with fastai using blurr](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb). 🌎 +- A notebook on how to [finetune BART for summarization with fastai using blurr](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb). 🌎 - A notebook on how to [finetune BART for summarization in two languages with Trainer class](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb). 🌎 -- [`BartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) and [noteboook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb). +- [`BartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb). - [`TFBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb). - [`FlaxBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization). +- An example of how to train [`BartForConditionalGeneration`] with a Hugging Face `datasets` object can be found in this [forum discussion](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904) - [Summarization](https://huggingface.co/course/chapter7/5?fw=pt#summarization) chapter of the 🤗 Hugging Face course. +- [Summarization task guide](../tasks/summarization) @@ -116,12 +109,20 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling task guide](../tasks/masked_language_modeling) - A notebook on how to [finetune mBART using Seq2SeqTrainer for Hindi to English translation](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb). 🌎 - [`BartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb). - [`TFBartForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/translation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb). +- [Translation task guide](../tasks/translation) + +See also: +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Distilled checkpoints](https://huggingface.co/models?search=distilbart) are described in this [paper](https://arxiv.org/abs/2010.13002). ## BartConfig @@ -138,6 +139,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] BartTokenizerFast - all + + + + ## BartModel [[autodoc]] BartModel @@ -163,6 +168,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] BartForCausalLM - forward + + + ## TFBartModel [[autodoc]] TFBartModel @@ -178,6 +186,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFBartForSequenceClassification - call + + + ## FlaxBartModel [[autodoc]] FlaxBartModel @@ -210,3 +221,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxBartForCausalLM - __call__ + + + + + diff --git a/docs/source/en/model_doc/barthez.mdx b/docs/source/en/model_doc/barthez.md similarity index 86% rename from docs/source/en/model_doc/barthez.mdx rename to docs/source/en/model_doc/barthez.md index f1969e8e942422..1b571e242f4743 100644 --- a/docs/source/en/model_doc/barthez.mdx +++ b/docs/source/en/model_doc/barthez.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BARThez @@ -34,8 +38,14 @@ provides a significant boost over vanilla BARThez, and is on par with or outperf This model was contributed by [moussakam](https://huggingface.co/moussakam). The Authors' code can be found [here](https://github.com/moussaKam/BARThez). + + +BARThez implementation is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on +configuration classes and their parameters. BARThez-specific tokenizers are documented below. + + -### Examples +## Resources - BARThez can be fine-tuned on sequence-to-sequence tasks in a similar way as BART, check: [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). diff --git a/docs/source/en/model_doc/bartpho.mdx b/docs/source/en/model_doc/bartpho.md similarity index 94% rename from docs/source/en/model_doc/bartpho.mdx rename to docs/source/en/model_doc/bartpho.md index d940173b42f861..8f0a5f8bfe24a4 100644 --- a/docs/source/en/model_doc/bartpho.mdx +++ b/docs/source/en/model_doc/bartpho.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BARTpho @@ -25,7 +29,9 @@ on a downstream task of Vietnamese text summarization show that in both automati outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks.* -Example of use: +This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho). + +## Usage example ```python >>> import torch @@ -50,7 +56,7 @@ Example of use: >>> features = bartpho(**input_ids) ``` -Tips: +## Usage tips - Following mBART, BARTpho uses the "large" architecture of BART with an additional layer-normalization layer on top of both the encoder and decoder. Thus, usage examples in the [documentation of BART](bart), when adapting to use @@ -75,8 +81,6 @@ Tips: Other languages, if employing this pre-trained multilingual SentencePiece model "vocab_file" for subword segmentation, can reuse BartphoTokenizer with their own language-specialized "monolingual_vocab_file". -This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BARTpho). - ## BartphoTokenizer [[autodoc]] BartphoTokenizer diff --git a/docs/source/en/model_doc/beit.mdx b/docs/source/en/model_doc/beit.md similarity index 94% rename from docs/source/en/model_doc/beit.mdx rename to docs/source/en/model_doc/beit.md index 17132ef0edc5c8..f7605ebcdf90d4 100644 --- a/docs/source/en/model_doc/beit.mdx +++ b/docs/source/en/model_doc/beit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BEiT @@ -35,7 +39,10 @@ with previous pre-training methods. For example, base-size BEiT achieves 83.2% t significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%).* -Tips: +This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was +contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit). + +## Usage tips - BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as @@ -64,9 +71,6 @@ alt="drawing" width="600"/> BEiT pre-training. Taken from the original paper. -This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was -contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT. @@ -74,6 +78,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) + +**Semantic segmentation** +- [Semantic segmentation task guide](../tasks/semantic_segmentation) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -99,6 +107,9 @@ If you're interested in submitting a resource to be included here, please feel f - preprocess - post_process_semantic_segmentation + + + ## BeitModel [[autodoc]] BeitModel @@ -119,6 +130,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] BeitForSemanticSegmentation - forward + + + ## FlaxBeitModel [[autodoc]] FlaxBeitModel @@ -133,3 +147,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] FlaxBeitForImageClassification - __call__ + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/bert-generation.mdx b/docs/source/en/model_doc/bert-generation.md similarity index 83% rename from docs/source/en/model_doc/bert-generation.mdx rename to docs/source/en/model_doc/bert-generation.md index e300917ea5e6f7..40c2fbaa212e6b 100644 --- a/docs/source/en/model_doc/bert-generation.mdx +++ b/docs/source/en/model_doc/bert-generation.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BertGeneration @@ -29,23 +33,26 @@ GPT-2 and RoBERTa checkpoints and conducted an extensive empirical study on the encoder and decoder, with these checkpoints. Our models result in new state-of-the-art results on Machine Translation, Text Summarization, Sentence Splitting, and Sentence Fusion.* -Usage: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be +found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder). + +## Usage examples and tips -- The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained - BERT checkpoints for subsequent fine-tuning. +The model can be used in combination with the [`EncoderDecoderModel`] to leverage two pretrained BERT checkpoints for +subsequent fine-tuning: ```python >>> # leverage checkpoints for Bert2Bert model... >>> # use BERT's cls token as BOS token and sep token as EOS token ->>> encoder = BertGenerationEncoder.from_pretrained("bert-large-uncased", bos_token_id=101, eos_token_id=102) +>>> encoder = BertGenerationEncoder.from_pretrained("google-bert/bert-large-uncased", bos_token_id=101, eos_token_id=102) >>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token >>> decoder = BertGenerationDecoder.from_pretrained( -... "bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102 +... "google-bert/bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102 ... ) >>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder) >>> # create tokenizer... ->>> tokenizer = BertTokenizer.from_pretrained("bert-large-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased") >>> input_ids = tokenizer( ... "This is a long article to summarize", add_special_tokens=False, return_tensors="pt" @@ -57,8 +64,7 @@ Usage: >>> loss.backward() ``` -- Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g., - +Pretrained [`EncoderDecoderModel`] are also directly available in the model hub, e.g.: ```python >>> # instantiate sentence fusion model @@ -81,9 +87,6 @@ Tips: - For summarization, sentence splitting, sentence fusion and translation, no special tokens are required for the input. Therefore, no EOS token should be added to the end of the input. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be -found [here](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder). - ## BertGenerationConfig [[autodoc]] BertGenerationConfig diff --git a/docs/source/en/model_doc/bert-japanese.mdx b/docs/source/en/model_doc/bert-japanese.md similarity index 87% rename from docs/source/en/model_doc/bert-japanese.mdx rename to docs/source/en/model_doc/bert-japanese.md index 312714b379e8f2..d68bb221d5779f 100644 --- a/docs/source/en/model_doc/bert-japanese.mdx +++ b/docs/source/en/model_doc/bert-japanese.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BertJapanese @@ -63,11 +67,15 @@ Example of using a model with Character tokenization: >>> outputs = bertjapanese(**inputs) ``` -Tips: +This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku). + + -- This implementation is the same as BERT, except for tokenization method. Refer to the [documentation of BERT](bert) for more usage examples. +This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for +API reference information. + + -This model was contributed by [cl-tohoku](https://huggingface.co/cl-tohoku). ## BertJapaneseTokenizer diff --git a/docs/source/en/model_doc/bert.mdx b/docs/source/en/model_doc/bert.md similarity index 95% rename from docs/source/en/model_doc/bert.mdx rename to docs/source/en/model_doc/bert.md index e6c47a9aa9a85a..bdf4566b43ad5c 100644 --- a/docs/source/en/model_doc/bert.mdx +++ b/docs/source/en/model_doc/bert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BERT @@ -41,7 +45,9 @@ language processing tasks, including pushing the GLUE score to 80.5% (7.7% point accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).* -Tips: +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert). + +## Usage tips - BERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -55,10 +61,6 @@ Tips: - The model must predict the original sentence, but has a second objective: inputs are two sentences A and B (with a separation token in between). With probability 50%, the sentences are consecutive in the corpus, in the remaining 50% they are not related. The model has to predict if the sentences are consecutive or not. - - -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/bert). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BERT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -72,6 +74,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`BertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). - [`TFBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). - [`FlaxBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). +- [Text classification task guide](../tasks/sequence_classification) @@ -81,6 +84,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). - [`FlaxBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Token classification task guide](../tasks/token_classification) @@ -88,6 +92,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling task guide](../tasks/masked_language_modeling) @@ -95,10 +100,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). - [`FlaxBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Question answering task guide](../tasks/question_answering) **Multiple choice** - [`BertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). - [`TFBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). +- [Multiple choice task guide](../tasks/multiple_choice) ⚡️ **Inference** - A blog post on how to [Accelerate BERT inference with Hugging Face Transformers and AWS Inferentia](https://huggingface.co/blog/bert-inferentia-sagemaker). @@ -128,14 +135,23 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - create_token_type_ids_from_sequences - save_vocabulary + + + ## BertTokenizerFast [[autodoc]] BertTokenizerFast + + + ## TFBertTokenizer [[autodoc]] TFBertTokenizer + + + ## Bert specific outputs [[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput @@ -144,6 +160,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput + + + + ## BertModel [[autodoc]] BertModel @@ -189,6 +209,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] BertForQuestionAnswering - forward + + + ## TFBertModel [[autodoc]] TFBertModel @@ -234,6 +257,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFBertForQuestionAnswering - call + + + ## FlaxBertModel [[autodoc]] FlaxBertModel @@ -278,3 +304,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxBertForQuestionAnswering - __call__ + + + + + diff --git a/docs/source/en/model_doc/bertweet.mdx b/docs/source/en/model_doc/bertweet.md similarity index 87% rename from docs/source/en/model_doc/bertweet.mdx rename to docs/source/en/model_doc/bertweet.md index df55360646f931..c4c883b21ad781 100644 --- a/docs/source/en/model_doc/bertweet.mdx +++ b/docs/source/en/model_doc/bertweet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BERTweet @@ -24,7 +28,9 @@ al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa- 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification.* -Example of use: +This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet). + +## Usage example ```python >>> import torch @@ -51,7 +57,12 @@ Example of use: >>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base") ``` -This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/BERTweet). + + +This implementation is the same as BERT, except for tokenization method. Refer to [BERT documentation](bert) for +API reference information. + + ## BertweetTokenizer diff --git a/docs/source/en/model_doc/big_bird.mdx b/docs/source/en/model_doc/big_bird.md similarity index 88% rename from docs/source/en/model_doc/big_bird.mdx rename to docs/source/en/model_doc/big_bird.md index fa15d32cdb1c7f..3d1ef91d560629 100644 --- a/docs/source/en/model_doc/big_bird.mdx +++ b/docs/source/en/model_doc/big_bird.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BigBird @@ -37,7 +41,10 @@ sequence as part of the sparse attention mechanism. The proposed sparse attentio BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.* -Tips: +This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found +[here](https://github.com/google-research/bigbird). + +## Usage tips - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird). - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using @@ -49,8 +56,15 @@ Tips: - BigBird is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. -This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta). The original code can be found -[here](https://github.com/google-research/bigbird). + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## BigBirdConfig @@ -72,6 +86,9 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta [[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput + + + ## BigBirdModel [[autodoc]] BigBirdModel @@ -112,6 +129,9 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta [[autodoc]] BigBirdForQuestionAnswering - forward + + + ## FlaxBigBirdModel [[autodoc]] FlaxBigBirdModel @@ -151,3 +171,8 @@ This model was contributed by [vasudevgupta](https://huggingface.co/vasudevgupta [[autodoc]] FlaxBigBirdForQuestionAnswering - __call__ + + + + + diff --git a/docs/source/en/model_doc/bigbird_pegasus.mdx b/docs/source/en/model_doc/bigbird_pegasus.md similarity index 89% rename from docs/source/en/model_doc/bigbird_pegasus.mdx rename to docs/source/en/model_doc/bigbird_pegasus.md index 1ba4b71d73bb8f..003e5643719b4b 100644 --- a/docs/source/en/model_doc/bigbird_pegasus.mdx +++ b/docs/source/en/model_doc/bigbird_pegasus.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BigBirdPegasus @@ -37,7 +41,9 @@ sequence as part of the sparse attention mechanism. The proposed sparse attentio BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.* -Tips: +The original code can be found [here](https://github.com/google-research/bigbird). + +## Usage tips - For an in-detail explanation on how BigBird's attention works, see [this blog post](https://huggingface.co/blog/big-bird). - BigBird comes with 2 implementations: **original_full** & **block_sparse**. For the sequence length < 1024, using @@ -50,7 +56,13 @@ Tips: - BigBird is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. -The original code can be found [here](https://github.com/google-research/bigbird). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## BigBirdPegasusConfig diff --git a/docs/source/en/model_doc/biogpt.mdx b/docs/source/en/model_doc/biogpt.md similarity index 73% rename from docs/source/en/model_doc/biogpt.mdx rename to docs/source/en/model_doc/biogpt.md index 84bd96d76850ab..20a8e4d9cd307c 100644 --- a/docs/source/en/model_doc/biogpt.mdx +++ b/docs/source/en/model_doc/biogpt.md @@ -8,26 +8,33 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BioGPT ## Overview -The BioGPT model was proposed in [BioGPT: generative pre-trained transformer for biomedical text generation and mining -](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch. +The BioGPT model was proposed in [BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch. The abstract from the paper is the following: *Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.* -Tips: +This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/BioGPT). + +## Usage tips -- BioGPT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left. +- BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. - BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script. - The model can take the `past_key_values` (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage. -This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/BioGPT). +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) ## BioGptConfig @@ -49,4 +56,16 @@ This model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The ## BioGptForCausalLM [[autodoc]] BioGptForCausalLM + - forward + + +## BioGptForTokenClassification + +[[autodoc]] BioGptForTokenClassification + - forward + + +## BioGptForSequenceClassification + +[[autodoc]] BioGptForSequenceClassification - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/bit.mdx b/docs/source/en/model_doc/bit.md similarity index 92% rename from docs/source/en/model_doc/bit.mdx rename to docs/source/en/model_doc/bit.md index a9b3ff33b79455..7f8a8ea67c454e 100644 --- a/docs/source/en/model_doc/bit.mdx +++ b/docs/source/en/model_doc/bit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Big Transfer (BiT) @@ -21,15 +25,15 @@ The abstract from the paper is the following: *Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.* -Tips: +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/google-research/big_transfer). + +## Usage tips - BiT models are equivalent to ResNetv2 in terms of architecture, except that: 1) all batch normalization layers are replaced by [group normalization](https://arxiv.org/abs/1803.08494), 2) [weight standardization](https://arxiv.org/abs/1903.10520) is used for convolutional layers. The authors show that the combination of both is useful for training with large batch sizes, and has a significant impact on transfer learning. -This model was contributed by [nielsr](https://huggingface.co/nielsr). -The original code can be found [here](https://github.com/google-research/big_transfer). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BiT. @@ -37,6 +41,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`BitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -57,5 +62,4 @@ If you're interested in submitting a resource to be included here, please feel f ## BitForImageClassification [[autodoc]] BitForImageClassification - - forward - + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/blenderbot-small.mdx b/docs/source/en/model_doc/blenderbot-small.md similarity index 84% rename from docs/source/en/model_doc/blenderbot-small.mdx rename to docs/source/en/model_doc/blenderbot-small.md index c4b157cac119cc..d5f4a7d849b7cb 100644 --- a/docs/source/en/model_doc/blenderbot-small.mdx +++ b/docs/source/en/model_doc/blenderbot-small.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Blenderbot Small @@ -36,13 +40,20 @@ and code publicly available. Human evaluations show our best models are superior dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.* -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be +found [here](https://github.com/facebookresearch/ParlAI). + +## Usage tips -- Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than - the left. +Blenderbot Small is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than +the left. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The authors' code can be -found [here](https://github.com/facebookresearch/ParlAI) . + +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## BlenderbotSmallConfig @@ -60,6 +71,9 @@ found [here](https://github.com/facebookresearch/ParlAI) . [[autodoc]] BlenderbotSmallTokenizerFast + + + ## BlenderbotSmallModel [[autodoc]] BlenderbotSmallModel @@ -75,6 +89,9 @@ found [here](https://github.com/facebookresearch/ParlAI) . [[autodoc]] BlenderbotSmallForCausalLM - forward + + + ## TFBlenderbotSmallModel [[autodoc]] TFBlenderbotSmallModel @@ -85,6 +102,9 @@ found [here](https://github.com/facebookresearch/ParlAI) . [[autodoc]] TFBlenderbotSmallForConditionalGeneration - call + + + ## FlaxBlenderbotSmallModel [[autodoc]] FlaxBlenderbotSmallModel @@ -98,3 +118,6 @@ found [here](https://github.com/facebookresearch/ParlAI) . - __call__ - encode - decode + + + diff --git a/docs/source/en/model_doc/blenderbot.mdx b/docs/source/en/model_doc/blenderbot.md similarity index 86% rename from docs/source/en/model_doc/blenderbot.mdx rename to docs/source/en/model_doc/blenderbot.md index 75706e13ec1a48..42e1710cb2d5ca 100644 --- a/docs/source/en/model_doc/blenderbot.mdx +++ b/docs/source/en/model_doc/blenderbot.md @@ -8,12 +8,14 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Blenderbot -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) . - ## Overview The Blender chatbot model was proposed in [Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, @@ -32,26 +34,14 @@ and code publicly available. Human evaluations show our best models are superior dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.* -Tips: - -- Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than - the left. - This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The authors' code can be found [here](https://github.com/facebookresearch/ParlAI) . +## Usage tips and example -## Implementation Notes - -- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture. -- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot). -- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as - `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with - [BlenderbotSmall](blenderbot-small). - - -## Usage +Blenderbot is a model with absolute position embeddings so it's usually advised to pad the inputs on the right +rather than the left. -Here is an example of model usage: +An example: ```python >>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration @@ -66,6 +56,21 @@ Here is an example of model usage: [" That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?"] ``` +## Implementation Notes + +- Blenderbot uses a standard [seq2seq model transformer](https://arxiv.org/pdf/1706.03762.pdf) based architecture. +- Available checkpoints can be found in the [model hub](https://huggingface.co/models?search=blenderbot). +- This is the *default* Blenderbot model class. However, some smaller checkpoints, such as + `facebook/blenderbot_small_90M`, have a different architecture and consequently should be used with + [BlenderbotSmall](blenderbot-small). + + +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## BlenderbotConfig [[autodoc]] BlenderbotConfig @@ -80,9 +85,13 @@ Here is an example of model usage: [[autodoc]] BlenderbotTokenizerFast - build_inputs_with_special_tokens + + + + ## BlenderbotModel -See `transformers.BartModel` for arguments to *forward* and *generate* +See [`~transformers.BartModel`] for arguments to *forward* and *generate* [[autodoc]] BlenderbotModel - forward @@ -99,6 +108,9 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an [[autodoc]] BlenderbotForCausalLM - forward + + + ## TFBlenderbotModel [[autodoc]] TFBlenderbotModel @@ -109,6 +121,9 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an [[autodoc]] TFBlenderbotForConditionalGeneration - call + + + ## FlaxBlenderbotModel [[autodoc]] FlaxBlenderbotModel @@ -122,3 +137,8 @@ See [`~transformers.BartForConditionalGeneration`] for arguments to *forward* an - __call__ - encode - decode + + + + + diff --git a/docs/source/en/model_doc/blip-2.mdx b/docs/source/en/model_doc/blip-2.md similarity index 93% rename from docs/source/en/model_doc/blip-2.mdx rename to docs/source/en/model_doc/blip-2.md index 690fc02bd8cfa9..d2a47e7af8f163 100644 --- a/docs/source/en/model_doc/blip-2.mdx +++ b/docs/source/en/model_doc/blip-2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BLIP-2 @@ -23,11 +27,6 @@ The abstract from the paper is the following: *The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained image encoders and frozen large language models. BLIP-2 bridges the modality gap with a lightweight Querying Transformer, which is pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder. The second stage bootstraps vision-to-language generative learning from a frozen language model. BLIP-2 achieves state-of-the-art performance on various vision-language tasks, despite having significantly fewer trainable parameters than existing methods. For example, our model outperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. We also demonstrate the model's emerging capabilities of zero-shot image-to-text generation that can follow natural language instructions.* -Tips: - -- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method. -- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text. - drawing @@ -36,6 +35,11 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/salesforce/LAVIS/tree/5ee63d688ba4cebff63acee04adaef2dee9af207). +## Usage tips + +- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method. +- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLIP-2. @@ -71,6 +75,14 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] Blip2QFormerModel - forward +## Blip2Model + +[[autodoc]] Blip2Model + - forward + - get_text_features + - get_image_features + - get_qformer_features + ## Blip2ForConditionalGeneration [[autodoc]] Blip2ForConditionalGeneration diff --git a/docs/source/en/model_doc/blip.mdx b/docs/source/en/model_doc/blip.md similarity index 77% rename from docs/source/en/model_doc/blip.mdx rename to docs/source/en/model_doc/blip.md index 42116f48697e9a..bc122c942a67a5 100644 --- a/docs/source/en/model_doc/blip.mdx +++ b/docs/source/en/model_doc/blip.md @@ -1,4 +1,4 @@ - # BLIP @@ -16,7 +20,7 @@ specific language governing permissions and limitations under the License. The BLIP model was proposed in [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. -BLIP is a model that is able to perform various multi-modal tasks including +BLIP is a model that is able to perform various multi-modal tasks including: - Visual Question Answering - Image-Text retrieval (Image-text matching) - Image Captioning @@ -26,7 +30,7 @@ The abstract from the paper is the following: *Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to videolanguage tasks in a zero-shot manner. Code, models, and datasets are released.* -![BLIP.gif](https://s3.amazonaws.com/moonup/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) +![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) This model was contributed by [ybelkada](https://huggingface.co/ybelkada). The original code can be found [here](https://github.com/salesforce/BLIP). @@ -35,7 +39,6 @@ The original code can be found [here](https://github.com/salesforce/BLIP). - [Jupyter notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) on how to fine-tune BLIP for image captioning on a custom dataset - ## BlipConfig [[autodoc]] BlipConfig @@ -53,12 +56,14 @@ The original code can be found [here](https://github.com/salesforce/BLIP). [[autodoc]] BlipProcessor - ## BlipImageProcessor [[autodoc]] BlipImageProcessor - preprocess + + + ## BlipModel [[autodoc]] BlipModel @@ -71,26 +76,59 @@ The original code can be found [here](https://github.com/salesforce/BLIP). [[autodoc]] BlipTextModel - forward - ## BlipVisionModel [[autodoc]] BlipVisionModel - forward - ## BlipForConditionalGeneration [[autodoc]] BlipForConditionalGeneration - forward - ## BlipForImageTextRetrieval [[autodoc]] BlipForImageTextRetrieval - forward - ## BlipForQuestionAnswering [[autodoc]] BlipForQuestionAnswering - - forward \ No newline at end of file + - forward + + + + +## TFBlipModel + +[[autodoc]] TFBlipModel + - call + - get_text_features + - get_image_features + +## TFBlipTextModel + +[[autodoc]] TFBlipTextModel + - call + +## TFBlipVisionModel + +[[autodoc]] TFBlipVisionModel + - call + +## TFBlipForConditionalGeneration + +[[autodoc]] TFBlipForConditionalGeneration + - call + +## TFBlipForImageTextRetrieval + +[[autodoc]] TFBlipForImageTextRetrieval + - call + +## TFBlipForQuestionAnswering + +[[autodoc]] TFBlipForQuestionAnswering + - call + + diff --git a/docs/source/en/model_doc/bloom.mdx b/docs/source/en/model_doc/bloom.md similarity index 83% rename from docs/source/en/model_doc/bloom.mdx rename to docs/source/en/model_doc/bloom.md index a3a2aa81d79c7e..a1d39d13ad0022 100644 --- a/docs/source/en/model_doc/bloom.mdx +++ b/docs/source/en/model_doc/bloom.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BLOOM @@ -27,13 +31,19 @@ Several smaller versions of the models have been trained on the same dataset. BL ## Resources - A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLOOM. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. - [`BloomForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). +See also: +- [Causal language modeling task guide](../tasks/language_modeling) +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) + + ⚡️ Inference - A blog on [Optimization story: Bloom inference](https://huggingface.co/blog/bloom-inference-optimization). - A blog on [Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts). @@ -46,16 +56,20 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] BloomConfig - all -## BloomModel - -[[autodoc]] BloomModel - - forward - ## BloomTokenizerFast [[autodoc]] BloomTokenizerFast - all + + + + +## BloomModel + +[[autodoc]] BloomModel + - forward + ## BloomForCausalLM [[autodoc]] BloomForCausalLM @@ -75,3 +89,21 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] BloomForQuestionAnswering - forward + + + + +## FlaxBloomModel + +[[autodoc]] FlaxBloomModel + - __call__ + +## FlaxBloomForCausalLM + +[[autodoc]] FlaxBloomForCausalLM + - __call__ + + + + + diff --git a/docs/source/en/model_doc/bort.mdx b/docs/source/en/model_doc/bort.md similarity index 76% rename from docs/source/en/model_doc/bort.mdx rename to docs/source/en/model_doc/bort.md index e90f042b6566b7..1542d464d9fd2f 100644 --- a/docs/source/en/model_doc/bort.mdx +++ b/docs/source/en/model_doc/bort.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BORT + + +This model is in maintenance mode only, we do not accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The BORT model was proposed in [Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) by @@ -30,13 +43,15 @@ hardware. It is also 7.9x faster on a CPU, as well as being better performing th architecture, and some of the non-compressed variants: it obtains performance improvements of between 0.3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks.* -Tips: +This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/). + +## Usage tips -- BORT's model architecture is based on BERT, so one can refer to [BERT's documentation page](bert) for the - model's API as well as usage examples. -- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, so one can refer to [RoBERTa's documentation page](roberta) for the tokenizer's API as well as usage examples. +- BORT's model architecture is based on BERT, refer to [BERT's documentation page](bert) for the + model's API reference as well as usage examples. +- BORT uses the RoBERTa tokenizer instead of the BERT tokenizer, refer to [RoBERTa's documentation page](roberta) for the tokenizer's API reference as well as usage examples. - BORT requires a specific fine-tuning algorithm, called [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) , that is sadly not open-sourced yet. It would be very useful for the community, if someone tries to implement the algorithm to make BORT fine-tuning work. -This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/alexa/bort/). + diff --git a/docs/source/en/model_doc/bridgetower.mdx b/docs/source/en/model_doc/bridgetower.md similarity index 85% rename from docs/source/en/model_doc/bridgetower.mdx rename to docs/source/en/model_doc/bridgetower.md index 87015877dc9c0e..013fea06c2778e 100644 --- a/docs/source/en/model_doc/bridgetower.mdx +++ b/docs/source/en/model_doc/bridgetower.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BridgeTower @@ -33,7 +37,9 @@ alt="drawing" width="600"/> BridgeTower architecture. Taken from the original paper. -## Usage +This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower). + +## Usage tips and examples BridgeTower consists of a visual encoder, a textual encoder and cross-modal encoder with multiple lightweight bridge layers. The goal of this approach was to build a bridge between each uni-modal encoder and the cross-modal encoder to enable comprehensive and detailed interaction at each layer of the cross-modal encoder. @@ -42,6 +48,28 @@ In principle, one can apply any visual, textual or cross-modal encoder in the pr The [`BridgeTowerProcessor`] wraps [`RobertaTokenizer`] and [`BridgeTowerImageProcessor`] into a single instance to both encode the text and prepare the images respectively. +The following example shows how to run contrastive learning using [`BridgeTowerProcessor`] and [`BridgeTowerForContrastiveLearning`]. +```python +>>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning +>>> import requests +>>> from PIL import Image + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"] + +>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") +>>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") + +>>> # forward pass +>>> scores = dict() +>>> for text in texts: +... # prepare inputs +... encoding = processor(image, text, return_tensors="pt") +... outputs = model(**encoding) +... scores[text] = outputs +``` + The following example shows how to run image-text retrieval using [`BridgeTowerProcessor`] and [`BridgeTowerForImageAndTextRetrieval`]. ```python >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval @@ -90,9 +118,6 @@ The following example shows how to run masked language modeling using [`BridgeTo .a cat looking out of the window. ``` -This model was contributed by [Anahita Bhiwandiwalla](https://huggingface.co/anahita-b), [Tiep Le](https://huggingface.co/Tile) and [Shaoyen Tseng](https://huggingface.co/shaoyent). The original code can be found [here](https://github.com/microsoft/BridgeTower). - - Tips: - This implementation of BridgeTower uses [`RobertaTokenizer`] to generate text embeddings and OpenAI's CLIP/ViT model to compute visual embeddings. @@ -128,6 +153,11 @@ Tips: [[autodoc]] BridgeTowerModel - forward +## BridgeTowerForContrastiveLearning + +[[autodoc]] BridgeTowerForContrastiveLearning + - forward + ## BridgeTowerForMaskedLM [[autodoc]] BridgeTowerForMaskedLM diff --git a/docs/source/en/model_doc/bros.md b/docs/source/en/model_doc/bros.md new file mode 100644 index 00000000000000..419e725e75e8ec --- /dev/null +++ b/docs/source/en/model_doc/bros.md @@ -0,0 +1,114 @@ + + +# BROS + +## Overview + +The BROS model was proposed in [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. + +BROS stands for *BERT Relying On Spatiality*. It is an encoder-only Transformer model that takes a sequence of tokens and their bounding boxes as inputs and outputs a sequence of hidden states. BROS encode relative spatial information instead of using absolute spatial information. + +It is pre-trained with two objectives: a token-masked language modeling objective (TMLM) used in BERT, and a novel area-masked language modeling objective (AMLM) +In TMLM, tokens are randomly masked, and the model predicts the masked tokens using spatial information and other unmasked tokens. +AMLM is a 2D version of TMLM. It randomly masks text tokens and predicts with the same information as TMLM, but it masks text blocks (areas). + +`BrosForTokenClassification` has a simple linear layer on top of BrosModel. It predicts the label of each token. +`BrosSpadeEEForTokenClassification` has an `initial_token_classifier` and `subsequent_token_classifier` on top of BrosModel. `initial_token_classifier` is used to predict the first token of each entity, and `subsequent_token_classifier` is used to predict the next token of within entity. `BrosSpadeELForTokenClassification` has an `entity_linker` on top of BrosModel. `entity_linker` is used to predict the relation between two entities. + +`BrosForTokenClassification` and `BrosSpadeEEForTokenClassification` essentially perform the same job. However, `BrosForTokenClassification` assumes input tokens are perfectly serialized (which is very challenging task since they exist in a 2D space), while `BrosSpadeEEForTokenClassification` allows for more flexibility in handling serialization errors as it predicts next connection tokens from one token. + +`BrosSpadeELForTokenClassification` perform the intra-entity linking task. It predicts relation from one token (of one entity) to another token (of another entity) if these two entities share some relation. + +BROS achieves comparable or better result on Key Information Extraction (KIE) benchmarks such as FUNSD, SROIE, CORD and SciTSR, without relying on explicit visual features. + +The abstract from the paper is the following: + +*Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods.* + +This model was contributed by [jinho8345](https://huggingface.co/jinho8345). The original code can be found [here](https://github.com/clovaai/bros). + +## Usage tips and examples + +- [`~transformers.BrosModel.forward`] requires `input_ids` and `bbox` (bounding box). Each bounding box should be in (x0, y0, x1, y1) format (top-left corner, bottom-right corner). Obtaining of Bounding boxes depends on external OCR system. The `x` coordinate should be normalized by document image width, and the `y` coordinate should be normalized by document image height. + +```python +def expand_and_normalize_bbox(bboxes, doc_width, doc_height): + # here, bboxes are numpy array + + # Normalize bbox -> 0 ~ 1 + bboxes[:, [0, 2]] = bboxes[:, [0, 2]] / width + bboxes[:, [1, 3]] = bboxes[:, [1, 3]] / height +``` + +- [`~transformers.BrosForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`, `~transformers.BrosSpadeEEForTokenClassification.forward`] require not only `input_ids` and `bbox` but also `box_first_token_mask` for loss calculation. It is a mask to filter out non-first tokens of each box. You can obtain this mask by saving start token indices of bounding boxes when creating `input_ids` from words. You can make `box_first_token_mask` with following code, + + +```python +def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512): + + box_first_token_mask = np.zeros(max_seq_length, dtype=np.bool_) + + # encode(tokenize) each word from words (List[str]) + input_ids_list: List[List[int]] = [tokenizer.encode(e, add_special_tokens=False) for e in words] + + # get the length of each box + tokens_length_list: List[int] = [len(l) for l in input_ids_list] + + box_end_token_indices = np.array(list(itertools.accumulate(tokens_length_list))) + box_start_token_indices = box_end_token_indices - np.array(tokens_length_list) + + # filter out the indices that are out of max_seq_length + box_end_token_indices = box_end_token_indices[box_end_token_indices < max_seq_length - 1] + if len(box_start_token_indices) > len(box_end_token_indices): + box_start_token_indices = box_start_token_indices[: len(box_end_token_indices)] + + # set box_start_token_indices to True + box_first_token_mask[box_start_token_indices] = True + + return box_first_token_mask + +``` + +## Resources + +- Demo scripts can be found [here](https://github.com/clovaai/bros). + +## BrosConfig + +[[autodoc]] BrosConfig + +## BrosProcessor + +[[autodoc]] BrosProcessor + - __call__ + +## BrosModel + +[[autodoc]] BrosModel + - forward + + +## BrosForTokenClassification + +[[autodoc]] BrosForTokenClassification + - forward + +## BrosSpadeEEForTokenClassification + +[[autodoc]] BrosSpadeEEForTokenClassification + - forward + +## BrosSpadeELForTokenClassification + +[[autodoc]] BrosSpadeELForTokenClassification + - forward diff --git a/docs/source/en/model_doc/byt5.mdx b/docs/source/en/model_doc/byt5.md similarity index 95% rename from docs/source/en/model_doc/byt5.mdx rename to docs/source/en/model_doc/byt5.md index dc4c5a6caf8f6f..dc2942e33bbe0b 100644 --- a/docs/source/en/model_doc/byt5.mdx +++ b/docs/source/en/model_doc/byt5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ByT5 @@ -36,14 +40,18 @@ experiments.* This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be found [here](https://github.com/google-research/byt5). -ByT5's architecture is based on the T5v1.1 model, so one can refer to [T5v1.1's documentation page](t5v1.1). They + + +ByT5's architecture is based on the T5v1.1 model, refer to [T5v1.1's documentation page](t5v1.1) for the API reference. They only differ in how inputs should be prepared for the model, see the code examples below. + + Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix. -### Example +## Usage example ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer: diff --git a/docs/source/en/model_doc/camembert.mdx b/docs/source/en/model_doc/camembert.md similarity index 69% rename from docs/source/en/model_doc/camembert.mdx rename to docs/source/en/model_doc/camembert.md index a35d5aefca67a4..ab06ec100b1298 100644 --- a/docs/source/en/model_doc/camembert.mdx +++ b/docs/source/en/model_doc/camembert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CamemBERT @@ -15,8 +19,8 @@ specific language governing permissions and limitations under the License. ## Overview The CamemBERT model was proposed in [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by -Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la -Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook's RoBERTa model released in 2019. It is a model +[Louis Martin](https://huggingface.co/louismartin), [Benjamin Muller](https://huggingface.co/benjamin-mlr), [Pedro Javier Ortiz Suárez](https://huggingface.co/pjox), Yoann Dupont, Laurent Romary, Éric Villemonte de la +Clergerie, [Djamé Seddah](https://huggingface.co/Djame), and [Benoît Sagot](https://huggingface.co/sagot). It is based on Facebook's RoBERTa model released in 2019. It is a model trained on 138GB of French text. The abstract from the paper is the following: @@ -30,12 +34,23 @@ dependency parsing, named-entity recognition, and natural language inference. Ca for most of the tasks considered. We release the pretrained model for CamemBERT hoping to foster research and downstream applications for French NLP.* -Tips: +This model was contributed by [the ALMAnaCH team (Inria)](https://huggingface.co/almanach). The original code can be found [here](https://camembert-model.fr/). + + + +This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well +as the information relative to the inputs and outputs. + + -- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples - as well as the information relative to the inputs and outputs. +## Resources -This model was contributed by [camembert](https://huggingface.co/camembert). The original code can be found [here](https://camembert-model.fr/). +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## CamembertConfig @@ -53,6 +68,9 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The [[autodoc]] CamembertTokenizerFast + + + ## CamembertModel [[autodoc]] CamembertModel @@ -81,6 +99,9 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The [[autodoc]] CamembertForQuestionAnswering + + + ## TFCamembertModel [[autodoc]] TFCamembertModel @@ -108,3 +129,7 @@ This model was contributed by [camembert](https://huggingface.co/camembert). The ## TFCamembertForQuestionAnswering [[autodoc]] TFCamembertForQuestionAnswering + + + + diff --git a/docs/source/en/model_doc/canine.mdx b/docs/source/en/model_doc/canine.md similarity index 91% rename from docs/source/en/model_doc/canine.mdx rename to docs/source/en/model_doc/canine.md index e73777d000827f..7729d8aa91d7d6 100644 --- a/docs/source/en/model_doc/canine.mdx +++ b/docs/source/en/model_doc/canine.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CANINE @@ -33,7 +37,9 @@ To use its finer-grained input effectively and efficiently, CANINE combines down sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.* -Tips: +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine). + +## Usage tips - CANINE uses no less than 3 Transformer encoders internally: 2 "shallow" encoders (which only consist of a single layer) and 1 "deep" encoder (which is a regular BERT encoder). First, a "shallow" encoder is used to contextualize @@ -46,19 +52,18 @@ Tips: (which has a predefined Unicode code point). For token classification tasks however, the downsampled sequence of tokens needs to be upsampled again to match the length of the original character sequence (which is 2048). The details for this can be found in the paper. -- Models: + +Model checkpoints: - [google/canine-c](https://huggingface.co/google/canine-c): Pre-trained with autoregressive character loss, 12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB). - [google/canine-s](https://huggingface.co/google/canine-s): Pre-trained with subword loss, 12-layer, 768-hidden, 12-heads, 121M parameters (size ~500 MB). -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/google-research/language/tree/master/language/canine). - -### Example +## Usage example -CANINE works on raw characters, so it can be used without a tokenizer: +CANINE works on raw characters, so it can be used **without a tokenizer**: ```python >>> from transformers import CanineModel @@ -92,9 +97,12 @@ sequences to the same length): >>> sequence_output = outputs.last_hidden_state ``` -## CANINE specific outputs +## Resources -[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Multiple choice task guide](../tasks/multiple_choice) ## CanineConfig @@ -107,6 +115,10 @@ sequences to the same length): - get_special_tokens_mask - create_token_type_ids_from_sequences +## CANINE specific outputs + +[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling + ## CanineModel [[autodoc]] CanineModel diff --git a/docs/source/en/model_doc/chinese_clip.mdx b/docs/source/en/model_doc/chinese_clip.md similarity index 94% rename from docs/source/en/model_doc/chinese_clip.mdx rename to docs/source/en/model_doc/chinese_clip.md index d8973759ed5ac0..b2d27a844e9e74 100644 --- a/docs/source/en/model_doc/chinese_clip.mdx +++ b/docs/source/en/model_doc/chinese_clip.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Chinese-CLIP @@ -21,7 +25,9 @@ The abstract from the paper is the following: *The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). Our codes, pretrained models, and demos have been released.* -## Usage +The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys). + +## Usage example The code snippet below shows how to compute image & text features and similarities: @@ -55,15 +61,13 @@ The code snippet below shows how to compute image & text features and similariti >>> probs = logits_per_image.softmax(dim=1) # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]] ``` -Currently, we release the following scales of pretrained Chinese-CLIP models at HF Model Hub: +Currently, following scales of pretrained Chinese-CLIP models are available on 🤗 Hub: - [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) - [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14) - [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px) - [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14) -The Chinese-CLIP model was contributed by [OFA-Sys](https://huggingface.co/OFA-Sys). - ## ChineseCLIPConfig [[autodoc]] ChineseCLIPConfig diff --git a/docs/source/en/model_doc/clap.mdx b/docs/source/en/model_doc/clap.md similarity index 76% rename from docs/source/en/model_doc/clap.mdx rename to docs/source/en/model_doc/clap.md index fa9abacbafe613..2bd2814e1b061d 100644 --- a/docs/source/en/model_doc/clap.mdx +++ b/docs/source/en/model_doc/clap.md @@ -8,25 +8,28 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CLAP ## Overview -The CLAP model was proposed in [Large Scale Constrastive Laungaue-Audio pretraining with +The CLAP model was proposed in [Large Scale Contrastive Language-Audio pretraining with feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. -CLAP (Constrastive Laungaue-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score. +CLAP (Contrastive Language-Audio Pretraining) is a neural network trained on a variety of (audio, text) pairs. It can be instructed in to predict the most relevant text snippet, given an audio, without directly optimizing for the task. The CLAP model uses a SWINTransformer to get audio features from a log-Mel spectrogram input, and a RoBERTa model to get text features. Both the text and audio features are then projected to a latent space with identical dimension. The dot product between the projected audio and text features is then used as a similar score. The abstract from the paper is the following: *Contrastive learning has shown remarkable success in the field of multimodal representation learning. In this paper, we propose a pipeline of contrastive language-audio pretraining to develop an audio representation by combining audio data with natural language descriptions. To accomplish this target, we first release LAION-Audio-630K, a large collection of 633,526 audio-text pairs from different data sources. Second, we construct a contrastive language-audio pretraining model by considering different audio encoders and text encoders. We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance. Third, we perform comprehensive experiments to evaluate our model across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our model achieves superior performance in text-to-audio retrieval task. In audio classification tasks, the model achieves state-of-the-art performance in the zeroshot setting and is able to obtain performance comparable to models' results in the non-zero-shot setting. LAION-Audio-6* -This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) . +This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) . The original code can be found [here](https://github.com/LAION-AI/Clap). - ## ClapConfig [[autodoc]] ClapConfig @@ -74,4 +77,3 @@ The original code can be found [here](https://github.com/LAION-AI/Clap). [[autodoc]] ClapAudioModelWithProjection - forward - diff --git a/docs/source/en/model_doc/clip.mdx b/docs/source/en/model_doc/clip.md similarity index 71% rename from docs/source/en/model_doc/clip.mdx rename to docs/source/en/model_doc/clip.md index 790bce6c7ff09f..692ea083717c42 100644 --- a/docs/source/en/model_doc/clip.mdx +++ b/docs/source/en/model_doc/clip.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CLIP @@ -36,7 +40,9 @@ for any dataset specific training. For instance, we match the accuracy of the or without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.* -## Usage +This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP). + +## Usage tips and example CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text @@ -46,10 +52,10 @@ product between the projected image and text features is then used as a similar To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. -The [`CLIPFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model. +The [`CLIPImageProcessor`] can be used to resize (or rescale) and normalize images for the model. The [`CLIPTokenizer`] is used to encode the text. The [`CLIPProcessor`] wraps -[`CLIPFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both +[`CLIPImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to get the image-text similarity scores using [`CLIPProcessor`] and [`CLIPModel`]. @@ -73,14 +79,27 @@ encode the text and prepare the images. The following example shows how to get t >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities ``` -This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/openai/CLIP). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIP. -- A blog post on [How to fine-tune CLIP on 10,000 image-text pairs](https://huggingface.co/blog/fine-tune-clip-rsicd). -- CLIP is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text). +- [Fine tuning CLIP with Remote Sensing (Satellite) images and captions](https://huggingface.co/blog/fine-tune-clip-rsicd), a blog post about how to fine-tune CLIP with [RSICD dataset](https://github.com/201528014227051/RSICD_optimal) and comparison of performance changes due to data augmentation. +- This [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) shows how to train a CLIP-like vision-text dual encoder model using a pre-trained vision and text encoder using [COCO dataset](https://cocodataset.org/#home). + + + +- A [notebook](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing) on how to use a pretrained CLIP for inference with beam search for image captioning. 🌎 + +**Image retrieval** + +- A [notebook](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing) on image retrieval using pretrained CLIP and computing MRR(Mean Reciprocal Rank) score. 🌎 +- A [notebook](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb) on image retrieval and showing the similarity score. 🌎 +- A [notebook](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing) on how to map images and texts to the same vector space using Multilingual CLIP. 🌎 +- A [notebook](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR) on how to run CLIP on semantic image search using [Unsplash](https://unsplash.com) and [TMDB](https://www.themoviedb.org/) datasets. 🌎 + +**Explainability** + +- A [notebook](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb) on how to visualize similarity between input token and image segment. 🌎 If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -123,6 +142,9 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] CLIPProcessor + + + ## CLIPModel [[autodoc]] CLIPModel @@ -145,12 +167,19 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] CLIPVisionModelWithProjection - forward - ## CLIPVisionModel [[autodoc]] CLIPVisionModel - forward +## CLIPForImageClassification + +[[autodoc]] CLIPForImageClassification + - forward + + + + ## TFCLIPModel [[autodoc]] TFCLIPModel @@ -168,6 +197,9 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] TFCLIPVisionModel - call + + + ## FlaxCLIPModel [[autodoc]] FlaxCLIPModel @@ -180,7 +212,15 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] FlaxCLIPTextModel - __call__ +## FlaxCLIPTextModelWithProjection + +[[autodoc]] FlaxCLIPTextModelWithProjection + - __call__ + ## FlaxCLIPVisionModel [[autodoc]] FlaxCLIPVisionModel - __call__ + + + diff --git a/docs/source/en/model_doc/clipseg.mdx b/docs/source/en/model_doc/clipseg.md similarity index 95% rename from docs/source/en/model_doc/clipseg.mdx rename to docs/source/en/model_doc/clipseg.md index 94b58275f6d8ee..320095bc1905b1 100644 --- a/docs/source/en/model_doc/clipseg.mdx +++ b/docs/source/en/model_doc/clipseg.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CLIPSeg @@ -37,13 +41,6 @@ to any binary segmentation task where a text or image query can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties* -Tips: - -- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`]. -- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text -(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom -conditional embeddings (provided to the model as `conditional_embeddings`). - drawing @@ -52,6 +49,13 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/timojl/clipseg). +## Usage tips + +- [`CLIPSegForImageSegmentation`] adds a decoder on top of [`CLIPSegModel`]. The latter is identical to [`CLIPModel`]. +- [`CLIPSegForImageSegmentation`] can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text +(provided to the model as `input_ids`) or an image (provided to the model as `conditional_pixel_values`). One can also provide custom +conditional embeddings (provided to the model as `conditional_embeddings`). + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CLIPSeg. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/clvp.md b/docs/source/en/model_doc/clvp.md new file mode 100644 index 00000000000000..a30269faf9caec --- /dev/null +++ b/docs/source/en/model_doc/clvp.md @@ -0,0 +1,126 @@ + + +# CLVP + +## Overview + +The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker. + +The abstract from the paper is the following: + +*In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.* + + +This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). +The original code can be found [here](https://github.com/neonbjb/tortoise-tts). + + +## Usage tips + +1. CLVP is an integral part of the Tortoise TTS model. +2. CLVP can be used to compare different generated speech candidates with the provided text, and the best speech tokens are forwarded to the diffusion model. +3. The use of the [`ClvpModelForConditionalGeneration.generate()`] method is strongly recommended for tortoise usage. +4. Note that the CLVP model expects the audio to be sampled at 22.05 kHz contrary to other audio models which expects 16 kHz. + + +## Brief Explanation: + +- The [`ClvpTokenizer`] tokenizes the text input, and the [`ClvpFeatureExtractor`] extracts the log mel-spectrogram from the desired audio. +- [`ClvpConditioningEncoder`] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio. +- The [`ClvpForCausalLM`] uses those embeddings to generate multiple speech candidates. +- Each speech candidate is passed through the speech encoder ([`ClvpEncoder`]) which converts them into a vector representation, and the text encoder ([`ClvpEncoder`]) converts the text tokens into the same latent space. +- At the end, we compare each speech vector with the text vector to see which speech vector is most similar to the text vector. +- [`ClvpModelForConditionalGeneration.generate()`] compresses all of the logic described above into a single method. + + +Example : + +```python +>>> import datasets +>>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration + +>>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library). +>>> text = "This is an example text." + +>>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +>>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050)) +>>> sample = ds[0]["audio"] + +>>> # Define processor and model. +>>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev") +>>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev") + +>>> # Generate processor output and model output. +>>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt") +>>> generated_output = model.generate(**processor_output) +``` + + +## ClvpConfig + +[[autodoc]] ClvpConfig + - from_sub_model_configs + +## ClvpEncoderConfig + +[[autodoc]] ClvpEncoderConfig + +## ClvpDecoderConfig + +[[autodoc]] ClvpDecoderConfig + +## ClvpTokenizer + +[[autodoc]] ClvpTokenizer + - save_vocabulary + +## ClvpFeatureExtractor + +[[autodoc]] ClvpFeatureExtractor + - __call__ + +## ClvpProcessor + +[[autodoc]] ClvpProcessor + - __call__ + - decode + - batch_decode + +## ClvpModelForConditionalGeneration + +[[autodoc]] ClvpModelForConditionalGeneration + - forward + - generate + - get_text_features + - get_speech_features + +## ClvpForCausalLM + +[[autodoc]] ClvpForCausalLM + +## ClvpModel + +[[autodoc]] ClvpModel + +## ClvpEncoder + +[[autodoc]] ClvpEncoder + +## ClvpDecoder + +[[autodoc]] ClvpDecoder + diff --git a/docs/source/en/model_doc/code_llama.md b/docs/source/en/model_doc/code_llama.md new file mode 100644 index 00000000000000..38d50c87334d67 --- /dev/null +++ b/docs/source/en/model_doc/code_llama.md @@ -0,0 +1,127 @@ + + +# CodeLlama + +## Overview + +The Code Llama model was proposed in [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) by Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. + +The abstract from the paper is the following: + +*We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.* + +Check out all Code Llama model checkpoints [here](https://huggingface.co/models?search=code_llama) and the officially released ones in the [codellama org](https://huggingface.co/codellama). + +This model was contributed by [ArthurZucker](https://huggingface.co/ArthurZ). The original code of the authors can be found [here](https://github.com/facebookresearch/llama). + +## Usage tips and examples + + + +The `Llama2` family models, on which Code Llama is based, were trained using `bfloat16`, but the original inference uses `float16`. Let's look at the different precisions: + +* `float32`: PyTorch convention on model initialization is to load models in `float32`, no matter with which `dtype` the model weights were stored. `transformers` also follows this convention for consistency with PyTorch. This will be picked by default. If you want the `AutoModel` API to cast the load the checkpoints with the storage weights type, you must specify `torch_dtype="auto"`, e.g. `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. +* `bfloat16`: Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning. +* `float16`: We recommend running inference using this precision, as it's usually faster than `bfloat16`, and evaluation metrics show no discernible degradation with respect to `bfloat16`. You can also run inference using `bfloat16`, and we recommend you check inference results with both `float16` and `bfloat16` after fine-tuning. + +As mentioned above, the `dtype` of the storage weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using. The reason is that the model will first be downloaded (using the `dtype` of the checkpoints online) and then will be casted to the default `dtype` of `torch` (becomes `torch.float32`). If there is a specified `torch_dtype`, it will be used instead. + + + + +Tips: +- The infilling task is supported out of the box. You should be using the `tokenizer.fill_token` where you want your input to be filled. +- The model conversion script is the same as for the `Llama2` family: + +Here is a sample usage: + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions +come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). + +After conversion, the model and tokenizer can be loaded via: + +```python +>>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer + +>>> tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf") +>>> model = LlamaForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf") +>>> PROMPT = '''def remove_non_ascii(s: str) -> str: + """ + return result +''' +>>> input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"] +>>> generated_ids = model.generate(input_ids, max_new_tokens=128) + +>>> filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0] +>>> print(PROMPT.replace("", filling)) +def remove_non_ascii(s: str) -> str: + """ Remove non-ASCII characters from a string. + + Args: + s: The string to remove non-ASCII characters from. + + Returns: + The string with non-ASCII characters removed. + """ + result = "" + for c in s: + if ord(c) < 128: + result += c + return result +``` + +If you only want the infilled part: +```python +>>> from transformers import pipeline +>>> import torch + +>>> generator = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto") +>>> generator('def remove_non_ascii(s: str) -> str:\n """ \n return result', max_new_tokens = 128, return_type = 1) +``` + +Under the hood, the tokenizer [automatically splits by ``](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token) to create a formatted input string that follows [the original training pattern](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402). This is more robust than preparing the pattern yourself: it avoids pitfalls, such as token glueing, that are very hard to debug. To see how much CPU and GPU memory you need for this model or others, try [this calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) which can help determine that value. + +The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string. + + + +Code Llama has the same architecture as the `Llama2` models, refer to [Llama2's documentation page](llama2) for the API reference. +Find Code Llama tokenizer reference below. + + + +## CodeLlamaTokenizer + +[[autodoc]] CodeLlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## CodeLlamaTokenizerFast + +[[autodoc]] CodeLlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary diff --git a/docs/source/en/model_doc/codegen.mdx b/docs/source/en/model_doc/codegen.md similarity index 94% rename from docs/source/en/model_doc/codegen.mdx rename to docs/source/en/model_doc/codegen.md index c46649e00133c8..78be813db1a60b 100644 --- a/docs/source/en/model_doc/codegen.mdx +++ b/docs/source/en/model_doc/codegen.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CodeGen @@ -36,7 +40,7 @@ The original code can be found [here](https://github.com/salesforce/codegen). * `mono`: Initialized with `multi`, then further pre-trained on Python data * For example, `Salesforce/codegen-350M-mono` offers a 350 million-parameter checkpoint pre-trained sequentially on the Pile, multiple programming languages, and Python. -## How to use +## Usage example ```python >>> from transformers import AutoModelForCausalLM, AutoTokenizer @@ -56,6 +60,10 @@ def hello_world(): hello_world() ``` +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) + ## CodeGenConfig [[autodoc]] CodeGenConfig diff --git a/docs/source/en/model_doc/conditional_detr.mdx b/docs/source/en/model_doc/conditional_detr.md similarity index 93% rename from docs/source/en/model_doc/conditional_detr.mdx rename to docs/source/en/model_doc/conditional_detr.md index 40cdbee3450200..516e1c43685563 100644 --- a/docs/source/en/model_doc/conditional_detr.mdx +++ b/docs/source/en/model_doc/conditional_detr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Conditional DETR @@ -27,6 +31,9 @@ alt="drawing" width="600"/> This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The original code can be found [here](https://github.com/Atten4Vis/ConditionalDETR). +## Resources + +- [Object detection task guide](../tasks/object_detection) ## ConditionalDetrConfig @@ -36,7 +43,6 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o [[autodoc]] ConditionalDetrImageProcessor - preprocess - - pad_and_create_pixel_mask - post_process_object_detection - post_process_instance_segmentation - post_process_semantic_segmentation @@ -46,7 +52,6 @@ This model was contributed by [DepuMeng](https://huggingface.co/DepuMeng). The o [[autodoc]] ConditionalDetrFeatureExtractor - __call__ - - pad_and_create_pixel_mask - post_process_object_detection - post_process_instance_segmentation - post_process_semantic_segmentation diff --git a/docs/source/en/model_doc/convbert.mdx b/docs/source/en/model_doc/convbert.md similarity index 84% rename from docs/source/en/model_doc/convbert.mdx rename to docs/source/en/model_doc/convbert.md index a1dba6bf1c92bf..17b5d7920c6cfa 100644 --- a/docs/source/en/model_doc/convbert.mdx +++ b/docs/source/en/model_doc/convbert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ConvBERT @@ -40,11 +44,21 @@ ConvBERT significantly outperforms BERT and its variants in various downstream t fewer model parameters. Remarkably, ConvBERTbase model achieves 86.4 GLUE score, 0.7 higher than ELECTRAbase, while using less than 1/4 training cost. Code and pre-trained models will be released.* -ConvBERT training tips are similar to those of BERT. - This model was contributed by [abhishek](https://huggingface.co/abhishek). The original implementation can be found here: https://github.com/yitu-opensource/ConvBert +## Usage tips + +ConvBERT training tips are similar to those of BERT. For usage tips refer to [BERT documentation](bert). + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## ConvBertConfig [[autodoc]] ConvBertConfig @@ -61,6 +75,9 @@ here: https://github.com/yitu-opensource/ConvBert [[autodoc]] ConvBertTokenizerFast + + + ## ConvBertModel [[autodoc]] ConvBertModel @@ -91,6 +108,9 @@ here: https://github.com/yitu-opensource/ConvBert [[autodoc]] ConvBertForQuestionAnswering - forward + + + ## TFConvBertModel [[autodoc]] TFConvBertModel @@ -120,3 +140,6 @@ here: https://github.com/yitu-opensource/ConvBert [[autodoc]] TFConvBertForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/convnext.mdx b/docs/source/en/model_doc/convnext.md similarity index 93% rename from docs/source/en/model_doc/convnext.mdx rename to docs/source/en/model_doc/convnext.md index 857a2adeb20a40..5222834b1f69d6 100644 --- a/docs/source/en/model_doc/convnext.mdx +++ b/docs/source/en/model_doc/convnext.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ConvNeXT @@ -28,10 +32,6 @@ of a vision Transformer, and discover several key components that contribute to dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.* -Tips: - -- See the code examples below each model regarding usage. - drawing @@ -47,6 +47,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`ConvNextForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -63,6 +64,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] ConvNextImageProcessor - preprocess + + + ## ConvNextModel [[autodoc]] ConvNextModel @@ -73,14 +77,18 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] ConvNextForImageClassification - forward + + ## TFConvNextModel [[autodoc]] TFConvNextModel - call - ## TFConvNextForImageClassification [[autodoc]] TFConvNextForImageClassification - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/convnextv2.md b/docs/source/en/model_doc/convnextv2.md new file mode 100644 index 00000000000000..8cd142c2765f0b --- /dev/null +++ b/docs/source/en/model_doc/convnextv2.md @@ -0,0 +1,68 @@ + + +# ConvNeXt V2 + +## Overview + +The ConvNeXt V2 model was proposed in [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. +ConvNeXt V2 is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, and a successor of [ConvNeXT](convnext). + +The abstract from the paper is the following: + +*Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.* + + + + ConvNeXt V2 architecture. Taken from the original paper. + +This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/facebookresearch/ConvNeXt-V2). + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ConvNeXt V2. + + + +- [`ConvNextV2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +## ConvNextV2Config + +[[autodoc]] ConvNextV2Config + +## ConvNextV2Model + +[[autodoc]] ConvNextV2Model + - forward + +## ConvNextV2ForImageClassification + +[[autodoc]] ConvNextV2ForImageClassification + - forward + +## TFConvNextV2Model + +[[autodoc]] TFConvNextV2Model + - call + + +## TFConvNextV2ForImageClassification + +[[autodoc]] TFConvNextV2ForImageClassification + - call diff --git a/docs/source/en/model_doc/cpm.mdx b/docs/source/en/model_doc/cpm.md similarity index 87% rename from docs/source/en/model_doc/cpm.mdx rename to docs/source/en/model_doc/cpm.md index ac8ed8fdbafbd8..129c4ed3a37720 100644 --- a/docs/source/en/model_doc/cpm.mdx +++ b/docs/source/en/model_doc/cpm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CPM @@ -33,7 +37,14 @@ NLP tasks in the settings of few-shot (even zero-shot) learning.* This model was contributed by [canwenxu](https://huggingface.co/canwenxu). The original implementation can be found here: https://github.com/TsinghuaAI/CPM-Generate -Note: We only have a tokenizer here, since the model architecture is the same as GPT-2. + + + +CPM's architecture is the same as GPT-2, except for tokenization method. Refer to [GPT-2 documentation](gpt2) for +API reference information. + + + ## CpmTokenizer diff --git a/docs/source/en/model_doc/cpmant.md b/docs/source/en/model_doc/cpmant.md new file mode 100644 index 00000000000000..4bcf774507fbe0 --- /dev/null +++ b/docs/source/en/model_doc/cpmant.md @@ -0,0 +1,47 @@ + + +# CPMAnt + +## Overview + +CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. [See more](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) + +This model was contributed by [OpenBMB](https://huggingface.co/openbmb). The original code can be found [here](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live). + +## Resources + +- A tutorial on [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live). + +## CpmAntConfig + +[[autodoc]] CpmAntConfig + - all + +## CpmAntTokenizer + +[[autodoc]] CpmAntTokenizer + - all + +## CpmAntModel + +[[autodoc]] CpmAntModel + - all + +## CpmAntForCausalLM + +[[autodoc]] CpmAntForCausalLM + - all \ No newline at end of file diff --git a/docs/source/en/model_doc/ctrl.mdx b/docs/source/en/model_doc/ctrl.md similarity index 90% rename from docs/source/en/model_doc/ctrl.mdx rename to docs/source/en/model_doc/ctrl.md index aedfd04764f596..be9fa85c707372 100644 --- a/docs/source/en/model_doc/ctrl.mdx +++ b/docs/source/en/model_doc/ctrl.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # CTRL @@ -37,7 +41,10 @@ providing more explicit control over text generation. These codes also allow CTR training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution.* -Tips: +This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found +[here](https://github.com/salesforce/ctrl). + +## Usage tips - CTRL makes use of control codes to generate text: it requires generations to be started by certain words, sentences or links to generate coherent text. Refer to the [original implementation](https://github.com/salesforce/ctrl) for @@ -52,9 +59,11 @@ Tips: pre-computed values in the context of text generation. See the [`forward`](model_doc/ctrl#transformers.CTRLModel.forward) method for more information on the usage of this argument. -This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitishr). The original code can be found -[here](https://github.com/salesforce/ctrl). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Causal language modeling task guide](../tasks/language_modeling) ## CTRLConfig @@ -65,6 +74,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis [[autodoc]] CTRLTokenizer - save_vocabulary + + + ## CTRLModel [[autodoc]] CTRLModel @@ -80,6 +92,9 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis [[autodoc]] CTRLForSequenceClassification - forward + + + ## TFCTRLModel [[autodoc]] TFCTRLModel @@ -94,3 +109,6 @@ This model was contributed by [keskarnitishr](https://huggingface.co/keskarnitis [[autodoc]] TFCTRLForSequenceClassification - call + + + diff --git a/docs/source/en/model_doc/cvt.mdx b/docs/source/en/model_doc/cvt.md similarity index 93% rename from docs/source/en/model_doc/cvt.mdx rename to docs/source/en/model_doc/cvt.md index 9d0fa7ea8818e2..503f97795c0ea4 100644 --- a/docs/source/en/model_doc/cvt.mdx +++ b/docs/source/en/model_doc/cvt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Convolutional Vision Transformer (CvT) @@ -29,15 +33,15 @@ performance gains are maintained when pretrained on larger datasets (\eg ImageNe ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.* -Tips: +This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT). + +## Usage tips - CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100. - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]). - The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). -This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/microsoft/CvT). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with CvT. @@ -45,6 +49,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`CvtForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -52,6 +57,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] CvtConfig + + + ## CvtModel [[autodoc]] CvtModel @@ -62,6 +70,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] CvtForImageClassification - forward + + + ## TFCvtModel [[autodoc]] TFCvtModel @@ -72,3 +83,5 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] TFCvtForImageClassification - call + + diff --git a/docs/source/en/model_doc/data2vec.mdx b/docs/source/en/model_doc/data2vec.md similarity index 85% rename from docs/source/en/model_doc/data2vec.mdx rename to docs/source/en/model_doc/data2vec.md index 39be094542c2ed..517a51ce46a3a4 100644 --- a/docs/source/en/model_doc/data2vec.mdx +++ b/docs/source/en/model_doc/data2vec.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Data2Vec @@ -31,19 +35,18 @@ the entire input. Experiments on the major benchmarks of speech recognition, ima natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches. Models and code are available at www.github.com/pytorch/fairseq/tree/master/examples/data2vec.* -Tips: - -- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method. -- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction -- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization. -- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction. - This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten). [sayakpaul](https://github.com/sayakpaul) and [Rocketknight1](https://github.com/Rocketknight1) contributed Data2Vec for vision in TensorFlow. The original code (for NLP and Speech) can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec). The original code for vision can be found [here](https://github.com/facebookresearch/data2vec_vision/tree/main/beit). +## Usage tips + +- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method. +- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction +- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization. +- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction. ## Resources @@ -54,6 +57,22 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`Data2VecVisionForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). - To fine-tune [`TFData2VecVisionForImageClassification`] on a custom dataset, see [this notebook](https://colab.research.google.com/github/sayakpaul/TF-2.0-Hacks/blob/master/data2vec_vision_image_classification.ipynb). +**Data2VecText documentation resources** +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + +**Data2VecAudio documentation resources** +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) + +**Data2VecVision documentation resources** +- [Image classification](../tasks/image_classification) +- [Semantic segmentation](../tasks/semantic_segmentation) + If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. ## Data2VecTextConfig @@ -68,6 +87,8 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] Data2VecVisionConfig + + ## Data2VecAudioModel @@ -144,6 +165,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] Data2VecVisionForSemanticSegmentation - forward + + + ## TFData2VecVisionModel [[autodoc]] TFData2VecVisionModel @@ -158,3 +182,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] TFData2VecVisionForSemanticSegmentation - call + + + diff --git a/docs/source/en/model_doc/deberta-v2.mdx b/docs/source/en/model_doc/deberta-v2.md similarity index 89% rename from docs/source/en/model_doc/deberta-v2.mdx rename to docs/source/en/model_doc/deberta-v2.md index 18b2c4f16d0166..e3bd91e8e4fab0 100644 --- a/docs/source/en/model_doc/deberta-v2.mdx +++ b/docs/source/en/model_doc/deberta-v2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DeBERTa-v2 @@ -58,6 +62,13 @@ New in v2: This model was contributed by [DeBERTa](https://huggingface.co/DeBERTa). This model TF 2.0 implementation was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/DeBERTa). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## DebertaV2Config @@ -77,6 +88,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code - build_inputs_with_special_tokens - create_token_type_ids_from_sequences + + + ## DebertaV2Model [[autodoc]] DebertaV2Model @@ -112,6 +126,9 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code [[autodoc]] DebertaV2ForMultipleChoice - forward + + + ## TFDebertaV2Model [[autodoc]] TFDebertaV2Model @@ -141,3 +158,11 @@ contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code [[autodoc]] TFDebertaV2ForQuestionAnswering - call + +## TFDebertaV2ForMultipleChoice + +[[autodoc]] TFDebertaV2ForMultipleChoice + - call + + + diff --git a/docs/source/en/model_doc/deberta.mdx b/docs/source/en/model_doc/deberta.md similarity index 93% rename from docs/source/en/model_doc/deberta.mdx rename to docs/source/en/model_doc/deberta.md index 33b1ec6a5104b3..342a3bc47960aa 100644 --- a/docs/source/en/model_doc/deberta.mdx +++ b/docs/source/en/model_doc/deberta.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DeBERTa @@ -48,6 +52,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on [Supercharged Customer Service with Machine Learning](https://huggingface.co/blog/supercharge-customer-service-with-machine-learning) with DeBERTa. - [`DebertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). - [`TFDebertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). +- [Text classification task guide](../tasks/sequence_classification) @@ -55,18 +60,21 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFDebertaForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. - [Byte-Pair Encoding tokenization](https://huggingface.co/course/chapter6/5?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Token classification task guide](../tasks/token_classification) - [`DebertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). - [`TFDebertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling task guide](../tasks/masked_language_modeling) - [`DebertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb). - [`TFDebertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Question answering task guide](../tasks/question_answering) ## DebertaConfig @@ -86,6 +94,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - build_inputs_with_special_tokens - create_token_type_ids_from_sequences + + + ## DebertaModel [[autodoc]] DebertaModel @@ -115,6 +126,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] DebertaForQuestionAnswering - forward + + + ## TFDebertaModel [[autodoc]] TFDebertaModel @@ -144,3 +158,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFDebertaForQuestionAnswering - call + + + + diff --git a/docs/source/en/model_doc/decision_transformer.mdx b/docs/source/en/model_doc/decision_transformer.md similarity index 93% rename from docs/source/en/model_doc/decision_transformer.mdx rename to docs/source/en/model_doc/decision_transformer.md index 2f0c70b6791bdc..07ef2ecbdc8e9a 100644 --- a/docs/source/en/model_doc/decision_transformer.mdx +++ b/docs/source/en/model_doc/decision_transformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Decision Transformer @@ -29,9 +33,7 @@ This allows us to draw upon the simplicity and scalability of the Transformer ar Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.* -Tips: - -This version of the model is for tasks where the state is a vector, image-based states will come soon. +This version of the model is for tasks where the state is a vector. This model was contributed by [edbeeching](https://huggingface.co/edbeeching). The original code can be found [here](https://github.com/kzl/decision-transformer). diff --git a/docs/source/en/model_doc/deformable_detr.mdx b/docs/source/en/model_doc/deformable_detr.md similarity index 93% rename from docs/source/en/model_doc/deformable_detr.mdx rename to docs/source/en/model_doc/deformable_detr.md index 32cb68746dd215..726fa0d0ca9a51 100644 --- a/docs/source/en/model_doc/deformable_detr.mdx +++ b/docs/source/en/model_doc/deformable_detr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Deformable DETR @@ -21,11 +25,6 @@ The abstract from the paper is the following: *DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach.* -Tips: - -- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model. -- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks. - drawing @@ -33,6 +32,10 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/fundamentalvision/Deformable-DETR). +## Usage tips + +- Training Deformable DETR is equivalent to training the original [DETR](detr) model. See the [resources](#resources) section below for demo notebooks. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Deformable DETR. @@ -40,6 +43,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Demo notebooks regarding inference + fine-tuning on a custom dataset for [`DeformableDetrForObjectDetection`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Deformable-DETR). +- See also: [Object detection task guide](../tasks/object_detection). If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -47,14 +51,12 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] DeformableDetrImageProcessor - preprocess - - pad_and_create_pixel_mask - post_process_object_detection ## DeformableDetrFeatureExtractor [[autodoc]] DeformableDetrFeatureExtractor - __call__ - - pad_and_create_pixel_mask - post_process_object_detection ## DeformableDetrConfig diff --git a/docs/source/en/model_doc/deit.mdx b/docs/source/en/model_doc/deit.md similarity index 95% rename from docs/source/en/model_doc/deit.mdx rename to docs/source/en/model_doc/deit.md index 0640a133911113..7d9918a45eeeb6 100644 --- a/docs/source/en/model_doc/deit.mdx +++ b/docs/source/en/model_doc/deit.md @@ -8,16 +8,13 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ---> - -# DeiT - +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. -This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight -breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title). +--> - +# DeiT ## Overview @@ -41,7 +38,9 @@ distillation, especially when using a convnet as a teacher. This leads us to rep for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.* -Tips: +This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts). + +## Usage tips - Compared to ViT, DeiT models use a so-called distillation token to effectively learn from a teacher (which, in the DeiT paper, is a ResNet like-model). The distillation token is learned through backpropagation, by interacting with @@ -69,8 +68,6 @@ Tips: *facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to prepare images for the model. -This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DeiT. @@ -78,6 +75,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`DeiTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) Besides that: @@ -99,6 +97,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] DeiTImageProcessor - preprocess + + + ## DeiTModel [[autodoc]] DeiTModel @@ -119,6 +120,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] DeiTForImageClassificationWithTeacher - forward + + + ## TFDeiTModel [[autodoc]] TFDeiTModel @@ -138,3 +142,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] TFDeiTForImageClassificationWithTeacher - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/deplot.md b/docs/source/en/model_doc/deplot.md new file mode 100644 index 00000000000000..a77bee39de76d6 --- /dev/null +++ b/docs/source/en/model_doc/deplot.md @@ -0,0 +1,66 @@ + + +# DePlot + +## Overview + +DePlot was proposed in the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) from Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. + +The abstract of the paper states the following: + +*Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.* + +DePlot is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct). +DePlot is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer. + +## Usage example + +Currently one checkpoint is available for DePlot: + +- `google/deplot`: DePlot fine-tuned on ChartQA dataset + + +```python +from transformers import AutoProcessor, Pix2StructForConditionalGeneration +import requests +from PIL import Image + +model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot") +processor = AutoProcessor.from_pretrained("google/deplot") +url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png" +image = Image.open(requests.get(url, stream=True).raw) + +inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt") +predictions = model.generate(**inputs, max_new_tokens=512) +print(processor.decode(predictions[0], skip_special_tokens=True)) +``` + +## Fine-tuning + +To fine-tune DePlot, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faster convergence: +```python +from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup + +optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05) +scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000) +``` + + + +DePlot is a model trained using `Pix2Struct` architecture. For API reference, see [`Pix2Struct` documentation](pix2struct). + + \ No newline at end of file diff --git a/docs/source/en/model_doc/depth_anything.md b/docs/source/en/model_doc/depth_anything.md new file mode 100644 index 00000000000000..99332697b38ef2 --- /dev/null +++ b/docs/source/en/model_doc/depth_anything.md @@ -0,0 +1,113 @@ + + +# Depth Anything + +## Overview + +The Depth Anything model was proposed in [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the [DPT](dpt) architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation. + +The abstract from the paper is the following: + +*This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet.* + + + + Depth Anything overview. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/LiheYoung/Depth-Anything). + +## Usage example + +There are 2 main ways to use Depth Anything: either using the pipeline API, which abstracts away all the complexity for you, or by using the `DepthAnythingForDepthEstimation` class yourself. + +### Pipeline API + +The pipeline allows to use the model in a few lines of code: + +```python +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> # load pipe +>>> pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf") + +>>> # load image +>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> # inference +>>> depth = pipe(image)["depth"] +``` + +### Using the model yourself + +If you want to do the pre- and postprocessing yourself, here's how to do that: + +```python +>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation +>>> import torch +>>> import numpy as np +>>> from PIL import Image +>>> import requests + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image_processor = AutoImageProcessor.from_pretrained("LiheYoung/depth-anything-small-hf") +>>> model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-small-hf") + +>>> # prepare image for the model +>>> inputs = image_processor(images=image, return_tensors="pt") + +>>> with torch.no_grad(): +... outputs = model(**inputs) +... predicted_depth = outputs.predicted_depth + +>>> # interpolate to original size +>>> prediction = torch.nn.functional.interpolate( +... predicted_depth.unsqueeze(1), +... size=image.size[::-1], +... mode="bicubic", +... align_corners=False, +... ) + +>>> # visualize the prediction +>>> output = prediction.squeeze().cpu().numpy() +>>> formatted = (output * 255 / np.max(output)).astype("uint8") +>>> depth = Image.fromarray(formatted) +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Depth Anything. + +- [Monocular depth estimation task guide](../tasks/depth_estimation) +- A notebook showcasing inference with [`DepthAnythingForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Depth%20Anything/Predicting_depth_in_an_image_with_Depth_Anything.ipynb). 🌎 + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +## DepthAnythingConfig + +[[autodoc]] DepthAnythingConfig + +## DepthAnythingForDepthEstimation + +[[autodoc]] DepthAnythingForDepthEstimation + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/deta.mdx b/docs/source/en/model_doc/deta.md similarity index 93% rename from docs/source/en/model_doc/deta.mdx rename to docs/source/en/model_doc/deta.md index 61b705d42b9751..1eed98832ac731 100644 --- a/docs/source/en/model_doc/deta.mdx +++ b/docs/source/en/model_doc/deta.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DETA @@ -22,10 +26,6 @@ The abstract from the paper is the following: *Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture.* -Tips: - -- One can use [`DetaImageProcessor`] to prepare images and optional targets for the model. - drawing @@ -39,6 +39,7 @@ The original code can be found [here](https://github.com/jozhang97/DETA). A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DETA. - Demo notebooks for DETA can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETA). +- See also: [Object detection task guide](../tasks/object_detection) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -46,20 +47,17 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] DetaConfig - ## DetaImageProcessor [[autodoc]] DetaImageProcessor - preprocess - post_process_object_detection - ## DetaModel [[autodoc]] DetaModel - forward - ## DetaForObjectDetection [[autodoc]] DetaForObjectDetection diff --git a/docs/source/en/model_doc/detr.mdx b/docs/source/en/model_doc/detr.md similarity index 96% rename from docs/source/en/model_doc/detr.mdx rename to docs/source/en/model_doc/detr.md index 872a7d4387a73a..60937b6012119f 100644 --- a/docs/source/en/model_doc/detr.mdx +++ b/docs/source/en/model_doc/detr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DETR @@ -37,6 +41,8 @@ baselines.* This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/facebookresearch/detr). +## How DETR works + Here's a TLDR explaining how [`~transformers.DetrForObjectDetection`] works: First, an image is sent through a pre-trained convolutional backbone (in the paper, the authors use @@ -75,7 +81,7 @@ where one first trains a [`~transformers.DetrForObjectDetection`] model to detec the mask head for 25 epochs. Experimentally, these two approaches give similar results. Note that predicting boxes is required for the training to be possible, since the Hungarian matching is computed using distances between boxes. -Tips: +## Usage tips - DETR uses so-called **object queries** to detect objects in an image. The number of queries determines the maximum number of objects that can be detected in a single image, and is set to 100 by default (see parameter @@ -140,7 +146,7 @@ As a summary, consider the following table: | **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] | | **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | | | **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) | -| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] | +| **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] | | **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` | In short, one should prepare the data either in COCO detection or COCO panoptic format, then use @@ -157,17 +163,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - All example notebooks illustrating fine-tuning [`DetrForObjectDetection`] and [`DetrForSegmentation`] on a custom dataset an be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR). +- See also: [Object detection task guide](../tasks/object_detection) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. -## DETR specific outputs - -[[autodoc]] models.detr.modeling_detr.DetrModelOutput - -[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput - -[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput - ## DetrConfig [[autodoc]] DetrConfig @@ -185,12 +184,19 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] DetrFeatureExtractor - __call__ - - pad_and_create_pixel_mask - post_process_object_detection - post_process_semantic_segmentation - post_process_instance_segmentation - post_process_panoptic_segmentation +## DETR specific outputs + +[[autodoc]] models.detr.modeling_detr.DetrModelOutput + +[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput + +[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput + ## DetrModel [[autodoc]] DetrModel diff --git a/docs/source/en/model_doc/dialogpt.mdx b/docs/source/en/model_doc/dialogpt.md similarity index 89% rename from docs/source/en/model_doc/dialogpt.mdx rename to docs/source/en/model_doc/dialogpt.md index 62c6b45130e346..558b91d76d25e7 100644 --- a/docs/source/en/model_doc/dialogpt.mdx +++ b/docs/source/en/model_doc/dialogpt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DialoGPT @@ -28,7 +32,9 @@ that leverage DialoGPT generate more relevant, contentful and context-consistent systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.* -Tips: +The original code can be found [here](https://github.com/microsoft/DialoGPT). + +## Usage tips - DialoGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -43,7 +49,8 @@ follow the OpenAI GPT-2 to model a multiturn dialogue session as a long text and modeling. We first concatenate all dialog turns within a dialogue session into a long text x_1,..., x_N (N is the sequence length), ended by the end-of-text token.* For more information please confer to the original paper. + -DialoGPT's architecture is based on the GPT2 model, so one can refer to [GPT2's documentation page](gpt2). +DialoGPT's architecture is based on the GPT2 model, refer to [GPT2's documentation page](gpt2) for API reference and examples. -The original code can be found [here](https://github.com/microsoft/DialoGPT). + diff --git a/docs/source/en/model_doc/dinat.mdx b/docs/source/en/model_doc/dinat.md similarity index 93% rename from docs/source/en/model_doc/dinat.mdx rename to docs/source/en/model_doc/dinat.md index 1f6577e21afd96..23dfa3b74fb047 100644 --- a/docs/source/en/model_doc/dinat.mdx +++ b/docs/source/en/model_doc/dinat.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Dilated Neighborhood Attention Transformer @@ -40,17 +44,6 @@ and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) an It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data). * -Tips: -- One can use the [`AutoImageProcessor`] API to prepare images for the model. -- DiNAT can be used as a *backbone*. When `output_hidden_states = True`, -it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`. - -Notes: -- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention. -You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`. -Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet. -- Patch size of 4 is only supported at the moment. - drawing @@ -61,6 +54,17 @@ Taken from the original paper. - [`DinatForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/dinov2.md b/docs/source/en/model_doc/dinov2.md new file mode 100644 index 00000000000000..dca94786773d1d --- /dev/null +++ b/docs/source/en/model_doc/dinov2.md @@ -0,0 +1,83 @@ + + +# DINOv2 + +## Overview + +The DINOv2 model was proposed in [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by +Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. +DINOv2 is an upgrade of [DINO](https://arxiv.org/abs/2104.14294), a self-supervised method applied on [Vision Transformers](vit). This method enables all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. + +The abstract from the paper is the following: + +*The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.* + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/facebookresearch/dinov2). + +## Usage tips + +The model can be traced using `torch.jit.trace` which leverages JIT compilation to optimize the model making it faster to run. Note this still produces some mis-matched elements and the difference between the original model and the traced model is of the order of 1e-4. + +```python +import torch +from transformers import AutoImageProcessor, AutoModel +from PIL import Image +import requests + +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +image = Image.open(requests.get(url, stream=True).raw) + +processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base') +model = AutoModel.from_pretrained('facebook/dinov2-base') + +inputs = processor(images=image, return_tensors="pt") +outputs = model(**inputs) +last_hidden_states = outputs[0] + +# We have to force return_dict=False for tracing +model.config.return_dict = False + +with torch.no_grad(): + traced_model = torch.jit.trace(model, [inputs.pixel_values]) + traced_outputs = traced_model(inputs.pixel_values) + +print((last_hidden_states - traced_outputs[0]).abs().max()) +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT. + +- Demo notebooks for DINOv2 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2). 🌎 + + + +- [`Dinov2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +## Dinov2Config + +[[autodoc]] Dinov2Config + +## Dinov2Model + +[[autodoc]] Dinov2Model + - forward + +## Dinov2ForImageClassification + +[[autodoc]] Dinov2ForImageClassification + - forward diff --git a/docs/source/en/model_doc/distilbert.mdx b/docs/source/en/model_doc/distilbert.md similarity index 86% rename from docs/source/en/model_doc/distilbert.mdx rename to docs/source/en/model_doc/distilbert.md index 99e65522f267e6..844927e71984a9 100644 --- a/docs/source/en/model_doc/distilbert.mdx +++ b/docs/source/en/model_doc/distilbert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DistilBERT @@ -19,6 +23,9 @@ specific language governing permissions and limitations under the License. Spaces + +Paper page + ## Overview @@ -27,7 +34,7 @@ The DistilBERT model was proposed in the blog post [Smaller, faster, cheaper, li distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5), and the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108). DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than -*bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language +*google-bert/bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark. The abstract from the paper is the following: @@ -44,7 +51,10 @@ distillation and cosine-distance losses. Our smaller, faster and lighter model i demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.* -Tips: +This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was +contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). + +## Usage tips - DistilBERT doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or `[SEP]`). @@ -56,8 +66,6 @@ Tips: * predicting the masked tokens correctly (but no next-sentence objective) * a cosine similarity between the hidden states of the student and the teacher model -This model was contributed by [victorsanh](https://huggingface.co/victorsanh). This model jax version was -contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). ## Resources @@ -75,6 +83,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`DistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). - [`TFDistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). - [`FlaxDistilBertForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). +- [Text classification task guide](../tasks/sequence_classification) @@ -83,6 +92,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFDistilBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). - [`FlaxDistilBertForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Token classification task guide](../tasks/token_classification) @@ -91,6 +101,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFDistilBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxDistilBertForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling task guide](../tasks/masked_language_modeling) @@ -98,10 +109,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFDistilBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). - [`FlaxDistilBertForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Question answering task guide](../tasks/question_answering) **Multiple choice** - [`DistilBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). - [`TFDistilBertForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). +- [Multiple choice task guide](../tasks/multiple_choice) ⚗️ Optimization @@ -120,6 +133,37 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on how to [deploy DistilBERT with Amazon SageMaker](https://huggingface.co/blog/deploy-hugging-face-models-easily-with-amazon-sagemaker). - A blog post on how to [Deploy BERT with Hugging Face Transformers, Amazon SageMaker and Terraform module](https://www.philschmid.de/terraform-huggingface-amazon-sagemaker). + +## Combining DistilBERT and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16`) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoTokenizer, AutoModel + +>>> device = "cuda" # the device to load the model onto + +>>> tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased') +>>> model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased", torch_dtype=torch.float16, attn_implementation="flash_attention_2") + +>>> text = "Replace me by any text you'd like." + +>>> encoded_input = tokenizer(text, return_tensors='pt').to(device) +>>> model.to(device) + +>>> output = model(**encoded_input) +``` + + ## DistilBertConfig [[autodoc]] DistilBertConfig @@ -132,6 +176,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] DistilBertTokenizerFast + + + ## DistilBertModel [[autodoc]] DistilBertModel @@ -162,6 +209,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] DistilBertForQuestionAnswering - forward + + + ## TFDistilBertModel [[autodoc]] TFDistilBertModel @@ -192,6 +242,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFDistilBertForQuestionAnswering - call + + + ## FlaxDistilBertModel [[autodoc]] FlaxDistilBertModel @@ -221,3 +274,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxDistilBertForQuestionAnswering - __call__ + + + + + + + diff --git a/docs/source/en/model_doc/dit.mdx b/docs/source/en/model_doc/dit.md similarity index 92% rename from docs/source/en/model_doc/dit.mdx rename to docs/source/en/model_doc/dit.md index 4843ca71f5acf6..7f6691a15bc446 100644 --- a/docs/source/en/model_doc/dit.mdx +++ b/docs/source/en/model_doc/dit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DiT @@ -33,6 +37,10 @@ alt="drawing" width="600"/> Summary of the approach. Taken from the [original paper](https://arxiv.org/abs/2203.02378). +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit). + +## Usage tips + One can directly use the weights of DiT with the AutoModel API: ```python @@ -62,10 +70,6 @@ model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-fine This particular checkpoint was fine-tuned on [RVL-CDIP](https://www.cs.cmu.edu/~aharley/rvl-cdip/), an important benchmark for document image classification. A notebook that illustrates inference for document image classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DiT/Inference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb). -As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/dit). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DiT. @@ -74,4 +78,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`BeitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). -If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. \ No newline at end of file +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + + As DiT's architecture is equivalent to that of BEiT, one can refer to [BEiT's documentation page](beit) for all tips, code examples and notebooks. + diff --git a/docs/source/en/model_doc/donut.mdx b/docs/source/en/model_doc/donut.md similarity index 96% rename from docs/source/en/model_doc/donut.mdx rename to docs/source/en/model_doc/donut.md index 62ce32fd9c8008..6e5cfe648d0963 100644 --- a/docs/source/en/model_doc/donut.mdx +++ b/docs/source/en/model_doc/donut.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + specific language governing permissions and limitations under the License. --> # Donut @@ -30,21 +34,21 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/clovaai/donut). -Tips: +## Usage tips - The quickest way to get started with Donut is by checking the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut), which show how to use the model at inference time as well as fine-tuning on custom data. - Donut is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. -## Inference +## Inference examples Donut's [`VisionEncoderDecoder`] model accepts images as input and makes use of [`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. -The [`DonutFeatureExtractor`] class is responsible for preprocessing the input image and +The [`DonutImageProcessor`] class is responsible for preprocessing the input image and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] decodes the generated target tokens to the target string. The -[`DonutProcessor`] wraps [`DonutFeatureExtractor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] +[`DonutProcessor`] wraps [`DonutImageProcessor`] and [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`] into a single instance to both extract the input features and decode the predicted token ids. - Step-by-step Document Image Classification @@ -76,11 +80,9 @@ into a single instance to both extract the input features and decode the predict ... pixel_values.to(device), ... decoder_input_ids=decoder_input_ids.to(device), ... max_length=model.decoder.config.max_position_embeddings, -... early_stopping=True, ... pad_token_id=processor.tokenizer.pad_token_id, ... eos_token_id=processor.tokenizer.eos_token_id, ... use_cache=True, -... num_beams=1, ... bad_words_ids=[[processor.tokenizer.unk_token_id]], ... return_dict_in_generate=True, ... ) @@ -121,11 +123,9 @@ into a single instance to both extract the input features and decode the predict ... pixel_values.to(device), ... decoder_input_ids=decoder_input_ids.to(device), ... max_length=model.decoder.config.max_position_embeddings, -... early_stopping=True, ... pad_token_id=processor.tokenizer.pad_token_id, ... eos_token_id=processor.tokenizer.eos_token_id, ... use_cache=True, -... num_beams=1, ... bad_words_ids=[[processor.tokenizer.unk_token_id]], ... return_dict_in_generate=True, ... ) @@ -168,11 +168,9 @@ into a single instance to both extract the input features and decode the predict ... pixel_values.to(device), ... decoder_input_ids=decoder_input_ids.to(device), ... max_length=model.decoder.config.max_position_embeddings, -... early_stopping=True, ... pad_token_id=processor.tokenizer.pad_token_id, ... eos_token_id=processor.tokenizer.eos_token_id, ... use_cache=True, -... num_beams=1, ... bad_words_ids=[[processor.tokenizer.unk_token_id]], ... return_dict_in_generate=True, ... ) diff --git a/docs/source/en/model_doc/dpr.mdx b/docs/source/en/model_doc/dpr.md similarity index 93% rename from docs/source/en/model_doc/dpr.mdx rename to docs/source/en/model_doc/dpr.md index 12c139e33b650f..8b9f352b637b9b 100644 --- a/docs/source/en/model_doc/dpr.mdx +++ b/docs/source/en/model_doc/dpr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DPR @@ -39,7 +43,8 @@ benchmarks.* This model was contributed by [lhoestq](https://huggingface.co/lhoestq). The original code can be found [here](https://github.com/facebookresearch/DPR). -Tips: +## Usage tips + - DPR consists in three models: * Question encoder: encode questions as vectors @@ -82,6 +87,9 @@ Tips: [[autodoc]] models.dpr.modeling_dpr.DPRReaderOutput + + + ## DPRContextEncoder [[autodoc]] DPRContextEncoder @@ -97,6 +105,9 @@ Tips: [[autodoc]] DPRReader - forward + + + ## TFDPRContextEncoder [[autodoc]] TFDPRContextEncoder @@ -111,3 +122,7 @@ Tips: [[autodoc]] TFDPRReader - call + + + + diff --git a/docs/source/en/model_doc/dpt.mdx b/docs/source/en/model_doc/dpt.md similarity index 76% rename from docs/source/en/model_doc/dpt.mdx rename to docs/source/en/model_doc/dpt.md index 705dc680e6785b..a02313a31235dd 100644 --- a/docs/source/en/model_doc/dpt.mdx +++ b/docs/source/en/model_doc/dpt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # DPT @@ -28,12 +32,30 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/isl-org/DPT). +## Usage tips + +DPT is compatible with the [`AutoBackbone`] class. This allows to use the DPT framework with various computer vision backbones available in the library, such as [`VitDetBackbone`] or [`Dinov2Backbone`]. One can create it as follows: + +```python +from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation + +# initialize with a Transformer-based backbone such as DINOv2 +# in that case, we also specify `reshape_hidden_states=False` to get feature maps of shape (batch_size, num_channels, height, width) +backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"], reshape_hidden_states=False) + +config = DPTConfig(backbone_config=backbone_config) +model = DPTForDepthEstimation(config=config) +``` + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with DPT. - Demo notebooks for [`DPTForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DPT). +- [Semantic segmentation task guide](../tasks/semantic_segmentation) +- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation) + If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. ## DPTConfig diff --git a/docs/source/en/model_doc/efficientformer.mdx b/docs/source/en/model_doc/efficientformer.md similarity index 81% rename from docs/source/en/model_doc/efficientformer.mdx rename to docs/source/en/model_doc/efficientformer.md index aba86f13177843..92ba90a9e5ed97 100644 --- a/docs/source/en/model_doc/efficientformer.mdx +++ b/docs/source/en/model_doc/efficientformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # EfficientFormer @@ -37,8 +41,11 @@ EfficientFormer-L7, obtains 83.3% accuracy with only 7.0 ms latency. Our work pr reach extremely low latency on mobile devices while maintaining high performance.* This model was contributed by [novice03](https://huggingface.co/novice03) and [Bearnardd](https://huggingface.co/Bearnardd). -The original code can be found [here](https://github.com/snap-research/EfficientFormer). +The original code can be found [here](https://github.com/snap-research/EfficientFormer). The TensorFlow version of this model was added by [D-Roberts](https://huggingface.co/D-Roberts). + +## Documentation resources +- [Image classification task guide](../tasks/image_classification) ## EfficientFormerConfig @@ -49,6 +56,9 @@ The original code can be found [here](https://github.com/snap-research/Efficient [[autodoc]] EfficientFormerImageProcessor - preprocess + + + ## EfficientFormerModel [[autodoc]] EfficientFormerModel @@ -63,3 +73,24 @@ The original code can be found [here](https://github.com/snap-research/Efficient [[autodoc]] EfficientFormerForImageClassificationWithTeacher - forward + + + + +## TFEfficientFormerModel + +[[autodoc]] TFEfficientFormerModel + - call + +## TFEfficientFormerForImageClassification + +[[autodoc]] TFEfficientFormerForImageClassification + - call + +## TFEfficientFormerForImageClassificationWithTeacher + +[[autodoc]] TFEfficientFormerForImageClassificationWithTeacher + - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/efficientnet.md b/docs/source/en/model_doc/efficientnet.md new file mode 100644 index 00000000000000..a69b255dba5e2c --- /dev/null +++ b/docs/source/en/model_doc/efficientnet.md @@ -0,0 +1,51 @@ + + +# EfficientNet + +## Overview + +The EfficientNet model was proposed in [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) +by Mingxing Tan and Quoc V. Le. EfficientNets are a family of image classification models, which achieve state-of-the-art accuracy, yet being an order-of-magnitude smaller and faster than previous models. + +The abstract from the paper is the following: + +*Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. +To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters.* + +This model was contributed by [adirik](https://huggingface.co/adirik). +The original code can be found [here](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet). + + +## EfficientNetConfig + +[[autodoc]] EfficientNetConfig + +## EfficientNetImageProcessor + +[[autodoc]] EfficientNetImageProcessor + - preprocess + +## EfficientNetModel + +[[autodoc]] EfficientNetModel + - forward + +## EfficientNetForImageClassification + +[[autodoc]] EfficientNetForImageClassification + - forward + diff --git a/docs/source/en/model_doc/electra.mdx b/docs/source/en/model_doc/electra.md similarity index 91% rename from docs/source/en/model_doc/electra.mdx rename to docs/source/en/model_doc/electra.md index 80074a7b0b1c41..700c49df799320 100644 --- a/docs/source/en/model_doc/electra.mdx +++ b/docs/source/en/model_doc/electra.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ELECTRA @@ -46,7 +50,9 @@ using 30x more compute) on the GLUE natural language understanding benchmark. Ou where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.* -Tips: +This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra). + +## Usage tips - ELECTRA is the pretraining approach, therefore there is nearly no changes done to the underlying model: BERT. The only change is the separation of the embedding size and the hidden size: the embedding size is generally smaller, @@ -62,8 +68,14 @@ Tips: [`ElectraForPreTraining`] model (the classification head will be randomly initialized as it doesn't exist in the generator). -This model was contributed by [lysandre](https://huggingface.co/lysandre). The original code can be found [here](https://github.com/google-research/electra). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## ElectraConfig @@ -83,6 +95,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o [[autodoc]] models.electra.modeling_tf_electra.TFElectraForPreTrainingOutput + + + ## ElectraModel [[autodoc]] ElectraModel @@ -123,6 +138,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o [[autodoc]] ElectraForQuestionAnswering - forward + + + ## TFElectraModel [[autodoc]] TFElectraModel @@ -158,6 +176,9 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o [[autodoc]] TFElectraForQuestionAnswering - call + + + ## FlaxElectraModel [[autodoc]] FlaxElectraModel @@ -197,3 +218,6 @@ This model was contributed by [lysandre](https://huggingface.co/lysandre). The o [[autodoc]] FlaxElectraForQuestionAnswering - __call__ + + + diff --git a/docs/source/en/model_doc/encodec.md b/docs/source/en/model_doc/encodec.md new file mode 100644 index 00000000000000..856f8be2b80afe --- /dev/null +++ b/docs/source/en/model_doc/encodec.md @@ -0,0 +1,65 @@ + + +# EnCodec + +## Overview + +The EnCodec neural codec model was proposed in [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438) by Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi. + +The abstract from the paper is the following: + +*We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio.* + +This model was contributed by [Matthijs](https://huggingface.co/Matthijs), [Patrick Von Platen](https://huggingface.co/patrickvonplaten) and [Arthur Zucker](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/facebookresearch/encodec). + +## Usage example + +Here is a quick example of how to encode and decode an audio using this model: + +```python +>>> from datasets import load_dataset, Audio +>>> from transformers import EncodecModel, AutoProcessor +>>> librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + +>>> model = EncodecModel.from_pretrained("facebook/encodec_24khz") +>>> processor = AutoProcessor.from_pretrained("facebook/encodec_24khz") +>>> librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate)) +>>> audio_sample = librispeech_dummy[-1]["audio"]["array"] +>>> inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt") + +>>> encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"]) +>>> audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0] +>>> # or the equivalent with a forward pass +>>> audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values +``` + +## EncodecConfig + +[[autodoc]] EncodecConfig + +## EncodecFeatureExtractor + +[[autodoc]] EncodecFeatureExtractor + - __call__ + +## EncodecModel + +[[autodoc]] EncodecModel + - decode + - encode + - forward diff --git a/docs/source/en/model_doc/encoder-decoder.mdx b/docs/source/en/model_doc/encoder-decoder.md similarity index 94% rename from docs/source/en/model_doc/encoder-decoder.mdx rename to docs/source/en/model_doc/encoder-decoder.md index 8130b4945d4cc2..4bd0e6f188fe15 100644 --- a/docs/source/en/model_doc/encoder-decoder.mdx +++ b/docs/source/en/model_doc/encoder-decoder.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Encoder Decoder Models @@ -51,8 +55,8 @@ To do so, the `EncoderDecoderModel` class provides a [`EncoderDecoderModel.from_ ```python >>> from transformers import EncoderDecoderModel, BertTokenizer ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") ->>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "google-bert/bert-base-uncased") ``` ## Loading an existing `EncoderDecoderModel` checkpoint and perform inference. @@ -115,8 +119,8 @@ target sequence). ```python >>> from transformers import BertTokenizer, EncoderDecoderModel ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") ->>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "google-bert/bert-base-uncased") >>> model.config.decoder_start_token_id = tokenizer.cls_token_id >>> model.config.pad_token_id = tokenizer.pad_token_id @@ -145,20 +149,32 @@ were contributed by [ydshieh](https://github.com/ydshieh). [[autodoc]] EncoderDecoderConfig + + + ## EncoderDecoderModel [[autodoc]] EncoderDecoderModel - forward - from_encoder_decoder_pretrained + + + ## TFEncoderDecoderModel [[autodoc]] TFEncoderDecoderModel - call - from_encoder_decoder_pretrained + + + ## FlaxEncoderDecoderModel [[autodoc]] FlaxEncoderDecoderModel - __call__ - from_encoder_decoder_pretrained + + + diff --git a/docs/source/en/model_doc/ernie.mdx b/docs/source/en/model_doc/ernie.md similarity index 84% rename from docs/source/en/model_doc/ernie.mdx rename to docs/source/en/model_doc/ernie.md index 6ec3f104732008..a5110b2d7b7312 100644 --- a/docs/source/en/model_doc/ernie.mdx +++ b/docs/source/en/model_doc/ernie.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ERNIE @@ -19,7 +23,7 @@ including [ERNIE1.0](https://arxiv.org/abs/1904.09223), [ERNIE2.0](https://ojs.a These models are contributed by [nghuyong](https://huggingface.co/nghuyong) and the official code can be found in [PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP) (in PaddlePaddle). -### How to use +### Usage example Take `ernie-1.0-base-zh` as an example: ```Python @@ -28,7 +32,7 @@ tokenizer = AutoTokenizer.from_pretrained("nghuyong/ernie-1.0-base-zh") model = AutoModel.from_pretrained("nghuyong/ernie-1.0-base-zh") ``` -### Supported Models +### Model checkpoints | Model Name | Language | Description | |:-------------------:|:--------:|:-------------------------------:| @@ -47,6 +51,15 @@ You can find all the supported models from huggingface's model hub: [huggingface repo: [PaddleNLP](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/transformers/ERNIE/contents.html) and [ERNIE](https://github.com/PaddlePaddle/ERNIE/blob/repro). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## ErnieConfig [[autodoc]] ErnieConfig diff --git a/docs/source/en/model_doc/ernie_m.mdx b/docs/source/en/model_doc/ernie_m.md similarity index 77% rename from docs/source/en/model_doc/ernie_m.mdx rename to docs/source/en/model_doc/ernie_m.md index 3164747367beae..a99332cb655ac5 100644 --- a/docs/source/en/model_doc/ernie_m.mdx +++ b/docs/source/en/model_doc/ernie_m.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ErnieM @@ -21,16 +25,22 @@ Hao Tian, Hua Wu, Haifeng Wang. The abstract from the paper is the following: *Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance in downstream cross-lingual tasks. This improvement benefits from learning a large amount of monolingual and parallel corpora. Although it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for lowresource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to overcome the constraint that the parallel corpus size places on the model performance. Our key insight is to integrate back-translation into the pre-training process. We generate pseudo-parallel sentence pairs on a monolingual corpus to enable the learning of semantic alignments between different languages, thereby enhancing the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results in various cross-lingual downstream tasks.* +This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_m). -Tips: -1. Ernie-M is a BERT-like model so it is a stacked Transformer Encoder. -2. Instead of using MaskedLM for pretraining (like BERT) the authors used two novel techniques: `Cross-attention Masked Language Modeling` and `Back-translation Masked Language Modeling`. For now these two LMHead objectives are not implemented here. -3. It is a multilingual language model. -4. Next Sentence Prediction was not used in pretraining process. +## Usage tips +- Ernie-M is a BERT-like model so it is a stacked Transformer Encoder. +- Instead of using MaskedLM for pretraining (like BERT) the authors used two novel techniques: `Cross-attention Masked Language Modeling` and `Back-translation Masked Language Modeling`. For now these two LMHead objectives are not implemented here. +- It is a multilingual language model. +- Next Sentence Prediction was not used in pretraining process. -This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). The original code can be found [here](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/ernie_m). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Multiple choice task guide](../tasks/multiple_choice) ## ErnieMConfig diff --git a/docs/source/en/model_doc/esm.mdx b/docs/source/en/model_doc/esm.md similarity index 91% rename from docs/source/en/model_doc/esm.mdx rename to docs/source/en/model_doc/esm.md index 9462e9db08773b..46bab860ff4d5f 100644 --- a/docs/source/en/model_doc/esm.mdx +++ b/docs/source/en/model_doc/esm.md @@ -8,11 +8,16 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ESM ## Overview + This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team, providing the state-of-the-art ESMFold and ESM-2, and the previously released ESM-1b and ESM-1v. Transformer protein language models were introduced in the paper [Biological structure and function emerge from scaling @@ -69,11 +74,6 @@ sequences with low perplexity that are well understood by the language model. ES order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic proteins in practical timescales.* - -Tips: - -- ESM models are trained with a masked language modeling (MLM) objective. - The original code can be found [here](https://github.com/facebookresearch/esm) and was was developed by the Fundamental AI Research team at Meta AI. ESM-1b, ESM-1v and ESM-2 were contributed to huggingface by [jasonliu](https://huggingface.co/jasonliu) @@ -83,8 +83,16 @@ ESMFold was contributed to huggingface by [Matt](https://huggingface.co/Rocketkn [Sylvain](https://huggingface.co/sgugger), with a big thank you to Nikita Smetanin, Roshan Rao and Tom Sercu for their help throughout the process! -The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library. -The `openfold` library is licensed under the Apache License 2.0. +## Usage tips + +- ESM models are trained with a masked language modeling (MLM) objective. +- The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library. The `openfold` library is licensed under the Apache License 2.0. + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Masked language modeling task guide](../tasks/masked_language_modeling) ## EsmConfig @@ -99,6 +107,8 @@ The `openfold` library is licensed under the Apache License 2.0. - create_token_type_ids_from_sequences - save_vocabulary + + ## EsmModel @@ -125,6 +135,9 @@ The `openfold` library is licensed under the Apache License 2.0. [[autodoc]] EsmForProteinFolding - forward + + + ## TFEsmModel [[autodoc]] TFEsmModel @@ -144,3 +157,6 @@ The `openfold` library is licensed under the Apache License 2.0. [[autodoc]] TFEsmForTokenClassification - call + + + diff --git a/docs/source/en/model_doc/falcon.md b/docs/source/en/model_doc/falcon.md new file mode 100644 index 00000000000000..9bf6c32a4ec5fc --- /dev/null +++ b/docs/source/en/model_doc/falcon.md @@ -0,0 +1,84 @@ + + +# Falcon + +## Overview + +Falcon is a class of causal decoder-only models built by [TII](https://www.tii.ae/). The largest Falcon checkpoints +have been trained on >=1T tokens of text, with a particular emphasis on the [RefinedWeb](https://arxiv.org/abs/2306.01116) +corpus. They are made available under the Apache 2.0 license. + + +Falcon's architecture is modern and optimized for inference, with multi-query attention and support for efficient +attention variants like `FlashAttention`. Both 'base' models trained only as causal language models as well as +'instruct' models that have received further fine-tuning are available. + + +Falcon models are (as of 2023) some of the largest and most powerful open-source language models, +and consistently rank highly in the [OpenLLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). + +## Converting custom checkpoints + + + +Falcon models were initially added to the Hugging Face Hub as custom code checkpoints. However, Falcon is now fully +supported in the Transformers library. If you fine-tuned a model from a custom code checkpoint, we recommend converting +your checkpoint to the new in-library format, as this should give significant improvements to stability and +performance, especially for generation, as well as removing the need to use `trust_remote_code=True`! + + + +You can convert custom code checkpoints to full Transformers checkpoints using the `convert_custom_code_checkpoint.py` +script located in the +[Falcon model directory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/falcon) +of the Transformers library. To use this script, simply call it with +`python convert_custom_code_checkpoint.py --checkpoint_dir my_model`. This will convert your checkpoint in-place, and +you can immediately load it from the directory afterwards with e.g. `from_pretrained()`. If your model hasn't been +uploaded to the Hub, we recommend making a backup before attempting the conversion, just in case! + + +## FalconConfig + +[[autodoc]] FalconConfig + - all + +## FalconModel + +[[autodoc]] FalconModel + - forward + +## FalconForCausalLM + +[[autodoc]] FalconForCausalLM + - forward + +## FalconForSequenceClassification + +[[autodoc]] FalconForSequenceClassification + - forward + +## FalconForTokenClassification + +[[autodoc]] FalconForTokenClassification + - forward + +## FalconForQuestionAnswering + +[[autodoc]] FalconForQuestionAnswering + - forward + + diff --git a/docs/source/en/model_doc/fastspeech2_conformer.md b/docs/source/en/model_doc/fastspeech2_conformer.md new file mode 100644 index 00000000000000..dbb87b5a4148c7 --- /dev/null +++ b/docs/source/en/model_doc/fastspeech2_conformer.md @@ -0,0 +1,134 @@ + + +# FastSpeech2Conformer + +## Overview + +The FastSpeech2Conformer model was proposed with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. + +The abstract from the original FastSpeech2 paper is the following: + +*Non-autoregressive text to speech (TTS) models such as FastSpeech (Ren et al., 2019) can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.* + +This model was contributed by [Connor Henderson](https://huggingface.co/connor-henderson). The original code can be found [here](https://github.com/espnet/espnet/blob/master/espnet2/tts/fastspeech2/fastspeech2.py). + + +## 🤗 Model Architecture +FastSpeech2's general structure with a Mel-spectrogram decoder was implemented, and the traditional transformer blocks were replaced with with conformer blocks as done in the ESPnet library. + +#### FastSpeech2 Model Architecture +![FastSpeech2 Model Architecture](https://www.microsoft.com/en-us/research/uploads/prod/2021/04/fastspeech2-1.png) + +#### Conformer Blocks +![Conformer Blocks](https://www.researchgate.net/profile/Hirofumi-Inaguma-2/publication/344911155/figure/fig2/AS:951455406108673@1603856054097/An-overview-of-Conformer-block.png) + +#### Convolution Module +![Convolution Module](https://d3i71xaburhd42.cloudfront.net/8809d0732f6147d4ad9218c8f9b20227c837a746/2-Figure1-1.png) + +## 🤗 Transformers Usage + +You can run FastSpeech2Conformer locally with the 🤗 Transformers library. + +1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), g2p-en: + +```bash +pip install --upgrade pip +pip install --upgrade transformers g2p-en +``` + +2. Run inference via the Transformers modelling code with the model and hifigan separately + +```python + +from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan +import soundfile as sf + +tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer") +inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt") +input_ids = inputs["input_ids"] + +model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer") +output_dict = model(input_ids, return_dict=True) +spectrogram = output_dict["spectrogram"] + +hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan") +waveform = hifigan(spectrogram) + +sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050) +``` + +3. Run inference via the Transformers modelling code with the model and hifigan combined + +```python +from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan +import soundfile as sf + +tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer") +inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt") +input_ids = inputs["input_ids"] + +model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan") +output_dict = model(input_ids, return_dict=True) +waveform = output_dict["waveform"] + +sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050) +``` + +4. Run inference with a pipeline and specify which vocoder to use +```python +from transformers import pipeline, FastSpeech2ConformerHifiGan +import soundfile as sf + +vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan") +synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder) + +speech = synthesiser("Hello, my dog is cooler than you!") + +sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"]) +``` + + +## FastSpeech2ConformerConfig + +[[autodoc]] FastSpeech2ConformerConfig + +## FastSpeech2ConformerHifiGanConfig + +[[autodoc]] FastSpeech2ConformerHifiGanConfig + +## FastSpeech2ConformerWithHifiGanConfig + +[[autodoc]] FastSpeech2ConformerWithHifiGanConfig + +## FastSpeech2ConformerTokenizer + +[[autodoc]] FastSpeech2ConformerTokenizer + - __call__ + - save_vocabulary + - decode + - batch_decode + +## FastSpeech2ConformerModel + +[[autodoc]] FastSpeech2ConformerModel + - forward + +## FastSpeech2ConformerHifiGan + +[[autodoc]] FastSpeech2ConformerHifiGan + - forward + +## FastSpeech2ConformerWithHifiGan + +[[autodoc]] FastSpeech2ConformerWithHifiGan + - forward diff --git a/docs/source/en/model_doc/flan-t5.mdx b/docs/source/en/model_doc/flan-t5.md similarity index 85% rename from docs/source/en/model_doc/flan-t5.mdx rename to docs/source/en/model_doc/flan-t5.md index 5a2d6fc934fd07..c0fd6b0011ccb1 100644 --- a/docs/source/en/model_doc/flan-t5.mdx +++ b/docs/source/en/model_doc/flan-t5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # FLAN-T5 @@ -44,6 +48,10 @@ Google has released the following variants: - [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl). -One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks. As well as the FLAN-T5 model card for more details regarding training and evaluation of the model. - The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints). + + + +Refer to [T5's documentation page](t5) for all API reference, code examples and notebooks. For more details regarding training and evaluation of the FLAN-T5, refer to the model card. + + \ No newline at end of file diff --git a/docs/source/en/model_doc/flan-ul2.md b/docs/source/en/model_doc/flan-ul2.md new file mode 100644 index 00000000000000..5487bb77976099 --- /dev/null +++ b/docs/source/en/model_doc/flan-ul2.md @@ -0,0 +1,54 @@ + + +# FLAN-UL2 + +## Overview + +Flan-UL2 is an encoder decoder model based on the T5 architecture. It uses the same configuration as the [UL2](ul2) model released earlier last year. +It was fine tuned using the "Flan" prompt tuning and dataset collection. Similar to `Flan-T5`, one can directly use FLAN-UL2 weights without finetuning the model: + +According to the original blog here are the notable improvements: + +- The original UL2 model was only trained with receptive field of 512, which made it non-ideal for N-shot prompting where N is large. +- The Flan-UL2 checkpoint uses a receptive field of 2048 which makes it more usable for few-shot in-context learning. +- The original UL2 model also had mode switch tokens that was rather mandatory to get good performance. However, they were a little cumbersome as this requires often some changes during inference or finetuning. In this update/change, we continue training UL2 20B for an additional 100k steps (with small batch) to forget “mode tokens” before applying Flan instruction tuning. This Flan-UL2 checkpoint does not require mode tokens anymore. +Google has released the following variants: + +The original checkpoints can be found [here](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints). + + +## Running on low resource devices + +The model is pretty heavy (~40GB in half precision) so if you just want to run the model, make sure you load your model in 8bit, and use `device_map="auto"` to make sure you don't have any OOM issue! + +```python +>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-ul2", load_in_8bit=True, device_map="auto") +>>> tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2") + +>>> inputs = tokenizer("A step by step recipe to make bolognese pasta:", return_tensors="pt") +>>> outputs = model.generate(**inputs) +>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['In a large skillet, brown the ground beef and onion over medium heat. Add the garlic'] +``` + + + +Refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks. + + diff --git a/docs/source/en/model_doc/flaubert.mdx b/docs/source/en/model_doc/flaubert.md similarity index 87% rename from docs/source/en/model_doc/flaubert.mdx rename to docs/source/en/model_doc/flaubert.md index fe8ef3e0504e43..04bcc2638ac9d2 100644 --- a/docs/source/en/model_doc/flaubert.mdx +++ b/docs/source/en/model_doc/flaubert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # FlauBERT @@ -46,7 +50,13 @@ This model was contributed by [formiel](https://huggingface.co/formiel). The ori Tips: - Like RoBERTa, without the sentence ordering prediction (so just trained on the MLM objective). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## FlaubertConfig @@ -56,6 +66,9 @@ Tips: [[autodoc]] FlaubertTokenizer + + + ## FlaubertModel [[autodoc]] FlaubertModel @@ -91,6 +104,9 @@ Tips: [[autodoc]] FlaubertForQuestionAnswering - forward + + + ## TFFlaubertModel [[autodoc]] TFFlaubertModel @@ -120,3 +136,9 @@ Tips: [[autodoc]] TFFlaubertForQuestionAnsweringSimple - call + + + + + + diff --git a/docs/source/en/model_doc/flava.mdx b/docs/source/en/model_doc/flava.md similarity index 94% rename from docs/source/en/model_doc/flava.mdx rename to docs/source/en/model_doc/flava.md index 4df11a5758a2dc..d9f9f1de514671 100644 --- a/docs/source/en/model_doc/flava.mdx +++ b/docs/source/en/model_doc/flava.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # FLAVA @@ -29,10 +33,8 @@ at once -- a true vision and language foundation model should be good at vision cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.* - This model was contributed by [aps](https://huggingface.co/aps). The original code can be found [here](https://github.com/facebookresearch/multimodal/tree/main/examples/flava). - ## FlavaConfig [[autodoc]] FlavaConfig diff --git a/docs/source/en/model_doc/fnet.mdx b/docs/source/en/model_doc/fnet.md similarity index 81% rename from docs/source/en/model_doc/fnet.mdx rename to docs/source/en/model_doc/fnet.md index 19afcc8051102f..1bcae678e632d9 100644 --- a/docs/source/en/model_doc/fnet.mdx +++ b/docs/source/en/model_doc/fnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # FNet @@ -33,13 +37,21 @@ sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finall and is particularly efficient at smaller model sizes; for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.* -Tips on usage: +This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/google-research/google-research/tree/master/f_net). + +## Usage tips -- The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with - maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum - sequence length for fine-tuning and inference. +The model was trained without an attention mask as it is based on Fourier Transform. The model was trained with +maximum sequence length 512 which includes pad tokens. Hence, it is highly recommended to use the same maximum +sequence length for fine-tuning and inference. -This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/google-research/google-research/tree/master/f_net). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## FNetConfig diff --git a/docs/source/en/model_doc/focalnet.md b/docs/source/en/model_doc/focalnet.md new file mode 100644 index 00000000000000..c4c97980f06976 --- /dev/null +++ b/docs/source/en/model_doc/focalnet.md @@ -0,0 +1,50 @@ + + +# FocalNet + +## Overview + +The FocalNet model was proposed in [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. +FocalNets completely replace self-attention (used in models like [ViT](vit) and [Swin](swin)) by a focal modulation mechanism for modeling token interactions in vision. +The authors claim that FocalNets outperform self-attention based models with similar computational costs on the tasks of image classification, object detection, and segmentation. + +The abstract from the paper is the following: + +*We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its +content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3.* + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/microsoft/FocalNet). + +## FocalNetConfig + +[[autodoc]] FocalNetConfig + +## FocalNetModel + +[[autodoc]] FocalNetModel + - forward + +## FocalNetForMaskedImageModeling + +[[autodoc]] FocalNetForMaskedImageModeling + - forward + +## FocalNetForImageClassification + +[[autodoc]] FocalNetForImageClassification + - forward diff --git a/docs/source/en/model_doc/fsmt.mdx b/docs/source/en/model_doc/fsmt.md similarity index 93% rename from docs/source/en/model_doc/fsmt.mdx rename to docs/source/en/model_doc/fsmt.md index 25c8d85cf48606..9419dce71edf38 100644 --- a/docs/source/en/model_doc/fsmt.mdx +++ b/docs/source/en/model_doc/fsmt.md @@ -8,13 +8,14 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # FSMT -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign -@stas00. - ## Overview FSMT (FairSeq MachineTranslation) models were introduced in [Facebook FAIR's WMT19 News Translation Task Submission](https://arxiv.org/abs/1907.06616) by Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov. diff --git a/docs/source/en/model_doc/funnel.mdx b/docs/source/en/model_doc/funnel.md similarity index 91% rename from docs/source/en/model_doc/funnel.mdx rename to docs/source/en/model_doc/funnel.md index 055cf6257b39d9..d6929691f40038 100644 --- a/docs/source/en/model_doc/funnel.mdx +++ b/docs/source/en/model_doc/funnel.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Funnel Transformer @@ -43,7 +47,9 @@ via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer o a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension.* -Tips: +This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer). + +## Usage tips - Since Funnel Transformer uses pooling, the sequence length of the hidden states changes after each block of layers. This way, their length is divided by 2, which speeds up the computation of the next hidden states. The base model therefore has a final sequence length that is a quarter of the original one. This model can be used @@ -58,7 +64,13 @@ Tips: [`FunnelBaseModel`], [`FunnelForSequenceClassification`] and [`FunnelForMultipleChoice`]. -This model was contributed by [sgugger](https://huggingface.co/sgugger). The original code can be found [here](https://github.com/laiguokun/Funnel-Transformer). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## FunnelConfig @@ -83,6 +95,9 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori [[autodoc]] models.funnel.modeling_tf_funnel.TFFunnelForPreTrainingOutput + + + ## FunnelBaseModel [[autodoc]] FunnelBaseModel @@ -123,6 +138,9 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori [[autodoc]] FunnelForQuestionAnswering - forward + + + ## TFFunnelBaseModel [[autodoc]] TFFunnelBaseModel @@ -162,3 +180,6 @@ This model was contributed by [sgugger](https://huggingface.co/sgugger). The ori [[autodoc]] TFFunnelForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/fuyu.md b/docs/source/en/model_doc/fuyu.md new file mode 100644 index 00000000000000..2832e35398f122 --- /dev/null +++ b/docs/source/en/model_doc/fuyu.md @@ -0,0 +1,115 @@ + + +# Fuyu + +## Overview + +The Fuyu model was created by [ADEPT](https://www.adept.ai/blog/fuyu-8b), and authored by Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar. + +The authors introduced Fuyu-8B, a decoder-only multimodal model based on the classic transformers architecture, with query and key normalization. A linear encoder is added to create multimodal embeddings from image inputs. + +By treating image tokens like text tokens and using a special image-newline character, the model knows when an image line ends. Image positional embeddings are removed. This avoids the need for different training phases for various image resolutions. With 8 billion parameters and licensed under CC-BY-NC, Fuyu-8B is notable for its ability to handle both text and images, its impressive context size of 16K, and its overall performance. + + + +The `Fuyu` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `torch_dtype = 'float16'` which will be +used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. + +The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `torch_dtype` they want, and if they don't it will be `torch.float32`. + +Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`. + + + + +Tips: + +- To convert the model, you need to clone the original repository using `git clone https://github.com/persimmon-ai-labs/adept-inference`, then get the checkpoints: + +```bash +git clone https://github.com/persimmon-ai-labs/adept-inference +wget path/to/fuyu-8b-model-weights.tar +tar -xvf fuyu-8b-model-weights.tar +python src/transformers/models/fuyu/convert_fuyu_weights_to_hf.py --input_dir /path/to/downloaded/fuyu/weights/ --output_dir /output/path \ + --pt_model_path /path/to/fuyu_8b_release/iter_0001251/mp_rank_00/model_optim_rng.pt + --ada_lib_path /path/to/adept-inference +``` + +For the chat model: +```bash +wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar +tar -xvf 8b_base_model_release.tar +``` +Then, model can be loaded via: + +```py +from transformers import FuyuConfig, FuyuForCausalLM +model_config = FuyuConfig() +model = FuyuForCausalLM(model_config).from_pretrained('/output/path') +``` + +Inputs need to be passed through a specific Processor to have the correct formats. +A processor requires an image_processor and a tokenizer. Hence, inputs can be loaded via: + +```py +from PIL import Image +from transformers import AutoTokenizer +from transformers.models.fuyu.processing_fuyu import FuyuProcessor +from transformers.models.fuyu.image_processing_fuyu import FuyuImageProcessor + + +tokenizer = AutoTokenizer.from_pretrained('adept-hf-collab/fuyu-8b') +image_processor = FuyuImageProcessor() + + +processor = FuyuProcessor(image_processor=image_processor, tokenizer=tokenizer) +text_prompt = "Generate a coco-style caption.\\n" + +bus_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bus.png" +bus_image_pil = Image.open(io.BytesIO(requests.get(bus_image_url).content)) +inputs_to_model = processor(text=text_prompt, images=image_pil) + + +``` + +This model was contributed by [Molbap](https://huggingface.co/Molbap). +The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference). + +- Fuyu uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer. +The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. + +- The authors suggest to use the following prompt for image captioning: `f"Generate a coco-style caption.\\n"` + + +## FuyuConfig + +[[autodoc]] FuyuConfig + +## FuyuForCausalLM + +[[autodoc]] FuyuForCausalLM + - forward + +## FuyuImageProcessor + +[[autodoc]] FuyuImageProcessor + - __call__ + +## FuyuProcessor + +[[autodoc]] FuyuProcessor + - __call__ diff --git a/docs/source/en/model_doc/git.mdx b/docs/source/en/model_doc/git.md similarity index 93% rename from docs/source/en/model_doc/git.mdx rename to docs/source/en/model_doc/git.md index 543548d8e60cf0..bffa98b89e3b73 100644 --- a/docs/source/en/model_doc/git.mdx +++ b/docs/source/en/model_doc/git.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GIT @@ -23,11 +27,6 @@ The abstract from the paper is the following: *In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on 12 challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.* -Tips: - -- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`. -- One can use [`GitProcessor`] to prepare images for the model, and the `generate` method for autoregressive generation. - drawing @@ -36,11 +35,16 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/GenerativeImage2Text). +## Usage tips + +- GIT is implemented in a very similar way to GPT-2, the only difference being that the model is also conditioned on `pixel_values`. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GIT. - Demo notebooks regarding inference + fine-tuning GIT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GIT). +- See also: [Causal language modeling task guide](../tasks/language_modeling) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/glpn.mdx b/docs/source/en/model_doc/glpn.md similarity index 94% rename from docs/source/en/model_doc/glpn.mdx rename to docs/source/en/model_doc/glpn.md index fe39dbb9489acc..b57d1a7ccdda7f 100644 --- a/docs/source/en/model_doc/glpn.mdx +++ b/docs/source/en/model_doc/glpn.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GLPN @@ -29,10 +33,6 @@ The abstract from the paper is the following: *Depth estimation from a single image is an important task that can be applied to various fields in computer vision, and has grown rapidly with the development of convolutional neural networks. In this paper, we propose a novel structure and training strategy for monocular depth estimation to further improve the prediction accuracy of the network. We deploy a hierarchical transformer encoder to capture and convey the global context, and design a lightweight yet powerful decoder to generate an estimated depth map while considering local connectivity. By constructing connected paths between multi-scale local features and the global decoding stream with our proposed selective feature fusion module, the network can integrate both representations and recover fine details. In addition, the proposed decoder shows better performance than the previously proposed decoders, with considerably less computational complexity. Furthermore, we improve the depth-specific augmentation method by utilizing an important observation in depth estimation to enhance the model. Our network achieves state-of-the-art performance over the challenging depth dataset NYU Depth V2. Extensive experiments have been conducted to validate and show the effectiveness of the proposed approach. Finally, our model shows better generalisation ability and robustness than other comparative models.* -Tips: - -- One can use [`GLPNImageProcessor`] to prepare images for the model. - drawing @@ -45,6 +45,7 @@ This model was contributed by [nielsr](https://huggingface.co/nielsr). The origi A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GLPN. - Demo notebooks for [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/GLPN). +- [Monocular depth estimation task guide](../tasks/monocular_depth_estimation) ## GLPNConfig diff --git a/docs/source/en/model_doc/gpt-sw3.mdx b/docs/source/en/model_doc/gpt-sw3.md similarity index 67% rename from docs/source/en/model_doc/gpt-sw3.mdx rename to docs/source/en/model_doc/gpt-sw3.md index 23b6dc976da3a1..f69bd958e9c5f1 100644 --- a/docs/source/en/model_doc/gpt-sw3.mdx +++ b/docs/source/en/model_doc/gpt-sw3.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GPT-Sw3 @@ -26,19 +30,15 @@ in collaboration with RISE and the WASP WARA for Media and Language. GPT-Sw3 has 320B tokens in Swedish, Norwegian, Danish, Icelandic, English, and programming code. The model was pretrained using a causal language modeling (CLM) objective utilizing the NeMo Megatron GPT implementation. -This model was contributed by [AI Sweden](https://huggingface.co/AI-Sweden). +This model was contributed by [AI Sweden Models](https://huggingface.co/AI-Sweden-Models). -The implementation uses the [GPT2Model](https://huggingface.co/docs/transformers/model_doc/gpt2) coupled -with our `GPTSw3Tokenizer`. This means that `AutoTokenizer` and `AutoModelForCausalLM` map to our tokenizer -implementation and the corresponding GPT2 model implementation respectively. -*Note that sentencepiece is required to use our tokenizer and can be installed with:* `pip install transformers[sentencepiece]` or `pip install sentencepiece` +## Usage example -Example usage: ```python >>> from transformers import AutoTokenizer, AutoModelForCausalLM ->>> tokenizer = AutoTokenizer.from_pretrained("AI-Sweden/gpt-sw3-356m") ->>> model = AutoModelForCausalLM.from_pretrained("AI-Sweden/gpt-sw3-356m") +>>> tokenizer = AutoTokenizer.from_pretrained("AI-Sweden-Models/gpt-sw3-356m") +>>> model = AutoModelForCausalLM.from_pretrained("AI-Sweden-Models/gpt-sw3-356m") >>> input_ids = tokenizer("Träd är fina för att", return_tensors="pt")["input_ids"] @@ -48,6 +48,21 @@ Example usage: Träd är fina för att de är färgstarka. Men ibland är det fint ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Causal language modeling task guide](../tasks/language_modeling) + + + +The implementation uses the `GPT2Model` coupled with our `GPTSw3Tokenizer`. Refer to [GPT2Model documentation](gpt2) +for API reference and examples. + +Note that sentencepiece is required to use our tokenizer and can be installed with `pip install transformers[sentencepiece]` or `pip install sentencepiece` + + + ## GPTSw3Tokenizer [[autodoc]] GPTSw3Tokenizer diff --git a/docs/source/en/model_doc/gpt2.mdx b/docs/source/en/model_doc/gpt2.md similarity index 92% rename from docs/source/en/model_doc/gpt2.mdx rename to docs/source/en/model_doc/gpt2.md index 7e557d0e9d507e..4708edde0b65d4 100644 --- a/docs/source/en/model_doc/gpt2.mdx +++ b/docs/source/en/model_doc/gpt2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # OpenAI GPT2 @@ -24,7 +28,7 @@ specific language governing permissions and limitations under the License. ## Overview OpenAI GPT-2 model was proposed in [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) by Alec -Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. It's a causal (unidirectional) +Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever from [OpenAI](https://huggingface.co/openai). It's a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data. The abstract from the paper is the following: @@ -35,7 +39,13 @@ text. The diversity of the dataset causes this simple goal to contain naturally across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.* -Tips: +[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by +Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five +different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*. + +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/). + +## Usage tips - GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -50,12 +60,6 @@ Tips: - Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only). -[Write With Transformer](https://transformer.huggingface.co/doc/gpt2-large) is a webapp created and hosted by -Hugging Face showcasing the generative capabilities of several models. GPT-2 is one of them and is available in five -different sizes: small, medium, large, xl and a distilled version of the small checkpoint: *distilgpt-2*. - -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://openai.com/blog/better-language-models/). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GPT2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -73,7 +77,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`GPT2LMHeadModel`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation), and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). - [`TFGPT2LMHeadModel`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_clmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxGPT2LMHeadModel`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/causal_language_modeling_flax.ipynb). - +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Causal language modeling task guide](../tasks/language_modeling) ## GPT2Config @@ -94,6 +100,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] models.gpt2.modeling_tf_gpt2.TFGPT2DoubleHeadsModelOutput + + + ## GPT2Model [[autodoc]] GPT2Model @@ -109,6 +118,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] GPT2DoubleHeadsModel - forward +## GPT2ForQuestionAnswering + +[[autodoc]] GPT2ForQuestionAnswering + - forward + ## GPT2ForSequenceClassification [[autodoc]] GPT2ForSequenceClassification @@ -119,6 +133,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] GPT2ForTokenClassification - forward + + + ## TFGPT2Model [[autodoc]] TFGPT2Model @@ -147,6 +164,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFGPT2Tokenizer + + + ## FlaxGPT2Model [[autodoc]] FlaxGPT2Model @@ -156,3 +176,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxGPT2LMHeadModel - __call__ + + + diff --git a/docs/source/en/model_doc/gpt_bigcode.md b/docs/source/en/model_doc/gpt_bigcode.md new file mode 100644 index 00000000000000..1635a9f50dd08e --- /dev/null +++ b/docs/source/en/model_doc/gpt_bigcode.md @@ -0,0 +1,106 @@ + + +# GPTBigCode + +## Overview + +The GPTBigCode model was proposed in [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by BigCode. The listed authors are: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. + +The abstract from the paper is the following: + +*The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at [this https URL.](https://huggingface.co/bigcode)* + +The model is an optimized [GPT2 model](https://huggingface.co/docs/transformers/model_doc/gpt2) with support for Multi-Query Attention. + +## Implementation details + +The main differences compared to GPT2. +- Added support for Multi-Query Attention. +- Use `gelu_pytorch_tanh` instead of classic `gelu`. +- Avoid unnecessary synchronizations (this has since been added to GPT2 in #20061, but wasn't in the reference codebase). +- Use Linear layers instead of Conv1D (good speedup but makes the checkpoints incompatible). +- Merge `_attn` and `_upcast_and_reordered_attn`. Always merge the matmul with scaling. Rename `reorder_and_upcast_attn`->`attention_softmax_in_fp32` +- Cache the attention mask value to avoid recreating it every time. +- Use jit to fuse the attention fp32 casting, masking, softmax, and scaling. +- Combine the attention and causal masks into a single one, pre-computed for the whole model instead of every layer. +- Merge the key and value caches into one (this changes the format of layer_past/ present, does it risk creating problems?) +- Use the memory layout (self.num_heads, 3, self.head_dim) instead of `(3, self.num_heads, self.head_dim)` for the QKV tensor with MHA. (prevents an overhead with the merged key and values, but makes the checkpoints incompatible with the original openai-community/gpt2 model). + +You can read more about the optimizations in the [original pull request](https://github.com/huggingface/transformers/pull/22575) + +## Combining Starcoder and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("bigcode/gpt_bigcode-santacoder", torch_dtype=torch.float16, attn_implementation="flash_attention_2") +>>> tokenizer = AutoTokenizer.from_pretrained("bigcode/gpt_bigcode-santacoder") + +>>> prompt = "def hello_world():" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False) +>>> tokenizer.batch_decode(generated_ids)[0] +'def hello_world():\n print("hello world")\n\nif __name__ == "__main__":\n print("hello world")\n<|endoftext|>' +``` + +### Expected speedups + +Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using `bigcode/starcoder` checkpoint and the Flash Attention 2 version of the model using two different sequence lengths. + +
+ +
+ + +## GPTBigCodeConfig + +[[autodoc]] GPTBigCodeConfig + +## GPTBigCodeModel + +[[autodoc]] GPTBigCodeModel + - forward + +## GPTBigCodeForCausalLM + +[[autodoc]] GPTBigCodeForCausalLM + - forward + +## GPTBigCodeForSequenceClassification + +[[autodoc]] GPTBigCodeForSequenceClassification + - forward + +## GPTBigCodeForTokenClassification + +[[autodoc]] GPTBigCodeForTokenClassification + - forward diff --git a/docs/source/en/model_doc/gpt_neo.md b/docs/source/en/model_doc/gpt_neo.md new file mode 100644 index 00000000000000..3c7858c998207e --- /dev/null +++ b/docs/source/en/model_doc/gpt_neo.md @@ -0,0 +1,147 @@ + + +# GPT Neo + +## Overview + +The GPTNeo model was released in the [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) repository by Sid +Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the +[Pile](https://pile.eleuther.ai/) dataset. + +The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of +256 tokens. + +This model was contributed by [valhalla](https://huggingface.co/valhalla). + +## Usage example + +The `generate()` method can be used to generate text using GPT Neo model. + +```python +>>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer + +>>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B") +>>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B") + +>>> prompt = ( +... "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " +... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " +... "researchers was the fact that the unicorns spoke perfect English." +... ) + +>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids + +>>> gen_tokens = model.generate( +... input_ids, +... do_sample=True, +... temperature=0.9, +... max_length=100, +... ) +>>> gen_text = tokenizer.batch_decode(gen_tokens)[0] +``` + +## Combining GPT-Neo and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature, and make sure your hardware is compatible with Flash-Attention 2. More details are available [here](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) concerning the installation. + +Make sure as well to load your model in half-precision (e.g. `torch.float16`). + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", torch_dtype=torch.float16, attn_implementation="flash_attention_2") +>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B") + +>>> prompt = "def hello_world():" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) +>>> tokenizer.batch_decode(generated_ids)[0] +"def hello_world():\n >>> run_script("hello.py")\n >>> exit(0)\n<|endoftext|>" +``` + +### Expected speedups + +Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `EleutherAI/gpt-neo-2.7B` checkpoint and the Flash Attention 2 version of the model. +Note that for GPT-Neo it is not possible to train / run on very long context as the max [position embeddings](https://huggingface.co/EleutherAI/gpt-neo-2.7B/blob/main/config.json#L58 ) is limited to 2048 - but this is applicable to all gpt-neo models and not specific to FA-2 + +
+ +
+ + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Causal language modeling task guide](../tasks/language_modeling) + +## GPTNeoConfig + +[[autodoc]] GPTNeoConfig + + + + + +## GPTNeoModel + +[[autodoc]] GPTNeoModel + - forward + +## GPTNeoForCausalLM + +[[autodoc]] GPTNeoForCausalLM + - forward + +## GPTNeoForQuestionAnswering + +[[autodoc]] GPTNeoForQuestionAnswering + - forward + +## GPTNeoForSequenceClassification + +[[autodoc]] GPTNeoForSequenceClassification + - forward + +## GPTNeoForTokenClassification + +[[autodoc]] GPTNeoForTokenClassification + - forward + + + + +## FlaxGPTNeoModel + +[[autodoc]] FlaxGPTNeoModel + - __call__ + +## FlaxGPTNeoForCausalLM + +[[autodoc]] FlaxGPTNeoForCausalLM + - __call__ + + + + + diff --git a/docs/source/en/model_doc/gpt_neo.mdx b/docs/source/en/model_doc/gpt_neo.mdx deleted file mode 100644 index f68b92b213270b..00000000000000 --- a/docs/source/en/model_doc/gpt_neo.mdx +++ /dev/null @@ -1,80 +0,0 @@ - - -# GPT Neo - -## Overview - -The GPTNeo model was released in the [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) repository by Sid -Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. It is a GPT2 like causal language model trained on the -[Pile](https://pile.eleuther.ai/) dataset. - -The architecture is similar to GPT2 except that GPT Neo uses local attention in every other layer with a window size of -256 tokens. - -This model was contributed by [valhalla](https://huggingface.co/valhalla). - -### Generation - -The `generate()` method can be used to generate text using GPT Neo model. - -```python ->>> from transformers import GPTNeoForCausalLM, GPT2Tokenizer - ->>> model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B") ->>> tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B") - ->>> prompt = ( -... "In a shocking finding, scientists discovered a herd of unicorns living in a remote, " -... "previously unexplored valley, in the Andes Mountains. Even more surprising to the " -... "researchers was the fact that the unicorns spoke perfect English." -... ) - ->>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids - ->>> gen_tokens = model.generate( -... input_ids, -... do_sample=True, -... temperature=0.9, -... max_length=100, -... ) ->>> gen_text = tokenizer.batch_decode(gen_tokens)[0] -``` - -## GPTNeoConfig - -[[autodoc]] GPTNeoConfig - -## GPTNeoModel - -[[autodoc]] GPTNeoModel - - forward - -## GPTNeoForCausalLM - -[[autodoc]] GPTNeoForCausalLM - - forward - -## GPTNeoForSequenceClassification - -[[autodoc]] GPTNeoForSequenceClassification - - forward - -## FlaxGPTNeoModel - -[[autodoc]] FlaxGPTNeoModel - - __call__ - -## FlaxGPTNeoForCausalLM - -[[autodoc]] FlaxGPTNeoForCausalLM - - __call__ diff --git a/docs/source/en/model_doc/gpt_neox.mdx b/docs/source/en/model_doc/gpt_neox.md similarity index 54% rename from docs/source/en/model_doc/gpt_neox.mdx rename to docs/source/en/model_doc/gpt_neox.md index 1be8be7a6a5e6c..fd105a3e82e1ee 100644 --- a/docs/source/en/model_doc/gpt_neox.mdx +++ b/docs/source/en/model_doc/gpt_neox.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GPT-NeoX @@ -34,7 +38,7 @@ model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b").half().cud GPT-NeoX-20B also has a different tokenizer from the one used in GPT-J-6B and GPT-Neo. The new tokenizer allocates additional tokens to whitespace characters, making the model more suitable for certain tasks like code generation. -### Generation +## Usage example The `generate()` method can be used to generate text using GPT Neo model. @@ -57,6 +61,44 @@ The `generate()` method can be used to generate text using GPT Neo model. >>> gen_text = tokenizer.batch_decode(gen_tokens)[0] ``` +## Using Flash Attention 2 + +Flash Attention 2 is an faster, optimized version of the model. + +### Installation + +First, check whether your hardware is compatible with Flash Attention 2. The latest list of compatible hardware can be found in the [official documentation](https://github.com/Dao-AILab/flash-attention#installation-and-features). If your hardware is not compatible with Flash Attention 2, you can still benefit from attention kernel optimisations through Better Transformer support covered [above](https://huggingface.co/docs/transformers/main/en/model_doc/bark#using-better-transformer). + +Next, [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2: + +```bash +pip install -U flash-attn --no-build-isolation +``` + +### Usage + +To load a model using Flash Attention 2, we can pass the argument `attn_implementation="flash_attention_2"` to [`.from_pretrained`](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained). We'll also load the model in half-precision (e.g. `torch.float16`), since it results in almost no degradation to audio quality but significantly lower memory usage and faster inference: + +```python +>>> from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast + +model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device) +... +``` + + +### Expected speedups + +Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `stockmark/gpt-neox-japanese-1.4b` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048. + +
+ +
+ +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) + ## GPTNeoXConfig [[autodoc]] GPTNeoXConfig @@ -74,3 +116,18 @@ The `generate()` method can be used to generate text using GPT Neo model. [[autodoc]] GPTNeoXForCausalLM - forward + +## GPTNeoXForQuestionAnswering + +[[autodoc]] GPTNeoXForQuestionAnswering + - forward + +## GPTNeoXForSequenceClassification + +[[autodoc]] GPTNeoXForSequenceClassification + - forward + +## GPTNeoXForTokenClassification + +[[autodoc]] GPTNeoXForTokenClassification + - forward diff --git a/docs/source/en/model_doc/gpt_neox_japanese.mdx b/docs/source/en/model_doc/gpt_neox_japanese.md similarity index 91% rename from docs/source/en/model_doc/gpt_neox_japanese.mdx rename to docs/source/en/model_doc/gpt_neox_japanese.md index da94b7497603c8..c69e643cae5bd7 100644 --- a/docs/source/en/model_doc/gpt_neox_japanese.mdx +++ b/docs/source/en/model_doc/gpt_neox_japanese.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GPT-NeoX-Japanese @@ -21,7 +25,7 @@ Following the recommendations from Google's research on [PaLM](https://ai.google Development of the model was led by [Shinya Otani](https://github.com/SO0529), [Takayoshi Makabe](https://github.com/spider-man-tm), [Anuj Arora](https://github.com/Anuj040), and [Kyo Hattori](https://github.com/go5paopao) from [ABEJA, Inc.](https://www.abejainc.com/). For more information on this model-building activity, please refer [here (ja)](https://tech-blog.abeja.asia/entry/abeja-gpt-project-202207). -### Generation +### Usage example The `generate()` method can be used to generate text using GPT NeoX Japanese model. @@ -47,6 +51,10 @@ The `generate()` method can be used to generate text using GPT NeoX Japanese mod 人とAIが協調するためには、AIと人が共存し、AIを正しく理解する必要があります。 ``` +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) + ## GPTNeoXJapaneseConfig [[autodoc]] GPTNeoXJapaneseConfig diff --git a/docs/source/en/model_doc/gptj.mdx b/docs/source/en/model_doc/gptj.md similarity index 86% rename from docs/source/en/model_doc/gptj.mdx rename to docs/source/en/model_doc/gptj.md index 51cee1d72a617a..b515cf36dd4060 100644 --- a/docs/source/en/model_doc/gptj.mdx +++ b/docs/source/en/model_doc/gptj.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GPT-J @@ -19,23 +23,24 @@ causal language model trained on [the Pile](https://pile.eleuther.ai/) dataset. This model was contributed by [Stella Biderman](https://huggingface.co/stellaathena). -Tips: +## Usage tips -- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size CPU - RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB of CPU - RAM to just load the model. To reduce the CPU RAM usage there are a few options. The `torch_dtype` argument can be - used to initialize the model in half-precision. And the `low_cpu_mem_usage` argument can be used to keep the RAM - usage to 1x. There is also a [fp16 branch](https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) which stores - the fp16 weights, which could be used to further minimize the RAM usage. Combining all this it should take roughly - 12.1GB of CPU RAM to load the model. +- To load [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) in float32 one would need at least 2x model size + RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would take at least 48GB + RAM to just load the model. To reduce the RAM usage there are a few options. The `torch_dtype` argument can be + used to initialize the model in half-precision on a CUDA device only. There is also a fp16 branch which stores the fp16 weights, + which could be used to further minimize the RAM usage: ```python >>> from transformers import GPTJForCausalLM >>> import torch +>>> device = "cuda" >>> model = GPTJForCausalLM.from_pretrained( -... "EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True -... ) +... "EleutherAI/gpt-j-6B", +... revision="float16", +... torch_dtype=torch.float16, +... ).to(device) ``` - The model should fit on 16GB GPU for inference. For training/fine-tuning it would take much more GPU RAM. Adam @@ -51,7 +56,7 @@ Tips: size, the tokenizer for [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) contains 143 extra tokens `<|extratoken_1|>... <|extratoken_143|>`, so the `vocab_size` of tokenizer also becomes 50400. -### Generation +## Usage examples The [`~generation.GenerationMixin.generate`] method can be used to generate text using GPT-J model. @@ -85,7 +90,8 @@ model. >>> from transformers import GPTJForCausalLM, AutoTokenizer >>> import torch ->>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16) +>>> device = "cuda" +>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.float16).to(device) >>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") >>> prompt = ( @@ -94,7 +100,7 @@ model. ... "researchers was the fact that the unicorns spoke perfect English." ... ) ->>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids +>>> input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) >>> gen_tokens = model.generate( ... input_ids, @@ -122,11 +128,19 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFGPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_clmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxGPTJForCausalLM`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/causal_language_modeling_flax.ipynb). +**Documentation resources** +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) + ## GPTJConfig [[autodoc]] GPTJConfig - all + + + ## GPTJModel [[autodoc]] GPTJModel @@ -147,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] GPTJForQuestionAnswering - forward + + + ## TFGPTJModel [[autodoc]] TFGPTJModel @@ -167,6 +184,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFGPTJForQuestionAnswering - call + + + ## FlaxGPTJModel [[autodoc]] FlaxGPTJModel @@ -176,3 +196,5 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxGPTJForCausalLM - __call__ + + diff --git a/docs/source/en/model_doc/gptsan-japanese.md b/docs/source/en/model_doc/gptsan-japanese.md new file mode 100644 index 00000000000000..1e6b1b6e1cf6d7 --- /dev/null +++ b/docs/source/en/model_doc/gptsan-japanese.md @@ -0,0 +1,121 @@ + + +# GPTSAN-japanese + +## Overview + +The GPTSAN-japanese model was released in the repository by Toshiyuki Sakamoto (tanreinama). + +GPTSAN is a Japanese language model using Switch Transformer. It has the same structure as the model introduced as Prefix LM +in the T5 paper, and support both Text Generation and Masked Language Modeling tasks. These basic tasks similarly can +fine-tune for translation or summarization. + +### Usage example + +The `generate()` method can be used to generate text using GPTSAN-Japanese model. + +```python +>>> from transformers import AutoModel, AutoTokenizer +>>> import torch + +>>> tokenizer = AutoTokenizer.from_pretrained("Tanrei/GPTSAN-japanese") +>>> model = AutoModel.from_pretrained("Tanrei/GPTSAN-japanese").cuda() +>>> x_tok = tokenizer("は、", prefix_text="織田信長", return_tensors="pt") +>>> torch.manual_seed(0) +>>> gen_tok = model.generate(x_tok.input_ids.cuda(), token_type_ids=x_tok.token_type_ids.cuda(), max_new_tokens=20) +>>> tokenizer.decode(gen_tok[0]) +'織田信長は、2004年に『戦国BASARA』のために、豊臣秀吉' +``` + +## GPTSAN Features + +GPTSAN has some unique features. It has a model structure of Prefix-LM. It works as a shifted Masked Language Model for Prefix Input tokens. Un-prefixed inputs behave like normal generative models. +The Spout vector is a GPTSAN specific input. Spout is pre-trained with random inputs, but you can specify a class of text or an arbitrary vector during fine-tuning. This allows you to indicate the tendency of the generated text. +GPTSAN has a sparse Feed Forward based on Switch-Transformer. You can also add other layers and train them partially. See the original GPTSAN repository for details. + +### Prefix-LM Model + +GPTSAN has the structure of the model named Prefix-LM in the `T5` paper. (The original GPTSAN repository calls it `hybrid`) +In GPTSAN, the `Prefix` part of Prefix-LM, that is, the input position that can be referenced by both tokens, can be specified with any length. +Arbitrary lengths can also be specified differently for each batch. +This length applies to the text entered in `prefix_text` for the tokenizer. +The tokenizer returns the mask of the `Prefix` part of Prefix-LM as `token_type_ids`. +The model treats the part where `token_type_ids` is 1 as a `Prefix` part, that is, the input can refer to both tokens before and after. + +## Usage tips + +Specifying the Prefix part is done with a mask passed to self-attention. +When token_type_ids=None or all zero, it is equivalent to regular causal mask + +for example: + +>>> x_token = tokenizer("アイウエ") +input_ids: | SOT | SEG | ア | イ | ウ | エ | +token_type_ids: | 1 | 0 | 0 | 0 | 0 | 0 | +prefix_lm_mask: +SOT | 1 0 0 0 0 0 | +SEG | 1 1 0 0 0 0 | +ア | 1 1 1 0 0 0 | +イ | 1 1 1 1 0 0 | +ウ | 1 1 1 1 1 0 | +エ | 1 1 1 1 1 1 | + +>>> x_token = tokenizer("", prefix_text="アイウエ") +input_ids: | SOT | ア | イ | ウ | エ | SEG | +token_type_ids: | 1 | 1 | 1 | 1 | 1 | 0 | +prefix_lm_mask: +SOT | 1 1 1 1 1 0 | +ア | 1 1 1 1 1 0 | +イ | 1 1 1 1 1 0 | +ウ | 1 1 1 1 1 0 | +エ | 1 1 1 1 1 0 | +SEG | 1 1 1 1 1 1 | + +>>> x_token = tokenizer("ウエ", prefix_text="アイ") +input_ids: | SOT | ア | イ | SEG | ウ | エ | +token_type_ids: | 1 | 1 | 1 | 0 | 0 | 0 | +prefix_lm_mask: +SOT | 1 1 1 0 0 0 | +ア | 1 1 1 0 0 0 | +イ | 1 1 1 0 0 0 | +SEG | 1 1 1 1 0 0 | +ウ | 1 1 1 1 1 0 | +エ | 1 1 1 1 1 1 | + +### Spout Vector + +A Spout Vector is a special vector for controlling text generation. +This vector is treated as the first embedding in self-attention to bring extraneous attention to the generated tokens. +In the pre-trained model published from `Tanrei/GPTSAN-japanese`, the Spout Vector is a 128-dimensional vector that passes through 8 fully connected layers in the model and is projected into the space acting as external attention. +The Spout Vector projected by the fully connected layer is split to be passed to all self-attentions. + +## GPTSanJapaneseConfig + +[[autodoc]] GPTSanJapaneseConfig + +## GPTSanJapaneseTokenizer + +[[autodoc]] GPTSanJapaneseTokenizer + +## GPTSanJapaneseModel + +[[autodoc]] GPTSanJapaneseModel + +## GPTSanJapaneseForConditionalGeneration + +[[autodoc]] GPTSanJapaneseForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/graphormer.mdx b/docs/source/en/model_doc/graphormer.md similarity index 92% rename from docs/source/en/model_doc/graphormer.mdx rename to docs/source/en/model_doc/graphormer.md index 33092fe8898c79..08e3f5fb3e9b5a 100644 --- a/docs/source/en/model_doc/graphormer.mdx +++ b/docs/source/en/model_doc/graphormer.md @@ -6,6 +6,10 @@ the License. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Graphormer @@ -13,32 +17,30 @@ specific language governing permissions and limitations under the License. ## Overview The Graphormer model was proposed in [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by -Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessign and collation, then using a modified attention. +Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen and Tie-Yan Liu. It is a Graph Transformer model, modified to allow computations on graphs instead of text sequences by generating embeddings and features of interest during preprocessing and collation, then using a modified attention. The abstract from the paper is the following: *The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer.* -Tips: +This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer). + +## Usage tips This model will not work well on large graphs (more than 100 nodes/edges), as it will make the memory explode. You can reduce the batch size, increase your RAM, or decrease the `UNREACHABLE_NODE_DISTANCE` parameter in algos_graphormer.pyx, but it will be hard to go above 700 nodes/edges. This model does not use a tokenizer, but instead a special collator during training. -This model was contributed by [clefourrier](https://huggingface.co/clefourrier). The original code can be found [here](https://github.com/microsoft/Graphormer). - ## GraphormerConfig [[autodoc]] GraphormerConfig - ## GraphormerModel [[autodoc]] GraphormerModel - forward - ## GraphormerForGraphClassification [[autodoc]] GraphormerForGraphClassification diff --git a/docs/source/en/model_doc/groupvit.mdx b/docs/source/en/model_doc/groupvit.md similarity index 93% rename from docs/source/en/model_doc/groupvit.mdx rename to docs/source/en/model_doc/groupvit.md index 200ec7ccb8a195..8728cf0da21ba2 100644 --- a/docs/source/en/model_doc/groupvit.mdx +++ b/docs/source/en/model_doc/groupvit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # GroupViT @@ -21,13 +25,13 @@ The abstract from the paper is the following: *Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision.* -Tips: - -- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. - This model was contributed by [xvjiarui](https://huggingface.co/xvjiarui). The TensorFlow version was contributed by [ariG23498](https://huggingface.co/ariG23498) with the help of [Yih-Dar SHIEH](https://huggingface.co/ydshieh), [Amy Roberts](https://huggingface.co/amyeroberts), and [Joao Gante](https://huggingface.co/joaogante). The original code can be found [here](https://github.com/NVlabs/GroupViT). +## Usage tips + +- You may specify `output_segmentation=True` in the forward of `GroupViTModel` to get the segmentation logits of input texts. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with GroupViT. @@ -48,6 +52,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] GroupViTVisionConfig + + + ## GroupViTModel [[autodoc]] GroupViTModel @@ -65,6 +72,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] GroupViTVisionModel - forward + + + ## TFGroupViTModel [[autodoc]] TFGroupViTModel @@ -80,4 +90,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ## TFGroupViTVisionModel [[autodoc]] TFGroupViTVisionModel - - call \ No newline at end of file + - call + + + diff --git a/docs/source/en/model_doc/herbert.mdx b/docs/source/en/model_doc/herbert.md similarity index 90% rename from docs/source/en/model_doc/herbert.mdx rename to docs/source/en/model_doc/herbert.md index 90e08ebe9ac744..0049d6bfcf3a2b 100644 --- a/docs/source/en/model_doc/herbert.mdx +++ b/docs/source/en/model_doc/herbert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # HerBERT @@ -33,7 +37,11 @@ which has the best average performance and obtains the best results for three ou extensive evaluation, including several standard baselines and recently proposed, multilingual Transformer-based models.* -Examples of use: +This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found +[here](https://github.com/allegro/HerBERT). + + +## Usage example ```python >>> from transformers import HerbertTokenizer, RobertaModel @@ -52,9 +60,12 @@ Examples of use: >>> model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1") ``` -This model was contributed by [rmroczkowski](https://huggingface.co/rmroczkowski). The original code can be found -[here](https://github.com/allegro/HerBERT). + + +Herbert implementation is the same as `BERT` except for the tokenization method. Refer to [BERT documentation](bert) +for API reference and examples. + ## HerbertTokenizer diff --git a/docs/source/en/model_doc/hubert.mdx b/docs/source/en/model_doc/hubert.md similarity index 88% rename from docs/source/en/model_doc/hubert.mdx rename to docs/source/en/model_doc/hubert.md index faab44b89d587c..43ce590d3715d2 100644 --- a/docs/source/en/model_doc/hubert.mdx +++ b/docs/source/en/model_doc/hubert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Hubert @@ -32,19 +36,26 @@ state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-lig 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.* -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). + +# Usage tips - Hubert is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - Hubert model was fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## HubertConfig [[autodoc]] HubertConfig + + + ## HubertModel [[autodoc]] HubertModel @@ -60,6 +71,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv [[autodoc]] HubertForSequenceClassification - forward + + + ## TFHubertModel [[autodoc]] TFHubertModel @@ -69,3 +83,6 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv [[autodoc]] TFHubertForCTC - call + + + diff --git a/docs/source/en/model_doc/ibert.mdx b/docs/source/en/model_doc/ibert.md similarity index 85% rename from docs/source/en/model_doc/ibert.mdx rename to docs/source/en/model_doc/ibert.md index 086e615fe5b424..9ea623951aec8e 100644 --- a/docs/source/en/model_doc/ibert.mdx +++ b/docs/source/en/model_doc/ibert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # I-BERT @@ -36,6 +40,13 @@ been open-sourced.* This model was contributed by [kssteven](https://huggingface.co/kssteven). The original code can be found [here](https://github.com/kssteven418/I-BERT). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/masked_language_modeling) ## IBertConfig diff --git a/docs/source/en/model_doc/idefics.md b/docs/source/en/model_doc/idefics.md new file mode 100644 index 00000000000000..9989f89d682e8f --- /dev/null +++ b/docs/source/en/model_doc/idefics.md @@ -0,0 +1,63 @@ + + +# IDEFICS + +## Overview + +The IDEFICS model was proposed in [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents +](https://huggingface.co/papers/2306.16527 +) by Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh + +The abstract from the paper is the following: + +*Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks that require reasoning over one or multiple images to generate a text. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELISC, we train an 80 billion parameters vision and language model on the dataset and obtain competitive performance on various multimodal benchmarks. We release the code to reproduce the dataset along with the dataset itself.* + +This model was contributed by [HuggingFaceM4](https://huggingface.co/HuggingFaceM4). The original code can be found [here](). (TODO: don't have a public link yet). + + + + +IDEFICS modeling code in Transformers is for finetuning and inferencing the pre-trained IDEFICS models. + +To train a new IDEFICS model from scratch use the m4 codebase (a link will be provided once it's made public) + + + + +## IdeficsConfig + +[[autodoc]] IdeficsConfig + +## IdeficsModel + +[[autodoc]] IdeficsModel + - forward + +## IdeficsForVisionText2Text + +[[autodoc]] IdeficsForVisionText2Text + - forward + +## IdeficsImageProcessor + +[[autodoc]] IdeficsImageProcessor + - preprocess + +## IdeficsProcessor + +[[autodoc]] IdeficsProcessor + - __call__ diff --git a/docs/source/en/model_doc/imagegpt.mdx b/docs/source/en/model_doc/imagegpt.md similarity index 96% rename from docs/source/en/model_doc/imagegpt.mdx rename to docs/source/en/model_doc/imagegpt.md index baee48b96e89c8..53a7ba3b34b709 100644 --- a/docs/source/en/model_doc/imagegpt.mdx +++ b/docs/source/en/model_doc/imagegpt.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + specific language governing permissions and limitations under the License. --> # ImageGPT @@ -36,7 +40,7 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr), based on [this issue](https://github.com/openai/image-gpt/issues/7). The original code can be found [here](https://github.com/openai/image-gpt). -Tips: +## Usage tips - ImageGPT is almost exactly the same as [GPT-2](gpt2), with the exception that a different activation function is used (namely "quick gelu"), and the layer normalization layers don't mean center the inputs. ImageGPT @@ -77,6 +81,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - Demo notebooks for ImageGPT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ImageGPT). - [`ImageGPTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -87,7 +92,6 @@ If you're interested in submitting a resource to be included here, please feel f ## ImageGPTFeatureExtractor [[autodoc]] ImageGPTFeatureExtractor - - __call__ ## ImageGPTImageProcessor @@ -98,17 +102,14 @@ If you're interested in submitting a resource to be included here, please feel f ## ImageGPTModel [[autodoc]] ImageGPTModel - - forward ## ImageGPTForCausalImageModeling [[autodoc]] ImageGPTForCausalImageModeling - - forward ## ImageGPTForImageClassification [[autodoc]] ImageGPTForImageClassification - - forward diff --git a/docs/source/en/model_doc/informer.md b/docs/source/en/model_doc/informer.md new file mode 100644 index 00000000000000..f866afbfcb8a9d --- /dev/null +++ b/docs/source/en/model_doc/informer.md @@ -0,0 +1,50 @@ + + +# Informer + +## Overview + +The Informer model was proposed in [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. + +This method introduces a Probabilistic Attention mechanism to select the "active" queries rather than the "lazy" queries and provides a sparse Transformer thus mitigating the quadratic compute and memory requirements of vanilla attention. + +The abstract from the paper is the following: + +*Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L logL) in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.* + +This model was contributed by [elisim](https://huggingface.co/elisim) and [kashif](https://huggingface.co/kashif). +The original code can be found [here](https://github.com/zhouhaoyi/Informer2020). + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- Check out the Informer blog-post in HuggingFace blog: [Multivariate Probabilistic Time Series Forecasting with Informer](https://huggingface.co/blog/informer) + +## InformerConfig + +[[autodoc]] InformerConfig + +## InformerModel + +[[autodoc]] InformerModel + - forward + +## InformerForPrediction + +[[autodoc]] InformerForPrediction + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/instructblip.md b/docs/source/en/model_doc/instructblip.md new file mode 100644 index 00000000000000..1a693493fff153 --- /dev/null +++ b/docs/source/en/model_doc/instructblip.md @@ -0,0 +1,67 @@ + + +# InstructBLIP + +## Overview + +The InstructBLIP model was proposed in [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500) by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. +InstructBLIP leverages the [BLIP-2](blip2) architecture for visual instruction tuning. + +The abstract from the paper is the following: + +*General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models.* + + + + InstructBLIP architecture. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip). + +## Usage tips + +InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former. + +## InstructBlipConfig + +[[autodoc]] InstructBlipConfig + - from_vision_qformer_text_configs + +## InstructBlipVisionConfig + +[[autodoc]] InstructBlipVisionConfig + +## InstructBlipQFormerConfig + +[[autodoc]] InstructBlipQFormerConfig + +## InstructBlipProcessor + +[[autodoc]] InstructBlipProcessor + +## InstructBlipVisionModel + +[[autodoc]] InstructBlipVisionModel + - forward + +## InstructBlipQFormerModel + +[[autodoc]] InstructBlipQFormerModel + - forward + +## InstructBlipForConditionalGeneration + +[[autodoc]] InstructBlipForConditionalGeneration + - forward + - generate \ No newline at end of file diff --git a/docs/source/en/model_doc/jukebox.mdx b/docs/source/en/model_doc/jukebox.md similarity index 77% rename from docs/source/en/model_doc/jukebox.mdx rename to docs/source/en/model_doc/jukebox.md index 860fb8fc3f67b0..578a8a91dd02ea 100644 --- a/docs/source/en/model_doc/jukebox.mdx +++ b/docs/source/en/model_doc/jukebox.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Jukebox @@ -15,7 +19,7 @@ specific language governing permissions and limitations under the License. The Jukebox model was proposed in [Jukebox: A generative model for music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, -Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditionned on +Ilya Sutskever. It introduces a generative music model which can produce minute long samples that can be conditioned on an artist, genres and lyrics. The abstract from the paper is the following: @@ -23,16 +27,20 @@ The abstract from the paper is the following: *We introduce Jukebox, a model that generates music with singing in the raw audio domain. We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete codes, and modeling those using autoregressive Transformers. We show that the combined model at scale can generate high-fidelity and diverse songs with coherence up to multiple minutes. We can condition on artist and genre to steer the musical and vocal style, and on unaligned lyrics to make the singing more controllable. We are releasing thousands of non cherry-picked samples, along with model weights and code.* As shown on the following figure, Jukebox is made of 3 `priors` which are decoder only models. They follow the architecture described in [Generating Long Sequences with Sparse Transformers](https://arxiv.org/abs/1904.10509), modified to support longer context length. -First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditionner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution. -The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positionnal embedding for the timing data. The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio. +First, a autoencoder is used to encode the text lyrics. Next, the first (also called `top_prior`) prior attends to the last hidden states extracted from the lyrics encoder. The priors are linked to the previous priors respectively via an `AudioConditioner` module. The`AudioConditioner` upsamples the outputs of the previous prior to raw tokens at a certain audio frame per second resolution. +The metadata such as *artist, genre and timing* are passed to each prior, in the form of a start token and positional embedding for the timing data. The hidden states are mapped to the closest codebook vector from the VQVAE in order to convert them to raw audio. ![JukeboxModel](https://gist.githubusercontent.com/ArthurZucker/92c1acaae62ebf1b6a951710bdd8b6af/raw/c9c517bf4eff61393f6c7dec9366ef02bdd059a3/jukebox.svg) -Tips: -- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face traineer! +This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/openai/jukebox). + +## Usage tips + +- This model only supports inference. This is for a few reasons, mostly because it requires a crazy amount of memory to train. Feel free to open a PR and add what's missing to have a full integration with the hugging face trainer! - This model is very slow, and takes 8h to generate a minute long audio using the 5b top prior on a V100 GPU. In order automaticallay handle the device on which the model should execute, use `accelerate`. - Contrary to the paper, the order of the priors goes from `0` to `1` as it felt more intuitive : we sample starting from `0`. -- Primed sampling (conditionning the sampling on raw audio) requires more memory than ancestral sampling and should be used with `fp16` set to `True`. +- Primed sampling (conditioning the sampling on raw audio) requires more memory than ancestral sampling and should be used with `fp16` set to `True`. This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The original code can be found [here](https://github.com/openai/jukebox). @@ -63,14 +71,12 @@ The original code can be found [here](https://github.com/openai/jukebox). - upsample - _sample - ## JukeboxPrior [[autodoc]] JukeboxPrior - sample - forward - ## JukeboxVQVAE [[autodoc]] JukeboxVQVAE diff --git a/docs/source/en/model_doc/kosmos-2.md b/docs/source/en/model_doc/kosmos-2.md new file mode 100644 index 00000000000000..f799751cce8496 --- /dev/null +++ b/docs/source/en/model_doc/kosmos-2.md @@ -0,0 +1,98 @@ + + +# KOSMOS-2 + +## Overview + +The KOSMOS-2 model was proposed in [Kosmos-2: Grounding Multimodal Large Language Models to the World](https://arxiv.org/abs/2306.14824) by Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Furu Wei. + +KOSMOS-2 is a Transformer-based causal language model and is trained using the next-word prediction task on a web-scale +dataset of grounded image-text pairs [GRIT](https://huggingface.co/datasets/zzliang/GRIT). The spatial coordinates of +the bounding boxes in the dataset are converted to a sequence of location tokens, which are appended to their respective +entity text spans (for example, `a snowman` followed by ``). The data format is +similar to “hyperlinks” that connect the object regions in an image to their text span in the corresponding caption. + +The abstract from the paper is the following: + +*We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at https://aka.ms/kosmos-2.* + + + + Overview of tasks that KOSMOS-2 can handle. Taken from the original paper. + +## Example + +```python +>>> from PIL import Image +>>> import requests +>>> from transformers import AutoProcessor, Kosmos2ForConditionalGeneration + +>>> model = Kosmos2ForConditionalGeneration.from_pretrained("microsoft/kosmos-2-patch14-224") +>>> processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224") + +>>> url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> prompt = " An image of" + +>>> inputs = processor(text=prompt, images=image, return_tensors="pt") + +>>> generated_ids = model.generate( +... pixel_values=inputs["pixel_values"], +... input_ids=inputs["input_ids"], +... attention_mask=inputs["attention_mask"], +... image_embeds=None, +... image_embeds_position_mask=inputs["image_embeds_position_mask"], +... use_cache=True, +... max_new_tokens=64, +... ) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +>>> processed_text = processor.post_process_generation(generated_text, cleanup_and_extract=False) +>>> processed_text +' An image of a snowman warming himself by a fire.' + +>>> caption, entities = processor.post_process_generation(generated_text) +>>> caption +'An image of a snowman warming himself by a fire.' + +>>> entities +[('a snowman', (12, 21), [(0.390625, 0.046875, 0.984375, 0.828125)]), ('a fire', (41, 47), [(0.171875, 0.015625, 0.484375, 0.890625)])] +``` + +This model was contributed by [Yih-Dar SHIEH](https://huggingface.co/ydshieh). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/kosmos-2). + +## Kosmos2Config + +[[autodoc]] Kosmos2Config + +## Kosmos2ImageProcessor + +## Kosmos2Processor + +[[autodoc]] Kosmos2Processor + - __call__ + +## Kosmos2Model + +[[autodoc]] Kosmos2Model + - forward + +## Kosmos2ForConditionalGeneration + +[[autodoc]] Kosmos2ForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/layoutlm.mdx b/docs/source/en/model_doc/layoutlm.md similarity index 92% rename from docs/source/en/model_doc/layoutlm.mdx rename to docs/source/en/model_doc/layoutlm.md index 57475abb635b6f..34b429fb73763a 100644 --- a/docs/source/en/model_doc/layoutlm.mdx +++ b/docs/source/en/model_doc/layoutlm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LayoutLM @@ -42,7 +46,7 @@ document-level pretraining. It achieves new state-of-the-art results in several understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24) and document image classification (from 93.07 to 94.42).* -Tips: +## Usage tips - In addition to *input_ids*, [`~transformers.LayoutLMModel.forward`] also expects the input `bbox`, which are the bounding boxes (i.e. 2D-positions) of the input tokens. These can be obtained using an external OCR engine such @@ -88,13 +92,20 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A notebook on how to [fine-tune LayoutLM on the FUNSD dataset with image embeddings](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Add_image_embeddings_to_LayoutLM.ipynb). +- See also: [Document question answering task guide](../tasks/document_question_answering) + - A notebook on how to [fine-tune LayoutLM for sequence classification on the RVL-CDIP dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb). +- [Text classification task guide](../tasks/sequence_classification) - A notebook on how to [ fine-tune LayoutLM for token classification on the FUNSD dataset](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb). +- [Token classification task guide](../tasks/token_classification) + +**Other resources** +- [Masked language modeling task guide](../tasks/masked_language_modeling) 🚀 Deploy @@ -112,6 +123,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] LayoutLMTokenizerFast + + + ## LayoutLMModel [[autodoc]] LayoutLMModel @@ -132,6 +146,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] LayoutLMForQuestionAnswering + + + ## TFLayoutLMModel [[autodoc]] TFLayoutLMModel @@ -151,3 +168,8 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h ## TFLayoutLMForQuestionAnswering [[autodoc]] TFLayoutLMForQuestionAnswering + + + + + diff --git a/docs/source/en/model_doc/layoutlmv2.mdx b/docs/source/en/model_doc/layoutlmv2.md similarity index 84% rename from docs/source/en/model_doc/layoutlmv2.mdx rename to docs/source/en/model_doc/layoutlmv2.md index dc225d768d5005..0769322e9ad54c 100644 --- a/docs/source/en/model_doc/layoutlmv2.mdx +++ b/docs/source/en/model_doc/layoutlmv2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LayoutLMV2 @@ -46,13 +50,13 @@ this https URL.* LayoutLMv2 depends on `detectron2`, `torchvision` and `tesseract`. Run the following to install them: -``` +```bash python -m pip install 'git+https://github.com/facebookresearch/detectron2.git' python -m pip install torchvision tesseract ``` (If you are developing for LayoutLMv2, note that passing the doctests also requires the installation of these packages.) -Tips: +## Usage tips - The main difference between LayoutLMv1 and LayoutLMv2 is that the latter incorporates visual embeddings during pre-training (while LayoutLMv1 only adds visual embeddings during fine-tuning). @@ -121,26 +125,48 @@ section below. In addition, there's LayoutXLM, which is a multilingual version of LayoutLMv2. More information can be found on [LayoutXLM's documentation page](layoutxlm). +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + +- A notebook on how to [finetune LayoutLMv2 for text-classification on RVL-CDIP dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb). +- See also: [Text classification task guide](../tasks/sequence_classification) + + + +- A notebook on how to [finetune LayoutLMv2 for question-answering on DocVQA dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb). +- See also: [Question answering task guide](../tasks/question_answering) +- See also: [Document question answering task guide](../tasks/document_question_answering) + + + + +- A notebook on how to [finetune LayoutLMv2 for token-classification on CORD dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/CORD/Fine_tuning_LayoutLMv2ForTokenClassification_on_CORD.ipynb). +- A notebook on how to [finetune LayoutLMv2 for token-classification on FUNSD dataset](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb). +- See also: [Token classification task guide](../tasks/token_classification) + ## Usage: LayoutLMv2Processor The easiest way to prepare data for the model is to use [`LayoutLMv2Processor`], which internally -combines a feature extractor ([`LayoutLMv2FeatureExtractor`]) and a tokenizer -([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The feature extractor +combines a image processor ([`LayoutLMv2ImageProcessor`]) and a tokenizer +([`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`]). The image processor handles the image modality, while the tokenizer handles the text modality. A processor combines both, which is ideal for a multi-modal model like LayoutLMv2. Note that you can still use both separately, if you only want to handle one modality. ```python -from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2TokenizerFast, LayoutLMv2Processor +from transformers import LayoutLMv2ImageProcessor, LayoutLMv2TokenizerFast, LayoutLMv2Processor -feature_extractor = LayoutLMv2FeatureExtractor() # apply_ocr is set to True by default +image_processor = LayoutLMv2ImageProcessor() # apply_ocr is set to True by default tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased") -processor = LayoutLMv2Processor(feature_extractor, tokenizer) +processor = LayoutLMv2Processor(image_processor, tokenizer) ``` In short, one can provide a document image (and possibly additional data) to [`LayoutLMv2Processor`], and it will create the inputs expected by the model. Internally, the processor first uses -[`LayoutLMv2FeatureExtractor`] to apply OCR on the image to get a list of words and normalized +[`LayoutLMv2ImageProcessor`] to apply OCR on the image to get a list of words and normalized bounding boxes, as well to resize the image to a given size in order to get the `image` input. The words and normalized bounding boxes are then provided to [`LayoutLMv2Tokenizer`] or [`LayoutLMv2TokenizerFast`], which converts them to token-level `input_ids`, @@ -150,7 +176,7 @@ which are turned into token-level `labels`. [`LayoutLMv2Processor`] uses [PyTesseract](https://pypi.org/project/pytesseract/), a Python wrapper around Google's Tesseract OCR engine, under the hood. Note that you can still use your own OCR engine of choice, and provide the words and normalized boxes yourself. This requires initializing -[`LayoutLMv2FeatureExtractor`] with `apply_ocr` set to `False`. +[`LayoutLMv2ImageProcessor`] with `apply_ocr` set to `False`. In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs). @@ -158,7 +184,7 @@ use cases work for both batched and non-batched inputs (we illustrate them for n **Use case 1: document image classification (training, inference) + token classification (inference), apply_ocr = True** -This is the simplest case, in which the processor (actually the feature extractor) will perform OCR on the image to get +This is the simplest case, in which the processor (actually the image processor) will perform OCR on the image to get the words and normalized bounding boxes. ```python @@ -179,7 +205,7 @@ print(encoding.keys()) **Use case 2: document image classification (training, inference) + token classification (inference), apply_ocr=False** -In case one wants to do OCR themselves, one can initialize the feature extractor with `apply_ocr` set to +In case one wants to do OCR themselves, one can initialize the image processor with `apply_ocr` set to `False`. In that case, one should provide the words and corresponding (normalized) bounding boxes themselves to the processor. diff --git a/docs/source/en/model_doc/layoutlmv3.mdx b/docs/source/en/model_doc/layoutlmv3.md similarity index 90% rename from docs/source/en/model_doc/layoutlmv3.mdx rename to docs/source/en/model_doc/layoutlmv3.md index d49ee1819a431e..87ff32f3835690 100644 --- a/docs/source/en/model_doc/layoutlmv3.mdx +++ b/docs/source/en/model_doc/layoutlmv3.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LayoutLMv3 @@ -22,16 +26,6 @@ The abstract from the paper is the following: *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* -Tips: - -- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: - - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. - - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. - Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. -- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. -- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3). -- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3). - drawing @@ -39,6 +33,14 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3). +## Usage tips + +- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: + - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. + - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. + Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3ImageProcessor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. +- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LayoutLMv3. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -49,20 +51,28 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2
+- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3). +- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3). + - [`LayoutLMv2ForSequenceClassification`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb). +- [Text classification task guide](../tasks/sequence_classification) - [`LayoutLMv3ForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3) and [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb). - A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Inference_with_LayoutLMv2ForTokenClassification.ipynb) for how to perform inference with [`LayoutLMv2ForTokenClassification`] and a [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/True_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb) for how to perform inference when no labels are available with [`LayoutLMv2ForTokenClassification`]. - A [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/FUNSD/Fine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb) for how to finetune [`LayoutLMv2ForTokenClassification`] with the 🤗 Trainer. +- [Token classification task guide](../tasks/token_classification) - [`LayoutLMv2ForQuestionAnswering`] is supported by this [notebook](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/DocVQA/Fine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb). +- [Question answering task guide](../tasks/question_answering) +**Document question answering** +- [Document question answering task guide](../tasks/document_question_answering) ## LayoutLMv3Config @@ -94,6 +104,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 [[autodoc]] LayoutLMv3Processor - __call__ + + + ## LayoutLMv3Model [[autodoc]] LayoutLMv3Model @@ -114,6 +127,9 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 [[autodoc]] LayoutLMv3ForQuestionAnswering - forward + + + ## TFLayoutLMv3Model [[autodoc]] TFLayoutLMv3Model @@ -133,3 +149,6 @@ LayoutLMv3 is nearly identical to LayoutLMv2, so we've also included LayoutLMv2 [[autodoc]] TFLayoutLMv3ForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/layoutxlm.mdx b/docs/source/en/model_doc/layoutxlm.md similarity index 93% rename from docs/source/en/model_doc/layoutxlm.mdx rename to docs/source/en/model_doc/layoutxlm.md index ed112453beae4f..f6b2cbef9d6fd1 100644 --- a/docs/source/en/model_doc/layoutxlm.mdx +++ b/docs/source/en/model_doc/layoutxlm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LayoutXLM @@ -29,6 +33,10 @@ introduce a multilingual form understanding benchmark dataset named XFUN, which for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset.* +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm). + +## Usage tips and examples + One can directly plug in the weights of LayoutXLM into a LayoutLMv2 model, like so: ```python @@ -48,14 +56,14 @@ tokenizer = LayoutXLMTokenizer.from_pretrained("microsoft/layoutxlm-base") ``` Similar to LayoutLMv2, you can use [`LayoutXLMProcessor`] (which internally applies -[`LayoutLMv2FeatureExtractor`] and +[`LayoutLMv2ImageProcessor`] and [`LayoutXLMTokenizer`]/[`LayoutXLMTokenizerFast`] in sequence) to prepare all data for the model. -As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm). + +As LayoutXLM's architecture is equivalent to that of LayoutLMv2, one can refer to [LayoutLMv2's documentation page](layoutlmv2) for all tips, code examples and notebooks. + ## LayoutXLMTokenizer diff --git a/docs/source/en/model_doc/led.mdx b/docs/source/en/model_doc/led.md similarity index 85% rename from docs/source/en/model_doc/led.mdx rename to docs/source/en/model_doc/led.md index 6ecdf808e261c9..9a39b0b28eded2 100644 --- a/docs/source/en/model_doc/led.mdx +++ b/docs/source/en/model_doc/led.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LED @@ -31,7 +35,7 @@ WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.* -Tips: +## Usage tips - [`LEDForConditionalGeneration`] is an extension of [`BartForConditionalGeneration`] exchanging the traditional *self-attention* layer with @@ -48,13 +52,19 @@ Tips: errors. This can be done by executing `model.gradient_checkpointing_enable()`. Moreover, the `use_cache=False` flag can be used to disable the caching mechanism to save memory. -- A notebook showing how to evaluate LED, can be accessed [here](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing). -- A notebook showing how to fine-tune LED, can be accessed [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing). - LED is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). +## Resources + +- [A notebook showing how to evaluate LED](https://colab.research.google.com/drive/12INTTR6n64TzS4RrXZxMSXfrOd9Xzamo?usp=sharing). +- [A notebook showing how to fine-tune LED](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing). +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## LEDConfig @@ -90,6 +100,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv [[autodoc]] models.led.modeling_tf_led.TFLEDSeq2SeqLMOutput + + + ## LEDModel [[autodoc]] LEDModel @@ -110,6 +123,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv [[autodoc]] LEDForQuestionAnswering - forward + + + ## TFLEDModel [[autodoc]] TFLEDModel @@ -119,3 +135,9 @@ This model was contributed by [patrickvonplaten](https://huggingface.co/patrickv [[autodoc]] TFLEDForConditionalGeneration - call + + + + + + diff --git a/docs/source/en/model_doc/levit.mdx b/docs/source/en/model_doc/levit.md similarity index 96% rename from docs/source/en/model_doc/levit.mdx rename to docs/source/en/model_doc/levit.md index 69c2e00c0b715d..15dc2f4e137357 100644 --- a/docs/source/en/model_doc/levit.mdx +++ b/docs/source/en/model_doc/levit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LeViT @@ -34,7 +38,9 @@ alt="drawing" width="600"/> LeViT Architecture. Taken from the original paper. -Tips: +This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT). + +## Usage tips - Compared to ViT, LeViT models use an additional distillation head to effectively learn from a teacher (which, in the LeViT paper, is a ResNet like-model). The distillation head is learned through backpropagation under supervision of a ResNet like-model. They also draw inspiration from convolution neural networks to use activation maps with decreasing resolutions to increase the efficiency. - There are 2 ways to fine-tune distilled models, either (1) in a classic way, by only placing a prediction head on top @@ -59,8 +65,6 @@ Tips: - You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]). -This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LeViT. @@ -68,6 +72,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`LevitForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -85,7 +90,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] LevitImageProcessor - preprocess - ## LevitModel [[autodoc]] LevitModel diff --git a/docs/source/en/model_doc/lilt.mdx b/docs/source/en/model_doc/lilt.md similarity index 91% rename from docs/source/en/model_doc/lilt.mdx rename to docs/source/en/model_doc/lilt.md index f29a8d67a3d2c8..2514a6ebd85263 100644 --- a/docs/source/en/model_doc/lilt.mdx +++ b/docs/source/en/model_doc/lilt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LiLT @@ -22,12 +26,20 @@ The abstract from the paper is the following: *Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding off-the-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure.* -Tips: + + + LiLT architecture. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/jpwang/lilt). + +## Usage tips - To combine the Language-Independent Layout Transformer with a new RoBERTa checkpoint from the [hub](https://huggingface.co/models?search=roberta), refer to [this guide](https://github.com/jpWang/LiLT#or-generate-your-own-checkpoint-optional). The script will result in `config.json` and `pytorch_model.bin` files being stored locally. After doing this, one can do the following (assuming you're logged in with your HuggingFace account): -``` +```python from transformers import LiltModel model = LiltModel.from_pretrained("path_to_your_files") @@ -38,20 +50,17 @@ model.push_to_hub("name_of_repo_on_the_hub") - As [lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-roberta-en-base) uses the same vocabulary as [LayoutLMv3](layoutlmv3), one can use [`LayoutLMv3TokenizerFast`] to prepare data for the model. The same is true for [lilt-roberta-en-base](https://huggingface.co/SCUT-DLVCLab/lilt-infoxlm-base): one can use [`LayoutXLMTokenizerFast`] for that model. - - - LiLT architecture. Taken from the original paper. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). -The original code can be found [here](https://github.com/jpwang/lilt). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LiLT. - Demo notebooks for LiLT can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT). +**Documentation resources** +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) + If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. ## LiltConfig diff --git a/docs/source/en/model_doc/llama.md b/docs/source/en/model_doc/llama.md new file mode 100644 index 00000000000000..915d5ecc70b554 --- /dev/null +++ b/docs/source/en/model_doc/llama.md @@ -0,0 +1,132 @@ + + +# LLaMA + +## Overview + +The LLaMA model was proposed in [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. It is a collection of foundation language models ranging from 7B to 65B parameters. + +The abstract from the paper is the following: + +*We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community. * + +This model was contributed by [zphang](https://huggingface.co/zphang) with contributions from [BlackSamorez](https://huggingface.co/BlackSamorez). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama). + +## Usage tips + +- Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form) +- After downloading the weights, they will need to be converted to the Hugging Face Transformers format using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command: + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +- After conversion, the model and tokenizer can be loaded via: + +```python +from transformers import LlamaForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = LlamaForCausalLM.from_pretrained("/output/path") +``` + +Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions +come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 65B model, it's thus 130GB of RAM needed. + +- The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string. + +This model was contributed by [zphang](https://huggingface.co/zphang) with contributions from [BlackSamorez](https://huggingface.co/BlackSamorez). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama). The Flax version of the implementation was contributed by [afmck](https://huggingface.co/afmck) with the code in the implementation based on Hugging Face's Flax GPT-Neo. + + +Based on the original LLaMA model, Meta AI has released some follow-up works: + +- **Llama2**: Llama2 is an improved version of Llama with some architectural tweaks (Grouped Query Attention), and is pre-trained on 2Trillion tokens. Refer to the documentation of Llama2 which can be found [here](llama2). + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + +- A [notebook](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb#scrollTo=f04ba4d2) on how to use prompt tuning to adapt the LLaMA model for text classification task. 🌎 + + + +- [StackLLaMA: A hands-on guide to train LLaMA with RLHF](https://huggingface.co/blog/stackllama#stackllama-a-hands-on-guide-to-train-llama-with-rlhf), a blog post about how to train LLaMA to answer questions on [Stack Exchange](https://stackexchange.com/) with RLHF. + +⚗️ Optimization +- A [notebook](https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing) on how to fine-tune LLaMA model using xturing library on GPU which has limited memory. 🌎 + +⚡️ Inference +- A [notebook](https://colab.research.google.com/github/DominguesM/alpaca-lora-ptbr-7b/blob/main/notebooks/02%20-%20Evaluate.ipynb) on how to run the LLaMA Model using PeftModel from the 🤗 PEFT library. 🌎 +- A [notebook](https://colab.research.google.com/drive/1l2GiSSPbajVyp2Nk3CFT4t3uH6-5TiBe?usp=sharing) on how to load a PEFT adapter LLaMA model with LangChain. 🌎 + +🚀 Deploy +- A [notebook](https://colab.research.google.com/github/lxe/simple-llama-finetuner/blob/master/Simple_LLaMA_FineTuner.ipynb#scrollTo=3PM_DilAZD8T) on how to fine-tune LLaMA model using LoRA method via the 🤗 PEFT library with intuitive UI. 🌎 +- A [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-open-llama.ipynb) on how to deploy Open-LLaMA model for text generation on Amazon SageMaker. 🌎 + +## LlamaConfig + +[[autodoc]] LlamaConfig + +## LlamaTokenizer + +[[autodoc]] LlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## LlamaTokenizerFast + +[[autodoc]] LlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary + +## LlamaModel + +[[autodoc]] LlamaModel + - forward + +## LlamaForCausalLM + +[[autodoc]] LlamaForCausalLM + - forward + +## LlamaForSequenceClassification + +[[autodoc]] LlamaForSequenceClassification + - forward + +## LlamaForQuestionAnswering + +[[autodoc]] LlamaForQuestionAnswering + - forward + +## FlaxLlamaModel + +[[autodoc]] FlaxLlamaModel + - __call__ + +## FlaxLlamaForCausalLM + +[[autodoc]] FlaxLlamaForCausalLM + - __call__ diff --git a/docs/source/en/model_doc/llama2.md b/docs/source/en/model_doc/llama2.md new file mode 100644 index 00000000000000..b4cd6b9ca110d1 --- /dev/null +++ b/docs/source/en/model_doc/llama2.md @@ -0,0 +1,140 @@ + + +# Llama2 + +## Overview + +The Llama2 model was proposed in [LLaMA: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/) by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application! + +The abstract from the paper is the following: + +*In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.* + +Checkout all Llama2 model checkpoints [here](https://huggingface.co/models?search=llama2). +This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ) with contributions from [Lysandre Debut](https://huggingface.co/lysandre). The code of the implementation in Hugging Face is based on GPT-NeoX [here](https://github.com/EleutherAI/gpt-neox). The original code of the authors can be found [here](https://github.com/facebookresearch/llama). + +## Usage tips + + + +The `Llama2` models were trained using `bfloat16`, but the original inference uses `float16`. The checkpoints uploaded on the Hub use `torch_dtype = 'float16'`, which will be +used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. + +The `dtype` of the online weights is mostly irrelevant unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online), then it will be casted to the default `dtype` of `torch` (becomes `torch.float32`), and finally, if there is a `torch_dtype` provided in the config, it will be used. + +Training the model in `float16` is not recommended and is known to produce `nan`; as such, the model should be trained in `bfloat16`. + + + +Tips: + +- Weights for the Llama2 models can be obtained by filling out [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) +- The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this [paper](https://arxiv.org/pdf/2305.13245.pdf) +- Setting `config.pretraining_tp` to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits. +- The original model uses `pad_id = -1` which means that there is no padding token. We can't have the same logic, make sure to add a padding token using `tokenizer.add_special_tokens({"pad_token":""})` and resize the token embedding accordingly. You should also set the `model.config.pad_token_id`. The `embed_tokens` layer of the model is initialized with `self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)`, which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended. +- After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py). The script can be called with the following (example) command: + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +- After conversion, the model and tokenizer can be loaded via: + +```python +from transformers import LlamaForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = LlamaForCausalLM.from_pretrained("/output/path") +``` + +Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions +come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it's thus 145GB of RAM needed. + +- The LLaMA tokenizer is a BPE model based on [sentencepiece](https://github.com/google/sentencepiece). One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string. + +- When using Flash Attention 2 via `attn_implementation="flash_attention_2"`, don't pass `torch_dtype` to the `from_pretrained` class method and use Automatic Mixed-Precision training. When using `Trainer`, it is simply specifying either `fp16` or `bf16` to `True`. Otherwise, make sure you are using `torch.autocast`. This is required because the Flash Attention only support `fp16` and `bf16` data type. + + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- [Llama 2 is here - get it on Hugging Face](https://huggingface.co/blog/llama2), a blog post about Llama 2 and how to use it with 🤗 Transformers and 🤗 PEFT. +- [LLaMA 2 - Every Resource you need](https://www.philschmid.de/llama-2), a compilation of relevant resources to learn about LLaMA 2 and how to get started quickly. + + + +- A [notebook](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing) on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎 +- A [notebook](https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing) on how to fine-tune the "Llama-v2-7b-guanaco" model with 4-bit QLoRA and generate Q&A datasets from PDFs. 🌎 + + + +- A [notebook](https://colab.research.google.com/drive/1ggaa2oRFphdBmqIjSEbnb_HGkcIRC2ZB?usp=sharing) on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷 + +⚗️ Optimization +- [Fine-tune Llama 2 with DPO](https://huggingface.co/blog/dpo-trl), a guide to using the TRL library's DPO method to fine tune Llama 2 on a specific dataset. +- [Extended Guide: Instruction-tune Llama 2](https://www.philschmid.de/instruction-tune-llama-2), a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving. +- A [notebook](https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing) on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎 + +⚡️ Inference +- A [notebook](https://colab.research.google.com/drive/1TC56ArKerXUpbgRy5vM3woRsbTEVNq7h?usp=sharing) on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎 +- A [notebook](https://colab.research.google.com/drive/1X1z9Q6domMKl2CnEM0QGHNwidLfR4dW2?usp=sharing) on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎 + +🚀 Deploy +- [Fine-tune LLaMA 2 (7-70B) on Amazon SageMaker](https://www.philschmid.de/sagemaker-llama2-qlora), a complete guide from setup to QLoRA fine-tuning and deployment on Amazon SageMaker. +- [Deploy Llama 2 7B/13B/70B on Amazon SageMaker](https://www.philschmid.de/sagemaker-llama-llm), a guide on using Hugging Face's LLM DLC container for secure and scalable deployment. + + +## LlamaConfig + +[[autodoc]] LlamaConfig + + +## LlamaTokenizer + +[[autodoc]] LlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## LlamaTokenizerFast + +[[autodoc]] LlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary + +## LlamaModel + +[[autodoc]] LlamaModel + - forward + + +## LlamaForCausalLM + +[[autodoc]] LlamaForCausalLM + - forward + +## LlamaForSequenceClassification + +[[autodoc]] LlamaForSequenceClassification + - forward + diff --git a/docs/source/en/model_doc/llava.md b/docs/source/en/model_doc/llava.md new file mode 100644 index 00000000000000..ee7d9bbd1af9be --- /dev/null +++ b/docs/source/en/model_doc/llava.md @@ -0,0 +1,80 @@ + + +# LLaVa + +## Overview + +LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions. + +The LLaVa model was proposed in [Visual Instruction Tuning](https://arxiv.org/abs/2304.08485) and improved in [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/pdf/2310.03744) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee. + +The abstract from the paper is the following: + +*Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available* + + + + LLaVa architecture. Taken from the original paper. + +This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ) and [ybelkada](https://huggingface.co/ybelkada). +The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/main/llava). + +## Usage tips + +- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. + +- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. + +- For better results, we recommend users to prompt the model with the correct prompt format: + +```bash +"USER: \nASSISTANT:" +``` + +For multiple turns conversation: + +```bash +"USER: \nASSISTANT: USER: ASSISTANT: USER: ASSISTANT:" +``` + +### Using Flash Attention 2 + +Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one). + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BEiT. + + + +- A [Google Colab demo](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing) on how to run Llava on a free-tier Google colab instance leveraging 4-bit inference. +- A [similar notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Inference_with_LLaVa_for_multimodal_generation.ipynb) showcasing batched inference. 🌎 + + +## LlavaConfig + +[[autodoc]] LlavaConfig + +## LlavaProcessor + +[[autodoc]] LlavaProcessor + +## LlavaForConditionalGeneration + +[[autodoc]] LlavaForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/longformer.mdx b/docs/source/en/model_doc/longformer.md similarity index 92% rename from docs/source/en/model_doc/longformer.mdx rename to docs/source/en/model_doc/longformer.md index 7c8f2c69917e90..20ba7a922515dd 100644 --- a/docs/source/en/model_doc/longformer.mdx +++ b/docs/source/en/model_doc/longformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Longformer @@ -37,15 +41,15 @@ contrast to most prior work, we also pretrain Longformer and finetune it on a va pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.* -Tips: +This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer). + +## Usage tips - Since the Longformer is based on RoBERTa, it doesn't have `token_type_ids`. You don't need to indicate which token belongs to which segment. Just separate your segments with the separation token `tokenizer.sep_token` (or ``). - A transformer model replacing the attention matrices by sparse matrices to go faster. Often, the local context (e.g., what are the two tokens left and right?) is enough to take action for a given token. Some preselected input tokens are still given global attention, but the attention matrix has way less parameters, resulting in a speed-up. See the local attention section for more information. -This model was contributed by [beltagy](https://huggingface.co/beltagy). The Authors' code can be found [here](https://github.com/allenai/longformer). - ## Longformer Self Attention Longformer self attention employs self attention on both a "local" context and a "global" context. Most tokens only @@ -89,6 +93,14 @@ mlm_labels = tokenizer.encode("This is a sentence from the training data", retur loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0] ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## LongformerConfig [[autodoc]] LongformerConfig @@ -131,6 +143,9 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0] [[autodoc]] models.longformer.modeling_tf_longformer.TFLongformerTokenClassifierOutput + + + ## LongformerModel [[autodoc]] LongformerModel @@ -161,6 +176,9 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0] [[autodoc]] LongformerForQuestionAnswering - forward + + + ## TFLongformerModel [[autodoc]] TFLongformerModel @@ -190,3 +208,6 @@ loss = model(input_ids, labels=input_ids, masked_lm_labels=mlm_labels)[0] [[autodoc]] TFLongformerForMultipleChoice - call + + + diff --git a/docs/source/en/model_doc/longt5.mdx b/docs/source/en/model_doc/longt5.md similarity index 94% rename from docs/source/en/model_doc/longt5.mdx rename to docs/source/en/model_doc/longt5.md index 0e73d6c8ddff0e..40faa6d8c2377f 100644 --- a/docs/source/en/model_doc/longt5.mdx +++ b/docs/source/en/model_doc/longt5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LongT5 @@ -32,7 +36,10 @@ attention ideas from long-input transformers (ETC), and adopted pre-training str able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.* -Tips: +This model was contributed by [stancld](https://huggingface.co/stancld). +The original code can be found [here](https://github.com/google-research/longt5). + +## Usage tips - [`LongT5ForConditionalGeneration`] is an extension of [`T5ForConditionalGeneration`] exchanging the traditional encoder *self-attention* layer with efficient either *local* attention or *transient-global* (*tglobal*) attention. @@ -83,14 +90,19 @@ The complexity of this mechanism is `O(l(r + l/k))`. >>> rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"]) ``` -This model was contributed by [stancld](https://huggingface.co/stancld). -The original code can be found [here](https://github.com/google-research/longt5). +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## LongT5Config [[autodoc]] LongT5Config + + + ## LongT5Model [[autodoc]] LongT5Model @@ -106,6 +118,9 @@ The original code can be found [here](https://github.com/google-research/longt5) [[autodoc]] LongT5EncoderModel - forward + + + ## FlaxLongT5Model [[autodoc]] FlaxLongT5Model @@ -119,3 +134,6 @@ The original code can be found [here](https://github.com/google-research/longt5) - __call__ - encode - decode + + + diff --git a/docs/source/en/model_doc/luke.mdx b/docs/source/en/model_doc/luke.md similarity index 90% rename from docs/source/en/model_doc/luke.mdx rename to docs/source/en/model_doc/luke.md index b7483f9194e02f..4e070b1c4bac3e 100644 --- a/docs/source/en/model_doc/luke.mdx +++ b/docs/source/en/model_doc/luke.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LUKE @@ -33,7 +37,9 @@ state-of-the-art results on five well-known datasets: Open Entity (entity typing CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering).* -Tips: +This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke). + +## Usage tips - This implementation is the same as [`RobertaModel`] with the addition of entity embeddings as well as an entity-aware self-attention mechanism, which improves performance on tasks involving reasoning about entities. @@ -71,13 +77,7 @@ Tips: head models by specifying `task="entity_classification"`, `task="entity_pair_classification"`, or `task="entity_span_classification"`. Please refer to the example code of each head models. - A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation - classification can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE). - - There are also 3 notebooks available, which showcase how you can reproduce the results as reported in the paper with - the HuggingFace implementation of LUKE. They can be found [here](https://github.com/studio-ousia/luke/tree/master/notebooks). - -Example: +Usage example: ```python >>> from transformers import LukeTokenizer, LukeModel, LukeForEntityPairClassification @@ -115,8 +115,15 @@ Example: >>> print("Predicted class:", model.config.id2label[predicted_class_idx]) ``` -This model was contributed by [ikuyamada](https://huggingface.co/ikuyamada) and [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/studio-ousia/luke). +## Resources +- [A demo notebook on how to fine-tune [`LukeForEntityPairClassification`] for relation classification](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LUKE) +- [Notebooks showcasing how you to reproduce the results as reported in the paper with the HuggingFace implementation of LUKE](https://github.com/studio-ousia/luke/tree/master/notebooks) +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## LukeConfig diff --git a/docs/source/en/model_doc/lxmert.mdx b/docs/source/en/model_doc/lxmert.md similarity index 93% rename from docs/source/en/model_doc/lxmert.mdx rename to docs/source/en/model_doc/lxmert.md index 51a5be07d7ed90..435994196b43ba 100644 --- a/docs/source/en/model_doc/lxmert.mdx +++ b/docs/source/en/model_doc/lxmert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # LXMERT @@ -37,7 +41,9 @@ best result by 22% absolute (54% to 76%). Lastly, we demonstrate detailed ablati model components and pretraining strategies significantly contribute to our strong results; and also present several attention visualizations for the different encoders* -Tips: +This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert). + +## Usage tips - Bounding boxes are not necessary to be used in the visual feature embeddings, any kind of visual-spacial features will work. @@ -49,8 +55,9 @@ Tips: contains self-attention for each respective modality and cross-attention, only the cross attention is returned and both self attention outputs are disregarded. -This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). The original code can be found [here](https://github.com/airsplay/lxmert). +## Resources +- [Question answering task guide](../tasks/question_answering) ## LxmertConfig @@ -76,6 +83,9 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T [[autodoc]] models.lxmert.modeling_tf_lxmert.TFLxmertForPreTrainingOutput + + + ## LxmertModel [[autodoc]] LxmertModel @@ -91,6 +101,9 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T [[autodoc]] LxmertForQuestionAnswering - forward + + + ## TFLxmertModel [[autodoc]] TFLxmertModel @@ -100,3 +113,6 @@ This model was contributed by [eltoto1219](https://huggingface.co/eltoto1219). T [[autodoc]] TFLxmertForPreTraining - call + + + diff --git a/docs/source/en/model_doc/m2m_100.mdx b/docs/source/en/model_doc/m2m_100.md similarity index 86% rename from docs/source/en/model_doc/m2m_100.mdx rename to docs/source/en/model_doc/m2m_100.md index 10ac6a9df918cd..fa808c2e94bbfd 100644 --- a/docs/source/en/model_doc/m2m_100.mdx +++ b/docs/source/en/model_doc/m2m_100.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # M2M100 @@ -34,7 +38,7 @@ open-source our scripts so that others may reproduce the data, evaluation, and f This model was contributed by [valhalla](https://huggingface.co/valhalla). -### Training and Generation +## Usage tips and examples M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks. As the model is multilingual it expects the sequences in a certain format: A special language id token is used as prefix in both the @@ -44,7 +48,7 @@ id for source text and target language id for target text, with `X` being the so The [`M2M100Tokenizer`] depends on `sentencepiece` so be sure to install it before running the examples. To install `sentencepiece` run `pip install sentencepiece`. -- Supervised Training +**Supervised Training** ```python from transformers import M2M100Config, M2M100ForConditionalGeneration, M2M100Tokenizer @@ -60,12 +64,12 @@ model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt") loss = model(**model_inputs).loss # forward pass ``` -- Generation +**Generation** - M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id - being forced as the first generated token. To force the target language id as the first generated token, pass the - *forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between - Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint. +M2M100 uses the `eos_token_id` as the `decoder_start_token_id` for generation with the target language id +being forced as the first generated token. To force the target language id as the first generated token, pass the +*forced_bos_token_id* parameter to the *generate* method. The following example shows how to translate between +Hindi to French and Chinese to English using the *facebook/m2m100_418M* checkpoint. ```python >>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer @@ -91,6 +95,11 @@ loss = model(**model_inputs).loss # forward pass "Life is like a box of chocolate." ``` +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## M2M100Config [[autodoc]] M2M100Config diff --git a/docs/source/en/model_doc/madlad-400.md b/docs/source/en/model_doc/madlad-400.md new file mode 100644 index 00000000000000..aeb41938499c29 --- /dev/null +++ b/docs/source/en/model_doc/madlad-400.md @@ -0,0 +1,68 @@ + + +# MADLAD-400 + +## Overview + +MADLAD-400 models were released in the paper [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](MADLAD-400: A Multilingual And Document-Level Large Audited Dataset). + +The abstract from the paper is the following: + +*We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss +the limitations revealed by self-auditing MADLAD-400, and the role data auditing +had in the dataset creation process. We then train and release a 10.7B-parameter +multilingual machine translation model on 250 billion tokens covering over 450 +languages using publicly available data, and find that it is competitive with models +that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot +translation. We make the baseline models 1 +available to the research community.* + +This model was added by [Juarez Bochi](https://huggingface.co/jbochi). The original checkpoints can be found [here](https://github.com/google-research/google-research/tree/master/madlad_400). + +This is a machine translation model that supports many low-resource languages, and that is competitive with models that are significantly larger. + +One can directly use MADLAD-400 weights without finetuning the model: + +```python +>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/madlad400-3b-mt") +>>> tokenizer = AutoTokenizer.from_pretrained("google/madlad400-3b-mt") + +>>> inputs = tokenizer("<2pt> I love pizza!", return_tensors="pt") +>>> outputs = model.generate(**inputs) +>>> print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) +['Eu amo pizza!'] +``` + +Google has released the following variants: + +- [google/madlad400-3b-mt](https://huggingface.co/google/madlad400-3b-mt) + +- [google/madlad400-7b-mt](https://huggingface.co/google/madlad400-7b-mt) + +- [google/madlad400-7b-mt-bt](https://huggingface.co/google/madlad400-7b-mt-bt) + +- [google/madlad400-10b-mt](https://huggingface.co/google/madlad400-10b-mt) + +The original checkpoints can be found [here](https://github.com/google-research/google-research/tree/master/madlad_400). + + + +Refer to [T5's documentation page](t5) for all API references, code examples, and notebooks. For more details regarding training and evaluation of the MADLAD-400, refer to the model card. + + diff --git a/docs/source/en/model_doc/marian.mdx b/docs/source/en/model_doc/marian.md similarity index 91% rename from docs/source/en/model_doc/marian.mdx rename to docs/source/en/model_doc/marian.md index 07240ccbc75a67..8078ea1427c952 100644 --- a/docs/source/en/model_doc/marian.mdx +++ b/docs/source/en/model_doc/marian.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MarianMT @@ -21,14 +25,11 @@ specific language governing permissions and limitations under the License. -**Bugs:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title) -and assign @patrickvonplaten. - -Translations should be similar, but not identical to output in the test set linked to in each model card. +## Overview -Tips: +A framework for translation models, using the same models as BART. Translations should be similar, but not identical to output in the test set linked to in each model card. +This model was contributed by [sshleifer](https://huggingface.co/sshleifer). -- A framework for translation models, using the same models as BART. ## Implementation Notes @@ -45,7 +46,7 @@ Tips: - the model starts generating with `pad_token_id` (which has 0 as a token_embedding) as the prefix (Bart uses ``), - Code to bulk convert models can be found in `convert_marian_to_pytorch.py`. -- This model was contributed by [sshleifer](https://huggingface.co/sshleifer). + ## Naming @@ -161,6 +162,12 @@ Example of translating english to many romance languages, using old-style 2 char 'Y esto al español'] ``` +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) +- [Causal language modeling task guide](../tasks/language_modeling) + ## MarianConfig [[autodoc]] MarianConfig @@ -170,6 +177,9 @@ Example of translating english to many romance languages, using old-style 2 char [[autodoc]] MarianTokenizer - build_inputs_with_special_tokens + + + ## MarianModel [[autodoc]] MarianModel @@ -185,6 +195,9 @@ Example of translating english to many romance languages, using old-style 2 char [[autodoc]] MarianForCausalLM - forward + + + ## TFMarianModel [[autodoc]] TFMarianModel @@ -195,6 +208,9 @@ Example of translating english to many romance languages, using old-style 2 char [[autodoc]] TFMarianMTModel - call + + + ## FlaxMarianModel [[autodoc]] FlaxMarianModel @@ -204,3 +220,6 @@ Example of translating english to many romance languages, using old-style 2 char [[autodoc]] FlaxMarianMTModel - __call__ + + + diff --git a/docs/source/en/model_doc/markuplm.mdx b/docs/source/en/model_doc/markuplm.md similarity index 94% rename from docs/source/en/model_doc/markuplm.mdx rename to docs/source/en/model_doc/markuplm.md index f4deb6d873cd32..e52ff3157eac2b 100644 --- a/docs/source/en/model_doc/markuplm.mdx +++ b/docs/source/en/model_doc/markuplm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MarkupLM @@ -21,9 +25,9 @@ performance, similar to [LayoutLM](layoutlm). The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks: -- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structual Reading Comprehension (a bit like SQuAD but for web pages) +- [WebSRC](https://x-lance.github.io/WebSRC/), a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages) - [SWDE](https://www.researchgate.net/publication/221299838_From_one_tree_to_a_forest_a_unified_solution_for_structured_web_data_extraction), a dataset -for information extraction from web pages (basically named-entity recogntion on web pages) +for information extraction from web pages (basically named-entity recognition on web pages) The abstract from the paper is the following: @@ -36,19 +40,19 @@ HTML/XML-based documents, where text and markup information is jointly pre-train pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available.* -Tips: +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm). + +## Usage tips + - In addition to `input_ids`, [`~MarkupLMModel.forward`] expects 2 additional inputs, namely `xpath_tags_seq` and `xpath_subs_seq`. These are the XPATH tags and subscripts respectively for each token in the input sequence. - One can use [`MarkupLMProcessor`] to prepare all data for the model. Refer to the [usage guide](#usage-markuplmprocessor) for more info. -- Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM). drawing MarkupLM architecture. Taken from the original paper. -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/markuplm). - ## Usage: MarkupLMProcessor The easiest way to prepare data for the model is to use [`MarkupLMProcessor`], which internally combines a feature extractor @@ -193,6 +197,13 @@ all nodes and xpaths yourself, you can provide them directly to the processor. M dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq']) ``` +## Resources + +- [Demo notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/MarkupLM) +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) + ## MarkupLMConfig [[autodoc]] MarkupLMConfig diff --git a/docs/source/en/model_doc/mask2former.mdx b/docs/source/en/model_doc/mask2former.md similarity index 96% rename from docs/source/en/model_doc/mask2former.mdx rename to docs/source/en/model_doc/mask2former.md index f0d43ba78f3436..bd5ab80728eb48 100644 --- a/docs/source/en/model_doc/mask2former.mdx +++ b/docs/source/en/model_doc/mask2former.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Mask2Former @@ -21,16 +25,17 @@ The abstract from the paper is the following: *Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).* -Tips: -- Mask2Former uses the same preprocessing and postprocessing steps as [MaskFormer](maskformer). Use [`Mask2FormerImageProcessor`] or [`AutoImageProcessor`] to prepare images and optional targets for the model. -- To get the final segmentation, depending on the task, you can call [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or [`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or [`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`Mask2FormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. - drawing Mask2Former architecture. Taken from the original paper. This model was contributed by [Shivalika Singh](https://huggingface.co/shivi) and [Alara Dirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/facebookresearch/Mask2Former). +## Usage tips + +- Mask2Former uses the same preprocessing and postprocessing steps as [MaskFormer](maskformer). Use [`Mask2FormerImageProcessor`] or [`AutoImageProcessor`] to prepare images and optional targets for the model. +- To get the final segmentation, depending on the task, you can call [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or [`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or [`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`Mask2FormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Mask2Former. @@ -40,16 +45,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we will review it. The resource should ideally demonstrate something new instead of duplicating an existing resource. +## Mask2FormerConfig + +[[autodoc]] Mask2FormerConfig + ## MaskFormer specific outputs [[autodoc]] models.mask2former.modeling_mask2former.Mask2FormerModelOutput [[autodoc]] models.mask2former.modeling_mask2former.Mask2FormerForUniversalSegmentationOutput -## Mask2FormerConfig - -[[autodoc]] Mask2FormerConfig - ## Mask2FormerModel [[autodoc]] Mask2FormerModel diff --git a/docs/source/en/model_doc/maskformer.mdx b/docs/source/en/model_doc/maskformer.md similarity index 92% rename from docs/source/en/model_doc/maskformer.mdx rename to docs/source/en/model_doc/maskformer.md index 5620c803bf2851..4d31b2829d10f2 100644 --- a/docs/source/en/model_doc/maskformer.mdx +++ b/docs/source/en/model_doc/maskformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MaskFormer @@ -27,20 +31,21 @@ The abstract from the paper is the following: *Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.* -Tips: -- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxilary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters). -- If you want to train the model in a distributed environment across multiple nodes, then one should update the - `get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be - set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169). -- One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model. -- To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. - The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278). This model was contributed by [francesco](https://huggingface.co/francesco). The original code can be found [here](https://github.com/facebookresearch/MaskFormer). +## Usage tips + +- MaskFormer's Transformer decoder is identical to the decoder of [DETR](detr). During training, the authors of DETR did find it helpful to use auxiliary losses in the decoder, especially to help the model output the correct number of objects of each class. If you set the parameter `use_auxiliary_loss` of [`MaskFormerConfig`] to `True`, then prediction feedforward neural networks and Hungarian losses are added after each decoder layer (with the FFNs sharing parameters). +- If you want to train the model in a distributed environment across multiple nodes, then one should update the + `get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be + set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169). +- One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model. +- To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. + ## Resources diff --git a/docs/source/en/model_doc/matcha.md b/docs/source/en/model_doc/matcha.md new file mode 100644 index 00000000000000..d4ee3305936741 --- /dev/null +++ b/docs/source/en/model_doc/matcha.md @@ -0,0 +1,76 @@ + + +# MatCha + +## Overview + +MatCha has been proposed in the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662), from Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. + +The abstract of the paper states the following: + +*Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.* + +## Model description + +MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct). +MatCha is a Visual Question Answering subset of `Pix2Struct` architecture. It renders the input question on the image and predicts the answer. + +## Usage + +Currently 6 checkpoints are available for MatCha: + +- `google/matcha`: the base MatCha model, used to fine-tune MatCha on downstream tasks +- `google/matcha-chartqa`: MatCha model fine-tuned on ChartQA dataset. It can be used to answer questions about charts. +- `google/matcha-plotqa-v1`: MatCha model fine-tuned on PlotQA dataset. It can be used to answer questions about plots. +- `google/matcha-plotqa-v2`: MatCha model fine-tuned on PlotQA dataset. It can be used to answer questions about plots. +- `google/matcha-chart2text-statista`: MatCha model fine-tuned on Statista dataset. +- `google/matcha-chart2text-pew`: MatCha model fine-tuned on Pew dataset. + +The models finetuned on `chart2text-pew` and `chart2text-statista` are more suited for summarization, whereas the models finetuned on `plotqa` and `chartqa` are more suited for question answering. + +You can use these models as follows (example on a ChatQA dataset): + +```python +from transformers import AutoProcessor, Pix2StructForConditionalGeneration +import requests +from PIL import Image + +model = Pix2StructForConditionalGeneration.from_pretrained("google/matcha-chartqa").to(0) +processor = AutoProcessor.from_pretrained("google/matcha-chartqa") +url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/20294671002019.png" +image = Image.open(requests.get(url, stream=True).raw) + +inputs = processor(images=image, text="Is the sum of all 4 places greater than Laos?", return_tensors="pt").to(0) +predictions = model.generate(**inputs, max_new_tokens=512) +print(processor.decode(predictions[0], skip_special_tokens=True)) +``` + +## Fine-tuning + +To fine-tune MatCha, refer to the pix2struct [fine-tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb). For `Pix2Struct` models, we have found out that fine-tuning the model with Adafactor and cosine learning rate scheduler leads to faste convergence: +```python +from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup + +optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05) +scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000) +``` + + + +MatCha is a model that is trained using `Pix2Struct` architecture. You can find more information about `Pix2Struct` in the [Pix2Struct documentation](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct). + + \ No newline at end of file diff --git a/docs/source/en/model_doc/mbart.mdx b/docs/source/en/model_doc/mbart.md similarity index 93% rename from docs/source/en/model_doc/mbart.mdx rename to docs/source/en/model_doc/mbart.md index 26835baf73609e..e7fc0bd53efa9b 100644 --- a/docs/source/en/model_doc/mbart.mdx +++ b/docs/source/en/model_doc/mbart.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MBart and MBart-50 @@ -21,8 +25,6 @@ specific language governing permissions and limitations under the License. -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign -@patrickvonplaten ## Overview of MBart @@ -152,6 +154,15 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) # => "The Secretary-General of the United Nations says there is no military solution in Syria." ``` +## Documentation resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## MBartConfig [[autodoc]] MBartConfig @@ -173,6 +184,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) [[autodoc]] MBart50TokenizerFast + + + ## MBartModel [[autodoc]] MBartModel @@ -194,6 +208,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) [[autodoc]] MBartForCausalLM - forward + + + ## TFMBartModel [[autodoc]] TFMBartModel @@ -204,6 +221,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) [[autodoc]] TFMBartForConditionalGeneration - call + + + ## FlaxMBartModel [[autodoc]] FlaxMBartModel @@ -231,3 +251,6 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) - __call__ - encode - decode + + + diff --git a/docs/source/en/model_doc/mctct.mdx b/docs/source/en/model_doc/mctct.md similarity index 80% rename from docs/source/en/model_doc/mctct.mdx rename to docs/source/en/model_doc/mctct.md index 690714ded6136f..7cf1a68f12e480 100644 --- a/docs/source/en/model_doc/mctct.mdx +++ b/docs/source/en/model_doc/mctct.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # M-CTC-T + + +This model is in maintenance mode only, so we won't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The M-CTC-T model was proposed in [Pseudo-Labeling For Massively Multilingual Speech Recognition](https://arxiv.org/abs/2111.00161) by Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, and Ronan Collobert. The model is a 1B-param transformer encoder, with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on Common Voice (version 6.1, December 2020 release) and VoxPopuli. After training on Common Voice and VoxPopuli, the model is trained on Common Voice only. The labels are unnormalized character-level transcripts (punctuation and capitalization are not removed). The model takes as input Mel filterbank features from a 16Khz audio signal. @@ -27,14 +40,15 @@ pseudo-labels for all languages, either from scratch or by fine-tuning. Experime Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.* - - This model was contributed by [cwkeam](https://huggingface.co/cwkeam). The original code can be found [here](https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl). +## Usage tips -Tips: +The PyTorch version of this model is only available in torch 1.9 and higher. -- The PyTorch version of this model is only available in torch 1.9 and higher. +## Resources + +- [Automatic speech recognition task guide](../tasks/asr) ## MCTCTConfig @@ -54,7 +68,6 @@ Tips: - batch_decode - decode - ## MCTCTModel [[autodoc]] MCTCTModel diff --git a/docs/source/en/model_doc/mega.md b/docs/source/en/model_doc/mega.md new file mode 100644 index 00000000000000..4ce62ca45a1d74 --- /dev/null +++ b/docs/source/en/model_doc/mega.md @@ -0,0 +1,84 @@ + + +# MEGA + +## Overview + +The MEGA model was proposed in [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. +MEGA proposes a new approach to self-attention with each encoder layer having a multi-headed exponential moving average in addition to a single head of standard dot-product attention, giving the attention mechanism +stronger positional biases. This allows MEGA to perform competitively to Transformers on standard benchmarks including LRA +while also having significantly fewer parameters. MEGA's compute efficiency allows it to scale to very long sequences, making it an +attractive option for long-document NLP tasks. + +The abstract from the paper is the following: + + *The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models. * + +This model was contributed by [mnaylor](https://huggingface.co/mnaylor). +The original code can be found [here](https://github.com/facebookresearch/mega). + + +## Usage tips + +- MEGA can perform quite well with relatively few parameters. See Appendix D in the MEGA paper for examples of architectural specs which perform well in various settings. If using MEGA as a decoder, be sure to set `bidirectional=False` to avoid errors with default bidirectional. +- Mega-chunk is a variant of mega that reduces time and spaces complexity from quadratic to linear. Utilize chunking with MegaConfig.use_chunking and control chunk size with MegaConfig.chunk_size + + +## Implementation Notes + +- The original implementation of MEGA had an inconsistent expectation of attention masks for padding and causal self-attention between the softmax attention and Laplace/squared ReLU method. This implementation addresses that inconsistency. +- The original implementation did not include token type embeddings; this implementation adds support for these, with the option controlled by MegaConfig.add_token_type_embeddings + + +## MegaConfig + +[[autodoc]] MegaConfig + +## MegaModel + +[[autodoc]] MegaModel + - forward + +## MegaForCausalLM + +[[autodoc]] MegaForCausalLM + - forward + +## MegaForMaskedLM + +[[autodoc]] MegaForMaskedLM + - forward + +## MegaForSequenceClassification + +[[autodoc]] MegaForSequenceClassification + - forward + +## MegaForMultipleChoice + +[[autodoc]] MegaForMultipleChoice + - forward + +## MegaForTokenClassification + +[[autodoc]] MegaForTokenClassification + - forward + +## MegaForQuestionAnswering + +[[autodoc]] MegaForQuestionAnswering + - forward diff --git a/docs/source/en/model_doc/megatron-bert.mdx b/docs/source/en/model_doc/megatron-bert.md similarity index 86% rename from docs/source/en/model_doc/megatron-bert.mdx rename to docs/source/en/model_doc/megatron-bert.md index 911bf76aec2789..67000c8b843f8a 100644 --- a/docs/source/en/model_doc/megatron-bert.mdx +++ b/docs/source/en/model_doc/megatron-bert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MegatronBERT @@ -36,7 +40,11 @@ achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15. accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).* -Tips: +This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). +That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, +it contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques. + +## Usage tips We have provided pretrained [BERT-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_bert_345m) checkpoints for use to evaluate or finetuning downstream tasks. @@ -74,9 +82,14 @@ python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpo python3 $PATH_TO_TRANSFORMERS/models/megatron_bert/convert_megatron_bert_checkpoint.py megatron_bert_345m_v0_1_cased.zip ``` -This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). That repository contains a multi-GPU and multi-node implementation of the -Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and -"pipeline parallel" techniques. +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## MegatronBertConfig diff --git a/docs/source/en/model_doc/megatron_gpt2.mdx b/docs/source/en/model_doc/megatron_gpt2.md similarity index 86% rename from docs/source/en/model_doc/megatron_gpt2.mdx rename to docs/source/en/model_doc/megatron_gpt2.md index a0d91e5a163055..284fd372c0e068 100644 --- a/docs/source/en/model_doc/megatron_gpt2.mdx +++ b/docs/source/en/model_doc/megatron_gpt2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MegatronGPT2 @@ -36,7 +40,11 @@ achieve SOTA results on the WikiText103 (10.8 compared to SOTA perplexity of 15. accuracy of 63.2%) datasets. Our BERT model achieves SOTA results on the RACE dataset (90.9% compared to SOTA accuracy of 89.4%).* -Tips: +This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). +That repository contains a multi-GPU and multi-node implementation of the Megatron Language models. In particular, it +contains a hybrid model parallel approach using "tensor parallel" and "pipeline parallel" techniques. + +## Usage tips We have provided pretrained [GPT2-345M](https://ngc.nvidia.com/catalog/models/nvidia:megatron_lm_345m) checkpoints for use to evaluate or finetuning downstream tasks. @@ -61,7 +69,9 @@ The following command allows you to do the conversion. We assume that the folder python3 $PATH_TO_TRANSFORMERS/models/megatron_gpt2/convert_megatron_gpt2_checkpoint.py megatron_gpt2_345m_v0_0.zip ``` -This model was contributed by [jdemouth](https://huggingface.co/jdemouth). The original code can be found [here](https://github.com/NVIDIA/Megatron-LM). That repository contains a multi-GPU and multi-node implementation of the -Megatron Language models. In particular, it contains a hybrid model parallel approach using "tensor parallel" and -"pipeline parallel" techniques. + + + MegatronGPT2 architecture is the same as OpenAI GPT-2 . Refer to [GPT-2 documentation](gpt2) for information on + configuration classes and their parameters. + \ No newline at end of file diff --git a/docs/source/en/model_doc/mgp-str.md b/docs/source/en/model_doc/mgp-str.md new file mode 100644 index 00000000000000..d4152e92b2eccb --- /dev/null +++ b/docs/source/en/model_doc/mgp-str.md @@ -0,0 +1,88 @@ + + +# MGP-STR + +## Overview + +The MGP-STR model was proposed in [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. MGP-STR is a conceptually **simple** yet **powerful** vision Scene Text Recognition (STR) model, which is built upon the [Vision Transformer (ViT)](vit). To integrate linguistic knowledge, Multi-Granularity Prediction (MGP) strategy is proposed to inject information from the language modality into the model in an implicit way. + +The abstract from the paper is the following: + +*Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks.* + + + + MGP-STR architecture. Taken from the original paper. + +MGP-STR is trained on two synthetic datasets [MJSynth]((http://www.robots.ox.ac.uk/~vgg/data/text/)) (MJ) and [SynthText](http://www.robots.ox.ac.uk/~vgg/data/scenetext/) (ST) without fine-tuning on other datasets. It achieves state-of-the-art results on six standard Latin scene text benchmarks, including 3 regular text datasets (IC13, SVT, IIIT) and 3 irregular ones (IC15, SVTP, CUTE). +This model was contributed by [yuekun](https://huggingface.co/yuekun). The original code can be found [here](https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR). + +## Inference example + +[`MgpstrModel`] accepts images as input and generates three types of predictions, which represent textual information at different granularities. +The three types of predictions are fused to give the final prediction result. + +The [`ViTImageProcessor`] class is responsible for preprocessing the input image and +[`MgpstrTokenizer`] decodes the generated character tokens to the target string. The +[`MgpstrProcessor`] wraps [`ViTImageProcessor`] and [`MgpstrTokenizer`] +into a single instance to both extract the input features and decode the predicted token ids. + +- Step-by-step Optical Character Recognition (OCR) + +```py +>>> from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition +>>> import requests +>>> from PIL import Image + +>>> processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base') +>>> model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base') + +>>> # load image from the IIIT-5k dataset +>>> url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png" +>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB") + +>>> pixel_values = processor(images=image, return_tensors="pt").pixel_values +>>> outputs = model(pixel_values) + +>>> generated_text = processor.batch_decode(outputs.logits)['generated_text'] +``` + +## MgpstrConfig + +[[autodoc]] MgpstrConfig + +## MgpstrTokenizer + +[[autodoc]] MgpstrTokenizer + - save_vocabulary + +## MgpstrProcessor + +[[autodoc]] MgpstrProcessor + - __call__ + - batch_decode + +## MgpstrModel + +[[autodoc]] MgpstrModel + - forward + +## MgpstrForSceneTextRecognition + +[[autodoc]] MgpstrForSceneTextRecognition + - forward diff --git a/docs/source/en/model_doc/mistral.md b/docs/source/en/model_doc/mistral.md new file mode 100644 index 00000000000000..31b5deaf9dd63b --- /dev/null +++ b/docs/source/en/model_doc/mistral.md @@ -0,0 +1,161 @@ + + +# Mistral + +## Overview + +Mistral-7B-v0.1 is Mistral AI's first Large Language Model (LLM). + +### Model Details + +Mistral-7B-v0.1 is a decoder-based LM with the following architectural choices: +* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens +* GQA (Grouped Query Attention) - allowing faster inference and lower cache size. +* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens. + +We also provide an instruction fine-tuned model: `Mistral-7B-Instruct-v0.1` which can be used for chat-based inference. + +For more details please read our [release blog post](https://mistral.ai/news/announcing-mistral-7b/) + +### License + +Both `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` are released under the Apache 2.0 license. + +## Usage tips + +`Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` can be found on the [Huggingface Hub](https://huggingface.co/mistralai) + +These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hub: + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") + +>>> prompt = "My favourite condiment is" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) +>>> tokenizer.batch_decode(generated_ids)[0] +"The expected output" +``` + +Raw weights for `Mistral-7B-v0.1` and `Mistral-7B-Instruct-v0.1` can be downloaded from: + +| Model Name | Checkpoint | +|----------------------------|-----------------------------------------------------------------------------------------| +| `Mistral-7B-v0.1` | [Raw Checkpoint](https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-v0.1.tar) | +| `Mistral-7B-Instruct-v0.1` | [Raw Checkpoint](https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-instruct-v0.1.tar) | + + +To use these raw checkpoints with HuggingFace you can use the `convert_mistral_weights_to_hf.py` script to convert them to the HuggingFace format: + +```bash +python src/transformers/models/mistral/convert_mistral_weights_to_hf.py \ + --input_dir /path/to/downloaded/mistral/weights --model_size 7B --output_dir /output/path +``` + +You can then load the converted model from the `output/path`: + +```python +from transformers import MistralForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = MistralForCausalLM.from_pretrained("/output/path") +``` + +## Combining Mistral and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Make also sure to load your model in half-precision (e.g. `torch.float16`) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2") +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") + +>>> prompt = "My favourite condiment is" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) +>>> tokenizer.batch_decode(generated_ids)[0] +"The expected output" +``` + +### Expected speedups + +Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using `mistralai/Mistral-7B-v0.1` checkpoint and the Flash Attention 2 version of the model. + +
+ +
+ +### Sliding window Attention + +The current implementation supports the sliding window attention mechanism and memory efficient cache management. +To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). + +The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding. + +## The Mistral Team + +Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. + +## MistralConfig + +[[autodoc]] MistralConfig + +## MistralModel + +[[autodoc]] MistralModel + - forward + +## MistralForCausalLM + +[[autodoc]] MistralForCausalLM + - forward + +## MistralForSequenceClassification + +[[autodoc]] MistralForSequenceClassification + - forward + +## FlaxMistralModel + +[[autodoc]] FlaxMistralModel + - __call__ + +## FlaxMistralForCausalLM + +[[autodoc]] FlaxMistralForCausalLM + - __call__ diff --git a/docs/source/en/model_doc/mixtral.md b/docs/source/en/model_doc/mixtral.md new file mode 100644 index 00000000000000..d1a9ee0a1a07e2 --- /dev/null +++ b/docs/source/en/model_doc/mixtral.md @@ -0,0 +1,163 @@ + + +# Mixtral + +## Overview + +Mixtral-8x7B is Mistral AI's second Large Language Model (LLM). + +The Mixtral model was proposed by the [Mistral AI](https://mistral.ai/) team. + +It was introduced in the [Mixtral of Experts blogpost](https://mistral.ai/news/mixtral-of-experts/) with the following introduction: + +*Today, the team is proud to release Mixtral 8x7B, a high-quality sparse mixture of experts models (SMoE) with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. In particular, it matches or outperforms GPT3.5 on most standard benchmarks.* + +Tips: + + +- The model needs to be converted using the [conversion script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py). +- If the model is quantized to 4bits, a single A100 is enough to fit the entire 45B model. + +This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ) . +The original code can be found [here](https://github.com/mistralai/mistral-src). + + +### Model Details + +Mixtral-45B is a decoder-based LM with the following architectural choices: + +* Mixtral is a Mixture of Expert (MOE) model with 8 experts per MLP, with a total of 45B paramateres but the compute required is the same as a 14B model. This is because even though each experts have to be loaded in RAM (70B like ram requirement) each token from the hidden states are dispatched twice (top 2 routing) and thus the compute (the operation required at each forward computation) is just 2 X sequence_length. + +The following implementation details are shared with Mistral AI's first model [mistral](mistral): +* Sliding Window Attention - Trained with 8k context length and fixed cache size, with a theoretical attention span of 128K tokens +* GQA (Grouped Query Attention) - allowing faster inference and lower cache size. +* Byte-fallback BPE tokenizer - ensures that characters are never mapped to out of vocabulary tokens. + +They also provide an instruction fine-tuned model: `mistralai/Mixtral-8x7B-v0.1` which can be used for chat-based inference. + +For more details please read our [release blog post](https://mistral.ai/news/mixtral-of-experts/) + +### License + +`Mixtral-8x7B` is released under the Apache 2.0 license. + +## Usage tips + +`Mixtral-8x7B` can be found on the [Huggingface Hub](https://huggingface.co/mistralai) + +These ready-to-use checkpoints can be downloaded and used via the HuggingFace Hub: + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1") +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1") + +>>> prompt = "My favourite condiment is" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) +>>> tokenizer.batch_decode(generated_ids)[0] +"The expected output" +``` + +To use the raw checkpoints with HuggingFace you can use the `convert_mixtral_weights_to_hf.py` script to convert them to the HuggingFace format: + +```bash +python src/transformers/models/mixtral/convert_mixtral_weights_to_hf.py \ + --input_dir /path/to/downloaded/mistral/weights --output_dir /output/path +``` + +You can then load the converted model from the `output/path`: + +```python +from transformers import MixtralForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = MixtralForCausalLM.from_pretrained("/output/path") +``` + +## Combining Mixtral and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Make also sure to load your model in half-precision (e.g. `torch.float16`) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.float16, attn_implementation="flash_attention_2") +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1") + +>>> prompt = "My favourite condiment is" + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True) +>>> tokenizer.batch_decode(generated_ids)[0] +"The expected output" +``` + +### Expected speedups + +Below is a expected speedup diagram that compares pure inference time between the native implementation in transformers using `mistralai/Mixtral-8x7B-v0.1` checkpoint and the Flash Attention 2 version of the model. + +
+ +
+ +### Sliding window Attention + +The current implementation supports the sliding window attention mechanism and memory efficient cache management. +To enable sliding window attention, just make sure to have a `flash-attn` version that is compatible with sliding window attention (`>=2.3.0`). + +The Flash Attention-2 model uses also a more memory efficient cache slicing mechanism - as recommended per the official implementation of Mistral model that use rolling cache mechanism we keep the cache size fixed (`self.config.sliding_window`), support batched generation only for `padding_side="left"` and use the absolute position of the current token to compute the positional embedding. + +## The Mistral Team + +Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. + +## MixtralConfig + +[[autodoc]] MixtralConfig + +## MixtralModel + +[[autodoc]] MixtralModel + - forward + +## MixtralForCausalLM + +[[autodoc]] MixtralForCausalLM + - forward + +## MixtralForSequenceClassification + +[[autodoc]] MixtralForSequenceClassification + - forward diff --git a/docs/source/en/model_doc/mluke.mdx b/docs/source/en/model_doc/mluke.md similarity index 93% rename from docs/source/en/model_doc/mluke.mdx rename to docs/source/en/model_doc/mluke.md index b910f17ae2f6b1..719af76ad446b9 100644 --- a/docs/source/en/model_doc/mluke.mdx +++ b/docs/source/en/model_doc/mluke.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # mLUKE @@ -33,6 +37,10 @@ representations into the input allows us to extract more language-agnostic featu multilingual cloze prompt task with the mLAMA dataset. We show that entity-based prompt elicits correct factual knowledge more likely than using only word representations.* +This model was contributed by [ryo0634](https://huggingface.co/ryo0634). The original code can be found [here](https://github.com/studio-ousia/luke). + +## Usage tips + One can directly plug in the weights of mLUKE into a LUKE model, like so: ```python @@ -49,10 +57,12 @@ from transformers import MLukeTokenizer tokenizer = MLukeTokenizer.from_pretrained("studio-ousia/mluke-base") ``` + + As mLUKE's architecture is equivalent to that of LUKE, one can refer to [LUKE's documentation page](luke) for all tips, code examples and notebooks. -This model was contributed by [ryo0634](https://huggingface.co/ryo0634). The original code can be found [here](https://github.com/studio-ousia/luke). + ## MLukeTokenizer diff --git a/docs/source/en/model_doc/mms.md b/docs/source/en/model_doc/mms.md new file mode 100644 index 00000000000000..dc453248eefbf7 --- /dev/null +++ b/docs/source/en/model_doc/mms.md @@ -0,0 +1,389 @@ + + +# MMS + +## Overview + +The MMS model was proposed in [Scaling Speech Technology to 1,000+ Languages](https://arxiv.org/abs/2305.13516) +by Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli + +The abstract from the paper is the following: + +*Expanding the language coverage of speech technology has the potential to improve access to information for many more people. +However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 +languages spoken around the world. +The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. +The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging +self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, +a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models +for the same number of languages, as well as a language identification model for 4,017 languages. +Experiments show that our multilingual speech recognition model more than halves the word error rate of +Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.* + +Here are the different models open sourced in the MMS project. The models and code are originally released [here](https://github.com/facebookresearch/fairseq/tree/main/examples/mms). We have add them to the `transformers` framework, making them easier to use. + +### Automatic Speech Recognition (ASR) + +The ASR model checkpoints can be found here : [mms-1b-fl102](https://huggingface.co/facebook/mms-1b-fl102), [mms-1b-l1107](https://huggingface.co/facebook/mms-1b-l1107), [mms-1b-all](https://huggingface.co/facebook/mms-1b-all). For best accuracy, use the `mms-1b-all` model. + +Tips: + +- All ASR models accept a float array corresponding to the raw waveform of the speech signal. The raw waveform should be pre-processed with [`Wav2Vec2FeatureExtractor`]. +- The models were trained using connectionist temporal classification (CTC) so the model output has to be decoded using + [`Wav2Vec2CTCTokenizer`]. +- You can load different language adapter weights for different languages via [`~Wav2Vec2PreTrainedModel.load_adapter`]. Language adapters only consists of roughly 2 million parameters + and can therefore be efficiently loaded on the fly when needed. + +#### Loading + +By default MMS loads adapter weights for English. If you want to load adapter weights of another language +make sure to specify `target_lang=` as well as `"ignore_mismatched_sizes=True`. +The `ignore_mismatched_sizes=True` keyword has to be passed to allow the language model head to be resized according +to the vocabulary of the specified language. +Similarly, the processor should be loaded with the same target language + +```py +from transformers import Wav2Vec2ForCTC, AutoProcessor + +model_id = "facebook/mms-1b-all" +target_lang = "fra" + +processor = AutoProcessor.from_pretrained(model_id, target_lang=target_lang) +model = Wav2Vec2ForCTC.from_pretrained(model_id, target_lang=target_lang, ignore_mismatched_sizes=True) +``` + + + +You can safely ignore a warning such as: + +```text +Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/mms-1b-all and are newly initialized because the shapes did not match: +- lm_head.bias: found shape torch.Size([154]) in the checkpoint and torch.Size([314]) in the model instantiated +- lm_head.weight: found shape torch.Size([154, 1280]) in the checkpoint and torch.Size([314, 1280]) in the model instantiated +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +``` + + + +If you want to use the ASR pipeline, you can load your chosen target language as such: + +```py +from transformers import pipeline + +model_id = "facebook/mms-1b-all" +target_lang = "fra" + +pipe = pipeline(model=model_id, model_kwargs={"target_lang": "fra", "ignore_mismatched_sizes": True}) +``` + +#### Inference + +Next, let's look at how we can run MMS in inference and change adapter layers after having called [`~PretrainedModel.from_pretrained`] +First, we load audio data in different languages using the [Datasets](https://github.com/huggingface/datasets). + +```py +from datasets import load_dataset, Audio + +# English +stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True) +stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) +en_sample = next(iter(stream_data))["audio"]["array"] + +# French +stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True) +stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) +fr_sample = next(iter(stream_data))["audio"]["array"] +``` + +Next, we load the model and processor + +```py +from transformers import Wav2Vec2ForCTC, AutoProcessor +import torch + +model_id = "facebook/mms-1b-all" + +processor = AutoProcessor.from_pretrained(model_id) +model = Wav2Vec2ForCTC.from_pretrained(model_id) +``` + +Now we process the audio data, pass the processed audio data to the model and transcribe the model output, +just like we usually do for [`Wav2Vec2ForCTC`]. + +```py +inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs).logits + +ids = torch.argmax(outputs, dim=-1)[0] +transcription = processor.decode(ids) +# 'joe keton disapproved of films and buster also had reservations about the media' +``` + +We can now keep the same model in memory and simply switch out the language adapters by +calling the convenient [`~Wav2Vec2ForCTC.load_adapter`] function for the model and [`~Wav2Vec2CTCTokenizer.set_target_lang`] for the tokenizer. +We pass the target language as an input - `"fra"` for French. + +```py +processor.tokenizer.set_target_lang("fra") +model.load_adapter("fra") + +inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs).logits + +ids = torch.argmax(outputs, dim=-1)[0] +transcription = processor.decode(ids) +# "ce dernier est volé tout au long de l'histoire romaine" +``` + +In the same way the language can be switched out for all other supported languages. Please have a look at: + +```py +processor.tokenizer.vocab.keys() +``` + +to see all supported languages. + +To further improve performance from ASR models, language model decoding can be used. See the documentation [here](https://huggingface.co/facebook/mms-1b-all) for further details. + +### Speech Synthesis (TTS) + +MMS-TTS uses the same model architecture as VITS, which was added to 🤗 Transformers in v4.33. MMS trains a separate +model checkpoint for each of the 1100+ languages in the project. All available checkpoints can be found on the Hugging +Face Hub: [facebook/mms-tts](https://huggingface.co/models?sort=trending&search=facebook%2Fmms-tts), and the inference +documentation under [VITS](https://huggingface.co/docs/transformers/main/en/model_doc/vits). + +#### Inference + +To use the MMS model, first update to the latest version of the Transformers library: + +```bash +pip install --upgrade transformers accelerate +``` + +Since the flow-based model in VITS is non-deterministic, it is good practice to set a seed to ensure reproducibility of +the outputs. + +- For languages with a Roman alphabet, such as English or French, the tokenizer can be used directly to +pre-process the text inputs. The following code example runs a forward pass using the MMS-TTS English checkpoint: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +model = VitsModel.from_pretrained("facebook/mms-tts-eng") + +inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") + +set_seed(555) # make deterministic + +with torch.no_grad(): + outputs = model(**inputs) + +waveform = outputs.waveform[0] +``` + +The resulting waveform can be saved as a `.wav` file: + +```python +import scipy + +scipy.io.wavfile.write("synthesized_speech.wav", rate=model.config.sampling_rate, data=waveform) +``` + +Or displayed in a Jupyter Notebook / Google Colab: + +```python +from IPython.display import Audio + +Audio(waveform, rate=model.config.sampling_rate) +``` + +For certain languages with non-Roman alphabets, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman) +perl package is required to pre-process the text inputs to the Roman alphabet. + +You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of +the pre-trained `tokenizer`: + +```python +from transformers import VitsTokenizer + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +print(tokenizer.is_uroman) +``` + +If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`, +since currently the tokenizer does not support performing the pre-processing itself. + +To do this, first clone the uroman repository to your local machine and set the bash variable `UROMAN` to the local path: + +```bash +git clone https://github.com/isi-nlp/uroman.git +cd uroman +export UROMAN=$(pwd) +``` + +You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable +`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed +import os +import subprocess + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor") +model = VitsModel.from_pretrained("facebook/mms-tts-kor") + +def uromanize(input_string, uroman_path): + """Convert non-Roman strings to Roman using the `uroman` perl package.""" + script_path = os.path.join(uroman_path, "bin", "uroman.pl") + + command = ["perl", script_path] + + process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + # Execute the perl command + stdout, stderr = process.communicate(input=input_string.encode()) + + if process.returncode != 0: + raise ValueError(f"Error {process.returncode}: {stderr.decode()}") + + # Return the output as a string and skip the new-line character at the end + return stdout.decode()[:-1] + +text = "이봐 무슨 일이야" +uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) + +inputs = tokenizer(text=uromaized_text, return_tensors="pt") + +set_seed(555) # make deterministic +with torch.no_grad(): + outputs = model(inputs["input_ids"]) + +waveform = outputs.waveform[0] +``` + +**Tips:** + +* The MMS-TTS checkpoints are trained on lower-cased, un-punctuated text. By default, the `VitsTokenizer` *normalizes* the inputs by removing any casing and punctuation, to avoid passing out-of-vocabulary characters to the model. Hence, the model is agnostic to casing and punctuation, so these should be avoided in the text prompt. You can disable normalisation by setting `normalize=False` in the call to the tokenizer, but this will lead to un-expected behaviour and is discouraged. +* The speaking rate can be varied by setting the attribute `model.speaking_rate` to a chosen value. Likewise, the randomness of the noise is controlled by `model.noise_scale`: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +model = VitsModel.from_pretrained("facebook/mms-tts-eng") + +inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") + +# make deterministic +set_seed(555) + +# make speech faster and more noisy +model.speaking_rate = 1.5 +model.noise_scale = 0.8 + +with torch.no_grad(): + outputs = model(**inputs) +``` + +### Language Identification (LID) + +Different LID models are available based on the number of languages they can recognize - [126](https://huggingface.co/facebook/mms-lid-126), [256](https://huggingface.co/facebook/mms-lid-256), [512](https://huggingface.co/facebook/mms-lid-512), [1024](https://huggingface.co/facebook/mms-lid-1024), [2048](https://huggingface.co/facebook/mms-lid-2048), [4017](https://huggingface.co/facebook/mms-lid-4017). + +#### Inference +First, we install transformers and some other libraries + +```bash +pip install torch accelerate datasets[audio] +pip install --upgrade transformers +```` + +Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. + +```py +from datasets import load_dataset, Audio + +# English +stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True) +stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) +en_sample = next(iter(stream_data))["audio"]["array"] + +# Arabic +stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True) +stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) +ar_sample = next(iter(stream_data))["audio"]["array"] +``` + +Next, we load the model and processor + +```py +from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor +import torch + +model_id = "facebook/mms-lid-126" + +processor = AutoFeatureExtractor.from_pretrained(model_id) +model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id) +``` + +Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/harshit345/xlsr-wav2vec-speech-emotion-recognition) + +```py +# English +inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs).logits + +lang_id = torch.argmax(outputs, dim=-1)[0].item() +detected_lang = model.config.id2label[lang_id] +# 'eng' + +# Arabic +inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs).logits + +lang_id = torch.argmax(outputs, dim=-1)[0].item() +detected_lang = model.config.id2label[lang_id] +# 'ara' +``` + +To see all the supported languages of a checkpoint, you can print out the language ids as follows: +```py +processor.id2label.values() +``` + +### Audio Pretrained Models + +Pretrained models are available for two different sizes - [300M](https://huggingface.co/facebook/mms-300m) , +[1Bil](https://huggingface.co/facebook/mms-1b). + + + +The MMS for ASR architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for further +details on how to finetune with models for various downstream tasks. + +MMS-TTS uses the same model architecture as VITS, refer to [VITS's documentation page](vits) for API reference. + diff --git a/docs/source/en/model_doc/mobilebert.mdx b/docs/source/en/model_doc/mobilebert.md similarity index 87% rename from docs/source/en/model_doc/mobilebert.mdx rename to docs/source/en/model_doc/mobilebert.md index 8305903d23c70e..5c9a230d0d5c40 100644 --- a/docs/source/en/model_doc/mobilebert.mdx +++ b/docs/source/en/model_doc/mobilebert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MobileBERT @@ -33,7 +37,9 @@ natural language inference tasks of GLUE, MobileBERT achieves a GLUEscore o 77.7 latency on a Pixel 4 phone. On the SQuAD v1.1/v2.0 question answering task, MobileBERT achieves a dev F1 score of 90.0/79.2 (1.5/2.1 higher than BERT_BASE).* -Tips: +This model was contributed by [vshampor](https://huggingface.co/vshampor). The original code can be found [here](https://github.com/google-research/google-research/tree/master/mobilebert). + +## Usage tips - MobileBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -41,7 +47,14 @@ Tips: efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. -This model was contributed by [vshampor](https://huggingface.co/vshampor). The original code can be found [here](https://github.com/google-research/mobilebert). + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## MobileBertConfig @@ -61,6 +74,9 @@ This model was contributed by [vshampor](https://huggingface.co/vshampor). The o [[autodoc]] models.mobilebert.modeling_tf_mobilebert.TFMobileBertForPreTrainingOutput + + + ## MobileBertModel [[autodoc]] MobileBertModel @@ -101,6 +117,9 @@ This model was contributed by [vshampor](https://huggingface.co/vshampor). The o [[autodoc]] MobileBertForQuestionAnswering - forward + + + ## TFMobileBertModel [[autodoc]] TFMobileBertModel @@ -140,3 +159,6 @@ This model was contributed by [vshampor](https://huggingface.co/vshampor). The o [[autodoc]] TFMobileBertForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/mobilenet_v1.mdx b/docs/source/en/model_doc/mobilenet_v1.md similarity index 95% rename from docs/source/en/model_doc/mobilenet_v1.mdx rename to docs/source/en/model_doc/mobilenet_v1.md index 48795896f02267..9f68035c63c2a9 100644 --- a/docs/source/en/model_doc/mobilenet_v1.mdx +++ b/docs/source/en/model_doc/mobilenet_v1.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MobileNet V1 @@ -20,7 +24,9 @@ The abstract from the paper is the following: *We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.* -Tips: +This model was contributed by [matthijs](https://huggingface.co/Matthijs). The original code and weights can be found [here](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md). + +## Usage tips - The checkpoints are named **mobilenet\_v1\_*depth*\_*size***, for example **mobilenet\_v1\_1.0\_224**, where **1.0** is the depth multiplier (sometimes also referred to as "alpha" or the width multiplier) and **224** is the resolution of the input images the model was trained on. @@ -42,8 +48,6 @@ Unsupported features: - It's common to extract the output from the pointwise layers at indices 5, 11, 12, 13 for downstream purposes. Using `output_hidden_states=True` returns the output from all intermediate layers. There is currently no way to limit this to specific layers. -This model was contributed by [matthijs](https://huggingface.co/Matthijs). The original code and weights can be found [here](https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with MobileNetV1. @@ -51,6 +55,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`MobileNetV1ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/mobilenet_v2.mdx b/docs/source/en/model_doc/mobilenet_v2.md similarity index 94% rename from docs/source/en/model_doc/mobilenet_v2.mdx rename to docs/source/en/model_doc/mobilenet_v2.md index 6f179f3ceeecf9..ff22231ae0c11a 100644 --- a/docs/source/en/model_doc/mobilenet_v2.mdx +++ b/docs/source/en/model_doc/mobilenet_v2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MobileNet V2 @@ -22,7 +26,9 @@ The abstract from the paper is the following: *The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.* -Tips: +This model was contributed by [matthijs](https://huggingface.co/Matthijs). The original code and weights can be found [here for the main model](https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet) and [here for DeepLabV3+](https://github.com/tensorflow/models/tree/master/research/deeplab). + +## Usage tips - The checkpoints are named **mobilenet\_v2\_*depth*\_*size***, for example **mobilenet\_v2\_1.0\_224**, where **1.0** is the depth multiplier (sometimes also referred to as "alpha" or the width multiplier) and **224** is the resolution of the input images the model was trained on. @@ -46,8 +52,6 @@ Unsupported features: - The DeepLabV3+ segmentation head does not use the final convolution layer from the backbone, but this layer gets computed anyway. There is currently no way to tell [`MobileNetV2Model`] up to which layer it should run. -This model was contributed by [matthijs](https://huggingface.co/Matthijs). The original code and weights can be found [here for the main model](https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet) and [here for DeepLabV3+](https://github.com/tensorflow/models/tree/master/research/deeplab). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with MobileNetV2. @@ -55,6 +59,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`MobileNetV2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) + +**Semantic segmentation** +- [Semantic segmentation task guide](../tasks/semantic_segmentation) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/mobilevit.mdx b/docs/source/en/model_doc/mobilevit.md similarity index 93% rename from docs/source/en/model_doc/mobilevit.mdx rename to docs/source/en/model_doc/mobilevit.md index 6c0b5b6aaefcbc..e724ffa380e2f9 100644 --- a/docs/source/en/model_doc/mobilevit.mdx +++ b/docs/source/en/model_doc/mobilevit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MobileViT @@ -20,7 +24,9 @@ The abstract from the paper is the following: *Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.* -Tips: +This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets). + +## Usage tips - MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction. - One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB). @@ -54,9 +60,6 @@ with open(tflite_filename, "wb") as f: The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network bandwidth can be constrained. - -This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with MobileViT. @@ -64,6 +67,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`MobileViTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) + +**Semantic segmentation** +- [Semantic segmentation task guide](../tasks/semantic_segmentation) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -83,6 +90,9 @@ If you're interested in submitting a resource to be included here, please feel f - preprocess - post_process_semantic_segmentation + + + ## MobileViTModel [[autodoc]] MobileViTModel @@ -98,6 +108,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] MobileViTForSemanticSegmentation - forward + + + ## TFMobileViTModel [[autodoc]] TFMobileViTModel @@ -112,3 +125,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] TFMobileViTForSemanticSegmentation - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/mobilevitv2.md b/docs/source/en/model_doc/mobilevitv2.md new file mode 100644 index 00000000000000..c3a650fc70423c --- /dev/null +++ b/docs/source/en/model_doc/mobilevitv2.md @@ -0,0 +1,56 @@ + + +# MobileViTV2 + +## Overview + +The MobileViTV2 model was proposed in [Separable Self-attention for Mobile Vision Transformers](https://arxiv.org/abs/2206.02680) by Sachin Mehta and Mohammad Rastegari. + +MobileViTV2 is the second version of MobileViT, constructed by replacing the multi-headed self-attention in MobileViT with separable self-attention. + +The abstract from the paper is the following: + +*Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires O(k2) time complexity with respect to the number of tokens (or patches) k. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. O(k). A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTV2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTV2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running 3.2× faster on a mobile device.* + +This model was contributed by [shehan97](https://huggingface.co/shehan97). +The original code can be found [here](https://github.com/apple/ml-cvnets). + +## Usage tips + +- MobileViTV2 is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. +- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB). +- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). +- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/). + +## MobileViTV2Config + +[[autodoc]] MobileViTV2Config + +## MobileViTV2Model + +[[autodoc]] MobileViTV2Model + - forward + +## MobileViTV2ForImageClassification + +[[autodoc]] MobileViTV2ForImageClassification + - forward + +## MobileViTV2ForSemanticSegmentation + +[[autodoc]] MobileViTV2ForSemanticSegmentation + - forward diff --git a/docs/source/en/model_doc/mpnet.mdx b/docs/source/en/model_doc/mpnet.md similarity index 81% rename from docs/source/en/model_doc/mpnet.mdx rename to docs/source/en/model_doc/mpnet.md index 0fa88ee87b7259..c571da47b0048c 100644 --- a/docs/source/en/model_doc/mpnet.mdx +++ b/docs/source/en/model_doc/mpnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MPNet @@ -33,12 +37,20 @@ down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet ou margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.* -Tips: +The original code can be found [here](https://github.com/microsoft/MPNet). -- MPNet doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. just - separate your segments with the separation token `tokenizer.sep_token` (or `[sep]`). +## Usage tips -The original code can be found [here](https://github.com/microsoft/MPNet). +MPNet doesn't have `token_type_ids`, you don't need to indicate which token belongs to which segment. Just +separate your segments with the separation token `tokenizer.sep_token` (or `[sep]`). + +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## MPNetConfig @@ -56,6 +68,9 @@ The original code can be found [here](https://github.com/microsoft/MPNet). [[autodoc]] MPNetTokenizerFast + + + ## MPNetModel [[autodoc]] MPNetModel @@ -86,6 +101,9 @@ The original code can be found [here](https://github.com/microsoft/MPNet). [[autodoc]] MPNetForQuestionAnswering - forward + + + ## TFMPNetModel [[autodoc]] TFMPNetModel @@ -115,3 +133,6 @@ The original code can be found [here](https://github.com/microsoft/MPNet). [[autodoc]] TFMPNetForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/mpt.md b/docs/source/en/model_doc/mpt.md new file mode 100644 index 00000000000000..f7e6fcc14382bd --- /dev/null +++ b/docs/source/en/model_doc/mpt.md @@ -0,0 +1,70 @@ + + +# MPT + +## Overview + +The MPT model was proposed by the [MosaicML](https://www.mosaicml.com/) team and released with multiple sizes and finetuned variants. The MPT models is a series of open source and commercially usable LLMs pre-trained on 1T tokens. + +MPT models are GPT-style decoder-only transformers with several improvements: performance-optimized layer implementations, architecture changes that provide greater training stability, and the elimination of context length limits by replacing positional embeddings with ALiBi. + +- MPT base: MPT base pre-trained models on next token prediction +- MPT instruct: MPT base models fine-tuned on instruction based tasks +- MPT storywriter: MPT base models fine-tuned for 2500 steps on 65k-token excerpts of fiction books contained in the books3 corpus, this enables the model to handle very long sequences + +The original code is available at the [`llm-foundry`](https://github.com/mosaicml/llm-foundry/tree/main) repository. + +Read more about it [in the release blogpost](https://www.mosaicml.com/blog/mpt-7b) + +## Usage tips + +- Learn more about some techniques behind training of the model [in this section of llm-foundry repository](https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md#faqs) +- If you want to use the advanced version of the model (triton kernels, direct flash attention integration), you can still use the original model implementation by adding `trust_remote_code=True` when calling `from_pretrained`. + +## Resources + +- [Fine-tuning Notebook](https://colab.research.google.com/drive/1HCpQkLL7UXW8xJUJJ29X7QAeNJKO0frZ?usp=sharing) on how to fine-tune MPT-7B on a free Google Colab instance to turn the model into a Chatbot. + +## MptConfig + +[[autodoc]] MptConfig + - all + +## MptModel + +[[autodoc]] MptModel + - forward + +## MptForCausalLM + +[[autodoc]] MptForCausalLM + - forward + +## MptForSequenceClassification + +[[autodoc]] MptForSequenceClassification + - forward + +## MptForTokenClassification + +[[autodoc]] MptForTokenClassification + - forward + +## MptForQuestionAnswering + +[[autodoc]] MptForQuestionAnswering + - forward diff --git a/docs/source/en/model_doc/mra.md b/docs/source/en/model_doc/mra.md new file mode 100644 index 00000000000000..cc4c0d9cc9c834 --- /dev/null +++ b/docs/source/en/model_doc/mra.md @@ -0,0 +1,62 @@ + + +# MRA + +## Overview + +The MRA model was proposed in [Multi Resolution Analysis (MRA) for Approximate Self-Attention](https://arxiv.org/abs/2207.10284) by Zhanpeng Zeng, Sourav Pal, Jeffery Kline, Glenn M Fung, and Vikas Singh. + +The abstract from the paper is the following: + +*Transformers have emerged as a preferred model for many tasks in natural language processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at https://github.com/mlpen/mra-attention.* + +This model was contributed by [novice03](https://huggingface.co/novice03). +The original code can be found [here](https://github.com/mlpen/mra-attention). + +## MraConfig + +[[autodoc]] MraConfig + +## MraModel + +[[autodoc]] MraModel + - forward + +## MraForMaskedLM + +[[autodoc]] MraForMaskedLM + - forward + +## MraForSequenceClassification + +[[autodoc]] MraForSequenceClassification + - forward + +## MraForMultipleChoice + +[[autodoc]] MraForMultipleChoice + - forward + +## MraForTokenClassification + +[[autodoc]] MraForTokenClassification + - forward + +## MraForQuestionAnswering + +[[autodoc]] MraForQuestionAnswering + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/mt5.mdx b/docs/source/en/model_doc/mt5.md similarity index 86% rename from docs/source/en/model_doc/mt5.mdx rename to docs/source/en/model_doc/mt5.md index 0da6a6a7f344ef..7f053bb724a111 100644 --- a/docs/source/en/model_doc/mt5.mdx +++ b/docs/source/en/model_doc/mt5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # mT5 @@ -56,6 +60,11 @@ Google has released the following variants: This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be found [here](https://github.com/google-research/multilingual-t5). +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## MT5Config [[autodoc]] MT5Config @@ -73,6 +82,8 @@ See [`T5Tokenizer`] for all details. See [`T5TokenizerFast`] for all details. + + ## MT5Model @@ -86,6 +97,21 @@ See [`T5TokenizerFast`] for all details. [[autodoc]] MT5EncoderModel +## MT5ForSequenceClassification + +[[autodoc]] MT5ForSequenceClassification + +## MT5ForTokenClassification + +[[autodoc]] MT5ForTokenClassification + +## MT5ForQuestionAnswering + +[[autodoc]] MT5ForQuestionAnswering + + + + ## TFMT5Model [[autodoc]] TFMT5Model @@ -98,6 +124,9 @@ See [`T5TokenizerFast`] for all details. [[autodoc]] TFMT5EncoderModel + + + ## FlaxMT5Model [[autodoc]] FlaxMT5Model @@ -109,3 +138,6 @@ See [`T5TokenizerFast`] for all details. ## FlaxMT5EncoderModel [[autodoc]] FlaxMT5EncoderModel + + + diff --git a/docs/source/en/model_doc/musicgen.md b/docs/source/en/model_doc/musicgen.md new file mode 100644 index 00000000000000..7c105e1f39f7ce --- /dev/null +++ b/docs/source/en/model_doc/musicgen.md @@ -0,0 +1,280 @@ + + +# MusicGen + +## Overview + +The MusicGen model was proposed in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) +by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez. + +MusicGen is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned +on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a +sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or *audio codes*, +conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, +to recover the audio waveform. + +Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of +the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. +hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass. + +The abstract from the paper is the following: + +*We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates +over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised +of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for +cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen +can generate high-quality samples, while being conditioned on textual description or melodic features, allowing better +controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human +studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. +Through ablation studies, we shed light over the importance of each of the components comprising MusicGen.* + +This model was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found +[here](https://github.com/facebookresearch/audiocraft). The pre-trained checkpoints can be found on the +[Hugging Face Hub](https://huggingface.co/models?sort=downloads&search=facebook%2Fmusicgen-). + +## Usage tips + +- After downloading the original checkpoints from [here](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md#importing--exporting-models) , you can convert them using the **conversion script** available at +`src/transformers/models/musicgen/convert_musicgen_transformers.py` with the following command: + +```bash +python src/transformers/models/musicgen/convert_musicgen_transformers.py \ + --checkpoint small --pytorch_dump_folder /output/path --safe_serialization +``` + +## Generation + +MusicGen is compatible with two generation modes: greedy and sampling. In practice, sampling leads to significantly +better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default, +and can be explicitly specified by setting `do_sample=True` in the call to [`MusicgenForConditionalGeneration.generate`], +or by overriding the model's generation config (see below). + +Generation is limited by the sinusoidal positional embeddings to 30 second inputs. Meaning, MusicGen cannot generate more +than 30 seconds of audio (1503 tokens), and input audio passed by Audio-Prompted Generation contributes to this limit so, +given an input of 20 seconds of audio, MusicGen cannot generate more than 10 seconds of additional audio. + +Transformers supports both mono (1-channel) and stereo (2-channel) variants of MusicGen. The mono channel versions +generate a single set of codebooks. The stereo versions generate 2 sets of codebooks, 1 for each channel (left/right), +and each set of codebooks is decoded independently through the audio compression model. The audio streams for each +channel are combined to give the final stereo output. + +### Unconditional Generation + +The inputs for unconditional (or 'null') generation can be obtained through the method +[`MusicgenForConditionalGeneration.get_unconditional_inputs`]: + +```python +>>> from transformers import MusicgenForConditionalGeneration + +>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") +>>> unconditional_inputs = model.get_unconditional_inputs(num_samples=1) + +>>> audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256) +``` + +The audio outputs are a three-dimensional Torch tensor of shape `(batch_size, num_channels, sequence_length)`. To listen +to the generated audio samples, you can either play them in an ipynb notebook: + +```python +from IPython.display import Audio + +sampling_rate = model.config.audio_encoder.sampling_rate +Audio(audio_values[0].numpy(), rate=sampling_rate) +``` + +Or save them as a `.wav` file using a third-party library, e.g. `scipy`: + +```python +>>> import scipy + +>>> sampling_rate = model.config.audio_encoder.sampling_rate +>>> scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy()) +``` + +### Text-Conditional Generation + +The model can generate an audio sample conditioned on a text prompt through use of the [`MusicgenProcessor`] to pre-process +the inputs: + +```python +>>> from transformers import AutoProcessor, MusicgenForConditionalGeneration + +>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-small") +>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") + +>>> inputs = processor( +... text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"], +... padding=True, +... return_tensors="pt", +... ) +>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256) +``` + +The `guidance_scale` is used in classifier free guidance (CFG), setting the weighting between the conditional logits +(which are predicted from the text prompts) and the unconditional logits (which are predicted from an unconditional or +'null' prompt). Higher guidance scale encourages the model to generate samples that are more closely linked to the input +prompt, usually at the expense of poorer audio quality. CFG is enabled by setting `guidance_scale > 1`. For best results, +use `guidance_scale=3` (default). + +### Audio-Prompted Generation + +The same [`MusicgenProcessor`] can be used to pre-process an audio prompt that is used for audio continuation. In the +following example, we load an audio file using the 🤗 Datasets library, which can be pip installed through the command +below: + +```bash +pip install --upgrade pip +pip install datasets[audio] +``` + +```python +>>> from transformers import AutoProcessor, MusicgenForConditionalGeneration +>>> from datasets import load_dataset + +>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-small") +>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") + +>>> dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True) +>>> sample = next(iter(dataset))["audio"] + +>>> # take the first half of the audio sample +>>> sample["array"] = sample["array"][: len(sample["array"]) // 2] + +>>> inputs = processor( +... audio=sample["array"], +... sampling_rate=sample["sampling_rate"], +... text=["80s blues track with groovy saxophone"], +... padding=True, +... return_tensors="pt", +... ) +>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256) +``` + +For batched audio-prompted generation, the generated `audio_values` can be post-processed to remove padding by using the +[`MusicgenProcessor`] class: + +```python +>>> from transformers import AutoProcessor, MusicgenForConditionalGeneration +>>> from datasets import load_dataset + +>>> processor = AutoProcessor.from_pretrained("facebook/musicgen-small") +>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") + +>>> dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True) +>>> sample = next(iter(dataset))["audio"] + +>>> # take the first quarter of the audio sample +>>> sample_1 = sample["array"][: len(sample["array"]) // 4] + +>>> # take the first half of the audio sample +>>> sample_2 = sample["array"][: len(sample["array"]) // 2] + +>>> inputs = processor( +... audio=[sample_1, sample_2], +... sampling_rate=sample["sampling_rate"], +... text=["80s blues track with groovy saxophone", "90s rock song with loud guitars and heavy drums"], +... padding=True, +... return_tensors="pt", +... ) +>>> audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256) + +>>> # post-process to remove padding from the batched audio +>>> audio_values = processor.batch_decode(audio_values, padding_mask=inputs.padding_mask) +``` + +### Generation Configuration + +The default parameters that control the generation process, such as sampling, guidance scale and number of generated +tokens, can be found in the model's generation config, and updated as desired: + +```python +>>> from transformers import MusicgenForConditionalGeneration + +>>> model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") + +>>> # inspect the default generation config +>>> model.generation_config + +>>> # increase the guidance scale to 4.0 +>>> model.generation_config.guidance_scale = 4.0 + +>>> # decrease the max length to 256 tokens +>>> model.generation_config.max_length = 256 +``` + +Note that any arguments passed to the generate method will **supersede** those in the generation config, so setting +`do_sample=False` in the call to generate will supersede the setting of `model.generation_config.do_sample` in the +generation config. + +## Model Structure + +The MusicGen model can be de-composed into three distinct stages: +1. Text encoder: maps the text inputs to a sequence of hidden-state representations. The pre-trained MusicGen models use a frozen text encoder from either T5 or Flan-T5 +2. MusicGen decoder: a language model (LM) that auto-regressively generates audio tokens (or codes) conditional on the encoder hidden-state representations +3. Audio encoder/decoder: used to encode an audio prompt to use as prompt tokens, and recover the audio waveform from the audio tokens predicted by the decoder + +Thus, the MusicGen model can either be used as a standalone decoder model, corresponding to the class [`MusicgenForCausalLM`], +or as a composite model that includes the text encoder and audio encoder/decoder, corresponding to the class +[`MusicgenForConditionalGeneration`]. If only the decoder needs to be loaded from the pre-trained checkpoint, it can be loaded by first +specifying the correct config, or be accessed through the `.decoder` attribute of the composite model: + +```python +>>> from transformers import AutoConfig, MusicgenForCausalLM, MusicgenForConditionalGeneration + +>>> # Option 1: get decoder config and pass to `.from_pretrained` +>>> decoder_config = AutoConfig.from_pretrained("facebook/musicgen-small").decoder +>>> decoder = MusicgenForCausalLM.from_pretrained("facebook/musicgen-small", **decoder_config) + +>>> # Option 2: load the entire composite model, but only return the decoder +>>> decoder = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small").decoder +``` + +Since the text encoder and audio encoder/decoder models are frozen during training, the MusicGen decoder [`MusicgenForCausalLM`] +can be trained standalone on a dataset of encoder hidden-states and audio codes. For inference, the trained decoder can +be combined with the frozen text encoder and audio encoder/decoders to recover the composite [`MusicgenForConditionalGeneration`] +model. + +Tips: +* MusicGen is trained on the 32kHz checkpoint of Encodec. You should ensure you use a compatible version of the Encodec model. +* Sampling mode tends to deliver better results than greedy - you can toggle sampling with the variable `do_sample` in the call to [`MusicgenForConditionalGeneration.generate`] + +## MusicgenDecoderConfig + +[[autodoc]] MusicgenDecoderConfig + +## MusicgenConfig + +[[autodoc]] MusicgenConfig + +## MusicgenProcessor + +[[autodoc]] MusicgenProcessor + +## MusicgenModel + +[[autodoc]] MusicgenModel + - forward + +## MusicgenForCausalLM + +[[autodoc]] MusicgenForCausalLM + - forward + +## MusicgenForConditionalGeneration + +[[autodoc]] MusicgenForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/mvp.mdx b/docs/source/en/model_doc/mvp.md similarity index 90% rename from docs/source/en/model_doc/mvp.mdx rename to docs/source/en/model_doc/mvp.md index 6fae8c73111d72..0d98e04cf091ee 100644 --- a/docs/source/en/model_doc/mvp.mdx +++ b/docs/source/en/model_doc/mvp.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # MVP @@ -24,15 +28,17 @@ According to the abstract, - MVP also has task-specific soft prompts to stimulate the model's capacity in performing a certain task. - MVP is specially designed for natural language generation and can be adapted to a wide range of generation tasks, including but not limited to summarization, data-to-text generation, open-ended dialogue system, story generation, question answering, question generation, task-oriented dialogue system, commonsense generation, paraphrase generation, text style transfer, and text simplification. Our model can also be adapted to natural language understanding tasks such as sequence classification and (extractive) question answering. -Tips: +This model was contributed by [Tianyi Tang](https://huggingface.co/StevenTang). The detailed information and instructions can be found [here](https://github.com/RUCAIBox/MVP). + +## Usage tips + - We have released a series of models [here](https://huggingface.co/models?filter=mvp), including MVP, MVP with task-specific prompts, and multi-task pre-trained variants. - If you want to use a model without prompts (standard Transformer), you can load it through `MvpForConditionalGeneration.from_pretrained('RUCAIBox/mvp')`. - If you want to use a model with task-specific prompts, such as summarization, you can load it through `MvpForConditionalGeneration.from_pretrained('RUCAIBox/mvp-summarization')`. - Our model supports lightweight prompt tuning following [Prefix-tuning](https://arxiv.org/abs/2101.00190) with method `set_lightweight_tuning()`. -This model was contributed by [Tianyi Tang](https://huggingface.co/StevenTang). The detailed information and instructions can be found [here](https://github.com/RUCAIBox/MVP). +## Usage examples -## Examples For summarization, it is an example to use MVP and MVP with summarization-specific prompts. ```python @@ -100,6 +106,15 @@ For lightweight tuning, *i.e.*, fixing the model and only tuning prompts, you ca >>> model.set_lightweight_tuning() ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## MvpConfig [[autodoc]] MvpConfig diff --git a/docs/source/en/model_doc/nat.mdx b/docs/source/en/model_doc/nat.md similarity index 94% rename from docs/source/en/model_doc/nat.mdx rename to docs/source/en/model_doc/nat.md index 636b984c6acf6a..ecb61ccb0a3397 100644 --- a/docs/source/en/model_doc/nat.mdx +++ b/docs/source/en/model_doc/nat.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Neighborhood Attention Transformer @@ -32,7 +36,18 @@ that boosts image classification and downstream vision performance. Experimental NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. * -Tips: + + + Neighborhood Attention compared to other attention patterns. +Taken from the original paper. + +This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr). +The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer). + +## Usage tips + - One can use the [`AutoImageProcessor`] API to prepare images for the model. - NAT can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. @@ -46,16 +61,6 @@ or build on your system by running `pip install natten`. Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet. - Patch size of 4 is only supported at the moment. - - - Neighborhood Attention compared to other attention patterns. -Taken from the original paper. - -This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr). -The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with NAT. @@ -63,6 +68,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`NatForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -70,7 +76,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] NatConfig - ## NatModel [[autodoc]] NatModel diff --git a/docs/source/en/model_doc/nezha.mdx b/docs/source/en/model_doc/nezha.md similarity index 84% rename from docs/source/en/model_doc/nezha.mdx rename to docs/source/en/model_doc/nezha.md index 8b613c38eb94dc..872f576f1286eb 100644 --- a/docs/source/en/model_doc/nezha.mdx +++ b/docs/source/en/model_doc/nezha.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Nezha @@ -31,6 +35,14 @@ and natural language inference (XNLI).* This model was contributed by [sijunhe](https://huggingface.co/sijunhe). The original code can be found [here](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-PyTorch). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## NezhaConfig [[autodoc]] NezhaConfig diff --git a/docs/source/en/model_doc/nllb-moe.md b/docs/source/en/model_doc/nllb-moe.md new file mode 100644 index 00000000000000..5c283fb3f0e1b1 --- /dev/null +++ b/docs/source/en/model_doc/nllb-moe.md @@ -0,0 +1,134 @@ + + +# NLLB-MOE + + +## Overview + +The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi, +Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, +Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, +Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, +Safiyyah Saleem, Holger Schwenk, and Jeff Wang. + +The abstract of the paper is the following: + +*Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. +However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the +200 language barrier while ensuring safe, high quality results, all while keeping ethical considerations in mind? In No Language Left Behind, we took on this challenge by +first contextualizing the need for low-resource language translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed +at narrowing the performance gap between low and high-resource languages. More specifically, we developed a conditional compute model based on Sparsely Gated Mixture of +Experts that is trained on data obtained with novel and effective data mining techniques tailored for low-resource languages. We propose multiple architectural and training +improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using +a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. +Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.* + +This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/facebookresearch/fairseq). + +## Usage tips + +- M2M100ForConditionalGeneration is the base model for both NLLB and NLLB MoE +- The NLLB-MoE is very similar to the NLLB model, but it's feed forward layer is based on the implementation of SwitchTransformers. +- The tokenizer is the same as the NLLB models. + +## Implementation differences with SwitchTransformers + +The biggest difference is the way the tokens are routed. NLLB-MoE uses a `top-2-gate` which means that for each input, only the top two experts are selected based on the +highest predicted probabilities from the gating network, and the remaining experts are ignored. In `SwitchTransformers`, only the top-1 probabilities are computed, +which means that tokens have less probability of being forwarded. Moreover, if a token is not routed to any expert, `SwitchTransformers` still adds its unmodified hidden +states (kind of like a residual connection) while they are masked in `NLLB`'s top-2 routing mechanism. + +## Generating with NLLB-MoE + +The available checkpoints require around 350GB of storage. Make sure to use `accelerate` if you do not have enough RAM on your machine. + +While generating the target text set the `forced_bos_token_id` to the target language id. The following +example shows how to translate English to French using the *facebook/nllb-200-distilled-600M* model. + +Note that we're using the BCP-47 code for French `fra_Latn`. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) +for the list of all BCP-47 in the Flores 200 dataset. + +```python +>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b") + +>>> article = "Previously, Ring's CEO, Jamie Siminoff, remarked the company started when his doorbell wasn't audible from his shop in his garage." +>>> inputs = tokenizer(article, return_tensors="pt") + +>>> translated_tokens = model.generate( +... **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"], max_length=50 +... ) +>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] +"Auparavant, le PDG de Ring, Jamie Siminoff, a fait remarquer que la société avait commencé lorsque sa sonnette n'était pas audible depuis son magasin dans son garage." +``` + +### Generating from any other language than English + +English (`eng_Latn`) is set as the default language from which to translate. In order to specify that you'd like to translate from a different language, +you should specify the BCP-47 code in the `src_lang` keyword argument of the tokenizer initialization. + +See example below for a translation from romanian to german: + +```python +>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-moe-54b", src_lang="ron_Latn") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-moe-54b") + +>>> article = "Şeful ONU spune că nu există o soluţie militară în Siria" +>>> inputs = tokenizer(article, return_tensors="pt") + +>>> translated_tokens = model.generate( +... **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["deu_Latn"], max_length=30 +... ) +>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] +``` + +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + + +## NllbMoeConfig + +[[autodoc]] NllbMoeConfig + +## NllbMoeTop2Router + +[[autodoc]] NllbMoeTop2Router + - route_tokens + - forward + +## NllbMoeSparseMLP + +[[autodoc]] NllbMoeSparseMLP + - forward + +## NllbMoeModel + +[[autodoc]] NllbMoeModel + - forward + +## NllbMoeForConditionalGeneration + +[[autodoc]] NllbMoeForConditionalGeneration + - forward + diff --git a/docs/source/en/model_doc/nllb.mdx b/docs/source/en/model_doc/nllb.md similarity index 71% rename from docs/source/en/model_doc/nllb.mdx rename to docs/source/en/model_doc/nllb.md index d2c0089fa3a1af..3f272129d2f8f0 100644 --- a/docs/source/en/model_doc/nllb.mdx +++ b/docs/source/en/model_doc/nllb.md @@ -8,14 +8,56 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # NLLB -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=bug&template=bug-report.yml) and assign -@LysandreJik +## Updated tokenizer behavior + +**DISCLAIMER:** The default behaviour for the tokenizer was fixed and thus changed in April 2023. +The previous version adds `[self.eos_token_id, self.cur_lang_code]` at the end of the token sequence for both target and source tokenization. This is wrong as the NLLB paper mentions (page 48, 6.1.1. Model Architecture) : -## Overview of NLLB +*Note that we prefix the source sequence with the source language, as opposed to the target +language as previously done in several works (Arivazhagan et al., 2019; Johnson et al., +2017). This is primarily because we prioritize optimizing zero-shot performance of our +model on any pair of 200 languages at a minor cost to supervised performance.* + +Previous behaviour: + +```python +>>> from transformers import NllbTokenizer + +>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") +>>> tokenizer("How was your day?").input_ids +[13374, 1398, 4260, 4039, 248130, 2, 256047] + +>>> # 2: '
' +>>> # 256047 : 'eng_Latn' +``` +New behaviour + +```python +>>> from transformers import NllbTokenizer + +>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M") +>>> tokenizer("How was your day?").input_ids +[256047, 13374, 1398, 4260, 4039, 248130, 2] + ``` + +Enabling the old behaviour can be done as follows: +```python +>>> from transformers import NllbTokenizer + +>>> tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", legacy_behaviour=True) +``` + +For more details, feel free to check the linked [PR](https://github.com/huggingface/transformers/pull/22313) and [Issue](https://github.com/huggingface/transformers/issues/19943). + +## Overview The NLLB model was presented in [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, @@ -35,7 +77,9 @@ improvements to counteract overfitting while training on thousands of tasks. Cri a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system.* -This implementation contains the dense models available on release. Let us know via a GitHub issue if you would like to see the MoE models as well. +This implementation contains the dense models available on release. + +**The sparse model NLLB-MoE (Mixture of Expert) is now available! More details [here](nllb-moe)** This model was contributed by [Lysandre](https://huggingface.co/lysandre). The authors' code can be found [here](https://github.com/facebookresearch/fairseq/tree/nllb). @@ -74,9 +118,9 @@ See example below for a translation from romanian to german: >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained( -... "facebook/nllb-200-distilled-600M", use_auth_token=True, src_lang="ron_Latn" +... "facebook/nllb-200-distilled-600M", token=True, src_lang="ron_Latn" ... ) ->>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", use_auth_token=True) +>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=True) >>> article = "Şeful ONU spune că nu există o soluţie militară în Siria" >>> inputs = tokenizer(article, return_tensors="pt") @@ -88,6 +132,11 @@ See example below for a translation from romanian to german: UN-Chef sagt, es gibt keine militärische Lösung in Syrien ``` +## Resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## NllbTokenizer [[autodoc]] NllbTokenizer diff --git a/docs/source/en/model_doc/nougat.md b/docs/source/en/model_doc/nougat.md new file mode 100644 index 00000000000000..a39e74eb213ab8 --- /dev/null +++ b/docs/source/en/model_doc/nougat.md @@ -0,0 +1,115 @@ + + +# Nougat + +## Overview + +The Nougat model was proposed in [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) by +Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic. Nougat uses the same architecture as [Donut](donut), meaning an image Transformer +encoder and an autoregressive text Transformer decoder to translate scientific PDFs to markdown, enabling easier access to them. + +The abstract from the paper is the following: + +*Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.* + + + + Nougat high-level overview. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found +[here](https://github.com/facebookresearch/nougat). + +## Usage tips + +- The quickest way to get started with Nougat is by checking the [tutorial + notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Nougat), which show how to use the model + at inference time as well as fine-tuning on custom data. +- Nougat is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. The model is identical to [Donut](donut) in terms of architecture. + +## Inference + +Nougat's [`VisionEncoderDecoder`] model accepts images as input and makes use of +[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image. + +The [`NougatImageProcessor`] class is responsible for preprocessing the input image and +[`NougatTokenizerFast`] decodes the generated target tokens to the target string. The +[`NougatProcessor`] wraps [`NougatImageProcessor`] and [`NougatTokenizerFast`] classes +into a single instance to both extract the input features and decode the predicted token ids. + +- Step-by-step PDF transcription + +```py +>>> from huggingface_hub import hf_hub_download +>>> import re +>>> from PIL import Image + +>>> from transformers import NougatProcessor, VisionEncoderDecoderModel +>>> from datasets import load_dataset +>>> import torch + +>>> processor = NougatProcessor.from_pretrained("facebook/nougat-base") +>>> model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base") + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> model.to(device) # doctest: +IGNORE_RESULT + +>>> # prepare PDF image for the model +>>> filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset") +>>> image = Image.open(filepath) +>>> pixel_values = processor(image, return_tensors="pt").pixel_values + +>>> # generate transcription (here we only generate 30 tokens) +>>> outputs = model.generate( +... pixel_values.to(device), +... min_length=1, +... max_new_tokens=30, +... bad_words_ids=[[processor.tokenizer.unk_token_id]], +... ) + +>>> sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0] +>>> sequence = processor.post_process_generation(sequence, fix_markdown=False) +>>> # note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence +>>> print(repr(sequence)) +'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@' +``` + +See the [model hub](https://huggingface.co/models?filter=nougat) to look for Nougat checkpoints. + + + +The model is identical to [Donut](donut) in terms of architecture. + + + +## NougatImageProcessor + +[[autodoc]] NougatImageProcessor + - preprocess + +## NougatTokenizerFast + +[[autodoc]] NougatTokenizerFast + +## NougatProcessor + +[[autodoc]] NougatProcessor + - __call__ + - from_pretrained + - save_pretrained + - batch_decode + - decode + - post_process_generation \ No newline at end of file diff --git a/docs/source/en/model_doc/nystromformer.mdx b/docs/source/en/model_doc/nystromformer.md similarity index 85% rename from docs/source/en/model_doc/nystromformer.mdx rename to docs/source/en/model_doc/nystromformer.md index 5c1619b57f1ea9..185c4e1f011a4f 100644 --- a/docs/source/en/model_doc/nystromformer.mdx +++ b/docs/source/en/model_doc/nystromformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Nyströmformer @@ -33,6 +37,14 @@ favorably relative to other efficient self-attention methods. Our code is availa This model was contributed by [novice03](https://huggingface.co/novice03). The original code can be found [here](https://github.com/mlpen/Nystromformer). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## NystromformerConfig [[autodoc]] NystromformerConfig diff --git a/docs/source/en/model_doc/oneformer.mdx b/docs/source/en/model_doc/oneformer.md similarity index 97% rename from docs/source/en/model_doc/oneformer.mdx rename to docs/source/en/model_doc/oneformer.md index 3560d84bc7e1b0..97a6aa64f5437b 100644 --- a/docs/source/en/model_doc/oneformer.mdx +++ b/docs/source/en/model_doc/oneformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # OneFormer @@ -22,7 +26,14 @@ The abstract from the paper is the following: *Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.* -Tips: +The figure below illustrates the architecture of OneFormer. Taken from the [original paper](https://arxiv.org/abs/2211.06220). + + + +This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3). The original code can be found [here](https://github.com/SHI-Labs/OneFormer). + +## Usage tips + - OneFormer requires two inputs during inference: *image* and *task token*. - During training, OneFormer only uses panoptic annotations. - If you want to train the model in a distributed environment across multiple nodes, then one should update the @@ -31,12 +42,6 @@ Tips: - One can use [`OneFormerProcessor`] to prepare input images and task inputs for the model and optional targets for the model. [`OneformerProcessor`] wraps [`OneFormerImageProcessor`] and [`CLIPTokenizer`] into a single instance to both prepare the images and encode the task inputs. - To get the final segmentation, depending on the task, you can call [`~OneFormerProcessor.post_process_semantic_segmentation`] or [`~OneFormerImageProcessor.post_process_instance_segmentation`] or [`~OneFormerImageProcessor.post_process_panoptic_segmentation`]. All three tasks can be solved using [`OneFormerForUniversalSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together. -The figure below illustrates the architecture of OneFormer. Taken from the [original paper](https://arxiv.org/abs/2211.06220). - - - -This model was contributed by [Jitesh Jain](https://huggingface.co/praeclarumjj3). The original code can be found [here](https://github.com/SHI-Labs/OneFormer). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OneFormer. diff --git a/docs/source/en/model_doc/open-llama.md b/docs/source/en/model_doc/open-llama.md new file mode 100644 index 00000000000000..01170e7e3be60f --- /dev/null +++ b/docs/source/en/model_doc/open-llama.md @@ -0,0 +1,61 @@ + + +# Open-Llama + + + +This model is in maintenance mode only, we don't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.31.0. +You can do so by running the following command: `pip install -U transformers==4.31.0`. + + + + + +This model differs from the [OpenLLaMA models](https://huggingface.co/models?search=openllama) on the Hugging Face Hub, which primarily use the [LLaMA](llama) architecture. + + + +## Overview + +The Open-Llama model was proposed in the open source Open-Llama project by community developer s-JoL. + +The model is mainly based on LLaMA with some modifications, incorporating memory-efficient attention from Xformers, stable embedding from Bloom, and shared input-output embedding from PaLM. +And the model is pre-trained on both Chinese and English, which gives it better performance on Chinese language tasks. + +This model was contributed by [s-JoL](https://huggingface.co/s-JoL). +The original code was released on GitHub by [s-JoL](https://github.com/s-JoL), but is now removed. + +## OpenLlamaConfig + +[[autodoc]] OpenLlamaConfig + +## OpenLlamaModel + +[[autodoc]] OpenLlamaModel + - forward + +## OpenLlamaForCausalLM + +[[autodoc]] OpenLlamaForCausalLM + - forward + +## OpenLlamaForSequenceClassification + +[[autodoc]] OpenLlamaForSequenceClassification + - forward diff --git a/docs/source/en/model_doc/openai-gpt.mdx b/docs/source/en/model_doc/openai-gpt.md similarity index 94% rename from docs/source/en/model_doc/openai-gpt.mdx rename to docs/source/en/model_doc/openai-gpt.md index 0c444b5a4f6946..1fbfbbcd89e336 100644 --- a/docs/source/en/model_doc/openai-gpt.mdx +++ b/docs/source/en/model_doc/openai-gpt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # OpenAI GPT @@ -40,7 +44,12 @@ approach on a wide range of benchmarks for natural language understanding. Our g discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied.* -Tips: +[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face +showcasing the generative capabilities of several models. GPT is one of them. + +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm). + +## Usage tips - GPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -48,10 +57,6 @@ Tips: token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the *run_generation.py* example script. -[Write With Transformer](https://transformer.huggingface.co/doc/gpt) is a webapp created and hosted by Hugging Face -showcasing the generative capabilities of several models. GPT is one of them. - -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/openai/finetune-transformer-lm). Note: @@ -73,6 +78,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on [outperforming OpenAI GPT-3 with SetFit for text-classification](https://www.philschmid.de/getting-started-setfit). +- See also: [Text classification task guide](../tasks/sequence_classification) @@ -86,6 +92,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [Causal language modeling](https://huggingface.co/course/en/chapter7/6?fw=pt#training-a-causal-language-model-from-scratch) chapter of the 🤗 Hugging Face Course. - [`OpenAIGPTLMHeadModel`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling), [text generation example script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). - [`TFOpenAIGPTLMHeadModel`] is supported by this [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_clmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). +- See also: [Causal language modeling task guide](../tasks/language_modeling) @@ -110,6 +117,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] models.openai.modeling_tf_openai.TFOpenAIGPTDoubleHeadsModelOutput + + + ## OpenAIGPTModel [[autodoc]] OpenAIGPTModel @@ -130,6 +140,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] OpenAIGPTForSequenceClassification - forward + + + ## TFOpenAIGPTModel [[autodoc]] TFOpenAIGPTModel @@ -149,3 +162,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFOpenAIGPTForSequenceClassification - call + + + diff --git a/docs/source/en/model_doc/opt.mdx b/docs/source/en/model_doc/opt.md similarity index 64% rename from docs/source/en/model_doc/opt.mdx rename to docs/source/en/model_doc/opt.md index 6bf81352176f8e..1b02b888994ecf 100644 --- a/docs/source/en/model_doc/opt.mdx +++ b/docs/source/en/model_doc/opt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # OPT @@ -21,13 +25,13 @@ The abstract from the paper is the following: *Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.* -Tips: -- OPT has the same architecture as [`BartDecoder`]. -- Contrary to GPT2, OPT adds the EOS token `` to the beginning of every prompt. **Note**: Make sure to pass `use_fast=False` when loading OPT's tokenizer with [`AutoTokenizer`] to get the correct tokenizer. - This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ), [Younes Belkada](https://huggingface.co/ybelkada), and [Patrick Von Platen](https://huggingface.co/patrickvonplaten). The original code can be found [here](https://github.com/facebookresearch/metaseq). +Tips: +- OPT has the same architecture as [`BartDecoder`]. +- Contrary to GPT2, OPT adds the EOS token `` to the beginning of every prompt. + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with OPT. If you're @@ -45,7 +49,7 @@ The resource should ideally demonstrate something new instead of duplicating an -- [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Text classification task guide](sequence_classification.md) - [`OPTForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). @@ -56,12 +60,64 @@ The resource should ideally demonstrate something new instead of duplicating an ⚡️ Inference -- A blog bost on [How 🤗 Accelerate runs very large models thanks to PyTorch](https://huggingface.co/blog/accelerate-large-models) with OPT. +- A blog post on [How 🤗 Accelerate runs very large models thanks to PyTorch](https://huggingface.co/blog/accelerate-large-models) with OPT. + + +## Combining OPT and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import OPTForCausalLM, GPT2Tokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = OPTForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16, attn_implementation="flash_attention_2") +>>> tokenizer = GPT2Tokenizer.from_pretrained("facebook/opt-350m") + +>>> prompt = ("A chat between a curious human and the Statue of Liberty.\n\nHuman: What is your name?\nStatue: I am the " + "Statue of Liberty.\nHuman: Where do you live?\nStatue: New York City.\nHuman: How long have you lived " + "there?") + +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device) +>>> model.to(device) + +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=30, do_sample=False) +>>> tokenizer.batch_decode(generated_ids)[0] +'A chat between a curious human and the Statue of Liberty.\n\nHuman: What is your name?\nStatue: I am the Statue of Liberty.\nHuman: Where do you live?\nStatue: New York City.\nHuman: How long have you lived there?\nStatue: I have lived here for about a year.\nHuman: What is your favorite place to eat?\nStatue: I love' +``` + +### Expected speedups + +Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `facebook/opt-2.7b` checkpoint and the Flash Attention 2 version of the model using two different sequence lengths. + +
+ +
+ +Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `facebook/opt-350m` checkpoint and the Flash Attention 2 version of the model using two different sequence lengths. + +
+ +
+ + ## OPTConfig [[autodoc]] OPTConfig + + + ## OPTModel [[autodoc]] OPTModel @@ -72,6 +128,19 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] OPTForCausalLM - forward +## OPTForSequenceClassification + +[[autodoc]] OPTForSequenceClassification + - forward + +## OPTForQuestionAnswering + +[[autodoc]] OPTForQuestionAnswering + - forward + + + + ## TFOPTModel [[autodoc]] TFOPTModel @@ -82,23 +151,18 @@ The resource should ideally demonstrate something new instead of duplicating an [[autodoc]] TFOPTForCausalLM - call -## OPTForSequenceClassification - -[[autodoc]] OPTForSequenceClassification - - forward - -## OPTForQuestionAnswering - -[[autodoc]] OPTForQuestionAnswering - - forward + + ## FlaxOPTModel [[autodoc]] FlaxOPTModel - __call__ - ## FlaxOPTForCausalLM [[autodoc]] FlaxOPTForCausalLM - __call__ + + + diff --git a/docs/source/en/model_doc/owlv2.md b/docs/source/en/model_doc/owlv2.md new file mode 100644 index 00000000000000..75fab0853a9778 --- /dev/null +++ b/docs/source/en/model_doc/owlv2.md @@ -0,0 +1,128 @@ + + +# OWLv2 + +## Overview + +OWLv2 was proposed in [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2 scales up [OWL-ViT](owlvit) using self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. This results in large gains over the previous state-of-the-art for zero-shot object detection. + +The abstract from the paper is the following: + +*Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.* + + + + OWLv2 high-level overview. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). + +## Usage example + +OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. + +[`Owlv2ImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`Owlv2Processor`] wraps [`Owlv2ImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`Owlv2Processor`] and [`Owlv2ForObjectDetection`]. + +```python +>>> import requests +>>> from PIL import Image +>>> import torch + +>>> from transformers import Owlv2Processor, Owlv2ForObjectDetection + +>>> processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble") +>>> model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble") + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> texts = [["a photo of a cat", "a photo of a dog"]] +>>> inputs = processor(text=texts, images=image, return_tensors="pt") +>>> outputs = model(**inputs) + +>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2] +>>> target_sizes = torch.Tensor([image.size[::-1]]) +>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax) +>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1) +>>> i = 0 # Retrieve predictions for the first image for the corresponding text queries +>>> text = texts[i] +>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"] +>>> for box, score, label in zip(boxes, scores, labels): +... box = [round(i, 2) for i in box.tolist()] +... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}") +Detected a photo of a cat with confidence 0.614 at location [341.67, 17.54, 642.32, 278.51] +Detected a photo of a cat with confidence 0.665 at location [6.75, 38.97, 326.62, 354.85] +``` + +## Resources + +- A demo notebook on using OWLv2 for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/OWLv2). +- [Zero-shot object detection task guide](../tasks/zero_shot_object_detection) + + + +The architecture of OWLv2 is identical to [OWL-ViT](owlvit), however the object detection head now also includes an objectness classifier, which predicts the (query-agnostic) likelihood that a predicted box contains an object (as opposed to background). The objectness score can be used to rank or filter predictions independently of text queries. +Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image processor ([`Owlv2ImageProcessor`]). + + + +## Owlv2Config + +[[autodoc]] Owlv2Config + - from_text_vision_configs + +## Owlv2TextConfig + +[[autodoc]] Owlv2TextConfig + +## Owlv2VisionConfig + +[[autodoc]] Owlv2VisionConfig + +## Owlv2ImageProcessor + +[[autodoc]] Owlv2ImageProcessor + - preprocess + - post_process_object_detection + - post_process_image_guided_detection + +## Owlv2Processor + +[[autodoc]] Owlv2Processor + +## Owlv2Model + +[[autodoc]] Owlv2Model + - forward + - get_text_features + - get_image_features + +## Owlv2TextModel + +[[autodoc]] Owlv2TextModel + - forward + +## Owlv2VisionModel + +[[autodoc]] Owlv2VisionModel + - forward + +## Owlv2ForObjectDetection + +[[autodoc]] Owlv2ForObjectDetection + - forward + - image_guided_detection diff --git a/docs/source/en/model_doc/owlvit.mdx b/docs/source/en/model_doc/owlvit.md similarity index 79% rename from docs/source/en/model_doc/owlvit.mdx rename to docs/source/en/model_doc/owlvit.md index f13ad4a540e131..c40d3a9e7a17fa 100644 --- a/docs/source/en/model_doc/owlvit.mdx +++ b/docs/source/en/model_doc/owlvit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # OWL-ViT @@ -20,12 +24,18 @@ The abstract from the paper is the following: *Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.* -## Usage + -OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. + OWL-ViT architecture. Taken from the original paper. -[`OwlViTFeatureExtractor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTFeatureExtractor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`]. +This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). + +## Usage tips +OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CLIP](clip) as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection. + +[`OwlViTImageProcessor`] can be used to resize (or rescale) and normalize images for the model and [`CLIPTokenizer`] is used to encode the text. [`OwlViTProcessor`] wraps [`OwlViTImageProcessor`] and [`CLIPTokenizer`] into a single instance to both encode the text and prepare the images. The following example shows how to perform object detection using [`OwlViTProcessor`] and [`OwlViTForObjectDetection`]. ```python >>> import requests @@ -45,23 +55,21 @@ OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CL >>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2] >>> target_sizes = torch.Tensor([image.size[::-1]]) ->>> # Convert outputs (bounding boxes and class logits) to COCO API ->>> results = processor.post_process(outputs=outputs, target_sizes=target_sizes) - +>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax) +>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1) >>> i = 0 # Retrieve predictions for the first image for the corresponding text queries >>> text = texts[i] >>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"] - ->>> score_threshold = 0.1 >>> for box, score, label in zip(boxes, scores, labels): ... box = [round(i, 2) for i in box.tolist()] -... if score >= score_threshold: -... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}") +... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}") Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29] Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17] ``` -This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit). +## Resources + +A demo notebook on using OWL-ViT for zero- and one-shot (image-guided) object detection can be found [here](https://github.com/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb). ## OwlViTConfig diff --git a/docs/source/en/model_doc/patchtsmixer.md b/docs/source/en/model_doc/patchtsmixer.md new file mode 100644 index 00000000000000..a67138e533b71a --- /dev/null +++ b/docs/source/en/model_doc/patchtsmixer.md @@ -0,0 +1,94 @@ + + +# PatchTSMixer + +## Overview + +The PatchTSMixer model was proposed in [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam. + + +PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer's capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression. + + +The abstract from the paper is the following: + +*TSMixer is a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules designed for multivariate forecasting and representation learning on patched time series. Our model draws inspiration from the success of MLP-Mixer models in computer vision. We demonstrate the challenges involved in adapting Vision MLP-Mixer for time series and introduce empirically validated components to enhance accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a Hybrid channel modeling approach to effectively handle noisy channel interactions and generalization across diverse datasets, a common challenge in existing patch channel-mixing methods. Additionally, a simple gated attention mechanism is introduced in the backbone to prioritize important features. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X).* + +This model was contributed by [ajati](https://huggingface.co/ajati), [vijaye12](https://huggingface.co/vijaye12), +[gsinthong](https://huggingface.co/gsinthong), [namctin](https://huggingface.co/namctin), +[wmgifford](https://huggingface.co/wmgifford), [kashif](https://huggingface.co/kashif). + +## Usage example + +The code snippet below shows how to randomly initialize a PatchTSMixer model. The model is compatible with the [Trainer API](../trainer.md). + +```python + +from transformers import PatchTSMixerConfig, PatchTSMixerForPrediction +from transformers import Trainer, TrainingArguments, + + +config = PatchTSMixerConfig(context_length = 512, prediction_length = 96) +model = PatchTSMixerForPrediction(config) +trainer = Trainer(model=model, args=training_args, + train_dataset=train_dataset, + eval_dataset=valid_dataset) +trainer.train() +results = trainer.evaluate(test_dataset) +``` + +## Usage tips + +The model can also be used for time series classification and time series regression. See the respective [`PatchTSMixerForTimeSeriesClassification`] and [`PatchTSMixerForRegression`] classes. + +## Resources + +- A blog post explaining PatchTSMixer in depth can be found [here](https://huggingface.co/blog/patchtsmixer). The blog can also be opened in Google Colab. + +## PatchTSMixerConfig + +[[autodoc]] PatchTSMixerConfig + + +## PatchTSMixerModel + +[[autodoc]] PatchTSMixerModel + - forward + + +## PatchTSMixerForPrediction + +[[autodoc]] PatchTSMixerForPrediction + - forward + + +## PatchTSMixerForTimeSeriesClassification + +[[autodoc]] PatchTSMixerForTimeSeriesClassification + - forward + + +## PatchTSMixerForPretraining + +[[autodoc]] PatchTSMixerForPretraining + - forward + + +## PatchTSMixerForRegression + +[[autodoc]] PatchTSMixerForRegression + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/patchtst.md b/docs/source/en/model_doc/patchtst.md new file mode 100644 index 00000000000000..544e4cb378c6df --- /dev/null +++ b/docs/source/en/model_doc/patchtst.md @@ -0,0 +1,68 @@ + + +# PatchTST + +## Overview + +The PatchTST model was proposed in [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam. + +At a high level the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure: + +![model](https://github.com/namctin/transformers/assets/8100/150af169-29de-419a-8d98-eb78251c21fa) + +The abstract from the paper is the following: + +*We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy.* + +This model was contributed by [namctin](https://huggingface.co/namctin), [gsinthong](https://huggingface.co/gsinthong), [diepi](https://huggingface.co/diepi), [vijaye12](https://huggingface.co/vijaye12), [wmgifford](https://huggingface.co/wmgifford), and [kashif](https://huggingface.co/kashif). The original code can be found [here](https://github.com/yuqinie98/PatchTST). + +## Usage tips + +The model can also be used for time series classification and time series regression. See the respective [`PatchTSTForClassification`] and [`PatchTSTForRegression`] classes. + +## Resources + +- A blog post explaining PatchTST in depth can be found [here](https://huggingface.co/blog/patchtst). The blog can also be opened in Google Colab. + +## PatchTSTConfig + +[[autodoc]] PatchTSTConfig + +## PatchTSTModel + +[[autodoc]] PatchTSTModel + - forward + +## PatchTSTForPrediction + +[[autodoc]] PatchTSTForPrediction + - forward + +## PatchTSTForClassification + +[[autodoc]] PatchTSTForClassification + - forward + +## PatchTSTForPretraining + +[[autodoc]] PatchTSTForPretraining + - forward + +## PatchTSTForRegression + +[[autodoc]] PatchTSTForRegression + - forward diff --git a/docs/source/en/model_doc/pegasus.mdx b/docs/source/en/model_doc/pegasus.md similarity index 93% rename from docs/source/en/model_doc/pegasus.mdx rename to docs/source/en/model_doc/pegasus.md index 09d7cfd5d6eb1f..0622354e62dedf 100644 --- a/docs/source/en/model_doc/pegasus.mdx +++ b/docs/source/en/model_doc/pegasus.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pegasus @@ -21,9 +25,6 @@ specific language governing permissions and limitations under the License. -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=sshleifer&labels=&template=bug-report.md&title) -and assign @patrickvonplaten. - ## Overview @@ -38,13 +39,17 @@ According to the abstract, This model was contributed by [sshleifer](https://huggingface.co/sshleifer). The Authors' code can be found [here](https://github.com/google-research/pegasus). -Tips: +## Usage tips - Sequence-to-sequence model with the same encoder-decoder model architecture as BART. Pegasus is pre-trained jointly on two self-supervised objective functions: Masked Language Modeling (MLM) and a novel summarization specific pretraining objective, called Gap Sentence Generation (GSG). * MLM: encoder input tokens are randomly replaced by a mask tokens and have to be predicted by the encoder (like in BERT) * GSG: whole encoder input sentences are replaced by a second mask token and fed to the decoder, but which has a causal mask to hide the future words like a regular auto-regressive transformer decoder. +- FP16 is not supported (help/ideas on this appreciated!). +- The adafactor optimizer is recommended for pegasus fine-tuning. + + ## Checkpoints All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tuned for summarization, besides @@ -56,20 +61,11 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun - Full replication results and correctly pre-processed data can be found in this [Issue](https://github.com/huggingface/transformers/issues/6844#issue-689259666). - [Distilled checkpoints](https://huggingface.co/models?search=distill-pegasus) are described in this [paper](https://arxiv.org/abs/2010.13002). -### Examples - -- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus - on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). -- FP16 is not supported (help/ideas on this appreciated!). -- The adafactor optimizer is recommended for pegasus fine-tuning. - - ## Implementation Notes - All models are transformer encoder-decoders with 16 layers in each component. - The implementation is completely inherited from [`BartForConditionalGeneration`] - Some key configuration differences: - - static, sinusoidal position embeddings - the model starts generating with pad_token_id (which has 0 token_embedding) as the prefix. - more beams are used (`num_beams=8`) @@ -78,7 +74,6 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun - The code to convert checkpoints trained in the author's [repo](https://github.com/google-research/pegasus) can be found in `convert_pegasus_tf_to_pytorch.py`. - ## Usage Example ```python @@ -102,6 +97,14 @@ All the [checkpoints](https://huggingface.co/models?search=pegasus) are fine-tun ... ) ``` +## Resources + +- [Script](https://github.com/huggingface/transformers/tree/main/examples/research_projects/seq2seq-distillation/finetune_pegasus_xsum.sh) to fine-tune pegasus + on the XSUM dataset. Data download instructions at [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md). +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## PegasusConfig [[autodoc]] PegasusConfig @@ -116,6 +119,9 @@ warning: `add_tokens` does not work at the moment. [[autodoc]] PegasusTokenizerFast + + + ## PegasusModel [[autodoc]] PegasusModel @@ -131,6 +137,9 @@ warning: `add_tokens` does not work at the moment. [[autodoc]] PegasusForCausalLM - forward + + + ## TFPegasusModel [[autodoc]] TFPegasusModel @@ -141,6 +150,9 @@ warning: `add_tokens` does not work at the moment. [[autodoc]] TFPegasusForConditionalGeneration - call + + + ## FlaxPegasusModel [[autodoc]] FlaxPegasusModel @@ -154,3 +166,6 @@ warning: `add_tokens` does not work at the moment. - __call__ - encode - decode + + + diff --git a/docs/source/en/model_doc/pegasus_x.mdx b/docs/source/en/model_doc/pegasus_x.md similarity index 81% rename from docs/source/en/model_doc/pegasus_x.mdx rename to docs/source/en/model_doc/pegasus_x.md index c3527c9e01a615..d64d8ba954163e 100644 --- a/docs/source/en/model_doc/pegasus_x.mdx +++ b/docs/source/en/model_doc/pegasus_x.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # PEGASUS-X @@ -22,23 +26,28 @@ The abstract from the paper is the following: *While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.* -Tips: +This model was contributed by [zphang](https://huggingface.co/zphang). The original code can be found [here](https://github.com/google-research/pegasus). + +## Documentation resources + +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) -* PEGASUS-X uses the same tokenizer as PEGASUS. + -This model was contributed by [zphang]( ## PegasusXConfig [[autodoc]] PegasusXConfig - ## PegasusXModel [[autodoc]] PegasusXModel - forward - ## PegasusXForConditionalGeneration [[autodoc]] PegasusXForConditionalGeneration diff --git a/docs/source/en/model_doc/perceiver.mdx b/docs/source/en/model_doc/perceiver.md similarity index 94% rename from docs/source/en/model_doc/perceiver.mdx rename to docs/source/en/model_doc/perceiver.md index 52a928472c0ddb..ee678c22f6f890 100644 --- a/docs/source/en/model_doc/perceiver.mdx +++ b/docs/source/en/model_doc/perceiver.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Perceiver @@ -77,7 +81,13 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/deepmind/deepmind-research/tree/master/perceiver). -Tips: + + +Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035) + + + +## Resources - The quickest way to get started with the Perceiver is by checking the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Perceiver). @@ -85,10 +95,9 @@ Tips: is implemented in the library. Note that the models available in the library only showcase some examples of what you can do with the Perceiver. There are many more use cases, including question answering, named-entity recognition, object detection, audio classification, video classification, etc. - -**Note**: - -- Perceiver does **not** work with `torch.nn.DataParallel` due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035) +- [Text classification task guide](../tasks/sequence_classification) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Image classification task guide](../tasks/image_classification) ## Perceiver specific outputs diff --git a/docs/source/en/model_doc/persimmon.md b/docs/source/en/model_doc/persimmon.md new file mode 100644 index 00000000000000..fe9e66a0b7175e --- /dev/null +++ b/docs/source/en/model_doc/persimmon.md @@ -0,0 +1,98 @@ + + +# Persimmon + +## Overview + +The Persimmon model was created by [ADEPT](https://www.adept.ai/blog/persimmon-8b), and authored by Erich Elsen, Augustus Odena, Maxwell Nye, Sağnak Taşırlar, Tri Dao, Curtis Hawthorne, Deepak Moparthi, Arushi Somani. + +The authors introduced Persimmon-8B, a decoder model based on the classic transformers architecture, with query and key normalization. Persimmon-8B is a fully permissively-licensed model with approximately 8 billion parameters, released under the Apache license. Some of the key attributes of Persimmon-8B are long context size (16K), performance, and capabilities for multimodal extensions. + +The authors showcase their approach to model evaluation, focusing on practical text generation, mirroring how users interact with language models. The work also includes a comparative analysis, pitting Persimmon-8B against other prominent models (MPT 7B Instruct and Llama 2 Base 7B 1-Shot), across various evaluation tasks. The results demonstrate Persimmon-8B's competitive performance, even with limited training data. + +In terms of model details, the work outlines the architecture and training methodology of Persimmon-8B, providing insights into its design choices, sequence length, and dataset composition. The authors present a fast inference code that outperforms traditional implementations through operator fusion and CUDA graph utilization while maintaining code coherence. They express their anticipation of how the community will leverage this contribution to drive innovation, hinting at further upcoming releases as part of an ongoing series of developments. + +This model was contributed by [ArthurZ](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/persimmon-ai-labs/adept-inference). + +## Usage tips + + + +The `Persimmon` models were trained using `bfloat16`, but the original inference uses `float16` The checkpoints uploaded on the hub use `torch_dtype = 'float16'` which will be +used by the `AutoModel` API to cast the checkpoints from `torch.float32` to `torch.float16`. + +The `dtype` of the online weights is mostly irrelevant, unless you are using `torch_dtype="auto"` when initializing a model using `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`. The reason is that the model will first be downloaded ( using the `dtype` of the checkpoints online) then it will be cast to the default `dtype` of `torch` (becomes `torch.float32`). Users should specify the `torch_dtype` they want, and if they don't it will be `torch.float32`. + +Finetuning the model in `float16` is not recommended and known to produce `nan`, as such the model should be fine-tuned in `bfloat16`. + + + + +Tips: + +- To convert the model, you need to clone the original repository using `git clone https://github.com/persimmon-ai-labs/adept-inference`, then get the checkpoints: + +```bash +git clone https://github.com/persimmon-ai-labs/adept-inference +wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_base_model_release.tar +tar -xvf 8b_base_model_release.tar +python src/transformers/models/persimmon/convert_persimmon_weights_to_hf.py --input_dir /path/to/downloaded/persimmon/weights/ --output_dir /output/path \ + --pt_model_path /path/to/8b_chat_model_release/iter_0001251/mp_rank_00/model_optim_rng.pt + --ada_lib_path /path/to/adept-inference +``` + +For the chat model: +```bash +wget https://axtkn4xl5cip.objectstorage.us-phoenix-1.oci.customer-oci.com/n/axtkn4xl5cip/b/adept-public-data/o/8b_chat_model_release.tar +tar -xvf 8b_base_model_release.tar +``` + +Thereafter, models can be loaded via: + +```py +from transformers import PersimmonForCausalLM, PersimmonTokenizer + +model = PersimmonForCausalLM.from_pretrained("/output/path") +tokenizer = PersimmonTokenizer.from_pretrained("/output/path") +``` + + +- Perismmon uses a `sentencepiece` based tokenizer, with a `Unigram` model. It supports bytefallback, which is only available in `tokenizers==0.14.0` for the fast tokenizer. +The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece. The `chat` template will be updated with the templating functions in a follow up PR! + +- The authors suggest to use the following prompt format for the chat mode: `f"human: {prompt}\n\nadept:"` + + +## PersimmonConfig + +[[autodoc]] PersimmonConfig + +## PersimmonModel + +[[autodoc]] PersimmonModel + - forward + +## PersimmonForCausalLM + +[[autodoc]] PersimmonForCausalLM + - forward + +## PersimmonForSequenceClassification + +[[autodoc]] PersimmonForSequenceClassification + - forward diff --git a/docs/source/en/model_doc/phi.md b/docs/source/en/model_doc/phi.md new file mode 100644 index 00000000000000..96efe4a303a84f --- /dev/null +++ b/docs/source/en/model_doc/phi.md @@ -0,0 +1,190 @@ + + +# Phi + +## Overview + +The Phi-1 model was proposed in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li. + +The Phi-1.5 model was proposed in [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee. + +### Summary + +In Phi-1 and Phi-1.5 papers, the authors showed how important the quality of the data is in training relative to the model size. +They selected high quality "textbook" data alongside with synthetically generated data for training their small sized Transformer +based model Phi-1 with 1.3B parameters. Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. +They follow the same strategy for Phi-1.5 and created another 1.3B parameter model with performance on natural language tasks comparable +to models 5x larger, and surpassing most non-frontier LLMs. Phi-1.5 exhibits many of the traits of much larger LLMs such as the ability +to “think step by step” or perform some rudimentary in-context learning. +With these two experiments the authors successfully showed the huge impact of quality of training data when training machine learning models. + +The abstract from the Phi-1 paper is the following: + +*We introduce phi-1, a new large language model for code, with significantly smaller size than +competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on +8 A100s, using a selection of “textbook quality” data from the web (6B tokens) and synthetically +generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains +pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent +properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding +exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as +phi-1 that still achieves 45% on HumanEval.* + +The abstract from the Phi-1.5 paper is the following: + +*We continue the investigation into the power of smaller Transformer-based language models as +initiated by TinyStories – a 10 million parameter model that can produce coherent English – and +the follow-up work on phi-1, a 1.3 billion parameter model with Python coding performance close +to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to +generate “textbook quality” data as a way to enhance the learning process compared to traditional +web data. We follow the “Textbooks Are All You Need” approach, focusing this time on common +sense reasoning in natural language, and create a new 1.3 billion parameter model named phi-1.5, +with performance on natural language tasks comparable to models 5x larger, and surpassing most +non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic +coding. More generally, phi-1.5 exhibits many of the traits of much larger LLMs, both good –such +as the ability to “think step by step” or perform some rudimentary in-context learning– and bad, +including hallucinations and the potential for toxic and biased generations –encouragingly though, we +are seeing improvement on that front thanks to the absence of web data. We open-source phi-1.5 to +promote further research on these urgent topics.* + +This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). + +The original code for Phi-1, Phi-1.5 and Phi-2 can be found [here](https://huggingface.co/microsoft/phi-1), [here](https://huggingface.co/microsoft/phi-1_5) and [here](https://huggingface.co/microsoft/phi-2), respectively. + +## Usage tips + +- This model is quite similar to `Llama` with the main difference in [`PhiDecoderLayer`], where they used [`PhiAttention`] and [`PhiMLP`] layers in parallel configuration. +- The tokenizer used for this model is identical to the [`CodeGenTokenizer`]. + +## How to use Phi-2 + + + +Phi-2 has been integrated in the development version (4.37.0.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing one of the following: + +* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function. + +* Update your local `transformers` to the development version: `pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers`. The previous command is an alternative to cloning and installing from the source. + + + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2") +>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") + +>>> inputs = tokenizer('Can you help me write a formal email to a potential business partner proposing a joint venture?', return_tensors="pt", return_attention_mask=False) + +>>> outputs = model.generate(**inputs, max_length=30) +>>> text = tokenizer.batch_decode(outputs)[0] +>>> print(text) +'Can you help me write a formal email to a potential business partner proposing a joint venture?\nInput: Company A: ABC Inc.\nCompany B: XYZ Ltd.\nJoint Venture: A new online platform for e-commerce' +``` + +### Example : + +```python +>>> from transformers import PhiForCausalLM, AutoTokenizer + +>>> # define the model and tokenizer. +>>> model = PhiForCausalLM.from_pretrained("microsoft/phi-1_5") +>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5") + +>>> # feel free to change the prompt to your liking. +>>> prompt = "If I were an AI that had just achieved" + +>>> # apply the tokenizer. +>>> tokens = tokenizer(prompt, return_tensors="pt") + +>>> # use the model to generate new tokens. +>>> generated_output = model.generate(**tokens, use_cache=True, max_new_tokens=10) + +>>> tokenizer.batch_decode(generated_output)[0] +'If I were an AI that had just achieved a breakthrough in machine learning, I would be thrilled' +``` + +## Combining Phi and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Make also sure that you have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of flash-attn repository. Make also sure to load your model in half-precision (e.g. `torch.float16``) + +To load and run a model using Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import PhiForCausalLM, AutoTokenizer + +>>> # define the model and tokenizer and push the model and tokens to the GPU. +>>> model = PhiForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to("cuda") +>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5") + +>>> # feel free to change the prompt to your liking. +>>> prompt = "If I were an AI that had just achieved" + +>>> # apply the tokenizer. +>>> tokens = tokenizer(prompt, return_tensors="pt").to("cuda") + +>>> # use the model to generate new tokens. +>>> generated_output = model.generate(**tokens, use_cache=True, max_new_tokens=10) + +>>> tokenizer.batch_decode(generated_output)[0] +'If I were an AI that had just achieved a breakthrough in machine learning, I would be thrilled' +``` + +### Expected speedups + +Below is an expected speedup diagram that compares pure inference time between the native implementation in transformers using `microsoft/phi-1` checkpoint and the Flash Attention 2 version of the model using a sequence length of 2048. + +
+ +
+ +## PhiConfig + +[[autodoc]] PhiConfig + + + + +## PhiModel + +[[autodoc]] PhiModel + - forward + +## PhiForCausalLM + +[[autodoc]] PhiForCausalLM + - forward + - generate + +## PhiForSequenceClassification + +[[autodoc]] PhiForSequenceClassification + - forward + +## PhiForTokenClassification + +[[autodoc]] PhiForTokenClassification + - forward + + + diff --git a/docs/source/en/model_doc/phobert.mdx b/docs/source/en/model_doc/phobert.md similarity index 83% rename from docs/source/en/model_doc/phobert.mdx rename to docs/source/en/model_doc/phobert.md index 4ae9b0aa625153..30a50275476e71 100644 --- a/docs/source/en/model_doc/phobert.mdx +++ b/docs/source/en/model_doc/phobert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # PhoBERT @@ -24,7 +28,9 @@ best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves th Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.* -Example of use: +This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT). + +## Usage example ```python >>> import torch @@ -46,7 +52,12 @@ Example of use: >>> # phobert = TFAutoModel.from_pretrained("vinai/phobert-base") ``` -This model was contributed by [dqnguyen](https://huggingface.co/dqnguyen). The original code can be found [here](https://github.com/VinAIResearch/PhoBERT). + + +PhoBERT implementation is the same as BERT, except for tokenization. Refer to [EART documentation](bert) for information on +configuration classes and their parameters. PhoBERT-specific tokenizer is documented below. + + ## PhobertTokenizer diff --git a/docs/source/en/model_doc/pix2struct.md b/docs/source/en/model_doc/pix2struct.md new file mode 100644 index 00000000000000..8dc179f5f863c8 --- /dev/null +++ b/docs/source/en/model_doc/pix2struct.md @@ -0,0 +1,77 @@ + + +# Pix2Struct + +## Overview + +The Pix2Struct model was proposed in [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. + +The abstract from the paper is the following: + +> Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. + +Tips: + +Pix2Struct has been fine tuned on a variety of tasks and datasets, ranging from image captioning, visual question answering (VQA) over different inputs (books, charts, science diagrams), captioning UI components etc. The full list can be found in Table 1 of the paper. +We therefore advise you to use these models for the tasks they have been fine tuned on. For instance, if you want to use Pix2Struct for UI captioning, you should use the model fine tuned on the UI dataset. If you want to use Pix2Struct for image captioning, you should use the model fine tuned on the natural images captioning dataset and so on. + +If you want to use the model to perform conditional text captioning, make sure to use the processor with `add_special_tokens=False`. + +This model was contributed by [ybelkada](https://huggingface.co/ybelkada). +The original code can be found [here](https://github.com/google-research/pix2struct). + +## Resources + +- [Fine-tuning Notebook](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb) +- [All models](https://huggingface.co/models?search=pix2struct) + +## Pix2StructConfig + +[[autodoc]] Pix2StructConfig + - from_text_vision_configs + +## Pix2StructTextConfig + +[[autodoc]] Pix2StructTextConfig + +## Pix2StructVisionConfig + +[[autodoc]] Pix2StructVisionConfig + +## Pix2StructProcessor + +[[autodoc]] Pix2StructProcessor + +## Pix2StructImageProcessor + +[[autodoc]] Pix2StructImageProcessor + - preprocess + +## Pix2StructTextModel + +[[autodoc]] Pix2StructTextModel + - forward + +## Pix2StructVisionModel + +[[autodoc]] Pix2StructVisionModel + - forward + +## Pix2StructForConditionalGeneration + +[[autodoc]] Pix2StructForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/plbart.mdx b/docs/source/en/model_doc/plbart.md similarity index 91% rename from docs/source/en/model_doc/plbart.mdx rename to docs/source/en/model_doc/plbart.md index 0755bb9a56e1c1..61af52e54d0d21 100644 --- a/docs/source/en/model_doc/plbart.mdx +++ b/docs/source/en/model_doc/plbart.md @@ -8,14 +8,15 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # PLBart -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign -[@gchhablani](https://www.github.com/gchhablani). - -## Overview of PLBart +## Overview The PLBART model was proposed in [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model `plbart-base` has been trained using multilingual denoising task @@ -36,7 +37,7 @@ even with limited annotations.* This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The Authors' code can be found [here](https://github.com/wasiahmad/PLBART). -### Training of PLBart +## Usage examples PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the model is multilingual it expects the sequences in a different format. A special language id token is added in both the @@ -49,7 +50,7 @@ In cases where the language code is needed, the regular [`~PLBartTokenizer.__cal when you pass texts as the first argument or with the keyword argument `text`, and will encode target text format if it's passed with the `text_target` keyword argument. -- Supervised training +### Supervised training ```python >>> from transformers import PLBartForConditionalGeneration, PLBartTokenizer @@ -61,7 +62,7 @@ it's passed with the `text_target` keyword argument. >>> model(**inputs) ``` -- Generation +### Generation While generating the target text set the `decoder_start_token_id` to the target language id. The following example shows how to translate Python to English using the `uclanlp/plbart-python-en_XX` model. @@ -78,6 +79,13 @@ it's passed with the `text_target` keyword argument. "Returns the maximum value of a b c." ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) + ## PLBartConfig [[autodoc]] PLBartConfig diff --git a/docs/source/en/model_doc/poolformer.mdx b/docs/source/en/model_doc/poolformer.md similarity index 95% rename from docs/source/en/model_doc/poolformer.mdx rename to docs/source/en/model_doc/poolformer.md index be3aa298495c06..823c4412485c44 100644 --- a/docs/source/en/model_doc/poolformer.mdx +++ b/docs/source/en/model_doc/poolformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # PoolFormer @@ -24,8 +28,9 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori +This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer). -Tips: +## Usage tips - PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer). - One can use [`PoolFormerImageProcessor`] to prepare images for the model. @@ -39,8 +44,6 @@ Tips: | m36 | [6, 6, 18, 6] | [96, 192, 384, 768] | 56 | 82.1 | | m48 | [8, 8, 24, 8] | [96, 192, 384, 768] | 73 | 82.5 | -This model was contributed by [heytanay](https://huggingface.co/heytanay). The original code can be found [here](https://github.com/sail-sg/poolformer). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with PoolFormer. @@ -48,6 +51,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`PoolFormerForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. diff --git a/docs/source/en/model_doc/pop2piano.md b/docs/source/en/model_doc/pop2piano.md new file mode 100644 index 00000000000000..8e7c1fbd34359e --- /dev/null +++ b/docs/source/en/model_doc/pop2piano.md @@ -0,0 +1,194 @@ + + +# Pop2Piano + +
+ +Spaces + +
+ +## Overview + +The Pop2Piano model was proposed in [Pop2Piano : Pop Audio-based Piano Cover Generation](https://arxiv.org/abs/2211.00895) by Jongho Choi and Kyogu Lee. + +Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great +expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you +can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover +from pop audio without melody and chord extraction modules. + +Pop2Piano is an encoder-decoder Transformer model based on [T5](https://arxiv.org/pdf/1910.10683.pdf). The input audio +is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder +uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four +different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file. + +The abstract from the paper is the following: + +*Piano covers of pop music are enjoyed by many people. However, the +task of automatically generating piano covers of pop music is still +understudied. This is partly due to the lack of synchronized +{Pop, Piano Cover} data pairs, which made it challenging to apply +the latest data-intensive deep learning-based methods. To leverage +the power of the data-driven approach, we make a large amount of +paired and synchronized {Pop, Piano Cover} data using an automated +pipeline. In this paper, we present Pop2Piano, a Transformer network +that generates piano covers given waveforms of pop music. To the best +of our knowledge, this is the first model to generate a piano cover +directly from pop audio without using melody and chord extraction +modules. We show that Pop2Piano, trained with our dataset, is capable +of producing plausible piano covers.* + +This model was contributed by [Susnato Dhar](https://huggingface.co/susnato). +The original code can be found [here](https://github.com/sweetcocoa/pop2piano). + +## Usage tips + +* To use Pop2Piano, you will need to install the 🤗 Transformers library, as well as the following third party modules: +```bash +pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy +``` +Please note that you may need to restart your runtime after installation. +* Pop2Piano is an Encoder-Decoder based model like T5. +* Pop2Piano can be used to generate midi-audio files for a given audio sequence. +* Choosing different composers in `Pop2PianoForConditionalGeneration.generate()` can lead to variety of different results. +* Setting the sampling rate to 44.1 kHz when loading the audio file can give good performance. +* Though Pop2Piano was mainly trained on Korean Pop music, it also does pretty well on other Western Pop or Hip Hop songs. + +## Examples + +- Example using HuggingFace Dataset: + +```python +>>> from datasets import load_dataset +>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor + +>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano") +>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano") +>>> ds = load_dataset("sweetcocoa/pop2piano_ci", split="test") + +>>> inputs = processor( +... audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt" +... ) +>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1") +>>> tokenizer_output = processor.batch_decode( +... token_ids=model_output, feature_extractor_output=inputs +... )["pretty_midi_objects"][0] +>>> tokenizer_output.write("./Outputs/midi_output.mid") +``` + +- Example using your own audio file: + +```python +>>> import librosa +>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor + +>>> audio, sr = librosa.load("", sr=44100) # feel free to change the sr to a suitable value. +>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano") +>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano") + +>>> inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt") +>>> model_output = model.generate(input_features=inputs["input_features"], composer="composer1") +>>> tokenizer_output = processor.batch_decode( +... token_ids=model_output, feature_extractor_output=inputs +... )["pretty_midi_objects"][0] +>>> tokenizer_output.write("./Outputs/midi_output.mid") +``` + +- Example of processing multiple audio files in batch: + +```python +>>> import librosa +>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor + +>>> # feel free to change the sr to a suitable value. +>>> audio1, sr1 = librosa.load("", sr=44100) +>>> audio2, sr2 = librosa.load("", sr=44100) +>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano") +>>> processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano") + +>>> inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt") +>>> # Since we now generating in batch(2 audios) we must pass the attention_mask +>>> model_output = model.generate( +... input_features=inputs["input_features"], +... attention_mask=inputs["attention_mask"], +... composer="composer1", +... ) +>>> tokenizer_output = processor.batch_decode( +... token_ids=model_output, feature_extractor_output=inputs +... )["pretty_midi_objects"] + +>>> # Since we now have 2 generated MIDI files +>>> tokenizer_output[0].write("./Outputs/midi_output1.mid") +>>> tokenizer_output[1].write("./Outputs/midi_output2.mid") +``` + + +- Example of processing multiple audio files in batch (Using `Pop2PianoFeatureExtractor` and `Pop2PianoTokenizer`): + +```python +>>> import librosa +>>> from transformers import Pop2PianoForConditionalGeneration, Pop2PianoFeatureExtractor, Pop2PianoTokenizer + +>>> # feel free to change the sr to a suitable value. +>>> audio1, sr1 = librosa.load("", sr=44100) +>>> audio2, sr2 = librosa.load("", sr=44100) +>>> model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano") +>>> feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano") +>>> tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano") + +>>> inputs = feature_extractor( +... audio=[audio1, audio2], +... sampling_rate=[sr1, sr2], +... return_attention_mask=True, +... return_tensors="pt", +... ) +>>> # Since we now generating in batch(2 audios) we must pass the attention_mask +>>> model_output = model.generate( +... input_features=inputs["input_features"], +... attention_mask=inputs["attention_mask"], +... composer="composer1", +... ) +>>> tokenizer_output = tokenizer.batch_decode( +... token_ids=model_output, feature_extractor_output=inputs +... )["pretty_midi_objects"] + +>>> # Since we now have 2 generated MIDI files +>>> tokenizer_output[0].write("./Outputs/midi_output1.mid") +>>> tokenizer_output[1].write("./Outputs/midi_output2.mid") +``` + + +## Pop2PianoConfig + +[[autodoc]] Pop2PianoConfig + +## Pop2PianoFeatureExtractor + +[[autodoc]] Pop2PianoFeatureExtractor + - __call__ + +## Pop2PianoForConditionalGeneration + +[[autodoc]] Pop2PianoForConditionalGeneration + - forward + - generate + +## Pop2PianoTokenizer + +[[autodoc]] Pop2PianoTokenizer + - __call__ + +## Pop2PianoProcessor + +[[autodoc]] Pop2PianoProcessor + - __call__ diff --git a/docs/source/en/model_doc/prophetnet.mdx b/docs/source/en/model_doc/prophetnet.md similarity index 91% rename from docs/source/en/model_doc/prophetnet.mdx rename to docs/source/en/model_doc/prophetnet.md index 193a731e7c085b..7e63e0c0887eea 100644 --- a/docs/source/en/model_doc/prophetnet.mdx +++ b/docs/source/en/model_doc/prophetnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ProphetNet @@ -21,10 +25,6 @@ specific language governing permissions and limitations under the License. - -**DISCLAIMER:** If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) and assign -@patrickvonplaten - ## Overview The ProphetNet model was proposed in [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training,](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei @@ -45,14 +45,19 @@ dataset (160GB) respectively. Then we conduct experiments on CNN/DailyMail, Giga abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pretraining corpus.* -Tips: +The Authors' code can be found [here](https://github.com/microsoft/ProphetNet). + +## Usage tips - ProphetNet is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. - The model architecture is based on the original Transformer, but replaces the “standard” self-attention mechanism in the decoder by a a main self-attention mechanism and a self and n-stream (predict) self-attention mechanism. -The Authors' code can be found [here](https://github.com/microsoft/ProphetNet). +## Resources +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## ProphetNetConfig diff --git a/docs/source/en/model_doc/pvt.md b/docs/source/en/model_doc/pvt.md new file mode 100644 index 00000000000000..3e88a24999f759 --- /dev/null +++ b/docs/source/en/model_doc/pvt.md @@ -0,0 +1,71 @@ + + +# Pyramid Vision Transformer (PVT) + +## Overview + +The PVT model was proposed in +[Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/abs/2102.12122) +by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. The PVT is a type of +vision transformer that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically +it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length +of the Transformer as it deepens - reducing the computational cost. Additionally, a spatial-reduction attention (SRA) layer +is used to further reduce the resource consumption when learning high-resolution features. + +The abstract from the paper is the following: + +*Although convolutional neural networks (CNNs) have achieved great success in computer vision, this work investigates a +simpler, convolution-free backbone network useful for many dense prediction tasks. Unlike the recently proposed Vision +Transformer (ViT) that was designed for image classification specifically, we introduce the Pyramid Vision Transformer +(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several +merits compared to current state of the arts. Different from ViT that typically yields low resolution outputs and +incurs high computational and memory costs, PVT not only can be trained on dense partitions of an image to achieve high +output resolution, which is important for dense prediction, but also uses a progressive shrinking pyramid to reduce the +computations of large feature maps. PVT inherits the advantages of both CNN and Transformer, making it a unified +backbone for various vision tasks without convolutions, where it can be used as a direct replacement for CNN backbones. +We validate PVT through extensive experiments, showing that it boosts the performance of many downstream tasks, including +object detection, instance and semantic segmentation. For example, with a comparable number of parameters, PVT+RetinaNet +achieves 40.4 AP on the COCO dataset, surpassing ResNet50+RetinNet (36.3 AP) by 4.1 absolute AP (see Figure 2). We hope +that PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future research.* + +This model was contributed by [Xrenya](https://huggingface.co/Xrenya). The original code can be found [here](https://github.com/whai362/PVT). + + +- PVTv1 on ImageNet-1K + +| **Model variant** |**Size** |**Acc@1**|**Params (M)**| +|--------------------|:-------:|:-------:|:------------:| +| PVT-Tiny | 224 | 75.1 | 13.2 | +| PVT-Small | 224 | 79.8 | 24.5 | +| PVT-Medium | 224 | 81.2 | 44.2 | +| PVT-Large | 224 | 81.7 | 61.4 | + + +## PvtConfig + +[[autodoc]] PvtConfig + +## PvtImageProcessor + +[[autodoc]] PvtImageProcessor + - preprocess + +## PvtForImageClassification + +[[autodoc]] PvtForImageClassification + - forward + +## PvtModel + +[[autodoc]] PvtModel + - forward diff --git a/docs/source/en/model_doc/qdqbert.mdx b/docs/source/en/model_doc/qdqbert.md similarity index 90% rename from docs/source/en/model_doc/qdqbert.mdx rename to docs/source/en/model_doc/qdqbert.md index df7b7bcee62516..19b829d0bc5d19 100644 --- a/docs/source/en/model_doc/qdqbert.mdx +++ b/docs/source/en/model_doc/qdqbert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # QDQBERT @@ -28,22 +32,18 @@ by processors with high-throughput integer math pipelines. We also present a wor able to maintain accuracy within 1% of the floating-point baseline on all networks studied, including models that are more difficult to quantize, such as MobileNets and BERT-large.* -Tips: +This model was contributed by [shangz](https://huggingface.co/shangz). + +## Usage tips - QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to (i) linear layer inputs and weights, (ii) matmul inputs, (iii) residual add inputs, in BERT model. - - QDQBERT requires the dependency of [Pytorch Quantization Toolkit](https://github.com/NVIDIA/TensorRT/tree/master/tools/pytorch-quantization). To install `pip install pytorch-quantization --extra-index-url https://pypi.ngc.nvidia.com` - -- QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *bert-base-uncased*), and +- QDQBERT model can be loaded from any checkpoint of HuggingFace BERT model (for example *google-bert/bert-base-uncased*), and perform Quantization Aware Training/Post Training Quantization. - - A complete example of using QDQBERT model to perform Quatization Aware Training and Post Training Quantization for SQUAD task can be found at [transformers/examples/research_projects/quantization-qdqbert/](examples/research_projects/quantization-qdqbert/). -This model was contributed by [shangz](https://huggingface.co/shangz). - - ### Set default quantizers QDQBERT model adds fake quantization operations (pair of QuantizeLinear/DequantizeLinear ops) to BERT by @@ -114,6 +114,15 @@ the instructions in [torch.onnx](https://pytorch.org/docs/stable/onnx.html). Exa >>> torch.onnx.export(...) ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## QDQBertConfig [[autodoc]] QDQBertConfig diff --git a/docs/source/en/model_doc/qwen2.md b/docs/source/en/model_doc/qwen2.md new file mode 100644 index 00000000000000..61e45fd9c2c8e2 --- /dev/null +++ b/docs/source/en/model_doc/qwen2.md @@ -0,0 +1,82 @@ + + +# Qwen2 + +## Overview + +Qwen2 is the new model series of large language models from the Qwen team. Previously, we released the Qwen series, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc. + +### Model Details + +Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes. + + +## Usage tips + +`Qwen2-7B-beta` and `Qwen2-7B-Chat-beta` can be found on the [Huggingface Hub](https://huggingface.co/Qwen) + +In the following, we demonstrate how to use `Qwen2-7B-Chat-beta` for the inference. Note that we have used the ChatML format for dialog, in this demo we show how to leverage `apply_chat_template` for this purpose. + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> model = AutoModelForCausalLM.from_pretrained("Qwen2/Qwen2-7B-Chat-beta", device_map="auto") +>>> tokenizer = AutoTokenizer.from_pretrained("Qwen2/Qwen2-7B-Chat-beta") + +>>> prompt = "Give me a short introduction to large language model." + +>>> messages = [{"role": "user", "content": prompt}] + +>>> text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + +>>> model_inputs = tokenizer([text], return_tensors="pt").to(device) + +>>> generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=True) + +>>> generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)] + +>>> response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +``` + +## Qwen2Config + +[[autodoc]] Qwen2Config + +## Qwen2Tokenizer + +[[autodoc]] Qwen2Tokenizer + - save_vocabulary + +## Qwen2TokenizerFast + +[[autodoc]] Qwen2TokenizerFast + +## Qwen2Model + +[[autodoc]] Qwen2Model + - forward + +## Qwen2ForCausalLM + +[[autodoc]] Qwen2ForCausalLM + - forward + +## Qwen2ForSequenceClassification + +[[autodoc]] Qwen2ForSequenceClassification + - forward diff --git a/docs/source/en/model_doc/rag.mdx b/docs/source/en/model_doc/rag.md similarity index 86% rename from docs/source/en/model_doc/rag.mdx rename to docs/source/en/model_doc/rag.md index f1cf9e4bc23205..1891efe742639e 100644 --- a/docs/source/en/model_doc/rag.mdx +++ b/docs/source/en/model_doc/rag.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RAG @@ -48,8 +52,12 @@ parametric-only seq2seq baseline.* This model was contributed by [ola13](https://huggingface.co/ola13). -Tips: -- Retrieval-augmented generation (“RAG”) models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt to downstream tasks. +## Usage tips + +Retrieval-augmented generation ("RAG") models combine the powers of pretrained dense retrieval (DPR) and Seq2Seq models. +RAG models retrieve docs, pass them to a seq2seq model, then marginalize to generate outputs. The retriever and seq2seq +modules are initialized from pretrained models, and fine-tuned jointly, allowing both retrieval and generation to adapt +to downstream tasks. ## RagConfig @@ -69,6 +77,9 @@ Tips: [[autodoc]] RagRetriever + + + ## RagModel [[autodoc]] RagModel @@ -86,6 +97,9 @@ Tips: - forward - generate + + + ## TFRagModel [[autodoc]] TFRagModel @@ -102,3 +116,6 @@ Tips: [[autodoc]] TFRagTokenForGeneration - call - generate + + + diff --git a/docs/source/en/model_doc/realm.mdx b/docs/source/en/model_doc/realm.md similarity index 95% rename from docs/source/en/model_doc/realm.mdx rename to docs/source/en/model_doc/realm.md index 545b1e0a3bf8f9..a8227bc83c7318 100644 --- a/docs/source/en/model_doc/realm.mdx +++ b/docs/source/en/model_doc/realm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # REALM diff --git a/docs/source/en/model_doc/reformer.mdx b/docs/source/en/model_doc/reformer.md similarity index 91% rename from docs/source/en/model_doc/reformer.mdx rename to docs/source/en/model_doc/reformer.md index 2313da5571b574..c78b1bbb8333d4 100644 --- a/docs/source/en/model_doc/reformer.mdx +++ b/docs/source/en/model_doc/reformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Reformer @@ -21,8 +25,6 @@ specific language governing permissions and limitations under the License. -**DISCLAIMER:** This model is still a work in progress, if you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title). - ## Overview The Reformer model was proposed in the paper [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451.pdf) by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya. @@ -40,7 +42,7 @@ while being much more memory-efficient and much faster on long sequences.* This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be found [here](https://github.com/google/trax/tree/master/trax/models/reformer). -Tips: +## Usage tips - Reformer does **not** work with *torch.nn.DataParallel* due to a bug in PyTorch, see [issue #36035](https://github.com/pytorch/pytorch/issues/36035). - Use Axial position encoding (see below for more details). It’s a mechanism to avoid having a huge positional encoding matrix (when the sequence length is very big) by factorizing it into smaller matrices. @@ -48,11 +50,11 @@ Tips: - Avoid storing the intermediate results of each layer by using reversible transformer layers to obtain them during the backward pass (subtracting the residuals from the input of the next layer gives them back) or recomputing them for results inside a given layer (less efficient than storing them but saves memory). - Compute the feedforward operations by chunks and not on the whole batch. -## Axial Positional Encodings +### Axial Positional Encodings Axial Positional Encodings were first implemented in Google's [trax library](https://github.com/google/trax/blob/4d99ad4965bab1deba227539758d59f0df0fef48/trax/layers/research/position_encodings.py#L29) and developed by the authors of this model's paper. In models that are treating very long input sequences, the -conventional position id encodings store an embedings vector of size \\(d\\) being the `config.hidden_size` for +conventional position id encodings store an embeddings vector of size \\(d\\) being the `config.hidden_size` for every position \\(i, \ldots, n_s\\), with \\(n_s\\) being `config.max_embedding_size`. This means that having a sequence length of \\(n_s = 2^{19} \approx 0.5M\\) and a `config.hidden_size` of \\(d = 2^{10} \approx 1000\\) would result in a position encoding matrix: @@ -83,8 +85,8 @@ factorized embedding vectors: \\(x^1_{k, l} + x^2_{l, k}\\), where as the `confi \\(j\\) is factorized into \\(k \text{ and } l\\). This design ensures that each position embedding vector \\(x_j\\) is unique. -Using the above example again, axial position encoding with \\(d^1 = 2^5, d^2 = 2^5, n_s^1 = 2^9, n_s^2 = 2^{10}\\) -can drastically reduced the number of parameters to \\(2^{14} + 2^{15} \approx 49000\\) parameters. +Using the above example again, axial position encoding with \\(d^1 = 2^9, d^2 = 2^9, n_s^1 = 2^9, n_s^2 = 2^{10}\\) +can drastically reduced the number of parameters from 500 000 000 to \\(2^{18} + 2^{19} \approx 780 000\\) parameters, this means 85% less memory usage. In practice, the parameter `config.axial_pos_embds_dim` is set to a tuple \\((d^1, d^2)\\) which sum has to be equal to `config.hidden_size` and `config.axial_pos_shape` is set to a tuple \\((n_s^1, n_s^2)\\) which @@ -92,7 +94,7 @@ product has to be equal to `config.max_embedding_size`, which during training ha length* of the `input_ids`. -## LSH Self Attention +### LSH Self Attention In Locality sensitive hashing (LSH) self attention the key and query projection weights are tied. Therefore, the key query embedding vectors are also tied. LSH self attention uses the locality sensitive hashing mechanism proposed in @@ -125,7 +127,7 @@ Using LSH self attention, the memory and time complexity of the query-key matmul and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. -## Local Self Attention +### Local Self Attention Local self attention is essentially a "normal" self attention layer with key, query and value projections, but is chunked so that in each chunk of length `config.local_chunk_length` the query embedding vectors only attends to @@ -137,7 +139,7 @@ Using Local self attention, the memory and time complexity of the query-key matm and time bottleneck in a transformer model, with \\(n_s\\) being the sequence length. -## Training +### Training During training, we must ensure that the sequence length is set to a value that can be divided by the least common multiple of `config.lsh_chunk_length` and `config.local_chunk_length` and that the parameters of the Axial @@ -151,6 +153,13 @@ input_ids = tokenizer.encode("This is a sentence from the training data", return loss = model(input_ids, labels=input_ids)[0] ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) + ## ReformerConfig [[autodoc]] ReformerConfig diff --git a/docs/source/en/model_doc/regnet.mdx b/docs/source/en/model_doc/regnet.md similarity index 80% rename from docs/source/en/model_doc/regnet.mdx rename to docs/source/en/model_doc/regnet.md index 62d030452a836f..acd833c77c2da9 100644 --- a/docs/source/en/model_doc/regnet.mdx +++ b/docs/source/en/model_doc/regnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RegNet @@ -22,15 +26,13 @@ The abstract from the paper is the following: *In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.* -Tips: - -- One can use [`AutoImageProcessor`] to prepare images for the model. -- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer) - This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model -was contributed by [sayakpaul](https://huggingface.com/sayakpaul) and [ariG23498](https://huggingface.com/ariG23498). +was contributed by [sayakpaul](https://huggingface.co/sayakpaul) and [ariG23498](https://huggingface.co/ariG23498). The original code can be found [here](https://github.com/facebookresearch/pycls). +The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), +trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer) + ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RegNet. @@ -38,6 +40,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`RegNetForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -45,25 +48,43 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] RegNetConfig + + ## RegNetModel [[autodoc]] RegNetModel - forward - ## RegNetForImageClassification [[autodoc]] RegNetForImageClassification - forward + + + ## TFRegNetModel [[autodoc]] TFRegNetModel - call - ## TFRegNetForImageClassification [[autodoc]] TFRegNetForImageClassification - - call \ No newline at end of file + - call + + + + +## FlaxRegNetModel + +[[autodoc]] FlaxRegNetModel + - __call__ + +## FlaxRegNetForImageClassification + +[[autodoc]] FlaxRegNetForImageClassification + - __call__ + + diff --git a/docs/source/en/model_doc/rembert.mdx b/docs/source/en/model_doc/rembert.md similarity index 85% rename from docs/source/en/model_doc/rembert.mdx rename to docs/source/en/model_doc/rembert.md index 0edb8e5202d956..b755d3423060ca 100644 --- a/docs/source/en/model_doc/rembert.mdx +++ b/docs/source/en/model_doc/rembert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RemBERT @@ -30,13 +34,22 @@ Transformer representations to be more general and more transferable to other ta findings, we are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.* -Tips: +## Usage tips For fine-tuning, RemBERT can be thought of as a bigger version of mBERT with an ALBERT-like factorization of the embedding layer. The embeddings are not tied in pre-training, in contrast with BERT, which enables smaller input embeddings (preserved during fine-tuning) and bigger output embeddings (discarded at fine-tuning). The tokenizer is also similar to the Albert one rather than the BERT one. +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## RemBertConfig [[autodoc]] RemBertConfig @@ -57,6 +70,9 @@ also similar to the Albert one rather than the BERT one. - create_token_type_ids_from_sequences - save_vocabulary + + + ## RemBertModel [[autodoc]] RemBertModel @@ -92,6 +108,9 @@ also similar to the Albert one rather than the BERT one. [[autodoc]] RemBertForQuestionAnswering - forward + + + ## TFRemBertModel [[autodoc]] TFRemBertModel @@ -126,3 +145,6 @@ also similar to the Albert one rather than the BERT one. [[autodoc]] TFRemBertForQuestionAnswering - call + + + diff --git a/docs/source/en/model_doc/resnet.mdx b/docs/source/en/model_doc/resnet.md similarity index 89% rename from docs/source/en/model_doc/resnet.mdx rename to docs/source/en/model_doc/resnet.md index 031066b69b6768..b959266512f502 100644 --- a/docs/source/en/model_doc/resnet.mdx +++ b/docs/source/en/model_doc/resnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ResNet @@ -23,10 +27,6 @@ The abstract from the paper is the following: *Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.* -Tips: - -- One can use [`AutoImageProcessor`] to prepare images for the model. - The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385). @@ -40,6 +40,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`ResNetForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -47,26 +48,44 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] ResNetConfig + + ## ResNetModel [[autodoc]] ResNetModel - forward - ## ResNetForImageClassification [[autodoc]] ResNetForImageClassification - forward + + ## TFResNetModel [[autodoc]] TFResNetModel - call - ## TFResNetForImageClassification [[autodoc]] TFResNetForImageClassification - call + + + + +## FlaxResNetModel + +[[autodoc]] FlaxResNetModel + - __call__ + +## FlaxResNetForImageClassification + +[[autodoc]] FlaxResNetForImageClassification + - __call__ + + + diff --git a/docs/source/en/model_doc/retribert.mdx b/docs/source/en/model_doc/retribert.md similarity index 73% rename from docs/source/en/model_doc/retribert.mdx rename to docs/source/en/model_doc/retribert.md index e83fae32300da9..ab29ac966fe19f 100644 --- a/docs/source/en/model_doc/retribert.mdx +++ b/docs/source/en/model_doc/retribert.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RetriBERT + + +This model is in maintenance mode only, so we won't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The RetriBERT model was proposed in the blog post [Explain Anything Like I'm Five: A Model for Open Domain Long Form diff --git a/docs/source/en/model_doc/roberta-prelayernorm.mdx b/docs/source/en/model_doc/roberta-prelayernorm.md similarity index 85% rename from docs/source/en/model_doc/roberta-prelayernorm.mdx rename to docs/source/en/model_doc/roberta-prelayernorm.md index a8fb2bb2b9aa96..f748e273e8f80e 100644 --- a/docs/source/en/model_doc/roberta-prelayernorm.mdx +++ b/docs/source/en/model_doc/roberta-prelayernorm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RoBERTa-PreLayerNorm @@ -21,19 +25,30 @@ The abstract from the paper is the following: *fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs.* -Tips: +This model was contributed by [andreasmaden](https://huggingface.co/andreasmadsen). +The original code can be found [here](https://github.com/princeton-nlp/DinkyTrain). + +## Usage tips - The implementation is the same as [Roberta](roberta) except instead of using _Add and Norm_ it does _Norm and Add_. _Add_ and _Norm_ refers to the Addition and LayerNormalization as described in [Attention Is All You Need](https://arxiv.org/abs/1706.03762). - This is identical to using the `--encoder-normalize-before` flag in [fairseq](https://fairseq.readthedocs.io/). -This model was contributed by [andreasmaden](https://huggingface.co/andreasmaden). -The original code can be found [here](https://github.com/princeton-nlp/DinkyTrain). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## RobertaPreLayerNormConfig [[autodoc]] RobertaPreLayerNormConfig + + + ## RobertaPreLayerNormModel [[autodoc]] RobertaPreLayerNormModel @@ -69,6 +84,9 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai [[autodoc]] RobertaPreLayerNormForQuestionAnswering - forward + + + ## TFRobertaPreLayerNormModel [[autodoc]] TFRobertaPreLayerNormModel @@ -104,6 +122,9 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai [[autodoc]] TFRobertaPreLayerNormForQuestionAnswering - call + + + ## FlaxRobertaPreLayerNormModel [[autodoc]] FlaxRobertaPreLayerNormModel @@ -138,3 +159,6 @@ The original code can be found [here](https://github.com/princeton-nlp/DinkyTrai [[autodoc]] FlaxRobertaPreLayerNormForQuestionAnswering - __call__ + + + diff --git a/docs/source/en/model_doc/roberta.mdx b/docs/source/en/model_doc/roberta.md similarity index 92% rename from docs/source/en/model_doc/roberta.mdx rename to docs/source/en/model_doc/roberta.md index 63b7bab6fa5a27..364b5b37e5f3f0 100644 --- a/docs/source/en/model_doc/roberta.mdx +++ b/docs/source/en/model_doc/roberta.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RoBERTa @@ -19,11 +23,14 @@ specific language governing permissions and limitations under the License. Spaces + +Paper page + ## Overview -The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer +The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, [Myle Ott](https://huggingface.co/myleott), Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018. It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with @@ -40,7 +47,9 @@ model published after it. Our best model achieves state-of-the-art results on GL highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.* -Tips: +This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta). + +## Usage tips - This implementation is the same as [`BertModel`] with a tiny embeddings tweak as well as a setup for Roberta pretrained models. @@ -56,8 +65,6 @@ Tips: * use BPE with bytes as a subunit and not characters (because of unicode characters) - [CamemBERT](camembert) is a wrapper around RoBERTa. Refer to this page for usage examples. -This model was contributed by [julien-c](https://huggingface.co/julien-c). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/roberta). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with RoBERTa. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -70,6 +77,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`RobertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb). - [`TFRobertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). - [`FlaxRobertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). +- [Text classification task guide](../tasks/sequence_classification) @@ -77,6 +85,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFRobertaForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). - [`FlaxRobertaForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Token classification task guide](../tasks/token_classification) @@ -85,6 +94,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFRobertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxRobertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling task guide](../tasks/masked_language_modeling) @@ -93,10 +103,12 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFRobertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). - [`FlaxRobertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Question answering task guide](../tasks/question_answering) **Multiple choice** - [`RobertaForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). - [`TFRobertaForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). +- [Multiple choice task guide](../tasks/multiple_choice) ## RobertaConfig @@ -115,6 +127,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] RobertaTokenizerFast - build_inputs_with_special_tokens + + + ## RobertaModel [[autodoc]] RobertaModel @@ -150,6 +165,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] RobertaForQuestionAnswering - forward + + + ## TFRobertaModel [[autodoc]] TFRobertaModel @@ -185,6 +203,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFRobertaForQuestionAnswering - call + + + ## FlaxRobertaModel [[autodoc]] FlaxRobertaModel @@ -219,3 +240,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxRobertaForQuestionAnswering - __call__ + + + diff --git a/docs/source/en/model_doc/roc_bert.mdx b/docs/source/en/model_doc/roc_bert.md similarity index 83% rename from docs/source/en/model_doc/roc_bert.mdx rename to docs/source/en/model_doc/roc_bert.md index c30ccfd1c52397..30fadd5c2c10d8 100644 --- a/docs/source/en/model_doc/roc_bert.mdx +++ b/docs/source/en/model_doc/roc_bert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RoCBert @@ -31,12 +35,20 @@ in the toxic content detection task under human-made attacks.* This model was contributed by [weiweishi](https://huggingface.co/weiweishi). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## RoCBertConfig [[autodoc]] RoCBertConfig - all - ## RoCBertTokenizer [[autodoc]] RoCBertTokenizer @@ -45,31 +57,26 @@ This model was contributed by [weiweishi](https://huggingface.co/weiweishi). - create_token_type_ids_from_sequences - save_vocabulary - ## RoCBertModel [[autodoc]] RoCBertModel - forward - ## RoCBertForPreTraining [[autodoc]] RoCBertForPreTraining - forward - ## RoCBertForCausalLM [[autodoc]] RoCBertForCausalLM - forward - ## RoCBertForMaskedLM [[autodoc]] RoCBertForMaskedLM - forward - ## RoCBertForSequenceClassification [[autodoc]] transformers.RoCBertForSequenceClassification @@ -80,14 +87,12 @@ This model was contributed by [weiweishi](https://huggingface.co/weiweishi). [[autodoc]] transformers.RoCBertForMultipleChoice - forward - ## RoCBertForTokenClassification [[autodoc]] transformers.RoCBertForTokenClassification - forward - ## RoCBertForQuestionAnswering [[autodoc]] RoCBertForQuestionAnswering - - forward \ No newline at end of file + - forward diff --git a/docs/source/en/model_doc/roformer.mdx b/docs/source/en/model_doc/roformer.md similarity index 82% rename from docs/source/en/model_doc/roformer.mdx rename to docs/source/en/model_doc/roformer.md index 435941d9f29ac1..5d8f146c43fdd3 100644 --- a/docs/source/en/model_doc/roformer.mdx +++ b/docs/source/en/model_doc/roformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # RoFormer @@ -29,13 +33,20 @@ transformer with rotary position embedding, or RoFormer, achieves superior perfo release the theoretical analysis along with some preliminary experiment results on Chinese data. The undergoing experiment for English benchmark will soon be updated.* -Tips: +This model was contributed by [junnyu](https://huggingface.co/junnyu). The original code can be found [here](https://github.com/ZhuiyiTechnology/roformer). -- RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown - improved performance on classification tasks with long texts. +## Usage tips +RoFormer is a BERT-like autoencoding model with rotary position embeddings. Rotary position embeddings have shown +improved performance on classification tasks with long texts. +## Resources -This model was contributed by [junnyu](https://huggingface.co/junnyu). The original code can be found [here](https://github.com/ZhuiyiTechnology/roformer). +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## RoFormerConfig @@ -54,6 +65,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi [[autodoc]] RoFormerTokenizerFast - build_inputs_with_special_tokens + + + ## RoFormerModel [[autodoc]] RoFormerModel @@ -89,6 +103,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi [[autodoc]] RoFormerForQuestionAnswering - forward + + + ## TFRoFormerModel [[autodoc]] TFRoFormerModel @@ -124,6 +141,9 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi [[autodoc]] TFRoFormerForQuestionAnswering - call + + + ## FlaxRoFormerModel [[autodoc]] FlaxRoFormerModel @@ -153,3 +173,6 @@ This model was contributed by [junnyu](https://huggingface.co/junnyu). The origi [[autodoc]] FlaxRoFormerForQuestionAnswering - __call__ + + + diff --git a/docs/source/en/model_doc/rwkv.md b/docs/source/en/model_doc/rwkv.md new file mode 100644 index 00000000000000..1acb173060216b --- /dev/null +++ b/docs/source/en/model_doc/rwkv.md @@ -0,0 +1,150 @@ + + +# RWKV + +## Overview + +The RWKV model was proposed in [this repo](https://github.com/BlinkDL/RWKV-LM) + +It suggests a tweak in the traditional Transformer attention to make it linear. This way, the model can be used as recurrent network: passing inputs for timestamp 0 and timestamp 1 together is the same as passing inputs at timestamp 0, then inputs at timestamp 1 along with the state of timestamp 0 (see example below). + +This can be more efficient than a regular Transformer and can deal with sentence of any length (even if the model uses a fixed context length for training). + +This model was contributed by [sgugger](https://huggingface.co/sgugger). +The original code can be found [here](https://github.com/BlinkDL/RWKV-LM). + +## Usage example + +```py +import torch +from transformers import AutoTokenizer, RwkvConfig, RwkvModel + +model = RwkvModel.from_pretrained("sgugger/rwkv-430M-pile") +tokenizer = AutoTokenizer.from_pretrained("sgugger/rwkv-430M-pile") + +inputs = tokenizer("This is an example.", return_tensors="pt") +# Feed everything to the model +outputs = model(inputs["input_ids"]) +output_whole = outputs.last_hidden_state + +outputs = model(inputs["input_ids"][:, :2]) +output_one = outputs.last_hidden_state + +# Using the state computed on the first inputs, we will get the same output +outputs = model(inputs["input_ids"][:, 2:], state=outputs.state) +output_two = outputs.last_hidden_state + +torch.allclose(torch.cat([output_one, output_two], dim=1), output_whole, atol=1e-5) +``` + +If you want to make sure the model stops generating when `'\n\n'` is detected, we recommend using the following stopping criteria: + +```python +from transformers import StoppingCriteria + +class RwkvStoppingCriteria(StoppingCriteria): + def __init__(self, eos_sequence = [187,187], eos_token_id = 537): + self.eos_sequence = eos_sequence + self.eos_token_id = eos_token_id + + def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: + last_2_ids = input_ids[:,-2:].tolist() + return self.eos_sequence in last_2_ids + + +output = model.generate(inputs["input_ids"], max_new_tokens=64, stopping_criteria = [RwkvStoppingCriteria()]) +``` + +## RwkvConfig + +[[autodoc]] RwkvConfig + +## RwkvModel + +[[autodoc]] RwkvModel + - forward + +## RwkvLMHeadModel + +[[autodoc]] RwkvForCausalLM + - forward + +## Rwkv attention and the recurrent formulas + +In a traditional auto-regressive Transformer, attention is written as + +$$O = \hbox{softmax}(QK^{T} / \sqrt{d}) V$$ + +with \\(Q\\), \\(K\\) and \\(V\\) are matrices of shape `seq_len x hidden_size` named query, key and value (they are actually bigger matrices with a batch dimension and an attention head dimension but we're only interested in the last two, which is where the matrix product is taken, so for the sake of simplicity we only consider those two). The product \\(QK^{T}\\) then has shape `seq_len x seq_len` and we can take the matrix product with \\(V\\) to get the output \\(O\\) of the same shape as the others. + +Replacing the softmax by its value gives: + +$$O_{i} = \frac{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}} V_{j}}{\sum_{j=1}^{i} e^{Q_{i} K_{j}^{T} / \sqrt{d}}}$$ + +Note that the entries in \\(QK^{T}\\) corresponding to \\(j > i\\) are masked (the sum stops at j) because the attention is not allowed to look at future tokens (only past ones). + +In comparison, the RWKV attention is given by + +$$O_{i} = \sigma(R_{i}) \frac{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}} V_{j}}{\sum_{j=1}^{i} e^{W_{i-j} + K_{j}}}$$ + +where \\(R\\) is a new matrix called receptance by the author, \\(K\\) and \\(V\\) are still the key and value (\\(\sigma\\) here is the sigmoid function). \\(W\\) is a new vector that represents the position of the token and is given by + +$$W_{0} = u \hbox{ and } W_{k} = (k-1)w \hbox{ for } k \geq 1$$ + +with \\(u\\) and \\(w\\) learnable parameters called in the code `time_first` and `time_decay` respectively. The numerator and denominator can both be expressed recursively. Naming them \\(N_{i}\\) and \\(D_{i}\\) we have: + +$$N_{i} = e^{u + K_{i}} V_{i} + \hat{N}_{i} \hbox{ where } \hat{N}_{i} = e^{K_{i-1}} V_{i-1} + e^{w + K_{i-2}} V_{i-2} \cdots + e^{(i-2)w + K_{1}} V_{1}$$ + +so \\(\hat{N}_{i}\\) (called `numerator_state` in the code) satisfies + +$$\hat{N}_{0} = 0 \hbox{ and } \hat{N}_{j+1} = e^{K_{j}} V_{j} + e^{w} \hat{N}_{j}$$ + +and + +$$D_{i} = e^{u + K_{i}} + \hat{D}_{i} \hbox{ where } \hat{D}_{i} = e^{K_{i-1}} + e^{w + K_{i-2}} \cdots + e^{(i-2)w + K_{1}}$$ + +so \\(\hat{D}_{i}\\) (called `denominator_state` in the code) satisfies + +$$\hat{D}_{0} = 0 \hbox{ and } \hat{D}_{j+1} = e^{K_{j}} + e^{w} \hat{D}_{j}$$ + +The actual recurrent formula used are a tiny bit more complex, as for numerical stability we don't want to compute exponentials of big numbers. Usually the softmax is not computed as is, but the exponential of the maximum term is divided of the numerator and denominator: + +$$\frac{e^{x_{i}}}{\sum_{j=1}^{n} e^{x_{j}}} = \frac{e^{x_{i} - M}}{\sum_{j=1}^{n} e^{x_{j} - M}}$$ + +with \\(M\\) the maximum of all \\(x_{j}\\). So here on top of saving the numerator state (\\(\hat{N}\\)) and the denominator state (\\(\hat{D}\\)) we also keep track of the maximum of all terms encountered in the exponentials. So we actually use + +$$\tilde{N}_{i} = e^{-M_{i}} \hat{N}_{i} \hbox{ and } \tilde{D}_{i} = e^{-M_{i}} \hat{D}_{i}$$ + +defined by the following recurrent formulas: + +$$\tilde{N}_{0} = 0 \hbox{ and } \tilde{N}_{j+1} = e^{K_{j} - q} V_{j} + e^{w + M_{j} - q} \tilde{N}_{j} \hbox{ where } q = \max(K_{j}, w + M_{j})$$ + +and + +$$\tilde{D}_{0} = 0 \hbox{ and } \tilde{D}_{j+1} = e^{K_{j} - q} + e^{w + M_{j} - q} \tilde{D}_{j} \hbox{ where } q = \max(K_{j}, w + M_{j})$$ + +and \\(M_{j+1} = q\\). With those, we can then compute + +$$N_{i} = e^{u + K_{i} - q} V_{i} + e^{M_{i}} \tilde{N}_{i} \hbox{ where } q = \max(u + K_{i}, M_{i})$$ + +and + +$$D_{i} = e^{u + K_{i} - q} + e^{M_{i}} \tilde{D}_{i} \hbox{ where } q = \max(u + K_{i}, M_{i})$$ + +which finally gives us + +$$O_{i} = \sigma(R_{i}) \frac{N_{i}}{D_{i}}$$ \ No newline at end of file diff --git a/docs/source/en/model_doc/sam.md b/docs/source/en/model_doc/sam.md new file mode 100644 index 00000000000000..feace522ef70be --- /dev/null +++ b/docs/source/en/model_doc/sam.md @@ -0,0 +1,148 @@ + + +# SAM + +## Overview + +SAM (Segment Anything Model) was proposed in [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. + +The model can be used to predict segmentation masks of any object of interest given an input image. + +![example image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-output.png) + +The abstract from the paper is the following: + +*We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at [https://segment-anything.com](https://segment-anything.com) to foster research into foundation models for computer vision.* + +Tips: + +- The model predicts binary masks that states the presence or not of the object of interest given an image. +- The model predicts much better results if input 2D points and/or input bounding boxes are provided +- You can prompt multiple points for the same image, and predict a single mask. +- Fine-tuning the model is not supported yet +- According to the paper, textual input should be also supported. However, at this time of writing this seems to be not supported according to [the official repository](https://github.com/facebookresearch/segment-anything/issues/4#issuecomment-1497626844). + + +This model was contributed by [ybelkada](https://huggingface.co/ybelkada) and [ArthurZ](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/facebookresearch/segment-anything). + +Below is an example on how to run mask generation given an image and a 2D point: + +```python +import torch +from PIL import Image +import requests +from transformers import SamModel, SamProcessor + +device = "cuda" if torch.cuda.is_available() else "cpu" +model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device) +processor = SamProcessor.from_pretrained("facebook/sam-vit-huge") + +img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" +raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB") +input_points = [[[450, 600]]] # 2D location of a window in the image + +inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device) +with torch.no_grad(): + outputs = model(**inputs) + +masks = processor.image_processor.post_process_masks( + outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu() +) +scores = outputs.iou_scores +``` + +You can also process your own masks alongside the input images in the processor to be passed to the model. + +```python +import torch +from PIL import Image +import requests +from transformers import SamModel, SamProcessor + +device = "cuda" if torch.cuda.is_available() else "cpu" +model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device) +processor = SamProcessor.from_pretrained("facebook/sam-vit-huge") + +img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" +raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB") +mask_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png" +segmentation_map = Image.open(requests.get(mask_url, stream=True).raw).convert("RGB") +input_points = [[[450, 600]]] # 2D location of a window in the image + +inputs = processor(raw_image, input_points=input_points, segmentation_maps=mask, return_tensors="pt").to(device) +with torch.no_grad(): + outputs = model(**inputs) + +masks = processor.image_processor.post_process_masks( + outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu() +) +scores = outputs.iou_scores +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SAM. + +- [Demo notebook](https://github.com/huggingface/notebooks/blob/main/examples/segment_anything.ipynb) for using the model. +- [Demo notebook](https://github.com/huggingface/notebooks/blob/main/examples/automatic_mask_generation.ipynb) for using the automatic mask generation pipeline. +- [Demo notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SAM/Run_inference_with_MedSAM_using_HuggingFace_Transformers.ipynb) for inference with MedSAM, a fine-tuned version of SAM on the medical domain. 🌎 +- [Demo notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SAM/Fine_tune_SAM_(segment_anything)_on_a_custom_dataset.ipynb) for fine-tuning the model on custom data. 🌎 + +## SlimSAM + +SlimSAM, a pruned version of SAM, was proposed in [0.1% Data Makes Segment Anything Slim](https://arxiv.org/abs/2312.05284) by Zigeng Chen et al. SlimSAM reduces the size of the SAM models considerably while maintaining the same performance. + +Checkpoints can be found on the [hub](https://huggingface.co/models?other=slimsam), and they can be used as a drop-in replacement of SAM. + +## SamConfig + +[[autodoc]] SamConfig + +## SamVisionConfig + +[[autodoc]] SamVisionConfig + +## SamMaskDecoderConfig + +[[autodoc]] SamMaskDecoderConfig + +## SamPromptEncoderConfig + +[[autodoc]] SamPromptEncoderConfig + + +## SamProcessor + +[[autodoc]] SamProcessor + + +## SamImageProcessor + +[[autodoc]] SamImageProcessor + + +## SamModel + +[[autodoc]] SamModel + - forward + + +## TFSamModel + +[[autodoc]] TFSamModel + - call diff --git a/docs/source/en/model_doc/seamless_m4t.md b/docs/source/en/model_doc/seamless_m4t.md new file mode 100644 index 00000000000000..e820e6c92563b0 --- /dev/null +++ b/docs/source/en/model_doc/seamless_m4t.md @@ -0,0 +1,220 @@ + + +# SeamlessM4T + +## Overview + +The SeamlessM4T model was proposed in [SeamlessM4T — Massively Multilingual & Multimodal Machine Translation](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf) by the Seamless Communication team from Meta AI. + +This is the **version 1** release of the model. For the updated **version 2** release, refer to the [Seamless M4T v2 docs](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t_v2). + +SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. + +SeamlessM4T enables multiple tasks without relying on separate models: + +- Speech-to-speech translation (S2ST) +- Speech-to-text translation (S2TT) +- Text-to-speech translation (T2ST) +- Text-to-text translation (T2TT) +- Automatic speech recognition (ASR) + +[`SeamlessM4TModel`] can perform all the above tasks, but each task also has its own dedicated sub-model. + +The abstract from the paper is the following: + +*What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication* + +## Usage + +First, load the processor and a checkpoint of the model: + +```python +>>> from transformers import AutoProcessor, SeamlessM4TModel + +>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium") +>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium") +``` + +You can seamlessly use this model on text or on audio, to generated either translated text or translated audio. + +Here is how to use the processor to process text and audio: + +```python +>>> # let's load an audio sample from an Arabic speech corpus +>>> from datasets import load_dataset +>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True) +>>> audio_sample = next(iter(dataset))["audio"] + +>>> # now, process it +>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt") + +>>> # now, process some English test as well +>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt") +``` + + +### Speech + +[`SeamlessM4TModel`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation: + +```python +>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze() +>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze() +``` + +With basically the same code, I've translated English text and Arabic speech to Russian speech samples. + +### Text + +Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`]. +This time, let's translate to French. + +```python +>>> # from audio +>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False) +>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) + +>>> # from text +>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False) +>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) +``` + +### Tips + + +#### 1. Use dedicated models + +[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. +For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: + +```python +>>> from transformers import SeamlessM4TForSpeechToSpeech +>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium") +``` + +Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`. + +```python +>>> from transformers import SeamlessM4TForTextToText +>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium") +``` + +Feel free to try out [`SeamlessM4TForSpeechToText`] and [`SeamlessM4TForTextToSpeech`] as well. + +#### 2. Change the speaker identity + +You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument. Some `spkr_id` works better than other for some languages! + +#### 3. Change the generation strategy + +You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model. + +#### 4. Generate speech and text at the same time + +Use `return_intermediate_token_ids=True` with [`SeamlessM4TModel`] to return both speech and text ! + +## Model architecture + + +SeamlessM4T features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text. + +Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model. + +Here's how the generation process works: + +- Input text or speech is processed through its specific encoder. +- A decoder creates text tokens in the desired language. +- If speech generation is required, the second seq2seq model, following a standard encoder-decoder structure, generates unit tokens. +- These unit tokens are then passed through the final vocoder to produce the actual speech. + + +This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication). + +## SeamlessM4TModel + +[[autodoc]] SeamlessM4TModel + - generate + + +## SeamlessM4TForTextToSpeech + +[[autodoc]] SeamlessM4TForTextToSpeech + - generate + + +## SeamlessM4TForSpeechToSpeech + +[[autodoc]] SeamlessM4TForSpeechToSpeech + - generate + + +## SeamlessM4TForTextToText + +[[autodoc]] transformers.SeamlessM4TForTextToText + - forward + - generate + +## SeamlessM4TForSpeechToText + +[[autodoc]] transformers.SeamlessM4TForSpeechToText + - forward + - generate + +## SeamlessM4TConfig + +[[autodoc]] SeamlessM4TConfig + + +## SeamlessM4TTokenizer + +[[autodoc]] SeamlessM4TTokenizer + - __call__ + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + + +## SeamlessM4TTokenizerFast + +[[autodoc]] SeamlessM4TTokenizerFast + - __call__ + +## SeamlessM4TFeatureExtractor + +[[autodoc]] SeamlessM4TFeatureExtractor + - __call__ + +## SeamlessM4TProcessor + +[[autodoc]] SeamlessM4TProcessor + - __call__ + +## SeamlessM4TCodeHifiGan + +[[autodoc]] SeamlessM4TCodeHifiGan + + +## SeamlessM4THifiGan + +[[autodoc]] SeamlessM4THifiGan + +## SeamlessM4TTextToUnitModel + +[[autodoc]] SeamlessM4TTextToUnitModel + +## SeamlessM4TTextToUnitForConditionalGeneration + +[[autodoc]] SeamlessM4TTextToUnitForConditionalGeneration + + diff --git a/docs/source/en/model_doc/seamless_m4t_v2.md b/docs/source/en/model_doc/seamless_m4t_v2.md new file mode 100644 index 00000000000000..aea34acc180b38 --- /dev/null +++ b/docs/source/en/model_doc/seamless_m4t_v2.md @@ -0,0 +1,194 @@ + + +# SeamlessM4T-v2 + +## Overview + +The SeamlessM4T-v2 model was proposed in [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team from Meta AI. + +SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the [previous version](https://huggingface.co/docs/transformers/main/model_doc/seamless_m4t). For more details on the differences between v1 and v2, refer to section [Difference with SeamlessM4T-v1](#difference-with-seamlessm4t-v1). + +SeamlessM4T-v2 enables multiple tasks without relying on separate models: + +- Speech-to-speech translation (S2ST) +- Speech-to-text translation (S2TT) +- Text-to-speech translation (T2ST) +- Text-to-text translation (T2TT) +- Automatic speech recognition (ASR) + +[`SeamlessM4Tv2Model`] can perform all the above tasks, but each task also has its own dedicated sub-model. + +The abstract from the paper is the following: + +*Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.* + +## Usage + +In the following example, we'll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text. + +First, load the processor and a checkpoint of the model: + +```python +>>> from transformers import AutoProcessor, SeamlessM4Tv2Model + +>>> processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large") +>>> model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large") +``` + +You can seamlessly use this model on text or on audio, to generated either translated text or translated audio. + +Here is how to use the processor to process text and audio: + +```python +>>> # let's load an audio sample from an Arabic speech corpus +>>> from datasets import load_dataset +>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True) +>>> audio_sample = next(iter(dataset))["audio"] + +>>> # now, process it +>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt") + +>>> # now, process some English text as well +>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt") +``` + + +### Speech + +[`SeamlessM4Tv2Model`] can *seamlessly* generate text or speech with few or no changes. Let's target Russian voice translation: + +```python +>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze() +>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze() +``` + +With basically the same code, I've translated English text and Arabic speech to Russian speech samples. + +### Text + +Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass `generate_speech=False` to [`SeamlessM4Tv2Model.generate`]. +This time, let's translate to French. + +```python +>>> # from audio +>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False) +>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) + +>>> # from text +>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False) +>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True) +``` + +### Tips + + +#### 1. Use dedicated models + +[`SeamlessM4Tv2Model`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. +For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code: + +```python +>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech +>>> model = SeamlessM4Tv2ForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large") +``` + +Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove `generate_speech=False`. + +```python +>>> from transformers import SeamlessM4Tv2ForTextToText +>>> model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large") +``` + +Feel free to try out [`SeamlessM4Tv2ForSpeechToText`] and [`SeamlessM4Tv2ForTextToSpeech`] as well. + +#### 2. Change the speaker identity + +You have the possibility to change the speaker used for speech synthesis with the `speaker_id` argument. Some `speaker_id` works better than other for some languages! + +#### 3. Change the generation strategy + +You can use different [generation strategies](../generation_strategies) for text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True)` which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. `.generate(..., speech_do_sample=True, speech_temperature=0.6)`. + +#### 4. Generate speech and text at the same time + +Use `return_intermediate_token_ids=True` with [`SeamlessM4Tv2Model`] to return both speech and text ! + +## Model architecture + +SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as "unit tokens," from the translated text. + +Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the [HiFi-GAN](https://arxiv.org/abs/2010.05646) architecture is placed on top of the second seq2seq model. + +### Difference with SeamlessM4T-v1 + +The architecture of this new version differs from the first in a few aspects: + +#### Improvements on the second-pass model + +The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a **single forward pass**. This achievement is made possible by: +- the use of **character-level embeddings**, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens. +- the use of an intermediate duration predictor, that predicts speech duration at the **character-level** on the predicted translated text. +- the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context. + +#### Difference in the speech encoder + +The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms: +- the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks. +- the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to [Self-Attentionwith Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155) for more details. +- the use of a causal depth-wise convolution instead of a non-causal one. + +### Generation process + +Here's how the generation process works: + +- Input text or speech is processed through its specific encoder. +- A decoder creates text tokens in the desired language. +- If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way. +- These unit tokens are then passed through the final vocoder to produce the actual speech. + + +This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication). + +## SeamlessM4Tv2Model + +[[autodoc]] SeamlessM4Tv2Model + - generate + + +## SeamlessM4Tv2ForTextToSpeech + +[[autodoc]] SeamlessM4Tv2ForTextToSpeech + - generate + + +## SeamlessM4Tv2ForSpeechToSpeech + +[[autodoc]] SeamlessM4Tv2ForSpeechToSpeech + - generate + + +## SeamlessM4Tv2ForTextToText + +[[autodoc]] transformers.SeamlessM4Tv2ForTextToText + - forward + - generate + +## SeamlessM4Tv2ForSpeechToText + +[[autodoc]] transformers.SeamlessM4Tv2ForSpeechToText + - forward + - generate + +## SeamlessM4Tv2Config + +[[autodoc]] SeamlessM4Tv2Config diff --git a/docs/source/en/model_doc/segformer.mdx b/docs/source/en/model_doc/segformer.md similarity index 96% rename from docs/source/en/model_doc/segformer.mdx rename to docs/source/en/model_doc/segformer.md index 5c494e4747d384..4edd646cd4faa4 100644 --- a/docs/source/en/model_doc/segformer.mdx +++ b/docs/source/en/model_doc/segformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SegFormer @@ -39,7 +43,7 @@ The figure below illustrates the architecture of SegFormer. Taken from the [orig This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/NVlabs/SegFormer). -Tips: +## Usage tips - SegFormer consists of a hierarchical Transformer encoder, and a lightweight all-MLP decoder head. [`SegformerModel`] is the hierarchical Transformer encoder (which in the paper is also referred to @@ -91,6 +95,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`SegformerForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- [Image classification task guide](../tasks/image_classification) Semantic segmentation: @@ -98,6 +103,7 @@ Semantic segmentation: - A blog on fine-tuning SegFormer on a custom dataset can be found [here](https://huggingface.co/blog/fine-tune-segformer). - More demo notebooks on SegFormer (both inference + fine-tuning on a custom dataset) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/SegFormer). - [`TFSegformerForSemanticSegmentation`] is supported by this [example notebook](https://github.com/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb). +- [Semantic segmentation task guide](../tasks/semantic_segmentation) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -117,6 +123,9 @@ If you're interested in submitting a resource to be included here, please feel f - preprocess - post_process_semantic_segmentation + + + ## SegformerModel [[autodoc]] SegformerModel @@ -137,6 +146,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] SegformerForSemanticSegmentation - forward + + + ## TFSegformerDecodeHead [[autodoc]] TFSegformerDecodeHead @@ -156,3 +168,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] TFSegformerForSemanticSegmentation - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/sew-d.mdx b/docs/source/en/model_doc/sew-d.md similarity index 87% rename from docs/source/en/model_doc/sew-d.mdx rename to docs/source/en/model_doc/sew-d.md index ceeb4f1ec35fcc..013e404bd045b8 100644 --- a/docs/source/en/model_doc/sew-d.mdx +++ b/docs/source/en/model_doc/sew-d.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SEW-D @@ -28,14 +32,18 @@ variety of training setups. For example, under the 100h-960h semi-supervised set inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.* -Tips: +This model was contributed by [anton-l](https://huggingface.co/anton-l). + +## Usage tips - SEW-D is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - SEWDForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -This model was contributed by [anton-l](https://huggingface.co/anton-l). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## SEWDConfig diff --git a/docs/source/en/model_doc/sew.mdx b/docs/source/en/model_doc/sew.md similarity index 87% rename from docs/source/en/model_doc/sew.mdx rename to docs/source/en/model_doc/sew.md index dce949a856b345..ee8a36a4dcb2fa 100644 --- a/docs/source/en/model_doc/sew.mdx +++ b/docs/source/en/model_doc/sew.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SEW @@ -28,14 +32,18 @@ variety of training setups. For example, under the 100h-960h semi-supervised set inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25-50% across different model sizes.* -Tips: +This model was contributed by [anton-l](https://huggingface.co/anton-l). + +## Usage tips - SEW is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - SEWForCTC is fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -This model was contributed by [anton-l](https://huggingface.co/anton-l). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## SEWConfig diff --git a/docs/source/en/model_doc/siglip.md b/docs/source/en/model_doc/siglip.md new file mode 100644 index 00000000000000..c6db0441e7a694 --- /dev/null +++ b/docs/source/en/model_doc/siglip.md @@ -0,0 +1,157 @@ + + +# SigLIP + +## Overview + +The SigLIP model was proposed in [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in [CLIP](clip) by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet. + +The abstract from the paper is the following: + +*We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient.* + +## Usage tips + +- Usage of SigLIP is similar to [CLIP](clip). The main difference is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax. +- Training is not yet supported. If you want to fine-tune SigLIP or train from scratch, refer to the loss function from [OpenCLIP](https://github.com/mlfoundations/open_clip/blob/73ad04ae7fb93ede1c02dc9040a828634cb1edf1/src/open_clip/loss.py#L307), which leverages various `torch.distributed` utilities. +- When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` as that's how the model was trained. + + + + SigLIP evaluation results compared to CLIP. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/google-research/big_vision/tree/main). + +## Usage example + +There are 2 main ways to use SigLIP: either using the pipeline API, which abstracts away all the complexity for you, or by using the `SiglipModel` class yourself. + +### Pipeline API + +The pipeline allows to use the model in a few lines of code: + +```python +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> # load pipe +>>> image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-224") + +>>> # load image +>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> # inference +>>> outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"]) +>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs] +>>> print(outputs) +[{'score': 0.1979, 'label': '2 cats'}, {'score': 0.0, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}] +``` + +### Using the model yourself + +If you want to do the pre- and postprocessing yourself, here's how to do that: + +```python +>>> from PIL import Image +>>> import requests +>>> from transformers import AutoProcessor, AutoModel +>>> import torch + +>>> model = AutoModel.from_pretrained("google/siglip-base-patch16-224") +>>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224") + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> texts = ["a photo of 2 cats", "a photo of 2 dogs"] +>>> # important: we pass `padding=max_length` since the model was trained with this +>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt") + +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits_per_image = outputs.logits_per_image +>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities +>>> print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'") +31.9% that image 0 is 'a photo of 2 cats' +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with SigLIP. + +- [Zero-shot image classification task guide](../tasks/zero_shot_image_classification_md) +- Demo notebooks for SigLIP can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/SigLIP). 🌎 + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +## SiglipConfig + +[[autodoc]] SiglipConfig + - from_text_vision_configs + +## SiglipTextConfig + +[[autodoc]] SiglipTextConfig + +## SiglipVisionConfig + +[[autodoc]] SiglipVisionConfig + +## SiglipTokenizer + +[[autodoc]] SiglipTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## SiglipImageProcessor + +[[autodoc]] SiglipImageProcessor + - preprocess + +## SiglipProcessor + +[[autodoc]] SiglipProcessor + +## SiglipModel + +[[autodoc]] SiglipModel + - forward + - get_text_features + - get_image_features + +## SiglipTextModel + +[[autodoc]] SiglipTextModel + - forward + +## SiglipVisionModel + +[[autodoc]] SiglipVisionModel + - forward + + +## SiglipForImageClassification + +[[autodoc]] SiglipForImageClassification + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/speech-encoder-decoder.mdx b/docs/source/en/model_doc/speech-encoder-decoder.md similarity index 94% rename from docs/source/en/model_doc/speech-encoder-decoder.mdx rename to docs/source/en/model_doc/speech-encoder-decoder.md index b0718a27a88cc8..7e2bcef98abce8 100644 --- a/docs/source/en/model_doc/speech-encoder-decoder.mdx +++ b/docs/source/en/model_doc/speech-encoder-decoder.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Speech Encoder Decoder Models @@ -48,7 +52,7 @@ To do so, the `SpeechEncoderDecoderModel` class provides a [`SpeechEncoderDecode >>> from transformers import SpeechEncoderDecoderModel >>> model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained( -... "facebook/hubert-large-ll60k", "bert-base-uncased" +... "facebook/hubert-large-ll60k", "google-bert/bert-base-uncased" ... ) ``` @@ -89,7 +93,7 @@ speech inputs) and `labels` (which are the `input_ids` of the encoded target seq >>> from datasets import load_dataset >>> encoder_id = "facebook/wav2vec2-base-960h" # acoustic model encoder ->>> decoder_id = "bert-base-uncased" # text decoder +>>> decoder_id = "google-bert/bert-base-uncased" # text decoder >>> feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_id) >>> tokenizer = AutoTokenizer.from_pretrained(decoder_id) @@ -107,7 +111,7 @@ speech inputs) and `labels` (which are the `input_ids` of the encoded target seq >>> labels = tokenizer(ds[0]["text"], return_tensors="pt").input_ids >>> # the forward function automatically creates the correct decoder_input_ids ->>> loss = model(**input_features).loss +>>> loss = model(input_values=input_values, labels=labels).loss >>> loss.backward() ``` @@ -125,4 +129,4 @@ speech inputs) and `labels` (which are the `input_ids` of the encoded target seq [[autodoc]] FlaxSpeechEncoderDecoderModel - __call__ - - from_encoder_decoder_pretrained \ No newline at end of file + - from_encoder_decoder_pretrained diff --git a/docs/source/en/model_doc/speech_to_text.mdx b/docs/source/en/model_doc/speech_to_text.md similarity index 96% rename from docs/source/en/model_doc/speech_to_text.mdx rename to docs/source/en/model_doc/speech_to_text.md index 95efc5504ff868..23512b323af6b4 100644 --- a/docs/source/en/model_doc/speech_to_text.mdx +++ b/docs/source/en/model_doc/speech_to_text.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Speech2Text @@ -23,7 +27,6 @@ transcripts/translations autoregressively. Speech2Text has been fine-tuned on se This model was contributed by [valhalla](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text). - ## Inference Speech2Text is a speech model that accepts a float tensor of log-mel filter-bank features extracted from the speech @@ -40,7 +43,6 @@ install those packages before running the examples. You could either install tho `pip install transformers"[speech, sentencepiece]"` or install the packages separately with `pip install torchaudio sentencepiece`. Also `torchaudio` requires the development version of the [libsndfile](http://www.mega-nerd.com/libsndfile/) package which can be installed via a system package manager. On Ubuntu it can be installed as follows: `apt install libsndfile1-dev` - - ASR and Speech Translation ```python @@ -94,7 +96,6 @@ be installed as follows: `apt install libsndfile1-dev` See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look for Speech2Text checkpoints. - ## Speech2TextConfig [[autodoc]] Speech2TextConfig @@ -121,6 +122,9 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look - batch_decode - decode + + + ## Speech2TextModel [[autodoc]] Speech2TextModel @@ -131,6 +135,9 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look [[autodoc]] Speech2TextForConditionalGeneration - forward + + + ## TFSpeech2TextModel [[autodoc]] TFSpeech2TextModel @@ -140,3 +147,6 @@ See the [model hub](https://huggingface.co/models?filter=speech_to_text) to look [[autodoc]] TFSpeech2TextForConditionalGeneration - call + + + diff --git a/docs/source/en/model_doc/speech_to_text_2.mdx b/docs/source/en/model_doc/speech_to_text_2.md similarity index 94% rename from docs/source/en/model_doc/speech_to_text_2.mdx rename to docs/source/en/model_doc/speech_to_text_2.md index 2e3ebc3f390a3c..6648e67f629d3c 100644 --- a/docs/source/en/model_doc/speech_to_text_2.mdx +++ b/docs/source/en/model_doc/speech_to_text_2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Speech2Text2 @@ -27,8 +31,7 @@ This model was contributed by [Patrick von Platen](https://huggingface.co/patric The original code can be found [here](https://github.com/pytorch/fairseq/blob/1f7ef9ed1e1061f8c7f88f8b94c7186834398690/fairseq/models/wav2vec/wav2vec2_asr.py#L266). - -Tips: +## Usage tips - Speech2Text2 achieves state-of-the-art results on the CoVoST Speech Translation dataset. For more information, see the [official models](https://huggingface.co/models?other=speech2text2) . @@ -94,6 +97,9 @@ predicted token ids. See [model hub](https://huggingface.co/models?filter=speech2text2) to look for Speech2Text2 checkpoints. +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) ## Speech2Text2Config diff --git a/docs/source/en/model_doc/speecht5.mdx b/docs/source/en/model_doc/speecht5.md similarity index 94% rename from docs/source/en/model_doc/speecht5.mdx rename to docs/source/en/model_doc/speecht5.md index 848744b24d54d0..4d5e2098a54219 100644 --- a/docs/source/en/model_doc/speecht5.mdx +++ b/docs/source/en/model_doc/speecht5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SpeechT5 @@ -67,7 +71,7 @@ This model was contributed by [Matthijs](https://huggingface.co/Matthijs). The o [[autodoc]] SpeechT5ForTextToSpeech - forward - - generate_speech + - generate ## SpeechT5ForSpeechToSpeech diff --git a/docs/source/en/model_doc/splinter.mdx b/docs/source/en/model_doc/splinter.md similarity index 93% rename from docs/source/en/model_doc/splinter.mdx rename to docs/source/en/model_doc/splinter.md index 55e5f61b8d0bf3..a46c55966c0ecf 100644 --- a/docs/source/en/model_doc/splinter.mdx +++ b/docs/source/en/model_doc/splinter.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Splinter @@ -30,7 +34,9 @@ are replaced with a special token, viewed as a question representation, that is the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting. -Tips: +This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter). + +## Usage tips - Splinter was trained to predict answers spans conditioned on a special [QUESTION] token. These tokens contextualize to question representations which are used to predict the answers. This layer is called QASS, and is the default @@ -45,7 +51,9 @@ Tips: doesn't (*tau/splinter-base* and *tau/splinter-large*). This is done to support randomly initializing this layer at fine-tuning, as it is shown to yield better results for some cases in the paper. -This model was contributed by [yuvalkirstain](https://huggingface.co/yuvalkirstain) and [oriram](https://huggingface.co/oriram). The original code can be found [here](https://github.com/oriram/splinter). +## Resources + +- [Question answering task guide](../tasks/question-answering) ## SplinterConfig diff --git a/docs/source/en/model_doc/squeezebert.mdx b/docs/source/en/model_doc/squeezebert.md similarity index 88% rename from docs/source/en/model_doc/squeezebert.mdx rename to docs/source/en/model_doc/squeezebert.md index c6219582c838e3..e2bb378fe5bb09 100644 --- a/docs/source/en/model_doc/squeezebert.mdx +++ b/docs/source/en/model_doc/squeezebert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SqueezeBERT @@ -34,7 +38,9 @@ self-attention layers with grouped convolutions, and we use this technique in a SqueezeBERT, which runs 4.3x faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The SqueezeBERT code will be released.* -Tips: +This model was contributed by [forresti](https://huggingface.co/forresti). + +## Usage tips - SqueezeBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left. @@ -44,8 +50,13 @@ Tips: - For best results when finetuning on sequence classification tasks, it is recommended to start with the *squeezebert/squeezebert-mnli-headless* checkpoint. -This model was contributed by [forresti](https://huggingface.co/forresti). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## SqueezeBertConfig diff --git a/docs/source/en/model_doc/stablelm.md b/docs/source/en/model_doc/stablelm.md new file mode 100644 index 00000000000000..90e634b2f7f474 --- /dev/null +++ b/docs/source/en/model_doc/stablelm.md @@ -0,0 +1,102 @@ + + +# StableLM + +## Overview + +`StableLM 3B 4E1T` was proposed in [`StableLM 3B 4E1T`: Technical Report](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Stability AI and is the first model in a series of multi-epoch pre-trained language models. + +### Model Details + +`StableLM 3B 4E1T` is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. +The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc. + +We also provide `StableLM Zephyr 3B`, an instruction fine-tuned version of the model that can be used for chat-based applications. + +### Usage Tips + +- The architecture is similar to LLaMA but with RoPE applied to 25% of head embedding dimensions, LayerNorm instead of RMSNorm, and optional QKV bias terms. +- `StableLM 3B 4E1T`-based models uses the same tokenizer as [`GPTNeoXTokenizerFast`]. + +`StableLM 3B 4E1T` and `StableLM Zephyr 3B` can be found on the [Huggingface Hub](https://huggingface.co/stabilityai) + +The following code snippet demonstrates how to use `StableLM 3B 4E1T` for inference: + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t") +>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t") +>>> model.to(device) + +>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device) + +>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True) +>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +>>> responses +['The weather is always wonderful in Santa Barbara and, for visitors hoping to make the move to our beautiful seaside city, this town offers plenty of great places to...'] +``` + +## Combining StableLM and Flash Attention 2 + +First, make sure to install the latest version of Flash Attention v2. + +```bash +pip install -U flash-attn --no-build-isolation +``` + +Also make sure that your hardware is compatible with Flash-Attention 2. Read more about it in the official documentation of the [`flash-attn`](https://github.com/Dao-AILab/flash-attention) repository. Note: you must load your model in half-precision (e.g. `torch.bfloat16`). + +Now, to run the model with Flash Attention 2, refer to the snippet below: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM, AutoTokenizer +>>> device = "cuda" # the device to load the model onto + +>>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t") +>>> model = AutoModelForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2") +>>> model.to(device) + +>>> model_inputs = tokenizer("The weather is always wonderful in", return_tensors="pt").to(model.device) + +>>> generated_ids = model.generate(**model_inputs, max_length=32, do_sample=True) +>>> responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +>>> responses +['The weather is always wonderful in Santa Barbara and, for visitors hoping to make the move to our beautiful seaside city, this town offers plenty of great places to...'] +``` + + +## StableLmConfig + +[[autodoc]] StableLmConfig + +## StableLmModel + +[[autodoc]] StableLmModel + - forward + +## StableLmForCausalLM + +[[autodoc]] StableLmForCausalLM + - forward + +## StableLmForSequenceClassification + +[[autodoc]] StableLmForSequenceClassification + - forward diff --git a/docs/source/en/model_doc/swiftformer.md b/docs/source/en/model_doc/swiftformer.md new file mode 100644 index 00000000000000..30c6941f0f46da --- /dev/null +++ b/docs/source/en/model_doc/swiftformer.md @@ -0,0 +1,44 @@ + + +# SwiftFormer + +## Overview + +The SwiftFormer model was proposed in [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. + +The SwiftFormer paper introduces a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations in the self-attention computation with linear element-wise multiplications. A series of models called 'SwiftFormer' is built based on this, which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Even their small variant achieves 78.5% top-1 ImageNet1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2× faster compared to MobileViT-v2. + +The abstract from the paper is the following: + +*Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.* + +This model was contributed by [shehan97](https://huggingface.co/shehan97). +The original code can be found [here](https://github.com/Amshaker/SwiftFormer). + +## SwiftFormerConfig + +[[autodoc]] SwiftFormerConfig + +## SwiftFormerModel + +[[autodoc]] SwiftFormerModel + - forward + +## SwiftFormerForImageClassification + +[[autodoc]] SwiftFormerForImageClassification + - forward diff --git a/docs/source/en/model_doc/swin.mdx b/docs/source/en/model_doc/swin.md similarity index 93% rename from docs/source/en/model_doc/swin.mdx rename to docs/source/en/model_doc/swin.md index 1bb4fb88d8475a..e23c882a3f097a 100644 --- a/docs/source/en/model_doc/swin.mdx +++ b/docs/source/en/model_doc/swin.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Swin Transformer @@ -32,11 +36,6 @@ prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.* -Tips: -- One can use the [`AutoImageProcessor`] API to prepare images for the model. -- Swin pads the inputs supporting any input height and width (if divisible by `32`). -- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`. - drawing @@ -44,6 +43,10 @@ alt="drawing" width="600"/> This model was contributed by [novice03](https://huggingface.co/novice03). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). The original code can be found [here](https://github.com/microsoft/Swin-Transformer). +## Usage tips + +- Swin pads the inputs supporting any input height and width (if divisible by `32`). +- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`. ## Resources @@ -52,6 +55,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`SwinForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) Besides that: @@ -63,6 +67,8 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] SwinConfig + + ## SwinModel @@ -79,6 +85,9 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] transformers.SwinForImageClassification - forward + + + ## TFSwinModel [[autodoc]] TFSwinModel @@ -93,3 +102,6 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] transformers.TFSwinForImageClassification - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/swin2sr.mdx b/docs/source/en/model_doc/swin2sr.md similarity index 95% rename from docs/source/en/model_doc/swin2sr.mdx rename to docs/source/en/model_doc/swin2sr.md index edb073d1ee386f..dfee144e50c483 100644 --- a/docs/source/en/model_doc/swin2sr.mdx +++ b/docs/source/en/model_doc/swin2sr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Swin2SR diff --git a/docs/source/en/model_doc/swinv2.mdx b/docs/source/en/model_doc/swinv2.md similarity index 93% rename from docs/source/en/model_doc/swinv2.mdx rename to docs/source/en/model_doc/swinv2.md index c4378583c48a16..25233dca339545 100644 --- a/docs/source/en/model_doc/swinv2.mdx +++ b/docs/source/en/model_doc/swinv2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Swin Transformer V2 @@ -20,9 +24,6 @@ The abstract from the paper is the following: *Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.* -Tips: -- One can use the [`AutoImageProcessor`] API to prepare images for the model. - This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik). The original code can be found [here](https://github.com/microsoft/Swin-Transformer). @@ -33,6 +34,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`Swinv2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) Besides that: diff --git a/docs/source/en/model_doc/switch_transformers.mdx b/docs/source/en/model_doc/switch_transformers.md similarity index 79% rename from docs/source/en/model_doc/switch_transformers.mdx rename to docs/source/en/model_doc/switch_transformers.md index 348c831a0e9850..ca6748167f5e01 100644 --- a/docs/source/en/model_doc/switch_transformers.mdx +++ b/docs/source/en/model_doc/switch_transformers.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # SwitchTransformers @@ -16,22 +20,25 @@ specific language governing permissions and limitations under the License. The SwitchTransformers model was proposed in [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) by William Fedus, Barret Zoph, Noam Shazeer. -The Switch Transformer model uses a sparse T5 encoder-decoder architecure, where the MLP are replaced by a Mixture of Experts (MoE). A routing mechanism (top 1 in this case) associates each token to one of the expert, where each expert is a dense MLP. While switch transformers have a lot more weights than their equivalent dense models, the sparsity allows better scaling and better finetuning performance at scale. -During a forward pass, only a fraction of the weights are used. The routing mecanism allows the model to select relevant weights on the fly which increases the model capacity without increasing the number of operations. - +The Switch Transformer model uses a sparse T5 encoder-decoder architecture, where the MLP are replaced by a Mixture of Experts (MoE). A routing mechanism (top 1 in this case) associates each token to one of the expert, where each expert is a dense MLP. While switch transformers have a lot more weights than their equivalent dense models, the sparsity allows better scaling and better finetuning performance at scale. +During a forward pass, only a fraction of the weights are used. The routing mechanism allows the model to select relevant weights on the fly which increases the model capacity without increasing the number of operations. The abstract from the paper is the following: *In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.* -Tips: +This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArthurZ). +The original code can be found [here](https://github.com/google/flaxformer/tree/main/flaxformer/architectures/moe). + +## Usage tips - SwitchTransformers uses the [`T5Tokenizer`], which can be loaded directly from each model's repository. - The released weights are pretrained on English [Masked Language Modeling](https://moon-ci-docs.huggingface.co/docs/transformers/pr_19323/en/glossary#general-terms) task, and should be finetuned. -This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://huggingface.co/ArtZucker) . -The original code can be found [here](https://github.com/google/flaxformer/tree/main/flaxformer/architectures/moe). +## Resources +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## SwitchTransformersConfig diff --git a/docs/source/en/model_doc/t5.mdx b/docs/source/en/model_doc/t5.md similarity index 84% rename from docs/source/en/model_doc/t5.mdx rename to docs/source/en/model_doc/t5.md index 472d10be236a02..70e80c459f082b 100644 --- a/docs/source/en/model_doc/t5.mdx +++ b/docs/source/en/model_doc/t5.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # T5 @@ -19,12 +23,15 @@ specific language governing permissions and limitations under the License. Spaces + +Paper page + ## Overview -The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, -Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. +The T5 model was presented in [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf) by [Colin Raffel](https://huggingface.co/craffel), Noam Shazeer, [Adam Roberts](https://huggingface.co/adarob), Katherine Lee, Sharan Narang, +Michael Matena, Yanqi Zhou, Wei Li, [Peter J. Liu](https://huggingface.co/peterjliu). The abstract from the paper is the following: @@ -38,7 +45,11 @@ with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the- summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.* -Tips: +All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5). + +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer). + +## Usage tips - T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a @@ -49,19 +60,19 @@ for summarization: *summarize: ...*. - T5 uses relative scalar embeddings. Encoder input padding can be done on the left and on the right. -- See the [training](#training), [inference](#inference) and [scripts](#scripts) sections below for all details regarding usage. +- See the [training](#training), [inference](#inference) and [resources](#resources) sections below for all details regarding usage. T5 comes in different sizes: -- [t5-small](https://huggingface.co/t5-small) +- [google-t5/t5-small](https://huggingface.co/google-t5/t5-small) -- [t5-base](https://huggingface.co/t5-base) +- [google-t5/t5-base](https://huggingface.co/google-t5/t5-base) -- [t5-large](https://huggingface.co/t5-large) +- [google-t5/t5-large](https://huggingface.co/google-t5/t5-large) -- [t5-3b](https://huggingface.co/t5-3b) +- [google-t5/t5-3b](https://huggingface.co/google-t5/t5-3b) -- [t5-11b](https://huggingface.co/t5-11b). +- [google-t5/t5-11b](https://huggingface.co/google-t5/t5-11b). Based on the original T5 model, Google has released some follow-up works: @@ -74,11 +85,15 @@ Based on the original T5 model, Google has released some follow-up works: - **byT5**: byT5 is a T5 model pre-trained on byte sequences rather than SentencePiece subword token sequences. Refer to the documentation of byT5 which can be found [here](byt5). -All checkpoints can be found on the [hub](https://huggingface.co/models?search=t5). +- **UL2**: UL2 is a T5 like model pretrained on various denoising objectives -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/google-research/text-to-text-transfer-transformer). +- **Flan-T5**: Flan is a pretraining methods that is based on prompting. The Flan-T5 are T5 models trained on the Flan collection of + datasets which include: `taskmaster2`, `djaym7/wiki_dialog`, `deepmind/code_contests`, `lambada`, `gsm8k`, `aqua_rat`, `esnli`, `quasc` and `qed`. + +- **FLan-UL2** : the UL2 model finetuned using the "Flan" prompt tuning and dataset collection. - +- **UMT5**: UmT5 is a multilingual T5 model trained on an improved and refreshed mC4 multilingual corpus, 29 trillion characters across 107 language, using a new sampling method, UniMax. Refer to + the documentation of mT5 which can be found [here](umt5). ## Training @@ -106,8 +121,8 @@ processed as follows: ```python >>> from transformers import T5Tokenizer, T5ForConditionalGeneration ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> input_ids = tokenizer("The walks in park", return_tensors="pt").input_ids >>> labels = tokenizer(" cute dog the ", return_tensors="pt").input_ids @@ -131,8 +146,8 @@ the model as follows: ```python >>> from transformers import T5Tokenizer, T5ForConditionalGeneration ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids >>> labels = tokenizer("Das Haus ist wunderbar.", return_tensors="pt").input_ids @@ -168,8 +183,8 @@ ignored. The code example below illustrates all of this. >>> from transformers import T5Tokenizer, T5ForConditionalGeneration >>> import torch ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> # the following 2 hyperparameters are task-specific >>> max_source_length = 512 @@ -232,8 +247,6 @@ batches to the longest example is not recommended on TPU as it triggers a recomp encountered during training thus significantly slowing down the training. only padding up to the longest example in a batch) leads to very slow training on TPU. - - ## Inference At inference time, it is recommended to use [`~generation.GenerationMixin.generate`]. This @@ -245,8 +258,8 @@ generation works in general in encoder-decoder models. ```python >>> from transformers import T5Tokenizer, T5ForConditionalGeneration ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids >>> outputs = model.generate(input_ids) @@ -262,8 +275,8 @@ The example above only shows a single example. You can also do batched inference ```python >>> from transformers import T5Tokenizer, T5ForConditionalGeneration ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> task_prefix = "translate English to German: " >>> # use different length sentences to test batching @@ -288,8 +301,8 @@ The predicted tokens will then be placed between the sentinel tokens. ```python >>> from transformers import T5Tokenizer, T5ForConditionalGeneration ->>> tokenizer = T5Tokenizer.from_pretrained("t5-small") ->>> model = T5ForConditionalGeneration.from_pretrained("t5-small") +>>> tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small") +>>> model = T5ForConditionalGeneration.from_pretrained("google-t5/t5-small") >>> input_ids = tokenizer("The walks in park", return_tensors="pt").input_ids @@ -299,12 +312,9 @@ The predicted tokens will then be placed between the sentinel tokens. [' park offers the park.'] ``` - - - ## Performance -If you'd like a faster training and inference performance, install [apex](https://github.com/NVIDIA/apex#quick-start) and then the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter. +If you'd like a faster training and inference performance, install [NVIDIA APEX](https://github.com/NVIDIA/apex#quick-start) for NVIDIA GPUs, or [ROCm APEX](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs and then the model will automatically use `apex.normalization.FusedRMSNorm` instead of `T5LayerNorm`. The former uses an optimized fused kernel which is several times faster than the latter. ## Resources @@ -329,10 +339,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A notebook to [Finetune T5-base-dutch to perform Dutch abstractive summarization on a TPU](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/T5/Fine_tuning_Dutch_T5_base_on_CNN_Daily_Mail_for_summarization_(on_TPU_using_HuggingFace_Accelerate).ipynb). - A notebook for how to [finetune T5 for summarization in PyTorch and track experiments with WandB](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb#scrollTo=OKRpFvYhBauC). 🌎 - A blog post on [Distributed Training: Train BART/T5 for Summarization using 🤗 Transformers and Amazon SageMaker](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq). -- [`T5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) and [noteboook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb). +- [`T5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb). - [`TFT5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb). - [`FlaxT5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization). - [Summarization](https://huggingface.co/course/chapter7/5?fw=pt#summarization) chapter of the 🤗 Hugging Face course. +- [Summarization task guide](../tasks/summarization) @@ -342,6 +353,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`T5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb). - [`TFT5ForConditionalGeneration`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/translation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb). +- [Translation task guide](../tasks/translation) @@ -367,6 +379,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] T5TokenizerFast + + + ## T5Model [[autodoc]] T5Model @@ -382,6 +397,24 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] T5EncoderModel - forward +## T5ForSequenceClassification + +[[autodoc]] T5ForSequenceClassification + - forward + +## T5ForTokenClassification + +[[autodoc]] T5ForTokenClassification + - forward + +## T5ForQuestionAnswering + +[[autodoc]] T5ForQuestionAnswering + - forward + + + + ## TFT5Model [[autodoc]] TFT5Model @@ -397,6 +430,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFT5EncoderModel - call + + + ## FlaxT5Model [[autodoc]] FlaxT5Model @@ -415,3 +451,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxT5EncoderModel - __call__ + + + diff --git a/docs/source/en/model_doc/t5v1.1.mdx b/docs/source/en/model_doc/t5v1.1.md similarity index 90% rename from docs/source/en/model_doc/t5v1.1.mdx rename to docs/source/en/model_doc/t5v1.1.md index a5b64f77dc7c2f..e18696f629df51 100644 --- a/docs/source/en/model_doc/t5v1.1.mdx +++ b/docs/source/en/model_doc/t5v1.1.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # T5v1.1 @@ -16,6 +20,10 @@ specific language governing permissions and limitations under the License. T5v1.1 was released in the [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) repository by Colin Raffel et al. It's an improved version of the original T5 model. +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be +found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511). + +## Usage tips One can directly plug in the weights of T5v1.1 into a T5 model, like so: @@ -55,7 +63,9 @@ Google has released the following variants: - [google/t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl). -One can refer to [T5's documentation page](t5) for all tips, code examples and notebooks. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be -found [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511). + + +Refer to [T5's documentation page](t5) for all API reference, tips, code examples and notebooks. + + \ No newline at end of file diff --git a/docs/source/en/model_doc/table-transformer.mdx b/docs/source/en/model_doc/table-transformer.md similarity index 87% rename from docs/source/en/model_doc/table-transformer.mdx rename to docs/source/en/model_doc/table-transformer.md index 862f4124c25f24..850e7f50aa610f 100644 --- a/docs/source/en/model_doc/table-transformer.mdx +++ b/docs/source/en/model_doc/table-transformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Table Transformer @@ -29,16 +33,15 @@ significant increase in training performance and a more reliable estimate of mod object detection models trained on PubTables-1M produce excellent results for all three tasks of detection, structure recognition, and functional analysis without the need for any special customization for these tasks.* -Tips: - -- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table). -- One can use the [`AutoImageProcessor`] API to prepare images and optional targets for the model. This will load a [`DetrImageProcessor`] behind the scenes. - drawing Table detection and table structure recognition clarified. Taken from the original paper. +The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in +documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) +(the task of recognizing the individual rows, columns etc. in a table). + This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/table-transformer). @@ -47,7 +50,7 @@ found [here](https://github.com/microsoft/table-transformer). - A demo notebook for the Table Transformer can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Table%20Transformer). -- It turns out padding of images is quite important for detection. An interesting Github thread with replies from the authors can be found [here](https://github.com/microsoft/table-transformer/issues/68). +- It turns out padding of images is quite important for detection. An interesting Github thread with replies from the authors can be found [here](https://github.com/microsoft/table-transformer/issues/68). ## TableTransformerConfig diff --git a/docs/source/en/model_doc/tapas.mdx b/docs/source/en/model_doc/tapas.md similarity index 98% rename from docs/source/en/model_doc/tapas.mdx rename to docs/source/en/model_doc/tapas.md index 5a2b54e8c32c7c..79bbe3e819cf86 100644 --- a/docs/source/en/model_doc/tapas.mdx +++ b/docs/source/en/model_doc/tapas.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # TAPAS @@ -40,10 +44,10 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The Tensorflow version of this model was contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/tapas). -Tips: +## Usage tips - TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). Note that this is something that was added after the publication of the original TAPAS paper. According to the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out of embeddings. This is reflected in the `reset_position_index_per_cell` parameter of [`TapasConfig`], which is set to `True` by default. The default versions of the models available on the [hub](https://huggingface.co/models?search=tapas) all use relative position embeddings. You can still use the ones with absolute position embeddings by passing in an additional argument `revision="no_reset"` when calling the `from_pretrained()` method. Note that it's usually advised to pad the inputs on the right rather than the left. -- TAPAS is based on BERT, so `TAPAS-base` for example corresponds to a `BERT-base` architecture. Of course, `TAPAS-large` will result in the best performance (the results reported in the paper are from `TAPAS-large`). Results of the various sized models are shown on the [original Github repository](https://github.com/google-research/tapas>). +- TAPAS is based on BERT, so `TAPAS-base` for example corresponds to a `BERT-base` architecture. Of course, `TAPAS-large` will result in the best performance (the results reported in the paper are from `TAPAS-large`). Results of the various sized models are shown on the [original GitHub repository](https://github.com/google-research/tapas). - TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that case, you have to feed every table-question pair one by one to the model, such that the `prev_labels` token type ids can be overwritten by the predicted `labels` of the model to the previous question. See "Usage" section for more info. - TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. Note that TAPAS can be used as an encoder in the EncoderDecoderModel framework, to combine it with an autoregressive text decoder such as GPT-2. @@ -569,6 +573,11 @@ Predicted answer: SUM > 87, 53, 69 In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that the `prev_labels` token types can be overwritten by the predicted `labels` of the previous table-question pair. Again, more info can be found in [this notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for PyTorch) and [this notebook](https://github.com/kamalkraj/Tapas-Tutorial/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) (for TensorFlow). +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Masked language modeling task guide](../tasks/masked_language_modeling) + ## TAPAS specific outputs [[autodoc]] models.tapas.modeling_tapas.TableQuestionAnsweringOutput @@ -581,6 +590,9 @@ In case of a conversational set-up, then each table-question pair must be provid - convert_logits_to_predictions - save_vocabulary + + + ## TapasModel [[autodoc]] TapasModel - forward @@ -597,6 +609,9 @@ In case of a conversational set-up, then each table-question pair must be provid [[autodoc]] TapasForQuestionAnswering - forward + + + ## TFTapasModel [[autodoc]] TFTapasModel - call @@ -611,4 +626,9 @@ In case of a conversational set-up, then each table-question pair must be provid ## TFTapasForQuestionAnswering [[autodoc]] TFTapasForQuestionAnswering - - call \ No newline at end of file + - call + + + + + diff --git a/docs/source/en/model_doc/tapex.mdx b/docs/source/en/model_doc/tapex.md similarity index 90% rename from docs/source/en/model_doc/tapex.mdx rename to docs/source/en/model_doc/tapex.md index f6e65764e50d46..15ac2463fd851d 100644 --- a/docs/source/en/model_doc/tapex.mdx +++ b/docs/source/en/model_doc/tapex.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # TAPEX + + +This model is in maintenance mode only, we don't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The TAPEX model was proposed in [TAPEX: Table Pre-training via Learning a Neural SQL Executor](https://arxiv.org/abs/2107.07653) by Qian Liu, @@ -36,7 +49,7 @@ on the weakly-supervised WikiSQL denotation accuracy to 89.5% (+2.3%), the WikiT to 74.5% (+3.5%), and the TabFact accuracy to 84.2% (+3.2%). To our knowledge, this is the first work to exploit table pre-training via synthetic executable programs and to achieve new state-of-the-art results on various downstream tasks.* -Tips: +## Usage tips - TAPEX is a generative (seq2seq) model. One can directly plug in the weights of TAPEX into a BART model. - TAPEX has checkpoints on the hub that are either pre-trained only, or fine-tuned on WTQ, SQA, WikiSQL and TabFact. @@ -45,7 +58,7 @@ Tips: - TAPEX has its own tokenizer, that allows to prepare all data for the model easily. One can pass Pandas DataFrames and strings to the tokenizer, and it will automatically create the `input_ids` and `attention_mask` (as shown in the usage examples below). -## Usage: inference +### Usage: inference Below, we illustrate how to use TAPEX for table question answering. As one can see, one can directly plug in the weights of TAPEX into a BART model. We use the [Auto API](auto), which will automatically instantiate the appropriate tokenizer ([`TapexTokenizer`]) and model ([`BartForConditionalGeneration`]) for us, @@ -122,6 +135,12 @@ benchmark for table fact checking (it achieves 84% accuracy). The code example b Refused ``` + + +TAPEX architecture is the same as BART, except for tokenization. Refer to [BART documentation](bart) for information on +configuration classes and their parameters. TAPEX-specific tokenizer is documented below. + + ## TapexTokenizer diff --git a/docs/source/en/model_doc/time_series_transformer.mdx b/docs/source/en/model_doc/time_series_transformer.md similarity index 86% rename from docs/source/en/model_doc/time_series_transformer.mdx rename to docs/source/en/model_doc/time_series_transformer.md index 3bd67f985f5dcc..c5bfcfc15ea2a2 100644 --- a/docs/source/en/model_doc/time_series_transformer.mdx +++ b/docs/source/en/model_doc/time_series_transformer.md @@ -8,22 +8,20 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ---> - -# Time Series Transformer - +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. -This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight -breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title). +--> - +# Time Series Transformer ## Overview The Time Series Transformer model is a vanilla encoder-decoder Transformer for time series forecasting. +This model was contributed by [kashif](https://huggingface.co/kashif). -Tips: +## Usage tips - Similar to other models in the library, [`TimeSeriesTransformerModel`] is the raw Transformer without any head on top, and [`TimeSeriesTransformerForPrediction`] adds a distribution head on top of the former, which can be used for time-series forecasting. Note that this is a so-called probabilistic forecasting model, not a @@ -52,21 +50,22 @@ of the context as initial input for the decoder). - At inference time, we give the final value of the `past_values` as input to the decoder. Next, we can sample from the model to make a prediction at the next time step, which is then fed to the decoder in order to make the next prediction (also called autoregressive generation). +## Resources -This model was contributed by [kashif](https://huggingface.co/kashif). +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- Check out the Time Series Transformer blog-post in HuggingFace blog: [Probabilistic Time Series Forecasting with 🤗 Transformers](https://huggingface.co/blog/time-series-transformers) ## TimeSeriesTransformerConfig [[autodoc]] TimeSeriesTransformerConfig - ## TimeSeriesTransformerModel [[autodoc]] TimeSeriesTransformerModel - forward - ## TimeSeriesTransformerForPrediction [[autodoc]] TimeSeriesTransformerForPrediction diff --git a/docs/source/en/model_doc/timesformer.mdx b/docs/source/en/model_doc/timesformer.md similarity index 84% rename from docs/source/en/model_doc/timesformer.mdx rename to docs/source/en/model_doc/timesformer.md index 602ec4f4f2a74b..fe75bee5b2897e 100644 --- a/docs/source/en/model_doc/timesformer.mdx +++ b/docs/source/en/model_doc/timesformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # TimeSformer @@ -21,13 +25,17 @@ The abstract from the paper is the following: *We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: [this https URL](https://github.com/facebookresearch/TimeSformer).* -Tips: - -There are many pretrained variants. Select your pretrained model based on the dataset it is trained on. Moreover, the number of input frames per clip changes based on the model size so you should consider this parameter while selecting your pretrained model. - This model was contributed by [fcakyon](https://huggingface.co/fcakyon). The original code can be found [here](https://github.com/facebookresearch/TimeSformer). +## Usage tips + +There are many pretrained variants. Select your pretrained model based on the dataset it is trained on. Moreover, +the number of input frames per clip changes based on the model size so you should consider this parameter while selecting your pretrained model. + +## Resources + +- [Video classification task guide](../tasks/video_classification) ## TimesformerConfig diff --git a/docs/source/en/model_doc/trajectory_transformer.mdx b/docs/source/en/model_doc/trajectory_transformer.md similarity index 84% rename from docs/source/en/model_doc/trajectory_transformer.mdx rename to docs/source/en/model_doc/trajectory_transformer.md index da7a55a50eca3e..45616255871a08 100644 --- a/docs/source/en/model_doc/trajectory_transformer.mdx +++ b/docs/source/en/model_doc/trajectory_transformer.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Trajectory Transformer + + +This model is in maintenance mode only, so we won't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The Trajectory Transformer model was proposed in [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https://arxiv.org/abs/2106.02039) by Michael Janner, Qiyang Li, Sergey Levine. @@ -30,19 +43,18 @@ in offline RL algorithms. We demonstrate the flexibility of this approach across imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.* -Tips: +This model was contributed by [CarlCochet](https://huggingface.co/CarlCochet). The original code can be found [here](https://github.com/jannerm/trajectory-transformer). + +## Usage tips This Transformer is used for deep reinforcement learning. To use it, you need to create sequences from actions, states and rewards from all previous timesteps. This model will treat all these elements together as one big sequence (a trajectory). -This model was contributed by [CarlCochet](https://huggingface.co/CarlCochet). The original code can be found [here](https://github.com/jannerm/trajectory-transformer). - ## TrajectoryTransformerConfig [[autodoc]] TrajectoryTransformerConfig - ## TrajectoryTransformerModel [[autodoc]] TrajectoryTransformerModel diff --git a/docs/source/en/model_doc/transfo-xl.mdx b/docs/source/en/model_doc/transfo-xl.md similarity index 71% rename from docs/source/en/model_doc/transfo-xl.mdx rename to docs/source/en/model_doc/transfo-xl.md index 34b8cc8e9f2ef9..c80d9352b5aef6 100644 --- a/docs/source/en/model_doc/transfo-xl.mdx +++ b/docs/source/en/model_doc/transfo-xl.md @@ -8,10 +8,43 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Transformer XL + + +This model is in maintenance mode only, so we won't accept any new PRs changing its code. This model was deprecated due to security issues linked to `pickle.load`. + +We recommend switching to more recent models for improved security. + +In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub. + +You will need to set the environment variable `TRUST_REMOTE_CODE` to `True` in order to allow the +usage of `pickle.load()`: + +```python +import os +from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel + +os.environ["TRUST_REMOTE_CODE"] = "True" + +checkpoint = 'transfo-xl/transfo-xl-wt103' +revision = '40a186da79458c9f9de846edfaea79c412137f97' + +tokenizer = TransfoXLTokenizer.from_pretrained(checkpoint, revision=revision) +model = TransfoXLLMHeadModel.from_pretrained(checkpoint, revision=revision) +``` + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.35.0. +You can do so by running the following command: `pip install -U transformers==4.35.0`. + + +
Models @@ -41,7 +74,9 @@ bpc/perplexity to 0.99 on enwiki8, 1.08 on text8, 18.3 on WikiText-103, 21.8 on Treebank (without finetuning). When trained only on WikiText-103, Transformer-XL manages to generate reasonably coherent, novel text articles with thousands of tokens.* -Tips: +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl). + +## Usage tips - Transformer-XL uses relative sinusoidal positional embeddings. Padding can be done on the left or on the right. The original implementation trains on SQuAD with padding on the left, therefore the padding defaults are set to left. @@ -50,7 +85,6 @@ Tips: - Basically, the hidden states of the previous segment are concatenated to the current input to compute the attention scores. This allows the model to pay attention to information that was in the previous segment as well as the current one. By stacking multiple attention layers, the receptive field can be increased to multiple previous segments. - This changes the positional embeddings to positional relative embeddings (as the regular positional embeddings would give the same results in the current input and the current hidden state at a given position) and needs to make some adjustments in the way attention scores are computed. -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/kimiyoung/transformer-xl). @@ -58,6 +92,10 @@ TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyT +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Causal language modeling task guide](../tasks/language_modeling) ## TransfoXLConfig @@ -70,13 +108,16 @@ TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyT ## TransfoXL specific outputs -[[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput +[[autodoc]] models.deprecated.transfo_xl.modeling_transfo_xl.TransfoXLModelOutput + +[[autodoc]] models.deprecated.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput -[[autodoc]] models.transfo_xl.modeling_transfo_xl.TransfoXLLMHeadModelOutput +[[autodoc]] models.deprecated.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput -[[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLModelOutput +[[autodoc]] models.deprecated.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput -[[autodoc]] models.transfo_xl.modeling_tf_transfo_xl.TFTransfoXLLMHeadModelOutput + + ## TransfoXLModel @@ -93,6 +134,9 @@ TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyT [[autodoc]] TransfoXLForSequenceClassification - forward + + + ## TFTransfoXLModel [[autodoc]] TFTransfoXLModel @@ -108,6 +152,9 @@ TransformerXL does **not** work with *torch.nn.DataParallel* due to a bug in PyT [[autodoc]] TFTransfoXLForSequenceClassification - call + + + ## Internal Layers [[autodoc]] AdaptiveEmbedding diff --git a/docs/source/en/model_doc/trocr.mdx b/docs/source/en/model_doc/trocr.md similarity index 71% rename from docs/source/en/model_doc/trocr.mdx rename to docs/source/en/model_doc/trocr.md index 3e3a6c1007537e..c471a13bbd23c5 100644 --- a/docs/source/en/model_doc/trocr.mdx +++ b/docs/source/en/model_doc/trocr.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + specific language governing permissions and limitations under the License. --> # TrOCR @@ -39,7 +43,7 @@ Please refer to the [`VisionEncoderDecoder`] class on how to use this model. This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/microsoft/unilm/tree/6f60612e7cc86a2a1ae85c47231507a587ab4e01/trocr). -Tips: +## Usage tips - The quickest way to get started with TrOCR is by checking the [tutorial notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/TrOCR), which show how to use the model @@ -50,6 +54,27 @@ Tips: information, see the [official models](https://huggingface.co/models?other=trocr>). - TrOCR is always used within the [VisionEncoderDecoder](vision-encoder-decoder) framework. +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with TrOCR. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + + +- A blog post on [Accelerating Document AI](https://huggingface.co/blog/document-ai) with TrOCR. +- A blog post on how to [Document AI](https://github.com/philschmid/document-ai-transformers) with TrOCR. +- A notebook on how to [finetune TrOCR on IAM Handwriting Database using Seq2SeqTrainer](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb). +- A notebook on [inference with TrOCR](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Inference_with_TrOCR_%2B_Gradio_demo.ipynb) and Gradio demo. +- A notebook on [finetune TrOCR on the IAM Handwriting Database](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_native_PyTorch.ipynb) using native PyTorch. +- A notebook on [evaluating TrOCR on the IAM test set](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Evaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb). + + + +- [Casual language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling) task guide. + +⚡️ Inference + +- An interactive-demo on [TrOCR handwritten character recognition](https://huggingface.co/spaces/nielsr/TrOCR-handwritten). + ## Inference TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of diff --git a/docs/source/en/model_doc/tvlt.mdx b/docs/source/en/model_doc/tvlt.md similarity index 95% rename from docs/source/en/model_doc/tvlt.mdx rename to docs/source/en/model_doc/tvlt.md index 56bc37d024d24d..f09ea8af863c9a 100644 --- a/docs/source/en/model_doc/tvlt.mdx +++ b/docs/source/en/model_doc/tvlt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # TVLT @@ -21,14 +25,6 @@ The abstract from the paper is the following: *In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering, image retrieval, video retrieval, and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text.* -Tips: - -- TVLT is a model that takes both `pixel_values` and `audio_values` as input. One can use [`TvltProcessor`] to prepare data for the model. - This processor wraps an image processor (for the image/video modality) and an audio feature extractor (for the audio modality) into one. -- TVLT is trained with images/videos and audios of various sizes: the authors resize and crop the input images/videos to 224 and limit the length of audio spectrogram to 2048. To make batching of videos and audios possible, the authors use a `pixel_mask` that indicates which pixels are real/padding and `audio_mask` that indicates which audio values are real/padding. -- The design of TVLT is very similar to that of a standard Vision Transformer (ViT) and masked autoencoder (MAE) as in [ViTMAE](vitmae). The difference is that the model includes embedding layers for the audio modality. -- The PyTorch version of this model is only available in torch 1.10 and higher. -

drawing @@ -38,6 +34,14 @@ alt="drawing" width="600"/> The original code can be found [here](https://github.com/zinengtang/TVLT). This model was contributed by [Zineng Tang](https://huggingface.co/ZinengTang). +## Usage tips + +- TVLT is a model that takes both `pixel_values` and `audio_values` as input. One can use [`TvltProcessor`] to prepare data for the model. + This processor wraps an image processor (for the image/video modality) and an audio feature extractor (for the audio modality) into one. +- TVLT is trained with images/videos and audios of various sizes: the authors resize and crop the input images/videos to 224 and limit the length of audio spectrogram to 2048. To make batching of videos and audios possible, the authors use a `pixel_mask` that indicates which pixels are real/padding and `audio_mask` that indicates which audio values are real/padding. +- The design of TVLT is very similar to that of a standard Vision Transformer (ViT) and masked autoencoder (MAE) as in [ViTMAE](vitmae). The difference is that the model includes embedding layers for the audio modality. +- The PyTorch version of this model is only available in torch 1.10 and higher. + ## TvltConfig [[autodoc]] TvltConfig diff --git a/docs/source/en/model_doc/tvp.md b/docs/source/en/model_doc/tvp.md new file mode 100644 index 00000000000000..22b400a06c736a --- /dev/null +++ b/docs/source/en/model_doc/tvp.md @@ -0,0 +1,186 @@ + + +# TVP + +## Overview + +The text-visual prompting (TVP) framework was proposed in the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. + +The abstract from the paper is the following: + +*In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.* + +This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as 'prompts', into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model's ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently. + + + + TVP architecture. Taken from the original paper. + +This model was contributed by [Jiqing Feng](https://huggingface.co/Jiqing). The original code can be found [here](https://github.com/intel/TVP). + +## Usage tips and examples + +Prompts are optimized perturbation patterns, which would be added to input video frames or text features. Universal set refers to using the same exact set of prompts for any input, this means that these prompts are added consistently to all video frames and text features, regardless of the input's content. + +TVP consists of a visual encoder and cross-modal encoder. A universal set of visual prompts and text prompts to be integrated into sampled video frames and textual features, respectively. Specially, a set of different visual prompts are applied to uniformly-sampled frames of one untrimmed video in order. + +The goal of this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems. +In principle, one can apply any visual, cross-modal encoder in the proposed architecture. + +The [`TvpProcessor`] wraps [`BertTokenizer`] and [`TvpImageProcessor`] into a single instance to both +encode the text and prepare the images respectively. + +The following example shows how to run temporal video grounding using [`TvpProcessor`] and [`TvpForVideoGrounding`]. +```python +import av +import cv2 +import numpy as np +import torch +from huggingface_hub import hf_hub_download +from transformers import AutoProcessor, TvpForVideoGrounding + + +def pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps): + ''' + Convert the video from its original fps to the target_fps and decode the video with PyAV decoder. + Args: + container (container): pyav container. + sampling_rate (int): frame sampling rate (interval between two sampled frames). + num_frames (int): number of frames to sample. + clip_idx (int): if clip_idx is -1, perform random temporal sampling. + If clip_idx is larger than -1, uniformly split the video to num_clips + clips, and select the clip_idx-th video clip. + num_clips (int): overall number of clips to uniformly sample from the given video. + target_fps (int): the input video may have different fps, convert it to + the target video fps before frame sampling. + Returns: + frames (tensor): decoded frames from the video. Return None if the no + video stream was found. + fps (float): the number of frames per second of the video. + ''' + video = container.streams.video[0] + fps = float(video.average_rate) + clip_size = sampling_rate * num_frames / target_fps * fps + delta = max(num_frames - clip_size, 0) + start_idx = delta * clip_idx / num_clips + end_idx = start_idx + clip_size - 1 + timebase = video.duration / num_frames + video_start_pts = int(start_idx * timebase) + video_end_pts = int(end_idx * timebase) + seek_offset = max(video_start_pts - 1024, 0) + container.seek(seek_offset, any_frame=False, backward=True, stream=video) + frames = {} + for frame in container.decode(video=0): + if frame.pts < video_start_pts: + continue + frames[frame.pts] = frame + if frame.pts > video_end_pts: + break + frames = [frames[pts] for pts in sorted(frames)] + return frames, fps + + +def decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps): + ''' + Decode the video and perform temporal sampling. + Args: + container (container): pyav container. + sampling_rate (int): frame sampling rate (interval between two sampled frames). + num_frames (int): number of frames to sample. + clip_idx (int): if clip_idx is -1, perform random temporal sampling. + If clip_idx is larger than -1, uniformly split the video to num_clips + clips, and select the clip_idx-th video clip. + num_clips (int): overall number of clips to uniformly sample from the given video. + target_fps (int): the input video may have different fps, convert it to + the target video fps before frame sampling. + Returns: + frames (tensor): decoded frames from the video. + ''' + assert clip_idx >= -2, "Not a valied clip_idx {}".format(clip_idx) + frames, fps = pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps) + clip_size = sampling_rate * num_frames / target_fps * fps + index = np.linspace(0, clip_size - 1, num_frames) + index = np.clip(index, 0, len(frames) - 1).astype(np.int64) + frames = np.array([frames[idx].to_rgb().to_ndarray() for idx in index]) + frames = frames.transpose(0, 3, 1, 2) + return frames + + +file = hf_hub_download(repo_id="Intel/tvp_demo", filename="AK2KG.mp4", repo_type="dataset") +model = TvpForVideoGrounding.from_pretrained("Intel/tvp-base") + +decoder_kwargs = dict( + container=av.open(file, metadata_errors="ignore"), + sampling_rate=1, + num_frames=model.config.num_frames, + clip_idx=0, + num_clips=1, + target_fps=3, +) +raw_sampled_frms = decode(**decoder_kwargs) + +text = "a person is sitting on a bed." +processor = AutoProcessor.from_pretrained("Intel/tvp-base") +model_inputs = processor( + text=[text], videos=list(raw_sampled_frms), return_tensors="pt", max_text_length=100#, size=size +) + +model_inputs["pixel_values"] = model_inputs["pixel_values"].to(model.dtype) +output = model(**model_inputs) + +def get_video_duration(filename): + cap = cv2.VideoCapture(filename) + if cap.isOpened(): + rate = cap.get(5) + frame_num = cap.get(7) + duration = frame_num/rate + return duration + return -1 + +duration = get_video_duration(file) +start, end = processor.post_process_video_grounding(output.logits, duration) + +print(f"The time slot of the video corresponding to the text \"{text}\" is from {start}s to {end}s") +``` + +Tips: + +- This implementation of TVP uses [`BertTokenizer`] to generate text embeddings and Resnet-50 model to compute visual embeddings. +- Checkpoints for pre-trained [tvp-base](https://huggingface.co/Intel/tvp-base) is released. +- Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task. + + +## TvpConfig + +[[autodoc]] TvpConfig + +## TvpImageProcessor + +[[autodoc]] TvpImageProcessor + - preprocess + +## TvpProcessor + +[[autodoc]] TvpProcessor + - __call__ + +## TvpModel + +[[autodoc]] TvpModel + - forward + +## TvpForVideoGrounding + +[[autodoc]] TvpForVideoGrounding + - forward diff --git a/docs/source/en/model_doc/ul2.mdx b/docs/source/en/model_doc/ul2.md similarity index 86% rename from docs/source/en/model_doc/ul2.mdx rename to docs/source/en/model_doc/ul2.md index 2481285747fa7b..f4d01c40b0c109 100644 --- a/docs/source/en/model_doc/ul2.mdx +++ b/docs/source/en/model_doc/ul2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # UL2 @@ -20,12 +24,20 @@ The abstract from the paper is the following: *Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 and/or GPT-like models across multiple diverse setups. Finally, by scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised NLP tasks ranging from language generation (with automated and human evaluation), language understanding, text classification, question answering, commonsense reasoning, long text reasoning, structured knowledge grounding and information retrieval. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.* -Tips: +This model was contributed by [DanielHesslow](https://huggingface.co/Seledorn). The original code can be found [here](https://github.com/google-research/google-research/tree/master/ul2). + +## Usage tips - UL2 is an encoder-decoder model pre-trained on a mixture of denoising functions as well as fine-tuned on an array of downstream tasks. - UL2 has the same architecture as [T5v1.1](t5v1.1) but uses the Gated-SiLU activation function instead of Gated-GELU. - The authors release checkpoints of one architecture which can be seen [here](https://huggingface.co/google/ul2) -The original code can be found [here](https://github.com/google-research/google-research/tree/master/ul2). + + +As UL2 has the same architecture as T5v1.1, refer to [T5's documentation page](t5) for API reference, tips, code examples and notebooks. + + + + + -This model was contributed by [DanielHesslow](https://huggingface.co/Seledorn). diff --git a/docs/source/en/model_doc/umt5.md b/docs/source/en/model_doc/umt5.md new file mode 100644 index 00000000000000..b9f86a0304e892 --- /dev/null +++ b/docs/source/en/model_doc/umt5.md @@ -0,0 +1,112 @@ + + +# UMT5 + +

+ +Models + + +Spaces + +
+ +## Overview + +The UMT5 model was proposed in [UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining](https://openreview.net/forum?id=kXwdL1cWOAi) by Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. + +The abstract from the paper is the following: + +*Pretrained multilingual large language models have typically used heuristic temperature-based sampling to balance between different languages. However previous work has not systematically evaluated the efficacy of different pretraining language distributions across model scales. In this paper, we propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language's corpus. We perform an extensive series of ablations testing a range of sampling strategies on a suite of multilingual benchmarks, while varying model scale. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases. As part of our contribution, we release: (i) an improved and refreshed mC4 multilingual corpus consisting of 29 trillion characters across 107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained with UniMax sampling.* + +Google has released the following variants: + +- [google/umt5-small](https://huggingface.co/google/umt5-small) +- [google/umt5-base](https://huggingface.co/google/umt5-base) +- [google/umt5-xl](https://huggingface.co/google/umt5-xl) +- [google/umt5-xxl](https://huggingface.co/google/umt5-xxl). + +This model was contributed by [agemagician](https://huggingface.co/agemagician) and [stefan-it](https://huggingface.co/stefan-it). The original code can be +found [here](https://github.com/google-research/t5x). + +## Usage tips + +- UMT5 was only pre-trained on [mC4](https://huggingface.co/datasets/mc4) excluding any supervised training. +Therefore, this model has to be fine-tuned before it is usable on a downstream task, unlike the original T5 model. +- Since umT5 was pre-trained in an unsupervised manner, there's no real advantage to using a task prefix during single-task +fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix. + +## Differences with mT5? +`UmT5` is based on mT5, with a non-shared relative positional bias that is computed for each layer. This means that the model set `has_relative_bias` for each layer. +The conversion script is also different because the model was saved in t5x's latest checkpointing format. + +# Sample usage + +```python +>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/umt5-small") +>>> tokenizer = AutoTokenizer.from_pretrained("google/umt5-small") + +>>> inputs = tokenizer( +... "A walks into a bar and orders a with pinch of .", +... return_tensors="pt", +... ) +>>> outputs = model.generate(**inputs) +>>> print(tokenizer.batch_decode(outputs)) +['nyone who drink a alcohol A A. This I'] +``` + + + +Refer to [T5's documentation page](t5) for more tips, code examples and notebooks. + + +## UMT5Config + +[[autodoc]] UMT5Config + +## UMT5Model + +[[autodoc]] UMT5Model + - forward + +## UMT5ForConditionalGeneration + +[[autodoc]] UMT5ForConditionalGeneration + - forward + +## UMT5EncoderModel + +[[autodoc]] UMT5EncoderModel + - forward + +## UMT5ForSequenceClassification + +[[autodoc]] UMT5ForSequenceClassification + - forward + +## UMT5ForTokenClassification + +[[autodoc]] UMT5ForTokenClassification + - forward + +## UMT5ForQuestionAnswering + +[[autodoc]] UMT5ForQuestionAnswering + - forward + diff --git a/docs/source/en/model_doc/unispeech-sat.mdx b/docs/source/en/model_doc/unispeech-sat.md similarity index 89% rename from docs/source/en/model_doc/unispeech-sat.mdx rename to docs/source/en/model_doc/unispeech-sat.md index e2ceb783ea9419..3f0bbcc79323f2 100644 --- a/docs/source/en/model_doc/unispeech-sat.mdx +++ b/docs/source/en/model_doc/unispeech-sat.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # UniSpeech-SAT @@ -27,13 +31,16 @@ this paper, we aim to improve the existing SSL framework for speaker representat introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where -additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed +additional overlapped utterances are created unsupervisedly and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.* -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be +found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT). + +## Usage tips - UniSpeechSat is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use [`Wav2Vec2Processor`] for the feature extraction. @@ -41,9 +48,10 @@ Tips: decoded using [`Wav2Vec2CTCTokenizer`]. - UniSpeechSat performs especially well on speaker verification, speaker identification, and speaker diarization tasks. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be -found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech-SAT). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## UniSpeechSatConfig diff --git a/docs/source/en/model_doc/unispeech.mdx b/docs/source/en/model_doc/unispeech.md similarity index 90% rename from docs/source/en/model_doc/unispeech.mdx rename to docs/source/en/model_doc/unispeech.md index 37d0a0a708e9fc..2b2b13bed52c17 100644 --- a/docs/source/en/model_doc/unispeech.mdx +++ b/docs/source/en/model_doc/unispeech.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # UniSpeech @@ -29,16 +33,20 @@ recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.* -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be +found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech). + +## Usage tips - UniSpeech is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use [`Wav2Vec2Processor`] for the feature extraction. - UniSpeech model can be fine-tuned using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be -found [here](https://github.com/microsoft/UniSpeech/tree/main/UniSpeech). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## UniSpeechConfig diff --git a/docs/source/en/model_doc/univnet.md b/docs/source/en/model_doc/univnet.md new file mode 100644 index 00000000000000..45bd94732773ae --- /dev/null +++ b/docs/source/en/model_doc/univnet.md @@ -0,0 +1,80 @@ + + +# UnivNet + +## Overview + +The UnivNet model was proposed in [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) by Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kin, and Juntae Kim. +The UnivNet model is a generative adversarial network (GAN) trained to synthesize high fidelity speech waveforms. The UnivNet model shared in `transformers` is the *generator*, which maps a conditioning log-mel spectrogram and optional noise sequence to a speech waveform (e.g. a vocoder). Only the generator is required for inference. The *discriminator* used to train the `generator` is not implemented. + +The abstract from the paper is the following: + +*Most neural vocoders employ band-limited mel-spectrograms to generate waveforms. If full-band spectral features are used as the input, the vocoder can be provided with as much acoustic information as possible. However, in some models employing full-band mel-spectrograms, an over-smoothing problem occurs as part of which non-sharp spectrograms are generated. To address this problem, we propose UnivNet, a neural vocoder that synthesizes high-fidelity waveforms in real time. Inspired by works in the field of voice activity detection, we added a multi-resolution spectrogram discriminator that employs multiple linear spectrogram magnitudes computed using various parameter sets. Using full-band mel-spectrograms as input, we expect to generate high-resolution signals by adding a discriminator that employs spectrograms of multiple resolutions as the input. In an evaluation on a dataset containing information on hundreds of speakers, UnivNet obtained the best objective and subjective results among competing models for both seen and unseen speakers. These results, including the best subjective score for text-to-speech, demonstrate the potential for fast adaptation to new speakers without a need for training from scratch.* + +Tips: + +- The `noise_sequence` argument for [`UnivNetModel.forward`] should be standard Gaussian noise (such as from `torch.randn`) of shape `([batch_size], noise_length, model.config.model_in_channels)`, where `noise_length` should match the length dimension (dimension 1) of the `input_features` argument. If not supplied, it will be randomly generated; a `torch.Generator` can be supplied to the `generator` argument so that the forward pass can be reproduced. (Note that [`UnivNetFeatureExtractor`] will return generated noise by default, so it shouldn't be necessary to generate `noise_sequence` manually.) +- Padding added by [`UnivNetFeatureExtractor`] can be removed from the [`UnivNetModel`] output through the [`UnivNetFeatureExtractor.batch_decode`] method, as shown in the usage example below. +- Padding the end of each waveform with silence can reduce artifacts at the end of the generated audio sample. This can be done by supplying `pad_end = True` to [`UnivNetFeatureExtractor.__call__`]. See [this issue](https://github.com/seungwonpark/melgan/issues/8) for more details. + +Usage Example: + +```python +import torch +from scipy.io.wavfile import write +from datasets import Audio, load_dataset + +from transformers import UnivNetFeatureExtractor, UnivNetModel + +model_id_or_path = "dg845/univnet-dev" +model = UnivNetModel.from_pretrained(model_id_or_path) +feature_extractor = UnivNetFeatureExtractor.from_pretrained(model_id_or_path) + +ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +# Resample the audio to the model and feature extractor's sampling rate. +ds = ds.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate)) +# Pad the end of the converted waveforms to reduce artifacts at the end of the output audio samples. +inputs = feature_extractor( + ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], pad_end=True, return_tensors="pt" +) + +with torch.no_grad(): + audio = model(**inputs) + +# Remove the extra padding at the end of the output. +audio = feature_extractor.batch_decode(**audio)[0] +# Convert to wav file +write("sample_audio.wav", feature_extractor.sampling_rate, audio) +``` + +This model was contributed by [dg845](https://huggingface.co/dg845). +To the best of my knowledge, there is no official code release, but an unofficial implementation can be found at [maum-ai/univnet](https://github.com/maum-ai/univnet) with pretrained checkpoints [here](https://github.com/maum-ai/univnet#pre-trained-model). + + +## UnivNetConfig + +[[autodoc]] UnivNetConfig + +## UnivNetFeatureExtractor + +[[autodoc]] UnivNetFeatureExtractor + - __call__ + +## UnivNetModel + +[[autodoc]] UnivNetModel + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/upernet.mdx b/docs/source/en/model_doc/upernet.md similarity index 93% rename from docs/source/en/model_doc/upernet.mdx rename to docs/source/en/model_doc/upernet.md index 17dff3c66a066f..418c3ef1786b52 100644 --- a/docs/source/en/model_doc/upernet.mdx +++ b/docs/source/en/model_doc/upernet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # UPerNet @@ -29,16 +33,7 @@ alt="drawing" width="600"/> This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code is based on OpenMMLab's mmsegmentation [here](https://github.com/open-mmlab/mmsegmentation/blob/master/mmseg/models/decode_heads/uper_head.py). -## Resources - -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with UPerNet. - -- Demo notebooks for UPerNet can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UPerNet). -- [`UperNetForSemanticSegmentation`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/semantic-segmentation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation.ipynb). - -If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. - -## Usage +## Usage examples UPerNet is a general framework for semantic segmentation. It can be used with any vision backbone, like so: @@ -64,6 +59,16 @@ model = UperNetForSemanticSegmentation(config) Note that this will randomly initialize all the weights of the model. +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with UPerNet. + +- Demo notebooks for UPerNet can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UPerNet). +- [`UperNetForSemanticSegmentation`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/semantic-segmentation) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation.ipynb). +- See also: [Semantic segmentation task guide](../tasks/semantic_segmentation) + +If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + ## UperNetConfig [[autodoc]] UperNetConfig diff --git a/docs/source/en/model_doc/van.mdx b/docs/source/en/model_doc/van.md similarity index 83% rename from docs/source/en/model_doc/van.mdx rename to docs/source/en/model_doc/van.md index 1f5507244c0765..2fb8475ce72f32 100644 --- a/docs/source/en/model_doc/van.mdx +++ b/docs/source/en/model_doc/van.md @@ -8,10 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # VAN + + +This model is in maintenance mode only, we don't accept any new PRs changing its code. + +If you run into any issues running this model, please reinstall the last version that supported this model: v4.30.0. +You can do so by running the following command: `pip install -U transformers==4.30.0`. + + + ## Overview The VAN model was proposed in [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. @@ -26,7 +39,7 @@ Tips: - VAN does not have an embedding layer, thus the `hidden_states` will have a length equal to the number of stages. -The figure below illustrates the architecture of a Visual Aattention Layer. Taken from the [original paper](https://arxiv.org/abs/2202.09741). +The figure below illustrates the architecture of a Visual Attention Layer. Taken from the [original paper](https://arxiv.org/abs/2202.09741). @@ -39,6 +52,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`VanForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -46,13 +60,11 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] VanConfig - ## VanModel [[autodoc]] VanModel - forward - ## VanForImageClassification [[autodoc]] VanForImageClassification diff --git a/docs/source/en/model_doc/videomae.mdx b/docs/source/en/model_doc/videomae.md similarity index 92% rename from docs/source/en/model_doc/videomae.mdx rename to docs/source/en/model_doc/videomae.md index 76e822ef8a5cc8..75eb9617380c57 100644 --- a/docs/source/en/model_doc/videomae.mdx +++ b/docs/source/en/model_doc/videomae.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # VideoMAE @@ -21,11 +25,6 @@ The abstract from the paper is the following: *Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction. These simple designs turn out to be effective for overcoming information leakage caused by the temporal correlation during video reconstruction. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. This is partially ascribed to the challenging task of video reconstruction to enforce high-level structure learning. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets are important issues in SSVP. Notably, our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data.* -Tips: - -- One can use [`VideoMAEImageProcessor`] to prepare videos for the model. It will resize + normalize all frames of a video for you. -- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training. - drawing @@ -43,10 +42,9 @@ review it! The resource should ideally demonstrate something new instead of dupl **Video classification** - [A notebook](https://github.com/huggingface/notebooks/blob/main/examples/video_classification.ipynb) that shows how to fine-tune a VideoMAE model on a custom dataset. -- [Video classification task page](https://huggingface.co/tasks/video-classification) +- [Video classification task guide](../tasks/video_classification) - [A 🤗 Space](https://huggingface.co/spaces/sayakpaul/video-classification-ucf101-subset) showing how to perform inference with a video classification model. - ## VideoMAEConfig [[autodoc]] VideoMAEConfig @@ -68,6 +66,8 @@ to fine-tune a VideoMAE model on a custom dataset. ## VideoMAEForPreTraining +`VideoMAEForPreTraining` includes the decoder on top for self-supervised pre-training. + [[autodoc]] transformers.VideoMAEForPreTraining - forward diff --git a/docs/source/en/model_doc/vilt.mdx b/docs/source/en/model_doc/vilt.md similarity index 93% rename from docs/source/en/model_doc/vilt.mdx rename to docs/source/en/model_doc/vilt.md index 7c8653e1a3b948..2b0ac022da4b2c 100644 --- a/docs/source/en/model_doc/vilt.mdx +++ b/docs/source/en/model_doc/vilt.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ViLT @@ -30,28 +34,24 @@ Vision-and-Language Transformer (ViLT), monolithic in the sense that the process simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance.* -Tips: + + + ViLT architecture. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/dandelin/ViLT). + +## Usage tips - The quickest way to get started with ViLT is by checking the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViLT) (which showcase both inference and fine-tuning on custom data). - ViLT is a model that takes both `pixel_values` and `input_ids` as input. One can use [`ViltProcessor`] to prepare data for the model. - This processor wraps a feature extractor (for the image modality) and a tokenizer (for the language modality) into one. + This processor wraps a image processor (for the image modality) and a tokenizer (for the language modality) into one. - ViLT is trained with images of various sizes: the authors resize the shorter edge of input images to 384 and limit the longer edge to under 640 while preserving the aspect ratio. To make batching of images possible, the authors use a `pixel_mask` that indicates which pixel values are real and which are padding. [`ViltProcessor`] automatically creates this for you. - The design of ViLT is very similar to that of a standard Vision Transformer (ViT). The only difference is that the model includes additional embedding layers for the language modality. - - - - ViLT architecture. Taken from the original paper. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code can be found [here](https://github.com/dandelin/ViLT). - - -Tips: - - The PyTorch version of this model is only available in torch 1.10 and higher. ## ViltConfig diff --git a/docs/source/en/model_doc/vipllava.md b/docs/source/en/model_doc/vipllava.md new file mode 100644 index 00000000000000..35f2467486a895 --- /dev/null +++ b/docs/source/en/model_doc/vipllava.md @@ -0,0 +1,61 @@ + + +# VipLlava + +## Overview + +The VipLlava model was proposed in [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. + +VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a "red bounding box" or "pointed arrow" during training. + +The abstract from the paper is the following: + +*While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.* + +Tips: + +- The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module. + +- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating. + +- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results. + +- For better results, we recommend users to prompt the model with the correct prompt format: + +```bash +A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant: +``` + +For multiple turns conversation: + +```bash +A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: \n###Assistant: ###Human: ###Assistant: +``` + +The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA). + +This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada) + + +## VipLlavaConfig + +[[autodoc]] VipLlavaConfig + +## VipLlavaForConditionalGeneration + +[[autodoc]] VipLlavaForConditionalGeneration + - forward diff --git a/docs/source/en/model_doc/vision-encoder-decoder.mdx b/docs/source/en/model_doc/vision-encoder-decoder.md similarity index 94% rename from docs/source/en/model_doc/vision-encoder-decoder.mdx rename to docs/source/en/model_doc/vision-encoder-decoder.md index 0241224c066797..41159b7fc5f9a8 100644 --- a/docs/source/en/model_doc/vision-encoder-decoder.mdx +++ b/docs/source/en/model_doc/vision-encoder-decoder.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Vision Encoder Decoder Models @@ -54,7 +58,7 @@ To do so, the `VisionEncoderDecoderModel` class provides a [`VisionEncoderDecode >>> from transformers import VisionEncoderDecoderModel >>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( -... "microsoft/swin-base-patch4-window7-224-in22k", "bert-base-uncased" +... "microsoft/swin-base-patch4-window7-224-in22k", "google-bert/bert-base-uncased" ... ) ``` @@ -119,9 +123,9 @@ images) and `labels` (which are the `input_ids` of the encoded target sequence). >>> from datasets import load_dataset >>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k") ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained( -... "google/vit-base-patch16-224-in21k", "bert-base-uncased" +... "google/vit-base-patch16-224-in21k", "google-bert/bert-base-uncased" ... ) >>> model.config.decoder_start_token_id = tokenizer.cls_token_id @@ -147,20 +151,32 @@ were contributed by [ydshieh](https://github.com/ydshieh). [[autodoc]] VisionEncoderDecoderConfig + + + ## VisionEncoderDecoderModel [[autodoc]] VisionEncoderDecoderModel - forward - from_encoder_decoder_pretrained + + + ## TFVisionEncoderDecoderModel [[autodoc]] TFVisionEncoderDecoderModel - call - from_encoder_decoder_pretrained + + + ## FlaxVisionEncoderDecoderModel [[autodoc]] FlaxVisionEncoderDecoderModel - __call__ - from_encoder_decoder_pretrained + + + diff --git a/docs/source/en/model_doc/vision-text-dual-encoder.mdx b/docs/source/en/model_doc/vision-text-dual-encoder.md similarity index 85% rename from docs/source/en/model_doc/vision-text-dual-encoder.mdx rename to docs/source/en/model_doc/vision-text-dual-encoder.md index c7ee59d77abb16..7cb68a261875e6 100644 --- a/docs/source/en/model_doc/vision-text-dual-encoder.mdx +++ b/docs/source/en/model_doc/vision-text-dual-encoder.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # VisionTextDualEncoder @@ -32,12 +36,29 @@ new zero-shot vision tasks such as image classification or retrieval. [[autodoc]] VisionTextDualEncoderProcessor + + + ## VisionTextDualEncoderModel [[autodoc]] VisionTextDualEncoderModel - forward + + + ## FlaxVisionTextDualEncoderModel [[autodoc]] FlaxVisionTextDualEncoderModel - __call__ + + + + +## TFVisionTextDualEncoderModel + +[[autodoc]] TFVisionTextDualEncoderModel + - call + + + diff --git a/docs/source/en/model_doc/visual_bert.mdx b/docs/source/en/model_doc/visual_bert.md similarity index 95% rename from docs/source/en/model_doc/visual_bert.mdx rename to docs/source/en/model_doc/visual_bert.md index df8858b1fa6785..95e5ae4e84a28d 100644 --- a/docs/source/en/model_doc/visual_bert.mdx +++ b/docs/source/en/model_doc/visual_bert.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # VisualBERT @@ -28,7 +32,9 @@ simpler. Further analysis demonstrates that VisualBERT can ground elements of la explicit supervision and is even sensitive to syntactic relationships, tracking, for example, associations between verbs and image regions corresponding to their arguments.* -Tips: +This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert). + +## Usage tips 1. Most of the checkpoints provided work with the [`VisualBertForPreTraining`] configuration. Other checkpoints provided are the fine-tuned checkpoints for down-stream tasks - VQA ('visualbert-vqa'), VCR @@ -39,8 +45,6 @@ Tips: We do not provide the detector and its weights as a part of the package, but it will be available in the research projects, and the states can be loaded directly into the detector provided. -## Usage - VisualBERT is a multi-modal vision and language model. It can be used for visual question answering, multiple choice, visual reasoning and region-to-phrase correspondence tasks. VisualBERT uses a BERT-like transformer to prepare embeddings for image-text pairs. Both the text and visual features are then projected to a latent space with identical @@ -69,7 +73,7 @@ The following example shows how to get the last hidden state using [`VisualBertM >>> from transformers import BertTokenizer, VisualBertModel >>> model = VisualBertModel.from_pretrained("uclanlp/visualbert-vqa-coco-pre") ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("What is the man eating?", return_tensors="pt") >>> # this is a custom function that returns the visual embeddings given the image path @@ -88,8 +92,6 @@ The following example shows how to get the last hidden state using [`VisualBertM >>> last_hidden_state = outputs.last_hidden_state ``` -This model was contributed by [gchhablani](https://huggingface.co/gchhablani). The original code can be found [here](https://github.com/uclanlp/visualbert). - ## VisualBertConfig [[autodoc]] VisualBertConfig diff --git a/docs/source/en/model_doc/vit.mdx b/docs/source/en/model_doc/vit.md similarity index 86% rename from docs/source/en/model_doc/vit.mdx rename to docs/source/en/model_doc/vit.md index 45ed3f1878c088..25c3a6c8f537f4 100644 --- a/docs/source/en/model_doc/vit.mdx +++ b/docs/source/en/model_doc/vit.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Vision Transformer (ViT) @@ -20,7 +24,6 @@ Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minder Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures. - The abstract from the paper is the following: *While the Transformer architecture has become the de-facto standard for natural language processing tasks, its @@ -32,30 +35,6 @@ data and transferred to multiple mid-sized or small image recognition benchmarks Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.* -Tips: - -- Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer). -- To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, - which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be - used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of - vectors to a standard Transformer encoder. -- As the Vision Transformer expects each image to be of the same size (resolution), one can use - [`ViTImageProcessor`] to resize (or rescale) and normalize images for the model. -- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of - each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch - resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit). -- The available checkpoints are either (1) pre-trained on [ImageNet-21k](http://www.image-net.org/) (a collection of - 14 million images and 21k classes) only, or (2) also fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million - images and 1,000 classes). -- The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to - use a higher resolution than pre-training [(Touvron et al., 2019)](https://arxiv.org/abs/1906.06423), [(Kolesnikov - et al., 2020)](https://arxiv.org/abs/1912.11370). In order to fine-tune at higher resolution, the authors perform - 2D interpolation of the pre-trained position embeddings, according to their location in the original image. -- The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed - an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked - language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant - improvement of 2% to training from scratch, but still 4% behind supervised pre-training. - drawing @@ -83,27 +62,35 @@ Following the original Vision Transformer, some follow-up works have been made: This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code (written in JAX) can be found [here](https://github.com/google-research/vision_transformer). -Note that we converted the weights from Ross Wightman's [timm library](https://github.com/rwightman/pytorch-image-models), who already converted the weights from JAX to PyTorch. Credits -go to him! - -## Resources - -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT. - - - -- [`ViTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). -- A blog on fine-tuning [`ViTForImageClassification`] on a custom dataset can be found [here](https://huggingface.co/blog/fine-tune-vit). -- More demo notebooks to fine-tune [`ViTForImageClassification`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer). +Note that we converted the weights from Ross Wightman's [timm library](https://github.com/rwightman/pytorch-image-models), +who already converted the weights from JAX to PyTorch. Credits go to him! -Besides that: +## Usage tips -- [`ViTForMaskedImageModeling`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining). - -If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. +- To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, + which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be + used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of + vectors to a standard Transformer encoder. +- As the Vision Transformer expects each image to be of the same size (resolution), one can use + [`ViTImageProcessor`] to resize (or rescale) and normalize images for the model. +- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of + each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch + resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit). +- The available checkpoints are either (1) pre-trained on [ImageNet-21k](http://www.image-net.org/) (a collection of + 14 million images and 21k classes) only, or (2) also fine-tuned on [ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million + images and 1,000 classes). +- The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to + use a higher resolution than pre-training [(Touvron et al., 2019)](https://arxiv.org/abs/1906.06423), [(Kolesnikov + et al., 2020)](https://arxiv.org/abs/1912.11370). In order to fine-tune at higher resolution, the authors perform + 2D interpolation of the pre-trained position embeddings, according to their location in the original image. +- The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed + an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked + language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant + improvement of 2% to training from scratch, but still 4% behind supervised pre-training. ## Resources +Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer). A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. `ViTForImageClassification` is supported by: @@ -129,7 +116,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on [Deploying Hugging Face ViT on Vertex AI](https://huggingface.co/blog/deploy-vertex-ai) - A blog post on [Deploying Hugging Face ViT on Kubernetes with TF Serving](https://huggingface.co/blog/deploy-tfserving-kubernetes) - ## ViTConfig [[autodoc]] ViTConfig @@ -139,12 +125,14 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] ViTFeatureExtractor - __call__ - ## ViTImageProcessor [[autodoc]] ViTImageProcessor - preprocess + + + ## ViTModel [[autodoc]] ViTModel @@ -160,6 +148,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] ViTForImageClassification - forward + + + ## TFViTModel [[autodoc]] TFViTModel @@ -170,6 +161,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFViTForImageClassification - call + + + ## FlaxVitModel [[autodoc]] FlaxViTModel @@ -179,3 +173,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxViTForImageClassification - __call__ + + + diff --git a/docs/source/en/model_doc/vit_hybrid.mdx b/docs/source/en/model_doc/vit_hybrid.md similarity index 93% rename from docs/source/en/model_doc/vit_hybrid.mdx rename to docs/source/en/model_doc/vit_hybrid.md index 8885af0dfe0f22..52c0d35bc13538 100644 --- a/docs/source/en/model_doc/vit_hybrid.mdx +++ b/docs/source/en/model_doc/vit_hybrid.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Hybrid Vision Transformer (ViT Hybrid) @@ -21,7 +25,6 @@ Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transfo very good results compared to familiar convolutional architectures. ViT hybrid is a slight variant of the [plain Vision Transformer](vit), by leveraging a convolutional backbone (specifically, [BiT](bit)) whose features are used as initial "tokens" for the Transformer. - The abstract from the paper is the following: *While the Transformer architecture has become the de-facto standard for natural language processing tasks, its @@ -36,7 +39,6 @@ substantially fewer computational resources to train.* This model was contributed by [nielsr](https://huggingface.co/nielsr). The original code (written in JAX) can be found [here](https://github.com/google-research/vision_transformer). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT Hybrid. @@ -44,10 +46,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`ViTHybridForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. - ## ViTHybridConfig [[autodoc]] ViTHybridConfig diff --git a/docs/source/en/model_doc/vit_mae.mdx b/docs/source/en/model_doc/vit_mae.md similarity index 95% rename from docs/source/en/model_doc/vit_mae.mdx rename to docs/source/en/model_doc/vit_mae.md index 714a68e152ef45..27d6d26816ae49 100644 --- a/docs/source/en/model_doc/vit_mae.mdx +++ b/docs/source/en/model_doc/vit_mae.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ViTMAE @@ -28,7 +32,15 @@ enables us to train large models efficiently and effectively: we accelerate trai models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.* -Tips: + + + MAE architecture. Taken from the original paper. + +This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlow version of the model was contributed by [sayakpaul](https://github.com/sayakpaul) and +[ariG23498](https://github.com/ariG23498) (equal contribution). The original code can be found [here](https://github.com/facebookresearch/mae). + +## Usage tips - MAE (masked auto encoding) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training objective is relatively simple: by masking a large portion (75%) of the image patches, the model must reconstruct raw pixel values. One can use [`ViTMAEForPreTraining`] for this purpose. @@ -40,14 +52,6 @@ consists of Transformer blocks) takes as input. Each mask token is a shared, lea sin/cos position embeddings are added both to the input of the encoder and the decoder. - For a visual understanding of how MAEs work you can check out this [post](https://keras.io/examples/vision/masked_image_modeling/). - - - MAE architecture. Taken from the original paper. - -This model was contributed by [nielsr](https://huggingface.co/nielsr). TensorFlow version of the model was contributed by [sayakpaul](https://github.com/sayakpaul) and -[ariG23498](https://github.com/ariG23498) (equal contribution). The original code can be found [here](https://github.com/facebookresearch/mae). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViTMAE. @@ -61,26 +65,31 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] ViTMAEConfig + + ## ViTMAEModel [[autodoc]] ViTMAEModel - forward - ## ViTMAEForPreTraining [[autodoc]] transformers.ViTMAEForPreTraining - forward + + ## TFViTMAEModel [[autodoc]] TFViTMAEModel - call - ## TFViTMAEForPreTraining [[autodoc]] transformers.TFViTMAEForPreTraining - call + + + diff --git a/docs/source/en/model_doc/vit_msn.mdx b/docs/source/en/model_doc/vit_msn.md similarity index 94% rename from docs/source/en/model_doc/vit_msn.mdx rename to docs/source/en/model_doc/vit_msn.md index 47c1f69e2bea45..666b7dd0dfda83 100644 --- a/docs/source/en/model_doc/vit_msn.mdx +++ b/docs/source/en/model_doc/vit_msn.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ViTMSN @@ -29,7 +33,13 @@ while producing representations of a high semantic level that perform competitiv on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark.* -Tips: +drawing + + MSN architecture. Taken from the original paper. + +This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn). + +## Usage tips - MSN (masked siamese networks) is a method for self-supervised pre-training of Vision Transformers (ViTs). The pre-training objective is to match the prototypes assigned to the unmasked views of the images to that of the masked views of the same images. @@ -39,13 +49,6 @@ use the [`ViTMSNForImageClassification`] class which is initialized from [`ViTMS - MSN is particularly useful in the low-shot and extreme low-shot regimes. Notably, it achieves 75.7% top-1 accuracy with only 1% of ImageNet-1K labels when fine-tuned. - -drawing - - MSN architecture. Taken from the original paper. - -This model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code can be found [here](https://github.com/facebookresearch/msn). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT MSN. @@ -53,6 +56,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`ViTMSNForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). +- See also: [Image classification task guide](../tasks/image_classification) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -60,13 +64,11 @@ If you're interested in submitting a resource to be included here, please feel f [[autodoc]] ViTMSNConfig - ## ViTMSNModel [[autodoc]] ViTMSNModel - forward - ## ViTMSNForImageClassification [[autodoc]] ViTMSNForImageClassification diff --git a/docs/source/en/model_doc/vitdet.md b/docs/source/en/model_doc/vitdet.md new file mode 100644 index 00000000000000..81bf787d6cda24 --- /dev/null +++ b/docs/source/en/model_doc/vitdet.md @@ -0,0 +1,38 @@ + + +# ViTDet + +## Overview + +The ViTDet model was proposed in [Exploring Plain Vision Transformer Backbones for Object Detection](https://arxiv.org/abs/2203.16527) by Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He. +VitDet leverages the plain [Vision Transformer](vit) for the task of object detection. + +The abstract from the paper is the following: + +*We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors.* + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/facebookresearch/detectron2/tree/main/projects/ViTDet). + +Tips: + +- At the moment, only the backbone is available. + +## VitDetConfig + +[[autodoc]] VitDetConfig + +## VitDetModel + +[[autodoc]] VitDetModel + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/vitmatte.md b/docs/source/en/model_doc/vitmatte.md new file mode 100644 index 00000000000000..5a6d501030fcdf --- /dev/null +++ b/docs/source/en/model_doc/vitmatte.md @@ -0,0 +1,55 @@ + + +# ViTMatte + +## Overview + +The ViTMatte model was proposed in [Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. +ViTMatte leverages plain [Vision Transformers](vit) for the task of image matting, which is the process of accurately estimating the foreground object in images and videos. + +The abstract from the paper is the following: + +*Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.* + +This model was contributed by [nielsr](https://huggingface.co/nielsr). +The original code can be found [here](https://github.com/hustvl/ViTMatte). + + + + ViTMatte high-level overview. Taken from the original paper. + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViTMatte. + +- A demo notebook regarding inference with [`VitMatteForImageMatting`], including background replacement, can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/ViTMatte). + + + +The model expects both the image and trimap (concatenated) as input. Use [`ViTMatteImageProcessor`] for this purpose. + + +## VitMatteConfig + +[[autodoc]] VitMatteConfig + +## VitMatteImageProcessor + +[[autodoc]] VitMatteImageProcessor + - preprocess + +## VitMatteForImageMatting + +[[autodoc]] VitMatteForImageMatting + - forward \ No newline at end of file diff --git a/docs/source/en/model_doc/vits.md b/docs/source/en/model_doc/vits.md new file mode 100644 index 00000000000000..73001d82ed561d --- /dev/null +++ b/docs/source/en/model_doc/vits.md @@ -0,0 +1,161 @@ + + +# VITS + +## Overview + +The VITS model was proposed in [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son. + +VITS (**V**ariational **I**nference with adversarial learning for end-to-end **T**ext-to-**S**peech) is an end-to-end +speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational +autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior. + +A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based +text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers, +much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text +input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to +synthesise speech with different rhythms from the same input text. + +The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training. +To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During +inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the +waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor, +the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform. + +The abstract from the paper is the following: + +*Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.* + +This model can also be used with TTS checkpoints from [Massively Multilingual Speech (MMS)](https://arxiv.org/abs/2305.13516) +as these checkpoints use the same architecture and a slightly modified tokenizer. + +This model was contributed by [Matthijs](https://huggingface.co/Matthijs) and [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original code can be found [here](https://github.com/jaywalnut310/vits). + +## Usage examples + +Both the VITS and MMS-TTS checkpoints can be used with the same API. Since the flow-based model is non-deterministic, it +is good practice to set a seed to ensure reproducibility of the outputs. For languages with a Roman alphabet, +such as English or French, the tokenizer can be used directly to pre-process the text inputs. The following code example +runs a forward pass using the MMS-TTS English checkpoint: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +model = VitsModel.from_pretrained("facebook/mms-tts-eng") + +inputs = tokenizer(text="Hello - my dog is cute", return_tensors="pt") + +set_seed(555) # make deterministic + +with torch.no_grad(): + outputs = model(**inputs) + +waveform = outputs.waveform[0] +``` + +The resulting waveform can be saved as a `.wav` file: + +```python +import scipy + +scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=waveform) +``` + +Or displayed in a Jupyter Notebook / Google Colab: + +```python +from IPython.display import Audio + +Audio(waveform, rate=model.config.sampling_rate) +``` + +For certain languages with a non-Roman alphabet, such as Arabic, Mandarin or Hindi, the [`uroman`](https://github.com/isi-nlp/uroman) +perl package is required to pre-process the text inputs to the Roman alphabet. + +You can check whether you require the `uroman` package for your language by inspecting the `is_uroman` attribute of +the pre-trained `tokenizer`: + +```python +from transformers import VitsTokenizer + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-eng") +print(tokenizer.is_uroman) +``` + +If required, you should apply the uroman package to your text inputs **prior** to passing them to the `VitsTokenizer`, +since currently the tokenizer does not support performing the pre-processing itself. + +To do this, first clone the uroman repository to your local machine and set the bash variable `UROMAN` to the local path: + +```bash +git clone https://github.com/isi-nlp/uroman.git +cd uroman +export UROMAN=$(pwd) +``` + +You can then pre-process the text input using the following code snippet. You can either rely on using the bash variable +`UROMAN` to point to the uroman repository, or you can pass the uroman directory as an argument to the `uromaize` function: + +```python +import torch +from transformers import VitsTokenizer, VitsModel, set_seed +import os +import subprocess + +tokenizer = VitsTokenizer.from_pretrained("facebook/mms-tts-kor") +model = VitsModel.from_pretrained("facebook/mms-tts-kor") + +def uromanize(input_string, uroman_path): + """Convert non-Roman strings to Roman using the `uroman` perl package.""" + script_path = os.path.join(uroman_path, "bin", "uroman.pl") + + command = ["perl", script_path] + + process = subprocess.Popen(command, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + # Execute the perl command + stdout, stderr = process.communicate(input=input_string.encode()) + + if process.returncode != 0: + raise ValueError(f"Error {process.returncode}: {stderr.decode()}") + + # Return the output as a string and skip the new-line character at the end + return stdout.decode()[:-1] + +text = "이봐 무슨 일이야" +uromaized_text = uromanize(text, uroman_path=os.environ["UROMAN"]) + +inputs = tokenizer(text=uromaized_text, return_tensors="pt") + +set_seed(555) # make deterministic +with torch.no_grad(): + outputs = model(inputs["input_ids"]) + +waveform = outputs.waveform[0] +``` + +## VitsConfig + +[[autodoc]] VitsConfig + +## VitsTokenizer + +[[autodoc]] VitsTokenizer + - __call__ + - save_vocabulary + +## VitsModel + +[[autodoc]] VitsModel + - forward diff --git a/docs/source/en/model_doc/vivit.md b/docs/source/en/model_doc/vivit.md new file mode 100644 index 00000000000000..4426493a0ff585 --- /dev/null +++ b/docs/source/en/model_doc/vivit.md @@ -0,0 +1,43 @@ + + +# Video Vision Transformer (ViViT) + +## Overview + +The Vivit model was proposed in [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid. +The paper proposes one of the first successful pure-transformer based set of models for video understanding. + +The abstract from the paper is the following: + +*We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks.* + +This model was contributed by [jegormeister](https://huggingface.co/jegormeister). The original code (written in JAX) can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/vivit). + +## VivitConfig + +[[autodoc]] VivitConfig + +## VivitImageProcessor + +[[autodoc]] VivitImageProcessor + - preprocess + +## VivitModel + +[[autodoc]] VivitModel + - forward + +## VivitForVideoClassification + +[[autodoc]] transformers.VivitForVideoClassification + - forward diff --git a/docs/source/en/model_doc/wav2vec2-bert.md b/docs/source/en/model_doc/wav2vec2-bert.md new file mode 100644 index 00000000000000..6514133330a9d4 --- /dev/null +++ b/docs/source/en/model_doc/wav2vec2-bert.md @@ -0,0 +1,90 @@ + + +# Wav2Vec2-BERT + +## Overview + +The Wav2Vec2-BERT model was proposed in [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team from Meta AI. + +This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification. + +The official results of the model can be found in Section 3.2.1 of the paper. + +The abstract from the paper is the following: + +*Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.* + +This model was contributed by [ylacombe](https://huggingface.co/ylacombe). The original code can be found [here](https://github.com/facebookresearch/seamless_communication). + +## Usage tips + +- Wav2Vec2-BERT follows the same architecture as Wav2Vec2-Conformer, but employs a causal depthwise convolutional layer and uses as input a mel-spectrogram representation of the audio instead of the raw waveform. +- Wav2Vec2-BERT can use either no relative position embeddings, Shaw-like position embeddings, Transformer-XL-like position embeddings, or + rotary position embeddings by setting the correct `config.position_embeddings_type`. +- Wav2Vec2-BERT also introduces a Conformer-based adapter network instead of a simple convolutional network. + +## Resources + + + +- [`Wav2Vec2BertForCTC`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). +- You can also adapt these notebooks on [how to finetune a speech recognition model in English](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/speech_recognition.ipynb), and [how to finetune a speech recognition model in any language](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb). + + + +- [`Wav2Vec2BertForSequenceClassification`] can be used by adapting this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification). +- See also: [Audio classification task guide](../tasks/audio_classification) + + +## Wav2Vec2BertConfig + +[[autodoc]] Wav2Vec2BertConfig + +## Wav2Vec2BertProcessor + +[[autodoc]] Wav2Vec2BertProcessor + - __call__ + - pad + - from_pretrained + - save_pretrained + - batch_decode + - decode + +## Wav2Vec2BertModel + +[[autodoc]] Wav2Vec2BertModel + - forward + +## Wav2Vec2BertForCTC + +[[autodoc]] Wav2Vec2BertForCTC + - forward + +## Wav2Vec2BertForSequenceClassification + +[[autodoc]] Wav2Vec2BertForSequenceClassification + - forward + +## Wav2Vec2BertForAudioFrameClassification + +[[autodoc]] Wav2Vec2BertForAudioFrameClassification + - forward + +## Wav2Vec2BertForXVector + +[[autodoc]] Wav2Vec2BertForXVector + - forward diff --git a/docs/source/en/model_doc/wav2vec2-conformer.mdx b/docs/source/en/model_doc/wav2vec2-conformer.md similarity index 89% rename from docs/source/en/model_doc/wav2vec2-conformer.mdx rename to docs/source/en/model_doc/wav2vec2-conformer.md index 2cfb38553f1e84..c32c03bb0cb7ac 100644 --- a/docs/source/en/model_doc/wav2vec2-conformer.mdx +++ b/docs/source/en/model_doc/wav2vec2-conformer.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Wav2Vec2-Conformer @@ -20,7 +24,10 @@ The official results of the model can be found in Table 3 and Table 4 of the pap The Wav2Vec2-Conformer weights were released by the Meta AI team within the [Fairseq library](https://github.com/pytorch/fairseq/blob/main/examples/wav2vec/README.md#pre-trained-models). -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). +The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec). + +## Usage tips - Wav2Vec2-Conformer follows the same architecture as Wav2Vec2, but replaces the *Attention*-block with a *Conformer*-block as introduced in [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100). @@ -30,9 +37,10 @@ an improved word error rate. - Wav2Vec2-Conformer can use either no relative position embeddings, Transformer-XL-like position embeddings, or rotary position embeddings by setting the correct `config.position_embeddings_type`. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). -The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## Wav2Vec2ConformerConfig diff --git a/docs/source/en/model_doc/wav2vec2.mdx b/docs/source/en/model_doc/wav2vec2.md similarity index 92% rename from docs/source/en/model_doc/wav2vec2.mdx rename to docs/source/en/model_doc/wav2vec2.md index 3acf176a27a8a7..b26e4db6f1b6cc 100644 --- a/docs/source/en/model_doc/wav2vec2.mdx +++ b/docs/source/en/model_doc/wav2vec2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Wav2Vec2 @@ -27,14 +31,14 @@ of the art on the 100 hour subset while using 100 times less labeled data. Using pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.* -Tips: +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). + +## Usage tips - Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). - ## Resources A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Wav2Vec2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. @@ -43,6 +47,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A notebook on how to [leverage a pretrained Wav2Vec2 model for emotion classification](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb). 🌎 - [`Wav2Vec2ForCTC`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb). +- [Audio classification task guide](../tasks/audio_classification) @@ -51,10 +56,11 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - A blog post on [finetuning XLS-R for Multi-Lingual ASR with 🤗 Transformers](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2). - A notebook on how to [create YouTube captions from any video by transcribing audio with Wav2Vec2](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb). 🌎 - [`Wav2Vec2ForCTC`] is supported by a notebook on [how to finetune a speech recognition model in English](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/speech_recognition.ipynb), and [how to finetune a speech recognition model in any language](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb). +- [Automatic speech recognition task guide](../tasks/asr) 🚀 Deploy -- A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recogntion with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker). +- A blog post on how to deploy Wav2Vec2 for [Automatic Speech Recognition with Hugging Face's Transformers & Amazon SageMaker](https://www.philschmid.de/automatic-speech-recognition-sagemaker). ## Wav2Vec2Config @@ -67,6 +73,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - save_vocabulary - decode - batch_decode + - set_target_lang ## Wav2Vec2FeatureExtractor @@ -160,6 +167,9 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower [[autodoc]] models.wav2vec2.modeling_flax_wav2vec2.FlaxWav2Vec2ForPreTrainingOutput + + + ## Wav2Vec2Model [[autodoc]] Wav2Vec2Model @@ -169,6 +179,7 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower [[autodoc]] Wav2Vec2ForCTC - forward + - load_adapter ## Wav2Vec2ForSequenceClassification @@ -190,16 +201,27 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower [[autodoc]] Wav2Vec2ForPreTraining - forward + + + ## TFWav2Vec2Model [[autodoc]] TFWav2Vec2Model - call +## TFWav2Vec2ForSequenceClassification + +[[autodoc]] TFWav2Vec2ForSequenceClassification + - call + ## TFWav2Vec2ForCTC [[autodoc]] TFWav2Vec2ForCTC - call + + + ## FlaxWav2Vec2Model [[autodoc]] FlaxWav2Vec2Model @@ -214,3 +236,6 @@ Otherwise, [`~Wav2Vec2ProcessorWithLM.batch_decode`] performance will be slower [[autodoc]] FlaxWav2Vec2ForPreTraining - __call__ + + + diff --git a/docs/source/en/model_doc/wav2vec2_phoneme.mdx b/docs/source/en/model_doc/wav2vec2_phoneme.md similarity index 84% rename from docs/source/en/model_doc/wav2vec2_phoneme.mdx rename to docs/source/en/model_doc/wav2vec2_phoneme.md index b39cf66ce1368c..93e0656f493c0f 100644 --- a/docs/source/en/model_doc/wav2vec2_phoneme.mdx +++ b/docs/source/en/model_doc/wav2vec2_phoneme.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Wav2Vec2Phoneme @@ -27,7 +31,13 @@ mapping phonemes of the training languages to the target language using articula this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.* -Tips: +Relevant checkpoints can be found under https://huggingface.co/models?other=phoneme-recognition. + +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten) + +The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + +## Usage tips - Wav2Vec2Phoneme uses the exact same architecture as Wav2Vec2 - Wav2Vec2Phoneme is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. @@ -35,17 +45,16 @@ Tips: decoded using [`Wav2Vec2PhonemeCTCTokenizer`]. - Wav2Vec2Phoneme can be fine-tuned on multiple language at once and decode unseen languages in a single forward pass to a sequence of phonemes -- By default the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one +- By default, the model outputs a sequence of phonemes. In order to transform the phonemes to a sequence of words one should make use of a dictionary and language model. -Relevant checkpoints can be found under https://huggingface.co/models?other=phoneme-recognition. - -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten) -The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + -Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, so one can refer to [`Wav2Vec2`]'s documentation page except for the tokenizer. +Wav2Vec2Phoneme's architecture is based on the Wav2Vec2 model, for API reference, check out [`Wav2Vec2`](wav2vec2)'s documentation page +except for the tokenizer. + ## Wav2Vec2PhonemeCTCTokenizer diff --git a/docs/source/en/model_doc/wavlm.mdx b/docs/source/en/model_doc/wavlm.md similarity index 87% rename from docs/source/en/model_doc/wavlm.mdx rename to docs/source/en/model_doc/wavlm.md index 8e2138a61187ad..a42fbff139588c 100644 --- a/docs/source/en/model_doc/wavlm.mdx +++ b/docs/source/en/model_doc/wavlm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # WavLM @@ -27,11 +31,16 @@ challenging. In this paper, we propose a new pre-trained model, WavLM, to solve WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where -additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up +additional overlapped utterances are created unsupervisedly and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.* -Tips: +Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm. + +This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be +found [here](https://github.com/microsoft/unilm/tree/master/wavlm). + +## Usage tips - WavLM is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Please use [`Wav2Vec2Processor`] for the feature extraction. @@ -39,11 +48,10 @@ Tips: using [`Wav2Vec2CTCTokenizer`]. - WavLM performs especially well on speaker verification, speaker identification, and speaker diarization tasks. -Relevant checkpoints can be found under https://huggingface.co/models?other=wavlm. - -This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The Authors' code can be -found [here](https://github.com/microsoft/unilm/tree/master/wavlm). +## Resources +- [Audio classification task guide](../tasks/audio_classification) +- [Automatic speech recognition task guide](../tasks/asr) ## WavLMConfig diff --git a/docs/source/en/model_doc/whisper.md b/docs/source/en/model_doc/whisper.md new file mode 100644 index 00000000000000..138f2b374bf347 --- /dev/null +++ b/docs/source/en/model_doc/whisper.md @@ -0,0 +1,192 @@ + + +# Whisper + +## Overview + +The Whisper model was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. + +The abstract from the paper is the following: + +*We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.* + +This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). +The original code can be found [here](https://github.com/openai/whisper). + +## Usage tips + +- The model usually performs well without requiring any finetuning. +- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation.GenerationMixin.generate`] function for inference. +- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text. + +- To convert the model and the processor, we recommend using the following: + +```bash +python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True +``` +The script will automatically determine all necessary parameters from the OpenAI checkpoint. A `tiktoken` library needs to be installed +to perform the conversion of the OpenAI tokenizer to the `tokenizers` version. + +## Inference + +Here is a step-by-step guide to transcribing an audio sample using a pre-trained Whisper model: + +```python +>>> from datasets import load_dataset +>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration + +>>> # Select an audio file and read it: +>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +>>> audio_sample = ds[0]["audio"] +>>> waveform = audio_sample["array"] +>>> sampling_rate = audio_sample["sampling_rate"] + +>>> # Load the Whisper model in Hugging Face format: +>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en") +>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en") + +>>> # Use the model and processor to transcribe the audio: +>>> input_features = processor( +... waveform, sampling_rate=sampling_rate, return_tensors="pt" +... ).input_features + +>>> # Generate token ids +>>> predicted_ids = model.generate(input_features) + +>>> # Decode token ids to text +>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) + +>>> transcription[0] +' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.' +``` + +## Resources + +A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Whisper. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + +- A fork with a script to [convert a Whisper model in Hugging Face format to OpenAI format](https://github.com/zuazo-forks/transformers/blob/convert_hf_to_openai/src/transformers/models/whisper/convert_hf_to_openai.py). 🌎 +Usage example: +```bash +pip install -U openai-whisper +python convert_hf_to_openai.py \ + --checkpoint openai/whisper-tiny \ + --whisper_dump_path whisper-tiny-openai.pt +``` + +## WhisperConfig + +[[autodoc]] WhisperConfig + +## WhisperTokenizer + +[[autodoc]] WhisperTokenizer + - set_prefix_tokens + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + - batch_decode + - decode + - basic_normalize + - normalize + +## WhisperTokenizerFast + +[[autodoc]] WhisperTokenizerFast + - set_prefix_tokens + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + - batch_decode + - decode + - basic_normalize + - normalize + +## WhisperFeatureExtractor + +[[autodoc]] WhisperFeatureExtractor + - __call__ + +## WhisperProcessor + +[[autodoc]] WhisperProcessor + - __call__ + - from_pretrained + - save_pretrained + - batch_decode + - decode + + + + +## WhisperModel + +[[autodoc]] WhisperModel + - forward + - _mask_input_features + +## WhisperForConditionalGeneration + +[[autodoc]] WhisperForConditionalGeneration + - forward + - generate + +## WhisperForCausalLM + +[[autodoc]] WhisperForCausalLM + - forward + +## WhisperForAudioClassification + +[[autodoc]] WhisperForAudioClassification + - forward + + + + +## TFWhisperModel + +[[autodoc]] TFWhisperModel + - call + +## TFWhisperForConditionalGeneration + +[[autodoc]] TFWhisperForConditionalGeneration + - call + + + + +## FlaxWhisperModel + +[[autodoc]] FlaxWhisperModel + - __call__ + +## FlaxWhisperForConditionalGeneration + +[[autodoc]] FlaxWhisperForConditionalGeneration + - __call__ + +## FlaxWhisperForAudioClassification + +[[autodoc]] FlaxWhisperForAudioClassification + - __call__ + + + + diff --git a/docs/source/en/model_doc/whisper.mdx b/docs/source/en/model_doc/whisper.mdx deleted file mode 100644 index 4b7a6028618427..00000000000000 --- a/docs/source/en/model_doc/whisper.mdx +++ /dev/null @@ -1,81 +0,0 @@ - - -# Whisper - -## Overview - -The Whisper model was proposed in [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. - -The abstract from the paper is the following: - -*We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zeroshot transfer setting without the need for any finetuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.* - - -Tips: - -- The model usually performs well without requiring any finetuning. -- The architecture follows a classic encoder-decoder architecture, which means that it relies on the [`~generation.GenerationMixin.generate`] function for inference. -- Inference is currently only implemented for short-form i.e. audio is pre-segmented into <=30s segments. Long-form (including timestamps) will be implemented in a future release. -- One can use [`WhisperProcessor`] to prepare audio for the model, and decode the predicted ID's back into text. - -This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). -The original code can be found [here](https://github.com/openai/whisper). - - -## WhisperConfig - -[[autodoc]] WhisperConfig - -## WhisperTokenizer - -[[autodoc]] WhisperTokenizer - - set_prefix_tokens - - build_inputs_with_special_tokens - - get_special_tokens_mask - - create_token_type_ids_from_sequences - - save_vocabulary - -## WhisperFeatureExtractor - -[[autodoc]] WhisperFeatureExtractor - - __call__ - -## WhisperProcessor - -[[autodoc]] WhisperProcessor - - __call__ - - from_pretrained - - save_pretrained - - batch_decode - - decode - -## WhisperModel - -[[autodoc]] WhisperModel - - forward - -## WhisperForConditionalGeneration - -[[autodoc]] WhisperForConditionalGeneration - - forward - - -## TFWhisperModel - -[[autodoc]] TFWhisperModel - - call - -## TFWhisperForConditionalGeneration - -[[autodoc]] TFWhisperForConditionalGeneration - - call diff --git a/docs/source/en/model_doc/xclip.mdx b/docs/source/en/model_doc/xclip.md similarity index 96% rename from docs/source/en/model_doc/xclip.mdx rename to docs/source/en/model_doc/xclip.md index a49ed8b9130cde..45c4c3db749be8 100644 --- a/docs/source/en/model_doc/xclip.mdx +++ b/docs/source/en/model_doc/xclip.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # X-CLIP diff --git a/docs/source/en/model_doc/xglm.mdx b/docs/source/en/model_doc/xglm.md similarity index 91% rename from docs/source/en/model_doc/xglm.mdx rename to docs/source/en/model_doc/xglm.md index e35bab25f89c4e..470e42c747bec6 100644 --- a/docs/source/en/model_doc/xglm.mdx +++ b/docs/source/en/model_doc/xglm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XGLM @@ -38,6 +42,10 @@ in social value tasks such as hate speech detection in five languages and find i This model was contributed by [Suraj](https://huggingface.co/valhalla). The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/xglm). +## Resources + +- [Causal language modeling task guide](../tasks/language_modeling) + ## XGLMConfig [[autodoc]] XGLMConfig @@ -54,6 +62,9 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig [[autodoc]] XGLMTokenizerFast + + + ## XGLMModel [[autodoc]] XGLMModel @@ -64,6 +75,9 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig [[autodoc]] XGLMForCausalLM - forward + + + ## TFXGLMModel [[autodoc]] TFXGLMModel @@ -74,6 +88,9 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig [[autodoc]] TFXGLMForCausalLM - call + + + ## FlaxXGLMModel [[autodoc]] FlaxXGLMModel @@ -82,4 +99,7 @@ This model was contributed by [Suraj](https://huggingface.co/valhalla). The orig ## FlaxXGLMForCausalLM [[autodoc]] FlaxXGLMForCausalLM - - __call__ \ No newline at end of file + - __call__ + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/xlm-prophetnet.mdx b/docs/source/en/model_doc/xlm-prophetnet.md similarity index 86% rename from docs/source/en/model_doc/xlm-prophetnet.mdx rename to docs/source/en/model_doc/xlm-prophetnet.md index 699f9192c24304..7a61aeb3e34a0a 100644 --- a/docs/source/en/model_doc/xlm-prophetnet.mdx +++ b/docs/source/en/model_doc/xlm-prophetnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLM-ProphetNet @@ -32,7 +36,7 @@ Zhang, Ming Zhou on 13 Jan, 2020. XLM-ProphetNet is an encoder-decoder model and can predict n-future tokens for "ngram" language modeling instead of just the next token. Its architecture is identical to ProhpetNet, but the model was trained on the multi-lingual -"wiki100" Wikipedia dump. +"wiki100" Wikipedia dump. XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset XGLUE. The abstract from the paper is the following: @@ -48,9 +52,11 @@ state-of-the-art results on all these datasets compared to the models using the The Authors' code can be found [here](https://github.com/microsoft/ProphetNet). -Tips: +## Resources -- XLM-ProphetNet's model architecture and pretraining objective is same as ProphetNet, but XLM-ProphetNet was pre-trained on the cross-lingual dataset XGLUE. +- [Causal language modeling task guide](../tasks/language_modeling) +- [Translation task guide](../tasks/translation) +- [Summarization task guide](../tasks/summarization) ## XLMProphetNetConfig diff --git a/docs/source/en/model_doc/xlm-roberta-xl.mdx b/docs/source/en/model_doc/xlm-roberta-xl.md similarity index 74% rename from docs/source/en/model_doc/xlm-roberta-xl.mdx rename to docs/source/en/model_doc/xlm-roberta-xl.md index 01829a128c00f3..f9cb78c0bf4e65 100644 --- a/docs/source/en/model_doc/xlm-roberta-xl.mdx +++ b/docs/source/en/model_doc/xlm-roberta-xl.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLM-RoBERTa-XL @@ -20,14 +24,22 @@ The abstract from the paper is the following: *Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed XLM-R XL and XLM-R XXL outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests pretrained models with larger capacity may obtain both strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available.* -Tips: +This model was contributed by [Soonhwan-Kwon](https://github.com/Soonhwan-Kwon) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). -- XLM-RoBERTa-XL is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does - not require `lang` tensors to understand which language is used, and should be able to determine the correct - language from the input ids. +## Usage tips -This model was contributed by [Soonhwan-Kwon](https://github.com/Soonhwan-Kwon) and [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). +XLM-RoBERTa-XL is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does +not require `lang` tensors to understand which language is used, and should be able to determine the correct +language from the input ids. + +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## XLMRobertaXLConfig diff --git a/docs/source/en/model_doc/xlm-roberta.mdx b/docs/source/en/model_doc/xlm-roberta.md similarity index 91% rename from docs/source/en/model_doc/xlm-roberta.mdx rename to docs/source/en/model_doc/xlm-roberta.md index 0a61d5c67a244e..58540015232e9d 100644 --- a/docs/source/en/model_doc/xlm-roberta.mdx +++ b/docs/source/en/model_doc/xlm-roberta.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLM-RoBERTa @@ -42,16 +46,14 @@ languages at scale. Finally, we show, for the first time, the possibility of mul per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.* -Tips: +This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). + +## Usage tips - XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require `lang` tensors to understand which language is used, and should be able to determine the correct language from the input ids. - Uses RoBERTa tricks on the XLM approach, but does not use the translation language modeling objective. It only uses masked language modeling on sentences coming from one language. -- This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples - as well as the information relative to the inputs and outputs. - -This model was contributed by [stefan-it](https://huggingface.co/stefan-it). The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/examples/xlmr). ## Resources @@ -64,6 +66,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFXLMRobertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb). - [`FlaxXLMRobertaForSequenceClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb). - [Text classification](https://huggingface.co/docs/transformers/tasks/sequence_classification) chapter of the 🤗 Hugging Face Task Guides. +- [Text classification task guide](../tasks/sequence_classification) @@ -71,11 +74,13 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFXLMRobertaForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb). - [`FlaxXLMRobertaForTokenClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification). - [Token classification](https://huggingface.co/course/chapter7/2?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Token classification task guide](../tasks/token_classification) - [`XLMRobertaForCausalLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). -- [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling) chapter of the 🤗 Hugging Face Task Guides. +- [Causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling) chapter of the 🤗 Hugging Face Task Guides. +- [Causal language modeling task guide](../tasks/language_modeling) @@ -83,6 +88,7 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFXLMRobertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb). - [`FlaxXLMRobertaForMaskedLM`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb). - [Masked language modeling](https://huggingface.co/course/chapter7/3?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Masked language modeling](../tasks/masked_language_modeling) @@ -90,15 +96,22 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - [`TFXLMRobertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb). - [`FlaxXLMRobertaForQuestionAnswering`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering). - [Question answering](https://huggingface.co/course/chapter7/7?fw=pt) chapter of the 🤗 Hugging Face Course. +- [Question answering task guide](../tasks/question_answering) **Multiple choice** - [`XLMRobertaForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). - [`TFXLMRobertaForMultipleChoice`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). +- [Multiple choice task guide](../tasks/multiple_choice) 🚀 Deploy -- A blog post on how to [Deploy Serveless XLM RoBERTa on AWS Lambda](https://www.philschmid.de/multilingual-serverless-xlm-roberta-with-huggingface). +- A blog post on how to [Deploy Serverless XLM RoBERTa on AWS Lambda](https://www.philschmid.de/multilingual-serverless-xlm-roberta-with-huggingface). + + + +This implementation is the same as RoBERTa. Refer to the [documentation of RoBERTa](roberta) for usage examples as well as the information relative to the inputs and outputs. + ## XLMRobertaConfig @@ -116,6 +129,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] XLMRobertaTokenizerFast + + + ## XLMRobertaModel [[autodoc]] XLMRobertaModel @@ -151,6 +167,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] XLMRobertaForQuestionAnswering - forward + + + ## TFXLMRobertaModel [[autodoc]] TFXLMRobertaModel @@ -186,6 +205,9 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] TFXLMRobertaForQuestionAnswering - call + + + ## FlaxXLMRobertaModel [[autodoc]] FlaxXLMRobertaModel @@ -220,3 +242,6 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h [[autodoc]] FlaxXLMRobertaForQuestionAnswering - __call__ + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/xlm-v.mdx b/docs/source/en/model_doc/xlm-v.md similarity index 89% rename from docs/source/en/model_doc/xlm-v.mdx rename to docs/source/en/model_doc/xlm-v.md index 4ad07edecbc66f..049a1f35ad9ab7 100644 --- a/docs/source/en/model_doc/xlm-v.mdx +++ b/docs/source/en/model_doc/xlm-v.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLM-V @@ -31,7 +35,10 @@ a multilingual language model with a one million token vocabulary. XLM-V outperf tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).* -Tips: +This model was contributed by [stefan-it](https://huggingface.co/stefan-it), including detailed experiments with XLM-V on downstream tasks. +The experiments repository can be found [here](https://github.com/stefan-it/xlm-v-experiments). + +## Usage tips - XLM-V is compatible with the XLM-RoBERTa model architecture, only model weights from [`fairseq`](https://github.com/facebookresearch/fairseq) library had to be converted. @@ -39,5 +46,7 @@ Tips: A XLM-V (base size) model is available under the [`facebook/xlm-v-base`](https://huggingface.co/facebook/xlm-v-base) identifier. -This model was contributed by [stefan-it](https://huggingface.co/stefan-it), including detailed experiments with XLM-V on downstream tasks. -The experiments repository can be found [here](https://github.com/stefan-it/xlm-v-experiments). + + +XLM-V architecture is the same as XLM-RoBERTa, refer to [XLM-RoBERTa documentation](xlm-roberta) for API reference, and examples. + \ No newline at end of file diff --git a/docs/source/en/model_doc/xlm.mdx b/docs/source/en/model_doc/xlm.md similarity index 89% rename from docs/source/en/model_doc/xlm.mdx rename to docs/source/en/model_doc/xlm.md index 18ce89cefde602..0ee11c6addc588 100644 --- a/docs/source/en/model_doc/xlm.mdx +++ b/docs/source/en/model_doc/xlm.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLM @@ -42,7 +46,9 @@ obtain 34.3 BLEU on WMT'16 German-English, improving the previous state of the a machine translation, we obtain a new state of the art of 38.5 BLEU on WMT'16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.* -Tips: +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/facebookresearch/XLM/). + +## Usage tips - XLM has many different checkpoints, which were trained using different objectives: CLM, MLM or TLM. Make sure to select the correct objective for your task (e.g. MLM checkpoints are not suitable for generation). @@ -53,8 +59,14 @@ Tips: * Masked language modeling (MLM) which is like RoBERTa. One of the languages is selected for each training sample, and the model input is a sentence of 256 tokens, that may span over several documents in one of those languages, with dynamic masking of the tokens. * A combination of MLM and translation language modeling (TLM). This consists of concatenating a sentence in two different languages, with random masking. To predict one of the masked tokens, the model can use both, the surrounding context in language 1 and the context given by language 2. -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/facebookresearch/XLM/). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## XLMConfig @@ -72,6 +84,9 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput + + + ## XLMModel [[autodoc]] XLMModel @@ -107,6 +122,9 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] XLMForQuestionAnswering - forward + + + ## TFXLMModel [[autodoc]] TFXLMModel @@ -136,3 +154,8 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] TFXLMForQuestionAnsweringSimple - call + + + + + diff --git a/docs/source/en/model_doc/xlnet.mdx b/docs/source/en/model_doc/xlnet.md similarity index 91% rename from docs/source/en/model_doc/xlnet.mdx rename to docs/source/en/model_doc/xlnet.md index 694ae3ca275ecd..d2209c3d550ec3 100644 --- a/docs/source/en/model_doc/xlnet.mdx +++ b/docs/source/en/model_doc/xlnet.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLNet @@ -40,7 +44,9 @@ formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state- pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.* -Tips: +This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/zihangdai/xlnet/). + +## Usage tips - The specific attention pattern can be controlled at training and test time using the `perm_mask` input. - Due to the difficulty of training a fully auto-regressive model over various factorization order, XLNet is pretrained @@ -52,8 +58,13 @@ Tips: - XLNet is not a traditional autoregressive model but uses a training strategy that builds on that. It permutes the tokens in the sentence, then allows the model to use the last n tokens to predict the token n+1. Since this is all done with a mask, the sentence is actually fed in the model in the right order, but instead of masking the first n tokens for n+1, XLNet uses a mask that hides the previous tokens in some given permutation of 1,…,sequence length. - XLNet also uses the same recurrence mechanism as Transformer-XL to build long-term dependencies. -This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The original code can be found [here](https://github.com/zihangdai/xlnet/). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## XLNetConfig @@ -99,6 +110,9 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] models.xlnet.modeling_tf_xlnet.TFXLNetForQuestionAnsweringSimpleOutput + + + ## XLNetModel [[autodoc]] XLNetModel @@ -134,6 +148,9 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] XLNetForQuestionAnswering - forward + + + ## TFXLNetModel [[autodoc]] TFXLNetModel @@ -163,3 +180,6 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o [[autodoc]] TFXLNetForQuestionAnsweringSimple - call + + + \ No newline at end of file diff --git a/docs/source/en/model_doc/xls_r.mdx b/docs/source/en/model_doc/xls_r.md similarity index 89% rename from docs/source/en/model_doc/xls_r.mdx rename to docs/source/en/model_doc/xls_r.md index 82a7e3b8afbd31..2226c813e72b13 100644 --- a/docs/source/en/model_doc/xls_r.mdx +++ b/docs/source/en/model_doc/xls_r.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLS-R @@ -30,14 +34,18 @@ language identification. Moreover, we show that with sufficient model size, cros English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.* -Tips: +Relevant checkpoints can be found under https://huggingface.co/models?other=xls_r. + +The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + +## Usage tips - XLS-R is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - XLS-R model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. -Relevant checkpoints can be found under https://huggingface.co/models?other=xls_r. + -XLS-R's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2). +XLS-R's architecture is based on the Wav2Vec2 model, refer to [Wav2Vec2's documentation page](wav2vec2) for API reference. -The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + \ No newline at end of file diff --git a/docs/source/en/model_doc/xlsr_wav2vec2.mdx b/docs/source/en/model_doc/xlsr_wav2vec2.md similarity index 92% rename from docs/source/en/model_doc/xlsr_wav2vec2.mdx rename to docs/source/en/model_doc/xlsr_wav2vec2.md index 32229f28b14763..d1b5444c2469bd 100644 --- a/docs/source/en/model_doc/xlsr_wav2vec2.mdx +++ b/docs/source/en/model_doc/xlsr_wav2vec2.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLSR-Wav2Vec2 @@ -30,12 +34,16 @@ individual models. Analysis shows that the latent discrete speech representation increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing XLSR-53, a large model pretrained in 53 languages.* -Tips: +The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + +## Usage tips - XLSR-Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. - XLSR-Wav2Vec2 model was trained using connectionist temporal classification (CTC) so the model output has to be decoded using [`Wav2Vec2CTCTokenizer`]. + + XLSR-Wav2Vec2's architecture is based on the Wav2Vec2 model, so one can refer to [Wav2Vec2's documentation page](wav2vec2). -The original code can be found [here](https://github.com/pytorch/fairseq/tree/master/fairseq/models/wav2vec). + diff --git a/docs/source/en/model_doc/xmod.mdx b/docs/source/en/model_doc/xmod.md similarity index 88% rename from docs/source/en/model_doc/xmod.mdx rename to docs/source/en/model_doc/xmod.md index ffc1c85dcbaf85..47797fa6490254 100644 --- a/docs/source/en/model_doc/xmod.mdx +++ b/docs/source/en/model_doc/xmod.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # X-MOD @@ -21,13 +25,15 @@ The abstract from the paper is the following: *Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-MOD) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.* +This model was contributed by [jvamvas](https://huggingface.co/jvamvas). +The original code can be found [here](https://github.com/facebookresearch/fairseq/tree/58cc6cca18f15e6d56e3f60c959fe4f878960a60/fairseq/models/xmod) and the original documentation is found [here](https://github.com/facebookresearch/fairseq/tree/58cc6cca18f15e6d56e3f60c959fe4f878960a60/examples/xmod). + +## Usage tips + Tips: - X-MOD is similar to [XLM-R](xlm-roberta), but a difference is that the input language needs to be specified so that the correct language adapter can be activated. - The main models – base and large – have adapters for 81 languages. -This model was contributed by [jvamvas](https://huggingface.co/jvamvas). -The original code can be found [here](https://github.com/facebookresearch/fairseq/tree/58cc6cca18f15e6d56e3f60c959fe4f878960a60/fairseq/models/xmod) and the original documentation is found [here](https://github.com/facebookresearch/fairseq/tree/58cc6cca18f15e6d56e3f60c959fe4f878960a60/examples/xmod). - ## Adapter Usage ### Input language @@ -78,6 +84,15 @@ model.set_default_language("de_DE") # Evaluate the model on German examples ... ``` +## Resources + +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Causal language modeling task guide](../tasks/language_modeling) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) + ## XmodConfig [[autodoc]] XmodConfig diff --git a/docs/source/en/model_doc/yolos.mdx b/docs/source/en/model_doc/yolos.md similarity index 89% rename from docs/source/en/model_doc/yolos.mdx rename to docs/source/en/model_doc/yolos.md index 66533bacfbee4c..5386c373ac8330 100644 --- a/docs/source/en/model_doc/yolos.mdx +++ b/docs/source/en/model_doc/yolos.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # YOLOS @@ -21,10 +25,6 @@ The abstract from the paper is the following: *Can Transformer perform 2D object- and region-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. We find that YOLOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve quite competitive performance on the challenging COCO object detection benchmark, e.g., YOLOS-Base directly adopted from BERT-Base architecture can obtain 42.0 box AP on COCO val. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through YOLOS.* -Tips: - -- One can use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created. - drawing @@ -39,9 +39,16 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h - All example notebooks illustrating inference + fine-tuning [`YolosForObjectDetection`] on a custom dataset can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS). +- See also: [Object detection task guide](../tasks/object_detection) If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. + + +Use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created. + + + ## YolosConfig [[autodoc]] YolosConfig @@ -60,13 +67,11 @@ If you're interested in submitting a resource to be included here, please feel f - pad - post_process_object_detection - ## YolosModel [[autodoc]] YolosModel - forward - ## YolosForObjectDetection [[autodoc]] YolosForObjectDetection diff --git a/docs/source/en/model_doc/yoso.mdx b/docs/source/en/model_doc/yoso.md similarity index 88% rename from docs/source/en/model_doc/yoso.mdx rename to docs/source/en/model_doc/yoso.md index 997ab4d0941639..a3dfa3fed855b5 100644 --- a/docs/source/en/model_doc/yoso.mdx +++ b/docs/source/en/model_doc/yoso.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # YOSO @@ -33,7 +37,9 @@ length where we see favorable performance relative to a standard pretrained Tran for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at this https URL* -Tips: +This model was contributed by [novice03](https://huggingface.co/novice03). The original code can be found [here](https://github.com/mlpen/YOSO). + +## Usage tips - The YOSO attention algorithm is implemented through custom CUDA kernels, functions written in CUDA C++ that can be executed multiple times in parallel on a GPU. @@ -48,26 +54,28 @@ alt="drawing" width="600"/> YOSO Attention Algorithm. Taken from the original paper. -This model was contributed by [novice03](https://huggingface.co/novice03). The original code can be found [here](https://github.com/mlpen/YOSO). +## Resources +- [Text classification task guide](../tasks/sequence_classification) +- [Token classification task guide](../tasks/token_classification) +- [Question answering task guide](../tasks/question_answering) +- [Masked language modeling task guide](../tasks/masked_language_modeling) +- [Multiple choice task guide](../tasks/multiple_choice) ## YosoConfig [[autodoc]] YosoConfig - ## YosoModel [[autodoc]] YosoModel - forward - ## YosoForMaskedLM [[autodoc]] YosoForMaskedLM - forward - ## YosoForSequenceClassification [[autodoc]] YosoForSequenceClassification @@ -78,13 +86,11 @@ This model was contributed by [novice03](https://huggingface.co/novice03). The o [[autodoc]] YosoForMultipleChoice - forward - ## YosoForTokenClassification [[autodoc]] YosoForTokenClassification - forward - ## YosoForQuestionAnswering [[autodoc]] YosoForQuestionAnswering diff --git a/docs/source/en/model_memory_anatomy.md b/docs/source/en/model_memory_anatomy.md new file mode 100644 index 00000000000000..c820681a7af0fc --- /dev/null +++ b/docs/source/en/model_memory_anatomy.md @@ -0,0 +1,272 @@ + + +# Model training anatomy + +To understand performance optimization techniques that one can apply to improve efficiency of model training +speed and memory utilization, it's helpful to get familiar with how GPU is utilized during training, and how compute +intensity varies depending on an operation performed. + +Let's start by exploring a motivating example of GPU utilization and the training run of a model. For the demonstration, +we'll need to install a few libraries: + +```bash +pip install transformers datasets accelerate nvidia-ml-py3 +``` + +The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar +with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly. + +Then, we create some dummy data: random token IDs between 100 and 30000 and binary labels for a classifier. +In total, we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format. + + +```py +>>> import numpy as np +>>> from datasets import Dataset + + +>>> seq_len, dataset_size = 512, 512 +>>> dummy_data = { +... "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), +... "labels": np.random.randint(0, 1, (dataset_size)), +... } +>>> ds = Dataset.from_dict(dummy_data) +>>> ds.set_format("pt") +``` + +To print summary statistics for the GPU utilization and the training run with the [`Trainer`] we define two helper functions: + +```py +>>> from pynvml import * + + +>>> def print_gpu_utilization(): +... nvmlInit() +... handle = nvmlDeviceGetHandleByIndex(0) +... info = nvmlDeviceGetMemoryInfo(handle) +... print(f"GPU memory occupied: {info.used//1024**2} MB.") + + +>>> def print_summary(result): +... print(f"Time: {result.metrics['train_runtime']:.2f}") +... print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") +... print_gpu_utilization() +``` + +Let's verify that we start with a free GPU memory: + +```py +>>> print_gpu_utilization() +GPU memory occupied: 0 MB. +``` + +That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on +your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by +the user. When a model is loaded to the GPU the kernels are also loaded, which can take up 1-2GB of memory. To see how +much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. + +```py +>>> import torch + + +>>> torch.ones((1, 1)).to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 1343 MB. +``` + +We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses. + +## Load Model + +First, we load the `google-bert/bert-large-uncased` model. We load the model weights directly to the GPU so that we can check +how much space just the weights use. + + +```py +>>> from transformers import AutoModelForSequenceClassification + + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 2631 MB. +``` + +We can see that the model weights alone take up 1.3 GB of GPU memory. The exact number depends on the specific +GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an +optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result +as with `nvidia-smi` CLI: + + +```bash +nvidia-smi +``` + +```bash +Tue Jan 11 08:58:05 2022 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | +| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB | ++-----------------------------------------------------------------------------+ +``` + +We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can +start training the model and see how the GPU memory consumption changes. First, we set up a few standard training +arguments: + +```py +default_args = { + "output_dir": "tmp", + "evaluation_strategy": "steps", + "num_train_epochs": 1, + "log_level": "error", + "report_to": "none", +} +``` + + + + If you plan to run multiple experiments, in order to properly clear the memory between experiments, restart the Python + kernel between experiments. + + + +## Memory utilization at vanilla training + +Let's use the [`Trainer`] and train the model without using any GPU performance optimization techniques and a batch size of 4: + +```py +>>> from transformers import TrainingArguments, Trainer, logging + +>>> logging.set_verbosity_error() + + +>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) +>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds) +>>> result = trainer.train() +>>> print_summary(result) +``` + +``` +Time: 57.82 +Samples/second: 8.86 +GPU memory occupied: 14949 MB. +``` + +We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size +can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our +model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. +To understand a bit better why this is the case let's have a look at a model's operations and memory needs. + +## Anatomy of Model's Operations + +Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. + +1. **Tensor Contractions** + + Linear layers and components of Multi-Head Attention all do batched **matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer. + +2. **Statistical Normalizations** + + Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more **reduction operations**, the result of which is then applied via a map. + +3. **Element-wise Operators** + + These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations. + +This knowledge can be helpful to know when analyzing performance bottlenecks. + +This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072) + + +## Anatomy of Model's Memory + +We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there +are many components during training that use GPU memory. The components on GPU memory are the following: + +1. model weights +2. optimizer states +3. gradients +4. forward activations saved for gradient computation +5. temporary buffers +6. functionality-specific memory + +A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For +inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per +model parameter for mixed precision inference, plus activation memory. + +Let's look at the details. + +**Model Weights:** + +- 4 bytes * number of parameters for fp32 training +- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) + +**Optimizer States:** + +- 8 bytes * number of parameters for normal AdamW (maintains 2 states) +- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) +- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) + +**Gradients** + +- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32) + +**Forward Activations** + +- size depends on many factors, the key ones being sequence length, hidden size and batch size. + +There are the input and output that are being passed and returned by the forward and the backward functions and the +forward activations saved for gradient computation. + +**Temporary Memory** + +Additionally, there are all kinds of temporary variables which get released once the calculation is done, but in the +moment these could require additional memory and could push to OOM. Therefore, when coding it's crucial to think +strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. + +**Functionality-specific memory** + +Then, your software could have special memory needs. For example, when generating text using beam search, the software +needs to maintain multiple copies of inputs and outputs. + +**`forward` vs `backward` Execution Speed** + +For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates +into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually +bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward +(e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, +and writes once, gradInput). + +As you can see, there are potentially a few places where we could save GPU memory or speed up operations. +Now that you understand what affects GPU utilization and computation speed, refer to +the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) documentation page to learn about +performance optimization techniques. diff --git a/docs/source/en/model_sharing.mdx b/docs/source/en/model_sharing.md similarity index 92% rename from docs/source/en/model_sharing.mdx rename to docs/source/en/model_sharing.md index bae458be7928c9..6ec4d9fa2a9280 100644 --- a/docs/source/en/model_sharing.mdx +++ b/docs/source/en/model_sharing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Share a model @@ -91,7 +95,7 @@ Specify `from_pt=True` to convert a checkpoint from PyTorch to TensorFlow: >>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) ``` -Then you can save your new TensorFlow model with it's new checkpoint: +Then you can save your new TensorFlow model with its new checkpoint: ```py >>> tf_model.save_pretrained("path/to/awesome-name-you-picked") @@ -197,7 +201,7 @@ Or perhaps you'd like to add the TensorFlow version of your fine-tuned PyTorch m >>> tf_model.push_to_hub("my-awesome-model") ``` -Now when you navigate to the your Hugging Face profile, you should see your newly created model repository. Clicking on the **Files** tab will display all the files you've uploaded to the repository. +Now when you navigate to your Hugging Face profile, you should see your newly created model repository. Clicking on the **Files** tab will display all the files you've uploaded to the repository. For more details on how to create and upload files to a repository, refer to the Hub documentation [here](https://huggingface.co/docs/hub/how-to-upstream). @@ -225,4 +229,4 @@ To make sure users understand your model's capabilities, limitations, potential * Manually creating and uploading a `README.md` file. * Clicking on the **Edit model card** button in your model repository. -Take a look at the DistilBert [model card](https://huggingface.co/distilbert-base-uncased) for a good example of the type of information a model card should include. For more details about other options you can control in the `README.md` file such as a model's carbon footprint or widget examples, refer to the documentation [here](https://huggingface.co/docs/hub/models-cards). +Take a look at the DistilBert [model card](https://huggingface.co/distilbert/distilbert-base-uncased) for a good example of the type of information a model card should include. For more details about other options you can control in the `README.md` file such as a model's carbon footprint or widget examples, refer to the documentation [here](https://huggingface.co/docs/hub/models-cards). diff --git a/docs/source/en/model_summary.mdx b/docs/source/en/model_summary.md similarity index 99% rename from docs/source/en/model_summary.mdx rename to docs/source/en/model_summary.md index bc93c3d60324c5..10acb4c5021093 100644 --- a/docs/source/en/model_summary.mdx +++ b/docs/source/en/model_summary.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # The Transformer model family diff --git a/docs/source/en/multilingual.mdx b/docs/source/en/multilingual.md similarity index 77% rename from docs/source/en/multilingual.mdx rename to docs/source/en/multilingual.md index 7c95de6ffc0909..30a63eea28c8c7 100644 --- a/docs/source/en/multilingual.mdx +++ b/docs/source/en/multilingual.md @@ -8,13 +8,17 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Multilingual models for inference [[open-in-colab]] -There are several multilingual models in 🤗 Transformers, and their inference usage differs from monolingual models. Not *all* multilingual model usage is different though. Some models, like [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased), can be used just like a monolingual model. This guide will show you how to use multilingual models whose usage differs for inference. +There are several multilingual models in 🤗 Transformers, and their inference usage differs from monolingual models. Not *all* multilingual model usage is different though. Some models, like [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased), can be used just like a monolingual model. This guide will show you how to use multilingual models whose usage differs for inference. ## XLM @@ -24,24 +28,24 @@ XLM has ten different checkpoints, only one of which is monolingual. The nine re The following XLM models use language embeddings to specify the language used at inference: -- `xlm-mlm-ende-1024` (Masked language modeling, English-German) -- `xlm-mlm-enfr-1024` (Masked language modeling, English-French) -- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) -- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) -- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) -- `xlm-clm-enfr-1024` (Causal language modeling, English-French) -- `xlm-clm-ende-1024` (Causal language modeling, English-German) +- `FacebookAI/xlm-mlm-ende-1024` (Masked language modeling, English-German) +- `FacebookAI/xlm-mlm-enfr-1024` (Masked language modeling, English-French) +- `FacebookAI/xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) +- `FacebookAI/xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) +- `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, English-French) +- `FacebookAI/xlm-clm-ende-1024` (Causal language modeling, English-German) Language embeddings are represented as a tensor of the same shape as the `input_ids` passed to the model. The values in these tensors depend on the language used and are identified by the tokenizer's `lang2id` and `id2lang` attributes. -In this example, load the `xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French): +In this example, load the `FacebookAI/xlm-clm-enfr-1024` checkpoint (Causal language modeling, English-French): ```py >>> import torch >>> from transformers import XLMTokenizer, XLMWithLMHeadModel ->>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") ->>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") ``` The `lang2id` attribute of the tokenizer displays this model's languages and their ids: @@ -79,8 +83,8 @@ The [run_generation.py](https://github.com/huggingface/transformers/tree/main/ex The following XLM models do not require language embeddings during inference: -- `xlm-mlm-17-1280` (Masked language modeling, 17 languages) -- `xlm-mlm-100-1280` (Masked language modeling, 100 languages) +- `FacebookAI/xlm-mlm-17-1280` (Masked language modeling, 17 languages) +- `FacebookAI/xlm-mlm-100-1280` (Masked language modeling, 100 languages) These models are used for generic sentence representations, unlike the previous XLM checkpoints. @@ -88,8 +92,8 @@ These models are used for generic sentence representations, unlike the previous The following BERT models can be used for multilingual tasks: -- `bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages) -- `bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages) +- `google-bert/bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages) +- `google-bert/bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages) These models do not require language embeddings during inference. They should identify the language from the context and infer accordingly. @@ -98,8 +102,8 @@ context and infer accordingly. The following XLM-RoBERTa models can be used for multilingual tasks: -- `xlm-roberta-base` (Masked language modeling, 100 languages) -- `xlm-roberta-large` (Masked language modeling, 100 languages) +- `FacebookAI/xlm-roberta-base` (Masked language modeling, 100 languages) +- `FacebookAI/xlm-roberta-large` (Masked language modeling, 100 languages) XLM-RoBERTa was trained on 2.5TB of newly created and cleaned CommonCrawl data in 100 languages. It provides strong gains over previously released multilingual models like mBERT or XLM on downstream tasks like classification, sequence labeling, and question answering. @@ -167,9 +171,9 @@ Tokenize the text: MBart forces the target language id as the first generated token to translate to the target language. Set the `forced_bos_token_id` to `en` in the `generate` method to translate to English: ```py ->>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id("en_XX")) +>>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]) >>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) "Don't interfere with the wizard's affairs, because they are subtle, will soon get angry." ``` -If you are using the `facebook/mbart-large-50-many-to-one-mmt` checkpoint, you don't need to force the target language id as the first generated token otherwise the usage is the same. \ No newline at end of file +If you are using the `facebook/mbart-large-50-many-to-one-mmt` checkpoint, you don't need to force the target language id as the first generated token otherwise the usage is the same. diff --git a/docs/source/en/pad_truncation.mdx b/docs/source/en/pad_truncation.md similarity index 95% rename from docs/source/en/pad_truncation.mdx rename to docs/source/en/pad_truncation.md index f848e23bed502e..cc623bca48a402 100644 --- a/docs/source/en/pad_truncation.mdx +++ b/docs/source/en/pad_truncation.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Padding and truncation @@ -50,6 +54,7 @@ The following table summarizes the recommended way to setup padding and truncati | | | `tokenizer(batch_sentences, padding='longest')` | | | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` | | | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | +| | padding to a multiple of a value | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)` | | truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or | | | | `tokenizer(batch_sentences, truncation=STRATEGY)` | | | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or | diff --git a/docs/source/en/peft.md b/docs/source/en/peft.md new file mode 100644 index 00000000000000..d86a36e62487dc --- /dev/null +++ b/docs/source/en/peft.md @@ -0,0 +1,236 @@ + + +# Load adapters with 🤗 PEFT + +[[open-in-colab]] + +[Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. The adapters are trained to learn task-specific information. This approach has been shown to be very memory-efficient with lower compute usage while producing results comparable to a fully fine-tuned model. + +Adapters trained with PEFT are also usually an order of magnitude smaller than the full model, making it convenient to share, store, and load them. + +
+ +
The adapter weights for a OPTForCausalLM model stored on the Hub are only ~6MB compared to the full size of the model weights, which can be ~700MB.
+
+ +If you're interested in learning more about the 🤗 PEFT library, check out the [documentation](https://huggingface.co/docs/peft/index). + +## Setup + +Get started by installing 🤗 PEFT: + +```bash +pip install peft +``` + +If you want to try out the brand new features, you might be interested in installing the library from source: + +```bash +pip install git+https://github.com/huggingface/peft.git +``` + +## Supported PEFT models + +🤗 Transformers natively supports some PEFT methods, meaning you can load adapter weights stored locally or on the Hub and easily run or train them with a few lines of code. The following methods are supported: + +- [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) +- [AdaLoRA](https://arxiv.org/abs/2303.10512) + +If you want to use other PEFT methods, such as prompt learning or prompt tuning, or about the 🤗 PEFT library in general, please refer to the [documentation](https://huggingface.co/docs/peft/index). + + +## Load a PEFT adapter + +To load and use a PEFT adapter model from 🤗 Transformers, make sure the Hub repository or local directory contains an `adapter_config.json` file and the adapter weights, as shown in the example image above. Then you can load the PEFT adapter model using the `AutoModelFor` class. For example, to load a PEFT adapter model for causal language modeling: + +1. specify the PEFT model id +2. pass it to the [`AutoModelForCausalLM`] class + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id) +``` + + + +You can load a PEFT adapter with either an `AutoModelFor` class or the base model class like `OPTForCausalLM` or `LlamaForCausalLM`. + + + +You can also load a PEFT adapter by calling the `load_adapter` method: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "facebook/opt-350m" +peft_model_id = "ybelkada/opt-350m-lora" + +model = AutoModelForCausalLM.from_pretrained(model_id) +model.load_adapter(peft_model_id) +``` + +## Load in 8bit or 4bit + +The `bitsandbytes` integration supports 8bit and 4bit precision data types, which are useful for loading large models because it saves memory (see the `bitsandbytes` integration [guide](./quantization#bitsandbytes-integration) to learn more). Add the `load_in_8bit` or `load_in_4bit` parameters to [`~PreTrainedModel.from_pretrained`] and set `device_map="auto"` to effectively distribute the model to your hardware: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True) +``` + +## Add a new adapter + +You can use [`~peft.PeftModel.add_adapter`] to add a new adapter to a model with an existing adapter as long as the new adapter is the same type as the current one. For example, if you have an existing LoRA adapter attached to a model: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import LoraConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + init_lora_weights=False +) + +model.add_adapter(lora_config, adapter_name="adapter_1") +``` + +To add a new adapter: + +```py +# attach new adapter with same config +model.add_adapter(lora_config, adapter_name="adapter_2") +``` + +Now you can use [`~peft.PeftModel.set_adapter`] to set which adapter to use: + +```py +# use adapter_1 +model.set_adapter("adapter_1") +output = model.generate(**inputs) +print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) + +# use adapter_2 +model.set_adapter("adapter_2") +output_enabled = model.generate(**inputs) +print(tokenizer.decode(output_enabled[0], skip_special_tokens=True)) +``` + +## Enable and disable adapters + +Once you've added an adapter to a model, you can enable or disable the adapter module. To enable the adapter module: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +adapter_model_id = "ybelkada/opt-350m-lora" +tokenizer = AutoTokenizer.from_pretrained(model_id) +text = "Hello" +inputs = tokenizer(text, return_tensors="pt") + +model = AutoModelForCausalLM.from_pretrained(model_id) +peft_config = PeftConfig.from_pretrained(adapter_model_id) + +# to initiate with random weights +peft_config.init_lora_weights = False + +model.add_adapter(peft_config) +model.enable_adapters() +output = model.generate(**inputs) +``` + +To disable the adapter module: + +```py +model.disable_adapters() +output = model.generate(**inputs) +``` + +## Train a PEFT adapter + +PEFT adapters are supported by the [`Trainer`] class so that you can train an adapter for your specific use case. It only requires adding a few more lines of code. For example, to train a LoRA adapter: + + + +If you aren't familiar with fine-tuning a model with [`Trainer`], take a look at the [Fine-tune a pretrained model](training) tutorial. + + + +1. Define your adapter configuration with the task type and hyperparameters (see [`~peft.LoraConfig`] for more details about what the hyperparameters do). + +```py +from peft import LoraConfig + +peft_config = LoraConfig( + lora_alpha=16, + lora_dropout=0.1, + r=64, + bias="none", + task_type="CAUSAL_LM", +) +``` + +2. Add adapter to the model. + +```py +model.add_adapter(peft_config) +``` + +3. Now you can pass the model to [`Trainer`]! + +```py +trainer = Trainer(model=model, ...) +trainer.train() +``` + +To save your trained adapter and load it back: + +```py +model.save_pretrained(save_dir) +model = AutoModelForCausalLM.from_pretrained(save_dir) +``` + +## Add additional trainable layers to a PEFT adapter + +You can also fine-tune additional trainable adapters on top of a model that has adapters attached by passing `modules_to_save` in your PEFT config. For example, if you want to also fine-tune the lm_head on top of a model with a LoRA adapter: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import LoraConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + modules_to_save=["lm_head"], +) + +model.add_adapter(lora_config) +``` + + + diff --git a/docs/source/en/perf_hardware.mdx b/docs/source/en/perf_hardware.md similarity index 94% rename from docs/source/en/perf_hardware.mdx rename to docs/source/en/perf_hardware.md index b28df49892b1ac..c42b58483bebd2 100644 --- a/docs/source/en/perf_hardware.mdx +++ b/docs/source/en/perf_hardware.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> @@ -23,6 +27,7 @@ Let's have a look at some practical advice for GPU setups. ## GPU When you train bigger models you have essentially three options: + - bigger GPUs - more GPUs - more CPU and NVMe (offloaded to by [DeepSpeed-Infinity](main_classes/deepspeed#nvme-support)) @@ -59,7 +64,7 @@ Next let's have a look at one of the most important aspects when having multiple If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. If the GPUs are on the same physical node, you can run: -``` +```bash nvidia-smi topo -m ``` @@ -111,7 +116,7 @@ Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvid So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture. -Let's compare the execution of a gpt2 language model training over a small sample of wikitext. +Let's compare the execution of a openai-community/gpt2 language model training over a small sample of wikitext. The results are: @@ -129,8 +134,8 @@ Here is the full benchmark code and outputs: ```bash # DDP w/ NVLink -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ ---nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 @@ -138,8 +143,8 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch # DDP w/o NVLink -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ ---nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 diff --git a/docs/source/en/perf_infer_cpu.md b/docs/source/en/perf_infer_cpu.md new file mode 100644 index 00000000000000..c0e017c020870e --- /dev/null +++ b/docs/source/en/perf_infer_cpu.md @@ -0,0 +1,127 @@ + + +# CPU inference + +With some optimizations, it is possible to efficiently run large model inference on a CPU. One of these optimization techniques involves compiling the PyTorch code into an intermediate format for high-performance environments like C++. The other technique fuses multiple operations into one kernel to reduce the overhead of running each operation separately. + +You'll learn how to use [BetterTransformer](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) for faster inference, and how to convert your PyTorch code to [TorchScript](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html). If you're using an Intel CPU, you can also use [graph optimizations](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features.html#graph-optimization) from [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/index.html) to boost inference speed even more. Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime or OpenVINO (if you're using an Intel CPU). + +## BetterTransformer + +BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. The two optimizations in the fastpath execution are: + +1. fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps +2. skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors + +BetterTransformer also converts all attention operations to use the more memory-efficient [scaled dot product attention](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention). + + + +BetterTransformer is not supported for all models. Check this [list](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models) to see if a model supports BetterTransformer. + + + +Before you start, make sure you have 🤗 Optimum [installed](https://huggingface.co/docs/optimum/installation). + +Enable BetterTransformer with the [`PreTrainedModel.to_bettertransformer`] method: + +```py +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder") +model.to_bettertransformer() +``` + +## TorchScript + +TorchScript is an intermediate PyTorch model representation that can be run in production environments where performance is important. You can train a model in PyTorch and then export it to TorchScript to free the model from Python performance constraints. PyTorch [traces](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) a model to return a [`ScriptFunction`] that is optimized with just-in-time compilation (JIT). Compared to the default eager mode, JIT mode in PyTorch typically yields better performance for inference using optimization techniques like operator fusion. + +For a gentle introduction to TorchScript, see the [Introduction to PyTorch TorchScript](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html) tutorial. + +With the [`Trainer`] class, you can enable JIT mode for CPU inference by setting the `--jit_mode_eval` flag: + +```bash +python run_qa.py \ +--model_name_or_path csarron/bert-base-uncased-squad-v1 \ +--dataset_name squad \ +--do_eval \ +--max_seq_length 384 \ +--doc_stride 128 \ +--output_dir /tmp/ \ +--no_cuda \ +--jit_mode_eval +``` + + + +For PyTorch >= 1.14.0, JIT-mode could benefit any model for prediction and evaluation since the dict input is supported in `jit.trace`. + +For PyTorch < 1.14.0, JIT-mode could benefit a model if its forward parameter order matches the tuple input order in `jit.trace`, such as a question-answering model. If the forward parameter order does not match the tuple input order in `jit.trace`, like a text classification model, `jit.trace` will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users. + + + +## IPEX graph optimization + +Intel® Extension for PyTorch (IPEX) provides further optimizations in JIT mode for Intel CPUs, and we recommend combining it with TorchScript for even faster performance. The IPEX [graph optimization](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/graph_optimization.html) fuses operations like Multi-head attention, Concat Linear, Linear + Add, Linear + Gelu, Add + LayerNorm, and more. + +To take advantage of these graph optimizations, make sure you have IPEX [installed](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html): + +```bash +pip install intel_extension_for_pytorch +``` + +Set the `--use_ipex` and `--jit_mode_eval` flags in the [`Trainer`] class to enable JIT mode with the graph optimizations: + +```bash +python run_qa.py \ +--model_name_or_path csarron/bert-base-uncased-squad-v1 \ +--dataset_name squad \ +--do_eval \ +--max_seq_length 384 \ +--doc_stride 128 \ +--output_dir /tmp/ \ +--no_cuda \ +--use_ipex \ +--jit_mode_eval +``` + +## 🤗 Optimum + + + +Learn more details about using ORT with 🤗 Optimum in the [Optimum Inference with ONNX Runtime](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models) guide. This section only provides a brief and simple example. + + + +ONNX Runtime (ORT) is a model accelerator that runs inference on CPUs by default. ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers, without making too many changes to your code. You only need to replace the 🤗 Transformers `AutoClass` with its equivalent [`~optimum.onnxruntime.ORTModel`] for the task you're solving, and load a checkpoint in the ONNX format. + +For example, if you're running inference on a question answering task, load the [optimum/roberta-base-squad2](https://huggingface.co/optimum/roberta-base-squad2) checkpoint which contains a `model.onnx` file: + +```py +from transformers import AutoTokenizer, pipeline +from optimum.onnxruntime import ORTModelForQuestionAnswering + +model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") +tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2") + +onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer) + +question = "What's my name?" +context = "My name is Philipp and I live in Nuremberg." +pred = onnx_qa(question, context) +``` + +If you have an Intel CPU, take a look at 🤗 [Optimum Intel](https://huggingface.co/docs/optimum/intel/index) which supports a variety of compression techniques (quantization, pruning, knowledge distillation) and tools for converting models to the [OpenVINO](https://huggingface.co/docs/optimum/intel/inference) format for higher performance inference. diff --git a/docs/source/en/perf_infer_cpu.mdx b/docs/source/en/perf_infer_cpu.mdx deleted file mode 100644 index a3df21e93a57e6..00000000000000 --- a/docs/source/en/perf_infer_cpu.mdx +++ /dev/null @@ -1,71 +0,0 @@ - - -# Efficient Inference on CPU - -This guide focuses on inferencing large models efficiently on CPU. - -## `BetterTransformer` for faster inference - -We have recently integrated `BetterTransformer` for faster inference on CPU for text, image and audio models. Check the documentation about this integration [here](https://huggingface.co/docs/optimum/bettertransformer/overview) for more details. - -## PyTorch JIT-mode (TorchScript) -TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. -Comparing to default eager mode, jit mode in PyTorch normally yields better performance for model inference from optimization methodologies like operator fusion. - -For a gentle introduction to TorchScript, see the Introduction to [PyTorch TorchScript tutorial](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules). - -### IPEX Graph Optimization with JIT-mode -Intel® Extension for PyTorch provides further optimizations in jit mode for Transformers series models. It is highly recommended for users to take advantage of Intel® Extension for PyTorch with jit mode. Some frequently used operator patterns from Transformers models are already supported in Intel® Extension for PyTorch with jit mode fusions. Those fusion patterns like Multi-head-attention fusion, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc. are enabled and perform well. The benefit of the fusion is delivered to users in a transparent fashion. According to the analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 precision and BFloat16 Mixed precision. - -Check more detailed information for [IPEX Graph Optimization](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/graph_optimization.html). - -#### IPEX installation: - -IPEX release is following PyTorch, check the approaches for [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/). - -### Usage of JIT-mode -To enable JIT-mode in Trainer for evaluaion or prediction, users should add `jit_mode_eval` in Trainer command arguments. - - - -for PyTorch >= 1.14.0. JIT-mode could benefit any models for prediction and evaluaion since dict input is supported in jit.trace - -for PyTorch < 1.14.0. JIT-mode could benefit models whose forward parameter order matches the tuple input order in jit.trace, like question-answering model -In the case where the forward parameter order does not match the tuple input order in jit.trace, like text-classification models, jit.trace will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users. - - - -Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) - - -- Inference using jit mode on CPU: -
python run_qa.py \
---model_name_or_path csarron/bert-base-uncased-squad-v1 \
---dataset_name squad \
---do_eval \
---max_seq_length 384 \
---doc_stride 128 \
---output_dir /tmp/ \
---no_cuda \
---jit_mode_eval 
- -- Inference with IPEX using jit mode on CPU: -
python run_qa.py \
---model_name_or_path csarron/bert-base-uncased-squad-v1 \
---dataset_name squad \
---do_eval \
---max_seq_length 384 \
---doc_stride 128 \
---output_dir /tmp/ \
---no_cuda \
---use_ipex \
---jit_mode_eval
diff --git a/docs/source/en/perf_infer_gpu_many.mdx b/docs/source/en/perf_infer_gpu_many.mdx deleted file mode 100644 index d8a24d6ab8aeaa..00000000000000 --- a/docs/source/en/perf_infer_gpu_many.mdx +++ /dev/null @@ -1,23 +0,0 @@ - - -# Efficient Inference on a Multiple GPUs - -This document contains information on how to efficiently infer on a multiple GPUs. - - -Note: A multi GPU setup can use the majority of the strategies described in the [single GPU section](./perf_infer_gpu_one). You must be aware of simple techniques, though, that can be used for a better usage. - - - -## `BetterTransformer` for faster inference - -We have recently integrated `BetterTransformer` for faster inference on multi-GPU for text, image and audio models. Check the documentation about this integration [here](https://huggingface.co/docs/optimum/bettertransformer/overview) for more details. diff --git a/docs/source/en/perf_infer_gpu_one.md b/docs/source/en/perf_infer_gpu_one.md new file mode 100644 index 00000000000000..36452aabd4d2d8 --- /dev/null +++ b/docs/source/en/perf_infer_gpu_one.md @@ -0,0 +1,398 @@ + + +# GPU inference + +GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, there are several optimizations you can use to speed up GPU inference. In this guide, you'll learn how to use FlashAttention-2 (a more memory-efficient attention mechanism), BetterTransformer (a PyTorch native fastpath execution), and bitsandbytes to quantize your model to a lower precision. Finally, learn how to use 🤗 Optimum to accelerate inference with ONNX Runtime on Nvidia and AMD GPUs. + + + +The majority of the optimizations described here also apply to multi-GPU setups! + + + +## FlashAttention-2 + + + +FlashAttention-2 is experimental and may change considerably in future versions. + + + +[FlashAttention-2](https://huggingface.co/papers/2205.14135) is a faster and more efficient implementation of the standard attention mechanism that can significantly speedup inference by: + +1. additionally parallelizing the attention computation over sequence length +2. partitioning the work between GPU threads to reduce communication and shared memory reads/writes between them + +FlashAttention-2 is currently supported for the following architectures: +* [Bark](https://huggingface.co/docs/transformers/model_doc/bark#transformers.BarkModel) +* [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) +* [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel) +* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel) +* [GPTNeo](https://huggingface.co/docs/transformers/model_doc/gpt_neo#transformers.GPTNeoModel) +* [GPTNeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox#transformers.GPTNeoXModel) +* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel) +* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) +* [Llava](https://huggingface.co/docs/transformers/model_doc/llava) +* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava) +* [MBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel) +* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel) +* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel) +* [OPT](https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTModel) +* [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel) +* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel) +* [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model) +* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) + +You can request to add FlashAttention-2 support for another model by opening a GitHub Issue or Pull Request. + +Before you begin, make sure you have FlashAttention-2 installed. + + + + +```bash +pip install flash-attn --no-build-isolation +``` + +We strongly suggest referring to the detailed [installation instructions](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features) to learn more about supported hardware and data types! + + + + +FlashAttention-2 is also supported on AMD GPUs and current support is limited to **Instinct MI210** and **Instinct MI250**. We strongly suggest using this [Dockerfile](https://github.com/huggingface/optimum-amd/tree/main/docker/transformers-pytorch-amd-gpu-flash/Dockerfile) to use FlashAttention-2 on AMD GPUs. + + + + +To enable FlashAttention-2, pass the argument `attn_implementation="flash_attention_2"` to [`~AutoModelForCausalLM.from_pretrained`]: + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", +) +``` + + + +FlashAttention-2 can only be used when the model's dtype is `fp16` or `bf16`. Make sure to cast your model to the appropriate dtype and load them on a supported device before using FlashAttention-2. + +
+ +You can also set `use_flash_attention_2=True` to enable FlashAttention-2 but it is deprecated in favor of `attn_implementation="flash_attention_2"`. + +
+ +FlashAttention-2 can be combined with other optimization techniques like quantization to further speedup inference. For example, you can combine FlashAttention-2 with 8-bit or 4-bit quantization: + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +# load in 8bit +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_8bit=True, + attn_implementation="flash_attention_2", +) + +# load in 4bit +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_4bit=True, + attn_implementation="flash_attention_2", +) +``` + +### Expected speedups + +You can benefit from considerable speedups for inference, especially for inputs with long sequences. However, since FlashAttention-2 does not support computing attention scores with padding tokens, you must manually pad/unpad the attention scores for batched inference when the sequence contains padding tokens. This leads to a significant slowdown for batched generations with padding tokens. + +To overcome this, you should use FlashAttention-2 without padding tokens in the sequence during training (by packing a dataset or [concatenating sequences](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L516) until reaching the maximum sequence length). + +For a single forward pass on [tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b) with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is: + +
+ +
+ +For a single forward pass on [meta-llama/Llama-7b-hf](https://hf.co/meta-llama/Llama-7b-hf) with a sequence length of 4096 and various batch sizes without padding tokens, the expected speedup is: + +
+ +
+ +For sequences with padding tokens (generating with padding tokens), you need to unpad/pad the input sequences to correctly compute the attention scores. With a relatively small sequence length, a single forward pass creates overhead leading to a small speedup (in the example below, 30% of the input is filled with padding tokens): + +
+ +
+ +But for larger sequence lengths, you can expect even more speedup benefits: + + + +FlashAttention is more memory efficient, meaning you can train on much larger sequence lengths without running into out-of-memory issues. You can potentially reduce memory usage up to 20x for larger sequence lengths. Take a look at the [flash-attention](https://github.com/Dao-AILab/flash-attention) repository for more details. + + + +
+ +
+ +## PyTorch scaled dot product attention + +PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA) can also call FlashAttention and memory-efficient attention kernels under the hood. SDPA support is currently being added natively in Transformers and is used by default for `torch>=2.1.1` when an implementation is available. + +For now, Transformers supports SDPA inference and training for the following architectures: +* [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel) +* [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode#transformers.GPTBigCodeModel) +* [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon#transformers.FalconModel) +* [Llama](https://huggingface.co/docs/transformers/model_doc/llama#transformers.LlamaModel) +* [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel) +* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel) +* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel) +* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel) +* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel) +* [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model) + + + +FlashAttention can only be used for models with the `fp16` or `bf16` torch type, so make sure to cast your model to the appropriate type first. + + + +By default, SDPA selects the most performant kernel available but you can check whether a backend is available in a given setting (hardware, problem size) with [`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel) as a context manager: + +```diff +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda") +# convert the model to BetterTransformer +model.to_bettertransformer() + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + ++ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +If you see a bug with the traceback below, try using the nightly version of PyTorch which may have broader coverage for FlashAttention: + +```bash +RuntimeError: No available kernel. Aborting execution. + +# install PyTorch nightly +pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 +``` + +## BetterTransformer + + + +Some BetterTransformer features are being upstreamed to Transformers with default support for native `torch.nn.scaled_dot_product_attention`. BetterTransformer still has a wider coverage than the Transformers SDPA integration, but you can expect more and more architectures to natively support SDPA in Transformers. + + + + + +Check out our benchmarks with BetterTransformer and scaled dot product attention in the [Out of the box acceleration and memory savings of 🤗 decoder models with PyTorch 2.0](https://pytorch.org/blog/out-of-the-box-acceleration/) and learn more about the fastpath execution in the [BetterTransformer](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2) blog post. + + + +BetterTransformer accelerates inference with its fastpath (native PyTorch specialized implementation of Transformer functions) execution. The two optimizations in the fastpath execution are: + +1. fusion, which combines multiple sequential operations into a single "kernel" to reduce the number of computation steps +2. skipping the inherent sparsity of padding tokens to avoid unnecessary computation with nested tensors + +BetterTransformer also converts all attention operations to use the more memory-efficient [scaled dot product attention (SDPA)](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention), and it calls optimized kernels like [FlashAttention](https://huggingface.co/papers/2205.14135) under the hood. + +Before you start, make sure you have 🤗 Optimum [installed](https://huggingface.co/docs/optimum/installation). + +Then you can enable BetterTransformer with the [`PreTrainedModel.to_bettertransformer`] method: + +```python +model = model.to_bettertransformer() +``` + +You can return the original Transformers model with the [`~PreTrainedModel.reverse_bettertransformer`] method. You should use this before saving your model to use the canonical Transformers modeling: + +```py +model = model.reverse_bettertransformer() +model.save_pretrained("saved_model") +``` + +## bitsandbytes + +bitsandbytes is a quantization library that includes support for 4-bit and 8-bit quantization. Quantization reduces your model size compared to its native full precision version, making it easier to fit large models onto GPUs with limited memory. + +Make sure you have bitsandbytes and 🤗 Accelerate installed: + +```bash +# these versions support 8-bit and 4-bit +pip install bitsandbytes>=0.39.0 accelerate>=0.20.0 + +# install Transformers +pip install transformers +``` + +### 4-bit + +To load a model in 4-bit for inference, use the `load_in_4bit` parameter. The `device_map` parameter is optional, but we recommend setting it to `"auto"` to allow 🤗 Accelerate to automatically and efficiently allocate the model given the available resources in the environment. + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) +``` + +To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 600MB of memory to the first GPU and 1GB of memory to the second GPU: + +```py +max_memory_mapping = {0: "600MB", 1: "1GB"} +model_name = "bigscience/bloom-3b" +model_4bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping +) +``` + +### 8-bit + + + +If you're curious and interested in learning more about the concepts underlying 8-bit quantization, read the [Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration) blog post. + + + +To load a model in 8-bit for inference, use the `load_in_8bit` parameter. The `device_map` parameter is optional, but we recommend setting it to `"auto"` to allow 🤗 Accelerate to automatically and efficiently allocate the model given the available resources in the environment: + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` + +If you're loading a model in 8-bit for text generation, you should use the [`~transformers.GenerationMixin.generate`] method instead of the [`Pipeline`] function which is not optimized for 8-bit models and will be slower. Some sampling strategies, like nucleus sampling, are also not supported by the [`Pipeline`] for 8-bit models. You should also place all inputs on the same device as the model: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "bigscience/bloom-2b5" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) + +prompt = "Hello, my llama is cute" +inputs = tokenizer(prompt, return_tensors="pt").to("cuda") +generated_ids = model.generate(**inputs) +outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +``` + +To load a model in 4-bit for inference with multiple GPUs, you can control how much GPU RAM you want to allocate to each GPU. For example, to distribute 1GB of memory to the first GPU and 2GB of memory to the second GPU: + +```py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) +``` + + + +Feel free to try running a 11 billion parameter [T5 model](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) or the 3 billion parameter [BLOOM model](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) for inference on Google Colab's free tier GPUs! + + + +## 🤗 Optimum + + + +Learn more details about using ORT with 🤗 Optimum in the [Accelerated inference on NVIDIA GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#accelerated-inference-on-nvidia-gpus) and [Accelerated inference on AMD GPUs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu#accelerated-inference-on-amd-gpus) guides. This section only provides a brief and simple example. + + + +ONNX Runtime (ORT) is a model accelerator that supports accelerated inference on Nvidia GPUs, and AMD GPUs that use [ROCm](https://www.amd.com/en/products/software/rocm.html) stack. ORT uses optimization techniques like fusing common operations into a single node and constant folding to reduce the number of computations performed and speedup inference. ORT also places the most computationally intensive operations on the GPU and the rest on the CPU to intelligently distribute the workload between the two devices. + +ORT is supported by 🤗 Optimum which can be used in 🤗 Transformers. You'll need to use an [`~optimum.onnxruntime.ORTModel`] for the task you're solving, and specify the `provider` parameter which can be set to either [`CUDAExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#cudaexecutionprovider), [`ROCMExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/amdgpu) or [`TensorrtExecutionProvider`](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/gpu#tensorrtexecutionprovider). If you want to load a model that was not yet exported to ONNX, you can set `export=True` to convert your model on-the-fly to the ONNX format: + +```py +from optimum.onnxruntime import ORTModelForSequenceClassification + +ort_model = ORTModelForSequenceClassification.from_pretrained( + "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + export=True, + provider="CUDAExecutionProvider", +) +``` + +Now you're free to use the model for inference: + +```py +from optimum.pipelines import pipeline +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english") + +pipeline = pipeline(task="text-classification", model=ort_model, tokenizer=tokenizer, device="cuda:0") +result = pipeline("Both the music and visual were astounding, not to mention the actors performance.") +``` + +## Combine optimizations + +It is often possible to combine several of the optimization techniques described above to get the best inference performance possible for your model. For example, you can load a model in 4-bit, and then enable BetterTransformer with FlashAttention: + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +# load model in 4-bit +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.float16 +) + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config) + +# enable BetterTransformer +model = model.to_bettertransformer() + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + +# enable FlashAttention +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` diff --git a/docs/source/en/perf_infer_gpu_one.mdx b/docs/source/en/perf_infer_gpu_one.mdx deleted file mode 100644 index 55b3b9fd9937e0..00000000000000 --- a/docs/source/en/perf_infer_gpu_one.mdx +++ /dev/null @@ -1,108 +0,0 @@ - - -# Efficient Inference on a Single GPU - -This document will be completed soon with information on how to infer on a single GPU. In the meantime you can check out [the guide for training on a single GPU](perf_train_gpu_one) and [the guide for inference on CPUs](perf_infer_cpu). - -## `BetterTransformer` for faster inference - -We have recently integrated `BetterTransformer` for faster inference on GPU for text, image and audio models. Check the documentation about this integration [here](https://huggingface.co/docs/optimum/bettertransformer/overview) for more details. - -## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition - - - -Note that this feature can also be used in a multi GPU setup. - - - -From the paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), we support Hugging Face integration for all models in the Hub with a few lines of code. -The method reduces `nn.Linear` size by 2 for `float16` and `bfloat16` weights and by 4 for `float32` weights, with close to no impact to the quality by operating on the outliers in half-precision. - -![HFxbitsandbytes.png](https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) - -Int8 mixed-precision matrix decomposition works by separating a matrix multiplication into two streams: (1) a systematic feature outlier stream matrix multiplied in fp16 (0.01%), (2) a regular stream of int8 matrix multiplication (99.9%). With this method, int8 inference with no predictive degradation is possible for very large models. -For more details regarding the method, check out the [paper](https://arxiv.org/abs/2208.07339) or our [blogpost about the integration](https://huggingface.co/blog/hf-bitsandbytes-integration). - -![MixedInt8.gif](https://s3.amazonaws.com/moonup/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif) - -Note, that you would require a GPU to run mixed-8bit models as the kernels have been compiled for GPUs only. Make sure that you have enough GPU memory to store the quarter (or half if your model weights are in half precision) of the model before using this feature. -Below are some notes to help you use this module, or follow the demos on [Google colab](#colab-demos). - -### Requirements - -- If you have `bitsandbytes<0.37.0`, make sure you run on NVIDIA GPUs that support 8-bit tensor cores (Turing, Ampere or newer architectures - e.g. T4, RTX20s RTX30s, A40-A100). For `bitsandbytes>=0.37.0`, all GPUs should be supported. -- Install the correct version of `bitsandbytes` by running: -`pip install bitsandbytes>=0.31.5` -- Install `accelerate` -`pip install accelerate>=0.12.0` - -### Running mixed-Int8 models - single GPU setup - -After installing the required libraries, the way to load your mixed 8-bit model is as follows: - -```py -from transformers import AutoModelForCausalLM - -model_name = "bigscience/bloom-2b5" -model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) -``` - -For text generation, we recommend: - -* using the model's `generate()` method instead of the `pipeline()` function. Although inference is possible with the `pipeline()` function, it is not optimized for mixed-8bit models, and will be slower than using the `generate()` method. Moreover, some sampling strategies are like nucleaus sampling are not supported by the `pipeline()` function for mixed-8bit models. -* placing all inputs on the same device as the model. - -Here is a simple example: - -```py -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_name = "bigscience/bloom-2b5" -tokenizer = AutoTokenizer.from_pretrained(model_name) -model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) - -text = "Hello, my llama is cute" -inputs = tokenizer(prompt, return_tensors="pt").to("cuda") -generated_ids = model.generate(**inputs) -outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) -``` - - -### Running mixed-int8 models - multi GPU setup - -The way to load your mixed 8-bit model in multiple GPUs is as follows (same command as single GPU setup): -```py -model_name = "bigscience/bloom-2b5" -model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) -``` -But you can control the GPU RAM you want to allocate on each GPU using `accelerate`. Use the `max_memory` argument as follows: - -```py -max_memory_mapping = {0: "1GB", 1: "2GB"} -model_name = "bigscience/bloom-3b" -model_8bit = AutoModelForCausalLM.from_pretrained( - model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping -) -``` -In this example, the first GPU will use 1GB of memory and the second 2GB. - -### Colab demos - -With this method you can infer on models that were not possible to infer on a Google Colab before. -Check out the demo for running T5-11b (42GB in fp32)! Using 8-bit quantization on Google Colab: - -[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) - -Or this demo for BLOOM-3B: - -[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) \ No newline at end of file diff --git a/docs/source/en/perf_torch_compile.md b/docs/source/en/perf_torch_compile.md new file mode 100644 index 00000000000000..a840e7d551cebf --- /dev/null +++ b/docs/source/en/perf_torch_compile.md @@ -0,0 +1,359 @@ + + +# Optimize inference using torch.compile() + +This guide aims to provide a benchmark on the inference speed-ups introduced with [`torch.compile()`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for [computer vision models in 🤗 Transformers](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers&sort=trending). + +## Benefits of torch.compile + +Depending on the model and the GPU, `torch.compile()` yields up to 30% speed-up during inference. To use `torch.compile()`, simply install any version of `torch` above 2.0. + +Compiling a model takes time, so it's useful if you are compiling the model only once instead of every time you infer. +To compile any computer vision model of your choice, call `torch.compile()` on the model as shown below: + +```diff +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained(MODEL_ID).to("cuda") ++ model = torch.compile(model) +``` + +`compile()` comes with multiple modes for compiling, which essentially differ in compilation time and inference overhead. `max-autotune` takes longer than `reduce-overhead` but results in faster inference. Default mode is fastest for compilation but is not as efficient compared to `reduce-overhead` for inference time. In this guide, we used the default mode. You can learn more about it [here](https://pytorch.org/get-started/pytorch-2.0/#user-experience). + +We benchmarked `torch.compile` with different computer vision models, tasks, types of hardware, and batch sizes on `torch` version 2.0.1. + +## Benchmarking code + +Below you can find the benchmarking code for each task. We warm up the GPU before inference and take the mean time of 300 inferences, using the same image each time. + +### Image Classification with ViT + +```python +import torch +from PIL import Image +import requests +import numpy as np +from transformers import AutoImageProcessor, AutoModelForImageClassification + +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +image = Image.open(requests.get(url, stream=True).raw) + +processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224").to("cuda") +model = torch.compile(model) + +processed_input = processor(image, return_tensors='pt').to(device="cuda") + +with torch.no_grad(): + _ = model(**processed_input) + +``` + +#### Object Detection with DETR + +```python +from transformers import AutoImageProcessor, AutoModelForObjectDetection + +processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50") +model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50").to("cuda") +model = torch.compile(model) + +texts = ["a photo of a cat", "a photo of a dog"] +inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**inputs) +``` + +#### Image Segmentation with Segformer + +```python +from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation + +processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") +model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512").to("cuda") +model = torch.compile(model) +seg_inputs = processor(images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**seg_inputs) +``` + +Below you can find the list of the models we benchmarked. + +**Image Classification** +- [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) +- [microsoft/beit-base-patch16-224-pt22k-ft22k](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k) +- [facebook/convnext-large-224](https://huggingface.co/facebook/convnext-large-224) +- [microsoft/resnet-50](https://huggingface.co/) + +**Image Segmentation** +- [nvidia/segformer-b0-finetuned-ade-512-512](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [facebook/mask2former-swin-tiny-coco-panoptic](https://huggingface.co/facebook/mask2former-swin-tiny-coco-panoptic) +- [facebook/maskformer-swin-base-ade](https://huggingface.co/facebook/maskformer-swin-base-ade) +- [google/deeplabv3_mobilenet_v2_1.0_513](https://huggingface.co/google/deeplabv3_mobilenet_v2_1.0_513) + +**Object Detection** +- [google/owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32) +- [facebook/detr-resnet-101](https://huggingface.co/facebook/detr-resnet-101) +- [microsoft/conditional-detr-resnet-50](https://huggingface.co/microsoft/conditional-detr-resnet-50) + +Below you can find visualization of inference durations with and without `torch.compile()` and percentage improvements for each model in different hardware and batch sizes. + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + +![Duration Comparison on V100 with Batch Size of 1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/v100_1_duration.png) + +![Percentage Improvement on T4 with Batch Size of 4](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/T4_4_percentage.png) + +Below you can find inference durations in milliseconds for each model with and without `compile()`. Note that OwlViT results in OOM in larger batch sizes. + +### A100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 9.325 | 7.584 | +| Image Segmentation/Segformer | 11.759 | 10.500 | +| Object Detection/OwlViT | 24.978 | 18.420 | +| Image Classification/BeiT | 11.282 | 8.448 | +| Object Detection/DETR | 34.619 | 19.040 | +| Image Classification/ConvNeXT | 10.410 | 10.208 | +| Image Classification/ResNet | 6.531 | 4.124 | +| Image Segmentation/Mask2former | 60.188 | 49.117 | +| Image Segmentation/Maskformer | 75.764 | 59.487 | +| Image Segmentation/MobileNet | 8.583 | 3.974 | +| Object Detection/Resnet-101 | 36.276 | 18.197 | +| Object Detection/Conditional-DETR | 31.219 | 17.993 | + + +### A100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 14.832 | 14.499 | +| Image Segmentation/Segformer | 18.838 | 16.476 | +| Image Classification/BeiT | 13.205 | 13.048 | +| Object Detection/DETR | 48.657 | 32.418| +| Image Classification/ConvNeXT | 22.940 | 21.631 | +| Image Classification/ResNet | 6.657 | 4.268 | +| Image Segmentation/Mask2former | 74.277 | 61.781 | +| Image Segmentation/Maskformer | 180.700 | 159.116 | +| Image Segmentation/MobileNet | 14.174 | 8.515 | +| Object Detection/Resnet-101 | 68.101 | 44.998 | +| Object Detection/Conditional-DETR | 56.470 | 35.552 | + +### A100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 40.944 | 40.010 | +| Image Segmentation/Segformer | 37.005 | 31.144 | +| Image Classification/BeiT | 41.854 | 41.048 | +| Object Detection/DETR | 164.382 | 161.902 | +| Image Classification/ConvNeXT | 82.258 | 75.561 | +| Image Classification/ResNet | 7.018 | 5.024 | +| Image Segmentation/Mask2former | 178.945 | 154.814 | +| Image Segmentation/Maskformer | 638.570 | 579.826 | +| Image Segmentation/MobileNet | 51.693 | 30.310 | +| Object Detection/Resnet-101 | 232.887 | 155.021 | +| Object Detection/Conditional-DETR | 180.491 | 124.032 | + +### V100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 10.495 | 6.00 | +| Image Segmentation/Segformer | 13.321 | 5.862 | +| Object Detection/OwlViT | 25.769 | 22.395 | +| Image Classification/BeiT | 11.347 | 7.234 | +| Object Detection/DETR | 33.951 | 19.388 | +| Image Classification/ConvNeXT | 11.623 | 10.412 | +| Image Classification/ResNet | 6.484 | 3.820 | +| Image Segmentation/Mask2former | 64.640 | 49.873 | +| Image Segmentation/Maskformer | 95.532 | 72.207 | +| Image Segmentation/MobileNet | 9.217 | 4.753 | +| Object Detection/Resnet-101 | 52.818 | 28.367 | +| Object Detection/Conditional-DETR | 39.512 | 20.816 | + +### V100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 15.181 | 14.501 | +| Image Segmentation/Segformer | 16.787 | 16.188 | +| Image Classification/BeiT | 15.171 | 14.753 | +| Object Detection/DETR | 88.529 | 64.195 | +| Image Classification/ConvNeXT | 29.574 | 27.085 | +| Image Classification/ResNet | 6.109 | 4.731 | +| Image Segmentation/Mask2former | 90.402 | 76.926 | +| Image Segmentation/Maskformer | 234.261 | 205.456 | +| Image Segmentation/MobileNet | 24.623 | 14.816 | +| Object Detection/Resnet-101 | 134.672 | 101.304 | +| Object Detection/Conditional-DETR | 97.464 | 69.739 | + +### V100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 52.209 | 51.633 | +| Image Segmentation/Segformer | 61.013 | 55.499 | +| Image Classification/BeiT | 53.938 | 53.581 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 109.682 | 100.771 | +| Image Classification/ResNet | 14.857 | 12.089 | +| Image Segmentation/Mask2former | 249.605 | 222.801 | +| Image Segmentation/Maskformer | 831.142 | 743.645 | +| Image Segmentation/MobileNet | 93.129 | 55.365 | +| Object Detection/Resnet-101 | 482.425 | 361.843 | +| Object Detection/Conditional-DETR | 344.661 | 255.298 | + +### T4 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 16.520 | 15.786 | +| Image Segmentation/Segformer | 16.116 | 14.205 | +| Object Detection/OwlViT | 53.634 | 51.105 | +| Image Classification/BeiT | 16.464 | 15.710 | +| Object Detection/DETR | 73.100 | 53.99 | +| Image Classification/ConvNeXT | 32.932 | 30.845 | +| Image Classification/ResNet | 6.031 | 4.321 | +| Image Segmentation/Mask2former | 79.192 | 66.815 | +| Image Segmentation/Maskformer | 200.026 | 188.268 | +| Image Segmentation/MobileNet | 18.908 | 11.997 | +| Object Detection/Resnet-101 | 106.622 | 82.566 | +| Object Detection/Conditional-DETR | 77.594 | 56.984 | + +### T4 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 43.653 | 43.626 | +| Image Segmentation/Segformer | 45.327 | 42.445 | +| Image Classification/BeiT | 52.007 | 51.354 | +| Object Detection/DETR | 277.850 | 268.003 | +| Image Classification/ConvNeXT | 119.259 | 105.580 | +| Image Classification/ResNet | 13.039 | 11.388 | +| Image Segmentation/Mask2former | 201.540 | 184.670 | +| Image Segmentation/Maskformer | 764.052 | 711.280 | +| Image Segmentation/MobileNet | 74.289 | 48.677 | +| Object Detection/Resnet-101 | 421.859 | 357.614 | +| Object Detection/Conditional-DETR | 289.002 | 226.945 | + +### T4 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 163.914 | 160.907 | +| Image Segmentation/Segformer | 192.412 | 163.620 | +| Image Classification/BeiT | 188.978 | 187.976 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 422.886 | 388.078 | +| Image Classification/ResNet | 44.114 | 37.604 | +| Image Segmentation/Mask2former | 756.337 | 695.291 | +| Image Segmentation/Maskformer | 2842.940 | 2656.88 | +| Image Segmentation/MobileNet | 299.003 | 201.942 | +| Object Detection/Resnet-101 | 1619.505 | 1262.758 | +| Object Detection/Conditional-DETR | 1137.513 | 897.390| + +## PyTorch Nightly +We also benchmarked on PyTorch nightly (2.1.0dev, find the wheel [here](https://download.pytorch.org/whl/nightly/cu118)) and observed improvement in latency both for uncompiled and compiled models. + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 - no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 12.462 | 6.954 | +| Image Classification/BeiT | 4 | 14.109 | 12.851 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 30.484 | 15.221 | +| Object Detection/DETR | 4 | 46.816 | 30.942 | +| Object Detection/DETR | 16 | 163.749 | 163.706 | + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 14.408 | 14.052 | +| Image Classification/BeiT | 4 | 47.381 | 46.604 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 68.382 | 53.481 | +| Object Detection/DETR | 4 | 269.615 | 204.785 | +| Object Detection/DETR | 16 | OOM | OOM | + +### V100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 13.477 | 7.926 | +| Image Classification/BeiT | 4 | 15.103 | 14.378 | +| Image Classification/BeiT | 16 | 52.517 | 51.691 | +| Object Detection/DETR | Unbatched | 28.706 | 19.077 | +| Object Detection/DETR | 4 | 88.402 | 62.949| +| Object Detection/DETR | 16 | OOM | OOM | + + +## Reduce Overhead +We benchmarked `reduce-overhead` compilation mode for A100 and T4 in Nightly. + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 | +| Image Classification/ConvNeXT | 4 | 23.171 | 21.490 | +| Image Classification/ResNet | Unbatched | 7.435 | 3.801 | +| Image Classification/ResNet | 4 | 7.261 | 2.187 | +| Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 | +| Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 | +| Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 | +| Image Segmentation/MobileNet | 4 | 14.385 | 7.946 | + + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 | +| Image Classification/ConvNeXT | 4 | 120.944 | 110.209 | +| Image Classification/ResNet | Unbatched | 9.761 | 7.698 | +| Image Classification/ResNet | 4 | 15.215 | 13.871 | +| Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 | +| Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 | +| Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 | +| Image Segmentation/MobileNet | 4 | 78.311 | 50.983 | + + diff --git a/docs/source/en/perf_train_cpu.md b/docs/source/en/perf_train_cpu.md new file mode 100644 index 00000000000000..14a52792d1f7d8 --- /dev/null +++ b/docs/source/en/perf_train_cpu.md @@ -0,0 +1,82 @@ + + +# Efficient Training on CPU + +This guide focuses on training large models efficiently on CPU. + +## Mixed precision with IPEX +Mixed precision uses single (fp32) and half-precision (bf16/fp16) data types in a model to accelerate training or inference while still preserving much of the single-precision accuracy. Modern CPUs such as 3rd and 4th Gen Intel® Xeon® Scalable processors natively support bf16, so you should get more performance out of the box by enabling mixed precision training with bf16. + +To further maximize training performance, you can use Intel® Extension for PyTorch (IPEX), which is a library built on PyTorch and adds additional CPU instruction level architecture (ISA) level support such as Intel® Advanced Vector Extensions 512 Vector Neural Network Instructions (Intel® AVX512-VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) for an extra performance boost on Intel CPUs. However, CPUs with only AVX2 (e.g., AMD or older Intel CPUs) are not guaranteed to have better performance under IPEX. + +Auto Mixed Precision (AMP) for CPU backends has been enabled since PyTorch 1.10. AMP support for bf16 on CPUs and bf16 operator optimization is also supported in IPEX and partially upstreamed to the main PyTorch branch. You can get better performance and user experience with IPEX AMP. + +Check more detailed information for [Auto Mixed Precision](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html). + +### IPEX installation: + +IPEX release is following PyTorch, to install via pip: + +| PyTorch Version | IPEX version | +| :---------------: | :----------: | +| 2.1.x | 2.1.100+cpu | +| 2.0.x | 2.0.100+cpu | +| 1.13 | 1.13.0+cpu | +| 1.12 | 1.12.300+cpu | + +Please run `pip list | grep torch` to get your `pytorch_version`, so you can get the `IPEX version_name`. +```bash +pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu +``` +You can check the latest versions in [ipex-whl-stable-cpu](https://developer.intel.com/ipex-whl-stable-cpu) if needed. + +Check more approaches for [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html). + +### Usage in Trainer +To enable auto mixed precision with IPEX in Trainer, users should add `use_ipex`, `bf16` and `no_cuda` in training command arguments. + +Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + +- Training with IPEX using BF16 auto mixed precision on CPU: +
 python run_qa.py \
+--model_name_or_path google-bert/bert-base-uncased \
+--dataset_name squad \
+--do_train \
+--do_eval \
+--per_device_train_batch_size 12 \
+--learning_rate 3e-5 \
+--num_train_epochs 2 \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/debug_squad/ \
+--use_ipex \
+--bf16 \
+--use_cpu
+ +If you want to enable `use_ipex` and `bf16` in your script, add these parameters to `TrainingArguments` like this: +```diff +training_args = TrainingArguments( + output_dir=args.output_path, ++ bf16=True, ++ use_ipex=True, ++ use_cpu=True, + **kwargs +) +``` + +### Practice example + +Blog: [Accelerating PyTorch Transformers with Intel Sapphire Rapids](https://huggingface.co/blog/intel-sapphire-rapids) diff --git a/docs/source/en/perf_train_cpu.mdx b/docs/source/en/perf_train_cpu.mdx deleted file mode 100644 index aa7a9ec2bf8bcd..00000000000000 --- a/docs/source/en/perf_train_cpu.mdx +++ /dev/null @@ -1,63 +0,0 @@ - - -# Efficient Training on CPU - -This guide focuses on training large models efficiently on CPU. - -## Mixed precision with IPEX - -IPEX is optimized for CPUs with AVX-512 or above, and functionally works for CPUs with only AVX2. So, it is expected to bring performance benefit for Intel CPU generations with AVX-512 or above while CPUs with only AVX2 (e.g., AMD CPUs or older Intel CPUs) might result in a better performance under IPEX, but not guaranteed. IPEX provides performance optimizations for CPU training with both Float32 and BFloat16. The usage of BFloat16 is the main focus of the following sections. - -Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The Auto Mixed Precision for CPU backend has been enabled since PyTorch-1.10. At the same time, the support of Auto Mixed Precision with BFloat16 for CPU and BFloat16 optimization of operators has been massively enabled in Intel® Extension for PyTorch, and partially upstreamed to PyTorch master branch. Users can get better performance and user experience with IPEX Auto Mixed Precision. - -Check more detailed information for [Auto Mixed Precision](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html). - -### IPEX installation: - -IPEX release is following PyTorch, to install via pip: - -| PyTorch Version | IPEX version | -| :---------------: | :----------: | -| 1.13 | 1.13.0+cpu | -| 1.12 | 1.12.300+cpu | -| 1.11 | 1.11.200+cpu | -| 1.10 | 1.10.100+cpu | - -``` -pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu -``` - -Check more approaches for [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html). - -### Usage in Trainer -To enable auto mixed precision with IPEX in Trainer, users should add `use_ipex`, `bf16` and `no_cuda` in training command arguments. - -Take an example of the use cases on [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) - -- Training with IPEX using BF16 auto mixed precision on CPU: -
 python run_qa.py \
---model_name_or_path bert-base-uncased \
---dataset_name squad \
---do_train \
---do_eval \
---per_device_train_batch_size 12 \
---learning_rate 3e-5 \
---num_train_epochs 2 \
---max_seq_length 384 \
---doc_stride 128 \
---output_dir /tmp/debug_squad/ \
---use_ipex \
---bf16 --no_cuda
- -### Practice example - -Blog: [Accelerating PyTorch Transformers with Intel Sapphire Rapids](https://huggingface.co/blog/intel-sapphire-rapids) diff --git a/docs/source/en/perf_train_cpu_many.md b/docs/source/en/perf_train_cpu_many.md new file mode 100644 index 00000000000000..53f7f7f9295dea --- /dev/null +++ b/docs/source/en/perf_train_cpu_many.md @@ -0,0 +1,318 @@ + + +# Efficient Training on Multiple CPUs + +When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling +distributed CPU training efficiently on [bare metal](#usage-in-trainer) and [Kubernetes](#usage-with-kubernetes). + +## Intel® oneCCL Bindings for PyTorch + +[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) and [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html). + +Module `oneccl_bindings_for_pytorch` (`torch_ccl` before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now + +Check more detailed information for [oneccl_bind_pt](https://github.com/intel/torch-ccl). + +### Intel® oneCCL Bindings for PyTorch installation + +Wheel files are available for the following Python versions: + +| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | +| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | +| 2.1.0 | | √ | √ | √ | √ | +| 2.0.0 | | √ | √ | √ | √ | +| 1.13.0 | | √ | √ | √ | √ | +| 1.12.100 | | √ | √ | √ | √ | +| 1.12.0 | | √ | √ | √ | √ | + +Please run `pip list | grep torch` to get your `pytorch_version`. +```bash +pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu +``` +where `{pytorch_version}` should be your PyTorch version, for instance 2.1.0. +Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). +Versions of oneCCL and PyTorch must match. + + + +oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) +PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100 + + + +## Intel® MPI library +Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. This component is part of the Intel® oneAPI HPC Toolkit. + +oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it. + +for Intel® oneCCL >= 1.12.0 +```bash +oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") +source $oneccl_bindings_for_pytorch_path/env/setvars.sh +``` + +for Intel® oneCCL whose version < 1.12.0 +```bash +torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") +source $torch_ccl_path/env/setvars.sh +``` + +#### Intel® Extension for PyTorch installation + +Intel Extension for PyTorch (IPEX) provides performance optimizations for CPU training with both Float32 and BFloat16 (refer to the [single CPU section](./perf_train_cpu) to learn more). + + +The following "Usage in Trainer" takes mpirun in Intel® MPI library as an example. + + +## Usage in Trainer +To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--ddp_backend ccl`** in the command arguments. + +Let's see an example with the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + + +The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=127.0.0.1 + mpirun -n 2 -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex +``` +The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. + +In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. +```shell script + cat hostfile + xxx.xxx.xxx.xxx #node0 ip + xxx.xxx.xxx.xxx #node1 ip +``` +Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1 with BF16 auto mixed precision: +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip + mpirun -f hostfile -n 4 -ppn 2 \ + -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex \ + --bf16 +``` + +## Usage with Kubernetes + +The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the +[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/pytorch/). + +### Setup + +This example assumes that you have: +* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow/) +* [`kubectl`](https://kubernetes.io/docs/tasks/tools/) installed and configured to access the Kubernetes cluster +* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) that can be used + to store datasets and model files. There are multiple options for setting up the PVC including using an NFS + [storage class](https://kubernetes.io/docs/concepts/storage/storage-classes/) or a cloud storage bucket. +* A Docker container that includes your model training script and all the dependencies needed to run the script. For + distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel + oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers. + +The snippet below is an example of a Dockerfile that uses a base image that supports distributed CPU training and then +extracts a Transformers release to the `/workspace` directory, so that the example scripts are included in the image: +```dockerfile +FROM intel/ai-workflows:torch-2.0.1-huggingface-multinode-py3.9 + +WORKDIR /workspace + +# Download and extract the transformers code +ARG HF_TRANSFORMERS_VER="4.35.2" +RUN mkdir transformers && \ + curl -sSL --retry 5 https://github.com/huggingface/transformers/archive/refs/tags/v${HF_TRANSFORMERS_VER}.tar.gz | tar -C transformers --strip-components=1 -xzf - +``` +The image needs to be built and copied to the cluster's nodes or pushed to a container registry prior to deploying the +PyTorchJob to the cluster. + +### PyTorchJob Specification File + +The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/pytorch/) is used to run the distributed +training job on the cluster. The yaml file for the PyTorchJob defines parameters such as: + * The name of the PyTorchJob + * The number of replicas (workers) + * The python script and it's parameters that will be used to run the training job + * The types of resources (node selector, memory, and CPU) needed for each worker + * The image/tag for the Docker container to use + * Environment variables + * A volume mount for the PVC + +The volume mount defines a path where the PVC will be mounted in the container for each worker pod. This location can be +used for the dataset, checkpoint files, and the saved model after training completes. + +The snippet below is an example of a yaml file for a PyTorchJob with 4 workers running the +[question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering). +```yaml +apiVersion: "kubeflow.org/v1" +kind: PyTorchJob +metadata: + name: transformers-pytorchjob + namespace: kubeflow +spec: + elasticPolicy: + rdzvBackend: c10d + minReplicas: 1 + maxReplicas: 4 + maxRestarts: 10 + pytorchReplicaSpecs: + Worker: + replicas: 4 # The number of worker pods + restartPolicy: OnFailure + template: + spec: + containers: + - name: pytorch + image: : # Specify the docker image to use for the worker pods + imagePullPolicy: IfNotPresent + command: + - torchrun + - /workspace/transformers/examples/pytorch/question-answering/run_qa.py + - --model_name_or_path + - "google-bert/bert-large-uncased" + - --dataset_name + - "squad" + - --do_train + - --do_eval + - --per_device_train_batch_size + - "12" + - --learning_rate + - "3e-5" + - --num_train_epochs + - "2" + - --max_seq_length + - "384" + - --doc_stride + - "128" + - --output_dir + - "/tmp/pvc-mount/output" + - --no_cuda + - --ddp_backend + - "ccl" + - --use_ipex + - --bf16 # Specify --bf16 if your hardware supports bfloat16 + env: + - name: LD_PRELOAD + value: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4.5.9:/usr/local/lib/libiomp5.so" + - name: TRANSFORMERS_CACHE + value: "/tmp/pvc-mount/transformers_cache" + - name: HF_DATASETS_CACHE + value: "/tmp/pvc-mount/hf_datasets_cache" + - name: LOGLEVEL + value: "INFO" + - name: CCL_WORKER_COUNT + value: "1" + - name: OMP_NUM_THREADS # Can be tuned for optimal performance +- value: "56" + resources: + limits: + cpu: 200 # Update the CPU and memory limit values based on your nodes + memory: 128Gi + requests: + cpu: 200 # Update the CPU and memory request values based on your nodes + memory: 128Gi + volumeMounts: + - name: pvc-volume + mountPath: /tmp/pvc-mount + - mountPath: /dev/shm + name: dshm + restartPolicy: Never + nodeSelector: # Optionally use the node selector to specify what types of nodes to use for the workers + node-type: spr + volumes: + - name: pvc-volume + persistentVolumeClaim: + claimName: transformers-pvc + - name: dshm + emptyDir: + medium: Memory +``` +To run this example, update the yaml based on your training script and the nodes in your cluster. + + + +The CPU resource limits/requests in the yaml are defined in [cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu) +where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical +host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of +available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in +order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed) +[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) for the worker pods, +set the same CPU and memory amounts for both the resource limits and requests. + + + +### Deploy + +After the PyTorchJob spec has been updated with values appropriate for your cluster and training job, it can be deployed +to the cluster using: +```bash +kubectl create -f pytorchjob.yaml +``` + +The `kubectl get pods -n kubeflow` command can then be used to list the pods in the `kubeflow` namespace. You should see +the worker pods for the PyTorchJob that was just deployed. At first, they will probably have a status of "Pending" as +the containers get pulled and created, then the status should change to "Running". +``` +NAME READY STATUS RESTARTS AGE +... +transformers-pytorchjob-worker-0 1/1 Running 0 7m37s +transformers-pytorchjob-worker-1 1/1 Running 0 7m37s +transformers-pytorchjob-worker-2 1/1 Running 0 7m37s +transformers-pytorchjob-worker-3 1/1 Running 0 7m37s +... +``` + +The logs for worker can be viewed using `kubectl logs -n kubeflow `. Add `-f` to stream the logs, for example: +```bash +kubectl logs -n kubeflow transformers-pytorchjob-worker-0 -f +``` + +After the training job completes, the trained model can be copied from the PVC or storage location. When you are done +with the job, the PyTorchJob resource can be deleted from the cluster using `kubectl delete -f pytorchjob.yaml`. + +## Summary + +This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes +cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training +performance, and can be used as a template to run your own workload on multiple nodes. diff --git a/docs/source/en/perf_train_cpu_many.mdx b/docs/source/en/perf_train_cpu_many.mdx deleted file mode 100644 index 1310e40d30e142..00000000000000 --- a/docs/source/en/perf_train_cpu_many.mdx +++ /dev/null @@ -1,130 +0,0 @@ - - -# Efficient Training on Multiple CPUs - -When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently. - -## Intel® oneCCL Bindings for PyTorch - -[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) and [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html). - -Module `oneccl_bindings_for_pytorch` (`torch_ccl` before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now - -Check more detailed information for [oneccl_bind_pt](https://github.com/intel/torch-ccl). - -### Intel® oneCCL Bindings for PyTorch installation: - -Wheel files are available for the following Python versions: - -| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | -| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | -| 1.13.0 | | √ | √ | √ | √ | -| 1.12.100 | | √ | √ | √ | √ | -| 1.12.0 | | √ | √ | √ | √ | -| 1.11.0 | | √ | √ | √ | √ | -| 1.10.0 | √ | √ | √ | √ | | - -``` -pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu -``` -where `{pytorch_version}` should be your PyTorch version, for instance 1.13.0. -Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). -Versions of oneCCL and PyTorch must match. - - - -oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) -PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100 - - - -## Intel® MPI library -Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. This component is part of the Intel® oneAPI HPC Toolkit. - -oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it. - -for Intel® oneCCL >= 1.12.0 -``` -oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") -source $oneccl_bindings_for_pytorch_path/env/setvars.sh -``` - -for Intel® oneCCL whose version < 1.12.0 -``` -torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") -source $torch_ccl_path/env/setvars.sh -``` - -#### IPEX installation: - -IPEX provides performance optimizations for CPU training with both Float32 and BFloat16, you could refer [single CPU section](./perf_train_cpu). - - -The following "Usage in Trainer" takes mpirun in Intel® MPI library as an example. - - -## Usage in Trainer -To enable multi CPU distributed training in the Trainer with the ccl backend, users should add **`--xpu_backend ccl`** in the command arguments. - -Let's see an example with the [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) - - -The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. -```shell script - export CCL_WORKER_COUNT=1 - export MASTER_ADDR=127.0.0.1 - mpirun -n 2 -genv OMP_NUM_THREADS=23 \ - python3 run_qa.py \ - --model_name_or_path bert-large-uncased \ - --dataset_name squad \ - --do_train \ - --do_eval \ - --per_device_train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ \ - --no_cuda \ - --xpu_backend ccl \ - --use_ipex -``` -The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance. - -In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. -```shell script - cat hostfile - xxx.xxx.xxx.xxx #node0 ip - xxx.xxx.xxx.xxx #node1 ip -``` -Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1 with BF16 auto mixed precision: -```shell script - export CCL_WORKER_COUNT=1 - export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip - mpirun -f hostfile -n 4 -ppn 2 \ - -genv OMP_NUM_THREADS=23 \ - python3 run_qa.py \ - --model_name_or_path bert-large-uncased \ - --dataset_name squad \ - --do_train \ - --do_eval \ - --per_device_train_batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 2 \ - --max_seq_length 384 \ - --doc_stride 128 \ - --output_dir /tmp/debug_squad/ \ - --no_cuda \ - --xpu_backend ccl \ - --use_ipex \ - --bf16 -``` diff --git a/docs/source/en/perf_train_gpu_many.md b/docs/source/en/perf_train_gpu_many.md new file mode 100644 index 00000000000000..db1c3c3ef4ed8a --- /dev/null +++ b/docs/source/en/perf_train_gpu_many.md @@ -0,0 +1,668 @@ + + +# Efficient Training on Multiple GPUs + +If training a model on a single GPU is too slow or if the model's weights do not fit in a single GPU's memory, transitioning +to a multi-GPU setup may be a viable option. Prior to making this transition, thoroughly explore all the strategies covered +in the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) as they are universally applicable +to model training on any number of GPUs. Once you have employed those strategies and found them insufficient for your +case on a single GPU, consider moving to multiple GPUs. + +Transitioning from a single GPU to multiple GPUs requires the introduction of some form of parallelism, as the workload +must be distributed across the resources. Multiple techniques can be employed to achieve parallelism, such as data +parallelism, tensor parallelism, and pipeline parallelism. It's important to note that there isn't a one-size-fits-all +solution, and the optimal settings depend on the specific hardware configuration you are using. + +This guide offers an in-depth overview of individual types of parallelism, as well as guidance on ways to combine +techniques and choosing an appropriate approach. For step-by-step tutorials on distributed training, please refer to +the [🤗 Accelerate documentation](https://huggingface.co/docs/accelerate/index). + + + +While the main concepts discussed in this guide are likely applicable across frameworks, here we focus on +PyTorch-based implementations. + + + +Before diving deeper into the specifics of each technique, let's go over the rough decision process when training +large models on a large infrastructure. + +## Scalability strategy + +Begin by estimating how much vRAM is required to train your model. For models hosted on the 🤗 Hub, use our +[Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), which gives you +accurate calculations within a few percent margin. + +**Parallelization strategy for a single Node / multi-GPU setup** + +When training a model on a single node with multiple GPUs, your choice of parallelization strategy can significantly +impact performance. Here's a breakdown of your options: + +**Case 1: Your model fits onto a single GPU** + +If your model can comfortably fit onto a single GPU, you have two primary options: + +1. DDP - Distributed DataParallel +2. ZeRO - depending on the situation and configuration used, this method may or may not be faster, however, it's worth experimenting with it. + +**Case 2: Your model doesn't fit onto a single GPU:** + +If your model is too large for a single GPU, you have several alternatives to consider: + +1. PipelineParallel (PP) +2. ZeRO +3. TensorParallel (TP) + +With very fast inter-node connectivity (e.g., NVLINK or NVSwitch) all three strategies (PP, ZeRO, TP) should result in +similar performance. However, without these, PP will be faster than TP or ZeRO. The degree of TP may also +make a difference. It's best to experiment with your specific setup to determine the most suitable strategy. + +TP is almost always used within a single node. That is TP size <= GPUs per node. + +**Case 3: Largest layer of your model does not fit onto a single GPU** + +1. If you are not using ZeRO, you have to use TensorParallel (TP), because PipelineParallel (PP) alone won't be sufficient to accommodate the large layer. +2. If you are using ZeRO, additionally adopt techniques from the [Methods and tools for efficient training on a single GPU](perf_train_gpu_one). + +**Parallelization strategy for a multi-Node / multi-GPU setup** + +* When you have fast inter-node connectivity (e.g., NVLINK or NVSwitch) consider using one of these options: + + 1. ZeRO - as it requires close to no modifications to the model + 2. A combination of PipelineParallel(PP) with TensorParallel(TP) and DataParallel(DP) - this approach will result in fewer communications, but requires significant changes to the model + +* When you have slow inter-node connectivity and still low on GPU memory: + + 1. Employ a combination of DataParallel(DP) with PipelineParallel(PP), TensorParallel(TP), and ZeRO. + +In the following sections of this guide we dig deeper into how these different parallelism methods work. + +## Data Parallelism + +Even with only 2 GPUs, you can readily leverage the accelerated training capabilities offered by PyTorch's built-in features, +such as `DataParallel` (DP) and `DistributedDataParallel` (DDP). Note that +[PyTorch documentation](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html) recommends to prefer +`DistributedDataParallel` (DDP) over `DataParallel` (DP) for multi-GPU training as it works for all models. +Let's take a look at how these two methods work and what makes them different. + +### DataParallel vs DistributedDataParallel + +To understand the key differences in inter-GPU communication overhead between the two methods, let's review the processes per batch: + +[DDP](https://pytorch.org/docs/master/notes/ddp.html): + +- At the start time the main process replicates the model once from GPU 0 to the rest of GPUs +- Then for each batch: + 1. Each GPU directly consumes its mini-batch of data. + 2. During `backward`, once the local gradients are ready, they are averaged across all processes. + +[DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html): + +For each batch: + 1. GPU 0 reads the batch of data and then sends a mini-batch to each GPU. + 2. The up-to-date model is replicated from GPU 0 to each GPU. + 3. `forward` is executed, and output from each GPU is sent to GPU 0 to compute the loss. + 4. The loss is distributed from GPU 0 to all GPUs, and `backward` is run. + 5. Gradients from each GPU are sent to GPU 0 and averaged. + +Key differences include: +1. DDP performs only a single communication per batch - sending gradients, while DP performs five different data exchanges per batch. +DDP copies data using [torch.distributed](https://pytorch.org/docs/master/distributed.html), while DP copies data within +the process via Python threads (which introduces limitations associated with GIL). As a result, **`DistributedDataParallel` (DDP) is generally faster than `DataParallel` (DP)** unless you have slow GPU card inter-connectivity. +2. Under DP, GPU 0 performs significantly more work than other GPUs, resulting in GPU under-utilization. +3. DDP supports distributed training across multiple machines, whereas DP does not. + +This is not an exhaustive list of differences between DP and DDP, however, other nuances are out of scope of this guide. +You can get a deeper understanding of these methods by reading this [article](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/). + +Let's illustrate the differences between DP and DDP with an experiment. We'll benchmark the differences between DP and +DDP with an added context of NVLink presence: + +* Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`). +* Software: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0`. + +To disable the NVLink feature on one of the benchmarks, we use `NCCL_P2P_DISABLE=1`. + +Here is the benchmarking code and outputs: + +**DP** + +```bash +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +python examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} +``` + +**DDP w/ NVlink** + +```bash +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} +``` + +**DDP w/o NVlink** + +```bash +rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +Here are the same benchmarking results gathered in a table for convenience: + +| Type | NVlink | Time | +| :----- | ----- | ---: | +| 2:DP | Y | 110s | +| 2:DDP | Y | 101s | +| 2:DDP | N | 131s | + +As you can see, in this case DP is ~10% slower than DDP with NVlink, but ~15% faster than DDP without NVlink. +The real difference will depend on how much data each GPU needs to sync with the others - the more there is to sync, +the more a slow link will impede the overall runtime. + +## ZeRO Data Parallelism + +ZeRO-powered data parallelism (ZeRO-DP) is illustrated in the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/). + +
+ DeepSpeed-Image-1 +
+ +While it may appear complex, it is a very similar concept to `DataParallel` (DP). The difference is that instead of +replicating the full model parameters, gradients and optimizer states, each GPU stores only a slice of it. Then, at +run-time when the full layer parameters are needed just for the given layer, all GPUs synchronize to give each other +parts that they miss. + +To illustrate this idea, consider a simple model with 3 layers (La, Lb, and Lc), where each layer has 3 parameters. +Layer La, for example, has weights a0, a1 and a2: + +``` +La | Lb | Lc +---|----|--- +a0 | b0 | c0 +a1 | b1 | c1 +a2 | b2 | c2 +``` + +If we have 3 GPUs, ZeRO-DP splits the model onto 3 GPUs like so: + +``` +GPU0: +La | Lb | Lc +---|----|--- +a0 | b0 | c0 + +GPU1: +La | Lb | Lc +---|----|--- +a1 | b1 | c1 + +GPU2: +La | Lb | Lc +---|----|--- +a2 | b2 | c2 +``` + +In a way, this is the same horizontal slicing as tensor parallelism, as opposed to Vertical +slicing, where one puts whole layer-groups on different GPUs. Now let's see how this works: + +Each of these GPUs will get the usual mini-batch as it works in DP: + +``` +x0 => GPU0 +x1 => GPU1 +x2 => GPU2 +``` + +The inputs are passed without modifications as if they would be processed by the original model. + +First, the inputs get to the layer `La`. What happens at this point? + +On GPU0: the x0 mini-batch requires the a0, a1, a2 parameters to do its forward path through the layer, but the GPU0 has only a0. +It will get a1 from GPU1 and a2 from GPU2, bringing all the pieces of the model together. + +In parallel, GPU1 gets another mini-batch - x1. GPU1 has the a1 parameter, but needs a0 and a2, so it gets those from GPU0 and GPU2. +Same happens to GPU2 that gets the mini-batch x2. It gets a0 and a1 from GPU0 and GPU1. + +This way each of the 3 GPUs gets the full tensors reconstructed and makes a forward pass with its own mini-batch. +As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation. +The reconstruction is done efficiently via a pre-fetch. + +Then the whole process is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La. + + + +This mechanism is similar to an efficient group backpacking strategy: person A carries the tent, person B carries the stove, +and person C carries the axe. Each night they all share what they have with others and get from others what they don't have, +and in the morning they pack up their allocated type of gear and continue on their way. This is what ZeRO DP/Sharded DDP is. +Compare this strategy to the simple one where each person has to carry their own tent, stove and axe (similar to +DataParallel (DP and DDP) in PyTorch), which would be far more inefficient. + + + +While reading the literature on this topic you may encounter the following synonyms: Sharded, Partitioned. +If you pay close attention the way ZeRO partitions the model's weights - it looks very similar to tensor parallelism +which will be discussed later. This is because it partitions/shards each layer's weights, unlike vertical model parallelism +which is discussed next. + +Implementations: + +- [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/) ZeRO-DP stages 1+2+3 +- [`Accelerate` integration](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed) +- [`transformers` integration](main_classes/trainer#trainer-integrations) + +## From Naive Model Parallelism to Pipeline Parallelism + +To explain Pipeline parallelism, we'll first look into Naive Model Parallelism (MP), also known as Vertical MP. This approach +involves distributing groups of model layers across multiple GPUs by assigning specific layers to specific GPUs with `.to()`. +As data flows through these layers, it is moved to the same GPU as the layer, while the other layers remain untouched. + +We refer to this Model parallelism as "Vertical" because of how models are typically visualized. For example, the +following diagram shows an 8-layer model split vertically into two slices, placing layers 0-3 onto +GPU0 and 4-7 to GPU1: + +``` +================ +| Layer | | +| 0 | | +| 1 | GPU0 | +| 2 | | +| 3 | | +================ +| Layer | | +| 4 | | +| 5 | GPU1 | +| 6 | | +| 7 | | +================ +``` + +In this example, when data moves from layer 0 to 3, it's no different from regular forward pass. However, passing data +from layer 3 to 4 requires moving it from GPU0 to GPU1, introducing a communication overhead. If the participating +GPUs are on the same compute node (e.g. same physical machine) this copying is fast, but if the GPUs are distributed +across different compute nodes (e.g. multiple machines), the communication overhead could be substantially greater. + +Following that, layers 4 to 7 work as they would in the original model. Upon completion of the 7th layer, there is often +a need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be +computed and the optimizer can do its work. + +Naive Model Parallelism comes several shortcomings: +- **All but one GPU are idle at any given moment**: if 4 GPUs are used, it's nearly identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. +- **Overhead in data transfer between devices**: E.g. 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, but a single 24GB card will complete the training faster, because it doesn't have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states) +- **Copying shared embeddings**: Shared embeddings may need to get copied back and forth between GPUs. + +Now that you are familiar with how the naive approach to model parallelism works and its shortcomings, let's look at Pipeline Parallelism (PP). +PP is almost identical to a naive MP, but it solves the GPU idling problem by chunking the incoming batch into micro-batches +and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. + +The following illustration from the [GPipe paper](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) +shows the naive MP on the top, and PP on the bottom: + +
+ MP vs PP +
+ +At the bottom of the diagram, you can observe that the Pipeline Parallelism (PP) approach minimizes the number of idle +GPU zones, referred to as 'bubbles'. Both parts of the diagram show a parallelism level of degree 4, meaning that 4 GPUs +are involved in the pipeline. You can see that there's a forward path of 4 pipe stages (F0, F1, F2 and F3) followed by +a backward path in reverse order (B3, B2, B1, and B0). + +PP introduces a new hyperparameter to tune - `chunks`, which determines how many data chunks are sent in a sequence +through the same pipe stage. For example, in the bottom diagram you can see `chunks=4`. GPU0 performs the same +forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do complete their work. +Only when the other GPUs begin to complete their work, GPU0 starts to work again doing the backward path for chunks +3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). + +Note that this is the same concept as gradient accumulation steps. PyTorch uses `chunks`, while DeepSpeed refers +to the same hyperparameter as gradient accumulation steps. + +Because of the chunks, PP introduces the notion of micro-batches (MBS). DP splits the global data batch size into +mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of +256 each (1024/4). And if the number of `chunks` (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each +Pipeline stage works with a single micro-batch at a time. To calculate the global batch size of the DP + PP setup, +use the formula: `mbs * chunks * dp_degree` (`8 * 32 * 4 = 1024`). +With `chunks=1` you end up with the naive MP, which is inefficient. With a large `chunks` value you end up with +tiny micro-batch sizes which is also inefficient. For this reason, we encourage to experiment with the `chunks` value to +find the one that leads to the most efficient GPUs utilization. + +You may notice a bubble of "dead" time on the diagram that can't be parallelized because the last `forward` stage +has to wait for `backward` to complete the pipeline. The purpose of finding the best value for `chunks` is to enable a high +concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble. + +Pipeline API solutions have been implemented in: +- PyTorch +- DeepSpeed +- Megatron-LM + +These come with some shortcomings: +- They have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a `nn.Sequential` sequence of the same, which may require changes to the design of the model. +- Currently the Pipeline API is very restricted. If you had a bunch of Python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693 +- Conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage. +- They have to arrange each layer so that the output of one layer becomes an input to the other layer. + +More recent solutions include: +- Varuna +- Sagemaker + +We have not experimented with Varuna and SageMaker but their papers report that they have overcome the list of problems +mentioned above and that they require smaller changes to the user's model. + +Implementations: +- [PyTorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py) +- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API. +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. +- [OSLO](https://github.com/tunib-ai/oslo) - this is implemented based on the Hugging Face Transformers. + +🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support. +The main obstacle is being unable to convert the models to `nn.Sequential` and have all the inputs to be Tensors. This +is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that. + +DeepSpeed and Megatron-LM integrations are available in [🤗 Accelerate](https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed) + +Other approaches: + +DeepSpeed, Varuna and SageMaker use the concept of an [Interleaved Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html) + +
+ Interleaved pipeline execution +
+ +Here the bubble (idle time) is further minimized by prioritizing backward passes. Varuna further attempts to improve the +schedule by using simulations to discover the most efficient scheduling. + +OSLO has pipeline parallelism implementation based on the Transformers without `nn.Sequential` conversion. + +## Tensor Parallelism + +In Tensor Parallelism, each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it. +To describe this method, this section of the guide relies on the concepts and diagrams from the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +paper: [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473). + +The main building block of any transformer is a fully connected `nn.Linear` followed by a nonlinear activation `GeLU`. +The dot dot-product part of it, following the Megatron's paper notation, can be written as `Y = GeLU(XA)`, where `X` is +an input vector, `Y` is the output vector, and `A` is the weight matrix. + +If we look at the computation in matrix form, you can see how the matrix multiplication can be split between multiple GPUs: + +
+ Parallel GEMM +
+ +If we split the weight matrix `A` column-wise across `N` GPUs and perform matrix multiplications `XA_1` through `XA_n` in parallel, +then we will end up with `N` output vectors `Y_1, Y_2, ..., Y_n` which can be fed into `GeLU` independently: + +
+ Independent GeLU +
+ +Using this principle, we can update a multi-layer perceptron of arbitrary depth, without the need for any synchronization +between GPUs until the very end, where we need to reconstruct the output vector from shards. The Megatron-LM paper authors +provide a helpful illustration for that: + +
+ Parallel shard processing +
+ +Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having +multiple independent heads! + +
+ Parallel self-attention +
+ +Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. +Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use +nodes that have at least 8 GPUs. + +This section is based on the original much more [detailed TP overview](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530). +by [@anton-l](https://github.com/anton-l). + +Alternative names: +- DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/training/#model-parallelism) + +Implementations: +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation, as it's very model-specific +- [parallelformers](https://github.com/tunib-ai/parallelformers) (only inference at the moment) +- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. +- [OSLO](https://github.com/tunib-ai/oslo) has the tensor parallelism implementation based on the Transformers. + +SageMaker combines TP with DP for a more efficient processing. + +🤗 Transformers status: +- core: not yet implemented in the core +- but if you want inference [parallelformers](https://github.com/tunib-ai/parallelformers) provides this support for most of our models. So until this is implemented in the core you can use theirs. And hopefully training mode will be supported too. +- Deepspeed-Inference also supports our BERT, GPT-2, and GPT-Neo models in their super-fast CUDA-kernel-based inference mode, see more [here](https://www.deepspeed.ai/tutorials/inference-tutorial/) + +🤗 Accelerate integrates with [TP from Megatron-LM](https://huggingface.co/docs/accelerate/v0.23.0/en/usage_guides/megatron_lm). + +## Data Parallelism + Pipeline Parallelism + +The following diagram from the DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/) demonstrates +how one can combine DP with PP. + +
+ DP + PP-2d +
+ +Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. To DP there is just GPUs 0 +and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. +And GPU1 does the same by enlisting GPU3 to its aid. + +Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs. + +Implementations: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformers status: not yet implemented + +## Data Parallelism + Pipeline Parallelism + Tensor Parallelism + +To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. This can be seen in the following diagram. + +
+ dp-pp-tp-3d +
+ +This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/), which is a good read as well. + +Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs. + +Implementations: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP. +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformers status: not yet implemented, since we have no PP and TP. + +## ZeRO Data Parallelism + Pipeline Parallelism + Tensor Parallelism + +One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been +discussed in [ZeRO Data Parallelism](#zero-data-parallelism). Normally it's a standalone feature that doesn't require PP or TP. +But it can be combined with PP and TP. + +When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding). + +While it's theoretically possible to use ZeRO stage 2 (gradient sharding) with Pipeline Parallelism, it will have negative +performance impacts. There would need to be an additional reduce-scatter collective for every micro-batch to aggregate +the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, +small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with +minimizing the Pipeline bubble (number of micro-batches). Therefore those communication costs are going to impact the performance. + +In addition, there are already fewer layers than normal due to PP and so the memory savings won't be huge. PP already +reduces gradient size by ``1/PP``, and so gradient sharding savings on top of that are less significant than pure DP. + +ZeRO stage 3 is not a good choice either for the same reason - more inter-node communications required. + +And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU. + +Implementations: +- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo. +- [OSLO](https://github.com/tunib-ai/oslo) + +Important papers: + +- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( +https://arxiv.org/abs/2201.11990) + +🤗 Transformers status: not yet implemented, since we have no PP and TP. + +## FlexFlow + +[FlexFlow](https://github.com/flexflow/FlexFlow) also solves the parallelization problem in a slightly different approach. + +Paper: ["Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken](https://arxiv.org/abs/1807.05358) + +It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter. + +1. Sample = Data Parallelism (sample-wise parallel) +2. Operator = Parallelize a single operation into several sub-operations +3. Attribute = Data Parallelism (length-wise parallel) +4. Parameter = Model Parallelism (regardless of dimension - horizontal or vertical) + +Examples: +* Sample + +Let's take 10 batches of sequence length 512. If we parallelize them by sample dimension into 2 devices, we get 10 x 512 which becomes be 5 x 2 x 512. + +* Operator + +If we perform layer normalization, we compute std first and mean second, and then we can normalize data. +Operator parallelism allows computing std and mean in parallel. So if we parallelize them by operator dimension into 2 +devices (cuda:0, cuda:1), first we copy input data into both devices, and cuda:0 computes std, cuda:1 computes mean at the same time. + +* Attribute + +We have 10 batches of 512 length. If we parallelize them by attribute dimension into 2 devices, 10 x 512 will be 10 x 2 x 256. + +* Parameter + +It is similar with tensor model parallelism or naive layer-wise model parallelism. + +
+ flex-flow-soap +
+ +The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) +fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which +parallelisation to use where. + +One very important aspect is that FlexFlow is designed for optimizing DNN parallelizations for models with static and +fixed workloads, since models with dynamic behavior may prefer different parallelization strategies across iterations. + +So the promise is very attractive - it runs a 30min simulation on the cluster of choice and it comes up with the best +strategy to utilise this specific environment. If you add/remove/replace any parts it'll run and re-optimize the plan +for that. And then you can train. A different setup will have its own custom optimization. + +🤗 Transformers status: Transformers models are FX-trace-able via [transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py), +which is a prerequisite for FlexFlow, however, changes are required on the FlexFlow side to make it work with Transformers models. + +## GPU selection + +When training on multiple GPUs, you can specify the number of GPUs to use and in what order. This can be useful for instance when you have GPUs with different computing power and want to use the faster GPU first. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) to use only a subset of the available GPUs, and you don't need Accelerate or the [DeepSpeed integration](./main_classes/deepspeed). + +### Number of GPUs + +For example, if you have 4 GPUs and you only want to use the first 2: + + + + +Use the `--nproc_per_node` to select how many GPUs to use. + +```bash +torchrun --nproc_per_node=2 trainer-program.py ... +``` + + + + +Use `--num_processes` to select how many GPUs to use. + +```bash +accelerate launch --num_processes 2 trainer-program.py ... +``` + + + + +Use `--num_gpus` to select how many GPUs to use. + +```bash +deepspeed --num_gpus 2 trainer-program.py ... +``` + + + + +### Order of GPUs + +Now, to select which GPUs to use and their order, you'll use the `CUDA_VISIBLE_DEVICES` environment variable. It is easiest to set the environment variable in a `~/bashrc` or another startup config file. `CUDA_VISIBLE_DEVICES` is used to map which GPUs are used. For example, if you have 4 GPUs (0, 1, 2, 3) and you only want to run GPUs 0 and 2: + +```bash +CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... +``` + +Only the 2 physical GPUs (0 and 2) are "visible" to PyTorch and these are mapped to `cuda:0` and `cuda:1` respectively. You can also reverse the order of the GPUs to use 2 first. Now, the mapping is `cuda:1` for GPU 0 and `cuda:0` for GPU 2. + +```bash +CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... +``` + +You can also set the `CUDA_VISIBLE_DEVICES` environment variable to an empty value to create an environment without GPUs. + +```bash +CUDA_VISIBLE_DEVICES= python trainer-program.py ... +``` + + + +As with any environment variable, they can be exported instead of being added to the command line. However, this is not recommended because it can be confusing if you forget how the environment variable was setup and you end up using the wrong GPUs. Instead, it is common practice to set the environment variable for a specific training run on the same command line. + + + +`CUDA_DEVICE_ORDER` is an alternative environment variable you can use to control how the GPUs are ordered. You can either order them by: + +1. PCIe bus ID's that matches the order of [`nvidia-smi`](https://developer.nvidia.com/nvidia-system-management-interface) and [`rocm-smi`](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/.doxygen/docBin/html/index.html) for NVIDIA and AMD GPUs respectively + +```bash +export CUDA_DEVICE_ORDER=PCI_BUS_ID +``` + +2. GPU compute ability + +```bash +export CUDA_DEVICE_ORDER=FASTEST_FIRST +``` + +The `CUDA_DEVICE_ORDER` is especially useful if your training setup consists of an older and newer GPU, where the older GPU appears first, but you cannot physically swap the cards to make the newer GPU appear first. In this case, set `CUDA_DEVICE_ORDER=FASTEST_FIRST` to always use the newer and faster GPU first (`nvidia-smi` or `rocm-smi` still reports the GPUs in their PCIe order). Or you could also set `export CUDA_VISIBLE_DEVICES=1,0`. diff --git a/docs/source/en/perf_train_gpu_many.mdx b/docs/source/en/perf_train_gpu_many.mdx deleted file mode 100644 index 17eb7b739925df..00000000000000 --- a/docs/source/en/perf_train_gpu_many.mdx +++ /dev/null @@ -1,529 +0,0 @@ - - -# Efficient Training on Multiple GPUs - -When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a multi-GPU setup. Switching from a single GPU to multiple requires some form of parallelism as the work needs to be distributed. There are several techniques to achieve parallism such as data, tensor, or pipeline parallism. However, there is no one solution to fit them all and which settings works best depends on the hardware you are running on. While the main concepts most likely will apply to any other framework, this article is focused on PyTorch-based implementations. - - - - Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training. - - - -We will first discuss in depth various 1D parallelism techniques and their pros and cons and then look at how they can be combined into 2D and 3D parallelism to enable an even faster training and to support even bigger models. Various other powerful alternative approaches will be presented. - -## Concepts - -The following is the brief description of the main concepts that will be described later in depth in this document. - -1. **DataParallel (DP)** - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step. -2. **TensorParallel (TP)** - each tensor is split up into multiple chunks, so instead of having the whole tensor reside on a single gpu, each shard of the tensor resides on its designated gpu. During processing each shard gets processed separately and in parallel on different GPUs and the results are synced at the end of the step. This is what one may call horizontal parallelism, as the splitting happens on horizontal level. -3. **PipelineParallel (PP)** - the model is split up vertically (layer-level) across multiple GPUs, so that only one or several layers of the model are places on a single gpu. Each gpu processes in parallel different stages of the pipeline and working on a small chunk of the batch. -4. **Zero Redundancy Optimizer (ZeRO)** - Also performs sharding of the tensors somewhat similar to TP, except the whole tensor gets reconstructed in time for a forward or backward computation, therefore the model doesn't need to be modified. It also supports various offloading techniques to compensate for limited GPU memory. -5. **Sharded DDP** - is another name for the foundational ZeRO concept as used by various other implementations of ZeRO. - -Before diving deeper into the specifics of each concept we first have a look at the rough decision process when training large models on a large infrastructure. - -## Scalability Strategy - -**⇨ Single Node / Multi-GPU** -* Model fits onto a single GPU: - - 1. DDP - Distributed DP - 2. ZeRO - may or may not be faster depending on the situation and configuration used - -* Model doesn't fit onto a single GPU: - - 1. PP - 2. ZeRO - 3. TP - - With very fast intra-node connectivity of NVLINK or NVSwitch all three should be mostly on par, without these PP will be faster than TP or ZeRO. The degree of TP may also make a difference. Best to experiment to find the winner on your particular setup. - - TP is almost always used within a single node. That is TP size <= gpus per node. - -* Largest Layer not fitting into a single GPU: - - 1. If not using ZeRO - must use TP, as PP alone won't be able to fit. - 2. With ZeRO see the same entry for "Single GPU" above - - -**⇨ Multi-Node / Multi-GPU** - -* When you have fast inter-node connectivity: - - 1. ZeRO - as it requires close to no modifications to the model - 2. PP+TP+DP - less communications, but requires massive changes to the model - -* when you have slow inter-node connectivity and still low on GPU memory: - - 1. DP+PP+TP+ZeRO-1 - - - -## Data Parallelism - -Most users with just 2 GPUs already enjoy the increased training speed up thanks to `DataParallel` (DP) and `DistributedDataParallel` (DDP) that are almost trivial to use. This is a built-in feature of Pytorch. Note that in general it is advised to use DDP as it is better maintained and works for all models while DP might fail for some models. [PyTorch documentation](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html) itself recommends the use of DDP. - -### DP vs DDP - -`DistributedDataParallel` (DDP) is typically faster than `DataParallel` (DP), but it is not always the case: -* while DP is python threads-based, DDP is multiprocess-based - and as such it has no python threads limitations, such as GIL -* on the other hand a slow inter-connectivity between the GPU cards could lead to an actual slower outcome with DDP - -Here are the main differences in the inter-GPU communication overhead between the two modes: - -[DDP](https://pytorch.org/docs/master/notes/ddp.html): - -- At the start time the main process replicates the model once from gpu 0 to the rest of gpus -- Then for each batch: - 1. each gpu consumes each own mini-batch of data directly - 2. during `backward`, once the local gradients are ready, they are then averaged across all processes - -[DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html): - -For each batch: - 1. gpu 0 reads the batch of data and then sends a mini-batch to each gpu - 2. replicates the up-to-date model from gpu 0 to each gpu - 3. runs `forward` and sends output from each gpu to gpu 0, computes loss - 4. scatters loss from gpu 0 to all gpus, runs `backward` - 5. sends gradients from each gpu to gpu 0 and averages those - -The only communication DDP performs per batch is sending gradients, whereas DP does 5 different data exchanges per batch. - -DP copies data within the process via python threads, whereas DDP copies data via [torch.distributed](https://pytorch.org/docs/master/distributed.html). - -Under DP gpu 0 performs a lot more work than the rest of the gpus, thus resulting in under-utilization of gpus. - -You can use DDP across multiple machines, but this is not the case with DP. - -There are other differences between DP and DDP but they aren't relevant to this discussion. - -If you want to go really deep into understanding these 2 modes, this [article](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/) is highly recommended, as it has great diagrams, includes multiple benchmarks and profiler outputs on various hardware, explains all the nuances that you may need to know. - -Let's look at an actual benchmark: - -| Type | NVlink | Time | -| :----- | ----- | ---: | -| 2:DP | Y | 110s | -| 2:DDP | Y | 101s | -| 2:DDP | N | 131s | - - -Analysis: - -Here DP is ~10% slower than DDP w/ NVlink, but ~15% faster than DDP w/o NVlink - -The real difference will depend on how much data each GPU needs to sync with the others - the more there is to sync, the more a slow link will slow down the total runtime. - -Here is the full benchmark code and outputs: - -`NCCL_P2P_DISABLE=1` was used to disable the NVLink feature on the corresponding benchmark. - -``` - -# DP -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ -python examples/pytorch/language-modeling/run_clm.py \ ---model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ ---do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 - -{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} - -# DDP w/ NVlink -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ -python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ ---model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ ---do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 - -{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} - -# DDP w/o NVlink -rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ -python -m torch.distributed.launch --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ ---model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ ---do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 - -{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} -``` - -Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`) -Software: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` - -## ZeRO Data Parallelism - -ZeRO-powered data parallelism (ZeRO-DP) is described on the following diagram from this [blog post](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) -![DeepSpeed-Image-1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png) - -It can be difficult to wrap one's head around it, but in reality the concept is quite simple. This is just the usual `DataParallel` (DP), except, instead of replicating the full model params, gradients and optimizer states, each GPU stores only a slice of it. And then at run-time when the full layer params are needed just for the given layer, all GPUs synchronize to give each other parts that they miss - this is it. - -Consider this simple model with 3 layers, where each layer has 3 params: -``` -La | Lb | Lc ----|----|--- -a0 | b0 | c0 -a1 | b1 | c1 -a2 | b2 | c2 -``` -Layer La has weights a0, a1 and a2. - -If we have 3 GPUs, the Sharded DDP (= Zero-DP) splits the model onto 3 GPUs like so: - -``` -GPU0: -La | Lb | Lc ----|----|--- -a0 | b0 | c0 - -GPU1: -La | Lb | Lc ----|----|--- -a1 | b1 | c1 - -GPU2: -La | Lb | Lc ----|----|--- -a2 | b2 | c2 -``` - -In a way this is the same horizontal slicing, as tensor parallelism, if you imagine the typical DNN diagram. Vertical slicing is where one puts whole layer-groups on different GPUs. But it's just the starting point. - -Now each of these GPUs will get the usual mini-batch as it works in DP: -``` -x0 => GPU0 -x1 => GPU1 -x2 => GPU2 -``` - -The inputs are unmodified - they think they are going to be processed by the normal model. - -First, the inputs hit the layer La. - -Let's focus just on GPU0: x0 needs a0, a1, a2 params to do its forward path, but GPU0 has only a0 - it gets sent a1 from GPU1 and a2 from GPU2, bringing all pieces of the model together. - -In parallel, GPU1 gets mini-batch x1 and it only has a1, but needs a0 and a2 params, so it gets those from GPU0 and GPU2. - -Same happens to GPU2 that gets input x2. It gets a0 and a1 from GPU0 and GPU1, and with its a2 it reconstructs the full tensor. - -All 3 GPUs get the full tensors reconstructed and a forward happens. - -As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation. The reconstruction is done efficiently via a pre-fetch. - -And the whole process is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La. - -To me this sounds like an efficient group backpacking weight distribution strategy: - -1. person A carries the tent -2. person B carries the stove -3. person C carries the axe - -Now each night they all share what they have with others and get from others what they don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is Sharded DDP / Zero DP. - -Compare this strategy to the simple one where each person has to carry their own tent, stove and axe, which would be far more inefficient. This is DataParallel (DP and DDP) in Pytorch. - -While reading the literature on this topic you may encounter the following synonyms: Sharded, Partitioned. - -If you pay close attention the way ZeRO partitions the model's weights - it looks very similar to tensor parallelism which will be discussed later. This is because it partitions/shards each layer's weights, unlike vertical model parallelism which is discussed next. - -Implementations: - -- [DeepSpeed](https://www.deepspeed.ai/features/#the-zero-redundancy-optimizer) ZeRO-DP stages 1+2+3 -- [Fairscale](https://github.com/facebookresearch/fairscale/#optimizer-state-sharding-zero) ZeRO-DP stages 1+2+3 -- [`transformers` integration](main_classes/trainer#trainer-integrations) - -## Naive Model Parallelism (Vertical) and Pipeline Parallelism - -Naive Model Parallelism (MP) is where one spreads groups of model layers across multiple GPUs. The mechanism is relatively simple - switch the desired layers `.to()` the desired devices and now whenever the data goes in and out those layers switch the data to the same device as the layer and leave the rest unmodified. - -We refer to it as Vertical MP, because if you remember how most models are drawn, we slice the layers vertically. For example, if the following diagram shows an 8-layer model: - -``` -=================== =================== -| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | -=================== =================== - gpu0 gpu1 -``` -we just sliced it in 2 vertically, placing layers 0-3 onto GPU0 and 4-7 to GPU1. - -Now while data travels from layer 0 to 1, 1 to 2 and 2 to 3 this is just the normal model. But when data needs to pass from layer 3 to layer 4 it needs to travel from GPU0 to GPU1 which introduces a communication overhead. If the participating GPUs are on the same compute node (e.g. same physical machine) this copying is pretty fast, but if the GPUs are located on different compute nodes (e.g. multiple machines) the communication overhead could be significantly larger. - -Then layers 4 to 5 to 6 to 7 are as a normal model would have and when the 7th layer completes we often need to send the data back to layer 0 where the labels are (or alternatively send the labels to the last layer). Now the loss can be computed and the optimizer can do its work. - -Problems: -- the main deficiency and why this one is called "naive" MP, is that all but one GPU is idle at any given moment. So if 4 GPUs are used, it's almost identical to quadrupling the amount of memory of a single GPU, and ignoring the rest of the hardware. Plus there is the overhead of copying the data between devices. So 4x 6GB cards will be able to accommodate the same size as 1x 24GB card using naive MP, except the latter will complete the training faster, since it doesn't have the data copying overhead. But, say, if you have 40GB cards and need to fit a 45GB model you can with 4x 40GB cards (but barely because of the gradient and optimizer states) -- shared embeddings may need to get copied back and forth between GPUs. - -Pipeline Parallelism (PP) is almost identical to a naive MP, but it solves the GPU idling problem, by chunking the incoming batch into micro-batches and artificially creating a pipeline, which allows different GPUs to concurrently participate in the computation process. - -The following illustration from the [GPipe paper](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html) shows the naive MP on the top, and PP on the bottom: - -![mp-pp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-gpipe-bubble.png) - -It's easy to see from the bottom diagram how PP has less dead zones, where GPUs are idle. The idle parts are referred to as the "bubble". - -Both parts of the diagram show a parallelism that is of degree 4. That is 4 GPUs are participating in the pipeline. So there is the forward path of 4 pipe stages F0, F1, F2 and F3 and then the return reverse order backward path of B3, B2, B1 and B0. - -PP introduces a new hyper-parameter to tune and it's `chunks` which defines how many chunks of data are sent in a sequence through the same pipe stage. For example, in the bottomw diagram you can see that `chunks=4`. GPU0 performs the same forward path on chunk 0, 1, 2 and 3 (F0,0, F0,1, F0,2, F0,3) and then it waits for other GPUs to do their work and only when their work is starting to be complete, GPU0 starts to work again doing the backward path for chunks 3, 2, 1 and 0 (B0,3, B0,2, B0,1, B0,0). - -Note that conceptually this is the same concept as gradient accumulation steps (GAS). Pytorch uses `chunks`, whereas DeepSpeed refers to the same hyper-parameter as GAS. - -Because of the chunks, PP introduces the concept of micro-batches (MBS). DP splits the global data batch size into mini-batches, so if you have a DP degree of 4, a global batch size of 1024 gets split up into 4 mini-batches of 256 each (1024/4). And if the number of `chunks` (or GAS) is 32 we end up with a micro-batch size of 8 (256/32). Each Pipeline stage works with a single micro-batch at a time. - -To calculate the global batch size of the DP + PP setup we then do: `mbs*chunks*dp_degree` (`8*32*4=1024`). - -Let's go back to the diagram. - -With `chunks=1` you end up with the naive MP, which is very inefficient. With a very large `chunks` value you end up with tiny micro-batch sizes which could be not every efficient either. So one has to experiment to find the value that leads to the highest efficient utilization of the gpus. - -While the diagram shows that there is a bubble of "dead" time that can't be parallelized because the last `forward` stage has to wait for `backward` to complete the pipeline, the purpose of finding the best value for `chunks` is to enable a high concurrent GPU utilization across all participating GPUs which translates to minimizing the size of the bubble. - -There are 2 groups of solutions - the traditional Pipeline API and the more modern solutions that make things much easier for the end user. - -Traditional Pipeline API solutions: -- PyTorch -- FairScale -- DeepSpeed -- Megatron-LM - -Modern solutions: -- Varuna -- Sagemaker - -Problems with traditional Pipeline API solutions: -- have to modify the model quite heavily, because Pipeline requires one to rewrite the normal flow of modules into a `nn.Sequential` sequence of the same, which may require changes to the design of the model. -- currently the Pipeline API is very restricted. If you had a bunch of python variables being passed in the very first stage of the Pipeline, you will have to find a way around it. Currently, the pipeline interface requires either a single Tensor or a tuple of Tensors as the only input and output. These tensors must have a batch size as the very first dimension, since pipeline is going to chunk the mini batch into micro-batches. Possible improvements are being discussed here https://github.com/pytorch/pytorch/pull/50693 -- conditional control flow at the level of pipe stages is not possible - e.g., Encoder-Decoder models like T5 require special workarounds to handle a conditional encoder stage. -- have to arrange each layer so that the output of one model becomes an input to the other model. - -We are yet to experiment with Varuna and SageMaker but their papers report that they have overcome the list of problems mentioned above and that they require much smaller changes to the user's model. - -Implementations: -- [Pytorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py) -- [FairScale](https://fairscale.readthedocs.io/en/latest/tutorials/pipe.html) -- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/) -- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API. -- [Varuna](https://github.com/microsoft/varuna) -- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. -- [OSLO](https://github.com/tunib-ai/oslo) - this is implemented based on the Hugging Face Transformers. - -🤗 Transformers status: as of this writing none of the models supports full-PP. GPT2 and T5 models have naive MP support. The main obstacle is being unable to convert the models to `nn.Sequential` and have all the inputs to be Tensors. This is because currently the models include many features that make the conversion very complicated, and will need to be removed to accomplish that. - -Other approaches: - -DeepSpeed, Varuna and SageMaker use the concept of an [Interleaved Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html) -![interleaved-pipeline-execution](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-sagemaker-interleaved-pipeline.png) - -Here the bubble (idle time) is further minimized by prioritizing backward passes. - -Varuna further tries to improve the schedule by using simulations to discover the most efficient scheduling. - -OSLO has pipeline parallelism implementation based on the Transformers without `nn.Sequential` converting. - -## Tensor Parallelism - -In Tensor Parallelism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing. - -In this section we use concepts and diagrams from the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) paper: [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473). - -The main building block of any transformer is a fully connected `nn.Linear` followed by a nonlinear activation `GeLU`. - -Following the Megatron's paper notation, we can write the dot-product part of it as `Y = GeLU(XA)`, where `X` and `Y` are the input and output vectors, and `A` is the weight matrix. - -If we look at the computation in matrix form, it's easy to see how the matrix multiplication can be split between multiple GPUs: -![Parallel GEMM](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_gemm.png) - -If we split the weight matrix `A` column-wise across `N` GPUs and perform matrix multiplications `XA_1` through `XA_n` in parallel, then we will end up with `N` output vectors `Y_1, Y_2, ..., Y_n` which can be fed into `GeLU` independently: -![independent GeLU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-independent-gelu.png) - -Using this principle, we can update an MLP of arbitrary depth, without the need for any synchronization between GPUs until the very end, where we need to reconstruct the output vector from shards. The Megatron-LM paper authors provide a helpful illustration for that: -![parallel shard processing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_shard_processing.png) - -Parallelizing the multi-headed attention layers is even simpler, since they are already inherently parallel, due to having multiple independent heads! -![parallel self-attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_self_attention.png) - -Special considerations: TP requires very fast network, and therefore it's not advisable to do TP across more than one node. Practically, if a node has 4 GPUs, the highest TP degree is therefore 4. If you need a TP degree of 8, you need to use nodes that have at least 8 GPUs. - -This section is based on the original much more [detailed TP overview](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530). -by [@anton-l](https://github.com/anton-l). - -SageMaker combines TP with DP for a more efficient processing. - -Alternative names: -- DeepSpeed calls it [tensor slicing](https://www.deepspeed.ai/features/#model-parallelism) - -Implementations: -- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation, as it's very model-specific -- [parallelformers](https://github.com/tunib-ai/parallelformers) (only inference at the moment) -- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. -- [OSLO](https://github.com/tunib-ai/oslo) has the tensor parallelism implementation based on the Transformers. - -🤗 Transformers status: -- core: not yet implemented in the core -- but if you want inference [parallelformers](https://github.com/tunib-ai/parallelformers) provides this support for most of our models. So until this is implemented in the core you can use theirs. And hopefully training mode will be supported too. -- Deepspeed-Inference also supports our BERT, GPT-2, and GPT-Neo models in their super-fast CUDA-kernel-based inference mode, see more [here](https://www.deepspeed.ai/tutorials/inference-tutorial/) - -## DP+PP - -The following diagram from the DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/) demonstrates how one combines DP with PP. - -![dp-pp-2d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero-dp-pp.png) - -Here it's important to see how DP rank 0 doesn't see GPU2 and DP rank 1 doesn't see GPU3. To DP there is just GPUs 0 and 1 where it feeds data as if there were just 2 GPUs. GPU0 "secretly" offloads some of its load to GPU2 using PP. And GPU1 does the same by enlisting GPU3 to its aid. - -Since each dimension requires at least 2 GPUs, here you'd need at least 4 GPUs. - -Implementations: -- [DeepSpeed](https://github.com/microsoft/DeepSpeed) -- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) -- [Varuna](https://github.com/microsoft/varuna) -- [SageMaker](https://arxiv.org/abs/2111.05972) -- [OSLO](https://github.com/tunib-ai/oslo) - -🤗 Transformers status: not yet implemented - -## DP+PP+TP - -To get an even more efficient training a 3D parallelism is used where PP is combined with TP and DP. This can be seen in the following diagram. - -![dp-pp-tp-3d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-deepspeed-3d.png) - -This diagram is from a blog post [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/), which is a good read as well. - -Since each dimension requires at least 2 GPUs, here you'd need at least 8 GPUs. - -Implementations: -- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed also includes an even more efficient DP, which they call ZeRO-DP. -- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) -- [Varuna](https://github.com/microsoft/varuna) -- [SageMaker](https://arxiv.org/abs/2111.05972) -- [OSLO](https://github.com/tunib-ai/oslo) - -🤗 Transformers status: not yet implemented, since we have no PP and TP. - -## ZeRO DP+PP+TP - -One of the main features of DeepSpeed is ZeRO, which is a super-scalable extension of DP. It has already been discussed in [ZeRO Data Parallelism](#zero-data-parallelism). Normally it's a standalone feature that doesn't require PP or TP. But it can be combined with PP and TP. - -When ZeRO-DP is combined with PP (and optionally TP) it typically enables only ZeRO stage 1 (optimizer sharding). - -While it's theoretically possible to use ZeRO stage 2 (gradient sharding) with Pipeline Parallelism, it will have bad performance impacts. There would need to be an additional reduce-scatter collective for every micro-batch to aggregate the gradients before sharding, which adds a potentially significant communication overhead. By nature of Pipeline Parallelism, small micro-batches are used and instead the focus is on trying to balance arithmetic intensity (micro-batch size) with minimizing the Pipeline bubble (number of micro-batches). Therefore those communication costs are going to hurt. - -In addition, There are already fewer layers than normal due to PP and so the memory savings won't be huge. PP already reduces gradient size by ``1/PP``, and so gradient sharding savings on top of that are less significant than pure DP. - -ZeRO stage 3 is not a good choice either for the same reason - more inter-node communications required. - -And since we have ZeRO, the other benefit is ZeRO-Offload. Since this is stage 1 optimizer states can be offloaded to CPU. - -Implementations: -- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) and [Megatron-Deepspeed from BigScience](https://github.com/bigscience-workshop/Megatron-DeepSpeed), which is the fork of the former repo. -- [OSLO](https://github.com/tunib-ai/oslo) - -Important papers: - -- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( -https://arxiv.org/abs/2201.11990) - -🤗 Transformers status: not yet implemented, since we have no PP and TP. - -## FlexFlow - -[FlexFlow](https://github.com/flexflow/FlexFlow) also solves the parallelization problem in a slightly different approach. - -Paper: ["Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken](https://arxiv.org/abs/1807.05358) - -It performs a sort of 4D Parallelism over Sample-Operator-Attribute-Parameter. - -1. Sample = Data Parallelism (sample-wise parallel) -2. Operator = Parallelize a single operation into several sub-operations -3. Attribute = Data Parallelism (length-wise parallel) -4. Parameter = Model Parallelism (regardless of dimension - horizontal or vertical) - -Examples: -* Sample - -Let's take 10 batches of sequence length 512. If we parallelize them by sample dimension into 2 devices, we get 10 x 512 which becomes be 5 x 2 x 512. - -* Operator - -If we perform layer normalization, we compute std first and mean second, and then we can normalize data. Operator parallelism allows computing std and mean in parallel. So if we parallelize them by operator dimension into 2 devices (cuda:0, cuda:1), first we copy input data into both devices, and cuda:0 computes std, cuda:1 computes mean at the same time. - -* Attribute - -We have 10 batches of 512 length. If we parallelize them by attribute dimension into 2 devices, 10 x 512 will be 10 x 2 x 256. - -* Parameter - -It is similar with tensor model parallelism or naive layer-wise model parallelism. - -![flex-flow-soap](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-flexflow.jpeg) - -The significance of this framework is that it takes resources like (1) GPU/TPU/CPU vs. (2) RAM/DRAM vs. (3) fast-intra-connect/slow-inter-connect and it automatically optimizes all these algorithmically deciding which parallelisation to use where. - -One very important aspect is that FlexFlow is designed for optimizing DNN parallelizations for models with static and fixed workloads, since models with dynamic behavior may prefer different parallelization strategies across iterations. - -So the promise is very attractive - it runs a 30min simulation on the cluster of choice and it comes up with the best strategy to utilise this specific environment. If you add/remove/replace any parts it'll run and re-optimize the plan for that. And then you can train. A different setup will have its own custom optimization. - -🤗 Transformers status: not yet integrated. We already have our models FX-trace-able via [transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py), which is a prerequisite for FlexFlow, so someone needs to figure out what needs to be done to make FlexFlow work with our models. - - -## Which Strategy To Use When - -Here is a very rough outline at which parallelism strategy to use when. The first on each list is typically faster. - -**⇨ Single GPU** - -* Model fits onto a single GPU: - - 1. Normal use - -* Model doesn't fit onto a single GPU: - - 1. ZeRO + Offload CPU and optionally NVMe - 2. as above plus Memory Centric Tiling (see below for details) if the largest layer can't fit into a single GPU - -* Largest Layer not fitting into a single GPU: - -1. ZeRO - Enable [Memory Centric Tiling](https://deepspeed.readthedocs.io/en/latest/zero3.html#memory-centric-tiling) (MCT). It allows you to run arbitrarily large layers by automatically splitting them and executing them sequentially. MCT reduces the number of parameters that are live on a GPU, but it does not affect the activation memory. As this need is very rare as of this writing a manual override of `torch.nn.Linear` needs to be done by the user. - -**⇨ Single Node / Multi-GPU** - -* Model fits onto a single GPU: - - 1. DDP - Distributed DP - 2. ZeRO - may or may not be faster depending on the situation and configuration used - -* Model doesn't fit onto a single GPU: - - 1. PP - 2. ZeRO - 3. TP - - With very fast intra-node connectivity of NVLINK or NVSwitch all three should be mostly on par, without these PP will be faster than TP or ZeRO. The degree of TP may also make a difference. Best to experiment to find the winner on your particular setup. - - TP is almost always used within a single node. That is TP size <= gpus per node. - -* Largest Layer not fitting into a single GPU: - - 1. If not using ZeRO - must use TP, as PP alone won't be able to fit. - 2. With ZeRO see the same entry for "Single GPU" above - - -**⇨ Multi-Node / Multi-GPU** - -* When you have fast inter-node connectivity: - - 1. ZeRO - as it requires close to no modifications to the model - 2. PP+TP+DP - less communications, but requires massive changes to the model - -* when you have slow inter-node connectivity and still low on GPU memory: - - 1. DP+PP+TP+ZeRO-1 diff --git a/docs/source/en/perf_train_gpu_one.md b/docs/source/en/perf_train_gpu_one.md new file mode 100644 index 00000000000000..1d885ba03646c7 --- /dev/null +++ b/docs/source/en/perf_train_gpu_one.md @@ -0,0 +1,552 @@ + + +# Methods and tools for efficient training on a single GPU + +This guide demonstrates practical techniques that you can use to increase the efficiency of your model's training by +optimizing memory utilization, speeding up the training, or both. If you'd like to understand how GPU is utilized during +training, please refer to the [Model training anatomy](model_memory_anatomy) conceptual guide first. This guide +focuses on practical techniques. + + + +If you have access to a machine with multiple GPUs, these approaches are still valid, plus you can leverage additional methods outlined in the [multi-GPU section](perf_train_gpu_many). + + + +When training large models, there are two aspects that should be considered at the same time: + +* Data throughput/training time +* Model performance + +Maximizing the throughput (samples/second) leads to lower training cost. This is generally achieved by utilizing the GPU +as much as possible and thus filling GPU memory to its limit. If the desired batch size exceeds the limits of the GPU memory, +the memory optimization techniques, such as gradient accumulation, can help. + +However, if the preferred batch size fits into memory, there's no reason to apply memory-optimizing techniques because they can +slow down the training. Just because one can use a large batch size, does not necessarily mean they should. As part of +hyperparameter tuning, you should determine which batch size yields the best results and then optimize resources accordingly. + +The methods and tools covered in this guide can be classified based on the effect they have on the training process: + +| Method/tool | Improves training speed | Optimizes memory utilization | +|:-----------------------------------------------------------|:------------------------|:-----------------------------| +| [Batch size choice](#batch-size-choice) | Yes | Yes | +| [Gradient accumulation](#gradient-accumulation) | No | Yes | +| [Gradient checkpointing](#gradient-checkpointing) | No | Yes | +| [Mixed precision training](#mixed-precision-training) | Yes | (No) | +| [Optimizer choice](#optimizer-choice) | Yes | Yes | +| [Data preloading](#data-preloading) | Yes | No | +| [DeepSpeed Zero](#deepspeed-zero) | No | Yes | +| [torch.compile](#using-torchcompile) | Yes | No | +| [Parameter-Efficient Fine Tuning (PEFT)](#using--peft) | No | Yes | + + + +Note: when using mixed precision with a small model and a large batch size, there will be some memory savings but with a +large model and a small batch size, the memory use will be larger. + + + +You can combine the above methods to get a cumulative effect. These techniques are available to you whether you are +training your model with [`Trainer`] or writing a pure PyTorch loop, in which case you can [configure these optimizations +with 🤗 Accelerate](#using--accelerate). + +If these methods do not result in sufficient gains, you can explore the following options: +* [Look into building your own custom Docker container with efficient softare prebuilds](#efficient-software-prebuilds) +* [Consider a model that uses Mixture of Experts (MoE)](#mixture-of-experts) +* [Convert your model to BetterTransformer to leverage PyTorch native attention](#using-pytorch-native-attention-and-flash-attention) + +Finally, if all of the above is still not enough, even after switching to a server-grade GPU like A100, consider moving +to a multi-GPU setup. All these approaches are still valid in a multi-GPU setup, plus you can leverage additional parallelism +techniques outlined in the [multi-GPU section](perf_train_gpu_many). + +## Batch size choice + +To achieve optimal performance, start by identifying the appropriate batch size. It is recommended to use batch sizes and +input/output neuron counts that are of size 2^N. Often it's a multiple of 8, but it can be +higher depending on the hardware being used and the model's dtype. + +For reference, check out NVIDIA's recommendation for [input/output neuron counts]( +https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) and +[batch size](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#batch-size) for +fully connected layers (which are involved in GEMMs (General Matrix Multiplications)). + +[Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) +define the multiplier based on the dtype and the hardware. For instance, for fp16 data type a multiple of 8 is recommended, unless +it's an A100 GPU, in which case use multiples of 64. + +For parameters that are small, consider also [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization). +This is where tiling happens and the right multiplier can have a significant speedup. + +## Gradient Accumulation + +The **gradient accumulation** method aims to calculate gradients in smaller increments instead of computing them for the +entire batch at once. This approach involves iteratively calculating gradients in smaller batches by performing forward +and backward passes through the model and accumulating the gradients during the process. Once a sufficient number of +gradients have been accumulated, the model's optimization step is executed. By employing gradient accumulation, it +becomes possible to increase the **effective batch size** beyond the limitations imposed by the GPU's memory capacity. +However, it is important to note that the additional forward and backward passes introduced by gradient accumulation can +slow down the training process. + +You can enable gradient accumulation by adding the `gradient_accumulation_steps` argument to [`TrainingArguments`]: + +```py +training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args) +``` + +In the above example, your effective batch size becomes 4. + +Alternatively, use 🤗 Accelerate to gain full control over the training loop. Find the 🤗 Accelerate example +[further down in this guide](#using--accelerate). + +While it is advised to max out GPU usage as much as possible, a high number of gradient accumulation steps can +result in a more pronounced training slowdown. Consider the following example. Let's say, the `per_device_train_batch_size=4` +without gradient accumulation hits the GPU's limit. If you would like to train with batches of size 64, do not set the +`per_device_train_batch_size` to 1 and `gradient_accumulation_steps` to 64. Instead, keep `per_device_train_batch_size=4` +and set `gradient_accumulation_steps=16`. This results in the same effective batch size while making better use of +the available GPU resources. + +For additional information, please refer to batch size and gradient accumulation benchmarks for [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) +and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957). + +## Gradient Checkpointing + +Some large models may still face memory issues even when the batch size is set to 1 and gradient accumulation is used. +This is because there are other components that also require memory storage. + +Saving all activations from the forward pass in order to compute the gradients during the backward pass can result in +significant memory overhead. The alternative approach of discarding the activations and recalculating them when needed +during the backward pass, would introduce a considerable computational overhead and slow down the training process. + +**Gradient checkpointing** offers a compromise between these two approaches and saves strategically selected activations +throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. For +an in-depth explanation of gradient checkpointing, refer to [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9). + +To enable gradient checkpointing in the [`Trainer`], pass the corresponding a flag to [`TrainingArguments`]: + +```py +training_args = TrainingArguments( + per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args +) +``` + +Alternatively, use 🤗 Accelerate - find the 🤗 Accelerate example [further in this guide](#using--accelerate). + + + +While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%. + + + +## Mixed precision training + +**Mixed precision training** is a technique that aims to optimize the computational efficiency of training models by +utilizing lower-precision numerical formats for certain variables. Traditionally, most models use 32-bit floating point +precision (fp32 or float32) to represent and process variables. However, not all variables require this high precision +level to achieve accurate results. By reducing the precision of certain variables to lower numerical formats like 16-bit +floating point (fp16 or float16), we can speed up the computations. Because in this approach some computations are performed +in half-precision, while some are still in full precision, the approach is called mixed precision training. + +Most commonly mixed precision training is achieved by using fp16 (float16) data types, however, some GPU architectures +(such as the Ampere architecture) offer bf16 and tf32 (CUDA internal data type) data types. Check +out the [NVIDIA Blog](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/) to learn more about +the differences between these data types. + +### fp16 + +The main advantage of mixed precision training comes from saving the activations in half precision (fp16). +Although the gradients are also computed in half precision they are converted back to full precision for the optimization +step so no memory is saved here. +While mixed precision training results in faster computations, it can also lead to more GPU memory being utilized, especially for small batch sizes. +This is because the model is now present on the GPU in both 16-bit and 32-bit precision (1.5x the original model on the GPU). + +To enable mixed precision training, set the `fp16` flag to `True`: + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) +``` + +If you prefer to use 🤗 Accelerate, find the 🤗 Accelerate example [further in this guide](#using--accelerate). + +### BF16 + +If you have access to an Ampere or newer hardware you can use bf16 for mixed precision training and evaluation. While +bf16 has a worse precision than fp16, it has a much bigger dynamic range. In fp16 the biggest number you can have +is `65535` and any number above that will result in an overflow. A bf16 number can be as large as `3.39e+38` (!) which +is about the same as fp32 - because both have 8-bits used for the numerical range. + +You can enable BF16 in the 🤗 Trainer with: + +```python +training_args = TrainingArguments(bf16=True, **default_args) +``` + +### TF32 + +The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead +of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total. It's "magical" in the sense that +you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput +improvement. All you need to do is to add the following to your code: + +```python +import torch +torch.backends.cuda.matmul.allow_tf32 = True +torch.backends.cudnn.allow_tf32 = True +``` + +CUDA will automatically switch to using tf32 instead of fp32 where possible, assuming that the used GPU is from the Ampere series. + +According to [NVIDIA research](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/), the +majority of machine learning training workloads show the same perplexity and convergence with tf32 training as with fp32. +If you're already using fp16 or bf16 mixed precision it may help with the throughput as well. + +You can enable this mode in the 🤗 Trainer: + +```python +TrainingArguments(tf32=True, **default_args) +``` + + + +tf32 can't be accessed directly via `tensor.to(dtype=torch.tf32)` because it is an internal CUDA data type. You need `torch>=1.7` to use tf32 data types. + + + +For additional information on tf32 vs other precisions, please refer to the following benchmarks: +[RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803) and +[A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189). + +## Flash Attention 2 + +You can speedup the training throughput by using Flash Attention 2 integration in transformers. Check out the appropriate section in the [single GPU section](./perf_infer_gpu_one#Flash-Attention-2) to learn more about how to load a model with Flash Attention 2 modules. + +## Optimizer choice + +The most common optimizer used to train transformer models is Adam or AdamW (Adam with weight decay). Adam achieves +good convergence by storing the rolling average of the previous gradients; however, it adds an additional memory +footprint of the order of the number of model parameters. To remedy this, you can use an alternative optimizer. +For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed for NVIDIA GPUs, or [ROCmSoftwarePlatform/apex](https://github.com/ROCmSoftwarePlatform/apex) for AMD GPUs, `adamw_apex_fused` will give you the +fastest training experience among all supported AdamW optimizers. + +[`Trainer`] integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`, +`adamw_apex_fused`, `adamw_anyprecision`, `adafactor`, or `adamw_bnb_8bit`. More optimizers can be plugged in via a third-party implementation. + +Let's take a closer look at two alternatives to AdamW optimizer: +1. `adafactor` which is available in [`Trainer`] +2. `adamw_bnb_8bit` is also available in Trainer, but a third-party integration is provided below for demonstration. + +For comparison, for a 3B-parameter model, like “google-t5/t5-3b”: +* A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB) +* Adafactor optimizer will need more than 12GB. It uses slightly more than 4 bytes for each parameter, so 4*3 and then some extra. +* 8bit BNB quantized optimizer will use only (2*3) 6GB if all optimizer states are quantized. + +### Adafactor + +Adafactor doesn't store rolling averages for each element in weight matrices. Instead, it keeps aggregated information +(sums of rolling averages row- and column-wise), significantly reducing its footprint. However, compared to Adam, +Adafactor may have slower convergence in certain cases. + +You can switch to Adafactor by setting `optim="adafactor"` in [`TrainingArguments`]: + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args) +``` + +Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training) +you can notice up to 3x improvement while maintaining the throughput! However, as mentioned before, the convergence of +Adafactor can be worse than Adam. + +### 8-bit Adam + +Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization +means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the +idea behind mixed precision training. + +To use `adamw_bnb_8bit`, you simply need to set `optim="adamw_bnb_8bit"` in [`TrainingArguments`]: + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args) +``` + +However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated. + +First, follow the installation guide in the GitHub [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library +that implements the 8-bit Adam optimizer. + +Next you need to initialize the optimizer. This involves two steps: +* First, group the model's parameters into two groups - one where weight decay should be applied, and the other one where it should not. Usually, biases and layer norm parameters are not weight decayed. +* Then do some argument housekeeping to use the same parameters as the previously used AdamW optimizer. + +```py +import bitsandbytes as bnb +from torch import nn +from transformers.trainer_pt_utils import get_parameter_names + +training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) + +decay_parameters = get_parameter_names(model, [nn.LayerNorm]) +decay_parameters = [name for name in decay_parameters if "bias" not in name] +optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if n in decay_parameters], + "weight_decay": training_args.weight_decay, + }, + { + "params": [p for n, p in model.named_parameters() if n not in decay_parameters], + "weight_decay": 0.0, + }, +] + +optimizer_kwargs = { + "betas": (training_args.adam_beta1, training_args.adam_beta2), + "eps": training_args.adam_epsilon, +} +optimizer_kwargs["lr"] = training_args.learning_rate +adam_bnb_optim = bnb.optim.Adam8bit( + optimizer_grouped_parameters, + betas=(training_args.adam_beta1, training_args.adam_beta2), + eps=training_args.adam_epsilon, + lr=training_args.learning_rate, +) +``` + +Finally, pass the custom optimizer as an argument to the `Trainer`: + +```py +trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) +``` + +Combined with other approaches (gradient accumulation, gradient checkpointing, and mixed precision training), +you can expect to get about a 3x memory improvement and even slightly higher throughput as using Adafactor. + +### multi_tensor + +pytorch-nightly introduced `torch.optim._multi_tensor` which should significantly speed up the optimizers for situations +with lots of small feature tensors. It should eventually become the default, but if you want to experiment with it sooner, take a look at this GitHub [issue](https://github.com/huggingface/transformers/issues/9965). + +## Data preloading + +One of the important requirements to reach great training speed is the ability to feed the GPU at the maximum speed it +can handle. By default, everything happens in the main process, and it might not be able to read the data from disk fast +enough, and thus create a bottleneck, leading to GPU under-utilization. Configure the following arguments to reduce the bottleneck: + +- `DataLoader(pin_memory=True, ...)` - ensures the data gets preloaded into the pinned memory on CPU and typically leads to much faster transfers from CPU to GPU memory. +- `DataLoader(num_workers=4, ...)` - spawn several workers to preload data faster. During training, watch the GPU utilization stats; if it's far from 100%, experiment with increasing the number of workers. Of course, the problem could be elsewhere, so many workers won't necessarily lead to better performance. + +When using [`Trainer`], the corresponding [`TrainingArguments`] are: `dataloader_pin_memory` (`True` by default), and `dataloader_num_workers` (defaults to `0`). + +## DeepSpeed ZeRO + +DeepSpeed is an open-source deep learning optimization library that is integrated with 🤗 Transformers and 🤗 Accelerate. +It provides a wide range of features and optimizations designed to improve the efficiency and scalability of large-scale +deep learning training. + +If your model fits onto a single GPU and you have enough space to fit a small batch size, you don't need to use DeepSpeed +as it'll only slow things down. However, if the model doesn't fit onto a single GPU or you can't fit a small batch, you can +leverage DeepSpeed ZeRO + CPU Offload, or NVMe Offload for much larger models. In this case, you need to separately +[install the library](main_classes/deepspeed#installation), then follow one of the guides to create a configuration file +and launch DeepSpeed: + +* For an in-depth guide on DeepSpeed integration with [`Trainer`], review [the corresponding documentation](main_classes/deepspeed), specifically the +[section for a single GPU](main_classes/deepspeed#deployment-with-one-gpu). Some adjustments are required to use DeepSpeed in a notebook; please take a look at the [corresponding guide](main_classes/deepspeed#deployment-in-notebooks). +* If you prefer to use 🤗 Accelerate, refer to [🤗 Accelerate DeepSpeed guide](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed). + +## Using torch.compile + +PyTorch 2.0 introduced a new compile function that doesn't require any modification to existing PyTorch code but can +optimize your code by adding a single line of code: `model = torch.compile(model)`. + +If using [`Trainer`], you only need `to` pass the `torch_compile` option in the [`TrainingArguments`]: + +```python +training_args = TrainingArguments(torch_compile=True, **default_args) +``` + +`torch.compile` uses Python's frame evaluation API to automatically create a graph from existing PyTorch programs. After +capturing the graph, different backends can be deployed to lower the graph to an optimized engine. +You can find more details and benchmarks in [PyTorch documentation](https://pytorch.org/get-started/pytorch-2.0/). + +`torch.compile` has a growing list of backends, which can be found in by calling `torchdynamo.list_backends()`, each of which with its optional dependencies. + +Choose which backend to use by specifying it via `torch_compile_backend` in the [`TrainingArguments`]. Some of the most commonly used backends are: + +**Debugging backends**: +* `dynamo.optimize("eager")` - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues. +* `dynamo.optimize("aot_eager")` - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups. + +**Training & inference backends**: +* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747) +* `dynamo.optimize("nvfuser")` - nvFuser with TorchScript. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) +* `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutograd. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) +* `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757) + +**Inference-only backend**s: +* `dynamo.optimize("ofi")` - Uses Torchscript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) +* `dynamo.optimize("fx2trt")` - Uses NVIDIA TensorRT for inference optimizations. [Read more](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) +* `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/) +* `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch) + +For an example of using `torch.compile` with 🤗 Transformers, check out this [blog post on fine-tuning a BERT model for Text Classification using the newest PyTorch 2.0 features](https://www.philschmid.de/getting-started-pytorch-2-0-transformers) + +## Using 🤗 PEFT + +[Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) methods freeze the pretrained model parameters during fine-tuning and add a small number of trainable parameters (the adapters) on top of it. + +As a result the [memory associated to the optimizer states and gradients](https://huggingface.co/docs/transformers/model_memory_anatomy#anatomy-of-models-memory) are greatly reduced. + +For example with a vanilla AdamW, the memory requirement for the optimizer state would be: +* fp32 copy of parameters: 4 bytes/param +* Momentum: 4 bytes/param +* Variance: 4 bytes/param + +Suppose a model with 7B parameters and 200 millions parameters injected with [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora). + +The memory requirement for the optimizer state of the plain model would be 12 * 7 = 84 GB (assuming 7B trainable parameters). + +Adding Lora increases slightly the memory associated to the model weights and substantially decreases memory requirement for the optimizer state to 12 * 0.2 = 2.4GB. + +Read more about PEFT and its detailed usage in [the PEFT documentation](https://huggingface.co/docs/peft/) or [PEFT repository](https://github.com/huggingface/peft). + +## Using 🤗 Accelerate + +With [🤗 Accelerate](https://huggingface.co/docs/accelerate/index) you can use the above methods while gaining full +control over the training loop and can essentially write the loop in pure PyTorch with some minor modifications. + +Suppose you have combined the methods in the [`TrainingArguments`] like so: + +```py +training_args = TrainingArguments( + per_device_train_batch_size=1, + gradient_accumulation_steps=4, + gradient_checkpointing=True, + fp16=True, + **default_args, +) +``` + +The full example training loop with 🤗 Accelerate is only a handful of lines of code long: + +```py +from accelerate import Accelerator +from torch.utils.data.dataloader import DataLoader + +dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size) + +if training_args.gradient_checkpointing: + model.gradient_checkpointing_enable() + +accelerator = Accelerator(fp16=training_args.fp16) +model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader) + +model.train() +for step, batch in enumerate(dataloader, start=1): + loss = model(**batch).loss + loss = loss / training_args.gradient_accumulation_steps + accelerator.backward(loss) + if step % training_args.gradient_accumulation_steps == 0: + optimizer.step() + optimizer.zero_grad() +``` + +First we wrap the dataset in a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). +Then we can enable gradient checkpointing by calling the model's [`~PreTrainedModel.gradient_checkpointing_enable`] method. +When we initialize the [`Accelerator`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator) +we can specify if we want to use mixed precision training and it will take care of it for us in the [`prepare`] call. +During the [`prepare`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.prepare) +call the dataloader will also be distributed across workers should we use multiple GPUs. We use the same [8-bit optimizer](#8-bit-adam) from the earlier example. + +Finally, we can add the main training loop. Note that the `backward` call is handled by 🤗 Accelerate. We can also see +how gradient accumulation works: we normalize the loss, so we get the average at the end of accumulation and once we have +enough steps we run the optimization. + +Implementing these optimization techniques with 🤗 Accelerate only takes a handful of lines of code and comes with the +benefit of more flexibility in the training loop. For a full documentation of all features have a look at the +[Accelerate documentation](https://huggingface.co/docs/accelerate/index). + + +## Efficient Software Prebuilds + +PyTorch's [pip and conda builds](https://pytorch.org/get-started/locally/#start-locally) come prebuilt with the cuda toolkit +which is enough to run PyTorch, but it is insufficient if you need to build cuda extensions. + +At times, additional efforts may be required to pre-build some components. For instance, if you're using libraries like `apex` that +don't come pre-compiled. In other situations figuring out how to install the right cuda toolkit system-wide can be complicated. +To address these scenarios PyTorch and NVIDIA released a new version of NGC docker container which already comes with +everything prebuilt. You just need to install your programs on it, and it will run out of the box. + +This approach is also useful if you want to tweak the pytorch source and/or make a new customized build. +To find the docker image version you want start [with PyTorch release notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/), +choose one of the latest monthly releases. Go into the release's notes for the desired release, check that the environment's +components are matching your needs (including NVIDIA Driver requirements!) and then at the very top of that document go +to the corresponding NGC page. If for some reason you get lost, here is [the index of all PyTorch NGC images](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). + +Next follow the instructions to download and deploy the docker image. + +## Mixture of Experts + +Some recent papers reported a 4-5x training speedup and a faster inference by integrating +Mixture of Experts (MoE) into the Transformer models. + +Since it has been discovered that more parameters lead to better performance, this technique allows to increase the +number of parameters by an order of magnitude without increasing training costs. + +In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function +that trains each expert in a balanced way depending on the input token's position in a sequence. + +![MoE Transformer 2x block](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf-moe-transformer.png) + +(source: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)) + +You can find exhaustive details and comparison tables in the papers listed at the end of this section. + +The main drawback of this approach is that it requires staggering amounts of GPU memory - almost an order of magnitude +larger than its dense equivalent. Various distillation and approaches are proposed to how to overcome the much higher memory requirements. + +There is direct trade-off though, you can use just a few experts with a 2-3x smaller base model instead of dozens or +hundreds experts leading to a 5x smaller model and thus increase the training speed moderately while increasing the +memory requirements moderately as well. + +Most related papers and implementations are built around Tensorflow/TPUs: + +- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) +- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) +- [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) + +And for Pytorch DeepSpeed has built one as well: [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596), [Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - blog posts: [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/), [2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/) and specific deployment with large transformer-based natural language generation models: [blog post](https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html), [Megatron-Deepspeed branch](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training). + +## Using PyTorch native attention and Flash Attention + +PyTorch 2.0 released a native [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) (SDPA), +that allows using fused GPU kernels such as [memory-efficient attention](https://arxiv.org/abs/2112.05682) and [flash attention](https://arxiv.org/abs/2205.14135). + +After installing the [`optimum`](https://github.com/huggingface/optimum) package, the relevant internal modules can be +replaced to use PyTorch's native attention with: + +```python +model = model.to_bettertransformer() +``` + +Once converted, train the model as usual. + + + +The PyTorch-native `scaled_dot_product_attention` operator can only dispatch to Flash Attention if no `attention_mask` is provided. + +By default, in training mode, the BetterTransformer integration **drops the mask support and can only be used for training that does not require a padding mask for batched training**. This is the case, for example, during masked language modeling or causal language modeling. BetterTransformer is not suited for fine-tuning models on tasks that require a padding mask. + + + +Check out this [blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to learn more about acceleration and memory-savings with SDPA. diff --git a/docs/source/en/perf_train_gpu_one.mdx b/docs/source/en/perf_train_gpu_one.mdx deleted file mode 100644 index 07299b016f5991..00000000000000 --- a/docs/source/en/perf_train_gpu_one.mdx +++ /dev/null @@ -1,744 +0,0 @@ - - -# Efficient Training on a Single GPU - -This guide focuses on training large models efficiently on a single GPU. These approaches are still valid if you have access to a machine with multiple GPUs but you will also have access to additional methods outlined in the [multi-GPU section](perf_train_gpu_many). - -In this section we have a look at a few tricks to reduce the memory footprint and speed up training for large models and how they are integrated in the [`Trainer`] and [🤗 Accelerate](https://huggingface.co/docs/accelerate/). Each method can improve speed or memory usage which is summarized in the table below: - -|Method|Speed|Memory| -|:-----|:----|:-----| -| Gradient accumulation | No | Yes | -| Gradient checkpointing | No| Yes | -| Mixed precision training | Yes | (No) | -| Batch size | Yes | Yes | -| Optimizer choice | Yes | Yes | -| DataLoader | Yes | No | -| DeepSpeed Zero | No | Yes | - -A bracket means that it might not be strictly the case but is usually either not a main concern or negligible. Before we start make sure you have installed the following libraries: - -```bash -pip install transformers datasets accelerate nvidia-ml-py3 -``` - -The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly. - -Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format. - - -```py -import numpy as np -from datasets import Dataset - - -seq_len, dataset_size = 512, 512 -dummy_data = { - "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), - "labels": np.random.randint(0, 1, (dataset_size)), -} -ds = Dataset.from_dict(dummy_data) -ds.set_format("pt") -``` - -We want to print some summary statistics for the GPU utilization and the training run with the [`Trainer`]. We setup a two helper functions to do just that: - -```py -from pynvml import * - - -def print_gpu_utilization(): - nvmlInit() - handle = nvmlDeviceGetHandleByIndex(0) - info = nvmlDeviceGetMemoryInfo(handle) - print(f"GPU memory occupied: {info.used//1024**2} MB.") - - -def print_summary(result): - print(f"Time: {result.metrics['train_runtime']:.2f}") - print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") - print_gpu_utilization() -``` - -Let's verify that we start with a free GPU memory: - -```py ->>> print_gpu_utilization() -GPU memory occupied: 0 MB. -``` - -That looks good: the GPU memory is not occupied as we would expect before we load any models. If that's not the case on your machine make sure to stop all processes that are using GPU memory. However, not all free GPU memory can be used by the user. When a model is loaded to the GPU also the kernels are loaded which can take up 1-2GB of memory. To see how much it is we load a tiny tensor into the GPU which triggers the kernels to be loaded as well. - -```py ->>> import torch - - ->>> torch.ones((1, 1)).to("cuda") ->>> print_gpu_utilization() -GPU memory occupied: 1343 MB. -``` - -We see that the kernels alone take up 1.3GB of GPU memory. Now let's see how much space the model uses. - -## Load Model - -First, we load the `bert-large-uncased` model. We load the model weights directly to the GPU so that we can check how much space just weights use. - - -```py ->>> from transformers import AutoModelForSequenceClassification - - ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-large-uncased").to("cuda") ->>> print_gpu_utilization() -GPU memory occupied: 2631 MB. -``` - -We can see that the model weights alone take up 1.3 GB of the GPU memory. The exact number depends on the specific GPU you are using. Note that on newer GPUs a model can sometimes take up more space since the weights are loaded in an optimized fashion that speeds up the usage of the model. Now we can also quickly check if we get the same result as with `nvidia-smi` CLI: - - -```bash -nvidia-smi -``` - -```bash -Tue Jan 11 08:58:05 2022 -+-----------------------------------------------------------------------------+ -| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | -|-------------------------------+----------------------+----------------------+ -| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | -| | | MIG M. | -|===============================+======================+======================| -| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | -| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | -| | | N/A | -+-------------------------------+----------------------+----------------------+ - -+-----------------------------------------------------------------------------+ -| Processes: | -| GPU GI CI PID Type Process name GPU Memory | -| ID ID Usage | -|=============================================================================| -| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB | -+-----------------------------------------------------------------------------+ -``` - -We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. So now we can start training the model and see how the GPU memory consumption changes. First, we set up a few standard training arguments that we will use across all our experiments: - -```py -default_args = { - "output_dir": "tmp", - "evaluation_strategy": "steps", - "num_train_epochs": 1, - "log_level": "error", - "report_to": "none", -} -``` - - - - Note: In order to properly clear the memory after experiments we need restart the Python kernel between experiments. Run all steps above and then just one of the experiments below. - - - -## Vanilla Training - -As a first experiment we will use the [`Trainer`] and train the model without any further modifications and a batch size of 4: - -```py -from transformers import TrainingArguments, Trainer, logging - -logging.set_verbosity_error() - - -training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 57.82 -Samples/second: 8.86 -GPU memory occupied: 14949 MB. -``` - -We see that already a relatively small batch size almost fills up our GPU's entire memory. However, a larger batch size can often result in faster model convergence or better end performance. So ideally we want to tune the batch size to our model's needs and not to the GPU limitations. What's interesting is that we use much more memory than the size of the model. To understand a bit better why this is the case let's have look at a model's operations and memory needs. - -## Anatomy of Model's Operations - -Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. - -1. **Tensor Contractions** - - Linear layers and components of Multi-Head Attention all do batched **matrix-matrix multiplications**. These operations are the most compute-intensive part of training a transformer. - -2. **Statistical Normalizations** - - Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more **reduction operations**, the result of which is then applied via a map. - -3. **Element-wise Operators** - - These are the remaining operators: **biases, dropout, activations, and residual connections**. These are the least compute-intensive operations. - -This knowledge can be helpful to know when analyzing performance bottlenecks. - -This summary is derived from [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072) - - -## Anatomy of Model's Memory -We've seen that training the model uses much more memory than just putting the model on the GPU. This is because there are many components during training that use GPU memory. The components on GPU memory are the following: -1. model weights -2. optimizer states -3. gradients -4. forward activations saved for gradient computation -5. temporary buffers -6. functionality-specific memory - -A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. For inference there are no optimizer states and gradients, so we can subtract those. And thus we end up with 6 bytes per model parameter for mixed precision inference, plus activation memory. - -Let's look at the details. - -**Model Weights:** - -- 4 bytes * number of parameters for fp32 training -- 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory) - -**Optimizer States:** - -- 8 bytes * number of parameters for normal AdamW (maintains 2 states) -- 2 bytes * number of parameters for 8-bit AdamW optimizers like [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) -- 4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state) - -**Gradients** - -- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32) - -**Forward Activations** - -- size depends on many factors, the key ones being sequence length, hidden size and batch size. - -There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation. - -**Temporary Memory** - -Additionally there are all kinds of temporary variables which get released once the calculation is done, but in the moment these could require additional memory and could push to OOM. Therefore when coding it's crucial to think strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. - -**Functionality-specific memory** - -Then your software could have special memory needs. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs. - -**`forward` vs `backward` Execution Speed** - -For convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Activations are usually bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward (e.g. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, and writes once, gradInput). - -So there are potentially a few places where we could save GPU memory or speed up operations. Let's start with a simple optimization: choosing the right batch size. - -## Batch sizes - -One gets the most efficient performance when batch sizes and input/output neuron counts are divisible by a certain number, which typically starts at 8, but can be much higher as well. That number varies a lot depending on the specific hardware being used and the dtype of the model. - -For example for fully connected layers (which correspond to GEMMs), NVIDIA provides recommendations for [input/output neuron counts]( -https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features) and [batch size](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#batch-size). - -[Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) define the multiplier based on the dtype and the hardware. For example, for fp16 a multiple of 8 is recommended, but on A100 it's 64! - -For parameters that are small, there is also [Dimension Quantization Effects](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization) to consider, this is where tiling happens and the right multiplier can have a significant speedup. - -## Gradient Accumulation - -The idea behind gradient accumulation is to instead of calculating the gradients for the whole batch at once to do it in smaller steps. The way we do that is to calculate the gradients iteratively in smaller batches by doing a forward and backward pass through the model and accumulating the gradients in the process. When enough gradients are accumulated we run the model's optimization step. This way we can easily increase the overall batch size to numbers that would never fit into the GPU's memory. In turn, however, the added forward and backward passes can slow down the training a bit. - -We can use gradient accumulation in the [`Trainer`] by simply adding the `gradient_accumulation_steps` argument to [`TrainingArguments`]. Let's see how it impacts the models memory footprint: - -```py -training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 66.03 -Samples/second: 7.75 -GPU memory occupied: 8681 MB. -``` - -We can see that the memory footprint was dramatically reduced at the cost of being only slightly slower than the vanilla run. Of course, this would change as you increase the number of accumulation steps. In general you would want to max out the GPU usage as much as possible. So in our case, the batch_size of 4 was already pretty close to the GPU's limit. If we wanted to train with a batch size of 64 we should not use `per_device_train_batch_size=1` and `gradient_accumulation_steps=64` but instead `per_device_train_batch_size=4` and `gradient_accumulation_steps=16` which has the same effective batch size while making better use of the available GPU resources. - -For more details see the benchmarks for [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537) -and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957). - -Next we have a look at another trick to save a little bit more GPU memory called gradient checkpointing. - -## Gradient Checkpointing - -Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with large models. In order to compute the gradients during the backward pass all activations from the forward pass are normally saved. This can create a big memory overhead. Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training. - -Gradient checkpointing strikes a compromise between the two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) explaining the ideas behind gradient checkpointing. - -To enable gradient checkpointing in the [`Trainer`] we only need to pass it as a flag to the [`TrainingArguments`]. Everything else is handled under the hood: - -```py -training_args = TrainingArguments( - per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args -) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 85.47 -Samples/second: 5.99 -GPU memory occupied: 6775 MB. -``` - -We can see that this saved some more memory but at the same time training became a bit slower. A general rule of thumb is that gradient checkpointing slows down training by about 20%. Let's have a look at another method with which we can regain some speed: mixed precision training. - - -## Floating Data Types - -The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput: - -- fp32 (`float32`) -- fp16 (`float16`) -- bf16 (`bfloat16`) -- tf32 (CUDA internal data type) - -Here is a diagram that shows how these data types correlate to each other. - -![data types](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/tf32-bf16-fp16-fp32.png) -(source: [NVIDIA Blog](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)) - -While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only available on the Ampere architecture GPUS and TPUs support bf16 as well. Let's start with the most commonly used method which is FP16 training/ - - -### FP16 Training - -The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`: - -```py -training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 27.46 -Samples/second: 18.64 -GPU memory occupied: 13939 MB. -``` - -We can see that this is almost twice as fast as the vanilla training. Let's add it to the mix of the previous methods: - - -```py -training_args = TrainingArguments( - per_device_train_batch_size=1, - gradient_accumulation_steps=4, - gradient_checkpointing=True, - fp16=True, - **default_args, -) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 50.76 -Samples/second: 10.09 -GPU memory occupied: 7275 MB. -``` - -We can see that with these tweaks we use about half the GPU memory as at the beginning while also being slightly faster. - -### BF16 -If you have access to a Ampere or newer hardware you can use bf16 for your training and evaluation. While bf16 has a worse precision than fp16, it has a much much bigger dynamic range. Therefore, if in the past you were experiencing overflow issues while training the model, bf16 will prevent this from happening most of the time. Remember that in fp16 the biggest number you can have is `65535` and any number above that will overflow. A bf16 number can be as large as `3.39e+38` (!) which is about the same as fp32 - because both have 8-bits used for the numerical range. - -You can enable BF16 in the 🤗 Trainer with: - -```python -TrainingArguments(bf16=True) -``` - -### TF32 -The Ampere hardware uses a magical data type called tf32. It has the same numerical range as fp32 (8-bits), but instead of 23 bits precision it has only 10 bits (same as fp16) and uses only 19 bits in total. - -It's magical in the sense that you can use the normal fp32 training and/or inference code and by enabling tf32 support you can get up to 3x throughput improvement. All you need to do is to add this to your code: - -``` -import torch -torch.backends.cuda.matmul.allow_tf32 = True -``` - -When this is done CUDA will automatically switch to using tf32 instead of fp32 where it's possible. This, of course, assumes that the used GPU is from the Ampere series. - -Like all cases with reduced precision this may or may not be satisfactory for your needs, so you have to experiment and see. According to [NVIDIA research](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/) the majority of machine learning training shouldn't be impacted and showed the same perplexity and convergence as the fp32 training. - -If you're already using fp16 or bf16 mixed precision it may help with the throughput as well. - -You can enable this mode in the 🤗 Trainer with: -```python -TrainingArguments(tf32=True) -``` -By default the PyTorch default is used. - -Note: tf32 mode is internal to CUDA and can't be accessed directly via `tensor.to(dtype=torch.tf32)` as `torch.tf32` doesn't exist. - -Note: you need `torch>=1.7` to enjoy this feature. - -You can also see a variety of benchmarks on tf32 vs other precisions: -[RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803) and -[A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189). - -We've now seen how we can change the floating types to increase throughput, but we are not done, yet! There is another area where we can save GPU memory: the optimizer. - -## Optimizer - -The most common optimizer used to train transformer model is Adam or AdamW (Adam with weight decay). Adam achieves good convergence by storing the rolling average of the previous gradients which, however, adds an additional memory footprint of the order of the number of model parameters. One remedy to this is to use an alternative optimizer such as Adafactor, which works well for some models but often it has instability issues. - -HF Trainer integrates a variety of optimisers that can be used out of box. To activate the desired optimizer simply pass the `--optim` flag to the command line. - -To see which optimizers are currently supported: - -```bash -$ python examples/pytorch/translation/run_translation.py -h | grep "\-optim" - [--optim {adamw_hf,adamw_torch,adamw_torch_xla,adamw_apex_fused,adafactor}] -``` - -For example, if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed `--optim adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers. - -On the other hand [8bit BNB optimizer](https://github.com/TimDettmers/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used. - -Let's get a feel for the numbers and use for example use a 3B-parameter model, like `t5-3b`. Note that since a Gigabyte correpsonds to a billion bytes we can simply multiply the parameters (in billions) with the number of necessary bytes per parameter to get Gigabytes of GPU memory usage: - -- A standard AdamW uses 8 bytes for each parameter, here the optimizer will need (`8*3`) 24GB of GPU memory. -- Adafactor uses slightly more than 4 bytes, so (`4*3`) 12GB and then some extra. -- 8bit BNB quantized optimizer will use only (`2*3`) 6GB if all optimizer states are quantized. - -Let's have a look at Adafactor first. - -### Adafactor - -Instead of keeping the rolling average for each element in the weight matrices Adafactor only stores aggregated information (row- and column-wise sums of the rolling averages) which reduces the footprint considerably. One downside of Adafactor is that in some instances convergence can be slower than Adam's so some experimentation is advised here. We can use Adafactor simply by setting `optim="adafactor"`: - - -```py -training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 64.31 -Samples/second: 7.96 -GPU memory occupied: 12295 MB. -``` - -We can see that this saves a few more GB on the GPU. Let's see how it looks when we add it to the other methods we introduced earlier: - - -```py -training_args = TrainingArguments( - per_device_train_batch_size=1, - gradient_accumulation_steps=4, - gradient_checkpointing=True, - fp16=True, - optim="adafactor", - **default_args, -) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 56.54 -Samples/second: 9.06 -GPU memory occupied: 4847 MB. -``` - -We went from 15 GB memory usage to 5 GB - a 3x improvement while maintaining the throughput! However, as mentioned before, the convergence of Adafactor can be worse than Adam. There is an alternative to Adafactor called 8-bit Adam that takes a slightly different approach. - -### 8-bit Adam - -Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. - -In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer. - -Once installed, we just need to initialize the the optimizer. Although this looks like a considerable amount of work it actually just involves two steps: first we need to group the model's parameters into two groups where to one group we apply weight decay and to the other we don't. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer. - - -Note that in order to use the 8-bit optimizer with an existing pretrained model a change to the embedding layer is needed. -Read [this issue](https://github.com/huggingface/transformers/issues/14819) for more information. - - -```py -import bitsandbytes as bnb -from torch import nn -from transformers.trainer_pt_utils import get_parameter_names - -training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) - -decay_parameters = get_parameter_names(model, [nn.LayerNorm]) -decay_parameters = [name for name in decay_parameters if "bias" not in name] -optimizer_grouped_parameters = [ - { - "params": [p for n, p in model.named_parameters() if n in decay_parameters], - "weight_decay": training_args.weight_decay, - }, - { - "params": [p for n, p in model.named_parameters() if n not in decay_parameters], - "weight_decay": 0.0, - }, -] - -optimizer_kwargs = { - "betas": (training_args.adam_beta1, training_args.adam_beta2), - "eps": training_args.adam_epsilon, -} -optimizer_kwargs["lr"] = training_args.learning_rate -adam_bnb_optim = bnb.optim.Adam8bit( - optimizer_grouped_parameters, - betas=(training_args.adam_beta1, training_args.adam_beta2), - eps=training_args.adam_epsilon, - lr=training_args.learning_rate, -) -``` - -We can now pass the custom optimizer as an argument to the `Trainer`: -```py -trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 55.95 -Samples/second: 9.15 -GPU memory occupied: 13085 MB. -``` - -We can see that we get a similar memory improvement as with Adafactor while keeping the full rolling average of the gradients. Let's repeat the experiment with the full settings: - -```py -training_args = TrainingArguments( - per_device_train_batch_size=1, - gradient_accumulation_steps=4, - gradient_checkpointing=True, - fp16=True, - **default_args, -) - -trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) -result = trainer.train() -print_summary(result) -``` - -``` -Time: 49.46 -Samples/second: 10.35 -GPU memory occupied: 5363 MB. -``` - -Again, we get about a 3x memory improvement and even slightly higher throughput as using Adafactor. So we have seen how we can optimize the memory footprint of large models. The following plot summarizes all our experiments: - -![png](https://huggingface.co/datasets/lvwerra/repo-images/raw/main/gpu-memory-savings.png) - -### `_multi_tensor` -pytorch-nightly introduced `torch.optim._multi_tensor` which should significantly speed up the optimizers for situations with lots of small feature tensors. It should eventually become the default, but if you want to experiment with it sooner and don't mind using the bleed-edge, see: https://github.com/huggingface/transformers/issues/9965 - - -## Using 🤗 Accelerate - -So far we have used the [`Trainer`] to run the experiments but a more flexible alternative to that approach is to use 🤗 Accelerate. With 🤗 Accelerate you have full control over the training loop and can essentially write the loop in pure PyTorch with some minor modifications. In turn it allows you to easily scale across different infrastructures such as CPUs, GPUs, TPUs, or distributed multi-GPU setups without changing any code. Let's see what it takes to implement all of the above tweaks in 🤗 Accelerate. We can still use the [`TrainingArguments`] to wrap the training settings: - - -```py -training_args = TrainingArguments( - per_device_train_batch_size=1, - gradient_accumulation_steps=4, - gradient_checkpointing=True, - fp16=True, - **default_args, -) -``` - -The full example training loop with 🤗 Accelerate is only a handful of lines of code long: - - -```py -from accelerate import Accelerator -from torch.utils.data.dataloader import DataLoader - -dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size) - -if training_args.gradient_checkpointing: - model.gradient_checkpointing_enable() - -accelerator = Accelerator(fp16=training_args.fp16) -model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader) - -model.train() -for step, batch in enumerate(dataloader, start=1): - loss = model(**batch).loss - loss = loss / training_args.gradient_accumulation_steps - accelerator.backward(loss) - if step % training_args.gradient_accumulation_steps == 0: - optimizer.step() - optimizer.zero_grad() -``` - -First we wrap the dataset in a [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). Then we can enable gradient checkpointing by calling the model's [`~PreTrainedModel.gradient_checkpointing_enable`] method. When we initialize the [`Accelerator`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator) we can specify if we want to use mixed precision training and it will take care of it for us in the [`prepare`] call. During the [`prepare`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.prepare) call the dataloader will also be distributed across workers should we use multiple GPUs. We use the same 8-bit optimizer from the earlier experiments. - -Finally, we can write the main training loop. Note that the `backward` call is handled by 🤗 Accelerate. We can also see how gradient accumulation works: we normalize the loss so we get the average at the end of accumulation and once we have enough steps we run the optimization. Now the question is: does this use the same amount of memory as the previous steps? Let's check: - - -```py ->>> print_gpu_utilization() -GPU memory occupied: 5363 MB. -``` - -Indeed it does. Implementing these optimization techniques with 🤗 Accelerate only takes a handful of lines of code and comes with the benefit of more flexiblity in the training loop. For a full documentation of all features have a look at the [Accelerate documentation](https://huggingface.co/docs/accelerate/index). - -## DataLoader - -One of the important requirements to reach great training speed is the ability to feed the GPU at the maximum speed it can handle. By default everything happens in the main process and it might not be able to read the data from disk fast enough, and thus create a bottleneck, leading to GPU under-utilization. - -- `DataLoader(pin_memory=True, ...)` which ensures that the data gets preloaded into the pinned memory on CPU and typically leads to much faster transfers from CPU to GPU memory. -- `DataLoader(num_workers=4, ...)` - spawn several workers to pre-load data faster - during training watch the GPU utilization stats and if it's far from 100% experiment with raising the number of workers. Of course, the problem could be elsewhere so a very big number of workers won't necessarily lead to a better performance. - -## DeepSpeed ZeRO - -The in-depth details on how to use Deepspeed can be found [here](main_classes/deepspeed). - -First, a quick decision tree: - -1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case. -2. Model doesn't fit onto a single GPU or you can't fit a small batch - use DeepSpeed ZeRO + CPU Offload and for much larger models NVMe Offload. - -Now if the decision tree suggested you use DeepSpeed first you need to [install it](main_classes/deepspeed#installation), then follow one of the following guides to create a configuration file and launch DeepSpeed. - -Activation: - -- HF Trainer-based examples: see this [guide](main_classes/deepspeed#deployment-with-one-gpu). -- Custom HF Trainer-based program: Same as above, but pass: - - ```python - TrainingArguments(deepspeed="/path/to/ds_config.json") - ``` -- Deployment in Notebooks: see this [guide](main_classes/deepspeed#deployment-in-notebooks). - -- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( -https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `deepspeed` in the code. - - -## Choice of GPU -Sometimes, even when applying all the above tweaks the throughput on a given GPU might still not be good enough. One easy solution is to change the type of GPU. For example switching from let's say a K80 (which you typically get on Google Colab) to a fancier GPU such as the V100 or A100. Although they are more expensive they are usually more cost effective than cheaper GPUs due to their larger memory and faster architecture. - -Now, let's take a step back and discuss what we should optimize for when scaling the training of large models. - -## How to scale - -When we train models there are a two aspects we want to optimize at the same time: - -- Data throughput/training time -- Model performance - -We have seen that each method changes the memory usage and throughput. In general we want to maximize the throughput (samples/second) to minimize the training cost. This is generally achieved by utilizing the GPU as much as possible and thus filling GPU memory to its limit. For example, as mentioned earlier, we only employ gradient accumulation when we want to use a batch size beyond the size of the GPU memory. If the desired batch size fits into memory then there is no reason to apply gradient accumulation which will only slow down training. - -The second objective is model performance. Just because we can does not mean we should use a large batch size. As part of hyperparameter tuning you should determine which batch size yields the best result and then optimize the throughput accordingly. - - -## Efficient Software Prebuilds - -PyTorch's [pip and conda builds](https://pytorch.org/get-started/locally/#start-locally) come prebuit with the cuda toolkit which is enough to run PyTorch, but it is insufficient if you need to build cuda extensions. - -At times it may take an additional effort to pre-build some components, e.g., if you're using libraries like `apex` that don't come pre-compiled. In other situations figuring out how to install the right cuda toolkit system-wide can be complicated. To address these users' needs PyTorch and NVIDIA release a new version of NGC docker container which already comes with everything prebuilt and you just need to install your programs on it and it will run out of the box. - -This approach is also useful if you want to tweak the pytorch source and/or make a new customized build. - -To find the docker image version you want start [here](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/), choose one of the latest monthly releases. Go into the release's notes for the desired release, check that the environment's components are matching your needs (including NVIDIA Driver requirements!) and then at the very top of that document go to the corresponding NGC page. If for some reason you get lost, here is [the index of all PyTorch NGC images](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch). - -Next follow the instructions to download and deploy the docker image. - -## Sparsity - -### Mixture of Experts - -Quite a few of the recent papers reported a 4-5x training speedup and a faster inference by integrating -Mixture of Experts (MoE) into the Transformer models. - -Since it has been discovered that more parameters lead to better performance, this technique allows to increase the number of parameters by an order of magnitude without increasing training costs. - -In this approach every other FFN layer is replaced with a MoE Layer which consists of many experts, with a gated function that trains each expert in a balanced way depending on the input token's position in a sequence. - -![MoE Transformer 2x block](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf-moe-transformer.png) - -(source: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)) - -You can find exhaustive details and comparison tables in the papers listed at the end of this section. - -The main drawback of this approach is that it requires staggering amounts of GPU memory - almost an order of magnitude larger than its dense equivalent. Various distillation and approaches are proposed to how to overcome the much higher memory requirements. - -There is direct trade-off though, you can use just a few experts with a 2-3x smaller base model instead of dozens or hundreds experts leading to a 5x smaller model and thus increase the training speed moderately while increasing the memory requirements moderately as well. - -Most related papers and implementations are built around Tensorflow/TPUs: - -- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) -- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961) -- [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) - -And for Pytorch DeepSpeed has built one as well: [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596), [Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - blog posts: [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/), [2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/) and specific deployment with large transformer-based natural language generation models: [blog post](https://www.deepspeed.ai/news/2021/12/09/deepspeed-moe-nlg.html), [Megatron-Deepspeed branch](Thttps://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training). - - -## Scaling beyond a single GPU - -For some applications, such as pretraining large language models, applying all the approaches above might still not be fast enough. In this case you want to scale your experiment to several GPUs. - -Another use case for training on many GPUs is if the model does not fit on a single GPU with all the mentioned tricks. There are still more methods we can apply although life starts to get a bit more complicated. This usually involves some form of pipeline or tensor parallelism where the model itself is distributed across several GPUs. One can also make use of DeepSpeed which implements some of these parallelism strategies along with some more optimization to reduce the memory footprint such as partitioning the optimizer states. You can read more about this in the ["Multi-GPU training" section](perf_train_gpu_many). - -## Using torch.compile - -PyTorch 2.0 introduces a new compile function, you can learn more about it [in their documentation](https://pytorch.org/get-started/pytorch-2.0/). It uses Python’s frame evaluation API to automatically create a graph from existing PyTorch programs. After capturing the graph, different backends can be deployed to lower the graph to an optimized engine. You can choose one option below for performance boost. - -`torch.compile` has a growing list of backends, which can be found in [backends.py](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/optimizations/backends.py) -or `torchdynamo.list_backends()` each of which with its optional dependencies. - -Some of the most commonly used backends are - -**Debugging backends**: -* `dynamo.optimize("eager")` - Uses PyTorch to run the extracted GraphModule. This is quite useful in debugging TorchDynamo issues. -* `dynamo.optimize("aot_eager")` - Uses AotAutograd with no compiler, i.e, just using PyTorch eager for the AotAutograd's extracted forward and backward graphs. This is useful for debugging, and unlikely to give speedups. - -**Training & inference backends**: -* `dynamo.optimize("inductor")` - Uses TorchInductor backend with AotAutograd and cudagraphs by leveraging codegened Triton kernels [Read more](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747) -* `dynamo.optimize("nvfuser")` - nvFuser with TorchScript. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) -* `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutograd. [Read more](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) -* `dynamo.optimize("aot_cudagraphs")` - cudagraphs with AotAutograd. [Read more](https://github.com/pytorch/torchdynamo/pull/757) - -**Inference-only backend**s: -* `dynamo.optimize("ofi")` - Uses Torchscript optimize_for_inference. [Read more](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) -* `dynamo.optimize("fx2trt")` - Uses Nvidia TensorRT for inference optimizations. [Read more](https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst) -* `dynamo.optimize("onnxrt")` - Uses ONNXRT for inference on CPU/GPU. [Read more](https://onnxruntime.ai/) -* `dynamo.optimize("ipex")` - Uses IPEX for inference on CPU. [Read more](https://github.com/intel/intel-extension-for-pytorch) diff --git a/docs/source/en/perf_train_special.md b/docs/source/en/perf_train_special.md new file mode 100644 index 00000000000000..d98d3e0e32e5a0 --- /dev/null +++ b/docs/source/en/perf_train_special.md @@ -0,0 +1,63 @@ + + +# PyTorch training on Apple silicon + +Previously, training models on a Mac was limited to the CPU only. With the release of PyTorch v1.12, you can take advantage of training models with Apple's silicon GPUs for significantly faster performance and training. This is powered in PyTorch by integrating Apple's Metal Performance Shaders (MPS) as a backend. The [MPS backend](https://pytorch.org/docs/stable/notes/mps.html) implements PyTorch operations as custom Metal shaders and places these modules on a `mps` device. + + + +Some PyTorch operations are not implemented in MPS yet and will throw an error. To avoid this, you should set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU kernels instead (you'll still see a `UserWarning`). + +
+ +If you run into any other errors, please open an issue in the [PyTorch](https://github.com/pytorch/pytorch/issues) repository because the [`Trainer`] only integrates the MPS backend. + +
+ +With the `mps` device set, you can: + +* train larger networks or batch sizes locally +* reduce data retrieval latency because the GPU's unified memory architecture allows direct access to the full memory store +* reduce costs because you don't need to train on cloud-based GPUs or add additional local GPUs + +Get started by making sure you have PyTorch installed. MPS acceleration is supported on macOS 12.3+. + +```bash +pip install torch torchvision torchaudio +``` + +[`TrainingArguments`] uses the `mps` device by default if it's available which means you don't need to explicitly set the device. For example, you can run the [run_glue.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification/run_glue.py) script with the MPS backend automatically enabled without making any changes. + +```diff +export TASK_NAME=mrpc + +python examples/pytorch/text-classification/run_glue.py \ + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ +- --use_mps_device \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir +``` + +Backends for [distributed setups](https://pytorch.org/docs/stable/distributed.html#backends) like `gloo` and `nccl` are not supported by the `mps` device which means you can only train on a single GPU with the MPS backend. + +You can learn more about the MPS backend in the [Introducing Accelerated PyTorch Training on Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) blog post. diff --git a/docs/source/en/perf_train_special.mdx b/docs/source/en/perf_train_special.mdx deleted file mode 100644 index cb6b8d4090e2ee..00000000000000 --- a/docs/source/en/perf_train_special.mdx +++ /dev/null @@ -1,20 +0,0 @@ - - -# Training on Specialized Hardware - - - - Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [multi-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section. - - - -This document will be completed soon with information on how to train on specialized hardware. diff --git a/docs/source/en/perf_train_tpu.mdx b/docs/source/en/perf_train_tpu.mdx deleted file mode 100644 index bc37e00877c2c6..00000000000000 --- a/docs/source/en/perf_train_tpu.mdx +++ /dev/null @@ -1,20 +0,0 @@ - - -# Training on TPUs - - - - Note: Most of the strategies introduced in the [single GPU section](perf_train_gpu_one) (such as mixed precision training or gradient accumulation) and [multi-GPU section](perf_train_gpu_many) are generic and apply to training models in general so make sure to have a look at it before diving into this section. - - - -This document will be completed soon with information on how to train on TPUs. diff --git a/docs/source/en/perf_train_tpu_tf.mdx b/docs/source/en/perf_train_tpu_tf.md similarity index 98% rename from docs/source/en/perf_train_tpu_tf.mdx rename to docs/source/en/perf_train_tpu_tf.md index 344031f6474e0d..011421b629c0ba 100644 --- a/docs/source/en/perf_train_tpu_tf.mdx +++ b/docs/source/en/perf_train_tpu_tf.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Training on TPU with TensorFlow diff --git a/docs/source/en/performance.md b/docs/source/en/performance.md new file mode 100644 index 00000000000000..ccd78d326d52e3 --- /dev/null +++ b/docs/source/en/performance.md @@ -0,0 +1,73 @@ + + +# Performance and Scalability + +Training large transformer models and deploying them to production present various challenges. +During training, the model may require more GPU memory than available or exhibit slow training speed. In the deployment +phase, the model can struggle to handle the required throughput in a production environment. + +This documentation aims to assist you in overcoming these challenges and finding the optimal setting for your use-case. +The guides are divided into training and inference sections, as each comes with different challenges and solutions. +Within each section you'll find separate guides for different hardware configurations, such as single GPU vs. multi-GPU +for training or CPU vs. GPU for inference. + +Use this document as your starting point to navigate further to the methods that match your scenario. + +## Training + +Training large transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where +you have a single GPU. The methods that you can apply to improve training efficiency on a single GPU extend to other setups +such as multiple GPU. However, there are also techniques that are specific to multi-GPU or CPU training. We cover them in +separate sections. + +* [Methods and tools for efficient training on a single GPU](perf_train_gpu_one): start here to learn common approaches that can help optimize GPU memory utilization, speed up the training, or both. +* [Multi-GPU training section](perf_train_gpu_many): explore this section to learn about further optimization methods that apply to a multi-GPU settings, such as data, tensor, and pipeline parallelism. +* [CPU training section](perf_train_cpu): learn about mixed precision training on CPU. +* [Efficient Training on Multiple CPUs](perf_train_cpu_many): learn about distributed CPU training. +* [Training on TPU with TensorFlow](perf_train_tpu_tf): if you are new to TPUs, refer to this section for an opinionated introduction to training on TPUs and using XLA. +* [Custom hardware for training](perf_hardware): find tips and tricks when building your own deep learning rig. +* [Hyperparameter Search using Trainer API](hpo_train) + +## Inference + +Efficient inference with large models in a production environment can be as challenging as training them. In the following +sections we go through the steps to run inference on CPU and single/multi-GPU setups. + +* [Inference on a single CPU](perf_infer_cpu) +* [Inference on a single GPU](perf_infer_gpu_one) +* [Multi-GPU inference](perf_infer_gpu_one) +* [XLA Integration for TensorFlow Models](tf_xla) + + +## Training and inference + +Here you'll find techniques, tips and tricks that apply whether you are training a model, or running inference with it. + +* [Instantiating a big model](big_models) +* [Troubleshooting performance issues](debugging) + +## Contribute + +This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to +make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there. + +When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the +source of that information (unless it comes directly from you). diff --git a/docs/source/en/performance.mdx b/docs/source/en/performance.mdx deleted file mode 100644 index 6c68e9b2acce60..00000000000000 --- a/docs/source/en/performance.mdx +++ /dev/null @@ -1,92 +0,0 @@ - - -# Performance and Scalability - -Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training your model can require more GPU memory than is available or be very slow to train and when you deploy it for inference it can be overwhelmed with the throughput that is required in the production environment. This documentation is designed to help you navigate these challenges and find the best setting for your use-case. We split the guides into training and inference as they come with different challenges and solutions. Then within each of them we have separate guides for different kinds of hardware setting (e.g. single vs. multi-GPU for training or CPU vs. GPU for infrence). - -![perf_overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf_overview.png) - -This document serves as an overview and entry point for the methods that could be useful for your scenario. - -## Training - -Training transformer models efficiently requires an accelerator such as a GPU or TPU. The most common case is where you only have a single GPU, but there is also a section about multi-GPU and CPU training (with more coming soon). - - - - Note: Most of the strategies introduced in the single GPU sections (such as mixed precision training or gradient accumulation) are generic and apply to training models in general so make sure to have a look at it before diving into the following sections such as multi-GPU or CPU training. - - - -### Single GPU - -Training large models on a single GPU can be challenging but there are a number of tools and methods that make it feasible. In this section methods such as mixed precision training, gradient accumulation and checkpointing, efficient optimizers, as well as strategies to determine the best batch size are discussed. - -[Go to single GPU training section](perf_train_gpu_one) - -### Multi-GPU - -In some cases training on a single GPU is still too slow or won't fit the large model. Moving to a multi-GPU setup is the logical step, but training on multiple GPUs at once comes with new decisions: does each GPU have a full copy of the model or is the model itself also distributed? In this section we look at data, tensor, and pipeline parallism. - -[Go to multi-GPU training section](perf_train_gpu_many) - -### CPU - - -[Go to CPU training section](perf_train_cpu) - - -### TPU - -[_Coming soon_](perf_train_tpu) - -### Specialized Hardware - -[_Coming soon_](perf_train_special) - -## Inference - -Efficient inference with large models in a production environment can be as challenging as training them. In the following sections we go through the steps to run inference on CPU and single/multi-GPU setups. - -### CPU - -[Go to CPU inference section](perf_infer_cpu) - -### Single GPU - -[Go to single GPU inference section](perf_infer_gpu_one) - -### Multi-GPU - -[Go to multi-GPU inference section](perf_infer_gpu_many) - -### Specialized Hardware - -[_Coming soon_](perf_infer_special) - -## Hardware - -In the hardware section you can find tips and tricks when building your own deep learning rig. - -[Go to hardware section](perf_hardware) - - -## Contribute - -This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there. - -When making contributions that A is better than B, please try to include a reproducible benchmark and/or a link to the source of that information (unless it comes directly from you). diff --git a/docs/source/en/perplexity.mdx b/docs/source/en/perplexity.md similarity index 93% rename from docs/source/en/perplexity.mdx rename to docs/source/en/perplexity.md index 01f861c99c5ea2..7555619fe488d2 100644 --- a/docs/source/en/perplexity.mdx +++ b/docs/source/en/perplexity.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Perplexity of fixed-length models @@ -71,7 +75,7 @@ Let's demonstrate this process with GPT-2. from transformers import GPT2LMHeadModel, GPT2TokenizerFast device = "cuda" -model_id = "gpt2-large" +model_id = "openai-community/gpt2-large" model = GPT2LMHeadModel.from_pretrained(model_id).to(device) tokenizer = GPT2TokenizerFast.from_pretrained(model_id) ``` @@ -115,11 +119,10 @@ for begin_loc in tqdm(range(0, seq_len, stride)): with torch.no_grad(): outputs = model(input_ids, labels=target_ids) - # loss is calculated using CrossEntropyLoss which averages over input tokens. - # Multiply it with trg_len to get the summation instead of average. - # We will take average over all the tokens to get the true average - # in the last step of this example. - neg_log_likelihood = outputs.loss * trg_len + # loss is calculated using CrossEntropyLoss which averages over valid labels + # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels + # to the left by 1. + neg_log_likelihood = outputs.loss nlls.append(neg_log_likelihood) @@ -127,14 +130,14 @@ for begin_loc in tqdm(range(0, seq_len, stride)): if end_loc == seq_len: break -ppl = torch.exp(torch.stack(nlls).sum() / end_loc) +ppl = torch.exp(torch.stack(nlls).mean()) ``` Running this with the stride length equal to the max input length is equivalent to the suboptimal, non-sliding-window strategy we discussed above. The smaller the stride, the more context the model will have in making each prediction, and the better the reported perplexity will typically be. -When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.64`, which is about the same +When we run the above with `stride = 1024`, i.e. no overlap, the resulting PPL is `19.44`, which is about the same as the `19.93` reported in the GPT-2 paper. By using `stride = 512` and thereby employing our striding window -strategy, this jumps down to `16.44`. This is not only a more favorable score, but is calculated in a way that is +strategy, this jumps down to `16.45`. This is not only a more favorable score, but is calculated in a way that is closer to the true autoregressive decomposition of a sequence likelihood. diff --git a/docs/source/en/philosophy.mdx b/docs/source/en/philosophy.md similarity index 95% rename from docs/source/en/philosophy.mdx rename to docs/source/en/philosophy.md index 7788d7836236c9..628cb39bbb3300 100644 --- a/docs/source/en/philosophy.mdx +++ b/docs/source/en/philosophy.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Philosophy @@ -60,7 +64,7 @@ A few other goals: The library is built around three types of classes for each model: -- **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen.html)) that work with the pretrained weights provided in the library. +- **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html)) that work with the pretrained weights provided in the library. - **Configuration classes** store the hyperparameters required to build a model (such as the number of layers and hidden size). You don't always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model). - **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Image processors](main_classes/image_processor) preprocess vision inputs, [feature extractors](main_classes/feature_extractor) preprocess audio inputs, and a [processor](main_classes/processors) handles multimodal inputs. diff --git a/docs/source/en/pipeline_tutorial.mdx b/docs/source/en/pipeline_tutorial.md similarity index 66% rename from docs/source/en/pipeline_tutorial.mdx rename to docs/source/en/pipeline_tutorial.md index 00dceeb4f24315..e3e4e2e5cb6b7e 100644 --- a/docs/source/en/pipeline_tutorial.mdx +++ b/docs/source/en/pipeline_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipelines for inference @@ -26,33 +30,44 @@ Take a look at the [`pipeline`] documentation for a complete list of supported t ## Pipeline usage -While each task has an associated [`pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains all the task-specific pipelines. The [`pipeline`] automatically loads a default model and a preprocessing class capable of inference for your task. +While each task has an associated [`pipeline`], it is simpler to use the general [`pipeline`] abstraction which contains +all the task-specific pipelines. The [`pipeline`] automatically loads a default model and a preprocessing class capable +of inference for your task. Let's take the example of using the [`pipeline`] for automatic speech recognition (ASR), or +speech-to-text. -1. Start by creating a [`pipeline`] and specify an inference task: + +1. Start by creating a [`pipeline`] and specify the inference task: ```py >>> from transformers import pipeline ->>> generator = pipeline(task="automatic-speech-recognition") +>>> transcriber = pipeline(task="automatic-speech-recognition") ``` -2. Pass your input text to the [`pipeline`]: +2. Pass your input to the [`pipeline`]. In the case of speech recognition, this is an audio input file: ```py ->>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") {'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} ``` -Not the result you had in mind? Check out some of the [most downloaded automatic speech recognition models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) on the Hub to see if you can get a better transcription. -Let's try [openai/whisper-large](https://huggingface.co/openai/whisper-large): +Not the result you had in mind? Check out some of the [most downloaded automatic speech recognition models](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending) +on the Hub to see if you can get a better transcription. + +Let's try the [Whisper large-v2](https://huggingface.co/openai/whisper-large) model from OpenAI. Whisper was released +2 years later than Wav2Vec2, and was trained on close to 10x more data. As such, it beats Wav2Vec2 on most downstream +benchmarks. It also has the added benefit of predicting punctuation and casing, neither of which are possible with +Wav2Vec2. + +Let's give it a try here to see how it performs: ```py ->>> generator = pipeline(model="openai/whisper-large") ->>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> transcriber = pipeline(model="openai/whisper-large-v2") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} ``` -Now this result looks more accurate! +Now this result looks more accurate! For a deep-dive comparison on Wav2Vec2 vs Whisper, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/asr_models). We really encourage you to check out the Hub for models in different languages, models specialized in your field, and more. You can check out and compare model results directly from your browser on the Hub to see if it fits or handles corner cases better than other ones. @@ -61,7 +76,7 @@ And if you don't find a model for your use case, you can always start [training] If you have several inputs, you can pass your input as a list: ```py -generator( +transcriber( [ "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", @@ -69,22 +84,22 @@ generator( ) ``` -If you want to iterate over a whole dataset, or want to use it for inference in a webserver, check out dedicated parts - -[Using pipelines on a dataset](#using-pipelines-on-a-dataset) - -[Using pipelines for a webserver](./pipeline_webserver) +Pipelines are great for experimentation as switching from one model to another is trivial; however, there are some ways to optimize them for larger workloads than experimentation. See the following guides that dive into iterating over whole datasets or using pipelines in a webserver: +of the docs: +* [Using pipelines on a dataset](#using-pipelines-on-a-dataset) +* [Using pipelines for a webserver](./pipeline_webserver) ## Parameters [`pipeline`] supports many parameters; some are task specific, and some are general to all pipelines. -In general you can specify parameters anywhere you want: +In general, you can specify parameters anywhere you want: ```py -generator(model="openai/whisper-large", my_parameter=1) -out = generate(...) # This will use `my_parameter=1`. -out = generate(..., my_parameter=2) # This will override and use `my_parameter=2`. -out = generate(...) # This will go back to using `my_parameter=1`. +transcriber = pipeline(model="openai/whisper-large-v2", my_parameter=1) + +out = transcriber(...) # This will use `my_parameter=1`. +out = transcriber(..., my_parameter=2) # This will override and use `my_parameter=2`. +out = transcriber(...) # This will go back to using `my_parameter=1`. ``` Let's check out 3 important ones: @@ -95,14 +110,21 @@ If you use `device=n`, the pipeline automatically puts the model on the specifie This will work regardless of whether you are using PyTorch or Tensorflow. ```py -generator(model="openai/whisper-large", device=0) +transcriber = pipeline(model="openai/whisper-large-v2", device=0) ``` -If the model is too large for a single GPU, you can set `device_map="auto"` to allow 🤗 [Accelerate](https://huggingface.co/docs/accelerate) to automatically determine how to load and store the model weights. +If the model is too large for a single GPU and you are using PyTorch, you can set `device_map="auto"` to automatically +determine how to load and store the model weights. Using the `device_map` argument requires the 🤗 [Accelerate](https://huggingface.co/docs/accelerate) +package: + +```bash +pip install --upgrade accelerate +``` + +The following code automatically loads and stores model weights across devices: ```py -#!pip install accelerate -generator(model="openai/whisper-large", device_map="auto") +transcriber = pipeline(model="openai/whisper-large-v2", device_map="auto") ``` Note that if `device_map="auto"` is passed, there is no need to add the argument `device=device` when instantiating your `pipeline` as you may encounter some unexpected behavior! @@ -114,12 +136,12 @@ By default, pipelines will not batch inference for reasons explained in detail [ But if it works in your use case, you can use: ```py -generator(model="openai/whisper-large", device=0, batch_size=2) -audio_filenames = [f"audio_{i}.flac" for i in range(10)] -texts = generator(audio_filenames) +transcriber = pipeline(model="openai/whisper-large-v2", device=0, batch_size=2) +audio_filenames = [f"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/{i}.flac" for i in range(1, 5)] +texts = transcriber(audio_filenames) ``` -This runs the pipeline on the 10 provided audio files, but it will pass them in batches of 2 +This runs the pipeline on the 4 provided audio files, but it will pass them in batches of 2 to the model (which is on a GPU, where batching is more likely to help) without requiring any further code from you. The output should always match what you would have received without batching. It is only meant as a way to help you get more speed out of a pipeline. @@ -132,18 +154,23 @@ For instance, the [`transformers.AutomaticSpeechRecognitionPipeline.__call__`] m ```py ->>> # Not using whisper, as it cannot provide timestamps. ->>> generator = pipeline(model="facebook/wav2vec2-large-960h-lv60-self", return_timestamps="word") ->>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") -{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP AND LIVE OUT THE TRUE MEANING OF ITS CREED', 'chunks': [{'text': 'I', 'timestamp': (1.22, 1.24)}, {'text': 'HAVE', 'timestamp': (1.42, 1.58)}, {'text': 'A', 'timestamp': (1.66, 1.68)}, {'text': 'DREAM', 'timestamp': (1.76, 2.14)}, {'text': 'BUT', 'timestamp': (3.68, 3.8)}, {'text': 'ONE', 'timestamp': (3.94, 4.06)}, {'text': 'DAY', 'timestamp': (4.16, 4.3)}, {'text': 'THIS', 'timestamp': (6.36, 6.54)}, {'text': 'NATION', 'timestamp': (6.68, 7.1)}, {'text': 'WILL', 'timestamp': (7.32, 7.56)}, {'text': 'RISE', 'timestamp': (7.8, 8.26)}, {'text': 'UP', 'timestamp': (8.38, 8.48)}, {'text': 'AND', 'timestamp': (10.08, 10.18)}, {'text': 'LIVE', 'timestamp': (10.26, 10.48)}, {'text': 'OUT', 'timestamp': (10.58, 10.7)}, {'text': 'THE', 'timestamp': (10.82, 10.9)}, {'text': 'TRUE', 'timestamp': (10.98, 11.18)}, {'text': 'MEANING', 'timestamp': (11.26, 11.58)}, {'text': 'OF', 'timestamp': (11.66, 11.7)}, {'text': 'ITS', 'timestamp': (11.76, 11.88)}, {'text': 'CREED', 'timestamp': (12.0, 12.38)}]} +>>> transcriber = pipeline(model="openai/whisper-large-v2", return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.', 'chunks': [{'timestamp': (0.0, 11.88), 'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its'}, {'timestamp': (11.88, 12.38), 'text': ' creed.'}]} ``` -As you can see, the model inferred the text and also outputted **when** the various words were pronounced -in the sentence. +As you can see, the model inferred the text and also outputted **when** the various sentences were pronounced. There are many parameters available for each task, so check out each task's API reference to see what you can tinker with! -For instance, the [`~transformers.AutomaticSpeechRecognitionPipeline`] has a `chunk_length_s` parameter which is helpful for working on really long audio files (for example, subtitling entire movies or hour-long videos) that a model typically cannot handle on its own. - +For instance, the [`~transformers.AutomaticSpeechRecognitionPipeline`] has a `chunk_length_s` parameter which is helpful +for working on really long audio files (for example, subtitling entire movies or hour-long videos) that a model typically +cannot handle on its own: + +```python +>>> transcriber = pipeline(model="openai/whisper-large-v2", chunk_length_s=30, return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav") +{'text': " Chapter 16. I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I, too, agree to whatever Marguerite wished, Marguerite to be unable to live apart from me. It was the day after the evening... +``` If you can't find a parameter that would really help you out, feel free to [request it](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)! @@ -158,7 +185,7 @@ def data(): yield f"My example {i}" -pipe = pipeline(model="gpt2", device=0) +pipe = pipeline(model="openai-community/gpt2", device=0) generated_characters = 0 for out in pipe(data()): generated_characters += len(out[0]["generated_text"]) @@ -200,7 +227,7 @@ page. Using a [`pipeline`] for vision tasks is practically identical. -Specify your task and pass your image to the classifier. The image can be a link or a local path to the image. For example, what species of cat is shown below? +Specify your task and pass your image to the classifier. The image can be a link, a local path or a base64-encoded image. For example, what species of cat is shown below? ![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg) @@ -247,7 +274,7 @@ For example, if you use this [invoice image](https://huggingface.co/spaces/impir ... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", ... question="What is the invoice number?", ... ) -[{'score': 0.42514941096305847, 'answer': 'us-001', 'start': 16, 'end': 16}] +[{'score': 0.42515, 'answer': 'us-001', 'start': 16, 'end': 16}] ``` @@ -287,4 +314,4 @@ pipe = pipeline(model="facebook/opt-1.3b", device_map="auto", model_kwargs={"loa output = pipe("This is a cool example!", do_sample=True, top_p=0.95) ``` -Note that you can replace the checkpoint with any of the Hugging Face model that supports large model loading such as BLOOM! \ No newline at end of file +Note that you can replace the checkpoint with any of the Hugging Face model that supports large model loading such as BLOOM! diff --git a/docs/source/en/pipeline_webserver.mdx b/docs/source/en/pipeline_webserver.md similarity index 92% rename from docs/source/en/pipeline_webserver.mdx rename to docs/source/en/pipeline_webserver.md index f62985ec26b5bb..17b5fbd958dd30 100644 --- a/docs/source/en/pipeline_webserver.mdx +++ b/docs/source/en/pipeline_webserver.md @@ -1,3 +1,7 @@ + + # Using pipelines for a webserver @@ -44,7 +48,7 @@ async def homepage(request): async def server_loop(q): - pipe = pipeline(model="bert-base-uncased") + pipe = pipeline(model="google-bert/bert-base-uncased") while True: (string, response_q) = await q.get() out = pipe(string) @@ -83,6 +87,13 @@ of the model on the webserver. This way, no unnecessary RAM is being used. Then the queuing mechanism allows you to do fancy stuff like maybe accumulating a few items before inferring to use dynamic batching: + + +The code sample below is intentionally written like pseudo-code for readability. +Do not run this without checking if it makes sense for your system resources! + + + ```py (string, rq) = await q.get() strings = [] @@ -100,11 +111,7 @@ for rq, out in zip(queues, outs): await rq.put(out) ``` - -Do not activate this without checking it makes sense for your load! - - -The proposed code is optimized for readability, not for being the best code. +Again, the proposed code is optimized for readability, not for being the best code. First of all, there's no batch size limit which is usually not a great idea. Next, the timeout is reset on every queue fetch, meaning you could wait much more than 1ms before running the inference (delaying the first request diff --git a/docs/source/en/pr_checks.mdx b/docs/source/en/pr_checks.md similarity index 57% rename from docs/source/en/pr_checks.mdx rename to docs/source/en/pr_checks.md index 6d7ea5d4d407a0..266cc1ca68d44b 100644 --- a/docs/source/en/pr_checks.mdx +++ b/docs/source/en/pr_checks.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Checks on a Pull Request @@ -24,7 +28,7 @@ When you open a pull request on 🤗 Transformers, a fair number of checks will In this document, we will take a stab at explaining what those various checks are and the reason behind them, as well as how to debug them locally if one of them fails on your PR. -Note that they all require you to have a dev install: +Note that, ideally, they require you to have a dev install: ```bash pip install transformers[dev] @@ -36,7 +40,18 @@ or for an editable install: pip install -e .[dev] ``` -inside the Transformers repo. +inside the Transformers repo. Since the number of optional dependencies of Transformers has grown a lot, it's possible you don't manage to get all of them. If the dev install fails, make sure to install the Deep Learning framework you are working with (PyTorch, TensorFlow and/or Flax) then do + +```bash +pip install transformers[quality] +``` + +or for an editable install: + +```bash +pip install -e .[quality] +``` + ## Tests @@ -109,6 +124,7 @@ This checks that: - The translations of the READMEs and the index of the doc have the same model list as the main README (performed by `utils/check_copies.py`) - The auto-generated tables in the documentation are up to date (performed by `utils/check_table.py`) - The library has all objects available even if not all optional dependencies are installed (performed by `utils/check_dummies.py`) +- All docstrings properly document the arguments in the signature of the object (performed by `utils/check_docstrings.py`) Should this check fail, the first two items require manual fixing, the last four can be fixed automatically for you by running the command @@ -127,3 +143,58 @@ Additional checks concern PRs that add new models, mainly that: - All checkpoints used actually exist on the Hub --> + +### Check copies + +Since the Transformers library is very opinionated with respect to model code, and each model should fully be implemented in a single file without relying on other models, we have added a mechanism that checks whether a copy of the code of a layer of a given model stays consistent with the original. This way, when there is a bug fix, we can see all other impacted models and choose to trickle down the modification or break the copy. + + + +If a file is a full copy of another file, you should register it in the constant `FULL_COPIES` of `utils/check_copies.py`. + + + +This mechanism relies on comments of the form `# Copied from xxx`. The `xxx` should contain the whole path to the class of function which is being copied below. For instance, `RobertaSelfOutput` is a direct copy of the `BertSelfOutput` class, so you can see [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289) it has a comment: + +```py +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput +``` + +Note that instead of applying this to a whole class, you can apply it to the relevant methods that are copied from. For instance [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L598) you can see how `RobertaPreTrainedModel._init_weights` is copied from the same method in `BertPreTrainedModel` with the comment: + +```py +# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights +``` + +Sometimes the copy is exactly the same except for names: for instance in `RobertaAttention`, we use `RobertaSelfAttention` insted of `BertSelfAttention` but other than that, the code is exactly the same. This is why `# Copied from` supports simple string replacements with the following syntax: `Copied from xxx with foo->bar`. This means the code is copied with all instances of `foo` being replaced by `bar`. You can see how it used [here](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86) in `RobertaAttention` with the comment: + +```py +# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta +``` + +Note that there shouldn't be any spaces around the arrow (unless that space is part of the pattern to replace of course). + +You can add several patterns separated by a comma. For instance here `CamemberForMaskedLM` is a direct copy of `RobertaForMaskedLM` with two replacements: `Roberta` to `Camembert` and `ROBERTA` to `CAMEMBERT`. You can see [here](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/camembert/modeling_camembert.py#L929) this is done with the comment: + +```py +# Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT +``` + +If the order matters (because one of the replacements might conflict with a previous one), the replacements are executed from left to right. + + + +If the replacements change the formatting (if you replace a short name by a very long name for instance), the copy is checked after applying the auto-formatter. + + + +Another way when the patterns are just different casings of the same replacement (with an uppercased and a lowercased variants) is just to add the option `all-casing`. [Here](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237) is an example in `MobileBertForSequenceClassification` with the comment: + +```py +# Copied from transformers.models.bert.modeling_bert.BertForSequenceClassification with Bert->MobileBert all-casing +``` + +In this case, the code is copied from `BertForSequenceClassification` by replacing: +- `Bert` by `MobileBert` (for instance when using `MobileBertModel` in the init) +- `bert` by `mobilebert` (for instance when defining `self.mobilebert`) +- `BERT` by `MOBILEBERT` (in the constant `MOBILEBERT_INPUTS_DOCSTRING`) diff --git a/docs/source/en/preprocessing.mdx b/docs/source/en/preprocessing.md similarity index 90% rename from docs/source/en/preprocessing.mdx rename to docs/source/en/preprocessing.md index 9896b6898931dc..82381057d3742b 100644 --- a/docs/source/en/preprocessing.mdx +++ b/docs/source/en/preprocessing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Preprocess @@ -18,7 +22,7 @@ Before you can train a model on a dataset, it needs to be preprocessed into the * Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors. * Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors. -* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors. +* Image inputs use a [ImageProcessor](./main_classes/image_processor) to convert images into tensors. * Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor. @@ -41,7 +45,7 @@ The main tool for preprocessing textual data is a [tokenizer](main_classes/token -If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referrred to as the *vocab*) during pretraining. +If you plan on using a pretrained model, it's important to use the associated pretrained tokenizer. This ensures the text is split the same way as the pretraining corpus, and uses the same corresponding tokens-to-index (usually referred to as the *vocab*) during pretraining. @@ -50,7 +54,7 @@ Get started by loading a pretrained tokenizer with the [`AutoTokenizer.from_pret ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") ``` Then pass your text to the tokenizer: @@ -58,8 +62,8 @@ Then pass your text to the tokenizer: ```py >>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") >>> print(encoded_input) -{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], +{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} ``` @@ -89,14 +93,14 @@ If there are several sentences you want to preprocess, pass them as a list to th ... ] >>> encoded_inputs = tokenizer(batch_sentences) >>> print(encoded_inputs) -{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], - [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], - [101, 1327, 1164, 5450, 23434, 136, 102]], - 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0]], - 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1]]} ``` @@ -114,14 +118,14 @@ Set the `padding` parameter to `True` to pad the shorter sequences in the batch ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True) >>> print(encoded_input) -{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], - [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], - [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], - 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], - 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` @@ -141,14 +145,14 @@ Set the `truncation` parameter to `True` to truncate a sequence to the maximum l ... ] >>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) >>> print(encoded_input) -{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], - [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], - [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], - 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], - 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} ``` @@ -177,10 +181,10 @@ Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for Tenso >>> print(encoded_input) {'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], - [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} @@ -199,11 +203,11 @@ Set the `return_tensors` parameter to either `pt` for PyTorch, or `tf` for Tenso array([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], - dtype=int32)>, + dtype=int32)>, 'token_type_ids': , + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': + +Different pipelines support tokenizer arguments in their `__call__()` differently. `text-2-text-generation` pipelines support (i.e. pass on) +only `truncation`. `text-generation` pipelines support `max_length`, `truncation`, `padding` and `add_special_tokens`. +In `fill-mask` pipelines, tokenizer arguments can be passed in the `tokenizer_kwargs` argument (dictionary). + + ## Audio For audio tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from raw audio data, and convert them into tensors. -Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets: +Load the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a feature extractor with audio datasets: ```py >>> from datasets import load_dataset, Audio @@ -240,7 +250,7 @@ This returns three items: * `path` points to the location of the audio file. * `sampling_rate` refers to how many data points in the speech signal are measured per second. -For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. +For this tutorial, you'll use the [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) model. Take a look at the model card, and you'll learn Wav2Vec2 is pretrained on 16kHz sampled speech audio. It is important your audio data's sampling rate matches the sampling rate of the dataset used to pretrain the model. If your data's sampling rate isn't the same, then you need to resample your data. 1. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz: @@ -302,7 +312,7 @@ Create a function to preprocess the dataset so the audio samples are the same le ... return inputs ``` -Apply the `preprocess_function` to the the first few examples in the dataset: +Apply the `preprocess_function` to the first few examples in the dataset: ```py >>> processed_dataset = preprocess_function(dataset[:5]) @@ -336,7 +346,7 @@ You can use any library you like for image augmentation. For image preprocessing -Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets: +Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets: @@ -350,7 +360,7 @@ Use 🤗 Datasets `split` parameter to only load a small sample from the trainin >>> dataset = load_dataset("food101", split="train[:100]") ``` -Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image) feature: +Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) feature: ```py >>> dataset[0]["image"] @@ -387,7 +397,7 @@ width are expected, for others only the `shortest_edge` is defined. >>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)]) ``` -2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) +2. The model accepts [`pixel_values`](model_doc/vision-encoder-decoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input. `ImageProcessor` can take care of normalizing the images, and generating appropriate tensors. Create a function that combines image augmentation and image preprocessing for a batch of images and generates `pixel_values`: @@ -408,8 +418,7 @@ If you wish to normalize images as a part of the augmentation transformation, us and `image_processor.image_std` values. -3. Then use 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) to apply the transforms on the fly: - +3. Then use 🤗 Datasets[`~datasets.Dataset.set_transform`] to apply the transforms on the fly: ```py >>> dataset.set_transform(transforms) ``` @@ -445,13 +454,13 @@ or segmentation maps. ### Pad In some cases, for instance, when fine-tuning [DETR](./model_doc/detr), the model applies scale augmentation at training -time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad_and_create_pixel_mask`] +time. This may cause images to be different sizes in a batch. You can use [`DetrImageProcessor.pad`] from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images together. ```py >>> def collate_fn(batch): ... pixel_values = [item["pixel_values"] for item in batch] -... encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors="pt") +... encoding = image_processor.pad(pixel_values, return_tensors="pt") ... labels = [item["labels"] for item in batch] ... batch = {} ... batch["pixel_values"] = encoding["pixel_values"] @@ -464,7 +473,7 @@ from [`DetrImageProcessor`] and define a custom `collate_fn` to batch images tog For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor. -Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR): +Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR): ```py >>> from datasets import load_dataset diff --git a/docs/source/en/quantization.md b/docs/source/en/quantization.md new file mode 100644 index 00000000000000..29ee188852feca --- /dev/null +++ b/docs/source/en/quantization.md @@ -0,0 +1,620 @@ + + +# Quantization + +Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. + +Transformers supports several quantization schemes to help you run inference with large language models (LLMs) and finetune adapters on quantized models. This guide will show you how to use Activation-aware Weight Quantization (AWQ), AutoGPTQ, and bitsandbytes. + + + +Interested in adding a new quantization method to Transformers? Read the [HfQuantizer](./hf_quantizer) guide to learn how! + + + +## AQLM + + + +Try AQLM on [Google Colab](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing)! + +Additive Quantization of Language Models ([AQLM](https://arxiv.org/abs/2401.06118)) is a Large Language Models compression method. It quantizes multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. + +Inference support for AQLM is realised in the `aqlm` library. Make sure to install it to run the models (note aqlm works only with python>=3.10): +```bash +pip install aqlm[gpu,cpu] +``` + +The library provides efficient kernels for both GPU and CPU inference. + +The instructions on how to quantize models yourself, as well as all the relevant code can be found in the corresponding GitHub [repository](https://github.com/Vahe1994/AQLM). + +### AQLM configurations + +AQLM quantization setpus vary mainly on the number of codebooks used as well as codebook sizes in bits. The most popular setups, as well as inference kernels they support are: + +| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference | +|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------| +| Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ | +| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ | +| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ | +| Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ | + +## AWQ + + + +Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)! + + + +[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation. + +There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models. + +Make sure you have autoawq installed: + +```bash +pip install autoawq +``` + +AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file: + +```json +{ + "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source", + "architectures": [ + "MistralForCausalLM" + ], + ... + ... + ... + "quantization_config": { + "quant_method": "awq", + "zero_point": true, + "group_size": 128, + "bits": 4, + "version": "gemm" + } +} +``` + +A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") +``` + +Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) +``` + +AWQ quantization can also be combined with [FlashAttention-2](perf_infer_gpu_one#flashattention-2) to further accelerate inference: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0") +``` + +### Fused modules + +Fused modules offers improved accuracy and performance and it is supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures. + + + +Fused modules cannot be combined with other optimization techniques such as FlashAttention-2. + + + + + + +To enable fused modules for supported architectures, create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True`. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. You can set it to a larger value to be safe. + +For example, to fuse the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model. + +```python +import torch +from transformers import AwqConfig, AutoModelForCausalLM + +model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ" + +quantization_config = AwqConfig( + bits=4, + fuse_max_seq_len=512, + do_fuse=True, +) + +model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) +``` + + + + +For architectures that don't support fused modules yet, you need to create a custom fusing mapping to define which modules need to be fused with the `modules_to_fuse` parameter. For example, to fuse the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model. + +```python +import torch +from transformers import AwqConfig, AutoModelForCausalLM + +model_id = "TheBloke/Yi-34B-AWQ" + +quantization_config = AwqConfig( + bits=4, + fuse_max_seq_len=512, + modules_to_fuse={ + "attention": ["q_proj", "k_proj", "v_proj", "o_proj"], + "layernorm": ["ln1", "ln2", "norm"], + "mlp": ["gate_proj", "up_proj", "down_proj"], + "use_alibi": False, + "num_attention_heads": 56, + "num_key_value_heads": 8, + "hidden_size": 7168 + } +) + +model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0) +``` + +The parameter `modules_to_fuse` should include: + +- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list. +- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list. +- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers). +- `"use_alibi"`: If your model uses ALiBi positional embedding. +- `"num_attention_heads"`: The number of attention heads. +- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA). If `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if `num_key_value_heads=1` the model will use Multi Query Attention (MQA), otherwise GQA is used. +- `"hidden_size"`: The dimension of the hidden representations. + + + + +## AutoGPTQ + + + +Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)! + + + +The [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate. + +Before you begin, make sure the following libraries are installed: + +```bash +pip install auto-gptq +pip install git+https://github.com/huggingface/optimum.git +pip install git+https://github.com/huggingface/transformers.git +pip install --upgrade accelerate +``` + +To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig + +model_id = "facebook/opt-125m" +tokenizer = AutoTokenizer.from_pretrained(model_id) +gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer) +``` + +You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper. + +```py +dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] +gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer) +``` + +Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization. + +```py +quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) +``` + +If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU): + +```py +quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config) +``` + + + +Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. + + + +Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]: + +```py +quantized_model.push_to_hub("opt-125m-gptq") +tokenizer.push_to_hub("opt-125m-gptq") +``` + +You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU: + +```py +quantized_model.save_pretrained("opt-125m-gptq") +tokenizer.save_pretrained("opt-125m-gptq") + +# if quantized with device_map set +quantized_model.to("cpu") +quantized_model.save_pretrained("opt-125m-gptq") +``` + +Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed. + +```py +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") +``` + +### ExLlama + +[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter: + +```py +import torch +from transformers import AutoModelForCausalLM, GPTQConfig + +gptq_config = GPTQConfig(bits=4, exllama_config={"version":2}) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config) +``` + + + +Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT. + + + +The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file. + +```py +import torch +from transformers import AutoModelForCausalLM, GPTQConfig +gptq_config = GPTQConfig(bits=4, use_exllama=False) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config) +``` + +## bitsandbytes + +[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. + +To use bitsandbytes, make sure you have the following libraries installed: + + + + +```bash +pip install transformers accelerate bitsandbytes>0.37.0 +``` + + + + +```bash +pip install bitsandbytes>=0.39.0 +pip install --upgrade accelerate +pip install --upgrade transformers +``` + + + + +Now you can quantize a model with the `load_in_8bit` or `load_in_4bit` parameters in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers. + + + + +Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available: + +```py +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_8bit=True) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: + +```py +import torch +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) +model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +``` + +Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True) +tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") + +model.push_to_hub("bloom-560m-8bit") +``` + + + + +Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available: + +```py +from transformers import AutoModelForCausalLM + +model_4bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_4bit=True) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: + +```py +import torch +from transformers import AutoModelForCausalLM + +model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True, torch_dtype=torch.float32) +model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +``` + +If you have `bitsandbytes>=0.41.3`, you can serialize 4-bit models and push them on Hugging Face Hub. Simply call `model.push_to_hub()` after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with `model.save_pretrained()` command. + + + + + + +Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. + + + +You can check your memory footprint with the `get_memory_footprint` method: + +```py +print(model.get_memory_footprint()) +``` + +Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") +``` + +### 8-bit + + + +Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! + + + +This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning. + +#### Offloading + +8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) +``` + +Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU: + +```py +device_map = { + "transformer.word_embeddings": 0, + "transformer.word_embeddings_layernorm": 0, + "lm_head": "cpu", + "transformer.h": 0, + "transformer.ln_f": 0, +} +``` + +Now load your model with the custom `device_map` and `quantization_config`: + +```py +model_8bit = AutoModelForCausalLM.from_pretrained( + "bigscience/bloom-1b7", + device_map=device_map, + quantization_config=quantization_config, +) +``` + +#### Outlier threshold + +An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). + +To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_threshold=10, +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +``` + +#### Skip module conversion + +For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_skip_modules=["lm_head"], +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + quantization_config=quantization_config, +) +``` + +#### Finetuning + +With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference). + +### 4-bit + + + +Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). + + + +This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. + + +#### Compute data type + +To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: + +```py +import torch +from transformers import BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +``` + +#### Normal Float 4 (NF4) + +NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: + +```py +from transformers import BitsAndBytesConfig + +nf4_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) +``` + +For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. + +#### Nested quantization + +Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps. + +```py +from transformers import BitsAndBytesConfig + +double_quant_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config) +``` + +## Optimum + +The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime. + +## Benchmarks + +To compare the speed, throughput, and latency of each quantization scheme, check the following benchmarks obtained from the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library. The benchmark was run on a NVIDIA A1000 for the [TheBloke/Mistral-7B-v0.1-AWQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ) and [TheBloke/Mistral-7B-v0.1-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ) models. These were also tested against the bitsandbytes quantization methods as well as a native fp16 model. + +
+
+ forward peak memory per batch size +
forward peak memory/batch size
+
+
+ generate peak memory per batch size +
generate peak memory/batch size
+
+
+ +
+
+ generate throughput per batch size +
generate throughput/batch size
+
+
+ forward latency per batch size +
forward latency/batch size
+
+
+ +The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size. For a more detailed discussion about the pros and cons of each quantization method, read the [Overview of natively supported quantization schemes in 🤗 Transformers](https://huggingface.co/blog/overview-quantization-transformers) blog post. + +### Fused AWQ modules + +The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules. + +
Unfused module
+ +| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) | +|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| +| 1 | 32 | 32 | 60.0984 | 38.4537 | 4.50 GB (5.68%) | +| 1 | 64 | 64 | 1333.67 | 31.6604 | 4.50 GB (5.68%) | +| 1 | 128 | 128 | 2434.06 | 31.6272 | 4.50 GB (5.68%) | +| 1 | 256 | 256 | 3072.26 | 38.1731 | 4.50 GB (5.68%) | +| 1 | 512 | 512 | 3184.74 | 31.6819 | 4.59 GB (5.80%) | +| 1 | 1024 | 1024 | 3148.18 | 36.8031 | 4.81 GB (6.07%) | +| 1 | 2048 | 2048 | 2927.33 | 35.2676 | 5.73 GB (7.23%) | + +
Fused module
+ +| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) | +|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------| +| 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) | +| 1 | 64 | 64 | 1756.1 | 106.26 | 4.00 GB (5.05%) | +| 1 | 128 | 128 | 2479.32 | 105.631 | 4.00 GB (5.06%) | +| 1 | 256 | 256 | 1813.6 | 85.7485 | 4.01 GB (5.06%) | +| 1 | 512 | 512 | 2848.9 | 97.701 | 4.11 GB (5.19%) | +| 1 | 1024 | 1024 | 3044.35 | 87.7323 | 4.41 GB (5.57%) | +| 1 | 2048 | 2048 | 2715.11 | 89.4709 | 5.57 GB (7.04%) | + +The speed and throughput of fused and unfused modules were also tested with the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library. + +
+
+ generate throughput per batch size +
forward peak memory/batch size
+
+
+ forward latency per batch size +
generate throughput/batch size
+
+
diff --git a/docs/source/en/quicktour.mdx b/docs/source/en/quicktour.md similarity index 92% rename from docs/source/en/quicktour.mdx rename to docs/source/en/quicktour.md index 76f46a28a671df..904e0bbc745340 100644 --- a/docs/source/en/quicktour.mdx +++ b/docs/source/en/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Quick tour @@ -26,11 +30,13 @@ You'll also need to install your preferred machine learning framework: + ```bash pip install torch ``` + ```bash pip install tensorflow ``` @@ -60,7 +66,7 @@ For a complete list of available tasks, check out the [pipeline API reference](. | Audio classification | assign a label to some audio data | Audio | pipeline(task=“audio-classification”) | | Automatic speech recognition | transcribe speech into text | Audio | pipeline(task=“automatic-speech-recognition”) | | Visual question answering | answer a question about the image, given an image and a question | Multimodal | pipeline(task=“vqa”) | -| Document question answering | answer a question about a document, given an image and a question | Multimodal | pipeline(task="document-question-answering") | +| Document question answering | answer a question about the document, given a document and a question | Multimodal | pipeline(task="document-question-answering") | | Image captioning | generate a caption for a given image | Multimodal | pipeline(task="image-to-text") | Start by creating an instance of [`pipeline`] and specifying a task you want to use it for. In this guide, you'll use the [`pipeline`] for sentiment analysis as an example: @@ -71,7 +77,7 @@ Start by creating an instance of [`pipeline`] and specifying a task you want to >>> classifier = pipeline("sentiment-analysis") ``` -The [`pipeline`] downloads and caches a default [pretrained model](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) and tokenizer for sentiment analysis. Now you can use the `classifier` on your target text: +The [`pipeline`] downloads and caches a default [pretrained model](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) and tokenizer for sentiment analysis. Now you can use the `classifier` on your target text: ```py >>> classifier("We are very happy to show you the 🤗 Transformers library.") @@ -118,7 +124,7 @@ Extract the raw waveform arrays from the first 4 samples and pass it as a list t ```py >>> result = speech_recognizer(dataset[:4]["audio"]) >>> print([d["text"] for d in result]) -['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I THURN A JOIN A COUNT'] +['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT'] ``` For larger datasets where the inputs are big (like in speech or vision), you'll want to pass a generator instead of a list to load all the inputs in memory. Take a look at the [pipeline API reference](./main_classes/pipelines) for more information. @@ -204,6 +210,7 @@ A tokenizer can also accept a list of inputs, and pad and truncate the text to r + ```py >>> pt_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -215,6 +222,7 @@ A tokenizer can also accept a list of inputs, and pad and truncate the text to r ``` + ```py >>> tf_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -285,7 +293,7 @@ See the [task summary](./task_summary) for tasks supported by an [`AutoModel`] c
-Now pass your preprocessed batch of inputs directly to the model by passing the dictionary keys directly to the tensors: +Now pass your preprocessed batch of inputs directly to the model. You can pass the tensors as-is: ```py >>> tf_outputs = tf_model(tf_batch) @@ -348,6 +356,7 @@ One particularly cool 🤗 Transformers feature is the ability to save a model a + ```py >>> from transformers import AutoModel @@ -356,6 +365,7 @@ One particularly cool 🤗 Transformers feature is the ability to save a model a ``` + ```py >>> from transformers import TFAutoModel @@ -374,7 +384,7 @@ Start by importing [`AutoConfig`], and then load the pretrained model you want t ```py >>> from transformers import AutoConfig ->>> my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) ``` @@ -406,12 +416,12 @@ All models are a standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn Depending on your task, you'll typically pass the following parameters to [`Trainer`]: -1. A [`PreTrainedModel`] or a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module): +1. You'll start with a [`PreTrainedModel`] or a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module): ```py >>> from transformers import AutoModelForSequenceClassification - >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` 2. [`TrainingArguments`] contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. The default values are used if you don't specify any training arguments: @@ -428,12 +438,12 @@ Depending on your task, you'll typically pass the following parameters to [`Trai ... ) ``` -3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor: +3. Load a preprocessing class like a tokenizer, image processor, feature extractor, or processor: ```py >>> from transformers import AutoTokenizer - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` 4. Load a dataset: @@ -505,15 +515,15 @@ All models are a standard [`tf.keras.Model`](https://www.tensorflow.org/api_docs ```py >>> from transformers import TFAutoModelForSequenceClassification - >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` -2. A preprocessing class like a tokenizer, image processor, feature extractor, or processor: +2. Load a preprocessing class like a tokenizer, image processor, feature extractor, or processor: ```py >>> from transformers import AutoTokenizer - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` 3. Create a function to tokenize the dataset: @@ -528,17 +538,17 @@ All models are a standard [`tf.keras.Model`](https://www.tensorflow.org/api_docs ```py >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP >>> tf_dataset = model.prepare_tf_dataset( - ... dataset, batch_size=16, shuffle=True, tokenizer=tokenizer + ... dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer ... ) # doctest: +SKIP ``` -5. When you're ready, you can call `compile` and `fit` to start training: +5. When you're ready, you can call `compile` and `fit` to start training. Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> from tensorflow.keras.optimizers import Adam - >>> model.compile(optimizer=Adam(3e-5)) - >>> model.fit(dataset) # doctest: +SKIP + >>> model.compile(optimizer=Adam(3e-5)) # No loss argument! + >>> model.fit(tf_dataset) # doctest: +SKIP ``` ## What's next? diff --git a/docs/source/en/run_scripts.mdx b/docs/source/en/run_scripts.md similarity index 92% rename from docs/source/en/run_scripts.mdx rename to docs/source/en/run_scripts.md index 58d6b8dd3e208c..845befc5638133 100644 --- a/docs/source/en/run_scripts.mdx +++ b/docs/source/en/run_scripts.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Train with a script @@ -83,11 +87,11 @@ pip install -r requirements.txt -The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset with the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task. +The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset with the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/google-t5/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task. ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -101,11 +105,11 @@ python examples/pytorch/summarization/run_summarization.py \ ``` -The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset using Keras on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task. +The example script downloads and preprocesses a dataset from the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. Then the script fine-tunes a dataset using Keras on an architecture that supports summarization. The following example shows how to fine-tune [T5-small](https://huggingface.co/google-t5/t5-small) on the [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) dataset. The T5 model requires an additional `source_prefix` argument due to how it was trained. This prompt lets T5 know this is a summarization task. ```bash python examples/tensorflow/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -126,10 +130,10 @@ The [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) sup - Set the number of GPUs to use with the `nproc_per_node` argument. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \ --fp16 \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -153,7 +157,7 @@ Tensor Processing Units (TPUs) are specifically designed to accelerate performan ```bash python xla_spawn.py --num_cores 8 \ summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -172,7 +176,7 @@ Tensor Processing Units (TPUs) are specifically designed to accelerate performan ```bash python run_summarization.py \ --tpu name_of_tpu_resource \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -210,7 +214,7 @@ Now you are ready to launch the training: ```bash accelerate launch run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ @@ -229,7 +233,7 @@ A summarization script using a custom dataset would look like this: ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --train_file path_to_csv_or_jsonlines_file \ @@ -254,7 +258,7 @@ It is often a good idea to run your script on a smaller number of dataset exampl ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --max_train_samples 50 \ --max_eval_samples 50 \ --max_predict_samples 50 \ @@ -284,7 +288,7 @@ The first method uses the `output_dir previous_output_dir` argument to resume tr ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -301,7 +305,7 @@ The second method uses the `resume_from_checkpoint path_to_specific_checkpoint` ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -331,7 +335,7 @@ The following example shows how to upload a model with a specific repository nam ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ diff --git a/docs/source/en/sagemaker.mdx b/docs/source/en/sagemaker.md similarity index 86% rename from docs/source/en/sagemaker.mdx rename to docs/source/en/sagemaker.md index 1ffdd4326e4d65..579caa499c2fcd 100644 --- a/docs/source/en/sagemaker.mdx +++ b/docs/source/en/sagemaker.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Run training on Amazon SageMaker @@ -22,4 +26,3 @@ The documentation has been moved to [hf.co/docs/sagemaker](https://huggingface.c - [Train Hugging Face models on Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train) - [Deploy Hugging Face models to Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference) -- [Frequently Asked Questions](https://huggingface.co/docs/sagemaker/faq) diff --git a/docs/source/en/serialization.md b/docs/source/en/serialization.md new file mode 100644 index 00000000000000..5995d9042de6fb --- /dev/null +++ b/docs/source/en/serialization.md @@ -0,0 +1,210 @@ + + +# Export to ONNX + +Deploying 🤗 Transformers models in production environments often requires, or can benefit from exporting the models into +a serialized format that can be loaded and executed on specialized runtimes and hardware. + +🤗 Optimum is an extension of Transformers that enables exporting models from PyTorch or TensorFlow to serialized formats +such as ONNX and TFLite through its `exporters` module. 🤗 Optimum also provides a set of performance optimization tools to train +and run models on targeted hardware with maximum efficiency. + +This guide demonstrates how you can export 🤗 Transformers models to ONNX with 🤗 Optimum, for the guide on exporting models to TFLite, +please refer to the [Export to TFLite page](tflite). + +## Export to ONNX + +[ONNX (Open Neural Network eXchange)](http://onnx.ai) is an open standard that defines a common set of operators and a +common file format to represent deep learning models in a wide variety of frameworks, including PyTorch and +TensorFlow. When a model is exported to the ONNX format, these operators are used to +construct a computational graph (often called an _intermediate representation_) which +represents the flow of data through the neural network. + +By exposing a graph with standardized operators and data types, ONNX makes it easy to +switch between frameworks. For example, a model trained in PyTorch can be exported to +ONNX format and then imported in TensorFlow (and vice versa). + +Once exported to ONNX format, a model can be: +- optimized for inference via techniques such as [graph optimization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) and [quantization](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization). +- run with ONNX Runtime via [`ORTModelForXXX` classes](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort), +which follow the same `AutoModel` API as the one you are used to in 🤗 Transformers. +- run with [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines), +which has the same API as the [`pipeline`] function in 🤗 Transformers. + +🤗 Optimum provides support for the ONNX export by leveraging configuration objects. These configuration objects come +ready-made for a number of model architectures, and are designed to be easily extendable to other architectures. + +For the list of ready-made configurations, please refer to [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/onnx/overview). + +There are two ways to export a 🤗 Transformers model to ONNX, here we show both: + +- export with 🤗 Optimum via CLI. +- export with 🤗 Optimum with `optimum.onnxruntime`. + +### Exporting a 🤗 Transformers model to ONNX with CLI + +To export a 🤗 Transformers model to ONNX, first install an extra dependency: + +```bash +pip install optimum[exporters] +``` + +To check out all available arguments, refer to the [🤗 Optimum docs](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli), +or view help in command line: + +```bash +optimum-cli export onnx --help +``` + +To export a model's checkpoint from the 🤗 Hub, for example, `distilbert/distilbert-base-uncased-distilled-squad`, run the following command: + +```bash +optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/ +``` + +You should see the logs indicating progress and showing where the resulting `model.onnx` is saved, like this: + +```bash +Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx... + -[✓] ONNX model output names match reference model (start_logits, end_logits) + - Validating ONNX Model output "start_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) + - Validating ONNX Model output "end_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) +The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx +``` + +The example above illustrates exporting a checkpoint from 🤗 Hub. When exporting a local model, first make sure that you +saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, pass the +`local_path` to the `model` argument instead of the checkpoint name on 🤗 Hub and provide the `--task` argument. +You can review the list of supported tasks in the [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/task_manager). +If `task` argument is not provided, it will default to the model architecture without any task specific head. + +```bash +optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/ +``` + +The resulting `model.onnx` file can then be run on one of the [many +accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX +standard. For example, we can load and run the model with [ONNX +Runtime](https://onnxruntime.ai/) as follows: + +```python +>>> from transformers import AutoTokenizer +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +The process is identical for TensorFlow checkpoints on the Hub. For instance, here's how you would +export a pure TensorFlow checkpoint from the [Keras organization](https://huggingface.co/keras-io): + +```bash +optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/ +``` + +### Exporting a 🤗 Transformers model to ONNX with `optimum.onnxruntime` + +Alternative to CLI, you can export a 🤗 Transformers model to ONNX programmatically like so: + +```python +>>> from optimum.onnxruntime import ORTModelForSequenceClassification +>>> from transformers import AutoTokenizer + +>>> model_checkpoint = "distilbert_base_uncased_squad" +>>> save_directory = "onnx/" + +>>> # Load a model from transformers and export it to ONNX +>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True) +>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) + +>>> # Save the onnx model and tokenizer +>>> ort_model.save_pretrained(save_directory) +>>> tokenizer.save_pretrained(save_directory) +``` + +### Exporting a model for an unsupported architecture + +If you wish to contribute by adding support for a model that cannot be currently exported, you should first check if it is +supported in [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview), +and if it is not, [contribute to 🤗 Optimum](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute) +directly. + +### Exporting a model with `transformers.onnx` + + + +`tranformers.onnx` is no longer maintained, please export models with 🤗 Optimum as described above. This section will be removed in the future versions. + + + +To export a 🤗 Transformers model to ONNX with `tranformers.onnx`, install extra dependencies: + +```bash +pip install transformers[onnx] +``` + +Use `transformers.onnx` package as a Python module to export a checkpoint using a ready-made configuration: + +```bash +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ +``` + +This exports an ONNX graph of the checkpoint defined by the `--model` argument. Pass any checkpoint on the 🤗 Hub or one that's stored locally. +The resulting `model.onnx` file can then be run on one of the many accelerators that support the ONNX standard. For example, +load and run the model with ONNX Runtime as follows: + +```python +>>> from transformers import AutoTokenizer +>>> from onnxruntime import InferenceSession + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> session = InferenceSession("onnx/model.onnx") +>>> # ONNX Runtime expects NumPy arrays as input +>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") +>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs)) +``` + +The required output names (like `["last_hidden_state"]`) can be obtained by taking a look at the ONNX configuration of +each model. For example, for DistilBERT we have: + +```python +>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig + +>>> config = DistilBertConfig() +>>> onnx_config = DistilBertOnnxConfig(config) +>>> print(list(onnx_config.outputs.keys())) +["last_hidden_state"] +``` + +The process is identical for TensorFlow checkpoints on the Hub. For example, export a pure TensorFlow checkpoint like so: + +```bash +python -m transformers.onnx --model=keras-io/transformers-qa onnx/ +``` + +To export a model that's stored locally, save the model's weights and tokenizer files in the same directory (e.g. `local-pt-checkpoint`), +then export it to ONNX by pointing the `--model` argument of the `transformers.onnx` package to the desired directory: + +```bash +python -m transformers.onnx --model=local-pt-checkpoint onnx/ +``` \ No newline at end of file diff --git a/docs/source/en/serialization.mdx b/docs/source/en/serialization.mdx deleted file mode 100644 index d1485fb9cdbef1..00000000000000 --- a/docs/source/en/serialization.mdx +++ /dev/null @@ -1,539 +0,0 @@ - - -# Export to ONNX - -If you need to deploy 🤗 Transformers models in production environments, we recommend -exporting them to a serialized format that can be loaded and executed on specialized -runtimes and hardware. In this guide, we'll show you how to export 🤗 Transformers -models to [ONNX (Open Neural Network eXchange)](http://onnx.ai). - -ONNX is an open standard that defines a common set of operators and a common file format -to represent deep learning models in a wide variety of frameworks, including PyTorch and -TensorFlow. When a model is exported to the ONNX format, these operators are used to -construct a computational graph (often called an _intermediate representation_) which -represents the flow of data through the neural network. - -By exposing a graph with standardized operators and data types, ONNX makes it easy to -switch between frameworks. For example, a model trained in PyTorch can be exported to -ONNX format and then imported in TensorFlow (and vice versa). - -🤗 Transformers provides a [`transformers.onnx`](main_classes/onnx) package that enables -you to convert model checkpoints to an ONNX graph by leveraging configuration objects. -These configuration objects come ready made for a number of model architectures, and are -designed to be easily extendable to other architectures. - - - -You can also export 🤗 Transformers models with the [`optimum.exporters.onnx` package](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model) -from 🤗 Optimum. - -Once exported, a model can be: - -- Optimized for inference via techniques such as quantization and graph optimization. -- Run with ONNX Runtime via [`ORTModelForXXX` classes](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort), -which follow the same `AutoModel` API as the one you are used to in 🤗 Transformers. -- Run with [optimized inference pipelines](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines), -which has the same API as the [`pipeline`] function in 🤗 Transformers. - -To explore all these features, check out the [🤗 Optimum library](https://github.com/huggingface/optimum). - - - -Ready-made configurations include the following architectures: - - - -- ALBERT -- BART -- BEiT -- BERT -- BigBird -- BigBird-Pegasus -- Blenderbot -- BlenderbotSmall -- BLOOM -- CamemBERT -- Chinese-CLIP -- CLIP -- CodeGen -- Conditional DETR -- ConvBERT -- ConvNeXT -- Data2VecText -- Data2VecVision -- DeBERTa -- DeBERTa-v2 -- DeiT -- DETR -- DistilBERT -- ELECTRA -- ERNIE -- FlauBERT -- GPT Neo -- GPT-J -- GPT-Sw3 -- GroupViT -- I-BERT -- ImageGPT -- LayoutLM -- LayoutLMv3 -- LeViT -- Longformer -- LongT5 -- M2M100 -- Marian -- mBART -- MobileBERT -- MobileNetV1 -- MobileNetV2 -- MobileViT -- MT5 -- OpenAI GPT-2 -- OWL-ViT -- Perceiver -- PLBart -- PoolFormer -- RemBERT -- ResNet -- RoBERTa -- RoBERTa-PreLayerNorm -- RoFormer -- SegFormer -- SqueezeBERT -- Swin Transformer -- T5 -- Table Transformer -- Vision Encoder decoder -- ViT -- Whisper -- X-MOD -- XLM -- XLM-RoBERTa -- XLM-RoBERTa-XL -- YOLOS - -In the next two sections, we'll show you how to: - -* Export a supported model using the `transformers.onnx` package. -* Export a custom model for an unsupported architecture. - -## Exporting a model to ONNX - - - -The recommended way of exporting a model is now to use -[`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli), -do not worry it is very similar to `transformers.onnx`! - - - -To export a 🤗 Transformers model to ONNX, you'll first need to install some extra -dependencies: - -```bash -pip install transformers[onnx] -``` - -The `transformers.onnx` package can then be used as a Python module: - -```bash -python -m transformers.onnx --help - -usage: Hugging Face Transformers ONNX exporter [-h] -m MODEL [--feature {causal-lm, ...}] [--opset OPSET] [--atol ATOL] output - -positional arguments: - output Path indicating where to store generated ONNX model. - -optional arguments: - -h, --help show this help message and exit - -m MODEL, --model MODEL - Model ID on huggingface.co or path on disk to load model from. - --feature {causal-lm, ...} - The type of features to export the model with. - --opset OPSET ONNX opset version to export the model with. - --atol ATOL Absolute difference tolerance when validating the model. -``` - -Exporting a checkpoint using a ready-made configuration can be done as follows: - -```bash -python -m transformers.onnx --model=distilbert-base-uncased onnx/ -``` - -You should see the following logs: - -```bash -Validating ONNX model... - -[✓] ONNX model output names match reference model ({'last_hidden_state'}) - - Validating ONNX Model output "last_hidden_state": - -[✓] (2, 8, 768) matches (2, 8, 768) - -[✓] all values close (atol: 1e-05) -All good, model saved at: onnx/model.onnx -``` - -This exports an ONNX graph of the checkpoint defined by the `--model` argument. In this -example, it is `distilbert-base-uncased`, but it can be any checkpoint on the Hugging -Face Hub or one that's stored locally. - -The resulting `model.onnx` file can then be run on one of the [many -accelerators](https://onnx.ai/supported-tools.html#deployModel) that support the ONNX -standard. For example, we can load and run the model with [ONNX -Runtime](https://onnxruntime.ai/) as follows: - -```python ->>> from transformers import AutoTokenizer ->>> from onnxruntime import InferenceSession - ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> session = InferenceSession("onnx/model.onnx") ->>> # ONNX Runtime expects NumPy arrays as input ->>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") ->>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs)) -``` - -The required output names (like `["last_hidden_state"]`) can be obtained by taking a -look at the ONNX configuration of each model. For example, for DistilBERT we have: - -```python ->>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig - ->>> config = DistilBertConfig() ->>> onnx_config = DistilBertOnnxConfig(config) ->>> print(list(onnx_config.outputs.keys())) -["last_hidden_state"] -``` - -The process is identical for TensorFlow checkpoints on the Hub. For example, we can -export a pure TensorFlow checkpoint from the [Keras -organization](https://huggingface.co/keras-io) as follows: - -```bash -python -m transformers.onnx --model=keras-io/transformers-qa onnx/ -``` - -To export a model that's stored locally, you'll need to have the model's weights and -tokenizer files stored in a directory. For example, we can load and save a checkpoint as -follows: - - -```python ->>> from transformers import AutoTokenizer, AutoModelForSequenceClassification - ->>> # Load tokenizer and PyTorch weights form the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") ->>> # Save to disk ->>> tokenizer.save_pretrained("local-pt-checkpoint") ->>> pt_model.save_pretrained("local-pt-checkpoint") -``` - -Once the checkpoint is saved, we can export it to ONNX by pointing the `--model` -argument of the `transformers.onnx` package to the desired directory: - -```bash -python -m transformers.onnx --model=local-pt-checkpoint onnx/ -``` - -```python ->>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification - ->>> # Load tokenizer and TensorFlow weights from the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") ->>> # Save to disk ->>> tokenizer.save_pretrained("local-tf-checkpoint") ->>> tf_model.save_pretrained("local-tf-checkpoint") -``` - -Once the checkpoint is saved, we can export it to ONNX by pointing the `--model` -argument of the `transformers.onnx` package to the desired directory: - -```bash -python -m transformers.onnx --model=local-tf-checkpoint onnx/ -``` - - -## Selecting features for different model tasks - - - -The recommended way of exporting a model is now to use `optimum.exporters.onnx`. -You can check the [🤗 Optimum documentation](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#selecting-a-task) -to learn how to select a task. - - - -Each ready-made configuration comes with a set of _features_ that enable you to export -models for different types of tasks. As shown in the table below, each feature is -associated with a different `AutoClass`: - -| Feature | Auto Class | -| ------------------------------------ | ------------------------------------ | -| `causal-lm`, `causal-lm-with-past` | `AutoModelForCausalLM` | -| `default`, `default-with-past` | `AutoModel` | -| `masked-lm` | `AutoModelForMaskedLM` | -| `question-answering` | `AutoModelForQuestionAnswering` | -| `seq2seq-lm`, `seq2seq-lm-with-past` | `AutoModelForSeq2SeqLM` | -| `sequence-classification` | `AutoModelForSequenceClassification` | -| `token-classification` | `AutoModelForTokenClassification` | - -For each configuration, you can find the list of supported features via the -[`~transformers.onnx.FeaturesManager`]. For example, for DistilBERT we have: - -```python ->>> from transformers.onnx.features import FeaturesManager - ->>> distilbert_features = list(FeaturesManager.get_supported_features_for_model_type("distilbert").keys()) ->>> print(distilbert_features) -["default", "masked-lm", "causal-lm", "sequence-classification", "token-classification", "question-answering"] -``` - -You can then pass one of these features to the `--feature` argument in the -`transformers.onnx` package. For example, to export a text-classification model we can -pick a fine-tuned model from the Hub and run: - -```bash -python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \ - --feature=sequence-classification onnx/ -``` - -This displays the following logs: - -```bash -Validating ONNX model... - -[✓] ONNX model output names match reference model ({'logits'}) - - Validating ONNX Model output "logits": - -[✓] (2, 2) matches (2, 2) - -[✓] all values close (atol: 1e-05) -All good, model saved at: onnx/model.onnx -``` - -Notice that in this case, the output names from the fine-tuned model are `logits` -instead of the `last_hidden_state` we saw with the `distilbert-base-uncased` checkpoint -earlier. This is expected since the fine-tuned model has a sequence classification head. - - - -The features that have a `with-past` suffix (like `causal-lm-with-past`) correspond to -model classes with precomputed hidden states (key and values in the attention blocks) -that can be used for fast autoregressive decoding. - - - - - -For `VisionEncoderDecoder` type models, the encoder and decoder parts are -exported separately as two ONNX files named `encoder_model.onnx` and `decoder_model.onnx` respectively. - - - - -## Exporting a model for an unsupported architecture - - - -If you wish to contribute by adding support for a model that cannot be currently exported, you should first check if it is -supported in [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/main/en/exporters/onnx/package_reference/configuration#supported-architectures), -and if it is not, [contribute to 🤗 Optimum](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/contribute) -directly. - - - -If you wish to export a model whose architecture is not natively supported by the -library, there are three main steps to follow: - -1. Implement a custom ONNX configuration. -2. Export the model to ONNX. -3. Validate the outputs of the PyTorch and exported models. - -In this section, we'll look at how DistilBERT was implemented to show what's involved -with each step. - -### Implementing a custom ONNX configuration - -Let's start with the ONNX configuration object. We provide three abstract classes that -you should inherit from, depending on the type of model architecture you wish to export: - -* Encoder-based models inherit from [`~onnx.config.OnnxConfig`] -* Decoder-based models inherit from [`~onnx.config.OnnxConfigWithPast`] -* Encoder-decoder models inherit from [`~onnx.config.OnnxSeq2SeqConfigWithPast`] - - - -A good way to implement a custom ONNX configuration is to look at the existing -implementation in the `configuration_.py` file of a similar architecture. - - - -Since DistilBERT is an encoder-based model, its configuration inherits from -`OnnxConfig`: - -```python ->>> from typing import Mapping, OrderedDict ->>> from transformers.onnx import OnnxConfig - - ->>> class DistilBertOnnxConfig(OnnxConfig): -... @property -... def inputs(self) -> Mapping[str, Mapping[int, str]]: -... return OrderedDict( -... [ -... ("input_ids", {0: "batch", 1: "sequence"}), -... ("attention_mask", {0: "batch", 1: "sequence"}), -... ] -... ) -``` - -Every configuration object must implement the `inputs` property and return a mapping, -where each key corresponds to an expected input, and each value indicates the axis of -that input. For DistilBERT, we can see that two inputs are required: `input_ids` and -`attention_mask`. These inputs have the same shape of `(batch_size, sequence_length)` -which is why we see the same axes used in the configuration. - - - -Notice that `inputs` property for `DistilBertOnnxConfig` returns an `OrderedDict`. This -ensures that the inputs are matched with their relative position within the -`PreTrainedModel.forward()` method when tracing the graph. We recommend using an -`OrderedDict` for the `inputs` and `outputs` properties when implementing custom ONNX -configurations. - - - -Once you have implemented an ONNX configuration, you can instantiate it by providing the -base model's configuration as follows: - -```python ->>> from transformers import AutoConfig - ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") ->>> onnx_config = DistilBertOnnxConfig(config) -``` - -The resulting object has several useful properties. For example, you can view the ONNX -operator set that will be used during the export: - -```python ->>> print(onnx_config.default_onnx_opset) -11 -``` - -You can also view the outputs associated with the model as follows: - -```python ->>> print(onnx_config.outputs) -OrderedDict([("last_hidden_state", {0: "batch", 1: "sequence"})]) -``` - -Notice that the outputs property follows the same structure as the inputs; it returns an -`OrderedDict` of named outputs and their shapes. The output structure is linked to the -choice of feature that the configuration is initialised with. By default, the ONNX -configuration is initialized with the `default` feature that corresponds to exporting a -model loaded with the `AutoModel` class. If you want to export a model for another task, -just provide a different feature to the `task` argument when you initialize the ONNX -configuration. For example, if we wished to export DistilBERT with a sequence -classification head, we could use: - -```python ->>> from transformers import AutoConfig - ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") ->>> onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification") ->>> print(onnx_config_for_seq_clf.outputs) -OrderedDict([('logits', {0: 'batch'})]) -``` - - - -All of the base properties and methods associated with [`~onnx.config.OnnxConfig`] and -the other configuration classes can be overridden if needed. Check out [`BartOnnxConfig`] -for an advanced example. - - - -### Exporting the model - -Once you have implemented the ONNX configuration, the next step is to export the model. -Here we can use the `export()` function provided by the `transformers.onnx` package. -This function expects the ONNX configuration, along with the base model and tokenizer, -and the path to save the exported file: - -```python ->>> from pathlib import Path ->>> from transformers.onnx import export ->>> from transformers import AutoTokenizer, AutoModel - ->>> onnx_path = Path("model.onnx") ->>> model_ckpt = "distilbert-base-uncased" ->>> base_model = AutoModel.from_pretrained(model_ckpt) ->>> tokenizer = AutoTokenizer.from_pretrained(model_ckpt) - ->>> onnx_inputs, onnx_outputs = export(tokenizer, base_model, onnx_config, onnx_config.default_onnx_opset, onnx_path) -``` - -The `onnx_inputs` and `onnx_outputs` returned by the `export()` function are lists of -the keys defined in the `inputs` and `outputs` properties of the configuration. Once the -model is exported, you can test that the model is well formed as follows: - -```python ->>> import onnx - ->>> onnx_model = onnx.load("model.onnx") ->>> onnx.checker.check_model(onnx_model) -``` - - - -If your model is larger than 2GB, you will see that many additional files are created -during the export. This is _expected_ because ONNX uses [Protocol -Buffers](https://developers.google.com/protocol-buffers/) to store the model and these -have a size limit of 2GB. See the [ONNX -documentation](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md) for -instructions on how to load models with external data. - - - -### Validating the model outputs - -The final step is to validate that the outputs from the base and exported model agree -within some absolute tolerance. Here we can use the `validate_model_outputs()` function -provided by the `transformers.onnx` package as follows: - -```python ->>> from transformers.onnx import validate_model_outputs - ->>> validate_model_outputs( -... onnx_config, tokenizer, base_model, onnx_path, onnx_outputs, onnx_config.atol_for_validation -... ) -``` - -This function uses the [`~transformers.onnx.OnnxConfig.generate_dummy_inputs`] method to -generate inputs for the base and exported model, and the absolute tolerance can be -defined in the configuration. We generally find numerical agreement in the 1e-6 to 1e-4 -range, although anything smaller than 1e-3 is likely to be OK. - -## Contributing a new configuration to 🤗 Transformers - -We are looking to expand the set of ready-made configurations and welcome contributions -from the community! If you would like to contribute your addition to the library, you -will need to: - -* Implement the ONNX configuration in the corresponding `configuration_.py` -file -* Include the model architecture and corresponding features in - [`~onnx.features.FeatureManager`] -* Add your model architecture to the tests in `test_onnx_v2.py` - -Check out how the configuration for [IBERT was -contributed](https://github.com/huggingface/transformers/pull/14868/files) to get an -idea of what's involved. diff --git a/docs/source/en/task_summary.mdx b/docs/source/en/task_summary.md similarity index 90% rename from docs/source/en/task_summary.mdx rename to docs/source/en/task_summary.md index 67181d2d0c53f3..8f7eb041f1f2d7 100644 --- a/docs/source/en/task_summary.mdx +++ b/docs/source/en/task_summary.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # What 🤗 Transformers can do @@ -178,7 +182,7 @@ Like classification tasks in any modality, text classification labels a sequence ### Token classification -In any NLP task, text is preprocessed by separating the sequence of text into individual words or subwords. These are known as [tokens](/glossary#token). Token classification assigns each token a label from a predefined set of classes. +In any NLP task, text is preprocessed by separating the sequence of text into individual words or subwords. These are known as [tokens](glossary#token). Token classification assigns each token a label from a predefined set of classes. Two common types of token classification are: @@ -264,7 +268,7 @@ In the early days, translation models were mostly monolingual, but recently, the >>> from transformers import pipeline >>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." ->>> translator = pipeline(task="translation", model="t5-small") +>>> translator = pipeline(task="translation", model="google-t5/t5-small") >>> translator(text) [{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}] ``` @@ -307,4 +311,31 @@ There are two types of language modeling: 'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}] ``` +## Multimodal + +Multimodal tasks require a model to process multiple data modalities (text, image, audio, video) to solve a particular problem. Image captioning is an example of a multimodal task where the model takes an image as input and outputs a sequence of text describing the image or some properties of the image. + +Although multimodal models work with different data types or modalities, internally, the preprocessing steps help the model convert all the data types into embeddings (vectors or list of numbers that holds meaningful information about the data). For a task like image captioning, the model learns relationships between image embeddings and text embeddings. + +### Document question answering + +Document question answering is a task that answers natural language questions from a document. Unlike a token-level question answering task which takes text as input, document question answering takes an image of a document as input along with a question about the document and returns an answer. Document question answering can be used to parse structured documents and extract key information from it. In the example below, the total amount and change due can be extracted from a receipt. + +```py +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices") +>>> preds = doc_question_answerer( +... question="What is the total amount?", +... image=image, +... ) +>>> preds +[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}] +``` + Hopefully, this page has given you some more background information about all the types of tasks in each modality and the practical importance of each one. In the next [section](tasks_explained), you'll learn **how** 🤗 Transformers work to solve these tasks. \ No newline at end of file diff --git a/docs/source/en/tasks/asr.mdx b/docs/source/en/tasks/asr.md similarity index 97% rename from docs/source/en/tasks/asr.mdx rename to docs/source/en/tasks/asr.md index 4c31cd2841895f..737460ed297bcf 100644 --- a/docs/source/en/tasks/asr.mdx +++ b/docs/source/en/tasks/asr.md @@ -8,10 +8,16 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Automatic speech recognition +[[open-in-colab]] + Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings. @@ -26,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit -[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm) +[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-BERT](../model_doc/wav2vec2-bert), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm) @@ -280,7 +286,7 @@ At this point, only three steps remain: ... args=training_args, ... train_dataset=encoded_minds["train"], ... eval_dataset=encoded_minds["test"], -... tokenizer=processor.feature_extractor, +... tokenizer=processor, ... data_collator=data_collator, ... compute_metrics=compute_metrics, ... ) diff --git a/docs/source/en/tasks/audio_classification.mdx b/docs/source/en/tasks/audio_classification.md similarity index 97% rename from docs/source/en/tasks/audio_classification.mdx rename to docs/source/en/tasks/audio_classification.md index 75c278c6f47f02..678af90c4fa079 100644 --- a/docs/source/en/tasks/audio_classification.mdx +++ b/docs/source/en/tasks/audio_classification.md @@ -8,10 +8,16 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Audio classification +[[open-in-colab]] + Audio classification - just like with text - assigns a class label output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds. @@ -26,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit -[Audio Spectrogram Transformer](../model_doc/audio-spectrogram-transformer), [Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm) +[Audio Spectrogram Transformer](../model_doc/audio-spectrogram-transformer), [Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-BERT](../model_doc/wav2vec2-bert), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm), [Whisper](../model_doc/whisper) diff --git a/docs/source/en/tasks/document_question_answering.mdx b/docs/source/en/tasks/document_question_answering.md similarity index 98% rename from docs/source/en/tasks/document_question_answering.mdx rename to docs/source/en/tasks/document_question_answering.md index 4c5208820642f7..24bf3a069ac9a5 100644 --- a/docs/source/en/tasks/document_question_answering.mdx +++ b/docs/source/en/tasks/document_question_answering.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Document Question Answering @@ -40,9 +44,6 @@ LayoutLMv2 solves the document question-answering task by adding a question-answ states of the tokens, to predict the positions of the start and end tokens of the answer. In other words, the problem is treated as extractive question answering: given the context, extract which piece of information answers the question. The context comes from the output of an OCR engine, here it is Google's Tesseract. -states of the tokens, in order to predict which token is at the start of the answer and which token is at the end of the -answer. In other words, the problem is treated as extractive question answering: given the context, extract which piece -of information answers the question. The context comes from the output of an OCR engine, here it is Google's Tesseract. Before you begin, make sure you have all the necessary libraries installed. LayoutLMv2 depends on detectron2, torchvision and tesseract. diff --git a/docs/source/en/tasks/idefics.md b/docs/source/en/tasks/idefics.md new file mode 100644 index 00000000000000..a780124edea9c6 --- /dev/null +++ b/docs/source/en/tasks/idefics.md @@ -0,0 +1,425 @@ + + +# Image tasks with IDEFICS + +[[open-in-colab]] + +While individual tasks can be tackled by fine-tuning specialized models, an alternative approach +that has recently emerged and gained popularity is to use large models for a diverse set of tasks without fine-tuning. +For instance, large language models can handle such NLP tasks as summarization, translation, classification, and more. +This approach is no longer limited to a single modality, such as text, and in this guide, we will illustrate how you can +solve image-text tasks with a large multimodal model called IDEFICS. + +[IDEFICS](../model_doc/idefics) is an open-access vision and language model based on [Flamingo](https://huggingface.co/papers/2204.14198), +a state-of-the-art visual language model initially developed by DeepMind. The model accepts arbitrary sequences of image +and text inputs and generates coherent text as output. It can answer questions about images, describe visual content, +create stories grounded in multiple images, and so on. IDEFICS comes in two variants - [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b) +and [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b), both of which are available on the 🤗 Hub. For each variant, you can also find fine-tuned instructed +versions of the model adapted for conversational use cases. + +This model is exceptionally versatile and can be used for a wide range of image and multimodal tasks. However, +being a large model means it requires significant computational resources and infrastructure. It is up to you to decide whether +this approach suits your use case better than fine-tuning specialized models for each individual task. + +In this guide, you'll learn how to: +- [Load IDEFICS](#loading-the-model) and [load the quantized version of the model](#quantized-model) +- Use IDEFICS for: + - [Image captioning](#image-captioning) + - [Prompted image captioning](#prompted-image-captioning) + - [Few-shot prompting](#few-shot-prompting) + - [Visual question answering](#visual-question-answering) + - [Image classification](#image-classification) + - [Image-guided text generation](#image-guided-text-generation) +- [Run inference in batch mode](#running-inference-in-batch-mode) +- [Run IDEFICS instruct for conversational use](#idefics-instruct-for-conversational-use) + +Before you begin, make sure you have all the necessary libraries installed. + +```bash +pip install -q bitsandbytes sentencepiece accelerate transformers +``` + + +To run the following examples with a non-quantized version of the model checkpoint you will need at least 20GB of GPU memory. + + +## Loading the model + +Let's start by loading the model's 9 billion parameters checkpoint: + +```py +>>> checkpoint = "HuggingFaceM4/idefics-9b" +``` + +Just like for other Transformers models, you need to load a processor and the model itself from the checkpoint. +The IDEFICS processor wraps a [`LlamaTokenizer`] and IDEFICS image processor into a single processor to take care of +preparing text and image inputs for the model. + +```py +>>> import torch + +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") +``` + +Setting `device_map` to `"auto"` will automatically determine how to load and store the model weights in the most optimized +manner given existing devices. + +### Quantized model + +If high-memory GPU availability is an issue, you can load the quantized version of the model. To load the model and the +processor in 4bit precision, pass a `BitsAndBytesConfig` to the `from_pretrained` method and the model will be compressed +on the fly while loading. + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig + +>>> quantization_config = BitsAndBytesConfig( +... load_in_4bit=True, +... bnb_4bit_compute_dtype=torch.float16, +... ) + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained( +... checkpoint, +... quantization_config=quantization_config, +... device_map="auto" +... ) +``` + +Now that you have the model loaded in one of the suggested ways, let's move on to exploring tasks that you can use IDEFICS for. + +## Image captioning +Image captioning is the task of predicting a caption for a given image. A common application is to aid visually impaired +people navigate through different situations, for instance, explore image content online. + +To illustrate the task, get an image to be captioned, e.g.: + +
+ Image of a puppy in a flower bed +
+ +Photo by [Hendo Wang](https://unsplash.com/@hendoo). + +IDEFICS accepts text and image prompts. However, to caption an image, you do not have to provide a text prompt to the +model, only the preprocessed input image. Without a text prompt, the model will start generating text from the +BOS (beginning-of-sequence) token thus creating a caption. + +As image input to the model, you can use either an image object (`PIL.Image`) or a url from which the image can be retrieved. + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +A puppy in a flower bed +``` + + + +It is a good idea to include the `bad_words_ids` in the call to `generate` to avoid errors arising when increasing +the `max_new_tokens`: the model will want to generate a new `` or `` token when there +is no image being generated by the model. +You can set it on-the-fly as in this guide, or store in the `GenerationConfig` as described in the [Text generation strategies](../generation_strategies) guide. + + +## Prompted image captioning + +You can extend image captioning by providing a text prompt, which the model will continue given the image. Let's take +another image to illustrate: + +
+ Image of the Eiffel Tower at night +
+ +Photo by [Denys Nevozhai](https://unsplash.com/@dnevozhai). + +Textual and image prompts can be passed to the model's processor as a single list to create appropriate inputs. + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +This is an image of the Eiffel Tower in Paris, France. +``` + +## Few-shot prompting + +While IDEFICS demonstrates great zero-shot results, your task may require a certain format of the caption, or come with +other restrictions or requirements that increase task's complexity. Few-shot prompting can be used to enable in-context learning. +By providing examples in the prompt, you can steer the model to generate results that mimic the format of given examples. + +Let's use the previous image of the Eiffel Tower as an example for the model and build a prompt that demonstrates to the model +that in addition to learning what the object in an image is, we would also like to get some interesting information about it. +Then, let's see, if we can get the same response format for an image of the Statue of Liberty: + +
+ Image of the Statue of Liberty +
+ +Photo by [Juan Mayobre](https://unsplash.com/@jmayobres). + +```py +>>> prompt = ["User:", +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", +... "User:", +... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", +... "Describe this image.\nAssistant:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +User: Describe this image. +Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. +User: Describe this image. +Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. +``` + +Notice that just from a single example (i.e., 1-shot) the model has learned how to perform the task. For more complex tasks, +feel free to experiment with a larger number of examples (e.g., 3-shot, 5-shot, etc.). + +## Visual question answering + +Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. Similar to image +captioning it can be used in accessibility applications, but also in education (reasoning about visual materials), customer +service (questions about products based on images), and image retrieval. + +Let's get a new image for this task: + +
+ Image of a couple having a picnic +
+ +Photo by [Jarritos Mexican Soda](https://unsplash.com/@jarritos). + +You can steer the model from image captioning to visual question answering by prompting it with appropriate instructions: + +```py +>>> prompt = [ +... "Instruction: Provide an answer to the question. Use the image to answer.\n", +... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Question: Where are these people and what's the weather like? Answer:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Provide an answer to the question. Use the image to answer. + Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. +``` + +## Image classification + +IDEFICS is capable of classifying images into different categories without being explicitly trained on data containing +labeled examples from those specific categories. Given a list of categories and using its image and text understanding +capabilities, the model can infer which category the image likely belongs to. + +Say, we have this image of a vegetable stand: + +
+ Image of a vegetable stand +
+ +Photo by [Peter Wendt](https://unsplash.com/@peterwendt). + +We can instruct the model to classify the image into one of the categories that we have: + +```py +>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] +>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", +... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Category: " +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. +Category: Vegetables +``` + +In the example above we instruct the model to classify the image into a single category, however, you can also prompt the model to do rank classification. + +## Image-guided text generation + +For more creative applications, you can use image-guided text generation to generate text based on an image. This can be +useful to create descriptions of products, ads, descriptions of a scene, etc. + +Let's prompt IDEFICS to write a story based on a simple image of a red door: + +
+ Image of a red door with a pumpkin on the steps +
+ +Photo by [Craig Tidball](https://unsplash.com/@devonshiremedia). + +```py +>>> prompt = ["Instruction: Use the image to write a story. \n", +... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", +... "Story: \n"] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Use the image to write a story. + Story: +Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. + +One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. + +The little girl ran inside and told her mother about the man. + +Her mother said, “Don’t worry, honey. He’s just a friendly ghost.” + +The little girl wasn’t sure if she believed her mother, but she went outside anyway. + +When she got to the door, the man was gone. + +The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. + +He was wearing a long black coat and a top hat. + +The little girl ran +``` + +Looks like IDEFICS noticed the pumpkin on the doorstep and went with a spooky Halloween story about a ghost. + + + +For longer outputs like this, you will greatly benefit from tweaking the text generation strategy. This can help +you significantly improve the quality of the generated output. Check out [Text generation strategies](../generation_strategies) +to learn more. + + +## Running inference in batch mode + +All of the earlier sections illustrated IDEFICS for a single example. In a very similar fashion, you can run inference +for a batch of examples by passing a list of prompts: + +```py +>>> prompts = [ +... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... ] + +>>> inputs = processor(prompts, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i,t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +0: +This is an image of the Eiffel Tower in Paris, France. + +1: +This is an image of a couple on a picnic blanket. + +2: +This is an image of a vegetable stand. +``` + +## IDEFICS instruct for conversational use + +For conversational use cases, you can find fine-tuned instructed versions of the model on the 🤗 Hub: +`HuggingFaceM4/idefics-80b-instruct` and `HuggingFaceM4/idefics-9b-instruct`. + +These checkpoints are the result of fine-tuning the respective base models on a mixture of supervised and instruction +fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings. + +The use and prompting for the conversational use is very similar to using the base models: + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" + +>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> prompts = [ +... [ +... "User: What is in this image?", +... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", +... "", + +... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.", + +... "\nUser:", +... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", +... "And who is that?", + +... "\nAssistant:", +... ], +... ] + +>>> # --batched mode +>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) +>>> # --single sample mode +>>> # inputs = processor(prompts[0], return_tensors="pt").to(device) + +>>> # Generation args +>>> exit_condition = processor.tokenizer("", add_special_tokens=False).input_ids +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i, t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +``` diff --git a/docs/source/en/tasks/image_captioning.mdx b/docs/source/en/tasks/image_captioning.md similarity index 97% rename from docs/source/en/tasks/image_captioning.mdx rename to docs/source/en/tasks/image_captioning.md index 2922de0549f0e0..b426cbf6383187 100644 --- a/docs/source/en/tasks/image_captioning.mdx +++ b/docs/source/en/tasks/image_captioning.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> @@ -69,7 +73,7 @@ Many image captioning datasets contain multiple captions per image. In those cas
-Split the dataset’s train split into a train and test set with the [~datasets.Dataset.train_test_split] method: +Split the dataset’s train split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: ```python diff --git a/docs/source/en/tasks/image_classification.mdx b/docs/source/en/tasks/image_classification.md similarity index 90% rename from docs/source/en/tasks/image_classification.mdx rename to docs/source/en/tasks/image_classification.md index 43041c45c36639..c1817780a1621b 100644 --- a/docs/source/en/tasks/image_classification.mdx +++ b/docs/source/en/tasks/image_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Image classification @@ -30,7 +34,8 @@ The task illustrated in this tutorial is supported by the following model archit -[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [EfficientFormer](../model_doc/efficientformer), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn) +[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [CLIP](../model_doc/clip), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [PVT](../model_doc/pvt), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SigLIP](../model_doc/siglip), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn) + @@ -289,7 +294,7 @@ You're ready to start training your model now! Load ViT with [`AutoModelForImage At this point, only three steps remain: -1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because this'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint. +1. Define your training hyperparameters in [`TrainingArguments`]. It is important you don't remove unused columns because that'll drop the `image` column. Without the `image` column, you can't create `pixel_values`. Set `remove_unused_columns=False` to prevent this behavior! The only other required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [`Trainer`] will evaluate the accuracy and save the training checkpoint. 2. Pass the training arguments to [`Trainer`] along with the model, dataset, tokenizer, data collator, and `compute_metrics` function. 3. Call [`~Trainer.train`] to finetune your model. @@ -343,7 +348,7 @@ If you are unfamiliar with fine-tuning a model with Keras, check out the [basic To fine-tune a model in TensorFlow, follow these steps: 1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule. -2. Instantiate a pre-treined model. +2. Instantiate a pre-trained model. 3. Convert a 🤗 Dataset to a `tf.data.Dataset`. 4. Compile your model. 5. Add callbacks and use the `fit()` method to run the training. @@ -385,12 +390,12 @@ Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Data ```py >>> # converting our train dataset to tf.data.Dataset >>> tf_train_dataset = food["train"].to_tf_dataset( -... columns=["pixel_values"], label_cols=["label"], shuffle=True, batch_size=batch_size, collate_fn=data_collator +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator ... ) >>> # converting our test dataset to tf.data.Dataset >>> tf_eval_dataset = food["test"].to_tf_dataset( -... columns=["pixel_values"], label_cols=["label"], shuffle=True, batch_size=batch_size, collate_fn=data_collator +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator ... ) ``` @@ -403,9 +408,9 @@ Configure the model for training with `compile()`: >>> model.compile(optimizer=optimizer, loss=loss) ``` -To compute the accuracy from the predictions and push your model to the 🤗 Hub, use [Keras callbacks](./main_classes/keras_callbacks). -Pass your `compute_metrics` function to [KerasMetricCallback](./main_classes/keras_callbacks#transformers.KerasMetricCallback), -and use the [PushToHubCallback](./main_classes/keras_callbacks#transformers.PushToHubCallback) to upload the model: +To compute the accuracy from the predictions and push your model to the 🤗 Hub, use [Keras callbacks](../main_classes/keras_callbacks). +Pass your `compute_metrics` function to [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback), +and use the [PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback) to upload the model: ```py >>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback diff --git a/docs/source/en/tasks/image_to_image.md b/docs/source/en/tasks/image_to_image.md new file mode 100644 index 00000000000000..6a11b515947c24 --- /dev/null +++ b/docs/source/en/tasks/image_to_image.md @@ -0,0 +1,132 @@ + + +# Image-to-Image Task Guide + +[[open-in-colab]] + +Image-to-Image task is the task where an application receives an image and outputs another image. This has various subtasks, including image enhancement (super resolution, low light enhancement, deraining and so on), image inpainting, and more. + +This guide will show you how to: +- Use an image-to-image pipeline for super resolution task, +- Run image-to-image models for same task without a pipeline. + +Note that as of the time this guide is released, `image-to-image` pipeline only supports super resolution task. + +Let's begin by installing the necessary libraries. + +```bash +pip install transformers +``` + +We can now initialize the pipeline with a [Swin2SR model](https://huggingface.co/caidas/swin2SR-lightweight-x2-64). We can then infer with the pipeline by calling it with an image. As of now, only [Swin2SR models](https://huggingface.co/models?sort=trending&search=swin2sr) are supported in this pipeline. + +```python +from transformers import pipeline + +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') +pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device) +``` + +Now, let's load an image. + +```python +from PIL import Image +import requests + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/cat.jpg" +image = Image.open(requests.get(url, stream=True).raw) + +print(image.size) +``` +```bash +# (532, 432) +``` +
+ Photo of a cat +
+ +We can now do inference with the pipeline. We will get an upscaled version of the cat image. + +```python +upscaled = pipe(image) +print(upscaled.size) +``` +```bash +# (1072, 880) +``` + +If you wish to do inference yourself with no pipeline, you can use the `Swin2SRForImageSuperResolution` and `Swin2SRImageProcessor` classes of transformers. We will use the same model checkpoint for this. Let's initialize the model and the processor. + +```python +from transformers import Swin2SRForImageSuperResolution, Swin2SRImageProcessor + +model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweight-x2-64").to(device) +processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64") +``` + +`pipeline` abstracts away the preprocessing and postprocessing steps that we have to do ourselves, so let's preprocess the image. We will pass the image to the processor and then move the pixel values to GPU. + +```python +pixel_values = processor(image, return_tensors="pt").pixel_values +print(pixel_values.shape) + +pixel_values = pixel_values.to(device) +``` + +We can now infer the image by passing pixel values to the model. + +```python +import torch + +with torch.no_grad(): + outputs = model(pixel_values) +``` +Output is an object of type `ImageSuperResolutionOutput` that looks like below 👇 + +``` +(loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275, ..., 0.7463, 0.7446, 0.7453], + [0.8287, 0.8278, 0.8283, ..., 0.7451, 0.7448, 0.7457], + [0.8280, 0.8273, 0.8269, ..., 0.7447, 0.7446, 0.7452], + ..., + [0.5923, 0.5933, 0.5924, ..., 0.0697, 0.0695, 0.0706], + [0.5926, 0.5932, 0.5926, ..., 0.0673, 0.0687, 0.0705], + [0.5927, 0.5914, 0.5922, ..., 0.0664, 0.0694, 0.0718]]]], + device='cuda:0'), hidden_states=None, attentions=None) +``` +We need to get the `reconstruction` and post-process it for visualization. Let's see how it looks like. + +```python +outputs.reconstruction.data.shape +# torch.Size([1, 3, 880, 1072]) +``` + +We need to squeeze the output and get rid of axis 0, clip the values, then convert it to be numpy float. Then we will arrange axes to have the shape [1072, 880], and finally, bring the output back to range [0, 255]. + +```python +import numpy as np + +# squeeze, take to CPU and clip the values +output = outputs.reconstruction.data.squeeze().cpu().clamp_(0, 1).numpy() +# rearrange the axes +output = np.moveaxis(output, source=0, destination=-1) +# bring values back to pixel values range +output = (output * 255.0).round().astype(np.uint8) +Image.fromarray(output) +``` +
+ Upscaled photo of a cat +
diff --git a/docs/source/en/tasks/knowledge_distillation_for_image_classification.md b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md new file mode 100644 index 00000000000000..8448e53011494c --- /dev/null +++ b/docs/source/en/tasks/knowledge_distillation_for_image_classification.md @@ -0,0 +1,186 @@ + +# Knowledge Distillation for Computer Vision + +[[open-in-colab]] + +Knowledge distillation is a technique used to transfer knowledge from a larger, more complex model (teacher) to a smaller, simpler model (student). To distill knowledge from one model to another, we take a pre-trained teacher model trained on a certain task (image classification for this case) and randomly initialize a student model to be trained on image classification. Next, we train the student model to minimize the difference between it's outputs and the teacher's outputs, thus making it mimic the behavior. It was first introduced in [Distilling the Knowledge in a Neural Network by Hinton et al](https://arxiv.org/abs/1503.02531). In this guide, we will do task-specific knowledge distillation. We will use the [beans dataset](https://huggingface.co/datasets/beans) for this. + +This guide demonstrates how you can distill a [fine-tuned ViT model](https://huggingface.co/merve/vit-mobilenet-beans-224) (teacher model) to a [MobileNet](https://huggingface.co/google/mobilenet_v2_1.4_224) (student model) using the [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) of 🤗 Transformers. + +Let's install the libraries needed for distillation and evaluating the process. + +```bash +pip install transformers datasets accelerate tensorboard evaluate --upgrade +``` + +In this example, we are using the `merve/beans-vit-224` model as teacher model. It's an image classification model, based on `google/vit-base-patch16-224-in21k` fine-tuned on beans dataset. We will distill this model to a randomly initialized MobileNetV2. + +We will now load the dataset. + +```python +from datasets import load_dataset + +dataset = load_dataset("beans") +``` + +We can use an image processor from either of the models, as in this case they return the same output with same resolution. We will use the `map()` method of `dataset` to apply the preprocessing to every split of the dataset. + +```python +from transformers import AutoImageProcessor +teacher_processor = AutoImageProcessor.from_pretrained("merve/beans-vit-224") + +def process(examples): + processed_inputs = teacher_processor(examples["image"]) + return processed_inputs + +processed_datasets = dataset.map(process, batched=True) +``` + +Essentially, we want the student model (a randomly initialized MobileNet) to mimic the teacher model (fine-tuned vision transformer). To achieve this, we first get the logits output from the teacher and the student. Then, we divide each of them by the parameter `temperature` which controls the importance of each soft target. A parameter called `lambda` weighs the importance of the distillation loss. In this example, we will use `temperature=5` and `lambda=0.5`. We will use the Kullback-Leibler Divergence loss to compute the divergence between the student and teacher. Given two data P and Q, KL Divergence explains how much extra information we need to represent P using Q. If two are identical, their KL divergence is zero, as there's no other information needed to explain P from Q. Thus, in the context of knowledge distillation, KL divergence is useful. + + +```python +from transformers import TrainingArguments, Trainer +import torch +import torch.nn as nn +import torch.nn.functional as F + + +class ImageDistilTrainer(Trainer): + def __init__(self, teacher_model=None, student_model=None, temperature=None, lambda_param=None, *args, **kwargs): + super().__init__(model=student_model, *args, **kwargs) + self.teacher = teacher_model + self.student = student_model + self.loss_function = nn.KLDivLoss(reduction="batchmean") + device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') + self.teacher.to(device) + self.teacher.eval() + self.temperature = temperature + self.lambda_param = lambda_param + + def compute_loss(self, student, inputs, return_outputs=False): + student_output = self.student(**inputs) + + with torch.no_grad(): + teacher_output = self.teacher(**inputs) + + # Compute soft targets for teacher and student + soft_teacher = F.softmax(teacher_output.logits / self.temperature, dim=-1) + soft_student = F.log_softmax(student_output.logits / self.temperature, dim=-1) + + # Compute the loss + distillation_loss = self.loss_function(soft_student, soft_teacher) * (self.temperature ** 2) + + # Compute the true label loss + student_target_loss = student_output.loss + + # Calculate final loss + loss = (1. - self.lambda_param) * student_target_loss + self.lambda_param * distillation_loss + return (loss, student_output) if return_outputs else loss +``` + +We will now login to Hugging Face Hub so we can push our model to the Hugging Face Hub through the `Trainer`. + +```python +from huggingface_hub import notebook_login + +notebook_login() +``` + +Let's set the `TrainingArguments`, the teacher model and the student model. + +```python +from transformers import AutoModelForImageClassification, MobileNetV2Config, MobileNetV2ForImageClassification + +training_args = TrainingArguments( + output_dir="my-awesome-model", + num_train_epochs=30, + fp16=True, + logging_dir=f"{repo_name}/logs", + logging_strategy="epoch", + evaluation_strategy="epoch", + save_strategy="epoch", + load_best_model_at_end=True, + metric_for_best_model="accuracy", + report_to="tensorboard", + push_to_hub=True, + hub_strategy="every_save", + hub_model_id=repo_name, + ) + +num_labels = len(processed_datasets["train"].features["labels"].names) + +# initialize models +teacher_model = AutoModelForImageClassification.from_pretrained( + "merve/beans-vit-224", + num_labels=num_labels, + ignore_mismatched_sizes=True +) + +# training MobileNetV2 from scratch +student_config = MobileNetV2Config() +student_config.num_labels = num_labels +student_model = MobileNetV2ForImageClassification(student_config) +``` + +We can use `compute_metrics` function to evaluate our model on the test set. This function will be used during the training process to compute the `accuracy` & `f1` of our model. + +```python +import evaluate +import numpy as np + +accuracy = evaluate.load("accuracy") + +def compute_metrics(eval_pred): + predictions, labels = eval_pred + acc = accuracy.compute(references=labels, predictions=np.argmax(predictions, axis=1)) + return {"accuracy": acc["accuracy"]} +``` + +Let's initialize the `Trainer` with the training arguments we defined. We will also initialize our data collator. + +```python +from transformers import DefaultDataCollator + +data_collator = DefaultDataCollator() +trainer = ImageDistilTrainer( + student_model=student_model, + teacher_model=teacher_model, + training_args=training_args, + train_dataset=processed_datasets["train"], + eval_dataset=processed_datasets["validation"], + data_collator=data_collator, + tokenizer=teacher_processor, + compute_metrics=compute_metrics, + temperature=5, + lambda_param=0.5 +) +``` + +We can now train our model. + +```python +trainer.train() +``` + +We can evaluate the model on the test set. + +```python +trainer.evaluate(processed_datasets["test"]) +``` + +On test set, our model reaches 72 percent accuracy. To have a sanity check over efficiency of distillation, we also trained MobileNet on the beans dataset from scratch with the same hyperparameters and observed 63 percent accuracy on the test set. We invite the readers to try different pre-trained teacher models, student architectures, distillation parameters and report their findings. The training logs and checkpoints for distilled model can be found in [this repository](https://huggingface.co/merve/vit-mobilenet-beans-224), and MobileNetV2 trained from scratch can be found in this [repository](https://huggingface.co/merve/resnet-mobilenet-beans-5). diff --git a/docs/source/en/tasks/language_modeling.mdx b/docs/source/en/tasks/language_modeling.md similarity index 61% rename from docs/source/en/tasks/language_modeling.mdx rename to docs/source/en/tasks/language_modeling.md index ff1911ff2bffb9..4022867a027af7 100644 --- a/docs/source/en/tasks/language_modeling.mdx +++ b/docs/source/en/tasks/language_modeling.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Causal language modeling @@ -25,7 +29,7 @@ the left. This means the model cannot see future tokens. GPT-2 is an example of This guide will show you how to: -1. Finetune [DistilGPT2](https://huggingface.co/distilgpt2) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset. +1. Finetune [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset. 2. Use your finetuned model for inference. @@ -33,8 +37,9 @@ You can finetune other architectures for causal language modeling following the Choose one of the following architectures: +[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [StableLm](../model_doc/stablelm), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [Whisper](../model_doc/whisper), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod) + -[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeGen](../model_doc/codegen), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [Megatron-BERT](../model_doc/megatron-bert), [MVP](../model_doc/mvp), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [Speech2Text2](../model_doc/speech_to_text_2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod) @@ -56,16 +61,15 @@ We encourage you to log in to your Hugging Face account so you can upload and sh ## Load ELI5 dataset -Start by loading a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library. - This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. +Start by loading the first 5000 examples from the [ELI5-Category](https://huggingface.co/datasets/eli5_category) dataset with the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. ```py >>> from datasets import load_dataset ->>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +>>> eli5 = load_dataset("eli5_category", split="train[:5000]") ``` -Split the dataset's `train_asks` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: +Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: ```py >>> eli5 = eli5.train_test_split(test_size=0.2) @@ -75,18 +79,23 @@ Then take a look at an example: ```py >>> eli5["train"][0] -{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], - 'score': [6, 3], - 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", - "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, - 'answers_urls': {'url': []}, - 'document': '', - 'q_id': 'nyxfp', - 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', - 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, - 'subreddit': 'askscience', - 'title': 'Few questions about this space walk photograph.', - 'title_urls': {'url': []}} +{'q_id': '7h191n', + 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?', + 'selftext': '', + 'category': 'Economics', + 'subreddit': 'explainlikeimfive', + 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'], + 'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.", + 'None yet. It has to be reconciled with a vastly different house bill and then passed again.', + 'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?', + 'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'], + 'score': [21, 19, 5, 3], + 'text_urls': [[], + [], + [], + ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]}, + 'title_urls': ['url'], + 'selftext_urls': ['url']} ``` While this may look like a lot, you're only really interested in the `text` field. What's cool about language modeling @@ -101,40 +110,45 @@ The next step is to load a DistilGPT2 tokenizer to process the `text` subfield: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") ``` You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to -extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method: +extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process#flatten) method: ```py >>> eli5 = eli5.flatten() >>> eli5["train"][0] -{'answers.a_id': ['c3d1aib', 'c3d4lya'], - 'answers.score': [6, 3], - 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", - "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], - 'answers_urls.url': [], - 'document': '', - 'q_id': 'nyxfp', - 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', - 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], - 'subreddit': 'askscience', - 'title': 'Few questions about this space walk photograph.', - 'title_urls.url': []} +{'q_id': '7h191n', + 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?', + 'selftext': '', + 'category': 'Economics', + 'subreddit': 'explainlikeimfive', + 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'], + 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.", + 'None yet. It has to be reconciled with a vastly different house bill and then passed again.', + 'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?', + 'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'], + 'answers.score': [21, 19, 5, 3], + 'answers.text_urls': [[], + [], + [], + ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']], + 'title_urls': ['url'], + 'selftext_urls': ['url']} ``` Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them. -Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilGPT2's maximum input length: +Here is a first preprocessing function to join the list of strings for each example and tokenize the result: ```py >>> def preprocess_function(examples): -... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) ``` -To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.with_transform`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: +To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: ```py >>> tokenized_eli5 = eli5.map( @@ -145,19 +159,26 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ ... ) ``` -Now you'll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should: +This dataset contains the token sequences, but some of these are longer than the maximum input length for the model. + +You can now use a second preprocessing function to -- Concatenate all the text. -- Split the concatenated text into smaller chunks defined by `block_size`. +- concatenate all the sequences +- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM. ```py >>> block_size = 128 >>> def group_texts(examples): +... # Concatenate all texts. ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... total_length = len(concatenated_examples[list(examples.keys())[0]]) -... total_length = (total_length // block_size) * block_size +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. ... result = { ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... for k, t in concatenated_examples.items() @@ -215,7 +236,7 @@ You're ready to start training your model now! Load DistilGPT2 with [`AutoModelF ```py >>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` At this point, only three steps remain: @@ -279,7 +300,7 @@ Then you can load DistilGPT2 with [`TFAutoModelForCausalLM`]: ```py >>> from transformers import TFAutoModelForCausalLM ->>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2") +>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]: @@ -300,12 +321,12 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` This can be done by specifying where to push your model and tokenizer in the [`~transformers.PushToHubCallback`]: @@ -352,7 +373,7 @@ The simplest way to try out your finetuned model for inference is to use it in a ```py >>> from transformers import pipeline ->>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model") +>>> generator = pipeline("text-generation", model="username/my_awesome_eli5_clm-model") >>> generator(prompt) [{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}] ``` @@ -364,7 +385,7 @@ Tokenize the text and return the `input_ids` as PyTorch tensors: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model") >>> inputs = tokenizer(prompt, return_tensors="pt").input_ids ``` @@ -374,7 +395,7 @@ For more details about the different text generation strategies and parameters f ```py >>> from transformers import AutoModelForCausalLM ->>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> model = AutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model") >>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) ``` @@ -391,7 +412,7 @@ Tokenize the text and return the `input_ids` as TensorFlow tensors: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_clm-model") >>> inputs = tokenizer(prompt, return_tensors="tf").input_ids ``` @@ -400,7 +421,7 @@ Use the [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] method ```py >>> from transformers import TFAutoModelForCausalLM ->>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> model = TFAutoModelForCausalLM.from_pretrained("username/my_awesome_eli5_clm-model") >>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) ``` diff --git a/docs/source/en/tasks/mask_generation.md b/docs/source/en/tasks/mask_generation.md new file mode 100644 index 00000000000000..e16b014f3757ab --- /dev/null +++ b/docs/source/en/tasks/mask_generation.md @@ -0,0 +1,238 @@ + + +# Mask Generation + +Mask generation is the task of generating semantically meaningful masks for an image. +This task is very similar to [image segmentation](semantic_segmentation), but many differences exist. Image segmentation models are trained on labeled datasets and are limited to the classes they have seen during training; they return a set of masks and corresponding classes, given an image. + +Mask generation models are trained on large amounts of data and operate in two modes. +- Prompting mode: In this mode, the model takes in an image and a prompt, where a prompt can be a 2D point location (XY coordinates) in the image within an object or a bounding box surrounding an object. In prompting mode, the model only returns the mask over the object +that the prompt is pointing out. +- Segment Everything mode: In segment everything, given an image, the model generates every mask in the image. To do so, a grid of points is generated and overlaid on the image for inference. + +Mask generation task is supported by [Segment Anything Model (SAM)](model_doc/sam). It's a powerful model that consists of a Vision Transformer-based image encoder, a prompt encoder, and a two-way transformer mask decoder. Images and prompts are encoded, and the decoder takes these embeddings and generates valid masks. + +
+ SAM Architecture +
+ +SAM serves as a powerful foundation model for segmentation as it has large data coverage. It is trained on +[SA-1B](https://ai.meta.com/datasets/segment-anything/), a dataset with 1 million images and 1.1 billion masks. + +In this guide, you will learn how to: +- Infer in segment everything mode with batching, +- Infer in point prompting mode, +- Infer in box prompting mode. + +First, let's install `transformers`: + +```bash +pip install -q transformers +``` + +## Mask Generation Pipeline + +The easiest way to infer mask generation models is to use the `mask-generation` pipeline. + +```python +>>> from transformers import pipeline + +>>> checkpoint = "facebook/sam-vit-base" +>>> mask_generator = pipeline(model=checkpoint, task="mask-generation") +``` + +Let's see the image. + +```python +from PIL import Image +import requests + +img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" +image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB") +``` + +
+ Example Image +
+ +Let's segment everything. `points-per-batch` enables parallel inference of points in segment everything mode. This enables faster inference, but consumes more memory. Moreover, SAM only enables batching over points and not the images. `pred_iou_thresh` is the IoU confidence threshold where only the masks above that certain threshold are returned. + +```python +masks = mask_generator(image, points_per_batch=128, pred_iou_thresh=0.88) +``` + +The `masks` looks like the following: + +```bash +{'masks': [array([[False, False, False, ..., True, True, True], + [False, False, False, ..., True, True, True], + [False, False, False, ..., True, True, True], + ..., + [False, False, False, ..., False, False, False], + [False, False, False, ..., False, False, False], + [False, False, False, ..., False, False, False]]), + array([[False, False, False, ..., False, False, False], + [False, False, False, ..., False, False, False], + [False, False, False, ..., False, False, False], + ..., +'scores': tensor([0.9972, 0.9917, + ..., +} +``` + +We can visualize them like this: + +```python +import matplotlib.pyplot as plt + +plt.imshow(image, cmap='gray') + +for i, mask in enumerate(masks["masks"]): + plt.imshow(mask, cmap='viridis', alpha=0.1, vmin=0, vmax=1) + +plt.axis('off') +plt.show() +``` + +Below is the original image in grayscale with colorful maps overlaid. Very impressive. + +
+ Visualized +
+ + +## Model Inference + +### Point Prompting + +You can also use the model without the pipeline. To do so, initialize the model and +the processor. + +```python +from transformers import SamModel, SamProcessor + +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') + +model = SamModel.from_pretrained("facebook/sam-vit-base").to(device) +processor = SamProcessor.from_pretrained("facebook/sam-vit-base") +``` + +To do point prompting, pass the input point to the processor, then take the processor output +and pass it to the model for inference. To post-process the model output, pass the outputs and +`original_sizes` and `reshaped_input_sizes` we take from the processor's initial output. We need to pass these +since the processor resizes the image, and the output needs to be extrapolated. + +```python +input_points = [[[2592, 1728]]] # point location of the bee + +inputs = processor(image, input_points=input_points, return_tensors="pt").to(device) +with torch.no_grad(): + outputs = model(**inputs) +masks = processor.image_processor.post_process_masks(outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()) +``` +We can visualize the three masks in the `masks` output. + +```python +import torch +import matplotlib.pyplot as plt +import numpy as np + +fig, axes = plt.subplots(1, 4, figsize=(15, 5)) + +axes[0].imshow(image) +axes[0].set_title('Original Image') +mask_list = [masks[0][0][0].numpy(), masks[0][0][1].numpy(), masks[0][0][2].numpy()] + +for i, mask in enumerate(mask_list, start=1): + overlayed_image = np.array(image).copy() + + overlayed_image[:,:,0] = np.where(mask == 1, 255, overlayed_image[:,:,0]) + overlayed_image[:,:,1] = np.where(mask == 1, 0, overlayed_image[:,:,1]) + overlayed_image[:,:,2] = np.where(mask == 1, 0, overlayed_image[:,:,2]) + + axes[i].imshow(overlayed_image) + axes[i].set_title(f'Mask {i}') +for ax in axes: + ax.axis('off') + +plt.show() +``` + +
+ Visualized +
+ +### Box Prompting + +You can also do box prompting in a similar fashion to point prompting. You can simply pass the input box in the format of a list +`[x_min, y_min, x_max, y_max]` format along with the image to the `processor`. Take the processor output and directly pass it +to the model, then post-process the output again. + + +```python +# bounding box around the bee +box = [2350, 1600, 2850, 2100] + +inputs = processor( + image, + input_boxes=[[[box]]], + return_tensors="pt" + ).to("cuda") + +with torch.no_grad(): + outputs = model(**inputs) + +mask = processor.image_processor.post_process_masks( + outputs.pred_masks.cpu(), + inputs["original_sizes"].cpu(), + inputs["reshaped_input_sizes"].cpu() +)[0][0][0].numpy() +``` + +You can visualize the bounding box around the bee as shown below. + +```python +import matplotlib.patches as patches + +fig, ax = plt.subplots() +ax.imshow(image) + +rectangle = patches.Rectangle((2350, 1600, 500, 500, linewidth=2, edgecolor='r', facecolor='none') +ax.add_patch(rectangle) +ax.axis("off") +plt.show() +``` + +
+ Visualized Bbox +
+ +You can see the inference output below. + +```python +fig, ax = plt.subplots() +ax.imshow(image) +ax.imshow(mask, cmap='viridis', alpha=0.4) + +ax.axis("off") +plt.show() +``` + +
+ Visualized Inference +
+ diff --git a/docs/source/en/tasks/masked_language_modeling.mdx b/docs/source/en/tasks/masked_language_modeling.md similarity index 63% rename from docs/source/en/tasks/masked_language_modeling.mdx rename to docs/source/en/tasks/masked_language_modeling.md index 0e5488e7343c28..de91cd587a6a0c 100644 --- a/docs/source/en/tasks/masked_language_modeling.mdx +++ b/docs/source/en/tasks/masked_language_modeling.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Masked language modeling @@ -22,7 +26,7 @@ require a good contextual understanding of an entire sequence. BERT is an exampl This guide will show you how to: -1. Finetune [DistilRoBERTa](https://huggingface.co/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset. +1. Finetune [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) on the [r/askscience](https://www.reddit.com/r/askscience/) subset of the [ELI5](https://huggingface.co/datasets/eli5) dataset. 2. Use your finetuned model for inference. @@ -31,7 +35,7 @@ Choose one of the following architectures: -[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [mBART](../model_doc/mbart), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [Perceiver](../model_doc/perceiver), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Wav2Vec2](../model_doc/wav2vec2), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MRA](../model_doc/mra), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [Perceiver](../model_doc/perceiver), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Wav2Vec2](../model_doc/wav2vec2), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) @@ -53,16 +57,15 @@ We encourage you to log in to your Hugging Face account so you can upload and sh ## Load ELI5 dataset -Start by loading a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library. This'll -give you a chance to experiment and make sure everything works before spending more time training on the full dataset. +Start by loading the first 5000 examples from the [ELI5-Category](https://huggingface.co/datasets/eli5_category) dataset with the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. ```py >>> from datasets import load_dataset ->>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +>>> eli5 = load_dataset("eli5_category", split="train[:5000]") ``` -Split the dataset's `train_asks` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: +Split the dataset's `train` split into a train and test set with the [`~datasets.Dataset.train_test_split`] method: ```py >>> eli5 = eli5.train_test_split(test_size=0.2) @@ -72,18 +75,23 @@ Then take a look at an example: ```py >>> eli5["train"][0] -{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], - 'score': [6, 3], - 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", - "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, - 'answers_urls': {'url': []}, - 'document': '', - 'q_id': 'nyxfp', - 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', - 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, - 'subreddit': 'askscience', - 'title': 'Few questions about this space walk photograph.', - 'title_urls': {'url': []}} +{'q_id': '7h191n', + 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?', + 'selftext': '', + 'category': 'Economics', + 'subreddit': 'explainlikeimfive', + 'answers': {'a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'], + 'text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.", + 'None yet. It has to be reconciled with a vastly different house bill and then passed again.', + 'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?', + 'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'], + 'score': [21, 19, 5, 3], + 'text_urls': [[], + [], + [], + ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']]}, + 'title_urls': ['url'], + 'selftext_urls': ['url']} ``` While this may look like a lot, you're only really interested in the `text` field. What's cool about language modeling tasks is you don't need labels (also known as an unsupervised task) because the next word *is* the label. @@ -97,40 +105,44 @@ For masked language modeling, the next step is to load a DistilRoBERTa tokenizer ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base") ``` -You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to e -xtract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten) method: +You'll notice from the example above, the `text` field is actually nested inside `answers`. This means you'll need to extract the `text` subfield from its nested structure with the [`flatten`](https://huggingface.co/docs/datasets/process#flatten) method: ```py >>> eli5 = eli5.flatten() >>> eli5["train"][0] -{'answers.a_id': ['c3d1aib', 'c3d4lya'], - 'answers.score': [6, 3], - 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", - "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], - 'answers_urls.url': [], - 'document': '', - 'q_id': 'nyxfp', - 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', - 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], - 'subreddit': 'askscience', - 'title': 'Few questions about this space walk photograph.', - 'title_urls.url': []} +{'q_id': '7h191n', + 'title': 'What does the tax bill that was passed today mean? How will it affect Americans in each tax bracket?', + 'selftext': '', + 'category': 'Economics', + 'subreddit': 'explainlikeimfive', + 'answers.a_id': ['dqnds8l', 'dqnd1jl', 'dqng3i1', 'dqnku5x'], + 'answers.text': ["The tax bill is 500 pages long and there were a lot of changes still going on right to the end. It's not just an adjustment to the income tax brackets, it's a whole bunch of changes. As such there is no good answer to your question. The big take aways are: - Big reduction in corporate income tax rate will make large companies very happy. - Pass through rate change will make certain styles of business (law firms, hedge funds) extremely happy - Income tax changes are moderate, and are set to expire (though it's the kind of thing that might just always get re-applied without being made permanent) - People in high tax states (California, New York) lose out, and many of them will end up with their taxes raised.", + 'None yet. It has to be reconciled with a vastly different house bill and then passed again.', + 'Also: does this apply to 2017 taxes? Or does it start with 2018 taxes?', + 'This article explains both the House and senate bills, including the proposed changes to your income taxes based on your income level. URL_0'], + 'answers.score': [21, 19, 5, 3], + 'answers.text_urls': [[], + [], + [], + ['https://www.investopedia.com/news/trumps-tax-reform-what-can-be-done/']], + 'title_urls': ['url'], + 'selftext_urls': ['url']} ``` Each subfield is now a separate column as indicated by the `answers` prefix, and the `text` field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them. -Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilRoBERTa's maximum input length: +Here is a first preprocessing function to join the list of strings for each example and tokenize the result: ```py >>> def preprocess_function(examples): -... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) ``` -To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.with_transform`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: +To apply this preprocessing function over the entire dataset, use the 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once, and increasing the number of processes with `num_proc`. Remove any columns you don't need: ```py >>> tokenized_eli5 = eli5.map( @@ -141,24 +153,29 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ ... ) ``` -Now you'll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should: +This dataset contains the token sequences, but some of these are longer than the maximum input length for the model. -- Concatenate all the text. -- Split the concatenated text into smaller chunks defined by `block_size`. +You can now use a second preprocessing function to +- concatenate all the sequences +- split the concatenated sequences into shorter chunks defined by `block_size`, which should be both shorter than the maximum input length and short enough for your GPU RAM. ```py >>> block_size = 128 >>> def group_texts(examples): +... # Concatenate all texts. ... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} ... total_length = len(concatenated_examples[list(examples.keys())[0]]) -... total_length = (total_length // block_size) * block_size +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. ... result = { ... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] ... for k, t in concatenated_examples.items() ... } -... result["labels"] = result["input_ids"].copy() ... return result ``` @@ -168,7 +185,7 @@ Apply the `group_texts` function over the entire dataset: >>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) ``` -Now create a batch of examples using [`DataCollatorForLanguageModeling`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +Now create a batch of examples using [`DataCollatorForLanguageModeling`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. @@ -209,7 +226,7 @@ You're ready to start training your model now! Load DistilRoBERTa with [`AutoMod ```py >>> from transformers import AutoModelForMaskedLM ->>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base") +>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") ``` At this point, only three steps remain: @@ -274,7 +291,7 @@ Then you can load DistilRoBERTa with [`TFAutoModelForMaskedLM`]: ```py >>> from transformers import TFAutoModelForMaskedLM ->>> model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base") +>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") ``` Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]: @@ -295,12 +312,12 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` This can be done by specifying where to push your model and tokenizer in the [`~transformers.PushToHubCallback`]: @@ -347,7 +364,7 @@ The simplest way to try out your finetuned model for inference is to use it in a ```py >>> from transformers import pipeline ->>> mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model") +>>> mask_filler = pipeline("fill-mask", "username/my_awesome_eli5_mlm_model") >>> mask_filler(text, top_k=3) [{'score': 0.5150994658470154, 'token': 21300, @@ -370,7 +387,7 @@ Tokenize the text and return the `input_ids` as PyTorch tensors. You'll also nee ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model") +>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model") >>> inputs = tokenizer(text, return_tensors="pt") >>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] ``` @@ -380,7 +397,7 @@ Pass your inputs to the model and return the `logits` of the masked token: ```py >>> from transformers import AutoModelForMaskedLM ->>> model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> model = AutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model") >>> logits = model(**inputs).logits >>> mask_token_logits = logits[0, mask_token_index, :] ``` @@ -403,7 +420,7 @@ Tokenize the text and return the `input_ids` as TensorFlow tensors. You'll also ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model") +>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_eli5_mlm_model") >>> inputs = tokenizer(text, return_tensors="tf") >>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1] ``` @@ -413,7 +430,7 @@ Pass your inputs to the model and return the `logits` of the masked token: ```py >>> from transformers import TFAutoModelForMaskedLM ->>> model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> model = TFAutoModelForMaskedLM.from_pretrained("username/my_awesome_eli5_mlm_model") >>> logits = model(**inputs).logits >>> mask_token_logits = logits[0, mask_token_index, :] ``` @@ -430,4 +447,4 @@ The Milky Way is a massive galaxy. The Milky Way is a small galaxy. ``` - \ No newline at end of file + diff --git a/docs/source/en/tasks/monocular_depth_estimation.md b/docs/source/en/tasks/monocular_depth_estimation.md new file mode 100644 index 00000000000000..aea18299893196 --- /dev/null +++ b/docs/source/en/tasks/monocular_depth_estimation.md @@ -0,0 +1,151 @@ + + +# Monocular depth estimation + +Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a +single image. In other words, it is the process of estimating the distance of objects in a scene from +a single camera viewpoint. + +Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, +and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects +in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, +occlusion, and texture. + + +The task illustrated in this tutorial is supported by the following model architectures: + + + +[Depth Anything](../model_doc/depth_anything), [DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) + + + + + +In this guide you'll learn how to: + +* create a depth estimation pipeline +* run depth estimation inference by hand + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install -q transformers +``` + +## Depth estimation pipeline + +The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`]. +Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads): + +```py +>>> from transformers import pipeline + +>>> checkpoint = "vinvino02/glpn-nyu" +>>> depth_estimator = pipeline("depth-estimation", model=checkpoint) +``` + +Next, choose an image to analyze: + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> image +``` + +
+ Photo of a busy street +
+ +Pass the image to the pipeline. + +```py +>>> predictions = depth_estimator(image) +``` + +The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values +being the depth expressed in meters for each pixel. +The second one, `depth`, is a PIL image that visualizes the depth estimation result. + +Let's take a look at the visualized result: + +```py +>>> predictions["depth"] +``` + +
+ Depth estimation visualization +
+ +## Depth estimation inference by hand + +Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand. + +Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads). +Here we'll use the same checkpoint as before: + +```py +>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation + +>>> checkpoint = "vinvino02/glpn-nyu" + +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) +``` + +Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations +such as resizing and normalization: + +```py +>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values +``` + +Pass the prepared inputs through the model: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(pixel_values) +... predicted_depth = outputs.predicted_depth +``` + +Visualize the results: + +```py +>>> import numpy as np + +>>> # interpolate to original size +>>> prediction = torch.nn.functional.interpolate( +... predicted_depth.unsqueeze(1), +... size=image.size[::-1], +... mode="bicubic", +... align_corners=False, +... ).squeeze() +>>> output = prediction.numpy() + +>>> formatted = (output * 255 / np.max(output)).astype("uint8") +>>> depth = Image.fromarray(formatted) +>>> depth +``` + +
+ Depth estimation visualization +
diff --git a/docs/source/en/tasks/multiple_choice.mdx b/docs/source/en/tasks/multiple_choice.md similarity index 90% rename from docs/source/en/tasks/multiple_choice.mdx rename to docs/source/en/tasks/multiple_choice.md index 74568d37caf8de..5cf17448f0a66a 100644 --- a/docs/source/en/tasks/multiple_choice.mdx +++ b/docs/source/en/tasks/multiple_choice.md @@ -8,15 +8,21 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Multiple choice +[[open-in-colab]] + A multiple choice task is similar to question answering, except several candidate answers are provided along with a context and the model is trained to select the correct answer. This guide will show you how to: -1. Finetune [BERT](https://huggingface.co/bert-base-uncased) on the `regular` configuration of the [SWAG](https://huggingface.co/datasets/swag) dataset to select the best answer given multiple options and some context. +1. Finetune [BERT](https://huggingface.co/google-bert/bert-base-uncased) on the `regular` configuration of the [SWAG](https://huggingface.co/datasets/swag) dataset to select the best answer given multiple options and some context. 2. Use your finetuned model for inference. @@ -24,7 +30,7 @@ The task illustrated in this tutorial is supported by the following model archit -[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MRA](../model_doc/mra), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) @@ -84,7 +90,7 @@ The next step is to load a BERT tokenizer to process the sentence starts and the ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` The preprocessing function you want to create needs to: @@ -117,7 +123,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ tokenized_swag = swag.map(preprocess_function, batched=True) ``` -🤗 Transformers doesn't have a data collator for multiple choice, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +🤗 Transformers doesn't have a data collator for multiple choice, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. `DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results: @@ -247,7 +253,7 @@ You're ready to start training your model now! Load BERT with [`AutoModelForMult ```py >>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer ->>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased") +>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") ``` At this point, only three steps remain: @@ -311,7 +317,7 @@ Then you can load BERT with [`TFAutoModelForMultipleChoice`]: ```py >>> from transformers import TFAutoModelForMultipleChoice ->>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased") +>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") ``` Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]: @@ -333,13 +339,13 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` -The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](./main_classes/keras_callbacks). +The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`~transformers.KerasMetricCallback`]: diff --git a/docs/source/en/tasks/object_detection.mdx b/docs/source/en/tasks/object_detection.md similarity index 92% rename from docs/source/en/tasks/object_detection.mdx rename to docs/source/en/tasks/object_detection.md index 411ed7d2e7393f..2513591f545238 100644 --- a/docs/source/en/tasks/object_detection.mdx +++ b/docs/source/en/tasks/object_detection.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Object detection @@ -132,9 +136,20 @@ To get an even better understanding of the data, visualize an example in the dat >>> label2id = {v: k for k, v in id2label.items()} >>> for i in range(len(annotations["id"])): -... box = annotations["bbox"][i - 1] -... class_idx = annotations["category"][i - 1] +... box = annotations["bbox"][i] +... class_idx = annotations["category"][i] ... x, y, w, h = tuple(box) +... # Check if coordinates are normalized or not +... if max(box) > 1.0: +... # Coordinates are un-normalized, no need to re-scale them +... x1, y1 = int(x), int(y) +... x2, y2 = int(x + w), int(y + h) +... else: +... # Coordinates are normalized, re-scale them +... x1 = int(x * width) +... y1 = int(y * height) +... x2 = int((x + w) * width) +... y2 = int((y + h) * height) ... draw.rectangle((x, y, x + w, y + h), outline="red", width=1) ... draw.text((x, y), id2label[class_idx], fill="white") @@ -149,7 +164,7 @@ To visualize the bounding boxes with associated labels, you can get the labels f the `category` field. You'll also want to create dictionaries that map a label id to a label class (`id2label`) and the other way around (`label2id`). You can use them later when setting up the model. Including these maps will make your model reusable by others if you share -it on the Hugging Face Hub. +it on the Hugging Face Hub. Please note that, the part of above code that draws the bounding boxes assume that it is in `XYWH` (x,y co-ordinates and width and height of the box) format. It might not work for other formats like `(x1, y1, x2, y2)`. As a final step of getting familiar with the data, explore it for potential issues. One common problem with datasets for object detection is bounding boxes that "stretch" beyond the edge of the image. Such "runaway" bounding boxes can raise @@ -301,7 +316,7 @@ to indicate which pixels are real (1) and which are padding (0). ```py >>> def collate_fn(batch): ... pixel_values = [item["pixel_values"] for item in batch] -... encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors="pt") +... encoding = image_processor.pad(pixel_values, return_tensors="pt") ... labels = [item["labels"] for item in batch] ... batch = {} ... batch["pixel_values"] = encoding["pixel_values"] @@ -458,9 +473,9 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco >>> class CocoDetection(torchvision.datasets.CocoDetection): -... def __init__(self, img_folder, feature_extractor, ann_file): +... def __init__(self, img_folder, image_processor, ann_file): ... super().__init__(img_folder, ann_file) -... self.feature_extractor = feature_extractor +... self.image_processor = image_processor ... def __getitem__(self, idx): ... # read in PIL image and target in COCO format @@ -470,14 +485,14 @@ Next, prepare an instance of a `CocoDetection` class that can be used with `coco ... # resizing + normalization of both image and target) ... image_id = self.ids[idx] ... target = {"image_id": image_id, "annotations": target} -... encoding = self.feature_extractor(images=img, annotations=target, return_tensors="pt") +... encoding = self.image_processor(images=img, annotations=target, return_tensors="pt") ... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension ... target = encoding["labels"][0] # remove batch dimension ... return {"pixel_values": pixel_values, "labels": target} ->>> im_processor = AutoImageProcessor.from_pretrained("MariaK/detr-resnet-50_finetuned_cppe5") +>>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") >>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"]) >>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno) @@ -489,7 +504,7 @@ Finally, load the metrics and run the evaluation. >>> import evaluate >>> from tqdm import tqdm ->>> model = AutoModelForObjectDetection.from_pretrained("MariaK/detr-resnet-50_finetuned_cppe5") +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") >>> module = evaluate.load("ybelkada/cocoevaluate", coco=test_ds_coco_format.coco) >>> val_dataloader = torch.utils.data.DataLoader( ... test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn @@ -508,7 +523,7 @@ Finally, load the metrics and run the evaluation. ... outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask) ... orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0) -... results = im_processor.post_process(outputs, orig_target_sizes) # convert outputs of model to COCO api +... results = im_processor.post_process(outputs, orig_target_sizes) # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax) ... module.add(prediction=results, reference=labels) ... del batch @@ -518,18 +533,18 @@ Finally, load the metrics and run the evaluation. Accumulating evaluation results... DONE (t=0.08s). IoU metric: bbox - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.150 - Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.280 - Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.130 - Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.038 - Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.036 - Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.182 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.166 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.317 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.335 - Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.104 - Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.146 - Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.382 + Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.352 + Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.681 + Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.292 + Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168 + Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208 + Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.274 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.484 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.501 + Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191 + Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323 + Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590 ``` These results can be further improved by adjusting the hyperparameters in [`~transformers.TrainingArguments`]. Give it a go! @@ -545,15 +560,15 @@ for object detection with your model, and pass an image to it: >>> url = "https://i.imgur.com/2lnWoly.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) ->>> obj_detector = pipeline("object-detection", model="MariaK/detr-resnet-50_finetuned_cppe5") +>>> obj_detector = pipeline("object-detection", model="devonho/detr-resnet-50_finetuned_cppe5") >>> obj_detector(image) ``` You can also manually replicate the results of the pipeline if you'd like: ```py ->>> image_processor = AutoImageProcessor.from_pretrained("MariaK/detr-resnet-50_finetuned_cppe5") ->>> model = AutoModelForObjectDetection.from_pretrained("MariaK/detr-resnet-50_finetuned_cppe5") +>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") >>> with torch.no_grad(): ... inputs = image_processor(images=image, return_tensors="pt") @@ -587,4 +602,3 @@ Let's plot the result:
Object detection result on a new image
- diff --git a/docs/source/en/tasks/prompting.md b/docs/source/en/tasks/prompting.md new file mode 100644 index 00000000000000..1746e36fb9675f --- /dev/null +++ b/docs/source/en/tasks/prompting.md @@ -0,0 +1,439 @@ + + + +# LLM prompting guide + +[[open-in-colab]] + +Large Language Models such as Falcon, LLaMA, etc. are pretrained transformer models initially trained to predict the +next token given some input text. They typically have billions of parameters and have been trained on trillions of +tokens for an extended period of time. As a result, these models become quite powerful and versatile, and you can use +them to solve multiple NLP tasks out of the box by instructing the models with natural language prompts. + +Designing such prompts to ensure the optimal output is often called "prompt engineering". Prompt engineering is an +iterative process that requires a fair amount of experimentation. Natural languages are much more flexible and expressive +than programming languages, however, they can also introduce some ambiguity. At the same time, prompts in natural language +are quite sensitive to changes. Even minor modifications in prompts can lead to wildly different outputs. + +While there is no exact recipe for creating prompts to match all cases, researchers have worked out a number of best +practices that help to achieve optimal results more consistently. + +This guide covers the prompt engineering best practices to help you craft better LLM prompts and solve various NLP tasks. +You'll learn: + +- [Basics of prompting](#basics-of-prompting) +- [Best practices of LLM prompting](#best-practices-of-llm-prompting) +- [Advanced prompting techniques: few-shot prompting and chain-of-thought](#advanced-prompting-techniques) +- [When to fine-tune instead of prompting](#prompting-vs-fine-tuning) + + + +Prompt engineering is only a part of the LLM output optimization process. Another essential component is choosing the +optimal text generation strategy. You can customize how your LLM selects each of the subsequent tokens when generating +the text without modifying any of the trainable parameters. By tweaking the text generation parameters, you can reduce +repetition in the generated text and make it more coherent and human-sounding. +Text generation strategies and parameters are out of scope for this guide, but you can learn more about these topics in +the following guides: + +* [Generation with LLMs](../llm_tutorial) +* [Text generation strategies](../generation_strategies) + + + +## Basics of prompting + +### Types of models + +The majority of modern LLMs are decoder-only transformers. Some examples include: [LLaMA](../model_doc/llama), +[Llama2](../model_doc/llama2), [Falcon](../model_doc/falcon), [GPT2](../model_doc/gpt2). However, you may encounter +encoder-decoder transformer LLMs as well, for instance, [Flan-T5](../model_doc/flan-t5) and [BART](../model_doc/bart). + +Encoder-decoder-style models are typically used in generative tasks where the output **heavily** relies on the input, for +example, in translation and summarization. The decoder-only models are used for all other types of generative tasks. + +When using a pipeline to generate text with an LLM, it's important to know what type of LLM you are using, because +they use different pipelines. + +Run inference with decoder-only models with the `text-generation` pipeline: + +```python +>>> from transformers import pipeline +>>> import torch + +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT + +>>> generator = pipeline('text-generation', model = 'openai-community/gpt2') +>>> prompt = "Hello, I'm a language model" + +>>> generator(prompt, max_length = 30) +[{'generated_text': "Hello, I'm a language model expert, so I'm a big believer in the concept that I know very well and then I try to look into"}] +``` + +To run inference with an encoder-decoder, use the `text2text-generation` pipeline: + +```python +>>> text2text_generator = pipeline("text2text-generation", model = 'google/flan-t5-base') +>>> prompt = "Translate from English to French: I'm very happy to see you" + +>>> text2text_generator(prompt) +[{'generated_text': 'Je suis très heureuse de vous rencontrer.'}] +``` + +### Base vs instruct/chat models + +Most of the recent LLM checkpoints available on 🤗 Hub come in two versions: base and instruct (or chat). For example, +[`tiiuae/falcon-7b`](https://huggingface.co/tiiuae/falcon-7b) and [`tiiuae/falcon-7b-instruct`](https://huggingface.co/tiiuae/falcon-7b-instruct). + +Base models are excellent at completing the text when given an initial prompt, however, they are not ideal for NLP tasks +where they need to follow instructions, or for conversational use. This is where the instruct (chat) versions come in. +These checkpoints are the result of further fine-tuning of the pre-trained base versions on instructions and conversational data. +This additional fine-tuning makes them a better choice for many NLP tasks. + +Let's illustrate some simple prompts that you can use with [`tiiuae/falcon-7b-instruct`](https://huggingface.co/tiiuae/falcon-7b-instruct) +to solve some common NLP tasks. + +### NLP tasks + +First, let's set up the environment: + +```bash +pip install -q transformers accelerate +``` + +Next, let's load the model with the appropriate pipeline (`"text-generation"`): + +```python +>>> from transformers import pipeline, AutoTokenizer +>>> import torch + +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> model = "tiiuae/falcon-7b-instruct" + +>>> tokenizer = AutoTokenizer.from_pretrained(model) +>>> pipe = pipeline( +... "text-generation", +... model=model, +... tokenizer=tokenizer, +... torch_dtype=torch.bfloat16, +... device_map="auto", +... ) +``` + + + +Note that Falcon models were trained using the `bfloat16` datatype, so we recommend you use the same. This requires a recent +version of CUDA and works best on modern cards. + + + +Now that we have the model loaded via the pipeline, let's explore how you can use prompts to solve NLP tasks. + +#### Text classification + +One of the most common forms of text classification is sentiment analysis, which assigns a label like "positive", "negative", +or "neutral" to a sequence of text. Let's write a prompt that instructs the model to classify a given text (a movie review). +We'll start by giving the instruction, and then specifying the text to classify. Note that instead of leaving it at that, we're +also adding the beginning of the response - `"Sentiment: "`: + +```python +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> prompt = """Classify the text into neutral, negative or positive. +... Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen. +... Sentiment: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Classify the text into neutral, negative or positive. +Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen. +Sentiment: +Positive +``` + +As a result, the output contains a classification label from the list we have provided in the instructions, and it is a correct one! + + + +You may notice that in addition to the prompt, we pass a `max_new_tokens` parameter. It controls the number of tokens the +model shall generate, and it is one of the many text generation parameters that you can learn about +in [Text generation strategies](../generation_strategies) guide. + + + +#### Named Entity Recognition + +Named Entity Recognition (NER) is a task of finding named entities in a piece of text, such as a person, location, or organization. +Let's modify the instructions in the prompt to make the LLM perform this task. Here, let's also set `return_full_text = False` +so that output doesn't contain the prompt: + +```python +>>> torch.manual_seed(1) # doctest: +IGNORE_RESULT +>>> prompt = """Return a list of named entities in the text. +... Text: The Golden State Warriors are an American professional basketball team based in San Francisco. +... Named entities: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=15, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +- Golden State Warriors +- San Francisco +``` + +As you can see, the model correctly identified two named entities from the given text. + +#### Translation + +Another task LLMs can perform is translation. You can choose to use encoder-decoder models for this task, however, here, +for the simplicity of the examples, we'll keep using Falcon-7b-instruct, which does a decent job. Once again, here's how +you can write a basic prompt to instruct a model to translate a piece of text from English to Italian: + +```python +>>> torch.manual_seed(2) # doctest: +IGNORE_RESULT +>>> prompt = """Translate the English text to Italian. +... Text: Sometimes, I've believed as many as six impossible things before breakfast. +... Translation: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=20, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +A volte, ho creduto a sei impossibili cose prima di colazione. +``` + +Here we've added a `do_sample=True` and `top_k=10` to allow the model to be a bit more flexible when generating output. + +#### Text summarization + +Similar to the translation, text summarization is another generative task where the output **heavily** relies on the input, +and encoder-decoder models can be a better choice. However, decoder-style models can be used for this task as well. +Previously, we have placed the instructions at the very beginning of the prompt. However, the very end of the prompt can +also be a suitable location for instructions. Typically, it's better to place the instruction on one of the extreme ends. + +```python +>>> torch.manual_seed(3) # doctest: +IGNORE_RESULT +>>> prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change. +... Write a summary of the above text. +... Summary: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=30, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +Permaculture is an ecological design mimicking natural ecosystems to meet basic needs and prepare for climate change. It is based on traditional knowledge and scientific understanding. +``` + +#### Question answering + +For question answering task we can structure the prompt into the following logical components: instructions, context, question, and +the leading word or phrase (`"Answer:"`) to nudge the model to start generating the answer: + +```python +>>> torch.manual_seed(4) # doctest: +IGNORE_RESULT +>>> prompt = """Answer the question using the context below. +... Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors. +... Question: What modern tool is used to make gazpacho? +... Answer: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Modern tools are used, such as immersion blenders +``` + +#### Reasoning + +Reasoning is one of the most difficult tasks for LLMs, and achieving good results often requires applying advanced prompting techniques, like +[Chain-of-though](#chain-of-thought). + +Let's try if we can make a model reason about a simple arithmetics task with a basic prompt: + +```python +>>> torch.manual_seed(5) # doctest: +IGNORE_RESULT +>>> prompt = """There are 5 groups of students in the class. Each group has 4 students. How many students are there in the class?""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=30, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: +There are a total of 5 groups, so there are 5 x 4=20 students in the class. +``` + +Correct! Let's increase the complexity a little and see if we can still get away with a basic prompt: + +```python +>>> torch.manual_seed(6) # doctest: +IGNORE_RESULT +>>> prompt = """I baked 15 muffins. I ate 2 muffins and gave 5 muffins to a neighbor. My partner then bought 6 more muffins and ate 2. How many muffins do we now have?""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: +The total number of muffins now is 21 +``` + +This is a wrong answer, it should be 12. In this case, this can be due to the prompt being too basic, or due to the choice +of model, after all we've picked the smallest version of Falcon. Reasoning is difficult for models of all sizes, but larger +models are likely to perform better. + +## Best practices of LLM prompting + +In this section of the guide we have compiled a list of best practices that tend to improve the prompt results: + +* When choosing the model to work with, the latest and most capable models are likely to perform better. +* Start with a simple and short prompt, and iterate from there. +* Put the instructions at the beginning of the prompt, or at the very end. When working with large context, models apply various optimizations to prevent Attention complexity from scaling quadratically. This may make a model more attentive to the beginning or end of a prompt than the middle. +* Clearly separate instructions from the text they apply to - more on this in the next section. +* Be specific and descriptive about the task and the desired outcome - its format, length, style, language, etc. +* Avoid ambiguous descriptions and instructions. +* Favor instructions that say "what to do" instead of those that say "what not to do". +* "Lead" the output in the right direction by writing the first word (or even begin the first sentence for the model). +* Use advanced techniques like [Few-shot prompting](#few-shot-prompting) and [Chain-of-thought](#chain-of-thought) +* Test your prompts with different models to assess their robustness. +* Version and track the performance of your prompts. + +## Advanced prompting techniques + +### Few-shot prompting + +The basic prompts in the sections above are the examples of "zero-shot" prompts, meaning, the model has been given +instructions and context, but no examples with solutions. LLMs that have been fine-tuned on instruction datasets, generally +perform well on such "zero-shot" tasks. However, you may find that your task has more complexity or nuance, and, perhaps, +you have some requirements for the output that the model doesn't catch on just from the instructions. In this case, you can +try the technique called few-shot prompting. + +In few-shot prompting, we provide examples in the prompt giving the model more context to improve the performance. +The examples condition the model to generate the output following the patterns in the examples. + +Here's an example: + +```python +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961. +... Date: 04/12/1961 +... Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. +... Date:""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=8, +... do_sample=True, +... top_k=10, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Text: The first human went into space and orbited the Earth on April 12, 1961. +Date: 04/12/1961 +Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. +Date: 09/28/1960 +``` + +In the above code snippet we used a single example to demonstrate the desired output to the model, so this can be called a +"one-shot" prompting. However, depending on the task complexity you may need to use more than one example. + +Limitations of the few-shot prompting technique: +- While LLMs can pick up on the patterns in the examples, these technique doesn't work well on complex reasoning tasks +- Few-shot prompting requires creating lengthy prompts. Prompts with large number of tokens can increase computation and latency. There's also a limit to the length of the prompts. +- Sometimes when given a number of examples, models can learn patterns that you didn't intend them to learn, e.g. that the third movie review is always negative. + +### Chain-of-thought + +Chain-of-thought (CoT) prompting is a technique that nudges a model to produce intermediate reasoning steps thus improving +the results on complex reasoning tasks. + +There are two ways of steering a model to producing the reasoning steps: +- few-shot prompting by illustrating examples with detailed answers to questions, showing the model how to work through a problem. +- by instructing the model to reason by adding phrases like "Let's think step by step" or "Take a deep breath and work through the problem step by step." + +If we apply the CoT technique to the muffins example from the [reasoning section](#reasoning) and use a larger model, +such as (`tiiuae/falcon-180B-chat`) which you can play with in the [HuggingChat](https://huggingface.co/chat/), +we'll get a significant improvement on the reasoning result: + +```text +Let's go through this step-by-step: +1. You start with 15 muffins. +2. You eat 2 muffins, leaving you with 13 muffins. +3. You give 5 muffins to your neighbor, leaving you with 8 muffins. +4. Your partner buys 6 more muffins, bringing the total number of muffins to 14. +5. Your partner eats 2 muffins, leaving you with 12 muffins. +Therefore, you now have 12 muffins. +``` + +## Prompting vs fine-tuning + +You can achieve great results by optimizing your prompts, however, you may still ponder whether fine-tuning a model +would work better for your case. Here are some scenarios when fine-tuning a smaller model may be a preferred option: + +- Your domain is wildly different from what LLMs were pre-trained on and extensive prompt optimization did not yield sufficient results. +- You need your model to work well in a low-resource language. +- You need the model to be trained on sensitive data that is under strict regulations. +- You have to use a small model due to cost, privacy, infrastructure or other limitations. + +In all of the above examples, you will need to make sure that you either already have or can easily obtain a large enough +domain-specific dataset at a reasonable cost to fine-tune a model. You will also need to have enough time and resources +to fine-tune a model. + +If the above examples are not the case for you, optimizing prompts can prove to be more beneficial. + + diff --git a/docs/source/en/tasks/question_answering.mdx b/docs/source/en/tasks/question_answering.md similarity index 86% rename from docs/source/en/tasks/question_answering.mdx rename to docs/source/en/tasks/question_answering.md index 3b0bf33b2f518f..2c4706ad93b001 100644 --- a/docs/source/en/tasks/question_answering.mdx +++ b/docs/source/en/tasks/question_answering.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Question answering @@ -23,7 +27,7 @@ Question answering tasks return an answer given a question. If you've ever asked This guide will show you how to: -1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering. +1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [SQuAD](https://huggingface.co/datasets/squad) dataset for extractive question answering. 2. Use your finetuned model for inference. @@ -31,7 +35,8 @@ The task illustrated in this tutorial is supported by the following model archit -[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [LXMERT](../model_doc/lxmert), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OPT](../model_doc/opt), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [Splinter](../model_doc/splinter), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [OpenAI GPT-2](../model_doc/gpt2), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [LLaMA](../model_doc/llama), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [LXMERT](../model_doc/lxmert), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OPT](../model_doc/opt), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [Splinter](../model_doc/splinter), [SqueezeBERT](../model_doc/squeezebert), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) @@ -54,7 +59,7 @@ We encourage you to login to your Hugging Face account so you can upload and sha ## Load SQuAD dataset -Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset. +Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. ```py >>> from datasets import load_dataset @@ -95,7 +100,7 @@ The next step is to load a DistilBERT tokenizer to process the `question` and `c ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` There are a few preprocessing steps particular to question answering tasks you should be aware of: @@ -201,7 +206,7 @@ You're ready to start training your model now! Load DistilBERT with [`AutoModelF ```py >>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer ->>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` At this point, only three steps remain: @@ -266,7 +271,7 @@ Then you can load DistilBERT with [`TFAutoModelForQuestionAnswering`]: ```py >>> from transformers import TFAutoModelForQuestionAnswering ->>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased") +>>> model = TFAutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPreTrainedModel.prepare_tf_dataset`]: @@ -327,7 +332,7 @@ or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/no Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [`Trainer`] still calculates the evaluation loss during training so you're not completely in the dark about your model's performance. -If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course! +If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#post-processing) chapter from the 🤗 Hugging Face Course! ## Inference @@ -369,6 +374,7 @@ Tokenize the text and return PyTorch tensors: Pass your inputs to the model and return the `logits`: ```py +>>> import torch >>> from transformers import AutoModelForQuestionAnswering >>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model") diff --git a/docs/source/en/tasks/semantic_segmentation.mdx b/docs/source/en/tasks/semantic_segmentation.md similarity index 64% rename from docs/source/en/tasks/semantic_segmentation.mdx rename to docs/source/en/tasks/semantic_segmentation.md index 8e6e7b2370a1d4..e99499bbbbd4cd 100644 --- a/docs/source/en/tasks/semantic_segmentation.mdx +++ b/docs/source/en/tasks/semantic_segmentation.md @@ -8,31 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> -# Semantic segmentation +# Image Segmentation [[open-in-colab]] -Semantic segmentation assigns a label or class to each individual pixel of an image. There are several types of segmentation, and in the case of semantic segmentation, no distinction is made between unique instances of the same object. Both objects are given the same label (for example, "car" instead of "car-1" and "car-2"). Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery. - -This guide will show you how to: - -1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset. -2. Use your finetuned model for inference. +Image segmentation models separate areas corresponding to different areas of interest in an image. These models work by assigning a label to each pixel. There are several types of segmentation: semantic segmentation, instance segmentation, and panoptic segmentation. - -The task illustrated in this tutorial is supported by the following model architectures: - - - -[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) - - - - +In this guide, we will: +1. [Take a look at different types of segmentation](#types-of-segmentation). +2. [Have an end-to-end fine-tuning example for semantic segmentation](#fine-tuning-a-model-for-segmentation). Before you begin, make sure you have all the necessary libraries installed: @@ -48,9 +40,180 @@ We encourage you to log in to your Hugging Face account so you can upload and sh >>> notebook_login() ``` -## Load SceneParse150 dataset +## Types of Segmentation + +Semantic segmentation assigns a label or class to every single pixel in an image. Let's take a look at a semantic segmentation model output. It will assign the same class to every instance of an object it comes across in an image, for example, all cats will be labeled as "cat" instead of "cat-1", "cat-2". +We can use transformers' image segmentation pipeline to quickly infer a semantic segmentation model. Let's take a look at the example image. + +```python +from transformers import pipeline +from PIL import Image +import requests + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg" +image = Image.open(requests.get(url, stream=True).raw) +image +``` + +
+ Segmentation Input +
+ +We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024). + +```python +semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024") +results = semantic_segmentation(image) +results +``` + +The segmentation pipeline output includes a mask for every predicted class. +```bash +[{'score': None, + 'label': 'road', + 'mask': }, + {'score': None, + 'label': 'sidewalk', + 'mask': }, + {'score': None, + 'label': 'building', + 'mask': }, + {'score': None, + 'label': 'wall', + 'mask': }, + {'score': None, + 'label': 'pole', + 'mask': }, + {'score': None, + 'label': 'traffic sign', + 'mask': }, + {'score': None, + 'label': 'vegetation', + 'mask': }, + {'score': None, + 'label': 'terrain', + 'mask': }, + {'score': None, + 'label': 'sky', + 'mask': }, + {'score': None, + 'label': 'car', + 'mask': }] +``` + +Taking a look at the mask for the car class, we can see every car is classified with the same mask. + +```python +results[-1]["mask"] +``` +
+ Semantic Segmentation Output +
+ +In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this. + +```python +instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance") +results = instance_segmentation(Image.open(image)) +results +``` + +As you can see below, there are multiple cars classified, and there's no classification for pixels other than pixels that belong to car and person instances. + +```bash +[{'score': 0.999944, + 'label': 'car', + 'mask': }, + {'score': 0.999945, + 'label': 'car', + 'mask': }, + {'score': 0.999652, + 'label': 'car', + 'mask': }, + {'score': 0.903529, + 'label': 'person', + 'mask': }] +``` +Checking out one of the car masks below. + +```python +results[2]["mask"] +``` +
+ Semantic Segmentation Output +
+ +Panoptic segmentation combines semantic segmentation and instance segmentation, where every pixel is classified into a class and an instance of that class, and there are multiple masks for each instance of a class. We can use [facebook/mask2former-swin-large-cityscapes-panoptic](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-panoptic) for this. + +```python +panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic") +results = panoptic_segmentation(Image.open(image)) +results +``` +As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes. + +```bash +[{'score': 0.999981, + 'label': 'car', + 'mask': }, + {'score': 0.999958, + 'label': 'car', + 'mask': }, + {'score': 0.99997, + 'label': 'vegetation', + 'mask': }, + {'score': 0.999575, + 'label': 'pole', + 'mask': }, + {'score': 0.999958, + 'label': 'building', + 'mask': }, + {'score': 0.999634, + 'label': 'road', + 'mask': }, + {'score': 0.996092, + 'label': 'sidewalk', + 'mask': }, + {'score': 0.999221, + 'label': 'car', + 'mask': }, + {'score': 0.99987, + 'label': 'sky', + 'mask': }] +``` + +Let's have a side by side comparison for all types of segmentation. + +
+ Segmentation Maps Compared +
+ +Seeing all types of segmentation, let's have a deep dive on fine-tuning a model for semantic segmentation. -Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset. +Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery. + +## Fine-tuning a Model for Segmentation + +We will now: + +1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset. +2. Use your fine-tuned model for inference. + + +The task illustrated in this tutorial is supported by the following model architectures: + + + +[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) + + + + + + +### Load SceneParse150 dataset + +Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. ```py >>> from datasets import load_dataset @@ -93,7 +256,59 @@ You'll also want to create a dictionary that maps a label id to a label class wh >>> num_labels = len(id2label) ``` -## Preprocess +#### Custom dataset + +You could also create and use your own dataset if you prefer to train with the [run_semantic_segmentation.py](https://github.com/huggingface/transformers/blob/main/examples/pytorch/semantic-segmentation/run_semantic_segmentation.py) script instead of a notebook instance. The script requires: + +1. a [`~datasets.DatasetDict`] with two [`~datasets.Image`] columns, "image" and "label" + + ```py + from datasets import Dataset, DatasetDict, Image + + image_paths_train = ["path/to/image_1.jpg/jpg", "path/to/image_2.jpg/jpg", ..., "path/to/image_n.jpg/jpg"] + label_paths_train = ["path/to/annotation_1.png", "path/to/annotation_2.png", ..., "path/to/annotation_n.png"] + + image_paths_validation = [...] + label_paths_validation = [...] + + def create_dataset(image_paths, label_paths): + dataset = Dataset.from_dict({"image": sorted(image_paths), + "label": sorted(label_paths)}) + dataset = dataset.cast_column("image", Image()) + dataset = dataset.cast_column("label", Image()) + return dataset + + # step 1: create Dataset objects + train_dataset = create_dataset(image_paths_train, label_paths_train) + validation_dataset = create_dataset(image_paths_validation, label_paths_validation) + + # step 2: create DatasetDict + dataset = DatasetDict({ + "train": train_dataset, + "validation": validation_dataset, + } + ) + + # step 3: push to Hub (assumes you have ran the huggingface-cli login command in a terminal/notebook) + dataset.push_to_hub("your-name/dataset-repo") + + # optionally, you can push to a private repo on the Hub + # dataset.push_to_hub("name of repo on the hub", private=True) + ``` + +2. an id2label dictionary mapping the class integers to their class names + + ```py + import json + # simple example + id2label = {0: 'cat', 1: 'dog'} + with open('id2label.json', 'w') as fp: + json.dump(id2label, fp) + ``` + +As an example, take a look at this [example dataset](https://huggingface.co/datasets/nielsr/ade20k-demo) which was created with the steps shown above. + +### Preprocess The next step is to load a SegFormer image processor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function: @@ -200,9 +415,9 @@ The transform is applied on the fly which is faster and consumes less disk space -## Evaluate +### Evaluate -Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): +Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): ```py >>> import evaluate @@ -217,6 +432,10 @@ logits first, and then reshaped to match the size of the labels before you can c ```py +>>> import numpy as np +>>> import torch +>>> from torch import nn + >>> def compute_metrics(eval_pred): ... with torch.no_grad(): ... logits, labels = eval_pred @@ -237,7 +456,7 @@ logits first, and then reshaped to match the size of the labels before you can c ... reduce_labels=False, ... ) ... for key, value in metrics.items(): -... if type(value) is np.ndarray: +... if isinstance(value, np.ndarray): ... metrics[key] = value.tolist() ... return metrics ``` @@ -281,7 +500,7 @@ logits first, and then reshaped to match the size of the labels before you can c Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training. -## Train +### Train @@ -377,7 +596,7 @@ Start by defining the hyperparameters, optimizer and learning rate schedule: ``` Then, load SegFormer with [`TFAutoModelForSemanticSegmentation`] along with the label mappings, and compile it with the -optimizer: +optimizer. Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> from transformers import TFAutoModelForSemanticSegmentation @@ -387,7 +606,7 @@ optimizer: ... id2label=id2label, ... label2id=label2id, ... ) ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and the [`DefaultDataCollator`]: @@ -412,7 +631,7 @@ Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Data ... ) ``` -To compute the accuracy from the predictions and push your model to the 🤗 Hub, use [Keras callbacks](./main_classes/keras_callbacks). +To compute the accuracy from the predictions and push your model to the 🤗 Hub, use [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`KerasMetricCallback`], and use the [`PushToHubCallback`] to upload the model: @@ -445,7 +664,7 @@ Congratulations! You have fine-tuned your model and shared it on the 🤗 Hub. Y -## Inference +### Inference Great, now that you've finetuned a model, you can use it for inference! @@ -462,43 +681,8 @@ Load an image for inference: -The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image segmentation with your model, and pass your image to it: - -```py ->>> from transformers import pipeline ->>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model") ->>> segmenter(image) -[{'score': None, - 'label': 'wall', - 'mask': }, - {'score': None, - 'label': 'sky', - 'mask': }, - {'score': None, - 'label': 'floor', - 'mask': }, - {'score': None, - 'label': 'ceiling', - 'mask': }, - {'score': None, - 'label': 'bed ', - 'mask': }, - {'score': None, - 'label': 'windowpane', - 'mask': }, - {'score': None, - 'label': 'cabinet', - 'mask': }, - {'score': None, - 'label': 'chair', - 'mask': }, - {'score': None, - 'label': 'armchair', - 'mask': }] -``` - -You can also manually replicate the results of the `pipeline` if you'd like. Process the image with an image processor and place the `pixel_values` on a GPU: +We will now see how to infer without a pipeline. Process the image with an image processor and place the `pixel_values` on a GPU: ```py >>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU @@ -587,4 +771,4 @@ To visualize the results, load the [dataset color palette](https://github.com/te
Image of bedroom overlaid with segmentation map -
\ No newline at end of file +
diff --git a/docs/source/en/tasks/sequence_classification.mdx b/docs/source/en/tasks/sequence_classification.md similarity index 79% rename from docs/source/en/tasks/sequence_classification.mdx rename to docs/source/en/tasks/sequence_classification.md index 00e003a15aa382..8459ae4c08babe 100644 --- a/docs/source/en/tasks/sequence_classification.mdx +++ b/docs/source/en/tasks/sequence_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Text classification @@ -16,11 +20,11 @@ specific language governing permissions and limitations under the License. -Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text. +Text classification is a common NLP task that assigns a label or class to text. Some of the largest companies run text classification in production for a wide range of practical applications. One of the most popular forms of text classification is sentiment analysis, which assigns a label like 🙂 positive, 🙁 negative, or 😐 neutral to a sequence of text. This guide will show you how to: -1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative. +1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [IMDb](https://huggingface.co/datasets/imdb) dataset to determine whether a movie review is positive or negative. 2. Use your finetuned model for inference. @@ -28,7 +32,9 @@ The task illustrated in this tutorial is supported by the following model archit -[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPT Neo](../model_doc/gpt_neo), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Perceiver](../model_doc/perceiver), [PLBart](../model_doc/plbart), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Transformer-XL](../model_doc/transfo-xl), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [CodeLlama](../model_doc/code_llama), [ConvBERT](../model_doc/convbert), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [LLaMA](../model_doc/llama), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [Mixtral](../model_doc/mixtral), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Perceiver](../model_doc/perceiver), [Persimmon](../model_doc/persimmon), [Phi](../model_doc/phi), [PLBart](../model_doc/plbart), [QDQBert](../model_doc/qdqbert), [Qwen2](../model_doc/qwen2), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [StableLm](../model_doc/stablelm), [T5](../model_doc/t5), [TAPAS](../model_doc/tapas), [Transformer-XL](../model_doc/transfo-xl), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + @@ -38,7 +44,7 @@ The task illustrated in this tutorial is supported by the following model archit Before you begin, make sure you have all the necessary libraries installed: ```bash -pip install transformers datasets evaluate +pip install transformers datasets evaluate accelerate ``` We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login: @@ -69,7 +75,7 @@ Then take a look at an example: } ``` -There are two fields in this dataset: +There are two fields in this dataset: - `text`: the movie review text. - `label`: a value that is either `0` for a negative review or `1` for a positive review. @@ -81,7 +87,7 @@ The next step is to load a DistilBERT tokenizer to preprocess the `text` field: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length: @@ -97,7 +103,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ tokenized_imdb = imdb.map(preprocess_function, batched=True) ``` -Now create a batch of examples using [`DataCollatorWithPadding`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +Now create a batch of examples using [`DataCollatorWithPadding`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. @@ -163,7 +169,7 @@ You're ready to start training your model now! Load DistilBERT with [`AutoModelF >>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer >>> model = AutoModelForSequenceClassification.from_pretrained( -... "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id +... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id ... ) ``` @@ -237,7 +243,7 @@ Then you can load DistilBERT with [`TFAutoModelForSequenceClassification`] along >>> from transformers import TFAutoModelForSequenceClassification >>> model = TFAutoModelForSequenceClassification.from_pretrained( -... "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id +... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id ... ) ``` @@ -259,15 +265,15 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` -The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](./main_classes/keras_callbacks). +The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`~transformers.KerasMetricCallback`]: diff --git a/docs/source/en/tasks/summarization.mdx b/docs/source/en/tasks/summarization.md similarity index 93% rename from docs/source/en/tasks/summarization.mdx rename to docs/source/en/tasks/summarization.md index f8127a8aabba2d..28dd3f5a49ebe3 100644 --- a/docs/source/en/tasks/summarization.mdx +++ b/docs/source/en/tasks/summarization.md @@ -8,20 +8,26 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Summarization +[[open-in-colab]] + Summarization creates a shorter version of a document or an article that captures all the important information. Along with translation, it is another example of a task that can be formulated as a sequence-to-sequence task. Summarization can be: - Extractive: extract the most relevant information from a document. -- Abstractive: generate new text that captures the most relevant information. +- Abstractive: generate new text that captures the most relevant information. This guide will show you how to: -1. Finetune [T5](https://huggingface.co/t5-small) on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset for abstractive summarization. +1. Finetune [T5](https://huggingface.co/google-t5/t5-small) on the California state bill subset of the [BillSum](https://huggingface.co/datasets/billsum) dataset for abstractive summarization. 2. Use your finetuned model for inference. @@ -29,7 +35,7 @@ The task illustrated in this tutorial is supported by the following model archit -[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SeamlessM4Tv2](../model_doc/seamless_m4t_v2), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) @@ -86,7 +92,7 @@ The next step is to load a T5 tokenizer to process `text` and `summary`: ```py >>> from transformers import AutoTokenizer ->>> checkpoint = "t5-small" +>>> checkpoint = "google-t5/t5-small" >>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) ``` @@ -116,10 +122,11 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ >>> tokenized_billsum = billsum.map(preprocess_function, batched=True) ``` -Now create a batch of examples using [`DataCollatorForSeq2Seq`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +Now create a batch of examples using [`DataCollatorForSeq2Seq`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. + ```py >>> from transformers import DataCollatorForSeq2Seq @@ -127,6 +134,7 @@ Now create a batch of examples using [`DataCollatorForSeq2Seq`]. It's more effic ``` + ```py >>> from transformers import DataCollatorForSeq2Seq @@ -265,15 +273,15 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` -The last two things to setup before you start training is to compute the ROUGE score from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](./main_classes/keras_callbacks). +The last two things to setup before you start training is to compute the ROUGE score from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`~transformers.KerasMetricCallback`]: @@ -352,7 +360,7 @@ Tokenize the text and return the `input_ids` as PyTorch tensors: >>> inputs = tokenizer(text, return_tensors="pt").input_ids ``` -Use the [`~transformers.generation_utils.GenerationMixin.generate`] method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](./main_classes/text_generation) API. +Use the [`~transformers.generation_utils.GenerationMixin.generate`] method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](../main_classes/text_generation) API. ```py >>> from transformers import AutoModelForSeq2SeqLM @@ -378,7 +386,7 @@ Tokenize the text and return the `input_ids` as TensorFlow tensors: >>> inputs = tokenizer(text, return_tensors="tf").input_ids ``` -Use the [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](./main_classes/text_generation) API. +Use the [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] method to create the summarization. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](../main_classes/text_generation) API. ```py >>> from transformers import TFAutoModelForSeq2SeqLM @@ -394,4 +402,4 @@ Decode the generated token ids back into text: 'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.' ``` - \ No newline at end of file + diff --git a/docs/source/en/tasks/text-to-speech.md b/docs/source/en/tasks/text-to-speech.md new file mode 100644 index 00000000000000..0b324904e9e263 --- /dev/null +++ b/docs/source/en/tasks/text-to-speech.md @@ -0,0 +1,637 @@ + + +# Text to speech + +[[open-in-colab]] + +Text-to-speech (TTS) is the task of creating natural-sounding speech from text, where the speech can be generated in multiple +languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as +[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5). + +You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Bark, +can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. +Here's an example of how you would use the `"text-to-speech"` pipeline with Bark: + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-speech", model="suno/bark-small") +>>> text = "[clears throat] This is a test ... and I just took a long pause." +>>> output = pipe(text) +``` + +Here's a code snippet you can use to listen to the resulting audio in a notebook: + +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` + +For more examples on what Bark and other pretrained TTS models can do, refer to our +[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models). + +If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers +are [SpeechT5](model_doc/speecht5) and [FastSpeech2Conformer](model_doc/fastspeech2_conformer), though more will be added in the future. SpeechT5 is pre-trained on a combination of speech-to-text and text-to-speech data, allowing it to learn a unified space of hidden representations shared by both text and speech. This means that the same pre-trained model can be fine-tuned for different tasks. Furthermore, SpeechT5 supports multiple speakers through x-vector speaker embeddings. + +The remainder of this guide illustrates how to: + +1. Fine-tune [SpeechT5](../model_doc/speecht5) that was originally trained on English speech on the Dutch (`nl`) language subset of the [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) dataset. +2. Use your refined model for inference in one of two ways: using a pipeline or directly. + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install datasets soundfile speechbrain accelerate +``` + +Install 🤗Transformers from source as not all the SpeechT5 features have been merged into an official release yet: + +```bash +pip install git+https://github.com/huggingface/transformers.git +``` + + + +To follow this guide you will need a GPU. If you're working in a notebook, run the following line to check if a GPU is available: + +```bash +!nvidia-smi +``` + +or alternatively for AMD GPUs: + +```bash +!rocm-smi +``` + + + +We encourage you to log in to your Hugging Face account to upload and share your model with the community. When prompted, enter your token to log in: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load the dataset + +[VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) is a large-scale multilingual speech corpus consisting of +data sourced from 2009-2020 European Parliament event recordings. It contains labelled audio-transcription data for 15 +European languages. In this guide, we are using the Dutch language subset, feel free to pick another subset. + +Note that VoxPopuli or any other automated speech recognition (ASR) dataset may not be the most suitable +option for training TTS models. The features that make it beneficial for ASR, such as excessive background noise, are +typically undesirable in TTS. However, finding top-quality, multilingual, and multi-speaker TTS datasets can be quite +challenging. + +Let's load the data: + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("facebook/voxpopuli", "nl", split="train") +>>> len(dataset) +20968 +``` + +20968 examples should be sufficient for fine-tuning. SpeechT5 expects audio data to have a sampling rate of 16 kHz, so +make sure the examples in the dataset meet this requirement: + +```py +dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +``` + +## Preprocess the data + +Let's begin by defining the model checkpoint to use and loading the appropriate processor: + +```py +>>> from transformers import SpeechT5Processor + +>>> checkpoint = "microsoft/speecht5_tts" +>>> processor = SpeechT5Processor.from_pretrained(checkpoint) +``` + +### Text cleanup for SpeechT5 tokenization + +Start by cleaning up the text data. You'll need the tokenizer part of the processor to process the text: + +```py +>>> tokenizer = processor.tokenizer +``` + +The dataset examples contain `raw_text` and `normalized_text` features. When deciding which feature to use as the text input, +consider that the SpeechT5 tokenizer doesn't have any tokens for numbers. In `normalized_text` the numbers are written +out as text. Thus, it is a better fit, and we recommend using `normalized_text` as input text. + +Because SpeechT5 was trained on the English language, it may not recognize certain characters in the Dutch dataset. If +left as is, these characters will be converted to `` tokens. However, in Dutch, certain characters like `à` are +used to stress syllables. In order to preserve the meaning of the text, we can replace this character with a regular `a`. + +To identify unsupported tokens, extract all unique characters in the dataset using the `SpeechT5Tokenizer` which +works with characters as tokens. To do this, write the `extract_all_chars` mapping function that concatenates +the transcriptions from all examples into one string and converts it to a set of characters. +Make sure to set `batched=True` and `batch_size=-1` in `dataset.map()` so that all transcriptions are available at once for +the mapping function. + +```py +>>> def extract_all_chars(batch): +... all_text = " ".join(batch["normalized_text"]) +... vocab = list(set(all_text)) +... return {"vocab": [vocab], "all_text": [all_text]} + + +>>> vocabs = dataset.map( +... extract_all_chars, +... batched=True, +... batch_size=-1, +... keep_in_memory=True, +... remove_columns=dataset.column_names, +... ) + +>>> dataset_vocab = set(vocabs["vocab"][0]) +>>> tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()} +``` + +Now you have two sets of characters: one with the vocabulary from the dataset and one with the vocabulary from the tokenizer. +To identify any unsupported characters in the dataset, you can take the difference between these two sets. The resulting +set will contain the characters that are in the dataset but not in the tokenizer. + +```py +>>> dataset_vocab - tokenizer_vocab +{' ', 'à', 'ç', 'è', 'ë', 'í', 'ï', 'ö', 'ü'} +``` + +To handle the unsupported characters identified in the previous step, define a function that maps these characters to +valid tokens. Note that spaces are already replaced by `▁` in the tokenizer and don't need to be handled separately. + +```py +>>> replacements = [ +... ("à", "a"), +... ("ç", "c"), +... ("è", "e"), +... ("ë", "e"), +... ("í", "i"), +... ("ï", "i"), +... ("ö", "o"), +... ("ü", "u"), +... ] + + +>>> def cleanup_text(inputs): +... for src, dst in replacements: +... inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst) +... return inputs + + +>>> dataset = dataset.map(cleanup_text) +``` + +Now that you have dealt with special characters in the text, it's time to shift focus to the audio data. + +### Speakers + +The VoxPopuli dataset includes speech from multiple speakers, but how many speakers are represented in the dataset? To +determine this, we can count the number of unique speakers and the number of examples each speaker contributes to the dataset. +With a total of 20,968 examples in the dataset, this information will give us a better understanding of the distribution of +speakers and examples in the data. + +```py +>>> from collections import defaultdict + +>>> speaker_counts = defaultdict(int) + +>>> for speaker_id in dataset["speaker_id"]: +... speaker_counts[speaker_id] += 1 +``` + +By plotting a histogram you can get a sense of how much data there is for each speaker. + +```py +>>> import matplotlib.pyplot as plt + +>>> plt.figure() +>>> plt.hist(speaker_counts.values(), bins=20) +>>> plt.ylabel("Speakers") +>>> plt.xlabel("Examples") +>>> plt.show() +``` + +
+ Speakers histogram +
+ +The histogram reveals that approximately one-third of the speakers in the dataset have fewer than 100 examples, while +around ten speakers have more than 500 examples. To improve training efficiency and balance the dataset, we can limit +the data to speakers with between 100 and 400 examples. + +```py +>>> def select_speaker(speaker_id): +... return 100 <= speaker_counts[speaker_id] <= 400 + + +>>> dataset = dataset.filter(select_speaker, input_columns=["speaker_id"]) +``` + +Let's check how many speakers remain: + +```py +>>> len(set(dataset["speaker_id"])) +42 +``` + +Let's see how many examples are left: + +```py +>>> len(dataset) +9973 +``` + +You are left with just under 10,000 examples from approximately 40 unique speakers, which should be sufficient. + +Note that some speakers with few examples may actually have more audio available if the examples are long. However, +determining the total amount of audio for each speaker requires scanning through the entire dataset, which is a +time-consuming process that involves loading and decoding each audio file. As such, we have chosen to skip this step here. + +### Speaker embeddings + +To enable the TTS model to differentiate between multiple speakers, you'll need to create a speaker embedding for each example. +The speaker embedding is an additional input into the model that captures a particular speaker's voice characteristics. +To generate these speaker embeddings, use the pre-trained [spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb) +model from SpeechBrain. + +Create a function `create_speaker_embedding()` that takes an input audio waveform and outputs a 512-element vector +containing the corresponding speaker embedding. + +```py +>>> import os +>>> import torch +>>> from speechbrain.pretrained import EncoderClassifier + +>>> spk_model_name = "speechbrain/spkrec-xvect-voxceleb" + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> speaker_model = EncoderClassifier.from_hparams( +... source=spk_model_name, +... run_opts={"device": device}, +... savedir=os.path.join("/tmp", spk_model_name), +... ) + + +>>> def create_speaker_embedding(waveform): +... with torch.no_grad(): +... speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform)) +... speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) +... speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy() +... return speaker_embeddings +``` + +It's important to note that the `speechbrain/spkrec-xvect-voxceleb` model was trained on English speech from the VoxCeleb +dataset, whereas the training examples in this guide are in Dutch. While we believe that this model will still generate +reasonable speaker embeddings for our Dutch dataset, this assumption may not hold true in all cases. + +For optimal results, we recommend training an X-vector model on the target speech first. This will ensure that the model +is better able to capture the unique voice characteristics present in the Dutch language. + +### Processing the dataset + +Finally, let's process the data into the format the model expects. Create a `prepare_dataset` function that takes in a +single example and uses the `SpeechT5Processor` object to tokenize the input text and load the target audio into a log-mel spectrogram. +It should also add the speaker embeddings as an additional input. + +```py +>>> def prepare_dataset(example): +... audio = example["audio"] + +... example = processor( +... text=example["normalized_text"], +... audio_target=audio["array"], +... sampling_rate=audio["sampling_rate"], +... return_attention_mask=False, +... ) + +... # strip off the batch dimension +... example["labels"] = example["labels"][0] + +... # use SpeechBrain to obtain x-vector +... example["speaker_embeddings"] = create_speaker_embedding(audio["array"]) + +... return example +``` + +Verify the processing is correct by looking at a single example: + +```py +>>> processed_example = prepare_dataset(dataset[0]) +>>> list(processed_example.keys()) +['input_ids', 'labels', 'stop_labels', 'speaker_embeddings'] +``` + +Speaker embeddings should be a 512-element vector: + +```py +>>> processed_example["speaker_embeddings"].shape +(512,) +``` + +The labels should be a log-mel spectrogram with 80 mel bins. + +```py +>>> import matplotlib.pyplot as plt + +>>> plt.figure() +>>> plt.imshow(processed_example["labels"].T) +>>> plt.show() +``` + +
+ Log-mel spectrogram with 80 mel bins +
+ +Side note: If you find this spectrogram confusing, it may be due to your familiarity with the convention of placing low frequencies +at the bottom and high frequencies at the top of a plot. However, when plotting spectrograms as an image using the matplotlib library, +the y-axis is flipped and the spectrograms appear upside down. + +Now apply the processing function to the entire dataset. This will take between 5 and 10 minutes. + +```py +>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names) +``` + +You'll see a warning saying that some examples in the dataset are longer than the maximum input length the model can handle (600 tokens). +Remove those examples from the dataset. Here we go even further and to allow for larger batch sizes we remove anything over 200 tokens. + +```py +>>> def is_not_too_long(input_ids): +... input_length = len(input_ids) +... return input_length < 200 + + +>>> dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"]) +>>> len(dataset) +8259 +``` + +Next, create a basic train/test split: + +```py +>>> dataset = dataset.train_test_split(test_size=0.1) +``` + +### Data collator + +In order to combine multiple examples into a batch, you need to define a custom data collator. This collator will pad shorter sequences with padding +tokens, ensuring that all examples have the same length. For the spectrogram labels, the padded portions are replaced with the special value `-100`. This special value +instructs the model to ignore that part of the spectrogram when calculating the spectrogram loss. + +```py +>>> from dataclasses import dataclass +>>> from typing import Any, Dict, List, Union + + +>>> @dataclass +... class TTSDataCollatorWithPadding: +... processor: Any + +... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: +... input_ids = [{"input_ids": feature["input_ids"]} for feature in features] +... label_features = [{"input_values": feature["labels"]} for feature in features] +... speaker_features = [feature["speaker_embeddings"] for feature in features] + +... # collate the inputs and targets into a batch +... batch = processor.pad(input_ids=input_ids, labels=label_features, return_tensors="pt") + +... # replace padding with -100 to ignore loss correctly +... batch["labels"] = batch["labels"].masked_fill(batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100) + +... # not used during fine-tuning +... del batch["decoder_attention_mask"] + +... # round down target lengths to multiple of reduction factor +... if model.config.reduction_factor > 1: +... target_lengths = torch.tensor([len(feature["input_values"]) for feature in label_features]) +... target_lengths = target_lengths.new( +... [length - length % model.config.reduction_factor for length in target_lengths] +... ) +... max_length = max(target_lengths) +... batch["labels"] = batch["labels"][:, :max_length] + +... # also add in the speaker embeddings +... batch["speaker_embeddings"] = torch.tensor(speaker_features) + +... return batch +``` + +In SpeechT5, the input to the decoder part of the model is reduced by a factor 2. In other words, it throws away every +other timestep from the target sequence. The decoder then predicts a sequence that is twice as long. Since the original +target sequence length may be odd, the data collator makes sure to round the maximum length of the batch down to be a +multiple of 2. + +```py +>>> data_collator = TTSDataCollatorWithPadding(processor=processor) +``` + +## Train the model + +Load the pre-trained model from the same checkpoint as you used for loading the processor: + +```py +>>> from transformers import SpeechT5ForTextToSpeech + +>>> model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint) +``` + +The `use_cache=True` option is incompatible with gradient checkpointing. Disable it for training. + +```py +>>> model.config.use_cache = False +``` + +Define the training arguments. Here we are not computing any evaluation metrics during the training process. Instead, we'll +only look at the loss: + +```python +>>> from transformers import Seq2SeqTrainingArguments + +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="speecht5_finetuned_voxpopuli_nl", # change to a repo name of your choice +... per_device_train_batch_size=4, +... gradient_accumulation_steps=8, +... learning_rate=1e-5, +... warmup_steps=500, +... max_steps=4000, +... gradient_checkpointing=True, +... fp16=True, +... evaluation_strategy="steps", +... per_device_eval_batch_size=2, +... save_steps=1000, +... eval_steps=1000, +... logging_steps=25, +... report_to=["tensorboard"], +... load_best_model_at_end=True, +... greater_is_better=False, +... label_names=["labels"], +... push_to_hub=True, +... ) +``` + +Instantiate the `Trainer` object and pass the model, dataset, and data collator to it. + +```py +>>> from transformers import Seq2SeqTrainer + +>>> trainer = Seq2SeqTrainer( +... args=training_args, +... model=model, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... data_collator=data_collator, +... tokenizer=processor, +... ) +``` + +And with that, you're ready to start training! Training will take several hours. Depending on your GPU, +it is possible that you will encounter a CUDA "out-of-memory" error when you start training. In this case, you can reduce +the `per_device_train_batch_size` incrementally by factors of 2 and increase `gradient_accumulation_steps` by 2x to compensate. + +```py +>>> trainer.train() +``` + +To be able to use your checkpoint with a pipeline, make sure to save the processor with the checkpoint: + +```py +>>> processor.save_pretrained("YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl") +``` + +Push the final model to the 🤗 Hub: + +```py +>>> trainer.push_to_hub() +``` + +## Inference + +### Inference with a pipeline + +Great, now that you've fine-tuned a model, you can use it for inference! +First, let's see how you can use it with a corresponding pipeline. Let's create a `"text-to-speech"` pipeline with your +checkpoint: + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-speech", model="YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl") +``` + +Pick a piece of text in Dutch you'd like narrated, e.g.: + +```py +>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!" +``` + +To use SpeechT5 with the pipeline, you'll need a speaker embedding. Let's get it from an example in the test dataset: + +```py +>>> example = dataset["test"][304] +>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) +``` + +Now you can pass the text and speaker embeddings to the pipeline, and it will take care of the rest: + +```py +>>> forward_params = {"speaker_embeddings": speaker_embeddings} +>>> output = pipe(text, forward_params=forward_params) +>>> output +{'audio': array([-6.82714235e-05, -4.26525949e-04, 1.06134125e-04, ..., + -1.22392643e-03, -7.76011671e-04, 3.29112721e-04], dtype=float32), + 'sampling_rate': 16000} +``` + +You can then listen to the result: + +```py +>>> from IPython.display import Audio +>>> Audio(output['audio'], rate=output['sampling_rate']) +``` + +### Run inference manually + +You can achieve the same inference results without using the pipeline, however, more steps will be required. + +Load the model from the 🤗 Hub: + +```py +>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl") +``` + +Pick an example from the test dataset obtain a speaker embedding. + +```py +>>> example = dataset["test"][304] +>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) +``` + +Define the input text and tokenize it. + +```py +>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!" +>>> inputs = processor(text=text, return_tensors="pt") +``` + +Create a spectrogram with your model: + +```py +>>> spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings) +``` + +Visualize the spectrogram, if you'd like to: + +```py +>>> plt.figure() +>>> plt.imshow(spectrogram.T) +>>> plt.show() +``` + +
+ Generated log-mel spectrogram +
+ +Finally, use the vocoder to turn the spectrogram into sound. + +```py +>>> with torch.no_grad(): +... speech = vocoder(spectrogram) + +>>> from IPython.display import Audio + +>>> Audio(speech.numpy(), rate=16000) +``` + +In our experience, obtaining satisfactory results from this model can be challenging. The quality of the speaker +embeddings appears to be a significant factor. Since SpeechT5 was pre-trained with English x-vectors, it performs best +when using English speaker embeddings. If the synthesized speech sounds poor, try using a different speaker embedding. + +Increasing the training duration is also likely to enhance the quality of the results. Even so, the speech clearly is Dutch instead of English, and it does +capture the voice characteristics of the speaker (compare to the original audio in the example). +Another thing to experiment with is the model's configuration. For example, try using `config.reduction_factor = 1` to +see if this improves the results. + +Finally, it is essential to consider ethical considerations. Although TTS technology has numerous useful applications, it +may also be used for malicious purposes, such as impersonating someone's voice without their knowledge or consent. Please +use TTS judiciously and responsibly. diff --git a/docs/source/en/tasks/token_classification.mdx b/docs/source/en/tasks/token_classification.md similarity index 84% rename from docs/source/en/tasks/token_classification.mdx rename to docs/source/en/tasks/token_classification.md index 10ac695a870bd7..791737b677c871 100644 --- a/docs/source/en/tasks/token_classification.mdx +++ b/docs/source/en/tasks/token_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Token classification @@ -16,11 +20,11 @@ specific language governing permissions and limitations under the License. -Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. +Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. This guide will show you how to: -1. Finetune [DistilBERT](https://huggingface.co/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities. +1. Finetune [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on the [WNUT 17](https://huggingface.co/datasets/wnut_17) dataset to detect new entities. 2. Use your finetuned model for inference. @@ -28,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit -[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [BROS](../model_doc/bros), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [Phi](../model_doc/phi), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) @@ -106,7 +110,7 @@ The next step is to load a DistilBERT tokenizer to preprocess the `tokens` field ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` As you saw in the example `tokens` field above, it looks like the input has already been tokenized. But the input actually hasn't been tokenized yet and you'll need to set `is_split_into_words=True` to tokenize the words into subwords. For example: @@ -121,8 +125,8 @@ As you saw in the example `tokens` field above, it looks like the input has alre However, this adds some special tokens `[CLS]` and `[SEP]` and the subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may now be split into two subwords. You'll need to realign the tokens and labels by: -1. Mapping all tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.Encoding.word_ids) method. -2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so they're ignored by the PyTorch loss function. +1. Mapping all tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) method. +2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]` so they're ignored by the PyTorch loss function (see [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)). 3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word. Here is how you can create a function to realign the tokens and labels, and truncate sequences to be no longer than DistilBERT's maximum input length: @@ -156,7 +160,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ >>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) ``` -Now create a batch of examples using [`DataCollatorWithPadding`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +Now create a batch of examples using [`DataCollatorWithPadding`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. @@ -268,7 +272,7 @@ You're ready to start training your model now! Load DistilBERT with [`AutoModelF >>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer >>> model = AutoModelForTokenClassification.from_pretrained( -... "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id ... ) ``` @@ -339,7 +343,7 @@ Then you can load DistilBERT with [`TFAutoModelForTokenClassification`] along wi >>> from transformers import TFAutoModelForTokenClassification >>> model = TFAutoModelForTokenClassification.from_pretrained( -... "distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id ... ) ``` @@ -361,15 +365,15 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` -The last two things to setup before you start training is to compute the seqeval scores from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](./main_classes/keras_callbacks). +The last two things to setup before you start training is to compute the seqeval scores from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`~transformers.KerasMetricCallback`]: diff --git a/docs/source/en/tasks/translation.mdx b/docs/source/en/tasks/translation.md similarity index 89% rename from docs/source/en/tasks/translation.mdx rename to docs/source/en/tasks/translation.md index f1a03b808774f6..f0433a0dad797d 100644 --- a/docs/source/en/tasks/translation.mdx +++ b/docs/source/en/tasks/translation.md @@ -8,17 +8,23 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Translation +[[open-in-colab]] + Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem, a powerful framework for returning some output from an input, like translation or summarization. Translation systems are commonly used for translation between different language texts, but it can also be used for speech or some combination in between like text-to-speech or speech-to-text. This guide will show you how to: -1. Finetune [T5](https://huggingface.co/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French. +1. Finetune [T5](https://huggingface.co/google-t5/t5-small) on the English-French subset of the [OPUS Books](https://huggingface.co/datasets/opus_books) dataset to translate English text to French. 2. Use your finetuned model for inference. @@ -26,7 +32,7 @@ The task illustrated in this tutorial is supported by the following model archit -[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SeamlessM4T](../model_doc/seamless_m4t), [SeamlessM4Tv2](../model_doc/seamless_m4t_v2), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) @@ -82,7 +88,7 @@ The next step is to load a T5 tokenizer to process the English-French language p ```py >>> from transformers import AutoTokenizer ->>> checkpoint = "t5-small" +>>> checkpoint = "google-t5/t5-small" >>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) ``` @@ -111,7 +117,7 @@ To apply the preprocessing function over the entire dataset, use 🤗 Datasets [ >>> tokenized_books = books.map(preprocess_function, batched=True) ``` -Now create a batch of examples using [`DataCollatorForSeq2Seq`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length. +Now create a batch of examples using [`DataCollatorForSeq2Seq`]. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. @@ -226,7 +232,7 @@ At this point, only three steps remain: ... ) >>> trainer.train() -```` +``` Once training is completed, share your model to the Hub with the [`~transformers.Trainer.push_to_hub`] method so everyone can use your model: @@ -274,15 +280,15 @@ Convert your datasets to the `tf.data.Dataset` format with [`~transformers.TFPre ... ) ``` -Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): +Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py >>> import tensorflow as tf ->>> model.compile(optimizer=optimizer) +>>> model.compile(optimizer=optimizer) # No loss argument! ``` -The last two things to setup before you start training is to compute the SacreBLEU metric from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](./main_classes/keras_callbacks). +The last two things to setup before you start training is to compute the SacreBLEU metric from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](../main_classes/keras_callbacks). Pass your `compute_metrics` function to [`~transformers.KerasMetricCallback`]: @@ -360,7 +366,7 @@ Tokenize the text and return the `input_ids` as PyTorch tensors: >>> inputs = tokenizer(text, return_tensors="pt").input_ids ``` -Use the [`~transformers.generation_utils.GenerationMixin.generate`] method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](./main_classes/text_generation) API. +Use the [`~transformers.generation_utils.GenerationMixin.generate`] method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](../main_classes/text_generation) API. ```py >>> from transformers import AutoModelForSeq2SeqLM @@ -386,7 +392,7 @@ Tokenize the text and return the `input_ids` as TensorFlow tensors: >>> inputs = tokenizer(text, return_tensors="tf").input_ids ``` -Use the [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](./main_classes/text_generation) API. +Use the [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] method to create the translation. For more details about the different text generation strategies and parameters for controlling generation, check out the [Text Generation](../main_classes/text_generation) API. ```py >>> from transformers import TFAutoModelForSeq2SeqLM @@ -402,4 +408,4 @@ Decode the generated token ids back into text: 'Les lugumes partagent les ressources avec des bactéries fixatrices d'azote.' ```
-
\ No newline at end of file +
diff --git a/docs/source/en/tasks/video_classification.mdx b/docs/source/en/tasks/video_classification.md similarity index 98% rename from docs/source/en/tasks/video_classification.mdx rename to docs/source/en/tasks/video_classification.md index 57dc00c1bf4923..38bdceba41b7b4 100644 --- a/docs/source/en/tasks/video_classification.mdx +++ b/docs/source/en/tasks/video_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Video classification @@ -26,7 +30,7 @@ The task illustrated in this tutorial is supported by the following model archit -[TimeSformer](../model_doc/timesformer), [VideoMAE](../model_doc/videomae) +[TimeSformer](../model_doc/timesformer), [VideoMAE](../model_doc/videomae), [ViViT](../model_doc/vivit) @@ -479,7 +483,7 @@ You can also manually replicate the results of the `pipeline` if you'd like. Now, pass your input to the model and return the `logits`: -``` +```py >>> logits = run_inference(trained_model, sample_test_video["video"]) ``` diff --git a/docs/source/en/tasks/visual_question_answering.md b/docs/source/en/tasks/visual_question_answering.md new file mode 100644 index 00000000000000..c45f12dbc1e7a8 --- /dev/null +++ b/docs/source/en/tasks/visual_question_answering.md @@ -0,0 +1,401 @@ + + +# Visual Question Answering + +[[open-in-colab]] + +Visual Question Answering (VQA) is the task of answering open-ended questions based on an image. +The input to models supporting this task is typically a combination of an image and a question, and the output is an +answer expressed in natural language. + +Some noteworthy use case examples for VQA include: +* Accessibility applications for visually impaired individuals. +* Education: posing questions about visual materials presented in lectures or textbooks. VQA can also be utilized in interactive museum exhibits or historical sites. +* Customer service and e-commerce: VQA can enhance user experience by letting users ask questions about products. +* Image retrieval: VQA models can be used to retrieve images with specific characteristics. For example, the user can ask "Is there a dog?" to find all images with dogs from a set of images. + +In this guide you'll learn how to: + +- Fine-tune a classification VQA model, specifically [ViLT](../model_doc/vilt), on the [`Graphcore/vqa` dataset](https://huggingface.co/datasets/Graphcore/vqa). +- Use your fine-tuned ViLT for inference. +- Run zero-shot VQA inference with a generative model, like BLIP-2. + +## Fine-tuning ViLT + +ViLT model incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for +Vision-and-Language Pre-training (VLP). This model can be used for several downstream tasks. For the VQA task, a classifier +head is placed on top (a linear layer on top of the final hidden state of the `[CLS]` token) and randomly initialized. +Visual Question Answering is thus treated as a **classification problem**. + +More recent models, such as BLIP, BLIP-2, and InstructBLIP, treat VQA as a generative task. Later in this guide we +illustrate how to use them for zero-shot VQA inference. + +Before you begin, make sure you have all the necessary libraries installed. + +```bash +pip install -q transformers datasets +``` + +We encourage you to share your model with the community. Log in to your Hugging Face account to upload it to the 🤗 Hub. +When prompted, enter your token to log in: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +Let's define the model checkpoint as a global variable. + +```py +>>> model_checkpoint = "dandelin/vilt-b32-mlm" +``` + +## Load the data + +For illustration purposes, in this guide we use a very small sample of the annotated visual question answering `Graphcore/vqa` dataset. +You can find the full dataset on [🤗 Hub](https://huggingface.co/datasets/Graphcore/vqa). + +As an alternative to the [`Graphcore/vqa` dataset](https://huggingface.co/datasets/Graphcore/vqa), you can download the +same data manually from the official [VQA dataset page](https://visualqa.org/download.html). If you prefer to follow the +tutorial with your custom data, check out how to [Create an image dataset](https://huggingface.co/docs/datasets/image_dataset#loading-script) +guide in the 🤗 Datasets documentation. + +Let's load the first 200 examples from the validation split and explore the dataset's features: + +```python +>>> from datasets import load_dataset + +>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]") +>>> dataset +Dataset({ + features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'], + num_rows: 200 +}) +``` + +Let's take a look at an example to understand the dataset's features: + +```py +>>> dataset[0] +{'question': 'Where is he looking?', + 'question_type': 'none of the above', + 'question_id': 262148000, + 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg', + 'answer_type': 'other', + 'label': {'ids': ['at table', 'down', 'skateboard', 'table'], + 'weights': [0.30000001192092896, + 1.0, + 0.30000001192092896, + 0.30000001192092896]}} +``` + +The features relevant to the task include: +* `question`: the question to be answered from the image +* `image_id`: the path to the image the question refers to +* `label`: the annotations + +We can remove the rest of the features as they won't be necessary: + +```py +>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type']) +``` + +As you can see, the `label` feature contains several answers to the same question (called `ids` here) collected by different human annotators. +This is because the answer to a question can be subjective. In this case, the question is "where is he looking?". Some people +annotated this with "down", others with "at table", another one with "skateboard", etc. + +Take a look at the image and consider which answer would you give: + +```python +>>> from PIL import Image + +>>> image = Image.open(dataset[0]['image_id']) +>>> image +``` + +
+ VQA Image Example +
+ +Due to the questions' and answers' ambiguity, datasets like this are treated as a multi-label classification problem (as +multiple answers are possibly valid). Moreover, rather than just creating a one-hot encoded vector, one creates a +soft encoding, based on the number of times a certain answer appeared in the annotations. + +For instance, in the example above, because the answer "down" is selected way more often than other answers, it has a +score (called `weight` in the dataset) of 1.0, and the rest of the answers have scores < 1.0. + +To later instantiate the model with an appropriate classification head, let's create two dictionaries: one that maps +the label name to an integer and vice versa: + +```py +>>> import itertools + +>>> labels = [item['ids'] for item in dataset['label']] +>>> flattened_labels = list(itertools.chain(*labels)) +>>> unique_labels = list(set(flattened_labels)) + +>>> label2id = {label: idx for idx, label in enumerate(unique_labels)} +>>> id2label = {idx: label for label, idx in label2id.items()} +``` + +Now that we have the mappings, we can replace the string answers with their ids, and flatten the dataset for a more convenient further preprocessing. + +```python +>>> def replace_ids(inputs): +... inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]] +... return inputs + + +>>> dataset = dataset.map(replace_ids) +>>> flat_dataset = dataset.flatten() +>>> flat_dataset.features +{'question': Value(dtype='string', id=None), + 'image_id': Value(dtype='string', id=None), + 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), + 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)} +``` + +## Preprocessing data + +The next step is to load a ViLT processor to prepare the image and text data for the model. +[`ViltProcessor`] wraps a BERT tokenizer and ViLT image processor into a convenient single processor: + +```py +>>> from transformers import ViltProcessor + +>>> processor = ViltProcessor.from_pretrained(model_checkpoint) +``` + +To preprocess the data we need to encode the images and questions using the [`ViltProcessor`]. The processor will use +the [`BertTokenizerFast`] to tokenize the text and create `input_ids`, `attention_mask` and `token_type_ids` for the text data. +As for images, the processor will leverage [`ViltImageProcessor`] to resize and normalize the image, and create `pixel_values` and `pixel_mask`. + +All these preprocessing steps are done under the hood, we only need to call the `processor`. However, we still need to +prepare the target labels. In this representation, each element corresponds to a possible answer (label). For correct answers, the element holds +their respective score (weight), while the remaining elements are set to zero. + +The following function applies the `processor` to the images and questions and formats the labels as described above: + +```py +>>> import torch + +>>> def preprocess_data(examples): +... image_paths = examples['image_id'] +... images = [Image.open(image_path) for image_path in image_paths] +... texts = examples['question'] + +... encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt") + +... for k, v in encoding.items(): +... encoding[k] = v.squeeze() + +... targets = [] + +... for labels, scores in zip(examples['label.ids'], examples['label.weights']): +... target = torch.zeros(len(id2label)) + +... for label, score in zip(labels, scores): +... target[label] = score + +... targets.append(target) + +... encoding["labels"] = targets + +... return encoding +``` + +To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.map`] function. You can speed up `map` by +setting `batched=True` to process multiple elements of the dataset at once. At this point, feel free to remove the columns you don't need. + +```py +>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type', 'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights']) +>>> processed_dataset +Dataset({ + features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'], + num_rows: 200 +}) +``` + +As a final step, create a batch of examples using [`DefaultDataCollator`]: + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + +## Train the model + +You’re ready to start training your model now! Load ViLT with [`ViltForQuestionAnswering`]. Specify the number of labels +along with the label mappings: + +```py +>>> from transformers import ViltForQuestionAnswering + +>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id) +``` + +At this point, only three steps remain: + +1. Define your training hyperparameters in [`TrainingArguments`]: + +```py +>>> from transformers import TrainingArguments + +>>> repo_id = "MariaK/vilt_finetuned_200" + +>>> training_args = TrainingArguments( +... output_dir=repo_id, +... per_device_train_batch_size=4, +... num_train_epochs=20, +... save_steps=200, +... logging_steps=50, +... learning_rate=5e-5, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +2. Pass the training arguments to [`Trainer`] along with the model, dataset, processor, and data collator. + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=processed_dataset, +... tokenizer=processor, +... ) +``` + +3. Call [`~Trainer.train`] to finetune your model. + +```py +>>> trainer.train() +``` + +Once training is completed, share your model to the Hub with the [`~Trainer.push_to_hub`] method to share your final model on the 🤗 Hub: + +```py +>>> trainer.push_to_hub() +``` + +## Inference + +Now that you have fine-tuned a ViLT model, and uploaded it to the 🤗 Hub, you can use it for inference. The simplest +way to try out your fine-tuned model for inference is to use it in a [`Pipeline`]. + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200") +``` + +The model in this guide has only been trained on 200 examples, so don't expect a lot from it. Let's see if it at least +learned something from the data and take the first example from the dataset to illustrate inference: + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +>>> print(question) +>>> pipe(image, question, top_k=1) +"Where is he looking?" +[{'score': 0.5498199462890625, 'answer': 'down'}] +``` + +Even though not very confident, the model indeed has learned something. With more examples and longer training, you'll get far better results! + +You can also manually replicate the results of the pipeline if you'd like: +1. Take an image and a question, prepare them for the model using the processor from your model. +2. Forward the result or preprocessing through the model. +3. From the logits, get the most likely answer's id, and find the actual answer in the `id2label`. + +```py +>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200") + +>>> image = Image.open(example['image_id']) +>>> question = example['question'] + +>>> # prepare inputs +>>> inputs = processor(image, question, return_tensors="pt") + +>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200") + +>>> # forward pass +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits +>>> idx = logits.argmax(-1).item() +>>> print("Predicted answer:", model.config.id2label[idx]) +Predicted answer: down +``` + +## Zero-shot VQA + +The previous model treated VQA as a classification task. Some recent models, such as BLIP, BLIP-2, and InstructBLIP approach +VQA as a generative task. Let's take [BLIP-2](../model_doc/blip-2) as an example. It introduced a new visual-language pre-training +paradigm in which any combination of pre-trained vision encoder and LLM can be used (learn more in the [BLIP-2 blog post](https://huggingface.co/blog/blip-2)). +This enables achieving state-of-the-art results on multiple visual-language tasks including visual question answering. + +Let's illustrate how you can use this model for VQA. First, let's load the model. Here we'll explicitly send the model to a +GPU, if available, which we didn't need to do earlier when training, as [`Trainer`] handles this automatically: + +```py +>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration +>>> import torch + +>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") +>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16) +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> model.to(device) +``` + +The model takes image and text as input, so let's use the exact same image/question pair from the first example in the VQA dataset: + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +``` + +To use BLIP-2 for visual question answering task, the textual prompt has to follow a specific format: `Question: {} Answer:`. + +```py +>>> prompt = f"Question: {question} Answer:" +``` + +Now we need to preprocess the image/prompt with the model's processor, pass the processed input through the model, and decode the output: + +```py +>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16) + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() +>>> print(generated_text) +"He is looking at the crowd" +``` + +As you can see, the model recognized the crowd, and the direction of the face (looking down), however, it seems to miss +the fact the crowd is behind the skater. Still, in cases where acquiring human-annotated datasets is not feasible, this +approach can quickly produce useful results. + diff --git a/docs/source/en/tasks/zero_shot_image_classification.md b/docs/source/en/tasks/zero_shot_image_classification.md new file mode 100644 index 00000000000000..45775d40cad32c --- /dev/null +++ b/docs/source/en/tasks/zero_shot_image_classification.md @@ -0,0 +1,147 @@ + + +# Zero-shot image classification + +[[open-in-colab]] + +Zero-shot image classification is a task that involves classifying images into different categories using a model that was +not explicitly trained on data containing labeled examples from those specific categories. + +Traditionally, image classification requires training a model on a specific set of labeled images, and this model learns to +"map" certain image features to labels. When there's a need to use such model for a classification task that introduces a +new set of labels, fine-tuning is required to "recalibrate" the model. + +In contrast, zero-shot or open vocabulary image classification models are typically multi-modal models that have been trained on a large +dataset of images and associated descriptions. These models learn aligned vision-language representations that can be used for many downstream tasks including zero-shot image classification. + +This is a more flexible approach to image classification that allows models to generalize to new and unseen categories +without the need for additional training data and enables users to query images with free-form text descriptions of their target objects . + +In this guide you'll learn how to: + +* create a zero-shot image classification pipeline +* run zero-shot image classification inference by hand + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install -q transformers +``` + +## Zero-shot image classification pipeline + +The simplest way to try out inference with a model supporting zero-shot image classification is to use the corresponding [`pipeline`]. +Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads): + +```python +>>> from transformers import pipeline + +>>> checkpoint = "openai/clip-vit-large-patch14" +>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification") +``` + +Next, choose an image you'd like to classify. + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of an owl +
+ +Pass the image and the candidate object labels to the pipeline. Here we pass the image directly; other suitable options +include a local path to an image or an image url. +The candidate labels can be simple words like in this example, or more descriptive. + +```py +>>> predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"]) +>>> predictions +[{'score': 0.9996670484542847, 'label': 'owl'}, + {'score': 0.000199399160919711, 'label': 'seagull'}, + {'score': 7.392891711788252e-05, 'label': 'fox'}, + {'score': 5.96074532950297e-05, 'label': 'bear'}] +``` + +## Zero-shot image classification by hand + +Now that you've seen how to use the zero-shot image classification pipeline, let's take a look how you can run zero-shot +image classification manually. + +Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads). +Here we'll use the same checkpoint as before: + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification + +>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +Let's take a different image to switch things up. + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of a car +
+ +Use the processor to prepare the inputs for the model. The processor combines an image processor that prepares the +image for the model by resizing and normalizing it, and a tokenizer that takes care of the text inputs. + +```py +>>> candidate_labels = ["tree", "car", "bike", "cat"] +>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True) +``` + +Pass the inputs through the model, and post-process the results: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits_per_image[0] +>>> probs = logits.softmax(dim=-1).numpy() +>>> scores = probs.tolist() + +>>> result = [ +... {"score": score, "label": candidate_label} +... for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0]) +... ] + +>>> result +[{'score': 0.998572, 'label': 'car'}, + {'score': 0.0010570387, 'label': 'bike'}, + {'score': 0.0003393686, 'label': 'tree'}, + {'score': 3.1572064e-05, 'label': 'cat'}] +``` \ No newline at end of file diff --git a/docs/source/en/tasks/zero_shot_object_detection.md b/docs/source/en/tasks/zero_shot_object_detection.md new file mode 100644 index 00000000000000..03e849a6c79d6f --- /dev/null +++ b/docs/source/en/tasks/zero_shot_object_detection.md @@ -0,0 +1,301 @@ + + +# Zero-shot object detection + +[[open-in-colab]] + +Traditionally, models used for [object detection](object_detection) require labeled image datasets for training, +and are limited to detecting the set of classes from the training data. + +Zero-shot object detection is supported by the [OWL-ViT](../model_doc/owlvit) model which uses a different approach. OWL-ViT +is an open-vocabulary object detector. It means that it can detect objects in images based on free-text queries without +the need to fine-tune the model on labeled datasets. + +OWL-ViT leverages multi-modal representations to perform open-vocabulary detection. It combines [CLIP](../model_doc/clip) with +lightweight object classification and localization heads. Open-vocabulary detection is achieved by embedding free-text queries with the text encoder of CLIP and using them as input to the object classification and localization heads. +associate images and their corresponding textual descriptions, and ViT processes image patches as inputs. The authors +of OWL-ViT first trained CLIP from scratch and then fine-tuned OWL-ViT end to end on standard object detection datasets using +a bipartite matching loss. + +With this approach, the model can detect objects based on textual descriptions without prior training on labeled datasets. + +In this guide, you will learn how to use OWL-ViT: +- to detect objects based on text prompts +- for batch object detection +- for image-guided object detection + +Before you begin, make sure you have all the necessary libraries installed: + +```bash +pip install -q transformers +``` + +## Zero-shot object detection pipeline + +The simplest way to try out inference with OWL-ViT is to use it in a [`pipeline`]. Instantiate a pipeline +for zero-shot object detection from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?other=owlvit): + +```python +>>> from transformers import pipeline + +>>> checkpoint = "google/owlv2-base-patch16-ensemble" +>>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection") +``` + +Next, choose an image you'd like to detect objects in. Here we'll use the image of astronaut Eileen Collins that is +a part of the [NASA](https://www.nasa.gov/multimedia/imagegallery/index.html) Great Images dataset. + +```py +>>> import skimage +>>> import numpy as np +>>> from PIL import Image + +>>> image = skimage.data.astronaut() +>>> image = Image.fromarray(np.uint8(image)).convert("RGB") + +>>> image +``` + +
+ Astronaut Eileen Collins +
+ +Pass the image and the candidate object labels to look for to the pipeline. +Here we pass the image directly; other suitable options include a local path to an image or an image url. We also pass text descriptions for all items we want to query the image for. + +```py +>>> predictions = detector( +... image, +... candidate_labels=["human face", "rocket", "nasa badge", "star-spangled banner"], +... ) +>>> predictions +[{'score': 0.3571370542049408, + 'label': 'human face', + 'box': {'xmin': 180, 'ymin': 71, 'xmax': 271, 'ymax': 178}}, + {'score': 0.28099656105041504, + 'label': 'nasa badge', + 'box': {'xmin': 129, 'ymin': 348, 'xmax': 206, 'ymax': 427}}, + {'score': 0.2110239565372467, + 'label': 'rocket', + 'box': {'xmin': 350, 'ymin': -1, 'xmax': 468, 'ymax': 288}}, + {'score': 0.13790413737297058, + 'label': 'star-spangled banner', + 'box': {'xmin': 1, 'ymin': 1, 'xmax': 105, 'ymax': 509}}, + {'score': 0.11950037628412247, + 'label': 'nasa badge', + 'box': {'xmin': 277, 'ymin': 338, 'xmax': 327, 'ymax': 380}}, + {'score': 0.10649408400058746, + 'label': 'rocket', + 'box': {'xmin': 358, 'ymin': 64, 'xmax': 424, 'ymax': 280}}] +``` + +Let's visualize the predictions: + +```py +>>> from PIL import ImageDraw + +>>> draw = ImageDraw.Draw(image) + +>>> for prediction in predictions: +... box = prediction["box"] +... label = prediction["label"] +... score = prediction["score"] + +... xmin, ymin, xmax, ymax = box.values() +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white") + +>>> image +``` + +
+ Visualized predictions on NASA image +
+ +## Text-prompted zero-shot object detection by hand + +Now that you've seen how to use the zero-shot object detection pipeline, let's replicate the same +result manually. + +Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?other=owlvit). +Here we'll use the same checkpoint as before: + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection + +>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +Let's take a different image to switch things up. + +```py +>>> import requests + +>>> url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640" +>>> im = Image.open(requests.get(url, stream=True).raw) +>>> im +``` + +
+ Beach photo +
+ +Use the processor to prepare the inputs for the model. The processor combines an image processor that prepares the +image for the model by resizing and normalizing it, and a [`CLIPTokenizer`] that takes care of the text inputs. + +```py +>>> text_queries = ["hat", "book", "sunglasses", "camera"] +>>> inputs = processor(text=text_queries, images=im, return_tensors="pt") +``` + +Pass the inputs through the model, post-process, and visualize the results. Since the image processor resized images before +feeding them to the model, you need to use the [`~OwlViTImageProcessor.post_process_object_detection`] method to make sure the predicted bounding +boxes have the correct coordinates relative to the original image: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = torch.tensor([im.size[::-1]]) +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(im) + +>>> scores = results["scores"].tolist() +>>> labels = results["labels"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white") + +>>> im +``` + +
+ Beach photo with detected objects +
+ +## Batch processing + +You can pass multiple sets of images and text queries to search for different (or same) objects in several images. +Let's use both an astronaut image and the beach image together. +For batch processing, you should pass text queries as a nested list to the processor and images as lists of PIL images, +PyTorch tensors, or NumPy arrays. + +```py +>>> images = [image, im] +>>> text_queries = [ +... ["human face", "rocket", "nasa badge", "star-spangled banner"], +... ["hat", "book", "sunglasses", "camera"], +... ] +>>> inputs = processor(text=text_queries, images=images, return_tensors="pt") +``` + +Previously for post-processing you passed the single image's size as a tensor, but you can also pass a tuple, or, in case +of several images, a list of tuples. Let's create predictions for the two examples, and visualize the second one (`image_idx = 1`). + +```py +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = [x.size[::-1] for x in images] +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes) + +>>> image_idx = 1 +>>> draw = ImageDraw.Draw(images[image_idx]) + +>>> scores = results[image_idx]["scores"].tolist() +>>> labels = results[image_idx]["labels"].tolist() +>>> boxes = results[image_idx]["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[image_idx][label]}: {round(score,2)}", fill="white") + +>>> images[image_idx] +``` + +
+ Beach photo with detected objects +
+ +## Image-guided object detection + +In addition to zero-shot object detection with text queries, OWL-ViT offers image-guided object detection. This means +you can use an image query to find similar objects in the target image. +Unlike text queries, only a single example image is allowed. + +Let's take an image with two cats on a couch as a target image, and an image of a single cat +as a query: + +```py +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image_target = Image.open(requests.get(url, stream=True).raw) + +>>> query_url = "http://images.cocodataset.org/val2017/000000524280.jpg" +>>> query_image = Image.open(requests.get(query_url, stream=True).raw) +``` + +Let's take a quick look at the images: + +```py +>>> import matplotlib.pyplot as plt + +>>> fig, ax = plt.subplots(1, 2) +>>> ax[0].imshow(image_target) +>>> ax[1].imshow(query_image) +``` + +
+ Cats +
+ +In the preprocessing step, instead of text queries, you now need to use `query_images`: + +```py +>>> inputs = processor(images=image_target, query_images=query_image, return_tensors="pt") +``` + +For predictions, instead of passing the inputs to the model, pass them to [`~OwlViTForObjectDetection.image_guided_detection`]. Draw the predictions +as before except now there are no labels. + +```py +>>> with torch.no_grad(): +... outputs = model.image_guided_detection(**inputs) +... target_sizes = torch.tensor([image_target.size[::-1]]) +... results = processor.post_process_image_guided_detection(outputs=outputs, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(image_target) + +>>> scores = results["scores"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="white", width=4) + +>>> image_target +``` + +
+ Cats with bounding boxes +
+ diff --git a/docs/source/en/tasks_explained.mdx b/docs/source/en/tasks_explained.md similarity index 99% rename from docs/source/en/tasks_explained.mdx rename to docs/source/en/tasks_explained.md index fba64f4a7a5cce..d453e38e86b9fa 100644 --- a/docs/source/en/tasks_explained.mdx +++ b/docs/source/en/tasks_explained.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # How 🤗 Transformers solve tasks diff --git a/docs/source/en/testing.mdx b/docs/source/en/testing.md similarity index 91% rename from docs/source/en/testing.mdx rename to docs/source/en/testing.md index cb03a57b041334..fda2fc0cb34352 100644 --- a/docs/source/en/testing.mdx +++ b/docs/source/en/testing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Testing @@ -53,8 +57,6 @@ RUN_SLOW=1 pytest examples/ - - ### Choosing which tests to run This document goes into many details of how tests can be run. If after reading everything, you need even more details @@ -108,7 +110,7 @@ pytest tests/test_optimization.py --collect-only -q To run an individual test module: ```bash -pytest tests/test_logging.py +pytest tests/utils/test_logging.py ``` ### Run specific tests @@ -176,6 +178,16 @@ If you want to include only tests that include both patterns, `and` is to be use ```bash pytest -k "test and ada" tests/test_optimization.py ``` + +### Run `accelerate` tests + +Sometimes you need to run `accelerate` tests on your models. For that you can just add `-m accelerate_tests` to your command, if let's say you want to run these tests on `OPT` run: + +```bash +RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py +``` + + ### Run documentation tests In order to test whether the documentation examples are correct, you should check that the `doctests` are passing. @@ -203,20 +215,12 @@ Example: ```""" ``` -3 steps are required to debug the docstring examples: -1. In order to properly run the test, **an extra line has to be added** at the end of the docstring. This can be automatically done on any file using: -```bash -python utils/prepare_for_doc_test.py -``` -2. Then, you can use the following line to automatically test every docstring example in the desired file: +Just run the following line to automatically test every docstring example in the desired file: ```bash pytest --doctest-modules ``` -3. Once you are done debugging, you need to remove the extra line added in step **1.** by running the following: -```bash -python utils/prepare_for_doc_test.py --remove_new_line -``` +If the file has a markdown extention, you should add the `--doctest-glob="*.md"` argument. ### Run only modified tests @@ -427,14 +431,14 @@ pytest --instafail On a GPU-enabled setup, to test in CPU-only mode add `CUDA_VISIBLE_DEVICES=""`: ```bash -CUDA_VISIBLE_DEVICES="" pytest tests/test_logging.py +CUDA_VISIBLE_DEVICES="" pytest tests/utils/test_logging.py ``` or if you have multiple gpus, you can specify which one is to be used by `pytest`. For example, to use only the second gpu if you have gpus `0` and `1`, you can run: ```bash -CUDA_VISIBLE_DEVICES="1" pytest tests/test_logging.py +CUDA_VISIBLE_DEVICES="1" pytest tests/utils/test_logging.py ``` This is handy when you want to run different tasks on different GPUs. @@ -506,6 +510,42 @@ from transformers.testing_utils import get_gpu_count n_gpu = get_gpu_count() # works with torch and tf ``` +### Testing with a specific PyTorch backend or device + +To run the test suite on a specific torch device add `TRANSFORMERS_TEST_DEVICE="$device"` where `$device` is the target backend. For example, to test on CPU only: + +```bash +TRANSFORMERS_TEST_DEVICE="cpu" pytest tests/utils/test_logging.py +``` + +This variable is useful for testing custom or less common PyTorch backends such as `mps`. It can also be used to achieve the same effect as `CUDA_VISIBLE_DEVICES` by targeting specific GPUs or testing in CPU-only mode. + +Certain devices will require an additional import after importing `torch` for the first time. This can be specified using the environment variable `TRANSFORMERS_TEST_BACKEND`: + +```bash +TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py +``` +Alternative backends may also require the replacement of device-specific functions. For example `torch.cuda.manual_seed` may need to be replaced with a device-specific seed setter like `torch.npu.manual_seed` to correctly set a random seed on the device. To specify a new backend with backend-specific device functions when running the test suite, create a Python device specification file in the format: + +``` +import torch +import torch_npu +# !! Further additional imports can be added here !! + +# Specify the device name (eg. 'cuda', 'cpu', 'npu') +DEVICE_NAME = 'npu' + +# Specify device-specific backends to dispatch to. +# If not specified, will fallback to 'default' in 'testing_utils.py` +MANUAL_SEED_FN = torch.npu.manual_seed +EMPTY_CACHE_FN = torch.npu.empty_cache +DEVICE_COUNT_FN = torch.npu.device_count +``` +This format also allows for specification of any additional imports required. To use this file to replace equivalent methods in the test suite, set the environment variable `TRANSFORMERS_TEST_DEVICE_SPEC` to the path of the spec file. + +Currently, only `MANUAL_SEED_FN`, `EMPTY_CACHE_FN` and `DEVICE_COUNT_FN` are supported for device-specific dispatch. + + ### Distributed training `pytest` can't deal with distributed training directly. If this is attempted - the sub-processes don't do the right @@ -533,7 +573,7 @@ according captured output will usually be shown along with the failure traceback To disable output capturing and to get the `stdout` and `stderr` normally, use `-s` or `--capture=no`: ```bash -pytest -s tests/test_logging.py +pytest -s tests/utils/test_logging.py ``` To send test results to JUnit format output: @@ -547,7 +587,7 @@ py.test tests --junitxml=result.xml To have no color (e.g., yellow on white background is not readable): ```bash -pytest --color=no tests/test_logging.py +pytest --color=no tests/utils/test_logging.py ``` ### Sending test report to online pastebin service @@ -555,7 +595,7 @@ pytest --color=no tests/test_logging.py Creating a URL for each test failure: ```bash -pytest --pastebin=failed tests/test_logging.py +pytest --pastebin=failed tests/utils/test_logging.py ``` This will submit test run information to a remote Paste service and provide a URL for each failure. You may select @@ -564,7 +604,7 @@ tests as usual or add for example -x if you only want to send one particular fai Creating a URL for a whole test session log: ```bash -pytest --pastebin=all tests/test_logging.py +pytest --pastebin=all tests/utils/test_logging.py ``` ## Writing tests @@ -859,7 +899,8 @@ or the `xfail` way: def test_feature_x(): ``` -- Here is how to skip a test based on some internal check inside the test: + +Here's how to skip a test based on internal checks within the test: ```python def test_feature_x(): @@ -935,7 +976,7 @@ Some decorators like `@parameterized` rewrite test names, therefore `@slow` and `@require_*` have to be listed last for them to work correctly. Here is an example of the correct usage: ```python no-style -@parameteriz ed.expand(...) +@parameterized.expand(...) @slow def test_integration_foo(): ``` @@ -1194,7 +1235,7 @@ tf.random.set_seed(seed) To start a debugger at the point of the warning, do this: ```bash -pytest tests/test_logging.py -W error::UserWarning --pdb +pytest tests/utils/test_logging.py -W error::UserWarning --pdb ``` ## Working with github actions workflows @@ -1271,3 +1312,19 @@ You can vote for this feature and see where it is at these CI-specific threads: - [Github Actions:](https://github.com/actions/toolkit/issues/399) - [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) + +## DeepSpeed integration + +For a PR that involves the DeepSpeed integration, keep in mind our CircleCI PR CI setup doesn't have GPUs. Tests requiring GPUs are run on a different CI nightly. This means if you get a passing CI report in your PR, it doesn’t mean the DeepSpeed tests pass. + +To run DeepSpeed tests: + +```bash +RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py +``` + +Any changes to the modeling or PyTorch examples code requires running the model zoo tests as well. + +```bash +RUN_SLOW=1 pytest tests/deepspeed +``` diff --git a/docs/source/en/tf_xla.mdx b/docs/source/en/tf_xla.md similarity index 92% rename from docs/source/en/tf_xla.mdx rename to docs/source/en/tf_xla.md index 1d9e13e8b35cc2..86ed1035fccc9e 100644 --- a/docs/source/en/tf_xla.mdx +++ b/docs/source/en/tf_xla.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # XLA Integration for TensorFlow Models @@ -81,8 +85,8 @@ from transformers.utils import check_min_version check_min_version("4.21.0") -tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", pad_token="") -model = TFAutoModelForCausalLM.from_pretrained("gpt2") +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") input_string = ["TensorFlow is"] # One line to create an XLA generation function @@ -110,8 +114,8 @@ To ensure `xla_generate()` always operates with the same input shapes, you can s import tensorflow as tf from transformers import AutoTokenizer, TFAutoModelForCausalLM -tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", pad_token="") -model = TFAutoModelForCausalLM.from_pretrained("gpt2") +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") input_string = ["TensorFlow is"] xla_generate = tf.function(model.generate, jit_compile=True) @@ -131,8 +135,8 @@ import time import tensorflow as tf from transformers import AutoTokenizer, TFAutoModelForCausalLM -tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side="left", pad_token="") -model = TFAutoModelForCausalLM.from_pretrained("gpt2") +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") xla_generate = tf.function(model.generate, jit_compile=True) diff --git a/docs/source/en/tflite.md b/docs/source/en/tflite.md new file mode 100644 index 00000000000000..09434a81508d35 --- /dev/null +++ b/docs/source/en/tflite.md @@ -0,0 +1,62 @@ + + +# Export to TFLite + +[TensorFlow Lite](https://www.tensorflow.org/lite/guide) is a lightweight framework for deploying machine learning models +on resource-constrained devices, such as mobile phones, embedded systems, and Internet of Things (IoT) devices. +TFLite is designed to optimize and run models efficiently on these devices with limited computational power, memory, and +power consumption. +A TensorFlow Lite model is represented in a special efficient portable format identified by the `.tflite` file extension. + +🤗 Optimum offers functionality to export 🤗 Transformers models to TFLite through the `exporters.tflite` module. +For the list of supported model architectures, please refer to [🤗 Optimum documentation](https://huggingface.co/docs/optimum/exporters/tflite/overview). + +To export a model to TFLite, install the required dependencies: + +```bash +pip install optimum[exporters-tf] +``` + +To check out all available arguments, refer to the [🤗 Optimum docs](https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model), +or view help in command line: + +```bash +optimum-cli export tflite --help +``` + +To export a model's checkpoint from the 🤗 Hub, for example, `google-bert/bert-base-uncased`, run the following command: + +```bash +optimum-cli export tflite --model google-bert/bert-base-uncased --sequence_length 128 bert_tflite/ +``` + +You should see the logs indicating progress and showing where the resulting `model.tflite` is saved, like this: + +```bash +Validating TFLite model... + -[✓] TFLite model output names match reference model (logits) + - Validating TFLite Model output "logits": + -[✓] (1, 128, 30522) matches (1, 128, 30522) + -[x] values not close enough, max diff: 5.817413330078125e-05 (atol: 1e-05) +The TensorFlow Lite export succeeded with the warning: The maximum absolute difference between the output of the reference model and the TFLite exported model is not within the set tolerance 1e-05: +- logits: max diff = 5.817413330078125e-05. + The exported model was saved at: bert_tflite + ``` + +The example above illustrates exporting a checkpoint from 🤗 Hub. When exporting a local model, first make sure that you +saved both the model's weights and tokenizer files in the same directory (`local_path`). When using CLI, pass the +`local_path` to the `model` argument instead of the checkpoint name on 🤗 Hub. \ No newline at end of file diff --git a/docs/source/en/tokenizer_summary.mdx b/docs/source/en/tokenizer_summary.md similarity index 97% rename from docs/source/en/tokenizer_summary.mdx rename to docs/source/en/tokenizer_summary.md index 942fe279068ecf..fbe8f6f7a17743 100644 --- a/docs/source/en/tokenizer_summary.mdx +++ b/docs/source/en/tokenizer_summary.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Summary of the tokenizers @@ -105,7 +109,7 @@ seen before, by decomposing them into known subwords. For instance, the [`~trans ```py >>> from transformers import BertTokenizer ->>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> tokenizer.tokenize("I have a new GPU!") ["i", "have", "a", "new", "gp", "##u", "!"] ``` @@ -119,7 +123,7 @@ As another example, [`~transformers.XLNetTokenizer`] tokenizes our previously ex ```py >>> from transformers import XLNetTokenizer ->>> tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased") +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") >>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") ["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] ``` @@ -137,11 +141,11 @@ on. Byte-Pair Encoding (BPE) was introduced in [Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909). BPE relies on a pre-tokenizer that splits the training data into -words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [Roberta](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), +words. Pretokenization can be as simple as space tokenization, e.g. [GPT-2](model_doc/gpt2), [RoBERTa](model_doc/roberta). More advanced pre-tokenization include rule-based tokenization, e.g. [XLM](model_doc/xlm), [FlauBERT](model_doc/flaubert) which uses Moses for most languages, or [GPT](model_doc/gpt) which uses -Spacy and ftfy, to count the frequency of each word in the training corpus. +spaCy and ftfy, to count the frequency of each word in the training corpus. -After pre-tokenization, a set of unique words has been created and the frequency of each word it occurred in the +After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to diff --git a/docs/source/en/torchscript.mdx b/docs/source/en/torchscript.md similarity index 95% rename from docs/source/en/torchscript.mdx rename to docs/source/en/torchscript.md index 0840973ad078ab..171e337ca7f846 100644 --- a/docs/source/en/torchscript.mdx +++ b/docs/source/en/torchscript.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Export to TorchScript @@ -93,7 +97,7 @@ class and then save it to disk under the filename `traced_bert.pt`: from transformers import BertModel, BertTokenizer, BertConfig import torch -enc = BertTokenizer.from_pretrained("bert-base-uncased") +enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" @@ -128,7 +132,7 @@ model = BertModel(config) model.eval() # If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag -model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) +model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True) # Creating the trace traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors]) @@ -201,7 +205,7 @@ AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launc ### Converting a model for AWS Neuron Convert a model for AWS NEURON using the same code from [Using TorchScript in -Python](serialization#using-torchscript-in-python) to trace a `BertModel`. Import the +Python](torchscript#using-torchscript-in-python) to trace a `BertModel`. Import the `torch.neuron` framework extension to access the components of the Neuron SDK through a Python API: diff --git a/docs/source/en/trainer.md b/docs/source/en/trainer.md new file mode 100644 index 00000000000000..22ef9a0c160e9c --- /dev/null +++ b/docs/source/en/trainer.md @@ -0,0 +1,414 @@ + + +# Trainer + +The [`Trainer`] is a complete training and evaluation loop for PyTorch models implemented in the Transformers library. You only need to pass it the necessary pieces for training (model, tokenizer, dataset, evaluation function, training hyperparameters, etc.), and the [`Trainer`] class takes care of the rest. This makes it easier to start training faster without manually writing your own training loop. But at the same time, [`Trainer`] is very customizable and offers a ton of training options so you can tailor it to your exact training needs. + + + +In addition to the [`Trainer`] class, Transformers also provides a [`Seq2SeqTrainer`] class for sequence-to-sequence tasks like translation or summarization. There is also the [`~trl.SFTTrainer`] class from the [TRL](https://hf.co/docs/trl) library which wraps the [`Trainer`] class and is optimized for training language models like Llama-2 and Mistral with autoregressive techniques. [`~trl.SFTTrainer`] also supports features like sequence packing, LoRA, quantization, and DeepSpeed for efficiently scaling to any model size. + +
+ +Feel free to check out the [API reference](./main_classes/trainer) for these other [`Trainer`]-type classes to learn more about when to use which one. In general, [`Trainer`] is the most versatile option and is appropriate for a broad spectrum of tasks. [`Seq2SeqTrainer`] is designed for sequence-to-sequence tasks and [`~trl.SFTTrainer`] is designed for training language models. + +
+ +Before you start, make sure [Accelerate](https://hf.co/docs/accelerate) - a library for enabling and running PyTorch training across distributed environments - is installed. + +```bash +pip install accelerate + +# upgrade +pip install accelerate --upgrade +``` + +This guide provides an overview of the [`Trainer`] class. + +## Basic usage + +[`Trainer`] includes all the code you'll find in a basic training loop: + +1. perform a training step to calculate the loss +2. calculate the gradients with the [`~accelerate.Accelerator.backward`] method +3. update the weights based on the gradients +4. repeat this process until you've reached a predetermined number of epochs + +The [`Trainer`] class abstracts all of this code away so you don't have to worry about manually writing a training loop every time or if you're just getting started with PyTorch and training. You only need to provide the essential components required for training, such as a model and a dataset, and the [`Trainer`] class handles everything else. + +If you want to specify any training options or hyperparameters, you can find them in the [`TrainingArguments`] class. For example, let's define where to save the model in `output_dir` and push the model to the Hub after training with `push_to_hub=True`. + +```py +from transformers import TrainingArguments + +training_args = TrainingArguments( + output_dir="your-model", + learning_rate=2e-5, + per_device_train_batch_size=16, + per_device_eval_batch_size=16, + num_train_epochs=2, + weight_decay=0.01, + evaluation_strategy="epoch", + save_strategy="epoch", + load_best_model_at_end=True, + push_to_hub=True, +) +``` + +Pass `training_args` to the [`Trainer`] along with a model, dataset, something to preprocess the dataset with (depending on your data type it could be a tokenizer, feature extractor or image processor), a data collator, and a function to compute the metrics you want to track during training. + +Finally, call [`~Trainer.train`] to start training! + +```py +from transformers import Trainer + +trainer = Trainer( + model=model, + args=training_args, + train_dataset=dataset["train"], + eval_dataset=dataset["test"], + tokenizer=tokenizer, + data_collator=data_collator, + compute_metrics=compute_metrics, +) + +trainer.train() +``` + +### Checkpoints + +The [`Trainer`] class saves your model checkpoints to the directory specified in the `output_dir` parameter of [`TrainingArguments`]. You'll find the checkpoints saved in a `checkpoint-000` subfolder where the numbers at the end correspond to the training step. Saving checkpoints are useful for resuming training later. + +```py +# resume from latest checkpoint +trainer.train(resume_from_checkpoint=True) + +# resume from specific checkpoint saved in output directory +trainer.train(resume_from_checkpoint="your-model/checkpoint-1000") +``` + +You can save your checkpoints (the optimizer state is not saved by default) to the Hub by setting `push_to_hub=True` in [`TrainingArguments`] to commit and push them. Other options for deciding how your checkpoints are saved are set up in the [`hub_strategy`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.hub_strategy) parameter: + +* `hub_strategy="checkpoint"` pushes the latest checkpoint to a subfolder named "last-checkpoint" from which you can resume training +* `hug_strategy="all_checkpoints"` pushes all checkpoints to the directory defined in `output_dir` (you'll see one checkpoint per folder in your model repository) + +When you resume training from a checkpoint, the [`Trainer`] tries to keep the Python, NumPy, and PyTorch RNG states the same as they were when the checkpoint was saved. But because PyTorch has various non-deterministic default settings, the RNG states aren't guaranteed to be the same. If you want to enable full determinism, take a look at the [Controlling sources of randomness](https://pytorch.org/docs/stable/notes/randomness#controlling-sources-of-randomness) guide to learn what you can enable to make your training fully deterministic. Keep in mind though that by making certain settings deterministic, training may be slower. + +## Customize the Trainer + +While the [`Trainer`] class is designed to be accessible and easy-to-use, it also offers a lot of customizability for more adventurous users. Many of the [`Trainer`]'s method can be subclassed and overridden to support the functionality you want, without having to rewrite the entire training loop from scratch to accommodate it. These methods include: + +* [`~Trainer.get_train_dataloader`] creates a training DataLoader +* [`~Trainer.get_eval_dataloader`] creates an evaluation DataLoader +* [`~Trainer.get_test_dataloader`] creates a test DataLoader +* [`~Trainer.log`] logs information on the various objects that watch training +* [`~Trainer.create_optimizer_and_scheduler`] creates an optimizer and learning rate scheduler if they weren't passed in the `__init__`; these can also be separately customized with [`~Trainer.create_optimizer`] and [`~Trainer.create_scheduler`] respectively +* [`~Trainer.compute_loss`] computes the loss on a batch of training inputs +* [`~Trainer.training_step`] performs the training step +* [`~Trainer.prediction_step`] performs the prediction and test step +* [`~Trainer.evaluate`] evaluates the model and returns the evaluation metrics +* [`~Trainer.predict`] makes predictions (with metrics if labels are available) on the test set + +For example, if you want to customize the [`~Trainer.compute_loss`] method to use a weighted loss instead. + +```py +from torch import nn +from transformers import Trainer + +class CustomTrainer(Trainer): + def compute_loss(self, model, inputs, return_outputs=False): + labels = inputs.pop("labels") + # forward pass + outputs = model(**inputs) + logits = outputs.get("logits") + # compute custom loss for 3 labels with different weights + loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device)) + loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)) + return (loss, outputs) if return_outputs else loss +``` + +### Callbacks + +Another option for customizing the [`Trainer`] is to use [callbacks](callbacks). Callbacks *don't change* anything in the training loop. They inspect the training loop state and then execute some action (early stopping, logging results, etc.) depending on the state. In other words, a callback can't be used to implement something like a custom loss function and you'll need to subclass and override the [`~Trainer.compute_loss`] method for that. + +For example, if you want to add an early stopping callback to the training loop after 10 steps. + +```py +from transformers import TrainerCallback + +class EarlyStoppingCallback(TrainerCallback): + def __init__(self, num_steps=10): + self.num_steps = num_steps + + def on_step_end(self, args, state, control, **kwargs): + if state.global_step >= self.num_steps: + return {"should_training_stop": True} + else: + return {} +``` + +Then pass it to the [`Trainer`]'s `callback` parameter. + +```py +from transformers import Trainer + +trainer = Trainer( + model=model, + args=training_args, + train_dataset=dataset["train"], + eval_dataset=dataset["test"], + tokenizer=tokenizer, + data_collator=data_collator, + compute_metrics=compute_metrics, + callback=[EarlyStoppingCallback()], +) +``` + +## Logging + + + +Check out the [logging](./main_classes/logging) API reference for more information about the different logging levels. + + + +The [`Trainer`] is set to `logging.INFO` by default which reports errors, warnings, and other basic information. A [`Trainer`] replica - in distributed environments - is set to `logging.WARNING` which only reports errors and warnings. You can change the logging level with the [`log_level`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level) and [`log_level_replica`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.log_level_replica) parameters in [`TrainingArguments`]. + +To configure the log level setting for each node, use the [`log_on_each_node`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.log_on_each_node) parameter to determine whether to use the log level on each node or only on the main node. + + + +[`Trainer`] sets the log level separately for each node in the [`Trainer.__init__`] method, so you may want to consider setting this sooner if you're using other Transformers functionalities before creating the [`Trainer`] object. + + + +For example, to set your main code and modules to use the same log level according to each node: + +```py +logger = logging.getLogger(__name__) + +logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], +) + +log_level = training_args.get_process_log_level() +logger.setLevel(log_level) +datasets.utils.logging.set_verbosity(log_level) +transformers.utils.logging.set_verbosity(log_level) + +trainer = Trainer(...) +``` + +Use different combinations of `log_level` and `log_level_replica` to configure what gets logged on each of the nodes. + + + + +```bash +my_app.py ... --log_level warning --log_level_replica error +``` + + + + +Add the `log_on_each_node 0` parameter for multi-node environments. + +```bash +my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0 + +# set to only report errors +my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0 +``` + + + + +## NEFTune + +[NEFTune](https://hf.co/papers/2310.05914) is a technique that can improve performance by adding noise to the embedding vectors during training. To enable it in [`Trainer`], set the `neftune_noise_alpha` parameter in [`TrainingArguments`] to control how much noise is added. + +```py +from transformers import TrainingArguments, Trainer + +training_args = TrainingArguments(..., neftune_noise_alpha=0.1) +trainer = Trainer(..., args=training_args) +``` + +NEFTune is disabled after training to restore the original embedding layer to avoid any unexpected behavior. + +## Accelerate and Trainer + +The [`Trainer`] class is powered by [Accelerate](https://hf.co/docs/accelerate), a library for easily training PyTorch models in distributed environments with support for integrations such as [FullyShardedDataParallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/). + + + +Learn more about FSDP sharding strategies, CPU offloading, and more with the [`Trainer`] in the [Fully Sharded Data Parallel](fsdp) guide. + + + +To use Accelerate with [`Trainer`], run the [`accelerate.config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) command to set up training for your training environment. This command creates a `config_file.yaml` that'll be used when you launch your training script. For example, some example configurations you can setup are: + + + + +```yml +compute_environment: LOCAL_MACHINE +distributed_type: MULTI_GPU +downcast_bf16: 'no' +gpu_ids: all +machine_rank: 0 #change rank as per the node +main_process_ip: 192.168.20.1 +main_process_port: 9898 +main_training_function: main +mixed_precision: fp16 +num_machines: 2 +num_processes: 8 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false +``` + + + + +```yml +compute_environment: LOCAL_MACHINE +distributed_type: FSDP +downcast_bf16: 'no' +fsdp_config: + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP + fsdp_backward_prefetch_policy: BACKWARD_PRE + fsdp_forward_prefetch: true + fsdp_offload_params: false + fsdp_sharding_strategy: 1 + fsdp_state_dict_type: FULL_STATE_DICT + fsdp_sync_module_states: true + fsdp_transformer_layer_cls_to_wrap: BertLayer + fsdp_use_orig_params: true +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 2 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false +``` + + + + +```yml +compute_environment: LOCAL_MACHINE +deepspeed_config: + deepspeed_config_file: /home/user/configs/ds_zero3_config.json + zero3_init_flag: true +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +num_machines: 1 +num_processes: 4 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false +``` + + + + +```yml +compute_environment: LOCAL_MACHINE +deepspeed_config: + gradient_accumulation_steps: 1 + gradient_clipping: 0.7 + offload_optimizer_device: cpu + offload_param_device: cpu + zero3_init_flag: true + zero_stage: 2 +distributed_type: DEEPSPEED +downcast_bf16: 'no' +machine_rank: 0 +main_training_function: main +mixed_precision: bf16 +num_machines: 1 +num_processes: 4 +rdzv_backend: static +same_network: true +tpu_env: [] +tpu_use_cluster: false +tpu_use_sudo: false +use_cpu: false +``` + + + + +The [`accelerate_launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) command is the recommended way to launch your training script on a distributed system with Accelerate and [`Trainer`] with the parameters specified in `config_file.yaml`. This file is saved to the Accelerate cache folder and automatically loaded when you run `accelerate_launch`. + +For example, to run the [run_glue.py](https://github.com/huggingface/transformers/blob/f4db565b695582891e43a5e042e5d318e28f20b8/examples/pytorch/text-classification/run_glue.py#L4) training script with the FSDP configuration: + +```bash +accelerate launch \ + ./examples/pytorch/text-classification/run_glue.py \ + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 5e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir +``` + +You could also specify the parameters from the `config_file.yaml` file directly in the command line: + +```bash +accelerate launch --num_processes=2 \ + --use_fsdp \ + --mixed_precision=bf16 \ + --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ + --fsdp_transformer_layer_cls_to_wrap="BertLayer" \ + --fsdp_sharding_strategy=1 \ + --fsdp_state_dict_type=FULL_STATE_DICT \ + ./examples/pytorch/text-classification/run_glue.py + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 16 \ + --learning_rate 5e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir +``` + +Check out the [Launching your Accelerate scripts](https://huggingface.co/docs/accelerate/basic_tutorials/launch) tutorial to learn more about `accelerate_launch` and custom configurations. diff --git a/docs/source/en/training.mdx b/docs/source/en/training.md similarity index 94% rename from docs/source/en/training.mdx rename to docs/source/en/training.md index 4d802db56359a6..4bd72aa9f6384d 100644 --- a/docs/source/en/training.mdx +++ b/docs/source/en/training.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Fine-tune a pretrained model @@ -39,12 +43,12 @@ Begin by loading the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_ 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} ``` -As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map) method to apply a preprocessing function over the entire dataset: +As you now know, you need a tokenizer to process the text and include a padding and truncation strategy to handle any variable sequence lengths. To process your dataset in one step, use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process#map) method to apply a preprocessing function over the entire dataset: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): @@ -82,7 +86,7 @@ Start by loading your model and specify the number of expected labels. From the ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` @@ -115,7 +119,7 @@ Specify where to save the checkpoints from your training: >>> metric = evaluate.load("accuracy") ``` -Call [`~evaluate.compute`] on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all 🤗 Transformers models return logits): +Call [`~evaluate.compute`] on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the logits to predictions (remember all 🤗 Transformers models return logits): ```py >>> def compute_metrics(eval_pred): @@ -183,7 +187,7 @@ so we can just convert that directly to a NumPy array without tokenization! ```py from transformers import AutoTokenizer -tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True) # Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras tokenized_data = dict(tokenized_data) @@ -191,16 +195,16 @@ tokenized_data = dict(tokenized_data) labels = np.array(dataset["label"]) # Label is already an array of 0 and 1 ``` -Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model: +Finally, load, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), and [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) the model. Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to: ```py from transformers import TFAutoModelForSequenceClassification from tensorflow.keras.optimizers import Adam # Load and compile our model -model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased") +model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased") # Lower learning rates are often better for fine-tuning transformers -model.compile(optimizer=Adam(3e-5)) +model.compile(optimizer=Adam(3e-5)) # No loss argument! model.fit(tokenized_data, labels) ``` @@ -247,7 +251,7 @@ reduces the number of padding tokens compared to padding the entire dataset. ```py ->>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer) +>>> tf_dataset = model.prepare_tf_dataset(dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer) ``` Note that in the code sample above, you need to pass the tokenizer to `prepare_tf_dataset` so it can correctly pad batches as they're loaded. @@ -261,7 +265,7 @@ list of samples into a batch and apply any preprocessing you want. See our Once you've created a `tf.data.Dataset`, you can compile and fit the model as before: ```py -model.compile(optimizer=Adam(3e-5)) +model.compile(optimizer=Adam(3e-5)) # No loss argument! model.fit(tf_dataset) ``` @@ -330,7 +334,7 @@ Load your model with the number of expected labels: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Optimizer and learning rate scheduler diff --git a/docs/source/en/transformers_agents.md b/docs/source/en/transformers_agents.md new file mode 100644 index 00000000000000..424f5b15f5dd68 --- /dev/null +++ b/docs/source/en/transformers_agents.md @@ -0,0 +1,323 @@ + + +# Transformers Agents + + + +Transformers Agents is an experimental API which is subject to change at any time. Results returned by the agents +can vary as the APIs or underlying models are prone to change. + + + +Transformers version v4.29.0, building on the concept of *tools* and *agents*. You can play with in +[this colab](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj). + +In short, it provides a natural language API on top of transformers: we define a set of curated tools and design an +agent to interpret natural language and to use these tools. It is extensible by design; we curated some relevant tools, +but we'll show you how the system can be extended easily to use any tool developed by the community. + +Let's start with a few examples of what can be achieved with this new API. It is particularly powerful when it comes +to multimodal tasks, so let's take it for a spin to generate images and read text out loud. + +```py +agent.run("Caption the following image", image=image) +``` + +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------| +| | A beaver is swimming in the water | + +--- + +```py +agent.run("Read the following text out loud", text=text) +``` +| **Input** | **Output** | +|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------| +| A beaver is swimming in the water | + +--- + +```py +agent.run( + "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?", + document=document, +) +``` +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|----------------| +| | ballroom foyer | + +## Quickstart + +Before being able to use `agent.run`, you will need to instantiate an agent, which is a large language model (LLM). +We provide support for openAI models as well as opensource alternatives from BigCode and OpenAssistant. The openAI +models perform better (but require you to have an openAI API key, so cannot be used for free); Hugging Face is +providing free access to endpoints for BigCode and OpenAssistant models. + +To start with, please install the `agents` extras in order to install all default dependencies. +```bash +pip install transformers[agents] +``` + +To use openAI models, you instantiate an [`OpenAiAgent`] after installing the `openai` dependency: + +```bash +pip install openai +``` + + +```py +from transformers import OpenAiAgent + +agent = OpenAiAgent(model="text-davinci-003", api_key="") +``` + +To use BigCode or OpenAssistant, start by logging in to have access to the Inference API: + +```py +from huggingface_hub import login + +login("") +``` + +Then, instantiate the agent + +```py +from transformers import HfAgent + +# Starcoder +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +# StarcoderBase +# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase") +# OpenAssistant +# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") +``` + +This is using the inference API that Hugging Face provides for free at the moment. If you have your own inference +endpoint for this model (or another one) you can replace the URL above with your URL endpoint. + + + +StarCoder and OpenAssistant are free to use and perform admirably well on simple tasks. However, the checkpoints +don't hold up when handling more complex prompts. If you're facing such an issue, we recommend trying out the OpenAI +model which, while sadly not open-source, performs better at this given time. + + + +You're now good to go! Let's dive into the two APIs that you now have at your disposal. + +### Single execution (run) + +The single execution method is when using the [`~Agent.run`] method of the agent: + +```py +agent.run("Draw me a picture of rivers and lakes.") +``` + + + +It automatically selects the tool (or tools) appropriate for the task you want to perform and runs them appropriately. It +can perform one or several tasks in the same instruction (though the more complex your instruction, the more likely +the agent is to fail). + +```py +agent.run("Draw me a picture of the sea then transform the picture to add an island") +``` + + + +
+ + +Every [`~Agent.run`] operation is independent, so you can run it several times in a row with different tasks. + +Note that your `agent` is just a large-language model, so small variations in your prompt might yield completely +different results. It's important to explain as clearly as possible the task you want to perform. We go more in-depth +on how to write good prompts [here](custom_tools#writing-good-user-inputs). + +If you'd like to keep a state across executions or to pass non-text objects to the agent, you can do so by specifying +variables that you would like the agent to use. For example, you could generate the first image of rivers and lakes, +and ask the model to update that picture to add an island by doing the following: + +```python +picture = agent.run("Generate a picture of rivers and lakes.") +updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture) +``` + + + +This can be helpful when the model is unable to understand your request and mixes tools. An example would be: + +```py +agent.run("Draw me the picture of a capybara swimming in the sea") +``` + +Here, the model could interpret in two ways: +- Have the `text-to-image` generate a capybara swimming in the sea +- Or, have the `text-to-image` generate capybara, then use the `image-transformation` tool to have it swim in the sea + +In case you would like to force the first scenario, you could do so by passing it the prompt as an argument: + +```py +agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea") +``` + + + + +### Chat-based execution (chat) + +The agent also has a chat-based approach, using the [`~Agent.chat`] method: + +```py +agent.chat("Generate a picture of rivers and lakes") +``` + + + +```py +agent.chat("Transform the picture so that there is a rock in there") +``` + + + +
+ +This is an interesting approach when you want to keep the state across instructions. It's better for experimentation, +but will tend to be much better at single instructions rather than complex instructions (which the [`~Agent.run`] +method is better at handling). + +This method can also take arguments if you would like to pass non-text types or specific prompts. + +### ⚠️ Remote execution + +For demonstration purposes and so that it could be used with all setups, we had created remote executors for several +of the default tools the agent has access for the release. These are created using +[inference endpoints](https://huggingface.co/inference-endpoints). + +We have turned these off for now, but in order to see how to set up remote executors tools yourself, +we recommend reading the [custom tool guide](./custom_tools). + +### What's happening here? What are tools, and what are agents? + + + +#### Agents + +The "agent" here is a large language model, and we're prompting it so that it has access to a specific set of tools. + +LLMs are pretty good at generating small samples of code, so this API takes advantage of that by prompting the +LLM gives a small sample of code performing a task with a set of tools. This prompt is then completed by the +task you give your agent and the description of the tools you give it. This way it gets access to the doc of the +tools you are using, especially their expected inputs and outputs, and can generate the relevant code. + +#### Tools + +Tools are very simple: they're a single function, with a name, and a description. We then use these tools' descriptions +to prompt the agent. Through the prompt, we show the agent how it would leverage tools to perform what was +requested in the query. + +This is using brand-new tools and not pipelines, because the agent writes better code with very atomic tools. +Pipelines are more refactored and often combine several tasks in one. Tools are meant to be focused on +one very simple task only. + +#### Code-execution?! + +This code is then executed with our small Python interpreter on the set of inputs passed along with your tools. +We hear you screaming "Arbitrary code execution!" in the back, but let us explain why that is not the case. + +The only functions that can be called are the tools you provided and the print function, so you're already +limited in what can be executed. You should be safe if it's limited to Hugging Face tools. + +Then, we don't allow any attribute lookup or imports (which shouldn't be needed anyway for passing along +inputs/outputs to a small set of functions) so all the most obvious attacks (and you'd need to prompt the LLM +to output them anyway) shouldn't be an issue. If you want to be on the super safe side, you can execute the +run() method with the additional argument return_code=True, in which case the agent will just return the code +to execute and you can decide whether to do it or not. + +The execution will stop at any line trying to perform an illegal operation or if there is a regular Python error +with the code generated by the agent. + +### A curated set of tools + +We identify a set of tools that can empower such agents. Here is an updated list of the tools we have integrated +in `transformers`: + +- **Document question answering**: given a document (such as a PDF) in image format, answer a question on this document ([Donut](./model_doc/donut)) +- **Text question answering**: given a long text and a question, answer the question in the text ([Flan-T5](./model_doc/flan-t5)) +- **Unconditional image captioning**: Caption the image! ([BLIP](./model_doc/blip)) +- **Image question answering**: given an image, answer a question on this image ([VILT](./model_doc/vilt)) +- **Image segmentation**: given an image and a prompt, output the segmentation mask of that prompt ([CLIPSeg](./model_doc/clipseg)) +- **Speech to text**: given an audio recording of a person talking, transcribe the speech into text ([Whisper](./model_doc/whisper)) +- **Text to speech**: convert text to speech ([SpeechT5](./model_doc/speecht5)) +- **Zero-shot text classification**: given a text and a list of labels, identify to which label the text corresponds the most ([BART](./model_doc/bart)) +- **Text summarization**: summarize a long text in one or a few sentences ([BART](./model_doc/bart)) +- **Translation**: translate the text into a given language ([NLLB](./model_doc/nllb)) + +These tools have an integration in transformers, and can be used manually as well, for example: + +```py +from transformers import load_tool + +tool = load_tool("text-to-speech") +audio = tool("This is a text to speech tool") +``` + +### Custom tools + +While we identify a curated set of tools, we strongly believe that the main value provided by this implementation is +the ability to quickly create and share custom tools. + +By pushing the code of a tool to a Hugging Face Space or a model repository, you're then able to leverage the tool +directly with the agent. We've added a few +**transformers-agnostic** tools to the [`huggingface-tools` organization](https://huggingface.co/huggingface-tools): + +- **Text downloader**: to download a text from a web URL +- **Text to image**: generate an image according to a prompt, leveraging stable diffusion +- **Image transformation**: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion +- **Text to video**: generate a small video according to a prompt, leveraging damo-vilab + +The text-to-image tool we have been using since the beginning is a remote tool that lives in +[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)! We will +continue releasing such tools on this and other organizations, to further supercharge this implementation. + +The agents have by default access to tools that reside on [`huggingface-tools`](https://huggingface.co/huggingface-tools). +We explain how to you can write and share your tools as well as leverage any custom tool that resides on the Hub in [following guide](custom_tools). + +### Code generation + +So far we have shown how to use the agents to perform actions for you. However, the agent is only generating code +that we then execute using a very restricted Python interpreter. In case you would like to use the code generated in +a different setting, the agent can be prompted to return the code, along with tool definition and accurate imports. + +For example, the following instruction +```python +agent.run("Draw me a picture of rivers and lakes", return_code=True) +``` + +returns the following code + +```python +from transformers import load_tool + +image_generator = load_tool("huggingface-tools/text-to-image") + +image = image_generator(prompt="rivers and lakes") +``` + +that you can then modify and execute yourself. diff --git a/docs/source/en/troubleshooting.mdx b/docs/source/en/troubleshooting.md similarity index 84% rename from docs/source/en/troubleshooting.mdx rename to docs/source/en/troubleshooting.md index 74346bccef97ae..c1bf338c13bebb 100644 --- a/docs/source/en/troubleshooting.mdx +++ b/docs/source/en/troubleshooting.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Troubleshoot @@ -130,7 +134,7 @@ In some cases, the output `hidden_state` may be incorrect if the `input_ids` inc >>> from transformers import AutoModelForSequenceClassification >>> import torch ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") >>> model.config.pad_token_id 0 ``` @@ -173,4 +177,22 @@ tensor([[ 0.0082, -0.2307], 🤗 Transformers doesn't automatically create an `attention_mask` to mask a padding token if it is provided because: - Some models don't have a padding token. -- For some use-cases, users want a model to attend to a padding token. \ No newline at end of file +- For some use-cases, users want a model to attend to a padding token. + +## ValueError: Unrecognized configuration class XYZ for this kind of AutoModel + +Generally, we recommend using the [`AutoModel`] class to load pretrained instances of models. This class +can automatically infer and load the correct architecture from a given checkpoint based on the configuration. If you see +this `ValueError` when loading a model from a checkpoint, this means the Auto class couldn't find a mapping from +the configuration in the given checkpoint to the kind of model you are trying to load. Most commonly, this happens when a +checkpoint doesn't support a given task. +For instance, you'll see this error in the following example because there is no GPT2 for question answering: + +```py +>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering + +>>> processor = AutoProcessor.from_pretrained("openai-community/gpt2-medium") +>>> model = AutoModelForQuestionAnswering.from_pretrained("openai-community/gpt2-medium") +ValueError: Unrecognized configuration class for this kind of AutoModel: AutoModelForQuestionAnswering. +Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BloomConfig, ... +``` diff --git a/docs/source/es/_toctree.yml b/docs/source/es/_toctree.yml index dd110b746c6ee6..e9a99b59599ed8 100644 --- a/docs/source/es/_toctree.yml +++ b/docs/source/es/_toctree.yml @@ -21,60 +21,71 @@ title: Compartir un modelo title: Tutoriales - sections: - - sections: - - local: create_a_model - title: Crea una arquitectura personalizada - - local: custom_models - title: Compartir modelos personalizados - - local: run_scripts - title: Entrenamiento con scripts - - local: sagemaker - title: Ejecutar el entrenamiento en Amazon SageMaker - - local: converting_tensorflow_models - title: Convertir checkpoints de TensorFlow - - local: serialization - title: Exportar a ONNX - title: Uso general - - sections: - - local: fast_tokenizers - title: Usa tokenizadores de 🤗 Tokenizers - - local: multilingual - title: Modelos multilingües para inferencia - - sections: - - local: tasks/question_answering - title: Respuesta a preguntas - - local: tasks/language_modeling - title: Modelado de lenguaje - - local: tasks/summarization - title: Generación de resúmenes - - local: tasks/multiple_choice - title: Selección múltiple - title: Guías de tareas + - isExpanded: false + sections: + - local: tasks/question_answering + title: Respuesta a preguntas + - local: tasks/language_modeling + title: Modelado de lenguaje + - local: tasks/summarization + title: Generación de resúmenes + - local: tasks/multiple_choice + title: Selección múltiple title: Procesamiento del Lenguaje Natural - - sections: + - isExpanded: false + sections: - local: tasks/asr title: Reconocimiento automático del habla title: Audio - - sections: + - isExpanded: false + sections: - local: tasks/image_classification title: Clasificación de imágenes title: Visión Artificial - - sections: - - local: debugging - title: Debugging - title: Rendimiento y escalabilidad - - sections: - - local: add_new_pipeline - title: ¿Cómo puedo añadir un pipeline a 🤗 Transformers? - - local: pr_checks - title: Verificaciones en un Pull Request - title: Contribuir + title: Guías prácticas +- sections: + - local: fast_tokenizers + title: Usa tokenizadores de 🤗 Tokenizers + - local: multilingual + title: Modelos multilingües para inferencia + - local: create_a_model + title: Crea una arquitectura personalizada + - local: custom_models + title: Compartir modelos personalizados + - local: run_scripts + title: Entrenamiento con scripts + - local: sagemaker + title: Ejecutar el entrenamiento en Amazon SageMaker + - local: converting_tensorflow_models + title: Convertir checkpoints de TensorFlow + - local: serialization + title: Exportar a ONNX - local: community title: Los recursos de la comunidad - title: Guías prácticas + title: Guías para desarrolladores +- sections: + - local: performance + title: Descripción general + - local: debugging + title: Debugging + title: Rendimiento y escalabilidad +- sections: + - local: add_new_pipeline + title: ¿Cómo puedo añadir un pipeline a 🤗 Transformers? + - local: pr_checks + title: Verificaciones en un Pull Request + title: Contribuir - sections: - local: philosophy title: Filosofía + - local: glossary + title: Glosario + - local: task_summary + title: Lo que 🤗 Transformers puede hacer + - local: pad_truncation + title: Relleno y truncamiento - local: bertology title: BERTología + - local: perplexity + title: Perplejidad de los modelos de longitud fija title: Guías conceptuales diff --git a/docs/source/es/accelerate.mdx b/docs/source/es/accelerate.md similarity index 96% rename from docs/source/es/accelerate.mdx rename to docs/source/es/accelerate.md index 6065bc110a1d71..2c4063b7ca3bca 100644 --- a/docs/source/es/accelerate.mdx +++ b/docs/source/es/accelerate.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Entrenamiento distribuido con 🤗 Accelerate diff --git a/docs/source/es/add_new_pipeline.mdx b/docs/source/es/add_new_pipeline.md similarity index 98% rename from docs/source/es/add_new_pipeline.mdx rename to docs/source/es/add_new_pipeline.md index 8e022077972fff..289444350dfa35 100644 --- a/docs/source/es/add_new_pipeline.mdx +++ b/docs/source/es/add_new_pipeline.md @@ -7,11 +7,15 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # ¿Cómo puedo crear un pipeline personalizado? -En esta guía, veremos cómo crear un pipeline personalizado y cómo compartirlo en el [Hub](hf.co/models) o añadirlo +En esta guía, veremos cómo crear un pipeline personalizado y cómo compartirlo en el [Hub](https://hf.co/models) o añadirlo a la biblioteca 🤗 Transformers. En primer lugar, debes decidir las entradas que tu pipeline podrá recibir. Pueden ser strings, bytes, diff --git a/docs/source/es/autoclass_tutorial.mdx b/docs/source/es/autoclass_tutorial.md similarity index 89% rename from docs/source/es/autoclass_tutorial.mdx rename to docs/source/es/autoclass_tutorial.md index e04a639422bbf1..cea44c3c1ea6cf 100644 --- a/docs/source/es/autoclass_tutorial.mdx +++ b/docs/source/es/autoclass_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Carga instancias preentrenadas con un AutoClass @@ -16,7 +20,7 @@ Con tantas arquitecturas diferentes de Transformer puede ser retador crear una p -Recuerda, la arquitectura se refiere al esqueleto del modelo y los checkpoints son los pesos para una arquitectura dada. Por ejemplo, [BERT](https://huggingface.co/bert-base-uncased) es una arquitectura, mientras que `bert-base-uncased` es un checkpoint. Modelo es un término general que puede significar una arquitectura o un checkpoint. +Recuerda, la arquitectura se refiere al esqueleto del modelo y los checkpoints son los pesos para una arquitectura dada. Por ejemplo, [BERT](https://huggingface.co/google-bert/bert-base-uncased) es una arquitectura, mientras que `google-bert/bert-base-uncased` es un checkpoint. Modelo es un término general que puede significar una arquitectura o un checkpoint. @@ -36,7 +40,7 @@ Carga un tokenizador con [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` Luego tokeniza tu input como lo mostrado a continuación: @@ -84,7 +88,7 @@ Finalmente, las clases `AutoModelFor` te permiten cargar un modelo preentrenado ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Reutiliza fácilmente el mismo checkpoint para cargar una aquitectura para alguna tarea diferente: @@ -92,7 +96,7 @@ Reutiliza fácilmente el mismo checkpoint para cargar una aquitectura para algun ```py >>> from transformers import AutoModelForTokenClassification ->>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Generalmente recomendamos utilizar las clases `AutoTokenizer` y `AutoModelFor` para cargar instancias pre-entrenadas de modelos. Ésto asegurará que cargues la arquitectura correcta en cada ocasión. En el siguiente [tutorial](preprocessing), aprende a usar tu tokenizador recién cargado, el extractor de características y el procesador para preprocesar un dataset para fine-tuning. @@ -103,7 +107,7 @@ Finalmente, la clase `TFAutoModelFor` te permite cargar tu modelo pre-entrenado ```py >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Reutiliza fácilmente el mismo checkpoint para cargar una aquitectura para alguna tarea diferente: @@ -111,7 +115,7 @@ Reutiliza fácilmente el mismo checkpoint para cargar una aquitectura para algun ```py >>> from transformers import TFAutoModelForTokenClassification ->>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Generalmente recomendamos utilizar las clases `AutoTokenizer` y `TFAutoModelFor` para cargar instancias de modelos pre-entrenados. Ésto asegurará que cargues la arquitectura correcta cada vez. En el siguiente [tutorial](preprocessing), aprende a usar tu tokenizador recién cargado, el extractor de características y el procesador para preprocesar un dataset para fine-tuning. diff --git a/docs/source/es/bertology.mdx b/docs/source/es/bertology.md similarity index 86% rename from docs/source/es/bertology.mdx rename to docs/source/es/bertology.md index 4a3a1e551bcf78..ed4e12a8d59ceb 100644 --- a/docs/source/es/bertology.mdx +++ b/docs/source/es/bertology.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # BERTología @@ -21,6 +25,7 @@ Hay un creciente campo de estudio empeñado en la investigación del funcionamie - Are Sixteen Heads Really Better than One? por Paul Michel, Omer Levy, Graham Neubig: https://arxiv.org/abs/1905.10650 - What Does BERT Look At? An Analysis of BERT's Attention por Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: https://arxiv.org/abs/1906.04341 +- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: https://arxiv.org/abs/2210.04633 Para asistir al desarrollo de este nuevo campo, hemos incluido algunas features adicionales en los modelos BERT/GPT/GPT-2 para ayudar a acceder a las representaciones internas, principalmente adaptado de la gran obra de Paul Michel diff --git a/docs/source/es/community.mdx b/docs/source/es/community.md similarity index 95% rename from docs/source/es/community.mdx rename to docs/source/es/community.md index a34fa30104b20c..71153fbc8336f6 100644 --- a/docs/source/es/community.mdx +++ b/docs/source/es/community.md @@ -1,3 +1,7 @@ + + # Comunidad Esta página agrupa los recursos de 🤗 Transformers desarrollados por la comunidad. @@ -6,7 +10,7 @@ Esta página agrupa los recursos de 🤗 Transformers desarrollados por la comun | Recurso | Descripción | Autor | |:----------|:-------------|------:| -| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | Un conjunto de flashcards basadas en el [Glosario de documentos de Transformers] (glosario) que se ha puesto en un formato que se puede aprender/revisar fácilmente usando [Anki] (https://apps.ankiweb.net/) una fuente abierta, aplicación de multiplataforma diseñada específicamente para la retención de conocimientos a largo plazo. Ve este [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) | +| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | Un conjunto de flashcards basadas en el [Glosario de documentos de Transformers] (glosario) que se ha puesto en un formato que se puede aprender/revisar fácilmente usando [Anki](https://apps.ankiweb.net/) una fuente abierta, aplicación de multiplataforma diseñada específicamente para la retención de conocimientos a largo plazo. Ve este [Introductory video on how to use the flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) | ## Los cuadernos de la comunidad: @@ -39,8 +43,8 @@ Esta página agrupa los recursos de 🤗 Transformers desarrollados por la comun |[Ajustar a Roberta para el análisis de sentimientos](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | Cómo ajustar un modelo de Roberta para el análisis de sentimientos | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)| |[Evaluación de modelos de generación de preguntas](https://github.com/flexudy-pipe/qugeev) | ¿Qué tan precisas son las respuestas a las preguntas generadas por tu modelo de transformador seq2seq? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)| |[Clasificar texto con DistilBERT y Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | Cómo ajustar DistilBERT para la clasificación de texto en TensorFlow | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)| -|[Aprovechar BERT para el resumen de codificador y decodificador en CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | Cómo iniciar en caliente un *EncoderDecoderModel* con un punto de control *bert-base-uncased* para resumir en CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| -|[Aprovechar RoBERTa para el resumen de codificador-decodificador en BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | Cómo iniciar en caliente un *EncoderDecoderModel* compartido con un punto de control *roberta-base* para resumir en BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| +|[Aprovechar BERT para el resumen de codificador y decodificador en CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | Cómo iniciar en caliente un *EncoderDecoderModel* con un punto de control *google-bert/bert-base-uncased* para resumir en CNN/Dailymail | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| +|[Aprovechar RoBERTa para el resumen de codificador-decodificador en BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | Cómo iniciar en caliente un *EncoderDecoderModel* compartido con un punto de control *FacebookAI/roberta-base* para resumir en BBC/XSum | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| |[Ajustar TAPAS en Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | Cómo ajustar *TapasForQuestionAnswering* con un punto de control *tapas-base* en el conjunto de datos del Sequential Question Answering (SQA) | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)| |[Evaluar TAPAS en Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | Cómo evaluar un *TapasForSequenceClassification* ajustado con un punto de control *tapas-base-finetuned-tabfact* usando una combinación de 🤗 conjuntos de datos y 🤗 bibliotecas de transformadores | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)| |[Ajustar de mBART para traducción](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | Cómo ajustar mBART utilizando Seq2SeqTrainer para la traducción del hindi al inglés | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)| diff --git a/docs/source/es/converting_tensorflow_models.mdx b/docs/source/es/converting_tensorflow_models.md similarity index 90% rename from docs/source/es/converting_tensorflow_models.mdx rename to docs/source/es/converting_tensorflow_models.md index 2ab15e81b2508a..f56eb02d87006a 100644 --- a/docs/source/es/converting_tensorflow_models.mdx +++ b/docs/source/es/converting_tensorflow_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Convertir checkpoints de Tensorflow @@ -83,29 +87,15 @@ transformers-cli convert --model_type gpt \ Aquí hay un ejemplo del proceso para convertir un modelo OpenAI GPT-2 pre-entrenado (más información [aquí](https://github.com/openai/gpt-2)): ```bash -export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights +export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/openai-community/gpt2/pretrained/weights -transformers-cli convert --model_type gpt2 \ +transformers-cli convert --model_type openai-community/gpt2 \ --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \ --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ [--config OPENAI_GPT2_CONFIG] \ [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK] ``` -## Transformer-XL - -Aquí hay un ejemplo del proceso para convertir un modelo Transformer-XL pre-entrenado (más información [aquí](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models)): - -```bash -export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint - -transformers-cli convert --model_type transfo_xl \ - --tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config TRANSFO_XL_CONFIG] \ - [--finetuning_task_name TRANSFO_XL_FINETUNED_TASK] -``` - ## XLNet Aquí hay un ejemplo del proceso para convertir un modelo XLNet pre-entrenado: diff --git a/docs/source/es/create_a_model.mdx b/docs/source/es/create_a_model.md similarity index 94% rename from docs/source/es/create_a_model.mdx rename to docs/source/es/create_a_model.md index 99ded53ee653a9..560fbd74e3851c 100644 --- a/docs/source/es/create_a_model.mdx +++ b/docs/source/es/create_a_model.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Crea una arquitectura personalizada @@ -82,7 +86,7 @@ DistilBertConfig { Los atributos de los modelos preentrenados pueden ser modificados con la función [`~PretrainedConfig.from_pretrained`]: ```py ->>> my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4) +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) ``` Cuando estés satisfecho con la configuración de tu modelo, puedes guardarlo con la función [`~PretrainedConfig.save_pretrained`]. Tu configuración se guardará en un archivo JSON dentro del directorio que le especifiques como parámetro. @@ -105,7 +109,7 @@ También puedes guardar los archivos de configuración como un diccionario; o in ## Modelo -El siguiente paso será crear un [modelo](main_classes/models). El modelo, al que a veces también nos referimos como arquitectura, es el encargado de definir cada capa y qué operaciones se realizan. Los atributos como `num_hidden_layers` de la configuración se usan para definir la arquitectura. Todos los modelos comparten una clase base, [`PreTrainedModel`], y algunos métodos comunes que se pueden usar para redimensionar los _embeddings_ o para recortar cabezas de auto-atención (también llamadas _self-attention heads_). Además, todos los modelos son subclases de [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) o [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/flax.linen.html#module), lo que significa que son compatibles con su respectivo framework. +El siguiente paso será crear un [modelo](main_classes/models). El modelo, al que a veces también nos referimos como arquitectura, es el encargado de definir cada capa y qué operaciones se realizan. Los atributos como `num_hidden_layers` de la configuración se usan para definir la arquitectura. Todos los modelos comparten una clase base, [`PreTrainedModel`], y algunos métodos comunes que se pueden usar para redimensionar los _embeddings_ o para recortar cabezas de auto-atención (también llamadas _self-attention heads_). Además, todos los modelos son subclases de [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) o [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html), lo que significa que son compatibles con su respectivo framework. @@ -124,13 +128,13 @@ Esto crea un modelo con valores aleatorios, en lugar de crearlo con los pesos de Puedes crear un modelo preentrenado con [`~PreTrainedModel.from_pretrained`]: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Cuando cargues tus pesos del preentrenamiento, el modelo por defecto se carga automáticamente si nos lo proporciona 🤗 Transformers. Sin embargo, siempre puedes reemplazar (todos o algunos de) los atributos del modelo por defecto por los tuyos: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -149,13 +153,13 @@ Esto crea un modelo con valores aleatorios, en lugar de crearlo con los pesos de Puedes crear un modelo preentrenado con [`~TFPreTrainedModel.from_pretrained`]: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Cuando cargues tus pesos del preentrenamiento, el modelo por defecto se carga automáticamente si este nos lo proporciona 🤗 Transformers. Sin embargo, siempre puedes reemplazar (todos o algunos de) los atributos del modelo por defecto por los tuyos: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -173,7 +177,7 @@ Por ejemplo, [`DistilBertForSequenceClassification`] es un modelo DistilBERT ba ```py >>> from transformers import DistilBertForSequenceClassification ->>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Puedes reutilizar este punto de guardado o *checkpoint* para otra tarea fácilmente cambiando a una cabeza de un modelo diferente. Para una tarea de respuesta a preguntas, puedes usar la cabeza del modelo [`DistilBertForQuestionAnswering`]. La cabeza de respuesta a preguntas es similar a la de clasificación de secuencias, excepto porque consta de una capa lineal delante de la salida de los *hidden states*. @@ -182,7 +186,7 @@ Puedes reutilizar este punto de guardado o *checkpoint* para otra tarea fácilme ```py >>> from transformers import DistilBertForQuestionAnswering ->>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -192,7 +196,7 @@ Por ejemplo, [`TFDistilBertForSequenceClassification`] es un modelo DistilBERT ```py >>> from transformers import TFDistilBertForSequenceClassification ->>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Puedes reutilizar este punto de guardado o *checkpoint* para otra tarea fácilmente cambiando a una cabeza de un modelo diferente. Para una tarea de respuesta a preguntas, puedes usar la cabeza del modelo [`TFDistilBertForQuestionAnswering`]. La cabeza de respuesta a preguntas es similar a la de clasificación de secuencias, excepto porque consta de una capa lineal delante de la salida de los *hidden states*. @@ -201,7 +205,7 @@ Puedes reutilizar este punto de guardado o *checkpoint* para otra tarea fácilme ```py >>> from transformers import TFDistilBertForQuestionAnswering ->>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ```
@@ -235,7 +239,7 @@ Es importante recordar que los vocabularios que provienen de un *tokenizer* pers ```py >>> from transformers import DistilBertTokenizer ->>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Crea un *tokenizer* rápido con la clase [`DistilBertTokenizerFast`]: @@ -244,7 +248,7 @@ Crea un *tokenizer* rápido con la clase [`DistilBertTokenizerFast`]: ```py >>> from transformers import DistilBertTokenizerFast ->>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased") +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") ``` diff --git a/docs/source/es/custom_models.mdx b/docs/source/es/custom_models.md similarity index 98% rename from docs/source/es/custom_models.mdx rename to docs/source/es/custom_models.md index 434d59f87daed6..e616a056055e3d 100644 --- a/docs/source/es/custom_models.mdx +++ b/docs/source/es/custom_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Compartir modelos personalizados diff --git a/docs/source/es/debugging.mdx b/docs/source/es/debugging.md similarity index 98% rename from docs/source/es/debugging.mdx rename to docs/source/es/debugging.md index a709e0407b8b51..313566753052cb 100644 --- a/docs/source/es/debugging.mdx +++ b/docs/source/es/debugging.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Debugging diff --git a/docs/source/es/fast_tokenizers.mdx b/docs/source/es/fast_tokenizers.md similarity index 94% rename from docs/source/es/fast_tokenizers.mdx rename to docs/source/es/fast_tokenizers.md index 63b43cc1c4c7e9..92b925f67f7e47 100644 --- a/docs/source/es/fast_tokenizers.mdx +++ b/docs/source/es/fast_tokenizers.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Usa los tokenizadores de 🤗 Tokenizers diff --git a/docs/source/es/glossary.md b/docs/source/es/glossary.md new file mode 100644 index 00000000000000..790fa1fecbe69a --- /dev/null +++ b/docs/source/es/glossary.md @@ -0,0 +1,464 @@ + + +# Glosario + +Este glosario define términos generales de aprendizaje automático y términos relacionados con 🤗 Transformers para ayudarte a comprender mejor la documentación. + +## A + +### attention mask + +La máscara de atención es un argumento opcional utilizado al agrupar secuencias. + + + +Este argumento indica al modelo qué tokens deben recibir atención y cuáles no. + +Por ejemplo, considera estas dos secuencias: + +```python +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased") + +>>> sequence_a = "This is a short sequence." +>>> sequence_b = "This is a rather long sequence. It is at least longer than the sequence A." + +>>> encoded_sequence_a = tokenizer(sequence_a)["input_ids"] +>>> encoded_sequence_b = tokenizer(sequence_b)["input_ids"] +``` + +Las versiones codificadas tienen longitudes diferentes: + +```python +>>> len(encoded_sequence_a), len(encoded_sequence_b) +(8, 19) +``` + +Por lo tanto, no podemos colocarlas juntas en el mismo tensor tal cual. La primera secuencia necesita ser rellenada hasta la longitud de la segunda, o la segunda necesita ser truncada hasta la longitud de la primera. + +En el primer caso, la lista de IDs se extenderá con los índices de relleno. Podemos pasar una lista al tokenizador y pedirle que realice el relleno de esta manera: + +```python +>>> padded_sequences = tokenizer([sequence_a, sequence_b], padding=True) +``` + +Podemos ver que se han agregado ceros a la derecha de la primera oración para que tenga la misma longitud que la segunda: + +```python +>>> padded_sequences["input_ids"] +[[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]] +``` + +Esto luego se puede convertir en un tensor en PyTorch o TensorFlow. La máscara de atención es un tensor binario que indica la posición de los índices de relleno para que el modelo no los tenga en cuenta. Para el [`BertTokenizer`], `1` indica un valor al que se debe prestar atención, mientras que `0` indica un valor de relleno. Esta máscara de atención está en el diccionario devuelto por el tokenizador bajo la clave "attention_mask": + +```python +>>> padded_sequences["attention_mask"] +[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]] +``` + +### autoencoding models + +Consulta [modelos de codificación](#encoder-models) y [modelado de lenguaje enmascarado](#masked-language-modeling-mlm) + +### autoregressive models + +Consulta [modelado de lenguaje causal](#causal-language-modeling) y [modelos de decodificación](#decoder-models) + +## B + +### backbone + +La columna vertebral, backbone en inglés, es la red (embeddings y layers) que produce los estados ocultos o características crudas. Normalmente, está conectado a una [cabecera](#head), que acepta las características como entrada para hacer una predicción. Por ejemplo, [`ViTModel`] es una columna vertebral sin una cabecera específica encima. Otros modelos también pueden usar [`VitModel`] como columna vertebral, como por ejemplo [DPT](model_doc/dpt). + +## C + +### causal language modeling + +Una tarea de preentrenamiento donde el modelo lee los textos en orden y tiene que predecir la siguiente palabra. Generalmente, se realiza leyendo toda la oración, pero utilizando una máscara dentro del modelo para ocultar los tokens futuros en un cierto paso de tiempo. + +### channel + +Las imágenes a color están compuestas por alguna combinación de valores en tres canales: rojo, verde y azul (RGB), y las imágenes en escala de grises solo tienen un canal. En 🤗 Transformers, el canal puede ser la primera o última dimensión del tensor de una imagen: [`n_channels`, `height`, `width`] o [`height`, `width`, `n_channels`]. + +### connectionist temporal classification (CTC) + +Un algoritmo que permite que un modelo aprenda sin saber exactamente cómo están alineadas la entrada y la salida; CTC calcula la distribución de todas las salidas posibles para una entrada dada y elige la salida más probable de ella. CTC se utiliza comúnmente en tareas de reconocimiento de voz porque el habla no siempre se alinea perfectamente con la transcripción debido a diversas razones, como las diferentes velocidades de habla de los oradores. + +### convolution + +Un tipo de capa en una red neuronal donde la matriz de entrada se multiplica elemento por elemento por una matriz más pequeña (núcleo o filtro) y los valores se suman en una nueva matriz. Esto se conoce como una operación de convolución que se repite sobre toda la matriz de entrada. Cada operación se aplica a un segmento diferente de la matriz de entrada. Las redes neuronales convolucionales (CNN) se utilizan comúnmente en visión por computadora. + +## D + +### DataParallel (DP) + +Técnica de paralelismo para entrenamiento en múltiples GPUs donde se replica la misma configuración varias veces, con cada instancia recibiendo una porción de datos única. El procesamiento se realiza en paralelo y todas las configuraciones se sincronizan al final de cada paso de entrenamiento. + +Obtén más información sobre cómo funciona el DataParallel [aquí](perf_train_gpu_many#dataparallel-vs-distributeddataparallel). + +### decoder input IDs + +Esta entrada es específica para modelos codificador-decodificador y contiene los IDs de entrada que se enviarán al decodificador. Estas entradas deben usarse para tareas de secuencia a secuencia, como traducción o resumen, y generalmente se construyen de una manera específica para cada modelo. + +La mayoría de los modelos codificador-decodificador (BART, T5) crean sus `decoder_input_ids` por sí mismos a partir de las `labels`. En tales modelos, pasar las `labels` es la forma preferida de manejar el entrenamiento. + +Consulta la documentación de cada modelo para ver cómo manejan estos IDs de entrada para el entrenamiento de secuencia a secuencia. + +### decoder models + +También conocidos como modelos autorregresivos, los modelos decodificadores involucran una tarea de preentrenamiento (llamada modelado de lenguaje causal) donde el modelo lee los textos en orden y tiene que predecir la siguiente palabra. Generalmente, se realiza leyendo la oración completa con una máscara para ocultar los tokens futuros en un cierto paso de tiempo. + + + +### deep learning (DL) + +Algoritmos de aprendizaje automático que utilizan redes neuronales con varias capas. + +## E + +### encoder models + +También conocidos como modelos de codificación automática (autoencoding models), los modelos codificadores toman una entrada (como texto o imágenes) y las transforman en una representación numérica condensada llamada embedding. A menudo, los modelos codificadores se entrenan previamente utilizando técnicas como el [modelado de lenguaje enmascarado](#masked-language-modeling-mlm), que enmascara partes de la secuencia de entrada y obliga al modelo a crear representaciones más significativas. + + + +## F + +### feature extraction + +El proceso de seleccionar y transformar datos crudos en un conjunto de características más informativas y útiles para algoritmos de aprendizaje automático. Algunos ejemplos de extracción de características incluyen transformar texto crudo en embeddings de palabras y extraer características importantes como bordes o formas de datos de imágenes/videos. + +### feed forward chunking + +En cada bloque de atención residual en los transformadores, la capa de autoatención suele ir seguida de 2 capas de avance. El tamaño de embedding intermedio de las capas de avance suele ser mayor que el tamaño oculto del modelo (por ejemplo, para `google-bert/bert-base-uncased`). + +Para una entrada de tamaño `[batch_size, sequence_length]`, la memoria requerida para almacenar los embeddings intermedios de avance `[batch_size, sequence_length, config.intermediate_size]` puede representar una gran fracción del uso de memoria. Los autores de [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) observaron que, dado que el cálculo es independiente de la dimensión `sequence_length`, es matemáticamente equivalente calcular los embeddings de salida de ambas capas de avance `[batch_size, config.hidden_size]_0, ..., [batch_size, config.hidden_size]_n` individualmente y concatenarlos después a `[batch_size, sequence_length, config.hidden_size]` con `n = sequence_length`, lo que intercambia el aumento del tiempo de cálculo por una reducción en el uso de memoria, pero produce un resultado matemáticamente **equivalente**. + +Para modelos que utilizan la función [`apply_chunking_to_forward`], el `chunk_size` define el número de embeddings de salida que se calculan en paralelo y, por lo tanto, define el equilibrio entre la complejidad de memoria y tiempo. Si `chunk_size` se establece en 0, no se realiza ninguna fragmentación de avance. + +### finetuned models + +El ajuste fino es una forma de transferencia de aprendizaje que implica tomar un modelo entrenado previamente, congelar sus pesos y reemplazar la capa de salida con una nueva [cabecera de modelo](#head) recién añadida. La cabecera del modelo se entrena en tu conjunto de datos objetivo. + +Consulta el tutorial [Ajustar finamente un modelo pre-entrenado](https://huggingface.co/docs/transformers/training) para obtener más detalles y aprende cómo ajustar finamente modelos con 🤗 Transformers. + +## H + +### head + +La cabecera del modelo se refiere a la última capa de una red neuronal que acepta los estados ocultos crudos y los proyecta en una dimensión diferente. Hay una cabecera de modelo diferente para cada tarea. Por ejemplo: + + * [`GPT2ForSequenceClassification`] es una cabecera de clasificación de secuencias, es decir, una capa lineal, encima del modelo base [`GPT2Model`]. + * [`ViTForImageClassification`] es una cabecera de clasificación de imágenes, es decir, una capa lineal encima del estado oculto final del token `CLS`, encima del modelo base [`ViTModel`]. + * [`Wav2Vec2ForCTC`] es una cabecera de modelado de lenguaje con [CTC](#connectionist-temporal-classification-ctc) encima del modelo base [`Wav2Vec2Model`]. + +## I + +### image patch + +Los modelos de Transformers basados en visión dividen una imagen en parches más pequeños que se incorporan linealmente y luego se pasan como una secuencia al modelo. Puedes encontrar el `patch_size` (o resolución del modelo) en su configuración. + +### inference + +La inferencia es el proceso de evaluar un modelo en nuevos datos después de completar el entrenamiento. Consulta el tutorial [Pipeline for inference](https://huggingface.co/docs/transformers/pipeline_tutorial) para aprender cómo realizar inferencias con 🤗 Transformers. + +### input IDs + +Los IDs de entrada a menudo son los únicos parámetros necesarios que se deben pasar al modelo como entrada. Son índices de tokens, representaciones numéricas de tokens que construyen las secuencias que se utilizarán como entrada por el modelo. + + + +Cada tokenizador funciona de manera diferente, pero el mecanismo subyacente sigue siendo el mismo. Aquí tienes un ejemplo utilizando el tokenizador BERT, que es un tokenizador [WordPiece](https://arxiv.org/pdf/1609.08144.pdf): + +```python +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased") + +>>> sequence = "A Titan RTX has 24GB of VRAM" +``` + +El tokenizador se encarga de dividir la secuencia en tokens disponibles en el vocabulario del tokenizador. + +```python +>>> tokenized_sequence = tokenizer.tokenize(sequence) +``` + +Los tokens son palabras o sub palabras. Por ejemplo, "VRAM" no estaba en el vocabulario del modelo, así que se dividió +en "V", "RA" y "M". Para indicar que estos tokens no son palabras separadas sino partes de la misma palabra, se añade un prefijo de doble almohadilla para "RA" y "M": + +```python +>>> print(tokenized_sequence) +['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] +``` + +Estos tokens luego se pueden convertir en IDs que son comprensibles por el modelo. Esto se puede hacer alimentando directamente la oración al tokenizador, que aprovecha la implementación en Rust de [🤗 Tokenizers](https://github.com/huggingface/tokenizers) para obtener un rendimiento óptimo. + +```python +>>> inputs = tokenizer(sequence) +``` + +El tokenizador devuelve un diccionario con todos los argumentos necesarios para que su modelo correspondiente funcione correctamente. Los índices de los tokens están bajo la clave `input_ids`: + +```python +>>> encoded_sequence = inputs["input_ids"] +>>> print(encoded_sequence) +[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102] +``` + +Ten en cuenta que el tokenizador añade automáticamente "tokens especiales" (si el modelo asociado depende de ellos), que son IDs especiales que el modelo utiliza en ocasiones. + +Si descodificamos la secuencia anterior de IDs, + +```python +>>> decoded_sequence = tokenizer.decode(encoded_sequence) +``` + +Veremos + +```python +>>> print(decoded_sequence) +[CLS] A Titan RTX has 24GB of VRAM [SEP] +``` + +Porque esta es la forma en que un [`BertModel`] espera sus entradas. + +## L + +### labels + +Las etiquetas son un argumento opcional que se puede pasar para que el modelo calcule la pérdida por sí mismo. Estas etiquetas deberían ser la predicción esperada del modelo: usará la pérdida estándar para calcular la pérdida entre sus +predicciones y el valor esperado (la etiqueta). + +Estas etiquetas son diferentes según la cabecera del modelo, por ejemplo: + +- Para modelos de clasificación de secuencias ([`BertForSequenceClassification`]), el modelo espera un tensor de dimensión + `(batch_size)` con cada valor del lote correspondiente a la etiqueta esperada de toda la secuencia. +- Para modelos de clasificación de tokens ([`BertForTokenClassification`]), el modelo espera un tensor de dimensión + `(batch_size, seq_length)` con cada valor correspondiente a la etiqueta esperada de cada token individual. +- Para el modelado de lenguaje enmascarado ([`BertForMaskedLM`]), el modelo espera un tensor de dimensión `(batch_size, seq_length)` con cada valor correspondiente a la etiqueta esperada de cada token individual: las etiquetas son el ID del token enmascarado y los valores deben ignorarse para el resto (generalmente -100). +- Para tareas de secuencia a secuencia ([`BartForConditionalGeneration`], [`MBartForConditionalGeneration`]), el modelo + espera un tensor de dimensión `(batch_size, tgt_seq_length)` con cada valor correspondiente a las secuencias objetivo asociadas con cada secuencia de entrada. Durante el entrenamiento, tanto BART como T5 generarán internamente los `decoder_input_ids` y las máscaras de atención del decodificador. Por lo general, no es necesario suministrarlos. Esto no se aplica a los modelos que aprovechan el marco codificador-decodificador. +- Para modelos de clasificación de imágenes ([`ViTForImageClassification`]), el modelo espera un tensor de dimensión + `(batch_size)` con cada valor del lote correspondiente a la etiqueta esperada de cada imagen individual. +- Para modelos de segmentación semántica ([`SegformerForSemanticSegmentation`]), el modelo espera un tensor de dimensión + `(batch_size, height, width)` con cada valor del lote correspondiente a la etiqueta esperada de cada píxel individual. +- Para modelos de detección de objetos ([`DetrForObjectDetection`]), el modelo espera una lista de diccionarios con claves `class_labels` y `boxes` donde cada valor del lote corresponde a la etiqueta esperada y el número de cajas delimitadoras de cada imagen individual. +- Para modelos de reconocimiento automático de voz ([`Wav2Vec2ForCTC`]), el modelo espera un tensor de dimensión `(batch_size, target_length)` con cada valor correspondiente a la etiqueta esperada de cada token individual. + + + +Las etiquetas de cada modelo pueden ser diferentes, así que asegúrate siempre de revisar la documentación de cada modelo para obtener más información sobre sus etiquetas específicas. + + + +Los modelos base ([`BertModel`]) no aceptan etiquetas, ya que estos son los modelos base de transformadores, que simplemente generan características. + +### large language models (LLM) + +Un término genérico que se refiere a modelos de lenguaje de transformadores (GPT-3, BLOOM, OPT) que fueron entrenados con una gran cantidad de datos. Estos modelos también tienden a tener un gran número de parámetros que se pueden aprender (por ejemplo, 175 mil millones para GPT-3). + +## M + +### masked language modeling (MLM) + +Una tarea de preentrenamiento en la que el modelo ve una versión corrupta de los textos, generalmente hecha +al enmascarar algunos tokens al azar, y tiene que predecir el texto original. + +### multimodal + +Una tarea que combina textos con otro tipo de entradas (por ejemplo: imágenes). + +## N + +### Natural language generation (NLG) + +Todas las tareas relacionadas con la generación de texto (por ejemplo: [Escribe con Transformers](https://transformer.huggingface.co/) o traducción). + +### Natural language processing (NLP) + +Una forma genérica de decir "trabajar con textos". + +### Natural language understanding (NLU) + +Todas las tareas relacionadas con entender lo que hay en un texto (por ejemplo: clasificar el +texto completo o palabras individuales). + +## P + +### Pipeline + +Un pipeline en 🤗 Transformers es una abstracción que se refiere a una serie de pasos que se ejecutan en un orden específico para preprocesar y transformar datos y devolver una predicción de un modelo. Algunas etapas de ejemplo que se encuentran en un pipeline pueden ser el preprocesamiento de datos, la extracción de características y la normalización. + +Para obtener más detalles, consulta [Pipelines para inferencia](https://huggingface.co/docs/transformers/pipeline_tutorial). + +### PipelineParallel (PP) + +Técnica de paralelismo en la que el modelo se divide verticalmente (a nivel de capa) en varios GPU, de modo que solo una o varias capas del modelo se colocan en un solo GPU. Cada GPU procesa en paralelo diferentes etapas del pipeline y trabaja en un pequeño fragmento del lote. Obtén más información sobre cómo funciona PipelineParallel [aquí](perf_train_gpu_many#from-naive-model-parallelism-to-pipeline-parallelism). + +### pixel values + +Un tensor de las representaciones numéricas de una imagen que se pasa a un modelo. Los valores de píxeles tienen una forma de [`batch_size`, `num_channels`, `height`, `width`], y se generan a partir de un procesador de imágenes. + +### pooling + +Una operación que reduce una matriz a una matriz más pequeña, ya sea tomando el máximo o el promedio de la dimensión (o dimensiones) agrupada(s). Las capas de agrupación se encuentran comúnmente entre capas convolucionales para reducir la representación de características. + +### position IDs + +A diferencia de las RNN que tienen la posición de cada token incrustada en ellas, los transformers no son conscientes de la posición de cada token. Por lo tanto, se utilizan los IDs de posición (`position_ids`) para que el modelo identifique la posición de cada token en la lista de tokens. + +Son un parámetro opcional. Si no se pasan `position_ids` al modelo, los IDs se crean automáticamente como embeddings de posición absolutas. + +Los embeddings de posición absolutas se seleccionan en el rango `[0, config.max_position_embeddings - 1]`. Algunos modelos utilizan otros tipos de embeddings de posición, como embeddings de posición sinusoidales o embeddings de posición relativas. + +### preprocessing + +La tarea de preparar datos crudos en un formato que pueda ser fácilmente consumido por modelos de aprendizaje automático. Por ejemplo, el texto se preprocesa típicamente mediante la tokenización. Para tener una mejor idea de cómo es el preprocesamiento para otros tipos de entrada, consulta el tutorial [Pre-procesar](https://huggingface.co/docs/transformers/preprocessing). + +### pretrained model + +Un modelo que ha sido pre-entrenado en algunos datos (por ejemplo, toda Wikipedia). Los métodos de preentrenamiento involucran un objetivo auto-supervisado, que puede ser leer el texto e intentar predecir la siguiente palabra (ver [modelado de lenguaje causal](#causal-language-modeling)) o enmascarar algunas palabras e intentar predecirlas (ver [modelado de lenguaje enmascarado](#masked-language-modeling-mlm)). + +Los modelos de habla y visión tienen sus propios objetivos de pre-entrenamiento. Por ejemplo, Wav2Vec2 es un modelo de habla pre-entrenado en una tarea contrastiva que requiere que el modelo identifique la representación de habla "verdadera" de un conjunto de representaciones de habla "falsas". Por otro lado, BEiT es un modelo de visión pre-entrenado en una tarea de modelado de imágenes enmascaradas que enmascara algunos de los parches de la imagen y requiere que el modelo prediga los parches enmascarados (similar al objetivo de modelado de lenguaje enmascarado). + +## R + +### recurrent neural network (RNN) + +Un tipo de modelo que utiliza un bucle sobre una capa para procesar textos. + +### representation learning + +Un subcampo del aprendizaje automático que se centra en aprender representaciones significativas de datos en bruto. Algunos ejemplos de técnicas de aprendizaje de representaciones incluyen embeddings de palabras, auto-encoders y Redes Generativas Adversarias (Generative Adversarial Networks, GANs). + +## S + +### sampling rate + +Una medida en hercios del número de muestras (la señal de audio) tomadas por segundo. La tasa de muestreo es el resultado de aproximar una señal continua como el habla. + +### self-attention + +Cada elemento de la entrada averigua a cuáles otros elementos de la entrada debe prestar atención. + +### self-supervised learning + +Una categoría de técnicas de aprendizaje automático en la que un modelo crea su propio objetivo de aprendizaje a partir de datos no etiquetados. Difiere del [aprendizaje no supervisado](#unsupervised-learning) y del [aprendizaje supervisado](#supervised-learning) en que el proceso de aprendizaje está supervisado, pero no explícitamente por el usuario. + +Un ejemplo de aprendizaje auto-supervisado es el [modelado de lenguaje enmascarado](#masked-language-modeling-mlm), donde un modelo recibe oraciones con una proporción de sus tokens eliminados y aprende a predecir los tokens faltantes. + +### semi-supervised learning + +Una amplia categoría de técnicas de entrenamiento de aprendizaje automático que aprovecha una pequeña cantidad de datos etiquetados con una mayor cantidad de datos no etiquetados para mejorar la precisión de un modelo, a diferencia del [aprendizaje supervisado](#supervised-learning) y del [aprendizaje no supervisado](#unsupervised-learning). + +Un ejemplo de un enfoque de aprendizaje semi-supervisado es "auto-entrenamiento", en el que un modelo se entrena con datos etiquetados y luego se utiliza para hacer predicciones sobre los datos no etiquetados. La porción de datos no etiquetados que el modelo predice con mayor confianza se agrega al conjunto de datos etiquetados y se utiliza para volver a entrenar el modelo. + +### sequence-to-sequence (seq2seq) + +Modelos que generan una nueva secuencia a partir de una entrada, como modelos de traducción o modelos de resumen (como +[Bart](model_doc/bart) o [T5](model_doc/t5)). + +### Sharded DDP + +Otro nombre para el concepto fundamental de [ZeRO](#zero-redundancy-optimizer-zero) utilizado por varias otras implementaciones de ZeRO. + +### stride + +En [convolución](#convolution) o [agrupación](#pooling), el paso (stride) se refiere a la distancia que recorre el núcleo sobre una matriz. Un paso de 1 significa que el núcleo se mueve un píxel a la vez, y un paso de 2 significa que el núcleo se mueve dos píxeles a la vez. + +### supervised learning + +Una forma de entrenamiento de modelos que utiliza directamente datos etiquetados para corregir y dirigir el rendimiento del modelo. Los datos se introducen en el modelo en entrenamiento, y sus predicciones se comparan con las etiquetas conocidas. El modelo actualiza sus pesos en función de cuán incorrectas fueron sus predicciones, y el proceso se repite para optimizar el rendimiento del modelo. + +## T + +### Tensor Parallelism (TP) + +Técnica de paralelismo para entrenamiento en múltiples GPU en la que cada tensor se divide en múltiples fragmentos, de modo que en lugar de tener todo el tensor en una sola GPU, cada fragmento del tensor reside en su GPU designada. Los fragmentos se procesan por separado y en paralelo en diferentes GPU y los resultados se sincronizan al final del paso de procesamiento.Esto es lo que a veces se llama paralelismo horizontal, ya que la división ocurre a nivel horizontal. +Obtén más información sobre el Paralelismo de Tensores [aquí](perf_train_gpu_many#tensor-parallelism). + +### token + +Parte de una oración, generalmente una palabra, pero también puede ser una sub-palabra (las palabras no comunes a menudo se dividen en sub-palabras) o un símbolo de puntuación. + +### token Type IDs + +Algunos modelos tienen como objetivo realizar clasificación en pares de oraciones o responder preguntas. + + + +Estos requieren que dos secuencias diferentes se unan en una única entrada "input_ids", lo cual generalmente se realiza con +la ayuda de tokens especiales, como el token de clasificación (`[CLS]`) y el token separador (`[SEP]`). Por ejemplo, el modelo BERT construye sus dos secuencias de entrada de la siguiente manera: + +```python +>>> # [CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP] +``` + +Podemos utilizar nuestro tokenizador para generar automáticamente una oración de este tipo al pasar las dos secuencias a `tokenizer` como dos argumentos (y no como una lista, como antes) de la siguiente manera: + +```python +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased") +>>> sequence_a = "HuggingFace is based in NYC" +>>> sequence_b = "Where is HuggingFace based?" + +>>> encoded_dict = tokenizer(sequence_a, sequence_b) +>>> decoded = tokenizer.decode(encoded_dict["input_ids"]) +``` + +Que devolverá: + +```python +>>> print(decoded) +[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP] +``` + +Esto es suficiente para que algunos modelos comprendan dónde termina una secuencia y comienza otra. Sin embargo, otros modelos, como BERT, también utilizan identificadores de tipo de token (también llamados identificadores de segmento). Se representan como una máscara binaria que identifica los dos tipos de secuencia en el modelo. + +El tokenizador devuelve esta máscara como la entrada "token_type_ids": + +```python +>>> encoded_dict["token_type_ids"] +[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] +``` + +La primera secuencia, el "contexto" utilizado para la pregunta, tiene todos sus tokens representados por un `0`, mientras que la segunda secuencia, correspondiente a la "pregunta", tiene todos sus tokens representados por un `1`. + +Algunos modelos, como [`XLNetModel`], utilizan un token adicional representado por un `2`. + +### transfer learning + +Una técnica que implica tomar un modelo pre-entrenado y adaptarlo a un conjunto de datos específico para tu tarea. En lugar de entrenar un modelo desde cero, puedes aprovechar el conocimiento obtenido de un modelo existente como punto de partida. Esto acelera el proceso de aprendizaje y reduce la cantidad de datos de entrenamiento necesarios. + +### transformer + +Arquitectura de modelo de aprendizaje profundo basada en auto-atención (Self-attention). + +## U + +### unsupervised learning + +Una forma de entrenamiento de modelos en la que los datos proporcionados al modelo no están etiquetados. Las técnicas de aprendizaje no supervisado aprovechan la información estadística de la distribución de datos para encontrar patrones útiles para la tarea en cuestión. + +## Z + +### Zero Redundancy Optimizer (ZeRO) + +Técnica de paralelismo que realiza la fragmentación de los tensores de manera algo similar a [TensorParallel](#tensor-parallelism-tp), excepto que todo el tensor se reconstruye a tiempo para una computación hacia adelante o hacia atrás, por lo tanto, el modelo no necesita ser modificado. Este método también admite diversas técnicas de descarga para compensar la memoria limitada de la GPU. Obtén más información sobre ZeRO [aquí](perf_train_gpu_many#zero-data-parallelism). \ No newline at end of file diff --git a/docs/source/es/index.mdx b/docs/source/es/index.md similarity index 96% rename from docs/source/es/index.mdx rename to docs/source/es/index.md index aca5d9f705d2bb..fe7d65d94e356c 100644 --- a/docs/source/es/index.mdx +++ b/docs/source/es/index.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers @@ -46,6 +50,7 @@ La biblioteca actualmente contiene implementaciones de JAX, PyTorch y TensorFlow 1. **[ALBERT](model_doc/albert)** (de Google Research y el Instituto Tecnológico de Toyota en Chicago) publicado con el paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), por Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](model_doc/align)** (de Google Research) publicado con el paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) por Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[BART](model_doc/bart)** (de Facebook) publicado con el paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) por Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov y Luke Zettlemoyer. 1. **[BARThez](model_doc/barthez)** (de École polytechnique) publicado con el paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) por Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](model_doc/bartpho)** (de VinAI Research) publicado con el paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) por Nguyen Luong Tran, Duong Minh Le y Dat Quoc Nguyen. @@ -62,6 +67,7 @@ La biblioteca actualmente contiene implementaciones de JAX, PyTorch y TensorFlow 1. **[CamemBERT](model_doc/camembert)** (de Inria/Facebook/Sorbonne) publicado con el paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) por Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah y Benoît Sagot. 1. **[CANINE](model_doc/canine)** (de Google Research) publicado con el paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) por Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[ConvNeXT](model_doc/convnext)** (de Facebook AI) publicado con el paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) por Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (de Facebook AI) publicado con el paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) por Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CLIP](model_doc/clip)** (de OpenAI) publicado con el paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) por Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[ConvBERT](model_doc/convbert)** (de YituTech) publicado con el paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) por Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[CPM](model_doc/cpm)** (de Universidad de Tsinghua) publicado con el paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) por Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. @@ -77,16 +83,18 @@ La biblioteca actualmente contiene implementaciones de JAX, PyTorch y TensorFlow 1. **[DistilBERT](model_doc/distilbert)** (de HuggingFace), publicado junto con el paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) por Victor Sanh, Lysandre Debut y Thomas Wolf. Se ha aplicado el mismo método para comprimir GPT2 en [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa en [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), BERT multilingüe en [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) y una versión alemana de DistilBERT. 1. **[DPR](model_doc/dpr)** (de Facebook) publicado con el paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) por Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, y Wen-tau Yih. 1. **[DPT](master/model_doc/dpt)** (de Intel Labs) publicado con el paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) por René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan and Quoc V. Le. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (de Google Research) publicado con el paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) por Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ELECTRA](model_doc/electra)** (de Google Research/Universidad de Stanford) publicado con el paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) por Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[FlauBERT](model_doc/flaubert)** (de CNRS) publicado con el paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) por Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FNet](model_doc/fnet)** (de Google Research) publicado con el paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) por James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. 1. **[Funnel Transformer](model_doc/funnel)** (de CMU/Google Brain) publicado con el paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) por Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GLPN](model_doc/glpn)** (de KAIST) publicado con el paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) por Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (de OpenAI) publicado con el paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) por Alec Radford, Karthik Narasimhan, Tim Salimans y Ilya Sutskever. -1. **[GPT-2](model_doc/gpt2)** (de OpenAI) publicado con el paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) por Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** y Ilya Sutskever**. +1. **[GPT](model_doc/openai-gpt)** (de OpenAI) publicado con el paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) por Alec Radford, Karthik Narasimhan, Tim Salimans y Ilya Sutskever. +1. **[GPT-2](model_doc/gpt2)** (de OpenAI) publicado con el paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) por Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei y Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (de EleutherAI) publicado con el repositorio [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) por Ben Wang y Aran Komatsuzaki. 1. **[GPT Neo](model_doc/gpt_neo)** (de EleutherAI) publicado en el paper [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) por Sid Black, Stella Biderman, Leo Gao, Phil Wang y Connor Leahy. +1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released with [GPTSAN](https://github.com/tanreinama/GPTSAN) by Toshiyuki Sakamoto (tanreinama). 1. **[Hubert](model_doc/hubert)** (de Facebook) publicado con el paper [HuBERT: Self-Supervised Speech Representation Learning por Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) por Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](model_doc/ibert)** (de Berkeley) publicado con el paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) por Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[ImageGPT](model_doc/imagegpt)** (de OpenAI) publicado con el paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) por Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. @@ -231,9 +239,9 @@ Flax), PyTorch y/o TensorFlow. | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | Realm | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | -| ResNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| ResNet | ❌ | ❌ | ✅ | ❌ | ✅ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | | RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | | RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ | diff --git a/docs/source/es/installation.mdx b/docs/source/es/installation.md similarity index 95% rename from docs/source/es/installation.mdx rename to docs/source/es/installation.md index 01b9d81409d447..b79d0af4a46436 100644 --- a/docs/source/es/installation.mdx +++ b/docs/source/es/installation.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Instalación @@ -127,10 +131,10 @@ El entorno de Python que creaste para la instalación de 🤗 Transformers encon ## Instalación con conda -Puedes instalar 🤗 Transformers desde el canal de conda `huggingface` con el siguiente comando: +Puedes instalar 🤗 Transformers desde el canal de conda `conda-forge` con el siguiente comando: ```bash -conda install -c huggingface transformers +conda install conda-forge::transformers ``` ## Configuración de Caché @@ -144,7 +148,7 @@ Los modelos preentrenados se descargan y almacenan en caché localmente en: `~/. 🤗 Transformers usará las variables de entorno de shell `PYTORCH_TRANSFORMERS_CACHE` o `PYTORCH_PRETRAINED_BERT_CACHE` si viene de una iteración anterior de la biblioteca y ha configurado esas variables de entorno, a menos que especifiques la variable de entorno de shell `TRANSFORMERS_CACHE`. - + @@ -161,14 +165,14 @@ Puedes añadir [🤗 Datasets](https://huggingface.co/docs/datasets/) al flujo d Por ejemplo, normalmente ejecutarías un programa en una red normal con firewall para instancias externas con el siguiente comando: ```bash -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Ejecuta este mismo programa en una instancia offline con el siguiente comando: ```bash HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` El script ahora debería ejecutarse sin bloquearse ni esperar a que se agote el tiempo de espera porque sabe que solo debe buscar archivos locales. @@ -200,7 +204,7 @@ Otra opción para usar 🤗 Transformers offline es descargando previamente los >>> model.save_pretrained("./your/path/bigscience_t0") ``` - 3. Cuando te encuentres offline, recarga los archivos con [`PreTrainedModel.from_pretrained`] desde el directorio especificado: + 3. Cuando te encuentres offline, recarga los archivos con [`PreTrainedModel.from_pretrained`] desde el directorio especificado: ```py >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0") @@ -209,7 +213,7 @@ Otra opción para usar 🤗 Transformers offline es descargando previamente los * Descarga de manera programática los archivos con la biblioteca [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub): - 1. Instala la biblioteca [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) en tu entorno virtual: + 1. Instala la biblioteca [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) en tu entorno virtual: ```bash python -m pip install huggingface_hub diff --git a/docs/source/es/model_sharing.mdx b/docs/source/es/model_sharing.md similarity index 94% rename from docs/source/es/model_sharing.mdx rename to docs/source/es/model_sharing.md index 06029880fb1487..43cf0b8eddb8f7 100644 --- a/docs/source/es/model_sharing.mdx +++ b/docs/source/es/model_sharing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Compartir un modelo @@ -216,4 +220,4 @@ Para asegurarnos que los usuarios entiendan las capacidades de tu modelo, sus li * Elaborando y subiendo manualmente el archivo`README.md`. * Dando click en el botón **Edit model card** dentro del repositorio. -Toma un momento para ver la [tarjeta de modelo](https://huggingface.co/distilbert-base-uncased) de DistilBert para que tengas un buen ejemplo del tipo de información que debería incluir. Consulta [la documentación](https://huggingface.co/docs/hub/models-cards) para más detalles acerca de otras opciones que puedes controlar dentro del archivo `README.md` como la huella de carbono del modelo o ejemplos de widgets. Consulta la documentación [aquí] (https://huggingface.co/docs/hub/models-cards). +Toma un momento para ver la [tarjeta de modelo](https://huggingface.co/distilbert/distilbert-base-uncased) de DistilBert para que tengas un buen ejemplo del tipo de información que debería incluir. Consulta [la documentación](https://huggingface.co/docs/hub/models-cards) para más detalles acerca de otras opciones que puedes controlar dentro del archivo `README.md` como la huella de carbono del modelo o ejemplos de widgets. Consulta la documentación [aquí](https://huggingface.co/docs/hub/models-cards). diff --git a/docs/source/es/multilingual.mdx b/docs/source/es/multilingual.md similarity index 78% rename from docs/source/es/multilingual.mdx rename to docs/source/es/multilingual.md index 4849416a44db85..d49d54f196d54e 100644 --- a/docs/source/es/multilingual.mdx +++ b/docs/source/es/multilingual.md @@ -8,13 +8,17 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Modelos multilingües para inferencia [[open-in-colab]] -Existen varios modelos multilingües en 🤗 Transformers y su uso para inferencia difiere de los modelos monolingües. Sin embargo, no *todos* los usos de los modelos multilingües son diferentes. Algunos modelos, como [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased), pueden utilizarse igual que un modelo monolingüe. Esta guía te enseñará cómo utilizar modelos multilingües cuyo uso difiere en la inferencia. +Existen varios modelos multilingües en 🤗 Transformers y su uso para inferencia difiere de los modelos monolingües. Sin embargo, no *todos* los usos de los modelos multilingües son diferentes. Algunos modelos, como [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased), pueden utilizarse igual que un modelo monolingüe. Esta guía te enseñará cómo utilizar modelos multilingües cuyo uso difiere en la inferencia. ## XLM @@ -24,24 +28,24 @@ XLM tiene diez checkpoints diferentes de los cuales solo uno es monolingüe. Los Los siguientes modelos XLM usan language embeddings para especificar el lenguaje utilizado en la inferencia: -- `xlm-mlm-ende-1024` (Masked language modeling, English-German) -- `xlm-mlm-enfr-1024` (Masked language modeling, English-French) -- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) -- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) -- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) -- `xlm-clm-enfr-1024` (Causal language modeling, English-French) -- `xlm-clm-ende-1024` (Causal language modeling, English-German) +- `FacebookAI/xlm-mlm-ende-1024` (Masked language modeling, English-German) +- `FacebookAI/xlm-mlm-enfr-1024` (Masked language modeling, English-French) +- `FacebookAI/xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) +- `FacebookAI/xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) +- `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, English-French) +- `FacebookAI/xlm-clm-ende-1024` (Causal language modeling, English-German) Los language embeddings son representados como un tensor de la mismas dimensiones que los `input_ids` pasados al modelo. Los valores de estos tensores dependen del idioma utilizado y se identifican mediante los atributos `lang2id` y `id2lang` del tokenizador. -En este ejemplo, carga el checkpoint `xlm-clm-enfr-1024` (Causal language modeling, English-French): +En este ejemplo, carga el checkpoint `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, English-French): ```py >>> import torch >>> from transformers import XLMTokenizer, XLMWithLMHeadModel ->>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") ->>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") ``` El atributo `lang2id` del tokenizador muestra los idiomas de este modelo y sus ids: @@ -79,8 +83,8 @@ El script [run_generation.py](https://github.com/huggingface/transformers/tree/m Los siguientes modelos XLM no requieren language embeddings durante la inferencia: -- `xlm-mlm-17-1280` (modelado de lenguaje enmascarado, 17 idiomas) -- `xlm-mlm-100-1280` (modelado de lenguaje enmascarado, 100 idiomas) +- `FacebookAI/xlm-mlm-17-1280` (modelado de lenguaje enmascarado, 17 idiomas) +- `FacebookAI/xlm-mlm-100-1280` (modelado de lenguaje enmascarado, 100 idiomas) Estos modelos se utilizan para representaciones genéricas de frases a diferencia de los anteriores checkpoints XLM. @@ -88,8 +92,8 @@ Estos modelos se utilizan para representaciones genéricas de frases a diferenci Los siguientes modelos de BERT pueden utilizarse para tareas multilingües: -- `bert-base-multilingual-uncased` (modelado de lenguaje enmascarado + predicción de la siguiente oración, 102 idiomas) -- `bert-base-multilingual-cased` (modelado de lenguaje enmascarado + predicción de la siguiente oración, 104 idiomas) +- `google-bert/bert-base-multilingual-uncased` (modelado de lenguaje enmascarado + predicción de la siguiente oración, 102 idiomas) +- `google-bert/bert-base-multilingual-cased` (modelado de lenguaje enmascarado + predicción de la siguiente oración, 104 idiomas) Estos modelos no requieren language embeddings durante la inferencia. Deben identificar la lengua a partir del contexto e inferir en consecuencia. @@ -98,8 +102,8 @@ contexto e inferir en consecuencia. Los siguientes modelos de XLM-RoBERTa pueden utilizarse para tareas multilingües: -- `xlm-roberta-base` (modelado de lenguaje enmascarado, 100 idiomas) -- `xlm-roberta-large` (Modelado de lenguaje enmascarado, 100 idiomas) +- `FacebookAI/xlm-roberta-base` (modelado de lenguaje enmascarado, 100 idiomas) +- `FacebookAI/xlm-roberta-large` (Modelado de lenguaje enmascarado, 100 idiomas) XLM-RoBERTa se entrenó con 2,5 TB de datos CommonCrawl recién creados y depurados en 100 idiomas. Proporciona fuertes ventajas sobre los modelos multilingües publicados anteriormente como mBERT o XLM en tareas posteriores como la clasificación, el etiquetado de secuencias y la respuesta a preguntas. diff --git a/docs/source/es/pad_truncation.md b/docs/source/es/pad_truncation.md new file mode 100644 index 00000000000000..6a31a69103502f --- /dev/null +++ b/docs/source/es/pad_truncation.md @@ -0,0 +1,69 @@ + + +# Relleno y truncamiento + +Las entradas agrupadas por lotes (batched) suelen tener longitudes diferentes, por lo que no se pueden convertir en tensores de tamaño fijo. El relleno (también conocido como "Padding") y el truncamiento (conocido como "Truncation") son estrategias para abordar este problema y crear tensores rectangulares a partir de lotes de longitudes variables. El relleno agrega un **padding token** especial para garantizar que las secuencias más cortas tengan la misma longitud que la secuencia más larga en un lote o la longitud máxima aceptada por el modelo. El truncamiento funciona en la otra dirección al truncar secuencias largas. + +En la mayoría de los casos, es bastante eficaz rellenar el lote hasta la longitud de la secuencia más larga y truncar hasta la longitud máxima que un modelo puede aceptar. Sin embargo, la API admite más estrategias si las necesitas. Los tres argumentos que necesitas son: `padding`, `truncation` y `max_length`. + +El argumento `padding` controla el relleno. Puede ser un booleano o una cadena: + + - `True` o `'longest'`: rellena hasta la longitud de la secuencia más larga en el lote (no se aplica relleno si solo proporcionas una única secuencia). + - `'max_length'`: rellena hasta una longitud especificada por el argumento `max_length` o la longitud máxima aceptada + por el modelo si no se proporciona `max_length` (`max_length=None`). El relleno se aplicará incluso si solo proporcionas una única secuencia. + - `False` o `'do_not_pad'`: no se aplica relleno. Este es el comportamiento predeterminado. + +El argumento `truncation` controla el truncamiento. Puede ser un booleano o una cadena: + + - `True` o `'longest_first'`: trunca hasta una longitud máxima especificada por el argumento `max_length` o + la longitud máxima aceptada por el modelo si no se proporciona `max_length` (`max_length=None`). Esto + truncará token por token, eliminando un token de la secuencia más larga en el par hasta alcanzar la longitud adecuada. + - `'only_second'`: trunca hasta una longitud máxima especificada por el argumento `max_length` o la longitud máxima + aceptada por el modelo si no se proporciona `max_length` (`max_length=None`). Esto solo truncará + la segunda oración de un par si se proporciona un par de secuencias (o un lote de pares de secuencias). + - `'only_first'`: trunca hasta una longitud máxima especificada por el argumento `max_length` o la longitud máxima + aceptada por el modelo si no se proporciona `max_length` (`max_length=None`). Esto solo truncará + la primera oración de un par si se proporciona un par de secuencias (o un lote de pares de secuencias). + - `False` o `'do_not_truncate'`: no se aplica truncamiento. Este es el comportamiento predeterminado. + +El argumento `max_length` controla la longitud del relleno y del truncamiento. Puede ser un número entero o `None`, en cuyo caso se establecerá automáticamente en la longitud máxima que el modelo puede aceptar. Si el modelo no tiene una longitud máxima de entrada específica, se desactiva el truncamiento o el relleno hasta `max_length`. + +La siguiente tabla resume la forma recomendada de configurar el relleno y el truncamiento. Si usas pares de secuencias de entrada en alguno de los siguientes ejemplos, puedes reemplazar `truncation=True` por una `ESTRATEGIA` seleccionada en +`['only_first', 'only_second', 'longest_first']`, es decir, `truncation='only_second'` o `truncation='longest_first'` para controlar cómo se truncan ambas secuencias en el par, como se detalló anteriormente. + +| Truncation | Padding | Instrucción | +|-----------------------------------------|--------------------------------------|---------------------------------------------------------------------------------------------| +| sin truncamiento | sin relleno | `tokenizer(batch_sentences)` | +| | relleno hasta la longitud máxima del lote | `tokenizer(batch_sentences, padding=True)` o | +| | | `tokenizer(batch_sentences, padding='longest')` | +| | relleno hasta la longitud máxima del modelo | `tokenizer(batch_sentences, padding='max_length')` | +| | relleno hasta una longitud específica | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | +| | relleno hasta un múltiplo de un valor | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)` | +| truncamiento hasta la longitud máxima del modelo | sin relleno | `tokenizer(batch_sentences, truncation=True)` o | +| | | `tokenizer(batch_sentences, truncation=ESTRATEGIA)` | +| | relleno hasta la longitud máxima del lote | `tokenizer(batch_sentences, padding=True, truncation=True)` o | +| | | `tokenizer(batch_sentences, padding=True, truncation=ESTRATEGIA)` | +| | relleno hasta la longitud máxima del modelo | `tokenizer(batch_sentences, padding='max_length', truncation=True)` o | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=ESTRATEGIA)` | +| | relleno hasta una longitud específica | No es posible | +| truncamiento hasta una longitud específica | sin relleno | `tokenizer(batch_sentences, truncation=True, max_length=42)` o | +| | | `tokenizer(batch_sentences, truncation=ESTRATEGIA, max_length=42)` | +| | relleno hasta la longitud máxima del lote | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` o | +| | | `tokenizer(batch_sentences, padding=True, truncation=ESTRATEGIA, max_length=42)` | +| | relleno hasta la longitud máxima del modelo | No es posible | +| | relleno hasta una longitud específica | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` o | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=ESTRATEGIA, max_length=42)` | diff --git a/docs/source/es/performance.md b/docs/source/es/performance.md new file mode 100644 index 00000000000000..4665c2961b3f23 --- /dev/null +++ b/docs/source/es/performance.md @@ -0,0 +1,61 @@ + + +# Rendimiento y Escalabilidad + +Entrenar modelos grandes de transformadores y desplegarlos en producción presenta varios desafíos. Durante el entrenamiento, el modelo puede requerir más memoria de GPU de la disponible o mostrar una velocidad de entrenamiento lenta. En la fase de implementación, el modelo puede tener dificultades para manejar el rendimiento necesario en un entorno de producción. + +Esta documentación tiene como objetivo ayudarte a superar estos desafíos y encontrar la configuración óptima para tu caso de uso. Las guías están divididas en secciones de entrenamiento e inferencia, ya que cada una presenta diferentes desafíos y soluciones. Dentro de cada sección, encontrarás guías separadas para diferentes configuraciones de hardware, como GPU única vs. multi-GPU para el entrenamiento o CPU vs. GPU para la inferencia. + +Utiliza este documento como punto de partida para navegar hacia los métodos que se ajusten a tu escenario. + +## Entrenamiento + +Entrenar modelos grandes de transformadores de manera eficiente requiere un acelerador como una GPU o TPU. El caso más común es cuando tienes una GPU única. Los métodos que puedes aplicar para mejorar la eficiencia de entrenamiento en una GPU única también se aplican a otras configuraciones, como múltiples GPU. Sin embargo, también existen técnicas específicas para entrenamiento con múltiples GPU o CPU, las cuales cubrimos en secciones separadas. + +* [Métodos y herramientas para un entrenamiento eficiente en una sola GPU](https://huggingface.co/docs/transformers/perf_train_gpu_one): comienza aquí para aprender enfoques comunes que pueden ayudar a optimizar la utilización de memoria de la GPU, acelerar el entrenamiento o ambas cosas. +* [Sección de entrenamiento con varias GPU](https://huggingface.co/docs/transformers/perf_train_gpu_many): explora esta sección para conocer métodos de optimización adicionales que se aplican a configuraciones con varias GPU, como paralelismo de datos, tensores y canalizaciones. +* [Sección de entrenamiento en CPU](https://huggingface.co/docs/transformers/perf_train_cpu): aprende sobre entrenamiento de precisión mixta en CPU. +* [Entrenamiento eficiente en múltiples CPUs](https://huggingface.co/docs/transformers/perf_train_cpu_many): aprende sobre el entrenamiento distribuido en CPU. +* [Entrenamiento en TPU con TensorFlow](https://huggingface.co/docs/transformers/perf_train_tpu_tf): si eres nuevo en TPUs, consulta esta sección para obtener una introducción basada en opiniones sobre el entrenamiento en TPUs y el uso de XLA. +* [Hardware personalizado para el entrenamiento](https://huggingface.co/docs/transformers/perf_hardware): encuentra consejos y trucos al construir tu propia plataforma de aprendizaje profundo. +* [Búsqueda de hiperparámetros utilizando la API del Entrenador](https://huggingface.co/docs/transformers/hpo_train) + +## Inferencia + +Realizar inferencias eficientes con modelos grandes en un entorno de producción puede ser tan desafiante como entrenarlos. En las siguientes secciones, describimos los pasos para ejecutar inferencias en CPU y configuraciones con GPU única/múltiple. + +* [Inferencia en una sola CPU](https://huggingface.co/docs/transformers/perf_infer_cpu) +* [Inferencia en una sola GPU](https://huggingface.co/docs/transformers/perf_infer_gpu_one) +* [Inferencia con múltiples GPU](https://huggingface.co/docs/transformers/perf_infer_gpu_one) +* [Integración de XLA para modelos de TensorFlow](https://huggingface.co/docs/transformers/tf_xla) + +## Entrenamiento e Inferencia + +Aquí encontrarás técnicas, consejos y trucos que aplican tanto si estás entrenando un modelo como si estás ejecutando inferencias con él. + +* [Instanciar un modelo grande](https://huggingface.co/docs/transformers/big_models) +* [Solución de problemas de rendimiento](https://huggingface.co/docs/transformers/debugging) + +## Contribuir + +Este documento está lejos de estar completo y aún se deben agregar muchas cosas, así que si tienes adiciones o correcciones que hacer, no dudes en abrir un PR. Si no estás seguro, inicia un Issue y podemos discutir los detalles allí. + +Cuando hagas contribuciones que indiquen que A es mejor que B, intenta incluir un benchmark reproducible y/o un enlace a la fuente de esa información (a menos que provenga directamente de ti). diff --git a/docs/source/es/perplexity.md b/docs/source/es/perplexity.md new file mode 100644 index 00000000000000..f07dc663f5524e --- /dev/null +++ b/docs/source/es/perplexity.md @@ -0,0 +1,116 @@ + + +# Perplejidad de los modelos de longitud fija + +[[open-in-colab]] + +La perplejidad, perplexity en inglés (PPL), es una de las métricas más comunes para evaluar modelos de lenguaje. Antes de sumergirnos, debemos tener en cuenta que esta métrica se aplica específicamente a modelos de lenguaje clásicos (a veces llamados modelos autorregresivos o causales) y no está bien definida para modelos de lenguaje enmascarados como BERT (ver [resumen del modelo](model_summary)). + +La perplejidad se define como la media negativa exponenciada del log-likelihood de una secuencia. Si tenemos una secuencia tokenizada \\(X = (x_0, x_1, \dots, x_t)\\), entonces la perplejidad de \\(X\\) es, + +$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{ + +Sin embargo, al trabajar con modelos aproximados, generalmente tenemos una restricción en la cantidad de tokens que el modelo puede procesar. La versión más grande de [GPT-2](model_doc/gpt2), por ejemplo, tiene una longitud fija de 1024 tokens, por lo que no podemos calcular \\(p_\theta(x_t|x_{ + +Esto es rápido de calcular, ya que la perplejidad de cada segmento se puede calcular en un solo pase hacia adelante, pero sirve como una aproximación pobre de la perplejidad completamente factorizada y generalmente dará como resultado una PPL más alta (peor) porque el modelo tendrá menos contexto en la mayoría de los pasos de predicción. + +En cambio, la PPL de modelos de longitud fija debería evaluarse con una estrategia de ventana deslizante. Esto implica deslizar repetidamente la ventana de contexto para que el modelo tenga más contexto al hacer cada predicción. + +Sliding window PPL taking advantage of all available context + +Esta es una aproximación más cercana a la verdadera descomposición de la probabilidad de la secuencia y generalmente dará como resultado una puntuación más favorable. La desventaja es que requiere un pase hacia adelante separado para cada token en el corpus. Un buen compromiso práctico es emplear una ventana deslizante estratificada, moviendo el contexto con pasos más grandes en lugar de deslizarse de 1 token a la vez. Esto permite que la computación avance mucho más rápido, mientras le da al modelo un contexto amplio para hacer +predicciones en cada paso. + +## Ejemplo: Cálculo de la perplejidad con GPT-2 en 🤗 Transformers + +Demostremos este proceso con GPT-2. + +```python +from transformers import GPT2LMHeadModel, GPT2TokenizerFast + +device = "cuda" +model_id = "openai-community/gpt2-large" +model = GPT2LMHeadModel.from_pretrained(model_id).to(device) +tokenizer = GPT2TokenizerFast.from_pretrained(model_id) +``` + +Carguemos el conjunto de datos WikiText-2 y evaluemos la perplejidad utilizando algunas estrategias de ventana deslizante diferentes. Dado que este conjunto de datos es pequeño y solo estamos realizando un pase hacia adelante sobre el conjunto, podemos cargar y codificar todo el conjunto de datos en la memoria. + +```python +from datasets import load_dataset + +test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") +encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt") +``` + +Con 🤗 Transformers, simplemente podemos pasar los `input_ids` como las `labels` a nuestro modelo, y la media negativa del log-likelihood para cada token se devuelve como la pérdida. Sin embargo, con nuestro enfoque de ventana deslizante, hay superposición en los tokens que pasamos al modelo en cada iteración. No queremos que el log-likelihood de los tokens que estamos tratando solo como contexto se incluya en nuestra pérdida, por lo que podemos establecer estos objetivos en `-100` para que se ignoren. El siguiente es un ejemplo de cómo podríamos hacer esto con un paso de `512`. Esto significa que el modelo tendrá al menos `512` tokens como contexto al calcular el log-likelihood condicional de cualquier token (siempre que haya `512` tokens precedentes disponibles para condicionar). + +```python +import torch +from tqdm import tqdm + +max_length = model.config.n_positions +stride = 512 +seq_len = encodings.input_ids.size(1) + +nlls = [] +prev_end_loc = 0 +for begin_loc in tqdm(range(0, seq_len, stride)): + end_loc = min(begin_loc + max_length, seq_len) + trg_len = end_loc - prev_end_loc # puede ser diferente del paso en el último bucle + input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) + target_ids = input_ids.clone() + target_ids[:, :-trg_len] = -100 + + with torch.no_grad(): + outputs = model(input_ids, labels=target_ids) + + # la pérdida se calcula utilizando CrossEntropyLoss, que promedia las etiquetas válidas + # N.B. el modelo solo calcula la pérdida sobre trg_len - 1 etiquetas, porque desplaza las etiqueta internamente + # a la izquierda por 1. + neg_log_likelihood = outputs.loss + + nlls.append(neg_log_likelihood) + + prev_end_loc = end_loc + if end_loc == seq_len: + break + +ppl = torch.exp(torch.stack(nlls).mean()) +``` + +Ejecuta esto con la longitud de paso igual a la longitud máxima de entrada es equivalente a la estrategia sub óptima, +sin ventana deslizante, que discutimos anteriormente. Cuanto menor sea el paso, más contexto tendrá el modelo para +realizar cada predicción y, por lo general, mejor será la perplejidad informada. + +Cuando ejecutamos lo anterior con `stride = 1024`, es decir, sin superposición, la PPL resultante es `19.44`, que es +aproximadamente la misma que la `19.93` informada en el artículo de GPT-2. Al utilizar `stride = 512` y, por lo tanto, +emplear nuestra estrategia de ventana deslizante, esto disminuye a `16.45`. Esto no solo es una puntuación más favorable, sino que se calcula de una manera más cercana a la verdadera descomposición autorregresiva de la probabilidad de una secuencia. diff --git a/docs/source/es/philosophy.mdx b/docs/source/es/philosophy.md similarity index 97% rename from docs/source/es/philosophy.mdx rename to docs/source/es/philosophy.md index 65e9a2c67a4293..4054ac0ae50716 100644 --- a/docs/source/es/philosophy.mdx +++ b/docs/source/es/philosophy.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Al menos que sea requrido por la ley aplicable o acordado por escrito, el software distribuido bajo la Licencia es distribuido sobre una BASE "AS IS", SIN GARANTIAS O CONDICIONES DE NINGÚN TIPO. Ver la Licencia para el idioma específico que rige los permisos y limitaciones bajo la Licencia. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Filosofía diff --git a/docs/source/es/pipeline_tutorial.mdx b/docs/source/es/pipeline_tutorial.md similarity index 95% rename from docs/source/es/pipeline_tutorial.mdx rename to docs/source/es/pipeline_tutorial.md index af202758eb134f..279f3593ba95c5 100644 --- a/docs/source/es/pipeline_tutorial.mdx +++ b/docs/source/es/pipeline_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipelines para inferencia @@ -70,8 +74,8 @@ El [`pipeline`] acepta cualquier modelo del [Model Hub](https://huggingface.co/m ```py >>> from transformers import AutoTokenizer, AutoModelForCausalLM ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Crea un [`pipeline`] para tu tarea y específica el modelo y el tokenizador que cargaste: diff --git a/docs/source/es/pr_checks.mdx b/docs/source/es/pr_checks.md similarity index 97% rename from docs/source/es/pr_checks.mdx rename to docs/source/es/pr_checks.md index 283f025a81fa23..ba67e85306d3a9 100644 --- a/docs/source/es/pr_checks.mdx +++ b/docs/source/es/pr_checks.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Verificaciones en un Pull Request diff --git a/docs/source/es/preprocessing.mdx b/docs/source/es/preprocessing.md similarity index 98% rename from docs/source/es/preprocessing.mdx rename to docs/source/es/preprocessing.md index 869f90c4177357..8486d6a0687abc 100644 --- a/docs/source/es/preprocessing.mdx +++ b/docs/source/es/preprocessing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Preprocesamiento @@ -41,7 +45,7 @@ Carga un tokenizador pre-entrenado con [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") ``` A continuación, pasa tu frase al tokenizador: @@ -191,7 +195,7 @@ Las entradas de audio se preprocesan de forma diferente a las entradas textuales pip install datasets ``` -Carga la tarea de detección de palabras clave del benchmark [SUPERB](https://huggingface.co/datasets/superb) (consulta el [tutorial 🤗 Dataset](https://huggingface.co/docs/datasets/load_hub.html) para que obtengas más detalles sobre cómo cargar un dataset): +Carga la tarea de detección de palabras clave del benchmark [SUPERB](https://huggingface.co/datasets/superb) (consulta el [tutorial 🤗 Dataset](https://huggingface.co/docs/datasets/load_hub) para que obtengas más detalles sobre cómo cargar un dataset): ```py >>> from datasets import load_dataset, Audio @@ -230,7 +234,7 @@ Por ejemplo, carga el dataset [LJ Speech](https://huggingface.co/datasets/lj_spe 'sampling_rate': 22050} ``` -1. Usa el método 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) para reducir la tasa de muestreo a 16kHz: +1. Usa el método 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.cast_column) para reducir la tasa de muestreo a 16kHz: ```py >>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000)) @@ -325,7 +329,7 @@ Vamos a cargar el dataset [food101](https://huggingface.co/datasets/food101) par >>> dataset = load_dataset("food101", split="train[:100]") ``` -A continuación, observa la imagen con la función 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=image#datasets.Image): +A continuación, observa la imagen con la función 🤗 Datasets [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image): ```py >>> dataset[0]["image"] @@ -366,7 +370,7 @@ Para las tareas de visión por computadora es común añadir algún tipo de aume ... return examples ``` -3. A continuación, utiliza 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform) para aplicar las transformaciones sobre la marcha: +3. A continuación, utiliza 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process#format-transform) para aplicar las transformaciones sobre la marcha: ```py >>> dataset.set_transform(transforms) @@ -457,7 +461,7 @@ Recuerda la sección anterior sobre el procesamiento de datos de audio, siempre ### Processor -Un processor combina un extractor de características y un tokenizador. Cargue un procesador con [`AutoProcessor.from_pretrained]: +Un processor combina un extractor de características y un tokenizador. Cargue un procesador con [`AutoProcessor.from_pretrained`]: ```py >>> from transformers import AutoProcessor diff --git a/docs/source/es/quicktour.mdx b/docs/source/es/quicktour.md similarity index 98% rename from docs/source/es/quicktour.mdx rename to docs/source/es/quicktour.md index 408c3fa375a074..ad2549ef450bb2 100644 --- a/docs/source/es/quicktour.mdx +++ b/docs/source/es/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Tour rápido @@ -64,11 +68,13 @@ Instala las siguientes dependencias si aún no lo has hecho: + ```bash pip install torch ``` + ```bash pip install tensorflow ``` @@ -87,7 +93,7 @@ El pipeline descarga y almacena en caché el [modelo preentrenado](https://huggi ```py >>> clasificador("Estamos muy felices de mostrarte la biblioteca de 🤗 Transformers.") -[{'label': 'POS', 'score': 0.9916}] +[{'label': 'POS', 'score': 0.9320}] ``` Para más de un enunciado, entrega una lista al [`pipeline`] que devolverá una lista de diccionarios: @@ -129,7 +135,7 @@ Extraigamos las matrices de onda cruda (raw waveform, en inglés) de las primera ```py >>> resultado = reconocedor_de_voz(dataset[:4]["audio"]) >>> print([d["text"] for d in resultado]) -['ahora buenas eh a ver tengo un problema con vuestra aplicación resulta que que quiero hacer una transferencia bancaria a una cuenta conocida pero me da error la aplicación a ver que a ver que puede ser', 'la aplicación no cargue saldo de mi nueva cuenta', 'hola tengo un problema con la aplicación no carga y y tampoco veo que carga el saldo de mi cuenta nueva dice que la aplicación está siendo reparada y ahora no puedo acceder a mi cuenta no necesito inmediatamente', 'hora buena la aplicación no se carga la vileza no carga el saldo de mi cuenta nueva dice que la villadenta siendo reparada y oro no puedo hacer a mi cuenta'] +['ahora buenas eh a ver tengo un problema con vuestra aplicación resulta que que quiero hacer una transferencia bancaria a una cuenta conocida pero me da error la aplicación a ver que a ver que puede ser', 'la aplicación no cargue saldo de mi nueva cuenta', 'hola tengo un problema con la aplicación no carga y y tampoco veo que carga el saldo de mi cuenta nueva dice que la aplicación está siendo reparada y ahora no puedo acceder a mi cuenta no necesito inmediatamente', 'hora buena la aplicación no se carga la vida no carga el saldo de mi cuenta nueva dice que la villadenta siendo reparada y oro no puedo hacer a mi cuenta'] ``` Para un dataset más grande, donde los inputs son de mayor tamaño (como en habla/audio o visión), querrás pasar un generador en lugar de una lista que carga todos los inputs en memoria. Ve la [documentación del pipeline](./main_classes/pipelines) para más información. @@ -220,6 +226,7 @@ Como con el [`pipeline`], el tokenizador aceptará una lista de inputs. Además, + ```py >>> pt_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -231,6 +238,7 @@ Como con el [`pipeline`], el tokenizador aceptará una lista de inputs. Además, ``` + ```py >>> tf_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -373,6 +381,7 @@ Una característica particularmente interesante de 🤗 Transformers es la habil + ```py >>> from transformers import AutoModel @@ -381,6 +390,7 @@ Una característica particularmente interesante de 🤗 Transformers es la habil ``` + ```py >>> from transformers import TFAutoModel diff --git a/docs/source/es/run_scripts.mdx b/docs/source/es/run_scripts.md similarity index 93% rename from docs/source/es/run_scripts.mdx rename to docs/source/es/run_scripts.md index d0ab716f80ff55..ff1afa340c9a1d 100644 --- a/docs/source/es/run_scripts.mdx +++ b/docs/source/es/run_scripts.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Entrenamiento con scripts @@ -83,11 +87,11 @@ pip install -r requirements.txt -El script de ejemplo descarga y preprocesa un conjunto de datos de la biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Luego, el script ajusta un conjunto de datos con [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) en una arquitectura que soporta la tarea de resumen. El siguiente ejemplo muestra cómo ajustar un [T5-small](https://huggingface.co/t5-small) en el conjunto de datos [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). El modelo T5 requiere un argumento adicional `source_prefix` debido a cómo fue entrenado. Este aviso le permite a T5 saber que se trata de una tarea de resumir. +El script de ejemplo descarga y preprocesa un conjunto de datos de la biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Luego, el script ajusta un conjunto de datos con [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) en una arquitectura que soporta la tarea de resumen. El siguiente ejemplo muestra cómo ajustar un [T5-small](https://huggingface.co/google-t5/t5-small) en el conjunto de datos [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). El modelo T5 requiere un argumento adicional `source_prefix` debido a cómo fue entrenado. Este aviso le permite a T5 saber que se trata de una tarea de resumir. ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -101,11 +105,11 @@ python examples/pytorch/summarization/run_summarization.py \ ``` -El script de ejemplo descarga y preprocesa un conjunto de datos de la biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Luego, el script ajusta un conjunto de datos utilizando Keras en una arquitectura que soporta la tarea de resumir. El siguiente ejemplo muestra cómo ajustar un [T5-small](https://huggingface.co/t5-small) en el conjunto de datos [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). El modelo T5 requiere un argumento adicional `source_prefix` debido a cómo fue entrenado. Este aviso le permite a T5 saber que se trata de una tarea de resumir. +El script de ejemplo descarga y preprocesa un conjunto de datos de la biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Luego, el script ajusta un conjunto de datos utilizando Keras en una arquitectura que soporta la tarea de resumir. El siguiente ejemplo muestra cómo ajustar un [T5-small](https://huggingface.co/google-t5/t5-small) en el conjunto de datos [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). El modelo T5 requiere un argumento adicional `source_prefix` debido a cómo fue entrenado. Este aviso le permite a T5 saber que se trata de una tarea de resumir. ```bash python examples/tensorflow/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -126,10 +130,10 @@ python examples/tensorflow/summarization/run_summarization.py \ - Establece la cantidad de GPU que se usará con el argumento `nproc_per_node`. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \ --fp16 \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -153,7 +157,7 @@ Las Unidades de Procesamiento de Tensor (TPUs) están diseñadas específicament ```bash python xla_spawn.py --num_cores 8 \ summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -172,7 +176,7 @@ Las Unidades de Procesamiento de Tensor (TPUs) están diseñadas específicament ```bash python run_summarization.py \ --tpu name_of_tpu_resource \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -210,7 +214,7 @@ Todo listo para iniciar el entrenamiento: ```bash accelerate launch run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ @@ -229,7 +233,7 @@ Un script para resumir que utiliza un conjunto de datos personalizado se vera as ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --train_file path_to_csv_or_jsonlines_file \ @@ -254,7 +258,7 @@ A veces, es una buena idea ejecutar tu secuencia de comandos en una cantidad men ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --max_train_samples 50 \ --max_eval_samples 50 \ --max_predict_samples 50 \ @@ -284,7 +288,7 @@ El primer método utiliza el argumento `output_dir previous_output_dir` para rea ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -301,7 +305,7 @@ El segundo método utiliza el argumento `resume_from_checkpoint path_to_specific ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -331,7 +335,7 @@ El siguiente ejemplo muestra cómo cargar un modelo con un nombre de repositorio ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ diff --git a/docs/source/es/sagemaker.mdx b/docs/source/es/sagemaker.md similarity index 86% rename from docs/source/es/sagemaker.mdx rename to docs/source/es/sagemaker.md index 491d93e10d4d14..9bc5b741084198 100644 --- a/docs/source/es/sagemaker.mdx +++ b/docs/source/es/sagemaker.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Ejecutar el entrenamiento en Amazon SageMaker @@ -22,4 +26,3 @@ La documentación ha sido trasladada a [hf.co/docs/sagemaker](https://huggingfac - [Entrenar modelos de Hugging Face en Amazon SageMaker con SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train) - [Desplegar modelos de Hugging Face en Amazon SageMaker con SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference) -- [Preguntas Frecuentes](https://huggingface.co/docs/sagemaker/faq) diff --git a/docs/source/es/serialization.mdx b/docs/source/es/serialization.md similarity index 95% rename from docs/source/es/serialization.mdx rename to docs/source/es/serialization.md index 4c42fd5d830ec4..3ad7d089853053 100644 --- a/docs/source/es/serialization.mdx +++ b/docs/source/es/serialization.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Exportar modelos 🤗 Transformers @@ -59,6 +63,7 @@ Las configuraciones a la medida incluyen las siguientes arquitecturas: - CodeGen - ConvBERT - ConvNeXT +- ConvNeXTV2 - Data2VecText - Data2VecVision - DeBERTa @@ -132,7 +137,7 @@ optional arguments: Exportar un checkpoint usando una configuración a la medida se puede hacer de la siguiente manera: ```bash -python -m transformers.onnx --model=distilbert-base-uncased onnx/ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ ``` que debería mostrar los siguientes registros: @@ -147,7 +152,7 @@ All good, model saved at: onnx/model.onnx ``` Esto exporta un grafo ONNX del checkpoint definido por el argumento `--model`. -En este ejemplo, es un modelo `distilbert-base-uncased`, pero puede ser cualquier +En este ejemplo, es un modelo `distilbert/distilbert-base-uncased`, pero puede ser cualquier checkpoint en Hugging Face Hub o que esté almacenado localmente. El archivo `model.onnx` resultante se puede ejecutar en uno de los @@ -159,7 +164,7 @@ modelo con [ONNX Runtime](https://onnxruntime.ai/) de la siguiente manera: >>> from transformers import AutoTokenizer >>> from onnxruntime import InferenceSession ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") >>> session = InferenceSession("onnx/model.onnx") >>> # ONNX Runtime expects NumPy arrays as input >>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") @@ -196,8 +201,8 @@ y guardar un checkpoint de la siguiente manera: >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification >>> # Load tokenizer and PyTorch weights form the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-pt-checkpoint") >>> pt_model.save_pretrained("local-pt-checkpoint") @@ -215,8 +220,8 @@ python -m transformers.onnx --model=local-pt-checkpoint onnx/ >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification >>> # Load tokenizer and TensorFlow weights from the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-tf-checkpoint") >>> tf_model.save_pretrained("local-tf-checkpoint") @@ -262,7 +267,7 @@ Le puedes pasar una de estas características al argumento `--feature` en el paq Por ejemplo, para exportar un modelo de clasificación de texto, podemos elegir un modelo ya ajustado del Hub y ejecutar: ```bash -python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased-finetuned-sst-2-english \ --feature=sequence-classification onnx/ ``` @@ -278,7 +283,7 @@ All good, model saved at: onnx/model.onnx ``` Ten en cuenta que, en este caso, los nombres de salida del modelo ajustado son `logits` en lugar de `last_hidden_state` -que vimos anteriormente con el checkpoint `distilbert-base-uncased`. Esto es de esperarse ya que el modelo ajustado +que vimos anteriormente con el checkpoint `distilbert/distilbert-base-uncased`. Esto es de esperarse ya que el modelo ajustado tiene un cabezal de clasificación secuencial. @@ -357,7 +362,7 @@ instancia proporcionando la configuración del modelo base de la siguiente maner ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config = DistilBertOnnxConfig(config) ``` @@ -388,7 +393,7 @@ exportar DistilBERT con un cabezal de clasificación de secuencias, podríamos u ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification") >>> print(onnx_config_for_seq_clf.outputs) OrderedDict([('logits', {0: 'batch'})]) @@ -415,7 +420,7 @@ y la ruta para guardar el archivo exportado: >>> from transformers import AutoTokenizer, AutoModel >>> onnx_path = Path("model.onnx") ->>> model_ckpt = "distilbert-base-uncased" +>>> model_ckpt = "distilbert/distilbert-base-uncased" >>> base_model = AutoModel.from_pretrained(model_ckpt) >>> tokenizer = AutoTokenizer.from_pretrained(model_ckpt) @@ -545,7 +550,7 @@ con la clase `BertConfig` y luego se guarda en el disco con el nombre de archivo from transformers import BertModel, BertTokenizer, BertConfig import torch -enc = BertTokenizer.from_pretrained("bert-base-uncased") +enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" @@ -580,7 +585,7 @@ model = BertModel(config) model.eval() # If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag -model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) +model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True) # Creating the trace traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors]) diff --git a/docs/source/es/task_summary.md b/docs/source/es/task_summary.md new file mode 100644 index 00000000000000..4aa6852ed35606 --- /dev/null +++ b/docs/source/es/task_summary.md @@ -0,0 +1,347 @@ + + +# Lo que 🤗 Transformers puede hacer + +🤗 Transformers es una biblioteca de modelos preentrenados de última generación para procesamiento del lenguaje natural (NLP, por sus siglas en inglés), visión por computadora y tareas de procesamiento de audio y voz. No solo contiene modelos Transformer, sino también modelos no Transformer como redes convolucionales modernas para tareas de visión por computadora. Si observas algunos de los productos de consumo más populares hoy en día, como teléfonos inteligentes, aplicaciones y televisores, es probable que haya alguna tecnología de aprendizaje profundo detrás. ¿Quieres quitar un objeto de fondo de una foto tomada por tu teléfono inteligente? Este es un ejemplo de una tarea de segmentación panóptica (no te preocupes si aún no sabes qué significa, ¡lo describiremos en las siguientes secciones!). + +Esta página proporciona una descripción general de las diferentes tareas de procesamiento de audio y voz, visión por computadora y NLP que se pueden resolver con la biblioteca 🤗 Transformers en solo tres líneas de código. + +## Audio + +Las tareas de procesamiento de audio y voz son un poco diferentes de las otras modalidades principalmente porque el audio como entrada es una señal continua. A diferencia del texto, una forma de onda de audio cruda no se puede dividir ordenadamente en fragmentos discretos de la misma manera en que una oración puede dividirse en palabras. Para superar esto, la señal de audio cruda generalmente se muestrea a intervalos regulares. Si tomas más muestras dentro de un intervalo, la tasa de muestreo es mayor y el audio se asemeja más a la fuente de audio original. + +Enfoques anteriores preprocesaban el audio para extraer características útiles. Ahora es más común comenzar las tareas de procesamiento de audio y voz alimentando directamente la forma de onda de audio cruda a un codificador de características para extraer una representación de audio. Esto simplifica el paso de preprocesamiento y permite que el modelo aprenda las características más esenciales. + +### Clasificación de audio + +La clasificación de audio es una tarea que etiqueta datos de audio con un conjunto predefinido de clases. Es una categoría amplia con muchas aplicaciones específicas, algunas de las cuales incluyen: + +* clasificación de escena acústica: etiquetar audio con una etiqueta de escena ("oficina", "playa", "estadio") +* detección de eventos acústicos: etiquetar audio con una etiqueta de evento de sonido ("bocina de automóvil", "llamada de ballena", "cristal rompiéndose") +* etiquetado: etiquetar audio que contiene varios sonidos (canto de pájaros, identificación de altavoces en una reunión) +* clasificación de música: etiquetar música con una etiqueta de género ("metal", "hip-hop", "country") + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er") +>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4532, 'label': 'hap'}, + {'score': 0.3622, 'label': 'sad'}, + {'score': 0.0943, 'label': 'neu'}, + {'score': 0.0903, 'label': 'ang'}] +``` + +### Reconocimiento automático del habla + +El reconocimiento automático del habla (ASR, por sus siglas en inglés) transcribe el habla a texto. Es una de las tareas de audio más comunes, en parte debido a que el habla es una forma natural de comunicación humana. Hoy en día, los sistemas ASR están integrados en productos de tecnología "inteligente" como altavoces, teléfonos y automóviles. Podemos pedirle a nuestros asistentes virtuales que reproduzcan música, establezcan recordatorios y nos informen sobre el clima. + +Pero uno de los desafíos clave que las arquitecturas Transformer han ayudado a superar es en los idiomas con recursos limitados. Al preentrenar con grandes cantidades de datos de habla, afinar el modelo solo con una hora de datos de habla etiquetados en un idioma con recursos limitados aún puede producir resultados de alta calidad en comparación con los sistemas ASR anteriores entrenados con 100 veces más datos etiquetados. + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +## Visión por computadora + +Una de las primeras y exitosas tareas de visión por computadora fue reconocer imágenes de números de código postal utilizando una [red neuronal convolucional](glossary#convolution) (CNN, por sus siglas en inglés). Una imagen está compuesta por píxeles, y cada píxel tiene un valor numérico. Esto facilita representar una imagen como una matriz de valores de píxeles. Cada combinación particular de valores de píxeles describe los colores de una imagen. + +Dos formas generales en las que se pueden resolver las tareas de visión por computadora son: + +1. Utilizar convoluciones para aprender las características jerárquicas de una imagen, desde características de bajo nivel hasta cosas abstractas de alto nivel. +2. Dividir una imagen en parches y utilizar un Transformer para aprender gradualmente cómo cada parche de imagen se relaciona entre sí para formar una imagen. A diferencia del enfoque ascendente preferido por una CNN, esto es como comenzar con una imagen borrosa y luego enfocarla gradualmente. + +### Clasificación de imágenes + +La clasificación de imágenes etiqueta una imagen completa con un conjunto predefinido de clases. Como la mayoría de las tareas de clasificación, hay muchos casos prácticos para la clasificación de imágenes, algunos de los cuales incluyen: + +* salud: etiquetar imágenes médicas para detectar enfermedades o monitorear la salud del paciente +* medio ambiente: etiquetar imágenes de satélite para monitorear la deforestación, informar la gestión de áreas silvestres o detectar incendios forestales +* agricultura: etiquetar imágenes de cultivos para monitorear la salud de las plantas o imágenes de satélite para el monitoreo del uso del suelo +* ecología: etiquetar imágenes de especies animales o vegetales para monitorear poblaciones de vida silvestre o rastrear especies en peligro de extinción + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="image-classification") +>>> preds = classifier( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.4335, 'label': 'lynx, catamount'} +{'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'} +{'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'} +{'score': 0.0239, 'label': 'Egyptian cat'} +{'score': 0.0229, 'label': 'tiger cat'} +``` + +### Detección de objetos + +A diferencia de la clasificación de imágenes, la detección de objetos identifica múltiples objetos dentro de una imagen y las posiciones de los objetos en la imagen (definidas por el cuadro delimitador). Algunas aplicaciones ejemplares de la detección de objetos incluyen: + +* vehículos autónomos: detectar objetos de tráfico cotidianos como otros vehículos, peatones y semáforos +* teledetección: monitoreo de desastres, planificación urbana y pronóstico del tiempo +* detección de defectos: detectar grietas o daños estructurales en edificios y defectos de fabricación + +```py +>>> from transformers import pipeline + +>>> detector = pipeline(task="object-detection") +>>> preds = detector( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds] +>>> preds +[{'score': 0.9865, + 'label': 'cat', + 'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}] +``` + +### Segmentación de imágenes + +La segmentación de imágenes es una tarea a nivel de píxeles que asigna cada píxel en una imagen a una clase. A diferencia de la detección de objetos, que utiliza cuadros delimitadores para etiquetar y predecir objetos en una imagen, la segmentación es más granular. La segmentación puede detectar objetos a nivel de píxeles. Hay varios tipos de segmentación de imágenes: + +* segmentación de instancias: además de etiquetar la clase de un objeto, también etiqueta cada instancia distinta de un objeto ("perro-1", "perro-2") +* segmentación panóptica: una combinación de segmentación semántica y de instancias; etiqueta cada píxel con una clase semántica **y** cada instancia distinta de un objeto + +Las tareas de segmentación son útiles en vehículos autónomos para crear un mapa a nivel de píxeles del mundo que los rodea para que puedan navegar de manera segura alrededor de peatones y otros vehículos. También es útil en imágenes médicas, donde la mayor granularidad de la tarea puede ayudar a identificar células anormales o características de órganos. La segmentación de imágenes también se puede utilizar en comercio electrónico para probar virtualmente la ropa o crear experiencias de realidad aumentada superponiendo objetos en el mundo real a través de tu cámara. + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline(task="image-segmentation") +>>> preds = segmenter( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.9879, 'label': 'LABEL_184'} +{'score': 0.9973, 'label': 'snow'} +{'score': 0.9972, 'label': 'cat'} +``` + +### Estimación de profundidad + +La estimación de profundidad predice la distancia de cada píxel en una imagen desde la cámara. Esta tarea de visión por computadora es especialmente importante para la comprensión y reconstrucción de escenas. Por ejemplo, en los vehículos autónomos, es necesario entender qué tan lejos están los objetos como peatones, señales de tráfico y otros vehículos para evitar obstáculos y colisiones. La información de profundidad también es útil para construir representaciones 3D a partir de imágenes 2D y se puede utilizar para crear representaciones 3D de alta calidad de estructuras biológicas o edificios. + +Hay dos enfoques para la estimación de profundidad: + +* estéreo: las profundidades se estiman comparando dos imágenes de la misma escena desde ángulos ligeramente diferentes +* monocular: las profundidades se estiman a partir de una sola imagen + +```py +>>> from transformers import pipeline + +>>> depth_estimator = pipeline(task="depth-estimation") +>>> preds = depth_estimator( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +``` + +## Procesamiento del lenguaje natural + +Las tareas de procesamiento del lenguaje natural (NLP, por sus siglas en inglés) están entre los tipos de tareas más comunes porque el texto es una forma natural de comunicación para nosotros. Para convertir el texto en un formato reconocido por un modelo, es necesario tokenizarlo. Esto significa dividir una secuencia de texto en palabras o subpalabras separadas (tokens) y luego convertir estos tokens en números. Como resultado, puedes representar una secuencia de texto como una secuencia de números, y una vez que tienes una secuencia de números, se puede ingresar a un modelo para resolver todo tipo de tareas de NLP. + +### Clasificación de texto + +Al igual que las tareas de clasificación en cualquier modalidad, la clasificación de texto etiqueta una secuencia de texto (puede ser a nivel de oración, párrafo o documento) de un conjunto predefinido de clases. Hay muchas aplicaciones prácticas para la clasificación de texto, algunas de las cuales incluyen: + +* análisis de sentimientos: etiquetar texto según alguna polaridad como `positivo` o `negativo`, lo que puede informar y respaldar la toma de decisiones en campos como política, finanzas y marketing +* clasificación de contenido: etiquetar texto según algún tema para ayudar a organizar y filtrar información en noticias y feeds de redes sociales (`clima`, `deportes`, `finanzas`, etc.) + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="sentiment-analysis") +>>> preds = classifier("Hugging Face is the best thing since sliced bread!") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.9991, 'label': 'POSITIVE'}] +``` + +### Clasificación de tokens + +En cualquier tarea de NLP, el texto se procesa separando la secuencia de texto en palabras o subpalabras individuales. Estas se conocen como [tokens](glossary#token). La clasificación de tokens asigna a cada token una etiqueta de un conjunto predefinido de clases. + +Dos tipos comunes de clasificación de tokens son: + +* reconocimiento de entidades nombradas (NER, por sus siglas en inglés): etiquetar un token según una categoría de entidad como organización, persona, ubicación o fecha. NER es especialmente popular en entornos biomédicos, donde puede etiquetar genes, proteínas y nombres de medicamentos +* etiquetado de partes del discurso (POS, por sus siglas en inglés): etiquetar un token según su parte del discurso, como sustantivo, verbo o adjetivo. POS es útil para ayudar a los sistemas de traducción a comprender cómo dos palabras idénticas son gramaticalmente diferentes (por ejemplo, "corte" como sustantivo versus "corte" como verbo) + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="ner") +>>> preds = classifier("Hugging Face is a French company based in New York City.") +>>> preds = [ +... { +... "entity": pred["entity"], +... "score": round(pred["score"], 4), +... "index": pred["index"], +... "word": pred["word"], +... "start": pred["start"], +... "end": pred["end"], +... } +... for pred in preds +... ] +>>> print(*preds, sep="\n") +{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2} +{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7} +{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12} +{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24} +{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45} +{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50} +{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55} +``` + +### Respuestas a preguntas + +Responder preguntas es otra tarea a nivel de tokens que devuelve una respuesta a una pregunta, a veces con contexto (dominio abierto) y otras veces sin contexto (dominio cerrado). Esta tarea ocurre cuando le preguntamos algo a un asistente virtual, como si un restaurante está abierto. También puede proporcionar soporte al cliente o técnico y ayudar a los motores de búsqueda a recuperar la información relevante que estás buscando. + +Hay dos tipos comunes de respuestas a preguntas: + +* extractivas: dada una pregunta y algún contexto, la respuesta es un fragmento de texto del contexto que el modelo debe extraer +* abstractivas: dada una pregunta y algún contexto, la respuesta se genera a partir del contexto; este enfoque lo maneja la [`Text2TextGenerationPipeline`] en lugar del [`QuestionAnsweringPipeline`] que se muestra a continuación + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline(task="question-answering") +>>> preds = question_answerer( +... question="What is the name of the repository?", +... context="The name of the repository is huggingface/transformers", +... ) +>>> print( +... f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}" +... ) +score: 0.9327, start: 30, end: 54, answer: huggingface/transformers +``` + +### Resumir + +Al resumir se crea una versión más corta de un texto más largo mientras intenta preservar la mayor parte del significado del documento original. Resumir es una tarea de secuencia a secuencia; produce una secuencia de texto más corta que la entrada. Hay muchos documentos de formato largo que se pueden resumir para ayudar a los lectores a comprender rápidamente los puntos principales. Proyectos de ley legislativos, documentos legales y financieros, patentes y artículos científicos son algunos ejemplos de documentos que podrían resumirse para ahorrar tiempo a los lectores y servir como ayuda para la lectura. + +Al igual que en las respuestas a preguntas, hay dos tipos de resumen: + +* extractiva: identifica y extrae las oraciones más importantes del texto original +* abstractiva: genera el resumen objetivo (que puede incluir nuevas palabras no presentes en el documento de entrada) a partir del texto original; el [`SummarizationPipeline`] utiliza el enfoque abstractivo + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline(task="summarization") +>>> summarizer( +... "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles." +... ) +[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}] +``` + +### Traducción + +La traducción convierte una secuencia de texto en un idioma a otro. Es importante para ayudar a personas de diferentes orígenes a comunicarse entre sí, traducir contenido para llegar a audiencias más amplias e incluso ser una herramienta de aprendizaje para ayudar a las personas a aprender un nuevo idioma. Al igual que resumir, la traducción es una tarea de secuencia a secuencia, lo que significa que el modelo recibe una secuencia de entrada y devuelve una secuencia de salida objetivo. + +En sus primeros días, los modelos de traducción eran principalmente monolingües, pero recientemente ha habido un creciente interés en modelos multilingües que pueden traducir entre muchas combinaciones de idiomas. + +```py +>>> from transformers import pipeline + +>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." +>>> translator = pipeline(task="translation", model="t5-small") +>>> translator(text) +[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}] +``` + +### Modelado de lenguaje + +El modelado de lenguaje es una tarea que predice una palabra en una secuencia de texto. Se ha vuelto una tarea de NLP muy popular porque un modelo de lenguaje preentrenado puede ser afinado para muchas otras tareas secundarias. Últimamente, ha habido mucho interés en modelos de lenguaje grandes (LLM, por sus siglas en inglés) que demuestran aprendizaje de cero o con pocas muestras (zero- or few-shot learning). ¡Esto significa que el modelo puede resolver tareas para las cuales no fue entrenado explícitamente! Los modelos de lenguaje se pueden utilizar para generar texto fluido y convincente, aunque debes tener cuidado, ya que el texto no siempre puede ser preciso. + +Hay dos tipos de modelado de lenguaje: + +* causal: el objetivo del modelo es predecir el próximo token en una secuencia, y los tokens futuros están enmascarados + + ```py + >>> from transformers import pipeline + + >>> prompt = "Hugging Face is a community-based open-source platform for machine learning." + >>> generator = pipeline(task="text-generation") + >>> generator(prompt) # doctest: +SKIP + ``` + +* enmascarado: el objetivo del modelo es predecir un token enmascarado en una secuencia con acceso completo a los tokens en la secuencia + + ```py + >>> text = "Hugging Face is a community-based open-source for machine learning." + >>> fill_mask = pipeline(task="fill-mask") + >>> preds = fill_mask(text, top_k=1) + >>> preds = [ + ... { + ... "score": round(pred["score"], 4), + ... "token": pred["token"], + ... "token_str": pred["token_str"], + ... "sequence": pred["sequence"], + ... } + ... for pred in preds + ... ] + >>> preds + [{'score': 0.2236, + 'token': 1761, + 'token_str': ' platform', + 'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}] + ``` + +## Multimodal + +Las tareas multimodales requieren que un modelo procese múltiples modalidades de datos (texto, imagen, audio, video) para resolver un problema particular. La descripción de imágenes es un ejemplo de una tarea multimodal en la que el modelo toma una imagen como entrada y produce una secuencia de texto que describe la imagen o algunas propiedades de la imagen. + +Aunque los modelos multimodales trabajan con diferentes tipos de datos o modalidades, internamente, los pasos de preprocesamiento ayudan al modelo a convertir todos los tipos de datos en embeddings (vectores o listas de números que contienen información significativa sobre los datos). Para una tarea como la descripción de imágenes, el modelo aprende las relaciones entre los embeddings de imágenes y los embeddings de texto. + +### Respuestas a preguntas de documentos + +Las respuestas a preguntas de documentos es una tarea que responde preguntas en lenguaje natural a partir de un documento. A diferencia de una tarea de respuestas a preguntas a nivel de token que toma texto como entrada, las respuestas a preguntas de documentos toman una imagen de un documento como entrada junto con una pregunta sobre el documento y devuelven una respuesta. Las respuestas a preguntas de documentos pueden usarse para analizar documentos estructurados y extraer información clave de ellos. En el ejemplo a continuación, el monto total y el cambio debido se pueden extraer de un recibo. + +```py +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices") +>>> preds = doc_question_answerer( +... question="What is the total amount?", +... image=image, +... ) +>>> preds +[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}] +``` + +Con suerte, esta página te ha proporcionado más información de fondo sobre todos los tipos de tareas en cada modalidad y la importancia práctica de cada una. En la próxima [sección](https://huggingface.co/docs/transformers/tasks_explained), aprenderás **cómo** 🤗 Transformers trabaja para resolver estas tareas. + + \ No newline at end of file diff --git a/docs/source/es/tasks/asr.mdx b/docs/source/es/tasks/asr.md similarity index 98% rename from docs/source/es/tasks/asr.mdx rename to docs/source/es/tasks/asr.md index f3747a332d7f42..850bdfd711e7e0 100644 --- a/docs/source/es/tasks/asr.mdx +++ b/docs/source/es/tasks/asr.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Reconocimiento automático del habla diff --git a/docs/source/es/tasks/image_classification.mdx b/docs/source/es/tasks/image_classification.md similarity index 87% rename from docs/source/es/tasks/image_classification.mdx rename to docs/source/es/tasks/image_classification.md index 9b8b03207d0822..f09730caf69fee 100644 --- a/docs/source/es/tasks/image_classification.mdx +++ b/docs/source/es/tasks/image_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Clasificación de imágenes @@ -69,12 +73,12 @@ Cada clase de alimento - o label - corresponde a un número; `79` indica una cos ## Preprocesa -Carga el feature extractor de ViT para procesar la imagen en un tensor: +Carga el image processor de ViT para procesar la imagen en un tensor: ```py ->>> from transformers import AutoFeatureExtractor +>>> from transformers import AutoImageProcessor ->>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k") +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k") ``` Aplica varias transformaciones de imagen al dataset para hacer el modelo más robusto contra el overfitting. En este caso se utilizará el módulo [`transforms`](https://pytorch.org/vision/stable/transforms.html) de torchvision. Recorta una parte aleatoria de la imagen, cambia su tamaño y normalízala con la media y la desviación estándar de la imagen: @@ -82,8 +86,8 @@ Aplica varias transformaciones de imagen al dataset para hacer el modelo más ro ```py >>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor ->>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std) ->>> _transforms = Compose([RandomResizedCrop(feature_extractor.size), ToTensor(), normalize]) +>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) +>>> _transforms = Compose([RandomResizedCrop(image_processor.size["height"]), ToTensor(), normalize]) ``` Crea una función de preprocesamiento que aplique las transformaciones y devuelva los `pixel_values` - los inputs al modelo - de la imagen: @@ -95,7 +99,7 @@ Crea una función de preprocesamiento que aplique las transformaciones y devuelv ... return examples ``` -Utiliza el método [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?#datasets.Dataset.with_transform) de 🤗 Dataset para aplicar las transformaciones sobre todo el dataset. Las transformaciones se aplican sobre la marcha cuando se carga un elemento del dataset: +Utiliza el método [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes?#datasets.Dataset.with_transform) de 🤗 Dataset para aplicar las transformaciones sobre todo el dataset. Las transformaciones se aplican sobre la marcha cuando se carga un elemento del dataset: ```py >>> food = food.with_transform(transforms) @@ -156,7 +160,7 @@ Al llegar a este punto, solo quedan tres pasos: ... data_collator=data_collator, ... train_dataset=food["train"], ... eval_dataset=food["test"], -... tokenizer=feature_extractor, +... tokenizer=image_processor, ... ) >>> trainer.train() diff --git a/docs/source/es/tasks/language_modeling.mdx b/docs/source/es/tasks/language_modeling.md similarity index 88% rename from docs/source/es/tasks/language_modeling.mdx rename to docs/source/es/tasks/language_modeling.md index 565185072a119b..010d1bccae7bbf 100644 --- a/docs/source/es/tasks/language_modeling.mdx +++ b/docs/source/es/tasks/language_modeling.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Modelado de lenguaje @@ -22,11 +26,11 @@ El modelado de lenguaje causal predice el siguiente token en una secuencia de to El modelado de lenguaje por enmascaramiento predice un token enmascarado en una secuencia, y el modelo puede considerar los tokens bidireccionalmente. -Esta guía te mostrará cómo realizar fine-tuning [DistilGPT2](https://huggingface.co/distilgpt2) para modelos de lenguaje causales y [DistilRoBERTa](https://huggingface.co/distilroberta-base) para modelos de lenguaje por enmascaramiento en el [r/askscience](https://www.reddit.com/r/askscience/) subdataset [ELI5](https://huggingface.co/datasets/eli5). +Esta guía te mostrará cómo realizar fine-tuning [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) para modelos de lenguaje causales y [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) para modelos de lenguaje por enmascaramiento en el [r/askscience](https://www.reddit.com/r/askscience/) subdataset [ELI5](https://huggingface.co/datasets/eli5). -Puedes realizar fine-tuning a otras arquitecturas para modelos de lenguaje como [GPT-Neo](https://huggingface.co/EleutherAI/gpt-neo-125M), [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) y [BERT](https://huggingface.co/bert-base-uncased) siguiendo los mismos pasos presentados en esta guía! +Puedes realizar fine-tuning a otras arquitecturas para modelos de lenguaje como [GPT-Neo](https://huggingface.co/EleutherAI/gpt-neo-125M), [GPT-J](https://huggingface.co/EleutherAI/gpt-j-6B) y [BERT](https://huggingface.co/google-bert/bert-base-uncased) siguiendo los mismos pasos presentados en esta guía! Mira la [página de tarea](https://huggingface.co/tasks/text-generation) para generación de texto y la [página de tarea](https://huggingface.co/tasks/fill-mask) para modelos de lenguajes por enmascaramiento para obtener más información sobre los modelos, datasets, y métricas asociadas. @@ -77,7 +81,7 @@ Para modelados de lenguaje causales carga el tokenizador DistilGPT2 para procesa ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") ``` @@ -87,10 +91,10 @@ Para modelados de lenguaje por enmascaramiento carga el tokenizador DistilRoBERT ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base") ``` -Extrae el subcampo `text` desde su estructura anidado con el método [`flatten`](https://huggingface.co/docs/datasets/process.html#flatten): +Extrae el subcampo `text` desde su estructura anidado con el método [`flatten`](https://huggingface.co/docs/datasets/process#flatten): ```py >>> eli5 = eli5.flatten() @@ -118,7 +122,7 @@ Así es como puedes crear una función de preprocesamiento para convertir la lis ... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True) ``` -Usa de 🤗 Datasets la función [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) para aplicar la función de preprocesamiento sobre el dataset en su totalidad. Puedes acelerar la función `map` configurando el argumento `batched=True` para procesar múltiples elementos del dataset a la vez y aumentar la cantidad de procesos con `num_proc`. Elimina las columnas que no necesitas: +Usa de 🤗 Datasets la función [`map`](https://huggingface.co/docs/datasets/process#map) para aplicar la función de preprocesamiento sobre el dataset en su totalidad. Puedes acelerar la función `map` configurando el argumento `batched=True` para procesar múltiples elementos del dataset a la vez y aumentar la cantidad de procesos con `num_proc`. Elimina las columnas que no necesitas: ```py >>> tokenized_eli5 = eli5.map( @@ -199,7 +203,7 @@ Para modelados de lenguajes por enmascaramiento usa el mismo [`DataCollatorForLa ## Modelado de lenguaje causal -El modelado de lenguaje causal es frecuentemente utilizado para generación de texto. Esta sección te muestra cómo realizar fine-tuning a [DistilGPT2](https://huggingface.co/distilgpt2) para generar nuevo texto. +El modelado de lenguaje causal es frecuentemente utilizado para generación de texto. Esta sección te muestra cómo realizar fine-tuning a [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) para generar nuevo texto. ### Entrenamiento @@ -210,7 +214,7 @@ Carga DistilGPT2 con [`AutoModelForCausalLM`]: ```py >>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` @@ -245,7 +249,7 @@ A este punto, solo faltan tres pasos: ``` -Para realizar el fine-tuning de un modelo en TensorFlow, comienza por convertir tus datasets al formato `tf.data.Dataset` con [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Especifica los inputs y etiquetas en `columns`, ya sea para mezclar el dataset, tamaño de lote, y el data collator: +Para realizar el fine-tuning de un modelo en TensorFlow, comienza por convertir tus datasets al formato `tf.data.Dataset` con [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Especifica los inputs y etiquetas en `columns`, ya sea para mezclar el dataset, tamaño de lote, y el data collator: ```py >>> tf_train_set = lm_dataset["train"].to_tf_dataset( @@ -284,7 +288,7 @@ Carga DistilGPT2 con [`TFAutoModelForCausalLM`]: ```py >>> from transformers import TFAutoModelForCausalLM ->>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2") +>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Configura el modelo para entrenamiento con [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): @@ -305,7 +309,7 @@ Llama a [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) par ## Modelado de lenguaje por enmascaramiento -El modelado de lenguaje por enmascaramiento es también conocido como una tarea de rellenar la máscara, pues predice un token enmascarado dada una secuencia. Los modelos de lenguaje por enmascaramiento requieren una buena comprensión del contexto de una secuencia entera, en lugar de solo el contexto a la izquierda. Esta sección te enseña como realizar el fine-tuning de [DistilRoBERTa](https://huggingface.co/distilroberta-base) para predecir una palabra enmascarada. +El modelado de lenguaje por enmascaramiento es también conocido como una tarea de rellenar la máscara, pues predice un token enmascarado dada una secuencia. Los modelos de lenguaje por enmascaramiento requieren una buena comprensión del contexto de una secuencia entera, en lugar de solo el contexto a la izquierda. Esta sección te enseña como realizar el fine-tuning de [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) para predecir una palabra enmascarada. ### Entrenamiento @@ -316,7 +320,7 @@ Carga DistilRoBERTa con [`AutoModelForMaskedlM`]: ```py >>> from transformers import AutoModelForMaskedLM ->>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base") +>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") ``` @@ -352,7 +356,7 @@ A este punto, solo faltan tres pasos: ``` -Para realizar el fine-tuning de un modelo en TensorFlow, comienza por convertir tus datasets al formato `tf.data.Dataset` con [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Especifica los inputs y etiquetas en `columns`, ya sea para mezclar el dataset, tamaño de lote, y el data collator: +Para realizar el fine-tuning de un modelo en TensorFlow, comienza por convertir tus datasets al formato `tf.data.Dataset` con [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Especifica los inputs y etiquetas en `columns`, ya sea para mezclar el dataset, tamaño de lote, y el data collator: ```py >>> tf_train_set = lm_dataset["train"].to_tf_dataset( @@ -391,7 +395,7 @@ Carga DistilRoBERTa con [`TFAutoModelForMaskedLM`]: ```py >>> from transformers import TFAutoModelForMaskedLM ->>> model = TFAutoModelForCausalLM.from_pretrained("distilroberta-base") +>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilroberta-base") ``` Configura el modelo para entrenamiento con [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): diff --git a/docs/source/es/tasks/multiple_choice.mdx b/docs/source/es/tasks/multiple_choice.md similarity index 94% rename from docs/source/es/tasks/multiple_choice.mdx rename to docs/source/es/tasks/multiple_choice.md index 2ece0969bf96a1..ca2e3d15f63546 100644 --- a/docs/source/es/tasks/multiple_choice.mdx +++ b/docs/source/es/tasks/multiple_choice.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Selección múltiple @@ -15,7 +19,7 @@ specific language governing permissions and limitations under the License. La tarea de selección múltiple es parecida a la de responder preguntas, con la excepción de que se dan varias opciones de respuesta junto con el contexto. El modelo se entrena para escoger la respuesta correcta entre varias opciones a partir del contexto dado. -Esta guía te mostrará como hacerle fine-tuning a [BERT](https://huggingface.co/bert-base-uncased) en la configuración `regular` del dataset [SWAG](https://huggingface.co/datasets/swag), de forma +Esta guía te mostrará como hacerle fine-tuning a [BERT](https://huggingface.co/google-bert/bert-base-uncased) en la configuración `regular` del dataset [SWAG](https://huggingface.co/datasets/swag), de forma que seleccione la mejor respuesta a partir de varias opciones y algún contexto. ## Cargar el dataset SWAG @@ -54,7 +58,7 @@ Carga el tokenizer de BERT para procesar el comienzo de cada oración y los cuat ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` La función de preprocesmaiento debe hacer lo siguiente: @@ -190,7 +194,7 @@ Carga el modelo BERT con [`AutoModelForMultipleChoice`]: ```py >>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer ->>> model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased") +>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") ``` @@ -270,7 +274,7 @@ Carga el modelo BERT con [`TFAutoModelForMultipleChoice`]: ```py >>> from transformers import TFAutoModelForMultipleChoice ->>> model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased") +>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") ``` Configura el modelo para entrenarlo con [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): diff --git a/docs/source/es/tasks/question_answering.mdx b/docs/source/es/tasks/question_answering.md similarity index 94% rename from docs/source/es/tasks/question_answering.mdx rename to docs/source/es/tasks/question_answering.md index d599fa8f1a3713..5cd59f6b064f71 100644 --- a/docs/source/es/tasks/question_answering.mdx +++ b/docs/source/es/tasks/question_answering.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Respuesta a preguntas @@ -19,7 +23,7 @@ La respuesta a preguntas devuelve una respuesta a partir de una pregunta dada. E - Extractiva: extraer la respuesta a partir del contexto dado. - Abstractiva: generar una respuesta que responda correctamente la pregunta a partir del contexto dado. -Esta guía te mostrará como hacer fine-tuning de [DistilBERT](https://huggingface.co/distilbert-base-uncased) en el dataset [SQuAD](https://huggingface.co/datasets/squad) para responder preguntas de forma extractiva. +Esta guía te mostrará como hacer fine-tuning de [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) en el dataset [SQuAD](https://huggingface.co/datasets/squad) para responder preguntas de forma extractiva. @@ -60,7 +64,7 @@ Carga el tokenizer de DistilBERT para procesar los campos `question` (pregunta) ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Hay algunos pasos de preprocesamiento específicos para la tarea de respuesta a preguntas que debes tener en cuenta: @@ -160,7 +164,7 @@ Carga el modelo DistilBERT con [`AutoModelForQuestionAnswering`]: ```py >>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer ->>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -243,7 +247,7 @@ Carga el modelo DistilBERT con [`TFAutoModelForQuestionAnswering`]: ```py >>> from transformers import TFAutoModelForQuestionAnswering ->>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased") +>>> model = TFAutoModelForQuestionAnswering("distilbert/distilbert-base-uncased") ``` Configura el modelo para entrenarlo con [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): diff --git a/docs/source/es/tasks/summarization.mdx b/docs/source/es/tasks/summarization.md similarity index 95% rename from docs/source/es/tasks/summarization.mdx rename to docs/source/es/tasks/summarization.md index c09c4b0b833a13..19ceb90b22cbb2 100644 --- a/docs/source/es/tasks/summarization.mdx +++ b/docs/source/es/tasks/summarization.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Generación de resúmenes @@ -19,7 +23,7 @@ La generación de resúmenes (summarization, en inglés) crea una versión más - Extractiva: Extrae la información más relevante de un documento. - Abstractiva: Genera un texto nuevo que captura la información más importante. -Esta guía te mostrará cómo puedes hacer fine-tuning del modelo [T5](https://huggingface.co/t5-small) sobre el subset de proyectos de ley del estado de California, dentro del dataset [BillSum](https://huggingface.co/datasets/billsum) para hacer generación de resúmenes abstractiva. +Esta guía te mostrará cómo puedes hacer fine-tuning del modelo [T5](https://huggingface.co/google-t5/t5-small) sobre el subset de proyectos de ley del estado de California, dentro del dataset [BillSum](https://huggingface.co/datasets/billsum) para hacer generación de resúmenes abstractiva. @@ -61,7 +65,7 @@ Carga el tokenizador T5 para procesar `text` y `summary`: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("t5-small") +>>> tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small") ``` La función de preprocesamiento necesita: @@ -118,7 +122,7 @@ Carga T5 con [`AutoModelForSeq2SeqLM`]: ```py >>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer ->>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small") ``` @@ -196,7 +200,7 @@ Carga T5 con [`TFAutoModelForSeq2SeqLM`]: ```py >>> from transformers import TFAutoModelForSeq2SeqLM ->>> model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small") +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-small") ``` Configura el modelo para entrenamiento con [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): @@ -219,4 +223,4 @@ Para un ejemplo con mayor profundidad de cómo hacer fine-tuning a un modelo par [notebook en PyTorch](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb) o a la [notebook en TensorFlow](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb). - \ No newline at end of file + diff --git a/docs/source/es/training.mdx b/docs/source/es/training.md similarity index 95% rename from docs/source/es/training.mdx rename to docs/source/es/training.md index 467df17d138076..fef44ed3f9ff72 100644 --- a/docs/source/es/training.mdx +++ b/docs/source/es/training.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Fine-tuning a un modelo pre-entrenado @@ -44,7 +48,7 @@ Como ya sabes, necesitas un tokenizador para procesar el texto e incluir una est ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): @@ -74,7 +78,7 @@ Comienza cargando tu modelo y especifica el número de labels previstas. A parti ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` @@ -98,7 +102,7 @@ Especifica dónde vas a guardar los checkpoints de tu entrenamiento: ### Métricas -El [`Trainer`] no evalúa automáticamente el rendimiento del modelo durante el entrenamiento. Tendrás que pasarle a [`Trainer`] una función para calcular y hacer un reporte de las métricas. La biblioteca de 🤗 Datasets proporciona una función de [`accuracy`](https://huggingface.co/metrics/accuracy) simple que puedes cargar con la función `load_metric` (ver este [tutorial](https://huggingface.co/docs/datasets/metrics.html) para más información): +El [`Trainer`] no evalúa automáticamente el rendimiento del modelo durante el entrenamiento. Tendrás que pasarle a [`Trainer`] una función para calcular y hacer un reporte de las métricas. La biblioteca de 🤗 Datasets proporciona una función de [`accuracy`](https://huggingface.co/metrics/accuracy) simple que puedes cargar con la función `load_metric` (ver este [tutorial](https://huggingface.co/docs/datasets/metrics) para más información): ```py >>> import numpy as np @@ -168,12 +172,12 @@ El [`DefaultDataCollator`] junta los tensores en un batch para que el modelo se -A continuación, convierte los datasets tokenizados en datasets de TensorFlow con el método [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Especifica tus entradas en `columns` y tu etiqueta en `label_cols`: +A continuación, convierte los datasets tokenizados en datasets de TensorFlow con el método [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Especifica tus entradas en `columns` y tu etiqueta en `label_cols`: ```py >>> tf_train_dataset = small_train_dataset.to_tf_dataset( ... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], +... label_cols="labels", ... shuffle=True, ... collate_fn=data_collator, ... batch_size=8, @@ -181,7 +185,7 @@ A continuación, convierte los datasets tokenizados en datasets de TensorFlow co >>> tf_validation_dataset = small_eval_dataset.to_tf_dataset( ... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], +... label_cols="labels", ... shuffle=False, ... collate_fn=data_collator, ... batch_size=8, @@ -196,7 +200,7 @@ Carguemos un modelo TensorFlow con el número esperado de labels: >>> import tensorflow as tf >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` A continuación, compila y aplica fine-tuning a tu modelo con [`fit`](https://keras.io/api/models/model_training_apis/) como lo harías con cualquier otro modelo de Keras: @@ -271,7 +275,7 @@ Carga tu modelo con el número de labels previstas: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Optimiza y programa el learning rate @@ -338,7 +342,7 @@ Para hacer un seguimiento al progreso del entrenamiento, utiliza la biblioteca [ ### Métricas -De la misma manera que necesitas añadir una función de evaluación al [`Trainer`], necesitas hacer lo mismo cuando escribas tu propio ciclo de entrenamiento. Pero en lugar de calcular y reportar la métrica al final de cada época, esta vez acumularás todos los batches con [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch) y calcularás la métrica al final. +De la misma manera que necesitas añadir una función de evaluación al [`Trainer`], necesitas hacer lo mismo cuando escribas tu propio ciclo de entrenamiento. Pero en lugar de calcular y reportar la métrica al final de cada época, esta vez acumularás todos los batches con [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=add_batch#datasets.Metric.add_batch) y calcularás la métrica al final. ```py >>> metric = load_metric("accuracy") diff --git a/docs/source/fr/_toctree.yml b/docs/source/fr/_toctree.yml index 11632a423b6af1..12c2feb0a02eb5 100755 --- a/docs/source/fr/_toctree.yml +++ b/docs/source/fr/_toctree.yml @@ -1,156 +1,30 @@ - sections: - - local: index - title: 🤗 Transformers - - local: quicktour - title: Visite rapide - - local: in_translation - title: Installation + - local: index + title: 🤗 Transformers + - local: quicktour + title: Visite rapide + - local: installation + title: Installation title: Démarrer - sections: - - local: in_translation - title: Pipelines pour l'inférence - - local: in_translation - title: Chargement d'instances pré-entraînées avec une AutoClass - - local: in_translation - title: Préparation des données - - local: in_translation - title: Fine-tune un modèle pré-entraîné - - local: in_translation - title: Entraînement distribué avec 🤗 Accelerate - - local: in_translation - title: Partager un modèle - title: Tutoriels -- sections: - - sections: - - local: in_translation - title: Créer votre architecture - - local: in_translation - title: Partager vos modèles - - local: in_translation - title: Entraînement avec un script - - local: in_translation - title: Entraînement avec Amazon SageMaker - - local: in_translation - title: Convertir depuis des checkpoints Tensorflow - - local: in_translation - title: Exporter vers ONNX - - local: in_translation - title: Exporter vers TorchScript - - local: in_translation - title: Aide au dépannage - title: Usage général - - sections: - - local: in_translation - title: Utiliser les tokenizers de 🤗 Tokenizers - - local: in_translation - title: Inférence avec les modèles multilingues - - local: in_translation - title: Stratégies de génération de texte - - sections: - - isExpanded: false - local: in_translation - title: Classification de texte - - local: in_translation - title: Classification de token - - local: in_translation - title: Système de question-réponse - - local: in_translation - title: Modélisation causale du langage - - local: in_translation - title: Modélisation du langage avec masque - - local: in_translation - title: Traduction - - local: in_translation - title: Génération de résumé - - local: in_translation - title: Question à choix multiple - title: Guides des tâches - title: Traitement automatique des langues - - sections: - - local: in_translation - title: Classification audio - - local: in_translation - title: Reconnaissance automatique de la parole - title: Audio - - sections: - local: in_translation - title: Classification d'images + title: Pipelines pour l'inférence + - local: autoclass_tutorial + title: Chargement d'instances pré-entraînées avec une AutoClass - local: in_translation - title: Segmentation sémantique + title: Préparation des données - local: in_translation - title: Classification de vidéos + title: Fine-tune un modèle pré-entraîné - local: in_translation - title: Détection d'objets - title: Vision par ordinateur - - sections: - - local: in_translation - title: Performance et extensibilité - - sections: - - local: in_translation - title: Comment contribuer à transformers? - - local: in_translation - title: Comment ajouter un modèle à 🤗 Transformers? - - local: in_translation - title: Comment convertir un modèle 🤗 Transformers vers TensorFlow? - - local: in_translation - title: Comment ajouter un pipeline à 🤗 Transformers? - - local: in_translation - title: Tester - - local: in_translation - title: Vérification pour une Pull Request - title: Contribuer - - local: in_translation - title: 🤗 Transformers Notebooks - - local: in_translation - title: Ressources communautaires - - local: in_translation - title: Benchmarks - - local: in_translation - title: Migration à partir de versions précédentes - title: Guides d'utilisation -- sections: - - local: in_translation - title: Philosophie - - local: in_translation - title: Glossaire - - local: in_translation - title: Qu'est ce 🤗 Transformers peut faire ? - - local: in_translation - title: Quelles tâches 🤗 Transformers peut résoudre ? - - local: in_translation - title: Résumé des modèles - - local: in_translation - title: Résumé des tokenizers - - local: in_translation - title: Remplissage et troncature - - local: in_translation - title: BERTology - - local: in_translation - title: Perplexité des modèles à longueur fixe - - local: in_translation - title: Pipelines pour inférence avec des serveurs web - title: Guides conceptuels -- sections: - - isExpanded: false - sections: - - local: in_translation - title: Classes principales - - local: in_translation - title: Modèles textuels - - local: in_translation - title: Modèles visuels - - local: in_translation - title: Modèles audio + title: Entraînement avec un script - local: in_translation - title: Modèles multimodal + title: Entraînement distribué avec 🤗 Accelerate - local: in_translation - title: Modèles d'apprentissage par renforcement + title: Chargement et entraînement des adaptateurs avec 🤗 PEFT - local: in_translation - title: Modèles de séries temporelles + title: Partager un modèle - local: in_translation - title: Graph models - title: Modèles - - sections: + title: Agents - local: in_translation - title: Utilitaires internes - title: API + title: Génération avec LLMs + title: Tutoriels diff --git a/docs/source/fr/autoclass_tutorial.md b/docs/source/fr/autoclass_tutorial.md new file mode 100644 index 00000000000000..f569966d0c6043 --- /dev/null +++ b/docs/source/fr/autoclass_tutorial.md @@ -0,0 +1,142 @@ + + +# Chargement d'instances pré-entraînées avec une AutoClass + +Avec autant d'architectures Transformer différentes, il peut être difficile d'en créer une pour votre ensemble de poids (aussi appelés "weights" ou "checkpoint" en anglais). Dans l'idée de créer une librairie facile, simple et flexible à utiliser, 🤗 Transformers fournit une `AutoClass` qui infère et charge automatiquement l'architecture correcte à partir d'un ensemble de poids donné. La fonction `from_pretrained()` vous permet de charger rapidement un modèle pré-entraîné pour n'importe quelle architecture afin que vous n'ayez pas à consacrer du temps et des ressources à l'entraînement d'un modèle à partir de zéro. Produire un tel code indépendant d'un ensemble de poids signifie que si votre code fonctionne pour un ensemble de poids, il fonctionnera avec un autre ensemble - tant qu'il a été entraîné pour une tâche similaire - même si l'architecture est différente. + + + +Rappel, l'architecture fait référence au squelette du modèle et l'ensemble de poids contient les poids pour une architecture donnée. Par exemple, [BERT](https://huggingface.co/google-bert/bert-base-uncased) est une architecture, tandis que `google-bert/bert-base-uncased` est un ensemble de poids. Le terme modèle est général et peut signifier soit architecture soit ensemble de poids. + + + +Dans ce tutoriel, vous apprendrez à: + + * Charger un tokenizer pré-entraîné. + * Charger un processeur d'image pré-entraîné. + * Charger un extracteur de caractéristiques pré-entraîné. + * Charger un processeur pré-entraîné. + * Charger un modèle pré-entraîné. + +## AutoTokenizer + +Quasiment toutes les tâches de traitement du langage (NLP) commencent avec un tokenizer. Un tokenizer convertit votre texte initial dans un format qui peut être traité par le modèle. + +Chargez un tokenizer avec [`AutoTokenizer.from_pretrained`]: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + +Puis, transformez votre texte initial comme montré ci-dessous: + +```py +>>> sequence = "In a hole in the ground there lived a hobbit." +>>> print(tokenizer(sequence)) +{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +## AutoImageProcessor + +Pour les tâches de vision, un processeur d'image traite l'image pour la formater correctment. + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + +## AutoFeatureExtractor + +Pour les tâches audio, un extracteur de caractéristiques (aussi appelés "features" en anglais) traite le signal audio pour le formater correctement. + +Chargez un extracteur de caractéristiques avec [`AutoFeatureExtractor.from_pretrained`]: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained( +... "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +``` + +## AutoProcessor + +Les tâches multimodales nécessitent un processeur qui combine deux types d'outils de prétraitement. Par exemple, le modèle [LayoutLMV2](model_doc/layoutlmv2) nécessite un processeur d'image pour traiter les images et un tokenizer pour traiter le texte ; un processeur combine les deux. + +Chargez un processeur avec [`AutoProcessor.from_pretrained`]: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") +``` + +## AutoModel + + + +Enfin, les classes `AutoModelFor` vous permettent de charger un modèle pré-entraîné pour une tâche donnée (voir [ici](model_doc/auto) pour une liste complète des tâches disponibles). Par exemple, chargez un modèle pour la classification de séquence avec [`AutoModelForSequenceClassification.from_pretrained`]: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +Réutilisez facilement le même ensemble de poids pour charger une architecture pour une tâche différente : + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +Pour les modèles PyTorch, la fonction `from_pretrained()` utilise `torch.load()` qui utilise `pickle` en interne et est connu pour être non sécurisé. En général, ne chargez jamais un modèle qui pourrait provenir d'une source non fiable, ou qui pourrait avoir été altéré. Ce risque de sécurité est partiellement atténué pour les modèles hébergés publiquement sur le Hugging Face Hub, qui sont [scannés pour les logiciels malveillants](https://huggingface.co/docs/hub/security-malware) à chaque modification. Consultez la [documentation du Hub](https://huggingface.co/docs/hub/security) pour connaître les meilleures pratiques comme la [vérification des modifications signées](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg) avec GPG. + +Les points de contrôle TensorFlow et Flax ne sont pas concernés, et peuvent être chargés dans des architectures PyTorch en utilisant les arguments `from_tf` et `from_flax` de la fonction `from_pretrained` pour contourner ce problème. + + + +En général, nous recommandons d'utiliser les classes `AutoTokenizer` et `AutoModelFor` pour charger des instances pré-entraînées de tokenizers et modèles respectivement. Cela vous permettra de charger la bonne architecture à chaque fois. Dans le prochain [tutoriel](preprocessing), vous apprenez à utiliser un tokenizer, processeur d'image, extracteur de caractéristiques et processeur pour pré-traiter un jeu de données pour le fine-tuning. + + +Enfin, les classes `TFAutoModelFor` vous permettent de charger un modèle pré-entraîné pour une tâche donnée (voir [ici](model_doc/auto) pour une liste complète des tâches disponibles). Par exemple, chargez un modèle pour la classification de séquence avec [`TFAutoModelForSequenceClassification.from_pretrained`]: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +Réutilisez facilement le même ensemble de poids pour charger une architecture pour une tâche différente : + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +En général, nous recommandons d'utiliser les classes `AutoTokenizer` et `TFAutoModelFor` pour charger des instances pré-entraînées de tokenizers et modèles respectivement. Cela vous permettra de charger la bonne architecture à chaque fois. Dans le prochain [tutoriel](preprocessing), vous apprenez à utiliser un tokenizer, processeur d'image, extracteur de caractéristiques et processeur pour pré-traiter un jeu de données pour le fine-tuning. + + diff --git a/docs/source/fr/in_translation.md b/docs/source/fr/in_translation.md new file mode 100644 index 00000000000000..910559ef6c9a0a --- /dev/null +++ b/docs/source/fr/in_translation.md @@ -0,0 +1,5 @@ + + +# Traduction en cours. \ No newline at end of file diff --git a/docs/source/fr/in_translation.mdx b/docs/source/fr/in_translation.mdx deleted file mode 100644 index 619f76420bd506..00000000000000 --- a/docs/source/fr/in_translation.mdx +++ /dev/null @@ -1 +0,0 @@ -# Traduction en cours. \ No newline at end of file diff --git a/docs/source/fr/index.mdx b/docs/source/fr/index.md similarity index 97% rename from docs/source/fr/index.mdx rename to docs/source/fr/index.md index 2903a8b1625d4d..187864a0874a98 100644 --- a/docs/source/fr/index.mdx +++ b/docs/source/fr/index.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers @@ -21,7 +25,7 @@ Apprentissage automatique de pointe pour [PyTorch](https://pytorch.org/), [Tenso 🗣️ **Audio**: reconnaissance automatique de la parole et classification audio.
🐙 **Multimodalité**: système de question-réponse avec des tableaux ou images, reconnaissance optique de caractères, extraction d'information depuis des documents scannés et classification de vidéo. -🤗 Transformers prend en charge l'interopérabilité entre PyTorch, TensorFlow et JAX. Cela permet d'utiliser un framework différent à chaque étape de la vie d'un modèle, par example entraîner un modèle en trois lignes de code avec un framework, et le charger pour l'inférence avec un autre. Les modèles peuvent également être exportés dans un format comme ONNX et TorchScript pour être déployés dans des environnements de production. +🤗 Transformers prend en charge l'interopérabilité entre PyTorch, TensorFlow et JAX. Cela permet d'utiliser un framework différent à chaque étape de la vie d'un modèle, par exemple entraîner un modèle en trois lignes de code avec un framework, et le charger pour l'inférence avec un autre. Les modèles peuvent également être exportés dans un format comme ONNX et TorchScript pour être déployés dans des environnements de production. Rejoignez la communauté grandissante sur le [Hub](https://huggingface.co/models), le [forum](https://discuss.huggingface.co/) ou [Discord](https://discord.com/invite/JfAtkvEtRb) dès aujourd'hui ! @@ -37,7 +41,7 @@ La documentation est organisée en 5 parties: - **DEMARRER** propose une visite rapide de la bibliothèque et des instructions d'installation pour être opérationnel. - **TUTORIELS** excellent point de départ pour les débutants. Cette section vous aidera à acquérir les compétences de base dont vous avez besoin pour commencer à utiliser la bibliothèque. -- **GUIDES D'UTILISATION** pour différentes tâches comme par example le finetuning d'un modèle pré-entraîné pour la classification de texte ou comment créer et partger votre propre modèle. +- **GUIDES D'UTILISATION** pour différentes tâches comme par exemple le finetuning d'un modèle pré-entraîné pour la classification de texte ou comment créer et partager votre propre modèle. - **GUIDES CONCEPTUELS** pour plus de discussions et d'explications sur les concepts et les idées sous-jacentes aux modèles, aux tâches et à la philosophie de conception de 🤗 Transformers. - **API** décrit toutes les classes et fonctions : @@ -45,11 +49,12 @@ La documentation est organisée en 5 parties: - **MODELES** détaille les classes et les fonctions propres à chaque modèle de la bibliothèque. - **UTILITAIRES INTERNES** détaille les classes et fonctions utilitaires utilisées en interne. -### Modèles supportées +### Modèles supportés 1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. 1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. @@ -79,6 +84,7 @@ La documentation est organisée en 5 parties: 1. **[Conditional DETR](model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. 1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. @@ -102,6 +108,7 @@ La documentation est organisée en 5 parties: 1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ERNIE](model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ESM](model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. +1. **[FastSpeech2Conformer](model_doc/fastspeech2_conformer)** (from ESPnet) released with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://arxiv.org/abs/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang. 1. **[FLAN-T5](model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. @@ -109,11 +116,11 @@ La documentation est organisée en 5 parties: 1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GIT](model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. @@ -230,9 +237,7 @@ La documentation est organisée en 5 parties: ### Frameworks compatibles -Le tableau ci-dessous représente la prise en charge actuelle dans la bibliothèque pour chacun de ces modèles, qu'ils aient un tokenizer Python -(appelé "slow"). Un tokenizer "fast" soutenu par la bibliothèque 🤗 Tokenizers, qu'ils aient un support en Jax (via. -Flax), PyTorch, et/ou TensorFlow. +Le tableau ci-dessous représente la prise en charge actuelle dans la bibliothèque pour chacun de ces modèles, qu'ils aient ou non un tokenizer Python (appelé "slow"). Un tokenizer rapide ("fast") soutenu par la bibliothèque 🤗 Tokenizers, qu'ils aient un support en Jax (via Flax), PyTorch, et/ou TensorFlow. @@ -286,6 +291,7 @@ Flax), PyTorch, et/ou TensorFlow. | ERNIE | ❌ | ❌ | ✅ | ❌ | ❌ | | ESM | ✅ | ❌ | ✅ | ✅ | ❌ | | FairSeq Machine-Translation | ✅ | ❌ | ✅ | ❌ | ❌ | +| FastSpeech2Conformer | ✅ | ❌ | ✅ | ❌ | ❌ | | FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ | | FLAVA | ❌ | ❌ | ✅ | ❌ | ❌ | | FNet | ✅ | ✅ | ✅ | ❌ | ❌ | @@ -347,7 +353,7 @@ Flax), PyTorch, et/ou TensorFlow. | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | REALM | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | ResNet | ❌ | ❌ | ✅ | ✅ | ❌ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | @@ -403,4 +409,4 @@ Flax), PyTorch, et/ou TensorFlow. | YOLOS | ❌ | ❌ | ✅ | ❌ | ❌ | | YOSO | ❌ | ❌ | ✅ | ❌ | ❌ | - \ No newline at end of file + diff --git a/docs/source/fr/installation.md b/docs/source/fr/installation.md new file mode 100644 index 00000000000000..cd68911bc3564d --- /dev/null +++ b/docs/source/fr/installation.md @@ -0,0 +1,258 @@ + + +# Installation + +Installez 🤗 Transformers pour n'importe quelle librairie d'apprentissage profond avec laquelle vous avez l'habitude de travaillez, configurez votre cache et configurez 🤗 Transformers pour un usage hors ligne (facultatif). + +🤗 Transformers est testé avec Python 3.6+, PyTorch 1.1.0+, TensorFlow 2.0+ et Flax. +Consulter les instructions d'installation ci-dessous pour la librairie d'apprentissage profond que vous utilisez: + + * Instructions d'installation pour [PyTorch](https://pytorch.org/get-started/locally/). + * Instructions d'installation pour [TensorFlow 2.0](https://www.tensorflow.org/install/pip). + * Instructions d'installation pour [Flax](https://flax.readthedocs.io/en/latest/). + +## Installation avec pip + +Vous devriez installer 🤗 Transformers dans un [environnement virtuel](https://docs.python.org/3/library/venv.html). +Si vous n'êtes pas à l'aise avec les environnements virtuels, consultez ce [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). +Utiliser un environnement virtuel permet de facilement gérer différents projets et d'éviter des erreurs de compatibilité entre les différentes dépendances. + +Commencez par créer un environnement virtuel dans l'espace de travail de votre projet : + +```bash +python -m venv .env +``` + +Activez l'environnement virtuel. Sur Linux ou MacOs : + +```bash +source .env/bin/activate +``` + +Activez l'environnement virtuel sur Windows : + +```bash +.env/Scripts/activate +``` + +Maintenant, 🤗 Transformers peut être installé avec la commande suivante : + +```bash +pip install transformers +``` + +Pour une utilisation avec CPU seulement, 🤗 Transformers et la librairie d'apprentissage profond de votre choix peuvent être installés en une seule ligne. +Par exemple, installez 🤗 Transformers et PyTorch avec la commande suivante : + +```bash +pip install 'transformers[torch]' +``` + +🤗 Transformers et TensorFlow 2.0 : + +```bash +pip install 'transformers[tf-cpu]' +``` + + + +Pour les architectures mac M1 / ARM + +Vous devez installer les outils suivants avant d'installer TensorFLow 2.0 + +```bash +brew install cmake +brew install pkg-config +``` + + + +🤗 Transformers et Flax : + +```bash +pip install 'transformers[flax]' +``` + +Vérifiez que 🤗 Transformers a bien été installé avec la commande suivante. La commande va télécharger un modèle pré-entraîné : + +```bash +python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))" +``` + +Le label et score sont ensuite affichés : + +```bash +[{'label': 'POSITIVE', 'score': 0.9998704791069031}] +``` + +## Installation depuis le code source + +Installez 🤗 Transformers depuis le code source avec la commande suivante : + +```bash +pip install git+https://github.com/huggingface/transformers +``` + +Cette commande installe la version depuis la branche `main` au lieu de la dernière version stable. La version de la branche `main` est utile pour avoir les derniers développements. Par exemple, si un bug a été résolu depuis la dernière version stable mais n'a pas encore été publié officiellement. Cependant, cela veut aussi dire que la version de la branche `main` n'est pas toujours stable. Nous nous efforçons de maintenir la version de la branche `main` opérationnelle, et la plupart des problèmes sont généralement résolus en l'espace de quelques heures ou d'un jour. Si vous recontrez un problème, n'hésitez pas à créer une [Issue](https://github.com/huggingface/transformers/issues) pour que l'on puisse trouver une solution au plus vite ! + +Vérifiez que 🤗 Transformers a bien été installé avec la commande suivante : + +```bash +python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))" +``` + +## Installation modifiable + +Vous aurez besoin d'une installation modifiable si vous le souhaitez : + + * Utiliser la version de la branche `main` du code source. + * Contribuer à 🤗 Transformers et vouler tester vos modifications du code source. + +Clonez le projet et installez 🤗 Transformers avec les commandes suivantes : + +```bash +git clone https://github.com/huggingface/transformers.git +cd transformers +pip install -e . +``` + +Ces commandes créent des liens entre le dossier où le projet a été cloné et les chemins de vos librairies Python. Python regardera maintenant dans le dossier que vous avez cloné en plus des dossiers où sont installées vos autres librairies. Par exemple, si vos librairies Python sont installées dans `~/anaconda3/envs/main/lib/python3.7/site-packages/`, Python cherchera aussi dans le dossier où vous avez cloné : `~/transformers/`. + + + +Vous devez garder le dossier `transformers` si vous voulez continuer d'utiliser la librairie. + + + +Maintenant, vous pouvez facilement mettre à jour votre clone avec la dernière version de 🤗 Transformers en utilisant la commande suivante : + +```bash +cd ~/transformers/ +git pull +``` + +Votre environnement Python utilisera la version de la branche `main` lors de la prochaine exécution. + +## Installation avec conda + +Installation via le canal `conda-forge` de conda : + +```bash +conda install conda-forge::transformers +``` + +## Configuration du cache + +Les modèles pré-entraînés sont téléchargés et mis en cache localement dans le dossier suivant : `~/.cache/huggingface/hub`. C'est le dossier par défaut donné par la variable d'environnement `TRANSFORMERS_CACHE`. Sur Windows, le dossier par défaut est `C:\Users\nom_utilisateur\.cache\huggingface\hub`. Vous pouvez modifier les variables d'environnement indiquées ci-dessous - par ordre de priorité - pour spécifier un dossier de cache différent : + +1. Variable d'environnement (par défaut) : `HUGGINGFACE_HUB_CACHE` ou `TRANSFORMERS_CACHE`. +2. Variable d'environnement : `HF_HOME`. +3. Variable d'environnement : `XDG_CACHE_HOME` + `/huggingface`. + + + +🤗 Transformers utilisera les variables d'environnement `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE` si vous utilisez une version précédente de cette librairie et avez défini ces variables d'environnement, sauf si vous spécifiez la variable d'environnement `TRANSFORMERS_CACHE`. + + + +## Mode hors ligne + +🤗 Transformers peut fonctionner dans un environnement cloisonné ou hors ligne en n'utilisant que des fichiers locaux. Définissez la variable d'environnement `TRANSFORMERS_OFFLINE=1` pour activer ce mode. + + + +Ajoutez [🤗 Datasets](https://huggingface.co/docs/datasets/) à votre processus d'entraînement hors ligne en définissant la variable d'environnement `HF_DATASETS_OFFLINE=1`. + + + +```bash +HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... +``` + +Le script devrait maintenant s'exécuter sans rester en attente ou attendre une expiration, car il n'essaiera pas de télécharger des modèle sur le Hub. + +Vous pouvez aussi éviter de télécharger un modèle à chaque appel de la fonction [`~PreTrainedModel.from_pretrained`] en utilisant le paramètre [local_files_only]. Seuls les fichiers locaux sont chargés lorsque ce paramètre est activé (c.-à-d. `local_files_only=True`) : + +```py +from transformers import T5Model + +model = T5Model.from_pretrained("./path/to/local/directory", local_files_only=True) +``` + +### Récupérer des modèles et des tokenizers pour une utilisation hors ligne + +Une autre option pour utiliser 🤗 Transformers hors ligne est de télécharger les fichiers à l'avance, puis d'utiliser les chemins locaux lorsque vous en avez besoin en mode hors ligne. Il existe trois façons de faire cela : + + * Téléchargez un fichier via l'interface utilisateur sur le [Model Hub](https://huggingface.co/models) en cliquant sur l'icône ↓. + + ![download-icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/download-icon.png) + + * Utilisez les fonctions [`PreTrainedModel.from_pretrained`] et [`PreTrainedModel.save_pretrained`] : + + 1. Téléchargez vos fichiers à l'avance avec [`PreTrainedModel.from_pretrained`]: + + ```py + >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + + >>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B") + >>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B") + ``` + + 2. Sauvegardez les fichiers dans un dossier de votre choix avec [`PreTrainedModel.save_pretrained`]: + + ```py + >>> tokenizer.save_pretrained("./your/path/bigscience_t0") + >>> model.save_pretrained("./your/path/bigscience_t0") + ``` + + 3. Maintenant, lorsque vous êtes hors ligne, rechargez vos fichiers avec [`PreTrainedModel.from_pretrained`] depuis le dossier où vous les avez sauvegardés : + + ```py + >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0") + >>> model = AutoModel.from_pretrained("./your/path/bigscience_t0") + ``` + + * Téléchargez des fichiers de manière automatique avec la librairie [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) : + + 1. Installez la librairie `huggingface_hub` dans votre environnement virtuel : + + ```bash + python -m pip install huggingface_hub + ``` + + 2. Utilisez la fonction [`hf_hub_download`](https://huggingface.co/docs/hub/adding-a-library#download-files-from-the-hub) pour télécharger un fichier vers un chemin de votre choix. Par exemple, la commande suivante télécharge le fichier `config.json` du modèle [T0](https://huggingface.co/bigscience/T0_3B) vers le chemin de votre choix : + + ```py + >>> from huggingface_hub import hf_hub_download + + >>> hf_hub_download(repo_id="bigscience/T0_3B", filename="config.json", cache_dir="./your/path/bigscience_t0") + ``` + +Une fois que votre fichier est téléchargé et caché localement, spécifiez son chemin local pour le charger et l'utiliser : + +```py +>>> from transformers import AutoConfig + +>>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json") +``` + + + +Consultez la section [How to download files from the Hub (Comment télécharger des fichiers depuis le Hub)](https://huggingface.co/docs/hub/how-to-downstream) pour plus de détails sur le téléchargement de fichiers stockés sur le Hub. + + diff --git a/docs/source/fr/quicktour.mdx b/docs/source/fr/quicktour.md similarity index 86% rename from docs/source/fr/quicktour.mdx rename to docs/source/fr/quicktour.md index 33fce8b1c0b213..f76764f103387a 100644 --- a/docs/source/fr/quicktour.mdx +++ b/docs/source/fr/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Visite rapide @@ -23,15 +27,16 @@ Avant de commencer, assurez-vous que vous avez installé toutes les bibliothèqu ``` Vous aurez aussi besoin d'installer votre bibliothèque d'apprentissage profond favorite : -You'll also need to install your preferred machine learning framework: + ```bash pip install torch ``` + ```bash pip install tensorflow ``` @@ -46,19 +51,19 @@ Le [`pipeline`] est le moyen le plus simple d'utiliser un modèle pré-entraîn | **Tâche** | **Description** | **Modalité** | **Identifiant du pipeline** | |------------------------------|--------------------------------------------------------------------------------------------------------------|----------------------|-----------------------------------------------| -| Classification de texte | Attribut une catégorie à une séquence de texte donnée | Texte | pipeline(task="sentiment-analysis") | +| Classification de texte | Attribue une catégorie à une séquence de texte donnée | Texte | pipeline(task="sentiment-analysis") | | Génération de texte | Génère du texte à partir d'une consigne donnée | Texte | pipeline(task="text-generation") | -| Reconnaissance de token nommé | Attribut une catégorie à chaque token dans une séquence (personnes, organisation, localisation, etc.) | Texte | pipeline(task="ner") | +| Reconnaissance de token nommé | Attribue une catégorie à chaque token dans une séquence (personnes, organisation, localisation, etc.) | Texte | pipeline(task="ner") | | Question réponse | Extrait une réponse du texte en fonction du contexte et d'une question | Texte | pipeline(task="question-answering") | | Prédiction de token masqué | Prédit correctement le token masqué dans une séquence | Texte | pipeline(task="fill-mask") | | Génération de résumé | Génère un résumé d'une séquence de texte donnée ou d'un document | Texte | pipeline(task="summarization") | | Traduction | Traduit du texte d'un langage à un autre | Texte | pipeline(task="translation") | -| Classification d'image | Attribut une catégorie à une image | Image | pipeline(task="image-classification") | -| Segmentation d'image | Attribut une catégorie à chaque pixel d'une image (supporte la segmentation sémantique, panoptique et d'instance) | Image | pipeline(task="image-segmentation") | -| Détection d'objects | Prédit les delimitations et catégories d'objects dans une image | Image | pipeline(task="object-detection") | -| Classification d'audio | Attribut une catégorie à un fichier audio | Audio | pipeline(task="audio-classification") | +| Classification d'image | Attribue une catégorie à une image | Image | pipeline(task="image-classification") | +| Segmentation d'image | Attribue une catégorie à chaque pixel d'une image (supporte la segmentation sémantique, panoptique et d'instance) | Image | pipeline(task="image-segmentation") | +| Détection d'objets | Prédit les délimitations et catégories d'objets dans une image | Image | pipeline(task="object-detection") | +| Classification d'audio | Attribue une catégorie à un fichier audio | Audio | pipeline(task="audio-classification") | | Reconnaissance automatique de la parole | Extrait le discours d'un fichier audio en texte | Audio | pipeline(task="automatic-speech-recognition") | -| Question réponse visuels | Etant donné une image et une question, répond correctement à une question sur l'image | Modalités multiples | pipeline(task="vqa") | +| Question réponse visuels | Etant données une image et une question, répond correctement à une question sur l'image | Modalités multiples | pipeline(task="vqa") | Commencez par créer une instance de [`pipeline`] et spécifiez la tâche pour laquelle vous souhaitez l'utiliser. Vous pouvez utiliser le [`pipeline`] pour n'importe laquelle des tâches mentionnées dans le tableau précédent. Pour obtenir une liste complète des tâches prises en charge, consultez la documentation de l'[API pipeline](./main_classes/pipelines). Dans ce guide, nous utiliserons le [`pipeline`] pour l'analyse des sentiments à titre d'exemple : @@ -68,7 +73,7 @@ Commencez par créer une instance de [`pipeline`] et spécifiez la tâche pour l >>> classifier = pipeline("sentiment-analysis") ``` -Le [`pipeline`] télécharge et cache un [modèle pré-entraîné](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) et un tokenizer par défaut pour l'analyse des sentiments. Vous pouvez maintenant utiliser le `classifier` sur le texte de votre choix : +Le [`pipeline`] télécharge et stocke en cache un [modèle pré-entraîné](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) et un tokenizer par défaut pour l'analyse des sentiments. Vous pouvez maintenant utiliser le `classifier` sur le texte de votre choix : ```py >>> classifier("We are very happy to show you the 🤗 Transformers library.") @@ -85,7 +90,7 @@ label: POSITIVE, avec le score de: 0.9998 label: NEGATIVE, avec le score de: 0.5309 ``` -Le [`pipeline`] peut aussi itérer sur un jeu de données entier pour n'importe quelle tâche. Prenons la reconnaissance automatique de la parole pour example : +Le [`pipeline`] peut aussi itérer sur un jeu de données entier pour n'importe quelle tâche. Prenons par exemple la reconnaissance automatique de la parole : ```py >>> import torch @@ -94,7 +99,7 @@ Le [`pipeline`] peut aussi itérer sur un jeu de données entier pour n'importe >>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") ``` -Chargez un jeu de données audio (voir le 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart#audio) pour plus de détails) sur lequel vous souhaitez itérer. Pour cet example, nous chargons le jeu de données [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) : +Chargez un jeu de données audio (voir le 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart#audio) pour plus de détails) sur lequel vous souhaitez itérer. Pour cet exemple, nous chargeons le jeu de données [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) : ```py >>> from datasets import load_dataset, Audio @@ -109,7 +114,7 @@ Vous devez vous assurer que le taux d'échantillonnage de l'ensemble de données ``` Les fichiers audio sont automatiquement chargés et rééchantillonnés lors de l'appel de la colonne `"audio"`. -Extrayez les tableaux de formes d'ondes brutes des 4 premiers échantillons et passez-les comme une liste au pipeline : +Extrayez les tableaux de formes d'ondes brutes des quatre premiers échantillons et passez-les comme une liste au pipeline : ```py >>> result = speech_recognizer(dataset[:4]["audio"]) @@ -117,11 +122,11 @@ Extrayez les tableaux de formes d'ondes brutes des 4 premiers échantillons et p ['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I THURN A JOIN A COUNT'] ``` -Pour les ensembles de données plus importants où les entrées sont volumineuses (comme dans les domaines de la parole ou de la vision), vous devriez utiliser un générateur au lieu d'une liste pour charger toutes les entrées en mémoire. Pour plus d'informations, consultez la documentation de l'[API pipeline](./main_classes/pipelines). +Pour les ensembles de données plus importants où les entrées sont volumineuses (comme dans les domaines de la parole ou de la vision), utilisez plutôt un générateur au lieu d'une liste pour charger toutes les entrées en mémoire. Pour plus d'informations, consultez la documentation de l'[API pipeline](./main_classes/pipelines). ### Utiliser une autre modèle et tokenizer dans le pipeline -Le [`pipeline`] peut être utiliser avec n'importe quel modèle du [Hub](https://huggingface.co/models), ce qui permet d'adapter facilement le [`pipeline`] à d'autres cas d'utilisation. Par exemple, si vous souhaitez un modèle capable de traiter du texte français, utilisez les filtres du Hub pour trouver un modèle approprié. Le premier résultat renvoie un [modèle BERT](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) multilingue finetuné pour l'analyse des sentiments que vous pouvez utiliser pour le texte français : +Le [`pipeline`] peut être utilisé avec n'importe quel modèle du [Hub](https://huggingface.co/models), ce qui permet d'adapter facilement le [`pipeline`] à d'autres cas d'utilisation. Par exemple, si vous souhaitez un modèle capable de traiter du texte français, utilisez les filtres du Hub pour trouver un modèle approprié. Le premier résultat renvoie un [modèle BERT](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment) multilingue finetuné pour l'analyse des sentiments que vous pouvez utiliser pour le texte français : ```py >>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" @@ -150,7 +155,7 @@ Utilisez [`TFAutoModelForSequenceClassification`] et [`AutoTokenizer`] pour char -Specifiez le modèle et le tokenizer dans le [`pipeline`], et utilisez le `classifier` sur le texte en français : +Spécifiez le modèle et le tokenizer dans le [`pipeline`], et utilisez le `classifier` sur le texte en français : ```py >>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) @@ -164,13 +169,13 @@ Si vous ne parvenez pas à trouver un modèle adapté à votre cas d'utilisation -Les classes [`AutoModelForSequenceClassification`] et [`AutoTokenizer`] fonctionnent ensemble pour créer un [`pipeline`] comme ceux que vous avez utilisé ci-dessus. Une [AutoClass](./model_doc/auto) est un raccourci qui récupère automatiquement l'architecture d'un modèle pré-entraîné à partir de son nom ou de son emplacement. Il vous suffit de sélectionner l'`AutoClass` appropriée à votre tâche et la classe de prétraitement qui lui est associée. +Les classes [`AutoModelForSequenceClassification`] et [`AutoTokenizer`] fonctionnent ensemble pour créer un [`pipeline`] comme celui que vous avez utilisé ci-dessus. Une [AutoClass](./model_doc/auto) est un raccourci qui récupère automatiquement l'architecture d'un modèle pré-entraîné à partir de son nom ou de son emplacement. Il vous suffit de sélectionner l'`AutoClass` appropriée à votre tâche et la classe de prétraitement qui lui est associée. Reprenons l'exemple de la section précédente et voyons comment vous pouvez utiliser l'`AutoClass` pour reproduire les résultats du [`pipeline`]. ### AutoTokenizer -Un tokenizer est chargé de prétraiter le texte pour en faire un tableau de chiffres qui servira d'entrée à un modèle. De nombreuses règles régissent le processus de tokenisation, notamment la manière de diviser un mot et le niveau auquel les mots doivent être divisés (pour en savoir plus sur la tokenisation, consultez le [résumé](./tokenizer_summary)). La chose la plus importante à retenir est que vous devez instancier un tokenizer avec le même nom de modèle pour vous assurer que vous utilisez les mêmes règles de tokenisation que celles avec lesquelles un modèle a été pré-entra^né. +Un tokenizer est chargé de prétraiter le texte pour en faire un tableau de chiffres qui servira d'entrée à un modèle. De nombreuses règles régissent le processus de tokenisation, notamment la manière de diviser un mot et le niveau auquel les mots doivent être divisés (pour en savoir plus sur la tokenisation, consultez le [résumé](./tokenizer_summary)). La chose la plus importante à retenir est que vous devez instancier un tokenizer avec le même nom de modèle pour vous assurer que vous utilisez les mêmes règles de tokenisation que celles avec lesquelles un modèle a été pré-entraîné. Chargez un tokenizer avec [`AutoTokenizer`] : @@ -200,6 +205,7 @@ Un tokenizer peut également accepter une liste de textes, et remplir et tronque + ```py >>> pt_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -211,6 +217,7 @@ Un tokenizer peut également accepter une liste de textes, et remplir et tronque ``` + ```py >>> tf_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -225,7 +232,7 @@ Un tokenizer peut également accepter une liste de textes, et remplir et tronque -Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails sur la tokenisation, et sur la manière d'utiliser un [`AutoImageProcessor`], un [`AutoFeatureExtractor`] et un [`AutoProcessor`] pour prétraiter les image, l'audio et les contenus multimodaux. +Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails sur la tokenisation, et sur la manière d'utiliser un [`AutoImageProcessor`], un [`AutoFeatureExtractor`] et un [`AutoProcessor`] pour prétraiter les images, l'audio et les contenus multimodaux. @@ -233,7 +240,7 @@ Consultez le tutoriel [prétraitement](./preprocessing) pour plus de détails su -🤗 Transformers fournit un moyen simple et unifié de charger des instances pré-entraînés. Cela signifie que vous pouvez charger un [`AutoModel`] comme vous chargeriez un [`AutoTokenizer`]. La seule différence est de sélectionner l'[`AutoModel`] approprié pour la tâche. Pour une classification de texte (ou de séquence de textes), vous devez charger [`AutoModelForSequenceClassification`] : +🤗 Transformers fournit un moyen simple et unifié de charger des instances pré-entraînées. Cela signifie que vous pouvez charger un [`AutoModel`] comme vous chargeriez un [`AutoTokenizer`]. La seule différence est de sélectionner l'[`AutoModel`] approprié pour la tâche. Pour une classification de texte (ou de séquence de textes), vous devez charger [`AutoModelForSequenceClassification`] : ```py >>> from transformers import AutoModelForSequenceClassification @@ -343,6 +350,7 @@ Une fonctionnalité particulièrement cool 🤗 Transformers est la possibilité + ```py >>> from transformers import AutoModel @@ -351,6 +359,7 @@ Une fonctionnalité particulièrement cool 🤗 Transformers est la possibilité ``` + ```py >>> from transformers import TFAutoModel @@ -369,7 +378,7 @@ Commencez par importer [`AutoConfig`], puis chargez le modèle pré-entraîné q ```py >>> from transformers import AutoConfig ->>> my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) ``` @@ -406,10 +415,10 @@ En fonction de votre tâche, vous passerez généralement les paramètres suivan ```py >>> from transformers import AutoModelForSequenceClassification - >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` -2. [`TrainingArguments`] contient les hyperparamètres du modèle que vous pouvez changer comme le taux d'apprentissage, la taille due l'échantillon, et le nombre d'époques pour s'entraîner. Les valeurs par défaut sont utilisées si vous ne spécifiez pas d'hyperparamètres d'apprentissage : +2. [`TrainingArguments`] contient les hyperparamètres du modèle que vous pouvez changer comme le taux d'apprentissage, la taille de l'échantillon, et le nombre d'époques pour s'entraîner. Les valeurs par défaut sont utilisées si vous ne spécifiez pas d'hyperparamètres d'apprentissage : ```py >>> from transformers import TrainingArguments @@ -428,7 +437,7 @@ En fonction de votre tâche, vous passerez généralement les paramètres suivan ```py >>> from transformers import AutoTokenizer - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` 4. Chargez un jeu de données : @@ -446,7 +455,7 @@ En fonction de votre tâche, vous passerez généralement les paramètres suivan ... return tokenizer(dataset["text"]) ``` - Then apply it over the entire dataset with [`~datasets.Dataset.map`]: + Puis appliquez-la à l'intégralité du jeu de données avec [`~datasets.Dataset.map`]: ```py >>> dataset = dataset.map(tokenize_dataset, batched=True) @@ -475,7 +484,7 @@ Maintenant, rassemblez tous ces éléments dans un [`Trainer`] : ... ) # doctest: +SKIP ``` -Une fois que vous êtes prêt, vous appelez la fonction [`~Trainer.train`] pour commencer l'entraînement : +Une fois que vous êtes prêt, appelez la fonction [`~Trainer.train`] pour commencer l'entraînement : ```py >>> trainer.train() # doctest: +SKIP @@ -487,7 +496,7 @@ Pour les tâches - comme la traduction ou la génération de résumé - qui util
-Vous pouvez personnaliser le comportement de la boucle d'apprentissage en redéfinissant les méthodes à l'intérieur de [`Trainer`]. Cela vous permet de personnaliser des caractéristiques telles que la fonction de perte, l'optimiseur et le planificateur. Consultez la documentation de [`Trainer`] pour savoir quelles méthodes peuvent être redéfinites. +Vous pouvez personnaliser le comportement de la boucle d'apprentissage en redéfinissant les méthodes à l'intérieur de [`Trainer`]. Cela vous permet de personnaliser des caractéristiques telles que la fonction de perte, l'optimiseur et le planificateur. Consultez la documentation de [`Trainer`] pour savoir quelles méthodes peuvent être redéfinies. L'autre moyen de personnaliser la boucle d'apprentissage est d'utiliser les [Callbacks](./main_classes/callbacks). Vous pouvez utiliser les callbacks pour intégrer d'autres bibliothèques et inspecter la boucle d'apprentissage afin de suivre la progression ou d'arrêter l'apprentissage plus tôt. Les callbacks ne modifient rien dans la boucle d'apprentissage elle-même. Pour personnaliser quelque chose comme la fonction de perte, vous devez redéfinir le [`Trainer`] à la place. @@ -500,7 +509,7 @@ Tous les modèles sont des modèles standard [`tf.keras.Model`](https://www.tens ```py >>> from transformers import TFAutoModelForSequenceClassification - >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` 2. Une classe de prétraitement comme un tokenizer, un processeur d'images ou un extracteur de caractéristiques : @@ -508,7 +517,7 @@ Tous les modèles sont des modèles standard [`tf.keras.Model`](https://www.tens ```py >>> from transformers import AutoTokenizer - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` 3. Créez une fonction qui transforme le texte du jeu de données en token : @@ -518,7 +527,7 @@ Tous les modèles sont des modèles standard [`tf.keras.Model`](https://www.tens ... return tokenizer(dataset["text"]) # doctest: +SKIP ``` -4. Appliquez le tokenizer sur l'ensemble du jeu de données avec [`~datasets.Dataset.map`] et passez ensuite le jeu de données et le tokenizer à [`~TFPreTrainedModel.prepare_tf_dataset`]. Vous pouvez également modifier la taille de l'échantillon et mélanger le jeu de données ici si vous le souhaitez : +4. Appliquez le tokenizer à l'ensemble du jeu de données avec [`~datasets.Dataset.map`] et passez ensuite le jeu de données et le tokenizer à [`~TFPreTrainedModel.prepare_tf_dataset`]. Vous pouvez également modifier la taille de l'échantillon et mélanger le jeu de données ici si vous le souhaitez : ```py >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP @@ -527,7 +536,7 @@ Tous les modèles sont des modèles standard [`tf.keras.Model`](https://www.tens ... ) # doctest: +SKIP ``` -5. Une fois que vous êtes prêt, vous appelez les fonctions `compile` et `fit` pour commencer l'entraînement : +5. Une fois que vous êtes prêt, appelez les fonctions `compile` et `fit` pour commencer l'entraînement : ```py >>> from tensorflow.keras.optimizers import Adam @@ -538,4 +547,4 @@ Tous les modèles sont des modèles standard [`tf.keras.Model`](https://www.tens ## Et après ? -Maintenant que vous avez terminé la visite rapide de 🤗 Transformers, consultez nos guides et apprenez à faire des choses plus spécifiques comme créer un modèle personnalisé, finetuner un modèle pour une tâche, et comment entraîner un modèle avec un script. Si vous souhaitez en savoir plus sur les concepts fondamentaux de 🤗 Transformers, jetez un œil à nos guides conceptuels ! \ No newline at end of file +Maintenant que vous avez terminé la visite rapide de 🤗 Transformers, consultez nos guides et apprenez à faire des choses plus spécifiques comme créer un modèle personnalisé, finetuner un modèle pour une tâche, et comment entraîner un modèle avec un script. Si vous souhaitez en savoir plus sur les concepts fondamentaux de 🤗 Transformers, jetez un œil à nos guides conceptuels ! diff --git a/docs/source/hi/_toctree.yml b/docs/source/hi/_toctree.yml new file mode 100644 index 00000000000000..546a8663cc4d88 --- /dev/null +++ b/docs/source/hi/_toctree.yml @@ -0,0 +1,3 @@ +- sections: + - local: pipeline_tutorial + title: पाइपलाइनों के साथ अनुमान चलाएँ \ No newline at end of file diff --git a/docs/source/hi/pipeline_tutorial.md b/docs/source/hi/pipeline_tutorial.md new file mode 100644 index 00000000000000..5f3cd680480d63 --- /dev/null +++ b/docs/source/hi/pipeline_tutorial.md @@ -0,0 +1,317 @@ + + +# अनुमान के लिए पाइपलाइन + +[`pipeline`] किसी भी भाषा, कंप्यूटर दृष्टि, भाषण और मल्टीमॉडल कार्यों पर अनुमान लगाने के लिए [Hub](https://huggingface.co/models) से किसी भी मॉडल का उपयोग करना आसान बनाता है। भले ही आपके पास किसी विशिष्ट तौर-तरीके का अनुभव न हो या आप मॉडलों के पीछे अंतर्निहित कोड से परिचित न हों, फिर भी आप [`pipeline`] के अनुमान के लिए उनका उपयोग कर सकते हैं! यह ट्यूटोरियल आपको ये सिखाएगा: + +* अनुमान के लिए [`pipeline`] का उपयोग करें। +* एक विशिष्ट टोकननाइज़र या मॉडल का उपयोग करें। +* ऑडियो, विज़न और मल्टीमॉडल कार्यों के लिए [`pipeline`] का उपयोग करें। + + + +समर्थित कार्यों और उपलब्ध मापदंडों की पूरी सूची के लिए [`pipeline`] दस्तावेज़ पर एक नज़र डालें। + + + +## पाइपलाइन का उपयोग + +जबकि प्रत्येक कार्य में एक संबद्ध [`pipeline`] होता है, सामान्य [`pipeline`] अमूर्त का उपयोग करना आसान होता है जिसमें शामिल होता है +सभी कार्य-विशिष्ट पाइपलाइनें। [`pipeline`] स्वचालित रूप से एक डिफ़ॉल्ट मॉडल और सक्षम प्रीप्रोसेसिंग क्लास लोड करता है +आपके कार्य के लिए अनुमान का. आइए स्वचालित वाक् पहचान (एएसआर) के लिए [`pipeline`] का उपयोग करने का उदाहरण लें, या +वाक्-से-पाठ. + + +1. एक [`pipeline`] बनाकर प्रारंभ करें और अनुमान कार्य निर्दिष्ट करें: + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition") +``` + +2. अपना इनपुट [`pipeline`] पर भेजें। वाक् पहचान के मामले में, यह एक ऑडियो इनपुट फ़ाइल है: + +```py +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} +``` + +क्या वह परिणाम नहीं जो आपके मन में था? कुछ [सबसे अधिक डाउनलोड किए गए स्वचालित वाक् पहचान मॉडल](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending) देखें +यह देखने के लिए हब पर जाएं कि क्या आपको बेहतर ट्रांस्क्रिप्शन मिल सकता है। + +आइए OpenAI से [व्हिस्पर लार्ज-v2](https://huggingface.co/openai/whisper-large) मॉडल आज़माएं। व्हिस्पर जारी किया गया +Wav2Vec2 की तुलना में 2 साल बाद, और लगभग 10 गुना अधिक डेटा पर प्रशिक्षित किया गया था। इस प्रकार, यह अधिकांश डाउनस्ट्रीम पर Wav2Vec2 को मात देता है +बेंचमार्क. इसमें विराम चिह्न और आवरण की भविष्यवाणी करने का अतिरिक्त लाभ भी है, जिनमें से कोई भी संभव नहीं है +Wav2Vec2. + +आइए इसे यहां आज़माकर देखें कि यह कैसा प्रदर्शन करता है: + +```py +>>> transcriber = pipeline(model="openai/whisper-large-v2") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +अब यह परिणाम अधिक सटीक दिखता है! Wav2Vec2 बनाम व्हिस्पर पर गहन तुलना के लिए, [ऑडियो ट्रांसफॉर्मर्स कोर्स](https://huggingface.co/learn/audio-course/chapter5/asr_models) देखें। +हम वास्तव में आपको विभिन्न भाषाओं में मॉडल, आपके क्षेत्र में विशेषीकृत मॉडल और बहुत कुछ के लिए हब की जांच करने के लिए प्रोत्साहित करते हैं। +आप हब पर सीधे अपने ब्राउज़र से मॉडल परिणामों की जांच और तुलना कर सकते हैं कि यह फिट बैठता है या नहीं +अन्य मामलों की तुलना में कोने के मामलों को बेहतर ढंग से संभालता है। +और यदि आपको अपने उपयोग के मामले के लिए कोई मॉडल नहीं मिलता है, तो आप हमेशा अपना खुद का [प्रशिक्षण](training) शुरू कर सकते हैं! + +यदि आपके पास कई इनपुट हैं, तो आप अपने इनपुट को एक सूची के रूप में पास कर सकते हैं: + +```py +transcriber( + [ + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", + ] +) +``` + +पाइपलाइनें प्रयोग के लिए बहुत अच्छी हैं क्योंकि एक मॉडल से दूसरे मॉडल पर स्विच करना मामूली काम है; हालाँकि, प्रयोग की तुलना में बड़े कार्यभार के लिए उन्हें अनुकूलित करने के कुछ तरीके हैं। संपूर्ण डेटासेट पर पुनरावृत्ति करने या वेबसर्वर में पाइपलाइनों का उपयोग करने के बारे में निम्नलिखित मार्गदर्शिकाएँ देखें: +दस्तावेज़ों में से: +* [डेटासेट पर पाइपलाइनों का उपयोग करना](#using-pipelines-on-a-dataset) +* [वेबसर्वर के लिए पाइपलाइनों का उपयोग करना](./pipeline_webserver) + +## प्राचल + +[`pipeline`] कई मापदंडों का समर्थन करता है; कुछ कार्य विशिष्ट हैं, और कुछ सभी पाइपलाइनों के लिए सामान्य हैं। +सामान्य तौर पर, आप अपनी इच्छानुसार कहीं भी पैरामीटर निर्दिष्ट कर सकते हैं: + +```py +transcriber = pipeline(model="openai/whisper-large-v2", my_parameter=1) + +out = transcriber(...) # This will use `my_parameter=1`. +out = transcriber(..., my_parameter=2) # This will override and use `my_parameter=2`. +out = transcriber(...) # This will go back to using `my_parameter=1`. +``` + +आइए 3 महत्वपूर्ण बातों पर गौर करें: + +### उपकरण + +यदि आप `device=0` का उपयोग करते हैं, तो पाइपलाइन स्वचालित रूप से मॉडल को निर्दिष्ट डिवाइस पर डाल देती है। +यह इस पर ध्यान दिए बिना काम करेगा कि आप PyTorch या Tensorflow का उपयोग कर रहे हैं या नहीं। + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device=0) +``` + +यदि मॉडल एकल GPU के लिए बहुत बड़ा है और आप PyTorch का उपयोग कर रहे हैं, तो आप `device_map="auto"` को स्वचालित रूप से सेट कर सकते हैं +निर्धारित करें कि मॉडल वज़न को कैसे लोड और संग्रहीत किया जाए। `device_map` तर्क का उपयोग करने के लिए 🤗 [Accelerate](https://huggingface.co/docs/accelerate) की आवश्यकता होती है +पैकेट: + +```bash +pip install --upgrade accelerate +``` + +निम्नलिखित कोड स्वचालित रूप से सभी डिवाइसों में मॉडल भार को लोड और संग्रहीत करता है: + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device_map="auto") +``` + +ध्यान दें कि यदि `device_map='auto'` पारित हो गया है, तो अपनी `pipeline` को चालू करते समय `device=device` तर्क जोड़ने की कोई आवश्यकता नहीं है क्योंकि आपको कुछ अप्रत्याशित व्यवहार का सामना करना पड़ सकता है! + +### बैच का आकार + +डिफ़ॉल्ट रूप से, पाइपलाइनें [यहां](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching) विस्तार से बताए गए कारणों के लिए बैच अनुमान नहीं लगाएंगी। इसका कारण यह है कि बैचिंग आवश्यक रूप से तेज़ नहीं है, और वास्तव में कुछ मामलों में काफी धीमी हो सकती है। + +लेकिन अगर यह आपके उपयोग के मामले में काम करता है, तो आप इसका उपयोग कर सकते हैं: + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device=0, batch_size=2) +audio_filenames = [f"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/{i}.flac" for i in range(1, 5)] +texts = transcriber(audio_filenames) +``` + +यह प्रदान की गई 4 ऑडियो फाइलों पर पाइपलाइन चलाता है, लेकिन यह उन्हें 2 के बैच में पास करेगा +आपसे किसी और कोड की आवश्यकता के बिना मॉडल (जो एक जीपीयू पर है, जहां बैचिंग से मदद मिलने की अधिक संभावना है) पर जाएं। +आउटपुट हमेशा उसी से मेल खाना चाहिए जो आपको बैचिंग के बिना प्राप्त हुआ होगा। इसका उद्देश्य केवल पाइपलाइन से अधिक गति प्राप्त करने में आपकी सहायता करना है। + +पाइपलाइनें बैचिंग की कुछ जटिलताओं को भी कम कर सकती हैं क्योंकि, कुछ पाइपलाइनों के लिए, एक एकल आइटम (जैसे एक लंबी ऑडियो फ़ाइल) को एक मॉडल द्वारा संसाधित करने के लिए कई भागों में विभाजित करने की आवश्यकता होती है। पाइपलाइन आपके लिए यह [*chunk batching*](./main_classes/pipelines#pipeline-chunk-batching) करती है। + +### कार्य विशिष्ट प्राचल + +सभी कार्य कार्य विशिष्ट प्राचल प्रदान करते हैं जो आपको अपना काम पूरा करने में मदद करने के लिए अतिरिक्त लचीलेपन और विकल्पों की अनुमति देते हैं। +उदाहरण के लिए, [`transformers.AutomaticSpeechRecognitionPipeline.__call__`] विधि में एक `return_timestamps` प्राचल है जो वीडियो उपशीर्षक के लिए आशाजनक लगता है: + + +```py +>>> transcriber = pipeline(model="openai/whisper-large-v2", return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.', 'chunks': [{'timestamp': (0.0, 11.88), 'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its'}, {'timestamp': (11.88, 12.38), 'text': ' creed.'}]} +``` + +जैसा कि आप देख सकते हैं, मॉडल ने पाठ का अनुमान लगाया और **when** विभिन्न वाक्यों का उच्चारण किया गया तो आउटपुट भी दिया। + +प्रत्येक कार्य के लिए कई प्राचल उपलब्ध हैं, इसलिए यह देखने के लिए कि आप किसके साथ छेड़छाड़ कर सकते हैं, प्रत्येक कार्य का API संदर्भ देखें! +उदाहरण के लिए, [`~transformers.AutomaticSpeechRecognitionPipeline`] में एक `chunk_length_s` प्राचल है जो सहायक है +वास्तव में लंबी ऑडियो फ़ाइलों पर काम करने के लिए (उदाहरण के लिए, संपूर्ण फिल्मों या घंटे-लंबे वीडियो को उपशीर्षक देना) जो आमतौर पर एक मॉडल होता है +अपने आप संभाल नहीं सकता: + +```python +>>> transcriber = pipeline(model="openai/whisper-large-v2", chunk_length_s=30, return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav") +{'text': " Chapter 16. I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I, too, agree to whatever Marguerite wished, Marguerite to be unable to live apart from me. It was the day after the evening... +``` + +यदि आपको कोई ऐसा पैरामीटर नहीं मिल रहा है जो वास्तव में आपकी मदद करेगा, तो बेझिझक [अनुरोध करें](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)! + + +## डेटासेट पर पाइपलाइनों का उपयोग करना + +पाइपलाइन बड़े डेटासेट पर भी अनुमान चला सकती है। ऐसा करने का सबसे आसान तरीका हम एक पुनरावर्तक का उपयोग करने की सलाह देते हैं: + +```py +def data(): + for i in range(1000): + yield f"My example {i}" + + +pipe = pipeline(model="openai-community/gpt2", device=0) +generated_characters = 0 +for out in pipe(data()): + generated_characters += len(out[0]["generated_text"]) +``` + +पुनरावर्तक `data()` प्रत्येक परिणाम और पाइपलाइन स्वचालित रूप से उत्पन्न करता है +पहचानता है कि इनपुट पुनरावर्तनीय है और डेटा प्राप्त करना शुरू कर देगा +यह इसे GPU पर प्रोसेस करना जारी रखता है (यह हुड के तहत [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) का उपयोग करता है)। +यह महत्वपूर्ण है क्योंकि आपको संपूर्ण डेटासेट के लिए मेमोरी आवंटित करने की आवश्यकता नहीं है +और आप जितनी जल्दी हो सके GPU को फीड कर सकते हैं। + +चूंकि बैचिंग से चीज़ें तेज़ हो सकती हैं, इसलिए यहां `batch_size` प्राचल को ट्यून करने का प्रयास करना उपयोगी हो सकता है। + +किसी डेटासेट पर पुनरावृति करने का सबसे सरल तरीका बस एक को 🤗 [Dataset](https://github.com/huggingface/datasets/) से लोड करना है: + +```py +# KeyDataset is a util that will just output the item we're interested in. +from transformers.pipelines.pt_utils import KeyDataset +from datasets import load_dataset + +pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") + +for out in pipe(KeyDataset(dataset, "audio")): + print(out) +``` + + +## वेबसर्वर के लिए पाइपलाइनों का उपयोग करना + + +एक अनुमान इंजन बनाना एक जटिल विषय है जो अपने आप में उपयुक्त है +पृष्ठ। + + +[Link](./pipeline_webserver) + +## विज़न पाइपलाइन + +दृष्टि कार्यों के लिए [`pipeline`] का उपयोग करना व्यावहारिक रूप से समान है। + +अपना कार्य निर्दिष्ट करें और अपनी छवि क्लासिफायरियर को भेजें। छवि एक लिंक, एक स्थानीय पथ या बेस64-एन्कोडेड छवि हो सकती है। उदाहरण के लिए, बिल्ली की कौन सी प्रजाति नीचे दिखाई गई है? + +![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg) + +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(model="google/vit-base-patch16-224") +>>> preds = vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] +``` + +## पाठ पाइपलाइन + +NLP कार्यों के लिए [`pipeline`] का उपयोग करना व्यावहारिक रूप से समान है। + +```py +>>> from transformers import pipeline + +>>> # This model is a `zero-shot-classification` model. +>>> # It will classify text, except you are free to choose any label you might imagine +>>> classifier = pipeline(model="facebook/bart-large-mnli") +>>> classifier( +... "I have a problem with my iphone that needs to be resolved asap!!", +... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], +... ) +{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} +``` + +## बहुविध पाइपलाइन + +[`pipeline`] एक से अधिक तौर-तरीकों का समर्थन करती है। उदाहरण के लिए, एक दृश्य प्रश्न उत्तर (VQA) कार्य पाठ और छवि को जोड़ता है। अपनी पसंद के किसी भी छवि लिंक और छवि के बारे में कोई प्रश्न पूछने के लिए स्वतंत्र महसूस करें। छवि एक URL या छवि का स्थानीय पथ हो सकती है। + +उदाहरण के लिए, यदि आप इस [invoice image](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png) का उपयोग करते हैं: + +```py +>>> from transformers import pipeline + +>>> vqa = pipeline(model="impira/layoutlm-document-qa") +>>> vqa( +... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", +... question="What is the invoice number?", +... ) +[{'score': 0.42515, 'answer': 'us-001', 'start': 16, 'end': 16}] +``` + + + +ऊपर दिए गए उदाहरण को चलाने के लिए आपको 🤗 ट्रांसफॉर्मर के अलावा [`pytesseract`](https://pypi.org/project/pytesseract/) इंस्टॉल करना होगा: + +```bash +sudo apt install -y tesseract-ocr +pip install pytesseract +``` + + + +## 🤗 `त्वरण` के साथ बड़े मॉडलों पर `pipeline` का उपयोग करना: + +आप 🤗 `accelerate` का उपयोग करके बड़े मॉडलों पर आसानी से `pipeline` चला सकते हैं! पहले सुनिश्चित करें कि आपने `accelerate` को `pip install accelerate` के साथ इंस्टॉल किया है। + +सबसे पहले `device_map='auto'` का उपयोग करके अपना मॉडल लोड करें! हम अपने उदाहरण के लिए `facebook/opt-1.3b` का उपयोग करेंगे। + +```py +# pip install accelerate +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", torch_dtype=torch.bfloat16, device_map="auto") +output = pipe("This is a cool example!", do_sample=True, top_p=0.95) +``` + +यदि आप `bitsandbytes` इंस्टॉल करते हैं और `load_in_8bit=True` तर्क जोड़ते हैं तो आप 8-बिट लोडेड मॉडल भी पास कर सकते हैं + +```py +# pip install accelerate bitsandbytes +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", device_map="auto", model_kwargs={"load_in_8bit": True}) +output = pipe("This is a cool example!", do_sample=True, top_p=0.95) +``` + +ध्यान दें कि आप चेकपॉइंट को किसी भी हगिंग फेस मॉडल से बदल सकते हैं जो BLOOM जैसे बड़े मॉडल लोडिंग का समर्थन करता है! diff --git a/docs/source/it/_toctree.yml b/docs/source/it/_toctree.yml index 9c18dcdf9b705a..47d90f9a9a85c9 100644 --- a/docs/source/it/_toctree.yml +++ b/docs/source/it/_toctree.yml @@ -1,47 +1,71 @@ -- sections: - - local: index - title: 🤗 Transformers - - local: quicktour - title: Tour rapido - - local: installation - title: Installazione - title: Iniziare -- sections: - - local: pipeline_tutorial - title: Pipeline per l'inferenza - - local: autoclass_tutorial - title: Carica istanze pre-allenate con AutoClass - - local: preprocessing - title: Preprocess - - local: training - title: Fine-tuning di un modello pre-addestrato - - local: accelerate - title: Allenamento distribuito con 🤗 Accelerate - - local: model_sharing - title: Condividere un modello - title: Esercitazione -- sections: - - local: create_a_model - title: Crea un'architettura personalizzata - - local: custom_models - title: Condividere modelli personalizzati - - local: run_scripts - title: Addestramento con script - - local: multilingual - title: Modelli multilingua per l'inferenza - - local: converting_tensorflow_models - title: Convertire modelli tensorflow - - local: serialization - title: Esporta modelli Transformers - - local: debugging - title: Debugging - title: Guide pratiche -- sections: - - local: add_new_pipeline - title: Come aggiungere una pipeline a 🤗 Transformers? - - local: add_new_model - title: Come aggiungere un modello a 🤗 Transformers? - - local: perf_hardware - title: Hardware ottimizzato per l'addestramento - title: Guide How-to - +- sections: + - local: index + title: 🤗 Transformers + - local: quicktour + title: Tour rapido + - local: installation + title: Installazione + title: Iniziare +- sections: + - local: pipeline_tutorial + title: Pipeline per l'inferenza + - local: autoclass_tutorial + title: Carica istanze pre-allenate con AutoClass + - local: preprocessing + title: Preprocess + - local: training + title: Fine-tuning di un modello pre-addestrato + - local: accelerate + title: Allenamento distribuito con 🤗 Accelerate + - local: model_sharing + title: Condividere un modello + title: Esercitazione +- sections: + - local: create_a_model + title: Crea un'architettura personalizzata + - local: custom_models + title: Condividere modelli personalizzati + - local: run_scripts + title: Addestramento con script + - local: multilingual + title: Modelli multilingua per l'inferenza + - local: converting_tensorflow_models + title: Convertire modelli tensorflow + - local: serialization + title: Esporta modelli Transformers + - local: perf_train_cpu + title: Addestramento efficiente su CPU + - local: perf_train_cpu_many + title: Addestramento efficiente su multiple CPU + - local: perf_train_tpu + title: Addestramento su TPU + - local: perf_train_special + title: Addestramento su Hardware Specializzato + - local: perf_infer_cpu + title: Inferenza Efficiente su CPU + - local: perf_infer_gpu_one + title: Inferenza su una GPU + - local: perf_infer_gpu_many + title: Inferenza Efficiente su GPU Multiple + - local: perf_infer_special + title: Inferenza su Hardware Specializzato + - local: big_models + title: Istanziare un big model + - local: migration + title: Passaggio da pacchetti precedenti + - local: debugging + title: Debugging + title: Guide pratiche +- sections: + - local: add_new_pipeline + title: Come aggiungere una pipeline a 🤗 Transformers? + - local: add_new_model + title: Come aggiungere un modello a 🤗 Transformers? + - local: perf_hardware + title: Hardware ottimizzato per l'addestramento + - local: community + title: Risorse della comunità + - local: pr_checks + title: Controlli su una Pull Request + title: Guide How-to + diff --git a/docs/source/it/accelerate.mdx b/docs/source/it/accelerate.md similarity index 96% rename from docs/source/it/accelerate.mdx rename to docs/source/it/accelerate.md index 20dc1a7ff90b53..3114613a9a7994 100644 --- a/docs/source/it/accelerate.mdx +++ b/docs/source/it/accelerate.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Allenamento distribuito con 🤗 Accelerate diff --git a/docs/source/it/add_new_model.mdx b/docs/source/it/add_new_model.md similarity index 99% rename from docs/source/it/add_new_model.mdx rename to docs/source/it/add_new_model.md index 8dce90a816b895..f6daeeaf85d350 100644 --- a/docs/source/it/add_new_model.mdx +++ b/docs/source/it/add_new_model.md @@ -7,6 +7,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Come aggiungere un modello a 🤗 Transformers? @@ -579,7 +583,7 @@ model.save_pretrained("/path/to/converted/checkpoint/folder") **7. Implementare il forward pass** Una volta che i weights pretrained sono stati correttamente caricati in 🤗 Transformers, dovrete assicurarvi che il forward pass -sia correttamente implementato. [Qui](#provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato +sia correttamente implementato. [Qui](#3-4-provare-un-pretrained-checkpoint-usando-la-repo-originale), avete give creato e provato uno script che testi il forward pass del modello usando la repo originaria. Ora dovrete fare lo stesso con uno script analogo usando l'implementazione in 🤗 Transformers anziché l'originale. Piu o meno lo script dovrebbe essere: diff --git a/docs/source/it/add_new_pipeline.mdx b/docs/source/it/add_new_pipeline.md similarity index 97% rename from docs/source/it/add_new_pipeline.mdx rename to docs/source/it/add_new_pipeline.md index cf9acd2902fcfa..cd42e5cc2cd3d9 100644 --- a/docs/source/it/add_new_pipeline.mdx +++ b/docs/source/it/add_new_pipeline.md @@ -7,11 +7,15 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Come creare una pipeline personalizzata? -In questa guida, scopriremo come creare una pipeline personalizzata e condividerla sull' [Hub](hf.co/models) o aggiungerla nella libreria +In questa guida, scopriremo come creare una pipeline personalizzata e condividerla sull' [Hub](https://hf.co/models) o aggiungerla nella libreria Transformers. Innanzitutto, è necessario decidere gli input grezzi che la pipeline sarà in grado di accettare. Possono essere strings, raw bytes, diff --git a/docs/source/it/autoclass_tutorial.mdx b/docs/source/it/autoclass_tutorial.md similarity index 89% rename from docs/source/it/autoclass_tutorial.mdx rename to docs/source/it/autoclass_tutorial.md index 88dd6cad6c4212..edb96528e705ea 100644 --- a/docs/source/it/autoclass_tutorial.mdx +++ b/docs/source/it/autoclass_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Carica istanze pre-allenate con AutoClass @@ -16,7 +20,7 @@ Con così tante architetture Transformer differenti, può essere sfidante crearn -Ricorda, con architettura ci si riferisce allo scheletro del modello e con checkpoint ai pesi di una determinata architettura. Per esempio, [BERT](https://huggingface.co/bert-base-uncased) è un'architettura, mentre `bert-base-uncased` è un checkpoint. Modello è un termine generale che può significare sia architettura che checkpoint. +Ricorda, con architettura ci si riferisce allo scheletro del modello e con checkpoint ai pesi di una determinata architettura. Per esempio, [BERT](https://huggingface.co/google-bert/bert-base-uncased) è un'architettura, mentre `google-bert/bert-base-uncased` è un checkpoint. Modello è un termine generale che può significare sia architettura che checkpoint. @@ -36,7 +40,7 @@ Carica un tokenizer con [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") +>>> tokenizer = AutoTokenizer.from_pretrained("FacebookAI/xlm-roberta-base") ``` Poi tokenizza il tuo input come mostrato in seguito: @@ -83,7 +87,7 @@ Infine, le classi `AutoModelFor` ti permettono di caricare un modello pre-allena ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Semplicemente utilizza lo stesso checkpoint per caricare un'architettura per un task differente: @@ -91,7 +95,7 @@ Semplicemente utilizza lo stesso checkpoint per caricare un'architettura per un ```py >>> from transformers import AutoModelForTokenClassification ->>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Generalmente, raccomandiamo di utilizzare la classe `AutoTokenizer` e la classe `AutoModelFor` per caricare istanze pre-allenate dei modelli. Questo ti assicurerà di aver caricato la corretta architettura ogni volta. Nel prossimo [tutorial](preprocessing), imparerai come utilizzare il tokenizer, il feature extractor e il processore per elaborare un dataset per il fine-tuning. @@ -103,7 +107,7 @@ Infine, le classi `TFAutoModelFor` ti permettono di caricare un modello pre-alle ```py >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Semplicemente utilizza lo stesso checkpoint per caricare un'architettura per un task differente: @@ -111,7 +115,7 @@ Semplicemente utilizza lo stesso checkpoint per caricare un'architettura per un ```py >>> from transformers import TFAutoModelForTokenClassification ->>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased") +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Generalmente, raccomandiamo di utilizzare la classe `AutoTokenizer` e la classe `TFAutoModelFor` per caricare istanze pre-allenate dei modelli. Questo ti assicurerà di aver caricato la corretta architettura ogni volta. Nel prossimo [tutorial](preprocessing), imparerai come utilizzare il tokenizer, il feature extractor e il processore per elaborare un dataset per il fine-tuning. diff --git a/docs/source/it/big_models.md b/docs/source/it/big_models.md new file mode 100644 index 00000000000000..6a5c346dec890f --- /dev/null +++ b/docs/source/it/big_models.md @@ -0,0 +1,123 @@ + + +# Istanziare un big model + +Quando vuoi utilizzare un modello preaddestrato (pretrained) molto grande, una sfida è minimizzare l'uso della RAM. Il workflow classico +in PyTorch è: + +1. Crea il tuo modello con pesi casuali (random weights). +2. Carica i tuoi pesi preaddestrati. +3. Inserisci i pesi preaddestrati nel tuo modello casuale. + +I passi 1 e 2 una versione completa del modello in memoria, in molti casi non è un problema, ma se il modello inizia a pesare diversi GigaBytes, queste due copie possono sturare la nostra RAM. Ancora peggio, se stai usando `torch.distributed` per seguire l'addestramento (training) in distribuito, ogni processo caricherà il modello preaddestrato e memorizzerà queste due copie nella RAM. + + + +Nota che il modello creato casualmente è inizializzato con tensori "vuoti", che occupano spazio in memoria ma senza riempirlo (quindi i valori casuali sono quelli che si trovavano in questa porzione di memoria in un determinato momento). L'inizializzazione casuale che segue la distribuzione appropriata per il tipo di modello/parametri istanziato (come la distribuzione normale per le istanze) è eseguito solo dopo il passaggio 3 sui pesi non inizializzati, per essere più rapido possibile! + + + +In questa guida, esploreremo le soluzioni che Transformers offre per affrontare questo problema. C'è da tenere in conto che questa è un'area in cui si sta attualmente sviluppando, quindi le API spiegate qui possono variare velocemente in futuro. + +## Checkpoints condivisi + +Dalla versione 4.18.0, i checkpoints dei modelli che occupano più di 10GB di spazio vengono automaticamente frammentati in più parti. Per quanto riguarda la possibilità di avere un unico checkpoint quando si utilizza `model.save_pretrained(save_dir)`, si hanno diversi checkpoint parziali (ognuno con dimensione < 10GB) e un indice che mappa i nomi dei parametri ai file in cui sono memorizzati. + +Puoi controllare la dimensione massima dopo la frammentazione con il parametro `max_shard_size`, nel prossimo esempio, useremo modelli di dimensioni normali con frammenti di piccoli dimensioni: prendiamo un modello BERT classico. + +```py +from transformers import AutoModel + +model = AutoModel.from_pretrained("google-bert/bert-base-cased") +``` + +Se tu salvi usando [`~PreTrainedModel.save_pretrained`], avrai una nuova cartella con due file: il config del modello e i suoi pesi: + +```py +>>> import os +>>> import tempfile + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir) +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model.bin'] +``` + +Adesso usiamo una dimensione massima di frammentazione di 200MB: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json'] +``` + +In aggiunta alla configurazione del modello, vediamo tre differenti file dei pesi, e un file `index.json` che è il nostro indice. Un checkpoint può essere ricaricato totalmente usando il metodo [`~PreTrainedModel.from_pretrained`]: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... new_model = AutoModel.from_pretrained(tmp_dir) +``` + +Il vantaggio principale di applicare questo metodo per modelli grandi è che durante il passo 2 del workflow illustrato in precedenza, ogni frammento del checkpoint viene caricato dopo il precedente, limitando l'utilizzo della RAM alla dimensione del modello più la dimensione del frammento più grande. + +Dietro le quinte, il file indice è utilizzato per determinare quali chiavi sono nel checkpoint, e dove i corrispondenti pesi sono memorizzati. Possiamo caricare l'indice come un qualsiasi json e ottenere un dizionario: + +```py +>>> import json + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f: +... index = json.load(f) + +>>> print(index.keys()) +dict_keys(['metadata', 'weight_map']) +``` + +I metadati consistono solo nella dimensione totale del modello per ora. Abbiamo in programma di aggiungere altre informazioni in futuro: + +```py +>>> index["metadata"] +{'total_size': 433245184} +``` + +La mappa dei pesi è la parte principale di questo indice, che mappa ogni nome dei parametri (si trova solitamente nei modelli PyTorch come `state_dict`) al file in cui è memorizzato: + +```py +>>> index["weight_map"] +{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin', + 'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin', + ... +``` + +Se vuoi caricare direttamente un checkpoint frammentato in un modello senza usare [`~PreTrainedModel.from_pretrained`] (come si farebbe con `model.load_state_dict()` per un checkpoint completo) devi usare [`~modeling_utils.load_sharded_checkpoint`]: + +```py +>>> from transformers.modeling_utils import load_sharded_checkpoint + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... load_sharded_checkpoint(model, tmp_dir) +``` + +## Caricamento low memory + +Frammentare i checkpoint l'utilizzo di memoria al passo 2 del workflow citato in precedenza, ma per utilizzare questo modello in un ambiente con poca memoria, consigliamo di utilizzare i nostri strumenti basati sulla libreria Accelerate. + +Per ulteriori informazioni, leggere la seguente guida: [Large model loading using Accelerate](./main_classes/model#large-model-loading) \ No newline at end of file diff --git a/docs/source/it/community.md b/docs/source/it/community.md new file mode 100644 index 00000000000000..92f6698a9a89bb --- /dev/null +++ b/docs/source/it/community.md @@ -0,0 +1,68 @@ + + +# Comunità + +Questa pagina raggruppa le risorse sviluppate dalla comunità riguardo 🤗 Transformers. + +## Risorse della comunità: + +| Risorsa | Descrizione | Autore | +|:----------|:-------------|------:| +| [Glossario delle Flashcards di Transformers](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | Un insieme di flashcards basate sul [glossario della documentazione di Transformers](glossary), creato in un formato tale da permettere un facile apprendimento e revisione usando [Anki](https://apps.ankiweb.net/), un'applicazione open-source e multi-piattaforma, specificatamente progettata per ricordare informazioni nel lungo termine. Guarda questo [video introduttivo su come usare le flashcards](https://www.youtube.com/watch?v=Dji_h7PILrw). | [Darigov Research](https://www.darigovresearch.com/) | + +## Notebook della comunità: + +| Notebook | Descrizione | Autore | | +|:----------|:-------------|:-------------|------:| +| [Fine-tuning di un Transformer pre-addestrato, al fine di generare testi di canzoni](https://github.com/AlekseyKorshuk/huggingartists) | Come generare testi di canzoni nello stile del vostro artista preferito attraverso il fine-tuning di un modello GPT-2. | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) | +| [Addestramento di T5 in Tensorflow 2](https://github.com/snapthat/TF-T5-text-to-text) | Come addestrare T5 per qualsiasi attività usando Tensorflow 2. Questo notebook mostra come risolvere l'attività di "Question Answering" usando Tensorflow 2 e SQUAD. | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | +| [Addestramento di T5 con TPU](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | Come addestrare T5 su SQUAD con Transformers e NLP. | [Suraj Patil](https://github.com/patil-suraj) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) | +| [Fine-tuning di T5 per la classificazione e scelta multipla](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | Come effettuare il fine-tuning di T5 per le attività di classificazione a scelta multipla - usando un formato testo-a-testo - con PyTorch Lightning. | [Suraj Patil](https://github.com/patil-suraj) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | +| [Fine-tuning di DialoGPT su nuovi dataset e lingue](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | Come effettuare il fine-tuning di un modello DialoGPT su un nuovo dataset per chatbots conversazionali open-dialog. | [Nathan Cooper](https://github.com/ncoop57) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | +| [Modellamento di una lunga sequenza con Reformer](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | Come addestrare su sequenze di lunghezza fino a 500 mila token con Reformer. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | +| [Fine-tuning di BART per riassumere testi](https://github.com/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | Come effettuare il fine-tuning di BART per riassumere testi con fastai usando blurr. | [Wayde Gilliam](https://ohmeow.com/) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/_notebooks/2020-05-23-text-generation-with-blurr.ipynb) | +| [Fine-tuning di un Transformer pre-addestrato su tweet](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | Come generare tweet nello stile del tuo account Twitter preferito attraverso il fine-tuning di un modello GPT-2. | [Boris Dayma](https://github.com/borisdayma) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | +| [Ottimizzazione di modelli 🤗 Hugging Face con Weights & Biases](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | Un tutorial completo che mostra l'integrazione di W&B con Hugging Face. | [Boris Dayma](https://github.com/borisdayma) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | +| [Longformer pre-addestrato](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | Come costruire una versione "long" degli esistenti modelli pre-addestrati. | [Iz Beltagy](https://beltagy.net) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | +| [Fine-tuning di Longformer per QA](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | Come effettuare il fine-tuning di un modello longformer per un task di QA.| [Suraj Patil](https://github.com/patil-suraj) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | +| [Valutazione di modelli con 🤗NLP](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | Come valutare longformer su TriviaQA con `NLP`. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) | +| [Fine-tuning di T5 per Sentiment Span Extraction](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | Come effettuare il fine-tuning di T5 per la sentiment span extraction - usando un formato testo-a-testo - con PyTorch Lightning. | [Lorenzo Ampil](https://github.com/enzoampil) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | +| [Fine-tuning di DistilBert per la classificazione multi-classe](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | Come effettuare il fine-tuning di DistilBert per la classificazione multi-classe con PyTorch. | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)| +|[Fine-tuning di BERT per la classificazione multi-etichetta](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|Come effettuare il fine-tuning di BERT per la classificazione multi-etichetta con PyTorch. |[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)| +|[Accelerazione del fine-tuning con il Dynamic Padding / Bucketing](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)| Come velocizzare il fine-tuning di un fattore 2X usando il dynamic padding / bucketing. |[Michael Benesty](https://github.com/pommedeterresautee) |[![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)| +|[Pre-addestramento di Reformer per Masked Language Modeling](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| Come addestrare un modello Reformer usando livelli di self-attention bi-direzionali.| [Patrick von Platen](https://github.com/patrickvonplaten) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)| +|[Espansione e fine-tuning di Sci-BERT](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| Come incrementare il vocabolario di un modello SciBERT - pre-addestrato da AllenAI sul dataset CORD - e crearne una pipeline. | [Tanmay Thakur](https://github.com/lordtt13) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)| +|[Fine-tuning di BlenderBotSmall per riassumere testi usando Trainer API](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-blenderbot_small-for-summarization.ipynb)| Come effettuare il fine-tuning di BlenderBotSmall per riassumere testi su un dataset personalizzato, usando Trainer API. | [Tanmay Thakur](https://github.com/lordtt13) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Wmupuls7mykSGyRN_Qo6lPQhgp56ymq?usp=sharing)| +|[Fine-tuning di Electra e interpretazione con Integrated Gradients](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | Come effettuare il fine-tuning di Electra per l'analisi dei sentimenti e intepretare le predizioni con Captum Integrated Gradients. | [Eliza Szczechla](https://elsanns.github.io) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)| +|[Fine-tuning di un modello GPT-2 non inglese con la classe Trainer](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | Come effettuare il fine-tuning di un modello GPT-2 non inglese con la classe Trainer. | [Philipp Schmid](https://www.philschmid.de) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)| +|[Fine-tuning di un modello DistilBERT per la classficazione multi-etichetta](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | Come effettuare il fine-tuning di un modello DistilBERT per l'attività di classificazione multi-etichetta. | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)| +|[Fine-tuning di ALBERT per la classifcazione di coppie di frasi](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | Come effettuare il fine-tuning di un modello ALBERT - o un altro modello BERT-based - per l'attività di classificazione di coppie di frasi. | [Nadir El Manouzi](https://github.com/NadirEM) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)| +|[Fine-tuning di Roberta per l'analisi di sentimenti](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | Come effettuare il fine-tuning di un modello Roberta per l'analisi di sentimenti. | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)| +|[Valutazione di modelli che generano domande](https://github.com/flexudy-pipe/qugeev) | Quanto sono accurante le risposte alle domande generate dal tuo modello transformer seq2seq? | [Pascal Zoleko](https://github.com/zolekode) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)| +|[Classificazione di testo con DistilBERT e Tensorflow](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | Come effettuare il fine-tuning di DistilBERT per la classificazione di testo in TensorFlow. | [Peter Bayerle](https://github.com/peterbayerle) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)| +|[Utilizzo di BERT per riassumere testi con un modello Encoder-Decoder su CNN/Dailymail](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | Come avviare "a caldo" un *EncoderDecoderModel* attraverso l'utilizzo di un checkpoint *google-bert/bert-base-uncased* per riassumere testi su CNN/Dailymail. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Aprilo in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| +|[Utilizzo di RoBERTa per riassumere testi con un modello Encoder-Decoder su BBC XSum](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | Come avviare "a caldo" un *EncoderDecoderModel* (condiviso) attraverso l'utilizzo di un checkpoint *FacebookAI/roberta-base* per riassumere testi su BBC/XSum. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| +|[Fine-tuning di TAPAS su Sequential Question Answering (SQA)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | Come effettuare il fine-tuning di un modello *TapasForQuestionAnswering* attraverso l'utilizzo di un checkpoint *tapas-base* sul dataset Sequential Question Answering (SQA). | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)| +|[Valutazione di TAPAS su Table Fact Checking (TabFact)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | Come valutare un modello *TapasForSequenceClassification* - fine-tuned con un checkpoint *tapas-base-finetuned-tabfact* - usando una combinazione delle librerie 🤗 datasets e 🤗 transformers. | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)| +|[Fine-tuning di mBART per la traduzione](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | Come effettuare il fine-tuning di mBART usando Seq2SeqTrainer per la traduzione da hindi a inglese.| [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)| +|[Fine-tuning di LayoutLM su FUNSD (un dataset per la comprensione della forma)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | Come effettuare il fine-tuning di un modello *LayoutLMForTokenClassification* sul dataset FUNSD per l'estrazione di informazioni da documenti scannerizzati.| [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)| +|[Fine-tuning di DistilGPT2 e generazione di testo](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | Come effettuare il fine-tuning di DistilGPT2 e generare testo. | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)| +|[Fine-tuning di LED fino a 8 mila token](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) | Come effettuare il fine-tuning di LED su PubMed per riassumere "lunghi" testi. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb)| +|[Valutazione di LED su Arxiv](https://github.com/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb) | Come valutare efficacemente LED sull'attività di riassumere "lunghi" testi. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb)| +|[Fine-tuning di LayoutLM su RVL-CDIP, un dataset per la classificazione di documenti (immagini)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) | Come effettuare il fine-tuning di un modello *LayoutLMForSequenceClassification* sul dataset RVL-CDIP per la classificazione di documenti scannerizzati. | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)| +|[Decodifica Wav2Vec2 CTC con variazioni di GPT2](https://github.com/voidful/huggingface_notebook/blob/main/xlsr_gpt.ipynb) | Come decodificare sequenze CTC, variate da modelli di linguaggio. | [Eric Lam](https://github.com/voidful) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e_z5jQHYbO2YKEaUgzb1ww1WwiAyydAj?usp=sharing) +|[Fine-tuning di BART per riassumere testi in due lingue con la classe Trainer](https://github.com/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb) | Come effettuare il fine-tuning di BART per riassumere testi in due lingue usando la classe Trainer. | [Eliza Szczechla](https://github.com/elsanns) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb)| +|[Valutazione di Big Bird su Trivia QA](https://github.com/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb) | Come valutare BigBird su question answering di "lunghi" documenti attraverso Trivia QA. | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb)| +| [Creazione di sottotitoli per video usando Wav2Vec2](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | Come creare sottotitoli per qualsiasi video di YouTube trascrivendo l'audio con Wav2Vec. | [Niklas Muennighoff](https://github.com/Muennighoff) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | +| [Fine-tuning di Vision Transformer su CIFAR-10 usando PyTorch Lightning](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | Come effettuare il fine-tuning di Vision Transformer (ViT) su CIFAR-10 usando HuggingFace Transformers, Datasets e PyTorch Lightning.| [Niels Rogge](https://github.com/nielsrogge) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | +| [Fine-tuning di Vision Transformer su CIFAR-10 usando 🤗 Trainer](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | Come effettuare il fine-tuning di Vision Transformer (ViT) su CIFAR-10 usando HuggingFace Transformers, Datasets e 🤗 Trainer. | [Niels Rogge](https://github.com/nielsrogge) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | +| [Valutazione di LUKE su Open Entity, un dataset di entity typing](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | Come valutare un modello *LukeForEntityClassification* sul dataset Open Entity. | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | +| [Valutazione di LUKE su TACRED, un dataset per l'estrazione di relazioni](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | Come valutare un modello *LukeForEntityPairClassification* sul dataset TACRED. | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | +| [Valutazione di LUKE su CoNLL-2003, un importante benchmark NER](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | Come valutare un modello *LukeForEntitySpanClassification* sul dataset CoNLL-2003. | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | +| [Valutazione di BigBird-Pegasus su dataset PubMed](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | Come valutare un modello *BigBirdPegasusForConditionalGeneration* su dataset PubMed. | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | +| [Classificazione di emozioni dal discorso con Wav2Vec2](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | Come utilizzare un modello pre-addestrato Wav2Vec2 per la classificazione di emozioni sul dataset MEGA. | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | +| [Rilevamento oggetti in un'immagine con DETR](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | Come usare un modello addestrato *DetrForObjectDetection* per rilevare oggetti in un'immagine e visualizzare l'attention. | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | +| [Fine-tuning di DETR su un dataset personalizzato per rilevare oggetti](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | Come effettuare fine-tuning di un modello *DetrForObjectDetection* su un dataset personalizzato per rilevare oggetti. | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | +| [Fine-tuning di T5 per Named Entity Recognition](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | Come effettuare fine-tunining di *T5* per un'attività di Named Entity Recognition. | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) | diff --git a/docs/source/it/converting_tensorflow_models.mdx b/docs/source/it/converting_tensorflow_models.md similarity index 90% rename from docs/source/it/converting_tensorflow_models.mdx rename to docs/source/it/converting_tensorflow_models.md index b9b30a315c6a19..b1de0113388254 100644 --- a/docs/source/it/converting_tensorflow_models.mdx +++ b/docs/source/it/converting_tensorflow_models.md @@ -5,6 +5,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Convertire checkpoint di Tensorflow @@ -92,29 +96,14 @@ transformers-cli convert --model_type gpt \ Ecco un esempio del processo di conversione di un modello OpenAI GPT-2 pre-allenato (vedi [qui](https://github.com/openai/gpt-2)): ```bash -export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights -transformers-cli convert --model_type gpt2 \ +export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/openai-community/gpt2/pretrained/weights +transformers-cli convert --model_type openai-community/gpt2 \ --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \ --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ [--config OPENAI_GPT2_CONFIG] \ [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK] ``` -## Transformer-XL - - -Ecco un esempio del processo di conversione di un modello Transformer-XL pre-allenato -(vedi [qui](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models)): - -```bash -export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint -transformers-cli convert --model_type transfo_xl \ - --tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config TRANSFO_XL_CONFIG] \ - [--finetuning_task_name TRANSFO_XL_FINETUNED_TASK] -``` - ## XLNet Ecco un esempio del processo di conversione di un modello XLNet pre-allenato: diff --git a/docs/source/it/create_a_model.mdx b/docs/source/it/create_a_model.md similarity index 93% rename from docs/source/it/create_a_model.mdx rename to docs/source/it/create_a_model.md index 6e11f3f1d0292c..caacf4fadc5db6 100644 --- a/docs/source/it/create_a_model.mdx +++ b/docs/source/it/create_a_model.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Crea un'architettura personalizzata @@ -82,7 +86,7 @@ DistilBertConfig { Nella funzione [`~PretrainedConfig.from_pretrained`] possono essere modificati gli attributi del modello pre-allenato: ```py ->>> my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4) +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) ``` Quando la configurazione del modello ti soddisfa, la puoi salvare con [`~PretrainedConfig.save_pretrained`]. Il file della tua configurazione è memorizzato come file JSON nella save directory specificata: @@ -105,7 +109,7 @@ Puoi anche salvare il file di configurazione come dizionario oppure come la diff ## Modello -Il prossimo passo e di creare [modello](main_classes/models). Il modello - vagamente riferito anche come architettura - definisce cosa ogni strato deve fare e quali operazioni stanno succedendo. Attributi come `num_hidden_layers` provenienti dalla configurazione sono usati per definire l'architettura. Ogni modello condivide la classe base [`PreTrainedModel`] e alcuni metodi comuni come il ridimensionamento degli input embeddings e la soppressione delle self-attention heads . Inoltre, tutti i modelli sono la sottoclasse di [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) o [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/flax.linen.html#module). Cio significa che i modelli sono compatibili con l'uso di ciascun di framework. +Il prossimo passo e di creare [modello](main_classes/models). Il modello - vagamente riferito anche come architettura - definisce cosa ogni strato deve fare e quali operazioni stanno succedendo. Attributi come `num_hidden_layers` provenienti dalla configurazione sono usati per definire l'architettura. Ogni modello condivide la classe base [`PreTrainedModel`] e alcuni metodi comuni come il ridimensionamento degli input embeddings e la soppressione delle self-attention heads . Inoltre, tutti i modelli sono la sottoclasse di [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) o [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html). Cio significa che i modelli sono compatibili con l'uso di ciascun di framework. @@ -123,13 +127,13 @@ Questo crea modelli con valori casuali invece di pesi pre-allenati. Non sarai in Crea un modello pre-allenato con [`~PreTrainedModel.from_pretrained`]: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Quando carichi pesi pre-allenati, la configurazione del modello predefinito è automaticamente caricata se il modello è fornito da 🤗 Transformers. Tuttavia, puoi ancora sostituire gli attributi - alcuni o tutti - di configurazione del modello predefinito con i tuoi se lo desideri: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -148,13 +152,13 @@ Questo crea modelli con valori casuali invece di pesi pre-allenati. Non sarai in Crea un modello pre-allenoto con [`~TFPreTrainedModel.from_pretrained`]: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Quando carichi pesi pre-allenati, la configurazione del modello predefinito è automaticamente caricato se il modello è fornito da 🤗 Transformers. Tuttavia, puoi ancora sostituire gli attributi - alcuni o tutti - di configurazione del modello predefinito con i tuoi se lo desideri: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -171,7 +175,7 @@ Per esempio, [`DistilBertForSequenceClassification`] è un modello DistilBERT ba ```py >>> from transformers import DistilBertForSequenceClassification ->>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Riutilizza facilmente questo checkpoint per un'altra attività passando ad un model head differente. Per un attività di risposta alle domande, utilizzerai il model head [`DistilBertForQuestionAnswering`]. La head per compiti di question answering è simile alla classificazione di sequenza head tranne per il fatto che è uno strato lineare sopra l'output degli stati nascosti (hidden states in inglese) @@ -179,7 +183,7 @@ Riutilizza facilmente questo checkpoint per un'altra attività passando ad un mo ```py >>> from transformers import DistilBertForQuestionAnswering ->>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -188,7 +192,7 @@ Per esempio, [`TFDistilBertForSequenceClassification`] è un modello DistilBERT ```py >>> from transformers import TFDistilBertForSequenceClassification ->>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Riutilizza facilmente questo checkpoint per un altra attività passando ad un modello head diverso. Per un attività di risposta alle domande, utilizzerai il model head [`TFDistilBertForQuestionAnswering`]. Il head di risposta alle domande è simile alla sequenza di classificazione head tranne per il fatto che è uno strato lineare sopra l'output degli stati nascosti (hidden states in inglese) @@ -196,7 +200,7 @@ Riutilizza facilmente questo checkpoint per un altra attività passando ad un mo ```py >>> from transformers import TFDistilBertForQuestionAnswering ->>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -229,7 +233,7 @@ Se hai addestrato il tuo tokenizer, puoi crearne uno dal tuo file *vocabolario*: ```py >>> from transformers import DistilBertTokenizer ->>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Crea un tokenizer veloce con la classe [`DistilBertTokenizerFast`]: @@ -237,7 +241,7 @@ Crea un tokenizer veloce con la classe [`DistilBertTokenizerFast`]: ```py >>> from transformers import DistilBertTokenizerFast ->>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased") +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") ``` diff --git a/docs/source/it/custom_models.mdx b/docs/source/it/custom_models.md similarity index 98% rename from docs/source/it/custom_models.mdx rename to docs/source/it/custom_models.md index b4b0302e29e3d9..b0cdf4cd7bf030 100644 --- a/docs/source/it/custom_models.mdx +++ b/docs/source/it/custom_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Condividere modelli personalizzati diff --git a/docs/source/it/debugging.mdx b/docs/source/it/debugging.md similarity index 98% rename from docs/source/it/debugging.mdx rename to docs/source/it/debugging.md index 5b392489eab9f3..5c1dab51bd1179 100644 --- a/docs/source/it/debugging.mdx +++ b/docs/source/it/debugging.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Debugging diff --git a/docs/source/it/index.mdx b/docs/source/it/index.md similarity index 97% rename from docs/source/it/index.mdx rename to docs/source/it/index.md index 38ab8f8aa64ce9..76cdc0ad246104 100644 --- a/docs/source/it/index.mdx +++ b/docs/source/it/index.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers @@ -51,6 +55,7 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p 1. **[ALBERT](model_doc/albert)** (da Google Research e l'Istituto Tecnologico di Chicago) rilasciato con il paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), da Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](model_doc/align)** (from Google Research) rilasciato con il paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) da Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[BART](model_doc/bart)** (da Facebook) rilasciato con il paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) da Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov e Luke Zettlemoyer. 1. **[BARThez](model_doc/barthez)** (da politecnico di École) rilasciato con il paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) da Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](model_doc/bartpho)** (da VinAI Research) rilasciato con il paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) da Nguyen Luong Tran, Duong Minh Le e Dat Quoc Nguyen. @@ -67,6 +72,7 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p 1. **[CamemBERT](model_doc/camembert)** (da Inria/Facebook/Sorbonne) rilasciato con il paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) da Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah e Benoît Sagot. 1. **[CANINE](model_doc/canine)** (da Google Research) rilasciato con il paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) da Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[ConvNeXT](model_doc/convnext)** (da Facebook AI) rilasciato con il paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) da Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (da Facebook AI) rilasciato con il paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) da Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CLIP](model_doc/clip)** (da OpenAI) rilasciato con il paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) da Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[ConvBERT](model_doc/convbert)** (da YituTech) rilasciato con il paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) da Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[CPM](model_doc/cpm)** (dalla Università di Tsinghua) rilasciato con il paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) da Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. @@ -83,6 +89,7 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p 1. **[DistilBERT](model_doc/distilbert)** (da HuggingFace), rilasciato assieme al paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) da Victor Sanh, Lysandre Debut e Thomas Wolf. La stessa tecnica è stata applicata per comprimere GPT2 in [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa in [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT in [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[DPR](model_doc/dpr)** (da Facebook) rilasciato con il paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) da Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, e Wen-tau Yih. 1. **[DPT](master/model_doc/dpt)** (da Intel Labs) rilasciato con il paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) da René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan and Quoc V. Le. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (da Google Research) rilasciato con il paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) da Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ELECTRA](model_doc/electra)** (da Google Research/Stanford University) rilasciato con il paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) da Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[FlauBERT](model_doc/flaubert)** (da CNRS) rilasciato con il paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) da Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. @@ -90,8 +97,8 @@ La libreria attualmente contiene implementazioni in JAX, PyTorch e TensorFlow, p 1. **[FNet](model_doc/fnet)** (da Google Research) rilasciato con il paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) da James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. 1. **[Funnel Transformer](model_doc/funnel)** (da CMU/Google Brain) rilasciato con il paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) da Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GLPN](model_doc/glpn)** (da KAIST) rilasciato con il paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) da Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (da OpenAI) rilasciato con il paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) da Alec Radford, Karthik Narasimhan, Tim Salimans e Ilya Sutskever. -1. **[GPT-2](model_doc/gpt2)** (da OpenAI) rilasciato con il paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) da Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** e Ilya Sutskever**. +1. **[GPT](model_doc/openai-gpt)** (da OpenAI) rilasciato con il paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) da Alec Radford, Karthik Narasimhan, Tim Salimans e Ilya Sutskever. +1. **[GPT-2](model_doc/gpt2)** (da OpenAI) rilasciato con il paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) da Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei e Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (da EleutherAI) rilasciato nel repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) da Ben Wang e Aran Komatsuzaki. 1. **[GPT Neo](model_doc/gpt_neo)** (da EleutherAI) rilasciato nel repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) da Sid Black, Stella Biderman, Leo Gao, Phil Wang e Connor Leahy. 1. **[GPT NeoX](model_doc/gpt_neox)** (da EleutherAI) rilasciato con il paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) da Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach @@ -249,9 +256,9 @@ tokenizer (chiamato "slow"). Un tokenizer "fast" supportato dalla libreria 🤗 | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | Realm | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | -| ResNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| ResNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | | RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | | RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ | diff --git a/docs/source/it/installation.mdx b/docs/source/it/installation.md similarity index 95% rename from docs/source/it/installation.mdx rename to docs/source/it/installation.md index 1ff47c110cffad..2f45f4182d24c9 100644 --- a/docs/source/it/installation.mdx +++ b/docs/source/it/installation.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Installazione @@ -126,10 +130,10 @@ Il tuo ambiente Python troverà la versione `main` di 🤗 Transformers alla pro ## Installazione con conda -Installazione dal canale conda `huggingface`: +Installazione dal canale conda `conda-forge`: ```bash -conda install -c huggingface transformers +conda install conda-forge::transformers ``` ## Impostazione della cache @@ -159,14 +163,14 @@ Aggiungi [🤗 Datasets](https://huggingface.co/docs/datasets/) al tuo flusso di Ad esempio, in genere si esegue un programma su una rete normale, protetta da firewall per le istanze esterne, con il seguente comando: ```bash -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Esegui lo stesso programma in un'istanza offline con: ```bash HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Lo script viene ora eseguito senza bloccarsi o attendere il timeout, perché sa di dover cercare solo file locali. @@ -232,4 +236,4 @@ Una volta che il tuo file è scaricato e salvato in cache localmente, specifica Fai riferimento alla sezione [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) per avere maggiori dettagli su come scaricare modelli presenti sull Hub. - \ No newline at end of file +
diff --git a/docs/source/it/migration.md b/docs/source/it/migration.md new file mode 100644 index 00000000000000..9a5f4d005505e8 --- /dev/null +++ b/docs/source/it/migration.md @@ -0,0 +1,313 @@ + + +# Migrazione da pacchetti precedenti + +## Migrazione da transformers `v3.x` a `v4.x` + +Un paio di modifiche sono state introdotte nel passaggio dalla versione 3 alla versione 4. Di seguito è riportato un riepilogo delle +modifiche previste: + +#### 1. AutoTokenizer e pipeline ora utilizzano tokenizer veloci (rust) per impostazione predefinita. + +I tokenizer python e rust hanno all'incirca le stesse API, ma i tokenizer rust hanno un set di funzionalità più completo. + +Ciò introduce due modifiche sostanziali: +- La gestione dei token in overflow tra i tokenizer Python e Rust è diversa. +- I tokenizers di rust non accettano numeri interi nei metodi di codifica. + +##### Come ottenere lo stesso comportamento di v3.x in v4.x + +- Le pipeline ora contengono funzionalità aggiuntive pronte all'uso. Vedi la [pipeline di classificazione dei token con il flag `grouped_entities`](main_classes/pipelines#transformers.TokenClassificationPipeline). +- Gli auto-tokenizer ora restituiscono tokenizer rust. Per ottenere invece i tokenizer python, l'utente deve usare il flag `use_fast` impostandolo `False`: + +Nella versione `v3.x`: +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +``` +per ottenere lo stesso nella versione `v4.x`: +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased", use_fast=False) +``` + +#### 2. SentencePiece è stato rimosso dalle dipendenze richieste + +Il requisito sulla dipendenza SentencePiece è stato rimosso da `setup.py`. È stato fatto per avere un canale su anaconda cloud senza basarsi su `conda-forge`. Ciò significa che i tokenizer che dipendono dalla libreria SentencePiece non saranno disponibili con un'installazione standard di `transformers`. + +Ciò include le versioni **lente** di: +- `XLNetTokenizer` +- `AlbertTokenizer` +- `CamembertTokenizer` +- `MBartTokenizer` +- `PegasusTokenizer` +- `T5Tokenizer` +- `ReformerTokenizer` +- `XLMRobertaTokenizer` + +##### Come ottenere lo stesso comportamento della v3.x nella v4.x + +Per ottenere lo stesso comportamento della versione `v3.x`, devi installare anche `sentencepiece`: + +Nella versione `v3.x`: +```bash +pip install transformers +``` +per ottenere lo stesso nella versione `v4.x`: +```bash +pip install transformers[sentencepiece] +``` +o +```bash +pip install transformers stentencepiece +``` +#### 3. L'architettura delle repo è stato aggiornata in modo che ogni modello abbia la propria cartella + +Con l’aggiunta di nuovi modelli, il numero di file nella cartella `src/transformers` continua a crescere e diventa più difficile navigare e capire. Abbiamo fatto la scelta di inserire ogni modello e i file che lo accompagnano nelle proprie sottocartelle. + +Si tratta di una modifica sostanziale in quanto l'importazione di layer intermedi utilizzando direttamente il modulo di un modello deve essere eseguita tramite un percorso diverso. + +##### Come ottenere lo stesso comportamento della v3.x nella v4.x + +Per ottenere lo stesso comportamento della versione `v3.x`, devi aggiornare il percorso utilizzato per accedere ai layer. + +Nella versione `v3.x`: +```bash +from transformers.modeling_bert import BertLayer +``` +per ottenere lo stesso nella versione `v4.x`: +```bash +from transformers.models.bert.modeling_bert import BertLayer +``` + +#### 4. Impostare l'argomento `return_dict` su `True` per impostazione predefinita + +L'[argomento `return_dict`](main_classes/output) abilita la restituzione di oggetti python dict-like contenenti gli output del modello, invece delle tuple standard. Questo oggetto è self-documented poiché le chiavi possono essere utilizzate per recuperare valori, comportandosi anche come una tupla e gli utenti possono recuperare oggetti per indexing o slicing. + +Questa è una modifica sostanziale poiché la tupla non può essere decompressa: `value0, value1 = outputs` non funzionerà. + +##### Come ottenere lo stesso comportamento della v3.x nella v4.x + +Per ottenere lo stesso comportamento della versione `v3.x`, specifica l'argomento `return_dict` come `False`, sia nella configurazione del modello che nel passaggio successivo. + +Nella versione `v3.x`: +```bash +model = BertModel.from_pretrained("google-bert/bert-base-cased") +outputs = model(**inputs) +``` +per ottenere lo stesso nella versione `v4.x`: +```bash +model = BertModel.from_pretrained("google-bert/bert-base-cased") +outputs = model(**inputs, return_dict=False) +``` +o +```bash +model = BertModel.from_pretrained("google-bert/bert-base-cased", return_dict=False) +outputs = model(**inputs) +``` + +#### 5. Rimozione di alcuni attributi deprecati + +Gli attributi sono stati rimossi se deprecati da almeno un mese. L'elenco completo degli attributi obsoleti è disponibile in [#8604](https://github.com/huggingface/transformers/pull/8604). + +Ecco un elenco di questi attributi/metodi/argomenti e quali dovrebbero essere le loro sostituzioni: + +In diversi modelli, le etichette diventano coerenti con gli altri modelli: +- `masked_lm_labels` diventa `labels` in `AlbertForMaskedLM` e `AlbertForPreTraining`. +- `masked_lm_labels` diventa `labels` in `BertForMaskedLM` e `BertForPreTraining`. +- `masked_lm_labels` diventa `labels` in `DistilBertForMaskedLM`. +- `masked_lm_labels` diventa `labels` in `ElectraForMaskedLM`. +- `masked_lm_labels` diventa `labels` in `LongformerForMaskedLM`. +- `masked_lm_labels` diventa `labels` in `MobileBertForMaskedLM`. +- `masked_lm_labels` diventa `labels` in `RobertaForMaskedLM`. +- `lm_labels` diventa `labels` in `BartForConditionalGeneration`. +- `lm_labels` diventa `labels` in `GPT2DoubleHeadsModel`. +- `lm_labels` diventa `labels` in `OpenAIGPTDoubleHeadsModel`. +- `lm_labels` diventa `labels` in `T5ForConditionalGeneration`. + +In diversi modelli, il meccanismo di memorizzazione nella cache diventa coerente con gli altri: +- `decoder_cached_states` diventa `past_key_values` in tutti i modelli BART-like, FSMT e T5. +- `decoder_past_key_values` diventa `past_key_values` in tutti i modelli BART-like, FSMT e T5. +- `past` diventa `past_key_values` in tutti i modelli CTRL. +- `past` diventa `past_key_values` in tutti i modelli GPT-2. + +Per quanto riguarda le classi tokenizer: +- L'attributo tokenizer `max_len` diventa `model_max_length`. +- L'attributo tokenizer `return_lengths` diventa `return_length`. +- L'argomento di codifica del tokenizer `is_pretokenized` diventa `is_split_into_words`. + +Per quanto riguarda la classe `Trainer`: +- L'argomento `tb_writer` di `Trainer` è stato rimosso in favore della funzione richiamabile `TensorBoardCallback(tb_writer=...)`. +- L'argomento `prediction_loss_only` di `Trainer` è stato rimosso in favore dell'argomento di classe `args.prediction_loss_only`. +- L'attributo `data_collator` di `Trainer` sarà richiamabile. +- Il metodo `_log` di `Trainer` è deprecato a favore di `log`. +- Il metodo `_training_step` di `Trainer` è deprecato a favore di `training_step`. +- Il metodo `_prediction_loop` di `Trainer` è deprecato a favore di `prediction_loop`. +- Il metodo `is_local_master` di `Trainer` è deprecato a favore di `is_local_process_zero`. +- Il metodo `is_world_master` di `Trainer` è deprecato a favore di `is_world_process_zero`. + +Per quanto riguarda la classe `TrainingArguments`: +- L'argomento `evaluate_during_training` di `TrainingArguments` è deprecato a favore di `evaluation_strategy`. + +Per quanto riguarda il modello Transfo-XL: +- L'attributo di configurazione `tie_weight` di Transfo-XL diventa `tie_words_embeddings`. +- Il metodo di modellazione `reset_length` di Transfo-XL diventa `reset_memory_length`. + +Per quanto riguarda le pipeline: +- L'argomento `topk` di `FillMaskPipeline` diventa `top_k`. + + + +## Passaggio da pytorch-transformers a 🤗 Transformers + +Ecco un breve riepilogo di ciò a cui prestare attenzione durante il passaggio da `pytorch-transformers` a 🤗 Transformers. + +### L’ordine posizionale di alcune parole chiave di input dei modelli (`attention_mask`, `token_type_ids`...) è cambiato + +Per usare Torchscript (vedi #1010, #1204 e #1195) l'ordine specifico delle **parole chiave di input** di alcuni modelli (`attention_mask`, `token_type_ids`...) è stato modificato. + +Se inizializzavi i modelli usando parole chiave per gli argomenti, ad esempio `model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)`, questo non dovrebbe causare alcun cambiamento. + +Se inizializzavi i modelli con input posizionali per gli argomenti, ad esempio `model(inputs_ids, attention_mask, token_type_ids)`, potrebbe essere necessario ricontrollare l'ordine esatto degli argomenti di input. + +## Migrazione da pytorch-pretrained-bert + +Ecco un breve riepilogo di ciò a cui prestare attenzione durante la migrazione da `pytorch-pretrained-bert` a 🤗 Transformers + +### I modelli restituiscono sempre `tuple` + +La principale modifica di rilievo durante la migrazione da `pytorch-pretrained-bert` a 🤗 Transformers è che il metodo dei modelli di previsione dà sempre una `tupla` con vari elementi a seconda del modello e dei parametri di configurazione. + +Il contenuto esatto delle tuple per ciascun modello è mostrato in dettaglio nelle docstring dei modelli e nella [documentazione](https://huggingface.co/transformers/). + +In quasi tutti i casi, andrà bene prendendo il primo elemento dell'output come quello che avresti precedentemente utilizzato in `pytorch-pretrained-bert`. + +Ecco un esempio di conversione da `pytorch-pretrained-bert` + a 🤗 Transformers per un modello di classificazione `BertForSequenceClassification`: + +```python +# Carichiamo il nostro modello +model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") + +# Se usavi questa riga in pytorch-pretrained-bert : +loss = model(input_ids, labels=labels) + +# Ora usa questa riga in 🤗 Transformers per estrarre la perdita dalla tupla di output: +outputs = model(input_ids, labels=labels) +loss = outputs[0] + +# In 🤗 Transformers puoi anche avere accesso ai logit: +loss, logits = outputs[:2] + +# Ed anche agli attention weight se configuri il modello per restituirli (e anche altri output, vedi le docstring e la documentazione) +model = BertForSequenceClassification.from_pretrained(" google-bert/bert-base-uncased", output_attentions=True) +outputs = model(input_ids, labels=labels) +loss, logits, attentions = outputs +``` + +### Serializzazione + +Modifica sostanziale nel metodo `from_pretrained()`: + +1. I modelli sono ora impostati in modalità di valutazione in maniera predefinita quando usi il metodo `from_pretrained()`. Per addestrarli non dimenticare di riportarli in modalità di addestramento (`model.train()`) per attivare i moduli di dropout. + +2. Gli argomenti aggiuntivi `*inputs` e `**kwargs` forniti al metodo `from_pretrained()` venivano passati direttamente al metodo `__init__()` della classe sottostante del modello. Ora sono usati per aggiornare prima l'attributo di configurazione del modello, che può non funzionare con le classi del modello derivate costruite basandosi sui precedenti esempi di `BertForSequenceClassification`. Più precisamente, gli argomenti posizionali `*inputs` forniti a `from_pretrained()` vengono inoltrati direttamente al metodo `__init__()` del modello mentre gli argomenti keyword `**kwargs` (i) che corrispondono agli attributi della classe di configurazione, vengono utilizzati per aggiornare tali attributi (ii) che non corrispondono ad alcun attributo della classe di configurazione, vengono inoltrati al metodo `__init__()`. + +Inoltre, sebbene non si tratti di una modifica sostanziale, i metodi di serializzazione sono stati standardizzati e probabilmente dovresti passare al nuovo metodo `save_pretrained(save_directory)` se prima usavi qualsiasi altro metodo di serializzazione. + +Ecco un esempio: + +```python +### Carichiamo un modello e un tokenizer +model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") +tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") + +### Facciamo fare alcune cose al nostro modello e tokenizer +# Es: aggiungiamo nuovi token al vocabolario e agli embending del nostro modello +tokenizer.add_tokens(["[SPECIAL_TOKEN_1]", "[SPECIAL_TOKEN_2]"]) +model.resize_token_embeddings(len(tokenizer)) +# Alleniamo il nostro modello +train(model) + +### Ora salviamo il nostro modello e il tokenizer in una cartella +model.save_pretrained("./my_saved_model_directory/") +tokenizer.save_pretrained("./my_saved_model_directory/") + +### Ricarichiamo il modello e il tokenizer +model = BertForSequenceClassification.from_pretrained("./my_saved_model_directory/") +tokenizer = BertTokenizer.from_pretrained("./my_saved_model_directory/") +``` + +### Ottimizzatori: BertAdam e OpenAIAdam ora sono AdamW, lo scheduling è quello standard PyTorch + +I due ottimizzatori precedenti inclusi, `BertAdam` e `OpenAIAdam`, sono stati sostituiti da un singolo `AdamW` che presenta alcune differenze: + +- implementa solo la correzione del weights decay, +- lo scheduling ora è esterno (vedi sotto), +- anche il gradient clipping ora è esterno (vedi sotto). + +Il nuovo ottimizzatore `AdamW` corrisponde alle API di `Adam` di PyTorch e ti consente di utilizzare metodi PyTorch o apex per lo scheduling e il clipping. + +Lo scheduling è ora standard [PyTorch learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) e non fanno più parte dell'ottimizzatore. + +Ecco un esempio di linear warmup e decay con `BertAdam` e con `AdamW`: + +```python +# Parametri: +lr = 1e-3 +max_grad_norm = 1.0 +num_training_steps = 1000 +num_warmup_steps = 100 +warmup_proportion = float( num_warmup_steps) / float(num_training_steps) # 0.1 + +### In precedenza l'ottimizzatore BertAdam veniva istanziato in questo modo: +optimizer = BertAdam( + model.parameters(), + lr=lr, + schedule="warmup_linear", + warmup=warmup_proportion, + num_training_steps=num_training_steps, +) +### e usato in questo modo: +for batch in train_data: + loss = model(batch) + loss.backward() + optimizer.step() + +### In 🤗 Transformers, ottimizzatore e schedule sono divisi e usati in questo modo: +optimizer = AdamW( + model.parameters(), lr=lr, correct_bias=False +) # Per riprodurre il comportamento specifico di BertAdam impostare correct_bias=False +scheduler = get_linear_schedule_with_warmup( + optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps +) # PyTorch scheduler +### e va usato così: +for batch in train_data: + loss = model(batch) + loss.backward() + torch.nn.utils.clip_grad_norm_( + model.parameters(), max_grad_norm + ) # Gradient clipping non è più in AdamW (quindi puoi usare amp senza problemi) + optimizer.step() + scheduler.step() +``` diff --git a/docs/source/it/model_sharing.mdx b/docs/source/it/model_sharing.md similarity index 95% rename from docs/source/it/model_sharing.mdx rename to docs/source/it/model_sharing.md index 9e1ca9588a1053..81257717ed9a70 100644 --- a/docs/source/it/model_sharing.mdx +++ b/docs/source/it/model_sharing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Condividi un modello @@ -231,4 +235,4 @@ Per assicurarti che chiunque possa comprendere le abilità, limitazioni, i poten * Creando manualmente e caricando un file `README.md`. * Premendo sul pulsante **Edit model card** nel repository del tuo modello. -Dai un'occhiata alla [scheda del modello](https://huggingface.co/distilbert-base-uncased) di DistilBert per avere un buon esempio del tipo di informazioni che una scheda di un modello deve includere. Per maggiori dettagli legati ad altre opzioni che puoi controllare nel file `README.md`, come l'impatto ambientale o widget di esempio, fai riferimento alla documentazione [qui](https://huggingface.co/docs/hub/models-cards). +Dai un'occhiata alla [scheda del modello](https://huggingface.co/distilbert/distilbert-base-uncased) di DistilBert per avere un buon esempio del tipo di informazioni che una scheda di un modello deve includere. Per maggiori dettagli legati ad altre opzioni che puoi controllare nel file `README.md`, come l'impatto ambientale o widget di esempio, fai riferimento alla documentazione [qui](https://huggingface.co/docs/hub/models-cards). diff --git a/docs/source/it/multilingual.mdx b/docs/source/it/multilingual.md similarity index 77% rename from docs/source/it/multilingual.mdx rename to docs/source/it/multilingual.md index a8ccec97d0a7de..e9e85beec1d966 100644 --- a/docs/source/it/multilingual.mdx +++ b/docs/source/it/multilingual.md @@ -8,13 +8,17 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Modelli multilingue per l'inferenza [[open-in-colab]] -Ci sono diversi modelli multilingue in 🤗 Transformers, e il loro utilizzo per l'inferenza differisce da quello dei modelli monolingua. Non *tutti* gli utilizzi dei modelli multilingue sono però diversi. Alcuni modelli, come [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased), possono essere usati come un modello monolingua. Questa guida ti mostrerà come utilizzare modelli multilingue che utilizzano un modo diverso per fare l'inferenza. +Ci sono diversi modelli multilingue in 🤗 Transformers, e il loro utilizzo per l'inferenza differisce da quello dei modelli monolingua. Non *tutti* gli utilizzi dei modelli multilingue sono però diversi. Alcuni modelli, come [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased), possono essere usati come un modello monolingua. Questa guida ti mostrerà come utilizzare modelli multilingue che utilizzano un modo diverso per fare l'inferenza. ## XLM @@ -24,24 +28,24 @@ XLM ha dieci diversi checkpoint, di cui solo uno è monolingua. I nove checkpoin I seguenti modelli XLM utilizzano gli embeddings linguistici per specificare la lingua utilizzata per l'inferenza: -- `xlm-mlm-ende-1024` (Modellazione mascherata del linguaggio (Masked language modeling, in inglese), Inglese-Tedesco) -- `xlm-mlm-enfr-1024` (Modellazione mascherata del linguaggio, Inglese-Francese) -- `xlm-mlm-enro-1024` (Modellazione mascherata del linguaggio, Inglese-Rumeno) -- `xlm-mlm-xnli15-1024` (Modellazione mascherata del linguaggio, lingue XNLI) -- `xlm-mlm-tlm-xnli15-1024` (Modellazione mascherata del linguaggio + traduzione, lingue XNLI) -- `xlm-clm-enfr-1024` (Modellazione causale del linguaggio, Inglese-Francese) -- `xlm-clm-ende-1024` (Modellazione causale del linguaggio, Inglese-Tedesco) +- `FacebookAI/xlm-mlm-ende-1024` (Modellazione mascherata del linguaggio (Masked language modeling, in inglese), Inglese-Tedesco) +- `FacebookAI/xlm-mlm-enfr-1024` (Modellazione mascherata del linguaggio, Inglese-Francese) +- `FacebookAI/xlm-mlm-enro-1024` (Modellazione mascherata del linguaggio, Inglese-Rumeno) +- `FacebookAI/xlm-mlm-xnli15-1024` (Modellazione mascherata del linguaggio, lingue XNLI) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (Modellazione mascherata del linguaggio + traduzione, lingue XNLI) +- `FacebookAI/xlm-clm-enfr-1024` (Modellazione causale del linguaggio, Inglese-Francese) +- `FacebookAI/xlm-clm-ende-1024` (Modellazione causale del linguaggio, Inglese-Tedesco) Gli embeddings linguistici sono rappresentati come un tensore delle stesse dimensioni dell' `input_ids` passato al modello. I valori in questi tensori dipendono dal linguaggio usato e sono identificati dagli attributi `lang2id` e `id2lang` del tokenizer. -In questo esempio, carica il checkpoint `xlm-clm-enfr-1024` (Modellazione causale del linguaggio, Inglese-Francese): +In questo esempio, carica il checkpoint `FacebookAI/xlm-clm-enfr-1024` (Modellazione causale del linguaggio, Inglese-Francese): ```py >>> import torch >>> from transformers import XLMTokenizer, XLMWithLMHeadModel ->>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") ->>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") ``` L'attributo `lang2id` del tokenizer mostra il linguaggio del modello e il suo ids: @@ -79,8 +83,8 @@ Lo script [run_generation.py](https://github.com/huggingface/transformers/tree/m I seguenti modelli XLM non richiedono l'utilizzo dei language embeddings per fare inferenza: -- `xlm-mlm-17-1280` (Modellazione mascherata del linguaggio, 17 lingue) -- `xlm-mlm-100-1280` (Modellazione mascherata del linguaggio, 100 lingue) +- `FacebookAI/xlm-mlm-17-1280` (Modellazione mascherata del linguaggio, 17 lingue) +- `FacebookAI/xlm-mlm-100-1280` (Modellazione mascherata del linguaggio, 100 lingue) Questi modelli sono utilizzati per rappresentazioni generiche di frasi, a differenza dei precedenti checkpoints XML. @@ -88,8 +92,8 @@ Questi modelli sono utilizzati per rappresentazioni generiche di frasi, a differ Il seguente modello BERT può essere usato per compiti multilingue: -- `bert-base-multilingual-uncased` (Modellazione mascherata del linguaggio + Previsione della prossima frase, 102 lingue) -- `bert-base-multilingual-cased` (Modellazione mascherata del linguaggio + Previsione della prossima frase, 104 lingue) +- `google-bert/bert-base-multilingual-uncased` (Modellazione mascherata del linguaggio + Previsione della prossima frase, 102 lingue) +- `google-bert/bert-base-multilingual-cased` (Modellazione mascherata del linguaggio + Previsione della prossima frase, 104 lingue) Questi modelli non richiedono language embeddings per fare inferenza. Riescono ad identificare il linguaggio dal contesto e inferire di conseguenza. @@ -97,8 +101,8 @@ Questi modelli non richiedono language embeddings per fare inferenza. Riescono a Il seguente modello XLM-RoBERTa può essere usato per compiti multilingue: -- `xlm-roberta-base` (Modellazione mascherata del linguaggio, 100 lingue) -- `xlm-roberta-large` (Modellazione mascherata del linguaggio, 100 lingue) +- `FacebookAI/xlm-roberta-base` (Modellazione mascherata del linguaggio, 100 lingue) +- `FacebookAI/xlm-roberta-large` (Modellazione mascherata del linguaggio, 100 lingue) XLM-RoBERTa è stato addestrato su 2.5TB di dati CommonCrawl appena creati e puliti in 100 lingue. Offre notevoli vantaggi rispetto ai modelli multilingue rilasciati in precedenza, come mBERT o XLM, in compiti come la classificazione, l'etichettatura delle sequenze e la risposta alle domande. diff --git a/docs/source/it/perf_hardware.mdx b/docs/source/it/perf_hardware.md similarity index 95% rename from docs/source/it/perf_hardware.mdx rename to docs/source/it/perf_hardware.md index 0bfdbc8fe686b9..946dcb3238d057 100644 --- a/docs/source/it/perf_hardware.mdx +++ b/docs/source/it/perf_hardware.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> @@ -59,7 +63,7 @@ Diamo quindi un'occhiata a uno degli aspetti più importanti quando si hanno pi Se utilizzi più GPU, il modo in cui le schede sono interconnesse può avere un enorme impatto sul tempo totale di allenamento. Se le GPU si trovano sullo stesso nodo fisico, puoi eseguire: -``` +```bash nvidia-smi topo -m ``` @@ -112,7 +116,7 @@ Ogni nuova generazione fornisce una larghezza di banda più veloce, ad es. ecco Quindi più `X` si ottiene nel rapporto di `NVX` nell'output di `nvidia-smi topo -m`, meglio è. La generazione dipenderà dall'architettura della tua GPU. -Confrontiamo l'esecuzione di un training del modello di linguaggio gpt2 su un piccolo campione di wikitext +Confrontiamo l'esecuzione di un training del modello di linguaggio openai-community/gpt2 su un piccolo campione di wikitext I risultati sono: @@ -130,8 +134,8 @@ Ecco il codice benchmark completo e gli output: ```bash # DDP w/ NVLink -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ ---nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 @@ -139,8 +143,8 @@ rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch # DDP w/o NVLink -rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ ---nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 diff --git a/docs/source/it/perf_infer_cpu.md b/docs/source/it/perf_infer_cpu.md new file mode 100644 index 00000000000000..baae51a5a97897 --- /dev/null +++ b/docs/source/it/perf_infer_cpu.md @@ -0,0 +1,79 @@ + + +# Inferenza Efficiente su CPU + +Questa guida si concentra sull'inferenza di modelli di grandi dimensioni in modo efficiente sulla CPU. + +## `BetterTransformer` per inferenza più rapida + +Abbiamo integrato di recente `BetterTransformer` per fare inferenza più rapidamente con modelli per testi, immagini e audio. Visualizza la documentazione sull'integrazione [qui](https://huggingface.co/docs/optimum/bettertransformer/overview) per maggiori dettagli. + +## PyTorch JIT-mode (TorchScript) + +TorchScript è un modo di creare modelli serializzabili e ottimizzabili da codice PyTorch. Ogni programmma TorchScript può esere salvato da un processo Python e caricato in un processo dove non ci sono dipendenze Python. +Comparandolo con l'eager mode di default, jit mode in PyTorch normalmente fornisce prestazioni migliori per l'inferenza del modello da parte di metodologie di ottimizzazione come la operator fusion. + +Per una prima introduzione a TorchScript, vedi la Introduction to [PyTorch TorchScript tutorial](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules). + +### IPEX Graph Optimization con JIT-mode + +Intel® Extension per PyTorch fornnisce ulteriori ottimizzazioni in jit mode per i modelli della serie Transformers. Consigliamo vivamente agli utenti di usufruire dei vantaggi di Intel® Extension per PyTorch con jit mode. Alcuni operator patterns usati fequentemente dai modelli Transformers models sono già supportati in Intel® Extension per PyTorch con jit mode fusions. Questi fusion patterns come Multi-head-attention fusion, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc. sono abilitati e hanno buone performance. I benefici della fusion è fornito agli utenti in modo trasparente. In base alle analisi, il ~70% dei problemi più popolari in NLP question-answering, text-classification, and token-classification possono avere benefici sulle performance grazie ai fusion patterns sia per Float32 precision che per BFloat16 Mixed precision. + +Vedi maggiori informazioni per [IPEX Graph Optimization](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/graph_optimization.html). + +#### Installazione di IPEX + +I rilasci di IPEX seguono PyTorch, verifica i vari approcci per [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/). + +### Utilizzo del JIT-mode + +Per abilitare JIT-mode in Trainer per evaluation e prediction, devi aggiungere `jit_mode_eval` negli argomenti di Trainer. + + + +per PyTorch >= 1.14.0. JIT-mode potrebe giovare a qualsiasi modello di prediction e evaluaion visto che il dict input è supportato in jit.trace + +per PyTorch < 1.14.0. JIT-mode potrebbe giovare ai modelli il cui ordine dei parametri corrisponde all'ordine delle tuple in ingresso in jit.trace, come i modelli per question-answering. +Nel caso in cui l'ordine dei parametri seguenti non corrisponda all'ordine delle tuple in ingresso in jit.trace, come nei modelli di text-classification, jit.trace fallirà e lo cattureremo con una eccezione al fine di renderlo un fallback. Il logging è usato per notificare gli utenti. + + + +Trovi un esempo con caso d'uso in [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + +- Inference using jit mode on CPU: + +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--jit_mode_eval 
+ +- Inference with IPEX using jit mode on CPU: + +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--use_ipex \
+--jit_mode_eval
diff --git a/docs/source/it/perf_infer_gpu_many.md b/docs/source/it/perf_infer_gpu_many.md new file mode 100644 index 00000000000000..b78cb34e1d6d81 --- /dev/null +++ b/docs/source/it/perf_infer_gpu_many.md @@ -0,0 +1,28 @@ + + +# Inferenza Efficiente su GPU Multiple + +Questo documento contiene informazioni su come fare inferenza in maniera efficiente su GPU multiple. + + + +Nota: Un setup con GPU multiple può utilizzare la maggior parte delle strategie descritte nella [sezione con GPU singola](./perf_infer_gpu_one). Tuttavia, è necessario conoscere delle tecniche semplici che possono essere utilizzate per un risultato migliore. + + + +## `BetterTransformer` per inferenza più rapida + +Abbiamo recentemente integrato `BetterTransformer` per inferenza più rapida su multi-GPU per modelli su testo, immagini e audio. Controlla il documento con queste integrazioni [qui](https://huggingface.co/docs/optimum/bettertransformer/overview) per maggiori dettagli. diff --git a/docs/source/it/perf_infer_gpu_one.md b/docs/source/it/perf_infer_gpu_one.md new file mode 100644 index 00000000000000..16f77b3b1f31cc --- /dev/null +++ b/docs/source/it/perf_infer_gpu_one.md @@ -0,0 +1,112 @@ + + +# Inferenza efficiente su GPU singola + +Questo documento sarà presto completato con informazioni su come effetture l'inferenza su una singola GPU. Nel frattempo è possibile consultare [la guida per l'addestramento su una singola GPU](perf_train_gpu_one) e [la guida per l'inferenza su CPU](perf_infer_cpu). + +## `BetterTransformer` per l'inferenza più veloce + +Abbiamo recentemente integrato `BetterTransformer` per velocizzare l'inferenza su GPU per modelli di testo, immagini e audio. Per maggiori dettagli, consultare la documentazione su questa integrazione [qui](https://huggingface.co/docs/optimum/bettertransformer/overview). + +## Integrazione di `bitsandbytes` per Int8 mixed-precision matrix decomposition + + + +Nota che questa funzione può essere utilizzata anche nelle configurazioni multi GPU. + + + +Dal paper [`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339), noi supportiamo l'integrazione di Hugging Face per tutti i modelli dell'Hub con poche righe di codice. +Il metodo `nn.Linear` riduce la dimensione di 2 per i pesi `float16` e `bfloat16` e di 4 per i pesi `float32`, con un impatto quasi nullo sulla qualità, operando sugli outlier in half-precision. + +![HFxbitsandbytes.png](https://cdn-uploads.huggingface.co/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) + +Il metodo Int8 mixed-precision matrix decomposition funziona separando la moltiplicazione tra matrici in due flussi: (1) una matrice di flusso di outlier di caratteristiche sistematiche moltiplicata in fp16, (2) in flusso regolare di moltiplicazione di matrici int8 (99,9%). Con questo metodo, è possibile effettutare inferenza int8 per modelli molto grandi senza degrado predittivo. +Per maggiori dettagli sul metodo, consultare il [paper](https://arxiv.org/abs/2208.07339) o il nostro [blogpost sull'integrazione](https://huggingface.co/blog/hf-bitsandbytes-integration). + +![MixedInt8.gif](https://cdn-uploads.huggingface.co/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif) + +Nota che è necessaria una GPU per eseguire modelli di tipo mixed-8bit, poiché i kernel sono stati compilati solo per le GPU. Prima di utilizzare questa funzione, assicurarsi di disporre di memoria sufficiente sulla GPU per memorizzare un quarto del modello (o la metà se i pesi del modello sono in mezza precisione). +Di seguito sono riportate alcune note per aiutarvi a utilizzare questo modulo, oppure seguite le dimostrazioni su [Google colab](#colab-demos). + +### Requisiti + +- Se si dispone di `bitsandbytes<0.37.0`, assicurarsi di eseguire su GPU NVIDIA che supportano tensor cores a 8 bit (Turing, Ampere o architetture più recenti - ad esempio T4, RTX20s RTX30s, A40-A100). Per `bitsandbytes>=0.37.0`, tutte le GPU dovrebbero essere supportate. +- Installare la versione corretta di `bitsandbytes` eseguendo: +`pip install bitsandbytes>=0.31.5`. +- Installare `accelerate` +`pip install accelerate>=0.12.0` + +### Esecuzione di modelli mixed-Int8 - configurazione per singola GPU + +Dopo aver installato le librerie necessarie, per caricare il tuo modello mixed 8-bit è il seguente: + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` + +Per la generazione di testo, si consiglia di: + +* utilizzare il metodo `generate()` del modello invece della funzione `pipeline()`. Sebbene l'inferenza sia possibile con la funzione `pipeline()`, essa non è ottimizzata per i modelli mixed-8bit e sarà più lenta rispetto all'uso del metodo `generate()`. Inoltre, alcune strategie di campionamento, come il campionamento nucleaus, non sono supportate dalla funzione `pipeline()` per i modelli mixed-8bit. +* collocare tutti gli ingressi sullo stesso dispositivo del modello. + +Ecco un semplice esempio: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "bigscience/bloom-2b5" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) + +text = "Hello, my llama is cute" +inputs = tokenizer(prompt, return_tensors="pt").to("cuda") +generated_ids = model.generate(**inputs) +outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +``` + + +### Esecuzione di modelli mixed-8bit - configurazione multi GPU + +Usare il seguente modo caricare il modello mixed-8bit su più GPU (stesso comando della configurazione a GPU singola): +```py +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` +Puoi controllare la RAM della GPU che si vuole allocare su ogni GPU usando `accelerate`. Utilizzare l'argomento `max_memory` come segue: + +```py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) +``` +In questo esempio, la prima GPU utilizzerà 1 GB di memoria e la seconda 2 GB. + +### Colab demos + +Con questo metodo è possibile inferire modelli che prima non era possibile inferire su Google Colab. +Guardate la demo per l'esecuzione di T5-11b (42GB in fp32)! Utilizzo la quantizzazione a 8 bit su Google Colab: + +[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) + +Oppure questa demo di BLOOM-3B: + +[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) \ No newline at end of file diff --git a/docs/source/en/perf_infer_special.mdx b/docs/source/it/perf_infer_special.md similarity index 56% rename from docs/source/en/perf_infer_special.mdx rename to docs/source/it/perf_infer_special.md index e18a9a10488302..3e2c0a5c288e37 100644 --- a/docs/source/en/perf_infer_special.mdx +++ b/docs/source/it/perf_infer_special.md @@ -7,8 +7,12 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> -# Inference on Specialized Hardware +# Inferenza su Hardware Specializzato -This document will be completed soon with information on how to infer on specialized hardware. In the meantime you can check out [the guide for inference on CPUs](perf_infer_cpu). \ No newline at end of file +Questo documento sarà completato a breve con la documentazione per l'inferenza su hardware specializzato. Nel frattempo puoi controllare [la guida per fare inferenza sulle CPU](perf_infer_cpu). \ No newline at end of file diff --git a/docs/source/it/perf_train_cpu.md b/docs/source/it/perf_train_cpu.md new file mode 100644 index 00000000000000..ff71d10d5c9d6c --- /dev/null +++ b/docs/source/it/perf_train_cpu.md @@ -0,0 +1,69 @@ + + +# Addestramento efficiente su CPU + +Questa guida si concentra su come addestrare in maniera efficiente grandi modelli su CPU. + +## Mixed precision con IPEX + +IPEX è ottimizzato per CPU con AVX-512 o superiore, e funziona per le CPU con solo AVX2. Pertanto, si prevede che le prestazioni saranno più vantaggiose per le le CPU Intel con AVX-512 o superiori, mentre le CPU con solo AVX2 (ad esempio, le CPU AMD o le CPU Intel più vecchie) potrebbero ottenere prestazioni migliori con IPEX, ma non sono garantite. IPEX offre ottimizzazioni delle prestazioni per l'addestramento della CPU sia con Float32 che con BFloat16. L'uso di BFloat16 è l'argomento principale delle seguenti sezioni. + +Il tipo di dati a bassa precisione BFloat16 è stato supportato in modo nativo su 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) con AVX512 e sarà supportata dalla prossima generazione di Intel® Xeon® Scalable Processors con Intel® Advanced Matrix Extensions (Intel® AMX) instruction set con prestazioni ulteriormente migliorate. L'Auto Mixed Precision per il backende della CPU è stato abilitato da PyTorch-1.10. allo stesso tempo, il supporto di Auto Mixed Precision con BFloat16 per CPU e l'ottimizzazione degli operatori BFloat16 è stata abilitata in modo massiccio in Intel® Extension per PyTorch, and parzialmente aggiornato al branch master di PyTorch. Gli utenti possono ottenere prestazioni migliori ed users experience con IPEX Auto Mixed Precision.. + +Vedi informazioni più dettagliate su [Auto Mixed Precision](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html). + +### Installazione di IPEX: + +Il rilascio di IPEX segue quello di PyTorch, da installare via pip: + +| PyTorch Version | IPEX version | +| :---------------: | :----------: | +| 1.13 | 1.13.0+cpu | +| 1.12 | 1.12.300+cpu | +| 1.11 | 1.11.200+cpu | +| 1.10 | 1.10.100+cpu | + +```bash +pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu +``` + +Vedi altri approcci per [IPEX installation](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html). + +### Utilizzo nel Trainer + +Per abilitare la auto mixed precision con IPEX in Trainer, l'utende dovrebbe aggiungere `use_ipex`, `bf16` e `no_cuda` negli argomenti del comando di addestramento. + +Vedi un sempio di un caso d'uso [Transformers question-answering](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + +- Training with IPEX using BF16 auto mixed precision on CPU: + +
 python run_qa.py \
+--model_name_or_path google-bert/bert-base-uncased \
+--dataset_name squad \
+--do_train \
+--do_eval \
+--per_device_train_batch_size 12 \
+--learning_rate 3e-5 \
+--num_train_epochs 2 \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/debug_squad/ \
+--use_ipex \
+--bf16 --no_cuda
+ +### Esempi pratici + +Blog: [Accelerating PyTorch Transformers with Intel Sapphire Rapids](https://huggingface.co/blog/intel-sapphire-rapids) diff --git a/docs/source/it/perf_train_cpu_many.md b/docs/source/it/perf_train_cpu_many.md new file mode 100644 index 00000000000000..c1f8833829ac3b --- /dev/null +++ b/docs/source/it/perf_train_cpu_many.md @@ -0,0 +1,141 @@ + + +# Addestramento effciente su multiple CPU + +Quando l'addestramento su una singola CPU è troppo lento, possiamo usare CPU multiple. Quasta guida si concentra su DDP basato su PyTorch abilitando l'addetramento distribuito su CPU in maniera efficiente. + +## Intel® oneCCL Bindings per PyTorch + +[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) è una libreria per l'addestramento efficiente del deep learning in distribuito e implementa collettivi come allreduce, allgather, alltoall. Per maggiori informazioni su oneCCL, fai riferimento a [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) e [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html). + +Il modulo `oneccl_bindings_for_pytorch` (`torch_ccl` precedentemente alla versione 1.12) implementa PyTorch C10D ProcessGroup API e può essere caricato dinamicamente com external ProcessGroup e funziona solo su piattaforma Linux al momento. + +Qui trovi informazioni più dettagliate per [oneccl_bind_pt](https://github.com/intel/torch-ccl). + +### Intel® oneCCL Bindings per l'installazione PyTorch: + +I file wheel sono disponibili per le seguenti versioni di Python: + +| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | +| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | +| 1.13.0 | | √ | √ | √ | √ | +| 1.12.100 | | √ | √ | √ | √ | +| 1.12.0 | | √ | √ | √ | √ | +| 1.11.0 | | √ | √ | √ | √ | +| 1.10.0 | √ | √ | √ | √ | | + +```bash +pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu +``` + +dove `{pytorch_version}` deve essere la tua versione di PyTorch, per l'stanza 1.13.0. +Verifica altri approcci per [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). +Le versioni di oneCCL e PyTorch devono combaciare. + + + +oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) +PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100 + + + +## Intel® MPI library + +Usa questa implementazione basata su standard MPI per fornire una architettura flessibile, efficiente, scalabile su cluster per Intel®. Questo componente è parte di Intel® oneAPI HPC Toolkit. + +oneccl_bindings_for_pytorch è installato insieme al set di strumenti MPI. Necessità di reperire l'ambiente prima di utilizzarlo. + +per Intel® oneCCL >= 1.12.0 + +```bash +oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") +source $oneccl_bindings_for_pytorch_path/env/setvars.sh +``` + +per Intel® oneCCL con versione < 1.12.0 + +```bash +torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") +source $torch_ccl_path/env/setvars.sh +``` + +#### Installazione IPEX: + +IPEX fornisce ottimizzazioni delle prestazioni per l'addestramento della CPU sia con Float32 che con BFloat16; puoi fare riferimento a [single CPU section](./perf_train_cpu). + +Il seguente "Utilizzo in Trainer" prende come esempio mpirun nella libreria Intel® MPI. + +## Utilizzo in Trainer + +Per abilitare l'addestramento distribuito multi CPU nel Trainer con il ccl backend, gli utenti devono aggiungere **`--ddp_backend ccl`** negli argomenti del comando. + +Vediamo un esempio per il [question-answering example](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + +Il seguente comando abilita due processi sul nodo Xeon, con un processo in esecuzione per ogni socket. Le variabili OMP_NUM_THREADS/CCL_WORKER_COUNT possono essere impostate per una prestazione ottimale. + +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=127.0.0.1 + mpirun -n 2 -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex +``` + +Il seguente comando abilita l'addestramento per un totale di quattro processi su due Xeon (node0 e node1, prendendo node0 come processo principale), ppn (processes per node) è impostato a 2, on un processo in esecuzione per ogni socket. Le variabili OMP_NUM_THREADS/CCL_WORKER_COUNT possono essere impostate per una prestazione ottimale. + +In node0, è necessario creare un file di configurazione che contenga gli indirizzi IP di ciascun nodo (per esempio hostfile) e passare il percorso del file di configurazione come parametro. + +```shell script + cat hostfile + xxx.xxx.xxx.xxx #node0 ip + xxx.xxx.xxx.xxx #node1 ip +``` + +A questo punto, esegui il seguente comando nel nodo0 e **4DDP** sarà abilitato in node0 e node1 con BF16 auto mixed precision: + +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip + mpirun -f hostfile -n 4 -ppn 2 \ + -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex \ + --bf16 +``` diff --git a/docs/source/it/perf_train_special.md b/docs/source/it/perf_train_special.md new file mode 100644 index 00000000000000..afe05d801d66e3 --- /dev/null +++ b/docs/source/it/perf_train_special.md @@ -0,0 +1,24 @@ + + +# Addestramento su Hardware Specializzato + + + + Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione. + + + +Questo documento sarà presto completato con informazioni su come effettuare la formazione su hardware specializzato. diff --git a/docs/source/it/perf_train_tpu.md b/docs/source/it/perf_train_tpu.md new file mode 100644 index 00000000000000..663f83c499cba4 --- /dev/null +++ b/docs/source/it/perf_train_tpu.md @@ -0,0 +1,24 @@ + + +# Addestramento su TPU + + + + Nota: Molte delle strategie introdotte nella [sezione sulla GPU singola](perf_train_gpu_one) (come mixed precision training o gradient accumulation) e [sezione multi-GPU](perf_train_gpu_many) sono generiche e applicabili all'addestramento di modelli in generale quindi assicurati di dargli un'occhiata prima di immergerti in questa sezione. + + + +Questo documento sarà presto completato con informazioni su come effettuare la formazione su TPU. diff --git a/docs/source/it/pipeline_tutorial.mdx b/docs/source/it/pipeline_tutorial.md similarity index 95% rename from docs/source/it/pipeline_tutorial.mdx rename to docs/source/it/pipeline_tutorial.md index 64347164505f40..87f3166623b05a 100644 --- a/docs/source/it/pipeline_tutorial.mdx +++ b/docs/source/it/pipeline_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipeline per l'inferenza @@ -72,8 +76,8 @@ La [`pipeline`] accetta qualsiasi modello dal [Model Hub](https://huggingface.co ```py >>> from transformers import AutoTokenizer, AutoModelForCausalLM ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Crea una [`pipeline`] per il tuo compito, specificando il modello e il tokenizer che hai caricato: diff --git a/docs/source/it/pr_checks.md b/docs/source/it/pr_checks.md new file mode 100644 index 00000000000000..caa5fe32965bde --- /dev/null +++ b/docs/source/it/pr_checks.md @@ -0,0 +1,135 @@ + + +# Controlli su una Pull Request + +Quando apri una pull request sui 🤗 Transformers, vengono eseguiti un discreto numero di controlli per assicurarsi che la patch che stai aggiungendo non stia rompendo qualcosa di esistente. Questi controlli sono di quattro tipi: +- test regolari +- costruzione della documentazione +- stile del codice e della documentazione +- coerenza generale del repository + +In questo documento, cercheremo di spiegare quali sono i vari controlli e le loro ragioni, oltre a spiegare come eseguire il debug locale se uno di essi fallisce sulla tua PR. + +Nota che tutti richiedono un'installazione dev: + +```bash +pip install transformers[dev] +``` + +o un'installazione modificabile: + +```bash +pip install -e .[dev] +``` + +all'interno del repo Transformers. + +## Tests + +Tutti i job che iniziano con `ci/circleci: run_tests_` eseguono parti della suite di test dei Transformers. Ognuno di questi job si concentra su una parte della libreria in un determinato ambiente: per esempio `ci/circleci: run_tests_pipelines_tf` esegue il test delle pipeline in un ambiente in cui è installato solo TensorFlow. + +Nota che per evitare di eseguire i test quando non ci sono cambiamenti reali nei moduli che si stanno testando, ogni volta viene eseguita solo una parte della suite di test: viene eseguita una utility per determinare le differenze nella libreria tra prima e dopo la PR (ciò che GitHub mostra nella scheda "Files changes") e sceglie i test che sono stati impattati dalla diff. Questa utility può essere eseguita localmente con: + +```bash +python utils/tests_fetcher.py +``` + +dalla root del repo Transformers. Di seguito ciò che farà: + +1. Controlla per ogni file nel diff se le modifiche sono nel codice o solo nei commenti o nelle docstrings. Vengono mantenuti solo i file con modifiche reali al codice. +2. Costruisce una mappa interna che fornisce per ogni file del codice sorgente della libreria tutti i file su cui ha un impatto ricorsivo. Si dice che il modulo A ha un impatto sul modulo B se il modulo B importa il modulo A. Per l'impatto ricorsivo, abbiamo bisogno di una catena di moduli che va dal modulo A al modulo B in cui ogni modulo importa il precedente. +3. Applica questa mappa ai file raccolti nel passaggio 1, si ottiene l'elenco dei file del modello interessati dalla PR. +4. Mappa ciascuno di questi file con i corrispondenti file di test e ottiene l'elenco dei test da eseguire. + +Quando esegui lo script in locale, dovresti ottenere la stampa dei risultati dei passi 1, 3 e 4 e quindi sapere quali test sono stati eseguiti. Lo script creerà anche un file chiamato `test_list.txt` che contiene l'elenco dei test da eseguire e che puoi eseguire localmente con il seguente comando: + +```bash +python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt) +``` + +Nel caso in cui qualcosa sia sfuggito, l'intera suite di test viene eseguita quotidianamente. + +## Build della documentazione + +Il job `ci/circleci: build_doc` esegue una build della documentazione per assicurarsi che tutto sia a posto una volta che la PR è stata unita. Se questo passaggio fallisce, puoi controllare localmente entrando nella cartella `docs` del repo Transformers e digitare + +```bash +make html +``` + +Sphinx non è noto per i suoi messaggi di errore chiari, quindi potrebbe essere necessario che provi alcune cose per trovare davvero la fonte dell'errore. + +## Stile del codice e della documentazione + +La formattazione del codice viene applicata a tutti i file sorgenti, agli esempi e ai test usando `black` e `isort`. Abbiamo anche uno strumento personalizzato che si occupa della formattazione delle docstring e dei file `rst` (`utils/style_doc.py`), così come dell'ordine dei lazy imports eseguiti nei file `__init__.py` dei Transformers (`utils/custom_init_isort.py`). Tutto questo può essere lanciato eseguendo + +```bash +make style +``` + +I controlli della CI sono applicati all'interno del controllo `ci/circleci: check_code_quality`. Esegue anche `flake8`, che dà un'occhiata di base al codice e si lamenta se trova una variabile non definita o non utilizzata. Per eseguire questo controllo localmente, usare + +```bash +make quality +``` + +Questa operazione può richiedere molto tempo, quindi per eseguire la stessa operazione solo sui file modificati nel branch corrente, eseguire + +```bash +make fixup +``` + +Quest'ultimo comando eseguirà anche tutti i controlli aggiuntivi per la consistenza del repository. Diamogli un'occhiata. + +## Coerenza del repository + +All'interno sono raggruppati tutti i test per assicurarsi che la tua PR lasci il repository in un buono stato ed è eseguito dal controllo `ci/circleci: check_repository_consistency`. Puoi eseguire localmente questo controllo eseguendo quanto segue: + +```bash +make repo-consistency +``` + +Questo verifica che: + +- Tutti gli oggetti aggiunti all'init sono documentati (eseguito da `utils/check_repo.py`) +- Tutti i file `__init__.py` hanno lo stesso contenuto nelle loro due sezioni (eseguito da `utils/check_inits.py`) +- Tutto il codice identificato come copia da un altro modulo è coerente con l'originale (eseguito da `utils/check_copies.py`) +- Le traduzioni dei README e l'indice della documentazione hanno lo stesso elenco di modelli del README principale (eseguito da `utils/check_copies.py`) +- Le tabelle autogenerate nella documentazione sono aggiornate (eseguito da `utils/check_table.py`) +- La libreria ha tutti gli oggetti disponibili anche se non tutte le dipendenze opzionali sono installate (eseguito da `utils/check_dummies.py`) + +Se questo controllo fallisce, le prime due voci richiedono una correzione manuale, mentre le ultime quattro possono essere corrette automaticamente per te eseguendo il comando + +```bash +make fix-copies +``` + +Ulteriori controlli riguardano le PR che aggiungono nuovi modelli, principalmente che: + +- Tutti i modelli aggiunti sono in un Auto-mapping (eseguita da `utils/check_repo.py`) + +- Tutti i modelli sono testati correttamente (eseguito da `utils/check_repo.py`) + + \ No newline at end of file diff --git a/docs/source/it/preprocessing.mdx b/docs/source/it/preprocessing.md similarity index 97% rename from docs/source/it/preprocessing.mdx rename to docs/source/it/preprocessing.md index a57ff9df9151e1..6d7bc5b2e3df7e 100644 --- a/docs/source/it/preprocessing.mdx +++ b/docs/source/it/preprocessing.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Preprocess @@ -41,7 +45,7 @@ Carica un tokenizer preaddestrato con [`AutoTokenizer.from_pretrained`]: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") ``` Poi inserisci le tue frasi nel tokenizer: @@ -190,7 +194,7 @@ Gli input audio sono processati in modo differente rispetto al testo, ma l'obiet pip install datasets ``` -Carica il dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) (vedi il 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) per avere maggiori dettagli su come caricare un dataset): +Carica il dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) (vedi il 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub) per avere maggiori dettagli su come caricare un dataset): ```py >>> from datasets import load_dataset, Audio @@ -229,7 +233,7 @@ Per esempio, il dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds1 'sampling_rate': 8000} ``` -1. Usa il metodo di 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) per alzare la frequenza di campionamento a 16kHz: +1. Usa il metodo di 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.cast_column) per alzare la frequenza di campionamento a 16kHz: ```py >>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) @@ -366,7 +370,7 @@ Per le attività di visione, è usuale aggiungere alcuni tipi di data augmentati ... return examples ``` -3. Poi utilizza 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process.html#format-transform)per applicare al volo la trasformazione: +3. Poi utilizza 🤗 Datasets [`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)per applicare al volo la trasformazione: ```py >>> dataset.set_transform(transforms) @@ -457,7 +461,7 @@ Ricorda dalla sezione precedente sull'elaborazione dei dati audio, tu dovresti s ### Processor -Un processor combina un estrattore di caratteristiche e un tokenizer. Carica un processor con [`AutoProcessor.from_pretrained]: +Un processor combina un estrattore di caratteristiche e un tokenizer. Carica un processor con [`AutoProcessor.from_pretrained`]: ```py >>> from transformers import AutoProcessor diff --git a/docs/source/it/quicktour.mdx b/docs/source/it/quicktour.md similarity index 98% rename from docs/source/it/quicktour.mdx rename to docs/source/it/quicktour.md index 2378edd2c2a155..07e7a2974a1fbc 100644 --- a/docs/source/it/quicktour.mdx +++ b/docs/source/it/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Quick tour @@ -64,11 +68,13 @@ Installa le seguenti dipendenze se non lo hai già fatto: + ```bash pip install torch ``` + ```bash pip install tensorflow ``` @@ -119,7 +125,7 @@ Crea una [`pipeline`] con il compito che vuoi risolvere e con il modello che vuo ... ) ``` -Poi, carica un dataset (vedi 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart.html) per maggiori dettagli) sul quale vuoi iterare. Per esempio, carichiamo il dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14): +Poi, carica un dataset (vedi 🤗 Datasets [Quick Start](https://huggingface.co/docs/datasets/quickstart) per maggiori dettagli) sul quale vuoi iterare. Per esempio, carichiamo il dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14): ```py >>> from datasets import load_dataset, Audio @@ -375,6 +381,7 @@ Una caratteristica particolarmente interessante di 🤗 Transformers è la sua a + ```py >>> from transformers import AutoModel @@ -383,6 +390,7 @@ Una caratteristica particolarmente interessante di 🤗 Transformers è la sua a ``` + ```py >>> from transformers import TFAutoModel diff --git a/docs/source/it/run_scripts.mdx b/docs/source/it/run_scripts.md similarity index 92% rename from docs/source/it/run_scripts.mdx rename to docs/source/it/run_scripts.md index 3ffd58a62830aa..7fc3fb6c6ac67a 100644 --- a/docs/source/it/run_scripts.mdx +++ b/docs/source/it/run_scripts.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Addestramento con script @@ -83,11 +87,11 @@ pip install -r requirements.txt -Lo script di esempio scarica e pre-processa un dataset dalla libreria 🤗 [Datasets](https://huggingface.co/docs/datasets/). Successivamente, lo script esegue il fine-tuning su un dataset usando il [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) su un'architettura che supporta la summarization. Il seguente esempio mostra come eseguire il fine-tuning di [T5-small](https://huggingface.co/t5-small) sul dataset [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). Il modello T5 richiede un parametro addizionale `source_prefix` a causa del modo in cui è stato addestrato. Questo prefisso permette a T5 di sapere che si tratta di un task di summarization. +Lo script di esempio scarica e pre-processa un dataset dalla libreria 🤗 [Datasets](https://huggingface.co/docs/datasets/). Successivamente, lo script esegue il fine-tuning su un dataset usando il [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) su un'architettura che supporta la summarization. Il seguente esempio mostra come eseguire il fine-tuning di [T5-small](https://huggingface.co/google-t5/t5-small) sul dataset [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). Il modello T5 richiede un parametro addizionale `source_prefix` a causa del modo in cui è stato addestrato. Questo prefisso permette a T5 di sapere che si tratta di un task di summarization. ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -101,11 +105,11 @@ python examples/pytorch/summarization/run_summarization.py \ ``` -Lo script di esempio scarica e pre-processa un dataset dalla libreria 🤗 [Datasets](https://huggingface.co/docs/datasets/). Successivamente, lo script esegue il fine-tuning su un dataset usando Keras su un'architettura che supporta la summarization. Il seguente esempio mostra come eseguire il fine-tuning di [T5-small](https://huggingface.co/t5-small) sul dataset [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). Il modello T5 richiede un parametro addizionale `source_prefix` a causa del modo in cui è stato addestrato. Questo prefisso permette a T5 di sapere che si tratta di un task di summarization. +Lo script di esempio scarica e pre-processa un dataset dalla libreria 🤗 [Datasets](https://huggingface.co/docs/datasets/). Successivamente, lo script esegue il fine-tuning su un dataset usando Keras su un'architettura che supporta la summarization. Il seguente esempio mostra come eseguire il fine-tuning di [T5-small](https://huggingface.co/google-t5/t5-small) sul dataset [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). Il modello T5 richiede un parametro addizionale `source_prefix` a causa del modo in cui è stato addestrato. Questo prefisso permette a T5 di sapere che si tratta di un task di summarization. ```bash python examples/tensorflow/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -126,10 +130,10 @@ Il [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) supp - Imposta un numero di GPU da usare con l'argomento `nproc_per_node`. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \ --fp16 \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -153,7 +157,7 @@ Le Tensor Processing Units (TPU) sono state progettate per migliorare le prestaz ```bash python xla_spawn.py --num_cores 8 \ summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -172,7 +176,7 @@ Le Tensor Processing Units (TPU) sono state progettate per migliorare le prestaz ```bash python run_summarization.py \ --tpu name_of_tpu_resource \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -210,7 +214,7 @@ Ora sei pronto per avviare l'addestramento: ```bash accelerate launch run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ @@ -229,7 +233,7 @@ Uno script di summarization usando un dataset personalizzato sarebbe simile a qu ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --train_file path_to_csv_or_jsonlines_file \ @@ -254,7 +258,7 @@ python examples/pytorch/summarization/run_summarization.py \ ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --max_train_samples 50 \ --max_eval_samples 50 \ --max_predict_samples 50 \ @@ -284,7 +288,7 @@ Il primo metodo usa l'argomento `output_dir previous_output_dir` per riavviare l ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -301,7 +305,7 @@ Il secondo metodo usa l'argomento `resume_from_checkpoint path_to_specific_check ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -331,7 +335,7 @@ Il seguente esempio mostra come caricare un modello specificando il nome del rep ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ diff --git a/docs/source/it/serialization.mdx b/docs/source/it/serialization.md similarity index 95% rename from docs/source/it/serialization.mdx rename to docs/source/it/serialization.md index 1dde00f429bdfb..974aee0d81cae0 100644 --- a/docs/source/it/serialization.mdx +++ b/docs/source/it/serialization.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Esporta modelli 🤗 Transformers @@ -118,7 +122,7 @@ optional arguments: L'esportazione di un checkpoint utilizzando una configurazione già pronta può essere eseguita come segue: ```bash -python -m transformers.onnx --model=distilbert-base-uncased onnx/ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ ``` che dovrebbe mostrare i seguenti log: @@ -133,7 +137,7 @@ All good, model saved at: onnx/model.onnx ``` Questo esporta un grafico ONNX del checkpoint definito dall'argomento `--model`. -In questo esempio è `distilbert-base-uncased`, ma può essere qualsiasi checkpoint +In questo esempio è `distilbert/distilbert-base-uncased`, ma può essere qualsiasi checkpoint Hugging Face Hub o uno memorizzato localmente. Il file risultante `model.onnx` può quindi essere eseguito su uno dei [tanti @@ -145,7 +149,7 @@ Runtime](https://onnxruntime.ai/) come segue: >>> from transformers import AutoTokenizer >>> from onnxruntime import InferenceSession ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") >>> session = InferenceSession("onnx/model.onnx") >>> # ONNX Runtime expects NumPy arrays as input >>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") @@ -183,8 +187,8 @@ checkpoint come segue: >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification >>> # Load tokenizer and PyTorch weights form the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-pt-checkpoint") >>> pt_model.save_pretrained("local-pt-checkpoint") @@ -202,8 +206,8 @@ python -m transformers.onnx --model=local-pt-checkpoint onnx/ >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification >>> # Load tokenizer and TensorFlow weights from the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-tf-checkpoint") >>> tf_model.save_pretrained("local-tf-checkpoint") @@ -250,7 +254,7 @@ pacchetto `transformers.onnx`. Ad esempio, per esportare un modello di classific possiamo scegliere un modello ottimizzato dall'Hub ed eseguire: ```bash -python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased-finetuned-sst-2-english \ --feature=sequence-classification onnx/ ``` @@ -267,7 +271,7 @@ All good, model saved at: onnx/model.onnx Puoi notare che in questo caso, i nomi di output del modello ottimizzato sono `logits` invece di `last_hidden_state` che abbiamo visto con il -checkpoint `distilbert-base-uncased` precedente. Questo è previsto dal +checkpoint `distilbert/distilbert-base-uncased` precedente. Questo è previsto dal modello ottimizato visto che ha una testa di e. @@ -350,7 +354,7 @@ fornendo alla configurazione del modello base come segue: ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config = DistilBertOnnxConfig(config) ``` @@ -382,7 +386,7 @@ usare: ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification") >>> print(onnx_config_for_seq_clf.outputs) OrderedDict([('logits', {0: 'batch'})]) @@ -409,7 +413,7 @@ con il modello base e il tokenizer e il percorso per salvare il file esportato: >>> from transformers import AutoTokenizer, AutoModel >>> onnx_path = Path("model.onnx") ->>> model_ckpt = "distilbert-base-uncased" +>>> model_ckpt = "distilbert/distilbert-base-uncased" >>> base_model = AutoModel.from_pretrained(model_ckpt) >>> tokenizer = AutoTokenizer.from_pretrained(model_ckpt) @@ -470,8 +474,7 @@ _.py` * Includere l'architettura del modello e le funzioni corrispondenti in [`~onnx.features.FeatureManager`] * Aggiungere la tua architettura del modello ai test in `test_onnx_v2.py` -Scopri come stato contribuito la configurazione per [IBERT] -(https://github.com/huggingface/transformers/pull/14868/files) per +Scopri come stato contribuito la configurazione per [IBERT](https://github.com/huggingface/transformers/pull/14868/files) per avere un'idea di cosa è coinvolto. ## TorchScript @@ -546,7 +549,7 @@ una classe `BertConfig` e quindi salvato su disco con il nome del file `traced_b from transformers import BertModel, BertTokenizer, BertConfig import torch -enc = BertTokenizer.from_pretrained("bert-base-uncased") +enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" @@ -581,7 +584,7 @@ model = BertModel(config) model.eval() # If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag -model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) +model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True) # Creating the trace traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors]) @@ -608,7 +611,7 @@ Usare il modello tracciato per l'inferenza è semplice come usare il suo metodo traced_model(tokens_tensor, segments_tensors) ``` -###Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK +### Implementare modelli HuggingFace TorchScript su AWS utilizzando Neuron SDK AWS ha introdotto [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) famiglia di istanze per l'inferenza di machine learning a basso costo e ad alte prestazioni nel cloud. diff --git a/docs/source/it/training.mdx b/docs/source/it/training.md similarity index 95% rename from docs/source/it/training.mdx rename to docs/source/it/training.md index 68f6434bbb5a6f..2a64cfca375f69 100644 --- a/docs/source/it/training.mdx +++ b/docs/source/it/training.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Fine-tuning di un modello pre-addestrato @@ -39,12 +43,12 @@ Inizia caricando il dataset [Yelp Reviews](https://huggingface.co/datasets/yelp_ 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} ``` -Come già sai, hai bisogno di un tokenizer per processare il testo e includere una strategia di padding e truncation per gestire sequenze di lunghezza variabile. Per processare il dataset in un unico passo, usa il metodo [`map`](https://huggingface.co/docs/datasets/process.html#map) di 🤗 Datasets che applica la funzione di preprocessing all'intero dataset: +Come già sai, hai bisogno di un tokenizer per processare il testo e includere una strategia di padding e truncation per gestire sequenze di lunghezza variabile. Per processare il dataset in un unico passo, usa il metodo [`map`](https://huggingface.co/docs/datasets/process#map) di 🤗 Datasets che applica la funzione di preprocessing all'intero dataset: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): @@ -76,7 +80,7 @@ Inizia caricando il tuo modello e specificando il numero di etichette (labels) a ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` @@ -99,7 +103,7 @@ Specifica dove salvare i checkpoints del tuo addestramento: ### Metriche -[`Trainer`] non valuta automaticamente le performance del modello durante l'addestramento. Dovrai passare a [`Trainer`] una funzione che calcola e restituisce le metriche. La libreria 🤗 Datasets mette a disposizione una semplice funzione [`accuracy`](https://huggingface.co/metrics/accuracy) che puoi caricare con la funzione `load_metric` (guarda questa [esercitazione](https://huggingface.co/docs/datasets/metrics.html) per maggiori informazioni): +[`Trainer`] non valuta automaticamente le performance del modello durante l'addestramento. Dovrai passare a [`Trainer`] una funzione che calcola e restituisce le metriche. La libreria 🤗 Datasets mette a disposizione una semplice funzione [`accuracy`](https://huggingface.co/metrics/accuracy) che puoi caricare con la funzione `load_metric` (guarda questa [esercitazione](https://huggingface.co/docs/datasets/metrics) per maggiori informazioni): ```py >>> import numpy as np @@ -196,7 +200,7 @@ Carica un modello TensorFlow col numero atteso di etichette: >>> import tensorflow as tf >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` Poi compila e fai il fine-tuning del tuo modello usando [`fit`](https://keras.io/api/models/model_training_apis/) come faresti con qualsiasi altro modello di Keras: @@ -275,7 +279,7 @@ Carica il tuo modello con il numero atteso di etichette: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Ottimizzatore e learning rate scheduler @@ -342,7 +346,7 @@ Per tenere traccia dei tuoi progressi durante l'addestramento, usa la libreria [ ### Metriche -Proprio come è necessario aggiungere una funzione di valutazione del [`Trainer`], è necessario fare lo stesso quando si scrive il proprio ciclo di addestramento. Ma invece di calcolare e riportare la metrica alla fine di ogni epoca, questa volta accumulerai tutti i batch con [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch) e calcolerai la metrica alla fine. +Proprio come è necessario aggiungere una funzione di valutazione del [`Trainer`], è necessario fare lo stesso quando si scrive il proprio ciclo di addestramento. Ma invece di calcolare e riportare la metrica alla fine di ogni epoca, questa volta accumulerai tutti i batch con [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=add_batch#datasets.Metric.add_batch) e calcolerai la metrica alla fine. ```py >>> metric = load_metric("accuracy") diff --git a/docs/source/ja/_toctree.yml b/docs/source/ja/_toctree.yml index a85f0ee11dcf69..354e22344a904a 100644 --- a/docs/source/ja/_toctree.yml +++ b/docs/source/ja/_toctree.yml @@ -1,10 +1,398 @@ - sections: - local: index title: 🤗 Transformers + - local: quicktour + title: クイックツアー - local: installation title: インストール - title: はじめに + title: Get started - sections: + - local: pipeline_tutorial + title: パイプラインを使用して推論を実行する + - local: autoclass_tutorial + title: AutoClass を使用して移植可能なコードを作成する + - local: preprocessing + title: データの前処理 + - local: training + title: 事前トレーニングされたモデルを微調整する + - local: run_scripts + title: スクリプトを使用してトレーニングする + - local: accelerate + title: 🤗 Accelerate を使用して分散トレーニングをセットアップする + - local: peft + title: 🤗 PEFT を使用してアダプターをロードしてトレーニングする + - local: model_sharing + title: モデルを共有する + - local: transformers_agents + title: エージェント + - local: llm_tutorial + title: LLM を使用した生成 + title: Tutorials +- sections: + - isExpanded: false + sections: + - local: tasks/sequence_classification + title: テキストの分類 + - local: tasks/token_classification + title: トークンの分類 + - local: tasks/question_answering + title: 質疑応答 + - local: tasks/language_modeling + title: 因果言語モデリング + - local: tasks/masked_language_modeling + title: マスクされた言語モデリング + - local: tasks/translation + title: 翻訳 + - local: tasks/summarization + title: 要約 + - local: tasks/multiple_choice + title: 複数の選択肢 + title: 自然言語処理 + - isExpanded: false + sections: + - local: tasks/audio_classification + title: 音声の分類 + - local: tasks/asr + title: 自動音声認識 + title: オーディオ + - isExpanded: false + sections: + - local: tasks/image_classification + title: 画像分類 + - local: tasks/semantic_segmentation + title: セマンティックセグメンテーション + - local: tasks/video_classification + title: ビデオの分類 + - local: tasks/object_detection + title: 物体検出 + - local: tasks/zero_shot_object_detection + title: ゼロショット物体検出 + - local: tasks/zero_shot_image_classification + title: ゼロショット画像分類 + - local: tasks/monocular_depth_estimation + title: 深さの推定 + - local: tasks/image_to_image + title: 画像から画像へ + - local: tasks/knowledge_distillation_for_image_classification + title: コンピュータビジョンのための知識の蒸留 + title: コンピュータビジョン + - isExpanded: false + sections: + - local: tasks/image_captioning + title: 画像のキャプション + - local: tasks/document_question_answering + title: 文書の質問への回答 + - local: tasks/visual_question_answering + title: 視覚的な質問への回答 + - local: tasks/text-to-speech + title: テキスト読み上げ + title: マルチモーダル + - isExpanded: false + sections: + - local: generation_strategies + title: 生成戦略をカスタマイズする + title: 世代 + - isExpanded: false + sections: + - local: tasks/idefics + title: IDEFICS を使用したイメージ タスク + - local: tasks/prompting + title: LLM プロンプト ガイド + title: プロンプト + title: Task Guides +- sections: + - local: fast_tokenizers + title: 🤗 トークナイザーの高速トークナイザーを使用する + - local: multilingual + title: 多言語モデルで推論を実行する + - local: create_a_model + title: モデル固有の API を使用する + - local: custom_models + title: カスタムモデルを共有する + - local: chat_templating + title: チャットモデルのテンプレート + - local: serialization + title: ONNX へのエクスポート + - local: tflite + title: TFLite へのエクスポート + - local: torchscript + title: トーチスクリプトへのエクスポート + - local: benchmarks + title: ベンチマーク + - local: community + title: コミュニティリソース + - local: custom_tools + title: カスタムツールとプロンプト + - local: troubleshooting + title: トラブルシューティング + title: 開発者ガイド +- sections: + - local: performance + title: 概要 + - sections: + - local: perf_train_gpu_one + title: 単一の GPU で効率的にトレーニングするための方法とツール + - local: perf_train_gpu_many + title: 複数の GPU と並列処理 + - local: perf_train_cpu + title: CPU での効率的なトレーニング + - local: perf_train_cpu_many + title: 分散CPUトレーニング + - local: perf_train_tpu + title: TPU に関するトレーニング + - local: perf_train_tpu_tf + title: TensorFlow を使用した TPU のトレーニング + - local: perf_train_special + title: 特殊なハードウェアに関するトレーニング + - local: perf_hardware + title: トレーニング用のカスタム ハードウェア + - local: hpo_train + title: Trainer API を使用したハイパーパラメータ検索 + title: 効率的なトレーニングテクニック + - sections: + - local: perf_infer_cpu + title: CPUでの推論 + - local: perf_infer_gpu_one + title: 1 つの GPU での推論 + - local: perf_infer_gpu_many + title: 多くの GPU での推論 + - local: perf_infer_special + title: 特殊なハードウェアでの推論 + title: 推論の最適化 + - local: big_models + title: 大きなモデルのインスタンス化 + - local: tf_xla + title: TensorFlowモデルのXLA統合 + - local: perf_torch_compile + title: torch.compile()を使用した推論の最適化 + title: パフォーマンスとスケーラビリティ +- sections: + - local: add_new_model + title: 🤗 Transformersにモデルを追加する方法 + - local: add_tensorflow_model + title: 🤗 TransformersモデルをTensorFlowに変換する方法 + - local: testing + title: テスト + - local: pr_checks + title: プルリクエストのチェック + title: 貢献する +- sections: + - local: philosophy + title: フィロソフィー + - local: glossary + title: 用語集 + - local: task_summary + title: 🤗 Transformersの機能 + - local: tasks_explained + title: 🤗 Transformersがタスクを解決する方法 + - local: model_summary + title: Transformerモデルファミリー + - local: tokenizer_summary + title: トークナイザーの概要 + - local: attention + title: 注意機構 + - local: pad_truncation + title: パディングと切り詰め + - local: bertology + title: BERTology + - local: perplexity + title: 固定長モデルのパープレキシティ + - local: pipeline_webserver + title: Webサーバー推論用パイプライン + - local: model_memory_anatomy + title: モデルトレーニングの解剖学 + title: コンセプチュアルガイド +- sections: + - sections: + - local: main_classes/agent + title: エージェントとツール + - local: model_doc/auto + title: Auto Classes + - local: main_classes/callback + title: コールバック + - local: main_classes/configuration + title: 構成 + - local: main_classes/data_collator + title: データ照合者 + - local: main_classes/keras_callbacks + title: Keras コールバック + - local: main_classes/logging + title: ロギング + - local: main_classes/model + title: モデル + - local: main_classes/text_generation + title: テキストの生成 + - local: main_classes/onnx + title: ONNX + - local: main_classes/optimizer_schedules + title: 最適化 + - local: main_classes/output + title: モデルの出力 + - local: main_classes/pipelines + title: パイプライン + - local: main_classes/processors + title: プロセッサー + - local: main_classes/quantization + title: 量子化 + - local: main_classes/tokenizer + title: トークナイザー + - local: main_classes/trainer + title: トレーナー + - local: main_classes/deepspeed + title: ディープスピードの統合 + - local: main_classes/feature_extractor + title: 特徴抽出器 + - local: main_classes/image_processor + title: 画像処理プロセッサ + title: 主要なクラス + - sections: + - isExpanded: false + sections: + - local: model_doc/albert + title: ALBERT + - local: model_doc/bart + title: BART + - local: model_doc/barthez + title: BARThez + - local: model_doc/bartpho + title: BARTpho + - local: model_doc/bert + title: BERT + - local: model_doc/bert-generation + title: BertGeneration + - local: model_doc/bert-japanese + title: BertJapanese + - local: model_doc/bertweet + title: Bertweet + - local: model_doc/big_bird + title: BigBird + - local: model_doc/bigbird_pegasus + title: BigBirdPegasus + - local: model_doc/biogpt + title: BioGpt + - local: model_doc/blenderbot + title: Blenderbot + - local: model_doc/blenderbot-small + title: Blenderbot Small + - local: model_doc/bloom + title: BLOOM + - local: model_doc/bort + title: BORT + - local: model_doc/byt5 + title: ByT5 + - local: model_doc/camembert + title: CamemBERT + - local: model_doc/canine + title: CANINE + - local: model_doc/codegen + title: CodeGen + - local: model_doc/code_llama + title: CodeLlama + - local: model_doc/convbert + title: ConvBERT + - local: model_doc/cpm + title: CPM + - local: model_doc/cpmant + title: CPMANT + - local: model_doc/ctrl + title: CTRL + - local: model_doc/deberta + title: DeBERTa + - local: model_doc/deberta-v2 + title: DeBERTa-v2 + - local: model_doc/dialogpt + title: DialoGPT + title: 文章モデル + - isExpanded: false + sections: + - local: model_doc/beit + title: BEiT + - local: model_doc/bit + title: BiT + - local: model_doc/conditional_detr + title: Conditional DETR + - local: model_doc/convnext + title: ConvNeXT + - local: model_doc/convnextv2 + title: ConvNeXTV2 + - local: model_doc/cvt + title: CvT + - local: model_doc/deformable_detr + title: Deformable DETR + - local: model_doc/deit + title: DeiT + - local: model_doc/deta + title: DETA + - local: model_doc/detr + title: DETR + - local: model_doc/dinat + title: DiNAT + title: ビジョンモデル + - isExpanded: false + sections: + - local: model_doc/audio-spectrogram-transformer + title: Audio Spectrogram Transformer + - local: model_doc/bark + title: Bark + - local: model_doc/clap + title: CLAP + title: 音声モデル + - isExpanded: false + sections: + - local: model_doc/align + title: ALIGN + - local: model_doc/altclip + title: AltCLIP + - local: model_doc/blip + title: BLIP + - local: model_doc/blip-2 + title: BLIP-2 + - local: model_doc/bridgetower + title: BridgeTower + - local: model_doc/bros + title: BROS + - local: model_doc/chinese_clip + title: Chinese-CLIP + - local: model_doc/clip + title: CLIP + - local: model_doc/clipseg + title: CLIPSeg + - local: model_doc/clvp + title: CLVP + - local: model_doc/data2vec + title: Data2Vec + - local: model_doc/deplot + title: DePlot + title: マルチモーダルモデル + - isExpanded: false + sections: + - local: model_doc/decision_transformer + title: Decision Transformer + title: 強化学習モデル + - isExpanded: false + sections: + - local: model_doc/autoformer + title: Autoformer + title: 時系列モデル + title: モデル - sections: - - local: multilingual - title: 推論のための多言語モデル \ No newline at end of file + - local: internal/modeling_utils + title: カスタムレイヤーとユーティリティ + - local: internal/pipelines_utils + title: パイプライン用のユーティリティ + - local: internal/tokenization_utils + title: ト=ークナイザー用のユーティリティ + - local: internal/trainer_utils + title: トレーナー用ユーティリティ + - local: internal/generation_utils + title: 発電用ユーティリティ + - local: internal/image_processing_utils + title: 画像プロセッサ用ユーティリティ + - local: internal/audio_utils + title: オーディオ処理用のユーティリティ + - local: internal/file_utils + title: 一般公共事業 + - local: internal/time_series_utils + title: 時系列用のユーティリティ + title: 内部ヘルパー + title: API diff --git a/docs/source/ja/accelerate.md b/docs/source/ja/accelerate.md new file mode 100644 index 00000000000000..73e45b9cd3c5ec --- /dev/null +++ b/docs/source/ja/accelerate.md @@ -0,0 +1,136 @@ + + +# 🤗 Accelerate を用いた分散学習 + +モデルが大きくなるにつれて、限られたハードウェアでより大きなモデルを訓練し、訓練速度を大幅に上昇させるための方法として並列処理が浮上してきました。1台のマシンに複数のGPUがあっても、複数のマシンにまたがる複数のGPUがあっても、あらゆるタイプの分散処理セットアップ上でユーザーが簡単に 🤗 Transformers モデルを訓練できるように、 Hugging Face では [🤗 Accelerate](https://huggingface.co/docs/accelerate) ライブラリを作成しました。このチュートリアルでは、PyTorch の訓練ループをカスタマイズして、分散処理環境での訓練を可能にする方法について学びます。 + +## セットアップ + +はじめに 🤗 Accelerate をインストールしましょう: + +```bash +pip install accelerate +``` + +そしたらインポートして [`~accelerate.Accelerator`] オブジェクトを作成しましょう。[`~accelerate.Accelerator`] は分散処理セットアップを自動的に検出し、訓練のために必要な全てのコンポーネントを初期化します。モデルをデバイスに明示的に配置する必要はありません。 + +```py +>>> from accelerate import Accelerator + +>>> accelerator = Accelerator() +``` + +## Accelerate する準備をしましょう + +次に、関連する全ての訓練オブジェクトを [`~accelerate.Accelerator.prepare`] メソッドに渡します。これには、訓練と評価それぞれのDataloader、モデル、optimizer が含まれます: + +```py +>>> train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( +... train_dataloader, eval_dataloader, model, optimizer +... ) +``` + +## Backward + +最後に訓練ループ内の `loss.backward()` を 🤗 Accelerate の [`~accelerate.Accelerator.backward`] メソッドで置き換えます: + +```py +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... outputs = model(**batch) +... loss = outputs.loss +... accelerator.backward(loss) + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +以下のコードで確認できる通り、訓練ループに4行のコードを追加するだけで分散学習が可能です! + +```diff ++ from accelerate import Accelerator + from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler + ++ accelerator = Accelerator() + + model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) + optimizer = AdamW(model.parameters(), lr=3e-5) + +- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +- model.to(device) + ++ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( ++ train_dataloader, eval_dataloader, model, optimizer ++ ) + + num_epochs = 3 + num_training_steps = num_epochs * len(train_dataloader) + lr_scheduler = get_scheduler( + "linear", + optimizer=optimizer, + num_warmup_steps=0, + num_training_steps=num_training_steps + ) + + progress_bar = tqdm(range(num_training_steps)) + + model.train() + for epoch in range(num_epochs): + for batch in train_dataloader: +- batch = {k: v.to(device) for k, v in batch.items()} + outputs = model(**batch) + loss = outputs.loss +- loss.backward() ++ accelerator.backward(loss) + + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + progress_bar.update(1) +``` + +## 訓練する + +関連するコードを追加したら、スクリプトまたは Colaboratory などのノートブックで訓練を開始します。 + +### スクリプトで訓練する + +スクリプトから訓練をしている場合は、設定ファイルを作成・保存するために以下のコマンドを実行してください: + +```bash +accelerate config +``` + +そして次のようにして訓練を開始します: + +```bash +accelerate launch train.py +``` + +### ノートブックで訓練する + +Colaboratory の TPU の利用をお考えの場合、🤗 Accelerate はノートブック上で実行することもできます。訓練に必要な全てのコードを関数に含め、[`~accelerate.notebook_launcher`] に渡してください: + +```py +>>> from accelerate import notebook_launcher + +>>> notebook_launcher(training_function) +``` + +🤗 Accelerate と豊富な機能についてもっと知りたい方は[ドキュメント](https://huggingface.co/docs/accelerate)を参照してください。 diff --git a/docs/source/ja/add_new_model.md b/docs/source/ja/add_new_model.md new file mode 100644 index 00000000000000..0701e973deeb3a --- /dev/null +++ b/docs/source/ja/add_new_model.md @@ -0,0 +1,756 @@ + + +# How to add a model to 🤗 Transformers? + +🤗 Transformersライブラリは、コミュニティの貢献者のおかげで新しいモデルを提供できることがよくあります。 +しかし、これは難しいプロジェクトであり、🤗 Transformersライブラリと実装するモデルについての深い知識が必要です。 +Hugging Faceでは、コミュニティの多くの人々に積極的にモデルを追加する力を与えようと努力しており、 +このガイドをまとめて、PyTorchモデルを追加するプロセスを説明します([PyTorchがインストールされていることを確認してください](https://pytorch.org/get-started/locally/))。 + + + +TensorFlowモデルを実装する興味がある場合は、[🤗 TransformersモデルをTensorFlowに変換する方法](add_tensorflow_model)ガイドを参照してみてください! + + + +この過程で、以下のことを学びます: + +- オープンソースのベストプラクティスに関する洞察 +- 最も人気のある深層学習ライブラリの設計原則を理解する +- 大規模なモデルを効率的にテストする方法を学ぶ +- `black`、`ruff`、および`make fix-copies`などのPythonユーティリティを統合して、クリーンで読みやすいコードを確保する方法を学ぶ + +Hugging Faceチームのメンバーがサポートを提供するので、一人ぼっちになることはありません。 🤗 ❤️ + +さあ、始めましょう!🤗 Transformersで見たいモデルについての[New model addition](https://github.com/huggingface/transformers/issues/new?assignees=&labels=New+model&template=new-model-addition.yml)のイシューを開いてください。 +特定のモデルを提供することに特にこだわりがない場合、[New model label](https://github.com/huggingface/transformers/labels/New%20model)で未割り当てのモデルリクエストがあるかどうかを確認して、それに取り組むことができます。 + +新しいモデルリクエストを開いたら、最初のステップは🤗 Transformersをよく理解することです! + +## General overview of 🤗 Transformers + +まず、🤗 Transformersの一般的な概要を把握する必要があります。🤗 Transformersは非常に意見が分かれるライブラリですので、 +ライブラリの哲学や設計選択について同意できない可能性があります。ただし、私たちの経験から、ライブラリの基本的な設計選択と哲学は、 +🤗 Transformersを効率的にスケーリングし、適切なレベルで保守コストを抑えるために不可欠です。 + +ライブラリの理解を深めるための良い出発点は、[哲学のドキュメント](philosophy)を読むことです。 +私たちの作業方法の結果、すべてのモデルに適用しようとするいくつかの選択肢があります: + +- 一般的に、抽象化よりも構成が優先されます。 +- コードの重複は、読みやすさやアクセス可能性を大幅に向上させる場合、必ずしも悪いわけではありません。 +- モデルファイルはできるだけ自己完結的であるべきで、特定のモデルのコードを読む際には、理想的には該当する`modeling_....py`ファイルのみを見る必要があります。 + +私たちの意見では、このライブラリのコードは単なる製品を提供する手段だけでなく、*例えば、推論のためにBERTを使用する能力*などの製品そのもの. + +### Overview of models + +モデルを正常に追加するためには、モデルとその設定、[`PreTrainedModel`]、および[`PretrainedConfig`]の相互作用を理解することが重要です。 +例示的な目的で、🤗 Transformersに追加するモデルを「BrandNewBert」と呼びます。 + +以下をご覧ください: + + + +ご覧のように、🤗 Transformersでは継承を使用していますが、抽象化のレベルを最小限に保っています。 +ライブラリ内のどのモデルにも、抽象化のレベルが2つを超えることはありません。 +`BrandNewBertModel` は `BrandNewBertPreTrainedModel` を継承し、さらに[`PreTrainedModel`]を継承しています。 +これだけです。 +一般的なルールとして、新しいモデルは[`PreTrainedModel`]にのみ依存するようにしたいと考えています。 +すべての新しいモデルに自動的に提供される重要な機能は、[`~PreTrainedModel.from_pretrained`]および +[`~PreTrainedModel.save_pretrained`]です。 +これらはシリアライゼーションとデシリアライゼーションに使用されます。 +`BrandNewBertModel.forward`などの他の重要な機能は、新しい「modeling_brand_new_bert.py」スクリプトで完全に定義されるべきです。 +次に、特定のヘッドレイヤーを持つモデル(たとえば `BrandNewBertForMaskedLM` )が `BrandNewBertModel` を継承するのではなく、 +抽象化のレベルを低く保つために、そのフォワードパスで `BrandNewBertModel` を呼び出すコンポーネントとして使用されるようにしたいと考えています。 +新しいモデルには常に `BrandNewBertConfig` という設定クラスが必要です。この設定は常に[`PreTrainedModel`]の属性として保存され、 +したがって、`BrandNewBertPreTrainedModel`から継承するすべてのクラスで`config`属性を介してアクセスできます。 + +```python +model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert") +model.config # model has access to its config +``` + +モデルと同様に、設定は[`PretrainedConfig`]から基本的なシリアル化および逆シリアル化の機能を継承しています。注意すべきは、設定とモデルは常に2つの異なる形式にシリアル化されることです - モデルは*pytorch_model.bin*ファイルに、設定は*config.json*ファイルにシリアル化されます。[`~PreTrainedModel.save_pretrained`]を呼び出すと、自動的に[`~PretrainedConfig.save_pretrained`]も呼び出され、モデルと設定の両方が保存されます。 + +### Code style + +新しいモデルをコーディングする際には、Transformersは意見があるライブラリであり、コードの書き方に関していくつかの独自の考え方があります :-) + +1. モデルのフォワードパスはモデリングファイルに完全に記述され、ライブラリ内の他のモデルとは完全に独立している必要があります。他のモデルからブロックを再利用したい場合、コードをコピーしてトップに`# Copied from`コメントを付けて貼り付けます(良い例は[こちら](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160)、コピーに関する詳細なドキュメンテーションは[ここ](pr_checks#check-copies)を参照してください)。 +2. コードは完全に理解可能でなければなりません。これは記述的な変数名を選択し、省略形を避けるべきであることを意味します。例えば、`act`ではなく`activation`が好まれます。1文字の変数名は、forループ内のインデックスでない限り、強く非推奨です。 +3. より一般的に、魔法のような短いコードよりも長くて明示的なコードを好みます。 +4. PyTorchでは`nn.Sequential`をサブクラス化せずに、`nn.Module`をサブクラス化し、フォワードパスを記述し、コードを使用する他の人が簡単にデバッグできるようにします。プリントステートメントやブレークポイントを追加してデバッグできるようにします。 +5. 関数のシグネチャは型アノテーションを付けるべきです。その他の部分に関しては、型アノテーションよりも良い変数名が読みやすく理解しやすいことがあります。 + +### Overview of tokenizers + +まだ完了していません :-( このセクションは近日中に追加されます! + +## Step-by-step recipe to add a model to 🤗 Transformers + +モデルを追加する方法は人それぞれ異なるため、他のコントリビューターが🤗 Transformersにモデルを追加する際の要約を確認することが非常に役立つ場合があります。以下は、他のコントリビューターが🤗 Transformersにモデルをポートする際のコミュニティブログ投稿のリストです。 + +1. [GPT2モデルのポーティング](https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28) by [Thomas](https://huggingface.co/thomwolf) +2. [WMT19 MTモデルのポーティング](https://huggingface.co/blog/porting-fsmt) by [Stas](https://huggingface.co/stas) + +経験から言えることは、モデルを追加する際に最も重要なことは次のようになります: + +- 車輪の再発明をしないでください!新しい🤗 Transformersモデルのために追加するコードのほとんどはすでに🤗 Transformers内のどこかに存在しています。類似した既存のモデルやトークナイザを見つけるために、いくつかの時間をかけて探すことが重要です。[grep](https://www.gnu.org/software/grep/)と[rg](https://github.com/BurntSushi/ripgrep)はあなたの友達です。モデルのトークナイザは1つのモデル実装に基づいているかもしれませんが、モデルのモデリングコードは別の実装に基づいていることがあることに注意してください。例えば、FSMTのモデリングコードはBARTに基づいており、FSMTのトークナイザコードはXLMに基づいています。 +- これは科学的な課題よりもエンジニアリングの課題です。モデルの論文の理論的な側面をすべて理解しようとするよりも、効率的なデバッグ環境を作成するために時間を費やすべきです。 +- 行き詰まった場合は助けを求めてください!モデルは🤗 Transformersのコアコンポーネントであり、Hugging Faceではモデルを追加するための各ステップでお手伝いするのを喜んでいます。進行がないことに気付いた場合は、進展していないことを気にしないでください。 + +以下では、🤗 Transformersにモデルをポートする際に最も役立つと考えられる一般的なレシピを提供しようとしています。 + +次のリストは、モデルを追加するために行う必要があるすべてのことの要約であり、To-Doリストとして使用できます: + +- ☐ (オプション)モデルの理論的な側面を理解しました +- ☐ 🤗 Transformersの開発環境を準備しました +- ☐ オリジナルのリポジトリのデバッグ環境をセットアップしました +- ☐ `forward()` パスをオリジナルのリポジトリとチェックポイントで正常に実行するスクリプトを作成しました +- ☐ モデルの骨格を🤗 Transformersに正常に追加しました +- ☐ オリジナルのチェックポイントを🤗 Transformersのチェックポイントに正常に変換しました +- ☐ 🤗 Transformersで実行される `forward()` パスを正常に実行し、オリジナルのチェックポイントと同一の出力を得ました +- ☐ 🤗 Transformersでのモデルテストを完了しました +- ☐ 🤗 Transformersにトークナイザを正常に追加しました +- ☐ エンドツーエンドの統合テストを実行しました +- ☐ ドキュメントを完成させました +- ☐ モデルのウェイトをHubにアップロードしました +- ☐ プルリクエストを提出しました +- ☐ (オプション)デモノートブックを追加しました + +まず、通常、`BrandNewBert`の理論的な理解を深めることをお勧めします。 +ただし、もしモデルの理論的な側面を「実務中に理解する」方が好ましい場合、`BrandNewBert`のコードベースに直接アクセスするのも問題ありません。 +このオプションは、エンジニアリングのスキルが理論的なスキルよりも優れている場合、 +`BrandNewBert`の論文を理解するのに苦労している場合、または科学的な論文を読むよりもプログラミングを楽しんでいる場合に適しています。 + +### 1. (Optional) Theoretical aspects of BrandNewBert + +BrandNewBertの論文がある場合、その説明を読むための時間を取るべきです。論文の中には理解が難しい部分があるかもしれません。 +その場合でも心配しないでください。目標は論文の深い理論的理解を得ることではなく、 +🤗 Transformersでモデルを効果的に再実装するために必要な情報を抽出することです。 +ただし、理論的な側面にあまり多くの時間をかける必要はありません。代わりに、実践的な側面に焦点を当てましょう。具体的には次の点です: + +- *brand_new_bert*はどの種類のモデルですか? BERTのようなエンコーダーのみのモデルですか? GPT2のようなデコーダーのみのモデルですか? BARTのようなエンコーダー-デコーダーモデルですか? + [model_summary](model_summary)を参照して、これらの違いについて詳しく知りたい場合があります。 +- *brand_new_bert*の応用分野は何ですか? テキスト分類ですか? テキスト生成ですか? Seq2Seqタスク、例えば要約ですか? +- モデルをBERT/GPT-2/BARTとは異なるものにする新しい機能は何ですか? +- 既存の[🤗 Transformersモデル](https://huggingface.co/transformers/#contents)の中で*brand_new_bert*に最も似ているモデルはどれですか? +- 使用されているトークナイザの種類は何ですか? SentencePieceトークナイザですか? WordPieceトークナイザですか? BERTやBARTで使用されているトークナイザと同じですか? + +モデルのアーキテクチャの良い概要を得たと感じたら、Hugging Faceチームに質問を送ることができます。 +これにはモデルのアーキテクチャ、注意層などに関する質問が含まれるかもしれません。 +私たちは喜んでお手伝いします。 + +### 2. Next prepare your environment + +1. リポジトリのページで「Fork」ボタンをクリックして、[リポジトリ](https://github.com/huggingface/transformers)をフォークします。 + これにより、コードのコピーがGitHubユーザーアカウントの下に作成されます。 + +2. ローカルディスクにある`transformers`フォークをクローンし、ベースリポジトリをリモートとして追加します: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +3. 開発環境をセットアップするために、次のコマンドを実行してください: + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +お使いのOSに応じて、およびTransformersのオプションの依存関係の数が増えているため、このコマンドでエラーが発生する可能性があります。 +その場合は、作業しているDeep Learningフレームワーク(PyTorch、TensorFlow、および/またはFlax)をインストールし、次の手順を実行してください: + +```bash +pip install -e ".[quality]" +``` + +これはほとんどのユースケースには十分であるはずです。その後、親ディレクトリに戻ることができます。 + +```bash +cd .. +``` + +4. Transformersに*brand_new_bert*のPyTorchバージョンを追加することをお勧めします。PyTorchをインストールするには、 + https://pytorch.org/get-started/locally/ の指示に従ってください。 + + **注意:** CUDAをインストールする必要はありません。新しいモデルをCPUで動作させることで十分です。 + +5. *brand_new_bert*を移植するには、元のリポジトリへのアクセスも必要です。 + +```bash +git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git +cd brand_new_bert +pip install -e . +``` + + +*brand_new_bert*を🤗 Transformersにポートするための開発環境を設定しました。 + +### 3.-4. Run a pretrained checkpoint using the original repository + +最初に、オリジナルの*brand_new_bert*リポジトリで作業します。通常、オリジナルの実装は非常に「研究的」であり、ドキュメンテーションが不足していたり、コードが理解しにくいことがあります。しかし、これが*brand_new_bert*を再実装する動機となるべきです。Hugging Faceでは、主要な目標の1つが、動作するモデルを取り、それをできるだけ**アクセス可能でユーザーフレンドリーで美しい**ものに書き直すことです。これは、🤗 Transformersにモデルを再実装する最も重要な動機です - 複雑な新しいNLP技術を**誰にでも**アクセス可能にしようとする試みです。 + +まず、オリジナルのリポジトリに入り込むことから始めるべきです。 + +公式の事前学習済みモデルをオリジナルのリポジトリで正常に実行することは、通常、**最も困難な**ステップです。 +私たちの経験から、オリジナルのコードベースに慣れるのに時間をかけることが非常に重要です。以下のことを理解する必要があります: + +- 事前学習済みの重みをどこで見つけるか? +- 対応するモデルに事前学習済みの重みをロードする方法は? +- モデルから独立してトークナイザを実行する方法は? +- 1つのフォワードパスを追跡して、単純なフォワードパスに必要なクラスと関数がわかるようにします。通常、これらの関数だけを再実装する必要があります。 +- モデルの重要なコンポーネントを特定できること:モデルのクラスはどこにありますか?モデルのサブクラス、*例* EncoderModel、DecoderModelがありますか?自己注意レイヤーはどこにありますか?複数の異なる注意レイヤー、*例* *自己注意*、*クロスアテンション*などが存在しますか? +- オリジナルのリポジトリの環境でモデルをデバッグする方法は?*print*ステートメントを追加する必要があるか、*ipdb*のような対話型デバッガを使用できるか、PyCharmのような効率的なIDEを使用してモデルをデバッグする必要がありますか? + +重要なのは、ポーティングプロセスを開始する前に、オリジナルのリポジトリでコードを**効率的に**デバッグできることです!また、これはオープンソースライブラリで作業していることを覚えておいてください。オリジナルのリポジトリでコードを調べる誰かを歓迎するために、問題をオープンにしたり、プルリクエストを送信したりすることをためらわないでください。このリポジトリのメンテナーは、彼らのコードを調べてくれる人に対して非常に喜んでいる可能性が高いです! + +この段階では、オリジナルのモデルのデバッグにどのような環境と戦略を使用するかは、あなた次第です。最初にオリジナルのリポジトリに関するコードをデバッグできることが非常に重要です。また、GPU環境をセットアップすることはお勧めしません。まず、CPU上で作業し、モデルがすでに🤗 Transformersに正常にポートされていることを確認します。最後に、モデルがGPU上でも期待通りに動作するかどうかを検証する必要があります。 + +一般的に、オリジナルのモデルを実行するための2つのデバッグ環境があります: + +- [Jupyter notebooks](https://jupyter.org/) / [google colab](https://colab.research.google.com/notebooks/intro.ipynb) +- ローカルなPythonスクリプト。 + +Jupyterノートブックは、セルごとに実行できるため、論理的なコンポーネントをより分割し、中間結果を保存できるため、デバッグサイクルが速くなるという利点があります。また、ノートブックは他の共同作業者と簡単に共有できることが多く、Hugging Faceチームに助けを求める場合に非常に役立つ場合があります。Jupyterノートブックに精通している場合、それ + + +```python +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = [0, 4, 5, 2, 3, 7, 9] # vector of input ids +original_output = model.predict(input_ids) +``` + +デバッグ戦略については、通常、いくつかの選択肢があります: + +- 元のモデルを多くの小さなテスト可能なコンポーネントに分解し、それぞれに対して前方パスを実行して検証します +- 元のモデルを元のトークナイザと元のモデルにのみ分解し、それらに対して前方パスを実行し、検証のために中間のプリントステートメントまたはブレークポイントを使用します + +再度、どの戦略を選択するかはあなた次第です。元のコードベースに依存することが多く、元のコードベースに応じて一方または他方が有利なことがあります。 + +元のコードベースがモデルを小さなサブコンポーネントに分解できる場合、*例えば*元のコードベースが簡単にイーガーモードで実行できる場合、それを行う価値が通常あります。最初からより難しい方法を選択することにはいくつかの重要な利点があります: + +- 後で元のモデルを🤗 Transformersの実装と比較する際に、各コンポーネントが対応する🤗 Transformers実装のコンポーネントと一致することを自動的に検証できるため、視覚的な比較に依存せずに済みます +- 大きな問題を小さな問題に分解する、つまり個々のコンポーネントのみをポーティングする問題に分割するのに役立ち、作業を構造化するのに役立ちます +- モデルを論理的な意味のあるコンポーネントに分割することで、モデルの設計をよりよく理解しやすくし、モデルをよりよく理解するのに役立ちます +- 後で、コンポーネントごとのテストを行うことで、コードを変更し続ける際にリグレッションが発生しないことを確認するのに役立ちます + +[Lysandreの](https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed) ELECTRAの統合チェックは、これがどのように行われるかの良い例です。 + +ただし、元のコードベースが非常に複雑で、中間コンポーネントをコンパイルモードで実行することしか許可しない場合、モデルを小さなテスト可能なサブコンポーネントに分解することが時間がかかりすぎるか、不可能であることがあります。 +良い例は[T5のMeshTensorFlow](https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow)ライブラリであり、非常に複雑でモデルをサブコンポーネントに分解する簡単な方法を提供しないことがあります。このようなライブラリでは、通常、プリントステートメントを検証することに依存します。 + +どの戦略を選択しても、推奨される手順は通常同じで、最初のレイヤーからデバッグを開始し、最後のレイヤーからデバッグを行うべきです。 + +通常、以下の順序で次のレイヤーからの出力を取得することをお勧めします: + +1. モデルに渡された入力IDを取得する +2. 単語の埋め込みを取得する +3. 最初のTransformerレイヤーの入力を取得する +4. 最初のTransformerレイヤーの出力を取得する +5. 次のn - 1つのTransformerレイヤーの出力を取得する +6. BrandNewBertモデル全体の出力を取得する + +入力IDは整数の配列である必要があり、*例:* `input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]` のようになります。 + +以下のレイヤーの出力は多次元の浮動小数点配列であることが多く、次のようになることがあります: + + +``` +[[ + [-0.1465, -0.6501, 0.1993, ..., 0.1451, 0.3430, 0.6024], + [-0.4417, -0.5920, 0.3450, ..., -0.3062, 0.6182, 0.7132], + [-0.5009, -0.7122, 0.4548, ..., -0.3662, 0.6091, 0.7648], + ..., + [-0.5613, -0.6332, 0.4324, ..., -0.3792, 0.7372, 0.9288], + [-0.5416, -0.6345, 0.4180, ..., -0.3564, 0.6992, 0.9191], + [-0.5334, -0.6403, 0.4271, ..., -0.3339, 0.6533, 0.8694]]], +``` + +🤗 Transformersに追加されるすべてのモデルは、統合テストを数回合格することが期待されており、元のモデルと🤗 Transformersで再実装されたバージョンが、0.001の精度までまったく同じ出力を提供する必要があります。 +異なるライブラリフレームワークで同じモデルを書いた場合、わずかに異なる出力を返すことが正常であるため、誤差許容値として1e-3(0.001)を受け入れています。モデルがほぼ同じ出力を返すだけでは不十分で、ほぼ同一である必要があります。そのため、🤗 Transformersバージョンの中間出力を元の*brand_new_bert*の実装の中間出力と複数回にわたって比較することになるでしょう。その際、元のリポジトリの**効率的な**デバッグ環境が非常に重要です。以下は、デバッグ環境をできるだけ効率的にするためのアドバイスです。 + +- 中間結果をデバッグする最適な方法を見つける。元のリポジトリはPyTorchで書かれていますか?その場合、元のモデルをより小さなサブコンポーネントに分解して中間値を取得する長いスクリプトを書くことがおそらく適切です。元のリポジトリがTensorflow 1で書かれている場合、[tf.print](https://www.tensorflow.org/api_docs/python/tf/print)などのTensorFlowのプリント操作を使用して中間値を出力する必要があるかもしれません。元のリポジトリがJaxで書かれている場合、フォワードパスの実行時にモデルが**jittedされていない**ことを確認してください。例:[このリンク](https://github.com/google/jax/issues/196)をチェック。 +- 使用可能な最小の事前学習済みチェックポイントを使用します。チェックポイントが小さいほど、デバッグサイクルが速くなります。事前学習済みモデルがフォワードパスに10秒以上かかる場合、効率的ではありません。非常に大きなチェックポイントしか利用できない場合、新しい環境でランダムに初期化されたウェイトを持つダミーモデルを作成し、それらのウェイトを🤗 Transformersバージョンのモデルと比較する方が良いかもしれません。 +- 元のリポジトリでフォワードパスを呼び出す最も簡単な方法を使用していることを確認してください。理想的には、元のリポジトリで**単一のフォワードパス**を呼び出す関数を見つけたいです。これは通常「predict」、「evaluate」、「forward」、「__call__」と呼ばれます。複数回「forward」を呼び出す関数をデバッグしたくありません。例:テキストを生成するために「autoregressive_sample」、「generate」と呼ばれる関数。 +- トークナイゼーションとモデルの「フォワード」パスを分離しようとしてください。元のリポジトリが入力文字列を入力する必要がある例を示す場合、フォワードコール内で文字列入力が入力IDに変更される場所を特定し、このポイントから開始します。これは、スクリプトを自分で書くか、入力文字列ではなく入力IDを直接入力できるように元のコードを変更する必要があるかもしれません。 +- デバッグセットアップ内のモデルがトレーニングモードではないことを確認してください。トレーニングモードでは、モデル内の複数のドロップアウトレイヤーのためにランダムな出力が生成されることがあります。デバッグ環境のフォワードパスが**決定論的**であることを確認し、ドロップアウトレイヤーが使用されないようにします。または、新しい実装が同じフレームワーク内にある場合、*transformers.utils.set_seed*を使用してください。 + +以下のセクションでは、*brand_new_bert*についてこれを具体的にどのように行うかについての詳細/ヒントを提供します。 + +### 5.-14. Port BrandNewBert to 🤗 Transformers + +次に、ついに新しいコードを🤗 Transformersに追加できます。🤗 Transformersのフォークのクローンに移動してください: + +```bash +cd transformers +``` + +特別なケースとして、既存のモデルと完全に一致するアーキテクチャのモデルを追加する場合、 +[このセクション](#write-a-conversion-script)で説明されているように、変換スクリプトを追加するだけで済みます。 +この場合、既存のモデルの完全なモデルアーキテクチャを再利用できます。 + +それ以外の場合、新しいモデルの生成を開始します。ここで2つの選択肢があります: + +- `transformers-cli add-new-model-like`を使用して既存のモデルのような新しいモデルを追加します +- `transformers-cli add-new-model`を使用して、テンプレートから新しいモデルを追加します(モデルのタイプに応じてBERTまたはBartのように見えます) + +どちらの場合でも、モデルの基本情報を入力するための質問事項が表示されます。 +2番目のコマンドを実行するには、`cookiecutter`をインストールする必要があります。 +詳細については[こちら](https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model)をご覧ください。 + +**主要な huggingface/transformers リポジトリでプルリクエストを開く** + +自動生成されたコードを適応し始める前に、🤗 Transformers に「作業中(WIP)」プルリクエストを開くタイミングです。 +例:「[WIP] *brand_new_bert* を追加」などです。 +これにより、ユーザーと Hugging Face チームが🤗 Transformers にモデルを統合する作業を並行して行うことができます。 + +以下の手順を実行してください: + +1. メインブランチから分かりやすい名前のブランチを作成します。 + +```bash +git checkout -b add_brand_new_bert +``` + +2. 自動生成されたコードをコミットしてください: + +```bash +git add . +git commit +``` + +3. 現在の main ブランチにフェッチしてリベース + +```bash +git fetch upstream +git rebase upstream/main +``` + +4. 変更をあなたのアカウントにプッシュするには、次のコマンドを使用します: + +```bash +git push -u origin a-descriptive-name-for-my-changes +``` + +5. 満足したら、GitHub上のフォークのウェブページに移動します。[プルリクエスト]をクリックします。将来の変更に備えて、Hugging Face チームのメンバーのGitHubハンドルをレビュアーとして追加してください。 + +6. GitHubのプルリクエストウェブページの右側にある「ドラフトに変換」をクリックして、PRをドラフトに変更します。 + +以下では、進捗があった場合は常に作業をコミットし、プッシュしてプルリクエストに表示されるようにしてください。さらに、定期的にメインからの最新の変更を取り込むために、次のように行うことを忘れないでください: + +```bash +git fetch upstream +git merge upstream/main +``` + +一般的に、モデルや実装に関する質問はPull Request (PR) で行い、PR内で議論し、解決します。 +これにより、Hugging Face チームは新しいコードをコミットする際や質問がある場合に常に通知を受けることができます。 +質問や問題が解決された際に、問題や質問が理解されやすいように、Hugging Face チームにコードを指摘することが非常に役立ちます。 + +このためには、「Files changed」タブに移動してすべての変更を表示し、質問したい行に移動して「+」シンボルをクリックしてコメントを追加します。 +質問や問題が解決された場合は、作成されたコメントの「Resolve」ボタンをクリックできます。 + +同様に、Hugging Face チームはコードをレビューする際にコメントを開きます。 +PR上でのほとんどの質問はGitHub上で行うことをお勧めします。 +一般的な質問に関しては、公にはあまり役立たない質問については、SlackやメールでHugging Face チームに連絡することもできます。 + +**5. 生成されたモデルコードを"brand_new_bert"に適応させる** + +最初に、モデル自体に焦点を当て、トークナイザには気にしないでください。 +関連するコードは、生成されたファイル`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`および`src/transformers/models/brand_new_bert/configuration_brand_new_bert.py`で見つかるはずです。 + +さて、ついにコーディングを始めることができます :smile:。 +`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`にある生成されたコードは、エンコーダーのみのモデルであればBERTと同じアーキテクチャを持っているか、エンコーダー-デコーダーモデルであればBARTと同じアーキテクチャを持っているはずです。 +この段階では、モデルの理論的な側面について学んだことを思い出すべきです。つまり、「このモデルはBERTまたはBARTとどのように異なるのか?」ということです。 +これらの変更を実装しますが、これは通常、セルフアテンションレイヤー、正規化レイヤーの順序などを変更することを意味します。 +再び、あなたのモデルがどのように実装されるべきかをより良く理解するために、Transformers内に既存のモデルの類似アーキテクチャを見ることが役立つことがあります。 + +この時点では、コードが完全に正確またはクリーンである必要はありません。 +むしろ、まずは必要なコードの最初の*クリーンでない*コピー&ペーストバージョンを +`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`に追加し、必要なコードがすべて追加されていると感じるまで改善/修正を反復的に行うことがお勧めです。 +私たちの経験から、必要なコードの最初のバージョンを迅速に追加し、次のセクションで説明する変換スクリプトを使用してコードを繰り返し改善/修正する方が効率的であることが多いです。 +この時点で動作する必要があるのは、🤗 Transformersの"brand_new_bert"の実装をインスタンス化できることだけです。つまり、以下のコマンドが機能する必要があります: + +```python +from transformers import BrandNewBertModel, BrandNewBertConfig + +model = BrandNewBertModel(BrandNewBertConfig()) +``` + +上記のコマンドは、`BrandNewBertConfig()` で定義されたデフォルトパラメータに従ってモデルを作成し、 +すべてのコンポーネントの `init()` メソッドが正常に動作することを確認します。 + +すべてのランダムな初期化は、`BrandnewBertPreTrainedModel` クラスの `_init_weights` メソッドで行う必要があります。 +このメソッドは、設定変数に依存するすべてのリーフモジュールを初期化する必要があります。以下は、BERT の `_init_weights` メソッドの例です: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) +``` + +特定のモジュールに特別な初期化が必要な場合、カスタムスキームをさらに持つことができます。たとえば、 +`Wav2Vec2ForPreTraining`では、最後の2つの線形層には通常のPyTorchの`nn.Linear`の初期化が必要ですが、 +他のすべての層は上記のような初期化を使用する必要があります。これは以下のようにコーディングされています: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, Wav2Vec2ForPreTraining): + module.project_hid.reset_parameters() + module.project_q.reset_parameters() + module.project_hid._is_hf_initialized = True + module.project_q._is_hf_initialized = True + elif isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() +``` + +`_is_hf_initialized`フラグは、サブモジュールを一度だけ初期化することを確実にするために内部で使用されます。 +`module.project_q`と`module.project_hid`のためにそれを`True`に設定することで、 +カスタム初期化が後で上書きされないようにし、`_init_weights`関数がそれらに適用されないようにします。 + +**6. 変換スクリプトを書く** + +次に、*brand_new_bert* の元のリポジトリでデバッグに使用したチェックポイントを、新しく作成した 🤗 Transformers 実装の *brand_new_bert* と互換性のあるチェックポイントに変換できる変換スクリプトを書く必要があります。 +変換スクリプトをゼロから書くことはお勧めされませんが、代わりに 🤗 Transformers で既に存在する類似のモデルを同じフレームワークで変換したスクリプトを調べることが良いでしょう。 +通常、既存の変換スクリプトをコピーして、自分のユースケースにわずかに適応させることで十分です。 +Hugging Face チームに既存のモデルに類似した変換スクリプトを教えてもらうことも躊躇しないでください。 + +- TensorFlowからPyTorchにモデルを移植している場合、良い出発点はBERTの変換スクリプトかもしれません [here](https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91) +- PyTorchからPyTorchにモデルを移植している場合、良い出発点はBARTの変換スクリプトかもしれません [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py) + +以下では、PyTorchモデルが層の重みをどのように保存し、層の名前を定義するかについて簡単に説明します。 +PyTorchでは、層の名前は層に与えるクラス属性の名前によって定義されます。 +PyTorchで `SimpleModel` というダミーモデルを定義しましょう: + +```python +from torch import nn + + +class SimpleModel(nn.Module): + def __init__(self): + super().__init__() + self.dense = nn.Linear(10, 10) + self.intermediate = nn.Linear(10, 10) + self.layer_norm = nn.LayerNorm(10) +``` + +これで、このモデル定義のインスタンスを作成し、`dense`、`intermediate`、`layer_norm`のすべての重みをランダムな重みで埋めたモデルを作成できます。モデルのアーキテクチャを確認するために、モデルを印刷してみましょう。 + +```python +model = SimpleModel() + +print(model) +``` +これは以下を出力します: + +``` +SimpleModel( + (dense): Linear(in_features=10, out_features=10, bias=True) + (intermediate): Linear(in_features=10, out_features=10, bias=True) + (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True) +) +``` + +層の名前はPyTorchのクラス属性の名前によって定義されています。特定の層の重み値を出力することができます: + + +```python +print(model.dense.weight.data) +``` + +ランダムに初期化された重みを確認するために + +``` +tensor([[-0.0818, 0.2207, -0.0749, -0.0030, 0.0045, -0.1569, -0.1598, 0.0212, + -0.2077, 0.2157], + [ 0.1044, 0.0201, 0.0990, 0.2482, 0.3116, 0.2509, 0.2866, -0.2190, + 0.2166, -0.0212], + [-0.2000, 0.1107, -0.1999, -0.3119, 0.1559, 0.0993, 0.1776, -0.1950, + -0.1023, -0.0447], + [-0.0888, -0.1092, 0.2281, 0.0336, 0.1817, -0.0115, 0.2096, 0.1415, + -0.1876, -0.2467], + [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465, + 0.2577, 0.0402], + [ 0.1502, 0.2465, 0.2566, 0.0693, 0.2352, -0.0530, 0.1859, -0.0604, + 0.2132, 0.1680], + [ 0.1733, -0.2407, -0.1721, 0.1484, 0.0358, -0.0633, -0.0721, -0.0090, + 0.2707, -0.2509], + [-0.1173, 0.1561, 0.2945, 0.0595, -0.1996, 0.2988, -0.0802, 0.0407, + 0.1829, -0.1568], + [-0.1164, -0.2228, -0.0403, 0.0428, 0.1339, 0.0047, 0.1967, 0.2923, + 0.0333, -0.0536], + [-0.1492, -0.1616, 0.1057, 0.1950, -0.2807, -0.2710, -0.1586, 0.0739, + 0.2220, 0.2358]]). +``` + +スクリプト内の変換スクリプトでは、ランダムに初期化された重みを、対応するチェックポイント内の正確な重みで埋める必要があります。例えば、以下のように翻訳します: + + +```python +# retrieve matching layer weights, e.g. by +# recursive algorithm +layer_name = "dense" +pretrained_weight = array_of_dense_layer + +model_pointer = getattr(model, "dense") + +model_pointer.weight.data = torch.from_numpy(pretrained_weight) +``` + +PyTorchモデルの各ランダム初期化された重みと対応する事前学習済みチェックポイントの重みが +**形状と名前の両方**で正確に一致することを確認する必要があります。 +これを行うために、形状に対するassertステートメントを追加し、チェックポイントの重みの名前を出力することが +**必要不可欠**です。例えば、次のようなステートメントを追加する必要があります: + + +```python +assert ( + model_pointer.weight.shape == pretrained_weight.shape +), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched" +``` + +また、両方の重みの名前を印刷して、一致していることを確認する必要があります。例えば、次のようにします: + +```python +logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}") +``` + +もし形状または名前のいずれかが一致しない場合、おそらく誤って🤗 Transformersの実装に初期化されたレイヤーに間違ったチェックポイントの重みを割り当ててしまった可能性があります。 + +誤った形状は、おそらく`BrandNewBertConfig()`での設定パラメーターが、変換したいチェックポイントで使用されたものと正確に一致しないためです。 +ただし、PyTorchのレイヤーの実装によっては、重みを事前に転置する必要がある場合もあります。 + +最後に、**すべて**の必要な重みが初期化されていることを確認し、初期化に使用されなかったすべてのチェックポイントの重みを表示して、モデルが正しく変換されていることを確認してください。 +変換トライアルが誤った形状ステートメントまたは誤った名前割り当てで失敗するのは完全に正常です。 +これはおそらく、`BrandNewBertConfig()`で誤ったパラメーターを使用したか、🤗 Transformersの実装に誤ったアーキテクチャがあるか、🤗 Transformersの実装の1つのコンポーネントの`init()`関数にバグがあるか、チェックポイントの重みの1つを転置する必要があるためです。 + +このステップは、以前のステップと繰り返すべきです。すべてのチェックポイントの重みが正しく🤗 Transformersモデルに読み込まれるまで繰り返すべきです。 +🤗 Transformers実装に正しくチェックポイントを読み込んだ後、選択したフォルダーにモデルを保存できます `/path/to/converted/checkpoint/folder`。このフォルダには`pytorch_model.bin`ファイルと`config.json`ファイルの両方が含まれるはずです。 + + +```python +model.save_pretrained("/path/to/converted/checkpoint/folder") +``` + +**7. 順伝播(forward pass)の実装** + +🤗 Transformers実装で事前学習済みの重みを正しく読み込んだ後、順伝播が正しく実装されていることを確認する必要があります。[元のリポジトリを理解する](#3-4-run-a-pretrained-checkpoint-using-the-original-repository)で、元のリポジトリを使用してモデルの順伝播を実行するスクリプトをすでに作成しました。今度は、元のリポジトリの代わりに🤗 Transformers実装を使用して類似のスクリプトを作成する必要があります。以下のようになります: + +```python +model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder") +input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19] +output = model(input_ids).last_hidden_states +``` + +🤗 Transformersの実装と元のモデルの実装が最初の実行で完全に同じ出力を提供しないか、 +フォワードパスでエラーが発生する可能性が非常に高いです。失望しないでください - これは予想されていることです! +まず、フォワードパスがエラーをスローしないことを確認する必要があります。 +間違った次元が使用され、*次元の不一致*エラーや、誤ったデータ型オブジェクトが使用されることがよくあります。 +例えば、`torch.long`ではなく`torch.float32`が使用されます。特定のエラーを解決できない場合は、 +Hugging Faceチームに助けを求めることを躊躇しないでください。 + +🤗 Transformers実装が正しく機能することを確認する最終的な部分は、出力が`1e-3`の精度で同等であることを確認することです。 +まず、出力の形状が同一であること、つまりスクリプトの🤗 Transformers実装と元の実装の両方で`outputs.shape`が同じ値を生成する必要があります。 +次に、出力値が同一であることを確認する必要があります。 +これは新しいモデルを追加する際の最も難しい部分の1つです。 +出力が同一でない理由の一般的な間違いは以下の通りです。 + +- 一部のレイヤーが追加されていない、つまり*活性化*レイヤーが追加されていないか、リザバル接続が忘れられている +- 単語埋め込み行列が結ばれていない +- オリジナルの実装がオフセットを使用しているため、誤った位置埋め込みが使用されている +- フォワードパス中にドロップアウトが適用されています。これを修正するには、*model.trainingがFalse*であることを確認し、フォワードパス中に誤ってドロップアウトレイヤーがアクティブ化されないようにします。 +*つまり* [PyTorchのfunctional dropout](https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout)に*model.training*を渡します。 + +問題を修正する最良の方法は、通常、元の実装と🤗 Transformers実装のフォワードパスを並べて表示し、違いがあるかどうかを確認することです。 +理想的には、フォワードパスの両方の実装の中間出力をデバッグ/プリントアウトして、🤗 Transformers実装が元の実装と異なる出力を示すネットワーク内の正確な位置を見つけることができます。 +最初に、両方のスクリプトのハードコーディングされた`input_ids`が同一であることを確認します。 +次に、`input_ids`の最初の変換(通常、単語埋め込み)の出力が同一であることを確認します。 +その後、ネットワークの最後のレイヤーまで作業を進めます。 +いずれかの時点で、2つの実装間で違いがあることに気付くはずで、それにより🤗 Transformers実装のバグの場所が特定されます。 +経験上、元の実装と🤗 Transformers実装のフォワードパスの同じ位置に多くのプリントステートメントを追加し、 +中間プレゼンテーションで同じ値を示すプリントステートメントを段階的に削除するのがシンプルかつ効果的な方法です。 + +両方の実装が同じ出力を生成することに自信を持っている場合、`torch.allclose(original_output, output, atol=1e-3)`を使用して出力を確認すると、最も難しい部分が完了します! +おめでとうございます - 完了する作業は簡単なものになるはずです 😊。 + +**8. 必要なすべてのモデルテストを追加** + +この時点で、新しいモデルが正常に追加されました。 +ただし、モデルがまだ必要な設計に完全に準拠していない可能性が非常に高いです。 +🤗 Transformersと完全に互換性があることを確認するために、すべての一般的なテストがパスする必要があります。 +Cookiecutterはおそらくモデル用のテストファイルを自動的に追加しているはずで、おそらく同じディレクトリに`tests/models/brand_new_bert/test_modeling_brand_new_bert.py`として存在します。 +このテストファイルを実行して、すべての一般的なテストがパスすることを確認してください: + +```bash +pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py +``` + +すべての一般的なテストを修正したら、今度は実行したすべての素晴らしい作業が適切にテストされていることを確認することが非常に重要です。これにより、 + +- a) コミュニティは*brand_new_bert*の特定のテストを見ることで、あなたの作業を簡単に理解できます。 +- b) モデルへの将来の変更がモデルの重要な機能を壊さないようにすることができます。 + +まず、統合テストを追加する必要があります。これらの統合テストは、基本的にはデバッグスクリプトと同じことを行います。これらのモデルテストのテンプレートはCookiecutterによって既に追加されており、「BrandNewBertModelIntegrationTests」と呼ばれています。このテストを記入するだけです。これらのテストが合格していることを確認するには、次のコマンドを実行します。 + +```bash +RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests +``` + + + +Windowsを使用している場合、`RUN_SLOW=1`を`SET RUN_SLOW=1`に置き換えてください。 + + + +次に、*brand_new_bert*に特有のすべての特徴は、別個のテスト内で追加されるべきです。 +`BrandNewBertModelTester`/`BrandNewBertModelTest`の下に。この部分はよく忘れられますが、2つの点で非常に役立ちます: + +- モデルの追加中に獲得した知識をコミュニティに伝え、*brand_new_bert*の特別な機能がどのように動作するかを示すことによって、知識の共有を支援します。 +- 将来の貢献者は、これらの特別なテストを実行することでモデルへの変更を迅速にテストできます。 + +**9. トークナイザの実装** + +次に、*brand_new_bert*のトークナイザを追加する必要があります。通常、トークナイザは🤗 Transformersの既存のトークナイザと同等か非常に似ています。 + +トークナイザが正しく動作することを確認するためには、まず、元のリポジトリ内で文字列を入力し、`input_ids`を返すスクリプトを作成することをお勧めします。 +このスクリプトは、次のように見えるかもしれません(疑似コードで示します): + +```python +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = model.tokenize(input_str) +``` + +オリジナルのリポジトリを詳しく調査し、正しいトークナイザの関数を見つける必要があるかもしれません。 +または、オリジナルのリポジトリのクローンを変更して、`input_ids`だけを出力するようにする必要があるかもしれません。 +オリジナルのリポジトリを使用した機能的なトークナイゼーションスクリプトを作成した後、 +🤗 Transformers向けの類似したスクリプトを作成する必要があります。 +以下のように見えるべきです: + +```python +from transformers import BrandNewBertTokenizer + +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." + +tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/") + +input_ids = tokenizer(input_str).input_ids +``` + +`input_ids`が同じ値を生成した場合、最終ステップとしてトークナイザのテストファイルも追加するべきです。 + +*brand_new_bert*のモデルングテストファイルと同様に、*brand_new_bert*のトークナイズテストファイルには、いくつかのハードコードされた統合テストが含まれるべきです。 + +**10. エンドツーエンド統合テストの実行** + +トークナイザを追加した後、`🤗 Transformers`内の`tests/models/brand_new_bert/test_modeling_brand_new_bert.py`に +モデルとトークナイザの両方を使用するいくつかのエンドツーエンド統合テストも追加する必要があります。 +このようなテストは、🤗 Transformersの実装が期待どおりに機能することを示すべきです。 +意味のあるテキスト対テキストのサンプルが含まれます。有用なテキスト対テキストのサンプルには、ソースからターゲットへの翻訳ペア、記事から要約へのペア、質問から回答へのペアなどが含まれます。 +ポートされたチェックポイントがダウンストリームタスクでファインチューニングされていない場合、モデルのテストに依存するだけで十分です。 +モデルが完全に機能していることを確認するために、すべてのテストをGPU上で実行することもお勧めします。 +モデルの内部テンソルに`.to(self.device)`ステートメントを追加するのを忘れる可能性があるため、そのようなテストではエラーが表示されることがあります。 +GPUにアクセスできない場合、Hugging Faceチームが代わりにこれらのテストを実行できます。 + +**11. ドキュメントの追加** + +これで、*brand_new_bert*の必要なすべての機能が追加されました - ほぼ完了です!残りの追加すべきことは、良いドキュメントとドキュメントページです。 +Cookiecutterが`docs/source/model_doc/brand_new_bert.md`というテンプレートファイルを追加しているはずで、これを記入する必要があります。 +モデルのユーザーは通常、モデルを使用する前にまずこのページを見ます。したがって、ドキュメンテーションは理解しやすく簡潔である必要があります。 +モデルの使用方法を示すためにいくつかの*Tips*を追加することはコミュニティにとって非常に役立ちます。ドキュメンテーションに関しては、Hugging Faceチームに問い合わせることをためらわないでください。 + +次に、`src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`に追加されたドキュメンテーション文字列が正しいこと、およびすべての必要な入力および出力を含んでいることを確認してください。 +ドキュメンテーションの書き方とドキュメンテーション文字列のフォーマットについて詳細なガイドが[こちら](writing-documentation)にあります。 +ドキュメンテーションは通常、コミュニティとモデルの最初の接触点であるため、コードと同じくらい注意深く扱うべきであることを常に念頭に置いてください。 + +**コードのリファクタリング** + +素晴らしい、これで*brand_new_bert*に必要なすべてのコードが追加されました。 +この時点で、次のようなポテンシャルなコードスタイルの誤りを訂正するために以下を実行する必要があります: + +```bash +make style +``` + +あなたのコーディングスタイルが品質チェックをパスすることを確認してください: + +```bash +make quality +``` + +🤗 Transformersの非常に厳格なデザインテストには、まだ合格していない可能性があるいくつかの他のテストが存在するかもしれません。 +これは、ドキュメント文字列に情報が不足しているか、名前が間違っていることが原因であることが多いです。Hugging Faceチームは、ここで詰まっている場合には必ず助けてくれるでしょう。 + +最後に、コードが正しく機能することを確認した後、コードをリファクタリングするのは常に良いアイデアです。 +すべてのテストがパスした今、追加したコードを再度確認してリファクタリングを行うのは良いタイミングです。 + +これでコーディングの部分は完了しました、おめでとうございます! 🎉 あなたは素晴らしいです! 😎 + +**12. モデルをモデルハブにアップロード** + +最後のパートでは、すべてのチェックポイントをモデルハブに変換してアップロードし、各アップロードしたモデルチェックポイントにモデルカードを追加する必要があります。 +モデルハブの機能について詳しくは、[Model sharing and uploading Page](model_sharing)を読んで理解できます。 +ここでは、*brand_new_bert*の著者組織の下にモデルをアップロードできるように必要なアクセス権を取得するために、Hugging Faceチームと協力する必要があります。 +`transformers`のすべてのモデルに存在する`push_to_hub`メソッドは、チェックポイントをハブにプッシュする迅速かつ効率的な方法です。 +以下に、少しのコードスニペットを示します: + +```python +brand_new_bert.push_to_hub("brand_new_bert") +# Uncomment the following line to push to an organization. +# brand_new_bert.push_to_hub("/brand_new_bert") +``` + +各チェックポイントに適切なモデルカードを作成する価値があります。モデルカードは、この特定のチェックポイントの特性をハイライトするべきです。例えば、このチェックポイントはどのデータセットで事前学習/ファインチューニングされたか、どのような下流タスクでモデルを使用すべきかを示すべきです。また、モデルの正しい使用方法に関するコードも含めるべきです。 + +**13.(オプション)ノートブックの追加** + +*brand_new_bert*を推論または下流タスクのファインチューニングにどのように詳細に使用できるかを示すノートブックを追加することは非常に役立ちます。これはあなたのPRをマージするために必須ではありませんが、コミュニティにとって非常に有用です。 + +**14. 完成したPRの提出** + +プログラミングが完了したら、最後のステップに移動し、PRをメインブランチにマージしましょう。通常、Hugging Faceチームはこの時点で既にあなたをサポートしているはずですが、PRに良い説明を追加し、コードにコメントを追加して、レビュアーに特定の設計の選択肢を指摘したい場合はコメントを追加することも価値があります。 + +### Share your work!! + +さあ、コミュニティからあなたの作業に対する評価を得る時が来ました!モデルの追加を完了することは、TransformersおよびNLPコミュニティにとって重要な貢献です。あなたのコードとポートされた事前学習済みモデルは、何百人、何千人という開発者や研究者によって確実に使用されるでしょう。あなたの仕事に誇りを持ち、コミュニティとあなたの成果を共有しましょう。 + +**あなたはコミュニティの誰でも簡単にアクセスできる別のモデルを作成しました! 🤯** + + diff --git a/docs/source/ja/add_tensorflow_model.md b/docs/source/ja/add_tensorflow_model.md new file mode 100644 index 00000000000000..8bc7ed0d9ee740 --- /dev/null +++ b/docs/source/ja/add_tensorflow_model.md @@ -0,0 +1,296 @@ + + + +# How to convert a 🤗 Transformers model to TensorFlow? + +🤗 Transformersを使用するために複数のフレームワークが利用可能であることは、アプリケーションを設計する際にそれぞれの強みを活かす柔軟性を提供しますが、 +互換性をモデルごとに追加する必要があることを意味します。しかし、幸いなことに +既存のモデルにTensorFlow互換性を追加することは、[ゼロから新しいモデルを追加すること](add_new_model)よりも簡単です! +大規模なTensorFlowモデルの詳細を理解したり、主要なオープンソースの貢献を行ったり、 +選択したモデルをTensorFlowで有効にするためのガイドです。 + +このガイドは、コミュニティのメンバーであるあなたに、TensorFlowモデルの重みおよび/または +アーキテクチャを🤗 Transformersで使用するために、Hugging Faceチームからの最小限の監視で貢献できる力を与えます。新しいモデルを書くことは小さな偉業ではありませんが、 +このガイドを読むことで、それがローラーコースターのようなものから散歩のようなものになることを願っています🎢🚶。 +このプロセスをますます簡単にするために、私たちの共通の経験を活用することは非常に重要ですので、 +このガイドの改善を提案することを強くお勧めします! + +さらに詳しく調べる前に、以下のリソースをチェックすることをお勧めします。🤗 Transformersが初めての場合: + +- [🤗 Transformersの一般的な概要](add_new_model#general-overview-of-transformers) +- [Hugging FaceのTensorFlow哲学](https://huggingface.co/blog/tensorflow-philosophy) + +このガイドの残りの部分では、新しいTensorFlowモデルアーキテクチャを追加するために必要なもの、 +PyTorchをTensorFlowモデルの重みに変換する手順、およびMLフレームワーク間の不一致を効率的にデバッグする方法について学びます。それでは始めましょう! + + + +使用したいモデルに対応するTensorFlowアーキテクチャがすでに存在するかどうかわからないですか? + +  + +選択したモデルの`config.json`の`model_type`フィールドをチェックしてみてください +([例](https://huggingface.co/google-bert/bert-base-uncased/blob/main/config.json#L14))。 +🤗 Transformersの該当するモデルフォルダに、名前が"modeling_tf"で始まるファイルがある場合、それは対応するTensorFlow +アーキテクチャを持っていることを意味します([例](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert))。 + + + +## Step-by-step guide to add TensorFlow model architecture code + +大規模なモデルアーキテクチャを設計する方法はさまざまであり、その設計を実装する方法もさまざまです。 +しかし、[🤗 Transformersの一般的な概要](add_new_model#general-overview-of-transformers)から +思い出していただけるかもしれませんが、私たちは意見のあるグループです - 🤗 Transformersの使いやすさは一貫性のある設計の選択肢に依存しています。経験から、TensorFlowモデルを追加する際に重要なことをいくつかお伝えできます: + +- 車輪を再発明しないでください!ほとんどの場合、確認すべき少なくとも2つの参照実装があります。それは、 +あなたが実装しているモデルのPyTorchバージョンと、同じ種類の問題に対する他のTensorFlowモデルです。 +- 優れたモデル実装は時間の試練を乗り越えます。これは、コードがきれいだからではなく、コードが明確で、デバッグしやすく、 +構築しやすいからです。TensorFlow実装でPyTorch実装と一致するパターンを複製し、PyTorch実装との不一致を最小限に抑えることで、 +あなたの貢献が長期間にわたって有用であることを保証します。 +- 行き詰まったら助けを求めてください! 🤗 Transformersチームはここにいますし、おそらくあなたが直面している同じ問題に対する解決策を見つけています。 + +TensorFlowモデルアーキテクチャを追加するために必要なステップの概要は次のとおりです: +1. 変換したいモデルを選択 +2. transformersの開発環境を準備 +3. (オプション)理論的な側面と既存の実装を理解 +4. モデルアーキテクチャを実装 +5. モデルのテストを実装 +6. プルリクエストを提出 +7. (オプション)デモを構築して世界と共有 + +### 1.-3. Prepare your model contribution + +**1. 変換したいモデルを選択する** + +まず、基本から始めましょう。最初に知っておく必要があることは、変換したいアーキテクチャです。 +特定のアーキテクチャを決めていない場合、🤗 Transformers チームに提案を求めることは、影響を最大限にする素晴らしい方法です。 +チームは、TensorFlow サイドで不足している最も注目されるアーキテクチャに向けてガイドします。 +TensorFlow で使用したい特定のモデルに、🤗 Transformers に既に TensorFlow アーキテクチャの実装が存在しているが、重みが不足している場合、 +このページの[重みの追加セクション](#adding-tensorflow-weights-to--hub)に直接移動してください。 + +簡単にするために、このガイドの残りの部分では、TensorFlow バージョンの *BrandNewBert* を貢献することを決定したと仮定しています +(これは、[新しいモデルの追加ガイド](add_new_model)での例と同じです)。 + + + +TensorFlow モデルのアーキテクチャに取り組む前に、それを行うための進行中の取り組みがないかを再確認してください。 +GitHub ページの[プルリクエスト](https://github.com/huggingface/transformers/pulls?q=is%3Apr)で `BrandNewBert` を検索して、 +TensorFlow 関連のプルリクエストがないことを確認できます。 + + + + +**2. transformers 開発環境の準備** + +モデルアーキテクチャを選択したら、意向を示すためにドラフト PR を開くための環境を設定してください。 +以下の手順に従って、環境を設定し、ドラフト PR を開いてください。 + +1. リポジトリのページで 'Fork' ボタンをクリックして、[リポジトリ](https://github.com/huggingface/transformers)をフォークします。 + これにより、コードのコピーが GitHub ユーザーアカウントの下に作成されます。 + +2. ローカルディスクにある 'transformers' フォークをクローンし、ベースリポジトリをリモートとして追加します: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +3. 開発環境を設定します。たとえば、以下のコマンドを実行してください: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +依存関係が増えているため、OSに応じて、Transformersのオプションの依存関係の数が増えるかもしれません。その場合は、TensorFlowをインストールしてから次のコマンドを実行してください。 + +```bash +pip install -e ".[quality]" +``` + +**注意:** CUDAをインストールする必要はありません。新しいモデルをCPUで動作させることが十分です。 + +4. メインブランチからわかりやすい名前のブランチを作成してください。 + +```bash +git checkout -b add_tf_brand_new_bert +``` +5. 現在のmainブランチにフェッチしてリベースする + +```bash +git fetch upstream +git rebase upstream/main +``` + +6. `transformers/src/models/brandnewbert/`に`modeling_tf_brandnewbert.py`という名前の空の`.py`ファイルを追加します。これはあなたのTensorFlowモデルファイルです。 + +7. 以下を使用して変更内容をアカウントにプッシュします: + +```bash +git add . +git commit -m "initial commit" +git push -u origin add_tf_brand_new_bert +``` + +8. GitHub上でフォークしたウェブページに移動し、「プルリクエスト」をクリックします。将来の変更に備えて、Hugging Face チームのメンバーのGitHubハンドルをレビュアーとして追加してください。 + +9. GitHubのプルリクエストウェブページの右側にある「ドラフトに変換」をクリックして、プルリクエストをドラフトに変更します。 + +これで、🤗 Transformers内に*BrandNewBert*をTensorFlowに移植するための開発環境が設定されました。 + +**3. (任意) 理論的な側面と既存の実装を理解する** + +*BrandNewBert*の論文が存在する場合、その記述的な作業を読む時間を取るべきです。論文には理解が難しい大きなセクションがあるかもしれません。その場合でも問題ありません - 心配しないでください!目標は論文の理論的な理解を深めることではなく、🤗 Transformersを使用してTensorFlowでモデルを効果的に再実装するために必要な情報を抽出することです。とは言え、理論的な側面にあまり時間をかける必要はありません。代わりに、既存のモデルのドキュメンテーションページ(たとえば、[BERTのモデルドキュメント](model_doc/bert)など)に焦点を当てるべきです。 + +実装するモデルの基本を把握した後、既存の実装を理解することは重要です。これは、動作する実装がモデルに対する期待と一致することを確認する絶好の機会であり、TensorFlow側での技術的な課題を予測することもできます。 + +情報の多さに圧倒されていると感じるのは完全に自然です。この段階ではモデルのすべての側面を理解する必要はありません。ただし、[フォーラム](https://discuss.huggingface.co/)で急な質問を解決することを強くお勧めします。 + + +### 4. Model implementation + +さあ、いよいよコーディングを始めましょう。お勧めする出発点は、PyTorchファイルそのものです。 +`src/transformers/models/brand_new_bert/`内の`modeling_brand_new_bert.py`の内容を +`modeling_tf_brand_new_bert.py`にコピーします。このセクションの目標は、 +🤗 Transformersのインポート構造を更新し、`TFBrandNewBert`と +`TFBrandNewBert.from_pretrained(model_repo, from_pt=True)`を正常に読み込む動作するTensorFlow *BrandNewBert*モデルを +インポートできるようにすることです。 + +残念ながら、PyTorchモデルをTensorFlowに変換する明確な方法はありません。ただし、プロセスをできるだけスムーズにするためのヒントを以下に示します: + +- すべてのクラスの名前の前に `TF` を付けます(例: `BrandNewBert` は `TFBrandNewBert` になります)。 +- ほとんどのPyTorchの操作には、直接TensorFlowの代替があります。たとえば、`torch.nn.Linear` は `tf.keras.layers.Dense` に対応し、`torch.nn.Dropout` は `tf.keras.layers.Dropout` に対応します。特定の操作について不明確な場合は、[TensorFlowのドキュメント](https://www.tensorflow.org/api_docs/python/tf)または[PyTorchのドキュメント](https://pytorch.org/docs/stable/)を参照できます。 +- 🤗 Transformersのコードベースにパターンが見つかります。特定の操作に直接的な代替がない場合、誰かがすでに同じ問題に対処している可能性が高いです。 +- デフォルトでは、PyTorchと同じ変数名と構造を維持します。これにより、デバッグや問題の追跡、修正の追加が容易になります。 +- 一部のレイヤーには、各フレームワークで異なるデフォルト値があります。注目すべき例は、バッチ正規化レイヤーの epsilon です(PyTorchでは`1e-5`、[TensorFlowでは](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) `1e-3` です)。ドキュメントを再確認してください! +- PyTorchの `nn.Parameter` 変数は通常、TF Layerの `build()` 内で初期化する必要があります。次の例を参照してください:[PyTorch](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_vit_mae.py#L212) / [TensorFlow](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_tf_vit_mae.py#L220) +- PyTorchモデルに関数の上部に `#copied from ...` がある場合、TensorFlowモデルも同じアーキテクチャからその関数を借りることができる可能性が高いです。TensorFlowアーキテクチャがある場合です。 +- TensorFlow関数内で `name`属性を正しく設定することは、`from_pt=True`のウェイトのクロスロードロードを行うために重要です。通常、`name`はPyTorchコード内の対応する変数の名前です。`name`が正しく設定されていない場合、モデルウェイトのロード時にエラーメッセージで表示されます。 +- ベースモデルクラス `BrandNewBertModel` のロジックは実際には `TFBrandNewBertMainLayer` にあります。これはKerasレイヤーのサブクラスです([例](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L719))。`TFBrandNewBertModel` は、単にこのレイヤーのラッパーです。 +- モデルを読み込むためには、Kerasモデルをビルドする必要があります。そのため、`TFBrandNewBertPreTrainedModel` はモデルへの入力の例、`dummy_inputs` を持つ必要があります([例](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L916))。 +- 表示が止まった場合は、助けを求めてください。私たちはあなたのお手伝いにここにいます! 🤗 + +モデルファイル自体だけでなく、モデルクラスと関連するドキュメンテーションページへのポインターも追加する必要があります。他のPRのパターンに従ってこの部分を完了できます +([例](https://github.com/huggingface/transformers/pull/18020/files))。 +以下は手動での変更が必要な一覧です: +- *BrandNewBert*のすべてのパブリッククラスを `src/transformers/__init__.py` に含める +- *BrandNewBert*クラスを `src/transformers/models/auto/modeling_tf_auto.py` の対応するAutoクラスに追加 +- ドキュメンテーションテストファイルのリストにモデリングファイルを追加する `utils/documentation_tests.txt` +- `src/transformers/utils/dummy_tf_objects.py` に関連する *BrandNewBert* に関連する遅延ロードクラスを追加 +- `src/transformers/models/brand_new_bert/__init__.py` でパブリッククラスのインポート構造を更新 +- `docs/source/en/model_doc/brand_new_bert.md` に *BrandNewBert* のパブリックメソッドのドキュメンテーションポインターを追加 +- `docs/source/en/model_doc/brand_new_bert.md` の *BrandNewBert* の貢献者リストに自分自身を追加 +- 最後に、`docs/source/en/index.md` の *BrandNewBert* のTensorFlow列に緑色のチェックマーク ✅ を追加 + +モデルアーキテクチャが準備できていることを確認するために、以下のチェックリストを実行してください: +1. 訓練時に異なる動作をするすべてのレイヤー(例:Dropout)は、`training`引数を使用して呼び出され、それが最上位クラスから伝播されます。 +2. 可能な限り `#copied from ...` を使用しました +3. `TFBrandNewBertMainLayer` およびそれを使用するすべてのクラスの `call` 関数が `@unpack_inputs` でデコレートされています +4. `TFBrandNewBertMainLayer` は `@keras_serializable` でデコレートされています +5. PyTorchウェイトからTensorFlowウェイトを使用してTensorFlowモデルをロードできます `TFBrandNewBert.from_pretrained(model_repo, from_pt=True)` +6. 予期される入力形式を使用してTensorFlowモデルを呼び出すことができます + + +### 5. Add model tests + +やったね、TensorFlowモデルを実装しました! +今度は、モデルが期待通りに動作することを確認するためのテストを追加する時間です。 +前のセクションと同様に、`tests/models/brand_new_bert/`ディレクトリ内の`test_modeling_brand_new_bert.py`ファイルを`test_modeling_tf_brand_new_bert.py`にコピーし、必要なTensorFlowの置換を行うことをお勧めします。 +今の段階では、すべての`.from_pretrained()`呼び出しで、既存のPyTorchの重みをロードするために`from_pt=True`フラグを使用する必要があります。 + +作業が完了したら、テストを実行する準備が整いました! 😬 + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +最も可能性の高い結果は、多くのエラーが表示されることです。心配しないでください、これは予想される動作です! +MLモデルのデバッグは非常に難しいとされており、成功の鍵は忍耐力(と`breakpoint()`)です。私たちの経験では、 +最も難しい問題はMLフレームワーク間の微妙な不一致から発生し、これについてはこのガイドの最後にいくつかのポインタを示します。 +他の場合では、一般的なテストが直接モデルに適用できない場合もあり、その場合はモデルのテストクラスレベルでオーバーライドを提案します。 +問題の種類に関係なく、詰まった場合は、ドラフトのプルリクエストで助けを求めることをためらわないでください。 + +すべてのテストがパスしたら、おめでとうございます。あなたのモデルはほぼ🤗 Transformersライブラリに追加する準備が整いました!🎉 + +**6. プルリクエストを提出する** + +実装とテストが完了したら、プルリクエストを提出する準備が整いました。コードをプッシュする前に、 +コードフォーマットユーティリティである `make fixup` 🪄 を実行してください。 +これにより、自動的なチェックに失敗する可能性のあるフォーマットの問題が自動的に修正されます。 + +これで、ドラフトプルリクエストを実際のプルリクエストに変換する準備が整いました。 +これを行うには、「レビュー待ち」ボタンをクリックし、Joao(`@gante`)とMatt(`@Rocketknight1`)をレビュワーとして追加します。 +モデルプルリクエストには少なくとも3人のレビュワーが必要ですが、モデルに適切な追加のレビュワーを見つけるのは彼らの責任です。 + +すべてのレビュワーがプルリクエストの状態に満足したら、最後のアクションポイントは、`.from_pretrained()` 呼び出しで `from_pt=True` フラグを削除することです。 +TensorFlowのウェイトが存在しないため、それらを追加する必要があります!これを行う方法については、以下のセクションを確認してください。 + +最後に、TensorFlowのウェイトがマージされ、少なくとも3人のレビューアが承認し、すべてのCIチェックが +成功した場合、テストをローカルで最後にもう一度確認してください。 + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +そして、あなたのPRをマージします!マイルストーン達成おめでとうございます 🎉 + +**7. (Optional) デモを作成して世界と共有** + +オープンソースの最も難しい部分の1つは、発見です。あなたの素晴らしいTensorFlowの貢献が存在することを他のユーザーがどのように知ることができるでしょうか?適切なコミュニケーションです! 📣 + +コミュニティとモデルを共有する主要な方法は2つあります。 +- デモを作成します。これにはGradioデモ、ノートブック、およびモデルを紹介するための他の楽しい方法が含まれます。[コミュニティ駆動のデモ](https://huggingface.co/docs/transformers/community)にノートブックを追加することを強くお勧めします。 +- TwitterやLinkedInなどのソーシャルメディアでストーリーを共有します。あなたの仕事に誇りを持ち、コミュニティとあなたの成果を共有するべきです - あなたのモデルは今や世界中の何千人ものエンジニアや研究者によって使用される可能性があります 🌍!私たちはあなたの投稿をリツイートして共同体と共有するお手伝いを喜んでします。 + +## Adding TensorFlow weights to 🤗 Hub + +TensorFlowモデルのアーキテクチャが🤗 Transformersで利用可能な場合、PyTorchの重みをTensorFlowの重みに変換することは簡単です! + +以下がその方法です: +1. ターミナルでHugging Faceアカウントにログインしていることを確認してください。コマンド`huggingface-cli login`を使用してログインできます(アクセストークンは[こちら](https://huggingface.co/settings/tokens)で見つけることができます)。 +2. `transformers-cli pt-to-tf --model-name foo/bar`というコマンドを実行します。ここで、`foo/bar`は変換したいPyTorchの重みを含むモデルリポジトリの名前です。 +3. 上記のコマンドで作成された🤗 Hub PRに`@joaogante`と`@Rocketknight1`をタグ付けします。 + +それだけです! 🎉 + +## Debugging mismatches across ML frameworks 🐛 + +新しいアーキテクチャを追加したり、既存のアーキテクチャのTensorFlowの重みを作成したりする際、PyTorchとTensorFlow間の不一致についてのエラーに遭遇することがあります。 +場合によっては、PyTorchとTensorFlowのモデルアーキテクチャがほぼ同一であるにもかかわらず、不一致を指摘するエラーが表示されることがあります。 +どうしてでしょうか? 🤔 + +まず最初に、なぜこれらの不一致を理解することが重要かについて話しましょう。多くのコミュニティメンバーは🤗 Transformersモデルをそのまま使用し、モデルが期待どおりに動作すると信頼しています。 +2つのフレームワーク間で大きな不一致があると、少なくとも1つのフレームワークのリファレンス実装に従ってモデルが動作しないことを意味します。 +これにより、モデルは実行されますが性能が低下する可能性があり、静かな失敗が発生する可能性があります。これは、全く実行されないモデルよりも悪いと言えるかもしれません!そのため、モデルのすべての段階でのフレームワークの不一致が`1e-5`未満であることを目指しています。 + +数値計算の問題と同様に、詳細については細かいところにあります。そして、詳細指向の技術である以上、秘密の要素は忍耐です。 +この種の問題に遭遇した場合のお勧めのワークフローは次のとおりです: +1. 不一致の原因を特定します。変換中のモデルにはおそらく特定の点までほぼ同一の内部変数があります。 + 両方のフレームワークのアーキテクチャに`breakpoint()`ステートメントを配置し、トップダウンの方法で数値変数の値を比較し、問題の原因を見つけます。 +2. 問題の原因を特定したら、🤗 Transformersチームと連絡を取りましょう。同様の問題に遭遇したことがあるかもしれず、迅速に解決策を提供できるかもしれません。最終手段として、StackOverflowやGitHubの問題など、人気のあるページをスキャンします。 +3. 解決策が見当たらない場合、問題を掘り下げる必要があることを意味します。良いニュースは、問題の原因を特定したことです。したがって、問題のある命令に焦点を当て、モデルの残りを抽象化できます!悪いニュースは、その命令のソース実装に進む必要があることです。一部の場合では、リファレンス実装に問題があるかもしれません - 上流リポジトリで問題を開くのを控えないでください。 + +🤗 Transformersチームとの話し合いで、不一致を修正することが困難であることが判明することがあります。 +出力レイヤーのモデルで不一致が非常に小さい場合(ただし、隠れた状態では大きい可能性がある)、モデルを配布するためにそれを無視することにするかもしれません。 +上記で言及した`pt-to-tf` CLIには、重み変換時にエラーメッセージを無視するための`--max-error`フラグがあります。 + + + + + + diff --git a/docs/source/ja/attention.md b/docs/source/ja/attention.md new file mode 100644 index 00000000000000..4c452ffa422299 --- /dev/null +++ b/docs/source/ja/attention.md @@ -0,0 +1,52 @@ + + +# Attention mechanism + +ほとんどのTransformerモデルは、アテンション行列が正方形であるという意味で完全なアテンションを使用します。 +これは、長いテキストを扱う場合に計算のボトルネックとなることがあります。LongformerやReformerは、より効率的でトレーニングを高速化するためにアテンション行列のスパースバージョンを使用しようとするモデルです。 + +## LSH attention + +[Reformer](model_doc/reformer)はLSH(局所的に散在ハッシュ)アテンションを使用します。 +ソフトマックス(QK^t)では、行列QK^tの中で(ソフトマックス次元で)最も大きな要素のみが有用な寄与を提供します。 +したがって、各クエリqについて、クエリqに近いキーkのみを考慮できます。 +qとkが近いかどうかを決定するために、ハッシュ関数が使用されます。 +アテンションマスクは変更され、現在のトークンをマスク化します(最初の位置を除く)。 +なぜなら、それはクエリとキーが等しい(つまり非常に似ている)クエリとキーを提供するからです。 +ハッシュは多少ランダムかもしれないため、実際にはいくつかのハッシュ関数が使用され(n_roundsパラメータで決定されます)、それらが平均化されます。 + +## Local attention + +[Longformer](model_doc/longformer)はローカルアテンションを使用します。 +しばしば、ローカルコンテキスト(例:左右の2つのトークンは何ですか?)は、特定のトークンに対して行動を起こすのに十分です。 +また、小さなウィンドウを持つアテンションレイヤーを積み重ねることで、最後のレイヤーはウィンドウ内のトークンだけでなく、ウィンドウ内のトークンを超えて受容野を持つようになり、文全体の表現を構築できます。 + +一部の事前選択された入力トークンにはグローバルアテンションも与えられます。 +これらの少数のトークンに対して、アテンション行列はすべてのトークンにアクセスでき、このプロセスは対称的です。 +他のすべてのトークンは、これらの特定のトークンにアクセスできます(ローカルウィンドウ内のトークンに加えて)。 +これは、論文の図2dに示されており、以下はサンプルのアテンションマスクです: + +
+ +
+ + +## Other tricks + +### Axial positional encodings + +[Reformer](model_doc/reformer)は軸方向の位置エンコーディングを使用しています。伝統的なトランスフォーマーモデルでは、位置エンコーディングEはサイズが \\(l\\) × \\(d\\) の行列で、\\(l\\) はシーケンスの長さ、\\(d\\) は隠れ状態の次元です。非常に長いテキストを扱う場合、この行列は非常に大きく、GPU上で大量のスペースを占有します。これを緩和するために、軸方向の位置エンコーディングは、この大きな行列Eを2つの小さな行列E1とE2に分解します。それぞれの行列はサイズ \\(l_{1} \times d_{1}\\) および \\(l_{2} \times d_{2}\\) を持ち、 \\(l_{1} \times l_{2} = l\\) および \\(d_{1} + d_{2} = d\\) という条件を満たします(長さの積を考えると、これがはるかに小さくなります)。行列E内の時刻 \\(j\\) の埋め込みは、E1内の時刻 \\(j \% l1\\) の埋め込みとE2内の時刻 \\(j // l1\\) の埋め込みを連結することによって得られます。 + diff --git a/docs/source/ja/autoclass_tutorial.md b/docs/source/ja/autoclass_tutorial.md new file mode 100644 index 00000000000000..f8fbeaa221f6aa --- /dev/null +++ b/docs/source/ja/autoclass_tutorial.md @@ -0,0 +1,161 @@ + + +# AutoClassを使用して事前学習済みインスタンスをロードする + +さまざまなTransformerアーキテクチャが存在するため、自分のタスクに合ったモデルを作成するのは難しいことがあります。 +🤗 Transformersのコア哲学の一環として、ライブラリを使用しやすく、シンプルで柔軟にするために、 +`AutoClass`は与えられたチェックポイントから正しいアーキテクチャを自動的に推論してロードします。 +`from_pretrained()`メソッドを使用すると、事前学習済みモデルを素早くロードできるため、モデルをゼロからトレーニングするために時間とリソースを費やす必要がありません。 +この種のチェックポイントに依存しないコードを生成することは、 +コードが1つのチェックポイントで動作すれば、アーキテクチャが異なっていても、同じタスクに向けてトレーニングされた場合は別のチェックポイントでも動作することを意味します。 + + + +アーキテクチャはモデルの骨格を指し、チェックポイントは特定のアーキテクチャの重みです。 +たとえば、[BERT](https://huggingface.co/google-bert/bert-base-uncased)はアーキテクチャであり、`google-bert/bert-base-uncased`はチェックポイントです。 +モデルはアーキテクチャまたはチェックポイントのどちらを指す一般的な用語です。 + + + +このチュートリアルでは、以下を学習します: + +* 事前学習済みトークナイザをロードする。 +* 事前学習済み画像プロセッサをロードする。 +* 事前学習済み特徴量抽出器をロードする。 +* 事前学習済みプロセッサをロードする。 +* 事前学習済みモデルをロードする。 + +## AutoTokenizer + +ほとんどのNLPタスクはトークナイザで始まります。トークナイザは入力をモデルで処理できる形式に変換します。 + +[`AutoTokenizer.from_pretrained`]を使用してトークナイザをロードします: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + + +次に、以下のように入力をトークナイズします: + +```py +>>> sequence = "In a hole in the ground there lived a hobbit." +>>> print(tokenizer(sequence)) +{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +## AutoImageProcessor + +ビジョンタスクの場合、画像プロセッサが画像を正しい入力形式に変換します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + +## AutoFeatureExtractor + +オーディオタスクの場合、特徴量抽出器がオーディオ信号を正しい入力形式に変換します。 + +[`AutoFeatureExtractor.from_pretrained`]を使用して特徴量抽出器をロードします. + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained( +... "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +``` + +## AutoProcessor + +マルチモーダルタスクの場合、2つの前処理ツールを組み合わせるプロセッサが必要です。たとえば、 +[LayoutLMV2](model_doc/layoutlmv2)モデルは画像を処理するための画像プロセッサとテキストを処理するためのトークナイザが必要です。 +プロセッサはこれらの両方を組み合わせます。 + +[`AutoProcessor.from_pretrained`]を使用してプロセッサをロードします: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") +``` + +## AutoModel + + + +最後に、`AutoModelFor`クラスは特定のタスクに対して事前学習済みモデルをロードできます(使用可能なタスクの完全な一覧については[こちら](model_doc/auto)を参照)。 +たとえば、[`AutoModelForSequenceClassification.from_pretrained`]を使用してシーケンス分類用のモデルをロードできます: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +同じチェックポイントを再利用して異なるタスクのアーキテクチャをロードできます: + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +PyTorchモデルの場合、 `from_pretrained()`メソッドは内部で`torch.load()`を使用し、内部的には`pickle`を使用しており、セキュリティの問題が知られています。 +一般的には、信頼性のないソースから取得した可能性があるモデルや改ざんされた可能性のあるモデルをロードしないでください。 +このセキュリティリスクは、`Hugging Face Hub`でホストされている公開モデルに対して部分的に緩和されており、各コミットでマルウェアのスキャンが行われています。 +GPGを使用した署名済みコミットの検証などのベストプラクティスについては、Hubのドキュメンテーションを参照してください。 + +TensorFlowおよびFlaxのチェックポイントには影響がなく、`from_pretrained`メソッドの`from_tf`および`from_flax`引数を使用してPyTorchアーキテクチャ内でロードできます。 + + + +一般的に、事前学習済みモデルのインスタンスをロードするために`AutoTokenizer`クラスと`AutoModelFor`クラスの使用をお勧めします。 +これにより、常に正しいアーキテクチャをロードできます。 +次の[tutorial](preprocessing)では、新しくロードしたトークナイザ、画像プロセッサ、特徴量抽出器、およびプロセッサを使用して、ファインチューニング用にデータセットを前処理する方法を学びます。 + + +最後に、`TFAutoModelFor`クラスは特定のタスクに対して事前学習済みモデルをロードできます(使用可能なタスクの完全な一覧についてはこちらを参照)。 +たとえば、[`TFAutoModelForSequenceClassification.from_pretrained`]を使用してシーケンス分類用のモデルをロードできます: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +同じチェックポイントを再利用して異なるタスクのアーキテクチャをロードできます: + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +一般的には、事前学習済みモデルのインスタンスをロードするために`AutoTokenizer`クラスと`TFAutoModelFor`クラスの使用をお勧めします。 +これにより、常に正しいアーキテクチャをロードできます。 +次の[tutorial](preproccesing)では、新しくロードしたトークナイザ、画像プロセッサ、特徴量抽出器、およびプロセッサを使用して、ファインチューニング用にデータセットを前処理する方法を学びます。 + + \ No newline at end of file diff --git a/docs/source/ja/benchmarks.md b/docs/source/ja/benchmarks.md new file mode 100644 index 00000000000000..7312aae8ce5b7c --- /dev/null +++ b/docs/source/ja/benchmarks.md @@ -0,0 +1,381 @@ + + +# Benchmarks + + + +Hugging Faceのベンチマークツールは非推奨であり、Transformerモデルの速度とメモリの複雑さを測定するために外部のベンチマークライブラリを使用することをお勧めします。 + + + +[[open-in-colab]] + +🤗 Transformersモデルをベンチマークし、ベストプラクティス、すでに利用可能なベンチマークについて見てみましょう。 + +🤗 Transformersモデルをベンチマークする方法について詳しく説明したノートブックは[こちら](https://github.com/huggingface/notebooks/tree/main/examples/benchmark.ipynb)で利用できます。 + +## How to benchmark 🤗 Transformers models + +[`PyTorchBenchmark`]クラスと[`TensorFlowBenchmark`]クラスを使用すると、🤗 Transformersモデルを柔軟にベンチマークできます。 +ベンチマーククラスを使用すると、_ピークメモリ使用量_ および _必要な時間_ を _推論_ および _トレーニング_ の両方について測定できます。 + + + +ここでの _推論_ は、単一のフォワードパスによって定義され、 _トレーニング_ は単一のフォワードパスと +バックワードパスによって定義されます。 + + + +ベンチマーククラス[`PyTorchBenchmark`]と[`TensorFlowBenchmark`]は、それぞれのベンチマーククラスに対する適切な設定を含む [`PyTorchBenchmarkArguments`] および [`TensorFlowBenchmarkArguments`] タイプのオブジェクトを必要とします。 +[`PyTorchBenchmarkArguments`] および [`TensorFlowBenchmarkArguments`] はデータクラスであり、それぞれのベンチマーククラスに対するすべての関連する設定を含んでいます。 +次の例では、タイプ _bert-base-cased_ のBERTモデルをベンチマークする方法が示されています。 + + + +```py +>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments + +>>> args = PyTorchBenchmarkArguments(models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512]) +>>> benchmark = PyTorchBenchmark(args) +``` + + +```py +>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments + +>>> args = TensorFlowBenchmarkArguments( +... models=["google-bert/bert-base-uncased"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] +... ) +>>> benchmark = TensorFlowBenchmark(args) +``` + + + + +ここでは、ベンチマーク引数のデータクラスに対して、`models`、`batch_sizes` +および`sequence_lengths`の3つの引数が指定されています。引数`models`は必須で、 +[モデルハブ](https://huggingface.co/models)からのモデル識別子の`リスト`を期待し +ます。`batch_sizes`と`sequence_lengths`の2つの`リスト`引数は +モデルのベンチマーク対象となる`input_ids`のサイズを定義します。 +ベンチマーク引数データクラスを介して設定できる他の多くのパラメータがあります。これらの詳細については、直接ファイル +`src/transformers/benchmark/benchmark_args_utils.py`、 +`src/transformers/benchmark/benchmark_args.py`(PyTorch用)、および`src/transformers/benchmark/benchmark_args_tf.py`(Tensorflow用) +を参照するか、次のシェルコマンドをルートから実行すると、PyTorchとTensorflowのそれぞれに対して設定可能なすべてのパラメータの記述的なリストが表示されます。 + + + +```bash +python examples/pytorch/benchmarking/run_benchmark.py --help +``` + +インスタンス化されたベンチマークオブジェクトは、単に `benchmark.run()` を呼び出すことで実行できます。 + + +```py +>>> results = benchmark.run() +>>> print(results) +==================== INFERENCE - SPEED - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Time in s +-------------------------------------------------------------------------------- +google-bert/bert-base-uncased 8 8 0.006 +google-bert/bert-base-uncased 8 32 0.006 +google-bert/bert-base-uncased 8 128 0.018 +google-bert/bert-base-uncased 8 512 0.088 +-------------------------------------------------------------------------------- + +==================== INFERENCE - MEMORY - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Memory in MB +-------------------------------------------------------------------------------- +google-bert/bert-base-uncased 8 8 1227 +google-bert/bert-base-uncased 8 32 1281 +google-bert/bert-base-uncased 8 128 1307 +google-bert/bert-base-uncased 8 512 1539 +-------------------------------------------------------------------------------- + +==================== ENVIRONMENT INFORMATION ==================== + +- transformers_version: 2.11.0 +- framework: PyTorch +- use_torchscript: False +- framework_version: 1.4.0 +- python_version: 3.6.10 +- system: Linux +- cpu: x86_64 +- architecture: 64bit +- date: 2020-06-29 +- time: 08:58:43.371351 +- fp16: False +- use_multiprocessing: True +- only_pretrain_model: False +- cpu_ram_mb: 32088 +- use_gpu: True +- num_gpus: 1 +- gpu: TITAN RTX +- gpu_ram_mb: 24217 +- gpu_power_watts: 280.0 +- gpu_performance_state: 2 +- use_tpu: False +``` + + +```bash +python examples/tensorflow/benchmarking/run_benchmark_tf.py --help +``` + +インスタンス化されたベンチマークオブジェクトは、単に `benchmark.run()` を呼び出すことで実行できます。 + + + +```py +>>> results = benchmark.run() +>>> print(results) +>>> results = benchmark.run() +>>> print(results) +==================== INFERENCE - SPEED - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Time in s +-------------------------------------------------------------------------------- +google-bert/bert-base-uncased 8 8 0.005 +google-bert/bert-base-uncased 8 32 0.008 +google-bert/bert-base-uncased 8 128 0.022 +google-bert/bert-base-uncased 8 512 0.105 +-------------------------------------------------------------------------------- + +==================== INFERENCE - MEMORY - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Memory in MB +-------------------------------------------------------------------------------- +google-bert/bert-base-uncased 8 8 1330 +google-bert/bert-base-uncased 8 32 1330 +google-bert/bert-base-uncased 8 128 1330 +google-bert/bert-base-uncased 8 512 1770 +-------------------------------------------------------------------------------- + +==================== ENVIRONMENT INFORMATION ==================== + +- transformers_version: 2.11.0 +- framework: Tensorflow +- use_xla: False +- framework_version: 2.2.0 +- python_version: 3.6.10 +- system: Linux +- cpu: x86_64 +- architecture: 64bit +- date: 2020-06-29 +- time: 09:26:35.617317 +- fp16: False +- use_multiprocessing: True +- only_pretrain_model: False +- cpu_ram_mb: 32088 +- use_gpu: True +- num_gpus: 1 +- gpu: TITAN RTX +- gpu_ram_mb: 24217 +- gpu_power_watts: 280.0 +- gpu_performance_state: 2 +- use_tpu: False +``` + + + +デフォルトでは、_推論時間_ と _必要なメモリ_ がベンチマークされます。 +上記の例の出力では、最初の2つのセクションが _推論時間_ と _推論メモリ_ +に対応する結果を示しています。さらに、計算環境に関するすべての関連情報、 +例えば GPU タイプ、システム、ライブラリのバージョンなどが、_ENVIRONMENT INFORMATION_ の下に表示されます。この情報は、[`PyTorchBenchmarkArguments`] +および [`TensorFlowBenchmarkArguments`] に引数 `save_to_csv=True` +を追加することで、オプションで _.csv_ ファイルに保存することができます。この場合、各セクションは別々の _.csv_ ファイルに保存されます。_.csv_ +ファイルへのパスは、データクラスの引数を使用してオプションで定義できます。 + +モデル識別子、例えば `google-bert/bert-base-uncased` を使用して事前学習済みモデルをベンチマークする代わりに、利用可能な任意のモデルクラスの任意の設定をベンチマークすることもできます。この場合、ベンチマーク引数と共に設定の `list` を挿入する必要があります。 + + + + +```py +>>> from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments, BertConfig + +>>> args = PyTorchBenchmarkArguments( +... models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] +... ) +>>> config_base = BertConfig() +>>> config_384_hid = BertConfig(hidden_size=384) +>>> config_6_lay = BertConfig(num_hidden_layers=6) + +>>> benchmark = PyTorchBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) +>>> benchmark.run() +==================== INFERENCE - SPEED - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Time in s +-------------------------------------------------------------------------------- +bert-base 8 128 0.006 +bert-base 8 512 0.006 +bert-base 8 128 0.018 +bert-base 8 512 0.088 +bert-384-hid 8 8 0.006 +bert-384-hid 8 32 0.006 +bert-384-hid 8 128 0.011 +bert-384-hid 8 512 0.054 +bert-6-lay 8 8 0.003 +bert-6-lay 8 32 0.004 +bert-6-lay 8 128 0.009 +bert-6-lay 8 512 0.044 +-------------------------------------------------------------------------------- + +==================== INFERENCE - MEMORY - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Memory in MB +-------------------------------------------------------------------------------- +bert-base 8 8 1277 +bert-base 8 32 1281 +bert-base 8 128 1307 +bert-base 8 512 1539 +bert-384-hid 8 8 1005 +bert-384-hid 8 32 1027 +bert-384-hid 8 128 1035 +bert-384-hid 8 512 1255 +bert-6-lay 8 8 1097 +bert-6-lay 8 32 1101 +bert-6-lay 8 128 1127 +bert-6-lay 8 512 1359 +-------------------------------------------------------------------------------- + +==================== ENVIRONMENT INFORMATION ==================== + +- transformers_version: 2.11.0 +- framework: PyTorch +- use_torchscript: False +- framework_version: 1.4.0 +- python_version: 3.6.10 +- system: Linux +- cpu: x86_64 +- architecture: 64bit +- date: 2020-06-29 +- time: 09:35:25.143267 +- fp16: False +- use_multiprocessing: True +- only_pretrain_model: False +- cpu_ram_mb: 32088 +- use_gpu: True +- num_gpus: 1 +- gpu: TITAN RTX +- gpu_ram_mb: 24217 +- gpu_power_watts: 280.0 +- gpu_performance_state: 2 +- use_tpu: False +``` + + +```py +>>> from transformers import TensorFlowBenchmark, TensorFlowBenchmarkArguments, BertConfig + +>>> args = TensorFlowBenchmarkArguments( +... models=["bert-base", "bert-384-hid", "bert-6-lay"], batch_sizes=[8], sequence_lengths=[8, 32, 128, 512] +... ) +>>> config_base = BertConfig() +>>> config_384_hid = BertConfig(hidden_size=384) +>>> config_6_lay = BertConfig(num_hidden_layers=6) + +>>> benchmark = TensorFlowBenchmark(args, configs=[config_base, config_384_hid, config_6_lay]) +>>> benchmark.run() +==================== INFERENCE - SPEED - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Time in s +-------------------------------------------------------------------------------- +bert-base 8 8 0.005 +bert-base 8 32 0.008 +bert-base 8 128 0.022 +bert-base 8 512 0.106 +bert-384-hid 8 8 0.005 +bert-384-hid 8 32 0.007 +bert-384-hid 8 128 0.018 +bert-384-hid 8 512 0.064 +bert-6-lay 8 8 0.002 +bert-6-lay 8 32 0.003 +bert-6-lay 8 128 0.0011 +bert-6-lay 8 512 0.074 +-------------------------------------------------------------------------------- + +==================== INFERENCE - MEMORY - RESULT ==================== +-------------------------------------------------------------------------------- +Model Name Batch Size Seq Length Memory in MB +-------------------------------------------------------------------------------- +bert-base 8 8 1330 +bert-base 8 32 1330 +bert-base 8 128 1330 +bert-base 8 512 1770 +bert-384-hid 8 8 1330 +bert-384-hid 8 32 1330 +bert-384-hid 8 128 1330 +bert-384-hid 8 512 1540 +bert-6-lay 8 8 1330 +bert-6-lay 8 32 1330 +bert-6-lay 8 128 1330 +bert-6-lay 8 512 1540 +-------------------------------------------------------------------------------- + +==================== ENVIRONMENT INFORMATION ==================== + +- transformers_version: 2.11.0 +- framework: Tensorflow +- use_xla: False +- framework_version: 2.2.0 +- python_version: 3.6.10 +- system: Linux +- cpu: x86_64 +- architecture: 64bit +- date: 2020-06-29 +- time: 09:38:15.487125 +- fp16: False +- use_multiprocessing: True +- only_pretrain_model: False +- cpu_ram_mb: 32088 +- use_gpu: True +- num_gpus: 1 +- gpu: TITAN RTX +- gpu_ram_mb: 24217 +- gpu_power_watts: 280.0 +- gpu_performance_state: 2 +- use_tpu: False +``` + + + +カスタマイズされたBertModelクラスの構成に対する推論時間と必要なメモリのベンチマーク + +この機能は、モデルをトレーニングする際にどの構成を選択すべきかを決定する際に特に役立つことがあります。 + +## Benchmark best practices + +このセクションでは、モデルをベンチマークする際に注意すべきいくつかのベストプラクティスをリストアップしています。 + +- 現在、単一デバイスのベンチマークしかサポートされていません。GPUでベンチマークを実行する場合、コードを実行するデバイスをユーザーが指定することを推奨します。 + これはシェルで`CUDA_VISIBLE_DEVICES`環境変数を設定することで行えます。例:`export CUDA_VISIBLE_DEVICES=0`を実行してからコードを実行します。 +- `no_multi_processing`オプションは、テストおよびデバッグ用にのみ`True`に設定すべきです。正確なメモリ計測を確保するために、各メモリベンチマークを別々のプロセスで実行することをお勧めします。これにより、`no_multi_processing`が`True`に設定されます。 +- モデルのベンチマーク結果を共有する際には、常に環境情報を記述するべきです。異なるGPUデバイス、ライブラリバージョンなどでベンチマーク結果が大きく異なる可能性があるため、ベンチマーク結果単体ではコミュニティにとってあまり有用ではありません。 + +## Sharing your benchmark + +以前、すべての利用可能なコアモデル(当時10モデル)に対して、多くの異なる設定で推論時間のベンチマークが行われました:PyTorchを使用し、TorchScriptの有無、TensorFlowを使用し、XLAの有無などです。これらのテストはすべてCPUで行われました(TensorFlow XLAを除く)。 + +このアプローチの詳細については、[次のブログポスト](https://medium.com/huggingface/benchmarking-transformers-pytorch-and-tensorflow-e2917fb891c2)に詳しく説明されており、結果は[こちら](https://docs.google.com/spreadsheets/d/1sryqufw2D0XlUH4sq3e9Wnxu5EAQkaohzrJbd5HdQ_w/edit?usp=sharing)で利用できます。 + +新しいベンチマークツールを使用すると、コミュニティとベンチマーク結果を共有することがこれまで以上に簡単になります。 + +- [PyTorchベンチマーク結果](https://github.com/huggingface/transformers/tree/main/examples/pytorch/benchmarking/README.md)。 +- [TensorFlowベンチマーク結果](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/benchmarking/README.md)。 diff --git a/docs/source/ja/bertology.md b/docs/source/ja/bertology.md new file mode 100644 index 00000000000000..167ed007bbe437 --- /dev/null +++ b/docs/source/ja/bertology.md @@ -0,0 +1,34 @@ + + + +# BERTology + +大規模なトランスフォーマー、例えばBERTの内部動作を調査する研究領域が急成長しています(これを「BERTology」とも呼びます)。この分野の良い例は以下です: + +- BERT Rediscovers the Classical NLP Pipeline by Ian Tenney, Dipanjan Das, Ellie Pavlick: + [論文リンク](https://arxiv.org/abs/1905.05950) +- Are Sixteen Heads Really Better than One? by Paul Michel, Omer Levy, Graham Neubig: [論文リンク](https://arxiv.org/abs/1905.10650) +- What Does BERT Look At? An Analysis of BERT's Attention by Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: [論文リンク](https://arxiv.org/abs/1906.04341) +- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure: [論文リンク](https://arxiv.org/abs/2210.04633) + +この新しい分野の発展を支援するために、BERT/GPT/GPT-2モデルにいくつかの追加機能を組み込み、人々が内部表現にアクセスできるようにしました。これらの機能は、主にPaul Michel氏の優れた研究([論文リンク](https://arxiv.org/abs/1905.10650))に基づいています。具体的には、以下の機能が含まれています: + +- BERT/GPT/GPT-2のすべての隠れ状態にアクセスすることができます。 +- BERT/GPT/GPT-2の各ヘッドの注意重みにアクセスできます。 +- ヘッドの出力値と勾配を取得し、ヘッドの重要性スコアを計算し、[論文リンク](https://arxiv.org/abs/1905.10650)で説明されているようにヘッドを削減できます。 + +これらの機能を理解し、使用するのを支援するために、特定のサンプルスクリプト「[bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py)」を追加しました。このスクリプトは、GLUEで事前トレーニングされたモデルから情報を抽出し、ヘッドを削減する役割を果たします。 diff --git a/docs/source/ja/big_models.md b/docs/source/ja/big_models.md new file mode 100644 index 00000000000000..78852dc4374cce --- /dev/null +++ b/docs/source/ja/big_models.md @@ -0,0 +1,130 @@ + + +# Instantiating a big model + +非常に大規模な事前学習済みモデルを使用する場合、RAMの使用量を最小限に抑えることは課題の1つです。通常のPyTorchのワークフローは次のとおりです: + +1. ランダムな重みを持つモデルを作成します。 +2. 事前学習済みの重みをロードします。 +3. これらの事前学習済みの重みをランダムなモデルに配置します。 + +ステップ1と2の両方がメモリにモデルの完全なバージョンを必要とし、ほとんどの場合は問題ありませんが、モデルのサイズが数ギガバイトになると、これらの2つのコピーをRAMから排除することができなくなる可能性があります。さらに悪いことに、分散トレーニングを実行するために`torch.distributed`を使用している場合、各プロセスは事前学習済みモデルをロードし、これらの2つのコピーをRAMに保存します。 + + + +ランダムに作成されたモデルは、メモリ内に「空の」テンソルで初期化されます。これらのランダムな値は、メモリの特定のチャンクにあったものを使用します(したがって、ランダムな値はその時点でのメモリチャンク内の値です)。モデル/パラメータの種類に適した分布(たとえば、正規分布)に従うランダムな初期化は、ステップ3で初期化されていない重みに対して、できるだけ高速に実行されます! + + + +このガイドでは、Transformersがこの問題に対処するために提供するソリューションを探ります。なお、これは現在も開発が進行中の分野であり、将来、ここで説明されているAPIがわずかに変更される可能性があることに注意してください。 + +## Sharded checkpoints + +バージョン4.18.0から、10GBを超えるサイズのモデルチェックポイントは自動的に複数の小さな部分に分割されます。`model.save_pretrained(save_dir)`を実行する際に1つの単一のチェックポイントを持つ代わりに、いくつかの部分的なチェックポイント(それぞれのサイズが<10GB)と、パラメータ名をそれらが格納されているファイルにマップするインデックスが生成されます。 + +`max_shard_size`パラメータでシャーディング前の最大サイズを制御できるため、例として通常サイズのモデルと小さなシャードサイズを使用します。従来のBERTモデルを使用してみましょう。 + + +```py +from transformers import AutoModel + +model = AutoModel.from_pretrained("google-bert/bert-base-cased") +``` + +もし[`~PreTrainedModel.save_pretrained`]を使用して保存する場合、新しいフォルダが2つのファイルを含む形で作成されます: モデルの設定情報とその重み情報です。 + +```py +>>> import os +>>> import tempfile + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir) +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model.bin'] +``` + +最大シャードサイズを200MBに設定します: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json'] +``` + +モデルの設定の上に、3つの異なる重みファイルと、`index.json`ファイルが見られます。これは私たちのインデックスです。 +このようなチェックポイントは、[`~PreTrainedModel.from_pretrained`]メソッドを使用して完全に再ロードできます: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... new_model = AutoModel.from_pretrained(tmp_dir) +``` + +主要な利点は、大規模なモデルの場合、上記のワークフローのステップ2において、各チェックポイントのシャードが前のシャードの後にロードされ、RAMのメモリ使用量をモデルのサイズと最大のシャードのサイズを合わせたものに制限できることです。 + +内部では、インデックスファイルが使用され、どのキーがチェックポイントに存在し、対応する重みがどこに格納されているかを判断します。このインデックスは通常のJSONファイルのように読み込むことができ、辞書として取得できます。 + + +```py +>>> import json + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f: +... index = json.load(f) + +>>> print(index.keys()) +dict_keys(['metadata', 'weight_map']) +``` + +メタデータには現時点ではモデルの総サイズのみが含まれています。 +将来的には他の情報を追加する予定です: + +```py +>>> index["metadata"] +{'total_size': 433245184} +``` + +重みマップはこのインデックスの主要な部分であり、各パラメータ名(通常はPyTorchモデルの`state_dict`で見つかるもの)をその格納されているファイルにマップします: + +```py +>>> index["weight_map"] +{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin', + 'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin', + ... +``` + +直接モデル内で[`~PreTrainedModel.from_pretrained`]を使用せずに、 +シャーディングされたチェックポイントをロードしたい場合(フルチェックポイントの場合に`model.load_state_dict()`を使用するように行う方法)、[`~modeling_utils.load_sharded_checkpoint`]を使用する必要があります: + + +```py +>>> from transformers.modeling_utils import load_sharded_checkpoint + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... load_sharded_checkpoint(model, tmp_dir) +``` + + +## Low memory loading + +シャードされたチェックポイントは、上記のワークフローのステップ2におけるメモリ使用量を削減しますが、 +低メモリの環境でそのモデルを使用するために、Accelerateライブラリに基づいた当社のツールを活用することをお勧めします。 + +詳細については、以下のガイドをご覧ください:[Accelerateを使用した大規模モデルの読み込み](./main_classes/model#large-model-loading) diff --git a/docs/source/ja/chat_templating.md b/docs/source/ja/chat_templating.md new file mode 100644 index 00000000000000..78d900b5bea8b2 --- /dev/null +++ b/docs/source/ja/chat_templating.md @@ -0,0 +1,249 @@ + + +# Templates for Chat Models + +## Introduction + +LLM(Language Model)のますます一般的な使用事例の1つは「チャット」です。 +チャットのコンテキストでは、通常の言語モデルのように単一のテキストストリングを継続するのではなく、モデルは1つ以上の「メッセージ」からなる会話を継続します。 +各メッセージには「ロール」とメッセージテキストが含まれます。 + +最も一般的に、これらのロールはユーザーからのメッセージには「ユーザー」、モデルからのメッセージには「アシスタント」が割り当てられます。 +一部のモデルは「システム」ロールもサポートしています。 +システムメッセージは通常会話の開始時に送信され、モデルの動作方法に関する指示が含まれます。 + +すべての言語モデル、チャット用に微調整されたモデルを含むすべてのモデルは、トークンのリニアシーケンスで動作し、ロールに特有の特別な処理を持ちません。 +つまり、ロール情報は通常、メッセージ間に制御トークンを追加して注入され、メッセージの境界と関連するロールを示すことで提供されます。 + +残念ながら、トークンの使用方法については(まだ!)標準が存在せず、異なるモデルはチャット用のフォーマットや制御トークンが大きく異なる形式でトレーニングされています。 +これはユーザーにとって実際の問題になる可能性があります。正しいフォーマットを使用しないと、モデルは入力に混乱し、パフォーマンスが本来よりも遥かに低下します。 +これが「チャットテンプレート」が解決しようとする問題です。 + +チャット会話は通常、各辞書が「ロール」と「コンテンツ」のキーを含み、単一のチャットメッセージを表すリストとして表現されます。 +チャットテンプレートは、指定されたモデルの会話を単一のトークン化可能なシーケンスにどのようにフォーマットするかを指定するJinjaテンプレートを含む文字列です。 +トークナイザとこの情報を保存することにより、モデルが期待する形式の入力データを取得できるようになります。 + +さっそく、`BlenderBot` モデルを使用した例を示して具体的にしましょう。`BlenderBot` のデフォルトテンプレートは非常にシンプルで、ほとんどが対話のラウンド間に空白を追加するだけです。 + + +```python +>>> from transformers import AutoTokenizer +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") + +>>> chat = [ +... {"role": "user", "content": "Hello, how are you?"}, +... {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, +... {"role": "user", "content": "I'd like to show off how chat templating works!"}, +... ] + +>>> tokenizer.apply_chat_template(chat, tokenize=False) +" Hello, how are you? I'm doing great. How can I help you today? I'd like to show off how chat templating works!" +``` + +指定された通り、チャット全体が単一の文字列にまとめられています。デフォルトの設定である「tokenize=True」を使用すると、 +その文字列もトークン化されます。しかし、より複雑なテンプレートが実際にどのように機能するかを確認するために、 +「meta-llama/Llama-2-7b-chat-hf」モデルを使用してみましょう。ただし、このモデルはゲート付きアクセスを持っており、 +このコードを実行する場合は[リポジトリでアクセスをリクエスト](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)する必要があります。 + +```python +>> from transformers import AutoTokenizer +>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf") + +>> chat = [ +... {"role": "user", "content": "Hello, how are you?"}, +... {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, +... {"role": "user", "content": "I'd like to show off how chat templating works!"}, +... ] + +>> tokenizer.use_default_system_prompt = False +>> tokenizer.apply_chat_template(chat, tokenize=False) +"[INST] Hello, how are you? [/INST] I'm doing great. How can I help you today? [INST] I'd like to show off how chat templating works! [/INST]" +``` + +今回、トークナイザは制御トークン [INST] と [/INST] を追加しました。これらはユーザーメッセージの開始と終了を示すためのものです(ただし、アシスタントメッセージには適用されません!) + +## How do chat templates work? + +モデルのチャットテンプレートは、`tokenizer.chat_template`属性に格納されています。チャットテンプレートが設定されていない場合、そのモデルクラスのデフォルトテンプレートが代わりに使用されます。`BlenderBot`のテンプレートを見てみましょう: + +```python + +>>> from transformers import AutoTokenizer +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") + +>>> tokenizer.default_chat_template +"{% for message in messages %}{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}{{ message['content'] }}{% if not loop.last %}{{ ' ' }}{% endif %}{% endfor %}{{ eos_token }}" +``` + + +これは少し抑圧的ですね。可読性を高めるために、新しい行とインデントを追加しましょう。 +各ブロックの直前の空白と、ブロックの直後の最初の改行は、デフォルトでJinjaの `trim_blocks` および `lstrip_blocks` フラグを使用して削除します。 +これにより、インデントと改行を含むテンプレートを書いても正常に機能することができます。 + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ ' ' }} + {% endif %} + {{ message['content'] }} + {% if not loop.last %} + {{ ' ' }} + {% endif %} +{% endfor %} +{{ eos_token }} +``` + +これが初めて見る方へ、これは[Jinjaテンプレート](https://jinja.palletsprojects.com/en/3.1.x/templates/)です。 +Jinjaはテキストを生成するためのシンプルなコードを記述できるテンプレート言語です。多くの点で、コードと +構文はPythonに似ています。純粋なPythonでは、このテンプレートは次のようになるでしょう: + +```python +for idx, message in enumerate(messages): + if message['role'] == 'user': + print(' ') + print(message['content']) + if not idx == len(messages) - 1: # Check for the last message in the conversation + print(' ') +print(eos_token) +``` + +実際に、このテンプレートは次の3つのことを行います: +1. 各メッセージに対して、メッセージがユーザーメッセージである場合、それの前に空白を追加し、それ以外の場合は何も表示しません。 +2. メッセージの内容を追加します。 +3. メッセージが最後のメッセージでない場合、その後に2つのスペースを追加します。最後のメッセージの後にはEOSトークンを表示します。 + +これは非常にシンプルなテンプレートです。制御トークンを追加しないし、モデルに対する指示を伝える一般的な方法である「システム」メッセージをサポートしていません。 +ただし、Jinjaはこれらのことを行うための多くの柔軟性を提供しています! +LLaMAがフォーマットする方法に類似した入力をフォーマットするためのJinjaテンプレートを見てみましょう +(実際のLLaMAテンプレートはデフォルトのシステムメッセージの処理や、一般的なシステムメッセージの処理が若干異なるため、 +実際のコードではこのテンプレートを使用しないでください!) + + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'] + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ ' ' + message['content'] + ' ' + eos_token }} + {% endif %} +{% endfor %} +``` + +願わくば、少し見つめていただければ、このテンプレートが何を行っているかがわかるかもしれません。 +このテンプレートは、各メッセージの「役割」に基づいて特定のトークンを追加します。これらのトークンは、メッセージを送信した人を表すものです。 +ユーザー、アシスタント、およびシステムメッセージは、それらが含まれるトークンによってモデルによって明確に区別されます。 + + +## How do I create a chat template? + +簡単です。単純にJinjaテンプレートを書いて、`tokenizer.chat_template`を設定します。 +他のモデルから既存のテンプレートを始点にして、必要に応じて編集すると便利かもしれません! +例えば、上記のLLaMAテンプレートを取って、アシスタントメッセージに"[ASST]"と"[/ASST]"を追加できます。 + +``` +{% for message in messages %} + {% if message['role'] == 'user' %} + {{ bos_token + '[INST] ' + message['content'].strip() + ' [/INST]' }} + {% elif message['role'] == 'system' %} + {{ '<>\\n' + message['content'].strip() + '\\n<>\\n\\n' }} + {% elif message['role'] == 'assistant' %} + {{ '[ASST] ' + message['content'] + ' [/ASST]' + eos_token }} + {% endif %} +{% endfor %} +``` + +次に、単に`tokenizer.chat_template`属性を設定してください。 +次回、[`~PreTrainedTokenizer.apply_chat_template`]を使用する際に、新しいテンプレートが使用されます! +この属性は`tokenizer_config.json`ファイルに保存されるため、[`~utils.PushToHubMixin.push_to_hub`]を使用して +新しいテンプレートをHubにアップロードし、みんなが正しいテンプレートを使用していることを確認できます! + +```python +template = tokenizer.chat_template +template = template.replace("SYS", "SYSTEM") # Change the system token +tokenizer.chat_template = template # Set the new template +tokenizer.push_to_hub("model_name") # Upload your new template to the Hub! +``` + +[`~PreTrainedTokenizer.apply_chat_template`] メソッドは、あなたのチャットテンプレートを使用するために [`ConversationalPipeline`] クラスによって呼び出されます。 +したがって、正しいチャットテンプレートを設定すると、あなたのモデルは自動的に [`ConversationalPipeline`] と互換性があるようになります。 + + +## What are "default" templates? + +チャットテンプレートの導入前に、チャットの処理はモデルクラスレベルでハードコードされていました。 +後方互換性のために、このクラス固有の処理をデフォルトテンプレートとして保持し、クラスレベルで設定されています。 +モデルにチャットテンプレートが設定されていない場合、ただしモデルクラスのデフォルトテンプレートがある場合、 +`ConversationalPipeline`クラスや`apply_chat_template`などのメソッドはクラステンプレートを使用します。 +トークナイザのデフォルトのチャットテンプレートを確認するには、`tokenizer.default_chat_template`属性をチェックしてください。 + +これは、後方互換性のために純粋に行っていることで、既存のワークフローを壊さないようにしています。 +モデルにとってクラステンプレートが適切である場合でも、デフォルトテンプレートをオーバーライドして +`chat_template`属性を明示的に設定することを強くお勧めします。これにより、ユーザーにとって +モデルがチャット用に正しく構成されていることが明確になり、デフォルトテンプレートが変更されたり廃止された場合に備えることができます。 + +## What template should I use? + +すでにチャットのトレーニングを受けたモデルのテンプレートを設定する場合、テンプレートがトレーニング中にモデルが見たメッセージのフォーマットとまったく一致することを確認する必要があります。 +そうでない場合、性能の低下を経験する可能性が高いです。これはモデルをさらにトレーニングしている場合でも同様です - チャットトークンを一定に保つと、おそらく最高の性能が得られます。 +これはトークン化と非常に類似しており、通常はトレーニング中に使用されたトークン化と正確に一致する場合に、推論またはファインチューニングの際に最良の性能が得られます。 + +一方、ゼロからモデルをトレーニングするか、チャットのためにベース言語モデルをファインチューニングする場合、適切なテンプレートを選択する自由度があります。 +LLM(Language Model)はさまざまな入力形式を処理できるほどスマートです。クラス固有のテンプレートがないモデル用のデフォルトテンプレートは、一般的なユースケースに対して良い柔軟な選択肢です。 +これは、[ChatMLフォーマット](https://github.com/openai/openai-python/blob/main/chatml.md)に従ったもので、多くのユースケースに適しています。次のようになります: + +``` +{% for message in messages %} + {{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}} +{% endfor %} +``` + +If you like this one, here it is in one-liner form, ready to copy into your code: + +```python +tokenizer.chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}" +``` + +このテンプレートは、各メッセージを「``」トークンで囲み、役割を文字列として単純に記述します。 +これにより、トレーニングで使用する役割に対する柔軟性が得られます。出力は以下のようになります: + + +``` +<|im_start|>system +You are a helpful chatbot that will do its best not to say anything so stupid that people tweet about it.<|im_end|> +<|im_start|>user +How are you?<|im_end|> +<|im_start|>assistant +I'm doing great!<|im_end|> +``` + +「ユーザー」、「システム」、および「アシスタント」の役割は、チャットの標準です。 +特に、[`ConversationalPipeline`]との連携をスムーズに行う場合には、これらの役割を使用することをお勧めします。ただし、これらの役割に制約はありません。テンプレートは非常に柔軟で、任意の文字列を役割として使用できます。 + +## I want to use chat templates! How should I get started? + +チャットモデルを持っている場合、そのモデルの`tokenizer.chat_template`属性を設定し、[`~PreTrainedTokenizer.apply_chat_template`]を使用してテストする必要があります。 +これはモデルの所有者でない場合でも適用されます。モデルのリポジトリが空のチャットテンプレートを使用している場合、またはデフォルトのクラステンプレートを使用している場合でも、 +この属性を適切に設定できるように[プルリクエスト](https://huggingface.co/docs/hub/repositories-pull-requests-discussions)を開いてください。 + +一度属性が設定されれば、それで完了です! `tokenizer.apply_chat_template`は、そのモデルに対して正しく動作するようになります。これは、 +`ConversationalPipeline`などの場所でも自動的にサポートされます。 + +モデルがこの属性を持つことを確認することで、オープンソースモデルの全コミュニティがそのフルパワーを使用できるようになります。 +フォーマットの不一致はこの分野に悩み続け、パフォーマンスに黙って影響を与えてきました。それを終わらせる時が来ました! + diff --git a/docs/source/ja/community.md b/docs/source/ja/community.md new file mode 100644 index 00000000000000..ffe28d042d237e --- /dev/null +++ b/docs/source/ja/community.md @@ -0,0 +1,69 @@ + + +# Community + +このページは、コミュニティによって開発された🤗 Transformersに関するリソースをまとめたものです。 + +## Community resources: + +| リソース | 説明 | 作者 | +|:----------|:-------------|------:| +| [Hugging Face Transformers Glossary Flashcards](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | [Transformers Docs Glossary](glossary)に基づいたフラッシュカードセットです。このセットは、長期の知識定着を特に考慮して設計されたオープンソースのクロスプラットフォームアプリである[Anki](https://apps.ankiweb.net/)を使用して簡単に学習/復習できる形式になっています。[フラッシュカードの使用方法に関する紹介ビデオはこちら](https://www.youtube.com/watch?v=Dji_h7PILrw)をご覧ください。 | [Darigov Research](https://www.darigovresearch.com/) | + +## Community notebooks: + +| ノートブック | 説明 | 著者 | | +|:----------|:-------------|:-------------|------:| +| [事前学習済みのTransformerを微調整して歌詞を生成](https://github.com/AlekseyKorshuk/huggingartists) | GPT-2モデルを微調整してお気に入りのアーティストのスタイルで歌詞を生成する方法 | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) | +| [Tensorflow 2でT5をトレーニング](https://github.com/snapthat/TF-T5-text-to-text) | Tensorflow 2を使用して任意のタスクに対してT5をトレーニングする方法。このノートブックはTensorflow 2を使用してSQUADで実装された質問と回答タスクを示しています。 | [Muhammad Harris](https://github.com/HarrisDePerceptron) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | +| [TPUでT5をトレーニング](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | TransformersとNlpを使用してSQUADでT5をトレーニングする方法 | [Suraj Patil](https://github.com/patil-suraj) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) | +| [分類と多肢選択のためにT5を微調整](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | PyTorch Lightningを使用してテキスト対テキスト形式でT5を分類と多肢選択タスクに微調整する方法 | [Suraj Patil](https://github.com/patil-suraj) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | +| [新しいデータセットと言語でDialoGPTを微調整](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | DialoGPTモデルを新しいデータセットでオープンダイアログ会話用の微調整する方法 | [Nathan Cooper](https://github.com/ncoop57) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | +| [Reformerを使用した長いシーケンスモデリング](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | Reformerを使用して500,000トークンまでのシーケンスをトレーニングする方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | +| [要約のためにBARTを微調整](https://github.com/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | Blurrを使用して要約のためにBARTを微調整する方法 | [Wayde Gilliam](https://ohmeow.com/) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | +| [事前学習済みのTransformerを微調整して誰かのツイートを生成](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | GPT-2モデルを微調整してお気に入りのTwitterアカウントのスタイルでツイートを生成する方法 | [Boris Dayma](https://github.com/borisdayma) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | +| [🤗 Hugging FaceモデルをWeights & Biasesで最適化](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | Hugging FaceとWeights & Biasesの統合を示す完全なチュートリアル | [Boris Dayma](https://github.com/borisdayma) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | +| [Longformerの事前学習](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | 既存の事前学習済みモデルの「長い」バージョンを構築する方法 | [Iz Beltagy](https://beltagy.net) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | +| [QAタスクのためにLongformerを微調整](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | QAタスクのためにLongformerモデルを微調整する方法 | [Suraj Patil](https://github.com/patil-suraj) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | +| [🤗nlpを使用したモデルの評価](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | `nlp`を使用してTriviaQAでLongformerを評価する方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) | +| [感情スパン抽出のためにT5を微調整](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | PyTorch Lightningを使用して感情スパン抽出のためにT5を微調整する方法 | [Lorenzo Ampil](https://github.com/enzoampil) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | +| [DistilBertをマルチクラス分類にファインチューニング](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | PyTorchを使用してDistilBertをマルチクラス分類にファインチューニングする方法 | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)| +|[BERTをマルチラベル分類にファインチューニング](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)|PyTorchを使用してBERTをマルチラベル分類にファインチューニングする方法|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)| +|[T5を要約にファインチューニング](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)|PyTorchを使用してT5を要約にファインチューニングし、WandBで実験をトラッキングする方法|[Abhishek Kumar Mishra](https://github.com/abhimishra91) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)| +|[ダイナミックパディング/バケッティングを使用してTransformersのファインチューニングを高速化](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)|ダイナミックパディング/バケッティングを使用してファインチューニングを2倍高速化する方法|[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)| +|[マスク言語モデリングのためのReformerの事前学習](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)|双方向セルフアテンションレイヤーを備えたReformerモデルのトレーニング方法|[Patrick von Platen](https://github.com/patrickvonplaten) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)| +|[Sci-BERTを拡張してファインチューニング](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)|AllenAIのCORDデータセットで事前学習済みのSciBERTモデルの語彙を拡張し、パイプライン化する方法|[Tanmay Thakur](https://github.com/lordtt13) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)| +|[Trainer APIを使用してBlenderBotSmallを要約のためにファインチューニング](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-blenderbot_small-for-summarization.ipynb)|カスタムデータセットでBlenderBotSmallを要約のためにファインチューニングする方法、Trainer APIを使用|[Tanmay Thakur](https://github.com/lordtt13) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Wmupuls7mykSGyRN_Qo6lPQhgp56ymq?usp=sharing)| +|[ElectraをファインチューニングしてCaptum Integrated Gradientsで解釈](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) |Electraを感情分析のためにファインチューニングし、Captum Integrated Gradientsで予測を解釈する方法|[Eliza Szczechla](https://elsanns.github.io) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)| +|[Trainerクラスを使用して非英語のGPT-2モデルをファインチューニング](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) |Trainerクラスを使用して非英語のGPT-2モデルをファインチューニングする方法|[Philipp Schmid](https://www.philschmid.de) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)| +|[DistilBERTモデルをマルチラベル分類タスクのためにファインチューニング](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) |DistilBERTモデルをマルチラベル分類タスクのためにファインチューニングする方法|[Dhaval Taunk](https://github.com/DhavalTaunk08) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)| +|[ALBERTを文ペア分類タスクのためにファインチューニング](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) |ALBERTモデルまたは他のBERTベースのモデルを文ペア分類タスクのためにファインチューニングする方法|[Nadir El Manouzi](https://github.com/NadirEM) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)| +|[RoBERTaを感情分析のためにファインチューニング](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) |RoBERTaモデルを感情分析のためにファインチューニングする方法|[Dhaval Taunk](https://github.com/DhavalTaunk08) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)| +|[質問生成モデルの評価](https://github.com/flexudy-pipe/qugeev) | seq2seqトランスフォーマーモデルによって生成された質問の回答の正確さを評価する方法 | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)| +|[DistilBERTとTensorflowを使用してテキストを分類](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | TensorFlowでテキスト分類のためにDistilBERTをファインチューニングする方法 | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)| +|[CNN/Dailymailでのエンコーダーデコーダー要約にBERTを活用](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | *google-bert/bert-base-uncased* チェックポイントを使用してCNN/Dailymailの要約のために *EncoderDecoderModel* をウォームスタートする方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| +|[BBC XSumでのエンコーダーデコーダー要約にRoBERTaを活用](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | *FacebookAI/roberta-base* チェックポイントを使用してBBC/XSumの要約のための共有 *EncoderDecoderModel* をウォームスタートする方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| +|[TAPASをシーケンシャル質問応答(SQA)でファインチューニング](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | シーケンシャル質問応答(SQA)データセットで *tapas-base* チェックポイントを使用して *TapasForQuestionAnswering* をファインチューニングする方法 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)| +|[TabFactでTAPASを評価](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | *tapas-base-finetuned-tabfact* チェックポイントを使用してファインチューニングされた *TapasForSequenceClassification* を評価する方法、🤗 datasets と 🤗 transformers ライブラリを組み合わせて使用 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)| +|[翻訳のためのmBARTをファインチューニング](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | Seq2SeqTrainerを使用してHindiからEnglishへの翻訳のためにmBARTをファインチューニングする方法 | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)| +|[FUNSD(フォーム理解データセット)でLayoutLMをファインチューニング](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | スキャンされたドキュメントからの情報抽出のためにFUNSDデータセットで *LayoutLMForTokenClassification* をファインチューニングする方法 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)| +| [DistilGPT2のファインチューニングとテキスト生成](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | DistilGPT2のファインチューニングとテキスト生成方法 | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)| +| [最大8KトークンでのLEDのファインチューニング](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) | ロングレンジ要約のためのpubmedでLEDをファインチューニングする方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb)| +| [ArxivでのLEDの評価](https://github.com/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb) | ロングレンジ要約のためのLEDの効果的な評価方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb)| +| [RVL-CDIP(文書画像分類データセット)でのLayoutLMのファインチューニング](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) | スキャンされた文書の分類のためのRVL-CDIPデータセットで*LayoutLMForSequenceClassification*をファインチューニングする方法 | [Niels Rogge](https://github.com/nielsrogge) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)| +| [Wav2Vec2 CTCデコーディングとGPT2の調整](https://github.com/voidful/huggingface_notebook/blob/main/xlsr_gpt.ipynb) | 言語モデルの調整を伴うCTCシーケンスのデコーディング方法 | [Eric Lam](https://github.com/voidful) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e_z5jQHYbO2YKEaUgzb1ww1WwiAyydAj?usp=sharing)| +| [Trainerクラスを使用した2言語の要約用にBARTをファインチューニング](https://github.com/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb) | トレーナークラスを使用して2つの言語での要約用にBARTをファインチューニングする方法 | [Eliza Szczechla](https://github.com/elsanns) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb)| +| [PubMedデータセットでBigBirdの評価](https://github.com/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb) | Trivia QAの長いドキュメント質問応答でBigBirdの評価方法 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb)| +| [Wav2Vec2を使用してビデオの字幕を作成する](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | Wav2Vecでオーディオを転記して任意のビデオからYouTubeの字幕を作成する方法 | [Niklas Muennighoff](https://github.com/Muennighoff) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | +| [PyTorch Lightningを使用したCIFAR-10でのVision Transformerのファインチューニング](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | HuggingFace Transformers、Datasets、およびPyTorch Lightningを使用してCIFAR-10でVision Transformer(ViT)をファインチューニングする方法 | [Niels Rogge](https://github.com/nielsrogge) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | +| [🤗 Trainerを使用したCIFAR-10でのVision Transformerのファインチューニング](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | HuggingFace Transformers、Datasets、および🤗 Trainerを使用してCIFAR-10でVision Transformer(ViT)をファインチューニングする方法 | [Niels Rogge](https://github.com/nielsrogge) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | +| [Open Entity、エンティティタイピングデータセットでLUKEの評価](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | Open Entityデータセットで*LukeForEntityClassification*の評価方法 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | +| [TACRED、関係抽出データセットでLUKEの評価](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | TACREDデータセットで*LukeForEntityPairClassification*の評価方法 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | +| [CoNLL-2003、重要なNERベンチマークでLUKEの評価](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | CoNLL-2003データセットで*LukeForEntitySpanClassification*の評価方法 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | +| [PubMedデータセットでBigBird-Pegasusの評価](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | PubMedデータセットで*BigBirdPegasusForConditionalGeneration*の評価方法 | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | +| [Wav2Vec2を使用したスピーチエモーション分類](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | MEGAデータセットでの感情分類のための事前学習済みWav2Vec2モデルの利用方法 | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | +| [DETRを使用して画像内のオブジェクトを検出する](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | トレーニング済み*DetrForObjectDetection*モデルを使用して画像内のオブジェクトを検出し、注意を可視化する方法 | [Niels Rogge](https://github.com/NielsRogge) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | +| [カスタムオブジェクト検出データセットでDETRをファインチューニングする](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | カスタムオブジェクト検出データセットで*DetrForObjectDetection*をファインチューニングする方法 | [Niels Rogge](https://github.com/NielsRogge) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | +| [Named Entity RecognitionのためにT5をファインチューニング](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | Named Entity RecognitionタスクでT5をファインチューニングする方法 | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Colabで開く](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) | diff --git a/docs/source/ja/create_a_model.md b/docs/source/ja/create_a_model.md new file mode 100644 index 00000000000000..fdb23f98e7b107 --- /dev/null +++ b/docs/source/ja/create_a_model.md @@ -0,0 +1,420 @@ + + +# Create a custom architecture + +[`AutoClass`](model_doc/auto)は、モデルのアーキテクチャを自動的に推論し、事前学習済みの設定と重みをダウンロードします。一般的には、チェックポイントに依存しないコードを生成するために`AutoClass`を使用することをお勧めします。ただし、特定のモデルパラメータに対する制御をより詳細に行いたいユーザーは、いくつかの基本クラスからカスタム🤗 Transformersモデルを作成できます。これは、🤗 Transformersモデルを研究、トレーニング、または実験する興味があるユーザーに特に役立つかもしれません。このガイドでは、`AutoClass`を使用しないカスタムモデルの作成について詳しく説明します。次の方法を学びます: + +- モデルの設定をロードおよびカスタマイズする。 +- モデルアーキテクチャを作成する。 +- テキスト用の遅いトークナイザと高速トークナイザを作成する。 +- ビジョンタスク用の画像プロセッサを作成する。 +- オーディオタスク用の特徴抽出器を作成する。 +- マルチモーダルタスク用のプロセッサを作成する。 + +## Configuration + +[設定](main_classes/configuration)は、モデルの特定の属性を指します。各モデルの設定には異なる属性があります。たとえば、すべてのNLPモデルには、`hidden_size`、`num_attention_heads`、`num_hidden_layers`、および`vocab_size`属性が共通してあります。これらの属性は、モデルを構築するための注意ヘッドの数や隠れ層の数を指定します。 + +[DistilBERT](model_doc/distilbert)をより詳しく調べるために、[`DistilBertConfig`]にアクセスしてその属性を調べてみましょう: + +```py +>>> from transformers import DistilBertConfig + +>>> config = DistilBertConfig() +>>> print(config) +DistilBertConfig { + "activation": "gelu", + "attention_dropout": 0.1, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +[`DistilBertConfig`]は、基本の[`DistilBertModel`]を構築するために使用されるすべてのデフォルト属性を表示します。 +すべての属性はカスタマイズ可能で、実験のためのスペースを提供します。例えば、デフォルトのモデルをカスタマイズして以下のようなことができます: + +- `activation`パラメータで異なる活性化関数を試す。 +- `attention_dropout`パラメータで注意確率の高いドロップアウト率を使用する。 + + +```py +>>> my_config = DistilBertConfig(activation="relu", attention_dropout=0.4) +>>> print(my_config) +DistilBertConfig { + "activation": "relu", + "attention_dropout": 0.4, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +事前学習済みモデルの属性は、[`~PretrainedConfig.from_pretrained`] 関数で変更できます: + +```py +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) +``` + +Once you are satisfied with your model configuration, you can save it with [`PretrainedConfig.save_pretrained`]. Your configuration file is stored as a JSON file in the specified save directory. + +```py +>>> my_config.save_pretrained(save_directory="./your_model_save_path") +``` + +設定ファイルを再利用するには、[`~PretrainedConfig.from_pretrained`]を使用してそれをロードします: + +```py +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +``` + + + +カスタム構成ファイルを辞書として保存することも、カスタム構成属性とデフォルトの構成属性の違いだけを保存することもできます!詳細については[configuration](main_classes/configuration)のドキュメンテーションをご覧ください。 + + + +## Model + +次のステップは、[モデル](main_classes/models)を作成することです。モデル(アーキテクチャとも緩く言われることがあります)は、各レイヤーが何をしているか、どの操作が行われているかを定義します。構成からの `num_hidden_layers` のような属性はアーキテクチャを定義するために使用されます。 +すべてのモデルは [`PreTrainedModel`] をベースクラスとし、入力埋め込みのリサイズやセルフアテンションヘッドのプルーニングなど、共通のメソッドがいくつかあります。 +さらに、すべてのモデルは [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)、[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)、または [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) のいずれかのサブクラスでもあります。つまり、モデルはそれぞれのフレームワークの使用法と互換性があります。 + + + +モデルにカスタム構成属性をロードします: + +```py +>>> from transformers import DistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +>>> model = DistilBertModel(my_config) +``` + +これにより、事前トレーニング済みの重みではなくランダムな値を持つモデルが作成されます。 +これは、トレーニングが行われるまで、まだ有用なものとして使用することはできません。 +トレーニングはコストと時間がかかるプロセスです。 +通常、トレーニングに必要なリソースの一部しか使用せず、より速くより良い結果を得るために事前学習済みモデルを使用することが良いでしょう。 + +[`~PreTrainedModel.from_pretrained`]を使用して事前学習済みモデルを作成します: + + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +事前学習済みの重みをロードする際、モデルが🤗 Transformersによって提供されている場合、デフォルトのモデル設定が自動的にロードされます。ただし、必要に応じてデフォルトのモデル設定属性の一部またはすべてを独自のもので置き換えることができます。 + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + +モデルにカスタム設定属性をロードしてください: + +```py +>>> from transformers import TFDistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json") +>>> tf_model = TFDistilBertModel(my_config) +``` + +これにより、事前学習済みの重みではなくランダムな値を持つモデルが作成されます。 +このモデルを有用な目的にはまだ使用することはできません。トレーニングはコストがかかり、時間がかかるプロセスです。 +一般的には、トレーニングに必要なリソースの一部しか使用せずに、より速く優れた結果を得るために事前学習済みモデルを使用することが良いでしょう。 + +[`~TFPreTrainedModel.from_pretrained`]を使用して事前学習済みモデルを作成します: + + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +事前学習済みの重みをロードする際、モデルが🤗 Transformersによって提供されている場合、デフォルトのモデル構成が自動的にロードされます。ただし、必要であればデフォルトのモデル構成属性の一部またはすべてを独自のもので置き換えることもできます: + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + + + +### Model heads + +この時点で、ベースのDistilBERTモデルがあり、これは隠れた状態を出力します。隠れた状態はモデルのヘッドへの入力として渡され、最終的な出力を生成します。🤗 Transformersは、モデルがそのタスクをサポートしている限り、各タスクに対応する異なるモデルヘッドを提供します(つまり、DistilBERTを翻訳のようなシーケンス対シーケンスタスクに使用することはできません)。 + + + +たとえば、[`DistilBertForSequenceClassification`]は、シーケンス分類ヘッドを持つベースのDistilBERTモデルです。シーケンス分類ヘッドは、プールされた出力の上にある線形層です。 + +```py +>>> from transformers import DistilBertForSequenceClassification + +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +新しいタスクにこのチェックポイントを簡単に再利用するには、異なるモデルヘッドに切り替えます。 +質問応答タスクの場合、[`DistilBertForQuestionAnswering`] モデルヘッドを使用します。 +質問応答ヘッドはシーケンス分類ヘッドと類似していますが、隠れ状態の出力の上に線形層があります。 + +```py +>>> from transformers import DistilBertForQuestionAnswering + +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +例えば、[`TFDistilBertForSequenceClassification`]は、シーケンス分類ヘッドを持つベースのDistilBERTモデルです。シーケンス分類ヘッドは、プールされた出力の上にある線形層です。 + +```py +>>> from transformers import TFDistilBertForSequenceClassification + +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +別のタスクにこのチェックポイントを簡単に再利用することができ、異なるモデルヘッドに切り替えるだけです。 +質問応答タスクの場合、[`TFDistilBertForQuestionAnswering`]モデルヘッドを使用します。 +質問応答ヘッドはシーケンス分類ヘッドと似ていますが、隠れ状態の出力の上に線形層があるだけです。 + + +```py +>>> from transformers import TFDistilBertForQuestionAnswering + +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +## Tokenizer + +テキストデータをモデルで使用する前に必要な最後のベースクラスは、生のテキストをテンソルに変換するための[トークナイザ](main_classes/tokenizer)です。 +🤗 Transformersで使用できる2つのタイプのトークナイザがあります: + +- [`PreTrainedTokenizer`]: トークナイザのPython実装です。 +- [`PreTrainedTokenizerFast`]: Rustベースの[🤗 Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/)ライブラリからのトークナイザです。 +このトークナイザのタイプは、そのRust実装により、特にバッチトークナイゼーション中に高速です。 +高速なトークナイザは、トークンを元の単語または文字にマッピングする*オフセットマッピング*などの追加メソッドも提供します。 + +両方のトークナイザは、エンコードとデコード、新しいトークンの追加、特別なトークンの管理など、共通のメソッドをサポートしています。 + + + +すべてのモデルが高速なトークナイザをサポートしているわけではありません。 +モデルが高速なトークナイザをサポートしているかどうかを確認するには、この[表](index#supported-frameworks)をご覧ください。 + + + +独自のトークナイザをトレーニングした場合、*ボキャブラリー*ファイルからトークナイザを作成できます。 + +```py +>>> from transformers import DistilBertTokenizer + +>>> my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left") +``` + +カスタムトークナイザーから生成される語彙は、事前学習済みモデルのトークナイザーが生成する語彙とは異なることを覚えておくことは重要です。 +事前学習済みモデルを使用する場合は、事前学習済みモデルの語彙を使用する必要があります。そうしないと、入力が意味をなさなくなります。 +[`DistilBertTokenizer`]クラスを使用して、事前学習済みモデルの語彙を持つトークナイザーを作成します: + + +```py +>>> from transformers import DistilBertTokenizer + +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +[`DistilBertTokenizerFast`]クラスを使用して高速なトークナイザを作成します: + +```py +>>> from transformers import DistilBertTokenizerFast + +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +デフォルトでは、[`AutoTokenizer`]は高速なトークナイザを読み込もうとします。`from_pretrained`内で`use_fast=False`を設定することで、この動作を無効にすることができます。 + + + +## Image Processor + +画像プロセッサはビジョン入力を処理します。これは基本クラス [`~image_processing_utils.ImageProcessingMixin`] を継承しています。 + +使用するには、使用しているモデルに関連付けられた画像プロセッサを作成します。 +たとえば、画像分類に[ViT](model_doc/vit)を使用する場合、デフォルトの [`ViTImageProcessor`] を作成します。 + +```py +>>> from transformers import ViTImageProcessor + +>>> vit_extractor = ViTImageProcessor() +>>> print(vit_extractor) +ViTImageProcessor { + "do_normalize": true, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.5, + 0.5, + 0.5 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": 2, + "size": 224 +} +``` + + + +カスタマイズを必要としない場合、モデルのデフォルトの画像プロセッサパラメータをロードするには、単純に`from_pretrained`メソッドを使用してください。 + + + +[`ViTImageProcessor`]のパラメータを変更して、カスタムの画像プロセッサを作成できます: + + +```py +>>> from transformers import ViTImageProcessor + +>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3]) +>>> print(my_vit_extractor) +ViTImageProcessor { + "do_normalize": false, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.3, + 0.3, + 0.3 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": "PIL.Image.BOX", + "size": 224 +} +``` + +## Feature Extractor + +フィーチャー抽出器は音声入力を処理します。これは基本的な [`~feature_extraction_utils.FeatureExtractionMixin`] クラスから継承され、音声入力を処理するための [`SequenceFeatureExtractor`] クラスからも継承されることがあります。 + +使用するには、モデルに関連付けられたフィーチャー抽出器を作成します。たとえば、音声分類に [Wav2Vec2](model_doc/wav2vec2) を使用する場合、デフォルトの [`Wav2Vec2FeatureExtractor`] を作成します。 + + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor() +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": true, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 16000 +} +``` + + + +カスタマイズを行わない場合、モデルのデフォルトの特徴抽出器パラメーターをロードするには、単に `from_pretrained` メソッドを使用してください。 + + + +[`Wav2Vec2FeatureExtractor`] のパラメーターを変更して、カスタム特徴抽出器を作成できます: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False) +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": false, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 8000 +} +``` + +## Processor + +マルチモーダルタスクをサポートするモデルに対して、🤗 Transformersは便利なプロセッサクラスを提供しています。 +このプロセッサクラスは、特徴量抽出器やトークナイザなどの処理クラスを便利にラップし、単一のオブジェクトに結合します。 +たとえば、自動音声認識タスク(ASR)用に[`Wav2Vec2Processor`]を使用してみましょう。 +ASRは音声をテキストに転写するタスクであり、音声入力を処理するために特徴量抽出器とトークナイザが必要です。 + +音声入力を処理する特徴量抽出器を作成します: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True) +``` + +テキスト入力を処理するトークナイザを作成します: + +```py +>>> from transformers import Wav2Vec2CTCTokenizer + +>>> tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt") +``` + +[`Wav2Vec2Processor`]で特徴量抽出器とトークナイザを組み合わせます: + + +```py +>>> from transformers import Wav2Vec2Processor + +>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer) +``` + +二つの基本クラス - 設定とモデル - および追加の前処理クラス(トークナイザ、画像プロセッサ、特徴抽出器、またはプロセッサ)を使用することで、🤗 Transformers がサポートするモデルのいずれかを作成できます。これらの基本クラスは設定可能で、必要な特性を使用できます。モデルをトレーニング用に簡単にセットアップしたり、既存の事前学習済みモデルを微調整することができます。 diff --git a/docs/source/ja/custom_models.md b/docs/source/ja/custom_models.md new file mode 100644 index 00000000000000..bf306f491bcca3 --- /dev/null +++ b/docs/source/ja/custom_models.md @@ -0,0 +1,336 @@ + + +# Sharing custom models + +🤗 Transformersライブラリは、簡単に拡張できるように設計されています。すべてのモデルはリポジトリの特定のサブフォルダに完全にコード化されており、抽象化はありません。したがって、モデリングファイルをコピーして調整することが簡単です。 + +新しいモデルを書いている場合、ゼロから始める方が簡単かもしれません。このチュートリアルでは、カスタムモデルとその設定をどのように書き、Transformers内で使用できるようにし、コードに依存する共同体と共有する方法を説明します。ライブラリに存在しない場合でも、誰でも使用できるようにします。 + +これを実証するために、[timmライブラリ](https://github.com/rwightman/pytorch-image-models)のResNetクラスを[`PreTrainedModel`]にラップすることによって、ResNetモデルを使用します。 + +## Writing a custom configuration + +モデルに取り組む前に、まずその設定を書きましょう。モデルの設定は、モデルを構築するために必要なすべての情報を含むオブジェクトです。次のセクションで見るように、モデルは初期化するために`config`しか受け取ることができないため、そのオブジェクトができるだけ完全である必要があります。 + +この例では、ResNetクラスのいくつかの引数を取得し、調整したいかもしれないとします。異なる設定は、異なるタイプのResNetを提供します。その後、これらの引数を確認した後、それらの引数を単に格納します。 + +```python +from transformers import PretrainedConfig +from typing import List + + +class ResnetConfig(PretrainedConfig): + model_type = "resnet" + + def __init__( + self, + block_type="bottleneck", + layers: List[int] = [3, 4, 6, 3], + num_classes: int = 1000, + input_channels: int = 3, + cardinality: int = 1, + base_width: int = 64, + stem_width: int = 64, + stem_type: str = "", + avg_down: bool = False, + **kwargs, + ): + if block_type not in ["basic", "bottleneck"]: + raise ValueError(f"`block_type` must be 'basic' or bottleneck', got {block_type}.") + if stem_type not in ["", "deep", "deep-tiered"]: + raise ValueError(f"`stem_type` must be '', 'deep' or 'deep-tiered', got {stem_type}.") + + self.block_type = block_type + self.layers = layers + self.num_classes = num_classes + self.input_channels = input_channels + self.cardinality = cardinality + self.base_width = base_width + self.stem_width = stem_width + self.stem_type = stem_type + self.avg_down = avg_down + super().__init__(**kwargs) +``` + +重要なことを3つ覚えておくべきポイントは次のとおりです: +- `PretrainedConfig` を継承する必要があります。 +- あなたの `PretrainedConfig` の `__init__` は任意の kwargs を受け入れる必要があります。 +- これらの `kwargs` は親クラスの `__init__` に渡す必要があります。 + +継承は、🤗 Transformers ライブラリのすべての機能を取得できるようにするためです。他の2つの制約は、 +`PretrainedConfig` が設定しているフィールド以外にも多くのフィールドを持っていることから来ています。 +`from_pretrained` メソッドで設定を再ロードする場合、これらのフィールドはあなたの設定に受け入れられ、 +その後、親クラスに送信される必要があります。 + +設定の `model_type` を定義すること(ここでは `model_type="resnet"`)は、 +自動クラスにモデルを登録したい場合を除いては必須ではありません(最後のセクションを参照)。 + +これで、ライブラリの他のモデル設定と同様に、設定を簡単に作成して保存できます。 +以下は、resnet50d 設定を作成して保存する方法の例です: + + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d_config.save_pretrained("custom-resnet") +``` + +これにより、`custom-resnet` フォルダ内に `config.json` という名前のファイルが保存されます。その後、`from_pretrained` メソッドを使用して構成を再ロードできます。 + + +```py +resnet50d_config = ResnetConfig.from_pretrained("custom-resnet") +``` + +また、[`PretrainedConfig`] クラスの他のメソッドを使用することもできます。たとえば、[`~PretrainedConfig.push_to_hub`] を使用して、設定を直接 Hub にアップロードできます。 + +## Writing a custom model + +ResNet の設定ができたので、モデルを書き始めることができます。実際には2つのモデルを書きます。1つはバッチの画像から隠れた特徴を抽出するモデル([`BertModel`] のようなもの)で、もう1つは画像分類に適したモデル([`BertForSequenceClassification`] のようなもの)です。 + +前述したように、この例をシンプルに保つために、モデルの緩いラッパーのみを書きます。このクラスを書く前に行う必要がある唯一のことは、ブロックタイプと実際のブロッククラスの間のマップです。その後、すべてを `ResNet` クラスに渡して設定からモデルを定義します: + +```py +from transformers import PreTrainedModel +from timm.models.resnet import BasicBlock, Bottleneck, ResNet +from .configuration_resnet import ResnetConfig + + +BLOCK_MAPPING = {"basic": BasicBlock, "bottleneck": Bottleneck} + + +class ResnetModel(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor): + return self.model.forward_features(tensor) +``` + +画像を分類するモデルの場合、forwardメソッドを変更するだけです: + +```py +import torch + + +class ResnetModelForImageClassification(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor, labels=None): + logits = self.model(tensor) + if labels is not None: + loss = torch.nn.cross_entropy(logits, labels) + return {"loss": loss, "logits": logits} + return {"logits": logits} +``` + +両方の場合、`PreTrainedModel`から継承し、`config`を使用してスーパークラスの初期化を呼び出します(通常の`torch.nn.Module`を書くときのような感じです)。 +`config_class`を設定する行は必須ではありませんが、(最後のセクションを参照)、モデルを自動クラスに登録したい場合に使用できます。 + + + +モデルがライブラリ内のモデルと非常に似ている場合、このモデルと同じ構成を再利用できます。 + + + +モデルが返す内容は何でも構いませんが、ラベルが渡されるときに損失を含む辞書を返す(`ResnetModelForImageClassification`のように行ったもの)と、 +モデルを[`Trainer`]クラス内で直接使用できるようになります。独自のトレーニングループまたは他のライブラリを使用する予定である限り、 +別の出力形式を使用することも問題ありません。 + +さて、モデルクラスができたので、1つ作成しましょう: + +```py +resnet50d = ResnetModelForImageClassification(resnet50d_config) +``` + +再度、[`PreTrainedModel`]のいずれかのメソッド、例えば[`~PreTrainedModel.save_pretrained`]や +[`~PreTrainedModel.push_to_hub`]などを使用できます。次のセクションでは、モデルの重みをコードと一緒に +Hugging Face Hub にプッシュする方法を見てみます。 +しかし、まずはモデル内に事前学習済みの重みをロードしましょう。 + +独自のユースケースでは、おそらく独自のデータでカスタムモデルをトレーニングすることになるでしょう。 +このチュートリアルではスピードアップのために、resnet50dの事前学習済みバージョンを使用します。 +私たちのモデルはそれをラップするだけなので、これらの重みを転送するのは簡単です: + + +```py +import timm + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +さて、[`~PreTrainedModel.save_pretrained`]または[`~PreTrainedModel.push_to_hub`]を実行したときに、 +モデルのコードが保存されるようにする方法を見てみましょう。 + +## Sending the code to the Hub + + + +このAPIは実験的であり、次のリリースでわずかな変更があるかもしれません。 + + + +まず、モデルが`.py`ファイルに完全に定義されていることを確認してください。 +ファイルは相対インポートを他のファイルに依存できますが、すべてのファイルが同じディレクトリにある限り(まだこの機能ではサブモジュールはサポートしていません)、問題ありません。 +この例では、現在の作業ディレクトリ内に名前が「resnet_model」のフォルダを作成し、その中に`modeling_resnet.py`ファイルと`configuration_resnet.py`ファイルを定義します。 +構成ファイルには`ResnetConfig`のコードが含まれ、モデリングファイルには`ResnetModel`と`ResnetModelForImageClassification`のコードが含まれています。 + + +``` +. +└── resnet_model + ├── __init__.py + ├── configuration_resnet.py + └── modeling_resnet.py +``` + +`__init__.py`は空であっても問題ありません。Pythonが`resnet_model`をモジュールとして検出できるようにするために存在します。 + + + +ライブラリからモデリングファイルをコピーする場合、ファイルの先頭にあるすべての相対インポートを`transformers`パッケージからインポートに置き換える必要があります。 + + + +既存の設定やモデルを再利用(またはサブクラス化)できることに注意してください。 + +コミュニティとモデルを共有するために、次の手順に従ってください:まず、新しく作成したファイルからResNetモデルと設定をインポートします: + + +```py +from resnet_model.configuration_resnet import ResnetConfig +from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification +``` + +次に、`save_pretrained`メソッドを使用してこれらのオブジェクトのコードファイルをコピーし、特定のAutoクラス(特にモデルの場合)に正しく登録するようライブラリに指示する必要があります。次のように実行します: + +```py +ResnetConfig.register_for_auto_class() +ResnetModel.register_for_auto_class("AutoModel") +ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClassification") +``` + +注意: 設定については自動クラスを指定する必要はありません(設定用の自動クラスは1つしかなく、[`AutoConfig`]です)が、 +モデルについては異なります。カスタムモデルは多くの異なるタスクに適している可能性があるため、 +モデルが正確な自動クラスのうちどれに適しているかを指定する必要があります。 + +次に、前述のように設定とモデルを作成しましょう: + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d = ResnetModelForImageClassification(resnet50d_config) + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +モデルをHubに送信するには、ログインしていることを確認してください。ターミナルで次のコマンドを実行します: + +```bash +huggingface-cli login +``` + +またはノートブックから: + +```py +from huggingface_hub import notebook_login + +notebook_login() +``` + +次に、次のようにして、独自の名前空間にプッシュできます(または、メンバーである組織にプッシュできます): + +```py +resnet50d.push_to_hub("custom-resnet50d") +``` + +モデリングの重みとJSON形式の構成に加えて、このフォルダー「custom-resnet50d」内のモデリングおよび構成「.py」ファイルもコピーされ、結果はHubにアップロードされました。結果はこの[model repo](https://huggingface.co/sgugger/custom-resnet50d)で確認できます。 + +詳細については、[Hubへのプッシュ方法](model_sharing)を参照してください。 + +## Using a model with custom code + +自動クラスと `from_pretrained` メソッドを使用して、リポジトリ内のカスタムコードファイルと共に任意の構成、モデル、またはトークナイザを使用できます。 Hubにアップロードされるすべてのファイルとコードはマルウェアのスキャンが実施されます(詳細は[Hubセキュリティ](https://huggingface.co/docs/hub/security#malware-scanning)ドキュメンテーションを参照してください)、しかし、依然として悪意のあるコードを実行しないために、モデルコードと作者を確認する必要があります。 +`trust_remote_code=True` を設定してカスタムコードを持つモデルを使用できます: + + +```py +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True) +``` + +コミットハッシュを「revision」として渡すことも強く推奨されています。これにより、モデルの作者がコードを悪意のある新しい行で更新しなかったことを確認できます(モデルの作者を完全に信頼している場合を除きます)。 + + +```py +commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292" +model = AutoModelForImageClassification.from_pretrained( + "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash +) +``` + +モデルリポジトリのコミット履歴をブラウジングする際には、任意のコミットのコミットハッシュを簡単にコピーできるボタンがあります。 + +## Registering a model with custom code to the auto classes + +🤗 Transformersを拡張するライブラリを作成している場合、独自のモデルを含めるために自動クラスを拡張したい場合があります。 +これはコードをHubにプッシュすることとは異なり、ユーザーはカスタムモデルを取得するためにあなたのライブラリをインポートする必要があります +(Hubからモデルコードを自動的にダウンロードするのとは対照的です)。 + +構成に既存のモデルタイプと異なる `model_type` 属性がある限り、またあなたのモデルクラスが適切な `config_class` 属性を持っている限り、 +次のようにそれらを自動クラスに追加できます: + +```py +from transformers import AutoConfig, AutoModel, AutoModelForImageClassification + +AutoConfig.register("resnet", ResnetConfig) +AutoModel.register(ResnetConfig, ResnetModel) +AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) +``` + +注意: `AutoConfig` にカスタム設定を登録する際の最初の引数は、カスタム設定の `model_type` と一致する必要があります。 +また、任意の自動モデルクラスにカスタムモデルを登録する際の最初の引数は、それらのモデルの `config_class` と一致する必要があります。 diff --git a/docs/source/ja/custom_tools.md b/docs/source/ja/custom_tools.md new file mode 100644 index 00000000000000..8c51ebaeb9d1ca --- /dev/null +++ b/docs/source/ja/custom_tools.md @@ -0,0 +1,764 @@ + + +# Custom Tools and Prompts + + + +トランスフォーマーのコンテキストでツールとエージェントが何であるかを知らない場合、 +まず[Transformers Agents](transformers_agents)ページをお読みいただくことをお勧めします。 + + + + + +Transformers Agentsは実験的なAPIであり、いつでも変更される可能性があります。 +エージェントによって返される結果は、APIや基礎となるモデルが変更される可能性があるため、変化することがあります。 + + + +カスタムツールとプロンプトを作成し、使用することは、エージェントを強化し、新しいタスクを実行させるために非常に重要です。 +このガイドでは、以下の内容を説明します: + +- プロンプトのカスタマイズ方法 +- カスタムツールの使用方法 +- カスタムツールの作成方法 + +## Customizing the prompt + +[Transformers Agents](transformers_agents)で説明されているように、エージェントは[`~Agent.run`]および[`~Agent.chat`]モードで実行できます。 +`run`モードと`chat`モードの両方は同じロジックに基づいています。 +エージェントを駆動する言語モデルは、長いプロンプトに基づいて条件付けられ、 +次のトークンを生成して停止トークンに達するまでプロンプトを完了します。 +両者の唯一の違いは、`chat`モードの間にプロンプトが前のユーザーの入力とモデルの生成と共に拡張されることです。 +これにより、エージェントは過去の対話にアクセスでき、エージェントにあたかもメモリがあるかのように見えます。 + +### Structure of the prompt + +プロンプトがどのように構築され、どのように最適化できるかを理解するために、プロンプトは大まかに4つの部分に分かれています。 + +1. イントロダクション:エージェントの振る舞い、ツールの概念の説明。 +2. すべてのツールの説明。これはユーザーによって定義/選択されたツールでランタイム時に動的に置換される`<>`トークンによって定義されます。 +3. タスクとその解決策の一連の例。 +4. 現在の例と解決策の要求。 + +各部分をよりよく理解するために、`run`プロンプトがどのように見えるかの簡略版を見てみましょう: + +````text +タスクを実行するために、Pythonのシンプルなコマンドのシリーズを考えてくることがあるでしょう。 +[...] +意味がある場合は、中間結果を表示することができます。 + +ツール: +- document_qa:これはドキュメント(pdf)に関する質問に答えるツールです。情報を含むドキュメントである `document` と、ドキュメントに関する質問である `question` を受け取り、質問に対する回答を含むテキストを返します。 +- image_captioner:これは画像の説明を生成するツールです。キャプションにする画像である `image` と、説明を含む英語のテキストを返すテキストを受け取ります。 +[...] + +タスク: "変数 `question` に関する質問に答えるための画像について回答してください。質問はフランス語です。" + +次のツールを使用します:質問を英語に翻訳するための `translator`、そして入力画像に関する質問に答えるための `image_qa`。 + +回答: +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(image=image, question=translated_question) +print(f"The answer is {answer}") +``` + +タスク:「`document`内で最年長の人物を特定し、その結果をバナーとして表示する。」 + +以下のツールを使用します:`document_qa`を使用してドキュメント内で最年長の人物を見つけ、その回答に従って`image_generator`を使用して画像を生成します。 + +回答: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +[...] +タスク: "川と湖の絵を描いてください" + +以下のものを使用します +```` + +導入部分("Tools:"の前のテキスト)は、モデルの振る舞いと実行すべきタスクを正確に説明しています。 +この部分はおそらくエージェントが常に同じ方法で振る舞う必要があるため、カスタマイズする必要はありません。 + +2番目の部分("Tools"の下の箇条書き)は、`run`または`chat`を呼び出すたびに動的に追加されます。 +`agent.toolbox`内のツールの数と同じ数の箇条書きがあり、それぞれの箇条書きにはツールの名前と説明が含まれています。 + +```text +- : +``` + +もうすぐ確認しましょう。 `document_qa` ツールを読み込んで名前と説明を出力します。 + +```py +from transformers import load_tool + +document_qa = load_tool("document-question-answering") +print(f"- {document_qa.name}: {document_qa.description}") +``` + +which gives: +```text +- document_qa: This is a tool that answers a question about a document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +``` + +ツール説明: +このツールは、2つのパートから成り立っています。最初のパートでは、ツールが何を行うかを説明し、2番目のパートでは入力引数と戻り値がどのように期待されるかを述べています。 + +良いツール名とツールの説明は、エージェントが正しく使用するために非常に重要です。エージェントがツールについて持っている唯一の情報は、その名前と説明です。したがって、ツール名と説明の両方が正確に記述され、ツールボックス内の既存のツールのスタイルに合致することを確認する必要があります。特に、説明にはコードスタイルで名前で期待されるすべての引数が言及され、期待される型とそれらが何であるかの説明も含めるべきです。 + + + +キュレートされたTransformersツールの命名と説明を確認して、ツールがどのような名前と説明を持つべきかを理解するのに役立ちます。 +すべてのツールは[`Agent.toolbox`]プロパティで確認できます。 + + + + +カスタマイズされた例: +ツールの使い方をエージェントに正確に示す一連の例が含まれています。これらの例は、エージェントが実際に正確で実行可能なコードを生成する可能性を最大化するように書かれているため、非常に重要です。大規模な言語モデルは、プロンプト内のパターンを認識し、新しいデータを使用してそのパターンを繰り返すことに非常に優れています。したがって、実践で正しい実行可能なコードを生成するエージェントの可能性を最大化するように、これらの例は書かれている必要があります。 + +以下は、一つの例です: + +````text +Task: "Identify the oldest person in the `document` and create an image showcasing the result as a banner." + +I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + +Answer: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +```` + +パターン:モデルが繰り返しを行うように指示されるパターンには、3つの部分があります。 +タスクの声明、エージェントの意図した動作の説明、そして最後に生成されるコードです。 +プロンプトの一部であるすべての例には、この正確なパターンがあり、エージェントが新しいトークンを生成する際にも +同じパターンを再現することを確認しています。 + +プロンプトの例はTransformersチームによって厳選され、一連の問題ステートメントで厳密に評価されます。 +これにより、エージェントのプロンプトがエージェントの実際の使用ケースを解決するためにできるだけ優れたものになります。 + +プロンプトの最後の部分に対応しています: + +[こちら](https://github.com/huggingface/transformers/blob/main/src/transformers/tools/evaluate_agent.py)の問題ステートメントで厳密に評価される、エージェントのプロンプトができるだけ優れたものになるように +慎重に選定されたプロンプト例を提供しています。 + +```text +Task: "Draw me a picture of rivers and lakes" + +I will use the following +``` + + +これがエージェントに完成させるための最終的で未完成の例です。未完成の例は、実際のユーザー入力に基づいて動的に作成されます。上記の例では、ユーザーが次のように実行しました: + +```py +agent.run("Draw me a picture of rivers and lakes") +``` + +ユーザーの入力 - つまり、タスク:"川と湖の絵を描いてください"は、以下のようなプロンプトテンプレートに変換されます:"タスク: \n\n 次に私は以下を使用します"。 +この文は、エージェントが条件付けられたプロンプトの最終行を構成し、したがってエージェントに対して前の例とまったく同じ方法で例を終了するよう強く影響します。 + +詳細には立ち入りませんが、チャットテンプレートは同じプロンプト構造を持ち、例はわずかに異なるスタイルを持っています。例: + +````text +[...] + +===== + +Human: Answer the question in the variable `question` about the image stored in the variable `image`. + +Assistant: I will use the tool `image_qa` to answer the question on the input image. + +```py +answer = image_qa(text=question, image=image) +print(f"The answer is {answer}") +``` + +Human: I tried this code, it worked but didn't give me a good result. The question is in French + +Assistant: In this case, the question needs to be translated first. I will use the tool `translator` to do this. + +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(text=translated_question, image=image) +print(f"The answer is {answer}") +``` + +===== + +[...] +```` + +*Human:* `run`プロンプトの例とは対照的に、各`chat`プロンプトの例には*Human*と*Assistant*の間で1つ以上のやりとりがあります。各やりとりは、`run`プロンプトの例と同様の構造になっています。ユーザーの入力は*Human:*の後ろに追加され、エージェントにはコードを生成する前に何を行う必要があるかを最初に生成するように指示されます。やりとりは以前のやりとりに基づいて行われることがあり、ユーザーが「I tried **this** code」と入力したように、以前に生成されたエージェントのコードを参照できます。 + +*Assistant:* `.chat`を実行すると、ユーザーの入力または*タスク*が未完了の形式に変換されます: + +```text +Human: \n\nAssistant: +``` + +以下のエージェントが完了するコマンドについて説明します。 `run` コマンドとは対照的に、`chat` コマンドは完了した例をプロンプトに追加します。そのため、次の `chat` ターンのためにエージェントにより多くの文脈を提供します。 + +さて、プロンプトの構造がわかったところで、どのようにカスタマイズできるかを見てみましょう! + +### Writing good user inputs + +大規模な言語モデルはユーザーの意図を理解する能力がますます向上していますが、エージェントが正しいタスクを選択するのを助けるために、できるだけ正確に記述することが非常に役立ちます。できるだけ正確であるとは何を意味するのでしょうか? + +エージェントは、プロンプトでツール名とその説明のリストを見ています。ツールが追加されるほど、エージェントが正しいツールを選択するのが難しくなり、正しいツールの連続を選択するのはさらに難しくなります。共通の失敗例を見てみましょう。ここではコードのみを返すことにします。 + + +```py +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") + +agent.run("Show me a tree", return_code=True) +``` + +gives: + +```text +==Explanation from the agent== +I will use the following tool: `image_segmenter` to create a segmentation mask for the image. + + +==Code generated by the agent== +mask = image_segmenter(image, prompt="tree") +``` + +これはおそらく私たちが望んでいたものではないでしょう。代わりに、木の画像が生成されることがより可能性が高いです。 +特定のツールを使用するようエージェントを誘導するために、ツールの名前や説明に含まれている重要なキーワードを使用することは非常に役立ちます。さて、詳しく見てみましょう。 + +```py +agent.toolbox["image_generator"].description +``` + +```text +'This is a tool that creates an image according to a prompt, which is a text description. It takes an input named `prompt` which contains the image description and outputs an image. +``` + +名前と説明文には、キーワード「画像」、「プロンプト」、「作成」、および「生成」が使用されています。これらの言葉を使用することで、ここでの動作がより効果的になる可能性が高いです。プロンプトを少し詳細に調整しましょう。 + +```py +agent.run("Create an image of a tree", return_code=True) +``` + +gives: +```text +==Explanation from the agent== +I will use the following tool `image_generator` to generate an image of a tree. + + +==Code generated by the agent== +image = image_generator(prompt="tree") +``` + +簡単に言うと、エージェントがタスクを正確に適切なツールにマッピングできない場合は、ツールの名前や説明の最も関連性のあるキーワードを調べて、タスクリクエストをそれに合わせて洗練させてみてください。 + + +### Customizing the tool descriptions + +以前にも見たように、エージェントは各ツールの名前と説明にアクセスできます。ベースのツールは非常に正確な名前と説明を持っているはずですが、特定のユースケースに合わせてツールの説明や名前を変更することが役立つかもしれません。これは、非常に類似した複数のツールを追加した場合や、特定のドメイン(たとえば、画像生成や変換など)でエージェントを使用する場合に特に重要になるかもしれません。 + +よくある問題は、エージェントが画像生成タスクに頻繁に使用される場合、画像生成と画像変換/修正を混同することです。 + +例: + +```py +agent.run("Make an image of a house and a car", return_code=True) +``` + +returns +```text +==Explanation from the agent== +I will use the following tools `image_generator` to generate an image of a house and `image_transformer` to transform the image of a car into the image of a house. + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +house_car_image = image_transformer(image=car_image, prompt="A house") +``` + +これはおそらく私たちがここで望んでいる正確なものではないようです。エージェントは「image_generator」と「image_transformer」の違いを理解するのが難しいようで、しばしば両方を一緒に使用します。 + +ここでエージェントをサポートするために、"image_transformer"のツール名と説明を変更して、少し"image"や"prompt"から切り離してみましょう。代わりにそれを「modifier」と呼びましょう: + +```py +agent.toolbox["modifier"] = agent.toolbox.pop("image_transformer") +agent.toolbox["modifier"].description = agent.toolbox["modifier"].description.replace( + "transforms an image according to a prompt", "modifies an image" +) +``` + +「変更」は、上記のプロンプトに新しい画像プロセッサを使用する強力な手がかりです。それでは、もう一度実行してみましょう。 + + +```py +agent.run("Make an image of a house and a car", return_code=True) +``` + +Now we're getting: +```text +==Explanation from the agent== +I will use the following tools: `image_generator` to generate an image of a house, then `image_generator` to generate an image of a car. + + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +``` + +これは、私たちが考えていたものに確実に近づいています!ただし、家と車を同じ画像に含めたいと考えています。タスクを単一の画像生成に向けることで、より適切な方向に進めるはずです: + +```py +agent.run("Create image: 'A house and car'", return_code=True) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_generator` to generate an image. + + +==Code generated by the agent== +image = image_generator(prompt="A house and car") +``` + + + +エージェントは、特に複数のオブジェクトの画像を生成するなど、やや複雑なユースケースに関しては、まだ多くのユースケースに対して脆弱です。 +エージェント自体とその基礎となるプロンプトは、今後数ヶ月でさらに改善され、さまざまなユーザーの入力に対してエージェントがより頑健になるようになります。 + + + +### Customizing the whole project + +ユーザーに最大限の柔軟性を提供するために、[上記](#structure-of-the-prompt)で説明されたプロンプトテンプレート全体をユーザーが上書きできます。この場合、カスタムプロンプトには導入セクション、ツールセクション、例セクション、未完了の例セクションが含まれていることを確認してください。`run` プロンプトテンプレートを上書きしたい場合、以下のように行うことができます: + + +```py +template = """ [...] """ + +agent = HfAgent(your_endpoint, run_prompt_template=template) +``` + + + +`<>` 文字列と `<>` は、エージェントが使用できるツールを認識し、ユーザーのプロンプトを正しく挿入できるように、`template` のどこかに定義されていることを確認してください。 + + + +同様に、`chat` プロンプトテンプレートを上書きすることもできます。なお、`chat` モードでは常に以下の形式で交換が行われます: + +上記のテキストの上に日本語の翻訳を提供してください。Markdownコードとして書いてください。 + + +```text +Human: <> + +Assistant: +``` + +したがって、カスタム`chat`プロンプトテンプレートの例もこのフォーマットを使用することが重要です。以下のように、インスタンス化時に`chat`テンプレートを上書きできます。 + +```python +template = """ [...] """ + +agent = HfAgent(url_endpoint=your_endpoint, chat_prompt_template=template) +``` + + + +`<>` という文字列が `template` 内で定義されていることを確認してください。これにより、エージェントは使用可能なツールを把握できます。 + + + +両方の場合、プロンプトテンプレートの代わりに、コミュニティの誰かがホストしたテンプレートを使用したい場合は、リポジトリIDを渡すことができます。デフォルトのプロンプトは、[このリポジトリ](https://huggingface.co/datasets/huggingface-tools/default-prompts) にありますので、参考になります。 + +カスタムプロンプトをHubのリポジトリにアップロードしてコミュニティと共有する場合は、次のことを確認してください: +- データセットリポジトリを使用すること +- `run` コマンド用のプロンプトテンプレートを `run_prompt_template.txt` という名前のファイルに配置すること +- `chat` コマンド用のプロンプトテンプレートを `chat_prompt_template.txt` という名前のファイルに配置すること + +## Using custom tools + +このセクションでは、画像生成に特化した2つの既存のカスタムツールを利用します: + +- [huggingface-tools/image-transformation](https://huggingface.co/spaces/huggingface-tools/image-transformation) をより多くの画像変更を可能にするために [diffusers/controlnet-canny-tool](https://huggingface.co/spaces/diffusers/controlnet-canny-tool) に置き換えます。 +- 画像のアップスケーリング用の新しいツールをデフォルトのツールボックスに追加します:[diffusers/latent-upscaler-tool](https://huggingface.co/spaces/diffusers/latent-upscaler-tool) は既存の画像変換ツールを置き換えます。 + +便利な [`load_tool`] 関数を使用してカスタムツールをロードします: + +```py +from transformers import load_tool + +controlnet_transformer = load_tool("diffusers/controlnet-canny-tool") +upscaler = load_tool("diffusers/latent-upscaler-tool") +``` + +エージェントにカスタムツールを追加すると、ツールの説明と名前がエージェントのプロンプトに自動的に含まれます。したがって、エージェントがカスタムツールの使用方法を理解できるように、カスタムツールには適切に記述された説明と名前が必要です。 + +`controlnet_transformer`の説明と名前を見てみましょう。 + +最初に、便利な[`load_tool`]関数を使用してカスタムツールをロードします。 + +```py +print(f"Description: '{controlnet_transformer.description}'") +print(f"Name: '{controlnet_transformer.name}'") +``` + +gives +```text +Description: 'This is a tool that transforms an image with ControlNet according to a prompt. +It takes two inputs: `image`, which should be the image to transform, and `prompt`, which should be the prompt to use to change it. It returns the modified image.' +Name: 'image_transformer' +``` + +名前と説明は正確であり、[厳選されたツール](./transformers_agents#a-curated-set-of-tools)のスタイルに合っています。 + +次に、`controlnet_transformer`と`upscaler`を使ってエージェントをインスタンス化します。 + +```py +tools = [controlnet_transformer, upscaler] +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=tools) + +``` + +以下のコマンドは、以下の情報を提供します: + + +```text +image_transformer has been replaced by as provided in `additional_tools` +``` + +一連の厳選されたツールにはすでに `image_transformer` ツールがあり、これをカスタムツールで置き換えます。 + + + +既存のツールを上書きすることは、特定のタスクに既存のツールをまったく同じ目的で使用したい場合に有益であることがあります。 +なぜなら、エージェントはその特定のタスクの使用方法に精通しているからです。この場合、カスタムツールは既存のツールとまったく同じAPIに従うか、そのツールを使用するすべての例が更新されるようにプロンプトテンプレートを適応させる必要があります。 + + + +アップスケーラーツールには `image_upscaler` という名前が付けられ、これはデフォルトのツールボックスにはまだ存在しないため、単にツールのリストに追加されます。 +エージェントが現在使用可能なツールボックスを確認するには、`agent.toolbox` 属性を使用できます。 + +```py +print("\n".join([f"- {a}" for a in agent.toolbox.keys()])) +``` + +```text +- document_qa +- image_captioner +- image_qa +- image_segmenter +- transcriber +- summarizer +- text_classifier +- text_qa +- text_reader +- translator +- image_transformer +- text_downloader +- image_generator +- video_generator +- image_upscaler +``` + +注意: `image_upscaler` がエージェントのツールボックスの一部となったことに注目してください。 + +それでは、新しいツールを試してみましょう![Transformers Agents Quickstart](./transformers_agents#single-execution-run) で生成した画像を再利用します。 + + +```py +from diffusers.utils import load_image + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" +) +``` + + + +美しい冬の風景にこの画像を変身させましょう: + +```py +image = agent.run("Transform the image: 'A frozen lake and snowy forest'", image=image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_transformer` to transform the image. + + +==Code generated by the agent== +image = image_transformer(image, prompt="A frozen lake and snowy forest") +``` + + + +新しい画像処理ツールは、非常に強力な画像の変更を行うことができるControlNetに基づいています。 +デフォルトでは、画像処理ツールはサイズが512x512ピクセルの画像を返します。それを拡大できるか見てみましょう。 + + +```py +image = agent.run("Upscale the image", image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_upscaler` to upscale the image. + + +==Code generated by the agent== +upscaled_image = image_upscaler(image) +``` + + + + +エージェントは、プロンプト「画像の拡大」を、その説明とツールの名前だけを基に、新たに追加されたアップスケーリングツールに自動的にマッピングし、正しく実行できました。 + +次に、新しいカスタムツールを作成する方法を見てみましょう。 + +### Adding new tools + +このセクションでは、エージェントに追加できる新しいツールの作成方法を示します。 + +#### Creating a new tool + +まず、ツールの作成から始めましょう。次のコードで、特定のタスクに関してHugging Face Hubで最もダウンロードされたモデルを取得する、あまり役立たないけれども楽しいタスクを追加します。 + +以下のコードでそれを行うことができます: + + +```python +from huggingface_hub import list_models + +task = "text-classification" + +model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) +print(model.id) +``` + +タスク `text-classification` の場合、これは `'facebook/bart-large-mnli'` を返します。`translation` の場合、`'google-t5/t5-base'` を返します。 + +これをエージェントが利用できるツールに変換する方法は何でしょうか?すべてのツールは、主要な属性を保持するスーパークラス `Tool` に依存しています。私たちは、それを継承したクラスを作成します: + + +```python +from transformers import Tool + + +class HFModelDownloadsTool(Tool): + pass +``` + +このクラスにはいくつかの必要な要素があります: +- `name` 属性:これはツール自体の名前に対応し、他のツールと調和するために `model_download_counter` と名付けます。 +- `description` 属性:これはエージェントのプロンプトを埋めるために使用されます。 +- `inputs` と `outputs` 属性:これらを定義することで、Python インタープリターが型に関する賢明な選択を行うのに役立ち、ツールをHubにプッシュする際にgradio-demoを生成できるようになります。これらは、予想される値のリストであり、`text`、`image`、または`audio`になることがあります。 +- `__call__` メソッド:これには推論コードが含まれています。これは上記で試したコードです! + +こちらが現在のクラスの外観です: + + +```python +from transformers import Tool +from huggingface_hub import list_models + + +class HFModelDownloadsTool(Tool): + name = "model_download_counter" + description = ( + "This is a tool that returns the most downloaded model of a given task on the Hugging Face Hub. " + "It takes the name of the category (such as text-classification, depth-estimation, etc), and " + "returns the name of the checkpoint." + ) + + inputs = ["text"] + outputs = ["text"] + + def __call__(self, task: str): + model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) + return model.id +``` + +さて、今度はツールが使えるようになりました。このツールをファイルに保存し、メインスクリプトからインポートしましょう。このファイルを `model_downloads.py` という名前にし、結果のインポートコードは次のようになります: + +以下は、現在のクラスの外観です: + + +```python +from model_downloads import HFModelDownloadsTool + +tool = HFModelDownloadsTool() +``` + +他の人々に利益をもたらし、より簡単な初期化のために、それをHubにあなたの名前空間でプッシュすることをお勧めします。これを行うには、`tool` 変数で `push_to_hub` を呼び出すだけです: + +```python +tool.push_to_hub("hf-model-downloads") +``` + +エージェントがツールを使用する方法について、最終ステップを見てみましょう。 + +#### Having the agent use the tool + +Hubにあるツールがあります。これは次のようにインスタンス化できます(ユーザー名をツールに合わせて変更してください): + +```python +from transformers import load_tool + +tool = load_tool("lysandre/hf-model-downloads") +``` + +エージェントで使用するためには、エージェントの初期化メソッドの `additional_tools` パラメータにそれを渡すだけです: + + +```python +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool]) + +agent.run( + "Can you read out loud the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub?" +) +``` +which outputs the following: +```text +==Code generated by the agent== +model = model_download_counter(task="text-to-video") +print(f"The model with the most downloads is {model}.") +audio_model = text_reader(model) + + +==Result== +The model with the most downloads is damo-vilab/text-to-video-ms-1.7b. +``` + +以下のテキストは、次のオーディオを生成します。 + + + +**Audio** | +|------------------------------------------------------------------------------------------------------------------------------------------------------| +|
\ No newline at end of file +
diff --git a/docs/source/ja/internal/audio_utils.md b/docs/source/ja/internal/audio_utils.md new file mode 100644 index 00000000000000..967c716cd2f93c --- /dev/null +++ b/docs/source/ja/internal/audio_utils.md @@ -0,0 +1,39 @@ + + +# `FeatureExtractor` 用のユーティリティ + +このページには、*短時間フーリエ変換* や *ログ メル スペクトログラム* などの一般的なアルゴリズムを使用して生のオーディオから特別な特徴を計算するために、オーディオ [`FeatureExtractor`] で使用できるすべてのユーティリティ関数がリストされています。 + +これらのほとんどは、ライブラリ内のオーディオ プロセッサのコードを学習する場合にのみ役に立ちます。 + +## オーディオ変換 + +[[autodoc]] audio_utils.hertz_to_mel + +[[autodoc]] audio_utils.mel_to_hertz + +[[autodoc]] audio_utils.mel_filter_bank + +[[autodoc]] audio_utils.optimal_fft_length + +[[autodoc]] audio_utils.window_function + +[[autodoc]] audio_utils.spectrogram + +[[autodoc]] audio_utils.power_to_db + +[[autodoc]] audio_utils.amplitude_to_db diff --git a/docs/source/ja/internal/file_utils.md b/docs/source/ja/internal/file_utils.md new file mode 100644 index 00000000000000..51a025bfc67f17 --- /dev/null +++ b/docs/source/ja/internal/file_utils.md @@ -0,0 +1,49 @@ + + +# 一般的なユーティリティ + +このページには、ファイル `utils.py` にある Transformers の一般的なユーティリティ関数がすべてリストされています。 + +これらのほとんどは、ライブラリで一般的なコードを学習する場合にのみ役に立ちます。 + +## 列挙型と名前付きタプル + +[[autodoc]] utils.ExplicitEnum + +[[autodoc]] utils.PaddingStrategy + +[[autodoc]] utils.TensorType + +## 特別なデコレーター + +[[autodoc]] utils.add_start_docstrings + +[[autodoc]] utils.add_start_docstrings_to_model_forward + +[[autodoc]] utils.add_end_docstrings + +[[autodoc]] utils.add_code_sample_docstrings + +[[autodoc]] utils.replace_return_docstrings + +## 特殊なプロパティ + +[[autodoc]] utils.cached_property + +## その他のユーティリティ + +[[autodoc]] utils._LazyModule diff --git a/docs/source/ja/internal/generation_utils.md b/docs/source/ja/internal/generation_utils.md new file mode 100644 index 00000000000000..baeefd06abb01b --- /dev/null +++ b/docs/source/ja/internal/generation_utils.md @@ -0,0 +1,357 @@ + + +# 発電用ユーティリティ + +このページには、[`~generation.GenerationMixin.generate`] で使用されるすべてのユーティリティ関数がリストされています。 +[`~generation.GenerationMixin.greedy_search`], +[`~generation.GenerationMixin.contrastive_search`], +[`~generation.GenerationMixin.sample`], +[`~generation.GenerationMixin.beam_search`], +[`~generation.GenerationMixin.beam_sample`], +[`~generation.GenerationMixin.group_beam_search`]、および +[`~generation.GenerationMixin.constrained_beam_search`]。 + +これらのほとんどは、ライブラリ内の生成メソッドのコードを学習する場合にのみ役に立ちます。 + +## 出力を生成する + +[`~generation.GenerationMixin.generate`] の出力は、次のサブクラスのインスタンスです。 +[`~utils.ModelOutput`]。この出力は、返されたすべての情報を含むデータ構造です。 +[`~generation.GenerationMixin.generate`] によって作成されますが、タプルまたは辞書としても使用できます。 + +以下に例を示します。 + +```python +from transformers import GPT2Tokenizer, GPT2LMHeadModel + +tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2") +model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2") + +inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt") +generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True) +``` + +`generation_output` オブジェクトは、できる限り [`~generation.GenerateDecoderOnlyOutput`] です。 +以下のそのクラスのドキュメントを参照してください。これは、次の属性があることを意味します。 + +- `sequences`: 生成されたトークンのシーケンス +- `scores` (オプション): 各生成ステップの言語モデリング ヘッドの予測スコア +- `hidden_​​states` (オプション): 生成ステップごとのモデルの隠れた状態 +- `attentions` (オプション): 生成ステップごとのモデルのアテンションの重み + +ここでは、`output_scores=True`を渡したので `scores` がありますが、`hidden_​​states` はありません。 +`attentions` は、`output_hidden_​​states=True`または`output_attentions=True`を渡さなかったためです。 + +通常と同じように各属性にアクセスできます。その属性がモデルから返されなかった場合は、 +は「なし」を取得します。ここで、たとえば`generation_output.scores`は、生成されたすべての予測スコアです。 +言語モデリングのヘッドであり、`generation_output.attentions`は`None`です。 + +`generation_output` オブジェクトをタプルとして使用する場合、`None` 値を持たない属性のみが保持されます。 +たとえば、ここには 2 つの要素、`loss`、次に`logits`があります。 + +```python +generation_output[:2] +``` + +たとえば、タプル `(generation_output.sequences,generation_output.scores)` を返します。 + +`generation_output` オブジェクトを辞書として使用する場合、`None` を持たない属性のみが保持されます。 +ここでは、たとえば、`sequences`と`scores`という 2 つのキーがあります。 + +ここではすべての出力タイプを文書化します。 + +### PyTorch + +[[autodoc]] generation.GenerateDecoderOnlyOutput + +[[autodoc]] generation.GenerateEncoderDecoderOutput + +[[autodoc]] generation.GenerateBeamDecoderOnlyOutput + +[[autodoc]] generation.GenerateBeamEncoderDecoderOutput + +### TensorFlow + +[[autodoc]] generation.TFGreedySearchEncoderDecoderOutput + +[[autodoc]] generation.TFGreedySearchDecoderOnlyOutput + +[[autodoc]] generation.TFSampleEncoderDecoderOutput + +[[autodoc]] generation.TFSampleDecoderOnlyOutput + +[[autodoc]] generation.TFBeamSearchEncoderDecoderOutput + +[[autodoc]] generation.TFBeamSearchDecoderOnlyOutput + +[[autodoc]] generation.TFBeamSampleEncoderDecoderOutput + +[[autodoc]] generation.TFBeamSampleDecoderOnlyOutput + +[[autodoc]] generation.TFContrastiveSearchEncoderDecoderOutput + +[[autodoc]] generation.TFContrastiveSearchDecoderOnlyOutput + +### FLAX + +[[autodoc]] generation.FlaxSampleOutput + +[[autodoc]] generation.FlaxGreedySearchOutput + +[[autodoc]] generation.FlaxBeamSearchOutput + +## LogitsProcessor + +[`LogitsProcessor`] を使用して、言語モデルのヘッドの予測スコアを変更できます。 +世代。 + +### PyTorch + +[[autodoc]] AlternatingCodebooksLogitsProcessor + - __call__ + +[[autodoc]] ClassifierFreeGuidanceLogitsProcessor + - __call__ + +[[autodoc]] EncoderNoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] EncoderRepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] EpsilonLogitsWarper + - __call__ + +[[autodoc]] EtaLogitsWarper + - __call__ + +[[autodoc]] ExponentialDecayLengthPenalty + - __call__ + +[[autodoc]] ForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForceTokensLogitsProcessor + - __call__ + +[[autodoc]] HammingDiversityLogitsProcessor + - __call__ + +[[autodoc]] InfNanRemoveLogitsProcessor + - __call__ + +[[autodoc]] LogitNormalization + - __call__ + +[[autodoc]] LogitsProcessor + - __call__ + +[[autodoc]] LogitsProcessorList + - __call__ + +[[autodoc]] LogitsWarper + - __call__ + +[[autodoc]] MinLengthLogitsProcessor + - __call__ + +[[autodoc]] MinNewTokensLengthLogitsProcessor + - __call__ + +[[autodoc]] NoBadWordsLogitsProcessor + - __call__ + +[[autodoc]] NoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] PrefixConstrainedLogitsProcessor + - __call__ + +[[autodoc]] RepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] SequenceBiasLogitsProcessor + - __call__ + +[[autodoc]] SuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] SuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] TemperatureLogitsWarper + - __call__ + +[[autodoc]] TopKLogitsWarper + - __call__ + +[[autodoc]] TopPLogitsWarper + - __call__ + +[[autodoc]] TypicalLogitsWarper + - __call__ + +[[autodoc]] UnbatchedClassifierFreeGuidanceLogitsProcessor + - __call__ + +[[autodoc]] WhisperTimeStampLogitsProcessor + - __call__ + +### TensorFlow + +[[autodoc]] TFForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] TFForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] TFForceTokensLogitsProcessor + - __call__ + +[[autodoc]] TFLogitsProcessor + - __call__ + +[[autodoc]] TFLogitsProcessorList + - __call__ + +[[autodoc]] TFLogitsWarper + - __call__ + +[[autodoc]] TFMinLengthLogitsProcessor + - __call__ + +[[autodoc]] TFNoBadWordsLogitsProcessor + - __call__ + +[[autodoc]] TFNoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] TFRepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] TFSuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] TFSuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] TFTemperatureLogitsWarper + - __call__ + +[[autodoc]] TFTopKLogitsWarper + - __call__ + +[[autodoc]] TFTopPLogitsWarper + - __call__ + +### FLAX + +[[autodoc]] FlaxForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForceTokensLogitsProcessor + - __call__ + +[[autodoc]] FlaxLogitsProcessor + - __call__ + +[[autodoc]] FlaxLogitsProcessorList + - __call__ + +[[autodoc]] FlaxLogitsWarper + - __call__ + +[[autodoc]] FlaxMinLengthLogitsProcessor + - __call__ + +[[autodoc]] FlaxSuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] FlaxSuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] FlaxTemperatureLogitsWarper + - __call__ + +[[autodoc]] FlaxTopKLogitsWarper + - __call__ + +[[autodoc]] FlaxTopPLogitsWarper + - __call__ + +[[autodoc]] FlaxWhisperTimeStampLogitsProcessor + - __call__ + +## StoppingCriteria + +[`StoppingCriteria`] を使用して、(EOS トークン以外の) 生成を停止するタイミングを変更できます。これは PyTorch 実装でのみ利用可能であることに注意してください。 + +[[autodoc]] StoppingCriteria + - __call__ + +[[autodoc]] StoppingCriteriaList + - __call__ + +[[autodoc]] MaxLengthCriteria + - __call__ + +[[autodoc]] MaxTimeCriteria + - __call__ + +## Constraints + +[`Constraint`] を使用すると、生成時に出力に特定のトークンまたはシーケンスが含まれるように強制できます。これは PyTorch 実装でのみ利用可能であることに注意してください。 + +[[autodoc]] Constraint + +[[autodoc]] PhrasalConstraint + +[[autodoc]] DisjunctiveConstraint + +[[autodoc]] ConstraintListState + +## BeamSearch + +[[autodoc]] BeamScorer + - process + - finalize + +[[autodoc]] BeamSearchScorer + - process + - finalize + +[[autodoc]] ConstrainedBeamSearchScorer + - process + - finalize + +## Utilities + +[[autodoc]] top_k_top_p_filtering + +[[autodoc]] tf_top_k_top_p_filtering + +## Streamers + +[[autodoc]] TextStreamer + +[[autodoc]] TextIteratorStreamer diff --git a/docs/source/ja/internal/image_processing_utils.md b/docs/source/ja/internal/image_processing_utils.md new file mode 100644 index 00000000000000..917f86875f4812 --- /dev/null +++ b/docs/source/ja/internal/image_processing_utils.md @@ -0,0 +1,48 @@ + + +# 画像プロセッサ用ユーティリティ + +このページには、画像プロセッサーで使用されるすべてのユーティリティー関数がリストされています。主に機能的なものです。 +画像を処理するために使用される変換。 + +これらのほとんどは、ライブラリ内の画像プロセッサのコードを学習する場合にのみ役に立ちます。 + +## Image Transformations + +[[autodoc]] image_transforms.center_crop + +[[autodoc]] image_transforms.center_to_corners_format + +[[autodoc]] image_transforms.corners_to_center_format + +[[autodoc]] image_transforms.id_to_rgb + +[[autodoc]] image_transforms.normalize + +[[autodoc]] image_transforms.pad + +[[autodoc]] image_transforms.rgb_to_id + +[[autodoc]] image_transforms.rescale + +[[autodoc]] image_transforms.resize + +[[autodoc]] image_transforms.to_pil_image + +## ImageProcessingMixin + +[[autodoc]] image_processing_utils.ImageProcessingMixin diff --git a/docs/source/ja/internal/modeling_utils.md b/docs/source/ja/internal/modeling_utils.md new file mode 100644 index 00000000000000..62aa2040c8a258 --- /dev/null +++ b/docs/source/ja/internal/modeling_utils.md @@ -0,0 +1,83 @@ + + +# カスタムレイヤーとユーティリティ + +このページには、ライブラリで使用されるすべてのカスタム レイヤーと、モデリングに提供されるユーティリティ関数がリストされます。 + +これらのほとんどは、ライブラリ内のモデルのコードを研究する場合にのみ役に立ちます。 + + +## Pytorch custom modules + +[[autodoc]] pytorch_utils.Conv1D + +[[autodoc]] modeling_utils.PoolerStartLogits + - forward + +[[autodoc]] modeling_utils.PoolerEndLogits + - forward + +[[autodoc]] modeling_utils.PoolerAnswerClass + - forward + +[[autodoc]] modeling_utils.SquadHeadOutput + +[[autodoc]] modeling_utils.SQuADHead + - forward + +[[autodoc]] modeling_utils.SequenceSummary + - forward + +## PyTorch Helper Functions + +[[autodoc]] pytorch_utils.apply_chunking_to_forward + +[[autodoc]] pytorch_utils.find_pruneable_heads_and_indices + +[[autodoc]] pytorch_utils.prune_layer + +[[autodoc]] pytorch_utils.prune_conv1d_layer + +[[autodoc]] pytorch_utils.prune_linear_layer + +## TensorFlow custom layers + +[[autodoc]] modeling_tf_utils.TFConv1D + +[[autodoc]] modeling_tf_utils.TFSequenceSummary + +## TensorFlow loss functions + +[[autodoc]] modeling_tf_utils.TFCausalLanguageModelingLoss + +[[autodoc]] modeling_tf_utils.TFMaskedLanguageModelingLoss + +[[autodoc]] modeling_tf_utils.TFMultipleChoiceLoss + +[[autodoc]] modeling_tf_utils.TFQuestionAnsweringLoss + +[[autodoc]] modeling_tf_utils.TFSequenceClassificationLoss + +[[autodoc]] modeling_tf_utils.TFTokenClassificationLoss + +## TensorFlow Helper Functions + +[[autodoc]] modeling_tf_utils.get_initializer + +[[autodoc]] modeling_tf_utils.keras_serializable + +[[autodoc]] modeling_tf_utils.shape_list diff --git a/docs/source/ja/internal/pipelines_utils.md b/docs/source/ja/internal/pipelines_utils.md new file mode 100644 index 00000000000000..833c98c4d0dc18 --- /dev/null +++ b/docs/source/ja/internal/pipelines_utils.md @@ -0,0 +1,44 @@ + + +# パイプライン用のユーティリティ + +このページには、ライブラリがパイプラインに提供するすべてのユーティリティ関数がリストされます。 + +これらのほとんどは、ライブラリ内のモデルのコードを研究する場合にのみ役に立ちます。 + + +## Argument handling + +[[autodoc]] pipelines.ArgumentHandler + +[[autodoc]] pipelines.ZeroShotClassificationArgumentHandler + +[[autodoc]] pipelines.QuestionAnsweringArgumentHandler + +## Data format + +[[autodoc]] pipelines.PipelineDataFormat + +[[autodoc]] pipelines.CsvPipelineDataFormat + +[[autodoc]] pipelines.JsonPipelineDataFormat + +[[autodoc]] pipelines.PipedPipelineDataFormat + +## Utilities + +[[autodoc]] pipelines.PipelineException diff --git a/docs/source/ja/internal/time_series_utils.md b/docs/source/ja/internal/time_series_utils.md new file mode 100644 index 00000000000000..9355ea090e1458 --- /dev/null +++ b/docs/source/ja/internal/time_series_utils.md @@ -0,0 +1,29 @@ + + +# 時系列ユーティリティ + +このページには、時系列ベースのモデルに使用できるすべてのユーティリティ関数とクラスがリストされます。 + +これらのほとんどは、時系列モデルのコードを研究している場合、または分散出力クラスのコレクションに追加したい場合にのみ役立ちます。 + +## Distributional Output + +[[autodoc]] time_series_utils.NormalOutput + +[[autodoc]] time_series_utils.StudentTOutput + +[[autodoc]] time_series_utils.NegativeBinomialOutput \ No newline at end of file diff --git a/docs/source/ja/internal/tokenization_utils.md b/docs/source/ja/internal/tokenization_utils.md new file mode 100644 index 00000000000000..8e36e4149e2784 --- /dev/null +++ b/docs/source/ja/internal/tokenization_utils.md @@ -0,0 +1,42 @@ + + +# Utilities for Tokenizers + +このページには、トークナイザーによって使用されるすべてのユーティリティ関数 (主にクラス) がリストされます。 +[`~tokenization_utils_base.PreTrainedTokenizerBase`] 間の共通メソッドを実装します。 +[`PreTrainedTokenizer`] と [`PreTrainedTokenizerFast`] およびミックスイン +[`~tokenization_utils_base.SpecialTokensMixin`]。 + +これらのほとんどは、ライブラリ内のトークナイザーのコードを学習する場合にのみ役に立ちます。 + +## PreTrainedTokenizerBase + +[[autodoc]] tokenization_utils_base.PreTrainedTokenizerBase + - __call__ + - all + +## SpecialTokensMixin + +[[autodoc]] tokenization_utils_base.SpecialTokensMixin + +## Enums and namedtuples + +[[autodoc]] tokenization_utils_base.TruncationStrategy + +[[autodoc]] tokenization_utils_base.CharSpan + +[[autodoc]] tokenization_utils_base.TokenSpan diff --git a/docs/source/ja/internal/trainer_utils.md b/docs/source/ja/internal/trainer_utils.md new file mode 100644 index 00000000000000..dddfc7dc55ee12 --- /dev/null +++ b/docs/source/ja/internal/trainer_utils.md @@ -0,0 +1,49 @@ + + +# トレーナー用ユーティリティ + +このページには、[`Trainer`] で使用されるすべてのユーティリティ関数がリストされています。 + +これらのほとんどは、ライブラリ内のトレーナーのコードを学習する場合にのみ役に立ちます。 + +## Utilities + +[[autodoc]] EvalPrediction + +[[autodoc]] IntervalStrategy + +[[autodoc]] enable_full_determinism + +[[autodoc]] set_seed + +[[autodoc]] torch_distributed_zero_first + +## Callbacks internals + +[[autodoc]] trainer_callback.CallbackHandler + +## Distributed Evaluation + +[[autodoc]] trainer_pt_utils.DistributedTensorGatherer + +## Trainer Argument Parser + +[[autodoc]] HfArgumentParser + +## Debug Utilities + +[[autodoc]] debug_utils.DebugUnderflowOverflow diff --git a/docs/source/ja/llm_tutorial.md b/docs/source/ja/llm_tutorial.md new file mode 100644 index 00000000000000..156a2423eedc79 --- /dev/null +++ b/docs/source/ja/llm_tutorial.md @@ -0,0 +1,223 @@ + + +# Generation with LLMs + +[[open-in-colab]] + +LLM、またはLarge Language Models(大規模言語モデル)は、テキスト生成の鍵となる要素です。要するに、これらは大規模な事前訓練済みトランスフォーマーモデルで、与えられた入力テキストに基づいて次の単語(または、より正確にはトークン)を予測するように訓練されています。トークンを1つずつ予測するため、モデルを呼び出すだけでは新しい文を生成するために何かより精巧なことをする必要があります。自己回帰生成を行う必要があります。 + +自己回帰生成は、推論時の手続きで、いくつかの初期入力を与えた状態で、モデルを反復的に呼び出す手法です。🤗 Transformersでは、これは[`~generation.GenerationMixin.generate`]メソッドによって処理され、これは生成能力を持つすべてのモデルで利用可能です。 + +このチュートリアルでは、以下のことを示します: + +* LLMを使用してテキストを生成する方法 +* 一般的な落とし穴を回避する方法 +* LLMを最大限に活用するための次のステップ + +始める前に、必要なライブラリがすべてインストールされていることを確認してください: + + +```bash +pip install transformers bitsandbytes>=0.39.0 -q +``` + +## Generate text + +[因果言語モデリング](tasks/language_modeling)のためにトレーニングされた言語モデルは、テキストトークンのシーケンスを入力として受け取り、次のトークンの確率分布を返します。 + + +
+ +
"Forward pass of an LLM"
+
+ + +LLM(Language Model)による自己回帰生成の重要な側面の1つは、この確率分布から次のトークンを選択する方法です。このステップでは、次のイテレーションのためのトークンが得られる限り、何でも可能です。これは、確率分布から最も可能性の高いトークンを選択するだけのシンプルな方法から、結果の分布からサンプリングする前に数々の変換を適用するほど複雑な方法まで、あらゆる方法が考えられます。 + + + +
+ +
"Autoregressive generation iteratively selects the next token from a probability distribution to generate text"
+
+ +上記のプロセスは、ある停止条件が満たされるまで反復的に繰り返されます。理想的には、停止条件はモデルによって指示され、モデルは終了シーケンス(`EOS`)トークンを出力するタイミングを学習すべきです。これがそうでない場合、生成はあらかじめ定義された最大長に達したときに停止します。 + +トークン選択ステップと停止条件を適切に設定することは、モデルがタスクで期待どおりに振る舞うために重要です。それが、各モデルに関連付けられた [`~generation.GenerationConfig`] ファイルがある理由であり、これには優れたデフォルトの生成パラメータ化が含まれ、モデルと一緒に読み込まれます。 + +コードについて話しましょう! + + + +基本的なLLMの使用に興味がある場合、高レベルの [`Pipeline`](pipeline_tutorial) インターフェースが良い出発点です。ただし、LLMはしばしば量子化やトークン選択ステップの細かい制御などの高度な機能が必要であり、これは [`~generation.GenerationMixin.generate`] を介して最良に行われます。LLMとの自己回帰生成はリソースが多く必要であり、適切なスループットのためにGPUで実行する必要があります。 + + + + +まず、モデルを読み込む必要があります。 + + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained( +... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True +... ) +``` + +`from_pretrained` 呼び出しで2つのフラグがあることに注意してください: + +- `device_map` はモデルをあなたのGPUに移動させます +- `load_in_4bit` は[4ビットの動的量子化](main_classes/quantization)を適用してリソース要件を大幅に削減します + +モデルを初期化する他の方法もありますが、これはLLMを始めるための良い基準です。 + +次に、[トークナイザ](tokenizer_summary)を使用してテキスト入力を前処理する必要があります。 + + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b") +>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda") +``` + + +`model_inputs` 変数は、トークン化されたテキスト入力とアテンションマスクを保持しています。 [`~generation.GenerationMixin.generate`] は、アテンションマスクが渡されていない場合でも、最善の努力をしてそれを推測しようとしますが、できる限り渡すことをお勧めします。最適な結果を得るためです。 + +最後に、[`~generation.GenerationMixin.generate`] メソッドを呼び出して生成されたトークンを取得し、それを表示する前にテキストに変換する必要があります。 + + +```py +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A list of colors: red, blue, green, yellow, black, white, and brown' +``` + +これで完了です!わずかなコード行数で、LLM(Large Language Model)のパワーを活用できます。 + +## Common pitfalls + +[生成戦略](generation_strategies)はたくさんあり、デフォルトの値があなたのユースケースに適していないことがあります。出力が期待通りでない場合、最も一般的な落とし穴とその回避方法のリストを作成しました。 + +```py +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b") +>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default +>>> model = AutoModelForCausalLM.from_pretrained( +... "openlm-research/open_llama_7b", device_map="auto", load_in_4bit=True +... ) +``` + +### Generated output is too short/long + +[`~generation.GenerationConfig`] ファイルで指定されていない場合、`generate` はデフォルトで最大で 20 トークンまで返します。我々は `generate` コールで `max_new_tokens` を手動で設定することを強くお勧めします。これにより、返される新しいトークンの最大数を制御できます。LLM(正確には、[デコーダー専用モデル](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt))も出力の一部として入力プロンプトを返すことに注意してください。 + +```py +>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") + +>>> # By default, the output will contain up to 20 tokens +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5' + +>>> # Setting `max_new_tokens` allows you to control the maximum length +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' +``` + +### Incorrect generation mode + +デフォルトでは、 [`~generation.GenerationConfig`] ファイルで指定されていない限り、`generate` は各イテレーションで最も可能性の高いトークンを選択します(貪欲デコーディング)。タスクに応じて、これは望ましくないことがあります。チャットボットやエッセイのような創造的なタスクでは、サンプリングが有益です。一方、音声の転写や翻訳のような入力に基づくタスクでは、貪欲デコーディングが有益です。`do_sample=True` でサンプリングを有効にできます。このトピックについての詳細は、この[ブログポスト](https://huggingface.co/blog/how-to-generate)で学ぶことができます。 + +```py +>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility +>>> from transformers import set_seed +>>> set_seed(0) + +>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") + +>>> # LLM + greedy decoding = repetitive, boring output +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. I am a cat. I am a cat. I am a cat' + +>>> # With sampling, the output becomes more creative! +>>> generated_ids = model.generate(**model_inputs, do_sample=True) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat.\nI just need to be. I am always.\nEvery time' +``` + +### Wrong padding side + +LLM(Large Language Models)は[デコーダー専用](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)のアーキテクチャであり、入力プロンプトを繰り返し処理することを意味します。入力が同じ長さでない場合、それらをパディングする必要があります。LLMはパッドトークンからの続きを学習していないため、入力は左パディングする必要があります。また、生成に対して注目マスクを渡し忘れないようにしてください! + + +```py +>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, +>>> # which is shorter, has padding on the right side. Generation fails. +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)[0] +'' + +>>> # With left-padding, it works as expected! +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b", padding_side="left") +>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 3, 4, 5, 6,' +``` + +## Further resources + +オートリグレッシブ生成プロセスは比較的簡単ですが、LLMを最大限に活用することは多くの要素が絡むため、挑戦的な試みとなります。LLMの使用と理解をさらに深めるための次のステップについては以下のリソースをご覧ください。 + + +### Advanced generate usage + +1. [ガイド](generation_strategies):異なる生成方法を制御する方法、生成構成ファイルの設定方法、出力のストリーミング方法についてのガイド; +2. [`~generation.GenerationConfig`]、[`~generation.GenerationMixin.generate`]、および[生成関連クラス](internal/generation_utils)に関するAPIリファレンス。 + +### LLM leaderboards + +1. [Open LLM リーダーボード](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard):オープンソースモデルの品質に焦点を当てたリーダーボード; +2. [Open LLM-Perf リーダーボード](https://huggingface.co/spaces/optimum/llm-perf-leaderboard):LLMのスループットに焦点を当てたリーダーボード。 + +### Latency and throughput + +1. [ガイド](main_classes/quantization):ダイナミッククオンタイズに関するガイド。これによりメモリ要件を劇的に削減する方法が示されています。 + +### Related libraries + +1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference):LLM用の本番向けサーバー; +2. [`optimum`](https://github.com/huggingface/optimum):特定のハードウェアデバイス向けに最適化された🤗 Transformersの拡張。 diff --git a/docs/source/ja/main_classes/agent.md b/docs/source/ja/main_classes/agent.md new file mode 100644 index 00000000000000..290f3b5b8c7247 --- /dev/null +++ b/docs/source/ja/main_classes/agent.md @@ -0,0 +1,105 @@ + + +# エージェントとツール + + + +Transformers Agents は実験的な API であり、いつでも変更される可能性があります。エージェントから返される結果 +API または基礎となるモデルは変更される傾向があるため、変更される可能性があります。 + + + +エージェントとツールの詳細については、[入門ガイド](../transformers_agents) を必ずお読みください。このページ +基礎となるクラスの API ドキュメントが含まれています。 + +## エージェント + +私たちは 3 種類のエージェントを提供します。[`HfAgent`] はオープンソース モデルの推論エンドポイントを使用し、[`LocalAgent`] は選択したモデルをローカルで使用し、[`OpenAiAgent`] は OpenAI クローズド モデルを使用します。 + +### HfAgent + +[[autodoc]] HfAgent + +### LocalAgent + +[[autodoc]] LocalAgent + +### OpenAiAgent + +[[autodoc]] OpenAiAgent + +### AzureOpenAiAgent + +[[autodoc]] AzureOpenAiAgent + +### Agent + +[[autodoc]] Agent + - chat + - run + - prepare_for_new_chat + +## Tools + +### load_tool + +[[autodoc]] load_tool + +### Tool + +[[autodoc]] Tool + +### PipelineTool + +[[autodoc]] PipelineTool + +### RemoteTool + +[[autodoc]] RemoteTool + +### launch_gradio_demo + +[[autodoc]] launch_gradio_demo + +## エージェントの種類 + +エージェントはツール間であらゆる種類のオブジェクトを処理できます。ツールは完全にマルチモーダルであるため、受け取りと返品が可能です +テキスト、画像、オーディオ、ビデオなどのタイプ。ツール間の互換性を高めるためだけでなく、 +これらの戻り値を ipython (jupyter、colab、ipython ノートブックなど) で正しくレンダリングするには、ラッパー クラスを実装します。 +このタイプの周り。 + +ラップされたオブジェクトは最初と同じように動作し続けるはずです。テキストオブジェクトは依然として文字列または画像として動作する必要があります +オブジェクトは依然として `PIL.Image` として動作するはずです。 + +これらのタイプには、次の 3 つの特定の目的があります。 + +- 型に対して `to_raw` を呼び出すと、基になるオブジェクトが返されるはずです +- 型に対して `to_string` を呼び出すと、オブジェクトを文字列として返す必要があります。`AgentText` の場合は文字列になる可能性があります。 + ただし、他のインスタンスのオブジェクトのシリアル化されたバージョンのパスになります。 +- ipython カーネルで表示すると、オブジェクトが正しく表示されるはずです + +### AgentText + +[[autodoc]] transformers.tools.agent_types.AgentText + +### AgentImage + +[[autodoc]] transformers.tools.agent_types.AgentImage + +### AgentAudio + +[[autodoc]] transformers.tools.agent_types.AgentAudio diff --git a/docs/source/ja/main_classes/callback.md b/docs/source/ja/main_classes/callback.md new file mode 100644 index 00000000000000..3ea4938841e386 --- /dev/null +++ b/docs/source/ja/main_classes/callback.md @@ -0,0 +1,135 @@ + + + +# コールバック数 + +コールバックは、PyTorch のトレーニング ループの動作をカスタマイズできるオブジェクトです。 +トレーニング ループを検査できる [`Trainer`] (この機能は TensorFlow にはまだ実装されていません) +状態を確認し (進捗レポート、TensorBoard または他の ML プラットフォームへのログ記録など)、決定を下します (初期段階など)。 +停止中)。 + +コールバックは、返される [`TrainerControl`] オブジェクトを除けば、「読み取り専用」のコード部分です。 +トレーニング ループ内では何も変更できません。トレーニング ループの変更が必要なカスタマイズの場合は、次のことを行う必要があります。 +[`Trainer`] をサブクラス化し、必要なメソッドをオーバーライドします (例については、[trainer](trainer) を参照してください)。 + +デフォルトでは、`TrainingArguments.report_to` は `"all"` に設定されているため、[`Trainer`] は次のコールバックを使用します。 + +- [`DefaultFlowCallback`] は、ログ記録、保存、評価のデフォルトの動作を処理します。 +- [`PrinterCallback`] または [`ProgressCallback`] で進行状況を表示し、 + ログ (最初のログは、[`TrainingArguments`] を通じて tqdm を非アクティブ化する場合に使用され、そうでない場合に使用されます) + 2番目です)。 +- [`~integrations.TensorBoardCallback`] (PyTorch >= 1.4 を介して) tensorboard にアクセスできる場合 + またはテンソルボードX)。 +- [`~integrations.WandbCallback`] [wandb](https://www.wandb.com/) がインストールされている場合。 +- [`~integrations.CometCallback`] [comet_ml](https://www.comet.ml/site/) がインストールされている場合。 +- [mlflow](https://www.mlflow.org/) がインストールされている場合は [`~integrations.MLflowCallback`]。 +- [`~integrations.NeptuneCallback`] [neptune](https://neptune.ai/) がインストールされている場合。 +- [`~integrations.AzureMLCallback`] [azureml-sdk](https://pypi.org/project/azureml-sdk/) の場合 + インストールされています。 +- [`~integrations.CodeCarbonCallback`] [codecarbon](https://pypi.org/project/codecarbon/) の場合 + インストールされています。 +- [`~integrations.ClearMLCallback`] [clearml](https://github.com/allegroai/clearml) がインストールされている場合。 +- [`~integrations.DagsHubCallback`] [dagshub](https://dagshub.com/) がインストールされている場合。 +- [`~integrations.FlyteCallback`] [flyte](https://flyte.org/) がインストールされている場合。 +- [`~integrations.DVCLiveCallback`] [dvclive](https://www.dvc.org/doc/dvclive) がインストールされている場合。 + +パッケージがインストールされているが、付随する統合を使用したくない場合は、`TrainingArguments.report_to` を、使用したい統合のみのリストに変更できます (例: `["azure_ml", "wandb"]`) 。 + +コールバックを実装するメインクラスは [`TrainerCallback`] です。それは、 +[`TrainingArguments`] は [`Trainer`] をインスタンス化するために使用され、それにアクセスできます。 +[`TrainerState`] を介してトレーナーの内部状態を取得し、トレーニング ループ上でいくつかのアクションを実行できます。 +[`TrainerControl`]。 + +## 利用可能なコールバック + +ライブラリで利用可能な [`TrainerCallback`] のリストは次のとおりです。 + +[[autodoc]] integrations.CometCallback + - setup + +[[autodoc]] DefaultFlowCallback + +[[autodoc]] PrinterCallback + +[[autodoc]] ProgressCallback + +[[autodoc]] EarlyStoppingCallback + +[[autodoc]] integrations.TensorBoardCallback + +[[autodoc]] integrations.WandbCallback + - setup + +[[autodoc]] integrations.MLflowCallback + - setup + +[[autodoc]] integrations.AzureMLCallback + +[[autodoc]] integrations.CodeCarbonCallback + +[[autodoc]] integrations.NeptuneCallback + +[[autodoc]] integrations.ClearMLCallback + +[[autodoc]] integrations.DagsHubCallback + +[[autodoc]] integrations.FlyteCallback + +[[autodoc]] integrations.DVCLiveCallback + - setup + +## TrainerCallback + +[[autodoc]] TrainerCallback + +以下は、カスタム コールバックを PyTorch [`Trainer`] に登録する方法の例です。 + +```python +class MyCallback(TrainerCallback): + "A callback that prints a message at the beginning of training" + + def on_train_begin(self, args, state, control, **kwargs): + print("Starting training") + + +trainer = Trainer( + model, + args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback()) +) +``` + +コールバックを登録する別の方法は、次のように `trainer.add_callback()` を呼び出すことです。 + +```python +trainer = Trainer(...) +trainer.add_callback(MyCallback) +# Alternatively, we can pass an instance of the callback class +trainer.add_callback(MyCallback()) +``` + +## TrainerState + +[[autodoc]] TrainerState + +## TrainerControl + +[[autodoc]] TrainerControl + + diff --git a/docs/source/ja/main_classes/configuration.md b/docs/source/ja/main_classes/configuration.md new file mode 100644 index 00000000000000..7fab5269e2049f --- /dev/null +++ b/docs/source/ja/main_classes/configuration.md @@ -0,0 +1,31 @@ + + +# 構成 + +基本クラス [`PretrainedConfig`] は、設定をロード/保存するための一般的なメソッドを実装します。 +ローカル ファイルまたはディレクトリから、またはライブラリ (ダウンロードされた) によって提供される事前トレーニング済みモデル構成から +HuggingFace の AWS S3 リポジトリから)。 + +各派生構成クラスはモデル固有の属性を実装します。すべての構成クラスに存在する共通の属性は次のとおりです。 +`hidden_​​size`、`num_attention_heads`、および `num_hidden_​​layers`。テキスト モデルはさらに以下を実装します。 +`vocab_size`。 + +## PretrainedConfig + +[[autodoc]] PretrainedConfig + - push_to_hub + - all diff --git a/docs/source/ja/main_classes/data_collator.md b/docs/source/ja/main_classes/data_collator.md new file mode 100644 index 00000000000000..c37f1aeef4d1cc --- /dev/null +++ b/docs/source/ja/main_classes/data_collator.md @@ -0,0 +1,67 @@ + + +# データ照合者 + +データ照合器は、データセット要素のリストを入力として使用してバッチを形成するオブジェクトです。これらの要素は、 +`train_dataset` または `eval_dataset` の要素と同じ型。 + +バッチを構築できるようにするために、データ照合者は何らかの処理 (パディングなど) を適用する場合があります。そのうちのいくつかは( +[`DataCollat​​orForLanguageModeling`]) ランダムなデータ拡張 (ランダム マスキングなど) も適用します +形成されたバッチ上で。 + +使用例は、[サンプル スクリプト](../examples) または [サンプル ノートブック](../notebooks) にあります。 + +## Default data collator + +[[autodoc]] data.data_collator.default_data_collator + +## DefaultDataCollator + +[[autodoc]] data.data_collator.DefaultDataCollator + +## DataCollatorWithPadding + +[[autodoc]] data.data_collator.DataCollatorWithPadding + +## DataCollatorForTokenClassification + +[[autodoc]] data.data_collator.DataCollatorForTokenClassification + +## DataCollatorForSeq2Seq + +[[autodoc]] data.data_collator.DataCollatorForSeq2Seq + +## DataCollatorForLanguageModeling + +[[autodoc]] data.data_collator.DataCollatorForLanguageModeling + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens + +## DataCollatorForWholeWordMask + +[[autodoc]] data.data_collator.DataCollatorForWholeWordMask + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens + +## DataCollatorForPermutationLanguageModeling + +[[autodoc]] data.data_collator.DataCollatorForPermutationLanguageModeling + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens diff --git a/docs/source/ja/main_classes/deepspeed.md b/docs/source/ja/main_classes/deepspeed.md new file mode 100644 index 00000000000000..4406ce4a34e21e --- /dev/null +++ b/docs/source/ja/main_classes/deepspeed.md @@ -0,0 +1,2254 @@ + + +# DeepSpeed Integration + +[DeepSpeed](https://github.com/microsoft/DeepSpeed) は、[ZeRO 論文](https://arxiv.org/abs/1910.02054) で説明されているすべてを実装します。現在、次のものを完全にサポートしています。 + +1. オプティマイザーの状態分割 (ZeRO ステージ 1) +2. 勾配分割 (ZeRO ステージ 2) +3. パラメーターの分割 (ZeRO ステージ 3) +4. カスタム混合精度トレーニング処理 +5. 一連の高速 CUDA 拡張ベースのオプティマイザー +6. CPU および NVMe への ZeRO オフロード + +ZeRO-Offload には独自の専用ペーパーがあります: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)。 NVMe サポートについては、論文 [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)。 + +DeepSpeed ZeRO-2 は、その機能が推論には役に立たないため、主にトレーニングのみに使用されます。 + +DeepSpeed ZeRO-3 は、巨大なモデルを複数の GPU にロードできるため、推論にも使用できます。 +単一の GPU では不可能です。 + +🤗 Transformers は、2 つのオプションを介して [DeepSpeed](https://github.com/microsoft/DeepSpeed) を統合します。 + +1. [`Trainer`] によるコア DeepSpeed 機能の統合。何でもやってくれるタイプです + 統合の場合 - カスタム構成ファイルを指定するか、テンプレートを使用するだけで、他に何もする必要はありません。たいていの + このドキュメントではこの機能に焦点を当てています。 +2. [`Trainer`] を使用せず、DeepSpeed を統合した独自のトレーナーを使用したい場合 + `from_pretrained` や `from_config` などのコア機能には、重要な機能の統合が含まれています。 + ZeRO ステージ 3 以降の `zero.Init`などの DeepSpeed の部分。この機能を活用するには、次のドキュメントをお読みください。 + [非トレーナー DeepSpeed 統合](#nontrainer-deepspeed-integration)。 + +統合されているもの: + +トレーニング: + +1. DeepSpeed ZeRO トレーニングは、ZeRO-Infinity (CPU および NVME オフロード) を使用して完全な ZeRO ステージ 1、2、および 3 をサポートします。 + +推論: + +1. DeepSpeed ZeRO Inference は、ZeRO-Infinity による ZeRO ステージ 3 をサポートします。トレーニングと同じ ZeRO プロトコルを使用しますが、 + オプティマイザと lr スケジューラは使用せず、ステージ 3 のみが関連します。詳細については、以下を参照してください。 + [ゼロ推論](#zero-inference)。 + +DeepSpeed Inference もあります。これは、Tensor Parallelism の代わりに Tensor Parallelism を使用するまったく異なるテクノロジーです。 +ZeRO (近日公開)。 + + + + +## Trainer Deepspeed Integration + + + + +### Installation + +pypi 経由でライブラリをインストールします。 +```bash +pip install deepspeed +``` + +または`tansformers`, `extras`経由: + +```bash +pip install transformers[deepspeed] +``` + +または、[DeepSpeed の GitHub ページ](https://github.com/microsoft/deepspeed#installation) で詳細を確認してください。 +[高度なインストール](https://www.deepspeed.ai/tutorials/advanced-install/)。 + +それでもビルドに苦労する場合は、まず [CUDA 拡張機能のインストール ノート](trainer#cuda-extension-installation-notes) を必ず読んでください。 + +拡張機能を事前ビルドせず、実行時に拡張機能がビルドされることに依存しており、上記の解決策をすべて試した場合 +それが役に立たなかった場合、次に試すべきことは、モジュールをインストールする前にモジュールを事前にビルドすることです。 + +DeepSpeed のローカル ビルドを作成するには: + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ +--global-option="build_ext" --global-option="-j8" --no-cache -v \ +--disable-pip-version-check 2>&1 | tee build.log +``` + +NVMe オフロードを使用する場合は、上記の手順に`DS_BUILD_AIO=1`を含める必要があります (また、 +*libaio-dev* システム全体にインストールします)。 + +`TORCH_CUDA_ARCH_LIST` を編集して、使用する GPU カードのアーキテクチャのコードを挿入します。すべてを仮定すると +あなたのカードは同じで、次の方法でアーチを取得できます。 + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" +``` + +したがって、`8, 6`を取得した場合は、`TORCH_CUDA_ARCH_LIST="8.6"`を使用します。複数の異なるカードをお持ちの場合は、すべてをリストすることができます +それらのうち、`TORCH_CUDA_ARCH_LIST="6.1;8.6"`が好きです + +複数のマシンで同じセットアップを使用する必要がある場合は、バイナリ ホイールを作成します。 + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ +python setup.py build_ext -j8 bdist_wheel +``` + +`dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`のようなものが生成されるので、これをインストールできます +`pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`としてローカルまたは他のマシンにインストールします。 + +繰り返しますが、`TORCH_CUDA_ARCH_LIST`をターゲット アーキテクチャに合わせて調整することを忘れないでください。 + +NVIDIA GPU の完全なリストと、それに対応する **コンピューティング機能** (この記事の Arch と同じ) を見つけることができます。 +コンテキスト) [ここ](https://developer.nvidia.com/cuda-gpus)。 + +以下を使用して、pytorch が構築されたアーチを確認できます。 + +```bash +python -c "import torch; print(torch.cuda.get_arch_list())" +``` + +ここでは、インストールされている GPU の 1 つのアーチを見つける方法を説明します。たとえば、GPU 0 の場合: + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ +print(torch.cuda.get_device_properties(torch.device('cuda')))" +``` + +出力が次の場合: + +```bash +_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) +``` + +そうすれば、このカードのアーチが`8.6`であることがわかります。 + +`TORCH_CUDA_ARCH_LIST` を完全に省略することもできます。そうすれば、ビルド プログラムが自動的にクエリを実行します。 +ビルドが行われる GPU のアーキテクチャ。これは、ターゲット マシンの GPU と一致する場合もあれば、一致しない場合もあります。 +目的のアーチを明示的に指定することをお勧めします。 + +提案されたことをすべて試してもまだビルドの問題が発生する場合は、GitHub の問題に進んでください。 +[ディープスピード](https://github.com/microsoft/DeepSpeed/issues)、 + + + +### Deployment with multiple GPUs + +DeepSpeed 統合をデプロイするには、[`Trainer`] コマンド ライン引数を調整して新しい引数 `--deepspeed ds_config.json` を含めます。ここで、`ds_config.json` は DeepSpeed 構成ファイルです。 + [こちら](https://www.deepspeed.ai/docs/config-json/)に記載されています。ファイル名はあなた次第です。 + DeepSpeed の`add_config_arguments`ユーティリティを使用して、必要なコマンド ライン引数をコードに追加することをお勧めします。 + 詳細については、[DeepSpeed の引数解析](https://deepspeed.readthedocs.io/en/latest/initialize.html#argument-parsing) ドキュメントを参照してください。 + +ここで選択したランチャーを使用できます。 pytorch ランチャーを引き続き使用できます。 + +```bash +torch.distributed.run --nproc_per_node=2 your_program.py --deepspeed ds_config.json +``` + +または、`deepspeed`によって提供されるランチャーを使用します。 + +```bash +deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json +``` + +ご覧のとおり、引数は同じではありませんが、ほとんどのニーズではどちらでも機能します。の +さまざまなノードと GPU を構成する方法の詳細については、[こちら](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) を参照してください。 + +`deepspeed`ランチャーを使用し、利用可能なすべての GPU を使用したい場合は、`--num_gpus`フラグを省略するだけです。 + +以下は、利用可能なすべての GPU をデプロイする DeepSpeed で`run_translation.py`を実行する例です。 + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +DeepSpeed のドキュメントには、`--deepspeed --deepspeed_config ds_config.json`が表示される可能性が高いことに注意してください。 +DeepSpeed 関連の引数が 2 つありますが、簡単にするためであり、処理すべき引数がすでに非常に多いためです。 +この 2 つを 1 つの引数に結合しました。 + +実際の使用例については、この [投稿](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400) を参照してください。 + + + + +### Deployment with one GPU + +1 つの GPU で DeepSpeed をデプロイするには、[`Trainer`] コマンド ライン引数を次のように調整します。 + +```bash +deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero2.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +これは複数の GPU の場合とほぼ同じですが、ここでは、DeepSpeed に 1 つの GPU だけを使用するように明示的に指示します。 +`--num_gpus=1`。デフォルトでは、DeepSpeed は指定されたノード上で認識できるすべての GPU をデプロイします。起動する GPU が 1 つだけの場合 +の場合、この引数は必要ありません。次の [ドキュメント](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) では、ランチャー オプションについて説明しています。 + +1 つの GPU だけで DeepSpeed を使用したいのはなぜですか? + +1. 一部の計算とメモリをホストの CPU と RAM に委任できる ZeRO オフロード機能を備えているため、 + モデルのニーズに合わせてより多くの GPU リソースを残しておきます。より大きなバッチ サイズ、または非常に大きなモデルのフィッティングを可能にする + 普通は合わないでしょう。 +2. スマートな GPU メモリ管理システムを提供し、メモリの断片化を最小限に抑えます。 + より大きなモデルとデータ バッチ。 + +次に構成について詳しく説明しますが、単一の GPU で大幅な改善を実現するための鍵は次のとおりです。 +DeepSpeed を使用するには、構成ファイルに少なくとも次の構成が必要です。 + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "overlap_comm": true, + "contiguous_gradients": true + } +} +``` + +これにより、オプティマイザーのオフロードやその他の重要な機能が有効になります。バッファ サイズを試してみるとよいでしょう。 +詳細については、以下のディスカッションを参照してください。 + +このタイプのデプロイメントの実際的な使用例については、この [投稿](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685) を参照してください。 + +このドキュメントで詳しく説明されているように、CPU および NVMe オフロードを備えた ZeRO-3 を試すこともできます。 + +ノート: + +- GPU 0 とは異なる特定の GPU で実行する必要がある場合、`CUDA_VISIBLE_DEVICES` を使用して制限することはできません。 + 利用可能な GPU の表示範囲。代わりに、次の構文を使用する必要があります。 + + ```bash + deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... + ``` + + この例では、DeepSpeed に GPU 1 (2 番目の GPU) を使用するように指示します。 + + + +### 複数のノードを使用したデプロイメント + +このセクションの情報は DeepSpeed 統合に固有のものではなく、あらゆるマルチノード プログラムに適用できます。ただし、DeepSpeed は、SLURM 環境でない限り、他のランチャーよりも使いやすい`deepspeed`ランチャーを提供します。 + +このセクションでは、それぞれ 8 GPU を備えた 2 つのノードがあると仮定します。また、最初のノードには `ssh hostname1` を使用して、2 番目のノードには `ssh hostname2` を使用して接続できます。両方ともパスワードなしでローカルの ssh 経由で相互に接続できる必要があります。もちろん、これらのホスト (ノード) 名を、作業している実際のホスト名に変更する必要があります。 + +#### The torch.distributed.run launcher + + +たとえば、`torch.distributed.run` を使用するには、次のようにします。 + +```bash +python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ +--master_port=9901 your_program.py --deepspeed ds_config.json +``` + +各ノードに SSH で接続し、それぞれのノードで同じコマンドを実行する必要があります。急ぐ必要はありません。ランチャーは両方のノードが同期するまで待機します。 + +詳細については、[torchrun](https://pytorch.org/docs/stable/elastic/run.html) を参照してください。ちなみに、これは pytorch の数バージョン前の`torch.distributed.launch`を置き換えたランチャーでもあります。 + +#### ディープスピード ランチャー + +代わりに`deepspeed`ランチャーを使用するには、まず`hostfile`ファイルを作成する必要があります。 + +``` +hostname1 slots=8 +hostname2 slots=8 +``` + +そして、次のように起動できます。 + +```bash +deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ +your_program.py --deepspeed ds_config.json +``` + +`torch.distributed.run`ランチャーとは異なり、`deepspeed`は両方のノードでこのコマンドを自動的に起動します。 + +詳細については、[リソース構成 (マルチノード)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) を参照してください。 + +#### Launching in a SLURM environment + +SLURM 環境では、次のアプローチを使用できます。以下は、特定の SLURM 環境に適合させるために必要な slurm スクリプト `launch.slurm` です。 + +```bash +#SBATCH --job-name=test-nodes # name +#SBATCH --nodes=2 # nodes +#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! +#SBATCH --cpus-per-task=10 # number of cores per tasks +#SBATCH --gres=gpu:8 # number of gpus +#SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) +#SBATCH --output=%x-%j.out # output file name + +export GPUS_PER_NODE=8 +export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) +export MASTER_PORT=9901 + +srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ + --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ + --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ +your_program.py --deepspeed ds_config.json' +``` + +あとは実行をスケジュールするだけです。 +```bash +sbatch launch.slurm +``` + +#### Use of Non-shared filesystem + +デフォルトでは、DeepSpeed はマルチノード環境が共有ストレージを使用することを想定しています。これが当てはまらず、各ノードがローカル ファイルシステムしか参照できない場合は、設定ファイルを調整して [`checkpoint`_section](https://www.deepspeed.ai/docs/config-json/#) を含める必要があります。チェックポイント オプション) を次の設定で指定します。 + + +```json +{ + "checkpoint": { + "use_node_local_storage": true + } +} +``` + +あるいは、[`Trainer`] の `--save_on_each_node` 引数を使用することもでき、上記の設定は自動的に追加されます。 + + + +### Deployment in Notebooks + +ノートブックのセルをスクリプトとして実行する場合の問題は、依存する通常の`deepspeed`ランチャーがないことです。 +特定の設定では、それをエミュレートする必要があります。 + +GPU を 1 つだけ使用している場合、DeepSpeed を使用するためにノートブック内のトレーニング コードを調整する必要がある方法は次のとおりです。 + +```python +# DeepSpeed requires a distributed environment even when only one process is used. +# This emulates a launcher in the notebook +import os + +os.environ["MASTER_ADDR"] = "localhost" +os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use +os.environ["RANK"] = "0" +os.environ["LOCAL_RANK"] = "0" +os.environ["WORLD_SIZE"] = "1" + +# Now proceed as normal, plus pass the deepspeed config file +training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") +trainer = Trainer(...) +trainer.train() +``` + +注: `...` は、関数に渡す通常の引数を表します。 + +複数の GPU を使用する場合、DeepSpeed が動作するにはマルチプロセス環境を使用する必要があります。つまり、あなたは持っています +その目的でランチャーを使用することはできませんが、これは、提示された分散環境をエミュレートすることによっては実現できません。 +このセクションの冒頭で。 + +現在のディレクトリのノートブックにその場で構成ファイルを作成したい場合は、専用の +セルの内容: + +```python no-style +%%bash +cat <<'EOT' > ds_config_zero3.json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +EOT +``` + +トレーニング スクリプトがノートブックのセルではなく通常のファイルにある場合は、次のようにして`deepspeed`を通常どおり起動できます。 +細胞からのシェル。たとえば、`run_translation.py` を使用するには、次のように起動します。 + +```python no-style +!git clone https://github.com/huggingface/transformers +!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... +``` + +または、`%%bash` マジックを使用すると、シェル プログラムを実行するための複数行のコードを記述することができます。 + +```python no-style +%%bash + +git clone https://github.com/huggingface/transformers +cd transformers +deepspeed examples/pytorch/translation/run_translation.py ... +``` + +そのような場合、このセクションの最初に示したコードは必要ありません。 + +注: `%%bash` マジックは優れていますが、現時点では出力をバッファリングするため、プロセスが終了するまでログは表示されません。 +完了します。 + + + +### Configuration + +設定ファイルで使用できる DeepSpeed 設定オプションの完全なガイドについては、次を参照してください。 +[次のドキュメント](https://www.deepspeed.ai/docs/config-json/) にアクセスしてください。 + +さまざまな実際のニーズに対応する数十の DeepSpeed 構成例を [DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples)で見つけることができます。 +リポジトリ: + +```bash +git clone https://github.com/microsoft/DeepSpeedExamples +cd DeepSpeedExamples +find . -name '*json' +``` + +上記のコードを続けて、Lamb オプティマイザーを構成しようとしているとします。したがって、次の中から検索できます +`.json` ファイルの例: + +```bash +grep -i Lamb $(find . -name '*json') +``` + +さらにいくつかの例が [メイン リポジトリ](https://github.com/microsoft/DeepSpeed) にもあります。 + +DeepSpeed を使用する場合は、常に DeepSpeed 構成ファイルを指定する必要がありますが、一部の構成パラメータには +コマンドライン経由で設定します。微妙な違いについては、このガイドの残りの部分で説明します。 + +DeepSpeed 構成ファイルがどのようなものかを理解するために、ZeRO ステージ 2 機能を有効にする構成ファイルを次に示します。 +オプティマイザー状態の CPU オフロードを含み、`AdamW`オプティマイザーと`WarmupLR`スケジューラーを使用し、混合を有効にします。 +`--fp16` が渡された場合の精度トレーニング: + + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", +} +``` + +プログラムを実行すると、DeepSpeed は [`Trainer`] から受け取った設定をログに記録します。 +コンソールに渡されるため、最終的にどのような設定が渡されたのかを正確に確認できます。 + + + +### Passing Configuration + +このドキュメントで説明したように、通常、DeepSpeed 設定は json ファイルへのパスとして渡されますが、 +トレーニングの設定にコマンド ライン インターフェイスを使用せず、代わりにインスタンスを作成します。 +[`Trainer`] via [`TrainingArguments`] その後、`deepspeed` 引数については次のことができます +ネストされた `dict` を渡します。これにより、その場で構成を作成でき、それを書き込む必要がありません。 +[`TrainingArguments`] に渡す前にファイル システムを変更します。 + +要約すると、次のことができます。 + +```python +TrainingArguments(..., deepspeed="/path/to/ds_config.json") +``` + +または: + +```python +ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) +TrainingArguments(..., deepspeed=ds_config_dict) +``` + + + + +### Shared Configuration + + + +このセクションは必読です + + + +[`Trainer`] と DeepSpeed の両方が正しく機能するには、いくつかの設定値が必要です。 +したがって、検出が困難なエラーにつながる可能性のある定義の競合を防ぐために、それらを構成することにしました。 +[`Trainer`] コマンドライン引数経由。 + +さらに、一部の構成値はモデルの構成に基づいて自動的に導出されます。 +複数の値を手動で調整することを忘れないでください。[`Trainer`] に大部分を任せるのが最善です +の設定を行います。 + +したがって、このガイドの残りの部分では、特別な設定値 `auto` が表示されます。これを設定すると、 +正しい値または最も効率的な値に自動的に置き換えられます。これを無視することを自由に選択してください +推奨事項を参照し、値を明示的に設定します。この場合、次の点に十分注意してください。 +[`Trainer`] 引数と DeepSpeed 設定は一致します。たとえば、同じものを使用していますか +学習率、バッチサイズ、または勾配累積設定?これらが一致しない場合、トレーニングは非常に失敗する可能性があります +方法を検出するのが難しい。あなたは警告を受けました。 + +DeepSpeed のみに固有の値や、それに合わせて手動で設定する必要がある値が他にも複数あります。 +あなたの要望。 + +独自のプログラムで、DeepSpeed 構成をマスターとして変更したい場合は、次のアプローチを使用することもできます。 +それに基づいて [`TrainingArguments`] を設定します。手順は次のとおりです。 + +1. マスター構成として使用する DeepSpeed 構成を作成またはロードします +2. これらの値に基づいて [`TrainingArguments`] オブジェクトを作成します + +`scheduler.params.total_num_steps`などの一部の値は次のように計算されることに注意してください。 +`train` 中に [`Trainer`] を実行しますが、もちろん自分で計算することもできます。 + + + +### ZeRO + +[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) は、DeepSpeed の主力製品です。それ +3 つの異なるレベル (段階) の最適化をサポートします。最初のものは、スケーラビリティの観点からはあまり興味深いものではありません。 +したがって、このドキュメントではステージ 2 と 3 に焦点を当てます。ステージ 3 は、最新の ZeRO-Infinity の追加によってさらに改善されています。 +詳細については、DeepSpeed のドキュメントを参照してください。 + +構成ファイルの `zero_optimization` セクションは最も重要な部分です ([docs](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training))。ここで定義します +どの ZeRO ステージを有効にするか、そしてそれらをどのように構成するか。各パラメータの説明は、 +DeepSpeed のドキュメント。 + +このセクションは、DeepSpeed 設定を介してのみ設定する必要があります - [`Trainer`] が提供します +同等のコマンドライン引数はありません。 + +注: 現在、DeepSpeed はパラメーター名を検証しないため、スペルを間違えると、デフォルト設定が使用されます。 +スペルが間違っているパラメータ。 DeepSpeed エンジンの起動ログ メッセージを見て、その値を確認できます。 +使用するつもりです。 + + + +#### ZeRO-2 Config + +以下は、ZeRO ステージ 2 の構成例です。 + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 5e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 5e8, + "contiguous_gradients": true + } +} +``` + +**性能調整:** + +- `offload_optimizer` を有効にすると、GPU RAM の使用量が削減されます (`"stage": 2` が必要です) +- `"overlap_comm": true` は、GPU RAM 使用量の増加とトレードオフして、遅延をすべて削減します。 `overlap_comm`は 4.5x を使用します + `allgather_bucket_size`と`reduce_bucket_size`の値。したがって、5e8 に設定されている場合、9GB が必要になります。 + フットプリント (`5e8 x 2Bytes x 2 x 4.5`)。したがって、8GB 以下の RAM を搭載した GPU を使用している場合、 + OOM エラーが発生した場合は、これらのパラメータを`2e8`程度に減らす必要があり、それには 3.6GB が必要になります。やりたくなるでしょう + OOM に達し始めている場合は、より大容量の GPU でも同様です。 +- これらのバッファを減らすと、より多くの GPU RAM を利用するために通信速度を犠牲にすることになります。バッファサイズが小さいほど、 + 通信が遅くなり、他のタスクで使用できる GPU RAM が増えます。したがって、バッチサイズが大きい場合は、 + 重要なのは、トレーニング時間を少し遅らせることは良いトレードになる可能性があります。 + +さらに、`deepspeed==0.4.4`には、次のコマンドで有効にできる新しいオプション`round_robin_gradients`が追加されました。 + +```json +{ + "zero_optimization": { + "round_robin_gradients": true + } +} +``` + +これは、きめ細かい勾配パーティショニングによってランク間の CPU メモリへの勾配コピーを並列化する、CPU オフロードのステージ 2 最適化です。パフォーマンスの利点は、勾配累積ステップ (オプティマイザー ステップ間のコピーの増加) または GPU 数 (並列処理の増加) に応じて増加します。 + + + +#### ZeRO-3 Config + +以下は、ZeRO ステージ 3 の構成例です。 + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + } +} +``` + +モデルまたはアクティベーションが GPU メモリに適合せず、CPU が未使用であるために OOM が発生している場合 +`"device": "cpu"` を使用してオプティマイザの状態とパラメータを CPU メモリにメモリオフロードすると、この制限が解決される可能性があります。 +CPU メモリにオフロードしたくない場合は、`device`エントリに`cpu`の代わりに`none`を使用します。オフロード先 +NVMe については後ほど説明します。 + +固定メモリは、`pin_memory`を`true`に設定すると有効になります。この機能により、次のようなコストをかけてスループットを向上させることができます。 +他のプロセスが使用できるメモリが少なくなります。ピン留めされたメモリは、それを要求した特定のプロセスのために確保されます。 +通常、通常の CPU メモリよりもはるかに高速にアクセスされます。 + +**性能調整:** + +- `stage3_max_live_parameters`: `1e9` +- `stage3_max_reuse_distance`: `1e9` + +OOM に達した場合は、「stage3_max_live_parameters」と「stage3_max_reuse_ distance」を減らします。影響は最小限に抑えられるはずです +アクティブ化チェックポイントを実行しない限り、パフォーマンスに影響します。 `1e9`は約 2GB を消費します。記憶を共有しているのは、 +`stage3_max_live_parameters` と `stage3_max_reuse_distance` なので、加算されるものではなく、合計で 2GB になります。 + +`stage3_max_live_parameters` は、特定の時点で GPU 上に保持する完全なパラメータの数の上限です。 +時間。 「再利用距離」は、パラメータが将来いつ再び使用されるかを判断するために使用する指標です。 +`stage3_max_reuse_ distance`を使用して、パラメータを破棄するか保持するかを決定します。パラメータが +近い将来に再び使用される予定 (`stage3_max_reuse_distance`未満) なので、通信を減らすために保持します。 +オーバーヘッド。これは、アクティベーション チェックポイントを有効にしている場合に非常に役立ちます。フォワード再計算が行われ、 +backward は単一レイヤー粒度を渡し、後方再計算までパラメータを前方再計算に保持したいと考えています。 + +次の構成値は、モデルの非表示サイズによって異なります。 + +- `reduce_bucket_size`: `hidden_size*hidden_size` +- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size` +- `stage3_param_persistence_threshold`: `10 * hidden_size` + +したがって、これらの値を `auto` に設定すると、[`Trainer`] が推奨される値を自動的に割り当てます。 +価値観。ただし、もちろん、これらを明示的に設定することもできます。 + +`stage3_gather_16bit_weights_on_model_save` は、モデルの保存時にモデル fp16 の重み統合を有効にします。大きい +モデルと複数の GPU の場合、これはメモリと速度の両方の点で高価な操作です。現在必須となっているのは、 +トレーニングを再開する予定です。この制限を取り除き、より便利にする今後のアップデートに注目してください。 +フレキシブル。 + +ZeRO-2 構成から移行している場合は、`allgather_partitions`、`allgather_bucket_size`、および +`reduce_scatter`設定パラメータは ZeRO-3 では使用されません。これらを設定ファイルに保存しておくと、 +無視される。 + +- `sub_group_size`: `1e9` + + +`sub_group_size` は、オプティマイザーのステップ中にパラメーターが更新される粒度を制御します。パラメータは次のとおりです。 +`sub_group_size` のバケットにグループ化され、各バケットは一度に 1 つずつ更新されます。 NVMeオフロードで使用する場合 +したがって、ZeRO-Infinity の `sub_group_size`は、モデルの状態が CPU に出入りする粒度を制御します。 +オプティマイザステップ中に NVMe からメモリを取得します。これにより、非常に大規模なモデルの CPU メモリ不足が防止されます。 + +NVMe オフロードを使用しない場合は、`sub_group_size`をデフォルト値の *1e9* のままにすることができます。変更することもできます +次の場合のデフォルト値: + +1. オプティマイザー ステップ中に OOM が発生する: `sub_group_size` を減らして、一時バッファーのメモリ使用量を削減します。 +2. オプティマイザー ステップに時間がかかります。`sub_group_size`を増やして、帯域幅の使用率を向上させます。 + データバッファの増加。 + +#### ZeRO-0 Config + +ステージ 0 と 1 はめったに使用されないため、最後にリストしていることに注意してください。 + +ステージ 0 では、すべてのタイプのシャーディングを無効にし、DDP として DeepSpeed のみを使用します。次のコマンドでオンにできます。 + +```json +{ + "zero_optimization": { + "stage": 0 + } +} +``` + +これにより、他に何も変更する必要がなく、基本的に ZeRO が無効になります。 + +#### ZeRO-1 Config + +ステージ 1 は、ステージ 2 からグラデーション シャーディングを除いたものです。オプティマイザーの状態をシャード化するだけで、処理を少し高速化するためにいつでも試すことができます。 + +```json +{ + "zero_optimization": { + "stage": 1 + } +} +``` + + + +### NVMe Support + +ZeRO-Infinity は、GPU と CPU メモリを NVMe メモリで拡張することで、非常に大規模なモデルのトレーニングを可能にします。おかげで +スマート パーティショニングおよびタイリング アルゴリズムでは、各 GPU が非常に少量のデータを送受信する必要があります。 +オフロードにより、最新の NVMe がトレーニングに利用できる合計メモリ プールをさらに大きくするのに適していることが判明しました。 +プロセス。 ZeRO-Infinity には、ZeRO-3 が有効になっている必要があります。 + +次の設定例では、NVMe がオプティマイザの状態とパラメータの両方をオフロードできるようにします。 + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 4, + "fast_init": false + }, + "offload_param": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 5, + "buffer_size": 1e8, + "max_in_cpu": 1e9 + }, + "aio": { + "block_size": 262144, + "queue_depth": 32, + "thread_count": 1, + "single_submit": false, + "overlap_events": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, +} +``` + +オプティマイザの状態とパラメータの両方を NVMe にオフロードするか、どちらか 1 つだけをオフロードするか、まったくオフロードしないかを選択できます。たとえば、次の場合 +利用可能な CPU メモリが大量にある場合は、高速になるため、必ず CPU メモリのみにオフロードしてください (ヒント: +*"device": "CPU"*)。 + +[オプティマイザーの状態](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) と [パラメーター](https://www.deepspeed.ai/docs/config-json/#parameter-offloading)。 + +`nvme_path`が実際に NVMe であることを確認してください。NVMe は通常のハードドライブまたは SSD で動作しますが、 +はるかに遅くなります。高速スケーラブルなトレーニングは、最新の NVMe 転送速度を念頭に置いて設計されました (この時点では +書き込みでは、読み取り最大 3.5 GB/秒、書き込み最大 3 GB/秒のピーク速度が得られます)。 + +最適な`aio`構成ブロックを見つけるには、ターゲット設定でベンチマークを実行する必要があります。 +[ここで説明](https://github.com/microsoft/DeepSpeed/issues/998)。 + + + + +#### ZeRO-2 vs ZeRO-3 Performance + +ZeRO-3 は、他のすべてが同じように構成されている場合、ZeRO-2 よりも遅くなる可能性があります。前者は収集する必要があるためです。 +ZeRO-2 の機能に加えてモデルの重み付けを行います。 ZeRO-2 がニーズを満たし、数個の GPU を超えて拡張する必要がない場合 +そうすれば、それに固執することを選択することもできます。 ZeRO-3 により、はるかに高いスケーラビリティ容量が可能になることを理解することが重要です +スピードを犠牲にして。 + +ZeRO-3 の構成を調整して、ZeRO-2 に近づけることができます。 + +- `stage3_param_persistence_threshold` を非常に大きな数値に設定します。たとえば、`6 * hidden_​​size * hidden_​​size` のように、最大​​パラメータよりも大きくなります。これにより、パラメータが GPU に保持されます。 +- ZeRO-2 にはそのオプションがないため、`offload_params` をオフにします。 + +変更しなくても、`offload_params`をオフにするだけでパフォーマンスが大幅に向上する可能性があります。 +`stage3_param_persistence_threshold`。もちろん、これらの変更はトレーニングできるモデルのサイズに影響します。それで +これらは、ニーズに応じて、スケーラビリティと引き換えに速度を向上させるのに役立ちます。 + + + +#### ZeRO-2 Example + +以下は、完全な ZeRO-2 自動構成ファイル `ds_config_zero2.json` です。 + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +以下は、手動で設定された完全な ZeRO-2 のすべてが有効な構成ファイルです。ここでは主に、典型的なものを確認するためのものです。 +値は次のようになりますが、複数の`auto`設定が含まれる値を使用することを強くお勧めします。 + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + + + +#### ZeRO-3 Example + +以下は、完全な ZeRO-3 自動構成ファイル`ds_config_zero3.json`です。 + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +以下は、手動で設定された完全な ZeRO-3 のすべてが有効な構成ファイルです。ここでは主に、典型的なものを確認するためのものです。 +値は次のようになりますが、複数の`auto`設定が含まれる値を使用することを強くお勧めします。 + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": 1e6, + "stage3_prefetch_bucket_size": 0.94e6, + "stage3_param_persistence_threshold": 1e4, + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + +#### How to Choose Which ZeRO Stage and Offloads To Use For Best Performance + +これで、さまざまな段階があることがわかりました。どちらを使用するかをどのように決定すればよいでしょうか?このセクションでは、この質問に答えていきます。 + +一般に、次のことが当てはまります。 + +- 速度の点(左の方が右より速い) + +ステージ 0 (DDP) > ステージ 1 > ステージ 2 > ステージ 2 + オフロード > ステージ 3 > ステージ 3 + オフロード + +- GPU メモリの使用状況 (右は左よりも GPU メモリ効率が高い) + +ステージ 0 (DDP) < ステージ 1 < ステージ 2 < ステージ 2 + オフロード < ステージ 3 < ステージ 3 + オフロード + +したがって、最小限の数の GPU に収まりながら最速の実行を実現したい場合は、次のプロセスに従うことができます。最も速いアプローチから開始し、GPU OOM に陥った場合は、次に遅いアプローチに進みますが、これにより使用される GPU メモリが少なくなります。などなど。 + +まず、バッチ サイズを 1 に設定します (必要な有効バッチ サイズに対して、いつでも勾配累積を使用できます)。 + +1. `--gradient_checkpointing 1` (HF Trainer) または直接 `model.gradient_checkpointing_enable()` を有効にします - OOM の場合 +2. 最初に ZeRO ステージ 2 を試してください。 OOMの場合 +3. ZeRO ステージ 2 + `offload_optimizer` を試します - OOM の場合 +4. ZeRO ステージ 3 に切り替える - OOM の場合 +5. `cpu` に対して `offload_param` を有効にします - OOM の場合 +6. OOM の場合は、`cpu`に対して`offload_optimizer`を有効にします。 + +7. それでもバッチ サイズ 1 に適合しない場合は、まずさまざまなデフォルト値を確認し、可能であれば値を下げます。たとえば、`generate`を使用し、広い検索ビームを使用しない場合は、大量のメモリを消費するため、検索ビームを狭くします。 + +8. fp32 では必ず混合半精度を使用します。つまり、Ampere 以上の GPU では bf16、古い GPU アーキテクチャでは fp16 を使用します。 + +9. それでも OOM を行う場合は、ハードウェアを追加するか、ZeRO-Infinity を有効にすることができます。つまり、オフロード `offload_param` と `offload_optimizer` を `nvme` に切り替えます。非常に高速な nvme であることを確認する必要があります。逸話として、ZeRO-Infinity を使用して小さな GPU で BLOOM-176B を推論することができましたが、非常に遅かったです。でも、うまくいきました! + +もちろん、最も GPU メモリ効率の高い構成から始めて、後から逆に進むことで、これらの手順を逆に実行することもできます。あるいは二等分してみてください。 + +OOM を引き起こさないバッチ サイズ 1 を取得したら、実効スループットを測定します。 + +次に、バッチ サイズをできるだけ大きくしてみます。バッチ サイズが大きいほど、乗算する行列が巨大な場合に GPU のパフォーマンスが最高になるため、GPU の効率が向上します。 + +ここで、パフォーマンス最適化ゲームが始まります。一部のオフロード機能をオフにするか、ZeRO 段階でステップダウンしてバッチ サイズを増減して、実効スループットを再度測定することができます。満足するまで洗い流し、繰り返します。 + +永遠にこれに費やす必要はありませんが、3 か月のトレーニングを開始しようとしている場合は、スループットに関して最も効果的な設定を見つけるために数日かけてください。そのため、トレーニングのコストが最小限になり、トレーニングをより早く完了できます。現在の目まぐるしく変化する ML の世界では、何かをトレーニングするのにさらに 1 か月かかる場合、絶好の機会を逃す可能性があります。もちろん、これは私が意見を共有しているだけであり、決してあなたを急かそうとしているわけではありません。 BLOOM-176B のトレーニングを開始する前に、このプロセスに 2 日間費やし、スループットを 90 TFLOP から 150 TFLOP に向上させることができました。この取り組みにより、トレーニング時間を 1 か月以上節約できました。 + +これらのメモは主にトレーニング モード用に書かれたものですが、ほとんどの場合は推論にも適用されるはずです。たとえば、勾配チェックポイントはトレーニング中にのみ役立つため、推論中は何も行われません。さらに、マルチ GPU 推論を実行していて、[DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/)、[Accelerate](https://ハグフェイス.co/blog/bloom-inference-pytorch-scripts) は優れたパフォーマンスを提供するはずです。 + + +その他のパフォーマンス関連の簡単なメモ: +- 何かを最初からトレーニングしている場合は、常に 16 で割り切れる形状のテンソル (隠れたサイズなど) を使用するようにしてください。バッチ サイズについては、少なくとも 2 で割り切れるようにしてください。 GPU からさらに高いパフォーマンスを引き出したい場合は、ハードウェア固有の [波とタイルの量子化](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/) の可分性があります。 + +### Activation Checkpointing or Gradient Checkpointing + +アクティベーション チェックポイントと勾配チェックポイントは、同じ方法論を指す 2 つの異なる用語です。とてもややこしいですが、こんな感じです。 + +勾配チェックポイントを使用すると、速度を GPU メモリと引き換えにできます。これにより、GPU OOM を克服したり、バッチ サイズを増やすことができ、多くの場合、パフォーマンスの向上につながります。 + +HF Transformers モデルは、DeepSpeed のアクティベーション チェックポイントについて何も知らないため、DeepSpeed 構成ファイルでその機能を有効にしようとしても、何も起こりません。 + +したがって、この非常に有益な機能を活用するには 2 つの方法があります。 + +1. HF Transformers モデルを使用したい場合は、`model.gradient_checkpointing_enable()` を実行するか、HF トレーナーで `--gradient_checkpointing` を使用します。これにより、これが自動的に有効になります。そこで使われるのが `torch.utils.checkpoint` です。 +2. 独自のモデルを作成し、DeepSpeed のアクティベーション チェックポイントを使用したい場合は、[そこで規定されている API](https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html) を使用できます。 HF Transformers モデリング コードを使用して、`torch.utils.checkpoint` を DeepSpeed の API に置き換えることもできます。後者は、順方向アクティベーションを再計算する代わりに CPU メモリにオフロードできるため、より柔軟です。 + +### Optimizer and Scheduler + +`offload_optimizer`を有効にしない限り、DeepSpeed スケジューラーと HuggingFace スケジューラーを組み合わせて使用​​できます。 +オプティマイザー (HuggingFace スケジューラーと DeepSpeed オプティマイザーの組み合わせを除く): + +| Combos | HF Scheduler | DS Scheduler | +|:-------------|:-------------|:-------------| +| HF Optimizer | Yes | Yes | +| DS Optimizer | No | Yes | + +`offload_optimizer`が有効な場合、CPU と +GPU 実装 (LAMB を除く)。 + + + + +#### Optimizer + +DeepSpeed の主なオプティマイザーは、Adam、AdamW、OneBitAdam、Lamb です。これらは ZeRO で徹底的にテストされており、 +したがって、使用することをお勧めします。ただし、他のオプティマイザを「torch」からインポートすることはできます。完全なドキュメントは [こちら](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters) にあります。 + +設定ファイルで `optimizer` エントリを設定しない場合、[`Trainer`] は +自動的に`AdamW`に設定され、指定された値または次のコマンドラインのデフォルトが使用されます。 +引数: `--learning_rate`、`--adam_beta1`、`--adam_beta2`、`--adam_epsilon`、および `--weight_decay`。 + +以下は、`AdamW`の自動構成された`optimizer`エントリの例です。 + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + } +} +``` + +コマンドライン引数によって構成ファイル内の値が設定されることに注意してください。これは 1 つあるためです +値の決定的なソースを提供し、たとえば学習率が次のように設定されている場合に、見つけにくいエラーを回避します。 +さまざまな場所でさまざまな価値観。コマンドラインのルール。オーバーライドされる値は次のとおりです。 + +- `lr` と `--learning_rate` の値 +- `betas` と `--adam_beta1 --adam_beta2` の値 +- `eps` と `--adam_epsilon` の値 +- `weight_decay` と `--weight_decay` の値 + +したがって、コマンドラインで共有ハイパーパラメータを調整することを忘れないでください。 + +値を明示的に設定することもできます。 + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": 0.001, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + } +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + +上記にリストされていない別のオプティマイザーを使用する場合は、トップレベルの構成に追加する必要があります。 + +```json +{ + "zero_allow_untested_optimizer": true +} +``` + +`AdamW`と同様に、公式にサポートされている他のオプティマイザーを構成できます。これらは異なる設定値を持つ可能性があることに注意してください。例えばAdam の場合は、`weight_decay`を`0.01`付近にする必要があります。 + +さらに、オフロードは、Deepspeed の CPU Adam オプティマイザーと併用すると最も効果的に機能します。 `deepspeed==0.8.3` なので、オフロードで別のオプティマイザーを使用したい場合は、以下も追加する必要があります。 + +```json +{ + "zero_force_ds_cpu_optimizer": false +} +``` + +最上位の構成に移行します。 + + + + +#### Scheduler + + +DeepSpeed は、`LRRangeTest`、`OneCycle`、`WarmupLR`、および`WarmupDecayLR`学習率スケジューラーをサポートしています。完全な +ドキュメントは[ここ](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters)です。 + +ここでは、🤗 Transformers と DeepSpeed の間でスケジューラーが重複する場所を示します。 + +- `--lr_scheduler_type constant_with_warmup` 経由の `WarmupLR` +- `--lr_scheduler_type Linear` を介した `WarmupDecayLR`。これは `--lr_scheduler_type` のデフォルト値でもあります。 + したがって、スケジューラを設定しない場合、これがデフォルトで設定されるスケジューラになります。 + +設定ファイルで `scheduler` エントリを設定しない場合、[`Trainer`] は +`--lr_scheduler_type`、`--learning_rate`、および `--warmup_steps` または `--warmup_ratio` の値を設定します。 +🤗 それのトランスフォーマーバージョン。 + +以下は、`WarmupLR`の自動構成された`scheduler`エントリの例です。 + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +*"auto"* が使用されているため、[`Trainer`] 引数は設定に正しい値を設定します。 +ファイル。これは、値の決定的なソースが 1 つあることと、たとえば次のような場合に見つけにくいエラーを避けるためです。 +学習率は、場所ごとに異なる値に設定されます。コマンドラインのルール。設定される値は次のとおりです。 + +- `warmup_min_lr` の値は `0` です。 +- `warmup_max_lr` と `--learning_rate` の値。 +- `warmup_num_steps` と `--warmup_steps` の値 (指定されている場合)。それ以外の場合は `--warmup_ratio` を使用します + トレーニング ステップの数を乗算し、切り上げます。 +- `total_num_steps` には `--max_steps` の値を指定するか、指定されていない場合は実行時に自動的に導出されます。 + 環境、データセットのサイズ、およびその他のコマンド ライン引数 ( + `WarmupDecayLR`)。 + +もちろん、構成値の一部またはすべてを引き継いで、自分で設定することもできます。 + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 0.001, + "warmup_num_steps": 1000 + } + } +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + +たとえば、`WarmupDecayLR`の場合は、次のエントリを使用できます。 + +```json +{ + "scheduler": { + "type": "WarmupDecayLR", + "params": { + "last_batch_iteration": -1, + "total_num_steps": "auto", + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +`total_num_steps`、`warmup_max_lr`、`warmup_num_steps`、および `total_num_steps` はロード時に設定されます。 + + + +### fp32 Precision + +Deepspeed は、完全な fp32 と fp16 の混合精度をサポートします。 + +fp16 混合精度を使用すると、必要なメモリが大幅に削減され、速度が向上するため、 +使用しているモデルがこのトレーニング モードで適切に動作しない場合は、使用しない方がよいでしょう。通常これ +モデルが fp16 混合精度で事前トレーニングされていない場合に発生します (たとえば、これは bf16 で事前トレーニングされた場合によく発生します) +モデル)。このようなモデルでは、オーバーフローまたはアンダーフローが発生し、`NaN`損失が発生する可能性があります。これがあなたの場合は、使用したいと思うでしょう +完全な fp32 モード。デフォルトの fp16 混合精度モードを次のように明示的に無効にします。 + +```json +{ + "fp16": { + "enabled": false, + } +} +``` + +Ampere アーキテクチャ ベースの GPU を使用している場合、pytorch バージョン 1.7 以降は自動的に を使用するように切り替わります。 +一部の操作でははるかに効率的な tf32 形式を使用しますが、結果は依然として fp32 になります。詳細と +ベンチマークについては、[Ampere デバイス上の TensorFloat-32(TF32)](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices) を参照してください。文書には以下が含まれます +何らかの理由でこの自動変換を使用したくない場合は、この自動変換を無効にする方法について説明します。 + +🤗 トレーナーでは、`--tf32` を使用して有効にするか、`--tf32 0` または `--no_tf32` を使用して無効にすることができます。デフォルトでは、PyTorch のデフォルトが使用されます。 + + + +### Automatic Mixed Precision + +pytorch のような AMP の方法または apex のような方法で自動混合精度を使用できます。 + +### fp16 + +fp16 (float16) を設定して pytorch AMP のようなモードを設定するには: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +[`Trainer`] は、の値に基づいてそれを自動的に有効または無効にします。 +`args.fp16_backend`。残りの設定値はあなた次第です。 + +このモードは、`--fp16 --fp16_backend amp`または`--fp16_full_eval`コマンドライン引数が渡されると有効になります。 + +このモードを明示的に有効/無効にすることもできます。 + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + +これが[ドキュメント](https://www.deepspeed.ai/docs/config-json/#fp16-training-options)です。 + +### BF16 + +fp16 の代わりに bf16 (bfloat16) が必要な場合は、次の構成セクションが使用されます。 + +```json +{ + "bf16": { + "enabled": "auto" + } +} +``` + +bf16 は fp32 と同じダイナミック レンジを備えているため、損失スケーリングは必要ありません。 + +このモードは、`--bf16` または `--bf16_full_eval` コマンドライン引数が渡されると有効になります。 + +このモードを明示的に有効/無効にすることもできます。 + +```json +{ + "bf16": { + "enabled": true + } +} +``` + + + +`deepspeed==0.6.0`の時点では、bf16 サポートは新しく実験的なものです。 + +bf16 が有効な状態で [勾配累積](#gradient-accumulation) を使用する場合は、bf16 で勾配が累積されることに注意する必要があります。この形式の精度が低いため、これは希望どおりではない可能性があります。損失のある蓄積につながります。 + +この問題を修正し、より高精度の `dtype` (fp16 または fp32) を使用するオプションを提供するための作業が行われています。 + + + + +### NCCL Collectives + +訓練体制の`dtype`があり、さまざまな削減や収集/分散操作などのコミュニケーション集合体に使用される別の`dtype`があります。 + +すべての収集/分散操作は、データが含まれているのと同じ `dtype` で実行されるため、bf16 トレーニング体制を使用している場合、データは bf16 で収集されます。収集は損失のない操作です。 + +さまざまなリデュース操作は非常に損失が大きい可能性があります。たとえば、複数の GPU 間で勾配が平均化される場合、通信が fp16 または bf16 で行われる場合、結果は損失が多くなる可能性があります。複数の数値を低精度でアドバタイズすると結果は正確ではないためです。 。 bf16 では fp16 よりも精度が低いため、さらにそうです。通常は非常に小さい grad を平均する際の損失が最小限に抑えられるため、fp16 で十分であることがよくあります。したがって、デフォルトでは、半精度トレーニングでは fp16 がリダクション演算のデフォルトとして使用されます。ただし、この機能を完全に制御でき、必要に応じて小さなオーバーヘッドを追加して、リダクションが累積 dtype として fp32 を使用し、結果の準備ができた場合にのみ半精度 `dtype` にダウンキャストするようにすることもできます。でトレーニング中です。 + +デフォルトをオーバーライドするには、新しい構成エントリを追加するだけです。 + +```json +{ + "communication_data_type": "fp32" +} +``` + +この記事の執筆時点での有効な値は、"fp16"、"bfp16"、"fp32"です。 + +注: ステージ ゼロ 3 には、bf16 通信タイプに関するバグがあり、`deepspeed==0.8.1`で修正されました。 + +### apex + +apex AMP のようなモード セットを設定するには: + +```json +"amp": { + "enabled": "auto", + "opt_level": "auto" +} +``` + +[`Trainer`] は `args.fp16_backend` の値に基づいて自動的に設定します。 +`args.fp16_opt_level`。 + +このモードは、`--fp16 --fp16_backend apex --fp16_opt_level 01`コマンド ライン引数が渡されると有効になります。 + +このモードを明示的に構成することもできます。 + +```json +{ + "amp": { + "enabled": true, + "opt_level": "O1" + } +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + +これは[ドキュメント](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options)です。 + + + +### Batch Size + +バッチサイズを設定するには、次を使用します。 + + +```json +{ + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto" +} +``` + +[`Trainer`] は自動的に `train_micro_batch_size_per_gpu` を次の値に設定します。 +`args.per_device_train_batch_size`と`train_batch_size`を`args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`に変更します。 + +値を明示的に設定することもできます。 + +```json +{ + "train_batch_size": 12, + "train_micro_batch_size_per_gpu": 4 +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + + + +### Gradient Accumulation + +勾配累積セットを構成するには: + +```json +{ + "gradient_accumulation_steps": "auto" +} +``` + +[`Trainer`] は自動的にそれを `args.gradient_accumulation_steps` の値に設定します。 + +値を明示的に設定することもできます。 + +```json +{ + "gradient_accumulation_steps": 3 +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + + + +### Gradient Clipping + +グラデーション グラデーション クリッピング セットを構成するには: + +```json +{ + "gradient_clipping": "auto" +} +``` + +[`Trainer`] は自動的にそれを `args.max_grad_norm` の値に設定します。 + +値を明示的に設定することもできます。 + +```json +{ + "gradient_clipping": 1.0 +} +``` + +ただし、[`Trainer`] コマンドライン引数と DeepSpeed を自分で同期することになります。 +構成。 + + + +### Getting The Model Weights Out + +トレーニングを継続し、DeepSpeed の使用を再開する限り、何も心配する必要はありません。 DeepSpeed ストア +fp32 のカスタム チェックポイント オプティマイザー ファイル内のマスターの重み。これは `global_step*/*optim_states.pt` (これは glob +パターン)、通常のチェックポイントの下に保存されます。 + +**FP16 ウェイト:** + +モデルを ZeRO-2 で保存すると、モデルの重みを含む通常の `pytorch_model.bin` ファイルが作成されますが、 +これらは重みの fp16 バージョンにすぎません。 + +ZeRO-3 では、モデルの重みが複数の GPU に分割されるため、状況はさらに複雑になります。 +したがって、fp16 を保存するための `Trainer` を取得するには、`"stage3_gather_16bit_weights_on_model_save": true` が必要です。 +重みのバージョン。この設定が`False`の場合、`pytorch_model.bin`は作成されません。これは、デフォルトで DeepSpeed の `state_dict` に実際の重みではなくプレースホルダーが含まれるためです。この `state_dict` を保存した場合、ロードし直すことはできません。 + +```json +{ + "zero_optimization": { + "stage3_gather_16bit_weights_on_model_save": true + } +} +``` + +**FP32 重量:** + +fp16 ウェイトはトレーニングを再開するのに適していますが、モデルの微調整が完了し、それを +[モデル ハブ](https://huggingface.co/models) にアクセスするか、fp32 を入手したいと思われる他の人に渡します。 +重み。これは大量のメモリを必要とするプロセスであるため、トレーニング中に行うべきではないのが理想的です。 +したがって、トレーニングの完了後にオフラインで実行するのが最適です。ただし、必要に応じて、空き CPU が十分にある場合は、 +同じトレーニング スクリプトで実行できることを思い出してください。次のセクションでは、両方のアプローチについて説明します。 + + +**ライブ FP32 ウェイト リカバリ:** + +モデルが大きく、トレーニングの終了時に空き CPU メモリがほとんど残っていない場合、このアプローチは機能しない可能性があります。 + +少なくとも 1 つのチェックポイントを保存していて、最新のチェックポイントを使用したい場合は、次の手順を実行できます。 + +```python +from transformers.trainer_utils import get_last_checkpoint +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint + +checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + +`--load_best_model_at_end` class:*~transformers.TrainingArguments* 引数を使用している場合 (最適なモデルを追跡するため) +チェックポイント)、最初に最終モデルを明示的に保存してから、上記と同じことを行うことでトレーニングを終了できます。 + +```python +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint + +checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") +trainer.deepspeed.save_checkpoint(checkpoint_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + + + +`load_state_dict_from_zero_checkpoint` が実行されると、`model` はもはや使用できなくなることに注意してください。 +同じアプリケーションの DeepSpeed コンテキスト。つまり、deepspeed エンジンを再初期化する必要があります。 +`model.load_state_dict(state_dict)` はそこからすべての DeepSpeed マジックを削除します。したがって、これは最後にのみ実行してください +トレーニングの様子。 + + + + +もちろん、class:*~transformers.Trainer* を使用する必要はなく、上記の例を独自のものに調整することができます。 +トレーナー。 + +何らかの理由でさらに改良したい場合は、重みの fp32 `state_dict` を抽出して適用することもできます。 +次の例に示すように、これらは自分で作成します。 + +```python +from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint + +state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu +model = model.cpu() +model.load_state_dict(state_dict) +``` + +**オフライン FP32 ウェイト リカバリ:** + +DeepSpeed は特別な変換スクリプト`zero_to_fp32.py`を作成し、チェックポイントの最上位に配置します。 +フォルダ。このスクリプトを使用すると、いつでも重みを抽出できます。スクリプトはスタンドアロンなので、もう必要ありません。 +抽出を行うための設定ファイルまたは `Trainer` が必要です。 + +チェックポイント フォルダーが次のようになっているとします。 + +```bash +$ ls -l output_dir/checkpoint-1/ +-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json +drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ +-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest +-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt +-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin +-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt +-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json +-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model +-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json +-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json +-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin +-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* +``` + +この例では、DeepSpeed チェックポイント サブフォルダー *global_step1* が 1 つだけあります。したがって、FP32を再構築するには +重みを実行するだけです: + +```bash +python zero_to_fp32.py . pytorch_model.bin +``` + +これだよ。 `pytorch_model.bin`には、複数の GPU から統合された完全な fp32 モデルの重みが含まれるようになります。 + +スクリプトは、ZeRO-2 または ZeRO-3 チェックポイントを自動的に処理できるようになります。 + +`python zero_to_fp32.py -h` を実行すると、使用方法の詳細が表示されます。 + +スクリプトは、ファイル`latest`の内容を使用して deepspeed サブフォルダーを自動検出します。 +例には`global_step1`が含まれます。 + +注: 現在、スクリプトには最終的な fp32 モデルの重みの 2 倍の一般 RAM が必要です。 + +### ZeRO-3 と Infinity Nuances + +ZeRO-3 は、パラメータ シャーディング機能の点で ZeRO-2 とは大きく異なります。 + +ZeRO-Infinity は ZeRO-3 をさらに拡張し、NVMe メモリやその他の複数の速度とスケーラビリティの向上をサポートします。 + +モデルに特別な変更を加える必要がなくても正常に動作するようにあらゆる努力が払われてきましたが、特定の点では +状況によっては、次の情報が必要になる場合があります。 + +#### Constructing Massive Models + + +DeepSpeed/ZeRO-3 は、既存の RAM に収まらない可能性のある数兆のパラメータを持つモデルを処理できます。そのような場合、 +また、初期化をより高速に実行したい場合は、*deepspeed.zero.Init()* を使用してモデルを初期化します。 +コンテキスト マネージャー (関数デコレーターでもあります)。次のようになります。 + +```python +from transformers import T5ForConditionalGeneration, T5Config +import deepspeed + +with deepspeed.zero.Init(): + config = T5Config.from_pretrained("google-t5/t5-small") + model = T5ForConditionalGeneration(config) +``` + +ご覧のとおり、これによりランダムに初期化されたモデルが得られます。 + +事前トレーニングされたモデルを使用したい場合、`model_class.from_pretrained` は次の条件を満たす限りこの機能を有効にします。 +`is_deepspeed_zero3_enabled()` は `True` を返します。これは現在、 +[`TrainingArguments`] オブジェクト (渡された DeepSpeed 構成ファイルに ZeRO-3 構成が含まれている場合) +セクション。したがって、呼び出しの前に** [`TrainingArguments`] オブジェクトを作成する必要があります。 +`from_pretrained`。考えられるシーケンスの例を次に示します。 + +```python +from transformers import AutoModel, Trainer, TrainingArguments + +training_args = TrainingArguments(..., deepspeed=ds_config) +model = AutoModel.from_pretrained("google-t5/t5-small") +trainer = Trainer(model=model, args=training_args, ...) +``` + +公式のサンプル スクリプトを使用していて、コマンド ライン引数に `--deepspeed ds_config.json` が含まれている場合 +ZeRO-3 設定を有効にすると、これがサンプル スクリプトの記述方法であるため、すべてがすでに完了しています。 + +注: モデルの fp16 重みが単一の GPU のメモリに収まらない場合は、この機能を使用する必要があります。 + +この方法とその他の関連機能の詳細については、[大規模モデルの構築](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models) を参照してください。 + +また、fp16 で事前訓練されたモデルをロードするときは、`from_pretrained` に使用するように指示する必要があります。 +`torch_dtype=torch.float16`。詳細については、[from_pretrained-torch-dtype](#from_pretrained-torch-dtype) を参照してください。 + +#### Gathering Parameters + +複数の GPU 上の ZeRO-3 では、現在の GPU のパラメータでない限り、単一の GPU がすべてのパラメータを持つことはありません。 +実行層。したがって、すべてのレイヤーのすべてのパラメーターに一度にアクセスする必要がある場合は、それを行うための特定の方法があります。 +ほとんどの場合は必要ありませんが、必要な場合は、[パラメータの収集](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination) を参照してください。 + +ただし、いくつかの場所で内部的に使用しています。その例の 1 つは、事前トレーニングされたモデルの重みをロードするときです。 +`from_pretrained`。一度に 1 つのレイヤーをロードし、参加しているすべての GPU に即座に分割します。 +大規模なモデルでは、メモリの関係で、1 つの GPU にロードしてから複数の GPU に分散することはできません。 +制限。 + +また、ZeRO-3 では、独自のコードを作成し、次のようなモデル パラメーターの重みが発生するとします。 + +```python +tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) +``` + +`tensor([1.])` にストレスを感じた場合、またはパラメータのサイズが `1` であるというエラーが発生した場合 +より大きな多次元形状。これは、パラメーターが分割されており、表示されるのは ZeRO-3 プレースホルダーであることを意味します。 + + + + +### ZeRO Inference + +ZeRO Inference は、ZeRO-3 Training と同じ構成を使用します。オプティマイザーとスケジューラーのセクションは必要ありません。で +実際、同じものをトレーニングと共有したい場合は、これらを設定ファイルに残すことができます。彼らはただそうなるだろう +無視されました。 + +それ以外の場合は、通常の [`TrainingArguments`] 引数を渡すだけです。例えば: + +```bash +deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json +``` + +唯一重要なことは、ZeRO-2 には何の利点もないため、ZeRO-3 構成を使用する必要があるということです。 +ZeRO-3 のみがパラメーターのシャーディングを実行するのに対し、ZeRO-1 は勾配とオプティマイザーの状態をシャーディングするため、推論に役立ちます。 + +以下は、利用可能なすべての GPU をデプロイする DeepSpeed で`run_translation.py`を実行する例です。 + + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path google-t5/t5-small --output_dir output_dir \ +--do_eval --max_eval_samples 50 --warmup_steps 50 \ +--max_source_length 128 --val_max_target_length 128 \ +--overwrite_output_dir --per_device_eval_batch_size 4 \ +--predict_with_generate --dataset_config "ro-en" --fp16 \ +--source_lang en --target_lang ro --dataset_name wmt16 \ +--source_prefix "translate English to Romanian: " +``` + +推論のために、オプティマイザーの状態と勾配によって使用される追加の大きなメモリは必要ないため、 +はるかに大きなバッチやシーケンス長を同じハードウェアに適合できる必要があります。 + +さらに、DeepSpeed は現在、Deepspeed-Inference と呼ばれる関連製品を開発していますが、これとは何の関係もありません。 +ZeRO テクノロジーに準拠していますが、代わりにテンソル並列処理を使用して、単一の GPU に収まらないモデルをスケーリングします。これは +現在開発中です。製品が完成したら統合を提供する予定です。 + + +### Memory Requirements + +Deepspeed ZeRO はメモリを CPU (および NVMe) にオフロードできるため、フレームワークは、使用されている GPU の数に応じて必要な CPU および GPU メモリの量を知ることができるユーティリティを提供します。 + +単一の GPU で `bigscience/T0_3B`を微調整するために必要なメモリの量を見積もってみましょう。 + +```bash +$ python -c 'from transformers import AutoModel; \ +from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ +model = AutoModel.from_pretrained("bigscience/T0_3B"); \ +estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' +[...] +Estimated memory needed for params, optim states and gradients for a: +HW: Setup with 1 node, 1 GPU per node. +SW: Model with 2783M total params, 65M largest layer params. + per CPU | per GPU | Options + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 + 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 + 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 +``` + +したがって、単一の 80 GB GPU で CPU オフロードなしで搭載することも、小さな 8 GB GPU でも最大 60 GB の CPU メモリが必要になることも可能です。 (これはパラメータ、オプティマイザの状態、および勾配のためのメモリであることに注意してください。cuda カーネル、アクティベーション、および一時メモリにはもう少し多くのメモリが必要です。) + +次に、コストと速度のトレードオフになります。より小さい GPU を購入またはレンタルした方が安くなります (Deepspeed ZeRO では複数の GPU を使用できるため、GPU の数を減らすこともできます)。しかし、その場合は遅くなります。そのため、何かを実行する速度を気にしなくても、速度の低下は GPU の使用時間に直接影響し、コストが増大するため、どれが最も効果的かを実験して比較してください。 + +十分な GPU メモリがある場合は、すべてが高速になるため、CPU/NVMe オフロードを必ず無効にしてください。 + +たとえば、2 つの GPU に対して同じことを繰り返してみましょう。 + +```bash +$ python -c 'from transformers import AutoModel; \ +from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ +model = AutoModel.from_pretrained("bigscience/T0_3B"); \ +estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=2, num_nodes=1)' +[...] +Estimated memory needed for params, optim states and gradients for a: +HW: Setup with 1 node, 2 GPUs per node. +SW: Model with 2783M total params, 65M largest layer params. + per CPU | per GPU | Options + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 + 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=1 + 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=0 + 0.74GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=1 + 31.11GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=0 + +``` + +したがって、ここでは、CPU にオフロードせずに 2x 32GB 以上の GPU が必要になります。 + +詳細については、[メモリ推定ツール](https://deepspeed.readthedocs.io/en/latest/memory.html) を参照してください。 + + +### Filing Issues + + +ここでは、問題の真相をすぐに解明し、作業のブロックを解除できるよう、問題を報告する方法を説明します。 + +レポートには必ず次の内容を含めてください。 + +1. レポート内の完全な Deepspeed 構成ファイル + +2. [`Trainer`] を使用している場合はコマンドライン引数、または + トレーナーのセットアップを自分でスクリプト作成している場合は、[`TrainingArguments`] 引数。しないでください + [`TrainingArguments`] には無関係なエントリが多数含まれているため、ダンプします。 + +3. 次の出力: + + ```bash + python -c 'import torch; print(f"torch: {torch.__version__}")' + python -c 'import transformers; print(f"transformers: {transformers.__version__}")' + python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' + ``` + +4. 可能であれば、問題を再現できる Google Colab ノートブックへのリンクを含めてください。これを使えます + [ノートブック](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb) として + 出発点。 + +5. 不可能でない限り、カスタムデータセットではなく、常に使用できる標準データセットを使用してください。 + +6. 可能であれば、既存の [サンプル](https://github.com/huggingface/transformers/tree/main/examples/pytorch) のいずれかを使用して問題を再現してみてください。 + +- Deepspeed が問題の原因ではないことがよくあります。 + + 提出された問題の一部は、Deepspeed とは無関係であることが判明しました。それは、Deepspeed がセットアップから削除された後です。 + 問題はまだ残っていた。 + + したがって、完全に明白でない場合は、DeepSpeed 関連の問題です。 + 例外が発生し、DeepSpeed モジュールが関係していることがわかります。まず、DeepSpeed を含まないセットアップを再テストしてください。 + 問題が解決しない場合にのみ、Deepspeed について言及し、必要な詳細をすべて提供してください。 + +- 問題が統合部分ではなく DeepSpeed コアにあることが明らかな場合は、問題を提出してください。 + [Deepspeed](https://github.com/microsoft/DeepSpeed/) を直接使用します。よくわからない場合でも、ご安心ください。 + どちらの問題トラッカーでも問題ありません。投稿されたらそれを判断し、次の場合は別の問題トラッカーにリダイレクトします。 + そうである必要がある。 + + +### Troubleshooting + +#### the `deepspeed` process gets killed at startup without a traceback + +`deepspeed`プロセスが起動時にトレースバックなしで強制終了された場合、それは通常、プログラムが試行したことを意味します。 +システムが持っているよりも多くの CPU メモリを割り当てるか、プロセスが割り当てを許可されているため、OS カーネルがそれを強制終了します。 +プロセス。これは、設定ファイルに `offload_optimizer` または `offload_param` が含まれている可能性が高いためです。 +どちらも`cpu`にオフロードするように設定されています。 NVMe を使用している場合は、次の環境で実行している場合は NVMe へのオフロードを試してください。 +ゼロ-3。 [特定のモデルに必要なメモリ量を見積もる]方法は次のとおりです(https://deepspeed.readthedocs.io/en/latest/memory.html)。 + +#### training and/or eval/predict loss is `NaN` + +これは、bf16 混合精度モードで事前トレーニングされたモデルを取得し、それを fp16 (混合精度の有無にかかわらず) で使用しようとした場合によく発生します。 TPU でトレーニングされたほとんどのモデル、および多くの場合、Google によってリリースされたモデルは、このカテゴリに分類されます (たとえば、ほぼすべての t5 ベースのモデル)。ここでの解決策は、ハードウェアがサポートしている場合 (TPU、Ampere GPU 以降)、fp32 または bf16 を使用することです。 + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +ログには、Deepspeed が次のように`OVERFLOW!`を報告していることがわかります。 + +``` +0%| | 0/189 [00:00=4.28` 以降、`synced_gpus` が明示的に指定されていない場合、これらの条件が検出されると自動的に `True` に設定されます。ただし、必要に応じて `synced_gpus` の値をオーバーライドすることもできます。 + +## Deepspeed 統合のテスト + +DeepSpeed 統合を含む PR を送信する場合は、CircleCI PR CI セットアップには GPU がないことに注意してください。そのため、GPU を必要とするテストは別の CI で毎晩のみ実行されます。したがって、PR で緑色の CI レポートが表示されても、DeepSpeed テストが合格したことを意味するわけではありません。 + +DeepSpeed テストを実行するには、少なくとも以下を実行してください。 + +```bash +RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py +``` + +モデリングまたは pytorch サンプル コードのいずれかを変更した場合は、Model Zoo テストも実行します。以下はすべての DeepSpeed テストを実行します。 + +```bash +RUN_SLOW=1 pytest tests/deepspeed +``` + + +## Main DeepSpeed Resources + +- [プロジェクトの github](https://github.com/microsoft/deepspeed) +- [使用方法ドキュメント](https://www.deepspeed.ai/getting-started/) +- [API ドキュメント](https://deepspeed.readthedocs.io/en/latest/index.html) +- [ブログ投稿](https://www.microsoft.com/en-us/research/search/?q=deepspeed) + +論文: + +- [ZeRO: 兆パラメータ モデルのトレーニングに向けたメモリの最適化](https://arxiv.org/abs/1910.02054) +- [ZeRO-Offload: 10 億規模のモデル トレーニングの民主化](https://arxiv.org/abs/2101.06840) +- [ZeRO-Infinity: 極限スケールの深層学習のための GPU メモリの壁を打ち破る](https://arxiv.org/abs/2104.07857) + +最後に、HuggingFace [`Trainer`] は DeepSpeed のみを統合していることを覚えておいてください。 +DeepSpeed の使用に関して問題や質問がある場合は、[DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues) に問題を提出してください。 diff --git a/docs/source/ja/main_classes/feature_extractor.md b/docs/source/ja/main_classes/feature_extractor.md new file mode 100644 index 00000000000000..a2bd8c59a84f35 --- /dev/null +++ b/docs/source/ja/main_classes/feature_extractor.md @@ -0,0 +1,41 @@ + + +# Feature Extractor + + +フィーチャーエクストラクタは、オーディオまたはビジョンモデルのための入力フィーチャーの準備を担当しています。これには、シーケンスからのフィーチャー抽出(例:オーディオファイルの前処理からLog-Melスペクトログラムフィーチャーへの変換)、画像からのフィーチャー抽出(例:画像ファイルのクロッピング)、またパディング、正規化、そしてNumpy、PyTorch、TensorFlowテンソルへの変換も含まれます。 + + +## FeatureExtractionMixin + +[[autodoc]] feature_extraction_utils.FeatureExtractionMixin + - from_pretrained + - save_pretrained + +## SequenceFeatureExtractor + +[[autodoc]] SequenceFeatureExtractor + - pad + +## BatchFeature + +[[autodoc]] BatchFeature + +## ImageFeatureExtractionMixin + +[[autodoc]] image_utils.ImageFeatureExtractionMixin diff --git a/docs/source/ja/main_classes/image_processor.md b/docs/source/ja/main_classes/image_processor.md new file mode 100644 index 00000000000000..bfd33b83c2e50b --- /dev/null +++ b/docs/source/ja/main_classes/image_processor.md @@ -0,0 +1,33 @@ + + +# Image Processor + +画像プロセッサは、ビジョン モデルの入力特徴の準備とその出力の後処理を担当します。これには、サイズ変更、正規化、PyTorch、TensorFlow、Flax、Numpy テンソルへの変換などの変換が含まれます。ロジットをセグメンテーション マスクに変換するなど、モデル固有の後処理も含まれる場合があります。 + +## ImageProcessingMixin + +[[autodoc]] image_processing_utils.ImageProcessingMixin + - from_pretrained + - save_pretrained + +## BatchFeature + +[[autodoc]] BatchFeature + +## BaseImageProcessor + +[[autodoc]] image_processing_utils.BaseImageProcessor diff --git a/docs/source/ja/main_classes/keras_callbacks.md b/docs/source/ja/main_classes/keras_callbacks.md new file mode 100644 index 00000000000000..ff28107a434579 --- /dev/null +++ b/docs/source/ja/main_classes/keras_callbacks.md @@ -0,0 +1,28 @@ + + +# Keras callbacks + +Keras を使用して Transformers モデルをトレーニングする場合、一般的な処理を自動化するために使用できるライブラリ固有のコールバックがいくつかあります。 +タスク: + +## KerasMetricCallback + +[[autodoc]] KerasMetricCallback + +## PushToHubCallback + +[[autodoc]] PushToHubCallback diff --git a/docs/source/ja/main_classes/logging.md b/docs/source/ja/main_classes/logging.md new file mode 100644 index 00000000000000..4b4f4a2a3e0940 --- /dev/null +++ b/docs/source/ja/main_classes/logging.md @@ -0,0 +1,121 @@ + + +# Logging + +🤗 Transformersには、ライブラリの詳細度を簡単に設定できる中央集中型のロギングシステムがあります。 + +現在、ライブラリのデフォルトの詳細度は「WARNING」です。 + +詳細度を変更するには、直接設定メソッドの1つを使用するだけです。例えば、詳細度をINFOレベルに変更する方法は以下の通りです。 + + +```python +import transformers + +transformers.logging.set_verbosity_info() +``` + + +環境変数 `TRANSFORMERS_VERBOSITY` を使用して、デフォルトの冗長性をオーバーライドすることもできます。設定できます +`debug`、`info`、`warning`、`error`、`critical` のいずれかに変更します。例えば: + +```bash +TRANSFORMERS_VERBOSITY=error ./myprogram.py +``` + + +さらに、一部の「警告」は環境変数を設定することで無効にできます。 +`TRANSFORMERS_NO_ADVISORY_WARNINGS` を *1* などの true 値に設定します。これにより、次を使用してログに記録される警告が無効になります。 +[`logger.warning_advice`]。例えば: + +```bash +TRANSFORMERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py +``` + + +以下は、独自のモジュールまたはスクリプトでライブラリと同じロガーを使用する方法の例です。 + +```python +from transformers.utils import logging + +logging.set_verbosity_info() +logger = logging.get_logger("transformers") +logger.info("INFO") +logger.warning("WARN") +``` + +このロギング モジュールのすべてのメソッドは以下に文書化されています。主なメソッドは次のとおりです。 +[`logging.get_verbosity`] ロガーの現在の冗長レベルを取得します。 +[`logging.set_verbosity`] を使用して、冗長性を選択したレベルに設定します。順番に(少ないものから) +冗長から最も冗長まで)、それらのレベル (括弧内は対応する int 値) は次のとおりです。 + +- `transformers.logging.CRITICAL` または `transformers.logging.FATAL` (int 値、50): 最も多いもののみをレポートします。 + 重大なエラー。 +- `transformers.logging.ERROR` (int 値、40): エラーのみを報告します。 +- `transformers.logging.WARNING` または `transformers.logging.WARN` (int 値、30): エラーと + 警告。これはライブラリで使用されるデフォルトのレベルです。 +- `transformers.logging.INFO` (int 値、20): エラー、警告、および基本情報をレポートします。 +- `transformers.logging.DEBUG` (int 値、10): すべての情報をレポートします。 + +デフォルトでは、モデルのダウンロード中に「tqdm」進行状況バーが表示されます。 [`logging.disable_progress_bar`] および [`logging.enable_progress_bar`] を使用して、この動作を抑制または抑制解除できます。 + +## `logging` vs `warnings` + +Python には、よく組み合わせて使用​​される 2 つのロギング システムがあります。上で説明した `logging` と `warnings` です。 +これにより、特定のバケット内の警告をさらに分類できます (例: 機能またはパスの`FutureWarning`) +これはすでに非推奨になっており、`DeprecationWarning`は今後の非推奨を示します。 + +両方とも`transformers`ライブラリで使用します。 `logging`の`captureWarning`メソッドを活用して適応させて、 +これらの警告メッセージは、上記の冗長設定ツールによって管理されます。 + +それはライブラリの開発者にとって何を意味しますか?次のヒューリスティックを尊重する必要があります。 +- `warnings`は、ライブラリおよび`transformers`に依存するライブラリの開発者に優先されるべきです。 +- `logging`は、日常のプロジェクトでライブラリを使用するライブラリのエンドユーザーに使用する必要があります。 + +以下の`captureWarnings`メソッドのリファレンスを参照してください。 + +[[autodoc]] logging.captureWarnings + +## Base setters + +[[autodoc]] logging.set_verbosity_error + +[[autodoc]] logging.set_verbosity_warning + +[[autodoc]] logging.set_verbosity_info + +[[autodoc]] logging.set_verbosity_debug + +## Other functions + +[[autodoc]] logging.get_verbosity + +[[autodoc]] logging.set_verbosity + +[[autodoc]] logging.get_logger + +[[autodoc]] logging.enable_default_handler + +[[autodoc]] logging.disable_default_handler + +[[autodoc]] logging.enable_explicit_format + +[[autodoc]] logging.reset_format + +[[autodoc]] logging.enable_progress_bar + +[[autodoc]] logging.disable_progress_bar diff --git a/docs/source/ja/main_classes/model.md b/docs/source/ja/main_classes/model.md new file mode 100644 index 00000000000000..916040c4a3b275 --- /dev/null +++ b/docs/source/ja/main_classes/model.md @@ -0,0 +1,160 @@ + + +# Models + +ベースクラスである [`PreTrainedModel`]、[`TFPreTrainedModel`]、[`FlaxPreTrainedModel`] は、モデルの読み込みと保存に関する共通のメソッドを実装しており、これはローカルのファイルやディレクトリから、またはライブラリが提供する事前学習モデル構成(HuggingFaceのAWS S3リポジトリからダウンロード)からモデルを読み込むために使用できます。 + +[`PreTrainedModel`] と [`TFPreTrainedModel`] は、次の共通のメソッドも実装しています: + +- 語彙に新しいトークンが追加された場合に、入力トークン埋め込みのリサイズを行う +- モデルのアテンションヘッドを刈り込む + +各モデルに共通するその他のメソッドは、[`~modeling_utils.ModuleUtilsMixin`](PyTorchモデル用)および[`~modeling_tf_utils.TFModuleUtilsMixin`](TensorFlowモデル用)で定義されており、テキスト生成の場合、[`~generation.GenerationMixin`](PyTorchモデル用)、[`~generation.TFGenerationMixin`](TensorFlowモデル用)、および[`~generation.FlaxGenerationMixin`](Flax/JAXモデル用)もあります。 + + +## PreTrainedModel + +[[autodoc]] PreTrainedModel + - push_to_hub + - all + + + + +### 大規模モデルの読み込み + +Transformers 4.20.0では、[`~PreTrainedModel.from_pretrained`] メソッドが再設計され、[Accelerate](https://huggingface.co/docs/accelerate/big_modeling) を使用して大規模モデルを扱うことが可能になりました。これには Accelerate >= 0.9.0 と PyTorch >= 1.9.0 が必要です。以前の方法でフルモデルを作成し、その後事前学習の重みを読み込む代わりに(これにはメモリ内のモデルサイズが2倍必要で、ランダムに初期化されたモデル用と重み用の2つが必要でした)、モデルを空の外殻として作成し、事前学習の重みが読み込まれるときにパラメーターを実体化するオプションが追加されました。 + +このオプションは `low_cpu_mem_usage=True` で有効にできます。モデルはまず空の重みを持つメタデバイス上に作成され、その後状態辞書が内部に読み込まれます(シャードされたチェックポイントの場合、シャードごとに読み込まれます)。この方法で使用される最大RAMは、モデルの完全なサイズだけです。 + + +```py +from transformers import AutoModelForSeq2SeqLM + +t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", low_cpu_mem_usage=True) +``` + +さらに、モデルが完全にRAMに収まらない場合(現時点では推論のみ有効)、異なるデバイスにモデルを直接配置できます。`device_map="auto"` を使用すると、Accelerateは各レイヤーをどのデバイスに配置するかを決定し、最速のデバイス(GPU)を最大限に活用し、残りの部分をCPU、あるいはGPU RAMが不足している場合はハードドライブにオフロードします。モデルが複数のデバイスに分割されていても、通常どおり実行されます。 + +`device_map` を渡す際、`low_cpu_mem_usage` は自動的に `True` に設定されるため、それを指定する必要はありません。 + + +```py +from transformers import AutoModelForSeq2SeqLM + +t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto") +``` + +モデルがデバイス間でどのように分割されたかは、その `hf_device_map` 属性を見ることで確認できます: + +```py +t0pp.hf_device_map +``` + +```python out +{'shared': 0, + 'decoder.embed_tokens': 0, + 'encoder': 0, + 'decoder.block.0': 0, + 'decoder.block.1': 1, + 'decoder.block.2': 1, + 'decoder.block.3': 1, + 'decoder.block.4': 1, + 'decoder.block.5': 1, + 'decoder.block.6': 1, + 'decoder.block.7': 1, + 'decoder.block.8': 1, + 'decoder.block.9': 1, + 'decoder.block.10': 1, + 'decoder.block.11': 1, + 'decoder.block.12': 1, + 'decoder.block.13': 1, + 'decoder.block.14': 1, + 'decoder.block.15': 1, + 'decoder.block.16': 1, + 'decoder.block.17': 1, + 'decoder.block.18': 1, + 'decoder.block.19': 1, + 'decoder.block.20': 1, + 'decoder.block.21': 1, + 'decoder.block.22': 'cpu', + 'decoder.block.23': 'cpu', + 'decoder.final_layer_norm': 'cpu', + 'decoder.dropout': 'cpu', + 'lm_head': 'cpu'} +``` + +同じフォーマットに従って、独自のデバイスマップを作成することもできます(レイヤー名からデバイスへの辞書です)。モデルのすべてのパラメータを指定されたデバイスにマップする必要がありますが、1つのレイヤーが完全に同じデバイスにある場合、そのレイヤーのサブモジュールのすべてがどこに行くかの詳細を示す必要はありません。例えば、次のデバイスマップはT0ppに適しています(GPUメモリがある場合): + +```python +device_map = {"shared": 0, "encoder": 0, "decoder": 1, "lm_head": 1} +``` + +モデルのメモリへの影響を最小限に抑えるもう 1 つの方法は、低精度の dtype (`torch.float16` など) でモデルをインスタンス化するか、以下で説明する直接量子化手法を使用することです。 + +### Model Instantiation dtype + +Pytorch では、モデルは通常 `torch.float32` 形式でインスタンス化されます。これは、しようとすると問題になる可能性があります +重みが fp16 にあるモデルをロードすると、2 倍のメモリが必要になるためです。この制限を克服するには、次のことができます。 +`torch_dtype` 引数を使用して、目的の `dtype` を明示的に渡します。 + +```python +model = T5ForConditionalGeneration.from_pretrained("t5", torch_dtype=torch.float16) +``` +または、モデルを常に最適なメモリ パターンでロードしたい場合は、特別な値 `"auto"` を使用できます。 +そして、`dtype` はモデルの重みから自動的に導出されます。 + +```python +model = T5ForConditionalGeneration.from_pretrained("t5", torch_dtype="auto") +``` + +スクラッチからインスタンス化されたモデルには、どの `dtype` を使用するかを指示することもできます。 + +```python +config = T5Config.from_pretrained("t5") +model = AutoModel.from_config(config) +``` + +Pytorch の設計により、この機能は浮動小数点 dtype でのみ使用できます。 + +## ModuleUtilsMixin + +[[autodoc]] modeling_utils.ModuleUtilsMixin + +## TFPreTrainedModel + +[[autodoc]] TFPreTrainedModel + - push_to_hub + - all + +## TFModelUtilsMixin + +[[autodoc]] modeling_tf_utils.TFModelUtilsMixin + +## FlaxPreTrainedModel + +[[autodoc]] FlaxPreTrainedModel + - push_to_hub + - all + +## Pushing to the Hub + +[[autodoc]] utils.PushToHubMixin + +## Sharded checkpoints + +[[autodoc]] modeling_utils.load_sharded_checkpoint diff --git a/docs/source/ja/main_classes/onnx.md b/docs/source/ja/main_classes/onnx.md new file mode 100644 index 00000000000000..f12427760976a0 --- /dev/null +++ b/docs/source/ja/main_classes/onnx.md @@ -0,0 +1,55 @@ + + +# Exporting 🤗 Transformers models to ONNX + +🤗 Transformers は `transformers.onnx` パッケージを提供します。 +設定オブジェクトを利用することで、モデルのチェックポイントをONNXグラフに変換することができます。 + +詳細は[ガイド](../serialization) を参照してください。 +を参照してください。 + +## ONNX Configurations + +以下の3つの抽象クラスを提供しています。 +エクスポートしたいモデルアーキテクチャのタイプに応じて、継承すべき3つの抽象クラスを提供します: + +* エンコーダーベースのモデルは [`~onnx.config.OnnxConfig`] を継承します。 +* デコーダーベースのモデルは [`~onnx.config.OnnxConfigWithPast`] を継承します。 +* エンコーダー・デコーダーモデルは [`~onnx.config.OnnxSeq2SeqConfigWithPast`] を継承しています。 + + +### OnnxConfig + +[[autodoc]] onnx.config.OnnxConfig + +### OnnxConfigWithPast + +[[autodoc]] onnx.config.OnnxConfigWithPast + +### OnnxSeq2SeqConfigWithPast + +[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast + +## ONNX Features + +各 ONNX 構成は、次のことを可能にする一連の _機能_ に関連付けられています。 +さまざまなタイプのトポロジまたはタスクのモデルをエクスポートします。 + +### FeaturesManager + +[[autodoc]] onnx.features.FeaturesManager + diff --git a/docs/source/ja/main_classes/optimizer_schedules.md b/docs/source/ja/main_classes/optimizer_schedules.md new file mode 100644 index 00000000000000..fc7a13b9df531d --- /dev/null +++ b/docs/source/ja/main_classes/optimizer_schedules.md @@ -0,0 +1,77 @@ + + +# Optimization + +`.optimization` モジュールは以下を提供します。 + +- モデルの微調整に使用できる重み減衰が修正されたオプティマイザー、および +- `_LRSchedule` から継承するスケジュール オブジェクトの形式のいくつかのスケジュール: +- 複数のバッチの勾配を累積するための勾配累積クラス + +## AdamW (PyTorch) + +[[autodoc]] AdamW + +## AdaFactor (PyTorch) + +[[autodoc]] Adafactor + +## AdamWeightDecay (TensorFlow) + +[[autodoc]] AdamWeightDecay + +[[autodoc]] create_optimizer + +## Schedules + +### Learning Rate Schedules (Pytorch) + +[[autodoc]] SchedulerType + +[[autodoc]] get_scheduler + +[[autodoc]] get_constant_schedule + +[[autodoc]] get_constant_schedule_with_warmup + + + +[[autodoc]] get_cosine_schedule_with_warmup + + + +[[autodoc]] get_cosine_with_hard_restarts_schedule_with_warmup + + + +[[autodoc]] get_linear_schedule_with_warmup + + + +[[autodoc]] get_polynomial_decay_schedule_with_warmup + +[[autodoc]] get_inverse_sqrt_schedule + +### Warmup (TensorFlow) + +[[autodoc]] WarmUp + +## Gradient Strategies + +### GradientAccumulator (TensorFlow) + +[[autodoc]] GradientAccumulator diff --git a/docs/source/ja/main_classes/output.md b/docs/source/ja/main_classes/output.md new file mode 100644 index 00000000000000..beb9dcbb442355 --- /dev/null +++ b/docs/source/ja/main_classes/output.md @@ -0,0 +1,321 @@ + + +# Model outputs + +すべてのモデルには、[`~utils.ModelOutput`] のサブクラスのインスタンスである出力があります。それらは +モデルによって返されるすべての情報を含むデータ構造ですが、タプルまたは +辞書。 + +これがどのようになるかを例で見てみましょう。 + +```python +from transformers import BertTokenizer, BertForSequenceClassification +import torch + +tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") + +inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") +labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 +outputs = model(**inputs, labels=labels) +``` + +`outputs`オブジェクトは[`~modeling_outputs.SequenceClassifierOutput`]である。 +これは、オプションで `loss`、`logits`、オプションで `hidden_states`、オプションで `attentions` 属性を持つことを意味します。 +オプションの `attentions` 属性を持つことを意味する。ここでは、`labels`を渡したので`loss`があるが、`hidden_states`と`attentions`はない。 +`output_hidden_states=True`や`output_attentions=True`を渡していないので、`hidden_states`と`attentions`はない。 +`output_attentions=True`を渡さなかったからだ。 + + + +`output_hidden_states=True`を渡すと、`outputs.hidden_states[-1]`が `outputs.last_hidden_states` と正確に一致することを期待するかもしれない。 +しかし、必ずしもそうなるとは限りません。モデルによっては、最後に隠された状態が返されたときに、正規化やその後の処理を適用するものもあります。 + + + + +通常と同じように各属性にアクセスできます。その属性がモデルから返されなかった場合は、 +は `None`を取得します。ここで、たとえば`outputs.loss`はモデルによって計算された損失であり、`outputs.attentions`は +`None`。 + +`outputs`オブジェクトをタプルとして考える場合、`None`値を持たない属性のみが考慮されます。 +たとえば、ここには 2 つの要素、`loss`、次に`logits`があります。 + +```python +outputs[:2] +``` + +たとえば、タプル `(outputs.loss, Outputs.logits)` を返します。 + +`outputs`オブジェクトを辞書として考慮する場合、「None」を持たない属性のみが考慮されます。 +価値観。たとえば、ここには`loss` と `logits`という 2 つのキーがあります。 + +ここでは、複数のモデル タイプで使用される汎用モデルの出力を文書化します。具体的な出力タイプは次のとおりです。 +対応するモデルのページに記載されています。 + +## ModelOutput + +[[autodoc]] utils.ModelOutput + - to_tuple + +## BaseModelOutput + +[[autodoc]] modeling_outputs.BaseModelOutput + +## BaseModelOutputWithPooling + +[[autodoc]] modeling_outputs.BaseModelOutputWithPooling + +## BaseModelOutputWithCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithCrossAttentions + +## BaseModelOutputWithPoolingAndCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions + +## BaseModelOutputWithPast + +[[autodoc]] modeling_outputs.BaseModelOutputWithPast + +## BaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithPastAndCrossAttentions + +## Seq2SeqModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqModelOutput + +## CausalLMOutput + +[[autodoc]] modeling_outputs.CausalLMOutput + +## CausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_outputs.CausalLMOutputWithCrossAttentions + +## CausalLMOutputWithPast + +[[autodoc]] modeling_outputs.CausalLMOutputWithPast + +## MaskedLMOutput + +[[autodoc]] modeling_outputs.MaskedLMOutput + +## Seq2SeqLMOutput + +[[autodoc]] modeling_outputs.Seq2SeqLMOutput + +## NextSentencePredictorOutput + +[[autodoc]] modeling_outputs.NextSentencePredictorOutput + +## SequenceClassifierOutput + +[[autodoc]] modeling_outputs.SequenceClassifierOutput + +## Seq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_outputs.Seq2SeqSequenceClassifierOutput + +## MultipleChoiceModelOutput + +[[autodoc]] modeling_outputs.MultipleChoiceModelOutput + +## TokenClassifierOutput + +[[autodoc]] modeling_outputs.TokenClassifierOutput + +## QuestionAnsweringModelOutput + +[[autodoc]] modeling_outputs.QuestionAnsweringModelOutput + +## Seq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqQuestionAnsweringModelOutput + +## Seq2SeqSpectrogramOutput + +[[autodoc]] modeling_outputs.Seq2SeqSpectrogramOutput + +## SemanticSegmenterOutput + +[[autodoc]] modeling_outputs.SemanticSegmenterOutput + +## ImageClassifierOutput + +[[autodoc]] modeling_outputs.ImageClassifierOutput + +## ImageClassifierOutputWithNoAttention + +[[autodoc]] modeling_outputs.ImageClassifierOutputWithNoAttention + +## DepthEstimatorOutput + +[[autodoc]] modeling_outputs.DepthEstimatorOutput + +## Wav2Vec2BaseModelOutput + +[[autodoc]] modeling_outputs.Wav2Vec2BaseModelOutput + +## XVectorOutput + +[[autodoc]] modeling_outputs.XVectorOutput + +## Seq2SeqTSModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSModelOutput + +## Seq2SeqTSPredictionOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSPredictionOutput + +## SampleTSPredictionOutput + +[[autodoc]] modeling_outputs.SampleTSPredictionOutput + +## TFBaseModelOutput + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutput + +## TFBaseModelOutputWithPooling + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPooling + +## TFBaseModelOutputWithPoolingAndCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions + +## TFBaseModelOutputWithPast + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPast + +## TFBaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions + +## TFSeq2SeqModelOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqModelOutput + +## TFCausalLMOutput + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutput + +## TFCausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions + +## TFCausalLMOutputWithPast + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithPast + +## TFMaskedLMOutput + +[[autodoc]] modeling_tf_outputs.TFMaskedLMOutput + +## TFSeq2SeqLMOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqLMOutput + +## TFNextSentencePredictorOutput + +[[autodoc]] modeling_tf_outputs.TFNextSentencePredictorOutput + +## TFSequenceClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutput + +## TFSeq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput + +## TFMultipleChoiceModelOutput + +[[autodoc]] modeling_tf_outputs.TFMultipleChoiceModelOutput + +## TFTokenClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFTokenClassifierOutput + +## TFQuestionAnsweringModelOutput + +[[autodoc]] modeling_tf_outputs.TFQuestionAnsweringModelOutput + +## TFSeq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput + +## FlaxBaseModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutput + +## FlaxBaseModelOutputWithPast + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPast + +## FlaxBaseModelOutputWithPooling + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPooling + +## FlaxBaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions + +## FlaxSeq2SeqModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqModelOutput + +## FlaxCausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions + +## FlaxMaskedLMOutput + +[[autodoc]] modeling_flax_outputs.FlaxMaskedLMOutput + +## FlaxSeq2SeqLMOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqLMOutput + +## FlaxNextSentencePredictorOutput + +[[autodoc]] modeling_flax_outputs.FlaxNextSentencePredictorOutput + +## FlaxSequenceClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxSequenceClassifierOutput + +## FlaxSeq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput + +## FlaxMultipleChoiceModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxMultipleChoiceModelOutput + +## FlaxTokenClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxTokenClassifierOutput + +## FlaxQuestionAnsweringModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxQuestionAnsweringModelOutput + +## FlaxSeq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput diff --git a/docs/source/ja/main_classes/pipelines.md b/docs/source/ja/main_classes/pipelines.md new file mode 100644 index 00000000000000..8e3f61130bdcaa --- /dev/null +++ b/docs/source/ja/main_classes/pipelines.md @@ -0,0 +1,500 @@ + + +# Pipelines + +パイプラインは、推論にモデルを使うための簡単で優れた方法である。パイプラインは、複雑なコードのほとんどを抽象化したオブジェクトです。 +パイプラインは、ライブラリから複雑なコードのほとんどを抽象化したオブジェクトで、名前付き固有表現認識、マスク言語モデリング、感情分析、特徴抽出、質問応答などのタスクに特化したシンプルなAPIを提供します。 +Recognition、Masked Language Modeling、Sentiment Analysis、Feature Extraction、Question Answeringなどのタスクに特化したシンプルなAPIを提供します。以下を参照のこと。 +[タスク概要](../task_summary)を参照してください。 + + +パイプラインの抽象化には2つのカテゴリーがある: + +- [`pipeline`] は、他のすべてのパイプラインをカプセル化する最も強力なオブジェクトです。 +- タスク固有のパイプラインは、[オーディオ](#audio)、[コンピューター ビジョン](#computer-vision)、[自然言語処理](#natural-language-processing)、および [マルチモーダル](#multimodal) タスクで使用できます。 + +## The pipeline abstraction + +*パイプライン* 抽象化は、他のすべての利用可能なパイプラインのラッパーです。他のものと同様にインスタンス化されます +パイプラインですが、さらなる生活の質を提供できます。 + +1 つの項目に対する単純な呼び出し: + +```python +>>> pipe = pipeline("text-classification") +>>> pipe("This restaurant is awesome") +[{'label': 'POSITIVE', 'score': 0.9998743534088135}] +``` + +[ハブ](https://huggingface.co) の特定のモデルを使用したい場合は、モデルがオンになっている場合はタスクを無視できます。 +ハブはすでにそれを定義しています。 + +```python +>>> pipe = pipeline(model="FacebookAI/roberta-large-mnli") +>>> pipe("This restaurant is awesome") +[{'label': 'NEUTRAL', 'score': 0.7313136458396912}] +``` + +多くの項目に対してパイプラインを呼び出すには、*list* を使用してパイプラインを呼び出すことができます。 + +```python +>>> pipe = pipeline("text-classification") +>>> pipe(["This restaurant is awesome", "This restaurant is awful"]) +[{'label': 'POSITIVE', 'score': 0.9998743534088135}, + {'label': 'NEGATIVE', 'score': 0.9996669292449951}] +``` + +完全なデータセットを反復するには、`Dataset`を直接使用することをお勧めします。これは、割り当てる必要がないことを意味します +データセット全体を一度に処理することも、自分でバッチ処理を行う必要もありません。これはカスタムループと同じくらい速く動作するはずです。 +GPU。それが問題でない場合は、ためらわずに問題を作成してください。 + +```python +import datasets +from transformers import pipeline +from transformers.pipelines.pt_utils import KeyDataset +from tqdm.auto import tqdm + +pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0) +dataset = datasets.load_dataset("superb", name="asr", split="test") + +# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item +# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset +for out in tqdm(pipe(KeyDataset(dataset, "file"))): + print(out) + # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"} + # {"text": ....} + # .... +``` + +使いやすくするために、ジェネレーターを使用することもできます。 + +```python +from transformers import pipeline + +pipe = pipeline("text-classification") + + +def data(): + while True: + # This could come from a dataset, a database, a queue or HTTP request + # in a server + # Caveat: because this is iterative, you cannot use `num_workers > 1` variable + # to use multiple threads to preprocess data. You can still have 1 thread that + # does the preprocessing while the main runs the big inference + yield "This is a test" + + +for out in pipe(data()): + print(out) + # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"} + # {"text": ....} + # .... +``` + +[[autodoc]] pipeline + + +## Pipeline batching + + +すべてのパイプラインでバッチ処理を使用できます。これはうまくいきます +パイプラインがストリーミング機能を使用するときは常に (つまり、リスト、`dataset`、または `generator`を渡すとき)。 + +```python +from transformers import pipeline +from transformers.pipelines.pt_utils import KeyDataset +import datasets + +dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised") +pipe = pipeline("text-classification", device=0) +for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"): + print(out) + # [{'label': 'POSITIVE', 'score': 0.9998743534088135}] + # Exactly the same output as before, but the content are passed + # as batches to the model +``` + + + + +ただし、これによってパフォーマンスが自動的に向上するわけではありません。状況に応じて、10 倍の高速化または 5 倍の低速化のいずれかになります。 +ハードウェア、データ、使用されている実際のモデルについて。 + +主に高速化である例: + + + + +```python +from transformers import pipeline +from torch.utils.data import Dataset +from tqdm.auto import tqdm + +pipe = pipeline("text-classification", device=0) + + +class MyDataset(Dataset): + def __len__(self): + return 5000 + + def __getitem__(self, i): + return "This is a test" + + +dataset = MyDataset() + +for batch_size in [1, 8, 64, 256]: + print("-" * 30) + print(f"Streaming batch_size={batch_size}") + for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)): + pass +``` + +``` +# On GTX 970 +------------------------------ +Streaming no batching +100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s] +------------------------------ +Streaming batch_size=8 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s] +------------------------------ +Streaming batch_size=64 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s] +------------------------------ +Streaming batch_size=256 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s] +(diminishing returns, saturated the GPU) +``` + +最も速度が低下する例: + + +```python +class MyDataset(Dataset): + def __len__(self): + return 5000 + + def __getitem__(self, i): + if i % 64 == 0: + n = 100 + else: + n = 1 + return "This is a test" * n +``` + +これは、他の文に比べて非常に長い文が時折あります。その場合、**全体**のバッチは 400 である必要があります。 +トークンが長いため、バッチ全体が [64, 4] ではなく [64, 400] になり、速度が大幅に低下します。さらに悪いことに、 +バッチが大きくなると、プログラムは単純にクラッシュします。 + +``` +------------------------------ +Streaming no batching +100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s] +------------------------------ +Streaming batch_size=8 +100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s] +------------------------------ +Streaming batch_size=64 +100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s] +------------------------------ +Streaming batch_size=256 + 0%| | 0/1000 [00:00 + for out in tqdm(pipe(dataset, batch_size=256), total=len(dataset)): +.... + q = q / math.sqrt(dim_per_head) # (bs, n_heads, q_length, dim_per_head) +RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch) +``` + +この問題に対する適切な (一般的な) 解決策はなく、使用できる距離はユースケースによって異なる場合があります。のルール +親指: + +ユーザーにとっての経験則は次のとおりです。 + +- **ハードウェアを使用して、負荷に対するパフォーマンスを測定します。測って、測って、測り続ける。実数というのは、 + 進むべき唯一の方法。** +- レイテンシに制約がある場合 (実際の製品が推論を実行している場合)、バッチ処理を行わないでください。 +- CPU を使用している場合は、バッチ処理を行わないでください。 +- GPU でスループットを使用している場合 (大量の静的データでモデルを実行したい場合)、次のようにします。 + + - sequence_length (「自然な」データ) のサイズについてまったくわからない場合は、デフォルトではバッチ処理や測定を行わず、 + 暫定的に追加してみます。失敗した場合に回復するために OOM チェックを追加します (失敗した場合は、ある時点で回復します)。 + sequence_length を制御します。) + - sequence_length が非常に規則的である場合、バッチ処理は非常に興味深いものとなる可能性が高く、測定してプッシュしてください。 + OOM が発生するまで続けます。 + - GPU が大きいほど、バッチ処理がより興味深いものになる可能性が高くなります。 +- バッチ処理を有効にしたらすぐに、OOM を適切に処理できることを確認してください。 + + +## Pipeline chunk batching + +`zero-shot-classification` と `question-answering` は、単一の入力で結果が得られる可能性があるという意味で、少し特殊です。 +モデルの複数の前方パス。通常の状況では、これにより `batch_size` 引数に関する問題が発生します。 + +この問題を回避するために、これらのパイプラインはどちらも少し特殊になっており、代わりに `ChunkPipeline` になっています。 +通常の `Pipeline`。要するに: + +```python +preprocessed = pipe.preprocess(inputs) +model_outputs = pipe.forward(preprocessed) +outputs = pipe.postprocess(model_outputs) +``` + +今は次のようになります: + +```python +all_model_outputs = [] +for preprocessed in pipe.preprocess(inputs): + model_outputs = pipe.forward(preprocessed) + all_model_outputs.append(model_outputs) +outputs = pipe.postprocess(all_model_outputs) +``` + +パイプラインは以下で使用されるため、これはコードに対して非常に透過的である必要があります。 +同じ方法。 + +パイプラインはバッチを自動的に処理できるため、これは簡略化されたビューです。気にする必要はないという意味です +入力が実際にトリガーする前方パスの数については、`batch_size` を最適化できます。 +入力とは独立して。前のセクションの注意事項が引き続き適用されます。 + +## Pipeline custom code + +特定のパイプラインをオーバーライドする場合。 + +目の前のタスクに関する問題を作成することを躊躇しないでください。パイプラインの目標は、使いやすく、ほとんどのユーザーをサポートすることです。 +したがって、`transformers`があなたのユースケースをサポートする可能性があります。 + + +単純に試してみたい場合は、次のことができます。 + +- 選択したパイプラインをサブクラス化します + +```python +class MyPipeline(TextClassificationPipeline): + def postprocess(): + # Your code goes here + scores = scores * 100 + # And here + + +my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...) +# or if you use *pipeline* function, then: +my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline) +``` + +これにより、必要なカスタム コードをすべて実行できるようになります。 + +## Implementing a pipeline + +[Implementing a new pipeline](../add_new_pipeline) + +## Audio + +オーディオ タスクに使用できるパイプラインには次のものがあります。 + +### AudioClassificationPipeline + +[[autodoc]] AudioClassificationPipeline + - __call__ + - all + +### AutomaticSpeechRecognitionPipeline + +[[autodoc]] AutomaticSpeechRecognitionPipeline + - __call__ + - all + +### TextToAudioPipeline + +[[autodoc]] TextToAudioPipeline + - __call__ + - all + + +### ZeroShotAudioClassificationPipeline + +[[autodoc]] ZeroShotAudioClassificationPipeline + - __call__ + - all + +## Computer vision + +コンピューター ビジョン タスクに使用できるパイプラインには次のものがあります。 + +### DepthEstimationPipeline +[[autodoc]] DepthEstimationPipeline + - __call__ + - all + +### ImageClassificationPipeline + +[[autodoc]] ImageClassificationPipeline + - __call__ + - all + +### ImageSegmentationPipeline + +[[autodoc]] ImageSegmentationPipeline + - __call__ + - all + +### ImageToImagePipeline + +[[autodoc]] ImageToImagePipeline + - __call__ + - all + +### ObjectDetectionPipeline + +[[autodoc]] ObjectDetectionPipeline + - __call__ + - all + +### VideoClassificationPipeline + +[[autodoc]] VideoClassificationPipeline + - __call__ + - all + +### ZeroShotImageClassificationPipeline + +[[autodoc]] ZeroShotImageClassificationPipeline + - __call__ + - all + +### ZeroShotObjectDetectionPipeline + +[[autodoc]] ZeroShotObjectDetectionPipeline + - __call__ + - all + +## Natural Language Processing + +自然言語処理タスクに使用できるパイプラインには次のものがあります。 + +### ConversationalPipeline + +[[autodoc]] Conversation + +[[autodoc]] ConversationalPipeline + - __call__ + - all + +### FillMaskPipeline + +[[autodoc]] FillMaskPipeline + - __call__ + - all + +### NerPipeline + +[[autodoc]] NerPipeline + +詳細については、[`TokenClassificationPipeline`] を参照してください。 + +### QuestionAnsweringPipeline + +[[autodoc]] QuestionAnsweringPipeline + - __call__ + - all + +### SummarizationPipeline + +[[autodoc]] SummarizationPipeline + - __call__ + - all + +### TableQuestionAnsweringPipeline + +[[autodoc]] TableQuestionAnsweringPipeline + - __call__ + +### TextClassificationPipeline + +[[autodoc]] TextClassificationPipeline + - __call__ + - all + +### TextGenerationPipeline + +[[autodoc]] TextGenerationPipeline + - __call__ + - all + +### Text2TextGenerationPipeline + +[[autodoc]] Text2TextGenerationPipeline + - __call__ + - all + +### TokenClassificationPipeline + +[[autodoc]] TokenClassificationPipeline + - __call__ + - all + +### TranslationPipeline + +[[autodoc]] TranslationPipeline + - __call__ + - all + +### ZeroShotClassificationPipeline + +[[autodoc]] ZeroShotClassificationPipeline + - __call__ + - all + +## Multimodal + +マルチモーダル タスクに使用できるパイプラインには次のものがあります。 + +### DocumentQuestionAnsweringPipeline + +[[autodoc]] DocumentQuestionAnsweringPipeline + - __call__ + - all + +### FeatureExtractionPipeline + +[[autodoc]] FeatureExtractionPipeline + - __call__ + - all + +### ImageFeatureExtractionPipeline + +[[autodoc]] ImageFeatureExtractionPipeline + - __call__ + - all + +### ImageToTextPipeline + +[[autodoc]] ImageToTextPipeline + - __call__ + - all + +### VisualQuestionAnsweringPipeline + +[[autodoc]] VisualQuestionAnsweringPipeline + - __call__ + - all + +## Parent class: `Pipeline` + +[[autodoc]] Pipeline diff --git a/docs/source/ja/main_classes/processors.md b/docs/source/ja/main_classes/processors.md new file mode 100644 index 00000000000000..63b94af6ea43f3 --- /dev/null +++ b/docs/source/ja/main_classes/processors.md @@ -0,0 +1,160 @@ + + +# Processors + +Transformers ライブラリでは、プロセッサは 2 つの異なる意味を持ちます。 +- [Wav2Vec2](../model_doc/wav2vec2) などのマルチモーダル モデルの入力を前処理するオブジェクト (音声とテキスト) + または [CLIP](../model_doc/clip) (テキストとビジョン) +- 古いバージョンのライブラリで GLUE または SQUAD のデータを前処理するために使用されていたオブジェクトは非推奨になりました。 + +## Multi-modal processors + +マルチモーダル モデルでは、オブジェクトが複数のモダリティ (テキスト、 +視覚と音声)。これは、2 つ以上の処理オブジェクトをグループ化するプロセッサーと呼ばれるオブジェクトによって処理されます。 +トークナイザー (テキスト モダリティ用)、画像プロセッサー (視覚用)、特徴抽出器 (オーディオ用) など。 + +これらのプロセッサは、保存およびロード機能を実装する次の基本クラスを継承します。 + +[[autodoc]] ProcessorMixin + +## Deprecated processors + +すべてのプロセッサは、同じアーキテクチャに従っています。 +[`~data.processors.utils.DataProcessor`]。プロセッサは次のリストを返します。 +[`~data.processors.utils.InputExample`]。これら +[`~data.processors.utils.InputExample`] は次のように変換できます。 +[`~data.processors.utils.Input features`] をモデルにフィードします。 + +[[autodoc]] data.processors.utils.DataProcessor + +[[autodoc]] data.processors.utils.InputExample + +[[autodoc]] data.processors.utils.InputFeatures + +## GLUE + +[一般言語理解評価 (GLUE)](https://gluebenchmark.com/) は、 +既存の NLU タスクの多様なセットにわたるモデルのパフォーマンス。紙と同時発売された [GLUE: A +自然言語理解のためのマルチタスクベンチマークおよび分析プラットフォーム](https://openreview.net/pdf?id=rJ4km2R5t7) + +このライブラリは、MRPC、MNLI、MNLI (不一致)、CoLA、SST2、STSB、 +QQP、QNLI、RTE、WNLI。 + +それらのプロセッサは次のとおりです。 + +- [`~data.processors.utils.MrpcProcessor`] +- [`~data.processors.utils.MnliProcessor`] +- [`~data.processors.utils.MnliMismatchedProcessor`] +- [`~data.processors.utils.Sst2Processor`] +- [`~data.processors.utils.StsbProcessor`] +- [`~data.processors.utils.QqpProcessor`] +- [`~data.processors.utils.QnliProcessor`] +- [`~data.processors.utils.RteProcessor`] +- [`~data.processors.utils.WnliProcessor`] + + +さらに、次のメソッドを使用して、データ ファイルから値をロードし、それらをリストに変換することができます。 +[`~data.processors.utils.InputExample`]。 + +[[autodoc]] data.processors.glue.glue_convert_examples_to_features + +## XNLI + +[クロスリンガル NLI コーパス (XNLI)](https://www.nyu.edu/projects/bowman/xnli/) は、 +言語を超えたテキスト表現の品質。 XNLI は、[*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/) に基づくクラウドソースのデータセットです。テキストのペアには、15 個のテキスト含意アノテーションがラベル付けされています。 +さまざまな言語 (英語などの高リソース言語とスワヒリ語などの低リソース言語の両方を含む)。 + +論文 [XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053) と同時にリリースされました。 + +このライブラリは、XNLI データをロードするプロセッサをホストします。 + +- [`~data.processors.utils.XnliProcessor`] + +テストセットにはゴールドラベルが付いているため、評価はテストセットで行われますのでご了承ください。 + +これらのプロセッサを使用する例は、[run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) スクリプトに示されています。 + +## SQuAD + +[The Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) は、次のベンチマークです。 +質問応答に関するモデルのパフォーマンスを評価します。 v1.1 と v2.0 の 2 つのバージョンが利用可能です。最初のバージョン +(v1.1) は、論文 [SQuAD: 100,000+ question for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250) とともにリリースされました。 2 番目のバージョン (v2.0) は、論文 [Know What You Don't と同時にリリースされました。 +知っておくべき: SQuAD の答えられない質問](https://arxiv.org/abs/1806.03822)。 + +このライブラリは、次の 2 つのバージョンのそれぞれのプロセッサをホストします。 + +### Processors + +それらのプロセッサは次のとおりです。 + +- [`~data.processors.utils.SquadV1Processor`] +- [`~data.processors.utils.SquadV2Processor`] + +どちらも抽象クラス [`~data.processors.utils.SquadProcessor`] を継承しています。 + +[[autodoc]] data.processors.squad.SquadProcessor + - all + +さらに、次のメソッドを使用して、SQuAD の例を次の形式に変換できます。 +モデルの入力として使用できる [`~data.processors.utils.SquadFeatures`]。 + +[[autodoc]] data.processors.squad.squad_convert_examples_to_features + +これらのプロセッサと前述の方法は、データを含むファイルだけでなく、 +*tensorflow_datasets* パッケージ。以下に例を示します。 + +### Example usage + +以下にプロセッサを使用した例と、データ ファイルを使用した変換方法を示します。 + +```python +# Loading a V2 processor +processor = SquadV2Processor() +examples = processor.get_dev_examples(squad_v2_data_dir) + +# Loading a V1 processor +processor = SquadV1Processor() +examples = processor.get_dev_examples(squad_v1_data_dir) + +features = squad_convert_examples_to_features( + examples=examples, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=args.doc_stride, + max_query_length=max_query_length, + is_training=not evaluate, +) +``` + +*tensorflow_datasets* の使用は、データ ファイルを使用するのと同じくらい簡単です。 + +```python +# tensorflow_datasets only handle Squad V1. +tfds_examples = tfds.load("squad") +examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) + +features = squad_convert_examples_to_features( + examples=examples, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=args.doc_stride, + max_query_length=max_query_length, + is_training=not evaluate, +) +``` + +これらのプロセッサを使用する別の例は、[run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) スクリプトに示されています。 diff --git a/docs/source/ja/main_classes/quantization.md b/docs/source/ja/main_classes/quantization.md new file mode 100644 index 00000000000000..3af3130a849f19 --- /dev/null +++ b/docs/source/ja/main_classes/quantization.md @@ -0,0 +1,447 @@ + + +# Quantize 🤗 Transformers models + +## `AutoGPTQ` Integration + + +🤗 Transformers には、言語モデルで GPTQ 量子化を実行するための `optimum` API が統合されています。パフォーマンスを大幅に低下させることなく、推論速度を高速化することなく、モデルを 8、4、3、さらには 2 ビットでロードおよび量子化できます。これは、ほとんどの GPU ハードウェアでサポートされています。 + +量子化モデルの詳細については、以下を確認してください。 +- [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) 論文 +- GPTQ 量子化に関する `optimum` [ガイド](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) +- バックエンドとして使用される [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) ライブラリ + +### Requirements + +以下のコードを実行するには、以下の要件がインストールされている必要があります: + +- 最新の `AutoGPTQ` ライブラリをインストールする。 +`pip install auto-gptq` をインストールする。 + +- 最新の `optimum` をソースからインストールする。 +`git+https://github.com/huggingface/optimum.git` をインストールする。 + +- 最新の `transformers` をソースからインストールする。 +最新の `transformers` をソースからインストールする `pip install git+https://github.com/huggingface/transformers.git` + +- 最新の `accelerate` ライブラリをインストールする。 +`pip install --upgrade accelerate` を実行する。 + +GPTQ統合は今のところテキストモデルのみをサポートしているので、視覚、音声、マルチモーダルモデルでは予期せぬ挙動に遭遇するかもしれないことに注意してください。 + +### Load and quantize a model + +GPTQ は、量子化モデルを使用する前に重みのキャリブレーションを必要とする量子化方法です。トランスフォーマー モデルを最初から量子化する場合は、量子化モデルを作成するまでに時間がかかることがあります (`facebook/opt-350m`モデルの Google colab では約 5 分)。 + +したがって、GPTQ 量子化モデルを使用するシナリオは 2 つあります。最初の使用例は、ハブで利用可能な他のユーザーによってすでに量子化されたモデルをロードすることです。2 番目の使用例は、モデルを最初から量子化し、保存するかハブにプッシュして、他のユーザーが使用できるようにすることです。それも使ってください。 + +#### GPTQ Configuration + +モデルをロードして量子化するには、[`GPTQConfig`] を作成する必要があります。データセットを準備するには、`bits`の数、量子化を調整するための`dataset`、およびモデルの`Tokenizer`を渡す必要があります。 + +```python +model_id = "facebook/opt-125m" +tokenizer = AutoTokenizer.from_pretrained(model_id) +gptq_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer) +``` + +独自のデータセットを文字列のリストとして渡すことができることに注意してください。ただし、GPTQ 論文のデータセットを使用することを強くお勧めします。 + +```python +dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] +quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer) +``` + +#### Quantization + +`from_pretrained` を使用し、`quantization_config` を設定することでモデルを量子化できます。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config) +``` + +モデルを量子化するには GPU が必要であることに注意してください。モデルを CPU に配置し、量子化するためにモジュールを GPU に前後に移動させます。 + +CPU オフロードの使用中に GPU の使用量を最大化したい場合は、`device_map = "auto"` を設定できます。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) +``` + +ディスク オフロードはサポートされていないことに注意してください。さらに、データセットが原因でメモリが不足している場合は、`from_pretained` で `max_memory` を渡す必要がある場合があります。 `device_map`と`max_memory`の詳細については、この [ガイド](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map) を参照してください。 + + +GPTQ 量子化は、現時点ではテキスト モデルでのみ機能します。さらに、量子化プロセスはハードウェアによっては長時間かかる場合があります (NVIDIA A100 を使用した場合、175B モデル = 4 gpu 時間)。モデルの GPTQ 量子化バージョンが存在しない場合は、ハブで確認してください。そうでない場合は、github で要求を送信できます。 + + +### Push quantized model to 🤗 Hub + +他の 🤗 モデルと同様に、`push_to_hub` を使用して量子化モデルをハブにプッシュできます。量子化構成は保存され、モデルに沿ってプッシュされます。 + +```python +quantized_model.push_to_hub("opt-125m-gptq") +tokenizer.push_to_hub("opt-125m-gptq") +``` + +量子化されたモデルをローカル マシンに保存したい場合は、`save_pretrained` を使用して行うこともできます。 + + +```python +quantized_model.save_pretrained("opt-125m-gptq") +tokenizer.save_pretrained("opt-125m-gptq") +``` + +`device_map` を使用してモデルを量子化した場合は、保存する前にモデル全体を GPU または `cpu` のいずれかに移動してください。 + +```python +quantized_model.to("cpu") +quantized_model.save_pretrained("opt-125m-gptq") +``` + +### Load a quantized model from the 🤗 Hub + +`from_pretrained`を使用して、量子化されたモデルをハブからロードできます。 +属性 `quantization_config` がモデル設定オブジェクトに存在することを確認して、プッシュされた重みが量子化されていることを確認します。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq") +``` + +必要以上のメモリを割り当てずにモデルをより速くロードしたい場合は、`device_map` 引数は量子化モデルでも機能します。 `accelerate`ライブラリがインストールされていることを確認してください。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") +``` + +### Exllama kernels for faster inference + +4 ビット モデルの場合、推論速度を高めるために exllama カーネルを使用できます。デフォルトで有効になっています。 [`GPTQConfig`] で `disable_exllama` を渡すことで、その動作を変更できます。これにより、設定に保存されている量子化設定が上書きされます。カーネルに関連する属性のみを上書きできることに注意してください。さらに、exllama カーネルを使用したい場合は、モデル全体を GPU 上に置く必要があります。 + + +```py +import torch +gptq_config = GPTQConfig(bits=4, disable_exllama=False) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config) +``` + +現時点では 4 ビット モデルのみがサポートされていることに注意してください。さらに、peft を使用して量子化モデルを微調整している場合は、exllama カーネルを非アクティブ化することをお勧めします。 + +#### Fine-tune a quantized model + +Hugging Face エコシステムのアダプターの公式サポートにより、GPTQ で量子化されたモデルを微調整できます。 +詳細については、[`peft`](https://github.com/huggingface/peft) ライブラリをご覧ください。 + +### Example demo + +GPTQ を使用してモデルを量子化する方法と、peft を使用して量子化されたモデルを微調整する方法については、Google Colab [ノートブック](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) を参照してください。 + +### GPTQConfig + +[[autodoc]] GPTQConfig + +## `bitsandbytes` Integration + +🤗 Transformers は、`bitsandbytes` で最もよく使用されるモジュールと緊密に統合されています。数行のコードでモデルを 8 ビット精度でロードできます。 +これは、`bitsandbytes`の `0.37.0`リリース以降、ほとんどの GPU ハードウェアでサポートされています。 + +量子化方法の詳細については、[LLM.int8()](https://arxiv.org/abs/2208.07339) 論文、または [ブログ投稿](https://huggingface.co/blog/hf-bitsandbytes-) をご覧ください。統合)コラボレーションについて。 + +`0.39.0`リリース以降、FP4 データ型を活用し、4 ビット量子化を使用して`device_map`をサポートする任意のモデルをロードできます。 + +独自の pytorch モデルを量子化したい場合は、🤗 Accelerate ライブラリの [ドキュメント](https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization) をチェックしてください。 + +`bitsandbytes`統合を使用してできることは次のとおりです + +### General usage + +モデルが 🤗 Accelerate による読み込みをサポートし、`torch.nn.Linear` レイヤーが含まれている限り、 [`~PreTrainedModel.from_pretrained`] メソッドを呼び出すときに `load_in_8bit` または `load_in_4bit` 引数を使用してモデルを量子化できます。これはどのようなモダリティでも同様に機能するはずです。 + +```python +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True) +model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True) +``` + +デフォルトでは、他のすべてのモジュール (例: `torch.nn.LayerNorm`) は `torch.float16` に変換されますが、その `dtype` を変更したい場合は、`torch_dtype` 引数を上書きできます。 + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM + +>>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) +>>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +torch.float32 +``` + +### FP4 quantization + +#### Requirements + +以下のコード スニペットを実行する前に、以下の要件がインストールされていることを確認してください。 + +- 最新の`bitsandbytes`ライブラリ +`pip install bitsandbytes>=0.39.0` + +- 最新の`accelerate`をインストールする +`pip install --upgrade accelerate` + +- 最新の `transformers` をインストールする +`pip install --upgrade transformers` + +#### Tips and best practices + +- **高度な使用法:** 可能なすべてのオプションを使用した 4 ビット量子化の高度な使用法については、[この Google Colab ノートブック](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) を参照してください。 + +- **`batch_size=1` による高速推論 :** bitsandbytes の `0.40.0` リリース以降、`batch_size=1` では高速推論の恩恵を受けることができます。 [これらのリリース ノート](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) を確認し、この機能を活用するには`0.40.0`以降のバージョンを使用していることを確認してください。箱の。 + +- **トレーニング:** [QLoRA 論文](https://arxiv.org/abs/2305.14314) によると、4 ビット基本モデルをトレーニングする場合 (例: LoRA アダプターを使用)、`bnb_4bit_quant_type='nf4'` を使用する必要があります。 。 + +- **推論:** 推論の場合、`bnb_4bit_quant_type` はパフォーマンスに大きな影響を与えません。ただし、モデルの重みとの一貫性を保つために、必ず同じ `bnb_4bit_compute_dtype` および `torch_dtype` 引数を使用してください。 + + +#### Load a large model in 4bit + +`.from_pretrained` メソッドを呼び出すときに `load_in_4bit=True` を使用すると、メモリ使用量を (おおよそ) 4 で割ることができます。 + +```python +# pip install transformers accelerate bitsandbytes +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "bigscience/bloom-1b7" + +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True) +``` + + + +モデルが 4 ビットでロードされると、現時点では量子化された重みをハブにプッシュすることはできないことに注意してください。 4 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、4 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。 + + + +### Load a large model in 8bit + +`.from_pretrained` メソッドを呼び出すときに `load_in_8bit=True` 引数を使用すると、メモリ要件をおよそ半分にしてモデルをロードできます。 + +```python +# pip install transformers accelerate bitsandbytes +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "bigscience/bloom-1b7" + +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True) +``` + +次に、通常 [`PreTrainedModel`] を使用するのと同じようにモデルを使用します。 + +`get_memory_footprint` メソッドを使用して、モデルのメモリ フットプリントを確認できます。 + +```python +print(model.get_memory_footprint()) +``` + +この統合により、大きなモデルを小さなデバイスにロードし、問題なく実行できるようになりました。 + + +モデルが 8 ビットでロードされると、最新の `transformers`と`bitsandbytes`を使用する場合を除き、量子化された重みをハブにプッシュすることは現在不可能であることに注意してください。 8 ビットの重みはまだサポートされていないため、トレーニングできないことにも注意してください。ただし、8 ビット モデルを使用して追加のパラメーターをトレーニングすることもできます。これについては次のセクションで説明します。 +また、`device_map` はオプションですが、利用可能なリソース上でモデルを効率的にディスパッチするため、推論には `device_map = 'auto'` を設定することが推奨されます。 + + + +#### Advanced use cases + +ここでは、FP4 量子化を使用して実行できるいくつかの高度な使用例について説明します。 + +##### Change the compute dtype + +compute dtype は、計算中に使用される dtype を変更するために使用されます。たとえば、隠し状態は`float32`にありますが、高速化のために計算を bf16 に設定できます。デフォルトでは、compute dtype は `float32` に設定されます。 + +```python +import torch +from transformers import BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +``` + +##### Using NF4 (Normal Float 4) data type + +NF4 データ型を使用することもできます。これは、正規分布を使用して初期化された重みに適合した新しい 4 ビット データ型です。その実行のために: + +```python +from transformers import BitsAndBytesConfig + +nf4_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) +``` + +##### Use nested quantization for more memory efficient inference + +また、ネストされた量子化手法を使用することをお勧めします。これにより、パフォーマンスを追加することなく、より多くのメモリが節約されます。経験的な観察から、これにより、NVIDIA-T4 16GB 上でシーケンス長 1024、バッチ サイズ 1、勾配累積ステップ 4 の llama-13b モデルを微調整することが可能になります。 + +```python +from transformers import BitsAndBytesConfig + +double_quant_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config) +``` + + +### Push quantized models on the 🤗 Hub + +`push_to_hub`メソッドを単純に使用することで、量子化されたモデルをハブにプッシュできます。これにより、最初に量子化構成ファイルがプッシュされ、次に量子化されたモデルの重みがプッシュされます。 +この機能を使用できるようにするには、必ず `bitsandbytes>0.37.2` を使用してください (この記事の執筆時点では、`bitsandbytes==0.38.0.post1` でテストしました)。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True) +tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") + +model.push_to_hub("bloom-560m-8bit") +``` + + + +大規模なモデルでは、ハブ上で 8 ビット モデルをプッシュすることが強く推奨されます。これにより、コミュニティはメモリ フットプリントの削減と、たとえば Google Colab での大規模なモデルの読み込みによる恩恵を受けることができます。 + + + +### Load a quantized model from the 🤗 Hub + +`from_pretrained`メソッドを使用して、ハブから量子化モデルをロードできます。属性 `quantization_config` がモデル設定オブジェクトに存在することを確認して、プッシュされた重みが量子化されていることを確認します。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") +``` + +この場合、引数 `load_in_8bit=True` を指定する必要はありませんが、`bitsandbytes` と `accelerate` がインストールされていることを確認する必要があることに注意してください。 +また、`device_map` はオプションですが、利用可能なリソース上でモデルを効率的にディスパッチするため、推論には `device_map = 'auto'` を設定することが推奨されます。 + +### Advanced use cases + +このセクションは、8 ビット モデルのロードと実行以外に何ができるかを探求したい上級ユーザーを対象としています。 + +#### Offload between `cpu` and `gpu` + +この高度な使用例の 1 つは、モデルをロードし、`CPU`と`GPU`の間で重みをディスパッチできることです。 CPU 上でディスパッチされる重みは **8 ビットに変換されない**ため、`float32`に保持されることに注意してください。この機能は、非常に大規模なモデルを適合させ、そのモデルを GPU と CPU の間でディスパッチしたいユーザーを対象としています。 + +まず、`transformers` から [`BitsAndBytesConfig`] をロードし、属性 `llm_int8_enable_fp32_cpu_offload` を `True` に設定します。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) +``` + +`bigscience/bloom-1b7`モデルをロードする必要があり、`lm_head`を除くモデル全体に​​適合するのに十分な GPU RAM があるとします。したがって、次のようにカスタム device_map を作成します。 + +```python +device_map = { + "transformer.word_embeddings": 0, + "transformer.word_embeddings_layernorm": 0, + "lm_head": "cpu", + "transformer.h": 0, + "transformer.ln_f": 0, +} +``` + +そして、次のようにモデルをロードします。 +```python +model_8bit = AutoModelForCausalLM.from_pretrained( + "bigscience/bloom-1b7", + device_map=device_map, + quantization_config=quantization_config, +) +``` + +以上です!モデルを楽しんでください! + +#### Play with `llm_int8_threshold` + +`llm_int8_threshold` 引数を操作して、外れ値のしきい値を変更できます。 外れ値 とは、特定のしきい値より大きい隠れた状態の値です。 +これは、`LLM.int8()`論文で説明されている外れ値検出の外れ値しきい値に対応します。このしきい値を超える隠し状態の値は外れ値とみなされ、それらの値に対する操作は fp16 で実行されます。通常、値は正規分布します。つまり、ほとんどの値は [-3.5, 3.5] の範囲内にありますが、大規模なモデルでは大きく異なる分布を示す例外的な系統的外れ値がいくつかあります。これらの外れ値は、多くの場合 [-60, -6] または [6, 60] の範囲内にあります。 Int8 量子化は、大きさが 5 程度までの値ではうまく機能しますが、それを超えると、パフォーマンスが大幅に低下します。適切なデフォルトのしきい値は 6 ですが、より不安定なモデル (小規模なモデル、微調整) では、より低いしきい値が必要になる場合があります。 +この引数は、モデルの推論速度に影響を与える可能性があります。このパラメータを試してみて、ユースケースに最適なパラメータを見つけることをお勧めします。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_threshold=10, +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +tokenizer = AutoTokenizer.from_pretrained(model_id) +``` + +#### Skip the conversion of some modules + +一部のモデルには、安定性を確保するために 8 ビットに変換する必要がないモジュールがいくつかあります。たとえば、ジュークボックス モデルには、スキップする必要があるいくつかの `lm_head` モジュールがあります。 `llm_int8_skip_modules` で遊んでみる + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_skip_modules=["lm_head"], +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +tokenizer = AutoTokenizer.from_pretrained(model_id) +``` + +#### Fine-tune a model that has been loaded in 8-bit + +Hugging Face エコシステムのアダプターの公式サポートにより、8 ビットでロードされたモデルを微調整できます。 +これにより、単一の Google Colab で`flan-t5-large`や`facebook/opt-6.7b`などの大規模モデルを微調整することができます。詳細については、[`peft`](https://github.com/huggingface/peft) ライブラリをご覧ください。 + +トレーニング用のモデルをロードするときに `device_map` を渡す必要がないことに注意してください。モデルが GPU に自動的にロードされます。必要に応じて、デバイス マップを特定のデバイスに設定することもできます (例: `cuda:0`、`0`、`torch.device('cuda:0')`)。 `device_map=auto`は推論のみに使用する必要があることに注意してください。 + +### BitsAndBytesConfig + +[[autodoc]] BitsAndBytesConfig + +## Quantization with 🤗 `optimum` + +`optimum`でサポートされている量子化方法の詳細については、[Optimum ドキュメント](https://huggingface.co/docs/optimum/index) を参照し、これらが自分のユースケースに適用できるかどうかを確認してください。 diff --git a/docs/source/ja/main_classes/text_generation.md b/docs/source/ja/main_classes/text_generation.md new file mode 100644 index 00000000000000..279d9b40735b73 --- /dev/null +++ b/docs/source/ja/main_classes/text_generation.md @@ -0,0 +1,63 @@ + + +# Generation + +各フレームワークには、それぞれの `GenerationMixin` クラスに実装されたテキスト生成のための Generate メソッドがあります。 + +- PyTorch [`~generation.GenerationMixin.generate`] は [`~generation.GenerationMixin`] に実装されています。 +- TensorFlow [`~generation.TFGenerationMixin.generate`] は [`~generation.TFGenerationMixin`] に実装されています。 +- Flax/JAX [`~generation.FlaxGenerationMixin.generate`] は [`~generation.FlaxGenerationMixin`] に実装されています。 + +選択したフレームワークに関係なく、[`~generation.GenerationConfig`] を使用して生成メソッドをパラメータ化できます。 +クラスインスタンス。動作を制御する生成パラメータの完全なリストについては、このクラスを参照してください。 +生成方法のこと。 + +モデルの生成構成を検査する方法、デフォルトとは何か、パラメーターをアドホックに変更する方法を学習するには、 +カスタマイズされた生成構成を作成して保存する方法については、「 +[テキスト生成戦略ガイド](../generation_strategies)。このガイドでは、関連機能の使用方法についても説明しています。 +トークンストリーミングのような。 + +## GenerationConfig + +[[autodoc]] generation.GenerationConfig + - from_pretrained + - from_model_config + - save_pretrained + +## GenerationMixin + +[[autodoc]] generation.GenerationMixin + - generate + - compute_transition_scores + - greedy_search + - sample + - beam_search + - beam_sample + - contrastive_search + - group_beam_search + - constrained_beam_search + +## TFGenerationMixin + +[[autodoc]] generation.TFGenerationMixin + - generate + - compute_transition_scores + +## FlaxGenerationMixin + +[[autodoc]] generation.FlaxGenerationMixin + - generate diff --git a/docs/source/ja/main_classes/tokenizer.md b/docs/source/ja/main_classes/tokenizer.md new file mode 100644 index 00000000000000..1cf5885bc81286 --- /dev/null +++ b/docs/source/ja/main_classes/tokenizer.md @@ -0,0 +1,80 @@ + + +# Tokenizer + +トークナイザーは、モデルの入力の準備を担当します。ライブラリには、すべてのモデルのトークナイザーが含まれています。ほとんど +トークナイザーの一部は、完全な Python 実装と、 +Rust ライブラリ [🤗 Tokenizers](https://github.com/huggingface/tokenizers)。 「高速」実装では次のことが可能になります。 + +1. 特にバッチトークン化を行う場合の大幅なスピードアップと +2. 元の文字列 (文字と単語) とトークン空間の間でマッピングする追加のメソッド (例: + 特定の文字を含むトークンのインデックス、または特定のトークンに対応する文字の範囲)。 + +基本クラス [`PreTrainedTokenizer`] および [`PreTrainedTokenizerFast`] +モデル入力の文字列入力をエンコードし (以下を参照)、Python をインスタンス化/保存するための一般的なメソッドを実装します。 +ローカル ファイルまたはディレクトリ、またはライブラリによって提供される事前トレーニング済みトークナイザーからの「高速」トークナイザー +(HuggingFace の AWS S3 リポジトリからダウンロード)。二人とも頼りにしているのは、 +共通メソッドを含む [`~tokenization_utils_base.PreTrainedTokenizerBase`] +[`~tokenization_utils_base.SpecialTokensMixin`]。 + +したがって、[`PreTrainedTokenizer`] と [`PreTrainedTokenizerFast`] はメインを実装します。 +すべてのトークナイザーを使用するためのメソッド: + +- トークン化 (文字列をサブワード トークン文字列に分割)、トークン文字列を ID に変換したり、その逆の変換を行ったりします。 + エンコード/デコード (つまり、トークン化と整数への変換)。 +- 基礎となる構造 (BPE、SentencePiece...) から独立した方法で、語彙に新しいトークンを追加します。 +- 特別なトークン (マスク、文の始まりなど) の管理: トークンの追加、属性への割り当て。 + トークナイザーにより、簡単にアクセスでき、トークン化中に分割されないようにすることができます。 + +[`BatchEncoding`] は、 +[`~tokenization_utils_base.PreTrainedTokenizerBase`] のエンコード メソッド (`__call__`、 +`encode_plus` および `batch_encode_plus`) であり、Python 辞書から派生しています。トークナイザーが純粋な Python の場合 +tokenizer の場合、このクラスは標準の Python 辞書と同じように動作し、によって計算されたさまざまなモデル入力を保持します。 +これらのメソッド (`input_ids`、`attention_mask`...)。トークナイザーが「高速」トークナイザーである場合 (つまり、 +HuggingFace [トークナイザー ライブラリ](https://github.com/huggingface/tokenizers))、このクラスはさらに提供します +元の文字列 (文字と単語) と +トークンスペース (例: 指定された文字または対応する文字の範囲を構成するトークンのインデックスの取得) +与えられたトークンに)。 + +## PreTrainedTokenizer + +[[autodoc]] PreTrainedTokenizer + - __call__ + - apply_chat_template + - batch_decode + - decode + - encode + - push_to_hub + - all + +## PreTrainedTokenizerFast + +[`PreTrainedTokenizerFast`] は [tokenizers](https://huggingface.co/docs/tokenizers) ライブラリに依存します。 🤗 トークナイザー ライブラリから取得したトークナイザーは、 +🤗 トランスに非常に簡単にロードされます。これがどのように行われるかを理解するには、[🤗 tokenizers からの tokenizers を使用する](../fast_tokenizers) ページを参照してください。 + +[[autodoc]] PreTrainedTokenizerFast + - __call__ + - apply_chat_template + - batch_decode + - decode + - encode + - push_to_hub + - all + +## BatchEncoding + +[[autodoc]] BatchEncoding diff --git a/docs/source/ja/main_classes/trainer.md b/docs/source/ja/main_classes/trainer.md new file mode 100644 index 00000000000000..61872996ab5938 --- /dev/null +++ b/docs/source/ja/main_classes/trainer.md @@ -0,0 +1,727 @@ + + +# Trainer + +[`Trainer`] クラスは、ほとんどの標準的なユースケースに対して、PyTorch で機能を完全にトレーニングするための API を提供します。これは、[サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples) のほとんどで使用されています。 + +[`Trainer`] をインスタンス化する前に、トレーニング中にカスタマイズのすべてのポイントにアクセスするために [`TrainingArguments`] を作成します。 + +この API は、複数の GPU/TPU での分散トレーニング、[NVIDIA Apex](https://github.com/NVIDIA/apex) および PyTorch のネイティブ AMP による混合精度をサポートします。 + +[`Trainer`] には、上記の機能をサポートする基本的なトレーニング ループが含まれています。カスタム動作を挿入するには、それらをサブクラス化し、次のメソッドをオーバーライドします。 + +- **get_train_dataloader** -- トレーニング データローダーを作成します。 +- **get_eval_dataloader** -- 評価用データローダーを作成します。 +- **get_test_dataloader** -- テスト データローダーを作成します。 +- **log** -- トレーニングを監視しているさまざまなオブジェクトに関する情報をログに記録します。 +- **create_optimizer_and_scheduler** -- オプティマイザと学習率スケジューラが渡されなかった場合にセットアップします。 + 初期化。 `create_optimizer`メソッドと`create_scheduler`メソッドをサブクラス化またはオーバーライドすることもできることに注意してください。 + 別々に。 +- **create_optimizer** -- init で渡されなかった場合にオプティマイザーをセットアップします。 +- **create_scheduler** -- init で渡されなかった場合、学習率スケジューラを設定します。 +- **compute_loss** - トレーニング入力のバッチの損失を計算します。 +- **training_step** -- トレーニング ステップを実行します。 +- **prediction_step** -- 評価/テスト ステップを実行します。 +- **evaluate** -- 評価ループを実行し、メトリクスを返します。 +- **predict** -- テスト セットの予測 (ラベルが使用可能な場合はメトリクスも含む) を返します。 + + + +[`Trainer`] クラスは 🤗 Transformers モデル用に最適化されており、驚くべき動作をする可能性があります +他の機種で使用する場合。独自のモデルで使用する場合は、次の点を確認してください。 + +- モデルは常に [`~utils.ModelOutput`] のタプルまたはサブクラスを返します。 +- `labels` 引数が指定され、その損失が最初の値として返される場合、モデルは損失を計算できます。 + タプルの要素 (モデルがタプルを返す場合) +- モデルは複数のラベル引数を受け入れることができます ([`TrainingArguments`] で `label_names` を使用して、その名前を [`Trainer`] に示します) が、それらのいずれにも `"label"` という名前を付ける必要はありません。 + + + +以下は、加重損失を使用するように [`Trainer`] をカスタマイズする方法の例です (不均衡なトレーニング セットがある場合に役立ちます)。 + +```python +from torch import nn +from transformers import Trainer + + +class CustomTrainer(Trainer): + def compute_loss(self, model, inputs, return_outputs=False): + labels = inputs.pop("labels") + # forward pass + outputs = model(**inputs) + logits = outputs.get("logits") + # compute custom loss (suppose one has 3 labels with different weights) + loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device)) + loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)) + return (loss, outputs) if return_outputs else loss +``` + +PyTorch [`Trainer`] のトレーニング ループの動作をカスタマイズするもう 1 つの方法は、トレーニング ループの状態を検査できる [callbacks](コールバック) を使用することです (進行状況レポート、TensorBoard または他の ML プラットフォームでのログ記録など)。決定(早期停止など)。 + +## Trainer + +[[autodoc]] Trainer + - all + +## Seq2SeqTrainer + +[[autodoc]] Seq2SeqTrainer + - evaluate + - predict + +## TrainingArguments + +[[autodoc]] TrainingArguments + - all + +## Seq2SeqTrainingArguments + +[[autodoc]] Seq2SeqTrainingArguments + - all + +## Checkpoints + +デフォルトでは、[`Trainer`] はすべてのチェックポイントを、 +[`TrainingArguments`] を使用しています。これらは、xxx を含む`checkpoint-xxx`という名前のサブフォルダーに保存されます。 +それはトレーニングの段階でした。 + +チェックポイントからトレーニングを再開するには、次のいずれかを使用して [`Trainer.train`] を呼び出します。 + +- `resume_from_checkpoint=True` は最新のチェックポイントからトレーニングを再開します +- `resume_from_checkpoint=checkpoint_dir` ディレクトリ内の特定のチェックポイントからトレーニングを再開します + 合格した。 + +さらに、`push_to_hub=True` を使用すると、モデル ハブにチェックポイントを簡単に保存できます。デフォルトでは、すべて +中間チェックポイントに保存されたモデルは別のコミットに保存されますが、オプティマイザーの状態は保存されません。適応できます +[`TrainingArguments`] の `hub-strategy` 値を次のいずれかにします。 + +- `"checkpoint"`: 最新のチェックポイントも last-checkpoint という名前のサブフォルダーにプッシュされます。 + `trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")` を使用してトレーニングを簡単に再開します。 +- `"all_checkpoints"`: すべてのチェックポイントは、出力フォルダーに表示されるようにプッシュされます (したがって、1 つのチェックポイントが得られます) + 最終リポジトリ内のフォルダーごとのチェックポイント フォルダー) + +## Logging + +デフォルトでは、[`Trainer`] はメインプロセスに `logging.INFO` を使用し、レプリカがある場合には `logging.WARNING` を使用します。 + +これらのデフォルトは、[`TrainingArguments`] の 5 つの `logging` レベルのいずれかを使用するようにオーバーライドできます。 +引数: + +- `log_level` - メインプロセス用 +- `log_level_replica` - レプリカ用 + +さらに、[`TrainingArguments`] の `log_on_each_node` が `False` に設定されている場合、メイン ノードのみが +メイン プロセスのログ レベル設定を使用すると、他のすべてのノードはレプリカのログ レベル設定を使用します。 + +[`Trainer`] は、`transformers` のログ レベルをノードごとに個別に設定することに注意してください。 +[`Trainer.__init__`]。したがって、他の機能を利用する場合は、これをより早く設定することをお勧めします (次の例を参照)。 +[`Trainer`] オブジェクトを作成する前の `transformers` 機能。 + +これをアプリケーションで使用する方法の例を次に示します。 + +```python +[...] +logger = logging.getLogger(__name__) + +# Setup logging +logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], +) + +# set the main code and the modules it uses to the same log-level according to the node +log_level = training_args.get_process_log_level() +logger.setLevel(log_level) +datasets.utils.logging.set_verbosity(log_level) +transformers.utils.logging.set_verbosity(log_level) + +trainer = Trainer(...) +``` + +そして、メイン ノードと他のすべてのノードで重複する可能性が高いものを出力しないように警告するだけを表示したい場合は、 +警告: 次のように実行できます。 + +```bash +my_app.py ... --log_level warning --log_level_replica error +``` + +マルチノード環境で、各ノードのメインプロセスのログを繰り返したくない場合は、次のようにします。 +上記を次のように変更します。 + +```bash +my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0 +``` + +その後、最初のノードのメイン プロセスのみが「警告」レベルでログに記録され、メイン ノード上の他のすべてのプロセスはログに記録されます。 +ノードと他のノード上のすべてのプロセスは「エラー」レベルでログに記録されます。 + +アプリケーションをできるだけ静かにする必要がある場合は、次のようにします。 + +```bash +my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0 +``` + +(マルチノード環境の場合は `--log_on_each_node 0` を追加します) + +## Randomness + +[`Trainer`] によって生成されたチェックポイントから再開する場合、すべての努力がその状態を復元するために行われます。 +_python_、_numpy_、および _pytorch_ の RNG 状態は、そのチェックポイントを保存した時点と同じ状態になります。 +これにより、「停止して再開」というスタイルのトレーニングが、ノンストップトレーニングに可能な限り近づけられるはずです。 + +ただし、さまざまなデフォルトの非決定的な pytorch 設定により、これは完全に機能しない可能性があります。フルをご希望の場合は +決定論については、[ランダム性のソースの制御](https://pytorch.org/docs/stable/notes/randomness) を参照してください。ドキュメントで説明されているように、これらの設定の一部は +物事を決定論的にするもの (例: `torch.backends.cudnn.deterministic`) は物事を遅くする可能性があるため、これは +デフォルトでは実行できませんが、必要に応じて自分で有効にすることができます。 + +## Specific GPUs Selection + +どの GPU をどのような順序で使用するかをプログラムに指示する方法について説明します。 + +[`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.Parallel.DistributedDataParallel.html) を使用して GPU のサブセットのみを使用する場合、使用する GPU の数を指定するだけです。 。たとえば、GPU が 4 つあるが、最初の 2 つを使用したい場合は、次のようにします。 + +```bash +torchrun --nproc_per_node=2 trainer-program.py ... +``` + +[`accelerate`](https://github.com/huggingface/accelerate) または [`deepspeed`](https://github.com/microsoft/DeepSpeed) がインストールされている場合は、次を使用して同じことを達成することもできます。の一つ: + +```bash +accelerate launch --num_processes 2 trainer-program.py ... +``` + +```bash +deepspeed --num_gpus 2 trainer-program.py ... +``` + +これらのランチャーを使用するために、Accelerate または [Deepspeed 統合](deepspeed) 機能を使用する必要はありません。 + + +これまでは、プログラムに使用する GPU の数を指示できました。次に、特定の GPU を選択し、その順序を制御する方法について説明します。 + +次の環境変数は、使用する GPU とその順序を制御するのに役立ちます。 + +**`CUDA_VISIBLE_DEVICES`** + +複数の GPU があり、そのうちの 1 つまたはいくつかの GPU だけを使用したい場合は、環境変数 `CUDA_VISIBLE_DEVICES` を使用する GPU のリストに設定します。 + +たとえば、4 つの GPU (0、1、2、3) があるとします。物理 GPU 0 と 2 のみで実行するには、次のようにします。 + +```bash +CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ... +``` + +したがって、pytorch は 2 つの GPU のみを認識し、物理 GPU 0 と 2 はそれぞれ `cuda:0` と `cuda:1` にマッピングされます。 + +順序を変更することもできます。 + +```bash +CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ... +``` + +ここでは、物理 GPU 0 と 2 がそれぞれ`cuda:1`と`cuda:0`にマッピングされています。 + +上記の例はすべて `DistributedDataParallel` 使用パターンのものですが、同じ方法が [`DataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) でも機能します。 + + +```bash +CUDA_VISIBLE_DEVICES=2,0 python trainer-program.py ... +``` + +GPU のない環境をエミュレートするには、次のようにこの環境変数を空の値に設定するだけです。 + +```bash +CUDA_VISIBLE_DEVICES= python trainer-program.py ... +``` + +他の環境変数と同様に、これらをコマンド ラインに追加する代わりに、次のようにエクスポートすることもできます。 + +```bash +export CUDA_VISIBLE_DEVICES=0,2 +torchrun trainer-program.py ... +``` + +ただし、この方法では、以前に環境変数を設定したことを忘れて、なぜ間違った GPU が使用されているのか理解できない可能性があるため、混乱を招く可能性があります。したがって、このセクションのほとんどの例で示されているように、同じコマンド ラインで特定の実行に対してのみ環境変数を設定するのが一般的です。 + +**`CUDA_DEVICE_ORDER`** + +物理デバイスの順序を制御する追加の環境変数 `CUDA_DEVICE_ORDER` があります。選択肢は次の 2 つです。 + +1. PCIe バス ID 順 (`nvidia-smi` の順序と一致) - これがデフォルトです。 + +```bash +export CUDA_DEVICE_ORDER=PCI_BUS_ID +``` + +2. GPU コンピューティング能力順に並べる + +```bash +export CUDA_DEVICE_ORDER=FASTEST_FIRST +``` + +ほとんどの場合、この環境変数を気にする必要はありませんが、古い GPU と新しい GPU が物理的に挿入されているため、遅い古いカードが遅くなっているように見えるような偏ったセットアップを行っている場合には、非常に役立ちます。初め。これを解決する 1 つの方法は、カードを交換することです。ただし、カードを交換できない場合 (デバイスの冷却が影響を受けた場合など)、`CUDA_DEVICE_ORDER=FASTEST_FIRST`を設定すると、常に新しい高速カードが最初に配置されます。ただし、`nvidia-smi`は依然として PCIe の順序でレポートするため、多少混乱するでしょう。 + +順序を入れ替えるもう 1 つの解決策は、以下を使用することです。 + +```bash +export CUDA_VISIBLE_DEVICES=1,0 +``` + +この例では 2 つの GPU だけを使用していますが、もちろん、コンピューターに搭載されている数の GPU にも同じことが当てはまります。 + +また、この環境変数を設定する場合は、`~/.bashrc` ファイルまたはその他の起動設定ファイルに設定して、忘れるのが最善です。 + +## Trainer Integrations + +[`Trainer`] は、トレーニングを劇的に改善する可能性のあるライブラリをサポートするように拡張されました。 +時間とはるかに大きなモデルに適合します。 + +現在、サードパーティのソリューション [DeepSpeed](https://github.com/microsoft/DeepSpeed) および [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) をサポートしています。論文 [ZeRO: メモリの最適化兆パラメータ モデルのトレーニングに向けて、Samyam Rajbhandari、Jeff Rasley、Olatunji Ruwase、Yuxiong He 著](https://arxiv.org/abs/1910.02054)。 + +この提供されるサポートは、この記事の執筆時点では新しくて実験的なものです。 DeepSpeed と PyTorch FSDP のサポートはアクティブであり、それに関する問題は歓迎しますが、FairScale 統合は PyTorch メインに統合されているため、もうサポートしていません ([PyTorch FSDP 統合](#pytorch-fully-sharded-data-parallel)) + + + +### CUDA Extension Installation Notes + +この記事の執筆時点では、Deepspeed を使用するには、CUDA C++ コードをコンパイルする必要があります。 + +すべてのインストールの問題は、[Deepspeed](https://github.com/microsoft/DeepSpeed/issues) の対応する GitHub の問題を通じて対処する必要がありますが、ビルド中に発生する可能性のある一般的な問題がいくつかあります。 +CUDA 拡張機能を構築する必要がある PyTorch 拡張機能。 + +したがって、次の操作を実行中に CUDA 関連のビルドの問題が発生した場合は、次のとおりです。 + +```bash +pip install deepspeed +``` + +まず次の注意事項をお読みください。 + +これらのノートでは、`pytorch` が CUDA `10.2` でビルドされた場合に何をすべきかの例を示します。あなたの状況が次のような場合 +異なる場合は、バージョン番号を目的のバージョンに調整することを忘れないでください。 + +#### Possible problem #1 + +Pytorch には独自の CUDA ツールキットが付属していますが、これら 2 つのプロジェクトをビルドするには、同一バージョンの CUDA が必要です。 +システム全体にインストールされます。 + +たとえば、Python 環境に `cudatoolkit==10.2` を指定して `pytorch` をインストールした場合は、次のものも必要です。 +CUDA `10.2` がシステム全体にインストールされました。 + +正確な場所はシステムによって異なる場合がありますが、多くのシステムでは`/usr/local/cuda-10.2`が最も一般的な場所です。 +Unix システム。 CUDA が正しく設定され、`PATH`環境変数に追加されると、 +次のようにしてインストール場所を指定します。 + + +```bash +which nvcc +``` + +CUDA がシステム全体にインストールされていない場合は、最初にインストールしてください。お気に入りを使用して手順を見つけることができます +検索エンジン。たとえば、Ubuntu を使用している場合は、[ubuntu cuda 10.2 install](https://www.google.com/search?q=ubuntu+cuda+10.2+install) を検索するとよいでしょう。 + +#### Possible problem #2 + +もう 1 つの考えられる一般的な問題は、システム全体に複数の CUDA ツールキットがインストールされている可能性があることです。たとえばあなた +がある可能性があり: + +```bash +/usr/local/cuda-10.2 +/usr/local/cuda-11.0 +``` + +この状況では、`PATH` および `LD_LIBRARY_PATH` 環境変数に以下が含まれていることを確認する必要があります。 +目的の CUDA バージョンへの正しいパス。通常、パッケージ インストーラーは、これらに、 +最後のバージョンがインストールされました。適切なパッケージが見つからないためにパッケージのビルドが失敗するという問題が発生した場合は、 +CUDA バージョンがシステム全体にインストールされているにもかかわらず、前述の 2 つを調整する必要があることを意味します +環境変数。 + +まず、その内容を見てみましょう。 + +```bash +echo $PATH +echo $LD_LIBRARY_PATH +``` + +それで、中に何が入っているかがわかります。 + +`LD_LIBRARY_PATH` が空である可能性があります。 + +`PATH` は実行可能ファイルが存在する場所をリストし、`LD_LIBRARY_PATH` は共有ライブラリの場所を示します。 +探すことです。どちらの場合も、前のエントリが後のエントリより優先されます。 `:` は複数を区切るために使用されます +エントリ。 + +ここで、ビルド プログラムに特定の CUDA ツールキットの場所を指示するには、最初にリストされる希望のパスを挿入します。 +やっていること: + +```bash +export PATH=/usr/local/cuda-10.2/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH +``` + +既存の値を上書きするのではなく、先頭に追加することに注意してください。 + +もちろん、必要に応じてバージョン番号やフルパスを調整します。割り当てたディレクトリが実際に機能することを確認してください +存在する。 `lib64` サブディレクトリは、`libcudart.so` などのさまざまな CUDA `.so` オブジェクトが存在する場所です。 +システムでは別の名前が付けられますが、現実を反映するように調整してください。 + +#### Possible problem #3 + +一部の古い CUDA バージョンは、新しいコンパイラでのビルドを拒否する場合があります。たとえば、あなたは`gcc-9`を持っていますが、それが必要です +`gcc-7`。 + +それにはさまざまな方法があります。 + +最新の CUDA ツールキットをインストールできる場合は、通常、新しいコンパイラがサポートされているはずです。 + +あるいは、既に所有しているコンパイラに加えて、下位バージョンのコンパイラをインストールすることもできます。 +すでに存在しますが、デフォルトではないため、ビルドシステムはそれを認識できません。 「gcc-7」がインストールされているが、 +ビルドシステムが見つからないというメッセージを表示する場合は、次の方法で解決できる可能性があります。 + +```bash +sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc +sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++ +``` + +ここでは、`/usr/local/cuda-10.2/bin/gcc` から `gcc-7` へのシンボリックリンクを作成しています。 +`/usr/local/cuda-10.2/bin/` は `PATH` 環境変数内にある必要があります (前の問題の解決策を参照)。 +`gcc-7` (および `g++7`) が見つかるはずで、ビルドは成功します。 + +いつものように、状況に合わせて例のパスを編集してください。 + +### PyTorch Fully Sharded Data parallel + +より大きなバッチ サイズで巨大なモデルのトレーニングを高速化するには、完全にシャード化されたデータ並列モデルを使用できます。 +このタイプのデータ並列パラダイムでは、オプティマイザーの状態、勾配、パラメーターをシャーディングすることで、より多くのデータと大規模なモデルをフィッティングできます。 +この機能とその利点の詳細については、[完全シャーディング データ並列ブログ](https://pytorch.org/blog/introducing-pytorch-full-sharded-data-Parallel-api/) をご覧ください。 +最新の PyTorch の Fully Sharded Data Parallel (FSDP) トレーニング機能を統合しました。 +必要なのは、設定を通じて有効にすることだけです。 + +**FSDP サポートに必要な PyTorch バージョン**: PyTorch Nightly (リリース後にこれを読んだ場合は 1.12.0) +FSDP を有効にしたモデルの保存は、最近の修正でのみ利用できるためです。 + +**使用法**: + +- 配布されたランチャーが追加されていることを確認してください +まだ使用していない場合は、`-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`を使用します。 + +- **シャーディング戦略**: + - FULL_SHARD : データ並列ワーカー/GPU にわたるシャード オプティマイザーの状態 + 勾配 + モデル パラメーター。 + このためには、コマンドライン引数に`--fsdp full_shard`を追加します。 + - SHARD_GRAD_OP : シャード オプティマイザーの状態 + データ並列ワーカー/GPU 全体の勾配。 + このためには、コマンドライン引数に`--fsdp shard_grad_op`を追加します。 + - NO_SHARD : シャーディングなし。このためには、コマンドライン引数に`--fsdp no_shard`を追加します。 +- パラメータと勾配を CPU にオフロードするには、 + コマンドライン引数に`--fsdp "full_shard offload"`または`--fsdp "shard_grad_op offload"`を追加します。 +- `default_auto_wrap_policy` を使用して FSDP でレイヤーを自動的に再帰的にラップするには、 + コマンドライン引数に`--fsdp "full_shard auto_wrap"`または`--fsdp "shard_grad_op auto_wrap"`を追加します。 +- CPU オフロードと自動ラッピングの両方を有効にするには、 + コマンドライン引数に`--fsdp "full_shard offload auto_wrap"`または`--fsdp "shard_grad_op offload auto_wrap"`を追加します。 +- 残りの FSDP 構成は、`--fsdp_config `を介して渡されます。それは、次のいずれかの場所です。 + FSDP json 構成ファイル (例: `fsdp_config.json`)、またはすでにロードされている json ファイルを `dict` として使用します。 + - 自動ラッピングが有効な場合は、トランスベースの自動ラップ ポリシーまたはサイズ ベースの自動ラップ ポリシーを使用できます。 + - トランスフォーマーベースの自動ラップポリシーの場合、構成ファイルで `fsdp_transformer_layer_cls_to_wrap` を指定することをお勧めします。指定しない場合、使用可能な場合、デフォルト値は `model._no_split_modules` になります。 + これは、ラップするトランスフォーマー層クラス名のリスト (大文字と小文字を区別) を指定します (例: [`BertLayer`]、[`GPTJBlock`]、[`T5Block`] ...)。 + 重みを共有するサブモジュール (埋め込み層など) が異なる FSDP ラップされたユニットにならないようにする必要があるため、これは重要です。 + このポリシーを使用すると、マルチヘッド アテンションとそれに続くいくつかの MLP レイヤーを含むブロックごとにラッピングが発生します。 + 共有埋め込みを含む残りの層は、同じ最も外側の FSDP ユニットにラップされるのが便利です。 + したがって、トランスベースのモデルにはこれを使用してください。 + - サイズベースの自動ラップポリシーの場合は、設定ファイルに`fsdp_min_num_params`を追加してください。 + 自動ラッピングのための FSDP のパラメータの最小数を指定します。 + - 設定ファイルで `fsdp_backward_prefetch` を指定できるようになりました。次のパラメータのセットをいつプリフェッチするかを制御します。 + `backward_pre` と `backward_pos` が利用可能なオプションです。 + 詳細については、`torch.distributed.fsdp.full_sharded_data_Parallel.BackwardPrefetch`を参照してください。 + - 設定ファイルで `fsdp_forward_prefetch` を指定できるようになりました。次のパラメータのセットをいつプリフェッチするかを制御します。 + `True`の場合、FSDP はフォワード パスでの実行中に、次に来るオールギャザーを明示的にプリフェッチします。 + - 設定ファイルで `limit_all_gathers` を指定できるようになりました。 + `True`の場合、FSDP は CPU スレッドを明示的に同期して、実行中のオールギャザが多すぎるのを防ぎます。 + - `activation_checkpointing`を設定ファイルで指定できるようになりました。 + `True`の場合、FSDP アクティベーション チェックポイントは、FSDP のアクティベーションをクリアすることでメモリ使用量を削減する手法です。 + 特定のレイヤーを処理し、バックワード パス中にそれらを再計算します。事実上、これは余分な計算時間を犠牲にします + メモリ使用量を削減します。 + +**注意すべき注意点がいくつかあります** +- これは `generate` と互換性がないため、 `--predict_with_generate` とも互換性がありません + すべての seq2seq/clm スクリプト (翻訳/要約/clm など)。 + 問題 [#21667](https://github.com/huggingface/transformers/issues/21667) を参照してください。 + +### PyTorch/XLA Fully Sharded Data parallel + +TPU ユーザーの皆様に朗報です。 PyTorch/XLA は FSDP をサポートするようになりました。 +最新の Fully Sharded Data Parallel (FSDP) トレーニングがすべてサポートされています。 +詳細については、[FSDP を使用した Cloud TPU での PyTorch モデルのスケーリング](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) および [PyTorch/XLA 実装 を参照してください。 FSDP の](https://github.com/pytorch/xla/tree/master/torch_xla/distributed/fsdp) +必要なのは、設定を通じて有効にすることだけです。 + +**FSDP サポートに必要な PyTorch/XLA バージョン**: >=2.0 + +**使用法**: + +`--fsdp "full shard"` を、`--fsdp_config ` に加えられる次の変更とともに渡します。 +- PyTorch/XLA FSDP を有効にするには、`xla`を`True`に設定する必要があります。 +- `xla_fsdp_settings` 値は、XLA FSDP ラッピング パラメータを格納する辞書です。 + オプションの完全なリストについては、[こちら]( + https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_full_sharded_data_Parallel.py)。 +- `xla_fsdp_grad_ckpt`。 `True`の場合、ネストされた XLA FSDP でラップされた各レイヤー上で勾配チェックポイントを使用します。 + この設定は、xla フラグが true に設定されており、自動ラッピング ポリシーが指定されている場合にのみ使用できます。 + `fsdp_min_num_params` または `fsdp_transformer_layer_cls_to_wrap`。 +- トランスフォーマー ベースの自動ラップ ポリシーまたはサイズ ベースの自動ラップ ポリシーのいずれかを使用できます。 + - トランスフォーマーベースの自動ラップポリシーの場合、構成ファイルで `fsdp_transformer_layer_cls_to_wrap` を指定することをお勧めします。指定しない場合、使用可能な場合、デフォルト値は `model._no_split_modules` になります。 + これは、ラップするトランスフォーマー層クラス名のリスト (大文字と小文字を区別) を指定します (例: [`BertLayer`]、[`GPTJBlock`]、[`T5Block`] ...)。 + 重みを共有するサブモジュール (埋め込み層など) が異なる FSDP ラップされたユニットにならないようにする必要があるため、これは重要です。 + このポリシーを使用すると、マルチヘッド アテンションとそれに続くいくつかの MLP レイヤーを含むブロックごとにラッピングが発生します。 + 共有埋め込みを含む残りの層は、同じ最も外側の FSDP ユニットにラップされるのが便利です。 + したがって、トランスベースのモデルにはこれを使用してください。 + - サイズベースの自動ラップポリシーの場合は、設定ファイルに`fsdp_min_num_params`を追加してください。 + 自動ラッピングのための FSDP のパラメータの最小数を指定します。 + +### Using Trainer for accelerated PyTorch Training on Mac + +PyTorch v1.12 リリースにより、開発者と研究者は Apple シリコン GPU を利用してモデル トレーニングを大幅に高速化できます。 +これにより、プロトタイピングや微調整などの機械学習ワークフローを Mac 上でローカルで実行できるようになります。 +PyTorch のバックエンドとしての Apple の Metal Performance Shaders (MPS) はこれを可能にし、新しい `"mps"` デバイス経由で使用できます。 +これにより、計算グラフとプリミティブが MPS Graph フレームワークと MPS によって提供される調整されたカーネルにマッピングされます。 +詳細については、公式ドキュメント [Mac での Accelerated PyTorch Training の紹介](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) を参照してください。 +および [MPS バックエンド](https://pytorch.org/docs/stable/notes/mps.html)。 + + + +MacOS マシンに PyTorch >= 1.13 (執筆時点ではナイトリー バージョン) をインストールすることを強くお勧めします。 +トランスベースのモデルのモデルの正確性とパフォーマンスの向上に関連する主要な修正が行われています。 +詳細については、https://github.com/pytorch/pytorch/issues/82707 を参照してください。 + + + +**Apple Silicon チップを使用したトレーニングと推論の利点** + +1. ユーザーがローカルで大規模なネットワークやバッチ サイズをトレーニングできるようにします +2. ユニファイド メモリ アーキテクチャにより、データ取得の遅延が短縮され、GPU がメモリ ストア全体に直接アクセスできるようになります。 +したがって、エンドツーエンドのパフォーマンスが向上します。 +3. クラウドベースの開発に関連するコストや追加のローカル GPU の必要性を削減します。 + +**前提条件**: mps サポートを備えたトーチをインストールするには、 +この素晴らしいメディア記事 [GPU アクセラレーションが M1 Mac の PyTorch に登場](https://medium.com/towards-data-science/gpu-acceleration-comes-to-pytorch-on-m1-macs-195c399efcc1) に従ってください。 。 + +**使用法**: +`mps` デバイスは、`cuda` デバイスが使用される方法と同様に利用可能な場合、デフォルトで使用されます。 +したがって、ユーザーによるアクションは必要ありません。 +たとえば、以下のコマンドを使用して、Apple Silicon GPU を使用して公式の Glue テキスト分類タスクを (ルート フォルダーから) 実行できます。 + +```bash +export TASK_NAME=mrpc + +python examples/pytorch/text-classification/run_glue.py \ + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir +``` + +**注意すべきいくつかの注意事項** + +1. 一部の PyTorch 操作は mps に実装されていないため、エラーがスローされます。 +これを回避する 1 つの方法は、環境変数 `PYTORCH_ENABLE_MPS_FALLBACK=1` を設定することです。 +これらの操作では CPU にフォールバックします。ただし、それでも UserWarning がスローされます。 +2. 分散セットアップ`gloo`および`nccl`は、`mps`デバイスでは動作しません。 +これは、現在「mps」デバイス タイプの単一 GPU のみを使用できることを意味します。 + +最後に、覚えておいてください。 🤗 `Trainer` は MPS バックエンドのみを統合するため、 +MPS バックエンドの使用に関して問題や質問がある場合は、 +[PyTorch GitHub](https://github.com/pytorch/pytorch/issues) に問題を提出してください。 + +## Using Accelerate Launcher with Trainer + +加速してトレーナーにパワーを与えましょう。ユーザーが期待することに関しては、次のとおりです。 +- トレーナー引数に対して FSDP、DeepSpeed などのトレーナー インテレーションを変更せずに使用し続けることができます。 +- トレーナーで Accelerate Launcher を使用できるようになりました (推奨)。 + +トレーナーで Accelerate Launcher を使用する手順: +1. 🤗 Accelerate がインストールされていることを確認してください。Accelerate がないと `Trainer` を使用することはできません。そうでない場合は、`pip install accelerate`してください。 Accelerate のバージョンを更新する必要がある場合もあります: `pip install activate --upgrade` +2. `accelerate config`を実行し、アンケートに記入します。以下は加速設定の例です。 + a. DDP マルチノード マルチ GPU 構成: + ```yaml + compute_environment: LOCAL_MACHINE + distributed_type: MULTI_GPU + downcast_bf16: 'no' + gpu_ids: all + machine_rank: 0 #change rank as per the node + main_process_ip: 192.168.20.1 + main_process_port: 9898 + main_training_function: main + mixed_precision: fp16 + num_machines: 2 + num_processes: 8 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + + b. FSDP config: + ```yaml + compute_environment: LOCAL_MACHINE + distributed_type: FSDP + downcast_bf16: 'no' + fsdp_config: + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP + fsdp_backward_prefetch_policy: BACKWARD_PRE + fsdp_forward_prefetch: true + fsdp_offload_params: false + fsdp_sharding_strategy: 1 + fsdp_state_dict_type: FULL_STATE_DICT + fsdp_sync_module_states: true + fsdp_transformer_layer_cls_to_wrap: BertLayer + fsdp_use_orig_params: true + machine_rank: 0 + main_training_function: main + mixed_precision: bf16 + num_machines: 1 + num_processes: 2 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + c.ファイルを指す DeepSpeed 構成: + ```yaml + compute_environment: LOCAL_MACHINE + deepspeed_config: + deepspeed_config_file: /home/user/configs/ds_zero3_config.json + zero3_init_flag: true + distributed_type: DEEPSPEED + downcast_bf16: 'no' + machine_rank: 0 + main_training_function: main + num_machines: 1 + num_processes: 4 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + + d.加速プラグインを使用した DeepSpeed 構成: + + ```yaml + compute_environment: LOCAL_MACHINE + deepspeed_config: + gradient_accumulation_steps: 1 + gradient_clipping: 0.7 + offload_optimizer_device: cpu + offload_param_device: cpu + zero3_init_flag: true + zero_stage: 2 + distributed_type: DEEPSPEED + downcast_bf16: 'no' + machine_rank: 0 + main_training_function: main + mixed_precision: bf16 + num_machines: 1 + num_processes: 4 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + +3. 加速設定またはランチャー引数によって上記で処理された引数以外の引数を使用して、トレーナー スクリプトを実行します。 +以下は、上記の FSDP 構成で`accelerate launcher`を使用して`run_glue.py`を実行する例です。 + +```bash +cd transformers + +accelerate launch \ +./examples/pytorch/text-classification/run_glue.py \ +--model_name_or_path google-bert/bert-base-cased \ +--task_name $TASK_NAME \ +--do_train \ +--do_eval \ +--max_seq_length 128 \ +--per_device_train_batch_size 16 \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--output_dir /tmp/$TASK_NAME/ \ +--overwrite_output_dir +``` + +4. `accelerate launch`するための cmd 引数を直接使用することもできます。上の例は次のようにマッピングされます。 + +```bash +cd transformers + +accelerate launch --num_processes=2 \ +--use_fsdp \ +--mixed_precision=bf16 \ +--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ +--fsdp_transformer_layer_cls_to_wrap="BertLayer" \ +--fsdp_sharding_strategy=1 \ +--fsdp_state_dict_type=FULL_STATE_DICT \ +./examples/pytorch/text-classification/run_glue.py +--model_name_or_path google-bert/bert-base-cased \ +--task_name $TASK_NAME \ +--do_train \ +--do_eval \ +--max_seq_length 128 \ +--per_device_train_batch_size 16 \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--output_dir /tmp/$TASK_NAME/ \ +--overwrite_output_dir +``` + +詳細については、🤗 Accelerate CLI ガイドを参照してください: [🤗 Accelerate スクリプトの起動](https://huggingface.co/docs/accelerate/basic_tutorials/launch)。 + +移動されたセクション: + +[ DeepSpeed +| Installation +| Deployment with multiple GPUs +| Deployment with one GPU +| Deployment in Notebooks +| Configuration +| Passing Configuration +| Shared Configuration +| ZeRO +| ZeRO-2 Config +| ZeRO-3 Config +| NVMe Support +| ZeRO-2 vs ZeRO-3 Performance +| ZeRO-2 Example +| ZeRO-3 Example +| Optimizer +| Scheduler +| fp32 Precision +| Automatic Mixed Precision +| Batch Size +| Gradient Accumulation +| Gradient Clipping +| Getting The Model Weights Out +] diff --git a/docs/source/ja/model_doc/albert.md b/docs/source/ja/model_doc/albert.md new file mode 100644 index 00000000000000..00403ea5376536 --- /dev/null +++ b/docs/source/ja/model_doc/albert.md @@ -0,0 +1,193 @@ + + +# ALBERT + +
+ +Models + + +Spaces + +
+ +## 概要 + +ALBERTモデルは、「[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)」という論文でZhenzhong Lan、Mingda Chen、Sebastian Goodman、Kevin Gimpel、Piyush Sharma、Radu Soricutによって提案されました。BERTのメモリ消費を減らしトレーニングを高速化するためのパラメータ削減技術を2つ示しています: + +- 埋め込み行列を2つの小さな行列に分割する。 +- グループ間で分割された繰り返し層を使用する。 + +論文の要旨は以下の通りです: + +*自然言語表現の事前学習時にモデルのサイズを増やすと、下流タスクのパフォーマンスが向上することがしばしばあります。しかし、ある時点でさらなるモデルの増大は、GPU/TPUのメモリ制限、長い訓練時間、予期せぬモデルの劣化といった問題のために困難になります。これらの問題に対処するために、我々はBERTのメモリ消費を低減し、訓練速度を高めるための2つのパラメータ削減技術を提案します。包括的な実証的証拠は、我々の提案方法が元のBERTに比べてはるかによくスケールするモデルを生み出すことを示しています。また、文間の一貫性をモデリングに焦点を当てた自己教師あり損失を使用し、複数の文が含まれる下流タスクに一貫して助けとなることを示します。その結果、我々の最良のモデルは、BERT-largeに比べてパラメータが少ないにもかかわらず、GLUE、RACE、SQuADベンチマークで新たな最先端の結果を確立します。* + +このモデルは[lysandre](https://huggingface.co/lysandre)により提供されました。このモデルのjaxバージョンは[kamalkraj](https://huggingface.co/kamalkraj)により提供されました。オリジナルのコードは[こちら](https://github.com/google-research/ALBERT)で見ることができます。 + +## 使用上のヒント + +- ALBERTは絶対位置埋め込みを使用するモデルなので、通常、入力を左側ではなく右側にパディングすることが推奨されます。 +- ALBERTは繰り返し層を使用するためメモリ使用量は小さくなりますが、同じ数の(繰り返し)層を反復しなければならないため、隠れ層の数が同じであればBERTのようなアーキテクチャと同様の計算コストがかかります。 +- 埋め込みサイズEは隠れサイズHと異なりますが、これは埋め込みが文脈に依存しない(一つの埋め込みベクトルが一つのトークンを表す)のに対し、隠れ状態は文脈に依存する(1つの隠れ状態がトークン系列を表す)ため、H >> Eとすることがより論理的です。また、埋め込み行列のサイズはV x Eと大きいです(Vは語彙サイズ)。E < Hであれば、パラメータは少なくなります。 +- 層はパラメータを共有するグループに分割されています(メモリ節約のため)。次文予測(NSP: Next Sentence Prediction)は文の順序予測に置き換えられます:入力では、2つの文AとB(それらは連続している)があり、Aに続いてBを与えるか、Bに続いてAを与えます。モデルはそれらが入れ替わっているかどうかを予測する必要があります。 + +## 参考資料 + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問応答タスクガイド](../tasks/question_answering) +- [マスクされた言語モデルタスクガイド](../tasks/masked_language_modeling) +- [多肢選択タスクガイド](../tasks/multiple_choice) + +## AlbertConfig + +[[autodoc]] AlbertConfig + +## AlbertTokenizer + +[[autodoc]] AlbertTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## AlbertTokenizerFast + +[[autodoc]] AlbertTokenizerFast + +## Albert specific outputs + +[[autodoc]] models.albert.modeling_albert.AlbertForPreTrainingOutput + +[[autodoc]] models.albert.modeling_tf_albert.TFAlbertForPreTrainingOutput + + + + +## AlbertModel + +[[autodoc]] AlbertModel + - forward + +## AlbertForPreTraining + +[[autodoc]] AlbertForPreTraining + - forward + +## AlbertForMaskedLM + +[[autodoc]] AlbertForMaskedLM + - forward + +## AlbertForSequenceClassification + +[[autodoc]] AlbertForSequenceClassification + - forward + +## AlbertForMultipleChoice + +[[autodoc]] AlbertForMultipleChoice + +## AlbertForTokenClassification + +[[autodoc]] AlbertForTokenClassification + - forward + +## AlbertForQuestionAnswering + +[[autodoc]] AlbertForQuestionAnswering + - forward + + + + + +## TFAlbertModel + +[[autodoc]] TFAlbertModel + - call + +## TFAlbertForPreTraining + +[[autodoc]] TFAlbertForPreTraining + - call + +## TFAlbertForMaskedLM + +[[autodoc]] TFAlbertForMaskedLM + - call + +## TFAlbertForSequenceClassification + +[[autodoc]] TFAlbertForSequenceClassification + - call + +## TFAlbertForMultipleChoice + +[[autodoc]] TFAlbertForMultipleChoice + - call + +## TFAlbertForTokenClassification + +[[autodoc]] TFAlbertForTokenClassification + - call + +## TFAlbertForQuestionAnswering + +[[autodoc]] TFAlbertForQuestionAnswering + - call + + + + +## FlaxAlbertModel + +[[autodoc]] FlaxAlbertModel + - __call__ + +## FlaxAlbertForPreTraining + +[[autodoc]] FlaxAlbertForPreTraining + - __call__ + +## FlaxAlbertForMaskedLM + +[[autodoc]] FlaxAlbertForMaskedLM + - __call__ + +## FlaxAlbertForSequenceClassification + +[[autodoc]] FlaxAlbertForSequenceClassification + - __call__ + +## FlaxAlbertForMultipleChoice + +[[autodoc]] FlaxAlbertForMultipleChoice + - __call__ + +## FlaxAlbertForTokenClassification + +[[autodoc]] FlaxAlbertForTokenClassification + - __call__ + +## FlaxAlbertForQuestionAnswering + +[[autodoc]] FlaxAlbertForQuestionAnswering + - __call__ + + + diff --git a/docs/source/ja/model_doc/align.md b/docs/source/ja/model_doc/align.md new file mode 100644 index 00000000000000..84496e605def92 --- /dev/null +++ b/docs/source/ja/model_doc/align.md @@ -0,0 +1,104 @@ + + +# ALIGN + +## 概要 + +ALIGNモデルは、「[Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918)」という論文でChao Jia、Yinfei Yang、Ye Xia、Yi-Ting Chen、Zarana Parekh、Hieu Pham、Quoc V. Le、Yunhsuan Sung、Zhen Li、Tom Duerigによって提案されました。ALIGNはマルチモーダルな視覚言語モデルです。これは画像とテキストの類似度や、ゼロショット画像分類に使用できます。ALIGNは[EfficientNet](efficientnet)を視覚エンコーダーとして、[BERT](bert)をテキストエンコーダーとして搭載したデュアルエンコーダー構造を特徴とし、対照学習によって視覚とテキストの表現を整合させることを学びます。それまでの研究とは異なり、ALIGNは巨大でノイジーなデータセットを活用し、コーパスのスケールを利用して単純な方法ながら最先端の表現を達成できることを示しています。 + +論文の要旨は以下の通りです: + +*事前学習された表現は、多くの自然言語処理(NLP)および知覚タスクにとって重要になっています。NLPにおける表現学習は、人間のアノテーションのない生のテキストでの学習へと移行していますが、視覚および視覚言語の表現は依然として精巧な学習データセットに大きく依存しており、これは高価であったり専門知識を必要としたりします。視覚アプリケーションの場合、ImageNetやOpenImagesのような明示的なクラスラベルを持つデータセットを使用して学習されることがほとんどです。視覚言語の場合、Conceptual Captions、MSCOCO、CLIPなどの人気のあるデータセットはすべて、それぞれ無視できないデータ収集(およびクリーニング)プロセスを含みます。このコストのかかるキュレーションプロセスはデータセットのサイズを制限し、訓練されたモデルのスケーリングを妨げます。本論文では、Conceptual Captionsデータセットの高価なフィルタリングや後処理ステップなしで得られた、10億を超える画像alt-textペアのノイズの多いデータセットを活用します。シンプルなデュアルエンコーダーアーキテクチャは、対照損失を使用して画像とテキストペアの視覚的および言語的表現を整合させることを学習します。我々は、コーパスの規模がそのノイズを補い、このような単純な学習スキームでも最先端の表現につながることを示します。我々の視覚表現は、ImageNetやVTABなどの分類タスクへの転移において強力な性能を発揮します。整合した視覚的および言語的表現は、ゼロショット画像分類を可能にし、また、より洗練されたクロスアテンションモデルと比較しても、Flickr30KおよびMSCOCO画像テキスト検索ベンチマークにおいて新たな最先端の結果を達成します。また、これらの表現は、複雑なテキストおよびテキスト+画像のクエリを用いたクロスモーダル検索を可能にします。* + +このモデルは[Alara Dirik](https://huggingface.co/adirik)により提供されました。 +オリジナルのコードは公開されておらず、この実装は元論文に基づいたKakao Brainの実装をベースにしています。 + +## 使用例 + +ALIGNはEfficientNetを使用して視覚的特徴を、BERTを使用してテキスト特徴を取得します。テキストと視覚の両方の特徴は、同一の次元を持つ潜在空間に射影されます。射影された画像とテキスト特徴間のドット積が類似度スコアとして使用されます。 + +[`AlignProcessor`]は、テキストのエンコードと画像の前処理を両方行うために、[`EfficientNetImageProcessor`]と[`BertTokenizer`]を単一のインスタンスにラップします。以下の例は、[`AlignProcessor`]と[`AlignModel`]を使用して画像-テキスト類似度スコアを取得する方法を示しています。 + +```python +import requests +import torch +from PIL import Image +from transformers import AlignProcessor, AlignModel + +processor = AlignProcessor.from_pretrained("kakaobrain/align-base") +model = AlignModel.from_pretrained("kakaobrain/align-base") + +url = "http://images.cocodataset.org/val2017/000000039769.jpg" +image = Image.open(requests.get(url, stream=True).raw) +candidate_labels = ["an image of a cat", "an image of a dog"] + +inputs = processor(text=candidate_labels, images=image, return_tensors="pt") + +with torch.no_grad(): + outputs = model(**inputs) + +# this is the image-text similarity score +logits_per_image = outputs.logits_per_image + +# we can take the softmax to get the label probabilities +probs = logits_per_image.softmax(dim=1) +print(probs) +``` + +## 参考資料 + +ALIGNの使用を開始するのに役立つ公式のHugging Faceとコミュニティ(🌎で示されている)の参考資料の一覧です。 + +- [ALIGNとCOYO-700Mデータセット](https://huggingface.co/blog/vit-align)に関するブログ投稿。 +- ゼロショット画像分類[デモ](https://huggingface.co/spaces/adirik/ALIGN-zero-shot-image-classification)。 +- `kakaobrain/align-base` モデルの[モデルカード](https://huggingface.co/kakaobrain/align-base)。 + +ここに参考資料を提出したい場合は、気兼ねなくPull Requestを開いてください。私たちはそれをレビューいたします!参考資料は、既存のものを複製するのではなく、何か新しいことを示すことが理想的です。 + +## AlignConfig + +[[autodoc]] AlignConfig + - from_text_vision_configs + +## AlignTextConfig + +[[autodoc]] AlignTextConfig + +## AlignVisionConfig + +[[autodoc]] AlignVisionConfig + +## AlignProcessor + +[[autodoc]] AlignProcessor + +## AlignModel + +[[autodoc]] AlignModel + - forward + - get_text_features + - get_image_features + +## AlignTextModel + +[[autodoc]] AlignTextModel + - forward + +## AlignVisionModel + +[[autodoc]] AlignVisionModel + - forward diff --git a/docs/source/ja/model_doc/altclip.md b/docs/source/ja/model_doc/altclip.md new file mode 100644 index 00000000000000..87cf6cc17d6a0b --- /dev/null +++ b/docs/source/ja/model_doc/altclip.md @@ -0,0 +1,97 @@ + + +# AltCLIP + +## 概要 + + +AltCLIPモデルは、「[AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679v2)」という論文でZhongzhi Chen、Guang Liu、Bo-Wen Zhang、Fulong Ye、Qinghong Yang、Ledell Wuによって提案されました。AltCLIP(CLIPの言語エンコーダーの代替)は、様々な画像-テキストペアおよびテキスト-テキストペアでトレーニングされたニューラルネットワークです。CLIPのテキストエンコーダーを事前学習済みの多言語テキストエンコーダーXLM-Rに置き換えることで、ほぼ全てのタスクでCLIPに非常に近い性能を得られ、オリジナルのCLIPの能力を多言語理解などに拡張しました。 + +論文の要旨は以下の通りです: + +*この研究では、強力なバイリンガルマルチモーダル表現モデルを訓練するための概念的に単純で効果的な方法を提案します。OpenAIによってリリースされたマルチモーダル表現モデルCLIPから開始し、そのテキストエンコーダを事前学習済みの多言語テキストエンコーダXLM-Rに交換し、教師学習と対照学習からなる2段階のトレーニングスキーマを用いて言語と画像の表現を整合させました。幅広いタスクの評価を通じて、我々の方法を検証します。ImageNet-CN、Flicker30k-CN、COCO-CNを含む多くのタスクで新たな最先端の性能を達成しました。さらに、ほぼすべてのタスクでCLIPに非常に近い性能を得ており、これはCLIPのテキストエンコーダを変更するだけで、多言語理解などの拡張を実現できることを示唆しています。* + +このモデルは[jongjyh](https://huggingface.co/jongjyh)により提供されました。 + +## 使用上のヒントと使用例 + +AltCLIPの使用方法はCLIPに非常に似ています。CLIPとの違いはテキストエンコーダーにあります。私たちはカジュアルアテンションではなく双方向アテンションを使用し、XLM-Rの[CLS]トークンをテキスト埋め込みを表すものとして取ることに留意してください。 + +AltCLIPはマルチモーダルな視覚言語モデルです。これは画像とテキストの類似度や、ゼロショット画像分類に使用できます。AltCLIPはViTのようなTransformerを使用して視覚的特徴を、双方向言語モデルを使用してテキスト特徴を取得します。テキストと視覚の両方の特徴は、同一の次元を持つ潜在空間に射影されます。射影された画像とテキスト特徴間のドット積が類似度スコアとして使用されます。 + +Transformerエンコーダーに画像を与えるには、各画像を固定サイズの重複しないパッチの系列に分割し、それらを線形に埋め込みます。画像全体を表現するための[CLS]トークンが追加されます。著者は絶対位置埋め込みも追加し、結果として得られるベクトルの系列を標準的なTransformerエンコーダーに供給します。[`CLIPImageProcessor`]を使用して、モデルのために画像のサイズ変更(または拡大縮小)と正規化を行うことができます。 + +[`AltCLIPProcessor`]は、テキストのエンコードと画像の前処理を両方行うために、[`CLIPImageProcessor`]と[`XLMRobertaTokenizer`]を単一のインスタンスにラップします。以下の例は、[`AltCLIPProcessor`]と[`AltCLIPModel`]を使用して画像-テキスト類似スコアを取得する方法を示しています。 + +```python +>>> from PIL import Image +>>> import requests + +>>> from transformers import AltCLIPModel, AltCLIPProcessor + +>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP") +>>> processor = AltCLIPProcessor.from_pretrained("BAAI/AltCLIP") + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) + +>>> outputs = model(**inputs) +>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score +>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities +``` + + + +このモデルは`CLIPModel`をベースにしており、オリジナルの[CLIP](clip)と同じように使用してください。 + + + +## AltCLIPConfig + +[[autodoc]] AltCLIPConfig + - from_text_vision_configs + +## AltCLIPTextConfig + +[[autodoc]] AltCLIPTextConfig + +## AltCLIPVisionConfig + +[[autodoc]] AltCLIPVisionConfig + +## AltCLIPProcessor + +[[autodoc]] AltCLIPProcessor + +## AltCLIPModel + +[[autodoc]] AltCLIPModel + - forward + - get_text_features + - get_image_features + +## AltCLIPTextModel + +[[autodoc]] AltCLIPTextModel + - forward + +## AltCLIPVisionModel + +[[autodoc]] AltCLIPVisionModel + - forward diff --git a/docs/source/ja/model_doc/audio-spectrogram-transformer.md b/docs/source/ja/model_doc/audio-spectrogram-transformer.md new file mode 100644 index 00000000000000..a5107b14f83a98 --- /dev/null +++ b/docs/source/ja/model_doc/audio-spectrogram-transformer.md @@ -0,0 +1,69 @@ + + +# Audio Spectrogram Transformer + +## 概要 + +Audio Spectrogram Transformerモデルは、[AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778)という論文でYuan Gong、Yu-An Chung、James Glassによって提案されました。これは、音声を画像(スペクトログラム)に変換することで、音声に[Vision Transformer](vit)を適用します。このモデルは音声分類において最先端の結果を得ています。 + +論文の要旨は以下の通りです: + +*過去10年間で、畳み込みニューラルネットワーク(CNN)は、音声スペクトログラムから対応するラベルへの直接的なマッピングを学習することを目指す、エンドツーエンドの音声分類モデルの主要な構成要素として広く採用されてきました。長距離のグローバルなコンテキストをより良く捉えるため、最近の傾向として、CNNの上にセルフアテンション機構を追加し、CNN-アテンションハイブリッドモデルを形成することがあります。しかし、CNNへの依存が必要かどうか、そして純粋にアテンションに基づくニューラルネットワークだけで音声分類において良いパフォーマンスを得ることができるかどうかは明らかではありません。本論文では、これらの問いに答えるため、音声分類用では最初の畳み込みなしで純粋にアテンションベースのモデルであるAudio Spectrogram Transformer(AST)を紹介します。我々はASTを様々なオーディオ分類ベンチマークで評価し、AudioSetで0.485 mAP、ESC-50で95.6%の正解率、Speech Commands V2で98.1%の正解率という新たな最先端の結果を達成しました。* + + + + Audio Spectrogram Transformerのアーキテクチャ。元論文より抜粋。 + +このモデルは[nielsr](https://huggingface.co/nielsr)より提供されました。 +オリジナルのコードは[こちら](https://github.com/YuanGongND/ast)で見ることができます。 + +## 使用上のヒント + +- 独自のデータセットでAudio Spectrogram Transformer(AST)をファインチューニングする場合、入力の正規化(入力の平均を0、標準偏差を0.5にすること)処理することが推奨されます。[`ASTFeatureExtractor`]はこれを処理します。デフォルトではAudioSetの平均と標準偏差を使用していることに注意してください。著者が下流のデータセットの統計をどのように計算しているかは、[`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py)で確認することができます。 +- ASTは低い学習率が必要であり 著者は[PSLA論文](https://arxiv.org/abs/2102.01243)で提案されたCNNモデルに比べて10倍小さい学習率を使用しています)、素早く収束するため、タスクに適した学習率と学習率スケジューラーを探すことをお勧めします。 + +## 参考資料 + +Audio Spectrogram Transformerの使用を開始するのに役立つ公式のHugging Faceおよびコミュニティ(🌎で示されている)の参考資料の一覧です。 + + + +- ASTを用いた音声分類の推論を説明するノートブックは[こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST)で見ることができます。 +- [`ASTForAudioClassification`]は、この[例示スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification)と[ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)によってサポートされています。 +- こちらも参照:[音声分類タスク](../tasks/audio_classification)。 + +ここに参考資料を提出したい場合は、気兼ねなくPull Requestを開いてください。私たちはそれをレビューいたします!参考資料は、既存のものを複製するのではなく、何か新しいことを示すことが理想的です。 + +## ASTConfig + +[[autodoc]] ASTConfig + +## ASTFeatureExtractor + +[[autodoc]] ASTFeatureExtractor + - __call__ + +## ASTModel + +[[autodoc]] ASTModel + - forward + +## ASTForAudioClassification + +[[autodoc]] ASTForAudioClassification + - forward diff --git a/docs/source/ja/model_doc/auto.md b/docs/source/ja/model_doc/auto.md new file mode 100644 index 00000000000000..d4baaf70e6fd48 --- /dev/null +++ b/docs/source/ja/model_doc/auto.md @@ -0,0 +1,370 @@ + + +# Auto Classes + +多くの場合、`from_pretrained()`メソッドに与えられた事前学習済みモデルの名前やパスから、使用したいアーキテクチャを推測することができます。自動クラスはこの仕事をあなたに代わって行うためにここにありますので、事前学習済みの重み/設定/語彙への名前/パスを与えると自動的に関連するモデルを取得できます。 + +[`AutoConfig`]、[`AutoModel`]、[`AutoTokenizer`]のいずれかをインスタンス化すると、関連するアーキテクチャのクラスが直接作成されます。例えば、 + +```python +model = AutoModel.from_pretrained("google-bert/bert-base-cased") +``` + +これは[`BertModel`]のインスタンスであるモデルを作成します。 + +各タスクごと、そして各バックエンド(PyTorch、TensorFlow、またはFlax)ごとに`AutoModel`のクラスが存在します。 + +## 自動クラスの拡張 + +それぞれの自動クラスには、カスタムクラスで拡張するためのメソッドがあります。例えば、`NewModel`というモデルのカスタムクラスを定義した場合、`NewModelConfig`を確保しておけばこのようにして自動クラスに追加することができます: + +```python +from transformers import AutoConfig, AutoModel + +AutoConfig.register("new-model", NewModelConfig) +AutoModel.register(NewModelConfig, NewModel) +``` + +その後、通常どおりauto classesを使用することができるようになります! + + + +あなたの`NewModelConfig`が[`~transformers.PretrainedConfig`]のサブクラスである場合、その`model_type`属性がコンフィグを登録するときに使用するキー(ここでは`"new-model"`)と同じに設定されていることを確認してください。 + +同様に、あなたの`NewModel`が[`PreTrainedModel`]のサブクラスである場合、その`config_class`属性がモデルを登録する際に使用するクラス(ここでは`NewModelConfig`)と同じに設定されていることを確認してください。 + + + +## AutoConfig + +[[autodoc]] AutoConfig + +## AutoTokenizer + +[[autodoc]] AutoTokenizer + +## AutoFeatureExtractor + +[[autodoc]] AutoFeatureExtractor + +## AutoImageProcessor + +[[autodoc]] AutoImageProcessor + +## AutoProcessor + +[[autodoc]] AutoProcessor + +## Generic model classes + +以下の自動クラスは、特定のヘッドを持たないベースモデルクラスをインスタンス化するために利用可能です。 + +### AutoModel + +[[autodoc]] AutoModel + +### TFAutoModel + +[[autodoc]] TFAutoModel + +### FlaxAutoModel + +[[autodoc]] FlaxAutoModel + +## Generic pretraining classes + +以下の自動クラスは、事前学習ヘッドを持つモデルをインスタンス化するために利用可能です。 + +### AutoModelForPreTraining + +[[autodoc]] AutoModelForPreTraining + +### TFAutoModelForPreTraining + +[[autodoc]] TFAutoModelForPreTraining + +### FlaxAutoModelForPreTraining + +[[autodoc]] FlaxAutoModelForPreTraining + +## Natural Language Processing + +以下の自動クラスは、次の自然言語処理タスクに利用可能です。 + +### AutoModelForCausalLM + +[[autodoc]] AutoModelForCausalLM + +### TFAutoModelForCausalLM + +[[autodoc]] TFAutoModelForCausalLM + +### FlaxAutoModelForCausalLM + +[[autodoc]] FlaxAutoModelForCausalLM + +### AutoModelForMaskedLM + +[[autodoc]] AutoModelForMaskedLM + +### TFAutoModelForMaskedLM + +[[autodoc]] TFAutoModelForMaskedLM + +### FlaxAutoModelForMaskedLM + +[[autodoc]] FlaxAutoModelForMaskedLM + +### AutoModelForMaskGeneration + +[[autodoc]] AutoModelForMaskGeneration + +### TFAutoModelForMaskGeneration + +[[autodoc]] TFAutoModelForMaskGeneration + +### AutoModelForSeq2SeqLM + +[[autodoc]] AutoModelForSeq2SeqLM + +### TFAutoModelForSeq2SeqLM + +[[autodoc]] TFAutoModelForSeq2SeqLM + +### FlaxAutoModelForSeq2SeqLM + +[[autodoc]] FlaxAutoModelForSeq2SeqLM + +### AutoModelForSequenceClassification + +[[autodoc]] AutoModelForSequenceClassification + +### TFAutoModelForSequenceClassification + +[[autodoc]] TFAutoModelForSequenceClassification + +### FlaxAutoModelForSequenceClassification + +[[autodoc]] FlaxAutoModelForSequenceClassification + +### AutoModelForMultipleChoice + +[[autodoc]] AutoModelForMultipleChoice + +### TFAutoModelForMultipleChoice + +[[autodoc]] TFAutoModelForMultipleChoice + +### FlaxAutoModelForMultipleChoice + +[[autodoc]] FlaxAutoModelForMultipleChoice + +### AutoModelForNextSentencePrediction + +[[autodoc]] AutoModelForNextSentencePrediction + +### TFAutoModelForNextSentencePrediction + +[[autodoc]] TFAutoModelForNextSentencePrediction + +### FlaxAutoModelForNextSentencePrediction + +[[autodoc]] FlaxAutoModelForNextSentencePrediction + +### AutoModelForTokenClassification + +[[autodoc]] AutoModelForTokenClassification + +### TFAutoModelForTokenClassification + +[[autodoc]] TFAutoModelForTokenClassification + +### FlaxAutoModelForTokenClassification + +[[autodoc]] FlaxAutoModelForTokenClassification + +### AutoModelForQuestionAnswering + +[[autodoc]] AutoModelForQuestionAnswering + +### TFAutoModelForQuestionAnswering + +[[autodoc]] TFAutoModelForQuestionAnswering + +### FlaxAutoModelForQuestionAnswering + +[[autodoc]] FlaxAutoModelForQuestionAnswering + +### AutoModelForTextEncoding + +[[autodoc]] AutoModelForTextEncoding + +### TFAutoModelForTextEncoding + +[[autodoc]] TFAutoModelForTextEncoding + +## Computer vision + +以下の自動クラスは、次のコンピュータービジョンタスクに利用可能です。 + +### AutoModelForDepthEstimation + +[[autodoc]] AutoModelForDepthEstimation + +### AutoModelForImageClassification + +[[autodoc]] AutoModelForImageClassification + +### TFAutoModelForImageClassification + +[[autodoc]] TFAutoModelForImageClassification + +### FlaxAutoModelForImageClassification + +[[autodoc]] FlaxAutoModelForImageClassification + +### AutoModelForVideoClassification + +[[autodoc]] AutoModelForVideoClassification + +### AutoModelForMaskedImageModeling + +[[autodoc]] AutoModelForMaskedImageModeling + +### TFAutoModelForMaskedImageModeling + +[[autodoc]] TFAutoModelForMaskedImageModeling + +### AutoModelForObjectDetection + +[[autodoc]] AutoModelForObjectDetection + +### AutoModelForImageSegmentation + +[[autodoc]] AutoModelForImageSegmentation + +### AutoModelForImageToImage + +[[autodoc]] AutoModelForImageToImage + +### AutoModelForSemanticSegmentation + +[[autodoc]] AutoModelForSemanticSegmentation + +### TFAutoModelForSemanticSegmentation + +[[autodoc]] TFAutoModelForSemanticSegmentation + +### AutoModelForInstanceSegmentation + +[[autodoc]] AutoModelForInstanceSegmentation + +### AutoModelForUniversalSegmentation + +[[autodoc]] AutoModelForUniversalSegmentation + +### AutoModelForZeroShotImageClassification + +[[autodoc]] AutoModelForZeroShotImageClassification + +### TFAutoModelForZeroShotImageClassification + +[[autodoc]] TFAutoModelForZeroShotImageClassification + +### AutoModelForZeroShotObjectDetection + +[[autodoc]] AutoModelForZeroShotObjectDetection + +## Audio + +以下の自動クラスは、次の音声タスクに利用可能です。 + +### AutoModelForAudioClassification + +[[autodoc]] AutoModelForAudioClassification + +### AutoModelForAudioFrameClassification + +[[autodoc]] TFAutoModelForAudioClassification + +### TFAutoModelForAudioFrameClassification + +[[autodoc]] AutoModelForAudioFrameClassification + +### AutoModelForCTC + +[[autodoc]] AutoModelForCTC + +### AutoModelForSpeechSeq2Seq + +[[autodoc]] AutoModelForSpeechSeq2Seq + +### TFAutoModelForSpeechSeq2Seq + +[[autodoc]] TFAutoModelForSpeechSeq2Seq + +### FlaxAutoModelForSpeechSeq2Seq + +[[autodoc]] FlaxAutoModelForSpeechSeq2Seq + +### AutoModelForAudioXVector + +[[autodoc]] AutoModelForAudioXVector + +### AutoModelForTextToSpectrogram + +[[autodoc]] AutoModelForTextToSpectrogram + +### AutoModelForTextToWaveform + +[[autodoc]] AutoModelForTextToWaveform + +## Multimodal + +以下の自動クラスは、次のマルチモーダルタスクに利用可能です。 + +### AutoModelForTableQuestionAnswering + +[[autodoc]] AutoModelForTableQuestionAnswering + +### TFAutoModelForTableQuestionAnswering + +[[autodoc]] TFAutoModelForTableQuestionAnswering + +### AutoModelForDocumentQuestionAnswering + +[[autodoc]] AutoModelForDocumentQuestionAnswering + +### TFAutoModelForDocumentQuestionAnswering + +[[autodoc]] TFAutoModelForDocumentQuestionAnswering + +### AutoModelForVisualQuestionAnswering + +[[autodoc]] AutoModelForVisualQuestionAnswering + +### AutoModelForVision2Seq + +[[autodoc]] AutoModelForVision2Seq + +### TFAutoModelForVision2Seq + +[[autodoc]] TFAutoModelForVision2Seq + +### FlaxAutoModelForVision2Seq + +[[autodoc]] FlaxAutoModelForVision2Seq diff --git a/docs/source/ja/model_doc/autoformer.md b/docs/source/ja/model_doc/autoformer.md new file mode 100644 index 00000000000000..b8b0948b960d69 --- /dev/null +++ b/docs/source/ja/model_doc/autoformer.md @@ -0,0 +1,50 @@ + + +# Autoformer + +## 概要 + +Autoformerモデルは、「[Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008)」という論文でHaixu Wu、Jiehui Xu、Jianmin Wang、Mingsheng Longによって提案されました。 + +このモデルは、予測プロセス中にトレンドと季節性成分を逐次的に分解できる深層分解アーキテクチャとしてTransformerを増強します。 + +論文の要旨は以下の通りです: + +*例えば異常気象の早期警告や長期的なエネルギー消費計画といった実応用において、予測時間を延長することは重要な要求です。本論文では、時系列の長期予測問題を研究しています。以前のTransformerベースのモデルは、長距離依存関係を発見するために様々なセルフアテンション機構を採用しています。しかし、長期未来の複雑な時間的パターンによってモデルが信頼できる依存関係を見つけることを妨げられます。また、Transformerは、長い系列の効率化のためにポイントワイズなセルフアテンションのスパースバージョンを採用する必要があり、情報利用のボトルネックとなります。Transformerを超えて、我々は自己相関機構を持つ新しい分解アーキテクチャとしてAutoformerを設計しました。系列分解の事前処理の慣行を破り、それを深層モデルの基本的な内部ブロックとして革新します。この設計は、複雑な時系列に対するAutoformerの進行的な分解能力を強化します。さらに、確率過程理論に触発されて、系列の周期性に基づいた自己相関機構を設計し、サブ系列レベルでの依存関係の発見と表現の集約を行います。自己相関は効率と精度の両方でセルフアテンションを上回ります。長期予測において、Autoformerは、エネルギー、交通、経済、気象、疾病の5つの実用的な応用をカバーする6つのベンチマークで38%の相対的な改善をもたらし、最先端の精度を達成します。* + +このモデルは[elisim](https://huggingface.co/elisim)と[kashif](https://huggingface.co/kashif)より提供されました。 +オリジナルのコードは[こちら](https://github.com/thuml/Autoformer)で見ることができます。 + +## 参考資料 + +Autoformerの使用を開始するのに役立つ公式のHugging Faceおよびコミュニティ(🌎で示されている)の参考資料の一覧です。ここに参考資料を提出したい場合は、気兼ねなくPull Requestを開いてください。私たちはそれをレビューいたします!参考資料は、既存のものを複製するのではなく、何か新しいことを示すことが理想的です。 + +- HuggingFaceブログでAutoformerに関するブログ記事をチェックしてください:[はい、Transformersは時系列予測に効果的です(+ Autoformer)](https://huggingface.co/blog/autoformer) + +## AutoformerConfig + +[[autodoc]] AutoformerConfig + +## AutoformerModel + +[[autodoc]] AutoformerModel + - forward + +## AutoformerForPrediction + +[[autodoc]] AutoformerForPrediction + - forward diff --git a/docs/source/ja/model_doc/bark.md b/docs/source/ja/model_doc/bark.md new file mode 100644 index 00000000000000..508b8938889bae --- /dev/null +++ b/docs/source/ja/model_doc/bark.md @@ -0,0 +1,198 @@ + + +# Bark + +## Overview + +Bark は、[suno-ai/bark](https://github.com/suno-ai/bark) で Suno AI によって提案されたトランスフォーマーベースのテキスト読み上げモデルです。 + + +Bark は 4 つの主要なモデルで構成されています。 + +- [`BarkSemanticModel`] ('テキスト'モデルとも呼ばれる): トークン化されたテキストを入力として受け取り、テキストの意味を捉えるセマンティック テキスト トークンを予測する因果的自己回帰変換モデル。 +- [`BarkCoarseModel`] ('粗い音響' モデルとも呼ばれる): [`BarkSemanticModel`] モデルの結果を入力として受け取る因果的自己回帰変換器。 EnCodec に必要な最初の 2 つのオーディオ コードブックを予測することを目的としています。 +- [`BarkFineModel`] ('微細音響' モデル)、今回は非因果的オートエンコーダー トランスフォーマーで、以前のコードブック埋め込みの合計に基づいて最後のコードブックを繰り返し予測します。 +- [`EncodecModel`] からすべてのコードブック チャネルを予測したので、Bark はそれを使用して出力オーディオ配列をデコードします。 + +最初の 3 つのモジュールはそれぞれ、特定の事前定義された音声に従って出力サウンドを調整するための条件付きスピーカー埋め込みをサポートできることに注意してください。 + +### Optimizing Bark + +Bark は、コードを数行追加するだけで最適化でき、**メモリ フットプリントが大幅に削減**され、**推論が高速化**されます。 + +#### Using half-precision + +モデルを半精度でロードするだけで、推論を高速化し、メモリ使用量を 50% 削減できます。 + +```python +from transformers import BarkModel +import torch + +device = "cuda" if torch.cuda.is_available() else "cpu" +model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device) +``` + +#### Using 🤗 Better Transformer + +Better Transformer は、内部でカーネル融合を実行する 🤗 最適な機能です。パフォーマンスを低下させることなく、速度を 20% ~ 30% 向上させることができます。モデルを 🤗 Better Transformer にエクスポートするのに必要なコードは 1 行だけです。 + +```python +model = model.to_bettertransformer() +``` + +この機能を使用する前に 🤗 Optimum をインストールする必要があることに注意してください。 [インストール方法はこちら](https://huggingface.co/docs/optimum/installation) + +#### Using CPU offload + +前述したように、Bark は 4 つのサブモデルで構成されており、オーディオ生成中に順番に呼び出されます。言い換えれば、1 つのサブモデルが使用されている間、他のサブモデルはアイドル状態になります。 + +CUDA デバイスを使用している場合、メモリ フットプリントの 80% 削減による恩恵を受ける簡単な解決策は、アイドル状態の GPU のサブモデルをオフロードすることです。この操作は CPU オフロードと呼ばれます。 1行のコードで使用できます。 + +```python +model.enable_cpu_offload() +``` + +この機能を使用する前に、🤗 Accelerate をインストールする必要があることに注意してください。 [インストール方法はこちら](https://huggingface.co/docs/accelerate/basic_tutorials/install) + +#### Combining optimization techniques + +最適化手法を組み合わせて、CPU オフロード、半精度、🤗 Better Transformer をすべて一度に使用できます。 + +```python +from transformers import BarkModel +import torch + +device = "cuda" if torch.cuda.is_available() else "cpu" + +# load in fp16 +model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device) + +# convert to bettertransformer +model = BetterTransformer.transform(model, keep_original_model=False) + +# enable CPU offload +model.enable_cpu_offload() +``` + +推論最適化手法の詳細については、[こちら](https://huggingface.co/docs/transformers/perf_infer_gpu_one) をご覧ください。 + +### Tips + +Suno は、多くの言語で音声プリセットのライブラリを提供しています [こちら](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c)。 +これらのプリセットは、ハブ [こちら](https://huggingface.co/suno/bark-small/tree/main/speaker_embeddings) または [こちら](https://huggingface.co/suno/bark/tree/main/speaker_embeddings)。 + +```python +>>> from transformers import AutoProcessor, BarkModel + +>>> processor = AutoProcessor.from_pretrained("suno/bark") +>>> model = BarkModel.from_pretrained("suno/bark") + +>>> voice_preset = "v2/en_speaker_6" + +>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset) + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +Bark は、非常にリアルな **多言語** 音声だけでなく、音楽、背景ノイズ、単純な効果音などの他の音声も生成できます。 + +```python +>>> # Multilingual speech - simplified Chinese +>>> inputs = processor("惊人的!我会说中文") + +>>> # Multilingual speech - French - let's use a voice_preset as well +>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5") + +>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics. +>>> inputs = processor("♪ Hello, my dog is cute ♪") + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +このモデルは、笑う、ため息、泣くなどの**非言語コミュニケーション**を生成することもできます。 + + +```python +>>> # Adding non-speech cues to the input text +>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]") + +>>> audio_array = model.generate(**inputs) +>>> audio_array = audio_array.cpu().numpy().squeeze() +``` + +オーディオを保存するには、モデル設定と scipy ユーティリティからサンプル レートを取得するだけです。 + +```python +>>> from scipy.io.wavfile import write as write_wav + +>>> # save audio to disk, but first take the sample rate from the model config +>>> sample_rate = model.generation_config.sample_rate +>>> write_wav("bark_generation.wav", sample_rate, audio_array) +``` + +このモデルは、[Yoach Lacombe (ylacombe)](https://huggingface.co/ylacombe) および [Sanchit Gandhi (sanchit-gandhi)](https://github.com/sanchit-gandhi) によって提供されました。 +元のコードは [ここ](https://github.com/suno-ai/bark) にあります。 + +## BarkConfig + +[[autodoc]] BarkConfig + - all + +## BarkProcessor + +[[autodoc]] BarkProcessor + - all + - __call__ + +## BarkModel + +[[autodoc]] BarkModel + - generate + - enable_cpu_offload + +## BarkSemanticModel + +[[autodoc]] BarkSemanticModel + - forward + +## BarkCoarseModel + +[[autodoc]] BarkCoarseModel + - forward + +## BarkFineModel + +[[autodoc]] BarkFineModel + - forward + +## BarkCausalModel + +[[autodoc]] BarkCausalModel + - forward + +## BarkCoarseConfig + +[[autodoc]] BarkCoarseConfig + - all + +## BarkFineConfig + +[[autodoc]] BarkFineConfig + - all + +## BarkSemanticConfig + +[[autodoc]] BarkSemanticConfig + - all diff --git a/docs/source/ja/model_doc/bart.md b/docs/source/ja/model_doc/bart.md new file mode 100644 index 00000000000000..5c71da37f01eb4 --- /dev/null +++ b/docs/source/ja/model_doc/bart.md @@ -0,0 +1,223 @@ + + +# BART + +
+ +Models + + +Spaces + +
+ +**免責事項:** 何か奇妙なものを見つけた場合は、[Github 問題](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) を提出し、割り当ててください。 +@patrickvonplaten + +## Overview + +Bart モデルは、[BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation、 +翻訳と理解](https://arxiv.org/abs/1910.13461) Mike Lewis、Yinhan Liu、Naman Goyal、Marjan 著 +ガズビニネジャド、アブデルラフマン・モハメド、オメル・レヴィ、ベス・ストヤノフ、ルーク・ゼトルモイヤー、2019年10月29日。 + +要約によると、 + +- Bart は、双方向エンコーダ (BERT など) を備えた標準の seq2seq/機械翻訳アーキテクチャを使用します。 + 左から右へのデコーダ (GPT など)。 +- 事前トレーニング タスクには、元の文の順序をランダムにシャッフルし、新しい埋め込みスキームが含まれます。 + ここで、テキストの範囲は単一のマスク トークンに置き換えられます。 +- BART は、テキスト生成用に微調整した場合に特に効果的ですが、理解タスクにも適しています。それ + RoBERTa のパフォーマンスを GLUE および SQuAD の同等のトレーニング リソースと同等にし、新たな成果を達成します。 + さまざまな抽象的な対話、質問応答、要約タスクに関する最先端の結果が得られ、成果が得られます。 + ルージュは最大6枚まで。 + +チップ: + +- BART は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 +- エンコーダーとデコーダーを備えたシーケンスツーシーケンス モデル。エンコーダには破損したバージョンのトークンが供給され、デコーダには元のトークンが供給されます(ただし、通常のトランスフォーマー デコーダと同様に、将来のワードを隠すためのマスクがあります)。次の変換の構成は、エンコーダーの事前トレーニング タスクに適用されます。 + + * ランダムなトークンをマスクします (BERT と同様) + * ランダムなトークンを削除します + * k 個のトークンのスパンを 1 つのマスク トークンでマスクします (0 トークンのスパンはマスク トークンの挿入です) + * 文を並べ替えます + * ドキュメントを回転して特定のトークンから開始するようにします + +このモデルは [sshleifer](https://huggingface.co/sshleifer) によって提供されました。著者のコードは [ここ](https://github.com/pytorch/fairseq/tree/master/examples/bart) にあります。 + +### Examples + +- シーケンス間タスク用の BART およびその他のモデルを微調整するための例とスクリプトは、次の場所にあります。 + [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md)。 +- Hugging Face `datasets` を使用して [`BartForConditionalGeneration`] をトレーニングする方法の例 + オブジェクトは、この [フォーラム ディスカッション](https://discuss.huggingface.co/t/train-bart-for-conditional-generation-e-g-summarization/1904) で見つけることができます。 +- [抽出されたチェックポイント](https://huggingface.co/models?search=distilbart) は、この [論文](https://arxiv.org/abs/2010.13002) で説明されています。 + +## Implementation Notes + +- Bart はシーケンスの分類に `token_type_ids` を使用しません。 [`BartTokenizer`] を使用するか、 + [`~BartTokenizer.encode`] を使用して適切に分割します。 +- [`BartModel`] のフォワードパスは、渡されなかった場合、`decoder_input_ids` を作成します。 + これは、他のモデリング API とは異なります。この機能の一般的な使用例は、マスクの塗りつぶしです。 +- モデルの予測は、次の場合に元の実装と同一になるように意図されています。 + `forced_bos_token_id=0`。ただし、これは、渡す文字列が次の場合にのみ機能します。 + [`fairseq.encode`] はスペースで始まります。 +- [`~generation.GenerationMixin.generate`] は、次のような条件付き生成タスクに使用する必要があります。 + 要約については、その docstring の例を参照してください。 +- *facebook/bart-large-cnn* 重みをロードするモデルには `mask_token_id` がないか、実行できません。 + マスクを埋めるタスク。 + +## Mask Filling + +`facebook/bart-base` および `facebook/bart-large` チェックポイントを使用して、マルチトークン マスクを埋めることができます。 + +```python +from transformers import BartForConditionalGeneration, BartTokenizer + +model = BartForConditionalGeneration.from_pretrained("facebook/bart-large", forced_bos_token_id=0) +tok = BartTokenizer.from_pretrained("facebook/bart-large") +example_english_phrase = "UN Chief Says There Is No in Syria" +batch = tok(example_english_phrase, return_tensors="pt") +generated_ids = model.generate(batch["input_ids"]) +assert tok.batch_decode(generated_ids, skip_special_tokens=True) == [ + "UN Chief Says There Is No Plan to Stop Chemical Weapons in Syria" +] +``` + +## Resources + +BART を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + + + +- に関するブログ投稿 [分散トレーニング: 🤗 Transformers と Amazon SageMaker を使用した要約のための BART/T5 のトレーニング](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq)。 +- 方法に関するノートブック [blurr を使用して fastai で要約するために BART を微調整する](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb). 🌎 🌎 +- 方法に関するノートブック [トレーナー クラスを使用して 2 つの言語で要約するために BART を微調整する](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb)。 🌎 +- [`BartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb)。 +- [`TFBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)。 +- [`FlaxBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/summarization) でサポートされています。 +- [要約](https://huggingface.co/course/chapter7/5?fw=pt#summarization) 🤗 ハグフェイスコースの章。 +- [要約タスクガイド](../tasks/summarization.md) + + + +- [`BartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) でサポートされており、 [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)。 +- [`TFBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling#run_mlmpy) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。 +- [`FlaxBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) および [ノートブック]( https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb)。 +- [マスクされた言語モデリング](https://huggingface.co/course/chapter7/3?fw=pt) 🤗 顔ハグ コースの章。 +- [マスクされた言語モデリング タスク ガイド](../tasks/masked_lang_modeling) + + + +- [ヒンディー語から英語への翻訳に Seq2SeqTrainer を使用して mBART を微調整する方法に関するノート](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)。 🌎 +- [`BartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/translation) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb)。 +- [`TFBartForConditionalGeneration`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/translation) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)。 +- [翻訳タスクガイド](../tasks/translation) + +以下も参照してください。 +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [抽出されたチェックポイント](https://huggingface.co/models?search=distilbart) は、この [論文](https://arxiv.org/abs/2010.13002) で説明されています。 + +## BartConfig + +[[autodoc]] BartConfig + - all + +## BartTokenizer + +[[autodoc]] BartTokenizer + - all + +## BartTokenizerFast + +[[autodoc]] BartTokenizerFast + - all + +## BartModel + +[[autodoc]] BartModel + - forward + +## BartForConditionalGeneration + +[[autodoc]] BartForConditionalGeneration + - forward + +## BartForSequenceClassification + +[[autodoc]] BartForSequenceClassification + - forward + +## BartForQuestionAnswering + +[[autodoc]] BartForQuestionAnswering + - forward + +## BartForCausalLM + +[[autodoc]] BartForCausalLM + - forward + +## TFBartModel + +[[autodoc]] TFBartModel + - call + +## TFBartForConditionalGeneration + +[[autodoc]] TFBartForConditionalGeneration + - call + +## TFBartForSequenceClassification + +[[autodoc]] TFBartForSequenceClassification + - call + +## FlaxBartModel + +[[autodoc]] FlaxBartModel + - __call__ + - encode + - decode + +## FlaxBartForConditionalGeneration + +[[autodoc]] FlaxBartForConditionalGeneration + - __call__ + - encode + - decode + +## FlaxBartForSequenceClassification + +[[autodoc]] FlaxBartForSequenceClassification + - __call__ + - encode + - decode + +## FlaxBartForQuestionAnswering + +[[autodoc]] FlaxBartForQuestionAnswering + - __call__ + - encode + - decode + +## FlaxBartForCausalLM + +[[autodoc]] FlaxBartForCausalLM + - __call__ diff --git a/docs/source/ja/model_doc/barthez.md b/docs/source/ja/model_doc/barthez.md new file mode 100644 index 00000000000000..94844c3f675d8a --- /dev/null +++ b/docs/source/ja/model_doc/barthez.md @@ -0,0 +1,60 @@ + + +# BARThez + +## Overview + +BARThez モデルは、Moussa Kamal Eddine、Antoine J.-P によって [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) で提案されました。ティクシエ、ミカリス・ヴァジルジャンニス、10月23日、 +2020年。 + +論文の要約: + + +*帰納的転移学習は、自己教師あり学習によって可能になり、自然言語処理全体を実行します。 +(NLP) 分野は、BERT や BART などのモデルにより、無数の自然言語に新たな最先端技術を確立し、嵐を巻き起こしています。 +タスクを理解すること。いくつかの注目すべき例外はありますが、利用可能なモデルと研究のほとんどは、 +英語を対象に実施されました。この作品では、フランス語用の最初の BART モデルである BARTez を紹介します。 +(我々の知る限りに)。 BARThez は、過去の研究から得た非常に大規模な単一言語フランス語コーパスで事前トレーニングされました +BART の摂動スキームに合わせて調整しました。既存の BERT ベースのフランス語モデルとは異なり、 +CamemBERT と FlauBERT、BARThez は、エンコーダだけでなく、 +そのデコーダは事前トレーニングされています。 FLUE ベンチマークからの識別タスクに加えて、BARThez を新しい評価に基づいて評価します。 +この論文とともにリリースする要約データセット、OrangeSum。また、すでに行われている事前トレーニングも継続します。 +BARTHez のコーパス上で多言語 BART を事前訓練し、結果として得られるモデル (mBARTHez と呼ぶ) が次のことを示します。 +バニラの BARThez を大幅に強化し、CamemBERT や FlauBERT と同等かそれを上回ります。* + +このモデルは [moussakam](https://huggingface.co/moussakam) によって寄稿されました。著者のコードは[ここ](https://github.com/moussaKam/BARThez)にあります。 + + + +BARThez の実装は、トークン化を除いて BART と同じです。詳細については、[BART ドキュメント](bart) を参照してください。 +構成クラスとそのパラメータ。 BARThez 固有のトークナイザーについては以下に記載されています。 + + + +### Resources + +- BARThez は、BART と同様の方法でシーケンス間のタスクを微調整できます。以下を確認してください。 + [examples/pytorch/summarization/](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization/README.md)。 + + +## BarthezTokenizer + +[[autodoc]] BarthezTokenizer + +## BarthezTokenizerFast + +[[autodoc]] BarthezTokenizerFast diff --git a/docs/source/ja/model_doc/bartpho.md b/docs/source/ja/model_doc/bartpho.md new file mode 100644 index 00000000000000..a9575d821ef916 --- /dev/null +++ b/docs/source/ja/model_doc/bartpho.md @@ -0,0 +1,86 @@ + + +# BARTpho + +## Overview + +BARTpho モデルは、Nguyen Luong Tran、Duong Minh Le、Dat Quoc Nguyen によって [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnam](https://arxiv.org/abs/2109.09701) で提案されました。 + +論文の要約は次のとおりです。 + +*BARTpho には、BARTpho_word と BARTpho_syllable の 2 つのバージョンがあり、初の公開された大規模な単一言語です。 +ベトナム語用に事前トレーニングされたシーケンスツーシーケンス モデル。当社の BARTpho は「大規模な」アーキテクチャと事前トレーニングを使用します +シーケンス間ノイズ除去モデル BART のスキームなので、生成 NLP タスクに特に適しています。実験 +ベトナム語テキスト要約の下流タスクでは、自動評価と人間による評価の両方で、BARTpho が +強力なベースライン mBART を上回り、最先端の性能を向上させます。将来を容易にするためにBARTphoをリリースします +生成的なベトナム語 NLP タスクの研究と応用。* + +このモデルは [dqnguyen](https://huggingface.co/dqnguyen) によって提供されました。元のコードは [こちら](https://github.com/VinAIResearch/BARTpho) にあります。 + +## Usage example + +```python +>>> import torch +>>> from transformers import AutoModel, AutoTokenizer + +>>> bartpho = AutoModel.from_pretrained("vinai/bartpho-syllable") + +>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bartpho-syllable") + +>>> line = "Chúng tôi là những nghiên cứu viên." + +>>> input_ids = tokenizer(line, return_tensors="pt") + +>>> with torch.no_grad(): +... features = bartpho(**input_ids) # Models outputs are now tuples + +>>> # With TensorFlow 2.0+: +>>> from transformers import TFAutoModel + +>>> bartpho = TFAutoModel.from_pretrained("vinai/bartpho-syllable") +>>> input_ids = tokenizer(line, return_tensors="tf") +>>> features = bartpho(**input_ids) +``` + +## Usage tips + +- mBARTに続いて、BARTphoはBARTの「大規模な」アーキテクチャを使用し、その上に追加の層正規化層を備えています。 + エンコーダとデコーダの両方。したがって、[BART のドキュメント](bart) の使用例は、使用に適応する場合に使用されます。 + BARTpho を使用する場合は、BART に特化したクラスを mBART に特化した対応するクラスに置き換えることによって調整する必要があります。 + 例えば: + +```python +>>> from transformers import MBartForConditionalGeneration + +>>> bartpho = MBartForConditionalGeneration.from_pretrained("vinai/bartpho-syllable") +>>> TXT = "Chúng tôi là nghiên cứu viên." +>>> input_ids = tokenizer([TXT], return_tensors="pt")["input_ids"] +>>> logits = bartpho(input_ids).logits +>>> masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() +>>> probs = logits[0, masked_index].softmax(dim=0) +>>> values, predictions = probs.topk(5) +>>> print(tokenizer.decode(predictions).split()) +``` + +- この実装はトークン化のみを目的としています。`monolingual_vocab_file`はベトナム語に特化した型で構成されています + 多言語 XLM-RoBERTa から利用できる事前トレーニング済み SentencePiece モデル`vocab_file`から抽出されます。 + 他の言語 (サブワードにこの事前トレーニング済み多言語 SentencePiece モデル`vocab_file`を使用する場合) + セグメンテーションにより、独自の言語に特化した`monolingual_vocab_file`を使用して BartphoTokenizer を再利用できます。 + +## BartphoTokenizer + +[[autodoc]] BartphoTokenizer diff --git a/docs/source/ja/model_doc/beit.md b/docs/source/ja/model_doc/beit.md new file mode 100644 index 00000000000000..45eb1efa5dd873 --- /dev/null +++ b/docs/source/ja/model_doc/beit.md @@ -0,0 +1,143 @@ + + +# BEiT + +## Overview + +BEiT モデルは、[BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) で提案されました。 +ハンボ・バオ、リー・ドン、フル・ウェイ。 BERT に触発された BEiT は、自己教師ありの事前トレーニングを作成した最初の論文です。 +ビジョン トランスフォーマー (ViT) は、教師付き事前トレーニングよりも優れたパフォーマンスを発揮します。クラスを予測するためにモデルを事前トレーニングするのではなく +([オリジナルの ViT 論文](https://arxiv.org/abs/2010.11929) で行われたように) 画像の BEiT モデルは、次のように事前トレーニングされています。 +マスクされた OpenAI の [DALL-E モデル](https://arxiv.org/abs/2102.12092) のコードブックからビジュアル トークンを予測します +パッチ。 + +論文の要約は次のとおりです。 + +*自己教師あり視覚表現モデル BEiT (Bidirectional Encoderpresentation) を導入します。 +イメージトランスフォーマーより。自然言語処理分野で開発されたBERTに倣い、マスク画像を提案します。 +ビジョントランスフォーマーを事前にトレーニングするためのモデリングタスク。具体的には、事前トレーニングでは各画像に 2 つのビューがあります。 +パッチ (16x16 ピクセルなど)、およびビジュアル トークン (つまり、個別のトークン)。まず、元の画像を「トークン化」して、 +ビジュアルトークン。次に、いくつかの画像パッチをランダムにマスクし、それらをバックボーンの Transformer に供給します。事前トレーニング +目的は、破損したイメージ パッチに基づいて元のビジュアル トークンを回復することです。 BEiTの事前トレーニング後、 +事前トレーニングされたエンコーダーにタスク レイヤーを追加することで、ダウンストリーム タスクのモデル パラメーターを直接微調整します。 +画像分類とセマンティックセグメンテーションに関する実験結果は、私たちのモデルが競争力のある結果を達成することを示しています +以前の事前トレーニング方法を使用して。たとえば、基本サイズの BEiT は、ImageNet-1K で 83.2% のトップ 1 精度を達成します。 +同じ設定でゼロからの DeiT トレーニング (81.8%) を大幅に上回りました。また、大型BEiTは +86.3% は ImageNet-1K のみを使用しており、ImageNet-22K での教師付き事前トレーニングを使用した ViT-L (85.2%) を上回っています。* + +## Usage tips + +- BEiT モデルは通常のビジョン トランスフォーマーですが、教師ありではなく自己教師ありの方法で事前トレーニングされています。彼らは + ImageNet-1K および CIFAR-100 で微調整すると、[オリジナル モデル (ViT)](vit) と [データ効率の高いイメージ トランスフォーマー (DeiT)](deit) の両方を上回るパフォーマンスを発揮します。推論に関するデモノートブックもチェックできます。 + カスタム データの微調整は [こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (置き換えるだけで済みます) + [`BeitImageProcessor`] による [`ViTFeatureExtractor`] と + [`ViTForImageClassification`] by [`BeitForImageClassification`])。 +- DALL-E の画像トークナイザーと BEiT を組み合わせる方法を紹介するデモ ノートブックも利用可能です。 + マスクされた画像モデリングを実行します。 [ここ](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT) で見つけることができます。 +- BEiT モデルは各画像が同じサイズ (解像度) であることを期待しているため、次のように使用できます。 + [`BeitImageProcessor`] を使用して、モデルの画像のサイズを変更 (または再スケール) し、正規化します。 +- 事前トレーニングまたは微調整中に使用されるパッチ解像度と画像解像度の両方が名前に反映されます。 + 各チェックポイント。たとえば、`microsoft/beit-base-patch16-224`は、パッチ付きの基本サイズのアーキテクチャを指します。 + 解像度は 16x16、微調整解像度は 224x224 です。すべてのチェックポイントは [ハブ](https://huggingface.co/models?search=microsoft/beit) で見つけることができます。 +- 利用可能なチェックポイントは、(1) [ImageNet-22k](http://www.image-net.org/) で事前トレーニングされています ( + 1,400 万の画像と 22,000 のクラス) のみ、(2) ImageNet-22k でも微調整、または (3) [ImageNet-1k](http://www.image-net.org/challenges/LSVRC)でも微調整/2012/) (ILSVRC 2012 とも呼ばれ、130 万件のコレクション) + 画像と 1,000 クラス)。 +- BEiT は、T5 モデルからインスピレーションを得た相対位置埋め込みを使用します。事前トレーニング中に、著者は次のことを共有しました。 + いくつかの自己注意層間の相対的な位置の偏り。微調整中、各レイヤーの相対位置 + バイアスは、事前トレーニング後に取得された共有相対位置バイアスで初期化されます。ご希望の場合は、 + モデルを最初から事前トレーニングするには、`use_relative_position_bias` または + 追加するには、[`BeitConfig`] の `use_relative_position_bias` 属性を `True` に設定します。 + 位置の埋め込み。 + + + + BEiT の事前トレーニング。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。このモデルの JAX/FLAX バージョンは、 +[kamalkraj](https://huggingface.co/kamalkraj) による投稿。元のコードは [ここ](https://github.com/microsoft/unilm/tree/master/beit) にあります。 + +## Resources + +BEiT の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + + + +- [`BeitForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +**セマンティック セグメンテーション** +- [セマンティック セグメンテーション タスク ガイド](../tasks/semantic_segmentation) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## BEiT specific outputs + +[[autodoc]] models.beit.modeling_beit.BeitModelOutputWithPooling + +[[autodoc]] models.beit.modeling_flax_beit.FlaxBeitModelOutputWithPooling + +## BeitConfig + +[[autodoc]] BeitConfig + +## BeitFeatureExtractor + +[[autodoc]] BeitFeatureExtractor + - __call__ + - post_process_semantic_segmentation + +## BeitImageProcessor + +[[autodoc]] BeitImageProcessor + - preprocess + - post_process_semantic_segmentation + +## BeitModel + +[[autodoc]] BeitModel + - forward + +## BeitForMaskedImageModeling + +[[autodoc]] BeitForMaskedImageModeling + - forward + +## BeitForImageClassification + +[[autodoc]] BeitForImageClassification + - forward + +## BeitForSemanticSegmentation + +[[autodoc]] BeitForSemanticSegmentation + - forward + +## FlaxBeitModel + +[[autodoc]] FlaxBeitModel + - __call__ + +## FlaxBeitForMaskedImageModeling + +[[autodoc]] FlaxBeitForMaskedImageModeling + - __call__ + +## FlaxBeitForImageClassification + +[[autodoc]] FlaxBeitForImageClassification + - __call__ diff --git a/docs/source/ja/model_doc/bert-generation.md b/docs/source/ja/model_doc/bert-generation.md new file mode 100644 index 00000000000000..d2c93a4644d943 --- /dev/null +++ b/docs/source/ja/model_doc/bert-generation.md @@ -0,0 +1,107 @@ + + +# BertGeneration + +## Overview + +BertGeneration モデルは、次を使用してシーケンス間のタスクに利用できる BERT モデルです。 +[Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) で提案されている [`EncoderDecoderModel`] +タスク、Sascha Rothe、Sishi Nagayan、Aliaksei Severyn 著。 + +論文の要約は次のとおりです。 + +*大規模なニューラル モデルの教師なし事前トレーニングは、最近、自然言語処理に革命をもたらしました。による +NLP 実践者は、公開されたチェックポイントからウォームスタートして、複数の項目で最先端の技術を推進してきました。 +コンピューティング時間を大幅に節約しながらベンチマークを実行します。これまでのところ、主に自然言語に焦点を当ててきました。 +タスクを理解する。この論文では、シーケンス生成のための事前トレーニングされたチェックポイントの有効性を実証します。私たちは +公開されている事前トレーニング済み BERT と互換性のある Transformer ベースのシーケンス間モデルを開発しました。 +GPT-2 および RoBERTa チェックポイントを使用し、モデルの初期化の有用性について広範な実証研究を実施しました。 +エンコーダとデコーダ、これらのチェックポイント。私たちのモデルは、機械翻訳に関する新しい最先端の結果をもたらします。 +テキストの要約、文の分割、および文の融合。* + +## Usage examples and tips + +- モデルを [`EncoderDecoderModel`] と組み合わせて使用​​して、2 つの事前トレーニングされたモデルを活用できます。 + 後続の微調整のための BERT チェックポイント。 + +```python +>>> # leverage checkpoints for Bert2Bert model... +>>> # use BERT's cls token as BOS token and sep token as EOS token +>>> encoder = BertGenerationEncoder.from_pretrained("google-bert/bert-large-uncased", bos_token_id=101, eos_token_id=102) +>>> # add cross attention layers and use BERT's cls token as BOS token and sep token as EOS token +>>> decoder = BertGenerationDecoder.from_pretrained( +... "google-bert/bert-large-uncased", add_cross_attention=True, is_decoder=True, bos_token_id=101, eos_token_id=102 +... ) +>>> bert2bert = EncoderDecoderModel(encoder=encoder, decoder=decoder) + +>>> # create tokenizer... +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-large-uncased") + +>>> input_ids = tokenizer( +... "This is a long article to summarize", add_special_tokens=False, return_tensors="pt" +... ).input_ids +>>> labels = tokenizer("This is a short summary", return_tensors="pt").input_ids + +>>> # train... +>>> loss = bert2bert(input_ids=input_ids, decoder_input_ids=labels, labels=labels).loss +>>> loss.backward() +``` + +- 事前トレーニングされた [`EncoderDecoderModel`] もモデル ハブで直接利用できます。 + +```python +>>> # instantiate sentence fusion model +>>> sentence_fuser = EncoderDecoderModel.from_pretrained("google/roberta2roberta_L-24_discofuse") +>>> tokenizer = AutoTokenizer.from_pretrained("google/roberta2roberta_L-24_discofuse") + +>>> input_ids = tokenizer( +... "This is the first sentence. This is the second sentence.", add_special_tokens=False, return_tensors="pt" +... ).input_ids + +>>> outputs = sentence_fuser.generate(input_ids) + +>>> print(tokenizer.decode(outputs[0])) +``` + +チップ: + +- [`BertGenerationEncoder`] と [`BertGenerationDecoder`] は、 + [`EncoderDecoder`] と組み合わせます。 +- 要約、文の分割、文の融合、および翻訳の場合、入力に特別なトークンは必要ありません。 + したがって、入力の末尾に EOS トークンを追加しないでください。 + +このモデルは、[patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。元のコードは次のとおりです +[ここ](https://tfhub.dev/s?module-type=text-generation&subtype=module,placeholder) があります。 + +## BertGenerationConfig + +[[autodoc]] BertGenerationConfig + +## BertGenerationTokenizer + +[[autodoc]] BertGenerationTokenizer + - save_vocabulary + +## BertGenerationEncoder + +[[autodoc]] BertGenerationEncoder + - forward + +## BertGenerationDecoder + +[[autodoc]] BertGenerationDecoder + - forward diff --git a/docs/source/ja/model_doc/bert-japanese.md b/docs/source/ja/model_doc/bert-japanese.md new file mode 100644 index 00000000000000..86cce741aac6cb --- /dev/null +++ b/docs/source/ja/model_doc/bert-japanese.md @@ -0,0 +1,81 @@ + + +# BertJapanese + +## Overview + +BERT モデルは日本語テキストでトレーニングされました。 + +2 つの異なるトークン化方法を備えたモデルがあります。 + +- MeCab と WordPiece を使用してトークン化します。これには、[MeCab](https://taku910.github.io/mecab/) のラッパーである [fugashi](https://github.com/polm/fugashi) という追加の依存関係が必要です。 +- 文字にトークン化します。 + +*MecabTokenizer* を使用するには、`pip installTransformers["ja"]` (または、インストールする場合は `pip install -e .["ja"]`) する必要があります。 +ソースから)依存関係をインストールします。 + +[cl-tohakuリポジトリの詳細](https://github.com/cl-tohaku/bert-japanese)を参照してください。 + +MeCab および WordPiece トークン化でモデルを使用する例: + + +```python +>>> import torch +>>> from transformers import AutoModel, AutoTokenizer + +>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese") +>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese") + +>>> ## Input Japanese Text +>>> line = "吾輩は猫である。" + +>>> inputs = tokenizer(line, return_tensors="pt") + +>>> print(tokenizer.decode(inputs["input_ids"][0])) +[CLS] 吾輩 は 猫 で ある 。 [SEP] + +>>> outputs = bertjapanese(**inputs) +``` + +文字トークン化を使用したモデルの使用例: + +```python +>>> bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese-char") +>>> tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-char") + +>>> ## Input Japanese Text +>>> line = "吾輩は猫である。" + +>>> inputs = tokenizer(line, return_tensors="pt") + +>>> print(tokenizer.decode(inputs["input_ids"][0])) +[CLS] 吾 輩 は 猫 で あ る 。 [SEP] + +>>> outputs = bertjapanese(**inputs) +``` + + + +- この実装はトークン化方法を除いて BERT と同じです。その他の使用例については、[BERT のドキュメント](bert) を参照してください。 + + + +このモデルは[cl-tohaku](https://huggingface.co/cl-tohaku)から提供されました。 + +## BertJapaneseTokenizer + +[[autodoc]] BertJapaneseTokenizer diff --git a/docs/source/ja/model_doc/bert.md b/docs/source/ja/model_doc/bert.md new file mode 100644 index 00000000000000..d34df9c0d5d60e --- /dev/null +++ b/docs/source/ja/model_doc/bert.md @@ -0,0 +1,312 @@ + + +# BERT + +
+ +Models + + +Spaces + +
+ +## Overview + +BERT モデルは、Jacob Devlin、Ming-Wei Chang、Kenton Lee、Kristina Toutanova によって [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) で提案されました。それは +マスクされた言語モデリング目標と次の文の組み合わせを使用して事前トレーニングされた双方向トランスフォーマー +Toronto Book Corpus と Wikipedia からなる大規模なコーパスでの予測。 + +論文の要約は次のとおりです。 + +*BERT と呼ばれる新しい言語表現モデルを導入します。これは Bidirectional Encoder Representations の略です +トランスフォーマーより。最近の言語表現モデルとは異なり、BERT は深い双方向性を事前にトレーニングするように設計されています。 +すべてのレイヤーの左と右の両方のコンテキストを共同で条件付けすることにより、ラベルのないテキストから表現します。結果として、 +事前トレーニングされた BERT モデルは、出力層を 1 つ追加するだけで微調整して、最先端のモデルを作成できます。 +実質的なタスク固有のものを必要とせず、質問応答や言語推論などの幅広いタスクに対応 +アーキテクチャの変更。* + +*BERT は概念的にはシンプルですが、経験的に強力です。 11 の自然な要素に関する新しい最先端の結果が得られます。 +言語処理タスク(GLUE スコアを 80.5% に押し上げる(7.7% ポイントの絶対改善)、MultiNLI を含む) +精度は 86.7% (絶対値 4.6% 向上)、SQuAD v1.1 質問応答テスト F1 は 93.2 (絶対値 1.5 ポイント) +改善) および SQuAD v2.0 テスト F1 から 83.1 (5.1 ポイントの絶対改善)。* + +## Usage tips + +- BERT は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 +- BERT は、マスク言語モデリング (MLM) および次の文予測 (NSP) の目標を使用してトレーニングされました。それは + マスクされたトークンの予測や NLU では一般に効率的ですが、テキスト生成には最適ではありません。 +- ランダム マスキングを使用して入力を破壊します。より正確には、事前トレーニング中に、トークンの指定された割合 (通常は 15%) が次によってマスクされます。 + + * 確率0.8の特別なマスクトークン + * 確率 0.1 でマスクされたトークンとは異なるランダムなトークン + * 確率 0.1 の同じトークン + +- モデルは元の文を予測する必要がありますが、2 番目の目的があります。入力は 2 つの文 A と B (間に分離トークンあり) です。確率 50% では、文はコーパス内で連続していますが、残りの 50% では関連性がありません。モデルは、文が連続しているかどうかを予測する必要があります。 + + + +このモデルは [thomwolf](https://huggingface.co/thomwolf) によって提供されました。元のコードは [こちら](https://github.com/google-research/bert) にあります。 + +## Resources + +BERT を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + + + +- に関するブログ投稿 [別の言語での BERT テキスト分類](https://www.philschmid.de/bert-text-classification-in-a-different-language)。 +- [マルチラベル テキスト分類のための BERT (およびその友人) の微調整](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb) のノートブック. +- 方法に関するノートブック [PyTorch を使用したマルチラベル分類のための BERT の微調整](https://colab.research.google.com/github/abhmishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)。 +- 方法に関するノートブック [要約のために BERT を使用して EncoderDecoder モデルをウォームスタートする](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)。 +- [`BertForSequenceClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)。 +- [`TFBertForSequenceClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)。 +- [`FlaxBertForSequenceClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/text-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification_flax.ipynb)。 +- [テキスト分類タスクガイド](../tasks/sequence_classification) + + + +- [Hugging Face Transformers with Keras: Fine-tune a non-English BERT for Named Entity Recognition](https://www.philschmid.de/huggingface-transformers-keras-tf) の使用方法に関するブログ投稿。 +- 各単語の最初の単語部分のみを使用した [固有表現認識のための BERT の微調整](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb) のノートブックトークン化中の単語ラベル内。単語のラベルをすべての単語部分に伝播するには、代わりにノートブックのこの [バージョン](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT.ipynb) を参照してください。 +- [`BertForTokenClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)。 +- [`TFBertForTokenClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)。 +- [`FlaxBertForTokenClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/token-classification) によってサポートされています。 +- [トークン分類](https://huggingface.co/course/chapter7/2?fw=pt) 🤗 ハグフェイスコースの章。 +- [トークン分類タスクガイド](../tasks/token_classification) + + + +- [`BertForMaskedLM`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) でサポートされており、 [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)。 +- [`TFBertForMaskedLM`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/lang-modeling#run_mlmpy) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。 +- [`FlaxBertForMaskedLM`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/language-modeling#masked-language-modeling) および [ノートブック]( https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/masked_language_modeling_flax.ipynb)。 +- [マスクされた言語モデリング](https://huggingface.co/course/chapter7/3?fw=pt) 🤗 顔ハグ コースの章。 +- [マスクされた言語モデリング タスク ガイド](../tasks/masked_lang_modeling) + + + + +- [`BertForQuestionAnswering`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)。 +- [`TFBertForQuestionAnswering`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)。 +- [`FlaxBertForQuestionAnswering`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/flax/question-answering) でサポートされています。 +- [質問回答](https://huggingface.co/course/chapter7/7?fw=pt) 🤗 ハグフェイスコースの章。 +- [質問回答タスク ガイド](../tasks/question_answering) + +**複数の選択肢** +- [`BertForMultipleChoice`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/multiple-choice) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb)。 +- [`TFBertForMultipleChoice`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)。 +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +⚡️ **推論** +- 方法に関するブログ投稿 [Hugging Face Transformers と AWS Inferentia を使用して BERT 推論を高速化する](https://huggingface.co/blog/bert-inferentia-sagemaker)。 +- 方法に関するブログ投稿 [GPU 上の DeepSpeed-Inference を使用して BERT 推論を高速化する](https://www.philschmid.de/bert-deepspeed-inference)。 + +⚙️ **事前トレーニング** +- [Hugging Face Transformers と Habana Gaudi を使用した BERT の事前トレーニング に関するブログ投稿](https://www.philschmid.de/pre-training-bert-habana)。 + +🚀 **デプロイ** +- 方法に関するブログ投稿 [ハグフェイス最適化でトランスフォーマーを ONNX に変換する](https://www.philschmid.de/convert-transformers-to-onnx)。 +- 方法に関するブログ投稿 [AWS 上の Habana Gaudi を使用したハグ顔トランスフォーマーのための深層学習環境のセットアップ](https://www.philschmid.de/getting-started-habana-gaudi#conclusion)。 +- に関するブログ投稿 [Hugging Face Transformers、Amazon SageMaker、および Terraform モジュールを使用した自動スケーリング BERT](https://www.philschmid.de/terraform-huggingface-amazon-sagemaker-advanced)。 +- に関するブログ投稿 [HuggingFace、AWS Lambda、Docker を使用したサーバーレス BERT](https://www.philschmid.de/serverless-bert-with-huggingface-aws-lambda-docker)。 +- に関するブログ投稿 [Amazon SageMaker と Training Compiler を使用した Hugging Face Transformers BERT 微調整](https://www.philschmid.de/huggingface-amazon-sagemaker-training-compiler)。 +- に関するブログ投稿 [Transformers と Amazon SageMaker を使用した BERT のタスク固有の知識の蒸留](https://www.philschmid.de/knowledge-distillation-bert-transformers) + +## BertConfig + +[[autodoc]] BertConfig + - all + +## BertTokenizer + +[[autodoc]] BertTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + + + + +## BertTokenizerFast + +[[autodoc]] BertTokenizerFast + + + + +## TFBertTokenizer + +[[autodoc]] TFBertTokenizer + + + + +## Bert specific outputs + +[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput + +[[autodoc]] models.bert.modeling_tf_bert.TFBertForPreTrainingOutput + +[[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput + + + + +## BertModel + +[[autodoc]] BertModel + - forward + +## BertForPreTraining + +[[autodoc]] BertForPreTraining + - forward + +## BertLMHeadModel + +[[autodoc]] BertLMHeadModel + - forward + +## BertForMaskedLM + +[[autodoc]] BertForMaskedLM + - forward + +## BertForNextSentencePrediction + +[[autodoc]] BertForNextSentencePrediction + - forward + +## BertForSequenceClassification + +[[autodoc]] BertForSequenceClassification + - forward + +## BertForMultipleChoice + +[[autodoc]] BertForMultipleChoice + - forward + +## BertForTokenClassification + +[[autodoc]] BertForTokenClassification + - forward + +## BertForQuestionAnswering + +[[autodoc]] BertForQuestionAnswering + - forward + + + + +## TFBertModel + +[[autodoc]] TFBertModel + - call + +## TFBertForPreTraining + +[[autodoc]] TFBertForPreTraining + - call + +## TFBertModelLMHeadModel + +[[autodoc]] TFBertLMHeadModel + - call + +## TFBertForMaskedLM + +[[autodoc]] TFBertForMaskedLM + - call + +## TFBertForNextSentencePrediction + +[[autodoc]] TFBertForNextSentencePrediction + - call + +## TFBertForSequenceClassification + +[[autodoc]] TFBertForSequenceClassification + - call + +## TFBertForMultipleChoice + +[[autodoc]] TFBertForMultipleChoice + - call + +## TFBertForTokenClassification + +[[autodoc]] TFBertForTokenClassification + - call + +## TFBertForQuestionAnswering + +[[autodoc]] TFBertForQuestionAnswering + - call + + + + + +## FlaxBertModel + +[[autodoc]] FlaxBertModel + - __call__ + +## FlaxBertForPreTraining + +[[autodoc]] FlaxBertForPreTraining + - __call__ + +## FlaxBertForCausalLM + +[[autodoc]] FlaxBertForCausalLM + - __call__ + +## FlaxBertForMaskedLM + +[[autodoc]] FlaxBertForMaskedLM + - __call__ + +## FlaxBertForNextSentencePrediction + +[[autodoc]] FlaxBertForNextSentencePrediction + - __call__ + +## FlaxBertForSequenceClassification + +[[autodoc]] FlaxBertForSequenceClassification + - __call__ + +## FlaxBertForMultipleChoice + +[[autodoc]] FlaxBertForMultipleChoice + - __call__ + +## FlaxBertForTokenClassification + +[[autodoc]] FlaxBertForTokenClassification + - __call__ + +## FlaxBertForQuestionAnswering + +[[autodoc]] FlaxBertForQuestionAnswering + - __call__ + + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/bertweet.md b/docs/source/ja/model_doc/bertweet.md new file mode 100644 index 00000000000000..3a5dddbf04cc58 --- /dev/null +++ b/docs/source/ja/model_doc/bertweet.md @@ -0,0 +1,68 @@ + + +# BERTweet + +## Overview + +BERTweet モデルは、Dat Quoc Nguyen、Thanh Vu によって [BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf) で提案されました。アン・トゥアン・グエンさん。 + +論文の要約は次のとおりです。 + +*私たちは、英語ツイート用に初めて公開された大規模な事前トレーニング済み言語モデルである BERTweet を紹介します。私たちのBERTweetは、 +BERT ベースと同じアーキテクチャ (Devlin et al., 2019) は、RoBERTa 事前トレーニング手順 (Liu et al.) を使用してトレーニングされます。 +al.、2019)。実験では、BERTweet が強力なベースラインである RoBERTa ベースおよび XLM-R ベースを上回るパフォーマンスを示すことが示されています (Conneau et al., +2020)、3 つのツイート NLP タスクにおいて、以前の最先端モデルよりも優れたパフォーマンス結果が得られました。 +品詞タグ付け、固有表現認識およびテキスト分類。* + +## Usage example + +```python +>>> import torch +>>> from transformers import AutoModel, AutoTokenizer + +>>> bertweet = AutoModel.from_pretrained("vinai/bertweet-base") + +>>> # For transformers v4.x+: +>>> tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False) + +>>> # For transformers v3.x: +>>> # tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base") + +>>> # INPUT TWEET IS ALREADY NORMALIZED! +>>> line = "SC has first two presumptive cases of coronavirus , DHEC confirms HTTPURL via @USER :cry:" + +>>> input_ids = torch.tensor([tokenizer.encode(line)]) + +>>> with torch.no_grad(): +... features = bertweet(input_ids) # Models outputs are now tuples + +>>> # With TensorFlow 2.0+: +>>> # from transformers import TFAutoModel +>>> # bertweet = TFAutoModel.from_pretrained("vinai/bertweet-base") +``` + + +この実装は、トークン化方法を除いて BERT と同じです。詳細については、[BERT ドキュメント](bert) を参照してください。 +API リファレンス情報。 + + + +このモデルは [dqnguyen](https://huggingface.co/dqnguyen) によって提供されました。元のコードは [ここ](https://github.com/VinAIResearch/BERTweet) にあります。 + +## BertweetTokenizer + +[[autodoc]] BertweetTokenizer diff --git a/docs/source/ja/model_doc/big_bird.md b/docs/source/ja/model_doc/big_bird.md new file mode 100644 index 00000000000000..4c0dabbebb46d4 --- /dev/null +++ b/docs/source/ja/model_doc/big_bird.md @@ -0,0 +1,176 @@ + + +# BigBird + +## Overview + +BigBird モデルは、[Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) で提案されました。 +ザヒール、マンジルとグルガネシュ、グルとダベイ、クマール・アヴィナヴァとエインズリー、ジョシュアとアルベルティ、クリスとオンタノン、 +サンティアゴとファム、フィリップとラブラ、アニルードとワン、キーファンとヤン、リーなど。 BigBird は注目度が低い +BERT などの Transformer ベースのモデルをさらに長いシーケンスに拡張する、Transformer ベースのモデル。まばらに加えて +アテンションと同様に、BigBird は入力シーケンスにランダム アテンションだけでなくグローバル アテンションも適用します。理論的には、 +まばらで全体的でランダムな注意を適用すると、完全な注意に近づくことが示されていますが、 +長いシーケンスでは計算効率が大幅に向上します。より長いコンテキストを処理できる機能の結果として、 +BigBird は、質問応答や +BERT または RoBERTa と比較した要約。 + +論文の要約は次のとおりです。 + +*BERT などのトランスフォーマーベースのモデルは、NLP で最も成功した深層学習モデルの 1 つです。 +残念ながら、それらの中核的な制限の 1 つは、シーケンスに対する二次依存性 (主にメモリに関する) です。 +完全な注意メカニズムによる長さです。これを解決するために、BigBird は、まばらな注意メカニズムを提案します。 +この二次依存関係を線形に削減します。 BigBird がシーケンス関数の汎用近似器であることを示します。 +チューリングは完全であるため、二次完全注意モデルのこれらの特性が保存されます。途中、私たちの +理論分析により、O(1) 個のグローバル トークン (CLS など) を持つ利点の一部が明らかになり、 +スパース注意メカニズムの一部としてのシーケンス。提案されたスパース アテンションは、次の長さのシーケンスを処理できます。 +同様のハードウェアを使用して以前に可能であったものの 8 倍。より長いコンテキストを処理できる機能の結果として、 +BigBird は、質問応答や要約などのさまざまな NLP タスクのパフォーマンスを大幅に向上させます。私達も +ゲノミクスデータへの新しいアプリケーションを提案します。* + +チップ: + +- BigBird の注意がどのように機能するかについての詳細な説明については、[このブログ投稿](https://huggingface.co/blog/big-bird) を参照してください。 +- BigBird には、**original_full** と **block_sparse** の 2 つの実装が付属しています。シーケンス長が 1024 未満の場合、次を使用します。 + **block_sparse** を使用してもメリットがないため、**original_full** を使用することをお勧めします。 +- コードは現在、3 ブロックと 2 グローバル ブロックのウィンドウ サイズを使用しています。 +- シーケンスの長さはブロック サイズで割り切れる必要があります。 +- 現在の実装では **ITC** のみがサポートされています。 +- 現在の実装では **num_random_blocks = 0** はサポートされていません +- BigBird は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 + + このモデルは、[vasudevgupta](https://huggingface.co/vasudevgupta) によって提供されました。元のコードが見つかる +[こちら](https://github.com/google-research/bigbird)。 + +## ドキュメント リソース + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [マスクされた言語モデリング タスク ガイド](../tasks/masked_lang_modeling) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +## BigBirdConfig + +[[autodoc]] BigBirdConfig + +## BigBirdTokenizer + +[[autodoc]] BigBirdTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## BigBirdTokenizerFast + +[[autodoc]] BigBirdTokenizerFast + +## BigBird specific outputs + +[[autodoc]] models.big_bird.modeling_big_bird.BigBirdForPreTrainingOutput + + + + +## BigBirdModel + +[[autodoc]] BigBirdModel + - forward + +## BigBirdForPreTraining + +[[autodoc]] BigBirdForPreTraining + - forward + +## BigBirdForCausalLM + +[[autodoc]] BigBirdForCausalLM + - forward + +## BigBirdForMaskedLM + +[[autodoc]] BigBirdForMaskedLM + - forward + +## BigBirdForSequenceClassification + +[[autodoc]] BigBirdForSequenceClassification + - forward + +## BigBirdForMultipleChoice + +[[autodoc]] BigBirdForMultipleChoice + - forward + +## BigBirdForTokenClassification + +[[autodoc]] BigBirdForTokenClassification + - forward + +## BigBirdForQuestionAnswering + +[[autodoc]] BigBirdForQuestionAnswering + - forward + + + + +## FlaxBigBirdModel + +[[autodoc]] FlaxBigBirdModel + - __call__ + +## FlaxBigBirdForPreTraining + +[[autodoc]] FlaxBigBirdForPreTraining + - __call__ + +## FlaxBigBirdForCausalLM + +[[autodoc]] FlaxBigBirdForCausalLM + - __call__ + +## FlaxBigBirdForMaskedLM + +[[autodoc]] FlaxBigBirdForMaskedLM + - __call__ + +## FlaxBigBirdForSequenceClassification + +[[autodoc]] FlaxBigBirdForSequenceClassification + - __call__ + +## FlaxBigBirdForMultipleChoice + +[[autodoc]] FlaxBigBirdForMultipleChoice + - __call__ + +## FlaxBigBirdForTokenClassification + +[[autodoc]] FlaxBigBirdForTokenClassification + - __call__ + +## FlaxBigBirdForQuestionAnswering + +[[autodoc]] FlaxBigBirdForQuestionAnswering + - __call__ + + + + diff --git a/docs/source/ja/model_doc/bigbird_pegasus.md b/docs/source/ja/model_doc/bigbird_pegasus.md new file mode 100644 index 00000000000000..e0132b4b5f86b5 --- /dev/null +++ b/docs/source/ja/model_doc/bigbird_pegasus.md @@ -0,0 +1,95 @@ + + +# BigBirdPegasus + +## Overview + +BigBird モデルは、[Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) で提案されました。 +ザヒール、マンジルとグルガネシュ、グルとダベイ、クマール・アヴィナヴァとエインズリー、ジョシュアとアルベルティ、クリスとオンタノン、 +サンティアゴとファム、フィリップとラブラ、アニルードとワン、キーファンとヤン、リーなど。 BigBird は注目度が低い +BERT などの Transformer ベースのモデルをさらに長いシーケンスに拡張する、Transformer ベースのモデル。まばらに加えて +アテンションと同様に、BigBird は入力シーケンスにランダム アテンションだけでなくグローバル アテンションも適用します。理論的には、 +まばらで全体的でランダムな注意を適用すると、完全な注意に近づくことが示されていますが、 +長いシーケンスでは計算効率が大幅に向上します。より長いコンテキストを処理できる機能の結果として、 +BigBird は、質問応答や +BERT または RoBERTa と比較した要約。 + +論文の要約は次のとおりです。 + +*BERT などのトランスフォーマーベースのモデルは、NLP で最も成功した深層学習モデルの 1 つです。 +残念ながら、それらの中核的な制限の 1 つは、シーケンスに対する二次依存性 (主にメモリに関する) です。 +完全な注意メカニズムによる長さです。これを解決するために、BigBird は、まばらな注意メカニズムを提案します。 +この二次依存関係を線形に削減します。 BigBird がシーケンス関数の汎用近似器であることを示します。 +チューリングは完全であるため、二次完全注意モデルのこれらの特性が保存されます。途中、私たちの +理論分析により、O(1) 個のグローバル トークン (CLS など) を持つ利点の一部が明らかになり、 +スパース注意メカニズムの一部としてのシーケンス。提案されたスパース アテンションは、次の長さのシーケンスを処理できます。 +同様のハードウェアを使用して以前に可能であったものの 8 倍。より長いコンテキストを処理できる機能の結果として、 +BigBird は、質問応答や要約などのさまざまな NLP タスクのパフォーマンスを大幅に向上させます。私達も +ゲノミクスデータへの新しいアプリケーションを提案します。* + +## Usage tips + +- BigBird の注意がどのように機能するかについての詳細な説明については、[このブログ投稿](https://huggingface.co/blog/big-bird) を参照してください。 +- BigBird には、**original_full** と **block_sparse** の 2 つの実装が付属しています。シーケンス長が 1024 未満の場合、次を使用します。 + **block_sparse** を使用してもメリットがないため、**original_full** を使用することをお勧めします。 +- コードは現在、3 ブロックと 2 グローバル ブロックのウィンドウ サイズを使用しています。 +- シーケンスの長さはブロック サイズで割り切れる必要があります。 +- 現在の実装では **ITC** のみがサポートされています。 +- 現在の実装では **num_random_blocks = 0** はサポートされていません。 +- BigBirdPegasus は [PegasusTokenizer](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pegasus/tokenization_pegasus.py) を使用します。 +- BigBird は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 + +元のコードは [こちら](https://github.com/google-research/bigbird) にあります。 + +## ドキュメント リソース + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [翻訳タスクガイド](../tasks/translation) +- [要約タスクガイド](../tasks/summarization) + +## BigBirdPegasusConfig + +[[autodoc]] BigBirdPegasusConfig + - all + +## BigBirdPegasusModel + +[[autodoc]] BigBirdPegasusModel + - forward + +## BigBirdPegasusForConditionalGeneration + +[[autodoc]] BigBirdPegasusForConditionalGeneration + - forward + +## BigBirdPegasusForSequenceClassification + +[[autodoc]] BigBirdPegasusForSequenceClassification + - forward + +## BigBirdPegasusForQuestionAnswering + +[[autodoc]] BigBirdPegasusForQuestionAnswering + - forward + +## BigBirdPegasusForCausalLM + +[[autodoc]] BigBirdPegasusForCausalLM + - forward diff --git a/docs/source/ja/model_doc/biogpt.md b/docs/source/ja/model_doc/biogpt.md new file mode 100644 index 00000000000000..0634062a6b7214 --- /dev/null +++ b/docs/source/ja/model_doc/biogpt.md @@ -0,0 +1,73 @@ + + +# BioGPT + +## Overview + +BioGPT モデルは、[BioGPT: generative pre-trained transformer for biomedical text generation and mining](https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac409/6713511?guestAccessKey=a66d9b5d-4f83-4017-bb52-405815c907b9) by Renqian Luo、Liai Sun、Yingce Xia、 Tao Qin、Sheng Zhang、Hoifung Poon、Tie-Yan Liu。 BioGPT は、生物医学テキストの生成とマイニングのための、ドメイン固有の生成事前トレーニング済み Transformer 言語モデルです。 BioGPT は、Transformer 言語モデルのバックボーンに従い、1,500 万の PubMed 抄録で最初から事前トレーニングされています。 + +論文の要約は次のとおりです。 + +*事前トレーニング済み言語モデルは、一般的な自然言語領域での大きな成功に触発されて、生物医学領域でますます注目を集めています。一般言語ドメインの事前トレーニング済み言語モデルの 2 つの主なブランチ、つまり BERT (およびそのバリアント) と GPT (およびそのバリアント) のうち、1 つ目は BioBERT や PubMedBERT などの生物医学ドメインで広く研究されています。これらはさまざまな下流の生物医学的タスクで大きな成功を収めていますが、生成能力の欠如により応用範囲が制限されています。この論文では、大規模な生物医学文献で事前トレーニングされたドメイン固有の生成 Transformer 言語モデルである BioGPT を提案します。私たちは 6 つの生物医学的自然言語処理タスクで BioGPT を評価し、ほとんどのタスクで私たちのモデルが以前のモデルよりも優れていることを実証しました。特に、BC5CDR、KD-DTI、DDI のエンドツーエンド関係抽出タスクではそれぞれ 44.98%、38.42%、40.76% の F1 スコアを獲得し、PubMedQA では 78.2% の精度を獲得し、新記録を樹立しました。テキスト生成に関する私たちのケーススタディは、生物医学文献における BioGPT の利点をさらに実証し、生物医学用語の流暢な説明を生成します。* + +## Usage tips + +- BioGPT は絶対位置埋め込みを備えたモデルであるため、通常は入力を左側ではなく右側にパディングすることをお勧めします。 +- BioGPT は因果言語モデリング (CLM) 目的でトレーニングされているため、シーケンス内の次のトークンを予測するのに強力です。 run_generation.py サンプル スクリプトで確認できるように、この機能を利用すると、BioGPT は構文的に一貫したテキストを生成できます。 +- モデルは、以前に計算されたキーと値のアテンション ペアである`past_key_values`(PyTorch の場合) を入力として受け取ることができます。この (past_key_values または past) 値を使用すると、モデルがテキスト生成のコンテキストで事前に計算された値を再計算できなくなります。 PyTorch の使用法の詳細については、BioGptForCausalLM.forward() メソッドの past_key_values 引数を参照してください。 + +このモデルは、[kamalkraj](https://huggingface.co/kamalkraj) によって提供されました。元のコードは [ここ](https://github.com/microsoft/BioGPT) にあります。 + +## Documentation resources + +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) + +## BioGptConfig + +[[autodoc]] BioGptConfig + + +## BioGptTokenizer + +[[autodoc]] BioGptTokenizer + - save_vocabulary + + +## BioGptModel + +[[autodoc]] BioGptModel + - forward + + +## BioGptForCausalLM + +[[autodoc]] BioGptForCausalLM + - forward + + +## BioGptForTokenClassification + +[[autodoc]] BioGptForTokenClassification + - forward + + +## BioGptForSequenceClassification + +[[autodoc]] BioGptForSequenceClassification + - forward + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/bit.md b/docs/source/ja/model_doc/bit.md new file mode 100644 index 00000000000000..76b24fa64470a2 --- /dev/null +++ b/docs/source/ja/model_doc/bit.md @@ -0,0 +1,65 @@ + + +# Big Transfer (BiT) + +## Overview + +BiT モデルは、Alexander Kolesnikov、Lucas Beyer、Xiaohua Zhai、Joan Puigcerver、Jessica Yung、Sylvain Gelly によって [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) で提案されました。ニール・ホールズビー。 +BiT は、[ResNet](resnet) のようなアーキテクチャ (具体的には ResNetv2) の事前トレーニングをスケールアップするための簡単なレシピです。この方法により、転移学習が大幅に改善されます。 + +論文の要約は次のとおりです。 + +*事前トレーニングされた表現の転送により、サンプル効率が向上し、視覚用のディープ ニューラル ネットワークをトレーニングする際のハイパーパラメーター調整が簡素化されます。大規模な教師ありデータセットでの事前トレーニングと、ターゲット タスクでのモデルの微調整のパラダイムを再検討します。私たちは事前トレーニングをスケールアップし、Big Transfer (BiT) と呼ぶシンプルなレシピを提案します。いくつかの慎重に選択されたコンポーネントを組み合わせ、シンプルなヒューリスティックを使用して転送することにより、20 を超えるデータセットで優れたパフォーマンスを実現します。 BiT は、クラスごとに 1 つのサンプルから合計 100 万のサンプルまで、驚くほど広範囲のデータ領域にわたって良好にパフォーマンスを発揮します。 BiT は、ILSVRC-2012 で 87.5%、CIFAR-10 で 99.4%、19 タスクの Visual Task Adaptation Benchmark (VTAB) で 76.3% のトップ 1 精度を達成しました。小規模なデータセットでは、BiT は ILSVRC-2012 (クラスあたり 10 例) で 76.8%、CIFAR-10 (クラスあたり 10 例) で 97.0% を達成しました。高い転写性能を実現する主要成分を詳細に分析※。 + +## Usage tips + +- BiT モデルは、アーキテクチャの点で ResNetv2 と同等ですが、次の点が異なります: 1) すべてのバッチ正規化層が [グループ正規化](https://arxiv.org/abs/1803.08494) に置き換えられます。 +2) [重みの標準化](https://arxiv.org/abs/1903.10520) は畳み込み層に使用されます。著者らは、両方の組み合わせが大きなバッチサイズでのトレーニングに役立ち、重要な効果があることを示しています。 +転移学習への影響。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。 +元のコードは [こちら](https://github.com/google-research/big_transfer) にあります。 + +## Resources + +BiT を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + + + +- [`BitForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## BitConfig + +[[autodoc]] BitConfig + +## BitImageProcessor + +[[autodoc]] BitImageProcessor + - preprocess + +## BitModel + +[[autodoc]] BitModel + - forward + +## BitForImageClassification + +[[autodoc]] BitForImageClassification + - forward diff --git a/docs/source/ja/model_doc/blenderbot-small.md b/docs/source/ja/model_doc/blenderbot-small.md new file mode 100644 index 00000000000000..ecb9c1174b285b --- /dev/null +++ b/docs/source/ja/model_doc/blenderbot-small.md @@ -0,0 +1,110 @@ + + +# Blenderbot Small + +[`BlenderbotSmallModel`] と +[`BlenderbotSmallForConditionalGeneration`] はチェックポイントと組み合わせてのみ使用されます +[facebook/blenderbot-90M](https://huggingface.co/facebook/blenderbot-90M)。より大規模な Blenderbot チェックポイントは、 +代わりに [`BlenderbotModel`] とともに使用してください。 +[`BlenderbotForConditionalGeneration`] + +## Overview + +Blender チャットボット モデルは、[Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller、Emily Dinan、Naman Goyal、Da Ju、Mary Williamson、yinghan Liu、で提案されました。 +ジン・シュー、マイル・オット、カート・シャスター、エリック・M・スミス、Y-ラン・ブーロー、ジェイソン・ウェストン、2020年4月30日。 + +論文の要旨は次のとおりです。 + +*オープンドメインのチャットボットの構築は、機械学習研究にとって難しい分野です。これまでの研究では次のことが示されていますが、 +ニューラル モデルをパラメーターの数とトレーニング対象のデータのサイズでスケーリングすると、結果が向上します。 +高性能のチャットボットには他の要素も重要であることを示します。良い会話には多くのことが必要です +会話の専門家がシームレスに融合するスキル: 魅力的な話のポイントを提供し、話を聞く +一貫した態度を維持しながら、知識、共感、個性を適切に表現する +ペルソナ。適切なトレーニング データと選択が与えられた場合、大規模モデルがこれらのスキルを学習できることを示します。 +世代戦略。 90M、2.7B、9.4B パラメーター モデルを使用してこれらのレシピのバリアントを構築し、モデルを作成します。 +コードは公開されています。人間による評価では、当社の最良のモデルが既存のアプローチよりも優れていることがマルチターンで示されています +魅力と人間性の測定という観点からの対話。次に、分析によってこの作業の限界について説明します。 +弊社機種の故障事例* + +チップ: + +- Blenderbot Small は絶対位置埋め込みを備えたモデルなので、通常は入力を右側にパディングすることをお勧めします。 + 左。 + +このモデルは、[patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。著者のコードは次のとおりです +[ここ](https://github.com/facebookresearch/ParlAI) をご覧ください。 + +## Documentation resources + +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [翻訳タスクガイド](../tasks/translation) +- [要約タスクガイド](../tasks/summarization) + +## BlenderbotSmallConfig + +[[autodoc]] BlenderbotSmallConfig + +## BlenderbotSmallTokenizer + +[[autodoc]] BlenderbotSmallTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## BlenderbotSmallTokenizerFast + +[[autodoc]] BlenderbotSmallTokenizerFast + +## BlenderbotSmallModel + +[[autodoc]] BlenderbotSmallModel + - forward + +## BlenderbotSmallForConditionalGeneration + +[[autodoc]] BlenderbotSmallForConditionalGeneration + - forward + +## BlenderbotSmallForCausalLM + +[[autodoc]] BlenderbotSmallForCausalLM + - forward + +## TFBlenderbotSmallModel + +[[autodoc]] TFBlenderbotSmallModel + - call + +## TFBlenderbotSmallForConditionalGeneration + +[[autodoc]] TFBlenderbotSmallForConditionalGeneration + - call + +## FlaxBlenderbotSmallModel + +[[autodoc]] FlaxBlenderbotSmallModel + - __call__ + - encode + - decode + +## FlaxBlenderbotForConditionalGeneration + +[[autodoc]] FlaxBlenderbotSmallForConditionalGeneration + - __call__ + - encode + - decode diff --git a/docs/source/ja/model_doc/blenderbot.md b/docs/source/ja/model_doc/blenderbot.md new file mode 100644 index 00000000000000..f7ee23e7557d7b --- /dev/null +++ b/docs/source/ja/model_doc/blenderbot.md @@ -0,0 +1,132 @@ + + +# Blenderbot + +**免責事項:** 何か奇妙なものを見つけた場合は、 [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title) を報告してください。 + +## Overview + +Blender チャットボット モデルは、[Recipes for building an open-domain chatbot](https://arxiv.org/pdf/2004.13637.pdf) Stephen Roller、Emily Dinan、Naman Goyal、Da Ju、Mary Williamson、yinghan Liu、で提案されました。 +ジン・シュー、マイル・オット、カート・シャスター、エリック・M・スミス、Y-ラン・ブーロー、ジェイソン・ウェストン、2020年4月30日。 + +論文の要旨は次のとおりです。 + +*オープンドメインのチャットボットの構築は、機械学習研究にとって難しい分野です。これまでの研究では次のことが示されていますが、 +ニューラル モデルをパラメーターの数とトレーニング対象のデータのサイズでスケーリングすると、結果が向上します。 +高性能のチャットボットには他の要素も重要であることを示します。良い会話には多くのことが必要です +会話の専門家がシームレスに融合するスキル: 魅力的な話のポイントを提供し、話を聞く +一貫した態度を維持しながら、知識、共感、個性を適切に表現する +ペルソナ。適切なトレーニング データと選択が与えられた場合、大規模モデルがこれらのスキルを学習できることを示します。 +世代戦略。 90M、2.7B、9.4B パラメーター モデルを使用してこれらのレシピのバリアントを構築し、モデルを作成します。 +コードは公開されています。人間による評価では、当社の最良のモデルが既存のアプローチよりも優れていることがマルチターンで示されています +魅力と人間性の測定という観点からの対話。次に、分析によってこの作業の限界について説明します。 +弊社機種の故障事例* + +チップ: + +- Blenderbot は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 + +このモデルは [sshleifer](https://huggingface.co/sshleifer) によって提供されました。著者のコードは [ここ](https://github.com/facebookresearch/ParlAI) にあります。 + +## Implementation Notes + +- Blenderbot は、標準の [seq2seq モデル トランスフォーマー](https://arxiv.org/pdf/1706.03762.pdf) ベースのアーキテクチャを使用します。 +- 利用可能なチェックポイントは、[モデル ハブ](https://huggingface.co/models?search=blenderbot) で見つけることができます。 +- これは *デフォルト* Blenderbot モデル クラスです。ただし、次のような小さなチェックポイントもいくつかあります。 + `facebook/blenderbot_small_90M` はアーキテクチャが異なるため、一緒に使用する必要があります。 + [BlenderbotSmall](ブレンダーボット小)。 + +## Usage + +モデルの使用例を次に示します。 + +```python +>>> from transformers import BlenderbotTokenizer, BlenderbotForConditionalGeneration + +>>> mname = "facebook/blenderbot-400M-distill" +>>> model = BlenderbotForConditionalGeneration.from_pretrained(mname) +>>> tokenizer = BlenderbotTokenizer.from_pretrained(mname) +>>> UTTERANCE = "My friends are cool but they eat too many carbs." +>>> inputs = tokenizer([UTTERANCE], return_tensors="pt") +>>> reply_ids = model.generate(**inputs) +>>> print(tokenizer.batch_decode(reply_ids)) +[" That's unfortunate. Are they trying to lose weight or are they just trying to be healthier?"] +``` + +## Documentation resources + +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [翻訳タスクガイド](../tasks/translation) +- [要約タスクガイド](../tasks/summarization) + +## BlenderbotConfig + +[[autodoc]] BlenderbotConfig + +## BlenderbotTokenizer + +[[autodoc]] BlenderbotTokenizer + - build_inputs_with_special_tokens + +## BlenderbotTokenizerFast + +[[autodoc]] BlenderbotTokenizerFast + - build_inputs_with_special_tokens + +## BlenderbotModel + +*forward* および *generate* の引数については、`transformers.BartModel`を参照してください。 + +[[autodoc]] BlenderbotModel + - forward + +## BlenderbotForConditionalGeneration + +*forward* と *generate* の引数については、[`~transformers.BartForConditionalGeneration`] を参照してください。 + +[[autodoc]] BlenderbotForConditionalGeneration + - forward + +## BlenderbotForCausalLM + +[[autodoc]] BlenderbotForCausalLM + - forward + +## TFBlenderbotModel + +[[autodoc]] TFBlenderbotModel + - call + +## TFBlenderbotForConditionalGeneration + +[[autodoc]] TFBlenderbotForConditionalGeneration + - call + +## FlaxBlenderbotModel + +[[autodoc]] FlaxBlenderbotModel + - __call__ + - encode + - decode + +## FlaxBlenderbotForConditionalGeneration + +[[autodoc]] FlaxBlenderbotForConditionalGeneration + - __call__ + - encode + - decode diff --git a/docs/source/ja/model_doc/blip-2.md b/docs/source/ja/model_doc/blip-2.md new file mode 100644 index 00000000000000..bd110522e27d60 --- /dev/null +++ b/docs/source/ja/model_doc/blip-2.md @@ -0,0 +1,90 @@ + + +# BLIP-2 + +## Overview + +BLIP-2 モデルは、[BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models](https://arxiv.org/abs/2301.12597) で提案されました。 +Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.・サバレーゼ、スティーブン・ホイ。 BLIP-2 は、軽量の 12 層 Transformer をトレーニングすることで、フリーズされた事前トレーニング済み画像エンコーダーと大規模言語モデル (LLM) を活用します。 +それらの間にエンコーダーを配置し、さまざまな視覚言語タスクで最先端のパフォーマンスを実現します。最も注目すべき点は、BLIP-2 が 800 億パラメータ モデルである [Flamingo](https://arxiv.org/abs/2204.14198) を 8.7% 改善していることです。 +ゼロショット VQAv2 ではトレーニング可能なパラメーターが 54 分の 1 に減少します。 + +論文の要約は次のとおりです。 + +*大規模モデルのエンドツーエンドのトレーニングにより、視覚と言語の事前トレーニングのコストはますます法外なものになってきています。この論文では、市販の凍結済み事前トレーニング画像エンコーダと凍結された大規模言語モデルから視覚言語の事前トレーニングをブートストラップする、汎用的で効率的な事前トレーニング戦略である BLIP-2 を提案します。 BLIP-2 は、2 段階で事前トレーニングされた軽量の Querying Transformer でモダリティのギャップを橋渡しします。最初のステージでは、フリーズされた画像エンコーダーから学習する視覚言語表現をブートストラップします。第 2 段階では、凍結された言語モデルから視覚から言語への生成学習をブートストラップします。 BLIP-2 は、既存の方法よりもトレーニング可能なパラメーターが大幅に少ないにもかかわらず、さまざまな視覚言語タスクで最先端のパフォーマンスを実現します。たとえば、私たちのモデルは、トレーニング可能なパラメーターが 54 分の 1 少ないゼロショット VQAv2 で、Flamingo80B を 8.7% 上回っています。また、自然言語の命令に従うことができる、ゼロショット画像からテキストへの生成というモデルの新しい機能も実証します* + + + + BLIP-2 アーキテクチャ。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。 +元のコードは [ここ](https://github.com/salesforce/LAVIS/tree/5ee63d688ba4cebff63acee04adaef2dee9af207) にあります。 + +## Usage tips + +- BLIP-2 は、画像とオプションのテキスト プロンプトを指定して条件付きテキストを生成するために使用できます。推論時には、 [`generate`] メソッドを使用することをお勧めします。 +- [`Blip2Processor`] を使用してモデル用の画像を準備し、予測されたトークン ID をデコードしてテキストに戻すことができます。 + +## Resources + +BLIP-2 の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + +- 画像キャプション、ビジュアル質問応答 (VQA)、およびチャットのような会話のための BLIP-2 のデモ ノートブックは、[こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BLIP-2) にあります。 + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## Blip2Config + +[[autodoc]] Blip2Config + - from_vision_qformer_text_configs + +## Blip2VisionConfig + +[[autodoc]] Blip2VisionConfig + +## Blip2QFormerConfig + +[[autodoc]] Blip2QFormerConfig + +## Blip2Processor + +[[autodoc]] Blip2Processor + +## Blip2VisionModel + +[[autodoc]] Blip2VisionModel + - forward + +## Blip2QFormerModel + +[[autodoc]] Blip2QFormerModel + - forward + +## Blip2Model + +[[autodoc]] Blip2Model + - forward + - get_text_features + - get_image_features + - get_qformer_features + +## Blip2ForConditionalGeneration + +[[autodoc]] Blip2ForConditionalGeneration + - forward + - generate \ No newline at end of file diff --git a/docs/source/ja/model_doc/blip.md b/docs/source/ja/model_doc/blip.md new file mode 100644 index 00000000000000..c145af701f23bb --- /dev/null +++ b/docs/source/ja/model_doc/blip.md @@ -0,0 +1,134 @@ + + +# BLIP + +## Overview + +BLIP モデルは、[BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086) で Junnan Li、Dongxu Li、Caiming Xiong、Steven Hoi によって提案されました。 。 + +BLIP は、次のようなさまざまなマルチモーダル タスクを実行できるモデルです。 +- 視覚的な質問応答 +- 画像とテキストの検索(画像とテキストのマッチング) +- 画像キャプション + +論文の要約は次のとおりです。 + +*視覚言語事前トレーニング (VLP) により、多くの視覚言語タスクのパフォーマンスが向上しました。 +ただし、既存の事前トレーニング済みモデルのほとんどは、理解ベースのタスクまたは世代ベースのタスクのいずれかでのみ優れています。さらに、最適ではない監視ソースである Web から収集されたノイズの多い画像とテキストのペアを使用してデータセットをスケールアップすることで、パフォーマンスの向上が大幅に達成されました。この論文では、視覚言語の理解と生成タスクの両方に柔軟に移行する新しい VLP フレームワークである BLIP を提案します。 BLIP は、キャプションをブートストラップすることでノイズの多い Web データを効果的に利用します。キャプショナーが合成キャプションを生成し、フィルターがノイズの多いキャプションを除去します。画像テキスト検索 (平均再現率 +2.7%@1)、画像キャプション作成 (CIDEr で +2.8%)、VQA ( VQA スコアは +1.6%)。 BLIP は、ゼロショット方式でビデオ言語タスクに直接転送した場合にも、強力な一般化能力を発揮します。コード、モデル、データセットがリリースされています。* + +![BLIP.gif](https://cdn-uploads.huggingface.co/production/uploads/1670928184033-62441d1d9fdefb55a0b7d12c.gif) + +このモデルは [ybelkada](https://huggingface.co/ybelkada) によって提供されました。 +元のコードは [ここ](https://github.com/salesforce/BLIP) にあります。 + +## Resources + +- [Jupyter ノートブック](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_blip.ipynb) カスタム データセットの画像キャプション用に BLIP を微調整する方法 + +## BlipConfig + +[[autodoc]] BlipConfig + - from_text_vision_configs + +## BlipTextConfig + +[[autodoc]] BlipTextConfig + +## BlipVisionConfig + +[[autodoc]] BlipVisionConfig + +## BlipProcessor + +[[autodoc]] BlipProcessor + +## BlipImageProcessor + +[[autodoc]] BlipImageProcessor + - preprocess + + + + +## BlipModel + +[[autodoc]] BlipModel + - forward + - get_text_features + - get_image_features + +## BlipTextModel + +[[autodoc]] BlipTextModel + - forward + +## BlipVisionModel + +[[autodoc]] BlipVisionModel + - forward + +## BlipForConditionalGeneration + +[[autodoc]] BlipForConditionalGeneration + - forward + +## BlipForImageTextRetrieval + +[[autodoc]] BlipForImageTextRetrieval + - forward + +## BlipForQuestionAnswering + +[[autodoc]] BlipForQuestionAnswering + - forward + + + + +## TFBlipModel + +[[autodoc]] TFBlipModel + - call + - get_text_features + - get_image_features + +## TFBlipTextModel + +[[autodoc]] TFBlipTextModel + - call + +## TFBlipVisionModel + +[[autodoc]] TFBlipVisionModel + - call + +## TFBlipForConditionalGeneration + +[[autodoc]] TFBlipForConditionalGeneration + - call + +## TFBlipForImageTextRetrieval + +[[autodoc]] TFBlipForImageTextRetrieval + - call + +## TFBlipForQuestionAnswering + +[[autodoc]] TFBlipForQuestionAnswering + - call + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/bloom.md b/docs/source/ja/model_doc/bloom.md new file mode 100644 index 00000000000000..1ac9396fa8ab76 --- /dev/null +++ b/docs/source/ja/model_doc/bloom.md @@ -0,0 +1,107 @@ + + +# BLOOM + +## Overview + +BLOOM モデルは、[BigScience Workshop](https://bigscience.huggingface.co/) を通じてさまざまなバージョンで提案されています。 BigScience は、研究者が時間とリソースをプールして共同でより高い効果を達成する他のオープン サイエンス イニシアチブからインスピレーションを得ています。 +BLOOM のアーキテクチャは基本的に GPT3 (次のトークン予測のための自己回帰モデル) に似ていますが、46 の異なる言語と 13 のプログラミング言語でトレーニングされています。 +モデルのいくつかの小さいバージョンが同じデータセットでトレーニングされています。 BLOOM は次のバージョンで利用できます。 + +- [bloom-560m](https://huggingface.co/bigscience/bloom-560m) +- [bloom-1b1](https://huggingface.co/bigscience/bloom-1b1) +- [bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) +- [bloom-3b](https://huggingface.co/bigscience/bloom-3b) +- [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1) +- [bloom](https://huggingface.co/bigscience/bloom) (176B parameters) + +## Resources + +BLOOM を使い始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + + + +- [`BloomForCausalLM`] これによってサポートされています [causal language modeling example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#gpt-2gpt-and-causal-language-modeling) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb). + +以下も参照してください。 +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) + + +⚡️ 推論 +- に関するブログ [最適化の話: ブルーム推論](https://huggingface.co/blog/bloom-inference-optimization)。 +- に関するブログ [DeepSpeed と Accelerate を使用した信じられないほど高速な BLOOM 推論](https://huggingface.co/blog/bloom-inference-pytorch-scripts)。 + +⚙️トレーニング +- に関するブログ [BLOOM トレーニングの背後にあるテクノロジー](https://huggingface.co/blog/bloom-megatron-deepspeed)。 + +## BloomConfig + +[[autodoc]] BloomConfig + - all + +## BloomTokenizerFast + +[[autodoc]] BloomTokenizerFast + - all + + + + + +## BloomModel + +[[autodoc]] BloomModel + - forward + +## BloomForCausalLM + +[[autodoc]] BloomForCausalLM + - forward + +## BloomForSequenceClassification + +[[autodoc]] BloomForSequenceClassification + - forward + +## BloomForTokenClassification + +[[autodoc]] BloomForTokenClassification + - forward + +## BloomForQuestionAnswering + +[[autodoc]] BloomForQuestionAnswering + - forward + + + + +## FlaxBloomModel + +[[autodoc]] FlaxBloomModel + - __call__ + +## FlaxBloomForCausalLM + +[[autodoc]] FlaxBloomForCausalLM + - __call__ + + + diff --git a/docs/source/ja/model_doc/bort.md b/docs/source/ja/model_doc/bort.md new file mode 100644 index 00000000000000..2b892a35bb9cc3 --- /dev/null +++ b/docs/source/ja/model_doc/bort.md @@ -0,0 +1,55 @@ + + +# BORT + + + +このモデルはメンテナンス モードのみであり、コードを変更する新しい PR は受け付けられません。 + +このモデルの実行中に問題が発生した場合は、このモデルをサポートしていた最後のバージョン (v4.30.0) を再インストールしてください。 +これを行うには、コマンド `pip install -U Transformers==4.30.0` を実行します。 + + + +## Overview + +BORT モデルは、[Optimal Subarchitecture Extraction for BERT](https://arxiv.org/abs/2010.10499) で提案されました。 +Adrian de Wynter and Daniel J. Perry.これは、BERT のアーキテクチャ パラメータの最適なサブセットです。 +著者は「ボルト」と呼んでいます。 + +論文の要約は次のとおりです。 + +*Devlin らから BERT アーキテクチャのアーキテクチャ パラメータの最適なサブセットを抽出します。 (2018) +ニューラル アーキテクチャ検索のアルゴリズムにおける最近の画期的な技術を適用します。この最適なサブセットを次のように呼びます。 +"Bort" は明らかに小さく、有効 (つまり、埋め込み層を考慮しない) サイズは 5.5% です。 +オリジナルの BERT 大規模アーキテクチャ、およびネット サイズの 16%。 Bort は 288 GPU 時間で事前トレーニングすることもできます。 +最高パフォーマンスの BERT パラメトリック アーキテクチャ バリアントである RoBERTa-large の事前トレーニングに必要な時間の 1.2% +(Liu et al., 2019)、同じマシンで BERT-large をトレーニングするのに必要な GPU 時間の世界記録の約 33% +ハードウェア。また、CPU 上で 7.9 倍高速であるだけでなく、他の圧縮バージョンよりもパフォーマンスが優れています。 +アーキテクチャ、および一部の非圧縮バリアント: 0.3% ~ 31% のパフォーマンス向上が得られます。 +BERT-large に関して、複数の公開自然言語理解 (NLU) ベンチマークにおける絶対的な評価。* + +このモデルは [stefan-it](https://huggingface.co/stefan-it) によって提供されました。元のコードは[ここ](https://github.com/alexa/bort/)にあります。 + +## Usage tips + +- BORT のモデル アーキテクチャは BERT に基づいています。詳細については、[BERT のドキュメント ページ](bert) を参照してください。 + モデルの API リファレンスと使用例。 +- BORT は BERT トークナイザーの代わりに RoBERTa トークナイザーを使用します。トークナイザーの API リファレンスと使用例については、[RoBERTa のドキュメント ページ](roberta) を参照してください。 +- BORT には、 [Agora](https://adewynter.github.io/notes/bort_algorithms_and_applications.html#fine-tuning-with-algebraic-topology) と呼ばれる特定の微調整アルゴリズムが必要です。 + 残念ながらまだオープンソース化されていません。誰かが実装しようとすると、コミュニティにとって非常に役立ちます。 + BORT の微調整を機能させるためのアルゴリズム。 \ No newline at end of file diff --git a/docs/source/ja/model_doc/bridgetower.md b/docs/source/ja/model_doc/bridgetower.md new file mode 100644 index 00000000000000..e2ce1cb4c57709 --- /dev/null +++ b/docs/source/ja/model_doc/bridgetower.md @@ -0,0 +1,171 @@ + + +# BridgeTower + +## Overview + +BridgeTower モデルは、Xiao Xu、Chenfei Wu、Shachar Rosenman、Vasudev Lal、Wanxiang Che、Nan Duan [BridgeTower: Building Bridges Between Encoders in Vision-Language Representative Learning](https://arxiv.org/abs/2206.08657) で提案されました。ドゥアン。このモデルの目標は、 +各ユニモーダル エンコーダとクロスモーダル エンコーダの間のブリッジにより、クロスモーダル エンコーダの各層での包括的かつ詳細な対話が可能になり、追加のパフォーマンスと計算コストがほとんど無視できる程度で、さまざまな下流タスクで優れたパフォーマンスを実現します。 + +この論文は [AAAI'23](https://aaai.org/Conferences/AAAI-23/) 会議に採択されました。 + +論文の要約は次のとおりです。 + +*TWO-TOWER アーキテクチャを備えたビジョン言語 (VL) モデルは、近年の視覚言語表現学習の主流となっています。 +現在の VL モデルは、軽量のユニモーダル エンコーダーを使用して、ディープ クロスモーダル エンコーダーで両方のモダリティを同時に抽出、位置合わせ、融合することを学習するか、事前にトレーニングされたディープ ユニモーダル エンコーダーから最終層のユニモーダル表現を上部のクロスモーダルエンコーダー。 +どちらのアプローチも、視覚言語表現の学習を制限し、モデルのパフォーマンスを制限する可能性があります。この論文では、ユニモーダル エンコーダの最上位層とクロスモーダル エンコーダの各層の間の接続を構築する複数のブリッジ層を導入する BRIDGETOWER を提案します。 +これにより、効果的なボトムアップのクロスモーダル調整と、クロスモーダル エンコーダー内の事前トレーニング済みユニモーダル エンコーダーのさまざまなセマンティック レベルの視覚表現とテキスト表現の間の融合が可能になります。 BRIDGETOWER は 4M 画像のみで事前トレーニングされており、さまざまな下流の視覚言語タスクで最先端のパフォーマンスを実現します。 +特に、VQAv2 テスト標準セットでは、BRIDGETOWER は 78.73% の精度を達成し、同じ事前トレーニング データとほぼ無視できる追加パラメータと計算コストで以前の最先端モデル METER を 1.09% 上回りました。 +特に、モデルをさらにスケーリングすると、BRIDGETOWER は 81.15% の精度を達成し、桁違いに大きなデータセットで事前トレーニングされたモデルを上回りました。* + + + + ブリッジタワー アーキテクチャ。 元の論文から抜粋。 + +このモデルは、[Anahita Bhiwandiwalla](https://huggingface.co/anahita-b)、[Tiep Le](https://huggingface.co/Tile)、[Shaoyen Tseng](https://huggingface.co/shaoyent) 。元のコードは [ここ](https://github.com/microsoft/BridgeTower) にあります。 + +## Usage tips and examples + +BridgeTower は、ビジュアル エンコーダー、テキスト エンコーダー、および複数の軽量ブリッジ レイヤーを備えたクロスモーダル エンコーダーで構成されます。 +このアプローチの目標は、各ユニモーダル エンコーダーとクロスモーダル エンコーダーの間にブリッジを構築し、クロスモーダル エンコーダーの各層で包括的かつ詳細な対話を可能にすることでした。 +原則として、提案されたアーキテクチャでは、任意のビジュアル、テキスト、またはクロスモーダル エンコーダを適用できます。 + +[`BridgeTowerProcessor`] は、[`RobertaTokenizer`] と [`BridgeTowerImageProcessor`] を単一のインスタンスにラップし、両方の機能を実現します。 +テキストをエンコードし、画像をそれぞれ用意します。 + +次の例は、[`BridgeTowerProcessor`] と [`BridgeTowerForContrastiveLearning`] を使用して対照学習を実行する方法を示しています。 + +```python +>>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning +>>> import requests +>>> from PIL import Image + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"] + +>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") +>>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") + +>>> # forward pass +>>> scores = dict() +>>> for text in texts: +... # prepare inputs +... encoding = processor(image, text, return_tensors="pt") +... outputs = model(**encoding) +... scores[text] = outputs +``` + +次の例は、[`BridgeTowerProcessor`] と [`BridgeTowerForImageAndTextRetrieval`] を使用して画像テキストの取得を実行する方法を示しています。 + +```python +>>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval +>>> import requests +>>> from PIL import Image + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"] + +>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm") +>>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm") + +>>> # forward pass +>>> scores = dict() +>>> for text in texts: +... # prepare inputs +... encoding = processor(image, text, return_tensors="pt") +... outputs = model(**encoding) +... scores[text] = outputs.logits[0, 1].item() +``` + +次の例は、[`BridgeTowerProcessor`] と [`BridgeTowerForMaskedLM`] を使用してマスクされた言語モデリングを実行する方法を示しています。 + +```python +>>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM +>>> from PIL import Image +>>> import requests + +>>> url = "http://images.cocodataset.org/val2017/000000360943.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB") +>>> text = "a looking out of the window" + +>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm") +>>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm") + +>>> # prepare inputs +>>> encoding = processor(image, text, return_tensors="pt") + +>>> # forward pass +>>> outputs = model(**encoding) + +>>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist()) + +>>> print(results) +.a cat looking out of the window. +``` + +チップ: + +- BridgeTower のこの実装では、[`RobertaTokenizer`] を使用してテキスト埋め込みを生成し、OpenAI の CLIP/ViT モデルを使用して視覚的埋め込みを計算します。 +- 事前トレーニングされた [bridgeTower-base](https://huggingface.co/BridgeTower/bridgetower-base) および [bridgetower マスクされた言語モデリングと画像テキスト マッチング](https://huggingface.co/BridgeTower/bridgetower--base-itm-mlm) のチェックポイント がリリースされました。 +- 画像検索およびその他の下流タスクにおける BridgeTower のパフォーマンスについては、[表 5](https://arxiv.org/pdf/2206.08657.pdf) を参照してください。 +- このモデルの PyTorch バージョンは、torch 1.10 以降でのみ使用できます。 + +## BridgeTowerConfig + +[[autodoc]] BridgeTowerConfig + +## BridgeTowerTextConfig + +[[autodoc]] BridgeTowerTextConfig + +## BridgeTowerVisionConfig + +[[autodoc]] BridgeTowerVisionConfig + +## BridgeTowerImageProcessor + +[[autodoc]] BridgeTowerImageProcessor + - preprocess + +## BridgeTowerProcessor + +[[autodoc]] BridgeTowerProcessor + - __call__ + +## BridgeTowerModel + +[[autodoc]] BridgeTowerModel + - forward + +## BridgeTowerForContrastiveLearning + +[[autodoc]] BridgeTowerForContrastiveLearning + - forward + +## BridgeTowerForMaskedLM + +[[autodoc]] BridgeTowerForMaskedLM + - forward + +## BridgeTowerForImageAndTextRetrieval + +[[autodoc]] BridgeTowerForImageAndTextRetrieval + - forward + diff --git a/docs/source/ja/model_doc/bros.md b/docs/source/ja/model_doc/bros.md new file mode 100644 index 00000000000000..3749a172a8286b --- /dev/null +++ b/docs/source/ja/model_doc/bros.md @@ -0,0 +1,113 @@ + + +# BROS + +## Overview + +BROS モデルは、Teakgyu Hon、Donghyun Kim、Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park によって [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) で提案されました。 + +BROS は *BERT Relying On Spatality* の略です。これは、一連のトークンとその境界ボックスを入力として受け取り、一連の隠れ状態を出力するエンコーダー専用の Transformer モデルです。 BROS は、絶対的な空間情報を使用する代わりに、相対的な空間情報をエンコードします。 + +BERT で使用されるトークンマスク言語モデリング目標 (TMLM) と新しいエリアマスク言語モデリング目標 (AMLM) の 2 つの目標で事前トレーニングされています。 +TMLM では、トークンはランダムにマスクされ、モデルは空間情報と他のマスクされていないトークンを使用してマスクされたトークンを予測します。 +AMLM は TMLM の 2D バージョンです。テキスト トークンをランダムにマスクし、TMLM と同じ情報で予測しますが、テキスト ブロック (領域) をマスクします。 + +`BrosForTokenClassification`には、BrosModel の上に単純な線形層があります。各トークンのラベルを予測します。 +`BrosSpadeEEForTokenClassification`には、BrosModel の上に`initial_token_classifier`と`subsequent_token_classifier`があります。 `initial_token_classifier` は各エンティティの最初のトークンを予測するために使用され、`subsequent_token_classifier` はエンティティ内の次のトークンを予測するために使用されます。 `BrosSpadeELForTokenClassification`には BrosModel の上に`entity_linker`があります。 `entity_linker` は 2 つのエンティティ間の関係を予測するために使用されます。 + +`BrosForTokenClassification`と`BrosSpadeEEForTokenClassification`は基本的に同じジョブを実行します。ただし、`BrosForTokenClassification`は入力トークンが完全にシリアル化されていることを前提としています (トークンは 2D 空間に存在するため、これは非常に困難な作業です)。一方、`BrosSpadeEEForTokenClassification`は 1 つのトークンから次の接続トークンを予測するため、シリアル化エラーの処理をより柔軟に行うことができます。 + +`BrosSpadeELForTokenClassification` はエンティティ内のリンク タスクを実行します。これら 2 つのエンティティが何らかの関係を共有する場合、(あるエンティティの) 1 つのトークンから (別のエンティティの) 別のトークンへの関係を予測します。 + +BROS は、明示的な視覚機能に依存せずに、FUNSD、SROIE、CORD、SciTSR などの Key Information Extraction (KIE) ベンチマークで同等以上の結果を達成します。 + +論文の要約は次のとおりです。 + +*文書画像からの重要情報抽出 (KIE) には、2 次元 (2D) 空間におけるテキストの文脈的および空間的意味論を理解する必要があります。最近の研究の多くは、文書画像の視覚的特徴とテキストおよびそのレイアウトを組み合わせることに重点を置いた事前トレーニング済み言語モデルを開発することで、この課題を解決しようとしています。一方、このペーパーでは、テキストとレイアウトの効果的な組み合わせという基本に立ち返ってこの問題に取り組みます。具体的には、BROS (BERT Relying On Spatality) という名前の事前トレーニング済み言語モデルを提案します。この言語モデルは、2D 空間内のテキストの相対位置をエンコードし、エリア マスキング戦略を使用してラベルのないドキュメントから学習します。 2D 空間内のテキストを理解するためのこの最適化されたトレーニング スキームにより、BROS は、視覚的な特徴に依存することなく、4 つの KIE ベンチマーク (FUNSD、SROIE*、CORD、および SciTSR) で以前の方法と比較して同等以上のパフォーマンスを示しました。また、この論文では、KIE タスクにおける 2 つの現実世界の課題 ((1) 間違ったテキスト順序によるエラーの最小化、および (2) 少数の下流例からの効率的な学習) を明らかにし、以前の方法に対する BROS の優位性を実証します。* + +このモデルは [jinho8345](https://huggingface.co/jinho8345) によって寄稿されました。元のコードは [ここ](https://github.com/clovaai/bros) にあります。 + +## Usage tips and examples + +- [`~transformers.BrosModel.forward`] には、`input_ids` と `bbox` (バウンディング ボックス) が必要です。各境界ボックスは、(x0、y0、x1、y1) 形式 (左上隅、右下隅) である必要があります。境界ボックスの取得は外部 OCR システムに依存します。 「x」座標はドキュメント画像の幅で正規化する必要があり、「y」座標はドキュメント画像の高さで正規化する必要があります。 + +```python +def expand_and_normalize_bbox(bboxes, doc_width, doc_height): + # here, bboxes are numpy array + + # Normalize bbox -> 0 ~ 1 + bboxes[:, [0, 2]] = bboxes[:, [0, 2]] / width + bboxes[:, [1, 3]] = bboxes[:, [1, 3]] / height +``` + +- [`~transformers.BrosForTokenClassification.forward`、`~transformers.BrosSpadeEEForTokenClassification.forward`、`~transformers.BrosSpadeEEForTokenClassification.forward`] では、損失計算に `input_ids` と `bbox` だけでなく `box_first_token_mask` も必要です。これは、各ボックスの先頭以外のトークンを除外するためのマスクです。このマスクは、単語から `input_ids` を作成するときに境界ボックスの開始トークン インデックスを保存することで取得できます。次のコードで`box_first_token_mask`を作成できます。 + +```python +def make_box_first_token_mask(bboxes, words, tokenizer, max_seq_length=512): + + box_first_token_mask = np.zeros(max_seq_length, dtype=np.bool_) + + # encode(tokenize) each word from words (List[str]) + input_ids_list: List[List[int]] = [tokenizer.encode(e, add_special_tokens=False) for e in words] + + # get the length of each box + tokens_length_list: List[int] = [len(l) for l in input_ids_list] + + box_end_token_indices = np.array(list(itertools.accumulate(tokens_length_list))) + box_start_token_indices = box_end_token_indices - np.array(tokens_length_list) + + # filter out the indices that are out of max_seq_length + box_end_token_indices = box_end_token_indices[box_end_token_indices < max_seq_length - 1] + if len(box_start_token_indices) > len(box_end_token_indices): + box_start_token_indices = box_start_token_indices[: len(box_end_token_indices)] + + # set box_start_token_indices to True + box_first_token_mask[box_start_token_indices] = True + + return box_first_token_mask + +``` + +## Resources + +- デモ スクリプトは [こちら](https://github.com/clovaai/bros) にあります。 + +## BrosConfig + +[[autodoc]] BrosConfig + +## BrosProcessor + +[[autodoc]] BrosProcessor + - __call__ + +## BrosModel + +[[autodoc]] BrosModel + - forward + + +## BrosForTokenClassification + +[[autodoc]] BrosForTokenClassification + - forward + +## BrosSpadeEEForTokenClassification + +[[autodoc]] BrosSpadeEEForTokenClassification + - forward + +## BrosSpadeELForTokenClassification + +[[autodoc]] BrosSpadeELForTokenClassification + - forward diff --git a/docs/source/ja/model_doc/byt5.md b/docs/source/ja/model_doc/byt5.md new file mode 100644 index 00000000000000..c6796f981817dc --- /dev/null +++ b/docs/source/ja/model_doc/byt5.md @@ -0,0 +1,154 @@ + + +# ByT5 + +## Overview + +ByT5 モデルは、[ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir +Kale, Adam Roberts, Colin Raffel. + +論文の要約は次のとおりです。 + +*最も広く使用されている事前トレーニング済み言語モデルは、単語またはサブワード単位に対応するトークンのシーケンスで動作します。 +テキストをトークンのシーケンスとしてエンコードするには、トークナイザーが必要です。トークナイザーは通常、 +モデル。代わりに生のテキスト (バイトまたは文字) を直接操作するトークンフリー モデルには多くの利点があります。 +すぐに使用できるあらゆる言語のテキストを処理でき、ノイズに対してより堅牢であり、技術的負債を最小限に抑えます。 +複雑でエラーが発生しやすいテキスト前処理パイプラインを削除します。バイトまたは文字列がトークンより長いため +トークンフリー モデルに関する過去の研究では、シーケンスのコストを償却するように設計された新しいモデル アーキテクチャが導入されることがよくありました。 +生のテキストを直接操作します。この論文では、標準的な Transformer アーキテクチャが次のようなもので使用できることを示します。 +バイトシーケンスを処理するための最小限の変更。パラメータ数の観点からトレードオフを注意深く特徴付けます。 +FLOP のトレーニングと推論速度を調べ、バイトレベルのモデルがトークンレベルと競合できることを示します。 +対応者。また、バイトレベルのモデルはノイズに対して大幅に堅牢であり、より優れたパフォーマンスを発揮することも示しています。 +スペルと発音に敏感なタスク。私たちの貢献の一環として、新しいセットをリリースします。 +T5 アーキテクチャに基づいた事前トレーニング済みのバイトレベルの Transformer モデルと、そこで使用されるすべてのコードとデータ +実験。* + +このモデルは、[patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。元のコードは次のとおりです +[ここ](https://github.com/google-research/byt5) にあります。 + + + +ByT5 のアーキテクチャは T5v1.1 モデルに基づいています。API リファレンスについては、[T5v1.1 のドキュメント ページ](t5v1.1) を参照してください。彼らは +モデルの入力を準備する方法が異なるだけです。以下のコード例を参照してください。 + + + +ByT5 は教師なしで事前トレーニングされているため、単一タスク中にタスク プレフィックスを使用する利点はありません。 +微調整。マルチタスクの微調整を行う場合は、プレフィックスを使用する必要があります。 + +## Usage Examples + +ByT5 は生の UTF-8 バイトで動作するため、トークナイザーなしで使用できます。 + +```python +>>> from transformers import T5ForConditionalGeneration +>>> import torch + +>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small") + +>>> num_special_tokens = 3 +>>> # Model has 3 special tokens which take up the input ids 0,1,2 of ByT5. +>>> # => Need to shift utf-8 character encodings by 3 before passing ids to model. + +>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens + +>>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens + +>>> loss = model(input_ids, labels=labels).loss +>>> loss.item() +2.66 +``` + +ただし、バッチ推論とトレーニングの場合は、トークナイザーを使用することをお勧めします。 + + +```python +>>> from transformers import T5ForConditionalGeneration, AutoTokenizer + +>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small") +>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-small") + +>>> model_inputs = tokenizer( +... ["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt" +... ) +>>> labels_dict = tokenizer( +... ["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt" +... ) +>>> labels = labels_dict.input_ids + +>>> loss = model(**model_inputs, labels=labels).loss +>>> loss.item() +17.9 +``` + +[T5](t5) と同様に、ByT5 はスパンマスクノイズ除去タスクでトレーニングされました。しかし、 +モデルはキャラクターに直接作用するため、事前トレーニングタスクは少し複雑です +違う。のいくつかの文字を破損してみましょう +`"The dog chases a ball in the park."`という文を入力し、ByT5 に予測してもらいます。 +わたしたちのため。 + +```python +>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM +>>> import torch + +>>> tokenizer = AutoTokenizer.from_pretrained("google/byt5-base") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("google/byt5-base") + +>>> input_ids_prompt = "The dog chases a ball in the park." +>>> input_ids = tokenizer(input_ids_prompt).input_ids + +>>> # Note that we cannot add "{extra_id_...}" to the string directly +>>> # as the Byte tokenizer would incorrectly merge the tokens +>>> # For ByT5, we need to work directly on the character level +>>> # Contrary to T5, ByT5 does not use sentinel tokens for masking, but instead +>>> # uses final utf character ids. +>>> # UTF-8 is represented by 8 bits and ByT5 has 3 special tokens. +>>> # => There are 2**8+2 = 259 input ids and mask tokens count down from index 258. +>>> # => mask to "The dog [258]a ball [257]park." + +>>> input_ids = torch.tensor([input_ids[:8] + [258] + input_ids[14:21] + [257] + input_ids[28:]]) +>>> input_ids +tensor([[ 87, 107, 104, 35, 103, 114, 106, 35, 258, 35, 100, 35, 101, 100, 111, 111, 257, 35, 115, 100, 117, 110, 49, 1]]) + +>>> # ByT5 produces only one char at a time so we need to produce many more output characters here -> set `max_length=100`. +>>> output_ids = model.generate(input_ids, max_length=100)[0].tolist() +>>> output_ids +[0, 258, 108, 118, 35, 119, 107, 104, 35, 114, 113, 104, 35, 122, 107, 114, 35, 103, 114, 104, 118, 257, 35, 108, 113, 35, 119, 107, 104, 35, 103, 108, 118, 102, 114, 256, 108, 113, 35, 119, 107, 104, 35, 115, 100, 117, 110, 49, 35, 87, 107, 104, 35, 103, 114, 106, 35, 108, 118, 35, 119, 107, 104, 35, 114, 113, 104, 35, 122, 107, 114, 35, 103, 114, 104, 118, 35, 100, 35, 101, 100, 111, 111, 35, 108, 113, 255, 35, 108, 113, 35, 119, 107, 104, 35, 115, 100, 117, 110, 49] + +>>> # ^- Note how 258 descends to 257, 256, 255 + +>>> # Now we need to split on the sentinel tokens, let's write a short loop for this +>>> output_ids_list = [] +>>> start_token = 0 +>>> sentinel_token = 258 +>>> while sentinel_token in output_ids: +... split_idx = output_ids.index(sentinel_token) +... output_ids_list.append(output_ids[start_token:split_idx]) +... start_token = split_idx +... sentinel_token -= 1 + +>>> output_ids_list.append(output_ids[start_token:]) +>>> output_string = tokenizer.batch_decode(output_ids_list) +>>> output_string +['', 'is the one who does', ' in the disco', 'in the park. The dog is the one who does a ball in', ' in the park.'] +``` + +## ByT5Tokenizer + +[[autodoc]] ByT5Tokenizer + +詳細については、[`ByT5Tokenizer`] を参照してください。 \ No newline at end of file diff --git a/docs/source/ja/model_doc/camembert.md b/docs/source/ja/model_doc/camembert.md new file mode 100644 index 00000000000000..db8e0aa936936d --- /dev/null +++ b/docs/source/ja/model_doc/camembert.md @@ -0,0 +1,135 @@ + + +# CamemBERT + +## Overview + +CamemBERT モデルは、[CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) で提案されました。 +Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la +Clergerie, Djamé Seddah, and Benoît Sagot. 2019年にリリースされたFacebookのRoBERTaモデルをベースにしたモデルです。 +138GBのフランス語テキストでトレーニングされました。 + +論文の要約は次のとおりです。 + +*事前トレーニングされた言語モデルは現在、自然言語処理で広く普及しています。成功にもかかわらず、利用可能なほとんどの +モデルは英語のデータ、または複数言語のデータの連結でトレーニングされています。これにより、 +このようなモデルの実際の使用は、英語を除くすべての言語で非常に限られています。フランス人にとってこの問題に対処することを目指して、 +Bi-direction Encoders for Transformers (BERT) のフランス語版である CamemBERT をリリースします。測定します +複数の下流タスク、つまり品詞タグ付けにおける多言語モデルと比較した CamemBERT のパフォーマンス +依存関係解析、固有表現認識、自然言語推論。 CamemBERT は最先端技術を向上させます +検討されているほとんどのタスクに対応します。私たちは、研究と +フランス語 NLP の下流アプリケーション。* + +このモデルは [camembert](https://huggingface.co/camembert) によって提供されました。元のコードは [ここ](https://camembert-model.fr/) にあります。 + + + + +この実装はRoBERTaと同じです。使用例については[RoBERTaのドキュメント](roberta)も参照してください。 +入力と出力に関する情報として。 + + + +## Resources + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [マスク言語モデリング タスク ガイド](../tasks/masked_language_modeling) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +## CamembertConfig + +[[autodoc]] CamembertConfig + +## CamembertTokenizer + +[[autodoc]] CamembertTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## CamembertTokenizerFast + +[[autodoc]] CamembertTokenizerFast + + + + +## CamembertModel + +[[autodoc]] CamembertModel + +## CamembertForCausalLM + +[[autodoc]] CamembertForCausalLM + +## CamembertForMaskedLM + +[[autodoc]] CamembertForMaskedLM + +## CamembertForSequenceClassification + +[[autodoc]] CamembertForSequenceClassification + +## CamembertForMultipleChoice + +[[autodoc]] CamembertForMultipleChoice + +## CamembertForTokenClassification + +[[autodoc]] CamembertForTokenClassification + +## CamembertForQuestionAnswering + +[[autodoc]] CamembertForQuestionAnswering + + + + +## TFCamembertModel + +[[autodoc]] TFCamembertModel + +## TFCamembertForCasualLM + +[[autodoc]] TFCamembertForCausalLM + +## TFCamembertForMaskedLM + +[[autodoc]] TFCamembertForMaskedLM + +## TFCamembertForSequenceClassification + +[[autodoc]] TFCamembertForSequenceClassification + +## TFCamembertForMultipleChoice + +[[autodoc]] TFCamembertForMultipleChoice + +## TFCamembertForTokenClassification + +[[autodoc]] TFCamembertForTokenClassification + +## TFCamembertForQuestionAnswering + +[[autodoc]] TFCamembertForQuestionAnswering + + + diff --git a/docs/source/ja/model_doc/canine.md b/docs/source/ja/model_doc/canine.md new file mode 100644 index 00000000000000..18af699967d1ad --- /dev/null +++ b/docs/source/ja/model_doc/canine.md @@ -0,0 +1,144 @@ + + +# CANINE + +## Overview + +CANINE モデルは、[CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language +Representation](https://arxiv.org/abs/2103.06874)、Jonathan H. Clark、Dan Garrette、Iulia Turc、John Wieting 著。その +明示的なトークン化ステップ (バイト ペアなど) を使用せずに Transformer をトレーニングする最初の論文の 1 つ +エンコーディング (BPE、WordPiece または SentencePiece)。代わりに、モデルは Unicode 文字レベルで直接トレーニングされます。 +キャラクターレベルでのトレーニングでは必然的にシーケンスの長さが長くなりますが、CANINE はこれを効率的な方法で解決します。 +ディープ Transformer エンコーダを適用する前に、ダウンサンプリング戦略を実行します。 + +論文の要約は次のとおりです。 + +*パイプライン NLP システムは、エンドツーエンドのニューラル モデリングに大部分が取って代わられていますが、一般的に使用されているほぼすべてのモデルは +依然として明示的なトークン化手順が必要です。最近のトークン化アプローチはデータ由来のサブワードに基づいていますが、 +レキシコンは手動で作成されたトークナイザーよりも脆弱ではありませんが、これらの技術はすべての言語に等しく適しているわけではありません。 +言語や固定語彙の使用により、モデルの適応能力が制限される可能性があります。この論文では、CANINE を紹介します。 +明示的なトークン化や語彙を使用せずに、文字シーケンスを直接操作するニューラル エンコーダーと、 +文字に直接作用するか、オプションでサブワードをソフト誘導バイアスとして使用する事前トレーニング戦略。 +よりきめの細かい入力を効果的かつ効率的に使用するために、CANINE はダウンサンプリングを組み合わせて、入力を削減します。 +コンテキストをエンコードするディープトランスフォーマースタックを備えたシーケンスの長さ。 CANINE は、同等の mBERT モデルよりも次の点で優れています。 +TyDi QA の 2.8 F1 は、モデル パラメータが 28% 少ないにもかかわらず、困難な多言語ベンチマークです。* + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。元のコードは [ここ](https://github.com/google-research/language/tree/master/language/canine) にあります。 + +## Usage tips + +- CANINE は内部で少なくとも 3 つの Transformer エンコーダーを使用します: 2 つの「浅い」エンコーダー (単一のエンコーダーのみで構成) + レイヤー) と 1 つの「ディープ」エンコーダー (通常の BERT エンコーダー)。まず、「浅い」エンコーダを使用してコンテキストを設定します。 + ローカル アテンションを使用した文字の埋め込み。次に、ダウンサンプリングの後、「ディープ」エンコーダーが適用されます。ついに、 + アップサンプリング後、「浅い」エンコーダを使用して最終的な文字埋め込みが作成されます。アップと + ダウンサンプリングについては論文に記載されています。 +- CANINE は、デフォルトで 2048 文字の最大シーケンス長を使用します。 [`CanineTokenizer`] を使用できます + モデル用のテキストを準備します。 +- 特別な [CLS] トークンの最終的な非表示状態の上に線形レイヤーを配置することで分類を行うことができます。 + (事前定義された Unicode コード ポイントがあります)。ただし、トークン分類タスクの場合は、ダウンサンプリングされたシーケンス + トークンは、元の文字シーケンスの長さ (2048) と一致するように再度アップサンプリングする必要があります。の + 詳細については、論文を参照してください。 + +モデルのチェックポイント: + + - [google/canine-c](https://huggingface.co/google/canine-c): 自己回帰文字損失で事前トレーニング済み、 + 12 レイヤー、768 隠し、12 ヘッド、121M パラメーター (サイズ ~500 MB)。 + - [google/canine-s](https://huggingface.co/google/canine-s): サブワード損失で事前トレーニング済み、12 層、 + 768 個の非表示、12 ヘッド、121M パラメーター (サイズ ~500 MB)。 + +## Usage example + +CANINE は生の文字で動作するため、**トークナイザーなし**で使用できます。 + +```python +>>> from transformers import CanineModel +>>> import torch + +>>> model = CanineModel.from_pretrained("google/canine-c") # model pre-trained with autoregressive character loss + +>>> text = "hello world" +>>> # use Python's built-in ord() function to turn each character into its unicode code point id +>>> input_ids = torch.tensor([[ord(char) for char in text]]) + +>>> outputs = model(input_ids) # forward pass +>>> pooled_output = outputs.pooler_output +>>> sequence_output = outputs.last_hidden_state +``` + +ただし、バッチ推論とトレーニングの場合は、トークナイザーを使用することをお勧めします(すべてをパディング/切り詰めるため) +シーケンスを同じ長さにします): + +```python +>>> from transformers import CanineTokenizer, CanineModel + +>>> model = CanineModel.from_pretrained("google/canine-c") +>>> tokenizer = CanineTokenizer.from_pretrained("google/canine-c") + +>>> inputs = ["Life is like a box of chocolates.", "You never know what you gonna get."] +>>> encoding = tokenizer(inputs, padding="longest", truncation=True, return_tensors="pt") + +>>> outputs = model(**encoding) # forward pass +>>> pooled_output = outputs.pooler_output +>>> sequence_output = outputs.last_hidden_state +``` + +## Resources + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +## CanineConfig + +[[autodoc]] CanineConfig + +## CanineTokenizer + +[[autodoc]] CanineTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + +## CANINE specific outputs + +[[autodoc]] models.canine.modeling_canine.CanineModelOutputWithPooling + +## CanineModel + +[[autodoc]] CanineModel + - forward + +## CanineForSequenceClassification + +[[autodoc]] CanineForSequenceClassification + - forward + +## CanineForMultipleChoice + +[[autodoc]] CanineForMultipleChoice + - forward + +## CanineForTokenClassification + +[[autodoc]] CanineForTokenClassification + - forward + +## CanineForQuestionAnswering + +[[autodoc]] CanineForQuestionAnswering + - forward diff --git a/docs/source/ja/model_doc/chinese_clip.md b/docs/source/ja/model_doc/chinese_clip.md new file mode 100644 index 00000000000000..8d7dc401d2ae7e --- /dev/null +++ b/docs/source/ja/model_doc/chinese_clip.md @@ -0,0 +1,112 @@ + + +# Chinese-CLIP + +## Overview + +Chinese-CLIP An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) で提案されました。周、張周。 +Chinese-CLIP は、中国語の画像とテキストのペアの大規模なデータセットに対する CLIP (Radford et al., 2021) の実装です。クロスモーダル検索を実行できるほか、ゼロショット画像分類、オープンドメインオブジェクト検出などのビジョンタスクのビジョンバックボーンとしても機能します。オリジナルの中国語-CLIPコードは[このリンクで](https://github.com/OFA-Sys/Chinese-CLIP)。 + +論文の要約は次のとおりです。 + +*CLIP の大成功 (Radford et al., 2021) により、視覚言語の事前訓練のための対照学習の研究と応用が促進されました。この研究では、ほとんどのデータが公開されているデータセットから取得された中国語の画像とテキストのペアの大規模なデータセットを構築し、新しいデータセットで中国語の CLIP モデルを事前トレーニングします。当社では、7,700 万から 9 億 5,800 万のパラメータにわたる、複数のサイズの 5 つの中国 CLIP モデルを開発しています。さらに、モデルのパフォーマンスを向上させるために、最初に画像エンコーダーをフリーズさせてモデルをトレーニングし、次にすべてのパラメーターを最適化してトレーニングする 2 段階の事前トレーニング方法を提案します。私たちの包括的な実験では、中国の CLIP がゼロショット学習と微調整のセットアップで MUGE、Flickr30K-CN、および COCO-CN 上で最先端のパフォーマンスを達成でき、ゼロで競争力のあるパフォーマンスを達成できることを実証しています。 - ELEVATER ベンチマークでの評価に基づくショット画像の分類 (Li et al., 2022)。コード、事前トレーニング済みモデル、デモがリリースされました。* + +Chinese-CLIP モデルは、[OFA-Sys](https://huggingface.co/OFA-Sys) によって提供されました。 + +## Usage example + +以下のコード スニペットは、画像とテキストの特徴と類似性を計算する方法を示しています。 + +```python +>>> from PIL import Image +>>> import requests +>>> from transformers import ChineseCLIPProcessor, ChineseCLIPModel + +>>> model = ChineseCLIPModel.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16") +>>> processor = ChineseCLIPProcessor.from_pretrained("OFA-Sys/chinese-clip-vit-base-patch16") + +>>> url = "https://clip-cn-beijing.oss-cn-beijing.aliyuncs.com/pokemon.jpeg" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> # Squirtle, Bulbasaur, Charmander, Pikachu in English +>>> texts = ["杰尼龟", "妙蛙种子", "小火龙", "皮卡丘"] + +>>> # compute image feature +>>> inputs = processor(images=image, return_tensors="pt") +>>> image_features = model.get_image_features(**inputs) +>>> image_features = image_features / image_features.norm(p=2, dim=-1, keepdim=True) # normalize + +>>> # compute text features +>>> inputs = processor(text=texts, padding=True, return_tensors="pt") +>>> text_features = model.get_text_features(**inputs) +>>> text_features = text_features / text_features.norm(p=2, dim=-1, keepdim=True) # normalize + +>>> # compute image-text similarity scores +>>> inputs = processor(text=texts, images=image, return_tensors="pt", padding=True) +>>> outputs = model(**inputs) +>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score +>>> probs = logits_per_image.softmax(dim=1) # probs: [[1.2686e-03, 5.4499e-02, 6.7968e-04, 9.4355e-01]] +``` + +現在、次のスケールの事前トレーニング済み Chinese-CLIP モデルが 🤗 Hub で利用可能です。 + +- [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) +- [OFA-Sys/chinese-clip-vit-large-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14) +- [OFA-Sys/chinese-clip-vit-large-patch14-336px](https://huggingface.co/OFA-Sys/chinese-clip-vit-large-patch14-336px) +- [OFA-Sys/chinese-clip-vit-huge-patch14](https://huggingface.co/OFA-Sys/chinese-clip-vit-huge-patch14) + +## ChineseCLIPConfig + +[[autodoc]] ChineseCLIPConfig + - from_text_vision_configs + +## ChineseCLIPTextConfig + +[[autodoc]] ChineseCLIPTextConfig + +## ChineseCLIPVisionConfig + +[[autodoc]] ChineseCLIPVisionConfig + +## ChineseCLIPImageProcessor + +[[autodoc]] ChineseCLIPImageProcessor + - preprocess + +## ChineseCLIPFeatureExtractor + +[[autodoc]] ChineseCLIPFeatureExtractor + +## ChineseCLIPProcessor + +[[autodoc]] ChineseCLIPProcessor + +## ChineseCLIPModel + +[[autodoc]] ChineseCLIPModel + - forward + - get_text_features + - get_image_features + +## ChineseCLIPTextModel + +[[autodoc]] ChineseCLIPTextModel + - forward + +## ChineseCLIPVisionModel + +[[autodoc]] ChineseCLIPVisionModel + - forward \ No newline at end of file diff --git a/docs/source/ja/model_doc/clap.md b/docs/source/ja/model_doc/clap.md new file mode 100644 index 00000000000000..f1e08d76018fac --- /dev/null +++ b/docs/source/ja/model_doc/clap.md @@ -0,0 +1,80 @@ + + +# CLAP + +## Overview + +CLAP モデルは、[Large Scale Contrastive Language-Audio pretraining with +feature fusion and keyword-to-caption augmentation](https://arxiv.org/pdf/2211.06687.pdf)、Yusong Wu、Ke Chen、Tianyu Zhang、Yuchen Hui、Taylor Berg-Kirkpatrick、Shlomo Dubnov 著。 + +CLAP (Contrastive Language-Audio Pretraining) は、さまざまな (音声、テキスト) ペアでトレーニングされたニューラル ネットワークです。タスクに合わせて直接最適化することなく、音声が与えられた場合に最も関連性の高いテキスト スニペットを予測するように指示できます。 CLAP モデルは、SWINTransformer を使用して log-Mel スペクトログラム入力からオーディオ特徴を取得し、RoBERTa モデルを使用してテキスト特徴を取得します。次に、テキストとオーディオの両方の特徴が、同じ次元の潜在空間に投影されます。投影されたオーディオとテキストの特徴の間のドット積が、同様のスコアとして使用されます。 + +論文の要約は次のとおりです。 + +*対照学習は、マルチモーダル表現学習の分野で目覚ましい成功を収めています。この論文では、音声データと自然言語記述を組み合わせて音声表現を開発する、対照的な言語音声事前トレーニングのパイプラインを提案します。この目標を達成するために、私たちはまず、さまざまなデータ ソースからの 633,526 個の音声とテキストのペアの大規模なコレクションである LAION-Audio-630K をリリースします。次に、さまざまなオーディオ エンコーダとテキスト エンコーダを考慮して、対照的な言語とオーディオの事前トレーニング モデルを構築します。機能融合メカニズムとキーワードからキャプションへの拡張をモデル設計に組み込んで、モデルが可変長の音声入力を処理できるようにし、パフォーマンスを向上させます。 3 番目に、包括的な実験を実行して、テキストから音声への取得、ゼロショット音声分類、教師付き音声分類の 3 つのタスクにわたってモデルを評価します。結果は、私たちのモデルがテキストから音声への検索タスクにおいて優れたパフォーマンスを達成していることを示しています。オーディオ分類タスクでは、モデルはゼロショット設定で最先端のパフォーマンスを達成し、非ゼロショット設定でもモデルの結果に匹敵するパフォーマンスを得ることができます。 LAION-オーディオ-6* + +このモデルは、[Younes Belkada](https://huggingface.co/ybelkada) および [Arthur Zucker](https://huggingface.co/ArthurZ) によって提供されました。 +元のコードは [こちら](https://github.com/LAION-AI/Clap) にあります。 + +## ClapConfig + +[[autodoc]] ClapConfig + - from_text_audio_configs + +## ClapTextConfig + +[[autodoc]] ClapTextConfig + +## ClapAudioConfig + +[[autodoc]] ClapAudioConfig + +## ClapFeatureExtractor + +[[autodoc]] ClapFeatureExtractor + +## ClapProcessor + +[[autodoc]] ClapProcessor + +## ClapModel + +[[autodoc]] ClapModel + - forward + - get_text_features + - get_audio_features + +## ClapTextModel + +[[autodoc]] ClapTextModel + - forward + +## ClapTextModelWithProjection + +[[autodoc]] ClapTextModelWithProjection + - forward + +## ClapAudioModel + +[[autodoc]] ClapAudioModel + - forward + +## ClapAudioModelWithProjection + +[[autodoc]] ClapAudioModelWithProjection + - forward + \ No newline at end of file diff --git a/docs/source/ja/model_doc/clip.md b/docs/source/ja/model_doc/clip.md new file mode 100644 index 00000000000000..697971e9224848 --- /dev/null +++ b/docs/source/ja/model_doc/clip.md @@ -0,0 +1,220 @@ + + +# CLIP + +## Overview + +CLIP モデルは、Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) で提案されました。 +サンディニ・アガルワル、ギリッシュ・サストリー、アマンダ・アスケル、パメラ・ミシュキン、ジャック・クラーク、グレッチェン・クルーガー、イリヤ・サツケヴァー。クリップ +(Contrastive Language-Image Pre-Training) は、さまざまな (画像、テキスト) ペアでトレーニングされたニューラル ネットワークです。かもね +直接最適化することなく、与えられた画像から最も関連性の高いテキスト スニペットを予測するように自然言語で指示されます。 +GPT-2 および 3 のゼロショット機能と同様に、タスクに対して。 + +論文の要約は次のとおりです。 + +*最先端のコンピューター ビジョン システムは、あらかじめ定められたオブジェクト カテゴリの固定セットを予測するようにトレーニングされています。これ +制限された形式の監視では、指定するために追加のラベル付きデータが必要となるため、一般性と使いやすさが制限されます。 +その他の視覚的なコンセプト。画像に関する生のテキストから直接学習することは、 +より広範な監督源。どのキャプションが表示されるかを予測するという単純な事前トレーニング タスクが有効であることを示します。 +400 のデータセットで SOTA 画像表現を最初から学習するための効率的かつスケーラブルな方法はどの画像ですか +インターネットから収集された数百万の(画像、テキスト)ペア。事前トレーニング後、自然言語を使用して参照します。 +視覚的な概念を学習し(または新しい概念を説明し)、下流のタスクへのモデルのゼロショット転送を可能にします。私たちは勉強します +30 を超えるさまざまな既存のコンピューター ビジョン データセットでタスクをまたがってベンチマークを行うことにより、このアプローチのパフォーマンスを評価します。 +OCR、ビデオ内のアクション認識、地理的位置特定、およびさまざまな種類のきめ細かいオブジェクト分類など。の +モデルはほとんどのタスクに簡単に移行でき、多くの場合、必要がなくても完全に監視されたベースラインと競合します。 +データセット固有のトレーニングに適しています。たとえば、ImageNet ゼロショットではオリジナルの ResNet-50 の精度と一致します。 +トレーニングに使用された 128 万のトレーニング サンプルを使用する必要はありません。コードをリリースし、事前トレーニング済み +モデルの重みはこの https URL で確認できます。* + +このモデルは [valhalla](https://huggingface.co/valhalla) によって提供されました。元のコードは [ここ](https://github.com/openai/CLIP) にあります。 + +## Usage tips and example + +CLIP は、マルチモーダルなビジョンおよび言語モデルです。画像とテキストの類似性やゼロショット画像に使用できます。 +分類。 CLIP は、ViT のようなトランスフォーマーを使用して視覚的特徴を取得し、因果言語モデルを使用してテキストを取得します +特徴。次に、テキストと視覚の両方の特徴が、同じ次元の潜在空間に投影されます。ドット +投影された画像とテキストの特徴間の積が同様のスコアとして使用されます。 + +画像を Transformer エンコーダに供給するために、各画像は固定サイズの重複しないパッチのシーケンスに分割されます。 +これらは線形に埋め込まれます。 [CLS] トークンは、イメージ全体の表現として機能するために追加されます。作家たち +また、絶対位置埋め込みを追加し、結果として得られるベクトルのシーケンスを標準の Transformer エンコーダに供給します。 +[`CLIPImageProcessor`] を使用して、モデルの画像のサイズ変更 (または再スケール) および正規化を行うことができます。 + +[`CLIPTokenizer`] はテキストのエンコードに使用されます。 [`CLIPProcessor`] はラップします +[`CLIPImageProcessor`] と [`CLIPTokenizer`] を両方の単一インスタンスに統合 +テキストをエンコードして画像を準備します。次の例は、次のメソッドを使用して画像とテキストの類似性スコアを取得する方法を示しています。 +[`CLIPProcessor`] と [`CLIPModel`]。 + +```python +>>> from PIL import Image +>>> import requests + +>>> from transformers import CLIPProcessor, CLIPModel + +>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") +>>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") + +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) + +>>> outputs = model(**inputs) +>>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score +>>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities +``` + +## Resources + +CLIP を使い始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + +- [リモート センシング (衛星) 画像とキャプションを使用した CLIP の微調整](https://huggingface.co/blog/fine-tune-clip-rsicd)、[RSICD データセット] を使用して CLIP を微調整する方法に関するブログ投稿(https://github.com/201528014227051/RSICD_optimal) と、データ拡張によるパフォーマンスの変化の比較。 +- この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/contrastive-image-text) は、プレ- [COCO データセット](https://cocodataset.org/#home) を使用してトレーニングされたビジョンおよびテキスト エンコーダー。 + + + +- 画像キャプションのビーム検索による推論に事前トレーニング済み CLIP を使用する方法に関する [ノートブック](https://colab.research.google.com/drive/1tuoAC5F4sC7qid56Z0ap-stR3rwdk0ZV?usp=sharing)。 🌎 + +**画像検索** + +- 事前トレーニングされた CLIP を使用した画像検索と MRR (平均相互ランク) スコアの計算に関する [ノートブック](https://colab.research.google.com/drive/1bLVwVKpAndpEDHqjzxVPr_9nGrSbuOQd?usp=sharing)。 🌎 +- 画像の取得と類似性スコアの表示に関する [ノートブック](https://colab.research.google.com/github/deep-diver/image_search_with_natural_language/blob/main/notebooks/Image_Search_CLIP.ipynb)。 🌎 +- 多言語 CLIP を使用して画像とテキストを同じベクトル空間にマッピングする方法に関する [ノートブック](https://colab.research.google.com/drive/1xO-wC_m_GNzgjIBQ4a4znvQkvDoZJvH4?usp=sharing)。 🌎 +- を使用してセマンティック イメージ検索で CLIP を実行する方法に関する [ノートブック](https://colab.research.google.com/github/vivien000/clip-demo/blob/master/clip.ipynb#scrollTo=uzdFhRGqiWkR) [Unsplash](https://unsplash.com) および [TMDB](https://www.themoviedb.org/) データセット。 🌎 + +**説明可能性** + +- 入力トークンと画像セグメントの類似性を視覚化する方法に関する [ノートブック](https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb)。 🌎 + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。 +リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## CLIPConfig + +[[autodoc]] CLIPConfig + - from_text_vision_configs + +## CLIPTextConfig + +[[autodoc]] CLIPTextConfig + +## CLIPVisionConfig + +[[autodoc]] CLIPVisionConfig + +## CLIPTokenizer + +[[autodoc]] CLIPTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## CLIPTokenizerFast + +[[autodoc]] CLIPTokenizerFast + +## CLIPImageProcessor + +[[autodoc]] CLIPImageProcessor + - preprocess + +## CLIPFeatureExtractor + +[[autodoc]] CLIPFeatureExtractor + +## CLIPProcessor + +[[autodoc]] CLIPProcessor + + + + +## CLIPModel + +[[autodoc]] CLIPModel + - forward + - get_text_features + - get_image_features + +## CLIPTextModel + +[[autodoc]] CLIPTextModel + - forward + +## CLIPTextModelWithProjection + +[[autodoc]] CLIPTextModelWithProjection + - forward + +## CLIPVisionModelWithProjection + +[[autodoc]] CLIPVisionModelWithProjection + - forward + +## CLIPVisionModel + +[[autodoc]] CLIPVisionModel + - forward + + + + +## TFCLIPModel + +[[autodoc]] TFCLIPModel + - call + - get_text_features + - get_image_features + +## TFCLIPTextModel + +[[autodoc]] TFCLIPTextModel + - call + +## TFCLIPVisionModel + +[[autodoc]] TFCLIPVisionModel + - call + + + + +## FlaxCLIPModel + +[[autodoc]] FlaxCLIPModel + - __call__ + - get_text_features + - get_image_features + +## FlaxCLIPTextModel + +[[autodoc]] FlaxCLIPTextModel + - __call__ + +## FlaxCLIPTextModelWithProjection + +[[autodoc]] FlaxCLIPTextModelWithProjection + - __call__ + +## FlaxCLIPVisionModel + +[[autodoc]] FlaxCLIPVisionModel + - __call__ + + + diff --git a/docs/source/ja/model_doc/clipseg.md b/docs/source/ja/model_doc/clipseg.md new file mode 100644 index 00000000000000..c8bdb0a0e4789b --- /dev/null +++ b/docs/source/ja/model_doc/clipseg.md @@ -0,0 +1,104 @@ + + +# CLIPSeg + +## Overview + +CLIPSeg モデルは、Timo Lüddecke, Alexander Ecker によって [Image Segmentation using Text and Image Prompts](https://arxiv.org/abs/2112.10003) で提案されました。 +そしてアレクサンダー・エッカー。 CLIPSeg は、ゼロショットおよびワンショット画像セグメンテーションのために、凍結された [CLIP](clip) モデルの上に最小限のデコーダを追加します。 + +論文の要約は次のとおりです。 + +*画像のセグメンテーションは通常、トレーニングによって解決されます。 +オブジェクト クラスの固定セットのモデル。後で追加のクラスやより複雑なクエリを組み込むとコストがかかります +これらの式を含むデータセットでモデルを再トレーニングする必要があるためです。ここでシステムを提案します +任意の情報に基づいて画像セグメンテーションを生成できます。 +テスト時にプロンプ​​トが表示されます。プロンプトはテキストまたは +画像。このアプローチにより、統一されたモデルを作成できます。 +3 つの一般的なセグメンテーション タスクについて (1 回トレーニング済み) +参照式のセグメンテーション、ゼロショット セグメンテーション、ワンショット セグメンテーションという明確な課題が伴います。 +CLIP モデルをバックボーンとして構築し、これをトランスベースのデコーダで拡張して、高密度なデータ通信を可能にします。 +予測。の拡張バージョンでトレーニングした後、 +PhraseCut データセット、私たちのシステムは、フリーテキスト プロンプトまたは +クエリを表す追加の画像。後者の画像ベースのプロンプトのさまざまなバリエーションを詳細に分析します。 +この新しいハイブリッド入力により、動的適応が可能になります。 +前述の 3 つのセグメンテーション タスクのみですが、 +テキストまたは画像をクエリするバイナリ セグメンテーション タスクに +定式化することができる。最後に、システムがうまく適応していることがわかりました +アフォーダンスまたはプロパティを含む一般化されたクエリ* + + + + CLIPSeg の概要。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。 +元のコードは [ここ](https://github.com/timojl/clipseg) にあります。 + +## Usage tips + +- [`CLIPSegForImageSegmentation`] は、[`CLIPSegModel`] の上にデコーダを追加します。後者は [`CLIPModel`] と同じです。 +- [`CLIPSegForImageSegmentation`] は、テスト時に任意のプロンプトに基づいて画像セグメンテーションを生成できます。プロンプトはテキストのいずれかです +(`input_ids` としてモデルに提供される) または画像 (`conditional_pixel_values` としてモデルに提供される)。カスタムを提供することもできます +条件付き埋め込み (`conditional_embeddings`としてモデルに提供されます)。 + +## Resources + +CLIPSeg の使用を開始するのに役立つ、公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + + + +- [CLIPSeg を使用したゼロショット画像セグメンテーション](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/CLIPSeg/Zero_shot_image_segmentation_with_CLIPSeg.ipynb) を説明するノートブック。 + +## CLIPSegConfig + +[[autodoc]] CLIPSegConfig + - from_text_vision_configs + +## CLIPSegTextConfig + +[[autodoc]] CLIPSegTextConfig + +## CLIPSegVisionConfig + +[[autodoc]] CLIPSegVisionConfig + +## CLIPSegProcessor + +[[autodoc]] CLIPSegProcessor + +## CLIPSegModel + +[[autodoc]] CLIPSegModel + - forward + - get_text_features + - get_image_features + +## CLIPSegTextModel + +[[autodoc]] CLIPSegTextModel + - forward + +## CLIPSegVisionModel + +[[autodoc]] CLIPSegVisionModel + - forward + +## CLIPSegForImageSegmentation + +[[autodoc]] CLIPSegForImageSegmentation + - forward \ No newline at end of file diff --git a/docs/source/ja/model_doc/clvp.md b/docs/source/ja/model_doc/clvp.md new file mode 100644 index 00000000000000..0803f5e027ce81 --- /dev/null +++ b/docs/source/ja/model_doc/clvp.md @@ -0,0 +1,123 @@ + + +# CLVP + +## Overview + +CLVP (Contrastive Language-Voice Pretrained Transformer) モデルは、James Betker によって [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) で提案されました。 + +論文の要約は次のとおりです。 + +*近年、画像生成の分野は自己回帰変換器と DDPM の応用によって革命を起こしています。これらのアプローチは、画像生成のプロセスを段階的な確率的プロセスとしてモデル化し、大量のコンピューティングとデータを活用して画像の分布を学習します。パフォーマンスを向上させるこの方法論は、画像に限定される必要はありません。この論文では、画像生成ドメインの進歩を音声合成に適用する方法について説明します。その結果、表現力豊かなマルチ音声テキスト読み上げシステムである TorToise が誕生しました。 + + +このモデルは [Susnato Dhar](https://huggingface.co/susnato) によって提供されました。 +元のコードは [ここ](https://github.com/neonbjb/tortoise-tts) にあります。 + +## Usage tips + +1. CLVP は Tortoise TTS モデルの不可欠な部分です。 +2. CLVP を使用して、生成されたさまざまな音声候補を提供されたテキストと比較することができ、最良の音声トークンが拡散モデルに転送されます。 +3. Tortoise の使用には、[`ClvpModelForConditionalGeneration.generate()`] メソッドの使用を強くお勧めします。 +4. 16 kHz を期待する他のオーディオ モデルとは対照的に、CLVP モデルはオーディオが 22.05 kHz でサンプリングされることを期待していることに注意してください。 + +## Brief Explanation: + +- [`ClvpTokenizer`] はテキスト入力をトークン化し、[`ClvpFeatureExtractor`] は目的のオーディオからログ メル スペクトログラムを抽出します。 +- [`ClvpConditioningEncoder`] は、これらのテキスト トークンとオーディオ表現を取得し、テキストとオーディオに基づいて条件付けされた埋め込みに変換します。 +- [`ClvpForCausalLM`] は、これらの埋め込みを使用して複数の音声候補を生成します。 +- 各音声候補は音声エンコーダ ([`ClvpEncoder`]) を通過してベクトル表現に変換され、テキスト エンコーダ ([`ClvpEncoder`]) はテキスト トークンを同じ潜在空間に変換します。 +- 最後に、各音声ベクトルをテキスト ベクトルと比較して、どの音声ベクトルがテキスト ベクトルに最も類似しているかを確認します。 +- [`ClvpModelForConditionalGeneration.generate()`] は、上記のすべてのロジックを 1 つのメソッドに圧縮します。 + +例 : + +```python +>>> import datasets +>>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration + +>>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library). +>>> text = "This is an example text." + +>>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +>>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050)) +>>> sample = ds[0]["audio"] + +>>> # Define processor and model. +>>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev") +>>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev") + +>>> # Generate processor output and model output. +>>> processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt") +>>> generated_output = model.generate(**processor_output) +``` + + +## ClvpConfig + +[[autodoc]] ClvpConfig + - from_sub_model_configs + +## ClvpEncoderConfig + +[[autodoc]] ClvpEncoderConfig + +## ClvpDecoderConfig + +[[autodoc]] ClvpDecoderConfig + +## ClvpTokenizer + +[[autodoc]] ClvpTokenizer + - save_vocabulary + +## ClvpFeatureExtractor + +[[autodoc]] ClvpFeatureExtractor + - __call__ + +## ClvpProcessor + +[[autodoc]] ClvpProcessor + - __call__ + - decode + - batch_decode + +## ClvpModelForConditionalGeneration + +[[autodoc]] ClvpModelForConditionalGeneration + - forward + - generate + - get_text_features + - get_speech_features + +## ClvpForCausalLM + +[[autodoc]] ClvpForCausalLM + +## ClvpModel + +[[autodoc]] ClvpModel + +## ClvpEncoder + +[[autodoc]] ClvpEncoder + +## ClvpDecoder + +[[autodoc]] ClvpDecoder + diff --git a/docs/source/ja/model_doc/code_llama.md b/docs/source/ja/model_doc/code_llama.md new file mode 100644 index 00000000000000..4ba345b8d7b9c5 --- /dev/null +++ b/docs/source/ja/model_doc/code_llama.md @@ -0,0 +1,125 @@ + + +# CodeLlama + +## Overview + +Code Llama モデルはによって [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) で提案されました。 Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve. +論文の要約は次のとおりです。 + +*私たちは Code Llama をリリースします。これは Llama 2 に基づくコードの大規模言語モデル ファミリであり、オープン モデルの中で最先端のパフォーマンス、埋め込み機能、大規模な入力コンテキストのサポート、プログラミング タスクのゼロショット命令追従機能を提供します。 。幅広いアプリケーションをカバーするための複数のフレーバーを提供しています。基盤モデル (Code Llama)、Python 特化 (Code Llama - Python)、およびそれぞれ 7B、13B、および 34B パラメーターを備えた命令追従モデル (Code Llama - Instruct) です。すべてのモデルは 16,000 トークンのシーケンスでトレーニングされ、最大 100,000 トークンの入力で改善が見られます。 7B および 13B コード ラマとコード ラマ - 命令バリアントは、周囲のコンテンツに基づいた埋め込みをサポートします。 Code Llama は、いくつかのコード ベンチマークでオープン モデルの中で最先端のパフォーマンスに達し、HumanEval と MBPP でそれぞれ最大 53% と 55% のスコアを獲得しました。特に、Code Llama - Python 7B は HumanEval および MBPP 上で Llama 2 70B よりも優れたパフォーマンスを示し、すべてのモデルは MultiPL-E 上で公開されている他のすべてのモデルよりも優れています。私たちは、研究と商業利用の両方を許可する寛容なライセンスに基づいて Code Llama をリリースしています。* + +すべての Code Llama モデル チェックポイントを [こちら](https://huggingface.co/models?search=code_llama) で確認し、[codellama org](https://huggingface.co/codellama) で正式にリリースされたチェックポイントを確認してください。 + +このモデルは [ArthurZucker](https://huggingface.co/ArthurZ) によって提供されました。著者のオリジナルのコードは [こちら](https://github.com/facebookresearch/llama) にあります。 + +## Usage tips and examples + + + +Code Llama のベースとなる`Llama2`ファミリー モデルは、`bfloat16`を使用してトレーニングされましたが、元の推論では`float16`を使用します。さまざまな精度を見てみましょう。 + +* `float32`: モデルの初期化に関する PyTorch の規約では、モデルの重みがどの `dtype` で格納されたかに関係なく、モデルを `float32` にロードします。 「transformers」も、PyTorch との一貫性を保つためにこの規則に従っています。これはデフォルトで選択されます。 `AutoModel` API でストレージの重み付けタイプを使用してチェックポイントのロードをキャストする場合は、`torch_dtype="auto"` を指定する必要があります。 `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`。 +* `bfloat16`: コード Llama はこの精度でトレーニングされているため、さらなるトレーニングや微調整に使用することをお勧めします。 +* `float16`: この精度を使用して推論を実行することをお勧めします。通常は `bfloat16` より高速であり、評価メトリクスには `bfloat16` と比べて明らかな低下が見られないためです。 bfloat16 を使用して推論を実行することもできます。微調整後、float16 と bfloat16 の両方で推論結果を確認することをお勧めします。 + +上で述べたように、モデルを初期化するときに `torch_dtype="auto"` を使用しない限り、ストレージの重みの `dtype` はほとんど無関係です。その理由は、モデルが最初にダウンロードされ (オンラインのチェックポイントの `dtype` を使用)、次に `torch` のデフォルトの `dtype` にキャストされるためです (`torch.float32` になります)。指定された `torch_dtype` がある場合は、代わりにそれが使用されます。 + + + +チップ: +- 充填タスクはすぐにサポートされます。入力を埋めたい場所には `tokenizer.fill_token` を使用する必要があります。 +- モデル変換スクリプトは、`Llama2` ファミリの場合と同じです。 + +使用例は次のとおりです。 + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +スクリプトを実行するには、(最大のバージョンであっても) float16 精度でモデル全体をホストするのに十分な CPU RAM が必要であることに注意してください。 +いくつかのチェックポイントがあり、それぞれにモデルの各重みの一部が含まれているため、すべてを RAM にロードする必要があります)。 + +変換後、モデルとトークナイザーは次の方法でロードできます。 + +```python +>>> from transformers import LlamaForCausalLM, CodeLlamaTokenizer + +>>> tokenizer = CodeLlamaTokenizer.from_pretrained("codellama/CodeLlama-7b-hf") +>>> model = LlamaForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf") +>>> PROMPT = '''def remove_non_ascii(s: str) -> str: + """ + return result +''' +>>> input_ids = tokenizer(PROMPT, return_tensors="pt")["input_ids"] +>>> generated_ids = model.generate(input_ids, max_new_tokens=128) + +>>> filling = tokenizer.batch_decode(generated_ids[:, input_ids.shape[1]:], skip_special_tokens = True)[0] +>>> print(PROMPT.replace("", filling)) +def remove_non_ascii(s: str) -> str: + """ Remove non-ASCII characters from a string. + + Args: + s: The string to remove non-ASCII characters from. + + Returns: + The string with non-ASCII characters removed. + """ + result = "" + for c in s: + if ord(c) < 128: + result += c + return result +``` + +塗りつぶされた部分だけが必要な場合: + +```python +>>> from transformers import pipeline +>>> import torch + +>>> generator = pipeline("text-generation",model="codellama/CodeLlama-7b-hf",torch_dtype=torch.float16, device_map="auto") +>>> generator('def remove_non_ascii(s: str) -> str:\n """ \n return result', max_new_tokens = 128, return_type = 1) +``` + +内部では、トークナイザーが [`` によって自動的に分割](https://huggingface.co/docs/transformers/main/model_doc/code_llama#transformers.CodeLlamaTokenizer.fill_token) して、[ に続く書式設定された入力文字列を作成します。オリジナルのトレーニング パターン](https://github.com/facebookresearch/codellama/blob/cb51c14ec761370ba2e2bc351374a79265d0465e/llama/generation.py#L402)。これは、パターンを自分で準備するよりも堅牢です。トークンの接着など、デバッグが非常に難しい落とし穴を回避できます。このモデルまたは他のモデルに必要な CPU および GPU メモリの量を確認するには、その値を決定するのに役立つ [この計算ツール](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) を試してください。 + +LLaMA トークナイザーは、[sentencepiece](https://github.com/google/sentencepiece) に基づく BPE モデルです。センテンスピースの癖の 1 つは、シーケンスをデコードするときに、最初のトークンが単語の先頭 (例: 「Banana」) である場合、トークナイザーは文字列の先頭にプレフィックス スペースを追加しないことです。 + + + +コード Llama は、`Llama2` モデルと同じアーキテクチャを持っています。API リファレンスについては、[Llama2 のドキュメント ページ](llama2) を参照してください。 +以下の Code Llama トークナイザーのリファレンスを見つけてください。 + + +## CodeLlamaTokenizer + +[[autodoc]] CodeLlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## CodeLlamaTokenizerFast + +[[autodoc]] CodeLlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary diff --git a/docs/source/ja/model_doc/codegen.md b/docs/source/ja/model_doc/codegen.md new file mode 100644 index 00000000000000..78caefe043319b --- /dev/null +++ b/docs/source/ja/model_doc/codegen.md @@ -0,0 +1,90 @@ + + +# CodeGen + +## Overview + + +CodeGen モデルは、[A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) で Erik Nijkamp、Bo Pang、林宏明、Lifu Tu、Huan Wang、Yingbo Zhou、Silvio Savarese、Caiming Xiong およびカイミン・ションさん。 + +CodeGen は、[The Pile](https://pile.eleuther.ai/)、BigQuery、BigPython で順次トレーニングされたプログラム合成用の自己回帰言語モデルです。 + +論文の要約は次のとおりです。 + +*プログラム合成は、与えられた問題仕様の解決策としてコンピューター プログラムを生成することを目的としています。我々は、大規模な言語モデルを介した会話型プログラム合成アプローチを提案します。これは、従来のアプローチで直面した広大なプログラム空間とユーザーの意図の仕様を検索するという課題に対処します。私たちの新しいアプローチでは、仕様とプログラムを作成するプロセスを、ユーザーとシステムの間の複数回の対話として捉えます。これはプログラム合成をシーケンス予測問題として扱い、仕様が自然言語で表現され、目的のプログラムが条件付きでサンプリングされます。私たちは、自然言語とプログラミング言語のデータに基づいて、CodeGen と呼ばれる大規模な言語モデルのファミリーをトレーニングします。データの監視が弱く、データ サイズとモデル サイズが拡大すると、単純な自己回帰言語モデリングから会話能力が生まれます。会話型プログラム合成におけるモデルの動作を研究するために、マルチターン プログラミング ベンチマーク (MTPB) を開発します。このベンチマークでは、各問題を解決するには、ユーザーとモデル間のマルチターン会話を介したマルチステップ合成が必要です。私たちの調査結果は、会話機能の出現と、提案されている会話プログラム合成パラダイムの有効性を示しています。さらに、私たちのモデル CodeGen (TPU-v4 でトレーニングされた最大 16B パラメーターを含む) は、HumanEval ベンチマークで OpenAI の Codex を上回ります。私たちはチェックポイントを含むトレーニング ライブラリ JaxFormer をオープン ソースのコントリビューションとして利用できるようにしています: [この https URL](https://github.com/salesforce/codegen)*。 + +このモデルは [林 宏明](https://huggingface.co/rooa) によって寄稿されました。 +元のコードは [ここ](https://github.com/salesforce/codegen) にあります。 + +## Checkpoint Naming + +* CodeGen モデル [チェックポイント](https://huggingface.co/models?other=codegen) は、可変サイズのさまざまな事前トレーニング データで利用できます。 +* 形式は「Salesforce/codegen-{size}-{data}」です。ここで、 + * `size`: `350M`、`2B`、`6B`、`16B` + * `data`: + * `nl`: パイルで事前トレーニング済み + * `multi`: `nl` で初期化され、複数のプログラミング言語データでさらに事前トレーニングされます。 + * `mono`: `multi` で初期化され、Python データでさらに事前トレーニングされます。 +* たとえば、`Salesforce/codegen-350M-mono` は、Pile、複数のプログラミング言語、および Python で順次事前トレーニングされた 3 億 5,000 万のパラメーターのチェックポイントを提供します。 + +## Usage example + +```python +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> checkpoint = "Salesforce/codegen-350M-mono" +>>> model = AutoModelForCausalLM.from_pretrained(checkpoint) +>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) + +>>> text = "def hello_world():" + +>>> completion = model.generate(**tokenizer(text, return_tensors="pt")) + +>>> print(tokenizer.decode(completion[0])) +def hello_world(): + print("Hello World") + +hello_world() +``` + +## Resources + +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) + +## CodeGenConfig + +[[autodoc]] CodeGenConfig + - all + +## CodeGenTokenizer + +[[autodoc]] CodeGenTokenizer + - save_vocabulary + +## CodeGenTokenizerFast + +[[autodoc]] CodeGenTokenizerFast + +## CodeGenModel + +[[autodoc]] CodeGenModel + - forward + +## CodeGenForCausalLM + +[[autodoc]] CodeGenForCausalLM + - forward diff --git a/docs/source/ja/model_doc/conditional_detr.md b/docs/source/ja/model_doc/conditional_detr.md new file mode 100644 index 00000000000000..4ef09f0b6d898e --- /dev/null +++ b/docs/source/ja/model_doc/conditional_detr.md @@ -0,0 +1,75 @@ + + +# Conditional DETR + +## Overview + +条件付き DETR モデルは、[Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) で Depu Meng、Xiaokang Chen、Zejia Fan、Gang Zeng、Houqiang Li、Yuhui Yuan、Lei Sun, Jingdong Wang によって提案されました。王京東。条件付き DETR は、高速 DETR トレーニングのための条件付きクロスアテンション メカニズムを提供します。条件付き DETR は DETR よりも 6.7 倍から 10 倍速く収束します。 + +論文の要約は次のとおりです。 + +*最近開発された DETR アプローチは、トランスフォーマー エンコーダーおよびデコーダー アーキテクチャを物体検出に適用し、有望なパフォーマンスを実現します。この論文では、トレーニングの収束が遅いという重要な問題を扱い、高速 DETR トレーニングのための条件付きクロスアテンション メカニズムを紹介します。私たちのアプローチは、DETR におけるクロスアテンションが 4 つの四肢の位置特定とボックスの予測にコンテンツの埋め込みに大きく依存しているため、高品質のコンテンツの埋め込みの必要性が高まり、トレーニングの難易度が高くなるという点に動機づけられています。条件付き DETR と呼ばれる私たちのアプローチは、デコーダーのマルチヘッド クロスアテンションのためにデコーダーの埋め込みから条件付きの空間クエリを学習します。利点は、条件付き空間クエリを通じて、各クロスアテンション ヘッドが、個別の領域 (たとえば、1 つのオブジェクトの端またはオブジェクト ボックス内の領域) を含むバンドに注目できることです。これにより、オブジェクト分類とボックス回帰のための個別の領域をローカライズするための空間範囲が狭まり、コンテンツの埋め込みへの依存が緩和され、トレーニングが容易になります。実験結果は、条件付き DETR がバックボーン R50 および R101 で 6.7 倍速く収束し、より強力なバックボーン DC5-R50 および DC5-R101 で 10 倍速く収束することを示しています。コードは https://github.com/Atten4Vis/ConditionalDETR で入手できます。* + + + + 条件付き DETR は、元の DETR に比べてはるかに速い収束を示します。 元の論文から引用。 + +このモデルは [DepuMeng](https://huggingface.co/DepuMeng) によって寄稿されました。元のコードは [ここ](https://github.com/Atten4Vis/ConditionalDETR) にあります。 + +## Resources + +- [オブジェクト検出タスクガイド](../tasks/object_detection) + +## ConditionalDetrConfig + +[[autodoc]] ConditionalDetrConfig + +## ConditionalDetrImageProcessor + +[[autodoc]] ConditionalDetrImageProcessor + - preprocess + - post_process_object_detection + - post_process_instance_segmentation + - post_process_semantic_segmentation + - post_process_panoptic_segmentation + +## ConditionalDetrFeatureExtractor + +[[autodoc]] ConditionalDetrFeatureExtractor + - __call__ + - post_process_object_detection + - post_process_instance_segmentation + - post_process_semantic_segmentation + - post_process_panoptic_segmentation + +## ConditionalDetrModel + +[[autodoc]] ConditionalDetrModel + - forward + +## ConditionalDetrForObjectDetection + +[[autodoc]] ConditionalDetrForObjectDetection + - forward + +## ConditionalDetrForSegmentation + +[[autodoc]] ConditionalDetrForSegmentation + - forward + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/convbert.md b/docs/source/ja/model_doc/convbert.md new file mode 100644 index 00000000000000..5d15f86c5136b7 --- /dev/null +++ b/docs/source/ja/model_doc/convbert.md @@ -0,0 +1,145 @@ + + +# ConvBERT + +
+ +Models + + +Spaces + +
+ +## Overview + +ConvBERT モデルは、[ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) で Zihang Jiang、Weihao Yu、Daquan Zhou、Yunpeng Chen、Jiashi Feng、Shuicheng Yan によって提案されました。 +やん。 + +論文の要約は次のとおりです。 + +*BERT やそのバリアントなどの事前トレーニング済み言語モデルは、最近、さまざまな環境で目覚ましいパフォーマンスを達成しています。 +自然言語理解タスク。ただし、BERT はグローバルな自己注意ブロックに大きく依存しているため、問題が発生します。 +メモリ使用量と計算コストが大きくなります。すべての注意が入力シーケンス全体に対してクエリを実行しますが、 +グローバルな観点からアテンション マップを生成すると、一部のヘッドはローカルな依存関係のみを学習する必要があることがわかります。 +これは、計算の冗長性が存在することを意味します。したがって、我々は、新しいスパンベースの動的畳み込みを提案します。 +これらのセルフアテンション ヘッドを置き換えて、ローカルの依存関係を直接モデル化します。新しいコンボリューションヘッドと、 +自己注意の頭を休め、グローバルとローカルの両方の状況でより効率的な新しい混合注意ブロックを形成します +学ぶ。この混合注意設計を BERT に装備し、ConvBERT モデルを構築します。実験でわかったことは、 +ConvBERT は、トレーニング コストが低く、さまざまな下流タスクにおいて BERT およびその亜種よりも大幅に優れたパフォーマンスを発揮します。 +モデルパラメータが少なくなります。注目すべきことに、ConvBERTbase モデルは 86.4 GLUE スコアを達成し、ELECTRAbase よりも 0.7 高いのに対し、 +トレーニングコストは 1/4 未満です。コードと事前トレーニングされたモデルがリリースされます。* + +このモデルは、[abhishek](https://huggingface.co/abhishek) によって提供されました。オリジナルの実装が見つかります +ここ: https://github.com/yitu-opensource/ConvBert + +## Usage tips + +ConvBERT トレーニングのヒントは BERT のヒントと似ています。使用上のヒントについては、[BERT ドキュメント](bert) を参照してください。 + +## Resources + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [マスクされた言語モデリング タスク ガイド](../tasks/masked_lang_modeling) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +## ConvBertConfig + +[[autodoc]] ConvBertConfig + +## ConvBertTokenizer + +[[autodoc]] ConvBertTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## ConvBertTokenizerFast + +[[autodoc]] ConvBertTokenizerFast + + + + +## ConvBertModel + +[[autodoc]] ConvBertModel + - forward + +## ConvBertForMaskedLM + +[[autodoc]] ConvBertForMaskedLM + - forward + +## ConvBertForSequenceClassification + +[[autodoc]] ConvBertForSequenceClassification + - forward + +## ConvBertForMultipleChoice + +[[autodoc]] ConvBertForMultipleChoice + - forward + +## ConvBertForTokenClassification + +[[autodoc]] ConvBertForTokenClassification + - forward + +## ConvBertForQuestionAnswering + +[[autodoc]] ConvBertForQuestionAnswering + - forward + + + + +## TFConvBertModel + +[[autodoc]] TFConvBertModel + - call + +## TFConvBertForMaskedLM + +[[autodoc]] TFConvBertForMaskedLM + - call + +## TFConvBertForSequenceClassification + +[[autodoc]] TFConvBertForSequenceClassification + - call + +## TFConvBertForMultipleChoice + +[[autodoc]] TFConvBertForMultipleChoice + - call + +## TFConvBertForTokenClassification + +[[autodoc]] TFConvBertForTokenClassification + - call + +## TFConvBertForQuestionAnswering + +[[autodoc]] TFConvBertForQuestionAnswering + - call + + + diff --git a/docs/source/ja/model_doc/convnext.md b/docs/source/ja/model_doc/convnext.md new file mode 100644 index 00000000000000..4386a7df8ceadb --- /dev/null +++ b/docs/source/ja/model_doc/convnext.md @@ -0,0 +1,94 @@ + + +# ConvNeXT + +## Overview + +ConvNeXT モデルは、[A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) で Zhuang Liu、Hanzi Mao、Chao-Yuan Wu、Christoph Feichtenhofer、Trevor Darrell、Saining Xie によって提案されました。 +ConvNeXT は、ビジョン トランスフォーマーの設計からインスピレーションを得た純粋な畳み込みモデル (ConvNet) であり、ビジョン トランスフォーマーよりも優れたパフォーマンスを発揮すると主張しています。 + +論文の要約は次のとおりです。 + +*視覚認識の「狂騒の 20 年代」は、最先端の画像分類モデルとして ConvNet にすぐに取って代わられた Vision Transformers (ViT) の導入から始まりました。 +一方、バニラ ViT は、オブジェクト検出やセマンティック セグメンテーションなどの一般的なコンピューター ビジョン タスクに適用すると困難に直面します。階層型トランスフォーマーです +(Swin Transformers など) は、いくつかの ConvNet の以前の機能を再導入し、Transformers を汎用ビジョン バックボーンとして実用的に可能にし、幅広い環境で顕著なパフォーマンスを実証しました。 +さまざまな視覚タスク。ただし、このようなハイブリッド アプローチの有効性は、依然として、固有の誘導性ではなく、トランスフォーマーの本質的な優位性によるところが大きいと考えられています。 +畳み込みのバイアス。この作業では、設計空間を再検討し、純粋な ConvNet が達成できる限界をテストします。標準 ResNet を設計に向けて徐々に「最新化」します。 +ビジョン Transformer の概要を確認し、途中でパフォーマンスの違いに寄与するいくつかの重要なコンポーネントを発見します。この調査の結果は、純粋な ConvNet モデルのファミリーです。 +ConvNextと呼ばれます。 ConvNeXts は完全に標準の ConvNet モジュールから構築されており、精度と拡張性の点で Transformers と有利に競合し、87.8% の ImageNet トップ 1 精度を達成しています。 +標準 ConvNet のシンプルさと効率を維持しながら、COCO 検出と ADE20K セグメンテーションでは Swin Transformers よりも優れたパフォーマンスを発揮します。* + + + + ConvNeXT アーキテクチャ。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。 TensorFlow バージョンのモデルは [ariG23498](https://github.com/ariG23498) によって提供されました。 +[gante](https://github.com/gante)、および [sayakpaul](https://github.com/sayakpaul) (同等の貢献)。元のコードは [こちら](https://github.com/facebookresearch/ConvNeXt) にあります。 + +## Resources + +ConvNeXT の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。 + + + +- [`ConvNextForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## ConvNextConfig + +[[autodoc]] ConvNextConfig + +## ConvNextFeatureExtractor + +[[autodoc]] ConvNextFeatureExtractor + +## ConvNextImageProcessor + +[[autodoc]] ConvNextImageProcessor + - preprocess + + + + +## ConvNextModel + +[[autodoc]] ConvNextModel + - forward + +## ConvNextForImageClassification + +[[autodoc]] ConvNextForImageClassification + - forward + + + + +## TFConvNextModel + +[[autodoc]] TFConvNextModel + - call + +## TFConvNextForImageClassification + +[[autodoc]] TFConvNextForImageClassification + - call + + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/convnextv2.md b/docs/source/ja/model_doc/convnextv2.md new file mode 100644 index 00000000000000..9e4d54df24b1c1 --- /dev/null +++ b/docs/source/ja/model_doc/convnextv2.md @@ -0,0 +1,68 @@ + + +# ConvNeXt V2 + +## Overview + +ConvNeXt V2 モデルは、Sanghyun Woo、Shobhik Debnath、Ronghang Hu、Xinlei Chen、Zhuang Liu, In So Kweon, Saining Xie. によって [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) で提案されました。 +ConvNeXt V2 は、Vision Transformers の設計からインスピレーションを得た純粋な畳み込みモデル (ConvNet) であり、[ConvNeXT](convnext) の後継です。 + +論文の要約は次のとおりです。 + +*アーキテクチャの改善と表現学習フレームワークの改善により、視覚認識の分野は 2020 年代初頭に急速な近代化とパフォーマンスの向上を実現しました。たとえば、ConvNeXt に代表される最新の ConvNet は、さまざまなシナリオで強力なパフォーマンスを実証しています。これらのモデルはもともと ImageNet ラベルを使用した教師あり学習用に設計されましたが、マスク オートエンコーダー (MAE) などの自己教師あり学習手法からも潜在的に恩恵を受けることができます。ただし、これら 2 つのアプローチを単純に組み合わせると、パフォーマンスが標準以下になることがわかりました。この論文では、完全畳み込みマスク オートエンコーダ フレームワークと、チャネル間の機能競合を強化するために ConvNeXt アーキテクチャに追加できる新しい Global Response Normalization (GRN) 層を提案します。この自己教師あり学習手法とアーキテクチャの改善の共同設計により、ConvNeXt V2 と呼ばれる新しいモデル ファミリが誕生しました。これにより、ImageNet 分類、COCO 検出、ADE20K セグメンテーションなどのさまざまな認識ベンチマークにおける純粋な ConvNet のパフォーマンスが大幅に向上します。また、ImageNet でトップ 1 の精度 76.7% を誇る効率的な 370 万パラメータの Atto モデルから、最先端の 88.9% を達成する 650M Huge モデルまで、さまざまなサイズの事前トレーニング済み ConvNeXt V2 モデルも提供しています。公開トレーニング データのみを使用した精度*。 + + + + ConvNeXt V2 アーキテクチャ。 元の論文から抜粋。 + +このモデルは [adirik](https://huggingface.co/adirik) によって提供されました。元のコードは [こちら](https://github.com/facebookresearch/ConvNeXt-V2) にあります。 + +## Resources + +ConvNeXt V2 の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。 + + + +- [`ConvNextV2ForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## ConvNextV2Config + +[[autodoc]] ConvNextV2Config + +## ConvNextV2Model + +[[autodoc]] ConvNextV2Model + - forward + +## ConvNextV2ForImageClassification + +[[autodoc]] ConvNextV2ForImageClassification + - forward + +## TFConvNextV2Model + +[[autodoc]] TFConvNextV2Model + - call + + +## TFConvNextV2ForImageClassification + +[[autodoc]] TFConvNextV2ForImageClassification + - call diff --git a/docs/source/ja/model_doc/cpm.md b/docs/source/ja/model_doc/cpm.md new file mode 100644 index 00000000000000..afac35823e641a --- /dev/null +++ b/docs/source/ja/model_doc/cpm.md @@ -0,0 +1,54 @@ + + +# CPM + +## Overview + +CPM モデルは、Zhengyan Zhang、Xu Han、Hao Zhou、Pei Ke、Yuxian Gu によって [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) で提案されました。葉徳明、秦裕佳、 +Yusheng Su、Haozhe Ji、Jian Guan、Fanchao Qi、Xiaozi Wang、Yanan Zheng、Guoyang Zeng、Huanqi Cao、Shengqi Chen、 +Daixuan Li、Zhenbo Sun、Zhiyuan Liu、Minlie Huang、Wentao Han、Jie Tang、Juanzi Li、Xiaoyan Zhu、Maosong Sun。 + +論文の要約は次のとおりです。 + +*事前トレーニングされた言語モデル (PLM) は、さまざまな下流の NLP タスクに有益であることが証明されています。最近ではGPT-3、 +1,750億個のパラメータと570GBの学習データを備え、数回の撮影(1枚でも)の容量で大きな注目を集めました +ゼロショット)学習。ただし、GPT-3 を適用して中国語の NLP タスクに対処することは依然として困難です。 +GPT-3 の言語は主に英語であり、パラメーターは公開されていません。この技術レポートでは、 +大規模な中国語トレーニング データに対する生成的事前トレーニングを備えた中国語事前トレーニング済み言語モデル (CPM)。最高に +私たちの知識の限りでは、26 億のパラメータと 100GB の中国語トレーニング データを備えた CPM は、事前トレーニングされた中国語としては最大のものです。 +言語モデルは、会話、エッセイの作成、 +クローゼテストと言語理解。広範な実験により、CPM が多くの環境で優れたパフォーマンスを達成できることが実証されています。 +少数ショット (ゼロショットでも) 学習の設定での NLP タスク。* + +このモデルは [canwenxu](https://huggingface.co/canwenxu) によって提供されました。オリジナルの実装が見つかります +ここ: https://github.com/TsinghuaAI/CPM-Generate + + + + +CPM のアーキテクチャは、トークン化方法を除いて GPT-2 と同じです。詳細については、[GPT-2 ドキュメント](openai-community/gpt2) を参照してください。 +API リファレンス情報。 + + + +## CpmTokenizer + +[[autodoc]] CpmTokenizer + +## CpmTokenizerFast + +[[autodoc]] CpmTokenizerFast diff --git a/docs/source/ja/model_doc/cpmant.md b/docs/source/ja/model_doc/cpmant.md new file mode 100644 index 00000000000000..ca1f65caa16c04 --- /dev/null +++ b/docs/source/ja/model_doc/cpmant.md @@ -0,0 +1,47 @@ + + +# CPMAnt + +## Overview + +CPM-Ant は、10B パラメータを備えたオープンソースの中国語の事前トレーニング済み言語モデル (PLM) です。これは、CPM-Live のライブ トレーニング プロセスの最初のマイルストーンでもあります。トレーニングプロセスは費用対効果が高く、環境に優しいものです。 CPM-Ant は、CUGE ベンチマークでのデルタ チューニングでも有望な結果を達成しています。フル モデルに加えて、さまざまなハードウェア構成の要件を満たすさまざまな圧縮バージョンも提供しています。 [詳細を見る](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) + +このモデルは [OpenBMB](https://huggingface.co/openbmb) によって提供されました。元のコードは [ここ](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) にあります。 + +## Resources + +- [CPM-Live](https://github.com/OpenBMB/CPM-Live/tree/cpm-ant/cpm-live) に関するチュートリアル。 + +## CpmAntConfig + +[[autodoc]] CpmAntConfig + - all + +## CpmAntTokenizer + +[[autodoc]] CpmAntTokenizer + - all + +## CpmAntModel + +[[autodoc]] CpmAntModel + - all + +## CpmAntForCausalLM + +[[autodoc]] CpmAntForCausalLM + - all \ No newline at end of file diff --git a/docs/source/ja/model_doc/ctrl.md b/docs/source/ja/model_doc/ctrl.md new file mode 100644 index 00000000000000..f93345d30e79bc --- /dev/null +++ b/docs/source/ja/model_doc/ctrl.md @@ -0,0 +1,113 @@ + + +# CTRL + +
+ +Models + + +Spaces + +
+ +## Overview + +CTRL モデルは、Nitish Shirish Keskar*、Bryan McCann*、Lav R. Varshney、Caiming Xiong, Richard Socher によって [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) で提案されました。 +リチャード・ソーチャー。これは、非常に大規模なコーパスの言語モデリングを使用して事前トレーニングされた因果的 (一方向) トランスフォーマーです +最初のトークンが制御コード (リンク、書籍、Wikipedia など) として予約されている、約 140 GB のテキスト データ。 + +論文の要約は次のとおりです。 + +*大規模な言語モデルは有望なテキスト生成機能を示していますが、ユーザーは特定の言語モデルを簡単に制御できません +生成されたテキストの側面。 16 億 3,000 万パラメータの条件付きトランスフォーマー言語モデルである CTRL をリリースします。 +スタイル、コンテンツ、タスク固有の動作を制御する制御コードを条件付けるように訓練されています。制御コードは +生のテキストと自然に共生する構造から派生し、教師なし学習の利点を維持しながら、 +テキスト生成をより明示的に制御できるようになります。これらのコードを使用すると、CTRL でどの部分が予測されるのかを予測することもできます。 +トレーニング データにはシーケンスが与えられる可能性が最も高くなります。これにより、大量のデータを分析するための潜在的な方法が提供されます。 +モデルベースのソース帰属を介して。* + +このモデルは、[keskarnitishr](https://huggingface.co/keskarnitishr) によって提供されました。元のコードが見つかる +[こちら](https://github.com/salesforce/Salesforce/ctrl)。 + +## Usage tips + +- CTRL は制御コードを利用してテキストを生成します。生成を特定の単語や文で開始する必要があります。 + またはリンクして一貫したテキストを生成します。 [元の実装](https://github.com/salesforce/Salesforce/ctrl) を参照してください。 + 詳しくは。 +- CTRL は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左。 +- CTRL は因果言語モデリング (CLM) の目的でトレーニングされているため、次の予測に強力です。 + シーケンス内のトークン。この機能を利用すると、CTRL は構文的に一貫したテキストを生成できるようになります。 + *run_generation.py* サンプル スクリプトで確認できます。 +- PyTorch モデルは、以前に計算されたキーと値のアテンション ペアである`past_key_values`を入力として受け取ることができます。 + TensorFlow モデルは`past`を入力として受け入れます。 `past_key_values`値を使用すると、モデルが再計算されなくなります。 + テキスト生成のコンテキストで事前に計算された値。 [`forward`](model_doc/ctrl#transformers.CTRLModel.forward) を参照してください。 + この引数の使用法の詳細については、メソッドを参照してください。 + +## Resources + +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) + +## CTRLConfig + +[[autodoc]] CTRLConfig + +## CTRLTokenizer + +[[autodoc]] CTRLTokenizer + - save_vocabulary + + + + +## CTRLModel + +[[autodoc]] CTRLModel + - forward + +## CTRLLMHeadModel + +[[autodoc]] CTRLLMHeadModel + - forward + +## CTRLForSequenceClassification + +[[autodoc]] CTRLForSequenceClassification + - forward + + + + +## TFCTRLModel + +[[autodoc]] TFCTRLModel + - call + +## TFCTRLLMHeadModel + +[[autodoc]] TFCTRLLMHeadModel + - call + +## TFCTRLForSequenceClassification + +[[autodoc]] TFCTRLForSequenceClassification + - call + + + diff --git a/docs/source/ja/model_doc/cvt.md b/docs/source/ja/model_doc/cvt.md new file mode 100644 index 00000000000000..16d39d1b55d35c --- /dev/null +++ b/docs/source/ja/model_doc/cvt.md @@ -0,0 +1,88 @@ + + +# Convolutional Vision Transformer (CvT) + +## Overview + +CvT モデルは、Haping Wu、Bin Xiao、Noel Codella、Mengchen Liu、Xiyang Dai、Lu Yuan、Lei Zhang によって [CvT: Introduction Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) で提案されました。畳み込みビジョン トランスフォーマー (CvT) は、ViT に畳み込みを導入して両方の設計の長所を引き出すことにより、[ビジョン トランスフォーマー (ViT)](vit) のパフォーマンスと効率を向上させます。 + +論文の要約は次のとおりです。 + +*この論文では、ビジョン トランスフォーマー (ViT) を改善する、畳み込みビジョン トランスフォーマー (CvT) と呼ばれる新しいアーキテクチャを紹介します。 +ViT に畳み込みを導入して両方の設計の長所を引き出すことで、パフォーマンスと効率を向上させます。これは次のようにして実現されます。 +2 つの主要な変更: 新しい畳み込みトークンの埋め込みを含むトランスフォーマーの階層と、畳み込みトランスフォーマー +畳み込み射影を利用したブロック。これらの変更により、畳み込みニューラル ネットワーク (CNN) の望ましい特性が導入されます。 +トランスフォーマーの利点 (動的な注意力、 +グローバルなコンテキストとより良い一般化)。私たちは広範な実験を実施することで CvT を検証し、このアプローチが達成できることを示しています。 +ImageNet-1k 上の他のビジョン トランスフォーマーや ResNet よりも、パラメータが少なく、FLOP が低い、最先端のパフォーマンスを実現します。加えて、 +より大きなデータセット (例: ImageNet-22k) で事前トレーニングし、下流のタスクに合わせて微調整すると、パフォーマンスの向上が維持されます。事前トレーニング済み +ImageNet-22k、当社の CvT-W24 は、ImageNet-1k val set で 87.7\% というトップ 1 の精度を獲得しています。最後に、私たちの結果は、位置エンコーディングが、 +既存のビジョン トランスフォーマーの重要なコンポーネントであるこのコンポーネントは、モデルでは安全に削除できるため、高解像度のビジョン タスクの設計が簡素化されます。* + +このモデルは [anugunj](https://huggingface.co/anugunj) によって提供されました。元のコードは [ここ](https://github.com/microsoft/CvT) にあります。 + +## Usage tips + +- CvT モデルは通常の Vision Transformer ですが、畳み込みでトレーニングされています。 ImageNet-1K および CIFAR-100 で微調整すると、[オリジナル モデル (ViT)](vit) よりも優れたパフォーマンスを発揮します。 +- カスタム データの微調整だけでなく推論に関するデモ ノートブックも [ここ](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) で確認できます ([`ViTFeatureExtractor を置き換えるだけで済みます) `] による [`AutoImageProcessor`] および [`ViTForImageClassification`] による [`CvtForImageClassification`])。 +- 利用可能なチェックポイントは、(1) [ImageNet-22k](http://www.image-net.org/) (1,400 万の画像と 22,000 のクラスのコレクション) でのみ事前トレーニングされている、(2) も問題ありません。 ImageNet-22k で調整、または (3) [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (ILSVRC 2012 とも呼ばれるコレクション) でも微調整130万の + 画像と 1,000 クラス)。 + +## Resources + +CvT を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。 + + + +- [`CvtForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## CvtConfig + +[[autodoc]] CvtConfig + + + + +## CvtModel + +[[autodoc]] CvtModel + - forward + +## CvtForImageClassification + +[[autodoc]] CvtForImageClassification + - forward + + + + +## TFCvtModel + +[[autodoc]] TFCvtModel + - call + +## TFCvtForImageClassification + +[[autodoc]] TFCvtForImageClassification + - call + + + + diff --git a/docs/source/ja/model_doc/data2vec.md b/docs/source/ja/model_doc/data2vec.md new file mode 100644 index 00000000000000..78ae71e6947e4d --- /dev/null +++ b/docs/source/ja/model_doc/data2vec.md @@ -0,0 +1,187 @@ + + +# Data2Vec + +## Overview + +Data2Vec モデルは、[data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/pdf/2202.03555) で Alexei Baevski、Wei-Ning Hsu、Qiantong Xu、バArun Babu, Jiatao Gu and Michael Auli. +Data2Vec は、テキスト、音声、画像などのさまざまなデータ モダリティにわたる自己教師あり学習のための統一フレームワークを提案します。 +重要なのは、事前トレーニングの予測ターゲットは、モダリティ固有のコンテキストに依存しないターゲットではなく、入力のコンテキスト化された潜在表現であることです。 + +論文の要約は次のとおりです。 + +*自己教師あり学習の一般的な考え方はどのモダリティでも同じですが、実際のアルゴリズムと +単一のモダリティを念頭に置いて開発されたため、目的は大きく異なります。一般に近づけるために +自己教師あり学習では、どちらの音声に対しても同じ学習方法を使用するフレームワークである data2vec を紹介します。 +NLP またはコンピューター ビジョン。中心となるアイデアは、完全な入力データの潜在的な表現を、 +標準の Transformer アーキテクチャを使用した自己蒸留セットアップの入力のマスクされたビュー。 +単語、視覚的トークン、人間の音声単位などのモダリティ固有のターゲットを予測するのではなく、 +本質的にローカルであるため、data2vec は、からの情報を含む文脈化された潜在表現を予測します。 +入力全体。音声認識、画像分類、および +自然言語理解は、新しい最先端技術や、主流のアプローチに匹敵するパフォーマンスを実証します。 +モデルとコードは、www.github.com/pytorch/fairseq/tree/master/examples/data2vec.* で入手できます。 + +このモデルは、[edugp](https://huggingface.co/edugp) および [patrickvonplaten](https://huggingface.co/patrickvonplaten) によって提供されました。 +[sayakpaul](https://github.com/sayakpaul) と [Rocketknight1](https://github.com/Rocketknight1) は、TensorFlow のビジョンに Data2Vec を提供しました。 + +元のコード (NLP および音声用) は、[こちら](https://github.com/pytorch/fairseq/tree/main/examples/data2vec) にあります。 +ビジョンの元のコードは [こちら](https://github.com/facebookresearch/data2vec_vision/tree/main/beit) にあります。 + +## Usage tips + +- Data2VecAudio、Data2VecText、および Data2VecVision はすべて、同じ自己教師あり学習方法を使用してトレーニングされています。 +- Data2VecAudio の場合、前処理は特徴抽出を含めて [`Wav2Vec2Model`] と同じです。 +- Data2VecText の場合、前処理はトークン化を含めて [`RobertaModel`] と同じです。 +- Data2VecVision の場合、前処理は特徴抽出を含めて [`BeitModel`] と同じです。 + +## Resources + +Data2Vec の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。 + + + +- [`Data2VecVisionForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://cola.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- カスタム データセットで [`TFData2VecVisionForImageClassification`] を微調整するには、[このノートブック](https://colab.research.google.com/github/sayakpaul/TF-2.0-Hacks/blob/master/data2vec_vision_image_classification.ipynb) を参照してください。 )。 + +**Data2VecText ドキュメント リソース** +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [因果言語モデリング タスク ガイド](../tasks/language_modeling) +- [マスク言語モデリング タスク ガイド](../tasks/masked_language_modeling) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +**Data2VecAudio ドキュメント リソース** +- [音声分類タスクガイド](../tasks/audio_classification) +- [自動音声認識タスクガイド](../tasks/asr) + +**Data2VecVision ドキュメント リソース** +- [画像分類](../tasks/image_classification) +- [セマンティック セグメンテーション](../tasks/semantic_segmentation) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## Data2VecTextConfig + +[[autodoc]] Data2VecTextConfig + +## Data2VecAudioConfig + +[[autodoc]] Data2VecAudioConfig + +## Data2VecVisionConfig + +[[autodoc]] Data2VecVisionConfig + + + + +## Data2VecAudioModel + +[[autodoc]] Data2VecAudioModel + - forward + +## Data2VecAudioForAudioFrameClassification + +[[autodoc]] Data2VecAudioForAudioFrameClassification + - forward + +## Data2VecAudioForCTC + +[[autodoc]] Data2VecAudioForCTC + - forward + +## Data2VecAudioForSequenceClassification + +[[autodoc]] Data2VecAudioForSequenceClassification + - forward + +## Data2VecAudioForXVector + +[[autodoc]] Data2VecAudioForXVector + - forward + +## Data2VecTextModel + +[[autodoc]] Data2VecTextModel + - forward + +## Data2VecTextForCausalLM + +[[autodoc]] Data2VecTextForCausalLM + - forward + +## Data2VecTextForMaskedLM + +[[autodoc]] Data2VecTextForMaskedLM + - forward + +## Data2VecTextForSequenceClassification + +[[autodoc]] Data2VecTextForSequenceClassification + - forward + +## Data2VecTextForMultipleChoice + +[[autodoc]] Data2VecTextForMultipleChoice + - forward + +## Data2VecTextForTokenClassification + +[[autodoc]] Data2VecTextForTokenClassification + - forward + +## Data2VecTextForQuestionAnswering + +[[autodoc]] Data2VecTextForQuestionAnswering + - forward + +## Data2VecVisionModel + +[[autodoc]] Data2VecVisionModel + - forward + +## Data2VecVisionForImageClassification + +[[autodoc]] Data2VecVisionForImageClassification + - forward + +## Data2VecVisionForSemanticSegmentation + +[[autodoc]] Data2VecVisionForSemanticSegmentation + - forward + + + + +## TFData2VecVisionModel + +[[autodoc]] TFData2VecVisionModel + - call + +## TFData2VecVisionForImageClassification + +[[autodoc]] TFData2VecVisionForImageClassification + - call + +## TFData2VecVisionForSemanticSegmentation + +[[autodoc]] TFData2VecVisionForSemanticSegmentation + - call + + + diff --git a/docs/source/ja/model_doc/deberta-v2.md b/docs/source/ja/model_doc/deberta-v2.md new file mode 100644 index 00000000000000..35da9d0a1d584f --- /dev/null +++ b/docs/source/ja/model_doc/deberta-v2.md @@ -0,0 +1,167 @@ + + +# DeBERTa-v2 + +## Overview + +DeBERTa モデルは、Pengcheng He、Xiaodong Liu、Jianfeng Gao、Weizhu Chen によって [DeBERTa: Decoding-enhanced BERT with Disentangled Attendant](https://arxiv.org/abs/2006.03654) で提案されました。Google のモデルに基づいています。 +2018年にリリースされたBERTモデルと2019年にリリースされたFacebookのRoBERTaモデル。 + +これは、もつれた注意を解きほぐし、使用されるデータの半分を使用して強化されたマスク デコーダ トレーニングを備えた RoBERTa に基づいて構築されています。 +ロベルタ。 + +論文の要約は次のとおりです。 + +*事前トレーニングされたニューラル言語モデルの最近の進歩により、多くの自然言語モデルのパフォーマンスが大幅に向上しました。 +言語処理 (NLP) タスク。この論文では、新しいモデル アーキテクチャ DeBERTa (Decoding-enhanced BERT with +これは、2 つの新しい技術を使用して BERT モデルと RoBERTa モデルを改善します。 1つ目は、 +もつれを解く注意メカニズム。各単語は、その内容をエンコードする 2 つのベクトルを使用して表現され、 +単語間の注意の重みは、それらの単語のもつれ解除行列を使用して計算されます。 +内容と相対的な位置。 2 番目に、強化されたマスク デコーダを使用して、出力ソフトマックス レイヤを次のように置き換えます。 +モデルの事前トレーニング用にマスクされたトークンを予測します。これら 2 つの手法により効率が大幅に向上することを示します。 +モデルの事前トレーニングと下流タスクのパフォーマンスの向上。 RoBERTa-Large と比較すると、DeBERTa モデルは半分のレベルでトレーニングされています。 +トレーニング データは幅広い NLP タスクで一貫して優れたパフォーマンスを示し、MNLI で +0.9% の改善を達成しました。 +(90.2% 対 91.1%)、SQuAD v2.0 では +2.3% (88.4% 対 90.7%)、RACE では +3.6% (83.2% 対 86.8%) でした。 DeBERTa コードと +事前トレーニングされたモデルは https://github.com/microsoft/DeBERTa で公開されます。* + +次の情報は、[元の実装で直接表示されます リポジトリ](https://github.com/microsoft/DeBERTa)。 DeBERTa v2 は、DeBERTa モデルの 2 番目のバージョンです。それには以下が含まれます +SuperGLUE 単一モデルの提出に使用された 1.5B モデルは、人間のベースライン 89.8 に対して 89.9 を達成しました。あなたはできる +この投稿に関する詳細については、著者のドキュメントを参照してください。 +[ブログ](https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/) + +v2 の新機能: + +- **語彙** v2 では、トレーニング データから構築されたサイズ 128K の新しい語彙を使用するようにトークナイザーが変更されました。 + GPT2 ベースのトークナイザーの代わりに、トークナイザーは + [sentencepiece ベース](https://github.com/google/sentencepiece) トークナイザー。 +- **nGiE(nGram Induced Input Encoding)** DeBERTa-v2 モデルは、最初の畳み込み層とは別に追加の畳み込み層を使用します。 + トランスフォーマー層を使用して、入力トークンのローカル依存関係をよりよく学習します。 +- **位置射影行列を注目レイヤーのコンテンツ射影行列と共有** 以前に基づく + 実験では、パフォーマンスに影響を与えることなくパラメータを保存できます。 +- **バケットを適用して相対位置をエンコードします** DeBERTa-v2 モデルはログ バケットを使用して相対位置をエンコードします + T5に似ています。 +- **900M モデル & 1.5B モデル** 2 つの追加モデル サイズ: 900M と 1.5B が利用可能で、これにより、パフォーマンスが大幅に向上します。 + 下流タスクのパフォーマンス。 + +このモデルは [DeBERTa](https://huggingface.co/DeBERTa) によって寄稿されました。このモデルの TF 2.0 実装は、 +[kamalkraj](https://huggingface.co/kamalkraj) による投稿。元のコードは [こちら](https://github.com/microsoft/DeBERTa) にあります。 + +## Resources +- [テキスト分類タスクガイド](../tasks/sequence_classification) +- [トークン分類タスクガイド](../tasks/token_classification) +- [質問回答タスク ガイド](../tasks/question_answering) +- [マスク言語モデリング タスク ガイド](../tasks/masked_language_modeling) +- [多肢選択タスク ガイド](../tasks/multiple_choice) + +## DebertaV2Config + +[[autodoc]] DebertaV2Config + +## DebertaV2Tokenizer + +[[autodoc]] DebertaV2Tokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## DebertaV2TokenizerFast + +[[autodoc]] DebertaV2TokenizerFast + - build_inputs_with_special_tokens + - create_token_type_ids_from_sequences + + + + +## DebertaV2Model + +[[autodoc]] DebertaV2Model + - forward + +## DebertaV2PreTrainedModel + +[[autodoc]] DebertaV2PreTrainedModel + - forward + +## DebertaV2ForMaskedLM + +[[autodoc]] DebertaV2ForMaskedLM + - forward + +## DebertaV2ForSequenceClassification + +[[autodoc]] DebertaV2ForSequenceClassification + - forward + +## DebertaV2ForTokenClassification + +[[autodoc]] DebertaV2ForTokenClassification + - forward + +## DebertaV2ForQuestionAnswering + +[[autodoc]] DebertaV2ForQuestionAnswering + - forward + +## DebertaV2ForMultipleChoice + +[[autodoc]] DebertaV2ForMultipleChoice + - forward + + + + +## TFDebertaV2Model + +[[autodoc]] TFDebertaV2Model + - call + +## TFDebertaV2PreTrainedModel + +[[autodoc]] TFDebertaV2PreTrainedModel + - call + +## TFDebertaV2ForMaskedLM + +[[autodoc]] TFDebertaV2ForMaskedLM + - call + +## TFDebertaV2ForSequenceClassification + +[[autodoc]] TFDebertaV2ForSequenceClassification + - call + +## TFDebertaV2ForTokenClassification + +[[autodoc]] TFDebertaV2ForTokenClassification + - call + +## TFDebertaV2ForQuestionAnswering + +[[autodoc]] TFDebertaV2ForQuestionAnswering + - call + +## TFDebertaV2ForMultipleChoice + +[[autodoc]] TFDebertaV2ForMultipleChoice + - call + + + + + diff --git a/docs/source/ja/model_doc/deberta.md b/docs/source/ja/model_doc/deberta.md new file mode 100644 index 00000000000000..f7e00ad3b2bcce --- /dev/null +++ b/docs/source/ja/model_doc/deberta.md @@ -0,0 +1,164 @@ + + +# DeBERTa + +## Overview + +DeBERTa モデルは、Pengcheng He、Xiaodong Liu、Jianfeng Gao、Weizhu Chen によって [DeBERTa: Decoding-enhanced BERT with Disentangled Attendant](https://arxiv.org/abs/2006.03654) で提案されました。Google のモデルに基づいています。 +2018年にリリースされたBERTモデルと2019年にリリースされたFacebookのRoBERTaモデル。 + +これは、もつれた注意を解きほぐし、使用されるデータの半分を使用して強化されたマスク デコーダ トレーニングを備えた RoBERTa に基づいて構築されています。 +ロベルタ。 + +論文の要約は次のとおりです。 + +*事前トレーニングされたニューラル言語モデルの最近の進歩により、多くの自然言語モデルのパフォーマンスが大幅に向上しました。 +言語処理 (NLP) タスク。この論文では、新しいモデル アーキテクチャ DeBERTa (Decoding-enhanced BERT with +これは、2 つの新しい技術を使用して BERT モデルと RoBERTa モデルを改善します。 1つ目は、 +もつれを解く注意メカニズム。各単語は、その内容をエンコードする 2 つのベクトルを使用して表現され、 +単語間の注意の重みは、それらの単語のもつれ解除行列を使用して計算されます。 +内容と相対的な位置。 2 番目に、強化されたマスク デコーダを使用して、出力ソフトマックス レイヤを次のように置き換えます。 +モデルの事前トレーニング用にマスクされたトークンを予測します。これら 2 つの手法により効率が大幅に向上することを示します。 +モデルの事前トレーニングと下流タスクのパフォーマンスの向上。 RoBERTa-Large と比較すると、DeBERTa モデルは半分のレベルでトレーニングされています。 +トレーニング データは幅広い NLP タスクで一貫して優れたパフォーマンスを示し、MNLI で +0.9% の改善を達成しました。 +(90.2% 対 91.1%)、SQuAD v2.0 では +2.3% (88.4% 対 90.7%)、RACE では +3.6% (83.2% 対 86.8%) でした。 DeBERTa コードと +事前トレーニングされたモデルは https://github.com/microsoft/DeBERTa で公開されます。* + + +このモデルは [DeBERTa](https://huggingface.co/DeBERTa) によって寄稿されました。このモデルの TF 2.0 実装は、 +[kamalkraj](https://huggingface.co/kamalkraj) による寄稿。元のコードは [こちら](https://github.com/microsoft/DeBERTa) にあります。 + +## Resources + +DeBERTa を使い始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + + + +- DeBERTa を使用して [DeepSpeed を使用して大規模モデルのトレーニングを加速する](https://huggingface.co/blog/accelerate-deepspeed) 方法に関するブログ投稿。 +- DeBERTa による [機械学習によるスーパーチャージされた顧客サービス](https://huggingface.co/blog/supercharge-customer-service-with-machine-learning) に関するブログ投稿。 +- [`DebertaForSequenceClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)。 +- [`TFDebertaForSequenceClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)。 +- [テキスト分類タスクガイド](../tasks/sequence_classification) + + + +- [`DebertaForTokenClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb)。 +- [`TFDebertaForTokenClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)。 +- [トークン分類](https://huggingface.co/course/chapter7/2?fw=pt) 🤗 ハグフェイスコースの章。 +- 🤗 ハグフェイスコースの [バイトペアエンコーディングのトークン化](https://huggingface.co/course/chapter6/5?fw=pt) の章。 +- [トークン分類タスクガイド](../tasks/token_classification) + + + +- [`DebertaForMaskedLM`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling#robertabertdistilbert-and-masked-language-modeling) でサポートされています。 [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)。 +- [`TFDebertaForMaskedLM`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/lang-modeling#run_mlmpy) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。 +- [マスクされた言語モデリング](https://huggingface.co/course/chapter7/3?fw=pt) 🤗 顔のハグ コースの章。 +- [マスク言語モデリング タスク ガイド](../tasks/masked_language_modeling) + + + +- [`DebertaForQuestionAnswering`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb)。 +- [`TFDebertaForQuestionAnswering`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)。 +- [質問回答](https://huggingface.co/course/chapter7/7?fw=pt) 🤗 ハグフェイスコースの章。 +- [質問回答タスク ガイド](../tasks/question_answering) + +## DebertaConfig + +[[autodoc]] DebertaConfig + +## DebertaTokenizer + +[[autodoc]] DebertaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## DebertaTokenizerFast + +[[autodoc]] DebertaTokenizerFast + - build_inputs_with_special_tokens + - create_token_type_ids_from_sequences + + + + +## DebertaModel + +[[autodoc]] DebertaModel + - forward + +## DebertaPreTrainedModel + +[[autodoc]] DebertaPreTrainedModel + +## DebertaForMaskedLM + +[[autodoc]] DebertaForMaskedLM + - forward + +## DebertaForSequenceClassification + +[[autodoc]] DebertaForSequenceClassification + - forward + +## DebertaForTokenClassification + +[[autodoc]] DebertaForTokenClassification + - forward + +## DebertaForQuestionAnswering + +[[autodoc]] DebertaForQuestionAnswering + - forward + + + + +## TFDebertaModel + +[[autodoc]] TFDebertaModel + - call + +## TFDebertaPreTrainedModel + +[[autodoc]] TFDebertaPreTrainedModel + - call + +## TFDebertaForMaskedLM + +[[autodoc]] TFDebertaForMaskedLM + - call + +## TFDebertaForSequenceClassification + +[[autodoc]] TFDebertaForSequenceClassification + - call + +## TFDebertaForTokenClassification + +[[autodoc]] TFDebertaForTokenClassification + - call + +## TFDebertaForQuestionAnswering + +[[autodoc]] TFDebertaForQuestionAnswering + - call + + + + diff --git a/docs/source/ja/model_doc/decision_transformer.md b/docs/source/ja/model_doc/decision_transformer.md new file mode 100644 index 00000000000000..9c7f27bbeeec2d --- /dev/null +++ b/docs/source/ja/model_doc/decision_transformer.md @@ -0,0 +1,53 @@ + + +# Decision Transformer + +## Overview + +Decision Transformer モデルは、[Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) で提案されました。 +Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. + +論文の要約は次のとおりです。 + +*強化学習(RL)をシーケンスモデリング問題として抽象化するフレームワークを紹介します。 +これにより、Transformer アーキテクチャのシンプルさとスケーラビリティ、および関連する進歩を活用できるようになります。 + GPT-x や BERT などの言語モデリングで。特に、Decision Transformer というアーキテクチャを紹介します。 + RL の問題を条件付きシーケンス モデリングとして投げかけます。値関数に適合する以前の RL アプローチとは異なり、 + ポリシー勾配を計算すると、Decision Transformer は因果的にマスクされたアルゴリズムを利用して最適なアクションを出力するだけです。 + 変成器。望ましいリターン (報酬)、過去の状態、アクションに基づいて自己回帰モデルを条件付けすることにより、 + Decision Transformer モデルは、望ましいリターンを達成する将来のアクションを生成できます。そのシンプルさにも関わらず、 + Decision Transformer は、最先端のモデルフリーのオフライン RL ベースラインのパフォーマンスと同等、またはそれを超えています。 + Atari、OpenAI Gym、Key-to-Door タスク* + +このバージョンのモデルは、状態がベクトルであるタスク用です。 + +このモデルは、[edbeeching](https://huggingface.co/edbeeching) によって提供されました。元のコードは [ここ](https://github.com/kzl/decion-transformer) にあります。 + +## DecisionTransformerConfig + +[[autodoc]] DecisionTransformerConfig + + +## DecisionTransformerGPT2Model + +[[autodoc]] DecisionTransformerGPT2Model + - forward + +## DecisionTransformerModel + +[[autodoc]] DecisionTransformerModel + - forward diff --git a/docs/source/ja/model_doc/deformable_detr.md b/docs/source/ja/model_doc/deformable_detr.md new file mode 100644 index 00000000000000..ccb6ec42f869b6 --- /dev/null +++ b/docs/source/ja/model_doc/deformable_detr.md @@ -0,0 +1,75 @@ + + +# Deformable DETR + +## Overview + +変形可能 DETR モデルは、Xizhou Zhu、Weijie Su、Lewei Lu、Bin Li、Xiaogang Wang, Jifeng Dai によって [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) で提案されました +変形可能な DETR は、参照周囲の少数の主要なサンプリング ポイントのみに注目する新しい変形可能なアテンション モジュールを利用することにより、収束の遅さの問題と元の [DETR](detr) の制限された特徴の空間解像度を軽減します。 + +論文の要約は次のとおりです。 + +*DETR は、優れたパフォーマンスを実証しながら、物体検出における多くの手作業で設計されたコンポーネントの必要性を排除するために最近提案されました。ただし、画像特徴マップの処理における Transformer アテンション モジュールの制限により、収束が遅く、特徴の空間解像度が制限されるという問題があります。これらの問題を軽減するために、私たちは Deformable DETR を提案しました。この DETR のアテンション モジュールは、参照周囲の少数の主要なサンプリング ポイントのみに注目します。変形可能な DETR は、10 分の 1 のトレーニング エポックで、DETR よりも優れたパフォーマンス (特に小さなオブジェクトの場合) を達成できます。 COCO ベンチマークに関する広範な実験により、私たちのアプローチの有効性が実証されました。* + + + + 変形可能な DETR アーキテクチャ。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。元のコードは [ここ](https://github.com/fundamentalvision/Deformable-DETR) にあります。 + +## Usage tips + + + - トレーニング Deformable DETR は、元の [DETR](detr) モデルをトレーニングすることと同等です。デモ ノートブックについては、以下の [resources](#resources) セクションを参照してください。 + +## Resources + +Deformable DETR の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示される) リソースのリスト。 + + + +- [`DeformableDetrForObjectDetection`] のカスタム データセットでの推論と微調整に関するデモ ノートブックは、[こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Deformable-DETR) にあります。 +- [物体検出タスクガイド](../tasks/object_detection) も参照してください。 + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## DeformableDetrImageProcessor + +[[autodoc]] DeformableDetrImageProcessor + - preprocess + - post_process_object_detection + +## DeformableDetrFeatureExtractor + +[[autodoc]] DeformableDetrFeatureExtractor + - __call__ + - post_process_object_detection + +## DeformableDetrConfig + +[[autodoc]] DeformableDetrConfig + +## DeformableDetrModel + +[[autodoc]] DeformableDetrModel + - forward + +## DeformableDetrForObjectDetection + +[[autodoc]] DeformableDetrForObjectDetection + - forward diff --git a/docs/source/ja/model_doc/deit.md b/docs/source/ja/model_doc/deit.md new file mode 100644 index 00000000000000..aa8c66c90be0b8 --- /dev/null +++ b/docs/source/ja/model_doc/deit.md @@ -0,0 +1,148 @@ + + +# DeiT + +## Overview + +DeiT モデルは、Hugo Touvron、Matthieu Cord、Matthijs Douze、Francisco Massa、Alexandre +Sablayrolles, Hervé Jégou.によって [Training data-efficient image Transformers & distillation through attention](https://arxiv.org/abs/2012.12877) で提案されました。 +サブレイロール、エルヴェ・ジェグー。 [Dosovitskiy et al., 2020](https://arxiv.org/abs/2010.11929) で紹介された [Vision Transformer (ViT)](vit) は、既存の畳み込みニューラルと同等、またはそれを上回るパフォーマンスを発揮できることを示しました。 +Transformer エンコーダ (BERT のような) を使用したネットワーク。ただし、その論文で紹介された ViT モデルには、次のトレーニングが必要でした。 +外部データを使用して、数週間にわたる高価なインフラストラクチャ。 DeiT (データ効率の高い画像変換器) はさらに優れています +画像分類用に効率的にトレーニングされたトランスフォーマーにより、必要なデータとコンピューティング リソースがはるかに少なくなります。 +オリジナルの ViT モデルとの比較。 + +論文の要約は次のとおりです。 + +*最近、純粋に注意に基づくニューラル ネットワークが、画像などの画像理解タスクに対処できることが示されました。 +分類。ただし、これらのビジュアル トランスフォーマーは、 +インフラストラクチャが高価であるため、その採用が制限されています。この作業では、コンボリューションフリーの競争力のあるゲームを作成します。 +Imagenet のみでトレーニングしてトランスフォーマーを作成します。 1 台のコンピューターで 3 日以内にトレーニングを行います。私たちの基準となるビジョン +トランス (86M パラメータ) は、外部なしで ImageNet 上で 83.1% (単一クロップ評価) のトップ 1 の精度を達成します。 +データ。さらに重要なのは、トランスフォーマーに特有の教師と生徒の戦略を導入することです。蒸留に依存している +学生が注意を払って教師から学ぶことを保証するトークン。私たちはこのトークンベースに興味を示します +特に convnet を教師として使用する場合。これにより、convnet と競合する結果を報告できるようになります。 +Imagenet (最大 85.2% の精度が得られます) と他のタスクに転送するときの両方で。私たちはコードを共有し、 +モデル。* + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。このモデルの TensorFlow バージョンは、[amyeroberts](https://huggingface.co/amyeroberts) によって追加されました。 + +## Usage tips + +- ViT と比較して、DeiT モデルはいわゆる蒸留トークンを使用して教師から効果的に学習します (これは、 + DeiT 論文は、ResNet のようなモデルです)。蒸留トークンは、バックプロパゲーションを通じて、と対話することによって学習されます。 + セルフアテンション層を介したクラス ([CLS]) とパッチ トークン。 +- 抽出されたモデルを微調整するには 2 つの方法があります。(1) 上部に予測ヘッドを配置するだけの古典的な方法。 + クラス トークンの最終的な非表示状態を抽出し、蒸留シグナルを使用しない、または (2) 両方の + 予測ヘッドはクラス トークンの上と蒸留トークンの上にあります。その場合、[CLS] 予測は + head は、head の予測とグラウンド トゥルース ラベル間の通常のクロスエントロピーを使用してトレーニングされます。 + 蒸留予測ヘッドは、硬蒸留 (予測と予測の間のクロスエントロピー) を使用してトレーニングされます。 + 蒸留ヘッドと教師が予測したラベル)。推論時に、平均予測を取得します。 + 最終的な予測として両頭の間で。 (2) は「蒸留による微調整」とも呼ばれます。 + 下流のデータセットですでに微調整されている教師。モデル的には (1) に相当します。 + [`DeiTForImageClassification`] と (2) に対応します。 + [`DeiTForImageClassificationWithTeacher`]。 +- 著者らは (2) についてもソフト蒸留を試みたことに注意してください (この場合、蒸留予測ヘッドは + 教師のソフトマックス出力に一致するように KL ダイバージェンスを使用してトレーニングしました)が、ハード蒸留が最良の結果をもたらしました。 +- リリースされたすべてのチェックポイントは、ImageNet-1k のみで事前トレーニングおよび微調整されました。外部データは使用されませんでした。これは + JFT-300M データセット/Imagenet-21k などの外部データを使用した元の ViT モデルとは対照的です。 + 事前トレーニング。 +- DeiT の作者は、より効率的にトレーニングされた ViT モデルもリリースしました。これは、直接プラグインできます。 + [`ViTModel`] または [`ViTForImageClassification`]。データなどのテクニック + はるかに大規模なデータセットでのトレーニングをシミュレートするために、拡張、最適化、正則化が使用されました。 + (ただし、事前トレーニングには ImageNet-1k のみを使用します)。 4 つのバリエーション (3 つの異なるサイズ) が利用可能です。 + *facebook/deit-tiny-patch16-224*、*facebook/deit-small-patch16-224*、*facebook/deit-base-patch16-224* および + *facebook/deit-base-patch16-384*。以下を行うには [`DeiTImageProcessor`] を使用する必要があることに注意してください。 + モデル用の画像を準備します。 + +## Resources + +DeiT を始めるのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + + + +- [`DeiTForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +それに加えて: + +- [`DeiTForMaskedImageModeling`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining) でサポートされています。 + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## DeiTConfig + +[[autodoc]] DeiTConfig + +## DeiTFeatureExtractor + +[[autodoc]] DeiTFeatureExtractor + - __call__ + +## DeiTImageProcessor + +[[autodoc]] DeiTImageProcessor + - preprocess + + + + +## DeiTModel + +[[autodoc]] DeiTModel + - forward + +## DeiTForMaskedImageModeling + +[[autodoc]] DeiTForMaskedImageModeling + - forward + +## DeiTForImageClassification + +[[autodoc]] DeiTForImageClassification + - forward + +## DeiTForImageClassificationWithTeacher + +[[autodoc]] DeiTForImageClassificationWithTeacher + - forward + + + + +## TFDeiTModel + +[[autodoc]] TFDeiTModel + - call + +## TFDeiTForMaskedImageModeling + +[[autodoc]] TFDeiTForMaskedImageModeling + - call + +## TFDeiTForImageClassification + +[[autodoc]] TFDeiTForImageClassification + - call + +## TFDeiTForImageClassificationWithTeacher + +[[autodoc]] TFDeiTForImageClassificationWithTeacher + - call + + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/deplot.md b/docs/source/ja/model_doc/deplot.md new file mode 100644 index 00000000000000..26871d1e7dde66 --- /dev/null +++ b/docs/source/ja/model_doc/deplot.md @@ -0,0 +1,65 @@ + + +# DePlot + +## Overview + +DePlot は、Fangyu Liu、Julian Martin Aisenschlos、Francesco Piccinno、Syrine Krichene、Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. の論文 [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) で提案されました。パン・ + +論文の要約には次のように記載されています。 + +*チャートやプロットなどの視覚言語は人間の世界に遍在しています。プロットやチャートを理解するには、強力な推論スキルが必要です。従来の最先端 (SOTA) モデルには少なくとも数万のトレーニング サンプルが必要であり、その推論能力は、特に人間が作成した複雑なクエリでは依然として大幅に制限されています。この論文では、視覚言語推論に対する最初のワンショット ソリューションを紹介します。私たちは、視覚言語推論の課題を 2 つのステップに分解します。(1) プロットからテキストへの翻訳と、(2) 翻訳されたテキストに対する推論です。この方法の鍵となるのは、プロットまたはチャートの画像を線形化されたテーブルに変換する、DePlot という名前のモダリティ変換モジュールです。その後、DePlot の出力を直接使用して、事前トレーニング済みの大規模言語モデル (LLM) をプロンプトし、LLM の少数ショット推論機能を利用できます。 DePlot を取得するには、統一されたタスク形式とメトリクスを確立することでプロットからテーブルへのタスクを標準化し、このタスクで DePlot をエンドツーエンドでトレーニングします。 DePlot は、プラグアンドプレイ方式で LLM とともに既製で使用できます。 28,000 を超えるデータ ポイントで微調整された SOTA モデルと比較して、ワンショット プロンプトのみを使用する DePlot+LLM は、チャート QA タスクからの人が作成したクエリに関して、微調整された SOTA より 24.0% の改善を達成しました。* + +DePlot は、`Pix2Struct` アーキテクチャを使用してトレーニングされたモデルです。 `Pix2Struct` の詳細については、[Pix2Struct ドキュメント](https://huggingface.co/docs/transformers/main/en/model_doc/pix2struct) を参照してください。 +DePlot は、`Pix2Struct` アーキテクチャの Visual Question Answering サブセットです。入力された質問を画像上にレンダリングし、答えを予測します。 + +## Usage example + +現在、DePlot で使用できるチェックポイントは 1 つです。 + +- `google/deplot`: ChartQA データセットで微調整された DePlot + +```python +from transformers import AutoProcessor, Pix2StructForConditionalGeneration +import requests +from PIL import Image + +model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot") +processor = AutoProcessor.from_pretrained("google/deplot") +url = "https://raw.githubusercontent.com/vis-nlp/ChartQA/main/ChartQA%20Dataset/val/png/5090.png" +image = Image.open(requests.get(url, stream=True).raw) + +inputs = processor(images=image, text="Generate underlying data table of the figure below:", return_tensors="pt") +predictions = model.generate(**inputs, max_new_tokens=512) +print(processor.decode(predictions[0], skip_special_tokens=True)) +``` + +## Fine-tuning + +DePlot を微調整するには、pix2struct [微調整ノートブック](https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb) を参照してください。 `Pix2Struct` モデルの場合、Adafactor とコサイン学習率スケジューラを使用してモデルを微調整すると、収束が高速化されることがわかりました。 +```python +from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup + +optimizer = Adafactor(self.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05) +scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000) +``` + + + +DePlot は、`Pix2Struct`アーキテクチャを使用してトレーニングされたモデルです。 API リファレンスについては、[`Pix2Struct` ドキュメント](pix2struct) を参照してください。 + + \ No newline at end of file diff --git a/docs/source/ja/model_doc/deta.md b/docs/source/ja/model_doc/deta.md new file mode 100644 index 00000000000000..615f8396577fdb --- /dev/null +++ b/docs/source/ja/model_doc/deta.md @@ -0,0 +1,64 @@ + + +# DETA + +## Overview + +DETA モデルは、[NMS Strikes Back](https://arxiv.org/abs/2212.06137) で Jeffrey Ouyang-Zhang、Jang Hyun Cho、Xingyi Zhou、Philipp Krähenbühl によって提案されました。 +DETA (Detection Transformers with Assignment の略) は、1 対 1 の 2 部ハンガリアン マッチング損失を置き換えることにより、[Deformable DETR](deformable_detr) を改善します。 +非最大抑制 (NMS) を備えた従来の検出器で使用される 1 対多のラベル割り当てを使用します。これにより、最大 2.5 mAP の大幅な増加が得られます。 + +論文の要約は次のとおりです。 + +*Detection Transformer (DETR) は、トレーニング中に 1 対 1 の 2 部マッチングを使用してクエリを一意のオブジェクトに直接変換し、エンドツーエンドのオブジェクト検出を可能にします。最近、これらのモデルは、紛れもない優雅さで COCO の従来の検出器を上回りました。ただし、モデル アーキテクチャやトレーニング スケジュールなど、さまざまな設計において従来の検出器とは異なるため、1 対 1 マッチングの有効性は完全には理解されていません。この研究では、DETR での 1 対 1 のハンガリー語マッチングと、非最大監視 (NMS) を備えた従来の検出器での 1 対多のラベル割り当てとの間の厳密な比較を行います。驚くべきことに、NMS を使用した 1 対多の割り当ては、同じ設定の下で標準的な 1 対 1 のマッチングよりも一貫して優れており、最大 2.5 mAP という大幅な向上が見られます。従来の IoU ベースのラベル割り当てを使用して Deformable-DETR をトレーニングする当社の検出器は、ResNet50 バックボーンを使用して 12 エポック (1x スケジュール) 以内に 50.2 COCO mAP を達成し、この設定で既存のすべての従来の検出器またはトランスベースの検出器を上回りました。複数のデータセット、スケジュール、アーキテクチャに関して、私たちは一貫して、パフォーマンスの高い検出トランスフォーマーには二部マッチングが不要であることを示しています。さらに、検出トランスの成功は、表現力豊かなトランス アーキテクチャによるものであると考えています。* + + + + DETA の概要。 元の論文から抜粋。 + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。 +元のコードは [ここ](https://github.com/jozhang97/DETA) にあります。 + +## Resources + +DETA の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + +- DETA のデモ ノートブックは [こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETA) にあります。 +- 参照: [オブジェクト検出タスク ガイド](../tasks/object_detection) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## DetaConfig + +[[autodoc]] DetaConfig + +## DetaImageProcessor + +[[autodoc]] DetaImageProcessor + - preprocess + - post_process_object_detection + +## DetaModel + +[[autodoc]] DetaModel + - forward + +## DetaForObjectDetection + +[[autodoc]] DetaForObjectDetection + - forward diff --git a/docs/source/ja/model_doc/detr.md b/docs/source/ja/model_doc/detr.md new file mode 100644 index 00000000000000..1b9e64eb5486ee --- /dev/null +++ b/docs/source/ja/model_doc/detr.md @@ -0,0 +1,217 @@ + + +# DETR + +## Overview + +DETR モデルは、[Transformers を使用したエンドツーエンドのオブジェクト検出](https://arxiv.org/abs/2005.12872) で提案されました。 +Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko ルイコ。 DETR +畳み込みバックボーンと、その後にエンドツーエンドでトレーニングできるエンコーダー/デコーダー Transformer で構成されます。 +物体の検出。 Faster-R-CNN や Mask-R-CNN などのモデルの複雑さの多くが大幅に簡素化されます。 +領域提案、非最大抑制手順、アンカー生成などです。さらに、DETR は次のようにすることもできます。 +デコーダ出力の上にマスク ヘッドを追加するだけで、パノプティック セグメンテーションを実行できるように自然に拡張されています。 + +論文の要約は次のとおりです。 + +*物体検出を直接集合予測問題として見る新しい方法を紹介します。私たちのアプローチは、 +検出パイプラインにより、非最大抑制などの多くの手作業で設計されたコンポーネントの必要性が効果的に排除されます。 +タスクに関する事前の知識を明示的にエンコードするプロシージャまたはアンカーの生成。の主な成分は、 +DEtection TRansformer または DETR と呼ばれる新しいフレームワークは、セットベースのグローバル損失であり、 +二部マッチング、およびトランスフォーマー エンコーダー/デコーダー アーキテクチャ。学習されたオブジェクト クエリの固定された小さなセットが与えられると、 +DETR は、オブジェクトとグローバル イメージ コンテキストの関係について推論し、最終セットを直接出力します。 +並行して予想も。新しいモデルは概念的にシンプルであり、多くのモデルとは異なり、特殊なライブラリを必要としません。 +他の最新の検出器。 DETR は、確立された、および同等の精度と実行時のパフォーマンスを実証します。 +困難な COCO 物体検出データセットに基づく、高度に最適化された Faster RCNN ベースライン。さらに、DETR は簡単に実行できます。 +統一された方法でパノプティック セグメンテーションを生成するために一般化されました。競合他社を大幅に上回るパフォーマンスを示しています +ベースライン* + +このモデルは、[nielsr](https://huggingface.co/nielsr) によって提供されました。元のコードは [こちら](https://github.com/facebookresearch/detr) にあります。 + +## How DETR works + +[`~transformers.DetrForObjectDetection`] がどのように機能するかを説明する TLDR は次のとおりです。 + +まず、事前にトレーニングされた畳み込みバックボーンを通じて画像が送信されます (論文では、著者らは次のように使用しています)。 +ResNet-50/ResNet-101)。バッチ ディメンションも追加すると仮定します。これは、バックボーンへの入力が +画像に 3 つのカラー チャネル (RGB) があると仮定した場合の、形状 `(batch_size, 3, height, width)` のテンソル。 CNNのバックボーン +通常は `(batch_size, 2048, height/32, width/32)` の形状の、新しい低解像度の特徴マップを出力します。これは +次に、DETR の Transformer の隠れ次元 (デフォルトでは `256`) に一致するように投影されます。 +`nn.Conv2D` レイヤー。これで、形状 `(batch_size, 256, height/32, width/32)` のテンソルが完成しました。 +特徴マップは平坦化および転置され、形状 `(batch_size, seq_len, d_model)` のテンソルを取得します = +`(batch_size, width/32*height/32, 256)`。したがって、NLP モデルとの違いは、シーケンスの長さが実際には +通常よりも長くなりますが、「d_model」は小さくなります (NLP では通常 768 以上です)。 + +次に、これがエンコーダを介して送信され、同じ形状の `encoder_hidden_​​states` が出力されます (次のように考えることができます)。 +これらは画像の特徴として)。次に、いわゆる **オブジェクト クエリ**がデコーダを通じて送信されます。これは形状のテンソルです +`(batch_size, num_queries, d_model)`。通常、`num_queries` は 100 に設定され、ゼロで初期化されます。 +これらの入力埋め込みは学習された位置エンコーディングであり、作成者はこれをオブジェクト クエリと呼び、同様に +エンコーダでは、それらは各アテンション層の入力に追加されます。各オブジェクト クエリは特定のオブジェクトを検索します。 +画像では。デコーダは、複数のセルフ アテンション レイヤとエンコーダ デコーダ アテンション レイヤを通じてこれらの埋め込みを更新します。 +同じ形状の `decoder_hidden_​​states` を出力します: `(batch_size, num_queries, d_model)`。次に頭が2つ +オブジェクト検出のために上部に追加されます。各オブジェクト クエリをオブジェクトの 1 つに分類するための線形レイヤー、または「いいえ」 +オブジェクト」、および各クエリの境界ボックスを予測する MLP。 + +モデルは **2 部マッチング損失**を使用してトレーニングされます。つまり、実際に行うことは、予測されたクラスを比較することです + +グラウンド トゥルース アノテーションに対する N = 100 個の各オブジェクト クエリの境界ボックス (同じ長さ N までパディング) +(したがって、画像にオブジェクトが 4 つしか含まれていない場合、96 個の注釈にはクラスとして「オブジェクトなし」、およびクラスとして「境界ボックスなし」が含まれるだけになります。 +境界ボックス)。 [Hungarian matching algorithm](https://en.wikipedia.org/wiki/Hungarian_algorithm) は、検索に使用されます。 +N 個のクエリのそれぞれから N 個の注釈のそれぞれへの最適な 1 対 1 のマッピング。次に、標準クロスエントロピー ( +クラス)、および L1 と [generalized IoU loss](https://giou.stanford.edu/) の線形結合 ( +境界ボックス) は、モデルのパラメーターを最適化するために使用されます。 + +DETR は、パノプティック セグメンテーション (セマンティック セグメンテーションとインスタンスを統合する) を実行するように自然に拡張できます。 +セグメンテーション)。 [`~transformers.DetrForSegmentation`] はセグメンテーション マスク ヘッドを上に追加します +[`~transformers.DetrForObjectDetection`]。マスク ヘッドは、共同でトレーニングすることも、2 段階のプロセスでトレーニングすることもできます。 +ここで、最初に [`~transformers.DetrForObjectDetection`] モデルをトレーニングして、両方の周囲の境界ボックスを検出します。 +「もの」(インスタンス)と「もの」(木、道路、空などの背景のもの)をすべて凍結し、すべての重みをフリーズしてのみトレーニングします。 +25 エポックのマスクヘッド。実験的には、これら 2 つのアプローチは同様の結果をもたらします。ボックスの予測は +ハンガリー語のマッチングはボックス間の距離を使用して計算されるため、トレーニングを可能にするためにはこれが必要です。 + +## Usage tips + +- DETR は、いわゆる **オブジェクト クエリ** を使用して、画像内のオブジェクトを検出します。クエリの数によって最大値が決まります + 単一の画像内で検出できるオブジェクトの数。デフォルトでは 100 に設定されます (パラメーターを参照) + [`~transformers.DetrConfig`] の `num_queries`)。ある程度の余裕があるのは良いことです (COCO では、 + 著者は 100 を使用しましたが、COCO イメージ内のオブジェクトの最大数は約 70 です)。 +- DETR のデコーダーは、クエリの埋め込みを並行して更新します。これは GPT-2 のような言語モデルとは異なります。 + 並列ではなく自己回帰デコードを使用します。したがって、因果的注意マスクは使用されません。 +- DETR は、投影前に各セルフアテンション層とクロスアテンション層の隠れ状態に位置埋め込みを追加します。 + クエリとキーに。画像の位置埋め込みについては、固定正弦波または学習済みのどちらかを選択できます。 + 絶対位置埋め込み。デフォルトでは、パラメータ `position_embedding_type` は + [`~transformers.DetrConfig`] は `"sine"` に設定されます。 +- DETR の作成者は、トレーニング中に、特にデコーダで補助損失を使用すると役立つことに気づきました。 + モデルは各クラスの正しい数のオブジェクトを出力します。パラメータ `auxiliary_loss` を設定すると、 + [`~transformers.DetrConfig`] を`True`に設定し、フィードフォワード ニューラル ネットワークとハンガリー損失を予測します + は各デコーダ層の後に追加されます (FFN がパラメータを共有する)。 +- 複数のノードにわたる分散環境でモデルをトレーニングする場合は、 + _modeling_detr.py_ の _DetrLoss_ クラスの _num_boxes_ 変数。複数のノードでトレーニングする場合、これは次のようにする必要があります + 元の実装で見られるように、すべてのノードにわたるターゲット ボックスの平均数に設定されます [こちら](https://github.com/facebookresearch/detr/blob/a54b77800eb8e64e3ad0d8237789fcbf2f8350c5/models/detr.py#L227-L232) 。 +- [`~transformers.DetrForObjectDetection`] および [`~transformers.DetrForSegmentation`] は次のように初期化できます。 + [timm ライブラリ](https://github.com/rwightman/pytorch-image-models) で利用可能な畳み込みバックボーン。 + たとえば、MobileNet バックボーンを使用した初期化は、次の `backbone` 属性を設定することで実行できます。 + [`~transformers.DetrConfig`] を `"tf_mobilenetv3_small_075"` に設定し、それを使用してモデルを初期化します。 + 構成。 +- DETR は、最短辺が一定のピクセル数以上になり、最長辺が一定量以上になるように入力画像のサイズを変更します。 + 最大 1333 ピクセル。トレーニング時に、最短辺がランダムに に設定されるようにスケール拡張が使用されます。 + 最小 480、最大 800 ピクセル。推論時には、最短辺が 800 に設定されます。 + +使用できます + [`~transformers.DetrImageProcessor`] 用の画像 (およびオプションの COCO 形式の注釈) を準備します。 + モデル。このサイズ変更により、バッチ内の画像のサイズが異なる場合があります。 DETR は、画像を最大までパディングすることでこの問題を解決します。 + どのピクセルが実数でどのピクセルがパディングであるかを示すピクセル マスクを作成することによって、バッチ内の最大サイズを決定します。 + あるいは、画像をバッチ処理するためにカスタムの `collat​​e_fn` を定義することもできます。 + [`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`]。 +- 画像のサイズによって使用されるメモリの量が決まり、したがって「batch_size」も決まります。 + GPU あたり 2 のバッチ サイズを使用することをお勧めします。詳細については、[この Github スレッド](https://github.com/facebookresearch/detr/issues/150) を参照してください。 + +DETR モデルをインスタンス化するには 3 つの方法があります (好みに応じて)。 + +オプション 1: モデル全体の事前トレーニングされた重みを使用して DETR をインスタンス化する + +```py +>>> from transformers import DetrForObjectDetection + +>>> model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50") +``` + +オプション 2: Transformer についてはランダムに初期化された重みを使用して DETR をインスタンス化しますが、バックボーンについては事前にトレーニングされた重みを使用します + +```py +>>> from transformers import DetrConfig, DetrForObjectDetection + +>>> config = DetrConfig() +>>> model = DetrForObjectDetection(config) +``` + +オプション 3: バックボーン + トランスフォーマーのランダムに初期化された重みを使用して DETR をインスタンス化します。 + +```py +>>> config = DetrConfig(use_pretrained_backbone=False) +>>> model = DetrForObjectDetection(config) +``` + +| Task | Object detection | Instance segmentation | Panoptic segmentation | +|------|------------------|-----------------------|-----------------------| +| **Description** |画像内のオブジェクトの周囲の境界ボックスとクラス ラベルを予測する | 画像内のオブジェクト (つまりインスタンス) の周囲のマスクを予測する | 画像内のオブジェクト (インスタンス) と「もの」 (木や道路などの背景) の両方の周囲のマスクを予測します | +| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] | +| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | | +| **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) | +| **Postprocessing** (i.e. converting the output of the model to Pascal VOC format) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] | +| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` | + +つまり、COCO 検出または COCO パノプティック形式でデータを準備してから、次を使用する必要があります。 +[`~transformers.DetrImageProcessor`] `pixel_values`、`pixel_mask`、およびオプションを作成します。 +「ラベル」。これを使用してモデルをトレーニング (または微調整) できます。評価するには、まず、 +[`~transformers.DetrImageProcessor`] の後処理メソッドの 1 つを使用したモデルの出力。これらはできます +`CocoEvaluator` または `PanopticEvaluator` のいずれかに提供され、次のようなメトリクスを計算できます。 +平均平均精度 (mAP) とパノラマ品質 (PQ)。後者のオブジェクトは [元のリポジトリ](https://github.com/facebookresearch/detr) に実装されています。評価の詳細については、[サンプル ノートブック](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) を参照してください。 + +## Resources + +DETR の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + + + +- カスタム データセットの [`DetrForObjectDetection`] と [`DetrForSegmentation`] の微調整を説明するすべてのサンプル ノートブックは、[こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) で見つけることができます。 。 +- 参照: [オブジェクト検出タスク ガイド](../tasks/object_detection) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## DetrConfig + +[[autodoc]] DetrConfig + +## DetrImageProcessor + +[[autodoc]] DetrImageProcessor + - preprocess + - post_process_object_detection + - post_process_semantic_segmentation + - post_process_instance_segmentation + - post_process_panoptic_segmentation + +## DetrFeatureExtractor + +[[autodoc]] DetrFeatureExtractor + - __call__ + - post_process_object_detection + - post_process_semantic_segmentation + - post_process_instance_segmentation + - post_process_panoptic_segmentation + +## DETR specific outputs + +[[autodoc]] models.detr.modeling_detr.DetrModelOutput + +[[autodoc]] models.detr.modeling_detr.DetrObjectDetectionOutput + +[[autodoc]] models.detr.modeling_detr.DetrSegmentationOutput + +## DetrModel + +[[autodoc]] DetrModel + - forward + +## DetrForObjectDetection + +[[autodoc]] DetrForObjectDetection + - forward + +## DetrForSegmentation + +[[autodoc]] DetrForSegmentation + - forward diff --git a/docs/source/ja/model_doc/dialogpt.md b/docs/source/ja/model_doc/dialogpt.md new file mode 100644 index 00000000000000..22ce0c9a099f75 --- /dev/null +++ b/docs/source/ja/model_doc/dialogpt.md @@ -0,0 +1,57 @@ + + +# DialoGPT + +## Overview + +DialoGPT は、[DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) で Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, +Jianfeng Gao, Jingjing Liu, Bill Dolan.これは、から抽出された 147M 万の会話のようなやりとりでトレーニングされた GPT2 モデルです。 +レディット。 + +論文の要約は次のとおりです。 + +*私たちは、大規模で調整可能なニューラル会話応答生成モデル DialoGPT (対話生成事前トレーニング済み) を紹介します。 +変成器)。 Reddit のコメント チェーンから抽出された 1 億 4,700 万件の会話のようなやり取りを対象にトレーニングされました。 +2005 年から 2017 年にかけて、DialoGPT は人間に近いパフォーマンスを達成するために Hugging Face PyTorch トランスフォーマーを拡張しました。 +シングルターンダイアログ設定における自動評価と人間による評価の両方。会話システムが +DialoGPT を活用すると、強力なベースラインよりも関連性が高く、内容が充実し、コンテキストに一貫性のある応答が生成されます。 +システム。神経反応の研究を促進するために、事前トレーニングされたモデルとトレーニング パイプラインが公開されています。 +よりインテリジェントなオープンドメイン対話システムの生成と開発。* + +元のコードは [ここ](https://github.com/microsoft/DialoGPT) にあります。 + +## Usage tips + + +- DialoGPT は絶対位置埋め込みを備えたモデルであるため、通常は入力を右側にパディングすることをお勧めします。 + 左よりも。 +- DialoGPT は、会話データの因果言語モデリング (CLM) 目標に基づいてトレーニングされているため、強力です + オープンドメイン対話システムにおける応答生成時。 +- DialoGPT を使用すると、[DialoGPT's model card](https://huggingface.co/microsoft/DialoGPT-medium) に示されているように、ユーザーはわずか 10 行のコードでチャット ボットを作成できます。 + +トレーニング: + +DialoGPT をトレーニングまたは微調整するには、因果言語モデリング トレーニングを使用できます。公式論文を引用すると: *私たちは +OpenAI GPT-2に従って、マルチターン対話セッションを長いテキストとしてモデル化し、生成タスクを言語としてフレーム化します +モデリング。まず、ダイアログ セッション内のすべてのダイアログ ターンを長いテキスト x_1,..., x_N に連結します (N は +* 詳細については、元の論文を参照してください。 + + + +DialoGPT のアーキテクチャは GPT2 モデルに基づいています。API リファレンスと例については、[GPT2 のドキュメント ページ](openai-community/gpt2) を参照してください。 + + diff --git a/docs/source/ja/model_doc/dinat.md b/docs/source/ja/model_doc/dinat.md new file mode 100644 index 00000000000000..a59b073d4669ef --- /dev/null +++ b/docs/source/ja/model_doc/dinat.md @@ -0,0 +1,93 @@ + + +# Dilated Neighborhood Attention Transformer + +## Overview + +DiNAT は [Dilated Neighborhood Attender Transformer](https://arxiv.org/abs/2209.15001) で提案されました。 +Ali Hassani and Humphrey Shi. + +[NAT](nat) を拡張するために、拡張近隣アテンション パターンを追加してグローバル コンテキストをキャプチャします。 +そしてそれと比較して大幅なパフォーマンスの向上が見られます。 + +論文の要約は次のとおりです。 + +*トランスフォーマーは急速に、さまざまなモダリティにわたって最も頻繁に適用される深層学習アーキテクチャの 1 つになりつつあります。 +ドメインとタスク。ビジョンでは、単純なトランスフォーマーへの継続的な取り組みに加えて、階層型トランスフォーマーが +また、そのパフォーマンスと既存のフレームワークへの簡単な統合のおかげで、大きな注目を集めました。 +これらのモデルは通常、スライディング ウィンドウの近隣アテンション (NA) などの局所的な注意メカニズムを採用しています。 +または Swin Transformer のシフト ウィンドウ セルフ アテンション。自己注意の二次複雑さを軽減するのに効果的ですが、 +局所的な注意は、自己注意の最も望ましい 2 つの特性を弱めます。それは、長距離の相互依存性モデリングです。 +そして全体的な受容野。このペーパーでは、自然で柔軟で、 +NA への効率的な拡張により、よりグローバルなコンテキストを捕捉し、受容野をゼロから指数関数的に拡張することができます。 +追加費用。 NA のローカルな注目と DiNA のまばらなグローバルな注目は相互に補完し合うため、私たちは +両方に基づいて構築された新しい階層型ビジョン トランスフォーマーである Dilated Neighborhood Attendant Transformer (DiNAT) を導入します。 +DiNAT のバリアントは、NAT、Swin、ConvNeXt などの強力なベースラインに比べて大幅に改善されています。 +私たちの大規模モデルは、COCO オブジェクト検出において Swin モデルよりも高速で、ボックス AP が 1.5% 優れています。 +COCO インスタンス セグメンテーションでは 1.3% のマスク AP、ADE20K セマンティック セグメンテーションでは 1.1% の mIoU。 +新しいフレームワークと組み合わせた当社の大規模バリアントは、COCO (58.2 PQ) 上の新しい最先端のパノプティック セグメンテーション モデルです。 +および ADE20K (48.5 PQ)、および Cityscapes (44.5 AP) および ADE20K (35.4 AP) のインスタンス セグメンテーション モデル (追加データなし)。 +また、ADE20K (58.2 mIoU) 上の最先端の特殊なセマンティック セグメンテーション モデルとも一致します。 +都市景観 (84.5 mIoU) では 2 位にランクされています (追加データなし)。 * + + + + + 異なる拡張値を使用した近隣アテンション。 +元の論文から抜粋。 + +このモデルは [Ali Hassani](https://huggingface.co/alihassanijr) によって提供されました。 +元のコードは [ここ](https://github.com/SHI-Labs/Neighborhood-Attendance-Transformer) にあります。 + +## Usage tips + +DiNAT は *バックボーン* として使用できます。 「output_hidden_​​states = True」の場合、 +`hidden_​​states` と `reshaped_hidden_​​states` の両方を出力します。 `reshape_hidden_​​states` は、`(batch_size, height, width, num_channels)` ではなく、`(batch, num_channels, height, width)` の形状を持っています。 + +ノート: +- DiNAT は、[NATTEN](https://github.com/SHI-Labs/NATTEN/) による近隣アテンションと拡張近隣アテンションの実装に依存しています。 +[shi-labs.com/natten](https://shi-labs.com/natten) を参照して、Linux 用のビルド済みホイールを使用してインストールするか、`pip install natten` を実行してシステム上に構築できます。 +後者はコンパイルに時間がかかる可能性があることに注意してください。 NATTEN はまだ Windows デバイスをサポートしていません。 +- 現時点ではパッチ サイズ 4 のみがサポートされています。 + +## Resources + +DiNAT の使用を開始するのに役立つ公式 Hugging Face およびコミュニティ (🌎 で示されている) リソースのリスト。 + + + + +- [`DinatForImageClassification`] は、この [サンプル スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) および [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)。 +- 参照: [画像分類タスク ガイド](../tasks/image_classification) + +ここに含めるリソースの送信に興味がある場合は、お気軽にプル リクエストを開いてください。審査させていただきます。リソースは、既存のリソースを複製するのではなく、何か新しいものを示すことが理想的です。 + +## DinatConfig + +[[autodoc]] DinatConfig + +## DinatModel + +[[autodoc]] DinatModel + - forward + +## DinatForImageClassification + +[[autodoc]] DinatForImageClassification + - forward diff --git a/docs/source/ja/model_memory_anatomy.md b/docs/source/ja/model_memory_anatomy.md new file mode 100644 index 00000000000000..5f09489b7f79aa --- /dev/null +++ b/docs/source/ja/model_memory_anatomy.md @@ -0,0 +1,255 @@ + + +# Model training anatomy + +モデルトレーニングの効率を向上させるために適用できるパフォーマンス最適化テクニックを理解するには、トレーニング中にGPUがどのように利用されるか、および実行される操作に応じて計算強度がどのように変化するかを理解することが役立ちます。 + +まずは、GPUの利用例とモデルのトレーニング実行に関する示唆に富む例を探求することから始めましょう。デモンストレーションのために、いくつかのライブラリをインストールする必要があります: + +```bash +pip install transformers datasets accelerate nvidia-ml-py3 +``` + +`nvidia-ml-py3` ライブラリは、Python内からモデルのメモリ使用状況をモニターすることを可能にします。おそらく、ターミナルでの `nvidia-smi` コマンドについてはお聞きかもしれませんが、このライブラリを使用すると、Pythonから同じ情報にアクセスできます。 + +それから、いくつかのダミーデータを作成します。100から30000の間のランダムなトークンIDと、分類器のためのバイナリラベルです。合計で、512のシーケンスがあり、それぞれの長さは512で、PyTorchフォーマットの [`~datasets.Dataset`] に格納されます。 + + +```py +>>> import numpy as np +>>> from datasets import Dataset + + +>>> seq_len, dataset_size = 512, 512 +>>> dummy_data = { +... "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), +... "labels": np.random.randint(0, 1, (dataset_size)), +... } +>>> ds = Dataset.from_dict(dummy_data) +>>> ds.set_format("pt") +``` + + +[`Trainer`]を使用してGPU利用率とトレーニング実行の要約統計情報を表示するために、2つのヘルパー関数を定義します。 + + +```py +>>> from pynvml import * + + +>>> def print_gpu_utilization(): +... nvmlInit() +... handle = nvmlDeviceGetHandleByIndex(0) +... info = nvmlDeviceGetMemoryInfo(handle) +... print(f"GPU memory occupied: {info.used//1024**2} MB.") + + +>>> def print_summary(result): +... print(f"Time: {result.metrics['train_runtime']:.2f}") +... print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") +... print_gpu_utilization() +``` + +以下は、無料のGPUメモリから開始していることを確認しましょう: + + +```py +>>> print_gpu_utilization() +GPU memory occupied: 0 MB. +``` + +GPUメモリがモデルを読み込む前のように占有されていないように見えます。これがお使いのマシンでの状況でない場合は、GPUメモリを使用しているすべてのプロセスを停止してください。ただし、すべての空きGPUメモリをユーザーが使用できるわけではありません。モデルがGPUに読み込まれると、カーネルも読み込まれ、1〜2GBのメモリを使用することがあります。それがどれくらいかを確認するために、GPUに小さなテンソルを読み込むと、カーネルも読み込まれます。 + + +```py +>>> import torch + + +>>> torch.ones((1, 1)).to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 1343 MB. +``` + +カーネルだけで1.3GBのGPUメモリを使用していることがわかります。次に、モデルがどれだけのスペースを使用しているかを見てみましょう。 + +## Load Model + +まず、`google-bert/bert-large-uncased` モデルを読み込みます。モデルの重みを直接GPUに読み込むことで、重みだけがどれだけのスペースを使用しているかを確認できます。 + + +```py +>>> from transformers import AutoModelForSequenceClassification + + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 2631 MB. +``` + +モデルの重みだけで、GPUメモリを1.3 GB使用していることがわかります。正確な数値は、使用している具体的なGPUに依存します。新しいGPUでは、モデルの重みが最適化された方法で読み込まれるため、モデルの使用を高速化することがあるため、モデルがより多くのスペースを占有することがあります。さて、`nvidia-smi` CLIと同じ結果が得られるかを簡単に確認することもできます。 + + +```bash +nvidia-smi +``` + +```bash +Tue Jan 11 08:58:05 2022 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | +| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB | ++-----------------------------------------------------------------------------+ +``` + +前回と同じ数値を取得し、16GBのメモリを搭載したV100 GPUを使用していることがわかります。さて、モデルのトレーニングを開始し、GPUメモリの消費がどのように変化するかを確認してみましょう。まず、いくつかの標準的なトレーニング引数を設定します: + + +```py +default_args = { + "output_dir": "tmp", + "evaluation_strategy": "steps", + "num_train_epochs": 1, + "log_level": "error", + "report_to": "none", +} +``` + + + +複数の実験を実行する予定がある場合、実験間でメモリを適切にクリアするために、実験の間に Python カーネルを再起動してください。 + + + +## Memory utilization at vanilla training + +[`Trainer`] を使用して、GPU パフォーマンスの最適化テクニックを使用せずにバッチサイズ 4 でモデルをトレーニングしましょう: + + +```py +>>> from transformers import TrainingArguments, Trainer, logging + +>>> logging.set_verbosity_error() + + +>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) +>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds) +>>> result = trainer.train() +>>> print_summary(result) +``` + +``` +Time: 57.82 +Samples/second: 8.86 +GPU memory occupied: 14949 MB. +``` + +既に、比較的小さいバッチサイズでも、GPUのほとんどのメモリがすでに使用されていることがわかります。しかし、より大きなバッチサイズを使用することは、しばしばモデルの収束が速くなったり、最終的な性能が向上したりすることがあります。したがって、理想的には、バッチサイズをモデルの要件に合わせて調整したいのですが、GPUの制限に合わせて調整する必要はありません。興味深いことに、モデルのサイズよりもはるかに多くのメモリを使用しています。なぜそうなるのかを少し理解するために、モデルの操作とメモリの必要性を見てみましょう。 + + +## Anatomy of Model's Operations + +Transformerアーキテクチャには、計算強度によって以下の3つの主要な操作グループが含まれています。 + +1. **テンソルの収縮** + + 線形層とMulti-Head Attentionのコンポーネントは、すべてバッチ処理された **行列-行列の乗算** を行います。これらの操作は、Transformerのトレーニングにおいて最も計算集約的な部分です。 + +2. **統計的正規化** + + Softmaxと層正規化は、テンソルの収縮よりも計算負荷が少なく、1つまたは複数の **縮約操作** を含み、その結果がマップを介して適用されます。 + +3. **要素ごとの演算子** + + これらは残りの演算子です:**バイアス、ドロップアウト、活性化、および残差接続** です。これらは最も計算集約的な操作ではありません。 + +パフォーマンスのボトルネックを分析する際に、この知識は役立つことがあります。 + +この要約は、[Data Movement Is All You Need: Optimizing Transformers 2020に関するケーススタディ](https://arxiv.org/abs/2007.00072)から派生しています。 + +## Anatomy of Model's Memory + +モデルのトレーニングがGPUに配置されたモデルよりもはるかに多くのメモリを使用することを見てきました。これは、トレーニング中にGPUメモリを使用する多くのコンポーネントが存在するためです。GPUメモリ上のコンポーネントは以下の通りです: + +1. モデルの重み +2. オプティマイザの状態 +3. 勾配 +4. 勾配計算のために保存された前向き活性化 +5. 一時バッファ +6. 機能固有のメモリ + +通常、AdamWを使用して混合精度でトレーニングされたモデルは、モデルパラメータごとに18バイトとアクティベーションメモリが必要です。推論ではオプティマイザの状態と勾配は不要ですので、これらを差し引くことができます。したがって、混合精度の推論においては、モデルパラメータごとに6バイトとアクティベーションメモリが必要です。 + +詳細を見てみましょう。 + + +**モデルの重み:** + +- fp32トレーニングのパラメーター数 * 4バイト +- ミックスプレシジョントレーニングのパラメーター数 * 6バイト(メモリ内にfp32とfp16のモデルを維持) + +**オプティマイザの状態:** + +- 通常のAdamWのパラメーター数 * 8バイト(2つの状態を維持) +- 8-bit AdamWオプティマイザのパラメーター数 * 2バイト([bitsandbytes](https://github.com/TimDettmers/bitsandbytes)のようなオプティマイザ) +- モーメンタムを持つSGDのようなオプティマイザのパラメーター数 * 4バイト(1つの状態を維持) + +**勾配** + +- fp32またはミックスプレシジョントレーニングのパラメーター数 * 4バイト(勾配は常にfp32で保持) + +**フォワードアクティベーション** + +- サイズは多くの要因に依存し、主要な要因はシーケンスの長さ、隠れ層のサイズ、およびバッチサイズです。 + +フォワードとバックワードの関数によって渡され、返される入力と出力、および勾配計算のために保存されるフォワードアクティベーションがあります。 + +**一時的なメモリ** + +さらに、計算が完了した後に解放されるさまざまな一時変数がありますが、これらは一時的に追加のメモリを必要とし、OOMに達する可能性があります。したがって、コーディング時にはこのような一時変数に戦略的に考え、必要なくなったら明示的に解放することが非常に重要です。 + +**機能固有のメモリ** + +次に、ソフトウェアには特別なメモリ要件がある場合があります。たとえば、ビームサーチを使用してテキストを生成する場合、ソフトウェアは複数の入力と出力のコピーを維持する必要があります。 + +**`forward`と`backward`の実行速度** + +畳み込み層と線形層では、バックワードにフォワードと比べて2倍のFLOPSがあり、一般的には約2倍遅くなります(バックワードのサイズが不便であることがあるため、それ以上になることがあります)。 アクティベーションは通常、バンド幅制限されており、バックワードでアクティベーションがフォワードよりも多くのデータを読むことが一般的です(たとえば、アクティベーションフォワードは1回読み取り、1回書き込み、アクティベーションバックワードはフォワードのgradOutputおよび出力を2回読み取り、1回書き込みます)。 + +ご覧の通り、GPUメモリを節約したり操作を高速化できる可能性のあるいくつかの場所があります。 GPUの利用と計算速度に影響を与える要因を理解したので、パフォーマンス最適化の技術については、[単一GPUでの効率的なトレーニングのための方法とツール](perf_train_gpu_one)のドキュメンテーションページを参照してください。 + +詳細を見てみましょう。 + + + + + + diff --git a/docs/source/ja/model_sharing.md b/docs/source/ja/model_sharing.md new file mode 100644 index 00000000000000..aa8f7a3d1e3327 --- /dev/null +++ b/docs/source/ja/model_sharing.md @@ -0,0 +1,262 @@ + + +# Share a Model + +最後の2つのチュートリアルでは、PyTorch、Keras、および🤗 Accelerateを使用してモデルをファインチューニングする方法を示しました。次のステップは、モデルをコミュニティと共有することです!Hugging Faceでは、知識とリソースを公開的に共有し、人工知能を誰にでも提供することを信じています。他の人々が時間とリソースを節約できるように、モデルをコミュニティと共有することを検討することをお勧めします。 + +このチュートリアルでは、訓練済みまたはファインチューニングされたモデルを[Model Hub](https://huggingface.co/models)に共有する2つの方法を学びます: + +- プログラムでファイルをHubにプッシュする。 +- ウェブインターフェースを使用してファイルをHubにドラッグアンドドロップする。 + + + + + +コミュニティとモデルを共有するには、[huggingface.co](https://huggingface.co/join)でアカウントが必要です。既存の組織に参加したり、新しい組織を作成したりすることもできます。 + + + +## Repository Features + +Model Hub上の各リポジトリは、通常のGitHubリポジトリのように動作します。リポジトリはバージョニング、コミット履歴、違いの視覚化の機能を提供します。 + +Model Hubの組み込みバージョニングはgitおよび[git-lfs](https://git-lfs.github.com/)に基づいています。言い換えれば、モデルを1つのリポジトリとして扱うことができ、より大きなアクセス制御とスケーラビリティを実現します。バージョン管理には*リビジョン*があり、コミットハッシュ、タグ、またはブランチ名で特定のモデルバージョンをピン留めする方法です。 + +その結果、`revision`パラメータを使用して特定のモデルバージョンをロードできます: + +```py +>>> model = AutoModel.from_pretrained( +... "julien-c/EsperBERTo-small", revision="v2.0.1" # タグ名、またはブランチ名、またはコミットハッシュ +... ) +``` + +ファイルはリポジトリ内で簡単に編集でき、コミット履歴と差分を表示できます: + +![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png) + +## Set Up + +モデルをHubに共有する前に、Hugging Faceの認証情報が必要です。ターミナルへのアクセス権がある場合、🤗 Transformersがインストールされている仮想環境で以下のコマンドを実行します。これにより、アクセストークンがHugging Faceのキャッシュフォルダに保存されます(デフォルトでは `~/.cache/` に保存されます): + +```bash +huggingface-cli login +``` + +JupyterやColaboratoryのようなノートブックを使用している場合、[`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library)ライブラリがインストールされていることを確認してください。 +このライブラリを使用すると、Hubとプログラム的に対話できます。 + +```bash +pip install huggingface_hub +``` + +次に、`notebook_login`を使用してHubにサインインし、[こちらのリンク](https://huggingface.co/settings/token)にアクセスしてログインに使用するトークンを生成します: + +```python +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Convert a Model for all frameworks + +異なるフレームワークで作業している他のユーザーがあなたのモデルを使用できるようにするために、 +PyTorchおよびTensorFlowのチェックポイントでモデルを変換してアップロードすることをお勧めします。 +このステップをスキップすると、ユーザーは異なるフレームワークからモデルをロードできますが、 +モデルをオンザフライで変換する必要があるため、遅くなります。 + +別のフレームワーク用にチェックポイントを変換することは簡単です。 +PyTorchとTensorFlowがインストールされていることを確認してください(インストール手順については[こちら](installation)を参照)し、 +その後、他のフレームワーク向けに特定のタスク用のモデルを見つけます。 + + + +TensorFlowからPyTorchにチェックポイントを変換するには、`from_tf=True`を指定します: + +```python +>>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) +>>> pt_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + + +指定して、PyTorchからTensorFlowにチェックポイントを変換するには `from_pt=True` を使用します: + +```python +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) +``` + +新しいTensorFlowモデルとその新しいチェックポイントを保存できます: + +```python +>>> tf_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + + +Flaxでモデルが利用可能な場合、PyTorchからFlaxへのチェックポイントの変換も行うことができます: + +```py +>>> flax_model = FlaxDistilBertForSequenceClassification.from_pretrained( +... "path/to/awesome-name-you-picked", from_pt=True +... ) +``` + + + +## Push a model during traning + + + + + +モデルをHubにプッシュすることは、追加のパラメーターまたはコールバックを追加するだけで簡単です。 +[ファインチューニングチュートリアル](training)から思い出してください、[`TrainingArguments`]クラスはハイパーパラメーターと追加のトレーニングオプションを指定する場所です。 +これらのトレーニングオプションの1つに、モデルを直接Hubにプッシュする機能があります。[`TrainingArguments`]で`push_to_hub=True`を設定します: + +```py +>>> training_args = TrainingArguments(output_dir="my-awesome-model", push_to_hub=True) +``` + +Pass your training arguments as usual to [`Trainer`]: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +[`Trainer`]に通常通りトレーニング引数を渡します: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +ファインチューニングが完了したら、[`Trainer`]で[`~transformers.Trainer.push_to_hub`]を呼び出して、トレーニング済みモデルをHubにプッシュします。🤗 Transformersは、トレーニングのハイパーパラメータ、トレーニング結果、およびフレームワークのバージョンを自動的にモデルカードに追加します! + +```py +>>> trainer.push_to_hub() +``` + + + + +[`PushToHubCallback`]を使用してモデルをHubに共有します。[`PushToHubCallback`]関数には、次のものを追加します: + +- モデルの出力ディレクトリ。 +- トークナイザ。 +- `hub_model_id`、つまりHubのユーザー名とモデル名。 + +```python +>>> from transformers import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="./your_model_save_path", tokenizer=tokenizer, hub_model_id="your-username/my-awesome-model" +... ) +``` + +🤗 Transformersは[`fit`](https://keras.io/api/models/model_training_apis/)にコールバックを追加し、トレーニング済みモデルをHubにプッシュします: + +```py +>>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3, callbacks=push_to_hub_callback) +``` + + + +## `push_to_hub` 関数を使用する + +また、モデルを直接Hubにアップロードするために、`push_to_hub` を呼び出すこともできます。 + +`push_to_hub` でモデル名を指定します: + +```py +>>> pt_model.push_to_hub("my-awesome-model") +``` + +これにより、ユーザー名の下にモデル名 `my-awesome-model` を持つリポジトリが作成されます。 +ユーザーは、`from_pretrained` 関数を使用してモデルをロードできます: + +```py +>>> from transformers import AutoModel + +>>> model = AutoModel.from_pretrained("your_username/my-awesome-model") +``` + +組織に所属し、モデルを組織名のもとにプッシュしたい場合、`repo_id` にそれを追加してください: + +```python +>>> pt_model.push_to_hub("my-awesome-org/my-awesome-model") +``` + +`push_to_hub`関数は、モデルリポジトリに他のファイルを追加するためにも使用できます。例えば、トークナイザをモデルリポジトリに追加します: + +```py +>>> tokenizer.push_to_hub("my-awesome-model") +``` + +あるいは、ファインチューニングされたPyTorchモデルのTensorFlowバージョンを追加したいかもしれません: + +```python +>>> tf_model.push_to_hub("my-awesome-model") +``` + +Hugging Faceプロフィールに移動すると、新しく作成したモデルリポジトリが表示されるはずです。**Files**タブをクリックすると、リポジトリにアップロードしたすべてのファイルが表示されます。 + +リポジトリにファイルを作成およびアップロードする方法の詳細については、Hubドキュメンテーション[こちら](https://huggingface.co/docs/hub/how-to-upstream)を参照してください。 + +## Upload with the web interface + +コードを書かずにモデルをアップロードしたいユーザーは、Hubのウェブインターフェースを使用してモデルをアップロードできます。[huggingface.co/new](https://huggingface.co/new)を訪れて新しいリポジトリを作成します: + +![new_model_repo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/new_model_repo.png) + +ここから、モデルに関するいくつかの情報を追加します: + +- リポジトリの**所有者**を選択します。これはあなた自身または所属している組織のいずれかです。 +- モデルの名前を選択します。これはリポジトリの名前にもなります。 +- モデルが公開か非公開かを選択します。 +- モデルのライセンス使用方法を指定します。 + +その後、**Files**タブをクリックし、**Add file**ボタンをクリックしてリポジトリに新しいファイルをアップロードします。次に、ファイルをドラッグアンドドロップしてアップロードし、コミットメッセージを追加します。 + +![upload_file](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/upload_file.png) + +## Add a model card + +ユーザーがモデルの機能、制限、潜在的な偏り、倫理的な考慮事項を理解できるようにするために、モデルリポジトリにモデルカードを追加してください。モデルカードは`README.md`ファイルで定義されます。モデルカードを追加する方法: + +* 手動で`README.md`ファイルを作成およびアップロードする。 +* モデルリポジトリ内の**Edit model card**ボタンをクリックする。 + +モデルカードに含めるべき情報の例については、DistilBert [モデルカード](https://huggingface.co/distilbert/distilbert-base-uncased)をご覧ください。`README.md`ファイルで制御できる他のオプション、例えばモデルの炭素フットプリントやウィジェットの例などについての詳細は、[こちらのドキュメンテーション](https://huggingface.co/docs/hub/models-cards)を参照してください。 + + + + + diff --git a/docs/source/ja/model_summary.md b/docs/source/ja/model_summary.md new file mode 100644 index 00000000000000..8f8b6af48cc17b --- /dev/null +++ b/docs/source/ja/model_summary.md @@ -0,0 +1,110 @@ + + +# The Transformer model family + +2017年に導入されて以来、[元のTransformer](https://arxiv.org/abs/1706.03762)モデルは、自然言語処理(NLP)のタスクを超える多くの新しいエキサイティングなモデルをインスパイアしました。[タンパク質の折りたたまれた構造を予測](https://huggingface.co/blog/deep-learning-with-proteins)するモデル、[チーターを走らせるためのトレーニング](https://huggingface.co/blog/train-decision-transformers)するモデル、そして[時系列予測](https://huggingface.co/blog/time-series-transformers)のためのモデルなどがあります。Transformerのさまざまなバリアントが利用可能ですが、大局を見落とすことがあります。これらのすべてのモデルに共通するのは、元のTransformerアーキテクチャに基づいていることです。一部のモデルはエンコーダまたはデコーダのみを使用し、他のモデルは両方を使用します。これは、Transformerファミリー内のモデルの高レベルの違いをカテゴライズし、調査するための有用な分類法を提供し、以前に出会ったことのないTransformerを理解するのに役立ちます。 + +元のTransformerモデルに慣れていないか、リフレッシュが必要な場合は、Hugging Faceコースの[Transformerの動作原理](https://huggingface.co/course/chapter1/4?fw=pt)章をチェックしてください。 + +
+ +
+ +## Computer vision + + + +### Convolutional network + +長い間、畳み込みネットワーク(CNN)はコンピュータビジョンのタスクにおいて支配的なパラダイムでしたが、[ビジョンTransformer](https://arxiv.org/abs/2010.11929)はそのスケーラビリティと効率性を示しました。それでも、一部のCNNの最高の特性、特に特定のタスクにとっては非常に強力な翻訳不変性など、一部のTransformerはアーキテクチャに畳み込みを組み込んでいます。[ConvNeXt](model_doc/convnext)は、畳み込みを現代化するためにTransformerから設計の選択肢を取り入れ、例えば、ConvNeXtは画像をパッチに分割するために重なり合わないスライディングウィンドウと、グローバル受容野を増加させるための大きなカーネルを使用します。ConvNeXtは、メモリ効率を向上させ、パフォーマンスを向上させるためにいくつかのレイヤーデザインの選択肢も提供し、Transformerと競合的になります! + + +### Encoder[[cv-encoder]] + +[ビジョン トランスフォーマー(ViT)](model_doc/vit) は、畳み込みを使用しないコンピュータビジョンタスクの扉を開けました。ViT は標準のトランスフォーマーエンコーダーを使用しますが、画像を扱う方法が主要なブレークスルーでした。画像を固定サイズのパッチに分割し、それらをトークンのように使用して埋め込みを作成します。ViT は、当時のCNNと競争力のある結果を示すためにトランスフォーマーの効率的なアーキテクチャを活用しましたが、トレーニングに必要なリソースが少なくて済みました。ViT に続いて、セグメンテーションや検出などの密なビジョンタスクを処理できる他のビジョンモデルも登場しました。 + +これらのモデルの1つが[Swin](model_doc/swin) トランスフォーマーです。Swin トランスフォーマーは、より小さなサイズのパッチから階層的な特徴マップ(CNNのようで ViT とは異なります)を構築し、深層のパッチと隣接するパッチとマージします。注意はローカルウィンドウ内でのみ計算され、ウィンドウは注意のレイヤー間でシフトされ、モデルがより良く学習するのをサポートする接続を作成します。Swin トランスフォーマーは階層的な特徴マップを生成できるため、セグメンテーションや検出などの密な予測タスクに適しています。[SegFormer](model_doc/segformer) も階層的な特徴マップを構築するためにトランスフォーマーエンコーダーを使用しますが、すべての特徴マップを組み合わせて予測するためにシンプルなマルチレイヤーパーセプトロン(MLP)デコーダーを追加します。 + +BeIT および ViTMAE などの他のビジョンモデルは、BERTの事前トレーニング目標からインスピレーションを得ました。[BeIT](model_doc/beit) は *masked image modeling (MIM)* によって事前トレーニングされています。画像パッチはランダムにマスクされ、画像も視覚トークンにトークン化されます。BeIT はマスクされたパッチに対応する視覚トークンを予測するようにトレーニングされます。[ViTMAE](model_doc/vitmae) も似たような事前トレーニング目標を持っており、視覚トークンの代わりにピクセルを予測する必要があります。異例なのは画像パッチの75%がマスクされていることです!デコーダーはマスクされたトークンとエンコードされたパッチからピクセルを再構築します。事前トレーニングの後、デコーダーは捨てられ、エンコーダーはダウンストリームのタスクで使用できる状態です。 + +### Decoder[[cv-decoder]] + +デコーダーのみのビジョンモデルは珍しいです。なぜなら、ほとんどのビジョンモデルは画像表現を学ぶためにエンコーダーを使用するからです。しかし、画像生成などのユースケースでは、デコーダーは自然な適応です。これは、GPT-2などのテキスト生成モデルから見てきたように、[ImageGPT](model_doc/imagegpt) でも同様のアーキテクチャを使用しますが、シーケンス内の次のトークンを予測する代わりに、画像内の次のピクセルを予測します。画像生成に加えて、ImageGPT は画像分類のためにもファインチューニングできます。 + +### Encoder-decoder[[cv-encoder-decoder]] + +ビジョンモデルは一般的にエンコーダー(バックボーンとも呼ばれます)を使用して重要な画像特徴を抽出し、それをトランスフォーマーデコーダーに渡すために使用します。[DETR](model_doc/detr) は事前トレーニング済みのバックボーンを持っていますが、オブジェクト検出のために完全なトランスフォーマーエンコーダーデコーダーアーキテクチャも使用しています。エンコーダーは画像表現を学び、デコーダー内のオブジェクトクエリ(各オブジェクトクエリは画像内の領域またはオブジェクトに焦点を当てた学習された埋め込みです)と組み合わせます。DETR は各オブジェクトクエリに対する境界ボックスの座標とクラスラベルを予測します。 + +## Natural lanaguage processing + + + +### Encoder[[nlp-encoder]] + +[BERT](model_doc/bert) はエンコーダー専用のTransformerで、入力の一部のトークンをランダムにマスクして他のトークンを見ないようにしています。これにより、トークンをマスクした文脈に基づいてマスクされたトークンを予測することが事前トレーニングの目標です。これにより、BERTは入力のより深いかつ豊かな表現を学習するのに左右の文脈を完全に活用できます。しかし、BERTの事前トレーニング戦略にはまだ改善の余地がありました。[RoBERTa](model_doc/roberta) は、トレーニングを長時間行い、より大きなバッチでトレーニングし、事前処理中に一度だけでなく各エポックでトークンをランダムにマスクし、次文予測の目標を削除する新しい事前トレーニングレシピを導入することでこれを改善しました。 + +性能を向上させる主要な戦略はモデルのサイズを増やすことですが、大規模なモデルのトレーニングは計算コストがかかります。計算コストを削減する方法の1つは、[DistilBERT](model_doc/distilbert) のような小さなモデルを使用することです。DistilBERTは[知識蒸留](https://arxiv.org/abs/1503.02531) - 圧縮技術 - を使用して、BERTのほぼすべての言語理解機能を保持しながら、より小さなバージョンを作成します。 + +しかし、ほとんどのTransformerモデルは引き続きより多くのパラメータに焦点を当て、トレーニング効率を向上させる新しいモデルが登場しています。[ALBERT](model_doc/albert) は、2つの方法でパラメータの数を減らすことによってメモリ消費量を削減します。大きな語彙埋め込みを2つの小さな行列に分割し、レイヤーがパラメータを共有できるようにします。[DeBERTa](model_doc/deberta) は、単語とその位置を2つのベクトルで別々にエンコードする解かれた注意機構を追加しました。注意はこれらの別々のベクトルから計算されます。単語と位置の埋め込みが含まれる単一のベクトルではなく、[Longformer](model_doc/longformer) は、特に長いシーケンス長のドキュメントを処理するために注意をより効率的にすることに焦点を当てました。固定されたウィンドウサイズの周りの各トークンから計算されるローカルウィンドウ付き注意(特定のタスクトークン(分類のための `[CLS]` など)のみのためのグローバルな注意を含む)の組み合わせを使用して、完全な注意行列ではなく疎な注意行列を作成します。 + + +### Decoder[[nlp-decoder]] + +[GPT-2](model_doc/gpt2)は、シーケンス内の次の単語を予測するデコーダー専用のTransformerです。モデルは先を見ることができないようにトークンを右にマスクし、"のぞき見"を防ぎます。大量のテキストを事前トレーニングしたことにより、GPT-2はテキスト生成が非常に得意で、テキストが正確であることがあるにしても、時折正確ではないことがあります。しかし、GPT-2にはBERTの事前トレーニングからの双方向コンテキストが不足しており、特定のタスクには適していませんでした。[XLNET](model_doc/xlnet)は、双方向に学習できる順列言語モデリング目標(PLM)を使用することで、BERTとGPT-2の事前トレーニング目標のベストを組み合わせています。 + +GPT-2の後、言語モデルはさらに大きく成長し、今では*大規模言語モデル(LLM)*として知られています。大規模なデータセットで事前トレーニングされれば、LLMはほぼゼロショット学習を示すことがあります。[GPT-J](model_doc/gptj)は、6Bのパラメータを持つLLMで、400Bのトークンでトレーニングされています。GPT-Jには[OPT](model_doc/opt)が続き、そのうち最大のモデルは175Bで、180Bのトークンでトレーニングされています。同じ時期に[BLOOM](model_doc/bloom)がリリースされ、このファミリーの最大のモデルは176Bのパラメータを持ち、46の言語と13のプログラミング言語で366Bのトークンでトレーニングされています。 + +### Encoder-decoder[[nlp-encoder-decoder]] + +[BART](model_doc/bart)は、元のTransformerアーキテクチャを保持していますが、事前トレーニング目標を*テキスト補完*の破損に変更しています。一部のテキストスパンは単一の`mask`トークンで置換されます。デコーダーは破損していないトークンを予測し(未来のトークンはマスクされます)、エンコーダーの隠れた状態を使用して予測を補助します。[Pegasus](model_doc/pegasus)はBARTに似ていますが、Pegasusはテキストスパンの代わりに文全体をマスクします。マスクされた言語モデリングに加えて、Pegasusはギャップ文生成(GSG)によって事前トレーニングされています。GSGの目標は、文書に重要な文をマスクし、それらを`mask`トークンで置換することです。デコーダーは残りの文から出力を生成しなければなりません。[T5](model_doc/t5)は、すべてのNLPタスクを特定のプレフィックスを使用してテキスト対テキストの問題に変換するよりユニークなモデルです。たとえば、プレフィックス`Summarize:`は要約タスクを示します。T5は教師ありトレーニング(GLUEとSuperGLUE)と自己教師ありトレーニング(トークンの15%をランダムにサンプルしドロップアウト)によって事前トレーニングされています。 + + +## Audio + + + +### Encoder[[audio-encoder]] + +[Wav2Vec2](model_doc/wav2vec2) は、生のオーディオ波形から直接音声表現を学習するためのTransformerエンコーダーを使用します。これは、対照的なタスクで事前学習され、一連の偽の表現から真の音声表現を特定します。 [HuBERT](model_doc/hubert) はWav2Vec2に似ていますが、異なるトレーニングプロセスを持っています。ターゲットラベルは、類似したオーディオセグメントがクラスタに割り当てられ、これが隠れユニットになるクラスタリングステップによって作成されます。隠れユニットは埋め込みにマップされ、予測を行います。 + +### Encoder-decoder[[audio-encoder-decoder]] + +[Speech2Text](model_doc/speech_to_text) は、自動音声認識(ASR)および音声翻訳のために設計された音声モデルです。このモデルは、オーディオ波形から抽出されたログメルフィルターバンクフィーチャーを受け入れ、事前トレーニングされた自己回帰的にトランスクリプトまたは翻訳を生成します。 [Whisper](model_doc/whisper) もASRモデルですが、他の多くの音声モデルとは異なり、✨ ラベル付き ✨ オーディオトランスクリプションデータを大量に事前に学習して、ゼロショットパフォーマンスを実現します。データセットの大部分には非英語の言語も含まれており、Whisperは低リソース言語にも使用できます。構造的には、WhisperはSpeech2Textに似ています。オーディオ信号はエンコーダーによってエンコードされたログメルスペクトログラムに変換されます。デコーダーはエンコーダーの隠れ状態と前のトークンからトランスクリプトを自己回帰的に生成します。 + +## Multimodal + + + +### Encoder[[mm-encoder]] + +[VisualBERT](model_doc/visual_bert) は、BERTの後にリリースされたビジョン言語タスク向けのマルチモーダルモデルです。これはBERTと事前トレーニングされた物体検出システムを組み合わせ、画像特徴をビジュアル埋め込みに抽出し、テキスト埋め込みと一緒にBERTに渡します。VisualBERTは非マスクテキストを基にしたマスクテキストを予測し、テキストが画像と整合しているかどうかも予測する必要があります。ViTがリリースされた際、[ViLT](model_doc/vilt) は画像埋め込みを取得するためにこの方法を採用しました。画像埋め込みはテキスト埋め込みと共に共同で処理されます。それから、ViLTは画像テキストマッチング、マスク言語モデリング、および全単語マスキングによる事前トレーニングが行われます。 + +[CLIP](model_doc/clip) は異なるアプローチを取り、(`画像`、`テキスト`) のペア予測を行います。画像エンコーダー(ViT)とテキストエンコーダー(Transformer)は、(`画像`、`テキスト`) ペアデータセット上で共同トレーニングされ、(`画像`、`テキスト`) ペアの画像とテキストの埋め込みの類似性を最大化します。事前トレーニング後、CLIPを使用して画像からテキストを予測したり、その逆を行うことができます。[OWL-ViT](model_doc/owlvit) は、ゼロショット物体検出のバックボーンとしてCLIPを使用しています。事前トレーニング後、物体検出ヘッドが追加され、(`クラス`、`バウンディングボックス`) ペアに対するセット予測が行われます。 + +### Encoder-decoder[[mm-encoder-decoder]] + +光学文字認識(OCR)は、通常、画像を理解しテキストを生成するために複数のコンポーネントが関与するテキスト認識タスクです。 [TrOCR](model_doc/trocr) は、エンドツーエンドのTransformerを使用してこのプロセスを簡略化します。エンコーダーは画像を固定サイズのパッチとして処理するためのViTスタイルのモデルであり、デコーダーはエンコーダーの隠れ状態を受け入れ、テキストを自己回帰的に生成します。[Donut](model_doc/donut) はOCRベースのアプローチに依存しないより一般的なビジュアルドキュメント理解モデルで、エンコーダーとしてSwin Transformer、デコーダーとして多言語BARTを使用します。 Donutは画像とテキストの注釈に基づいて次の単語を予測することにより、テキストを読むために事前トレーニングされます。デコーダーはプロンプトを与えられたトークンシーケンスを生成します。プロンプトは各ダウンストリームタスクごとに特別なトークンを使用して表現されます。例えば、ドキュメントの解析には`解析`トークンがあり、エンコーダーの隠れ状態と組み合わされてドキュメントを構造化された出力フォーマット(JSON)に解析します。 + +## Reinforcement learning + + + +### Decoder[[rl-decoder]] + +意思決定と軌跡トランスフォーマーは、状態、アクション、報酬をシーケンスモデリングの問題として捉えます。 [Decision Transformer](model_doc/decision_transformer) は、リターン・トゥ・ゴー、過去の状態、およびアクションに基づいて将来の希望リターンにつながるアクションの系列を生成します。最後の *K* タイムステップでは、3つのモダリティそれぞれがトークン埋め込みに変換され、将来のアクショントークンを予測するためにGPTのようなモデルによって処理されます。[Trajectory Transformer](model_doc/trajectory_transformer) も状態、アクション、報酬をトークン化し、GPTアーキテクチャで処理します。報酬調整に焦点を当てたDecision Transformerとは異なり、Trajectory Transformerはビームサーチを使用して将来のアクションを生成します。 diff --git a/docs/source/ja/multilingual.mdx b/docs/source/ja/multilingual.md similarity index 77% rename from docs/source/ja/multilingual.mdx rename to docs/source/ja/multilingual.md index a5ccc18385a247..39524195f88810 100644 --- a/docs/source/ja/multilingual.mdx +++ b/docs/source/ja/multilingual.md @@ -8,13 +8,17 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 推論のための多言語モデル [[open-in-colab]] -🤗 Transformers にはいくつかの多言語モデルがあり、それらの推論の使用方法は単一言語モデルとは異なります。ただし、多言語モデルの使用方法がすべて異なるわけではありません。 [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased) などの一部のモデルは、単一言語モデルと同様に使用できます。 このガイドでは、推論のために使用方法が異なる多言語モデルをどのように使うかを示します。 +🤗 Transformers にはいくつかの多言語モデルがあり、それらの推論の使用方法は単一言語モデルとは異なります。ただし、多言語モデルの使用方法がすべて異なるわけではありません。 [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) などの一部のモデルは、単一言語モデルと同様に使用できます。 このガイドでは、推論のために使用方法が異なる多言語モデルをどのように使うかを示します。 ## XLM @@ -24,24 +28,24 @@ XLM には10の異なるチェックポイントがあり、そのうちの1つ 次の XLM モデルは、言語の埋め込みを使用して、推論で使用される言語を指定します。 -- `xlm-mlm-ende-1024` (マスク化された言語モデリング、英語-ドイツ語) -- `xlm-mlm-enfr-1024` (マスク化された言語モデリング、英語-フランス語) -- `xlm-mlm-enro-1024` (マスク化された言語モデリング、英語-ルーマニア語) -- `xlm-mlm-xnli15-1024` (マスク化された言語モデリング、XNLI 言語) -- `xlm-mlm-tlm-xnli15-1024` (マスク化された言語モデリング + 翻訳 + XNLI 言語) -- `xlm-clm-enfr-1024` (因果言語モデリング、英語-フランス語) -- `xlm-clm-ende-1024` (因果言語モデリング、英語-ドイツ語) +- `FacebookAI/xlm-mlm-ende-1024` (マスク化された言語モデリング、英語-ドイツ語) +- `FacebookAI/xlm-mlm-enfr-1024` (マスク化された言語モデリング、英語-フランス語) +- `FacebookAI/xlm-mlm-enro-1024` (マスク化された言語モデリング、英語-ルーマニア語) +- `FacebookAI/xlm-mlm-xnli15-1024` (マスク化された言語モデリング、XNLI 言語) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (マスク化された言語モデリング + 翻訳 + XNLI 言語) +- `FacebookAI/xlm-clm-enfr-1024` (因果言語モデリング、英語-フランス語) +- `FacebookAI/xlm-clm-ende-1024` (因果言語モデリング、英語-ドイツ語) 言語の埋め込みは、モデルに渡される `input_ids` と同じ形状のテンソルとして表されます。 これらのテンソルの値は、使用される言語に依存し、トークナイザーの `lang2id` および `id2lang` 属性によって識別されます。 -この例では、`xlm-clm-enfr-1024` チェックポイントをロードします (因果言語モデリング、英語-フランス語)。 +この例では、`FacebookAI/xlm-clm-enfr-1024` チェックポイントをロードします (因果言語モデリング、英語-フランス語)。 ```py >>> import torch >>> from transformers import XLMTokenizer, XLMWithLMHeadModel ->>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") ->>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") ``` トークナイザーの `lang2id` 属性は、このモデルの言語とその ID を表示します。 @@ -79,8 +83,8 @@ XLM には10の異なるチェックポイントがあり、そのうちの1つ 次の XLM モデルは、推論中に言語の埋め込みを必要としません。 -- `xlm-mlm-17-1280` (マスク化された言語モデリング、17の言語) -- `xlm-mlm-100-1280` (マスク化された言語モデリング、100の言語) +- `FacebookAI/xlm-mlm-17-1280` (マスク化された言語モデリング、17の言語) +- `FacebookAI/xlm-mlm-100-1280` (マスク化された言語モデリング、100の言語) これらのモデルは、以前の XLM チェックポイントとは異なり、一般的な文の表現に使用されます。 @@ -88,8 +92,8 @@ XLM には10の異なるチェックポイントがあり、そのうちの1つ 以下の BERT モデルは、多言語タスクに使用できます。 -- `bert-base-multilingual-uncased` (マスク化された言語モデリング + 次の文の予測、102の言語) -- `bert-base-multilingual-cased` (マスク化された言語モデリング + 次の文の予測、104の言語) +- `google-bert/bert-base-multilingual-uncased` (マスク化された言語モデリング + 次の文の予測、102の言語) +- `google-bert/bert-base-multilingual-cased` (マスク化された言語モデリング + 次の文の予測、104の言語) これらのモデルは、推論中に言語の埋め込みを必要としません。 文脈から言語を識別し、それに応じて推測する必要があります。 @@ -97,8 +101,8 @@ XLM には10の異なるチェックポイントがあり、そのうちの1つ 次の XLM-RoBERTa モデルは、多言語タスクに使用できます。 -- `xlm-roberta-base` (マスク化された言語モデリング、100の言語) -- `xlm-roberta-large` (マスク化された言語モデリング、100の言語) +- `FacebookAI/xlm-roberta-base` (マスク化された言語モデリング、100の言語) +- `FacebookAI/xlm-roberta-large` (マスク化された言語モデリング、100の言語) XLM-RoBERTa は、100の言語で新しく作成およびクリーニングされた2.5 TB の CommonCrawl データでトレーニングされました。 これは、分類、シーケンスのラベル付け、質問応答などのダウンストリームタスクで、mBERT や XLM などの以前にリリースされた多言語モデルを大幅に改善します。 diff --git a/docs/source/ja/pad_truncation.md b/docs/source/ja/pad_truncation.md new file mode 100644 index 00000000000000..4da23fd82e0f49 --- /dev/null +++ b/docs/source/ja/pad_truncation.md @@ -0,0 +1,63 @@ + + +# Padding and truncation + +バッチ入力はしばしば異なる長さであり、固定サイズのテンソルに変換できないため、変動する長さのバッチから長方形のテンソルを作成するための戦略として、パディングと切り詰めがあります。パディングは、短いシーケンスがバッチ内の最長シーケンスまたはモデルが受け入れる最大長と同じ長さになるように、特別な**パディングトークン**を追加します。切り詰めは、長いシーケンスを切り詰めることで逆方向に機能します。 + +ほとんどの場合、バッチを最長シーケンスの長さにパディングし、モデルが受け入れる最大長に切り詰めることで、うまく動作します。ただし、APIはそれ以上の戦略もサポートしています。必要な3つの引数は次のとおりです:`padding`、`truncation`、および `max_length`。 + +`padding`引数はパディングを制御します。ブール値または文字列であることができます: + + - `True`または`'longest'`:バッチ内の最長シーケンスにパディングを追加します(シーケンスが1つしか提供されない場合、パディングは適用されません)。 + - `max_length'`:`max_length`引数で指定された長さまでパディングを追加します。または`max_length`が提供されていない場合はモデルが受け入れる最大長(`max_length=None`)。シーケンスが1つしか提供されている場合でも、パディングは適用されます。 + - `False`または`'do_not_pad'`:パディングは適用されません。これがデフォルトの動作です。 + +`truncation`引数は切り詰めを制御します。ブール値または文字列であることができます: + + - `True`または`'longest_first'`:最大長を`max_length`引数で指定するか、モデルが受け入れる最大長(`max_length=None`)まで切り詰めます。これはトークンごとに切り詰め、適切な長さに達するまでペア内の最長シーケンスからトークンを削除します。 + - `'only_second'`:最大長を`max_length`引数で指定するか、モデルが受け入れる最大長(`max_length=None`)まで切り詰めます。これはペアの2番目の文だけを切り詰めます(シーケンスのペアまたはシーケンスのバッチのペアが提供された場合)。 + - `'only_first'`:最大長を`max_length`引数で指定するか、モデルが受け入れる最大長(`max_length=None`)まで切り詰めます。これはペアの最初の文だけを切り詰めます(シーケンスのペアまたはシーケンスのバッチのペアが提供された場合)。 + - `False`または`'do_not_truncate'`:切り詰めは適用されません。これがデフォルトの動作です。 + +`max_length`引数はパディングと切り詰めの長さを制御します。整数または`None`であり、この場合、モデルが受け入れる最大入力長にデフォルトで設定されます。モデルに特定の最大入力長がない場合、`max_length`への切り詰めまたはパディングは無効になります。 + +以下の表は、パディングと切り詰めを設定する推奨方法を要約しています。以下の例のいずれかで入力シーケンスのペアを使用する場合、`truncation=True`を`['only_first', 'only_second', 'longest_first']`で選択した`STRATEGY`に置き換えることができます。つまり、`truncation='only_second'`または`truncation='longest_first'`を使用して、ペア内の両方のシーケンスを前述のように切り詰める方法を制御できます。 + + + +| Truncation | Padding | Instruction | +|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------| +| no truncation | no padding | `tokenizer(batch_sentences)` | +| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True)` or | +| | | `tokenizer(batch_sentences, padding='longest')` | +| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length')` | +| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | +| | padding to a multiple of a value | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)` | +| truncation to max model input length | no padding | `tokenizer(batch_sentences, truncation=True)` or | +| | | `tokenizer(batch_sentences, truncation=STRATEGY)` | +| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True)` or | +| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` | +| | padding to max model input length | `tokenizer(batch_sentences, padding='max_length', truncation=True)` or | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` | +| | padding to specific length | Not possible | +| truncation to specific length | no padding | `tokenizer(batch_sentences, truncation=True, max_length=42)` or | +| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` | +| | padding to max sequence in batch | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` or | +| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` | +| | padding to max model input length | Not possible | +| | padding to specific length | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` or | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` | diff --git a/docs/source/ja/peft.md b/docs/source/ja/peft.md new file mode 100644 index 00000000000000..5cc687f70bf835 --- /dev/null +++ b/docs/source/ja/peft.md @@ -0,0 +1,214 @@ + + + +# Load adapters with 🤗 PEFT + +[[open-in-colab]] + +[Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) メソッドは、事前学習済みモデルのパラメータをファインチューニング中に凍結し、その上にわずかな訓練可能なパラメータ(アダプター)を追加するアプローチです。アダプターは、タスク固有の情報を学習するために訓練されます。このアプローチは、メモリ使用量が少なく、完全にファインチューニングされたモデルと比較して計算リソースを低く抑えつつ、同等の結果を生成することが示されています。 + +PEFTで訓練されたアダプターは通常、完全なモデルのサイズよりも1桁小さく、共有、保存、読み込むのが便利です。 + +
+ +
Hubに格納されているOPTForCausalLMモデルのアダプター重みは、モデルの全体サイズの約6MBで、モデル重みの全サイズは約700MBです。
+
+ +🤗 PEFTライブラリについて詳しく知りたい場合は、[ドキュメンテーション](https://huggingface.co/docs/peft/index)をご覧ください。 + +## Setup + +🤗 PEFTをインストールして始めましょう: + + +```bash +pip install peft +``` + +新機能を試してみたい場合、ソースからライブラリをインストールすることに興味があるかもしれません: + +```bash +pip install git+https://github.com/huggingface/peft.git +``` + +## Supported PEFT models + +🤗 Transformersは、いくつかのPEFT(Parameter Efficient Fine-Tuning)メソッドをネイティブにサポートしており、ローカルまたはHubに格納されたアダプターウェイトを簡単に読み込んで実行またはトレーニングできます。以下のメソッドがサポートされています: + +- [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) +- [AdaLoRA](https://arxiv.org/abs/2303.10512) + +他のPEFTメソッドを使用したい場合、プロンプト学習やプロンプト調整などについて詳しく知りたい場合、または🤗 PEFTライブラリ全般については、[ドキュメンテーション](https://huggingface.co/docs/peft/index)を参照してください。 + + +## Load a PEFT adapter + +🤗 TransformersからPEFTアダプターモデルを読み込んで使用するには、Hubリポジトリまたはローカルディレクトリに `adapter_config.json` ファイルとアダプターウェイトが含まれていることを確認してください。次に、`AutoModelFor` クラスを使用してPEFTアダプターモデルを読み込むことができます。たとえば、因果言語モデリング用のPEFTアダプターモデルを読み込むには: + +1. PEFTモデルのIDを指定します。 +2. それを[`AutoModelForCausalLM`] クラスに渡します。 + + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id) +``` + + + +PEFTアダプターを`AutoModelFor`クラスまたは基本モデルクラス(`OPTForCausalLM`または`LlamaForCausalLM`など)で読み込むことができます。 + + + +また、`load_adapter`メソッドを呼び出すことで、PEFTアダプターを読み込むこともできます: + + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "facebook/opt-350m" +peft_model_id = "ybelkada/opt-350m-lora" + +model = AutoModelForCausalLM.from_pretrained(model_id) +model.load_adapter(peft_model_id) +``` + +## Load in 8bit or 4bit + + +`bitsandbytes` 統合は、8ビットおよび4ビットの精度データ型をサポートしており、大規模なモデルを読み込む際にメモリを節約するのに役立ちます(詳細については `bitsandbytes` 統合の[ガイド](./quantization#bitsandbytes-integration)を参照してください)。[`~PreTrainedModel.from_pretrained`] に `load_in_8bit` または `load_in_4bit` パラメータを追加し、`device_map="auto"` を設定してモデルを効果的にハードウェアに分散配置できます: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True) +``` + +## Add a new adapter + +既存のアダプターを持つモデルに新しいアダプターを追加するために [`~peft.PeftModel.add_adapter`] を使用できます。ただし、新しいアダプターは現在のアダプターと同じタイプである限り、これを行うことができます。たとえば、モデルに既存の LoRA アダプターがアタッチされている場合: + + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + init_lora_weights=False +) + +model.add_adapter(lora_config, adapter_name="adapter_1") +``` + +新しいアダプタを追加するには: + + +```py +# attach new adapter with same config +model.add_adapter(lora_config, adapter_name="adapter_2") +``` + +[`~peft.PeftModel.set_adapter`] を使用して、どのアダプターを使用するかを設定できます: + + +```py +# use adapter_1 +model.set_adapter("adapter_1") +output = model.generate(**inputs) +print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) + +# use adapter_2 +model.set_adapter("adapter_2") +output_enabled = model.generate(**inputs) +print(tokenizer.decode(output_enabled[0], skip_special_tokens=True)) +``` + +## Enable and disable adapters + +モデルにアダプターを追加したら、アダプターモジュールを有効または無効にすることができます。アダプターモジュールを有効にするには、次の手順を実行します: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +adapter_model_id = "ybelkada/opt-350m-lora" +tokenizer = AutoTokenizer.from_pretrained(model_id) +text = "Hello" +inputs = tokenizer(text, return_tensors="pt") + +model = AutoModelForCausalLM.from_pretrained(model_id) +peft_config = PeftConfig.from_pretrained(adapter_model_id) + +# to initiate with random weights +peft_config.init_lora_weights = False + +model.add_adapter(peft_config) +model.enable_adapters() +output = model.generate(**inputs) +``` + +アダプターモジュールを無効にするには: + +```py +model.disable_adapters() +output = model.generate(**inputs) +``` + +## Train a PEFT adapter + +PEFTアダプターは[`Trainer`]クラスでサポートされており、特定のユースケースに対してアダプターをトレーニングすることができます。数行のコードを追加するだけで済みます。たとえば、LoRAアダプターをトレーニングする場合: + + + +[`Trainer`]を使用したモデルの微調整に慣れていない場合は、[事前トレーニング済みモデルの微調整](training)チュートリアルをご覧ください。 + + + +1. タスクタイプとハイパーパラメータに対するアダプターの構成を定義します(ハイパーパラメータの詳細については[`~peft.LoraConfig`]を参照してください)。 + + +```py +from peft import LoraConfig + +peft_config = LoraConfig( + lora_alpha=16, + lora_dropout=0.1, + r=64, + bias="none", + task_type="CAUSAL_LM", +) +``` + +2. モデルにアダプターを追加する。 + + +```py +model.add_adapter(peft_config) +``` + +3. これで、モデルを [`Trainer`] に渡すことができます! + +```py +trainer = Trainer(model=model, ...) +trainer.train() +``` + +保存するトレーニング済みアダプタとそれを読み込むための手順: diff --git a/docs/source/ja/perf_hardware.md b/docs/source/ja/perf_hardware.md new file mode 100644 index 00000000000000..0d104ed3ddb0b3 --- /dev/null +++ b/docs/source/ja/perf_hardware.md @@ -0,0 +1,160 @@ + + +# Custom hardware for training + +モデルのトレーニングおよび推論に使用するハードウェアは、パフォーマンスに大きな影響を与えることがあります。GPUについて詳しく知りたい場合は、Tim Dettmerの優れた[ブログ記事](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/)をチェックしてみてください。 + +GPUセットアップの実用的なアドバイスをいくつか見てみましょう。 + +## GPU +より大きなモデルをトレーニングする場合、基本的には以下の3つのオプションがあります: + +- より大きなGPU +- より多くのGPU +- より多くのCPUおよびNVMe([DeepSpeed-Infinity](main_classes/deepspeed#nvme-support)によるオフロード) + +まず、単一のGPUを使用する場合から始めましょう。 + +### Power and Cooling + +高価なハイエンドGPUを購入した場合、正しい電力供給と十分な冷却を提供することが重要です。 + +**電力**: + +一部の高級コンシューマGPUカードには、2つまたは3つのPCI-E 8ピン電源ソケットがあります。カードにあるソケットの数だけ、独立した12V PCI-E 8ピンケーブルが接続されていることを確認してください。同じケーブルの一端にある2つの分岐(またはピッグテールケーブルとしても知られています)を使用しないでください。つまり、GPUに2つのソケットがある場合、PSUからカードに向けて2つのPCI-E 8ピンケーブルを使用し、1つのケーブルの端に2つのPCI-E 8ピンコネクタがあるものは使用しないでください!そうしないと、カードからのパフォーマンスを十分に引き出すことができません。 + +各PCI-E 8ピン電源ケーブルは、PSU側の12Vレールに接続する必要があり、最大で150Wの電力を供給できます。 + +一部のカードはPCI-E 12ピンコネクタを使用することがあり、これらは最大で500-600Wの電力を供給できます。 + +低価格帯のカードは6ピンコネクタを使用することがあり、最大で75Wの電力を供給します。 + +さらに、カードが必要とする安定した電圧を提供する高品質な電源ユニット(PSU)を使用する必要があります。 + +もちろん、PSUにはカードを駆動するために十分な未使用の電力が必要です。 + +**冷却**: + +GPUが過熱すると、スロットリングが開始され、フルパフォーマンスを提供しなくなり、過熱しすぎるとシャットダウンすることさえあります。 + +GPUが重要な負荷の下でどのような温度を目指すべきかを正確に示すことは難しいですが、おそらく+80℃未満であれば良いでしょうが、それより低い方が良いです - おそらく70-75℃が優れた範囲でしょう。スロットリングの開始温度はおそらく84-90℃のあたりからでしょう。スロットリングによるパフォーマンスの低下以外にも、長時間にわたる非常に高い温度はGPUの寿命を短縮する可能性があります。 + +次に、複数のGPUを持つ際に最も重要な側面の一つである接続について詳しく見てみましょう。 + +### Multi-GPU Connectivity + +複数のGPUを使用する場合、カードの相互接続方法はトータルのトレーニング時間に大きな影響を与える可能性があります。GPUが同じ物理ノードにある場合、次のように実行できます: + + +```bash +nvidia-smi topo -m +``` + +もちろん、GPUがどのように相互接続されているかについて説明します。デュアルGPUを搭載し、NVLinkで接続されているマシンでは、おそらく以下のような情報が表示されるでしょう: + +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X NV2 0-23 N/A +GPU1 NV2 X 0-23 N/A +``` + +別のNVLinkなしのマシンでは、以下のような状況が発生するかもしれません: + +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X PHB 0-11 N/A +GPU1 PHB X 0-11 N/A +``` + +こちらが伝説です: + +``` + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks +``` + +最初のレポートである `NV2` では、GPUは2つのNVLinkで接続されており、2番目のレポートである `PHB` では、典型的な消費者向けのPCIe+Bridgeセットアップが行われています。 + +あなたのセットアップでどの種類の接続性があるかを確認してください。これらの接続方法のいくつかはカード間の通信を速くすることができます(例:NVLink)、他のものは遅くすることができます(例:PHB)。 + +使用されるスケーラビリティソリューションの種類に応じて、接続速度は大きな影響を与えることも、小さな影響を与えることもあります。GPUがあまり頻繁に同期する必要がない場合、DDPのように、遅い接続の影響はそれほど重要ではありません。しかし、GPUが頻繁にメッセージを送信する必要がある場合、ZeRO-DPのように、高速の接続がより高速なトレーニングを実現するために非常に重要になります。 + +#### NVlink + +[NVLink](https://en.wikipedia.org/wiki/NVLink) は、Nvidiaによって開発された有線のシリアルマルチレーンの近距離通信リンクです。 + +各新世代では、より高速な帯域幅が提供されます。たとえば、[Nvidia Ampere GA102 GPU Architecture](https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf) からの引用です。 + +> Third-Generation NVLink® +> GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, +> with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four +> links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth +> between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. +> (Note that 3-Way and 4-Way SLI configurations are not supported.) + +したがって、`nvidia-smi topo -m` の出力の `NVX` レポートで取得する `X` が高いほど良いです。世代はあなたのGPUアーキテクチャに依存します。 + +小さなサンプルのwikitextを使用したgpt2言語モデルのトレーニングの実行を比較しましょう。 + +結果は次のとおりです: + +(ここに結果を挿入) + +上記のテキストの日本語訳を提供しました。Markdownコードとしてフォーマットしました。どんな他の質問があれば、お気軽にお知らせください! + +| NVlink | Time | +| ----- | ---: | +| Y | 101s | +| N | 131s | + + +NVLinkを使用すると、トレーニングが約23%速く完了することがわかります。2番目のベンチマークでは、`NCCL_P2P_DISABLE=1`を使用して、GPUがNVLinkを使用しないように指示しています。 + +以下は、完全なベンチマークコードと出力です: + + +```bash +# DDP w/ NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +# DDP w/o NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`) +Software: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` diff --git a/docs/source/ja/perf_infer_cpu.md b/docs/source/ja/perf_infer_cpu.md new file mode 100644 index 00000000000000..d23ae65f309f41 --- /dev/null +++ b/docs/source/ja/perf_infer_cpu.md @@ -0,0 +1,74 @@ + + + +# Efficient Inference on CPU + +このガイドは、CPU上で大規模なモデルの効率的な推論に焦点を当てています。 + +## `BetterTransformer` for faster inference + +最近、テキスト、画像、および音声モデルのCPU上での高速な推論のために`BetterTransformer`を統合しました。詳細については、この統合に関するドキュメンテーションを[こちら](https://huggingface.co/docs/optimum/bettertransformer/overview)で確認してください。 + +## PyTorch JITモード(TorchScript) +TorchScriptは、PyTorchコードからシリアライズ可能で最適化可能なモデルを作成する方法です。任意のTorchScriptプログラムは、Python依存性のないプロセスで保存およびロードできます。 +デフォルトのイーガーモードと比較して、PyTorchのjitモードは通常、オペレーターフュージョンなどの最適化手法によりモデル推論のパフォーマンスが向上します。 + +TorchScriptの簡単な紹介については、[PyTorch TorchScriptチュートリアル](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules)を参照してください。 + +### JITモードでのIPEXグラフ最適化 +Intel® Extension for PyTorchは、Transformersシリーズモデルのjitモードにさらなる最適化を提供します。Intel® Extension for PyTorchをjitモードで使用することを強くお勧めします。Transformersモデルからよく使用されるオペレーターパターンのいくつかは、既にIntel® Extension for PyTorchでjitモードのフュージョンに対応しています。これらのフュージョンパターン(Multi-head-attentionフュージョン、Concat Linear、Linear+Add、Linear+Gelu、Add+LayerNormフュージョンなど)は有効でパフォーマンスが良いです。フュージョンの利点は、ユーザーに透過的に提供されます。分析によれば、最も人気のある質問応答、テキスト分類、トークン分類のNLPタスクの約70%が、これらのフュージョンパターンを使用してFloat32精度とBFloat16混合精度の両方でパフォーマンスの利点を得ることができます。 + +[IPEXグラフ最適化の詳細情報](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/graph_optimization.html)を確認してください。 + +#### IPEX installation: + +IPEXのリリースはPyTorchに従っています。[IPEXのインストール方法](https://intel.github.io/intel-extension-for-pytorch/)を確認してください。 + +### Usage of JIT-mode +Trainerで評価または予測のためにJITモードを有効にするには、ユーザーはTrainerコマンド引数に`jit_mode_eval`を追加する必要があります。 + + + +PyTorch >= 1.14.0の場合、jitモードはjit.traceでdict入力がサポートされているため、予測と評価に任意のモデルに利益をもたらす可能性があります。 + +PyTorch < 1.14.0の場合、jitモードはforwardパラメーターの順序がjit.traceのタプル入力の順序と一致するモデルに利益をもたらす可能性があります(質問応答モデルなど)。jit.traceがタプル入力の順序と一致しない場合、テキスト分類モデルなど、jit.traceは失敗し、これをフォールバックさせるために例外でキャッチしています。ログはユーザーに通知するために使用されます。 + + + +[Transformers質問応答の使用例](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)を参考にしてください。 + +- Inference using jit mode on CPU: +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--jit_mode_eval 
+ +- Inference with IPEX using jit mode on CPU: +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--use_ipex \
+--jit_mode_eval
diff --git a/docs/source/ja/perf_infer_gpu_many.md b/docs/source/ja/perf_infer_gpu_many.md new file mode 100644 index 00000000000000..378bb2a248fe11 --- /dev/null +++ b/docs/source/ja/perf_infer_gpu_many.md @@ -0,0 +1,125 @@ + + +# Efficient Inference on a Multiple GPUs + +この文書には、複数のGPUで効率的に推論を行う方法に関する情報が含まれています。 + + +注意: 複数のGPUセットアップは、[単一のGPUセクション](./perf_infer_gpu_one)で説明されているほとんどの戦略を使用できます。ただし、より良い使用法のために使用できる簡単なテクニックについても認識しておく必要があります。 + + + +## Flash Attention 2 + +Flash Attention 2の統合は、複数のGPUセットアップでも機能します。詳細については、[単一のGPUセクション](./perf_infer_gpu_one#Flash-Attention-2)の適切なセクションをご覧ください。 + +## BetterTransformer + +[BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview)は、🤗 TransformersモデルをPyTorchネイティブの高速実行パスを使用するように変換し、その下でFlash Attentionなどの最適化されたカーネルを呼び出します。 + +BetterTransformerは、テキスト、画像、音声モデルの単一GPUおよび複数GPUでの高速推論もサポートしています。 + + +Flash Attentionは、fp16またはbf16 dtypeを使用しているモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。 + + + +### Decoder models + +テキストモデル、特にデコーダーベースのモデル(GPT、T5、Llamaなど)の場合、BetterTransformer APIはすべての注意操作を[`torch.nn.functional.scaled_dot_product_attention`オペレーター](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention)(SDPA)を使用するように変換します。これはPyTorch 2.0以降でのみ使用可能です。 + +モデルをBetterTransformerに変換するには: + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m") +# convert the model to BetterTransformer +model.to_bettertransformer() + +# Use it for training or inference +``` + +SDPAは、ハードウェアや問題のサイズなどの特定の設定で[Flash Attention](https://arxiv.org/abs/2205.14135)カーネルを呼び出すこともできます。Flash Attentionを有効にするか、特定の設定(ハードウェア、問題のサイズ)で利用可能かを確認するには、[`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel)をコンテキストマネージャとして使用します。 + + +```diff +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m").to("cuda") +# convert the model to BetterTransformer +model.to_bettertransformer() + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + ++ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +もしトレースバックで次のようなエラーメッセージが表示された場合: + + +```bash +RuntimeError: No available kernel. Aborting execution. +``` + +当日、Flash Attentionのカバレッジが広範囲である可能性があるPyTorch Nightlyバージョンを試すようにお勧めします。 + +```bash +pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 +``` + +[このブログ投稿](https://pytorch.org/blog/out-of-the-box-acceleration/)をチェックして、BetterTransformer + SDPA APIで可能なことについて詳しく学びましょう。 + +### Encoder Models + +推論中のエンコーダーモデルでは、BetterTransformerはエンコーダーレイヤーのforward呼び出しを、エンコーダーレイヤーの[`torch.nn.TransformerEncoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html)の相当するものにディスパッチします。これにより、エンコーダーレイヤーの高速実装が実行されます。 + +`torch.nn.TransformerEncoderLayer`の高速実装はトレーニングをサポートしていないため、代わりに`torch.nn.functional.scaled_dot_product_attention`にディスパッチされます。これにより、ネストされたテンソルを活用しないFlash AttentionまたはMemory-Efficient Attentionの融合カーネルを使用できます。 + +BetterTransformerのパフォーマンスの詳細については、この[ブログ投稿](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)をご覧いただけます。また、エンコーダーモデル用のBetterTransformerについては、この[ブログ](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/)で詳しく学ぶことができます。 + + +## Advanced usage: mixing FP4 (or Int8) and BetterTransformer + +モデルの最良のパフォーマンスを得るために、上記で説明した異なる方法を組み合わせることができます。例えば、FP4ミックスプレシジョン推論+Flash Attentionを使用したBetterTransformerを組み合わせることができます。 + + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.float16 +) + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config) + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` \ No newline at end of file diff --git a/docs/source/ja/perf_infer_gpu_one.md b/docs/source/ja/perf_infer_gpu_one.md new file mode 100644 index 00000000000000..6d7466e022220a --- /dev/null +++ b/docs/source/ja/perf_infer_gpu_one.md @@ -0,0 +1,441 @@ + + +# Efficient Inference on a Single GPU + +このガイドに加えて、[1つのGPUでのトレーニングガイド](perf_train_gpu_one)と[CPUでの推論ガイド](perf_infer_cpu)に関連する情報があります。 + +## Flash Attention 2 + + + +この機能は実験的であり、将来のバージョンで大幅に変更される可能性があります。たとえば、Flash Attention 2 APIは近い将来`BetterTransformer` APIに移行するかもしれません。 + + + +Flash Attention 2は、トランスフォーマーベースのモデルのトレーニングと推論速度を大幅に高速化できます。Flash Attention 2は、Tri Dao氏によって[公式のFlash Attentionリポジトリ](https://github.com/Dao-AILab/flash-attention)で導入されました。Flash Attentionに関する科学論文は[こちら](https://arxiv.org/abs/2205.14135)で見ることができます。 + +Flash Attention 2を正しくインストールするには、上記のリポジトリに記載されているインストールガイドに従ってください。 + +以下のモデルに対してFlash Attention 2をネイティブサポートしています: + +- Llama +- Falcon + +さらに多くのモデルにFlash Attention 2のサポートを追加することをGitHubで提案することもでき、変更を統合するためにプルリクエストを開くこともできます。サポートされているモデルは、パディングトークンを使用してトレーニングを含む、推論とトレーニングに使用できます(現在の`BetterTransformer` APIではサポートされていない)。 + + + +Flash Attention 2は、モデルのdtypeが`fp16`または`bf16`の場合にのみ使用でき、NVIDIA-GPUデバイスでのみ実行されます。この機能を使用する前に、モデルを適切なdtypeにキャストし、サポートされているデバイスにロードしてください。 + + + +### Quick usage + +モデルでFlash Attention 2を有効にするには、`from_pretrained`の引数に`attn_implementation="flash_attention_2"`を追加します。 + + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2", +) +``` + +こちらは、生成または微調整のために使用するテキストです。 + +### Expected speedups + +特に長いシーケンスに対して、微調整と推論の際には、かなりの高速化が期待できます。ただし、Flash Attentionはパディングトークンを使用してアテンションスコアを計算しないため、シーケンスにパディングトークンが含まれる場合、バッチ推論においてアテンションスコアを手動でパッド/アンパッドする必要があり、パディングトークンを含むバッチ生成の大幅な遅延が発生します。 + +これを克服するために、トレーニング中にシーケンスにパディングトークンを使用せずにFlash Attentionを使用する必要があります(たとえば、データセットをパックすることにより、シーケンスを最大シーケンス長に達するまで連結することなど)。ここに[例](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py#L516)が提供されています。 + +以下は、パディングトークンのない場合に、シーケンス長が4096の[tiiuae/falcon-7b](https://hf.co/tiiuae/falcon-7b)に対する単純なフォワードパスの予想される高速化です。さまざまなバッチサイズが示されています: + +
+ +
+ +以下は、パディングトークンのない場合に、シーケンス長が4096の[`meta-llama/Llama-7b-hf`](https://hf.co/meta-llama/Llama-7b-hf)に対する単純なフォワードパスの予想される高速化です。さまざまなバッチサイズが示されています: + +
+ +
+ +パディングトークンを含むシーケンス(パディングトークンを使用してトレーニングまたは生成する)の場合、アテンションスコアを正しく計算するために入力シーケンスをアンパッド/パッドする必要があります。比較的小さいシーケンス長の場合、純粋なフォワードパスではパディングトークンが30%未満しか埋められていないため、これはわずかな高速化をもたらします。 + +
+ +
+ +しかし、大きなシーケンス長の場合、純粋な推論(トレーニングも含む)には興味深い高速化が得られます。 + +Flash Attentionは、アテンション計算をよりメモリ効率の良いものにし、大きなシーケンス長でのCUDA OOMの問題を回避できるようにします。大きなシーケンス長に対して最大20のメモリ削減をもたらすことがあります。詳細については、[公式のFlash Attentionリポジトリ](https://github.com/Dao-AILab/flash-attention)をご覧ください。 + +
+ +
+ + +### Advanced usage + +この機能をモデルの最適化に多くの既存の機能と組み合わせることができます。以下にいくつかの例を示します: + +### Combining Flash Attention 2 and 8-bit models + +この機能を8ビットの量子化と組み合わせることができます: + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_8bit=True, + attn_implementation="flash_attention_2", +) +``` + +### Combining Flash Attention 2 and 4-bit models + +この機能を 4 ビットの量子化と組み合わせることができます: + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_4bit=True, + attn_implementation="flash_attention_2", +) +``` + +### Combining Flash Attention 2 and PEFT + +この機能を使用して、Flash Attention 2をベースにアダプターをトレーニングする際にPEFTを組み合わせることができます。 + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM +from peft import LoraConfig + +model_id = "tiiuae/falcon-7b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + load_in_4bit=True, + attn_implementation="flash_attention_2", +) + +lora_config = LoraConfig( + r=8, + task_type="CAUSAL_LM" +) + +model.add_adapter(lora_config) + +... # train your model +``` + +## BetterTransformer + +[BetterTransformer](https://huggingface.co/docs/optimum/bettertransformer/overview)は、🤗 TransformersモデルをPyTorchネイティブの高速パス実行に変換します。これにより、Flash Attentionなどの最適化されたカーネルが内部で呼び出されます。 + +BetterTransformerは、テキスト、画像、およびオーディオモデルの単一およびマルチGPUでの高速な推論をサポートしています。 + + + +Flash Attentionは、fp16またはbf16のdtypeを使用するモデルにのみ使用できます。BetterTransformerを使用する前に、モデルを適切なdtypeにキャストしてください。 + + + +### Encoder models + +PyTorchネイティブの[`nn.MultiHeadAttention`](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/)アテンション高速パス、BetterTransformerと呼ばれるものは、[🤗 Optimumライブラリ](https://huggingface.co/docs/optimum/bettertransformer/overview)の統合を通じてTransformersと一緒に使用できます。 + +PyTorchのアテンション高速パスを使用すると、カーネルフュージョンと[ネストされたテンソル](https://pytorch.org/docs/stable/nested.html)の使用により、推論を高速化できます。詳細なベンチマーク情報は[このブログ記事](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)にあります。 + +[`optimum`](https://github.com/huggingface/optimum)パッケージをインストールした後、推論中にBetter Transformerを使用するには、関連する内部モジュールを呼び出すことで置き換える必要があります[`~PreTrainedModel.to_bettertransformer`]: + + +```python +model = model.to_bettertransformer() +``` + +メソッド [`~PreTrainedModel.reverse_bettertransformer`] は、モデルを保存する前に使用すべきで、標準のトランスフォーマーモデリングを使用するためのものです: + +```python +model = model.reverse_bettertransformer() +model.save_pretrained("saved_model") +``` + +BetterTransformer APIを使ったエンコーダーモデルの可能性について詳しく知るには、[このブログポスト](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)をご覧ください。 + +### Decoder models + +テキストモデル、特にデコーダーベースのモデル(GPT、T5、Llamaなど)にとって、BetterTransformer APIはすべての注意操作を[`torch.nn.functional.scaled_dot_product_attention`オペレーター](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention)(SDPA)を使用するように変換します。このオペレーターはPyTorch 2.0以降でのみ利用可能です。 + +モデルをBetterTransformerに変換するには、以下の手順を実行してください: + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m") +# convert the model to BetterTransformer +model.to_bettertransformer() + +# Use it for training or inference +``` + +SDPAは、ハードウェアや問題のサイズに応じて[Flash Attention](https://arxiv.org/abs/2205.14135)カーネルを使用することもできます。Flash Attentionを有効にするか、特定の設定(ハードウェア、問題サイズ)で使用可能かどうかを確認するには、[`torch.backends.cuda.sdp_kernel`](https://pytorch.org/docs/master/backends.html#torch.backends.cuda.sdp_kernel)をコンテキストマネージャとして使用します。 + + +```diff +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.float16).to("cuda") +# convert the model to BetterTransformer +model.to_bettertransformer() + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + ++ with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +もしトレースバックにバグが表示された場合 + +```bash +RuntimeError: No available kernel. Aborting execution. +``` + +Flash Attention の広範なカバレッジを持つかもしれない PyTorch のナイトリーバージョンを試してみることをお勧めします。 + +```bash +pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 +``` + +Or make sure your model is correctly casted in float16 or bfloat16 + +モデルが正しくfloat16またはbfloat16にキャストされていることを確認してください。 + +Have a look at [this detailed blogpost](https://pytorch.org/blog/out-of-the-box-acceleration/) to read more about what is possible to do with `BetterTransformer` + SDPA API. + +`BetterTransformer` + SDPA APIを使用して何が可能かについて詳しく読むには、[この詳細なブログポスト](https://pytorch.org/blog/out-of-the-box-acceleration/)をご覧ください。 + +## `bitsandbytes` integration for FP4 mixed-precision inference + +FP4混合精度推論のための`bitsandbytes`統合 + +You can install `bitsandbytes` and benefit from easy model compression on GPUs. Using FP4 quantization you can expect to reduce up to 8x the model size compared to its native full precision version. Check out below how to get started. + +`bitsandbytes`をインストールし、GPUで簡単なモデルの圧縮を利用できます。FP4量子化を使用すると、ネイティブのフルプレシジョンバージョンと比較してモデルサイズを最大8倍削減できることが期待できます。以下を確認して、どのように始めるかをご覧ください。 + + + +Note that this feature can also be used in a multi GPU setup. + +この機能は、マルチGPUセットアップでも使用できることに注意してください。 + + + +### Requirements [[requirements-for-fp4-mixedprecision-inference]] + +- Latest `bitsandbytes` library +`pip install bitsandbytes>=0.39.0` + +- Install latest `accelerate` from source +`pip install git+https://github.com/huggingface/accelerate.git` + +- Install latest `transformers` from source +`pip install git+https://github.com/huggingface/transformers.git` + + +### Running FP4 models - single GPU setup - Quickstart + +以下のコードを実行することで、簡単に単一のGPUでFP4モデルを実行できます: + + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) +``` + +注意: `device_map`はオプションですが、推論時に `device_map = 'auto'` を設定することが推奨されています。これにより、利用可能なリソースに効率的にモデルがディスパッチされます。 + +### Running FP4 models - multi GPU setup + +混合4ビットモデルを複数のGPUにロードする方法は、単一GPUセットアップと同じです(単一GPUセットアップと同じコマンドです): + +```py +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) +``` + +しかし、`accelerate`を使用して、各GPUに割り当てるGPU RAMを制御することができます。以下のように、`max_memory`引数を使用します: + + +```py +max_memory_mapping = {0: "600MB", 1: "1GB"} +model_name = "bigscience/bloom-3b" +model_4bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping +) +``` + +この例では、最初のGPUは600MBのメモリを使用し、2番目のGPUは1GBを使用します。 + +### Advanced usage + +このメソッドのさらなる高度な使用法については、[量子化](main_classes/quantization)のドキュメンテーションページをご覧ください。 + +## `bitsandbytes` integration for Int8 mixed-precision matrix decomposition + + + +この機能は、マルチGPU環境でも使用できます。 + + + +論文[`LLM.int8():スケーラブルなTransformer向けの8ビット行列乗算`](https://arxiv.org/abs/2208.07339)によれば、Hugging Face統合がHub内のすべてのモデルでわずか数行のコードでサポートされています。このメソッドは、半精度(`float16`および`bfloat16`)の重みの場合に`nn.Linear`サイズを2倍、単精度(`float32`)の重みの場合は4倍に縮小し、外れ値に対してほとんど影響を与えません。 + +![HFxbitsandbytes.png](https://cdn-uploads.huggingface.co/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) + +Int8混合精度行列分解は、行列乗算を2つのストリームに分割することによって動作します:(1) システマティックな特徴外れ値ストリームがfp16で行列乗算(0.01%)、(2) int8行列乗算の通常のストリーム(99.9%)。この方法を使用すると、非常に大きなモデルに対して予測の劣化なしにint8推論が可能です。 +このメソッドの詳細については、[論文](https://arxiv.org/abs/2208.07339)または[この統合に関するブログ記事](https://huggingface.co/blog/hf-bitsandbytes-integration)をご確認ください。 + +![MixedInt8.gif](https://cdn-uploads.huggingface.co/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif) + +なお、この機能を使用するにはGPUが必要であり、カーネルはGPU専用にコンパイルされている必要があります。この機能を使用する前に、モデルの1/4(またはハーフ精度の重みの場合は1/2)を保存するのに十分なGPUメモリがあることを確認してください。 +このモジュールを使用する際のヘルプに関する詳細は、以下のノートをご覧いただくか、[Google Colabのデモ](#colab-demos)をご覧ください。 + +### Requirements [[requirements-for-int8-mixedprecision-matrix-decomposition]] + +- `bitsandbytes<0.37.0`を使用する場合、NVIDIA GPUを使用していることを確認し、8ビットテンソルコアをサポートしていることを確認してください(Turing、Ampere、またはそれ以降のアーキテクチャー、例:T4、RTX20s RTX30s、A40-A100など)。`bitsandbytes>=0.37.0`の場合、すべてのGPUがサポートされるはずです。 +- 正しいバージョンの`bitsandbytes`をインストールするには、次のコマンドを実行してください: +`pip install bitsandbytes>=0.31.5` +- `accelerate`をインストールします: +`pip install accelerate>=0.12.0` + + +### Running mixed-Int8 models - single GPU setup + +必要なライブラリをインストールした後、ミックス 8 ビットモデルを読み込む方法は次の通りです: + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` + +以下はシンプルな例です: + +* `pipeline()` 関数の代わりに、モデルの `generate()` メソッドを使用することをお勧めします。`pipeline()` 関数を使用して推論することは可能ですが、混合8ビットモデルに最適化されておらず、`generate()` メソッドを使用するよりも遅くなります。また、一部のサンプリング戦略(例:ヌクレウスサンプリング)は、`pipeline()` 関数では混合8ビットモデルではサポートされていません。 +* すべての入力をモデルと同じデバイスに配置してください。 + + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "bigscience/bloom-2b5" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) + +prompt = "Hello, my llama is cute" +inputs = tokenizer(prompt, return_tensors="pt").to("cuda") +generated_ids = model.generate(**inputs) +outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +``` + +### Running mixed-int8 models - multi GPU setup + +複数のGPUに混合8ビットモデルをロードする方法は、次の通りです(シングルGPUセットアップと同じコマンドです): + +```py +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` + +`accelerate`を使用して各GPUに割り当てるGPU RAMを制御する際には、以下のように`max_memory`引数を使用します: + + +```py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) +``` + +In this example, the first GPU will use 1GB of memory and the second 2GB. + +### Colab demos + +この方法を使用すると、以前のGoogle Colabでは推論できなかったモデルに対して推論を行うことができます。以下は、Google Colabで8ビット量子化を使用してT5-11b(fp32で42GB)を実行するデモのリンクです: + +[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) + +また、BLOOM-3Bのデモもご覧いただけます: + +[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) + +## Advanced usage: mixing FP4 (or Int8) and BetterTransformer + +異なる方法を組み合わせて、モデルの最適なパフォーマンスを得ることができます。例えば、BetterTransformerを使用してFP4ミックスプレシジョン推論とフラッシュアテンションを組み合わせることができます。 + + +```py +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_compute_dtype=torch.float16 +) + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") +model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", quantization_config=quantization_config) + +input_text = "Hello my dog is cute and" +inputs = tokenizer(input_text, return_tensors="pt").to("cuda") + +with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False): + outputs = model.generate(**inputs) + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` \ No newline at end of file diff --git a/docs/source/ja/perf_infer_special.md b/docs/source/ja/perf_infer_special.md new file mode 100644 index 00000000000000..a5102590a0504c --- /dev/null +++ b/docs/source/ja/perf_infer_special.md @@ -0,0 +1,18 @@ + + +# Inference on Specialized Hardware + +こちらのドキュメントは、専用のハードウェアでの推論方法についての情報がまもなく提供されます。その間に、CPUでの推論に関するガイドをご覧いただけます。[the guide for inference on CPUs](perf_infer_cpu). \ No newline at end of file diff --git a/docs/source/ja/perf_torch_compile.md b/docs/source/ja/perf_torch_compile.md new file mode 100644 index 00000000000000..6eb69ec8eb9f68 --- /dev/null +++ b/docs/source/ja/perf_torch_compile.md @@ -0,0 +1,359 @@ + + +# Optimize inference using torch.compile() + +このガイドは、[`torch.compile()`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) を使用した推論速度の向上に関するベンチマークを提供することを目的としています。これは、[🤗 Transformers のコンピュータビジョンモデル](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers&sort=trending)向けのものです。 + +## Benefits of torch.compile + +`torch.compile()`の利点 +モデルとGPUによっては、torch.compile()は推論時に最大30%の高速化を実現します。 `torch.compile()`を使用するには、バージョン2.0以上のtorchをインストールするだけです。 + +モデルのコンパイルには時間がかかるため、毎回推論するのではなく、モデルを1度だけコンパイルする場合に役立ちます。 +任意のコンピュータビジョンモデルをコンパイルするには、以下のようにモデルに`torch.compile()`を呼び出します: + +```diff +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained(MODEL_ID).to("cuda") ++ model = torch.compile(model) +``` + +`compile()` は、コンパイルに関する異なるモードを備えており、基本的にはコンパイル時間と推論のオーバーヘッドが異なります。`max-autotune` は `reduce-overhead` よりも時間がかかりますが、推論速度が速くなります。デフォルトモードはコンパイルにおいては最速ですが、推論時間においては `reduce-overhead` に比べて効率が良くありません。このガイドでは、デフォルトモードを使用しました。詳細については、[こちら](https://pytorch.org/get-started/pytorch-2.0/#user-experience) を参照してください。 + +`torch` バージョン 2.0.1 で異なるコンピュータビジョンモデル、タスク、ハードウェアの種類、およびバッチサイズを使用して `torch.compile` をベンチマークしました。 + +## Benchmarking code + +以下に、各タスクのベンチマークコードを示します。推論前にGPUをウォームアップし、毎回同じ画像を使用して300回の推論の平均時間を取得します。 + +### Image Classification with ViT + +```python +from PIL import Image +import requests +import numpy as np +from transformers import AutoImageProcessor, AutoModelForImageClassification + +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +image = Image.open(requests.get(url, stream=True).raw) + +processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224").to("cuda") +model = torch.compile(model) + +processed_input = processor(image, return_tensors='pt').to(device="cuda") + +with torch.no_grad(): + _ = model(**processed_input) +``` + +#### Object Detection with DETR + +```python +from transformers import AutoImageProcessor, AutoModelForObjectDetection + +processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50") +model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50").to("cuda") +model = torch.compile(model) + +texts = ["a photo of a cat", "a photo of a dog"] +inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**inputs) +``` + +#### Image Segmentation with Segformer + +```python +from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation + +processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") +model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512").to("cuda") +model = torch.compile(model) +seg_inputs = processor(images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**seg_inputs) +``` + +以下は、私たちがベンチマークを行ったモデルのリストです。 + + +**Image Classification** +- [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) +- [microsoft/beit-base-patch16-224-pt22k-ft22k](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k) +- [facebook/convnext-large-224](https://huggingface.co/facebook/convnext-large-224) +- [microsoft/resnet-50](https://huggingface.co/) + +**Image Segmentation** +- [nvidia/segformer-b0-finetuned-ade-512-512](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [facebook/mask2former-swin-tiny-coco-panoptic](https://huggingface.co/facebook/mask2former-swin-tiny-coco-panoptic) +- [facebook/maskformer-swin-base-ade](https://huggingface.co/facebook/maskformer-swin-base-ade) +- [google/deeplabv3_mobilenet_v2_1.0_513](https://huggingface.co/google/deeplabv3_mobilenet_v2_1.0_513) + +**Object Detection** +- [google/owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32) +- [facebook/detr-resnet-101](https://huggingface.co/facebook/detr-resnet-101) +- [microsoft/conditional-detr-resnet-50](https://huggingface.co/microsoft/conditional-detr-resnet-50) + + +以下は、`torch.compile()`を使用した場合と使用しない場合の推論時間の可視化と、異なるハードウェアとバッチサイズの各モデルに対するパフォーマンス向上の割合です。 + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +![Duration Comparison on V100 with Batch Size of 1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/v100_1_duration.png) + +![Percentage Improvement on T4 with Batch Size of 4](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/T4_4_percentage.png) + +下記は、各モデルについて`compile()`を使用した場合と使用しなかった場合の推論時間(ミリ秒単位)です。なお、OwlViTは大きなバッチサイズでの使用時にメモリ不足(OOM)が発生することに注意してください。 + +### A100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 9.325 | 7.584 | +| Image Segmentation/Segformer | 11.759 | 10.500 | +| Object Detection/OwlViT | 24.978 | 18.420 | +| Image Classification/BeiT | 11.282 | 8.448 | +| Object Detection/DETR | 34.619 | 19.040 | +| Image Classification/ConvNeXT | 10.410 | 10.208 | +| Image Classification/ResNet | 6.531 | 4.124 | +| Image Segmentation/Mask2former | 60.188 | 49.117 | +| Image Segmentation/Maskformer | 75.764 | 59.487 | +| Image Segmentation/MobileNet | 8.583 | 3.974 | +| Object Detection/Resnet-101 | 36.276 | 18.197 | +| Object Detection/Conditional-DETR | 31.219 | 17.993 | + + +### A100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 14.832 | 14.499 | +| Image Segmentation/Segformer | 18.838 | 16.476 | +| Image Classification/BeiT | 13.205 | 13.048 | +| Object Detection/DETR | 48.657 | 32.418| +| Image Classification/ConvNeXT | 22.940 | 21.631 | +| Image Classification/ResNet | 6.657 | 4.268 | +| Image Segmentation/Mask2former | 74.277 | 61.781 | +| Image Segmentation/Maskformer | 180.700 | 159.116 | +| Image Segmentation/MobileNet | 14.174 | 8.515 | +| Object Detection/Resnet-101 | 68.101 | 44.998 | +| Object Detection/Conditional-DETR | 56.470 | 35.552 | + +### A100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 40.944 | 40.010 | +| Image Segmentation/Segformer | 37.005 | 31.144 | +| Image Classification/BeiT | 41.854 | 41.048 | +| Object Detection/DETR | 164.382 | 161.902 | +| Image Classification/ConvNeXT | 82.258 | 75.561 | +| Image Classification/ResNet | 7.018 | 5.024 | +| Image Segmentation/Mask2former | 178.945 | 154.814 | +| Image Segmentation/Maskformer | 638.570 | 579.826 | +| Image Segmentation/MobileNet | 51.693 | 30.310 | +| Object Detection/Resnet-101 | 232.887 | 155.021 | +| Object Detection/Conditional-DETR | 180.491 | 124.032 | + +### V100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 10.495 | 6.00 | +| Image Segmentation/Segformer | 13.321 | 5.862 | +| Object Detection/OwlViT | 25.769 | 22.395 | +| Image Classification/BeiT | 11.347 | 7.234 | +| Object Detection/DETR | 33.951 | 19.388 | +| Image Classification/ConvNeXT | 11.623 | 10.412 | +| Image Classification/ResNet | 6.484 | 3.820 | +| Image Segmentation/Mask2former | 64.640 | 49.873 | +| Image Segmentation/Maskformer | 95.532 | 72.207 | +| Image Segmentation/MobileNet | 9.217 | 4.753 | +| Object Detection/Resnet-101 | 52.818 | 28.367 | +| Object Detection/Conditional-DETR | 39.512 | 20.816 | + +### V100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 15.181 | 14.501 | +| Image Segmentation/Segformer | 16.787 | 16.188 | +| Image Classification/BeiT | 15.171 | 14.753 | +| Object Detection/DETR | 88.529 | 64.195 | +| Image Classification/ConvNeXT | 29.574 | 27.085 | +| Image Classification/ResNet | 6.109 | 4.731 | +| Image Segmentation/Mask2former | 90.402 | 76.926 | +| Image Segmentation/Maskformer | 234.261 | 205.456 | +| Image Segmentation/MobileNet | 24.623 | 14.816 | +| Object Detection/Resnet-101 | 134.672 | 101.304 | +| Object Detection/Conditional-DETR | 97.464 | 69.739 | + +### V100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 52.209 | 51.633 | +| Image Segmentation/Segformer | 61.013 | 55.499 | +| Image Classification/BeiT | 53.938 | 53.581 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 109.682 | 100.771 | +| Image Classification/ResNet | 14.857 | 12.089 | +| Image Segmentation/Mask2former | 249.605 | 222.801 | +| Image Segmentation/Maskformer | 831.142 | 743.645 | +| Image Segmentation/MobileNet | 93.129 | 55.365 | +| Object Detection/Resnet-101 | 482.425 | 361.843 | +| Object Detection/Conditional-DETR | 344.661 | 255.298 | + +### T4 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 16.520 | 15.786 | +| Image Segmentation/Segformer | 16.116 | 14.205 | +| Object Detection/OwlViT | 53.634 | 51.105 | +| Image Classification/BeiT | 16.464 | 15.710 | +| Object Detection/DETR | 73.100 | 53.99 | +| Image Classification/ConvNeXT | 32.932 | 30.845 | +| Image Classification/ResNet | 6.031 | 4.321 | +| Image Segmentation/Mask2former | 79.192 | 66.815 | +| Image Segmentation/Maskformer | 200.026 | 188.268 | +| Image Segmentation/MobileNet | 18.908 | 11.997 | +| Object Detection/Resnet-101 | 106.622 | 82.566 | +| Object Detection/Conditional-DETR | 77.594 | 56.984 | + +### T4 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 43.653 | 43.626 | +| Image Segmentation/Segformer | 45.327 | 42.445 | +| Image Classification/BeiT | 52.007 | 51.354 | +| Object Detection/DETR | 277.850 | 268.003 | +| Image Classification/ConvNeXT | 119.259 | 105.580 | +| Image Classification/ResNet | 13.039 | 11.388 | +| Image Segmentation/Mask2former | 201.540 | 184.670 | +| Image Segmentation/Maskformer | 764.052 | 711.280 | +| Image Segmentation/MobileNet | 74.289 | 48.677 | +| Object Detection/Resnet-101 | 421.859 | 357.614 | +| Object Detection/Conditional-DETR | 289.002 | 226.945 | + +### T4 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 163.914 | 160.907 | +| Image Segmentation/Segformer | 192.412 | 163.620 | +| Image Classification/BeiT | 188.978 | 187.976 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 422.886 | 388.078 | +| Image Classification/ResNet | 44.114 | 37.604 | +| Image Segmentation/Mask2former | 756.337 | 695.291 | +| Image Segmentation/Maskformer | 2842.940 | 2656.88 | +| Image Segmentation/MobileNet | 299.003 | 201.942 | +| Object Detection/Resnet-101 | 1619.505 | 1262.758 | +| Object Detection/Conditional-DETR | 1137.513 | 897.390| + +## PyTorch Nightly +また、PyTorchのナイトリーバージョン(2.1.0dev)でのベンチマークを行い、コンパイルされていないモデルとコンパイル済みモデルの両方でレイテンシーの向上を観察しました。ホイールは[こちら](https://download.pytorch.org/whl/nightly/cu118)から入手できます。 + + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 - no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 12.462 | 6.954 | +| Image Classification/BeiT | 4 | 14.109 | 12.851 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 30.484 | 15.221 | +| Object Detection/DETR | 4 | 46.816 | 30.942 | +| Object Detection/DETR | 16 | 163.749 | 163.706 | + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 14.408 | 14.052 | +| Image Classification/BeiT | 4 | 47.381 | 46.604 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 68.382 | 53.481 | +| Object Detection/DETR | 4 | 269.615 | 204.785 | +| Object Detection/DETR | 16 | OOM | OOM | + +### V100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 13.477 | 7.926 | +| Image Classification/BeiT | 4 | 15.103 | 14.378 | +| Image Classification/BeiT | 16 | 52.517 | 51.691 | +| Object Detection/DETR | Unbatched | 28.706 | 19.077 | +| Object Detection/DETR | 4 | 88.402 | 62.949| +| Object Detection/DETR | 16 | OOM | OOM | + + +## Reduce Overhead +NightlyビルドでA100およびT4向けの `reduce-overhead` コンパイルモードをベンチマークしました。 + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 | +| Image Classification/ConvNeXT | 4 | 23.171 | 21.490 | +| Image Classification/ResNet | Unbatched | 7.435 | 3.801 | +| Image Classification/ResNet | 4 | 7.261 | 2.187 | +| Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 | +| Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 | +| Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 | +| Image Segmentation/MobileNet | 4 | 14.385 | 7.946 | + + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 | +| Image Classification/ConvNeXT | 4 | 120.944 | 110.209 | +| Image Classification/ResNet | Unbatched | 9.761 | 7.698 | +| Image Classification/ResNet | 4 | 15.215 | 13.871 | +| Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 | +| Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 | +| Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 | +| Image Segmentation/MobileNet | 4 | 78.311 | 50.983 | diff --git a/docs/source/ja/perf_train_cpu.md b/docs/source/ja/perf_train_cpu.md new file mode 100644 index 00000000000000..bf623d131363b5 --- /dev/null +++ b/docs/source/ja/perf_train_cpu.md @@ -0,0 +1,67 @@ + + +# Efficient Training on CPU + +このガイドは、CPU上で大規模なモデルを効率的にトレーニングする方法に焦点を当てています。 + +## Mixed precision with IPEX + +IPEXはAVX-512以上のCPUに最適化されており、AVX2のみのCPUでも機能的に動作します。そのため、AVX-512以上のIntel CPU世代ではパフォーマンスの向上が期待されますが、AVX2のみのCPU(例:AMD CPUまたは古いIntel CPU)ではIPEXの下でより良いパフォーマンスが得られるかもしれませんが、保証されません。IPEXは、Float32とBFloat16の両方でCPUトレーニングのパフォーマンスを最適化します。以下のセクションでは、BFloat16の使用に重点を置いて説明します。 + +低精度データ型であるBFloat16は、AVX512命令セットを備えた第3世代Xeon® Scalable Processors(別名Cooper Lake)でネイティブサポートされており、さらに高性能なIntel® Advanced Matrix Extensions(Intel® AMX)命令セットを備えた次世代のIntel® Xeon® Scalable Processorsでもサポートされます。CPUバックエンド用の自動混合精度がPyTorch-1.10以降で有効になっています。同時に、Intel® Extension for PyTorchでのCPU用BFloat16の自動混合精度サポートと、オペレーターのBFloat16最適化のサポートが大幅に向上し、一部がPyTorchのメインブランチにアップストリームされています。ユーザーはIPEX Auto Mixed Precisionを使用することで、より優れたパフォーマンスとユーザーエクスペリエンスを得ることができます。 + +詳細な情報については、[Auto Mixed Precision](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html)を確認してください。 + +### IPEX installation: + +IPEXのリリースはPyTorchに従っており、pipを使用してインストールできます: + +| PyTorch Version | IPEX version | +| :---------------: | :----------: | +| 1.13 | 1.13.0+cpu | +| 1.12 | 1.12.300+cpu | +| 1.11 | 1.11.200+cpu | +| 1.10 | 1.10.100+cpu | + +```bash +pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu +``` + +[IPEXのインストール方法](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html)について、さらなるアプローチを確認してください。 + +### Trainerでの使用方法 +TrainerでIPEXの自動混合精度を有効にするには、ユーザーはトレーニングコマンド引数に `use_ipex`、`bf16`、および `no_cuda` を追加する必要があります。 + +[Transformersの質問応答](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)のユースケースを例に説明します。 + +- CPU上でBF16自動混合精度を使用してIPEXでトレーニングを行う場合: +
 python run_qa.py \
+--model_name_or_path google-bert/bert-base-uncased \
+--dataset_name squad \
+--do_train \
+--do_eval \
+--per_device_train_batch_size 12 \
+--learning_rate 3e-5 \
+--num_train_epochs 2 \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/debug_squad/ \
+--use_ipex \
+--bf16 --no_cuda
+ +### Practice example + +Blog: [Accelerating PyTorch Transformers with Intel Sapphire Rapids](https://huggingface.co/blog/intel-sapphire-rapids) diff --git a/docs/source/ja/perf_train_cpu_many.md b/docs/source/ja/perf_train_cpu_many.md new file mode 100644 index 00000000000000..26da32f577251f --- /dev/null +++ b/docs/source/ja/perf_train_cpu_many.md @@ -0,0 +1,151 @@ + + + +# Efficient Training on Multiple CPUs + +1つのCPUでのトレーニングが遅すぎる場合、複数のCPUを使用できます。このガイドは、PyTorchベースのDDPを使用した分散CPUトレーニングに焦点を当てています。 + +## Intel® oneCCL Bindings for PyTorch + +[Intel® oneCCL](https://github.com/oneapi-src/oneCCL)(集合通信ライブラリ)は、allreduce、allgather、alltoallなどの収集通信を実装した効率的な分散ディープラーニングトレーニング用のライブラリです。oneCCLの詳細については、[oneCCLドキュメント](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html)と[oneCCL仕様](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html)を参照してください。 + +モジュール`oneccl_bindings_for_pytorch`(バージョン1.12以前は`torch_ccl`)は、PyTorch C10D ProcessGroup APIを実装し、外部のProcessGroupとして動的にロードでき、現在はLinuxプラットフォームでのみ動作します。 + +[torch-ccl](https://github.com/intel/torch-ccl)の詳細情報を確認してください。 + +### Intel® oneCCL Bindings for PyTorch installation: + +Wheelファイルは、以下のPythonバージョン用に利用可能です: + +| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | +| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | +| 1.13.0 | | √ | √ | √ | √ | +| 1.12.100 | | √ | √ | √ | √ | +| 1.12.0 | | √ | √ | √ | √ | +| 1.11.0 | | √ | √ | √ | √ | +| 1.10.0 | √ | √ | √ | √ | | + +```bash +pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu +``` + +where `{pytorch_version}` should be your PyTorch version, for instance 1.13.0. +Check more approaches for [oneccl_bind_pt installation](https://github.com/intel/torch-ccl). +Versions of oneCCL and PyTorch must match. + + + +oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) +PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100 + + + +`{pytorch_version}` は、あなたのPyTorchのバージョン(例:1.13.0)に置き換える必要があります。重要なのは、oneCCLとPyTorchのバージョンが一致していることです。[oneccl_bind_ptのインストール](https://github.com/intel/torch-ccl)に関するさらなるアプローチを確認できます。 + + + +`oneccl_bindings_for_pytorch`の1.12.0プリビルトホイールはPyTorch 1.12.1と互換性がありません(これはPyTorch 1.12.0用です)。PyTorch 1.12.1を使用する場合は、`oneccl_bindings_for_pytorch`バージョン1.12.100を使用する必要があります。 + + + +## Intel® MPI library + + +この基準ベースのMPI実装を使用して、Intel®アーキテクチャ上で柔軟で効率的、スケーラブルなクラスタメッセージングを提供します。このコンポーネントは、Intel® oneAPI HPC Toolkitの一部です。 + +oneccl_bindings_for_pytorchはMPIツールセットと一緒にインストールされます。使用する前に環境をソース化する必要があります。 + + +for Intel® oneCCL >= 1.12.0 +```bash +oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") +source $oneccl_bindings_for_pytorch_path/env/setvars.sh +``` + +for Intel® oneCCL whose version < 1.12.0 +```bash +torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") +source $torch_ccl_path/env/setvars.sh +``` + +#### IPEX installation: + +IPEXは、Float32およびBFloat16の両方でCPUトレーニングのパフォーマンス最適化を提供します。詳細は[こちらのシングルCPUセクション](./perf_train_cpu)をご参照ください。 + +以下の「トレーナーでの使用」は、Intel® MPIライブラリでmpirunを使用する例を示しています。 + +## Usage in Trainer +トレーナーでのマルチCPU分散トレーニングを有効にするために、ユーザーはコマンド引数に **`--ddp_backend ccl`** を追加する必要があります。 + +例を見てみましょう。[質問応答の例](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) + +以下のコマンドは、1つのXeonノードで2つのプロセスを使用してトレーニングを有効にします。1つのプロセスが1つのソケットで実行されます。OMP_NUM_THREADS/CCL_WORKER_COUNT変数は、最適なパフォーマンスを調整するために調整できます。 + + +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=127.0.0.1 + mpirun -n 2 -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex +``` + +以下のコマンドは、2つのXeonプロセッサ(node0とnode1、node0をメインプロセスとして使用)で合計4つのプロセスを使用してトレーニングを有効にします。ppn(ノードごとのプロセス数)は2に設定され、1つのソケットごとに1つのプロセスが実行されます。最適なパフォーマンスを得るために、OMP_NUM_THREADS/CCL_WORKER_COUNT変数を調整できます。 + +node0では、各ノードのIPアドレスを含む構成ファイルを作成し、その構成ファイルのパスを引数として渡す必要があります。 + +```shell script + cat hostfile + xxx.xxx.xxx.xxx #node0 ip + xxx.xxx.xxx.xxx #node1 ip +``` + +ノード0で次のコマンドを実行すると、ノード0とノード1で**4DDP**がBF16自動混合精度で有効になります。 + + +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip + mpirun -f hostfile -n 4 -ppn 2 \ + -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex \ + --bf16 +``` \ No newline at end of file diff --git a/docs/source/ja/perf_train_gpu_many.md b/docs/source/ja/perf_train_gpu_many.md new file mode 100644 index 00000000000000..d85165d0c547a0 --- /dev/null +++ b/docs/source/ja/perf_train_gpu_many.md @@ -0,0 +1,529 @@ + + +# Efficient Training on Multiple GPUs + +単一のGPUでのトレーニングが遅すぎる場合や、モデルの重みが単一のGPUのメモリに収まらない場合、複数のGPUを使用したセットアップが必要となります。単一のGPUから複数のGPUへの切り替えには、ワークロードを分散するためのある種の並列処理が必要です。データ、テンソル、またはパイプラインの並列処理など、さまざまな並列処理技術があります。ただし、すべてに適した一つの解決策は存在せず、最適な設定は使用するハードウェアに依存します。この記事は、おそらく他のフレームワークにも適用される主要な概念に焦点を当てつつ、PyTorchベースの実装に焦点を当てています。 + + + +**注意**: [単一GPUセクション](perf_train_gpu_one) で紹介された多くの戦略(混合精度トレーニングや勾配蓄積など)は一般的であり、モデルのトレーニングに一般的に適用されます。したがって、マルチGPUやCPUトレーニングなどの次のセクションに入る前に、それを確認してください。 + + + +まず、さまざまな1D並列処理技術とその利点および欠点について詳しく説明し、それらを2Dおよび3D並列処理に組み合わせてさらに高速なトレーニングを実現し、より大きなモデルをサポートする方法を検討します。さまざまな他の強力な代替手法も紹介されます。 + +## Concepts + +以下は、この文書で後で詳しく説明される主要な概念の簡単な説明です。 + +1. **DataParallel (DP)** - 同じセットアップが複数回複製され、各セットアップにデータのスライスが供給されます。処理は並行して行われ、各セットアップはトレーニングステップの最後に同期されます。 +2. **TensorParallel (TP)** - 各テンソルは複数のチャンクに分割され、単一のGPUにテンソル全体が存在するのではなく、テンソルの各シャードが指定されたGPUに存在します。処理中に、各シャードは別々に並行して処理され、異なるGPUで同期され、ステップの最後に結果が同期されます。これは水平並列処理と呼ばれるもので、分割は水平レベルで行われます。 +3. **PipelineParallel (PP)** - モデルは垂直(レイヤーレベル)に複数のGPUに分割され、モデルの単一または複数のレイヤーが単一のGPUに配置されます。各GPUはパイプラインの異なるステージを並行して処理し、バッチの小さなチャンクで作業します。 +4. **Zero Redundancy Optimizer (ZeRO)** - TPといくらか似たようなテンソルのシャーディングを実行しますが、前向きまたは後向きの計算のためにテンソル全体が再構築されるため、モデルを変更する必要はありません。また、GPUメモリが制限されている場合に補償するためのさまざまなオフロード技術をサポートします。 +5. **Sharded DDP** - Sharded DDPは、さまざまなZeRO実装で使用される基本的なZeROコンセプトの別名です。 + +各コンセプトの詳細に深入りする前に、大規模なインフラストラクチャで大規模なモデルをトレーニングする際の大まかな決定プロセスを見てみましょう。 + +## Scalability Strategy + +**⇨ シングルノード / マルチGPU** +* モデルが単一のGPUに収まる場合: + + 1. DDP - 分散データ並列 + 2. ZeRO - 状況と使用される構成に応じて速いかどうかが異なります + +* モデルが単一のGPUに収まらない場合: + + 1. PP + 2. ZeRO + 3. TP + + 非常に高速なノード内接続(NVLINKまたはNVSwitchなど)があれば、これらの3つはほぼ同じ速度になるはずで、これらがない場合、PPはTPまたはZeROよりも速くなります。TPの程度も差を生じるかもしれません。特定のセットアップでの勝者を見つけるために実験することが最善です。 + + TPはほとんどの場合、単一ノード内で使用されます。つまり、TPサイズ <= ノードごとのGPU数です。 + +* 最大のレイヤーが単一のGPUに収まらない場合: + + 1. ZeROを使用しない場合 - TPを使用する必要があります。PP単独では収まらないでしょう。 + 2. ZeROを使用する場合 - "シングルGPU"のエントリと同じものを参照してください + +**⇨ マルチノード / マルチGPU** + +* ノード間の高速接続がある場合: + + 1. ZeRO - モデルへのほとんどの変更が不要です + 2. PP+TP+DP - 通信が少なく、モデルへの大規模な変更が必要です + +* ノード間の接続が遅く、GPUメモリがまだ不足している場合: + + 1. DP+PP+TP+ZeRO-1 + +## Data Parallelism + +2つのGPUを持つほとんどのユーザーは、`DataParallel`(DP)と`DistributedDataParallel`(DDP)によって提供されるトレーニング速度の向上をすでに享受しています。これらはほぼ自明に使用できるPyTorchの組み込み機能です。一般的に、すべてのモデルで動作するDDPを使用することをお勧めします。DPは一部のモデルで失敗する可能性があるためです。[PyTorchのドキュメンテーション](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html)自体もDDPの使用を推奨しています。 + +### DP vs DDP + +`DistributedDataParallel`(DDP)は通常、`DataParallel`(DP)よりも高速ですが、常にそうとは限りません: +* DPはPythonスレッドベースですが、DDPはマルチプロセスベースです。そのため、GIL(Global Interpreter Lock)などのPythonスレッドの制約がないためです。 +* 一方、GPUカード間の遅い相互接続性は、DDPの場合に実際には遅い結果をもたらす可能性があります。 + +以下は、2つのモード間のGPU間通信の主な違いです: + +[DDP](https://pytorch.org/docs/master/notes/ddp.html): + +- 開始時、メインプロセスはモデルをGPU 0から他のGPUに複製します。 +- それから各バッチごとに: + 1. 各GPUは各自のミニバッチのデータを直接消費します。 + 2. `backward`中、ローカル勾配が準備できると、それらはすべてのプロセスで平均化されます。 + +[DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html): + +各バッチごとに: + 1. GPU 0はデータバッチを読み取り、それから各GPUにミニバッチを送信します。 + 2. GPU 0から各GPUに最新のモデルを複製します。 + 3. `forward`を実行し、各GPUからGPU 0に出力を送信し、損失を計算します。 + 4. GPU 0からすべてのGPUに損失を分散し、`backward`を実行します。 + 5. 各GPUからGPU 0に勾配を送信し、それらを平均化します。 + +DDPはバッチごとに行う通信は勾配の送信のみであり、一方、DPはバッチごとに5つの異なるデータ交換を行います。 + +DPはプロセス内でデータをPythonスレッドを介してコピーしますが、DDPは[torch.distributed](https://pytorch.org/docs/master/distributed.html)を介してデータをコピーします。 + +DPではGPU 0は他のGPUよりもはるかに多くの作業を行うため、GPUの未使用率が高くなります。 + +DDPは複数のマシン間で使用できますが、DPの場合はそうではありません。 + +DPとDDPの他にも違いがありますが、この議論には関係ありません。 + +これら2つのモードを深く理解したい場合、この[記事](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/)を強くお勧めします。素晴らしいダイアグラムを含み、さまざまなハードウェアでの複数のベンチマークとプロファイラの出力を示し、知っておく必要があるすべての微妙なニュアンスを説明しています。 + +実際のベンチマークを見てみましょう: + +| Type | NVlink | Time | +| :----- | ----- | ---: | +| 2:DP | Y | 110s | +| 2:DDP | Y | 101s | +| 2:DDP | N | 131s | + + +解析: + +ここで、DPはNVlinkを使用したDDPに比べて約10%遅く、NVlinkを使用しないDDPに比べて約15%高速であることが示されています。 + +実際の違いは、各GPUが他のGPUと同期する必要があるデータの量に依存します。同期するデータが多いほど、遅いリンクが合計の実行時間を遅くする可能性が高くなります。 + +以下は完全なベンチマークコードと出力です: + +`NCCL_P2P_DISABLE=1`を使用して、対応するベンチマークでNVLink機能を無効にしました。 + + +```bash + +# DP +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +python examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} + +# DDP w/ NVlink +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +# DDP w/o NVlink +rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +ハードウェア: 2x TITAN RTX、各24GB + 2つのNVLink(`nvidia-smi topo -m`で `NV2`) + +ソフトウェア: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` + +## ZeRO Data Parallelism + +ZeROパワードデータ並列処理(ZeRO-DP)は、次の[ブログ投稿](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)のダイアグラムで説明されています。 +![DeepSpeed-Image-1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png) + +これは理解が難しいかもしれませんが、実際にはこの概念は非常にシンプルです。これは通常の`DataParallel`(DP)ですが、完全なモデルパラメータ、勾配、およびオプティマイザの状態を複製する代わりに、各GPUはそれぞれのスライスのみを保存します。そして、実行時に、特定のレイヤーに必要な完全なレイヤーパラメータが必要な場合、すべてのGPUが同期して、お互いに不足している部分を提供します。それがすべてです。 + +3つのレイヤーからなる単純なモデルを考えてみましょう。各レイヤーには3つのパラメータがあります: + + +``` +La | Lb | Lc +---|----|--- +a0 | b0 | c0 +a1 | b1 | c1 +a2 | b2 | c2 +``` + +レイヤーLaには、重みa0、a1、およびa2があります。 + +3つのGPUがある場合、Sharded DDP(= Zero-DP)はモデルを3つのGPUに次のように分割します: + + +``` +GPU0: +La | Lb | Lc +---|----|--- +a0 | b0 | c0 + +GPU1: +La | Lb | Lc +---|----|--- +a1 | b1 | c1 + +GPU2: +La | Lb | Lc +---|----|--- +a2 | b2 | c2 +``` + +これは、典型的なディープニューラルネットワーク(DNN)のダイアグラムを想像すると、テンソル並列処理と同様の水平スライスであるようなものです。垂直スライスは、異なるGPUに完全な層グループを配置する方法です。しかし、これは単なる出発点に過ぎません。 + +これから、各GPUは通常のデータ並列処理(DP)と同様に、通常のミニバッチを受け取ります: + +``` +x0 => GPU0 +x1 => GPU1 +x2 => GPU2 +``` + +最初に、入力データはレイヤーLaに適用されます。 + +GPU0に焦点を当てましょう:x0は、その前向きパスを実行するためにa0、a1、a2のパラメータが必要ですが、GPU0にはa0しかありません。GPU1からa1を、GPU2からa2を受け取り、モデルの各部分をまとめます。 + +同様に、GPU1はミニバッチx1を受け取り、a1しか持っていませんが、a0とa2のパラメータが必要です。これらはGPU0とGPU2から取得します。 + +GPU2もx2を受け取ります。a0とa1はGPU0とGPU1から受け取り、a2とともに完全なテンソルを再構築します。 + +3つのGPUは完全なテンソルを再構築し、前向き計算が行われます。 + +計算が完了すると、不要になったデータは削除されます。計算中だけ使用され、再構築は事前にフェッチを使用して効率的に行われます。 + +そして、このプロセス全体がレイヤーLb、次に前向きでLc、そして逆方向でLc -> Lb -> Laに対して繰り返されます。 + +私にとって、これは効率的なグループでの重みの分散戦略のように聞こえます: + +1. 人Aはテントを持っています。 +2. 人Bはストーブを持っています。 +3. 人Cは斧を持っています。 + +今、彼らは毎晩持っているものを共有し、他の人から持っていないものをもらい、朝には割り当てられたタイプのギアを詰めて旅を続けます。これがSharded DDP / Zero DPです。 + +この戦略を、各人が独自のテント、ストーブ、斧を持って運ばなければならないシンプルな戦略と比較してみてください。これがPyTorchのDataParallel(DPおよびDDP)です。 + +このトピックの文献を読む際に、以下の類義語に出会うかもしれません:Sharded、Partitioned。 + +ZeROがモデルの重みを分割する方法に注意を払うと、これはテンソルパラレリズムと非常に似ているように見えます。これは後で議論される垂直モデルパラレリズムとは異なり、各レイヤーの重みをパーティション/シャーディングします。 + +Implementations: + +- [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/) ZeRO-DP stages 1+2+3 +- [`transformers` integration](main_classes/trainer#trainer-integrations) + + +## Naive Model Parallelism (Vertical) and Pipeline Parallelism + +ナイーブモデルパラレリズム(MP)は、モデルの層を複数のGPUに分散させる方法です。このメカニズムは比較的単純で、希望する層を`.to()`メソッドを使用して特定のデバイスに切り替えるだけです。これにより、データがこれらの層を通過するたびに、データも層と同じデバイスに切り替えられ、残りの部分は変更されません。 + +私たちはこれを「垂直MP」と呼びます。なぜなら、ほとんどのモデルがどのように描かれるかを思い出すと、層を垂直にスライスするからです。たとえば、以下の図は8層のモデルを示しています: + + +``` +=================== =================== +| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | +=================== =================== + gpu0 gpu1 +``` + +我々は、モデルを垂直に2つに分割し、レイヤー0から3をGPU0に配置し、レイヤー4から7をGPU1に配置しました。 + +データがレイヤー0から1、1から2、2から3に移動する間は通常のモデルと同じです。しかし、データがレイヤー3からレイヤー4に移動する必要がある場合、GPU0からGPU1への移動が発生し、通信のオーバーヘッドが発生します。参加しているGPUが同じコンピュートノード(例:同じ物理マシン)にある場合、このコピーは非常に高速ですが、異なるコンピュートノード(例:複数のマシン)にある場合、通信のオーバーヘッドは大幅に増加する可能性があります。 + +その後、レイヤー4から5、6から7までは通常のモデルと同様に動作し、7番目のレイヤーが完了すると、データをしばしばレイヤー0に戻す必要があります(またはラベルを最後のレイヤーに送信します)。これで損失を計算し、オプティマイザが作業を開始できます。 + +問題点: +- 主な欠点、およびなぜこれを「単純な」MPと呼ぶのかは、1つを除いてすべてのGPUがどんな瞬間でもアイドル状態であることです。したがって、4つのGPUを使用する場合、単純なMPは、1つのGPUのメモリ容量を4倍にするのとほぼ同じであり、ハードウェアの残りを無視します。さらに、データのコピーのオーバーヘッドがあることを忘れてはいけません。したがって、4枚の6GBのカードは、データのコピーのオーバーヘッドがない1枚の24GBのカードと同じサイズを収容できるでしょうが、後者はトレーニングをより迅速に完了します。ただし、たとえば40GBのカードがあり、45GBのモデルを収める必要がある場合、勾配とオプティマイザの状態のためにほとんど収めることができません。 +- 共有の埋め込みは、GPU間でコピーする必要があるかもしれません。 + +パイプライン並列処理(PP)は、ほぼ単純なMPと同じですが、GPUがアイドル状態になる問題を解決し、入力バッチをマイクロバッチに分割し、パイプラインを人工的に作成することにより、異なるGPUが計算プロセスに同時に参加できるようにします。 + +以下は、[GPipe論文](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html)からの図で、上部には単純なMP、下部にはPPが示されています: + +![mp-pp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-gpipe-bubble.png) + +この図から、PPがGPUがアイドル状態の領域である「バブル」を少なく持つことがわかります。アイドル状態の部分は「バブル」と呼ばれます。 + +図の両方の部分は、4つのGPUがパイプラインに参加している4の次元の並列性を示しています。つまり、4つのパイプステージF0、F1、F2、F3のフォワードパスがあり、逆順のバックワードパスB3、B2、B1、B0があります。 + +PPは調整する新しいハイパーパラメータを導入します。それは `chunks` で、同じパイプステージを通じて連続して送信されるデータのチャンクの数を定義します。たとえば、下の図では `chunks=4` が表示されています。GPU0はチャンク0、1、2、3(F0,0、F0,1、F0,2、F0,3)で同じフォワードパスを実行し、他のGPUが作業を開始し始めるのを待ってから、GPU0はチャンク3、2、1、0(B0,3、B0,2、B0,1、B0,0)で逆順パスを実行します。 + +注意すべきは、概念的にはこれが勾配蓄積ステップ(GAS)と同じコンセプトであることです。PyTorchは `chunks` を使用し、DeepSpeedは同じハイパーパラメータをGASと呼びます。 + +`chunks` の導入により、PPはマイクロバッチ(MBS)の概念を導入します。DPはグローバルデータバッチサイズをミニバッチに分割します。したがって、DPの次数が4で、グローバルバッチサイズが1024の場合、4つのミニバッチ(それぞれ256)に分割されます(1024/4)。そして、`chunks`(またはGAS)の数が32である場合、マイクロバッチサイズは8になります(256/32)。各パイプラインステージは1つのマイクロバッチで作業します。 + +DP + PPセットアップのグローバルバッチサイズを計算するには、`mbs*chunks*dp_degree`(`8*32*4=1024`)を行います。 + +図に戻りましょう。 + +`chunks=1` であれば、非効率な単純なMPになります。非常に大きな `chunks` 値を使用すると、非常に小さなマイクロバッチサイズになり、効率があまり高くないかもしれません。したがって、GPUの効率的な利用を最大化する値を見つけるために実験する必要があります。これは、バブルのサイズを最小限にすることに対応する、すべての参加GPUにわたる高い並行GPU利用を可能にするためです。 + +2つのソリューショングループがあります。従来のパイプラインAPIソリューションと、ユーザーのモデルを大幅に変更する必要があるより現代的なソリューションです。 + +従来のパイプラインAPIソリューション: +- PyTorch +- DeepSpeed +- Megatron-LM + +現代的なソリューション: +- Varuna +- Sagemaker + +従来のパイプラインAPIソリューションの問題点: +- モデルをかなり変更する必要があるため、Pipelineはモジュールの通常のフローを`nn.Sequential`シーケンスに再書き込む必要があり、モデルの設計を変更することが必要です。 +- 現在、Pipeline APIは非常に制限的です。最初のパイプラインステージに渡されるPython変数のセットがある場合、回避策を見つける必要があります。現在、パイプラインインターフェースでは、唯一のテンソルまたはテンソルのタプルを入力と出力として要求しています。これらのテンソルはバッチサイズを最初の次元として持っている必要があります。パイプラインはミニバッチをマイクロバッチに分割します。可能な改善点については、こちらの議論が行われています:https://github.com/pytorch/pytorch/pull/50693 +- パイプステージのレベルでの条件付き制御フローは不可能です。例えば、T5のようなエンコーダーデコーダーモデルは、条件付きエンコーダーステージを処理するために特別な回避策が必要です。 +- 各レイヤーを配置する必要があるため、1つのモデルの出力が他のモデルの入力になるようにします。 + +VarunaとSageMakerとの実験はまだ行っていませんが、彼らの論文によれば、上記で述べた問題のリストを克服し、ユーザーのモデルにははるかに小さな変更しか必要としないと報告されています。 + +実装: + +- [Pytorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py) +- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API. +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS. +- [OSLO](https://github.com/tunib-ai/oslo) - この実装は、Hugging Face Transformersに基づいています。 + +🤗 Transformersのステータス: この執筆時点では、いずれのモデルも完全なPP(パイプライン並列処理)をサポートしていません。GPT2モデルとT5モデルは単純なMP(モデル並列処理)サポートを持っています。主な障害は、モデルを`nn.Sequential`に変換できず、すべての入力がテンソルである必要があることです。現在のモデルには、変換を非常に複雑にする多くの機能が含まれており、これらを削除する必要があります。 + +他のアプローチ: + +DeepSpeed、Varuna、およびSageMakerは、[交互にパイプラインを実行](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html)するコンセプトを使用しています。ここでは、バックワードパスを優先させてバブル(アイドル時間)をさらに最小限に抑えます。 + +Varunaは、最適なスケジュールを発見するためにシミュレーションを使用してスケジュールをさらに改善しようとします。 + +OSLOは、`nn.Sequential`の変換なしでTransformersに基づくパイプライン並列処理を実装しています。 + +## Tensor Parallelism + +テンソル並列処理では、各GPUがテンソルのスライスのみを処理し、全体が必要な操作のためにのみ完全なテンソルを集約します。 + +このセクションでは、[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)論文からのコンセプトと図を使用します:[GPUクラスタでの効率的な大規模言語モデルトレーニング](https://arxiv.org/abs/2104.04473)。 + +どのトランスフォーマの主要な構築要素は、完全に接続された`nn.Linear`に続く非線形アクティベーション`GeLU`です。 + +Megatronの論文の表記法に従って、行列の乗算部分を`Y = GeLU(XA)`と書くことができます。ここで、`X`と`Y`は入力ベクトルと出力ベクトルで、`A`は重み行列です。 + +行列の計算を行列形式で見ると、行列乗算を複数のGPUで分割できる方法が簡単に理解できます: +![Parallel GEMM](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_gemm.png) + +重み行列`A`を`N`個のGPUに対して列ごとに分割し、並列で行列乗算`XA_1`から`XA_n`を実行すると、`N`個の出力ベクトル`Y_1、Y_2、...、Y_n`が得られ、それらを独立して`GeLU`に供給できます: +![独立したGeLU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-independent-gelu.png) + +この原理を使用して、最後まで同期が必要ないまま、任意の深さのMLPを更新できます。Megatron-LMの著者はそのための有用なイラストを提供しています: +![並列シャード処理](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_shard_processing.png) + +マルチヘッドアテンションレイヤーを並列化することはさらに簡単です。それらは既に複数の独立したヘッドを持っているため、本質的に並列です! +![並列セルフアテンション](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_self_attention.png) + +特別な考慮事項:TPには非常に高速なネットワークが必要であり、したがって1つのノードを超えてTPを実行しないことがお勧めされません。実際には、1つのノードに4つのGPUがある場合、最大のTP度数は4です。TP度数8が必要な場合は、少なくとも8つのGPUを持つノードを使用する必要があります。 + +このセクションは、元のより詳細な[TPの概要](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530)に基づいています。 +by [@anton-l](https://github.com/anton-l)。 + +SageMakerは、より効率的な処理のためにTPとDPを組み合わせて使用します。 + +代替名: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed)はこれを「テンソルスライシング」と呼びます。詳細は[DeepSpeedの特徴](https://www.deepspeed.ai/training/#model-parallelism)をご覧ください。 + +実装例: +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)には、モデル固有の内部実装があります。 +- [parallelformers](https://github.com/tunib-ai/parallelformers)(現時点では推論のみ)。 +- [SageMaker](https://arxiv.org/abs/2111.05972) - これはAWSでのみ使用できるプロプライエタリなソリューションです。 +- [OSLO](https://github.com/tunib-ai/oslo)には、Transformersに基づいたテンソル並列実装があります。 + +🤗 Transformersの状況: +- コア: まだコアには実装されていません。 +- ただし、推論が必要な場合、[parallelformers](https://github.com/tunib-ai/parallelformers)はほとんどのモデルに対してサポートを提供します。これがコアに実装されるまで、これを使用できます。そして、トレーニングモードもサポートされることを期待しています。 +- Deepspeed-Inferenceでは、BERT、GPT-2、およびGPT-NeoモデルをCUDAカーネルベースの高速推論モードでサポートしています。詳細は[こちら](https://www.deepspeed.ai/tutorials/inference-tutorial/)をご覧ください。 + +## DP+PP + +DeepSpeedの[パイプラインチュートリアル](https://www.deepspeed.ai/tutorials/pipeline/)からの次の図は、DPをPPと組み合わせる方法を示しています。 + +![dp-pp-2d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero-dp-pp.png) + +ここで重要なのは、DPランク0がGPU2を見えなくし、DPランク1がGPU3を見えなくすることです。DPにとって、存在するのはGPU 0 と 1 のみで、それらの2つのGPUのようにデータを供給します。GPU0はPPを使用してGPU2に一部の負荷を「秘密裏に」オフロードし、GPU1も同様にGPU3を支援に引き入れます。 + +各次元には少なくとも2つのGPUが必要ですので、ここでは少なくとも4つのGPUが必要です。 + +実装例: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformersの状況: まだ実装されていません + +## DP+PP+TP + +さらに効率的なトレーニングを行うために、3Dパラレリズムを使用し、PPをTPとDPと組み合わせます。これは次の図で示されています。 + +![dp-pp-tp-3d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-deepspeed-3d.png) + +この図は[3Dパラレリズム:兆パラメータモデルへのスケーリング](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)というブログ投稿から取得されたもので、おすすめの読み物です。 + +各次元には少なくとも2つのGPUが必要ですので、ここでは少なくとも8つのGPUが必要です。 + +実装例: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeedには、さらに効率的なDPであるZeRO-DPと呼ばれるものも含まれています。 +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformersの状況: まだ実装されていません。PPとTPがないため。 + +## ZeRO DP+PP+TP + +DeepSpeedの主要な機能の1つはZeROで、これはDPの拡張機能です。これについてはすでに「ZeROデータ並列化」で説明されています。通常、これは単独で動作する機能で、PPやTPは必要ありません。しかし、PPとTPと組み合わせることもできます。 + +ZeRO-DPがPPと組み合わされる場合、通常はZeROステージ1(オプティマイザーシャーディング)のみが有効になります。 + +ZeROステージ2(勾配シャーディング)をパイプライン並列化と組み合わせて使用する理論的な可能性はありますが、性能に悪影響を及ぼします。各マイクロバッチごとに勾配をシャーディングする前に、勾配を集約するための追加のリダクションスキャッター集計が必要で、通信オーバーヘッドが発生する可能性があります。パイプライン並列化の性質上、小さなマイクロバッチが使用され、計算の集中度(マイクロバッチサイズ)をバランスにかけ、パイプラインバブル(マイクロバッチ数)を最小限に抑えることに焦点が当てられています。したがって、これらの通信コストは影響を及ぼすでしょう。 + +さらに、PPには通常よりも少ない層が含まれており、メモリの節約はそれほど大きくありません。PPは既に勾配サイズを「1/PP」に削減するため、勾配シャーディングの節約は純粋なDPよりもはるかに重要ではありません。 + +ZeROステージ3も同様の理由で適していません - より多くのノード間通信が必要です。 + +そして、ZeROを持っているので、もう一つの利点はZeRO-Offloadです。これはステージ1オプティマイザーステートをCPUにオフロードできます。 + +実装例: +- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)と[BigScienceからのMegatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed)は、前者のリポジトリのフォークです。 +- [OSLO](https://github.com/tunib-ai/oslo) + +重要な論文: + +- [DeepSpeedとMegatronを使用したMegatron-Turing NLG 530Bのトレーニング](https://arxiv.org/abs/2201.11990) + +🤗 Transformersの状況: まだ実装されていません。PPとTPがないため。 + + +## FlexFlow + +[FlexFlow](https://github.com/flexflow/FlexFlow)は、わずかに異なるアプローチで並列化の問題を解決します。 + +論文: [Zhihao Jia、Matei Zaharia、Alex Aikenによる "Deep Neural Networksのデータとモデルの並列化を超えて"](https://arxiv.org/abs/1807.05358) + +FlexFlowは、サンプル-オペレータ-属性-パラメータの4D並列化を行います。 + +1. サンプル = データ並列化(サンプル単位の並列化) +2. オペレータ = 単一の操作をいくつかのサブ操作に並列化 +3. 属性 = データ並列化(長さ方向の並列化) +4. パラメータ = モデル並列化(次元に関係なく、水平または垂直) + +例: +* サンプル + +シーケンス長512の10バッチを考えてみましょう。これらをサンプル次元で2つのデバイスに並列化すると、10 x 512が5 x 2 x 512になります。 + +* オペレータ + +層正規化を行う場合、まずstdを計算し、次にmeanを計算し、データを正規化できます。オペレータの並列化により、stdとmeanを並列に計算できます。したがって、オペレータ次元で2つのデバイス(cuda:0、cuda:1)に並列化すると、最初に入力データを両方のデバイスにコピーし、cuda:0でstdを計算し、cuda:1でmeanを同時に計算します。 + +* 属性 + +10バッチの512長があります。これらを属性次元で2つのデバイスに並列化すると、10 x 512が10 x 2 x 256になります。 + +* パラメータ + +これはテンソルモデルの並列化または単純な層ごとのモデルの並列化と似ています。 + +このフレームワークの重要性は、(1)GPU/TPU/CPU対(2)RAM/DRAM対(3)高速内部接続/低速外部接続などのリソースを取り、これらすべてをアルゴリズムによって自動的に最適化することです。どの並列化をどこで使用するかをアルゴリズム的に決定します。 + +非常に重要な側面の1つは、FlexFlowは静的で固定のワークロードを持つモデルのために設計されており、動的な動作を持つモデルはイテレーションごとに異なる並列化戦略を好む場合があることです。 + +したがって、このフレームワークの約束は非常に魅力的です。選択したクラスタで30分間のシミュレーションを実行し、この特定の環境を最適に利用するための最良の戦略を提供します。部分を追加/削除/置換すると、それに対して実行して再最適化プランを作成します。その後、トレーニングできます。異なるセットアップには独自の最適化があります。 + +🤗 Transformersの現在の状況: まだ統合されていません。すでに[transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py)を使用してモデルがFXトレース可能であるため、FlexFlowを動作させるために必要な手順を誰かが見つける必要があります。 + +## Which Strategy To Use When + +ここでは、どの並列化戦略をいつ使用するかの非常におおまかなアウトラインを示します。各リストの最初が通常よりも速いことが一般的です。 + +**⇨ 単一GPU** + +* モデルが単一GPUに収まる場合: + + 1. 通常の使用 + +* モデルが単一GPUに収まらない場合: + + 1. ZeRO + CPUをオフロードし、オプションでNVMeをオフロード + 2. 上記に加えて、最大のレイヤーが単一GPUに収まらない場合、[Memory Centric Tiling](https://deepspeed.readthedocs.io/en/latest/zero3.html#memory-centric-tiling)(詳細は以下参照)を有効化 + +* 最大のレイヤーが単一GPUに収まらない場合: + + 1. ZeROを使用しない場合 - TPを有効化する必要があります。なぜなら、PPだけでは収めることができないからです。 + 2. ZeROを使用する場合は、上記の「単一GPU」のエントリと同じものを参照してください + +**⇨ 単一ノード/マルチGPU** + +* モデルが単一GPUに収まる場合: + + 1. DDP - 分散データ並列 + 2. ZeRO - 状況と使用される構成に依存して速いかどうかが異なることがあります + +* モデルが単一GPUに収まらない場合: + + 1. PP + 2. ZeRO + 3. TP + + 非常に高速なノード内接続がNVLINKまたはNVSwitchである場合、これらのすべてはほとんど同等の性能です。これらがない場合、PPはTPまたはZeROよりも速くなります。TPの度合いも違いを生じるかもしれません。特定のセットアップで勝者を見つけるために実験するのが最善です。 + + TPはほとんど常に単一ノード内で使用されます。つまり、TPサイズ <= ノードあたりのGPUです。 + +* 最大のレイヤーが単一GPUに収まらない場合: + + 1. ZeROを使用しない場合 - TPを使用する必要があります。なぜなら、PPだけでは収めることができないからです。 + 2. ZeROを使用する場合は、上記の「単一GPU」のエントリと同じものを参照してください + +**⇨ マルチノード/マルチGPU** + +* 高速なノード間接続がある場合: + + 1. ZeRO - モデルへのほとんどの変更が不要です + 2. PP+TP+DP - 通信が少なく、モデルに大規模な変更が必要です + +* 遅いノード間接続があり、GPUメモリが少ない場合: + + 1. DP+PP+TP+ZeRO-1 + diff --git a/docs/source/ja/perf_train_gpu_one.md b/docs/source/ja/perf_train_gpu_one.md new file mode 100644 index 00000000000000..2c2bc540e48384 --- /dev/null +++ b/docs/source/ja/perf_train_gpu_one.md @@ -0,0 +1,438 @@ + + +# Methods and tools for efficient training on a single GPU + +このガイドでは、メモリの利用効率を最適化し、トレーニングを高速化することで、モデルのトレーニング効率を向上させるために使用できる実用的なテクニックを紹介します。トレーニング中にGPUがどのように利用されるかを理解したい場合は、最初に「[モデルトレーニングの解剖学](model_memory_anatomy)」のコンセプトガイドを参照してください。このガイドは実用的なテクニックに焦点を当てています。 + + + +複数のGPUを搭載したマシンにアクセスできる場合、これらのアプローチは依然として有効です。さらに、[マルチGPUセクション](perf_train_gpu_many)で説明されている追加の方法を活用できます。 + + + +大規模なモデルをトレーニングする際、同時に考慮すべき2つの側面があります: + +* データのスループット/トレーニング時間 +* モデルのパフォーマンス + +スループット(サンプル/秒)を最大化することは、トレーニングコストを低減させます。これは一般的に、GPUをできるだけ効果的に活用し、GPUメモリを限界まで埋めることによって達成されます。希望するバッチサイズがGPUメモリの制限を超える場合、勾配蓄積などのメモリ最適化テクニックが役立ちます。 + +しかし、好みのバッチサイズがメモリに収まる場合、メモリを最適化するテクニックを適用する理由はありません。大きなバッチサイズを使用できるからといって、それを必ずしも使用すべきではありません。ハイパーパラメータの調整の一環として、どのバッチサイズが最良の結果を生み出すかを決定し、リソースを適切に最適化する必要があります。 + +このガイドでカバーされている方法とツールは、トレーニングプロセスに与える影響に基づいて分類できます: + + +| Method/tool | Improves training speed | Optimizes memory utilization | +|:-----------------------------------------------------------|:------------------------|:-----------------------------| +| [Batch size choice](#batch-size-choice) | Yes | Yes | +| [Gradient accumulation](#gradient-accumulation) | No | Yes | +| [Gradient checkpointing](#gradient-checkpointing) | No | Yes | +| [Mixed precision training](#mixed-precision-training) | Yes | (No) | +| [Optimizer choice](#optimizer-choice) | Yes | Yes | +| [Data preloading](#data-preloading) | Yes | No | +| [DeepSpeed Zero](#deepspeed-zero) | No | Yes | +| [torch.compile](#using-torchcompile) | Yes | No | + + + +**注意**: 小さなモデルと大きなバッチサイズを使用する場合、メモリの節約が行われますが、大きなモデルと小さなバッチサイズを使用する場合、メモリの使用量が増加します。 + + + +これらのテクニックは、[`Trainer`]でモデルをトレーニングしている場合や、純粋なPyTorchループを記述している場合の両方で利用できます。詳細な最適化の設定については、🤗 Accelerateを使用して[これらの最適化を設定できます](#using--accelerate)。 + +これらの方法が十分な利益をもたらさない場合、以下のオプションを検討できます: +* [効率的なソフトウェアプリビルドを備えたカスタムDockerコンテナの作成](#efficient-software-prebuilds) +* [Mixture of Experts(MoE)を使用するモデルを検討](#mixture-of-experts) +* [モデルをBetterTransformerに変換して、PyTorchネイティブのアテンションを活用](#using-pytorch-native-attention) + +最後に、これらの方法がまだ十分でない場合、A100などのサーバーグレードGPUに切り替えても、さらなる改善が必要かもしれません。これらのアプローチは、マルチGPUセットアップでも有効であり、[マルチGPUセクション](perf_train_gpu_many)で説明されている追加の並列化技術を活用できます。 + +## Batch size choice + +最適なパフォーマンスを実現するために、適切なバッチサイズを特定することから始めましょう。2^Nのサイズのバッチサイズと入力/出力ニューロン数を使用することが推奨されています。通常、これは8の倍数ですが、使用するハードウェアとモデルのデータ型に依存することがあります。 + +参考までに、NVIDIAの[入力/出力ニューロン数の推奨事項](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features)と[バッチサイズ](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#batch-size)を確認してください(これらはGEMM(一般的な行列乗算)に関与します)。 + +[Tensor Core要件](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc)では、データ型とハードウェアに基づいて乗数が定義されています。たとえば、fp16データ型の場合、64の倍数を使用することが推奨されます(A100 GPUの場合を除く)。 + +小さなパラメータの場合、[次元量子化効果](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization)も考慮してください。これはタイリングが行われ、適切な乗数が大幅な高速化をもたらす場合があります。 + +## Gradient Accumulation + +**勾配蓄積**メソッドは、GPUのメモリ容量の制約によって課せられる制限を超えた効果的なバッチサイズを実現するために、勾配を小さな増分で計算することを目的としています。このアプローチでは、モデルを順方向および逆方向に小さなバッチで反復的に計算し、その過程で勾配を蓄積します。十分な数の勾配が蓄積されたら、モデルの最適化ステップを実行します。勾配蓄積を使用することで、GPUのメモリ容量による制約を超えて**効果的なバッチサイズ**を増やすことができますが、勾配蓄積によって導入される追加の順方向および逆方向の計算はトレーニングプロセスを遅くする可能性があることに注意が必要です。 + +`TrainingArguments`に`gradient_accumulation_steps`引数を追加することで、勾配蓄積を有効にすることができます: + +```py +training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args) +``` + +上記の例では、効果的なバッチサイズは4になります。 + +また、トレーニングループを完全に制御するために🤗 Accelerateを使用することもできます。🤗 Accelerateの例は、[このガイドの後半にある](#using--accelerate)で見つけることができます。 + +できるだけGPUの使用率を最大限にすることが推奨されていますが、高い勾配蓄積ステップ数はトレーニングの遅延をより顕著にすることがあります。以下の例を考えてみましょう。`per_device_train_batch_size=4`の場合、勾配蓄積を使用しないとGPUの制限に達します。バッチサイズ64でトレーニングしたい場合、`per_device_train_batch_size`を1に設定し、`gradient_accumulation_steps`を64に設定しないでください。代わりに、`per_device_train_batch_size=4`を保持し、`gradient_accumulation_steps=16`を設定します。これにより、同じ効果的なバッチサイズが得られ、利用可能なGPUリソースが効果的に活用されます。 + +詳細な情報については、[RTX-3090用のバッチサイズと勾配蓄積のベンチマーク](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537)および[A100用のバッチサイズと勾配蓄積のベンチマーク](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957)を参照してください。 + +## Gradient Checkpointing + +一部の大きなモデルは、バッチサイズを1に設定し、勾配蓄積を使用している場合でもメモリの問題に直面することがあります。これは、メモリストレージが必要な他のコンポーネントも存在するためです。 + +前向きパスからのすべてのアクティベーションを保存して、逆向きパスで勾配を計算すると、かなりのメモリオーバーヘッドが発生します。逆向きパスで必要なときにアクティベーションを破棄して再計算する代替アプローチは、計算オーバーヘッドが大幅に増加し、トレーニングプロセスが遅くなります。 + +**勾配チェックポイント**は、これらの2つのアプローチの折衷案を提供し、計算グラフ全体で戦略的に選択されたアクティベーションのみを保存するため、勾配を再計算する必要があるアクティベーションの一部だけを節約します。勾配チェックポイントの詳細については、[この素晴らしい記事](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9)を参照してください。 + +[`Trainer`]で勾配チェックポイントを有効にするには、[`TrainingArguments`]に対応するフラグを渡します: + + +```py +training_args = TrainingArguments( + per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args +) +``` + +代替手段として、🤗 Accelerateを使用することもできます - 🤗 Accelerateの例は[このガイドのさらに後ろにあります](#using--accelerate)。 + + + +勾配チェックポイントを使用することでメモリ効率が向上する場合がありますが、トレーニング速度は約20%遅くなることに注意してください。 + + + +## Mixed precision training + +**混合精度トレーニング**は、モデルのトレーニングの計算効率を最適化する技術で、特定の変数に対して低精度の数値フォーマットを利用します。従来、ほとんどのモデルは変数を表現し処理するために32ビット浮動小数点精度(fp32またはfloat32)を使用しています。しかし、すべての変数が正確な結果を得るためにこの高精度のレベルを必要としない場合があります。一部の変数の精度を16ビット浮動小数点(fp16またはfloat16)などのより低い数値フォーマットに変更することで、計算を高速化できます。このアプローチでは、一部の計算は半精度で行われ、一部はまだ完全な精度で行われるため、このアプローチは混合精度トレーニングと呼ばれています。 + +最も一般的に混合精度トレーニングは、fp16(float16)データ型を使用して実現されますが、一部のGPUアーキテクチャ(アンペアアーキテクチャなど)ではbf16およびtf32(CUDA内部データ型)データ型も提供されています。これらのデータ型の違いについて詳しく知りたい場合は、[NVIDIAのブログ](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)を確認してください。 + +### fp16 + +混合精度トレーニングの主な利点は、半精度(fp16)でアクティベーションを保存することから得られます。 +勾配も半精度で計算されますが、最適化ステップでは再び完全精度に変換されるため、ここではメモリは保存されません。 +混合精度トレーニングは計算速度を向上させる一方、特に小さなバッチサイズの場合、より多くのGPUメモリを使用することがあります。 +これは、モデルがGPU上に16ビットおよび32ビット精度の両方で存在するためです(GPU上の元のモデルの1.5倍)。 + +混合精度トレーニングを有効にするには、`fp16`フラグを`True`に設定します: + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args) +``` + +🤗 Accelerateを使用する場合、🤗 Accelerateの例は[このガイドのさらに後ろにあります](#using--accelerate)。 + +### BF16 + +Ampereまたはそれ以降のハードウェアにアクセスできる場合、混合精度トレーニングと評価にbf16を使用できます。bf16はfp16よりも精度が劣りますが、はるかに大きな動的範囲を持っています。fp16では、持つことができる最大の数は `65535` であり、それを超える数値はオーバーフローを引き起こします。一方、bf16の数値は `3.39e+38` のように大きく、これはfp32とほぼ同じです - どちらも数値範囲に8ビットを使用しているためです。 + +BF16を有効にするには、🤗 Trainerで以下のように設定します: + + +```python +training_args = TrainingArguments(bf16=True, **default_args) +``` + + +### TF32 + +アンペアハードウェアは、tf32という特別なデータ型を使用します。これは、fp32と同じ数値範囲(8ビット)を持っていますが、23ビットの精度ではなく、10ビットの精度(fp16と同じ)を持ち、合計で19ビットしか使用しません。これは通常のfp32トレーニングおよび推論コードを使用し、tf32サポートを有効にすることで、最大3倍のスループットの向上が得られる点で「魔法のよう」です。行う必要があるのは、次のコードを追加するだけです: + +```python +import torch +torch.backends.cuda.matmul.allow_tf32 = True +torch.backends.cudnn.allow_tf32 = True +``` + + +使用されているGPUがアンペアシリーズであると仮定し、CUDAは可能な限りtf32を使用するように自動的に切り替えます。 + +[NVIDIAの研究によれば](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)、ほとんどの機械学習トレーニングワークロードはtf32トレーニングとfp32トレーニングで同じ難解度と収束を示します。すでにfp16またはbf16混合精度を使用している場合、スループットの向上に役立つこともあります。 + +🤗 Trainerでこのモードを有効にすることができます: + + +```python +TrainingArguments(tf32=True, **default_args) +``` + + + +tf32は`tensor.to(dtype=torch.tf32)`を介して直接アクセスできません。これは内部のCUDAデータ型です。tf32データ型を使用するには、`torch>=1.7`が必要です。 + + + +tf32と他の精度に関する詳細な情報については、以下のベンチマークを参照してください: +[RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803)および +[A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189)。 + +## Flash Attention 2 + +transformersでFlash Attention 2統合を使用することで、トレーニングのスループットを向上させることができます。Flash Attention 2モジュールを含むモデルの読み込み方法については、[single GPU section](./perf_infer_gpu_one#Flash-Attention-2)の適切なセクションを確認して詳細を学びましょう。 + +## オプティマイザの選択 + +Transformerモデルをトレーニングするために最も一般的に使用されるオプティマイザはAdamまたはAdamW(重み減衰を伴うAdam)です。Adamは前回の勾配の移動平均を保存することで収束を達成しますが、モデルパラメータの数のオーダーの追加メモリフットプリントを追加します。これを解消するために、代替オプティマイザを使用できます。たとえば、[NVIDIA/apex](https://github.com/NVIDIA/apex)がインストールされている場合、`adamw_apex_fused`はすべてのサポートされているAdamWオプティマイザの中で最も高速なトレーニング体験を提供します。 + +[`Trainer`]は、直接使用できるさまざまなオプティマイザを統合しており、`adamw_hf`、`adamw_torch`、`adamw_torch_fused`、`adamw_apex_fused`、`adamw_anyprecision`、`adafactor`、または`adamw_bnb_8bit`が含まれています。サードパーティの実装を介してさらに多くのオプティマイザを追加できます。 + +AdamWオプティマイザの代替手段について詳しく見てみましょう: +1. [`Trainer`]で使用可能な`adafactor` +2. Trainerで使用可能な`adamw_bnb_8bit`は、デモンストレーション用に以下でサードパーティの統合が提供されています。 + +比較のため、3Bパラメータモデル(例:「google-t5/t5-3b」)の場合: +* 標準のAdamWオプティマイザは、各パラメータに8バイトを使用するため、24GBのGPUメモリが必要です(8 * 3 => 24GB)。 +* Adafactorオプティマイザは12GB以上必要です。各パラメータにわずか4バイト以上を使用するため、4 * 3と少し余分になります。 +* 8ビットのBNB量子化オプティマイザは、すべてのオプティマイザの状態が量子化されている場合、わずか6GBしか使用しません。 + +### Adafactor + +Adafactorは、重み行列の各要素のために前回の平均を保存しません。代わりに、(行ごとと列ごとの平均の合計など)集 + + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args) +``` + + +他のアプローチ(勾配蓄積、勾配チェックポイント、混合精度トレーニング)と組み合わせることで、スループットを維持しながら最大3倍の向上が見られることがあります!ただし、前述のように、Adafactorの収束性はAdamよりも悪いことがあります。 + +### 8ビット Adam + +Adafactorのようにオプティマイザの状態を集約する代わりに、8ビットのAdamは完全な状態を保持し、それを量子化します。量子化とは、状態を低い精度で保存し、最適化のためだけに非量子化することを意味します。これは混合精度トレーニングの背後にあるアイデアと似ています。 + +`adamw_bnb_8bit`を使用するには、単に[`TrainingArguments`]で`optim="adamw_bnb_8bit"`を設定するだけです: + + +```py +training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args) +``` + +ただし、デモンストレーション目的で8ビットオプティマイザをサードパーティの実装を使用することもできます。これを統合する方法を確認するためです。 + +まず、8ビットAdamオプティマイザを実装した`bitsandbytes`ライブラリをインストールするために、GitHub [リポジトリ](https://github.com/TimDettmers/bitsandbytes)内のインストールガイドに従ってください。 + +次に、オプティマイザを初期化する必要があります。これには2つのステップが含まれます: +* まず、モデルのパラメータを2つのグループに分けます - 重み減衰を適用するべきグループと、適用すべきでないグループです。通常、バイアスとレイヤー正規化パラメータは重み減衰されません。 +* 次に、以前に使用したAdamWオプティマイザと同じパラメータを使用するために、いくつかの引数の調整を行います。 + + +```py +import bitsandbytes as bnb +from torch import nn +from transformers.trainer_pt_utils import get_parameter_names + +training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) + +decay_parameters = get_parameter_names(model, [nn.LayerNorm]) +decay_parameters = [name for name in decay_parameters if "bias" not in name] +optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if n in decay_parameters], + "weight_decay": training_args.weight_decay, + }, + { + "params": [p for n, p in model.named_parameters() if n not in decay_parameters], + "weight_decay": 0.0, + }, +] + +optimizer_kwargs = { + "betas": (training_args.adam_beta1, training_args.adam_beta2), + "eps": training_args.adam_epsilon, +} +optimizer_kwargs["lr"] = training_args.learning_rate +adam_bnb_optim = bnb.optim.Adam8bit( + optimizer_grouped_parameters, + betas=(training_args.adam_beta1, training_args.adam_beta2), + eps=training_args.adam_epsilon, + lr=training_args.learning_rate, +) +``` + +最後に、カスタムオプティマイザを`Trainer`に引数として渡します: + + +```py +trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None)) +``` + +他のアプローチ(勾配蓄積、勾配チェックポイント、混合精度トレーニング)と組み合わせることで、Adafactorの使用と同等以上の3倍のメモリ改善およびわずかに高いスループットを期待できます。 + +### multi_tensor + +pytorch-nightlyは、多くの小さな特徴テンソルがある状況のオプティマイザを大幅に高速化するはずの`torch.optim._multi_tensor`を導入しました。これは最終的にはデフォルトになるはずですが、それを早く試してみたい場合は、このGitHub [issue](https://github.com/huggingface/transformers/issues/9965)をご覧ください。 + +## データの事前読み込み + +優れたトレーニング速度に到達するための重要な要件の1つは、GPUが処理できる最大速度でデータを供給できる能力です。デフォルトではすべてがメインプロセスで行われ、データをディスクから十分速く読み取ることができない場合、GPUのアンダーユーティリゼーションを引き起こすボトルネックが発生する可能性があります。ボトルネックを減らすために、以下の引数を設定します: + +- `DataLoader(pin_memory=True, ...)` - データをCPUのピンメモリに事前読み込みし、通常、CPUからGPUメモリへの転送がはるかに高速化されます。 +- `DataLoader(num_workers=4, ...)` - データをより速く事前読み込みするために複数のワーカーを生成します。トレーニング中にGPUの利用状況の統計情報を確認し、100%から遠い場合、ワーカーの数を増やす実験を行ってください。もちろん、問題は他の場所にあるかもしれませんので、多くのワーカーが必ずしも性能向上につながるわけではありません。 + +[`Trainer`]を使用する場合、対応する[`TrainingArguments`]は`dataloader_pin_memory`(デフォルトでは`True`)および`dataloader_num_workers`(デフォルトは`0`)です。 + +## DeepSpeed ZeRO + +DeepSpeedは、🤗 Transformersと🤗 Accelerateと統合されたオープンソースのディープラーニング最適化ライブラリです。 +大規模なディープラーニングトレーニングの効率とスケーラビリティを向上させるために設計されたさまざまな機能と最適化を提供します。 + +モデルが単一のGPUに収まり、小さなバッチサイズを収めるスペースがある場合、DeepSpeedを使用する必要はありません。それはむしろ遅くなります。ただし、モデルが単一のGPUに収まらない場合、または小さなバッチを収めることができない場合、DeepSpeed ZeRO + CPU OffloadまたはNVMe Offloadを利用できます。この場合、[ライブラリを別途インストール](main_classes/deepspeed#installation)し、設定ファイルを作成し、DeepSpeedを起動するためのガイドをフォローする必要があります: + +* [`Trainer`]とのDeepSpeed統合の詳細ガイドについては、[該当するドキュメンテーション](main_classes/deepspeed)を確認してください。特に、[単一GPU用のデプロイメント](main_classes/deepspeed#deployment-with-one-gpu)に関するセクションです。DeepSpeedをノートブックで使用するにはいくつかの調整が必要ですので、[該当するガイド](main_classes/deepspeed#deployment-in-notebooks)もご覧ください。 +* 🤗 Accelerateを使用する場合は、[🤗 Accelerate DeepSpeedガイド](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)を参照してください。 + +## torch.compileの使用 + +PyTorch 2.0は新しいコンパイル関数を導入しました。これは既存のPyTorchコードを変更する必要はありませんが、1行のコードを追加することでコードを最適化できます:`model = torch.compile(model)`。 + +[`Trainer`]を使用する場合、[`TrainingArguments`]内の`torch_compile`オプションを渡すだけです: + + +```python +training_args = TrainingArguments(torch_compile=True, **default_args) +``` + +`torch.compile`は、既存のPyTorchプログラムからグラフを自動的に作成するためにPythonのフレーム評価APIを使用します。グラフをキャプチャした後、異なるバックエンドを展開して最適化されたエンジンに変換できます。 +詳細およびベンチマークについては、[PyTorchドキュメント](https://pytorch.org/get-started/pytorch-2.0/)を参照してください。 + +`torch.compile`には、オプションの依存関係を持つ成長中のバックエンドのリストがあり、`torchdynamo.list_backends()`を呼び出して確認できます。最も一般的に使用される一部のバックエンドは次のとおりです。 + +**デバッグ用バックエンド**: +* `dynamo.optimize("eager")` - 抽出されたGraphModuleを実行するためにPyTorchを使用します。これはTorchDynamoの問題をデバッグする際に非常に役立ちます。 +* `dynamo.optimize("aot_eager")` - コンパイラーを使用しないAotAutogradを使用してAotAutogradの抽出されたフォワードおよびバックワードグラフに対して単にPyTorch eagerを使用します。これはデバッグに役立ち、高速化は期待できません。 + +**トレーニングおよび推論バックエンド**: +* `dynamo.optimize("inductor")` - TorchInductorバックエンドを使用し、AotAutogradおよびcudagraphsを活用してコード生成されたTritonカーネルを使用します [詳細はこちら](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747) +* `dynamo.optimize("nvfuser")` - nvFuser with TorchScriptを使用します。 [詳細はこちら](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) +* `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutogradを使用します。 [詳細はこちら](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593) +* `dynamo.optimize("aot_cudagraphs")` - AotAutogradを使用してcudagraphsを使用します。 [詳細はこちら](https://github.com/pytorch/torchdynamo/pull/757) + +**推論専用バックエンド**: +* `dynamo.optimize("ofi")` - Torchscriptの`optimize_for_inference`を使用します。 [詳細はこちら](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html) +* `dynamo.optimize("fx2trt")` - Nvidia TensorRTを使用した推論の最適化にNvidia TensorRTを使用します。 [詳細はこちら](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html) +* `dynamo.optimize("onnxrt")` - CPU/GPUでの推論にONNX Runtimeを使用します。 [詳細はこちら](https://onnxruntime.ai/) +* `dynamo.optimize("ipex")` - CPUでの推論にIPEXを使用します。 [詳細はこちら](https://github.com/intel/intel-extension-for-pytorch) + +🤗 Transformersを使用した`torch.compile`の使用例については、この[ブログ記事](https://www.philschmid.de/getting-started-pytorch-2-0-transformers)をご覧ください。 + +## Using 🤗 Accelerate + +[🤗 Accelerate](https://huggingface.co/docs/accelerate/index)を使用すると、上記の方法を使用しながらトレーニングループを完全に制御でき、基本的には純粋なPyTorchでループを書くことができます。 + +次に、[`TrainingArguments`]内で方法を組み合わせた場合を想 + +```py +training_args = TrainingArguments( + per_device_train_batch_size=1, + gradient_accumulation_steps=4, + gradient_checkpointing=True, + fp16=True, + **default_args, +) +``` + +🤗 Accelerateを使用した完全なトレーニングループの例は、ほんの数行のコードです: + +```py +from accelerate import Accelerator +from torch.utils.data.dataloader import DataLoader + +dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size) + +if training_args.gradient_checkpointing: + model.gradient_checkpointing_enable() + +accelerator = Accelerator(fp16=training_args.fp16) +model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader) + +model.train() +for step, batch in enumerate(dataloader, start=1): + loss = model(**batch).loss + loss = loss / training_args.gradient_accumulation_steps + accelerator.backward(loss) + if step % training_args.gradient_accumulation_steps == 0: + optimizer.step() + optimizer.zero_grad() +``` + +まず、データセットを[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)でラップします。 +次に、モデルの[`~PreTrainedModel.gradient_checkpointing_enable`]メソッドを呼び出すことで勾配チェックポイントを有効にできます。 +[`Accelerator`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator)を初期化する際に、混合精度トレーニングを使用するかどうかを[`prepare`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.prepare)の呼び出しで指定し、複数のGPUを使用する場合、`prepare`の間にデータローダーもワーカー間で分散されます。同じ[8ビットオプティマイザ](#8-bit-adam)を前の例から使用します。 + +最後に、主要なトレーニングループを追加できます。`backward`の呼び出しは🤗 Accelerateによって処理されることに注意してください。また、勾配の蓄積がどのように機能するかも確認できます。損失を正規化しているため、蓄積の最後に平均を得て、十分なステップがあると最適化が実行されます。 + +これらの最適化技術を🤗 Accelerateを使用して実装するのは、わずかなコード行で行うことができ、トレーニングループの柔軟性が向上します。すべての機能の詳細については、[Accelerateのドキュメント](https://huggingface.co/docs/accelerate/index)を参照してください。 + +## Efficient Software Prebuilds + +PyTorchの[pipとcondaビルド](https://pytorch.org/get-started/locally/#start-locally)は、PyTorchを実行するのに十分なcudaツールキットで事前にビルドされていますが、cuda拡張をビルドする必要がある場合には不十分です。 + +時折、追加の努力が必要な場合があります。たとえば、事前にコンパイルされていない`apex`などのライブラリを使用している場合です。また、システム全体で適切なcudaツールキットをインストールする方法を見つけることが難しい場合もあります。 +これらのシナリオに対処するために、PyTorchとNVIDIAはcuda拡張がすでに事前にビルドされているNGC dockerコンテナの新しいバージョンをリリースしました。プログラムをインストールするだけで、そのまま実行できます。 + +このアプローチは、PyTorchのソースを調整したり、新しいカスタマイズされたビルドを作成したりしたい場合にも役立ちます。 +欲しいdockerイメージバージョンを見つけるには、まず[PyTorchのリリースノート](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/)から始め、最新の月次リリースのいずれかを選択します。希望のリリースのリリースノートに移動し、環境のコンポーネントが必要なものと一致していることを確認します(NVIDIA Driverの要件も含む!)、その文書の一番上に行き、対応するNGCページに移動します。なぜかわからない場合は、[すべてのPyTorch NGCイメージのインデックス](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)です。 + +次に、dockerイメージをダウンロードして展開する手順に従います。 + +## Mixture of Experts + +最近の論文によれば、Transformerモデルに専門家の混合(MoE)を統合することで、トレーニング速度が4〜5倍向上し、推論も高速化されることが報告されています。 + +より多くのパラメータがより良いパフォーマンスにつながることがわかっているため、この技術はトレーニングコストを増やすことなくパラメータの数を桁違いに増やすことを可能にします。 + +このアプローチでは、他のFFN層の代わりにMoE層が配置され、各専門家をトークンの位置に応じてバランスよくトレーニングするゲート関数で構成されます。 + +![MoE Transformer 2x block](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf-moe-transformer.png) + +(出典: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)) + +このアプローチの主な欠点は、GPUメモリをほぼ桁違いに多く必要とすることです。メモリ要件がはるかに大きいことがそのまま反映されます。より高いメモリ要件を克服する方法については、さまざまな蒸留およびアプローチが提案されています。 + +ただし、直接のトレードオフがあります。数人の専門家を使用してベースモデルを2〜3倍小さくすることで、5倍小さなモデルにし、トレーニング速度を適度に向上させ、メモリ要件を適度に増やすことができます。 + +関連するほとんどの論文および実装はTensorflow/TPUを中心に構築されています。 + +- [GShard: Conditional Computation and Automatic Shardingを活用した巨大モデルのスケーリング](https://arxiv.org/abs/2006.16668) +- [Switch Transformers: シンプルで効率的なスパース性を備えたトリリオンパラメータモデルへのスケーリング](https://arxiv.org/abs/2101.03961) +- [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html) + +PytorchにはDeepSpeedが構築したものもあります: [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)、[Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - ブログ記事: [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/)、[2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/)、大規模なTransformerベースの自然言語生成モデルの具体的な展開については、[ブログ記事](https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html)、[Megatron-Deepspeedブランチ](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training)を参照してください。 + + +## PyTorchネイティブアテンションとFlash Attentionの使用 + +PyTorch 2.0では、ネイティブの[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)(SDPA)がリリースされ、[メモリ効率の高いアテンション](https://arxiv.org/abs/2112.05682)や[フラッシュアテンション](https://arxiv.org/abs/2205.14135)などの融合されたGPUカーネルの使用を可能にします。 + +[`optimum`](https://github.com/huggingface/optimum)パッケージをインストールした後、関連する内部モジュールを置き換えて、PyTorchのネイティブアテンションを使用できます。以下のように設定します: + + +```python +model = model.to_bettertransformer() +``` + +変換後、通常通りモデルをトレーニングしてください。 + + + +PyTorchネイティブの`scaled_dot_product_attention`演算子は、`attention_mask`が提供されていない場合にのみFlash Attentionにディスパッチできます。 + +デフォルトでは、トレーニングモードでBetterTransformer統合はマスクサポートを削除し、バッチトレーニングにパディングマスクが必要ないトレーニングにしか使用できません。これは、例えばマスク言語モデリングや因果言語モデリングのような、バッチトレーニングにパディングマスクが不要なトレーニングの場合に該当します。BetterTransformerはパディングマスクが必要なタスクに対するモデルの微調整には適していません。 + + + +SDPAを使用したアクセラレーションとメモリの節約について詳しく知りたい場合は、この[ブログ記事](https://pytorch.org/blog/out-of-the-box-acceleration/)をチェックしてください。 diff --git a/docs/source/ja/perf_train_special.md b/docs/source/ja/perf_train_special.md new file mode 100644 index 00000000000000..080ff66f4cf562 --- /dev/null +++ b/docs/source/ja/perf_train_special.md @@ -0,0 +1,24 @@ + + +# Training on Specialized Hardware + + + +注意: [単一GPUセクション](perf_train_gpu_one)で紹介されたほとんどの戦略(混合精度トレーニングや勾配蓄積など)および[マルチGPUセクション](perf_train_gpu_many)は一般的なトレーニングモデルに適用される汎用的なものですので、このセクションに入る前にそれを確認してください。 + + + +このドキュメントは、専用ハードウェアでトレーニングする方法に関する情報を近日中に追加予定です。 diff --git a/docs/source/ja/perf_train_tpu.md b/docs/source/ja/perf_train_tpu.md new file mode 100644 index 00000000000000..aadd588ae84d35 --- /dev/null +++ b/docs/source/ja/perf_train_tpu.md @@ -0,0 +1,24 @@ + + +# Training on TPUs + + + + 注意: [シングルGPUセクション](perf_train_gpu_one)で紹介されているほとんどの戦略(混合精度トレーニングや勾配蓄積など)および[マルチGPUセクション](perf_train_gpu_many)は一般的なモデルのトレーニングに適用できますので、このセクションに入る前にそれを確認してください。 + + + +このドキュメントは、TPUでのトレーニング方法に関する情報をまもなく追加いたします。 diff --git a/docs/source/ja/perf_train_tpu_tf.md b/docs/source/ja/perf_train_tpu_tf.md new file mode 100644 index 00000000000000..3ffe88267cddeb --- /dev/null +++ b/docs/source/ja/perf_train_tpu_tf.md @@ -0,0 +1,168 @@ + + +# Training on TPU with TensorFlow + + + +詳細な説明が不要で、単にTPUのコードサンプルを入手してトレーニングを開始したい場合は、[私たちのTPUの例のノートブックをチェックしてください!](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) + + + +### What is a TPU? + +TPUは**Tensor Processing Unit(テンソル処理ユニット)**の略です。これらはGoogleが設計したハードウェアで、ニューラルネットワーク内のテンソル計算を大幅に高速化するために使用されます。これはGPUのようなものです。ネットワークのトレーニングと推論の両方に使用できます。一般的にはGoogleのクラウドサービスを介してアクセスされますが、Google ColabとKaggle Kernelsを通じても無料で小規模のTPUに直接アクセスできます。 + +[🤗 TransformersのすべてのTensorFlowモデルはKerasモデルです](https://huggingface.co/blog/tensorflow-philosophy)ので、この文書のほとんどの方法は一般的にKerasモデル用のTPUトレーニングに適用できます!ただし、TransformersとDatasetsのHuggingFaceエコシステム(hug-o-system?)に固有のポイントもいくつかあり、それについては適用するときにそれを示します。 + +### What kinds of TPU are available? + +新しいユーザーは、さまざまなTPUとそのアクセス方法に関する幅広い情報によく混乱します。理解するための最初の重要な違いは、**TPUノード**と**TPU VM**の違いです。 + +**TPUノード**を使用すると、事実上リモートのTPUに間接的にアクセスします。別個のVMが必要で、ネットワークとデータパイプラインを初期化し、それらをリモートノードに転送します。Google ColabでTPUを使用すると、**TPUノード**スタイルでアクセスしています。 + +TPUノードを使用すると、それに慣れていない人々にはかなり予期しない動作が発生することがあります!特に、TPUはPythonコードを実行しているマシンと物理的に異なるシステムに配置されているため、データはローカルマシンにローカルで格納されているデータパイプラインが完全に失敗します。代わりに、データはGoogle Cloud Storageに格納する必要があります。ここでデータパイプラインはリモートのTPUノードで実行されている場合でも、データにアクセスできます。 + + + +すべてのデータを`np.ndarray`または`tf.Tensor`としてメモリに収めることができる場合、ColabまたはTPUノードを使用している場合でも、データをGoogle Cloud Storageにアップロードせずに`fit()`でトレーニングできます。 + + + + + +**🤗 Hugging Face固有のヒント🤗:** TFコードの例でよく見るであろう`Dataset.to_tf_dataset()`とその高レベルのラッパーである`model.prepare_tf_dataset()`は、TPUノードで失敗します。これは、`tf.data.Dataset`を作成しているにもかかわらず、それが「純粋な」`tf.data`パイプラインではなく、`tf.numpy_function`または`Dataset.from_generator()`を使用して基盤となるHuggingFace `Dataset`からデータをストリームで読み込むことからです。このHuggingFace `Dataset`はローカルディスク上のデータをバックアップしており、リモートTPUノードが読み取ることができないためです。 + + + +TPUにアクセスする第二の方法は、**TPU VM**を介してです。TPU VMを使用する場合、TPUが接続されているマシンに直接接続します。これはGPU VMでトレーニングを行うのと同様です。TPU VMは一般的にデータパイプラインに関しては特に作業がしやすく、上記のすべての警告はTPU VMには適用されません! + +これは主観的な文書ですので、こちらの意見です:**可能な限りTPUノードの使用を避けてください。** TPU VMよりも混乱しやすく、デバッグが難しいです。将来的にはサポートされなくなる可能性もあります - Googleの最新のTPUであるTPUv4は、TPU VMとしてのみアクセスできるため、TPUノードは将来的には「レガシー」のアクセス方法になる可能性が高いです。ただし、無料でTPUにアクセスできるのはColabとKaggle Kernelsの場合があります。その場合、どうしても使用しなければならない場合の取り扱い方法を説明しようとします!詳細は[TPUの例のノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)で詳細な説明を確認してください。 + +### What sizes of TPU are available? + +単一のTPU(v2-8/v3-8/v4-8)は8つのレプリカを実行します。TPUは数百から数千のレプリカを同時に実行できる**ポッド**に存在します。単一のTPUよりも多くのTPUを使用するが、ポッド全体ではない場合(たとえばv3-32)、TPUフリートは**ポッドスライス**として参照されます。 + +Colabを介して無料のTPUにアクセスする場合、通常は単一のv2-8 TPUが提供されます。 + + +### I keep hearing about this XLA thing. What’s XLA, and how does it relate to TPUs? + +XLAは、TensorFlowとJAXの両方で使用される最適化コンパイラです。JAXでは唯一のコンパイラであり、TensorFlowではオプションですが(しかしTPUでは必須です!)、Kerasモデルをトレーニングする際に`model.compile()`に引数`jit_compile=True`を渡すことで最も簡単に有効にできます。エラーが発生せず、パフォーマンスが良好であれば、それはTPUに移行する準備が整った良い兆候です! + +TPU上でのデバッグは一般的にCPU/GPUよりも少し難しいため、TPUで試す前にまずCPU/GPUでXLAを使用してコードを実行することをお勧めします。もちろん、長時間トレーニングする必要はありません。モデルとデータパイプラインが期待通りに動作するかを確認するための数ステップだけです。 + + + +XLAコンパイルされたコードは通常高速です。したがって、TPUで実行する予定がない場合でも、`jit_compile=True`を追加することでパフォーマンスを向上させることができます。ただし、以下のXLA互換性に関する注意事項に注意してください! + + + + + +**苦い経験から生まれたヒント:** `jit_compile=True`を使用することは、CPU/GPUコードがXLA互換であることを確認し、速度を向上させる良い方法ですが、実際にTPUでコードを実行する際には多くの問題を引き起こす可能性があります。 XLAコンパイルはTPU上で暗黙的に行われるため、実際にコードをTPUで実行する前にその行を削除することを忘れないでください! + + + +### How do I make my model XLA compatible? + +多くの場合、コードはすでにXLA互換かもしれません!ただし、XLAでは動作する通常のTensorFlowでも動作しないいくつかの要素があります。以下に、3つの主要なルールにまとめています: + + + +**🤗 HuggingFace固有のヒント🤗:** TensorFlowモデルと損失関数をXLA互換に書き直すために多くの努力を払っています。通常、モデルと損失関数はデフォルトでルール#1と#2に従っているため、`transformers`モデルを使用している場合はこれらをスキップできます。ただし、独自のモデルと損失関数を記述する場合は、これらのルールを忘れないでください! + + + +#### XLA Rule #1: Your code cannot have “data-dependent conditionals” + +これは、任意の`if`ステートメントが`tf.Tensor`内の値に依存していない必要があることを意味します。例えば、次のコードブロックはXLAでコンパイルできません! + +```python +if tf.reduce_sum(tensor) > 10: + tensor = tensor / 2.0 +``` + +これは最初は非常に制限的に思えるかもしれませんが、ほとんどのニューラルネットコードはこれを行う必要はありません。通常、この制約を回避するために`tf.cond`を使用するか(ドキュメントはこちらを参照)、条件を削除して代わりに指示変数を使用したりすることができます。次のように: + +```python +sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32) +tensor = tensor / (1.0 + sum_over_10) +``` + +このコードは、上記のコードとまったく同じ効果を持っていますが、条件を回避することで、XLAで問題なくコンパイルできることを確認します! + +#### XLA Rule #2: Your code cannot have “data-dependent shapes” + +これは、コード内のすべての `tf.Tensor` オブジェクトの形状が、その値に依存しないことを意味します。たとえば、`tf.unique` 関数はXLAでコンパイルできないので、このルールに違反します。なぜなら、これは入力 `Tensor` の一意の値の各インスタンスを含む `tensor` を返すためです。この出力の形状は、入力 `Tensor` の重複具合によって異なるため、XLAはそれを処理しないことになります! + +一般的に、ほとんどのニューラルネットワークコードはデフォルトでルール#2に従います。ただし、いくつかの一般的なケースでは問題が発生することがあります。非常に一般的なケースの1つは、**ラベルマスキング**を使用する場合です。ラベルを無視して損失を計算する場所を示すために、ラベルを負の値に設定する方法です。NumPyまたはPyTorchのラベルマスキングをサポートする損失関数を見ると、次のような[ブールインデックス](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing)を使用したコードがよく見られます: + + +```python +label_mask = labels >= 0 +masked_outputs = outputs[label_mask] +masked_labels = labels[label_mask] +loss = compute_loss(masked_outputs, masked_labels) +mean_loss = torch.mean(loss) +``` + +このコードはNumPyやPyTorchでは完全に機能しますが、XLAでは動作しません!なぜなら、`masked_outputs`と`masked_labels`の形状はマスクされた位置の数に依存するため、これは**データ依存の形状**になります。ただし、ルール#1と同様に、このコードを書き直して、データ依存の形状なしでまったく同じ出力を生成できることがあります。 + + +```python +label_mask = tf.cast(labels >= 0, tf.float32) +loss = compute_loss(outputs, labels) +loss = loss * label_mask # Set negative label positions to 0 +mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask) +``` + + +ここでは、データ依存の形状を避けるために、各位置で損失を計算してから、平均を計算する際に分子と分母の両方でマスクされた位置をゼロ化する方法を紹介します。これにより、最初のアプローチとまったく同じ結果が得られますが、XLA互換性を維持します。注意点として、ルール#1と同じトリックを使用します - `tf.bool`を`tf.float32`に変換して指標変数として使用します。これは非常に便利なトリックですので、自分のコードをXLAに変換する必要がある場合には覚えておいてください! + +#### XLA Rule #3: XLA will need to recompile your model for every different input shape it sees + +これは重要なルールです。これはつまり、入力形状が非常に変動的な場合、XLA はモデルを何度も再コンパイルする必要があるため、大きなパフォーマンスの問題が発生する可能性があるということです。これは NLP モデルで一般的に発生し、トークナイズ後の入力テキストの長さが異なる場合があります。他のモダリティでは、静的な形状が一般的であり、このルールはほとんど問題になりません。 + +ルール#3を回避する方法は何でしょうか?鍵は「パディング」です - すべての入力を同じ長さにパディングし、次に「attention_mask」を使用することで、可変形状と同じ結果を得ることができますが、XLA の問題は発生しません。ただし、過度のパディングも深刻な遅延を引き起こす可能性があります - データセット全体で最大の長さにすべてのサンプルをパディングすると、多くの計算とメモリを無駄にする可能性があります! + +この問題には完璧な解決策はありませんが、いくつかのトリックを試すことができます。非常に便利なトリックの1つは、**バッチのサンプルを32または64トークンの倍数までパディングする**ことです。これにより、トークン数がわずかに増加するだけで、すべての入力形状が32または64の倍数である必要があるため、一意の入力形状の数が大幅に減少します。一意の入力形状が少ないと、XLA の再コンパイルが少なくなります! + + + +**🤗 HuggingFace に関する具体的なヒント🤗:** 弊社のトークナイザーとデータコレクターには、ここで役立つメソッドがあります。トークナイザーを呼び出す際に `padding="max_length"` または `padding="longest"` を使用して、パディングされたデータを出力するように設定できます。トークナイザーとデータコレクターには、一意の入力形状の数を減らすのに役立つ `pad_to_multiple_of` 引数もあります! + + + +### How do I actually train my model on TPU? + +一度トレーニングが XLA 互換性があることを確認し、(TPU Node/Colab を使用する場合は)データセットが適切に準備されている場合、TPU 上で実行することは驚くほど簡単です!コードを変更する必要があるのは、いくつかの行を追加して TPU を初期化し、モデルとデータセットが `TPUStrategy` スコープ内で作成されるようにすることだけです。これを実際に見るには、[TPU のサンプルノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)をご覧ください! + +### Summary + +ここでは多くの情報が提供されましたので、TPU でモデルをトレーニングする際に以下のチェックリストを使用できます: + +- コードが XLA の三つのルールに従っていることを確認します。 +- CPU/GPU で `jit_compile=True` を使用してモデルをコンパイルし、XLA でトレーニングできることを確認します。 +- データセットをメモリに読み込むか、TPU 互換のデータセット読み込みアプローチを使用します([ノートブックを参照](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))。 +- コードを Colab(アクセラレータを「TPU」に設定)または Google Cloud の TPU VM に移行します。 +- TPU 初期化コードを追加します([ノートブックを参照](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))。 +- `TPUStrategy` を作成し、データセットの読み込みとモデルの作成が `strategy.scope()` 内で行われることを確認します([ノートブックを参照](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb))。 +- TPU に移行する際に `jit_compile=True` を外すのを忘れないでください! +- 🙏🙏🙏🥺🥺🥺 +- `model.fit()` を呼び出します。 +- おめでとうございます! + + diff --git a/docs/source/ja/performance.md b/docs/source/ja/performance.md new file mode 100644 index 00000000000000..bcd3987c553536 --- /dev/null +++ b/docs/source/ja/performance.md @@ -0,0 +1,68 @@ + + +# Performance and Scalability + +大規模なトランスフォーマーモデルのトレーニングおよび本番環境への展開はさまざまな課題を提起します。 +トレーニング中には、モデルが利用可能なGPUメモリよりも多くを必要としたり、トレーニング速度が遅かったりする可能性があります。 +デプロイフェーズでは、モデルが本番環境で必要なスループットを処理するのに苦労することがあります。 + +このドキュメンテーションは、これらの課題を克服し、ユースケースに最適な設定を見つけるのに役立つことを目的としています。 +ガイドはトレーニングと推論のセクションに分かれており、それぞれ異なる課題と解決策が存在します。 +各セクション内には、トレーニング用のシングルGPU対マルチGPU、推論用のCPU対GPUなど、異なるハードウェア構成用の別々のガイドが用意されています。 + +このドキュメントを出発点として、シナリオに合った方法に進むための情報源としてご利用ください。 + +## Training + +大規模なトランスフォーマーモデルを効率的にトレーニングするには、GPUやTPUなどのアクセラレータが必要です。 +最も一般的なケースは、シングルGPUがある場合です。シングルGPUでのトレーニング効率を最適化するための一般的なアプローチを学ぶには、以下を参照してください。 + +* [シングルGPUでの効率的なトレーニングのための方法とツール](perf_train_gpu_one): GPUメモリの効果的な利用、トレーニングの高速化などを支援する共通のアプローチを学ぶためにここから始めてください。 +* [マルチGPUトレーニングセクション](perf_train_gpu_many): マルチGPU環境に適用されるデータ、テンソル、パイプライン並列性など、さらなる最適化方法について詳細に学びます。 +* [CPUトレーニングセクション](perf_train_cpu): CPU上での混合精度トレーニングについて学びます。 +* [複数CPUでの効率的なトレーニング](perf_train_cpu_many): 分散CPUトレーニングについて学びます。 +* [TensorFlowでTPUを使用したトレーニング](perf_train_tpu_tf): TPUに慣れていない場合は、TPUでのトレーニングとXLAの使用についてのセクションを参照してください。 +* [トレーニングのためのカスタムハードウェア](perf_hardware): 独自のディープラーニング環境を構築する際のヒントやトリックを見つけます。 +* [Trainer APIを使用したハイパーパラメーター検索](hpo_train) + +## Inference + +本番環境で大規模なモデルを効率的に推論することは、それらをトレーニングすることと同じくらい難しいことがあります。 +以下のセクションでは、CPUおよびシングル/マルチGPU環境で推論を実行する手順について説明します。 + +* [シングルCPUでの推論](perf_infer_cpu) +* [シングルGPUでの推論](perf_infer_gpu_one) +* [マルチGPU推論](perf_infer_gpu_many) +* [TensorFlowモデルのXLA統合](tf_xla) + +## Training and inference + +モデルをトレーニングするか、それを使用して推論を実行するかに関係なく適用されるテクニック、ヒント、トリックがここにあります。 + +* [大規模モデルのインスタンス化](big_models) +* [パフォーマンスの問題のトラブルシューティング](debugging) + +## Contribute + +このドキュメントはまだ完全ではなく、さらに追加する必要がある項目がたくさんあります。 +追加や訂正が必要な場合は、遠慮せずにPRをオープンするか、詳細を議論するためにIssueを開始してください。 + +AがBよりも優れているという貢献を行う際には、再現可能なベンチマークやその情報の出典へのリンクを含めてみてください(あなた自身の情報である場合を除く)。 diff --git a/docs/source/ja/perplexity.md b/docs/source/ja/perplexity.md new file mode 100644 index 00000000000000..368a301ec3ab4a --- /dev/null +++ b/docs/source/ja/perplexity.md @@ -0,0 +1,116 @@ + + +# Perplexity of fixed-length models + +[[open-in-colab]] + +パープレキシティ(PPL)は言語モデルの評価に最も一般的な指標の1つです。深入りする前に、この指標は特に古典的な言語モデル(時にはオートレグレッシブまたは因果言語モデルとも呼ばれる)に適用され、BERTなどのマスクされた言語モデルには適していないことに注意すべきです(モデルの概要を参照してください[モデルの概要](model_summary))。 + +パープレキシティは、シーケンスの指数平均負の対数尤度として定義されます。トークン化されたシーケンス \\(X = (x_0, x_1, \dots, x_t)\\) がある場合、\\(X\\) のパープレキシティは次のように表されます。 + +$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{ + +しかし、通常、近似モデルを使用する場合、モデルが処理できるトークン数に制約があります。例えば、最大の[GPT-2](model_doc/gpt2)のバージョンは1024トークンの固定長を持っているため、1024よりも大きい \\(t\\) に対して \\(p_\theta(x_t|x_{ + +これは各セグメントのパープレキシティが1回のフォワードパスで計算できるため、計算が迅速ですが、通常、モデルはほとんどの予測ステップでコンテキストが少ないため、完全に因子分解されたパープレキシティの悪い近似となり、通常、より高い(悪い)PPLを返します。 + +代わりに、固定長モデルのPPLはスライディングウィンドウ戦略を用いて評価するべきです。これには、モデルが各予測ステップでより多くのコンテキストを持つように、コンテキストウィンドウを繰り返しスライドさせるという方法が含まれます。 + +Sliding window PPL taking advantage of all available context + +これはシーケンスの確率のより正確な分解に近いものであり、通常はより有利なスコアを生成します。欠点は、コーパス内の各トークンに対して別個の前方パスが必要です。実用的な妥協案は、1トークンずつスライドする代わりに、より大きなストライドでコンテキストを移動するストライド型のスライディングウィンドウを使用することです。これにより、計算がはるかに高速に進行できる一方で、モデルには各ステップで予測を行うための大きなコンテキストが提供されます。 + +## Example: Calculating perplexity with GPT-2 in 🤗 Transformers + +GPT-2を使用してこのプロセスをデモンストレーションしてみましょう。 + +```python +from transformers import GPT2LMHeadModel, GPT2TokenizerFast + +device = "cuda" +model_id = "openai-community/gpt2-large" +model = GPT2LMHeadModel.from_pretrained(model_id).to(device) +tokenizer = GPT2TokenizerFast.from_pretrained(model_id) +``` + +WikiText-2データセットを読み込み、異なるスライディングウィンドウ戦略を使用してパープレキシティを評価します。このデータセットは小規模で、セット全体に対して単一のフォワードパスを実行するだけなので、データセット全体をメモリに読み込んでエンコードするだけで十分です。 + + +```python +from datasets import load_dataset + +test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") +encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt") +``` + +🤗 Transformersを使用すると、単純に`input_ids`をモデルの`labels`として渡すことで、各トークンの平均負の対数尤度が損失として返されます。しかし、スライディングウィンドウのアプローチでは、各イテレーションでモデルに渡すトークンにオーバーラップがあります。私たちは、コンテキストとして扱っているトークンの対数尤度を損失に含めたくありません。そのため、これらの対象を `-100` に設定して無視されるようにします。以下は、ストライドを `512` とした場合の例です。これにより、モデルは任意のトークンの条件付けの尤度を計算する際に、少なくともコンテキストとして 512 トークンを持つことになります(512 個の前のトークンが利用可能である場合)。 + + +```python +import torch +from tqdm import tqdm + +max_length = model.config.n_positions +stride = 512 +seq_len = encodings.input_ids.size(1) + +nlls = [] +prev_end_loc = 0 +for begin_loc in tqdm(range(0, seq_len, stride)): + end_loc = min(begin_loc + max_length, seq_len) + trg_len = end_loc - prev_end_loc # may be different from stride on last loop + input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) + target_ids = input_ids.clone() + target_ids[:, :-trg_len] = -100 + + with torch.no_grad(): + outputs = model(input_ids, labels=target_ids) + + # loss is calculated using CrossEntropyLoss which averages over valid labels + # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels + # to the left by 1. + neg_log_likelihood = outputs.loss + + nlls.append(neg_log_likelihood) + + prev_end_loc = end_loc + if end_loc == seq_len: + break + +ppl = torch.exp(torch.stack(nlls).mean()) +``` + +ストライド長が最大入力長と同じ場合、上述の最適でないスライディングウィンドウ戦略と同等です。ストライドが小さいほど、モデルは各予測を行う際により多くのコンテキストを持つため、通常、報告される困難度(perplexity)が向上します。 + +上記のコードを `stride = 1024` で実行すると、オーバーラップがない状態で、結果の困難度(perplexity)は `19.44` になります。これは GPT-2 の論文に報告された `19.93` とほぼ同等です。一方、`stride = 512` を使用し、このようにストライディングウィンドウ戦略を採用すると、困難度(perplexity)が `16.45` に向上します。これはより好意的なスコアだけでなく、シーケンスの尤度の真の自己回帰分解により近い方法で計算されています。 + + + diff --git a/docs/source/ja/philosophy.md b/docs/source/ja/philosophy.md new file mode 100644 index 00000000000000..3edef0bd2add3b --- /dev/null +++ b/docs/source/ja/philosophy.md @@ -0,0 +1,67 @@ + + +# Philosophy + +🤗 Transformersは、次のような目的で構築された意見を持つライブラリです: + +- 大規模なTransformersモデルを使用、研究、または拡張したい機械学習研究者および教育者。 +- これらのモデルを微調整したり、本番環境で提供したり、またはその両方を行いたい実務家。 +- 与えられた機械学習タスクを解決するために、事前トレーニングされたモデルをダウンロードして使用したいエンジニア。 + +このライブラリは、2つの強力な目標を持って設計されました: + +1. できるだけ簡単かつ高速に使用できるようにすること: + + - ユーザー向けの抽象化を限りなく少なくし、実際、ほとんどの場合、抽象化はありません。 + 各モデルを使用するために必要な3つの標準クラスだけが存在します:[構成](main_classes/configuration)、 + [モデル](main_classes/model)、および前処理クラス(NLP用の[トークナイザ](main_classes/tokenizer)、ビジョン用の[イメージプロセッサ](main_classes/image_processor)、 + オーディオ用の[特徴抽出器](main_classes/feature_extractor)、およびマルチモーダル入力用の[プロセッサ](main_classes/processors))。 + - これらのクラスは、共通の`from_pretrained()`メソッドを使用して、事前トレーニング済みのインスタンスから簡単かつ統一された方法で初期化できます。このメソッドは、事前トレーニング済みのチェックポイントから関連するクラスのインスタンスと関連データ(構成のハイパーパラメータ、トークナイザの語彙、モデルの重み)をダウンロード(必要な場合はキャッシュ)して読み込みます。これらの基本クラスの上に、ライブラリは2つのAPIを提供しています:[パイプライン]は、特定のタスクでモデルをすばやく推論に使用するためのものであり、[`Trainer`]はPyTorchモデルを迅速にトレーニングまたは微調整するためのものです(すべてのTensorFlowモデルは`Keras.fit`と互換性があります)。 + - その結果、このライブラリはニューラルネットワークのモジュラーツールボックスではありません。ライブラリを拡張または構築したい場合は、通常のPython、PyTorch、TensorFlow、Kerasモジュールを使用し、ライブラリの基本クラスから継承してモデルの読み込みと保存などの機能を再利用するだけです。モデルのコーディング哲学について詳しく知りたい場合は、[Repeat Yourself](https://huggingface.co/blog/transformers-design-philosophy)ブログ投稿をチェックしてみてください。 + +2. オリジナルのモデルにできるだけ近い性能を持つ最新のモデルを提供すること: + + - 各アーキテクチャに対して、公式な著者から提供された結果を再現する少なくとも1つの例を提供します。 + - コードは通常、可能な限り元のコードベースに近いものであり、これはPyTorchコードがTensorFlowコードに変換されることから生じ、逆もまた然りです。 + +その他のいくつかの目標: + +- モデルの内部をできるだけ一貫して公開すること: + + - フルな隠れ状態と注意の重みにアクセスできる単一のAPIを提供します。 + - 前処理クラスと基本モデルのAPIは標準化され、簡単にモデル間を切り替えることができます。 + +- これらのモデルの微調整と調査のための有望なツールを主観的に選定すること: + + - 語彙と埋め込みに新しいトークンを追加するための簡単で一貫した方法。 + - Transformerヘッドをマスクおよびプルーンするための簡単な方法。 + +- PyTorch、TensorFlow 2.0、およびFlaxの間を簡単に切り替えて、1つのフレームワークでトレーニングし、別のフレームワークで推論を行うことを可能にすること。 + +## Main concepts + +このライブラリは、各モデルについて次の3つのタイプのクラスを中心に構築されています: + +- **モデルクラス**は、ライブラリで提供される事前トレーニング済みの重みと互換性のあるPyTorchモデル([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module))、Kerasモデル([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model))またはJAX/Flaxモデル([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html))を使用できます。 +- **構成クラス**は、モデルを構築するために必要なハイパーパラメータを格納します(層の数や隠れ層のサイズなど)。これらを自分でインスタンス化する必要はありません。特に、変更を加えずに事前トレーニング済みモデルを使用している場合、モデルを作成すると自動的に構成がインスタンス化されるようになります(これはモデルの一部です)。 +- **前処理クラス**は、生データをモデルが受け入れる形式に変換します。[トークナイザ](main_classes/tokenizer)は各モデルの語彙を保存し、文字列をトークン埋め込みのインデックスのリストにエンコードおよびデコードするためのメソッドを提供します。[イメージプロセッサ](main_classes/image_processor)はビジョン入力を前処理し、[特徴抽出器](main_classes/feature_extractor)はオーディオ入力を前処理し、[プロセッサ](main_classes/processors)はマルチモーダル入力を処理します。 + +これらのすべてのクラスは、事前トレーニング済みのインスタンスからインスタンス化し、ローカルに保存し、Hubで共有することができる3つのメソッドを使用しています: + +- `from_pretrained()`は、ライブラリ自体によって提供される([モデルハブ](https://huggingface.co/models)でサポートされているモデルがあります)か、ユーザーによってローカルに保存された(またはサーバーに保存された)事前トレーニング済みバージョンからモデル、構成、前処理クラスをインスタンス化するためのメソッドです。 +- `save_pretrained()`は、モデル、構成、前処理クラスをローカルに保存し、`from_pretrained()`を使用して再読み込みできるようにします。 +- `push_to_hub()`は、モデル、構成、前処理クラスをHubに共有し、誰でも簡単にアクセスできるようにします。 diff --git a/docs/source/ja/pipeline_tutorial.md b/docs/source/ja/pipeline_tutorial.md new file mode 100644 index 00000000000000..354e2a2be38022 --- /dev/null +++ b/docs/source/ja/pipeline_tutorial.md @@ -0,0 +1,293 @@ + + +# Pipelines for inference + +[`pipeline`]を使用することで、[Hub](https://huggingface.co/models)からの任意のモデルを言語、コンピュータビジョン、音声、およびマルチモーダルタスクの推論に簡単に使用できます。 +特定のモダリティに関する経験がない場合や、モデルの背後にあるコードに精通していない場合でも、[`pipeline`]を使用して推論できます! +このチュートリアルでは、次のことを学びます: + +- 推論のための[`pipeline`]の使用方法。 +- 特定のトークナイザやモデルの使用方法。 +- オーディオ、ビジョン、マルチモーダルタスクのための[`pipeline`]の使用方法。 + + + +サポートされているタスクと利用可能なパラメータの完全な一覧については、[`pipeline`]のドキュメンテーションをご覧ください。 + + + +## Pipeline usage + +各タスクには関連する[`pipeline`]がありますが、タスク固有の[`pipeline`]を使用する代わりに、すべてのタスク固有のパイプラインを含む一般的な[`pipeline`]の抽象化を使用すると、より簡単です。[`pipeline`]は自動的にデフォルトのモデルと、タスクの推論が可能な前処理クラスを読み込みます。 + +1. [`pipeline`]を作成し、推論タスクを指定して始めます: + +```py +>>> from transformers import pipeline + +>>> generator = pipeline(task="automatic-speech-recognition") +``` + +2. [`pipeline`]に入力テキストを渡します: + +```python +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} +``` + +チェックアウトできなかったか? [Hubの最もダウンロードされた自動音声認識モデル](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads) のいくつかを見て、より良い転写を得ることができるかどうかを確認してみてください。 +[openai/whisper-large](https://huggingface.co/openai/whisper-large) を試してみましょう: + +```python +>>> generator = pipeline(model="openai/whisper-large") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +この結果はより正確に見えますね! +異なる言語、専門分野に特化したモデル、その他のモデルについては、Hubをチェックすることを強くお勧めします。 +Hubでは、ブラウザから直接モデルの結果をチェックして、他のモデルよりも適しているか、特殊なケースをよりよく処理できるかを確認できます。 +そして、あなたのユースケースに適したモデルが見つからない場合、いつでも[トレーニング](training)を開始できます! + +複数の入力がある場合、入力をリストとして渡すことができます: + +```py +generator( + [ + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", + ] +) +``` + +データセット全体を繰り返し処理したり、ウェブサーバーで推論に使用したい場合は、専用の部分をチェックしてください。 + +[データセットでパイプラインを使用する](#using-pipeline-in-a-dataset) + +[ウェブサーバーでパイプラインを使用する](./pipeline_webserver) + + + +## パラメータ + +[`pipeline`]は多くのパラメータをサポートしており、一部はタスク固有であり、一部はすべてのパイプラインに共通です。 +一般的には、どこでもパラメータを指定できます: + +```py +generator = pipeline(model="openai/whisper-large", my_parameter=1) +out = generator(...) # これは `my_parameter=1` を使用します。 +out = generator(..., my_parameter=2) # これは上書きして `my_parameter=2` を使用します。 +out = generator(...) # これは再び `my_parameter=1` を使用します。 +``` + +3つの重要なものを確認しましょう: + +### Device + +`device=n` を使用すると、パイプラインはモデルを指定したデバイスに自動的に配置します。 +これは、PyTorchまたはTensorflowを使用しているかどうかに関係なく機能します。 + +```py +generator = pipeline(model="openai/whisper-large", device=0) +``` + +もしモデルが単一のGPUには大きすぎる場合、`device_map="auto"`を設定して、🤗 [Accelerate](https://huggingface.co/docs/accelerate) にモデルの重みをどのようにロードし、保存するかを自動的に決定させることができます。 + +```python +#!pip install accelerate +generator = pipeline(model="openai/whisper-large", device_map="auto") +``` + +注意: `device_map="auto"` が渡された場合、`pipeline` をインスタンス化する際に `device=device` 引数を追加する必要はありません。そうしないと、予期しない動作に遭遇する可能性があります! + +### Batch size + +デフォルトでは、パイプラインは詳細について[こちら](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)で説明されている理由から、推論をバッチ処理しません。その理由は、バッチ処理が必ずしも速くないためであり、実際にはいくつかのケースでかなり遅くなることがあるからです。 + +ただし、あなたのユースケースで機能する場合は、次のように使用できます: + +```py +generator = pipeline(model="openai/whisper-large", device=0, batch_size=2) +audio_filenames = [f"audio_{i}.flac" for i in range(10)] +texts = generator(audio_filenames) +``` + +これにより、パイプラインは提供された10個のオーディオファイルでパイプラインを実行しますが、 +モデルにはバッチ処理がより効果的であるGPU上にあり、バッチ処理を行うための追加のコードは必要ありません。 +出力は常にバッチ処理なしで受け取ったものと一致するはずです。これは単にパイプラインからより高速な処理を得るための方法として提供されています。 + +パイプラインは、バッチ処理のいくつかの複雑さを軽減することもできます。なぜなら、一部のパイプラインでは、 +モデルで処理するために1つのアイテム(長いオーディオファイルのようなもの)を複数の部分に分割する必要がある場合があるからです。 +パイプラインはこれをあなたのために実行します。[*チャンクバッチ処理*](./main_classes/pipelines#pipeline-chunk-batching)として知られるものを実行します。 + +### Task specific parameters + +すべてのタスクは、タスク固有のパラメータを提供し、追加の柔軟性とオプションを提供して、作業をスムーズに進めるのに役立ちます。 +たとえば、[`transformers.AutomaticSpeechRecognitionPipeline.__call__`]メソッドには、ビデオの字幕作成に有用な`return_timestamps`パラメータがあります。 + +```py +>>> # Not using whisper, as it cannot provide timestamps. +>>> generator = pipeline(model="facebook/wav2vec2-large-960h-lv60-self", return_timestamps="word") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP AND LIVE OUT THE TRUE MEANING OF ITS CREED', 'chunks': [{'text': 'I', 'timestamp': (1.22, 1.24)}, {'text': 'HAVE', 'timestamp': (1.42, 1.58)}, {'text': 'A', 'timestamp': (1.66, 1.68)}, {'text': 'DREAM', 'timestamp': (1.76, 2.14)}, {'text': 'BUT', 'timestamp': (3.68, 3.8)}, {'text': 'ONE', 'timestamp': (3.94, 4.06)}, {'text': 'DAY', 'timestamp': (4.16, 4.3)}, {'text': 'THIS', 'timestamp': (6.36, 6.54)}, {'text': 'NATION', 'timestamp': (6.68, 7.1)}, {'text': 'WILL', 'timestamp': (7.32, 7.56)}, {'text': 'RISE', 'timestamp': (7.8, 8.26)}, {'text': 'UP', 'timestamp': (8.38, 8.48)}, {'text': 'AND', 'timestamp': (10.08, 10.18)}, {'text': 'LIVE', 'timestamp': (10.26, 10.48)}, {'text': 'OUT', 'timestamp': (10.58, 10.7)}, {'text': 'THE', 'timestamp': (10.82, 10.9)}, {'text': 'TRUE', 'timestamp': (10.98, 11.18)}, {'text': 'MEANING', 'timestamp': (11.26, 11.58)}, {'text': 'OF', 'timestamp': (11.66, 11.7)}, {'text': 'ITS', 'timestamp': (11.76, 11.88)}, {'text': 'CREED', 'timestamp': (12.0, 12.38)}]} +``` + +モデルは、テキストを推測し、文の中で各単語がいつ発音されたかを出力しました。 + +各タスクごとに利用可能な多くのパラメータがありますので、何を調整できるかを確認するために各タスクのAPIリファレンスを確認してください! +たとえば、[`~transformers.AutomaticSpeechRecognitionPipeline`]には、モデル単体では処理できない非常に長いオーディオファイル(たとえば、映画全体や1時間のビデオの字幕付けなど)で役立つ`chunk_length_s`パラメータがあります。 + + + +役立つパラメータが見つからない場合は、[リクエスト](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)してください! + +## Using pipeline in a dataset + +パイプラインは大規模なデータセット上で推論を実行することもできます。これを行う最も簡単な方法は、イテレータを使用することです: + +```py +def data(): + for i in range(1000): + yield f"My example {i}" + + +pipe = pipeline(model="openai-community/gpt2", device=0) +generated_characters = 0 +for out in pipe(data()): + generated_characters += len(out[0]["generated_text"]) +``` + +イテレーター `data()` は各結果を生成し、パイプラインは自動的に入力が反復可能であることを認識し、データを取得し続けながらGPU上で処理を行います(これは[DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)を内部で使用しています)。 +これは、データセット全体にメモリを割り当てる必要がなく、GPUにできるだけ速くデータを供給できるため重要です。 + +バッチ処理は処理を高速化できる可能性があるため、ここで`batch_size`パラメータを調整して試すことが役立つかもしれません。 + +データセットを反復処理する最も簡単な方法は、🤗 [Datasets](https://github.com/huggingface/datasets/)からデータセットを読み込むことです: + +```py +# KeyDataset is a util that will just output the item we're interested in. +from transformers.pipelines.pt_utils import KeyDataset +from datasets import load_dataset + +pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") + +for out in pipe(KeyDataset(dataset, "audio")): + print(out) +``` + +## Using pipelines for a webserver + + +推論エンジンを作成することは複雑なトピックで、独自のページが必要です。 + + +[リンク](./pipeline_webserver) + +## Vision pipeline + +ビジョンタスク用の[`pipeline`]を使用する方法はほぼ同じです。 + +タスクを指定し、画像をクラシファイアに渡します。画像はリンク、ローカルパス、またはBase64エンコードされた画像であることができます。例えば、以下の画像はどの種類の猫ですか? + +![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg) + +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(model="google/vit-base-patch16-224") +>>> preds = vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] +``` + +## Text pipeline + +[`pipeline`]を使用することは、NLPタスクに対してほぼ同じです。 + +```py +>>> from transformers import pipeline + +>>> # This model is a `zero-shot-classification` model. +>>> # It will classify text, except you are free to choose any label you might imagine +>>> classifier = pipeline(model="facebook/bart-large-mnli") +>>> classifier( +... "I have a problem with my iphone that needs to be resolved asap!!", +... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], +... ) +{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} +``` + +## Multimodal pipeline + +[`pipeline`]は、1つ以上のモダリティをサポートしています。たとえば、視覚的な質問応答(VQA)タスクはテキストと画像を組み合わせています。 +好きな画像リンクと画像に関する質問を自由に使ってください。画像はURLまたは画像のローカルパスで指定できます。 + +例えば、この[請求書画像](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png)を使用する場合: + +```py +>>> from transformers import pipeline + +>>> vqa = pipeline(model="impira/layoutlm-document-qa") +>>> vqa( +... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", +... question="What is the invoice number?", +... ) +[{'score': 0.42515, 'answer': 'us-001', 'start': 16, 'end': 16}] +``` + + + +上記の例を実行するには、🤗 Transformersに加えて [`pytesseract`](https://pypi.org/project/pytesseract/) がインストールされている必要があります。 + +```bash +sudo apt install -y tesseract-ocr +pip install pytesseract +``` + + + +## Using `pipeline` on large models with 🤗 `accelerate`: + +まず、`accelerate` を`pip install accelerate` でインストールしていることを確認してください。 + +次に、`device_map="auto"` を使用してモデルをロードします。この例では `facebook/opt-1.3b` を使用します。 + +```python +# pip install accelerate +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", torch_dtype=torch.bfloat16, device_map="auto") +output = pipe("これは素晴らしい例です!", do_sample=True, top_p=0.95) +``` + +もし `bitsandbytes` をインストールし、`load_in_8bit=True` 引数を追加すれば、8ビットで読み込まれたモデルを渡すこともできます。 + +```py +# pip install accelerate bitsandbytes +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", device_map="auto", model_kwargs={"load_in_8bit": True}) +output = pipe("This is a cool example!", do_sample=True, top_p=0.95) +``` + +注意: BLOOMなどの大規模モデルのロードをサポートするHugging Faceモデルのいずれかで、チェックポイントを置き換えることができます! diff --git a/docs/source/ja/pipeline_webserver.md b/docs/source/ja/pipeline_webserver.md new file mode 100644 index 00000000000000..3b35a01490d409 --- /dev/null +++ b/docs/source/ja/pipeline_webserver.md @@ -0,0 +1,132 @@ + + +# Webサーバー用のパイプラインの使用 + + +推論エンジンの作成は複雑なトピックであり、"最適な"ソリューションはおそらく問題の領域に依存するでしょう。CPUまたはGPUを使用していますか?最低のレイテンシ、最高のスループット、多くのモデルのサポート、または特定のモデルの高度な最適化を望んでいますか? +このトピックに取り組むための多くの方法があり、私たちが紹介するのは、おそらく最適なソリューションではないかもしれないが、始めるための良いデフォルトです。 + + +重要なことは、Webサーバーはリクエストを待機し、受信したように扱うシステムであるため、[データセット](pipeline_tutorial#using-pipelines-on-a-dataset)のように、イテレータを使用できることです。 + +通常、Webサーバーは並列処理(マルチスレッド、非同期など)されて、さまざまなリクエストを同時に処理します。一方、パイプライン(および主にその基礎となるモデル)は並列処理にはあまり適していません。それらは多くのRAMを使用するため、実行中に利用可能なリソースをすべて提供するか、計算集約型のジョブである場合に最適です。 + +Webサーバーは受信と送信の軽い負荷を処理し、実際の作業を1つのスレッドで処理するようにします。この例では`starlette`を使用します。実際のフレームワークはあまり重要ではありませんが、別のフレームワークを使用している場合は、同じ効果を得るためにコードを調整または変更する必要があるかもしれません。 + +`server.py`を作成してください: + + +```py +from starlette.applications import Starlette +from starlette.responses import JSONResponse +from starlette.routing import Route +from transformers import pipeline +import asyncio + + +async def homepage(request): + payload = await request.body() + string = payload.decode("utf-8") + response_q = asyncio.Queue() + await request.app.model_queue.put((string, response_q)) + output = await response_q.get() + return JSONResponse(output) + + +async def server_loop(q): + pipe = pipeline(model="google-bert/bert-base-uncased") + while True: + (string, response_q) = await q.get() + out = pipe(string) + await response_q.put(out) + + +app = Starlette( + routes=[ + Route("/", homepage, methods=["POST"]), + ], +) + + +@app.on_event("startup") +async def startup_event(): + q = asyncio.Queue() + app.model_queue = q + asyncio.create_task(server_loop(q)) +``` + +ここから始めることができます: +```bash +uvicorn server:app +``` + +そして、次のようにクエリできます: +```bash +curl -X POST -d "test [MASK]" http://localhost:8000/ +#[{"score":0.7742936015129089,"token":1012,"token_str":".","sequence":"test."},...] +``` + + + +そして、これでウェブサーバーを作成する方法の良いアイデアを持っています! + +本当に重要なのは、モデルを**一度だけ**ロードすることです。これにより、ウェブサーバー上にモデルのコピーがないため、不必要なRAMが使用されなくなります。 +その後、キューイングメカニズムを使用して、動的バッチ処理を行うなど、いくつかのアイテムを蓄積してから推論を行うなど、高度な処理を行うことができます: + + + +以下のコードサンプルは、可読性のために擬似コードのように書かれています。システムリソースに合理的かどうかを確認せずに実行しないでください! + + + + +```py +(string, rq) = await q.get() +strings = [] +queues = [] +while True: + try: + (string, rq) = await asyncio.wait_for(q.get(), timeout=0.001) # 1ms + except asyncio.exceptions.TimeoutError: + break + strings.append(string) + queues.append(rq) +strings +outs = pipe(strings, batch_size=len(strings)) +for rq, out in zip(queues, outs): + await rq.put(out) +``` + +まず第一に、通常はあまり良いアイデアではないバッチサイズの制限がありません。次に、タイムアウトはキューの取得ごとにリセットされるため、推論を実行する前に1ms以上待つ可能性があります(最初のリクエストの遅延に1ms分遅れが生じます)。 + +1msの締め切りを1回だけ持つのが良いでしょう。 + +これは、キューに何もない場合でも常に1ms待機しますが、キューに何もない場合に推論を開始したい場合は適していないかもしれません。ただし、バッチ処理が本当に重要な場合には意味があるかもしれません。再度、1つの最適な解決策は存在しません。 + +## Few things you might want to consider + +### Error checking + +本番環境では多くの問題が発生する可能性があります:メモリ不足、スペース不足、モデルの読み込みが失敗するかもしれません、クエリが誤っているかもしれません、クエリが正しい場合でもモデルの構成エラーのために実行に失敗するかもしれませんなど。 + +一般的には、サーバーがエラーをユーザーに出力すると良いため、これらのエラーを表示するための多くの`try..except`ステートメントを追加することは良いアイデアです。ただし、セキュリティコンテキストに応じてこれらのエラーをすべて表示することはセキュリティリスクになる可能性があることに注意してください。 + +### Circuit breaking + +Webサーバーは通常、過負荷時に正しいエラーを返す方が良いです。クエリを無期限に待つ代わりに適切なエラーを返します。長時間待つ代わりに503エラーを返すか、長時間待ってから504エラーを返すかです。 + +提案されたコードでは単一のキューがあるため、キューサイズを見ることは、Webサーバーが負荷に耐える前にエラーを返すための基本的な方法です。 + +### Blocking the main thread + +現在、PyTorchは非同期を認識していないため、計算はメインスレッドをブロックします。つまり、PyTorchが独自のスレッド/プロセスで実行されるようにすると良いでしょう。提案されたコードは、スレッドと非同期とキューがうまく連携しないため、これは行われていませんが、最終的には同じことを行います。 + +これは、単一のアイテムの推論が長い場合(>1秒)に重要です。この場合、推論中にすべてのクエリが1秒待たなければならないことを意味します。 + +### Dynamic batching + +一般的に、バッチ処理は1回のアイテムを1回渡すよりも改善されることは必ずしもありません(詳細は[バッチ処理の詳細](./main_classes/pipelines#pipeline-batching)を参照)。しかし、正しい設定で使用すると非常に効果的です。APIではデフォルトで動的バッチ処理は行われません(遅延の機会が多すぎます)。しかし、非常に大規模なモデルであるBLOOM推論の場合、動的バッチ処理は**重要**です。これにより、すべてのユーザーにとってまともなエクスペリエンスを提供できます。 + +以上が、提供されたテキストのMarkdown形式の翻訳です。 diff --git a/docs/source/ja/pr_checks.md b/docs/source/ja/pr_checks.md new file mode 100644 index 00000000000000..dc8450b52502e6 --- /dev/null +++ b/docs/source/ja/pr_checks.md @@ -0,0 +1,208 @@ + + + +# Checks on a Pull Request + +🤗 Transformersリポジトリでプルリクエストを開くと、追加しているパッチが既存のものを壊していないことを確認するために、かなりの数のチェックが実行されます。これらのチェックには、次の4つのタイプがあります: +- 通常のテスト +- ドキュメンテーションのビルド +- コードとドキュメンテーションのスタイル +- リポジトリ全体の一貫性 + +このドキュメントでは、これらのさまざまなチェックとその背後にある理由、そしてそれらのいずれかがあなたのプルリクエストで失敗した場合のローカルでのデバッグ方法について説明します。 + +なお、理想的には、開発者用のインストールが必要です: + + +```bash +pip install transformers[dev] +``` + +または編集可能なインストールの場合: + + +```bash +pip install -e .[dev] +``` + +トランスフォーマーズのリポジトリ内で作業しています。トランスフォーマーズのオプションの依存関係の数が増えたため、すべてを取得できない可能性があります。開発用インストールが失敗した場合、作業しているディープラーニングフレームワーク(PyTorch、TensorFlow、および/またはFlax)をインストールし、次の手順を実行してください。 + + +```bash +pip install transformers[quality] +``` + +または編集可能なインストールの場合: + +```bash +pip install -e .[quality] +``` + +## Tests + +`ci/circleci: run_tests_` で始まるすべてのジョブは、Transformersのテストスイートの一部を実行します。これらのジョブは、特定の環境でライブラリの一部に焦点を当てて実行されます。たとえば、`ci/circleci: run_tests_pipelines_tf` は、TensorFlowのみがインストールされた環境でパイプラインのテストを実行します。 + +テストスイートの一部のみが実行されるように注意してください。テストスイートは、変更前と変更後のPRのライブラリの違いを決定し、その違いに影響を受けるテストを選択するためのユーティリティが実行されます。このユーティリティは、ローカルで以下のコマンドを実行して実行できます: + + +```bash +python utils/tests_fetcher.py +``` + +1. リポジトリのルートからスクリプトを実行します。これは次のステップを実行します: + + 1. 差分内の各ファイルをチェックし、変更がコード内にあるか、コメントやドキュメンテーション文字列のみにあるかを確認します。実際のコード変更があるファイルのみを保持します。 + + 2. 内部のマップを構築します。このマップは、ライブラリのソースコードの各ファイルが再帰的に影響を与えるすべてのファイルを提供します。モジュールAがモジュールBに影響を与えるとは、モジュールBがモジュールAをインポートする場合を指します。再帰的な影響を得るには、モジュールAからモジュールBへのモジュールのチェーンが必要で、各モジュールは前のモジュールをインポートする必要があります。 + + 3. このマップをステップ1で収集したファイルに適用します。これにより、PRに影響を受けるモデルファイルのリストが得られます。 + + 4. これらのファイルをそれに対応するテストファイルにマップし、実行するテストのリストを取得します。 + +2. スクリプトをローカルで実行する場合、ステップ1、3、および4の結果が表示され、実行するテストがわかります。スクリプトはまた、`test_list.txt` という名前のファイルを作成します。このファイルには実行するテストのリストが含まれており、次のコマンドでローカルで実行できます: + +```bash +python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt) +``` + +## Documentation build + +`build_pr_documentation` ジョブは、ドキュメンテーションのビルドを行い、あなたのPRがマージされた後にすべてが正常に表示されることを確認します。ボットがプレビューのドキュメンテーションへのリンクをPRに追加します。PRに対する変更は、プレビューに自動的に反映されます。ドキュメンテーションのビルドに失敗した場合、失敗したジョブの隣にある「詳細」をクリックして、何が問題になっているかを確認できます。多くの場合、問題は`toctree`内のファイルが不足しているなど、単純なものです。 + +ドキュメンテーションをローカルでビルドまたはプレビューしたい場合は、[docsフォルダ内の`README.md`](https://github.com/huggingface/transformers/tree/main/docs)をご覧ください。 + +## Code and documentation style + +すべてのソースファイル、例、テストには、`black`と`ruff`を使用してコードのフォーマットが適用されます。また、ドックストリングと`rst`ファイルのフォーマット、Transformersの`__init__.py`ファイルで実行される遅延インポートの順序についてもカスタムツールが存在します(`utils/style_doc.py`と`utils/custom_init_isort.py`)。これらすべては、以下を実行することで起動できます。 + + +```bash +make style +``` + +CIは、`ci/circleci: check_code_quality` チェック内でこれらのチェックが適用されていることを確認します。また、`ruff` を実行し、未定義の変数や使用されていない変数がある場合にエラーを報告します。このチェックをローカルで実行するには、以下のコマンドを使用してください。 + + +```bash +make quality +``` + +時間がかかることがあります。したがって、現在のブランチで変更したファイルのみで同じことを実行するには、次のコマンドを実行します。 + + +```bash +make fixup +``` + +この最後のコマンドは、リポジトリの整合性のためのすべての追加のチェックも実行します。それらを詳しく見てみましょう。 + +## Repository consistency + +これには、あなたのPRがリポジトリを適切な状態に保ったままであることを確認するためのすべてのテストが含まれており、ci/`circleci: check_repository_consistency` チェックによって実行されます。ローカルでこのチェックを実行するには、以下を実行します。 + +```bash +make repo-consistency +``` + +これを確認します: + +- `utils/check_repo.py` によって実行される、init に追加されたすべてのオブジェクトが文書化されています。 +- `utils/check_inits.py` によって実行される、すべての `__init__.py` ファイルがその2つのセクションで同じ内容を持っています。 +- `utils/check_copies.py` によって実行される、他のモジュールからのコピーとして識別されたすべてのコードが元のコードと一致しています。 +- `utils/check_config_docstrings.py` によって実行される、すべての設定クラスには少なくとも1つの有効なチェックポイントがドキュメント文字列に記載されています。 +- `utils/check_config_attributes.py` によって実行される、すべての設定クラスには、対応するモデリングファイルで使用されている属性のみが含まれています。 +- `utils/check_copies.py` によって実行される、README とドキュメントのインデックスの翻訳が、メインのREADME と同じモデルリストを持っています。 +- `utils/check_table.py` によって実行される、ドキュメンテーションの自動生成テーブルが最新であることを確認します。 +- `utils/check_dummies.py` によって実行される、すべてのオブジェクトが利用可能であり、オプションの依存関係がすべてインストールされていなくても問題ありません。 + +このチェックが失敗する場合、最初の2つの項目は手動で修正する必要があり、最後の4つはコマンドを実行して自動的に修正できます。 + + +```bash +make fix-copies +``` + +追加のチェックポイントは、新しいモデルを追加するPull Request(PR)に関連しています。主に次の点を確認します: + +- すべての追加されたモデルは、Auto-mapping(`utils/check_repo.py`で実行)に含まれています。 + +- すべてのモデルが適切にテストされています(`utils/check_repo.py`で実行)。 + + + + +### Check copies + +Transformersライブラリは、モデルコードに関して非常に意見があるため、各モデルは他のモデルに依存せずに完全に1つのファイルに実装する必要があります。したがって、特定のモデルのコードのコピーが元のコードと一貫しているかどうかを確認する仕組みを追加しました。これにより、バグ修正がある場合、他の影響を受けるモデルをすべて確認し、変更を伝達するかコピーを破棄するかを選択できます。 + + + +ファイルが別のファイルの完全なコピーである場合、それを`utils/check_copies.py`の`FULL_COPIES`定数に登録する必要があります。 + + + +この仕組みは、`# Copied from xxx`という形式のコメントに依存しています。`xxx`は、コピーされているクラスまたは関数の完全なパスを含む必要があります。例えば、`RobertaSelfOutput`は`BertSelfOutput`クラスの直接のコピーですので、[こちら](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289)にコメントがあります。 + + +```py +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput +``` + +注意点として、これをクラス全体に適用する代わりに、コピー元の関連メソッドに適用できます。たとえば、[こちら](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L598)では、`RobertaPreTrainedModel._init_weights` が `BertPreTrainedModel` からコピーされており、以下のコメントがあります: + + +```py +# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta +``` + +注:矢印の周りにはスペースが含まれていてはいけません(もちろん、そのスペースが置換パターンの一部である場合を除きます)。 + +カンマで区切られた複数のパターンを追加できます。例えば、ここでは `CamemberForMaskedLM` は `RobertaForMaskedLM` の直接のコピーで、2つの置換があります: `Roberta` から `Camembert` へ、そして `ROBERTA` から `CAMEMBERT` へと置換されます。[こちら](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/camembert/modeling_camembert.py#L929)で、この作業はコメント付きで行われています。 + + +```py +# Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT +``` + + +もし順序が重要な場合(以前の置換と競合する可能性があるため)、置換は左から右に実行されます。 + + + +もし置換がフォーマットを変更する場合(たとえば、短い名前を非常に長い名前に置き換える場合など)、自動フォーマッタを適用した後にコピーが確認されます。 + + + +パターンが同じ置換の異なるケース(大文字と小文字のバリアントがある)の場合、オプションとして `all-casing` を追加するだけの別の方法もあります。[こちら](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237)は、`MobileBertForSequenceClassification` 内の例で、コメントがついています。 + + +```py +# Copied from transformers.models.bert.modeling_bert.BertForSequenceClassification with Bert->MobileBert all-casing +``` + +この場合、コードは「BertForSequenceClassification」からコピーされ、次のように置換されます: +- `Bert` を `MobileBert` に置き換える(例:`init`で `MobileBertModel` を使用する場合) +- `bert` を `mobilebert` に置き換える(例:`self.mobilebert` を定義する場合) +- `BERT` を `MOBILEBERT` に置き換える(定数 `MOBILEBERT_INPUTS_DOCSTRING` 内で) + diff --git a/docs/source/ja/preprocessing.md b/docs/source/ja/preprocessing.md new file mode 100644 index 00000000000000..ea0b98df028031 --- /dev/null +++ b/docs/source/ja/preprocessing.md @@ -0,0 +1,533 @@ + + +# Preprocess + +[[open-in-colab]] + +データセットでモデルをトレーニングする前に、それをモデルの期待する入力形式に前処理する必要があります。 +データがテキスト、画像、またはオーディオであるかどうかにかかわらず、それらはテンソルのバッチに変換して組み立てる必要があります。 +🤗 Transformersは、データをモデル用に準備するのに役立つ前処理クラスのセットを提供しています。 +このチュートリアルでは、次のことを学びます: + +* テキストの場合、[Tokenizer](./main_classes/tokenizer)を使用してテキストをトークンのシーケンスに変換し、トークンの数値表現を作成し、それらをテンソルに組み立てる方法。 +* 音声とオーディオの場合、[Feature extractor](./main_classes/feature_extractor)を使用してオーディオ波形から連続的な特徴を抽出し、それらをテンソルに変換する方法。 +* 画像入力の場合、[ImageProcessor](./main_classes/image)を使用して画像をテンソルに変換する方法。 +* マルチモーダル入力の場合、[Processor](./main_classes/processors)を使用してトークナイザと特徴抽出器または画像プロセッサを組み合わせる方法。 + + + +`AutoProcessor`は常に動作し、使用するモデルに適切なクラスを自動的に選択します。 +トークナイザ、画像プロセッサ、特徴抽出器、またはプロセッサを使用しているかにかかわらず、動作します。 + + + +始める前に、🤗 Datasetsをインストールして、いくつかのデータセットを試すことができるようにしてください: + +```bash +pip install datasets +``` + +## Natural Language Processing + + + +テキストデータの前処理に使用する主要なツールは、[トークナイザ](main_classes/tokenizer)です。トークナイザは、一連のルールに従ってテキストを*トークン*に分割します。トークンは数値に変換され、その後テンソルに変換され、モデルの入力となります。モデルが必要とする追加の入力は、トークナイザによって追加されます。 + + + +事前学習済みモデルを使用する予定の場合、関連する事前学習済みトークナイザを使用することが重要です。これにより、テキストが事前学習コーパスと同じ方法で分割され、事前学習中に通常*ボキャブ*として参照される対応するトークンインデックスを使用します。 + + + +[`AutoTokenizer.from_pretrained`]メソッドを使用して事前学習済みトークナイザをロードして、開始しましょう。これにより、モデルが事前学習された*ボキャブ*がダウンロードされます: + +```python +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +``` + +次に、テキストをトークナイザに渡します: + +```py +>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") +>>> print(encoded_input) +{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +トークナイザは、重要な3つの項目を持つ辞書を返します: + +* [input_ids](glossary#input-ids) は文中の各トークンに対応するインデックスです。 +* [attention_mask](glossary#attention-mask) はトークンがアテンションを受ける必要があるかどうかを示します。 +* [token_type_ids](glossary#token-type-ids) は複数のシーケンスがある場合、トークンがどのシーケンスに属しているかを識別します。 + +`input_ids` をデコードして入力を返します: + +```python +>>> tokenizer.decode(encoded_input["input_ids"]) +'[CLS] 魔法使いの事に干渉するな、彼らは微妙で怒りっぽい。 [SEP]' +``` + +如何にお分かりいただけるかと思いますが、トークナイザはこの文章に2つの特別なトークン、`CLS`(クラシファイア)と`SEP`(セパレータ)を追加しました。 +すべてのモデルが特別なトークンを必要とするわけではありませんが、必要な場合、トークナイザは自動的にそれらを追加します。 + +複数の文章を前処理する場合、トークナイザにリストとして渡してください: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_inputs = tokenizer(batch_sentences) +>>> print(encoded_inputs) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1]]} +``` + +### Pad + +文章は常に同じ長さではないことがあり、これはテンソル(モデルの入力)が均一な形状を持つ必要があるため問題となります。 +パディングは、短い文に特別な「パディングトークン」を追加して、テンソルを長いシーケンスに合わせるための戦略です。 + +バッチ内の短いシーケンスを最長のシーケンスに合わせるために、`padding`パラメータを`True`に設定します: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + +1番目と3番目の文は、短いために`0`でパディングされています。 + +### Truncation + +逆のスペクトルでは、時折、モデルが処理するのに長すぎるシーケンスがあるかもしれません。この場合、シーケンスを短縮する必要があります。 + +モデルが受け入れる最大の長さにシーケンスを切り詰めるには、`truncation`パラメータを`True`に設定します: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + + + +異なるパディングと切り詰めの引数について詳しくは、[パディングと切り詰め](./pad_truncation)のコンセプトガイドをご覧ください。 + + + +### Build tensors + +最後に、トークナイザがモデルに供給される実際のテンソルを返すように設定します。 + +`return_tensors`パラメータを`pt`(PyTorch用)または`tf`(TensorFlow用)に設定します: + + + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") +>>> print(encoded_input) +{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} +``` + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf") +>>> print(encoded_input) +{'input_ids': , + 'token_type_ids': , + 'attention_mask': } +``` + + + +## Audio + +オーディオタスクの場合、データセットをモデル用に準備するために[特徴抽出器](main_classes/feature_extractor)が必要です。 +特徴抽出器は生のオーディオデータから特徴を抽出し、それらをテンソルに変換するために設計されています。 + +[PolyAI/minds14](https://huggingface.co/datasets/PolyAI/minds14)データセットをロードして(データセットのロード方法の詳細については🤗 [Datasetsチュートリアル](https://huggingface.co/docs/datasets/load_hub)を参照)、 +オーディオデータセットで特徴抽出器をどのように使用できるかを確認してみましょう: + +```python +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") +``` + +アクセスして`audio`列の最初の要素を確認します。`audio`列を呼び出すと、自動的にオーディオファイルが読み込まれ、リサンプリングされます: + +```py +>>> dataset[0]["audio"] +{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, + 0. , 0. ], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 8000} +``` + +これにより、3つのアイテムが返されます: + +* `array` は読み込まれた音声信号で、1Dの配列として読み込まれます。必要に応じてリサンプリングされることもあります。 +* `path` は音声ファイルの場所を指します。 +* `sampling_rate` は音声信号内のデータポイントが1秒間にいくつ測定されるかを示します。 + +このチュートリアルでは、[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)モデルを使用します。 +モデルカードを確認すると、Wav2Vec2が16kHzのサンプリングされた音声オーディオで事前学習されていることがわかります。 +モデルの事前学習に使用されたデータセットのサンプリングレートと、あなたのオーディオデータのサンプリングレートが一致することが重要です。 +データのサンプリングレートが異なる場合、データをリサンプリングする必要があります。 + +1. 🤗 Datasetsの [`~datasets.Dataset.cast_column`] メソッドを使用して、サンプリングレートを16kHzにアップサンプリングします: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +2. 再び `audio` 列を呼び出してオーディオファイルをリサンプルします: + +```py +>>> dataset[0]["audio"] +{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., + 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 16000} +``` + +次に、入力を正規化しパディングするために特徴抽出器をロードします。テキストデータをパディングする場合、短いシーケンスには `0` が追加されます。同じ考え方がオーディオデータにも適用されます。特徴抽出器は `array` に `0` を追加します(これは無音として解釈されます)。 + +[`AutoFeatureExtractor.from_pretrained`]を使用して特徴抽出器をロードします: + +```python +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") +``` + +オーディオ `array` を特徴抽出器に渡します。特徴抽出器で発生する可能性のある無音エラーをより良くデバッグするために、特徴抽出器に `sampling_rate` 引数を追加することをお勧めします。 + +```python +>>> audio_input = [dataset[0]["audio"]["array"]] +>>> feature_extractor(audio_input, sampling_rate=16000) +{'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ..., + 5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]} +``` + +同様に、トークナイザと同様に、バッチ内の可変シーケンスを処理するためにパディングまたは切り詰めを適用できます。次に、これらの2つのオーディオサンプルのシーケンス長を確認してみましょう: + +```python +>>> dataset[0]["audio"]["array"].shape +(173398,) + +>>> dataset[1]["audio"]["array"].shape +(106496,) +``` + +この関数は、データセットを前処理してオーディオサンプルの長さを同じにするためのものです。最大サンプル長を指定し、特徴抽出器はシーケンスをそれに合わせてパディングまたは切り詰めます。 + +```py +>>> def preprocess_function(examples): +... audio_arrays = [x["array"] for x in examples["audio"]] +... inputs = feature_extractor( +... audio_arrays, +... sampling_rate=16000, +... padding=True, +... max_length=100000, +... truncation=True, +... ) +... return inputs +``` + +`preprocess_function`をデータセットの最初の数例に適用します: + +```python +>>> processed_dataset = preprocess_function(dataset[:5]) +``` + +サンプルの長さは現在同じで、指定された最大長と一致しています。これで処理されたデータセットをモデルに渡すことができます! + +```py +>>> processed_dataset["input_values"][0].shape +(100000,) + +>>> processed_dataset["input_values"][1].shape +(100000,) +``` + +## Computer Vision + +コンピュータビジョンタスクでは、モデル用にデータセットを準備するための[画像プロセッサ](main_classes/image_processor)が必要です。 +画像の前処理には、画像をモデルが期待する入力形式に変換するためのいくつかのステップが含まれています。これらのステップには、リサイズ、正規化、カラーチャネルの補正、および画像をテンソルに変換するなどが含まれます。 + + + +画像の前処理は、通常、画像の増強の形式に従います。画像の前処理と画像の増強の両方は画像データを変換しますが、異なる目的があります: + +* 画像の増強は、過学習を防ぎ、モデルの堅牢性を向上させるのに役立つ方法で画像を変更します。データを増強する方法は無限で、明るさや色の調整、クロップ、回転、リサイズ、ズームなど、様々な方法があります。ただし、増強操作によって画像の意味が変わらないように注意する必要があります。 +* 画像の前処理は、画像がモデルの期待する入力形式と一致することを保証します。コンピュータビジョンモデルをファインチューニングする場合、画像はモデルが最初にトレーニングされたときとまったく同じ方法で前処理する必要があります。 + +画像の増強には任意のライブラリを使用できます。画像の前処理には、モデルに関連付けられた`ImageProcessor`を使用します。 + + + +コンピュータビジョンのデータセットで画像プロセッサを使用する方法を示すために、[food101](https://huggingface.co/datasets/food101)データセットをロードします(データセットのロード方法の詳細については🤗[Datasetsチュートリアル](https://huggingface.co/docs/datasets/load_hub)を参照): + + + +データセットがかなり大きいため、🤗 Datasetsの`split`パラメータを使用してトレーニングデータの小さなサンプルのみをロードします! + + + +```python +>>> from datasets import load_dataset + +>>> dataset = load_dataset("food101", split="train[:100]") +``` + +次に、🤗 Datasetsの [`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image) 機能で画像を見てみましょう: + +```python +>>> dataset[0]["image"] +``` + +
+ +
+ +AutoImageProcessorを[`AutoImageProcessor.from_pretrained`]を使用してロードします: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + +1. まず、画像の拡張を追加しましょう。好きなライブラリを使用できますが、このチュートリアルではtorchvisionの[`transforms`](https://pytorch.org/vision/stable/transforms.html)モジュールを使用します。別のデータ拡張ライブラリを使用したい場合は、[Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb)または[Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb)で詳細を学ぶことができます。 + + ここでは、[`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html)を使用していくつかの変換を連鎖させます - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html)と[`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html)。 + サイズの変更に関しては、`image_processor`から画像サイズの要件を取得できます。 + 一部のモデルでは、正確な高さと幅が必要ですが、他のモデルでは`shortest_edge`のみが定義されています。 + +```py +>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose + +>>> size = ( +... image_processor.size["shortest_edge"] +... if "shortest_edge" in image_processor.size +... else (image_processor.size["height"], image_processor.size["width"]) +... ) + +>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)]) +``` + +2. モデルは[`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)を入力として受け取ります。 +`ImageProcessor`は画像の正規化と適切なテンソルの生成を処理できます。 +一連の画像に対する画像拡張と画像前処理を組み合わせ、`pixel_values`を生成する関数を作成します: + +```python +>>> def transforms(examples): +... images = [_transforms(img.convert("RGB")) for img in examples["image"]] +... examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"] +... return examples +``` + + + +上記の例では、画像のサイズ変更を既に画像増強変換で行っているため、`do_resize=False`を設定しました。 +適切な `image_processor` からの `size` 属性を活用しています。画像増強中に画像のサイズ変更を行わない場合は、このパラメータを省略してください。 +デフォルトでは、`ImageProcessor` がサイズ変更を処理します。 + +画像を増強変換の一部として正規化したい場合は、`image_processor.image_mean` と `image_processor.image_std` の値を使用してください。 + + +3. 次に、🤗 Datasetsの[`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)を使用して、変換をリアルタイムで適用します: + +```python +>>> dataset.set_transform(transforms) +``` + +4. 画像にアクセスすると、画像プロセッサが `pixel_values` を追加したことがわかります。これで処理済みのデータセットをモデルに渡すことができます! + +```python +>>> dataset[0].keys() +``` + +以下は、変換が適用された後の画像の外観です。 画像はランダムに切り抜かれ、その色の特性も異なります。 + +```py +>>> import numpy as np +>>> import matplotlib.pyplot as plt + +>>> img = dataset[0]["pixel_values"] +>>> plt.imshow(img.permute(1, 2, 0)) +``` + +
+ +
+ + + +オブジェクト検出、意味セグメンテーション、インスタンスセグメンテーション、およびパノプティックセグメンテーションなどのタスクの場合、`ImageProcessor`は +ポスト処理メソッドを提供します。これらのメソッドは、モデルの生の出力を境界ボックスやセグメンテーションマップなどの意味のある予測に変換します。 + + + +### Pad + +一部の場合、たとえば、[DETR](./model_doc/detr)をファインチューニングする場合、モデルはトレーニング時にスケールの変更を適用します。 +これにより、バッチ内の画像のサイズが異なる場合があります。[`DetrImageProcessor`]から[`DetrImageProcessor.pad`]を使用し、 +カスタムの`collate_fn`を定義して画像を一緒にバッチ処理できます。 + +```py +>>> def collate_fn(batch): +... pixel_values = [item["pixel_values"] for item in batch] +... encoding = image_processor.pad(pixel_values, return_tensors="pt") +... labels = [item["labels"] for item in batch] +... batch = {} +... batch["pixel_values"] = encoding["pixel_values"] +... batch["pixel_mask"] = encoding["pixel_mask"] +... batch["labels"] = labels +... return batch +``` + +## Multi Modal + +マルチモーダル入力を使用するタスクの場合、モデル用にデータセットを準備するための[プロセッサ](main_classes/processors)が必要です。プロセッサは、トークナイザや特徴量抽出器などの2つの処理オブジェクトを結合します。 + +自動音声認識(ASR)のためのプロセッサの使用方法を示すために、[LJ Speech](https://huggingface.co/datasets/lj_speech)データセットをロードします(データセットのロード方法の詳細については🤗 [Datasets チュートリアル](https://huggingface.co/docs/datasets/load_hub)を参照): + +```python +>>> from datasets import load_dataset + +>>> lj_speech = load_dataset("lj_speech", split="train") +``` + +ASR(自動音声認識)の場合、主に `audio` と `text` に焦点を当てているため、他の列を削除できます: + +```python +>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) +``` + +次に、`audio`と`text`の列を見てみましょう: + +```python +>>> lj_speech[0]["audio"] +{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ..., + 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', + 'sampling_rate': 22050} + +>>> lj_speech[0]["text"] +'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' +``` + +常に、オーディオデータセットのサンプリングレートを、モデルの事前学習に使用されたデータセットのサンプリングレートと一致させるように[リサンプル](preprocessing#audio)する必要があります! + +```py +>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +プロセッサを [`AutoProcessor.from_pretrained`] を使用してロードします: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") +``` + +1. `array`内に含まれるオーディオデータを`input_values`に処理し、`text`を`labels`にトークン化する関数を作成します: + +```py +>>> def prepare_dataset(example): +... audio = example["audio"] + +... example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000)) + +... return example +``` + +2. サンプルに`prepare_dataset`関数を適用します: + +```py +>>> prepare_dataset(lj_speech[0]) +``` diff --git a/docs/source/ja/quicktour.md b/docs/source/ja/quicktour.md new file mode 100644 index 00000000000000..3bec2f827a47ee --- /dev/null +++ b/docs/source/ja/quicktour.md @@ -0,0 +1,588 @@ + + +# Quick tour + +[[open-in-colab]] + +🤗 Transformersを使い始めましょう! 開発者であろうと、日常的なユーザーであろうと、このクイックツアーは +初めて始めるのを支援し、[`pipeline`]を使った推論方法、[AutoClass](./model_doc/auto)で事前学習済みモデルとプリプロセッサをロードする方法、 +そしてPyTorchまたはTensorFlowで素早くモデルをトレーニングする方法を示します。 初心者の場合、ここで紹介されたコンセプトの詳細な説明を提供する +チュートリアルまたは[コース](https://huggingface.co/course/chapter1/1)を次に参照することをお勧めします。 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください: + +```bash +!pip install transformers datasets +``` + +あなたはまた、好きな機械学習フレームワークをインストールする必要があります: + + + + +```bash +pip install torch +``` + + + +```bash +pip install tensorflow +``` + + + +## Pipeline + + + +[`pipeline`] は、事前学習済みモデルを推論に最も簡単で高速な方法です。 +[`pipeline`] を使用することで、さまざまなモダリティにわたる多くのタスクに対して即座に使用できます。 +いくつかのタスクは以下の表に示されています: + + + +使用可能なタスクの完全な一覧については、[pipeline API リファレンス](./main_classes/pipelines)を確認してください。 + + + +| **タスク** | **説明** | **モダリティ** | **パイプライン識別子** | +|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------| +| テキスト分類 | テキストのシーケンスにラベルを割り当てる | NLP | pipeline(task="sentiment-analysis") | +| テキスト生成 | プロンプトを指定してテキストを生成する | NLP | pipeline(task="text-generation") | +| 要約 | テキストまたはドキュメントの要約を生成する | NLP | pipeline(task="summarization") | +| 画像分類 | 画像にラベルを割り当てる | コンピュータビジョン | pipeline(task="image-classification") | +| 画像セグメンテーション | 画像の各個別のピクセルにラベルを割り当てる(セマンティック、パノプティック、およびインスタンスセグメンテーションをサポート) | コンピュータビジョン | pipeline(task="image-segmentation") | +| オブジェクト検出 | 画像内のオブジェクトの境界ボックスとクラスを予測する | コンピュータビジョン | pipeline(task="object-detection") | +| オーディオ分類 | オーディオデータにラベルを割り当てる | オーディオ | pipeline(task="audio-classification") | +| 自動音声認識 | 音声をテキストに変換する | オーディオ | pipeline(task="automatic-speech-recognition") | +| ビジュアルクエスチョン応答 | 画像と質問が与えられた場合に、画像に関する質問に回答する | マルチモーダル | pipeline(task="vqa") | +| ドキュメントクエスチョン応答 | ドキュメントと質問が与えられた場合に、ドキュメントに関する質問に回答する | マルチモーダル | pipeline(task="document-question-answering") | +| 画像キャプショニング | 与えられた画像にキャプションを生成する | マルチモーダル | pipeline(task="image-to-text") | + +まず、[`pipeline`] のインスタンスを作成し、使用したいタスクを指定します。 +このガイドでは、センチメント分析のために [`pipeline`] を使用する例を示します: + +```python +>>> from transformers import pipeline + +>>> classifier = pipeline("sentiment-analysis") +``` + +[`pipeline`]は、感情分析のためのデフォルトの[事前学習済みモデル](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)とトークナイザをダウンロードしてキャッシュし、使用できるようになります。 +これで、`classifier`を対象のテキストに使用できます: + +```python +>>> classifier("私たちは🤗 Transformersライブラリをお見せできてとても嬉しいです。") +[{'label': 'POSITIVE', 'score': 0.9998}] +``` + +複数の入力がある場合は、[`pipeline`]に入力をリストとして渡して、辞書のリストを返します: + +```py +>>> results = classifier(["🤗 Transformersライブラリをご紹介できて非常に嬉しいです。", "嫌いにならないでほしいです。"]) +>>> for result in results: +... print(f"label: {result['label']}, スコア: {round(result['score'], 4)}") +label: POSITIVE, スコア: 0.9998 +label: NEGATIVE, スコア: 0.5309 +``` + +[`pipeline`]は、任意のタスクに対してデータセット全体を繰り返し処理することもできます。この例では、自動音声認識をタスクとして選びましょう: + +```python +>>> import torch +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") +``` + +オーディオデータセットをロードします(詳細については🤗 Datasets [クイックスタート](https://huggingface.co/docs/datasets/quickstart#audio)を参照してください)。 +たとえば、[MInDS-14](https://huggingface.co/datasets/PolyAI/minds14)データセットをロードします: + +```python +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT +``` + +データセットのサンプリングレートが[`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h)がトレーニングされたサンプリングレートと一致することを確認してください: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) +``` + +"audio"列を呼び出すと、オーディオファイルは自動的にロードされ、リサンプリングされます。最初の4つのサンプルから生の波形配列を抽出し、それをパイプラインにリストとして渡します。 + +```py +>>> result = speech_recognizer(dataset[:4]["audio"]) +>>> print([d["text"] for d in result]) +['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT'] +``` + +大規模なデータセットで、入力が大きい場合(音声や画像など)、すべての入力をメモリに読み込む代わりに、リストではなくジェネレータを渡すことがお勧めです。詳細については[パイプラインAPIリファレンス](./main_classes/pipelines)を参照してください。 + +### Use another model and tokenizer in the pipeline + +[`pipeline`]は[Hub](https://huggingface.co/models)からの任意のモデルを収容でき、他のユースケースに[`pipeline`]を適応させることが容易です。たとえば、フランス語のテキストを処理できるモデルが必要な場合、Hubのタグを使用して適切なモデルをフィルタリングできます。トップのフィルタリングされた結果は、フランス語のテキストに使用できる感情分析用に調整された多言語の[BERTモデル](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)を返します: + +```py +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +``` + + + +[`AutoModelForSequenceClassification`]と[`AutoTokenizer`]を使用して事前学習済みモデルとそれに関連するトークナイザをロードします(次のセクションで`AutoClass`について詳しく説明します): + +```python +>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + + +以下のコードは、[`TFAutoModelForSequenceClassification`]および[`AutoTokenizer`]を使用して、事前学習済みモデルとその関連するトークナイザをロードする方法を示しています(`TFAutoClass`については次のセクションで詳しく説明します): + +```python +>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + + +指定したモデルとトークナイザを[`pipeline`]に設定し、今度はフランス語のテキストに`classifier`を適用できます: + +```py +>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) +>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") +[{'label': '5 stars', 'score': 0.7273}] +``` + +もし、あなたのユースケースに適したモデルが見つからない場合、事前学習済みモデルをあなたのデータでファインチューニングする必要があります。 +ファインチューニングの方法については、[ファインチューニングのチュートリアル](./training)をご覧ください。 +最後に、ファインチューニングした事前学習済みモデルを共有し、コミュニティと共有ハブで共有することを検討してください。これにより、機械学習を民主化する手助けができます! 🤗 + +## AutoClass + + + +[`AutoModelForSequenceClassification`] および [`AutoTokenizer`] クラスは、上記で使用した [`pipeline`] を駆動するために協力して動作します。 +[AutoClass](./model_doc/auto) は、事前学習済みモデルのアーキテクチャをその名前またはパスから自動的に取得するショートカットです。 +適切な `AutoClass` を選択し、それに関連する前処理クラスを選択するだけで済みます。 + +前のセクションからの例に戻り、`AutoClass` を使用して [`pipeline`] の結果を再現する方法を見てみましょう。 + +### AutoTokenizer + +トークナイザはテキストをモデルの入力として使用できる数値の配列に前処理する役割を果たします。 +トークナイゼーションプロセスには、単語をどのように分割するかや、単語をどのレベルで分割するかといった多くのルールがあります +(トークナイゼーションについての詳細は [トークナイザサマリー](./tokenizer_summary) をご覧ください)。 +最も重要なことは、モデルが事前学習済みになったときと同じトークナイゼーションルールを使用するために、同じモデル名でトークナイザをインスタンス化する必要があることです。 + +[`AutoTokenizer`] を使用してトークナイザをロードします: + +```python +>>> from transformers import AutoTokenizer + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + +Pass your text to the tokenizer: + +```python +>>> encoding = tokenizer("私たちは🤗 Transformersライブラリをお見せできてとても嬉しいです。") +>>> print(encoding) +{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +トークナイザは、次の情報を含む辞書を返します: + +- [input_ids](./glossary#input-ids): トークンの数値表現。 +- [attention_mask](.glossary#attention-mask): どのトークンにアテンションを向けるかを示します。 + +トークナイザはまた、入力のリストを受け入れ、一様な長さのバッチを返すためにテキストをパディングおよび切り詰めることができます。 + + + + +```py +>>> pt_batch = tokenizer( +... ["🤗 Transformersライブラリをお見せできて非常に嬉しいです。", "嫌いではないことを願っています。"], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="pt", +... ) +``` + + + +```py +>>> tf_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="tf", +... ) +``` + + + + + +[前処理](./preprocessing)チュートリアルをご覧いただき、トークナイゼーションの詳細や、[`AutoImageProcessor`]、[`AutoFeatureExtractor`]、[`AutoProcessor`]を使用して画像、オーディオ、およびマルチモーダル入力を前処理する方法について詳しく説明されているページもご覧ください。 + + + +### AutoModel + + + +🤗 Transformersは事前学習済みインスタンスを簡単に統一的にロードする方法を提供します。 +これは、[`AutoTokenizer`]をロードするのと同じように[`AutoModel`]をロードできることを意味します。 +タスクに適した[`AutoModel`]を選択する以外の違いはありません。 +テキスト(またはシーケンス)分類の場合、[`AutoModelForSequenceClassification`]をロードする必要があります: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +[`AutoModel`]クラスでサポートされているタスクに関する詳細については、[タスクの概要](./task_summary)を参照してください。 + + + +今、前処理済みのバッチを直接モデルに渡します。辞書を展開するだけで、`**`を追加する必要があります: + +```python +>>> pt_outputs = pt_model(**pt_batch) +``` + +モデルは、`logits`属性に最終的なアクティベーションを出力します。 `logits`にsoftmax関数を適用して確率を取得します: + +```py +>>> from torch import nn + +>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) +>>> print(pt_predictions) +tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], + [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) +``` + + + +🤗 Transformersは事前学習済みインスタンスをロードするためのシンプルで統一された方法を提供します。 +これは、[`TFAutoModel`]を[`AutoTokenizer`]をロードするのと同じようにロードできることを意味します。 +唯一の違いは、タスクに適した[`TFAutoModel`]を選択することです。 +テキスト(またはシーケンス)分類の場合、[`TFAutoModelForSequenceClassification`]をロードする必要があります: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +詳細については、[`AutoModel`]クラスでサポートされているタスクに関する情報は、[タスクの概要](./task_summary)を参照してください。 + + + +次に、前処理済みのバッチを直接モデルに渡します。テンソルをそのまま渡すことができます: + +```python +>>> tf_outputs = tf_model(tf_batch) +``` + +モデルは`logits`属性に最終的なアクティベーションを出力します。`logits`にソフトマックス関数を適用して確率を取得します: + +```python +>>> import tensorflow as tf + +>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) +>>> tf_predictions # doctest: +IGNORE_RESULT +``` + + + + + + +🤗 Transformersのすべてのモデル(PyTorchまたはTensorFlow)は、最終的な活性化関数(softmaxなど)*前*のテンソルを出力します。 +最終的な活性化関数は、しばしば損失と結合されているためです。モデルの出力は特別なデータクラスであり、その属性はIDEで自動補完されます。 +モデルの出力は、タプルまたは辞書のように動作します(整数、スライス、または文字列でインデックスを付けることができます)。 +この場合、Noneである属性は無視されます。 + + + +### Save a Model + + + +モデルをファインチューニングしたら、[`PreTrainedModel.save_pretrained`]を使用してトークナイザと共に保存できます: + +```py +>>> pt_save_directory = "./pt_save_pretrained" +>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT +>>> pt_model.save_pretrained(pt_save_directory) +``` + +再びモデルを使用する準備ができたら、[`PreTrainedModel.from_pretrained`]を使用して再度ロードします: + +```py +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") +``` + + + +モデルをファインチューニングしたら、そのトークナイザを使用してモデルを保存できます。[`TFPreTrainedModel.save_pretrained`]を使用します: + +```py +>>> tf_save_directory = "./tf_save_pretrained" +>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT +>>> tf_model.save_pretrained(tf_save_directory) +``` + +モデルを再度使用する準備ができたら、[`TFPreTrainedModel.from_pretrained`]を使用して再度ロードします: + +```py +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") +``` + + + + +🤗 Transformersの特に素晴らしい機能の一つは、モデルを保存し、それをPyTorchモデルまたはTensorFlowモデルとして再ロードできることです。 `from_pt`または`from_tf`パラメータを使用してモデルをフレームワーク間で変換できます: + + + + +```py +>>> from transformers import AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) +``` + + + + +```py +>>> from transformers import TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) +``` + + + +## Custom model builds + +モデルを構築方法を変更するには、モデルの設定クラスを変更できます。設定はモデルの属性を指定します。例えば、隠れ層の数やアテンションヘッドの数などがこれに含まれます。カスタム設定クラスからモデルを初期化する際には、ゼロから始めます。モデルの属性はランダムに初期化され、有意義な結果を得るためにモデルをトレーニングする必要があります。 + +最初に[`AutoConfig`]をインポートし、変更したい事前学習済みモデルをロードします。[`AutoConfig.from_pretrained`]内で、変更したい属性(例:アテンションヘッドの数)を指定できます: + +```python +>>> from transformers import AutoConfig + +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) +``` + + + +[`AutoModel.from_config`]を使用してカスタム設定からモデルを作成します: + +```python +>>> from transformers import AutoModel + +>>> my_model = AutoModel.from_config(my_config) +``` + + + +カスタム構成からモデルを作成するには、[`TFAutoModel.from_config`]を使用します: + +```py +>>> from transformers import TFAutoModel + +>>> my_model = TFAutoModel.from_config(my_config) +``` + + + + +[カスタムアーキテクチャを作成](./create_a_model)ガイドを参照して、カスタム構成の詳細情報を確認してください。 + +## Trainer - PyTorch向けの最適化されたトレーニングループ + +すべてのモデルは標準の[`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)であるため、通常のトレーニングループで使用できます。 +独自のトレーニングループを作成できますが、🤗 TransformersはPyTorch向けに[`Trainer`]クラスを提供しており、基本的なトレーニングループに加えて、 +分散トレーニング、混合精度などの機能の追加を行っています。 + +タスクに応じて、通常は[`Trainer`]に以下のパラメータを渡します: + +1. [`PreTrainedModel`]または[`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)から始めます: + + ```py + >>> from transformers import AutoModelForSequenceClassification + + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. [`TrainingArguments`]には、変更できるモデルのハイパーパラメータが含まれており、学習率、バッチサイズ、トレーニングエポック数などが変更できます。指定しない場合、デフォルト値が使用されます: + + ```py + >>> from transformers import TrainingArguments + + >>> training_args = TrainingArguments( + ... output_dir="path/to/save/folder/", + ... learning_rate=2e-5, + ... per_device_train_batch_size=8, + ... per_device_eval_batch_size=8, + ... num_train_epochs=2, + ... ) + ``` + +3. トークナイザ、画像プロセッサ、特徴量抽出器、またはプロセッサのような前処理クラスをロードします: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +4. データセットをロードする: + + ```py + >>> from datasets import load_dataset + + >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT + ``` + +5. データセットをトークン化するための関数を作成します: + + ```python + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) + ``` + + その後、[`~datasets.Dataset.map`]を使用してデータセット全体に適用します: + + ```python + >>> dataset = dataset.map(tokenize_dataset, batched=True) + ``` + +6. データセットからの例のバッチを作成するための [`DataCollatorWithPadding`]: + + ```py + >>> from transformers import DataCollatorWithPadding + + >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) + ``` + +次に、これらのクラスを[`Trainer`]にまとめます: + +```python +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) # doctest: +SKIP +``` + +訓練を開始する準備ができたら、[`~Trainer.train`]を呼び出してトレーニングを開始します: + +```py +>>> trainer.train() # doctest: +SKIP +``` + + + +翻訳や要約など、シーケンス間モデルを使用するタスクには、代わりに[`Seq2SeqTrainer`]と[`Seq2SeqTrainingArguments`]クラスを使用してください。 + + + +[`Trainer`]内のメソッドをサブクラス化することで、トレーニングループの動作をカスタマイズできます。これにより、損失関数、オプティマイザ、スケジューラなどの機能をカスタマイズできます。サブクラス化できるメソッドの一覧については、[`Trainer`]リファレンスをご覧ください。 + +トレーニングループをカスタマイズする別の方法は、[Callbacks](./main_classes/callbacks)を使用することです。コールバックを使用して他のライブラリと統合し、トレーニングループを監視して進捗状況を報告したり、トレーニングを早期に停止したりできます。コールバックはトレーニングループ自体には何も変更を加えません。損失関数などのカスタマイズを行う場合は、[`Trainer`]をサブクラス化する必要があります。 + +## Train with TensorFlow + +すべてのモデルは標準の[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)であるため、[Keras](https://keras.io/) APIを使用してTensorFlowでトレーニングできます。 +🤗 Transformersは、データセットを`tf.data.Dataset`として簡単にロードできるようにする[`~TFPreTrainedModel.prepare_tf_dataset`]メソッドを提供しており、Kerasの[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)および[`fit`](https://keras.io/api/models/model_training_apis/#fit-method)メソッドを使用してトレーニングをすぐに開始できます。 + +1. [`TFPreTrainedModel`]または[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)から始めます: + + ```py + >>> from transformers import TFAutoModelForSequenceClassification + + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. トークナイザ、画像プロセッサ、特徴量抽出器、またはプロセッサのような前処理クラスをロードします: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +3. データセットをトークナイズするための関数を作成します: + + ```python + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) # doctest: +SKIP + ``` + +4. [`~datasets.Dataset.map`]を使用してデータセット全体にトークナイザを適用し、データセットとトークナイザを[`~TFPreTrainedModel.prepare_tf_dataset`]に渡します。バッチサイズを変更し、データセットをシャッフルすることもできます。 + + ```python + >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP + >>> tf_dataset = model.prepare_tf_dataset( + ... dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer + ... ) # doctest: +SKIP + ``` + +5. 準備ができたら、`compile`と`fit`を呼び出してトレーニングを開始できます。 Transformersモデルはすべてデフォルトのタスクに関連する損失関数を持っているため、指定しない限り、損失関数を指定する必要はありません。 + + ```python + >>> from tensorflow.keras.optimizers import Adam + + >>> model.compile(optimizer=Adam(3e-5)) # 損失関数の引数は不要です! + >>> model.fit(tf + ``` + +## What's next? + +🤗 Transformersのクイックツアーを完了したら、ガイドをチェックして、カスタムモデルの作成、タスクのためのファインチューニング、スクリプトを使用したモデルのトレーニングなど、より具体的なことを学ぶことができます。🤗 Transformersのコアコンセプトについてもっと詳しく知りたい場合は、コンセプチュアルガイドを読んでみてください! diff --git a/docs/source/ja/run_scripts.md b/docs/source/ja/run_scripts.md new file mode 100644 index 00000000000000..af99d1c6da9702 --- /dev/null +++ b/docs/source/ja/run_scripts.md @@ -0,0 +1,370 @@ + + +# Train with a script + +🤗 Transformersの[notebooks](./notebooks/README)と一緒に、[PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch)、[TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow)、または[JAX/Flax](https://github.com/huggingface/transformers/tree/main/examples/flax)を使用してモデルをトレーニングする方法を示すサンプルスクリプトもあります。 + +また、私たちの[研究プロジェクト](https://github.com/huggingface/transformers/tree/main/examples/research_projects)や[レガシーの例](https://github.com/huggingface/transformers/tree/main/examples/legacy)で使用したスクリプトも見つかります。これらのスクリプトは現在メンテナンスされておらず、おそらく最新バージョンのライブラリと互換性がない特定の🤗 Transformersのバージョンが必要です。 + +サンプルスクリプトはすべての問題でそのまま動作することは期待されておらず、解決しようとしている問題にスクリプトを適応させる必要があるかもしれません。この点をサポートするために、ほとんどのスクリプトはデータがどのように前処理されているかを完全に公開し、必要に応じて編集できるようにしています。 + +サンプルスクリプトで実装したい機能がある場合は、[フォーラム](https://discuss.huggingface.co/)か[イシュートラッカー](https://github.com/huggingface/transformers/issues)で議論してからプルリクエストを提出してください。バグ修正は歓迎しますが、読みやすさのコストで機能を追加するプルリクエストはほとんどマージされない可能性が高いです。 + +このガイドでは、[PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)と[TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization)で実行するサマリゼーショントレーニングスクリプトの実行方法を示します。すべての例は、明示的に指定されていない限り、両方のフレームワークともに動作することが期待されています。 + +## Setup + +最新バージョンのサンプルスクリプトを正常に実行するには、新しい仮想環境に🤗 Transformersをソースからインストールする必要があります: + + +```bash +git clone https://github.com/huggingface/transformers +cd transformers +pip install . +``` + +以前のスクリプトのバージョンについては、以下のトグルをクリックしてください: + +
+ 以前の🤗 Transformersのバージョンに関する例 + +
+ +次に、現在の🤗 Transformersのクローンを特定のバージョンに切り替えてください。たとえば、v3.5.1などです。 + + +```bash +git checkout tags/v3.5.1 +``` + + +適切なライブラリバージョンを設定したら、任意の例のフォルダに移動し、例固有の要件をインストールします: + + + +```bash +pip install -r requirements.txt +``` + +## Run a script + + + +この例のスクリプトは、🤗 [Datasets](https://huggingface.co/docs/datasets/) ライブラリからデータセットをダウンロードし、前処理を行います。次に、[Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) を使用して要約をサポートするアーキテクチャ上でデータセットをファインチューニングします。以下の例では、[CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) データセット上で [T5-small](https://huggingface.co/google-t5/t5-small) をファインチューニングする方法が示されています。T5 モデルは、そのトレーニング方法に起因して追加の `source_prefix` 引数が必要です。このプロンプトにより、T5 はこれが要約タスクであることを知ることができます。 + + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + + +この例のスクリプトは、🤗 [Datasets](https://huggingface.co/docs/datasets/) ライブラリからデータセットをダウンロードして前処理します。その後、スクリプトは要約をサポートするアーキテクチャ上で Keras を使用してデータセットをファインチューニングします。以下の例では、[T5-small](https://huggingface.co/google-t5/t5-small) を [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) データセットでファインチューニングする方法を示しています。T5 モデルは、そのトレーニング方法に起因して追加の `source_prefix` 引数が必要です。このプロンプトは、T5 にこれが要約タスクであることを知らせます。 + + +```bash +python examples/tensorflow/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## Distributed training and mixed precision + +[Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)は、分散トレーニングと混合精度をサポートしています。つまり、この機能をスクリプトで使用することができます。これらの機能を有効にするには、次の手順を実行します。 + +- `fp16`引数を追加して混合精度を有効にします。 +- `nproc_per_node`引数で使用するGPUの数を設定します。 + +以下は提供されたBashコードです。このコードの日本語訳をMarkdown形式で記載します。 + +```bash +torchrun \ + --nproc_per_node 8 pytorch/summarization/run_summarization.py \ + --fp16 \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +TensorFlowスクリプトは、分散トレーニングに[`MirroredStrategy`](https://www.tensorflow.org/guide/distributed_training#mirroredstrategy)を使用し、トレーニングスクリプトに追加の引数を追加する必要はありません。TensorFlowスクリプトは、デフォルトで複数のGPUが利用可能な場合にそれらを使用します。 + +## Run a script on a TPU + + + +Tensor Processing Units (TPUs)は、パフォーマンスを加速させるために特別に設計されています。PyTorchは、[XLA](https://www.tensorflow.org/xla)ディープラーニングコンパイラを使用してTPUsをサポートしており、詳細については[こちら](https://github.com/pytorch/xla/blob/master/README.md)をご覧ください。TPUを使用するには、`xla_spawn.py`スクリプトを起動し、`num_cores`引数を使用して使用するTPUコアの数を設定します。 +```bash +python xla_spawn.py --num_cores 8 \ + summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +もちろん、Tensor Processing Units(TPUs)は性能を高速化するために特別に設計されています。TensorFlowスクリプトは、TPUsでトレーニングするために[`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy)を利用します。TPUを使用するには、TPUリソースの名前を`tpu`引数に渡します。 + +```bash +python run_summarization.py \ + --tpu name_of_tpu_resource \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## Run a script with 🤗 Accelerate + +🤗 [Accelerate](https://huggingface.co/docs/accelerate)は、PyTorch専用のライブラリで、CPUのみ、複数のGPU、TPUなど、さまざまなセットアップでモデルをトレーニングするための統一された方法を提供します。PyTorchのトレーニングループを完全に可視化しながら実行できます。まだインストールしていない場合は、🤗 Accelerateをインストールしてください: + +> 注意:Accelerateは急速に開発が進行しているため、スクリプトを実行するにはaccelerateのgitバージョンをインストールする必要があります +```bash +pip install git+https://github.com/huggingface/accelerate +``` + +代わりに、`run_summarization_no_trainer.py` スクリプトを使用する必要があります。 🤗 Accelerate がサポートするスクリプトには、フォルダ内に `task_no_trainer.py` ファイルが含まれています。まず、次のコマンドを実行して設定ファイルを作成し、保存します: + +```bash +accelerate config +``` + +テストを行い、設定が正しく構成されているか確認してください: + + +```bash +accelerate test +``` + +Now you are ready to launch the training: + + +```bash +accelerate launch run_summarization_no_trainer.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir ~/tmp/tst-summarization +``` + +## Use a custom dataset + +要約スクリプトは、CSVまたはJSON Lineファイルであれば、カスタムデータセットをサポートしています。独自のデータセットを使用する場合、いくつかの追加の引数を指定する必要があります。 + +- `train_file`および`validation_file`は、トレーニングとバリデーションのファイルへのパスを指定します。 +- `text_column`は要約するための入力テキストです。 +- `summary_column`は出力する対象テキストです。 + +カスタムデータセットを使用した要約スクリプトは、以下のようになります: + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --train_file path_to_csv_or_jsonlines_file \ + --validation_file path_to_csv_or_jsonlines_file \ + --text_column text_column_name \ + --summary_column summary_column_name \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --overwrite_output_dir \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --predict_with_generate +``` + +## Test a script + +すべてが予想通りに動作することを確認するために、データセット全体を処理する前に、データセットの一部の例でスクリプトを実行することは良いアイデアです。以下の引数を使用して、データセットを最大サンプル数に切り詰めます: + +- `max_train_samples` +- `max_eval_samples` +- `max_predict_samples` + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --max_train_samples 50 \ + --max_eval_samples 50 \ + --max_predict_samples 50 \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +一部の例のスクリプトは、`max_predict_samples`引数をサポートしていないことがあります。この引数がサポートされているかどうかがわからない場合は、`-h`引数を追加して確認してください。 + +```bash +examples/pytorch/summarization/run_summarization.py -h +``` + +## Resume training from checkpoint + +以前のチェックポイントからトレーニングを再開するための役立つオプションもあります。これにより、トレーニングが中断された場合でも、最初からやり直すことなく、中断したところから再開できます。チェックポイントからトレーニングを再開するための2つの方法があります。 + +最初の方法は、`output_dir previous_output_dir` 引数を使用して、`output_dir` に保存された最新のチェックポイントからトレーニングを再開する方法です。この場合、`overwrite_output_dir` を削除する必要があります: + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --output_dir previous_output_dir \ + --predict_with_generate +``` + +2番目の方法では、`resume_from_checkpoint path_to_specific_checkpoint` 引数を使用して、特定のチェックポイントフォルダからトレーニングを再開します。 + + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --resume_from_checkpoint path_to_specific_checkpoint \ + --predict_with_generate +``` + +## Share your model + +すべてのスクリプトは、最終的なモデルを [Model Hub](https://huggingface.co/models) にアップロードできます。開始する前に Hugging Face にログインしていることを確認してください。 + +```bash +huggingface-cli login +``` + +次に、スクリプトに `push_to_hub` 引数を追加します。この引数は、Hugging Face のユーザー名と `output_dir` で指定したフォルダ名でリポジトリを作成します。 + +特定の名前をリポジトリに付けるには、`push_to_hub_model_id` 引数を使用して追加します。このリポジトリは自動的にあなたの名前空間の下にリストされます。 + +以下の例は、特定のリポジトリ名でモデルをアップロードする方法を示しています: + + + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --push_to_hub \ + --push_to_hub_model_id finetuned-t5-cnn_dailymail \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + + + diff --git a/docs/source/ja/serialization.md b/docs/source/ja/serialization.md new file mode 100644 index 00000000000000..3e9d81180de046 --- /dev/null +++ b/docs/source/ja/serialization.md @@ -0,0 +1,191 @@ + + +# Export to ONNX + +🤗 Transformersモデルを本番環境に展開する際には、モデルを特殊なランタイムおよびハードウェアで読み込み、実行できるように、モデルをシリアライズされた形式にエクスポートすることが必要であるか、その恩恵を受けることができることがあります。 + +🤗 Optimumは、Transformersの拡張機能であり、PyTorchまたはTensorFlowからモデルをONNXやTFLiteなどのシリアライズされた形式にエクスポートすることを可能にする「exporters」モジュールを提供しています。また、🤗 Optimumは、最大の効率でターゲットハードウェアでモデルをトレーニングおよび実行するためのパフォーマンス最適化ツールも提供しています。 + +このガイドでは、🤗 Transformersモデルを🤗 Optimumを使用してONNXにエクスポートする方法を示しており、モデルをTFLiteにエクスポートする方法については[Export to TFLiteページ](tflite)を参照してください。 + +## Export to ONNX + +[ONNX(Open Neural Network eXchange)](http://onnx.ai)は、PyTorchおよびTensorFlowを含むさまざまなフレームワークで深層学習モデルを表現するための共通の一連の演算子とファイル形式を定義するオープンスタンダードです。モデルがONNX形式にエクスポートされると、これらの演算子はニューラルネットワークを介するデータの流れを表す計算グラフ(一般的には「中間表現」と呼ばれる)を構築するために使用されます。 + +標準化された演算子とデータ型を備えたグラフを公開することで、ONNXはフレームワーク間の切り替えを容易にします。たとえば、PyTorchでトレーニングされたモデルはONNX形式にエクスポートし、それをTensorFlowでインポートすることができます(逆も同様です)。 + +ONNX形式にエクスポートされたモデルは、以下のように使用できます: +- [グラフ最適化](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization)や[量子化](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization)などのテクニックを使用して推論のために最適化。 +- [`ORTModelForXXX`クラス](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort)を介してONNX Runtimeで実行し、🤗 Transformersでおなじみの`AutoModel` APIに従います。 +- [最適化された推論パイプライン](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines)を介して実行し、🤗 Transformersの[`pipeline`]関数と同じAPIを持っています。 + +🤗 Optimumは、設定オブジェクトを活用してONNXエクスポートをサポートしており、これらの設定オブジェクトは多くのモデルアーキテクチャ用に事前に作成されており、他のアーキテクチャにも簡単に拡張できるように設計されています。 + +事前に作成された設定のリストについては、[🤗 Optimumドキュメント](https://huggingface.co/docs/optimum/exporters/onnx/overview)を参照してください。 + +🤗 TransformersモデルをONNXにエクスポートする方法は2つあります。以下では両方の方法を示します: + +- export with 🤗 Optimum via CLI. +- export with 🤗 Optimum with `optimum.onnxruntime`. + +### Exporting a 🤗 Transformers model to ONNX with CLI + +🤗 TransformersモデルをONNXにエクスポートするには、まず追加の依存関係をインストールしてください: + +```bash +pip install optimum[exporters] +``` + +すべての利用可能な引数を確認するには、[🤗 Optimumドキュメント](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)を参照してください。または、コマンドラインでヘルプを表示することもできます: + + +```bash +optimum-cli export onnx --help +``` + +🤗 Hubからモデルのチェックポイントをエクスポートするには、例えば `distilbert/distilbert-base-uncased-distilled-squad` を使いたい場合、以下のコマンドを実行してください: + +```bash +optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/ +``` + +進行状況を示し、結果の `model.onnx` が保存される場所を表示するログは、以下のように表示されるはずです: + + +```bash +Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx... + -[✓] ONNX model output names match reference model (start_logits, end_logits) + - Validating ONNX Model output "start_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) + - Validating ONNX Model output "end_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) +The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx +``` + +上記の例は🤗 Hubからのチェックポイントのエクスポートを示しています。ローカルモデルをエクスポートする場合、まずモデルの重みとトークナイザのファイルを同じディレクトリ(`local_path`)に保存してください。CLIを使用する場合、🤗 Hubのチェックポイント名の代わりに`model`引数に`local_path`を渡し、`--task`引数を指定してください。[🤗 Optimumドキュメント](https://huggingface.co/docs/optimum/exporters/task_manager)でサポートされているタスクのリストを確認できます。`task`引数が指定されていない場合、タスク固有のヘッドを持たないモデルアーキテクチャがデフォルトで選択されます。 + + +```bash +optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/ +``` + +エクスポートされた `model.onnx` ファイルは、ONNX標準をサポートする[多くのアクセラレータ](https://onnx.ai/supported-tools.html#deployModel)の1つで実行できます。たとえば、[ONNX Runtime](https://onnxruntime.ai/)を使用してモデルを読み込み、実行する方法は以下の通りです: + + +```python +>>> from transformers import AutoTokenizer +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +🤗 HubからTensorFlowのチェックポイントをエクスポートするプロセスは、同様です。例えば、[Keras organization](https://huggingface.co/keras-io)から純粋なTensorFlowのチェックポイントをエクスポートする方法は以下の通りです: + + +```bash +optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/ +``` + +### Exporting a 🤗 Transformers model to ONNX with `optimum.onnxruntime` + +CLIの代わりに、🤗 TransformersモデルをONNXにプログラム的にエクスポートすることもできます。以下のように行います: + +```python +>>> from optimum.onnxruntime import ORTModelForSequenceClassification +>>> from transformers import AutoTokenizer + +>>> model_checkpoint = "distilbert_base_uncased_squad" +>>> save_directory = "onnx/" + +>>> # Load a model from transformers and export it to ONNX +>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True) +>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) + +>>> # Save the onnx model and tokenizer +>>> ort_model.save_pretrained(save_directory) +>>> tokenizer.save_pretrained(save_directory) +``` + +### Exporting a model for an unsupported architecture + +現在エクスポートできないモデルをサポートするために貢献したい場合、まず[`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)でサポートされているかどうかを確認し、サポートされていない場合は[🤗 Optimumに貢献](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)してください。 + +### Exporting a model with `transformers.onnx` + + + +`transformers.onnx`はもはやメンテナンスされていないため、モデルを上記で説明したように🤗 Optimumでエクスポートしてください。このセクションは将来のバージョンで削除されます。 + + + +🤗 TransformersモデルをONNXにエクスポートするには、追加の依存関係をインストールしてください: + + +```bash +pip install transformers[onnx] +``` + +`transformers.onnx`パッケージをPythonモジュールとして使用して、事前に用意された設定を使用してチェックポイントをエクスポートする方法は以下の通りです: + +```bash +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ +``` + +この方法は、`--model`引数で定義されたチェックポイントのONNXグラフをエクスポートします。🤗 Hubのいずれかのチェックポイントまたはローカルに保存されたチェックポイントを渡すことができます。エクスポートされた`model.onnx`ファイルは、ONNX標準をサポートする多くのアクセラレータで実行できます。例えば、ONNX Runtimeを使用してモデルを読み込んで実行する方法は以下の通りです: + + +```python +>>> from transformers import AutoTokenizer +>>> from onnxruntime import InferenceSession + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> session = InferenceSession("onnx/model.onnx") +>>> # ONNX Runtime expects NumPy arrays as input +>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") +>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs)) +``` + +必要な出力名(例: `["last_hidden_state"]`)は、各モデルのONNX構成を確認することで取得できます。例えば、DistilBERTの場合、次のようになります: + + +```python +>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig + +>>> config = DistilBertConfig() +>>> onnx_config = DistilBertOnnxConfig(config) +>>> print(list(onnx_config.outputs.keys())) +["last_hidden_state"] +``` + +ハブから純粋なTensorFlowのチェックポイントをプログラム的にエクスポートするプロセスは、以下のように同様です: + +```bash +python -m transformers.onnx --model=keras-io/transformers-qa onnx/ +``` + +ローカルに保存されたモデルをエクスポートする場合、モデルの重みとトークナイザのファイルを同じディレクトリに保存してください(例: `local-pt-checkpoint`)。その後、`transformers.onnx`パッケージの `--model`引数を希望するディレクトリに向けて設定して、ONNXにエクスポートします: + + +```bash +python -m transformers.onnx --model=local-pt-checkpoint onnx/ +``` + diff --git a/docs/source/ja/task_summary.md b/docs/source/ja/task_summary.md new file mode 100644 index 00000000000000..0069f6afaf3205 --- /dev/null +++ b/docs/source/ja/task_summary.md @@ -0,0 +1,355 @@ + + +# What 🤗 Transformers can do + +🤗 Transformersは、自然言語処理(NLP)、コンピュータビジョン、音声処理などの最新の事前学習済みモデルのライブラリです。このライブラリには、Transformerモデルだけでなく、コンピュータビジョンタスク向けのモダンな畳み込みニューラルネットワークなど、Transformer以外のモデルも含まれています。現代のスマートフォン、アプリ、テレビなど、最も人気のある消費者製品の多くには、深層学習技術が使用されています。スマートフォンで撮影した写真から背景オブジェクトを削除したいですか?これはパノプティック・セグメンテーション(まだ何を意味するかわからなくても心配しないでください、以下のセクションで説明します!)のタスクの一例です。 + +このページでは、🤗 Transformersライブラリを使用して、たった3行のコードで解決できるさまざまな音声および音声、コンピュータビジョン、NLPタスクの概要を提供します! + +## Audio + +音声と音声処理のタスクは、他のモダリティとは少し異なります。なぜなら、入力としての生の音声波形は連続的な信号であるからです。テキストとは異なり、生の音声波形は文章を単語に分割できるようにきれいに分割できません。これを解決するために、通常、生の音声信号は定期的な間隔でサンプリングされます。間隔内でより多くのサンプルを取ると、サンプリングレートが高くなり、音声はより元の音声ソースに近づきます。 + +以前のアプローチでは、音声を前処理してそれから有用な特徴を抽出しました。しかし、現在では、生の音声波形を特徴エンコーダに直接フィードし、音声表現を抽出することから始めることが一般的です。これにより、前処理ステップが簡素化され、モデルは最も重要な特徴を学習できます。 + +### Audio classification + +音声分類は、事前に定義されたクラスのセットから音声データにラベルを付けるタスクです。これは多くの具体的なアプリケーションを含む広範なカテゴリであり、いくつかの例は次のとおりです: + +- 音響シーン分類:音声にシーンラベルを付ける(「オフィス」、「ビーチ」、「スタジアム」) +- 音響イベント検出:音声に音声イベントラベルを付ける(「車のクラクション」、「クジラの呼び声」、「ガラスの破裂」) +- タギング:複数の音を含む音声にラベルを付ける(鳥の鳴き声、会議でのスピーカー識別) +- 音楽分類:音楽にジャンルラベルを付ける(「メタル」、「ヒップホップ」、「カントリー」) + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er") +>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4532, 'label': 'hap'}, + {'score': 0.3622, 'label': 'sad'}, + {'score': 0.0943, 'label': 'neu'}, + {'score': 0.0903, 'label': 'ang'}] +``` + +### Automatic speech recognition + +自動音声認識(ASR)は音声をテキストに変換します。これは、人間のコミュニケーションの自然な形式である音声の一部として、部分的にそのような一般的なオーディオタスクの一つです。今日、ASRシステムはスピーカー、スマートフォン、自動車などの「スマート」テクノロジープロダクトに組み込まれています。私たちは仮想アシスタントに音楽を再生してもらったり、リマインダーを設定してもらったり、天気を教えてもらったりできます。 + +しかし、Transformerアーキテクチャが助けた主要な課題の一つは、低リソース言語におけるものです。大量の音声データで事前トレーニングし、低リソース言語でラベル付き音声データをわずか1時間だけでファインチューニングすることでも、以前のASRシステムと比較して高品質な結果を得ることができます。以前のASRシステムは100倍以上のラベル付きデータでトレーニングされていましたが、Transformerアーキテクチャはこの問題に貢献しました。 + + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + + +## Computer vision + +最初で初めて成功したコンピュータビジョンのタスクの一つは、[畳み込みニューラルネットワーク(CNN)](glossary#convolution)を使用して郵便番号の画像を認識することでした。画像はピクセルから構成され、各ピクセルには数値があります。これにより、画像をピクセル値の行列として簡単に表現できます。特定のピクセル値の組み合わせは、画像の色を記述します。 + +コンピュータビジョンのタスクを解決する一般的な方法は次の2つです: + +1. 畳み込みを使用して、低レベルの特徴から高レベルの抽象的な情報まで、画像の階層的な特徴を学習する。 +2. 画像をパッチに分割し、各画像パッチが画像全体とどのように関連しているかを徐々に学習するためにTransformerを使用する。CNNが好むボトムアップアプローチとは異なり、これはぼんやりとした画像から始めて徐々に焦点を合わせるようなものです。 + +### Image classification + +画像分類は、事前に定義されたクラスのセットから画像全体にラベルを付けます。多くの分類タスクと同様に、画像分類には多くの実用的な用途があり、その一部は次のとおりです: + +* 医療:疾患の検出や患者の健康の監視に使用するために医療画像にラベルを付ける +* 環境:森林伐採の監視、野生地の管理情報、または山火事の検出に使用するために衛星画像にラベルを付ける +* 農業:作物の健康を監視するための作物の画像や、土地利用の監視のための衛星画像にラベルを付ける +* 生態学:野生動物の個体数を監視したり、絶滅危惧種を追跡するために動植物の種の画像にラベルを付ける + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="image-classification") +>>> preds = classifier( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.4335, 'label': 'lynx, catamount'} +{'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'} +{'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'} +{'score': 0.0239, 'label': 'Egyptian cat'} +{'score': 0.0229, 'label': 'tiger cat'} +``` + +### Object detection + +画像分類とは異なり、オブジェクト検出は画像内の複数のオブジェクトを識別し、オブジェクトの位置を画像内で定義する境界ボックスによって特定します。オブジェクト検出の例には次のような用途があります: + +* 自動運転車:他の車両、歩行者、信号機などの日常の交通オブジェクトを検出 +* リモートセンシング:災害モニタリング、都市計画、天候予測 +* 欠陥検出:建物のクラックや構造上の損傷、製造上の欠陥を検出 + + +```py +>>> from transformers import pipeline + +>>> detector = pipeline(task="object-detection") +>>> preds = detector( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds] +>>> preds +[{'score': 0.9865, + 'label': 'cat', + 'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}] +``` + +### Image segmentation + +画像セグメンテーションは、画像内のすべてのピクセルをクラスに割り当てるピクセルレベルのタスクです。これはオブジェクト検出とは異なり、オブジェクトをラベル付けし、予測するために境界ボックスを使用する代わりに、セグメンテーションはより詳細になります。セグメンテーションはピクセルレベルでオブジェクトを検出できます。画像セグメンテーションにはいくつかのタイプがあります: + +* インスタンスセグメンテーション:オブジェクトのクラスをラベル付けするだけでなく、オブジェクトの個別のインスタンス("犬-1"、"犬-2")もラベル付けします。 +* パノプティックセグメンテーション:セマンティックセグメンテーションとインスタンスセグメンテーションの組み合わせ。セマンティッククラスごとに各ピクセルにラベルを付け、オブジェクトの個別のインスタンスもラベル付けします。 + +セグメンテーションタスクは、自動運転車にとって、周囲の世界のピクセルレベルのマップを作成し、歩行者や他の車両を安全に回避できるようにするのに役立ちます。また、医療画像では、タスクの細かい粒度が異常な細胞や臓器の特徴を識別するのに役立ちます。画像セグメンテーションは、eコマースで衣類を仮想的に試着したり、カメラを通じて実世界にオブジェクトを重ねて拡張現実の体験を作成したりするためにも使用できます。 + + + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline(task="image-segmentation") +>>> preds = segmenter( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.9879, 'label': 'LABEL_184'} +{'score': 0.9973, 'label': 'snow'} +{'score': 0.9972, 'label': 'cat'} +``` + +### Depth estimation + +深度推定は、画像内の各ピクセルがカメラからの距離を予測します。このコンピュータビジョンタスクは、特にシーンの理解と再構築に重要です。たとえば、自動運転車では、歩行者、交通標識、他の車などの物体がどれだけ遠いかを理解し、障害物や衝突を回避するために必要です。深度情報はまた、2D画像から3D表現を構築し、生物学的構造や建物の高品質な3D表現を作成するのに役立ちます。 + +深度推定には次の2つのアプローチがあります: + +* ステレオ:深度は、わずかに異なる角度からの同じ画像の2つの画像を比較して推定されます。 +* モノキュラー:深度は単一の画像から推定されます。 + + +```py +>>> from transformers import pipeline + +>>> depth_estimator = pipeline(task="depth-estimation") +>>> preds = depth_estimator( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +``` + + +## Natural language processing + +NLPタスクは、テキストが私たちにとって自然なコミュニケーション手段であるため、最も一般的なタスクの一つです。モデルが認識するための形式にテキストを変換するには、トークン化が必要です。これは、テキストのシーケンスを単語やサブワード(トークン)に分割し、それらのトークンを数字に変換することを意味します。その結果、テキストのシーケンスを数字のシーケンスとして表現し、一度数字のシーケンスがあれば、さまざまなNLPタスクを解決するためにモデルに入力できます! + +### Text classification + +どんなモダリティの分類タスクと同様に、テキスト分類は事前に定義されたクラスのセットからテキストのシーケンス(文レベル、段落、またはドキュメントであることがあります)にラベルを付けます。テキスト分類には多くの実用的な用途があり、その一部は次のとおりです: + +* 感情分析:`positive`や`negative`のような極性に従ってテキストにラベルを付け、政治、金融、マーケティングなどの分野での意思決定をサポートします。 +* コンテンツ分類:テキストをトピックに従ってラベル付けし、ニュースやソーシャルメディアのフィード内の情報を整理し、フィルタリングするのに役立ちます(`天気`、`スポーツ`、`金融`など)。 + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="sentiment-analysis") +>>> preds = classifier("Hugging Face is the best thing since sliced bread!") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.9991, 'label': 'POSITIVE'}] +``` + +### Token classification + +どんなNLPタスクでも、テキストはテキストのシーケンスを個々の単語やサブワードに分割して前処理されます。これらは[トークン](/glossary#token)として知られています。トークン分類は、事前に定義されたクラスのセットから各トークンにラベルを割り当てます。 + +トークン分類の一般的なタイプは次の2つです: + +* 固有表現認識(NER):組織、人物、場所、日付などのエンティティのカテゴリに従ってトークンにラベルを付けます。NERは特にバイオメディカル環境で人気であり、遺伝子、タンパク質、薬物名などをラベル付けできます。 +* 品詞タグ付け(POS):名詞、動詞、形容詞などの品詞に従ってトークンにラベルを付けます。POSは、翻訳システムが同じ単語が文法的にどのように異なるかを理解するのに役立ちます(名詞としての「銀行」と動詞としての「銀行」など)。 + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="ner") +>>> preds = classifier("Hugging Face is a French company based in New York City.") +>>> preds = [ +... { +... "entity": pred["entity"], +... "score": round(pred["score"], 4), +... "index": pred["index"], +... "word": pred["word"], +... "start": pred["start"], +... "end": pred["end"], +... } +... for pred in preds +... ] +>>> print(*preds, sep="\n") +{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2} +{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7} +{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12} +{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24} +{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45} +{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50} +{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55} +``` + +### Question answering + +質問応答は、コンテキスト(オープンドメイン)を含む場合と含まない場合(クローズドドメイン)がある場合もある、別のトークンレベルのタスクで、質問に対する回答を返します。このタスクは、仮想アシスタントにレストランが営業しているかどうかのような質問をするときに発生します。また、顧客や技術サポートを提供し、検索エンジンがあなたが求めている関連情報を取得するのにも役立ちます。 + +質問応答の一般的なタイプは次の2つです: + +* 抽出型:質問と一部のコンテキストが与えられた場合、モデルがコンテキストから抽出する必要のあるテキストのスパンが回答となります。 +* 抽象的:質問と一部のコンテキストが与えられた場合、回答はコンテキストから生成されます。このアプローチは、[`QuestionAnsweringPipeline`]ではなく[`Text2TextGenerationPipeline`]で処理されます。 + + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline(task="question-answering") +>>> preds = question_answerer( +... question="What is the name of the repository?", +... context="The name of the repository is huggingface/transformers", +... ) +>>> print( +... f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}" +... ) +score: 0.9327, start: 30, end: 54, answer: huggingface/transformers +``` + +### Summarization + +要約は、長いテキストから短いバージョンを作成し、元の文書の意味の大部分を保ちながら試みるタスクです。要約はシーケンスからシーケンスへのタスクであり、入力よりも短いテキストシーケンスを出力します。要約を行うことで、読者が主要なポイントを迅速に理解できるようにするのに役立つ長文書がたくさんあります。法案、法的および財務文書、特許、科学論文など、読者の時間を節約し読書の支援となる文書の例があります。 + +質問応答と同様に、要約には2つのタイプがあります: + +* 抽出的要約:元のテキストから最も重要な文を識別して抽出します。 +* 抽象的要約:元のテキストからターゲットの要約(入力文書に含まれていない新しい単語を含むことがあります)を生成します。[`SummarizationPipeline`]は抽象的なアプローチを使用しています。 + + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline(task="summarization") +>>> summarizer( +... "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles." +... ) +[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}] +``` + +### Translation + +翻訳は、ある言語のテキストシーケンスを別の言語に変換する作業です。これは異なるバックグラウンドを持つ人々がコミュニケーションをとるのに役立ち、広範な観客にコンテンツを翻訳して伝えるのに役立ち、新しい言語を学ぶのを支援する学習ツールにもなります。要約と共に、翻訳はシーケンス間のタスクであり、モデルは入力シーケンスを受け取り、ターゲットの出力シーケンスを返します。 + +初期の翻訳モデルは主に単一言語でしたが、最近では多言語モデルに対する関心が高まり、多くの言語対で翻訳できるような多言語モデルに注目が集まっています。 + + +```py +>>> from transformers import pipeline + +>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." +>>> translator = pipeline(task="translation", model="google-t5/t5-small") +>>> translator(text) +[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}] +``` + +### 言語モデリング + +言語モデリングは、テキストのシーケンス内の単語を予測するタスクです。事前学習された言語モデルは、多くの他のダウンストリームタスクに対してファインチューニングできるため、非常に人気のあるNLPタスクとなっています。最近では、ゼロショットまたはフューショット学習を実証する大規模な言語モデル(LLM)に大きな関心が寄せられています。これは、モデルが明示的にトレーニングされていないタスクを解決できることを意味します!言語モデルは、流暢で説得力のあるテキストを生成するために使用できますが、テキストが常に正確であるわけではないため、注意が必要です。 + +言語モデリングには2つのタイプがあります: + +* 因果的:モデルの目標は、シーケンス内の次のトークンを予測することであり、将来のトークンはマスクされます。 + + ```py + >>> from transformers import pipeline + + >>> prompt = "Hugging Face is a community-based open-source platform for machine learning." + >>> generator = pipeline(task="text-generation") + >>> generator(prompt) # doctest: +SKIP + ``` + +* マスクされた:モデルの目的は、シーケンス内のトークン全体にアクセスしながら、シーケンス内のマスクされたトークンを予測することです。 + + ```py + >>> text = "Hugging Face is a community-based open-source for machine learning." + >>> fill_mask = pipeline(task="fill-mask") + >>> preds = fill_mask(text, top_k=1) + >>> preds = [ + ... { + ... "score": round(pred["score"], 4), + ... "token": pred["token"], + ... "token_str": pred["token_str"], + ... "sequence": pred["sequence"], + ... } + ... for pred in preds + ... ] + >>> preds + [{'score': 0.2236, + 'token': 1761, + 'token_str': ' platform', + 'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}] + ``` + +## Multimodal + +マルチモーダルタスクは、特定の問題を解決するために複数のデータモダリティ(テキスト、画像、音声、ビデオ)を処理するためにモデルを必要とします。画像キャプショニングは、モデルが入力として画像を受け取り、画像を説明するテキストのシーケンスまたは画像のいくつかの特性を出力するマルチモーダルタスクの例です。 + +マルチモーダルモデルは異なるデータタイプまたはモダリティで作業しますが、内部的には前処理ステップがモデルにすべてのデータタイプを埋め込み(データに関する意味のある情報を保持するベクトルまたは数字のリスト)に変換するのを支援します。画像キャプショニングのようなタスクでは、モデルは画像の埋め込みとテキストの埋め込みの間の関係を学習します。 + +### Document question answering + +ドキュメント質問応答は、ドキュメントからの自然言語の質問に答えるタスクです。テキストを入力とするトークンレベルの質問応答タスクとは異なり、ドキュメント質問応答はドキュメントの画像とそのドキュメントに関する質問を受け取り、答えを返します。ドキュメント質問応答は構造化されたドキュメントを解析し、それから重要な情報を抽出するために使用できます。以下の例では、レシートから合計金額とお釣りを抽出することができます。 + + +```py +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices") +>>> preds = doc_question_answerer( +... question="What is the total amount?", +... image=image, +... ) +>>> preds +[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}] +``` + +このページが各モダリティのタスクの種類とそれぞれの重要性についての追加の背景情報を提供できたことを願っています。次の [セクション](tasks_explained) では、🤗 トランスフォーマーがこれらのタスクを解決するために **どのように** 動作するかを学びます。 diff --git a/docs/source/ja/tasks/asr.md b/docs/source/ja/tasks/asr.md new file mode 100644 index 00000000000000..fd564abdc5c908 --- /dev/null +++ b/docs/source/ja/tasks/asr.md @@ -0,0 +1,380 @@ + + +# Automatic speech recognition + +[[open-in-colab]] + + + +自動音声認識 (ASR) は音声信号をテキストに変換し、一連の音声入力をテキスト出力にマッピングします。 Siri や Alexa などの仮想アシスタントは ASR モデルを使用してユーザーを日常的に支援しており、ライブキャプションや会議中のメモ取りなど、他にも便利なユーザー向けアプリケーションが数多くあります。 + +このガイドでは、次の方法を説明します。 + +1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) データセットの [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) を微調整して、音声をテキストに書き起こします。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate jiwer +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load MInDS-14 dataset + +まず、🤗 データセット ライブラリから [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) データセットの小さいサブセットをロードします。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> from datasets import load_dataset, Audio + +>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]") +``` + +[`~Dataset.train_test_split`] メソッドを使用して、データセットの `train` 分割をトレイン セットとテスト セットに分割します。 + +```py +>>> minds = minds.train_test_split(test_size=0.2) +``` + +次に、データセットを見てみましょう。 + +```py +>>> minds +DatasetDict({ + train: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 16 + }) + test: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 4 + }) +}) +``` + +データセットには`lang_id`や`english_transcription`などの多くの有用な情報が含まれていますが、このガイドでは「`audio`」と「`transciption`」に焦点を当てます。 [`~datasets.Dataset.remove_columns`] メソッドを使用して他の列を削除します。 + +```py +>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"]) +``` + +もう一度例を見てみましょう。 + +```py +>>> minds["train"][0] +{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414, + 0.00024414, 0.00024414], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'sampling_rate': 8000}, + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"} +``` + +次の 2 つのフィールドがあります。 + +- `audio`: 音声ファイルをロードしてリサンプリングするために呼び出す必要がある音声信号の 1 次元の `array`。 +- `transcription`: ターゲットテキスト。 + +## Preprocess + +次のステップでは、Wav2Vec2 プロセッサをロードしてオーディオ信号を処理します。 + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base") +``` + +MInDS-14 データセットのサンプリング レートは 8000kHz です (この情報は [データセット カード](https://huggingface.co/datasets/PolyAI/minds14) で確認できます)。つまり、データセットを再サンプリングする必要があります。事前トレーニングされた Wav2Vec2 モデルを使用するには、16000kHz に設定します。 + +```py +>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) +>>> minds["train"][0] +{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ..., + 2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'sampling_rate': 16000}, + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"} +``` + +上の `transcription` でわかるように、テキストには大文字と小文字が混在しています。 Wav2Vec2 トークナイザーは大文字のみでトレーニングされるため、テキストがトークナイザーの語彙と一致することを確認する必要があります。 + +```py +>>> def uppercase(example): +... return {"transcription": example["transcription"].upper()} + + +>>> minds = minds.map(uppercase) +``` + +次に、次の前処理関数を作成します。 + +1. `audio`列を呼び出して、オーディオ ファイルをロードしてリサンプリングします。 +2. オーディオ ファイルから `input_values` を抽出し、プロセッサを使用して `transcription` 列をトークン化します。 + +```py +>>> def prepare_dataset(batch): +... audio = batch["audio"] +... batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"]) +... batch["input_length"] = len(batch["input_values"][0]) +... return batch +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] 関数を使用します。 `num_proc` パラメータを使用してプロセスの数を増やすことで、`map` を高速化できます。 [`~datasets.Dataset.remove_columns`] メソッドを使用して、不要な列を削除します。 + +```py +>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4) +``` + +🤗 Transformers には ASR 用のデータ照合器がないため、[`DataCollat​​orWithPadding`] を調整してサンプルのバッチを作成する必要があります。また、テキストとラベルが (データセット全体ではなく) バッチ内の最も長い要素の長さに合わせて動的に埋め込まれ、均一な長さになります。 `padding=True` を設定すると、`tokenizer` 関数でテキストを埋め込むことができますが、動的な埋め込みの方が効率的です。 + +他のデータ照合器とは異なり、この特定のデータ照合器は、`input_values`と `labels`」に異なるパディング方法を適用する必要があります。 + +```py +>>> import torch + +>>> from dataclasses import dataclass, field +>>> from typing import Any, Dict, List, Optional, Union + + +>>> @dataclass +... class DataCollatorCTCWithPadding: +... processor: AutoProcessor +... padding: Union[bool, str] = "longest" + +... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: +... # split inputs and labels since they have to be of different lengths and need +... # different padding methods +... input_features = [{"input_values": feature["input_values"][0]} for feature in features] +... label_features = [{"input_ids": feature["labels"]} for feature in features] + +... batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt") + +... labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt") + +... # replace padding with -100 to ignore loss correctly +... labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) + +... batch["labels"] = labels + +... return batch +``` + +次に、`DataCollat​​orForCTCWithPadding` をインスタンス化します。 + +```py +>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest") +``` + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[単語エラー率](https://huggingface.co/spaces/evaluate-metric/wer) (WER) メトリクスを読み込みます (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照して、メトリクスをロードして計算する方法の詳細を確認してください)。 + +```py +>>> import evaluate + +>>> wer = evaluate.load("wer") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して WER を計算する関数を作成します。 + +```py +>>> import numpy as np + + +>>> def compute_metrics(pred): +... pred_logits = pred.predictions +... pred_ids = np.argmax(pred_logits, axis=-1) + +... pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id + +... pred_str = processor.batch_decode(pred_ids) +... label_str = processor.batch_decode(pred.label_ids, group_tokens=False) + +... wer = wer.compute(predictions=pred_str, references=label_str) + +... return {"wer": wer} +``` + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForCTC`] で Wav2Vec2 をロードします。 `ctc_loss_reduction` パラメータで適用する削減を指定します。多くの場合、デフォルトの合計ではなく平均を使用する方が適切です。 + +```py +>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer + +>>> model = AutoModelForCTC.from_pretrained( +... "facebook/wav2vec2-base", +... ctc_loss_reduction="mean", +... pad_token_id=processor.tokenizer.pad_token_id, +... ) +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`トレーナー`] は WER を評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_asr_mind_model", +... per_device_train_batch_size=8, +... gradient_accumulation_steps=2, +... learning_rate=1e-5, +... warmup_steps=500, +... max_steps=2000, +... gradient_checkpointing=True, +... fp16=True, +... group_by_length=True, +... evaluation_strategy="steps", +... per_device_eval_batch_size=8, +... save_steps=1000, +... eval_steps=1000, +... logging_steps=25, +... load_best_model_at_end=True, +... metric_for_best_model="wer", +... greater_is_better=False, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=encoded_minds["train"], +... eval_dataset=encoded_minds["test"], +... tokenizer=processor, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + + +自動音声認識用にモデルを微調整する方法のより詳細な例については、英語 ASR および英語のこのブログ [投稿](https://huggingface.co/blog/fine-tune-wav2vec2-english) を参照してください。多言語 ASR については、この [投稿](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2) を参照してください。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論を実行したい音声ファイルをロードします。必要に応じて、オーディオ ファイルのサンプリング レートをモデルのサンプリング レートと一致するようにリサンプリングすることを忘れないでください。 + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train") +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +>>> sampling_rate = dataset.features["audio"].sampling_rate +>>> audio_file = dataset[0]["audio"]["path"] +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して自動音声認識用の`pipeline`をインスタンス化し、オーディオ ファイルをそれに渡します。 + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model") +>>> transcriber(audio_file) +{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'} +``` + + + +転写はまあまあですが、もっと良くなる可能性があります。さらに良い結果を得るには、より多くの例でモデルを微調整してみてください。 + + + +必要に応じて、「パイプライン」の結果を手動で複製することもできます。 + + + + +プロセッサをロードしてオーディオ ファイルと文字起こしを前処理し、`input`を PyTorch テンソルとして返します。 + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model") +>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +``` + +Pass your inputs to the model and return the logits: + +```py +>>> from transformers import AutoModelForCTC + +>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +最も高い確率で予測された `input_ids` を取得し、プロセッサを使用して予測された `input_ids` をデコードしてテキストに戻します。 + + +```py +>>> import torch + +>>> predicted_ids = torch.argmax(logits, dim=-1) +>>> transcription = processor.batch_decode(predicted_ids) +>>> transcription +['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'] +``` + + + diff --git a/docs/source/ja/tasks/audio_classification.md b/docs/source/ja/tasks/audio_classification.md new file mode 100644 index 00000000000000..58d42f3f4d4ff1 --- /dev/null +++ b/docs/source/ja/tasks/audio_classification.md @@ -0,0 +1,330 @@ + + +# Audio classification + +[[open-in-colab]] + + + + +音声分類では、テキストと同様に、入力データから出力されたクラス ラベルを割り当てます。唯一の違いは、テキスト入力の代わりに生のオーディオ波形があることです。音声分類の実際的な応用例には、話者の意図、言語分類、さらには音による動物の種類の識別などがあります。 + +このガイドでは、次の方法を説明します。 + +1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) データセットで [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) を微調整して話者の意図を分類します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[Audio Spectrogram Transformer](../model_doc/audio-spectrogram-transformer), [Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm), [Whisper](../model_doc/whisper) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load MInDS-14 dataset + +まず、🤗 データセット ライブラリから MInDS-14 データセットをロードします。 + +```py +>>> from datasets import load_dataset, Audio + +>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train` をより小さなトレインとテスト セットに分割します。これにより、完全なデータセットにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> minds = minds.train_test_split(test_size=0.2) +``` + +次に、データセットを見てみましょう。 + +```py +>>> minds +DatasetDict({ + train: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 450 + }) + test: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 113 + }) +}) +``` + + +データセットには`lang_id`や`english_transcription`などの多くの有用な情報が含まれていますが、このガイドでは`audio`と`intent_class`に焦点を当てます。 [`~datasets.Dataset.remove_columns`] メソッドを使用して他の列を削除します。 + +```py +>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"]) +``` + +ここで例を見てみましょう。 + +```py +>>> minds["train"][0] +{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00048828, + -0.00024414, -0.00024414], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav', + 'sampling_rate': 8000}, + 'intent_class': 2} +``` + +次の 2 つのフィールドがあります。 + +- `audio`: 音声ファイルをロードしてリサンプリングするために呼び出す必要がある音声信号の 1 次元の `array`。 +- `intent_class`: スピーカーのインテントのクラス ID を表します。 + +モデルがラベル ID からラベル名を取得しやすくするために、ラベル名を整数に、またはその逆にマップする辞書を作成します。 + +```py +>>> labels = minds["train"].features["intent_class"].names +>>> label2id, id2label = dict(), dict() +>>> for i, label in enumerate(labels): +... label2id[label] = str(i) +... id2label[str(i)] = label +``` + +これで、ラベル ID をラベル名に変換できるようになりました。 + +```py +>>> id2label[str(2)] +'app_error' +``` + +## Preprocess + +次のステップでは、Wav2Vec2 特徴抽出プログラムをロードしてオーディオ信号を処理します。 + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") +``` + +MInDS-14 データセットのサンプリング レートは 8000khz です (この情報は [データセット カード](https://huggingface.co/datasets/PolyAI/minds14) で確認できます)。つまり、データセットを再サンプリングする必要があります。事前トレーニングされた Wav2Vec2 モデルを使用するには、16000kHz に設定します。 + +```py +>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) +>>> minds["train"][0] +{'audio': {'array': array([ 2.2098757e-05, 4.6582241e-05, -2.2803260e-05, ..., + -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav', + 'sampling_rate': 16000}, + 'intent_class': 2} +``` + +次に、次の前処理関数を作成します。 + +1. `audio`列を呼び出してロードし、必要に応じてオーディオ ファイルをリサンプリングします。 +2. オーディオ ファイルのサンプリング レートが、モデルが事前トレーニングされたオーディオ データのサンプリング レートと一致するかどうかを確認します。この情報は、Wav2Vec2 [モデル カード](https://huggingface.co/facebook/wav2vec2-base) で見つけることができます。 +3. 入力の最大長を設定して、長い入力を切り捨てずにバッチ処理します。 + +```py +>>> def preprocess_function(examples): +... audio_arrays = [x["array"] for x in examples["audio"]] +... inputs = feature_extractor( +... audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True +... ) +... return inputs +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] 関数を使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` を高速化できます。不要な列を削除し、`intent_class` の名前を `label` に変更します。これはモデルが期待する名前であるためです。 + +```py +>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True) +>>> encoded_minds = encoded_minds.rename_column("intent_class", "label") +``` +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) メトリクスを読み込みます (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してください) メトリクスの読み込みと計算方法の詳細については、次を参照してください。 + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して精度を計算する関数を作成します。 + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions = np.argmax(eval_pred.predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=eval_pred.label_ids) +``` + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForAudioClassification`] を使用して、予期されるラベルの数とラベル マッピングを使用して Wav2Vec2 を読み込みます。 + +```py +>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer + +>>> num_labels = len(id2label) +>>> model = AutoModelForAudioClassification.from_pretrained( +... "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label +... ) +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`トレーナー`] は精度を評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_mind_model", +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=3e-5, +... per_device_train_batch_size=32, +... gradient_accumulation_steps=4, +... per_device_eval_batch_size=32, +... num_train_epochs=10, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=encoded_minds["train"], +... eval_dataset=encoded_minds["test"], +... tokenizer=feature_extractor, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + +音声分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb). + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論を実行したい音声ファイルをロードします。必要に応じて、オーディオ ファイルのサンプリング レートをモデルのサンプリング レートと一致するようにリサンプリングすることを忘れないでください。 + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +>>> sampling_rate = dataset.features["audio"].sampling_rate +>>> audio_file = dataset[0]["audio"]["path"] +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して音声分類用の`pipeline`をインスタンス化し、それに音声ファイルを渡します。 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model") +>>> classifier(audio_file) +[ + {'score': 0.09766869246959686, 'label': 'cash_deposit'}, + {'score': 0.07998877018690109, 'label': 'app_error'}, + {'score': 0.0781070664525032, 'label': 'joint_account'}, + {'score': 0.07667109370231628, 'label': 'pay_bill'}, + {'score': 0.0755252093076706, 'label': 'balance'} +] +``` + +必要に応じて、`pipeline` の結果を手動で複製することもできます。 + + + + +特徴抽出器をロードしてオーディオ ファイルを前処理し、`input`を PyTorch テンソルとして返します。 + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model") +>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +``` + +入力をモデルに渡し、ロジットを返します。 + +```py +>>> from transformers import AutoModelForAudioClassification + +>>> model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +最も高い確率でクラスを取得し、モデルの `id2label` マッピングを使用してそれをラベルに変換します。 + +```py +>>> import torch + +>>> predicted_class_ids = torch.argmax(logits).item() +>>> predicted_label = model.config.id2label[predicted_class_ids] +>>> predicted_label +'cash_deposit' +``` + + \ No newline at end of file diff --git a/docs/source/ja/tasks/document_question_answering.md b/docs/source/ja/tasks/document_question_answering.md new file mode 100644 index 00000000000000..478c6af2235490 --- /dev/null +++ b/docs/source/ja/tasks/document_question_answering.md @@ -0,0 +1,502 @@ + + +# Document Question Answering + +[[open-in-colab]] + +文書による質問応答は、文書による視覚的な質問応答とも呼ばれ、以下を提供するタスクです。 +ドキュメント画像に関する質問への回答。このタスクをサポートするモデルへの入力は通常、画像と画像の組み合わせです。 +質問があり、出力は自然言語で表現された回答です。これらのモデルは、以下を含む複数のモダリティを利用します。 +テキスト、単語の位置 (境界ボックス)、および画像自体。 + +このガイドでは、次の方法を説明します。 + +- [DocVQA データセット](https://huggingface.co/datasets/nielsr/docvqa_1200_examples_donut) の [LayoutLMv2](../model_doc/layoutlmv2) を微調整します。 +- 微調整されたモデルを推論に使用します。 + + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + + +[LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3) + + + + + +LayoutLMv2 は、最後の非表示のヘッダーの上に質問応答ヘッドを追加することで、ドキュメントの質問応答タスクを解決します。 +トークンの状態を調べて、トークンの開始トークンと終了トークンの位置を予測します。 +答え。言い換えれば、問題は抽出的質問応答として扱われます。つまり、コンテキストを考慮して、どの部分を抽出するかということです。 +の情報が質問に答えます。コンテキストは OCR エンジンの出力から取得されます。ここでは Google の Tesseract です。 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 LayoutLMv2 は detectron2、torchvision、tesseract に依存します。 + +```bash +pip install -q transformers datasets +``` + +```bash +pip install 'git+https://github.com/facebookresearch/detectron2.git' +pip install torchvision +``` + +```bash +sudo apt install tesseract-ocr +pip install -q pytesseract +``` + +すべての依存関係をインストールしたら、ランタイムを再起動します。 + +モデルをコミュニティと共有することをお勧めします。 Hugging Face アカウントにログインして、🤗 ハブにアップロードします。 +プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +いくつかのグローバル変数を定義しましょう。 + +```py +>>> model_checkpoint = "microsoft/layoutlmv2-base-uncased" +>>> batch_size = 4 +``` + +## Load the data + +このガイドでは、🤗 Hub にある前処理された DocVQA の小さなサンプルを使用します。フルに使いたい場合は、 +DocVQA データセットは、[DocVQA ホームページ](https://rrc.cvc.uab.es/?ch=17) で登録してダウンロードできます。そうすれば、 +このガイドを進めて、[🤗 データセットにファイルをロードする方法](https://huggingface.co/docs/datasets/loading#local-and-remote-files) を確認してください。 + + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("nielsr/docvqa_1200_examples") +>>> dataset +DatasetDict({ + train: Dataset({ + features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'], + num_rows: 1000 + }) + test: Dataset({ + features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'], + num_rows: 200 + }) +}) +``` + +ご覧のとおり、データセットはすでにトレーニング セットとテスト セットに分割されています。理解するためにランダムな例を見てみましょう +機能を備えた自分自身。 + +```py +>>> dataset["train"].features +``` + +個々のフィールドが表す内容は次のとおりです。 +* `id`: サンプルのID +* `image`: ドキュメント画像を含む PIL.Image.Image オブジェクト +* `query`: 質問文字列 - いくつかの言語での自然言語による質問 +* `answers`: ヒューマン アノテーターによって提供された正解のリスト +* `words` と `bounding_boxes`: OCR の結果。ここでは使用しません。 +* `answer`: 別のモデルと一致する答え。ここでは使用しません。 + +英語の質問だけを残し、別のモデルによる予測が含まれていると思われる`answer`機能を削除しましょう。 +また、アノテーターによって提供されたセットから最初の回答を取得します。あるいは、ランダムにサンプリングすることもできます。 + +```py +>>> updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"]) +>>> updated_dataset = updated_dataset.map( +... lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"] +... ) +``` + +このガイドで使用する LayoutLMv2 チェックポイントは、`max_position_embeddings = 512` でトレーニングされていることに注意してください ( +この情報は、[チェックポイントの `config.json` ファイル](https://huggingface.co/microsoft/layoutlmv2-base-uncased/blob/main/config.json#L18)) で見つけてください。 +例を省略することもできますが、答えが大きな文書の最後にあり、結局省略されてしまうという状況を避けるために、 +ここでは、埋め込みが 512 を超える可能性があるいくつかの例を削除します。 +データセット内のほとんどのドキュメントが長い場合は、スライディング ウィンドウ戦略を実装できます。詳細については、[このノートブック](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb) を確認してください。 。 + +```py +>>> updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512) +``` + +この時点で、このデータセットから OCR 機能も削除しましょう。これらは、異なるデータを微調整するための OCR の結果です。 +モデル。これらは入力要件と一致しないため、使用したい場合はさらに処理が必要になります。 +このガイドで使用するモデルの。代わりに、OCR と OCR の両方の元のデータに対して [`LayoutLMv2Processor`] を使用できます。 +トークン化。このようにして、モデルの予想される入力と一致する入力を取得します。画像を手動で加工したい場合は、 +モデルがどのような入力形式を想定しているかを知るには、[`LayoutLMv2` モデルのドキュメント](../model_doc/layoutlmv2) を確認してください。 + +```py +>>> updated_dataset = updated_dataset.remove_columns("words") +>>> updated_dataset = updated_dataset.remove_columns("bounding_boxes") +``` + +最後に、画像サンプルを確認しないとデータ探索は完了しません。 + + +```py +>>> updated_dataset["train"][11]["image"] +``` + +
+ DocVQA Image Example +
+ +## Preprocess the data + +文書の質問に答えるタスクはマルチモーダル タスクであるため、各モダリティからの入力が確実に行われるようにする必要があります。 +モデルの期待に従って前処理されます。まず、[`LayoutLMv2Processor`] をロードします。これは、画像データを処理できる画像プロセッサとテキスト データをエンコードできるトークナイザーを内部で組み合わせています。 + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained(model_checkpoint) +``` + +### Preprocessing document images + +まず、プロセッサからの `image_processor` を利用して、モデルのドキュメント画像を準備しましょう。 +デフォルトでは、画像プロセッサは画像のサイズを 224x224 に変更し、カラー チャネルの順序が正しいことを確認します。 +tesseract を使用して OCR を適用し、単語と正規化された境界ボックスを取得します。このチュートリアルでは、これらのデフォルトはすべて、まさに必要なものです。 +デフォルトの画像処理を画像のバッチに適用し、OCR の結果を返す関数を作成します。 + +```py +>>> image_processor = processor.image_processor + + +>>> def get_ocr_words_and_boxes(examples): +... images = [image.convert("RGB") for image in examples["image"]] +... encoded_inputs = image_processor(images) + +... examples["image"] = encoded_inputs.pixel_values +... examples["words"] = encoded_inputs.words +... examples["boxes"] = encoded_inputs.boxes + +... return examples +``` + +この前処理をデータセット全体に高速に適用するには、[`~datasets.Dataset.map`] を使用します。 + +```py +>>> dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2) +``` + +### Preprocessing text data + +画像に OCR を適用したら、データセットのテキスト部分をエンコードしてモデル用に準備する必要があります。 +これには、前のステップで取得した単語とボックスをトークンレベルの `input_ids`、`attention_mask`、 +`token_type_ids`と`bbox`。テキストを前処理するには、プロセッサからの`Tokenizer`が必要になります。 + +```py +>>> tokenizer = processor.tokenizer +``` + +前述の前処理に加えて、モデルのラベルを追加する必要もあります。 `xxxForQuestionAnswering` モデルの場合 +🤗 Transformers では、ラベルは `start_positions` と `end_positions` で構成され、どのトークンがその位置にあるかを示します。 +開始点と、どのトークンが回答の最後にあるか。 + +それから始めましょう。より大きなリスト (単語リスト) 内のサブリスト (単語に分割された回答) を検索できるヘルパー関数を定義します。 + +この関数は、`words_list` と `answer_list` という 2 つのリストを入力として受け取ります。次に、`words_list`を反復処理してチェックします。 +`words_list` (words_list[i]) 内の現在の単語が、answer_list (answer_list[0]) の最初の単語と等しいかどうか、および +現在の単語から始まり、`answer_list` と同じ長さの `words_list` のサブリストは、`to answer_list` と等しくなります。 +この条件が true の場合、一致が見つかったことを意味し、関数は一致とその開始インデックス (idx) を記録します。 +とその終了インデックス (idx + len(answer_list) - 1)。複数の一致が見つかった場合、関数は最初のもののみを返します。 +一致するものが見つからない場合、関数は (`None`、0、および 0) を返します。 + +```py +>>> def subfinder(words_list, answer_list): +... matches = [] +... start_indices = [] +... end_indices = [] +... for idx, i in enumerate(range(len(words_list))): +... if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list: +... matches.append(answer_list) +... start_indices.append(idx) +... end_indices.append(idx + len(answer_list) - 1) +... if matches: +... return matches[0], start_indices[0], end_indices[0] +... else: +... return None, 0, 0 +``` + +この関数が答えの位置を見つける方法を説明するために、例で使用してみましょう。 + +```py +>>> example = dataset_with_ocr["train"][1] +>>> words = [word.lower() for word in example["words"]] +>>> match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split()) +>>> print("Question: ", example["question"]) +>>> print("Words:", words) +>>> print("Answer: ", example["answer"]) +>>> print("start_index", word_idx_start) +>>> print("end_index", word_idx_end) +Question: Who is in cc in this letter? +Words: ['wie', 'baw', 'brown', '&', 'williamson', 'tobacco', 'corporation', 'research', '&', 'development', 'internal', 'correspondence', 'to:', 'r.', 'h.', 'honeycutt', 'ce:', 't.f.', 'riehl', 'from:', '.', 'c.j.', 'cook', 'date:', 'may', '8,', '1995', 'subject:', 'review', 'of', 'existing', 'brainstorming', 'ideas/483', 'the', 'major', 'function', 'of', 'the', 'product', 'innovation', 'graup', 'is', 'to', 'develop', 'marketable', 'nove!', 'products', 'that', 'would', 'be', 'profitable', 'to', 'manufacture', 'and', 'sell.', 'novel', 'is', 'defined', 'as:', 'of', 'a', 'new', 'kind,', 'or', 'different', 'from', 'anything', 'seen', 'or', 'known', 'before.', 'innovation', 'is', 'defined', 'as:', 'something', 'new', 'or', 'different', 'introduced;', 'act', 'of', 'innovating;', 'introduction', 'of', 'new', 'things', 'or', 'methods.', 'the', 'products', 'may', 'incorporate', 'the', 'latest', 'technologies,', 'materials', 'and', 'know-how', 'available', 'to', 'give', 'then', 'a', 'unique', 'taste', 'or', 'look.', 'the', 'first', 'task', 'of', 'the', 'product', 'innovation', 'group', 'was', 'to', 'assemble,', 'review', 'and', 'categorize', 'a', 'list', 'of', 'existing', 'brainstorming', 'ideas.', 'ideas', 'were', 'grouped', 'into', 'two', 'major', 'categories', 'labeled', 'appearance', 'and', 'taste/aroma.', 'these', 'categories', 'are', 'used', 'for', 'novel', 'products', 'that', 'may', 'differ', 'from', 'a', 'visual', 'and/or', 'taste/aroma', 'point', 'of', 'view', 'compared', 'to', 'canventional', 'cigarettes.', 'other', 'categories', 'include', 'a', 'combination', 'of', 'the', 'above,', 'filters,', 'packaging', 'and', 'brand', 'extensions.', 'appearance', 'this', 'category', 'is', 'used', 'for', 'novel', 'cigarette', 'constructions', 'that', 'yield', 'visually', 'different', 'products', 'with', 'minimal', 'changes', 'in', 'smoke', 'chemistry', 'two', 'cigarettes', 'in', 'cne.', 'emulti-plug', 'te', 'build', 'yaur', 'awn', 'cigarette.', 'eswitchable', 'menthol', 'or', 'non', 'menthol', 'cigarette.', '*cigarettes', 'with', 'interspaced', 'perforations', 'to', 'enable', 'smoker', 'to', 'separate', 'unburned', 'section', 'for', 'future', 'smoking.', '«short', 'cigarette,', 'tobacco', 'section', '30', 'mm.', '«extremely', 'fast', 'buming', 'cigarette.', '«novel', 'cigarette', 'constructions', 'that', 'permit', 'a', 'significant', 'reduction', 'iretobacco', 'weight', 'while', 'maintaining', 'smoking', 'mechanics', 'and', 'visual', 'characteristics.', 'higher', 'basis', 'weight', 'paper:', 'potential', 'reduction', 'in', 'tobacco', 'weight.', '«more', 'rigid', 'tobacco', 'column;', 'stiffing', 'agent', 'for', 'tobacco;', 'e.g.', 'starch', '*colored', 'tow', 'and', 'cigarette', 'papers;', 'seasonal', 'promotions,', 'e.g.', 'pastel', 'colored', 'cigarettes', 'for', 'easter', 'or', 'in', 'an', 'ebony', 'and', 'ivory', 'brand', 'containing', 'a', 'mixture', 'of', 'all', 'black', '(black', 'paper', 'and', 'tow)', 'and', 'ail', 'white', 'cigarettes.', '499150498'] +Answer: T.F. Riehl +start_index 17 +end_index 18 +``` + +ただし、サンプルがエンコードされると、次のようになります。 + +```py +>>> encoding = tokenizer(example["question"], example["words"], example["boxes"]) +>>> tokenizer.decode(encoding["input_ids"]) +[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development ... +``` + +エンコードされた入力内で答えの位置を見つける必要があります。 +* `token_type_ids` は、どのトークンが質問の一部であり、どのトークンが文書の単語の一部であるかを示します。 +* `tokenizer.cls_token_id` は、入力の先頭で特別なトークンを見つけるのに役立ちます。 +* `word_ids` は、元の `words` で見つかった回答を、完全にエンコードされた入力内の同じ回答と照合して判断するのに役立ちます。 +エンコードされた入力内の応答の開始/終了位置。 + +これを念頭に置いて、データセット内のサンプルのバッチをエンコードする関数を作成しましょう。 + + +```py +>>> def encode_dataset(examples, max_length=512): +... questions = examples["question"] +... words = examples["words"] +... boxes = examples["boxes"] +... answers = examples["answer"] + +... # encode the batch of examples and initialize the start_positions and end_positions +... encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True) +... start_positions = [] +... end_positions = [] + +... # loop through the examples in the batch +... for i in range(len(questions)): +... cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id) + +... # find the position of the answer in example's words +... words_example = [word.lower() for word in words[i]] +... answer = answers[i] +... match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split()) + +... if match: +... # if match is found, use `token_type_ids` to find where words start in the encoding +... token_type_ids = encoding["token_type_ids"][i] +... token_start_index = 0 +... while token_type_ids[token_start_index] != 1: +... token_start_index += 1 + +... token_end_index = len(encoding["input_ids"][i]) - 1 +... while token_type_ids[token_end_index] != 1: +... token_end_index -= 1 + +... word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1] +... start_position = cls_index +... end_position = cls_index + +... # loop over word_ids and increase `token_start_index` until it matches the answer position in words +... # once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding +... for id in word_ids: +... if id == word_idx_start: +... start_position = token_start_index +... else: +... token_start_index += 1 + +... # similarly loop over `word_ids` starting from the end to find the `end_position` of the answer +... for id in word_ids[::-1]: +... if id == word_idx_end: +... end_position = token_end_index +... else: +... token_end_index -= 1 + +... start_positions.append(start_position) +... end_positions.append(end_position) + +... else: +... start_positions.append(cls_index) +... end_positions.append(cls_index) + +... encoding["image"] = examples["image"] +... encoding["start_positions"] = start_positions +... encoding["end_positions"] = end_positions + +... return encoding +``` + +この前処理関数が完成したので、データセット全体をエンコードできます。 + +```py +>>> encoded_train_dataset = dataset_with_ocr["train"].map( +... encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["train"].column_names +... ) +>>> encoded_test_dataset = dataset_with_ocr["test"].map( +... encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["test"].column_names +... ) +``` + +エンコードされたデータセットの特徴がどのようなものかを確認してみましょう。 + +```py +>>> encoded_train_dataset.features +{'image': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='uint8', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None), + 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), + 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), + 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), + 'bbox': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), + 'start_positions': Value(dtype='int64', id=None), + 'end_positions': Value(dtype='int64', id=None)} +``` + +## Evaluation + +文書の質問回答の評価には、大量の後処理が必要です。過剰摂取を避けるために +現時点では、このガイドでは評価ステップを省略しています。 [`Trainer`] はトレーニング中に評価損失を計算するため、 +モデルのパフォーマンスについてまったくわからないわけではありません。抽出的質問応答は通常、F1/完全一致を使用して評価されます。 +自分で実装したい場合は、[質問応答の章](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) を確認してください。 +インスピレーションを得るためにハグフェイスコースの。 + +## Train + +おめでとう!このガイドの最も難しい部分を無事にナビゲートできたので、独自のモデルをトレーニングする準備が整いました。 +トレーニングには次の手順が含まれます。 +* 前処理と同じチェックポイントを使用して、[`AutoModelForDocumentQuestionAnswering`] でモデルを読み込みます。 +* [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 +* サンプルをバッチ処理する関数を定義します。ここでは [`DefaultDataCollat​​or`] が適切に機能します。 +* モデル、データセット、データ照合器とともにトレーニング引数を [`Trainer`] に渡します。 +* [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> from transformers import AutoModelForDocumentQuestionAnswering + +>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint) +``` + +[`TrainingArguments`] で、`output_dir` を使用してモデルの保存場所を指定し、必要に応じてハイパーパラメーターを構成します。 +モデルをコミュニティと共有したい場合は、`push_to_hub`を`True`に設定します (モデルをアップロードするには、Hugging Face にサインインする必要があります)。 +この場合、`output_dir`はモデルのチェックポイントがプッシュされるリポジトリの名前にもなります。 + +```py +>>> from transformers import TrainingArguments + +>>> # REPLACE THIS WITH YOUR REPO ID +>>> repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa" + +>>> training_args = TrainingArguments( +... output_dir=repo_id, +... per_device_train_batch_size=4, +... num_train_epochs=20, +... save_steps=200, +... logging_steps=50, +... evaluation_strategy="steps", +... learning_rate=5e-5, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +サンプルをまとめてバッチ処理するための単純なデータ照合器を定義します。 + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + +最後に、すべてをまとめて、[`~Trainer.train`] を呼び出します。 + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=encoded_train_dataset, +... eval_dataset=encoded_test_dataset, +... tokenizer=processor, +... ) + +>>> trainer.train() +``` + +最終モデルを 🤗 Hub に追加するには、モデル カードを作成し、`push_to_hub` を呼び出します。 + +```py +>>> trainer.create_model_card() +>>> trainer.push_to_hub() +``` + +## Inference + +LayoutLMv2 モデルを微調整し、🤗 ハブにアップロードしたので、それを推論に使用できます。もっとも単純な +推論用に微調整されたモデルを試す方法は、それを [`Pipeline`] で使用することです。 + +例を挙げてみましょう: +```py +>>> example = dataset["test"][2] +>>> question = example["query"]["en"] +>>> image = example["image"] +>>> print(question) +>>> print(example["answers"]) +'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?' +['TRRF Vice President', 'lee a. waller'] +``` + +次に、パイプラインをインスタンス化します。 +モデルを使用して質問への回答を文書化し、画像と質問の組み合わせをモデルに渡します。 + +```py +>>> from transformers import pipeline + +>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa") +>>> qa_pipeline(image, question) +[{'score': 0.9949808120727539, + 'answer': 'Lee A. Waller', + 'start': 55, + 'end': 57}] +``` + +必要に応じて、パイプラインの結果を手動で複製することもできます。 +1. 画像と質問を取得し、モデルのプロセッサを使用してモデル用に準備します。 +2. モデルを通じて結果または前処理を転送します。 +3. モデルは`start_logits`と`end_logits`を返します。これらは、どのトークンが応答の先頭にあるのかを示し、 +どのトークンが回答の最後にありますか。どちらも形状 (batch_size、sequence_length) を持ちます。 +4. `start_logits` と `end_logits` の両方の最後の次元で argmax を取得し、予測される `start_idx` と `end_idx` を取得します。 +5. トークナイザーを使用して回答をデコードします。 + +```py +>>> import torch +>>> from transformers import AutoProcessor +>>> from transformers import AutoModelForDocumentQuestionAnswering + +>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa") +>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa") + +>>> with torch.no_grad(): +... encoding = processor(image.convert("RGB"), question, return_tensors="pt") +... outputs = model(**encoding) +... start_logits = outputs.start_logits +... end_logits = outputs.end_logits +... predicted_start_idx = start_logits.argmax(-1).item() +... predicted_end_idx = end_logits.argmax(-1).item() + +>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1]) +'lee a. waller' +``` diff --git a/docs/source/ja/tasks/idefics.md b/docs/source/ja/tasks/idefics.md new file mode 100644 index 00000000000000..3ee9f25e74555d --- /dev/null +++ b/docs/source/ja/tasks/idefics.md @@ -0,0 +1,430 @@ + + + +# Image tasks with IDEFICS + +[[open-in-colab]] + +個別のタスクは特殊なモデルを微調整することで対処できますが、別のアプローチも可能です。 +最近登場して人気を博しているのは、微調整を行わずにさまざまなタスクに大規模なモデルを使用することです。 +たとえば、大規模な言語モデルは、要約、翻訳、分類などの NLP タスクを処理できます。 +このアプローチは、テキストなどの単一のモダリティに限定されなくなりました。このガイドでは、次のような方法を説明します。 +IDEFICS と呼ばれる大規模なマルチモーダル モデルを使用して、画像とテキストのタスクを解決します。 + +[IDEFICS](../model_doc/idefics) は、[Flamingo](https://huggingface.co/papers/2204.14198) に基づくオープンアクセスのビジョンおよび言語モデルです。 +DeepMind によって最初に開発された最先端の視覚言語モデル。モデルは任意の画像シーケンスを受け入れます +テキストを入力し、出力として一貫したテキストを生成します。画像に関する質問に答えたり、視覚的なコンテンツについて説明したり、 +複数のイメージに基づいたストーリーを作成するなど。 IDEFICS には 2 つのバリエーションがあります - [800 億パラメータ](https://huggingface.co/HuggingFaceM4/idefics-80b) +および [90 億のパラメータ](https://huggingface.co/HuggingFaceM4/idefics-9b)、どちらも 🤗 Hub で入手できます。各バリエーションについて、細かく調整された指示も見つけることができます。 +会話のユースケースに適応したモデルのバージョン。 + +このモデルは非常に多用途で、幅広い画像タスクやマルチモーダル タスクに使用できます。しかし、 +大規模なモデルであるということは、大量の計算リソースとインフラストラクチャが必要であることを意味します。それはあなた次第です +このアプローチは、個別のタスクごとに特化したモデルを微調整するよりも、ユースケースに適しています。 + +このガイドでは、次の方法を学習します。 +- [IDEFICS をロード](#loading-the-model) および [モデルの量子化バージョンをロード](#quantized-model) +- IDEFICS を次の目的で使用します。 + - [画像キャプション](#image-captioning) + - [プロンプト画像キャプション](#prompted-image-captioning) + - [Few-shot プロンプト](#few-shot-prompting) + - [ビジュアル質問回答](#visual-question-answering) + - [画像分類](#image-classification) + - [画像ガイド付きテキスト生成](#image-guided-text-generation) +- [バッチモードで推論を実行する](#running-inference-in-batch-mode) +- [会話用に IDEFICS 命令を実行](#idefics-instruct-for-conversational-use) + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q bitsandbytes sentencepiece accelerate transformers +``` + + +量子化されていないバージョンのモデル チェックポイントを使用して次の例を実行するには、少なくとも 20GB の GPU メモリが必要です。 + + +## Loading the model + +まずはモデルの 90 億個のパラメーターのチェックポイントをロードしましょう。 + +```py +>>> checkpoint = "HuggingFaceM4/idefics-9b" +``` + +他の Transformers モデルと同様に、プロセッサとモデル自体をチェックポイントからロードする必要があります。 +IDEFICS プロセッサは、[`LlamaTokenizer`] と IDEFICS 画像プロセッサを単一のプロセッサにラップして処理します。 +モデルのテキストと画像の入力を準備します。 + + +```py +>>> import torch + +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, device_map="auto") +``` + +`device_map`を`auto`に設定すると、モデルの重みを最も最適化された状態でロードおよび保存する方法が自動的に決定されます。 +既存のデバイスを考慮した方法。 + +### Quantized model + +ハイメモリ GPU の可用性が問題となる場合は、モデルの量子化されたバージョンをロードできます。モデルと +プロセッサを 4 ビット精度で使用する場合、`BitsAndBytesConfig`を`from_pretrained`メソッドに渡すと、モデルが圧縮されます。 +ロード中にその場で。 + + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor, BitsAndBytesConfig + +>>> quantization_config = BitsAndBytesConfig( +... load_in_4bit=True, +... bnb_4bit_compute_dtype=torch.float16, +... ) + +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> model = IdeficsForVisionText2Text.from_pretrained( +... checkpoint, +... quantization_config=quantization_config, +... device_map="auto" +... ) +``` + +提案された方法のいずれかでモデルをロードしたので、IDEFICS を使用できるタスクの探索に進みましょう。 + +## Image captioning + +画像のキャプション付けは、特定の画像のキャプションを予測するタスクです。一般的な用途は視覚障害者を支援することです +人々はさまざまな状況をナビゲートします。たとえば、オンラインで画像コンテンツを探索します。 + +タスクを説明するには、キャプションを付ける画像を取得します。例: + +
+ Image of a puppy in a flower bed +
+ +写真提供:[Hendo Wang](https://unsplash.com/@hendoo) + +IDEFICS はテキストと画像のプロンプトを受け入れます。ただし、画像にキャプションを付けるには、テキスト プロンプトをユーザーに提供する必要はありません。 +モデル、前処理された入力画像のみ。テキスト プロンプトがない場合、モデルはテキストの生成を開始します。 +BOS (Beginning-of-sequence) トークンによりキャプションが作成されます。 + +モデルへの画像入力として、画像オブジェクト (`PIL.Image`) または画像を取得できる URL のいずれかを使用できます。 + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1583160247711-2191776b4b91?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3542&q=80", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +A puppy in a flower bed +``` + + + +増加時に発生するエラーを避けるために、`generate`の呼び出しに`bad_words_ids`を含めることをお勧めします。 +`max_new_tokens`: モデルは、新しい `` または `` トークンを生成する必要があります。 +モデルによって画像が生成されていません。 +このガイドのようにオンザフライで設定することも、[テキスト生成戦略](../generation_strategies) ガイドで説明されているように `GenerationConfig` に保存することもできます。 + + +## Prompted image captioning + +テキスト プロンプトを提供することで画像キャプションを拡張でき、モデルは画像を指定して続行します。持っていきましょう +別の図で説明します。 + +
+ Image of the Eiffel Tower at night +
+ +写真提供:[Denys Nevozhai](https://unsplash.com/@dnevozhai)。 + +テキストおよび画像のプロンプトを単一のリストとしてモデルのプロセッサに渡し、適切な入力を作成できます。 + +```py +>>> prompt = [ +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +This is an image of the Eiffel Tower in Paris, France. +``` + +## Few-shot prompting + +IDEFICS はゼロショットで優れた結果を示しますが、タスクによっては特定の形式のキャプションが必要になる場合や、キャプションが付属する場合があります。 +タスクの複雑さを増大させるその他の制限または要件。少数のショットのプロンプトを使用して、コンテキスト内の学習を有効にすることができます。 +プロンプトに例を指定することで、指定された例の形式を模倣した結果を生成するようにモデルを操作できます。 + +前のエッフェル塔の画像をモデルの例として使用し、モデルにデモンストレーションするプロンプトを作成してみましょう。 +画像内のオブジェクトが何であるかを知ることに加えて、それに関する興味深い情報も取得したいと考えています。 +次に、自由の女神の画像に対して同じ応答形式を取得できるかどうかを見てみましょう。 + +
+ Image of the Statue of Liberty +
+ +写真提供:[Juan Mayobre](https://unsplash.com/@jmayobres)。 + +```py +>>> prompt = ["User:", +... "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "Describe this image.\nAssistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building.\n", +... "User:", +... "https://images.unsplash.com/photo-1524099163253-32b7f0256868?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3387&q=80", +... "Describe this image.\nAssistant:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=30, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +User: Describe this image. +Assistant: An image of the Eiffel Tower at night. Fun fact: the Eiffel Tower is the same height as an 81-storey building. +User: Describe this image. +Assistant: An image of the Statue of Liberty. Fun fact: the Statue of Liberty is 151 feet tall. +``` + +モデルは 1 つの例 (つまり、1 ショット) だけからタスクの実行方法を学習していることに注目してください。より複雑なタスクの場合は、 +より多くの例 (3 ショット、5 ショットなど) を自由に試してみてください。 + +## Visual question answering + +Visual Question Answering (VQA) は、画像に基づいて自由形式の質問に答えるタスクです。画像に似ている +キャプションは、アクセシビリティ アプリケーションだけでなく、教育 (視覚資料についての推論) にも使用できます。 +サービス(画像を基にした商品に関する質問)、画像検索など。 + +このタスク用に新しい画像を取得しましょう。 + +
+ Image of a couple having a picnic +
+ +写真提供 [Jarritos Mexican Soda](https://unsplash.com/@jarritos). + +適切な指示をプロンプトすることで、モデルを画像キャプションから視覚的な質問への応答に導くことができます。 + +```py +>>> prompt = [ +... "Instruction: Provide an answer to the question. Use the image to answer.\n", +... "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Question: Where are these people and what's the weather like? Answer:" +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=20, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Provide an answer to the question. Use the image to answer. + Question: Where are these people and what's the weather like? Answer: They're in a park in New York City, and it's a beautiful day. +``` + +## Image classification + +IDEFICS は、次のデータを含むデータについて明示的にトレーニングしなくても、画像をさまざまなカテゴリに分類できます。 +これらの特定のカテゴリからのラベル付きの例。カテゴリのリストを指定し、その画像とテキストを使用して理解する +機能を利用すると、モデルは画像がどのカテゴリに属する​​可能性が高いかを推測できます。 + +たとえば、次のような野菜スタンドの画像があるとします。 + +
+ Image of a vegetable stand +
+ +写真提供:[Peter Wendt](https://unsplash.com/@peterwendt)。 + +画像を次のいずれかのカテゴリに分類するようにモデルに指示できます。 + +```py +>>> categories = ['animals','vegetables', 'city landscape', 'cars', 'office'] +>>> prompt = [f"Instruction: Classify the following image into a single category from the following list: {categories}.\n", +... "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "Category: " +... ] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=6, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Classify the following image into a single category from the following list: ['animals', 'vegetables', 'city landscape', 'cars', 'office']. +Category: Vegetables +``` + +上の例では、画像を 1 つのカテゴリに分類するようにモデルに指示していますが、ランク分類を行うようにモデルに指示することもできます。 + +## Image-guided text generation + +よりクリエイティブなアプリケーションの場合は、画像ガイド付きテキスト生成を使用して、画像に基づいてテキストを生成できます。これは可能です +製品、広告、シーンの説明などを作成するのに役立ちます。 + +IDEFICS に、赤いドアの単純な画像に基づいてストーリーを書くように促してみましょう。 + +
+ Image of a red door with a pumpkin on the steps +
+ +写真提供:[Craig Tidball](https://unsplash.com/@devonshiremedia)。 + +```py +>>> prompt = ["Instruction: Use the image to write a story. \n", +... "https://images.unsplash.com/photo-1517086822157-2b0358e7684a?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=2203&q=80", +... "Story: \n"] + +>>> inputs = processor(prompt, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, num_beams=2, max_new_tokens=200, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> print(generated_text[0]) +Instruction: Use the image to write a story. + Story: +Once upon a time, there was a little girl who lived in a house with a red door. She loved her red door. It was the prettiest door in the whole world. + +One day, the little girl was playing in her yard when she noticed a man standing on her doorstep. He was wearing a long black coat and a top hat. + +The little girl ran inside and told her mother about the man. + +Her mother said, “Don’t worry, honey. He’s just a friendly ghost.” + +The little girl wasn’t sure if she believed her mother, but she went outside anyway. + +When she got to the door, the man was gone. + +The next day, the little girl was playing in her yard again when she noticed the man standing on her doorstep. + +He was wearing a long black coat and a top hat. + +The little girl ran +``` + +IDEFICS は玄関先にあるカボチャに気づき、幽霊に関する不気味なハロウィーンの話をしたようです。 + + + +このような長い出力の場合、テキスト生成戦略を微調整すると大きなメリットが得られます。これは役に立ちます +生成される出力の品質が大幅に向上します。 [テキスト生成戦略](../generation_strategies) を確認してください。 +詳しく知ることができ。 + + + +## Running inference in batch mode + +これまでのすべてのセクションでは、IDEFICS を 1 つの例として説明しました。非常に似た方法で、推論を実行できます。 +プロンプトのリストを渡すことにより、サンプルのバッチを取得します。 + +```py +>>> prompts = [ +... [ "https://images.unsplash.com/photo-1543349689-9a4d426bee8e?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3501&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1623944889288-cd147dbb517c?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... [ "https://images.unsplash.com/photo-1471193945509-9ad0617afabf?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=3540&q=80", +... "This is an image of ", +... ], +... ] + +>>> inputs = processor(prompts, return_tensors="pt").to("cuda") +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10, bad_words_ids=bad_words_ids) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i,t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +0: +This is an image of the Eiffel Tower in Paris, France. + +1: +This is an image of a couple on a picnic blanket. + +2: +This is an image of a vegetable stand. +``` + +## IDEFICS instruct for conversational use + +会話型のユースケースの場合は、🤗 ハブでモデルの微調整された指示されたバージョンを見つけることができます。 +`HuggingFaceM4/idefics-80b-instruct` および `HuggingFaceM4/idefics-9b-instruct`。 + +これらのチェックポイントは、教師ありモデルと命令モデルを組み合わせたそれぞれの基本モデルを微調整した結果です。 +データセットを微調整することで、ダウンストリームのパフォーマンスを向上させながら、会話設定でモデルをより使いやすくします。 + +会話での使用とプロンプトは、基本モデルの使用と非常に似ています。 + +```py +>>> import torch +>>> from transformers import IdeficsForVisionText2Text, AutoProcessor + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" + +>>> checkpoint = "HuggingFaceM4/idefics-9b-instruct" +>>> model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device) +>>> processor = AutoProcessor.from_pretrained(checkpoint) + +>>> prompts = [ +... [ +... "User: What is in this image?", +... "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG", +... "", + +... "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.", + +... "\nUser:", +... "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052", +... "And who is that?", + +... "\nAssistant:", +... ], +... ] + +>>> # --batched mode +>>> inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device) +>>> # --single sample mode +>>> # inputs = processor(prompts[0], return_tensors="pt").to(device) + +>>> # Generation args +>>> exit_condition = processor.tokenizer("", add_special_tokens=False).input_ids +>>> bad_words_ids = processor.tokenizer(["", ""], add_special_tokens=False).input_ids + +>>> generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) +>>> for i, t in enumerate(generated_text): +... print(f"{i}:\n{t}\n") +``` diff --git a/docs/source/ja/tasks/image_captioning.md b/docs/source/ja/tasks/image_captioning.md new file mode 100644 index 00000000000000..31c687c111c071 --- /dev/null +++ b/docs/source/ja/tasks/image_captioning.md @@ -0,0 +1,276 @@ + + +# Image captioning + +[[open-in-colab]] + +画像のキャプション付けは、特定の画像のキャプションを予測するタスクです。一般的な現実世界のアプリケーションには次のものがあります。 +視覚障害者がさまざまな状況を乗り越えられるよう支援します。したがって、画像のキャプション +画像を説明することで人々のコンテンツへのアクセシビリティを向上させるのに役立ちます。 + +このガイドでは、次の方法を説明します。 + +* 画像キャプション モデルを微調整します。 +* 微調整されたモデルを推論に使用します。 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate -q +pip install jiwer -q +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + + +```python +from huggingface_hub import notebook_login + +notebook_login() +``` + +## Load the Pokémon BLIP captions dataset + +🤗 データセット ライブラリを使用して、{image-caption} ペアで構成されるデータセットを読み込みます。独自の画像キャプション データセットを作成するには +PyTorch では、[このノートブック](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb) を参照できます。 + +```py +ds = load_dataset("lambdalabs/pokemon-blip-captions") +ds +``` + +```bash +DatasetDict({ + train: Dataset({ + features: ['image', 'text'], + num_rows: 833 + }) +}) +``` + +データセットには `image`と`text`の 2 つの機能があります。 + + + +多くの画像キャプション データセットには、画像ごとに複数のキャプションが含まれています。このような場合、一般的な戦略は、トレーニング中に利用可能なキャプションの中からランダムにキャプションをサンプリングすることです。 + + + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットのトレイン スプリットをトレイン セットとテスト セットに分割します。 + +```python +ds = ds["train"].train_test_split(test_size=0.1) +train_ds = ds["train"] +test_ds = ds["test"] +``` + +トレーニング セットからのいくつかのサンプルを視覚化してみましょう。 + +```python +from textwrap import wrap +import matplotlib.pyplot as plt +import numpy as np + + +def plot_images(images, captions): + plt.figure(figsize=(20, 20)) + for i in range(len(images)): + ax = plt.subplot(1, len(images), i + 1) + caption = captions[i] + caption = "\n".join(wrap(caption, 12)) + plt.title(caption) + plt.imshow(images[i]) + plt.axis("off") + + +sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)] +sample_captions = [train_ds[i]["text"] for i in range(5)] +plot_images(sample_images_to_visualize, sample_captions) +``` + +
+ Sample training images +
+ +## Preprocess the dataset + +データセットには 2 つのモダリティ (画像とテキスト) があるため、前処理パイプラインは画像とキャプションを前処理します。 + +これを行うには、微調整しようとしているモデルに関連付けられたプロセッサ クラスをロードします。 + +```python +from transformers import AutoProcessor + +checkpoint = "microsoft/git-base" +processor = AutoProcessor.from_pretrained(checkpoint) +``` + + +プロセッサは内部で画像を前処理し (サイズ変更やピクセル スケーリングを含む)、キャプションをトークン化します。 + +```python +def transforms(example_batch): + images = [x for x in example_batch["image"]] + captions = [x for x in example_batch["text"]] + inputs = processor(images=images, text=captions, padding="max_length") + inputs.update({"labels": inputs["input_ids"]}) + return inputs + + +train_ds.set_transform(transforms) +test_ds.set_transform(transforms) +``` + +データセットの準備ができたら、微調整用にモデルをセットアップできます。 + +## Load a base model + +["microsoft/git-base"](https://huggingface.co/microsoft/git-base) を [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) オブジェクト。 + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained(checkpoint) +``` + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained(checkpoint) +``` +## Evaluate + +画像キャプション モデルは通常、[Rouge Score](https://huggingface.co/spaces/evaluate-metric/rouge) または [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/) で評価されます。そうだった)。このガイドでは、Word Error Rate (WER) を使用します。 + +これを行うには 🤗 Evaluate ライブラリを使用します。 WER の潜在的な制限やその他の問題点については、[このガイド](https://huggingface.co/spaces/evaluate-metric/wer) を参照してください。 + +```python +from evaluate import load +import torch + +wer = load("wer") + + +def compute_metrics(eval_pred): + logits, labels = eval_pred + predicted = logits.argmax(-1) + decoded_labels = processor.batch_decode(labels, skip_special_tokens=True) + decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True) + wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels) + return {"wer_score": wer_score} +``` + +## Train! + +これで、モデルの微調整を開始する準備が整いました。これには 🤗 [`Trainer`] を使用します。 + +まず、[`TrainingArguments`] を使用してトレーニング引数を定義します。 + +```python +from transformers import TrainingArguments, Trainer + +model_name = checkpoint.split("/")[1] + +training_args = TrainingArguments( + output_dir=f"{model_name}-pokemon", + learning_rate=5e-5, + num_train_epochs=50, + fp16=True, + per_device_train_batch_size=32, + per_device_eval_batch_size=32, + gradient_accumulation_steps=2, + save_total_limit=3, + evaluation_strategy="steps", + eval_steps=50, + save_strategy="steps", + save_steps=50, + logging_steps=50, + remove_unused_columns=False, + push_to_hub=True, + label_names=["labels"], + load_best_model_at_end=True, +) +``` + +Trainer 次に、次に、データセットとモデルと一緒に 🤗 に渡します。 + +```python +trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_ds, + eval_dataset=test_ds, + compute_metrics=compute_metrics, +) +``` + +トレーニングを開始するには、[`Trainer`] オブジェクトの [`~Trainer.train`] を呼び出すだけです。 + +```python +trainer.train() +``` + +トレーニングが進むにつれて、トレーニングの損失がスムーズに減少することがわかります。 + +トレーニングが完了したら、 [`~Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```python +trainer.push_to_hub() +``` + +## Inference + +`test_ds` からサンプル画像を取得してモデルをテストします。 + +```python +from PIL import Image +import requests + +url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png" +image = Image.open(requests.get(url, stream=True).raw) +image +``` + +
+ Test image +
+ +モデル用の画像を準備します。 + +```python +device = "cuda" if torch.cuda.is_available() else "cpu" + +inputs = processor(images=image, return_tensors="pt").to(device) +pixel_values = inputs.pixel_values +``` + +[`generate`] を呼び出して予測をデコードします。 + +```python +generated_ids = model.generate(pixel_values=pixel_values, max_length=50) +generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(generated_caption) +``` +```bash +a drawing of a pink and blue pokemon +``` + +微調整されたモデルにより、非常に優れたキャプションが生成されたようです。 + + + + diff --git a/docs/source/ja/tasks/image_classification.md b/docs/source/ja/tasks/image_classification.md new file mode 100644 index 00000000000000..f8d8d0d55238b9 --- /dev/null +++ b/docs/source/ja/tasks/image_classification.md @@ -0,0 +1,559 @@ + + + +# Image classification + +[[open-in-colab]] + + + +画像分類では、画像にラベルまたはクラスを割り当てます。テキストや音声の分類とは異なり、入力は +画像を構成するピクセル値。損傷の検出など、画像分類には多くの用途があります +自然災害の後、作物の健康状態を監視したり、病気の兆候がないか医療画像をスクリーニングしたりするのに役立ちます。 + +このガイドでは、次の方法を説明します。 + +1. [Food-101](https://huggingface.co/datasets/food101) データセットの [ViT](model_doc/vit) を微調整して、画像内の食品を分類します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [DINOv2](../model_doc/dinov2), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [PVT](../model_doc/pvt), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [SwiftFormer](../model_doc/swiftformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +Hugging Face アカウントにログインして、モデルをアップロードしてコミュニティと共有することをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load Food-101 dataset + +Datasets、🤗 データセット ライブラリから Food-101 データセットの小さいサブセットを読み込みます。これにより、次の機会が得られます +完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認してください。 + +```py +>>> from datasets import load_dataset + +>>> food = load_dataset("food101", split="train[:5000]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train` 分割をトレイン セットとテスト セットに分割します。 + + +```py +>>> food = food.train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> food["train"][0] +{'image': , + 'label': 79} +``` + +データセット内の各例には 2 つのフィールドがあります。 + +- `image`: 食品の PIL 画像 +- `label`: 食品のラベルクラス + +モデルがラベル ID からラベル名を取得しやすくするために、ラベル名をマップする辞書を作成します。 +整数への変換、またはその逆: + +```py +>>> labels = food["train"].features["label"].names +>>> label2id, id2label = dict(), dict() +>>> for i, label in enumerate(labels): +... label2id[label] = str(i) +... id2label[str(i)] = label +``` + +これで、ラベル ID をラベル名に変換できるようになりました。 + +```py +>>> id2label[str(79)] +'prime_rib' +``` + +## Preprocess + +次のステップでは、ViT 画像プロセッサをロードして画像をテンソルに処理します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "google/vit-base-patch16-224-in21k" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +``` + + + + + +いくつかの画像変換を画像に適用して、モデルの過学習に対する堅牢性を高めます。ここでは torchvision の [`transforms`](https://pytorch.org/vision/stable/transforms.html) モジュールを使用しますが、任意の画像ライブラリを使用することもできます。 + +画像のランダムな部分をトリミングし、サイズを変更し、画像の平均と標準偏差で正規化します。 + + +```py +>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor + +>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) +>>> size = ( +... image_processor.size["shortest_edge"] +... if "shortest_edge" in image_processor.size +... else (image_processor.size["height"], image_processor.size["width"]) +... ) +>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize]) +``` + +次に、変換を適用し、画像の `pixel_values` (モデルへの入力) を返す前処理関数を作成します。 + +```py +>>> def transforms(examples): +... examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]] +... del examples["image"] +... return examples +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.with_transform`] メソッドを使用します。変換は、データセットの要素を読み込むときにオンザフライで適用されます。 + +```py +>>> food = food.with_transform(transforms) +``` + +次に、[`DefaultDataCollat​​or`] を使用してサンプルのバッチを作成します。 🤗 Transformers の他のデータ照合器とは異なり、`DefaultDataCollat​​or` はパディングなどの追加の前処理を適用しません。 + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + + + + + + + +過剰適合を回避し、モデルをより堅牢にするために、データセットのトレーニング部分にデータ拡張を追加します。 +ここでは、Keras 前処理レイヤーを使用してトレーニング データの変換 (データ拡張を含む) を定義します。 +検証データの変換 (中央のトリミング、サイズ変更、正規化のみ)。 `tf.image` または +他のライブラリでも構いません。 + + +```py +>>> from tensorflow import keras +>>> from tensorflow.keras import layers + +>>> size = (image_processor.size["height"], image_processor.size["width"]) + +>>> train_data_augmentation = keras.Sequential( +... [ +... layers.RandomCrop(size[0], size[1]), +... layers.Rescaling(scale=1.0 / 127.5, offset=-1), +... layers.RandomFlip("horizontal"), +... layers.RandomRotation(factor=0.02), +... layers.RandomZoom(height_factor=0.2, width_factor=0.2), +... ], +... name="train_data_augmentation", +... ) + +>>> val_data_augmentation = keras.Sequential( +... [ +... layers.CenterCrop(size[0], size[1]), +... layers.Rescaling(scale=1.0 / 127.5, offset=-1), +... ], +... name="val_data_augmentation", +... ) +``` + +次に、一度に 1 つの画像ではなく、画像のバッチに適切な変換を適用する関数を作成します。 + +```py +>>> import numpy as np +>>> import tensorflow as tf +>>> from PIL import Image + + +>>> def convert_to_tf_tensor(image: Image): +... np_image = np.array(image) +... tf_image = tf.convert_to_tensor(np_image) +... # `expand_dims()` is used to add a batch dimension since +... # the TF augmentation layers operates on batched inputs. +... return tf.expand_dims(tf_image, 0) + + +>>> def preprocess_train(example_batch): +... """Apply train_transforms across a batch.""" +... images = [ +... train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"] +... ] +... example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images] +... return example_batch + + +... def preprocess_val(example_batch): +... """Apply val_transforms across a batch.""" +... images = [ +... val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"] +... ] +... example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images] +... return example_batch +``` + +🤗 データセット [`~datasets.Dataset.set_transform`] を使用して、その場で変換を適用します。 + +```py +food["train"].set_transform(preprocess_train) +food["test"].set_transform(preprocess_val) +``` + +最後の前処理ステップとして、`DefaultDataCollat​​or`を使用してサンプルのバッチを作成します。 🤗 Transformers の他のデータ照合機能とは異なり、 +`DefaultDataCollat​​or` は、パディングなどの追加の前処理を適用しません。 + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。すぐにロードできます +🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用した評価方法。このタスクでは、ロードします +[accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) 指標 (詳細については、🤗 評価 [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してくださいメトリクスをロードして計算する方法): + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して精度を計算する関数を作成します。 + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... predictions = np.argmax(predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=labels) +``` + +これで `compute_metrics`関数の準備が整いました。トレーニングを設定するときにこの関数に戻ります。 + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForImageClassification`] を使用して ViT をロードします。ラベルの数と予想されるラベルの数、およびラベル マッピングを指定します。 + +```py +>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer + +>>> model = AutoModelForImageClassification.from_pretrained( +... checkpoint, +... num_labels=len(labels), +... id2label=id2label, +... label2id=label2id, +... ) +``` + +この時点で残っているステップは 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 `image` 列が削除されるため、未使用の列を削除しないことが重要です。 `image` 列がないと、`pixel_values` を作成できません。この動作を防ぐには、`remove_unused_columns=False`を設定してください。他に必要なパラメータは、モデルの保存場所を指定する `output_dir` だけです。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は精度を評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_food_model", +... remove_unused_columns=False, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=5e-5, +... per_device_train_batch_size=16, +... gradient_accumulation_steps=4, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=food["train"], +... eval_dataset=food["test"], +... tokenizer=image_processor, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + + + + + +Keras を使用したモデルの微調整に慣れていない場合は、まず [基本チュートリアル](./training#train-a-tensorflow-model-with-keras) を確認してください。 + + + + +TensorFlow でモデルを微調整するには、次の手順に従います。 +1. トレーニングのハイパーパラメータを定義し、オプティマイザーと学習率スケジュールを設定します。 +2. 事前トレーニングされたモデルをインスタンス化します。 +3. 🤗 データセットを `tf.data.Dataset` に変換します。 +4. モデルをコンパイルします。 +5. コールバックを追加し、`fit()` メソッドを使用してトレーニングを実行します。 +6. モデルを 🤗 Hub にアップロードしてコミュニティと共有します。 + +まず、ハイパーパラメーター、オプティマイザー、学習率スケジュールを定義します。 + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_epochs = 5 +>>> num_train_steps = len(food["train"]) * num_epochs +>>> learning_rate = 3e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +次に、ラベル マッピングとともに [`TFAutoModelForImageClassification`] を使用して ViT を読み込みます。 + +```py +>>> from transformers import TFAutoModelForImageClassification + +>>> model = TFAutoModelForImageClassification.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +``` + +Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and your `data_collator`: + +```py +>>> # converting our train dataset to tf.data.Dataset +>>> tf_train_dataset = food["train"].to_tf_dataset( +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator +... ) + +>>> # converting our test dataset to tf.data.Dataset +>>> tf_eval_dataset = food["test"].to_tf_dataset( +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator +... ) +``` + +`compile()` を使用してトレーニング用にモデルを設定します。 + +```py +>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy + +>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) +>>> model.compile(optimizer=optimizer, loss=loss) +``` + +予測から精度を計算し、モデルを 🤗 ハブにプッシュするには、[Keras callbacks](../main_classes/keras_callbacks) を使用します。 +`compute_metrics` 関数を [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback) に渡します。 +[PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback) を使用してモデルをアップロードします。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset) +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="food_classifier", +... tokenizer=image_processor, +... save_strategy="no", +... ) +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルをトレーニングする準備が整いました。トレーニングおよび検証データセット、エポック数、 +モデルを微調整するためのコールバック: + +```py +>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks) +Epoch 1/5 +250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290 +Epoch 2/5 +250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690 +Epoch 3/5 +250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820 +Epoch 4/5 +250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900 +Epoch 5/5 +250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890 +``` + +おめでとう!モデルを微調整し、🤗 Hub で共有しました。これで推論に使用できるようになりました。 + + + + + + +画像分類用のモデルを微調整する方法の詳細な例については、対応する [PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb) + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論を実行したい画像を読み込みます。 + +```py +>>> ds = load_dataset("food101", split="validation[:10]") +>>> image = ds["image"][0] +``` + +
+ image of beignets +
+ +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して画像分類用の`pipeline`をインスタンス化し、それに画像を渡します。 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("image-classification", model="my_awesome_food_model") +>>> classifier(image) +[{'score': 0.31856709718704224, 'label': 'beignets'}, + {'score': 0.015232225880026817, 'label': 'bruschetta'}, + {'score': 0.01519392803311348, 'label': 'chicken_wings'}, + {'score': 0.013022331520915031, 'label': 'pork_chop'}, + {'score': 0.012728818692266941, 'label': 'prime_rib'}] +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。 + + + + +画像プロセッサをロードして画像を前処理し、`input`を PyTorch テンソルとして返します。 + +```py +>>> from transformers import AutoImageProcessor +>>> import torch + +>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model") +>>> inputs = image_processor(image, return_tensors="pt") +``` + +入力をモデルに渡し、ロジットを返します。 + +```py +>>> from transformers import AutoModelForImageClassification + +>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +最も高い確率で予測されたラベルを取得し、モデルの `id2label` マッピングを使用してラベルに変換します。 + +```py +>>> predicted_label = logits.argmax(-1).item() +>>> model.config.id2label[predicted_label] +'beignets' +``` + + + + + + +画像プロセッサをロードして画像を前処理し、`input`を TensorFlow テンソルとして返します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +入力をモデルに渡し、ロジットを返します。 + +```py +>>> from transformers import TFAutoModelForImageClassification + +>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier") +>>> logits = model(**inputs).logits +``` + +最も高い確率で予測されたラベルを取得し、モデルの `id2label` マッピングを使用してラベルに変換します。 + + +```py +>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0]) +>>> model.config.id2label[predicted_class_id] +'beignets' +``` + + + + diff --git a/docs/source/ja/tasks/image_to_image.md b/docs/source/ja/tasks/image_to_image.md new file mode 100644 index 00000000000000..cd429a4e1e27c3 --- /dev/null +++ b/docs/source/ja/tasks/image_to_image.md @@ -0,0 +1,135 @@ + + +# Image-to-Image Task Guide + +[[open-in-colab]] + +Image-to-Image タスクは、アプリケーションが画像を受信し、別の画像を出力するタスクです。これには、画像強化 (超解像度、低光量強化、ディレインなど)、画像修復などを含むさまざまなサブタスクがあります。 + +このガイドでは、次の方法を説明します。 +- 超解像度タスクに画像間のパイプラインを使用します。 +- パイプラインを使用せずに、同じタスクに対してイメージ間モデルを実行します。 + +このガイドがリリースされた時点では、`image-to-image`パイプラインは超解像度タスクのみをサポートしていることに注意してください。 + +必要なライブラリをインストールすることから始めましょう。 + +```bash +pip install transformers +``` + +[Swin2SR モデル](https://huggingface.co/caidas/swin2SR-lightweight-x2-64) を使用してパイプラインを初期化できるようになりました。次に、イメージを使用してパイプラインを呼び出すことで、パイプラインを推論できます。現時点では、[Swin2SR モデル](https://huggingface.co/models?sort=trending&search=swin2sr) のみがこのパイプラインでサポートされています。 + +```python +from transformers import pipeline + +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') +pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device) +``` + +では、画像を読み込みましょう。 + +```python +from PIL import Image +import requests + +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/cat.jpg" +image = Image.open(requests.get(url, stream=True).raw) + +print(image.size) +``` +```bash +# (532, 432) +``` +
+ Photo of a cat +
+ + +これで、パイプラインを使用して推論を実行できるようになりました。猫の画像の拡大バージョンを取得します。 + +```python +upscaled = pipe(image) +print(upscaled.size) +``` +```bash +# (1072, 880) +``` + +パイプラインを使用せずに自分で推論を実行したい場合は、トランスフォーマーの `Swin2SRForImageSuperResolution` クラスと `Swin2SRImageProcessor` クラスを使用できます。これには同じモデルのチェックポイントを使用します。モデルとプロセッサを初期化しましょう。 + +```python +from transformers import Swin2SRForImageSuperResolution, Swin2SRImageProcessor + +model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweight-x2-64").to(device) +processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64") +``` + +`pipeline`」は、自分で行う必要がある前処理と後処理のステップを抽象化するので、画像を前処理しましょう。画像をプロセッサに渡してから、ピクセル値を GPU に移動します。 + +```python +pixel_values = processor(image, return_tensors="pt").pixel_values +print(pixel_values.shape) + +pixel_values = pixel_values.to(device) +``` + +これで、ピクセル値をモデルに渡すことで画像を推測できるようになりました。 + +```python +import torch + +with torch.no_grad(): + outputs = model(pixel_values) +``` + +出力は、以下のような `ImageSuperResolutionOutput` タイプのオブジェクトです 👇 + +``` +(loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275, ..., 0.7463, 0.7446, 0.7453], + [0.8287, 0.8278, 0.8283, ..., 0.7451, 0.7448, 0.7457], + [0.8280, 0.8273, 0.8269, ..., 0.7447, 0.7446, 0.7452], + ..., + [0.5923, 0.5933, 0.5924, ..., 0.0697, 0.0695, 0.0706], + [0.5926, 0.5932, 0.5926, ..., 0.0673, 0.0687, 0.0705], + [0.5927, 0.5914, 0.5922, ..., 0.0664, 0.0694, 0.0718]]]], + device='cuda:0'), hidden_states=None, attentions=None) +``` + +`reconstruction`を取得し、それを視覚化するために後処理する必要があります。どのように見えるか見てみましょう。 + +```python +outputs.reconstruction.data.shape +# torch.Size([1, 3, 880, 1072]) +``` + +出力を圧縮して軸 0 を削除し、値をクリップしてから、それを numpy float に変換する必要があります。次に、軸を [1072, 880] の形状になるように配置し、最後に出力を範囲 [0, 255] に戻します。 + +```python +import numpy as np + +# squeeze, take to CPU and clip the values +output = outputs.reconstruction.data.squeeze().cpu().clamp_(0, 1).numpy() +# rearrange the axes +output = np.moveaxis(output, source=0, destination=-1) +# bring values back to pixel values range +output = (output * 255.0).round().astype(np.uint8) +Image.fromarray(output) +``` +
+ Upscaled photo of a cat +
\ No newline at end of file diff --git a/docs/source/ja/tasks/knowledge_distillation_for_image_classification.md b/docs/source/ja/tasks/knowledge_distillation_for_image_classification.md new file mode 100644 index 00000000000000..16df6e3b9d9658 --- /dev/null +++ b/docs/source/ja/tasks/knowledge_distillation_for_image_classification.md @@ -0,0 +1,188 @@ + +# Knowledge Distillation for Computer Vision + +[[open-in-colab]] + +知識の蒸留は、より大規模で複雑なモデル (教師) からより小規模で単純なモデル (生徒) に知識を伝達するために使用される手法です。あるモデルから別のモデルに知識を抽出するには、特定のタスク (この場合は画像分類) でトレーニングされた事前トレーニング済み教師モデルを取得し、画像分類でトレーニングされる生徒モデルをランダムに初期化します。次に、学生モデルをトレーニングして、その出力と教師の出力の差を最小限に抑え、動作を模倣します。これは [Distilling the Knowledge in a Neural Network by Hinton et al](https://arxiv.org/abs/1503.02531) で最初に導入されました。このガイドでは、タスク固有の知識の蒸留を行います。これには [Beans データセット](https://huggingface.co/datasets/beans) を使用します。 + +このガイドでは、[微調整された ViT モデル](https://huggingface.co/merve/vit-mobilenet-beans-224) (教師モデル) を抽出して [MobileNet](https://huggingface.co/google/mobilenet_v2_1.4_224) (学生モデル) 🤗 Transformers の [Trainer API](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) を使用します。 + +蒸留とプロセスの評価に必要なライブラリをインストールしましょう。 + +```bash +pip install transformers datasets accelerate tensorboard evaluate --upgrade +``` + +この例では、教師モデルとして`merve/beans-vit-224`モデルを使用しています。これは、Bean データセットに基づいて微調整された`google/vit-base-patch16-224-in21k`に基づく画像分類モデルです。このモデルをランダムに初期化された MobileNetV2 に抽出します。 + +次に、データセットをロードします。 + +```python +from datasets import load_dataset + +dataset = load_dataset("beans") +``` + +この場合、同じ解像度で同じ出力が返されるため、どちらのモデルの画像プロセッサも使用できます。 `dataset`の`map()`メソッドを使用して、データセットのすべての分割に前処理を適用します。 + +```python +from transformers import AutoImageProcessor +teacher_processor = AutoImageProcessor.from_pretrained("merve/beans-vit-224") + +def process(examples): + processed_inputs = teacher_processor(examples["image"]) + return processed_inputs + +processed_datasets = dataset.map(process, batched=True) +``` + +基本的に、我々は生徒モデル(ランダムに初期化されたMobileNet)が教師モデル(微調整されたビジョン変換器)を模倣することを望む。これを実現するために、まず教師と生徒からロジット出力を得る。次に、それぞれのソフトターゲットの重要度を制御するパラメータ`temperature`で分割する。`lambda`と呼ばれるパラメータは蒸留ロスの重要度を量る。この例では、`temperature=5`、`lambda=0.5`とする。生徒と教師の間の発散を計算するために、Kullback-Leibler発散損失を使用します。2つのデータPとQが与えられたとき、KLダイバージェンスはQを使ってPを表現するためにどれだけの余分な情報が必要かを説明します。もし2つが同じであれば、QからPを説明するために必要な他の情報はないので、それらのKLダイバージェンスはゼロになります。 + + +```python +from transformers import TrainingArguments, Trainer +import torch +import torch.nn as nn +import torch.nn.functional as F + + +class ImageDistilTrainer(Trainer): + def __init__(self, *args, teacher_model=None, **kwargs): + super().__init__(*args, **kwargs) + self.teacher = teacher_model + self.student = student_model + self.loss_function = nn.KLDivLoss(reduction="batchmean") + device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') + self.teacher.to(device) + self.teacher.eval() + self.temperature = temperature + self.lambda_param = lambda_param + + def compute_loss(self, student, inputs, return_outputs=False): + student_output = self.student(**inputs) + + with torch.no_grad(): + teacher_output = self.teacher(**inputs) + + # Compute soft targets for teacher and student + soft_teacher = F.softmax(teacher_output.logits / self.temperature, dim=-1) + soft_student = F.log_softmax(student_output.logits / self.temperature, dim=-1) + + # Compute the loss + distillation_loss = self.loss_function(soft_student, soft_teacher) * (self.temperature ** 2) + + # Compute the true label loss + student_target_loss = student_output.loss + + # Calculate final loss + loss = (1. - self.lambda_param) * student_target_loss + self.lambda_param * distillation_loss + return (loss, student_output) if return_outputs else loss +``` + +次に、Hugging Face Hub にログインして、`trainer`を通じてモデルを Hugging Face Hub にプッシュできるようにします。 + +```python +from huggingface_hub import notebook_login + +notebook_login() +``` + +教師モデルと生徒モデルである`TrainingArguments`を設定しましょう。 + +```python +from transformers import AutoModelForImageClassification, MobileNetV2Config, MobileNetV2ForImageClassification + +training_args = TrainingArguments( + output_dir="my-awesome-model", + num_train_epochs=30, + fp16=True, + logging_dir=f"{repo_name}/logs", + logging_strategy="epoch", + evaluation_strategy="epoch", + save_strategy="epoch", + load_best_model_at_end=True, + metric_for_best_model="accuracy", + report_to="tensorboard", + push_to_hub=True, + hub_strategy="every_save", + hub_model_id=repo_name, + ) + +num_labels = len(processed_datasets["train"].features["labels"].names) + +# initialize models +teacher_model = AutoModelForImageClassification.from_pretrained( + "merve/beans-vit-224", + num_labels=num_labels, + ignore_mismatched_sizes=True +) + +# training MobileNetV2 from scratch +student_config = MobileNetV2Config() +student_config.num_labels = num_labels +student_model = MobileNetV2ForImageClassification(student_config) +``` + +`compute_metrics` 関数を使用して、テスト セットでモデルを評価できます。この関数は、トレーニング プロセス中にモデルの`accuracy`と`f1`を計算するために使用されます。 + +```python +import evaluate +import numpy as np + +accuracy = evaluate.load("accuracy") + +def compute_metrics(eval_pred): + predictions, labels = eval_pred + acc = accuracy.compute(references=labels, predictions=np.argmax(predictions, axis=1)) + return {"accuracy": acc["accuracy"]} +``` + +定義したトレーニング引数を使用して`Trainer`を初期化しましょう。データ照合装置も初期化します。 + + +```python +from transformers import DefaultDataCollator + +data_collator = DefaultDataCollator() +trainer = ImageDistilTrainer( + student_model=student_model, + teacher_model=teacher_model, + training_args=training_args, + train_dataset=processed_datasets["train"], + eval_dataset=processed_datasets["validation"], + data_collator=data_collator, + tokenizer=teacher_extractor, + compute_metrics=compute_metrics, + temperature=5, + lambda_param=0.5 +) +``` + +これでモデルをトレーニングできるようになりました。 + +```python +trainer.train() +``` + +テスト セットでモデルを評価できます。 + + +```python +trainer.evaluate(processed_datasets["test"]) +``` + +テスト セットでは、モデルの精度は 72% に達します。蒸留効率の健全性チェックを行うために、同じハイパーパラメータを使用して Bean データセットで MobileNet を最初からトレーニングし、テスト セットで 63% の精度を観察しました。読者の皆様には、さまざまな事前トレーニング済み教師モデル、学生アーキテクチャ、蒸留パラメータを試していただき、その結果を報告していただくようお勧めします。抽出されたモデルのトレーニング ログとチェックポイントは [このリポジトリ](https://huggingface.co/merve/vit-mobilenet-beans-224) にあり、最初からトレーニングされた MobileNetV2 はこの [リポジトリ](https://huggingface.co/merve/resnet-mobilenet-beans-5)。 diff --git a/docs/source/ja/tasks/language_modeling.md b/docs/source/ja/tasks/language_modeling.md new file mode 100644 index 00000000000000..835a0d54ea4ffd --- /dev/null +++ b/docs/source/ja/tasks/language_modeling.md @@ -0,0 +1,444 @@ + + +# Causal language modeling + + +[[open-in-colab]] + + +言語モデリングには、因果的モデリングとマスクされた言語モデリングの 2 つのタイプがあります。このガイドでは、因果関係のある言語モデリングについて説明します。 +因果言語モデルはテキスト生成によく使用されます。これらのモデルは、次のようなクリエイティブなアプリケーションに使用できます。 +独自のテキスト アドベンチャーを選択するか、Copilot や CodeParrot などのインテリジェントなコーディング アシスタントを選択します。 + + + + +因果言語モデリングは、一連のトークン内の次のトークンを予測します。モデルは、次のトークンにのみ対応できます。 +左。これは、モデルが将来のトークンを認識できないことを意味します。 GPT-2 は因果的言語モデルの一例です。 + +このガイドでは、次の方法を説明します。 + +1. [ELI5](https:/) の [r/askscience](https://www.reddit.com/r/askscience/) サブセットで [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) を微調整します。 /huggingface.co/datasets/eli5) データセット。 +2. 微調整したモデルを推論に使用します。 + + + +このガイドと同じ手順に従って、因果言語モデリング用に他のアーキテクチャを微調整できます。 +次のアーキテクチャのいずれかを選択します。 + + +[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeLlama](../model_doc/code_llama), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [Falcon](../model_doc/falcon), [Fuyu](../model_doc/fuyu), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [Mistral](../model_doc/mistral), [MPT](../model_doc/mpt), [MusicGen](../model_doc/musicgen), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [Persimmon](../model_doc/persimmon), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod) + + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load ELI5 dataset + + +まず、ELI5 データセットの r/askscience サブセットの小さいサブセットを 🤗 データセット ライブラリからロードします。 + これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> from datasets import load_dataset + +>>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train_asks` をトレイン セットとテスト セットに分割します。 + +```py +>>> eli5 = eli5.train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> eli5["train"][0] +{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], + 'score': [6, 3], + 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, + 'answers_urls': {'url': []}, + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls': {'url': []}} +``` + +これは多くのことのように見えるかもしれませんが、実際に関心があるのは`text`フィールドだけです。言語モデリングの優れている点 +タスクでは、次の単語がラベル * であるため、ラベル (教師なしタスクとも呼ばれます) は必要ありません。 + + +## Preprocess + + + + +次のステップは、`text`サブフィールドを処理するために DistilGPT2 トークナイザーをロードすることです。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +``` + +上の例からわかるように、`text`フィールドは実際には`answers`内にネストされています。つまり、次のことが必要になります。 +[` flatten`](https://huggingface.co/docs/datasets/process.html#flatten) メソッドを使用して、ネストされた構造から `text` サブフィールドを抽出します。 + +```py +>>> eli5 = eli5.flatten() +>>> eli5["train"][0] +{'answers.a_id': ['c3d1aib', 'c3d4lya'], + 'answers.score': [6, 3], + 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], + 'answers_urls.url': [], + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls.url': []} +``` + +`answers`接頭辞で示されるように、各サブフィールドは個別の列になり、`text`フィールドはリストになりました。その代わり +各文を個別にトークン化する場合は、リストを文字列に変換して、それらをまとめてトークン化できるようにします。 + +以下は、各例の文字列のリストを結合し、結果をトークン化する最初の前処理関数です。 + +```py +>>> def preprocess_function(examples): +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) +``` + +この前処理関数をデータセット全体に適用するには、🤗 Datasets [`~datasets.Dataset.map`] メソッドを使用します。 `map` 関数を高速化するには、`batched=True` を設定してデータセットの複数の要素を一度に処理し、`num_proc` でプロセスの数を増やします。不要な列を削除します。 + +```py +>>> tokenized_eli5 = eli5.map( +... preprocess_function, +... batched=True, +... num_proc=4, +... remove_columns=eli5["train"].column_names, +... ) +``` + +このデータセットにはトークン シーケンスが含まれていますが、その一部はモデルの最大入力長よりも長くなります。 + +2 番目の前処理関数を使用して、 +- すべてのシーケンスを連結します +- 連結されたシーケンスを`block_size`で定義された短いチャンクに分割します。これは、最大入力長より短く、GPU RAM に十分な長さである必要があります。 + +```py +>>> block_size = 128 + + +>>> def group_texts(examples): +... # Concatenate all texts. +... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} +... total_length = len(concatenated_examples[list(examples.keys())[0]]) +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. +... result = { +... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] +... for k, t in concatenated_examples.items() +... } +... result["labels"] = result["input_ids"].copy() +... return result +``` + +Apply the `group_texts` function over the entire dataset: + +```py +>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) +``` + +次に、[`DataCollat​​orForLanguageModeling`] を使用してサンプルのバッチを作成します。 *動的にパディング*する方が効率的です。 +データセット全体を最大長までパディングするのではなく、照合中にバッチ内の文を最長の長さにします。 + + + + +シーケンス終了トークンをパディング トークンとして使用し、`mlm=False` を設定します。これは、入力を 1 要素分右にシフトしたラベルとして使用します。 + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> tokenizer.pad_token = tokenizer.eos_token +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) +``` + + + +シーケンス終了トークンをパディング トークンとして使用し、`mlm=False` を設定します。これは、入力を 1 要素分右にシフトしたラベルとして使用します。 + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf") +``` + + + + + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[基本チュートリアル](../training#train-with-pytorch-trainer) を参照してください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForCausalLM`] を使用して DistilGPT2 をロードします。 + + +```py +>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer + +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。 +2. トレーニング引数をモデル、データセット、データ照合器とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_eli5_clm-model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=lm_dataset["train"], +... eval_dataset=lm_dataset["test"], +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.evaluate`] メソッドを使用してモデルを評価し、その複雑さを取得します。 + +```py +>>> import math + +>>> eval_results = trainer.evaluate() +>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") +Perplexity: 49.61 +``` + +次に、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[基本チュートリアル](../training#train-a-tensorflow-model-with-keras) をご覧ください。 + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +次に、[`TFAutoModelForCausalLM`] を使用して DistilGPT2 をロードできます。 + +```py +>>> from transformers import TFAutoModelForCausalLM + +>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +これは、モデルとトークナイザーを [`~transformers.PushToHubCallback`] でプッシュする場所を指定することで実行できます。 + + + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_eli5_clm-model", +... tokenizer=tokenizer, +... ) +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + + + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback]) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + +因果言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +テキストを生成するプロンプトを考え出します。 + +```py +>>> prompt = "Somatic hypermutation allows the immune system to" +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用してテキスト生成用の`pipeline`をインスタンス化し、それにテキストを渡します。 + + +```py +>>> from transformers import pipeline + +>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model") +>>> generator(prompt) +[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}] +``` + + + + + +テキストをトークン化し、「input_ids」を PyTorch テンソルとして返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids +``` + +[`~transformers.generation_utils.GenerationMixin.generate`] メソッドを使用してテキストを生成します。 +さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[テキスト生成戦略](../generation_strategies) ページを参照してください。 + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + +```py +>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) +["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"] +``` + + + + +テキストをトークン化し、`input_ids`を TensorFlow テンソルとして返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids +``` + +[`~transformers.generation_tf_utils.TFGenerationMixin.generate`] メソッドを使用して要約を作成します。さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[テキスト生成戦略](../generation_strategies) ページを参照してください。 + +```py +>>> from transformers import TFAutoModelForCausalLM + +>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + +```py +>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) +['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for'] +``` + + + diff --git a/docs/source/ja/tasks/masked_language_modeling.md b/docs/source/ja/tasks/masked_language_modeling.md new file mode 100644 index 00000000000000..b0fff72f9b0e26 --- /dev/null +++ b/docs/source/ja/tasks/masked_language_modeling.md @@ -0,0 +1,455 @@ + + +# Masked language modeling + +[[open-in-colab]] + + + +マスクされた言語モデリングはシーケンス内のマスクされたトークンを予測し、モデルはトークンを双方向に処理できます。これ +これは、モデルが左右のトークンに完全にアクセスできることを意味します。マスクされた言語モデリングは、次のようなタスクに最適です。 +シーケンス全体の文脈をよく理解する必要があります。 BERT はマスクされた言語モデルの一例です。 + +このガイドでは、次の方法を説明します。 + +1. [ELI5](https://huggingface.co/distilbert/distilroberta-base) の [r/askscience](https://www.reddit.com/r/askscience/) サブセットで [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) を微調整します。 ://huggingface.co/datasets/eli5) データセット。 +2. 微調整したモデルを推論に使用します。 + + +このガイドと同じ手順に従って、マスクされた言語モデリング用に他のアーキテクチャを微調整できます。 +次のアーキテクチャのいずれかを選択します。 + + + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MRA](../model_doc/mra), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [Perceiver](../model_doc/perceiver), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Wav2Vec2](../model_doc/wav2vec2), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load ELI5 dataset + +まず、ELI5 データセットの r/askscience サブセットの小さいサブセットを 🤗 データセット ライブラリからロードします。これで +データセット全体のトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が与えられます。 + +```py +>>> from datasets import load_dataset + +>>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train_asks` をトレイン セットとテスト セットに分割します。 + +```py +>>> eli5 = eli5.train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> eli5["train"][0] +{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], + 'score': [6, 3], + 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, + 'answers_urls': {'url': []}, + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls': {'url': []}} +``` + +これは多くのことのように見えるかもしれませんが、実際に関心があるのは`text`フィールドだけです。言語モデリング タスクの優れた点は、次の単語がラベル * であるため、ラベル (教師なしタスクとも呼ばれます) が必要ないことです。 + +## Preprocess + + + +マスクされた言語モデリングの場合、次のステップは、`text`サブフィールドを処理するために DistilRoBERTa トークナイザーをロードすることです。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base") +``` + +上の例からわかるように、`text`フィールドは実際には`answers`内にネストされています。これは、次のことを行う必要があることを意味します +[` flatten`](https://huggingface.co/docs/datasets/process.html#flatten) メソッドを使用して、ネストされた構造から `text` サブフィールドを抽出します。 + +```py +>>> eli5 = eli5.flatten() +>>> eli5["train"][0] +{'answers.a_id': ['c3d1aib', 'c3d4lya'], + 'answers.score': [6, 3], + 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], + 'answers_urls.url': [], + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls.url': []} +``` + +`answers`接頭辞で示されるように、各サブフィールドは個別の列になり、`text`フィールドはリストになりました。その代わり +各文を個別にトークン化する場合は、リストを文字列に変換して、それらをまとめてトークン化できるようにします。 + +以下は、各例の文字列のリストを結合し、結果をトークン化する最初の前処理関数です。 + +```py +>>> def preprocess_function(examples): +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) +``` + +この前処理関数をデータセット全体に適用するには、🤗 Datasets [`~datasets.Dataset.map`] メソッドを使用します。 `map` 関数を高速化するには、`batched=True` を設定してデータセットの複数の要素を一度に処理し、`num_proc` でプロセスの数を増やします。不要な列を削除します。 + +```py +>>> tokenized_eli5 = eli5.map( +... preprocess_function, +... batched=True, +... num_proc=4, +... remove_columns=eli5["train"].column_names, +... ) +``` + +このデータセットにはトークン シーケンスが含まれていますが、その一部はモデルの最大入力長よりも長くなります。 + +2 番目の前処理関数を使用して、 +- すべてのシーケンスを連結します +- 連結されたシーケンスを`block_size`で定義された短いチャンクに分割します。これは、最大入力長より短く、GPU RAM に十分な長さである必要があります。 + +```py +>>> block_size = 128 + + +>>> def group_texts(examples): +... # Concatenate all texts. +... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} +... total_length = len(concatenated_examples[list(examples.keys())[0]]) +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. +... result = { +... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] +... for k, t in concatenated_examples.items() +... } +... return result +``` + +データセット全体に`group_texts`関数を適用します。 + +```py +>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) +``` + +次に、[`DataCollat​​orForLanguageModeling`] を使用してサンプルのバッチを作成します。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。 + + + + +シーケンス終了トークンをパディング トークンとして使用し、データを反復するたびにランダムにトークンをマスクするために `mlm_probability` を指定します。 + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> tokenizer.pad_token = tokenizer.eos_token +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) +``` + + + +シーケンス終了トークンをパディング トークンとして使用し、データを反復するたびにランダムにトークンをマスクするために `mlm_probability` を指定します。 + + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf") +``` + + + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForMaskedLM`] を使用して DistilRoBERTa をロードします。 + +```py +>>> from transformers import AutoModelForMaskedLM + +>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。 +2. トレーニング引数をモデル、データセット、データ照合器とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_eli5_mlm_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=lm_dataset["train"], +... eval_dataset=lm_dataset["test"], +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.evaluate`] メソッドを使用してモデルを評価し、その複雑さを取得します。 + + +```py +>>> import math + +>>> eval_results = trainer.evaluate() +>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") +Perplexity: 8.76 +``` + +次に、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +次に、[`TFAutoModelForMaskedLM`] を使用して DistilRoBERTa をロードできます。 + +```py +>>> from transformers import TFAutoModelForMaskedLM + +>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +This can be done by specifying where to push your model and tokenizer in the [`~transformers.PushToHubCallback`]: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_eli5_mlm_model", +... tokenizer=tokenizer, +... ) +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + + + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback]) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + +マスクされた言語モデリング用にモデルを微調整する方法のより詳細な例については、対応するドキュメントを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +モデルに空白を埋めるテキストを考え出し、特別な `` トークンを使用して空白を示します。 + +```py +>>> text = "The Milky Way is a galaxy." +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用してフィルマスクの`pipeline`をインスタンス化し、テキストをそれに渡します。必要に応じて、`top_k`パラメータを使用して、返す予測の数を指定できます。 + +```py +>>> from transformers import pipeline + +>>> mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model") +>>> mask_filler(text, top_k=3) +[{'score': 0.5150994658470154, + 'token': 21300, + 'token_str': ' spiral', + 'sequence': 'The Milky Way is a spiral galaxy.'}, + {'score': 0.07087188959121704, + 'token': 2232, + 'token_str': ' massive', + 'sequence': 'The Milky Way is a massive galaxy.'}, + {'score': 0.06434620916843414, + 'token': 650, + 'token_str': ' small', + 'sequence': 'The Milky Way is a small galaxy.'}] +``` + + + + +テキストをトークン化し、`input_ids`を PyTorch テンソルとして返します。 `` トークンの位置も指定する必要があります。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> inputs = tokenizer(text, return_tensors="pt") +>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] +``` + +入力をモデルに渡し、マスクされたトークンの`logits`を返します。 + +```py +>>> from transformers import AutoModelForMaskedLM + +>>> model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> logits = model(**inputs).logits +>>> mask_token_logits = logits[0, mask_token_index, :] +``` + +次に、マスクされた 3 つのトークンを最も高い確率で返し、出力します。 + +```py +>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist() + +>>> for token in top_3_tokens: +... print(text.replace(tokenizer.mask_token, tokenizer.decode([token]))) +The Milky Way is a spiral galaxy. +The Milky Way is a massive galaxy. +The Milky Way is a small galaxy. +``` + + + + +テキストをトークン化し、`input_ids`を TensorFlow テンソルとして返します。 `` トークンの位置も指定する必要があります。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> inputs = tokenizer(text, return_tensors="tf") +>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1] +``` + +入力をモデルに渡し、マスクされたトークンの`logits`を返します。 + + +```py +>>> from transformers import TFAutoModelForMaskedLM + +>>> model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> logits = model(**inputs).logits +>>> mask_token_logits = logits[0, mask_token_index, :] +``` + +次に、マスクされた 3 つのトークンを最も高い確率で返し、出力します。 + + +```py +>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy() + +>>> for token in top_3_tokens: +... print(text.replace(tokenizer.mask_token, tokenizer.decode([token]))) +The Milky Way is a spiral galaxy. +The Milky Way is a massive galaxy. +The Milky Way is a small galaxy. +``` + + diff --git a/docs/source/ja/tasks/monocular_depth_estimation.md b/docs/source/ja/tasks/monocular_depth_estimation.md new file mode 100644 index 00000000000000..984631fd3d5500 --- /dev/null +++ b/docs/source/ja/tasks/monocular_depth_estimation.md @@ -0,0 +1,154 @@ + + +# Monocular depth estimation + +単眼奥行き推定は、シーンの奥行き情報を画像から予測することを含むコンピューター ビジョン タスクです。 +単一の画像。言い換えれば、シーン内のオブジェクトの距離を距離から推定するプロセスです。 +単一カメラの視点。 + +単眼奥行き推定には、3D 再構築、拡張現実、自動運転、 +そしてロボット工学。モデルがオブジェクト間の複雑な関係を理解する必要があるため、これは困難な作業です。 +シーンとそれに対応する深度情報(照明条件などの要因の影響を受ける可能性があります) +オクルージョンとテクスチャ。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) + + + + + +このガイドでは、次の方法を学びます。 + +* 深度推定パイプラインを作成する +* 手動で深度推定推論を実行します + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q transformers +``` + +## Depth estimation pipeline + +深度推定をサポートするモデルで推論を試す最も簡単な方法は、対応する [`pipeline`] を使用することです。 +[Hugging Face Hub のチェックポイント](https://huggingface.co/models?pipeline_tag=Depth-estimation&sort=downloads) からパイプラインをインスタンス化します。 + + +```py +>>> from transformers import pipeline + +>>> checkpoint = "vinvino02/glpn-nyu" +>>> depth_estimator = pipeline("depth-estimation", model=checkpoint) +``` + +次に、分析する画像を選択します。 + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> image +``` + +
+ Photo of a busy street +
+ +画像をパイプラインに渡します。 + +```py +>>> predictions = depth_estimator(image) +``` + +パイプラインは 2 つのエントリを含む辞書を返します。最初のものは`predicted_ Depth`と呼ばれ、次の値を持つテンソルです。 +深さは各ピクセルのメートル単位で表されます。 +2 番目の`depth`は、深度推定結果を視覚化する PIL 画像です。 + +視覚化された結果を見てみましょう。 + +```py +>>> predictions["depth"] +``` + +
+ Depth estimation visualization +
+ +## Depth estimation inference by hand + +深度推定パイプラインの使用方法を理解したので、同じ結果を手動で複製する方法を見てみましょう。 + +まず、[Hugging Face Hub のチェックポイント](https://huggingface.co/models?pipeline_tag=Depth-estimation&sort=downloads) からモデルと関連プロセッサをロードします。 +ここでは、前と同じチェックポイントを使用します。 + + +```py +>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation + +>>> checkpoint = "vinvino02/glpn-nyu" + +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) +``` + +必要な画像変換を処理する`image_processor`を使用して、モデルの画像入力を準備します。 +サイズ変更や正規化など: + +```py +>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values +``` + +準備された入力をモデルに渡します。 + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(pixel_values) +... predicted_depth = outputs.predicted_depth +``` + +結果を視覚化します。 + + +```py +>>> import numpy as np + +>>> # interpolate to original size +>>> prediction = torch.nn.functional.interpolate( +... predicted_depth.unsqueeze(1), +... size=image.size[::-1], +... mode="bicubic", +... align_corners=False, +... ).squeeze() +>>> output = prediction.numpy() + +>>> formatted = (output * 255 / np.max(output)).astype("uint8") +>>> depth = Image.fromarray(formatted) +>>> depth +``` + +
+ Depth estimation visualization +
diff --git a/docs/source/ja/tasks/multiple_choice.md b/docs/source/ja/tasks/multiple_choice.md new file mode 100644 index 00000000000000..bfe5f388cb4ab6 --- /dev/null +++ b/docs/source/ja/tasks/multiple_choice.md @@ -0,0 +1,470 @@ + + +# Multiple choice + +[[open-in-colab]] + +多肢選択タスクは質問応答に似ていますが、いくつかの候補の回答がコンテキストとともに提供され、正しい回答を選択するようにモデルがトレーニングされる点が異なります。 + +このガイドでは、次の方法を説明します。 + +1. [SWAG](https://huggingface.co/datasets/swag) データセットの「通常」構成で [BERT](https://huggingface.co/google-bert/bert-base-uncased) を微調整して、最適なデータセットを選択します複数の選択肢と何らかのコンテキストを考慮して回答します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MRA](../model_doc/mra), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load SWAG dataset + +まず、🤗 データセット ライブラリから SWAG データセットの「通常」構成をロードします。 + +```py +>>> from datasets import load_dataset + +>>> swag = load_dataset("swag", "regular") +``` + +次に、例を見てみましょう。 + +```py +>>> swag["train"][0] +{'ending0': 'passes by walking down the street playing their instruments.', + 'ending1': 'has heard approaching them.', + 'ending2': "arrives and they're outside dancing and asleep.", + 'ending3': 'turns the lead singer watches the performance.', + 'fold-ind': '3416', + 'gold-source': 'gold', + 'label': 0, + 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.', + 'sent2': 'A drum line', + 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line', + 'video-id': 'anetv_jkn6uvmqwh4'} +``` + +ここにはたくさんのフィールドがあるように見えますが、実際は非常に簡単です。 + +- `sent1` と `sent2`: これらのフィールドは文の始まりを示し、この 2 つを組み合わせると `startphrase` フィールドが得られます。 +- `ending`: 文の終わり方として考えられる終わり方を示唆しますが、正しいのは 1 つだけです。 +- `label`: 正しい文の終わりを識別します。 + +## Preprocess + +次のステップでは、BERT トークナイザーをロードして、文の始まりと 4 つの可能な終わりを処理します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + +作成する前処理関数は次のことを行う必要があります。 + +1. `sent1` フィールドのコピーを 4 つ作成し、それぞれを `sent2` と組み合わせて文の始まりを再現します。 +2. `sent2` を 4 つの可能な文末尾のそれぞれと組み合わせます。 +3. これら 2 つのリストをトークン化できるようにフラット化し、その後、各例に対応する `input_ids`、`attention_mask`、および `labels` フィールドが含まれるように非フラット化します。 + +```py +>>> ending_names = ["ending0", "ending1", "ending2", "ending3"] + + +>>> def preprocess_function(examples): +... first_sentences = [[context] * 4 for context in examples["sent1"]] +... question_headers = examples["sent2"] +... second_sentences = [ +... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers) +... ] + +... first_sentences = sum(first_sentences, []) +... second_sentences = sum(second_sentences, []) + +... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True) +... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()} +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] メソッドを使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` 関数を高速化できます。 + + +```py +tokenized_swag = swag.map(preprocess_function, batched=True) +``` + +🤗 Transformers には多肢選択用のデータ照合器がないため、[`DataCollat​​orWithPadding`] を調整してサンプルのバッチを作成する必要があります。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。 + +`DataCollat​​orForMultipleChoice` は、すべてのモデル入力を平坦化し、パディングを適用して、結果を非平坦化します。 + + + +```py +>>> from dataclasses import dataclass +>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy +>>> from typing import Optional, Union +>>> import torch + + +>>> @dataclass +... class DataCollatorForMultipleChoice: +... """ +... Data collator that will dynamically pad the inputs for multiple choice received. +... """ + +... tokenizer: PreTrainedTokenizerBase +... padding: Union[bool, str, PaddingStrategy] = True +... max_length: Optional[int] = None +... pad_to_multiple_of: Optional[int] = None + +... def __call__(self, features): +... label_name = "label" if "label" in features[0].keys() else "labels" +... labels = [feature.pop(label_name) for feature in features] +... batch_size = len(features) +... num_choices = len(features[0]["input_ids"]) +... flattened_features = [ +... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features +... ] +... flattened_features = sum(flattened_features, []) + +... batch = self.tokenizer.pad( +... flattened_features, +... padding=self.padding, +... max_length=self.max_length, +... pad_to_multiple_of=self.pad_to_multiple_of, +... return_tensors="pt", +... ) + +... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} +... batch["labels"] = torch.tensor(labels, dtype=torch.int64) +... return batch +``` + + +```py +>>> from dataclasses import dataclass +>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy +>>> from typing import Optional, Union +>>> import tensorflow as tf + + +>>> @dataclass +... class DataCollatorForMultipleChoice: +... """ +... Data collator that will dynamically pad the inputs for multiple choice received. +... """ + +... tokenizer: PreTrainedTokenizerBase +... padding: Union[bool, str, PaddingStrategy] = True +... max_length: Optional[int] = None +... pad_to_multiple_of: Optional[int] = None + +... def __call__(self, features): +... label_name = "label" if "label" in features[0].keys() else "labels" +... labels = [feature.pop(label_name) for feature in features] +... batch_size = len(features) +... num_choices = len(features[0]["input_ids"]) +... flattened_features = [ +... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features +... ] +... flattened_features = sum(flattened_features, []) + +... batch = self.tokenizer.pad( +... flattened_features, +... padding=self.padding, +... max_length=self.max_length, +... pad_to_multiple_of=self.pad_to_multiple_of, +... return_tensors="tf", +... ) + +... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()} +... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64) +... return batch +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) メトリクスを読み込みます (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してください) ) メトリクスの読み込みと計算方法の詳細については、次を参照してください)。 + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して精度を計算する関数を作成します。 + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... predictions = np.argmax(predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=labels) +``` + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForMultipleChoice`] を使用して BERT をロードします。 + +```py +>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer + +>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は精度を評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_swag_model", +... evaluation_strategy="epoch", +... save_strategy="epoch", +... load_best_model_at_end=True, +... learning_rate=5e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_swag["train"], +... eval_dataset=tokenized_swag["validation"], +... tokenizer=tokenizer, +... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer), +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できますように。 + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_train_epochs = 2 +>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs +>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps) +``` + +次に、[`TFAutoModelForMultipleChoice`] を使用して BERT をロードできます。 + +```py +>>> from transformers import TFAutoModelForMultipleChoice + +>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_swag["train"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_swag["validation"], +... shuffle=False, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +トレーニングを開始する前にセットアップする最後の 2 つのことは、予測から精度を計算することと、モデルをハブにプッシュする方法を提供することです。どちらも [Keras コールバック](../main_classes/keras_callbacks) を使用して行われます。 + +`compute_metrics` 関数を [`~transformers.KerasMetricCallback`] に渡します。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`] でモデルとトークナイザーをプッシュする場所を指定します。 + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_model", +... tokenizer=tokenizer, +... ) +``` + +次に、コールバックをまとめてバンドルします。 + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + + +複数選択用にモデルを微調整する方法の詳細な例については、対応するセクションを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb)。 + + + + +# Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +いくつかのテキストと 2 つの回答候補を考えてください。 + +```py +>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette." +>>> candidate1 = "The law does not apply to croissants and brioche." +>>> candidate2 = "The law applies to baguettes." +``` + + + + +各プロンプトと回答候補のペアをトークン化し、PyTorch テンソルを返します。いくつかの`lables`も作成する必要があります。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") +>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True) +>>> labels = torch.tensor(0).unsqueeze(0) +``` + +入力とラベルをモデルに渡し、`logits`を返します。 + +```py +>>> from transformers import AutoModelForMultipleChoice + +>>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") +>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels) +>>> logits = outputs.logits +``` + +最も高い確率でクラスを取得します。 + +```py +>>> predicted_class = logits.argmax().item() +>>> predicted_class +'0' +``` + + + +各プロンプトと回答候補のペアをトークン化し、TensorFlow テンソルを返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") +>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True) +``` + +入力をモデルに渡し、`logits`を返します。 + +```py +>>> from transformers import TFAutoModelForMultipleChoice + +>>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") +>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()} +>>> outputs = model(inputs) +>>> logits = outputs.logits +``` + +最も高い確率でクラスを取得します。 + +```py +>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0]) +>>> predicted_class +'0' +``` + + diff --git a/docs/source/ja/tasks/object_detection.md b/docs/source/ja/tasks/object_detection.md new file mode 100644 index 00000000000000..389e7bdf2f455e --- /dev/null +++ b/docs/source/ja/tasks/object_detection.md @@ -0,0 +1,603 @@ + + +# Object detection + +[[open-in-colab]] + +オブジェクト検出は、画像内のインスタンス (人間、建物、車など) を検出するコンピューター ビジョン タスクです。物体検出モデルは画像を入力および出力として受け取ります +検出されたオブジェクトの境界ボックスと関連するラベルの座標。画像には複数のオブジェクトを含めることができます。 +それぞれに独自の境界ボックスとラベルがあり (例: 車と建物を持つことができます)、各オブジェクトは +画像のさまざまな部分に存在する必要があります (たとえば、画像には複数の車が含まれている可能性があります)。 +このタスクは、歩行者、道路標識、信号機などを検出するために自動運転で一般的に使用されます。 +他のアプリケーションには、画像内のオブジェクトのカウント、画像検索などが含まれます。 + +このガイドでは、次の方法を学習します。 + + 1. Finetune [DETR](https://huggingface.co/docs/transformers/model_doc/detr)、畳み込みアルゴリズムを組み合わせたモデル + [CPPE-5](https://huggingface.co/datasets/cppe-5) 上のエンコーダー/デコーダー トランスフォーマーを備えたバックボーン + データセット。 + 2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[Conditional DETR](../model_doc/conditional_detr), [Deformable DETR](../model_doc/deformable_detr), [DETA](../model_doc/deta), [DETR](../model_doc/detr), [Table Transformer](../model_doc/table-transformer), [YOLOS](../model_doc/yolos) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + + +```bash +pip install -q datasets transformers evaluate timm albumentations +``` + +🤗 データセットを使用して Hugging Face Hub からデータセットをロードし、🤗 トランスフォーマーを使用してモデルをトレーニングします。 +データを増強するための`albumentations`。 `timm` は現在、DETR モデルの畳み込みバックボーンをロードするために必要です。 + +モデルをコミュニティと共有することをお勧めします。 Hugging Face アカウントにログインして、ハブにアップロードします。 +プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load the CPPE-5 dataset + +[CPPE-5 データセット](https://huggingface.co/datasets/cppe-5) には、次の画像が含まれています。 +新型コロナウイルス感染症のパンデミックにおける医療用個人保護具 (PPE) を識別する注釈。 + +データセットをロードすることから始めます。 + +```py +>>> from datasets import load_dataset + +>>> cppe5 = load_dataset("cppe-5") +>>> cppe5 +DatasetDict({ + train: Dataset({ + features: ['image_id', 'image', 'width', 'height', 'objects'], + num_rows: 1000 + }) + test: Dataset({ + features: ['image_id', 'image', 'width', 'height', 'objects'], + num_rows: 29 + }) +}) +``` + +このデータセットには、1000 枚の画像を含むトレーニング セットと 29 枚の画像を含むテスト セットがすでに付属していることがわかります。 + +データに慣れるために、例がどのようなものかを調べてください。 + +```py +>>> cppe5["train"][0] +{'image_id': 15, + 'image': , + 'width': 943, + 'height': 663, + 'objects': {'id': [114, 115, 116, 117], + 'area': [3796, 1596, 152768, 81002], + 'bbox': [[302.0, 109.0, 73.0, 52.0], + [810.0, 100.0, 57.0, 28.0], + [160.0, 31.0, 248.0, 616.0], + [741.0, 68.0, 202.0, 401.0]], + 'category': [4, 4, 0, 0]}} +``` + +データセット内の例には次のフィールドがあります。 +- `image_id`: サンプルの画像ID +- `image`: 画像を含む `PIL.Image.Image` オブジェクト +- `width`: 画像の幅 +- `height`: 画像の高さ +- `objects`: 画像内のオブジェクトの境界ボックスのメタデータを含む辞書: + - `id`: アノテーションID + - `area`: 境界ボックスの領域 + - `bbox`: オブジェクトの境界ボックス ([COCO 形式](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/#coco) ) + - `category`: オブジェクトのカテゴリー。可能な値には、`Coverall (0)`、`Face_Shield (1)`、`Gloves (2)`、`Goggles (3)`、および `Mask (4)` が含まれます。 + +`bbox`フィールドが COCO 形式に従っていることに気づくかもしれません。これは DETR モデルが予期する形式です。 +ただし、「オブジェクト」内のフィールドのグループ化は、DETR が必要とする注釈形式とは異なります。あなたはするであろう +このデータをトレーニングに使用する前に、いくつかの前処理変換を適用する必要があります。 + +データをさらに深く理解するには、データセット内の例を視覚化します。 + +```py +>>> import numpy as np +>>> import os +>>> from PIL import Image, ImageDraw + +>>> image = cppe5["train"][0]["image"] +>>> annotations = cppe5["train"][0]["objects"] +>>> draw = ImageDraw.Draw(image) + +>>> categories = cppe5["train"].features["objects"].feature["category"].names + +>>> id2label = {index: x for index, x in enumerate(categories, start=0)} +>>> label2id = {v: k for k, v in id2label.items()} + +>>> for i in range(len(annotations["id"])): +... box = annotations["bbox"][i] +... class_idx = annotations["category"][i] +... x, y, w, h = tuple(box) +... draw.rectangle((x, y, x + w, y + h), outline="red", width=1) +... draw.text((x, y), id2label[class_idx], fill="white") + +>>> image +``` + +
+ CPPE-5 Image Example +
+ +関連付けられたラベルを使用して境界ボックスを視覚化するには、データセットのメタデータからラベルを取得します。 +`category`フィールド。 +また、ラベル ID をラベル クラスにマッピングする辞書 (`id2label`) やその逆 (`label2id`) を作成することもできます。 +これらは、後でモデルをセットアップするときに使用できます。これらのマップを含めると、共有した場合に他の人がモデルを再利用できるようになります。 +ハグフェイスハブに取り付けます。 + +データに慣れるための最後のステップとして、潜在的な問題がないかデータを調査します。データセットに関する一般的な問題の 1 つは、 +オブジェクト検出は、画像の端を越えて「伸びる」境界ボックスです。このような「暴走」境界ボックスは、 +トレーニング中にエラーが発生するため、この段階で対処する必要があります。このデータセットには、この問題に関する例がいくつかあります。 +このガイドでは内容をわかりやすくするために、これらの画像をデータから削除します。 + +```py +>>> remove_idx = [590, 821, 822, 875, 876, 878, 879] +>>> keep = [i for i in range(len(cppe5["train"])) if i not in remove_idx] +>>> cppe5["train"] = cppe5["train"].select(keep) +``` + +## Preprocess the data + +モデルを微調整するには、事前トレーニングされたモデルに使用されるアプローチと正確に一致するように、使用する予定のデータを前処理する必要があります。 +[`AutoImageProcessor`] は、画像データを処理して `pixel_values`、`pixel_mask`、および +DETR モデルをトレーニングできる「ラベル」。画像プロセッサには、心配する必要のないいくつかの属性があります。 + +- `image_mean = [0.485, 0.456, 0.406 ]` +- `image_std = [0.229, 0.224, 0.225]` + +これらは、モデルの事前トレーニング中に画像を正規化するために使用される平均と標準偏差です。これらの価値観は非常に重要です +事前にトレーニングされた画像モデルを推論または微調整するときに複製します。 + +微調整するモデルと同じチェックポイントからイメージ プロセッサをインスタンス化します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "facebook/detr-resnet-50" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +``` + +画像を`image_processor`に渡す前に、2 つの前処理変換をデータセットに適用します。 +- 画像の拡張 +- DETR の期待に応えるための注釈の再フォーマット + +まず、モデルがトレーニング データにオーバーフィットしないようにするために、任意のデータ拡張ライブラリを使用して画像拡張を適用できます。ここでは[Albumentations](https://albumentations.ai/docs/)を使用します... +このライブラリは、変換が画像に影響を与え、それに応じて境界ボックスを更新することを保証します。 +🤗 データセット ライブラリのドキュメントには、詳細な [物体検出用に画像を拡張する方法に関するガイド](https://huggingface.co/docs/datasets/object_detection) が記載されています。 +例としてまったく同じデータセットを使用しています。ここでも同じアプローチを適用し、各画像のサイズを (480, 480) に変更します。 +水平に反転して明るくします。 + +```py +>>> import albumentations +>>> import numpy as np +>>> import torch + +>>> transform = albumentations.Compose( +... [ +... albumentations.Resize(480, 480), +... albumentations.HorizontalFlip(p=1.0), +... albumentations.RandomBrightnessContrast(p=1.0), +... ], +... bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]), +... ) +``` + +`image_processor` は、注釈が次の形式であることを期待します: `{'image_id': int, 'annotations': List[Dict]}`, + ここで、各辞書は COCO オブジェクトの注釈です。 1 つの例として、注釈を再フォーマットする関数を追加してみましょう。 + + ```py +>>> def formatted_anns(image_id, category, area, bbox): +... annotations = [] +... for i in range(0, len(category)): +... new_ann = { +... "image_id": image_id, +... "category_id": category[i], +... "isCrowd": 0, +... "area": area[i], +... "bbox": list(bbox[i]), +... } +... annotations.append(new_ann) + +... return annotations +``` + +これで、画像と注釈の変換を組み合わせてサンプルのバッチで使用できるようになりました。 + +```py +>>> # transforming a batch +>>> def transform_aug_ann(examples): +... image_ids = examples["image_id"] +... images, bboxes, area, categories = [], [], [], [] +... for image, objects in zip(examples["image"], examples["objects"]): +... image = np.array(image.convert("RGB"))[:, :, ::-1] +... out = transform(image=image, bboxes=objects["bbox"], category=objects["category"]) + +... area.append(objects["area"]) +... images.append(out["image"]) +... bboxes.append(out["bboxes"]) +... categories.append(out["category"]) + +... targets = [ +... {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)} +... for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes) +... ] + +... return image_processor(images=images, annotations=targets, return_tensors="pt") +``` + +🤗 Datasets [`~datasets.Dataset.with_transform`] メソッドを使用して、この前処理関数をデータセット全体に適用します。この方法が適用されるのは、 +データセットの要素を読み込むときに、その場で変換します。 + +この時点で、データセットの例が変換後にどのようになるかを確認できます。テンソルが表示されるはずです +`pixel_values`、テンソルと `pixel_mask`、および `labels` を使用します。 + +```py +>>> cppe5["train"] = cppe5["train"].with_transform(transform_aug_ann) +>>> cppe5["train"][15] +{'pixel_values': tensor([[[ 0.9132, 0.9132, 0.9132, ..., -1.9809, -1.9809, -1.9809], + [ 0.9132, 0.9132, 0.9132, ..., -1.9809, -1.9809, -1.9809], + [ 0.9132, 0.9132, 0.9132, ..., -1.9638, -1.9638, -1.9638], + ..., + [-1.5699, -1.5699, -1.5699, ..., -1.9980, -1.9980, -1.9980], + [-1.5528, -1.5528, -1.5528, ..., -1.9980, -1.9809, -1.9809], + [-1.5528, -1.5528, -1.5528, ..., -1.9980, -1.9809, -1.9809]], + + [[ 1.3081, 1.3081, 1.3081, ..., -1.8431, -1.8431, -1.8431], + [ 1.3081, 1.3081, 1.3081, ..., -1.8431, -1.8431, -1.8431], + [ 1.3081, 1.3081, 1.3081, ..., -1.8256, -1.8256, -1.8256], + ..., + [-1.3179, -1.3179, -1.3179, ..., -1.8606, -1.8606, -1.8606], + [-1.3004, -1.3004, -1.3004, ..., -1.8606, -1.8431, -1.8431], + [-1.3004, -1.3004, -1.3004, ..., -1.8606, -1.8431, -1.8431]], + + [[ 1.4200, 1.4200, 1.4200, ..., -1.6476, -1.6476, -1.6476], + [ 1.4200, 1.4200, 1.4200, ..., -1.6476, -1.6476, -1.6476], + [ 1.4200, 1.4200, 1.4200, ..., -1.6302, -1.6302, -1.6302], + ..., + [-1.0201, -1.0201, -1.0201, ..., -1.5604, -1.5604, -1.5604], + [-1.0027, -1.0027, -1.0027, ..., -1.5604, -1.5430, -1.5430], + [-1.0027, -1.0027, -1.0027, ..., -1.5604, -1.5430, -1.5430]]]), + 'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + ..., + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1]]), + 'labels': {'size': tensor([800, 800]), 'image_id': tensor([756]), 'class_labels': tensor([4]), 'boxes': tensor([[0.7340, 0.6986, 0.3414, 0.5944]]), 'area': tensor([519544.4375]), 'iscrowd': tensor([0]), 'orig_size': tensor([480, 480])}} +``` + +個々の画像を正常に拡張し、それらの注釈を準備しました。ただし、前処理はそうではありません。 +まだ完成しています。最後のステップでは、画像をバッチ処理するためのカスタム `collat​​e_fn` を作成します。 +画像 (現在は `pixel_values`) をバッチ内の最大の画像にパディングし、対応する `pixel_mask` を作成します +どのピクセルが実数 (1) で、どのピクセルがパディング (0) であるかを示します。 + + +```py +>>> def collate_fn(batch): +... pixel_values = [item["pixel_values"] for item in batch] +... encoding = image_processor.pad(pixel_values, return_tensors="pt") +... labels = [item["labels"] for item in batch] +... batch = {} +... batch["pixel_values"] = encoding["pixel_values"] +... batch["pixel_mask"] = encoding["pixel_mask"] +... batch["labels"] = labels +... return batch +``` + +## Training the DETR model + +前のセクションで重労働のほとんどを完了したので、モデルをトレーニングする準備が整いました。 +このデータセット内の画像は、サイズを変更した後でも依然として非常に大きいです。これは、このモデルを微調整すると、 +少なくとも 1 つの GPU が必要です。 + +トレーニングには次の手順が含まれます。 +1. 前処理と同じチェックポイントを使用して、[`AutoModelForObjectDetection`] でモデルを読み込みます。 +2. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 +3. トレーニング引数をモデル、データセット、画像プロセッサ、データ照合器とともに [`Trainer`] に渡します。 +4. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +前処理に使用したのと同じチェックポイントからモデルをロードするときは、必ず`label2id`を渡してください。 +および `id2label` マップは、以前にデータセットのメタデータから作成したものです。さらに、`ignore_mismatched_sizes=True`を指定して、既存の分類頭部を新しい分類頭部に置き換えます。 + +```py +>>> from transformers import AutoModelForObjectDetection + +>>> model = AutoModelForObjectDetection.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ignore_mismatched_sizes=True, +... ) +``` + +[`TrainingArguments`] で、`output_dir` を使用してモデルの保存場所を指定し、必要に応じてハイパーパラメーターを構成します。 +画像列が削除されるため、未使用の列を削除しないことが重要です。画像列がないと、 +`pixel_values` を作成できません。このため、`remove_unused_columns`を`False`に設定します。 +ハブにプッシュしてモデルを共有したい場合は、`push_to_hub` を `True` に設定します (Hugging にサインインする必要があります) +顔に向かってモデルをアップロードします)。 + +```py +>>> from transformers import TrainingArguments + +>>> training_args = TrainingArguments( +... output_dir="detr-resnet-50_finetuned_cppe5", +... per_device_train_batch_size=8, +... num_train_epochs=10, +... fp16=True, +... save_steps=200, +... logging_steps=50, +... learning_rate=1e-5, +... weight_decay=1e-4, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +最後に、すべてをまとめて、[`~transformers.Trainer.train`] を呼び出します。 + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=collate_fn, +... train_dataset=cppe5["train"], +... tokenizer=image_processor, +... ) + +>>> trainer.train() +``` + +`training_args`で`push_to_hub`を`True`に設定した場合、トレーニング チェックポイントは +ハグフェイスハブ。トレーニングが完了したら、[`~transformers.Trainer.push_to_hub`] メソッドを呼び出して、最終モデルもハブにプッシュします。 + +```py +>>> trainer.push_to_hub() +``` + +## Evaluate + +物体検出モデルは通常、一連の COCO スタイルの指標を使用して評価されます。 +既存のメトリクス実装のいずれかを使用できますが、ここでは`torchvision`のメトリクス実装を使用して最終的なメトリクスを評価します。 +ハブにプッシュしたモデル。 + +`torchvision`エバリュエーターを使用するには、グラウンド トゥルース COCO データセットを準備する必要があります。 COCO データセットを構築するための API +データを特定の形式で保存する必要があるため、最初に画像と注釈をディスクに保存する必要があります。と同じように +トレーニング用にデータを準備するとき、`cppe5["test"]` からの注釈をフォーマットする必要があります。ただし、画像 +そのままでいるべきです。 + +評価ステップには少し作業が必要ですが、大きく 3 つのステップに分けることができます。 +まず、`cppe5["test"]` セットを準備します。注釈をフォーマットし、データをディスクに保存します。 + + +```py +>>> import json + + +>>> # format annotations the same as for training, no need for data augmentation +>>> def val_formatted_anns(image_id, objects): +... annotations = [] +... for i in range(0, len(objects["id"])): +... new_ann = { +... "id": objects["id"][i], +... "category_id": objects["category"][i], +... "iscrowd": 0, +... "image_id": image_id, +... "area": objects["area"][i], +... "bbox": objects["bbox"][i], +... } +... annotations.append(new_ann) + +... return annotations + + +>>> # Save images and annotations into the files torchvision.datasets.CocoDetection expects +>>> def save_cppe5_annotation_file_images(cppe5): +... output_json = {} +... path_output_cppe5 = f"{os.getcwd()}/cppe5/" + +... if not os.path.exists(path_output_cppe5): +... os.makedirs(path_output_cppe5) + +... path_anno = os.path.join(path_output_cppe5, "cppe5_ann.json") +... categories_json = [{"supercategory": "none", "id": id, "name": id2label[id]} for id in id2label] +... output_json["images"] = [] +... output_json["annotations"] = [] +... for example in cppe5: +... ann = val_formatted_anns(example["image_id"], example["objects"]) +... output_json["images"].append( +... { +... "id": example["image_id"], +... "width": example["image"].width, +... "height": example["image"].height, +... "file_name": f"{example['image_id']}.png", +... } +... ) +... output_json["annotations"].extend(ann) +... output_json["categories"] = categories_json + +... with open(path_anno, "w") as file: +... json.dump(output_json, file, ensure_ascii=False, indent=4) + +... for im, img_id in zip(cppe5["image"], cppe5["image_id"]): +... path_img = os.path.join(path_output_cppe5, f"{img_id}.png") +... im.save(path_img) + +... return path_output_cppe5, path_anno +``` + +次に、`cocoevaluator`で利用できる`CocoDetection`クラスのインスタンスを用意します。 + + +```py +>>> import torchvision + + +>>> class CocoDetection(torchvision.datasets.CocoDetection): +... def __init__(self, img_folder, image_processor, ann_file): +... super().__init__(img_folder, ann_file) +... self.image_processor = image_processor + +... def __getitem__(self, idx): +... # read in PIL image and target in COCO format +... img, target = super(CocoDetection, self).__getitem__(idx) + +... # preprocess image and target: converting target to DETR format, +... # resizing + normalization of both image and target) +... image_id = self.ids[idx] +... target = {"image_id": image_id, "annotations": target} +... encoding = self.image_processor(images=img, annotations=target, return_tensors="pt") +... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension +... target = encoding["labels"][0] # remove batch dimension + +... return {"pixel_values": pixel_values, "labels": target} + + +>>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") + +>>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"]) +>>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno) +``` + +最後に、メトリクスをロードして評価を実行します。 + +```py +>>> import evaluate +>>> from tqdm import tqdm + +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") +>>> module = evaluate.load("ybelkada/cocoevaluate", coco=test_ds_coco_format.coco) +>>> val_dataloader = torch.utils.data.DataLoader( +... test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn +... ) + +>>> with torch.no_grad(): +... for idx, batch in enumerate(tqdm(val_dataloader)): +... pixel_values = batch["pixel_values"] +... pixel_mask = batch["pixel_mask"] + +... labels = [ +... {k: v for k, v in t.items()} for t in batch["labels"] +... ] # these are in DETR format, resized + normalized + +... # forward pass +... outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask) + +... orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0) +... results = im_processor.post_process(outputs, orig_target_sizes) # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax) + +... module.add(prediction=results, reference=labels) +... del batch + +>>> results = module.compute() +>>> print(results) +Accumulating evaluation results... +DONE (t=0.08s). +IoU metric: bbox + Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.352 + Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.681 + Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.292 + Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168 + Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208 + Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.274 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.484 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.501 + Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191 + Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323 + Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590 +``` + +これらの結果は、[`~transformers.TrainingArguments`] のハイパーパラメータを調整することでさらに改善できます。試してごらん! + +## Inference + +DETR モデルを微調整して評価し、Hugging Face Hub にアップロードしたので、それを推論に使用できます。 +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。パイプラインをインスタンス化する +モデルを使用してオブジェクトを検出し、それに画像を渡します。 + + +```py +>>> from transformers import pipeline +>>> import requests + +>>> url = "https://i.imgur.com/2lnWoly.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> obj_detector = pipeline("object-detection", model="devonho/detr-resnet-50_finetuned_cppe5") +>>> obj_detector(image) +``` + +必要に応じて、パイプラインの結果を手動で複製することもできます。 + +```py +>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") + +>>> with torch.no_grad(): +... inputs = image_processor(images=image, return_tensors="pt") +... outputs = model(**inputs) +... target_sizes = torch.tensor([image.size[::-1]]) +... results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[0] + +>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): +... box = [round(i, 2) for i in box.tolist()] +... print( +... f"Detected {model.config.id2label[label.item()]} with confidence " +... f"{round(score.item(), 3)} at location {box}" +... ) +Detected Coverall with confidence 0.566 at location [1215.32, 147.38, 4401.81, 3227.08] +Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413.9] +``` + +結果をプロットしてみましょう: + +```py +>>> draw = ImageDraw.Draw(image) + +>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): +... box = [round(i, 2) for i in box.tolist()] +... x, y, x2, y2 = tuple(box) +... draw.rectangle((x, y, x2, y2), outline="red", width=1) +... draw.text((x, y), model.config.id2label[label.item()], fill="white") + +>>> image +``` + +
+ Object detection result on a new image +
diff --git a/docs/source/ja/tasks/prompting.md b/docs/source/ja/tasks/prompting.md new file mode 100644 index 00000000000000..bd66e751ee61d6 --- /dev/null +++ b/docs/source/ja/tasks/prompting.md @@ -0,0 +1,438 @@ + + + +# LLM prompting guide + +[[open-in-colab]] + +Falcon、LLaMA などの大規模言語モデルは、事前にトレーニングされたトランスフォーマー モデルであり、最初は予測するようにトレーニングされています。 +入力テキストが与えられた場合の次のトークン。通常、数十億のパラメータがあり、何兆ものパラメータでトレーニングされています。 +長期間のトークン。その結果、これらのモデルは非常に強力で多用途になり、次のようなことが可能になります。 +自然言語プロンプトでモデルに指示することで、すぐに複数の NLP タスクを解決できます。 + +最適な出力を保証するためにこのようなプロンプトを設計することは、多くの場合「プロンプト エンジニアリング」と呼ばれます。プロンプトエンジニアリングとは、 +かなりの量の実験を必要とする反復プロセス。自然言語ははるかに柔軟で表現力豊かです +ただし、プログラミング言語よりもあいまいさが生じる可能性があります。同時に、自然言語によるプロンプト +変化にはかなり敏感です。プロンプトにわずかな変更を加えただけでも、出力が大幅に異なる場合があります。 + +すべてのケースに適合するプロンプトを作成するための正確なレシピはありませんが、研究者はいくつかの最良のレシピを考案しました。 +最適な結果をより一貫して達成するのに役立つ実践。 + +このガイドでは、より優れた LLM プロンプトを作成し、さまざまな NLP タスクを解決するのに役立つプロンプト エンジニアリングのベスト プラクティスについて説明します。 +次のことを学びます: + +- [プロンプトの基本](#basics-of-prompting) +- [LLM プロンプトのベスト プラクティス](#best-practices-of-llm-prompting) +- [高度なプロンプト テクニック: 数回のプロンプトと思考の連鎖](#advanced-prompting-techniques) +- [プロンプトを表示する代わりに微調整する場合](#prompting-vs-fine-tuning) + + + +迅速なエンジニアリングは、LLM 出力最適化プロセスの一部にすぎません。もう 1 つの重要な要素は、 +最適なテキスト生成戦略。 LLM が生成時に後続の各トークンを選択する方法をカスタマイズできます。 +トレーニング可能なパラメータを一切変更せずにテキストを作成します。テキスト生成パラメータを微調整することで、 +生成されたテキストに繰り返しが含まれているため、より一貫性があり人間らしい響きになります。 +テキスト生成戦略とパラメーターはこのガイドの範囲外ですが、これらのトピックについて詳しくは、次のトピックを参照してください。 +次のガイド: + +* [LLM による生成](../llm_tutorial) +* [テキスト生成戦略](../generation_strategies) + + + +## Basics of prompting + +### Types of models + +最新の LLM の大部分は、デコーダ専用のトランスフォーマーです。例としては、[LLaMA](../model_doc/llama)、 +[Llama2](../model_doc/llama2)、[Falcon](../model_doc/falcon)、[GPT2](../model_doc/gpt2)。ただし、遭遇する可能性があります +エンコーダ デコーダ トランスフォーマ LLM も同様です。たとえば、[Flan-T5](../model_doc/flan-t5) や [BART](../model_doc/bart) です。 + +エンコーダ デコーダ スタイルのモデルは通常、出力が入力に**大きく**依存する生成タスクで使用されます。 +たとえば、翻訳と要約です。デコーダ専用モデルは、他のすべてのタイプの生成タスクに使用されます。 + +パイプラインを使用して LLM でテキストを生成する場合、使用している LLM のタイプを知ることが重要です。 +異なるパイプラインを使用します。 + +`text-generation`パイプラインを使用してデコーダのみのモデルで推論を実行します。 + +```python +>>> from transformers import pipeline +>>> import torch + +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT + +>>> generator = pipeline('text-generation', model = 'openai-community/gpt2') +>>> prompt = "Hello, I'm a language model" + +>>> generator(prompt, max_length = 30) +[{'generated_text': "Hello, I'm a language model expert, so I'm a big believer in the concept that I know very well and then I try to look into"}] +``` + +エンコーダー/デコーダーを使用して推論を実行するには、`text2text-generation` パイプラインを使用します。 + +```python +>>> text2text_generator = pipeline("text2text-generation", model = 'google/flan-t5-base') +>>> prompt = "Translate from English to French: I'm very happy to see you" + +>>> text2text_generator(prompt) +[{'generated_text': 'Je suis très heureuse de vous rencontrer.'}] +``` + +### Base vs instruct/chat models + +🤗 Hub で利用できる最近の LLM チェックポイントのほとんどには、base と instruct (または chat) の 2 つのバージョンがあります。例えば、 +[`tiiuae/falcon-7b`](https://huggingface.co/tiiuae/falcon-7b) および [`tiiuae/falcon-7b-instruct`](https://huggingface.co/tiiuae/falcon-7b) -指示する)。 + +基本モデルは、最初のプロンプトが与えられたときにテキストを完成させるのには優れていますが、NLP タスクには理想的ではありません。 +指示に従う必要がある場合、または会話で使用する場合に使用します。ここで、指示 (チャット) バージョンが登場します。 +これらのチェックポイントは、命令と会話データに基づいて事前トレーニングされたベース バージョンをさらに微調整した結果です。 +この追加の微調整により、多くの NLP タスクにとってより適切な選択肢になります。 + +[`tiiuae/falcon-7b-instruct`](https://huggingface.co/tiiuae/falcon-7b-instruct) で使用できるいくつかの簡単なプロンプトを示してみましょう。 +いくつかの一般的な NLP タスクを解決します。 + +### NLP tasks + +まず、環境をセットアップしましょう。 + +```bash +pip install -q transformers accelerate +``` + +次に、適切なパイプライン (`text_generation`) を使用してモデルをロードしましょう。 + +```python +>>> from transformers import pipeline, AutoTokenizer +>>> import torch + +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> model = "tiiuae/falcon-7b-instruct" + +>>> tokenizer = AutoTokenizer.from_pretrained(model) +>>> pipe = pipeline( +... "text-generation", +... model=model, +... tokenizer=tokenizer, +... torch_dtype=torch.bfloat16, +... device_map="auto", +... ) +``` + + + +Falcon モデルは `bfloat16` データ型を使用してトレーニングされたため、同じものを使用することをお勧めします。これには、最近の +CUDA のバージョンに準拠しており、最新のカードで最適に動作します。 + + + +パイプライン経由でモデルをロードしたので、プロンプトを使用して NLP タスクを解決する方法を見てみましょう。 + +#### Text classification + +テキスト分類の最も一般的な形式の 1 つはセンチメント分析であり、「ポジティブ」、「ネガティブ」、「ネガティブ」などのラベルを割り当てます。 +または、一連のテキストに対して「中立」です。与えられたテキスト (映画レビュー) を分類するようにモデルに指示するプロンプトを作成してみましょう。 +まず指示を与え、次に分類するテキストを指定します。そのままにしておくのではなく、 +応答の先頭にも追加します - `"Sentiment: "`: + +```python +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> prompt = """Classify the text into neutral, negative or positive. +... Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen. +... Sentiment: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Classify the text into neutral, negative or positive. +Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen. +Sentiment: +Positive +``` + +その結果、出力には、手順で提供したリストの分類ラベルが含まれており、それは正しいラベルです。 + + + +プロンプトに加えて、`max_new_tokens`パラメータを渡していることに気づくかもしれません。トークンの数を制御します。 +モデルが生成します。これは、学習できる多くのテキスト生成パラメーターの 1 つです。 +[テキスト生成戦略](../generation_strategies) ガイドを参照してください。 + + + +#### Named Entity Recognition + +固有表現認識 (NER) は、テキスト内の人物、場所、組織などの固有表現を検索するタスクです。 +プロンプトの指示を変更して、LLM にこのタスクを実行させましょう。ここでは`return_full_text = False`も設定しましょう +出力にプロンプ​​トが含​​まれないようにします。 + +```python +>>> torch.manual_seed(1) # doctest: +IGNORE_RESULT +>>> prompt = """Return a list of named entities in the text. +... Text: The Golden State Warriors are an American professional basketball team based in San Francisco. +... Named entities: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=15, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +- Golden State Warriors +- San Francisco +``` + +ご覧のとおり、モデルは指定されたテキストから 2 つの名前付きエンティティを正しく識別しました。 + +#### Translation + +LLM が実行できるもう 1 つのタスクは翻訳です。このタスクにはエンコーダー/デコーダー モデルを使用することを選択できますが、ここでは +例を簡単にするために、きちんとした仕事をする Falcon-7b-instruct を使い続けます。もう一度、方法は次のとおりです +テキストの一部を英語からイタリア語に翻訳するようにモデルに指示する基本的なプロンプトを作成できます。 + +```python +>>> torch.manual_seed(2) # doctest: +IGNORE_RESULT +>>> prompt = """Translate the English text to Italian. +... Text: Sometimes, I've believed as many as six impossible things before breakfast. +... Translation: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=20, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +A volte, ho creduto a sei impossibili cose prima di colazione. +``` + +ここでは、出力生成時にモデルがもう少し柔軟になるように `do_sample=True` と `top_k=10` を追加しました。 + +#### Text summarization + +翻訳と同様に、テキストの要約も、出力が入力に**大きく**依存する生成タスクです。 +エンコーダ/デコーダ モデルの方が良い選択になる可能性があります。ただし、デコーダ スタイルのモデルもこのタスクに使用できます。 +以前は、プロンプトの先頭に指示を配置していました。ただし、プロンプトの最後で、 +指示を与えるのに適した場所でもあります。通常、命令はどちらかの端に配置することをお勧めします。 + +```python +>>> torch.manual_seed(3) # doctest: +IGNORE_RESULT +>>> prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change. +... Write a summary of the above text. +... Summary: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=30, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"{seq['generated_text']}") +Permaculture is an ecological design mimicking natural ecosystems to meet basic needs and prepare for climate change. It is based on traditional knowledge and scientific understanding. +``` + +#### Question answering + +質問応答タスクの場合、プロンプトを次の論理コンポーネントに構造化できます: 指示、コンテキスト、質問、 +先頭の単語またはフレーズ (`"Answer:"`) を使用して、モデルを操作して答えの生成を開始します。 + + +```python +>>> torch.manual_seed(4) # doctest: +IGNORE_RESULT +>>> prompt = """Answer the question using the context below. +... Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors. +... Question: What modern tool is used to make gazpacho? +... Answer: +... """ + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Modern tools are used, such as immersion blenders +``` + +#### Reasoning + +LLM にとって推論は最も困難なタスクの 1 つであり、良い結果を達成するには、多くの場合、次のような高度なプロンプト テクニックを適用する必要があります。 +[Chain-of-thought](#chain-of-thought)。 + +基本的なプロンプトを使用して、単純な算術タスクに関するモデル推論を作成できるかどうか試してみましょう。 + +```python +>>> torch.manual_seed(5) # doctest: +IGNORE_RESULT +>>> prompt = """There are 5 groups of students in the class. Each group has 4 students. How many students are there in the class?""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=30, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: +There are a total of 5 groups, so there are 5 x 4=20 students in the class. +``` + +正しい!もう少し複雑さを増やして、基本的なプロンプトで問題を解決できるかどうかを確認してみましょう。 + +```python +>>> torch.manual_seed(6) # doctest: +IGNORE_RESULT +>>> prompt = """I baked 15 muffins. I ate 2 muffins and gave 5 muffins to a neighbor. My partner then bought 6 more muffins and ate 2. How many muffins do we now have?""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=10, +... do_sample=True, +... top_k=10, +... return_full_text = False, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: +The total number of muffins now is 21 +``` + +これは間違った答えです。12 である必要があります。この場合、プロンプトが基本的すぎるか、選択内容が原因である可能性があります。 +結局のところ、Falcon の最小バージョンを選択しました。あらゆるサイズのモデルでは推論が困難ですが、より大きなモデルでは +モデルのパフォーマンスが向上する可能性があります。 + +## Best practices of LLM prompting + +ガイドのこのセクションでは、プロンプトの結果を改善する傾向にあるベスト プラクティスのリストをまとめました。 + +* 使用するモデルを選択する場合は、最新かつ最も機能的なモデルの方がパフォーマンスが向上する可能性があります。 +* シンプルで短いプロンプトから始めて、そこから繰り返します。 +* 指示はプロンプトの最初または最後に入力してください。大規模なコンテキストを扱う場合、モデルはさまざまな最適化を適用して、アテンションの複雑さが二次的に拡大するのを防ぎます。これにより、モデルはプロンプトの途中よりも最初または最後に注意を払うようになります。 +* 指示と、それが適用されるテキストを明確に区別してください。これについては、次のセクションで詳しく説明します。 +* タスクと望ましい結果 (その形式、長さ、スタイル、言語など) について具体的かつ説明的にします。 +* 曖昧な説明や指示は避けてください。 +*「何をしてはいけないか」という指示ではなく、「何をすべきか」という指示を優先します。 +* 最初の単語を書いて (またはモデルの最初の文を始めて)、出力を正しい方向に「導き」ます。 +* [Few-shot prompting](#few-shot-prompting) や [Chain-of-thought](#chain-of-thought) などの高度なテクニックを使用します。 +* さまざまなモデルでプロンプトをテストして、その堅牢性を評価します。 +* プロンプトのバージョンを確認し、パフォーマンスを追跡します。 + +## Advanced prompting techniques + +### Few-shot prompting + +上記のセクションの基本的なプロンプトは、「ゼロショット」プロンプトの例です。つまり、モデルにはすでに与えられています。 +指示とコンテキストはありますが、解決策を含む例はありません。通常、命令データセットに基づいて微調整された LLM +このような「ゼロショット」タスクでも優れたパフォーマンスを発揮します。ただし、タスクがより複雑であったり微妙な点があったりする場合があります。 +出力には、命令だけではモデルが理解できないいくつかの要件があります。この場合、次のことができます。 +少数ショット プロンプトと呼ばれるテクニックを試してください。 + +少数ショット プロンプトでは、モデルにパフォーマンスを向上させるためのより多くのコンテキストを提供するプロンプト内の例が提供されます。 +例では、例のパターンに従って出力を生成するようにモデルを条件付けします。 + +以下に例を示します。 + +```python +>>> torch.manual_seed(0) # doctest: +IGNORE_RESULT +>>> prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961. +... Date: 04/12/1961 +... Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. +... Date:""" + +>>> sequences = pipe( +... prompt, +... max_new_tokens=8, +... do_sample=True, +... top_k=10, +... ) + +>>> for seq in sequences: +... print(f"Result: {seq['generated_text']}") +Result: Text: The first human went into space and orbited the Earth on April 12, 1961. +Date: 04/12/1961 +Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon. +Date: 09/28/1960 +``` + +上記のコード スニペットでは、モデルへの目的の出力を示すために 1 つの例を使用しました。したがって、これは、 +「ワンショット」プロンプト。ただし、タスクの複雑さに応じて、複数の例を使用する必要がある場合があります。 + +数回のプロンプト手法の制限: +- LLM は例のパターンを理解できますが、これらの手法は複雑な推論タスクではうまく機能しません。 +- 少数ショットのプロンプトでは、長いプロンプトを作成する必要があります。大量のトークンを含むプロンプトでは、計算量と待ち時間が増加する可能性があります。プロンプトの長さにも制限があります。 +- 多くの例を与えると、モデルが学習するつもりのなかったパターンを学習することがあります。 3番目の映画レビューはいつも否定的だということ。 + +### Chain-of-thought + +思考連鎖 (CoT) プロンプトは、モデルを微調整して中間推論ステップを生成し、改善する手法です。 +複雑な推論タスクの結果。 + +モデルを操作して推論ステップを生成するには、2 つの方法があります。 +- 質問に対する詳細な回答を含む例を示し、問題に対処する方法をモデルに示すことで、数回のプロンプトを表示します。 +- 「ステップごとに考えてみましょう」または「深呼吸して、問題をステップごとに解決してください」などのフレーズを追加してモデルに推論を指示します。 + +[推論セクション](#reasoning) のマフィンの例に CoT テクニックを適用し、より大きなモデルを使用すると、 +[HuggingChat](https://huggingface.co/chat/)で遊べる(`tiiuae/falcon-180B-chat`)など、 +推論結果は大幅に改善されます。 + +```text +Let's go through this step-by-step: +1. You start with 15 muffins. +2. You eat 2 muffins, leaving you with 13 muffins. +3. You give 5 muffins to your neighbor, leaving you with 8 muffins. +4. Your partner buys 6 more muffins, bringing the total number of muffins to 14. +5. Your partner eats 2 muffins, leaving you with 12 muffins. +Therefore, you now have 12 muffins. +``` + +## Prompting vs fine-tuning + +プロンプトを最適化することで優れた結果を達成できますが、モデルを微調整するかどうかについてはまだ思案するかもしれません。 +あなたの場合にはもっとうまくいくでしょう。より小規模なモデルを微調整することが好ましいオプションである場合のいくつかのシナリオを次に示します。 + +- ドメインが LLM が事前にトレーニングされたものと大きく異なっており、広範なプロンプト最適化では十分な結果が得られませんでした。 +- モデルが低リソース言語で適切に動作する必要があります。 +- 厳格な規制の下にある機密データでモデルをトレーニングする必要があります。 +- コスト、プライバシー、インフラストラクチャ、またはその他の制限により、小規模なモデルを使用する必要があります。 + +上記のすべての例で、十分な大きさのファイルをすでに持っているか、簡単に入手できるかを確認する必要があります。 +ドメイン固有のデータセットを合理的なコストでモデルを微調整できます。十分な時間とリソースも必要になります +モデルを微調整します。 + +上記の例が当てはまらない場合は、プロンプトを最適化する方が有益であることがわかります。 diff --git a/docs/source/ja/tasks/question_answering.md b/docs/source/ja/tasks/question_answering.md new file mode 100644 index 00000000000000..54df687c2f047f --- /dev/null +++ b/docs/source/ja/tasks/question_answering.md @@ -0,0 +1,445 @@ + + +# Question answering + +[[open-in-colab]] + + + +質問応答タスクは、質問に対して回答を返します。 Alexa、Siri、Google などの仮想アシスタントに天気を尋ねたことがあるなら、質問応答モデルを使用したことがあるはずです。質問応答タスクには一般的に 2 つのタイプがあります。 + +- 抽出: 与えられたコンテキストから回答を抽出します。 +- 抽象的: 質問に正しく答えるコンテキストから回答を生成します。 + +このガイドでは、次の方法を説明します。 + +1. 抽出的質問応答用に [SQuAD](https://huggingface.co/datasets/squad) データセット上の [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) を微調整します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [OpenAI GPT-2](../model_doc/gpt2), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [LXMERT](../model_doc/lxmert), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OPT](../model_doc/opt), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [Splinter](../model_doc/splinter), [SqueezeBERT](../model_doc/squeezebert), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load SQuAD dataset + +まず、🤗 データセット ライブラリから SQuAD データセットの小さいサブセットを読み込みます。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + + +```py +>>> from datasets import load_dataset + +>>> squad = load_dataset("squad", split="train[:5000]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train` 分割をトレイン セットとテスト セットに分割します。 + +```py +>>> squad = squad.train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> squad["train"][0] +{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, + 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', + 'id': '5733be284776f41900661182', + 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', + 'title': 'University_of_Notre_Dame' +} +``` + +ここにはいくつかの重要なフィールドがあります。 + +- `answers`: 回答トークンと回答テキストの開始位置。 +- `context`: モデルが答えを抽出するために必要な背景情報。 +- `question`: モデルが答える必要がある質問。 + +## Preprocess + + + +次のステップでは、DistilBERT トークナイザーをロードして`question`フィールドと`context`フィールドを処理します。 + + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +質問応答タスクに特有の、注意すべき前処理手順がいくつかあります。 + +1. データセット内の一部の例には、モデルの最大入力長を超える非常に長い「コンテキスト」が含まれる場合があります。より長いシーケンスを処理するには、`truncation="only_second"` を設定して `context` のみを切り捨てます。 +2. 次に、設定によって、回答の開始位置と終了位置を元の `context`にマッピングします。 + 「`return_offset_mapping=True`」。 +3. マッピングが手元にあるので、答えの開始トークンと終了トークンを見つけることができます。 [`~tokenizers.Encoding.sequence_ids`] メソッドを使用して、 + オフセットのどの部分が`question`に対応し、どの部分が`context`に対応するかを見つけます。 + +以下に、`answer`の開始トークンと終了トークンを切り詰めて`context`にマッピングする関数を作成する方法を示します。 + +```py +>>> def preprocess_function(examples): +... questions = [q.strip() for q in examples["question"]] +... inputs = tokenizer( +... questions, +... examples["context"], +... max_length=384, +... truncation="only_second", +... return_offsets_mapping=True, +... padding="max_length", +... ) + +... offset_mapping = inputs.pop("offset_mapping") +... answers = examples["answers"] +... start_positions = [] +... end_positions = [] + +... for i, offset in enumerate(offset_mapping): +... answer = answers[i] +... start_char = answer["answer_start"][0] +... end_char = answer["answer_start"][0] + len(answer["text"][0]) +... sequence_ids = inputs.sequence_ids(i) + +... # Find the start and end of the context +... idx = 0 +... while sequence_ids[idx] != 1: +... idx += 1 +... context_start = idx +... while sequence_ids[idx] == 1: +... idx += 1 +... context_end = idx - 1 + +... # If the answer is not fully inside the context, label it (0, 0) +... if offset[context_start][0] > end_char or offset[context_end][1] < start_char: +... start_positions.append(0) +... end_positions.append(0) +... else: +... # Otherwise it's the start and end token positions +... idx = context_start +... while idx <= context_end and offset[idx][0] <= start_char: +... idx += 1 +... start_positions.append(idx - 1) + +... idx = context_end +... while idx >= context_start and offset[idx][1] >= end_char: +... idx -= 1 +... end_positions.append(idx + 1) + +... inputs["start_positions"] = start_positions +... inputs["end_positions"] = end_positions +... return inputs +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] 関数を使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` 関数を高速化できます。不要な列を削除します。 + +```py +>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names) +``` + +次に、[`DefaultDataCollat​​or`] を使用してサンプルのバッチを作成します。 🤗 Transformers の他のデータ照合器とは異なり、[`DefaultDataCollat​​or`] はパディングなどの追加の前処理を適用しません。 + + + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") +``` + + + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForQuestionAnswering`] を使用して DitilBERT をロードします。 + +```py +>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer + +>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。 +2. トレーニング引数をモデル、データセット、トークナイザー、データ照合器とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_qa_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_squad["train"], +... eval_dataset=tokenized_squad["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_epochs = 2 +>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs +>>> optimizer, schedule = create_optimizer( +... init_lr=2e-5, +... num_warmup_steps=0, +... num_train_steps=total_train_steps, +... ) +``` + +次に、[`TFAutoModelForQuestionAnswering`] を使用して DistilBERT をロードできます。 + +```py +>>> from transformers import TFAutoModelForQuestionAnswering + +>>> model = TFAutoModelForQuestionAnswering("distilbert/distilbert-base-uncased") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_squad["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_squad["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +トレーニングを開始する前に最後にセットアップすることは、モデルをハブにプッシュする方法を提供することです。これは、モデルとトークナイザーを [`~transformers.PushToHubCallback`] でプッシュする場所を指定することで実行できます。 + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_qa_model", +... tokenizer=tokenizer, +... ) +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback]) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + +質問応答用のモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)。 + + + +## Evaluate + +質問応答の評価には、大量の後処理が必要です。時間がかかりすぎないように、このガイドでは評価ステップを省略しています。 [`Trainer`] はトレーニング中に評価損失を計算するため、モデルのパフォーマンスについて完全に分からないわけではありません。 + +もっと時間があり、質問応答用のモデルを評価する方法に興味がある場合は、[質問応答](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) の章を参照してください。 🤗ハグフェイスコースから! + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +質問と、モデルに予測させたいコンテキストを考え出します。 + +```py +>>> question = "How many programming languages does BLOOM support?" +>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages." +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して質問応答用の`pipeline`をインスタンス化し、それにテキストを渡します。 + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model") +>>> question_answerer(question=question, context=context) +{'score': 0.2058267742395401, + 'start': 10, + 'end': 95, + 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'} +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。 + + + + +テキストをトークン化して PyTorch テンソルを返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model") +>>> inputs = tokenizer(question, context, return_tensors="pt") +``` + +入力をモデルに渡し、`logits`を返します。 + + +```py +>>> import torch +>>> from transformers import AutoModelForQuestionAnswering + +>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model") +>>> with torch.no_grad(): +... outputs = model(**inputs) +``` + +モデル出力から開始位置と終了位置の最も高い確率を取得します。 + +```py +>>> answer_start_index = outputs.start_logits.argmax() +>>> answer_end_index = outputs.end_logits.argmax() +``` + +予測されたトークンをデコードして答えを取得します。 + +```py +>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] +>>> tokenizer.decode(predict_answer_tokens) +'176 billion parameters and can generate text in 46 languages natural languages and 13' +``` + + + +テキストをトークン化し、TensorFlow テンソルを返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model") +>>> inputs = tokenizer(question, text, return_tensors="tf") +``` + +入力をモデルに渡し、`logits`を返します。 + + +```py +>>> from transformers import TFAutoModelForQuestionAnswering + +>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model") +>>> outputs = model(**inputs) +``` + +モデル出力から開始位置と終了位置の最も高い確率を取得します。 + +```py +>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0]) +>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0]) +``` + +予測されたトークンをデコードして答えを取得します。 + +```py +>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] +>>> tokenizer.decode(predict_answer_tokens) +'176 billion parameters and can generate text in 46 languages natural languages and 13' +``` + + + diff --git a/docs/source/ja/tasks/semantic_segmentation.md b/docs/source/ja/tasks/semantic_segmentation.md new file mode 100644 index 00000000000000..2816688b4e1c14 --- /dev/null +++ b/docs/source/ja/tasks/semantic_segmentation.md @@ -0,0 +1,605 @@ + + +# Semantic segmentation + +[[open-in-colab]] + + + +セマンティック セグメンテーションでは、画像の個々のピクセルにラベルまたはクラスを割り当てます。セグメンテーションにはいくつかのタイプがありますが、セマンティック セグメンテーションの場合、同じオブジェクトの一意のインスタンス間の区別は行われません。両方のオブジェクトに同じラベルが付けられます (たとえば、`car-1`と`car-2`の代わりに`car`)。セマンティック セグメンテーションの一般的な現実世界のアプリケーションには、歩行者や重要な交通情報を識別するための自動運転車のトレーニング、医療画像内の細胞と異常の識別、衛星画像からの環境変化の監視などが含まれます。 + +このガイドでは、次の方法を説明します。 + +1. [SceneParse150](https://huggingface.co/datasets/scene_parse_150) データセットの [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) を微調整します。 +2. 微調整したモデルを推論に使用します。 + + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q datasets transformers evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load SceneParse150 dataset + +まず、SceneParse150 データセットの小さいサブセットを 🤗 データセット ライブラリから読み込みます。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> from datasets import load_dataset + +>>> ds = load_dataset("scene_parse_150", split="train[:50]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train` 分割をトレイン セットとテスト セットに分割します。 + + +```py +>>> ds = ds.train_test_split(test_size=0.2) +>>> train_ds = ds["train"] +>>> test_ds = ds["test"] +``` + +次に、例を見てみましょう。 + +```py +>>> train_ds[0] +{'image': , + 'annotation': , + 'scene_category': 368} +``` + +- `image`: シーンの PIL イメージ。 +- `annotation`: セグメンテーション マップの PIL イメージ。モデルのターゲットでもあります。 +- `scene_category`: "kitchen"や"office"などの画像シーンを説明するカテゴリ ID。このガイドでは、`image`と`annotation`のみが必要になります。どちらも PIL イメージです。 + +また、ラベル ID をラベル クラスにマップする辞書を作成することもできます。これは、後でモデルを設定するときに役立ちます。ハブからマッピングをダウンロードし、`id2label` および `label2id` ディクショナリを作成します。 + +```py +>>> import json +>>> from huggingface_hub import cached_download, hf_hub_url + +>>> repo_id = "huggingface/label-files" +>>> filename = "ade20k-id2label.json" +>>> id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r")) +>>> id2label = {int(k): v for k, v in id2label.items()} +>>> label2id = {v: k for k, v in id2label.items()} +>>> num_labels = len(id2label) +``` + +## Preprocess + +次のステップでは、SegFormer 画像プロセッサをロードして、モデルの画像と注釈を準備します。このデータセットのような一部のデータセットは、バックグラウンド クラスとしてゼロインデックスを使用します。ただし、実際には背景クラスは 150 個のクラスに含まれていないため、`reduce_labels=True`を設定してすべてのラベルから 1 つを引く必要があります。ゼロインデックスは `255` に置き換えられるため、SegFormer の損失関数によって無視されます。 + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "nvidia/mit-b0" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True) +``` + + + + +モデルを過学習に対してより堅牢にするために、画像データセットにいくつかのデータ拡張を適用するのが一般的です。このガイドでは、[torchvision](https://pytorch.org/vision/stable/index.html) の [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) 関数を使用します。 ) を使用して画像の色のプロパティをランダムに変更しますが、任意の画像ライブラリを使用することもできます。 + +```py +>>> from torchvision.transforms import ColorJitter + +>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) +``` + +次に、モデルの画像と注釈を準備するための 2 つの前処理関数を作成します。これらの関数は、画像を`pixel_values`に変換し、注釈を`labels`に変換します。トレーニング セットの場合、画像を画像プロセッサに提供する前に `jitter` が適用されます。テスト セットの場合、テスト中にデータ拡張が適用されないため、画像プロセッサは`images`を切り取って正規化し、`ラベル`のみを切り取ります。 + +```py +>>> def train_transforms(example_batch): +... images = [jitter(x) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [x for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +データセット全体に`jitter`を適用するには、🤗 Datasets [`~datasets.Dataset.set_transform`] 関数を使用します。変換はオンザフライで適用されるため、高速で消費するディスク容量が少なくなります。 + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + + + + + +モデルを過学習に対してより堅牢にするために、画像データセットにいくつかのデータ拡張を適用するのが一般的です。 +このガイドでは、[`tf.image`](https://www.tensorflow.org/api_docs/python/tf/image) を使用して画像の色のプロパティをランダムに変更しますが、任意のプロパティを使用することもできます。画像 +好きな図書館。 +2 つの別々の変換関数を定義します。 +- 画像拡張を含むトレーニング データ変換 +- 🤗 Transformers のコンピューター ビジョン モデルはチャネル優先のレイアウトを想定しているため、画像を転置するだけの検証データ変換 + +```py +>>> import tensorflow as tf + + +>>> def aug_transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.image.random_brightness(image, 0.25) +... image = tf.image.random_contrast(image, 0.5, 2.0) +... image = tf.image.random_saturation(image, 0.75, 1.25) +... image = tf.image.random_hue(image, 0.1) +... image = tf.transpose(image, (2, 0, 1)) +... return image + + +>>> def transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.transpose(image, (2, 0, 1)) +... return image +``` + +次に、モデルの画像と注釈のバッチを準備する 2 つの前処理関数を作成します。これらの機能が適用されます +画像変換を行い、以前にロードされた `image_processor` を使用して画像を `pixel_values` に変換し、 +`labels`への注釈。 `ImageProcessor` は、画像のサイズ変更と正規化も処理します。 + +```py +>>> def train_transforms(example_batch): +... images = [aug_transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +データセット全体に前処理変換を適用するには、🤗 Datasets [`~datasets.Dataset.set_transform`] 関数を使用します。 +変換はオンザフライで適用されるため、高速で消費するディスク容量が少なくなります。 + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[Mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) メトリックをロードします (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照して、メトリクスをロードして計算する方法の詳細を確認してください)。 + +```py +>>> import evaluate + +>>> metric = evaluate.load("mean_iou") +``` + +次に、メトリクスを [`~evaluate.EvaluationModule.compute`] する関数を作成します。予測を次のように変換する必要があります +最初にロジットを作成し、次に [`~evaluate.EvaluationModule.compute`] を呼び出す前にラベルのサイズに一致するように再形成します。 + + + + +```py +>>> import numpy as np +>>> import torch +>>> from torch import nn + +>>> def compute_metrics(eval_pred): +... with torch.no_grad(): +... logits, labels = eval_pred +... logits_tensor = torch.from_numpy(logits) +... logits_tensor = nn.functional.interpolate( +... logits_tensor, +... size=labels.shape[-2:], +... mode="bilinear", +... align_corners=False, +... ).argmax(dim=1) + +... pred_labels = logits_tensor.detach().cpu().numpy() +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=255, +... reduce_labels=False, +... ) +... for key, value in metrics.items(): +... if type(value) is np.ndarray: +... metrics[key] = value.tolist() +... return metrics +``` + + + + + + + + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... logits = tf.transpose(logits, perm=[0, 2, 3, 1]) +... logits_resized = tf.image.resize( +... logits, +... size=tf.shape(labels)[1:], +... method="bilinear", +... ) + +... pred_labels = tf.argmax(logits_resized, axis=-1) +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=-1, +... reduce_labels=image_processor.do_reduce_labels, +... ) + +... per_category_accuracy = metrics.pop("per_category_accuracy").tolist() +... per_category_iou = metrics.pop("per_category_iou").tolist() + +... metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}) +... metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)}) +... return {"val_" + k: v for k, v in metrics.items()} +``` + + + + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#finetune-with-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSemanticSegmentation`] を使用して SegFormer をロードし、ラベル ID とラベル クラス間のマッピングをモデルに渡します。 + +```py +>>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer + +>>> model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id) +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 `image` 列が削除されるため、未使用の列を削除しないことが重要です。 `image` 列がないと、`pixel_values` を作成できません。この動作を防ぐには、`remove_unused_columns=False`を設定してください。他に必要なパラメータは、モデルの保存場所を指定する `output_dir` だけです。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は IoU メトリックを評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="segformer-b0-scene-parse-150", +... learning_rate=6e-5, +... num_train_epochs=50, +... per_device_train_batch_size=2, +... per_device_eval_batch_size=2, +... save_total_limit=3, +... evaluation_strategy="steps", +... save_strategy="steps", +... save_steps=20, +... eval_steps=20, +... logging_steps=1, +... eval_accumulation_steps=5, +... remove_unused_columns=False, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=train_ds, +... eval_dataset=test_ds, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + + + +Keras を使用したモデルの微調整に慣れていない場合は、まず [基本チュートリアル](./training#train-a-tensorflow-model-with-keras) を確認してください。 + + + +TensorFlow でモデルを微調整するには、次の手順に従います。 +1. トレーニングのハイパーパラメータを定義し、オプティマイザーと学習率スケジュールを設定します。 +2. 事前トレーニングされたモデルをインスタンス化します。 +3. 🤗 データセットを `tf.data.Dataset` に変換します。 +4. モデルをコンパイルします。 +5. コールバックを追加してメトリクスを計算し、モデルを 🤗 Hub にアップロードします +6. `fit()` メソッドを使用してトレーニングを実行します。 + +まず、ハイパーパラメーター、オプティマイザー、学習率スケジュールを定義します。 + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 2 +>>> num_epochs = 50 +>>> num_train_steps = len(train_ds) * num_epochs +>>> learning_rate = 6e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +次に、ラベル マッピングとともに [`TFAutoModelForSemanticSegmentation`] を使用して SegFormer をロードし、それをコンパイルします。 +オプティマイザ。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +[`~datasets.Dataset.to_tf_dataset`] と [`DefaultDataCollat​​or`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") + +>>> tf_train_dataset = train_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_eval_dataset = test_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +予測から精度を計算し、モデルを 🤗 ハブにプッシュするには、[Keras callbacks](../main_classes/keras_callbacks) を使用します。 +`compute_metrics` 関数を [`KerasMetricCallback`] に渡します。 +そして [`PushToHubCallback`] を使用してモデルをアップロードします。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback( +... metric_fn=compute_metrics, eval_dataset=tf_eval_dataset, batch_size=batch_size, label_cols=["labels"] +... ) + +>>> push_to_hub_callback = PushToHubCallback(output_dir="scene_segmentation", tokenizer=image_processor) + +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルをトレーニングする準備が整いました。トレーニングおよび検証データセット、エポック数、 +モデルを微調整するためのコールバック: + + +```py +>>> model.fit( +... tf_train_dataset, +... validation_data=tf_eval_dataset, +... callbacks=callbacks, +... epochs=num_epochs, +... ) +``` + +おめでとう!モデルを微調整し、🤗 Hub で共有しました。これで推論に使用できるようになりました。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論のために画像をロードします。 + +```py +>>> image = ds[0]["image"] +>>> image +``` + +
+ Image of bedroom +
+ + + + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して画像セグメンテーション用の `pipeline`をインスタンス化し、それに画像を渡します。 + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model") +>>> segmenter(image) +[{'score': None, + 'label': 'wall', + 'mask': }, + {'score': None, + 'label': 'sky', + 'mask': }, + {'score': None, + 'label': 'floor', + 'mask': }, + {'score': None, + 'label': 'ceiling', + 'mask': }, + {'score': None, + 'label': 'bed ', + 'mask': }, + {'score': None, + 'label': 'windowpane', + 'mask': }, + {'score': None, + 'label': 'cabinet', + 'mask': }, + {'score': None, + 'label': 'chair', + 'mask': }, + {'score': None, + 'label': 'armchair', + 'mask': }] +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。画像を画像プロセッサで処理し、`pixel_values` を GPU に配置します。 + +```py +>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU +>>> encoding = image_processor(image, return_tensors="pt") +>>> pixel_values = encoding.pixel_values.to(device) +``` + +入力をモデルに渡し、`logits`を返します。 + + +```py +>>> outputs = model(pixel_values=pixel_values) +>>> logits = outputs.logits.cpu() +``` + +次に、ロジットを元の画像サイズに再スケールします。 + +```py +>>> upsampled_logits = nn.functional.interpolate( +... logits, +... size=image.size[::-1], +... mode="bilinear", +... align_corners=False, +... ) + +>>> pred_seg = upsampled_logits.argmax(dim=1)[0] +``` + + + + + + + +画像プロセッサをロードして画像を前処理し、入力を TensorFlow テンソルとして返します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/scene_segmentation") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +入力をモデルに渡し、`logits`を返します。 + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained("MariaK/scene_segmentation") +>>> logits = model(**inputs).logits +``` + +次に、ロジットを元の画像サイズに再スケールし、クラス次元に argmax を適用します。 + +```py +>>> logits = tf.transpose(logits, [0, 2, 3, 1]) + +>>> upsampled_logits = tf.image.resize( +... logits, +... # We reverse the shape of `image` because `image.size` returns width and height. +... image.size[::-1], +... ) + +>>> pred_seg = tf.math.argmax(upsampled_logits, axis=-1)[0] +``` + + + + +結果を視覚化するには、[データセット カラー パレット](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) を、それぞれをマップする `ade_palette()` としてロードします。クラスを RGB 値に変換します。次に、画像と予測されたセグメンテーション マップを組み合わせてプロットできます。 + +```py +>>> import matplotlib.pyplot as plt +>>> import numpy as np + +>>> color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8) +>>> palette = np.array(ade_palette()) +>>> for label, color in enumerate(palette): +... color_seg[pred_seg == label, :] = color +>>> color_seg = color_seg[..., ::-1] # convert to BGR + +>>> img = np.array(image) * 0.5 + color_seg * 0.5 # plot the image with the segmentation map +>>> img = img.astype(np.uint8) + +>>> plt.figure(figsize=(15, 10)) +>>> plt.imshow(img) +>>> plt.show() +``` + +
+ Image of bedroom overlaid with segmentation map +
diff --git a/docs/source/ja/tasks/sequence_classification.md b/docs/source/ja/tasks/sequence_classification.md new file mode 100644 index 00000000000000..6673cfe9e56938 --- /dev/null +++ b/docs/source/ja/tasks/sequence_classification.md @@ -0,0 +1,608 @@ + + +# Sequence classification + +[[open-in-colab]] + + + +セマンティック セグメンテーションでは、画像の個々のピクセルにラベルまたはクラスを割り当てます。セグメンテーションにはいくつかのタイプがありますが、セマンティック セグメンテーションの場合、同じオブジェクトの一意のインスタンス間の区別は行われません。両方のオブジェクトに同じラベルが付けられます (たとえば、「car-1」と「car-2」の代わりに「car」)。セマンティック セグメンテーションの一般的な現実世界のアプリケーションには、歩行者や重要な交通情報を識別するための自動運転車のトレーニング、医療画像内の細胞と異常の識別、衛星画像からの環境変化の監視などが含まれます。 + +このガイドでは、次の方法を説明します。 + +1. [SceneParse150](https://huggingface.co/datasets/scene_parse_150) データセットの [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) を微調整します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q datasets transformers evaluate +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load SceneParse150 dataset + + +まず、SceneParse150 データセットの小さいサブセットを 🤗 データセット ライブラリから読み込みます。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> from datasets import load_dataset + +>>> ds = load_dataset("scene_parse_150", split="train[:50]") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットの `train` 分割をトレイン セットとテスト セットに分割します。 + +```py +>>> ds = ds.train_test_split(test_size=0.2) +>>> train_ds = ds["train"] +>>> test_ds = ds["test"] +``` + +次に、例を見てみましょう。 + +```py +>>> train_ds[0] +{'image': , + 'annotation': , + 'scene_category': 368} +``` + +- `image`: シーンの PIL イメージ。 +- `annotation`: セグメンテーション マップの PIL イメージ。モデルのターゲットでもあります。 +- `scene_category`: 「キッチン」や「オフィス」などの画像シーンを説明するカテゴリ ID。このガイドでは、「image」と「annotation」のみが必要になります。どちらも PIL イメージです。 + +また、ラベル ID をラベル クラスにマップする辞書を作成することもできます。これは、後でモデルを設定するときに役立ちます。ハブからマッピングをダウンロードし、`id2label` および `label2id` ディクショナリを作成します。 + +```py +>>> import json +>>> from huggingface_hub import cached_download, hf_hub_url + +>>> repo_id = "huggingface/label-files" +>>> filename = "ade20k-id2label.json" +>>> id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r")) +>>> id2label = {int(k): v for k, v in id2label.items()} +>>> label2id = {v: k for k, v in id2label.items()} +>>> num_labels = len(id2label) +``` + +## Preprocess + +次のステップでは、SegFormer 画像プロセッサをロードして、モデルの画像と注釈を準備します。このデータセットのような一部のデータセットは、バックグラウンド クラスとしてゼロインデックスを使用します。ただし、実際には背景クラスは 150 個のクラスに含まれていないため、`reduce_labels=True`を設定してすべてのラベルから 1 つを引く必要があります。ゼロインデックスは `255` に置き換えられるため、SegFormer の損失関数によって無視されます。 + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "nvidia/mit-b0" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True) +``` + + + + +モデルを過学習に対してより堅牢にするために、画像データセットにいくつかのデータ拡張を適用するのが一般的です。このガイドでは、[torchvision](https://pytorch.org) の [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) 関数を使用します。 /vision/stable/index.html) を使用して画像の色のプロパティをランダムに変更しますが、任意の画像ライブラリを使用することもできます。 + +```py +>>> from torchvision.transforms import ColorJitter + +>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) +``` + +次に、モデルの画像と注釈を準備するための 2 つの前処理関数を作成します。これらの関数は、画像を`pixel_values`に変換し、注釈を`labels`に変換します。トレーニング セットの場合、画像を画像プロセッサに提供する前に`jitter`が適用されます。テスト セットの場合、テスト中にデータ拡張が適用されないため、画像プロセッサは`images`を切り取って正規化し、`labels` のみを切り取ります。 + +```py +>>> def train_transforms(example_batch): +... images = [jitter(x) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [x for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +データセット全体に`jitter`を適用するには、🤗 Datasets [`~datasets.Dataset.set_transform`] 関数を使用します。変換はオンザフライで適用されるため、高速で消費するディスク容量が少なくなります。 + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + + + + + +モデルを過学習に対してより堅牢にするために、画像データセットにいくつかのデータ拡張を適用するのが一般的です。 +このガイドでは、[`tf.image`](https://www.tensorflow.org/api_docs/python/tf/image) を使用して画像の色のプロパティをランダムに変更しますが、任意のプロパティを使用することもできます。画像 +好きな図書館。 +2 つの別々の変換関数を定義します。 +- 画像拡張を含むトレーニング データ変換 +- 🤗 Transformers のコンピューター ビジョン モデルはチャネル優先のレイアウトを想定しているため、画像を転置するだけの検証データ変換 + +```py +>>> import tensorflow as tf + + +>>> def aug_transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.image.random_brightness(image, 0.25) +... image = tf.image.random_contrast(image, 0.5, 2.0) +... image = tf.image.random_saturation(image, 0.75, 1.25) +... image = tf.image.random_hue(image, 0.1) +... image = tf.transpose(image, (2, 0, 1)) +... return image + + +>>> def transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.transpose(image, (2, 0, 1)) +... return image +``` + +次に、モデルの画像と注釈のバッチを準備する 2 つの前処理関数を作成します。これらの機能が適用されます +画像変換を行い、以前にロードされた `image_processor` を使用して画像を `pixel_values` に変換し、 +`labels`への注釈。 `ImageProcessor` は、画像のサイズ変更と正規化も処理します。 + +```py +>>> def train_transforms(example_batch): +... images = [aug_transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +データセット全体に前処理変換を適用するには、🤗 Datasets [`~datasets.Dataset.set_transform`] 関数を使用します。 +変換はオンザフライで適用されるため、高速で消費するディスク容量が少なくなります。 + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[Mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) メトリックをロードします (🤗 Evaluate [クイック ツアー](https://huggingface.co) を参照してください) /docs/evaluate/a_quick_tour) を参照して、メトリクスをロードして計算する方法の詳細を確認してください)。 + +```py +>>> import evaluate + +>>> metric = evaluate.load("mean_iou") +``` + +次に、メトリクスを [`~evaluate.EvaluationModule.compute`] する関数を作成します。予測を次のように変換する必要があります +最初にロジットを作成し、次に [`~evaluate.EvaluationModule.compute`] を呼び出す前にラベルのサイズに一致するように再形成します。 + + + + +```py +>>> import numpy as np +>>> import torch +>>> from torch import nn + +>>> def compute_metrics(eval_pred): +... with torch.no_grad(): +... logits, labels = eval_pred +... logits_tensor = torch.from_numpy(logits) +... logits_tensor = nn.functional.interpolate( +... logits_tensor, +... size=labels.shape[-2:], +... mode="bilinear", +... align_corners=False, +... ).argmax(dim=1) + +... pred_labels = logits_tensor.detach().cpu().numpy() +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=255, +... reduce_labels=False, +... ) +... for key, value in metrics.items(): +... if type(value) is np.ndarray: +... metrics[key] = value.tolist() +... return metrics +``` + + + + + + + + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... logits = tf.transpose(logits, perm=[0, 2, 3, 1]) +... logits_resized = tf.image.resize( +... logits, +... size=tf.shape(labels)[1:], +... method="bilinear", +... ) + +... pred_labels = tf.argmax(logits_resized, axis=-1) +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=-1, +... reduce_labels=image_processor.do_reduce_labels, +... ) + +... per_category_accuracy = metrics.pop("per_category_accuracy").tolist() +... per_category_iou = metrics.pop("per_category_iou").tolist() + +... metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}) +... metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)}) +... return {"val_" + k: v for k, v in metrics.items()} +``` + + + + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#finetune-with-trainer) の基本的なチュートリアルをご覧ください。 + + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSemanticSegmentation`] を使用して SegFormer をロードし、ラベル ID とラベル クラス間のマッピングをモデルに渡します。 + +```py +>>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer + +>>> model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id) +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 `image` 列が削除されるため、未使用の列を削除しないことが重要です。 `image` 列がないと、`pixel_values` を作成できません。この動作を防ぐには、`remove_unused_columns=False`を設定してください。他に必要なパラメータは、モデルの保存場所を指定する `output_dir` だけです。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は IoU メトリックを評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + + +```py +>>> training_args = TrainingArguments( +... output_dir="segformer-b0-scene-parse-150", +... learning_rate=6e-5, +... num_train_epochs=50, +... per_device_train_batch_size=2, +... per_device_eval_batch_size=2, +... save_total_limit=3, +... evaluation_strategy="steps", +... save_strategy="steps", +... save_steps=20, +... eval_steps=20, +... logging_steps=1, +... eval_accumulation_steps=5, +... remove_unused_columns=False, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=train_ds, +... eval_dataset=test_ds, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + + + + +Keras を使用したモデルの微調整に慣れていない場合は、まず [基本チュートリアル](./training#train-a-tensorflow-model-with-keras) を確認してください。 + + + +TensorFlow でモデルを微調整するには、次の手順に従います。 +1. トレーニングのハイパーパラメータを定義し、オプティマイザーと学習率スケジュールを設定します。 +2. 事前トレーニングされたモデルをインスタンス化します。 +3. 🤗 データセットを `tf.data.Dataset` に変換します。 +4. モデルをコンパイルします。 +5. コールバックを追加してメトリクスを計算し、モデルを 🤗 Hub にアップロードします +6. `fit()` メソッドを使用してトレーニングを実行します。 + +まず、ハイパーパラメーター、オプティマイザー、学習率スケジュールを定義します。 + + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 2 +>>> num_epochs = 50 +>>> num_train_steps = len(train_ds) * num_epochs +>>> learning_rate = 6e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +次に、ラベル マッピングとともに [`TFAutoModelForSemanticSegmentation`] を使用して SegFormer をロードし、それをコンパイルします。 +オプティマイザ。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +[`~datasets.Dataset.to_tf_dataset`] と [`DefaultDataCollat​​or`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") + +>>> tf_train_dataset = train_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_eval_dataset = test_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +予測から精度を計算し、モデルを 🤗 ハブにプッシュするには、[Keras callbacks](../main_classes/keras_callbacks) を使用します。 +`compute_metrics` 関数を [`KerasMetricCallback`] に渡します。 +そして [`PushToHubCallback`] を使用してモデルをアップロードします。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback( +... metric_fn=compute_metrics, eval_dataset=tf_eval_dataset, batch_size=batch_size, label_cols=["labels"] +... ) + +>>> push_to_hub_callback = PushToHubCallback(output_dir="scene_segmentation", tokenizer=image_processor) + +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルをトレーニングする準備が整いました。`fit()`トレーニングおよび検証データセット、エポック数、 +モデルを微調整するためのコールバック: + +```py +>>> model.fit( +... tf_train_dataset, +... validation_data=tf_eval_dataset, +... callbacks=callbacks, +... epochs=num_epochs, +... ) +``` + +おめでとう!モデルを微調整し、🤗 Hub で共有しました。これで推論に使用できるようになりました。 + + + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論のために画像をロードします。 + +```py +>>> image = ds[0]["image"] +>>> image +``` + +
+ Image of bedroom +
+ + + + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して画像セグメンテーション用の `pipeline` をインスタンス化し、それに画像を渡します。 + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model") +>>> segmenter(image) +[{'score': None, + 'label': 'wall', + 'mask': }, + {'score': None, + 'label': 'sky', + 'mask': }, + {'score': None, + 'label': 'floor', + 'mask': }, + {'score': None, + 'label': 'ceiling', + 'mask': }, + {'score': None, + 'label': 'bed ', + 'mask': }, + {'score': None, + 'label': 'windowpane', + 'mask': }, + {'score': None, + 'label': 'cabinet', + 'mask': }, + {'score': None, + 'label': 'chair', + 'mask': }, + {'score': None, + 'label': 'armchair', + 'mask': }] +``` + +必要に応じて、`pipeline` の結果を手動で複製することもできます。画像プロセッサで画像を処理し、`pixel_values`を GPU に配置します。 + +```py +>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU +>>> encoding = image_processor(image, return_tensors="pt") +>>> pixel_values = encoding.pixel_values.to(device) +``` + +入力をモデルに渡し、「logits」を返します。 + +```py +>>> outputs = model(pixel_values=pixel_values) +>>> logits = outputs.logits.cpu() +``` + +次に、ロジットを元の画像サイズに再スケールします。 + + +```py +>>> upsampled_logits = nn.functional.interpolate( +... logits, +... size=image.size[::-1], +... mode="bilinear", +... align_corners=False, +... ) + +>>> pred_seg = upsampled_logits.argmax(dim=1)[0] +``` + + + + + + + +画像プロセッサをロードして画像を前処理し、入力を TensorFlow テンソルとして返します。 + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/scene_segmentation") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +入力をモデルに渡し、`logits`を返します。 + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained("MariaK/scene_segmentation") +>>> logits = model(**inputs).logits +``` + +次に、ロジットを元の画像サイズに再スケールし、クラス次元に argmax を適用します。 + +```py +>>> logits = tf.transpose(logits, [0, 2, 3, 1]) + +>>> upsampled_logits = tf.image.resize( +... logits, +... # We reverse the shape of `image` because `image.size` returns width and height. +... image.size[::-1], +... ) + +>>> pred_seg = tf.math.argmax(upsampled_logits, axis=-1)[0] +``` + + + + +結果を視覚化するには、[データセット カラー パレット](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) を、それぞれをマップする `ade_palette()` としてロードします。クラスを RGB 値に変換します。次に、画像と予測されたセグメンテーション マップを組み合わせてプロットできます。 + +```py +>>> import matplotlib.pyplot as plt +>>> import numpy as np + +>>> color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8) +>>> palette = np.array(ade_palette()) +>>> for label, color in enumerate(palette): +... color_seg[pred_seg == label, :] = color +>>> color_seg = color_seg[..., ::-1] # convert to BGR + +>>> img = np.array(image) * 0.5 + color_seg * 0.5 # plot the image with the segmentation map +>>> img = img.astype(np.uint8) + +>>> plt.figure(figsize=(15, 10)) +>>> plt.imshow(img) +>>> plt.show() +``` + +
+ Image of bedroom overlaid with segmentation map +
diff --git a/docs/source/ja/tasks/summarization.md b/docs/source/ja/tasks/summarization.md new file mode 100644 index 00000000000000..a4b012d712f2e7 --- /dev/null +++ b/docs/source/ja/tasks/summarization.md @@ -0,0 +1,409 @@ + + +# Summarization + +[[open-in-colab]] + + + +要約により、すべての重要な情報をまとめた短いバージョンの文書または記事が作成されます。これは、翻訳と並んで、シーケンス間のタスクとして定式化できるタスクのもう 1 つの例です。要約は次のようになります。 + +- 抽出: 文書から最も関連性の高い情報を抽出します。 +- 抽象的: 最も関連性の高い情報を捉えた新しいテキストを生成します。 + +このガイドでは、次の方法を説明します。 + +1. 抽象的な要約のために、[BillSum](https://huggingface.co/datasets/billsum) データセットのカリフォルニア州請求書サブセットで [T5](https://huggingface.co/google-t5/t5-small) を微調整します。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate rouge_score +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load BillSum dataset + +まず、🤗 データセット ライブラリから BillSum データセットの小さいカリフォルニア州請求書サブセットを読み込みます。 + +```py +>>> from datasets import load_dataset + +>>> billsum = load_dataset("billsum", split="ca_test") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットをトレイン セットとテスト セットに分割します。 + +```py +>>> billsum = billsum.train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> billsum["train"][0] +{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.', + 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.', + 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'} +``` + +使用するフィールドが 2 つあります。 + +- `text`: モデルへの入力となる請求書のテキスト。 +- `summary`: モデルのターゲットとなる `text` の要約版。 + +## Preprocess + +次のステップでは、T5 トークナイザーをロードして「text」と`summary`を処理します。 + +```py +>>> from transformers import AutoTokenizer + +>>> checkpoint = "google-t5/t5-small" +>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) +``` + +作成する前処理関数は次のことを行う必要があります。 + +1. T5 がこれが要約タスクであることを認識できるように、入力の前にプロンプ​​トを付けます。複数の NLP タスクが可能な一部のモデルでは、特定のタスクのプロンプトが必要です。 +2. ラベルをトークン化するときにキーワード `text_target` 引数を使用します。 +3. `max_length`パラメータで設定された最大長を超えないようにシーケンスを切り詰めます。 + +```py +>>> prefix = "summarize: " + + +>>> def preprocess_function(examples): +... inputs = [prefix + doc for doc in examples["text"]] +... model_inputs = tokenizer(inputs, max_length=1024, truncation=True) + +... labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True) + +... model_inputs["labels"] = labels["input_ids"] +... return model_inputs +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] メソッドを使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` 関数を高速化できます。 + +```py +>>> tokenized_billsum = billsum.map(preprocess_function, batched=True) +``` + +次に、[`DataCollat​​orForSeq2Seq`] を使用してサンプルのバッチを作成します。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。 + + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) +``` + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf") +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) メトリックを読み込みます (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してください) ) メトリクスをロードして計算する方法の詳細については、次を参照してください)。 + +```py +>>> import evaluate + +>>> rouge = evaluate.load("rouge") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して ROUGE メトリクスを計算する関数を作成します。 + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) +... labels = np.where(labels != -100, labels, tokenizer.pad_token_id) +... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + +... result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True) + +... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions] +... result["gen_len"] = np.mean(prediction_lens) + +... return {k: round(v, 4) for k, v in result.items()} +``` + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSeq2SeqLM`] を使用して T5 をロードします。 + + +```py +>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +この時点で残っている手順は次の 3 つだけです。 + +1. [`Seq2SeqTrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は ROUGE メトリクスを評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数をモデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Seq2SeqTrainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="my_awesome_billsum_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... weight_decay=0.01, +... save_total_limit=3, +... num_train_epochs=4, +... predict_with_generate=True, +... fp16=True, +... push_to_hub=True, +... ) + +>>> trainer = Seq2SeqTrainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_billsum["train"], +... eval_dataset=tokenized_billsum["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +次に、[`TFAutoModelForSeq2SeqLM`] を使用して T5 をロードできます。 + + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_billsum["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_billsum["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +トレーニングを開始する前にセットアップする最後の 2 つのことは、予測から ROUGE スコアを計算し、モデルをハブにプッシュする方法を提供することです。どちらも [Keras コールバック](../main_classes/keras_callbacks) を使用して行われます。 + +`compute_metrics` 関数を [`~transformers.KerasMetricCallback`] に渡します。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +Specify where to push your model and tokenizer in the [`~transformers.PushToHubCallback`]: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_billsum_model", +... tokenizer=tokenizer, +... ) +``` + +次に、コールバックをまとめてバンドルします。 + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + +要約用にモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +要約したいテキストを考え出します。 T5 の場合、作業中のタスクに応じて入力に接頭辞を付ける必要があります。要約するには、以下に示すように入力にプレフィックスを付ける必要があります。 + +```py +>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes." +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して要約用の `pipeline` をインスタンス化し、テキストをそれに渡します。 + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model") +>>> summarizer(text) +[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}] +``` + +必要に応じて、`pipeline`」の結果を手動で複製することもできます。 + + + +Tokenize the text and return the `input_ids` as PyTorch tensors: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> inputs = tokenizer(text, return_tensors="pt").input_ids +``` + +[`~transformers.generation_utils.GenerationMixin.generate`] メソッドを使用して要約を作成します。さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[Text Generation](../main_classes/text_generation) API を確認してください。 + +```py +>>> from transformers import AutoModelForSeq2SeqLM + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.' +``` + + + +テキストをトークン化し、`input_ids`を TensorFlow テンソルとして返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> inputs = tokenizer(text, return_tensors="tf").input_ids +``` + +[`~transformers.generation_tf_utils.TFGenerationMixin.generate`] メソッドを使用して要約を作成します。さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[Text Generation](../main_classes/text_generation) API を確認してください。 + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.' +``` + + diff --git a/docs/source/ja/tasks/text-to-speech.md b/docs/source/ja/tasks/text-to-speech.md new file mode 100644 index 00000000000000..357ec18855149e --- /dev/null +++ b/docs/source/ja/tasks/text-to-speech.md @@ -0,0 +1,638 @@ + + +# Text to speech + +[[open-in-colab]] + +テキスト読み上げ (TTS) は、テキストから自然な音声を作成するタスクです。音声は複数の形式で生成できます。 +言語と複数の話者向け。現在、いくつかのテキスト読み上げモデルが 🤗 Transformers で利用可能です。 +[Bark](../model_doc/bark)、[MMS](../model_doc/mms)、[VITS](../model_doc/vits)、および [SpeechT5](../model_doc/speecht5)。 + +`text-to-audio`パイプライン (またはその別名 - `text-to-speech`) を使用して、音声を簡単に生成できます。 Bark などの一部のモデルは、 +笑い、ため息、泣きなどの非言語コミュニケーションを生成したり、音楽を追加したりするように条件付けすることもできます。 +Bark で`text-to-speech`パイプラインを使用する方法の例を次に示します。 + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-speech", model="suno/bark-small") +>>> text = "[clears throat] This is a test ... and I just took a long pause." +>>> output = pipe(text) +``` + +ノートブックで結果の音声を聞くために使用できるコード スニペットを次に示します。 + +```python +>>> from IPython.display import Audio +>>> Audio(output["audio"], rate=output["sampling_rate"]) +``` +Bark およびその他の事前トレーニングされた TTS モデルができることの詳細な例については、次のドキュメントを参照してください。 +[音声コース](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models)。 + +TTS モデルを微調整する場合、現在微調整できるのは SpeechT5 のみです。 SpeechT5 は、次の組み合わせで事前トレーニングされています。 +音声からテキストへのデータとテキストから音声へのデータ。両方のテキストに共有される隠された表現の統一された空間を学習できるようにします。 +そしてスピーチ。これは、同じ事前トレーニング済みモデルをさまざまなタスクに合わせて微調整できることを意味します。さらに、SpeechT5 +X ベクトル スピーカーの埋め込みを通じて複数のスピーカーをサポートします。 + +このガイドの残りの部分では、次の方法を説明します。 + +1. [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) のオランダ語 (`nl`) 言語サブセット上の英語音声で元々トレーニングされた [SpeechT5](../model_doc/speecht5) を微調整します。 データセット。 +2. パイプラインを使用するか直接使用するかの 2 つの方法のいずれかで、洗練されたモデルを推論に使用します。 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + + +```bash +pip install datasets soundfile speechbrain accelerate +``` + +SpeechT5 のすべての機能がまだ正式リリースにマージされていないため、ソースから 🤗Transformers をインストールします。 + +```bash +pip install git+https://github.com/huggingface/transformers.git +``` + + + +このガイドに従うには、GPU が必要です。ノートブックで作業している場合は、次の行を実行して GPU が利用可能かどうかを確認します。 + +```bash +!nvidia-smi +``` + + + +Hugging Face アカウントにログインして、モデルをアップロードしてコミュニティと共有することをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load the dataset + +[VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) は、以下で構成される大規模な多言語音声コーパスです。 +データは 2009 年から 2020 年の欧州議会のイベント記録をソースとしています。 15 件分のラベル付き音声文字起こしデータが含まれています。 +ヨーロッパの言語。このガイドではオランダ語のサブセットを使用していますが、自由に別のサブセットを選択してください。 + +VoxPopuli またはその他の自動音声認識 (ASR) データセットは最適ではない可能性があることに注意してください。 +TTS モデルをトレーニングするためのオプション。過剰なバックグラウンドノイズなど、ASR にとって有益となる機能は次のとおりです。 +通常、TTS では望ましくありません。ただし、最高品質、多言語、マルチスピーカーの TTS データセットを見つけるのは非常に困難な場合があります。 +挑戦的。 + +データをロードしましょう: + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("facebook/voxpopuli", "nl", split="train") +>>> len(dataset) +20968 +``` + +微調整には 20968 個の例で十分です。 SpeechT5 はオーディオ データのサンプリング レートが 16 kHz であることを想定しているため、 +データセット内の例がこの要件を満たしていることを確認してください。 + + +```py +dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +``` + +## Preprocess the data + +使用するモデル チェックポイントを定義し、適切なプロセッサをロードすることから始めましょう。 + + +```py +>>> from transformers import SpeechT5Processor + +>>> checkpoint = "microsoft/speecht5_tts" +>>> processor = SpeechT5Processor.from_pretrained(checkpoint) +``` + +### Text cleanup for SpeechT5 tokenization + + +まずはテキストデータをクリーンアップすることから始めます。テキストを処理するには、プロセッサのトークナイザー部分が必要です。 + +```py +>>> tokenizer = processor.tokenizer +``` + +データセットの例には、`raw_text`機能と `normalized_text`機能が含まれています。テキスト入力としてどの機能を使用するかを決めるときは、 +SpeechT5 トークナイザーには数値のトークンがないことを考慮してください。 `normalized_text`には数字が書かれています +テキストとして出力します。したがって、これはより適切であり、入力テキストとして `normalized_text` を使用することをお勧めします。 + +SpeechT5 は英語でトレーニングされているため、オランダ語のデータセット内の特定の文字を認識しない可能性があります。もし +残っているように、これらの文字は ``トークンに変換されます。ただし、オランダ語では、`à`などの特定の文字は +音節を強調することに慣れています。テキストの意味を保持するために、この文字を通常の`a`に置き換えることができます。 + +サポートされていないトークンを識別するには、`SpeechT5Tokenizer`を使用してデータセット内のすべての一意の文字を抽出します。 +文字をトークンとして扱います。これを行うには、以下を連結する `extract_all_chars` マッピング関数を作成します。 +すべての例からの転写を 1 つの文字列にまとめ、それを文字セットに変換します。 +すべての文字起こしが一度に利用できるように、`dataset.map()`で`b​​atched=True`と`batch_size=-1`を必ず設定してください。 +マッピング機能。 + +```py +>>> def extract_all_chars(batch): +... all_text = " ".join(batch["normalized_text"]) +... vocab = list(set(all_text)) +... return {"vocab": [vocab], "all_text": [all_text]} + + +>>> vocabs = dataset.map( +... extract_all_chars, +... batched=True, +... batch_size=-1, +... keep_in_memory=True, +... remove_columns=dataset.column_names, +... ) + +>>> dataset_vocab = set(vocabs["vocab"][0]) +>>> tokenizer_vocab = {k for k, _ in tokenizer.get_vocab().items()} +``` + +これで、2 つの文字セットができました。1 つはデータセットの語彙を持ち、もう 1 つはトークナイザーの語彙を持ちます。 +データセット内でサポートされていない文字を特定するには、これら 2 つのセットの差分を取ることができます。結果として +set には、データセットにはあるがトークナイザーには含まれていない文字が含まれます。 + +```py +>>> dataset_vocab - tokenizer_vocab +{' ', 'à', 'ç', 'è', 'ë', 'í', 'ï', 'ö', 'ü'} +``` + +前の手順で特定されたサポートされていない文字を処理するには、これらの文字を +有効なトークン。スペースはトークナイザーですでに `▁` に置き換えられているため、個別に処理する必要がないことに注意してください。 + +```py +>>> replacements = [ +... ("à", "a"), +... ("ç", "c"), +... ("è", "e"), +... ("ë", "e"), +... ("í", "i"), +... ("ï", "i"), +... ("ö", "o"), +... ("ü", "u"), +... ] + + +>>> def cleanup_text(inputs): +... for src, dst in replacements: +... inputs["normalized_text"] = inputs["normalized_text"].replace(src, dst) +... return inputs + + +>>> dataset = dataset.map(cleanup_text) +``` + +テキスト内の特殊文字を扱ったので、今度は音声データに焦点を移します。 + +### Speakers + +VoxPopuli データセットには複数の話者の音声が含まれていますが、データセットには何人の話者が含まれているのでしょうか?に +これを決定すると、一意の話者の数と、各話者がデータセットに寄与する例の数を数えることができます。 +データセットには合計 20,968 個の例が含まれており、この情報により、分布をより深く理解できるようになります。 +講演者とデータ内の例。 + +```py +>>> from collections import defaultdict + +>>> speaker_counts = defaultdict(int) + +>>> for speaker_id in dataset["speaker_id"]: +... speaker_counts[speaker_id] += 1 +``` + +ヒストグラムをプロットすると、各話者にどれだけのデータがあるかを把握できます。 + +```py +>>> import matplotlib.pyplot as plt + +>>> plt.figure() +>>> plt.hist(speaker_counts.values(), bins=20) +>>> plt.ylabel("Speakers") +>>> plt.xlabel("Examples") +>>> plt.show() +``` + +
+ Speakers histogram +
+ +ヒストグラムから、データセット内の話者の約 3 分の 1 の例が 100 未満であることがわかります。 +約 10 人の講演者が 500 以上の例を持っています。トレーニング効率を向上させ、データセットのバランスをとるために、次のことを制限できます。 +100 ~ 400 個の例を含むデータを講演者に提供します。 + +```py +>>> def select_speaker(speaker_id): +... return 100 <= speaker_counts[speaker_id] <= 400 + + +>>> dataset = dataset.filter(select_speaker, input_columns=["speaker_id"]) +``` + +残りのスピーカーの数を確認してみましょう。 + +```py +>>> len(set(dataset["speaker_id"])) +42 +``` + +残りの例がいくつあるか見てみましょう。 + + +```py +>>> len(dataset) +9973 +``` + +約 40 人のユニークな講演者からの 10,000 弱の例が残りますが、これで十分です。 + +例が少ないスピーカーの中には、例が長い場合、実際にはより多くの音声が利用できる場合があることに注意してください。しかし、 +各話者の音声の合計量を決定するには、データセット全体をスキャンする必要があります。 +各オーディオ ファイルのロードとデコードを伴う時間のかかるプロセス。そのため、ここではこのステップをスキップすることにしました。 + +### Speaker embeddings + +TTS モデルが複数のスピーカーを区別できるようにするには、サンプルごとにスピーカーの埋め込みを作成する必要があります。 +スピーカーの埋め込みは、特定のスピーカーの音声特性をキャプチャするモデルへの追加入力です。 +これらのスピーカー埋め込みを生成するには、事前トレーニングされた [spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb) を使用します。 +SpeechBrain のモデル。 + +入力オーディオ波形を受け取り、512 要素のベクトルを出力する関数 `create_speaker_embedding()` を作成します。 +対応するスピーカー埋め込みが含まれます。 + + +```py +>>> import os +>>> import torch +>>> from speechbrain.pretrained import EncoderClassifier + +>>> spk_model_name = "speechbrain/spkrec-xvect-voxceleb" + +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> speaker_model = EncoderClassifier.from_hparams( +... source=spk_model_name, +... run_opts={"device": device}, +... savedir=os.path.join("/tmp", spk_model_name), +... ) + + +>>> def create_speaker_embedding(waveform): +... with torch.no_grad(): +... speaker_embeddings = speaker_model.encode_batch(torch.tensor(waveform)) +... speaker_embeddings = torch.nn.functional.normalize(speaker_embeddings, dim=2) +... speaker_embeddings = speaker_embeddings.squeeze().cpu().numpy() +... return speaker_embeddings +``` + +`speechbrain/spkrec-xvect-voxceleb`モデルは、VoxCeleb からの英語音声でトレーニングされたことに注意することが重要です。 +データセットですが、このガイドのトレーニング例はオランダ語です。このモデルは今後も生成されると信じていますが、 +オランダ語のデータセットに適切な話者埋め込みを行っても、この仮定はすべての場合に当てはまらない可能性があります。 + +最適な結果を得るには、最初にターゲット音声で X ベクトル モデルをトレーニングすることをお勧めします。これにより、モデルが確実に +オランダ語に存在する独特の音声特徴をよりよく捉えることができます。 + +### Processing the dataset + +最後に、モデルが期待する形式にデータを処理しましょう。を取り込む `prepare_dataset` 関数を作成します。 +これは 1 つの例であり、`SpeechT5Processor` オブジェクトを使用して入力テキストをトークン化し、ターゲット オーディオをログメル スペクトログラムにロードします。 +また、追加の入力としてスピーカーの埋め込みも追加する必要があります。 + +```py +>>> def prepare_dataset(example): +... audio = example["audio"] + +... example = processor( +... text=example["normalized_text"], +... audio_target=audio["array"], +... sampling_rate=audio["sampling_rate"], +... return_attention_mask=False, +... ) + +... # strip off the batch dimension +... example["labels"] = example["labels"][0] + +... # use SpeechBrain to obtain x-vector +... example["speaker_embeddings"] = create_speaker_embedding(audio["array"]) + +... return example +``` + +単一の例を見て、処理が正しいことを確認します。 + +```py +>>> processed_example = prepare_dataset(dataset[0]) +>>> list(processed_example.keys()) +['input_ids', 'labels', 'stop_labels', 'speaker_embeddings'] +``` +スピーカーのエンベディングは 512 要素のベクトルである必要があります。 + +```py +>>> processed_example["speaker_embeddings"].shape +(512,) +``` + +ラベルは、80 メル ビンを含むログメル スペクトログラムである必要があります。 + +```py +>>> import matplotlib.pyplot as plt + +>>> plt.figure() +>>> plt.imshow(processed_example["labels"].T) +>>> plt.show() +``` + +
+ Log-mel spectrogram with 80 mel bins +
+ +補足: このスペクトログラムがわかりにくいと感じる場合は、低周波を配置する規則に慣れていることが原因である可能性があります。 +プロットの下部に高周波、上部に高周波が表示されます。ただし、matplotlib ライブラリを使用してスペクトログラムを画像としてプロットする場合、 +Y 軸が反転され、スペクトログラムが上下逆に表示されます。 + +次に、処理関数をデータセット全体に適用します。これには 5 ~ 10 分かかります。 + +```py +>>> dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names) +``` + +データセット内の一部の例が、モデルが処理できる最大入力長 (600 トークン) を超えていることを示す警告が表示されます。 +それらの例をデータセットから削除します。ここではさらに進んで、より大きなバッチ サイズを可能にするために、200 トークンを超えるものはすべて削除します。 + +```py +>>> def is_not_too_long(input_ids): +... input_length = len(input_ids) +... return input_length < 200 + + +>>> dataset = dataset.filter(is_not_too_long, input_columns=["input_ids"]) +>>> len(dataset) +8259 +``` + +次に、基本的なトレーニング/テスト分割を作成します。 + +```py +>>> dataset = dataset.train_test_split(test_size=0.1) +``` + +### Data collator + +複数の例を 1 つのバッチに結合するには、カスタム データ照合器を定義する必要があります。このコレーターは、短いシーケンスをパディングで埋め込みます。 +トークンを使用して、すべての例が同じ長さになるようにします。スペクトログラム ラベルの場合、埋め込まれた部分は特別な値 `-100` に置き換えられます。この特別な価値は +スペクトログラム損失を計算するときに、スペクトログラムのその部分を無視するようにモデルに指示します。 + +```py +>>> from dataclasses import dataclass +>>> from typing import Any, Dict, List, Union + + +>>> @dataclass +... class TTSDataCollatorWithPadding: +... processor: Any + +... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: +... input_ids = [{"input_ids": feature["input_ids"]} for feature in features] +... label_features = [{"input_values": feature["labels"]} for feature in features] +... speaker_features = [feature["speaker_embeddings"] for feature in features] + +... # collate the inputs and targets into a batch +... batch = processor.pad(input_ids=input_ids, labels=label_features, return_tensors="pt") + +... # replace padding with -100 to ignore loss correctly +... batch["labels"] = batch["labels"].masked_fill(batch.decoder_attention_mask.unsqueeze(-1).ne(1), -100) + +... # not used during fine-tuning +... del batch["decoder_attention_mask"] + +... # round down target lengths to multiple of reduction factor +... if model.config.reduction_factor > 1: +... target_lengths = torch.tensor([len(feature["input_values"]) for feature in label_features]) +... target_lengths = target_lengths.new( +... [length - length % model.config.reduction_factor for length in target_lengths] +... ) +... max_length = max(target_lengths) +... batch["labels"] = batch["labels"][:, :max_length] + +... # also add in the speaker embeddings +... batch["speaker_embeddings"] = torch.tensor(speaker_features) + +... return batch +``` + +SpeechT5 では、モデルのデコーダ部分への入力が 2 分の 1 に削減されます。つまり、すべてのデータが破棄されます。 +ターゲット シーケンスからの他のタイムステップ。次に、デコーダは 2 倍の長さのシーケンスを予測します。オリジナル以来 +ターゲット シーケンスの長さが奇数である可能性がある場合、データ照合機能はバッチの最大長を切り捨てて、 +2の倍数。 + +```py +>>> data_collator = TTSDataCollatorWithPadding(processor=processor) +``` + +## Train the model + +プロセッサのロードに使用したのと同じチェックポイントから事前トレーニングされたモデルをロードします。 + +```py +>>> from transformers import SpeechT5ForTextToSpeech + +>>> model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint) +``` + +`use_cache=True`オプションは、勾配チェックポイントと互換性がありません。トレーニングのために無効にします。 + +```py +>>> model.config.use_cache = False +``` + +トレーニング引数を定義します。ここでは、トレーニング プロセス中に評価メトリクスを計算していません。代わりに、 +損失だけを見てください。 + +```python +>>> from transformers import Seq2SeqTrainingArguments + +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="speecht5_finetuned_voxpopuli_nl", # change to a repo name of your choice +... per_device_train_batch_size=4, +... gradient_accumulation_steps=8, +... learning_rate=1e-5, +... warmup_steps=500, +... max_steps=4000, +... gradient_checkpointing=True, +... fp16=True, +... evaluation_strategy="steps", +... per_device_eval_batch_size=2, +... save_steps=1000, +... eval_steps=1000, +... logging_steps=25, +... report_to=["tensorboard"], +... load_best_model_at_end=True, +... greater_is_better=False, +... label_names=["labels"], +... push_to_hub=True, +... ) +``` + +`Trainer`オブジェクトをインスタンス化し、モデル、データセット、データ照合器をそれに渡します。 + +```py +>>> from transformers import Seq2SeqTrainer + +>>> trainer = Seq2SeqTrainer( +... args=training_args, +... model=model, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... data_collator=data_collator, +... tokenizer=processor, +... ) +``` +これで、トレーニングを開始する準備が整いました。トレーニングには数時間かかります。 GPU に応じて、 +トレーニングを開始するときに、CUDA の「メモリ不足」エラーが発生する可能性があります。この場合、減らすことができます +`per_device_train_batch_size`を 2 倍に増分し、`gradient_accumulation_steps`を 2 倍に増やして補正します。 + +```py +>>> trainer.train() +``` +パイプラインでチェックポイントを使用できるようにするには、必ずプロセッサをチェックポイントとともに保存してください。 + +```py +>>> processor.save_pretrained("YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl") +``` + +最終モデルを 🤗 ハブにプッシュします。 + +```py +>>> trainer.push_to_hub() +``` + +## Inference + +### Inference with a pipeline + +モデルを微調整したので、それを推論に使用できるようになりました。 +まず、対応するパイプラインでそれを使用する方法を見てみましょう。 `"text-to-speech"` パイプラインを作成しましょう +チェックポイント: + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("text-to-speech", model="YOUR_ACCOUNT_NAME/speecht5_finetuned_voxpopuli_nl") +``` + +ナレーションを希望するオランダ語のテキストを選択してください。例: + +```py +>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!" +``` + +パイプラインで SpeechT5 を使用するには、スピーカーの埋め込みが必要です。テスト データセットの例から取得してみましょう。 + +```py +>>> example = dataset["test"][304] +>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) +``` + +これで、テキストとスピーカーの埋め込みをパイプラインに渡すことができ、残りはパイプラインが処理します。 + + +```py +>>> forward_params = {"speaker_embeddings": speaker_embeddings} +>>> output = pipe(text, forward_params=forward_params) +>>> output +{'audio': array([-6.82714235e-05, -4.26525949e-04, 1.06134125e-04, ..., + -1.22392643e-03, -7.76011671e-04, 3.29112721e-04], dtype=float32), + 'sampling_rate': 16000} +``` + +その後、結果を聞くことができます。 + + +```py +>>> from IPython.display import Audio +>>> Audio(output['audio'], rate=output['sampling_rate']) +``` + +### Run inference manually + +パイプラインを使用しなくても同じ推論結果を得ることができますが、より多くの手順が必要になります。 + +🤗 ハブからモデルをロードします。 + + +```py +>>> model = SpeechT5ForTextToSpeech.from_pretrained("YOUR_ACCOUNT/speecht5_finetuned_voxpopuli_nl") +``` + +テスト データセットから例を選択して、スピーカーの埋め込みを取得します。 + +```py +>>> example = dataset["test"][304] +>>> speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0) +``` + +入力テキストを定義し、トークン化します。 + +```py +>>> text = "hallo allemaal, ik praat nederlands. groetjes aan iedereen!" +>>> inputs = processor(text=text, return_tensors="pt") +``` + +モデルを使用してスペクトログラムを作成します。 + +```py +>>> spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings) +``` + +次のことを行う場合は、スペクトログラムを視覚化します。 + +```py +>>> plt.figure() +>>> plt.imshow(spectrogram.T) +>>> plt.show() +``` + +
+ Generated log-mel spectrogram +
+ +最後に、ボコーダーを使用してスペクトログラムをサウンドに変換します。 + +```py +>>> with torch.no_grad(): +... speech = vocoder(spectrogram) + +>>> from IPython.display import Audio + +>>> Audio(speech.numpy(), rate=16000) +``` + +私たちの経験では、このモデルから満足のいく結果を得るのは難しい場合があります。スピーカーの品質 +埋め込みは重要な要素であるようです。 SpeechT5 は英語の x ベクトルで事前トレーニングされているため、最高のパフォーマンスを発揮します +英語スピーカーの埋め込みを使用する場合。合成音声の音質が悪い場合は、別のスピーカー埋め込みを使用してみてください。 + +トレーニング期間を長くすると、結果の質も向上する可能性があります。それでも、そのスピーチは明らかに英語ではなくオランダ語です。 +話者の音声特性をキャプチャします (例の元の音声と比較)。 +もう 1 つ実験すべきことは、モデルの構成です。たとえば、`config.reduction_factor = 1`を使用してみてください。 +これにより結果が改善されるかどうかを確認してください。 + +最後に、倫理的配慮を考慮することが不可欠です。 TTS テクノロジーには数多くの有用な用途がありますが、 +また、知らないうちに誰かの声を偽装するなど、悪意のある目的に使用される可能性もあります。お願いします +TTS は賢明かつ責任を持って使用してください。 diff --git a/docs/source/ja/tasks/token_classification.md b/docs/source/ja/tasks/token_classification.md new file mode 100644 index 00000000000000..2b650c4a844d84 --- /dev/null +++ b/docs/source/ja/tasks/token_classification.md @@ -0,0 +1,565 @@ + + +# Token classification + +[[open-in-colab]] + + + +トークン分類では、文内の個々のトークンにラベルを割り当てます。最も一般的なトークン分類タスクの 1 つは、固有表現認識 (NER) です。 NER は、人、場所、組織など、文内の各エンティティのラベルを見つけようとします。 + +このガイドでは、次の方法を説明します。 + +1. [WNUT 17](https://huggingface.co/datasets/wnut_17) データセットで [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) を微調整して、新しいエンティティを検出します。 +2. 微調整されたモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [BROS](../model_doc/bros), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [Falcon](../model_doc/falcon), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MPT](../model_doc/mpt), [MRA](../model_doc/mra), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate seqeval +``` +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load WNUT 17 dataset + +まず、🤗 データセット ライブラリから WNUT 17 データセットをロードします。 + +```py +>>> from datasets import load_dataset + +>>> wnut = load_dataset("wnut_17") +``` + +次に、例を見てみましょう。 + +```py +>>> wnut["train"][0] +{'id': '0', + 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], + 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'] +} +``` + +`ner_tags`内の各数字はエンティティを表します。数値をラベル名に変換して、エンティティが何であるかを調べます。 + +```py +>>> label_list = wnut["train"].features[f"ner_tags"].feature.names +>>> label_list +[ + "O", + "B-corporation", + "I-corporation", + "B-creative-work", + "I-creative-work", + "B-group", + "I-group", + "B-location", + "I-location", + "B-person", + "I-person", + "B-product", + "I-product", +] +``` +各 `ner_tag` の前に付く文字は、エンティティのトークンの位置を示します。 + +- `B-` はエンティティの始まりを示します。 +- `I-` は、トークンが同じエンティティ内に含まれていることを示します (たとえば、`State` トークンは次のようなエンティティの一部です) + `Empire State Building`)。 +- `0` は、トークンがどのエンティティにも対応しないことを示します。 + +## Preprocess + + + +次のステップでは、DistilBERT トークナイザーをロードして`tokens`フィールドを前処理します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +上の `tokens`フィールドの例で見たように、入力はすでにトークン化されているようです。しかし、実際には入力はまだトークン化されていないため、単語をサブワードにトークン化するには`is_split_into_words=True` を設定する必要があります。例えば: + +```py +>>> example = wnut["train"][0] +>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True) +>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]) +>>> tokens +['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]'] +``` + +ただし、これによりいくつかの特別なトークン `[CLS]` と `[SEP]` が追加され、サブワードのトークン化により入力とラベルの間に不一致が生じます。 1 つのラベルに対応する 1 つの単語を 2 つのサブワードに分割できるようになりました。次の方法でトークンとラベルを再調整する必要があります。 + +1. [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) メソッドを使用して、すべてのトークンを対応する単語にマッピングします。 +2. 特別なトークン `[CLS]` と `[SEP]` にラベル `-100` を割り当て、それらが PyTorch 損失関数によって無視されるようにします ([CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html))。 +3. 特定の単語の最初のトークンのみにラベルを付けます。同じ単語の他のサブトークンに `-100`を割り当てます。 + +トークンとラベルを再調整し、シーケンスを DistilBERT の最大入力長以下に切り詰める関数を作成する方法を次に示します。 + +```py +>>> def tokenize_and_align_labels(examples): +... tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) + +... labels = [] +... for i, label in enumerate(examples[f"ner_tags"]): +... word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word. +... previous_word_idx = None +... label_ids = [] +... for word_idx in word_ids: # Set the special tokens to -100. +... if word_idx is None: +... label_ids.append(-100) +... elif word_idx != previous_word_idx: # Only label the first token of a given word. +... label_ids.append(label[word_idx]) +... else: +... label_ids.append(-100) +... previous_word_idx = word_idx +... labels.append(label_ids) + +... tokenized_inputs["labels"] = labels +... return tokenized_inputs +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] 関数を使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` 関数を高速化できます。 + +```py +>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) +``` + +次に、[`DataCollat​​orWithPadding`] を使用してサンプルのバッチを作成します。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。 + + + + +```py +>>> from transformers import DataCollatorForTokenClassification + +>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) +``` + + + +```py +>>> from transformers import DataCollatorForTokenClassification + +>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf") +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) フレームワークを読み込みます (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してください) ) メトリクスの読み込みと計算の方法について詳しくは、こちらをご覧ください)。 Seqeval は実際に、精度、再現率、F1、精度などのいくつかのスコアを生成します。 + +```py +>>> import evaluate + +>>> seqeval = evaluate.load("seqeval") +``` + +まず NER ラベルを取得してから、真の予測と真のラベルを [`~evaluate.EvaluationModule.compute`] に渡してスコアを計算する関数を作成します。 + +```py +>>> import numpy as np + +>>> labels = [label_list[i] for i in example[f"ner_tags"]] + + +>>> def compute_metrics(p): +... predictions, labels = p +... predictions = np.argmax(predictions, axis=2) + +... true_predictions = [ +... [label_list[p] for (p, l) in zip(prediction, label) if l != -100] +... for prediction, label in zip(predictions, labels) +... ] +... true_labels = [ +... [label_list[l] for (p, l) in zip(prediction, label) if l != -100] +... for prediction, label in zip(predictions, labels) +... ] + +... results = seqeval.compute(predictions=true_predictions, references=true_labels) +... return { +... "precision": results["overall_precision"], +... "recall": results["overall_recall"], +... "f1": results["overall_f1"], +... "accuracy": results["overall_accuracy"], +... } +``` + +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + +モデルのトレーニングを開始する前に、`id2label`と`label2id`を使用して、予想される ID とそのラベルのマップを作成します。 +```py +>>> id2label = { +... 0: "O", +... 1: "B-corporation", +... 2: "I-corporation", +... 3: "B-creative-work", +... 4: "I-creative-work", +... 5: "B-group", +... 6: "I-group", +... 7: "B-location", +... 8: "I-location", +... 9: "B-person", +... 10: "I-person", +... 11: "B-product", +... 12: "I-product", +... } +>>> label2id = { +... "O": 0, +... "B-corporation": 1, +... "I-corporation": 2, +... "B-creative-work": 3, +... "I-creative-work": 4, +... "B-group": 5, +... "I-group": 6, +... "B-location": 7, +... "I-location": 8, +... "B-person": 9, +... "I-person": 10, +... "B-product": 11, +... "I-product": 12, +... } +``` + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForTokenClassification`] を使用して、予期されるラベルの数とラベル マッピングを指定して DistilBERT を読み込みます。 + +```py +>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer + +>>> model = AutoModelForTokenClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... ) +``` + +この時点で残っているステップは 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は連続スコアを評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数を、モデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Trainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_wnut_model", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=2, +... weight_decay=0.01, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... load_best_model_at_end=True, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_wnut["train"], +... eval_dataset=tokenized_wnut["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_train_epochs = 3 +>>> num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=2e-5, +... num_train_steps=num_train_steps, +... weight_decay_rate=0.01, +... num_warmup_steps=0, +... ) +``` +次に、[`TFAutoModelForTokenClassification`] を使用して、予期されるラベルの数とラベル マッピングを指定して DistilBERT をロードできます。 + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... ) +``` +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_wnut["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_wnut["validation"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # No loss argument! +``` + +トレーニングを開始する前にセットアップする最後の 2 つのことは、予測から連続スコアを計算することと、モデルをハブにプッシュする方法を提供することです。どちらも [Keras コールバック](../main_classes/keras_callbacks) を使用して行われます。 + +`compute_metrics` 関数を [`~transformers.KerasMetricCallback`] に渡します。 + + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`] でモデルとトークナイザーをプッシュする場所を指定します。 + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_wnut_model", +... tokenizer=tokenizer, +... ) +``` + +次に、コールバックをまとめてバンドルします。 + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + +トークン分類のモデルを微調整する方法のより詳細な例については、対応するセクションを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)。 + + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論を実行したいテキストをいくつか取得します。 + +```py +>>> text = "The Golden State Warriors are an American professional basketball team based in San Francisco." +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して NER の`pipeline`をインスタンス化し、テキストをそれに渡します。 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model") +>>> classifier(text) +[{'entity': 'B-location', + 'score': 0.42658573, + 'index': 2, + 'word': 'golden', + 'start': 4, + 'end': 10}, + {'entity': 'I-location', + 'score': 0.35856336, + 'index': 3, + 'word': 'state', + 'start': 11, + 'end': 16}, + {'entity': 'B-group', + 'score': 0.3064001, + 'index': 4, + 'word': 'warriors', + 'start': 17, + 'end': 25}, + {'entity': 'B-location', + 'score': 0.65523505, + 'index': 13, + 'word': 'san', + 'start': 80, + 'end': 83}, + {'entity': 'B-location', + 'score': 0.4668663, + 'index': 14, + 'word': 'francisco', + 'start': 84, + 'end': 93}] +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。 + + + +テキストをトークン化して PyTorch テンソルを返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> inputs = tokenizer(text, return_tensors="pt") +``` + +入力をモデルに渡し、`logits`を返します。 + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +最も高い確率でクラスを取得し、モデルの `id2label` マッピングを使用してそれをテキスト ラベルに変換します。 + +```py +>>> predictions = torch.argmax(logits, dim=2) +>>> predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]] +>>> predicted_token_class +['O', + 'O', + 'B-location', + 'I-location', + 'B-group', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'B-location', + 'B-location', + 'O', + 'O'] +``` + + + + +テキストをトークン化し、TensorFlow テンソルを返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> inputs = tokenizer(text, return_tensors="tf") +``` + +入力をモデルに渡し、`logits`を返します。 + + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> logits = model(**inputs).logits +``` + +最も高い確率でクラスを取得し、モデルの `id2label` マッピングを使用してそれをテキスト ラベルに変換します。 + +```py +>>> predicted_token_class_ids = tf.math.argmax(logits, axis=-1) +>>> predicted_token_class = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()] +>>> predicted_token_class +['O', + 'O', + 'B-location', + 'I-location', + 'B-group', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'B-location', + 'B-location', + 'O', + 'O'] +``` + + diff --git a/docs/source/ja/tasks/translation.md b/docs/source/ja/tasks/translation.md new file mode 100644 index 00000000000000..fb2c89f3856d49 --- /dev/null +++ b/docs/source/ja/tasks/translation.md @@ -0,0 +1,417 @@ + + +# Translation + +[[open-in-colab]] + + + +翻訳では、一連のテキストをある言語から別の言語に変換します。これは、シーケンス間問題として定式化できるいくつかのタスクの 1 つであり、翻訳や要約など、入力から何らかの出力を返すための強力なフレームワークです。翻訳システムは通常、異なる言語のテキスト間の翻訳に使用されますが、音声、またはテキストから音声への変換や音声からテキストへの変換など、音声間の組み合わせにも使用できます。 + +このガイドでは、次の方法を説明します。 + +1. [OPUS Books](https://huggingface.co/datasets/opus_books) データセットの英語-フランス語サブセットの [T5](https://huggingface.co/google-t5/t5-small) を微調整して、英語のテキストを次の形式に翻訳します。フランス語。 +2. 微調整されたモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [UMT5](../model_doc/umt5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install transformers datasets evaluate sacrebleu +``` + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` +## Load OPUS Books dataset + +まず、🤗 データセット ライブラリから [OPUS Books](https://huggingface.co/datasets/opus_books) データセットの英語とフランス語のサブセットを読み込みます。 + +```py +>>> from datasets import load_dataset + +>>> books = load_dataset("opus_books", "en-fr") +``` + +[`~datasets.Dataset.train_test_split`] メソッドを使用して、データセットをトレイン セットとテスト セットに分割します。 + + +```py +>>> books = books["train"].train_test_split(test_size=0.2) +``` + +次に、例を見てみましょう。 + +```py +>>> books["train"][0] +{'id': '90560', + 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.', + 'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}} +``` + +`translation`: テキストの英語とフランス語の翻訳。 + +## Preprocess + + + +次のステップでは、T5 トークナイザーをロードして英語とフランス語の言語ペアを処理します。 + +```py +>>> from transformers import AutoTokenizer + +>>> checkpoint = "google-t5/t5-small" +>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) +``` + +作成する前処理関数は次のことを行う必要があります。 + +1. T5 がこれが翻訳タスクであることを認識できるように、入力の前にプロンプ​​トを付けます。複数の NLP タスクが可能な一部のモデルでは、特定のタスクのプロンプトが必要です。 +2. 英語の語彙で事前トレーニングされたトークナイザーを使用してフランス語のテキストをトークン化することはできないため、入力 (英語) とターゲット (フランス語) を別々にトークン化します。 +3. `max_length`パラメータで設定された最大長を超えないようにシーケンスを切り詰めます。 +```py +>>> source_lang = "en" +>>> target_lang = "fr" +>>> prefix = "translate English to French: " + + +>>> def preprocess_function(examples): +... inputs = [prefix + example[source_lang] for example in examples["translation"]] +... targets = [example[target_lang] for example in examples["translation"]] +... model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) +... return model_inputs +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.Dataset.map`] メソッドを使用します。 `batched=True` を設定してデータセットの複数の要素を一度に処理することで、`map` 関数を高速化できます。 + +```py +>>> tokenized_books = books.map(preprocess_function, batched=True) +``` + +次に、[`DataCollat​​orForSeq2Seq`] を使用してサンプルのバッチを作成します。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。 + + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) +``` + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf") +``` + + + +## Evaluate + +トレーニング中にメトリクスを含めると、多くの場合、モデルのパフォーマンスを評価するのに役立ちます。 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) ライブラリを使用して、評価メソッドをすばやくロードできます。このタスクでは、[SacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu) メトリクスをロードします (🤗 Evaluate [クイック ツアー](https://huggingface.co/docs/evaluate/a_quick_tour) を参照してください) ) メトリクスの読み込みと計算方法の詳細については、次を参照してください)。 + +```py +>>> import evaluate + +>>> metric = evaluate.load("sacrebleu") +``` + +次に、予測とラベルを [`~evaluate.EvaluationModule.compute`] に渡して SacreBLEU スコアを計算する関数を作成します。 +```py +>>> import numpy as np + + +>>> def postprocess_text(preds, labels): +... preds = [pred.strip() for pred in preds] +... labels = [[label.strip()] for label in labels] + +... return preds, labels + + +>>> def compute_metrics(eval_preds): +... preds, labels = eval_preds +... if isinstance(preds, tuple): +... preds = preds[0] +... decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) + +... labels = np.where(labels != -100, labels, tokenizer.pad_token_id) +... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + +... decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) + +... result = metric.compute(predictions=decoded_preds, references=decoded_labels) +... result = {"bleu": result["score"]} + +... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds] +... result["gen_len"] = np.mean(prediction_lens) +... result = {k: round(v, 4) for k, v in result.items()} +... return result +``` +これで`compute_metrics`関数の準備が整いました。トレーニングをセットアップするときにこの関数に戻ります。 + +## Train + + + + + +[`Trainer`] を使用したモデルの微調整に慣れていない場合は、[ここ](../training#train-with-pytorch-trainer) の基本的なチュートリアルをご覧ください。 + + + +これでモデルのトレーニングを開始する準備が整いました。 [`AutoModelForSeq2SeqLM`] を使用して T5 をロードします。 + +```py +>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +この時点で残っているステップは 3 つだけです。 + +1. [`Seq2SeqTrainingArguments`] でトレーニング ハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する `output_dir` です。 `push_to_hub=True`を設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。各エポックの終了時に、[`Trainer`] は SacreBLEU メトリクスを評価し、トレーニング チェックポイントを保存します。 +2. トレーニング引数をモデル、データセット、トークナイザー、データ照合器、および `compute_metrics` 関数とともに [`Seq2SeqTrainer`] に渡します。 +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + + +```py +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="my_awesome_opus_books_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... weight_decay=0.01, +... save_total_limit=3, +... num_train_epochs=2, +... predict_with_generate=True, +... fp16=True, +... push_to_hub=True, +... ) + +>>> trainer = Seq2SeqTrainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_books["train"], +... eval_dataset=tokenized_books["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras を使用したモデルの微調整に慣れていない場合は、[こちら](../training#train-a-tensorflow-model-with-keras) の基本的なチュートリアルをご覧ください。 + + +TensorFlow でモデルを微調整するには、オプティマイザー関数、学習率スケジュール、およびいくつかのトレーニング ハイパーパラメーターをセットアップすることから始めます。 + + +```py +>>> from transformers import AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +次に、[`TFAutoModelForSeq2SeqLM`] を使用して T5 をロードできます。 + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] を使用して、データセットを `tf.data.Dataset` 形式に変換します。 + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_books["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_books["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) を使用してトレーニング用のモデルを設定します。 Transformers モデルにはすべてデフォルトのタスク関連の損失関数があるため、次の場合を除き、損失関数を指定する必要はないことに注意してください。 + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # No loss argument! +``` +トレーニングを開始する前にセットアップする最後の 2 つのことは、予測から SacreBLEU メトリクスを計算し、モデルをハブにプッシュする方法を提供することです。どちらも [Keras コールバック](../main_classes/keras_callbacks) を使用して行われます。 + +`compute_metrics` 関数を [`~transformers.KerasMetricCallback`] に渡します。 + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`] でモデルとトークナイザーをプッシュする場所を指定します。 + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_opus_books_model", +... tokenizer=tokenizer, +... ) +``` + +次に、コールバックをまとめてバンドルします。 + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +ついに、モデルのトレーニングを開始する準備が整いました。トレーニングおよび検証データセット、エポック数、コールバックを指定して [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) を呼び出し、モデルを微調整します。 + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks) +``` + +トレーニングが完了すると、モデルは自動的にハブにアップロードされ、誰でも使用できるようになります。 + + + + + + +翻訳用にモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。 +[PyTorch ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) +または [TensorFlow ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)。 + + + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +別の言語に翻訳したいテキストを考え出します。 T5 の場合、作業中のタスクに応じて入力に接頭辞を付ける必要があります。英語からフランス語に翻訳する場合は、以下に示すように入力に接頭辞を付ける必要があります。 + +```py +>>> text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria." +``` + +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`] で使用することです。モデルを使用して翻訳用の`pipeline`をインスタンス化し、テキストをそれに渡します。 + + +```py +>>> from transformers import pipeline + +>>> translator = pipeline("translation", model="my_awesome_opus_books_model") +>>> translator(text) +[{'translation_text': 'Legumes partagent des ressources avec des bactéries azotantes.'}] +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。 + + + + +テキストをトークン化し、`input_ids` を PyTorch テンソルとして返します。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model") +>>> inputs = tokenizer(text, return_tensors="pt").input_ids +``` + +[`~transformers.generation_utils.GenerationMixin.generate`] メソッドを使用して翻訳を作成します。さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[Text Generation](../main_classes/text_generation) API を確認してください。 + + +```py +>>> from transformers import AutoModelForSeq2SeqLM + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model") +>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'Les lignées partagent des ressources avec des bactéries enfixant l'azote.' +``` + + + + +`input_ids`を TensorFlow テンソルとして返します。 tensors: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model") +>>> inputs = tokenizer(text, return_tensors="tf").input_ids +``` + +[`~transformers.generation_tf_utils.TFGenerationMixin.generate`] メソッドを使用して翻訳を作成します。さまざまなテキスト生成戦略と生成を制御するためのパラメーターの詳細については、[Text Generation](../main_classes/text_generation) API を確認してください。 + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model") +>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) +``` + +生成されたトークン ID をデコードしてテキストに戻します。 + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'Les lugumes partagent les ressources avec des bactéries fixatrices d'azote.' +``` + + diff --git a/docs/source/ja/tasks/video_classification.md b/docs/source/ja/tasks/video_classification.md new file mode 100644 index 00000000000000..e0c383619411bf --- /dev/null +++ b/docs/source/ja/tasks/video_classification.md @@ -0,0 +1,503 @@ + + +# Video classification + +[[open-in-colab]] + + +ビデオ分類は、ビデオ全体にラベルまたはクラスを割り当てるタスクです。ビデオには、各ビデオに 1 つのクラスのみが含まれることが期待されます。ビデオ分類モデルはビデオを入力として受け取り、ビデオがどのクラスに属するかについての予測を返します。これらのモデルを使用して、ビデオの内容を分類できます。ビデオ分類の実際のアプリケーションはアクション/アクティビティ認識であり、フィットネス アプリケーションに役立ちます。また、視覚障害のある人にとって、特に通勤時に役立ちます。 + +このガイドでは、次の方法を説明します。 + +1. [UCF101](https://www.crcv.ucf.edu/) のサブセットで [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) を微調整します。 data/UCF101.php) データセット。 +2. 微調整したモデルを推論に使用します。 + + +このチュートリアルで説明するタスクは、次のモデル アーキテクチャでサポートされています。 + + + +[TimeSformer](../model_doc/timesformer), [VideoMAE](../model_doc/videomae), [ViViT](../model_doc/vivit) + + + + + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 +```bash +pip install -q pytorchvideo transformers evaluate +``` + +[PyTorchVideo](https://pytorchvideo.org/) (`pytorchvideo` と呼ばれます) を使用してビデオを処理し、準備します。 + +モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Load UCF101 dataset + +まず、[UCF-101 データセット](https://www.crcv.ucf.edu/data/UCF101.php) のサブセットをロードします。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。 + +```py +>>> from huggingface_hub import hf_hub_download + +>>> hf_dataset_identifier = "sayakpaul/ucf101-subset" +>>> filename = "UCF101_subset.tar.gz" +>>> file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset") +``` + +サブセットをダウンロードした後、圧縮アーカイブを抽出する必要があります。 + +```py +>>> import tarfile + +>>> with tarfile.open(file_path) as t: +... t.extractall(".") +``` + +大まかに言うと、データセットは次のように構成されています。 + +```bash +UCF101_subset/ + train/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... + val/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... + test/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... +``` + +(`sorted`)された ビデオ パスは次のように表示されます。 + + +```bash +... +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi' +... +``` + +同じグループ/シーンに属するビデオ クリップがあり、ビデオ ファイル パスではグループが`g`で示されていることがわかります。たとえば、`v_ApplyEyeMakeup_g07_c04.avi`や`v_ApplyEyeMakeup_g07_c06.avi`などです。 + +検証と評価の分割では、[データ漏洩](https://www.kaggle.com/code/alexisbcook/data-leakage) を防ぐために、同じグループ/シーンからのビデオ クリップを使用しないでください。このチュートリアルで使用しているサブセットでは、この情報が考慮されています。 + +次に、データセット内に存在するラベルのセットを取得します。また、モデルを初期化するときに役立つ 2 つの辞書を作成します。 + +* `label2id`: クラス名を整数にマップします。 +* `id2label`: 整数をクラス名にマッピングします。 + + +```py +>>> class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths}) +>>> label2id = {label: i for i, label in enumerate(class_labels)} +>>> id2label = {i: label for label, i in label2id.items()} + +>>> print(f"Unique classes: {list(label2id.keys())}.") + +# Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress']. +``` + +個性的なクラスが10種類あります。トレーニング セットには、クラスごとに 30 個のビデオがあります。 + +## Load a model to fine-tune + +事前トレーニングされたチェックポイントとそれに関連する画像プロセッサからビデオ分類モデルをインスタンス化します。モデルのエンコーダーには事前トレーニングされたパラメーターが付属しており、分類ヘッドはランダムに初期化されます。画像プロセッサは、データセットの前処理パイプラインを作成するときに役立ちます。 + +```py +>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification + +>>> model_ckpt = "MCG-NJU/videomae-base" +>>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt) +>>> model = VideoMAEForVideoClassification.from_pretrained( +... model_ckpt, +... label2id=label2id, +... id2label=id2label, +... ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint +... ) +``` + +モデルのロード中に、次の警告が表示される場合があります。 + +```bash +Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight'] +- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +``` + +この警告は、一部の重み (たとえば、`classifier`層の重みとバイアス) を破棄し、他のいくつかの重み (新しい`classifier`層の重みとバイアス) をランダムに初期化していることを示しています。この場合、これは予想されることです。事前にトレーニングされた重みを持たない新しい頭部を追加しているため、推論に使用する前にこのモデルを微調整する必要があるとライブラリが警告します。これはまさに私たちが行おうとしているものです。する。 + +**注意** [このチェックポイント](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics) は、同様のダウンストリームで微調整されてチェックポイントが取得されたため、このタスクのパフォーマンスが向上することに注意してください。かなりのドメインの重複があるタスク。 `MCG-NJU/videomae-base-finetuned-kinetics` を微調整して取得した [このチェックポイント](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset) を確認できます。 -キネティクス`。 + +## Prepare the datasets for training + +ビデオの前処理には、[PyTorchVideo ライブラリ](https://pytorchvideo.org/) を利用します。まず、必要な依存関係をインポートします。 + + +```py +>>> import pytorchvideo.data + +>>> from pytorchvideo.transforms import ( +... ApplyTransformToKey, +... Normalize, +... RandomShortSideScale, +... RemoveKey, +... ShortSideScale, +... UniformTemporalSubsample, +... ) + +>>> from torchvision.transforms import ( +... Compose, +... Lambda, +... RandomCrop, +... RandomHorizontalFlip, +... Resize, +... ) +``` + +トレーニング データセットの変換には、均一な時間サブサンプリング、ピクセル正規化、ランダム クロッピング、およびランダムな水平反転を組み合わせて使用​​します。検証および評価のデータセット変換では、ランダムなトリミングと水平反転を除き、同じ変換チェーンを維持します。これらの変換の詳細については、[PyTorchVideo の公式ドキュメント](https://pytorchvideo.org) を参照してください。 + +事前トレーニングされたモデルに関連付けられた`image_processor`を使用して、次の情報を取得します。 + +* ビデオ フレームのピクセルが正規化される画像の平均値と標準偏差。 +* ビデオ フレームのサイズが変更される空間解像度。 + +まず、いくつかの定数を定義します。 + +```py +>>> mean = image_processor.image_mean +>>> std = image_processor.image_std +>>> if "shortest_edge" in image_processor.size: +... height = width = image_processor.size["shortest_edge"] +>>> else: +... height = image_processor.size["height"] +... width = image_processor.size["width"] +>>> resize_to = (height, width) + +>>> num_frames_to_sample = model.config.num_frames +>>> sample_rate = 4 +>>> fps = 30 +>>> clip_duration = num_frames_to_sample * sample_rate / fps +``` + +次に、データセット固有の変換とデータセットをそれぞれ定義します。トレーニングセットから始めます: + + +```py +>>> train_transform = Compose( +... [ +... ApplyTransformToKey( +... key="video", +... transform=Compose( +... [ +... UniformTemporalSubsample(num_frames_to_sample), +... Lambda(lambda x: x / 255.0), +... Normalize(mean, std), +... RandomShortSideScale(min_size=256, max_size=320), +... RandomCrop(resize_to), +... RandomHorizontalFlip(p=0.5), +... ] +... ), +... ), +... ] +... ) + +>>> train_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "train"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration), +... decode_audio=False, +... transform=train_transform, +... ) +``` + +同じ一連のワークフローを検証セットと評価セットに適用できます。 + + +```py +>>> val_transform = Compose( +... [ +... ApplyTransformToKey( +... key="video", +... transform=Compose( +... [ +... UniformTemporalSubsample(num_frames_to_sample), +... Lambda(lambda x: x / 255.0), +... Normalize(mean, std), +... Resize(resize_to), +... ] +... ), +... ), +... ] +... ) + +>>> val_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "val"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), +... decode_audio=False, +... transform=val_transform, +... ) + +>>> test_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "test"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), +... decode_audio=False, +... transform=val_transform, +... ) +``` + +**注意**: 上記のデータセット パイプラインは、[公式 PyTorchVideo サンプル](https://pytorchvideo.org/docs/tutorial_classification#dataset) から取得したものです。 [`pytorchvideo.data.Ucf101()`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.Ucf101) 関数を使用しています。 UCF-101 データセット。内部では、[`pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.LabeledVideoDataset) オブジェクトを返します。 `LabeledVideoDataset` クラスは、PyTorchVideo データセット内のすべてのビデオの基本クラスです。したがって、PyTorchVideo で既製でサポートされていないカスタム データセットを使用したい場合は、それに応じて `LabeledVideoDataset` クラスを拡張できます。詳細については、`data`API [ドキュメント](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html)を参照してください。また、データセットが同様の構造 (上に示したもの) に従っている場合は、`pytorchvideo.data.Ucf101()` を使用すると問題なく動作するはずです。 + +`num_videos` 引数にアクセスすると、データセット内のビデオの数を知ることができます。 + + + +```py +>>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos) +# (300, 30, 75) +``` + +## Visualize the preprocessed video for better debugging + +```py +>>> import imageio +>>> import numpy as np +>>> from IPython.display import Image + +>>> def unnormalize_img(img): +... """Un-normalizes the image pixels.""" +... img = (img * std) + mean +... img = (img * 255).astype("uint8") +... return img.clip(0, 255) + +>>> def create_gif(video_tensor, filename="sample.gif"): +... """Prepares a GIF from a video tensor. +... +... The video tensor is expected to have the following shape: +... (num_frames, num_channels, height, width). +... """ +... frames = [] +... for video_frame in video_tensor: +... frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy()) +... frames.append(frame_unnormalized) +... kargs = {"duration": 0.25} +... imageio.mimsave(filename, frames, "GIF", **kargs) +... return filename + +>>> def display_gif(video_tensor, gif_name="sample.gif"): +... """Prepares and displays a GIF from a video tensor.""" +... video_tensor = video_tensor.permute(1, 0, 2, 3) +... gif_filename = create_gif(video_tensor, gif_name) +... return Image(filename=gif_filename) + +>>> sample_video = next(iter(train_dataset)) +>>> video_tensor = sample_video["video"] +>>> display_gif(video_tensor) +``` + +
+ Person playing basketball +
+ +## Train the model + +🤗 Transformers の [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer) をモデルのトレーニングに利用します。 `Trainer`をインスタンス化するには、トレーニング構成と評価メトリクスを定義する必要があります。最も重要なのは [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) で、これはトレーニングを構成するためのすべての属性を含むクラスです。モデルのチェックポイントを保存するために使用される出力フォルダー名が必要です。また、🤗 Hub 上のモデル リポジトリ内のすべての情報を同期するのにも役立ちます。 + +トレーニング引数のほとんどは一目瞭然ですが、ここで非常に重要なのは`remove_unused_columns=False`です。これにより、モデルの呼び出し関数で使用されない機能が削除されます。デフォルトでは`True`です。これは、通常、未使用の特徴列を削除し、モデルの呼び出し関数への入力を解凍しやすくすることが理想的であるためです。ただし、この場合、`pixel_values` (モデルが入力で期待する必須キーです) を作成するには、未使用の機能 (特に`video`) が必要です。 + +```py +>>> from transformers import TrainingArguments, Trainer + +>>> model_name = model_ckpt.split("/")[-1] +>>> new_model_name = f"{model_name}-finetuned-ucf101-subset" +>>> num_epochs = 4 + +>>> args = TrainingArguments( +... new_model_name, +... remove_unused_columns=False, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=5e-5, +... per_device_train_batch_size=batch_size, +... per_device_eval_batch_size=batch_size, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... max_steps=(train_dataset.num_videos // batch_size) * num_epochs, +... ) +``` + +`pytorchvideo.data.Ucf101()` によって返されるデータセットは `__len__` メソッドを実装していません。そのため、`TrainingArguments`をインスタンス化するときに`max_steps`を定義する必要があります。 + +次に、予測からメトリクスを計算する関数を定義する必要があります。これは、これからロードする`metric`を使用します。必要な前処理は、予測されたロジットの argmax を取得することだけです。 + +```py +import evaluate + +metric = evaluate.load("accuracy") + + +def compute_metrics(eval_pred): + predictions = np.argmax(eval_pred.predictions, axis=1) + return metric.compute(predictions=predictions, references=eval_pred.label_ids) +``` + +**評価に関する注意事項**: + +[VideoMAE 論文](https://arxiv.org/abs/2203.12602) では、著者は次の評価戦略を使用しています。彼らはテスト ビデオからのいくつかのクリップでモデルを評価し、それらのクリップにさまざまなクロップを適用して、合計スコアを報告します。ただし、単純さと簡潔さを保つために、このチュートリアルではそれを考慮しません。 + +また、サンプルをまとめてバッチ処理するために使用される `collat​​e_fn` を定義します。各バッチは、`pixel_values` と `labels` という 2 つのキーで構成されます。 + + +```py +>>> def collate_fn(examples): +... # permute to (num_frames, num_channels, height, width) +... pixel_values = torch.stack( +... [example["video"].permute(1, 0, 2, 3) for example in examples] +... ) +... labels = torch.tensor([example["label"] for example in examples]) +... return {"pixel_values": pixel_values, "labels": labels} +``` + +次に、これらすべてをデータセットとともに`Trainer`に渡すだけです。 + +```py +>>> trainer = Trainer( +... model, +... args, +... train_dataset=train_dataset, +... eval_dataset=val_dataset, +... tokenizer=image_processor, +... compute_metrics=compute_metrics, +... data_collator=collate_fn, +... ) +``` + +すでにデータを前処理しているのに、なぜトークナイザーとして`image_processor`を渡したのか不思議に思うかもしれません。これは、イメージ プロセッサ構成ファイル (JSON として保存) もハブ上のリポジトリにアップロードされるようにするためだけです。 + +次に、`train` メソッドを呼び出してモデルを微調整します。 + +```py +>>> train_results = trainer.train() +``` + +トレーニングが完了したら、 [`~transformers.Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。 + +```py +>>> trainer.push_to_hub() +``` + +## Inference + +モデルを微調整したので、それを推論に使用できるようになりました。 + +推論のためにビデオをロードします。 + +```py +>>> sample_test_video = next(iter(test_dataset)) +``` + +
+ Teams playing basketball +
+ +推論用に微調整されたモデルを試す最も簡単な方法は、それを [`pipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.VideoClassificationPipeline). で使用することです。モデルを使用してビデオ分類用の` pipeline`をインスタンス化し、それにビデオを渡します。 + + +```py +>>> from transformers import pipeline + +>>> video_cls = pipeline(model="my_awesome_video_cls_model") +>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi") +[{'score': 0.9272987842559814, 'label': 'BasketballDunk'}, + {'score': 0.017777055501937866, 'label': 'BabyCrawling'}, + {'score': 0.01663011871278286, 'label': 'BalanceBeam'}, + {'score': 0.009560945443809032, 'label': 'BandMarching'}, + {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}] +``` + +必要に応じて、`pipeline`の結果を手動で複製することもできます。 + +```py +>>> def run_inference(model, video): +... # (num_frames, num_channels, height, width) +... perumuted_sample_test_video = video.permute(1, 0, 2, 3) +... inputs = { +... "pixel_values": perumuted_sample_test_video.unsqueeze(0), +... "labels": torch.tensor( +... [sample_test_video["label"]] +... ), # this can be skipped if you don't have labels available. +... } + +... device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +... inputs = {k: v.to(device) for k, v in inputs.items()} +... model = model.to(device) + +... # forward pass +... with torch.no_grad(): +... outputs = model(**inputs) +... logits = outputs.logits + +... return logits +``` + +次に、入力をモデルに渡し、`logits `を返します。 + +```py +>>> logits = run_inference(trained_model, sample_test_video["video"]) +``` + +`logits` をデコードすると、次のようになります。 + +```py +>>> predicted_class_idx = logits.argmax(-1).item() +>>> print("Predicted class:", model.config.id2label[predicted_class_idx]) +# Predicted class: BasketballDunk +``` diff --git a/docs/source/ja/tasks/visual_question_answering.md b/docs/source/ja/tasks/visual_question_answering.md new file mode 100644 index 00000000000000..f6c2989693708b --- /dev/null +++ b/docs/source/ja/tasks/visual_question_answering.md @@ -0,0 +1,405 @@ + + +# Visual Question Answering + +[[open-in-colab]] + +Visual Question Answering (VQA) は、画像に基づいて自由形式の質問に答えるタスクです。 +このタスクをサポートするモデルへの入力は通常、画像と質問の組み合わせであり、出力は +自然言語で表現された答え。 + +VQA の注目すべき使用例には次のようなものがあります。 +* 視覚障害者向けのアクセシビリティ アプリケーション。 +* 教育: 講義や教科書で示されている視覚的な資料について質問を投げかけること。 VQA は、インタラクティブな博物館の展示物や史跡でも利用できます。 +* カスタマー サービスと電子商取引: VQA は、ユーザーが製品について質問できるようにすることでユーザー エクスペリエンスを向上させます。 +* 画像検索: VQA モデルを使用して、特定の特徴を持つ画像を検索できます。たとえば、ユーザーは「犬はいますか?」と尋ねることができます。一連の画像から犬が写っているすべての画像を検索します。 + +このガイドでは、次の方法を学びます。 + +- [`Graphcore/vqa` データセット](https://huggingface.co/datasets/Graphcore/vqa) 上で分類 VQA モデル、特に [ViLT](../model_doc/vilt) を微調整します。 +- 微調整された ViLT を推論に使用します。 +- BLIP-2 などの生成モデルを使用してゼロショット VQA 推論を実行します。 + +## Fine-tuning ViLT + +ViLT モデルは、Vision Transformer (ViT) にテキスト埋め込みを組み込んでおり、最小限の設計を可能にします。 +視覚と言語の事前トレーニング (VLP)。このモデルは、いくつかの下流タスクに使用できます。 VQA タスクの場合、分類子 +head は最上部 (`[CLS]` トークンの最終的な非表示状態の最上部にある線形層) に配置され、ランダムに初期化されます。 +したがって、視覚的質問応答は **分類問題** として扱われます。 + +BLIP、BLIP-2、InstructBLIP などの最近のモデルは、VQA を生成タスクとして扱います。このガイドの後半では、 +ゼロショット VQA 推論にそれらを使用する方法を示します。 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q transformers datasets +``` +モデルをコミュニティと共有することをお勧めします。 Hugging Face アカウントにログインして、🤗 ハブにアップロードします。 +プロンプトが表示されたら、トークンを入力してログインします。 + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +モデルのチェックポイントをグローバル変数として定義しましょう。 + +```py +>>> model_checkpoint = "dandelin/vilt-b32-mlm" +``` + +## Load the data + +説明の目的で、このガイドでは、注釈付きの視覚的な質問に答える「Graphcore/vqa」データセットの非常に小さなサンプルを使用します。 +完全なデータセットは [🤗 Hub](https://huggingface.co/datasets/Graphcore/vqa) で見つけることができます。 + +[`Graphcore/vqa` データセット](https://huggingface.co/datasets/Graphcore/vqa) の代わりに、 +公式 [VQA データセット ページ](https://visualqa.org/download.html) から同じデータを手動で取得します。フォローしたい場合は、 +カスタム データを使用したチュートリアルでは、[画像データセットを作成する](https://huggingface.co/docs/datasets/image_dataset#loading-script) 方法を確認してください。 +🤗 データセットのドキュメントのガイド。 + +検証分割から最初の 200 個の例をロードし、データセットの機能を調べてみましょう。 + +```python +>>> from datasets import load_dataset + +>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]") +>>> dataset +Dataset({ + features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'], + num_rows: 200 +}) +``` +データセットの特徴を理解するために例を見てみましょう。 + +```py +>>> dataset[0] +{'question': 'Where is he looking?', + 'question_type': 'none of the above', + 'question_id': 262148000, + 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg', + 'answer_type': 'other', + 'label': {'ids': ['at table', 'down', 'skateboard', 'table'], + 'weights': [0.30000001192092896, + 1.0, + 0.30000001192092896, + 0.30000001192092896]}} +``` + +このタスクに関連する機能には次のものがあります。 +* `question`: 画像から回答する質問 +* `image_id`: 質問が参照する画像へのパス +* `label`: 注釈 + +残りの機能は必要ないので削除できます。 + + +```py +>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type']) +``` + +ご覧のとおり、`label`機能には、さまざまなヒューマン・アノテーターによって収集された、同じ質問に対する複数の回答 (ここでは`id`と呼びます) が含まれています。 +質問に対する答えは主観的なものになる可能性があるためです。この場合、問題は "彼はどこを見ているのか?"ということです。一部の人々 +これには "ダウン" という注釈が付けられ、他のものには "テーブルで" という注釈が付けられ、別の注釈には "スケートボード" という注釈が付けられました。 + +画像を見て、どの答えを出すかを考えてください。 + +```python +>>> from PIL import Image + +>>> image = Image.open(dataset[0]['image_id']) +>>> image +``` + +
+ VQA Image Example +
+ + +質問と回答のあいまいさのため、このようなデータセットはマルチラベル分類問題として扱われます ( +複数の回答が有効である可能性があります)。さらに、ワンホット エンコードされたベクトルを作成するだけではなく、 +注釈内に特定の回答が出現した回数に基づくソフト エンコーディング。 + +たとえば、上の例では、"down"という回答が他の回答よりも頻繁に選択されるため、 +スコア (データセットでは`weight`と呼ばれます) は 1.0 で、残りの回答のスコアは 1.0 未満です。 + +後で適切な分類ヘッドを使用してモデルをインスタンス化するために、2 つの辞書を作成しましょう。 +ラベル名を整数に変換する、またはその逆: + +```py +>>> import itertools + +>>> labels = [item['ids'] for item in dataset['label']] +>>> flattened_labels = list(itertools.chain(*labels)) +>>> unique_labels = list(set(flattened_labels)) + +>>> label2id = {label: idx for idx, label in enumerate(unique_labels)} +>>> id2label = {idx: label for label, idx in label2id.items()} +``` + +マッピングができたので、文字列の回答をその ID に置き換え、さらに前処理をより便利にするためにデータセットをフラット化することができます。 + +```python +>>> def replace_ids(inputs): +... inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]] +... return inputs + + +>>> dataset = dataset.map(replace_ids) +>>> flat_dataset = dataset.flatten() +>>> flat_dataset.features +{'question': Value(dtype='string', id=None), + 'image_id': Value(dtype='string', id=None), + 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), + 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)} +``` + +## Preprocessing data + +次のステップでは、ViLT プロセッサをロードして、モデルの画像データとテキスト データを準備します。 +[`ViltProcessor`] は、BERT トークナイザーと ViLT 画像プロセッサを便利な単一プロセッサにラップします。 + +```py +>>> from transformers import ViltProcessor + +>>> processor = ViltProcessor.from_pretrained(model_checkpoint) +``` + +データを前処理するには、[`ViltProcessor`] を使用して画像と質問をエンコードする必要があります。プロセッサーは使用します +[`BertTokenizerFast`] を使用してテキストをトークン化し、テキスト データの `input_ids`、`attention_mask`、および `token_type_ids` を作成します。 +画像に関しては、プロセッサは [`ViltImageProcessor`] を利用して画像のサイズ変更と正規化を行い、`pixel_values` と `pixel_mask` を作成します。 + +これらの前処理ステップはすべて内部で行われ、`processor`を呼び出すだけで済みます。ただし、それでも必要なのは、 +対象のラベルを準備します。この表現では、各要素は考えられる答え (ラベル) に対応します。正解の場合、要素は保持されます。 +それぞれのスコア (重み) が設定され、残りの要素は 0 に設定されます。 + +次の関数は、画像と質問に `processor` を適用し、上で説明したようにラベルをフォーマットします。 + +```py +>>> import torch + +>>> def preprocess_data(examples): +... image_paths = examples['image_id'] +... images = [Image.open(image_path) for image_path in image_paths] +... texts = examples['question'] + +... encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt") + +... for k, v in encoding.items(): +... encoding[k] = v.squeeze() + +... targets = [] + +... for labels, scores in zip(examples['label.ids'], examples['label.weights']): +... target = torch.zeros(len(id2label)) + +... for label, score in zip(labels, scores): +... target[label] = score + +... targets.append(target) + +... encoding["labels"] = targets + +... return encoding +``` + +データセット全体に前処理関数を適用するには、🤗 Datasets [`~datasets.map`] 関数を使用します。 `map` を高速化するには、次のようにします。 +データセットの複数の要素を一度に処理するには、`batched=True` を設定します。この時点で、不要な列は自由に削除してください。 + + +```py +>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type', 'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights']) +>>> processed_dataset +Dataset({ + features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'], + num_rows: 200 +}) +``` + +最後のステップとして、[`DefaultDataCollat​​or`] を使用してサンプルのバッチを作成します。 + + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + +## Train the model + +これでモデルのトレーニングを開始する準備が整いました。 [`ViltForQuestionAnswering`] で ViLT をロードします。ラベルの数を指定します +ラベルマッピングとともに: + +```py +>>> from transformers import ViltForQuestionAnswering + +>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id) +``` + +この時点で残っているステップは 3 つだけです。 + +1. [`TrainingArguments`] でトレーニング ハイパーパラメータを定義します。 + +```py +>>> from transformers import TrainingArguments + +>>> repo_id = "MariaK/vilt_finetuned_200" + +>>> training_args = TrainingArguments( +... output_dir=repo_id, +... per_device_train_batch_size=4, +... num_train_epochs=20, +... save_steps=200, +... logging_steps=50, +... learning_rate=5e-5, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +2. トレーニング引数をモデル、データセット、プロセッサー、データ照合器とともに [`Trainer`] に渡します。 + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=processed_dataset, +... tokenizer=processor, +... ) +``` + +3. [`~Trainer.train`] を呼び出してモデルを微調整します。 + +```py +>>> trainer.train() +``` + +トレーニングが完了したら、 [`~Trainer.push_to_hub`] メソッドを使用してモデルをハブに共有し、🤗 ハブで最終モデルを共有します。 + +```py +>>> trainer.push_to_hub() +``` + +## Inference + +ViLT モデルを微調整し、🤗 Hub にアップロードしたので、それを推論に使用できます。もっとも単純な +推論用に微調整されたモデルを試す方法は、それを [`pipeline`] で使用することです。 + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200") +``` + +このガイドのモデルは 200 の例でのみトレーニングされているため、多くを期待しないでください。少なくともそれがあるかどうか見てみましょう +データから何かを学習し、推論を説明するためにデータセットから最初の例を取り出します。 + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +>>> print(question) +>>> pipe(image, question, top_k=1) +"Where is he looking?" +[{'score': 0.5498199462890625, 'answer': 'down'}] +``` + +あまり自信がありませんが、モデルは確かに何かを学習しました。より多くの例とより長いトレーニングを行うと、はるかに良い結果が得られます。 + +必要に応じて、パイプラインの結果を手動で複製することもできます。 +1. 画像と質問を取得し、モデルのプロセッサを使用してモデル用に準備します。 +2. モデルを通じて結果または前処理を転送します。 +3. ロジットから、最も可能性の高い回答の ID を取得し、`id2label` で実際の回答を見つけます。 + +```py +>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200") + +>>> image = Image.open(example['image_id']) +>>> question = example['question'] + +>>> # prepare inputs +>>> inputs = processor(image, question, return_tensors="pt") + +>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200") + +>>> # forward pass +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits +>>> idx = logits.argmax(-1).item() +>>> print("Predicted answer:", model.config.id2label[idx]) +Predicted answer: down +``` + +## Zero-shot VQA + +以前のモデルでは、VQA を分類タスクとして扱いました。 BLIP、BLIP-2、InstructBLIP アプローチなどの一部の最近のモデル +生成タスクとしての VQA。 [BLIP-2](../model_doc/blip-2) を例として考えてみましょう。新しいビジュアル言語の事前トレーニングを導入しました +事前にトレーニングされたビジョン エンコーダーと LLM を任意に組み合わせて使用​​できるパラダイム (詳細については、[BLIP-2 ブログ投稿](https://huggingface.co/blog/blip-2) を参照)。 +これにより、視覚的な質問応答を含む複数の視覚言語タスクで最先端の結果を達成することができます。 + +このモデルを VQA に使用する方法を説明しましょう。まず、モデルをロードしましょう。ここではモデルを明示的に送信します。 +GPU (利用可能な場合)。これは [`Trainer`] が自動的に処理するため、トレーニング時に事前に行う必要はありませんでした。 + + +```py +>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration +>>> import torch + +>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") +>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16) +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> model.to(device) +``` + +モデルは画像とテキストを入力として受け取るため、VQA データセットの最初の例とまったく同じ画像と質問のペアを使用してみましょう。 + + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +``` + +視覚的な質問応答タスクに BLIP-2 を使用するには、テキスト プロンプトが特定の形式 (`Question: {} Answer:`) に従う必要があります。 + + +```py +>>> prompt = f"Question: {question} Answer:" +``` + +次に、モデルのプロセッサで画像/プロンプトを前処理し、処理された入力をモデルに渡し、出力をデコードする必要があります。 + +```py +>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16) + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() +>>> print(generated_text) +"He is looking at the crowd" +``` + +ご覧のとおり、モデルは群衆と顔の向き (下を向いている) を認識しましたが、見逃しているようです。 +観客がスケーターの後ろにいるという事実。それでも、人間が注釈を付けたデータセットを取得することが不可能な場合には、これは +このアプローチにより、有用な結果がすぐに得られます。 diff --git a/docs/source/ja/tasks/zero_shot_image_classification.md b/docs/source/ja/tasks/zero_shot_image_classification.md new file mode 100644 index 00000000000000..d8d0b530334ec4 --- /dev/null +++ b/docs/source/ja/tasks/zero_shot_image_classification.md @@ -0,0 +1,148 @@ + + +# Zero-shot image classification + +[[open-in-colab]] + +ゼロショット画像分類は、次のモデルを使用して画像をさまざまなカテゴリに分類するタスクです。 +これらの特定のカテゴリのラベル付きの例を含むデータに対して明示的にトレーニングされていない。 + +従来、画像分類には、ラベル付き画像の特定のセットでモデルをトレーニングする必要があり、このモデルは次のことを学習します。 +特定の画像の特徴をラベルに「マッピング」します。分類タスクにそのようなモデルを使用する必要がある場合、 +新しいラベルのセットでは、モデルを "再調整" するために微調整が必​​要です。 + +対照的に、ゼロショットまたはオープン語彙画像分類モデルは、通常、大規模なシステムでトレーニングされたマルチモーダル モデルです。 +画像と関連する説明のデータセット。これらのモデルは、ゼロショット画像分類を含む多くの下流タスクに使用できる、調整された視覚言語表現を学習します。 + +これは、画像分類に対するより柔軟なアプローチであり、モデルを新しいまだ見たことのないカテゴリに一般化できるようになります。 +追加のトレーニング データを必要とせず、ユーザーはターゲット オブジェクトの自由形式のテキスト説明を含む画像をクエリできるようになります。 + +このガイドでは、次の方法を学びます。 + +* ゼロショット画像分類パイプラインを作成する +* 手動でゼロショット画像分類推論を実行します + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q transformers +``` + +## Zero-shot image classification pipeline + +ゼロショット画像分類をサポートするモデルで推論を試す最も簡単な方法は、対応する [`パイプライン`] を使用することです。 +[Hugging Face Hub のチェックポイント](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads) からパイプラインをインスタンス化します。 + +```python +>>> from transformers import pipeline + +>>> checkpoint = "openai/clip-vit-large-patch14" +>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification") +``` + +次に、分類したい画像を選択します。 + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of an owl +
+ +画像と候補オブジェクトのラベルをパイプラインに渡します。ここでは画像を直接渡します。他の適切なオプション +画像へのローカル パスまたは画像 URL を含めます。 +候補ラベルは、この例のように単純な単語にすることも、より説明的な単語にすることもできます。 + +```py +>>> predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"]) +>>> predictions +[{'score': 0.9996670484542847, 'label': 'owl'}, + {'score': 0.000199399160919711, 'label': 'seagull'}, + {'score': 7.392891711788252e-05, 'label': 'fox'}, + {'score': 5.96074532950297e-05, 'label': 'bear'}] +``` + +## Zero-shot image classification by hand + +ゼロショット画像分類パイプラインの使用方法を理解したところで、ゼロショットを実行する方法を見てみましょう。 +画像を手動で分類します。 + +まず、[Hugging Face Hub のチェックポイント](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads) からモデルと関連プロセッサをロードします。 +ここでは、前と同じチェックポイントを使用します。 + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification + +>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +気分を変えて、別の画像を撮ってみましょう。 + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of a car +
+ +プロセッサを使用してモデルの入力を準備します。プロセッサーは、 +サイズ変更と正規化によるモデルの画像、およびテキスト入力を処理するトークナイザー。 + +```py +>>> candidate_labels = ["tree", "car", "bike", "cat"] +>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True) +``` + +入力をモデルに渡し、結果を後処理します。 + + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits_per_image[0] +>>> probs = logits.softmax(dim=-1).numpy() +>>> scores = probs.tolist() + +>>> result = [ +... {"score": score, "label": candidate_label} +... for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0]) +... ] + +>>> result +[{'score': 0.998572, 'label': 'car'}, + {'score': 0.0010570387, 'label': 'bike'}, + {'score': 0.0003393686, 'label': 'tree'}, + {'score': 3.1572064e-05, 'label': 'cat'}] +``` diff --git a/docs/source/ja/tasks/zero_shot_object_detection.md b/docs/source/ja/tasks/zero_shot_object_detection.md new file mode 100644 index 00000000000000..7dbc97ef3a1842 --- /dev/null +++ b/docs/source/ja/tasks/zero_shot_object_detection.md @@ -0,0 +1,310 @@ + + +# Zero-shot object detection + +[[open-in-colab]] + +従来、[オブジェクト検出](object_detection) に使用されるモデルには、トレーニング用のラベル付き画像データセットが必要でした。 +トレーニング データからのクラスのセットの検出に限定されます。 + +ゼロショットオブジェクト検出は、別のアプローチを使用する [OWL-ViT](../model_doc/owlvit) モデルによってサポートされています。 OWL-ViT +オープン語彙オブジェクト検出器です。これは、フリーテキストクエリに基づいて画像内のオブジェクトを検出できることを意味します。 +ラベル付きデータセットでモデルを微調整する必要性。 + +OWL-ViTは、マルチモーダル表現を利用してオープン語彙の検出を実行します。 [CLIP](../model_doc/clip) とを組み合わせます。 +軽量のオブジェクト分類および位置特定ヘッド。オープン語彙の検出は、CLIP のテキスト エンコーダーを使用してフリーテキスト クエリを埋め込み、それらをオブジェクト分類およびローカリゼーション ヘッドへの入力として使用することによって実現されます。 +画像とそれに対応するテキストの説明を関連付け、ViT は画像パッチを入力として処理します。作家たち +のOWL-ViTは、まずCLIPをゼロからトレーニングし、次に標準の物体検出データセットを使用してOWL-ViTをエンドツーエンドで微調整しました。 +二部マッチング損失。 + +このアプローチを使用すると、モデルはラベル付きデータセットで事前にトレーニングしなくても、テキストの説明に基づいてオブジェクトを検出できます。 + +このガイドでは、OWL-ViT の使用方法を学習します。 +- テキストプロンプトに基づいてオブジェクトを検出します +- バッチオブジェクト検出用 +- 画像誘導物体検出用 + +始める前に、必要なライブラリがすべてインストールされていることを確認してください。 + +```bash +pip install -q transformers +``` + +## Zero-shot object detection pipeline + +OWL-ViTによる推論を試す最も簡単な方法は、OWL-ViTを[`pipeline`]で使用することです。パイプラインをインスタンス化する +[Hugging Face Hub のチェックポイント](https://huggingface.co/models?other=owlvit) からのゼロショット オブジェクト検出の場合: + +```python +>>> from transformers import pipeline + +>>> checkpoint = "google/owlvit-base-patch32" +>>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection") +``` + +次に、物体を検出したい画像を選択します。ここでは、宇宙飛行士アイリーン・コリンズの画像を使用します。 +[NASA](https://www.nasa.gov/multimedia/imagegallery/index.html) Great Images データセットの一部。 + +```py +>>> import skimage +>>> import numpy as np +>>> from PIL import Image + +>>> image = skimage.data.astronaut() +>>> image = Image.fromarray(np.uint8(image)).convert("RGB") + +>>> image +``` + +
+ Astronaut Eileen Collins +
+ +検索する画像と候補オブジェクトのラベルをパイプラインに渡します。 +ここでは画像を直接渡します。他の適切なオプションには、画像へのローカル パスまたは画像 URL が含まれます。また、画像をクエリするすべてのアイテムのテキスト説明も渡します。 + +```py +>>> predictions = detector( +... image, +... candidate_labels=["human face", "rocket", "nasa badge", "star-spangled banner"], +... ) +>>> predictions +[{'score': 0.3571370542049408, + 'label': 'human face', + 'box': {'xmin': 180, 'ymin': 71, 'xmax': 271, 'ymax': 178}}, + {'score': 0.28099656105041504, + 'label': 'nasa badge', + 'box': {'xmin': 129, 'ymin': 348, 'xmax': 206, 'ymax': 427}}, + {'score': 0.2110239565372467, + 'label': 'rocket', + 'box': {'xmin': 350, 'ymin': -1, 'xmax': 468, 'ymax': 288}}, + {'score': 0.13790413737297058, + 'label': 'star-spangled banner', + 'box': {'xmin': 1, 'ymin': 1, 'xmax': 105, 'ymax': 509}}, + {'score': 0.11950037628412247, + 'label': 'nasa badge', + 'box': {'xmin': 277, 'ymin': 338, 'xmax': 327, 'ymax': 380}}, + {'score': 0.10649408400058746, + 'label': 'rocket', + 'box': {'xmin': 358, 'ymin': 64, 'xmax': 424, 'ymax': 280}}] +``` + +予測を視覚化してみましょう。 + + +```py +>>> from PIL import ImageDraw + +>>> draw = ImageDraw.Draw(image) + +>>> for prediction in predictions: +... box = prediction["box"] +... label = prediction["label"] +... score = prediction["score"] + +... xmin, ymin, xmax, ymax = box.values() +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white") + +>>> image +``` + +
+ Visualized predictions on NASA image +
+ +## Text-prompted zero-shot object detection by hand + +ゼロショット物体検出パイプラインの使用方法を確認したので、同じことを再現してみましょう。 +手動で結果を取得します。 + +まず、[Hugging Face Hub のチェックポイント](https://huggingface.co/models?other=owlvit) からモデルと関連プロセッサをロードします。 +ここでは、前と同じチェックポイントを使用します。 + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection + +>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +気分を変えて、別の画像を撮ってみましょう。 + +```py +>>> import requests + +>>> url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640" +>>> im = Image.open(requests.get(url, stream=True).raw) +>>> im +``` + +
+ Beach photo +
+ +プロセッサを使用してモデルの入力を準備します。プロセッサーは、 +サイズ変更と正規化によるモデルの画像と、テキスト入力を処理する [`CLIPTokenizer`] です。 + +```py +>>> text_queries = ["hat", "book", "sunglasses", "camera"] +>>> inputs = processor(text=text_queries, images=im, return_tensors="pt") +``` + +入力をモデルに渡し、後処理し、結果を視覚化します。以前は画像プロセッサによって画像のサイズが変更されていたため、 +それらをモデルにフィードするには、[`~OwlViTImageProcessor.post_process_object_detection`] メソッドを使用して、予測された境界を確認する必要があります。 +ボックスは元の画像を基準とした正しい座標を持ちます。 + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = torch.tensor([im.size[::-1]]) +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(im) + +>>> scores = results["scores"].tolist() +>>> labels = results["labels"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white") + +>>> im +``` + +
+ Beach photo with detected objects +
+ +## Batch processing + +複数の画像セットとテキスト クエリを渡して、複数の画像内の異なる (または同じ) オブジェクトを検索できます。 +宇宙飛行士の画像とビーチの画像を組み合わせてみましょう。 +バッチ処理の場合、テキスト クエリをネストされたリストとしてプロセッサに渡し、画像を PIL イメージのリストとして渡す必要があります。 +PyTorch テンソル、または NumPy 配列。 + +```py +>>> images = [image, im] +>>> text_queries = [ +... ["human face", "rocket", "nasa badge", "star-spangled banner"], +... ["hat", "book", "sunglasses", "camera"], +... ] +>>> inputs = processor(text=text_queries, images=images, return_tensors="pt") +``` + +以前は後処理のために単一の画像のサイズをテンソルとして渡していましたが、タプルを渡すこともできます。 +複数の画像のタプルのリスト。 2 つの例の予測を作成し、2 番目の例 (`image_idx = 1`) を視覚化しましょう。 + +```py +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = [x.size[::-1] for x in images] +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes) + +>>> image_idx = 1 +>>> draw = ImageDraw.Draw(images[image_idx]) + +>>> scores = results[image_idx]["scores"].tolist() +>>> labels = results[image_idx]["labels"].tolist() +>>> boxes = results[image_idx]["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[image_idx][label]}: {round(score,2)}", fill="white") + +>>> images[image_idx] +``` + +
+ Beach photo with detected objects +
+ +## Image-guided object detection + +テキストクエリによるゼロショットオブジェクト検出に加えて、OWL-ViTは画像ガイドによるオブジェクト検出を提供します。これはつまり +画像クエリを使用して、ターゲット画像内の類似したオブジェクトを検索できます。 +テキスト クエリとは異なり、使用できるサンプル画像は 1 つだけです。 + +対象画像としてソファに2匹の猫がいる画像と、1匹の猫の画像を撮影しましょう +クエリとして: + +```py +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image_target = Image.open(requests.get(url, stream=True).raw) + +>>> query_url = "http://images.cocodataset.org/val2017/000000524280.jpg" +>>> query_image = Image.open(requests.get(query_url, stream=True).raw) +``` + +画像を簡単に見てみましょう。 + +```py +>>> import matplotlib.pyplot as plt + +>>> fig, ax = plt.subplots(1, 2) +>>> ax[0].imshow(image_target) +>>> ax[1].imshow(query_image) +``` + +
+ Cats +
+ +前処理ステップでは、テキスト クエリの代わりに `query_images` を使用する必要があります。 + +```py +>>> inputs = processor(images=image_target, query_images=query_image, return_tensors="pt") +``` + +予測の場合、入力をモデルに渡す代わりに、[`~OwlViTForObjectDetection.image_guided_detection`] に渡します。予測を描く +ラベルがないことを除いては以前と同様です。 + +```py +>>> with torch.no_grad(): +... outputs = model.image_guided_detection(**inputs) +... target_sizes = torch.tensor([image_target.size[::-1]]) +... results = processor.post_process_image_guided_detection(outputs=outputs, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(image_target) + +>>> scores = results["scores"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="white", width=4) + +>>> image_target +``` + +
+ Cats with bounding boxes +
+ +OWL-ViTによる推論をインタラクティブに試したい場合は、このデモをチェックしてください。 + + \ No newline at end of file diff --git a/docs/source/ja/tasks_explained.md b/docs/source/ja/tasks_explained.md new file mode 100644 index 00000000000000..7c027e7a7394a9 --- /dev/null +++ b/docs/source/ja/tasks_explained.md @@ -0,0 +1,301 @@ + + +# How 🤗 Transformers solve tasks + +[🤗 Transformersでできること](task_summary)で、自然言語処理(NLP)、音声とオーディオ、コンピュータビジョンのタスク、それらの重要なアプリケーションについて学びました。このページでは、モデルがこれらのタスクをどのように解決するかを詳しく見て、モデルの内部で何が起こっているかを説明します。特定のタスクを解決するためには多くの方法があり、一部のモデルは特定のテクニックを実装するか、または新しい観点からタスクに取り組むかもしれませんが、Transformerモデルにとって、一般的なアイデアは同じです。柔軟なアーキテクチャのおかげで、ほとんどのモデルはエンコーダ、デコーダ、またはエンコーダ-デコーダ構造の変種です。Transformerモデル以外にも、当社のライブラリにはコンピュータビジョンタスクに今でも使用されているいくつかの畳み込みニューラルネットワーク(CNN)もあります。また、現代のCNNがどのように機能するかも説明します。 + +タスクがどのように解決されるかを説明するために、モデル内部で有用な予測を出力するために何が起こるかについて説明します。 + +- [Wav2Vec2](model_doc/wav2vec2):オーディオ分類および自動音声認識(ASR)向け +- [Vision Transformer(ViT)](model_doc/vit)および[ConvNeXT](model_doc/convnext):画像分類向け +- [DETR](model_doc/detr):オブジェクト検出向け +- [Mask2Former](model_doc/mask2former):画像セグメンテーション向け +- [GLPN](model_doc/glpn):深度推定向け +- [BERT](model_doc/bert):エンコーダを使用するテキスト分類、トークン分類、および質問応答などのNLPタスク向け +- [GPT2](model_doc/gpt2):デコーダを使用するテキスト生成などのNLPタスク向け +- [BART](model_doc/bart):エンコーダ-デコーダを使用する要約および翻訳などのNLPタスク向け + + + +さらに進む前に、元のTransformerアーキテクチャの基本的な知識を持つと良いです。エンコーダ、デコーダ、および注意力がどのように動作するかを知っておくと、異なるTransformerモデルがどのように動作するかを理解するのに役立ちます。始めているか、リフレッシュが必要な場合は、詳細な情報については当社の[コース](https://huggingface.co/course/chapter1/4?fw=pt)をチェックしてください! + + + +## Speech and audio + +[Wav2Vec2](model_doc/wav2vec2)は、未ラベルの音声データで事前トレーニングされ、オーディオ分類および自動音声認識のラベル付きデータでファインチューンされた自己教師モデルです。 + +
+ +
+ +このモデルには主に次の4つのコンポーネントがあります。 + +1. *特徴エンコーダ*:生の音声波形を受け取り、平均値をゼロに正規化し、単位分散に変換し、それを20msごとの特徴ベクトルのシーケンスに変換します。 + +2. 波形は自然に連続しているため、テキストのシーケンスを単語に分割できるようにできるように、特徴ベクトルは*量子化モジュール*に渡され、離散音声ユニットを学習しようとします。音声ユニットは*コードブック*(語彙と考えることができます)として知られるコードワードのコレクションから選択されます。コードブックから、連続したオーディオ入力を最もよく表すベクトルまたは音声ユニット(ターゲットラベルと考えることができます)が選択され、モデルを介して転送されます。 + +3. 特徴ベクトルの約半分はランダムにマスクされ、マスクされた特徴ベクトルは*コンテキストネットワーク*に供給されます。これは、相対的な位置エンベッディングも追加するTransformerエンコーダです。 + +4. コンテキストネットワークの事前トレーニングの目的は*コントラスティブタスク*です。モデルはマスクされた予測の真の量子化音声表現を、偽の予測のセットから予測しなければならず、モデルは最も似たコンテキストベクトルと量子化音声ユニット(ターゲットラベル)を見つけるように促されます。 + +今、Wav2Vec2は事前トレーニングされているので、オーディオ分類または自動音声認識のためにデータをファインチューンできます! + +### Audio classification + +事前トレーニングされたモデルをオーディオ分類に使用するには、基本的なWav2Vec2モデルの上にシーケンス分類ヘッドを追加します。分類ヘッドはエンコーダの隠れた状態を受け入れる線形層で、各オーディオフレームから学習された特徴を表します。これらの隠れた状態は長さが異なる可能性があるため、最初に隠れた状態がプールされ、次にクラスラベルに対するロジットに変換されます。ロジットとターゲット間のクロスエントロピー損失が計算され、最も可能性の高いクラスを見つけるために使用されます。 + +オーディオ分類を試す準備はできましたか?Wav2Vec2をファインチューンして推論に使用する方法を学ぶための完全な[オーディオ分類ガイド](tasks/audio_classification)をチェックしてください! + +### Automatic speech recognition + +事前トレーニングされたモデルを自動音声認識に使用するには、[connectionist temporal classification(CTC)](glossary#connectionist-temporal-classification-ctc)のための基本的なWav2Vec2モデルの上に言語モデリングヘッドを追加します。言語モデリングヘッドはエンコーダの隠れた状態を受け入れ、それらをロジットに変換します。各ロジットはトークンクラスを表し(トークン数はタスクの語彙から来ます)、ロジットとターゲット間のCTC損失が計算され、次に転写に変換されます。 + +自動音声認識を試す準備はできましたか?Wav2Vec2をファインチューンして推論に使用する方法を学ぶための完全な[自動音声認識ガイド](tasks/asr)をチェックしてください! + +## Computer vision + +コンピュータビジョンのタスクをアプローチする方法は2つあります。 + +1. 画像をパッチのシーケンスに分割し、Transformerを使用して並列に処理します。 +2. [ConvNeXT](model_doc/convnext)などのモダンなCNNを使用します。これらは畳み込み層を使用しますが、モダンなネットワーク設計を採用しています。 + + + +サードアプローチでは、Transformerと畳み込みを組み合わせたものもあります(例:[Convolutional Vision Transformer](model_doc/cvt)または[LeViT](model_doc/levit))。これらについては議論しませんが、これらはここで調べる2つのアプローチを組み合わせています。 + + + +ViTとConvNeXTは画像分類によく使用されますが、オブジェクト検出、セグメンテーション、深度推定などの他のビジョンタスクに対しては、DETR、Mask2Former、GLPNなどが適しています。 + +### Image classification + +ViTとConvNeXTの両方を画像分類に使用できます。主な違いは、ViTが注意メカニズムを使用し、ConvNeXTが畳み込みを使用することです。 + +#### Transformer + +[ViT](model_doc/vit)は畳み込みを完全にTransformerアーキテクチャで置き換えます。元のTransformerに精通している場合、ViTの理解は既にほとんど完了しています。 + +
+ +
+ +ViTが導入した主な変更点は、画像をTransformerに供給する方法です。 + +1. 画像は正方形で重ならないパッチのシーケンスに分割され、各パッチはベクトルまたは*パッチ埋め込み*に変換されます。パッチ埋め込みは、適切な入力次元を作成するために2D畳み込み層から生成されます(基本のTransformerの場合、各パッチ埋め込みに768の値があります)。224x224ピクセルの画像がある場合、それを16x16の画像パッチに分割できます。テキストが単語にトークン化されるように、画像はパッチのシーケンスに「トークン化」されます。 + +2. *学習埋め込み*、つまり特別な `[CLS]` トークンが、BERTのようにパッチ埋め込みの先頭に追加されます。 `[CLS]` トークンの最終的な隠れた状態は、付属の分類ヘッドの入力として使用されます。他の出力は無視されます。このトークンは、モデルが画像の表現をエンコードする方法を学ぶのに役立ちます。 + +3. パッチと学習埋め込みに追加する最後の要素は*位置埋め込み*です。モデルは画像パッチがどのように並べられているかを知りませんので、位置埋め込みも学習可能で、パッチ埋め込みと同じサイズを持ちます。最後に、すべての埋め込みがTransformerエンコーダに渡されます。 + +4. 出力、具体的には `[CLS]` トークンの出力だけが、多層パーセプトロンヘッド(MLP)に渡されます。ViTの事前トレーニングの目的は単純に分類です。他の分類ヘッドと同様に、MLPヘッドは出力をクラスラベルに対するロジットに変換し、クロスエントロピー損失を計算して最も可能性の高いクラスを見つけます。 + +画像分類を試す準備はできましたか?ViTをファインチューンして推論に使用する方法を学ぶための完全な[画像分類ガイド](tasks/image_classification)をチェックしてください! + + +#### CNN + + + +このセクションでは畳み込みについて簡単に説明していますが、画像の形状とサイズがどのように変化するかを事前に理解していると役立ちます。畳み込みに慣れていない場合は、fastaiの書籍から[Convolution Neural Networks chapter](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb)をチェックしてみてください! + + + +[ConvNeXT](model_doc/convnext)は、性能を向上させるために新しいモダンなネットワーク設計を採用したCNNアーキテクチャです。ただし、畳み込みはモデルの中核にまだあります。高レベルから見た場合、[畳み込み(convolution)](glossary#convolution)は、小さな行列(*カーネル*)が画像のピクセルの小さなウィンドウに乗算される操作です。それは特定のテクスチャや線の曲率などの特徴を計算します。その後、次のピクセルのウィンドウに移動します。畳み込みが移動する距離は*ストライド*として知られています。 + +
+ +
+ +[Convolution Arithmetic for Deep Learning](https://arxiv.org/abs/1603.07285) からの基本的なパディングやストライドのない畳み込み。 + +この出力を別の畳み込み層に供給し、各連続した層ごとに、ネットワークはホットドッグやロケットのようなより複雑で抽象的なものを学習します。畳み込み層の間には、特徴の次元を削減し、特徴の位置の変動に対してモデルをより堅牢にするためにプーリング層を追加するのが一般的です。 + +
+ +
+ +ConvNeXTは、以下の5つの方法でCNNをモダン化しています。 + +1. 各ステージのブロック数を変更し、画像をより大きなストライドと対応するカーネルサイズで*パッチ化*します。重ならないスライディングウィンドウは、これにより画像をパッチに分割するViTの戦略と似ています。 + +2. *ボトルネック* レイヤーはチャネル数を縮小し、それを復元します。1x1の畳み込みを実行するのは速く、深さを増やすことができます。逆ボトルネックは逆のことを行い、チャネル数を拡張し、それを縮小します。これはメモリ効率が高いです。 + +3. ボトルネックレイヤー内の通常の3x3の畳み込み層を、*深度方向の畳み込み*で置き換えます。これは各入力チャネルに個別に畳み込みを適用し、最後にそれらを積み重ねる畳み込みです。これにより、性能向上のためにネットワーク幅が広がります。 + +4. ViTはグローバル受容野を持っているため、その注意メカニズムのおかげで一度に画像の多くを見ることができます。ConvNeXTはこの効果を再現しようとし、カーネルサイズを7x7に増やします。 + +5. ConvNeXTはまた、Transformerモデルを模倣するいくつかのレイヤーデザイン変更を行っています。アクティベーションと正規化レイヤーが少なく、活性化関数はReLUの代わりにGELUに切り替え、BatchNormの代わりにLayerNormを使用しています。 + +畳み込みブロックからの出力は、分類ヘッドに渡され、出力をロジットに変換し、最も可能性の高いラベルを見つけるためにクロスエントロピー損失が計算されます。 + +### Object detection + +[DETR](model_doc/detr)、*DEtection TRansformer*、はCNNとTransformerエンコーダーデコーダーを組み合わせたエンドツーエンドのオブジェクト検出モデルです。 + +
+ +
+ +1. 事前トレーニングされたCNN *バックボーン* は、ピクセル値で表される画像を受け取り、それの低解像度の特徴マップを作成します。特徴マップには次元削減のために1x1の畳み込みが適用され、高レベルの画像表現を持つ新しい特徴マップが作成されます。Transformerは連続モデルであるため、特徴マップは特徴ベクトルのシーケンスに平坦化され、位置エンベディングと組み合わせられます。 + +2. 特徴ベクトルはエンコーダーに渡され、その注意レイヤーを使用して画像表現を学習します。次に、エンコーダーの隠れ状態はデコーダーの*オブジェクトクエリ*と組み合わされます。オブジェクトクエリは、画像の異なる領域に焦点を当てる学習埋め込みで、各注意レイヤーを進行するにつれて更新されます。デコーダーの隠れ状態は、各オブジェクトクエリに対してバウンディングボックスの座標とクラスラベルを予測するフィードフォワードネットワークに渡されます。または、存在しない場合は `no object` が渡されます。 + + DETRは各オブジェクトクエリを並行してデコードして、*N*の最終的な予測(*N*はクエリの数)を出力します。典型的な自己回帰モデルが1つの要素を1回ずつ予測するのとは異なり、オブジェクト検出はセット予測タスク(`バウンディングボックス`、`クラスラベル`)であり、1回のパスで*N*の予測を行います。 + +3. 訓練中、DETRは*二部マッチング損失*を使用して、固定された数の予測と固定された一連の正解ラベルを比較します。 *N*のラベルセットに正解ラベルが少ない場合、 `no object` クラスでパディングされます。この損失関数は、DETRに予測と正解ラベルとの間で1対1の割り当てを見つけるように促します。バウンディングボックスまたはクラスラベルのどちらかが正しくない場合、損失が発生します。同様に、DETRが存在しないオブジェクトを予測した場合、罰金が科せられます。これにより、DETRは1つの非常に顕著なオブジェクトに焦点を当てるのではなく、画像内の他のオブジェクトを見つけるように促されます。 + +DETRの上にオブジェクト検出ヘッドを追加して、クラスラベルとバウンディングボックスの座標を見つけます。オブジェクト検出ヘッドには2つのコンポーネントがあります:デコーダーの隠れ状態をクラスラベルのロジットに変換するための線形層、およびバウンディングボックスを予測するためのMLPです。 + +オブジェクト検出を試す準備はできましたか?DETROの完全な[オブジェクト検出ガイド](tasks/object_detection)をチェックして、DETROのファインチューニング方法と推論方法を学んでください! + +### Image segmentation + +[Mask2Former](model_doc/mask2former)は、すべての種類の画像セグメンテーションタスクを解決するためのユニバーサルアーキテクチャです。従来のセグメンテーションモデルは通常、インスタンス、セマンティック、またはパノプティックセグメンテーションの特定のサブタスクに合わせて設計されています。Mask2Formerは、それらのタスクのそれぞれを*マスク分類*の問題として捉えます。マスク分類はピクセルを*N*のセグメントにグループ化し、与えられた画像に対して*N*のマスクとそれに対応するクラスラベルを予測します。このセクションでは、Mask2Formerの動作方法を説明し、最後にSegFormerのファインチューニングを試すことができます。 + +
+ +
+ +Mask2Formerの主要なコンポーネントは次の3つです。 + +1. [Swin](model_doc/swin)バックボーンは画像を受け入れ、3つの連続する3x3の畳み込みから低解像度の画像特徴マップを作成します。 + +2. 特徴マップは*ピクセルデコーダー*に渡され、低解像度の特徴を高解像度のピクセル埋め込みに徐々にアップサンプリングします。ピクセルデコーダーは実際には解像度1/32、1/16、および1/8のオリジナル画像のマルチスケール特徴(低解像度と高解像度の特徴を含む)を生成します。 + +3. これらの異なるスケールの特徴マップのそれぞれは、高解像度の特徴から小さいオブジェクトをキャプチャするために1回ずつトランスフォーマーデコーダーレイヤーに渡されます。Mask2Formerの要点は、デコーダーの*マスクアテンション*メカニズムです。クロスアテンションが画像全体に注意を向けることができるのに対し、マスクアテンションは画像の特定の領域にのみ焦点を当てます。これは速く、ローカルな画像特徴だけでもモデルが学習できるため、パフォーマンスが向上します。 + +4. [DETR](tasks_explained#object-detection)と同様に、Mask2Formerも学習されたオブジェクトクエリを使用し、画像の特徴と組み合わせてセットの予測(`クラスラベル`、`マスク予測`)を行います。デコーダーの隠れ状態は線形層に渡され、クラスラベルに対するロジットに変換されます。ロジットと正解ラベル間のクロスエントロピー損失が最も可能性の高いものを見つけます。 + + マスク予測は、ピクセル埋め込みと最終的なデコーダーの隠れ状態を組み合わせて生成されます。シグモイドクロスエントロピーやダイス損失がロジットと正解マスクの間で最も可能性の高いマスクを見つけます。 + +セグメンテーションタスクに取り組む準備ができましたか?SegFormerのファインチューニング方法と推論方法を学ぶために、完全な[画像セグメンテーションガイド](tasks/semantic_segmentation)をチェックしてみてください! + +### Depth estimation + +[GLPN](model_doc/glpn)、*Global-Local Path Network*、はセグメンテーションまたは深度推定などの密な予測タスクに適しています。[SegFormer](model_doc/segformer)エンコーダーを軽量デコーダーと組み合わせたTransformerベースの深度推定モデルです。 + +
+ +
+ +1. ViTのように、画像はパッチのシーケンスに分割されますが、これらの画像パッチは小さいです。これはセグメンテーションや深度推定などの密な予測タスクに適しています。画像パッチはパッチ埋め込みに変換されます(パッチ埋め込みの作成方法の詳細については、[画像分類](#image-classification)セクションを参照してください)。これらのパッチ埋め込みはエンコーダーに渡されます。 + +2. エンコーダーはパッチ埋め込みを受け入れ、複数のエンコーダーブロックを通じてそれらを渡します。各ブロックにはアテンションとMix-FFNレイヤーが含まれています。後者の役割は位置情報を提供することです。各エンコーダーブロックの最後には、階層的表現を作成するための*パッチマージング*レイヤーがあります。隣接するパッチのグループごとの特徴が連結され、連結された特徴に対して線形層が適用され、パッチの数を1/4の解像度に削減します。これが次のエンコーダーブロックへの入力となり、ここではこのプロセス全体が繰り返され、元の画像の1/8、1/16、および1/32の解像度の画像特徴が得られます。 + +3. 軽量デコーダーは、エンコーダーからの最後の特徴マップ(1/32スケール)を受け取り、それを1/16スケールにアップサンプリングします。その後、特徴は各特徴に対するアテンションマップからローカルとグローバルな特徴を選択して組み合わせる*セレクティブフィーチャーフュージョン(SFF)*モジュールに渡され、1/8にアップサンプリングされます。このプロセスはデコードされた特徴が元の画像と同じサイズになるまで繰り返されます。 + +4. デコードされた特徴は、最終的な予測を行うためにセマンティックセグメンテーション、深度推定、またはその他の密な予測タスクに供給されます。セマンティックセグメンテーションの場合、特徴はクラス数に対するロジットに変換され、クロスエントロピー損失を使用して最適化されます。深度推定の場合、特徴は深度マップに変換され、平均絶対誤差(MAE)または平均二乗誤差(MSE)損失が使用されます。 + + + +## Natural language processing + +Transformerは最初に機械翻訳のために設計され、それ以降、ほとんどのNLPタスクを解決するためのデフォルトのアーキテクチャとなっています。一部のタスクはTransformerのエンコーダー構造に適しており、他のタスクはデコーダーに適しています。さらに、一部のタスクではTransformerのエンコーダー-デコーダー構造を使用します。 + +### Text classification + +[BERT](model_doc/bert)はエンコーダーのみのモデルであり、テキストの豊かな表現を学習するために両側の単語に注意を払うことで、深い双方向性を効果的に実装した最初のモデルです。 + +1. BERTは[WordPiece](tokenizer_summary#wordpiece)トークナイゼーションを使用してテキストのトークン埋め込みを生成します。単一の文と文のペアを区別するために、特別な `[SEP]` トークンが追加されます。 `[CLS]` トークンはすべてのテキストシーケンスの先頭に追加されます。 `[CLS]` トークンとともに最終出力は、分類タスクのための入力として使用されます。BERTはまた、トークンが文のペアの最初または2番目の文に属するかどうかを示すセグメント埋め込みを追加します。 + +2. BERTは、事前トレーニングで2つの目標を使用します:マスクされた言語モデリングと次の文の予測です。マスクされた言語モデリングでは、入力トークンの一部がランダムにマスクされ、モデルはこれらを予測する必要があります。これにより、モデルが全ての単語を見て「次の単語」を予測することができる双方向性の問題が解決されます。予測されたマスクトークンの最終的な隠れた状態は、ソフトマックスを使用した単語のマスクを予測するためのフィードフォワードネットワークに渡されます。 + + 2番目の事前トレーニングオブジェクトは次の文の予測です。モデルは文Aの後に文Bが続くかどうかを予測する必要があります。半分の場合、文Bは次の文であり、残りの半分の場合、文Bはランダムな文です。予測(次の文かどうか)は、2つのクラス(`IsNext`および`NotNext`)に対するソフトマックスを持つフィードフォワードネットワークに渡されます。 + +3. 入力埋め込みは、最終的な隠れた状態を出力するために複数のエンコーダーレイヤーを介して渡されます。 + +事前訓練済みモデルをテキスト分類に使用するには、ベースのBERTモデルの上にシーケンス分類ヘッドを追加します。シーケンス分類ヘッドは最終的な隠れた状態を受け入れ、それらをロジットに変換するための線形層です。クロスエントロピー損失は、ロジットとターゲット間で最も可能性の高いラベルを見つけるために計算されます。 + +テキスト分類を試してみる準備はできましたか?DistilBERTを微調整し、推論に使用する方法を学ぶために、完全な[テキスト分類ガイド](tasks/sequence_classification)をチェックしてみてください! + +### Token classification + +BERTを名前エンティティ認識(NER)などのトークン分類タスクに使用するには、ベースのBERTモデルの上にトークン分類ヘッドを追加します。トークン分類ヘッドは最終的な隠れた状態を受け入れ、それらをロジットに変換するための線形層です。クロスエントロピー損失は、ロジットと各トークン間で最も可能性の高いラベルを見つけるために計算されます。 + +トークン分類を試してみる準備はできましたか?DistilBERTを微調整し、推論に使用する方法を学ぶために、完全な[トークン分類ガイド](tasks/token_classification)をチェックしてみてください! + +### Question answering + +BERTを質問応答に使用するには、ベースのBERTモデルの上にスパン分類ヘッドを追加します。この線形層は最終的な隠れた状態を受け入れ、回答に対応するテキストの「スパン」開始と終了のロジットを計算します。クロスエントロピー損失は、ロジットとラベル位置との間で最も可能性の高いテキストスパンを見つけるために計算されます。 + +質問応答を試してみる準備はできましたか?DistilBERTを微調整し、推論に使用する方法を学ぶために、完全な[質問応答ガイド](tasks/question_answering)をチェックしてみてください! + + + +💡 注意してください。一度事前トレーニングが完了したBERTを使用してさまざまなタスクに簡単に適用できることに注目してください。必要なのは、事前トレーニング済みモデルに特定のヘッドを追加して、隠れた状態を所望の出力に変換することだけです! + + + +### Text generation + +[GPT-2](model_doc/gpt2)は大量のテキストで事前トレーニングされたデコーダー専用モデルです。プロンプトを与えると説得力のあるテキストを生成し、明示的にトレーニングされていないにもかかわらず、質問応答などの他のNLPタスクも完了できます。 + +
+ +
+ +1. GPT-2は[バイトペアエンコーディング(BPE)](tokenizer_summary#bytepair-encoding-bpe)を使用して単語をトークナイズし、トークン埋め込みを生成します。位置エンコーディングがトークン埋め込みに追加され、各トークンの位置を示します。入力埋め込みは複数のデコーダーブロックを介して最終的な隠れた状態を出力するために渡されます。各デコーダーブロック内で、GPT-2は「マスクされた自己注意」レイヤーを使用します。これは、GPT-2が未来のトークンに注意を払うことはできないことを意味します。GPT-2は左側のトークンにのみ注意を払うことが許可されています。これはBERTの[`mask`]トークンとは異なり、マスクされた自己注意では未来のトークンに対してスコアを`0`に設定するための注意マスクが使用されます。 + +2. デコーダーからの出力は、言語モデリングヘッドに渡され、最終的な隠れた状態をロジットに変換するための線形変換を実行します。ラベルはシーケンス内の次のトークンであり、これはロジットを右に1つずらして生成されます。クロスエントロピー損失は、シフトされたロジットとラベル間で計算され、次に最も可能性の高いトークンを出力します。 + +GPT-2の事前トレーニングの目標は完全に[因果言語モデリング](glossary#causal-language-modeling)に基づいており、シーケンス内の次の単語を予測します。これにより、GPT-2はテキスト生成を含むタスクで特に優れた性能を発揮します。 + +テキスト生成を試してみる準備はできましたか?DistilGPT-2を微調整し、推論に使用する方法を学ぶために、完全な[因果言語モデリングガイド](tasks/language_modeling#causal-language-modeling)をチェックしてみてください! + + + +テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください! + + + + +### Summarization + +[BART](model_doc/bart) や [T5](model_doc/t5) のようなエンコーダーデコーダーモデルは、要約タスクのシーケンス・トゥ・シーケンス・パターンに設計されています。このセクションでは、BARTの動作方法を説明し、最後にT5の微調整を試すことができます。 + +
+ +
+ +1. BARTのエンコーダーアーキテクチャは、BERTと非常に似ており、テキストのトークンと位置エンベディングを受け入れます。BARTは、入力を破壊してからデコーダーで再構築することによって事前トレーニングされます。特定の破壊戦略を持つ他のエンコーダーとは異なり、BARTは任意の種類の破壊を適用できます。ただし、*テキストインフィリング*破壊戦略が最適です。テキストインフィリングでは、いくつかのテキストスパンが**単一の** [`mask`] トークンで置き換えられます。これは重要です、なぜならモデルはマスクされたトークンを予測しなければならず、モデルに欠落トークンの数を予測させるからです。入力埋め込みとマスクされたスパンはエンコーダーを介して最終的な隠れた状態を出力しますが、BERTとは異なり、BARTは単語を予測するための最終的なフィードフォワードネットワークを最後に追加しません。 + +2. エンコーダーの出力はデコーダーに渡され、デコーダーはエンコーダーの出力からマスクされたトークンと非破壊トークンを予測する必要があります。これにより、デコーダーは元のテキストを復元するのに役立つ追加のコンテキストが提供されます。デコーダーからの出力は言語モデリングヘッドに渡され、隠れた状態をロジットに変換するための線形変換を実行します。クロスエントロピー損失は、ロジットとラベルの間で計算され、ラベルは単に右にシフトされたトークンです。 + +要約を試す準備はできましたか?T5を微調整して推論に使用する方法を学ぶために、完全な[要約ガイド](tasks/summarization)をご覧ください! + + + +テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください! + + + +### Translation + +翻訳は、もう一つのシーケンス・トゥ・シーケンス・タスクの例であり、[BART](model_doc/bart) や [T5](model_doc/t5) のようなエンコーダーデコーダーモデルを使用して実行できます。このセクションでは、BARTの動作方法を説明し、最後にT5の微調整を試すことができます。 + +BARTは、ソース言語をターゲット言語にデコードできるようにするために、別個にランダムに初期化されたエンコーダーを追加することで翻訳に適応します。この新しいエンコーダーの埋め込みは、元の単語埋め込みの代わりに事前トレーニング済みのエンコーダーに渡されます。ソースエンコーダーは、モデルの出力からのクロスエントロピー損失を用いてソースエンコーダー、位置エンベディング、および入力エンベディングを更新することによって訓練されます。この最初のステップではモデルパラメータが固定され、すべてのモデルパラメータが2番目のステップで一緒に訓練されます。 + +その後、翻訳のために多言語版のmBARTが登場し、多言語で事前トレーニングされたモデルとして利用可能です。 + +翻訳を試す準備はできましたか?T5を微調整して推論に使用する方法を学ぶために、完全な[翻訳ガイド](tasks/summarization)をご覧ください! + + + +テキスト生成に関する詳細は、[テキスト生成戦略](generation_strategies)ガイドをチェックしてみてください! + + diff --git a/docs/source/ja/testing.md b/docs/source/ja/testing.md new file mode 100644 index 00000000000000..a7b357acd66e7e --- /dev/null +++ b/docs/source/ja/testing.md @@ -0,0 +1,1214 @@ + + +# Testing + +🤗 Transformersモデルがどのようにテストされ、新しいテストを書いて既存のテストを改善できるかを見てみましょう。 + +このリポジトリには2つのテストスイートがあります: + +1. `tests` -- 一般的なAPI用のテスト +2. `examples` -- APIの一部ではないさまざまなアプリケーション用のテスト + +## How transformers are tested + +1. PRが提出されると、9つのCircleCiジョブでテストされます。PRへの新しいコミットごとに再テストされます。これらのジョブは、[この設定ファイル](https://github.com/huggingface/transformers/tree/main/.circleci/config.yml)で定義されており、必要な場合は同じ環境を自分のマシンで再現できます。 + + これらのCIジョブは `@slow` テストを実行しません。 + +2. [GitHub Actions](https://github.com/huggingface/transformers/actions)によって実行される3つのジョブがあります: + + - [torch hub integration](https://github.com/huggingface/transformers/tree/main/.github/workflows/github-torch-hub.yml): torch hubの統合が動作するかどうかを確認します。 + + - [self-hosted (push)](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-push.yml): `main` にコミットが行われた場合に、GPUで高速テストを実行します。このジョブは、`main` でのコミットが以下のフォルダーのコードを更新した場合にのみ実行されます:`src`、`tests`、`.github`(追加されたモデルカード、ノートブックなどの実行を防ぐため)。 + + - [self-hosted runner](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-scheduled.yml): GPUで `tests` と `examples` の通常のテストと遅いテストを実行します。 + +```bash +RUN_SLOW=1 pytest tests/ +RUN_SLOW=1 pytest examples/ +``` + 結果は[here](https://github.com/huggingface/transformers/actions)で観察できます。 + +## Running tests + + + +### Choosing which tests to run + +このドキュメントは、テストを実行する方法の多くの詳細について説明しています。すべてを読んだ後でも、さらに詳細が必要な場合は、[こちら](https://docs.pytest.org/en/latest/usage.html)で見つけることができます。 + +以下は、テストを実行するためのいくつかの最も便利な方法です。 + +すべて実行します: +```console +pytest +``` + +または: +```bash +make test +``` + + +後者は次のように定義されることに注意してください。 + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/ +``` + +以下は、pytestに渡す設定情報です。 + +- テストプロセスをCPUコアの数と同じだけ実行するように指示します。ただし、RAMが十分でない場合は注意が必要です。 +- 同じファイルからのすべてのテストは、同じテストプロセスで実行されるようにします。 +- 出力のキャプチャを行いません。 +- 冗長モードで実行します。 + + +### Getting the list of all tests + +テストスイートのすべてのテスト: + +```bash +pytest --collect-only -q +``` + +指定されたテスト ファイルのすべてのテスト: + +```bash +pytest tests/test_optimization.py --collect-only -q +``` + +### Run a specific test module + +個別のテスト モジュールを実行するには: + +```bash +pytest tests/utils/test_logging.py +``` + +### Run specific tests + +ほとんどのテストでunittestが使用されているため、特定のサブテストを実行するには、それらのテストを含むunittestクラスの名前を知っている必要があります。例えば、それは次のようになるかもしれません: + + +```bash +pytest tests/test_optimization.py::OptimizationTest::test_adam_w +``` + +テストの実行方法: + +テストファイル: `tests/test_optimization.py` +クラス名: `OptimizationTest` +テスト関数の名前: `test_adam_w` + +ファイルに複数のクラスが含まれている場合は、特定のクラスのテストのみを実行することを選択できます。例えば: + +```bash +pytest tests/test_optimization.py::OptimizationTest +``` + +テストクラス内のすべてのテストを実行します。 + +前述の通り、`OptimizationTest` クラスに含まれるテストを実行するには、次のコマンドを実行できます: + +```bash +pytest tests/test_optimization.py::OptimizationTest --collect-only -q +``` + +キーワード式を使用してテストを実行できます。 + +名前に `adam` が含まれるテストのみを実行するには: + +```bash +pytest -k adam tests/test_optimization.py +``` + +`and`および`or`は、すべてのキーワードが一致するか、いずれかを示すために使用できます。`not`は否定するために使用できます。 + +`adam`という名前を含むテストを除いてすべてのテストを実行するには: + +```bash +pytest -k "not adam" tests/test_optimization.py +``` + + +以下は、提供されたテキストの日本語訳です。 + +```bash +pytest -k "ada and not adam" tests/test_optimization.py +``` + +たとえば、`test_adafactor`と`test_adam_w`の両方を実行するには、以下のコマンドを使用できます: + +```bash +pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py +``` + +注意: ここでは、`or` を使用しています。キーワードのいずれか一つが一致すれば、両方を含めるためです。 + +両方のパターンを含むテストのみを含めたい場合は、`and` を使用してください。 + +```bash +pytest -k "test and ada" tests/test_optimization.py +``` + +### Run `accelerate` tests + +時々、モデルに対して `accelerate` テストを実行する必要があります。たとえば、`OPT` 実行に対してこれらのテストを実行したい場合、コマンドに `-m accelerate_tests` を追加するだけで済みます: + +```bash +RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py +``` + +### Run documentation tests + +ドキュメンテーションの例が正しいかどうかをテストするには、`doctests` が合格しているかを確認する必要があります。 +例として、[`WhisperModel.forward` のドックストリング](https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L1017-L1035)を使用しましょう。 + + +```python +r""" +Returns: + +Example: + ```python + >>> import torch + >>> from transformers import WhisperModel, WhisperFeatureExtractor + >>> from datasets import load_dataset + + >>> model = WhisperModel.from_pretrained("openai/whisper-base") + >>> feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base") + >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + >>> inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt") + >>> input_features = inputs.input_features + >>> decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id + >>> last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state + >>> list(last_hidden_state.shape) + [1, 2, 512] + ```""" + +``` + +指定したファイル内のすべてのドックストリング例を自動的にテストするために、以下の行を実行してください: + +```bash +pytest --doctest-modules +``` + +ファイルにマークダウン拡張子がある場合は、`--doctest-glob="*.md"`引数を追加する必要があります。 + + +### Run only modified tests + +[pytest-picked](https://github.com/anapaulagomes/pytest-picked)を使用すると、未ステージングのファイルまたは現在のブランチ(Gitに従って)に関連するテストを実行できます。これは、変更内容に関連するテストのみ実行されるため、変更が何も壊れていないことを迅速に確認する素晴らしい方法です。変更されていないファイルに関連するテストは実行されません。 + +```bash +pip install pytest-picked +``` + +```bash +pytest --picked +``` + +すべてのテストは、変更されたがまだコミットされていないファイルとフォルダから実行されます。 + +### Automatically rerun failed tests on source modification + +[pytest-xdist](https://github.com/pytest-dev/pytest-xdist)は、非常に便利な機能を提供しており、すべての失敗したテストを検出し、ファイルを修正する間にそれらの失敗したテストを連続して再実行することができます。そのため、修正を行った後にpytestを再起動する必要がありません。すべてのテストが合格するまで繰り返され、その後再度フルランが実行されます。 + + +```bash +pip install pytest-xdist +``` + +モードに入るには: `pytest -f`または`pytest --looponfail` + +ファイルの変更は、`looponfailroots`ルートディレクトリとその内容全体(再帰的に)を見て検出されます。この値のデフォルトが機能しない場合、`setup.cfg`で設定オプションを変更してプロジェクト内で変更できます。 + + +```ini +[tool:pytest] +looponfailroots = transformers tests +``` + +または `pytest.ini`/`tox.ini` ファイル: + +```ini +[pytest] +looponfailroots = transformers tests +``` + +ファイルの変更を探すことは、iniファイルのディレクトリを基準にして指定されたディレクトリ内でのみ行われます。 + +[pytest-watch](https://github.com/joeyespo/pytest-watch) は、この機能の代替実装です。 + +### Skip a test module + +特定のテストモジュールを除外してすべてのテストモジュールを実行したい場合、実行するテストの明示的なリストを指定することができます。例えば、`test_modeling_*.py` テストを除外してすべてを実行するには次のようにします: + +```bash +pytest *ls -1 tests/*py | grep -v test_modeling* +``` + +### Clearing state + +CIビルドおよび速度に対する隔離が重要な場合(キャッシュに対して)、キャッシュをクリアする必要があります: + +```bash +pytest --cache-clear tests +``` + +### Running tests in parallel + +前述のように、`make test` は `pytest-xdist` プラグインを介してテストを並列実行します(`-n X` 引数、例: `-n 2` で2つの並列ジョブを実行)。 + +`pytest-xdist` の `--dist=` オプションを使用すると、テストがどのようにグループ化されるかを制御できます。`--dist=loadfile` は同じファイルにあるテストを同じプロセスに配置します。 + +テストの実行順序が異なり予測不可能であるため、`pytest-xdist` を使用してテストスイートを実行すると失敗が発生する場合(つまり、いくつかの未検出の連動テストがある場合)、[pytest-replay](https://github.com/ESSS/pytest-replay) を使用してテストを同じ順序で再生し、その後、失敗するシーケンスを最小限にするのに役立ちます。 + +### Test order and repetition + +潜在的な相互依存性や状態に関連するバグ(ティアダウン)を検出するために、テストを複数回、連続して、ランダムに、またはセットで繰り返すことは有用です。そして、単純な複数回の繰り返しは、DLのランダム性によって明らかになるいくつかの問題を検出するのに役立ちます。 + +#### Repeat tests + +- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): + +```bash +pip install pytest-flakefinder +``` + +そして、すべてのテストを複数回実行します (デフォルトでは 50 回)。 + +```bash +pytest --flake-finder --flake-runs=5 tests/test_failing_test.py +``` + + + +このプラグインは、`pytest-xdist` の `-n` フラグでは動作しません。 + + + + + + +別のプラグイン `pytest-repeat` もありますが、これは `unittest` では動作しません。 + + + +#### Run tests in a random order + +```bash +pip install pytest-random-order +``` + +重要: `pytest-random-order` が存在すると、テストは自動的にランダム化されます。設定の変更や変更は必要ありません。 +コマンドラインオプションは必須です。 + +前に説明したように、これにより、結合されたテスト (1 つのテストの状態が別のテストの状態に影響を与える) の検出が可能になります。いつ +`pytest-random-order` がインストールされていると、そのセッションに使用されたランダム シードが出力されます。例: + + +```bash +pytest tests +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +そのため、指定された特定のシーケンスが失敗した場合、その正確なシードを追加することでそれを再現できます。例: + +```bash +pytest --random-order-seed=573663 +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +特定のテストのリストを使用しない場合、またはまったくリストを使用しない場合、同じテストの正確な順序を再現します。テストのリストを手動で絞り込み始めると、シードに依存せず、テストが失敗した正確な順序で手動でリストを指定する必要があります。これには、`--random-order-bucket=none` を使用してランダム化を無効にするようpytestに指示する必要があります。例えば、次のようにします: + + +```bash +pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py +``` + +すべてのテストのシャッフルを無効にするには: + +```bash +pytest --random-order-bucket=none +``` + +デフォルトでは、`--random-order-bucket=module` が暗黙的に適用され、モジュールレベルでファイルをシャッフルします。また、`class`、`package`、`global`、および`none` レベルでシャッフルすることもできます。詳細については、その[ドキュメンテーション](https://github.com/jbasko/pytest-random-order)を参照してください。 + +別のランダム化の代替手段は、[`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly) です。このモジュールは非常に似た機能/インターフェースを持っていますが、`pytest-random-order` で利用可能なバケットモードを持っていません。インストール後に自動的に有効になるという同じ問題があります。 + +### Look and feel variations + +#### pytest-sugar + +[pytest-sugar](https://github.com/Frozenball/pytest-sugar) は、外観と操作性を向上させ、プログレスバーを追加し、即座に失敗したテストとアサーションを表示するプラグインです。インストール後に自動的にアクティブ化されます。 + +```bash +pip install pytest-sugar +``` + + +これを使用せずにテストを実行するには、次を実行します。 + +```bash +pytest -p no:sugar +``` + +またはアンインストールします。 + +#### Report each sub-test name and its progress + +`pytest` による単一またはグループのテストの場合 (`pip install pytest-pspec` の後): + + +```bash +pytest --pspec tests/test_optimization.py +``` + + +#### Instantly shows failed tests + +[pytest-instafail](https://github.com/pytest-dev/pytest-instafail) では、失敗とエラーが即座に表示されます。 +テストセッションが終了するまで待機します。 + +```bash +pip install pytest-instafail +``` + +```bash +pytest --instafail +``` + +### To GPU or not to GPU + +GPU が有効な設定で、CPU のみモードでテストするには、`CUDA_VISIBLE_DEVICES=""`を追加します。 + +```bash +CUDA_VISIBLE_DEVICES="" pytest tests/utils/test_logging.py +``` + + +または、複数の GPU がある場合は、`pytest` でどれを使用するかを指定できます。たとえば、 +2 番目の GPU GPU `0` と `1` がある場合は、次を実行できます。 + +```bash +CUDA_VISIBLE_DEVICES="1" pytest tests/utils/test_logging.py +``` + +これは、異なるGPUで異なるタスクを実行したい場合に便利です。 + +一部のテストはCPUのみで実行する必要があり、他のテストはCPU、GPU、またはTPUで実行する必要があり、また別のテストは複数のGPUで実行する必要があります。次のスキップデコレーターは、テストのCPU/GPU/TPUに関する要件を設定するために使用されます: + +- `require_torch` - このテストはtorchの下でのみ実行されます。 +- `require_torch_gpu` - `require_torch` に加えて、少なくとも1つのGPUが必要です。 +- `require_torch_multi_gpu` - `require_torch` に加えて、少なくとも2つのGPUが必要です。 +- `require_torch_non_multi_gpu` - `require_torch` に加えて、0または1つのGPUが必要です。 +- `require_torch_up_to_2_gpus` - `require_torch` に加えて、0、1、または2つのGPUが必要です。 +- `require_torch_tpu` - `require_torch` に加えて、少なくとも1つのTPUが必要です。 + +以下の表にGPUの要件を示します: + +| n gpus | decorator | +|--------+--------------------------------| +| `>= 0` | `@require_torch` | +| `>= 1` | `@require_torch_gpu` | +| `>= 2` | `@require_torch_multi_gpu` | +| `< 2` | `@require_torch_non_multi_gpu` | +| `< 3` | `@require_torch_up_to_2_gpus` | + + +たとえば、使用可能な GPU が 2 つ以上あり、pytorch がインストールされている場合にのみ実行する必要があるテストを次に示します。 + + +```python no-style +@require_torch_multi_gpu +def test_example_with_multi_gpu(): +``` + +テストに `tensorflow` が必要な場合は、`require_tf` デコレータを使用します。例えば: + +```python no-style +@require_tf +def test_tf_thing_with_tensorflow(): +``` + +これらのデコレータは積み重ねることができます。たとえば、テストが遅く、pytorch で少なくとも 1 つの GPU が必要な場合は、次のようになります。 +設定方法: + +```python no-style +@require_torch_gpu +@slow +def test_example_slow_on_gpu(): +``` + +`@parametrized` のような一部のデコレータはテスト名を書き換えるため、`@require_*` スキップ デコレータをリストする必要があります。 +最後にそれらが正しく動作するようにします。正しい使用例は次のとおりです + +```python no-style +@parameterized.expand(...) +@require_torch_multi_gpu +def test_integration_foo(): +``` + +この順序の問題は `@pytest.mark.parametrize` には存在しません。最初または最後に配置しても、それでも問題は解決されます。 +仕事。ただし、それは非単体テストでのみ機能します。 + +内部テスト: + +- 利用可能な GPU の数: + +```python +from transformers.testing_utils import get_gpu_count + +n_gpu = get_gpu_count() # works with torch and tf +``` + +### Testing with a specific PyTorch backend or device + +特定のtorchデバイスでテストスイートを実行するには、`TRANSFORMERS_TEST_DEVICE="$device"` を追加します。ここで `$device` は対象のバックエンドです。例えば、CPUでテストするには以下のようにします: + +```bash +TRANSFORMERS_TEST_DEVICE="cpu" pytest tests/utils/test_logging.py +``` + +この変数は、`mps`などのカスタムまたはあまり一般的ではない PyTorch バックエンドをテストするのに役立ちます。また、特定の GPU をターゲットにしたり、CPU 専用モードでテストしたりすることで、`CUDA_VISIBLE_DEVICES`と同じ効果を達成するために使用することもできます。 + +特定のデバイスでは、初めて「torch」をインポートした後、追加のインポートが必要になります。これは、環境変数 `TRANSFORMERS_TEST_BACKEND` を使用して指定できます。 + +```bash +TRANSFORMERS_TEST_BACKEND="torch_npu" pytest tests/utils/test_logging.py +``` + +### Distributed training + +`pytest` は直接的に分散トレーニングを処理することはできません。試みると、サブプロセスは正しい処理を行わず、自分自身が `pytest` であると思い込んでテストスイートをループで実行し続けます。ただし、通常のプロセスを生成し、それから複数のワーカーを生成し、IOパイプを管理するプロセスを生成すれば機能します。 + +これを使用するいくつかのテストがあります: + +- [test_trainer_distributed.py](https://github.com/huggingface/transformers/tree/main/tests/trainer/test_trainer_distributed.py) +- [test_deepspeed.py](https://github.com/huggingface/transformers/tree/main/tests/deepspeed/test_deepspeed.py) + +実行ポイントにすぐに移動するには、これらのテスト内で `execute_subprocess_async` 呼び出しを検索してください。 + +これらのテストを実行するには、少なくとも2つのGPUが必要です: + +```bash +CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py +``` + +### Output capture + +テストの実行中に、`stdout` および `stderr` に送信された出力はキャプチャされます。テストまたはセットアップメソッドが失敗した場合、通常、それに対応するキャプチャされた出力が失敗のトレースバックと共に表示されます。 + +出力のキャプチャを無効にし、`stdout` と `stderr` を通常通りに取得するには、`-s` または `--capture=no` を使用してください: + +これらのテストを実行するには少なくとも2つのGPUが必要です: + +```bash +pytest -s tests/utils/test_logging.py +``` + +テスト結果を JUnit 形式の出力に送信するには: + +```bash +py.test tests --junitxml=result.xml +``` + +### Color control + +色を持たないようにする(例:黄色のテキストを白い背景に表示すると読みにくいです): + + +```bash +pytest --color=no tests/utils/test_logging.py +``` + +### Sending test report to online pastebin service + +テスト失敗ごとに URL を作成します。 + + +```bash +pytest --pastebin=failed tests/utils/test_logging.py +``` + +これにより、テスト実行情報がリモートのPasteサービスに送信され、各エラーに対してURLが提供されます。通常通りテストを選択するか、たとえば特定のエラーのみを送信したい場合は `-x` を追加で指定できます。 + +テストセッション全体のログに対するURLを作成する方法: + + +```bash +pytest --pastebin=all tests/utils/test_logging.py +``` + +## Writing tests + +🤗 transformersのテストは `unittest` を基にしていますが、 `pytest` で実行されるため、ほとんどの場合、両方のシステムの機能を使用できます。 + +[こちら](https://docs.pytest.org/en/stable/unittest.html)でサポートされている機能を読むことができますが、重要なことは、ほとんどの `pytest` のフィクスチャが動作しないことです。パラメータ化も同様ですが、似たような方法で動作する `parameterized` モジュールを使用しています。 + +### Parametrization + +同じテストを異なる引数で複数回実行する必要があることがよくあります。これはテスト内部から行うこともできますが、その場合、そのテストを単一の引数セットで実行する方法はありません。 + + +```python +# test_this1.py +import unittest +from parameterized import parameterized + + +class TestMathUnitTest(unittest.TestCase): + @parameterized.expand( + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ] + ) + def test_floor(self, name, input, expected): + assert_equal(math.floor(input), expected) +``` + +デフォルトでは、このテストは3回実行され、それぞれの実行で `test_floor` の最後の3つの引数がパラメータリストの対応する引数に割り当てられます。 + +そして、`negative` と `integer` パラメータのセットのみを実行することもできます: + +```bash +pytest -k "negative and integer" tests/test_mytest.py +``` + +または、`Negative`のサブテストを除くすべての場合、次のようになります。 + +```bash +pytest -k "not negative" tests/test_mytest.py +``` + +`-k` フィルターを使用することに加えて、各サブテストの正確な名前を調べ、その正確な名前を使用して任意のサブテストまたはすべてのサブテストを実行することができます。 + + +```bash +pytest test_this1.py --collect-only -q +``` + +すると次のものがリストされます: + +```bash +test_this1.py::TestMathUnitTest::test_floor_0_negative +test_this1.py::TestMathUnitTest::test_floor_1_integer +test_this1.py::TestMathUnitTest::test_floor_2_large_fraction +``` + + +したがって、2 つの特定のサブテストのみを実行できるようになりました。 + +```bash +pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer +``` + +`transformers`の開発者依存関係にすでに含まれているモジュール[parameterized](https://pypi.org/project/parameterized/) は、`unittests` と `pytest` テストの両方で機能します。 + +ただし、テストが `unittest` でない場合、`pytest.mark.parametrize` を使用することができます(または既存のテストのいくつかで、主に `examples` の下で使用されているのを見ることができます)。 + +次に、同じ例を示しますが、今度は `pytest` の `parametrize` マーカーを使用しています: + + +```python +# test_this2.py +import pytest + + +@pytest.mark.parametrize( + "name, input, expected", + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ], +) +def test_floor(name, input, expected): + assert_equal(math.floor(input), expected) +``` + +`parameterized` と同様に、`pytest.mark.parametrize` を使用すると、`-k` フィルタが役立たない場合でも、サブテストの実行を細かく制御できます。ただし、このパラメータ化関数はサブテストの名前をわずかに異なるものにします。以下にその例を示します: + + +```bash +pytest test_this2.py --collect-only -q +``` + +すると次のものがリストされます: + +```bash +test_this2.py::test_floor[integer-1-1.0] +test_this2.py::test_floor[negative--1.5--2.0] +test_this2.py::test_floor[large fraction-1.6-1] +``` + +これで、特定のテストのみを実行できるようになりました。 + +```bash +pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] +``` + +前の例と同様に。 + +### Files and directories + +テストの中で、現在のテストファイルからの相対位置を知る必要があることがよくあります。しかし、これは簡単なことではありません。なぜなら、テストは複数のディレクトリから呼び出されるか、異なる深さのサブディレクトリに存在することがあるからです。`transformers.test_utils.TestCasePlus` というヘルパークラスは、すべての基本パスを整理し、簡単にアクセスできるようにすることで、この問題を解決します。 + +- `pathlib` オブジェクト(すべて完全に解決されたもの): + + - `test_file_path` - 現在のテストファイルのパス、つまり `__file__` + - `test_file_dir` - 現在のテストファイルを含むディレクトリ + - `tests_dir` - `tests` テストスイートのディレクトリ + - `examples_dir` - `examples` テストスイートのディレクトリ + - `repo_root_dir` - リポジトリのディレクトリ + - `src_dir` - `transformers` サブディレクトリが存在する場所 + +- パスの文字列表現――上記と同じですが、これらは `pathlib` オブジェクトではなく文字列としてパスを返します: + + - `test_file_path_str` + - `test_file_dir_str` + - `tests_dir_str` + - `examples_dir_str` + - `repo_root_dir_str` + - `src_dir_str` + +これらを使用し始めるには、テストが `transformers.test_utils.TestCasePlus` のサブクラスに存在することを確認するだけです。例: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_local_locations(self): + data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" +``` + +もし、`pathlib` を介してパスを操作する必要がない場合、または単に文字列としてパスが必要な場合は、`pathlib` オブジェクトに `str()` を呼び出すか、`_str` で終わるアクセサを使用できます。例: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_stringified_locations(self): + examples_dir = self.examples_dir_str +``` + +### Temporary files and directories + +一意の一時ファイルとディレクトリの使用は、並列テストの実行には欠かせません。これにより、テストがお互いのデータを上書きしないようにします。また、これらを作成した各テストの終了時に一時ファイルとディレクトリが削除されることを望みます。そのため、これらのニーズを満たすパッケージである `tempfile` のようなパッケージの使用は重要です。 + +しかし、テストのデバッグ時には、一時ファイルやディレクトリに何が格納されているかを確認できる必要があり、テストを再実行するたびにランダムに変更されないその正確なパスを知りたいと思います。 + +`transformers.test_utils.TestCasePlus` というヘルパークラスは、このような目的に最適です。これは `unittest.TestCase` のサブクラスであるため、テストモジュールで簡単に継承することができます。 + +以下はその使用例です: + + +```python +from transformers.testing_utils import TestCasePlus + + +class ExamplesTests(TestCasePlus): + def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +このコードはユニークな一時ディレクトリを作成し、`tmp_dir` をその場所に設定します。 + +- ユニークな一時ディレクトリを作成します: + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +`tmp_dir` には、作成された一時ディレクトリへのパスが含まれます。期間終了後は自動的に削除されます +テスト。 + +- 任意の一時ディレクトリを作成し、テストの開始前にそれが空であることを確認し、テスト後には空にしないでください。 + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir("./xxx") +``` + +これは、特定のディレクトリを監視し、前のテストがそこにデータを残さないことを確認したい場合に、デバッグに役立ちます。 + +- `before` と `after` 引数を直接オーバーライドすることで、デフォルトの動作をオーバーライドできます。以下のいずれかの動作に導きます: + + - `before=True`:テストの開始時に常に一時ディレクトリがクリアされます。 + - `before=False`:一時ディレクトリが既に存在する場合、既存のファイルはそのままになります。 + - `after=True`:テストの終了時に常に一時ディレクトリが削除されます。 + - `after=False`:テストの終了時に常に一時ディレクトリはそのままになります。 + + + +`rm -r`の相当を安全に実行するために、明示的な `tmp_dir` が使用される場合、プロジェクトリポジトリのチェックアウトのサブディレクトリのみが許可されます。誤って `/tmp` などのファイルシステムの重要な部分が削除されないように、常に `./` から始まるパスを渡してください。 + + + + + +各テストは複数の一時ディレクトリを登録でき、要求がない限りすべて自動で削除されます。 + + + +### Temporary sys.path override + +別のテストからインポートするために一時的に `sys.path` をオーバーライドする必要がある場合、`ExtendSysPath` コンテキストマネージャを使用できます。例: + + +```python +import os +from transformers.testing_utils import ExtendSysPath + +bindir = os.path.abspath(os.path.dirname(__file__)) +with ExtendSysPath(f"{bindir}/.."): + from test_trainer import TrainerIntegrationCommon # noqa +``` + +### Skipping tests + +これは、バグが見つかり、新しいテストが作成された場合であっても、バグがまだ修正されていない場合に役立ちます。メインリポジトリにコミットできるようにするには、`make test` の実行中にそれをスキップする必要があります。 + +メソッド: + +- **skip** は、テストが特定の条件が満たされた場合にのみパスすることを期待しており、それ以外の場合は pytest がテストの実行をスキップします。一般的な例は、Windows専用のテストを非Windowsプラットフォームでスキップする場合、または現在利用できない外部リソースに依存するテストをスキップする場合です(例: データベースが利用できない場合)。 + +- **xfail** は、何らかの理由でテストが失敗することを期待しています。一般的な例は、まだ実装されていない機能のテストや、まだ修正されていないバグのテストです。テストが予想される失敗にもかかわらずパスした場合(pytest.mark.xfailでマークされたテスト)、それはxpassとしてテストサマリーに報告されます。 + +これらの2つの間の重要な違いの1つは、`skip` はテストを実行しない点であり、`xfail` は実行します。したがって、バグのあるコードが他のテストに影響を与える場合は、`xfail` を使用しないでください。 + +#### Implementation + +- テスト全体を無条件にスキップする方法は次のとおりです: + + +```python no-style +@unittest.skip("this bug needs to be fixed") +def test_feature_x(): +``` + +または pytest 経由: + +```python no-style +@pytest.mark.skip(reason="this bug needs to be fixed") +``` + +または `xfail` の方法: + +```python no-style +@pytest.mark.xfail +def test_feature_x(): +``` + + +- テスト内の内部チェックに基づいてテストをスキップする方法は次のとおりです。 + +```python +def test_feature_x(): + if not has_something(): + pytest.skip("unsupported configuration") +``` + +またはモジュール全体: + +```python +import pytest + +if not pytest.config.getoption("--custom-flag"): + pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) +``` + +または `xfail` の方法: + +```python +def test_feature_x(): + pytest.xfail("expected to fail until bug XYZ is fixed") +``` + +- 一部のインポートが欠落している場合にモジュール内のすべてのテストをスキップする方法は次のとおりです。 + +```python +docutils = pytest.importorskip("docutils", minversion="0.3") +``` + +- 条件に基づいてテストをスキップします。 + +```python no-style +@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") +def test_feature_x(): +``` + +または: + +```python no-style +@unittest.skipIf(torch_device == "cpu", "Can't do half precision") +def test_feature_x(): +``` + + +またはモジュール全体をスキップします。 + +```python no-style +@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") +class TestClass(): + def test_feature_x(self): +``` + +詳細、例、および方法についての詳細は[こちら](https://docs.pytest.org/en/latest/skipping.html)を参照してください。 + +### Slow tests + +テストライブラリは着実に成長しており、テストの一部は数分かかります。そのため、CIでテストスイートの完了を待つのは1時間待つ余裕がないことがあります。したがって、いくつかの例外を除いて、遅いテストは以下の例のようにマークすべきです: + + +```python no-style +from transformers.testing_utils import slow +@slow +def test_integration_foo(): +``` + + +テストが`@slow`としてマークされたら、そのようなテストを実行するには、環境変数 `RUN_SLOW=1`を設定します。例: + +```bash +RUN_SLOW=1 pytest tests +``` + +`@parameterized` のようなデコレータはテスト名を書き換えるため、`@slow` および他のスキップデコレータ `@require_*` は正しく動作するためには、最後にリストアップする必要があります。以下は正しい使用例の一例です: + + +```python no-style +@parameterized.expand(...) +@slow +def test_integration_foo(): +``` + +このドキュメントの冒頭で説明したように、遅いテストは定期的なスケジュールに従って実行され、PRのCIチェックでは実行されません。そのため、一部の問題がPRの提出時に見落とされ、マージされる可能性があります。そのような問題は次回のスケジュールされたCIジョブで検出されます。しかし、それはまた、PRを提出する前に自分のマシンで遅いテストを実行する重要性を意味しています。 + +どのテストを遅いテストとしてマークすべきかを選択するための、おおまかな意思決定メカニズムが次に示されています: + +- テストがライブラリの内部コンポーネントの1つに焦点を当てている場合(例: モデリングファイル、トークン化ファイル、パイプライン)、そのテストは遅いテストスイートで実行する必要があります。それがライブラリの他の側面、たとえばドキュメンテーションや例に焦点を当てている場合、それらのテストは遅いテストスイートで実行する必要があります。そして、このアプローチを洗練させるために例外を設ける必要があります。 + +- 重いウェイトセットや約50MB以上のデータセットをダウンロードする必要があるすべてのテスト(例: モデル統合テスト、トークナイザ統合テスト、パイプライン統合テスト)は遅いテストとして設定する必要があります。新しいモデルを追加する場合、統合テスト用にランダムなウェイトを持つ小さなバージョンを作成し、ハブにアップロードする必要があります。これについては以下の段落で詳しく説明します。 + +- 特に高速化されていないトレーニングを行う必要があるすべてのテストは遅いテストとして設定する必要があります。 + +- 一部の「遅い」であるべきでないテストが非常に遅い場合、およびそれらを `@slow` として設定する必要がある場合には例外を導入できます。大容量のファイルをディスクに保存および読み込みする自動モデリングテストは、`@slow` としてマークされたテストの良い例です。 + +- CIで1秒未満でテストが完了する場合(ダウンロードを含む)、それは通常のテストであるべきです。 + +すべての非遅いテストは、さまざまな内部要素を完全にカバーする必要がありますが、高速である必要があります。たとえば、特別に作成された小さなモデル(レイヤー数が最小限で、語彙サイズが小さいなど)を使用して、かなりのカバレッジを実現できます。その後、`@slow` テストでは大規模な遅いモデルを使用して質的なテストを実行できます。これらを使用するには、以下のように *tiny* モデルを探してください: + + +```bash +grep tiny tests examples +``` + +[スクリプトの例](https://github.com/huggingface/transformers/tree/main/scripts/fsmt/fsmt-make-tiny-model.py)があり、これにより tiny-wmt19-en-de のような小さなモデルが作成されます。特定のモデルのアーキテクチャに簡単に調整できます。 + +実行時間を誤って測定することが簡単です。たとえば、巨大なモデルのダウンロードに関するオーバーヘッドがある場合、ローカルでテストするとダウンロードされたファイルがキャッシュされ、ダウンロード時間が計測されなくなります。したがって、CIログの実行速度レポート(`pytest --durations=0 tests` の出力)を確認してください。 + +このレポートは、遅いテストとしてマークされていない遅い外れ値や、高速に書き直す必要があるテストを見つけるのにも役立ちます。テストスイートがCIで遅くなり始めた場合、このレポートのトップリストには最も遅いテストが表示されます。 + +### Testing the stdout/stderr output + +`stdout` および/または `stderr` に書き込む関数をテストするために、テストは `pytest` の [capsys システム](https://docs.pytest.org/en/latest/capture.html) を使用してこれらのストリームにアクセスできます。以下はその方法です: + + +```python +import sys + + +def print_to_stdout(s): + print(s) + + +def print_to_stderr(s): + sys.stderr.write(s) + + +def test_result_and_stdout(capsys): + msg = "Hello" + print_to_stdout(msg) + print_to_stderr(msg) + out, err = capsys.readouterr() # consume the captured output streams + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + sys.stderr.write(err) + # test: + assert msg in out + assert msg in err +``` + + +そしてもちろん、ほとんどの場合、`stderr`は例外の一部として提供されるため、そのような場合には try/excel を使用する必要があります。 +ケース: + +```python +def raise_exception(msg): + raise ValueError(msg) + + +def test_something_exception(): + msg = "Not a good value" + error = "" + try: + raise_exception(msg) + except Exception as e: + error = str(e) + assert msg in error, f"{msg} is in the exception:\n{error}" +``` + + +stdout をキャプチャするもう 1 つのアプローチは、`contextlib.redirect_stdout`を使用することです。 + +```python +from io import StringIO +from contextlib import redirect_stdout + + +def print_to_stdout(s): + print(s) + + +def test_result_and_stdout(): + msg = "Hello" + buffer = StringIO() + with redirect_stdout(buffer): + print_to_stdout(msg) + out = buffer.getvalue() + # optional: if you want to replay the consumed streams: + sys.stdout.write(out) + # test: + assert msg in out +``` + +stdout をキャプチャする際の重要な潜在的な問題は、通常の `print` でこれまでに出力された内容をリセットする可能性がある `\r` 文字が含まれている可能性があることです。`pytest` 自体には問題はありませんが、`pytest -s` ではこれらの文字がバッファに含まれるため、`-s` ありとなしでテストを実行できるようにするには、`re.sub(r'~.*\r', '', buf, 0, re.M)` を使用してキャプチャされた出力に対して追加のクリーンアップを行う必要があります。 + +しかし、その後、`\r` が含まれているかどうかにかかわらず、すべての操作を自動的に処理するヘルパーコンテキストマネージャラッパーがあります。したがって、次のように簡単に行えます: + + +```python +from transformers.testing_utils import CaptureStdout + +with CaptureStdout() as cs: + function_that_writes_to_stdout() +print(cs.out) +``` + +完全なテスト例は次のとおりです。 + +```python +from transformers.testing_utils import CaptureStdout + +msg = "Secret message\r" +final = "Hello World" +with CaptureStdout() as cs: + print(msg + final) +assert cs.out == final + "\n", f"captured: {cs.out}, expecting {final}" +``` + +`stderr` をキャプチャしたい場合は、代わりに `CaptureStderr` クラスを使用してください。 + +```python +from transformers.testing_utils import CaptureStderr + +with CaptureStderr() as cs: + function_that_writes_to_stderr() +print(cs.err) +``` + +両方のストリームを一度にキャプチャする必要がある場合は、親の `CaptureStd` クラスを使用します。 + +```python +from transformers.testing_utils import CaptureStd + +with CaptureStd() as cs: + function_that_writes_to_stdout_and_stderr() +print(cs.err, cs.out) +``` + + +また、テストの問題のデバッグを支援するために、デフォルトで、これらのコンテキスト マネージャーは終了時にキャプチャされたストリームを自動的に再生します。 +文脈から。 + +### Capturing logger stream + +ロガーの出力を検証する必要がある場合は、`CaptureLogger`を使用できます。 + +```python +from transformers import logging +from transformers.testing_utils import CaptureLogger + +msg = "Testing 1, 2, 3" +logging.set_verbosity_info() +logger = logging.get_logger("transformers.models.bart.tokenization_bart") +with CaptureLogger(logger) as cl: + logger.info(msg) +assert cl.out, msg + "\n" +``` + +### Testing with environment variables + +特定のテストで環境変数の影響をテストしたい場合は、ヘルパー デコレータを使用できます。 +`transformers.testing_utils.mockenv` + +```python +from transformers.testing_utils import mockenv + + +class HfArgumentParserTest(unittest.TestCase): + @mockenv(TRANSFORMERS_VERBOSITY="error") + def test_env_override(self): + env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) +``` + +場合によっては、外部プログラムを呼び出す必要があるため、`os.environ` に`PYTHONPATH`を設定してインクルードする必要があります。 +複数のローカル パス。ヘルパー クラス `transformers.test_utils.TestCasePlus` が役に立ちます。 + +```python +from transformers.testing_utils import TestCasePlus + + +class EnvExampleTest(TestCasePlus): + def test_external_prog(self): + env = self.get_env() + # now call the external program, passing `env` to it +``` + +テストファイルが `tests` テストスイートまたは `examples` のどちらにあるかに応じて +`env[PYTHONPATH]` を使用して、これら 2 つのディレクトリのいずれかを含めます。また、テストが確実に行われるようにするための `src` ディレクトリも含めます。 +現在のリポジトリに対して実行され、最後に、テストが実行される前にすでに設定されていた `env[PYTHONPATH]` を使用して実行されます。 +何かあれば呼ばれます。 + +このヘルパー メソッドは `os.environ` オブジェクトのコピーを作成するため、元のオブジェクトはそのまま残ります。 + + +### Getting reproducible results + +状況によっては、テストのランダム性を削除したい場合があります。同一の再現可能な結果セットを取得するには、 +シードを修正する必要があります: + +```python +seed = 42 + +# python RNG +import random + +random.seed(seed) + +# pytorch RNGs +import torch + +torch.manual_seed(seed) +torch.backends.cudnn.deterministic = True +if torch.cuda.is_available(): + torch.cuda.manual_seed_all(seed) + +# numpy RNG +import numpy as np + +np.random.seed(seed) + +# tf RNG +tf.random.set_seed(seed) +``` + + +### Debugging tests + +警告が発生した時点でデバッガーを開始するには、次の手順を実行します。 + +```bash +pytest tests/utils/test_logging.py -W error::UserWarning --pdb +``` + +## Working with github actions workflows + +セルフプッシュのワークフローCIジョブをトリガーするには、以下の手順を実行する必要があります: + +1. `transformers` のリモートリポジトリで新しいブランチを作成します(フォークではなく、元のリポジトリで行います)。 +2. ブランチの名前は `ci_` または `ci-` で始まる必要があります(`main` もトリガーしますが、`main` ではPRを作成できません)。また、特定のパスでのみトリガーされます - このドキュメントが書かれた後に変更された場合に備えて、最新の定義は[こちら](https://github.com/huggingface/transformers/blob/main/.github/workflows/self-push.yml)の *push:* にあります。 +3. このブランチからPRを作成します。 +4. その後、このジョブが[ここ](https://github.com/huggingface/transformers/actions/workflows/self-push.yml)に表示されます。ジョブはバックログがある場合、すぐに実行されないことがあります。 + +## Testing Experimental CI Features + +CI機能のテストは通常のCIの正常な動作に干渉する可能性があるため、新しいCI機能を追加する場合、以下の手順に従う必要があります。 + +1. テストが必要なものをテストするための新しい専用のジョブを作成します。 +2. 新しいジョブは常に成功する必要があるため、常にグリーン ✓(詳細は以下参照)を表示する必要があります。 +3. さまざまな種類のPR(ユーザーフォークブランチ、非フォークブランチ、github.com UIから直接ファイルを編集するブランチ、さまざまな強制プッシュなど)が実行されるまでいくつかの日間実行し、実験的なジョブのログを監視します(意図的に常にグリーンになるようになっている全体のジョブの緑ではなく)。 +4. すべてが安定していることが明確になったら、新しい変更を既存のジョブに統合します。 + +このように、CI機能自体の実験が通常のワークフローに干渉しないようにできます。 + +では、新しいCI機能が開発中である間、ジョブを常に成功させるにはどうすればいいでしょうか? + +TravisCIのような一部のCIは `ignore-step-failure` をサポートし、全体のジョブを成功として報告しますが、この文書が作成された時点ではCircleCIとGithub Actionsはそれをサポートしていません。 + +したがって、以下のワークアラウンドを使用できます: + +1. bashスクリプト内で潜在的な失敗を抑制するために実行コマンドの冒頭に `set +euo pipefail` を記述します。 +2. 最後のコマンドは成功する必要があります。たとえば `echo "done"` または単に `true` を使用できます。 + +以下は例です: + + + +```yaml +- run: + name: run CI experiment + command: | + set +euo pipefail + echo "setting run-all-despite-any-errors-mode" + this_command_will_fail + echo "but bash continues to run" + # emulate another failure + false + # but the last command must be a success + echo "during experiment do not remove: reporting success to CI, even if there were failures" +``` + + +単純なコマンドの場合は、次のようにすることもできます。 + +```bash +cmd_that_may_fail || true +``` + +もちろん、結果に満足したら、実験的なステップやジョブを通常のジョブと統合し、`set +euo pipefail` などの追加した要素を削除して、実験的なジョブが通常のCIの動作に干渉しないようにします。 + +このプロセス全体は、実験的なステップに対して `allow-failure` のようなものを設定し、PRの全体のステータスに影響を与えずに失敗させることができれば、はるかに簡単になったでしょう。しかし、前述の通り、現在はCircleCIとGithub Actionsはこの機能をサポートしていません。 + +この機能に関しての投票や、CIに特有のスレッドでその進捗状況を確認できます: + +- [Github Actions:](https://github.com/actions/toolkit/issues/399) +- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) + diff --git a/docs/source/ja/tf_xla.md b/docs/source/ja/tf_xla.md new file mode 100644 index 00000000000000..1f5a2af1a5a288 --- /dev/null +++ b/docs/source/ja/tf_xla.md @@ -0,0 +1,179 @@ + + +# XLA Integration for TensorFlow Models + +[[open-in-colab]] + +加速線形代数(Accelerated Linear Algebra)、通称XLAは、TensorFlowモデルのランタイムを高速化するためのコンパイラです。[公式ドキュメント](https://www.tensorflow.org/xla)によれば、XLA(Accelerated Linear Algebra)は線形代数のためのドメイン固有のコンパイラで、TensorFlowモデルを潜在的にソースコードの変更なしで高速化できます。 + +TensorFlowでXLAを使用するのは簡単です。XLAは`tensorflow`ライブラリ内にパッケージ化されており、[`tf.function`](https://www.tensorflow.org/guide/intro_to_graphs)などのグラフを作成する関数内で`jit_compile`引数を使用してトリガーできます。`fit()`や`predict()`などのKerasメソッドを使用する場合、`model.compile()`に`jit_compile`引数を渡すだけでXLAを有効にできます。ただし、XLAはこれらのメソッドに限定されているわけではありません。任意の`tf.function`を高速化するためにも使用できます。 + +🤗 Transformers内のいくつかのTensorFlowメソッドは、XLAと互換性があるように書き直されています。これには、[GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)、[T5](https://huggingface.co/docs/transformers/model_doc/t5)、[OPT](https://huggingface.co/docs/transformers/model_doc/opt)などのテキスト生成モデルや、[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)などの音声処理モデルも含まれます。 + +速度向上の具体的な量はモデルに非常に依存しますが、🤗 Transformers内のTensorFlowテキスト生成モデルでは、約100倍の速度向上を確認しています。このドキュメントでは、これらのモデルにXLAを使用して最大のパフォーマンスを得る方法を説明します。また、ベンチマークとXLA統合のデザイン哲学について詳しく学びたい場合の追加リソースへのリンクも提供します。 + +## Running TF functions with XLA + +以下のTensorFlowモデルを考えてみましょう: + + +```py +import tensorflow as tf + +model = tf.keras.Sequential( + [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")] +) +``` + +上記のモデルは、次元が`(10, )`の入力を受け入れます。このモデルをフォワードパスで実行するには、次のようにします: + + +```py +# Generate random inputs for the model. +batch_size = 16 +input_vector_dim = 10 +random_inputs = tf.random.normal((batch_size, input_vector_dim)) + +# Run a forward pass. +_ = model(random_inputs) +``` + +XLAでコンパイルされた関数を使用してフォワードパスを実行するには、以下のようにします: + + +```py +xla_fn = tf.function(model, jit_compile=True) +_ = xla_fn(random_inputs) +``` + +`model`のデフォルトの `call()` 関数はXLAグラフをコンパイルするために使用されます。ただし、XLAにコンパイルしたい他のモデル関数がある場合、それも可能です。以下はその方法です: + + +```py +my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True) +``` + +## Running a TF text generation model with XLA from 🤗 Transformers + +🤗 Transformers内でXLAでの高速化された生成を有効にするには、最新バージョンの`transformers`がインストールされている必要があります。次のコマンドを実行してインストールできます: + +```bash +pip install transformers --upgrade +``` + +次に、次のコードを実行できます: + + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +# Will error if the minimal version of Transformers is not installed. +from transformers.utils import check_min_version + +check_min_version("4.21.0") + + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="
") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +# One line to create an XLA generation function +xla_generate = tf.function(model.generate, jit_compile=True) + +tokenized_input = tokenizer(input_string, return_tensors="tf") +generated_tokens = xla_generate(**tokenized_input, num_beams=2) + +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +# Generated -- TensorFlow is an open-source, open-source, distributed-source application # framework for the +``` + +`generate()`でXLAを有効にするのは、たった一行のコードです。コードの残り部分は変更されていません。ただし、XLA固有のいくつかの注意点が上記のコードスニペットにあります。これらに注意する必要があり、XLAがもたらす速度向上を実現するためにそれらを把握することが重要です。次のセクションでこれらについて詳しく説明します。 + + +## Gotchas to be aware of + +XLAを有効にした関数(上記の`xla_generate()`など)を初めて実行すると、内部で計算グラフを推論しようとしますが、これは時間がかかります。このプロセスは["トレーシング"(tracing)](https://www.tensorflow.org/guide/intro_to_graphs#when_is_a_function_tracing)として知られています。 + +生成時間が高速ではないことに気付くかもしれません。`xla_generate()`(または他のXLA対応関数)の連続呼び出しでは、関数への入力が最初に計算グラフが構築されたときと同じ形状に従っている場合、計算グラフを推論する必要はありません。これは、入力形状が固定されているモダリティ(例:画像)には問題ありませんが、変数の入力形状モダリティ(例:テキスト)を扱う場合には注意が必要です。 + +`xla_generate()`が常に同じ入力形状で動作するようにするには、トークナイザを呼び出す際に`padding`引数を指定できます。 + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +xla_generate = tf.function(model.generate, jit_compile=True) + +# Here, we call the tokenizer with padding options. +tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + +generated_tokens = xla_generate(**tokenized_input, num_beams=2) +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +``` + +これにより、`xla_generate()`への入力が常にトレースされた形状の入力を受け取ることを確認し、生成時間の高速化を実現できます。以下のコードでこれを確認できます: + +```py +import time +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") + +xla_generate = tf.function(model.generate, jit_compile=True) + +for input_string in ["TensorFlow is", "TensorFlow is a", "TFLite is a"]: + tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + start = time.time_ns() + generated_tokens = xla_generate(**tokenized_input, num_beams=2) + end = time.time_ns() + print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n") +``` + +Tesla T4 GPUを使用すると、次のような出力が期待されます: + +```bash +Execution time -- 30819.6 ms + +Execution time -- 79.0 ms + +Execution time -- 78.9 ms +``` + +最初の`xla_generate()`呼び出しはトレーシングのために時間がかかりますが、連続する呼び出しは桁違いに高速です。生成オプションのいかなる変更も、再トレーシングを引き起こし、生成時間の遅延を引き起こすことに注意してください。 + +このドキュメントでは、🤗 Transformersが提供するテキスト生成オプションをすべて網羅していません。高度なユースケースについてはドキュメンテーションを参照することをお勧めします。 + +## Additional Resources + +ここでは、🤗 Transformersと一般的なXLAについてさらに詳しく学びたい場合のいくつかの追加リソースを提供します。 + +* [このColab Notebook](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/91_tf_xla_generate.ipynb)では、XLA対応のエンコーダーデコーダー([T5](https://huggingface.co/docs/transformers/model_doc/t5)など)およびデコーダー専用([GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)など)テキスト生成モデルを試すための対話型デモが提供されています。 +* [このブログ記事](https://huggingface.co/blog/tf-xla-generate)では、XLA対応モデルの比較ベンチマークの概要と、TensorFlowでのXLAについての友好的な紹介が提供されています。 +* [このブログ記事](https://blog.tensorflow.org/2022/11/how-hugging-face-improved-text-generation-performance-with-xla.html)では、🤗 TransformersのTensorFlowモデルにXLAサポートを追加する際の設計哲学について説明しています。 +* 一般的なXLAとTensorFlowグラフについて詳しく学ぶためのおすすめの投稿: + * [XLA: 機械学習用の最適化コンパイラ](https://www.tensorflow.org/xla) + * [グラフと`tf.function`の紹介](https://www.tensorflow.org/guide/intro_to_graphs) + * [`tf.function`を使用したパフォーマンス向上](https://www.tensorflow.org/guide/function) diff --git a/docs/source/ja/tflite.md b/docs/source/ja/tflite.md new file mode 100644 index 00000000000000..ad3e9a3f484e2c --- /dev/null +++ b/docs/source/ja/tflite.md @@ -0,0 +1,58 @@ + + +# Export to TFLite + +[TensorFlow Lite](https://www.tensorflow.org/lite/guide)は、モバイルフォン、組み込みシステム、およびモノのインターネット(IoT)デバイスなど、リソースに制約のあるデバイスに機械学習モデルを展開するための軽量なフレームワークです。TFLiteは、計算能力、メモリ、および電力消費が限られているこれらのデバイス上でモデルを効率的に最適化して実行するために設計されています。 +TensorFlow Liteモデルは、`.tflite`ファイル拡張子で識別される特別な効率的なポータブル形式で表されます。 + +🤗 Optimumは、🤗 TransformersモデルをTFLiteにエクスポートするための機能を`exporters.tflite`モジュールを介して提供しています。サポートされているモデルアーキテクチャのリストについては、[🤗 Optimumのドキュメント](https://huggingface.co/docs/optimum/exporters/tflite/overview)をご参照ください。 + +モデルをTFLiteにエクスポートするには、必要な依存関係をインストールしてください: + + +```bash +pip install optimum[exporters-tf] +``` + +すべての利用可能な引数を確認するには、[🤗 Optimumドキュメント](https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model)を参照するか、コマンドラインでヘルプを表示してください: + +```bash +optimum-cli export tflite --help +``` + +🤗 Hubからモデルのチェックポイントをエクスポートするには、例えば `google-bert/bert-base-uncased` を使用する場合、次のコマンドを実行します: + +```bash +optimum-cli export tflite --model google-bert/bert-base-uncased --sequence_length 128 bert_tflite/ +``` + +進行状況を示すログが表示され、生成された `model.tflite` が保存された場所も表示されるはずです: + +```bash +Validating TFLite model... + -[✓] TFLite model output names match reference model (logits) + - Validating TFLite Model output "logits": + -[✓] (1, 128, 30522) matches (1, 128, 30522) + -[x] values not close enough, max diff: 5.817413330078125e-05 (atol: 1e-05) +The TensorFlow Lite export succeeded with the warning: The maximum absolute difference between the output of the reference model and the TFLite exported model is not within the set tolerance 1e-05: +- logits: max diff = 5.817413330078125e-05. + The exported model was saved at: bert_tflite + ``` + +上記の例は🤗 Hubからチェックポイントをエクスポートする方法を示しています。ローカルモデルをエクスポートする場合、まずモデルの重みファイルとトークナイザファイルを同じディレクトリ(`local_path`)に保存したことを確認してください。CLIを使用する場合、🤗 Hubのチェックポイント名の代わりに`model`引数に`local_path`を渡します。 + + diff --git a/docs/source/ja/tokenizer_summary.md b/docs/source/ja/tokenizer_summary.md new file mode 100644 index 00000000000000..448ad9c871aaa3 --- /dev/null +++ b/docs/source/ja/tokenizer_summary.md @@ -0,0 +1,179 @@ + + +# Summary of the tokenizers + +[[open-in-colab]] + +このページでは、トークナイゼーションについて詳しく見ていきます。 + + + +[前処理のチュートリアル](preprocessing)で見たように、テキストをトークン化することは、それを単語またはサブワードに分割し、それらをルックアップテーブルを介してIDに変換することです。単語またはサブワードをIDに変換することは簡単ですので、この要約ではテキストを単語またはサブワードに分割する(つまり、テキストをトークナイズする)ことに焦点を当てます。具体的には、🤗 Transformersで使用される3つの主要なトークナイザ、[Byte-Pair Encoding(BPE)](#byte-pair-encoding)、[WordPiece](#wordpiece)、および[SentencePiece](#sentencepiece)を見て、どのモデルがどのトークナイザタイプを使用しているかの例を示します。 + +各モデルページでは、事前トレーニング済みモデルがどのトークナイザタイプを使用しているかを知るために、関連するトークナイザのドキュメントを確認できます。例えば、[`BertTokenizer`]を見ると、モデルが[WordPiece](#wordpiece)を使用していることがわかります。 + +## Introduction + +テキストをより小さなチャンクに分割することは、見かけ以上に難しいタスクであり、複数の方法があります。例えば、次の文を考えてみましょう。「"Don't you love 🤗 Transformers? We sure do."」 + + + +このテキストをトークン化する簡単な方法は、スペースで分割することです。これにより、以下のようになります: + + +``` +["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] +``` + +これは合理的な第一歩ですが、トークン "Transformers?" と "do." を見ると、句読点が単語 "Transformer" と "do" に結合されていることがわかり、これは最適ではありません。句読点を考慮に入れるべきで、モデルが単語とそれに続く可能性のあるすべての句読点記号の異なる表現を学ばなければならないことを避けるべきです。これにより、モデルが学ばなければならない表現の数が爆発的に増加します。句読点を考慮に入れた場合、例文のトークン化は次のようになります: + + +``` +["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +ただし、単語「"Don't"」をトークン化する方法に関しては、不利な側面があります。 「"Don't"」は「"do not"」を表しているため、「["Do", "n't"]」としてトークン化する方が適しています。ここから事柄が複雑になり、各モデルが独自のトークナイザータイプを持つ理由の一部でもあります。テキストをトークン化するために適用するルールに応じて、同じテキストに対して異なるトークナイズされた出力が生成されます。事前トレーニング済みモデルは、トレーニングデータをトークナイズするのに使用されたルールと同じルールでトークナイズされた入力を提供する場合にのみ正常に機能します。 + +[spaCy](https://spacy.io/)と[Moses](http://www.statmt.org/moses/?n=Development.GetStarted)は、2つの人気のあるルールベースのトークナイザーです。これらを私たちの例に適用すると、*spaCy*と*Moses*は次のような出力を生成します: + +``` +["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +空白と句読点のトークン化、およびルールベースのトークン化が使用されていることがわかります。空白と句読点のトークン化、およびルールベースのトークン化は、文を単語に分割することをゆるやかに定義される単語トークン化の例です。テキストをより小さなチャンクに分割するための最も直感的な方法である一方、このトークン化方法は大規模なテキストコーパスに対して問題を引き起こすことがあります。この場合、空白と句読点のトークン化は通常、非常に大きな語彙(すべての一意な単語とトークンのセット)を生成します。例えば、[Transformer XL](model_doc/transformerxl)は空白と句読点のトークン化を使用しており、語彙サイズは267,735です! + +このような大きな語彙サイズは、モデルに非常に大きな埋め込み行列を入力および出力レイヤーとして持たせることを強制し、メモリおよび時間の複雑さの増加を引き起こします。一般的に、トランスフォーマーモデルは、特に単一の言語で事前トレーニングされた場合、50,000を超える語彙サイズを持つことはほとんどありません。 + +したがって、シンプルな空白と句読点のトークン化が不十分な場合、なぜ単に文字単位でトークン化しないのかという疑問が生じますか? + + + +文字単位のトークン化は非常にシンプルであり、メモリと時間の複雑さを大幅に削減できますが、モデルに意味のある入力表現を学習させることが非常に難しくなります。たとえば、文字「"t"」のための意味のあるコンテキスト独立の表現を学習することは、単語「"today"」のためのコンテキスト独立の表現を学習するよりもはるかに難しいです。そのため、文字単位のトークン化はしばしばパフォーマンスの低下を伴います。したがって、トランスフォーマーモデルは単語レベルと文字レベルのトークン化のハイブリッドである**サブワード**トークン化を使用して、両方の世界の利点を活かします。 + +## Subword tokenization + + + +サブワードトークン化アルゴリズムは、頻繁に使用される単語をより小さなサブワードに分割すべきではないが、珍しい単語は意味のあるサブワードに分解されるという原則に依存しています。たとえば、「"annoyingly"」は珍しい単語と見なされ、その単語は「"annoying"」と「"ly"」に分解されるかもしれません。独立した「"annoying"」と「"ly"」はより頻繁に現れますが、「"annoyingly"」の意味は「"annoying"」と「"ly"」の合成的な意味によって保持されます。これは特にトルコ語などの結合言語で役立ちます。ここではサブワードを連結して(ほぼ)任意の長い複雑な単語を形成できます。 + +サブワードトークン化により、モデルは合理的な語彙サイズを持つことができ、意味のあるコンテキスト独立の表現を学習できます。さらに、サブワードトークン化により、モデルは以前に見たことのない単語を処理し、それらを既知のサブワードに分解することができます。例えば、[`~transformers.BertTokenizer`]は`"I have a new GPU!"`を以下のようにトークン化します: + + +```py +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> tokenizer.tokenize("I have a new GPU!") +["i", "have", "a", "new", "gp", "##u", "!"] +``` + +「uncased」モデルを考慮しているため、まず文を小文字に変換しました。トークナイザの語彙に「["i", "have", "a", "new"]」という単語が存在することがわかりますが、「"gpu"」という単語は存在しません。したがって、トークナイザは「"gpu"」を既知のサブワード「["gp"、"##u"]」に分割します。ここで「"##"」は、トークンのデコードまたはトークナイゼーションの逆転のために、トークンの前の部分にスペースなしで接続する必要があることを意味します。 + +別の例として、[`~transformers.XLNetTokenizer`]は以下のように以前のサンプルテキストをトークン化します: + +```py +>>> from transformers import XLNetTokenizer + +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") +>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") +["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] +``` + +これらの「▁」の意味については、[SentencePiece](#sentencepiece)を見るときに詳しく説明します。ご覧の通り、「Transformers」という珍しい単語は、より頻繁に現れるサブワード「Transform」と「ers」に分割されています。 + +さて、異なるサブワードトークン化アルゴリズムがどのように動作するかを見てみましょう。これらのトークナイゼーションアルゴリズムはすべて、通常は対応するモデルがトレーニングされるコーパスで行われる形式のトレーニングに依存しています。 + + + +### Byte-Pair Encoding(BPE) + +Byte-Pair Encoding(BPE)は、[Neural Machine Translation of Rare Words with Subword Units(Sennrich et al., 2015)](https://arxiv.org/abs/1508.07909)で導入されました。BPEは、トレーニングデータを単語に分割するプリトークナイザに依存しています。プリトークナイゼーションは、空白のトークナイゼーションなど、非常に単純なものであることがあります。例えば、[GPT-2](model_doc/gpt2)、[RoBERTa](model_doc/roberta)です。より高度なプリトークナイゼーションには、ルールベースのトークナイゼーション([XLM](model_doc/xlm)、[FlauBERT](model_doc/flaubert)などが大部分の言語にMosesを使用)や、[GPT](model_doc/gpt)(Spacyとftfyを使用してトレーニングコーパス内の各単語の頻度を数える)などが含まれます。 + +プリトークナイゼーションの後、一意の単語セットが作成され、各単語がトレーニングデータで出現した頻度が決定されます。次に、BPEはベース語彙を作成し、ベース語彙の二つのシンボルから新しいシンボルを形成するためのマージルールを学習します。このプロセスは、語彙が所望の語彙サイズに達するまで続けられます。なお、所望の語彙サイズはトークナイザをトレーニングする前に定義するハイパーパラメータであることに注意してください。 + +例として、プリトークナイゼーションの後、次のセットの単語とその出現頻度が決定されたと仮定しましょう: + + +``` +("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) +``` + +したがって、ベース語彙は「["b", "g", "h", "n", "p", "s", "u"]」です。すべての単語をベース語彙のシンボルに分割すると、次のようになります: + + +``` +("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) +``` + +その後、BPEは可能なすべてのシンボルペアの頻度を数え、最も頻繁に発生するシンボルペアを選択します。上記の例では、`"h"`の後に`"u"`が15回(`"hug"`の10回、`"hugs"`の5回)出現します。しかし、最も頻繁なシンボルペアは、合計で20回(`"u"`の10回、`"g"`の5回、`"u"`の5回)出現する`"u"`の後に`"g"`が続くシンボルペアです。したがって、トークナイザが最初に学習するマージルールは、`"u"`の後に`"g"`が続くすべての`"u"`シンボルを一緒にグループ化することです。次に、`"ug"`が語彙に追加されます。単語のセットは次になります: + + +``` +("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) +``` + +次に、BPEは次に最も一般的なシンボルペアを識別します。それは「"u"」に続いて「"n"」で、16回出現します。したがって、「"u"」と「"n"」は「"un"」に結合され、語彙に追加されます。次に最も頻度の高いシンボルペアは、「"h"」に続いて「"ug"」で、15回出現します。再びペアが結合され、「hug」が語彙に追加できます。 + +この段階では、語彙は`["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`であり、一意の単語のセットは以下のように表されます: + + +``` +("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) +``` + +前提として、Byte-Pair Encoding(BPE)のトレーニングがこの段階で停止すると、学習されたマージルールが新しい単語に適用されます(新しい単語にはベースボキャブラリに含まれていないシンボルが含まれていない限り)。 例えば、単語 "bug" は ["b", "ug"] としてトークン化されますが、"mug" はベースボキャブラリに "m" シンボルが含まれていないため、["", "ug"] としてトークン化されます。 一般的に、"m" のような単一の文字は、トレーニングデータには通常、各文字の少なくとも1つの出現が含まれているため、"" シンボルに置き換えられることはありませんが、絵文字のような非常に特殊な文字の場合には発生する可能性があります。 + +前述のように、ボキャブラリサイズ、すなわちベースボキャブラリサイズ + マージの回数は選択するハイパーパラメータです。 例えば、[GPT](model_doc/gpt) はベース文字が478文字で、40,000回のマージ後にトレーニングを停止したため、ボキャブラリサイズは40,478です。 + +#### Byte-level BPE + +すべてのUnicode文字をベース文字と考えると、すべての可能なベース文字が含まれるかもしれないベースボキャブラリはかなり大きくなることがあります。 [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) は、ベースボキャブラリを256バイトにする賢いトリックとしてバイトをベースボキャブラリとして使用し、すべてのベース文字がボキャブラリに含まれるようにしています。 パンクチュエーションを扱うためのいくつかの追加ルールを備えたGPT2のトークナイザは、 シンボルを必要とせずにすべてのテキストをトークン化できます。 [GPT-2](model_doc/gpt) は50,257のボキャブラリサイズを持っており、これは256バイトのベーストークン、特別なテキストの終了を示すトークン、および50,000回のマージで学習したシンボルに対応しています。 + +### WordPiece + +WordPieceは、[BERT](model_doc/bert)、[DistilBERT](model_doc/distilbert)、および[Electra](model_doc/electra)で使用されるサブワードトークナイゼーションアルゴリズムです。 このアルゴリズムは、[Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) で概説されており、BPEに非常に似ています。 WordPieceは最も頻繁なシンボルペアを選択するのではなく、トレーニングデータに追加した場合にトレーニングデータの尤度を最大化するシンボルペアを選択します。 + +これは具体的にはどういう意味ですか?前の例を参照すると、トレーニングデータの尤度を最大化することは、そのシンボルペアの確率をその最初のシンボルに続く2番目のシンボルの確率で割ったものが、すべてのシンボルペアの中で最も大きい場合に該当するシンボルペアを見つけることに等しいです。 たとえば、"u" の後に "g" が続く場合、他のどのシンボルペアよりも "ug" の確率を "u"、"g" で割った確率が高ければ、それらのシンボルは結合されます。直感的に言えば、WordPieceは2つのシンボルを結合することによって失われるものを評価し、それがそれに値するかどうかを確認する点でBPEとはわずかに異なります。 + +### Unigram + +Unigramは、[Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf) で導入されたサブワードトークナイゼーションアルゴリズムです。 BPEやWordPieceとは異なり、Unigramはベースボキャブラリを多数のシンボルで初期化し、各シンボルを削減してより小さなボキャブラリを取得します。 ベースボキャブラリは、事前にトークン化されたすべての単語と最も一般的な部分文字列に対応する可能性があります。 Unigramはtransformersのモデルの直接の使用には適していませんが、[SentencePiece](#sentencepiece)と組み合わせて使用されます。 + +各トレーニングステップで、Unigramアルゴリズムは現在のボキャブラリとユニグラム言語モデルを使用してトレーニングデータ上の損失(通常は対数尤度として定義)を定義します。その後、ボキャブラリ内の各シンボルについて、そのシンボルがボキャブラリから削除された場合に全体の損失がどれだけ増加するかを計算します。 Unigramは、損失の増加が最も低いp(通常は10%または20%)パーセントのシンボルを削除します。つまり、トレーニングデータ全体の損失に最も影響を与えない、最も損失の少ないシンボルを削除します。 このプロセスは、ボキャブラリが望ましいサイズに達するまで繰り返されます。 Unigramアルゴリズムは常にベース文字を保持するため、任意の単語をトークン化できます。 + +Unigramはマージルールに基づいていないため(BPEとWordPieceとは対照的に)、トレーニング後の新しいテキストのトークン化にはいくつかの方法があります。例として、トレーニングされたUnigramトークナイザが持つボキャブラリが次のような場合: + + +``` +["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], +``` + +`"hugs"`は、`["hug", "s"]`、`["h", "ug", "s"]`、または`["h", "u", "g", "s"]`のようにトークン化できます。では、どれを選択すべきでしょうか? Unigramは、トレーニングコーパス内の各トークンの確率を保存し、トレーニング後に各可能なトークン化の確率を計算できるようにします。このアルゴリズムは実際には最も可能性の高いトークン化を選択しますが、確率に従って可能なトークン化をサンプリングするオプションも提供します。 + +これらの確率は、トークナイザーがトレーニングに使用する損失によって定義されます。トレーニングデータが単語 \\(x_{1}, \dots, x_{N}\\) で構成され、単語 \\(x_{i}\\) のすべての可能なトークン化のセットが \\(S(x_{i})\\) と定義される場合、全体の損失は次のように定義されます。 + +$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ + + + +### SentencePiece + +これまでに説明したすべてのトークン化アルゴリズムには同じ問題があります。それは、入力テキストが単語を区切るためにスペースを使用していると仮定しているということです。しかし、すべての言語が単語を区切るためにスペースを使用しているわけではありません。この問題を一般的に解決するための1つの方法は、言語固有の前トークナイザーを使用することです(例:[XLM](model_doc/xlm)は特定の中国語、日本語、およびタイ語の前トークナイザーを使用しています)。より一般的にこの問題を解決するために、[SentencePiece:ニューラルテキスト処理のためのシンプルで言語非依存のサブワードトークナイザーおよびデトークナイザー(Kudo et al.、2018)](https://arxiv.org/pdf/1808.06226.pdf) は、入力を生の入力ストリームとして扱い、スペースを使用する文字のセットに含めます。それからBPEまたはunigramアルゴリズムを使用して適切な語彙を構築します。 + +たとえば、[`XLNetTokenizer`]はSentencePieceを使用しており、そのために前述の例で`"▁"`文字が語彙に含まれていました。SentencePieceを使用したデコードは非常に簡単で、すべてのトークンを単純に連結し、`"▁"`はスペースに置換されます。 + +ライブラリ内のすべてのtransformersモデルは、SentencePieceをunigramと組み合わせて使用します。SentencePieceを使用するモデルの例には、[ALBERT](model_doc/albert)、[XLNet](model_doc/xlnet)、[Marian](model_doc/marian)、および[T5](model_doc/t5)があります。 diff --git a/docs/source/ja/torchscript.md b/docs/source/ja/torchscript.md new file mode 100644 index 00000000000000..27d64a625c8c42 --- /dev/null +++ b/docs/source/ja/torchscript.md @@ -0,0 +1,177 @@ + + +# Export to TorchScript + + + +これはTorchScriptを使用した実験の最初であり、可変入力サイズのモデルに対するその能力をまだ探求中です。これは私たちの関心の焦点であり、今後のリリースでは、より柔軟な実装や、PythonベースのコードとコンパイルされたTorchScriptを比較するベンチマークを含む、より多くのコード例で詳細な分析を行います。 + + + +[TorchScriptのドキュメント](https://pytorch.org/docs/stable/jit.html)によれば: + +> TorchScriptは、PyTorchコードから直列化および最適化可能なモデルを作成する方法です。 + +TorchScriptを使用すると、効率志向のC++プログラムなど、他のプログラムでモデルを再利用できるようになります。PyTorchベースのPythonプログラム以外の環境で🤗 Transformersモデルをエクスポートして使用するためのインターフェースを提供しています。ここでは、TorchScriptを使用してモデルをエクスポートし、使用する方法を説明します。 + +モデルをエクスポートするには、次の2つの要件があります: + +- `torchscript`フラグを使用したモデルのインスタンス化 +- ダミーの入力を使用したフォワードパス + +これらの必要条件は、以下で詳細に説明されているように、開発者が注意する必要があるいくつかのことを意味します。 + +## TorchScript flag and tied weights + +`torchscript`フラグは、ほとんどの🤗 Transformers言語モデルにおいて、`Embedding`レイヤーと`Decoding`レイヤー間で重みが連結されているため必要です。 +TorchScriptでは、重みが連結されているモデルをエクスポートすることはできませんので、事前に重みを切り離して複製する必要があります。 + +`torchscript`フラグを使用してインスタンス化されたモデルは、`Embedding`レイヤーと`Decoding`レイヤーが分離されており、そのため後でトレーニングしてはいけません。 +トレーニングは、これらの2つのレイヤーを非同期にする可能性があり、予期しない結果をもたらす可能性があります。 + +言語モデルヘッドを持たないモデルには言及しませんが、これらのモデルには連結された重みが存在しないため、`torchscript`フラグなしで安全にエクスポートできます。 + +## Dummy inputs and standard lengths + +ダミー入力はモデルのフォワードパスに使用されます。入力の値はレイヤーを通じて伝播される間、PyTorchは各テンソルに実行された異なる操作を追跡します。これらの記録された操作は、モデルの*トレース*を作成するために使用されます。 + +トレースは入力の寸法に対して作成されます。そのため、ダミー入力の寸法に制約され、他のシーケンス長やバッチサイズでは動作しません。異なるサイズで試すと、以下のエラーが発生します: + +``` +`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2` +``` + +お勧めしますのは、モデルの推論中に供給される最大の入力と同じ大きさのダミー入力サイズでモデルをトレースすることです。パディングを使用して不足値を補完することもできます。ただし、モデルがより大きな入力サイズでトレースされるため、行列の寸法も大きくなり、より多くの計算が発生します。 + +異なるシーケンス長のモデルをエクスポートする際に、各入力に対して実行される演算の総数に注意して、パフォーマンスを密接にフォローすることをお勧めします。 + +## Using TorchScript in Python + +このセクションでは、モデルの保存と読み込み、および推論にトレースを使用する方法を示します。 + +### Saving a model + +TorchScriptで`BertModel`をエクスポートするには、`BertConfig`クラスから`BertModel`をインスタンス化し、それをファイル名`traced_bert.pt`でディスクに保存します: + +```python +from transformers import BertModel, BertTokenizer, BertConfig +import torch + +enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") + +# Tokenizing input text +text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" +tokenized_text = enc.tokenize(text) + +# Masking one of the input tokens +masked_index = 8 +tokenized_text[masked_index] = "[MASK]" +indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) +segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] + +# Creating a dummy input +tokens_tensor = torch.tensor([indexed_tokens]) +segments_tensors = torch.tensor([segments_ids]) +dummy_input = [tokens_tensor, segments_tensors] + +# Initializing the model with the torchscript flag +# Flag set to True even though it is not necessary as this model does not have an LM Head. +config = BertConfig( + vocab_size_or_config_json_file=32000, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + torchscript=True, +) + +# Instantiating the model +model = BertModel(config) + +# The model needs to be in evaluation mode +model.eval() + +# If you are instantiating the model with *from_pretrained* you can also easily set the TorchScript flag +model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True) + +# Creating the trace +traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors]) +torch.jit.save(traced_model, "traced_bert.pt") +``` + +### Loading a model + +以前に保存した `BertModel`、`traced_bert.pt` をディスクから読み込んで、以前に初期化した `dummy_input` で使用できます。 + +```python +loaded_model = torch.jit.load("traced_bert.pt") +loaded_model.eval() + +all_encoder_layers, pooled_output = loaded_model(*dummy_input) +``` + + +### Using a traced model for inference + +トレースモデルを使用して推論を行うには、その `__call__` ダンダーメソッドを使用します。 + +```python +traced_model(tokens_tensor, segments_tensors) +``` + + +## Deploy Hugging Face TorchScript models to AWS with the Neuron SDK + +AWSはクラウドでの低コストで高性能な機械学習推論向けに [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) インスタンスファミリーを導入しました。Inf1インスタンスはAWS Inferentiaチップによって駆動され、ディープラーニング推論ワークロードに特化したカスタムビルドのハードウェアアクセラレータです。[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#) はInferentia用のSDKで、トランスフォーマーモデルをトレースして最適化し、Inf1に展開するためのサポートを提供します。 + +Neuron SDK が提供するもの: + +1. クラウドでの推論のためにTorchScriptモデルをトレースして最適化するための、1行のコード変更で使用できる簡単なAPI。 +2. [改善されたコストパフォーマンス](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/) のためのボックス外のパフォーマンス最適化。 +3. [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) または [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html) で構築されたHugging Faceトランスフォーマーモデルへのサポート。 + +### Implications + +BERT(Bidirectional Encoder Representations from Transformers)アーキテクチャやその変種([distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) や [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta) など)に基づくトランスフォーマーモデルは、非生成タスク(抽出型質問応答、シーケンス分類、トークン分類など)において、Inf1上で最適に動作します。ただし、テキスト生成タスクも [AWS Neuron MarianMT チュートリアル](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html) に従ってInf1上で実行できます。Inferentiaでボックス外で変換できるモデルに関する詳細情報は、Neuronドキュメンテーションの [Model Architecture Fit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) セクションにあります。 + +### Dependencies + +モデルをAWS Neuronに変換するには、[Neuron SDK 環境](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide) が必要で、[AWS Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html) に事前に構成されています。 + +### Converting a model for AWS Neuron + +モデルをAWS NEURON用に変換するには、[PythonでTorchScriptを使用する](torchscript#using-torchscript-in-python) と同じコードを使用して `BertModel` をトレースします。Python APIを介してNeuron SDKのコンポーネントにアクセスするために、`torch.neuron` フレームワーク拡張をインポートします。 + +```python +from transformers import BertModel, BertTokenizer, BertConfig +import torch +import torch.neuron +``` + +次の行を変更するだけで済みます。 + +```diff +- torch.jit.trace(model, [tokens_tensor, segments_tensors]) ++ torch.neuron.trace(model, [token_tensor, segments_tensors]) +``` + +これにより、Neuron SDKはモデルをトレースし、Inf1インスタンス向けに最適化します。 + +AWS Neuron SDKの機能、ツール、サンプルチュートリアル、最新のアップデートについて詳しく知りたい場合は、[AWS NeuronSDK ドキュメンテーション](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html) をご覧ください。 + + + diff --git a/docs/source/ja/training.md b/docs/source/ja/training.md new file mode 100644 index 00000000000000..79fbb1b7fb2571 --- /dev/null +++ b/docs/source/ja/training.md @@ -0,0 +1,434 @@ + + +# Fine-tune a pretrained model + +[[open-in-colab]] + +事前学習済みモデルを使用すると、計算コストを削減し、炭素排出量を減少させ、ゼロからモデルをトレーニングする必要なしに最新のモデルを使用できる利点があります。 +🤗 Transformersは、さまざまなタスクに対応した数千もの事前学習済みモデルへのアクセスを提供します。 +事前学習済みモデルを使用する場合、それを特定のタスクに合わせたデータセットでトレーニングします。これはファインチューニングとして知られ、非常に強力なトレーニング技術です。 +このチュートリアルでは、事前学習済みモデルを選択したディープラーニングフレームワークでファインチューニングする方法について説明します: + +* 🤗 Transformersの[`Trainer`]を使用して事前学習済みモデルをファインチューニングする。 +* TensorFlowとKerasを使用して事前学習済みモデルをファインチューニングする。 +* ネイティブのPyTorchを使用して事前学習済みモデルをファインチューニングする。 + + + +## Prepare a dataset + + + +事前学習済みモデルをファインチューニングする前に、データセットをダウンロードしてトレーニング用に準備する必要があります。 +前のチュートリアルでは、トレーニングデータの処理方法を説明しましたが、これからはそれらのスキルを活かす機会があります! + +まず、[Yelp Reviews](https://huggingface.co/datasets/yelp_review_full)データセットを読み込んでみましょう: + +```python +>>> from datasets import load_dataset + +>>> dataset = load_dataset("yelp_review_full") +>>> dataset["train"][100] +{'label': 0, + 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} +``` + +トークナイザがテキストを処理し、可変のシーケンス長を処理するためのパディングと切り捨て戦略を含める必要があることをご存知の通り、 +データセットを1つのステップで処理するには、🤗 Datasets の [`map`](https://huggingface.co/docs/datasets/process#map) メソッドを使用して、 +データセット全体に前処理関数を適用します: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") + +>>> def tokenize_function(examples): +... return tokenizer(examples["text"], padding="max_length", truncation=True) + +>>> tokenized_datasets = dataset.map(tokenize_function, batched=True) +``` + +お好みで、実行時間を短縮するためにフルデータセットの小さなサブセットを作成することができます: + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + + + +## Train + +この時点で、使用したいフレームワークに対応するセクションに従う必要があります。右側のサイドバーのリンクを使用して、ジャンプしたいフレームワークに移動できます。 +そして、特定のフレームワークのすべてのコンテンツを非表示にしたい場合は、そのフレームワークのブロック右上にあるボタンを使用してください! + + + + + +## Train with Pytorch Trainer + +🤗 Transformersは、🤗 Transformersモデルのトレーニングを最適化した[`Trainer`]クラスを提供し、独自のトレーニングループを手動で記述せずにトレーニングを開始しやすくしています。 +[`Trainer`] APIは、ログ記録、勾配累積、混合精度など、さまざまなトレーニングオプションと機能をサポートしています。 + +まず、モデルをロードし、予想されるラベルの数を指定します。Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields)から、5つのラベルがあることがわかります: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + + + +一部の事前学習済みの重みが使用されず、一部の重みがランダムに初期化された警告が表示されることがあります。心配しないでください、これは完全に正常です! +BERTモデルの事前学習済みのヘッドは破棄され、ランダムに初期化された分類ヘッドで置き換えられます。この新しいモデルヘッドをシーケンス分類タスクでファインチューニングし、事前学習モデルの知識をそれに転送します。 + + + +### Training Hyperparameters + +次に、トレーニングオプションをアクティベートするためのすべてのハイパーパラメータと、調整できるハイパーパラメータを含む[`TrainingArguments`]クラスを作成します。 +このチュートリアルでは、デフォルトのトレーニング[ハイパーパラメータ](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)を使用して開始できますが、最適な設定を見つけるためにこれらを実験しても構いません。 + +トレーニングのチェックポイントを保存する場所を指定します: + +```python +>>> from transformers import TrainingArguments + +>>> training_args = TrainingArguments(output_dir="test_trainer") +``` + +### Evaluate + +[`Trainer`]はトレーニング中に自動的にモデルのパフォーマンスを評価しません。メトリクスを計算して報告する関数を[`Trainer`]に渡す必要があります。 +[🤗 Evaluate](https://huggingface.co/docs/evaluate/index)ライブラリでは、[`evaluate.load`]関数を使用して読み込むことができるシンプルな[`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy)関数が提供されています(詳細については[こちらのクイックツアー](https://huggingface.co/docs/evaluate/a_quick_tour)を参照してください): + +```python +>>> import numpy as np +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +``` + +`metric`の`~evaluate.compute`を呼び出して、予測の正確度を計算します。 `compute`に予測を渡す前に、予測をロジットに変換する必要があります(すべての🤗 Transformersモデルはロジットを返すことを覚えておいてください): + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... predictions = np.argmax(logits, axis=-1) +... return metric.compute(predictions=predictions, references=labels) +``` + +評価メトリクスをファインチューニング中に監視したい場合、トレーニング引数で `evaluation_strategy` パラメータを指定して、各エポックの終了時に評価メトリクスを報告します: + +```python +>>> from transformers import TrainingArguments, Trainer + +>>> training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch") +``` + +### Trainer + +モデル、トレーニング引数、トレーニングおよびテストデータセット、評価関数を使用して[`Trainer`]オブジェクトを作成します: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +その後、[`~transformers.Trainer.train`]を呼び出してモデルを微調整します: + +```python +>>> trainer.train() +``` + + + + + + + +## Kerasを使用してTensorFlowモデルをトレーニングする + +Keras APIを使用して🤗 TransformersモデルをTensorFlowでトレーニングすることもできます! + +### Loading Data from Keras + +🤗 TransformersモデルをKeras APIでトレーニングする場合、データセットをKerasが理解できる形式に変換する必要があります。 +データセットが小さい場合、データセット全体をNumPy配列に変換してKerasに渡すことができます。 +複雑なことをする前に、まずそれを試してみましょう。 + +まず、データセットを読み込みます。GLUEベンチマークからCoLAデータセットを使用します +([GLUE Banchmark](https://huggingface.co/datasets/glue))、これは単純なバイナリテキスト分類タスクです。今のところトレーニング分割のみを使用します。 + +```py +from datasets import load_dataset + +dataset = load_dataset("glue", "cola") +dataset = dataset["train"] # 今のところトレーニング分割のみを使用します +``` + +次に、トークナイザをロードし、データをNumPy配列としてトークン化します。ラベルは既に`0`と`1`のリストであるため、トークン化せずに直接NumPy配列に変換できます! + +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True) +# トークナイザはBatchEncodingを返しますが、それをKeras用に辞書に変換します +tokenized_data = dict(tokenized_data) + +labels = np.array(dataset["label"]) # ラベルはすでに0と1の配列です +``` + +最後に、モデルをロードし、[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) と [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) メソッドを実行します。 +注意点として、Transformersモデルはすべてデフォルトでタスクに関連した損失関数を持っているため、指定しなくても構いません(指定する場合を除く): + +```python +from transformers import TFAutoModelForSequenceClassification +from tensorflow.keras.optimizers import Adam + +# モデルをロードしてコンパイルする +model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased") +# ファインチューニングには通常、学習率を下げると良いです +model.compile(optimizer=Adam(3e-5)) # 損失関数の指定は不要です! + +model.fit(tokenized_data, labels) +``` + + + +モデルを`compile()`する際に`loss`引数を渡す必要はありません!Hugging Faceモデルは、この引数を空白のままにしておくと、タスクとモデルアーキテクチャに適した損失を自動的に選択します。 +必要に応じて自分で損失を指定してオーバーライドすることもできます! + + + +このアプローチは、小規模なデータセットには適していますが、大規模なデータセットに対しては問題になることがあります。なぜなら、トークナイズされた配列とラベルはメモリに完全に読み込まれる必要があり、またNumPyは「ジャギー」な配列を処理しないため、トークナイズされた各サンプルを全体のデータセット内で最も長いサンプルの長さにパディングする必要があります。 +これにより、配列がさらに大きくなり、すべてのパディングトークンがトレーニングを遅くする原因になります! + +### Loading data as a tf.data.Dataset + +トレーニングを遅くせずにデータを読み込むには、データを`tf.data.Dataset`として読み込むことができます。独自の`tf.data`パイプラインを作成することもできますが、これを行うための便利な方法が2つあります: + +- [`~TFPreTrainedModel.prepare_tf_dataset`]: これはほとんどの場合で推奨する方法です。モデル上のメソッドなので、モデルを検査してモデル入力として使用可能な列を自動的に把握し、他の列を破棄してより単純で高性能なデータセットを作成できます。 +- [`~datasets.Dataset.to_tf_dataset`]: このメソッドはより低レベルで、データセットがどのように作成されるかを正確に制御する場合に便利です。`columns`と`label_cols`を指定して、データセットに含める列を正確に指定できます。 + +[`~TFPreTrainedModel.prepare_tf_dataset`]を使用する前に、次のコードサンプルに示すように、トークナイザの出力をデータセットに列として追加する必要があります: + +```py +def tokenize_dataset(data): + # 返された辞書のキーはデータセットに列として追加されます + return tokenizer(data["text"]) + + +dataset = dataset.map(tokenize_dataset) +``` + +Hugging Faceのデータセットはデフォルトでディスクに保存されるため、これによりメモリの使用量が増えることはありません! +列が追加されたら、データセットからバッチをストリームし、各バッチにパディングを追加できます。これにより、 +データセット全体にパディングを追加する場合と比べて、パディングトークンの数が大幅に削減されます。 + +```python +>>> tf_dataset = model.prepare_tf_dataset(dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer) +``` + +上記のコードサンプルでは、トークナイザを`prepare_tf_dataset`に渡して、バッチを正しく読み込む際に正しくパディングできるようにする必要があります。 +データセットのすべてのサンプルが同じ長さであり、パディングが不要な場合は、この引数をスキップできます。 +パディング以外の複雑な処理を行う必要がある場合(例:マスク言語モデリングのためのトークンの破損など)、 +代わりに`collate_fn`引数を使用して、サンプルのリストをバッチに変換し、必要な前処理を適用する関数を渡すことができます。 +このアプローチを実際に使用した例については、 +[examples](https://github.com/huggingface/transformers/tree/main/examples)や +[notebooks](https://huggingface.co/docs/transformers/notebooks)をご覧ください。 + +`tf.data.Dataset`を作成したら、以前と同様にモデルをコンパイルし、適合させることができます: + +```python +model.compile(optimizer=Adam(3e-5)) # 損失引数は不要です! + +model.fit(tf_dataset) +``` + + + + + + +## Train in native Pytorch + + + + + +[`Trainer`]はトレーニングループを処理し、1行のコードでモデルをファインチューニングできるようにします。 +トレーニングループを独自に記述したいユーザーのために、🤗 TransformersモデルをネイティブのPyTorchでファインチューニングすることもできます。 + +この時点で、ノートブックを再起動するか、以下のコードを実行してメモリを解放する必要があるかもしれません: + +```py +del model +del trainer +torch.cuda.empty_cache() +``` + +1. モデルは生のテキストを入力として受け取らないため、`text` 列を削除します: + +```py +>>> tokenized_datasets = tokenized_datasets.remove_columns(["text"]) +``` + +2. `label`列を`labels`に名前を変更します。モデルは引数の名前を`labels`と期待しています: + +```py +>>> tokenized_datasets = tokenized_datasets.rename_column("label", "labels") +``` + +3. データセットの形式をリストではなくPyTorchテンソルを返すように設定します: + +```py +>>> tokenized_datasets.set_format("torch") +``` + +以前に示したように、ファインチューニングを高速化するためにデータセットの小さなサブセットを作成します: + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + +### DataLoader + +トレーニングデータセットとテストデータセット用の`DataLoader`を作成して、データのバッチをイテレートできるようにします: + +```py +>>> from torch.utils.data import DataLoader + +>>> train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8) +>>> eval_dataloader = DataLoader(small_eval_dataset, batch_size=8) +``` + +ロードするモデルと期待されるラベルの数を指定してください: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + +### Optimizer and learning rate scheduler + +モデルをファインチューニングするためのオプティマイザと学習率スケジューラーを作成しましょう。 +PyTorchから[`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html)オプティマイザを使用します: + +```python +>>> from torch.optim import AdamW + +>>> optimizer = AdamW(model.parameters(), lr=5e-5) +``` + +デフォルトの学習率スケジューラを[`Trainer`]から作成する: + +```py +>>> from transformers import get_scheduler + +>>> num_epochs = 3 +>>> num_training_steps = num_epochs * len(train_dataloader) +>>> lr_scheduler = get_scheduler( +... name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps +... ) +``` + +最後に、GPUを利用できる場合は `device` を指定してください。それ以外の場合、CPUでのトレーニングは数時間かかる可能性があり、数分で完了することができます。 + +```py +>>> import torch + +>>> device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +>>> model.to(device) +``` + + + +クラウドGPUが利用できない場合、[Colaboratory](https://colab.research.google.com/)や[SageMaker StudioLab](https://studiolab.sagemaker.aws/)などのホストされたノートブックを使用して無料でGPUにアクセスできます。 + + + +さて、トレーニングの準備が整いました! 🥳 + +### トレーニングループ + +トレーニングの進捗を追跡するために、[tqdm](https://tqdm.github.io/)ライブラリを使用してトレーニングステップの数に対して進行状況バーを追加します: + +```py +>>> from tqdm.auto import tqdm + +>>> progress_bar = tqdm(range(num_training_steps)) + +>>> model.train() +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... outputs = model(**batch) +... loss = outputs.loss +... loss.backward() + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +### Evaluate + +[`Trainer`]に評価関数を追加したのと同様に、独自のトレーニングループを作成する際にも同様の操作を行う必要があります。 +ただし、各エポックの最後にメトリックを計算および報告する代わりに、今回は[`~evaluate.add_batch`]を使用してすべてのバッチを蓄積し、最後にメトリックを計算します。 + +```python +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +>>> model.eval() +>>> for batch in eval_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... with torch.no_grad(): +... outputs = model(**batch) + +... logits = outputs.logits +... predictions = torch.argmax(logits, dim=-1) +... metric.add_batch(predictions=predictions, references=batch["labels"]) + +>>> metric.compute() +``` + + + + + + +## 追加リソース + +さらなるファインチューニングの例については、以下を参照してください: + +- [🤗 Transformers Examples](https://github.com/huggingface/transformers/tree/main/examples) には、PyTorchとTensorFlowで一般的なNLPタスクをトレーニングするスクリプトが含まれています。 + +- [🤗 Transformers Notebooks](notebooks) には、特定のタスクにモデルをファインチューニングする方法に関するさまざまなノートブックが含まれています。 diff --git a/docs/source/ja/transformers_agents.md b/docs/source/ja/transformers_agents.md new file mode 100644 index 00000000000000..572d7f290c96dc --- /dev/null +++ b/docs/source/ja/transformers_agents.md @@ -0,0 +1,282 @@ + + +# Transformers Agents + + + +Transformers Agentsは、いつでも変更される可能性のある実験的なAPIです。エージェントが返す結果は、APIまたは基礎となるモデルが変更される可能性があるため、異なることがあります。 + + + +Transformersバージョンv4.29.0は、*ツール*と*エージェント*のコンセプトを基に構築されています。この[colab](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj)で試すことができます。 + +要するに、これはtransformersの上に自然言語APIを提供するものです:私たちは一連の厳選されたツールを定義し、自然言語を解釈し、これらのツールを使用するエージェントを設計します。これは設計上拡張可能です。私たちはいくつかの関連するツールを厳選しましたが、コミュニティによって開発された任意のツールを使用するためにシステムを簡単に拡張できる方法も示します。 + +この新しいAPIで何ができるかのいくつかの例から始めましょう。特に多モーダルなタスクに関して強力ですので、画像を生成したりテキストを読み上げたりするのに最適です。 + +上記のテキストの上に、日本語の翻訳を提供します。 + + +```py +agent.run("Caption the following image", image=image) +``` + +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------| +| | A beaver is swimming in the water | + +--- + +```py +agent.run("Read the following text out loud", text=text) +``` +| **Input** | **Output** | +|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------| +| A beaver is swimming in the water | + +--- + +```py +agent.run( + "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?", + document=document, +) +``` +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|----------------| +| | ballroom foyer | + +## Quickstart + +`agent.run`を使用する前に、エージェントをインスタンス化する必要があります。エージェントは、大規模な言語モデル(LLM)です。 +OpenAIモデルとBigCode、OpenAssistantからのオープンソースの代替モデルをサポートしています。OpenAIモデルはパフォーマンスが優れていますが、OpenAIのAPIキーが必要であり、無料で使用することはできません。一方、Hugging FaceはBigCodeとOpenAssistantモデルのエンドポイントへの無料アクセスを提供しています。 + +まず、デフォルトの依存関係をすべてインストールするために`agents`のエクストラをインストールしてください。 + + +```bash +pip install transformers[agents] +``` + +OpenAIモデルを使用するには、`openai`の依存関係をインストールした後、`OpenAiAgent`をインスタンス化します。 + + +```bash +pip install openai +``` + + +```py +from transformers import OpenAiAgent + +agent = OpenAiAgent(model="text-davinci-003", api_key="") +``` + +BigCodeまたはOpenAssistantを使用するには、まずログインしてInference APIにアクセスしてください。 + +```py +from huggingface_hub import login + +login("") +``` + +次に、エージェントをインスタンス化してください。 + +```py +from transformers import HfAgent + +# Starcoder +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +# StarcoderBase +# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase") +# OpenAssistant +# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") +``` + +これは、Hugging Faceが現在無料で提供している推論APIを使用しています。このモデル(または別のモデル)の独自の推論エンドポイントをお持ちの場合は、上記のURLエンドポイントをご自分のURLエンドポイントで置き換えることができます。 + + + +StarCoderとOpenAssistantは無料で利用でき、シンプルなタスクには非常に優れた性能を発揮します。ただし、より複雑なプロンプトを処理する際には、チェックポイントが十分でないことがあります。そのような場合には、現時点ではオープンソースではないものの、パフォーマンスが向上する可能性のあるOpenAIモデルを試してみることをお勧めします。 + + + +これで準備が整いました!これから、あなたが利用できる2つのAPIについて詳しく説明します。 + +### Single execution (run) + +単一実行メソッドは、エージェントの [`~Agent.run`] メソッドを使用する場合です。 + + +```py +agent.run("Draw me a picture of rivers and lakes.") +``` + + + + +これは、実行したいタスクに適したツール(またはツール)を自動的に選択し、適切に実行します。1つまたは複数のタスクを同じ命令で実行することができます(ただし、命令が複雑であるほど、エージェントが失敗する可能性が高くなります)。 + + +```py +agent.run("Draw me a picture of the sea then transform the picture to add an island") +``` + + + +
+ +[`~Agent.run`] 操作は独立して実行できますので、異なるタスクで何度も実行することができます。 + +注意点として、あなたの `agent` は単なる大規模な言語モデルであるため、プロンプトのわずかな変更でも完全に異なる結果が得られる可能性があります。したがって、実行したいタスクをできるだけ明確に説明することが重要です。良いプロンプトの書き方については、[こちら](custom_tools#writing-good-user-inputs) で詳しく説明しています。 + +実行ごとに状態を保持したり、テキスト以外のオブジェクトをエージェントに渡したりする場合は、エージェントが使用する変数を指定することができます。例えば、最初の川や湖の画像を生成し、その画像に島を追加するようにモデルに指示するには、次のように行うことができます: + +```python +picture = agent.run("Generate a picture of rivers and lakes.") +updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture) +``` + + + +これは、モデルがあなたのリクエストを理解できない場合や、ツールを混同する場合に役立つことがあります。例えば: + +```py +agent.run("Draw me the picture of a capybara swimming in the sea") +``` + +ここでは、モデルは2つの方法で解釈できます: +- `text-to-image`に海で泳ぐカピバラを生成させる +- または、`text-to-image`でカピバラを生成し、それを海で泳がせるために`image-transformation`ツールを使用する + +最初のシナリオを強制したい場合は、プロンプトを引数として渡すことができます: + + +```py +agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea") +``` + + + + +### Chat-based execution (チャット) + +エージェントは、[`~Agent.chat`] メソッドを使用することで、チャットベースのアプローチも可能です。 + + + +```py +agent.chat("Transform the picture so that there is a rock in there") +``` + + + +
+ +これは、指示をまたいで状態を保持したい場合に便利なアプローチで、単一の指示に比べて複雑な指示を処理するのは難しいかもしれません(その場合は [`~Agent.run`] メソッドの方が適しています)。 + +このメソッドは、非テキスト型の引数や特定のプロンプトを渡したい場合にも使用できます。 + +### ⚠️ Remote execution + +デモンストレーションの目的やすべてのセットアップで使用できるように、リリースのためにいくつかのデフォルトツール用のリモート実行ツールも作成しました。これらは [推論エンドポイント](https://huggingface.co/inference-endpoints) を使用して作成されます。 + +これらは現在オフになっていますが、リモート実行ツールを自分で設定する方法については、[カスタムツールガイド](./custom_tools) を読むことをお勧めします。 + +### What's happening here? What are tools, and what are agents? + +![エージェントとツールのダイアグラム](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/diagram.png) + +#### Agents + +ここでの「エージェント」とは、大規模な言語モデルのことであり、特定の一連のツールにアクセスできるようにプロンプトを設定しています。 + +LLM(大規模言語モデル)は、コードの小さなサンプルを生成するのにかなり優れており、このAPIは、エージェントに特定のツールセットを使用してタスクを実行するコードの小さなサンプルを生成させることに利用しています。このプロンプトは、エージェントにタスクとツールの説明を提供することで、エージェントが使用しているツールのドキュメントにアクセスし、関連するコードを生成できるようになります。 + +#### Tools + +ツールは非常に単純で、名前と説明からなる単一の関数です。それから、これらのツールの説明を使用してエージェントをプロンプトします。プロンプトを通じて、エージェントに、ツールを使用してクエリで要求されたタスクをどのように実行するかを示します。特に、ツールの期待される入力と出力を示します。 + +これは新しいツールを使用しており、パイプラインではなくツールを使用しています。なぜなら、エージェントは非常に原子的なツールでより良いコードを生成するからです。パイプラインはよりリファクタリングされ、しばしば複数のタスクを組み合わせています。ツールは非常に単純なタスクに焦点を当てることを意図しています。 + +#### Code-execution?! + +このコードは、ツールとツールと一緒に渡される入力のセットで、当社の小規模なPythonインタープリタで実行されます。すでに提供されたツールとprint関数しか呼び出すことができないため、実行できることはすでに制限されています。Hugging Faceのツールに制限されているため、安全だと考えても問題ありません。 + +さらに、属性の検索やインポートは許可しておらず(それらは渡された入力/出力を処理するためには必要ないはずです)、最も明らかな攻撃は問題ありません(エージェントにそれらを出力するようにプロンプトする必要があります)。超安全な側に立ちたい場合は、追加の引数 return_code=True を指定して run() メソッドを実行できます。その場合、エージェントは実行するコードを返すだけで、実行するかどうかはあなた次第です。 + +実行は、違法な操作を試みる行またはエージェントが生成したコードに通常のPythonエラーがある場合に停止します。 + +### A curated set of tools + +私たちは、このようなエージェントを強化できるツールのセットを特定します。以下は、`transformers`に統合されたツールの更新されたリストです: + +- **ドキュメント質問応答**: 画像形式のドキュメント(PDFなど)が与えられた場合、このドキュメントに関する質問に回答します([Donut](./model_doc/donut)) +- **テキスト質問応答**: 長いテキストと質問が与えられた場合、テキスト内の質問に回答します([Flan-T5](./model_doc/flan-t5)) +- **無条件の画像キャプション**: 画像にキャプションを付けます!([BLIP](./model_doc/blip)) +- **画像質問応答**: 画像が与えられた場合、その画像に関する質問に回答します([VILT](./model_doc/vilt)) +- **画像セグメンテーション**: 画像とプロンプトが与えられた場合、そのプロンプトのセグメンテーションマスクを出力します([CLIPSeg](./model_doc/clipseg)) +- **音声からテキストへの変換**: 人の話し声のオーディオ録音が与えられた場合、その音声をテキストに転記します([Whisper](./model_doc/whisper)) +- **テキストから音声への変換**: テキストを音声に変換します([SpeechT5](./model_doc/speecht5)) +- **ゼロショットテキスト分類**: テキストとラベルのリストが与えられた場合、テキストが最も対応するラベルを識別します([BART](./model_doc/bart)) +- **テキスト要約**: 長いテキストを1つまたは数文に要約します([BART](./model_doc/bart)) +- **翻訳**: テキストを指定された言語に翻訳します([NLLB](./model_doc/nllb)) + +これらのツールはtransformersに統合されており、手動でも使用できます。たとえば、次のように使用できます: + +```py +from transformers import load_tool + +tool = load_tool("text-to-speech") +audio = tool("This is a text to speech tool") +``` + +### Custom tools + +私たちは、厳選されたツールのセットを特定する一方、この実装が提供する主要な価値は、カスタムツールを迅速に作成して共有できる能力だと強く信じています。 + +ツールのコードをHugging Face Spaceまたはモデルリポジトリにプッシュすることで、エージェントと直接連携してツールを活用できます。[`huggingface-tools` organization](https://huggingface.co/huggingface-tools)には、**transformers非依存**のいくつかのツールが追加されました: + +- **テキストダウンローダー**: ウェブURLからテキストをダウンロードするためのツール +- **テキストから画像へ**: プロンプトに従って画像を生成するためのツール。安定した拡散を活用します +- **画像変換**: 初期画像とプロンプトを指定して画像を変更するためのツール。instruct pix2pixの安定した拡散を活用します +- **テキストからビデオへ**: プロンプトに従って小さなビデオを生成するためのツール。damo-vilabを活用します + +最初から使用しているテキストから画像へのツールは、[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)にあるリモートツールです!今後も、この組織および他の組織にさらにこのようなツールをリリースし、この実装をさらに強化していきます。 + +エージェントはデフォルトで[`huggingface-tools`](https://huggingface.co/huggingface-tools)にあるツールにアクセスできます。 +ツールの作成と共有方法、またHubに存在するカスタムツールを活用する方法についての詳細は、[次のガイド](custom_tools)で説明しています。 + +### Code generation + +これまで、エージェントを使用してあなたのためにアクションを実行する方法を示しました。ただし、エージェントはコードを生成するだけで、非常に制限されたPythonインタープリタを使用して実行します。生成されたコードを異なる環境で使用したい場合、エージェントにコードを返すように指示できます。ツールの定義と正確なインポートも含めて。 + +例えば、以下の命令: +```python +agent.run("Draw me a picture of rivers and lakes", return_code=True) +``` + +次のコードを返します +```python +from transformers import load_tool + +image_generator = load_tool("huggingface-tools/text-to-image") + +image = image_generator(prompt="rivers and lakes") +``` + +その後、自分で変更して実行できます。 diff --git a/docs/source/ja/troubleshooting.md b/docs/source/ja/troubleshooting.md new file mode 100644 index 00000000000000..b13b5993171a0a --- /dev/null +++ b/docs/source/ja/troubleshooting.md @@ -0,0 +1,195 @@ + + +# Troubleshoot + +時にはエラーが発生することがありますが、私たちはここにいます!このガイドでは、私たちがよく見る最も一般的な問題と、それらを解決する方法について説明します。ただし、このガイドはすべての 🤗 Transformers の問題の包括的なコレクションではありません。問題をトラブルシューティングするための詳細なヘルプが必要な場合は、以下の方法を試してみてください: + + + +1. [フォーラム](https://discuss.huggingface.co/)で助けを求める。 [初心者向け](https://discuss.huggingface.co/c/beginners/5) または [🤗 Transformers](https://discuss.huggingface.co/c/transformers/9) など、質問を投稿できる特定のカテゴリがあります。問題が解決される可能性を最大限にするために、再現可能なコードを含む良い説明的なフォーラム投稿を書くことを確認してください! + + + +2. バグがライブラリに関連する場合は、🤗 Transformers リポジトリで [Issue](https://github.com/huggingface/transformers/issues/new/choose) を作成してください。バグを説明するためのできるだけ多くの情報を含めるように心がけ、何が問題で、どのように修正できるかをより良く理解できるようにしてください。 + +3. より古いバージョンの 🤗 Transformers を使用している場合は、[Migration](migration) ガイドを確認してください。バージョン間で重要な変更が導入されているためです。 + +トラブルシューティングとヘルプの詳細については、Hugging Faceコースの [第8章](https://huggingface.co/course/chapter8/1?fw=pt) を参照してください。 + +## Firewalled environments + +一部のクラウド上のGPUインスタンスやイントラネットセットアップは、外部接続に対してファイアウォールで保護されているため、接続エラーが発生することがあります。スクリプトがモデルの重みやデータセットをダウンロードしようとすると、ダウンロードが途中で止まり、次のメッセージとタイムアウトエラーが表示されます: + +``` +ValueError: Connection error, and we cannot find the requested files in the cached path. +Please try again or make sure your Internet connection is on. +``` + + +この場合、接続エラーを回避するために[オフラインモード](installation#offline-mode)で🤗 Transformersを実行してみてください。 + +## CUDA out of memory + +数百万のパラメータを持つ大規模なモデルのトレーニングは、適切なハードウェアなしでは課題です。GPUのメモリが不足するとよくあるエラーの1つは次のとおりです: + +以下はメモリ使用量を減らすために試すことができるいくつかの解決策です: + +- [`TrainingArguments`]の中で [`per_device_train_batch_size`](main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) の値を減らす。 +- [`TrainingArguments`]の中で [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) を使用して、全体的なバッチサイズを効果的に増やすことを試す。 + + + +メモリ節約のテクニックについての詳細は、[ガイド](performance)を参照してください。 + + + +## Unable to load a saved TensorFlow model + +TensorFlowの[model.save](https://www.tensorflow.org/tutorials/keras/save_and_load#save_the_entire_model)メソッドは、モデル全体 - アーキテクチャ、重み、トレーニング設定 - を1つのファイルに保存します。しかし、モデルファイルを再度読み込む際にエラーが発生することがあります。これは、🤗 Transformersがモデルファイル内のすべてのTensorFlow関連オブジェクトを読み込まないためです。TensorFlowモデルの保存と読み込みに関する問題を回避するために、次のことをお勧めします: + +- モデルの重みを`h5`ファイル拡張子で保存し、[`~TFPreTrainedModel.from_pretrained`]を使用してモデルを再読み込みする: + +```py +>>> from transformers import TFPreTrainedModel + +>>> model.save_weights("some_folder/tf_model.h5") +>>> model = TFPreTrainedModel.from_pretrained("some_folder") +``` + +- Save the model with [`~TFPretrainedModel.save_pretrained`] and load it again with [`~TFPreTrainedModel.from_pretrained`]: + +```py +>>> from transformers import TFPreTrainedModel + +>>> model.save_pretrained("path_to/model") +>>> model = TFPreTrainedModel.from_pretrained("path_to/model") +``` + +## ImportError + +もう一つよくあるエラーは、特に新しくリリースされたモデルの場合に遭遇することがある `ImportError` です: + + +``` +ImportError: cannot import name 'ImageGPTImageProcessor' from 'transformers' (unknown location) +``` + +これらのエラータイプに関しては、最新バージョンの 🤗 Transformers がインストールされていることを確認して、最新のモデルにアクセスできるようにしてください: + +```bash +pip install transformers --upgrade +``` + +## CUDA error: device-side assert triggered + +時々、デバイスコードでエラーが発生したという一般的な CUDA エラーに遭遇することがあります。 + +``` +RuntimeError: CUDA error: device-side assert triggered +``` + +より具体的なエラーメッセージを取得するために、まずはCPU上でコードを実行してみることをお勧めします。以下の環境変数をコードの冒頭に追加して、CPUに切り替えてみてください: + +```py +>>> import os + +>>> os.environ["CUDA_VISIBLE_DEVICES"] = "" +``` + +GPUからより良いトレースバックを取得する別のオプションは、次の環境変数をコードの先頭に追加することです。これにより、エラーの発生源を指すトレースバックが得られます: + +```py +>>> import os + +>>> os.environ["CUDA_LAUNCH_BLOCKING"] = "1" +``` + + +## Incorrect output when padding tokens aren't masked + +一部のケースでは、`input_ids`にパディングトークンが含まれている場合、出力の`hidden_state`が正しくないことがあります。デモンストレーションのために、モデルとトークナイザーをロードします。モデルの`pad_token_id`にアクセスして、その値を確認できます。一部のモデルでは`pad_token_id`が`None`になることもありますが、常に手動で設定することができます。 + + +```py +>>> from transformers import AutoModelForSequenceClassification +>>> import torch + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") +>>> model.config.pad_token_id +0 +``` + +以下の例は、パディングトークンをマスクせずに出力を表示したものです: + +```py +>>> input_ids = torch.tensor([[7592, 2057, 2097, 2393, 9611, 2115], [7592, 0, 0, 0, 0, 0]]) +>>> output = model(input_ids) +>>> print(output.logits) +tensor([[ 0.0082, -0.2307], + [ 0.1317, -0.1683]], grad_fn=) +``` + +以下は、第2のシーケンスの実際の出力です: + +```py +>>> input_ids = torch.tensor([[7592]]) +>>> output = model(input_ids) +>>> print(output.logits) +tensor([[-0.1008, -0.4061]], grad_fn=) +``` + +大抵の場合、モデルには `attention_mask` を提供して、パディングトークンを無視し、このような無音のエラーを回避する必要があります。これにより、2番目のシーケンスの出力が実際の出力と一致するようになります。 + + + +デフォルトでは、トークナイザは、トークナイザのデフォルトに基づいて `attention_mask` を自動で作成します。 + + + +```py +>>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]]) +>>> output = model(input_ids, attention_mask=attention_mask) +>>> print(output.logits) +tensor([[ 0.0082, -0.2307], + [-0.1008, -0.4061]], grad_fn=) +``` + +🤗 Transformersは、提供されるパディングトークンをマスクするために自動的に`attention_mask`を作成しません。その理由は以下の通りです: + +- 一部のモデルにはパディングトークンが存在しない場合があるためです。 +- 一部のユースケースでは、ユーザーがパディングトークンにアテンションを向けることを望む場合があるためです。 + +## ValueError: Unrecognized configuration class XYZ for this kind of AutoModel + +一般的に、事前学習済みモデルのインスタンスをロードするためには[`AutoModel`]クラスを使用することをお勧めします。このクラスは、設定に基づいて与えられたチェックポイントから正しいアーキテクチャを自動的に推測およびロードできます。モデルをロードする際にこの`ValueError`が表示される場合、Autoクラスは与えられたチェックポイントの設定から、ロードしようとしているモデルの種類へのマッピングを見つけることができなかったことを意味します。最も一般的には、特定のタスクをサポートしないチェックポイントがある場合にこのエラーが発生します。 +例えば、質問応答のためのGPT2が存在しない場合、次の例でこのエラーが表示されます: + +上記のテキストを日本語に翻訳し、Markdownファイルとしてフォーマットしました。 + + +```py +>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering + +>>> processor = AutoProcessor.from_pretrained("openai-community/gpt2-medium") +>>> model = AutoModelForQuestionAnswering.from_pretrained("openai-community/gpt2-medium") +ValueError: Unrecognized configuration class for this kind of AutoModel: AutoModelForQuestionAnswering. +Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BloomConfig, ... +``` diff --git a/docs/source/ko/_config.py b/docs/source/ko/_config.py index 5d966e8c40f081..9bdfef7af94b5a 100644 --- a/docs/source/ko/_config.py +++ b/docs/source/ko/_config.py @@ -10,5 +10,5 @@ black_avoid_patterns = { "{processor_class}": "FakeProcessorClass", "{model_class}": "FakeModelClass", - "{object_class}": "FakeObjectClass", + "{object_class}": "FakeObjectClass", } diff --git a/docs/source/ko/_toctree.yml b/docs/source/ko/_toctree.yml index 06ab5ee766345f..f7a5f640107526 100644 --- a/docs/source/ko/_toctree.yml +++ b/docs/source/ko/_toctree.yml @@ -7,56 +7,686 @@ title: 설치방법 title: 시작하기 - sections: - - local: in_translation - title: (번역 중) + - local: pipeline_tutorial + title: Pipeline으로 추론하기 + - local: autoclass_tutorial + title: AutoClass로 사전 학습된 인스턴스 로드하기 + - local: preprocessing + title: 데이터 전처리하기 + - local: training + title: 사전 학습된 모델 미세 조정하기 + - local: run_scripts + title: 스크립트로 학습하기 + - local: accelerate + title: 🤗 Accelerate로 분산 학습 구성하기 + - local: peft + title: 🤗 PEFT로 어댑터 로드 및 학습하기 + - local: model_sharing + title: 만든 모델 공유하기 + - local: transformers_agents + title: 에이전트 + - local: llm_tutorial + title: 대규모 언어 모델로 생성하기 title: 튜토리얼 - sections: - - local: in_translation - title: (번역 중) - title: How-to 가이드 + - sections: + - local: tasks/sequence_classification + title: 텍스트 분류 + - local: tasks/token_classification + title: 토큰 분류 + - local: tasks/question_answering + title: 질의 응답(Question Answering) + - local: tasks/language_modeling + title: 인과적 언어 모델링(Causal language modeling) + - local: tasks/masked_language_modeling + title: 마스킹된 언어 모델링(Masked language modeling) + - local: tasks/translation + title: 번역 + - local: tasks/summarization + title: 요약 + - local: tasks/multiple_choice + title: 객관식 문제(Multiple Choice) + title: 자연어처리 + isExpanded: false + - sections: + - local: tasks/audio_classification + title: 오디오 분류 + - local: tasks/asr + title: 자동 음성 인식 + title: 오디오 + isExpanded: false + - sections: + - local: tasks/image_classification + title: 이미지 분류 + - local: tasks/semantic_segmentation + title: 의미적 분할(Semantic segmentation) + - local: tasks/video_classification + title: 영상 분류 + - local: tasks/object_detection + title: 객체 탐지 + - local: tasks/zero_shot_object_detection + title: 제로샷(zero-shot) 객체 탐지 + - local: tasks/zero_shot_image_classification + title: 제로샷(zero-shot) 이미지 분류 + - local: tasks/monocular_depth_estimation + title: 단일 영상 기반 깊이 추정 + title: 컴퓨터 비전 + isExpanded: false + - sections: + - local: tasks/image_captioning + title: 이미지 캡셔닝 + - local: tasks/document_question_answering + title: 문서 질의 응답(Document Question Answering) + - local: tasks/visual_question_answering + title: 시각적 질의응답 (Visual Question Answering) + title: 멀티모달 + isExpanded: false + title: 태스크 가이드 +- sections: + - local: fast_tokenizers + title: 🤗 Tokenizers 라이브러리에서 토크나이저 사용하기 + - local: multilingual + title: 다국어 모델 추론하기 + - local: in_translation + title: (번역중) Customize text generation strategy + - local: create_a_model + title: 모델별 API 사용하기 + - local: custom_models + title: 사용자 정의 모델 공유하기 + - local: sagemaker + title: Amazon SageMaker에서 학습 실행하기 + - local: serialization + title: ONNX로 내보내기 + - local: tflite + title: TFLite로 내보내기 + - local: torchscript + title: TorchScript로 내보내기 + - local: in_translation + title: (번역중) Benchmarks + - local: in_translation + title: (번역중) Notebooks with examples + - local: community + title: 커뮤니티 리소스 + - local: custom_tools + title: 사용자 정의 도구와 프롬프트 + - local: troubleshooting + title: 문제 해결 + title: (번역중) 개발자 가이드 +- sections: + - local: performance + title: 성능 및 확장성 + - local: in_translation + title: (번역중) Training on one GPU + - local: perf_train_gpu_many + title: 다중 GPU에서 훈련 진행하기 + - local: perf_train_cpu + title: CPU에서 훈련 + - local: perf_train_cpu_many + title: 다중 CPU에서 훈련하기 + - local: in_translation + title: (번역중) Training on TPUs + - local: perf_train_tpu_tf + title: TensorFlow로 TPU에서 훈련하기 + - local: in_translation + title: (번역중) Training on Specialized Hardware + - local: perf_infer_cpu + title: CPU로 추론하기 + - local: perf_infer_gpu_one + title: 하나의 GPU를 활용한 추론 + - local: perf_infer_gpu_many + title: 다중 GPU에서 추론 + - local: in_translation + title: (번역중) Inference on Specialized Hardware + - local: perf_hardware + title: 훈련용 사용자 맞춤형 하드웨어 + - local: big_models + title: 대형 모델을 인스턴스화 + - local: debugging + title: 디버깅 + - local: hpo_train + title: Trainer API를 사용한 하이퍼파라미터 탐색 + - local: tf_xla + title: TensorFlow 모델을 위한 XLA 통합 + title: (번역중) 성능 및 확장성 +- sections: + - local: contributing + title: 🤗 Transformers에 기여하는 방법 + - local: add_new_model + title: 🤗 Transformers에 새로운 모델을 추가하는 방법 + - local: add_tensorflow_model + title: 어떻게 🤗 Transformers 모델을 TensorFlow로 변환하나요? + - local: add_new_pipeline + title: 어떻게 🤗 Transformers에 파이프라인을 추가하나요? + - local: testing + title: 테스트 + - local: pr_checks + title: Pull Request에 대한 검사 + title: (번역중) 기여하기 + - sections: + - local: philosophy + title: 이념과 목표 - local: in_translation - title: (번역 중) - title: 개념 가이드 + title: (번역중) Glossary + - local: task_summary + title: 🤗 Transformers로 할 수 있는 작업 + - local: tasks_explained + title: 🤗 Transformers로 작업을 해결하는 방법 + - local: model_summary + title: Transformer 모델군 + - local: tokenizer_summary + title: 토크나이저 요약 + - local: attention + title: 어텐션 매커니즘 + - local: pad_truncation + title: 패딩과 잘라내기 + - local: bertology + title: BERTology + - local: perplexity + title: 고정 길이 모델의 펄플렉서티(Perplexity) + - local: pipeline_webserver + title: 추론 웹 서버를 위한 파이프라인 + - local: model_memory_anatomy + title: 모델 학습 해부하기 + title: (번역중) 개념 가이드 - sections: - sections: - local: in_translation - title: (번역 중) - title: 메인 클래스 + title: (번역중) Auto Classes + - local: in_translation + title: (번역중) Callbacks + - local: in_translation + title: (번역중) Configuration + - local: in_translation + title: (번역중) Data Collator + - local: in_translation + title: (번역중) Keras callbacks + - local: in_translation + title: (번역중) Logging + - local: in_translation + title: (번역중) Models + - local: in_translation + title: (번역중) Text Generation + - local: in_translation + title: (번역중) ONNX + - local: in_translation + title: (번역중) Optimization + - local: in_translation + title: (번역중) Model outputs + - local: in_translation + title: (번역중) Pipelines + - local: in_translation + title: (번역중) Processors + - local: in_translation + title: (번역중) Quantization + - local: in_translation + title: (번역중) Tokenizer + - local: in_translation + title: (번역중) Trainer + - local: in_translation + title: (번역중) DeepSpeed Integration + - local: in_translation + title: (번역중) Feature Extractor + - local: in_translation + title: (번역중) Image Processor + title: (번역중) 메인 클래스 - sections: - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 텍스트 모델 + title: (번역중) ALBERT + - local: in_translation + title: (번역중) BART + - local: in_translation + title: (번역중) BARThez + - local: in_translation + title: (번역중) BARTpho + - local: in_translation + title: (번역중) BERT + - local: in_translation + title: (번역중) BertGeneration + - local: in_translation + title: (번역중) BertJapanese + - local: in_translation + title: (번역중) Bertweet + - local: in_translation + title: (번역중) BigBird + - local: in_translation + title: (번역중) BigBirdPegasus + - local: in_translation + title: (번역중) BioGpt + - local: in_translation + title: (번역중) Blenderbot + - local: in_translation + title: (번역중) Blenderbot Small + - local: in_translation + title: (번역중) BLOOM + - local: in_translation + title: (번역중) BORT + - local: in_translation + title: (번역중) ByT5 + - local: in_translation + title: (번역중) CamemBERT + - local: in_translation + title: (번역중) CANINE + - local: in_translation + title: (번역중) CodeGen + - local: in_translation + title: (번역중) ConvBERT + - local: in_translation + title: (번역중) CPM + - local: in_translation + title: (번역중) CPMANT + - local: in_translation + title: (번역중) CTRL + - local: in_translation + title: (번역중) DeBERTa + - local: in_translation + title: (번역중) DeBERTa-v2 + - local: in_translation + title: (번역중) DialoGPT + - local: in_translation + title: (번역중) DistilBERT + - local: in_translation + title: (번역중) DPR + - local: in_translation + title: (번역중) ELECTRA + - local: in_translation + title: (번역중) Encoder Decoder Models + - local: in_translation + title: (번역중) ERNIE + - local: in_translation + title: (번역중) ErnieM + - local: in_translation + title: (번역중) ESM + - local: in_translation + title: (번역중) FLAN-T5 + - local: in_translation + title: (번역중) FLAN-UL2 + - local: in_translation + title: (번역중) FlauBERT + - local: in_translation + title: (번역중) FNet + - local: in_translation + title: (번역중) FSMT + - local: in_translation + title: (번역중) Funnel Transformer + - local: in_translation + title: (번역중) GPT + - local: in_translation + title: (번역중) GPT Neo + - local: in_translation + title: (번역중) GPT NeoX + - local: in_translation + title: (번역중) GPT NeoX Japanese + - local: in_translation + title: (번역중) GPT-J + - local: in_translation + title: (번역중) GPT2 + - local: in_translation + title: (번역중) GPTBigCode + - local: in_translation + title: (번역중) GPTSAN Japanese + - local: in_translation + title: (번역중) GPTSw3 + - local: in_translation + title: (번역중) HerBERT + - local: in_translation + title: (번역중) I-BERT + - local: in_translation + title: (번역중) Jukebox + - local: in_translation + title: (번역중) LED + - local: model_doc/llama + title: LLaMA + - local: model_doc/llama2 + title: LLaMA2 + - local: in_translation + title: (번역중) Longformer + - local: in_translation + title: (번역중) LongT5 + - local: in_translation + title: (번역중) LUKE + - local: in_translation + title: (번역중) M2M100 + - local: in_translation + title: (번역중) MarianMT + - local: in_translation + title: (번역중) MarkupLM + - local: in_translation + title: (번역중) MBart and MBart-50 + - local: in_translation + title: (번역중) MEGA + - local: in_translation + title: (번역중) MegatronBERT + - local: in_translation + title: (번역중) MegatronGPT2 + - local: in_translation + title: (번역중) mLUKE + - local: in_translation + title: (번역중) MobileBERT + - local: in_translation + title: (번역중) MPNet + - local: in_translation + title: (번역중) MT5 + - local: in_translation + title: (번역중) MVP + - local: in_translation + title: (번역중) NEZHA + - local: in_translation + title: (번역중) NLLB + - local: in_translation + title: (번역중) NLLB-MoE + - local: in_translation + title: (번역중) Nyströmformer + - local: in_translation + title: (번역중) Open-Llama + - local: in_translation + title: (번역중) OPT + - local: in_translation + title: (번역중) Pegasus + - local: in_translation + title: (번역중) PEGASUS-X + - local: in_translation + title: (번역중) PhoBERT + - local: in_translation + title: (번역중) PLBart + - local: in_translation + title: (번역중) ProphetNet + - local: in_translation + title: (번역중) QDQBert + - local: in_translation + title: (번역중) RAG + - local: in_translation + title: (번역중) REALM + - local: in_translation + title: (번역중) Reformer + - local: in_translation + title: (번역중) RemBERT + - local: in_translation + title: (번역중) RetriBERT + - local: in_translation + title: (번역중) RoBERTa + - local: in_translation + title: (번역중) RoBERTa-PreLayerNorm + - local: in_translation + title: (번역중) RoCBert + - local: in_translation + title: (번역중) RoFormer + - local: in_translation + title: (번역중) Splinter + - local: in_translation + title: (번역중) SqueezeBERT + - local: in_translation + title: (번역중) SwitchTransformers + - local: in_translation + title: (번역중) T5 + - local: in_translation + title: (번역중) T5v1.1 + - local: in_translation + title: (번역중) TAPEX + - local: in_translation + title: (번역중) Transformer XL + - local: in_translation + title: (번역중) UL2 + - local: in_translation + title: (번역중) X-MOD + - local: in_translation + title: (번역중) XGLM + - local: in_translation + title: (번역중) XLM + - local: in_translation + title: (번역중) XLM-ProphetNet + - local: in_translation + title: (번역중) XLM-RoBERTa + - local: in_translation + title: (번역중) XLM-RoBERTa-XL + - local: in_translation + title: (번역중) XLM-V + - local: in_translation + title: (번역중) XLNet + - local: in_translation + title: (번역중) YOSO + title: (번역중) 텍스트 모델 - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 비전 모델 + title: (번역중) BEiT + - local: in_translation + title: (번역중) BiT + - local: in_translation + title: (번역중) Conditional DETR + - local: in_translation + title: (번역중) ConvNeXT + - local: in_translation + title: (번역중) ConvNeXTV2 + - local: in_translation + title: (번역중) CvT + - local: in_translation + title: (번역중) Deformable DETR + - local: in_translation + title: (번역중) DeiT + - local: in_translation + title: (번역중) DETA + - local: in_translation + title: (번역중) DETR + - local: in_translation + title: (번역중) DiNAT + - local: in_translation + title: (번역중) DiT + - local: in_translation + title: (번역중) DPT + - local: in_translation + title: (번역중) EfficientFormer + - local: in_translation + title: (번역중) EfficientNet + - local: in_translation + title: (번역중) FocalNet + - local: in_translation + title: (번역중) GLPN + - local: in_translation + title: (번역중) ImageGPT + - local: in_translation + title: (번역중) LeViT + - local: in_translation + title: (번역중) Mask2Former + - local: in_translation + title: (번역중) MaskFormer + - local: in_translation + title: (번역중) MobileNetV1 + - local: in_translation + title: (번역중) MobileNetV2 + - local: in_translation + title: (번역중) MobileViT + - local: in_translation + title: (번역중) NAT + - local: in_translation + title: (번역중) PoolFormer + - local: in_translation + title: (번역중) RegNet + - local: in_translation + title: (번역중) ResNet + - local: in_translation + title: (번역중) SegFormer + - local: in_translation + title: (번역중) Swin Transformer + - local: in_translation + title: (번역중) Swin Transformer V2 + - local: in_translation + title: (번역중) Swin2SR + - local: in_translation + title: (번역중) Table Transformer + - local: in_translation + title: (번역중) TimeSformer + - local: in_translation + title: (번역중) UperNet + - local: in_translation + title: (번역중) VAN + - local: in_translation + title: (번역중) VideoMAE + - local: in_translation + title: (번역중) Vision Transformer (ViT) + - local: in_translation + title: (번역중) ViT Hybrid + - local: in_translation + title: (번역중) ViTMAE + - local: in_translation + title: (번역중) ViTMSN + - local: in_translation + title: (번역중) YOLOS + title: (번역중) 비전 모델 - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 오디오 모델 + title: (번역중) Audio Spectrogram Transformer + - local: in_translation + title: (번역중) CLAP + - local: in_translation + title: (번역중) Hubert + - local: in_translation + title: (번역중) MCTCT + - local: in_translation + title: (번역중) SEW + - local: in_translation + title: (번역중) SEW-D + - local: in_translation + title: (번역중) Speech2Text + - local: in_translation + title: (번역중) Speech2Text2 + - local: in_translation + title: (번역중) SpeechT5 + - local: in_translation + title: (번역중) UniSpeech + - local: in_translation + title: (번역중) UniSpeech-SAT + - local: in_translation + title: (번역중) Wav2Vec2 + - local: in_translation + title: (번역중) Wav2Vec2-Conformer + - local: in_translation + title: (번역중) Wav2Vec2Phoneme + - local: in_translation + title: (번역중) WavLM + - local: model_doc/whisper + title: Whisper + - local: in_translation + title: (번역중) XLS-R + - local: in_translation + title: (번역중) XLSR-Wav2Vec2 + title: (번역중) 오디오 모델 + - isExpanded: false + sections: + - local: in_translation + title: (번역중) ALIGN + - local: in_translation + title: (번역중) AltCLIP + - local: in_translation + title: (번역중) BLIP + - local: in_translation + title: (번역중) BLIP-2 + - local: in_translation + title: (번역중) BridgeTower + - local: in_translation + title: (번역중) Chinese-CLIP + - local: in_translation + title: (번역중) CLIP + - local: in_translation + title: (번역중) CLIPSeg + - local: in_translation + title: (번역중) Data2Vec + - local: in_translation + title: (번역중) DePlot + - local: in_translation + title: (번역중) Donut + - local: in_translation + title: (번역중) FLAVA + - local: in_translation + title: (번역중) GIT + - local: in_translation + title: (번역중) GroupViT + - local: in_translation + title: (번역중) LayoutLM + - local: in_translation + title: (번역중) LayoutLMV2 + - local: in_translation + title: (번역중) LayoutLMV3 + - local: in_translation + title: (번역중) LayoutXLM + - local: in_translation + title: (번역중) LiLT + - local: in_translation + title: (번역중) LXMERT + - local: in_translation + title: (번역중) MatCha + - local: in_translation + title: (번역중) MGP-STR + - local: in_translation + title: (번역중) OneFormer + - local: in_translation + title: (번역중) OWL-ViT + - local: in_translation + title: (번역중) Perceiver + - local: in_translation + title: (번역중) Pix2Struct + - local: in_translation + title: (번역중) Segment Anything + - local: in_translation + title: (번역중) Speech Encoder Decoder Models + - local: in_translation + title: (번역중) TAPAS + - local: in_translation + title: (번역중) TrOCR + - local: in_translation + title: (번역중) TVLT + - local: in_translation + title: (번역중) ViLT + - local: in_translation + title: (번역중) Vision Encoder Decoder Models + - local: in_translation + title: (번역중) Vision Text Dual Encoder + - local: in_translation + title: (번역중) VisualBERT + - local: in_translation + title: (번역중) X-CLIP + title: (번역중) 멀티모달 모델 - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 멀티모달 모델 + title: (번역중) Decision Transformer + - local: in_translation + title: (번역중) Trajectory Transformer + title: (번역중) 강화학습 모델 - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 강화학습 모델 + title: (번역중) Informer + - local: in_translation + title: (번역중) Time Series Transformer + title: (번역중) 시계열 모델 - isExpanded: false sections: - local: in_translation - title: (번역 중) - title: 시계열 모델 - title: 모델 + title: (번역중) Graphormer + title: (번역중) Graph models + title: (번역중) 모델 - sections: - local: in_translation - title: (번역 중) - title: 내부 유틸리티 - title: API + title: (번역중) Custom Layers and Utilities + - local: in_translation + title: (번역중) Utilities for pipelines + - local: in_translation + title: (번역중) Utilities for Tokenizers + - local: in_translation + title: (번역중) Utilities for Trainer + - local: in_translation + title: (번역중) Utilities for Generation + - local: in_translation + title: (번역중) Utilities for Image Processors + - local: in_translation + title: (번역중) Utilities for Audio processing + - local: in_translation + title: (번역중) General Utilities + - local: in_translation + title: (번역중) Utilities for Time Series + title: (번역중) Internal Helpers + title: (번역중) API diff --git a/docs/source/ko/accelerate.md b/docs/source/ko/accelerate.md new file mode 100644 index 00000000000000..0ef8957de3ac20 --- /dev/null +++ b/docs/source/ko/accelerate.md @@ -0,0 +1,136 @@ + + +# 🤗 Accelerate를 활용한 분산 학습[[distributed-training-with-accelerate]] + +모델이 커지면서 병렬 처리는 제한된 하드웨어에서 더 큰 모델을 훈련하고 훈련 속도를 몇 배로 가속화하기 위한 전략으로 등장했습니다. Hugging Face에서는 사용자가 하나의 머신에 여러 개의 GPU를 사용하든 여러 머신에 여러 개의 GPU를 사용하든 모든 유형의 분산 설정에서 🤗 Transformers 모델을 쉽게 훈련할 수 있도록 돕기 위해 [🤗 Accelerate](https://huggingface.co/docs/accelerate) 라이브러리를 만들었습니다. 이 튜토리얼에서는 분산 환경에서 훈련할 수 있도록 기본 PyTorch 훈련 루프를 커스터마이즈하는 방법을 알아봅시다. + +## 설정[[setup]] + +🤗 Accelerate 설치 시작하기: + +```bash +pip install accelerate +``` + +그 다음, [`~accelerate.Accelerator`] 객체를 불러오고 생성합니다. [`~accelerate.Accelerator`]는 자동으로 분산 설정 유형을 감지하고 훈련에 필요한 모든 구성 요소를 초기화합니다. 장치에 모델을 명시적으로 배치할 필요는 없습니다. + +```py +>>> from accelerate import Accelerator + +>>> accelerator = Accelerator() +``` + +## 가속화를 위한 준비[[prepare-to-accelerate]] + +다음 단계는 관련된 모든 훈련 객체를 [`~accelerate.Accelerator.prepare`] 메소드에 전달하는 것입니다. 여기에는 훈련 및 평가 데이터로더, 모델 및 옵티마이저가 포함됩니다: + +```py +>>> train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( +... train_dataloader, eval_dataloader, model, optimizer +... ) +``` + +## 백워드(Backward)[[backward]] + +마지막으로 훈련 루프의 일반적인 `loss.backward()`를 🤗 Accelerate의 [`~accelerate.Accelerator.backward`] 메소드로 대체하기만 하면 됩니다: + +```py +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... outputs = model(**batch) +... loss = outputs.loss +... accelerator.backward(loss) + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +다음 코드에서 볼 수 있듯이, 훈련 루프에 코드 네 줄만 추가하면 분산 학습을 활성화할 수 있습니다! + +```diff ++ from accelerate import Accelerator + from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler + ++ accelerator = Accelerator() + + model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) + optimizer = AdamW(model.parameters(), lr=3e-5) + +- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +- model.to(device) + ++ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( ++ train_dataloader, eval_dataloader, model, optimizer ++ ) + + num_epochs = 3 + num_training_steps = num_epochs * len(train_dataloader) + lr_scheduler = get_scheduler( + "linear", + optimizer=optimizer, + num_warmup_steps=0, + num_training_steps=num_training_steps + ) + + progress_bar = tqdm(range(num_training_steps)) + + model.train() + for epoch in range(num_epochs): + for batch in train_dataloader: +- batch = {k: v.to(device) for k, v in batch.items()} + outputs = model(**batch) + loss = outputs.loss +- loss.backward() ++ accelerator.backward(loss) + + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + progress_bar.update(1) +``` + +## 학습[[train]] + +관련 코드를 추가한 후에는 스크립트나 Colaboratory와 같은 노트북에서 훈련을 시작하세요. + +### 스크립트로 학습하기[[train-with-a-script]] + +스크립트에서 훈련을 실행하는 경우, 다음 명령을 실행하여 구성 파일을 생성하고 저장합니다: + +```bash +accelerate config +``` + +Then launch your training with: + +```bash +accelerate launch train.py +``` + +### 노트북으로 학습하기[[train-with-a-notebook]] + +Collaboratory의 TPU를 사용하려는 경우, 노트북에서도 🤗 Accelerate를 실행할 수 있습니다. 훈련을 담당하는 모든 코드를 함수로 감싸서 [`~accelerate.notebook_launcher`]에 전달하세요: + +```py +>>> from accelerate import notebook_launcher + +>>> notebook_launcher(training_function) +``` + +🤗 Accelerate 및 다양한 기능에 대한 자세한 내용은 [documentation](https://huggingface.co/docs/accelerate)를 참조하세요. \ No newline at end of file diff --git a/docs/source/ko/add_new_model.md b/docs/source/ko/add_new_model.md new file mode 100644 index 00000000000000..752bbd4e4e3aae --- /dev/null +++ b/docs/source/ko/add_new_model.md @@ -0,0 +1,630 @@ + + +# Hugging Face Transformers를 추가하는 방법은 무엇인가요? [[how-to-add-a-model-to-transformers]] + +Hugging Face Transformers 라이브러리는 커뮤니티 기여자들 덕분에 새로운 모델을 제공할 수 있는 경우가 많습니다. 하지만 이는 도전적인 프로젝트이며 Hugging Face Transformers 라이브러리와 구현할 모델에 대한 깊은 이해가 필요합니다. Hugging Face에서는 더 많은 커뮤니티 멤버가 모델을 적극적으로 추가할 수 있도록 지원하고자 하며, 이 가이드를 통해 PyTorch 모델을 추가하는 과정을 안내하고 있습니다 (PyTorch가 설치되어 있는지 확인해주세요). + + + +TensorFlow 모델을 구현하고자 하는 경우 [🤗 Transformers 모델을 TensorFlow로 변환하는 방법](add_tensorflow_model) 가이드를 살펴보세요! + + + +이 과정을 진행하면 다음과 같은 내용을 이해하게 됩니다: + +- 오픈 소스의 모범 사례에 대한 통찰력을 얻습니다. +- 가장 인기 있는 딥러닝 라이브러리의 설계 원칙을 이해합니다. +- 대규모 모델을 효율적으로 테스트하는 방법을 배웁니다. +- `black`, `ruff`, `make fix-copies`와 같은 Python 유틸리티를 통합하여 깔끔하고 가독성 있는 코드를 작성하는 방법을 배웁니다. + +Hugging Face 팀은 항상 도움을 줄 준비가 되어 있으므로 혼자가 아니라는 점을 기억하세요. 🤗 ❤️ + +시작에 앞서 🤗 Transformers에 원하는 모델을 추가하기 위해 [New model addition](https://github.com/huggingface/transformers/issues/new?assignees=&labels=New+model&template=new-model-addition.yml) 이슈를 열어야 합니다. 특정 모델을 기여하는 데 특별히 까다로운 기준을 가지지 않는 경우 [New model label](https://github.com/huggingface/transformers/labels/New%20model)을 필터링하여 요청되지 않은 모델이 있는지 확인하고 작업할 수 있습니다. + +새로운 모델 요청을 열었다면 첫 번째 단계는 🤗 Transformers에 익숙해지는 것입니다! + +## 🤗 Transformers의 전반적인 개요 [[general-overview-of-transformers]] + +먼저 🤗 Transformers에 대한 전반적인 개요를 파악해야 합니다. 🤗 Transformers는 매우 주관적인 라이브러리이기 때문에 해당 라이브러리의 철학이나 설계 선택 사항에 동의하지 않을 수도 있습니다. 그러나 우리의 경험상 라이브러리의 기본적인 설계 선택과 철학은 🤗 Transformers의 규모를 효율적으로 확장하면서 유지 보수 비용을 합리적인 수준으로 유지하는 것입니다. + +[라이브러리의 철학에 대한 문서](philosophy)를 읽는 것이 라이브러리를 더 잘 이해하는 좋은 시작점입니다. 모든 모델에 적용하려는 몇 가지 작업 방식에 대한 선택 사항이 있습니다: + +- 일반적으로 추상화보다는 구성을 선호합니다. +- 코드를 복제하는 것이 항상 나쁜 것은 아닙니다. 코드의 가독성이나 접근성을 크게 향상시킨다면 복제하는 것은 좋습니다. +- 모델 파일은 가능한 한 독립적으로 유지되어야 합니다. 따라서 특정 모델의 코드를 읽을 때 해당 `modeling_....py` 파일만 확인하면 됩니다. + +우리는 라이브러리의 코드가 제품을 제공하는 수단뿐만 아니라 개선하고자 하는 제품이라고도 생각합니다. 따라서 모델을 추가할 때, 사용자는 모델을 사용할 사람뿐만 아니라 코드를 읽고 이해하고 필요한 경우 조정할 수 있는 모든 사람까지도 포함한다는 점을 기억해야 합니다. + +이를 염두에 두고 일반적인 라이브러리 설계에 대해 조금 더 자세히 알아보겠습니다. + +### 모델 개요 [[overview-of-models]] + +모델을 성공적으로 추가하려면 모델과 해당 구성인 [`PreTrainedModel`] 및 [`PretrainedConfig`] 간의 상호작용을 이해하는 것이 중요합니다. 예를 들어, 🤗 Transformers에 추가하려는 모델을 `BrandNewBert`라고 부르겠습니다. + +다음을 살펴보겠습니다: + + + +보다시피, 🤗 Transformers에서는 상속을 사용하지만 추상화 수준을 최소한으로 유지합니다. 라이브러리의 어떤 모델에서도 두 수준 이상의 추상화가 존재하지 않습니다. `BrandNewBertModel`은 `BrandNewBertPreTrainedModel`에서 상속받고, 이 클래스는 [`PreTrainedModel`]에서 상속받습니다. 이로써 새로운 모델은 [`PreTrainedModel`]에만 의존하도록 하려고 합니다. 모든 새로운 모델에 자동으로 제공되는 중요한 기능은 [`~PreTrainedModel.from_pretrained`] 및 [`~PreTrainedModel.save_pretrained`]입니다. 이러한 기능 외에도 `BrandNewBertModel.forward`와 같은 다른 중요한 기능은 새로운 `modeling_brand_new_bert.py` 스크립트에서 완전히 정의되어야 합니다. 또한 `BrandNewBertForMaskedLM`과 같은 특정 헤드 레이어를 가진 모델은 `BrandNewBertModel`을 상속받지 않고 forward pass에서 호출할 수 있는 `BrandNewBertModel`을 사용하여 추상화 수준을 낮게 유지합니다. 모든 새로운 모델은 `BrandNewBertConfig`라는 구성 클래스를 필요로 합니다. 이 구성은 항상 [`PreTrainedModel`]의 속성으로 저장되며, 따라서 `BrandNewBertPreTrainedModel`을 상속받는 모든 클래스에서 `config` 속성을 통해 액세스할 수 있습니다: + +```python +model = BrandNewBertModel.from_pretrained("brandy/brand_new_bert") +model.config # model has access to its config +``` + +모델과 마찬가지로 구성은 [`PretrainedConfig`]에서 기본 직렬화 및 역직렬화 기능을 상속받습니다. 구성과 모델은 항상 *pytorch_model.bin* 파일과 *config.json* 파일로 각각 별도로 직렬화됩니다. [`~PreTrainedModel.save_pretrained`]를 호출하면 자동으로 [`~PretrainedConfig.save_pretrained`]도 호출되므로 모델과 구성이 모두 저장됩니다. + + +### 코드 스타일 [[code-style]] + +새로운 모델을 작성할 때, Transformers는 주관적인 라이브러리이며 몇 가지 독특한 코딩 스타일이 있습니다: + +1. 모델의 forward pass는 모델 파일에 완전히 작성되어야 합니다. 라이브러리의 다른 모델에서 블록을 재사용하려면 코드를 복사하여 위에 `# Copied from` 주석과 함께 붙여넣으면 됩니다 (예: [여기](https://github.com/huggingface/transformers/blob/v4.17.0/src/transformers/models/roberta/modeling_roberta.py#L160)를 참조하세요). +2. 코드는 완전히 이해하기 쉬워야 합니다. 변수 이름을 명확하게 지정하고 약어를 사용하지 않는 것이 좋습니다. 예를 들어, `act`보다는 `activation`을 선호합니다. 한 글자 변수 이름은 루프의 인덱스인 경우를 제외하고 권장되지 않습니다. +3. 더 일반적으로, 짧은 마법 같은 코드보다는 길고 명시적인 코드를 선호합니다. +4. PyTorch에서 `nn.Sequential`을 하위 클래스로 만들지 말고 `nn.Module`을 하위 클래스로 만들고 forward pass를 작성하여 다른 사람이 코드를 빠르게 디버그할 수 있도록 합니다. print 문이나 중단점을 추가할 수 있습니다. +5. 함수 시그니처에는 타입 주석을 사용해야 합니다. 그 외에는 타입 주석보다 변수 이름이 훨씬 읽기 쉽고 이해하기 쉽습니다. + +### 토크나이저 개요 [[overview-of-tokenizers]] + +아직 준비되지 않았습니다 :-( 이 섹션은 곧 추가될 예정입니다! + +## 🤗 Transformers에 모델 추가하는 단계별 방법 [[stepbystep-recipe-to-add-a-model-to-transformers]] + +각자 모델을 이식하는 방법에 대한 선호가 다르기 때문에 다른 기여자들이 Hugging Face에 모델을 이식하는 방법에 대한 요약을 살펴보는 것이 매우 유용할 수 있습니다. 다음은 모델을 이식하는 방법에 대한 커뮤니티 블로그 게시물 목록입니다: + +1. [GPT2 모델 이식하기](https://medium.com/huggingface/from-tensorflow-to-pytorch-265f40ef2a28) - [Thomas](https://huggingface.co/thomwolf) +2. [WMT19 MT 모델 이식하기](https://huggingface.co/blog/porting-fsmt) - [Stas](https://huggingface.co/stas) + +경험상 모델을 추가할 때 주의해야 할 가장 중요한 사항은 다음과 같습니다: + +- 같은 일을 반복하지 마세요! 새로운 🤗 Transformers 모델을 위해 추가할 코드의 대부분은 이미 🤗 Transformers 어딘가에 존재합니다. 이미 존재하는 복사할 수 있는 유사한 모델과 토크나이저를 찾는데 시간을 투자하세요. [grep](https://www.gnu.org/software/grep/)와 [rg](https://github.com/BurntSushi/ripgrep)를 참고하세요. 모델의 토크나이저가 한 모델을 기반으로 하고 모델링 코드가 다른 모델을 기반으로 하는 경우가 존재할 수도 있습니다. 예를 들어 FSMT의 모델링 코드는 BART를 기반으로 하고 FSMT의 토크나이저 코드는 XLM을 기반으로 합니다. +- 이것은 과학적인 도전보다는 공학적인 도전입니다. 논문의 모델의 모든 이론적 측면을 이해하려는 것보다 효율적인 디버깅 환경을 만드는 데 더 많은 시간을 소비해야 합니다. +- 막힐 때 도움을 요청하세요! 모델은 🤗 Transformers의 핵심 구성 요소이므로 Hugging Face의 우리는 당신이 모델을 추가하는 각 단계에서 기꺼이 도움을 줄 준비가 되어 있습니다. 진전이 없다고 느끼면 주저하지 말고 도움을 요청하세요. + +다음에서는 모델을 🤗 Transformers로 이식하는 데 가장 유용한 일반적인 절차를 제공하려고 노력합니다. + +다음 목록은 모델을 추가하는 데 수행해야 할 모든 작업의 요약이며 To-Do 목록으로 사용할 수 있습니다: + +☐ (선택 사항) BrandNewBert의 이론적 측면 이해
+☐ Hugging Face 개발 환경 준비
+☐ 원본 리포지토리의 디버깅 환경 설정
+☐ 원본 리포지토리와 체크포인트를 사용하여 `forward()` pass가 성공적으로 실행되는 스크립트 작성
+☐ 🤗 Transformers에 모델 스켈레톤 성공적으로 추가
+☐ 원본 체크포인트를 🤗 Transformers 체크포인트로 성공적으로 변환
+☐ 🤗 Transformers에서 원본 체크포인트와 동일한 출력을 내주는 `forward()` pass 성공적으로 실행
+☐ 🤗 Transformers에서 모델 테스트 완료
+☐ 🤗 Transformers에 토크나이저 성공적으로 추가
+☐ 종단 간 통합 테스트 실행
+☐ 문서 작성 완료
+☐ 모델 가중치를 허브에 업로드
+☐ Pull request 제출
+☐ (선택 사항) 데모 노트북 추가 + +우선, 일반적으로는 `BrandNewBert`의 이론적인 이해로 시작하는 것을 권장합니다. 그러나 이론적 측면을 직접 이해하는 대신 *직접 해보면서* 모델의 이론적 측면을 이해하는 것을 선호하는 경우 바로 `BrandNewBert` 코드 베이스로 빠져드는 것도 괜찮습니다. 이 옵션은 엔지니어링 기술이 이론적 기술보다 더 뛰어난 경우, `BrandNewBert`의 논문을 이해하는 데 어려움이 있는 경우, 또는 과학적인 논문을 읽는 것보다 프로그래밍에 훨씬 더 흥미 있는 경우에 더 적합할 수 있습니다. + +### 1. (선택 사항) BrandNewBert의 이론적 측면 [[1-optional-theoretical-aspects-of-brandnewbert]] + +만약 그런 서술적인 작업이 존재한다면, *BrandNewBert*의 논문을 읽어보는 시간을 가져야 합니다. 이해하기 어려운 섹션이 많을 수 있습니다. 그렇더라도 걱정하지 마세요! 목표는 논문의 깊은 이론적 이해가 아니라 *BrandNewBert*를 🤗 Transformers에서 효과적으로 재구현하기 위해 필요한 정보를 추출하는 것입니다. 이를 위해 이론적 측면에 너무 많은 시간을 투자할 필요는 없지만 다음과 같은 실제적인 측면에 집중해야 합니다: + +- *BrandNewBert*는 어떤 유형의 모델인가요? BERT와 유사한 인코더 모델인가요? GPT2와 유사한 디코더 모델인가요? BART와 유사한 인코더-디코더 모델인가요? 이들 간의 차이점에 익숙하지 않은 경우[model_summary](model_summary)를 참조하세요. +- *BrandNewBert*의 응용 분야는 무엇인가요? 텍스트 분류인가요? 텍스트 생성인가요? 요약과 같은 Seq2Seq 작업인가요? +- *brand_new_bert*와 BERT/GPT-2/BART의 차이점은 무엇인가요? +- *brand_new_bert*와 가장 유사한 [🤗 Transformers 모델](https://huggingface.co/transformers/#contents)은 무엇인가요? +- 어떤 종류의 토크나이저가 사용되나요? Sentencepiece 토크나이저인가요? Word piece 토크나이저인가요? BERT 또는 BART에 사용되는 동일한 토크나이저인가요? + +모델의 아키텍처에 대해 충분히 이해했다는 생각이 든 후, 궁금한 사항이 있으면 Hugging Face 팀에 문의하십시오. 이는 모델의 아키텍처, 어텐션 레이어 등에 관한 질문을 포함할 수 있습니다. Hugging Face의 유지 관리자들은 보통 코드를 검토하는 것에 대해 매우 기뻐하므로 당신을 돕는 일을 매우 환영할 것입니다! + +### 2. 개발 환경 설정 [[2-next-prepare-your-environment]] + +1. 저장소 페이지에서 "Fork" 버튼을 클릭하여 저장소의 사본을 GitHub 사용자 계정으로 만듭니다. + +2. `transformers` fork를 로컬 디스크에 클론하고 베이스 저장소를 원격 저장소로 추가합니다: + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +3. 개발 환경을 설정합니다. 다음 명령을 실행하여 개발 환경을 설정할 수 있습니다: + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +각 운영 체제에 따라 Transformers의 선택적 의존성이 개수가 증가하면 이 명령이 실패할 수 있습니다. 그런 경우에는 작업 중인 딥 러닝 프레임워크 (PyTorch, TensorFlow 및/또는 Flax)을 설치한 후, 다음 명령을 수행하면 됩니다: + +```bash +pip install -e ".[quality]" +``` + +대부분의 경우에는 이것으로 충분합니다. 그런 다음 상위 디렉토리로 돌아갑니다. + +```bash +cd .. +``` + +4. Transformers에 *brand_new_bert*의 PyTorch 버전을 추가하는 것을 권장합니다. PyTorch를 설치하려면 다음 링크의 지침을 따르십시오: https://pytorch.org/get-started/locally/. + +**참고:** CUDA를 설치할 필요는 없습니다. 새로운 모델이 CPU에서 작동하도록 만드는 것으로 충분합니다. + +5. *brand_new_bert*를 이식하기 위해서는 해당 원본 저장소에 접근할 수 있어야 합니다: + +```bash +git clone https://github.com/org_that_created_brand_new_bert_org/brand_new_bert.git +cd brand_new_bert +pip install -e . +``` + +이제 *brand_new_bert*를 🤗 Transformers로 이식하기 위한 개발 환경을 설정하였습니다. + +### 3.-4. 원본 저장소에서 사전 훈련된 체크포인트 실행하기 [[3.-4.-run-a-pretrained-checkpoint-using-the-original-repository]] + +먼저, 원본 *brand_new_bert* 저장소에서 작업을 시작합니다. 원본 구현은 보통 "연구용"으로 많이 사용됩니다. 즉, 문서화가 부족하고 코드가 이해하기 어려울 수 있습니다. 그러나 이것이 바로 *brand_new_bert*를 다시 구현하려는 동기가 되어야 합니다. Hugging Face에서의 주요 목표 중 하나는 **거인의 어깨 위에 서는 것**이며, 이는 여기에서 쉽게 해석되어 동작하는 모델을 가져와서 가능한 한 **접근 가능하고 사용자 친화적이며 아름답게** 만드는 것입니다. 이것은 🤗 Transformers에서 모델을 다시 구현하는 가장 중요한 동기입니다 - 새로운 복잡한 NLP 기술을 **모두에게** 접근 가능하게 만드는 것을 목표로 합니다. + +따라서 원본 저장소에 대해 자세히 살펴보는 것으로 시작해야 합니다. + +원본 저장소에서 공식 사전 훈련된 모델을 성공적으로 실행하는 것은 종종 **가장 어려운** 단계입니다. 우리의 경험에 따르면, 원본 코드 베이스에 익숙해지는 데 시간을 투자하는 것이 매우 중요합니다. 다음을 파악해야 합니다: + +- 사전 훈련된 가중치를 어디서 찾을 수 있는지? +- 사전 훈련된 가중치를 해당 모델에로드하는 방법은? +- 모델과 독립적으로 토크나이저를 실행하는 방법은? +- 간단한 forward pass에 필요한 클래스와 함수를 파악하기 위해 forward pass를 한 번 추적해 보세요. 일반적으로 해당 함수들만 다시 구현하면 됩니다. +- 모델의 중요한 구성 요소를 찾을 수 있어야 합니다. 모델 클래스는 어디에 있나요? 모델 하위 클래스(*EncoderModel*, *DecoderModel* 등)가 있나요? self-attention 레이어는 어디에 있나요? self-attention, cross-attention 등 여러 가지 다른 어텐션 레이어가 있나요? +- 원본 환경에서 모델을 디버그할 수 있는 방법은 무엇인가요? *print* 문을 추가해야 하나요? *ipdb*와 같은 대화식 디버거를 사용할 수 있나요? PyCharm과 같은 효율적인 IDE를 사용해 모델을 디버그할 수 있나요? + +원본 저장소에서 코드를 이식하는 작업을 시작하기 전에 원본 저장소에서 코드를 **효율적으로** 디버그할 수 있어야 합니다! 또한, 오픈 소스 라이브러리로 작업하고 있다는 것을 기억해야 합니다. 따라서 원본 저장소에서 issue를 열거나 pull request를 열기를 주저하지 마십시오. 이 저장소의 유지 관리자들은 누군가가 자신들의 코드를 살펴본다는 것에 대해 매우 기뻐할 것입니다! + +현재 시점에서, 원래 모델을 디버깅하기 위해 어떤 디버깅 환경과 전략을 선호하는지는 당신에게 달렸습니다. 우리는 고가의 GPU 환경을 구축하는 것은 비추천합니다. 대신, 원래 저장소로 들어가서 작업을 시작할 때와 🤗 Transformers 모델의 구현을 시작할 때에도 CPU에서 작업하는 것이 좋습니다. 모델이 이미 🤗 Transformers로 성공적으로 이식되었을 때에만 모델이 GPU에서도 예상대로 작동하는지 확인해야합니다. + +일반적으로, 원래 모델을 실행하기 위한 두 가지 가능한 디버깅 환경이 있습니다. + +- [Jupyter 노트북](https://jupyter.org/) / [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb) +- 로컬 Python 스크립트 + +Jupyter 노트북의 장점은 셀 단위로 실행할 수 있다는 것입니다. 이는 논리적인 구성 요소를 더 잘 분리하고 중간 결과를 저장할 수 있으므로 디버깅 사이클이 더 빨라질 수 있습니다. 또한, 노트북은 다른 기여자와 쉽게 공유할 수 있으므로 Hugging Face 팀의 도움을 요청하려는 경우 매우 유용할 수 있습니다. Jupyter 노트북에 익숙하다면 이를 사용하는 것을 강력히 추천합니다. + +Jupyter 노트북의 단점은 사용에 익숙하지 않은 경우 새로운 프로그래밍 환경에 적응하는 데 시간을 할애해야 하며, `ipdb`와 같은 알려진 디버깅 도구를 더 이상 사용할 수 없을 수도 있다는 것입니다. + +각 코드 베이스에 대해 좋은 첫 번째 단계는 항상 **작은** 사전 훈련된 체크포인트를 로드하고 더미 정수 벡터 입력을 사용하여 단일 forward pass를 재현하는 것입니다. 이와 같은 스크립트는 다음과 같을 수 있습니다(의사 코드로 작성): + +```python +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = [0, 4, 5, 2, 3, 7, 9] # vector of input ids +original_output = model.predict(input_ids) +``` + +다음으로, 디버깅 전략에 대해 일반적으로 다음과 같은 몇 가지 선택지가 있습니다: + +- 원본 모델을 많은 작은 테스트 가능한 구성 요소로 분해하고 각각에 대해 forward pass를 실행하여 검증합니다. +- 원본 모델을 원본 *tokenizer*과 원본 *model*로만 분해하고 해당 부분에 대해 forward pass를 실행한 후 검증을 위해 중간 출력(print 문 또는 중단점)을 사용합니다. + +다시 말하지만, 어떤 전략을 선택할지는 당신에게 달려 있습니다. 원본 코드 베이스에 따라 하나 또는 다른 전략이 유리할 수 있습니다. + +원본 코드 베이스를 모델의 작은 하위 구성 요소로 분해할 수 있는지 여부, 예를 들어 원본 코드 베이스가 즉시 실행 모드에서 간단히 실행될 수 있는 경우, 그런 경우에는 그 노력이 가치가 있다는 것이 일반적입니다. 초기에 더 어려운 방법을 선택하는 것에는 몇 가지 중요한 장점이 있습니다. + +- 원본 모델을 🤗 Transformers 구현과 비교할 때 각 구성 요소가 일치하는지 자동으로 확인할 수 있습니다. 즉, 시각적인 비교(print 문을 통한 비교가 아닌) 대신 🤗 Transformers 구현과 그에 대응하는 원본 구성 요소가 일치하는지 확인할 수 있습니다. +- 전체 모델을 모듈별로, 즉 작은 구성 요소로 분해함으로써 모델을 이식하는 큰 문제를 단순히 개별 구성 요소를 이식하는 작은 문제로 분해할 수 있으므로 작업을 더 잘 구조화할 수 있습니다. +- 모델을 논리적으로 의미 있는 구성 요소로 분리하는 것은 모델의 설계에 대한 더 나은 개요를 얻고 모델을 더 잘 이해하는 데 도움이 됩니다. +- 이러한 구성 요소별 테스트를 통해 코드를 변경하면서 회귀가 발생하지 않도록 보장할 수 있습니다. + +[Lysandre의 ELECTRA 통합 검사](https://gist.github.com/LysandreJik/db4c948f6b4483960de5cbac598ad4ed)는 이를 수행하는 좋은 예제입니다. + +그러나 원본 코드 베이스가 매우 복잡하거나 중간 구성 요소를 컴파일된 모드에서 실행하는 것만 허용하는 경우, 모델을 테스트 가능한 작은 하위 구성 요소로 분해하는 것이 시간이 많이 소요되거나 불가능할 수도 있습니다. [T5의 MeshTensorFlow](https://github.com/tensorflow/mesh/tree/master/mesh_tensorflow) 라이브러리는 매우 복잡하며 모델을 하위 구성 요소로 분해하는 간단한 방법을 제공하지 않습니다. 이러한 라이브러리의 경우, 보통 print 문을 통해 확인합니다. + +어떤 전략을 선택하더라도 권장되는 절차는 동일합니다. 먼저 시작 레이어를 디버그하고 마지막 레이어를 마지막에 디버그하는 것이 좋습니다. + +다음 순서로 각 레이어의 출력을 검색하는 것이 좋습니다: + +1. 모델에 전달된 입력 ID 가져오기 +2. 워드 임베딩 가져오기 +3. 첫 번째 Transformer 레이어의 입력 가져오기 +4. 첫 번째 Transformer 레이어의 출력 가져오기 +5. 다음 n-1개의 Transformer 레이어의 출력 가져오기 +6. BrandNewBert 모델의 출력 가져오기 + +입력 ID는 정수 배열로 구성되며, 예를 들어 `input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19]`와 같을 수 있습니다. + +다음 레이어의 출력은 종종 다차원 실수 배열로 구성되며, 다음과 같이 나타낼 수 있습니다: + +``` +[[ + [-0.1465, -0.6501, 0.1993, ..., 0.1451, 0.3430, 0.6024], + [-0.4417, -0.5920, 0.3450, ..., -0.3062, 0.6182, 0.7132], + [-0.5009, -0.7122, 0.4548, ..., -0.3662, 0.6091, 0.7648], + ..., + [-0.5613, -0.6332, 0.4324, ..., -0.3792, 0.7372, 0.9288], + [-0.5416, -0.6345, 0.4180, ..., -0.3564, 0.6992, 0.9191], + [-0.5334, -0.6403, 0.4271, ..., -0.3339, 0.6533, 0.8694]]], +``` + +🤗 Transformers에 추가되는 모든 모델은 통합 테스트를 통과해야 합니다. 즉, 원본 모델과 🤗 Transformers의 재구현 버전이 0.001의 정밀도로 정확히 동일한 출력을 내야 합니다! 동일한 모델이 다른 라이브러리에서 작성되었을 때 라이브러리 프레임워크에 따라 약간 다른 출력을 얻는 것은 정상이므로 1e-3(0.001)의 오차는 허용합니다. 거의 동일한 출력을 내는 것만으로는 충분하지 않으며, 완벽히 일치하는 수준이어야 합니다. 따라서 🤗 Transformers 버전의 중간 출력을 *brand_new_bert*의 원래 구현의 중간 출력과 여러 번 비교해야 합니다. 이 경우 원본 저장소의 **효율적인** 디버깅 환경이 절대적으로 중요합니다. 디버깅 환경을 가능한 한 효율적으로 만드는 몇 가지 조언을 제시합니다. + +- 중간 결과를 디버그하는 가장 좋은 방법을 찾으세요. 원본 저장소가 PyTorch로 작성되었다면 원본 모델을 더 작은 하위 구성 요소로 분해하여 중간 값을 검색하는 긴 스크립트를 작성하는 것에 시간을 투자할 가치가 있습니다. 원본 저장소가 Tensorflow 1로 작성되었다면 [tf.print](https://www.tensorflow.org/api_docs/python/tf/print)와 같은 Tensorflow 출력 작업을 사용하여 중간 값을 출력해야 할 수도 있습니다. 원본 저장소가 Jax로 작성되었다면 forward pass를 실행할 때 모델이 **jit 되지 않도록** 해야 합니다. 예를 들어 [이 링크](https://github.com/google/jax/issues/196)를 확인해 보세요. +- 사용 가능한 가장 작은 사전 훈련된 체크포인트를 사용하세요. 체크포인트가 작을수록 디버그 사이클이 더 빨라집니다. 전반적으로 forward pass에 10초 이상이 걸리는 경우 효율적이지 않습니다. 매우 큰 체크포인트만 사용할 수 있는 경우, 새 환경에서 임의로 초기화된 가중치로 더미 모델을 만들고 해당 가중치를 🤗 Transformers 버전과 비교하기 위해 저장하는 것이 더 의미가 있을 수 있습니다. +- 디버깅 설정에서 가장 쉽게 forward pass를 호출하는 방법을 사용하세요. 원본 저장소에서 **단일** forward pass만 호출하는 함수를 찾는 것이 이상적입니다. 이 함수는 일반적으로 `predict`, `evaluate`, `forward`, `__call__`과 같이 호출됩니다. `autoregressive_sample`과 같은 텍스트 생성에서 `forward`를 여러 번 호출하여 텍스트를 생성하는 등의 작업을 수행하는 함수를 디버그하고 싶지 않을 것입니다. +- 토큰화 과정을 모델의 *forward* pass와 분리하려고 노력하세요. 원본 저장소에서 입력 문자열을 입력해야 하는 예제가 있는 경우, 입력 문자열이 입력 ID로 변경되는 순간을 찾아서 시작하세요. 이 경우 직접 ID를 입력할 수 있도록 작은 스크립트를 작성하거나 원본 코드를 수정해야 할 수도 있습니다. +- 디버깅 설정에서 모델이 훈련 모드가 아니라는 것을 확인하세요. 훈련 모드에서는 모델의 여러 드롭아웃 레이어 때문에 무작위 출력이 생성될 수 있습니다. 디버깅 환경에서 forward pass가 **결정론적**이도록 해야 합니다. 또는 동일한 프레임워크에 있는 경우 *transformers.utils.set_seed*를 사용하세요. + +다음 섹션에서는 *brand_new_bert*에 대해 이 작업을 수행하는 데 더 구체적인 세부 사항/팁을 제공합니다. + +### 5.-14. 🤗 Transformers에 BrandNewBert를 이식하기 [[5.-14.-port-brandnewbert-to-transformers]] + +이제, 마침내 🤗 Transformers에 새로운 코드를 추가할 수 있습니다. 🤗 Transformers 포크의 클론으로 이동하세요: + +```bash +cd transformers +``` + +다음과 같이 이미 존재하는 모델의 모델 아키텍처와 정확히 일치하는 모델을 추가하는 특별한 경우에는 [이 섹션](#write-a-conversion-script)에 설명된대로 변환 스크립트만 추가하면 됩니다. 이 경우에는 이미 존재하는 모델의 전체 모델 아키텍처를 그대로 재사용할 수 있습니다. + +그렇지 않으면 새로운 모델 생성을 시작합시다. 여기에서 두 가지 선택지가 있습니다: + +- `transformers-cli add-new-model-like`를 사용하여 기존 모델과 유사한 새로운 모델 추가하기 +- `transformers-cli add-new-model`을 사용하여 템플릿을 기반으로 한 새로운 모델 추가하기 (선택한 모델 유형에 따라 BERT 또는 Bart와 유사한 모습일 것입니다) + +두 경우 모두, 모델의 기본 정보를 입력하는 설문조사가 제시됩니다. 두 번째 명령어는 `cookiecutter`를 설치해야 합니다. 자세한 정보는 [여기](https://github.com/huggingface/transformers/tree/main/templates/adding_a_new_model)에서 확인할 수 있습니다. + +**huggingface/transformers 메인 저장소에 Pull Request 열기** + +자동으로 생성된 코드를 수정하기 전에, 지금은 "작업 진행 중 (WIP)" 풀 리퀘스트를 열기 위한 시기입니다. 예를 들어, 🤗 Transformers에 "*brand_new_bert* 추가"라는 제목의 "[WIP] Add *brand_new_bert*" 풀 리퀘스트를 엽니다. 이렇게 하면 당신과 Hugging Face 팀이 🤗 Transformers에 모델을 통합하는 작업을 함께할 수 있습니다. + +다음을 수행해야 합니다: + +1. 메인 브랜치에서 작업을 잘 설명하는 이름으로 브랜치 생성 + +```bash +git checkout -b add_brand_new_bert +``` + +2. 자동으로 생성된 코드 커밋 + +```bash +git add . +git commit +``` + +3. 현재 메인을 가져오고 리베이스 + +```bash +git fetch upstream +git rebase upstream/main +``` + +4. 변경 사항을 계정에 푸시 + +```bash +git push -u origin a-descriptive-name-for-my-changes +``` + +5. 만족스럽다면, GitHub에서 자신의 포크한 웹 페이지로 이동합니다. "Pull request"를 클릭합니다. Hugging Face 팀의 일부 멤버의 GitHub 핸들을 리뷰어로 추가하여 Hugging Face 팀이 앞으로의 변경 사항에 대해 알림을 받을 수 있도록 합니다. + +6. GitHub 풀 리퀘스트 웹 페이지 오른쪽에 있는 "Convert to draft"를 클릭하여 PR을 초안으로 변경합니다. + +다음으로, 어떤 진전을 이루었다면 작업을 커밋하고 계정에 푸시하여 풀 리퀘스트에 표시되도록 해야 합니다. 또한, 다음과 같이 현재 메인과 작업을 업데이트해야 합니다: + +```bash +git fetch upstream +git merge upstream/main +``` + +일반적으로, 모델 또는 구현에 관한 모든 질문은 자신의 PR에서 해야 하며, PR에서 토론되고 해결되어야 합니다. 이렇게 하면 Hugging Face 팀이 새로운 코드를 커밋하거나 질문을 할 때 항상 알림을 받을 수 있습니다. Hugging Face 팀에게 문제 또는 질문을 효율적으로 이해할 수 있도록 추가한 코드를 명시하는 것이 도움이 될 때가 많습니다. + +이를 위해, 변경 사항을 모두 볼 수 있는 "Files changed" 탭으로 이동하여 질문하고자 하는 줄로 이동한 다음 "+" 기호를 클릭하여 코멘트를 추가할 수 있습니다. 질문이나 문제가 해결되면, 생성된 코멘트의 "Resolve" 버튼을 클릭할 수 있습니다. + +마찬가지로, Hugging Face 팀은 코드를 리뷰할 때 코멘트를 남길 것입니다. 우리는 PR에서 대부분의 질문을 GitHub에서 묻는 것을 권장합니다. 공개에 크게 도움이 되지 않는 매우 일반적인 질문의 경우, Slack이나 이메일을 통해 Hugging Face 팀에게 문의할 수 있습니다. + +**5. brand_new_bert에 대해 생성된 모델 코드를 적용하기** + +먼저, 우리는 모델 자체에만 초점을 맞추고 토크나이저에 대해서는 신경 쓰지 않을 것입니다. 모든 관련 코드는 다음의 생성된 파일에서 찾을 수 있습니다: `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py` 및 `src/transformers/models/brand_new_bert/configuration_brand_new_bert.py`. + +이제 마침내 코딩을 시작할 수 있습니다 :). `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`의 생성된 코드는 인코더 전용 모델인 경우 BERT와 동일한 아키텍처를 가지거나, 인코더-디코더 모델인 경우 BART와 동일한 아키텍처를 가질 것입니다. 이 시점에서, 모델의 이론적 측면에 대해 배운 내용을 다시 상기해야 합니다: *모델이 BERT 또는 BART와 어떻게 다른가요?*. 자주 변경해야 하는 것은 *self-attention* 레이어, 정규화 레이어의 순서 등을 변경하는 것입니다. 다시 말하지만, 자신의 모델을 구현하는 데 도움이 되도록 Transformers에서 이미 존재하는 모델의 유사한 아키텍처를 살펴보는 것이 유용할 수 있습니다. + +**참고로** 이 시점에서, 코드가 완전히 정확하거나 깨끗하다고 확신할 필요는 없습니다. 오히려 처음에는 원본 코드의 첫 번째 *불완전하고* 복사된 버전을 `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`에 추가하는 것이 좋습니다. 필요한 모든 코드가 추가될 때까지 이러한 작업을 진행한 후, 다음 섹션에서 설명한 변환 스크립트를 사용하여 코드를 점진적으로 개선하고 수정하는 것이 훨씬 효율적입니다. 이 시점에서 작동해야 하는 유일한 것은 다음 명령이 작동하는 것입니다: + +```python +from transformers import BrandNewBertModel, BrandNewBertConfig + +model = BrandNewBertModel(BrandNewBertConfig()) +``` + +위의 명령은 `BrandNewBertConfig()`에 정의된 기본 매개변수에 따라 무작위 가중치로 모델을 생성하며, 이로써 모든 구성 요소의 `init()` 메서드가 작동함을 보장합니다. + +모든 무작위 초기화는 `BrandnewBertPreTrainedModel` 클래스의 `_init_weights` 메서드에서 수행되어야 합니다. 이 메서드는 구성 설정 변수에 따라 모든 리프 모듈을 초기화해야 합니다. BERT의 `_init_weights` 메서드 예제는 다음과 같습니다: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) +``` + +몇 가지 모듈에 대해 특별한 초기화가 필요한 경우 사용자 정의 방식을 사용할 수도 있습니다. 예를 들어, `Wav2Vec2ForPreTraining`에서 마지막 두 개의 선형 레이어는 일반적인 PyTorch `nn.Linear`의 초기화를 가져야 하지만, 다른 모든 레이어는 위와 같은 초기화를 사용해야 합니다. 이는 다음과 같이 코드화됩니다: + +```py +def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, Wav2Vec2ForPreTraining): + module.project_hid.reset_parameters() + module.project_q.reset_parameters() + module.project_hid._is_hf_initialized = True + module.project_q._is_hf_initialized = True + elif isinstance(module, nn.Linear): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() +``` + +`_is_hf_initialized` 플래그는 서브모듈을 한 번만 초기화하도록 내부적으로 사용됩니다. `module.project_q` 및 `module.project_hid`에 대해 `True`로 설정함으로써, 우리가 수행한 사용자 정의 초기화가 이후에 덮어쓰이지 않도록 합니다. 즉, `_init_weights` 함수가 이들에게 적용되지 않습니다. + +**6. 변환 스크립트 작성하기** + +다음으로, 디버그에 사용한 체크포인트를 기존 저장소에서 만든 🤗 Transformers 구현과 호환되는 체크포인트로 변환할 수 있는 변환 스크립트를 작성해야 합니다. 변환 스크립트를 처음부터 작성하는 것보다는 *brand_new_bert*와 동일한 프레임워크로 작성된 유사한 모델을 변환한 기존 변환 스크립트를 찾아보는 것이 좋습니다. 일반적으로 기존 변환 스크립트를 복사하여 사용 사례에 맞게 약간 수정하는 것으로 충분합니다. 모델에 대해 유사한 기존 변환 스크립트를 어디에서 찾을 수 있는지 Hugging Face 팀에게 문의하는 것을 망설이지 마세요. + +- TensorFlow에서 PyTorch로 모델을 이전하는 경우, 좋은 참고 자료로 BERT의 변환 스크립트 [여기](https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/models/bert/modeling_bert.py#L91)를 참조할 수 있습니다. +- PyTorch에서 PyTorch로 모델을 이전하는 경우, 좋은 참고 자료로 BART의 변환 스크립트 [여기](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py)를 참조할 수 있습니다. + +다음에서는 PyTorch 모델이 레이어 가중치를 저장하고 레이어 이름을 정의하는 방법에 대해 간단히 설명하겠습니다. PyTorch에서 레이어의 이름은 레이어에 지정한 클래스 속성의 이름으로 정의됩니다. 다음과 같이 PyTorch에서 `SimpleModel`이라는 더미 모델을 정의해 봅시다: + +```python +from torch import nn + + +class SimpleModel(nn.Module): + def __init__(self): + super().__init__() + self.dense = nn.Linear(10, 10) + self.intermediate = nn.Linear(10, 10) + self.layer_norm = nn.LayerNorm(10) +``` + +이제 이 모델 정의의 인스턴스를 생성할 수 있으며 `dense`, `intermediate`, `layer_norm` 등의 가중치가 랜덤하게 할당됩니다. 모델을 출력하여 아키텍처를 확인할 수 있습니다. + +```python +model = SimpleModel() + +print(model) +``` + +이는 다음과 같이 출력됩니다: + +``` +SimpleModel( + (dense): Linear(in_features=10, out_features=10, bias=True) + (intermediate): Linear(in_features=10, out_features=10, bias=True) + (layer_norm): LayerNorm((10,), eps=1e-05, elementwise_affine=True) +) +``` + +우리는 레이어의 이름이 PyTorch에서 클래스 속성의 이름으로 정의되어 있는 것을 볼 수 있습니다. 특정 레이어의 가중치 값을 출력하여 확인할 수 있습니다: + +```python +print(model.dense.weight.data) +``` + +가중치가 무작위로 초기화되었음을 확인할 수 있습니다. + +``` +tensor([[-0.0818, 0.2207, -0.0749, -0.0030, 0.0045, -0.1569, -0.1598, 0.0212, + -0.2077, 0.2157], + [ 0.1044, 0.0201, 0.0990, 0.2482, 0.3116, 0.2509, 0.2866, -0.2190, + 0.2166, -0.0212], + [-0.2000, 0.1107, -0.1999, -0.3119, 0.1559, 0.0993, 0.1776, -0.1950, + -0.1023, -0.0447], + [-0.0888, -0.1092, 0.2281, 0.0336, 0.1817, -0.0115, 0.2096, 0.1415, + -0.1876, -0.2467], + [ 0.2208, -0.2352, -0.1426, -0.2636, -0.2889, -0.2061, -0.2849, -0.0465, + 0.2577, 0.0402], + [ 0.1502, 0.2465, 0.2566, 0.0693, 0.2352, -0.0530, 0.1859, -0.0604, + 0.2132, 0.1680], + [ 0.1733, -0.2407, -0.1721, 0.1484, 0.0358, -0.0633, -0.0721, -0.0090, + 0.2707, -0.2509], + [-0.1173, 0.1561, 0.2945, 0.0595, -0.1996, 0.2988, -0.0802, 0.0407, + 0.1829, -0.1568], + [-0.1164, -0.2228, -0.0403, 0.0428, 0.1339, 0.0047, 0.1967, 0.2923, + 0.0333, -0.0536], + [-0.1492, -0.1616, 0.1057, 0.1950, -0.2807, -0.2710, -0.1586, 0.0739, + 0.2220, 0.2358]]). +``` + +변환 스크립트에서는 이러한 무작위로 초기화된 가중치를 체크포인트의 해당 레이어의 정확한 가중치로 채워야 합니다. 예를 들면 다음과 같습니다: + +```python +# retrieve matching layer weights, e.g. by +# recursive algorithm +layer_name = "dense" +pretrained_weight = array_of_dense_layer + +model_pointer = getattr(model, "dense") + +model_pointer.weight.data = torch.from_numpy(pretrained_weight) +``` + +이렇게 하면 PyTorch 모델의 무작위로 초기화된 각 가중치와 해당 체크포인트 가중치가 **모양과 이름** 모두에서 정확히 일치하는지 확인해야 합니다. 이를 위해 모양에 대한 assert 문을 추가하고 체크포인트 가중치의 이름을 출력해야 합니다. 예를 들어 다음과 같은 문장을 추가해야 합니다: + +```python +assert ( + model_pointer.weight.shape == pretrained_weight.shape +), f"Pointer shape of random weight {model_pointer.shape} and array shape of checkpoint weight {pretrained_weight.shape} mismatched" +``` + +또한 두 가중치의 이름을 출력하여 일치하는지 확인해야 합니다. *예시*: + +```python +logger.info(f"Initialize PyTorch weight {layer_name} from {pretrained_weight.name}") +``` + +모양 또는 이름이 일치하지 않는 경우, 랜덤으로 초기화된 레이어에 잘못된 체크포인트 가중치를 할당한 것으로 추측됩니다. + +잘못된 모양은 `BrandNewBertConfig()`의 구성 매개변수 설정이 변환하려는 체크포인트에 사용된 설정과 정확히 일치하지 않기 때문일 가능성이 가장 큽니다. 그러나 PyTorch의 레이어 구현 자체에서 가중치를 전치해야 할 수도 있습니다. + +마지막으로, **모든** 필요한 가중치가 초기화되었는지 확인하고 초기화에 사용되지 않은 모든 체크포인트 가중치를 출력하여 모델이 올바르게 변환되었는지 확인해야 합니다. 잘못된 모양 문장이나 잘못된 이름 할당으로 인해 변환 시도가 실패하는 것은 완전히 정상입니다. 이는 `BrandNewBertConfig()`에서 잘못된 매개변수를 사용하거나 🤗 Transformers 구현에서 잘못된 아키텍처, 🤗 Transformers 구현의 구성 요소 중 하나의 `init()` 함수에 버그가 있는 경우이거나 체크포인트 가중치 중 하나를 전치해야 하는 경우일 가능성이 가장 높습니다. + +이 단계는 이전 단계와 함께 반복되어야 하며 모든 체크포인트의 가중치가 Transformers 모델에 올바르게 로드되었을 때까지 계속되어야 합니다. 🤗 Transformers 구현에 체크포인트를 올바르게 로드한 후에는 `/path/to/converted/checkpoint/folder`와 같은 원하는 폴더에 모델을 저장할 수 있어야 합니다. 해당 폴더에는 `pytorch_model.bin` 파일과 `config.json` 파일이 모두 포함되어야 합니다. + +```python +model.save_pretrained("/path/to/converted/checkpoint/folder") +``` + +**7. 순방향 패스 구현하기** + +🤗 Transformers 구현에 사전 훈련된 가중치를 정확하게 로드한 후에는 순방향 패스가 올바르게 구현되었는지 확인해야 합니다. [원본 저장소에 익숙해지기](#3-4-run-a-pretrained-checkpoint-using-the-original-repository)에서 이미 원본 저장소를 사용하여 모델의 순방향 패스를 실행하는 스크립트를 만들었습니다. 이제 원본 대신 🤗 Transformers 구현을 사용하는 유사한 스크립트를 작성해야 합니다. 다음과 같이 작성되어야 합니다: + +```python +model = BrandNewBertModel.from_pretrained("/path/to/converted/checkpoint/folder") +input_ids = [0, 4, 4, 3, 2, 4, 1, 7, 19] +output = model(input_ids).last_hidden_states +``` + +🤗 Transformers 구현과 원본 모델 구현이 처음부터 정확히 동일한 출력을 제공하지 않거나 순방향 패스에서 오류가 발생할 가능성이 매우 높습니다. 실망하지 마세요. 예상된 일입니다! 먼저, 순방향 패스에서 오류가 발생하지 않도록 해야 합니다. 종종 잘못된 차원이 사용되어 *차원 불일치* 오류가 발생하거나 잘못된 데이터 유형 개체가 사용되는 경우가 있습니다. 예를 들면 `torch.long` 대신에 `torch.float32`가 사용된 경우입니다. 해결할 수 없는 오류가 발생하면 Hugging Face 팀에 도움을 요청하는 것이 좋습니다. + +🤗 Transformers 구현이 올바르게 작동하는지 확인하는 마지막 단계는 출력이 `1e-3`의 정밀도로 동일한지 확인하는 것입니다. 먼저, 출력 모양이 동일하도록 보장해야 합니다. 즉, 🤗 Transformers 구현 스크립트와 원본 구현 사이에서 `outputs.shape`는 동일한 값을 반환해야 합니다. 그 다음으로, 출력 값이 동일하도록 해야 합니다. 이는 새로운 모델을 추가할 때 가장 어려운 부분 중 하나입니다. 출력이 동일하지 않은 일반적인 실수 사례는 다음과 같습니다: + +- 일부 레이어가 추가되지 않았습니다. 즉, *활성화* 레이어가 추가되지 않았거나 잔차 연결이 빠졌습니다. +- 단어 임베딩 행렬이 연결되지 않았습니다. +- 잘못된 위치 임베딩이 사용되었습니다. 원본 구현에서는 오프셋을 사용합니다. +- 순방향 패스 중에 Dropout이 적용되었습니다. 이를 수정하려면 *model.training이 False*인지 확인하고 순방향 패스 중에 Dropout 레이어가 잘못 활성화되지 않도록 하세요. 즉, [PyTorch의 기능적 Dropout](https://pytorch.org/docs/stable/nn.functional.html?highlight=dropout#torch.nn.functional.dropout)에 *self.training*을 전달하세요. + +문제를 해결하는 가장 좋은 방법은 일반적으로 원본 구현과 🤗 Transformers 구현의 순방향 패스를 나란히 놓고 차이점이 있는지 확인하는 것입니다. 이상적으로는 순방향 패스의 중간 출력을 디버그/출력하여 원본 구현과 🤗 Transformers 구현의 정확한 위치를 찾을 수 있어야 합니다. 먼저, 두 스크립트의 하드코딩된 `input_ids`가 동일한지 확인하세요. 다음으로, `input_ids`의 첫 번째 변환의 출력(일반적으로 단어 임베딩)이 동일한지 확인하세요. 그런 다음 네트워크의 가장 마지막 레이어까지 진행해보세요. 어느 시점에서 두 구현 사이에 차이가 있는 것을 알게 되는데, 이는 🤗 Transformers 구현의 버그 위치를 가리킬 것입니다. 저희 경험상으로는 원본 구현과 🤗 Transformers 구현 모두에서 동일한 위치에 많은 출력 문을 추가하고 이들의 중간 표현에 대해 동일한 값을 보이는 출력 문을 연속적으로 제거하는 것이 간단하고 효과적인 방법입니다. + +`torch.allclose(original_output, output, atol=1e-3)`로 출력을 확인하여 두 구현이 동일한 출력을 하는 것을 확신한다면, 가장 어려운 부분은 끝났습니다! 축하드립니다. 남은 작업은 쉬운 일이 될 것입니다 😊. + +**8. 필요한 모든 모델 테스트 추가하기** + +이 시점에서 새로운 모델을 성공적으로 추가했습니다. 그러나 해당 모델이 요구되는 디자인에 완전히 부합하지 않을 수도 있습니다. 🤗 Transformers와 완벽하게 호환되는 구현인지 확인하기 위해 모든 일반 테스트를 통과해야 합니다. Cookiecutter는 아마도 모델을 위한 테스트 파일을 자동으로 추가했을 것입니다. 아마도 `tests/models/brand_new_bert/test_modeling_brand_new_bert.py`와 같은 경로에 위치할 것입니다. 이 테스트 파일을 실행하여 일반 테스트가 모두 통과하는지 확인하세요. + +```bash +pytest tests/models/brand_new_bert/test_modeling_brand_new_bert.py +``` + +모든 일반 테스트를 수정한 후, 이제 수행한 작업을 충분히 테스트하여 다음 사항을 보장해야 합니다. + +- a) 커뮤니티가 *brand_new_bert*의 특정 테스트를 살펴봄으로써 작업을 쉽게 이해할 수 있도록 함 +- b) 모델에 대한 향후 변경 사항이 모델의 중요한 기능을 손상시키지 않도록 함 + +먼저 통합 테스트를 추가해야 합니다. 이러한 통합 테스트는 이전에 모델을 🤗 Transformers로 구현하기 위해 사용한 디버깅 스크립트와 동일한 작업을 수행합니다. Cookiecutter에 이미 이러한 모델 테스트의 템플릿인 `BrandNewBertModelIntegrationTests`가 추가되어 있으며, 여러분이 작성해야 할 내용으로만 채워 넣으면 됩니다. 이러한 테스트가 통과하는지 확인하려면 다음을 실행하세요. + +```bash +RUN_SLOW=1 pytest -sv tests/models/brand_new_bert/test_modeling_brand_new_bert.py::BrandNewBertModelIntegrationTests +``` + + + +Windows를 사용하는 경우 `RUN_SLOW=1`을 `SET RUN_SLOW=1`로 바꿔야 합니다. + + + +둘째로, *brand_new_bert*에 특화된 모든 기능도 별도의 테스트에서 추가로 테스트해야 합니다. 이 부분은 종종 잊히는데, 두 가지 측면에서 굉장히 유용합니다. + +- *brand_new_bert*의 특수 기능이 어떻게 작동해야 하는지 보여줌으로써 커뮤니티에게 모델 추가 과정에서 습득한 지식을 전달하는 데 도움이 됩니다. +- 향후 기여자는 이러한 특수 테스트를 실행하여 모델에 대한 변경 사항을 빠르게 테스트할 수 있습니다. + + +**9. 토크나이저 구현하기** + +다음으로, *brand_new_bert*의 토크나이저를 추가해야 합니다. 보통 토크나이저는 🤗 Transformers의 기존 토크나이저와 동일하거나 매우 유사합니다. + +토크나이저가 올바르게 작동하는지 확인하기 위해 먼저 원본 리포지토리에서 문자열을 입력하고 `input_ids`를 반환하는 스크립트를 생성하는 것이 좋습니다. 다음과 같은 유사한 스크립트일 수 있습니다 (의사 코드로 작성): + +```python +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." +model = BrandNewBertModel.load_pretrained_checkpoint("/path/to/checkpoint/") +input_ids = model.tokenize(input_str) +``` + +원본 리포지토리를 자세히 살펴보고 올바른 토크나이저 함수를 찾거나, 복제본에서 변경 사항을 적용하여 `input_ids`만 출력하도록 해야 합니다. 원본 리포지토리를 사용하는 기능적인 토큰화 스크립트를 작성한 후, 🤗 Transformers의 유사한 스크립트를 생성해야 합니다. 다음과 같이 작성되어야 합니다: + +```python +from transformers import BrandNewBertTokenizer + +input_str = "This is a long example input string containing special characters .$?-, numbers 2872 234 12 and words." + +tokenizer = BrandNewBertTokenizer.from_pretrained("/path/to/tokenizer/folder/") + +input_ids = tokenizer(input_str).input_ids +``` + +두 개의 `input_ids`가 동일한 값을 반환할 때, 마지막 단계로 토크나이저 테스트 파일도 추가해야 합니다. + +*brand_new_bert*의 모델링 테스트 파일과 유사하게, *brand_new_bert*의 토크나이제이션 테스트 파일에는 몇 가지 하드코딩된 통합 테스트가 포함되어야 합니다. + +**10. 종단 간 통합 테스트 실행** + +토크나이저를 추가한 후에는 모델과 토크나이저를 사용하여 몇 가지 종단 간 통합 테스트를 추가해야 합니다. `tests/models/brand_new_bert/test_modeling_brand_new_bert.py`에 추가해주세요. 이러한 테스트는 🤗 Transformers 구현이 예상대로 작동하는지를 의미 있는 text-to-text 예시로 보여줘야 합니다. 그 예시로는 *예를 들어* source-to-target 번역 쌍, article-to-summary 쌍, question-to-answer 쌍 등이 포함될 수 있습니다. 불러온 체크포인트 중 어느 것도 다운스트림 작업에서 미세 조정되지 않았다면, 모델 테스트만으로 충분합니다. 모델이 완전히 기능을 갖추었는지 확인하기 위해 마지막 단계로 GPU에서 모든 테스트를 실행하는 것이 좋습니다. 모델의 내부 텐서의 일부에 `.to(self.device)` 문을 추가하는 것을 잊었을 수 있으며, 이 경우 테스트에서 오류로 표시됩니다. GPU에 액세스할 수 없는 경우, Hugging Face 팀이 테스트를 대신 실행할 수 있습니다. + +**11. 기술문서 추가** + +이제 *brand_new_bert*에 필요한 모든 기능이 추가되었습니다. 거의 끝났습니다! 추가해야 할 것은 멋진 기술문서과 기술문서 페이지입니다. Cookiecutter가 `docs/source/model_doc/brand_new_bert.md`라는 템플릿 파일을 추가해줬을 것입니다. 이 페이지를 사용하기 전에 모델을 사용하는 사용자들은 일반적으로 이 페이지를 먼저 확인합니다. 따라서 문서는 이해하기 쉽고 간결해야 합니다. 모델을 사용하는 방법을 보여주기 위해 *팁*을 추가하는 것이 커뮤니티에 매우 유용합니다. 독스트링에 관련하여 Hugging Face 팀에 문의하는 것을 주저하지 마세요. + +다음으로, `src/transformers/models/brand_new_bert/modeling_brand_new_bert.py`에 추가된 독스트링이 올바르며 필요한 모든 입력 및 출력을 포함하도록 확인하세요. [여기](writing-documentation)에서 우리의 문서 작성 가이드와 독스트링 형식에 대한 상세 가이드가 있습니다. 문서는 일반적으로 커뮤니티와 모델의 첫 번째 접점이기 때문에, 문서는 적어도 코드만큼의 주의를 기울여야 합니다. + +**코드 리팩토링** + +좋아요, 이제 *brand_new_bert*를 위한 모든 필요한 코드를 추가했습니다. 이 시점에서 다음을 실행하여 잠재적으로 잘못된 코드 스타일을 수정해야 합니다: + +그리고 코딩 스타일이 품질 점검을 통과하는지 확인하기 위해 다음을 실행하고 확인해야 합니다: + +```bash +make style +``` + +🤗 Transformers에는 여전히 실패할 수 있는 몇 가지 매우 엄격한 디자인 테스트가 있습니다. 이는 독스트링에 누락된 정보나 잘못된 명명 때문에 종종 발생합니다. 여기서 막히면 Hugging Face 팀이 도움을 줄 것입니다. + +```bash +make quality +``` + +마지막으로, 코드가 정확히 작동하는 것을 확인한 후에는 항상 코드를 리팩토링하는 것이 좋은 생각입니다. 모든 테스트가 통과된 지금은 추가한 코드를 다시 검토하고 리팩토링하는 좋은 시기입니다. + +이제 코딩 부분을 완료했습니다. 축하합니다! 🎉 멋져요! 😎 + +**12. 모델을 모델 허브에 업로드하세요** + +이 마지막 파트에서는 모든 체크포인트를 변환하여 모델 허브에 업로드하고 각 업로드된 모델 체크포인트에 대한 모델 카드를 추가해야 합니다. [Model sharing and uploading Page](model_sharing)를 읽고 허브 기능에 익숙해지세요. *brand_new_bert*의 저자 조직 아래에 모델을 업로드할 수 있는 필요한 액세스 권한을 얻기 위해 Hugging Face 팀과 협업해야 합니다. `transformers`의 모든 모델에 있는 `push_to_hub` 메서드는 체크포인트를 허브에 빠르고 효율적으로 업로드하는 방법입니다. 아래에 작은 코드 조각이 붙여져 있습니다: + +각 체크포인트에 적합한 모델 카드를 만드는 데 시간을 할애하는 것은 가치가 있습니다. 모델 카드는 체크포인트의 특성을 강조해야 합니다. *예를 들어* 이 체크포인트는 어떤 데이터셋에서 사전 훈련/세부 훈련되었는지? 이 모델은 어떤 하위 작업에서 사용해야 하는지? 그리고 모델을 올바르게 사용하는 방법에 대한 몇 가지 코드도 포함해야 합니다. + +```python +brand_new_bert.push_to_hub("brand_new_bert") +# Uncomment the following line to push to an organization. +# brand_new_bert.push_to_hub("/brand_new_bert") +``` + +**13. (선택 사항) 노트북 추가** + +*brand_new_bert*를 다운스트림 작업에서 추론 또는 미세 조정에 사용하는 방법을 자세히 보여주는 노트북을 추가하는 것이 매우 유용합니다. 이것은 PR을 병합하는 데 필수적이지는 않지만 커뮤니티에 매우 유용합니다. + +**14. 완료된 PR 제출** + +이제 프로그래밍을 마쳤으며, 마지막 단계로 PR을 메인 브랜치에 병합해야 합니다. 보통 Hugging Face 팀은 이미 여기까지 도움을 주었을 것입니다. 그러나 PR에 멋진 설명을 추가하고 리뷰어에게 특정 디자인 선택 사항을 강조하려면 완료된 PR에 약간의 설명을 추가하는 시간을 할애하는 것이 가치가 있습니다. + +### 작업물을 공유하세요!! [[share-your-work]] + +이제 커뮤니티에서 작업물을 인정받을 시간입니다! 모델 추가 작업을 완료하는 것은 Transformers와 전체 NLP 커뮤니티에 큰 기여입니다. 당신의 코드와 이식된 사전 훈련된 모델은 수백, 심지어 수천 명의 개발자와 연구원에 의해 확실히 사용될 것입니다. 당신의 작업에 자랑스러워해야 하며 이를 커뮤니티와 공유해야 합니다. + +**당신은 커뮤니티 내 모든 사람들에게 매우 쉽게 접근 가능한 또 다른 모델을 만들었습니다! 🤯** diff --git a/docs/source/ko/add_new_pipeline.md b/docs/source/ko/add_new_pipeline.md new file mode 100644 index 00000000000000..9ddd4981154a37 --- /dev/null +++ b/docs/source/ko/add_new_pipeline.md @@ -0,0 +1,248 @@ + + +# 어떻게 사용자 정의 파이프라인을 생성하나요? [[how-to-create-a-custom-pipeline]] + +이 가이드에서는 사용자 정의 파이프라인을 어떻게 생성하고 [허브](https://hf.co/models)에 공유하거나 🤗 Transformers 라이브러리에 추가하는 방법을 살펴보겠습니다. + +먼저 파이프라인이 수용할 수 있는 원시 입력을 결정해야 합니다. +문자열, 원시 바이트, 딕셔너리 또는 가장 원하는 입력일 가능성이 높은 것이면 무엇이든 가능합니다. +이 입력을 가능한 한 순수한 Python 형식으로 유지해야 (JSON을 통해 다른 언어와도) 호환성이 좋아집니다. +이것이 전처리(`preprocess`) 파이프라인의 입력(`inputs`)이 될 것입니다. + +그런 다음 `outputs`를 정의하세요. +`inputs`와 같은 정책을 따르고, 간단할수록 좋습니다. +이것이 후처리(`postprocess`) 메소드의 출력이 될 것입니다. + +먼저 4개의 메소드(`preprocess`, `_forward`, `postprocess` 및 `_sanitize_parameters`)를 구현하기 위해 기본 클래스 `Pipeline`을 상속하여 시작합니다. + + +```python +from transformers import Pipeline + + +class MyPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] + return preprocess_kwargs, {}, {} + + def preprocess(self, inputs, maybe_arg=2): + model_input = Tensor(inputs["input_ids"]) + return {"model_input": model_input} + + def _forward(self, model_inputs): + # model_inputs == {"model_input": model_input} + outputs = self.model(**model_inputs) + # Maybe {"logits": Tensor(...)} + return outputs + + def postprocess(self, model_outputs): + best_class = model_outputs["logits"].softmax(-1) + return best_class +``` + +이 분할 구조는 CPU/GPU에 대한 비교적 원활한 지원을 제공하는 동시에, 다른 스레드에서 CPU에 대한 사전/사후 처리를 수행할 수 있게 지원하는 것입니다. + +`preprocess`는 원래 정의된 입력을 가져와 모델에 공급할 수 있는 형식으로 변환합니다. +더 많은 정보를 포함할 수 있으며 일반적으로 `Dict` 형태입니다. + +`_forward`는 구현 세부 사항이며 직접 호출할 수 없습니다. +`forward`는 예상 장치에서 모든 것이 작동하는지 확인하기 위한 안전장치가 포함되어 있어 선호되는 호출 메소드입니다. +실제 모델과 관련된 것은 `_forward` 메소드에 속하며, 나머지는 전처리/후처리 과정에 있습니다. + +`postprocess` 메소드는 `_forward`의 출력을 가져와 이전에 결정한 최종 출력 형식으로 변환합니다. + +`_sanitize_parameters`는 초기화 시간에 `pipeline(...., maybe_arg=4)`이나 호출 시간에 `pipe = pipeline(...); output = pipe(...., maybe_arg=4)`과 같이, 사용자가 원하는 경우 언제든지 매개변수를 전달할 수 있도록 허용합니다. + +`_sanitize_parameters`의 반환 값은 `preprocess`, `_forward`, `postprocess`에 직접 전달되는 3개의 kwargs 딕셔너리입니다. +호출자가 추가 매개변수로 호출하지 않았다면 아무것도 채우지 마십시오. +이렇게 하면 항상 더 "자연스러운" 함수 정의의 기본 인수를 유지할 수 있습니다. + +분류 작업에서 `top_k` 매개변수가 대표적인 예입니다. + +```python +>>> pipe = pipeline("my-new-task") +>>> pipe("This is a test") +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}, {"label": "3-star", "score": 0.05} +{"label": "4-star", "score": 0.025}, {"label": "5-star", "score": 0.025}] + +>>> pipe("This is a test", top_k=2) +[{"label": "1-star", "score": 0.8}, {"label": "2-star", "score": 0.1}] +``` + +이를 달성하기 위해 우리는 `postprocess` 메소드를 기본 매개변수인 `5`로 업데이트하고 `_sanitize_parameters`를 수정하여 이 새 매개변수를 허용합니다. + + +```python +def postprocess(self, model_outputs, top_k=5): + best_class = model_outputs["logits"].softmax(-1) + # top_k를 처리하는 로직 추가 + return best_class + + +def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "maybe_arg" in kwargs: + preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"] + + postprocess_kwargs = {} + if "top_k" in kwargs: + postprocess_kwargs["top_k"] = kwargs["top_k"] + return preprocess_kwargs, {}, postprocess_kwargs +``` + +입/출력을 가능한 한 간단하고 완전히 JSON 직렬화 가능한 형식으로 유지하려고 노력하십시오. +이렇게 하면 사용자가 새로운 종류의 개체를 이해하지 않고도 파이프라인을 쉽게 사용할 수 있습니다. +또한 사용 용이성을 위해 여러 가지 유형의 인수(오디오 파일은 파일 이름, URL 또는 순수한 바이트일 수 있음)를 지원하는 것이 비교적 일반적입니다. + + + +## 지원되는 작업 목록에 추가하기 [[adding-it-to-the-list-of-supported-tasks]] + +`new-task`를 지원되는 작업 목록에 등록하려면 `PIPELINE_REGISTRY`에 추가해야 합니다: + +```python +from transformers.pipelines import PIPELINE_REGISTRY + +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, +) +``` + +원하는 경우 기본 모델을 지정할 수 있으며, 이 경우 특정 개정(분기 이름 또는 커밋 해시일 수 있음, 여기서는 "abcdef")과 타입을 함께 가져와야 합니다: + +```python +PIPELINE_REGISTRY.register_pipeline( + "new-task", + pipeline_class=MyPipeline, + pt_model=AutoModelForSequenceClassification, + default={"pt": ("user/awesome_model", "abcdef")}, + type="text", # 현재 지원 유형: text, audio, image, multimodal +) +``` + +## Hub에 파이프라인 공유하기 [[share-your-pipeline-on-the-hub]] + +Hub에 사용자 정의 파이프라인을 공유하려면 `Pipeline` 하위 클래스의 사용자 정의 코드를 Python 파일에 저장하기만 하면 됩니다. +예를 들어, 다음과 같이 문장 쌍 분류를 위한 사용자 정의 파이프라인을 사용한다고 가정해 보겠습니다: + +```py +import numpy as np + +from transformers import Pipeline + + +def softmax(outputs): + maxes = np.max(outputs, axis=-1, keepdims=True) + shifted_exp = np.exp(outputs - maxes) + return shifted_exp / shifted_exp.sum(axis=-1, keepdims=True) + + +class PairClassificationPipeline(Pipeline): + def _sanitize_parameters(self, **kwargs): + preprocess_kwargs = {} + if "second_text" in kwargs: + preprocess_kwargs["second_text"] = kwargs["second_text"] + return preprocess_kwargs, {}, {} + + def preprocess(self, text, second_text=None): + return self.tokenizer(text, text_pair=second_text, return_tensors=self.framework) + + def _forward(self, model_inputs): + return self.model(**model_inputs) + + def postprocess(self, model_outputs): + logits = model_outputs.logits[0].numpy() + probabilities = softmax(logits) + + best_class = np.argmax(probabilities) + label = self.model.config.id2label[best_class] + score = probabilities[best_class].item() + logits = logits.tolist() + return {"label": label, "score": score, "logits": logits} +``` + +구현은 프레임워크에 구애받지 않으며, PyTorch와 TensorFlow 모델에 대해 작동합니다. +이를 `pair_classification.py`라는 파일에 저장한 경우, 다음과 같이 가져오고 등록할 수 있습니다: + +```py +from pair_classification import PairClassificationPipeline +from transformers.pipelines import PIPELINE_REGISTRY +from transformers import AutoModelForSequenceClassification, TFAutoModelForSequenceClassification + +PIPELINE_REGISTRY.register_pipeline( + "pair-classification", + pipeline_class=PairClassificationPipeline, + pt_model=AutoModelForSequenceClassification, + tf_model=TFAutoModelForSequenceClassification, +) +``` + +이 작업이 완료되면 사전훈련된 모델과 함께 사용할 수 있습니다. +예를 들어, `sgugger/finetuned-bert-mrpc`은 MRPC 데이터 세트에서 미세 조정되어 문장 쌍을 패러프레이즈인지 아닌지를 분류합니다. + +```py +from transformers import pipeline + +classifier = pipeline("pair-classification", model="sgugger/finetuned-bert-mrpc") +``` + +그런 다음 `Repository`의 `save_pretrained` 메소드를 사용하여 허브에 공유할 수 있습니다: + +```py +from huggingface_hub import Repository + +repo = Repository("test-dynamic-pipeline", clone_from="{your_username}/test-dynamic-pipeline") +classifier.save_pretrained("test-dynamic-pipeline") +repo.push_to_hub() +``` + +이렇게 하면 "test-dynamic-pipeline" 폴더 내에 `PairClassificationPipeline`을 정의한 파일이 복사되며, 파이프라인의 모델과 토크나이저도 저장한 후, `{your_username}/test-dynamic-pipeline` 저장소에 있는 모든 것을 푸시합니다. +이후에는 `trust_remote_code=True` 옵션만 제공하면 누구나 사용할 수 있습니다. + +```py +from transformers import pipeline + +classifier = pipeline(model="{your_username}/test-dynamic-pipeline", trust_remote_code=True) +``` + +## 🤗 Transformers에 파이프라인 추가하기 [[add-the-pipeline-to-transformers]] + +🤗 Transformers에 사용자 정의 파이프라인을 기여하려면, `pipelines` 하위 모듈에 사용자 정의 파이프라인 코드와 함께 새 모듈을 추가한 다음, `pipelines/__init__.py`에서 정의된 작업 목록에 추가해야 합니다. + +그런 다음 테스트를 추가해야 합니다. +`tests/test_pipelines_MY_PIPELINE.py`라는 새 파일을 만들고 다른 테스트와 예제를 함께 작성합니다. + +`run_pipeline_test` 함수는 매우 일반적이며, `model_mapping` 및 `tf_model_mapping`에서 정의된 가능한 모든 아키텍처의 작은 무작위 모델에서 실행됩니다. + +이는 향후 호환성을 테스트하는 데 매우 중요하며, 누군가 `XXXForQuestionAnswering`을 위한 새 모델을 추가하면 파이프라인 테스트가 해당 모델에서 실행을 시도한다는 의미입니다. +모델이 무작위이기 때문에 실제 값을 확인하는 것은 불가능하므로, 단순히 파이프라인 출력 `TYPE`과 일치시키기 위한 도우미 `ANY`가 있습니다. + +또한 2개(이상적으로는 4개)의 테스트를 구현해야 합니다. + +- `test_small_model_pt`: 이 파이프라인에 대한 작은 모델 1개를 정의(결과가 의미 없어도 상관없음)하고 파이프라인 출력을 테스트합니다. +결과는 `test_small_model_tf`와 동일해야 합니다. +- `test_small_model_tf`: 이 파이프라인에 대한 작은 모델 1개를 정의(결과가 의미 없어도 상관없음)하고 파이프라인 출력을 테스트합니다. +결과는 `test_small_model_pt`와 동일해야 합니다. +- `test_large_model_pt`(`선택사항`): 결과가 의미 있을 것으로 예상되는 실제 파이프라인에서 파이프라인을 테스트합니다. +이러한 테스트는 속도가 느리므로 이를 표시해야 합니다. +여기서의 목표는 파이프라인을 보여주고 향후 릴리즈에서의 변화가 없는지 확인하는 것입니다. +- `test_large_model_tf`(`선택사항`): 결과가 의미 있을 것으로 예상되는 실제 파이프라인에서 파이프라인을 테스트합니다. +이러한 테스트는 속도가 느리므로 이를 표시해야 합니다. +여기서의 목표는 파이프라인을 보여주고 향후 릴리즈에서의 변화가 없는지 확인하는 것입니다. diff --git a/docs/source/ko/add_tensorflow_model.md b/docs/source/ko/add_tensorflow_model.md new file mode 100644 index 00000000000000..22980b1320c55b --- /dev/null +++ b/docs/source/ko/add_tensorflow_model.md @@ -0,0 +1,262 @@ + + +# 어떻게 🤗 Transformers 모델을 TensorFlow로 변환하나요? [[how-to-convert-a-transformers-model-to-tensorflow]] + +🤗 Transformers에서처럼 사용할 수 있는 여러 가지 프레임워크가 있다는 것은 애플리케이션을 설계할 때 그들의 강점을 유연하게 이용할 수 있다는 장점이 있지만, 모델 별로 호환성을 추가해야 한다는 단점 또한 존재한다는 것을 의미합니다. 좋은 소식은 기존 모델에 TensorFlow 호환성을 추가하는 것이 [처음부터 새로운 모델을 추가하는 것](add_new_model)보다도 간단하다는 것입니다! + +만약 대규모 TensorFlow 모델을 더 깊이 이해하려거나, 오픈 소스에 큰 기여를 하려거나, 선택한 모델에 Tensorflow를 활용하려한다면, 이 안내서는 여러분께 도움이 될 것입니다. + +이 가이드는 Hugging Face 팀의 최소한의 감독 아래에서 🤗 Transformers에서 사용되는 TensorFlow 모델 가중치와/또는 아키텍처를 기여할 수 있는 커뮤니티 구성원인 여러분을 대상으로 합니다. +새로운 모델을 작성하는 것은 쉬운 일이 아니지만, 이 가이드를 통해 조금 덜 힘들고 훨씬 쉬운 작업으로 만들 수 있습니다. +모두의 경험을 모으는 것은 이 작업을 점차적으로 더 쉽게 만드는 데 굉장히 중요하기 때문에, 이 가이드를 개선시킬만한 제안이 떠오르면 공유하시는걸 적극적으로 권장합니다! + +더 깊이 알아보기 전에, 🤗 Transformers를 처음 접하는 경우 다음 자료를 확인하는 것이 좋습니다: +- [🤗 Transformers의 일반 개요](add_new_model#general-overview-of-transformers) +- [Hugging Face의 TensorFlow 철학](https://huggingface.co/blog/tensorflow-philosophy) + +이 가이드의 나머지 부분에서는 새로운 TensorFlow 모델 아키텍처를 추가하는 데 필요한 단계, Pytorch를 TensorFlow 모델 가중치로 변환하는 절차 및 ML 프레임워크 간의 불일치를 효율적으로 디버깅하는 방법을 알게 될 것입니다. 시작해봅시다! + + + +사용하려는 모델이 이미 해당하는 TensorFlow 아키텍처가 있는지 확실하지 않나요? + +선택한 모델([예](https://huggingface.co/google-bert/bert-base-uncased/blob/main/config.json#L14))의 `config.json`의 `model_type` 필드를 확인해보세요. 🤗 Transformers의 해당 모델 폴더에는 "modeling_tf"로 시작하는 파일이 있는 경우, 해당 모델에는 해당 TensorFlow 아키텍처([예](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert))가 있다는 의미입니다. + + + +## TensorFlow 모델 아키텍처 코드 추가하는 단계별 가이드 [[step-by-step-guide-to add-tensorFlow-model-architecture-code]] + +대규모 아키텍처를 가진 모델을 설계하는 방법에는 여러가지가 있으며, 해당 설계를 구현하는 방법도 여러 가지입니다. +그러나 우리는 [🤗 Transformers 일반 개요](add_new_model#general-overview-of-transformers)에서 언급한 대로 일관된 설계 선택에 따라야지만 🤗 Transformers를 사용하기 편할 것이라는 확고한 의견을 가지고 있습니다. +우리의 경험을 통해 TensorFlow 모델을 추가하는 데 관련된 중요한 몇 가지 사항을 알려 드릴 수 있습니다: + +- 이미 있는걸 다시 개발하려 하지 마세요! 최소한 2개의 이미 구현된 모델을 대개 참조해야 합니다. 구현하려는 모델과 기능상 동일한 Pytorch 모델 하나와 같은 문제 유형을 풀고 있는 다른 TensorFlow 모델 하나를 살펴보세요. +- 우수한 모델 구현은 시간이 지나도 남아있습니다. 이것은 코드가 아름답다는 이유가 아니라 코드가 명확하고 디버깅 및 개선이 쉽기 때문입니다. TensorFlow 구현에서 다른 모델들과 패턴을 똑같이 하고 Pytorch 구현과의 불일치를 최소화하여 메인테이너의 업무를 쉽게 한다면, 기여한 코드가 오래도록 유지될 수 있습니다. +- 필요하다면 도움을 요청하세요! 🤗 Transformers 팀은 여러분을 돕기 위해 있으며, 여러분이 직면한 동일한 문제에 대한 해결책을 이미 찾은 경우도 있을 수 있습니다. + +TensorFlow 모델 아키텍처를 추가하는 데 필요한 단계를 개략적으로 써보면: +1. 변환하려는 모델 선택 +2. transformers 개발 환경 준비 +3. (선택 사항) 이론적 측면 및 기존 구현 이해 +4. 모델 아키텍처 구현 +5. 모델 테스트 구현 +6. PR (pull request) 제출 +7. (선택 사항) 데모 빌드 및 공유 + +### 1.-3. 모델 기여 준비 [[1.-3.-prepare-your-model-contribution]] + +**1. 변환하려는 모델 선택** + +우선 기본 사항부터 시작해 보겠습니다. 먼저 변환하려는 아키텍처를 알아야 합니다. +특정 아키텍처에 대한 관심 없는 경우, 🤗 Transformers 팀에게 제안을 요청하는 것은 여러분의 영향력을 극대화하는 좋은 방법입니다. +우리는 TensorFlow에서 빠져 있는 가장 유명한 아키텍처로 이끌어 드리겠습니다. +TensorFlow에서 사용할 모델이 이미 🤗 Transformers에 TensorFlow 아키텍처 구현이 있지만 가중치가 없는 경우, +이 페이지의 [가중치 추가 섹션](#adding-tensorflow-weights-to-hub)으로 바로 이동하셔도 됩니다. + +간단히 말해서, 이 안내서의 나머지 부분은 TensorFlow 버전의 *BrandNewBert*([가이드](add_new_model)와 동일한 예제)를 기여하려고 결정했다고 가정합니다. + + + +TensorFlow 모델 아키텍처에 작업을 시작하기 전에 해당 작업이 진행 중인지 확인하세요. +`BrandNewBert`를 검색하여 +[pull request GitHub 페이지](https://github.com/huggingface/transformers/pulls?q=is%3Apr)에서 TensorFlow 관련 pull request가 없는지 확인할 수 있습니다. + + + +**2. transformers 개발 환경 준비** + + +모델 아키텍처를 선택한 후, 관련 작업을 수행할 의도를 미리 알리기 위해 Draft PR을 여세요. 아래 지침대로 하시면 환경을 설정하고 Draft PR을 열 수 있습니다. + +1. 'Fork' 버튼을 클릭하여 [리포지터리](https://github.com/huggingface/transformers)를 포크하세요. 이렇게 하면 GitHub 사용자 계정에 코드의 사본이 생성됩니다. + + +2. `transformers` 포크를 로컬 디스크에 클론하고 원본 리포지터리를 원격 리포지터리로 추가하세요. + +```bash +git clone https://github.com/[your Github handle]/transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +``` + +3. 개발 환경을 설정하세요. 예를 들어, 다음 명령을 실행하여 개발 환경을 설정할 수 있습니다. + +```bash +python -m venv .env +source .env/bin/activate +pip install -e ".[dev]" +``` + +운영 체제에 따라서 Transformers의 선택적 종속성이 증가하면서 위 명령이 실패할 수도 있습니다. 그런 경우 TensorFlow를 설치한 후 다음을 실행하세요. + +```bash +pip install -e ".[quality]" +``` + +**참고:** CUDA를 설치할 필요는 없습니다. 새로운 모델이 CPU에서 작동하도록 만드는 것만으로 충분합니다. + +4. 메인 브랜치에서 만드려는 기능이 잘 표현되는 이름으로 브랜치를 만듭니다. + +```bash +git checkout -b add_tf_brand_new_bert +``` + +5. 메인 브랜치의 현재 상태를 페치(fetch)하고 리베이스하세요. + +```bash +git fetch upstream +git rebase upstream/main +``` + +6. `transformers/src/models/brandnewbert/`에 `modeling_tf_brandnewbert.py`라는 빈 `.py` 파일을 추가하세요. 이 파일이 TensorFlow 모델 파일이 될 것입니다. + +7. 변경 사항을 계정에 푸시하세요. + +```bash +git add . +git commit -m "initial commit" +git push -u origin add_tf_brand_new_bert +``` + +8. 만족스러운 경우 GitHub에서 포크된 웹 페이지로 이동합니다. "Pull request"를 클릭합니다. Hugging Face 팀의 GitHub ID를 리뷰어로 추가해서, 앞으로의 변경 사항에 대해 Hugging Face 팀이 알림을 받을 수 있도록 합니다. + + +9. GitHub Pull Requests 페이지의 오른쪽에 있는 "Convert to draft"를 클릭하여 PR을 초안으로 변경하세요. + +이제 🤗 Transformers에서 *BrandNewBert*를 TensorFlow로 변환할 개발 환경을 설정했습니다. + + +**3. (선택 사항) 이론적 측면 및 기존 구현 이해** + + +*BrandNewBert*처럼 자세한 글이 있다면 시간을 내어 논문을 읽는걸 추천드립니다. 이해하기 어려운 부분이 많을 수 있습니다. 그렇다고 해서 걱정하지 마세요! 목표는 논문의 심도있는 이론적 이해가 아니라 TensorFlow를 사용하여 🤗 Transformers에 모델을 효과적으로 다시 구현하는 데 필요한 필수 정보를 추출하는 것입니다. 많은 시간을 이론적 이해에 투자할 필요는 없지만 실용적인 측면에서 현재 존재하는 모델 문서 페이지(e.g. [model docs for BERT](model_doc/bert))에 집중하는 것이 좋습니다. + + +모델의 기본 사항을 이해한 후, 기존 구현을 이해하는 것이 중요합니다. 이는 작업 중인 모델에 대한 실제 구현이 여러분의 기대와 일치함을 확인하고, TensorFlow 측면에서의 기술적 문제를 예상할 수 있습니다. + +막대한 양의 정보를 처음으로 학습할 때 압도당하는 것은 자연스러운 일입니다. 이 단계에서 모델의 모든 측면을 이해해야 하는 필요는 전혀 없습니다. 그러나 우리는 Hugging Face의 [포럼](https://discuss.huggingface.co/)을 통해 질문이 있는 경우 대답을 구할 것을 권장합니다. + +### 4. 모델 구현 [[4-model-implementation]] + + +이제 드디어 코딩을 시작할 시간입니다. 우리의 제안된 시작점은 PyTorch 파일 자체입니다: `modeling_brand_new_bert.py`의 내용을 +`src/transformers/models/brand_new_bert/` 내부의 +`modeling_tf_brand_new_bert.py`에 복사합니다. 이 섹션의 목표는 파일을 수정하고 🤗 Transformers의 import 구조를 업데이트하여 `TFBrandNewBert` 및 `TFBrandNewBert.from_pretrained(model_repo, from_pt=True)`가 성공적으로 작동하는 TensorFlow *BrandNewBert* 모델을 가져올 수 있도록 하는 것입니다. + +유감스럽게도, PyTorch 모델을 TensorFlow로 변환하는 규칙은 없습니다. 그러나 프로세스를 가능한한 원활하게 만들기 위해 다음 팁을 따를 수 있습니다. + +- 모든 클래스 이름 앞에 `TF`를 붙입니다(예: `BrandNewBert`는 `TFBrandNewBert`가 됩니다). +- 대부분의 PyTorch 작업에는 직접적인 TensorFlow 대체가 있습니다. 예를 들어, `torch.nn.Linear`는 `tf.keras.layers.Dense`에 해당하고, `torch.nn.Dropout`은 `tf.keras.layers.Dropout`에 해당합니다. 특정 작업에 대해 확신이 없는 경우 [TensorFlow 문서](https://www.tensorflow.org/api_docs/python/tf)나 [PyTorch 문서](https://pytorch.org/docs/stable/)를 참조할 수 있습니다. +- 🤗 Transformers 코드베이스에서 패턴을 찾으세요. 직접적인 대체가 없는 특정 작업을 만나면 다른 사람이 이미 동일한 문제를 해결한 경우가 많습니다. +- 기본적으로 PyTorch와 동일한 변수 이름과 구조를 유지하세요. 이렇게 하면 디버깅과 문제 추적, 그리고 문제 해결 추가가 더 쉬워집니다. +- 일부 레이어는 각 프레임워크마다 다른 기본값을 가지고 있습니다. 대표적인 예로 배치 정규화 레이어의 epsilon은 [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html#torch.nn.BatchNorm2d)에서 `1e-5`이고 [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization)에서 `1e-3`입니다. 문서를 모두 확인하세요! +- PyTorch의 `nn.Parameter` 변수는 일반적으로 TF 레이어의 `build()` 내에서 초기화해야 합니다. 다음 예를 참조하세요: [PyTorch](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_vit_mae.py#L212) / + [TensorFlow](https://github.com/huggingface/transformers/blob/655f72a6896c0533b1bdee519ed65a059c2425ac/src/transformers/models/vit_mae/modeling_tf_vit_mae.py#L220) +- PyTorch 모델의 함수 상단에 `#copied from ...`가 있는 경우, TensorFlow 모델에 TensorFlow 아키텍처가 있다면 TensorFlow 모델이 해당 함수를 복사한 아키텍처에서 사용할 수 있습니다. +- TensorFlow 함수에서 `name` 속성을 올바르게 할당하는 것은 `from_pt=True` 가중치 교차 로딩을 수행하는 데 중요합니다. `name`은 대부분 PyTorch 코드의 해당 변수의 이름입니다. `name`이 제대로 설정되지 않으면 모델 가중치를 로드할 때 오류 메시지에서 확인할 수 있습니다. +- 기본 모델 클래스인 `BrandNewBertModel`의 로직은 실제로 Keras 레이어 서브클래스([예시](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L719))인 `TFBrandNewBertMainLayer`에 있습니다. `TFBrandNewBertModel`은 이 레이어를 감싸기만 하는 래퍼 역할을 합니다. +- Keras 모델은 사전 훈련된 가중치를 로드하기 위해 빌드되어야 합니다. 따라서 `TFBrandNewBertPreTrainedModel`은 모델의 입력 예제인 `dummy_inputs`([예시](https://github.com/huggingface/transformers/blob/4fd32a1f499e45f009c2c0dea4d81c321cba7e02/src/transformers/models/bert/modeling_tf_bert.py#L916)) 유지해야 합니다. +- 도움이 필요한 경우 도움을 요청하세요. 우리는 여기 있어서 도움을 드리기 위해 있는 것입니다! 🤗 + +모델 파일 자체 외에도 모델 클래스 및 관련 문서 페이지에 대한 포인터를 추가해야 합니다. 이 부분은 다른 PR([예시](https://github.com/huggingface/transformers/pull/18020/files))의 패턴을 따라 완전히 완료할 수 있습니다. 다음은 필요한 수동 변경 목록입니다. + +- `src/transformers/__init__.py`에 *BrandNewBert*의 모든 공개 클래스를 포함합니다. +- `src/transformers/models/auto/modeling_tf_auto.py`에서 *BrandNewBert* 클래스를 해당 Auto 클래스에 추가합니다. +- `src/transformers/utils/dummy_tf_objects.py`에 *BrandNewBert*와 관련된 레이지 로딩 클래스를 추가합니다. +- `src/transformers/models/brand_new_bert/__init__.py`에서 공개 클래스에 대한 import 구조를 업데이트합니다. +- `docs/source/en/model_doc/brand_new_bert.md`에서 *BrandNewBert*의 공개 메서드에 대한 문서 포인터를 추가합니다. +- `docs/source/en/model_doc/brand_new_bert.md`의 *BrandNewBert* 기여자 목록에 자신을 추가합니다. +- 마지막으로 ✅ 녹색 체크박스를 TensorFlow 열 docs/source/en/index.md 안 BrandNewBert에 추가합니다. + +구현이 만족하면 다음 체크리스트를 실행하여 모델 아키텍처가 준비되었는지 확인하세요. + +1. 훈련 시간에 다르게 동작하는 `training` 인수로 불리는 모든 레이어(예: Dropout)는 최상위 클래스에서 전파됩니다. +2. #copied from ...가능할 때마다 사용했습니다. +3. `TFBrandNewBertMainLayer`와 그것을 사용하는 모든 클래스는 `call`함수로 `@unpack_inputs`와 함께 데코레이터 됩니다. +4. `TFBrandNewBertMainLayer`는 `@keras_serializable`로 데코레이터 됩니다. +5. TensorFlow 모델은 `TFBrandNewBert.from_pretrained(model_repo, from_pt=True)`를 사용하여 PyTorch 가중치에서 로드할 수 있습니다. +6. 예상 입력 형식을 사용하여 TensorFlow 모델을 호출할 수 있습니다. + +### 5. 모델 테스트 구현 [[5-add-model-tests]] + +TensorFlow 모델 아키텍처를 구현하는 데 성공했습니다! 이제 TensorFlow 모델을 테스트하는 구현을 작성할 차례입니다. 이를 통해 모델이 예상대로 작동하는지 확인할 수 있습니다. 이전에 우리는 `test_modeling_brand_new_bert.py` 파일을 `tests/models/brand_new_bert/ into test_modeling_tf_brand_new_bert.py`에 복사한 뒤, TensorFlow로 교체하는 것이 좋습니다. 지금은, 모든 `.from_pretrained()`을 `from_pt=True`를 사용하여 존재하는 Pytorch 가중치를 가져오도록 해야합니다. + +완료하셨으면, 이제 진실의 순간이 찾아왔습니다: 테스트를 실행해 보세요! 😬 + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +오류가 많이 나타날 것이지만 괜찮습니다! 기계 학습 모델을 디버깅하는 것은 악명높게 어려우며 성공의 핵심 요소는 인내심입니다 (`breakpoint()`도 필요합니다). 우리의 경험상으로는 ML 프레임워크 사이의 미묘한 불일치로 인해 가장 어려운 문제가 발생합니다. 이에 대한 몇 가지 지침이 이 가이드의 끝 부분에 있습니다. 다른 경우에는 일반 테스트가 직접 모델에 적용되지 않을 수 있으며, 이 경우 모델 테스트 클래스 레벨에서 재정의를 제안합니다. 문제가 무엇이든지 상관없이 문제가 있으면 당신이 고립되었다면 draft pull request에서 도움을 요청하는 것이 좋습니다. + +모든 테스트가 통과되면 축하합니다. 이제 모델을 🤗 Transformers 라이브러리에 추가할 준비가 거의 완료된 것입니다! 🎉 + + +테스트를 추가하는 방법에 대한 자세한 내용은 [🤗 Transformers의 테스트 가이드](https://huggingface.co/transformers/contributing.html#running-tests)를 참조하세요. + +### 6.-7. 모든 사용자가 당신의 모델을 사용할 수 있게 하기 [[6.-7.-ensure-everyone -can-use-your-model]] + +**6. 풀 요청 제출하기** + +구현과 테스트가 완료되면 풀 요청을 제출할 시간입니다. 코드를 푸시하기 전에 코드 서식 맞추기 유틸리티인 `make fixup` 🪄 를 실행하세요. 이렇게 하면 자동으로 서식 오류를 수정하며 자동 검사가 실패하는 것을 방지할 수 있습니다. + +이제 드래프트 풀 요청을 실제 풀 요청으로 변환하는 시간입니다. "리뷰 준비됨" 버튼을 클릭하고 Joao (`@gante`)와 Matt (`@Rocketknight1`)를 리뷰어로 추가하세요. 모델 풀 요청에는 적어도 3명의 리뷰어가 필요하지만, 그들이 당신의 모델에 적절한 추가 리뷰어를 찾을 것입니다. + +모든 리뷰어들이 PR 상태에 만족하면 마지막으로 `.from_pretrained()` 호출에서 `from_pt=True` 플래그를 제거하는 것입니다. TensorFlow 가중치가 없기 때문에 이를 추가해야 합니다! 이를 수행하는 방법은 아래 섹션의 지침을 확인하세요. + +마침내 TensorFlow 가중치가 병합되고, 적어도 3명의 리뷰어 승인을 받았으며 모든 CI 검사가 통과되었다면, 로컬로 테스트를 한 번 더 확인하세요. + +```bash +NVIDIA_TF32_OVERRIDE=0 RUN_SLOW=1 RUN_PT_TF_CROSS_TESTS=1 \ +py.test -vv tests/models/brand_new_bert/test_modeling_tf_brand_new_bert.py +``` + +그리고 우리는 당신의 PR을 병합할 것입니다! 마일스톤 달성을 축하드립니다! 🎉 + +**7. (선택 사항) 데모를 만들고 세상과 공유하기** + +오픈 소스의 가장 어려운 부분 중 하나는 발견입니다. 다른 사용자들이 당신의 멋진 TensorFlow 기여를 어떻게 알 수 있을까요? 물론 적절한 커뮤니케이션으로 가능합니다! 📣 + +커뮤니티와 모델을 공유하는 두 가지 주요 방법이 있습니다: +- 데모 만들기. Gradio 데모, 노트북 및 모델을 자랑하는 다른 재미있는 방법을 포함합니다. [커뮤니티 기반 데모](https://huggingface.co/docs/transformers/community)에 노트북을 추가하는 것을 적극 권장합니다. +- Twitter와 LinkedIn과 같은 소셜 미디어에 이야기 공유하기. 당신의 작업에 자랑스러워하고 커뮤니티와 당신의 업적을 공유해야 합니다. 이제 당신의 모델은 전 세계의 수천 명의 엔지니어와 연구원들에 의해 사용될 수 있습니다 🌍! 우리는 당신의 게시물을 리트윗하고 커뮤니티와 함께 당신의 작업을 공유하는 데 도움이 될 것입니다. + + +## 🤗 허브에 TensorFlow 가중치 추가하기 [[adding-tensorFlow-weights-to-🤗-hub]] + +TensorFlow 모델 아키텍처가 🤗 Transformers에서 사용 가능하다고 가정하고, PyTorch 가중치를 TensorFlow 가중치로 변환하는 것은 쉽습니다! + +다음은 그 방법입니다: +1. 터미널에서 Hugging Face 계정으로 로그인되어 있는지 확인하십시오. `huggingface-cli login` 명령어를 사용하여 로그인할 수 있습니다. (액세스 토큰은 [여기](https://huggingface.co/settings/tokens)에서 찾을 수 있습니다.) +2. `transformers-cli pt-to-tf --model-name foo/bar`를 실행하십시오. 여기서 `foo/bar`는 변환하려는 PyTorch 가중치가 있는 모델 저장소의 이름입니다. +3. 방금 만든 🤗 허브 PR에서 `@joaogante`와 `@Rocketknight1`을 태그합니다. + +그게 다입니다! 🎉 + + +## ML 프레임워크 간 디버깅 🐛[[debugging-mismatches-across-ml-frameworks]] + +새로운 아키텍처를 추가하거나 기존 아키텍처에 대한 TensorFlow 가중치를 생성할 때, PyTorch와 TensorFlow 간의 불일치로 인한 오류가 발생할 수 있습니다. 심지어 두 프레임워크의 모델 아키텍처 코드가 동일해 보일 수도 있습니다. 무슨 일이 벌어지고 있는 걸까요? 🤔 + +먼저, 이러한 불일치를 이해하는 이유에 대해 이야기해 보겠습니다. 많은 커뮤니티 멤버들은 🤗 Transformers 모델을 그대로 사용하고, 우리의 모델이 예상대로 작동할 것이라고 믿습니다. 두 프레임워크 간에 큰 불일치가 있으면 모델이 적어도 하나의 프레임워크에 대한 참조 구현을 따르지 않음을 의미합니다. 이는 모델이 의도한 대로 작동하지 않을 수 있음을 나타냅니다. 이는 아예 실행되지 않는 모델보다 나쁠 수 있습니다! 따라서 우리는 모든 모델의 프레임워크 불일치를 `1e-5`보다 작게 유지하는 것을 목표로 합니다. + +기타 숫자 문제와 마찬가지로, 세세한 문제가 있습니다. 그리고 세세함에 집중하는 공정에서 필수 요소는 인내심입니다. 이러한 종류의 문제가 발생할 때 권장되는 작업 흐름은 다음과 같습니다: +1. 불일치의 원인을 찾아보십시오. 변환 중인 모델은 아마도 특정 지점까지 거의 동일한 내부 변수를 가지고 있을 것입니다. 두 프레임워크의 아키텍처에 `breakpoint()` 문을 넣고, 위에서 아래로 숫자 변수의 값을 비교하여 문제의 근원을 찾아냅니다. +2. 이제 문제의 근원을 찾았으므로 🤗 Transformers 팀에 연락하세요. 우리는 비슷한 문제를 이전에 겪었을 수 있으며 빠르게 해결책을 제공할 수 있습니다. 예외적인 경우에는 StackOverflow와 GitHub 이슈와 같은 인기있는 페이지를 확인하십시오. +3. 더 이상 해결책이 없는 경우, 더 깊이 들어가야 합니다. 좋은 소식은 문제의 원인을 찾았으므로 나머지 모델을 추상화하고 문제가 있는 명령어에 초점을 맞출 수 있습니다! 나쁜 소식은 해당 명령어의 소스 구현에 대해 알아봐야 한다는 것입니다. 일부 경우에는 참조 구현에 문제가 있을 수도 있으니 업스트림 저장소에서 이슈를 열기를 꺼리지 마십시오. + +어떤 경우에는 🤗 Transformers 팀과의 토론을 통해 불일치를 수정할 수 없을 수도 있습니다. 모델의 출력 레이어에서 불일치가 매우 작지만 숨겨진 상태에서 크게 나타날 수 있기 때문입니다. 이 경우 모델을 배포하는 것을 우선시하기 위해 불일치를 무시하기로 결정할 수도 있습니다. 위에서 언급한 `pt-to-tf` CLI에는 가중치 변환 시 오류 메시지를 무시하는 `--max-error` 플래그가 있습니다. diff --git a/docs/source/ko/attention.md b/docs/source/ko/attention.md new file mode 100644 index 00000000000000..d9a5eeee1b73ea --- /dev/null +++ b/docs/source/ko/attention.md @@ -0,0 +1,54 @@ + + +# 어텐션 메커니즘[[attention_mechanisms]] + +대부분의 트랜스포머 모델은 정방행렬인 전체 어텐션을 사용합니다. +하지만 이는 긴 텍스트를 다룰 때는 큰 계산 병목 현상을 유발할 수 있습니다. +`Longformer`와 `Reformer`는 훈련 속도를 높이기 위해 어텐션 행렬의 희소 버전을 사용하여 효율을 높이려는 모델입니다. + +## LSH 어텐션[[lsh_attention]] + + +[Reformer](model_doc/reformer)는 LSH(Locality Sensitive Hashing) 어텐션을 사용합니다. softmax(QK^t)에서는 행렬 QK^t의 (softmax 차원에서) 가장 큰 요소들만 유용한 기여를 할 것입니다. +따라서 각각의 쿼리 q에 대해, q와 가까운 키 k만 고려할 수 있습니다. 해시 함수는 q와 k가 가까운지 여부를 결정하는 데 사용됩니다. +어텐션 마스크는 현재 토큰을 마스킹하여 변경됩니다. 이 때 첫 번째 위치의 토큰은 제외합니다. 왜냐하면 쿼리와 키가 동일한 값을 갖게 되기 때문입니다(서로 매우 유사함). +해시는 약간의 무작위성을 가질 수 있으므로, 실제로는 여러 개의 해시 함수가 사용되고 (`n_rounds` 매개변수에 의해 결정됨) 그 후에 평균값을 취하게 됩니다. + +## 지역 어텐션[[local_attention]] + +[Longformer](model_doc/longformer)는 지역 어텐션을 사용합니다. 종종 특정 토큰에 대해 지역 컨텍스트(예: 왼쪽과 오른쪽에 있는 두 개의 토큰은 무엇인가요?)만으로도 작업을 수행하는데 충분합니다. +또한 작은 창(window)을 가진 어텐션 레이어를 쌓음으로써 마지막 레이어는 창 내의 토큰뿐만 아니라 더 많은 수의 토큰에 대한 수용 영역(receptive field)을 갖게 되어 전체 문장의 표현을 구축할 수 있습니다. + +사전에 선택된 일부 입력 토큰들은 전역 어텐션을 받습니다. 이 몇 개의 토큰에 대해서는 어텐션 행렬이 모든 토큰에 접근할 수 있으며, 이 과정은 대칭적으로 이루어집니다. +다른 모든 토큰들은 로컬 창 내의 토큰들에 더해 해당 특정 토큰들에도 접근할 수 있습니다. 이는 논문의 Figure 2d에서 나타나며, 아래에 샘플 어텐션 마스크가 제시되어 있습니다: + + +
+ +
+ + +적은 파라미터의 어텐션 행렬을 사용하면 모델이 더 큰 시퀀스 입력 길이를 가질 수 있습니다. + +## 다른 방법들[[other_tricks]] + +### 축별 위치 인코딩[[axial_positional_encodings]] + +[Reformer](model_doc/reformer)는 축별 위치 인코딩(axial positional encodings)을 사용합니다. 기존의 트랜스포머 모델에서는 위치 인코딩 행렬 E는 크기가 \\(l \times d\\)인 행렬이며, +여기서 \\(l\\)은 시퀀스 길이(sequence length)이고 \\(d\\)는 숨겨진 상태(hidden state)의 차원입니다. 매우 긴 텍스트의 경우, 이 행렬은 매우 크며 GPU 상에서 공간을 많이 차지할 수 있습니다. +이를 완화하기 위해, 축별 위치 인코딩은 큰 행렬 E를 두 개의 작은 행렬 E1과 E2로 분해합니다. 이때 E1의 크기는 \\(l_{1} \times d_{1}\\)이고, E2의 크기는 \\(l_{2} \times d_{2}\\)입니다. +이때 \\(l_{1} \times l_{2} = l\\)이고 \\(d_{1} + d_{2} = d\\)(길이에 대한 곱셈 연산을 사용하면 훨씬 작아집니다). E의 시간 단계 j에 대한 임베딩은 E1에서 시간 단계 \\(j \% l1\\)의 임베딩과 E2에서 시간 단계 \\(j // l1\\)의 임베딩을 연결하여 얻습니다. \ No newline at end of file diff --git a/docs/source/ko/autoclass_tutorial.md b/docs/source/ko/autoclass_tutorial.md new file mode 100644 index 00000000000000..e41a2acc7b486b --- /dev/null +++ b/docs/source/ko/autoclass_tutorial.md @@ -0,0 +1,144 @@ + + +# AutoClass로 사전 학습된 인스턴스 로드[[load-pretrained-instances-with-an-autoclass]] + +트랜스포머 아키텍처가 매우 다양하기 때문에 체크포인트에 맞는 아키텍처를 생성하는 것이 어려울 수 있습니다. 라이브러리를 쉽고 간단하며 유연하게 사용하기 위한 Transformer 핵심 철학의 일환으로, `AutoClass`는 주어진 체크포인트에서 올바른 아키텍처를 자동으로 추론하여 로드합니다. `from_pretrained()` 메서드를 사용하면 모든 아키텍처에 대해 사전 학습된 모델을 빠르게 로드할 수 있으므로 모델을 처음부터 학습하는 데 시간과 리소스를 투입할 필요가 없습니다. +체크포인트에 구애받지 않는 코드를 생성한다는 것은 코드가 한 체크포인트에서 작동하면 아키텍처가 다르더라도 다른 체크포인트(유사한 작업에 대해 학습된 경우)에서도 작동한다는 것을 의미합니다. + + + +아키텍처는 모델의 골격을 의미하며 체크포인트는 주어진 아키텍처에 대한 가중치입니다. 예를 들어, [BERT](https://huggingface.co/google-bert/bert-base-uncased)는 아키텍처이고, `google-bert/bert-base-uncased`는 체크포인트입니다. 모델은 아키텍처 또는 체크포인트를 의미할 수 있는 일반적인 용어입니다. + + + +이 튜토리얼에서는 다음을 학습합니다: + +* 사전 학습된 토크나이저 로드하기. +* 사전 학습된 이미지 프로세서 로드하기. +* 사전 학습된 특징 추출기 로드하기. +* 사전 훈련된 프로세서 로드하기. +* 사전 학습된 모델 로드하기. + +## AutoTokenizer[[autotokenizer]] + +거의 모든 NLP 작업은 토크나이저로 시작됩니다. 토크나이저는 사용자의 입력을 모델에서 처리할 수 있는 형식으로 변환합니다. +[`AutoTokenizer.from_pretrained`]로 토크나이저를 로드합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + +그리고 아래와 같이 입력을 토큰화합니다: + +```py +>>> sequence = "In a hole in the ground there lived a hobbit." +>>> print(tokenizer(sequence)) +{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +## AutoImageProcessor[[autoimageprocessor]] + +비전 작업의 경우 이미지 프로세서가 이미지를 올바른 입력 형식으로 처리합니다. + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + + +## AutoFeatureExtractor[[autofeatureextractor]] + +오디오 작업의 경우 특징 추출기가 오디오 신호를 올바른 입력 형식으로 처리합니다. + +[`AutoFeatureExtractor.from_pretrained`]로 특징 추출기를 로드합니다: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained( +... "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +``` + +## AutoProcessor[[autoprocessor]] + +멀티모달 작업에는 두 가지 유형의 전처리 도구를 결합한 프로세서가 필요합니다. 예를 들어 LayoutLMV2 모델에는 이미지를 처리하는 이미지 프로세서와 텍스트를 처리하는 토크나이저가 필요하며, 프로세서는 이 두 가지를 결합합니다. + +[`AutoProcessor.from_pretrained()`]로 프로세서를 로드합니다: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") +``` + +## AutoModel[[automodel]] + + + +마지막으로 AutoModelFor클래스를 사용하면 주어진 작업에 대해 미리 학습된 모델을 로드할 수 있습니다 (사용 가능한 작업의 전체 목록은 [여기](model_doc/auto)를 참조하세요). 예를 들어, [`AutoModelForSequenceClassification.from_pretrained`]를 사용하여 시퀀스 분류용 모델을 로드할 수 있습니다: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +동일한 체크포인트를 쉽게 재사용하여 다른 작업에 아키텍처를 로드할 수 있습니다: + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +PyTorch모델의 경우 `from_pretrained()` 메서드는 내부적으로 피클을 사용하여 안전하지 않은 것으로 알려진 `torch.load()`를 사용합니다. +일반적으로 신뢰할 수 없는 소스에서 가져왔거나 변조되었을 수 있는 모델은 로드하지 마세요. 허깅 페이스 허브에서 호스팅되는 공개 모델의 경우 이러한 보안 위험이 부분적으로 완화되며, 각 커밋 시 멀웨어를 [검사합니다](https://huggingface.co/docs/hub/security-malware). GPG를 사용해 서명된 [커밋 검증](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg)과 같은 모범사례는 [문서](https://huggingface.co/docs/hub/security)를 참조하세요. + +텐서플로우와 Flax 체크포인트는 영향을 받지 않으며, `from_pretrained`메서드에 `from_tf` 와 `from_flax` 키워드 가변 인자를 사용하여 이 문제를 우회할 수 있습니다. + + + +일반적으로 AutoTokenizer 클래스와 AutoModelFor 클래스를 사용하여 미리 학습된 모델 인스턴스를 로드하는 것이 좋습니다. 이렇게 하면 매번 올바른 아키텍처를 로드할 수 있습니다. 다음 [튜토리얼](preprocessing)에서는 새롭게 로드한 토크나이저, 이미지 프로세서, 특징 추출기를 사용하여 미세 튜닝용 데이터 세트를 전처리하는 방법에 대해 알아봅니다. + + +마지막으로 `TFAutoModelFor` 클래스를 사용하면 주어진 작업에 대해 사전 훈련된 모델을 로드할 수 있습니다. (사용 가능한 작업의 전체 목록은 [여기](model_doc/auto)를 참조하세요. 예를 들어, [`TFAutoModelForSequenceClassification.from_pretrained`]로 시퀀스 분류를 위한 모델을 로드합니다: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +쉽게 동일한 체크포인트를 재사용하여 다른 작업에 아키텍처를 로드할 수 있습니다: + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +일반적으로, `AutoTokenizer`클래스와 `TFAutoModelFor` 클래스를 사용하여 미리 학습된 모델 인스턴스를 로드하는 것이 좋습니다. 이렇게 하면 매번 올바른 아키텍처를 로드할 수 있습니다. 다음 [튜토리얼](preprocessing)에서는 새롭게 로드한 토크나이저, 이미지 프로세서, 특징 추출기를 사용하여 미세 튜닝용 데이터 세트를 전처리하는 방법에 대해 알아봅니다. + + diff --git a/docs/source/ko/bertology.md b/docs/source/ko/bertology.md new file mode 100644 index 00000000000000..7b4f3dc4c4939b --- /dev/null +++ b/docs/source/ko/bertology.md @@ -0,0 +1,41 @@ + + +# BERTology + +BERT와 같은 대규모 트랜스포머의 내부 동작을 조사하는 연구 분야가 점점 더 중요해지고 있습니다. +혹자는 "BERTology"라 칭하기도 합니다. 이 분야의 좋은 예시는 다음과 같습니다: + + +- BERT는 고전적인 NLP 파이프라인의 재발견 - Ian Tenney, Dipanjan Das, Ellie Pavlick: + https://arxiv.org/abs/1905.05950 +- 16개의 헤드가 정말로 1개보다 나은가? - Paul Michel, Omer Levy, Graham Neubig: + https://arxiv.org/abs/1905.10650 +- BERT는 무엇을 보는가? BERT의 어텐션 분석 - Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning: + https://arxiv.org/abs/1906.04341 +- CAT-probing: 프로그래밍 언어에 대해 사전훈련된 모델이 어떻게 코드 구조를 보는지 알아보기 위한 메트릭 기반 접근 방법: + https://arxiv.org/abs/2210.04633 + +우리는 이 새로운 연구 분야의 발전을 돕기 위해, BERT/GPT/GPT-2 모델에 내부 표현을 살펴볼 수 있는 몇 가지 기능을 추가했습니다. +이 기능들은 주로 Paul Michel의 훌륭한 작업을 참고하여 개발되었습니다 +(https://arxiv.org/abs/1905.10650): + + +- BERT/GPT/GPT-2의 모든 은닉 상태에 접근하기, +- BERT/GPT/GPT-2의 각 헤드의 모든 어텐션 가중치에 접근하기, +- 헤드의 출력 값과 그래디언트를 검색하여 헤드 중요도 점수를 계산하고 https://arxiv.org/abs/1905.10650에서 설명된 대로 헤드를 제거하는 기능을 제공합니다. + +이러한 기능들을 이해하고 직접 사용해볼 수 있도록 [bertology.py](https://github.com/huggingface/transformers/tree/main/examples/research_projects/bertology/run_bertology.py) 예제 스크립트를 추가했습니다. 이 예제 스크립트에서는 GLUE에 대해 사전훈련된 모델에서 정보를 추출하고 모델을 가지치기(prune)해봅니다. diff --git a/docs/source/ko/big_models.md b/docs/source/ko/big_models.md new file mode 100644 index 00000000000000..3180b51117a97b --- /dev/null +++ b/docs/source/ko/big_models.md @@ -0,0 +1,122 @@ + + +# 큰 모델 인스턴스화 [[instantiating-a-big-model]] + +매우 큰 사전훈련된 모델을 사용하려면, RAM 사용을 최소화해야 하는 과제가 있습니다. 일반적인 PyTorch 워크플로우는 다음과 같습니다: + +1. 무작위 가중치로 모델을 생성합니다. +2. 사전훈련된 가중치를 불러옵니다. +3. 사전훈련된 가중치를 무작위 모델에 적용합니다. + +1단계와 2단계 모두 모델의 전체 버전을 메모리에 적재해야 하며, 대부분 문제가 없지만 모델이 기가바이트급의 용량을 차지하기 시작하면 복사본 2개가 RAM을 초과하여 메모리 부족 이슈를 야기할 수 있습니다. 더 심각한 문제는 분산 학습을 위해 `torch.distributed`를 사용하는 경우, 프로세스마다 사전훈련된 모델을 로드하고 복사본을 2개씩 RAM에 저장한다는 것입니다. + + + +무작위로 생성된 모델은 "비어 있는" (즉 그때 메모리에 있던 것으로 이뤄진) 텐서로 초기화되며 메모리 공간을 차지합니다. 초기화된 모델/파라미터의 종류에 적합한 분포(예: 정규 분포)에 따른 무작위 초기화는 가능한 한 빠르게 하기 위해 초기화되지 않은 가중치에 대해 3단계 이후에만 수행됩니다! + + + +이 안내서에서는 Transformers가 이 문제를 해결하기 위해 제공하는 솔루션을 살펴봅니다. 주의할 점은 아직 활발히 개발 중인 분야이므로 여기서 설명하는 API가 앞으로 약간 변경될 수 있다는 것입니다. + +## 샤딩된 체크포인트 [[sharded-checkpoints]] + +4.18.0 버전 이후, 10GB 이상의 공간을 차지하는 모델 체크포인트는 자동으로 작은 조각들로 샤딩됩니다. `model.save_pretrained(save_dir)`를 실행할 때 하나의 단일 체크포인트를 가지게 될 대신, 여러 부분 체크포인트(각각의 크기는 10GB 미만)와 매개변수 이름을 해당 파일에 매핑하는 인덱스가 생성됩니다. + +`max_shard_size` 매개변수로 샤딩 전 최대 크기를 제어할 수 있으므로, 이 예제를 위해 샤드 크기가 작은 일반 크기의 모델을 사용하겠습니다: 전통적인 BERT 모델을 사용해 봅시다. + +```py +from transformers import AutoModel + +model = AutoModel.from_pretrained("google-bert/bert-base-cased") +``` + +[`~PreTrainedModel.save_pretrained`]을 사용하여 모델을 저장하면, 모델의 구성과 가중치가 들어있는 두 개의 파일이 있는 새 폴더가 생성됩니다: + +```py +>>> import os +>>> import tempfile + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir) +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model.bin'] +``` + +이제 최대 샤드 크기를 200MB로 사용해 봅시다: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json'] +``` + +모델의 구성에 더해, 세 개의 다른 가중치 파일과 파라미터 이름과 해당 파일의 매핑이 포함된 `index.json` 파일을 볼 수 있습니다. 이러한 체크포인트는 [`~PreTrainedModel.from_pretrained`] 메서드를 사용하여 완전히 다시 로드할 수 있습니다: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... new_model = AutoModel.from_pretrained(tmp_dir) +``` + +큰 모델의 경우 이러한 방식으로 처리하는 주된 장점은 위에서 보여준 흐름의 2단계에서, 각 샤드가 이전 샤드 다음에 로드되므로 메모리 사용량이 모델 크기와 가장 큰 샤드의 크기를 초과하지 않는다는 점입니다. + +이 인덱스 파일은 키가 체크포인트에 있는지, 그리고 해당 가중치가 어디에 저장되어 있는지를 결정하는 데 사용됩니다. 이 인덱스를 json과 같이 로드하고 딕셔너리를 얻을 수 있습니다: + +```py +>>> import json + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f: +... index = json.load(f) + +>>> print(index.keys()) +dict_keys(['metadata', 'weight_map']) +``` + +메타데이터는 현재 모델의 총 크기만 포함됩니다. 앞으로 다른 정보를 추가할 계획입니다: + +```py +>>> index["metadata"] +{'total_size': 433245184} +``` + +가중치 맵은 이 인덱스의 주요 부분으로, 각 매개변수 이름(PyTorch 모델 `state_dict`에서 보통 찾을 수 있는)을 해당 파일에 매핑합니다: + +```py +>>> index["weight_map"] +{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin', + 'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin', + ... +``` + +만약 [`~PreTrainedModel.from_pretrained`]를 사용하지 않고 모델 내에서 이러한 샤딩된 체크포인트를 직접 가져오려면 (전체 체크포인트를 위해 `model.load_state_dict()`를 수행하는 것처럼), [`~modeling_utils.load_sharded_checkpoint`]를 사용해야 합니다. + +```py +>>> from transformers.modeling_utils import load_sharded_checkpoint + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... load_sharded_checkpoint(model, tmp_dir) +``` + +## 저(低)메모리 로딩 [[low-memory-loading]] + +샤딩된 체크포인트는 위에서 언급한 작업 흐름의 2단계에서 메모리 사용량을 줄이지만, 저(低)메모리 설정에서 모델을 사용하기 위해 우리의 Accelerate 라이브러리를 기반으로 한 도구를 활용하는 것이 좋습니다. + +자세한 사항은 다음 가이드를 참조해주세요: [Accelerate로 대규모 모델 가져오기 (영문)](../en/main_classes/model#large-model-loading) \ No newline at end of file diff --git a/docs/source/ko/community.md b/docs/source/ko/community.md new file mode 100644 index 00000000000000..d50168d7548620 --- /dev/null +++ b/docs/source/ko/community.md @@ -0,0 +1,69 @@ + + +# 커뮤니티 [[community]] + +이 페이지는 커뮤니티에서 개발한 🤗 Transformers 리소스를 재구성한 페이지입니다. + +## 커뮤니티 리소스: [[community-resources]] + +| 리소스 | 설명 | 만든이 | +|:----------|:-------------|------:| +| [Hugging Face Transformers 용어집 플래시카드](https://www.darigovresearch.com/huggingface-transformers-glossary-flashcards) | [Transformers 문서 용어집](glossary)을 기반으로 한 플래시카드 세트로, 지식을 장기적으로 유지하기 위해 특별히 설계된 오픈소스 크로스 플랫폼 앱인 [Anki](https://apps.ankiweb.net/)를 사용하여 쉽게 학습/수정할 수 있는 형태로 제작되었습니다. [플래시카드 사용법에 대한 소개 동영상](https://www.youtube.com/watch?v=Dji_h7PILrw)을 참조하세요. | [Darigov 리서치](https://www.darigovresearch.com/) | + +## 커뮤니티 노트북: [[community-notebooks]] + +| 노트북 | 설명 | 만든이 | | +|:----------|:-------------|:-------------|------:| +| [가사를 생성하기 위해 사전훈련된 트랜스포머를 미세 조정하기](https://github.com/AlekseyKorshuk/huggingartists) | GPT-2 모델을 미세 조정하여 좋아하는 아티스트의 스타일로 가사를 생성하는 방법 | [Aleksey Korshuk](https://github.com/AlekseyKorshuk) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AlekseyKorshuk/huggingartists/blob/master/huggingartists-demo.ipynb) | +| [Tensorflow 2로 T5 훈련하기](https://github.com/snapthat/TF-T5-text-to-text) | Tensorflow 2를 사용하여 T5를 훈련시키는 방법. 이 노트북은 Tensorflow 2로 SQUAD를 사용하여 구현한 질의응답 작업을 보여줍니다. | [Muhammad Harris](https://github.com/HarrisDePerceptron) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/snapthat/TF-T5-text-to-text/blob/master/snapthatT5/notebooks/TF-T5-Datasets%20Training.ipynb) | +| [TPU에서 T5 훈련하기](https://github.com/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb) | Transformers와 Nlp를 사용하여 SQUAD로 T5를 훈련하는 방법 | [Suraj Patil](https://github.com/patil-suraj) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/T5_on_TPU.ipynb#scrollTo=QLGiFCDqvuil) | +| [분류 및 객관식 문제를 위해 T5 미세 조정하기](https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | 분류 및 객관식 문제에 맞게 텍스트-텍스트 형식을 사용하여 PyTorch Lightning으로 T5를 미세 조정하는 방법 | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb) | +| [새로운 데이터 세트와 언어로 DialoGPT 미세 조정하기](https://github.com/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | 자유 대화형 챗봇을 만들기 위해 새로운 데이터 세트로 DialoGPT 모델을 미세 조정하는 방법 | [Nathan Cooper](https://github.com/ncoop57) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ncoop57/i-am-a-nerd/blob/master/_notebooks/2020-05-12-chatbot-part-1.ipynb) | +| [Reformer로 긴 시퀀스 모델링하기](https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | Reformer로 최대 50만 토큰의 시퀀스를 훈련하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb) | +| [요약을 위해 BART 미세 조정하기](https://github.com/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | blurr를 사용하여 fastai로 요약하기 위해 BART를 미세 조정하는 방법 | [Wayde Gilliam](https://ohmeow.com/) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ohmeow/ohmeow_website/blob/master/posts/2021-05-25-mbart-sequence-classification-with-blurr.ipynb) | +| [다른 사람의 트윗으로 사전훈련된 트랜스포머 미세 조정하기](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | GPT-2 모델을 미세 조정하여 좋아하는 트위터 계정 스타일로 트윗을 생성하는 방법 | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/borisdayma/huggingtweets/blob/master/huggingtweets-demo.ipynb) | +| [Weights & Biases로 🤗 Hugging Face 모델 최적화하기](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | W&B와 Hugging Face의 통합을 보여주는 전체 튜토리얼 | [Boris Dayma](https://github.com/borisdayma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/Optimize_Hugging_Face_models_with_Weights_%26_Biases.ipynb) | +| [Longformer 사전훈련하기](https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | 기존 사전훈련된 모델의 "긴" 버전을 빌드하는 방법 | [Iz Beltagy](https://beltagy.net) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb) | +| [QA를 위해 Longformer 미세 조정하기](https://github.com/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | QA 작업을 위해 Longformer를 미세 조정하는 방법 | [Suraj Patil](https://github.com/patil-suraj) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patil-suraj/Notebooks/blob/master/longformer_qa_training.ipynb) | +| [🤗 Nlp로 모델 평가하기](https://github.com/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb) | `Nlp`로 TriviaQA에서 Longformer를 평가하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1m7eTGlPmLRgoPkkA7rkhQdZ9ydpmsdLE?usp=sharing) | +| [감정 범위 추출을 위해 T5 미세 조정하기](https://github.com/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | 감정 범위 추출을 위해 텍스트-텍스트 형식을 사용하여 PyTorch Lightning으로 T5를 미세 조정하는 방법 | [Lorenzo Ampil](https://github.com/enzoampil) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enzoampil/t5-intro/blob/master/t5_qa_training_pytorch_span_extraction.ipynb) | +| [다중 클래스 분류를 위해 DistilBert 미세 조정하기](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb) | 다중 클래스 분류를 위해 PyTorch를 사용하여 DistilBert를 미세 조정하는 방법 | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb)| +| [다중 레이블 분류를 위해 BERT 미세 조정하기](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb) | 다중 레이블 분류를 위해 PyTorch를 사용하여 BERT를 미세 조정하는 방법 | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb)| +| [요약을 위해 T5 미세 조정하기](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb) | 요약을 위해 PyTorch로 T5를 미세 조정하고 WandB로 실험을 추적하는 방법 | [Abhishek Kumar Mishra](https://github.com/abhimishra91) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb)| +| [동적 패딩/버켓팅으로 Transformers 미세 조정 속도 높이기](https://github.com/ELS-RD/transformers-notebook/blob/master/Divide_Hugging_Face_Transformers_training_time_by_2_or_more.ipynb)| 동적 패딩/버켓팅을 사용하여 미세 조정 속도를 2배로 높이는 방법 |[Michael Benesty](https://github.com/pommedeterresautee) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CBfRU1zbfu7-ijiOqAAQUA-RJaxfcJoO?usp=sharing)| +|[마스킹된 언어 모델링을 위해 Reformer 사전훈련하기](https://github.com/patrickvonplaten/notebooks/blob/master/Reformer_For_Masked_LM.ipynb)| 양방향 셀프 어텐션 레이어를 이용해서 Reformer 모델을 훈련하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tzzh0i8PgDQGV3SMFUGxM7_gGae3K-uW?usp=sharing)| +| [Sci-BERT 확장 및 미세 조정하기](https://github.com/lordtt13/word-embeddings/blob/master/COVID-19%20Research%20Data/COVID-SciBERT.ipynb)| CORD 데이터 세트로 AllenAI에서 사전훈련된 SciBERT 모델의 어휘를 늘리고 파이프라인을 구축하는 방법 | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1rqAR40goxbAfez1xvF3hBJphSCsvXmh8)| +| [요약을 위해 Trainer API로 BlenderBotSmall 미세 조정하기](https://github.com/lordtt13/transformers-experiments/blob/master/Custom%20Tasks/fine-tune-blenderbot_small-for-summarization.ipynb)| 요약을 위해 Trainer API를 사용하여 사용자 지정 데이터 세트로 BlenderBotSmall 미세 조정하기 | [Tanmay Thakur](https://github.com/lordtt13) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/19Wmupuls7mykSGyRN_Qo6lPQhgp56ymq?usp=sharing)| +| [통합 기울기(Integrated Gradient)를 이용하여 Electra 미세 조정하고 해석하기](https://github.com/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb) | 감정 분석을 위해 Electra를 미세 조정하고 Captum 통합 기울기로 예측을 해석하는 방법 | [Eliza Szczechla](https://elsanns.github.io) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/electra_fine_tune_interpret_captum_ig.ipynb)| +| [Trainer 클래스로 비영어권 GPT-2 모델 미세 조정하기](https://github.com/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb) | Trainer 클래스로 비영어권 GPT-2 모델을 미세 조정하는 방법 | [Philipp Schmid](https://www.philschmid.de) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb)| +|[다중 라벨 분류 작업을 위해 DistilBERT 모델 미세 조정하기](https://github.com/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb) | 다중 라벨 분류 작업을 위해 DistilBERT 모델을 미세 조정하는 방법 | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/Transformers_scripts/blob/master/Transformers_multilabel_distilbert.ipynb)| +|[문장쌍 분류를 위해 ALBERT 미세 조정하기](https://github.com/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb) | 문장쌍 분류 작업을 위해 ALBERT 모델 또는 다른 BERT 기반 모델을 미세 조정하는 방법 | [Nadir El Manouzi](https://github.com/NadirEM) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb)| +|[감정 분석을 위해 Roberta 미세 조정하기](https://github.com/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb) | 감정 분석을 위해 Roberta 모델을 미세 조정하는 방법 | [Dhaval Taunk](https://github.com/DhavalTaunk08) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb)| +|[질문 생성 모델 평가하기](https://github.com/flexudy-pipe/qugeev) | seq2seq 트랜스포머 모델이 생성한 질문과 이에 대한 답변이 얼마나 정확한가요? | [Pascal Zoleko](https://github.com/zolekode) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bpsSqCQU-iw_5nNoRm_crPq6FRuJthq_?usp=sharing)| +|[DistilBERT와 Tensorflow로 텍스트 분류하기](https://github.com/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb) | 텍스트 분류를 위해 TensorFlow로 DistilBERT를 미세 조정하는 방법 | [Peter Bayerle](https://github.com/peterbayerle) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/peterbayerle/huggingface_notebook/blob/main/distilbert_tf.ipynb)| +|[CNN/Dailail 요약을 위해 인코더-디코더 모델에 BERT 활용하기](https://github.com/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb) | CNN/Dailail 요약을 위해 *google-bert/bert-base-uncased* 체크포인트를 활용하여 *EncoderDecoderModel*을 워밍업하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/BERT2BERT_for_CNN_Dailymail.ipynb)| +|[BBC XSum 요약을 위해 인코더-디코더 모델에 RoBERTa 활용하기](https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb) | BBC/XSum 요약을 위해 *FacebookAI/roberta-base* 체크포인트를 활용하여 공유 *EncoderDecoderModel*을 워밍업하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb)| +|[순차적 질문 답변(SQA)을 위해 TAPAS 미세 조정하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb) | *tapas-base* 체크포인트를 활용하여 순차적 질문 답변(SQA) 데이터 세트로 *TapasForQuestionAnswering*을 미세 조정하는 방법 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Fine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)| +|[표 사실 검사(TabFact)로 TAPAS 평가하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb) | 🤗 Datasets와 🤗 Transformer 라이브러리를 함께 사용하여 *tapas-base-finetuned-tabfact* 체크포인트로 미세 조정된 *TapasForSequenceClassification*을 평가하는 방법 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/TAPAS/Evaluating_TAPAS_on_the_Tabfact_test_set.ipynb)| +|[번역을 위해 mBART 미세 조정하기](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb) | 힌디어에서 영어로 번역하기 위해 Seq2SeqTrainer를 사용하여 mBART를 미세 조정하는 방법 | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/huggingface-tutorials/blob/main/translation_training.ipynb)| +|[FUNSD(양식 이해 데이터 세트)로 LayoutLM 미세 조정하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb) | 스캔한 문서에서 정보 추출을 위해 FUNSD 데이터 세트로 *LayoutLMForTokenClassification*을 미세 조정하는 방법 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)| +|[DistilGPT2 미세 조정하고 및 텍스트 생성하기](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb) | DistilGPT2를 미세 조정하고 텍스트를 생성하는 방법 | [Aakash Tripathi](https://github.com/tripathiaakash) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tripathiaakash/DistilGPT2-Tutorial/blob/main/distilgpt2_fine_tuning.ipynb)| +|[최대 8K 토큰에서 LED 미세 조정하기](https://github.com/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb) | 긴 범위를 요약하기 위해 PubMed로 LED를 미세 조정하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tune_Longformer_Encoder_Decoder_(LED)_for_Summarization_on_pubmed.ipynb)| +|[Arxiv로 LED 평가하기](https://github.com/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb) | 긴 범위 요약에 대해 LED를 효과적으로 평가하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/LED_on_Arxiv.ipynb)| +|[RVL-CDIP(문서 이미지 분류 데이터 세트)로 LayoutLM 미세 조정하기)](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb) | 스캔 문서 분류를 위해 RVL-CDIP 데이터 세트로 *LayoutLMForSequenceClassification*을 미세 조정하는 방법 | [Niels Rogge](https://github.com/nielsrogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Fine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)| +|[GPT2 조정을 통한 Wav2Vec2 CTC 디코딩](https://github.com/voidful/huggingface_notebook/blob/main/xlsr_gpt.ipynb) | 언어 모델 조정을 통해 CTC 시퀀스를 디코딩하는 방법 | [Eric Lam](https://github.com/voidful) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1e_z5jQHYbO2YKEaUgzb1ww1WwiAyydAj?usp=sharing)| +|[Trainer 클래스로 두 개 언어로 요약하기 위해 BART 미세 조정하기](https://github.com/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb) | Trainer 클래스로 두 개 언어로 요약하기 위해 BART 미세 조정하는 방법 | [Eliza Szczechla](https://github.com/elsanns) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elsanns/xai-nlp-notebooks/blob/master/fine_tune_bart_summarization_two_langs.ipynb)| +|[Trivia QA로 Big Bird 평가하기](https://github.com/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb) | Trivia QA로 긴 문서 질문에 대한 답변에 대해 BigBird를 평가하는 방법 | [Patrick von Platen](https://github.com/patrickvonplaten) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Evaluating_Big_Bird_on_TriviaQA.ipynb)| +| [Wav2Vec2를 사용하여 동영상 캡션 만들기](https://github.com/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | Wav2Vec으로 오디오를 텍스트로 변환하여 모든 동영상에서 YouTube 캡션 만드는 방법 | [Niklas Muennighoff](https://github.com/Muennighoff) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Muennighoff/ytclipcc/blob/main/wav2vec_youtube_captions.ipynb) | +| [PyTorch Lightning을 사용하여 CIFAR-10으로 비전 트랜스포머 미세 조정하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | HuggingFace Transformers, Datasets, PyTorch Lightning을 사용하여 CIFAR-10으로 비전 트랜스포머(ViT)를 미세 조정하는 방법 | [Niels Rogge](https://github.com/nielsrogge) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb) | +| [🤗 Trainer를 사용하여 CIFAR-10에서 비전 트랜스포머 미세 조정하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | Datasets, 🤗 Trainer를 사용하여 CIFAR-10에서 비전 트랜스포머(ViT)를 미세 조정하는 방법 | [Niels Rogge](https://github.com/nielsrogge) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/VisionTransformer/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb) | +| [개체 입력 데이터 세트인 Open Entity로 LUKE 평가하기](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | Open Entity 데이터 세트로 *LukeForEntityClassification*을 평가하는 방법 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_open_entity.ipynb) | +| [관계 추출 데이터 세트인 TACRED로 LUKE 평가하기](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | TACRED 데이터 세트로 *LukeForEntityPairClassification*을 평가하는 방법 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_tacred.ipynb) | +| [중요 NER 벤치마크인 CoNLL-2003으로 LUKE 평가하기](https://github.com/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | CoNLL-2003 데이터 세트로 *LukeForEntitySpanClassification*를 평가하는 방법 | [Ikuya Yamada](https://github.com/ikuyamada) |[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/studio-ousia/luke/blob/master/notebooks/huggingface_conll_2003.ipynb) | +| [PubMed 데이터 세트로 BigBird-Pegasus 평가하기](https://github.com/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | PubMed 데이터 세트로 *BigBirdPegasusForConditionalGeneration*를 평가하는 방법 | [Vasudev Gupta](https://github.com/vasudevgupta7) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb) | +| [Wav2Vec2를 사용해서 음성 감정 분류하기](https://github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | 감정 분류를 위해 사전훈련된 Wav2Vec2 모델을 MEGA 데이터 세트에 활용하는 방법 | [Mehrdad Farahani](https://github.com/m3hrdadfi) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb) | +| [DETR로 이미지에서 객체 탐지하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | 훈련된 *DetrForObjectDetection* 모델을 사용하여 이미지에서 객체를 탐지하고 어텐션을 시각화하는 방법 | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/DETR_minimal_example_(with_DetrFeatureExtractor).ipynb) | +| [사용자 지정 객체 탐지 데이터 세트로 DETR 미세 조정하기](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | 사용자 지정 객체 탐지 데이터 세트로 *DetrForObjectDetection*을 미세 조정하는 방법 | [Niels Rogge](https://github.com/NielsRogge) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DETR/Fine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb) | +| [개체명 인식을 위해 T5 미세 조정하기](https://github.com/ToluClassics/Notebooks/blob/main/T5_Ner_Finetuning.ipynb) | 개체명 인식 작업을 위해 *T5*를 미세 조정하는 방법 | [Ogundepo Odunayo](https://github.com/ToluClassics) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing) | diff --git a/docs/source/ko/contributing.md b/docs/source/ko/contributing.md new file mode 100644 index 00000000000000..56e51b326644f2 --- /dev/null +++ b/docs/source/ko/contributing.md @@ -0,0 +1,332 @@ + + +# 🤗 Transformers에 기여하기 [[contribute-to-transformers]] + +누구나 🤗 Transformers에 기여할 수 있으며, 우리는 모든 사람의 기여를 소중히 생각합니다. 코드 기여는 커뮤니티를 돕는 유일한 방법이 아닙니다. 질문에 답하거나 다른 사람을 도와 문서를 개선하는 것도 매우 가치가 있습니다. + +🤗 Transformers를 널리 알리는 것도 큰 도움이 됩니다! 멋진 프로젝트들을 가능하게 한 🤗 Transformers 라이브러리에 대해 블로그 게시글에 언급하거나, 도움이 되었을 때마다 Twitter에 알리거나, 저장소에 ⭐️ 를 표시하여 감사 인사를 전해주세요. + +어떤 방식으로 기여하든 [행동 규칙](https://github.com/huggingface/transformers/blob/main/CODE_OF_CONDUCT.md)을 숙지하고 존중해주세요. + +**이 안내서는 멋진 [scikit-learn 기여 안내서](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md)에서 큰 영감을 받았습니다.** + +## 기여하는 방법 [[ways-to-contribute]] + +여러 가지 방법으로 🤗 Transformers에 기여할 수 있습니다: + +* 기존 코드의 미해결된 문제를 수정합니다. +* 버그 또는 새로 추가되길 원하는 기능과 관련된 이슈를 제출합니다. +* 새로운 모델을 구현합니다. +* 예제나 문서에 기여합니다. + +어디서부터 시작할지 모르겠다면, [Good First Issue](https://github.com/huggingface/transformers/contribute) 목록을 확인해보세요. 이 목록은 초보자도 참여하기 쉬운 오픈 이슈 목록을 제공하며, 당신이 오픈소스에 처음으로 기여하는 데 큰 도움이 될 것입니다. 그저 작업하고 싶은 이슈에 댓글만 달아주면 됩니다. + +조금 더 도전적인 작업을 원한다면, [Good Second Issue](https://github.com/huggingface/transformers/labels/Good%20Second%20Issue) 목록도 확인해보세요. 이미 당신이 잘 하고 있다고 생각되더라도, 한 번 시도해보세요! 우리도 여러분을 도울 것입니다. 🚀 + +> 커뮤니티에 이루어지는 모든 기여는 똑같이 소중합니다. 🥰 + +## 미해결된 문제 수정하기 [[fixing-outstanding-issues]] + +기존 코드에서 발견한 문제점에 대한 해결책이 떠오른 경우, 언제든지 [기여를 시작](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#create-a-pull-request)하고 Pull Request를 생성해주세요! + +## 버그 관련 이슈를 제기하거나 새로운 기능 요청하기 [[submitting-a-bugrelated-issue-or-feature-request]] + +버그 관련 이슈를 제기하거나 새로운 기능을 요청할 때는 다음 가이드라인을 최대한 준수해주세요. 이렇게 하면 좋은 피드백과 함께 빠르게 답변해 드릴 수 있습니다. + +### 버그를 발견하셨나요? [[did-you-find-a-bug]] + +🤗 Transformers 라이브러리는 사용 중에 겪는 문제를 보고해주는 사용자들 덕분에 더욱 견고해지고 신뢰할 수 있게 되었습니다. + +이슈를 보고하기 전에, 버그가 이미 **보고되지 않았는지** 확인해주세요. (GitHub의 이슈 탭 아래의 검색 바를 사용하세요). 이슈는 라이브러리 자체에서 발생한 버그어야 하며, 코드의 다른 부분과 관련된 것이 아니어야 합니다. 버그가 라이브러리의 문제로 발생하였는지 확실하지 않은 경우 먼저 [포럼](https://discuss.huggingface.co/)에서 질문해 주세요. 이렇게 하면 일반적인 질문보다 라이브러리와 관련된 문제를 더 빠르게 해결할 수 있습니다. + +버그가 이미 보고되지 않았다는 것을 확인했다면, 다음 정보를 포함하여 이슈를 제출해 주세요. 그러면 우리가 빠르게 해결할 수 있습니다: + +* 사용 중인 **운영체제 종류와 버전**, 그리고 **Python**, **PyTorch** 또는 **TensorFlow** 버전. +* 버그를 30초 이내로 재현할 수 있는 간단하고 독립적인 코드 스니펫. +* 예외가 발생한 경우 *전체* 트레이스백. +* 스크린샷과 같이 도움이 될 것으로 생각되는 추가 정보를 첨부해 주세요. + +운영체제와 소프트웨어 버전을 자동으로 가져오려면 다음 명령을 실행하세요: + +```bash +transformers-cli env +``` + +저장소의 루트 디렉터리에서도 같은 명령을 실행할 수 있습니다: + +```bash +python src/transformers/commands/transformers_cli.py env +``` + + +### 새로운 기능을 원하시나요? [[do-you-want-a-new-feature]] + +🤗 Transformers에서 사용하고 싶은 새로운 기능이 있다면, 다음 내용을 포함하여 이슈를 제출해 주세요: + +1. 이 기능이 필요한 *이유*는 무엇인가요? 라이브러리에 대한 문제나 불만과 관련이 있나요? 프로젝트에 필요한 기능인가요? 커뮤니티에 도움이 될 만한 기능인가요? + + 어떤 내용이든 여러분의 이야기를 듣고 싶습니다! + +2. 요청하는 기능을 최대한 자세히 설명해 주세요. 더 많은 정보를 제공할수록 더 나은 도움을 드릴 수 있습니다. +3. 해당 기능의 사용법을 보여주는 *코드 스니펫*을 제공해 주세요. +4. 기능과 관련된 논문이 있는 경우 링크를 포함해 주세요. + +이슈가 잘 작성되었다면 이슈가 생성된 순간, 이미 80% 정도의 작업이 완료된 것입니다. + +이슈를 제기하는 데 도움이 될 만한 [템플릿](https://github.com/huggingface/transformers/tree/main/templates)도 준비되어 있습니다. + +## 새로운 모델을 구현하고 싶으신가요? [[do-you-want-to-implement-a-new-model]] + +새로운 모델은 계속해서 출시됩니다. 만약 여러분이 새로운 모델을 구현하고 싶다면 다음 정보를 제공해 주세요: + +* 모델에 대한 간단한 설명과 논문 링크. +* 구현이 공개되어 있다면 구현 링크. +* 모델 가중치가 사용 가능하다면 가중치 링크. + +만약 모델을 직접 기여하고 싶으시다면, 알려주세요. 🤗 Transformers에 추가할 수 있도록 도와드리겠습니다! + +새로운 모델을 추가하는 방법에 대한 [상세 안내서와 템플릿](https://github.com/huggingface/transformers/tree/main/templates)을 제공하고 있으며, [🤗 Transformers에 새로운 모델을 추가하는 방법](https://huggingface.co/docs/transformers/add_new_model)에 대한 기술적인 안내서도 있습니다. + +## 문서를 추가하고 싶으신가요? [[do-you-want-to-add-documentation]] + +우리는 언제나 더 명확하고 정확한 문서를 제공하기 위하여 개선점을 찾고 있습니다. 오탈자나 부족한 내용, 분명하지 않거나 부정확한 내용 등을 알려주시면 개선하는 데 도움이 됩니다. 관심이 있으시다면 변경하거나 기여하실 수 있도록 도와드리겠습니다! + +문서를 생성, 빌드 및 작성하는 방법에 대한 자세한 내용은 [README](https://github.com/huggingface/transformers/tree/main/docs) 문서를 확인해 주세요. + +## 풀 리퀘스트(Pull Request) 생성하기 [[create-a-pull-request]] + +코드를 작성하기 전에 기존의 Pull Request나 이슈를 검색하여 누군가 이미 동일한 작업을 하고 있는지 확인하는 것이 좋습니다. 확실하지 않다면 피드백을 받기 위해 이슈를 열어보는 것이 좋습니다. + +🤗 Transformers에 기여하기 위해서는 기본적인 `git` 사용 능력이 필요합니다. `git`은 사용하기 쉬운 도구는 아니지만, 매우 훌륭한 매뉴얼을 제공합니다. 쉘(shell)에서 `git --help`을 입력하여 확인해보세요! 만약 책을 선호한다면, [Pro Git](https://git-scm.com/book/en/v2)은 매우 좋은 참고 자료가 될 것입니다. + +🤗 Transformers에 기여하려면 **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** 이상의 버전이 필요합니다. 기여를 시작하려면 다음 단계를 따르세요: + +1. 저장소 페이지에서 **[Fork](https://github.com/huggingface/transformers/fork)** 버튼을 클릭하여 저장소를 포크하세요. 이렇게 하면 코드의 복사본이 여러분의 GitHub 사용자 계정 아래에 생성됩니다. + +2. 포크한 저장소를 로컬 디스크로 클론하고, 기본 저장소를 원격(remote)으로 추가하세요: + + ```bash + git clone git@github.com:/transformers.git + cd transformers + git remote add upstream https://github.com/huggingface/transformers.git + ``` + +3. 개발 변경 사항을 저장할 새 브랜치를 생성하세요: + + ```bash + git checkout -b a-descriptive-name-for-my-changes + ``` + + 🚨 절대 `main` 브랜치에서 작업하지 **마세요!** + +4. 가상 환경에서 다음 명령을 실행하여 개발 환경을 설정하세요: + + ```bash + pip install -e ".[dev]" + ``` + + 만약 이미 가상 환경에 🤗 Transformers가 설치되어 있다면, `-e` 플래그를 사용하여 설치하기 전에 `pip uninstall transformers`로 제거해주세요. + + 여러분의 운영체제에 따라서, 그리고 🤗 Transformers의 선택적 의존성의 수가 증가하면서, 이 명령이 실패할 수도 있습니다. 그럴 경우 사용하려는 딥러닝 프레임워크(PyTorch, TensorFlow, 그리고/또는 Flax)를 설치한 후 아래 명령을 실행해주세요: + + ```bash + pip install -e ".[quality]" + ``` + + 대부분의 경우 이것으로 충분할 것입니다. + +5. 브랜치에서 기능을 개발하세요. + + 코드를 작업하는 동안 테스트 스위트(test suite)가 통과하는지 확인하세요. 다음과 같이 변경 사항에 영향을 받는 테스트를 실행하세요: + + ```bash + pytest tests/.py + ``` + + 테스트에 대한 더 많은 정보는 [테스트](https://huggingface.co/docs/transformers/testing) 가이드를 확인하세요. + + 🤗 Transformers는 `black`과 `ruff`를 사용하여 소스 코드의 형식을 일관되게 유지합니다. 변경 사항을 적용한 후에는 다음 명령으로 자동으로 스타일 교정 및 코드 검증을 수행하세요: + + ```bash + make fixup + ``` + + 이것은 또한 작업 중인 PR에서 수정한 파일에서만 작동하도록 최적화되어 있습니다. + + 검사를 하나씩 실행하려는 경우, 다음 명령으로 스타일 교정을 적용할 수 있습니다: + + ```bash + make style + ``` + + 🤗 Transformers는 또한 `ruff`와 몇 가지 사용자 정의 스크립트를 사용하여 코딩 실수를 확인합니다. CI를 통해 품질 관리가 수행되지만, 다음 명령으로 동일한 검사를 실행할 수 있습니다: + + ```bash + make quality + ``` + + 마지막으로, 새 모델을 추가할 때 일부 파일을 업데이트하는 것을 잊지 않도록 하기 위한 많은 스크립트가 있습니다. 다음 명령으로 이러한 스크립트를 실행할 수 있습니다: + + ```bash + make repo-consistency + ``` + + 이러한 검사에 대해 자세히 알아보고 관련 문제를 해결하는 방법은 [Pull Request에 대한 검사](https://huggingface.co/docs/transformers/pr_checks) 가이드를 확인하세요. + + 만약 `docs/source` 디렉터리 아래의 문서를 수정하는 경우, 문서가 빌드될 수 있는지 확인하세요. 이 검사는 Pull Request를 열 때도 CI에서 실행됩니다. 로컬 검사를 실행하려면 문서 빌더를 설치해야 합니다: + + ```bash + pip install ".[docs]" + ``` + + 저장소의 루트 디렉터리에서 다음 명령을 실행하세요: + + ```bash + doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build + ``` + + 이 명령은 `~/tmp/test-build` 폴더에 문서를 빌드하며, 생성된 Markdown 파일을 선호하는 편집기로 확인할 수 있습니다. Pull Request를 열 때 GitHub에서 문서를 미리 볼 수도 있습니다. + + 변경 사항에 만족하면 `git add`로 변경된 파일을 추가하고, `git commit`으로 변경 사항을 로컬에 기록하세요: + + ```bash + git add modified_file.py + git commit + ``` + + [좋은 커밋 메시지](https://chris.beams.io/posts/git-commit/)를 작성하여 변경 사항을 명확하게 전달하세요! + + 변경 사항을 프로젝트 원본 저장소와 동기화하려면, PR을 *열기 전에* 브랜치를 `upstream/branch`로 리베이스(rebase)하세요. 또는 관리자의 요청에 이 작업이 필요할 수 있습니다: + + ```bash + git fetch upstream + git rebase upstream/main + ``` + + 변경 사항을 브랜치에 푸시하세요: + + ```bash + git push -u origin a-descriptive-name-for-my-changes + ``` + + 이미 PR을 열었다면, `--force` 플래그와 함께 강제 푸시해야 합니다. 아직 PR이 열리지 않았다면 정상적으로 변경 사항을 푸시하면 됩니다. + +6. 이제 GitHub에서 포크한 저장소로 이동하고 **Pull request(풀 리퀘스트)**를 클릭하여 Pull Request를 열 수 있습니다. 아래의 [체크리스트](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#pull-request-checklist)에서 모든 항목에 체크 표시를 하세요. 준비가 완료되면 프로젝트 관리자에게 변경 사항을 보내 검토를 요청할 수 있습니다. + +7. 관리자가 변경 사항을 요청해도 괜찮습니다. 핵심 기여자들도 동일한 상황을 겪습니다! 모두가 변경 사항을 Pull Request에서 볼 수 있도록, 로컬 브랜치에서 작업하고 변경 사항을 포크한 저장소로 푸시하세요. 그러면 변경 사항이 자동으로 Pull Request에 나타납니다. + +### Pull Request 체크리스트 [[pull-request-checklist]] + +☐ Pull Request 제목은 기여 내용을 요약해야 합니다.
+☐ Pull Request가 이슈를 해결하는 경우, Pull Request 설명에 이슈 번호를 언급하여 연관되어 있음을 알려주세요. (이슈를 확인하는 사람들이 해당 이슈에 대한 작업이 진행 중임을 알 수 있게 합니다).
+☐ 작업이 진행중이라면 제목 앞에 `[WIP]`를 붙여주세요. 중복 작업을 피하고 병합할 준비가 된 PR과 구분하기에 유용합니다.
+☐ 기존 테스트를 통과하는지 확인하세요.
+☐ 새로운 기능을 추가하는 경우, 해당 기능에 대한 테스트도 추가하세요.
+ - 새 모델을 추가하는 경우, `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)`을 사용하여 일반적인 테스트를 활성화하세요. + - 새 `@slow` 테스트를 추가하는 경우, 다음 명령으로 테스트를 통과하는지 확인하세요: `RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py`. + - 새 토크나이저를 추가하는 경우, 테스트를 작성하고 다음 명령으로 테스트를 통과하는지 확인하세요: `RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py`. + - CircleCI에서는 느린 테스트를 실행하지 않지만, GitHub Actions에서는 매일 밤 실행됩니다!
+ +☐ 모든 공개 메소드는 유용한 기술문서를 가져야 합니다 (예를 들어 [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py) 참조).
+☐ 저장소가 빠르게 성장하고 있으므로 저장소에 상당한 부담을 주는 이미지, 동영상 및 기타 텍스트가 아닌 파일은 추가하지 마세요. 대신 [`hf-internal-testing`](https://huggingface.co/hf-internal-testing)과 같은 Hub 저장소를 사용하여 이러한 파일을 호스팅하고 URL로 참조하세요. 문서와 관련된 이미지는 다음 저장소에 배치하는 것을 권장합니다: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images). 이 데이터셋 저장소에서 PR을 열어서 Hugging Face 멤버에게 병합을 요청할 수 있습니다. + +Pull Request에서 실행되는 검사에 대한 자세한 정보는 [Pull Request에 대한 검사](https://huggingface.co/docs/transformers/pr_checks) 가이드를 확인하세요. + +### 테스트 [[tests]] + +라이브러리 동작과 여러 예제를 테스트할 수 있는 광범위한 테스트 스위트가 포함되어 있습니다. 라이브러리 테스트는 [tests](https://github.com/huggingface/transformers/tree/main/tests) 폴더에, 예제 테스트는 [examples](https://github.com/huggingface/transformers/tree/main/examples) 폴더에 있습니다. + +속도가 빠른 `pytest`와 `pytest-xdist`를 선호합니다. 저장소의 루트 디렉터리에서 테스트를 실행할 *하위 폴더 경로 또는 테스트 파일 경로*를 지정하세요: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +``` + +마찬가지로 `examples` 디렉터리에서도 *하위 폴더 경로 또는 테스트 파일 경로*를 지정하세요. 예를 들어, 다음 명령은 PyTorch `examples` 디렉터리의 텍스트 분류 하위 폴더를 테스트합니다: + +```bash +pip install -r examples/xxx/requirements.txt # only needed the first time +python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +이것이 실제로 `make test` 및 `make test-examples` 명령이 구현되는 방식입니다 (`pip install`은 제외합니다)! + +또한 특정 기능만 테스트하기 위한 더 작은 테스트를 지정할 수 있습니다. + +기본적으로 느린 테스트는 건너뛰지만 `RUN_SLOW` 환경 변수를 `yes`로 설정하여 실행할 수 있습니다. 이렇게 하면 많은 기가바이트 단위의 모델이 다운로드되므로 충분한 디스크 공간, 좋은 인터넷 연결과 많은 인내가 필요합니다! + + + +테스트를 실행하려면 *하위 폴더 경로 또는 테스트 파일 경로*를 지정하세요. 그렇지 않으면 `tests` 또는 `examples` 폴더의 모든 테스트를 실행하게 되어 매우 긴 시간이 걸립니다! + + + +```bash +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +느린 테스트와 마찬가지로, 다음과 같이 테스트 중에 기본적으로 활성화되지 않는 다른 환경 변수도 있습니다: +- `RUN_CUSTOM_TOKENIZERS`: 사용자 정의 토크나이저 테스트를 활성화합니다. +- `RUN_PT_FLAX_CROSS_TESTS`: PyTorch + Flax 통합 테스트를 활성화합니다. +- `RUN_PT_TF_CROSS_TESTS`: TensorFlow + PyTorch 통합 테스트를 활성화합니다. + +더 많은 환경 변수와 추가 정보는 [testing_utils.py](src/transformers/testing_utils.py)에서 찾을 수 있습니다. + +🤗 Transformers는 테스트 실행기로 `pytest`를 사용합니다. 그러나 테스트 스위트 자체에서는 `pytest` 관련 기능을 사용하지 않습니다. + +이것은 `unittest`가 완전히 지원된다는 것을 의미합니다. 다음은 `unittest`로 테스트를 실행하는 방법입니다: + +```bash +python -m unittest discover -s tests -t . -v +python -m unittest discover -s examples -t examples -v +``` + +### 스타일 가이드 [[style-guide]] + +문서는 [Google Python 스타일 가이드](https://google.github.io/styleguide/pyguide.html)를 따릅니다. 자세한 정보는 [문서 작성 가이드](https://github.com/huggingface/transformers/tree/main/docs#writing-documentation---specification)를 확인하세요. + +### Windows에서 개발 [[develop-on-windows]] + +Windows에서 개발할 경우([Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) 또는 WSL에서 작업하지 않는 한) Windows `CRLF` 줄 바꿈을 Linux `LF` 줄 바꿈으로 변환하도록 git을 구성해야 합니다: + +```bash +git config core.autocrlf input +``` + +Windows에서 `make` 명령을 실행하는 한 가지 방법은 MSYS2를 사용하는 것입니다: + +1. [MSYS2](https://www.msys2.org/)를 다운로드합니다. `C:\msys64`에 설치되었다고 가정합니다. +2. CLI에서 `C:\msys64\msys2.exe`를 엽니다 (시작 메뉴에서 사용 가능해야 함). +3. 쉘에서 다음을 실행하여: `pacman -Syu` 및 `pacman -S make`로 `make`를 설치합니다. +4. 환경 변수 PATH에 `C:\msys64\usr\bin`을 추가하세요. + +이제 모든 터미널 (PowerShell, cmd.exe 등)에서 `make`를 사용할 수 있습니다! 🎉 + +### 포크한 저장소를 상위 원본 브랜치(main)과 동기화하기 (Hugging Face 저장소) [[sync-a-forked-repository-with-upstream-main-the-hugging-face-repository]] + +포크한 저장소의 main 브랜치를 업데이트할 때, 다음 단계를 따라 수행해주세요. 이렇게 하면 각 upstream PR에 참조 노트가 추가되는 것을 피하고 이러한 PR에 관여하는 개발자들에게 불필요한 알림이 전송되는 것을 방지할 수 있습니다. + +1. 가능하면 포크된 저장소의 브랜치 및 PR을 사용하여 upstream과 동기화하지 마세요. 대신 포크된 main 저장소에 직접 병합하세요. +2. PR이 반드시 필요한 경우, 브랜치를 확인한 후 다음 단계를 사용하세요: + + ```bash + git checkout -b your-branch-for-syncing + git pull --squash --no-commit upstream main + git commit -m '' + git push --set-upstream origin your-branch-for-syncing + ``` diff --git a/docs/source/ko/create_a_model.md b/docs/source/ko/create_a_model.md new file mode 100644 index 00000000000000..b911669bb174b9 --- /dev/null +++ b/docs/source/ko/create_a_model.md @@ -0,0 +1,388 @@ + + +# 맞춤형 아키텍처 만들기[[create-a-custom-architecture]] + +[`AutoClass`](model_doc/auto)는 모델 아키텍처를 자동으로 추론하고 미리 학습된 configuration과 가중치를 다운로드합니다. 일반적으로 체크포인트에 구애받지 않는 코드를 생성하려면 `AutoClass`를 사용하는 것이 좋습니다. 하지만 특정 모델 파라미터를 보다 세밀하게 제어하고자 하는 사용자는 몇 가지 기본 클래스만으로 커스텀 🤗 Transformers 모델을 생성할 수 있습니다. 이는 🤗 Transformers 모델을 연구, 교육 또는 실험하는 데 관심이 있는 모든 사용자에게 특히 유용할 수 있습니다. 이 가이드에서는 'AutoClass'를 사용하지 않고 커스텀 모델을 만드는 방법에 대해 알아보겠습니다: + +- 모델 configuration을 가져오고 사용자 지정합니다. +- 모델 아키텍처를 생성합니다. +- 텍스트에 사용할 느리거나 빠른 토큰화기를 만듭니다. +- 비전 작업을 위한 이미지 프로세서를 생성합니다. +- 오디오 작업을 위한 특성 추출기를 생성합니다. +- 멀티모달 작업용 프로세서를 생성합니다. + +## Configuration[[configuration]] + +[configuration](main_classes/configuration)은 모델의 특정 속성을 나타냅니다. 각 모델 구성에는 서로 다른 속성이 있습니다. 예를 들어, 모든 NLP 모델에는 `hidden_size`, `num_attention_heads`, `num_hidden_layers` 및 `vocab_size` 속성이 공통으로 있습니다. 이러한 속성은 모델을 구성할 attention heads 또는 hidden layers의 수를 지정합니다. + +[DistilBERT](model_doc/distilbert) 속성을 검사하기 위해 [`DistilBertConfig`]에 접근하여 자세히 살펴봅니다: + +```py +>>> from transformers import DistilBertConfig + +>>> config = DistilBertConfig() +>>> print(config) +DistilBertConfig { + "activation": "gelu", + "attention_dropout": 0.1, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +[`DistilBertConfig`]는 기본 [`DistilBertModel`]을 빌드하는 데 사용되는 모든 기본 속성을 표시합니다. 모든 속성은 커스터마이징이 가능하므로 실험을 위한 공간을 만들 수 있습니다. 예를 들어 기본 모델을 다음과 같이 커스터마이즈할 수 있습니다: + +- `activation` 파라미터로 다른 활성화 함수를 사용해 보세요. +- `attention_dropout` 파라미터를 사용하여 어텐션 확률에 더 높은 드롭아웃 비율을 사용하세요. + +```py +>>> my_config = DistilBertConfig(activation="relu", attention_dropout=0.4) +>>> print(my_config) +DistilBertConfig { + "activation": "relu", + "attention_dropout": 0.4, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +사전 학습된 모델 속성은 [`~PretrainedConfig.from_pretrained`] 함수에서 수정할 수 있습니다: + +```py +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) +``` + +모델 구성이 만족스러우면 [`~PretrainedConfig.save_pretrained`]로 저장할 수 있습니다. 설정 파일은 지정된 작업 경로에 JSON 파일로 저장됩니다: + +```py +>>> my_config.save_pretrained(save_directory="./your_model_save_path") +``` + +configuration 파일을 재사용하려면 [`~PretrainedConfig.from_pretrained`]를 사용하여 가져오세요: + +```py +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +``` + + + +configuration 파일을 딕셔너리로 저장하거나 사용자 정의 configuration 속성과 기본 configuration 속성의 차이점만 저장할 수도 있습니다! 자세한 내용은 [configuration](main_classes/configuration) 문서를 참조하세요. + + + +## 모델[[model]] + +다음 단계는 [모델(model)](main_classes/models)을 만드는 것입니다. 느슨하게 아키텍처라고도 불리는 모델은 각 계층이 수행하는 동작과 발생하는 작업을 정의합니다. configuration의 `num_hidden_layers`와 같은 속성은 아키텍처를 정의하는 데 사용됩니다. 모든 모델은 기본 클래스 [`PreTrainedModel`]과 입력 임베딩 크기 조정 및 셀프 어텐션 헤드 가지 치기와 같은 몇 가지 일반적인 메소드를 공유합니다. 또한 모든 모델은 [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) 또는 [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html)의 서브클래스이기도 합니다. 즉, 모델은 각 프레임워크의 사용법과 호환됩니다. + + + +사용자 지정 configuration 속성을 모델에 가져옵니다: + +```py +>>> from transformers import DistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +>>> model = DistilBertModel(my_config) +``` + +이제 사전 학습된 가중치 대신 임의의 값을 가진 모델이 생성됩니다. 이 모델을 훈련하기 전까지는 유용하게 사용할 수 없습니다. 훈련은 비용과 시간이 많이 소요되는 프로세스입니다. 일반적으로 훈련에 필요한 리소스의 일부만 사용하면서 더 나은 결과를 더 빨리 얻으려면 사전 훈련된 모델을 사용하는 것이 좋습니다. + +사전 학습된 모델을 [`~PreTrainedModel.from_pretrained`]로 생성합니다: + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +🤗 Transformers에서 제공한 모델의 사전 학습된 가중치를 사용하는 경우 기본 모델 configuration을 자동으로 불러옵니다. 그러나 원하는 경우 기본 모델 configuration 속성의 일부 또는 전부를 사용자 지정으로 바꿀 수 있습니다: + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + +사용자 지정 configuration 속성을 모델에 불러옵니다: + +```py +>>> from transformers import TFDistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json") +>>> tf_model = TFDistilBertModel(my_config) +``` + +이제 사전 학습된 가중치 대신 임의의 값을 가진 모델이 생성됩니다. 이 모델을 훈련하기 전까지는 유용하게 사용할 수 없습니다. 훈련은 비용과 시간이 많이 소요되는 프로세스입니다. 일반적으로 훈련에 필요한 리소스의 일부만 사용하면서 더 나은 결과를 더 빨리 얻으려면 사전 훈련된 모델을 사용하는 것이 좋습니다. + +사전 학습된 모델을 [`~TFPreTrainedModel.from_pretrained`]로 생성합니다: + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +🤗 Transformers에서 제공한 모델의 사전 학습된 가중치를 사용하는 경우 기본 모델 configuration을 자동으로 불러옵니다. 그러나 원하는 경우 기본 모델 configuration 속성의 일부 또는 전부를 사용자 지정으로 바꿀 수 있습니다: + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + + +### 모델 헤드[[model-heads]] + +이 시점에서 *은닉 상태(hidden state)*를 출력하는 기본 DistilBERT 모델을 갖게 됩니다. 은닉 상태는 최종 출력을 생성하기 위해 모델 헤드에 입력으로 전달됩니다. 🤗 Transformers는 모델이 해당 작업을 지원하는 한 각 작업마다 다른 모델 헤드를 제공합니다(즉, 번역과 같은 시퀀스 간 작업에는 DistilBERT를 사용할 수 없음). + + + +예를 들어, [`DistilBertForSequenceClassification`]은 시퀀스 분류 헤드가 있는 기본 DistilBERT 모델입니다. 시퀀스 분류 헤드는 풀링된 출력 위에 있는 선형 레이어입니다. + +```py +>>> from transformers import DistilBertForSequenceClassification + +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +다른 모델 헤드로 전환하여 이 체크포인트를 다른 작업에 쉽게 재사용할 수 있습니다. 질의응답 작업의 경우, [`DistilBertForQuestionAnswering`] 모델 헤드를 사용할 수 있습니다. 질의응답 헤드는 숨겨진 상태 출력 위에 선형 레이어가 있다는 점을 제외하면 시퀀스 분류 헤드와 유사합니다. + +```py +>>> from transformers import DistilBertForQuestionAnswering + +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + +예를 들어, [`TFDistilBertForSequenceClassification`]은 시퀀스 분류 헤드가 있는 기본 DistilBERT 모델입니다. 시퀀스 분류 헤드는 풀링된 출력 위에 있는 선형 레이어입니다. + +```py +>>> from transformers import TFDistilBertForSequenceClassification + +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +다른 모델 헤드로 전환하여 이 체크포인트를 다른 작업에 쉽게 재사용할 수 있습니다. 질의응답 작업의 경우, [`TFDistilBertForQuestionAnswering`] 모델 헤드를 사용할 수 있습니다. 질의응답 헤드는 숨겨진 상태 출력 위에 선형 레이어가 있다는 점을 제외하면 시퀀스 분류 헤드와 유사합니다. + +```py +>>> from transformers import TFDistilBertForQuestionAnswering + +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +## 토크나이저[[tokenizer]] + +텍스트 데이터에 모델을 사용하기 전에 마지막으로 필요한 기본 클래스는 원시 텍스트를 텐서로 변환하는 [토크나이저](main_classes/tokenizer)입니다. 🤗 Transformers에 사용할 수 있는 토크나이저는 두 가지 유형이 있습니다: + +- [`PreTrainedTokenizer`]: 파이썬으로 구현된 토크나이저입니다. +- [`PreTrainedTokenizerFast`]: Rust 기반 [🤗 Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/) 라이브러리로 만들어진 토크나이저입니다. 이 토크나이저는 Rust로 구현되어 배치 토큰화에서 특히 빠릅니다. 빠른 토크나이저는 토큰을 원래 단어나 문자에 매핑하는 *오프셋 매핑*과 같은 추가 메소드도 제공합니다. +두 토크나이저 모두 인코딩 및 디코딩, 새 토큰 추가, 특수 토큰 관리와 같은 일반적인 방법을 지원합니다. + + + +모든 모델이 빠른 토크나이저를 지원하는 것은 아닙니다. 이 [표](index#supported-frameworks)에서 모델의 빠른 토크나이저 지원 여부를 확인하세요. + + + +토크나이저를 직접 학습한 경우, *어휘(vocabulary)* 파일에서 토크나이저를 만들 수 있습니다: + +```py +>>> from transformers import DistilBertTokenizer + +>>> my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left") +``` + +사용자 지정 토크나이저의 어휘는 사전 학습된 모델의 토크나이저에서 생성된 어휘와 다를 수 있다는 점을 기억하는 것이 중요합니다. 사전 학습된 모델을 사용하는 경우 사전 학습된 모델의 어휘를 사용해야 하며, 그렇지 않으면 입력이 의미를 갖지 못합니다. [`DistilBertTokenizer`] 클래스를 사용하여 사전 학습된 모델의 어휘로 토크나이저를 생성합니다: + +```py +>>> from transformers import DistilBertTokenizer + +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +[`DistilBertTokenizerFast`] 클래스로 빠른 토크나이저를 생성합니다: + +```py +>>> from transformers import DistilBertTokenizerFast + +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +[`AutoTokenizer`]는 기본적으로 빠른 토크나이저를 가져오려고 합니다. 이 동작을 비활성화하려면 `from_pretrained`에서 `use_fast=False`를 설정하면 됩니다. + + + +## 이미지 프로세서[[image-processor]] + +이미지 프로세서(image processor)는 비전 입력을 처리합니다. 기본 [`~image_processing_utils.ImageProcessingMixin`] 클래스에서 상속합니다. + +사용하려면 사용 중인 모델과 연결된 이미지 프로세서를 생성합니다. 예를 들어, 이미지 분류에 [ViT](model_doc/vit)를 사용하는 경우 기본 [`ViTImageProcessor`]를 생성합니다: + +```py +>>> from transformers import ViTImageProcessor + +>>> vit_extractor = ViTImageProcessor() +>>> print(vit_extractor) +ViTImageProcessor { + "do_normalize": true, + "do_resize": true, + "feature_extractor_type": "ViTImageProcessor", + "image_mean": [ + 0.5, + 0.5, + 0.5 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": 2, + "size": 224 +} +``` + + + +사용자 지정을 원하지 않는 경우 `from_pretrained` 메소드를 사용하여 모델의 기본 이미지 프로세서 매개변수를 불러오면 됩니다. + + + +사용자 지정 이미지 프로세서를 생성하려면 [`ViTImageProcessor`] 파라미터를 수정합니다: + +```py +>>> from transformers import ViTImageProcessor + +>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3]) +>>> print(my_vit_extractor) +ViTImageProcessor { + "do_normalize": false, + "do_resize": true, + "feature_extractor_type": "ViTImageProcessor", + "image_mean": [ + 0.3, + 0.3, + 0.3 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": "PIL.Image.BOX", + "size": 224 +} +``` + +## 특성 추출기[[feature-extractor]] + +특성 추출기(feature extractor)는 오디오 입력을 처리합니다. 기본 [`~feature_extraction_utils.FeatureExtractionMixin`] 클래스에서 상속되며, 오디오 입력을 처리하기 위해 [`SequenceFeatureExtractor`] 클래스에서 상속할 수도 있습니다. + +사용하려면 사용 중인 모델과 연결된 특성 추출기를 생성합니다. 예를 들어, 오디오 분류에 [Wav2Vec2](model_doc/wav2vec2)를 사용하는 경우 기본 [`Wav2Vec2FeatureExtractor`]를 생성합니다: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor() +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": true, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 16000 +} +``` + + + +사용자 지정이 필요하지 않은 경우 `from_pretrained` 메소드를 사용하여 모델의 기본 특성 추출기 ㅁ개변수를 불러 오면 됩니다. + + + +사용자 지정 특성 추출기를 만들려면 [`Wav2Vec2FeatureExtractor`] 매개변수를 수정합니다: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False) +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": false, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 8000 +} +``` + + +## 프로세서[[processor]] + +멀티모달 작업을 지원하는 모델의 경우, 🤗 Transformers는 특성 추출기 및 토크나이저와 같은 처리 클래스를 단일 객체로 편리하게 래핑하는 프로세서 클래스를 제공합니다. 예를 들어, 자동 음성 인식 작업(Automatic Speech Recognition task (ASR))에 [`Wav2Vec2Processor`]를 사용한다고 가정해 보겠습니다. 자동 음성 인식 작업은 오디오를 텍스트로 변환하므로 특성 추출기와 토크나이저가 필요합니다. + +오디오 입력을 처리할 특성 추출기를 만듭니다: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True) +``` + +텍스트 입력을 처리할 토크나이저를 만듭니다: + +```py +>>> from transformers import Wav2Vec2CTCTokenizer + +>>> tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt") +``` + +[`Wav2Vec2Processor`]에서 특성 추출기와 토크나이저를 결합합니다: + +```py +>>> from transformers import Wav2Vec2Processor + +>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer) +``` + +configuration과 모델이라는 두 가지 기본 클래스와 추가 전처리 클래스(토크나이저, 이미지 프로세서, 특성 추출기 또는 프로세서)를 사용하면 🤗 Transformers에서 지원하는 모든 모델을 만들 수 있습니다. 이러한 각 기본 클래스는 구성이 가능하므로 원하는 특정 속성을 사용할 수 있습니다. 학습을 위해 모델을 쉽게 설정하거나 기존의 사전 학습된 모델을 수정하여 미세 조정할 수 있습니다. diff --git a/docs/source/ko/custom_models.md b/docs/source/ko/custom_models.md new file mode 100644 index 00000000000000..72dad7caaff203 --- /dev/null +++ b/docs/source/ko/custom_models.md @@ -0,0 +1,346 @@ + + +# 사용자 정의 모델 공유하기[[sharing-custom-models]] + +🤗 Transformers 라이브러리는 쉽게 확장할 수 있도록 설계되었습니다. +모든 모델은 추상화 없이 저장소의 지정된 하위 폴더에 완전히 코딩되어 있으므로, 손쉽게 모델링 파일을 복사하고 필요에 따라 조정할 수 있습니다. + +완전히 새로운 모델을 만드는 경우에는 처음부터 시작하는 것이 더 쉬울 수 있습니다. +이 튜토리얼에서는 Transformers 내에서 사용할 수 있도록 사용자 정의 모델과 구성을 작성하는 방법과 +🤗 Transformers 라이브러리에 없는 경우에도 누구나 사용할 수 있도록 (의존성과 함께) 커뮤니티에 공유하는 방법을 배울 수 있습니다. + +[timm 라이브러리](https://github.com/rwightman/pytorch-image-models)의 ResNet 클래스를 [`PreTrainedModel`]로 래핑한 ResNet 모델을 예로 모든 것을 설명합니다. + +## 사용자 정의 구성 작성하기[[writing-a-custom-configuration]] + +모델에 들어가기 전에 먼저 구성을 작성해보도록 하겠습니다. +모델의 `configuration`은 모델을 만들기 위해 필요한 모든 중요한 것들을 포함하고 있는 객체입니다. +다음 섹션에서 볼 수 있듯이, 모델은 `config`를 사용해서만 초기화할 수 있기 때문에 완벽한 구성이 필요합니다. + +아래 예시에서는 ResNet 클래스의 인수(argument)를 조정해보겠습니다. +다른 구성은 가능한 ResNet 중 다른 유형을 제공합니다. +그런 다음 몇 가지 유효성을 확인한 후 해당 인수를 저장합니다. + +```python +from transformers import PretrainedConfig +from typing import List + + +class ResnetConfig(PretrainedConfig): + model_type = "resnet" + + def __init__( + self, + block_type="bottleneck", + layers: List[int] = [3, 4, 6, 3], + num_classes: int = 1000, + input_channels: int = 3, + cardinality: int = 1, + base_width: int = 64, + stem_width: int = 64, + stem_type: str = "", + avg_down: bool = False, + **kwargs, + ): + if block_type not in ["basic", "bottleneck"]: + raise ValueError(f"`block_type` must be 'basic' or bottleneck', got {block_type}.") + if stem_type not in ["", "deep", "deep-tiered"]: + raise ValueError(f"`stem_type` must be '', 'deep' or 'deep-tiered', got {stem_type}.") + + self.block_type = block_type + self.layers = layers + self.num_classes = num_classes + self.input_channels = input_channels + self.cardinality = cardinality + self.base_width = base_width + self.stem_width = stem_width + self.stem_type = stem_type + self.avg_down = avg_down + super().__init__(**kwargs) +``` + +사용자 정의 `configuration`을 작성할 때 기억해야 할 세 가지 중요한 사항은 다음과 같습니다: +- `PretrainedConfig`을 상속해야 합니다. +- `PretrainedConfig`의 `__init__`은 모든 kwargs를 허용해야 하고, +- 이러한 `kwargs`는 상위 클래스 `__init__`에 전달되어야 합니다. + +상속은 🤗 Transformers 라이브러리에서 모든 기능을 가져오는 것입니다. +이러한 점으로부터 비롯되는 두 가지 제약 조건은 `PretrainedConfig`에 설정하는 것보다 더 많은 필드가 있습니다. +`from_pretrained` 메서드로 구성을 다시 로드할 때 해당 필드는 구성에서 수락한 후 상위 클래스로 보내야 합니다. + +모델을 auto 클래스에 등록하지 않는 한, `configuration`에서 `model_type`을 정의(여기서 `model_type="resnet"`)하는 것은 필수 사항이 아닙니다 (마지막 섹션 참조). + +이렇게 하면 라이브러리의 다른 모델 구성과 마찬가지로 구성을 쉽게 만들고 저장할 수 있습니다. +다음은 resnet50d 구성을 생성하고 저장하는 방법입니다: + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d_config.save_pretrained("custom-resnet") +``` + +이렇게 하면 `custom-resnet` 폴더 안에 `config.json`이라는 파일이 저장됩니다. +그런 다음 `from_pretrained` 메서드를 사용하여 구성을 다시 로드할 수 있습니다. + +```py +resnet50d_config = ResnetConfig.from_pretrained("custom-resnet") +``` + +구성을 Hub에 직접 업로드하기 위해 [`PretrainedConfig`] 클래스의 [`~PretrainedConfig.push_to_hub`]와 같은 다른 메서드를 사용할 수 있습니다. + + +## 사용자 정의 모델 작성하기[[writing-a-custom-model]] + +이제 ResNet 구성이 있으므로 모델을 작성할 수 있습니다. +실제로는 두 개를 작성할 것입니다. 하나는 이미지 배치에서 hidden features를 추출하는 것([`BertModel`]과 같이), 다른 하나는 이미지 분류에 적합한 것입니다([`BertForSequenceClassification`]과 같이). + +이전에 언급했듯이 이 예제에서는 단순하게 하기 위해 모델의 느슨한 래퍼(loose wrapper)만 작성할 것입니다. +이 클래스를 작성하기 전에 블록 유형과 실제 블록 클래스 간의 매핑 작업만 하면 됩니다. +그런 다음 `ResNet` 클래스로 전달되어 `configuration`을 통해 모델이 선언됩니다: + +```py +from transformers import PreTrainedModel +from timm.models.resnet import BasicBlock, Bottleneck, ResNet +from .configuration_resnet import ResnetConfig + + +BLOCK_MAPPING = {"basic": BasicBlock, "bottleneck": Bottleneck} + + +class ResnetModel(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor): + return self.model.forward_features(tensor) +``` + +이미지 분류 모델을 만들기 위해서는 forward 메소드만 변경하면 됩니다: + +```py +import torch + + +class ResnetModelForImageClassification(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor, labels=None): + logits = self.model(tensor) + if labels is not None: + loss = torch.nn.cross_entropy(logits, labels) + return {"loss": loss, "logits": logits} + return {"logits": logits} +``` + +두 경우 모두 `PreTrainedModel`를 상속받고, `config`를 통해 상위 클래스 초기화를 호출하다는 점을 기억하세요 (일반적인 `torch.nn.Module`을 작성할 때와 비슷함). +모델을 auto 클래스에 등록하고 싶은 경우에는 `config_class`를 설정하는 부분이 필수입니다 (마지막 섹션 참조). + + + +라이브러리에 존재하는 모델과 굉장히 유사하다면, 모델을 생성할 때 구성을 참조해 재사용할 수 있습니다. + + + +원하는 것을 모델이 반환하도록 할 수 있지만, `ResnetModelForImageClassification`에서 했던 것 처럼 +레이블을 통과시켰을 때 손실과 함께 사전 형태로 반환하는 것이 [`Trainer`] 클래스 내에서 직접 모델을 사용하기에 유용합니다. +자신만의 학습 루프 또는 다른 학습 라이브러리를 사용할 계획이라면 다른 출력 형식을 사용해도 좋습니다. + +이제 모델 클래스가 있으므로 하나 생성해 보겠습니다: + +```py +resnet50d = ResnetModelForImageClassification(resnet50d_config) +``` + +다시 말하지만, [`~PreTrainedModel.save_pretrained`]또는 [`~PreTrainedModel.push_to_hub`]처럼 [`PreTrainedModel`]에 속하는 모든 메소드를 사용할 수 있습니다. +다음 섹션에서 두 번째 메소드를 사용해 모델 코드와 모델 가중치를 업로드하는 방법을 살펴보겠습니다. +먼저, 모델 내부에 사전 훈련된 가중치를 로드해 보겠습니다. + +이 예제를 활용할 때는, 사용자 정의 모델을 자신만의 데이터로 학습시킬 것입니다. +이 튜토리얼에서는 빠르게 진행하기 위해 사전 훈련된 resnet50d를 사용하겠습니다. +아래 모델은 resnet50d의 래퍼이기 때문에, 가중치를 쉽게 로드할 수 있습니다. + + +```py +import timm + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +이제 [`~PreTrainedModel.save_pretrained`] 또는 [`~PreTrainedModel.push_to_hub`]를 사용할 때 모델 코드가 저장되는지 확인해봅시다. + +## Hub로 코드 업로드하기[[sending-the-code-to-the-hub]] + + + +이 API는 실험적이며 다음 릴리스에서 약간의 변경 사항이 있을 수 있습니다. + + + +먼저 모델이 `.py` 파일에 완전히 정의되어 있는지 확인하세요. +모든 파일이 동일한 작업 경로에 있기 때문에 상대경로 임포트(relative import)에 의존할 수 있습니다 (transformers에서는 이 기능에 대한 하위 모듈을 지원하지 않습니다). +이 예시에서는 작업 경로 안의 `resnet_model`에서 `modeling_resnet.py` 파일과 `configuration_resnet.py` 파일을 정의합니다. +구성 파일에는 `ResnetConfig`에 대한 코드가 있고 모델링 파일에는 `ResnetModel` 및 `ResnetModelForImageClassification`에 대한 코드가 있습니다. + +``` +. +└── resnet_model + ├── __init__.py + ├── configuration_resnet.py + └── modeling_resnet.py +``` + +Python이 `resnet_model`을 모듈로 사용할 수 있도록 감지하는 목적이기 때문에 `__init__.py`는 비어 있을 수 있습니다. + + + +라이브러리에서 모델링 파일을 복사하는 경우, +모든 파일 상단에 있는 상대 경로 임포트(relative import) 부분을 `transformers` 패키지에서 임포트 하도록 변경해야 합니다. + + + +기존 구성이나 모델을 재사용(또는 서브 클래스화)할 수 있습니다. + +커뮤니티에 모델을 공유하기 위해서는 다음 단계를 따라야 합니다: +먼저, 새로 만든 파일에 ResNet 모델과 구성을 임포트합니다: + +```py +from resnet_model.configuration_resnet import ResnetConfig +from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification +``` + +다음으로 `save_pretrained` 메소드를 사용해 해당 객체의 코드 파일을 복사하고, +복사한 파일을 Auto 클래스로 등록하고(모델인 경우) 실행합니다: + +```py +ResnetConfig.register_for_auto_class() +ResnetModel.register_for_auto_class("AutoModel") +ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClassification") +``` + +`configuration`에 대한 auto 클래스를 지정할 필요는 없지만(`configuration` 관련 auto 클래스는 AutoConfig 클래스 하나만 있음), 모델의 경우에는 지정해야 합니다. +사용자 지정 모델은 다양한 작업에 적합할 수 있으므로, 모델에 맞는 auto 클래스를 지정해야 합니다. + +다음으로, 이전에 작업했던 것과 마찬가지로 구성과 모델을 작성합니다: + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d = ResnetModelForImageClassification(resnet50d_config) + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +이제 모델을 Hub로 업로드하기 위해 로그인 상태인지 확인하세요. +터미널에서 다음 코드를 실행해 확인할 수 있습니다: + +```bash +huggingface-cli login +``` + +주피터 노트북의 경우에는 다음과 같습니다: + +```py +from huggingface_hub import notebook_login + +notebook_login() +``` + +그런 다음 이렇게 자신의 네임스페이스(또는 자신이 속한 조직)에 업로드할 수 있습니다: +```py +resnet50d.push_to_hub("custom-resnet50d") +``` + +On top of the modeling weights and the configuration in json format, this also copied the modeling and +configuration `.py` files in the folder `custom-resnet50d` and uploaded the result to the Hub. You can check the result +in this [model repo](https://huggingface.co/sgugger/custom-resnet50d). +json 형식의 모델링 가중치와 구성 외에도 `custom-resnet50d` 폴더 안의 모델링과 구성 `.py` 파일을 복사하해 Hub에 업로드합니다. +[모델 저장소](https://huggingface.co/sgugger/custom-resnet50d)에서 결과를 확인할 수 있습니다. + +[sharing tutorial](model_sharing) 문서의 `push_to_hub` 메소드에서 자세한 내용을 확인할 수 있습니다. + + +## 사용자 정의 코드로 모델 사용하기[[using-a-model-with-custom-code]] + +auto 클래스와 `from_pretrained` 메소드를 사용하여 사용자 지정 코드 파일과 함께 모든 구성, 모델, 토크나이저를 사용할 수 있습니다. +Hub에 업로드된 모든 파일 및 코드는 멜웨어가 있는지 검사되지만 (자세한 내용은 [Hub 보안](https://huggingface.co/docs/hub/security#malware-scanning) 설명 참조), +자신의 컴퓨터에서 모델 코드와 작성자가 악성 코드를 실행하지 않는지 확인해야 합니다. +사용자 정의 코드로 모델을 사용하려면 `trust_remote_code=True`로 설정하세요: + +```py +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True) +``` + +모델 작성자가 악의적으로 코드를 업데이트하지 않았다는 점을 확인하기 위해, 커밋 해시(commit hash)를 `revision`으로 전달하는 것도 강력히 권장됩니다 (모델 작성자를 완전히 신뢰하지 않는 경우). + +```py +commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292" +model = AutoModelForImageClassification.from_pretrained( + "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash +) +``` + +Hub에서 모델 저장소의 커밋 기록을 찾아볼 때, 모든 커밋의 커밋 해시를 쉽게 복사할 수 있는 버튼이 있습니다. + +## 사용자 정의 코드로 만든 모델을 auto 클래스로 등록하기[[registering-a-model-with-custom-code-to-the-auto-classes]] + +🤗 Transformers를 상속하는 라이브러리를 작성하는 경우 사용자 정의 모델을 auto 클래스에 추가할 수 있습니다. +사용자 정의 모델을 사용하기 위해 해당 라이브러리를 임포트해야 하기 때문에, 이는 Hub로 코드를 업로드하는 것과 다릅니다 (Hub에서 자동적으로 모델 코드를 다운로드 하는 것과 반대). + +구성에 기존 모델 유형과 다른 `model_type` 속성이 있고 모델 클래스에 올바른 `config_class` 속성이 있는 한, +다음과 같이 auto 클래스에 추가할 수 있습니다: + +```py +from transformers import AutoConfig, AutoModel, AutoModelForImageClassification + +AutoConfig.register("resnet", ResnetConfig) +AutoModel.register(ResnetConfig, ResnetModel) +AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) +``` + +사용자 정의 구성을 [`AutoConfig`]에 등록할 때 사용되는 첫 번째 인수는 사용자 정의 구성의 `model_type`과 일치해야 합니다. +또한, 사용자 정의 모델을 auto 클래스에 등록할 때 사용되는 첫 번째 인수는 해당 모델의 `config_class`와 일치해야 합니다. \ No newline at end of file diff --git a/docs/source/ko/custom_tools.md b/docs/source/ko/custom_tools.md new file mode 100644 index 00000000000000..853d69187f6aaa --- /dev/null +++ b/docs/source/ko/custom_tools.md @@ -0,0 +1,748 @@ + + +# 사용자 정의 도구와 프롬프트[[custom-tools-and-prompts]] + + + +Transformers와 관련하여 어떤 도구와 에이전트가 있는지 잘 모르신다면 [Transformers Agents](transformers_agents) 페이지를 먼저 읽어보시기 바랍니다. + + + + + +Transformers Agents는 실험 중인 API로 언제든지 변경될 수 있습니다. +API 또는 기반 모델이 변경되기 쉽기 때문에 에이전트가 반환하는 결과도 달라질 수 있습니다. + + + +에이전트에게 권한을 부여하고 새로운 작업을 수행하게 하려면 사용자 정의 도구와 프롬프트를 만들고 사용하는 것이 무엇보다 중요합니다. +이 가이드에서는 다음과 같은 내용을 살펴보겠습니다: + +- 프롬프트를 사용자 정의하는 방법 +- 사용자 정의 도구를 사용하는 방법 +- 사용자 정의 도구를 만드는 방법 + +## 프롬프트를 사용자 정의하기[[customizing-the-prompt]] + +[Transformers Agents](transformers_agents)에서 설명한 것처럼 에이전트는 [`~Agent.run`] 및 [`~Agent.chat`] 모드에서 실행할 수 있습니다. +`run`(실행) 모드와 `chat`(채팅) 모드 모두 동일한 로직을 기반으로 합니다. +에이전트를 구동하는 언어 모델은 긴 프롬프트에 따라 조건이 지정되고, 중지 토큰에 도달할 때까지 다음 토큰을 생성하여 프롬프트를 완수합니다. +`chat` 모드에서는 프롬프트가 이전 사용자 입력 및 모델 생성으로 연장된다는 점이 두 모드의 유일한 차이점입니다. +이를 통해 에이전트가 과거 상호작용에 접근할 수 있게 되므로 에이전트에게 일종의 메모리를 제공하는 셈입니다. + +### 프롬프트의 구조[[structure-of-the-prompt]] + +어떻게 프롬프트 사용자 정의를 잘 할 수 있는지 이해하기 위해 프롬프트의 구조를 자세히 살펴봅시다. +프롬프트는 크게 네 부분으로 구성되어 있습니다. + +- 1. 도입: 에이전트가 어떻게 행동해야 하는지, 도구의 개념에 대한 설명. +- 2. 모든 도구에 대한 설명. 이는 런타임에 사용자가 정의/선택한 도구로 동적으로 대체되는 `<>` 토큰으로 정의됩니다. +- 3. 작업 예제 및 해당 솔루션 세트. +- 4. 현재 예제 및 해결 요청. + +각 부분을 더 잘 이해할 수 있도록 짧은 버전을 통해 `run` 프롬프트가 어떻게 보이는지 살펴보겠습니다: + +````text +I will ask you to perform a task, your job is to come up with a series of simple commands in Python that will perform the task. +[...] +You can print intermediate results if it makes sense to do so. + +Tools: +- document_qa: This is a tool that answers a question about a document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +- image_captioner: This is a tool that generates a description of an image. It takes an input named `image` which should be the image to the caption and returns a text that contains the description in English. +[...] + +Task: "Answer the question in the variable `question` about the image stored in the variable `image`. The question is in French." + +I will use the following tools: `translator` to translate the question into English and then `image_qa` to answer the question on the input image. + +Answer: +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(image=image, question=translated_question) +print(f"The answer is {answer}") +``` + +Task: "Identify the oldest person in the `document` and create an image showcasing the result as a banner." + +I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + +Answer: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +[...] + +Task: "Draw me a picture of rivers and lakes" + +I will use the following +```` + +도입(*"도구:"* 앞의 텍스트)에서는 모델이 어떻게 작동하고 무엇을 해야 하는지 정확하게 설명합니다. +에이전트는 항상 같은 방식으로 작동해야 하므로 이 부분은 사용자 정의할 필요가 없을 가능성이 높습니다. + +두 번째 부분(*"도구"* 아래의 글머리 기호)은 `run` 또는 `chat`을 호출할 때 동적으로 추가됩니다. +정확히 `agent.toolbox`에 있는 도구 수만큼 글머리 기호가 있고, 각 글머리 기호는 도구의 이름과 설명으로 구성됩니다: + +```text +- : +``` + +문서 질의응답 도구를 가져오고 이름과 설명을 출력해서 빠르게 확인해 보겠습니다. + +```py +from transformers import load_tool + +document_qa = load_tool("document-question-answering") +print(f"- {document_qa.name}: {document_qa.description}") +``` + +그러면 다음 결과가 출력됩니다: +```text +- document_qa: This is a tool that answers a question about a document (pdf). It takes an input named `document` which should be the document containing the information, as well as a `question` that is the question about the document. It returns a text that contains the answer to the question. +``` + +여기서 도구 이름이 짧고 정확하다는 것을 알 수 있습니다. +설명은 두 부분으로 구성되어 있는데, 첫 번째 부분에서는 도구의 기능을 설명하고 두 번째 부분에서는 예상되는 입력 인수와 반환 값을 명시합니다. + +에이전트가 도구를 올바르게 사용하려면 좋은 도구 이름과 도구 설명이 매우 중요합니다. +에이전트가 도구에 대해 알 수 있는 유일한 정보는 이름과 설명뿐이므로, 이 두 가지를 정확하게 작성하고 도구 상자에 있는 기존 도구의 스타일과 일치하는지 확인해야 합니다. +특히 이름에 따라 예상되는 모든 인수가 설명에 코드 스타일로 언급되어 있는지, 예상되는 유형과 그 유형이 무엇인지에 대한 설명이 포함되어 있는지 확인하세요. + + + +도구에 어떤 이름과 설명이 있어야 하는지 이해하려면 엄선된 Transformers 도구의 이름과 설명을 확인하세요. +[`Agent.toolbox`] 속성을 가진 모든 도구를 볼 수 있습니다. + + + +세 번째 부분에는 에이전트가 어떤 종류의 사용자 요청에 대해 어떤 코드를 생성해야 하는지 정확하게 보여주는 엄선된 예제 세트가 포함되어 있습니다. +에이전트를 지원하는 대규모 언어 모델은 프롬프트에서 패턴을 인식하고 새로운 데이터로 패턴을 반복하는 데 매우 능숙합니다. +따라서 에이전트가 실제로 올바른 실행 가능한 코드를 생성할 가능성을 극대화하는 방식으로 예제를 작성하는 것이 매우 중요합니다. + +한 가지 예를 살펴보겠습니다: + +````text +Task: "Identify the oldest person in the `document` and create an image showcasing the result as a banner." + +I will use the following tools: `document_qa` to find the oldest person in the document, then `image_generator` to generate an image according to the answer. + +Answer: +```py +answer = document_qa(document, question="What is the oldest person?") +print(f"The answer is {answer}.") +image = image_generator("A banner showing " + answer) +``` + +```` +작업 설명, 에이전트가 수행하려는 작업에 대한 설명, 마지막으로 생성된 코드, 이 세 부분으로 구성된 프롬프트는 모델에 반복하여 제공됩니다. +프롬프트의 일부인 모든 예제는 이러한 정확한 패턴으로 되어 있으므로, 에이전트가 새 토큰을 생성할 때 정확히 동일한 패턴을 재현할 수 있습니다. + +프롬프트 예제는 Transformers 팀이 선별하고 일련의 [problem statements](https://github.com/huggingface/transformers/blob/main/src/transformers/tools/evaluate_agent.py)에 따라 엄격하게 평가하여 +에이전트의 프롬프트가 에이전트의 실제 사용 사례를 최대한 잘 해결할 수 있도록 보장합니다. + +프롬프트의 마지막 부분은 다음에 해당합니다: +```text +Task: "Draw me a picture of rivers and lakes" + +I will use the following +``` + +이는 에이전트가 완료해야 할 최종적인 미완성 예제입니다. 미완성 예제는 실제 사용자 입력에 따라 동적으로 만들어집니다. +위 예시의 경우 사용자가 다음과 같이 실행했습니다: + +```py +agent.run("Draw me a picture of rivers and lakes") +``` + +사용자 입력 - *즉* Task: *"Draw me a picture of rivers and lakes"*가 프롬프트 템플릿에 맞춰 "Task: \n\n I will use the following"로 캐스팅됩니다. +이 문장은 에이전트에게 조건이 적용되는 프롬프트의 마지막 줄을 구성하므로 에이전트가 이전 예제에서 수행한 것과 정확히 동일한 방식으로 예제를 완료하도록 강력하게 영향을 미칩니다. + +너무 자세히 설명하지 않더라도 채팅 템플릿의 프롬프트 구조는 동일하지만 예제의 스타일이 약간 다릅니다. *예를 들면*: + +````text +[...] + +===== + +Human: Answer the question in the variable `question` about the image stored in the variable `image`. + +Assistant: I will use the tool `image_qa` to answer the question on the input image. + +```py +answer = image_qa(text=question, image=image) +print(f"The answer is {answer}") +``` + +Human: I tried this code, it worked but didn't give me a good result. The question is in French + +Assistant: In this case, the question needs to be translated first. I will use the tool `translator` to do this. + +```py +translated_question = translator(question=question, src_lang="French", tgt_lang="English") +print(f"The translated question is {translated_question}.") +answer = image_qa(text=translated_question, image=image) +print(f"The answer is {answer}") +``` + +===== + +[...] +```` + +`run` 프롬프트의 예와는 반대로, 각 `chat` 프롬프트의 예에는 *Human(사람)*과 *Assistant(어시스턴트)* 간에 하나 이상의 교환이 있습니다. 모든 교환은 `run` 프롬프트의 예와 유사한 구조로 되어 있습니다. +사용자의 입력이 *Human:* 뒤에 추가되며, 에이전트에게 코드를 생성하기 전에 수행해야 할 작업을 먼저 생성하라는 메시지가 표시됩니다. +교환은 이전 교환을 기반으로 할 수 있으므로 위와 같이 사용자가 "**이** 코드를 시도했습니다"라고 입력하면 이전에 생성된 에이전트의 코드를 참조하여 과거 교환을 참조할 수 있습니다. + +`.chat`을 실행하면 사용자의 입력 또는 *작업*이 미완성된 양식의 예시로 캐스팅됩니다: +```text +Human: \n\nAssistant: +``` +그러면 에이전트가 이를 완성합니다. `run` 명령과 달리 `chat` 명령은 완료된 예제를 프롬프트에 추가하여 에이전트에게 다음 `chat` 차례에 대한 더 많은 문맥을 제공합니다. + +이제 프롬프트가 어떻게 구성되어 있는지 알았으니 어떻게 사용자 정의할 수 있는지 살펴봅시다! + +### 좋은 사용자 입력 작성하기[[writing-good-user-inputs]] + +대규모 언어 모델이 사용자의 의도를 이해하는 능력이 점점 더 향상되고 있지만, 에이전트가 올바른 작업을 선택할 수 있도록 최대한 정확성을 유지하는 것은 큰 도움이 됩니다. +최대한 정확하다는 것은 무엇을 의미할까요? + +에이전트는 프롬프트에서 도구 이름 목록과 해당 설명을 볼 수 있습니다. +더 많은 도구가 추가될수록 에이전트가 올바른 도구를 선택하기가 더 어려워지고 실행할 도구의 올바른 순서를 선택하는 것은 더욱 어려워집니다. +일반적인 실패 사례를 살펴보겠습니다. 여기서는 분석할 코드만 반환하겠습니다. + +```py +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") + +agent.run("Show me a tree", return_code=True) +``` + +그러면 다음 결과가 출력됩니다: + +```text +==Explanation from the agent== +I will use the following tool: `image_segmenter` to create a segmentation mask for the image. + + +==Code generated by the agent== +mask = image_segmenter(image, prompt="tree") +``` + +우리가 원했던 결과가 아닐 수도 있습니다. 대신 나무 이미지가 생성되기를 원할 가능성이 더 높습니다. +따라서 에이전트가 특정 도구를 사용하도록 유도하려면 도구의 이름과 설명에 있는 중요한 키워드를 사용하는 것이 매우 유용할 수 있습니다. 한번 살펴보겠습니다. +```py +agent.toolbox["image_generator"].description +``` + +```text +'This is a tool that creates an image according to a prompt, which is a text description. It takes an input named `prompt` which contains the image description and outputs an image. +``` + +이름과 설명은 "image", "prompt", "create" 및 "generate" 키워드를 사용합니다. 이 단어들을 사용하면 더 잘 작동할 가능성이 높습니다. 프롬프트를 조금 더 구체화해 보겠습니다. + +```py +agent.run("Create an image of a tree", return_code=True) +``` + +이 코드는 다음 프롬프트를 만들어냅니다: +```text +==Explanation from the agent== +I will use the following tool `image_generator` to generate an image of a tree. + + +==Code generated by the agent== +image = image_generator(prompt="tree") +``` + +훨씬 낫네요! 저희가 원했던 것과 비슷해 보입니다. +즉, 에이전트가 작업을 올바른 도구에 올바르게 매핑하는 데 어려움을 겪고 있다면 도구 이름과 설명에서 가장 관련성이 높은 키워드를 찾아보고 이를 통해 작업 요청을 구체화해 보세요. + +### 도구 설명 사용자 정의하기[[customizing-the-tool-descriptions]] + +앞서 살펴본 것처럼 에이전트는 각 도구의 이름과 설명에 액세스할 수 있습니다. +기본 도구에는 매우 정확한 이름과 설명이 있어야 하지만 특정 사용 사례에 맞게 도구의 설명이나 이름을 변경하는 것이 도움이 될 수도 있습니다. +이는 매우 유사한 여러 도구를 추가했거나 특정 도메인(*예*: 이미지 생성 및 변환)에만 에이전트를 사용하려는 경우에 특히 중요해질 수 있습니다. + +일반적인 문제는 이미지 생성 작업에 많이 사용되는 경우 에이전트가 이미지 생성과 이미지 변환/수정을 혼동하는 것입니다. *예를 들어,* +```py +agent.run("Make an image of a house and a car", return_code=True) +``` +그러면 다음 결과가 출력됩니다: +```text +==Explanation from the agent== +I will use the following tools `image_generator` to generate an image of a house and `image_transformer` to transform the image of a car into the image of a house. + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +house_car_image = image_transformer(image=car_image, prompt="A house") +``` + +결과물이 우리가 여기서 원하는 것과 정확히 일치하지 않을 수 있습니다. 에이전트가 `image_generator`와 `image_transformer`의 차이점을 이해하기 어려워서 두 가지를 함께 사용하는 경우가 많은 것 같습니다. + +여기서 `image_transformer`의 도구 이름과 설명을 변경하여 에이전트가 도울 수 있습니다. +"image" 및 "prompt"와 약간 분리하기 위해 `modifier`라고 대신 부르겠습니다: +```py +agent.toolbox["modifier"] = agent.toolbox.pop("image_transformer") +agent.toolbox["modifier"].description = agent.toolbox["modifier"].description.replace( + "transforms an image according to a prompt", "modifies an image" +) +``` + +이제 "modify"은 새 이미지 프로세서를 사용하라는 강력한 신호이므로 위의 프롬프트에 도움이 될 것입니다. 다시 실행해 봅시다. + +```py +agent.run("Make an image of a house and a car", return_code=True) +``` + +여기서 다음과 같은 결과를 얻게 됩니다: +```text +==Explanation from the agent== +I will use the following tools: `image_generator` to generate an image of a house, then `image_generator` to generate an image of a car. + + +==Code generated by the agent== +house_image = image_generator(prompt="A house") +car_image = image_generator(prompt="A car") +``` + +우리가 염두에 두었던 것과 확실히 더 가까워졌습니다! 하지만 집과 자동차가 모두 같은 이미지에 포함되면 좋겠습니다. 작업을 단일 이미지 생성에 더 집중하면 도움이 될 것입니다: + +```py +agent.run("Create image: 'A house and car'", return_code=True) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_generator` to generate an image. + + +==Code generated by the agent== +image = image_generator(prompt="A house and car") +``` + + + +에이전트는 여전히 특히 여러 개체의 이미지를 생성하는 것과 같이 약간 더 복잡한 사용 사례에서 취약한 경우가 많습니다. +앞으로 몇 달 안에 에이전트 자체와 기본 프롬프트가 더욱 개선되어 에이전트가 다양한 사용자 입력에 더욱 강력하게 대응할 수 있도록 할 예정입니다. + + + +### 전체 프롬프트 사용자 정의하기[[customizing-the-whole-prompt]] + +사용자에게 최대한의 유연성을 제공하기 위해 [위](#structure-of-the-prompt)에 설명된 전체 프롬프트 템플릿을 사용자가 덮어쓸 수 있습니다. +이 경우 사용자 정의 프롬프트에 소개 섹션, 도구 섹션, 예제 섹션 및 미완성 예제 섹션이 포함되어 있는지 확인하세요. +`run` 프롬프트 템플릿을 덮어쓰려면 다음과 같이 하면 됩니다: + +```py +template = """ [...] """ + +agent = HfAgent(your_endpoint, run_prompt_template=template) +``` + + + +에이전트가 사용 가능한 도구를 인식하고 사용자의 프롬프트를 올바르게 삽입할 수 있도록 `<>` 문자열과 `<>`를 `template` 어딘가에 정의해야 합니다. + + + +마찬가지로 `chat` 프롬프트 템플릿을 덮어쓸 수 있습니다. `chat` 모드에서는 항상 다음과 같은 교환 형식을 사용한다는 점에 유의하세요: + +```text +Human: <> + +Assistant: +``` + +따라서 사용자 정의 `chat` 프롬프트 템플릿의 예제에서도 이 형식을 사용하는 것이 중요합니다. +다음과 같이 인스턴스화 할 때 `chat` 템플릿을 덮어쓸 수 있습니다. + +```python +template = """ [...] """ + +agent = HfAgent(url_endpoint=your_endpoint, chat_prompt_template=template) +``` + + + +에이전트가 사용 가능한 도구를 인식할 수 있도록 `<>` 문자열을 `template` 어딘가에 정의해야 합니다. + + + +두 경우 모두 커뮤니티의 누군가가 호스팅하는 템플릿을 사용하려는 경우 프롬프트 템플릿 대신 저장소 ID를 전달할 수 있습니다. +기본 프롬프트는 [이 저장소](https://huggingface.co/datasets/huggingface-tools/default-prompts)를 예로 들 수 있습니다. + +Hub의 저장소에 사용자 정의 프롬프트를 업로드하여 커뮤니티와 공유하려면 다음을 확인하세요: +- 데이터 세트 저장소를 사용하세요. +- `run` 명령에 대한 프롬프트 템플릿을 `run_prompt_template.txt`라는 파일에 넣으세요. +- `chat` 명령에 대한 프롬프트 템플릿을 `chat_prompt_template.txt`라는 파일에 넣으세요. + +## 사용자 정의 도구 사용하기[[using-custom-tools]] + +이 섹션에서는 이미지 생성에 특화된 두 가지 기존 사용자 정의 도구를 활용하겠습니다: + +- 더 많은 이미지 수정을 허용하기 위해 [huggingface-tools/image-transformation](https://huggingface.co/spaces/huggingface-tools/image-transformation)을 + [diffusers/controlnet-canny-tool](https://huggingface.co/spaces/diffusers/controlnet-canny-tool)로 대체합니다. +- 기본 도구 상자에 이미지 업스케일링을 위한 새로운 도구가 추가되었습니다: + [diffusers/latent-upscaler-tool](https://huggingface.co/spaces/diffusers/latent-upscaler-tool)가 기존 이미지 변환 도구를 대체합니다. + +편리한 [`load_tool`] 함수를 사용하여 사용자 정의 도구를 가져오는 것으로 시작하겠습니다: + +```py +from transformers import load_tool + +controlnet_transformer = load_tool("diffusers/controlnet-canny-tool") +upscaler = load_tool("diffusers/latent-upscaler-tool") +``` + +에이전트에게 사용자 정의 도구를 추가하면 도구의 설명과 이름이 에이전트의 프롬프트에 자동으로 포함됩니다. +따라서 에이전트가 사용 방법을 이해할 수 있도록 사용자 정의 도구의 설명과 이름을 잘 작성해야 합니다. +`controlnet_transformer`의 설명과 이름을 살펴보겠습니다: + +```py +print(f"Description: '{controlnet_transformer.description}'") +print(f"Name: '{controlnet_transformer.name}'") +``` + +그러면 다음 결과가 출력됩니다: +```text +Description: 'This is a tool that transforms an image with ControlNet according to a prompt. +It takes two inputs: `image`, which should be the image to transform, and `prompt`, which should be the prompt to use to change it. It returns the modified image.' +Name: 'image_transformer' +``` + +이름과 설명이 정확하고 [큐레이팅 된 도구 세트(curated set of tools)](./transformers_agents#a-curated-set-of-tools)의 스타일에 맞습니다. +다음으로, `controlnet_transformer`와 `upscaler`로 에이전트를 인스턴스화해 봅시다: +```py +tools = [controlnet_transformer, upscaler] +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=tools) +``` + +이 명령을 실행하면 다음 정보가 표시됩니다: + +```text +image_transformer has been replaced by as provided in `additional_tools` +``` + +큐레이팅된 도구 세트에는 이미 'image_transformer' 도구가 있으며, 이 도구는 사용자 정의 도구로 대체됩니다. + + + +기존 도구와 똑같은 작업에 사용자 정의 도구를 사용하려는 경우 기존 도구를 덮어쓰는 것이 유용할 수 있습니다. +에이전트가 해당 작업에 능숙하기 때문입니다. +이 경우 사용자 정의 도구가 덮어쓴 도구와 정확히 동일한 API를 따라야 하며, 그렇지 않으면 해당 도구를 사용하는 모든 예제가 업데이트되도록 프롬프트 템플릿을 조정해야 한다는 점에 유의하세요. + + + +업스케일러 도구에 지정된 'image_upscaler'라는 이름 아직 기본 도구 상자에는 존재하지 않기 때문에, 도구 목록에 해당 이름이 간단히 추가되었습니다. +에이전트가 현재 사용할 수 있는 도구 상자는 언제든지 `agent.toolbox` 속성을 통해 확인할 수 있습니다: + +```py +print("\n".join([f"- {a}" for a in agent.toolbox.keys()])) +``` + +```text +- document_qa +- image_captioner +- image_qa +- image_segmenter +- transcriber +- summarizer +- text_classifier +- text_qa +- text_reader +- translator +- image_transformer +- text_downloader +- image_generator +- video_generator +- image_upscaler +``` + +에이전트의 도구 상자에 `image_upscaler`가 추가된 점을 주목하세요. + +이제 새로운 도구를 사용해봅시다! [Transformers Agents Quickstart](./transformers_agents#single-execution-run)에서 생성한 이미지를 다시 사용하겠습니다. + +```py +from diffusers.utils import load_image + +image = load_image( + "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rivers_and_lakes.png" +) +``` + + + +이미지를 아름다운 겨울 풍경으로 바꿔 봅시다: + +```py +image = agent.run("Transform the image: 'A frozen lake and snowy forest'", image=image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_transformer` to transform the image. + + +==Code generated by the agent== +image = image_transformer(image, prompt="A frozen lake and snowy forest") +``` + + + +새로운 이미지 처리 도구는 이미지를 매우 강력하게 수정할 수 있는 ControlNet을 기반으로 합니다. +기본적으로 이미지 처리 도구는 512x512 픽셀 크기의 이미지를 반환합니다. 이를 업스케일링할 수 있는지 살펴봅시다. + +```py +image = agent.run("Upscale the image", image) +``` + +```text +==Explanation from the agent== +I will use the following tool: `image_upscaler` to upscale the image. + + +==Code generated by the agent== +upscaled_image = image_upscaler(image) +``` + + + +에이전트는 업스케일러 도구의 설명과 이름만 보고 방금 추가한 업스케일러 도구에 "이미지 업스케일링"이라는 프롬프트를 자동으로 매핑하여 올바르게 실행했습니다. + +다음으로 새 사용자 정의 도구를 만드는 방법을 살펴보겠습니다. + +### 새 도구 추가하기[[adding-new-tools]] + +이 섹션에서는 에이전트에게 추가할 수 있는 새 도구를 만드는 방법을 보여 드립니다. + +#### 새 도구 만들기[[creating-a-new-tool]] + +먼저 도구를 만드는 것부터 시작하겠습니다. +특정 작업에 대해 가장 많은 다운로드를 받은 Hugging Face Hub의 모델을 가져오는, 그다지 유용하지는 않지만 재미있는 작업을 추가하겠습니다. + +다음 코드를 사용하면 됩니다: + +```python +from huggingface_hub import list_models + +task = "text-classification" + +model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) +print(model.id) +``` +`text-classification`(텍스트 분류) 작업의 경우 `'facebook/bart-large-mnli'`를 반환하고, `translation`(번역) 작업의 경우 `'google-t5/t5-base'`를 반환합니다. + +이를 에이전트가 활용할 수 있는 도구로 변환하려면 어떻게 해야 할까요? +모든 도구는 필요한 주요 속성을 보유하는 슈퍼클래스 `Tool`에 의존합니다. 이를 상속하는 클래스를 만들어 보겠습니다: + +```python +from transformers import Tool + + +class HFModelDownloadsTool(Tool): + pass +``` + +이 클래스에는 몇 가지 요구사항이 있습니다: +- 도구 자체의 이름에 해당하는 `name` 속성. 수행명이 있는 다른 도구와 호환되도록 `model_download_counter`로 이름을 지정하겠습니다. +- 에이전트의 프롬프트를 채우는 데 사용되는 속성 `description`. +- `inputs` 및 `outputs` 속성. 이를 정의하면 Python 인터프리터가 유형에 대한 정보에 입각한 선택을 하는 데 도움이 되며, + 도구를 허브에 푸시할 때 gradio 데모를 생성할 수 있습니다. + 두 속성 모두 값은 '텍스트', '이미지' 또는 '오디오'가 될 수 있는 예상 값의 리스트입니다. +- 추론 코드가 포함된 `__call__` 메소드. 이것이 우리가 위에서 다루었던 코드입니다! + +이제 클래스의 모습은 다음과 같습니다: + +```python +from transformers import Tool +from huggingface_hub import list_models + + +class HFModelDownloadsTool(Tool): + name = "model_download_counter" + description = ( + "This is a tool that returns the most downloaded model of a given task on the Hugging Face Hub. " + "It takes the name of the category (such as text-classification, depth-estimation, etc), and " + "returns the name of the checkpoint." + ) + + inputs = ["text"] + outputs = ["text"] + + def __call__(self, task: str): + model = next(iter(list_models(filter=task, sort="downloads", direction=-1))) + return model.id +``` + +이제 도구를 손쉽게 사용할 수 있게 되었습니다. +도구를 파일에 저장하고 메인 스크립트에서 가져옵니다. 이 파일의 이름을 `model_downloads.py`로 지정하면 결과적으로 가져오기 코드는 다음과 같습니다: + +```python +from model_downloads import HFModelDownloadsTool + +tool = HFModelDownloadsTool() +``` + +다른 사람들이 이 기능을 활용할 수 있도록 하고 초기화를 더 간단하게 하려면 네임스페이스 아래의 Hub로 푸시하는 것이 좋습니다. +그렇게 하려면 `tool` 변수에서 `push_to_hub`를 호출하면 됩니다: + +```python +tool.push_to_hub("hf-model-downloads") +``` + +이제 허브에 코드가 생겼습니다! 마지막 단계인 에이전트가 코드를 사용하도록 하는 단계를 살펴보겠습니다. + +#### 에이전트가 도구를 사용하게 하기[[Having-the-agent-use-the-tool]] + +이제 이런 식으로 허브에 존재하는 도구를 인스턴스화할 수 있습니다(도구의 사용자 이름은 변경하세요): +We now have our tool that lives on the Hub which can be instantiated as such (change the user name for your tool): + +```python +from transformers import load_tool + +tool = load_tool("lysandre/hf-model-downloads") +``` + +이 도구를 에이전트에서 사용하려면 에이전트 초기화 메소드의 `additional_tools` 매개변수에 전달하기만 하면 됩니다: + +```python +from transformers import HfAgent + +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder", additional_tools=[tool]) + +agent.run( + "Can you read out loud the name of the model that has the most downloads in the 'text-to-video' task on the Hugging Face Hub?" +) +``` +그러면 다음과 같은 결과가 출력됩니다: +```text +==Code generated by the agent== +model = model_download_counter(task="text-to-video") +print(f"The model with the most downloads is {model}.") +audio_model = text_reader(model) + + +==Result== +The model with the most downloads is damo-vilab/text-to-video-ms-1.7b. +``` + +and generates the following audio. + +| **Audio** | +|------------------------------------------------------------------------------------------------------------------------------------------------------| +|
\ No newline at end of file +
diff --git a/docs/source/ko/llm_tutorial.md b/docs/source/ko/llm_tutorial.md new file mode 100644 index 00000000000000..d5e0bd356edd2e --- /dev/null +++ b/docs/source/ko/llm_tutorial.md @@ -0,0 +1,222 @@ + + + +# 대규모 언어 모델로 생성하기 [[generation-with-llms]] + +[[open-in-colab]] + +LLM 또는 대규모 언어 모델은 텍스트 생성의 핵심 구성 요소입니다. 간단히 말하면, 주어진 입력 텍스트에 대한 다음 단어(정확하게는 토큰)를 예측하기 위해 훈련된 대규모 사전 훈련 변환기 모델로 구성됩니다. 토큰을 한 번에 하나씩 예측하기 때문에 새로운 문장을 생성하려면 모델을 호출하는 것 외에 더 복잡한 작업을 수행해야 합니다. 즉, 자기회귀 생성을 수행해야 합니다. + +자기회귀 생성은 몇 개의 초기 입력값을 제공한 후, 그 출력을 다시 모델에 입력으로 사용하여 반복적으로 호출하는 추론 과정입니다. 🤗 Transformers에서는 [`~generation.GenerationMixin.generate`] 메소드가 이 역할을 하며, 이는 생성 기능을 가진 모든 모델에서 사용 가능합니다. + +이 튜토리얼에서는 다음 내용을 다루게 됩니다: + +* LLM으로 텍스트 생성 +* 일반적으로 발생하는 문제 해결 +* LLM을 최대한 활용하기 위한 다음 단계 + +시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers bitsandbytes>=0.39.0 -q +``` + + +## 텍스트 생성 [[generate-text]] + +[인과적 언어 모델링(causal language modeling)](tasks/language_modeling)을 목적으로 학습된 언어 모델은 일련의 텍스트 토큰을 입력으로 사용하고, 그 결과로 다음 토큰이 나올 확률 분포를 제공합니다. + + +
+ +
"LLM의 전방 패스"
+
+ +LLM과 자기회귀 생성을 함께 사용할 때 핵심적인 부분은 이 확률 분포로부터 다음 토큰을 어떻게 고를 것인지입니다. 다음 반복 과정에 사용될 토큰을 결정하는 한, 어떠한 방법도 가능합니다. 확률 분포에서 가장 가능성이 높은 토큰을 선택하는 것처럼 간단할 수도 있고, 결과 분포에서 샘플링하기 전에 수십 가지 변환을 적용하는 것처럼 복잡할 수도 있습니다. + + +
+ +
"자기회귀 생성은 확률 분포에서 다음 토큰을 반복적으로 선택하여 텍스트를 생성합니다."
+
+ +위에서 설명한 과정은 어떤 종료 조건이 충족될 때까지 반복적으로 수행됩니다. 모델이 시퀀스의 끝(EOS 토큰)을 출력할 때까지를 종료 조건으로 하는 것이 이상적입니다. 그렇지 않은 경우에는 미리 정의된 최대 길이에 도달했을 때 생성이 중단됩니다. + +모델이 예상대로 동작하기 위해선 토큰 선택 단계와 정지 조건을 올바르게 설정하는 것이 중요합니다. 이러한 이유로, 각 모델에는 기본 생성 설정이 잘 정의된 [`~generation.GenerationConfig`] 파일이 함께 제공됩니다. + +코드를 확인해봅시다! + + + +기본 LLM 사용에 관심이 있다면, 우리의 [`Pipeline`](pipeline_tutorial) 인터페이스로 시작하는 것을 추천합니다. 그러나 LLM은 양자화나 토큰 선택 단계에서의 미세한 제어와 같은 고급 기능들을 종종 필요로 합니다. 이러한 작업은 [`~generation.GenerationMixin.generate`]를 통해 가장 잘 수행될 수 있습니다. LLM을 이용한 자기회귀 생성은 자원을 많이 소모하므로, 적절한 처리량을 위해 GPU에서 실행되어야 합니다. + + + +먼저, 모델을 불러오세요. + +```python +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +`from_pretrained` 함수를 호출할 때 2개의 플래그를 주목하세요: + +- `device_map`은 모델이 GPU로 이동되도록 합니다. +- `load_in_4bit`는 리소스 요구 사항을 크게 줄이기 위해 [4비트 동적 양자화](main_classes/quantization)를 적용합니다. + +이 외에도 모델을 초기화하는 다양한 방법이 있지만, LLM을 처음 시작할 때 이 설정을 추천합니다. + +이어서 텍스트 입력을 [토크나이저](tokenizer_summary)으로 전처리하세요. + +```python +>>> from transformers import AutoTokenizer +>>> import torch + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to(device) +``` + +`model_inputs` 변수에는 토큰화된 텍스트 입력과 함께 어텐션 마스크가 들어 있습니다. [`~generation.GenerationMixin.generate`]는 어텐션 마스크가 제공되지 않았을 경우에도 이를 추론하려고 노력하지만, 최상의 성능을 위해서는 가능하면 어텐션 마스크를 전달하는 것을 권장합니다. + +마지막으로 [`~generation.GenerationMixin.generate`] 메소드를 호출해 생성된 토큰을 얻은 후, 이를 출력하기 전에 텍스트 형태로 변환하세요. + +```python +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A list of colors: red, blue, green, yellow, black, white, and brown' +``` + +이게 전부입니다! 몇 줄의 코드만으로 LLM의 능력을 활용할 수 있게 되었습니다. + + +## 일반적으로 발생하는 문제 [[common-pitfalls]] + +[생성 전략](generation_strategies)이 많고, 기본값이 항상 사용 사례에 적합하지 않을 수 있습니다. 출력이 예상과 다를 때 흔히 발생하는 문제와 이를 해결하는 방법에 대한 목록을 만들었습니다. + +```py +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> tokenizer.pad_token = tokenizer.eos_token # Mistral has no pad token by default +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +### 생성된 출력이 너무 짧거나 길다 [[generated-output-is-too-shortlong]] + +[`~generation.GenerationConfig`] 파일에서 별도로 지정하지 않으면, `generate`는 기본적으로 최대 20개의 토큰을 반환합니다. `generate` 호출에서 `max_new_tokens`을 수동으로 설정하여 반환할 수 있는 새 토큰의 최대 수를 설정하는 것이 좋습니다. LLM(정확하게는 [디코더 전용 모델](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt))은 입력 프롬프트도 출력의 일부로 반환합니다. + + +```py +>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") + +>>> # By default, the output will contain up to 20 tokens +>>> generated_ids = model.generate(**model_inputs, pad_token_id=tokenizer.eos_token_id) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5' + +>>> # Setting `max_new_tokens` allows you to control the maximum length +>>> generated_ids = model.generate(**model_inputs, pad_token_id=tokenizer.eos_token_id, max_new_tokens=50) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' +``` + +### 잘못된 생성 모드 [[incorrect-generation-mode]] + +기본적으로 [`~generation.GenerationConfig`] 파일에서 별도로 지정하지 않으면, `generate`는 각 반복에서 가장 확률이 높은 토큰을 선택합니다(그리디 디코딩). 하려는 작업에 따라 이 방법은 바람직하지 않을 수 있습니다. 예를 들어, 챗봇이나 에세이 작성과 같은 창의적인 작업은 샘플링이 적합할 수 있습니다. 반면, 오디오를 텍스트로 변환하거나 번역과 같은 입력 기반 작업은 그리디 디코딩이 더 적합할 수 있습니다. `do_sample=True`로 샘플링을 활성화할 수 있으며, 이 주제에 대한 자세한 내용은 이 [블로그 포스트](https://huggingface.co/blog/how-to-generate)에서 볼 수 있습니다. + +```python +>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility +>>> from transformers import set_seed +>>> set_seed(0) + +>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") + +>>> # LLM + greedy decoding = repetitive, boring output +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. I am a cat. I am a cat. I am a cat' + +>>> # With sampling, the output becomes more creative! +>>> generated_ids = model.generate(**model_inputs, do_sample=True) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat.\nI just need to be. I am always.\nEvery time' +``` + +### 잘못된 패딩 [[wrong-padding-side]] + +LLM은 [디코더 전용](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt) 구조를 가지고 있어, 입력 프롬프트에 대해 지속적으로 반복 처리를 합니다. 입력 데이터의 길이가 다르면 패딩 작업이 필요합니다. LLM은 패딩 토큰에서 작동을 이어가도록 설계되지 않았기 때문에, 입력 왼쪽에 패딩이 추가 되어야 합니다. 그리고 어텐션 마스크도 꼭 `generate` 함수에 전달되어야 합니다! + +```python +>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, +>>> # which is shorter, has padding on the right side. Generation fails. +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids[0], skip_special_tokens=True)[0] +'' + +>>> # With left-padding, it works as expected! +>>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b", padding_side="left") +>>> tokenizer.pad_token = tokenizer.eos_token # Llama has no pad token by default +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 3, 4, 5, 6,' +``` + + + +## 추가 자료 [[further-resources]] + +자기회귀 생성 프로세스는 상대적으로 단순한 편이지만, LLM을 최대한 활용하려면 여러 가지 요소를 고려해야 하므로 쉽지 않을 수 있습니다. LLM에 대한 더 깊은 이해와 활용을 위한 다음 단계는 아래와 같습니다: + + +### 고급 생성 사용 [[advanced-generate-usage]] + +1. [가이드](generation_strategies)는 다양한 생성 방법을 제어하는 방법, 생성 설정 파일을 설정하는 방법, 출력을 스트리밍하는 방법에 대해 설명합니다. +2. [`~generation.GenerationConfig`]와 [`~generation.GenerationMixin.generate`], [generate-related classes](internal/generation_utils)를 참조해보세요. + +### LLM 리더보드 [[llm-leaderboards]] + +1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)는 오픈 소스 모델의 품질에 중점을 둡니다. +2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)는 LLM 처리량에 중점을 둡니다. + +### 지연 시간 및 처리량 [[latency-and-throughput]] + +1. 메모리 요구 사항을 줄이려면, 동적 양자화에 대한 [가이드](main_classes/quantization)를 참조하세요. + +### 관련 라이브러리 [[related-libraries]] + +1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference)는 LLM을 위한 실제 운영 환경에 적합한 서버입니다. +2. [`optimum`](https://github.com/huggingface/optimum)은 특정 하드웨어 장치에서 LLM을 최적화하기 위해 🤗 Transformers를 확장한 것입니다. diff --git a/docs/source/ko/model_doc/llama.md b/docs/source/ko/model_doc/llama.md new file mode 100644 index 00000000000000..282befac213ddd --- /dev/null +++ b/docs/source/ko/model_doc/llama.md @@ -0,0 +1,117 @@ + + +# LLaMA [[llama]] + +## 개요 [[overview]] + +LLaMA 모델은 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample에 의해 제안된 [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)에서 소개되었습니다. 이 모델은 7B에서 65B개의 파라미터까지 다양한 크기의 기초 언어 모델을 모아놓은 것입니다. + +논문의 초록은 다음과 같습니다: + +*"LLaMA는 7B에서 65B개의 파라미터 수를 가진 기초 언어 모델의 모음입니다. 우리는 수조 개의 토큰으로 모델을 훈련시켰고, 공개적으로 이용 가능한 데이터셋만을 사용하여 최고 수준의 모델을 훈련시킬 수 있음을 보여줍니다. 특히, LLaMA-13B 모델은 대부분의 벤치마크에서 GPT-3 (175B)를 능가하며, LLaMA-65B는 최고 수준의 모델인 Chinchilla-70B와 PaLM-540B에 버금가는 성능을 보입니다. 우리는 모든 모델을 연구 커뮤니티에 공개합니다."* + +팁: + +- LLaMA 모델의 가중치는 [이 양식](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form)을 작성하여 얻을 수 있습니다. +- 가중치를 다운로드한 후에는 이를 [변환 스크립트](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)를 사용하여 Hugging Face Transformers 형식으로 변환해야합니다. 변환 스크립트를 실행하려면 아래의 예시 명령어를 참고하세요: + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +- 변환을 하였다면 모델과 토크나이저는 다음과 같이 로드할 수 있습니다: + +```python +from transformers import LlamaForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = LlamaForCausalLM.from_pretrained("/output/path") +``` + +스크립트를 실행하기 위해서는 모델을 float16 정밀도로 전부 로드할 수 있을 만큼의 충분한 CPU RAM이 필요합니다. (가장 큰 버전의 모델이 여러 체크포인트로 나뉘어 있더라도, 각 체크포인트는 모델의 각 가중치의 일부를 포함하고 있기 때문에 모든 체크포인트를 RAM에 로드해야 합니다) 65B 모델의 경우, 총 130GB의 RAM이 필요합니다. + + +- LLaMA 토크나이저는 [sentencepiece](https://github.com/google/sentencepiece)를 기반으로 하는 BPE 모델입니다. sentencepiece의 특징 중 하나는 시퀀스를 디코딩할 때 첫 토큰이 단어의 시작이라면 (예를 들어 "Banana"), 토크나이저는 문자열 앞에 공백을 추가하지 않는다는 것입니다. + +이 모델은 [BlackSamorez](https://huggingface.co/BlackSamorez)의 기여와 함께, [zphang](https://huggingface.co/zphang)에 의해 제공되었습니다. Hugging Face에서의 구현 코드는 GPT-NeoX를 기반으로 하며 [여기](https://github.com/EleutherAI/gpt-neox)에서 찾을 수 있고, 저자의 코드 원본은 [여기](https://github.com/facebookresearch/llama)에서 확인할 수 있습니다. + + +원래 LLaMA 모델을 기반으로 Meta AI에서 몇 가지 후속 작업을 발표했습니다: + +- **Llama2**: Llama2는 구조적인 몇 가지 수정(Grouped Query Attention)을 통해 개선된 버전이며, 2조 개의 토큰으로 사전 훈련이 되어 있습니다. Llama2에 대한 자세한 내용은 [이 문서](llama2)를 참고하세요. + +## 리소스 [[resources]] + +LLaMA를 시작하는 데 도움이 될 Hugging Face 및 커뮤니티(🌎로 표시)의 공식 자료 목록입니다. 여기에 자료를 제출하고 싶다면 Pull Request를 올려주세요! 추가할 자료는 기존의 자료와 중복되지 않고 새로운 내용을 보여주는 것이 좋습니다. + + + +- LLaMA 모델을 텍스트 분류 작업에 적용하기 위한 프롬프트 튜닝 방법에 대한 [노트북](https://colab.research.google.com/github/bigscience-workshop/petals/blob/main/examples/prompt-tuning-sst2.ipynb#scrollTo=f04ba4d2) 🌎 + + + +- [Stack Exchange](https://stackexchange.com/)에서 질문에 답하는 LLaMA를 훈련하는 방법을 위한 [StackLLaMA: RLHF로 LLaMA를 훈련하는 실전 가이드](https://huggingface.co/blog/stackllama#stackllama-a-hands-on-guide-to-train-llama-with-rlhf) 🌎 + +⚗️ 최적화 +- 제한된 메모리를 가진 GPU에서 xturing 라이브러리를 사용하여 LLaMA 모델을 미세 조정하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1SQUXq1AMZPSLD4mk3A3swUIc6Y2dclme?usp=sharing) 🌎 + +⚡️ 추론 +- 🤗 PEFT 라이브러리의 PeftModel을 사용하여 LLaMA 모델을 실행하는 방법에 대한 [노트북](https://colab.research.google.com/github/DominguesM/alpaca-lora-ptbr-7b/blob/main/notebooks/02%20-%20Evaluate.ipynb) 🌎 +- LangChain을 사용하여 PEFT 어댑터 LLaMA 모델을 로드하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1l2GiSSPbajVyp2Nk3CFT4t3uH6-5TiBe?usp=sharing) 🌎 + +🚀 배포 +- 🤗 PEFT 라이브러리와 사용자 친화적인 UI로 LLaMA 모델을 미세 조정하는 방법에 대한 [노트북](https://colab.research.google.com/github/lxe/simple-llama-finetuner/blob/master/Simple_LLaMA_FineTuner.ipynb#scrollTo=3PM_DilAZD8T) 🌎 +- Amazon SageMaker에서 텍스트 생성을 위해 Open-LLaMA 모델을 배포하는 방법에 대한 [노트북](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-open-llama.ipynb) 🌎 + +## LlamaConfig [[llamaconfig]] + +[[autodoc]] LlamaConfig + + +## LlamaTokenizer [[llamatokenizer]] + +[[autodoc]] LlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## LlamaTokenizerFast [[llamatokenizerfast]] + +[[autodoc]] LlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary + +## LlamaModel [[llamamodel]] + +[[autodoc]] LlamaModel + - forward + + +## LlamaForCausalLM [[llamaforcausallm]] + +[[autodoc]] LlamaForCausalLM + - forward + +## LlamaForSequenceClassification [[llamaforsequenceclassification]] + +[[autodoc]] LlamaForSequenceClassification + - forward diff --git a/docs/source/ko/model_doc/llama2.md b/docs/source/ko/model_doc/llama2.md new file mode 100644 index 00000000000000..5290f2bb7b6f8d --- /dev/null +++ b/docs/source/ko/model_doc/llama2.md @@ -0,0 +1,129 @@ + + +# Llama2 [[llama2]] + +## 개요 [[overview]] + +Llama2 모델은 Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Ya1smine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom의 논문 [LLaMA: Open Foundation and Fine-Tuned Chat Models](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)에서 제안되었습니다. 채팅 어플리케이션에 맞게 미세 조정된 체크포인트를 포함된 7B에서 70B 범위의 매개변수를 가진 기초 언어 모델 모음입니다! + +논문의 초록은 다음과 같습니다: + +*이 연구에서 우리는 70억에서 700억 파라미터의 범위에서 사전 훈련 및 미세 조정된 대규모 언어 모델(LLMs)의 모음인 Llama 2를 개발 및 공개합니다. Llama 2-Chat라고 불리는 미세 조정된 LLMs은 대화 사용 사례에 최적화되었습니다. 우리의 모델은 테스트한 대부분의 벤치마크에서 오픈 소스 채팅 모델보다 성능이 뛰어나며, 유용성과 안전성에 대한 인적 평가를 바탕으로 비공개 소스 모델을 대체할 수 있는 적절한 대안이 될 수 있습니다. 우리는 Llama 2-Chat의 미세 조정 및 안전성 향상의 접근 방식에 대한 자세한 설명을 제공하여 커뮤니티가 우리의 작업을 기반으로 LLMs의 책임있는 개발에 기여할 수 있도록 합니다.* + +[여기](https://huggingface.co/models?search=llama2)에서 모든 Llama2 모델을 확인할 수 있습니다. + + + +`Llama2` 모델은 `bfloat16`을 사용하여 훈련되었지만, 원래 추론은 `float16`을 사용합니다. 허브에 업로드된 체크포인트는 `torch_dtype = 'float16'`을 사용하며, 이는 `AutoModel` API에 의해 체크포인트를 `torch.float32`에서 `torch.float16`으로 캐스팅하는 데 사용됩니다. + +온라인 가중치의 `dtype`은 `model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto")`를 사용하여 모델을 초기화할 때 `torch_dtype="auto"`를 사용하지 않는 한 대부분 관련이 없습니다. 그 이유는 모델이 먼저 다운로드될 것이고 (온라인 체크포인트의 `dtype`을 사용하여) 그다음에 기본 `dtype`인 `torch`로 캐스팅하고(`torch.float32`가 됨), 마지막으로 구성(configuration)에서 제공된 `torch_dtype`이 있는 경우 이를 사용하기 때문입니다. + +모델을 `float16`에서 훈련하는 것은 권장되지 않으며 `nan`을 생성하는 것으로 알려져 있습니다. 따라서 모델은 `bfloat16`에서 훈련되어야 합니다. + + + +🍯 팁: + +- Llama2 모델의 가중치는 [이 양식](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)을 작성하여 얻을 수 있습니다. +- 아키텍처는 처음 버전의 Llama와 매우 유사하며, [이 논문](https://arxiv.org/pdf/2305.13245.pdf)의 내용에 따라 Grouped Query Attention (GQA)이 추가되었습니다. +- `config.pretraining_tp`를 1과 다른 값으로 설정하면 더 정확하지만 느린 선형 레이어 계산이 활성화되어 원본 로짓과 더 잘 일치하게 됩니다. +- 원래 모델은 `pad_id = -1`을 사용하는데, 이는 패딩 토큰이 없음을 의미합니다. 동일한 로직을 사용할 수 없으므로 `tokenizer.add_special_tokens({"pad_token":""})`를 사용하여 패딩 토큰을 추가하고 이에 따라 토큰 임베딩 크기를 조정해야 합니다. 또한 `model.config.pad_token_id`를 설정해야 합니다. 모델의 `embed_tokens` 레이어는 `self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx)`로 초기화되어, 패딩 토큰 인코딩이 0을 출력하도록 합니다. 따라서 초기화 시에 전달하는 것을 권장합니다. +- 양식을 작성하고 모델 체크포인트 접근 권한을 얻은 후에는 이미 변환된 체크포인트를 사용할 수 있습니다. 그렇지 않고 자신의 모델을 직접 변환하려는 경우, [변환 스크립트](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)를 자유롭게 사용하세요. 스크립트는 다음과 같은 예시의 명령어로 호출할 수 있습니다: + +```bash +python src/transformers/models/llama/convert_llama_weights_to_hf.py \ + --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path +``` + +- 변환 후 모델과 토크나이저는 다음과 같이 로드할 수 있습니다: + +```python +from transformers import LlamaForCausalLM, LlamaTokenizer + +tokenizer = LlamaTokenizer.from_pretrained("/output/path") +model = LlamaForCausalLM.from_pretrained("/output/path") +``` + +스크립트를 실행하려면 모델을 float16 정밀도로 전부 호스트할 수 있을 만큼 충분한 CPU RAM이 필요합니다 (가장 큰 버전이 여러 체크포인트로 제공되더라도 각 체크포인트는 모델 가중치의 일부만을 포함하므로 모두 RAM에 로드해야 합니다). 75B 모델의 경우, 총 145GB의 RAM이 필요합니다. + +- LLaMA 토크나이저는 [sentencepiece](https://github.com/google/sentencepiece)를 기반으로 한 BPE 모델입니다. sentencepiece의 특징 중 하나는 시퀀스를 디코딩할 때 첫 번째 토큰이 단어의 시작이면 (예: "Banana") 토크나이저는 문자열 앞에 접두사 공간을 추가하지 않는 것입니다. + +이 모델은 [Arthur Zucker](https://huggingface.co/ArthurZ)가 [Lysandre Debut](https://huggingface.co/lysandre)의 도움을 받아 제공하였습니다. Hugging Face에서의 구현 코드는 [여기](https://github.com/EleutherAI/gpt-neox)의 GPT-NeoX 를 기반으로 합니다. 저자의 원래 코드는 [여기](https://github.com/facebookresearch/llama)에서 찾을 수 있습니다. + +## 리소스 [[resources]] + +LLaMA2를 시작하는 데 도움이 될 Hugging Face의 공식 및 커뮤니티(🌎로 표시) 리소스 목록입니다. 여기에 새로운 리소스를 추가하기 위해서 Pull Request를 열어 주시면 검토하겠습니다! 리소스는 기존 리소스와 중복되지 않는 새로운 것을 보여주는 것이 이상적입니다. + +- [Llama 2 is here - get it on Hugging Face](https://huggingface.co/blog/llama2), Llama 2에 관한 블로그 포스트와 🤗 Transformers 및 🤗 PEFT와 함께 사용하는 방법에 대한 내용입니다. +- [LLaMA 2 - Every Resource you need](https://www.philschmid.de/llama-2), LLaMA 2에 대해 알아보고 빠르게 시작하는 데 필요한 관련 리소스의 모음입니다. + + + +- Google Colab에서 QLoRA와 4-bit 정밀도를 사용하여 Llama 2를 미세 조정하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing)입니다. 🌎 +- "Llama-v2-7b-guanaco" 모델을 4-bit QLoRA로 미세 조정하고 PDF에서 Q&A 데이터셋을 생성하는 방법에 대한 [노트북](https://colab.research.google.com/drive/134o_cXcMe_lsvl15ZE_4Y75Kstepsntu?usp=sharing)입니다. 🌎 + +⚗️ 최적화 +- [Llama 2를 DPO로 미세 조정하기](https://huggingface.co/blog/dpo-trl), TRL 라이브러리의 DPO 방법을 사용하여 특정 데이터셋에서 Llama 2를 미세 조정하는 방법을 안내하는 가이드입니다. +- [확장 가이드: Llama 2 명령어 조정](https://www.philschmid.de/instruction-tune-llama-2), 입력에서 명령어를 생성하도록 Llama 2를 훈련시키는 방법을 안내하는 가이드로, 명령어를 따르는 모델에서 명령어를 주는 모델로 변환합니다. +- 개인 컴퓨터에서 QLoRA와 TRL을 사용하여 Llama 2 모델을 미세 조정하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing)입니다. 🌎 + +⚡️ 추론 +- AutoGPTQ 라이브러리의 GPTQ를 사용하여 Llama 2 모델을 양자화하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1TC56ArKerXUpbgRy5vM3woRsbTEVNq7h?usp=sharing)입니다. 🌎 +- 로컬 컴퓨터나 Google Colab에서 4-bit 양자화로 Llama 2 채팅 모델을 실행하는 방법에 대한 [노트북](https://colab.research.google.com/drive/1X1z9Q6domMKl2CnEM0QGHNwidLfR4dW2?usp=sharing)입니다. 🌎 + +🚀 배포 +- [Amazon SageMaker에서 LLaMA 2 (7-70B) 미세 조정하기](https://www.philschmid.de/sagemaker-llama2-qlora), Amazon SageMaker에서 QLoRA 미세 조정 및 배포에 이르기까지의 완전한 가이드입니다. +- [Amazon SageMaker에서 Llama 2 7B/13B/70B 배포하기](https://www.philschmid.de/sagemaker-llama-llm), 안전하고 확장 가능한 배포를 위해 Hugging Face의 LLM DLC 컨테이너를 사용하는 방법에 대한 가이드입니다. + + +## LlamaConfig [[llamaconfig]] + +[[autodoc]] LlamaConfig + + +## LlamaTokenizer [[llamatokenizer]] + +[[autodoc]] LlamaTokenizer + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## LlamaTokenizerFast [[llamatokenizerfast]] + +[[autodoc]] LlamaTokenizerFast + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - update_post_processor + - save_vocabulary + +## LlamaModel [[llamamodel]] + +[[autodoc]] LlamaModel + - forward + + +## LlamaForCausalLM [[llamaforcausallm]] + +[[autodoc]] LlamaForCausalLM + - forward + +## LlamaForSequenceClassification [[llamaforsequenceclassification]] + +[[autodoc]] LlamaForSequenceClassification + - forward diff --git a/docs/source/ko/model_doc/whisper.md b/docs/source/ko/model_doc/whisper.md new file mode 100644 index 00000000000000..f48bae1e60f5d0 --- /dev/null +++ b/docs/source/ko/model_doc/whisper.md @@ -0,0 +1,128 @@ + + +# Whisper [[whisper]] + +## 개요 [[overview]] + +Whisper 모델은 Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever에 의해 [Robust Speech Recognition via Large-Scale Weak Supervision](https://cdn.openai.com/papers/whisper.pdf)에서 제안되었습니다. + +논문의 초록은 다음과 같습니다: + +*우리는 인터넷에서 대량의 오디오를 글로 옮긴 것을 예측하도록 간단히 훈련된 음성 처리 시스템의 성능을 연구합니다. 68만 시간의 다국어 및 다중 작업 지도(multitask supervision)에 확장했을 때, 결과 모델은 표준 벤치마크에 잘 일반화되며, 미세 조정이 필요 없는 제로샷 전송 설정에서 이전의 완전히 지도된(fully-supervised) 결과와 경쟁할 수 있는 경우가 많습니다. 사람과 비교하면, 이 모델은 사람의 정확도와 견고성에 근접합니다. 우리는 강력한 음성 처리를 위한 추가 작업의 기반이 될 모델과 추론 코드를 공개합니다.* + + + +팁: + +- 이 모델은 일반적으로 별도의 미세 조정 없이도 잘 작동합니다. +- 아키텍처는 고전적인 인코더-디코더 아키텍처를 따르기 때문에, 추론을 위해 [`~generation.GenerationMixin.generate`] 함수를 사용합니다. +- 현재 추론은 짧은 형식에만 구현되어 있으며, 오디오는 30초 미만의 세그먼트로 미리 분할되어야 합니다. 타임스탬프를 포함한 긴 형식에 대한 추론은 향후 릴리스에서 구현될 예정입니다. +- [`WhisperProcessor`]를 사용하여 모델에 사용할 오디오를 준비하고, 예측된 ID를 텍스트로 디코딩할 수 있습니다. + +- 모델과 프로세서를 변환하려면 다음을 사용하는 것이 좋습니다: + +```bash +python src/transformers/models/whisper/convert_openai_to_hf.py --checkpoint_path "" --pytorch_dump_folder_path "Arthur/whisper-3" --convert_preprocessor True +``` +스크립트는 OpenAI 체크포인트에서 필요한 모든 매개변수를 자동으로 결정합니다. OpenAI 변환을 수행하려면 `tiktoken` 라이브러리를 설치해야 합니다. +라이브러리를 설치해야 OpenAI 토큰화기를 `tokenizers` 버전으로 변환할 수 있습니다. + +이 모델은 [Arthur Zucker](https://huggingface.co/ArthurZ)에 의해 제공되었습니다. 이 모델의 Tensorflow 버전은 [amyeroberts](https://huggingface.co/amyeroberts)에 의해 제공되었습니다. +원본 코드는 [여기](https://github.com/openai/whisper)에서 찾을 수 있습니다. + + + +## WhisperConfig [[whisperconfig]] + +[[autodoc]] WhisperConfig + +## WhisperTokenizer [[whispertokenizer]] + +[[autodoc]] WhisperTokenizer + - set_prefix_tokens + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## WhisperTokenizerFast [[whispertokenizerfast]] + +[[autodoc]] WhisperTokenizerFast + - set_prefix_tokens + - build_inputs_with_special_tokens + - get_special_tokens_mask + - create_token_type_ids_from_sequences + - save_vocabulary + +## WhisperFeatureExtractor [[whisperfeatureextractor]] + +[[autodoc]] WhisperFeatureExtractor + - __call__ + +## WhisperProcessor [[whisperprocessor]] + +[[autodoc]] WhisperProcessor + - __call__ + - from_pretrained + - save_pretrained + - batch_decode + - decode + +## WhisperModel [[whispermodel]] + +[[autodoc]] WhisperModel + - forward + - _mask_input_features + +## WhisperForConditionalGeneration [[whisperforconditionalgeneration]] + +[[autodoc]] WhisperForConditionalGeneration + - forward + +## WhisperForAudioClassification [[whisperforaudioclassification]] + +[[autodoc]] WhisperForAudioClassification + - forward + + + +## TFWhisperModel [[tfwhispermodel]] + +[[autodoc]] TFWhisperModel + - call + +## TFWhisperForConditionalGeneration [[tfwhisperforconditionalgeneration]] + +[[autodoc]] TFWhisperForConditionalGeneration + - call + + +## FlaxWhisperModel [[flaxwhispermodel]] + +[[autodoc]] FlaxWhisperModel + - __call__ + +## FlaxWhisperForConditionalGeneration [[flaxwhisperforconditionalgeneration]] + +[[autodoc]] FlaxWhisperForConditionalGeneration + - __call__ + +## FlaxWhisperForAudioClassification [[flaxwhisperforaudioclassification]] + +[[autodoc]] FlaxWhisperForAudioClassification + - __call__ + diff --git a/docs/source/ko/model_memory_anatomy.md b/docs/source/ko/model_memory_anatomy.md new file mode 100644 index 00000000000000..5701e19aaa085d --- /dev/null +++ b/docs/source/ko/model_memory_anatomy.md @@ -0,0 +1,242 @@ + + +# 모델 학습 해부하기 [[model-training-anatomy]] + +모델 훈련 속도와 메모리 활용의 효율성을 향상시키기 위해 적용할 수 있는 성능 최적화 기술을 이해하려면 GPU가 훈련 중에 어떻게 활용되는지, 그리고 수행되는 연산에 따라 연산 강도가 어떻게 변하는지에 익숙해져야 합니다. + +먼저 GPU 활용과 모델 훈련 실행에 대한 예시를 살펴보겠습니다. 데모를 위해 몇몇 라이브러리를 설치해야 합니다: + +```bash +pip install transformers datasets accelerate nvidia-ml-py3 +``` + +`nvidia-ml-py3` 라이브러리는 Python 내부에서 모델의 메모리 사용량을 모니터링할 수 있게 해줍니다. 터미널의 `nvidia-smi` 명령어에 익숙할 수 있는데, 이 라이브러리는 Python에서 직접 동일한 정보에 접근할 수 있게 해줍니다. + +그 다음, 100과 30000 사이의 무작위 토큰 ID와 분류기를 위한 이진 레이블인 더미 데이터를 생성합니다. +길이가 각각 512인 총 512개의 시퀀스를 가져와 PyTorch 형식의 [`~datasets.Dataset`]에 저장합니다. + + +```py +>>> import numpy as np +>>> from datasets import Dataset + + +>>> seq_len, dataset_size = 512, 512 +>>> dummy_data = { +... "input_ids": np.random.randint(100, 30000, (dataset_size, seq_len)), +... "labels": np.random.randint(0, 1, (dataset_size)), +... } +>>> ds = Dataset.from_dict(dummy_data) +>>> ds.set_format("pt") +``` + +GPU 활용 및 [`Trainer`]로 실행한 훈련 과정에 대한 요약 통계를 출력하기 위해 두 개의 도우미 함수를 정의하겠습니다: + +```py +>>> from pynvml import * + + +>>> def print_gpu_utilization(): +... nvmlInit() +... handle = nvmlDeviceGetHandleByIndex(0) +... info = nvmlDeviceGetMemoryInfo(handle) +... print(f"GPU memory occupied: {info.used//1024**2} MB.") + + +>>> def print_summary(result): +... print(f"Time: {result.metrics['train_runtime']:.2f}") +... print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}") +... print_gpu_utilization() +``` + +시작할 때 GPU 메모리가 비어 있는지 확인해 봅시다: + +```py +>>> print_gpu_utilization() +GPU memory occupied: 0 MB. +``` + +좋습니다. 모델을 로드하기 전에는 예상대로 GPU 메모리가 점유되지 않았습니다. 그렇지 않다면 사용자의 기기에서 GPU 메모리를 사용하는 모든 프로세스를 중단해야 합니다. 그러나 사용자는 모든 여유 GPU 메모리를 사용할 수는 없습니다. 모델이 GPU에 로드될 때 커널도 로드되므로 1-2GB의 메모리를 차지할 수 있습니다. 얼마나 되는지 확인하기 위해 GPU에 작은 텐서를 로드하여 커널이 로드되도록 트리거합니다. + +```py +>>> import torch + + +>>> torch.ones((1, 1)).to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 1343 MB. +``` + +커널만으로도 GPU 메모리의 1.3GB를 차지합니다. 이제 모델이 얼마나 많은 공간을 사용하는지 확인해 보겠습니다. + +## 모델 로드 [[load-model]] + +우선, `google-bert/bert-large-uncased` 모델을 로드합니다. 모델의 가중치를 직접 GPU에 로드해서 가중치만이 얼마나 많은 공간을 차지하는지 확인할 수 있습니다. + + +```py +>>> from transformers import AutoModelForSequenceClassification + + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-uncased").to("cuda") +>>> print_gpu_utilization() +GPU memory occupied: 2631 MB. +``` + +모델의 가중치만으로도 GPU 메모리를 1.3 GB 차지하는 것을 볼 수 있습니다. 정확한 숫자는 사용하는 GPU에 따라 다릅니다. 최신 GPU에서는 모델 사용 속도를 높이는 최적화된 방식으로 가중치가 로드되므로, 모델이 더 많은 공간을 차지할 수 있습니다. 이제 `nvidia-smi` CLI와 동일한 결과를 얻는지 빠르게 확인할 수 있습니다: + + +```bash +nvidia-smi +``` + +```bash +Tue Jan 11 08:58:05 2022 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | +| N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| 0 N/A N/A 3721 C ...nvs/codeparrot/bin/python 2629MiB | ++-----------------------------------------------------------------------------+ +``` + +이전과 동일한 숫자가 출력되고 16GB 메모리를 가진 V100 GPU를 사용하고 있다는 것도 볼 수 있습니다. 그러므로 이제 모델 훈련을 시작하여 GPU 메모리 사용량이 어떻게 달라지는지 볼 수 있습니다. 우선 몇몇 표준 훈련 인수를 설정합니다: + +```py +default_args = { + "output_dir": "tmp", + "evaluation_strategy": "steps", + "num_train_epochs": 1, + "log_level": "error", + "report_to": "none", +} +``` + + + +여러 실험을 실행할 계획이라면, 실험 간에 메모리를 제대로 비우기 위해서 Python 커널을 실험 사이마다 재시작해야 합니다. + + + +## 기본 훈련에서의 메모리 활용 [[memory-utilization-at-vanilla-training]] + +[`Trainer`]를 사용하여, GPU 성능 최적화 기술을 사용하지 않고 배치 크기가 4인 모델을 훈련시키겠습니다: + +```py +>>> from transformers import TrainingArguments, Trainer, logging + +>>> logging.set_verbosity_error() + + +>>> training_args = TrainingArguments(per_device_train_batch_size=4, **default_args) +>>> trainer = Trainer(model=model, args=training_args, train_dataset=ds) +>>> result = trainer.train() +>>> print_summary(result) +``` + +``` +Time: 57.82 +Samples/second: 8.86 +GPU memory occupied: 14949 MB. +``` + +우리는 비교적 작은 배치 크기로도 전체 GPU 메모리를 거의 다 차지하는 것을 볼 수 있습니다. 그러나 배치 크기가 클수록 모델 수렴 속도가 빨라지고 최종 성능이 향상되는 경우가 많습니다. 그래서 이상적으로는 GPU 제한이 아닌 우리 모델의 요구사항에 맞게 배치 크기를 조정하려고 합니다. 흥미롭게도 우리는 모델의 크기보다 훨씬 더 많은 메모리를 사용합니다. 왜 이런 현상이 발생하는지 조금 더 잘 이해하기 위해 모델의 연산과 메모리 요구 사항을 살펴보겠습니다. + +## 모델의 연산 해부하기 [[anatomy-of-models-operations]] + +트랜스포머 아키텍처에는 연산 강도(compute-intensity)에 따라 그룹화된 3가지 주요 연산 그룹이 있습니다. + +1. **텐서 축약(Tensor Contractions)** + + 선형 레이어와 멀티헤드 어텐션의 구성 요소는 모두 **행렬-행렬 곱셈(matrix-matrix multiplications)**을 일괄적으로 처리합니다. 이 연산은 트랜스포머 훈련에서 가장 연산 강도가 높은 부분입니다. + +2. **통계 정규화(Statistical Normalizations)** + + 소프트맥스와 레이어 정규화는 텐서 축약보다 연산 강도가 낮습니다. 하나 이상의 **감소 연산(reduction operations)**을 포함하며, 그 결과는 map을 통해 적용됩니다. + +3. **원소별 연산자(Element-wise Operators)** + + 그 외 연산자들, **편향(biases), 드롭아웃(dropout), 활성화 함수(activations), 잔차 연결(residual connections)**이 여기에 해당합니다. 이 연산들은 연산 강도가 가장 낮습니다. + +이러한 지식은 성능 병목 현상을 분석할 때 도움이 될 수 있습니다. + +이 내용은 [Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020](https://arxiv.org/abs/2007.00072)을 참고하였습니다. + + +## 모델의 메모리 구조 [[anatomy-of-models-memory]] + +모델을 훈련시키는 데는 단순히 GPU에 모델을 올리는 것보다 훨씬 더 많은 메모리를 사용한다는 것을 보았습니다. 이는 훈련 중 GPU 메모리를 사용하는 많은 구성 요소가 있기 때문입니다. GPU 메모리의 구성 요소는 다음과 같습니다: + +1. 모델 가중치 +2. 옵티마이저 상태 +3. 그라디언트 +4. 그라디언트 계산을 위해 저장된 순방향 활성화 +5. 임시 버퍼 +6. 기능별 메모리 + +AdamW를 사용하여 혼합 정밀도로 훈련된 일반적인 모델은 모델 파라미터당 18 바이트와 활성화 메모리가 필요합니다. 추론 단계에서는 옵티마이저와 그라디언트가 필요하지 않으므로 이들은 제외합니다. 따라서 혼합 정밀도 추론의 경우 모델 매개변수당 6 바이트와 활성화 메모리가 필요합니다. + +자세히 살펴보겠습니다. + +**모델 가중치:** + +- fp32 훈련의 경우 매개 변수 수 * 4 바이트 +- 혼합 정밀도 훈련의 경우 매개 변수 수 * 6 바이트 (메모리에 fp32와 fp16 두 가지 모델을 유지) + +**옵티마이저 상태:** + +- 일반 AdamW의 경우 매개 변수 수 * 8 바이트 (2가지 상태 유지) +- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)와 같은 8비트 AdamW 옵티마이저의 경우 매개 변수 수 * 2 바이트 +- Momentum을 가진 SGD와 같은 옵티마이저의 경우 매개 변수 수 * 4 바이트 (하나의 상태만 유지) + +**그라디언트** + +- fp32 또는 혼합 정밀도 훈련의 경우 매개 변수 수 * 4 바이트 (그라디언트는 항상 fp32으로 유지됩니다.) + +**순방향 활성화** + +- 크기는 여러 요인에 따라 달라지며, 주요 요인은 시퀀스 길이, 은닉 상태의 크기 및 배치 크기입니다. + +순방향 및 역방향 함수에서 전달 및 반환되는 입력과 출력이 있으며, 그라디언트 계산을 위해 저장된 순방향 활성화가 있습니다. + +**임시 메모리** + +더불어 모든 종류의 임시 변수는 연산이 완료되면 곧바로 해제되지만, 그 순간에는 추가 메모리가 필요할 수 있고 OOM을 유발할 수 있습니다. 따라서 코딩할 때 이러한 임시 변수에 대해 전략적으로 생각하고 때로는 더 이상 필요 없는 임시 변수를 즉시 명시적으로 메모리에서 제거하는 것이 중요합니다. + +**기능별 메모리** + +그런 다음, 소프트웨어에는 특별한 메모리 요구 사항이 있을 수 있습니다. 예를 들어, 빔 검색을 사용하여 텍스트를 생성할 때 소프트웨어는 입력과 출력 사본을 여러 개 유지해야 합니다. + +**`forward` vs `backward` 실행 속도** + +합성곱과 선형 레이어의 경우 순방향에 비해 역방향에서는 2배의 플롭스가 필요하므로 일반적으로 2배 정도 느리게 변환됩니다(역방향의 경우 사이즈가 부자연스럽기 때문에, 때로는 더욱 느릴 수도 있습니다). 활성화는 일반적으로 대역폭이 제한되어 있으며, 일반적으로 순방향보다 역방향에서 더 많은 데이터를 읽어야 합니다. (예를 들어, 순방향 활성화 시 한 번 씩 읽고 쓰지만, 역방향 활성화에서는 순방향 gradOutput과 출력에 대해 총 두 번 읽고 gradInput에 대해 한 번 씁니다.) + +보다시피, GPU 메모리를 절약하거나 작업 속도를 높일 수 있는 몇 가지 방법이 있습니다. +이제 GPU 활용과 계산 속도에 영향을 주는 것이 무엇인지를 이해했으므로, [Methods and tools for efficient training on a single GPU](perf_train_gpu_one) 문서 페이지를 참조하여 성능 최적화 기법에 대해 알아보세요. \ No newline at end of file diff --git a/docs/source/ko/model_sharing.md b/docs/source/ko/model_sharing.md new file mode 100644 index 00000000000000..868cc3b231de93 --- /dev/null +++ b/docs/source/ko/model_sharing.md @@ -0,0 +1,232 @@ + + +# 모델 공유하기[[share-a-model]] + +지난 두 튜토리얼에서 분산 설정을 위해 PyTorch, Keras 및 🤗 Accelerate를 사용하여 모델을 미세 조정하는 방법을 보았습니다. 다음 단계는 모델을 커뮤니티와 공유하는 것입니다! Hugging Face는 인공지능의 민주화를 위해 모두에게 지식과 자원을 공개적으로 공유해야 한다고 믿습니다. 다른 사람들이 시간과 자원을 절약할 수 있도록 커뮤니티에 모델을 공유하는 것을 고려해 보세요. + +이 튜토리얼에서 [Model Hub](https://huggingface.co/models)에서 훈련되거나 미세 조정 모델을 공유하는 두 가지 방법에 대해 알아봅시다: + +- API를 통해 파일을 Hub에 푸시합니다. +- 웹사이트를 통해 파일을 Hub로 끌어다 놓습니다. + + + + + +커뮤니티에 모델을 공유하려면, [huggingface.co](https://huggingface.co/join)에 계정이 필요합니다. 기존 조직에 가입하거나 새로 만들 수도 있습니다. + + + +## 저장소 특징[[repository-features]] + +모델 허브의 각 저장소는 일반적인 GitHub 저장소처럼 작동합니다. 저장소는 버전 관리, 커밋 기록, 차이점 시각화 기능을 제공합니다. + +모델 허브에 내장된 버전 관리는 git 및 [git-lfs](https://git-lfs.github.com/)를 기반으로 합니다. 즉, 하나의 모델을 하나의 저장소로 취급하여 접근 제어 및 확장성이 향상됩니다. 버전 제어는 커밋 해시, 태그 또는 브랜치로 모델의 특정 버전을 고정하는 방법인 *revision*을 허용합니다. + +따라서 `revision` 매개변수를 사용하여 특정 모델 버전을 가져올 수 있습니다: + +```py +>>> model = AutoModel.from_pretrained( +... "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash +... ) +``` + +또한 저장소에서 파일을 쉽게 편집할 수 있으며, 커밋 기록과 차이를 볼 수 있습니다: + +![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png) + +## 설정[[setup]] + +모델을 허브에 공유하기 전에 Hugging Face 자격 증명이 필요합니다. 터미널에 액세스할 수 있는 경우, 🤗 Transformers가 설치된 가상 환경에서 다음 명령을 실행합니다. 그러면 Hugging Face 캐시 폴더(기본적으로 `~/.cache/`)에 액세스 토큰을 저장합니다: + +```bash +huggingface-cli login +``` + +Jupyter 또는 Colaboratory와 같은 노트북을 사용 중인 경우, [`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library) 라이브러리가 설치되었는지 확인하세요. 이 라이브러리를 사용하면 API로 허브와 상호 작용할 수 있습니다. + +```bash +pip install huggingface_hub +``` + +그런 다음 `notebook_login`로 허브에 로그인하고, [여기](https://huggingface.co/settings/token) 링크에서 로그인할 토큰을 생성합니다: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## 프레임워크 간 모델 변환하기[[convert-a-model-for-all-frameworks]] + +다른 프레임워크로 작업하는 사용자가 모델을 사용할 수 있도록 하려면, PyTorch 및 TensorFlow 체크포인트를 모두 사용하여 모델을 변환하고 업로드하는 것이 좋습니다. 이 단계를 건너뛰어도 사용자는 다른 프레임워크에서 모델을 가져올 수 있지만, 🤗 Transformers가 체크포인트를 즉석에서 변환해야 하므로 속도가 느려질 수 있습니다. + +체크포인트를 다른 프레임워크로 변환하는 것은 쉽습니다. PyTorch 및 TensorFlow가 설치되어 있는지 확인한 다음(설치 지침은 [여기](installation) 참조) 다른 프레임워크에서 작업에 대한 특정 모델을 찾습니다. + + + +체크포인트를 TensorFlow에서 PyTorch로 변환하려면 `from_tf=True`를 지정하세요: + +```py +>>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) +>>> pt_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + +체크포인트를 PyTorch에서 TensorFlow로 변환하려면 `from_pt=True`를 지정하세요: + +```py +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) +``` + +그런 다음 새로운 체크포인트와 함께 새로운 TensorFlow 모델을 저장할 수 있습니다: + +```py +>>> tf_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + +Flax에서 모델을 사용하는 경우, PyTorch에서 Flax로 체크포인트를 변환할 수도 있습니다: + +```py +>>> flax_model = FlaxDistilBertForSequenceClassification.from_pretrained( +... "path/to/awesome-name-you-picked", from_pt=True +... ) +``` + + + +## 훈련 중 모델 푸시하기[[push-a-model-during-training]] + + + + + +모델을 허브에 공유하는 것은 추가 매개변수나 콜백을 추가하는 것만큼 간단합니다. [미세 조정 튜토리얼](training)에서 [`TrainingArguments`] 클래스는 하이퍼파라미터와 추가 훈련 옵션을 지정하는 곳이라는 것을 기억하세요. 이러한 훈련 옵션 중 하나는 모델을 허브로 직접 푸시하는 기능을 포함합니다. [`TrainingArguments`]에서 `push_to_hub=True`를 설정하세요: + +```py +>>> training_args = TrainingArguments(output_dir="my-awesome-model", push_to_hub=True) +``` + +평소와 같이 훈련 인수를 [`Trainer`]에 전달합니다: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +모델을 미세 조정한 후, [`Trainer`]에서 [`~transformers.Trainer.push_to_hub`]를 호출하여 훈련된 모델을 허브로 푸시하세요. 🤗 Transformers는 훈련 하이퍼파라미터, 훈련 결과 및 프레임워크 버전을 모델 카드에 자동으로 추가합니다! + +```py +>>> trainer.push_to_hub() +``` + + +[`PushToHubCallback`]을 사용하여 모델을 허브에 공유하려면, [`PushToHubCallback`]에 다음 인수를 정의하세요: + +- 출력된 모델의 파일 경로 +- 토크나이저 +- `{Hub 사용자 이름}/{모델 이름}` 형식의 `hub_model_id` + +```py +>>> from transformers import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="./your_model_save_path", tokenizer=tokenizer, hub_model_id="your-username/my-awesome-model" +... ) +``` + +[`fit`](https://keras.io/api/models/model_training_apis/)에 콜백을 추가하면, 🤗 Transformers가 훈련된 모델을 허브로 푸시합니다: + +```py +>>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3, callbacks=push_to_hub_callback) +``` + + + +## `push_to_hub` 함수 사용하기[[use-the-pushtohub-function]] + +모델에서 직접 `push_to_hub`를 호출하여 허브에 업로드할 수도 있습니다. + +`push_to_hub`에 모델 이름을 지정하세요: + +```py +>>> pt_model.push_to_hub("my-awesome-model") +``` + +이렇게 하면 사용자 이름 아래에 모델 이름 `my-awesome-model`로 저장소가 생성됩니다. 이제 사용자는 `from_pretrained` 함수를 사용하여 모델을 가져올 수 있습니다: + +```py +>>> from transformers import AutoModel + +>>> model = AutoModel.from_pretrained("your_username/my-awesome-model") +``` + +조직에 속하고 모델을 조직 이름으로 대신 푸시하려면 `repo_id`에 추가하세요: + +```py +>>> pt_model.push_to_hub("my-awesome-org/my-awesome-model") +``` + +`push_to_hub` 함수는 모델 저장소에 다른 파일을 추가하는 데에도 사용할 수 있습니다. 예를 들어 모델 저장소에 토크나이저를 추가할 수 있습니다: + +```py +>>> tokenizer.push_to_hub("my-awesome-model") +``` + +또는 미세 조정된 PyTorch 모델의 TensorFlow 버전을 추가할 수도 있습니다: + +```py +>>> tf_model.push_to_hub("my-awesome-model") +``` + +이제 Hugging Face 프로필로 이동하면, 새로 생성한 모델 저장소가 표시됩니다. **Files** 탭을 클릭하면 저장소에 업로드한 모든 파일이 표시됩니다. + +저장소에 파일을 만들고 업로드하는 방법에 대한 자세한 내용은 허브 설명서 [여기](https://huggingface.co/docs/hub/how-to-upstream)를 참조하세요. + +## 웹 인터페이스로 업로드하기[[upload-with-the-web-interface]] + +코드 없는 접근 방식을 선호하는 사용자는 허브의 웹 인터페이스를 통해 모델을 업로드할 수 있습니다. [huggingface.co/new](https://huggingface.co/new)를 방문하여 새로운 저장소를 생성하세요: + +![new_model_repo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/new_model_repo.png) + +여기서 모델에 대한 몇 가지 정보를 추가하세요: + +- 저장소의 **소유자**를 선택합니다. 이는 사용자 또는 사용자가 속한 조직일 수 있습니다. +- 저장소 이름이 될 모델의 이름을 선택합니다. +- 모델이 공개인지 비공개인지 선택합니다. +- 모델의 라이센스 사용을 지정합니다. + +이제 **Files** 탭을 클릭하고 **Add file** 버튼을 클릭하여 새로운 파일을 저장소에 업로드합니다. 그런 다음 업로드할 파일을 끌어다 놓고 커밋 메시지를 추가하세요. + +![upload_file](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/upload_file.png) + +## 모델 카드 추가하기[[add-a-model-card]] + +사용자가 모델의 기능, 제한, 잠재적 편향 및 윤리적 고려 사항을 이해할 수 있도록 저장소에 모델 카드를 추가하세요. 모델 카드는 `README.md` 파일에 정의되어 있습니다. 다음 방법으로 모델 카드를 추가할 수 있습니다: + +* `README.md` 파일을 수동으로 생성하여 업로드합니다. +* 모델 저장소에서 **Edit model card** 버튼을 클릭합니다. + +모델 카드에 포함할 정보 유형에 대한 좋은 예는 DistilBert [모델 카드](https://huggingface.co/distilbert/distilbert-base-uncased)를 참조하세요. 모델의 탄소 발자국이나 위젯 예시 등 `README.md` 파일에서 제어할 수 있는 다른 옵션에 대한 자세한 내용은 [여기](https://huggingface.co/docs/hub/models-cards) 문서를 참조하세요. diff --git a/docs/source/ko/model_summary.md b/docs/source/ko/model_summary.md new file mode 100644 index 00000000000000..568b9425335d7f --- /dev/null +++ b/docs/source/ko/model_summary.md @@ -0,0 +1,107 @@ + + +# Transformer 모델군[[the-transformer-model-family]] + +2017년에 소개된 [기본 Transformer](https://arxiv.org/abs/1706.03762) 모델은 자연어 처리(NLP) 작업을 넘어 새롭고 흥미로운 모델들에 영감을 주었습니다. [단백질 접힘 구조 예측](https://huggingface.co/blog/deep-learning-with-proteins), [치타의 달리기 훈련](https://huggingface.co/blog/train-decision-transformers), [시계열 예측](https://huggingface.co/blog/time-series-transformers) 등을 위한 다양한 모델이 생겨났습니다. Transformer의 변형이 너무 많아서, 큰 그림을 놓치기 쉽습니다. 하지만 여기 있는 모든 모델의 공통점은 기본 Trasnformer 아키텍처를 기반으로 한다는 점입니다. 일부 모델은 인코더 또는 디코더만 사용하고, 다른 모델들은 인코더와 디코더를 모두 사용하기도 합니다. 이렇게 Transformer 모델군 내 상위 레벨에서의 차이점을 분류하고 검토하면 유용한 분류 체계를 얻을 수 있으며, 이전에 접해보지 못한 Transformer 모델들 또한 이해하는 데 도움이 될 것입니다. + +기본 Transformer 모델에 익숙하지 않거나 복습이 필요한 경우, Hugging Face 강의의 [트랜스포머는 어떻게 동작하나요?](https://huggingface.co/course/chapter1/4?fw=pt) 챕터를 확인하세요. + +
+ +
+ +## 컴퓨터 비전[[computer-vision]] + + + +### 합성곱 네트워크[[convolutional-network]] + +[Vision Transformer](https://arxiv.org/abs/2010.11929)가 확장성과 효율성을 입증하기 전까지 오랫동안 합성곱 네트워크(CNN)가 컴퓨터 비전 작업의 지배적인 패러다임이었습니다. 그럼에도 불구하고, 이동 불변성(translation invariance)과 같은 CNN의 우수한 부분이 도드라지기 때문에 몇몇 (특히 특정 과업에서의) Transformer 모델은 아키텍처에 합성곱을 통합하기도 했습니다. [ConvNeXt](model_doc/convnext)는 이런 관례를 뒤집어 CNN을 현대화하기 위해 Transformer의 디자인을 차용합니다. 예를 들면 ConvNeXt는 겹치지 않는 슬라이딩 창(sliding window)을 사용하여 이미지를 패치화하고, 더 큰 커널로 전역 수용 필드(global receptive field)를 확장시킵니다. ConvNeXt는 또한 메모리 효율을 높이고 성능을 향상시키기 위해 여러 레이어 설계를 선택하기 때문에 Transformer와 견줄만합니다! + +### 인코더[[cv-encoder]] + +[Vision Transformer(ViT)](model_doc/vit)는 합성곱 없는 컴퓨터 비전 작업의 막을 열었습니다. ViT는 표준 Transformer 인코더를 사용하지만, 가장 큰 혁신은 이미지를 처리하는 방식이었습니다. 문장을 토큰으로 분할하는 것처럼 이미지를 고정된 크기의 패치로 분할하고, 이를 사용하여 임베딩을 생성합니다. ViT는 Transformer의 효율적인 아키텍처를 활용하여 훈련에 더 적은 자원을 사용하면서도 당시 CNN에 비견하는 결과를 입증했습니다. 그리고 ViT를 뒤이어 분할(segmentation)과 같은 고밀도 비전 작업과 탐지 작업도 다룰 수 있는 다른 비전 모델이 등장했습니다. + +이러한 모델 중 하나가 [Swin](model_doc/swin) Transformer입니다. 이 모델은 작은 크기의 패치에서 계층적 특징 맵(CNN 👀과 같지만 ViT와는 다름)을 만들고 더 깊은 레이어의 인접 패치와 병합합니다. 어텐션(Attention)은 지역 윈도우 내에서만 계산되며, 모델이 더 잘 학습할 수 있도록 어텐션 레이어 간에 윈도우를 이동하며 연결을 생성합니다. Swin Transformer는 계층적 특징 맵을 생성할 수 있으므로, 분할(segmentation)과 탐지와 같은 고밀도 예측 작업에 적합합니다. [SegFormer](model_doc/segformer) 역시 Transformer 인코더를 사용하여 계층적 특징 맵을 구축하지만, 상단에 간단한 다층 퍼셉트론(MLP) 디코더를 추가하여 모든 특징 맵을 결합하고 예측을 수행합니다. + +BeIT와 ViTMAE와 같은 다른 비전 모델은 BERT의 사전훈련 목표(objective)에서 영감을 얻었습니다. [BeIT](model_doc/beit)는 *마스크드 이미지 모델링(MIM)*으로 사전훈련되며, 이미지 패치는 임의로 마스킹되고 이미지도 시각적 토큰으로 토큰화됩니다. BeIT는 마스킹된 패치에 해당하는 시각적 토큰을 예측하도록 학습됩니다. [ViTMAE](model_doc/vitmae)도 비슷한 사전훈련 목표가 있지만, 시각적 토큰 대신 픽셀을 예측해야 한다는 점이 다릅니다. 특이한 점은 이미지 패치의 75%가 마스킹되어 있다는 것입니다! 디코더는 마스킹된 토큰과 인코딩된 패치에서 픽셀을 재구성합니다. 사전훈련이 끝나면 디코더는 폐기되고 인코더는 다운스트림 작업에 사용할 준비가 됩니다. + +### 디코더[[cv-decoder]] + +대부분의 비전 모델은 인코더에 의존하여 이미지 표현을 학습하기 때문에 디코더 전용 비전 모델은 드뭅니다. 하지만 이미지 생성 등의 사례의 경우, GPT-2와 같은 텍스트 생성 모델에서 보았듯이 디코더가 가장 적합합니다. [ImageGPT](model_doc/imagegpt)는 GPT-2와 동일한 아키텍처를 사용하지만, 시퀀스의 다음 토큰을 예측하는 대신 이미지의 다음 픽셀을 예측합니다. ImageGPT는 이미지 생성 뿐만 아니라 이미지 분류를 위해 미세 조정할 수도 있습니다. + +### 인코더-디코더[[cv-encoder-decoder]] + +비전 모델은 일반적으로 인코더(백본으로도 알려짐)를 사용하여 중요한 이미지 특징을 추출한 후, 이를 Transformer 디코더로 전달합니다. [DETR](model_doc/detr)에 사전훈련된 백본이 있지만, 객체 탐지를 위해 완전한 Transformer 인코더-디코더 아키텍처도 사용합니다. 인코더는 이미지 표현을 학습하고 이를 디코더에서 객체 쿼리(각 객체 쿼리는 이미지의 영역 또는 객체에 중점을 두고 학습된 임베딩)와 결합합니다. DETR은 각 객체 쿼리에 대한 바운딩 박스 좌표와 클래스 레이블을 예측합니다. + +## 자연어처리[[natural-language-processing]] + + + +### 인코더[[nlp-encoder]] + +[BERT](model_doc/bert)는 인코더 전용 Transformer로, 다른 토큰을 보고 소위 "부정 행위"를 저지르는 걸 막기 위해 입력에서 특정 토큰을 임의로 마스킹합니다. 사전훈련의 목표는 컨텍스트를 기반으로 마스킹된 토큰을 예측하는 것입니다. 이를 통해 BERT는 왼쪽과 오른쪽 컨텍스트를 충분히 활용하여 입력에 대해 더 깊고 풍부한 표현을 학습할 수 있습니다. 그러나 BERT의 사전훈련 전략에는 여전히 개선의 여지가 남아 있었습니다. [RoBERTa](model_doc/roberta)는 더 긴 시간 동안 더 큰 배치에 대한 훈련을 포함하고, 전처리 중에 한 번만 마스킹하는 것이 아니라 각 에폭에서 토큰을 임의로 마스킹하고, 다음 문장 예측 목표를 제거하는 새로운 사전훈련 방식을 도입함으로써 이를 개선했습니다. + +성능 개선을 위한 전략으로 모델 크기를 키우는 것이 지배적입니다. 하지만 큰 모델을 훈련하려면 계산 비용이 많이 듭니다. 계산 비용을 줄이는 한 가지 방법은 [DistilBERT](model_doc/distilbert)와 같이 작은 모델을 사용하는 것입니다. DistilBERT는 압축 기법인 [지식 증류(knowledge distillation)](https://arxiv.org/abs/1503.02531)를 사용하여, 거의 모든 언어 이해 능력을 유지하면서 더 작은 버전의 BERT를 만듭니다. + +그러나 대부분의 Transformer 모델에 더 많은 매개변수를 사용하는 경향이 이어졌고, 이에 따라 훈련 효율성을 개선하는 것에 중점을 둔 새로운 모델이 등장했습니다. [ALBERT](model_doc/albert)는 두 가지 방법으로 매개변수 수를 줄여 메모리 사용량을 줄였습니다. 바로 큰 어휘를 두 개의 작은 행렬로 분리하는 것과 레이어가 매개변수를 공유하도록 하는 것입니다. [DeBERTa](model_doc/deberta)는 단어와 그 위치를 두 개의 벡터로 개별적으로 인코딩하는 분리된(disentangled) 어텐션 메커니즘을 추가했습니다. 어텐션은 단어와 위치 임베딩을 포함하는 단일 벡터 대신 이 별도의 벡터에서 계산됩니다. [Longformer](model_doc/longformer)는 특히 시퀀스 길이가 긴 문서를 처리할 때, 어텐션을 더 효율적으로 만드는 것에 중점을 두었습니다. 지역(local) 윈도우 어텐션(각 토큰 주변의 고정된 윈도우 크기에서만 계산되는 어텐션)과 전역(global) 어텐션(분류를 위해 `[CLS]`와 같은 특정 작업 토큰에만 해당)의 조합을 사용하여 전체(full) 어텐션 행렬 대신 희소(sparse) 어텐션 행렬을 생성합니다. + +### 디코더[[nlp-decoder]] + +[GPT-2](model_doc/gpt2)는 시퀀스에서 다음 단어를 예측하는 디코더 전용 Transformer입니다. 토큰을 오른쪽으로 마스킹하여 모델이 이전 토큰을 보고 "부정 행위"를 하지 못하도록 합니다. GPT-2는 방대한 텍스트에 대해 사전훈련하여 텍스트가 일부만 정확하거나 사실인 경우에도 상당히 능숙하게 텍스트를 생성할 수 있게 되었습니다. 하지만 GPT-2는 BERT가 사전훈련에서 갖는 양방향 컨텍스트가 부족하기 때문에 특정 작업에 적합하지 않았습니다. [XLNET](model_doc/xlnet)은 양방향 훈련이 가능한 permutation language modeling objective(PLM)를 사용하여 BERT와 GPT-2의 사전훈련 목표에 대한 장점을 함께 가지고 있습니다. + +GPT-2 이후, 언어 모델은 더욱 거대해졌고 현재는 *대규모 언어 모델(LLM)*로 알려져 있습니다. 충분히 큰 데이터 세트로 사전훈련된 LLM은 퓨샷(few-shot) 또는 제로샷(zero-shot) 학습을 수행합니다. [GPT-J](model_doc/gptj)는 6B 크기의 매개변수가 있고 400B 크기의 토큰으로 훈련된 LLM입니다. GPT-J에 이어 디코더 전용 모델군인 [OPT](model_doc/opt)가 등장했으며, 이 중 가장 큰 모델은 175B 크기이고 180B 크기의 토큰으로 훈련되었습니다. [BLOOM](model_doc/bloom)은 비슷한 시기에 출시되었으며, 이 중 가장 큰 모델은 176B 크기의 매개변수가 있고 46개의 언어와 13개의 프로그래밍 언어로 된 366B 크기의 토큰으로 훈련되었습니다. + +### 인코더-디코더[[nlp-encoder-decoder]] + +[BART](model_doc/bart)는 기본 Transformer 아키텍처를 유지하지만, 일부 텍스트 스팬(span)이 단일 `마스크` 토큰으로 대체되는 *text infilling* 변형으로 사전훈련 목표를 수정합니다. 디코더는 변형되지 않은 토큰(향후 토큰은 마스킹됨)을 예측하고 인코더의 은닉 상태를 사용하여 이 작업을 돕습니다. [Pegasus](model_doc/pegasus)는 BART와 유사하지만, Pegasus는 텍스트 스팬 대신 전체 문장을 마스킹합니다. Pegasus는 마스크드 언어 모델링 외에도 gap sentence generation(GSG)로 사전훈련됩니다. GSG는 문서에 중요한 문장 전체를 마스킹하여 `마스크` 토큰으로 대체하는 것을 목표로 합니다. 디코더는 남은 문장에서 출력을 생성해야 합니다. [T5](model_doc/t5)는 특정 접두사를 사용하여 모든 NLP 작업을 텍스트 투 텍스트 문제로 변환하는 더 특수한 모델입니다. 예를 들어, 접두사 `Summarize:`은 요약 작업을 나타냅니다. T5는 지도(GLUE 및 SuperGLUE) 훈련과 자기지도 훈련(토큰의 15%를 임의로 샘플링하여 제거)으로 사전훈련됩니다. + +## 오디오[[audio]] + + + +### 인코더[[audio-encoder]] + +[Wav2Vec2](model_doc/wav2vec2)는 Transformer 인코더를 사용하여 원본 오디오 파형(raw audio waveform)에서 직접 음성 표현을 학습합니다. 허위 음성 표현 세트에서 실제 음성 표현을 판별하는 대조 작업으로 사전훈련됩니다. [HuBERT](model_doc/hubert)는 Wav2Vec2와 유사하지만 훈련 과정이 다릅니다. 타겟 레이블이 유사한 오디오 세그먼트가 클러스터에 할당되어 은닉 단위(unit)가 되는 군집화(clustering) 단계에서 생성됩니다. 은닉 단위는 예측을 위한 임베딩에 매핑됩니다. + +### 인코더-디코더[[audio-encoder-decoder]] + +[Speech2Text](model_doc/speech_to_text)는 자동 음성 인식(ASR) 및 음성 번역을 위해 고안된 음성 모델입니다. 이 모델은 오디오 파형에서 추출한 log mel-filter bank 특징을 채택하고 자기회귀 방식으로 사전훈련하여, 전사본 또는 번역을 만듭니다. [Whisper](model_doc/whisper)은 ASR 모델이지만, 다른 많은 음성 모델과 달리 제로샷 성능을 위해 대량의 ✨ 레이블이 지정된 ✨ 오디오 전사 데이터에 대해 사전훈련됩니다. 데이터 세트의 큰 묶음에는 영어가 아닌 언어도 포함되어 있어서 자원이 적은 언어에도 Whisper를 사용할 수 있습니다. 구조적으로, Whisper는 Speech2Text와 유사합니다. 오디오 신호는 인코더에 의해 인코딩된 log-mel spectrogram으로 변환됩니다. 디코더는 인코더의 은닉 상태와 이전 토큰으로부터 자기회귀 방식으로 전사를 생성합니다. + +## 멀티모달[[multimodal]] + + + +### 인코더[[mm-encoder]] + +[VisualBERT](model_doc/visual_bert)는 BERT 이후에 출시된 비전 언어 작업을 위한 멀티모달 모델입니다. 이 모델은 BERT와 사전훈련된 객체 탐지 시스템을 결합하여 이미지 특징을 시각 임베딩으로 추출하고, 텍스트 임베딩과 함께 BERT로 전달합니다. VisualBERT는 마스킹되지 않은 텍스트와 시각 임베딩을 기반으로 마스킹된 텍스트를 예측하고, 텍스트가 이미지와 일치하는지 예측해야 합니다. ViT가 이미지 임베딩을 구하는 방식이 더 쉬웠기 때문에, ViT가 출시된 후 [ViLT](model_doc/vilt)는 아키텍처에 ViT를 채택했습니다. 이미지 임베딩은 텍스트 임베딩과 함께 처리됩니다. 여기에서, ViLT는 이미지 텍스트 매칭, 마스크드 언어 모델링, 전체 단어 마스킹을 통해 사전훈련됩니다. + +[CLIP](model_doc/clip)은 다른 접근 방식을 사용하여 (`이미지`, `텍스트`)의 쌍 예측을 수행합니다. (`이미지`, `텍스트`) 쌍에서의 이미지와 텍스트 임베딩 간의 유사도를 최대화하기 위해 4억 개의 (`이미지`, `텍스트`) 쌍 데이터 세트에 대해 이미지 인코더(ViT)와 텍스트 인코더(Transformer)를 함께 훈련합니다. 사전훈련 후, 자연어를 사용하여 이미지가 주어진 텍스트를 예측하거나 그 반대로 예측하도록 CLIP에 지시할 수 있습니다. [OWL-ViT](model_doc/owlvit)는 CLIP을 제로샷 객체 탐지를 위한 백본(backbone)으로 사용하여 CLIP 상에 구축됩니다. 사전훈련 후, 객체 탐지 헤드가 추가되어 (`클래스`, `바운딩 박스`) 쌍에 대한 집합(set) 예측을 수행합니다. + +### 인코더-디코더[[mm-encoder-decoder]] + +광학 문자 인식(OCR)은 이미지를 이해하고 텍스트를 생성하기 위해 다양한 구성 요소를 필요로 하는 전통적인 텍스트 인식 작업입니다. [TrOCR](model_doc/trocr)은 종단간(end-to-end) Transformer를 사용하여 이 프로세스를 간소화합니다. 인코더는 이미지 이해를 위한 ViT 방식의 모델이며 이미지를 고정된 크기의 패치로 처리합니다. 디코더는 인코더의 은닉 상태를 받아서 자기회귀 방식으로 텍스트를 생성합니다. [Donut](model_doc/donut)은 OCR 기반 접근 방식에 의존하지 않는 더 일반적인 시각 문서 이해 모델입니다. 이 모델은 Swin Transformer를 인코더로, 다국어 BART를 디코더로 사용합니다. Donut은 이미지와 텍스트 주석을 기반으로 다음 단어를 예측하여 텍스트를 읽도록 사전훈련됩니다. 디코더는 프롬프트가 주어지면 토큰 시퀀스를 생성합니다. 프롬프트는 각 다운스트림 작업에 대한 특수 토큰으로 표현됩니다. 예를 들어, 문서 파싱(parsing)에는 인코더의 은닉 상태와 결합되어 문서를 정형 출력 형식(JSON)으로 파싱하는 특수 `파싱` 토큰이 있습니다. + +## 강화 학습[[reinforcement-learning]] + + + +### 디코더[[rl-decoder]] + +Decision 및 Trajectory Transformer는 상태(state), 행동(action), 보상(reward)을 시퀀스 모델링 문제로 표현합니다. [Decision Transformer](model_doc/decision_transformer)는 기대 보상(returns-to-go), 과거 상태 및 행동을 기반으로 미래의 원하는 수익(return)으로 이어지는 일련의 행동을 생성합니다. 마지막 *K* 시간 스텝(timestep)에 대해, 세 가지 모달리티는 각각 토큰 임베딩으로 변환되고 GPT와 같은 모델에 의해 처리되어 미래의 액션 토큰을 예측합니다. [Trajectory Transformer](model_doc/trajectory_transformer)도 상태, 행동, 보상을 토큰화하여 GPT 아키텍처로 처리합니다. 보상 조건에 중점을 둔 Decision Transformer와 달리 Trajectory Transformer는 빔 서치(beam search)로 미래 행동을 생성합니다. \ No newline at end of file diff --git a/docs/source/ko/multilingual.md b/docs/source/ko/multilingual.md new file mode 100644 index 00000000000000..c0eee024358f3e --- /dev/null +++ b/docs/source/ko/multilingual.md @@ -0,0 +1,192 @@ + + +# 다국어 모델 추론하기[[multilingual-models-for-inference]] + +[[open-in-colab]] + +🤗 Transformers에는 여러 종류의 다국어(multilingual) 모델이 있으며, 단일 언어(monolingual) 모델과 추론 시 사용법이 다릅니다. +그렇다고 해서 *모든* 다국어 모델의 사용법이 다른 것은 아닙니다. + +[google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)와 같은 몇몇 모델은 단일 언어 모델처럼 사용할 수 있습니다. +이번 가이드에서 다국어 모델의 추론 시 사용 방법을 알아볼 것입니다. + +## XLM[[xlm]] + +XLM에는 10가지 체크포인트(checkpoint)가 있는데, 이 중 하나만 단일 언어입니다. +나머지 체크포인트 9개는 언어 임베딩을 사용하는 체크포인트와 그렇지 않은 체크포인트의 두 가지 범주로 나눌 수 있습니다. + +### 언어 임베딩을 사용하는 XLM[[xlm-with-language-embeddings]] + +다음 XLM 모델은 추론 시에 언어 임베딩을 사용합니다: + +- `FacebookAI/xlm-mlm-ende-1024` (마스킹된 언어 모델링, 영어-독일어) +- `FacebookAI/xlm-mlm-enfr-1024` (마스킹된 언어 모델링, 영어-프랑스어) +- `FacebookAI/xlm-mlm-enro-1024` (마스킹된 언어 모델링, 영어-루마니아어) +- `FacebookAI/xlm-mlm-xnli15-1024` (마스킹된 언어 모델링, XNLI 데이터 세트에서 제공하는 15개 국어) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (마스킹된 언어 모델링 + 번역, XNLI 데이터 세트에서 제공하는 15개 국어) +- `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, 영어-프랑스어) +- `FacebookAI/xlm-clm-ende-1024` (Causal language modeling, 영어-독일어) + +언어 임베딩은 모델에 전달된 `input_ids`와 동일한 shape의 텐서로 표현됩니다. +이러한 텐서의 값은 사용된 언어에 따라 다르며 토크나이저의 `lang2id` 및 `id2lang` 속성에 의해 식별됩니다. + +다음 예제에서는 `FacebookAI/xlm-clm-enfr-1024` 체크포인트(코잘 언어 모델링(causal language modeling), 영어-프랑스어)를 가져옵니다: + +```py +>>> import torch +>>> from transformers import XLMTokenizer, XLMWithLMHeadModel + +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +``` + +토크나이저의 `lang2id` 속성은 모델의 언어와 해당 ID를 표시합니다: + +```py +>>> print(tokenizer.lang2id) +{'en': 0, 'fr': 1} +``` + +다음으로, 예제 입력을 만듭니다: + +```py +>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # 배치 크기는 1입니다 +``` + +언어 ID를 `"en"`으로 설정해 언어 임베딩을 정의합니다. +언어 임베딩은 영어의 언어 ID인 `0`으로 채워진 텐서입니다. +이 텐서는 `input_ids`와 같은 크기여야 합니다. + +```py +>>> language_id = tokenizer.lang2id["en"] # 0 +>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0]) + +>>> # (batch_size, sequence_length) shape의 텐서가 되도록 만듭니다. +>>> langs = langs.view(1, -1) # 이제 [1, sequence_length] shape이 되었습니다(배치 크기는 1입니다) +``` + +이제 `input_ids`와 언어 임베딩을 모델로 전달합니다: + +```py +>>> outputs = model(input_ids, langs=langs) +``` + +[run_generation.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation/run_generation.py) 스크립트로 `xlm-clm` 체크포인트를 사용해 텍스트와 언어 임베딩을 생성할 수 있습니다. + +### 언어 임베딩을 사용하지 않는 XLM[[xlm-without-language-embeddings]] + +다음 XLM 모델은 추론 시에 언어 임베딩이 필요하지 않습니다: + +- `FacebookAI/xlm-mlm-17-1280` (마스킹된 언어 모델링, 17개 국어) +- `FacebookAI/xlm-mlm-100-1280` (마스킹된 언어 모델링, 100개 국어) + +이전의 XLM 체크포인트와 달리 이 모델은 일반 문장 표현에 사용됩니다. + +## BERT[[bert]] + +다음 BERT 모델은 다국어 태스크에 사용할 수 있습니다: + +- `google-bert/bert-base-multilingual-uncased` (마스킹된 언어 모델링 + 다음 문장 예측, 102개 국어) +- `google-bert/bert-base-multilingual-cased` (마스킹된 언어 모델링 + 다음 문장 예측, 104개 국어) + +이러한 모델은 추론 시에 언어 임베딩이 필요하지 않습니다. +문맥에서 언어를 식별하고, 식별된 언어로 추론합니다. + +## XLM-RoBERTa[[xlmroberta]] + +다음 XLM-RoBERTa 또한 다국어 다국어 태스크에 사용할 수 있습니다: + +- `FacebookAI/xlm-roberta-base` (마스킹된 언어 모델링, 100개 국어) +- `FacebookAI/xlm-roberta-large` (마스킹된 언어 모델링, 100개 국어) + +XLM-RoBERTa는 100개 국어에 대해 새로 생성되고 정제된 2.5TB 규모의 CommonCrawl 데이터로 학습되었습니다. +이전에 공개된 mBERT나 XLM과 같은 다국어 모델에 비해 분류, 시퀀스 라벨링, 질의 응답과 같은 다운스트림(downstream) 작업에서 이점이 있습니다. + +## M2M100[[m2m100]] + +다음 M2M100 모델 또한 다국어 다국어 태스크에 사용할 수 있습니다: + +- `facebook/m2m100_418M` (번역) +- `facebook/m2m100_1.2B` (번역) + +이 예제에서는 `facebook/m2m100_418M` 체크포인트를 가져와서 중국어를 영어로 번역합니다. +토크나이저에서 번역 대상 언어(source language)를 설정할 수 있습니다: + +```py +>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer + +>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +>>> chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒." + +>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh") +>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M") +``` + +문장을 토큰화합니다: + +```py +>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt") +``` + +M2M100은 번역을 진행하기 위해 첫 번째로 생성되는 토큰은 번역할 언어(target language) ID로 강제 지정합니다. +영어로 번역하기 위해 `generate` 메소드에서 `forced_bos_token_id`를 `en`으로 설정합니다: + +```py +>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) +>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +'Do not interfere with the matters of the witches, because they are delicate and will soon be angry.' +``` + +## MBart[[mbart]] + +다음 MBart 모델 또한 다국어 태스크에 사용할 수 있습니다: + +- `facebook/mbart-large-50-one-to-many-mmt` (일대다 다국어 번역, 50개 국어) +- `facebook/mbart-large-50-many-to-many-mmt` (다대다 다국어 번역, 50개 국어) +- `facebook/mbart-large-50-many-to-one-mmt` (다대일 다국어 번역, 50개 국어) +- `facebook/mbart-large-50` (다국어 번역, 50개 국어) +- `facebook/mbart-large-cc25` + +이 예제에서는 핀란드어를 영어로 번역하기 위해 `facebook/mbart-large-50-many-to-many-mmt` 체크포인트를 가져옵니다. +토크나이저에서 번역 대상 언어(source language)를 설정할 수 있습니다: + +```py +>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + +>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +>>> fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia." + +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") +``` + +문장을 토큰화합니다: + +```py +>>> encoded_en = tokenizer(en_text, return_tensors="pt") +``` + +MBart는 번역을 진행하기 위해 첫 번째로 생성되는 토큰은 번역할 언어(target language) ID로 강제 지정합니다. +영어로 번역하기 위해 `generate` 메소드에서 `forced_bos_token_id`를 `en`으로 설정합니다: + +```py +>>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id("en_XX")) +>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +"Don't interfere with the wizard's affairs, because they are subtle, will soon get angry." +``` + +`facebook/mbart-large-50-many-to-one-mmt` 체크포인트를 사용하고 있다면, 첫 번째로 생성되는 토큰을 번역할 언어(target language) ID로 강제 지정할 필요는 없습니다. diff --git a/docs/source/ko/pad_truncation.md b/docs/source/ko/pad_truncation.md new file mode 100644 index 00000000000000..9ee4dc839b8416 --- /dev/null +++ b/docs/source/ko/pad_truncation.md @@ -0,0 +1,68 @@ + + +# 패딩과 잘라내기[[padding-and-truncation]] + +배치 입력은 길이가 다른 경우가 많아서 고정 크기 텐서로 변환할 수 없습니다. 패딩과 잘라내기는 다양한 길이의 배치에서 직사각형 텐서를 생성할 수 있도록 이 문제를 해결하는 전략입니다. 패딩은 특수한 **패딩 토큰**을 추가하여 짧은 시퀀스가 배치에서 가장 긴 시퀀스 또는 모델에서 허용하는 최대 길이와 동일한 길이를 갖도록 합니다. 잘라내기는 긴 시퀀스를 잘라내어 패딩과 다른 방식으로 시퀀스의 길이를 동일하게 합니다. + +대부분의 경우 배치에 가장 긴 시퀀스의 길이로 패딩하고 모델이 허용할 수 있는 최대 길이로 잘라내는 것이 잘 작동합니다. 그러나 필요하다면 API가 지원하는 더 많은 전략을 사용할 수 있습니다. 필요한 인수는 `padding`, `truncation`, `max_length` 세 가지입니다. + +`padding` 인수는 패딩을 제어합니다. 불리언 또는 문자열일 수 있습니다: + + - `True` 또는 `'longest'`: 배치에서 가장 긴 시퀀스로 패딩합니다(단일 시퀀스만 제공하는 경우 패딩이 적용되지 않습니다). + - `'max_length'`: `max_length` 인수가 지정한 길이로 패딩하거나, `max_length`가 제공되지 않은 경우(`max_length=None`) 모델에서 허용되는 최대 길이로 패딩합니다. 단일 시퀀스만 제공하는 경우에도 패딩이 적용됩니다. + - `False` 또는 `'do_not_pad'`: 패딩이 적용되지 않습니다. 이것이 기본 동작입니다. + +`truncation` 인수는 잘라낼 방법을 정합니다. 불리언 또는 문자열일 수 있습니다: + + - `True` 또는 `longest_first`: `max_length` 인수가 지정한 최대 길이로 잘라내거나, + `max_length`가 제공되지 않은 경우(`max_length=None`) 모델에서 허용되는 최대 길이로 잘라냅니다. + 시퀀스 쌍에서 가장 긴 시퀀스의 토큰을 적절한 길이에 도달할 때까지 하나씩 제거합니다. + - `'only_second'`: `max_length` 인수가 지정한 최대 길이로 잘라내거나, + `max_length`가 제공되지 않은 경우(`max_length=None`) 모델에서 허용되는 최대 길이로 잘라냅니다. + 시퀀스 쌍(또는 시퀀스 쌍의 배치)가 제공된 경우 쌍의 두 번째 문장만 잘라냅니다. + - `'only_first'`: `max_length` 인수가 지정한 최대 길이로 잘라내거나, + `max_length`가 제공되지 않은 경우(`max_length=None`) 모델에서 허용되는 최대 길이로 잘라냅니다. + 시퀀스 쌍(또는 시퀀스 쌍의 배치)가 제공된 경우 쌍의 첫 번째 문장만 잘라냅니다. + - `False` 또는 `'do_not_truncate'`: 잘라내기를 적용하지 않습니다. 이것이 기본 동작입니다. + +`max_length` 인수는 패딩 및 잘라내기를 적용할 길이를 제어합니다. 이 인수는 정수 또는 `None`일 수 있으며, `None`일 경우 모델이 허용할 수 있는 최대 길이로 기본값이 설정됩니다. 모델에 특정한 최대 입력 길이가 없는 경우 `max_length`에 대한 잘라내기 또는 패딩이 비활성화됩니다. + +다음 표에는 패딩 및 잘라내기를 설정하는 권장 방법이 요약되어 있습니다. +입력으로 시퀀스 쌍을 사용하는 경우, 다음 예제에서 `truncation=True`를 `['only_first', 'only_second', 'longest_first']`에서 선택한 `STRATEGY`, 즉 `truncation='only_second'` 또는 `truncation='longest_first'`로 바꾸면 앞서 설명한 대로 쌍의 두 시퀀스가 잘리는 방식을 제어할 수 있습니다. + +| 잘라내기 | 패딩 | 사용 방법 | +|--------------------------------------|-----------------------------------|------------------------------------------------------------------------------------------| +| 잘라내기 없음 | 패딩 없음 | `tokenizer(batch_sentences)` | +| | 배치 내 최대 길이로 패딩 | `tokenizer(batch_sentences, padding=True)` 또는 | +| | | `tokenizer(batch_sentences, padding='longest')` | +| | 모델의 최대 입력 길이로 패딩 | `tokenizer(batch_sentences, padding='max_length')` | +| | 특정 길이로 패딩 | `tokenizer(batch_sentences, padding='max_length', max_length=42)` | +| | 다양한 길이로 패딩 | `tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)` | +| 모델의 최대 입력 길이로 잘라내기 | 패딩 없음 | `tokenizer(batch_sentences, truncation=True)` 또는 | +| | | `tokenizer(batch_sentences, truncation=STRATEGY)` | +| | 배치 내 최대 길이로 패딩 | `tokenizer(batch_sentences, padding=True, truncation=True)` 또는 | +| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY)` | +| | 모델의 최대 입력 길이로 패딩 | `tokenizer(batch_sentences, padding='max_length', truncation=True)` 또는 | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)` | +| | 특정 길이로 패딩 | 사용 불가 | +| 특정 길이로 잘라내기 | 패딩 없음 | `tokenizer(batch_sentences, truncation=True, max_length=42)` 또는 | +| | | `tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)` | +| | 배치 내 최대 길이로 패딩 | `tokenizer(batch_sentences, padding=True, truncation=True, max_length=42)` 또는 | +| | | `tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)` | +| | 모델의 최대 입력 길이로 패딩 | 사용 불가 | +| | 특정 길이로 패딩 | `tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42)` 또는 | +| | | `tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42)` | diff --git a/docs/source/ko/peft.md b/docs/source/ko/peft.md new file mode 100644 index 00000000000000..90327e62c27ac4 --- /dev/null +++ b/docs/source/ko/peft.md @@ -0,0 +1,209 @@ + + +# 🤗 PEFT로 어댑터 가져오기 [[load-adapters-with-peft]] + +[[open-in-colab]] + +[Parameter-Efficient Fine Tuning (PEFT)](https://huggingface.co/blog/peft) 방법은 사전훈련된 모델의 매개변수를 미세 조정 중 고정시키고, 그 위에 훈련할 수 있는 매우 적은 수의 매개변수(어댑터)를 추가합니다. 어댑터는 작업별 정보를 학습하도록 훈련됩니다. 이 접근 방식은 완전히 미세 조정된 모델에 필적하는 결과를 생성하면서, 메모리 효율적이고 비교적 적은 컴퓨팅 리소스를 사용합니다. + +또한 PEFT로 훈련된 어댑터는 일반적으로 전체 모델보다 훨씬 작기 때문에 공유, 저장 및 가져오기가 편리합니다. + +
+ +
Hub에 저장된 OPTForCausalLM 모델의 어댑터 가중치는 최대 700MB에 달하는 모델 가중치의 전체 크기에 비해 약 6MB에 불과합니다.
+
+ +🤗 PEFT 라이브러리에 대해 자세히 알아보려면 [문서](https://huggingface.co/docs/peft/index)를 확인하세요. + +## 설정 [[setup]] + +🤗 PEFT를 설치하여 시작하세요: + +```bash +pip install peft +``` + +새로운 기능을 사용해보고 싶다면, 다음 소스에서 라이브러리를 설치하는 것이 좋습니다: + +```bash +pip install git+https://github.com/huggingface/peft.git +``` + +## 지원되는 PEFT 모델 [[supported-peft-models]] + +🤗 Transformers는 기본적으로 일부 PEFT 방법을 지원하며, 로컬이나 Hub에 저장된 어댑터 가중치를 가져오고 몇 줄의 코드만으로 쉽게 실행하거나 훈련할 수 있습니다. 다음 방법을 지원합니다: + +- [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) +- [AdaLoRA](https://arxiv.org/abs/2303.10512) + +🤗 PEFT와 관련된 다른 방법(예: 프롬프트 훈련 또는 프롬프트 튜닝) 또는 일반적인 🤗 PEFT 라이브러리에 대해 자세히 알아보려면 [문서](https://huggingface.co/docs/peft/index)를 참조하세요. + + +## PEFT 어댑터 가져오기 [[load-a-peft-adapter]] + +🤗 Transformers에서 PEFT 어댑터 모델을 가져오고 사용하려면 Hub 저장소나 로컬 디렉터리에 `adapter_config.json` 파일과 어댑터 가중치가 포함되어 있는지 확인하십시오. 그런 다음 `AutoModelFor` 클래스를 사용하여 PEFT 어댑터 모델을 가져올 수 있습니다. 예를 들어 인과 관계 언어 모델용 PEFT 어댑터 모델을 가져오려면 다음 단계를 따르십시오: + +1. PEFT 모델 ID를 지정하십시오. +2. [`AutoModelForCausalLM`] 클래스에 전달하십시오. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id) +``` + + + +`AutoModelFor` 클래스나 기본 모델 클래스(예: `OPTForCausalLM` 또는 `LlamaForCausalLM`) 중 하나를 사용하여 PEFT 어댑터를 가져올 수 있습니다. + + + +`load_adapter` 메소드를 호출하여 PEFT 어댑터를 가져올 수도 있습니다. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "facebook/opt-350m" +peft_model_id = "ybelkada/opt-350m-lora" + +model = AutoModelForCausalLM.from_pretrained(model_id) +model.load_adapter(peft_model_id) +``` + +## 8비트 또는 4비트로 가져오기 [[load-in-8bit-or-4bit]] + +`bitsandbytes` 통합은 8비트와 4비트 정밀도 데이터 유형을 지원하므로 큰 모델을 가져올 때 유용하면서 메모리도 절약합니다. 모델을 하드웨어에 효과적으로 분배하려면 [`~PreTrainedModel.from_pretrained`]에 `load_in_8bit` 또는 `load_in_4bit` 매개변수를 추가하고 `device_map="auto"`를 설정하세요: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True) +``` + +## 새 어댑터 추가 [[add-a-new-adapter]] + +새 어댑터가 현재 어댑터와 동일한 유형인 경우에 한해 기존 어댑터가 있는 모델에 새 어댑터를 추가하려면 [`~peft.PeftModel.add_adapter`]를 사용할 수 있습니다. 예를 들어 모델에 기존 LoRA 어댑터가 연결되어 있는 경우: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + init_lora_weights=False +) + +model.add_adapter(lora_config, adapter_name="adapter_1") +``` + +새 어댑터를 추가하려면: + +```py +# attach new adapter with same config +model.add_adapter(lora_config, adapter_name="adapter_2") +``` + +이제 [`~peft.PeftModel.set_adapter`]를 사용하여 어댑터를 사용할 어댑터로 설정할 수 있습니다: + +```py +# use adapter_1 +model.set_adapter("adapter_1") +output = model.generate(**inputs) +print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) + +# use adapter_2 +model.set_adapter("adapter_2") +output_enabled = model.generate(**inputs) +print(tokenizer.decode(output_enabled[0], skip_special_tokens=True)) +``` + +## 어댑터 활성화 및 비활성화 [[enable-and-disable-adapters]] + +모델에 어댑터를 추가한 후 어댑터 모듈을 활성화 또는 비활성화할 수 있습니다. 어댑터 모듈을 활성화하려면: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +adapter_model_id = "ybelkada/opt-350m-lora" +tokenizer = AutoTokenizer.from_pretrained(model_id) +text = "Hello" +inputs = tokenizer(text, return_tensors="pt") + +model = AutoModelForCausalLM.from_pretrained(model_id) +peft_config = PeftConfig.from_pretrained(adapter_model_id) + +# to initiate with random weights +peft_config.init_lora_weights = False + +model.add_adapter(peft_config) +model.enable_adapters() +output = model.generate(**inputs) +``` + +어댑터 모듈을 비활성화하려면: + +```py +model.disable_adapters() +output = model.generate(**inputs) +``` + +## PEFT 어댑터 훈련 [[train-a-peft-adapter]] + +PEFT 어댑터는 [`Trainer`] 클래스에서 지원되므로 특정 사용 사례에 맞게 어댑터를 훈련할 수 있습니다. 몇 줄의 코드를 추가하기만 하면 됩니다. 예를 들어 LoRA 어댑터를 훈련하려면: + + + +[`Trainer`]를 사용하여 모델을 미세 조정하는 것이 익숙하지 않다면 [사전훈련된 모델을 미세 조정하기](training) 튜토리얼을 확인하세요. + + + +1. 작업 유형 및 하이퍼파라미터를 지정하여 어댑터 구성을 정의합니다. 하이퍼파라미터에 대한 자세한 내용은 [`~peft.LoraConfig`]를 참조하세요. + +```py +from peft import LoraConfig + +peft_config = LoraConfig( + lora_alpha=16, + lora_dropout=0.1, + r=64, + bias="none", + task_type="CAUSAL_LM", +) +``` + +2. 모델에 어댑터를 추가합니다. + +```py +model.add_adapter(peft_config) +``` + +3. 이제 모델을 [`Trainer`]에 전달할 수 있습니다! + +```py +trainer = Trainer(model=model, ...) +trainer.train() +``` + +훈련한 어댑터를 저장하고 다시 가져오려면: + +```py +model.save_pretrained(save_dir) +model = AutoModelForCausalLM.from_pretrained(save_dir) +``` diff --git a/docs/source/ko/perf_hardware.md b/docs/source/ko/perf_hardware.md new file mode 100644 index 00000000000000..01282a0c711147 --- /dev/null +++ b/docs/source/ko/perf_hardware.md @@ -0,0 +1,156 @@ + + + +# 훈련용 사용자 맞춤형 하드웨어 [[custom-hardware-for-training]] + +모델 훈련과 추론에 사용하는 하드웨어는 성능에 큰 영향을 미칠 수 있습니다. GPU에 대해 자세히 알아보려면, Tim Dettmer의 훌륭한 블로그 포스트를 확인해보세요. [블로그 포스트 링크](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/) (영어로 작성됨). + +GPU 설정에 대한 실용적인 조언을 살펴보겠습니다. + +## GPU [[gpu]] +더 큰 모델을 훈련시킬 때는 기본적으로 세 가지 옵션이 있습니다: + +- 더 큰 GPU +- 더 많은 GPU +- 더 많은 CPU 및 NVMe ([DeepSpeed-Infinity](../en/main_classes/deepspeed#nvme-support)를 통한 오프로드(offload)) + +우선, 하나의 GPU만 사용하는 경우부터 시작해봅시다. + +### 전원 공급과 냉각 [[power-and-cooling]] + +비싼 고성능 GPU를 구매한 경우, 올바른 전원 공급과 충분한 냉각을 제공해야 합니다. + +**전원 공급**: + +일부 고성능 소비자용 GPU는 2개 혹은 가끔가다 3개의 PCI-E 8핀 전원 소켓이 있습니다. 카드에 있는 소켓 수만큼 독립적인 12V PCI-E 8핀 케이블이 연결되어 있는지 확인하세요. 같은 케이블의 한쪽 끝에 있는 2개의 스플릿(또는 피그테일(pigtail) 케이블)을 사용하지 마세요. 즉, GPU에 2개의 소켓이 있다면, PSU(전원 공급 장치)에서 카드로 연결되는 2개의 PCI-E 8핀 케이블이 필요하며, 끝에 2개의 PCI-E 8핀 커넥터가 있는 케이블이 필요하지 않습니다! 그렇지 않으면 카드의 전체 성능을 제대로 발휘하지 못할 수 있습니다. + +각각의 PCI-E 8핀 전원 케이블은 PSU 쪽의 12V 레일에 연결되어야 하며 최대 150W의 전력을 공급할 수 있습니다. + +일부 다른 GPU는 PCI-E 12핀 커넥터를 사용하며, 이러한 커넥터는 최대 500W-600W의 전력을 공급할 수 있습니다. + +저가형 GPU는 6핀 커넥터를 사용하며, 최대 75W의 전력을 공급합니다. + +또한 GPU가 안정적인 전압을 받을 수 있도록 고급 PSU를 선택해야 합니다. 일부 저품질의 PSU는 GPU가 최고 성능으로 동작하기 위해 필요한 전압을 안정적으로 공급하지 못할 수 있습니다. + +물론, PSU는 GPU에 전원을 공급하기에 충분한 여분의 전력 용량을 가져야 합니다. + +**냉각**: + +GPU가 과열되면 성능이 저하되고 최대 성능을 발휘하지 못할 수 있으며, 너무 뜨거워지면 중지될 수 있습니다. + +GPU가 과열될 때 정확한 적정 온도를 알기 어려우나, 아마도 +80℃ 미만이면 좋지만 더 낮을수록 좋습니다. 70℃-75℃ 정도가 훌륭한 온도 범위입니다. 성능 저하가 발생하기 시작하는 온도는 대략 84℃-90℃ 정도일 것입니다. 하지만 성능 저하 이외에도 지속적으로 매우 높은 온도는 GPU 수명을 단축시킬 수 있습니다. + +이어서, 여러 개의 GPU를 사용할 때 가장 중요한 측면 중 하나인 GPU 간 연결 방식을 살펴보겠습니다. + +### 다중 GPU 연결 방식 [[multigpu-connectivity]] + +다중 GPU를 사용하는 경우 GPU 간의 연결 방식은 전체 훈련 시간에 큰 영향을 미칠 수 있습니다. 만약 GPU가 동일한 물리적 노드에 있을 경우, 다음과 같이 확인할 수 있습니다: + +```bash +nvidia-smi topo -m +``` + +만약 NVLink로 연결된 듀얼 GPU 환경이라면, 다음과 같은 결과를 확인할 수 있습니다: + +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X NV2 0-23 N/A +GPU1 NV2 X 0-23 N/A +``` + +NVLink를 지원하지 않는 다른 환경의 경우에는 다음과 같은 결과를 확인할 수 있습니다: +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X PHB 0-11 N/A +GPU1 PHB X 0-11 N/A +``` + +이 결과에는 다음과 같은 범례가 포함되어 있습니다: + +``` + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks +``` + +따라서 첫 번째 결과의 `NV2`는 GPU가 2개의 NVLink로 연결되어 있다는 것을 나타내고, 두 번째 결과의 `PHB`는 일반적인 소비자용 PCIe+브릿지 설정을 가지고 있다는 것을 나타냅니다. + +설정에서 어떤 유형의 연결 방식을 가지고 있는지 확인하세요. 일부 연결 방식은 GPU 간 통신을 더 빠르게 만들 수 있으며(NVLink와 같이), 어떤 연결 방식은 더 느리게 만들 수 있습니다(PHB와 같이). + +사용하는 확장성 솔루션의 종류에 따라 연결 속도가 주요한 영향을 미칠 수도 있고 미미한 영향을 미칠 수도 있습니다. DDP와 같이 GPU가 거의 동기화하지 않아도 되는 경우, 연결 속도가 느려도 큰 영향을 받지 않습니다. 반면 ZeRO-DP와 같이 GPU간 통신이 많이 필요한 경우, 더 빠른 훈련을 위해서는 더 빠른 연결 속도가 중요합니다. + +#### NVLink [[nvlink]] + +[NVLink](https://en.wikipedia.org/wiki/NVLink)는 Nvidia에서 개발한 유선 기반의 직렬 다중 레인 근거리 통신 링크입니다. + +새로운 세대의 NVLink는 더 빠른 대역폭을 제공합니다. [Nvidia Ampere GA102 GPU Architecture](https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf)에서 아래와 같은 정보를 확인하실 수 있습니다: + +> 3세대 NVLink® +> GA102 GPU는 4개의 x4 링크를 포함하는 NVIDIA의 3세대 NVLink 인터페이스를 활용하며, +> 각 링크는 두 개의 GPU 간에 각 방향으로 초당 14.0625GB의 대역폭을 제공합니다. +> 4개의 링크는 각 방향에 초당 56.25GB의 대역폭을 제공하며, 두 개의 GPU 간에는 초당 112.5GB의 총 대역폭을 제공합니다. +> 두 개의 RTX 3090 GPU를 NVLink를 사용해 SLI로 연결할 수 있습니다. +> (3-Way 및 4-Way SLI 구성은 지원되지 않음에 유의하세요.) + + +따라서 `nvidia-smi topo -m`의 결과에서 `NVX`의 값이 높을수록 더 좋습니다. 세대는 GPU 아키텍처에 따라 다를 수 있습니다. + +그렇다면, openai-community/gpt2를 작은 wikitext 샘플로 학습시키는 예제를 통해, NVLink가 훈련에 어떤 영향을 미치는지 살펴보겠습니다. + +결과는 다음과 같습니다: + + +| NVlink | Time | +| ----- | ---: | +| Y | 101s | +| N | 131s | + + +NVLink 사용 시 훈련이 약 23% 더 빠르게 완료됨을 확인할 수 있습니다. 두 번째 벤치마크에서는 `NCCL_P2P_DISABLE=1`을 사용하여 NVLink를 사용하지 않도록 설정했습니다. + +전체 벤치마크 코드와 결과는 다음과 같습니다: + +```bash +# DDP w/ NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +# DDP w/o NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +하드웨어: 각각 2개의 TITAN RTX 24GB + 2개의 NVLink (`NV2` in `nvidia-smi topo -m`) +소프트웨어: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` diff --git a/docs/source/ko/perf_infer_cpu.md b/docs/source/ko/perf_infer_cpu.md new file mode 100644 index 00000000000000..123e56b4f32c2f --- /dev/null +++ b/docs/source/ko/perf_infer_cpu.md @@ -0,0 +1,73 @@ + + +# CPU에서 효율적인 추론하기 [[efficient-inference-on-cpu]] + +이 가이드는 CPU에서 대규모 모델을 효율적으로 추론하는 방법에 중점을 두고 있습니다. + +## 더 빠른 추론을 위한 `BetterTransformer` [[bettertransformer-for-faster-inference]] + +우리는 최근 CPU에서 텍스트, 이미지 및 오디오 모델의 빠른 추론을 위해 `BetterTransformer`를 통합했습니다. 이 통합에 대한 더 자세한 내용은 [이 문서](https://huggingface.co/docs/optimum/bettertransformer/overview)를 참조하세요. + +## PyTorch JIT 모드 (TorchScript) [[pytorch-jitmode-torchscript]] +TorchScript는 PyTorch 코드에서 직렬화와 최적화가 가능한 모델을 생성할때 쓰입니다. TorchScript로 만들어진 프로그램은 기존 Python 프로세스에서 저장한 뒤, 종속성이 없는 새로운 프로세스로 가져올 수 있습니다. PyTorch의 기본 설정인 `eager` 모드와 비교했을때, `jit` 모드는 연산자 결합과 같은 최적화 방법론을 통해 모델 추론에서 대부분 더 나은 성능을 제공합니다. + +TorchScript에 대한 친절한 소개는 [PyTorch TorchScript 튜토리얼](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#tracing-modules)을 참조하세요. + +### JIT 모드와 함께하는 IPEX 그래프 최적화 [[ipex-graph-optimization-with-jitmode]] +Intel® Extension for PyTorch(IPEX)는 Transformers 계열 모델의 jit 모드에서 추가적인 최적화를 제공합니다. jit 모드와 더불어 Intel® Extension for PyTorch(IPEX)를 활용하시길 강력히 권장드립니다. Transformers 모델에서 자주 사용되는 일부 연산자 패턴은 이미 jit 모드 연산자 결합(operator fusion)의 형태로 Intel® Extension for PyTorch(IPEX)에서 지원되고 있습니다. Multi-head-attention, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm 결합 패턴 등이 이용 가능하며 활용했을 때 성능이 우수합니다. 연산자 결합의 이점은 사용자에게 고스란히 전달됩니다. 분석에 따르면, 질의 응답, 텍스트 분류 및 토큰 분류와 같은 가장 인기 있는 NLP 태스크 중 약 70%가 이러한 결합 패턴을 사용하여 Float32 정밀도와 BFloat16 혼합 정밀도 모두에서 성능상의 이점을 얻을 수 있습니다. + +[IPEX 그래프 최적화](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/graph_optimization.html)에 대한 자세한 정보를 확인하세요. + +#### IPEX 설치: [[ipex-installation]] + +IPEX 배포 주기는 PyTorch를 따라서 이루어집니다. 자세한 정보는 [IPEX 설치 방법](https://intel.github.io/intel-extension-for-pytorch/)을 확인하세요. + +### JIT 모드 사용법 [[usage-of-jitmode]] +평가 또는 예측을 위해 Trainer에서 JIT 모드를 사용하려면 Trainer의 명령 인수에 `jit_mode_eval`을 추가해야 합니다. + + + +PyTorch의 버전이 1.14.0 이상이라면, jit 모드는 jit.trace에서 dict 입력이 지원되므로, 모든 모델의 예측과 평가가 개선될 수 있습니다. + +PyTorch의 버전이 1.14.0 미만이라면, 질의 응답 모델과 같이 forward 매개변수의 순서가 jit.trace의 튜플 입력 순서와 일치하는 모델에 득이 될 수 있습니다. 텍스트 분류 모델과 같이 forward 매개변수 순서가 jit.trace의 튜플 입력 순서와 다른 경우, jit.trace가 실패하며 예외가 발생합니다. 이때 예외상황을 사용자에게 알리기 위해 Logging이 사용됩니다. + + + +[Transformers 질의 응답](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)의 사용 사례 예시를 참조하세요. + + +- CPU에서 jit 모드를 사용한 추론: +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--jit_mode_eval 
+ +- CPU에서 IPEX와 함께 jit 모드를 사용한 추론: +
python run_qa.py \
+--model_name_or_path csarron/bert-base-uncased-squad-v1 \
+--dataset_name squad \
+--do_eval \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/ \
+--no_cuda \
+--use_ipex \
+--jit_mode_eval
diff --git a/docs/source/ko/perf_infer_gpu_many.md b/docs/source/ko/perf_infer_gpu_many.md new file mode 100644 index 00000000000000..3e4542180398e4 --- /dev/null +++ b/docs/source/ko/perf_infer_gpu_many.md @@ -0,0 +1,27 @@ + + +# 다중 GPU에서 효율적인 추론 [[efficient-inference-on-a-multiple-gpus]] + +이 문서에는 다중 GPU에서 효율적으로 추론하는 방법에 대한 정보가 포함되어 있습니다. + + +참고: 다중 GPU 설정은 [단일 GPU 섹션](./perf_infer_gpu_one)에서 설명된 대부분의 전략을 사용할 수 있습니다. 그러나 더 나은 활용을 위해 간단한 기법들을 알아야 합니다. + + + +## 더 빠른 추론을 위한 `BetterTransformer` [[bettertransformer-for-faster-inference]] + +우리는 최근 텍스트, 이미지 및 오디오 모델에 대한 다중 GPU에서 더 빠른 추론을 위해 `BetterTransformer`를 통합했습니다. 자세한 내용은 이 통합에 대한 [문서](https://huggingface.co/docs/optimum/bettertransformer/overview)를 확인하십시오. \ No newline at end of file diff --git a/docs/source/ko/perf_infer_gpu_one.md b/docs/source/ko/perf_infer_gpu_one.md new file mode 100644 index 00000000000000..73cef858b97def --- /dev/null +++ b/docs/source/ko/perf_infer_gpu_one.md @@ -0,0 +1,184 @@ + + +# 단일 GPU에서 효율적인 추론 [[efficient-inference-on-a-single-gpu]] + +이 가이드 외에도, [단일 GPU에서의 훈련 가이드](perf_train_gpu_one)와 [CPU에서의 추론 가이드](perf_infer_cpu)에서도 관련 정보를 찾을 수 있습니다. + +## Better Transformer: PyTorch 네이티브 Transformer 패스트패스 [[better-transformer-pytorchnative-transformer-fastpath]] + +PyTorch 네이티브 [`nn.MultiHeadAttention`](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) 어텐션 패스트패스인 BetterTransformer는 [🤗 Optimum 라이브러리](https://huggingface.co/docs/optimum/bettertransformer/overview)의 통합을 통해 Transformers와 함께 사용할 수 있습니다. + +PyTorch의 어텐션 패스트패스는 커널 퓨전과 [중첩된 텐서](https://pytorch.org/docs/stable/nested.html)의 사용을 통해 추론 속도를 높일 수 있습니다. 자세한 벤치마크는 [이 블로그 글](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2)에서 확인할 수 있습니다. + +[`optimum`](https://github.com/huggingface/optimum) 패키지를 설치한 후에는 추론 중 Better Transformer를 사용할 수 있도록 [`~PreTrainedModel.to_bettertransformer`]를 호출하여 관련 내부 모듈을 대체합니다: + +```python +model = model.to_bettertransformer() +``` + +[`~PreTrainedModel.reverse_bettertransformer`] 메소드는 정규화된 transformers 모델링을 사용하기 위해 모델을 저장하기 전 원래의 모델링으로 돌아갈 수 있도록 해줍니다: + +```python +model = model.reverse_bettertransformer() +model.save_pretrained("saved_model") +``` + +PyTorch 2.0부터는 어텐션 패스트패스가 인코더와 디코더 모두에서 지원됩니다. 지원되는 아키텍처 목록은 [여기](https://huggingface.co/docs/optimum/bettertransformer/overview#supported-models)에서 확인할 수 있습니다. + +## FP4 혼합 정밀도 추론을 위한 `bitsandbytes` 통합 [[bitsandbytes-integration-for-fp4-mixedprecision-inference]] + +`bitsandbytes`를 설치하면 GPU에서 손쉽게 모델을 압축할 수 있습니다. FP4 양자화를 사용하면 원래의 전체 정밀도 버전과 비교하여 모델 크기를 최대 8배 줄일 수 있습니다. 아래에서 시작하는 방법을 확인하세요. + + + +이 기능은 다중 GPU 설정에서도 사용할 수 있습니다. + + + +### 요구 사항 [[requirements-for-fp4-mixedprecision-inference]] + +- 최신 `bitsandbytes` 라이브러리 +`pip install bitsandbytes>=0.39.0` + +- 최신 `accelerate`를 소스에서 설치 +`pip install git+https://github.com/huggingface/accelerate.git` + +- 최신 `transformers`를 소스에서 설치 +`pip install git+https://github.com/huggingface/transformers.git` + +### FP4 모델 실행 - 단일 GPU 설정 - 빠른 시작 [[running-fp4-models-single-gpu-setup-quickstart]] + +다음 코드를 실행하여 단일 GPU에서 빠르게 FP4 모델을 실행할 수 있습니다. + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) +``` +`device_map`은 선택 사항입니다. 그러나 `device_map = 'auto'`로 설정하는 것이 사용 가능한 리소스를 효율적으로 디스패치하기 때문에 추론에 있어 권장됩니다. + +### FP4 모델 실행 - 다중 GPU 설정 [[running-fp4-models-multi-gpu-setup]] + +다중 GPU에서 혼합 4비트 모델을 가져오는 방법은 단일 GPU 설정과 동일합니다(동일한 명령어 사용): +```py +model_name = "bigscience/bloom-2b5" +model_4bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True) +``` +하지만 `accelerate`를 사용하여 각 GPU에 할당할 GPU RAM을 제어할 수 있습니다. 다음과 같이 `max_memory` 인수를 사용하세요: + +```py +max_memory_mapping = {0: "600MB", 1: "1GB"} +model_name = "bigscience/bloom-3b" +model_4bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_4bit=True, max_memory=max_memory_mapping +) +``` +이 예에서는 첫 번째 GPU가 600MB의 메모리를 사용하고 두 번째 GPU가 1GB를 사용합니다. + +### 고급 사용법 [[advanced-usage]] + +이 방법의 더 고급 사용법에 대해서는 [양자화](main_classes/quantization) 문서 페이지를 참조하세요. + +## Int8 혼합 정밀도 행렬 분해를 위한 `bitsandbytes` 통합 [[bitsandbytes-integration-for-int8-mixedprecision-matrix-decomposition]] + + + +이 기능은 다중 GPU 설정에서도 사용할 수 있습니다. + + + +[`LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale`](https://arxiv.org/abs/2208.07339) 논문에서 우리는 몇 줄의 코드로 Hub의 모든 모델에 대한 Hugging Face 통합을 지원합니다. +이 방법은 `float16` 및 `bfloat16` 가중치에 대해 `nn.Linear` 크기를 2배로 줄이고, `float32` 가중치에 대해 4배로 줄입니다. 이는 절반 정밀도에서 이상치를 처리함으로써 품질에 거의 영향을 미치지 않습니다. + +![HFxbitsandbytes.png](https://cdn-uploads.huggingface.co/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png) + +Int8 혼합 정밀도 행렬 분해는 행렬 곱셈을 두 개의 스트림으로 분리합니다: (1) fp16로 곱해지는 체계적인 특이값 이상치 스트림 행렬(0.01%) 및 (2) int8 행렬 곱셈의 일반적인 스트림(99.9%). 이 방법을 사용하면 매우 큰 모델에 대해 예측 저하 없이 int8 추론이 가능합니다. +이 방법에 대한 자세한 내용은 [논문](https://arxiv.org/abs/2208.07339)이나 [통합에 관한 블로그 글](https://huggingface.co/blog/hf-bitsandbytes-integration)에서 확인할 수 있습니다. + +![MixedInt8.gif](https://cdn-uploads.huggingface.co/production/uploads/1660567469965-62441d1d9fdefb55a0b7d12c.gif) + +커널은 GPU 전용으로 컴파일되어 있기 때문에 혼합 8비트 모델을 실행하려면 GPU가 필요합니다. 이 기능을 사용하기 전에 모델의 1/4(또는 모델 가중치가 절반 정밀도인 경우 절반)을 저장할 충분한 GPU 메모리가 있는지 확인하세요. +이 모듈을 사용하는 데 도움이 되는 몇 가지 참고 사항이 아래에 나와 있습니다. 또는 [Google colab](#colab-demos)에서 데모를 따라할 수도 있습니다. + +### 요구 사항 [[requirements-for-int8-mixedprecision-matrix-decomposition]] + +- `bitsandbytes<0.37.0`을 사용하는 경우, 8비트 텐서 코어(Turing, Ampere 또는 이후 아키텍처 - 예: T4, RTX20s RTX30s, A40-A100)를 지원하는 NVIDIA GPU에서 실행하는지 확인하세요. `bitsandbytes>=0.37.0`을 사용하는 경우, 모든 GPU가 지원됩니다. +- 올바른 버전의 `bitsandbytes`를 다음 명령으로 설치하세요: +`pip install bitsandbytes>=0.31.5` +- `accelerate`를 설치하세요 +`pip install accelerate>=0.12.0` + +### 혼합 Int8 모델 실행 - 단일 GPU 설정 [[running-mixedint8-models-single-gpu-setup]] + +필요한 라이브러리를 설치한 후 혼합 8비트 모델을 가져오는 방법은 다음과 같습니다: + +```py +from transformers import AutoModelForCausalLM + +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` + +텍스트 생성의 경우: + +* `pipeline()` 함수 대신 모델의 `generate()` 메소드를 사용하는 것을 권장합니다. `pipeline()` 함수로는 추론이 가능하지만, 혼합 8비트 모델에 최적화되지 않았기 때문에 `generate()` 메소드를 사용하는 것보다 느릴 수 있습니다. 또한, nucleus 샘플링과 같은 일부 샘플링 전략은 혼합 8비트 모델에 대해 `pipeline()` 함수에서 지원되지 않습니다. +* 입력을 모델과 동일한 GPU에 배치하는 것이 좋습니다. + +다음은 간단한 예입니다: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_name = "bigscience/bloom-2b5" +tokenizer = AutoTokenizer.from_pretrained(model_name) +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) + +prompt = "Hello, my llama is cute" +inputs = tokenizer(prompt, return_tensors="pt").to("cuda") +generated_ids = model.generate(**inputs) +outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +``` + + +### 혼합 Int8 모델 실행 - 다중 GPU 설정 [[running-mixedint8-models-multi-gpu-setup]] + +다중 GPU에서 혼합 8비트 모델을 로드하는 방법은 단일 GPU 설정과 동일합니다(동일한 명령어 사용): +```py +model_name = "bigscience/bloom-2b5" +model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True) +``` +하지만 `accelerate`를 사용하여 각 GPU에 할당할 GPU RAM을 제어할 수 있습니다. 다음과 같이 `max_memory` 인수를 사용하세요: + +```py +max_memory_mapping = {0: "1GB", 1: "2GB"} +model_name = "bigscience/bloom-3b" +model_8bit = AutoModelForCausalLM.from_pretrained( + model_name, device_map="auto", load_in_8bit=True, max_memory=max_memory_mapping +) +``` +이 예시에서는 첫 번째 GPU가 1GB의 메모리를 사용하고 두 번째 GPU가 2GB를 사용합니다. + +### Colab 데모 [[colab-demos]] + +이 방법을 사용하면 이전에 Google Colab에서 추론할 수 없었던 모델에 대해 추론할 수 있습니다. +Google Colab에서 8비트 양자화를 사용하여 T5-11b(42GB in fp32)를 실행하는 데모를 확인하세요: + +[![Open In Colab: T5-11b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YORPWx4okIHXnjW7MSAidXN29mPVNT7F?usp=sharing) + +또는 BLOOM-3B에 대한 데모를 확인하세요: + +[![Open In Colab: BLOOM-3b demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qOjXfQIAULfKvZqwCen8-MoWKGdSatZ4?usp=sharing) \ No newline at end of file diff --git a/docs/source/ko/perf_train_cpu.md b/docs/source/ko/perf_train_cpu.md new file mode 100644 index 00000000000000..1a6c58b25afae1 --- /dev/null +++ b/docs/source/ko/perf_train_cpu.md @@ -0,0 +1,67 @@ + + +# CPU에서 효율적인 훈련 [[efficient-training-on-cpu]] + +이 가이드는 CPU에서 대규모 모델을 효율적으로 훈련하는 데 초점을 맞춥니다. + +## IPEX와 혼합 정밀도 [[mixed-precision-with-ipex]] + +IPEX는 AVX-512 이상을 지원하는 CPU에 최적화되어 있으며, AVX2만 지원하는 CPU에도 기능적으로 작동합니다. 따라서 AVX-512 이상의 Intel CPU 세대에서는 성능상 이점이 있을 것으로 예상되지만, AVX2만 지원하는 CPU (예: AMD CPU 또는 오래된 Intel CPU)의 경우에는 IPEX 아래에서 더 나은 성능을 보일 수 있지만 이는 보장되지 않습니다. IPEX는 Float32와 BFloat16를 모두 사용하여 CPU 훈련을 위한 성능 최적화를 제공합니다. BFloat16의 사용은 다음 섹션의 주요 초점입니다. + +저정밀도 데이터 타입인 BFloat16은 3세대 Xeon® Scalable 프로세서 (코드명: Cooper Lake)에서 AVX512 명령어 집합을 네이티브로 지원해 왔으며, 다음 세대의 Intel® Xeon® Scalable 프로세서에서 Intel® Advanced Matrix Extensions (Intel® AMX) 명령어 집합을 지원하여 성능을 크게 향상시킬 예정입니다. CPU 백엔드의 자동 혼합 정밀도 기능은 PyTorch-1.10부터 활성화되었습니다. 동시에, Intel® Extension for PyTorch에서 BFloat16에 대한 CPU의 자동 혼합 정밀도 및 연산자의 BFloat16 최적화를 대규모로 활성화하고, PyTorch 마스터 브랜치로 부분적으로 업스트림을 반영했습니다. 사용자들은 IPEX 자동 혼합 정밀도를 사용하여 더 나은 성능과 사용자 경험을 얻을 수 있습니다. + +[자동 혼합 정밀도](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/features/amp.html)에 대한 자세한 정보를 확인하십시오. + +### IPEX 설치: [[ipex-installation]] + +IPEX 릴리스는 PyTorch를 따라갑니다. pip를 통해 설치하려면: + +| PyTorch Version | IPEX version | +| :---------------: | :----------: | +| 1.13 | 1.13.0+cpu | +| 1.12 | 1.12.300+cpu | +| 1.11 | 1.11.200+cpu | +| 1.10 | 1.10.100+cpu | + +```bash +pip install intel_extension_for_pytorch== -f https://developer.intel.com/ipex-whl-stable-cpu +``` + +[IPEX 설치](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/installation.html)에 대한 더 많은 접근 방법을 확인하십시오. + +### Trainer에서의 사용법 [[usage-in-trainer]] +Trainer에서 IPEX의 자동 혼합 정밀도를 활성화하려면 사용자는 훈련 명령 인수에 `use_ipex`, `bf16`, `no_cuda`를 추가해야 합니다. + +[Transformers 질문-응답](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)의 사용 사례를 살펴보겠습니다. + +- CPU에서 BF16 자동 혼합 정밀도를 사용하여 IPEX로 훈련하기: +
 python run_qa.py \
+--model_name_or_path google-bert/bert-base-uncased \
+--dataset_name squad \
+--do_train \
+--do_eval \
+--per_device_train_batch_size 12 \
+--learning_rate 3e-5 \
+--num_train_epochs 2 \
+--max_seq_length 384 \
+--doc_stride 128 \
+--output_dir /tmp/debug_squad/ \
+--use_ipex \
+--bf16 --no_cuda
+ +### 실습 예시 [[practice-example]] + +블로그: [Intel Sapphire Rapids로 PyTorch Transformers 가속화](https://huggingface.co/blog/intel-sapphire-rapids) \ No newline at end of file diff --git a/docs/source/ko/perf_train_cpu_many.md b/docs/source/ko/perf_train_cpu_many.md new file mode 100644 index 00000000000000..e7a68971a7dc54 --- /dev/null +++ b/docs/source/ko/perf_train_cpu_many.md @@ -0,0 +1,134 @@ + + +# 다중 CPU에서 효율적으로 훈련하기 [[efficient-training-on-multiple-cpus]] + +하나의 CPU에서 훈련하는 것이 너무 느릴 때는 다중 CPU를 사용할 수 있습니다. 이 가이드는 PyTorch 기반의 DDP를 사용하여 분산 CPU 훈련을 효율적으로 수행하는 방법에 대해 설명합니다. + +## PyTorch용 Intel® oneCCL 바인딩 [[intel-oneccl-bindings-for-pytorch]] + +[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library)은 allreduce, allgather, alltoall과 같은 집합 통신(collective communications)을 구현한 효율적인 분산 딥러닝 훈련을 위한 라이브러리입니다. oneCCL에 대한 자세한 정보는 [oneCCL 문서](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html)와 [oneCCL 사양](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html)을 참조하세요. + +`oneccl_bindings_for_pytorch` 모듈 (`torch_ccl`은 버전 1.12 이전에 사용)은 PyTorch C10D ProcessGroup API를 구현하며, 외부 ProcessGroup로 동적으로 가져올 수 있으며 현재 Linux 플랫폼에서만 작동합니다. + +[oneccl_bind_pt](https://github.com/intel/torch-ccl)에서 더 자세한 정보를 확인하세요. + +### PyTorch용 Intel® oneCCL 바인딩 설치: [[intel-oneccl-bindings-for-pytorch-installation]] + +다음 Python 버전에 대한 Wheel 파일을 사용할 수 있습니다. + +| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | +| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | +| 1.13.0 | | √ | √ | √ | √ | +| 1.12.100 | | √ | √ | √ | √ | +| 1.12.0 | | √ | √ | √ | √ | +| 1.11.0 | | √ | √ | √ | √ | +| 1.10.0 | √ | √ | √ | √ | | + +```bash +pip install oneccl_bind_pt=={pytorch_version} -f https://developer.intel.com/ipex-whl-stable-cpu +``` +`{pytorch_version}`은 1.13.0과 같이 PyTorch 버전을 나타냅니다. +[oneccl_bind_pt 설치](https://github.com/intel/torch-ccl)에 대한 더 많은 접근 방법을 확인해 보세요. +oneCCL과 PyTorch의 버전은 일치해야 합니다. + + + +oneccl_bindings_for_pytorch 1.12.0 버전의 미리 빌드된 Wheel 파일은 PyTorch 1.12.1과 호환되지 않습니다(PyTorch 1.12.0용입니다). +PyTorch 1.12.1은 oneccl_bindings_for_pytorch 1.12.10 버전과 함께 사용해야 합니다. + + + +## Intel® MPI 라이브러리 [[intel-mpi-library]] +이 표준 기반 MPI 구현을 사용하여 Intel® 아키텍처에서 유연하고 효율적이며 확장 가능한 클러스터 메시징을 제공하세요. 이 구성 요소는 Intel® oneAPI HPC Toolkit의 일부입니다. + +oneccl_bindings_for_pytorch는 MPI 도구 세트와 함께 설치됩니다. 사용하기 전에 환경을 소스로 지정해야 합니다. + +Intel® oneCCL 버전 1.12.0 이상인 경우 +```bash +oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") +source $oneccl_bindings_for_pytorch_path/env/setvars.sh +``` + +Intel® oneCCL 버전이 1.12.0 미만인 경우 +```bash +torch_ccl_path=$(python -c "import torch; import torch_ccl; import os; print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))") +source $torch_ccl_path/env/setvars.sh +``` + +#### IPEX 설치: [[ipex-installation]] + +IPEX는 Float32와 BFloat16을 모두 사용하는 CPU 훈련을 위한 성능 최적화를 제공합니다. [single CPU section](./perf_train_cpu)을 참조하세요. + + +이어서 나오는 "Trainer에서의 사용"은 Intel® MPI 라이브러리의 mpirun을 예로 들었습니다. + + +## Trainer에서의 사용 [[usage-in-trainer]] +Trainer에서 ccl 백엔드를 사용하여 멀티 CPU 분산 훈련을 활성화하려면 명령 인수에 **`--ddp_backend ccl`**을 추가해야 합니다. + +[질의 응답 예제](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering)를 사용한 예를 살펴보겠습니다. + + +다음 명령은 한 Xeon 노드에서 2개의 프로세스로 훈련을 활성화하며, 각 소켓당 하나의 프로세스가 실행됩니다. OMP_NUM_THREADS/CCL_WORKER_COUNT 변수는 최적의 성능을 위해 조정할 수 있습니다. +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=127.0.0.1 + mpirun -n 2 -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex +``` +다음 명령은 두 개의 Xeon(노드0 및 노드1, 주 프로세스로 노드0을 사용)에서 총 4개의 프로세스로 훈련을 활성화하며, 각 소켓당 하나의 프로세스가 실행됩니다. OMP_NUM_THREADS/CCL_WORKER_COUNT 변수는 최적의 성능을 위해 조정할 수 있습니다. + +노드0에서는 각 노드의 IP 주소를 포함하는 구성 파일(예: hostfile)을 생성하고 해당 구성 파일 경로를 인수로 전달해야 합니다. +```shell script + cat hostfile + xxx.xxx.xxx.xxx #node0 ip + xxx.xxx.xxx.xxx #node1 ip +``` +이제 노드0에서 다음 명령을 실행하면 **4DDP**가 노드0 및 노드1에서 BF16 자동 혼합 정밀도로 활성화됩니다. +```shell script + export CCL_WORKER_COUNT=1 + export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip + mpirun -f hostfile -n 4 -ppn 2 \ + -genv OMP_NUM_THREADS=23 \ + python3 run_qa.py \ + --model_name_or_path google-bert/bert-large-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --per_device_train_batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 2 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir /tmp/debug_squad/ \ + --no_cuda \ + --ddp_backend ccl \ + --use_ipex \ + --bf16 +``` diff --git a/docs/source/ko/perf_train_gpu_many.md b/docs/source/ko/perf_train_gpu_many.md new file mode 100644 index 00000000000000..c2a80505ef7659 --- /dev/null +++ b/docs/source/ko/perf_train_gpu_many.md @@ -0,0 +1,533 @@ + + +# 다중 GPU에서 효율적인 훈련 [[efficient-training-on-multiple-gpus]] + +단일 GPU에서의 훈련이 너무 느리거나 모델 가중치가 단일 GPU의 메모리에 맞지 않는 경우, 다중-GPU 설정을 사용합니다. 단일 GPU에서 다중 GPU로 전환하기 위해서는 작업을 분산해야 합니다. 데이터, 텐서 또는 파이프라인과 같은 병렬화 기법을 사용하여 작업을 병렬로 처리할 수 있습니다. 그러나 이러한 설정을 모두에게 적용할 수 있는 완벽한 해결책은 없으며, 어떤 설정이 가장 적합한지는 사용하는 하드웨어에 따라 달라집니다. 이 문서는 주로 PyTorch 기반의 구현을 중심으로 설명하며, 대부분의 개념은 다른 프레임워크에도 적용될 수 있을 것으로 예상됩니다. + + + + 참고: [단일 GPU 섹션](perf_train_gpu_one)에서 소개된 전략(혼합 정밀도 훈련 또는 그래디언트 누적 등)은 일반적으로 모델 훈련에 적용되며, 다중-GPU 또는 CPU 훈련과 같은 다음 섹션으로 진입하기 전에 해당 섹션을 참고하는 것이 좋습니다. + + + +먼저 1D 병렬화 기술에 대해 자세히 논의한 후, 이러한 기술을 결합하여 2D 및 3D 병렬화를 구현하여 더 빠른 훈련과 더 큰 모델을 지원하는 방법을 살펴볼 것입니다. 또한 다른 효과적인 대안 방식도 소개될 예정입니다. + +## 개념 [[concepts]] + +다음은 이 문서에서 자세히 설명될 주요 개념에 대한 간단한 설명입니다. + +1. **DataParallel (DP)** - 동일한 설정이 여러 번 복제되고, 각 설정에 데이터 일부를 받습니다. 처리는 병렬로 수행되며 모든 설정은 각 훈련 단계의 끝날 때 동기화됩니다. +2. **TensorParallel (TP)** - 각 텐서는 여러 개의 묶음으로 분할되기에, 전체 텐서가 단일 GPU에 상주하는 대신 텐서의 각 샤드가 지정된 GPU에 상주합니다. 처리하는 동안 각 샤드는 서로 다른 GPU에서 개별적으로 병렬 처리되며 결과는 단계가 끝날 때 동기화됩니다. 분할이 수평 수준에서 이루어지기 때문에 이를 수평 병렬 처리라고 부를 수 있습니다. +3. **PipelineParallel (PP)** - 모델이 수직으로 (레이어 수준) 여러 GPU에 분할되어 모델의 단일 GPU에는 하나 또는 여러 레이어가 배치됩니다. 각 GPU는 파이프라인의 서로 다른 단계를 병렬로 처리하며 작은 배치 묶음에서 작동합니다. +4. **Zero Redundancy Optimizer (ZeRO)** - TP와 유사하게 텐서를 샤딩하지만, 전체 텐서는 순방향 또는 역방향 계산을 위해 재구성되므로 모델을 수정할 필요가 없습니다. 또한 제한된 GPU 메모리를 보완하기 위해 다양한 오프로드 기술을 지원합니다. +5. **Sharded DDP** - ZeRO의 기본 개념으로 다른 ZeRO 구현에서도 사용되는 용어입니다. + +각 개념의 구체적인 내용에 대해 자세히 들어가기 전에 대규모 인프라에서 대규모 모델을 훈련하는 경우의 대략적인 결정 과정을 살펴보겠습니다. + +## 확장성 전략 [[scalability-strategy]] + +**⇨ 단일 노드 / 다중-GPU** +* 모델이 단일 GPU에 맞는 경우: + + 1. DDP - 분산 DP + 2. ZeRO - 상황과 구성에 따라 더 빠를 수도 있고 그렇지 않을 수도 있음 + +* 모델이 단일 GPU에 맞지 않는 경우: + + 1. PP + 2. ZeRO + 3. TP + + 노드 내 연결 속도가 매우 빠른 NVLINK 또는 NVSwitch의 경우 세 가지 방법은 대부분 비슷한 성능을 보여야 하며, PP가 없는 경우 TP 또는 ZeRO보다 빠를 것입니다. TP의 정도도 차이를 만들 수 있습니다. 특정 설정에서 승자를 찾기 위해 실험하는 것이 가장 좋습니다. + + TP는 거의 항상 단일 노드 내에서 사용됩니다. 즉, TP 크기 <= 노드당 GPU 수입니다. + +* 가장 큰 레이어가 단일 GPU에 맞지 않는 경우: + + 1. ZeRO를 사용하지 않는 경우 - PP만으로는 맞지 않으므로 TP를 반드시 사용해야 함 + 2. ZeRO를 사용하는 경우에는 위의 "단일 GPU" 항목과 동일 + + +**⇨ 다중 노드 / 다중 GPU** + +* 노드 간 연결 속도가 빠른 경우: + + 1. ZeRO - 모델에 대부분의 수정을 필요로 하지 않음 + 2. PP+TP+DP - 통신이 적지만 모델에 대대적인 변경이 필요함 + +* 노드 간 연결 속도가 느리며, GPU 메모리가 여전히 부족한 경우: + + 1. DP+PP+TP+ZeRO-1 + + + +## 데이터 병렬화 [[data-parallelism]] + +2개의 GPU만으로도 대부분의 사용자들은 `DataParallel` (DP)과 `DistributedDataParallel` (DDP)을 통해 향상된 훈련 속도를 누릴 수 있습니다. 이는 PyTorch의 내장 기능입니다. 일반적으로 DDP를 사용하는 것이 좋으며, DP는 일부 모델에서 작동하지 않을 수 있으므로 주의해야 합니다. [PyTorch 문서](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html)에서도 DDP의 사용을 권장합니다. + +### DP vs DDP [[dp-vs-ddp]] + +`DistributedDataParallel` (DDP)은 일반적으로 `DataParallel` (DP)보다 빠르지만, 항상 그렇지는 않습니다: +* DP는 파이썬 스레드 기반인 반면, DDP는 다중 프로세스 기반이기 때문에 GIL과 같은 파이썬 스레드 제한이 없습니다. +* 그러나 GPU 카드 간의 느린 상호 연결성은 DDP로 인해 실제로 느린 결과를 낼 수 있습니다. + +이 두 모드 간의 GPU 간 통신 오버헤드의 주요 차이점은 다음과 같습니다: + +[DDP](https://pytorch.org/docs/master/notes/ddp.html): + +- 시작할 때, 주 프로세스가 모델을 gpu 0에서 다른 모든 gpu로 복제합니다. +- 그런 다음 각 배치에 대해: + 1. 각 gpu는 자체 미니 배치 데이터를 직접 사용합니다. + 2. `backward` 동안 로컬 그래디언트가 준비되면, 모든 프로세스에 평균화됩니다. + +[DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html): + +각 배치에 대해: + 1. gpu 0은 데이터 배치를 읽고 각 gpu에 미니 배치를 보냅니다. + 2. 업데이트된 모델을 gpu 0에서 각 gpu로 복제합니다. + 3. `forward`를 실행하고 각 gpu의 출력을 gpu 0으로 보내고 손실을 계산합니다. + 4. gpu 0에서 모든 gpu로 손실을 분산하고 `backward`를 실행합니다. + 5. 각 gpu에서 그래디언트를 gpu 0으로 보내고 이를 평균화합니다. + +DDP는 각 배치마다 그래디언트를 보내는 통신만을 수행하며, DP는 배치마다 5개의 다른 데이터 교환을 수행합니다. + +DP는 파이썬 스레드를 통해 프로세스 내에서 데이터를 복제하며, DDP는 [torch.distributed](https://pytorch.org/docs/master/distributed.html)를 통해 데이터를 복제합니다. + +DP에서는 gpu 0이 다른 gpu보다 훨씬 더 많은 작업을 수행하므로, gpu의 활용도가 낮아집니다. + +DDP는 여러 대의 컴퓨터에서 사용할 수 있지만, DP의 경우는 그렇지 않습니다. + +DP와 DDP 사이에는 다른 차이점이 있지만, 이 토론과는 관련이 없습니다. + +이 2가지 모드를 깊게 이해하고 싶다면, [이 문서](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/)를 강력히 추천합니다. 이 문서는 멋진 다이어그램을 포함하고 있으며, 다양한 하드웨어에서 여러 벤치마크와 프로파일러 출력을 설명하여 필요한 세부 사항을 모두 설명합니다. + +실제 벤치마크를 살펴보겠습니다: + +| Type | NVlink | Time | +| :----- | ----- | ---: | +| 2:DP | Y | 110s | +| 2:DDP | Y | 101s | +| 2:DDP | N | 131s | + + +분석: + +여기서 DP는 NVlink가 있는 DDP보다 약 10% 느립니다. 그러나 NVlink가 없는 DDP보다 약 15% 빠릅니다. + +실제 차이는 각 GPU가 다른 GPU와 동기화해야 하는 데이터 양에 따라 달라질 것입니다. 동기화할 데이터가 많을수록 느린 링크가 총 실행 시간을 늦출 수 있습니다. + +다음은 전체 벤치마크 코드와 출력입니다: + +해당 벤치마크에서 `NCCL_P2P_DISABLE=1`을 사용하여 NVLink 기능을 비활성화했습니다. + +```bash + +# DP +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +python examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69} + +# DDP w/ NVlink +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +# DDP w/o NVlink +rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \ +torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \ +--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ +--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +하드웨어: 각각 24GB의 TITAN RTX 2개 + NVlink과 2개의 NVLink (`nvidia-smi topo -m`에서 `NV2`입니다.) +소프트웨어: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` + +## ZeRO 데이터 병렬화 [[zero-data-parallelism]] + +ZeRO를 기반으로 한 데이터 병렬화 (ZeRO-DP)는 다음 [블로그 글](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)의 다음 다이어그램에서 설명되고 있습니다. +![DeepSpeed-Image-1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero.png) + +이 개념은 이해하기 어려울 수 있지만, 실제로는 매우 간단한 개념입니다. 이는 일반적인 `DataParallel` (DP)과 동일하지만, 전체 모델 매개변수, 그래디언트 및 옵티마이저 상태를 복제하는 대신 각 GPU는 그 중 일부만 저장합니다. 그리고 실행 시간에는 주어진 레이어에 대해 전체 레이어 매개변수가 필요할 때 각 GPU가 서로에게 필요한 부분을 제공하기 위해 동기화됩니다 - 그게 전부입니다. + +각각 3개의 레이어와 3개의 매개변수가 있는 간단한 모델을 생각해 봅시다: +``` +La | Lb | Lc +---|----|--- +a0 | b0 | c0 +a1 | b1 | c1 +a2 | b2 | c2 +``` +레이어 La에는 가중치 a0, a1 및 a2가 있습니다. + +3개의 GPU가 있는 경우, Sharded DDP (= Zero-DP)는 다음과 같이 모델을 3개의 GPU에 분할합니다: + +``` +GPU0: +La | Lb | Lc +---|----|--- +a0 | b0 | c0 + +GPU1: +La | Lb | Lc +---|----|--- +a1 | b1 | c1 + +GPU2: +La | Lb | Lc +---|----|--- +a2 | b2 | c2 +``` + +일반적인 DNN 다이어그램을 상상해보면 이는 텐서 병렬 처리와 같은 수평 슬라이싱입니다. 수직 슬라이싱은 전체 레이어 그룹을 다른 GPU에 배치하는 것입니다. 이는 시작에 불과합니다. + +이제 이러한 각각의 GPU는 DP에서 작동하는 것과 마찬가지로 일반적인 미니 배치를 받습니다: +``` +x0 => GPU0 +x1 => GPU1 +x2 => GPU2 +``` + +입력은 수정되지 않은 상태로 일반 모델에 의해 처리될 것으로 간주합니다. + +먼저, 입력은 레이어 La에 도달합니다. + +GPU0에만 집중해 보겠습니다. x0은 순방향 경로를 수행하기 위해 a0, a1, a2 파라미터가 필요하지만 GPU0에는 a0만 있습니다. GPU1에서 a1을, GPU2에서 a2를 전송받아 모델의 모든 조각을 하나로 모읍니다. + +병렬적으로, GPU1은 미니 배치 x1을 받고 a1만 가지고 있지만, a0 및 a2 매개변수가 필요합니다. 따라서 GPU0 및 GPU2에서 이를 가져옵니다. + +GPU2도 동일한 작업을 수행합니다. 입력 x2를 받고 GPU0 및 GPU1에서 각각 a0과 a1을, 그리고 자신의 a2와 함께 전체 텐서를 복원합니다. + +3개의 GPU는 복원된 전체 텐서를 받고 forward가 수행됩니다. + +계산이 완료되면 더 이상 필요하지 않은 데이터는 삭제되고, 해당 데이터는 계산 중에만 사용됩니다. 복원은 사전 패치를 통해 효율적으로 수행됩니다. + +그리고 전체 프로세스는 레이어 Lb에 대해 반복되고, 그 다음 Lc로 순방향으로, 그다음은 역방향으로 Lc -> Lb -> La로 반복됩니다. + +개인적으로 이것은 효율적인 그룹 배낭 여행자의 중량 분배 전략처럼 들립니다: + +1. 사람 A가 텐트를 운반합니다. +2. 사람 B가 난로를 운반합니다. +3. 사람 C가 도끼를 운반합니다. + +이제 매일 밤 각자 가진 것을 다른 사람들과 공유하고, 가지지 않은 것은 다른 사람들로부터 받고, 아침에는 할당된 유형의 장비를 싸고 계속해서 여행을 진행합니다. 이것이 Sharded DDP / Zero DP입니다. + +이 전략을 각각 자신의 텐트, 난로 및 도끼를 개별적으로 운반해야 하는 단순한 전략과 비교해보면 훨씬 비효율적일 것입니다. 이것이 Pytorch의 DataParallel (DP 및 DDP)입니다. + +이 주제에 대해 논문을 읽을 때 다음 동의어를 만날 수 있습니다: Sharded, Partitioned. + +ZeRO가 모델 가중치를 분할하는 방식을 자세히 살펴보면, 텐서 병렬화와 매우 유사한 것을 알 수 있습니다. 이는 이후에 설명될 수직 모델 병렬화와는 달리 각 레이어의 가중치를 분할/분할하기 때문입니다. + +구현: + +- [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/)는 1단계 + 2단계 + 3단계의 ZeRO-DP를 제공합니다. +- [Fairscale](https://github.com/facebookresearch/fairscale/#optimizer-state-sharding-zero)은 1단계 + 2단계 + 3단계의 ZeRO-DP를 제공합니다. +- [`transformers` 통합](main_classes/trainer#trainer-integrations) + +## 네이티브 모델 병렬 처리(수직적) 및 파이프라인 병렬 처리[[naive-model-parallelism-vertical-and-pipeline-parallelism]] + +Naive Model Parallelism (MP)은 모델 레이어 그룹을 다중 GPU에 분산하는 방식입니다. 메커니즘은 상대적으로 간단합니다. 원하는 레이어를 `.to()`를 사용하여 원하는 장치로 전환하면 데이터가 해당 레이어로 들어오고 나갈 때 데이터도 레이어와 동일한 장치로 전환되고 나머지는 수정되지 않습니다. + +대부분의 모델이 그려지는 방식이 레이어를 세로로 슬라이스하기 때문에 이를 수직 모델 병렬화라고 부릅니다. 예를 들어 다음 다이어그램은 8레이어 모델을 보여줍니다: + +``` +=================== =================== +| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 | +=================== =================== + gpu0 gpu1 +``` +우리는 모델을 수직으로 2개로 분할하여 레이어 0-3을 GPU0에 배치하고 레이어 4-7을 GPU1에 배치했습니다. + +이제 데이터가 레이어 0에서 1로, 1에서 2로, 2에서 3으로 이동하는 동안에는 일반적인 모델입니다. 그러나 데이터가 레이어 3에서 레이어 4로 전달되어야 할 때는 GPU0에서 GPU1로 이동해야 하므로 통신 오버헤드가 발생합니다. 참여하는 GPU가 동일한 컴퓨팅 노드(예: 동일한 물리적인 기계)에 있는 경우 이 복사는 매우 빠릅니다. 그러나 GPU가 서로 다른 컴퓨팅 노드(예: 여러 기계)에 위치한 경우 통신 오버헤드는 상당히 크게 될 수 있습니다. + +그런 다음 레이어 4부터 5로, 6으로, 7로 진행되는 것은 일반적인 모델과 동일하게 진행되고, 7번째 레이어가 완료되면 데이터를 다시 레이어 0으로 보내거나 또는 레이블을 마지막 레이어로 보내야 할 필요가 있습니다. 이제 손실을 계산하고 옵티마이저가 작동할 수 있습니다. + +문제점: +- 이 방식을 "naive" MP라고 부르는 이유는 주어진 상황에 하나의 GPU를 제외한 모든 GPU가 유휴 상태라는 점입니다. 따라서 4개의 GPU를 사용하는 경우 단일 GPU의 메모리 양을 4배로 늘리고 나머지 하드웨어는 무시하는 것과 거의 동일합니다. 또한 장치 간 데이터 복사의 오버헤드도 있습니다. 따라서 4개의 6GB 카드는 naive MP를 사용하여 1개의 24GB 카드와 동일한 크기를 수용할 수 있지만, 후자는 데이터 복사의 오버헤드가 없으므로 훈련을 더 빨리 완료합니다. 그러나 예를 들어 40GB 카드가 있고 45GB 모델을 맞추어야 할 경우 4개의 40GB 카드로 맞출 수 있습니다 (하지만 그래디언트와 옵티마이저 상태 때문에 가까스로 가능합니다). +- 공유 임베딩은 GPU 간에 복사해야 할 수도 있습니다. + +파이프라인 병렬화 (PP)은 거의 naive MP와 동일하지만 GPU 유휴 상태 문제를 해결하기 위해 들어오는 배치를 마이크로 배치로 나누고 인공적으로 파이프라인을 생성하여 서로 다른 GPU가 동시에 계산에 참여할 수 있게 합니다. + +[GPipe 논문](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html)에서 가져온 그림은 상단에 naive MP를, 하단에는 PP를 보여줍니다: + +![mp-pp](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-gpipe-bubble.png) + +하단 다이어그램에서 PP가 유휴 영역이 적은 것을 쉽게 볼 수 있습니다. 유휴 부분을 "bubble"이라고 합니다. + +다이어그램의 양쪽 부분은 참여하는 GPU가 4개인 병렬성을 보여줍니다. 즉, 4개의 GPU가 파이프라인에 참여합니다. 따라서 4개의 파이프 단계 F0, F1, F2 및 F3의 순방향 경로와 B3, B2, B1 및 B0의 역방향 경로가 있습니다. + +PP는 조정해야 할 새로운 하이퍼파라미터인 `chunks`를 도입합니다. 이는 동일한 파이프 단계를 통해 일련의 데이터를 묶어서 보내는 방식을 정의합니다. 예를 들어, 아래 다이어그램에서 `chunks=4`를 볼 수 있습니다. GPU0은 0, 1, 2 및 3 (F0,0, F0,1, F0,2, F0,3) 묶음에서 동일한 순방향 경로를 수행하고, 다른 GPU가 작업을 수행하기 시작하고 완료가 시작될 때만 GPU0이 묶음의 역순으로 3, 2, 1 및 0 (B0,3, B0,2, B0,1, B0,0) 경로를 수행합니다. + +개념적으로 이는 그래디언트 누적 단계 (GAS)와 동일한 개념입니다. 파이토치에서는 `chunks`를 사용하고 DeepSpeed에서는 동일한 하이퍼파라미터를 GAS로 참조합니다. + +묶음으로 인해 PP는 마이크로 배치 (MBS)의 개념을 도입합니다. DP는 전역 데이터 배치 크기를 미니 배치로 나눕니다. 따라서 DP 차수가 4이고 전역 배치 크기가 1024이면 256씩 4개의 미니 배치로 분할됩니다 (1024/4). 그리고 `chunks` (또는 GAS)의 수가 32이면 마이크로 배치 크기는 8이 됩니다 (256/32). 각 파이프라인 단계는 한 번에 하나의 마이크로 배치와 함께 작동합니다. + +DP + PP 설정의 전역 배치 크기를 계산하려면 `mbs*chunks*dp_degree` (`8*32*4=1024`)를 수행합니다. + +다이어그램으로 돌아가 보겠습니다. + +`chunks=1`로 설정하면 매우 비효율적인 naive MP가 생성되며, 매우 큰 `chunks` 값으로 설정하면 아주 작은 마이크로 배치 크기가 생성되어 효율적이지 않을 수 있습니다. 따라서 가장 효율적인 GPU 활용을 위해 어떤 값이 가장 적절한지 실험을 해야 합니다. + +다이어그램에서 보이는 것처럼 "dead" 시간의 버블이 존재하여 마지막 `forward` 단계가 `backward` 단계가 파이프라인을 완료하기를 기다려야 하는 상황이 발생하지만, `chunks`의 가장 적절한 값을 찾는 것의 목적은 모든 참여하는 GPU에서 동시에 고도로 활용되는 GPU 활용을 가능하게 하여 버블의 크기를 최소화하는 것입니다. + +해결책은 전통적인 파이프라인 API와 더 현대적인 솔루션으로 나뉩니다. 전통적인 파이프라인 API 솔루션과 현대적인 솔루션에 대해 알아보겠습니다. + +전통적인 파이프라인 API 솔루션: +- 파이토치 +- FairScale +- DeepSpeed +- Megatron-LM + +현대적인 솔루션: +- Varuna +- Sagemaker + +전통적인 파이프라인 API 솔루션의 문제점: +- 모델을 상당히 수정해야 한다는 점이 문제입니다. 파이프라인은 모듈의 정상적인 흐름을 `nn.Sequential` 시퀀스로 다시 작성해야 하므로 모델의 설계를 변경해야 할 수 있습니다. +- 현재 파이프라인 API는 매우 제한적입니다. 파이프라인의 매우 첫 번째 단계에서 전달되는 많은 파이썬 변수가 있는 경우 이를 해결해야 합니다. 현재 파이프라인 인터페이스는 하나의 텐서 또는 텐서의 튜플을 유일한 입력 및 출력으로 요구합니다. 이러한 텐서는 마이크로 배치로 미니 배치로 묶을 것이므로 첫 번째 차원으로 배치 크기가 있어야 합니다. 가능한 개선 사항은 여기에서 논의되고 있습니다. https://github.com/pytorch/pytorch/pull/50693 +- 파이프 단계 수준에서 조건부 제어 흐름은 불가능합니다. 예를 들어, T5와 같은 인코더-디코더 모델은 조건부 인코더 단계를 처리하기 위해 특별한 해결책이 필요합니다. +- 각 레이어를 정렬하여 하나의 모델의 출력이 다른 모델의 입력이 되도록해야 합니다. + +우리는 아직 Varuna와 SageMaker로 실험하지 않았지만, 해당 논문들은 위에서 언급한 문제들의 목록을 극복했고 사용자의 모델에 대한 변경 사항이 훨씬 적게 필요하다고 보고하고 있습니다. + +구현: +- [파이토치](https://pytorch.org/docs/stable/pipeline.html) (파이토치-1.8에서 초기 지원, 1.9에서 점진적으로 개선되고 1.10에서 더 개선됨). [예제](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py)도 참고하세요. +- [FairScale](https://fairscale.readthedocs.io/en/latest/tutorials/pipe.html) +- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)은 내부 구현을 가지고 있습니다 - API 없음. +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) - 이는 AWS에서만 사용할 수 있는 소유 솔루션입니다. +- [OSLO](https://github.com/tunib-ai/oslo) - 이는 Hugging Face Transformers를 기반으로 구현된 파이프라인 병렬화입니다. + +🤗 Transformers 상태: 이 작성 시점에서 모델 중 어느 것도 완전한 PP를 지원하지 않습니다. GPT2와 T5 모델은 naive MP를 지원합니다. 주요 장애물은 모델을 `nn.Sequential`로 변환하고 모든 입력을 텐서로 가져와야 하는 것을 처리할 수 없기 때문입니다. 현재 모델에는 이러한 변환을 매우 복잡하게 만드는 많은 기능이 포함되어 있어 제거해야 합니다. + +기타 접근 방법: + +DeepSpeed, Varuna 및 SageMaker는 [교차 파이프라인(Interleaved Pipeline)](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html) 개념을 사용합니다. +![interleaved-pipeline-execution](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-sagemaker-interleaved-pipeline.png) + +여기서는 버블(유휴 시간)을 역방향 패스에 우선순위를 부여하여 최소화합니다. + +Varuna는 가장 효율적인 스케줄링을 찾기 위해 시뮬레이션을 사용하여 스케줄을 개선하려고 합니다. + +OSLO는 `nn.Sequential`로 변환하지 않고 Transformers를 기반으로 한 파이프라인 병렬화를 구현했습니다. + +## 텐서 병렬 처리 [[tensor-parallelism]] + +텐서 병렬 처리에서는 각 GPU가 텐서의 일부분만 처리하고 전체 텐서가 필요한 연산에 대해서만 전체 텐서를 집계합니다. + +이 섹션에서는 [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) 논문인 [Efficient Large-Scale Language Model Training on GPU Clusters](https://arxiv.org/abs/2104.04473)에서의 개념과 다이어그램을 사용합니다. + +Transformer의 주요 구성 요소는 fully connected `nn.Linear`와 비선형 활성화 함수인 `GeLU`입니다. + +Megatron 논문의 표기법을 따라 행렬의 점곱 부분을 `Y = GeLU(XA)`로 표현할 수 있습니다. 여기서 `X`와 `Y`는 입력 및 출력 벡터이고 `A`는 가중치 행렬입니다. + +행렬 형태로 계산을 살펴보면, 행렬 곱셈을 다중 GPU로 분할할 수 있는 방법을 쉽게 알 수 있습니다: +![Parallel GEMM](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_gemm.png) + +가중치 행렬 `A`를 `N`개의 GPU에 대해 열별로 분할하고 병렬로 행렬 곱셈 `XA_1`에서 `XA_n`까지 수행하면 `N`개의 출력 벡터 `Y_1, Y_2, ..., Y_n`가 생성되며 독립적으로 `GeLU`에 전달될 수 있습니다: +![independent GeLU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-independent-gelu.png) + +이 원리를 사용하여 동기화가 필요하지 않은 GPU 간의 임의 깊이의 MLP를 업데이트할 수 있습니다. 그러나 결과 벡터를 샤드로부터 재구성해야 하는 마지막 단계까지는 GPU 간의 동기화가 필요합니다. Megatron-LM 논문의 저자들은 이에 대한 유용한 그림을 제공합니다: +![parallel shard processing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_shard_processing.png) + +다중 헤드 어텐션 레이어의 병렬화는 더욱 간단합니다. 이미 독립적인 다중 헤드를 가지고 있기 때문에 이미 병렬화되어 있습니다! +![parallel self-attention](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-tp-parallel_self_attention.png) + +특별 고려사항: TP는 매우 빠른 네트워크가 필요하므로 한 개 이상의 노드에서 TP를 수행하는 것은 권장되지 않습니다. 실제로 노드에 4개의 GPU가 있는 경우 TP의 최대 차수는 4입니다. TP 차수가 8인 경우 최소한 8개의 GPU가 있는 노드를 사용해야 합니다. + +이 섹션은 원래의 [더 자세한 TP 개요](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530)를 기반으로 합니다. +작성자는 [@anton-l](https://github.com/anton-l)입니다. + +SageMaker는 더 효율적인 처리를 위해 TP와 DP를 결합합니다. + +대체 이름: +- DeepSpeed는 이를 [텐서 슬라이싱](https://www.deepspeed.ai/training/#model-parallelism)이라고 부릅니다. + +구현: +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)은 내부 구현을 가지고 있으므로 모델에 매우 특화되어 있습니다. +- [parallelformers](https://github.com/tunib-ai/parallelformers) (현재는 추론에만 해당) +- [SageMaker](https://arxiv.org/abs/2111.05972) - 이는 AWS에서만 사용할 수 있는 소유 솔루션입니다. +- [OSLO](https://github.com/tunib-ai/oslo)은 Transformers를 기반으로 한 텐서 병렬 처리 구현을 가지고 있습니다. + +🤗 Transformers 현황: +- core: 아직 핵심 부분에 구현되지 않음 +- 그러나 추론을 하려면 [parallelformers](https://github.com/tunib-ai/parallelformers)가 대부분의 모델을 지원합니다. 따라서 핵심 부분에 구현되기 전까지 그들의 것을 사용할 수 있습니다. 그리고 훈련 모드도 지원될 예정입니다. +- Deepspeed-Inference는 CUDA 커널을 기반으로 하는 매우 빠른 추론 모드에서 BERT, GPT-2 및 GPT-Neo 모델을 지원합니다. 자세한 내용은 [여기](https://www.deepspeed.ai/tutorials/inference-tutorial/)를 참조하세요. + +## DP+PP [[dppp]] + +DeepSpeed [pipeline tutorial](https://www.deepspeed.ai/tutorials/pipeline/)에서 다음 다이어그램은 DP와 PP를 결합하는 방법을 보여줍니다. + +![dp-pp-2d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-zero-dp-pp.png) + +여기서 DP 랭크 0은 GPU2를 보지 못하고, DP 랭크 1은 GPU3을 보지 못하는 것이 중요합니다. DP에게는 딱 2개의 GPU인 것처럼 데이터를 공급합니다. GPU0은 PP를 사용하여 GPU2에게 일부 작업을 "비밀리에" 할당합니다. 그리고 GPU1도 GPU3을 도움으로 삼아 같은 방식으로 작업합니다. + +각 차원마다 적어도 2개의 GPU가 필요하므로 최소한 4개의 GPU가 필요합니다. + +구현: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformers 현황: 아직 구현되지 않음 + +## DP+PP+TP [[dppptp]] + +더 효율적인 훈련을 위해 PP와 TP 및 DP를 결합하여 3D 병렬 처리를 사용합니다. 다음 다이어그램에서 이를 확인할 수 있습니다. + +![dp-pp-tp-3d](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-deepspeed-3d.png) + +이 다이어그램은 [3D parallelism: Scaling to trillion-parameter models](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)이라는 블로그 글에서 확인할 수 있습니다. + +각 차원마다 적어도 2개의 GPU가 필요하므로 최소한 8개의 GPU가 필요합니다. + +구현: +- [DeepSpeed](https://github.com/microsoft/DeepSpeed) - DeepSpeed는 더욱 효율적인 DP인 ZeRO-DP라고도 부릅니다. +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) +- [Varuna](https://github.com/microsoft/varuna) +- [SageMaker](https://arxiv.org/abs/2111.05972) +- [OSLO](https://github.com/tunib-ai/oslo) + +🤗 Transformers 현황: 아직 구현되지 않음. PP와 TP가 없기 때문입니다. + +## ZeRO DP+PP+TP [[zero-dppptp]] + +DeepSpeed의 주요 기능 중 하나는 DP의 확장인 ZeRO입니다. ZeRO-DP에 대해 이미 [ZeRO Data Parallelism](#zero-data-parallelism)에서 논의되었습니다. 일반적으로 이는 PP나 TP를 필요로하지 않는 독립적인 기능입니다. 그러나 PP와 TP와 결합할 수도 있습니다. + +ZeRO-DP가 PP와 (선택적으로 TP와) 결합되면 일반적으로 ZeRO 단계 1(옵티마이저 분할)만 활성화됩니다. + +이론적으로는 ZeRO 단계 2(그라디언트 분할)를 파이프라인 병렬 처리와 함께 사용할 수도 있지만, 이는 성능에 나쁜 영향을 미칠 것입니다. 각 마이크로 배치마다 그라디언트를 샤딩하기 전에 추가적인 리듀스-스캐터 컬렉티브가 필요하며, 이는 잠재적으로 상당한 통신 오버헤드를 추가합니다. 파이프라인 병렬 처리의 특성상 작은 마이크로 배치가 사용되며, 산술 연산 강도(마이크로 배치 크기)를 균형 있게 유지하면서 파이프라인 버블(마이크로 배치 수)을 최소화하는 것에 중점을 둡니다. 따라서 해당 통신 비용은 문제가 될 것입니다. + +또한, PP로 인해 정상보다 적은 수의 레이어가 있으므로 메모리 절약은 크지 않을 것입니다. PP는 이미 그래디언트 크기를 ``1/PP``로 줄이기 때문에 그래디언트 샤딩의 절약 효과는 순수 DP보다는 미미합니다. + +ZeRO 단계 3도 같은 이유로 좋은 선택이 아닙니다 - 더 많은 노드 간 통신이 필요합니다. + +그리고 ZeRO가 있기 때문에 다른 이점은 ZeRO-Offload입니다. 이는 단계 1이므로 옵티마이저 상태를 CPU로 오프로드할 수 있습니다. + +구현: +- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed) 및 [BigScience의 Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed), 이전 저장소의 포크입니다. +- [OSLO](https://github.com/tunib-ai/oslo) + +중요한 논문: + +- [Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]( +https://arxiv.org/abs/2201.11990) + +🤗 Transformers 현황: 아직 구현되지 않음, PP와 TP가 없기 때문입니다. + +## FlexFlow [[flexflow]] + +[FlexFlow](https://github.com/flexflow/FlexFlow)는 약간 다른 방식으로 병렬화 문제를 해결합니다. + +논문: ["Beyond Data and Model Parallelism for Deep Neural Networks" by Zhihao Jia, Matei Zaharia, Alex Aiken](https://arxiv.org/abs/1807.05358) + +이는 Sample-Operator-Attribute-Parameter를 기반으로 하는 일종의 4D 병렬화를 수행합니다. + +1. Sample = 데이터 병렬화 (샘플별 병렬) +2. Operator = 단일 연산을 여러 하위 연산으로 병렬화 +3. Attribute = 데이터 병렬화 (길이별 병렬) +4. Parameter = 모델 병렬화 (수평 또는 수직과 관계없이) + +예시: +* Sample + +512 길이의 10개의 배치를 가정해 봅시다. 이를 sample 차원으로 2개의 장치에 병렬화하면, 10 x 512는 5 x 2 x 512가 됩니다. + +* Operator + +레이어 정규화를 수행한다면, 우선 std를 계산하고 두 번째로 mean을 계산한 다음 데이터를 정규화할 수 있습니다. Operator 병렬화는 std와 mean을 병렬로 계산할 수 있도록 합니다. 따라서 operator 차원으로 2개의 장치 (cuda:0, cuda:1)에 병렬화하면, 먼저 입력 데이터를 두 장치로 복사한 다음 cuda:0에서 std를 계산하고 cuda:1에서 동시에 mean을 계산합니다. + +* Attribute + +512 길이의 10개의 배치가 있습니다. 이를 attribute 차원으로 2개의 장치에 병렬화하면, 10 x 512는 10 x 2 x 256이 됩니다. + +* Parameter + +이는 tensor 모델 병렬화 또는 naive layer-wise 모델 병렬화와 유사합니다. + +![flex-flow-soap](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/parallelism-flexflow.jpeg) + +이 프레임워크의 중요한 점은 (1) GPU/TPU/CPU 대 (2) RAM/DRAM 대 (3) 빠른 인트라-커넥트 대 느린 인터-커넥트와 같은 리소스를 고려하여 어디에서 어떤 병렬화를 사용할지를 알고리즘적으로 자동으로 최적화한다는 것입니다. + +하나 매우 중요한 측면은 FlexFlow가 정적이고 고정된 워크로드를 가진 모델에 대한 DNN 병렬화를 최적화하기 위해 설계되었다는 것입니다. 동적인 동작을 가진 모델은 반복마다 다른 병렬화 전략을 선호할 수 있습니다. + +따라서 이 프레임워크의 장점은 선택한 클러스터에서 30분 동안 시뮬레이션을 실행하고 이 특정 환경을 최적으로 활용하기 위한 최상의 전략을 제안한다는 것입니다. 부품을 추가/제거/교체하면 실행하고 그에 대한 계획을 다시 최적화한 후 훈련할 수 있습니다. 다른 설정은 자체적인 사용자 정의 최적화를 가질 수 있습니다. + +🤗 Transformers 현황: 아직 통합되지 않음. 이미 [transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py)를 통해 모델을 FX-추적할 수 있으며, 이는 FlexFlow의 선행 조건입니다. 따라서 어떤 작업을 수행해야 FlexFlow가 우리의 모델과 함께 작동할 수 있는지 파악해야 합니다. + + +## 어떤 전략을 사용해야 할까요? [[which-strategy-to-use-when]] + +다음은 어떤 병렬화 전략을 언제 사용해야 하는지에 대한 매우 대략적인 개요입니다. 각 목록의 첫 번째 전략이 일반적으로 더 빠릅니다. + +**⇨ 단일 GPU** + +* 모델이 단일 GPU에 맞는 경우: + + 1. 일반적인 사용 + +* 모델이 단일 GPU에 맞지 않는 경우: + + 1. ZeRO + CPU 및 옵션으로 NVMe 언로드 + 2. 위와 동일하게 사용하되, 가장 큰 레이어가 단일 GPU에 맞지 않는 경우 Memory Centric Tiling(자세한 내용은 아래 참조)을 추가적으로 사용 + +* 가장 큰 레이어가 단일 GPU에 맞지 않는 경우: + +1. ZeRO - [Memory Centric Tiling](https://deepspeed.readthedocs.io/en/latest/zero3.html#memory-centric-tiling) (MCT) 활성화. 이를 통해 크기가 매우 큰 레이어를 임의로 분할하여 순차적으로 실행할 수 있습니다. MCT는 GPU에 활성화된 매개변수의 수를 줄이지만 활성화 메모리에는 영향을 주지 않습니다. 현재 작성 기준으로 이 요구사항은 매우 드물기 때문에 사용자가 `torch.nn.Linear`를 수동으로 수정해야 합니다. + +**⇨ 단일 노드 / 다중 GPU** + +* 모델이 단일 GPU에 맞는 경우: + + 1. DDP - 분산 DP + 2. ZeRO - 상황과 구성에 따라 빠를 수도 있고 그렇지 않을 수도 있습니다. + +* 모델이 단일 GPU에 맞지 않는 경우: + + 1. PP + 2. ZeRO + 3. TP + + NVLINK 또는 NVSwitch를 통한 매우 빠른 인트라-노드 연결이 있는 경우 이 세 가지 방법은 거의 동등할 것이며, 이러한 연결이 없는 경우 PP가 TP나 ZeRO보다 빠를 것입니다. 또한 TP의 차수도 영향을 줄 수 있습니다. 특정 설정에서 우승자를 찾기 위해 실험하는 것이 가장 좋습니다. + + TP는 거의 항상 단일 노드 내에서 사용됩니다. 즉, TP 크기 <= 노드당 GPU 수입니다. + +* 가장 큰 레이어가 단일 GPU에 맞지 않는 경우: + + 1. ZeRO를 사용하지 않을 경우 - PP만 사용할 수 없으므로 TP를 사용해야 합니다. + 2. ZeRO를 사용할 경우, "단일 GPU"의 항목과 동일한 항목 참조 + + +**⇨ 다중 노드 / 다중 GPU** + +* 빠른 노드 간 연결이 있는 경우: + + 1. ZeRO - 모델에 대한 수정이 거의 필요하지 않습니다. + 2. PP+TP+DP - 통신이 적지만 모델에 대한 대규모 변경이 필요합니다. + +* 느린 노드 간 연결 및 GPU 메모리 부족한 경우: + + 1. DP+PP+TP+ZeRO-1 diff --git a/docs/source/ko/perf_train_tpu_tf.md b/docs/source/ko/perf_train_tpu_tf.md new file mode 100644 index 00000000000000..28d4fdafb96ca8 --- /dev/null +++ b/docs/source/ko/perf_train_tpu_tf.md @@ -0,0 +1,162 @@ + + +# TensorFlow로 TPU에서 훈련하기[[training-on-tpu-with-tensorflow]] + + + +자세한 설명이 필요하지 않고 바로 TPU 샘플 코드를 시작하고 싶다면 [우리의 TPU 예제 노트북!](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)을 확인하세요. + + + +### TPU가 무엇인가요?[[what-is-a-tpu]] + +TPU는 **텐서 처리 장치**입니다. Google에서 설계한 하드웨어로, GPU처럼 신경망 내에서 텐서 연산을 더욱 빠르게 처리하기 위해 사용됩니다. 네트워크 훈련과 추론 모두에 사용할 수 있습니다. 일반적으로 Google의 클라우드 서비스를 통해 이용할 수 있지만, Google Colab과 Kaggle Kernel을 통해 소규모 TPU를 무료로 직접 이용할 수도 있습니다. + +[🤗 Transformers의 모든 Tensorflow 모델은 Keras 모델](https://huggingface.co/blog/tensorflow-philosophy)이기 때문에, 이 문서에서 다루는 대부분의 메소드는 대체로 모든 Keras 모델을 위한 TPU 훈련에 적용할 수 있습니다! 하지만 Transformer와 데이터 세트의 HuggingFace 생태계(hug-o-system?)에 특화된 몇 가지 사항이 있으며, 해당 사항에 대해 설명할 때 반드시 언급하도록 하겠습니다. + +### 어떤 종류의 TPU가 있나요?[[what-kinds-of-tpu-are-available]] + +신규 사용자는 TPU의 범위와 다양한 이용 방법에 대해 매우 혼란스러워하는 경우가 많습니다. **TPU 노드**와 **TPU VM**의 차이점은 가장 먼저 이해해야 할 핵심적인 구분 사항입니다. + +**TPU 노드**를 사용한다면, 실제로는 원격 TPU를 간접적으로 이용하는 것입니다. 네트워크와 데이터 파이프라인을 초기화한 다음, 이를 원격 노드로 전달할 별도의 VM이 필요합니다. Google Colab에서 TPU를 사용하는 경우, **TPU 노드** 방식으로 이용하게 됩니다. + +TPU 노드를 사용하는 것은 이를 사용하지 않는 사용자에게 예기치 않은 현상이 발생하기도 합니다! 특히, TPU는 파이썬 코드를 실행하는 기기(machine)와 물리적으로 다른 시스템에 있기 때문에 로컬 기기에 데이터를 저장할 수 없습니다. 즉, 컴퓨터의 내부 저장소에서 가져오는 데이터 파이프라인은 절대 작동하지 않습니다! 로컬 기기에 데이터를 저장하는 대신에, 데이터 파이프라인이 원격 TPU 노드에서 실행 중일 때에도 데이터 파이프라인이 계속 이용할 수 있는 Google Cloud Storage에 데이터를 저장해야 합니다. + + + +메모리에 있는 모든 데이터를 `np.ndarray` 또는 `tf.Tensor`로 맞출 수 있다면, Google Cloud Storage에 업로드할 필요 없이, Colab 또는 TPU 노드를 사용해서 해당 데이터에 `fit()` 할 수 있습니다. + + + + + +**🤗특수한 Hugging Face 팁🤗:** TF 코드 예제에서 볼 수 있는 `Dataset.to_tf_dataset()` 메소드와 그 상위 래퍼(wrapper)인 `model.prepare_tf_dataset()`는 모두 TPU 노드에서 작동하지 않습니다. 그 이유는 `tf.data.Dataset`을 생성하더라도 “순수한” `tf.data` 파이프라인이 아니며 `tf.numpy_function` 또는 `Dataset.from_generator()`를 사용하여 기본 HuggingFace `Dataset`에서 데이터를 전송하기 때문입니다. 이 HuggingFace `Dataset`는 로컬 디스크에 있는 데이터로 지원되며 원격 TPU 노드가 읽을 수 없습니다. + + + +TPU를 이용하는 두 번째 방법은 **TPU VM**을 사용하는 것입니다. TPU VM을 사용할 때, GPU VM에서 훈련하는 것과 같이 TPU가 장착된 기기에 직접 연결합니다. 특히 데이터 파이프라인과 관련하여, TPU VM은 대체로 작업하기 더 쉽습니다. 위의 모든 경고는 TPU VM에는 해당되지 않습니다! + +이 문서는 의견이 포함된 문서이며, 저희의 의견이 여기에 있습니다: **가능하면 TPU 노드를 사용하지 마세요.** TPU 노드는 TPU VM보다 더 복잡하고 디버깅하기가 더 어렵습니다. 또한 향후에는 지원되지 않을 가능성이 높습니다. Google의 최신 TPU인 TPUv4는 TPU VM으로만 이용할 수 있으므로, TPU 노드는 점점 더 "구식" 이용 방법이 될 것으로 전망됩니다. 그러나 TPU 노드를 사용하는 Colab과 Kaggle Kernel에서만 무료 TPU 이용이 가능한 것으로 확인되어, 필요한 경우 이를 다루는 방법을 설명해 드리겠습니다! 이에 대한 자세한 설명이 담긴 코드 샘플은 [TPU 예제 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)에서 확인하시기 바랍니다. + +### 어떤 크기의 TPU를 사용할 수 있나요?[[what-sizes-of-tpu-are-available]] + +단일 TPU(v2-8/v3-8/v4-8)는 8개의 복제본(replicas)을 실행합니다. TPU는 수백 또는 수천 개의 복제본을 동시에 실행할 수 있는 **pod**로 존재합니다. 단일 TPU를 하나 이상 사용하지만 전체 Pod보다 적게 사용하는 경우(예를 들면, v3-32), TPU 구성을 **pod 슬라이스**라고 합니다. + +Colab을 통해 무료 TPU에 이용하는 경우, 기본적으로 단일 v2-8 TPU를 제공받습니다. + +### XLA에 대해 들어본 적이 있습니다. XLA란 무엇이고 TPU와 어떤 관련이 있나요?[[i-keep-hearing-about-this-xla-thing-whats-xla-and-how-does-it-relate-to-tpus]] + +XLA는 최적화 컴파일러로, TensorFlow와 JAX에서 모두 사용됩니다. JAX에서는 유일한 컴파일러이지만, TensorFlow에서는 선택 사항입니다(하지만 TPU에서는 필수입니다!). Keras 모델을 훈련할 때 이를 활성화하는 가장 쉬운 방법은 `jit_compile=True` 인수를 `model.compile()`에 전달하는 것입니다. 오류가 없고 성능이 양호하다면, TPU로 전환할 준비가 되었다는 좋은 신호입니다! + +TPU에서 디버깅하는 것은 대개 CPU/GPU보다 조금 더 어렵기 때문에, TPU에서 시도하기 전에 먼저 XLA로 CPU/GPU에서 코드를 실행하는 것을 권장합니다. 물론 오래 학습할 필요는 없습니다. 즉, 모델과 데이터 파이프라인이 예상대로 작동하는지 확인하기 위해 몇 단계만 거치면 됩니다. + + + +XLA로 컴파일된 코드는 대체로 더 빠릅니다. 따라서 TPU에서 실행할 계획이 없더라도, `jit_compile=True`를 추가하면 성능이 향상될 수 있습니다. 하지만 XLA 호환성에 대한 아래 주의 사항을 반드시 확인하세요! + + + + + +**뼈아픈 경험에서 얻은 팁:** `jit_compile=True`를 사용하면 속도를 높이고 CPU/GPU 코드가 XLA와 호환되는지 검증할 수 있는 좋은 방법이지만, 실제 TPU에서 훈련할 때 그대로 남겨두면 많은 문제를 초래할 수 있습니다. XLA 컴파일은 TPU에서 암시적으로 이뤄지므로, 실제 TPU에서 코드를 실행하기 전에 해당 줄을 제거하는 것을 잊지 마세요! + + + +### 제 XLA 모델과 호환하려면 어떻게 해야 하나요?[[how-do-i-make-my-model-xla-compatible]] + +대부분의 경우, 여러분의 코드는 이미 XLA와 호환될 것입니다! 그러나 표준 TensorFlow에서 작동하지만, XLA에서는 작동하지 않는 몇 가지 사항이 있습니다. 이를 아래 세 가지 핵심 규칙으로 간추렸습니다: + + + +**특수한 HuggingFace 팁🤗:** 저희는 TensorFlow 모델과 손실 함수를 XLA와 호환되도록 재작성하는 데 많은 노력을 기울였습니다. 저희의 모델과 손실 함수는 대개 기본적으로 규칙 #1과 #2를 따르므로 `transformers` 모델을 사용하는 경우, 이를 건너뛸 수 있습니다. 하지만 자체 모델과 손실 함수를 작성할 때는 이러한 규칙을 잊지 마세요! + + + +#### XLA 규칙 #1: 코드에서 “데이터 종속 조건문”을 사용할 수 없습니다[[xla-rule-1-your-code-cannot-have-datadependent-conditionals]] + +어떤 `if`문도 `tf.Tensor` 내부의 값에 종속될 수 없다는 것을 의미합니다. 예를 들어, 이 코드 블록은 XLA로 컴파일할 수 없습니다! + +```python +if tf.reduce_sum(tensor) > 10: + tensor = tensor / 2.0 +``` + +처음에는 매우 제한적으로 보일 수 있지만, 대부분의 신경망 코드에서는 이를 수행할 필요가 없습니다. `tf.cond`를 사용하거나([여기](https://www.tensorflow.org/api_docs/python/tf/cond) 문서를 참조), 다음과 같이 조건문을 제거하고 대신 지표 변수를 사용하는 영리한 수학 트릭을 찾아내어 이 제한을 우회할 수 있습니다: + +```python +sum_over_10 = tf.cast(tf.reduce_sum(tensor) > 10, tf.float32) +tensor = tensor / (1.0 + sum_over_10) +``` + +이 코드는 위의 코드와 정확히 동일한 효과를 구현하지만, 조건문을 제거하여 문제 없이 XLA로 컴파일되도록 합니다! + +#### XLA 규칙 #2: 코드에서 "데이터 종속 크기"를 가질 수 없습니다[[xla-rule-2-your-code-cannot-have-datadependent-shapes]] + +코드에서 모든 `tf.Tensor` 객체의 크기가 해당 값에 종속될 수 없다는 것을 의미합니다. 예를 들어, `tf.unique` 함수는 입력에서 각 고유 값의 인스턴스 하나를 포함하는 `tensor`를 반환하기 때문에 XLA로 컴파일할 수 없습니다. 이 출력의 크기는 입력 `Tensor`가 얼마나 반복적인지에 따라 분명히 달라질 것이므로, XLA는 이를 처리하지 못합니다! + +일반적으로, 대부분의 신경망 코드는 기본값으로 규칙 2를 따릅니다. 그러나 문제가 되는 몇 가지 대표적인 사례가 있습니다. 가장 흔한 사례 중 하나는 **레이블 마스킹**을 사용하여 손실(loss)을 계산할 때, 해당 위치를 무시하도록 나타내기 위해 레이블을 음수 값으로 설정하는 경우입니다. 레이블 마스킹을 지원하는 NumPy나 PyTorch 손실 함수를 보면 [불 인덱싱](https://numpy.org/doc/stable/user/basics.indexing.html#boolean-array-indexing)을 사용하는 다음과 같은 코드를 자주 접할 수 있습니다: + +```python +label_mask = labels >= 0 +masked_outputs = outputs[label_mask] +masked_labels = labels[label_mask] +loss = compute_loss(masked_outputs, masked_labels) +mean_loss = torch.mean(loss) +``` + +이 코드는 NumPy나 PyTorch에서는 문제 없이 작동하지만, XLA에서는 손상됩니다! 왜 그럴까요? 얼마나 많은 위치가 마스킹되는지에 따라 `masked_outputs`와 `masked_labels`의 크기가 달라져서, **데이터 종속 크기**가 되기 때문입니다. 그러나 규칙 #1과 마찬가지로, 이 코드를 다시 작성하면 데이터 종속적 모양 크기가 정확히 동일한 출력을 산출할 수 있습니다. + +```python +label_mask = tf.cast(labels >= 0, tf.float32) +loss = compute_loss(outputs, labels) +loss = loss * label_mask # Set negative label positions to 0 +mean_loss = tf.reduce_sum(loss) / tf.reduce_sum(label_mask) +``` + +여기서, 모든 위치에 대한 손실을 계산하지만, 평균을 계산할 때 분자와 분모 모두에서 마스크된 위치를 0으로 처리합니다. 이는 데이터 종속 크기를 방지하고 XLA 호환성을 유지하면서 첫 번째 블록과 정확히 동일한 결과를 산출합니다. 규칙 #1에서와 동일한 트릭을 사용하여 `tf.bool`을 `tf.float32`로 변환하고 이를 지표 변수로 사용합니다. 해당 트릭은 매우 유용하며, 자체 코드를 XLA로 변환해야 할 경우 기억해 두세요! + +#### XLA 규칙 #3: XLA는 각기 다른 입력 크기가 나타날 때마다 모델을 다시 컴파일해야 합니다[[xla-rule-3-xla-will-need-to-recompile-your-model-for-every-different-input-shape-it-sees]] + +이것은 가장 큰 문제입니다. 입력 크기가 매우 가변적인 경우, XLA는 모델을 반복해서 다시 컴파일해야 하므로 성능에 큰 문제가 발생할 수 있습니다. 이 문제는 토큰화 후 입력 텍스트의 길이가 가변적인 NLP 모델에서 주로 발생합니다. 다른 모달리티에서는 정적 크기가 더 흔하며, 해당 규칙이 훨씬 덜 문제시 됩니다. + +규칙 #3을 어떻게 우회할 수 있을까요? 핵심은 **패딩**입니다. 모든 입력을 동일한 길이로 패딩한 다음, `attention_mask`를 사용하면 어떤 XLA 문제도 없이 가변 크기에서 가져온 것과 동일한 결과를 가져올 수 있습니다. 그러나 과도한 패딩은 심각한 속도 저하를 야기할 수도 있습니다. 모든 샘플을 전체 데이터 세트의 최대 길이로 패딩하면, 무한한 패딩 토큰으로 구성된 배치가 생성되어 많은 연산과 메모리가 낭비될 수 있습니다! + +이 문제에 대한 완벽한 해결책은 없습니다. 하지만, 몇 가지 트릭을 시도해볼 수 있습니다. 한 가지 유용한 트릭은 **샘플 배치를 32 또는 64 토큰과 같은 숫자의 배수까지 패딩하는 것입니다.** 이는 토큰 수가 소폭 증가하지만, 모든 입력 크기가 32 또는 64의 배수여야 하기 때문에 고유한 입력 크기의 수가 대폭 줄어듭니다. 고유한 입력 크기가 적다는 것은 XLA 컴파일 횟수가 적어진다는 것을 의미합니다! + + + +**🤗특수한 HuggingFace 팁🤗:** 토크나이저와 데이터 콜레이터에 도움이 될 수 있는 메소드가 있습니다. 토크나이저를 불러올 때 `padding="max_length"` 또는 `padding="longest"`를 사용하여 패딩된 데이터를 출력하도록 할 수 있습니다. 토크나이저와 데이터 콜레이터는 나타나는 고유한 입력 크기의 수를 줄이기 위해 사용할 수 있는 `pad_to_multiple_of` 인수도 있습니다! + + + +### 실제 TPU로 모델을 훈련하려면 어떻게 해야 하나요?[[how-do-i-actually-train-my-model-on-tpu]] + +훈련이 XLA와 호환되고 (TPU 노드/Colab을 사용하는 경우) 데이터 세트가 적절하게 준비되었다면, TPU에서 실행하는 것은 놀랍도록 쉽습니다! 코드에서 몇 줄만 추가하여, TPU를 초기화하고 모델과 데이터 세트가 `TPUStrategy` 범위 내에 생성되도록 변경하면 됩니다. [우리의 TPU 예제 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb)을 참조하여 실제로 작동하는 모습을 확인해 보세요! + +### 요약[[summary]] + +여기에 많은 내용이 포함되어 있으므로, TPU 훈련을 위한 모델을 준비할 때 따를 수 있는 간략한 체크리스트로 요약해 보겠습니다: + +- 코드가 XLA의 세 가지 규칙을 따르는지 확인합니다. +- CPU/GPU에서 `jit_compile=True`로 모델을 컴파일하고 XLA로 훈련할 수 있는지 확인합니다. +- 데이터 세트를 메모리에 가져오거나 TPU 호환 데이터 세트를 가져오는 방식을 사용합니다([노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) 참조) +- 코드를 Colab(accelerator가 “TPU”로 설정됨) 또는 Google Cloud의 TPU VM으로 마이그레이션합니다. +- TPU 초기화 코드를 추가합니다([노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) 참조) +- `TPUStrategy`를 생성하고 데이터 세트를 가져오는 것과 모델 생성이 `strategy.scope()` 내에 있는지 확인합니다([노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb) 참조) +- TPU로 이동할 때 `jit_compile=True`를 다시 설정하는 것을 잊지 마세요! +- 🙏🙏🙏🥺🥺🥺 +- model.fit()을 불러옵니다. +- 여러분이 해냈습니다! \ No newline at end of file diff --git a/docs/source/ko/performance.md b/docs/source/ko/performance.md new file mode 100644 index 00000000000000..226bd5f249af5d --- /dev/null +++ b/docs/source/ko/performance.md @@ -0,0 +1,96 @@ + + +# 성능 및 확장성 [[performance-and-scalability]] + +점점 더 큰 규모의 트랜스포머 모델을 훈련하고 프로덕션에 배포하는 데에는 다양한 어려움이 따릅니다. 훈련 중에는 모델이 사용 가능한 GPU 메모리보다 더 많은 메모리를 필요로 하거나 훈련 속도가 매우 느릴 수 있으며, 추론을 위해 배포할 때는 제품 환경에서 요구되는 처리량으로 인해 과부하가 발생할 수 있습니다. 이 문서는 이러한 문제를 극복하고 사용 사례에 가장 적합한 설정을 찾도록 도움을 주기 위해 설계되었습니다. 훈련과 추론으로 가이드를 분할했는데, 이는 각각 다른 문제와 해결 방법이 있기 때문입니다. 그리고 각 가이드에는 다양한 종류의 하드웨어 설정에 대한 별도의 가이드가 있습니다(예: 훈련을 위한 단일 GPU vs 다중 GPU 또는 추론을 위한 CPU vs GPU). + +![perf_overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf_overview.png) + +이 문서는 사용자의 상황에 유용할 수 있는 방법들에 대한 개요 및 시작점 역할을 합니다. + +## 훈련 [[training]] + +효율적인 트랜스포머 모델 훈련에는 GPU나 TPU와 같은 가속기가 필요합니다. 가장 일반적인 경우는 단일 GPU만 사용하는 경우지만, 다중 GPU 및 CPU 훈련에 대한 섹션도 있습니다(곧 더 많은 내용이 추가될 예정). + + + + 참고: 단일 GPU 섹션에서 소개된 대부분의 전략(예: 혼합 정밀도 훈련 또는 그라디언트 누적)은 일반적인 모델 훈련에도 적용되므로, 다중 GPU나 CPU 훈련과 같은 섹션을 살펴보기 전에 꼭 참고하시길 바랍니다. + + + +### 단일 GPU [[single-gpu]] + +단일 GPU에서 대규모 모델을 훈련하는 것은 어려울 수 있지만, 이를 가능하게 하는 여러 가지 도구와 방법이 있습니다. 이 섹션에서는 혼합 정밀도 훈련, 그라디언트 누적 및 체크포인팅, 효율적인 옵티마이저, 최적의 배치 크기를 결정하기 위한 전략 등에 대해 논의합니다. + +[단일 GPU 훈련 섹션으로 이동](perf_train_gpu_one) + +### 다중 GPU [[multigpu]] + +단일 GPU에서 훈련하는 것이 너무 느리거나 대규모 모델에 적합하지 않은 경우도 있습니다. 다중 GPU 설정으로 전환하는 것은 논리적인 단계이지만, 여러 GPU에서 한 번에 훈련하려면 각 GPU마다 모델의 전체 사본을 둘지, 혹은 모델 자체도 여러 GPU에 분산하여 둘지 등 새로운 결정을 내려야 합니다. 이 섹션에서는 데이터, 텐서 및 파이프라인 병렬화에 대해 살펴봅니다. + +[다중 GPU 훈련 섹션으로 이동](perf_train_gpu_many) + +### CPU [[cpu]] + + +[CPU 훈련 섹션으로 이동](perf_train_cpu) + + +### TPU [[tpu]] + +[_곧 제공될 예정_](perf_train_tpu) + +### 특수한 하드웨어 [[specialized-hardware]] + +[_곧 제공될 예정_](perf_train_special) + +## 추론 [[inference]] + +제품 및 서비스 환경에서 대규모 모델을 효율적으로 추론하는 것은 모델을 훈련하는 것만큼 어려울 수 있습니다. 이어지는 섹션에서는 CPU 및 단일/다중 GPU 설정에서 추론을 진행하는 단계를 살펴봅니다. + +### CPU [[cpu]] + +[CPU 추론 섹션으로 이동](perf_infer_cpu) + +### 단일 GPU [[single-gpu]] + +[단일 GPU 추론 섹션으로 이동](perf_infer_gpu_one) + +### 다중 GPU [[multigpu]] + +[다중 GPU 추론 섹션으로 이동](perf_infer_gpu_many) + +### 특수한 하드웨어 [[specialized-hardware]] + +[_곧 제공될 예정_](perf_infer_special) + +## 하드웨어 [[hardware]] + +하드웨어 섹션에서는 자신만의 딥러닝 장비를 구축할 때 유용한 팁과 요령을 살펴볼 수 있습니다. + +[하드웨어 섹션으로 이동](perf_hardware) + + +## 기여하기 [[contribute]] + +이 문서는 완성되지 않은 상태이며, 추가해야 할 내용이나 수정 사항이 많이 있습니다. 따라서 추가하거나 수정할 내용이 있으면 주저하지 말고 PR을 열어 주시거나, 자세한 내용을 논의하기 위해 Issue를 시작해 주시기 바랍니다. + +A가 B보다 좋다고 하는 기여를 할 때는, 재현 가능한 벤치마크와/또는 해당 정보의 출처 링크를 포함해주세요(당신으로부터의 직접적인 정보가 아닌 경우). \ No newline at end of file diff --git a/docs/source/ko/perplexity.md b/docs/source/ko/perplexity.md new file mode 100644 index 00000000000000..9de84a5f289b94 --- /dev/null +++ b/docs/source/ko/perplexity.md @@ -0,0 +1,135 @@ + + +# 고정 길이 모델의 펄플렉서티(Perplexity)[[perplexity-of-fixedlength-models]] + +[[open-in-colab]] + +펄플렉서티(Perplexity, PPL)는 가장 일반적인 언어 모델 평가지표 중 하나입니다. +자세히 알아보기 전에 이 평가지표는 고전적인 언어 모델(자기회귀 또는 인과적 언어 모델이라고도 함)에만 적용되며 BERT와 같은 마스킹된 언어 모델에는 잘 적용하지 않습니다 (BERT는 [summary of the models](../en/model_summary) 문서를 참고하세요). + +펄플렉서티는 시퀀스의 음의 로그 우도(negative log-likelihood, NLL) 값의 평균에 지수(exponentiate)를 취한 값으로 정의됩니다. +토큰화된 시퀀스 \\(X = (x_0, x_1, \dots, x_t)\\) 가 있을 때, \\(X\\) 의 펄플렉서티는 아래 수식과 같이 구할 수 있습니다. + +$$\text{PPL}(X) = \exp \left\{ {-\frac{1}{t}\sum_i^t \log p_\theta (x_i|x_{ + +그러나 모델의 근사치를 구할 때는 일반적으로 모델이 처리할 수 있는 토큰 수에 제한이 있습니다. +예를 들어, 가장 큰 버전의 [GPT-2](model_doc/gpt2)는 토큰의 길이가 1024로 고정되어 있습니다. +따라서 \\(t\\) 가 1024보다 큰 경우에 \\(p_\theta(x_t|x_{ + +이 방법은 각 부분의 펄플렉서티를 한 번의 포워드 패스로 계산할 수 있어 빠르지만 일반적으로 더 높은(더 나쁜) PPL을 산출합니다. +왜냐하면 대부분의 예측 단계에서 모델의 컨텍스트가 적기 때문입니다. + +대신, 고정 길이 모델의 PPL은 슬라이딩 윈도우 전략으로 평가해야 합니다. +이 전략에는 컨텍스트 윈도우을 반복적으로 슬라이딩해 모델이 각 예측을 수행할 때 더 많은 컨텍스트를 갖도록 하는 작업이 포함됩니다. + +Sliding window PPL taking advantage of all available context + +이는 시퀀스 확률의 실제 분해에 더 가까운 근사치이며 일반적으로 더 유리한 점수를 산출합니다. +단점은 말뭉치의 각 토큰에 대해 별도의 포워드 패스가 필요하다는 것입니다. +현실적으로 좋은 절충안은 한 번에 한 토큰씩 슬라이딩하는 것이 아니라 더 큰 간격으로 컨텍스트를 이동하는 스트라이드가 적용된 슬라이딩 윈도우을 사용하는 것입니다. +이렇게 하면 계산을 훨씬 더 빠르게 진행하면서도 모델에 각 단계에서 예측을 수행할 수 있는 긴 컨텍스트를 제공할 수 있습니다. + +## 예제: 🤗 Transformers에서 GPT-2로 펄플렉서티(perplexity) 계산하기[[example-calculating-perplexity-with-gpt2-in-transformers]] + +이제 GPT-2로 위의 과정을 시연해 보겠습니다. + +```python +from transformers import GPT2LMHeadModel, GPT2TokenizerFast + +device = "cuda" +model_id = "openai-community/gpt2-large" +model = GPT2LMHeadModel.from_pretrained(model_id).to(device) +tokenizer = GPT2TokenizerFast.from_pretrained(model_id) +``` + +WikiText-2 데이터 세트를 가져오고 몇 가지 슬라이딩 윈도우 전략을 사용해 펄플렉서티를 계산해보겠습니다. +이 데이터 세트는 크기가 작고 포워드 패스 한 번만 수행하기 때문에 전체 데이터 세트를 메모리에 가져오고 인코딩할 수 있습니다. + +```python +from datasets import load_dataset + +test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") +encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt") +``` + +🤗 Transformers를 사용하면 모델의 `labels`로 `input_ids`를 전달해 각 토큰에 대한 평균 음의 우도 값을 손실로 반환할 수 있습니다. +하지만 슬라이딩 윈도우 방식을 사용하면 각 반복마다 모델에 전달하는 토큰이 겹칩니다. +컨텍스트로 처리하는 토큰에 대한 로그 우도 값이 손실에 포함되는 것을 원하지 않기 때문에 이러한 토큰의 `input_ids`를 `-100`으로 설정하여 무시할 수 있습니다. + +다음은 스트라이드(stride)를 `512`로 사용한 예시입니다. +즉, 모델이 한 토큰의 조건부 우도 값을 계산할 때 컨텍스트에 최소한 512개의 토큰이 포함되어있다는 의미입니다 (해당 토큰 앞에 512개의 토큰이 있는 경우). + +```python +import torch +from tqdm import tqdm + +max_length = model.config.n_positions +stride = 512 +seq_len = encodings.input_ids.size(1) + +nlls = [] +prev_end_loc = 0 +for begin_loc in tqdm(range(0, seq_len, stride)): + end_loc = min(begin_loc + max_length, seq_len) + trg_len = end_loc - prev_end_loc # 마지막 루프의 스트라이드 값과 다를 수 있음 + input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) + target_ids = input_ids.clone() + target_ids[:, :-trg_len] = -100 + + with torch.no_grad(): + outputs = model(input_ids, labels=target_ids) + + # 손실은 모든 유효한 레이블에 대한 평균값을 구하는 교차 엔트로피(cross entropy)로 계산됩니다. + # 나이브 베이지안 모델은 내부적으로 레이블을 왼쪽으로 1개씩 밀기 때문에, (타켓 - 1)개 만큼의 레이블에 대해 손실을 계산합니다. + neg_log_likelihood = outputs.loss + + nlls.append(neg_log_likelihood) + + prev_end_loc = end_loc + if end_loc == seq_len: + break + +ppl = torch.exp(torch.stack(nlls).mean()) +``` + +스트라이드를 최대 입력 길이와 동일하게 설정하면 위에서 설명한 차선책인 비슬라이딩 윈도우 전략과 동일합니다. +일반적으로 스트라이드가 작을수록 모델이 각 예측을 할 때 더 많은 컨텍스트를 볼 수 있게 되어 펄플렉서티 값이 좋아집니다. + +위의 계산을 토큰이 겹치지 않도록 `stride = 1024`로 설정하면 PPL은 `19.44`로 GPT-2 논문에서 보고된 `19.93`과 거의 동일합니다. +`stride = 512`로 슬라이딩 윈도우 전략을 사용하면 PPL은 `16.45`로 떨어집니다. +이는 더 좋은 점수일 뿐만 아니라 시퀀스 확률의 실제 자동 회귀 분해에 더 가까운 방식으로 계산됩니다. \ No newline at end of file diff --git a/docs/source/ko/philosophy.md b/docs/source/ko/philosophy.md new file mode 100644 index 00000000000000..e303709a11b833 --- /dev/null +++ b/docs/source/ko/philosophy.md @@ -0,0 +1,66 @@ + + +# 이념과 목표 [[philosophy]] + +🤗 Transformers는 다음과 같은 목적으로 만들어진 독자적인 라이브러리입니다: + +- 대규모 Transformers 모델을 사용하거나 연구하거나 확장하려는 기계 학습 연구원 및 교육자를 위한 것입니다. +- 모델을 미세 조정하거나 제작용으로 사용하고자 하는 실전 개발자를 위한 것입니다. +- 특정 기계 학습 작업을 해결하기 위해 사전훈련된 모델을 다운로드하고 사용하기만 하려는 엔지니어를 위한 것입니다. + +이 라이브러리는 두 가지 주요 목표를 가지고 설계되었습니다: + +1. 사용하기 쉽고 빠르게 만드는 것: + +- 학습해야 할 사용자 대상 추상화의 수를 제한했습니다. 실제로 거의 추상화가 없으며, 각 모델을 사용하기 위해 필요한 세 가지 표준 클래스인 [configuration](main_classes/configuration), [models](main_classes/model) 및 전처리 클래스인 ([tokenizer](main_classes/tokenizer)는 NLP용, [image processor](main_classes/image_processor)는 비전용, [feature extractor](main_classes/feature_extractor)는 오디오용, [processor](main_classes/processors)는 멀티모달 입력용)만 사용합니다. +- 이러한 클래스는 공통적인 `from_pretrained()` 메서드를 사용하여 미리 훈련된 인스턴스에서 간단하고 통일된 방식으로 초기화할 수 있습니다. 이 메소드는 미리 훈련된 체크포인트에서 관련 클래스 인스턴스와 관련 데이터(구성의 하이퍼파라미터, 토크나이저의 어휘, 모델의 가중치)를 (필요한 경우) 다운로드하고 캐시하며 가져옵니다. 체크포인트는 [Hugging Face Hub](https://huggingface.co/models)에서 제공되거나 사용자 자체의 저장된 체크포인트에서 제공됩니다. +- 이 세 가지 기본 클래스 위에 라이브러리는 [`pipeline`] API를 제공하여 주어진 작업에 대해 모델을 빠르게 추론하는 데 사용하고, [`Trainer`]를 제공하여 PyTorch 모델을 빠르게 훈련하거나 미세 조정할 수 있도록 합니다(모든 TensorFlow 모델은 `Keras.fit`과 호환됩니다). +- 결과적으로, 이 라이브러리는 신경망을 구축하기 위한 모듈식 도구 상자가 아닙니다. 라이브러리를 확장하거나 구축하려면 일반적인 Python, PyTorch, TensorFlow, Keras 모듈을 사용하고 라이브러리의 기본 클래스를 상속하여 모델 로딩 및 저장과 같은 기능을 재사용하면 됩니다. 모델에 대한 코딩 철학에 대해 더 자세히 알고 싶다면 [Repeat Yourself](https://huggingface.co/blog/transformers-design-philosophy) 블로그 글을 확인해보세요. + +2. 원래 모델과 가능한 한 근접한 성능을 제공하는 최신 모델을 제공하는 것: + +- 각 아키텍처에 대해 공식 저자가 제공한 결과를 재현하는 적어도 한 가지 예제를 제공합니다. +- 코드는 원래 코드와 가능한 한 유사하게 유지되므로 PyTorch 코드는 TensorFlow 코드로 변환되어 *pytorchic*하지 않을 수 있고, 그 반대의 경우도 마찬가지입니다. + +기타 목표 몇 가지: + +- 모델의 내부를 가능한 일관되게 노출시키기: + + - 전체 은닉 상태와 어텐션 가중치에 대한 액세스를 단일 API를 사용하여 제공합니다. + - 전처리 클래스 및 기본 모델 API는 모델 간에 쉽게 전환할 수 있도록 표준화되어 있습니다. + +- 미세 조정 및 모델 탐색을 위한 유망한 도구들을 주관적으로 선택하기: + + - 미세 조정을 위해 어휘 및 임베딩에 새로운 토큰을 간단하고 일관된 방식으로 추가하는 방법을 제공합니다. + - Transformer 헤드를 마스킹하고 가지치기하는 간단한 방법을 제공합니다. + +- PyTorch, TensorFlow 2.0 및 Flax 간에 쉽게 전환할 수 있도록 하여 하나의 프레임워크로 훈련하고 다른 프레임워크로 추론할 수 있게 합니다. + +## 주요 개념 [[main-concepts]] + +이 라이브러리는 각 모델에 대해 세 가지 유형의 클래스를 기반으로 구축되었습니다: + +- **모델 클래스**는 라이브러리에서 제공하는 사전 훈련된 가중치와 함께 작동하는 PyTorch 모델([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras 모델([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)), JAX/Flax 모델([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html))일 수 있습니다. +- **구성 클래스**는 모델을 구축하는 데 필요한 하이퍼파라미터(예: 레이어 수 및 은닉 크기)를 저장합니다. 구성 클래스를 직접 인스턴스화할 필요는 없습니다. 특히, 수정 없이 고 사전 학습된 모델을 사용하는 경우 모델을 생성하면 모델의 일부인 구성을 자동으로 인스턴스화됩니다. +- **전처리 클래스**는 원시 데이터를 모델이 수용하는 형식으로 변환합니다. [Tokenizer](main_classes/tokenizer)는 각 모델의 어휘를 저장하고, 문자열을 토큰 임베딩 인덱스 리스트로 인코딩하고 디코딩하기 위한 메소드를 제공합니다. [Image processors](main_classes/image_processor)는 비전 입력을 전처리하고, [feature extractors](main_classes/feature_extractor)는 오디오 입력을 전처리하며, [processor](main_classes/processors)는 멀티모달 입력을 처리합니다. + +모든 이러한 클래스는 사전 훈련된 인스턴스에서 인스턴스화하고 로컬로 저장하며, 세 가지 메소드를 사용하여 Hub에서 공유할 수 있습니다: + +- `from_pretrained()` 메소드를 사용하면 라이브러리 자체에서 제공하는 사전 훈련된 버전(지원되는 모델은 [Model Hub](https://huggingface.co/models)에서 찾을 수 있음)이나 사용자가 로컬로 저장한 경우(또는 서버에 저장한 경우)의 모델, 구성 및 전처리 클래스를 인스턴스화할 수 있습니다. +- `save_pretrained()` 메소드를 사용하면 모델, 구성 및 전처리 클래스를 로컬로 저장하여 `from_pretrained()`를 사용하여 다시 가져올 수 있습니다. +- `push_to_hub()` 메소드를 사용하면 모델, 구성 및 전처리 클래스를 Hub에 공유하여 모두에게 쉽게 접근할 수 있습니다. + diff --git a/docs/source/ko/pipeline_tutorial.md b/docs/source/ko/pipeline_tutorial.md new file mode 100644 index 00000000000000..2f166fc6939f32 --- /dev/null +++ b/docs/source/ko/pipeline_tutorial.md @@ -0,0 +1,243 @@ + + +# 추론을 위한 Pipeline[[pipelines-for-inference]] + +[`pipeline`]을 사용하면 언어, 컴퓨터 비전, 오디오 및 멀티모달 태스크에 대한 추론을 위해 [Hub](https://huggingface.co/models)의 어떤 모델이든 쉽게 사용할 수 있습니다. 특정 분야에 대한 경험이 없거나, 모델을 이루는 코드가 익숙하지 않은 경우에도 [`pipeline`]을 사용해서 추론할 수 있어요! 이 튜토리얼에서는 다음을 배워보겠습니다. + +* 추론을 위해 [`pipeline`]을 사용하는 방법 +* 특정 토크나이저 또는 모델을 사용하는 방법 +* 언어, 컴퓨터 비전, 오디오 및 멀티모달 태스크에서 [`pipeline`]을 사용하는 방법 + + + +지원하는 모든 태스크와 쓸 수 있는 매개변수를 담은 목록은 [`pipeline`] 설명서를 참고해주세요. + + + +## Pipeline 사용하기[[pipeline-usage]] + +각 태스크마다 고유의 [`pipeline`]이 있지만, 개별 파이프라인을 담고있는 추상화된 [`pipeline`]를 사용하는 것이 일반적으로 더 간단합니다. [`pipeline`]은 태스크에 알맞게 추론이 가능한 기본 모델과 전처리 클래스를 자동으로 로드합니다. + +1. 먼저 [`pipeline`]을 생성하고 태스크를 지정하세요. + +```py +>>> from transformers import pipeline + +>>> generator = pipeline(task="automatic-speech-recognition") +``` + +2. 그리고 [`pipeline`]에 입력을 넣어주세요. + +```py +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} +``` + +기대했던 결과가 아닌가요? Hub에서 [가장 많이 다운로드된 자동 음성 인식 모델](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=downloads)로 더 나은 결과를 얻을 수 있는지 확인해보세요. +다음은 [openai/whisper-large](https://huggingface.co/openai/whisper-large)로 시도해보겠습니다. + +```py +>>> generator = pipeline(model="openai/whisper-large") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +훨씬 더 나아졌군요! +Hub의 모델들은 여러 다양한 언어와 전문분야를 아우르기 때문에 꼭 자신의 언어나 분야에 특화된 모델을 찾아보시기 바랍니다. +브라우저를 벗어날 필요없이 Hub에서 직접 모델의 출력을 확인하고 다른 모델과 비교해서 자신의 상황에 더 적합한지, 애매한 입력을 더 잘 처리하는지도 확인할 수 있습니다. +만약 상황에 알맞는 모델을 없다면 언제나 직접 [훈련](training)시킬 수 있습니다! + +입력이 여러 개 있는 경우, 리스트 형태로 전달할 수 있습니다. + +```py +generator( + [ + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", + ] +) +``` + +전체 데이터세트을 순회하거나 웹서버에 올려두어 추론에 사용하고 싶다면, 각 상세 페이지를 참조하세요. + +[데이터세트에서 Pipeline 사용하기](#using-pipelines-on-a-dataset) + +[웹서버에서 Pipeline 사용하기](./pipeline_webserver) + +## 매개변수[[parameters]] + +[`pipeline`]은 많은 매개변수를 지원합니다. 특정 태스크용인 것도 있고, 범용인 것도 있습니다. +일반적으로 원하는 위치에 어디든 매개변수를 넣을 수 있습니다. + +```py +generator(model="openai/whisper-large", my_parameter=1) +out = generate(...) # This will use `my_parameter=1`. +out = generate(..., my_parameter=2) # This will override and use `my_parameter=2`. +out = generate(...) # This will go back to using `my_parameter=1`. +``` + +중요한 3가지 매개변수를 살펴보겠습니다. + +### 기기(device)[[device]] + +`device=n`처럼 기기를 지정하면 파이프라인이 자동으로 해당 기기에 모델을 배치합니다. +파이토치에서나 텐서플로우에서도 모두 작동합니다. + +```py +generator(model="openai/whisper-large", device=0) +``` + +모델이 GPU 하나에 돌아가기 버겁다면, `device_map="auto"`를 지정해서 🤗 [Accelerate](https://huggingface.co/docs/accelerate)가 모델 가중치를 어떻게 로드하고 저장할지 자동으로 결정하도록 할 수 있습니다. + +```py +#!pip install accelerate +generator(model="openai/whisper-large", device_map="auto") +``` + +### 배치 사이즈[[batch-size]] + +기본적으로 파이프라인은 [여기](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)에 나온 이유로 추론을 일괄 처리하지 않습니다. 간단히 설명하자면 일괄 처리가 반드시 더 빠르지 않고 오히려 더 느려질 수도 있기 때문입니다. + +하지만 자신의 상황에 적합하다면, 이렇게 사용하세요. + +```py +generator(model="openai/whisper-large", device=0, batch_size=2) +audio_filenames = [f"audio_{i}.flac" for i in range(10)] +texts = generator(audio_filenames) +``` + +파이프라인 위 제공된 10개의 오디오 파일을 추가로 처리하는 코드 없이 (일괄 처리에 보다 효과적인 GPU 위) 모델에 2개씩 전달합니다. +출력은 일괄 처리하지 않았을 때와 똑같아야 합니다. 파이프라인에서 속도를 더 낼 수도 있는 방법 중 하나일 뿐입니다. + +파이프라인은 일괄 처리의 복잡한 부분을 줄여주기도 합니다. (예를 들어 긴 오디오 파일처럼) 여러 부분으로 나눠야 모델이 처리할 수 있는 것을 [*chunk batching*](./main_classes/pipelines#pipeline-chunk-batching)이라고 하는데, 파이프라인을 사용하면 자동으로 나눠줍니다. + +### 특정 태스크용 매개변수[[task-specific-parameters]] + +각 태스크마다 구현할 때 유연성과 옵션을 제공하기 위해 태스크용 매개변수가 있습니다. +예를 들어 [`transformers.AutomaticSpeechRecognitionPipeline.__call__`] 메서드에는 동영상의 자막을 넣을 때 유용할 것 같은 `return_timestamps` 매개변수가 있습니다. + +```py +>>> # Not using whisper, as it cannot provide timestamps. +>>> generator = pipeline(model="facebook/wav2vec2-large-960h-lv60-self", return_timestamps="word") +>>> generator("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP AND LIVE OUT THE TRUE MEANING OF ITS CREED', 'chunks': [{'text': 'I', 'timestamp': (1.22, 1.24)}, {'text': 'HAVE', 'timestamp': (1.42, 1.58)}, {'text': 'A', 'timestamp': (1.66, 1.68)}, {'text': 'DREAM', 'timestamp': (1.76, 2.14)}, {'text': 'BUT', 'timestamp': (3.68, 3.8)}, {'text': 'ONE', 'timestamp': (3.94, 4.06)}, {'text': 'DAY', 'timestamp': (4.16, 4.3)}, {'text': 'THIS', 'timestamp': (6.36, 6.54)}, {'text': 'NATION', 'timestamp': (6.68, 7.1)}, {'text': 'WILL', 'timestamp': (7.32, 7.56)}, {'text': 'RISE', 'timestamp': (7.8, 8.26)}, {'text': 'UP', 'timestamp': (8.38, 8.48)}, {'text': 'AND', 'timestamp': (10.08, 10.18)}, {'text': 'LIVE', 'timestamp': (10.26, 10.48)}, {'text': 'OUT', 'timestamp': (10.58, 10.7)}, {'text': 'THE', 'timestamp': (10.82, 10.9)}, {'text': 'TRUE', 'timestamp': (10.98, 11.18)}, {'text': 'MEANING', 'timestamp': (11.26, 11.58)}, {'text': 'OF', 'timestamp': (11.66, 11.7)}, {'text': 'ITS', 'timestamp': (11.76, 11.88)}, {'text': 'CREED', 'timestamp': (12.0, 12.38)}]} +``` + +보시다시피 모델이 텍스트를 추론할 뿐만 아니라 각 단어를 말한 시점까지도 출력했습니다. + +태스크마다 다양한 매개변수를 가지고 있는데요. 원하는 태스크의 API를 참조해서 바꿔볼 수 있는 여러 매개변수를 살펴보세요! +지금까지 다뤄본 [`~transformers.AutomaticSpeechRecognitionPipeline`]에는 `chunk_length_s` 매개변수가 있습니다. 영화나 1시간 분량의 동영상의 자막 작업을 할 때처럼, 일반적으로 모델이 자체적으로 처리할 수 없는 매우 긴 오디오 파일을 처리할 때 유용하죠. + + +도움이 될 만한 매개변수를 찾지 못했다면 언제든지 [요청](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)해주세요! + + +## 데이터세트에서 Pipeline 사용하기[[using-pipelines-on-a-dataset]] + +파이프라인은 대규모 데이터세트에서도 추론 작업을 할 수 있습니다. 이때 이터레이터를 사용하는 걸 추천드립니다. + +```py +def data(): + for i in range(1000): + yield f"My example {i}" + + +pipe = pipe(model="openai-community/gpt2", device=0) +generated_characters = 0 +for out in pipe(data()): + generated_characters += len(out["generated_text"]) +``` + +이터레이터 `data()`는 각 결과를 호출마다 생성하고, 파이프라인은 입력이 순회할 수 있는 자료구조임을 자동으로 인식하여 GPU에서 기존 데이터가 처리되는 동안 새로운 데이터를 가져오기 시작합니다.(이때 내부적으로 [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)를 사용해요.) 이 과정은 전체 데이터세트를 메모리에 적재하지 않고도 GPU에 최대한 빠르게 새로운 작업을 공급할 수 있기 때문에 중요합니다. + +그리고 일괄 처리가 더 빠를 수 있기 때문에, `batch_size` 매개변수를 조정해봐도 좋아요. + +데이터세트를 순회하는 가장 간단한 방법은 🤗 [Datasets](https://github.com/huggingface/datasets/)를 활용하는 것인데요. + +```py +# KeyDataset is a util that will just output the item we're interested in. +from transformers.pipelines.pt_utils import KeyDataset + +pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") + +for out in pipe(KeyDataset(dataset["audio"])): + print(out) +``` + + +## 웹서버에서 Pipeline 사용하기[[using-pipelines-for-a-webserver]] + + +추론 엔진을 만드는 과정은 따로 페이지를 작성할만한 복잡한 주제입니다. + + +[Link](./pipeline_webserver) + +## 비전 Pipeline[[vision-pipeline]] + +비전 태스크를 위해 [`pipeline`]을 사용하는 일은 거의 동일합니다. + +태스크를 지정하고 이미지를 분류기에 전달하면 됩니다. 이미지는 인터넷 링크 또는 로컬 경로의 형태로 전달해주세요. 예를 들어 아래에 표시된 고양이는 어떤 종인가요? + +![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg) + +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(model="google/vit-base-patch16-224") +>>> preds = vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] +``` + +### 텍스트 Pipeline[[text-pipeline]] + +NLP 태스크를 위해 [`pipeline`]을 사용하는 일도 거의 동일합니다. + +```py +>>> from transformers import pipeline + +>>> # This model is a `zero-shot-classification` model. +>>> # It will classify text, except you are free to choose any label you might imagine +>>> classifier = pipeline(model="facebook/bart-large-mnli") +>>> classifier( +... "I have a problem with my iphone that needs to be resolved asap!!", +... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], +... ) +{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} +``` + +### 멀티모달 Pipeline[[multimodal-pipeline]] + +[`pipeline`]은 여러 모달리티(역주: 오디오, 비디오, 텍스트와 같은 데이터 형태)를 지원합니다. 예시로 시각적 질의응답(VQA; Visual Question Answering) 태스크는 텍스트와 이미지를 모두 사용합니다. 그 어떤 이미지 링크나 묻고 싶은 질문도 자유롭게 전달할 수 있습니다. 이미지는 URL 또는 로컬 경로의 형태로 전달해주세요. + +예를 들어 이 [거래명세서 사진](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png)에서 거래명세서 번호를 묻고 싶다면, + +```py +>>> from transformers import pipeline + +>>> vqa = pipeline(model="impira/layoutlm-document-qa") +>>> vqa( +... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", +... question="What is the invoice number?", +... ) +[{'score': 0.42514941096305847, 'answer': 'us-001', 'start': 16, 'end': 16}] +``` diff --git a/docs/source/ko/pipeline_webserver.md b/docs/source/ko/pipeline_webserver.md new file mode 100644 index 00000000000000..b7d5366c57c4ef --- /dev/null +++ b/docs/source/ko/pipeline_webserver.md @@ -0,0 +1,144 @@ + + +# 웹 서버를 위한 파이프라인 사용하기[[using_pipelines_for_a_webserver]] + + +추론 엔진을 만드는 것은 복잡한 주제이며, "최선의" 솔루션은 문제 공간에 따라 달라질 가능성이 높습니다. CPU 또는 GPU를 사용하는지에 따라 다르고 낮은 지연 시간을 원하는지, 높은 처리량을 원하는지, 다양한 모델을 지원할 수 있길 원하는지, 하나의 특정 모델을 고도로 최적화하길 원하는지 등에 따라 달라집니다. 이 주제를 해결하는 방법에는 여러 가지가 있으므로, 이 장에서 제시하는 것은 처음 시도해 보기에 좋은 출발점일 수는 있지만, 이 장을 읽는 여러분이 필요로 하는 최적의 솔루션은 아닐 수 있습니다. + + +핵심적으로 이해해야 할 점은 [dataset](pipeline_tutorial#using-pipelines-on-a-dataset)를 다룰 때와 마찬가지로 반복자를 사용 가능하다는 것입니다. 왜냐하면, 웹 서버는 기본적으로 요청을 기다리고 들어오는 대로 처리하는 시스템이기 때문입니다. + +보통 웹 서버는 다양한 요청을 동시에 다루기 위해 매우 다중화된 구조(멀티 스레딩, 비동기 등)를 지니고 있습니다. 반면에, 파이프라인(대부분 파이프라인 안에 있는 모델)은 병렬처리에 그다지 좋지 않습니다. 왜냐하면 파이프라인은 많은 RAM을 차지하기 때문입니다. 따라서, 파이프라인이 실행 중이거나 계산 집약적인 작업 중일 때 모든 사용 가능한 리소스를 제공하는 것이 가장 좋습니다. + +이 문제를 우리는 웹 서버가 요청을 받고 보내는 가벼운 부하를 처리하고, 실제 작업을 처리하는 단일 스레드를 갖는 방법으로 해결할 것입니다. 이 예제는 `starlette` 라이브러리를 사용합니다. +실제 프레임워크는 중요하지 않지만, 다른 프레임워크를 사용하는 경우 동일한 효과를 보기 위해선 코드를 조정하거나 변경해야 할 수 있습니다. + +`server.py`를 생성하세요: + +```py +from starlette.applications import Starlette +from starlette.responses import JSONResponse +from starlette.routing import Route +from transformers import pipeline +import asyncio + + +async def homepage(request): + payload = await request.body() + string = payload.decode("utf-8") + response_q = asyncio.Queue() + await request.app.model_queue.put((string, response_q)) + output = await response_q.get() + return JSONResponse(output) + + +async def server_loop(q): + pipe = pipeline(model="google-bert/bert-base-uncased") + while True: + (string, response_q) = await q.get() + out = pipe(string) + await response_q.put(out) + + +app = Starlette( + routes=[ + Route("/", homepage, methods=["POST"]), + ], +) + + +@app.on_event("startup") +async def startup_event(): + q = asyncio.Queue() + app.model_queue = q + asyncio.create_task(server_loop(q)) +``` + +이제 다음 명령어로 실행시킬 수 있습니다: + +```bash +uvicorn server:app +``` + +이제 쿼리를 날려볼 수 있습니다: + +```bash +curl -X POST -d "test [MASK]" http://localhost:8000/ +#[{"score":0.7742936015129089,"token":1012,"token_str":".","sequence":"test."},...] +``` + +자, 이제 웹 서버를 만드는 방법에 대한 좋은 개념을 알게 되었습니다! + +중요한 점은 모델을 **한 번만** 가져온다는 것입니다. 따라서 웹 서버에는 모델의 사본이 없습니다. 이런 방식은 불필요한 RAM이 사용되지 않습니다. 그런 다음 큐 메커니즘을 사용하면, 다음과 같은 +동적 배치를 사용하기 위해 추론 전 단계에 몇 개의 항목을 축적하는 것과 같은 멋진 작업을 할 수 있습니다: + + +코드는 의도적으로 가독성을 위해 의사 코드처럼 작성되었습니다! +아래 코드를 작동시키기 전에 시스템 자원이 충분한지 확인하세요! + + +```py +(string, rq) = await q.get() +strings = [] +queues = [] +while True: + try: + (string, rq) = await asyncio.wait_for(q.get(), timeout=0.001) # 1ms + except asyncio.exceptions.TimeoutError: + break + strings.append(string) + queues.append(rq) +strings +outs = pipe(strings, batch_size=len(strings)) +for rq, out in zip(queues, outs): + await rq.put(out) +``` + +다시 말씀 드리자면, 제안된 코드는 가독성을 위해 최적화되었으며, 최상의 코드는 아닙니다. +첫째, 배치 크기 제한이 없으며 이는 일반적으로 좋은 방식이 아닙니다. +둘째, 모든 큐 가져오기에서 타임아웃이 재설정되므로 추론을 실행하기 전에 1ms보다 훨씬 오래 기다릴 수 있습니다(첫 번째 요청을 그만큼 지연시킴). + +단일 1ms 길이의 데드라인을 두는 편이 더 좋습니다. + +이 방식을 사용하면 큐가 비어 있어도 항상 1ms를 기다리게 될 것입니다. +큐에 아무것도 없을 때 추론을 원하는 경우에는 최선의 방법이 아닐 수 있습니다. +하지만 배치 작업이 사용례에 따라 정말로 중요하다면 의미가 있을 수도 있습니다. +다시 말하지만, 최상의 솔루션은 없습니다. + +## 고려해야 할 몇 가지 사항[[few_things_you_might want_to_consider]] + +### 에러 확인[[error_checking]] + +프로덕션 환경에서는 문제가 발생할 여지가 많습니다. +메모리가 모자라거나, 공간이 부족하거나, 모델을 가져오는 데에 실패하거나, 쿼리가 잘못되었거나, 쿼리는 정확해도 모델 설정이 잘못되어 실행에 실패하는 등등 많은 경우가 존재합니다. + +일반적으로 서버가 사용자에게 오류를 출력하는 것이 좋으므로 +오류를 표시하기 위해 `try...except` 문을 많이 추가하는 것이 좋습니다. +하지만 보안 상황에 따라 모든 오류를 표시하는 것은 보안상 위험할 수도 있다는 점을 명심해야합니다. + +### 서킷 브레이킹[[circuit_breaking]] + +웹 서버는 일반적으로 서킷 브레이킹을 수행할 때 더 나은 상황에 직면합니다. +즉, 이는 서버가 쿼리를 무기한 기다리는 대신 과부하 상태일 때 적절한 오류를 반환하는 것을 의미합니다. +서버가 매우 오랜 시간 동안 대기하거나 적당한 시간이 지난 후에 504 에러를 반환하는 대신 503 에러를 빠르게 반환하게 하는 것입니다. + +제안된 코드에는 단일 큐가 있으므로 구현하기가 비교적 쉽습니다. +큐 크기를 확인하는 것은 웹 서버가 과부하 상항 하에 있을 때 에러를 반환하기 위한 가장 기초적인 작업입니다. + +### 메인 쓰레드 차단[[blocking_the_main_thread]] + +현재 PyTorch는 비동기 처리를 지원하지 않으며, 실행 중에는 메인 스레드가 차단됩니다. +따라서 PyTorch를 별도의 스레드/프로세스에서 실행하도록 강제하는 것이 좋습니다. +여기서는 이 작업이 수행되지 않았습니다. 왜냐하면 코드가 훨씬 더 복잡하기 때문입니다(주로 스레드, 비동기 처리, 큐가 서로 잘 맞지 않기 때문입니다). +하지만 궁극적으로는 같은 작업을 수행하는 것입니다. + +단일 항목의 추론이 오래 걸린다면 (> 1초), 메인 쓰레드를 차단하는 것은 중요할 수 있습니다. 왜냐하면 이 경우 추론 중 모든 쿼리는 오류를 받기 전에 1초를 기다려야 하기 때문입니다. + +### 동적 배치[[dynamic_batching]] + +일반적으로, 배치 처리가 1개 항목을 한 번에 전달하는 것에 비해 반드시 성능 향상이 있는 것은 아닙니다(자세한 내용은 [`batching details`](./main_classes/pipelines#pipeline-batching)을 참고하세요). +하지만 올바른 설정에서 사용하면 매우 효과적일 수 있습니다. +API에는 기본적으로 속도 저하의 가능성이 매우 높기 때문에 동적 배치 처리가 없습니다. +하지만 매우 큰 모델인 BLOOM 추론의 경우 동적 배치 처리는 모든 사람에게 적절한 경험을 제공하는 데 **필수**입니다. diff --git a/docs/source/ko/pr_checks.md b/docs/source/ko/pr_checks.md new file mode 100644 index 00000000000000..1d155cd1fb9ddb --- /dev/null +++ b/docs/source/ko/pr_checks.md @@ -0,0 +1,200 @@ + + +# Pull Request에 대한 검사 [[checks-on-a-pull-request]] + +🤗 Transformers에서 Pull Request를 열 때, 기존에 있는 것을 망가뜨리지 않는지 확인하기 위해 상당한 수의 검사가 실행됩니다. 이러한 검사는 다음과 같은 네 가지 유형으로 구성됩니다: +- 일반적인 테스트 +- 문서 빌드 +- 코드 및 문서 스타일 +- 일반 저장소 일관성 + +이 문서에서는 이러한 다양한 검사와 그 이유를 설명하고, PR에서 하나 이상의 검사가 실패한 경우 로컬에서 어떻게 디버그하는지 알아보겠습니다. + +참고로, 이러한 검사를 사용하려면 개발 설치가 필요합니다: + +```bash +pip install transformers[dev] +``` + +또는 Transformers 저장소 내에 편집 가능한 설치가 필요합니다: + +```bash +pip install -e .[dev] +``` + +Transformers의 선택적 종속성 수가 많이 늘어났기 때문에 개발 설치를 실패할 수도 있습니다. 개발 설치가 실패하는 경우, 작업 중인 Deep Learning 프레임워크 (PyTorch, TensorFlow 및/또는 Flax)를 설치하고 다음 명령을 실행하세요. + +```bash +pip install transformers[quality] +``` + +편집 가능한 설치의 경우는 다음 명령을 실행하세요. + +```bash +pip install -e .[quality] +``` + + +## 테스트 [[tests]] + +`ci/circleci: run_tests_`로 시작하는 모든 작업은 Transformers 테스트 모음의 일부를 실행합니다. 이러한 작업은 특정 환경에서 일부 라이브러리에 중점을 둡니다. 예를 들어 `ci/circleci: run_tests_pipelines_tf`는 TensorFlow만 설치된 환경에서 파이프라인 테스트를 실행합니다. + +테스트 모듈에서 실제로 변경 사항이 없을 때 테스트를 실행하지 않기 위해, 테스트 모음의 일부만 실행됩니다. 라이브러리의 변경 전후에 대한 차이를 확인하기 위해 유틸리티가 실행되고, 해당 차이에 영향을 받는 테스트가 선택됩니다. 이 유틸리티는 로컬에서 다음과 같이 실행할 수 있습니다: + +```bash +python utils/tests_fetcher.py +``` + +Transformers 저장소의 최상단에서 실행합니다. 이 유틸리티는 다음과 같은 작업을 수행합니다: + +1. 변경 사항이 있는 파일마다 변경 사항이 코드인지 주석 또는 문서 문자열인지 확인합니다. 실제 코드 변경이 있는 파일만 유지됩니다. +2. 소스 코드 파일의 각 파일에 대해 재귀적으로 영향을 주는 모든 파일을 제공하는 내부 맵을 작성합니다. 모듈 B가 모듈 A를 가져오면 모듈 A는 모듈 B에 영향을 줍니다. 재귀적인 영향에는 각 모듈이 이전 모듈을 가져오는 모듈 체인이 필요합니다. +3. 단계 1에서 수집한 파일에 이 맵을 적용하여 PR에 영향을 받는 모델 파일 목록을 얻습니다. +4. 각 파일을 해당하는 테스트 파일에 매핑하고 실행할 테스트 목록을 가져옵니다. + +로컬에서 스크립트를 실행하면 단계 1, 3 및 4의 결과를 출력하여 실행되는 테스트를 알 수 있습니다. 스크립트는 또한 `test_list.txt`라는 파일을 생성하여 실행할 테스트 목록을 포함하며, 다음 명령으로 해당 테스트를 로컬에서 실행할 수 있습니다: + +```bash +python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt) +``` + +잘못된 사항이 누락되었을 경우, 전체 테스트 모음도 매일 실행됩니다. + +## 문서 빌드 [[documentation-build]] + +`build_pr_documentation` 작업은 문서를 빌드하고 미리 보기를 생성하여 PR이 병합된 후 모든 것이 제대로 보이는지 확인합니다. 로봇은 PR에 문서 미리보기 링크를 추가합니다. PR에서 만든 변경 사항은 자동으로 미리보기에 업데이트됩니다. 문서 빌드에 실패한 경우 **세부 정보**를 클릭하여 어디에서 문제가 발생했는지 확인할 수 있습니다. 오류는 주로 `toctree`에 누락된 파일과 같이 간단한 오류입니다. + +로컬에서 문서를 빌드하거나 미리 볼 경우, docs 폴더의 [`README.md`](https://github.com/huggingface/transformers/tree/main/docs)를 참조하세요. + +## 코드 및 문서 스타일 [[code-and-documentation-style]] + +`black`과 `ruff`를 사용하여 모든 소스 파일, 예제 및 테스트에 코드 형식을 적용합니다. 또한, `utils/style_doc.py`에서 문서 문자열과 `rst` 파일의 형식, 그리고 Transformers의 `__init__.py` 파일에서 실행되는 지연된 임포트의 순서에 대한 사용자 정의 도구가 있습니다. 이 모든 것은 다음을 실행함으로써 실행할 수 있습니다: + +```bash +make style +``` + +CI는 이러한 사항이 `ci/circleci: check_code_quality` 검사 내에서 적용되었는지 확인합니다. 또한 `ruff`도 실행되며, 정의되지 않은 변수나 사용되지 않은 변수를 발견하면 경고합니다. 이 검사를 로컬에서 실행하려면 다음을 사용하세요: + +```bash +make quality +``` + +이 작업은 많은 시간이 소요될 수 있으므로 현재 브랜치에서 수정한 파일에 대해서만 동일한 작업을 실행하려면 다음을 실행하세요. + +```bash +make fixup +``` + +이 명령은 현재 브랜치에서 수정한 파일에 대한 모든 추가적인 검사도 실행합니다. 이제 이들을 살펴보겠습니다. + +## 저장소 일관성 [[repository-consistency]] + +이는 PR이 저장소를 정상적인 상태로 유지하는지 확인하는 모든 테스트를 모은 것이며, `ci/circleci: check_repository_consistency` 검사에서 수행됩니다. 다음을 실행함으로써 로컬에서 이 검사를 실행할 수 있습니다. + +```bash +make repo-consistency +``` + +이 검사는 다음을 확인합니다. + +- init에 추가된 모든 객체가 문서화되었는지 (`utils/check_repo.py`에서 수행) +- `__init__.py` 파일의 두 섹션에 동일한 내용이 있는지 (`utils/check_inits.py`에서 수행) +- 다른 모듈에서 복사된 코드가 원본과 일치하는지 (`utils/check_copies.py`에서 수행) +- 모든 구성 클래스에 docstring에 언급된 유효한 체크포인트가 적어도 하나 있는지 (`utils/check_config_docstrings.py`에서 수행) +- 모든 구성 클래스가 해당하는 모델링 파일에서 사용되는 속성만 포함하고 있는지 (`utils/check_config_attributes.py`에서 수행) +- README와 문서 인덱스의 번역이 메인 README와 동일한 모델 목록을 가지고 있는지 (`utils/check_copies.py`에서 수행) +- 문서의 자동 생성된 테이블이 최신 상태인지 (`utils/check_table.py`에서 수행) +- 라이브러리에는 선택적 종속성이 설치되지 않았더라도 모든 객체가 사용 가능한지 (`utils/check_dummies.py`에서 수행) + +이러한 검사가 실패하는 경우, 처음 두 가지 항목은 수동으로 수정해야 하며, 나머지 네 가지 항목은 다음 명령을 실행하여 자동으로 수정할 수 있습니다. + +```bash +make fix-copies +``` + +추가적인 검사는 새로운 모델을 추가하는 PR에 대한 것으로, 주로 다음과 같습니다: + +- 추가된 모든 모델이 Auto-mapping에 있는지 (`utils/check_repo.py`에서 수행) + +- 모든 모델이 올바르게 테스트되었는지 (`utils/check_repo.py`에서 수행) + + + +### 복사본 확인 [[check-copies]] + +Transformers 라이브러리는 모델 코드에 대해 매우 완고하며, 각 모델은 다른 모델에 의존하지 않고 완전히 단일 파일로 구현되어야 합니다. 이렇게 하기 위해 특정 모델의 코드 복사본이 원본과 일관된 상태로 유지되는지 확인하는 메커니즘을 추가했습니다. 따라서 버그 수정이 필요한 경우 다른 모델에 영향을 주는 모든 모델을 볼 수 있으며 수정을 적용할지 수정된 사본을 삭제할지 선택할 수 있습니다. + + + +파일이 다른 파일의 완전한 사본인 경우 해당 파일을 `utils/check_copies.py`의 `FULL_COPIES` 상수에 등록해야 합니다. + + + +이 메커니즘은 `# Copied from xxx` 형식의 주석을 기반으로 합니다. `xxx`에는 아래에 복사되는 클래스 또는 함수의 전체 경로가 포함되어야 합니다. 예를 들어 `RobertaSelfOutput`은 `BertSelfOutput` 클래스의 복사본입니다. 따라서 [여기](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L289)에서 주석이 있습니다: + + +```py +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput +``` + +클래스 전체에 수정을 적용하는 대신에 복사본과 관련있는 메서드에 적용할 수도 있습니다. 예를 들어 [여기](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L598)에서 `RobertaPreTrainedModel._init_weights`가 `BertPreTrainedModel`의 동일한 메서드에서 복사된 것을 볼 수 있으며 해당 주석이 있습니다: + +```py +# Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights +``` + +복사본이 이름만 다른 경우가 있습니다: 예를 들어 `RobertaAttention`에서 `BertSelfAttention` 대신 `RobertaSelfAttention`을 사용하지만 그 외에는 코드가 완전히 동일합니다: 이 때 `# Copied from`은 `Copied from xxx with foo->bar`와 같은 간단한 문자열 대체를 지원합니다. 이는 모든 `foo` 인스턴스를 `bar`로 바꿔서 코드를 복사합니다. [여기](https://github.com/huggingface/transformers/blob/2bd7a27a671fd1d98059124024f580f8f5c0f3b5/src/transformers/models/roberta/modeling_roberta.py#L304C1-L304C86)에서 어떻게 사용되는지 볼 수 있습니다: + +```py +# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->Roberta +``` + +화살표 주변에는 공백이 없어야 합니다(공백이 대체 패턴의 일부인 경우는 예외입니다). + +대체 패턴을 쉼표로 구분하여 여러 패턴을 추가할 수 있습니다. 예를 들어 `CamemberForMaskedLM`은 두 가지 대체 사항을 가진 `RobertaForMaskedLM`의 복사본입니다: `Roberta`를 `Camembert`로 대체하고 `ROBERTA`를 `CAMEMBERT`로 대체합니다. [여기](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/camembert/modeling_camembert.py#L929)에서 이것이 주석으로 어떻게 구현되었는지 확인할 수 있습니다: + +```py +# Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT +``` + +순서가 중요한 경우(이전 수정과 충돌할 수 있는 경우) 수정은 왼쪽에서 오른쪽으로 실행됩니다. + + + +새 변경이 서식을 변경하는 경우(짧은 이름을 매우 긴 이름으로 바꾸는 경우) 자동 서식 지정기를 적용한 후 복사본이 검사됩니다. + + + +패턴의 대소문자가 다른 경우(대문자와 소문자가 혼용된 대체 양식) `all-casing` 옵션을 추가하는 방법도 있습니다. [여기](https://github.com/huggingface/transformers/blob/15082a9dc6950ecae63a0d3e5060b2fc7f15050a/src/transformers/models/mobilebert/modeling_mobilebert.py#L1237)에서 `MobileBertForSequenceClassification`에서 사용된 예시를 볼 수 있습니다: + +```py +# Copied from transformers.models.bert.modeling_bert.BertForSequenceClassification with Bert->MobileBert all-casing +``` + +이 경우, 코드는 다음과 같이 복사됩니다: +- `MobileBert`에서 `Bert`로(예: `MobileBertModel`을 init에서 사용할 때) +- `mobilebert`에서 `bert`로(예: `self.mobilebert`를 정의할 때) +- `MOBILEBERT`에서 `BERT`로(`MOBILEBERT_INPUTS_DOCSTRING` 상수에서) diff --git a/docs/source/ko/preprocessing.md b/docs/source/ko/preprocessing.md new file mode 100644 index 00000000000000..4466b63585723d --- /dev/null +++ b/docs/source/ko/preprocessing.md @@ -0,0 +1,539 @@ + + +# 전처리[[preprocess]] + +[[open-in-colab]] + +모델을 훈련하려면 데이터 세트를 모델에 맞는 입력 형식으로 전처리해야 합니다. 텍스트, 이미지 또는 오디오인지 관계없이 데이터를 텐서 배치로 변환하고 조립할 필요가 있습니다. 🤗 Transformers는 모델에 대한 데이터를 준비하는 데 도움이 되는 일련의 전처리 클래스를 제공합니다. 이 튜토리얼에서는 다음 내용을 배울 수 있습니다: + +* 텍스트는 [Tokenizer](./main_classes/tokenizer)를 사용하여 토큰 시퀀스로 변환하고 토큰의 숫자 표현을 만든 후 텐서로 조립합니다. +* 음성 및 오디오는 [Feature extractor](./main_classes/feature_extractor)를 사용하여 오디오 파형에서 시퀀스 특성을 파악하여 텐서로 변환합니다. +* 이미지 입력은 [ImageProcessor](./main_classes/image)을 사용하여 이미지를 텐서로 변환합니다. +* 멀티모달 입력은 [Processor](./main_classes/processors)을 사용하여 토크나이저와 특성 추출기 또는 이미지 프로세서를 결합합니다. + + + +`AutoProcessor`는 **언제나** 작동하여 토크나이저, 이미지 프로세서, 특성 추출기 또는 프로세서 등 사용 중인 모델에 맞는 클래스를 자동으로 선택합니다. + + + +시작하기 전에 🤗 Datasets를 설치하여 실험에 사용할 데이터를 불러올 수 있습니다: + +```bash +pip install datasets +``` + +## 자연어처리[[natural-language-processing]] + + + +텍스트 데이터를 전처리하기 위한 기본 도구는 [tokenizer](main_classes/tokenizer)입니다. 토크나이저는 일련의 규칙에 따라 텍스트를 *토큰*으로 나눕니다. 토큰은 숫자로 변환되고 텐서는 모델 입력이 됩니다. 모델에 필요한 추가 입력은 토크나이저에 의해 추가됩니다. + + + +사전훈련된 모델을 사용할 계획이라면 모델과 함께 사전훈련된 토크나이저를 사용하는 것이 중요합니다. 이렇게 하면 텍스트가 사전훈련 말뭉치와 동일한 방식으로 분할되고 사전훈련 중에 동일한 해당 토큰-인덱스 쌍(일반적으로 *vocab*이라고 함)을 사용합니다. + + + +시작하려면 [`AutoTokenizer.from_pretrained`] 메소드를 사용하여 사전훈련된 토크나이저를 불러오세요. 모델과 함께 사전훈련된 *vocab*을 다운로드합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +``` + +그 다음으로 텍스트를 토크나이저에 넣어주세요: + +```py +>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") +>>> print(encoded_input) +{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +토크나이저는 세 가지 중요한 항목을 포함한 딕셔너리를 반환합니다: + +* [input_ids](glossary#input-ids)는 문장의 각 토큰에 해당하는 인덱스입니다. +* [attention_mask](glossary#attention-mask)는 토큰을 처리해야 하는지 여부를 나타냅니다. +* [token_type_ids](glossary#token-type-ids)는 두 개 이상의 시퀀스가 있을 때 토큰이 속한 시퀀스를 식별합니다. + +`input_ids`를 디코딩하여 입력을 반환합니다: + +```py +>>> tokenizer.decode(encoded_input["input_ids"]) +'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]' +``` + +토크나이저가 두 개의 특수한 토큰(분류 토큰 `CLS`와 분할 토큰 `SEP`)을 문장에 추가했습니다. +모든 모델에 특수한 토큰이 필요한 것은 아니지만, 필요하다면 토크나이저가 자동으로 추가합니다. + +전처리할 문장이 여러 개 있는 경우에는 리스트로 토크나이저에 전달합니다: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_inputs = tokenizer(batch_sentences) +>>> print(encoded_inputs) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1]]} +``` + +### 패딩[[pad]] + +모델 입력인 텐서는 모양이 균일해야 하지만, 문장의 길이가 항상 같지는 않기 때문에 문제가 될 수 있습니다. 패딩은 짧은 문장에 특수한 *패딩 토큰*을 추가하여 텐서를 직사각형 모양이 되도록 하는 전략입니다. + +`padding` 매개변수를 `True`로 설정하여 배치 내의 짧은 시퀀스를 가장 긴 시퀀스에 맞춰 패딩합니다. + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + +길이가 짧은 첫 문장과 세 번째 문장이 이제 `0`으로 채워졌습니다. + +### 잘라내기[[truncation]] + +한편, 때로는 시퀀스가 모델에서 처리하기에 너무 길 수도 있습니다. 이 경우, 시퀀스를 더 짧게 줄일 필요가 있습니다. + +모델에서 허용하는 최대 길이로 시퀀스를 자르려면 `truncation` 매개변수를 `True`로 설정하세요: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + + + +다양한 패딩과 잘라내기 인수에 대해 더 알아보려면 [패딩과 잘라내기](./pad_truncation) 개념 가이드를 확인해보세요. + + + +### 텐서 만들기[[build-tensors]] + +마지막으로, 토크나이저가 모델에 공급되는 실제 텐서를 반환하도록 합니다. + +`return_tensors` 매개변수를 PyTorch의 경우 `pt`, TensorFlow의 경우 `tf`로 설정하세요: + + + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") +>>> print(encoded_input) +{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} +``` + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf") +>>> print(encoded_input) +{'input_ids': , + 'token_type_ids': , + 'attention_mask': } +``` + + + +## 오디오[[audio]] + +오디오 작업은 모델에 맞는 데이터 세트를 준비하기 위해 [특성 추출기](main_classes/feature_extractor)가 필요합니다. 특성 추출기는 원시 오디오 데이터에서 특성를 추출하고 이를 텐서로 변환하는 것이 목적입니다. + +오디오 데이터 세트에 특성 추출기를 사용하는 방법을 보기 위해 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트를 가져오세요. (데이터 세트를 가져오는 방법은 🤗 [데이터 세트 튜토리얼](https://huggingface.co/docs/datasets/load_hub)에서 자세히 설명하고 있습니다.) + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") +``` + +`audio` 열의 첫 번째 요소에 접근하여 입력을 살펴보세요. `audio` 열을 호출하면 오디오 파일을 자동으로 가져오고 리샘플링합니다. + +```py +>>> dataset[0]["audio"] +{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, + 0. , 0. ], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 8000} +``` + +이렇게 하면 세 가지 항목이 반환됩니다: + +* `array`는 1D 배열로 가져와서 (필요한 경우) 리샘플링된 음성 신호입니다. +* `path`는 오디오 파일의 위치를 가리킵니다. +* `sampling_rate`는 음성 신호에서 초당 측정되는 데이터 포인트 수를 나타냅니다. + +이 튜토리얼에서는 [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base) 모델을 사용합니다. 모델 카드를 보면 Wav2Vec2가 16kHz 샘플링된 음성 오디오를 기반으로 사전훈련된 것을 알 수 있습니다. +모델을 사전훈련하는 데 사용된 데이터 세트의 샘플링 레이트와 오디오 데이터의 샘플링 레이트가 일치해야 합니다. 데이터의 샘플링 레이트가 다르면 데이터를 리샘플링해야 합니다. + +1. 🤗 Datasets의 [`~datasets.Dataset.cast_column`] 메소드를 사용하여 샘플링 레이트를 16kHz로 업샘플링하세요: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +2. 오디오 파일을 리샘플링하기 위해 `audio` 열을 다시 호출합니다: + +```py +>>> dataset[0]["audio"] +{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., + 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 16000} +``` + +다음으로, 입력을 정규화하고 패딩할 특성 추출기를 가져오세요. 텍스트 데이터의 경우, 더 짧은 시퀀스에 대해 `0`이 추가됩니다. 오디오 데이터에도 같은 개념이 적용됩니다. +특성 추출기는 배열에 `0`(묵음으로 해석)을 추가합니다. + +[`AutoFeatureExtractor.from_pretrained`]를 사용하여 특성 추출기를 가져오세요: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") +``` + +오디오 `array`를 특성 추출기에 전달하세요. 또한, 발생할 수 있는 조용한 오류(silent errors)를 더 잘 디버깅할 수 있도록 특성 추출기에 `sampling_rate` 인수를 추가하는 것을 권장합니다. + +```py +>>> audio_input = [dataset[0]["audio"]["array"]] +>>> feature_extractor(audio_input, sampling_rate=16000) +{'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ..., + 5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]} +``` + +토크나이저와 마찬가지로 배치 내에서 가변적인 시퀀스를 처리하기 위해 패딩 또는 잘라내기를 적용할 수 있습니다. 이 두 개의 오디오 샘플의 시퀀스 길이를 확인해보세요: + +```py +>>> dataset[0]["audio"]["array"].shape +(173398,) + +>>> dataset[1]["audio"]["array"].shape +(106496,) +``` + +오디오 샘플의 길이가 동일하도록 데이터 세트를 전처리하는 함수를 만드세요. 최대 샘플 길이를 지정하면 특성 추출기가 해당 길이에 맞춰 시퀀스를 패딩하거나 잘라냅니다: + +```py +>>> def preprocess_function(examples): +... audio_arrays = [x["array"] for x in examples["audio"]] +... inputs = feature_extractor( +... audio_arrays, +... sampling_rate=16000, +... padding=True, +... max_length=100000, +... truncation=True, +... ) +... return inputs +``` + +`preprocess_function`을 데이터 세트의 처음 예시 몇 개에 적용해보세요: + +```py +>>> processed_dataset = preprocess_function(dataset[:5]) +``` + +이제 샘플 길이가 모두 같고 지정된 최대 길이에 맞게 되었습니다. 드디어 전처리된 데이터 세트를 모델에 전달할 수 있습니다! + +```py +>>> processed_dataset["input_values"][0].shape +(100000,) + +>>> processed_dataset["input_values"][1].shape +(100000,) +``` + +## 컴퓨터 비전[[computer-vision]] + +컴퓨터 비전 작업의 경우, 모델에 대한 데이터 세트를 준비하기 위해 [이미지 프로세서](main_classes/image_processor)가 필요합니다. +이미지 전처리는 이미지를 모델이 예상하는 입력으로 변환하는 여러 단계로 이루어집니다. +이러한 단계에는 크기 조정, 정규화, 색상 채널 보정, 이미지의 텐서 변환 등이 포함됩니다. + + + +이미지 전처리는 이미지 증강 기법을 몇 가지 적용한 뒤에 할 수도 있습니다. +이미지 전처리 및 이미지 증강은 모두 이미지 데이터를 변형하지만, 서로 다른 목적을 가지고 있습니다: + +* 이미지 증강은 과적합(over-fitting)을 방지하고 모델의 견고함(resiliency)을 높이는 데 도움이 되는 방식으로 이미지를 수정합니다. +밝기와 색상 조정, 자르기, 회전, 크기 조정, 확대/축소 등 다양한 방법으로 데이터를 증강할 수 있습니다. +그러나 증강으로 이미지의 의미가 바뀌지 않도록 주의해야 합니다. +* 이미지 전처리는 이미지가 모델이 예상하는 입력 형식과 일치하도록 보장합니다. +컴퓨터 비전 모델을 미세 조정할 때 이미지는 모델이 초기에 훈련될 때와 정확히 같은 방식으로 전처리되어야 합니다. + +이미지 증강에는 원하는 라이브러리를 무엇이든 사용할 수 있습니다. 이미지 전처리에는 모델과 연결된 `ImageProcessor`를 사용합니다. + + + +[food101](https://huggingface.co/datasets/food101) 데이터 세트를 가져와서 컴퓨터 비전 데이터 세트에서 이미지 프로세서를 어떻게 사용하는지 알아보세요. +데이터 세트를 불러오는 방법은 🤗 [데이터 세트 튜토리얼](https://huggingface.co/docs/datasets/load_hub)을 참고하세요. + + + +데이터 세트가 상당히 크기 때문에 🤗 Datasets의 `split` 매개변수를 사용하여 훈련 세트에서 작은 샘플만 가져오세요! + + + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("food101", split="train[:100]") +``` + +다음으로, 🤗 Datasets의 [`image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image)로 이미지를 확인해보세요: + +```py +>>> dataset[0]["image"] +``` + +
+ +
+ +[`AutoImageProcessor.from_pretrained`]로 이미지 프로세서를 가져오세요: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + +먼저 이미지 증강 단계를 추가해 봅시다. 아무 라이브러리나 사용해도 괜찮지만, 이번 튜토리얼에서는 torchvision의 [`transforms`](https://pytorch.org/vision/stable/transforms.html) 모듈을 사용하겠습니다. +다른 데이터 증강 라이브러리를 사용해보고 싶다면, [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) 또는 [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb)에서 어떻게 사용하는지 배울 수 있습니다. + +1. [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html)로 [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html)와 [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) 등 변환을 몇 가지 연결하세요. +참고로 크기 조정에 필요한 이미지의 크기 요구사항은 `image_processor`에서 가져올 수 있습니다. +일부 모델은 정확한 높이와 너비를 요구하지만, 제일 짧은 변의 길이(`shortest_edge`)만 정의된 모델도 있습니다. + +```py +>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose + +>>> size = ( +... image_processor.size["shortest_edge"] +... if "shortest_edge" in image_processor.size +... else (image_processor.size["height"], image_processor.size["width"]) +... ) + +>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)]) +``` + +2. 모델은 입력으로 [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values)를 받습니다. +`ImageProcessor`는 이미지 정규화 및 적절한 텐서 생성을 처리할 수 있습니다. +배치 이미지에 대한 이미지 증강 및 이미지 전처리를 결합하고 `pixel_values`를 생성하는 함수를 만듭니다: + +```py +>>> def transforms(examples): +... images = [_transforms(img.convert("RGB")) for img in examples["image"]] +... examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"] +... return examples +``` + + + +위의 예에서는 이미지 증강 중에 이미지 크기를 조정했기 때문에 `do_resize=False`로 설정하고, 해당 `image_processor`에서 `size` 속성을 활용했습니다. +이미지 증강 중에 이미지 크기를 조정하지 않은 경우 이 매개변수를 생략하세요. +기본적으로는 `ImageProcessor`가 크기 조정을 처리합니다. + +증강 변환 과정에서 이미지를 정규화하려면 `image_processor.image_mean` 및 `image_processor.image_std` 값을 사용하세요. + + + +3. 🤗 Datasets의 [`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)를 사용하여 실시간으로 변환을 적용합니다: + +```py +>>> dataset.set_transform(transforms) +``` + +4. 이제 이미지에 접근하면 이미지 프로세서가 `pixel_values`를 추가한 것을 알 수 있습니다. +드디어 처리된 데이터 세트를 모델에 전달할 수 있습니다! + +```py +>>> dataset[0].keys() +``` + +다음은 변형이 적용된 후의 이미지입니다. 이미지가 무작위로 잘려나갔고 색상 속성이 다릅니다. + +```py +>>> import numpy as np +>>> import matplotlib.pyplot as plt + +>>> img = dataset[0]["pixel_values"] +>>> plt.imshow(img.permute(1, 2, 0)) +``` + +
+ +
+ + + +`ImageProcessor`는 객체 감지, 시맨틱 세그멘테이션(semantic segmentation), 인스턴스 세그멘테이션(instance segmentation), 파놉틱 세그멘테이션(panoptic segmentation)과 같은 작업에 대한 후처리 방법을 제공합니다. +이러한 방법은 모델의 원시 출력을 경계 상자나 세그멘테이션 맵과 같은 의미 있는 예측으로 변환해줍니다. + + + +### 패딩[[pad]] + +예를 들어, [DETR](./model_doc/detr)와 같은 경우에는 모델이 훈련할 때 크기 조정 증강을 적용합니다. +이로 인해 배치 내 이미지 크기가 달라질 수 있습니다. +[`DetrImageProcessor`]의 [`DetrImageProcessor.pad`]를 사용하고 사용자 정의 `collate_fn`을 정의해서 배치 이미지를 처리할 수 있습니다. + +```py +>>> def collate_fn(batch): +... pixel_values = [item["pixel_values"] for item in batch] +... encoding = image_processor.pad(pixel_values, return_tensors="pt") +... labels = [item["labels"] for item in batch] +... batch = {} +... batch["pixel_values"] = encoding["pixel_values"] +... batch["pixel_mask"] = encoding["pixel_mask"] +... batch["labels"] = labels +... return batch +``` + +## 멀티모달[[multimodal]] + +멀티모달 입력이 필요한 작업의 경우, 모델에 데이터 세트를 준비하기 위한 [프로세서](main_classes/processors)가 필요합니다. +프로세서는 토크나이저와 특성 추출기와 같은 두 가지 처리 객체를 결합합니다. + +[LJ Speech](https://huggingface.co/datasets/lj_speech) 데이터 세트를 가져와서 자동 음성 인식(ASR)을 위한 프로세서를 사용하는 방법을 확인하세요. +(데이터 세트를 가져오는 방법에 대한 자세한 내용은 🤗 [데이터 세트 튜토리얼](https://huggingface.co/docs/datasets/load_hub)에서 볼 수 있습니다.) + +```py +>>> from datasets import load_dataset + +>>> lj_speech = load_dataset("lj_speech", split="train") +``` + +자동 음성 인식(ASR)에서는 `audio`와 `text`에만 집중하면 되므로, 다른 열들은 제거할 수 있습니다: + +```py +>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) +``` + +이제 `audio`와 `text`열을 살펴보세요: + +```py +>>> lj_speech[0]["audio"] +{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ..., + 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', + 'sampling_rate': 22050} + +>>> lj_speech[0]["text"] +'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' +``` + +기존에 사전훈련된 모델에서 사용된 데이터 세트와 새로운 오디오 데이터 세트의 샘플링 레이트를 일치시키기 위해 오디오 데이터 세트의 샘플링 레이트를 [리샘플링](preprocessing#audio)해야 합니다! + +```py +>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +[`AutoProcessor.from_pretrained`]로 프로세서를 가져오세요: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") +``` + +1. `array`에 들어 있는 오디오 데이터를 `input_values`로 변환하고 `text`를 토큰화하여 `labels`로 변환하는 함수를 만듭니다. +모델의 입력은 다음과 같습니다: + +```py +>>> def prepare_dataset(example): +... audio = example["audio"] + +... example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000)) + +... return example +``` + +2. 샘플을 `prepare_dataset` 함수에 적용하세요: + +```py +>>> prepare_dataset(lj_speech[0]) +``` + +이제 프로세서가 `input_values`와 `labels`를 추가하고, 샘플링 레이트도 올바르게 16kHz로 다운샘플링했습니다. +드디어 처리된 데이터 세트를 모델에 전달할 수 있습니다! diff --git a/docs/source/ko/quicktour.md b/docs/source/ko/quicktour.md new file mode 100644 index 00000000000000..c92279fa916bae --- /dev/null +++ b/docs/source/ko/quicktour.md @@ -0,0 +1,557 @@ + + +# 둘러보기 [[quick-tour]] + +[[open-in-colab]] + +🤗 Transformers를 시작해보세요! 개발해본 적이 없더라도 쉽게 읽을 수 있도록 쓰인 이 글은 [`pipeline`](./main_classes/pipelines)을 사용하여 추론하고, 사전학습된 모델과 전처리기를 [AutoClass](./model_doc/auto)로 로드하고, PyTorch 또는 TensorFlow로 모델을 빠르게 학습시키는 방법을 소개해 드릴 것입니다. 본 가이드에서 소개되는 개념을 (특히 초보자의 관점으로) 더 친절하게 접하고 싶다면, 튜토리얼이나 [코스](https://huggingface.co/course/chapter1/1)를 참조하기를 권장합니다. + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +!pip install transformers datasets +``` + +또한 선호하는 머신 러닝 프레임워크를 설치해야 합니다: + + + + +```bash +pip install torch +``` + + + +```bash +pip install tensorflow +``` + + + +## 파이프라인 [[pipeline]] + + + +[`pipeline`](./main_classes/pipelines)은 사전 훈련된 모델로 추론하기에 가장 쉽고 빠른 방법입니다. [`pipeline`]은 여러 모달리티에서 다양한 과업을 쉽게 처리할 수 있으며, 아래 표에 표시된 몇 가지 과업을 기본적으로 지원합니다: + + + +사용 가능한 작업의 전체 목록은 [Pipelines API 참조](./main_classes/pipelines)를 확인하세요. + + + +| **태스크** | **설명** | **모달리티** | **파이프라인 ID** | +|-----------------|----------------------------------------------------------------------|------------------|-----------------------------------------------| +| 텍스트 분류 | 텍스트에 알맞은 레이블 붙이기 | 자연어 처리(NLP) | pipeline(task="sentiment-analysis") | +| 텍스트 생성 | 주어진 문자열 입력과 이어지는 텍스트 생성하기 | 자연어 처리(NLP) | pipeline(task="text-generation") | +| 개체명 인식 | 문자열의 각 토큰마다 알맞은 레이블 붙이기 (인물, 조직, 장소 등등) | 자연어 처리(NLP) | pipeline(task="ner") | +| 질의응답 | 주어진 문맥과 질문에 따라 올바른 대답하기 | 자연어 처리(NLP) | pipeline(task="question-answering") | +| 빈칸 채우기 | 문자열의 빈칸에 알맞은 토큰 맞추기 | 자연어 처리(NLP) | pipeline(task="fill-mask") | +| 요약 | 텍스트나 문서를 요약하기 | 자연어 처리(NLP) | pipeline(task="summarization") | +| 번역 | 텍스트를 한 언어에서 다른 언어로 번역하기 | 자연어 처리(NLP) | pipeline(task="translation") | +| 이미지 분류 | 이미지에 알맞은 레이블 붙이기 | 컴퓨터 비전(CV) | pipeline(task="image-classification") | +| 이미지 분할 | 이미지의 픽셀마다 레이블 붙이기(시맨틱, 파놉틱 및 인스턴스 분할 포함) | 컴퓨터 비전(CV) | pipeline(task="image-segmentation") | +| 객체 탐지 | 이미지 속 객체의 경계 상자를 그리고 클래스를 예측하기 | 컴퓨터 비전(CV) | pipeline(task="object-detection") | +| 오디오 분류 | 오디오 파일에 알맞은 레이블 붙이기 | 오디오 | pipeline(task="audio-classification") | +| 자동 음성 인식 | 오디오 파일 속 음성을 텍스트로 바꾸기 | 오디오 | pipeline(task="automatic-speech-recognition") | +| 시각 질의응답 | 주어진 이미지와 질문에 대해 올바르게 대답하기 | 멀티모달 | pipeline(task="vqa") | +| 문서 질의응답 | 주어진 문서와 질문에 대해 올바르게 대답하기 | 멀티모달 | pipeline(task="document-question-answering") | +| 이미지 캡션 달기 | 주어진 이미지의 캡션 생성하기 | 멀티모달 | pipeline(task="image-to-text") | + +먼저 [`pipeline`]의 인스턴스를 생성하고 사용할 작업을 지정합니다. 이 가이드에서는 감정 분석을 위해 [`pipeline`]을 사용하는 예제를 보여드리겠습니다: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("sentiment-analysis") +``` + +[`pipeline`]은 감정 분석을 위한 [사전 훈련된 모델](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)과 토크나이저를 자동으로 다운로드하고 캐시합니다. 이제 `classifier`를 대상 텍스트에 사용할 수 있습니다: + +```py +>>> classifier("We are very happy to show you the 🤗 Transformers library.") +[{'label': 'POSITIVE', 'score': 0.9998}] +``` + +만약 입력이 여러 개 있는 경우, 입력을 리스트로 [`pipeline`]에 전달하여, 사전 훈련된 모델의 출력을 딕셔너리로 이루어진 리스트 형태로 받을 수 있습니다: + +```py +>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) +>>> for result in results: +... print(f"label: {result['label']}, with score: {round(result['score'], 4)}") +label: POSITIVE, with score: 0.9998 +label: NEGATIVE, with score: 0.5309 +``` + +[`pipeline`]은 주어진 과업에 관계없이 데이터셋 전부를 순회할 수도 있습니다. 이 예제에서는 자동 음성 인식을 과업으로 선택해 보겠습니다: + +```py +>>> import torch +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") +``` + +데이터셋을 로드할 차례입니다. (자세한 내용은 🤗 Datasets [시작하기](https://huggingface.co/docs/datasets/quickstart#audio)을 참조하세요) 여기에서는 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터셋을 로드하겠습니다: + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT +``` + +데이터셋의 샘플링 레이트가 기존 모델인 [`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h)의 훈련 당시 샘플링 레이트와 일치하는지 확인해야 합니다: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) +``` + +`"audio"` 열을 호출하면 자동으로 오디오 파일을 가져와서 리샘플링합니다. 첫 4개 샘플에서 원시 웨이브폼 배열을 추출하고 파이프라인에 리스트로 전달하세요: + +```py +>>> result = speech_recognizer(dataset[:4]["audio"]) +>>> print([d["text"] for d in result]) +['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT'] +``` + +음성이나 비전과 같이 입력이 큰 대규모 데이터셋의 경우, 모든 입력을 메모리에 로드하려면 리스트 대신 제너레이터 형태로 전달해야 합니다. 자세한 내용은 [Pipelines API 참조](./main_classes/pipelines)를 확인하세요. + +### 파이프라인에서 다른 모델과 토크나이저 사용하기 [[use-another-model-and-tokenizer-in-the-pipeline]] + +[`pipeline`]은 [Hub](https://huggingface.co/models)의 모든 모델을 사용할 수 있기 때문에, [`pipeline`]을 다른 용도에 맞게 쉽게 수정할 수 있습니다. 예를 들어, 프랑스어 텍스트를 처리할 수 있는 모델을 사용하기 위해선 Hub의 태그를 사용하여 적절한 모델을 필터링하면 됩니다. 필터링된 결과의 상위 항목으로는 프랑스어 텍스트에 사용할 수 있는 다국어 [BERT 모델](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)이 반환됩니다: + +```py +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +``` + + + +[`AutoModelForSequenceClassification`]과 [`AutoTokenizer`]를 사용하여 사전 훈련된 모델과 관련된 토크나이저를 로드하세요 (다음 섹션에서 [`AutoClass`]에 대해 더 자세히 알아보겠습니다): + +```py +>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + +[`TFAutoModelForSequenceClassification`]과 [`AutoTokenizer`]를 사용하여 사전 훈련된 모델과 관련된 토크나이저를 로드하세요 (다음 섹션에서 [`TFAutoClass`]에 대해 더 자세히 알아보겠습니다): + +```py +>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + + +[`pipeline`]에서 모델과 토크나이저를 지정하면, 이제 `classifier`를 프랑스어 텍스트에 적용할 수 있습니다: + +```py +>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) +>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") +[{'label': '5 stars', 'score': 0.7273}] +``` + +마땅한 모델을 찾을 수 없는 경우 데이터를 기반으로 사전 훈련된 모델을 미세조정해야 합니다. 미세조정 방법에 대한 자세한 내용은 [미세조정 튜토리얼](./training)을 참조하세요. 사전 훈련된 모델을 미세조정한 후에는 모델을 Hub의 커뮤니티와 공유하여 머신러닝 민주화에 기여해주세요! 🤗 + +## AutoClass [[autoclass]] + + + +[`AutoModelForSequenceClassification`]과 [`AutoTokenizer`] 클래스는 위에서 다룬 [`pipeline`]의 기능을 구현하는 데 사용됩니다. [AutoClass](./model_doc/auto)는 사전 훈련된 모델의 아키텍처를 이름이나 경로에서 자동으로 가져오는 '바로가기'입니다. 과업에 적합한 `AutoClass`를 선택하고 해당 전처리 클래스를 선택하기만 하면 됩니다. + +이전 섹션의 예제로 돌아가서 [`pipeline`]의 결과를 `AutoClass`를 활용해 복제하는 방법을 살펴보겠습니다. + +### AutoTokenizer [[autotokenizer]] + +토크나이저는 텍스트를 모델의 입력으로 사용하기 위해 숫자 배열 형태로 전처리하는 역할을 담당합니다. 토큰화 과정에는 단어를 어디에서 끊을지, 어느 수준까지 나눌지와 같은 여러 규칙들이 있습니다 (토큰화에 대한 자세한 내용은 [토크나이저 요약](./tokenizer_summary)을 참조하세요). 가장 중요한 점은 모델이 사전 훈련된 모델과 동일한 토큰화 규칙을 사용하도록 동일한 모델 이름으로 토크나이저를 인스턴스화해야 한다는 것입니다. + +[`AutoTokenizer`]로 토크나이저를 로드하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + +텍스트를 토크나이저에 전달하세요: + +```py +>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.") +>>> print(encoding) +{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +토크나이저는 다음을 포함한 딕셔너리를 반환합니다: + +* [input_ids](./glossary#input-ids): 토큰의 숫자 표현. +* [attention_mask](.glossary#attention-mask): 어떤 토큰에 주의를 기울여야 하는지를 나타냅니다. + +토크나이저는 입력을 리스트 형태로도 받을 수 있으며, 텍스트를 패딩하고 잘라내어 일정한 길이의 묶음을 반환할 수도 있습니다: + + + + +```py +>>> pt_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="pt", +... ) +``` + + + +```py +>>> tf_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="tf", +... ) +``` + + + + + +[전처리](./preprocessing) 튜토리얼을 참조하시면 토큰화에 대한 자세한 설명과 함께 이미지, 오디오와 멀티모달 입력을 전처리하기 위한 [`AutoImageProcessor`]와 [`AutoFeatureExtractor`], [`AutoProcessor`]의 사용방법도 알 수 있습니다. + + + +### AutoModel [[automodel]] + + + +🤗 Transformers는 사전 훈련된 인스턴스를 간단하고 통합된 방법으로 로드할 수 있습니다. 즉, [`AutoTokenizer`]처럼 [`AutoModel`]을 로드할 수 있습니다. 유일한 차이점은 과업에 알맞은 [`AutoModel`]을 선택해야 한다는 점입니다. 텍스트 (또는 시퀀스) 분류의 경우 [`AutoModelForSequenceClassification`]을 로드해야 합니다: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +[`AutoModel`] 클래스에서 지원하는 과업에 대해서는 [과업 요약](./task_summary)을 참조하세요. + + + +이제 전처리된 입력 묶음을 직접 모델에 전달해야 합니다. 아래처럼 `**`를 앞에 붙여 딕셔너리를 풀어주면 됩니다: + +```py +>>> pt_outputs = pt_model(**pt_batch) +``` + +모델의 최종 활성화 함수 출력은 `logits` 속성에 담겨있습니다. `logits`에 softmax 함수를 적용하여 확률을 얻을 수 있습니다: + +```py +>>> from torch import nn + +>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) +>>> print(pt_predictions) +tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], + [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) +``` + + +🤗 Transformers는 사전 훈련된 인스턴스를 간단하고 통합된 방법으로 로드할 수 있습니다. 즉, [`AutoTokenizer`]처럼 [`TFAutoModel`]을 로드할 수 있습니다. 유일한 차이점은 과업에 알맞은 [`TFAutoModel`]을 선택해야 한다는 점입니다. 텍스트 (또는 시퀀스) 분류의 경우 [`TFAutoModelForSequenceClassification`]을 로드해야 합니다: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +[`AutoModel`] 클래스에서 지원하는 과업에 대해서는 [과업 요약](./task_summary)을 참조하세요. + + + +이제 전처리된 입력 묶음을 직접 모델에 전달해야 합니다. 아래처럼 그대로 텐서를 전달하면 됩니다: + +```py +>>> tf_outputs = tf_model(tf_batch) +``` + +모델의 최종 활성화 함수 출력은 `logits` 속성에 담겨있습니다. `logits`에 softmax 함수를 적용하여 확률을 얻을 수 있습니다: + +```py +>>> import tensorflow as tf + +>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) +>>> tf_predictions # doctest: +IGNORE_RESULT +``` + + + + + +모든 🤗 Transformers 모델(PyTorch 또는 TensorFlow)은 (softmax와 같은) 최종 활성화 함수 *이전에* 텐서를 출력합니다. 왜냐하면 최종 활성화 함수의 출력은 종종 손실 함수 출력과 결합되기 때문입니다. 모델 출력은 특수한 데이터 클래스이므로 IDE에서 자동 완성됩니다. 모델 출력은 튜플이나 딕셔너리처럼 동작하며 (정수, 슬라이스 또는 문자열로 인덱싱 가능), None인 속성은 무시됩니다. + + + +### 모델 저장하기 [[save-a-model]] + + + +미세조정된 모델을 토크나이저와 함께 저장하려면 [`PreTrainedModel.save_pretrained`]를 사용하세요: + +```py +>>> pt_save_directory = "./pt_save_pretrained" +>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT +>>> pt_model.save_pretrained(pt_save_directory) +``` + +모델을 다시 사용하려면 [`PreTrainedModel.from_pretrained`]로 모델을 다시 로드하세요: + +```py +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") +``` + + +미세조정된 모델을 토크나이저와 함께 저장하려면 [`TFPreTrainedModel.save_pretrained`]를 사용하세요: + +```py +>>> tf_save_directory = "./tf_save_pretrained" +>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT +>>> tf_model.save_pretrained(tf_save_directory) +``` + +모델을 다시 사용하려면 [`TFPreTrainedModel.from_pretrained`]로 모델을 다시 로드하세요: + +```py +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") +``` + + + +🤗 Transformers의 멋진 기능 중 하나는 모델을 PyTorch 또는 TensorFlow 모델로 저장해뒀다가 다른 프레임워크로 다시 로드할 수 있는 점입니다. `from_pt` 또는 `from_tf` 매개변수를 사용하여 모델을 한 프레임워크에서 다른 프레임워크로 변환할 수 있습니다: + + + + +```py +>>> from transformers import AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) +``` + + + +```py +>>> from transformers import TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) +``` + + + +## 커스텀 모델 구축하기 [[custom-model-builds]] + +모델의 구성 클래스를 수정하여 모델의 구조를 바꿀 수 있습니다. (은닉층이나 어텐션 헤드의 수와 같은) 모델의 속성은 구성에서 지정되기 때문입니다. 커스텀 구성 클래스로 모델을 만들면 처음부터 시작해야 합니다. 모델 속성은 무작위로 초기화되므로 의미 있는 결과를 얻으려면 먼저 모델을 훈련시켜야 합니다. + +먼저 [`AutoConfig`]를 가져오고 수정하고 싶은 사전학습된 모델을 로드하세요. [`AutoConfig.from_pretrained`] 내부에서 (어텐션 헤드 수와 같이) 변경하려는 속성를 지정할 수 있습니다: + +```py +>>> from transformers import AutoConfig + +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) +``` + + + +[`AutoModel.from_config`]를 사용하여 바꾼 구성대로 모델을 생성하세요: + +```py +>>> from transformers import AutoModel + +>>> my_model = AutoModel.from_config(my_config) +``` + + +[`TFAutoModel.from_config`]를 사용하여 바꾼 구성대로 모델을 생성하세요: + +```py +>>> from transformers import TFAutoModel + +>>> my_model = TFAutoModel.from_config(my_config) +``` + + + +커스텀 구성에 대한 자세한 내용은 [커스텀 아키텍처 만들기](./create_a_model) 가이드를 확인하세요. + +## Trainer - PyTorch에 최적화된 훈련 루프 [[trainer-a-pytorch-optimized-training-loop]] + +모든 모델은 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)이므로 일반적인 훈련 루프에서 사용할 수 있습니다. 직접 훈련 루프를 작성할 수도 있지만, 🤗 Transformers는 PyTorch를 위한 [`Trainer`] 클래스를 제공합니다. 이 클래스에는 기본 훈련 루프가 포함되어 있으며 분산 훈련, 혼합 정밀도 등과 같은 기능을 추가로 제공합니다. + +과업에 따라 다르지만 일반적으로 [`Trainer`]에 다음 매개변수를 전달합니다: + +1. [`PreTrainedModel`] 또는 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)로 시작합니다: + + ```py + >>> from transformers import AutoModelForSequenceClassification + + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. [`TrainingArguments`]는 학습률, 배치 크기, 훈련할 에포크 수와 같은 모델 하이퍼파라미터를 포함합니다. 훈련 인자를 지정하지 않으면 기본값이 사용됩니다: + + ```py + >>> from transformers import TrainingArguments + + >>> training_args = TrainingArguments( + ... output_dir="path/to/save/folder/", + ... learning_rate=2e-5, + ... per_device_train_batch_size=8, + ... per_device_eval_batch_size=8, + ... num_train_epochs=2, + ... ) + ``` + +3. 토크나이저, 이미지 프로세서, 특징 추출기(feature extractor) 또는 프로세서와 전처리 클래스를 로드하세요: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +4. 데이터셋을 로드하세요: + + ```py + >>> from datasets import load_dataset + + >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT + ``` + +5. 데이터셋을 토큰화하는 함수를 생성하세요: + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) + ``` + + 그리고 [`~datasets.Dataset.map`]로 데이터셋 전체에 적용하세요: + + ```py + >>> dataset = dataset.map(tokenize_dataset, batched=True) + ``` + +6. [`DataCollatorWithPadding`]을 사용하여 데이터셋의 표본 묶음을 만드세요: + + ```py + >>> from transformers import DataCollatorWithPadding + + >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) + ``` + +이제 위의 모든 클래스를 [`Trainer`]로 모으세요: + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) # doctest: +SKIP +``` + +준비가 되었으면 [`~Trainer.train`]을 호출하여 훈련을 시작하세요: + +```py +>>> trainer.train() # doctest: +SKIP +``` + + + +번역이나 요약과 같이 시퀀스-시퀀스 모델을 사용하는 과업에는 [`Seq2SeqTrainer`] 및 [`Seq2SeqTrainingArguments`] 클래스를 사용하세요. + + + +[`Trainer`] 내의 메서드를 서브클래스화하여 훈련 루프를 바꿀 수도 있습니다. 이러면 손실 함수, 옵티마이저, 스케줄러와 같은 기능 또한 바꿀 수 있게 됩니다. 변경 가능한 메소드에 대해서는 [`Trainer`] 문서를 참고하세요. + +훈련 루프를 수정하는 다른 방법은 [Callbacks](./main_classes/callbacks)를 사용하는 것입니다. Callbacks로 다른 라이브러리와 통합하고, 훈련 루프를 체크하여 진행 상황을 보고받거나, 훈련을 조기에 중단할 수 있습니다. Callbacks은 훈련 루프 자체를 바꾸지는 않습니다. 손실 함수와 같은 것을 바꾸려면 [`Trainer`]를 서브클래스화해야 합니다. + +## TensorFlow로 훈련시키기 [[train-with-tensorflow]] + +모든 모델은 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)이므로 [Keras](https://keras.io/) API를 통해 TensorFlow에서 훈련시킬 수 있습니다. 🤗 Transformers는 데이터셋을 쉽게 `tf.data.Dataset` 형태로 쉽게 로드할 수 있는 [`~TFPreTrainedModel.prepare_tf_dataset`] 메소드를 제공하기 때문에, Keras의 [`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 및 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) 메소드로 바로 훈련을 시작할 수 있습니다. + +1. [`TFPreTrainedModel`] 또는 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)로 시작합니다: + + ```py + >>> from transformers import TFAutoModelForSequenceClassification + + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. 토크나이저, 이미지 프로세서, 특징 추출기(feature extractor) 또는 프로세서와 같은 전처리 클래스를 로드하세요: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +3. 데이터셋을 토큰화하는 함수를 생성하세요: + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) # doctest: +SKIP + ``` + +4. [`~datasets.Dataset.map`]을 사용하여 전체 데이터셋에 토큰화 함수를 적용하고, 데이터셋과 토크나이저를 [`~TFPreTrainedModel.prepare_tf_dataset`]에 전달하세요. 배치 크기를 변경하거나 데이터셋을 섞을 수도 있습니다: + + ```py + >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP + >>> tf_dataset = model.prepare_tf_dataset( + ... dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer + ... ) # doctest: +SKIP + ``` + +5. 준비되었으면 `compile` 및 `fit`를 호출하여 훈련을 시작하세요. 🤗 Transformers의 모든 모델은 과업과 관련된 기본 손실 함수를 가지고 있으므로 명시적으로 지정하지 않아도 됩니다: + + ```py + >>> from tensorflow.keras.optimizers import Adam + + >>> model.compile(optimizer=Adam(3e-5)) # No loss argument! + >>> model.fit(tf_dataset) # doctest: +SKIP + ``` + +## 다음 단계는 무엇인가요? [[whats-next]] + +🤗 Transformers 둘러보기를 모두 읽으셨다면, 가이드를 살펴보고 더 구체적인 것을 수행하는 방법을 알아보세요. 이를테면 커스텀 모델 구축하는 방법, 과업에 알맞게 모델을 미세조정하는 방법, 스크립트로 모델 훈련하는 방법 등이 있습니다. 🤗 Transformers 핵심 개념에 대해 더 알아보려면 커피 한 잔 들고 개념 가이드를 살펴보세요! \ No newline at end of file diff --git a/docs/source/ko/quicktour.mdx b/docs/source/ko/quicktour.mdx deleted file mode 100644 index 0ce4ec42a283cf..00000000000000 --- a/docs/source/ko/quicktour.mdx +++ /dev/null @@ -1,536 +0,0 @@ - - -# 둘러보기[[quick-tour]] - -[[open-in-colab]] -🤗 Transformer를 시작해봐요! 둘러보기는 개발자와 일반 사용자 모두를 위해 쓰여졌습니다. [`pipeline`]으로 추론하는 방법, [AutoClass](./model_doc/auto)로 사전학습된 모델과 전처리기를 적재하는 방법과 PyTorch 또는 TensorFlow로 신속하게 모델을 훈련시키는 방법을 보여줍니다. 기본을 배우고 싶다면 튜토리얼이나 [course](https://huggingface.co/course/chapter1/1)에서 여기 소개된 개념에 대한 자세한 설명을 확인하시길 권장합니다. - -시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하고, - -```bash -!pip install transformers datasets -``` - -좋아하는 머신러닝 프레임워크도 설치해야 합니다. - - - -```bash -pip install torch -``` - - -```bash -pip install tensorflow -``` - - - -## Pipeline (파이프라인) - - - -[`pipeline`]은 사전학습된 모델을 사용해 추론할 때 제일 쉬운 방법입니다. 여러 모달리티의 수많은 태스크에 [`pipeline`]을 즉시 사용할 수 있습니다. 지원하는 태스크의 예시는 아래 표를 참고하세요. - -| **태스크** | **설명** | **모달리티** | **파이프라인 ID** | -|----------------|---------------------------------------------------------------------|------------------|-----------------------------------------------| -| 텍스트 분류 | 텍스트에 알맞은 라벨 붙이기 | 자연어 처리(NLP) | pipeline(task="sentiment-analysis") | -| 텍스트 생성 | 주어진 문자열 입력과 이어지는 텍스트 생성하기 | 자연어 처리(NLP) | pipeline(task="text-generation") | -| 개체명 인식 | 문자열의 각 토큰마다 알맞은 라벨 붙이기 (인물, 조직, 장소 등등) | 자연어 처리(NLP) | pipeline(task="ner") | -| 질의응답 | 주어진 문맥과 질문에 따라 올바른 대답하기 | 자연어 처리(NLP) | pipeline(task="question-answering") | -| 빈칸 채우기 | 문자열의 빈칸에 알맞은 토큰 맞추기 | 자연어 처리(NLP) | pipeline(task="fill-mask") | -| 요약 | 텍스트나 문서를 요약하기 | 자연어 처리(NLP) | pipeline(task="summarization") | -| 번역 | 텍스트를 한 언어에서 다른 언어로 번역하기 | 자연어 처리(NLP) | pipeline(task="translation") | -| 이미지 분류 | 이미지에 알맞은 라벨 붙이기 | 컴퓨터 비전(CV) | pipeline(task="image-classification") | -| 이미지 분할 | 이미지의 픽셀마다 라벨 붙이기(시맨틱, 파놉틱 및 인스턴스 분할 포함) | 컴퓨터 비전(CV) | pipeline(task="image-segmentation") | -| 객체 탐지 | 이미지 속 객체의 경계 상자를 그리고 클래스를 예측하기 | 컴퓨터 비전(CV) | pipeline(task="object-detection") | -| 오디오 분류 | 오디오 파일에 알맞은 라벨 붙이기 | 오디오 | pipeline(task="audio-classification") | -| 자동 음성 인식 | 오디오 파일 속 음성을 텍스트로 바꾸기 | 오디오 | pipeline(task="automatic-speech-recognition") | -| 시각 질의응답 | 주어진 이미지와 이미지에 대한 질문에 따라 올바르게 대답하기 | 멀티모달 | pipeline(task="vqa") | - -먼저 [`pipeline`]의 인스턴스를 만들어 적용할 태스크를 고르세요. 위 태스크들은 모두 [`pipeline`]을 사용할 수 있고, 지원하는 태스크의 전체 목록을 보려면 [pipeline API 레퍼런스](./main_classes/pipelines)를 확인해주세요. 간단한 예시로 감정 분석 태스크에 [`pipeline`]를 적용해 보겠습니다. - -```py ->>> from transformers import pipeline - ->>> classifier = pipeline("sentiment-analysis") -``` - -[`pipeline`]은 기본 [사전학습된 모델(영어)](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)와 감정 분석을 하기 위한 tokenizer를 다운로드하고 캐시해놓습니다. 이제 원하는 텍스트에 `classifier`를 사용할 수 있습니다. - -```py ->>> classifier("We are very happy to show you the 🤗 Transformers library.") -[{'label': 'POSITIVE', 'score': 0.9998}] -``` - -입력이 여러 개라면, 입력을 [`pipeline`]에 리스트로 전달해서 딕셔너리로 된 리스트를 받을 수 있습니다. - -```py ->>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) ->>> for result in results: -... print(f"label: {result['label']}, with score: {round(result['score'], 4)}") -label: POSITIVE, with score: 0.9998 -label: NEGATIVE, with score: 0.5309 -``` - -[`pipeline`]은 특정 태스크용 데이터셋를 전부 순회할 수도 있습니다. 자동 음성 인식 태스크에 적용해 보겠습니다. - -```py ->>> import torch ->>> from transformers import pipeline - ->>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") -``` - -이제 순회할 오디오 데이터셋를 적재하겠습니다. (자세한 내용은 🤗 Datasets [시작하기](https://huggingface.co/docs/datasets/quickstart#audio)를 참고해주세요) [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터셋로 해볼까요? - -```py ->>> from datasets import load_dataset, Audio - ->>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT -``` - -데이터셋의 샘플링 레이트가 [`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h)의 훈련 당시 샘플링 레이트와 일치해야만 합니다. - -```py ->>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) -``` - -오디오 파일은 `"audio"` 열을 호출할 때 자동으로 적재되고 다시 샘플링됩니다. -처음 4개 샘플에서 음성을 추출하여 파이프라인에 리스트 형태로 전달해보겠습니다. - -```py ->>> result = speech_recognizer(dataset[:4]["audio"]) ->>> print([d["text"] for d in result]) -['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I THURN A JOIN A COUNT'] -``` - -(음성이나 비전처럼) 입력이 큰 대규모 데이터셋의 경우, 메모리에 적재시키기 위해 리스트 대신 제너레이터로 입력을 모두 전달할 수 있습니다. 자세한 내용은 [pipeline API 레퍼런스](./main_classes/pipelines)를 확인해주세요. - -### 파이프라인에서 다른 모델이나 tokenizer 사용하는 방법[[use-another-model-and-tokenizer-in-the-pipeline]] - -[`pipeline`]은 [Hub](https://huggingface.co/models) 속 모든 모델을 사용할 수 있어, 얼마든지 [`pipeline`]을 사용하고 싶은대로 바꿀 수 있습니다. 예를 들어 프랑스어 텍스트를 다룰 수 있는 모델을 만드려면, Hub의 태그로 적절한 모델을 찾아보세요. 상위 검색 결과로 뜬 감정 분석을 위해 파인튜닝된 다국어 [BERT 모델](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)이 프랑스어를 지원하는군요. - -```py ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" -``` - - - -[`AutoModelForSequenceClassification`]과 [`AutoTokenizer`]로 사전학습된 모델과 함께 연관된 토크나이저를 불러옵니다. (`AutoClass`에 대한 내용은 다음 섹션에서 살펴보겠습니다) - -```py ->>> from transformers import AutoTokenizer, AutoModelForSequenceClassification - ->>> model = AutoModelForSequenceClassification.from_pretrained(model_name) ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - - -[`TFAutoModelForSequenceClassification`]과 [`AutoTokenizer`]로 사전학습된 모델과 함께 연관된 토크나이저를 불러옵니다. (`TFAutoClass`에 대한 내용은 다음 섹션에서 살펴보겠습니다) - -```py ->>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification - ->>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - - - -[`pipeline`]에서 사용할 모델과 토크나이저를 입력하면 이제 (감정 분석기인) `classifier`를 프랑스어 텍스트에 적용할 수 있습니다. - -```py ->>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) ->>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") -[{'label': '5 stars', 'score': 0.7273}] -``` - -하고싶은 것에 적용할 마땅한 모델이 없다면, 가진 데이터로 사전학습된 모델을 파인튜닝해야 합니다. 자세한 방법은 [파인튜닝 튜토리얼](./training)을 참고해주세요. 사전학습된 모델의 파인튜닝을 마치셨으면, 누구나 머신러닝을 할 수 있도록 [공유](./model_sharing)하는 것을 고려해주세요. 🤗 - -## AutoClass - - - -내부적으로 들어가면 위에서 사용했던 [`pipeline`]은 [`AutoModelForSequenceClassification`]과 [`AutoTokenizer`] 클래스로 작동합니다. [AutoClass](./model_doc/auto)란 이름이나 경로를 받으면 그에 알맞는 사전학습된 모델을 가져오는 '바로가기'라고 볼 수 있는데요. 원하는 태스크와 전처리에 적합한 `AutoClass`를 고르기만 하면 됩니다. - -전에 사용했던 예시로 돌아가서 `AutoClass`로 [`pipeline`]과 동일한 결과를 얻을 수 있는 방법을 알아보겠습니다. - -### AutoTokenizer - -토큰나이저는 전처리를 담당하며, 텍스트를 모델이 받을 숫자 배열로 바꿉니다. 토큰화 과정에는 단어를 어디에서 끊을지, 얼만큼 나눌지 등을 포함한 여러 규칙이 있습니다. 자세한 내용은 [토크나이저 요약](./tokenizer_summary)를 확인해주세요. 제일 중요한 점은 모델이 훈련됐을 때와 동일한 토큰화 규칙을 쓰도록 동일한 모델 이름으로 토크나이저 인스턴스를 만들어야 합니다. - -[`AutoTokenizer`]로 토크나이저를 불러오고, - -```py ->>> from transformers import AutoTokenizer - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - -토크나이저에 텍스트를 제공하세요. - -```py ->>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.") ->>> print(encoding) -{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} -``` - -그러면 다음을 포함한 딕셔너리가 반환됩니다. - -* [input_ids](./glossary#input-ids): 숫자로 표현된 토큰들 -* [attention_mask](.glossary#attention-mask): 주시할 토큰들 - -토크나이저는 입력을 리스트로도 받을 수 있으며, 텍스트를 패드하거나 잘라내어 균일한 길이의 배치를 반환할 수도 있습니다. - - - -```py ->>> pt_batch = tokenizer( -... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], -... padding=True, -... truncation=True, -... max_length=512, -... return_tensors="pt", -... ) -``` - - -```py ->>> tf_batch = tokenizer( -... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], -... padding=True, -... truncation=True, -... max_length=512, -... return_tensors="tf", -... ) -``` - - - - - -[전처리](./preprocessing) 튜토리얼을 보시면 토큰화에 대한 자세한 설명과 함께 이미지, 오디오와 멀티모달 입력을 전처리하기 위한 [`AutoFeatureExtractor`]과 [`AutoProcessor`]의 사용방법도 알 수 있습니다. - - - -### AutoModel - - - -🤗 Transformers로 사전학습된 인스턴스를 간단하고 통일된 방식으로 불러올 수 있습니다. 이러면 [`AutoTokenizer`]처럼 [`AutoModel`]도 불러올 수 있게 됩니다. 유일한 차이점은 태스크에 적합한 [`AutoModel`]을 선택해야 한다는 점입니다. 텍스트(또는 시퀀스) 분류의 경우 [`AutoModelForSequenceClassification`]을 불러와야 합니다. - -```py ->>> from transformers import AutoModelForSequenceClassification - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) -``` - - - -[`AutoModel`] 클래스에서 지원하는 태스크들은 [태스크 정리](./task_summary) 문서를 참고해주세요. - - - -이제 전처리된 입력 배치를 모델로 직접 보내야 합니다. 아래처럼 `**`를 앞에 붙여 딕셔너리를 풀어주기만 하면 됩니다. - -```py ->>> pt_outputs = pt_model(**pt_batch) -``` - -모델의 activation 결과는 `logits` 속성에 담겨있습니다. `logits`에 Softmax 함수를 적용해서 확률 형태로 받으세요. - -```py ->>> from torch import nn - ->>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) ->>> print(pt_predictions) -tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], - [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) -``` - - -🤗 Transformers는 사전학습된 인스턴스를 간단하고 통일된 방식으로 불러올 수 있습니다. 이러면 [`AutoTokenizer`]처럼 [`TFAutoModel`]도 불러올 수 있게 됩니다. 유일한 차이점은 태스크에 적합한 [`TFAutoModel`]를 선택해야 한다는 점입니다. 텍스트(또는 시퀀스) 분류의 경우 [`TFAutoModelForSequenceClassification`]을 불러와야 합니다. - -```py ->>> from transformers import TFAutoModelForSequenceClassification - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) -``` - - - -[`AutoModel`] 클래스에서 지원하는 태스크들은 [태스크 정리](./task_summary) 문서를 참고해주세요. - - - -이제 전처리된 입력 배치를 모델로 직접 보내야 합니다. 딕셔너리의 키를 텐서에 직접 넣어주기만 하면 됩니다. - -```py ->>> tf_outputs = tf_model(tf_batch) -``` - -모델의 activation 결과는 `logits` 속성에 담겨있습니다. `logits`에 Softmax 함수를 적용해서 확률 형태로 받으세요. - -```py ->>> import tensorflow as tf - ->>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) ->>> tf_predictions # doctest: +IGNORE_RESULT -``` - - - - - -모든 (PyTorch 또는 TensorFlow) 🤗 Transformers 모델은 (softmax 등의) 최종 activation 함수 *이전에* 텐서를 내놓습니다. 왜냐하면 최종 activation 함수를 종종 loss 함수와 동일시하기 때문입니다. 모델 출력은 특수 데이터 클래스이므로 해당 속성은 IDE에서 자동으로 완성됩니다. 모델 출력은 튜플 또는 (정수, 슬라이스 또는 문자열로 인덱싱하는) 딕셔너리 형태로 주어지고 이런 경우 None인 속성은 무시됩니다. - - - -### 모델 저장하기[[save-a-model]] - - - -모델을 파인튜닝한 뒤에는 [`PreTrainedModel.save_pretrained`]로 모델을 토크나이저와 함께 저장할 수 있습니다. - -```py ->>> pt_save_directory = "./pt_save_pretrained" ->>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT ->>> pt_model.save_pretrained(pt_save_directory) -``` - -모델을 다시 사용할 때는 [`PreTrainedModel.from_pretrained`]로 다시 불러오면 됩니다. - -```py ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") -``` - - -모델을 파인튜닝한 뒤에는 [`TFPreTrainedModel.save_pretrained`]로 모델을 토크나이저와 함께 저장할 수 있습니다. - -```py ->>> tf_save_directory = "./tf_save_pretrained" ->>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT ->>> tf_model.save_pretrained(tf_save_directory) -``` - -모델을 다시 사용할 때는 [`TFPreTrainedModel.from_pretrained`]로 다시 불러오면 됩니다. - -```py ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") -``` - - - -🤗 Transformers 기능 중 특히 재미있는 한 가지는 모델을 저장하고 PyTorch나 TensorFlow 모델로 다시 불러올 수 있는 기능입니다. 'from_pt' 또는 'from_tf' 매개변수를 사용해 모델을 기존과 다른 프레임워크로 변환시킬 수 있습니다. - - - -```py ->>> from transformers import AutoModel - ->>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) ->>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) -``` - - -```py ->>> from transformers import TFAutoModel - ->>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) -``` - - - -## 커스텀 모델 구축하기[[custom-model-builds]] - -모델의 구성 클래스를 수정하여 모델의 구조를 바꿀 수 있습니다. 은닉층, 어텐션 헤드 수와 같은 모델의 속성을 구성에서 지정합니다. 커스텀 구성 클래스에서 모델을 만들면 처음부터 시작해야 합니다. 모델 속성은 랜덤하게 초기화되므로 의미 있는 결과를 얻으려면 먼저 모델을 훈련시킬 필요가 있습니다. - -먼저 [`AutoConfig`]를 임포트하고, 수정하고 싶은 사전학습된 모델을 불러오세요. [`AutoConfig.from_pretrained`]에서 어텐션 헤드 수 같은 속성을 변경할 수 있습니다. - -```py ->>> from transformers import AutoConfig - ->>> my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) -``` - - - -[`AutoModel.from_config`]를 사용하여 커스텀 구성대로 모델을 생성합니다. - -```py ->>> from transformers import AutoModel - ->>> my_model = AutoModel.from_config(my_config) -``` - - -[`TFAutoModel.from_config`]를 사용하여 커스텀 구성대로 모델을 생성합니다. - -```py ->>> from transformers import TFAutoModel - ->>> my_model = TFAutoModel.from_config(my_config) -``` - - - -커스텀 구성을 작성하는 방법에 대한 자세한 내용은 [커스텀 아키텍처 만들기](./create_a_model) 가이드를 참고하세요. - -## Trainer - PyTorch에 최적화된 훈련 반복 루프[[trainer-a-pytorch-optimized-training-loop]] - -모든 모델은 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)이어서 대다수의 훈련 반복 루프에 사용할 수 있습니다. 사용자가 직접 훈련 반복 루프를 작성해도 되지만, 🤗 Transformers는 PyTorch용 [`Trainer`] 클래스를 제공합니다. 기본적인 훈련 반폭 루프가 포함되어 있고, 분산 훈련이나 혼합 정밀도 등의 추가 기능도 있습니다. - -태스크에 따라 다르지만, 일반적으로 다음 매개변수를 [`Trainer`]에 전달할 것입니다. - -1. [`PreTrainedModel`] 또는 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)로 시작합니다. - - ```py - >>> from transformers import AutoModelForSequenceClassification - - >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") - ``` - -2. [`TrainingArguments`]로 학습률, 배치 크기나 훈련할 epoch 수와 같이 모델의 하이퍼파라미터를 조정합니다. 기본값은 훈련 인수를 전혀 지정하지 않은 경우 사용됩니다. - - ```py - >>> from transformers import TrainingArguments - - >>> training_args = TrainingArguments( - ... output_dir="path/to/save/folder/", - ... learning_rate=2e-5, - ... per_device_train_batch_size=8, - ... per_device_eval_batch_size=8, - ... num_train_epochs=2, - ... ) - ``` - -3. 토크나이저, 특징추출기(feature extractor), 전처리기(processor) 클래스 등으로 전처리를 수행합니다. - - ```py - >>> from transformers import AutoTokenizer - - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") - ``` - -4. 데이터셋를 적재합니다. - - ```py - >>> from datasets import load_dataset - - >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT - ``` - -5. 데이터셋을 토큰화하는 함수를 만들고 [`~datasets.Dataset.map`]으로 전체 데이터셋에 적용시킵니다. - - ```py - >>> def tokenize_dataset(dataset): - ... return tokenizer(dataset["text"]) - - - >>> dataset = dataset.map(tokenize_dataset, batched=True) - ``` - -6. [`DataCollatorWithPadding`]로 데이터셋으로부터 표본으로 삼을 배치를 만듭니다. - - ```py - >>> from transformers import DataCollatorWithPadding - - >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) - ``` - -이제 위의 모든 클래스를 [`Trainer`]로 모으세요. - -```py ->>> from transformers import Trainer - ->>> trainer = Trainer( -... model=model, -... args=training_args, -... train_dataset=dataset["train"], -... eval_dataset=dataset["test"], -... tokenizer=tokenizer, -... data_collator=data_collator, -... ) # doctest: +SKIP -``` - -준비되었으면 [`~Trainer.train`]으로 훈련을 시작하세요. - -```py ->>> trainer.train() # doctest: +SKIP -``` - - - -sequence-to-sequence 모델을 사용하는 (번역이나 요약 같은) 태스크의 경우 [`Seq2SeqTrainer`]와 [`Seq2SeqTrainingArguments`] 클래스를 대신 사용하시기 바랍니다. - - - -[`Trainer`] 내부의 메서드를 구현 상속(subclassing)해서 훈련 반복 루프를 개조할 수도 있습니다. 이러면 loss 함수, optimizer, scheduler 등의 기능도 개조할 수 있습니다. 어떤 메서드를 구현 상속할 수 있는지 알아보려면 [`Trainer`]를 참고하세요. - -훈련 반복 루프를 개조하는 다른 방법은 [Callbacks](./main_classes/callbacks)를 사용하는 것입니다. Callbacks로 다른 라이브러리와 통합하고, 훈련 반복 루프를 수시로 체크하여 진행 상황을 보고받거나, 훈련을 조기에 중단할 수 있습니다. Callbacks은 훈련 반복 루프 자체를 전혀 수정하지 않습니다. 만약 loss 함수 등을 개조하고 싶다면 [`Trainer`]를 구현 상속해야만 합니다. - -## TensorFlow로 훈련시키기[[train-with-tensorflow]] - -모든 모델은 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)이어서 [Keras](https://keras.io/) API를 통해 TensorFlow에서 훈련시킬 수 있습니다. 🤗 Transformers에서 데이터셋를 `tf.data.Dataset` 형태로 쉽게 적재할 수 있는 [`~TFPreTrainedModel.prepare_tf_dataset`] 메서드를 제공하기 때문에, Keras의 [`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 및 [`fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) 메서드로 즉시 훈련을 시작할 수 있습니다. - -1. [`TFPreTrainedModel`] 또는 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)로 시작합니다. - - ```py - >>> from transformers import TFAutoModelForSequenceClassification - - >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") - ``` - -2. 토크나이저, 특징추출기(feature extractor), 전처리기(processor) 클래스 등으로 전처리를 수행합니다. - - ```py - >>> from transformers import AutoTokenizer - - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") - ``` - -3. 데이터셋을 토큰화하는 함수를 만듭니다. - - ```py - >>> def tokenize_dataset(dataset): - ... return tokenizer(dataset["text"]) # doctest: +SKIP - ``` - -4. [`~datasets.Dataset.map`]으로 전체 데이터셋에 위 함수를 적용시킨 다음, 데이터셋과 토크나이저를 [`~TFPreTrainedModel.prepare_tf_dataset`]로 전달합니다. 배치 크기를 변경해보거나 데이터셋를 섞어봐도 좋습니다. - - ```py - >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP - >>> tf_dataset = model.prepare_tf_dataset( - ... dataset, batch_size=16, shuffle=True, tokenizer=tokenizer - ... ) # doctest: +SKIP - ``` - -5. 준비되었으면 `compile`과 `fit`으로 훈련을 시작하세요. - - ```py - >>> from tensorflow.keras.optimizers import Adam - - >>> model.compile(optimizer=Adam(3e-5)) - >>> model.fit(dataset) # doctest: +SKIP - ``` - -## 이제 무얼 하면 될까요?[[whats-next]] - -🤗 Transformers 둘러보기를 모두 읽으셨다면, 가이드를 통해 특정 기술을 배울 수 있어요. 예를 들어 커스텀 모델을 작성하는 방법, 태스크용 모델을 파인튜닝하는 방법, 스크립트로 모델을 훈련시키는 방법 등이 있습니다. 🤗 Transformers의 핵심 개념에 대해 자세히 알아보려면 커피 한 잔을 마신 뒤 개념 가이드를 살펴보셔도 좋습니다! diff --git a/docs/source/ko/run_scripts.md b/docs/source/ko/run_scripts.md new file mode 100644 index 00000000000000..715a949dde4280 --- /dev/null +++ b/docs/source/ko/run_scripts.md @@ -0,0 +1,375 @@ + + +# 스크립트로 실행하기[[train-with-a-script]] + +🤗 Transformers 노트북과 함께 [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch), [TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow), 또는 [JAX/Flax](https://github.com/huggingface/transformers/tree/main/examples/flax)를 사용해 특정 태스크에 대한 모델을 훈련하는 방법을 보여주는 예제 스크립트도 있습니다. + +또한 [연구 프로젝트](https://github.com/huggingface/transformers/tree/main/examples/research_projects) 및 [레거시 예제](https://github.com/huggingface/transformers/tree/main/examples/legacy)에서 대부분 커뮤니티에서 제공한 스크립트를 찾을 수 있습니다. +이러한 스크립트는 적극적으로 유지 관리되지 않으며 최신 버전의 라이브러리와 호환되지 않을 가능성이 높은 특정 버전의 🤗 Transformers를 필요로 합니다. + +예제 스크립트가 모든 문제에서 바로 작동하는 것은 아니며, 해결하려는 문제에 맞게 스크립트를 변경해야 할 수도 있습니다. +이를 위해 대부분의 스크립트에는 데이터 전처리 방법이 나와있어 필요에 따라 수정할 수 있습니다. + +예제 스크립트에 구현하고 싶은 기능이 있으면 pull request를 제출하기 전에 [포럼](https://discuss.huggingface.co/) 또는 [이슈](https://github.com/huggingface/transformers/issues)에서 논의해 주세요. +버그 수정은 환영하지만 가독성을 희생하면서까지 더 많은 기능을 추가하는 pull request는 병합(merge)하지 않을 가능성이 높습니다. + +이 가이드에서는 [PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) 및 [TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization)에서 요약 훈련하는 + 스크립트 예제를 실행하는 방법을 설명합니다. +특별한 설명이 없는 한 모든 예제는 두 프레임워크 모두에서 작동할 것으로 예상됩니다. + +## 설정하기[[setup]] + +최신 버전의 예제 스크립트를 성공적으로 실행하려면 새 가상 환경에서 **소스로부터 🤗 Transformers를 설치**해야 합니다: + +```bash +git clone https://github.com/huggingface/transformers +cd transformers +pip install . +``` + +이전 버전의 예제 스크립트를 보려면 아래 토글을 클릭하세요: + +
+ 이전 버전의 🤗 Transformers 예제 + +
+ +그리고 다음과 같이 복제(clone)해온 🤗 Transformers 버전을 특정 버전(예: v3.5.1)으로 전환하세요: + +```bash +git checkout tags/v3.5.1 +``` + +올바른 라이브러리 버전을 설정한 후 원하는 예제 폴더로 이동하여 예제별로 라이브러리에 대한 요구 사항(requirements)을 설치합니다: + +```bash +pip install -r requirements.txt +``` + +## 스크립트 실행하기[[run-a-script]] + + + +예제 스크립트는 🤗 [Datasets](https://huggingface.co/docs/datasets/) 라이브러리에서 데이터 세트를 다운로드하고 전처리합니다. +그런 다음 스크립트는 요약 기능을 지원하는 아키텍처에서 [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)를 사용하여 데이터 세트를 미세 조정합니다. +다음 예는 [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) 데이터 세트에서 [T5-small](https://huggingface.co/google-t5/t5-small)을 미세 조정합니다. +T5 모델은 훈련 방식에 따라 추가 `source_prefix` 인수가 필요하며, 이 프롬프트는 요약 작업임을 T5에 알려줍니다. + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +예제 스크립트는 🤗 [Datasets](https://huggingface.co/docs/datasets/) 라이브러리에서 데이터 세트를 다운로드하고 전처리합니다. +그런 다음 스크립트는 요약 기능을 지원하는 아키텍처에서 Keras를 사용하여 데이터 세트를 미세 조정합니다. +다음 예는 [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) 데이터 세트에서 [T5-small](https://huggingface.co/google-t5/t5-small)을 미세 조정합니다. +T5 모델은 훈련 방식에 따라 추가 `source_prefix` 인수가 필요하며, 이 프롬프트는 요약 작업임을 T5에 알려줍니다. +```bash +python examples/tensorflow/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## 혼합 정밀도(mixed precision)로 분산 훈련하기[[distributed-training-and-mixed-precision]] + +[Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) 클래스는 분산 훈련과 혼합 정밀도(mixed precision)를 지원하므로 스크립트에서도 사용할 수 있습니다. +이 두 가지 기능을 모두 활성화하려면 다음 두 가지를 설정해야 합니다: + +- `fp16` 인수를 추가해 혼합 정밀도(mixed precision)를 활성화합니다. +- `nproc_per_node` 인수를 추가해 사용할 GPU 개수를 설정합니다. + +```bash +torchrun \ + --nproc_per_node 8 pytorch/summarization/run_summarization.py \ + --fp16 \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +TensorFlow 스크립트는 분산 훈련을 위해 [`MirroredStrategy`](https://www.tensorflow.org/guide/distributed_training#mirroredstrategy)를 활용하며, 훈련 스크립트에 인수를 추가할 필요가 없습니다. +다중 GPU 환경이라면, TensorFlow 스크립트는 기본적으로 여러 개의 GPU를 사용합니다. + +## TPU 위에서 스크립트 실행하기[[run-a-script-on-a-tpu]] + + + +Tensor Processing Units (TPUs)는 성능을 가속화하기 위해 특별히 설계되었습니다. +PyTorch는 [XLA](https://www.tensorflow.org/xla) 딥러닝 컴파일러와 함께 TPU를 지원합니다(자세한 내용은 [여기](https://github.com/pytorch/xla/blob/master/README.md) 참조). +TPU를 사용하려면 `xla_spawn.py` 스크립트를 실행하고 `num_cores` 인수를 사용하여 사용하려는 TPU 코어 수를 설정합니다. + +```bash +python xla_spawn.py --num_cores 8 \ + summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + +Tensor Processing Units (TPUs)는 성능을 가속화하기 위해 특별히 설계되었습니다. +TensorFlow 스크립트는 TPU를 훈련에 사용하기 위해 [`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy)를 활용합니다. +TPU를 사용하려면 TPU 리소스의 이름을 `tpu` 인수에 전달합니다. + +```bash +python run_summarization.py \ + --tpu name_of_tpu_resource \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## 🤗 Accelerate로 스크립트 실행하기[[run-a-script-with-accelerate]] + +🤗 [Accelerate](https://huggingface.co/docs/accelerate)는 PyTorch 훈련 과정에 대한 완전한 가시성을 유지하면서 여러 유형의 설정(CPU 전용, 다중 GPU, TPU)에서 모델을 훈련할 수 있는 통합 방법을 제공하는 PyTorch 전용 라이브러리입니다. +🤗 Accelerate가 설치되어 있는지 확인하세요: + +> 참고: Accelerate는 빠르게 개발 중이므로 스크립트를 실행하려면 accelerate를 설치해야 합니다. +```bash +pip install git+https://github.com/huggingface/accelerate +``` + +`run_summarization.py` 스크립트 대신 `run_summarization_no_trainer.py` 스크립트를 사용해야 합니다. +🤗 Accelerate 클래스가 지원되는 스크립트는 폴더에 `task_no_trainer.py` 파일이 있습니다. +다음 명령을 실행하여 구성 파일을 생성하고 저장합니다: +```bash +accelerate config +``` + +설정을 테스트하여 올바르게 구성되었는지 확인합니다: + +```bash +accelerate test +``` + +이제 훈련을 시작할 준비가 되었습니다: + +```bash +accelerate launch run_summarization_no_trainer.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir ~/tmp/tst-summarization +``` + +## 사용자 정의 데이터 세트 사용하기[[use-a-custom-dataset]] + +요약 스크립트는 사용자 지정 데이터 세트가 CSV 또는 JSON 파일인 경우 지원합니다. +사용자 지정 데이터 세트를 사용하는 경우에는 몇 가지 추가 인수를 지정해야 합니다: + +- `train_file`과 `validation_file`은 훈련 및 검증 파일의 경로를 지정합니다. +- `text_column`은 요약할 입력 텍스트입니다. +- `summary_column`은 출력할 대상 텍스트입니다. + +사용자 지정 데이터 세트를 사용하는 요약 스크립트는 다음과 같습니다: + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --train_file path_to_csv_or_jsonlines_file \ + --validation_file path_to_csv_or_jsonlines_file \ + --text_column text_column_name \ + --summary_column summary_column_name \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --overwrite_output_dir \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --predict_with_generate +``` + +## 스크립트 테스트하기[[test-a-script]] + +전체 데이터 세트를 대상으로 훈련을 완료하는데 꽤 오랜 시간이 걸리기 때문에, 작은 데이터 세트에서 모든 것이 예상대로 실행되는지 확인하는 것이 좋습니다. + +다음 인수를 사용하여 데이터 세트를 최대 샘플 수로 잘라냅니다: +- `max_train_samples` +- `max_eval_samples` +- `max_predict_samples` + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --max_train_samples 50 \ + --max_eval_samples 50 \ + --max_predict_samples 50 \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +모든 예제 스크립트가 `max_predict_samples` 인수를 지원하지는 않습니다. +스크립트가 이 인수를 지원하는지 확실하지 않은 경우 `-h` 인수를 추가하여 확인하세요: + +```bash +examples/pytorch/summarization/run_summarization.py -h +``` + +## 체크포인트(checkpoint)에서 훈련 이어서 하기[[resume-training-from-checkpoint]] + +또 다른 유용한 옵션은 이전 체크포인트에서 훈련을 재개하는 것입니다. +이렇게 하면 훈련이 중단되더라도 처음부터 다시 시작하지 않고 중단한 부분부터 다시 시작할 수 있습니다. +체크포인트에서 훈련을 재개하는 방법에는 두 가지가 있습니다. + +첫 번째는 `output_dir previous_output_dir` 인수를 사용하여 `output_dir`에 저장된 최신 체크포인트부터 훈련을 재개하는 방법입니다. +이 경우 `overwrite_output_dir`을 제거해야 합니다: +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --output_dir previous_output_dir \ + --predict_with_generate +``` + +두 번째는 `resume_from_checkpoint path_to_specific_checkpoint` 인수를 사용하여 특정 체크포인트 폴더에서 훈련을 재개하는 방법입니다. + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --resume_from_checkpoint path_to_specific_checkpoint \ + --predict_with_generate +``` + +## 모델 공유하기[[share-your-model]] + +모든 스크립트는 최종 모델을 [Model Hub](https://huggingface.co/models)에 업로드할 수 있습니다. +시작하기 전에 Hugging Face에 로그인했는지 확인하세요: +```bash +huggingface-cli login +``` + +그런 다음 스크립트에 `push_to_hub` 인수를 추가합니다. +이 인수는 Hugging Face 사용자 이름과 `output_dir`에 지정된 폴더 이름으로 저장소를 생성합니다. + +저장소에 특정 이름을 지정하려면 `push_to_hub_model_id` 인수를 사용하여 추가합니다. +저장소는 네임스페이스 아래에 자동으로 나열됩니다. +다음 예는 특정 저장소 이름으로 모델을 업로드하는 방법입니다: + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --push_to_hub \ + --push_to_hub_model_id finetuned-t5-cnn_dailymail \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` \ No newline at end of file diff --git a/docs/source/ko/sagemaker.md b/docs/source/ko/sagemaker.md new file mode 100644 index 00000000000000..18aafc28a1612d --- /dev/null +++ b/docs/source/ko/sagemaker.md @@ -0,0 +1,28 @@ + + +# Amazon SageMaker에서 학습 실행하기[[run-training-on-amazon-sagemaker]] + +문서가 [hf.co/docs/sagemaker](https://huggingface.co/docs/sagemaker)로 이동되었습니다. 이 페이지는 `transformers` 5.0 에서 삭제될 예정입니다. + +### 목차[[table-of-content]] + +- [Train Hugging Face models on Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/train) +- [Deploy Hugging Face models to Amazon SageMaker with the SageMaker Python SDK](https://huggingface.co/docs/sagemaker/inference) diff --git a/docs/source/ko/serialization.md b/docs/source/ko/serialization.md new file mode 100644 index 00000000000000..2e521e2b7b4af8 --- /dev/null +++ b/docs/source/ko/serialization.md @@ -0,0 +1,181 @@ + + +# ONNX로 내보내기 [[export-to-onnx]] + +🤗 Transformers 모델을 제품 환경에서 배포하기 위해서는 모델을 직렬화된 형식으로 내보내고 특정 런타임과 하드웨어에서 로드하고 실행할 수 있으면 유용합니다. + +🤗 Optimum은 Transformers의 확장으로, PyTorch 또는 TensorFlow에서 모델을 ONNX와 TFLite와 같은 직렬화된 형식으로 내보낼 수 있도록 하는 `exporters` 모듈을 통해 제공됩니다. 🤗 Optimum은 또한 성능 최적화 도구 세트를 제공하여 특정 하드웨어에서 모델을 훈련하고 실행할 때 최대 효율성을 달성할 수 있습니다. + +이 안내서는 🤗 Optimum을 사용하여 🤗 Transformers 모델을 ONNX로 내보내는 방법을 보여줍니다. TFLite로 모델을 내보내는 안내서는 [TFLite로 내보내기 페이지](tflite)를 참조하세요. + +## ONNX로 내보내기 [[export-to-onnx]] + +[ONNX (Open Neural Network eXchange)](http://onnx.ai)는 PyTorch와 TensorFlow를 포함한 다양한 프레임워크에서 심층 학습 모델을 나타내는 데 사용되는 공통 연산자 세트와 공통 파일 형식을 정의하는 오픈 표준입니다. 모델이 ONNX 형식으로 내보내지면 이러한 연산자를 사용하여 신경망을 통해 데이터가 흐르는 흐름을 나타내는 계산 그래프(일반적으로 _중간 표현_이라고 함)가 구성됩니다. + +표준화된 연산자와 데이터 유형을 가진 그래프를 노출함으로써, ONNX는 프레임워크 간에 쉽게 전환할 수 있습니다. 예를 들어, PyTorch에서 훈련된 모델을 ONNX 형식으로 내보내고 TensorFlow에서 가져올 수 있습니다(그 반대도 가능합니다). + +ONNX 형식으로 내보낸 모델은 다음과 같이 사용할 수 있습니다: +- [그래프 최적화](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) 및 [양자화](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization)와 같은 기법을 사용하여 추론을 위해 최적화됩니다. +- ONNX Runtime을 통해 실행할 수 있습니다. [`ORTModelForXXX` 클래스들](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort)을 통해 동일한 `AutoModel` API를 따릅니다. 이 API는 🤗 Transformers에서 사용하는 것과 동일합니다. +- [최적화된 추론 파이프라인](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines)을 사용할 수 있습니다. 이는 🤗 Transformers의 [`pipeline`] 함수와 동일한 API를 가지고 있습니다. + +🤗 Optimum은 구성 객체를 활용하여 ONNX 내보내기를 지원합니다. 이러한 구성 객체는 여러 모델 아키텍처에 대해 미리 준비되어 있으며 다른 아키텍처에 쉽게 확장할 수 있도록 설계되었습니다. + +미리 준비된 구성 목록은 [🤗 Optimum 문서](https://huggingface.co/docs/optimum/exporters/onnx/overview)를 참조하세요. + +🤗 Transformers 모델을 ONNX로 내보내는 두 가지 방법이 있습니다. 여기에서 두 가지 방법을 모두 보여줍니다: + +- 🤗 Optimum을 사용하여 CLI로 내보내기 +- `optimum.onnxruntime`을 사용하여 🤗 Optimum으로 ONNX로 내보내기 + +### CLI를 사용하여 🤗 Transformers 모델을 ONNX로 내보내기 [[exporting-a-transformers-model-to-onnx-with-cli]] + +🤗 Transformers 모델을 ONNX로 내보내려면 먼저 추가 종속성을 설치하세요: + +```bash +pip install optimum[exporters] +``` + +사용 가능한 모든 인수를 확인하려면 [🤗 Optimum 문서](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli)를 참조하거나 명령줄에서 도움말을 보세요. + +```bash +optimum-cli export onnx --help +``` + +예를 들어, 🤗 Hub에서 `distilbert/distilbert-base-uncased-distilled-squad`와 같은 모델의 체크포인트를 내보내려면 다음 명령을 실행하세요: + +```bash +optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/ +``` + +위와 같이 진행 상황을 나타내는 로그가 표시되고 결과인 `model.onnx`가 저장된 위치가 표시됩니다. + +```bash +Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx... + -[✓] ONNX model output names match reference model (start_logits, end_logits) + - Validating ONNX Model output "start_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) + - Validating ONNX Model output "end_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) +The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx +``` + +위의 예제는 🤗 Hub에서 체크포인트를 내보내는 것을 설명합니다. 로컬 모델을 내보낼 때에는 모델의 가중치와 토크나이저 파일을 동일한 디렉토리(`local_path`)에 저장했는지 확인하세요. CLI를 사용할 때에는 🤗 Hub의 체크포인트 이름 대신 `model` 인수에 `local_path`를 전달하고 `--task` 인수를 제공하세요. 지원되는 작업의 목록은 [🤗 Optimum 문서](https://huggingface.co/docs/optimum/exporters/task_manager)를 참조하세요. `task` 인수가 제공되지 않으면 작업에 특화된 헤드 없이 모델 아키텍처로 기본 설정됩니다. + +```bash +optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/ +``` + +그 결과로 생성된 `model.onnx` 파일은 ONNX 표준을 지원하는 많은 [가속기](https://onnx.ai/supported-tools.html#deployModel) 중 하나에서 실행할 수 있습니다. 예를 들어, [ONNX Runtime](https://onnxruntime.ai/)을 사용하여 모델을 로드하고 실행할 수 있습니다: + +```python +>>> from transformers import AutoTokenizer +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +Hub의 TensorFlow 체크포인트에 대해서도 동일한 프로세스가 적용됩니다. 예를 들어, [Keras organization](https://huggingface.co/keras-io)에서 순수한 TensorFlow 체크포인트를 내보내는 방법은 다음과 같습니다: + +```bash +optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/ +``` + +### `optimum.onnxruntime`을 사용하여 🤗 Transformers 모델을 ONNX로 내보내기 [[exporting-a-transformers-model-to-onnx-with-optimumonnxruntime]] + +CLI 대신에 `optimum.onnxruntime`을 사용하여 프로그래밍 방식으로 🤗 Transformers 모델을 ONNX로 내보낼 수도 있습니다. 다음과 같이 진행하세요: + +```python +>>> from optimum.onnxruntime import ORTModelForSequenceClassification +>>> from transformers import AutoTokenizer + +>>> model_checkpoint = "distilbert_base_uncased_squad" +>>> save_directory = "onnx/" + +>>> # Load a model from transformers and export it to ONNX +>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True) +>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) + +>>> # Save the onnx model and tokenizer +>>> ort_model.save_pretrained(save_directory) +>>> tokenizer.save_pretrained(save_directory) +``` + +### 지원되지 않는 아키텍처의 모델 내보내기 [[exporting-a-model-for-an-unsupported-architecture]] + +현재 내보낼 수 없는 모델을 지원하기 위해 기여하려면, 먼저 [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview)에서 지원되는지 확인한 후 지원되지 않는 경우에는 [🤗 Optimum에 기여](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)하세요. + +### `transformers.onnx`를 사용하여 모델 내보내기 [[exporting-a-model-with-transformersonnx]] + + + +`tranformers.onnx`는 더 이상 유지되지 않습니다. 위에서 설명한 대로 🤗 Optimum을 사용하여 모델을 내보내세요. 이 섹션은 향후 버전에서 제거될 예정입니다. + + + +🤗 Transformers 모델을 ONNX로 내보내려면 추가 종속성을 설치하세요: + +```bash +pip install transformers[onnx] +``` + +`transformers.onnx` 패키지를 Python 모듈로 사용하여 준비된 구성을 사용하여 체크포인트를 내보냅니다: + +```bash +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ +``` + +이렇게 하면 `--model` 인수에 정의된 체크포인트의 ONNX 그래프가 내보내집니다. 🤗 Hub에서 제공하는 체크포인트나 로컬에 저장된 체크포인트를 전달할 수 있습니다. 결과로 생성된 `model.onnx` 파일은 ONNX 표준을 지원하는 많은 가속기 중 하나에서 실행할 수 있습니다. 예를 들어, 다음과 같이 ONNX Runtime을 사용하여 모델을 로드하고 실행할 수 있습니다: + +```python +>>> from transformers import AutoTokenizer +>>> from onnxruntime import InferenceSession + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> session = InferenceSession("onnx/model.onnx") +>>> # ONNX Runtime expects NumPy arrays as input +>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") +>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs)) +``` + +필요한 출력 이름(예: `["last_hidden_state"]`)은 각 모델의 ONNX 구성을 확인하여 얻을 수 있습니다. 예를 들어, DistilBERT의 경우 다음과 같습니다: + +```python +>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig + +>>> config = DistilBertConfig() +>>> onnx_config = DistilBertOnnxConfig(config) +>>> print(list(onnx_config.outputs.keys())) +["last_hidden_state"] +``` + +Hub의 TensorFlow 체크포인트에 대해서도 동일한 프로세스가 적용됩니다. 예를 들어, 다음과 같이 순수한 TensorFlow 체크포인트를 내보냅니다: + +```bash +python -m transformers.onnx --model=keras-io/transformers-qa onnx/ +``` + +로컬에 저장된 모델을 내보내려면 모델의 가중치 파일과 토크나이저 파일을 동일한 디렉토리에 저장한 다음, transformers.onnx 패키지의 --model 인수를 원하는 디렉토리로 지정하여 ONNX로 내보냅니다: + +```bash +python -m transformers.onnx --model=local-pt-checkpoint onnx/ +``` \ No newline at end of file diff --git a/docs/source/ko/task_summary.md b/docs/source/ko/task_summary.md new file mode 100644 index 00000000000000..a0e60c60924b99 --- /dev/null +++ b/docs/source/ko/task_summary.md @@ -0,0 +1,341 @@ + + +# 🤗 Transformers로 할 수 있는 것[[what__transformers_can_do]] + +🤗 Transformers는 자연어처리(NLP), 컴퓨터 비전, 오디오 및 음성 처리 작업에 대한 사전훈련된 최첨단 모델 라이브러리입니다. +이 라이브러리는 트랜스포머 모델뿐만 아니라 컴퓨터 비전 작업을 위한 현대적인 합성곱 신경망과 같은 트랜스포머가 아닌 모델도 포함하고 있습니다. + +스마트폰, 앱, 텔레비전과 같은 오늘날 가장 인기 있는 소비자 제품을 살펴보면, 딥러닝 기술이 그 뒤에 사용되고 있을 확률이 높습니다. +스마트폰으로 촬영한 사진에서 배경 객체를 제거하고 싶다면 어떻게 할까요? 이는 파놉틱 세그멘테이션 작업의 예입니다(아직 이게 무엇인지 모른다면, 다음 섹션에서 설명하겠습니다!). + +이 페이지는 다양한 음성 및 오디오, 컴퓨터 비전, NLP 작업을 🤗 Transformers 라이브러리를 활용하여 다루는 간단한 예제를 3줄의 코드로 제공합니다. + +## 오디오[[audio]] + + +음성 및 오디오 처리 작업은 다른 모달리티와 약간 다릅니다. 이는 주로 오디오가 연속적인 신호로 입력되기 때문입니다. +텍스트와 달리 원본 오디오 파형(waveform)은 문장이 단어로 나눠지는 것처럼 깔끔하게 이산적인 묶음으로 나눌 수 없습니다. +이를 극복하기 위해 원본 오디오 신호는 일정한 간격으로 샘플링됩니다. 해당 간격 내에서 더 많은 샘플을 취할 경우 샘플링률이 높아지며, 오디오는 원본 오디오 소스에 더 가까워집니다. + +과거의 접근 방식은 오디오에서 유용한 특징을 추출하기 위해 오디오를 전처리하는 것이었습니다. +하지만 현재는 원본 오디오 파형을 특성 인코더에 직접 넣어서 오디오 표현(representation)을 추출하는 것이 더 일반적입니다. +이렇게 하면 전처리 단계가 단순해지고 모델이 가장 중요한 특징을 학습할 수 있습니다. + +### 오디오 분류[[audio_classification]] + + +오디오 분류는 오디오 데이터에 미리 정의된 클래스 집합의 레이블을 지정하는 작업입니다. 이는 많은 구체적인 응용 프로그램을 포함한 넓은 범주입니다. + +일부 예시는 다음과 같습니다: + +* 음향 장면 분류: 오디오에 장면 레이블("사무실", "해변", "경기장")을 지정합니다. +* 음향 이벤트 감지: 오디오에 소리 이벤트 레이블("차 경적", "고래 울음소리", "유리 파손")을 지정합니다. +* 태깅: 여러 가지 소리(새 지저귐, 회의에서의 화자 식별)가 포함된 오디오에 레이블을 지정합니다. +* 음악 분류: 음악에 장르 레이블("메탈", "힙합", "컨트리")을 지정합니다. + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er") +>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4532, 'label': 'hap'}, + {'score': 0.3622, 'label': 'sad'}, + {'score': 0.0943, 'label': 'neu'}, + {'score': 0.0903, 'label': 'ang'}] +``` + +### 자동 음성 인식[[automatic_speech_recognition]] + + +자동 음성 인식(ASR)은 음성을 텍스트로 변환하는 작업입니다. +음성은 인간의 자연스러운 의사소통 형태이기 때문에 ASR은 가장 일반적인 오디오 작업 중 하나입니다. +오늘날 ASR 시스템은 스피커, 전화 및 자동차와 같은 "스마트" 기술 제품에 내장되어 있습니다. +우리는 가상 비서에게 음악 재생, 알림 설정 및 날씨 정보를 요청할 수 있습니다. + +하지만 트랜스포머 아키텍처가 해결하는 데 도움을 준 핵심 도전 과제 중 하나는 양이 데이터 양이 적은 언어(low-resource language)에 대한 것입니다. 대량의 음성 데이터로 사전 훈련한 후 데이터 양이 적은 언어에서 레이블이 지정된 음성 데이터 1시간만으로 모델을 미세 조정하면 이전의 100배 많은 레이블이 지정된 데이터로 훈련된 ASR 시스템보다 훨씬 더 높은 품질의 결과를 얻을 수 있습니다. +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +## 컴퓨터 비전[[computer_vision]] + +컴퓨터 비전 작업 중 가장 초기의 성공적인 작업 중 하나는 [합성곱 신경망(CNN)](glossary#convolution)을 사용하여 우편번호 숫자 이미지를 인식하는 것이었습니다. 이미지는 픽셀로 구성되어 있으며 각 픽셀은 숫자 값으로 표현됩니다. 이로써 이미지를 픽셀 값의 행렬로 나타내는 것이 쉬워집니다. 특정한 픽셀 값의 조합은 이미지의 색상을 의미합니다. + +컴퓨터 비전 작업은 일반적으로 다음 두 가지 방법으로 접근 가능합니다: + +1. 합성곱을 사용하여 이미지의 낮은 수준 특징에서 높은 수준의 추상적인 요소까지 계층적으로 학습합니다. + +2. 이미지를 패치로 나누고 트랜스포머를 사용하여 점진적으로 각 이미지 패치가 서로 어떠한 방식으로 연관되어 이미지를 형성하는지 학습합니다. `CNN`에서 선호하는 상향식 접근법과는 달리, 이 방식은 흐릿한 이미지로 초안을 그리고 점진적으로 선명한 이미지로 만들어가는 것과 유사합니다. + +### 이미지 분류[[image_classification]] + + +이미지 분류는 한 개의 전체 이미지에 미리 정의된 클래스 집합의 레이블을 지정하는 작업입니다. + +대부분의 분류 작업과 마찬가지로, 이미지 분류에는 다양한 실용적인 용도가 있으며, 일부 예시는 다음과 같습니다: + + +* 의료: 질병을 감지하거나 환자 건강을 모니터링하기 위해 의료 이미지에 레이블을 지정합니다. +* 환경: 위성 이미지를 분류하여 산림 벌채를 감시하고 야생 지역 관리를 위한 정보를 제공하거나 산불을 감지합니다. +* 농업: 작물 이미지를 분류하여 식물 건강을 확인하거나 위성 이미지를 분류하여 토지 이용 관찰에 사용합니다. +* 생태학: 동물이나 식물 종 이미지를 분류하여 야생 동물 개체군을 조사하거나 멸종 위기에 처한 종을 추적합니다. + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="image-classification") +>>> preds = classifier( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.4335, 'label': 'lynx, catamount'} +{'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'} +{'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'} +{'score': 0.0239, 'label': 'Egyptian cat'} +{'score': 0.0229, 'label': 'tiger cat'} +``` + +### 객체 탐지[[object_detection]] + + +이미지 분류와 달리 객체 탐지는 이미지 내에서 여러 객체를 식별하고 바운딩 박스로 정의된 객체의 위치를 파악합니다. + +객체 탐지의 몇 가지 응용 예시는 다음과 같습니다: + +* 자율 주행 차량: 다른 차량, 보행자 및 신호등과 같은 일상적인 교통 객체를 감지합니다. +* 원격 감지: 재난 모니터링, 도시 계획 및 기상 예측 등을 수행합니다. +* 결함 탐지: 건물의 균열이나 구조적 손상, 제조 결함 등을 탐지합니다. + + +```py +>>> from transformers import pipeline + +>>> detector = pipeline(task="object-detection") +>>> preds = detector( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds] +>>> preds +[{'score': 0.9865, + 'label': 'cat', + 'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}] +``` + +### 이미지 분할[[image_segmentation]] + + +이미지 분할은 픽셀 차원의 작업으로, 이미지 내의 모든 픽셀을 클래스에 할당합니다. 이는 객체 탐지와 다릅니다. 객체 탐지는 바운딩 박스를 사용하여 이미지 내의 객체를 레이블링하고 예측하는 반면, 분할은 더 세분화된 작업입니다. 분할은 픽셀 수준에서 객체를 감지할 수 있습니다. + +이미지 분할에는 여러 유형이 있습니다: + +* 인스턴스 분할: 개체의 클래스를 레이블링하는 것 외에도, 개체의 각 구분된 인스턴스에도 레이블을 지정합니다 ("개-1", "개-2" 등). +* 파놉틱 분할: 의미적 분할과 인스턴스 분할의 조합입니다. 각 픽셀을 의미적 클래스로 레이블링하는 **동시에** 개체의 각각 구분된 인스턴스로도 레이블을 지정합니다. + +분할 작업은 자율 주행 차량에서 유용하며, 주변 환경의 픽셀 수준 지도를 생성하여 보행자와 다른 차량 주변에서 안전하게 탐색할 수 있습니다. 또한 의료 영상에서도 유용합니다. 분할 작업이 픽셀 수준에서 객체를 감지할 수 있기 때문에 비정상적인 세포나 장기의 특징을 식별하는 데 도움이 될 수 있습니다. 이미지 분할은 의류 가상 시착이나 카메라를 통해 실제 세계에 가상 개체를 덧씌워 증강 현실 경험을 만드는 등 전자 상거래 분야에서도 사용될 수 있습니다. + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline(task="image-segmentation") +>>> preds = segmenter( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.9879, 'label': 'LABEL_184'} +{'score': 0.9973, 'label': 'snow'} +{'score': 0.9972, 'label': 'cat'} +``` + +### 깊이 추정[[depth_estimation]] + +깊이 추정은 카메라로부터 이미지 내부의 각 픽셀의 거리를 예측합니다. 이 컴퓨터 비전 작업은 특히 장면 이해와 재구성에 중요합니다. 예를 들어, 자율 주행 차량은 보행자, 교통 표지판 및 다른 차량과 같은 객체와의 거리를 이해하여 장애물과 충돌을 피해야 합니다. 깊이 정보는 또한 2D 이미지에서 3D 표현을 구성하는 데 도움이 되며 생물학적 구조나 건물의 고품질 3D 표현을 생성하는 데 사용될 수 있습니다. + +깊이 추정에는 두 가지 접근 방식이 있습니다: + +* 스테레오: 약간 다른 각도에서 촬영된 동일한 이미지 두 장을 비교하여 깊이를 추정합니다. +* 단안: 단일 이미지에서 깊이를 추정합니다. + + +```py +>>> from transformers import pipeline + +>>> depth_estimator = pipeline(task="depth-estimation") +>>> preds = depth_estimator( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +``` + +## 자연어처리[[natural_language_processing]] + +텍스트는 인간이 의사 소통하는 자연스러운 방식 중 하나이기 때문에 자연어처리 역시 가장 일반적인 작업 유형 중 하나입니다. 모델이 인식하는 형식으로 텍스트를 변환하려면 토큰화해야 합니다. 이는 텍스트 시퀀스를 개별 단어 또는 하위 단어(토큰)로 분할한 다음 이러한 토큰을 숫자로 변환하는 것을 의미합니다. 결과적으로 텍스트 시퀀스를 숫자 시퀀스로 표현할 수 있으며, 숫자 시퀀스를 다양한 자연어처리 작업을 해결하기 위한 모델에 입력할 수 있습니다! + +### 텍스트 분류[[text_classification]] + +다른 모달리티에서의 분류 작업과 마찬가지로 텍스트 분류는 미리 정의된 클래스 집합에서 텍스트 시퀀스(문장 수준, 단락 또는 문서 등)에 레이블을 지정합니다. 텍스트 분류에는 다양한 실용적인 응용 사례가 있으며, 일부 예시는 다음과 같습니다: + +* 감성 분석: 텍스트를 `긍정` 또는 `부정`과 같은 어떤 극성에 따라 레이블링하여 정치, 금융, 마케팅과 같은 분야에서 의사 결정에 정보를 제공하고 지원할 수 있습니다. +* 콘텐츠 분류: 텍스트를 주제에 따라 레이블링(날씨, 스포츠, 금융 등)하여 뉴스 및 소셜 미디어 피드에서 정보를 구성하고 필터링하는 데 도움이 될 수 있습니다. + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="sentiment-analysis") +>>> preds = classifier("Hugging Face is the best thing since sliced bread!") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.9991, 'label': 'POSITIVE'}] +``` + +### 토큰 분류[[token_classification]] + +모든 자연어처리 작업에서는 텍스트가 개별 단어나 하위 단어로 분리되어 전처리됩니다. 분리된 단어를 [토큰](/glossary#token)이라고 합니다. 토큰 분류는 각 토큰에 미리 정의된 클래스 집합의 레이블을 할당합니다. + +토큰 분류의 두 가지 일반적인 유형은 다음과 같습니다: + +* 개체명 인식 (NER): 토큰을 조직, 인물, 위치 또는 날짜와 같은 개체 범주에 따라 레이블링합니다. NER은 특히 유전체학적인 환경에서 유전자, 단백질 및 약물 이름에 레이블을 지정하는 데 널리 사용됩니다. +* 품사 태깅 (POS): 명사, 동사, 형용사와 같은 품사에 따라 토큰에 레이블을 할당합니다. POS는 번역 시스템이 동일한 단어가 문법적으로 어떻게 다른지 이해하는 데 도움이 됩니다 (명사로 사용되는 "bank(은행)"과 동사로 사용되는 "bank(예금을 예치하다)"과 같은 경우). + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="ner") +>>> preds = classifier("Hugging Face is a French company based in New York City.") +>>> preds = [ +... { +... "entity": pred["entity"], +... "score": round(pred["score"], 4), +... "index": pred["index"], +... "word": pred["word"], +... "start": pred["start"], +... "end": pred["end"], +... } +... for pred in preds +... ] +>>> print(*preds, sep="\n") +{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2} +{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7} +{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12} +{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24} +{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45} +{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50} +{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55} +``` + +### 질의응답[[question_answering]] + +질의응답은 또 하나의 토큰 차원의 작업으로, 문맥이 있을 때(개방형 도메인)와 문맥이 없을 때(폐쇄형 도메인) 질문에 대한 답변을 반환합니다. 이 작업은 가상 비서에게 식당이 영업 중인지와 같은 질문을 할 때마다 발생할 수 있습니다. 고객 지원 또는 기술 지원을 제공하거나 검색 엔진이 요청한 정보를 검색하는 데 도움을 줄 수 있습니다. + +질문 답변에는 일반적으로 두 가지 유형이 있습니다: + +* 추출형: 질문과 문맥이 주어졌을 때, 모델이 주어진 문맥의 일부에서 가져온 텍스트의 범위를 답변으로 합니다. +* 생성형: 질문과 문맥이 주어졌을 때, 주어진 문맥을 통해 답변을 생성합니다. 이 접근 방식은 [`QuestionAnsweringPipeline`] 대신 [`Text2TextGenerationPipeline`]을 통해 처리됩니다. + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline(task="question-answering") +>>> preds = question_answerer( +... question="What is the name of the repository?", +... context="The name of the repository is huggingface/transformers", +... ) +>>> print( +... f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}" +... ) +score: 0.9327, start: 30, end: 54, answer: huggingface/transformers +``` + +### 요약[[summarization]] + +요약은 원본 문서의 의미를 최대한 보존하면서 긴 문서를 짧은 문서로 만드는 작업입니다. 요약은 `sequence-to-sequence` 작업입니다. 입력보다 짧은 텍스트 시퀀스를 출력합니다. 요약 작업은 독자가 장문 문서들의 주요 포인트를 빠르게 이해하는 데 도움을 줄 수 있습니다. 입법안, 법률 및 금융 문서, 특허 및 과학 논문은 요약 작업이 독자의 시간을 절약하고 독서 보조 도구로 사용될 수 있는 몇 가지 예시입니다. + +질문 답변과 마찬가지로 요약에는 두 가지 유형이 있습니다: + +* 추출형: 원본 텍스트에서 가장 중요한 문장을 식별하고 추출합니다. +* 생성형: 원본 텍스트에서 목표 요약을 생성합니다. 입력 문서에 없는 새로운 단어를 포함할 수도 있습니다. [`SummarizationPipeline`]은 생성형 접근 방식을 사용합니다. + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline(task="summarization") +>>> summarizer( +... "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles." +... ) +[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}] +``` + +### 번역[[translation]] + +번역은 한 언어로 된 텍스트 시퀀스를 다른 언어로 변환하는 작업입니다. 이는 서로 다른 배경을 가진 사람들이 서로 소통하는 데 도움을 주는 중요한 역할을 합니다. 더 넓은 대중에게 콘텐츠를 번역하여 전달하거나, 새로운 언어를 배우는 데 도움이 되는 학습 도구가 될 수도 있습니다. 요약과 마찬가지로, 번역은 `sequence-to-sequence` 작업입니다. 즉, 모델은 입력 시퀀스를 받아서 출력이 되는 목표 시퀀스를 반환합니다. + +초기의 번역 모델은 대부분 단일 언어로 이루어져 있었지만, 최근에는 많은 언어 쌍 간에 번역을 수행할 수 있는 다중 언어 모델에 대한 관심이 높아지고 있습니다. + +```py +>>> from transformers import pipeline + +>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." +>>> translator = pipeline(task="translation", model="google-t5/t5-small") +>>> translator(text) +[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}] +``` + +### 언어 모델링[[language_modeling]] + +언어 모델링은 텍스트 시퀀스에서 단어를 예측하는 작업입니다. 사전 훈련된 언어 모델은 많은 다른 하위 작업에 따라 미세 조정될 수 있기 때문에 매우 인기 있는 자연어처리 작업이 되었습니다. 최근에는 제로 샷(zero-shot) 또는 퓨 샷(few-shot) 학습이 가능한 대규모 언어 모델(Large Language Models, LLM)에 대한 많은 관심이 발생하고 있습니다. 이는 모델이 명시적으로 훈련되지 않은 작업도 해결할 수 있다는 것을 의미합니다! 언어 모델은 유창하고 설득력 있는 텍스트를 생성하는 데 사용될 수 있지만, 텍스트가 항상 정확하지는 않을 수 있으므로 주의가 필요합니다. + +언어 모델링에는 두 가지 유형이 있습니다: + +* 인과적 언어 모델링: 이 모델의 목적은 시퀀스에서 다음 토큰을 예측하는 것이며, 미래 토큰이 마스킹 됩니다. + ```py + >>> from transformers import pipeline + + >>> prompt = "Hugging Face is a community-based open-source platform for machine learning." + >>> generator = pipeline(task="text-generation") + >>> generator(prompt) # doctest: +SKIP + ``` + +* 마스킹된 언어 모델링: 이 모델의 목적은 시퀀스 내의 마스킹된 토큰을 예측하는 것이며, 시퀀스 내의 모든 토큰에 대한 접근이 제공됩니다. + + ```py + >>> text = "Hugging Face is a community-based open-source for machine learning." + >>> fill_mask = pipeline(task="fill-mask") + >>> preds = fill_mask(text, top_k=1) + >>> preds = [ + ... { + ... "score": round(pred["score"], 4), + ... "token": pred["token"], + ... "token_str": pred["token_str"], + ... "sequence": pred["sequence"], + ... } + ... for pred in preds + ... ] + >>> preds + [{'score': 0.2236, + 'token': 1761, + 'token_str': ' platform', + 'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}] + ``` + +이 페이지를 통해 각 모달리티의 다양한 작업 유형과 각 작업의 실용적 중요성에 대해 추가적인 배경 정보를 얻으셨기를 바랍니다. 다음 [섹션](tasks_explained)에서는 🤗 Transformer가 이러한 작업을 해결하는 **방법**에 대해 알아보실 수 있습니다. \ No newline at end of file diff --git a/docs/source/ko/tasks/asr.md b/docs/source/ko/tasks/asr.md new file mode 100644 index 00000000000000..47a568ecf02bb4 --- /dev/null +++ b/docs/source/ko/tasks/asr.md @@ -0,0 +1,380 @@ + + +# 자동 음성 인식[[automatic-speech-recognition]] + +[[open-in-colab]] + + + +자동 음성 인식(Automatic Speech Recognition, ASR)은 음성 신호를 텍스트로 변환하여 음성 입력 시퀀스를 텍스트 출력에 매핑합니다. +Siri와 Alexa와 같은 가상 어시스턴트는 ASR 모델을 사용하여 일상적으로 사용자를 돕고 있으며, 회의 중 라이브 캡션 및 메모 작성과 같은 유용한 사용자 친화적 응용 프로그램도 많이 있습니다. + +이 가이드에서 소개할 내용은 아래와 같습니다: + +1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트에서 [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)를 미세 조정하여 오디오를 텍스트로 변환합니다. +2. 미세 조정한 모델을 추론에 사용합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에 의해 지원됩니다: + + + +[Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [M-CTC-T](../model_doc/mctct), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm) + + + + + +시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate jiwer +``` + +Hugging Face 계정에 로그인하면 모델을 업로드하고 커뮤니티에 공유할 수 있습니다. 토큰을 입력하여 로그인하세요. + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## MInDS-14 데이터 세트 가져오기[[load-minds-14-dataset]] + +먼저, 🤗 Datasets 라이브러리에서 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트의 일부분을 가져오세요. +이렇게 하면 전체 데이터 세트에 대한 훈련에 시간을 들이기 전에 모든 것이 작동하는지 실험하고 검증할 수 있습니다. + +```py +>>> from datasets import load_dataset, Audio + +>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train[:100]") +``` + +[`~Dataset.train_test_split`] 메소드를 사용하여 데이터 세트의 `train`을 훈련 세트와 테스트 세트로 나누세요: + +```py +>>> minds = minds.train_test_split(test_size=0.2) +``` + +그리고 데이터 세트를 확인하세요: + +```py +>>> minds +DatasetDict({ + train: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 16 + }) + test: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 4 + }) +}) +``` + +데이터 세트에는 `lang_id`와 `english_transcription`과 같은 유용한 정보가 많이 포함되어 있지만, 이 가이드에서는 `audio`와 `transcription`에 초점을 맞출 것입니다. 다른 열은 [`~datasets.Dataset.remove_columns`] 메소드를 사용하여 제거하세요: + +```py +>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"]) +``` + +예시를 다시 한번 확인해보세요: + +```py +>>> minds["train"][0] +{'audio': {'array': array([-0.00024414, 0. , 0. , ..., 0.00024414, + 0.00024414, 0.00024414], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'sampling_rate': 8000}, + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"} +``` + +두 개의 필드가 있습니다: + +- `audio`: 오디오 파일을 가져오고 리샘플링하기 위해 호출해야 하는 음성 신호의 1차원 `array(배열)` +- `transcription`: 목표 텍스트 + +## 전처리[[preprocess]] + +다음으로 오디오 신호를 처리하기 위한 Wav2Vec2 프로세서를 가져옵니다: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base") +``` + +MInDS-14 데이터 세트의 샘플링 레이트는 8000kHz이므로([데이터 세트 카드](https://huggingface.co/datasets/PolyAI/minds14)에서 확인), 사전 훈련된 Wav2Vec2 모델을 사용하려면 데이터 세트를 16000kHz로 리샘플링해야 합니다: + +```py +>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) +>>> minds["train"][0] +{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ..., + 2.78103951e-04, 2.38446111e-04, 1.18740834e-04], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'sampling_rate': 16000}, + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav', + 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"} +``` + +위의 'transcription'에서 볼 수 있듯이 텍스트는 대문자와 소문자가 섞여 있습니다. Wav2Vec2 토크나이저는 대문자 문자에 대해서만 훈련되어 있으므로 텍스트가 토크나이저의 어휘와 일치하는지 확인해야 합니다: + +```py +>>> def uppercase(example): +... return {"transcription": example["transcription"].upper()} + + +>>> minds = minds.map(uppercase) +``` + +이제 다음 작업을 수행할 전처리 함수를 만들어보겠습니다: + +1. `audio` 열을 호출하여 오디오 파일을 가져오고 리샘플링합니다. +2. 오디오 파일에서 `input_values`를 추출하고 프로세서로 `transcription` 열을 토큰화합니다. + +```py +>>> def prepare_dataset(batch): +... audio = batch["audio"] +... batch = processor(audio["array"], sampling_rate=audio["sampling_rate"], text=batch["transcription"]) +... batch["input_length"] = len(batch["input_values"][0]) +... return batch +``` + +전체 데이터 세트에 전처리 함수를 적용하려면 🤗 Datasets [`~datasets.Dataset.map`] 함수를 사용하세요. `num_proc` 매개변수를 사용하여 프로세스 수를 늘리면 `map`의 속도를 높일 수 있습니다. [`~datasets.Dataset.remove_columns`] 메소드를 사용하여 필요하지 않은 열을 제거하세요: + +```py +>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4) +``` + +🤗 Transformers에는 자동 음성 인식용 데이터 콜레이터가 없으므로 예제 배치를 생성하려면 [`DataCollatorWithPadding`]을 조정해야 합니다. 이렇게 하면 데이터 콜레이터는 텍스트와 레이블을 배치에서 가장 긴 요소의 길이에 동적으로 패딩하여 길이를 균일하게 합니다. `tokenizer` 함수에서 `padding=True`를 설정하여 텍스트를 패딩할 수 있지만, 동적 패딩이 더 효율적입니다. + +다른 데이터 콜레이터와 달리 이 특정 데이터 콜레이터는 `input_values`와 `labels`에 대해 다른 패딩 방법을 적용해야 합니다. + +```py +>>> import torch + +>>> from dataclasses import dataclass, field +>>> from typing import Any, Dict, List, Optional, Union + + +>>> @dataclass +... class DataCollatorCTCWithPadding: +... processor: AutoProcessor +... padding: Union[bool, str] = "longest" + +... def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: +... # 입력과 레이블을 분할합니다 +... # 길이가 다르고, 각각 다른 패딩 방법을 사용해야 하기 때문입니다 +... input_features = [{"input_values": feature["input_values"][0]} for feature in features] +... label_features = [{"input_ids": feature["labels"]} for feature in features] + +... batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt") + +... labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt") + +... # 패딩에 대해 손실을 적용하지 않도록 -100으로 대체합니다 +... labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) + +... batch["labels"] = labels + +... return batch +``` + +이제 `DataCollatorForCTCWithPadding`을 인스턴스화합니다: + +```py +>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding="longest") +``` + +## 평가하기[[evaluate]] + +훈련 중에 평가 지표를 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하면 평가 방법을 빠르게 불러올 수 있습니다. +이 작업에서는 [단어 오류율(Word Error Rate, WER)](https://huggingface.co/spaces/evaluate-metric/wer) 평가 지표를 가져옵니다. +(평가 지표를 불러오고 계산하는 방법은 🤗 Evaluate [둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요): + +```py +>>> import evaluate + +>>> wer = evaluate.load("wer") +``` + +그런 다음 예측값과 레이블을 [`~evaluate.EvaluationModule.compute`]에 전달하여 WER을 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(pred): +... pred_logits = pred.predictions +... pred_ids = np.argmax(pred_logits, axis=-1) + +... pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id + +... pred_str = processor.batch_decode(pred_ids) +... label_str = processor.batch_decode(pred.label_ids, group_tokens=False) + +... wer = wer.compute(predictions=pred_str, references=label_str) + +... return {"wer": wer} +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 훈련을 설정할 때 이 함수로 되돌아올 것입니다. + +## 훈련하기[[train]] + + + + + +[`Trainer`]로 모델을 미세 조정하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요! + + + +이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForCTC`]로 Wav2Vec2를 가져오세요. `ctc_loss_reduction` 매개변수로 CTC 손실에 적용할 축소(reduction) 방법을 지정하세요. 기본값인 합계 대신 평균을 사용하는 것이 더 좋은 경우가 많습니다: + +```py +>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer + +>>> model = AutoModelForCTC.from_pretrained( +... "facebook/wav2vec2-base", +... ctc_loss_reduction="mean", +... pad_token_id=processor.tokenizer.pad_token_id, +... ) +``` + +이제 세 단계만 남았습니다: + +1. [`TrainingArguments`]에서 훈련 하이퍼파라미터를 정의하세요. `output_dir`은 모델을 저장할 경로를 지정하는 유일한 필수 매개변수입니다. `push_to_hub=True`를 설정하여 모델을 Hub에 업로드 할 수 있습니다(모델을 업로드하려면 Hugging Face에 로그인해야 합니다). [`Trainer`]는 각 에폭마다 WER을 평가하고 훈련 체크포인트를 저장합니다. +2. 모델, 데이터 세트, 토크나이저, 데이터 콜레이터, `compute_metrics` 함수와 함께 [`Trainer`]에 훈련 인수를 전달하세요. +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_asr_mind_model", +... per_device_train_batch_size=8, +... gradient_accumulation_steps=2, +... learning_rate=1e-5, +... warmup_steps=500, +... max_steps=2000, +... gradient_checkpointing=True, +... fp16=True, +... group_by_length=True, +... evaluation_strategy="steps", +... per_device_eval_batch_size=8, +... save_steps=1000, +... eval_steps=1000, +... logging_steps=25, +... load_best_model_at_end=True, +... metric_for_best_model="wer", +... greater_is_better=False, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=encoded_minds["train"], +... eval_dataset=encoded_minds["test"], +... tokenizer=processor.feature_extractor, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면 모두가 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 Hub에 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + + + + + +자동 음성 인식을 위해 모델을 미세 조정하는 더 자세한 예제는 영어 자동 음성 인식을 위한 [블로그 포스트](https://huggingface.co/blog/fine-tune-wav2vec2-english)와 다국어 자동 음성 인식을 위한 [포스트](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)를 참조하세요. + + + +## 추론하기[[inference]] + +좋아요, 이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +추론에 사용할 오디오 파일을 가져오세요. 필요한 경우 오디오 파일의 샘플링 비율을 모델의 샘플링 레이트에 맞게 리샘플링하는 것을 잊지 마세요! + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train") +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +>>> sampling_rate = dataset.features["audio"].sampling_rate +>>> audio_file = dataset[0]["audio"]["path"] +``` + +추론을 위해 미세 조정된 모델을 시험해보는 가장 간단한 방법은 [`pipeline`]을 사용하는 것입니다. 모델을 사용하여 자동 음성 인식을 위한 `pipeline`을 인스턴스화하고 오디오 파일을 전달하세요: + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline("automatic-speech-recognition", model="stevhliu/my_awesome_asr_minds_model") +>>> transcriber(audio_file) +{'text': 'I WOUD LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'} +``` + + + +텍스트로 변환된 결과가 꽤 괜찮지만 더 좋을 수도 있습니다! 더 나은 결과를 얻으려면 더 많은 예제로 모델을 미세 조정하세요! + + + +`pipeline`의 결과를 수동으로 재현할 수도 있습니다: + + + +오디오 파일과 텍스트를 전처리하고 PyTorch 텐서로 `input`을 반환할 프로세서를 가져오세요: + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("stevhliu/my_awesome_asr_mind_model") +>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +``` + +입력을 모델에 전달하고 로짓을 반환하세요: + +```py +>>> from transformers import AutoModelForCTC + +>>> model = AutoModelForCTC.from_pretrained("stevhliu/my_awesome_asr_mind_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +가장 높은 확률의 `input_ids`를 예측하고, 프로세서를 사용하여 예측된 `input_ids`를 다시 텍스트로 디코딩하세요: + +```py +>>> import torch + +>>> predicted_ids = torch.argmax(logits, dim=-1) +>>> transcription = processor.batch_decode(predicted_ids) +>>> transcription +['I WOUL LIKE O SET UP JOINT ACOUNT WTH Y PARTNER'] +``` + + \ No newline at end of file diff --git a/docs/source/ko/tasks/audio_classification.md b/docs/source/ko/tasks/audio_classification.md new file mode 100644 index 00000000000000..7e1094815fd429 --- /dev/null +++ b/docs/source/ko/tasks/audio_classification.md @@ -0,0 +1,329 @@ + + +# 오디오 분류[[audio_classification]] + +[[open-in-colab]] + + + +오디오 분류는 텍스트와 마찬가지로 입력 데이터에 클래스 레이블 출력을 할당합니다. 유일한 차이점은 텍스트 입력 대신 원시 오디오 파형이 있다는 것입니다. 오디오 분류의 실제 적용 분야에는 화자의 의도 파악, 언어 분류, 소리로 동물 종을 식별하는 것 등이 있습니다. + +이 문서에서 방법을 알아보겠습니다: + +1. [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 데이터 세트를 [Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)로 미세 조정하여 화자의 의도를 분류합니다. +2. 추론에 미세 조정된 모델을 사용하세요. + + +이 튜토리얼에서 설명하는 작업은 아래의 모델 아키텍처에서 지원됩니다: + + + +[Audio Spectrogram Transformer](../model_doc/audio-spectrogram-transformer), [Data2VecAudio](../model_doc/data2vec-audio), [Hubert](../model_doc/hubert), [SEW](../model_doc/sew), [SEW-D](../model_doc/sew-d), [UniSpeech](../model_doc/unispeech), [UniSpeechSat](../model_doc/unispeech-sat), [Wav2Vec2](../model_doc/wav2vec2), [Wav2Vec2-Conformer](../model_doc/wav2vec2-conformer), [WavLM](../model_doc/wavlm), [Whisper](../model_doc/whisper) + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +모델을 업로드하고 커뮤니티와 공유할 수 있도록 허깅페이스 계정에 로그인하는 것이 좋습니다. 메시지가 표시되면 토큰을 입력하여 로그인합니다: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## MInDS-14 데이터셋 불러오기[[load_minds_14_dataset]] + +먼저 🤗 Datasets 라이브러리에서 MinDS-14 데이터 세트를 가져옵니다: + +```py +>>> from datasets import load_dataset, Audio + +>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train") +``` + +데이터 세트의 `train` 분할을 [`~datasets.Dataset.train_test_split`] 메소드를 사용하여 더 작은 훈련 및 테스트 집합으로 분할합니다. 이렇게 하면 전체 데이터 세트에 더 많은 시간을 소비하기 전에 모든 것이 작동하는지 실험하고 확인할 수 있습니다. + +```py +>>> minds = minds.train_test_split(test_size=0.2) +``` + +이제 데이터 집합을 살펴볼게요: + +```py +>>> minds +DatasetDict({ + train: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 450 + }) + test: Dataset({ + features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'], + num_rows: 113 + }) +}) +``` + +데이터 세트에는 `lang_id` 및 `english_transcription`과 같은 유용한 정보가 많이 포함되어 있지만 이 가이드에서는 `audio` 및 `intent_class`에 중점을 둘 것입니다. 다른 열은 [`~datasets.Dataset.remove_columns`] 메소드를 사용하여 제거합니다: + +```py +>>> minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"]) +``` + +예시를 살펴보겠습니다: + +```py +>>> minds["train"][0] +{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00048828, + -0.00024414, -0.00024414], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav', + 'sampling_rate': 8000}, + 'intent_class': 2} +``` + +두 개의 필드가 있습니다: + +- `audio`: 오디오 파일을 가져오고 리샘플링하기 위해 호출해야 하는 음성 신호의 1차원 `배열`입니다. +- `intent_class`: 화자의 의도에 대한 클래스 ID를 나타냅니다. + +모델이 레이블 ID에서 레이블 이름을 쉽게 가져올 수 있도록 레이블 이름을 정수로 매핑하는 사전을 만들거나 그 반대로 매핑하는 사전을 만듭니다: + +```py +>>> labels = minds["train"].features["intent_class"].names +>>> label2id, id2label = dict(), dict() +>>> for i, label in enumerate(labels): +... label2id[label] = str(i) +... id2label[str(i)] = label +``` + +이제 레이블 ID를 레이블 이름으로 변환할 수 있습니다: + +```py +>>> id2label[str(2)] +'app_error' +``` + +## 전처리[[preprocess]] + +다음 단계는 오디오 신호를 처리하기 위해 Wav2Vec2 특징 추출기를 가져오는 것입니다: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") +``` + +MinDS-14 데이터 세트의 샘플링 속도는 8000khz이므로(이 정보는 [데이터세트 카드](https://huggingface.co/datasets/PolyAI/minds14)에서 확인할 수 있습니다), 사전 훈련된 Wav2Vec2 모델을 사용하려면 데이터 세트를 16000kHz로 리샘플링해야 합니다: + +```py +>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000)) +>>> minds["train"][0] +{'audio': {'array': array([ 2.2098757e-05, 4.6582241e-05, -2.2803260e-05, ..., + -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav', + 'sampling_rate': 16000}, + 'intent_class': 2} +``` + +이제 전처리 함수를 만듭니다: + +1. 가져올 `오디오` 열을 호출하고 필요한 경우 오디오 파일을 리샘플링합니다. +2. 오디오 파일의 샘플링 속도가 모델에 사전 훈련된 오디오 데이터의 샘플링 속도와 일치하는지 확인합니다. 이 정보는 Wav2Vec2 [모델 카드](https://huggingface.co/facebook/wav2vec2-base)에서 확인할 수 있습니다. +3. 긴 입력이 잘리지 않고 일괄 처리되도록 최대 입력 길이를 설정합니다. + +```py +>>> def preprocess_function(examples): +... audio_arrays = [x["array"] for x in examples["audio"]] +... inputs = feature_extractor( +... audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True +... ) +... return inputs +``` + +전체 데이터 세트에 전처리 기능을 적용하려면 🤗 Datasets [`~datasets.Dataset.map`] 함수를 사용합니다. `batched=True`를 설정하여 데이터 집합의 여러 요소를 한 번에 처리하면 `map`의 속도를 높일 수 있습니다. 필요하지 않은 열을 제거하고 `intent_class`의 이름을 모델이 예상하는 이름인 `label`로 변경합니다: + +```py +>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True) +>>> encoded_minds = encoded_minds.rename_column("intent_class", "label") +``` + +## 평가하기[[evaluate]] + +훈련 중에 메트릭을 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하여 평가 방법을 빠르게 가져올 수 있습니다. 이 작업에서는 [accuracy(정확도)](https://huggingface.co/spaces/evaluate-metric/accuracy) 메트릭을 가져옵니다(메트릭을 가져오고 계산하는 방법에 대한 자세한 내용은 🤗 Evalutate [빠른 둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour) 참조하세요): + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +그런 다음 예측과 레이블을 [`~evaluate.EvaluationModule.compute`]에 전달하여 정확도를 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions = np.argmax(eval_pred.predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=eval_pred.label_ids) +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 트레이닝을 설정할 때 이 함수를 사용합니다. + +## 훈련[[train]] + + + + + +[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)을 살펴보세요! + + + +이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForAudioClassification`]을 이용해서 Wav2Vec2를 불러옵니다. 예상되는 레이블 수와 레이블 매핑을 지정합니다: + +```py +>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer + +>>> num_labels = len(id2label) +>>> model = AutoModelForAudioClassification.from_pretrained( +... "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label +... ) +``` + +이제 세 단계만 남았습니다: + +1. 훈련 하이퍼파라미터를 [`TrainingArguments`]에 정의합니다. 유일한 필수 매개변수는 모델을 저장할 위치를 지정하는 `output_dir`입니다. `push_to_hub = True`를 설정하여 이 모델을 허브로 푸시합니다(모델을 업로드하려면 허깅 페이스에 로그인해야 합니다). 각 에폭이 끝날 때마다 [`Trainer`]가 정확도를 평가하고 훈련 체크포인트를 저장합니다. +2. 모델, 데이터 세트, 토크나이저, 데이터 콜레이터, `compute_metrics` 함수와 함께 훈련 인자를 [`Trainer`]에 전달합니다. +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정합니다. + + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_mind_model", +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=3e-5, +... per_device_train_batch_size=32, +... gradient_accumulation_steps=4, +... per_device_eval_batch_size=32, +... num_train_epochs=10, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=encoded_minds["train"], +... eval_dataset=encoded_minds["test"], +... tokenizer=feature_extractor, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면 모든 사람이 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 허브에 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + + + + + +For a more in-depth example of how to finetune a model for audio classification, take a look at the corresponding [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb). + + + +## 추론[[inference]] + +이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +추론을 실행할 오디오 파일을 가져옵니다. 필요한 경우 오디오 파일의 샘플링 속도를 모델의 샘플링 속도와 일치하도록 리샘플링하는 것을 잊지 마세요! + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000)) +>>> sampling_rate = dataset.features["audio"].sampling_rate +>>> audio_file = dataset[0]["audio"]["path"] +``` + +추론을 위해 미세 조정한 모델을 시험해 보는 가장 간단한 방법은 [`pipeline`]에서 사용하는 것입니다. 모델을 사용하여 오디오 분류를 위한 `pipeline`을 인스턴스화하고 오디오 파일을 전달합니다: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model") +>>> classifier(audio_file) +[ + {'score': 0.09766869246959686, 'label': 'cash_deposit'}, + {'score': 0.07998877018690109, 'label': 'app_error'}, + {'score': 0.0781070664525032, 'label': 'joint_account'}, + {'score': 0.07667109370231628, 'label': 'pay_bill'}, + {'score': 0.0755252093076706, 'label': 'balance'} +] +``` + +원하는 경우 `pipeline`의 결과를 수동으로 복제할 수도 있습니다: + + + +특징 추출기를 가져와서 오디오 파일을 전처리하고 `입력`을 PyTorch 텐서로 반환합니다: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("stevhliu/my_awesome_minds_model") +>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") +``` + +모델에 입력을 전달하고 로짓을 반환합니다: + +```py +>>> from transformers import AutoModelForAudioClassification + +>>> model = AutoModelForAudioClassification.from_pretrained("stevhliu/my_awesome_minds_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +확률이 가장 높은 클래스를 가져온 다음 모델의 `id2label` 매핑을 사용하여 이를 레이블로 변환합니다: + +```py +>>> import torch + +>>> predicted_class_ids = torch.argmax(logits).item() +>>> predicted_label = model.config.id2label[predicted_class_ids] +>>> predicted_label +'cash_deposit' +``` + + \ No newline at end of file diff --git a/docs/source/ko/tasks/document_question_answering.md b/docs/source/ko/tasks/document_question_answering.md new file mode 100644 index 00000000000000..b9e98f3bf67235 --- /dev/null +++ b/docs/source/ko/tasks/document_question_answering.md @@ -0,0 +1,482 @@ + + +# 문서 질의 응답(Document Question Answering) [[document_question_answering]] + +[[open-in-colab]] + +문서 시각적 질의 응답(Document Visual Question Answering)이라고도 하는 +문서 질의 응답(Document Question Answering)은 문서 이미지에 대한 질문에 답변을 주는 태스크입니다. +이 태스크를 지원하는 모델의 입력은 일반적으로 이미지와 질문의 조합이고, 출력은 자연어로 된 답변입니다. 이러한 모델은 텍스트, 단어의 위치(바운딩 박스), 이미지 등 다양한 모달리티를 활용합니다. + +이 가이드는 다음 내용을 설명합니다: + +- [DocVQA dataset](https://huggingface.co/datasets/nielsr/docvqa_1200_examples_donut)을 사용해 [LayoutLMv2](../model_doc/layoutlmv2) 미세 조정하기 +- 추론을 위해 미세 조정된 모델을 사용하기 + + + +이 튜토리얼에서 설명하는 태스크는 다음과 같은 모델 아키텍처에서 지원됩니다: + + + +[LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3) + + + + + +LayoutLMv2는 토큰의 마지막 은닉층 위에 질의 응답 헤드를 추가해 답변의 시작 토큰과 끝 토큰의 위치를 예측함으로써 문서 질의 응답 태스크를 해결합니다. 즉, 문맥이 주어졌을 때 질문에 답하는 정보를 추출하는 추출형 질의 응답(Extractive question answering)으로 문제를 처리합니다. +문맥은 OCR 엔진의 출력에서 가져오며, 여기서는 Google의 Tesseract를 사용합니다. + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요. LayoutLMv2는 detectron2, torchvision 및 테서랙트를 필요로 합니다. + +```bash +pip install -q transformers datasets +``` + +```bash +pip install 'git+https://github.com/facebookresearch/detectron2.git' +pip install torchvision +``` + +```bash +sudo apt install tesseract-ocr +pip install -q pytesseract +``` + +필요한 라이브러리들을 모두 설치한 후 런타임을 다시 시작합니다. + +커뮤니티에 당신의 모델을 공유하는 것을 권장합니다. Hugging Face 계정에 로그인해서 모델을 🤗 Hub에 업로드하세요. +프롬프트가 실행되면, 로그인을 위해 토큰을 입력하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +몇 가지 전역 변수를 정의해 보겠습니다. + +```py +>>> model_checkpoint = "microsoft/layoutlmv2-base-uncased" +>>> batch_size = 4 +``` + +## 데이터 불러오기 [[load-the-data]] + +이 가이드에서는 🤗 Hub에서 찾을 수 있는 전처리된 DocVQA의 작은 샘플을 사용합니다. +DocVQA의 전체 데이터 세트를 사용하고 싶다면, [DocVQA homepage](https://rrc.cvc.uab.es/?ch=17)에 가입 후 다운로드 할 수 있습니다. 전체 데이터 세트를 다운로드 했다면, 이 가이드를 계속 진행하기 위해 [🤗 dataset에 파일을 가져오는 방법](https://huggingface.co/docs/datasets/loading#local-and-remote-files)을 확인하세요. + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("nielsr/docvqa_1200_examples") +>>> dataset +DatasetDict({ + train: Dataset({ + features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'], + num_rows: 1000 + }) + test: Dataset({ + features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'], + num_rows: 200 + }) +}) +``` + +보시다시피, 데이터 세트는 이미 훈련 세트와 테스트 세트로 나누어져 있습니다. 무작위로 예제를 살펴보면서 특성을 확인해보세요. + +```py +>>> dataset["train"].features +``` + +각 필드가 나타내는 내용은 다음과 같습니다: +* `id`: 예제의 id +* `image`: 문서 이미지를 포함하는 PIL.Image.Image 객체 +* `query`: 질문 문자열 - 여러 언어의 자연어로 된 질문 +* `answers`: 사람이 주석을 단 정답 리스트 +* `words` and `bounding_boxes`: OCR의 결과값들이며 이 가이드에서는 사용하지 않을 예정 +* `answer`: 다른 모델과 일치하는 답변이며 이 가이드에서는 사용하지 않을 예정 + +영어로 된 질문만 남기고 다른 모델에 대한 예측을 포함하는 `answer` 특성을 삭제하겠습니다. +그리고 주석 작성자가 제공한 데이터 세트에서 첫 번째 답변을 가져옵니다. 또는 무작위로 샘플을 추출할 수도 있습니다. + +```py +>>> updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"]) +>>> updated_dataset = updated_dataset.map( +... lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"] +... ) +``` + +이 가이드에서 사용하는 LayoutLMv2 체크포인트는 `max_position_embeddings = 512`로 훈련되었습니다(이 정보는 [체크포인트의 `config.json` 파일](https://huggingface.co/microsoft/layoutlmv2-base-uncased/blob/main/config.json#L18)에서 확인할 수 있습니다). +바로 예제를 잘라낼 수도 있지만, 긴 문서의 끝에 답변이 있어 잘리는 상황을 피하기 위해 여기서는 임베딩이 512보다 길어질 가능성이 있는 몇 가지 예제를 제거하겠습니다. +데이터 세트에 있는 대부분의 문서가 긴 경우 슬라이딩 윈도우 방법을 사용할 수 있습니다 - 자세한 내용을 확인하고 싶으면 이 [노트북](https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb)을 확인하세요. + +```py +>>> updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512) +``` + +이 시점에서 이 데이터 세트의 OCR 특성도 제거해 보겠습니다. OCR 특성은 다른 모델을 미세 조정하기 위한 것으로, 이 가이드에서 사용하는 모델의 입력 요구 사항과 일치하지 않기 때문에 이 특성을 사용하기 위해서는 일부 처리가 필요합니다. +대신, 원본 데이터에 [`LayoutLMv2Processor`]를 사용하여 OCR 및 토큰화를 모두 수행할 수 있습니다. +이렇게 하면 모델이 요구하는 입력을 얻을 수 있습니다. +이미지를 수동으로 처리하려면, [`LayoutLMv2` model documentation](../model_doc/layoutlmv2)에서 모델이 요구하는 입력 포맷을 확인해보세요. + +```py +>>> updated_dataset = updated_dataset.remove_columns("words") +>>> updated_dataset = updated_dataset.remove_columns("bounding_boxes") +``` + +마지막으로, 데이터 탐색을 완료하기 위해 이미지 예시를 살펴봅시다. + +```py +>>> updated_dataset["train"][11]["image"] +``` + +
+ DocVQA Image Example +
+ +## 데이터 전처리 [[preprocess-the-data]] + + +문서 질의 응답 태스크는 멀티모달 태스크이며, 각 모달리티의 입력이 모델의 요구에 맞게 전처리 되었는지 확인해야 합니다. +이미지 데이터를 처리할 수 있는 이미지 프로세서와 텍스트 데이터를 인코딩할 수 있는 토크나이저를 결합한 [`LayoutLMv2Processor`]를 가져오는 것부터 시작해 보겠습니다. + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained(model_checkpoint) +``` + +### 문서 이미지 전처리 [[preprocessing-document-images]] + +먼저, 프로세서의 `image_processor`를 사용해 모델에 대한 문서 이미지를 준비해 보겠습니다. +기본값으로, 이미지 프로세서는 이미지 크기를 224x224로 조정하고 색상 채널의 순서가 올바른지 확인한 후 단어와 정규화된 바운딩 박스를 얻기 위해 테서랙트를 사용해 OCR를 적용합니다. +이 튜토리얼에서 우리가 필요한 것과 기본값은 완전히 동일합니다. 이미지 배치에 기본 이미지 처리를 적용하고 OCR의 결과를 변환하는 함수를 작성합니다. + +```py +>>> image_processor = processor.image_processor + + +>>> def get_ocr_words_and_boxes(examples): +... images = [image.convert("RGB") for image in examples["image"]] +... encoded_inputs = image_processor(images) + +... examples["image"] = encoded_inputs.pixel_values +... examples["words"] = encoded_inputs.words +... examples["boxes"] = encoded_inputs.boxes + +... return examples +``` + +이 전처리를 데이터 세트 전체에 빠르게 적용하려면 [`~datasets.Dataset.map`]를 사용하세요. + +```py +>>> dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2) +``` + +### 텍스트 데이터 전처리 [[preprocessing-text-data]] + +이미지에 OCR을 적용했으면 데이터 세트의 텍스트 부분을 모델에 맞게 인코딩해야 합니다. +이 인코딩에는 이전 단계에서 가져온 단어와 박스를 토큰 수준의 `input_ids`, `attention_mask`, `token_type_ids` 및 `bbox`로 변환하는 작업이 포함됩니다. +텍스트를 전처리하려면 프로세서의 `tokenizer`가 필요합니다. + +```py +>>> tokenizer = processor.tokenizer +``` + +위에서 언급한 전처리 외에도 모델을 위해 레이블을 추가해야 합니다. 🤗 Transformers의 `xxxForQuestionAnswering` 모델의 경우, 레이블은 `start_positions`와 `end_positions`로 구성되며 어떤 토큰이 답변의 시작과 끝에 있는지를 나타냅니다. + +레이블 추가를 위해서, 먼저 더 큰 리스트(단어 리스트)에서 하위 리스트(단어로 분할된 답변)을 찾을 수 있는 헬퍼 함수를 정의합니다. + +이 함수는 `words_list`와 `answer_list`, 이렇게 두 리스트를 입력으로 받습니다. +그런 다음 `words_list`를 반복하여 `words_list`의 현재 단어(words_list[i])가 `answer_list`의 첫 번째 단어(answer_list[0])와 같은지, +현재 단어에서 시작해 `answer_list`와 같은 길이만큼의 `words_list`의 하위 리스트가 `answer_list`와 일치하는지 확인합니다. +이 조건이 참이라면 일치하는 항목을 발견했음을 의미하며, 함수는 일치 항목, 시작 인덱스(idx) 및 종료 인덱스(idx + len(answer_list) - 1)를 기록합니다. 일치하는 항목이 두 개 이상 발견되면 함수는 첫 번째 항목만 반환합니다. 일치하는 항목이 없다면 함수는 (`None`, 0, 0)을 반환합니다. + +```py +>>> def subfinder(words_list, answer_list): +... matches = [] +... start_indices = [] +... end_indices = [] +... for idx, i in enumerate(range(len(words_list))): +... if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list: +... matches.append(answer_list) +... start_indices.append(idx) +... end_indices.append(idx + len(answer_list) - 1) +... if matches: +... return matches[0], start_indices[0], end_indices[0] +... else: +... return None, 0, 0 +``` + +이 함수가 어떻게 정답의 위치를 찾는지 설명하기 위해 다음 예제에서 함수를 사용해 보겠습니다: + +```py +>>> example = dataset_with_ocr["train"][1] +>>> words = [word.lower() for word in example["words"]] +>>> match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split()) +>>> print("Question: ", example["question"]) +>>> print("Words:", words) +>>> print("Answer: ", example["answer"]) +>>> print("start_index", word_idx_start) +>>> print("end_index", word_idx_end) +Question: Who is in cc in this letter? +Words: ['wie', 'baw', 'brown', '&', 'williamson', 'tobacco', 'corporation', 'research', '&', 'development', 'internal', 'correspondence', 'to:', 'r.', 'h.', 'honeycutt', 'ce:', 't.f.', 'riehl', 'from:', '.', 'c.j.', 'cook', 'date:', 'may', '8,', '1995', 'subject:', 'review', 'of', 'existing', 'brainstorming', 'ideas/483', 'the', 'major', 'function', 'of', 'the', 'product', 'innovation', 'graup', 'is', 'to', 'develop', 'marketable', 'nove!', 'products', 'that', 'would', 'be', 'profitable', 'to', 'manufacture', 'and', 'sell.', 'novel', 'is', 'defined', 'as:', 'of', 'a', 'new', 'kind,', 'or', 'different', 'from', 'anything', 'seen', 'or', 'known', 'before.', 'innovation', 'is', 'defined', 'as:', 'something', 'new', 'or', 'different', 'introduced;', 'act', 'of', 'innovating;', 'introduction', 'of', 'new', 'things', 'or', 'methods.', 'the', 'products', 'may', 'incorporate', 'the', 'latest', 'technologies,', 'materials', 'and', 'know-how', 'available', 'to', 'give', 'then', 'a', 'unique', 'taste', 'or', 'look.', 'the', 'first', 'task', 'of', 'the', 'product', 'innovation', 'group', 'was', 'to', 'assemble,', 'review', 'and', 'categorize', 'a', 'list', 'of', 'existing', 'brainstorming', 'ideas.', 'ideas', 'were', 'grouped', 'into', 'two', 'major', 'categories', 'labeled', 'appearance', 'and', 'taste/aroma.', 'these', 'categories', 'are', 'used', 'for', 'novel', 'products', 'that', 'may', 'differ', 'from', 'a', 'visual', 'and/or', 'taste/aroma', 'point', 'of', 'view', 'compared', 'to', 'canventional', 'cigarettes.', 'other', 'categories', 'include', 'a', 'combination', 'of', 'the', 'above,', 'filters,', 'packaging', 'and', 'brand', 'extensions.', 'appearance', 'this', 'category', 'is', 'used', 'for', 'novel', 'cigarette', 'constructions', 'that', 'yield', 'visually', 'different', 'products', 'with', 'minimal', 'changes', 'in', 'smoke', 'chemistry', 'two', 'cigarettes', 'in', 'cne.', 'emulti-plug', 'te', 'build', 'yaur', 'awn', 'cigarette.', 'eswitchable', 'menthol', 'or', 'non', 'menthol', 'cigarette.', '*cigarettes', 'with', 'interspaced', 'perforations', 'to', 'enable', 'smoker', 'to', 'separate', 'unburned', 'section', 'for', 'future', 'smoking.', '«short', 'cigarette,', 'tobacco', 'section', '30', 'mm.', '«extremely', 'fast', 'buming', 'cigarette.', '«novel', 'cigarette', 'constructions', 'that', 'permit', 'a', 'significant', 'reduction', 'iretobacco', 'weight', 'while', 'maintaining', 'smoking', 'mechanics', 'and', 'visual', 'characteristics.', 'higher', 'basis', 'weight', 'paper:', 'potential', 'reduction', 'in', 'tobacco', 'weight.', '«more', 'rigid', 'tobacco', 'column;', 'stiffing', 'agent', 'for', 'tobacco;', 'e.g.', 'starch', '*colored', 'tow', 'and', 'cigarette', 'papers;', 'seasonal', 'promotions,', 'e.g.', 'pastel', 'colored', 'cigarettes', 'for', 'easter', 'or', 'in', 'an', 'ebony', 'and', 'ivory', 'brand', 'containing', 'a', 'mixture', 'of', 'all', 'black', '(black', 'paper', 'and', 'tow)', 'and', 'ail', 'white', 'cigarettes.', '499150498'] +Answer: T.F. Riehl +start_index 17 +end_index 18 +``` + +한편, 위 예제가 인코딩되면 다음과 같이 표시됩니다: + +```py +>>> encoding = tokenizer(example["question"], example["words"], example["boxes"]) +>>> tokenizer.decode(encoding["input_ids"]) +[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development ... +``` + +이제 인코딩된 입력에서 정답의 위치를 찾아야 합니다. +* `token_type_ids`는 어떤 토큰이 질문에 속하는지, 그리고 어떤 토큰이 문서의 단어에 포함되는지를 알려줍니다. +* `tokenizer.cls_token_id` 입력의 시작 부분에 있는 특수 토큰을 찾는 데 도움을 줍니다. +* `word_ids`는 원본 `words`에서 찾은 답변을 전체 인코딩된 입력의 동일한 답과 일치시키고 인코딩된 입력에서 답변의 시작/끝 위치를 결정합니다. + +위 내용들을 염두에 두고 데이터 세트 예제의 배치를 인코딩하는 함수를 만들어 보겠습니다: + +```py +>>> def encode_dataset(examples, max_length=512): +... questions = examples["question"] +... words = examples["words"] +... boxes = examples["boxes"] +... answers = examples["answer"] + +... # 예제 배치를 인코딩하고 start_positions와 end_positions를 초기화합니다 +... encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True) +... start_positions = [] +... end_positions = [] + +... # 배치의 예제를 반복합니다 +... for i in range(len(questions)): +... cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id) + +... # 예제의 words에서 답변의 위치를 찾습니다 +... words_example = [word.lower() for word in words[i]] +... answer = answers[i] +... match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split()) + +... if match: +... # 일치하는 항목을 발견하면, `token_type_ids`를 사용해 인코딩에서 단어가 시작하는 위치를 찾습니다 +... token_type_ids = encoding["token_type_ids"][i] +... token_start_index = 0 +... while token_type_ids[token_start_index] != 1: +... token_start_index += 1 + +... token_end_index = len(encoding["input_ids"][i]) - 1 +... while token_type_ids[token_end_index] != 1: +... token_end_index -= 1 + +... word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1] +... start_position = cls_index +... end_position = cls_index + +... # words의 답변 위치와 일치할 때까지 word_ids를 반복하고 `token_start_index`를 늘립니다 +... # 일치하면 `token_start_index`를 인코딩에서 답변의 `start_position`으로 저장합니다 +... for id in word_ids: +... if id == word_idx_start: +... start_position = token_start_index +... else: +... token_start_index += 1 + +... # 비슷하게, 끝에서 시작해 `word_ids`를 반복하며 답변의 `end_position`을 찾습니다 +... for id in word_ids[::-1]: +... if id == word_idx_end: +... end_position = token_end_index +... else: +... token_end_index -= 1 + +... start_positions.append(start_position) +... end_positions.append(end_position) + +... else: +... start_positions.append(cls_index) +... end_positions.append(cls_index) + +... encoding["image"] = examples["image"] +... encoding["start_positions"] = start_positions +... encoding["end_positions"] = end_positions + +... return encoding +``` + +이제 이 전처리 함수가 있으니 전체 데이터 세트를 인코딩할 수 있습니다: + +```py +>>> encoded_train_dataset = dataset_with_ocr["train"].map( +... encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["train"].column_names +... ) +>>> encoded_test_dataset = dataset_with_ocr["test"].map( +... encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["test"].column_names +... ) +``` + +인코딩된 데이터 세트의 특성이 어떻게 생겼는지 확인해 보겠습니다: + +```py +>>> encoded_train_dataset.features +{'image': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='uint8', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None), + 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), + 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), + 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), + 'bbox': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), + 'start_positions': Value(dtype='int64', id=None), + 'end_positions': Value(dtype='int64', id=None)} +``` + +## 평가 [[evaluation]] + +문서 질의 응답을 평가하려면 상당한 양의 후처리가 필요합니다. 시간이 너무 많이 걸리지 않도록 이 가이드에서는 평가 단계를 생략합니다. +[`Trainer`]가 훈련 과정에서 평가 손실(evaluation loss)을 계속 계산하기 때문에 모델의 성능을 대략적으로 알 수 있습니다. +추출적(Extractive) 질의 응답은 보통 F1/exact match 방법을 사용해 평가됩니다. +직접 구현해보고 싶으시다면, Hugging Face course의 [Question Answering chapter](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing)을 참고하세요. + +## 훈련 [[train]] + +축하합니다! 이 가이드의 가장 어려운 부분을 성공적으로 처리했으니 이제 나만의 모델을 훈련할 준비가 되었습니다. +훈련은 다음과 같은 단계로 이루어져 있습니다: +* 전처리에서의 동일한 체크포인트를 사용하기 위해 [`AutoModelForDocumentQuestionAnswering`]으로 모델을 가져옵니다. +* [`TrainingArguments`]로 훈련 하이퍼파라미터를 정합니다. +* 예제를 배치 처리하는 함수를 정의합니다. 여기서는 [`DefaultDataCollator`]가 적당합니다. +* 모델, 데이터 세트, 데이터 콜레이터(Data collator)와 함께 [`Trainer`]에 훈련 인수들을 전달합니다. +* [`~Trainer.train`]을 호출해서 모델을 미세 조정합니다. + +```py +>>> from transformers import AutoModelForDocumentQuestionAnswering + +>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint) +``` + +[`TrainingArguments`]에서 `output_dir`을 사용하여 모델을 저장할 위치를 지정하고, 적절한 하이퍼파라미터를 설정합니다. +모델을 커뮤니티와 공유하려면 `push_to_hub`를 `True`로 설정하세요 (모델을 업로드하려면 Hugging Face에 로그인해야 합니다). +이 경우 `output_dir`은 모델의 체크포인트를 푸시할 레포지토리의 이름이 됩니다. + +```py +>>> from transformers import TrainingArguments + +>>> # 본인의 레포지토리 ID로 바꾸세요 +>>> repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa" + +>>> training_args = TrainingArguments( +... output_dir=repo_id, +... per_device_train_batch_size=4, +... num_train_epochs=20, +... save_steps=200, +... logging_steps=50, +... evaluation_strategy="steps", +... learning_rate=5e-5, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +간단한 데이터 콜레이터를 정의하여 예제를 함께 배치합니다. + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + +마지막으로, 모든 것을 한 곳에 모아 [`~Trainer.train`]을 호출합니다: + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=encoded_train_dataset, +... eval_dataset=encoded_test_dataset, +... tokenizer=processor, +... ) + +>>> trainer.train() +``` + +최종 모델을 🤗 Hub에 추가하려면, 모델 카드를 생성하고 `push_to_hub`를 호출합니다: + +```py +>>> trainer.create_model_card() +>>> trainer.push_to_hub() +``` + +## 추론 [[inference]] + +이제 LayoutLMv2 모델을 미세 조정하고 🤗 Hub에 업로드했으니 추론에도 사용할 수 있습니다. +추론을 위해 미세 조정된 모델을 사용해 보는 가장 간단한 방법은 [`Pipeline`]을 사용하는 것 입니다. + +예를 들어 보겠습니다: +```py +>>> example = dataset["test"][2] +>>> question = example["query"]["en"] +>>> image = example["image"] +>>> print(question) +>>> print(example["answers"]) +'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?' +['TRRF Vice President', 'lee a. waller'] +``` + +그 다음, 모델로 문서 질의 응답을 하기 위해 파이프라인을 인스턴스화하고 이미지 + 질문 조합을 전달합니다. + +```py +>>> from transformers import pipeline + +>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa") +>>> qa_pipeline(image, question) +[{'score': 0.9949808120727539, + 'answer': 'Lee A. Waller', + 'start': 55, + 'end': 57}] +``` + +원한다면 파이프라인의 결과를 수동으로 복제할 수도 있습니다: +1. 이미지와 질문을 가져와 모델의 프로세서를 사용해 모델에 맞게 준비합니다. +2. 모델을 통해 결과 또는 전처리를 전달합니다. +3. 모델은 어떤 토큰이 답변의 시작에 있는지, 어떤 토큰이 답변이 끝에 있는지를 나타내는 `start_logits`와 `end_logits`를 반환합니다. 둘 다 (batch_size, sequence_length) 형태를 갖습니다. +4. `start_logits`와 `end_logits`의 마지막 차원을 최대로 만드는 값을 찾아 예상 `start_idx`와 `end_idx`를 얻습니다. +5. 토크나이저로 답변을 디코딩합니다. + +```py +>>> import torch +>>> from transformers import AutoProcessor +>>> from transformers import AutoModelForDocumentQuestionAnswering + +>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa") +>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa") + +>>> with torch.no_grad(): +... encoding = processor(image.convert("RGB"), question, return_tensors="pt") +... outputs = model(**encoding) +... start_logits = outputs.start_logits +... end_logits = outputs.end_logits +... predicted_start_idx = start_logits.argmax(-1).item() +... predicted_end_idx = end_logits.argmax(-1).item() + +>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1]) +'lee a. waller' +``` \ No newline at end of file diff --git a/docs/source/ko/tasks/image_captioning.md b/docs/source/ko/tasks/image_captioning.md new file mode 100644 index 00000000000000..c5139649a9185b --- /dev/null +++ b/docs/source/ko/tasks/image_captioning.md @@ -0,0 +1,281 @@ + + + +# 이미지 캡셔닝[[image-captioning]] + +[[open-in-colab]] + +이미지 캡셔닝(Image captioning)은 주어진 이미지에 대한 캡션을 예측하는 작업입니다. +이미지 캡셔닝은 시각 장애인이 다양한 상황을 탐색하는 데 도움을 줄 수 있도록 시각 장애인을 보조하는 등 실생활에서 흔히 활용됩니다. +따라서 이미지 캡셔닝은 이미지를 설명함으로써 사람들의 콘텐츠 접근성을 개선하는 데 도움이 됩니다. + +이 가이드에서는 소개할 내용은 아래와 같습니다: + +* 이미지 캡셔닝 모델을 파인튜닝합니다. +* 파인튜닝된 모델을 추론에 사용합니다. + +시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate -q +pip install jiwer -q +``` + +Hugging Face 계정에 로그인하면 모델을 업로드하고 커뮤니티에 공유할 수 있습니다. +토큰을 입력하여 로그인하세요. + + +```python +from huggingface_hub import notebook_login + +notebook_login() +``` + +## 포켓몬 BLIP 캡션 데이터세트 가져오기[[load-the-pokmon-blip-captions-dataset]] + +{이미지-캡션} 쌍으로 구성된 데이터세트를 가져오려면 🤗 Dataset 라이브러리를 사용합니다. +PyTorch에서 자신만의 이미지 캡션 데이터세트를 만들려면 [이 노트북](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GIT/Fine_tune_GIT_on_an_image_captioning_dataset.ipynb)을 참조하세요. + + +```python +from datasets import load_dataset + +ds = load_dataset("lambdalabs/pokemon-blip-captions") +ds +``` +```bash +DatasetDict({ + train: Dataset({ + features: ['image', 'text'], + num_rows: 833 + }) +}) +``` + +이 데이터세트는 `image`와 `text`라는 두 특성을 가지고 있습니다. + + + +많은 이미지 캡션 데이터세트에는 이미지당 여러 개의 캡션이 포함되어 있습니다. +이러한 경우, 일반적으로 학습 중에 사용 가능한 캡션 중에서 무작위로 샘플을 추출합니다. + + + +[`~datasets.Dataset.train_test_split`] 메소드를 사용하여 데이터세트의 학습 분할을 학습 및 테스트 세트로 나눕니다: + + +```python +ds = ds["train"].train_test_split(test_size=0.1) +train_ds = ds["train"] +test_ds = ds["test"] +``` + +학습 세트의 샘플 몇 개를 시각화해 봅시다. +Let's visualize a couple of samples from the training set. + + +```python +from textwrap import wrap +import matplotlib.pyplot as plt +import numpy as np + + +def plot_images(images, captions): + plt.figure(figsize=(20, 20)) + for i in range(len(images)): + ax = plt.subplot(1, len(images), i + 1) + caption = captions[i] + caption = "\n".join(wrap(caption, 12)) + plt.title(caption) + plt.imshow(images[i]) + plt.axis("off") + + +sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)] +sample_captions = [train_ds[i]["text"] for i in range(5)] +plot_images(sample_images_to_visualize, sample_captions) +``` + +
+ Sample training images +
+ +## 데이터세트 전처리[[preprocess-the-dataset]] + +데이터세트에는 이미지와 텍스트라는 두 가지 양식이 있기 때문에, 전처리 파이프라인에서 이미지와 캡션을 모두 전처리합니다. + +전처리 작업을 위해, 파인튜닝하려는 모델에 연결된 프로세서 클래스를 가져옵니다. + +```python +from transformers import AutoProcessor + +checkpoint = "microsoft/git-base" +processor = AutoProcessor.from_pretrained(checkpoint) +``` + +프로세서는 내부적으로 크기 조정 및 픽셀 크기 조정을 포함한 이미지 전처리를 수행하고 캡션을 토큰화합니다. + +```python +def transforms(example_batch): + images = [x for x in example_batch["image"]] + captions = [x for x in example_batch["text"]] + inputs = processor(images=images, text=captions, padding="max_length") + inputs.update({"labels": inputs["input_ids"]}) + return inputs + + +train_ds.set_transform(transforms) +test_ds.set_transform(transforms) +``` + +데이터세트가 준비되었으니 이제 파인튜닝을 위해 모델을 설정할 수 있습니다. + +## 기본 모델 가져오기[[load-a-base-model]] + +["microsoft/git-base"](https://huggingface.co/microsoft/git-base)를 [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) 객체로 가져옵니다. + + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained(checkpoint) +``` + +## 평가[[evaluate]] + +이미지 캡션 모델은 일반적으로 [Rouge 점수](https://huggingface.co/spaces/evaluate-metric/rouge) 또는 [단어 오류율(Word Error Rate)](https://huggingface.co/spaces/evaluate-metric/wer)로 평가합니다. +이 가이드에서는 단어 오류율(WER)을 사용합니다. + +이를 위해 🤗 Evaluate 라이브러리를 사용합니다. +WER의 잠재적 제한 사항 및 기타 문제점은 [이 가이드](https://huggingface.co/spaces/evaluate-metric/wer)를 참조하세요. + + +```python +from evaluate import load +import torch + +wer = load("wer") + + +def compute_metrics(eval_pred): + logits, labels = eval_pred + predicted = logits.argmax(-1) + decoded_labels = processor.batch_decode(labels, skip_special_tokens=True) + decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True) + wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels) + return {"wer_score": wer_score} +``` + +## 학습![[train!]] + +이제 모델 파인튜닝을 시작할 준비가 되었습니다. 이를 위해 🤗 [`Trainer`]를 사용합니다. + +먼저, [`TrainingArguments`]를 사용하여 학습 인수를 정의합니다. + + +```python +from transformers import TrainingArguments, Trainer + +model_name = checkpoint.split("/")[1] + +training_args = TrainingArguments( + output_dir=f"{model_name}-pokemon", + learning_rate=5e-5, + num_train_epochs=50, + fp16=True, + per_device_train_batch_size=32, + per_device_eval_batch_size=32, + gradient_accumulation_steps=2, + save_total_limit=3, + evaluation_strategy="steps", + eval_steps=50, + save_strategy="steps", + save_steps=50, + logging_steps=50, + remove_unused_columns=False, + push_to_hub=True, + label_names=["labels"], + load_best_model_at_end=True, +) +``` + +학습 인수를 데이터세트, 모델과 함께 🤗 Trainer에 전달합니다. + +```python +trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_ds, + eval_dataset=test_ds, + compute_metrics=compute_metrics, +) +``` + +학습을 시작하려면 [`Trainer`] 객체에서 [`~Trainer.train`]을 호출하기만 하면 됩니다. + +```python +trainer.train() +``` + +학습이 진행되면서 학습 손실이 원활하게 감소하는 것을 볼 수 있습니다. + +학습이 완료되면 모든 사람이 모델을 사용할 수 있도록 [`~Trainer.push_to_hub`] 메소드를 사용하여 모델을 허브에 공유하세요: + + +```python +trainer.push_to_hub() +``` + +## 추론[[inference]] + +`test_ds`에서 샘플 이미지를 가져와 모델을 테스트합니다. + + +```python +from PIL import Image +import requests + +url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png" +image = Image.open(requests.get(url, stream=True).raw) +image +``` + +
+ Test image +
+ +모델에 사용할 이미지를 준비합니다. + +```python +device = "cuda" if torch.cuda.is_available() else "cpu" + +inputs = processor(images=image, return_tensors="pt").to(device) +pixel_values = inputs.pixel_values +``` + +[`generate`]를 호출하고 예측을 디코딩합니다. + +```python +generated_ids = model.generate(pixel_values=pixel_values, max_length=50) +generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] +print(generated_caption) +``` +```bash +a drawing of a pink and blue pokemon +``` + +파인튜닝된 모델이 꽤 괜찮은 캡션을 생성한 것 같습니다! diff --git a/docs/source/ko/tasks/image_classification.md b/docs/source/ko/tasks/image_classification.md new file mode 100644 index 00000000000000..031e01ea5c5a83 --- /dev/null +++ b/docs/source/ko/tasks/image_classification.md @@ -0,0 +1,546 @@ + + +# 이미지 분류[[image-classification]] + +[[open-in-colab]] + + + +이미지 분류는 이미지에 레이블 또는 클래스를 할당합니다. 텍스트 또는 오디오 분류와 달리 입력은 +이미지를 구성하는 픽셀 값입니다. 이미지 분류에는 자연재해 후 피해 감지, 농작물 건강 모니터링, 의료 이미지에서 질병의 징후 검사 지원 등 +다양한 응용 사례가 있습니다. + +이 가이드에서는 다음을 설명합니다: + +1. [Food-101](https://huggingface.co/datasets/food101) 데이터 세트에서 [ViT](model_doc/vit)를 미세 조정하여 이미지에서 식품 항목을 분류합니다. +2. 추론을 위해 미세 조정 모델을 사용합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에 의해 지원됩니다: + + + +[BEiT](../model_doc/beit), [BiT](../model_doc/bit), [ConvNeXT](../model_doc/convnext), [ConvNeXTV2](../model_doc/convnextv2), [CvT](../model_doc/cvt), [Data2VecVision](../model_doc/data2vec-vision), [DeiT](../model_doc/deit), [DiNAT](../model_doc/dinat), [EfficientFormer](../model_doc/efficientformer), [EfficientNet](../model_doc/efficientnet), [FocalNet](../model_doc/focalnet), [ImageGPT](../model_doc/imagegpt), [LeViT](../model_doc/levit), [MobileNetV1](../model_doc/mobilenet_v1), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [NAT](../model_doc/nat), [Perceiver](../model_doc/perceiver), [PoolFormer](../model_doc/poolformer), [RegNet](../model_doc/regnet), [ResNet](../model_doc/resnet), [SegFormer](../model_doc/segformer), [Swin Transformer](../model_doc/swin), [Swin Transformer V2](../model_doc/swinv2), [VAN](../model_doc/van), [ViT](../model_doc/vit), [ViT Hybrid](../model_doc/vit_hybrid), [ViTMSN](../model_doc/vit_msn) + + + + +시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에 공유하는 것을 권장합니다. 메시지가 표시되면, 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## Food-101 데이터 세트 가져오기[[load-food101-dataset]] + +🤗 Datasets 라이브러리에서 Food-101 데이터 세트의 더 작은 부분 집합을 가져오는 것으로 시작합니다. 이렇게 하면 전체 데이터 세트에 대한 +훈련에 많은 시간을 할애하기 전에 실험을 통해 모든 것이 제대로 작동하는지 확인할 수 있습니다. + +```py +>>> from datasets import load_dataset + +>>> food = load_dataset("food101", split="train[:5000]") +``` + +데이터 세트의 `train`을 [`~datasets.Dataset.train_test_split`] 메소드를 사용하여 훈련 및 테스트 세트로 분할하세요: + +```py +>>> food = food.train_test_split(test_size=0.2) +``` + +그리고 예시를 살펴보세요: + +```py +>>> food["train"][0] +{'image': , + 'label': 79} +``` + +데이터 세트의 각 예제에는 두 개의 필드가 있습니다: + +- `image`: 식품 항목의 PIL 이미지 +- `label`: 식품 항목의 레이블 클래스 + +모델이 레이블 ID에서 레이블 이름을 쉽게 가져올 수 있도록 +레이블 이름을 정수로 매핑하고, 정수를 레이블 이름으로 매핑하는 사전을 만드세요: + +```py +>>> labels = food["train"].features["label"].names +>>> label2id, id2label = dict(), dict() +>>> for i, label in enumerate(labels): +... label2id[label] = str(i) +... id2label[str(i)] = label +``` + +이제 레이블 ID를 레이블 이름으로 변환할 수 있습니다: + +```py +>>> id2label[str(79)] +'prime_rib' +``` + +## 전처리[[preprocess]] + +다음 단계는 이미지를 텐서로 처리하기 위해 ViT 이미지 프로세서를 가져오는 것입니다: + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "google/vit-base-patch16-224-in21k" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +``` + + + +이미지에 몇 가지 이미지 변환을 적용하여 과적합에 대해 모델을 더 견고하게 만듭니다. 여기서 Torchvision의 [`transforms`](https://pytorch.org/vision/stable/transforms.html) 모듈을 사용하지만, 원하는 이미지 라이브러리를 사용할 수도 있습니다. + +이미지의 임의 부분을 크롭하고 크기를 조정한 다음, 이미지 평균과 표준 편차로 정규화하세요: + +```py +>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor + +>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) +>>> size = ( +... image_processor.size["shortest_edge"] +... if "shortest_edge" in image_processor.size +... else (image_processor.size["height"], image_processor.size["width"]) +... ) +>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize]) +``` + +그런 다음 전처리 함수를 만들어 변환을 적용하고 이미지의 `pixel_values`(모델에 대한 입력)를 반환하세요: + +```py +>>> def transforms(examples): +... examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]] +... del examples["image"] +... return examples +``` + +전체 데이터 세트에 전처리 기능을 적용하려면 🤗 Datasets [`~datasets.Dataset.with_transform`]을 사용합니다. 데이터 세트의 요소를 가져올 때 변환이 즉시 적용됩니다: + +```py +>>> food = food.with_transform(transforms) +``` + +이제 [`DefaultDataCollator`]를 사용하여 예제 배치를 만듭니다. 🤗 Transformers의 다른 데이터 콜레이터와 달리, `DefaultDataCollator`는 패딩과 같은 추가적인 전처리를 적용하지 않습니다. + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + + + + + + + +과적합을 방지하고 모델을 보다 견고하게 만들기 위해 데이터 세트의 훈련 부분에 데이터 증강을 추가합니다. +여기서 Keras 전처리 레이어로 훈련 데이터에 대한 변환(데이터 증강 포함)과 +검증 데이터에 대한 변환(중앙 크로핑, 크기 조정, 정규화만)을 정의합니다. +`tf.image` 또는 다른 원하는 라이브러리를 사용할 수 있습니다. + +```py +>>> from tensorflow import keras +>>> from tensorflow.keras import layers + +>>> size = (image_processor.size["height"], image_processor.size["width"]) + +>>> train_data_augmentation = keras.Sequential( +... [ +... layers.RandomCrop(size[0], size[1]), +... layers.Rescaling(scale=1.0 / 127.5, offset=-1), +... layers.RandomFlip("horizontal"), +... layers.RandomRotation(factor=0.02), +... layers.RandomZoom(height_factor=0.2, width_factor=0.2), +... ], +... name="train_data_augmentation", +... ) + +>>> val_data_augmentation = keras.Sequential( +... [ +... layers.CenterCrop(size[0], size[1]), +... layers.Rescaling(scale=1.0 / 127.5, offset=-1), +... ], +... name="val_data_augmentation", +... ) +``` + +다음으로 한 번에 하나의 이미지가 아니라 이미지 배치에 적절한 변환을 적용하는 함수를 만듭니다. + +```py +>>> import numpy as np +>>> import tensorflow as tf +>>> from PIL import Image + + +>>> def convert_to_tf_tensor(image: Image): +... np_image = np.array(image) +... tf_image = tf.convert_to_tensor(np_image) +... # `expand_dims()` is used to add a batch dimension since +... # the TF augmentation layers operates on batched inputs. +... return tf.expand_dims(tf_image, 0) + + +>>> def preprocess_train(example_batch): +... """Apply train_transforms across a batch.""" +... images = [ +... train_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"] +... ] +... example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images] +... return example_batch + + +... def preprocess_val(example_batch): +... """Apply val_transforms across a batch.""" +... images = [ +... val_data_augmentation(convert_to_tf_tensor(image.convert("RGB"))) for image in example_batch["image"] +... ] +... example_batch["pixel_values"] = [tf.transpose(tf.squeeze(image)) for image in images] +... return example_batch +``` + +🤗 Datasets [`~datasets.Dataset.set_transform`]를 사용하여 즉시 변환을 적용하세요: + +```py +food["train"].set_transform(preprocess_train) +food["test"].set_transform(preprocess_val) +``` + +최종 전처리 단계로 `DefaultDataCollator`를 사용하여 예제 배치를 만듭니다. 🤗 Transformers의 다른 데이터 콜레이터와 달리 +`DefaultDataCollator`는 패딩과 같은 추가 전처리를 적용하지 않습니다. + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") +``` + + + +## 평가[[evaluate]] + +훈련 중에 평가 지표를 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. +🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리로 평가 방법을 빠르게 가져올 수 있습니다. 이 작업에서는 +[accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) 평가 지표를 가져옵니다. (🤗 Evaluate [빠른 둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하여 평가 지표를 가져오고 계산하는 방법에 대해 자세히 알아보세요): + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +그런 다음 예측과 레이블을 [`~evaluate.EvaluationModule.compute`]에 전달하여 정확도를 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... predictions = np.argmax(predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=labels) +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 훈련을 설정하면 이 함수로 되돌아올 것입니다. + +## 훈련[[train]] + + + + + +[`Trainer`]를 사용하여 모델을 미세 조정하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요! + + + +이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForImageClassification`]로 ViT를 가져옵니다. 예상되는 레이블 수, 레이블 매핑 및 레이블 수를 지정하세요: + +```py +>>> from transformers import AutoModelForImageClassification, TrainingArguments, Trainer + +>>> model = AutoModelForImageClassification.from_pretrained( +... checkpoint, +... num_labels=len(labels), +... id2label=id2label, +... label2id=label2id, +... ) +``` + +이제 세 단계만 거치면 끝입니다: + +1. [`TrainingArguments`]에서 훈련 하이퍼파라미터를 정의하세요. `image` 열이 삭제되기 때문에 미사용 열을 제거하지 않는 것이 중요합니다. `image` 열이 없으면 `pixel_values`을 생성할 수 없습니다. 이 동작을 방지하려면 `remove_unused_columns=False`로 설정하세요! 다른 유일한 필수 매개변수는 모델 저장 위치를 지정하는 `output_dir`입니다. `push_to_hub=True`로 설정하면 이 모델을 허브에 푸시합니다(모델을 업로드하려면 Hugging Face에 로그인해야 합니다). 각 에폭이 끝날 때마다, [`Trainer`]가 정확도를 평가하고 훈련 체크포인트를 저장합니다. +2. [`Trainer`]에 모델, 데이터 세트, 토크나이저, 데이터 콜레이터 및 `compute_metrics` 함수와 함께 훈련 인수를 전달하세요. +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_food_model", +... remove_unused_columns=False, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=5e-5, +... per_device_train_batch_size=16, +... gradient_accumulation_steps=4, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=food["train"], +... eval_dataset=food["test"], +... tokenizer=image_processor, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면, 모든 사람이 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메소드로 모델을 허브에 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + + + + + + + + +Keras를 사용하여 모델을 미세 조정하는 방법에 익숙하지 않은 경우, 먼저 [기본 튜토리얼](./training#train-a-tensorflow-model-with-keras)을 확인하세요! + + + +TensorFlow에서 모델을 미세 조정하려면 다음 단계를 따르세요: +1. 훈련 하이퍼파라미터를 정의하고 옵티마이저와 학습률 스케쥴을 설정합니다. +2. 사전 훈련된 모델을 인스턴스화합니다. +3. 🤗 Dataset을 `tf.data.Dataset`으로 변환합니다. +4. 모델을 컴파일합니다. +5. 콜백을 추가하고 훈련을 수행하기 위해 `fit()` 메소드를 사용합니다. +6. 커뮤니티와 공유하기 위해 모델을 🤗 Hub에 업로드합니다. + +하이퍼파라미터, 옵티마이저 및 학습률 스케쥴을 정의하는 것으로 시작합니다: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_epochs = 5 +>>> num_train_steps = len(food["train"]) * num_epochs +>>> learning_rate = 3e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +그런 다음 레이블 매핑과 함께 [`TFAuto ModelForImageClassification`]으로 ViT를 가져옵니다: + +```py +>>> from transformers import TFAutoModelForImageClassification + +>>> model = TFAutoModelForImageClassification.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +``` + +데이터 세트를 [`~datasets.Dataset.to_tf_dataset`]와 `data_collator`를 사용하여 `tf.data.Dataset` 형식으로 변환하세요: + +```py +>>> # converting our train dataset to tf.data.Dataset +>>> tf_train_dataset = food["train"].to_tf_dataset( +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator +... ) + +>>> # converting our test dataset to tf.data.Dataset +>>> tf_eval_dataset = food["test"].to_tf_dataset( +... columns="pixel_values", label_cols="label", shuffle=True, batch_size=batch_size, collate_fn=data_collator +... ) +``` + +`compile()`를 사용하여 훈련 모델을 구성하세요: + +```py +>>> from tensorflow.keras.losses import SparseCategoricalCrossentropy + +>>> loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) +>>> model.compile(optimizer=optimizer, loss=loss) +``` + +예측에서 정확도를 계산하고 모델을 🤗 Hub로 푸시하려면 [Keras callbacks](../main_classes/keras_callbacks)를 사용하세요. +`compute_metrics` 함수를 [KerasMetricCallback](../main_classes/keras_callbacks#transformers.KerasMetricCallback)에 전달하고, +[PushToHubCallback](../main_classes/keras_callbacks#transformers.PushToHubCallback)을 사용하여 모델을 업로드합니다: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_eval_dataset) +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="food_classifier", +... tokenizer=image_processor, +... save_strategy="no", +... ) +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +이제 모델을 훈련할 준비가 되었습니다! 훈련 및 검증 데이터 세트, 에폭 수와 함께 `fit()`을 호출하고, +콜백을 사용하여 모델을 미세 조정합니다: + +```py +>>> model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=num_epochs, callbacks=callbacks) +Epoch 1/5 +250/250 [==============================] - 313s 1s/step - loss: 2.5623 - val_loss: 1.4161 - accuracy: 0.9290 +Epoch 2/5 +250/250 [==============================] - 265s 1s/step - loss: 0.9181 - val_loss: 0.6808 - accuracy: 0.9690 +Epoch 3/5 +250/250 [==============================] - 252s 1s/step - loss: 0.3910 - val_loss: 0.4303 - accuracy: 0.9820 +Epoch 4/5 +250/250 [==============================] - 251s 1s/step - loss: 0.2028 - val_loss: 0.3191 - accuracy: 0.9900 +Epoch 5/5 +250/250 [==============================] - 238s 949ms/step - loss: 0.1232 - val_loss: 0.3259 - accuracy: 0.9890 +``` + +축하합니다! 모델을 미세 조정하고 🤗 Hub에 공유했습니다. 이제 추론에 사용할 수 있습니다! + + + + + + +이미지 분류를 위한 모델을 미세 조정하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb)을 참조하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +추론을 수행하고자 하는 이미지를 가져와봅시다: + +```py +>>> ds = load_dataset("food101", split="validation[:10]") +>>> image = ds["image"][0] +``` + +
+ image of beignets +
+ +미세 조정 모델로 추론을 시도하는 가장 간단한 방법은 [`pipeline`]을 사용하는 것입니다. 모델로 이미지 분류를 위한 `pipeline`을 인스턴스화하고 이미지를 전달합니다: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("image-classification", model="my_awesome_food_model") +>>> classifier(image) +[{'score': 0.31856709718704224, 'label': 'beignets'}, + {'score': 0.015232225880026817, 'label': 'bruschetta'}, + {'score': 0.01519392803311348, 'label': 'chicken_wings'}, + {'score': 0.013022331520915031, 'label': 'pork_chop'}, + {'score': 0.012728818692266941, 'label': 'prime_rib'}] +``` + +원한다면, `pipeline`의 결과를 수동으로 복제할 수도 있습니다: + + + +이미지를 전처리하기 위해 이미지 프로세서를 가져오고 `input`을 PyTorch 텐서로 반환합니다: + +```py +>>> from transformers import AutoImageProcessor +>>> import torch + +>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model") +>>> inputs = image_processor(image, return_tensors="pt") +``` + +입력을 모델에 전달하고 logits을 반환합니다: + +```py +>>> from transformers import AutoModelForImageClassification + +>>> model = AutoModelForImageClassification.from_pretrained("my_awesome_food_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +확률이 가장 높은 예측 레이블을 가져오고, 모델의 `id2label` 매핑을 사용하여 레이블로 변환합니다: + +```py +>>> predicted_label = logits.argmax(-1).item() +>>> model.config.id2label[predicted_label] +'beignets' +``` + + + + + +이미지를 전처리하기 위해 이미지 프로세서를 가져오고 `input`을 TensorFlow 텐서로 반환합니다: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/food_classifier") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +입력을 모델에 전달하고 logits을 반환합니다: + +```py +>>> from transformers import TFAutoModelForImageClassification + +>>> model = TFAutoModelForImageClassification.from_pretrained("MariaK/food_classifier") +>>> logits = model(**inputs).logits +``` + +확률이 가장 높은 예측 레이블을 가져오고, 모델의 `id2label` 매핑을 사용하여 레이블로 변환합니다: + +```py +>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0]) +>>> model.config.id2label[predicted_class_id] +'beignets' +``` + + + diff --git a/docs/source/ko/tasks/language_modeling.md b/docs/source/ko/tasks/language_modeling.md new file mode 100644 index 00000000000000..ee1d11c1d09daf --- /dev/null +++ b/docs/source/ko/tasks/language_modeling.md @@ -0,0 +1,417 @@ + + +# 인과 언어 모델링[[causal-language-modeling]] + +[[open-in-colab]] + +언어 모델링은 인과적 언어 모델링과 마스크드 언어 모델링, 두 가지 유형으로 나뉩니다. 이 가이드에서는 인과적 언어 모델링을 설명합니다. +인과 언어 모델은 텍스트 생성에 자주 사용됩니다. 또 창의적인 방향으로 응용할 수 있습니다. +직접 사용하며 재미있는 탐구를 해보거나, Copilot 또는 CodeParrot와 같은 지능형 코딩 어시스턴트의 기반이 되기도 합니다. + + + +인과 언어 모델링은 토큰 시퀀스에서 다음 토큰을 예측하며, 모델은 왼쪽의 토큰에만 접근할 수 있습니다. +이는 모델이 미래의 토큰을 볼 수 없다는 것을 의미합니다. 인과 언어 모델의 예로 GPT-2가 있죠. + +이 가이드에서는 다음 작업을 수행하는 방법을 안내합니다: + +1. [DistilGPT2](https://huggingface.co/distilbert/distilgpt2) 모델을 [ELI5](https://huggingface.co/datasets/eli5) 데이터 세트의 [r/askscience](https://www.reddit.com/r/askscience/) 하위 집합으로 미세 조정 +2. 미세 조정된 모델을 추론에 사용 + + +이 안내서의 단계와 동일한 방법으로 인과 언어 모델링을 위해 다른 아키텍처를 미세 조정할 수 있습니다. +다음 아키텍처 중 하나를 선택하세요: + + +[BART](../model_doc/bart), [BERT](../model_doc/bert), [Bert Generation](../model_doc/bert-generation), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BioGpt](../model_doc/biogpt), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CodeGen](../model_doc/codegen), [CPM-Ant](../model_doc/cpmant), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [GIT](../model_doc/git), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [GPT Neo](../model_doc/gpt_neo), [GPT NeoX](../model_doc/gpt_neox), [GPT NeoX Japanese](../model_doc/gpt_neox_japanese), [GPT-J](../model_doc/gptj), [LLaMA](../model_doc/llama), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MVP](../model_doc/mvp), [OpenLlama](../model_doc/open-llama), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Pegasus](../model_doc/pegasus), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [RWKV](../model_doc/rwkv), [Speech2Text2](../model_doc/speech_to_text_2), [Transformer-XL](../model_doc/transfo-xl), [TrOCR](../model_doc/trocr), [XGLM](../model_doc/xglm), [XLM](../model_doc/xlm), [XLM-ProphetNet](../model_doc/xlm-prophetnet), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod) + + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +커뮤니티에 모델을 업로드하고 공유하기 위해 Hugging Face 계정에 로그인하는 것을 권장합니다. 알림이 표시되면 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## ELI5 데이터 세트 불러오기[[load-eli5-dataset]] + +먼저, 🤗 Datasets 라이브러리에서 r/askscience의 작은 하위 집합인 ELI5 데이터 세트를 불러옵니다. +이를 통해 전체 데이터 세트에서 학습하는 데 더 많은 시간을 투자하기 전에, 실험해봄으로써 모든 것이 작동하는지 확인할 수 있습니다. + +```py +>>> from datasets import load_dataset + +>>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +``` + +데이터 세트의 `train_asks` 분할을 [`~datasets.Dataset.train_test_split`] 메소드를 사용하여 학습 및 테스트 세트로 분할합니다: + +```py +>>> eli5 = eli5.train_test_split(test_size=0.2) +``` + +그런 다음 예제를 살펴보세요: + +```py +>>> eli5["train"][0] +{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], + 'score': [6, 3], + 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, + 'answers_urls': {'url': []}, + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls': {'url': []}} +``` + +많아 보일 수 있지만, 실제로는 `text` 필드만 중요합니다. 언어 모델링 작업의 장점은 레이블이 필요하지 않다는 것입니다. 다음 단어 *자체가* 레이블입니다. (이렇게 레이블을 제공하지 않아도 되는 학습을 비지도 학습이라고 일컫습니다) + +## 전처리[[preprocess]] + + + +다음 단계는 `text` 필드를 전처리하기 위해 DistilGPT2 토크나이저를 불러오는 것입니다. + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +``` + +위의 예제에서 알 수 있듯이, `text` 필드는 `answers` 아래에 중첩되어 있습니다. 따라서 [`flatten`](https://huggingface.co/docs/datasets/process#flatten) 메소드를 사용하여 중첩 구조에서 `text` 하위 필드를 추출해야 합니다. + +```py +>>> eli5 = eli5.flatten() +>>> eli5["train"][0] +{'answers.a_id': ['c3d1aib', 'c3d4lya'], + 'answers.score': [6, 3], + 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], + 'answers_urls.url': [], + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls.url': []} +``` + +각 하위 필드는 이제 `answers` 접두사를 가진 별도의 열로 나뉘었으며, `text` 필드는 이제 리스트입니다. 각 문장을 개별적으로 토큰화하는 대신, 먼저 리스트를 문자열로 변환하여 한꺼번에 토큰화할 수 있습니다. + +다음은 문자열 리스트를 결합하고 결과를 토큰화하는 첫 번째 전처리 함수입니다: + +```py +>>> def preprocess_function(examples): +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) +``` + +이 전처리 함수를 전체 데이터 세트에 적용하려면 🤗 Datasets [`~datasets.Dataset.map`] 메소드를 사용하세요. `batched=True`로 설정하여 데이터셋의 여러 요소를 한 번에 처리하고, `num_proc`를 증가시켜 프로세스 수를 늘릴 수 있습니다. 필요 없는 열은 제거하세요: + +```py +>>> tokenized_eli5 = eli5.map( +... preprocess_function, +... batched=True, +... num_proc=4, +... remove_columns=eli5["train"].column_names, +... ) +``` + +이제 데이터 세트는 시퀀스가 토큰화됐지만, 일부 시퀀스는 모델의 최대 입력 길이보다 길 수 있습니다. + +이제 두 번째 전처리 함수를 사용하여 +- 모든 시퀀스를 연결하고, +- `block_size`로 정의된 길이로 연결된 시퀀스를 여러 개의 짧은 묶음으로 나눕니다. 이 값은 최대 입력 길이와 GPU RAM을 고려해 충분히 짧아야 합니다. + +```py +>>> block_size = 128 + + +>>> def group_texts(examples): +... # Concatenate all texts. +... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} +... total_length = len(concatenated_examples[list(examples.keys())[0]]) +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. +... result = { +... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] +... for k, t in concatenated_examples.items() +... } +... result["labels"] = result["input_ids"].copy() +... return result +``` + +전체 데이터 세트에 `group_texts` 함수를 적용하세요: + +```py +>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) +``` + +그런 다음 [`DataCollatorForLanguageModeling`]을 사용하여 예제의 배치를 만듭니다. 데이터 세트 전체를 최대 길이로 패딩하는 것보다, 취합 단계에서 각 배치의 최대 길이로 문장을 *동적으로 패딩*하는 것이 더 효율적입니다. + + + +패딩 토큰으로 종결 토큰을 사용하고 `mlm=False`로 설정하세요. 이렇게 하면 입력을 오른쪽으로 한 칸씩 시프트한 값을 레이블로 사용합니다: + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> tokenizer.pad_token = tokenizer.eos_token +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) +``` + + + +패딩 토큰으로 종결 토큰을 사용하고 `mlm=False`로 설정하세요. 이렇게 하면 입력을 오른쪽으로 한 칸씩 시프트한 값을 레이블로 사용합니다: + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf") +``` + + + + + +## 훈련[[train]] + + + + + +[`Trainer`]를 사용하여 모델을 미세 조정하는 방법을 잘 모르신다면 [기본 튜토리얼](../training#train-with-pytorch-trainer)을 확인해보세요! + + + +이제 모델을 훈련하기 준비가 되었습니다! [`AutoModelForCausalLM`]를 사용하여 DistilGPT2를 불러옵니다: + +```py +>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer + +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") +``` + +여기까지 진행하면 세 단계만 남았습니다: + +1. [`TrainingArguments`]에서 훈련 하이퍼파라미터를 정의하세요. `output_dir`은 유일한 필수 매개변수로, 모델을 저장할 위치를 지정합니다. (먼저 Hugging Face에 로그인 필수) `push_to_hub=True`로 설정하여 이 모델을 허브에 업로드할 수 있습니다. +2. 훈련 인수를 [`Trainer`]에 모델, 데이터 세트 및 데이터 콜레이터와 함께 전달하세요. +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_eli5_clm-model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=lm_dataset["train"], +... eval_dataset=lm_dataset["test"], +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면 [`~transformers.Trainer.evaluate`] 메소드를 사용하여 모델을 평가하고 퍼플렉서티를 얻을 수 있습니다: + +```py +>>> import math + +>>> eval_results = trainer.evaluate() +>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") +Perplexity: 49.61 +``` + +그런 다음 [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 허브에 공유하세요. 이렇게 하면 누구나 모델을 사용할 수 있습니다: + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras를 사용하여 모델을 미세 조정하는 방법에 익숙하지 않다면 [기본 튜토리얼](../training#train-a-tensorflow-model-with-keras)을 확인해보세요! + + +TensorFlow에서 모델을 미세 조정하려면, 먼저 옵티마이저 함수, 학습률 스케줄 및 일부 훈련 하이퍼파라미터를 설정하세요: + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +그런 다음 [`TFAutoModelForCausalLM`]를 사용하여 DistilGPT2를 불러옵니다: + +```py +>>> from transformers import TFAutoModelForCausalLM + +>>> model = TFAutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용하여 데이터 세트를 `tf.data.Dataset` 형식으로 변환하세요: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)을 사용하여 모델을 훈련하기 위해 구성하세요. Transformers 모델은 모두 기본적인 작업 관련 손실 함수를 가지고 있으므로, 원한다면 별도로 지정하지 않아도 됩니다: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) # 별도로 loss 인자를 넣지 않았어요! +``` + +[`~transformers.PushToHubCallback`]에서 모델과 토크나이저를 업로드할 위치를 지정할 수 있습니다: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_eli5_clm-model", +... tokenizer=tokenizer, +... ) +``` + +마지막으로, 모델을 훈련하기 위해 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)을 호출하세요. 훈련 데이터 세트, 검증 데이터 세트, 에폭 수 및 콜백을 전달하세요: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback]) +``` + +훈련이 완료되면 모델이 자동으로 허브에 업로드되어 모두가 사용할 수 있습니다! + + + + + +인과 언어 모델링을 위해 모델을 미세 조정하는 더 자세한 예제는 해당하는 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 미세 조정했으므로 추론에 사용할 수 있습니다! + +생성할 텍스트를 위한 프롬프트를 만들어보세요: + +```py +>>> prompt = "Somatic hypermutation allows the immune system to" +``` + +추론을 위해 미세 조정된 모델을 간단히 사용하는 가장 간단한 방법은 [`pipeline`]에서 사용하는 것입니다. 모델과 함께 텍스트 생성을 위한 `pipeline`을 인스턴스화하고 텍스트를 전달하세요: + +```py +>>> from transformers import pipeline + +>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model") +>>> generator(prompt) +[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}] +``` + + + +텍스트를 토큰화하고 `input_ids`를 PyTorch 텐서로 반환하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids +``` + +[`~transformers.generation_utils.GenerationMixin.generate`] 메소드를 사용하여 텍스트를 생성하세요. 생성을 제어하는 다양한 텍스트 생성 전략과 매개변수에 대한 자세한 내용은 [텍스트 생성 전략](../generation_strategies) 페이지를 확인하세요. + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) +``` + +생성된 토큰 ID를 다시 텍스트로 디코딩하세요: + +```py +>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) +["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"] +``` + + +텍스트를 토큰화하고 `input_ids`를 TensorFlow 텐서로 반환하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model") +>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids +``` + +[`~transformers.generation_tf_utils.TFGenerationMixin.generate`] 메소드를 사용하여 요약을 생성하세요. 생성을 제어하는 다양한 텍스트 생성 전략과 매개변수에 대한 자세한 내용은 [텍스트 생성 전략](../generation_strategies) 페이지를 확인하세요. + +```py +>>> from transformers import TFAutoModelForCausalLM + +>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model") +>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95) +``` + +생성된 토큰 ID를 다시 텍스트로 디코딩하세요: + +```py +>>> tokenizer.batch_decode(outputs, skip_special_tokens=True) +['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for'] +``` + + diff --git a/docs/source/ko/tasks/masked_language_modeling.md b/docs/source/ko/tasks/masked_language_modeling.md new file mode 100644 index 00000000000000..3aafdf1cb9eebe --- /dev/null +++ b/docs/source/ko/tasks/masked_language_modeling.md @@ -0,0 +1,448 @@ + + +# 마스킹된 언어 모델링(Masked language modeling)[[masked-language-modeling]] + +[[open-in-colab]] + + + +마스킹된 언어 모델링은 시퀀스에서 마스킹된 토큰을 예측하며, 모델은 양방향으로 토큰에 액세스할 수 있습니다. +즉, 모델은 토큰의 왼쪽과 오른쪽 양쪽에서 접근할 수 있습니다. +마스킹된 언어 모델링은 전체 시퀀스에 대한 문맥적 이해가 필요한 작업에 적합하며, BERT가 그 예에 해당합니다. + +이번 가이드에서 다룰 내용은 다음과 같습니다: + +1. [ELI5](https://huggingface.co/datasets/eli5) 데이터 세트에서 [r/askscience](https://www.reddit.com/r/askscience/) 부분을 사용해 [DistilRoBERTa](https://huggingface.co/distilbert/distilroberta-base) 모델을 미세 조정합니다. +2. 추론 시에 직접 미세 조정한 모델을 사용합니다. + + +이번 가이드에서처럼 다른 아키텍처를 미세 조정해 마스킹된 언어 모델링을 할 수 있습니다. + +다음 아키텍쳐 중 하나를 선택하세요: + + + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [Perceiver](../model_doc/perceiver), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Wav2Vec2](../model_doc/wav2vec2), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티와의 공유를 권장합니다. 메시지가 표시되면(When prompted) 토큰을 입력하여 로그인합니다: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## ELI5 데이터 세트 가져오기[[load-eli5-dataset]] + +먼저 🤗 Datasets 라이브러리에서 ELI5 데이터 세트의 r/askscience 중 일부만 가져옵니다. +이렇게 하면 전체 데이터 세트 학습에 더 많은 시간을 할애하기 전에 모든 것이 작동하는지 실험하고 확인할 수 있습니다. + +```py +>>> from datasets import load_dataset + +>>> eli5 = load_dataset("eli5", split="train_asks[:5000]") +``` + +데이터 세트의 `train_asks`를 [`~datasets.Dataset.train_test_split`] 메소드를 사용해 훈련 데이터와 테스트 데이터로 분할합니다: + +```py +>>> eli5 = eli5.train_test_split(test_size=0.2) +``` + +그리고 아래 예시를 살펴보세요: + +```py +>>> eli5["train"][0] +{'answers': {'a_id': ['c3d1aib', 'c3d4lya'], + 'score': [6, 3], + 'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]}, + 'answers_urls': {'url': []}, + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']}, + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls': {'url': []}} +``` + +많아 보일 수 있지만 실제로는 `text` 필드에만 집중하면 됩나다. +언어 모델링 작업의 멋진 점은 (비지도 학습으로) *다음 단어가 레이블*이기 때문에 레이블이 따로 필요하지 않습니다. + +## 전처리[[preprocess]] + + + +마스킹된 언어 모델링을 위해, 다음 단계로 DistilRoBERTa 토크나이저를 가져와서 `text` 하위 필드를 처리합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilroberta-base") +``` + +위의 예제에서와 마찬가지로, `text` 필드는 `answers` 안에 중첩되어 있습니다. +따라서 중첩된 구조에서 [`flatten`](https://huggingface.co/docs/datasets/process#flatten) 메소드를 사용하여 `text` 하위 필드를 추출합니다: + +```py +>>> eli5 = eli5.flatten() +>>> eli5["train"][0] +{'answers.a_id': ['c3d1aib', 'c3d4lya'], + 'answers.score': [6, 3], + 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.", + "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"], + 'answers_urls.url': [], + 'document': '', + 'q_id': 'nyxfp', + 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?', + 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'], + 'subreddit': 'askscience', + 'title': 'Few questions about this space walk photograph.', + 'title_urls.url': []} +``` + +이제 각 하위 필드는 `answers` 접두사(prefix)로 표시된 대로 별도의 열이 되고, `text` 필드는 이제 리스트가 되었습니다. +각 문장을 개별적으로 토큰화하는 대신 리스트를 문자열로 변환하여 한번에 토큰화할 수 있습니다. + +다음은 각 예제에 대해 문자열로 이루어진 리스트를 `join`하고 결과를 토큰화하는 첫 번째 전처리 함수입니다: + +```py +>>> def preprocess_function(examples): +... return tokenizer([" ".join(x) for x in examples["answers.text"]]) +``` + +이 전처리 함수를 전체 데이터 세트에 적용하기 위해 🤗 Datasets [`~datasets.Dataset.map`] 메소드를 사용합니다. +데이터 세트의 여러 요소를 한 번에 처리하도록 `batched=True`를 설정하고 `num_proc`로 처리 횟수를 늘리면 `map` 함수의 속도를 높일 수 있습니다. +필요하지 않은 열은 제거합니다: + +```py +>>> tokenized_eli5 = eli5.map( +... preprocess_function, +... batched=True, +... num_proc=4, +... remove_columns=eli5["train"].column_names, +... ) +``` + +이 데이터 세트에는 토큰 시퀀스가 포함되어 있지만 이 중 일부는 모델의 최대 입력 길이보다 깁니다. + +이제 두 번째 전처리 함수를 사용해 +- 모든 시퀀스를 연결하고 +- 연결된 시퀀스를 정의한 `block_size` 보다 더 짧은 덩어리로 분할하는데, 이 덩어리는 모델의 최대 입력 길이보다 짧고 GPU RAM이 수용할 수 있는 길이여야 합니다. + + +```py +>>> block_size = 128 + + +>>> def group_texts(examples): +... # Concatenate all texts. +... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} +... total_length = len(concatenated_examples[list(examples.keys())[0]]) +... # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can +... # customize this part to your needs. +... if total_length >= block_size: +... total_length = (total_length // block_size) * block_size +... # Split by chunks of block_size. +... result = { +... k: [t[i : i + block_size] for i in range(0, total_length, block_size)] +... for k, t in concatenated_examples.items() +... } +... result["labels"] = result["input_ids"].copy() +... return result +``` + +전체 데이터 세트에 `group_texts` 함수를 적용합니다: + +```py +>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4) +``` + +이제 [`DataCollatorForLanguageModeling`]을 사용하여 데이터 예제의 배치를 생성합니다. +데이터 세트 전체를 최대 길이로 패딩하는 것보다 collation 단계에서 매 배치안에서의 최대 길이로 문장을 *동적으로 패딩*하는 것이 더 효율적입니다. + + + + +시퀀스 끝 토큰을 패딩 토큰으로 사용하고 데이터를 반복할 때마다 토큰을 무작위로 마스킹하도록 `mlm_-probability`를 지정합니다: + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> tokenizer.pad_token = tokenizer.eos_token +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) +``` + + + +시퀀스 끝 토큰을 패딩 토큰으로 사용하고 데이터를 반복할 때마다 토큰을 무작위로 마스킹하도록 `mlm_-probability`를 지정합니다: + +```py +>>> from transformers import DataCollatorForLanguageModeling + +>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf") +``` + + + +## 훈련[[train]] + + + + + +[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요! + + +이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForMaskedLM`]를 사용해 DistilRoBERTa 모델을 가져옵니다: + +```py +>>> from transformers import AutoModelForMaskedLM + +>>> model = AutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") +``` + +이제 세 단계가 남았습니다: + +1. [`TrainingArguments`]의 훈련 하이퍼파라미터를 정의합니다. 모델 저장 위치를 지정하는 `output_dir`은 유일한 필수 파라미터입니다. `push_to_hub=True`를 설정하여 이 모델을 Hub에 업로드합니다 (모델을 업로드하려면 Hugging Face에 로그인해야 합니다). +2. 모델, 데이터 세트 및 데이터 콜레이터(collator)와 함께 훈련 인수를 [`Trainer`]에 전달합니다. +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정합니다. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_eli5_mlm_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=lm_dataset["train"], +... eval_dataset=lm_dataset["test"], +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면 [`~transformers.Trainer.evaluate`] 메소드를 사용하여 펄플렉서티(perplexity)를 계산하고 모델을 평가합니다: + +```py +>>> import math + +>>> eval_results = trainer.evaluate() +>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}") +Perplexity: 8.76 +``` + +그리고 [`~transformers.Trainer.push_to_hub`] 메소드를 사용해 다른 사람들이 사용할 수 있도록, Hub로 모델을 업로드합니다. + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-a-tensorflow-model-with-keras)를 살펴보세요! + + +TensorFlow로 모델을 미세 조정하기 위해서는 옵티마이저(optimizer) 함수 설정, 학습률(learning rate) 스케쥴링, 훈련 하이퍼파라미터 설정부터 시작하세요: + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +다음으로 [`TFAutoModelForMaskedLM`]를 사용해 DistilRoBERTa 모델을 가져옵니다: + +```py +>>> from transformers import TFAutoModelForMaskedLM + +>>> model = TFAutoModelForMaskedLM.from_pretrained("distilbert/distilroberta-base") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`] 메소드를 사용해 데이터 세트를 `tf.data.Dataset` 형식으로 변환하세요: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... lm_dataset["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... lm_dataset["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 메소드를 통해 모델 훈련을 구성합니다: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +이는 업로드할 모델과 토크나이저의 위치를 [`~transformers.PushToHubCallback`]에 지정하여 수행할 수 있습니다: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_eli5_mlm_model", +... tokenizer=tokenizer, +... ) +``` + +드디어 모델을 훈련할 준비가 되었습니다! +모델을 미세 조정할 때 훈련 및 검증 데이터 세트, 에포크 수, 콜백이 포함된 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)을 호출합니다: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback]) +``` + +훈련이 완료되면, 자동으로 Hub로 업로드되어 누구나 사용할 수 있습니다! + + + + + +마스킹된 언어 모델링을 위해 모델을 미세 조정하는 방법에 대한 보다 심층적인 예제는 +[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) +또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb)을 참조하세요. + + +## 추론[[inference]] + +지금까지 모델 미세 조정을 잘 했으니, 추론에 사용할 수 있습니다! + +모델이 빈칸을 채울 텍스트를 스페셜 토큰(special token)인 `` 토큰으로 표시합니다: + + +```py +>>> text = "The Milky Way is a galaxy." +``` +추론을 위해 미세 조정한 모델을 테스트하는 가장 간단한 방법은 [`pipeline`]에서 사용하는 것입니다. +`fill-mask`태스크로 `pipeline`을 인스턴스화하고 텍스트를 전달합니다. +`top_k` 매개변수를 사용하여 반환하는 예측의 수를 지정할 수 있습니다: + +```py +>>> from transformers import pipeline + +>>> mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model") +>>> mask_filler(text, top_k=3) +[{'score': 0.5150994658470154, + 'token': 21300, + 'token_str': ' spiral', + 'sequence': 'The Milky Way is a spiral galaxy.'}, + {'score': 0.07087188959121704, + 'token': 2232, + 'token_str': ' massive', + 'sequence': 'The Milky Way is a massive galaxy.'}, + {'score': 0.06434620916843414, + 'token': 650, + 'token_str': ' small', + 'sequence': 'The Milky Way is a small galaxy.'}] +``` + + + +텍스트를 토큰화하고 `input_ids`를 PyTorch 텐서 형태로 반환합니다. +또한, `` 토큰의 위치를 지정해야 합니다: +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model") +>>> inputs = tokenizer(text, return_tensors="pt") +>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] +``` + +모델에 `inputs`를 입력하고, 마스킹된 토큰의 `logits`를 반환합니다: + +```py +>>> from transformers import AutoModelForMaskedLM + +>>> model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> logits = model(**inputs).logits +>>> mask_token_logits = logits[0, mask_token_index, :] +``` + +그런 다음 가장 높은 확률은 가진 마스크 토큰 3개를 반환하고, 출력합니다: +```py +>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist() + +>>> for token in top_3_tokens: +... print(text.replace(tokenizer.mask_token, tokenizer.decode([token]))) +The Milky Way is a spiral galaxy. +The Milky Way is a massive galaxy. +The Milky Way is a small galaxy. +``` + + +텍스트를 토큰화하고 `input_ids`를 TensorFlow 텐서 형태로 반환합니다. +또한, `` 토큰의 위치를 지정해야 합니다: +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_mlm_model") +>>> inputs = tokenizer(text, return_tensors="tf") +>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1] +``` + +모델에 `inputs`를 입력하고, 마스킹된 토큰의 `logits`를 반환합니다: + +```py +>>> from transformers import TFAutoModelForMaskedLM + +>>> model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model") +>>> logits = model(**inputs).logits +>>> mask_token_logits = logits[0, mask_token_index, :] +``` + +그런 다음 가장 높은 확률은 가진 마스크 토큰 3개를 반환하고, 출력합니다: +```py +>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy() + +>>> for token in top_3_tokens: +... print(text.replace(tokenizer.mask_token, tokenizer.decode([token]))) +The Milky Way is a spiral galaxy. +The Milky Way is a massive galaxy. +The Milky Way is a small galaxy. +``` + + diff --git a/docs/source/ko/tasks/monocular_depth_estimation.md b/docs/source/ko/tasks/monocular_depth_estimation.md new file mode 100644 index 00000000000000..e02dd5466b7d54 --- /dev/null +++ b/docs/source/ko/tasks/monocular_depth_estimation.md @@ -0,0 +1,149 @@ + + +# 단일 영상 기반 깊이 추정[[depth-estimation-pipeline]] + +단일 영상 기반 깊이 추정은 한 장면의 단일 이미지에서 장면의 깊이 정보를 예측하는 컴퓨터 비전 작업입니다. +즉, 단일 카메라 시점의 장면에 있는 물체의 거리를 예측하는 과정입니다. + +단일 영상 기반 깊이 추정은 3D 재구성, 증강 현실, 자율 주행, 로봇 공학 등 다양한 분야에서 응용됩니다. +조명 조건, 가려짐, 텍스처와 같은 요소의 영향을 받을 수 있는 장면 내 물체와 해당 깊이 정보 간의 복잡한 관계를 모델이 이해해야 하므로 까다로운 작업입니다. + + + +이 튜토리얼에서 다루는 작업은 다음 모델 아키텍처에서 지원됩니다: + + + +[DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) + + + + + +이번 가이드에서 배울 내용은 다음과 같습니다: + +* 깊이 추정 파이프라인 만들기 +* 직접 깊이 추정 추론하기 + +시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install -q transformers +``` + +## 깊이 추정 파이프라인[[depth-estimation-inference-by-hand]] + +깊이 추정을 추론하는 가장 간단한 방법은 해당 기능을 제공하는 [`pipeline`]을 사용하는 것입니다. +[Hugging Face Hub 체크포인트](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads)에서 파이프라인을 초기화합니다: + +```py +>>> from transformers import pipeline + +>>> checkpoint = "vinvino02/glpn-nyu" +>>> depth_estimator = pipeline("depth-estimation", model=checkpoint) +``` + + +다음으로, 분석할 이미지를 한 장 선택하세요: + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) +>>> image +``` + +
+ Photo of a busy street +
+ +이미지를 파이프라인으로 전달합니다. + +```py +>>> predictions = depth_estimator(image) +``` + +파이프라인은 두 개의 항목을 가지는 딕셔너리를 반환합니다. +첫 번째는 `predicted_depth`로 각 픽셀의 깊이를 미터로 표현한 값을 가지는 텐서입니다. +두 번째는 `depth`로 깊이 추정 결과를 시각화하는 PIL 이미지입니다. + +이제 시각화한 결과를 살펴보겠습니다: + +```py +>>> predictions["depth"] +``` + +
+ Depth estimation visualization +
+ +## 직접 깊이 추정 추론하기[[depth-estimation-inference-by-hand]] + +이제 깊이 추정 파이프라인 사용법을 살펴보았으니 동일한 결과를 복제하는 방법을 살펴보겠습니다. +[Hugging Face Hub 체크포인트](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads)에서 모델과 관련 프로세서를 가져오는 것부터 시작합니다. +여기서 이전에 사용한 체크포인트와 동일한 것을 사용합니다: + +```py +>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation + +>>> checkpoint = "vinvino02/glpn-nyu" + +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) +``` + +필요한 이미지 변환을 처리하는 `image_processor`를 사용하여 모델에 대한 이미지 입력을 준비합니다. +`image_processor`는 크기 조정 및 정규화 등 필요한 이미지 변환을 처리합니다: + +```py +>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values +``` + +준비한 입력을 모델로 전달합니다: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(pixel_values) +... predicted_depth = outputs.predicted_depth +``` + +결과를 시각화합니다: + +```py +>>> import numpy as np + +>>> # 원본 사이즈로 복원 +>>> prediction = torch.nn.functional.interpolate( +... predicted_depth.unsqueeze(1), +... size=image.size[::-1], +... mode="bicubic", +... align_corners=False, +... ).squeeze() +>>> output = prediction.numpy() + +>>> formatted = (output * 255 / np.max(output)).astype("uint8") +>>> depth = Image.fromarray(formatted) +>>> depth +``` + +
+ Depth estimation visualization +
diff --git a/docs/source/ko/tasks/multiple_choice.md b/docs/source/ko/tasks/multiple_choice.md new file mode 100644 index 00000000000000..4e02f7fabe504f --- /dev/null +++ b/docs/source/ko/tasks/multiple_choice.md @@ -0,0 +1,465 @@ + + +# 객관식 문제[[multiple-choice]] + +[[open-in-colab]] + +객관식 과제는 문맥과 함께 여러 개의 후보 답변이 제공되고 모델이 정답을 선택하도록 학습된다는 점을 제외하면 질의응답과 유사합니다. + +진행하는 방법은 아래와 같습니다: + +1. [SWAG](https://huggingface.co/datasets/swag) 데이터 세트의 'regular' 구성으로 [BERT](https://huggingface.co/google-bert/bert-base-uncased)를 미세 조정하여 여러 옵션과 일부 컨텍스트가 주어졌을 때 가장 적합한 답을 선택합니다. +2. 추론에 미세 조정된 모델을 사용합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에서 지원됩니다: + + + +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [I-BERT](../model_doc/ibert), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +모델을 업로드하고 커뮤니티와 공유할 수 있도록 허깅페이스 계정에 로그인하는 것이 좋습니다. 메시지가 표시되면 토큰을 입력하여 로그인합니다: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## SWAG 데이터 세트 가져오기[[load-swag-dataset]] + +먼저 🤗 Datasets 라이브러리에서 SWAG 데이터셋의 '일반' 구성을 가져옵니다: + +```py +>>> from datasets import load_dataset + +>>> swag = load_dataset("swag", "regular") +``` + +이제 데이터를 살펴봅니다: + +```py +>>> swag["train"][0] +{'ending0': 'passes by walking down the street playing their instruments.', + 'ending1': 'has heard approaching them.', + 'ending2': "arrives and they're outside dancing and asleep.", + 'ending3': 'turns the lead singer watches the performance.', + 'fold-ind': '3416', + 'gold-source': 'gold', + 'label': 0, + 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.', + 'sent2': 'A drum line', + 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line', + 'video-id': 'anetv_jkn6uvmqwh4'} +``` + +여기에는 많은 필드가 있는 것처럼 보이지만 실제로는 매우 간단합니다: + +- `sent1` 및 `sent2`: 이 필드는 문장이 어떻게 시작되는지 보여주며, 이 두 필드를 합치면 `시작 구절(startphrase)` 필드가 됩니다. +- `종료 구절(ending)`: 문장이 어떻게 끝날 수 있는지에 대한 가능한 종료 구절를 제시하지만 그 중 하나만 정답입니다. +- `레이블(label)`: 올바른 문장 종료 구절을 식별합니다. + +## 전처리[[preprocess]] + +다음 단계는 문장의 시작과 네 가지 가능한 구절을 처리하기 위해 BERT 토크나이저를 불러옵니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + +생성하려는 전처리 함수는 다음과 같아야 합니다: + +1. `sent1` 필드를 네 개 복사한 다음 각각을 `sent2`와 결합하여 문장이 시작되는 방식을 재현합니다. +2. `sent2`를 네 가지 가능한 문장 구절 각각과 결합합니다. +3. 이 두 목록을 토큰화할 수 있도록 평탄화(flatten)하고, 각 예제에 해당하는 `input_ids`, `attention_mask` 및 `labels` 필드를 갖도록 다차원화(unflatten) 합니다. + +```py +>>> ending_names = ["ending0", "ending1", "ending2", "ending3"] + + +>>> def preprocess_function(examples): +... first_sentences = [[context] * 4 for context in examples["sent1"]] +... question_headers = examples["sent2"] +... second_sentences = [ +... [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers) +... ] + +... first_sentences = sum(first_sentences, []) +... second_sentences = sum(second_sentences, []) + +... tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True) +... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()} +``` + +전체 데이터 집합에 전처리 기능을 적용하려면 🤗 Datasets [`~datasets.Dataset.map`] 메소드를 사용합니다. `batched=True`를 설정하여 데이터 집합의 여러 요소를 한 번에 처리하면 `map` 함수의 속도를 높일 수 있습니다: + +```py +tokenized_swag = swag.map(preprocess_function, batched=True) +``` + +🤗 Transformers에는 객관식용 데이터 콜레이터가 없으므로 예제 배치를 만들려면 [`DataCollatorWithPadding`]을 조정해야 합니다. 데이터 정렬 중에 전체 데이터 집합을 최대 길이로 패딩하는 대신 배치 중 가장 긴 길이로 문장을 *동적 패딩*하는 것이 더 효율적입니다. + +`DataCollatorForMultipleChoice`는 모든 모델 입력을 평탄화하고 패딩을 적용하며 그 결과를 결과를 다차원화합니다: + + + +```py +>>> from dataclasses import dataclass +>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy +>>> from typing import Optional, Union +>>> import torch + + +>>> @dataclass +... class DataCollatorForMultipleChoice: +... """ +... Data collator that will dynamically pad the inputs for multiple choice received. +... """ + +... tokenizer: PreTrainedTokenizerBase +... padding: Union[bool, str, PaddingStrategy] = True +... max_length: Optional[int] = None +... pad_to_multiple_of: Optional[int] = None + +... def __call__(self, features): +... label_name = "label" if "label" in features[0].keys() else "labels" +... labels = [feature.pop(label_name) for feature in features] +... batch_size = len(features) +... num_choices = len(features[0]["input_ids"]) +... flattened_features = [ +... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features +... ] +... flattened_features = sum(flattened_features, []) + +... batch = self.tokenizer.pad( +... flattened_features, +... padding=self.padding, +... max_length=self.max_length, +... pad_to_multiple_of=self.pad_to_multiple_of, +... return_tensors="pt", +... ) + +... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} +... batch["labels"] = torch.tensor(labels, dtype=torch.int64) +... return batch +``` + + +```py +>>> from dataclasses import dataclass +>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy +>>> from typing import Optional, Union +>>> import tensorflow as tf + + +>>> @dataclass +... class DataCollatorForMultipleChoice: +... """ +... Data collator that will dynamically pad the inputs for multiple choice received. +... """ + +... tokenizer: PreTrainedTokenizerBase +... padding: Union[bool, str, PaddingStrategy] = True +... max_length: Optional[int] = None +... pad_to_multiple_of: Optional[int] = None + +... def __call__(self, features): +... label_name = "label" if "label" in features[0].keys() else "labels" +... labels = [feature.pop(label_name) for feature in features] +... batch_size = len(features) +... num_choices = len(features[0]["input_ids"]) +... flattened_features = [ +... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features +... ] +... flattened_features = sum(flattened_features, []) + +... batch = self.tokenizer.pad( +... flattened_features, +... padding=self.padding, +... max_length=self.max_length, +... pad_to_multiple_of=self.pad_to_multiple_of, +... return_tensors="tf", +... ) + +... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()} +... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64) +... return batch +``` + + + +## 평가 하기[[evaluate]] + +훈련 중에 메트릭을 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. 🤗[Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하여 평가 방법을 빠르게 가져올 수 있습니다. 이 작업에서는 [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) 지표를 가져옵니다(🤗 Evaluate [둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하여 지표를 가져오고 계산하는 방법에 대해 자세히 알아보세요): + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +그리고 예측과 레이블을 [`~evaluate.EvaluationModule.compute`]에 전달하여 정확도를 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... predictions = np.argmax(predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=labels) +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 훈련을 설정할 때 이 함수로 돌아가게 됩니다. + +## 훈련 하기[[train]] + + + + + +[`Trainer`]로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-with-pytorch-trainer)를 살펴보세요! + + + +이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForMultipleChoice`]로 BERT를 로드합니다: + +```py +>>> from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer + +>>> model = AutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") +``` + +이제 세 단계만 남았습니다: + +1. 훈련 하이퍼파라미터를 [`TrainingArguments`]에 정의합니다. 유일한 필수 매개변수는 모델을 저장할 위치를 지정하는 `output_dir`입니다. `push_to_hub=True`를 설정하여 이 모델을 허브에 푸시합니다(모델을 업로드하려면 허깅 페이스에 로그인해야 합니다). 각 에폭이 끝날 때마다 [`Trainer`]가 정확도를 평가하고 훈련 체크포인트를 저장합니다. +2. 모델, 데이터 세트, 토크나이저, 데이터 콜레이터, `compute_metrics` 함수와 함께 훈련 인자를 [`Trainer`]에 전달합니다. +3. [`~Trainer.train`]을 사용하여 모델을 미세 조정합니다. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_swag_model", +... evaluation_strategy="epoch", +... save_strategy="epoch", +... load_best_model_at_end=True, +... learning_rate=5e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_swag["train"], +... eval_dataset=tokenized_swag["validation"], +... tokenizer=tokenizer, +... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer), +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면 모든 사람이 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 허브에 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras로 모델을 미세 조정하는 데 익숙하지 않다면 기본 튜토리얼 [여기](../training#train-a-tensorflow-model-with-keras)를 살펴보시기 바랍니다! + + +TensorFlow에서 모델을 미세 조정하려면 최적화 함수, 학습률 스케쥴 및 몇 가지 학습 하이퍼파라미터를 설정하는 것부터 시작하세요: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_train_epochs = 2 +>>> total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs +>>> optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps) +``` + +그리고 [`TFAutoModelForMultipleChoice`]로 BERT를 가져올 수 있습니다: + +```py +>>> from transformers import TFAutoModelForMultipleChoice + +>>> model = TFAutoModelForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용하여 데이터 세트를 `tf.data.Dataset` 형식으로 변환합니다: + +```py +>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer) +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_swag["train"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_swag["validation"], +... shuffle=False, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)을 사용하여 훈련 모델을 구성합니다: + +```py +>>> model.compile(optimizer=optimizer) +``` + +훈련을 시작하기 전에 설정해야 할 마지막 두 가지는 예측의 정확도를 계산하고 모델을 허브로 푸시하는 방법을 제공하는 것입니다. 이 두 가지 작업은 모두 [Keras 콜백](../main_classes/keras_callbacks)을 사용하여 수행할 수 있습니다. + +`compute_metrics`함수를 [`~transformers.KerasMetricCallback`]에 전달하세요: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +모델과 토크나이저를 업로드할 위치를 [`~transformers.PushToHubCallback`]에서 지정하세요: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_model", +... tokenizer=tokenizer, +... ) +``` + +그리고 콜백을 함께 묶습니다: + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +이제 모델 훈련을 시작합니다! 훈련 및 검증 데이터 세트, 에폭 수, 콜백을 사용하여 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)을 호출하고 모델을 미세 조정합니다: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks) +``` + +훈련이 완료되면 모델이 자동으로 허브에 업로드되어 누구나 사용할 수 있습니다! + + + + + + +객관식 모델을 미세 조정하는 방법에 대한 보다 심층적인 예는 아래 문서를 참조하세요. +[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb) +또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multiple_choice-tf.ipynb). + + + +## 추론 하기[[inference]] + +이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +텍스트와 두 개의 후보 답안을 작성합니다: + +```py +>>> prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette." +>>> candidate1 = "The law does not apply to croissants and brioche." +>>> candidate2 = "The law applies to baguettes." +``` + + + +각 프롬프트와 후보 답변 쌍을 토큰화하여 PyTorch 텐서를 반환합니다. 또한 `labels`을 생성해야 합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") +>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True) +>>> labels = torch.tensor(0).unsqueeze(0) +``` + +입력과 레이블을 모델에 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import AutoModelForMultipleChoice + +>>> model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") +>>> outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels) +>>> logits = outputs.logits +``` + +가장 높은 확률을 가진 클래스를 가져옵니다: + +```py +>>> predicted_class = logits.argmax().item() +>>> predicted_class +'0' +``` + + +각 프롬프트와 후보 답안 쌍을 토큰화하여 텐서플로 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model") +>>> inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True) +``` + +모델에 입력을 전달하고 `logits`를 반환합니다: + +```py +>>> from transformers import TFAutoModelForMultipleChoice + +>>> model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model") +>>> inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()} +>>> outputs = model(inputs) +>>> logits = outputs.logits +``` + +가장 높은 확률을 가진 클래스를 가져옵니다: + +```py +>>> predicted_class = int(tf.math.argmax(logits, axis=-1)[0]) +>>> predicted_class +'0' +``` + + diff --git a/docs/source/ko/tasks/object_detection.md b/docs/source/ko/tasks/object_detection.md new file mode 100644 index 00000000000000..0076bba6f8441f --- /dev/null +++ b/docs/source/ko/tasks/object_detection.md @@ -0,0 +1,588 @@ + + +# 객체 탐지 [[object-detection]] + +[[open-in-colab]] + +객체 탐지는 이미지에서 인스턴스(예: 사람, 건물 또는 자동차)를 감지하는 컴퓨터 비전 작업입니다. 객체 탐지 모델은 이미지를 입력으로 받고 탐지된 바운딩 박스의 좌표와 관련된 레이블을 출력합니다. +하나의 이미지에는 여러 객체가 있을 수 있으며 각각은 자체적인 바운딩 박스와 레이블을 가질 수 있습니다(예: 차와 건물이 있는 이미지). +또한 각 객체는 이미지의 다른 부분에 존재할 수 있습니다(예: 이미지에 여러 대의 차가 있을 수 있음). +이 작업은 보행자, 도로 표지판, 신호등과 같은 것들을 감지하는 자율 주행에 일반적으로 사용됩니다. +다른 응용 분야로는 이미지 내 객체 수 계산 및 이미지 검색 등이 있습니다. + +이 가이드에서 다음을 배울 것입니다: + + 1. 합성곱 백본(인풋 데이터의 특성을 추출하는 합성곱 네트워크)과 인코더-디코더 트랜스포머 모델을 결합한 [DETR](https://huggingface.co/docs/transformers/model_doc/detr) 모델을 [CPPE-5](https://huggingface.co/datasets/cppe-5) 데이터 세트에 대해 미세조정 하기 + 2. 미세조정 한 모델을 추론에 사용하기. + + +이 튜토리얼의 태스크는 다음 모델 아키텍처에서 지원됩니다: + + + +[Conditional DETR](../model_doc/conditional_detr), [Deformable DETR](../model_doc/deformable_detr), [DETA](../model_doc/deta), [DETR](../model_doc/detr), [Table Transformer](../model_doc/table-transformer), [YOLOS](../model_doc/yolos) + + + + + +시작하기 전에 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: +```bash +pip install -q datasets transformers evaluate timm albumentations +``` + +허깅페이스 허브에서 데이터 세트를 가져오기 위한 🤗 Datasets과 모델을 학습하기 위한 🤗 Transformers, 데이터를 증강하기 위한 `albumentations`를 사용합니다. +DETR 모델의 합성곱 백본을 가져오기 위해서는 현재 `timm`이 필요합니다. + +커뮤니티에 모델을 업로드하고 공유할 수 있도록 Hugging Face 계정에 로그인하는 것을 권장합니다. 프롬프트가 나타나면 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## CPPE-5 데이터 세트 가져오기 [[load-the-CPPE-5-dataset]] + +[CPPE-5](https://huggingface.co/datasets/cppe-5) 데이터 세트는 COVID-19 대유행 상황에서 의료 전문인력 보호 장비(PPE)를 식별하는 어노테이션이 포함된 이미지를 담고 있습니다. + +데이터 세트를 가져오세요: + +```py +>>> from datasets import load_dataset + +>>> cppe5 = load_dataset("cppe-5") +>>> cppe5 +DatasetDict({ + train: Dataset({ + features: ['image_id', 'image', 'width', 'height', 'objects'], + num_rows: 1000 + }) + test: Dataset({ + features: ['image_id', 'image', 'width', 'height', 'objects'], + num_rows: 29 + }) +}) +``` + +이 데이터 세트는 학습 세트 이미지 1,000개와 테스트 세트 이미지 29개를 갖고 있습니다. + +데이터에 익숙해지기 위해, 예시가 어떻게 구성되어 있는지 살펴보세요. + +```py +>>> cppe5["train"][0] +{'image_id': 15, + 'image': , + 'width': 943, + 'height': 663, + 'objects': {'id': [114, 115, 116, 117], + 'area': [3796, 1596, 152768, 81002], + 'bbox': [[302.0, 109.0, 73.0, 52.0], + [810.0, 100.0, 57.0, 28.0], + [160.0, 31.0, 248.0, 616.0], + [741.0, 68.0, 202.0, 401.0]], + 'category': [4, 4, 0, 0]}} +``` + +데이터 세트에 있는 예시는 다음의 영역을 가지고 있습니다: + +- `image_id`: 예시 이미지 id +- `image`: 이미지를 포함하는 `PIL.Image.Image` 객체 +- `width`: 이미지의 너비 +- `height`: 이미지의 높이 +- `objects`: 이미지 안의 객체들의 바운딩 박스 메타데이터를 포함하는 딕셔너리: + - `id`: 어노테이션 id + - `area`: 바운딩 박스의 면적 + - `bbox`: 객체의 바운딩 박스 ([COCO 포맷](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/#coco)으로) + - `category`: 객체의 카테고리, 가능한 값으로는 `Coverall (0)`, `Face_Shield (1)`, `Gloves (2)`, `Goggles (3)` 및 `Mask (4)` 가 포함됩니다. + +`bbox` 필드가 DETR 모델이 요구하는 COCO 형식을 따른다는 것을 알 수 있습니다. +그러나 `objects` 내부의 필드 그룹은 DETR이 요구하는 어노테이션 형식과 다릅니다. 따라서 이 데이터를 학습에 사용하기 전에 전처리를 적용해야 합니다. + +데이터를 더 잘 이해하기 위해서 데이터 세트에서 한 가지 예시를 시각화하세요. + +```py +>>> import numpy as np +>>> import os +>>> from PIL import Image, ImageDraw + +>>> image = cppe5["train"][0]["image"] +>>> annotations = cppe5["train"][0]["objects"] +>>> draw = ImageDraw.Draw(image) + +>>> categories = cppe5["train"].features["objects"].feature["category"].names + +>>> id2label = {index: x for index, x in enumerate(categories, start=0)} +>>> label2id = {v: k for k, v in id2label.items()} + +>>> for i in range(len(annotations["id"])): +... box = annotations["bbox"][i - 1] +... class_idx = annotations["category"][i - 1] +... x, y, w, h = tuple(box) +... draw.rectangle((x, y, x + w, y + h), outline="red", width=1) +... draw.text((x, y), id2label[class_idx], fill="white") + +>>> image +``` + +
+ CPPE-5 Image Example +
+ +바운딩 박스와 연결된 레이블을 시각화하려면 데이터 세트의 메타 데이터, 특히 `category` 필드에서 레이블을 가져와야 합니다. +또한 레이블 ID를 레이블 클래스에 매핑하는 `id2label`과 반대로 매핑하는 `label2id` 딕셔너리를 만들어야 합니다. +모델을 설정할 때 이러한 매핑을 사용할 수 있습니다. 이러한 매핑은 허깅페이스 허브에서 모델을 공유했을 때 다른 사람들이 재사용할 수 있습니다. + +데이터를 더 잘 이해하기 위한 최종 단계로, 잠재적인 문제를 찾아보세요. +객체 감지를 위한 데이터 세트에서 자주 발생하는 문제 중 하나는 바운딩 박스가 이미지의 가장자리를 넘어가는 것입니다. +이러한 바운딩 박스를 "넘어가는 것(run away)"은 훈련 중에 오류를 발생시킬 수 있기에 이 단계에서 처리해야 합니다. +이 데이터 세트에도 같은 문제가 있는 몇 가지 예가 있습니다. 이 가이드에서는 간단하게하기 위해 데이터에서 이러한 이미지를 제거합니다. + +```py +>>> remove_idx = [590, 821, 822, 875, 876, 878, 879] +>>> keep = [i for i in range(len(cppe5["train"])) if i not in remove_idx] +>>> cppe5["train"] = cppe5["train"].select(keep) +``` + +## 데이터 전처리하기 [[preprocess-the-data]] + +모델을 미세 조정 하려면, 미리 학습된 모델에서 사용한 전처리 방식과 정확하게 일치하도록 사용할 데이터를 전처리해야 합니다. +[`AutoImageProcessor`]는 이미지 데이터를 처리하여 DETR 모델이 학습에 사용할 수 있는 `pixel_values`, `pixel_mask`, 그리고 `labels`를 생성하는 작업을 담당합니다. +이 이미지 프로세서에는 걱정하지 않아도 되는 몇 가지 속성이 있습니다: + +- `image_mean = [0.485, 0.456, 0.406 ]` +- `image_std = [0.229, 0.224, 0.225]` + + +이 값들은 모델 사전 훈련 중 이미지를 정규화하는 데 사용되는 평균과 표준 편차입니다. +이 값들은 추론 또는 사전 훈련된 이미지 모델을 세밀하게 조정할 때 복제해야 하는 중요한 값입니다. + +사전 훈련된 모델과 동일한 체크포인트에서 이미지 프로세서를 인스턴스화합니다. + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "facebook/detr-resnet-50" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) +``` + +`image_processor`에 이미지를 전달하기 전에, 데이터 세트에 두 가지 전처리를 적용해야 합니다: + +- 이미지 증강 +- DETR 모델의 요구에 맞게 어노테이션을 다시 포맷팅 + +첫째로, 모델이 학습 데이터에 과적합 되지 않도록 데이터 증강 라이브러리 중 아무거나 사용하여 변환을 적용할 수 있습니다. 여기에서는 [Albumentations](https://albumentations.ai/docs/) 라이브러리를 사용합니다... +이 라이브러리는 변환을 이미지에 적용하고 바운딩 박스를 적절하게 업데이트하도록 보장합니다. +🤗 Datasets 라이브러리 문서에는 [객체 탐지를 위해 이미지를 보강하는 방법에 대한 자세한 가이드](https://huggingface.co/docs/datasets/object_detection)가 있으며, +이 예제와 정확히 동일한 데이터 세트를 사용합니다. 여기서는 각 이미지를 (480, 480) 크기로 조정하고, 좌우로 뒤집고, 밝기를 높이는 동일한 접근법을 적용합니다: + + +```py +>>> import albumentations +>>> import numpy as np +>>> import torch + +>>> transform = albumentations.Compose( +... [ +... albumentations.Resize(480, 480), +... albumentations.HorizontalFlip(p=1.0), +... albumentations.RandomBrightnessContrast(p=1.0), +... ], +... bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]), +... ) +``` + +이미지 프로세서는 어노테이션이 다음과 같은 형식일 것으로 예상합니다: `{'image_id': int, 'annotations': List[Dict]}`, 여기서 각 딕셔너리는 COCO 객체 어노테이션입니다. 단일 예제에 대해 어노테이션의 형식을 다시 지정하는 함수를 추가해 보겠습니다: + +```py +>>> def formatted_anns(image_id, category, area, bbox): +... annotations = [] +... for i in range(0, len(category)): +... new_ann = { +... "image_id": image_id, +... "category_id": category[i], +... "isCrowd": 0, +... "area": area[i], +... "bbox": list(bbox[i]), +... } +... annotations.append(new_ann) + +... return annotations +``` + +이제 이미지와 어노테이션 전처리 변환을 결합하여 예제 배치에 사용할 수 있습니다: + +```py +>>> # transforming a batch +>>> def transform_aug_ann(examples): +... image_ids = examples["image_id"] +... images, bboxes, area, categories = [], [], [], [] +... for image, objects in zip(examples["image"], examples["objects"]): +... image = np.array(image.convert("RGB"))[:, :, ::-1] +... out = transform(image=image, bboxes=objects["bbox"], category=objects["category"]) + +... area.append(objects["area"]) +... images.append(out["image"]) +... bboxes.append(out["bboxes"]) +... categories.append(out["category"]) + +... targets = [ +... {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)} +... for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes) +... ] + +... return image_processor(images=images, annotations=targets, return_tensors="pt") +``` + +이전 단계에서 만든 전처리 함수를 🤗 Datasets의 [`~datasets.Dataset.with_transform`] 메소드를 사용하여 데이터 세트 전체에 적용합니다. +이 메소드는 데이터 세트의 요소를 가져올 때마다 전처리 함수를 적용합니다. + +이 시점에서는 전처리 후 데이터 세트에서 예시 하나를 가져와서 변환 후 모양이 어떻게 되는지 확인해 볼 수 있습니다. +이때, `pixel_values` 텐서, `pixel_mask` 텐서, 그리고 `labels`로 구성된 텐서가 있어야 합니다. + +```py +>>> cppe5["train"] = cppe5["train"].with_transform(transform_aug_ann) +>>> cppe5["train"][15] +{'pixel_values': tensor([[[ 0.9132, 0.9132, 0.9132, ..., -1.9809, -1.9809, -1.9809], + [ 0.9132, 0.9132, 0.9132, ..., -1.9809, -1.9809, -1.9809], + [ 0.9132, 0.9132, 0.9132, ..., -1.9638, -1.9638, -1.9638], + ..., + [-1.5699, -1.5699, -1.5699, ..., -1.9980, -1.9980, -1.9980], + [-1.5528, -1.5528, -1.5528, ..., -1.9980, -1.9809, -1.9809], + [-1.5528, -1.5528, -1.5528, ..., -1.9980, -1.9809, -1.9809]], + + [[ 1.3081, 1.3081, 1.3081, ..., -1.8431, -1.8431, -1.8431], + [ 1.3081, 1.3081, 1.3081, ..., -1.8431, -1.8431, -1.8431], + [ 1.3081, 1.3081, 1.3081, ..., -1.8256, -1.8256, -1.8256], + ..., + [-1.3179, -1.3179, -1.3179, ..., -1.8606, -1.8606, -1.8606], + [-1.3004, -1.3004, -1.3004, ..., -1.8606, -1.8431, -1.8431], + [-1.3004, -1.3004, -1.3004, ..., -1.8606, -1.8431, -1.8431]], + + [[ 1.4200, 1.4200, 1.4200, ..., -1.6476, -1.6476, -1.6476], + [ 1.4200, 1.4200, 1.4200, ..., -1.6476, -1.6476, -1.6476], + [ 1.4200, 1.4200, 1.4200, ..., -1.6302, -1.6302, -1.6302], + ..., + [-1.0201, -1.0201, -1.0201, ..., -1.5604, -1.5604, -1.5604], + [-1.0027, -1.0027, -1.0027, ..., -1.5604, -1.5430, -1.5430], + [-1.0027, -1.0027, -1.0027, ..., -1.5604, -1.5430, -1.5430]]]), + 'pixel_mask': tensor([[1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + ..., + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1], + [1, 1, 1, ..., 1, 1, 1]]), + 'labels': {'size': tensor([800, 800]), 'image_id': tensor([756]), 'class_labels': tensor([4]), 'boxes': tensor([[0.7340, 0.6986, 0.3414, 0.5944]]), 'area': tensor([519544.4375]), 'iscrowd': tensor([0]), 'orig_size': tensor([480, 480])}} +``` + +각각의 이미지를 성공적으로 증강하고 이미지의 어노테이션을 준비했습니다. +그러나 전처리는 아직 끝나지 않았습니다. 마지막 단계로, 이미지를 배치로 만들 사용자 정의 `collate_fn`을 생성합니다. +해당 배치에서 가장 큰 이미지에 이미지(현재 `pixel_values` 인)를 패드하고, 실제 픽셀(1)과 패딩(0)을 나타내기 위해 그에 해당하는 새로운 `pixel_mask`를 생성해야 합니다. + +```py +>>> def collate_fn(batch): +... pixel_values = [item["pixel_values"] for item in batch] +... encoding = image_processor.pad(pixel_values, return_tensors="pt") +... labels = [item["labels"] for item in batch] +... batch = {} +... batch["pixel_values"] = encoding["pixel_values"] +... batch["pixel_mask"] = encoding["pixel_mask"] +... batch["labels"] = labels +... return batch +``` + +## DETR 모델 학습시키기 [[training-the-DETR-model]] + +이전 섹션에서 대부분의 작업을 수행하여 이제 모델을 학습할 준비가 되었습니다! +이 데이터 세트의 이미지는 리사이즈 후에도 여전히 용량이 크기 때문에, 이 모델을 미세 조정 하려면 적어도 하나의 GPU가 필요합니다. + +학습은 다음의 단계를 수행합니다: + +1. [`AutoModelForObjectDetection`]을 사용하여 전처리와 동일한 체크포인트를 사용하여 모델을 가져옵니다. +2. [`TrainingArguments`]에서 학습 하이퍼파라미터를 정의합니다. +3. 모델, 데이터 세트, 이미지 프로세서 및 데이터 콜레이터와 함께 [`Trainer`]에 훈련 인수를 전달합니다. +4. [`~Trainer.train`]를 호출하여 모델을 미세 조정 합니다. + +전처리에 사용한 체크포인트와 동일한 체크포인트에서 모델을 가져올 때, 데이터 세트의 메타데이터에서 만든 `label2id`와 `id2label` 매핑을 전달해야 합니다. +또한, `ignore_mismatched_sizes=True`를 지정하여 기존 분류 헤드(모델에서 분류에 사용되는 마지막 레이어)를 새 분류 헤드로 대체합니다. + +```py +>>> from transformers import AutoModelForObjectDetection + +>>> model = AutoModelForObjectDetection.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ignore_mismatched_sizes=True, +... ) +``` + +[`TrainingArguments`]에서 `output_dir`을 사용하여 모델을 저장할 위치를 지정한 다음, 필요에 따라 하이퍼파라미터를 구성하세요. +사용하지 않는 열을 제거하지 않도록 주의해야 합니다. 만약 `remove_unused_columns`가 `True`일 경우 이미지 열이 삭제됩니다. +이미지 열이 없는 경우 `pixel_values`를 생성할 수 없기 때문에 `remove_unused_columns`를 `False`로 설정해야 합니다. +모델을 Hub에 업로드하여 공유하려면 `push_to_hub`를 `True`로 설정하십시오(허깅페이스에 로그인하여 모델을 업로드해야 합니다). + + +```py +>>> from transformers import TrainingArguments + +>>> training_args = TrainingArguments( +... output_dir="detr-resnet-50_finetuned_cppe5", +... per_device_train_batch_size=8, +... num_train_epochs=10, +... fp16=True, +... save_steps=200, +... logging_steps=50, +... learning_rate=1e-5, +... weight_decay=1e-4, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +마지막으로 `model`, `training_args`, `collate_fn`, `image_processor`와 데이터 세트(`cppe5`)를 모두 가져온 후, [`~transformers.Trainer.train`]를 호출합니다. + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=collate_fn, +... train_dataset=cppe5["train"], +... tokenizer=image_processor, +... ) + +>>> trainer.train() +``` + +`training_args`에서 `push_to_hub`를 `True`로 설정한 경우, 학습 체크포인트는 허깅페이스 허브에 업로드됩니다. +학습 완료 후, [`~transformers.Trainer.push_to_hub`] 메소드를 호출하여 최종 모델을 허깅페이스 허브에 업로드합니다. + +```py +>>> trainer.push_to_hub() +``` + +## 평가하기 [[evaluate]] + +객체 탐지 모델은 일반적으로 일련의 COCO-스타일 지표로 평가됩니다. +기존에 구현된 평가 지표 중 하나를 사용할 수도 있지만, 여기에서는 허깅페이스 허브에 푸시한 최종 모델을 평가하는 데 `torchvision`에서 제공하는 평가 지표를 사용합니다. + +`torchvision` 평가자(evaluator)를 사용하려면 실측값인 COCO 데이터 세트를 준비해야 합니다. +COCO 데이터 세트를 빌드하는 API는 데이터를 특정 형식으로 저장해야 하므로, 먼저 이미지와 어노테이션을 디스크에 저장해야 합니다. +학습을 위해 데이터를 준비할 때와 마찬가지로, cppe5["test"]에서의 어노테이션은 포맷을 맞춰야 합니다. 그러나 이미지는 그대로 유지해야 합니다. + +평가 단계는 약간의 작업이 필요하지만, 크게 세 가지 주요 단계로 나눌 수 있습니다. +먼저, `cppe5["test"]` 세트를 준비합니다: 어노테이션을 포맷에 맞게 만들고 데이터를 디스크에 저장합니다. + +```py +>>> import json + + +>>> # format annotations the same as for training, no need for data augmentation +>>> def val_formatted_anns(image_id, objects): +... annotations = [] +... for i in range(0, len(objects["id"])): +... new_ann = { +... "id": objects["id"][i], +... "category_id": objects["category"][i], +... "iscrowd": 0, +... "image_id": image_id, +... "area": objects["area"][i], +... "bbox": objects["bbox"][i], +... } +... annotations.append(new_ann) + +... return annotations + + +>>> # Save images and annotations into the files torchvision.datasets.CocoDetection expects +>>> def save_cppe5_annotation_file_images(cppe5): +... output_json = {} +... path_output_cppe5 = f"{os.getcwd()}/cppe5/" + +... if not os.path.exists(path_output_cppe5): +... os.makedirs(path_output_cppe5) + +... path_anno = os.path.join(path_output_cppe5, "cppe5_ann.json") +... categories_json = [{"supercategory": "none", "id": id, "name": id2label[id]} for id in id2label] +... output_json["images"] = [] +... output_json["annotations"] = [] +... for example in cppe5: +... ann = val_formatted_anns(example["image_id"], example["objects"]) +... output_json["images"].append( +... { +... "id": example["image_id"], +... "width": example["image"].width, +... "height": example["image"].height, +... "file_name": f"{example['image_id']}.png", +... } +... ) +... output_json["annotations"].extend(ann) +... output_json["categories"] = categories_json + +... with open(path_anno, "w") as file: +... json.dump(output_json, file, ensure_ascii=False, indent=4) + +... for im, img_id in zip(cppe5["image"], cppe5["image_id"]): +... path_img = os.path.join(path_output_cppe5, f"{img_id}.png") +... im.save(path_img) + +... return path_output_cppe5, path_anno +``` + +다음으로, `cocoevaluator`와 함께 사용할 수 있는 `CocoDetection` 클래스의 인스턴스를 준비합니다. + +```py +>>> import torchvision + + +>>> class CocoDetection(torchvision.datasets.CocoDetection): +... def __init__(self, img_folder, image_processor, ann_file): +... super().__init__(img_folder, ann_file) +... self.image_processor = image_processor + +... def __getitem__(self, idx): +... # read in PIL image and target in COCO format +... img, target = super(CocoDetection, self).__getitem__(idx) + +... # preprocess image and target: converting target to DETR format, +... # resizing + normalization of both image and target) +... image_id = self.ids[idx] +... target = {"image_id": image_id, "annotations": target} +... encoding = self.image_processor(images=img, annotations=target, return_tensors="pt") +... pixel_values = encoding["pixel_values"].squeeze() # remove batch dimension +... target = encoding["labels"][0] # remove batch dimension + +... return {"pixel_values": pixel_values, "labels": target} + + +>>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") + +>>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"]) +>>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno) +``` + +마지막으로, 평가 지표를 가져와서 평가를 실행합니다. + +```py +>>> import evaluate +>>> from tqdm import tqdm + +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") +>>> module = evaluate.load("ybelkada/cocoevaluate", coco=test_ds_coco_format.coco) +>>> val_dataloader = torch.utils.data.DataLoader( +... test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn +... ) + +>>> with torch.no_grad(): +... for idx, batch in enumerate(tqdm(val_dataloader)): +... pixel_values = batch["pixel_values"] +... pixel_mask = batch["pixel_mask"] + +... labels = [ +... {k: v for k, v in t.items()} for t in batch["labels"] +... ] # these are in DETR format, resized + normalized + +... # forward pass +... outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask) + +... orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0) +... results = im_processor.post_process(outputs, orig_target_sizes) # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax) + +... module.add(prediction=results, reference=labels) +... del batch + +>>> results = module.compute() +>>> print(results) +Accumulating evaluation results... +DONE (t=0.08s). +IoU metric: bbox + Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.352 + Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.681 + Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.292 + Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168 + Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208 + Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.274 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.484 + Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.501 + Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191 + Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323 + Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590 +``` + +이러한 결과는 [`~transformers.TrainingArguments`]의 하이퍼파라미터를 조정하여 더욱 개선될 수 있습니다. 한번 시도해 보세요! + +## 추론하기 [[inference]] + +DETR 모델을 미세 조정 및 평가하고, 허깅페이스 허브에 업로드 했으므로 추론에 사용할 수 있습니다. + +미세 조정된 모델을 추론에 사용하는 가장 간단한 방법은 [`pipeline`]에서 모델을 사용하는 것입니다. +모델과 함께 객체 탐지를 위한 파이프라인을 인스턴스화하고, 이미지를 전달하세요: + +```py +>>> from transformers import pipeline +>>> import requests + +>>> url = "https://i.imgur.com/2lnWoly.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> obj_detector = pipeline("object-detection", model="devonho/detr-resnet-50_finetuned_cppe5") +>>> obj_detector(image) +``` + +만약 원한다면 수동으로 `pipeline`의 결과를 재현할 수 있습니다: + +```py +>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") +>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5") + +>>> with torch.no_grad(): +... inputs = image_processor(images=image, return_tensors="pt") +... outputs = model(**inputs) +... target_sizes = torch.tensor([image.size[::-1]]) +... results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[0] + +>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): +... box = [round(i, 2) for i in box.tolist()] +... print( +... f"Detected {model.config.id2label[label.item()]} with confidence " +... f"{round(score.item(), 3)} at location {box}" +... ) +Detected Coverall with confidence 0.566 at location [1215.32, 147.38, 4401.81, 3227.08] +Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413.9] +``` + +결과를 시각화하겠습니다: +```py +>>> draw = ImageDraw.Draw(image) + +>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): +... box = [round(i, 2) for i in box.tolist()] +... x, y, x2, y2 = tuple(box) +... draw.rectangle((x, y, x2, y2), outline="red", width=1) +... draw.text((x, y), model.config.id2label[label.item()], fill="white") + +>>> image +``` + +
+ Object detection result on a new image +
diff --git a/docs/source/ko/tasks/question_answering.md b/docs/source/ko/tasks/question_answering.md new file mode 100644 index 00000000000000..9539b9a403030e --- /dev/null +++ b/docs/source/ko/tasks/question_answering.md @@ -0,0 +1,428 @@ + + +# 질의 응답(Question Answering)[[question-answering]] + +[[open-in-colab]] + + + +질의 응답 태스크는 주어진 질문에 대한 답변을 제공합니다. Alexa, Siri 또는 Google과 같은 가상 비서에게 날씨가 어떤지 물어본 적이 있다면 질의 응답 모델을 사용해본 적이 있을 것입니다. 질의 응답 태스크에는 일반적으로 두 가지 유형이 있습니다. + +- 추출적(Extractive) 질의 응답: 주어진 문맥에서 답변을 추출합니다. +- 생성적(Abstractive) 질의 응답: 문맥에서 질문에 올바르게 답하는 답변을 생성합니다. + +이 가이드는 다음과 같은 방법들을 보여줍니다. + +1. 추출적 질의 응답을 하기 위해 [SQuAD](https://huggingface.co/datasets/squad) 데이터 세트에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) 미세 조정하기 +2. 추론에 미세 조정된 모델 사용하기 + + +이 튜토리얼에서 설명하는 태스크는 다음과 같은 모델 아키텍처에서 지원됩니다. + + + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [LXMERT](../model_doc/lxmert), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OPT](../model_doc/opt), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [Splinter](../model_doc/splinter), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + + +시작하기 전에, 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +여러분의 모델을 업로드하고 커뮤니티에 공유할 수 있도록 Hugging Face 계정에 로그인하는 것이 좋습니다. 메시지가 표시되면 토큰을 입력해서 로그인합니다: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## SQuAD 데이터 세트 가져오기[[load-squad-dataset]] + +먼저 🤗 Datasets 라이브러리에서 SQuAD 데이터 세트의 일부를 가져옵니다. 이렇게 하면 전체 데이터 세트로 훈련하며 더 많은 시간을 할애하기 전에 모든 것이 잘 작동하는지 실험하고 확인할 수 있습니다. + +```py +>>> from datasets import load_dataset + +>>> squad = load_dataset("squad", split="train[:5000]") +``` + +데이터 세트의 분할된 `train`을 [`~datasets.Dataset.train_test_split`] 메소드를 사용해 훈련 데이터 세트와 테스트 데이터 세트로 나누어줍니다: + +```py +>>> squad = squad.train_test_split(test_size=0.2) +``` + +그리고나서 예시로 데이터를 하나 살펴봅니다: + +```py +>>> squad["train"][0] +{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, + 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', + 'id': '5733be284776f41900661182', + 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', + 'title': 'University_of_Notre_Dame' +} +``` + +이 중에서 몇 가지 중요한 항목이 있습니다: + +- `answers`: 답안 토큰의 시작 위치와 답안 텍스트 +- `context`: 모델이 답을 추출하는데 필요한 배경 지식 +- `question`: 모델이 답해야 하는 질문 + +## 전처리[[preprocess]] + + + +다음 단계에서는 `question` 및 `context` 항목을 처리하기 위해 DistilBERT 토크나이저를 가져옵니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +질의 응답 태스크와 관련해서 특히 유의해야할 몇 가지 전처리 단계가 있습니다: + +1. 데이터 세트의 일부 예제에는 모델의 최대 입력 길이를 초과하는 매우 긴 `context`가 있을 수 있습니다. 긴 시퀀스를 다루기 위해서는, `truncation="only_second"`로 설정해 `context`만 잘라내면 됩니다. +2. 그 다음, `return_offset_mapping=True`로 설정해 답변의 시작과 종료 위치를 원래의 `context`에 매핑합니다. +3. 매핑을 완료하면, 이제 답변에서 시작 토큰과 종료 토큰을 찾을 수 있습니다. 오프셋의 어느 부분이 `question`과 `context`에 해당하는지 찾을 수 있도록 [`~tokenizers.Encoding.sequence_ids`] 메소드를 사용하세요. + +다음은 `answer`의 시작 토큰과 종료 토큰을 잘라내서 `context`에 매핑하는 함수를 만드는 방법입니다: + +```py +>>> def preprocess_function(examples): +... questions = [q.strip() for q in examples["question"]] +... inputs = tokenizer( +... questions, +... examples["context"], +... max_length=384, +... truncation="only_second", +... return_offsets_mapping=True, +... padding="max_length", +... ) + +... offset_mapping = inputs.pop("offset_mapping") +... answers = examples["answers"] +... start_positions = [] +... end_positions = [] + +... for i, offset in enumerate(offset_mapping): +... answer = answers[i] +... start_char = answer["answer_start"][0] +... end_char = answer["answer_start"][0] + len(answer["text"][0]) +... sequence_ids = inputs.sequence_ids(i) + +... # Find the start and end of the context +... idx = 0 +... while sequence_ids[idx] != 1: +... idx += 1 +... context_start = idx +... while sequence_ids[idx] == 1: +... idx += 1 +... context_end = idx - 1 + +... # If the answer is not fully inside the context, label it (0, 0) +... if offset[context_start][0] > end_char or offset[context_end][1] < start_char: +... start_positions.append(0) +... end_positions.append(0) +... else: +... # Otherwise it's the start and end token positions +... idx = context_start +... while idx <= context_end and offset[idx][0] <= start_char: +... idx += 1 +... start_positions.append(idx - 1) + +... idx = context_end +... while idx >= context_start and offset[idx][1] >= end_char: +... idx -= 1 +... end_positions.append(idx + 1) + +... inputs["start_positions"] = start_positions +... inputs["end_positions"] = end_positions +... return inputs +``` + +모든 데이터 세트에 전처리를 적용하려면, 🤗 Datasets [`~datasets.Dataset.map`] 함수를 사용하세요. `batched=True`로 설정해 데이터 세트의 여러 요소들을 한 번에 처리하면 `map` 함수의 속도를 빠르게 할 수 있습니다. 필요하지 않은 열은 모두 제거합니다: + +```py +>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names) +``` + +이제 [`DefaultDataCollator`]를 이용해 예시 배치를 생성합니다. 🤗 Transformers의 다른 데이터 콜레이터(data collator)와 달리, [`DefaultDataCollator`]는 패딩과 같은 추가 전처리를 적용하지 않습니다: + + + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") +``` + + + +## 훈련[[train]] + + + + + +[`Trainer`]를 이용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기초 튜토리얼을 살펴보세요! + + + +이제 모델 훈련을 시작할 준비가 되었습니다! [`AutoModelForQuestionAnswering`]으로 DistilBERT를 가져옵니다: + +```py +>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer + +>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + +이제 세 단계만 남았습니다: + +1. [`TrainingArguments`]에서 훈련 하이퍼파라미터를 정합니다. 꼭 필요한 매개변수는 모델을 저장할 위치를 지정하는 `output_dir` 입니다. `push_to_hub=True`로 설정해서 이 모델을 Hub로 푸시합니다 (모델을 업로드하려면 Hugging Face에 로그인해야 합니다). +2. 모델, 데이터 세트, 토크나이저, 데이터 콜레이터와 함께 [`Trainer`]에 훈련 인수들을 전달합니다. +3. [`~Trainer.train`]을 호출해서 모델을 미세 조정합니다. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_qa_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=3, +... weight_decay=0.01, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_squad["train"], +... eval_dataset=tokenized_squad["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면, [`~transformers.Trainer.push_to_hub`] 매소드를 사용해 모델을 Hub에 공유해서 모든 사람들이 사용할 수 있게 공유해주세요: + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras로 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#train-a-tensorflow-model-with-keras)에서 기초 튜토리얼을 살펴보세요! + + +TensorFlow를 이용한 모델을 미세 조정하려면 옵티마이저 함수, 학습률 스케쥴 및 몇 가지 훈련 하이퍼파라미터를 설정하는 것부터 시작해야합니다: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_epochs = 2 +>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs +>>> optimizer, schedule = create_optimizer( +... init_lr=2e-5, +... num_warmup_steps=0, +... num_train_steps=total_train_steps, +... ) +``` + +그 다음 [`TFAutoModelForQuestionAnswering`]으로 DistilBERT를 가져옵니다: + +```py +>>> from transformers import TFAutoModelForQuestionAnswering + +>>> model = TFAutoModelForQuestionAnswering("distilbert/distilbert-base-uncased") +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용해서 데이터 세트를 `tf.data.Dataset` 형식으로 변환합니다: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_squad["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_squad["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)로 훈련할 모델을 설정합니다: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +마지막으로 모델을 Hub로 푸시할 방법을 설정합니다. [`~transformers.PushToHubCallback`]에서 모델과 토크나이저를 푸시할 경로를 설정합니다: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> callback = PushToHubCallback( +... output_dir="my_awesome_qa_model", +... tokenizer=tokenizer, +... ) +``` + +드디어 모델 훈련을 시작할 준비가 되었습니다! 훈련 데이터 세트와 평가 데이터 세트, 에폭 수, 콜백을 설정한 후 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)을 이용해 모델을 미세 조정합니다: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback]) +``` +훈련이 완료되면 모델이 자동으로 Hub에 업로드되어 누구나 사용할 수 있습니다! + + + + + +질의 응답을 위해 모델을 미세 조정하는 방법에 대한 더 자세한 예시는 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering-tf.ipynb)을 참조하세요. + + + +## 평가[[evaluate]] + +질의 응답을 평가하려면 상당한 양의 후처리가 필요합니다. 시간이 너무 많이 걸리지 않도록 이 가이드에서는 평가 단계를 생략합니다. [`Trainer`]는 훈련 과정에서 평가 손실(evaluation loss)을 계속 계산하기 때문에 모델의 성능을 대략적으로 알 수 있습니다. + +시간에 여유가 있고 질의 응답 모델을 평가하는 방법에 관심이 있다면 🤗 Hugging Face Course의 [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) 챕터를 살펴보세요! + +## 추론[[inference]] + +이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +질문과 모델이 예측하기 원하는 문맥(context)를 생각해보세요: + +```py +>>> question = "How many programming languages does BLOOM support?" +>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages." +``` + +추론을 위해 미세 조정한 모델을 테스트하는 가장 쉬운 방법은 [`pipeline`]을 사용하는 것 입니다. 모델을 사용해 질의 응답을 하기 위해서 `pipeline`을 인스턴스화하고 텍스트를 입력합니다: + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model") +>>> question_answerer(question=question, context=context) +{'score': 0.2058267742395401, + 'start': 10, + 'end': 95, + 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'} +``` + +원한다면 `pipeline`의 결과를 직접 복제할 수도 있습니다: + + + +텍스트를 토큰화해서 PyTorch 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model") +>>> inputs = tokenizer(question, context, return_tensors="pt") +``` + +모델에 입력을 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import AutoModelForQuestionAnswering + +>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model") +>>> with torch.no_grad(): +... outputs = model(**inputs) +``` + +모델의 출력에서 시작 및 종료 위치가 어딘지 가장 높은 확률을 얻습니다: + +```py +>>> answer_start_index = outputs.start_logits.argmax() +>>> answer_end_index = outputs.end_logits.argmax() +``` + +예측된 토큰을 해독해서 답을 얻습니다: + +```py +>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] +>>> tokenizer.decode(predict_answer_tokens) +'176 billion parameters and can generate text in 46 languages natural languages and 13' +``` + + +텍스트를 토큰화해서 TensorFlow 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model") +>>> inputs = tokenizer(question, text, return_tensors="tf") +``` + +모델에 입력을 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import TFAutoModelForQuestionAnswering + +>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model") +>>> outputs = model(**inputs) +``` + +모델의 출력에서 시작 및 종료 위치가 어딘지 가장 높은 확률을 얻습니다: + +```py +>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0]) +>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0]) +``` + +예측된 토큰을 해독해서 답을 얻습니다: + +```py +>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] +>>> tokenizer.decode(predict_answer_tokens) +'176 billion parameters and can generate text in 46 languages natural languages and 13' +``` + + diff --git a/docs/source/ko/tasks/semantic_segmentation.md b/docs/source/ko/tasks/semantic_segmentation.md new file mode 100644 index 00000000000000..4b6109d692bf10 --- /dev/null +++ b/docs/source/ko/tasks/semantic_segmentation.md @@ -0,0 +1,591 @@ + + +# 의미적 분할(Semantic segmentation)[[semantic-segmentation]] + +[[open-in-colab]] + + + +의미적 분할(semantic segmentation)은 이미지의 각 픽셀에 레이블 또는 클래스를 할당합니다. 분할(segmentation)에는 여러 종류가 있으며, 의미적 분할의 경우 동일한 물체의 고유 인스턴스를 구분하지 않습니다. 두 물체 모두 동일한 레이블이 지정됩니다(예시로, "car-1" 과 "car-2" 대신 "car"로 지정합니다). +실생활에서 흔히 볼 수 있는 의미적 분할의 적용 사례로는 보행자와 중요한 교통 정보를 식별하는 자율 주행 자동차 학습, 의료 이미지의 세포와 이상 징후 식별, 그리고 위성 이미지의 환경 변화 모니터링등이 있습니다. + +이번 가이드에서 배울 내용은 다음과 같습니다: + +1. [SceneParse150](https://huggingface.co/datasets/scene_parse_150) 데이터 세트를 이용해 [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) 미세 조정하기. +2. 미세 조정된 모델을 추론에 사용하기. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에서 지원됩니다: + + + +[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) + + + + + +시작하기 전에 필요한 모든 라이브러리가 설치되었는지 확인하세요: + +```bash +pip install -q datasets transformers evaluate +``` +커뮤니티에 모델을 업로드하고 공유할 수 있도록 Hugging Face 계정에 로그인하는 것을 권장합니다. 프롬프트가 나타나면 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## SceneParse150 데이터 세트 불러오기[[load-sceneparse150-dataset]] + +🤗 Datasets 라이브러리에서 SceneParse150 데이터 세트의 더 작은 부분 집합을 가져오는 것으로 시작합니다. 이렇게 하면 데이터 세트 전체에 대한 훈련에 많은 시간을 할애하기 전에 실험을 통해 모든 것이 제대로 작동하는지 확인할 수 있습니다. + +```py +>>> from datasets import load_dataset + +>>> ds = load_dataset("scene_parse_150", split="train[:50]") +``` + +데이터 세트의 `train`을 [`~datasets.Dataset.train_test_split`] 메소드를 사용하여 훈련 및 테스트 세트로 분할하세요: + +```py +>>> ds = ds.train_test_split(test_size=0.2) +>>> train_ds = ds["train"] +>>> test_ds = ds["test"] +``` + +그리고 예시를 살펴보세요: + +```py +>>> train_ds[0] +{'image': , + 'annotation': , + 'scene_category': 368} +``` + +- `image`: 장면의 PIL 이미지입니다. +- `annotation`: 분할 지도(segmentation map)의 PIL 이미지입니다. 모델의 타겟이기도 합니다. +- `scene_category`: "주방" 또는 "사무실"과 같이 이미지 장면을 설명하는 카테고리 ID입니다. 이 가이드에서는 둘 다 PIL 이미지인 `image`와 `annotation`만을 사용합니다. + +나중에 모델을 설정할 때 유용하게 사용할 수 있도록 레이블 ID를 레이블 클래스에 매핑하는 사전도 만들고 싶을 것입니다. Hub에서 매핑을 다운로드하고 `id2label` 및 `label2id` 사전을 만드세요: + +```py +>>> import json +>>> from huggingface_hub import cached_download, hf_hub_url + +>>> repo_id = "huggingface/label-files" +>>> filename = "ade20k-id2label.json" +>>> id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r")) +>>> id2label = {int(k): v for k, v in id2label.items()} +>>> label2id = {v: k for k, v in id2label.items()} +>>> num_labels = len(id2label) +``` + +## 전처리하기[[preprocess] + +다음 단계는 모델에 사용할 이미지와 주석을 준비하기 위해 SegFormer 이미지 프로세서를 불러오는 것입니다. 우리가 사용하는 데이터 세트와 같은 일부 데이터 세트는 배경 클래스로 제로 인덱스를 사용합니다. 하지만 배경 클래스는 150개의 클래스에 실제로는 포함되지 않기 때문에 `reduce_labels=True` 를 설정해 모든 레이블에서 배경 클래스를 제거해야 합니다. 제로 인덱스는 `255`로 대체되므로 SegFormer의 손실 함수에서 무시됩니다: + +```py +>>> from transformers import AutoImageProcessor + +>>> checkpoint = "nvidia/mit-b0" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True) +``` + + + + +이미지 데이터 세트에 데이터 증강을 적용하여 과적합에 대해 모델을 보다 강건하게 만드는 것이 일반적입니다. 이 가이드에서는 [torchvision](https://pytorch.org/vision/stable/index.html)의 [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html)를 사용하여 이미지의 색상 속성을 임의로 변경합니다. 하지만, 자신이 원하는 이미지 라이브러리를 사용할 수도 있습니다. + +```py +>>> from torchvision.transforms import ColorJitter + +>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1) +``` + +이제 모델에 사용할 이미지와 주석을 준비하기 위해 두 개의 전처리 함수를 만듭니다. 이 함수들은 이미지를 `pixel_values`로, 주석을 `labels`로 변환합니다. 훈련 세트의 경우 이미지 프로세서에 이미지를 제공하기 전에 `jitter`를 적용합니다. 테스트 세트의 경우 이미지 프로세서는 `images`를 자르고 정규화하며, 테스트 중에는 데이터 증강이 적용되지 않으므로 `labels`만 자릅니다. + +```py +>>> def train_transforms(example_batch): +... images = [jitter(x) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [x for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +모든 데이터 세트에 `jitter`를 적용하려면, 🤗 Datasets [`~datasets.Dataset.set_transform`] 함수를 사용하세요. 즉시 변환이 적용되기 때문에 더 빠르고 디스크 공간을 덜 차지합니다: + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + + + + + +이미지 데이터 세트에 데이터 증강을 적용하여 과적합에 대해 모델을 보다 강건하게 만드는 것이 일반적입니다. 이 가이드에서는 [`tf.image`](https://www.tensorflow.org/api_docs/python/tf/image)를 사용하여 이미지의 색상 속성을 임의로 변경합니다. 하지만, 자신이 원하는 이미지 라이브러리를 사용할 수도 있습니다. + +별개의 두 변환 함수를 정의합니다: +- 이미지 증강을 포함하는 학습 데이터 변환 +- 🤗 Transformers의 컴퓨터 비전 모델은 채널 우선 레이아웃을 기대하기 때문에, 이미지만 바꾸는 검증 데이터 변환 + +```py +>>> import tensorflow as tf + + +>>> def aug_transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.image.random_brightness(image, 0.25) +... image = tf.image.random_contrast(image, 0.5, 2.0) +... image = tf.image.random_saturation(image, 0.75, 1.25) +... image = tf.image.random_hue(image, 0.1) +... image = tf.transpose(image, (2, 0, 1)) +... return image + + +>>> def transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.transpose(image, (2, 0, 1)) +... return image +``` + +그런 다음 모델을 위해 두 개의 전처리 함수를 만들어 이미지 및 주석 배치를 준비합니다. 이 함수들은 이미지 변환을 적용하고 이전에 로드한 `image_processor`를 사용하여 이미지를 `pixel_values`로, 주석을 `label`로 변환합니다. `ImageProcessor` 는 이미지의 크기 조정과 정규화도 처리합니다. + +```py +>>> def train_transforms(example_batch): +... images = [aug_transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +전체 데이터 집합에 전처리 변환을 적용하려면 🤗 Datasets [`~datasets.Dataset.set_transform`] 함수를 사용하세요. +즉시 변환이 적용되기 때문에 더 빠르고 디스크 공간을 덜 차지합니다: + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + +## 평가하기[[evaluate]] + +훈련 중에 메트릭을 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하여 평가 방법을 빠르게 로드할 수 있습니다. 이 태스크에서는 [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) 메트릭을 로드하세요 (메트릭을 로드하고 계산하는 방법에 대해 자세히 알아보려면 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour)를 살펴보세요). + +```py +>>> import evaluate + +>>> metric = evaluate.load("mean_iou") +``` + +그런 다음 메트릭을 [`~evaluate.EvaluationModule.compute`]하는 함수를 만듭니다. 예측을 먼저 로짓으로 변환한 다음, 레이블의 크기에 맞게 모양을 다시 지정해야 [`~evaluate.EvaluationModule.compute`]를 호출할 수 있습니다: + + + + +```py +>>> import numpy as np +>>> import torch +>>> from torch import nn + +>>> def compute_metrics(eval_pred): +... with torch.no_grad(): +... logits, labels = eval_pred +... logits_tensor = torch.from_numpy(logits) +... logits_tensor = nn.functional.interpolate( +... logits_tensor, +... size=labels.shape[-2:], +... mode="bilinear", +... align_corners=False, +... ).argmax(dim=1) + +... pred_labels = logits_tensor.detach().cpu().numpy() +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=255, +... reduce_labels=False, +... ) +... for key, value in metrics.items(): +... if isinstance(value, np.ndarray): +... metrics[key] = value.tolist() +... return metrics +``` + + + + + + + + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... logits = tf.transpose(logits, perm=[0, 2, 3, 1]) +... logits_resized = tf.image.resize( +... logits, +... size=tf.shape(labels)[1:], +... method="bilinear", +... ) + +... pred_labels = tf.argmax(logits_resized, axis=-1) +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=-1, +... reduce_labels=image_processor.do_reduce_labels, +... ) + +... per_category_accuracy = metrics.pop("per_category_accuracy").tolist() +... per_category_iou = metrics.pop("per_category_iou").tolist() + +... metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}) +... metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)}) +... return {"val_" + k: v for k, v in metrics.items()} +``` + + + + +이제 `compute_metrics` 함수를 사용할 준비가 되었습니다. 트레이닝을 설정할 때 이 함수로 돌아가게 됩니다. + +## 학습하기[[train]] + + + + +만약 [`Trainer`]를 사용해 모델을 미세 조정하는 것에 익숙하지 않다면, [여기](../training#finetune-with-trainer)에서 기본 튜토리얼을 살펴보세요! + + + +이제 모델 학습을 시작할 준비가 되었습니다! [`AutoModelForSemanticSegmentation`]로 SegFormer를 불러오고, 모델에 레이블 ID와 레이블 클래스 간의 매핑을 전달합니다: + +```py +>>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer + +>>> model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id) +``` + +이제 세 단계만 남았습니다: + +1. 학습 하이퍼파라미터를 [`TrainingArguments`]에 정의합니다. `image` 열이 삭제되기 때문에 사용하지 않는 열을 제거하지 않는 것이 중요합니다. `image` 열이 없으면 `pixel_values`을 생성할 수 없습니다. 이런 경우를 방지하려면 `remove_unused_columns=False`로 설정하세요! 유일하게 필요한 다른 매개변수는 모델을 저장할 위치를 지정하는 `output_dir`입니다. `push_to_hub=True`를 설정하여 이 모델을 Hub에 푸시합니다(모델을 업로드하려면 Hugging Face에 로그인해야 합니다). 각 에포크가 끝날 때마다 [`Trainer`]가 IoU 메트릭을 평가하고 학습 체크포인트를 저장합니다. +2. 모델, 데이터 세트, 토크나이저, 데이터 콜레이터, `compute_metrics` 함수와 함께 학습 인자를 [`Trainer`]에 전달하세요. +3. 모델을 미세 조정하기 위해 [`~Trainer.train`]를 호출하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="segformer-b0-scene-parse-150", +... learning_rate=6e-5, +... num_train_epochs=50, +... per_device_train_batch_size=2, +... per_device_eval_batch_size=2, +... save_total_limit=3, +... evaluation_strategy="steps", +... save_strategy="steps", +... save_steps=20, +... eval_steps=20, +... logging_steps=1, +... eval_accumulation_steps=5, +... remove_unused_columns=False, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=train_ds, +... eval_dataset=test_ds, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` +학습이 완료되면, 누구나 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메서드를 사용해 Hub에 모델을 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + + + + + + + +Keras로 모델을 미세 조정하는 데 익숙하지 않은 경우, 먼저 [기본 튜토리얼](../training#train-a-tensorflow-model-with-keras)을 확인해보세요! + + + +TensorFlow에서 모델을 미세 조정하려면 다음 단계를 따르세요: +1. 학습 하이퍼파라미터를 정의하고 옵티마이저와 학습률 스케쥴러를 설정하세요. +2. 사전 학습된 모델을 인스턴스화하세요. +3. 🤗 Dataset을 `tf.data.Dataset`로 변환하세요. +4. 모델을 컴파일하세요. +5. 콜백을 추가하여 메트릭을 계산하고 🤗 Hub에 모델을 업로드하세요. +6. `fit()` 메서드를 사용하여 훈련을 실행하세요. + +하이퍼파라미터, 옵티마이저, 학습률 스케쥴러를 정의하는 것으로 시작하세요: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 2 +>>> num_epochs = 50 +>>> num_train_steps = len(train_ds) * num_epochs +>>> learning_rate = 6e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +그런 다음 레이블 매핑과 함께 [`TFAutoModelForSemanticSegmentation`]을 사용하여 SegFormer를 불러오고 옵티마이저로 컴파일합니다. 트랜스포머 모델은 모두 디폴트로 태스크 관련 손실 함수가 있으므로 원치 않으면 지정할 필요가 없습니다: + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +>>> model.compile(optimizer=optimizer) # 손실 함수 인자가 없습니다! +``` + +[`~datasets.Dataset.to_tf_dataset`] 와 [`DefaultDataCollator`]를 사용해 데이터 세트를 `tf.data.Dataset` 포맷으로 변환하세요: + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") + +>>> tf_train_dataset = train_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_eval_dataset = test_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +예측으로 정확도를 계산하고 모델을 🤗 Hub로 푸시하려면 [Keras callbacks](../main_classes/keras_callbacks)를 사용하세요. `compute_metrics` 함수를 [`KerasMetricCallback`]에 전달하고, 모델 업로드를 위해 [`PushToHubCallback`]를 사용하세요: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback( +... metric_fn=compute_metrics, eval_dataset=tf_eval_dataset, batch_size=batch_size, label_cols=["labels"] +... ) + +>>> push_to_hub_callback = PushToHubCallback(output_dir="scene_segmentation", tokenizer=image_processor) + +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +이제 모델을 훈련할 준비가 되었습니다! 훈련 및 검증 데이터 세트, 에포크 수와 함께 `fit()`을 호출하고, 콜백을 사용하여 모델을 미세 조정합니다: + +```py +>>> model.fit( +... tf_train_dataset, +... validation_data=tf_eval_dataset, +... callbacks=callbacks, +... epochs=num_epochs, +... ) +``` + +축하합니다! 모델을 미세 조정하고 🤗 Hub에 공유했습니다. 이제 추론에 사용할 수 있습니다! + + + + + +## 추론하기[[inference]] + +이제 모델을 미세 조정했으니 추론에 사용할 수 있습니다! + +추론할 이미지를 로드하세요: + +```py +>>> image = ds[0]["image"] +>>> image +``` + +
+ Image of bedroom +
+ + + + +추론을 위해 미세 조정한 모델을 시험해 보는 가장 간단한 방법은 [`pipeline`]에서 사용하는 것입니다. 모델을 사용하여 이미지 분할을 위한 `pipeline`을 인스턴스화하고 이미지를 전달합니다: + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model") +>>> segmenter(image) +[{'score': None, + 'label': 'wall', + 'mask': }, + {'score': None, + 'label': 'sky', + 'mask': }, + {'score': None, + 'label': 'floor', + 'mask': }, + {'score': None, + 'label': 'ceiling', + 'mask': }, + {'score': None, + 'label': 'bed ', + 'mask': }, + {'score': None, + 'label': 'windowpane', + 'mask': }, + {'score': None, + 'label': 'cabinet', + 'mask': }, + {'score': None, + 'label': 'chair', + 'mask': }, + {'score': None, + 'label': 'armchair', + 'mask': }] +``` +원하는 경우 `pipeline`의 결과를 수동으로 복제할 수도 있습니다. 이미지 프로세서로 이미지를 처리하고 `pixel_values`을 GPU에 배치합니다: + +```py +>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 가능하다면 GPU를 사용하고, 그렇지 않다면 CPU를 사용하세요 +>>> encoding = image_processor(image, return_tensors="pt") +>>> pixel_values = encoding.pixel_values.to(device) +``` + +모델에 입력을 전달하고 `logits`를 반환합니다: + +```py +>>> outputs = model(pixel_values=pixel_values) +>>> logits = outputs.logits.cpu() +``` +그런 다음 로짓의 크기를 원본 이미지 크기로 다시 조정합니다: + +```py +>>> upsampled_logits = nn.functional.interpolate( +... logits, +... size=image.size[::-1], +... mode="bilinear", +... align_corners=False, +... ) + +>>> pred_seg = upsampled_logits.argmax(dim=1)[0] +``` + + + + + + +이미지 프로세서를 로드하여 이미지를 전처리하고 입력을 TensorFlow 텐서로 반환합니다: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/scene_segmentation") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +모델에 입력을 전달하고 `logits`를 반환합니다: + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained("MariaK/scene_segmentation") +>>> logits = model(**inputs).logits +``` + +그런 다음 로그를 원본 이미지 크기로 재조정하고 클래스 차원에 argmax를 적용합니다: + +```py +>>> logits = tf.transpose(logits, [0, 2, 3, 1]) + +>>> upsampled_logits = tf.image.resize( +... logits, +... # `image.size`가 너비와 높이를 반환하기 때문에 `image`의 모양을 반전시킵니다 +... image.size[::-1], +... ) + +>>> pred_seg = tf.math.argmax(upsampled_logits, axis=-1)[0] +``` + + + + +결과를 시각화하려면 [dataset color palette](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51)를 각 클래스를 RGB 값에 매핑하는 `ade_palette()`로 로드합니다. 그런 다음 이미지와 예측된 분할 지도(segmentation map)을 결합하여 구성할 수 있습니다: + +```py +>>> import matplotlib.pyplot as plt +>>> import numpy as np + +>>> color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8) +>>> palette = np.array(ade_palette()) +>>> for label, color in enumerate(palette): +... color_seg[pred_seg == label, :] = color +>>> color_seg = color_seg[..., ::-1] # BGR로 변환 + +>>> img = np.array(image) * 0.5 + color_seg * 0.5 # 분할 지도으로 이미지 구성 +>>> img = img.astype(np.uint8) + +>>> plt.figure(figsize=(15, 10)) +>>> plt.imshow(img) +>>> plt.show() +``` + +
+ Image of bedroom overlaid with segmentation map +
diff --git a/docs/source/ko/tasks/sequence_classification.md b/docs/source/ko/tasks/sequence_classification.md new file mode 100644 index 00000000000000..a1a5da50e9f614 --- /dev/null +++ b/docs/source/ko/tasks/sequence_classification.md @@ -0,0 +1,395 @@ + + +# 텍스트 분류[[text-classification]] + +[[open-in-colab]] + + + +텍스트 분류는 자연어 처리의 일종으로, 텍스트에 레이블 또는 클래스를 지정하는 작업입니다. 많은 대기업이 다양한 실용적인 응용 분야에서 텍스트 분류를 운영하고 있습니다. 가장 인기 있는 텍스트 분류 형태 중 하나는 감성 분석으로, 텍스트 시퀀스에 🙂 긍정, 🙁 부정 또는 😐 중립과 같은 레이블을 지정합니다. + +이 가이드에서 학습할 내용은: + +1. [IMDb](https://huggingface.co/datasets/imdb) 데이터셋에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased)를 파인 튜닝하여 영화 리뷰가 긍정적인지 부정적인지 판단합니다. +2. 추론을 위해 파인 튜닝 모델을 사용합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에 의해 지원됩니다: + + + +[ALBERT](../model_doc/albert), [BART](../model_doc/bart), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [CTRL](../model_doc/ctrl), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPT Neo](../model_doc/gpt_neo), [GPT-J](../model_doc/gptj), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LED](../model_doc/led), [LiLT](../model_doc/lilt), [LLaMA](../model_doc/llama), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [mBART](../model_doc/mbart), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [MVP](../model_doc/mvp), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [OpenAI GPT](../model_doc/openai-gpt), [OPT](../model_doc/opt), [Perceiver](../model_doc/perceiver), [PLBart](../model_doc/plbart), [QDQBert](../model_doc/qdqbert), [Reformer](../model_doc/reformer), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [TAPAS](../model_doc/tapas), [Transformer-XL](../model_doc/transfo-xl), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + + +시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate +``` + +Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에 공유하는 것을 권장합니다. 메시지가 표시되면, 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## IMDb 데이터셋 가져오기[[load-imdb-dataset]] + +먼저 🤗 Datasets 라이브러리에서 IMDb 데이터셋을 가져옵니다: + +```py +>>> from datasets import load_dataset + +>>> imdb = load_dataset("imdb") +``` + +그런 다음 예시를 살펴봅시다: + +```py +>>> imdb["test"][0] +{ + "label": 0, + "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.", +} +``` + +이 데이터셋에는 두 가지 필드가 있습니다: + +- `text`: 영화 리뷰 텍스트 +- `label`: `0`은 부정적인 리뷰, `1`은 긍정적인 리뷰를 나타냅니다. + +## 전처리[[preprocess]] + +다음 단계는 DistilBERT 토크나이저를 가져와서 `text` 필드를 전처리하는 것입니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +`text`를 토큰화하고 시퀀스가 DistilBERT의 최대 입력 길이보다 길지 않도록 자르기 위한 전처리 함수를 생성하세요: + +```py +>>> def preprocess_function(examples): +... return tokenizer(examples["text"], truncation=True) +``` + +전체 데이터셋에 전처리 함수를 적용하려면, 🤗 Datasets [`~datasets.Dataset.map`] 함수를 사용하세요. 데이터셋의 여러 요소를 한 번에 처리하기 위해 `batched=True`로 설정함으로써 데이터셋 `map`를 더 빠르게 처리할 수 있습니다: + +```py +tokenized_imdb = imdb.map(preprocess_function, batched=True) +``` + +이제 [`DataCollatorWithPadding`]를 사용하여 예제 배치를 만들어봅시다. 데이터셋 전체를 최대 길이로 패딩하는 대신, *동적 패딩*을 사용하여 배치에서 가장 긴 길이에 맞게 문장을 패딩하는 것이 효율적입니다. + + + +```py +>>> from transformers import DataCollatorWithPadding + +>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) +``` + + +```py +>>> from transformers import DataCollatorWithPadding + +>>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") +``` + + + +## 평가하기[[evaluate]] + +훈련 중 모델의 성능을 평가하기 위해 메트릭을 포함하는 것이 유용합니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하여 빠르게 평가 방법을 로드할 수 있습니다. 이 작업에서는 [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) 메트릭을 가져옵니다. (메트릭을 가져오고 계산하는 방법에 대해서는 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요): + +```py +>>> import evaluate + +>>> accuracy = evaluate.load("accuracy") +``` + +그런 다음 `compute_metrics` 함수를 만들어서 예측과 레이블을 계산하여 정확도를 계산하도록 [`~evaluate.EvaluationModule.compute`]를 호출합니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... predictions = np.argmax(predictions, axis=1) +... return accuracy.compute(predictions=predictions, references=labels) +``` + +이제 `compute_metrics` 함수는 준비되었고, 훈련 과정을 설정할 때 다시 살펴볼 예정입니다. + +## 훈련[[train]] + +모델을 훈련하기 전에, `id2label`와 `label2id`를 사용하여 예상되는 id와 레이블의 맵을 생성하세요: + +```py +>>> id2label = {0: "NEGATIVE", 1: "POSITIVE"} +>>> label2id = {"NEGATIVE": 0, "POSITIVE": 1} +``` + + + + + +[`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)의 기본 튜토리얼을 확인하세요! + + + +이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForSequenceClassification`]로 DistilBERT를 가쳐오고 예상되는 레이블 수와 레이블 매핑을 지정하세요: + +```py +>>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer + +>>> model = AutoModelForSequenceClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id +... ) +``` + +이제 세 단계만 거치면 끝입니다: + +1. [`TrainingArguments`]에서 하이퍼파라미터를 정의하세요. `output_dir`는 모델을 저장할 위치를 지정하는 유일한 파라미터입니다. 이 모델을 Hub에 업로드하기 위해 `push_to_hub=True`를 설정합니다. (모델을 업로드하기 위해 Hugging Face에 로그인해야합니다.) 각 에폭이 끝날 때마다, [`Trainer`]는 정확도를 평가하고 훈련 체크포인트를 저장합니다. +2. [`Trainer`]에 훈련 인수와 모델, 데이터셋, 토크나이저, 데이터 수집기 및 `compute_metrics` 함수를 전달하세요. +3. [`~Trainer.train`]를 호출하여 모델은 파인 튜닝하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_model", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=2, +... weight_decay=0.01, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... load_best_model_at_end=True, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_imdb["train"], +... eval_dataset=tokenized_imdb["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + + + +[`Trainer`]는 `tokenizer`를 전달하면 기본적으로 동적 매핑을 적용합니다. 이 경우, 명시적으로 데이터 수집기를 지정할 필요가 없습니다. + + + +훈련이 완료되면, [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 Hub에 공유할 수 있습니다. + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-a-tensorflow-model-with-keras)의 기본 튜토리얼을 확인하세요! + + +TensorFlow에서 모델을 파인 튜닝하려면, 먼저 옵티마이저 함수와 학습률 스케쥴, 그리고 일부 훈련 하이퍼파라미터를 설정해야 합니다: + +```py +>>> from transformers import create_optimizer +>>> import tensorflow as tf + +>>> batch_size = 16 +>>> num_epochs = 5 +>>> batches_per_epoch = len(tokenized_imdb["train"]) // batch_size +>>> total_train_steps = int(batches_per_epoch * num_epochs) +>>> optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps) +``` + +그런 다음 [`TFAutoModelForSequenceClassification`]을 사용하여 DistilBERT를 로드하고, 예상되는 레이블 수와 레이블 매핑을 로드할 수 있습니다: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id +... ) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용하여 데이터셋을 `tf.data.Dataset` 형식으로 변환합니다: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_imdb["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_imdb["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)를 사용하여 훈련할 모델을 구성합니다: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +훈련을 시작하기 전에 설정해야할 마지막 두 가지는 예측에서 정확도를 계산하고, 모델을 Hub에 업로드할 방법을 제공하는 것입니다. 모두 [Keras callbacks](../main_classes/keras_callbacks)를 사용하여 수행됩니다. + +[`~transformers.KerasMetricCallback`]에 `compute_metrics`를 전달하여 정확도를 높입니다. + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`]에서 모델과 토크나이저를 업로드할 위치를 지정합니다: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_model", +... tokenizer=tokenizer, +... ) +``` + +그런 다음 콜백을 함께 묶습니다: + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +드디어, 모델 훈련을 시작할 준비가 되었습니다! [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)에 훈련 데이터셋, 검증 데이터셋, 에폭의 수 및 콜백을 전달하여 파인 튜닝합니다: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks) +``` + +훈련이 완료되면, 모델이 자동으로 Hub에 업로드되어 모든 사람이 사용할 수 있습니다! + + + + + +텍스트 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb) 또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification-tf.ipynb)를 참조하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 파인 튜닝했으니 추론에 사용할 수 있습니다! + +추론을 수행하고자 하는 텍스트를 가져와봅시다: + +```py +>>> text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three." +``` + +파인 튜닝된 모델로 추론을 시도하는 가장 간단한 방법은 [`pipeline`]를 사용하는 것입니다. 모델로 감정 분석을 위한 `pipeline`을 인스턴스화하고, 텍스트를 전달해보세요: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model") +>>> classifier(text) +[{'label': 'POSITIVE', 'score': 0.9994940757751465}] +``` + +원한다면, `pipeline`의 결과를 수동으로 복제할 수도 있습니다. + + + +텍스트를 토큰화하고 PyTorch 텐서를 반환합니다. + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model") +>>> inputs = tokenizer(text, return_tensors="pt") +``` + +입력을 모델에 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +가장 높은 확률을 가진 클래스를 모델의 `id2label` 매핑을 사용하여 텍스트 레이블로 변환합니다: + +```py +>>> predicted_class_id = logits.argmax().item() +>>> model.config.id2label[predicted_class_id] +'POSITIVE' +``` + + +텍스트를 토큰화하고 TensorFlow 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model") +>>> inputs = tokenizer(text, return_tensors="tf") +``` + +입력값을 모델에 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model") +>>> logits = model(**inputs).logits +``` + +가장 높은 확률을 가진 클래스를 모델의 `id2label` 매핑을 사용하여 텍스트 레이블로 변환합니다: + +```py +>>> predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0]) +>>> model.config.id2label[predicted_class_id] +'POSITIVE' +``` + + diff --git a/docs/source/ko/tasks/summarization.md b/docs/source/ko/tasks/summarization.md new file mode 100644 index 00000000000000..43eae25d79f0aa --- /dev/null +++ b/docs/source/ko/tasks/summarization.md @@ -0,0 +1,418 @@ + + +# 요약[[summarization]] + +[[open-in-colab]] + + + +요약은 문서나 기사에서 중요한 정보를 모두 포함하되 짧게 만드는 일입니다. +번역과 마찬가지로, 시퀀스-투-시퀀스 문제로 구성할 수 있는 대표적인 작업 중 하나입니다. +요약에는 아래와 같이 유형이 있습니다: + +- 추출(Extractive) 요약: 문서에서 가장 관련성 높은 정보를 추출합니다. +- 생성(Abstractive) 요약: 가장 관련성 높은 정보를 포착해내는 새로운 텍스트를 생성합니다. + +이 가이드에서 소개할 내용은 아래와 같습니다: + +1. 생성 요약을 위한 [BillSum](https://huggingface.co/datasets/billsum) 데이터셋 중 캘리포니아 주 법안 하위 집합으로 [T5](https://huggingface.co/google-t5/t5-small)를 파인튜닝합니다. +2. 파인튜닝된 모델을 사용하여 추론합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에서 지원됩니다: + + + +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate rouge_score +``` + +Hugging Face 계정에 로그인하면 모델을 업로드하고 커뮤니티에 공유할 수 있습니다. +토큰을 입력하여 로그인하세요. + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## BillSum 데이터셋 가져오기[[load-billsum-dataset]] + +🤗 Datasets 라이브러리에서 BillSum 데이터셋의 작은 버전인 캘리포니아 주 법안 하위 집합을 가져오세요: + +```py +>>> from datasets import load_dataset + +>>> billsum = load_dataset("billsum", split="ca_test") +``` + +[`~datasets.Dataset.train_test_split`] 메소드로 데이터셋을 학습용와 테스트용으로 나누세요: + +```py +>>> billsum = billsum.train_test_split(test_size=0.2) +``` + +그런 다음 예시를 하나 살펴보세요: + +```py +>>> billsum["train"][0] +{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.', + 'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.', + 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'} +``` + +여기서 다음 두 개의 필드를 사용하게 됩니다: + +- `text`: 모델의 입력이 될 법안 텍스트입니다. +- `summary`: `text`의 간략한 버전으로 모델의 타겟이 됩니다. + +## 전처리[[preprocess]] + +다음으로 `text`와 `summary`를 처리하기 위한 T5 토크나이저를 가져옵니다: + +```py +>>> from transformers import AutoTokenizer + +>>> checkpoint = "google-t5/t5-small" +>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) +``` + +생성하려는 전처리 함수는 아래 조건을 만족해야 합니다: + +1. 입력 앞에 프롬프트를 붙여 T5가 요약 작업임을 인식할 수 있도록 합니다. 여러 NLP 작업을 수행할 수 있는 일부 모델은 특정 작업에 대한 프롬프트가 필요합니다. +2. 레이블을 토큰화할 때 `text_target` 인수를 사용합니다. +3. `max_length` 매개변수로 설정된 최대 길이를 넘지 않도록 긴 시퀀스를 잘라냅니다. + +```py +>>> prefix = "summarize: " + + +>>> def preprocess_function(examples): +... inputs = [prefix + doc for doc in examples["text"]] +... model_inputs = tokenizer(inputs, max_length=1024, truncation=True) + +... labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True) + +... model_inputs["labels"] = labels["input_ids"] +... return model_inputs +``` + +전체 데이터셋에 전처리 함수를 적용하려면 🤗 Datasets의 [`~datasets.Dataset.map`] 메소드를 사용하세요. +`batched=True`로 설정하여 데이터셋의 여러 요소를 한 번에 처리하면 `map` 함수의 속도를 높일 수 있습니다. + +```py +>>> tokenized_billsum = billsum.map(preprocess_function, batched=True) +``` + +이제 [`DataCollatorForSeq2Seq`]를 사용하여 예제 배치를 만드세요. +전체 데이터셋을 최대 길이로 패딩하는 것보다 배치마다 가장 긴 문장 길이에 맞춰 *동적 패딩*하는 것이 더 효율적입니다. + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) +``` + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf") +``` + + + +## 평가[[evaluate]] + +학습 중에 평가 지표를 포함하면 모델의 성능을 평가하는 데 도움이 되는 경우가 많습니다. +🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하면 평가 방법을 빠르게 불러올 수 있습니다. +이 작업에서는 [ROUGE](https://huggingface.co/spaces/evaluate-metric/rouge) 평가 지표를 가져옵니다. +(평가 지표를 불러오고 계산하는 방법은 🤗 Evaluate [둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요.) + +```py +>>> import evaluate + +>>> rouge = evaluate.load("rouge") +``` + +그런 다음 예측값과 레이블을 [`~evaluate.EvaluationModule.compute`]에 전달하여 ROUGE 지표를 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + + +>>> def compute_metrics(eval_pred): +... predictions, labels = eval_pred +... decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) +... labels = np.where(labels != -100, labels, tokenizer.pad_token_id) +... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + +... result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True) + +... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions] +... result["gen_len"] = np.mean(prediction_lens) + +... return {k: round(v, 4) for k, v in result.items()} +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 학습을 설정할 때 이 함수로 되돌아올 것입니다. + +## 학습[[train]] + + + + + +모델을 [`Trainer`]로 파인튜닝 하는 것이 익숙하지 않다면, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인해보세요! + + + +이제 모델 학습을 시작할 준비가 되었습니다! [`AutoModelForSeq2SeqLM`]로 T5를 가져오세요: + +```py +>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +이제 세 단계만 남았습니다: + +1. [`Seq2SeqTrainingArguments`]에서 학습 하이퍼파라미터를 정의하세요. +유일한 필수 매개변수는 모델을 저장할 위치를 지정하는 `output_dir`입니다. +`push_to_hub=True`를 설정하여 이 모델을 Hub에 푸시할 수 있습니다(모델을 업로드하려면 Hugging Face에 로그인해야 합니다.) +[`Trainer`]는 각 에폭이 끝날 때마다 ROUGE 지표를 평가하고 학습 체크포인트를 저장합니다. +2. 모델, 데이터셋, 토크나이저, 데이터 콜레이터 및 `compute_metrics` 함수와 함께 학습 인수를 [`Seq2SeqTrainer`]에 전달하세요. +3. [`~Trainer.train`]을 호출하여 모델을 파인튜닝하세요. + +```py +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="my_awesome_billsum_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... weight_decay=0.01, +... save_total_limit=3, +... num_train_epochs=4, +... predict_with_generate=True, +... fp16=True, +... push_to_hub=True, +... ) + +>>> trainer = Seq2SeqTrainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_billsum["train"], +... eval_dataset=tokenized_billsum["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +학습이 완료되면, 누구나 모델을 사용할 수 있도록 [`~transformers.Trainer.push_to_hub`] 메소드로 Hub에 공유합니다: + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras로 모델 파인튜닝을 하는 것이 익숙하지 않다면, [여기](../training#train-a-tensorflow-model-with-keras)에서 기본적인 튜토리얼을 확인하세요! + + +TensorFlow에서 모델을 파인튜닝하려면, 먼저 옵티마이저, 학습률 스케줄 그리고 몇 가지 학습 하이퍼파라미터를 설정하세요: + +```py +>>> from transformers import create_optimizer, AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +그런 다음 [`TFAutoModelForSeq2SeqLM`]을 사용하여 T5를 가져오세요: + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용하여 데이터셋을 `tf.data.Dataset` 형식으로 변환하세요: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_billsum["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_billsum["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)을 사용하여 모델을 학습할 수 있도록 구성하세요: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +학습을 시작하기 전에 설정해야 할 마지막 두 가지는 예측에서 ROUGE 점수를 계산하고 모델을 Hub에 푸시하는 방법을 제공하는 것입니다. +두 작업 모두 [Keras callbacks](../main_classes/keras_callbacks)으로 수행할 수 있습니다. + +[`~transformers.KerasMetricCallback`]에 `compute_metrics` 함수를 전달하세요: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`]에서 모델과 토크나이저를 푸시할 위치를 지정하세요: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_billsum_model", +... tokenizer=tokenizer, +... ) +``` + +그런 다음 콜백을 번들로 묶어줍니다: + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +드디어 모델 학습을 시작할 준비가 되었습니다! +학습 및 검증 데이터셋, 에폭 수 및 콜백과 함께 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)을 호출하여 모델을 파인튜닝하세요. + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks) +``` + +학습이 완료되면 모델이 자동으로 Hub에 업로드되어 누구나 사용할 수 있게 됩니다! + + + + + +요약을 위해 모델을 파인튜닝하는 방법에 대한 더 자세한 예제를 보려면 [PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization.ipynb) +또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/summarization-tf.ipynb)을 참고하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 파인튜닝했으니 추론에 사용할 수 있습니다! + +요약할 텍스트를 작성해보세요. T5의 경우 작업에 따라 입력 앞에 접두사를 붙여야 합니다. 요약의 경우, 아래와 같은 접두사를 입력 앞에 붙여야 합니다: + +```py +>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes." +``` + +추론을 위해 파인튜닝한 모델을 시험해 보는 가장 간단한 방법은 [`pipeline`]에서 사용하는 것입니다. +모델을 사용하여 요약을 수행할 [`pipeline`]을 인스턴스화하고 텍스트를 전달하세요: + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model") +>>> summarizer(text) +[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}] +``` + +원한다면 수동으로 다음과 같은 작업을 수행하여 [`pipeline`]의 결과와 동일한 결과를 얻을 수 있습니다: + + + + +텍스트를 토크나이즈하고 `input_ids`를 PyTorch 텐서로 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> inputs = tokenizer(text, return_tensors="pt").input_ids +``` + +요약문을 생성하려면 [`~transformers.generation_utils.GenerationMixin.generate`] 메소드를 사용하세요. +텍스트 생성에 대한 다양한 전략과 생성을 제어하기 위한 매개변수에 대한 자세한 내용은 [텍스트 생성](../main_classes/text_generation) API를 참조하세요. + +```py +>>> from transformers import AutoModelForSeq2SeqLM + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False) +``` + +생성된 토큰 ID를 텍스트로 디코딩합니다: + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.' +``` + + +텍스트를 토크나이즈하고 `input_ids`를 TensorFlow 텐서로 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> inputs = tokenizer(text, return_tensors="tf").input_ids +``` + +요약문을 생성하려면 [`~transformers.generation_tf_utils.TFGenerationMixin.generate`] 메소드를 사용하세요. +텍스트 생성에 대한 다양한 전략과 생성을 제어하기 위한 매개변수에 대한 자세한 내용은 [텍스트 생성](../main_classes/text_generation) API를 참조하세요. + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model") +>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False) +``` + +생성된 토큰 ID를 텍스트로 디코딩합니다: + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.' +``` + + diff --git a/docs/source/ko/tasks/token_classification.md b/docs/source/ko/tasks/token_classification.md new file mode 100644 index 00000000000000..1e49d79a0d7235 --- /dev/null +++ b/docs/source/ko/tasks/token_classification.md @@ -0,0 +1,560 @@ + + +# 토큰 분류[[token-classification]] + +[[open-in-colab]] + + + +토큰 분류는 문장의 개별 토큰에 레이블을 할당합니다. 가장 일반적인 토큰 분류 작업 중 하나는 개체명 인식(Named Entity Recognition, NER)입니다. 개체명 인식은 문장에서 사람, 위치 또는 조직과 같은 각 개체의 레이블을 찾으려고 시도합니다. + +이 가이드에서 학습할 내용은: + +1. [WNUT 17](https://huggingface.co/datasets/wnut_17) 데이터 세트에서 [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased)를 파인 튜닝하여 새로운 개체를 탐지합니다. +2. 추론을 위해 파인 튜닝 모델을 사용합니다. + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에 의해 지원됩니다: + + + +[ALBERT](../model_doc/albert), [BERT](../model_doc/bert), [BigBird](../model_doc/big_bird), [BioGpt](../model_doc/biogpt), [BLOOM](../model_doc/bloom), [CamemBERT](../model_doc/camembert), [CANINE](../model_doc/canine), [ConvBERT](../model_doc/convbert), [Data2VecText](../model_doc/data2vec-text), [DeBERTa](../model_doc/deberta), [DeBERTa-v2](../model_doc/deberta-v2), [DistilBERT](../model_doc/distilbert), [ELECTRA](../model_doc/electra), [ERNIE](../model_doc/ernie), [ErnieM](../model_doc/ernie_m), [ESM](../model_doc/esm), [FlauBERT](../model_doc/flaubert), [FNet](../model_doc/fnet), [Funnel Transformer](../model_doc/funnel), [GPT-Sw3](../model_doc/gpt-sw3), [OpenAI GPT-2](../model_doc/gpt2), [GPTBigCode](../model_doc/gpt_bigcode), [I-BERT](../model_doc/ibert), [LayoutLM](../model_doc/layoutlm), [LayoutLMv2](../model_doc/layoutlmv2), [LayoutLMv3](../model_doc/layoutlmv3), [LiLT](../model_doc/lilt), [Longformer](../model_doc/longformer), [LUKE](../model_doc/luke), [MarkupLM](../model_doc/markuplm), [MEGA](../model_doc/mega), [Megatron-BERT](../model_doc/megatron-bert), [MobileBERT](../model_doc/mobilebert), [MPNet](../model_doc/mpnet), [Nezha](../model_doc/nezha), [Nyströmformer](../model_doc/nystromformer), [QDQBert](../model_doc/qdqbert), [RemBERT](../model_doc/rembert), [RoBERTa](../model_doc/roberta), [RoBERTa-PreLayerNorm](../model_doc/roberta-prelayernorm), [RoCBert](../model_doc/roc_bert), [RoFormer](../model_doc/roformer), [SqueezeBERT](../model_doc/squeezebert), [XLM](../model_doc/xlm), [XLM-RoBERTa](../model_doc/xlm-roberta), [XLM-RoBERTa-XL](../model_doc/xlm-roberta-xl), [XLNet](../model_doc/xlnet), [X-MOD](../model_doc/xmod), [YOSO](../model_doc/yoso) + + + + + +시작하기 전에, 필요한 모든 라이브러리가 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate seqeval +``` + +Hugging Face 계정에 로그인하여 모델을 업로드하고 커뮤니티에 공유하는 것을 권장합니다. 메시지가 표시되면, 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## WNUT 17 데이터 세트 가져오기[[load-wnut-17-dataset]] + +먼저 🤗 Datasets 라이브러리에서 WNUT 17 데이터 세트를 가져옵니다: + +```py +>>> from datasets import load_dataset + +>>> wnut = load_dataset("wnut_17") +``` + +다음 예제를 살펴보세요: + +```py +>>> wnut["train"][0] +{'id': '0', + 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0], + 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.'] +} +``` + +`ner_tags`의 각 숫자는 개체를 나타냅니다. 숫자를 레이블 이름으로 변환하여 개체가 무엇인지 확인합니다: + +```py +>>> label_list = wnut["train"].features[f"ner_tags"].feature.names +>>> label_list +[ + "O", + "B-corporation", + "I-corporation", + "B-creative-work", + "I-creative-work", + "B-group", + "I-group", + "B-location", + "I-location", + "B-person", + "I-person", + "B-product", + "I-product", +] +``` + +각 `ner_tag`의 앞에 붙은 문자는 개체의 토큰 위치를 나타냅니다: + +- `B-`는 개체의 시작을 나타냅니다. +- `I-`는 토큰이 동일한 개체 내부에 포함되어 있음을 나타냅니다(예를 들어 `State` 토큰은 `Empire State Building`와 같은 개체의 일부입니다). +- `0`는 토큰이 어떤 개체에도 해당하지 않음을 나타냅니다. + +## 전처리[[preprocess]] + + + +다음으로 `tokens` 필드를 전처리하기 위해 DistilBERT 토크나이저를 가져옵니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +위의 예제 `tokens` 필드를 보면 입력이 이미 토큰화된 것처럼 보입니다. 그러나 실제로 입력은 아직 토큰화되지 않았으므로 단어를 하위 단어로 토큰화하기 위해 `is_split_into_words=True`를 설정해야 합니다. 예제로 확인합니다: + +```py +>>> example = wnut["train"][0] +>>> tokenized_input = tokenizer(example["tokens"], is_split_into_words=True) +>>> tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"]) +>>> tokens +['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]'] +``` + +그러나 이로 인해 `[CLS]`과 `[SEP]`라는 특수 토큰이 추가되고, 하위 단어 토큰화로 인해 입력과 레이블 간에 불일치가 발생합니다. 하나의 레이블에 해당하는 단일 단어는 이제 두 개의 하위 단어로 분할될 수 있습니다. 토큰과 레이블을 다음과 같이 재정렬해야 합니다: + +1. [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) 메소드로 모든 토큰을 해당 단어에 매핑합니다. +2. 특수 토큰 `[CLS]`와 `[SEP]`에 `-100` 레이블을 할당하여, PyTorch 손실 함수가 해당 토큰을 무시하도록 합니다. +3. 주어진 단어의 첫 번째 토큰에만 레이블을 지정합니다. 같은 단어의 다른 하위 토큰에 `-100`을 할당합니다. + +다음은 토큰과 레이블을 재정렬하고 DistilBERT의 최대 입력 길이보다 길지 않도록 시퀀스를 잘라내는 함수를 만드는 방법입니다: + +```py +>>> def tokenize_and_align_labels(examples): +... tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True) + +... labels = [] +... for i, label in enumerate(examples[f"ner_tags"]): +... word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word. +... previous_word_idx = None +... label_ids = [] +... for word_idx in word_ids: # Set the special tokens to -100. +... if word_idx is None: +... label_ids.append(-100) +... elif word_idx != previous_word_idx: # Only label the first token of a given word. +... label_ids.append(label[word_idx]) +... else: +... label_ids.append(-100) +... previous_word_idx = word_idx +... labels.append(label_ids) + +... tokenized_inputs["labels"] = labels +... return tokenized_inputs +``` + +전체 데이터 세트에 전처리 함수를 적용하려면, 🤗 Datasets [`~datasets.Dataset.map`] 함수를 사용하세요. `batched=True`로 설정하여 데이터 세트의 여러 요소를 한 번에 처리하면 `map` 함수의 속도를 높일 수 있습니다: +```py +>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) +``` + +이제 [`DataCollatorWithPadding`]를 사용하여 예제 배치를 만들어봅시다. 데이터 세트 전체를 최대 길이로 패딩하는 대신, *동적 패딩*을 사용하여 배치에서 가장 긴 길이에 맞게 문장을 패딩하는 것이 효율적입니다. + + + +```py +>>> from transformers import DataCollatorForTokenClassification + +>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer) +``` + + +```py +>>> from transformers import DataCollatorForTokenClassification + +>>> data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf") +``` + + + +## 평가[[evaluation]] + +훈련 중 모델의 성능을 평가하기 위해 평가 지표를 포함하는 것이 유용합니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리를 사용하여 빠르게 평가 방법을 가져올 수 있습니다. 이 작업에서는 [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) 평가 지표를 가져옵니다. (평가 지표를 가져오고 계산하는 방법에 대해서는 🤗 Evaluate [빠른 둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요). Seqeval은 실제로 정밀도, 재현률, F1 및 정확도와 같은 여러 점수를 산출합니다. + +```py +>>> import evaluate + +>>> seqeval = evaluate.load("seqeval") +``` + +먼저 NER 레이블을 가져온 다음, [`~evaluate.EvaluationModule.compute`]에 실제 예측과 실제 레이블을 전달하여 점수를 계산하는 함수를 만듭니다: + +```py +>>> import numpy as np + +>>> labels = [label_list[i] for i in example[f"ner_tags"]] + + +>>> def compute_metrics(p): +... predictions, labels = p +... predictions = np.argmax(predictions, axis=2) + +... true_predictions = [ +... [label_list[p] for (p, l) in zip(prediction, label) if l != -100] +... for prediction, label in zip(predictions, labels) +... ] +... true_labels = [ +... [label_list[l] for (p, l) in zip(prediction, label) if l != -100] +... for prediction, label in zip(predictions, labels) +... ] + +... results = seqeval.compute(predictions=true_predictions, references=true_labels) +... return { +... "precision": results["overall_precision"], +... "recall": results["overall_recall"], +... "f1": results["overall_f1"], +... "accuracy": results["overall_accuracy"], +... } +``` + +이제 `compute_metrics` 함수를 사용할 준비가 되었으며, 훈련을 설정하면 이 함수로 되돌아올 것입니다. + +## 훈련[[train]] + +모델을 훈련하기 전에, `id2label`와 `label2id`를 사용하여 예상되는 id와 레이블의 맵을 생성하세요: + +```py +>>> id2label = { +... 0: "O", +... 1: "B-corporation", +... 2: "I-corporation", +... 3: "B-creative-work", +... 4: "I-creative-work", +... 5: "B-group", +... 6: "I-group", +... 7: "B-location", +... 8: "I-location", +... 9: "B-person", +... 10: "I-person", +... 11: "B-product", +... 12: "I-product", +... } +>>> label2id = { +... "O": 0, +... "B-corporation": 1, +... "I-corporation": 2, +... "B-creative-work": 3, +... "I-creative-work": 4, +... "B-group": 5, +... "I-group": 6, +... "B-location": 7, +... "I-location": 8, +... "B-person": 9, +... "I-person": 10, +... "B-product": 11, +... "I-product": 12, +... } +``` + + + + + +[`Trainer`]를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 확인하세요! + + + +이제 모델을 훈련시킬 준비가 되었습니다! [`AutoModelForSequenceClassification`]로 DistilBERT를 가져오고 예상되는 레이블 수와 레이블 매핑을 지정하세요: + +```py +>>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer + +>>> model = AutoModelForTokenClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... ) +``` + +이제 세 단계만 거치면 끝입니다: + +1. [`TrainingArguments`]에서 하이퍼파라미터를 정의하세요. `output_dir`는 모델을 저장할 위치를 지정하는 유일한 매개변수입니다. 이 모델을 허브에 업로드하기 위해 `push_to_hub=True`를 설정합니다(모델을 업로드하기 위해 Hugging Face에 로그인해야합니다.) 각 에폭이 끝날 때마다, [`Trainer`]는 seqeval 점수를 평가하고 훈련 체크포인트를 저장합니다. +2. [`Trainer`]에 훈련 인수와 모델, 데이터 세트, 토크나이저, 데이터 콜레이터 및 `compute_metrics` 함수를 전달하세요. +3. [`~Trainer.train`]를 호출하여 모델을 파인 튜닝하세요. + +```py +>>> training_args = TrainingArguments( +... output_dir="my_awesome_wnut_model", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... num_train_epochs=2, +... weight_decay=0.01, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... load_best_model_at_end=True, +... push_to_hub=True, +... ) + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_wnut["train"], +... eval_dataset=tokenized_wnut["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +훈련이 완료되면, [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 모델을 허브에 공유할 수 있습니다. + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras를 사용하여 모델을 파인 튜닝하는 방법에 익숙하지 않은 경우, [여기](../training#train-a-tensorflow-model-with-keras)의 기본 튜토리얼을 확인하세요! + + +TensorFlow에서 모델을 파인 튜닝하려면, 먼저 옵티마이저 함수와 학습률 스케쥴, 그리고 일부 훈련 하이퍼파라미터를 설정해야 합니다: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 16 +>>> num_train_epochs = 3 +>>> num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=2e-5, +... num_train_steps=num_train_steps, +... weight_decay_rate=0.01, +... num_warmup_steps=0, +... ) +``` + +그런 다음 [`TFAutoModelForSequenceClassification`]을 사용하여 DistilBERT를 가져오고, 예상되는 레이블 수와 레이블 매핑을 지정합니다: + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained( +... "distilbert/distilbert-base-uncased", num_labels=13, id2label=id2label, label2id=label2id +... ) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]을 사용하여 데이터 세트를 `tf.data.Dataset` 형식으로 변환합니다: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_wnut["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_validation_set = model.prepare_tf_dataset( +... tokenized_wnut["validation"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)를 사용하여 훈련할 모델을 구성합니다: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +훈련을 시작하기 전에 설정해야할 마지막 두 가지는 예측에서 seqeval 점수를 계산하고, 모델을 허브에 업로드할 방법을 제공하는 것입니다. 모두 [Keras callbacks](../main_classes/keras_callbacks)를 사용하여 수행됩니다. + +[`~transformers.KerasMetricCallback`]에 `compute_metrics` 함수를 전달하세요: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +[`~transformers.PushToHubCallback`]에서 모델과 토크나이저를 업로드할 위치를 지정합니다: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_wnut_model", +... tokenizer=tokenizer, +... ) +``` + +그런 다음 콜백을 함께 묶습니다: + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +드디어, 모델 훈련을 시작할 준비가 되었습니다! [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)에 훈련 데이터 세트, 검증 데이터 세트, 에폭의 수 및 콜백을 전달하여 파인 튜닝합니다: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks) +``` + +훈련이 완료되면, 모델이 자동으로 허브에 업로드되어 누구나 사용할 수 있습니다! + + + + + +토큰 분류를 위한 모델을 파인 튜닝하는 자세한 예제는 다음 +[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb) +또는 [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification-tf.ipynb)를 참조하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 파인 튜닝했으니 추론에 사용할 수 있습니다! + +추론을 수행하고자 하는 텍스트를 가져와봅시다: + +```py +>>> text = "The Golden State Warriors are an American professional basketball team based in San Francisco." +``` + +파인 튜닝된 모델로 추론을 시도하는 가장 간단한 방법은 [`pipeline`]를 사용하는 것입니다. 모델로 NER의 `pipeline`을 인스턴스화하고, 텍스트를 전달해보세요: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model") +>>> classifier(text) +[{'entity': 'B-location', + 'score': 0.42658573, + 'index': 2, + 'word': 'golden', + 'start': 4, + 'end': 10}, + {'entity': 'I-location', + 'score': 0.35856336, + 'index': 3, + 'word': 'state', + 'start': 11, + 'end': 16}, + {'entity': 'B-group', + 'score': 0.3064001, + 'index': 4, + 'word': 'warriors', + 'start': 17, + 'end': 25}, + {'entity': 'B-location', + 'score': 0.65523505, + 'index': 13, + 'word': 'san', + 'start': 80, + 'end': 83}, + {'entity': 'B-location', + 'score': 0.4668663, + 'index': 14, + 'word': 'francisco', + 'start': 84, + 'end': 93}] +``` + +원한다면, `pipeline`의 결과를 수동으로 복제할 수도 있습니다: + + + +텍스트를 토큰화하고 PyTorch 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> inputs = tokenizer(text, return_tensors="pt") +``` + +입력을 모델에 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> with torch.no_grad(): +... logits = model(**inputs).logits +``` + +가장 높은 확률을 가진 클래스를 모델의 `id2label` 매핑을 사용하여 텍스트 레이블로 변환합니다: + +```py +>>> predictions = torch.argmax(logits, dim=2) +>>> predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]] +>>> predicted_token_class +['O', + 'O', + 'B-location', + 'I-location', + 'B-group', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'B-location', + 'B-location', + 'O', + 'O'] +``` + + +텍스트를 토큰화하고 TensorFlow 텐서를 반환합니다: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> inputs = tokenizer(text, return_tensors="tf") +``` + +입력값을 모델에 전달하고 `logits`을 반환합니다: + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model") +>>> logits = model(**inputs).logits +``` + +가장 높은 확률을 가진 클래스를 모델의 `id2label` 매핑을 사용하여 텍스트 레이블로 변환합니다: + +```py +>>> predicted_token_class_ids = tf.math.argmax(logits, axis=-1) +>>> predicted_token_class = [model.config.id2label[t] for t in predicted_token_class_ids[0].numpy().tolist()] +>>> predicted_token_class +['O', + 'O', + 'B-location', + 'I-location', + 'B-group', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'O', + 'B-location', + 'B-location', + 'O', + 'O'] +``` + + diff --git a/docs/source/ko/tasks/translation.md b/docs/source/ko/tasks/translation.md new file mode 100644 index 00000000000000..6de275f7d04c80 --- /dev/null +++ b/docs/source/ko/tasks/translation.md @@ -0,0 +1,409 @@ + + +# 번역[[translation]] + +[[open-in-colab]] + + + +번역은 한 언어로 된 시퀀스를 다른 언어로 변환합니다. 번역이나 요약은 입력을 받아 일련의 출력을 반환하는 강력한 프레임워크인 시퀀스-투-시퀀스 문제로 구성할 수 있는 대표적인 태스크입니다. 번역 시스템은 일반적으로 다른 언어로 된 텍스트 간의 번역에 사용되지만, 음성 간의 통역이나 텍스트-음성 또는 음성-텍스트와 같은 조합에도 사용될 수 있습니다. + +이 가이드에서 학습할 내용은: + +1. 영어 텍스트를 프랑스어로 번역하기 위해 [T5](https://huggingface.co/google-t5/t5-small) 모델을 OPUS Books 데이터세트의 영어-프랑스어 하위 집합으로 파인튜닝하는 방법과 +2. 파인튜닝된 모델을 추론에 사용하는 방법입니다. + + +이 태스크 가이드는 아래 모델 아키텍처에도 응용할 수 있습니다. + + + +[BART](../model_doc/bart), [BigBird-Pegasus](../model_doc/bigbird_pegasus), [Blenderbot](../model_doc/blenderbot), [BlenderbotSmall](../model_doc/blenderbot-small), [Encoder decoder](../model_doc/encoder-decoder), [FairSeq Machine-Translation](../model_doc/fsmt), [GPTSAN-japanese](../model_doc/gptsan-japanese), [LED](../model_doc/led), [LongT5](../model_doc/longt5), [M2M100](../model_doc/m2m_100), [Marian](../model_doc/marian), [mBART](../model_doc/mbart), [MT5](../model_doc/mt5), [MVP](../model_doc/mvp), [NLLB](../model_doc/nllb), [NLLB-MOE](../model_doc/nllb-moe), [Pegasus](../model_doc/pegasus), [PEGASUS-X](../model_doc/pegasus_x), [PLBart](../model_doc/plbart), [ProphetNet](../model_doc/prophetnet), [SwitchTransformers](../model_doc/switch_transformers), [T5](../model_doc/t5), [XLM-ProphetNet](../model_doc/xlm-prophetnet) + + + + + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install transformers datasets evaluate sacrebleu +``` + +모델을 업로드하고 커뮤니티와 공유할 수 있도록 Hugging Face 계정에 로그인하는 것이 좋습니다. 새로운 창이 표시되면 토큰을 입력하여 로그인하세요. + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## OPUS Books 데이터세트 가져오기[[load-opus-books-dataset]] + +먼저 🤗 Datasets 라이브러리에서 [OPUS Books](https://huggingface.co/datasets/opus_books) 데이터세트의 영어-프랑스어 하위 집합을 가져오세요. + +```py +>>> from datasets import load_dataset + +>>> books = load_dataset("opus_books", "en-fr") +``` + +데이터세트를 [`~datasets.Dataset.train_test_split`] 메서드를 사용하여 훈련 및 테스트 데이터로 분할하세요. + +```py +>>> books = books["train"].train_test_split(test_size=0.2) +``` + +훈련 데이터에서 예시를 살펴볼까요? + +```py +>>> books["train"][0] +{'id': '90560', + 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.', + 'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}} +``` + +반환된 딕셔너리의 `translation` 키가 텍스트의 영어, 프랑스어 버전을 포함하고 있는 것을 볼 수 있습니다. + +## 전처리[[preprocess]] + + + +다음 단계로 영어-프랑스어 쌍을 처리하기 위해 T5 토크나이저를 가져오세요. + +```py +>>> from transformers import AutoTokenizer + +>>> checkpoint = "google-t5/t5-small" +>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint) +``` + +만들 전처리 함수는 아래 요구사항을 충족해야 합니다: + +1. T5가 번역 태스크임을 인지할 수 있도록 입력 앞에 프롬프트를 추가하세요. 여러 NLP 태스크를 할 수 있는 모델 중 일부는 이렇게 태스크 프롬프트를 미리 줘야합니다. +2. 원어(영어)과 번역어(프랑스어)를 별도로 토큰화하세요. 영어 어휘로 사전 학습된 토크나이저로 프랑스어 텍스트를 토큰화할 수는 없기 때문입니다. +3. `max_length` 매개변수로 설정한 최대 길이보다 길지 않도록 시퀀스를 truncate하세요. + +```py +>>> source_lang = "en" +>>> target_lang = "fr" +>>> prefix = "translate English to French: " + + +>>> def preprocess_function(examples): +... inputs = [prefix + example[source_lang] for example in examples["translation"]] +... targets = [example[target_lang] for example in examples["translation"]] +... model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True) +... return model_inputs +``` + +전체 데이터세트에 전처리 함수를 적용하려면 🤗 Datasets의 [`~datasets.Dataset.map`] 메서드를 사용하세요. `map` 함수의 속도를 높이려면 `batched=True`를 설정하여 데이터세트의 여러 요소를 한 번에 처리하는 방법이 있습니다. + +```py +>>> tokenized_books = books.map(preprocess_function, batched=True) +``` + +이제 [`DataCollatorForSeq2Seq`]를 사용하여 예제 배치를 생성합니다. 데이터세트의 최대 길이로 전부를 padding하는 대신, 데이터 정렬 중 각 배치의 최대 길이로 문장을 *동적으로 padding*하는 것이 더 효율적입니다. + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint) +``` + + + +```py +>>> from transformers import DataCollatorForSeq2Seq + +>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf") +``` + + + +## 평가[[evalulate]] + +훈련 중에 메트릭을 포함하면 모델의 성능을 평가하는 데 도움이 됩니다. 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리로 평가 방법(evaluation method)을 빠르게 가져올 수 있습니다. 현재 태스크에 적합한 SacreBLEU 메트릭을 가져오세요. (메트릭을 가져오고 계산하는 방법에 대해 자세히 알아보려면 🤗 Evaluate [둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요): + +```py +>>> import evaluate + +>>> metric = evaluate.load("sacrebleu") +``` + +그런 다음 [`~evaluate.EvaluationModule.compute`]에 예측값과 레이블을 전달하여 SacreBLEU 점수를 계산하는 함수를 생성하세요: + +```py +>>> import numpy as np + + +>>> def postprocess_text(preds, labels): +... preds = [pred.strip() for pred in preds] +... labels = [[label.strip()] for label in labels] + +... return preds, labels + + +>>> def compute_metrics(eval_preds): +... preds, labels = eval_preds +... if isinstance(preds, tuple): +... preds = preds[0] +... decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) + +... labels = np.where(labels != -100, labels, tokenizer.pad_token_id) +... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) + +... decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels) + +... result = metric.compute(predictions=decoded_preds, references=decoded_labels) +... result = {"bleu": result["score"]} + +... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds] +... result["gen_len"] = np.mean(prediction_lens) +... result = {k: round(v, 4) for k, v in result.items()} +... return result +``` + +이제 `compute_metrics` 함수는 준비되었고, 훈련 과정을 설정할 때 다시 살펴볼 예정입니다. + +## 훈련[[train]] + + + + + +[`Trainer`]로 모델을 파인튜닝하는 방법에 익숙하지 않다면 [여기](../training#train-with-pytorch-trainer)에서 기본 튜토리얼을 살펴보시기 바랍니다! + + + +모델을 훈련시킬 준비가 되었군요! [`AutoModelForSeq2SeqLM`]으로 T5를 로드하세요: + +```py +>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer + +>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +이제 세 단계만 거치면 끝입니다: + +1. [`Seq2SeqTrainingArguments`]에서 훈련 하이퍼파라미터를 정의하세요. 유일한 필수 매개변수는 모델을 저장할 위치인 `output_dir`입니다. 모델을 Hub에 푸시하기 위해 `push_to_hub=True`로 설정하세요. (모델을 업로드하려면 Hugging Face에 로그인해야 합니다.) [`Trainer`]는 에폭이 끝날때마다 SacreBLEU 메트릭을 평가하고 훈련 체크포인트를 저장합니다. +2. [`Seq2SeqTrainer`]에 훈련 인수를 전달하세요. 모델, 데이터 세트, 토크나이저, data collator 및 `compute_metrics` 함수도 덩달아 전달해야 합니다. +3. [`~Trainer.train`]을 호출하여 모델을 파인튜닝하세요. + +```py +>>> training_args = Seq2SeqTrainingArguments( +... output_dir="my_awesome_opus_books_model", +... evaluation_strategy="epoch", +... learning_rate=2e-5, +... per_device_train_batch_size=16, +... per_device_eval_batch_size=16, +... weight_decay=0.01, +... save_total_limit=3, +... num_train_epochs=2, +... predict_with_generate=True, +... fp16=True, +... push_to_hub=True, +... ) + +>>> trainer = Seq2SeqTrainer( +... model=model, +... args=training_args, +... train_dataset=tokenized_books["train"], +... eval_dataset=tokenized_books["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... compute_metrics=compute_metrics, +... ) + +>>> trainer.train() +``` + +학습이 완료되면 [`~transformers.Trainer.push_to_hub`] 메서드로 모델을 Hub에 공유하세요. 이러면 누구나 모델을 사용할 수 있게 됩니다: + +```py +>>> trainer.push_to_hub() +``` + + + + +Keras로 모델을 파인튜닝하는 방법이 익숙하지 않다면, [여기](../training#train-a-tensorflow-model-with-keras)에서 기본 튜토리얼을 살펴보시기 바랍니다! + + +TensorFlow에서 모델을 파인튜닝하려면 우선 optimizer 함수, 학습률 스케줄 등의 훈련 하이퍼파라미터를 설정하세요: + +```py +>>> from transformers import AdamWeightDecay + +>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01) +``` + +이제 [`TFAutoModelForSeq2SeqLM`]로 T5를 가져오세요: + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint) +``` + +[`~transformers.TFPreTrainedModel.prepare_tf_dataset`]로 데이터 세트를 `tf.data.Dataset` 형식으로 변환하세요: + +```py +>>> tf_train_set = model.prepare_tf_dataset( +... tokenized_books["train"], +... shuffle=True, +... batch_size=16, +... collate_fn=data_collator, +... ) + +>>> tf_test_set = model.prepare_tf_dataset( +... tokenized_books["test"], +... shuffle=False, +... batch_size=16, +... collate_fn=data_collator, +... ) +``` + +훈련하기 위해 [`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 메서드로 모델을 구성하세요: + +```py +>>> import tensorflow as tf + +>>> model.compile(optimizer=optimizer) +``` + +훈련을 시작하기 전에 예측값으로부터 SacreBLEU 메트릭을 계산하는 방법과 모델을 Hub에 업로드하는 방법 두 가지를 미리 설정해둬야 합니다. 둘 다 [Keras callbacks](../main_classes/keras_callbacks)로 구현하세요. + +[`~transformers.KerasMetricCallback`]에 `compute_metrics` 함수를 전달하세요. + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback + +>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set) +``` + +모델과 토크나이저를 업로드할 위치를 [`~transformers.PushToHubCallback`]에서 지정하세요: + +```py +>>> from transformers.keras_callbacks import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="my_awesome_opus_books_model", +... tokenizer=tokenizer, +... ) +``` + +이제 콜백들을 한데로 묶어주세요: + +```py +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +드디어 모델을 훈련시킬 모든 준비를 마쳤군요! 이제 훈련 및 검증 데이터 세트에 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) 메서드를 에폭 수와 만들어둔 콜백과 함께 호출하여 모델을 파인튜닝하세요: + +```py +>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks) +``` + +학습이 완료되면 모델이 자동으로 Hub에 업로드되고, 누구나 사용할 수 있게 됩니다! + + + + + +번역을 위해 모델을 파인튜닝하는 방법에 대한 보다 자세한 예제는 해당 [PyTorch 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation.ipynb) 또는 [TensorFlow 노트북](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/translation-tf.ipynb)을 참조하세요. + + + +## 추론[[inference]] + +좋아요, 이제 모델을 파인튜닝했으니 추론에 사용할 수 있습니다! + +다른 언어로 번역하고 싶은 텍스트를 써보세요. T5의 경우 원하는 태스크를 입력의 접두사로 추가해야 합니다. 예를 들어 영어에서 프랑스어로 번역하는 경우, 아래와 같은 접두사가 추가됩니다: + +```py +>>> text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria." +``` + +파인튜닝된 모델로 추론하기에 제일 간단한 방법은 [`pipeline`]을 사용하는 것입니다. 해당 모델로 번역 `pipeline`을 만든 뒤, 텍스트를 전달하세요: + +```py +>>> from transformers import pipeline + +>>> translator = pipeline("translation", model="my_awesome_opus_books_model") +>>> translator(text) +[{'translation_text': 'Legumes partagent des ressources avec des bactéries azotantes.'}] +``` + +원한다면 `pipeline`의 결과를 직접 복제할 수도 있습니다: + + + +텍스트를 토큰화하고 `input_ids`를 PyTorch 텐서로 반환하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model") +>>> inputs = tokenizer(text, return_tensors="pt").input_ids +``` + +[`~transformers.generation_utils.GenerationMixin.generate`] 메서드로 번역을 생성하세요. 다양한 텍스트 생성 전략 및 생성을 제어하기 위한 매개변수에 대한 자세한 내용은 [Text Generation](../main_classes/text_generation) API를 살펴보시기 바랍니다. + +```py +>>> from transformers import AutoModelForSeq2SeqLM + +>>> model = AutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model") +>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) +``` + +생성된 토큰 ID들을 다시 텍스트로 디코딩하세요: + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'Les lignées partagent des ressources avec des bactéries enfixant l'azote.' +``` + + +텍스트를 토큰화하고 `input_ids`를 TensorFlow 텐서로 반환하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model") +>>> inputs = tokenizer(text, return_tensors="tf").input_ids +``` + +[`~transformers.generation_tf_utils.TFGenerationMixin.generate`] 메서드로 번역을 생성하세요. 다양한 텍스트 생성 전략 및 생성을 제어하기 위한 매개변수에 대한 자세한 내용은 [Text Generation](../main_classes/text_generation) API를 살펴보시기 바랍니다. + +```py +>>> from transformers import TFAutoModelForSeq2SeqLM + +>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model") +>>> outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95) +``` + +생성된 토큰 ID들을 다시 텍스트로 디코딩하세요: + +```py +>>> tokenizer.decode(outputs[0], skip_special_tokens=True) +'Les lugumes partagent les ressources avec des bactéries fixatrices d'azote.' +``` + + diff --git a/docs/source/ko/tasks/video_classification.md b/docs/source/ko/tasks/video_classification.md new file mode 100644 index 00000000000000..01dbb0757b6608 --- /dev/null +++ b/docs/source/ko/tasks/video_classification.md @@ -0,0 +1,498 @@ + + +# 영상 분류 [[video-classification]] + +[[open-in-colab]] + + +영상 분류는 영상 전체에 레이블 또는 클래스를 지정하는 작업입니다. 각 영상에는 하나의 클래스가 있을 것으로 예상됩니다. 영상 분류 모델은 영상을 입력으로 받아 어느 클래스에 속하는지에 대한 예측을 반환합니다. 이러한 모델은 영상이 어떤 내용인지 분류하는 데 사용될 수 있습니다. 영상 분류의 실제 응용 예는 피트니스 앱에서 유용한 동작 / 운동 인식 서비스가 있습니다. 이는 또한 시각 장애인이 이동할 때 보조하는데 사용될 수 있습니다 + +이 가이드에서는 다음을 수행하는 방법을 보여줍니다: + +1. [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) 데이터 세트의 하위 집합을 통해 [VideoMAE](https://huggingface.co/docs/transformers/main/en/model_doc/videomae) 모델을 미세 조정하기. +2. 미세 조정한 모델을 추론에 사용하기. + + + +이 튜토리얼에서 설명하는 작업은 다음 모델 아키텍처에서 지원됩니다: + + + +[TimeSformer](../model_doc/timesformer), [VideoMAE](../model_doc/videomae) + + + + + + +시작하기 전에 필요한 모든 라이브러리가 설치되었는지 확인하세요: +```bash +pip install -q pytorchvideo transformers evaluate +``` + +영상을 처리하고 준비하기 위해 [PyTorchVideo](https://pytorchvideo.org/)(이하 `pytorchvideo`)를 사용합니다. + +커뮤니티에 모델을 업로드하고 공유할 수 있도록 Hugging Face 계정에 로그인하는 것을 권장합니다. 프롬프트가 나타나면 토큰을 입력하여 로그인하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## UCF101 데이터셋 불러오기 [[load-ufc101-dataset]] + +[UCF-101](https://www.crcv.ucf.edu/data/UCF101.php) 데이터 세트의 하위 집합(subset)을 불러오는 것으로 시작할 수 있습니다. 전체 데이터 세트를 학습하는데 더 많은 시간을 할애하기 전에 데이터의 하위 집합을 불러와 모든 것이 잘 작동하는지 실험하고 확인할 수 있습니다. + +```py +>>> from huggingface_hub import hf_hub_download + +>>> hf_dataset_identifier = "sayakpaul/ucf101-subset" +>>> filename = "UCF101_subset.tar.gz" +>>> file_path = hf_hub_download(repo_id=hf_dataset_identifier, filename=filename, repo_type="dataset") +``` + +데이터 세트의 하위 집합이 다운로드 되면, 압축된 파일의 압축을 해제해야 합니다: +```py +>>> import tarfile + +>>> with tarfile.open(file_path) as t: +... t.extractall(".") +``` + +전체 데이터 세트는 다음과 같이 구성되어 있습니다. + +```bash +UCF101_subset/ + train/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... + val/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... + test/ + BandMarching/ + video_1.mp4 + video_2.mp4 + ... + Archery + video_1.mp4 + video_2.mp4 + ... + ... +``` + + +정렬된 영상의 경로는 다음과 같습니다: + +```bash +... +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c04.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g07_c06.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g08_c01.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c02.avi', +'UCF101_subset/train/ApplyEyeMakeup/v_ApplyEyeMakeup_g09_c06.avi' +... +``` + +동일한 그룹/장면에 속하는 영상 클립은 파일 경로에서 `g`로 표시되어 있습니다. 예를 들면, `v_ApplyEyeMakeup_g07_c04.avi`와 `v_ApplyEyeMakeup_g07_c06.avi` 이 있습니다. 이 둘은 같은 그룹입니다. + +검증 및 평가 데이터 분할을 할 때, [데이터 누출(data leakage)](https://www.kaggle.com/code/alexisbcook/data-leakage)을 방지하기 위해 동일한 그룹 / 장면의 영상 클립을 사용하지 않아야 합니다. 이 튜토리얼에서 사용하는 하위 집합은 이러한 정보를 고려하고 있습니다. + +그 다음으로, 데이터 세트에 존재하는 라벨을 추출합니다. 또한, 모델을 초기화할 때 도움이 될 딕셔너리(dictionary data type)를 생성합니다. + +* `label2id`: 클래스 이름을 정수에 매핑합니다. +* `id2label`: 정수를 클래스 이름에 매핑합니다. + +```py +>>> class_labels = sorted({str(path).split("/")[2] for path in all_video_file_paths}) +>>> label2id = {label: i for i, label in enumerate(class_labels)} +>>> id2label = {i: label for label, i in label2id.items()} + +>>> print(f"Unique classes: {list(label2id.keys())}.") + +# Unique classes: ['ApplyEyeMakeup', 'ApplyLipstick', 'Archery', 'BabyCrawling', 'BalanceBeam', 'BandMarching', 'BaseballPitch', 'Basketball', 'BasketballDunk', 'BenchPress']. +``` + +이 데이터 세트에는 총 10개의 고유한 클래스가 있습니다. 각 클래스마다 30개의 영상이 훈련 세트에 있습니다 + +## 미세 조정하기 위해 모델 가져오기 [[load-a-model-to-fine-tune]] + +사전 훈련된 체크포인트와 체크포인트에 연관된 이미지 프로세서를 사용하여 영상 분류 모델을 인스턴스화합니다. 모델의 인코더에는 미리 학습된 매개변수가 제공되며, 분류 헤드(데이터를 분류하는 마지막 레이어)는 무작위로 초기화됩니다. 데이터 세트의 전처리 파이프라인을 작성할 때는 이미지 프로세서가 유용합니다. + +```py +>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification + +>>> model_ckpt = "MCG-NJU/videomae-base" +>>> image_processor = VideoMAEImageProcessor.from_pretrained(model_ckpt) +>>> model = VideoMAEForVideoClassification.from_pretrained( +... model_ckpt, +... label2id=label2id, +... id2label=id2label, +... ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint +... ) +``` + +모델을 가져오는 동안, 다음과 같은 경고를 마주칠 수 있습니다: + +```bash +Some weights of the model checkpoint at MCG-NJU/videomae-base were not used when initializing VideoMAEForVideoClassification: [..., 'decoder.decoder_layers.1.attention.output.dense.bias', 'decoder.decoder_layers.2.attention.attention.key.weight'] +- This IS expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing VideoMAEForVideoClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +Some weights of VideoMAEForVideoClassification were not initialized from the model checkpoint at MCG-NJU/videomae-base and are newly initialized: ['classifier.bias', 'classifier.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +``` + + +위 경고는 우리가 일부 가중치(예: `classifier` 층의 가중치와 편향)를 버리고 새로운 `classifier` 층의 가중치와 편향을 무작위로 초기화하고 있다는 것을 알려줍니다. 이 경우에는 미리 학습된 가중치가 없는 새로운 헤드를 추가하고 있으므로, 라이브러리가 모델을 추론에 사용하기 전에 미세 조정하라고 경고를 보내는 것은 당연합니다. 그리고 이제 우리는 이 모델을 미세 조정할 예정입니다. + +**참고** 이 [체크포인트](https://huggingface.co/MCG-NJU/videomae-base-finetuned-kinetics)는 도메인이 많이 중첩된 유사한 다운스트림 작업에 대해 미세 조정하여 얻은 체크포인트이므로 이 작업에서 더 나은 성능을 보일 수 있습니다. `MCG-NJU/videomae-base-finetuned-kinetics` 데이터 세트를 미세 조정하여 얻은 [체크포인트](https://huggingface.co/sayakpaul/videomae-base-finetuned-kinetics-finetuned-ucf101-subset)도 있습니다. + +## 훈련을 위한 데이터 세트 준비하기[[prepare-the-datasets-for-training]] + +영상 전처리를 위해 [PyTorchVideo 라이브러리](https://pytorchvideo.org/)를 활용할 것입니다. 필요한 종속성을 가져오는 것으로 시작하세요. + +```py +>>> import pytorchvideo.data + +>>> from pytorchvideo.transforms import ( +... ApplyTransformToKey, +... Normalize, +... RandomShortSideScale, +... RemoveKey, +... ShortSideScale, +... UniformTemporalSubsample, +... ) + +>>> from torchvision.transforms import ( +... Compose, +... Lambda, +... RandomCrop, +... RandomHorizontalFlip, +... Resize, +... ) +``` + +학습 데이터 세트 변환에는 '균일한 시간 샘플링(uniform temporal subsampling)', '픽셀 정규화(pixel normalization)', '랜덤 잘라내기(random cropping)' 및 '랜덤 수평 뒤집기(random horizontal flipping)'의 조합을 사용합니다. 검증 및 평가 데이터 세트 변환에는 '랜덤 잘라내기'와 '랜덤 뒤집기'를 제외한 동일한 변환 체인을 유지합니다. 이러한 변환에 대해 자세히 알아보려면 [PyTorchVideo 공식 문서](https://pytorchvideo.org)를 확인하세요. + +사전 훈련된 모델과 관련된 이미지 프로세서를 사용하여 다음 정보를 얻을 수 있습니다: + +* 영상 프레임 픽셀을 정규화하는 데 사용되는 이미지 평균과 표준 편차 +* 영상 프레임이 조정될 공간 해상도 + + +먼저, 몇 가지 상수를 정의합니다. + +```py +>>> mean = image_processor.image_mean +>>> std = image_processor.image_std +>>> if "shortest_edge" in image_processor.size: +... height = width = image_processor.size["shortest_edge"] +>>> else: +... height = image_processor.size["height"] +... width = image_processor.size["width"] +>>> resize_to = (height, width) + +>>> num_frames_to_sample = model.config.num_frames +>>> sample_rate = 4 +>>> fps = 30 +>>> clip_duration = num_frames_to_sample * sample_rate / fps +``` + +이제 데이터 세트에 특화된 전처리(transform)과 데이터 세트 자체를 정의합니다. 먼저 훈련 데이터 세트로 시작합니다: + +```py +>>> train_transform = Compose( +... [ +... ApplyTransformToKey( +... key="video", +... transform=Compose( +... [ +... UniformTemporalSubsample(num_frames_to_sample), +... Lambda(lambda x: x / 255.0), +... Normalize(mean, std), +... RandomShortSideScale(min_size=256, max_size=320), +... RandomCrop(resize_to), +... RandomHorizontalFlip(p=0.5), +... ] +... ), +... ), +... ] +... ) + +>>> train_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "train"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("random", clip_duration), +... decode_audio=False, +... transform=train_transform, +... ) +``` + +같은 방식의 작업 흐름을 검증과 평가 세트에도 적용할 수 있습니다. + +```py +>>> val_transform = Compose( +... [ +... ApplyTransformToKey( +... key="video", +... transform=Compose( +... [ +... UniformTemporalSubsample(num_frames_to_sample), +... Lambda(lambda x: x / 255.0), +... Normalize(mean, std), +... Resize(resize_to), +... ] +... ), +... ), +... ] +... ) + +>>> val_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "val"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), +... decode_audio=False, +... transform=val_transform, +... ) + +>>> test_dataset = pytorchvideo.data.Ucf101( +... data_path=os.path.join(dataset_root_path, "test"), +... clip_sampler=pytorchvideo.data.make_clip_sampler("uniform", clip_duration), +... decode_audio=False, +... transform=val_transform, +... ) +``` + + +**참고**: 위의 데이터 세트의 파이프라인은 [공식 파이토치 예제](https://pytorchvideo.org/docs/tutorial_classification#dataset)에서 가져온 것입니다. 우리는 UCF-101 데이터셋에 맞게 [`pytorchvideo.data.Ucf101()`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.Ucf101) 함수를 사용하고 있습니다. 내부적으로 이 함수는 [`pytorchvideo.data.labeled_video_dataset.LabeledVideoDataset`](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html#pytorchvideo.data.LabeledVideoDataset) 객체를 반환합니다. `LabeledVideoDataset` 클래스는 PyTorchVideo 데이터셋에서 모든 영상 관련 작업의 기본 클래스입니다. 따라서 PyTorchVideo에서 미리 제공하지 않는 사용자 지정 데이터 세트를 사용하려면, 이 클래스를 적절하게 확장하면 됩니다. 더 자세한 사항이 알고 싶다면 `data` API [문서](https://pytorchvideo.readthedocs.io/en/latest/api/data/data.html) 를 참고하세요. 또한 위의 예시와 유사한 구조를 갖는 데이터 세트를 사용하고 있다면, `pytorchvideo.data.Ucf101()` 함수를 사용하는 데 문제가 없을 것입니다. + +데이터 세트에 영상의 개수를 알기 위해 `num_videos` 인수에 접근할 수 있습니다. + +```py +>>> print(train_dataset.num_videos, val_dataset.num_videos, test_dataset.num_videos) +# (300, 30, 75) +``` + +## 더 나은 디버깅을 위해 전처리 영상 시각화하기[[visualize-the-preprocessed-video-for-better-debugging]] + +```py +>>> import imageio +>>> import numpy as np +>>> from IPython.display import Image + +>>> def unnormalize_img(img): +... """Un-normalizes the image pixels.""" +... img = (img * std) + mean +... img = (img * 255).astype("uint8") +... return img.clip(0, 255) + +>>> def create_gif(video_tensor, filename="sample.gif"): +... """Prepares a GIF from a video tensor. +... +... The video tensor is expected to have the following shape: +... (num_frames, num_channels, height, width). +... """ +... frames = [] +... for video_frame in video_tensor: +... frame_unnormalized = unnormalize_img(video_frame.permute(1, 2, 0).numpy()) +... frames.append(frame_unnormalized) +... kargs = {"duration": 0.25} +... imageio.mimsave(filename, frames, "GIF", **kargs) +... return filename + +>>> def display_gif(video_tensor, gif_name="sample.gif"): +... """Prepares and displays a GIF from a video tensor.""" +... video_tensor = video_tensor.permute(1, 0, 2, 3) +... gif_filename = create_gif(video_tensor, gif_name) +... return Image(filename=gif_filename) + +>>> sample_video = next(iter(train_dataset)) +>>> video_tensor = sample_video["video"] +>>> display_gif(video_tensor) +``` + +
+ Person playing basketball +
+ +## 모델 훈련하기[[train-the-model]] + +🤗 Transformers의 [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer)를 사용하여 모델을 훈련시켜보세요. `Trainer`를 인스턴스화하려면 훈련 설정과 평가 지표를 정의해야 합니다. 가장 중요한 것은 [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments)입니다. 이 클래스는 훈련을 구성하는 모든 속성을 포함하며, 훈련 중 체크포인트를 저장할 출력 폴더 이름을 필요로 합니다. 또한 🤗 Hub의 모델 저장소의 모든 정보를 동기화하는 데 도움이 됩니다. + +대부분의 훈련 인수는 따로 설명할 필요는 없습니다. 하지만 여기에서 중요한 인수는 `remove_unused_columns=False` 입니다. 이 인자는 모델의 호출 함수에서 사용되지 않는 모든 속성 열(columns)을 삭제합니다. 기본값은 일반적으로 True입니다. 이는 사용되지 않는 기능 열을 삭제하는 것이 이상적이며, 입력을 모델의 호출 함수로 풀기(unpack)가 쉬워지기 때문입니다. 하지만 이 경우에는 `pixel_values`(모델의 입력으로 필수적인 키)를 생성하기 위해 사용되지 않는 기능('video'가 특히 그렇습니다)이 필요합니다. 따라서 remove_unused_columns을 False로 설정해야 합니다. + +```py +>>> from transformers import TrainingArguments, Trainer + +>>> model_name = model_ckpt.split("/")[-1] +>>> new_model_name = f"{model_name}-finetuned-ucf101-subset" +>>> num_epochs = 4 + +>>> args = TrainingArguments( +... new_model_name, +... remove_unused_columns=False, +... evaluation_strategy="epoch", +... save_strategy="epoch", +... learning_rate=5e-5, +... per_device_train_batch_size=batch_size, +... per_device_eval_batch_size=batch_size, +... warmup_ratio=0.1, +... logging_steps=10, +... load_best_model_at_end=True, +... metric_for_best_model="accuracy", +... push_to_hub=True, +... max_steps=(train_dataset.num_videos // batch_size) * num_epochs, +... ) +``` + +`pytorchvideo.data.Ucf101()` 함수로 반환되는 데이터 세트는 `__len__` 메소드가 이식되어 있지 않습니다. 따라서, `TrainingArguments`를 인스턴스화할 때 `max_steps`를 정의해야 합니다. + +다음으로, 평가지표를 불러오고, 예측값에서 평가지표를 계산할 함수를 정의합니다. 필요한 전처리 작업은 예측된 로짓(logits)에 argmax 값을 취하는 것뿐입니다: + +```py +import evaluate + +metric = evaluate.load("accuracy") + + +def compute_metrics(eval_pred): + predictions = np.argmax(eval_pred.predictions, axis=1) + return metric.compute(predictions=predictions, references=eval_pred.label_ids) +``` + +**평가에 대한 참고사항**: + +[VideoMAE 논문](https://arxiv.org/abs/2203.12602)에서 저자는 다음과 같은 평가 전략을 사용합니다. 테스트 영상에서 여러 클립을 선택하고 그 클립에 다양한 크롭을 적용하여 집계 점수를 보고합니다. 그러나 이번 튜토리얼에서는 간단함과 간결함을 위해 해당 전략을 고려하지 않습니다. + +또한, 예제를 묶어서 배치를 형성하는 `collate_fn`을 정의해야합니다. 각 배치는 `pixel_values`와 `labels`라는 2개의 키로 구성됩니다. + +```py +>>> def collate_fn(examples): +... # permute to (num_frames, num_channels, height, width) +... pixel_values = torch.stack( +... [example["video"].permute(1, 0, 2, 3) for example in examples] +... ) +... labels = torch.tensor([example["label"] for example in examples]) +... return {"pixel_values": pixel_values, "labels": labels} +``` + +그런 다음 이 모든 것을 데이터 세트와 함께 `Trainer`에 전달하기만 하면 됩니다: + +```py +>>> trainer = Trainer( +... model, +... args, +... train_dataset=train_dataset, +... eval_dataset=val_dataset, +... tokenizer=image_processor, +... compute_metrics=compute_metrics, +... data_collator=collate_fn, +... ) +``` + +데이터를 이미 처리했는데도 불구하고 `image_processor`를 토크나이저 인수로 넣은 이유는 JSON으로 저장되는 이미지 프로세서 구성 파일이 Hub의 저장소에 업로드되도록 하기 위함입니다. + +`train` 메소드를 호출하여 모델을 미세 조정하세요: + +```py +>>> train_results = trainer.train() +``` + +학습이 완료되면, 모델을 [`~transformers.Trainer.push_to_hub`] 메소드를 사용하여 허브에 공유하여 누구나 모델을 사용할 수 있도록 합니다: +```py +>>> trainer.push_to_hub() +``` + +## 추론하기[[inference]] + +좋습니다. 이제 미세 조정된 모델을 추론하는 데 사용할 수 있습니다. + +추론에 사용할 영상을 불러오세요: +```py +>>> sample_test_video = next(iter(test_dataset)) +``` + +
+ Teams playing basketball +
+ +미세 조정된 모델을 추론에 사용하는 가장 간단한 방법은 [`pipeline`](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.VideoClassificationPipeline)에서 모델을 사용하는 것입니다. 모델로 영상 분류를 하기 위해 `pipeline`을 인스턴스화하고 영상을 전달하세요: + +```py +>>> from transformers import pipeline + +>>> video_cls = pipeline(model="my_awesome_video_cls_model") +>>> video_cls("https://huggingface.co/datasets/sayakpaul/ucf101-subset/resolve/main/v_BasketballDunk_g14_c06.avi") +[{'score': 0.9272987842559814, 'label': 'BasketballDunk'}, + {'score': 0.017777055501937866, 'label': 'BabyCrawling'}, + {'score': 0.01663011871278286, 'label': 'BalanceBeam'}, + {'score': 0.009560945443809032, 'label': 'BandMarching'}, + {'score': 0.0068979403004050255, 'label': 'BaseballPitch'}] +``` + +만약 원한다면 수동으로 `pipeline`의 결과를 재현할 수 있습니다: + + +```py +>>> def run_inference(model, video): +... # (num_frames, num_channels, height, width) +... perumuted_sample_test_video = video.permute(1, 0, 2, 3) +... inputs = { +... "pixel_values": perumuted_sample_test_video.unsqueeze(0), +... "labels": torch.tensor( +... [sample_test_video["label"]] +... ), # this can be skipped if you don't have labels available. +... } + +... device = torch.device("cuda" if torch.cuda.is_available() else "cpu") +... inputs = {k: v.to(device) for k, v in inputs.items()} +... model = model.to(device) + +... # forward pass +... with torch.no_grad(): +... outputs = model(**inputs) +... logits = outputs.logits + +... return logits +``` + +모델에 입력값을 넣고 `logits`을 반환받으세요: + +```py +>>> logits = run_inference(trained_model, sample_test_video["video"]) +``` + +`logits`을 디코딩하면, 우리는 다음 결과를 얻을 수 있습니다: + +```py +>>> predicted_class_idx = logits.argmax(-1).item() +>>> print("Predicted class:", model.config.id2label[predicted_class_idx]) +# Predicted class: BasketballDunk +``` diff --git a/docs/source/ko/tasks/visual_question_answering.md b/docs/source/ko/tasks/visual_question_answering.md new file mode 100644 index 00000000000000..f8560b14f9b8a1 --- /dev/null +++ b/docs/source/ko/tasks/visual_question_answering.md @@ -0,0 +1,375 @@ + + +# 시각적 질의응답 (Visual Question Answering) + +[[open-in-colab]] + +시각적 질의응답(VQA)은 이미지를 기반으로 개방형 질문에 대응하는 작업입니다. 이 작업을 지원하는 모델의 입력은 대부분 이미지와 질문의 조합이며, 출력은 자연어로 된 답변입니다. + +VQA의 주요 사용 사례는 다음과 같습니다: +* 시각 장애인을 위한 접근성 애플리케이션을 구축할 수 있습니다. +* 교육: 강의나 교과서에 나온 시각 자료에 대한 질문에 답할 수 있습니다. 또한 체험형 전시와 유적 등에서도 VQA를 활용할 수 있습니다. +* 고객 서비스 및 전자상거래: VQA는 사용자가 제품에 대해 질문할 수 있게 함으로써 사용자 경험을 향상시킬 수 있습니다. +* 이미지 검색: VQA 모델을 사용하여 원하는 특성을 가진 이미지를 검색할 수 있습니다. 예를 들어 사용자는 "강아지가 있어?"라고 물어봐서 주어진 이미지 묶음에서 강아지가 있는 모든 이미지를 받아볼 수 있습니다. + +이 가이드에서 학습할 내용은 다음과 같습니다: + +- VQA 모델 중 하나인 [ViLT](../../en/model_doc/vilt)를 [`Graphcore/vqa` 데이터셋](https://huggingface.co/datasets/Graphcore/vqa) 에서 미세조정하는 방법 +- 미세조정된 ViLT 모델로 추론하는 방법 +- BLIP-2 같은 생성 모델로 제로샷 VQA 추론을 실행하는 방법 + +## ViLT 미세 조정 [[finetuning-vilt]] + +ViLT는 Vision Transformer (ViT) 내에 텍스트 임베딩을 포함하여 비전/자연어 사전훈련(VLP; Vision-and-Language Pretraining)을 위한 기본 디자인을 제공합니다. +ViLT 모델은 비전 트랜스포머(ViT)에 텍스트 임베딩을 넣어 비전/언어 사전훈련(VLP; Vision-and-Language Pre-training)을 위한 기본적인 디자인을 갖췄습니다. 이 모델은 여러 다운스트림 작업에 사용할 수 있습니다. VQA 태스크에서는 (`[CLS]` 토큰의 최종 은닉 상태 위에 선형 레이어인) 분류 헤더가 있으며 무작위로 초기화됩니다. +따라서 여기에서 시각적 질의응답은 **분류 문제**로 취급됩니다. + +최근의 BLIP, BLIP-2, InstructBLIP와 같은 모델들은 VQA를 생성형 작업으로 간주합니다. 가이드의 후반부에서는 이런 모델들을 사용하여 제로샷 VQA 추론을 하는 방법에 대해 설명하겠습니다. + +시작하기 전 필요한 모든 라이브러리를 설치했는지 확인하세요. + +```bash +pip install -q transformers datasets +``` + +커뮤니티에 모델을 공유하는 것을 권장 드립니다. Hugging Face 계정에 로그인하여 🤗 Hub에 업로드할 수 있습니다. +메시지가 나타나면 로그인할 토큰을 입력하세요: + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +모델 체크포인트를 전역 변수로 선언하세요. + +```py +>>> model_checkpoint = "dandelin/vilt-b32-mlm" +``` + +## 데이터 가져오기 [[load-the-data]] + +이 가이드에서는 `Graphcore/vqa` 데이터세트의 작은 샘플을 사용합니다. 전체 데이터세트는 [🤗 Hub](https://huggingface.co/datasets/Graphcore/vqa) 에서 확인할 수 있습니다. + +[`Graphcore/vqa` 데이터세트](https://huggingface.co/datasets/Graphcore/vqa) 의 대안으로 공식 [VQA 데이터세트 페이지](https://visualqa.org/download.html) 에서 동일한 데이터를 수동으로 다운로드할 수 있습니다. 직접 공수한 데이터로 튜토리얼을 따르고 싶다면 [이미지 데이터세트 만들기](https://huggingface.co/docs/datasets/image_dataset#loading-script) 라는 +🤗 Datasets 문서를 참조하세요. + +검증 데이터의 첫 200개 항목을 불러와 데이터세트의 특성을 확인해 보겠습니다: + +```python +>>> from datasets import load_dataset + +>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]") +>>> dataset +Dataset({ + features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'], + num_rows: 200 +}) +``` + +예제를 하나 뽑아 데이터세트의 특성을 이해해 보겠습니다. + +```py +>>> dataset[0] +{'question': 'Where is he looking?', + 'question_type': 'none of the above', + 'question_id': 262148000, + 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg', + 'answer_type': 'other', + 'label': {'ids': ['at table', 'down', 'skateboard', 'table'], + 'weights': [0.30000001192092896, + 1.0, + 0.30000001192092896, + 0.30000001192092896]}} +``` + +데이터세트에는 다음과 같은 특성이 포함되어 있습니다: +* `question`: 이미지에 대한 질문 +* `image_id`: 질문과 관련된 이미지의 경로 +* `label`: 데이터의 레이블 (annotations) + +나머지 특성들은 필요하지 않기 때문에 삭제해도 됩니다: + +```py +>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type']) +``` + +보시다시피 `label` 특성은 같은 질문마다 답변이 여러 개 있을 수 있습니다. 모두 다른 데이터 라벨러들로부터 수집되었기 때문인데요. 질문의 답변은 주관적일 수 있습니다. 이 경우 질문은 "그는 어디를 보고 있나요?" 였지만, 어떤 사람들은 "아래"로 레이블을 달았고, 다른 사람들은 "테이블" 또는 "스케이트보드" 등으로 주석을 달았습니다. + +아래의 이미지를 보고 어떤 답변을 선택할 것인지 생각해 보세요: + +```python +>>> from PIL import Image + +>>> image = Image.open(dataset[0]['image_id']) +>>> image +``` + +
+ VQA Image Example +
+ +질문과 답변의 모호성으로 인해 이러한 데이터세트는 여러 개의 답변이 가능하므로 다중 레이블 분류 문제로 처리됩니다. 게다가, 원핫(one-hot) 인코딩 벡터를 생성하기보다는 레이블에서 특정 답변이 나타나는 횟수를 기반으로 소프트 인코딩을 생성합니다. + +위의 예시에서 "아래"라는 답변이 다른 답변보다 훨씬 더 자주 선택되었기 때문에 데이터세트에서 `weight`라고 불리는 점수로 1.0을 가지며, 나머지 답변들은 1.0 미만의 점수를 가집니다. + +적절한 분류 헤더로 모델을 나중에 인스턴스화하기 위해 레이블을 정수로 매핑한 딕셔너리 하나, 반대로 정수를 레이블로 매핑한 딕셔너리 하나 총 2개의 딕셔너리를 생성하세요: + +```py +>>> import itertools + +>>> labels = [item['ids'] for item in dataset['label']] +>>> flattened_labels = list(itertools.chain(*labels)) +>>> unique_labels = list(set(flattened_labels)) + +>>> label2id = {label: idx for idx, label in enumerate(unique_labels)} +>>> id2label = {idx: label for label, idx in label2id.items()} +``` + +이제 매핑이 완료되었으므로 문자열 답변을 해당 id로 교체하고, 데이터세트의 더 편리한 후처리를 위해 편평화 할 수 있습니다. + +```python +>>> def replace_ids(inputs): +... inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]] +... return inputs + + +>>> dataset = dataset.map(replace_ids) +>>> flat_dataset = dataset.flatten() +>>> flat_dataset.features +{'question': Value(dtype='string', id=None), + 'image_id': Value(dtype='string', id=None), + 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), + 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)} +``` + +## 데이터 전처리 [[preprocessing-data]] + +다음 단계는 모델을 위해 이미지와 텍스트 데이터를 준비하기 위해 ViLT 프로세서를 가져오는 것입니다. +[`ViltProcessor`]는 BERT 토크나이저와 ViLT 이미지 프로세서를 편리하게 하나의 프로세서로 묶습니다: + +```py +>>> from transformers import ViltProcessor + +>>> processor = ViltProcessor.from_pretrained(model_checkpoint) +``` + +데이터를 전처리하려면 이미지와 질문을 [`ViltProcessor`]로 인코딩해야 합니다. 프로세서는 [`BertTokenizerFast`]로 텍스트를 토크나이즈하고 텍스트 데이터를 위해 `input_ids`, `attention_mask` 및 `token_type_ids`를 생성합니다. +이미지는 [`ViltImageProcessor`]로 이미지를 크기 조정하고 정규화하며, `pixel_values`와 `pixel_mask`를 생성합니다. + +이런 전처리 단계는 모두 내부에서 이루어지므로, `processor`를 호출하기만 하면 됩니다. 하지만 아직 타겟 레이블이 완성되지 않았습니다. 타겟의 표현에서 각 요소는 가능한 답변(레이블)에 해당합니다. 정확한 답변의 요소는 해당 점수(weight)를 유지시키고 나머지 요소는 0으로 설정해야 합니다. + +아래 함수가 위에서 설명한대로 이미지와 질문에 `processor`를 적용하고 레이블을 형식에 맞춥니다: + +```py +>>> import torch + +>>> def preprocess_data(examples): +... image_paths = examples['image_id'] +... images = [Image.open(image_path) for image_path in image_paths] +... texts = examples['question'] + +... encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt") + +... for k, v in encoding.items(): +... encoding[k] = v.squeeze() + +... targets = [] + +... for labels, scores in zip(examples['label.ids'], examples['label.weights']): +... target = torch.zeros(len(id2label)) + +... for label, score in zip(labels, scores): +... target[label] = score + +... targets.append(target) + +... encoding["labels"] = targets + +... return encoding +``` + +전체 데이터세트에 전처리 함수를 적용하려면 🤗 Datasets의 [`~datasets.map`] 함수를 사용하십시오. `batched=True`를 설정하여 데이터세트의 여러 요소를 한 번에 처리함으로써 `map`을 더 빠르게 할 수 있습니다. 이 시점에서 필요하지 않은 열은 제거하세요. + +```py +>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type', 'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights']) +>>> processed_dataset +Dataset({ + features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'], + num_rows: 200 +}) +``` + +마지막 단계로, [`DefaultDataCollator`]를 사용하여 예제로 쓸 배치를 생성하세요: + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator() +``` + +## 모델 훈련 [[train-the-model]] + +이제 모델을 훈련하기 위해 준비되었습니다! [`ViltForQuestionAnswering`]으로 ViLT를 가져올 차례입니다. 레이블의 수와 레이블 매핑을 지정하세요: + +```py +>>> from transformers import ViltForQuestionAnswering + +>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id) +``` + +이 시점에서는 다음 세 단계만 남았습니다: + +1. [`TrainingArguments`]에서 훈련 하이퍼파라미터를 정의하세요: + +```py +>>> from transformers import TrainingArguments + +>>> repo_id = "MariaK/vilt_finetuned_200" + +>>> training_args = TrainingArguments( +... output_dir=repo_id, +... per_device_train_batch_size=4, +... num_train_epochs=20, +... save_steps=200, +... logging_steps=50, +... learning_rate=5e-5, +... save_total_limit=2, +... remove_unused_columns=False, +... push_to_hub=True, +... ) +``` + +2. 모델, 데이터세트, 프로세서, 데이터 콜레이터와 함께 훈련 인수를 [`Trainer`]에 전달하세요: + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... data_collator=data_collator, +... train_dataset=processed_dataset, +... tokenizer=processor, +... ) +``` + +3. [`~Trainer.train`]을 호출하여 모델을 미세 조정하세요: + +```py +>>> trainer.train() +``` + +훈련이 완료되면, [`~Trainer.push_to_hub`] 메소드를 사용하여 🤗 Hub에 모델을 공유하세요: + +```py +>>> trainer.push_to_hub() +``` + +## 추론 [[inference]] + +ViLT 모델을 미세 조정하고 🤗 Hub에 업로드했다면 추론에 사용할 수 있습니다. 미세 조정된 모델을 추론에 사용해보는 가장 간단한 방법은 [`Pipeline`]에서 사용하는 것입니다. + +```py +>>> from transformers import pipeline + +>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200") +``` + +이 가이드의 모델은 200개의 예제에서만 훈련되었으므로 그다지 많은 것을 기대할 수는 없습니다. 데이터세트의 첫 번째 예제를 사용하여 추론 결과를 설명해보겠습니다: + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +>>> print(question) +>>> pipe(image, question, top_k=1) +"Where is he looking?" +[{'score': 0.5498199462890625, 'answer': 'down'}] +``` + +비록 확신은 별로 없지만, 모델은 실제로 무언가를 배웠습니다. 더 많은 예제와 더 긴 훈련 기간이 주어진다면 분명 더 나은 결과를 얻을 수 있을 것입니다! + +원한다면 파이프라인의 결과를 수동으로 복제할 수도 있습니다: +1. 이미지와 질문을 가져와서 프로세서를 사용하여 모델에 준비합니다. +2. 전처리된 결과를 모델에 전달합니다. +3. 로짓에서 가장 가능성 있는 답변의 id를 가져와서 `id2label`에서 실제 답변을 찾습니다. + +```py +>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200") + +>>> image = Image.open(example['image_id']) +>>> question = example['question'] + +>>> # prepare inputs +>>> inputs = processor(image, question, return_tensors="pt") + +>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200") + +>>> # forward pass +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits +>>> idx = logits.argmax(-1).item() +>>> print("Predicted answer:", model.config.id2label[idx]) +Predicted answer: down +``` + +## 제로샷 VQA [[zeroshot-vqa]] + +이전 모델은 VQA를 분류 문제로 처리했습니다. BLIP, BLIP-2 및 InstructBLIP와 같은 최근의 모델은 VQA를 생성 작업으로 접근합니다. [BLIP-2](../../en/model_doc/blip-2)를 예로 들어 보겠습니다. 이 모델은 사전훈련된 비전 인코더와 LLM의 모든 조합을 사용할 수 있는 새로운 비전-자연어 사전 학습 패러다임을 도입했습니다. ([BLIP-2 블로그 포스트](https://huggingface.co/blog/blip-2)를 통해 더 자세히 알아볼 수 있어요) +이를 통해 시각적 질의응답을 포함한 여러 비전-자연어 작업에서 SOTA를 달성할 수 있었습니다. + +이 모델을 어떻게 VQA에 사용할 수 있는지 설명해 보겠습니다. 먼저 모델을 가져와 보겠습니다. 여기서 GPU가 사용 가능한 경우 모델을 명시적으로 GPU로 전송할 것입니다. 이전에는 훈련할 때 쓰지 않은 이유는 [`Trainer`]가 이 부분을 자동으로 처리하기 때문입니다: + +```py +>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration +>>> import torch + +>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") +>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16) +>>> device = "cuda" if torch.cuda.is_available() else "cpu" +>>> model.to(device) +``` + +모델은 이미지와 텍스트를 입력으로 받으므로, VQA 데이터세트의 첫 번째 예제에서와 동일한 이미지/질문 쌍을 사용해 보겠습니다: + +```py +>>> example = dataset[0] +>>> image = Image.open(example['image_id']) +>>> question = example['question'] +``` + +BLIP-2를 시각적 질의응답 작업에 사용하려면 텍스트 프롬프트가 `Question: {} Answer:` 형식을 따라야 합니다. + +```py +>>> prompt = f"Question: {question} Answer:" +``` + +이제 모델의 프로세서로 이미지/프롬프트를 전처리하고, 처리된 입력을 모델을 통해 전달하고, 출력을 디코드해야 합니다: + +```py +>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16) + +>>> generated_ids = model.generate(**inputs, max_new_tokens=10) +>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() +>>> print(generated_text) +"He is looking at the crowd" +``` + +보시다시피 모델은 군중을 인식하고, 얼굴의 방향(아래쪽을 보고 있음)을 인식했지만, 군중이 스케이터 뒤에 있다는 사실을 놓쳤습니다. 그러나 사람이 직접 라벨링한 데이터셋을 얻을 수 없는 경우에, 이 접근법은 빠르게 유용한 결과를 생성할 수 있습니다. diff --git a/docs/source/ko/tasks/zero_shot_image_classification.md b/docs/source/ko/tasks/zero_shot_image_classification.md new file mode 100644 index 00000000000000..f824de93b86522 --- /dev/null +++ b/docs/source/ko/tasks/zero_shot_image_classification.md @@ -0,0 +1,144 @@ + + +# 제로샷(zero-shot) 이미지 분류[[zeroshot-image-classification]] + +[[open-in-colab]] + +제로샷(zero-shot) 이미지 분류는 특정 카테고리의 예시가 포함된 데이터를 학습되지 않은 모델을 사용해 이미지 분류를 수행하는 작업입니다. + +일반적으로 이미지 분류를 위해서는 레이블이 달린 특정 이미지 데이터로 모델 학습이 필요하며, 이 모델은 특정 이미지의 특징을 레이블에 "매핑"하는 방법을 학습합니다. +새로운 레이블이 있는 분류 작업에 이러한 모델을 사용해야 하는 경우에는, 모델을 "재보정"하기 위해 미세 조정이 필요합니다. + +이와 대조적으로, 제로샷 또는 개방형 어휘(open vocabulary) 이미지 분류 모델은 일반적으로 대규모 이미지 데이터와 해당 설명에 대해 학습된 멀티모달(multimodal) 모델입니다. +이러한 모델은 제로샷 이미지 분류를 포함한 많은 다운스트림 작업에 사용할 수 있는 정렬된(aligned) 비전 언어 표현을 학습합니다. + +이는 이미지 분류에 대한 보다 유연한 접근 방식으로, 추가 학습 데이터 없이 새로운 레이블이나 학습하지 못한 카테고리에 대해 모델을 일반화할 수 있습니다. +또한, 사용자가 대상 개체에 대한 자유 형식의 텍스트 설명으로 이미지를 검색할 수 있습니다. + +이번 가이드에서 배울 내용은 다음과 같습니다: + +* 제로샷 이미지 분류 파이프라인 만들기 +* 직접 제로샷 이미지 분류 모델 추론 실행하기 + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: + +```bash +pip install -q transformers +``` + +## 제로샷(zero-shot) 이미지 분류 파이프라인[[zeroshot-image-classification-pipeline]] + +[`pipeline`]을 활용하면 가장 간단하게 제로샷 이미지 분류를 지원하는 모델로 추론해볼 수 있습니다. +[Hugging Face Hub에 업로드된 체크포인트](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads)에서 파이프라인을 인스턴스화합니다. + +```python +>>> from transformers import pipeline + +>>> checkpoint = "openai/clip-vit-large-patch14" +>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification") +``` + +다음으로, 분류하고 싶은 이미지를 선택하세요. + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of an owl +
+ +이미지와 해당 이미지의 후보 레이블인 `candidate_labels`를 파이프라인으로 전달합니다. +여기서는 이미지를 직접 전달하지만, 컴퓨터에 저장된 이미지의 경로나 url로 전달할 수도 있습니다. +`candidate_labels`는 이 예시처럼 간단한 단어일 수도 있고 좀 더 설명적인 단어일 수도 있습니다. + +```py +>>> predictions = classifier(image, candidate_labels=["fox", "bear", "seagull", "owl"]) +>>> predictions +[{'score': 0.9996670484542847, 'label': 'owl'}, + {'score': 0.000199399160919711, 'label': 'seagull'}, + {'score': 7.392891711788252e-05, 'label': 'fox'}, + {'score': 5.96074532950297e-05, 'label': 'bear'}] +``` + +## 직접 제로샷(zero-shot) 이미지 분류하기[[zeroshot-image-classification-by-hand]] + +이제 제로샷 이미지 분류 파이프라인 사용 방법을 살펴보았으니, 실행하는 방법을 살펴보겠습니다. + +[Hugging Face Hub에 업로드된 체크포인트](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads)에서 모델과 프로세서를 가져오는 것으로 시작합니다. +여기서는 이전과 동일한 체크포인트를 사용하겠습니다: + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification + +>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +다른 이미지를 사용해 보겠습니다. + +```py +>>> from PIL import Image +>>> import requests + +>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> image +``` + +
+ Photo of a car +
+ +프로세서를 사용해 모델의 입력을 준비합니다. +프로세서는 모델의 입력으로 사용하기 위해 이미지 크기를 변환하고 정규화하는 이미지 프로세서와 텍스트 입력을 처리하는 토크나이저로 구성됩니다. + +```py +>>> candidate_labels = ["tree", "car", "bike", "cat"] +>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True) +``` + +모델에 입력을 전달하고, 결과를 후처리합니다: + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) + +>>> logits = outputs.logits_per_image[0] +>>> probs = logits.softmax(dim=-1).numpy() +>>> scores = probs.tolist() + +>>> result = [ +... {"score": score, "label": candidate_label} +... for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0]) +... ] + +>>> result +[{'score': 0.998572, 'label': 'car'}, + {'score': 0.0010570387, 'label': 'bike'}, + {'score': 0.0003393686, 'label': 'tree'}, + {'score': 3.1572064e-05, 'label': 'cat'}] +``` \ No newline at end of file diff --git a/docs/source/ko/tasks/zero_shot_object_detection.md b/docs/source/ko/tasks/zero_shot_object_detection.md new file mode 100644 index 00000000000000..8e9b52e8c7a20f --- /dev/null +++ b/docs/source/ko/tasks/zero_shot_object_detection.md @@ -0,0 +1,307 @@ + + +# 제로샷(zero-shot) 객체 탐지[[zeroshot-object-detection]] + +[[open-in-colab]] + +일반적으로 [객체 탐지](object_detection)에 사용되는 모델을 학습하기 위해서는 레이블이 지정된 이미지 데이터 세트가 필요합니다. +그리고 학습 데이터에 존재하는 클래스(레이블)만 탐지할 수 있다는 한계점이 있습니다. + +다른 방식을 사용하는 [OWL-ViT](../model_doc/owlvit) 모델로 제로샷 객체 탐지가 가능합니다. +OWL-ViT는 개방형 어휘(open-vocabulary) 객체 탐지기입니다. +즉, 레이블이 지정된 데이터 세트에 미세 조정하지 않고 자유 텍스트 쿼리를 기반으로 이미지에서 객체를 탐지할 수 있습니다. + +OWL-ViT 모델은 멀티 모달 표현을 활용해 개방형 어휘 탐지(open-vocabulary detection)를 수행합니다. +[CLIP](../model_doc/clip) 모델에 경량화(lightweight)된 객체 분류와 지역화(localization) 헤드를 결합합니다. +개방형 어휘 탐지는 CLIP의 텍스트 인코더로 free-text 쿼리를 임베딩하고, 객체 분류와 지역화 헤드의 입력으로 사용합니다. +이미지와 해당 텍스트 설명을 연결하면 ViT가 이미지 패치(image patches)를 입력으로 처리합니다. +OWL-ViT 모델의 저자들은 CLIP 모델을 처음부터 학습(scratch learning)한 후에, bipartite matching loss를 사용하여 표준 객체 인식 데이터셋으로 OWL-ViT 모델을 미세 조정했습니다. + +이 접근 방식을 사용하면 모델은 레이블이 지정된 데이터 세트에 대한 사전 학습 없이도 텍스트 설명을 기반으로 객체를 탐지할 수 있습니다. + +이번 가이드에서는 OWL-ViT 모델의 사용법을 다룰 것입니다: +- 텍스트 프롬프트 기반 객체 탐지 +- 일괄 객체 탐지 +- 이미지 가이드 객체 탐지 + +시작하기 전에 필요한 라이브러리가 모두 설치되어 있는지 확인하세요: +```bash +pip install -q transformers +``` + +## 제로샷(zero-shot) 객체 탐지 파이프라인[[zeroshot-object-detection-pipeline]] + +[`pipeline`]을 활용하면 가장 간단하게 OWL-ViT 모델을 추론해볼 수 있습니다. +[Hugging Face Hub에 업로드된 체크포인트](https://huggingface.co/models?pipeline_tag=zero-shot-image-classification&sort=downloads)에서 제로샷(zero-shot) 객체 탐지용 파이프라인을 인스턴스화합니다: + +```python +>>> from transformers import pipeline + +>>> checkpoint = "google/owlvit-base-patch32" +>>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection") +``` + +다음으로, 객체를 탐지하고 싶은 이미지를 선택하세요. +여기서는 [NASA](https://www.nasa.gov/multimedia/imagegallery/index.html) Great Images 데이터 세트의 일부인 우주비행사 에일린 콜린스(Eileen Collins) 사진을 사용하겠습니다. + +```py +>>> import skimage +>>> import numpy as np +>>> from PIL import Image + +>>> image = skimage.data.astronaut() +>>> image = Image.fromarray(np.uint8(image)).convert("RGB") + +>>> image +``` + +
+ Astronaut Eileen Collins +
+ +이미지와 해당 이미지의 후보 레이블을 파이프라인으로 전달합니다. +여기서는 이미지를 직접 전달하지만, 컴퓨터에 저장된 이미지의 경로나 url로 전달할 수도 있습니다. +candidate_labels는 이 예시처럼 간단한 단어일 수도 있고 좀 더 설명적인 단어일 수도 있습니다. +또한, 이미지를 검색(query)하려는 모든 항목에 대한 텍스트 설명도 전달합니다. + +```py +>>> predictions = detector( +... image, +... candidate_labels=["human face", "rocket", "nasa badge", "star-spangled banner"], +... ) +>>> predictions +[{'score': 0.3571370542049408, + 'label': 'human face', + 'box': {'xmin': 180, 'ymin': 71, 'xmax': 271, 'ymax': 178}}, + {'score': 0.28099656105041504, + 'label': 'nasa badge', + 'box': {'xmin': 129, 'ymin': 348, 'xmax': 206, 'ymax': 427}}, + {'score': 0.2110239565372467, + 'label': 'rocket', + 'box': {'xmin': 350, 'ymin': -1, 'xmax': 468, 'ymax': 288}}, + {'score': 0.13790413737297058, + 'label': 'star-spangled banner', + 'box': {'xmin': 1, 'ymin': 1, 'xmax': 105, 'ymax': 509}}, + {'score': 0.11950037628412247, + 'label': 'nasa badge', + 'box': {'xmin': 277, 'ymin': 338, 'xmax': 327, 'ymax': 380}}, + {'score': 0.10649408400058746, + 'label': 'rocket', + 'box': {'xmin': 358, 'ymin': 64, 'xmax': 424, 'ymax': 280}}] +``` + +이제 예측값을 시각화해봅시다: + +```py +>>> from PIL import ImageDraw + +>>> draw = ImageDraw.Draw(image) + +>>> for prediction in predictions: +... box = prediction["box"] +... label = prediction["label"] +... score = prediction["score"] + +... xmin, ymin, xmax, ymax = box.values() +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white") + +>>> image +``` + +
+ Visualized predictions on NASA image +
+ +## 텍스트 프롬프트 기반 객체 탐지[[textprompted-zeroshot-object-detection-by-hand]] + +제로샷 객체 탐지 파이프라인 사용법에 대해 살펴보았으니, 이제 동일한 결과를 복제해보겠습니다. + +[Hugging Face Hub에 업로드된 체크포인트](https://huggingface.co/models?other=owlvit)에서 관련 모델과 프로세서를 가져오는 것으로 시작합니다. +여기서는 이전과 동일한 체크포인트를 사용하겠습니다: + +```py +>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection + +>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint) +>>> processor = AutoProcessor.from_pretrained(checkpoint) +``` + +다른 이미지를 사용해 보겠습니다: + +```py +>>> import requests + +>>> url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640" +>>> im = Image.open(requests.get(url, stream=True).raw) +>>> im +``` + +
+ Beach photo +
+ +프로세서를 사용해 모델의 입력을 준비합니다. +프로세서는 모델의 입력으로 사용하기 위해 이미지 크기를 변환하고 정규화하는 이미지 프로세서와 텍스트 입력을 처리하는 [`CLIPTokenizer`]로 구성됩니다. + +```py +>>> text_queries = ["hat", "book", "sunglasses", "camera"] +>>> inputs = processor(text=text_queries, images=im, return_tensors="pt") +``` + +모델에 입력을 전달하고 결과를 후처리 및 시각화합니다. +이미지 프로세서가 모델에 이미지를 입력하기 전에 이미지 크기를 조정했기 때문에, [`~OwlViTImageProcessor.post_process_object_detection`] 메소드를 사용해 +예측값의 바운딩 박스(bounding box)가 원본 이미지의 좌표와 상대적으로 동일한지 확인해야 합니다. + +```py +>>> import torch + +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = torch.tensor([im.size[::-1]]) +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(im) + +>>> scores = results["scores"].tolist() +>>> labels = results["labels"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white") + +>>> im +``` + +
+ Beach photo with detected objects +
+ +## 일괄 처리[[batch-processing]] + +여러 이미지와 텍스트 쿼리를 전달하여 여러 이미지에서 서로 다른(또는 동일한) 객체를 검색할 수 있습니다. +일괄 처리를 위해서 텍스트 쿼리는 이중 리스트로, 이미지는 PIL 이미지, PyTorch 텐서, 또는 NumPy 배열로 이루어진 리스트로 프로세서에 전달해야 합니다. + +```py +>>> images = [image, im] +>>> text_queries = [ +... ["human face", "rocket", "nasa badge", "star-spangled banner"], +... ["hat", "book", "sunglasses", "camera"], +... ] +>>> inputs = processor(text=text_queries, images=images, return_tensors="pt") +``` + +이전에는 후처리를 위해 단일 이미지의 크기를 텐서로 전달했지만, 튜플을 전달할 수 있고, 여러 이미지를 처리하는 경우에는 튜플로 이루어진 리스트를 전달할 수도 있습니다. +아래 두 예제에 대한 예측을 생성하고, 두 번째 이미지(`image_idx = 1`)를 시각화해 보겠습니다. + +```py +>>> with torch.no_grad(): +... outputs = model(**inputs) +... target_sizes = [x.size[::-1] for x in images] +... results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes) + +>>> image_idx = 1 +>>> draw = ImageDraw.Draw(images[image_idx]) + +>>> scores = results[image_idx]["scores"].tolist() +>>> labels = results[image_idx]["labels"].tolist() +>>> boxes = results[image_idx]["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1) +... draw.text((xmin, ymin), f"{text_queries[image_idx][label]}: {round(score,2)}", fill="white") + +>>> images[image_idx] +``` + +
+ Beach photo with detected objects +
+ +## 이미지 가이드 객체 탐지[[imageguided-object-detection]] + +텍스트 쿼리를 이용한 제로샷 객체 탐지 외에도 OWL-ViT 모델은 이미지 가이드 객체 탐지 기능을 제공합니다. +이미지를 쿼리로 사용해 대상 이미지에서 유사한 객체를 찾을 수 있다는 의미입니다. +텍스트 쿼리와 달리 하나의 예제 이미지에서만 가능합니다. + +소파에 고양이 두 마리가 있는 이미지를 대상 이미지(target image)로, 고양이 한 마리가 있는 이미지를 쿼리로 사용해보겠습니다: + +```py +>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" +>>> image_target = Image.open(requests.get(url, stream=True).raw) + +>>> query_url = "http://images.cocodataset.org/val2017/000000524280.jpg" +>>> query_image = Image.open(requests.get(query_url, stream=True).raw) +``` + +다음 이미지를 살펴보겠습니다: + +```py +>>> import matplotlib.pyplot as plt + +>>> fig, ax = plt.subplots(1, 2) +>>> ax[0].imshow(image_target) +>>> ax[1].imshow(query_image) +``` + +
+ Cats +
+ +전처리 단계에서 텍스트 쿼리 대신에 `query_images`를 사용합니다: + +```py +>>> inputs = processor(images=image_target, query_images=query_image, return_tensors="pt") +``` + +예측의 경우, 모델에 입력을 전달하는 대신 [`~OwlViTForObjectDetection.image_guided_detection`]에 전달합니다. +레이블이 없다는 점을 제외하면 이전과 동일합니다. +이전과 동일하게 이미지를 시각화합니다. + +```py +>>> with torch.no_grad(): +... outputs = model.image_guided_detection(**inputs) +... target_sizes = torch.tensor([image_target.size[::-1]]) +... results = processor.post_process_image_guided_detection(outputs=outputs, target_sizes=target_sizes)[0] + +>>> draw = ImageDraw.Draw(image_target) + +>>> scores = results["scores"].tolist() +>>> boxes = results["boxes"].tolist() + +>>> for box, score, label in zip(boxes, scores, labels): +... xmin, ymin, xmax, ymax = box +... draw.rectangle((xmin, ymin, xmax, ymax), outline="white", width=4) + +>>> image_target +``` + +
+ Cats with bounding boxes +
+ +OWL-ViT 모델을 추론하고 싶다면 아래 데모를 확인하세요: + + diff --git a/docs/source/ko/tasks_explained.md b/docs/source/ko/tasks_explained.md new file mode 100644 index 00000000000000..78c90849bb89bf --- /dev/null +++ b/docs/source/ko/tasks_explained.md @@ -0,0 +1,295 @@ + + +# 🤗 Transformers로 작업을 해결하는 방법[[how-transformers-solve-tasks]] + +[🤗 Transformers로 할 수 있는 작업](task_summary)에서 자연어 처리(NLP), 음성 및 오디오, 컴퓨터 비전 작업 등의 중요한 응용을 배웠습니다. 이 페이지에서는 모델이 이러한 작업을 어떻게 해결하는지 자세히 살펴보고 내부에서 어떤 일이 일어나는지 설명합니다. 주어진 작업을 해결하는 많은 방법이 있으며, 일부 모델은 특정 기술을 구현하거나 심지어 새로운 방식으로 작업에 접근할 수도 있지만, Transformer 모델의 경우 일반적인 아이디어는 동일합니다. 유연한 아키텍처 덕분에 대부분의 모델은 인코더, 디코더 또는 인코더-디코더 구조의 변형입니다. Transformer 모델뿐만 아니라 우리의 라이브러리에는 오늘날 컴퓨터 비전 작업에 사용되는 몇 가지 합성곱 신경망(CNNs)도 있습니다. 또한, 우리는 현대 CNN의 작동 방식에 대해 설명할 것입니다. + +작업이 어떻게 해결되는지 설명하기 위해, 유용한 예측을 출력하고자 모델 내부에서 어떤 일이 일어나는지 살펴봅니다. + +- 오디오 분류 및 자동 음성 인식(ASR)을 위한 [Wav2Vec2](model_doc/wav2vec2) +- 이미지 분류를 위한 [Vision Transformer (ViT)](model_doc/vit) 및 [ConvNeXT](model_doc/convnext) +- 객체 탐지를 위한 [DETR](model_doc/detr) +- 이미지 분할을 위한 [Mask2Former](model_doc/mask2former) +- 깊이 추정을 위한 [GLPN](model_doc/glpn) +- 인코더를 사용하는 텍스트 분류, 토큰 분류 및 질의응답과 같은 NLP 작업을 위한 [BERT](model_doc/bert) +- 디코더를 사용하는 텍스트 생성과 같은 NLP 작업을 위한 [GPT2](model_doc/gpt2) +- 인코더-디코더를 사용하는 요약 및 번역과 같은 NLP 작업을 위한 [BART](model_doc/bart) + + + +더 나아가기 전에, 기존 Transformer 아키텍처에 대한 기본적인 지식을 숙지하는 것이 좋습니다. 인코더, 디코더 및 어텐션의 작동 방식을 알면 다양한 Transformer 모델이 어떻게 작동하는지 이해하는 데 도움이 됩니다. 시작 단계거나 복습이 필요한 경우, 더 많은 정보를 위해 [코스](https://huggingface.co/course/chapter1/4?fw=pt)를 확인하세요! + + + +## 음성 및 오디오[[speech-and-audio]] + +[Wav2Vec2](model_doc/wav2vec2)는 레이블이 지정되지 않은 음성 데이터에 대해 사전훈련된 모델로, 오디오 분류 및 자동 음성 인식을 위해 레이블이 지정된 데이터로 미세 조정합니다. + +
+ +
+ +이 모델에는 4가지 주요 구성 요소가 있습니다: + +1. *특징 인코더(feature encoder)*는 원시 오디오 파형(raw audio waveform)을 가져와서 제로 평균 및 단위 분산으로 표준화하고, 각각 20ms 길이의 특징 벡터의 시퀀스로 변환합니다. + +2. 오디오 파형은 본질적으로 연속적이기 때문에, 텍스트 시퀀스를 단어로 나누는 것과 같이 분할할 수 없습니다. 그래서 *양자화 모듈(quantization module)*로 전달되는 특징 벡터는 이산형 음성 단위를 학습하기 위한 것입니다. 음성 단위는 *코드북(codebook)*(어휘집이라고 생각할 수 있습니다)이라는 코드단어(codewords) 콜렉션에서 선택됩니다. 코드북에서 연속적인 오디오 입력을 가장 잘 나타내는 벡터 또는 음성 단위가 선택되어 모델을 통과합니다. + +3. 특징 벡터의 절반은 무작위로 마스크가 적용되며, 마스크된 특징 벡터는 *상대적 위치 임베딩*을 추가하는 Transformer 인코더인 *문맥 네트워크(context network)*로 전달됩니다. + +4. 문맥 네트워크의 사전훈련 목표는 *대조적 작업(contrastive task)*입니다. 모델은 잘못된 예측 시퀀스에서 마스크된 예측의 실제 양자화된 음성 표현을 예측하며, 모델이 가장 유사한 컨텍스트 벡터와 양자화된 음성 단위(타겟 레이블)를 찾도록 권장합니다. + +이제 wav2vec2가 사전훈련되었으므로, 오디오 분류 또는 자동 음성 인식을 위해 데이터에 맞춰 미세 조정할 수 있습니다! + +### 오디오 분류[[audio-classification]] + +사전훈련된 모델을 오디오 분류에 사용하려면, 기본 Wav2Vec2 모델 상단에 시퀀스 분류 헤드를 추가하면 됩니다. 분류 헤드는 인코더의 은닉 상태(hidden states)를 받는 선형 레이어입니다. 은닉 상태는 각각 길이가 다른 오디오 프레임에서 학습된 특징을 나타냅니다. 고정 길이의 벡터 하나를 만들기 위해, 은닉 상태는 먼저 풀링되고, 클래스 레이블에 대한 로짓으로 변환됩니다. 가장 가능성이 높은 클래스를 찾기 위해 로짓과 타겟 사이의 교차 엔트로피 손실이 계산됩니다. + +오디오 분류에 직접 도전할 준비가 되셨나요? 완전한 [오디오 분류 가이드](tasks/audio_classification)를 확인하여 Wav2Vec2를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +### 자동 음성 인식[[automatic-speech-recognition]] + +사전훈련된 모델을 자동 음성 인식에 사용하려면, [연결주의적 시간 분류(CTC, Connectionist Temporal Classification)](glossary#connectionist-temporal-classification-ctc)를 위해 기본 Wav2Vec2 모델 상단에 언어 모델링 헤드를 추가합니다. 언어 모델링 헤드는 인코더의 은닉 상태를 받아서 로짓으로 변환합니다. 각 로짓은 토큰 클래스(토큰 수는 작업의 어휘에서 나타납니다)를 나타냅니다. CTC 손실은 텍스트로 디코딩된 토큰에서 가장 가능성이 높은 토큰 시퀀스를 찾기 위해 로짓과 타겟 사이에서 계산됩니다. + +자동 음성 인식에 직접 도전할 준비가 되셨나요? 완전한 [자동 음성 인식 가이드](tasks/asr)를 확인하여 Wav2Vec2를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +## 컴퓨터 비전[[computer-vision]] + +컴퓨터 비전 작업에 접근하는 2가지 방법이 있습니다: + +1. 이미지를 패치 시퀀스로 분리하고 Transformer로 병렬 처리합니다. +2. [ConvNeXT](model_doc/convnext)와 같은 현대 CNN을 사용합니다. 이는 합성곱 레이어를 기반으로 하지만 현대 네트워크 설계를 적용합니다. + + + +세 번째 방법은 Transformer와 합성곱(예를 들어, [Convolutional Vision Transformer](model_doc/cvt) 또는 [LeViT](model_doc/levit))을 결합하는 것입니다. 우리는 살펴볼 두 가지 방법만 결합하기 때문에 여기서 이 방법을 다루지 않습니다. + + + +ViT와 ConvNeXT는 일반적으로 이미지 분류에서 사용되지만, 물체 감지, 분할, 깊이 추정과 같은 다른 비전 작업에는 각각 DETR, Mask2Former, GLPN이 더 적합하므로 이러한 모델을 살펴보겠습니다. + +### 이미지 분류[[image-classification]] + +ViT와 ConvNeXT 모두 이미지 분류에 사용될 수 있지만, ViT는 어텐션 메커니즘을, ConvNeXT는 합성곱을 사용하는 것이 주된 차이입니다. + +#### Transformer[[transformer]] + +[ViT](model_doc/vit)은 합성곱을 전적으로 순수 Transformer 아키텍처로 대체합니다. 기존 Transformer에 익숙하다면, ViT를 이해하는 방법의 대부분을 이미 파악했다고 볼 수 있습니다. + +
+ +
+ +ViT가 도입한 주요 변경 사항은 이미지가 Transformer로 어떻게 전달되는지에 있습니다: + +1. 이미지는 서로 중첩되지 않는 정사각형 패치로 분할되고, 각 패치는 벡터 또는 *패치 임베딩(patch embedding)*으로 변환됩니다. 패치 임베딩은 적절한 입력 차원을 만드는 2D 합성곱 계층에서 생성됩니다(기본 Transformer의 경우 각 패치의 임베딩마다 768개의 값이 필요합니다). 224x224 픽셀 이미지가 있다면, 16x16 이미지 패치 196개로 분할할 수 있습니다. 텍스트가 단어로 토큰화되는 것처럼, 이미지도 패치 시퀀스로 "토큰화"됩니다. + +2. *학습 가능한 임베딩(learnable embedding)*(특수한 `[CLS]` 토큰)이 BERT와 같이 패치 임베딩의 시작 부분에 추가됩니다. `[CLS]` 토큰의 마지막 은닉 상태는 부착된 분류 헤드의 입력으로 사용되고, 다른 출력은 무시됩니다. 이 토큰은 모델이 이미지의 표현을 인코딩하는 방법을 학습하는 데 도움이 됩니다. + +3. 패치와 학습 가능한 임베딩에 마지막으로 추가할 것은 *위치 임베딩*입니다. 왜냐하면 모델은 이미지 패치의 순서를 모르기 때문입니다. 위치 임베딩도 학습 가능하며, 패치 임베딩과 동일한 크기를 가집니다. 최종적으로, 모든 임베딩이 Transformer 인코더에 전달됩니다. + +4. `[CLS]` 토큰을 포함한 출력은 다층 퍼셉트론 헤드(MLP)에 전달됩니다. ViT의 사전훈련 목표는 단순히 분류입니다. 다른 분류 헤드와 같이, MLP 헤드는 출력을 클래스 레이블에 대해 로짓으로 변환하고 교차 엔트로피 손실을 계산하여 가장 가능성이 높은 클래스를 찾습니다. + +이미지 분류에 직접 도전할 준비가 되셨나요? 완전한 [이미지 분류 가이드](tasks/image_classification)를 확인하여 ViT를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +#### CNN[[cnn]] + + + +이 섹션에서는 합성곱에 대해 간략하게 설명합니다. 그러나 이미지의 모양과 크기가 어떻게 변화하는지에 대한 사전 이해가 있다면 도움이 될 것입니다. 합성곱에 익숙하지 않은 경우, fastai book의 [합성곱 신경망 챕터](https://github.com/fastai/fastbook/blob/master/13_convolutions.ipynb)를 확인하세요! + + + +[ConvNeXT](model_doc/convnext)는 성능을 높이기 위해 새로운 현대 네트워크 설계를 적용한 CNN 구조입니다. 그러나 합성곱은 여전히 모델의 핵심입니다. 높은 수준의 관점에서 볼 때, [합성곱](glossary#convolution)은 작은 행렬(*커널*)에 이미지 픽셀의 작은 윈도우를 곱하는 연산입니다. 이는 특정 텍스쳐(texture)이나 선의 곡률과 같은 일부 특징을 계산합니다. 그러고 다음 픽셀 윈도우로 넘어가는데, 여기서 합성곱이 이동하는 거리를 *보폭(stride)*이라고 합니다. + +
+ +
+ +패딩이나 보폭이 없는 기본 합성곱, 딥러닝을 위한 합성곱 연산 가이드 + +이 출력을 다른 합성곱 레이어에 전달할 수 있으며, 각 연속적인 레이어를 통해 네트워크는 핫도그나 로켓과 같이 더 복잡하고 추상적인 것을 학습합니다. 합성곱 레이어 사이에 풀링 레이어를 추가하여 차원을 줄이고 특징의 위치 변화에 대해 모델을 더 견고하게 만드는 것이 일반적입니다. + +
+ +
+ +ConvNeXT는 CNN을 5가지 방식으로 현대화합니다: + +1. 각 단계의 블록 수를 변경하고 더 큰 보폭과 그에 대응하는 커널 크기로 이미지를 "패치화(patchify)"합니다. 겹치지 않는 슬라이딩 윈도우는 ViT가 이미지를 패치로 분할하는 방법과 유사하게 이 패치화 전략을 만듭니다. + +2. *병목(bottleneck)* 레이어는 채널 수를 줄였다가 다시 복원합니다. 왜냐하면 1x1 합성곱을 수행하는 것이 더 빠르고, 깊이를 늘릴 수 있기 때문입니다. 역 병목(inverted bottlenect)은 채널 수를 확장하고 축소함으로써 그 반대로 수행하므로, 메모리 효율이 더 높습니다. + +3. 병목 레이어의 일반적인 3x3 합성곱 레이어를 각 입력 채널에 개별적으로 합성곱을 적용한 다음 마지막에 쌓는 *깊이별 합성곱(depthwise convolution)*으로 대체합니다. 이는 네트워크 폭이 넓혀 성능이 향상됩니다. + +4. ViT는 어텐션 메커니즘 덕분에 한 번에 더 많은 이미지를 볼 수 있는 전역 수신 필드를 가지고 있습니다. ConvNeXT는 커널 크기를 7x7로 늘려 이 효과를 재현하려고 시도합니다. + +5. 또한 ConvNeXT는 Transformer 모델을 모방하는 몇 가지 레이어 설계를 변경합니다. 활성화 및 정규화 레이어가 더 적고, 활성화 함수가 ReLU 대신 GELU로 전환되고, BatchNorm 대신 LayerNorm을 사용합니다. + +합성곱 블록의 출력은 분류 헤드로 전달되며, 분류 헤드는 출력을 로짓으로 변환하고 교차 엔트로피 손실을 계산하여 가장 가능성이 높은 레이블을 찾습니다. + +### 객체 탐지[[object-detection]] + +[DETR](model_doc/detr), *DEtection TRansformer*는 CNN과 Transformer 인코더-디코더를 결합한 종단간(end-to-end) 객체 탐지 모델입니다. + +
+ +
+ +1. 사전훈련된 CNN *백본(backbone)*은 픽셀 값으로 나타낸 이미지를 가져와 저해상도 특징 맵을 만듭니다. 특징 맵에 대해 1x1 합성곱을 적용하여 차원을 줄이고, 고수준 이미지 표현을 가진 새로운 특징 맵을 생성합니다. Transformer는 시퀀스 모델이기 때문에 특징 맵을 위치 임베딩과 결합된 특징 벡터의 시퀀스로 평탄화합니다. + +2. 특징 벡터는 어텐션 레이어를 사용하여 이미지 표현을 학습하는 인코더에 전달됩니다. 다음으로, 인코더의 은닉 상태는 디코더에서 *객체 쿼리*와 결합됩니다. 객체 쿼리는 이미지의 다른 영역에 초점을 맞춘 학습된 임베딩으로 학습되고, 각 어텐션 레이어를 진행하면서 갱신됩니다. 디코더의 은닉 상태는 각 객체 쿼리에 대한 바운딩 박스 좌표와 클래스 레이블을 예측하는 순방향 네트워크에 전달되며, 객체가 없는 경우 `no object`가 출력됩니다. + + DETR은 각 객체 쿼리를 병렬로 디코딩하여 *N* 개의 최종 예측을 출력합니다. 여기서 *N*은 쿼리 수입니다. 한 번에 하나의 요소를 예측하는 일반적인 자기회귀 모델과 달리, 객체 탐지는 한 번에 *N* 개의 예측을 수행하는 집합 예측 작업(`바운딩 박스`, `클래스 레이블`)입니다. + +3. DETR은 훈련 중 *이분 매칭 손실(bipartite matching loss)*을 사용하여 고정된 수의 예측과 고정된 실제 정답 레이블(ground truth labels) 세트를 비교합니다. *N*개의 레이블 세트에 실제 정답 레이블보다 적은 경우, `no object` 클래스로 패딩됩니다. 이 손실 함수는 DETR이 예측과 실제 정답 레이블 간 1:1 대응을 찾도록 권장합니다. 바운딩 박스 또는 클래스 레이블 중 하나라도 잘못된 경우, 손실이 발생합니다. 마찬가지로, 존재하지 않는 객체를 예측하는 경우, 패널티를 받습니다. 이로 인해 DETR은 이미지에서 눈에 잘 띄는 물체 하나에 집중하는 대신, 다른 객체를 찾도록 권장됩니다. + +객체 탐지 헤드가 DETR 상단에 추가되어 클래스 레이블과 바운딩 박스의 좌표를 찾습니다. 객체 탐지 헤드에는 두 가지 구성 요소가 있습니다: 디코더 은닉 상태를 클래스 레이블의 로짓으로 변환하는 선형 레이어 및 바운딩 박스를 예측하는 MLP + +객체 탐지에 직접 도전할 준비가 되셨나요? 완전한 [객체 탐지 가이드](tasks/object_detection)를 확인하여 DETR을 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +### 이미지 분할[[image-segmentation]] + +[Mask2Former](model_doc/mask2former)는 모든 유형의 이미지 분할 작업을 해결하는 범용 아키텍처입니다. 전통적인 분할 모델은 일반적으로 시멘틱(semantic) 또는 파놉틱(panoptic) 분할과 같은 이미지 분할의 특정 하위 작업에 맞춰 조정됩니다. Mask2Former는 모든 작업을 *마스크 분류* 문제로 구성합니다. 마스크 분류는 픽셀을 *N*개 세그먼트로 그룹화하고, 주어진 이미지에 대해 *N*개의 마스크와 그에 대응하는 클래스 레이블을 예측합니다. 이 섹션에서 Mask2Former의 작동 방법을 설명한 다음, 마지막에 SegFormer를 미세 조정해볼 수 있습니다. + +
+ +
+ +Mask2Former에는 3가지 주요 구성 요소가 있습니다: + +1. [Swin](model_doc/swin) 백본이 이미지를 받아 3개의 연속된 3x3 합성곱에서 저해상도 이미지 특징 맵을 생성합니다. + +2. 특징 맵은 *픽셀 디코더*에 전달됩니다. 이 디코더는 저해상도 특징을 고해상도 픽셀 임베딩으로 점진적으로 업샘플링합니다. 픽셀 디코더는 실제로 원본 이미지의 1/32, 1/16, 1/8 해상도의 다중 스케일 특징(저해상도 및 고해상도 특징 모두 포함)을 생성합니다. + +3. 이러한 서로 다른 크기의 특징 맵은 고해상도 특징에서 작은 객체를 포착하기 위해 한 번에 하나의 Transformer 디코더 레이어에 연속적으로 공급됩니다. Mask2Former의 핵심은 디코더의 *마스크 어텐션* 메커니즘입니다. 전체 이미지를 참조할 수 있는 크로스 어텐션(cross-attention)과 달리, 마스크 어텐션은 이미지의 특정 영역에만 집중합니다. 이는 이미지의 지역적 특징만으로 모델이 충분히 학습할 수 있기 때문에 더 빠르고 성능이 우수합니다. + +4. [DETR](tasks_explained#object-detection)과 같이, Mask2Former는 학습된 객체 쿼리를 사용하고 이를 픽셀 디코더에서의 이미지 특징과 결합하여 예측 집합(`클래스 레이블`, `마스크 예측`)을 생성합니다. 디코더의 은닉 상태는 선형 레이어로 전달되어 클래스 레이블에 대한 로짓으로 변환됩니다. 로짓과 클래스 레이블 사이의 교차 엔트로피 손실을 계산하여 가장 가능성이 높은 것을 찾습니다. + + 마스크 예측은 픽셀 임베딩과 최종 디코더 은닉 상태를 결합하여 생성됩니다. 시그모이드 교차 엔트로피 및 Dice 손실은 로짓과 실제 정답 마스크(ground truth mask) 사이에서 계산되어 가장 가능성이 높은 마스크를 찾습니다. + +이미지 분할에 직접 도전할 준비가 되셨나요? 완전한 [이미지 분할 가이드](tasks/semantic_segmentation)를 확인하여 SegFormer를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +### 깊이 추정[[depth-estimation]] + +[GLPN](model_doc/glpn), *Global-Local Path Network*는 [SegFormer](model_doc/segformer) 인코더와 경량 디코더를 결합한 깊이 추정을 위한 Transformer입니다. + +
+ +
+ +1. ViT와 같이, 이미지는 패치 시퀀스로 분할되지만, 이미지 패치가 더 작다는 점이 다릅니다. 이는 세그멘테이션이나 깊이 추정과 같은 밀도 예측 작업에 더 적합합니다. 이미지 패치는 패치 임베딩으로 변환되어(패치 임베딩이 생성되는 방법은 [이미지 분류](#image-classification) 섹션을 참조하세요), 인코더로 전달됩니다. + +2. 인코더는 패치 임베딩을 받아, 여러 인코더 블록에 전달합니다. 각 블록은 어텐션 및 Mix-FFN 레이어로 구성됩니다. 후자의 목적은 위치 정보를 제공하는 것입니다. 각 인코더 블록의 끝에는 계층적 표현을 생성하기 위한 *패치 병합(patch merging)* 레이어가 있습니다. 각 인접한 패치 그룹의 특징은 연결되고, 연결된 특징에 선형 레이어가 적용되어 패치 수를 1/4의 해상도로 줄입니다. 이는 다음 인코더 블록의 입력이 되며, 이러한 전체 프로세스는 1/8, 1/16, 1/32 해상도의 이미지 특징을 가질 때까지 반복됩니다. + +3. 경량 디코더는 인코더에서 마지막 특징 맵(1/32 크기)을 가져와 1/16 크기로 업샘플링합니다. 여기서, 특징은 *선택적 특징 융합(SFF, Selective Feature Fusion)* 모듈로 전달됩니다. 이 모듈은 각 특징에 대해 어텐션 맵에서 로컬 및 전역 특징을 선택하고 결합한 다음, 1/8로 업샘플링합니다. 이 프로세스는 디코딩된 특성이 원본 이미지와 동일한 크기가 될 때까지 반복됩니다. 출력은 두 개의 합성곱 레이어를 거친 다음, 시그모이드 활성화가 적용되어 각 픽셀의 깊이를 예측합니다. + +## 자연어처리[[natural-language-processing]] + +Transformer는 초기에 기계 번역을 위해 설계되었고, 그 이후로는 사실상 모든 NLP 작업을 해결하기 위한 기본 아키텍처가 되었습니다. 어떤 작업은 Transformer의 인코더 구조에 적합하며, 다른 작업은 디코더에 더 적합합니다. 또 다른 작업은 Transformer의 인코더-디코더 구조를 모두 활용합니다. + +### 텍스트 분류[[text-classification]] + +[BERT](model_doc/bert)는 인코더 전용 모델이며, 텍스트의 풍부한 표현을 학습하기 위해 양방향의 단어에 주목함으로써 심층 양방향성(deep bidirectionality)을 효과적으로 구현한 최초의 모델입니다. + +1. BERT는 [WordPiece](tokenizer_summary#wordpiece) 토큰화를 사용하여 문장의 토큰 임베딩을 생성합니다. 단일 문장과 한 쌍의 문장을 구분하기 위해 특수한 `[SEP]` 토큰이 추가됩니다. 모든 텍스트 시퀀스의 시작 부분에는 특수한 `[CLS]` 토큰이 추가됩니다. `[CLS]` 토큰이 있는 최종 출력은 분류 작업을 위한 분류 헤드로 입력에 사용됩니다. BERT는 또한 한 쌍의 문장에서 각 토큰이 첫 번째 문장인지 두 번째 문장에 속하는지 나타내는 세그먼트 임베딩(segment embedding)을 추가합니다. + +2. BERT는 마스크드 언어 모델링과 다음 문장 예측, 두 가지 목적으로 사전훈련됩니다. 마스크드 언어 모델링에서는 입력 토큰의 일부가 무작위로 마스킹되고, 모델은 이를 예측해야 합니다. 이는 모델이 모든 단어를 보고 다음 단어를 "예측"할 수 있는 양방향성 문제를 해결합니다. 예측된 마스크 토큰의 최종 은닉 상태는 어휘에 대한 소프트맥스가 있는 순방향 네트워크로 전달되어 마스크된 단어를 예측합니다. + + 두 번째 사전훈련 대상은 다음 문장 예측입니다. 모델은 문장 B가 문장 A 다음에 오는지 예측해야 합니다. 문장 B가 다음 문장인 경우와 무작위 문장인 경우 각각 50%의 확률로 발생합니다. 다음 문장인지 아닌지에 대한 예측은 두 개의 클래스(`IsNext` 및 `NotNext`)에 대한 소프트맥스가 있는 순방향 네트워크로 전달됩니다. + +3. 입력 임베딩은 여러 인코더 레이어를 거쳐서 최종 은닉 상태를 출력합니다. + +사전훈련된 모델을 텍스트 분류에 사용하려면, 기본 BERT 모델 상단에 시퀀스 분류 헤드를 추가합니다. 시퀀스 분류 헤드는 최종 은닉 상태를 받는 선형 레이어이며, 로짓으로 변환하기 위해 선형 변환을 수행합니다. 교차 엔트로피 손실은 로짓과 타겟 간에 계산되어 가장 가능성이 높은 레이블을 찾습니다. + +텍스트 분류에 직접 도전할 준비가 되셨나요? 완전한 [텍스트 분류 가이드](tasks/sequence_classification)를 확인하여 DistilBERT를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +### 토큰 분류[[token-classification]] + +개체명 인식(Named Entity Recognition, NER)과 같은 토큰 분류 작업에 BERT를 사용하려면, 기본 BERT 모델 상단에 토큰 분류 헤드를 추가합니다. 토큰 분류 헤드는 최종 은닉 상태를 받는 선형 레이어이며, 로짓으로 변환하기 위해 선형 변환을 수행합니다. 교차 엔트로피 손실은 로짓과 각 토큰 간에 계산되어 가장 가능성이 높은 레이블을 찾습니다. + +토큰 분류에 직접 도전할 준비가 되셨나요? 완전한 [토큰 분류 가이드](tasks/token_classification)를 확인하여 DistilBERT를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + +### 질의응답[[question-answering]] + +질의응답에 BERT를 사용하려면, 기본 BERT 모델 위에 스팬(span) 분류 헤드를 추가합니다. 이 선형 레이어는 최종 은닉 상태를 받고, 답변에 대응하는 `스팬`의 시작과 끝 로그를 계산하기 위해 선형 변환을 수행합니다. 교차 엔트로피 손실은 로짓과 각 레이블 위치 간에 계산되어 답변에 대응하는 가장 가능성이 높은 텍스트의 스팬을 찾습니다. + +질의응답에 직접 도전할 준비가 되셨나요? 완전한 [질의응답 가이드](tasks/question_answering)를 확인하여 DistilBERT를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + + + +💡 사전훈련된 BERT를 다양한 작업에 사용하는 것이 얼마나 쉬운지 주목하세요. 사전훈련된 모델에 특정 헤드를 추가하기만 하면 은닉 상태를 원하는 출력으로 조작할 수 있습니다! + + + +### 텍스트 생성[[text-generation]] + +[GPT-2](model_doc/gpt2)는 대량의 텍스트에 대해 사전훈련된 디코딩 전용 모델입니다. 프롬프트를 주어지면 설득력 있는 (항상 사실은 아니지만!) 텍스트를 생성하고 명시적으로 훈련되지 않았음에도 불구하고 질의응답과 같은 다른 NLP 작업을 완수할 수 있습니다. + +
+ +
+ +1. GPT-2는 단어를 토큰화하고 토큰 임베딩을 생성하기 위해 [바이트 페어 인코딩(BPE, byte pair encoding)](tokenizer_summary#bytepair-encoding-bpe)을 사용합니다. 위치 인코딩은 시퀀스에서 각 토큰의 위치를 나타내기 위해 토큰 임베딩에 추가됩니다. 입력 임베딩은 여러 디코더 블록을 거쳐 일부 최종 은닉 상태를 출력합니다. 각 디코더 블록 내에서 GPT-2는 *마스크드 셀프 어텐션(masked self-attention)* 레이어를 사용합니다. 이는 GPT-2가 이후 토큰(future tokens)에 주의를 기울일 수 없도록 합니다. 왼쪽에 있는 토큰에만 주의를 기울일 수 있습니다. 마스크드 셀프 어텐션에서는 어텐션 마스크를 사용하여 이후 토큰에 대한 점수(score)를 `0`으로 설정하기 때문에 BERT의 [`mask`] 토큰과 다릅니다. + +2. 디코더의 출력은 언어 모델링 헤드에 전달되며, 언어 모델링 헤드는 은닉 상태를 로짓으로 선형 변환을 수행합니다. 레이블은 시퀀스의 다음 토큰으로, 로짓을 오른쪽으로 하나씩 이동하여 생성됩니다. 교차 엔트로피 손실은 이동된 로짓과 레이블 간에 계산되어 가장 가능성이 높은 다음 토큰을 출력합니다. + +GPT-2의 사전훈련 목적은 전적으로 [인과적 언어 모델링](glossary#causal-language-modeling)에 기반하여, 시퀀스에서 다음 단어를 예측하는 것입니다. 이는 GPT-2가 텍스트 생성에 관련된 작업에 특히 우수하도록 합니다. + +텍스트 생성에 직접 도전할 준비가 되셨나요? 완전한 [인과적 언어 모델링 가이드](tasks/language_modeling#causal-language-modeling)를 확인하여 DistilGPT-2를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + + + +텍스트 생성에 대한 자세한 내용은 [텍스트 생성 전략](generation_strategies) 가이드를 확인하세요! + + + +### 요약[[summarization]] + +[BART](model_doc/bart) 및 [T5](model_doc/t5)와 같은 인코더-디코더 모델은 요약 작업의 시퀀스-투-시퀀스 패턴을 위해 설계되었습니다. 이 섹션에서 BART의 작동 방법을 설명한 다음, 마지막에 T5를 미세 조정해볼 수 있습니다. + +
+ +
+ +1. BART의 인코더 아키텍처는 BERT와 매우 유사하며 텍스트의 토큰 및 위치 임베딩을 받습니다. BART는 입력을 변형시키고 디코더로 재구성하여 사전훈련됩니다. 특정 변형 기법이 있는 다른 인코더와는 달리, BART는 모든 유형의 변형을 적용할 수 있습니다. 그러나 *text infilling* 변형 기법이 가장 잘 작동합니다. Text Infiling에서는 여러 텍스트 스팬을 **단일** [`mask`] 토큰으로 대체합니다. 이는 모델이 마스크된 토큰을 예측해야 하고, 모델에 누락된 토큰의 수를 예측하도록 가르치기 때문에 중요합니다. 입력 임베딩과 마스크된 스팬이 인코더를 거쳐 최종 은닉 상태를 출력하지만, BERT와 달리 BART는 마지막에 단어를 예측하는 순방향 네트워크를 추가하지 않습니다. + +2. 인코더의 출력은 디코더로 전달되며, 디코더는 인코더의 출력에서 마스크 토큰과 변형되지 않은 토큰을 예측해야 합니다. 이는 디코더가 원본 텍스트를 복원하는 데 도움이 되는 추가적인 문맥을 얻도록 합니다. 디코더의 출력은 언어 모델링 헤드에 전달되며, 언어 모델링 헤드는 은닉 상태를 로짓으로 선형 변환을 수행합니다. 교차 엔트로피 손실은 로짓과 토큰이 오른쪽으로 이동된 레이블 간에 계산됩니다. + +요약에 직접 도전할 준비가 되셨나요? 완전한 [요약 가이드](tasks/summarization)를 확인하여 T5를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + + + +텍스트 생성에 대한 자세한 내용은 [텍스트 생성 전략](generation_strategies) 가이드를 확인하세요! + + + +### 번역[[translation]] + +번역은 시퀀스-투-시퀀스 작업의 또 다른 예로, [BART](model_doc/bart) 또는 [T5](model_doc/t5)와 같은 인코더-디코더 모델을 사용할 수 있습니다. 이 섹션에서 BART의 작동 방법을 설명한 다음, 마지막에 T5를 미세 조정해볼 수 있습니다. + +BART는 원천 언어를 타겟 언어로 디코딩할 수 있는 입력에 매핑하기 위해 무작위로 초기화된 별도의 인코더를 추가하여 번역에 적용합니다. 이 새로운 인코더의 임베딩은 원본 단어 임베딩 대신 사전훈련된 인코더로 전달됩니다. 원천 인코더는 모델 출력의 교차 엔트로피 손실로부터 원천 인코더, 위치 임베딩, 입력 임베딩을 갱신하여 훈련됩니다. 첫 번째 단계에서는 모델 파라미터가 고정되고, 두 번째 단계에서는 모든 모델 파라미터가 함께 훈련됩니다. + +BART는 이후 번역을 위해 다양한 언어로 사전훈련된 다국어 버전의 mBART로 확장되었습니다. + +번역에 직접 도전할 준비가 되셨나요? 완전한 [번역 가이드](tasks/summarization)를 확인하여 T5를 미세 조정하고 추론에 사용하는 방법을 학습하세요! + + + +텍스트 생성에 대한 자세한 내용은 [텍스트 생성 전략](generation_strategies) 가이드를 확인하세요! + + \ No newline at end of file diff --git a/docs/source/ko/testing.md b/docs/source/ko/testing.md new file mode 100644 index 00000000000000..aad22c00feea4d --- /dev/null +++ b/docs/source/ko/testing.md @@ -0,0 +1,1278 @@ + + +# 테스트[[testing]] + + +먼저 🤗 Transformers 모델이 어떻게 테스트되는지 살펴보고, 새로운 테스트를 작성 및 기존 테스트를 개선하는 방법을 알아봅시다. + +이 저장소에는 2개의 테스트 스위트가 있습니다: + +1. `tests` - 일반 API에 대한 테스트 +2. `examples` - API의 일부가 아닌 다양한 응용 프로그램에 대한 테스트 + +## Transformers 테스트 방법[[how-transformers-are-tested]] + +1. PR이 제출되면 9개의 CircleCi 작업으로 테스트가 진행됩니다. 해당 PR에 대해 새로운 커밋이 생성될 때마다 테스트는 다시 진행됩니다. 이 작업들은 + 이 [config 파일](https://github.com/huggingface/transformers/tree/main/.circleci/config.yml)에 정의되어 있으므로 필요하다면 + 사용자의 로컬 환경에서 동일하게 재현해 볼 수 있습니다. + + 이 CI 작업은 `@slow` 테스트를 실행하지 않습니다. + +2. [github actions](https://github.com/huggingface/transformers/actions)에 의해 실행되는 작업은 3개입니다: + + - [torch hub integration](https://github.com/huggingface/transformers/tree/main/.github/workflows/github-torch-hub.yml): + torch hub integration이 작동하는지 확인합니다. + + - [self-hosted (push)](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-push.yml): `main` 브랜치에서 커밋이 업데이트된 경우에만 GPU를 이용한 빠른 테스트를 실행합니다. + 이는 `src`, `tests`, `.github` 폴더 중 하나에 코드가 업데이트된 경우에만 실행됩니다. + (model card, notebook, 기타 등등을 추가한 경우 실행되지 않도록 하기 위해서입니다) + + - [self-hosted runner](https://github.com/huggingface/transformers/tree/main/.github/workflows/self-scheduled.yml): `tests` 및 `examples`에서 + GPU를 이용한 일반 테스트, 느린 테스트를 실행합니다. + + +```bash +RUN_SLOW=1 pytest tests/ +RUN_SLOW=1 pytest examples/ +``` + + 결과는 [여기](https://github.com/huggingface/transformers/actions)에서 확인할 수 있습니다. + + +## 테스트 실행[[running-tests]] + + + + + +### 실행할 테스트 선택[[choosing-which-tests-to-run]] + +이 문서는 테스트를 실행하는 다양한 방법에 대해 자세히 설명합니다. +모든 내용을 읽은 후에도, 더 자세한 내용이 필요하다면 [여기](https://docs.pytest.org/en/latest/usage.html)에서 확인할 수 있습니다. + +다음은 가장 유용한 테스트 실행 방법 몇 가지입니다. + +모두 실행: + +```console +pytest +``` + +또는: + +```bash +make test +``` + +후자는 다음과 같이 정의됩니다: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/ +``` + +위의 명령어는 pytest에게 아래의 내용을 전달합니다: + +- 사용 가능한 CPU 코어 수만큼 테스트 프로세스를 실행합니다. (RAM이 충분하지 않다면, 테스트 프로세스 수가 너무 많을 수 있습니다!) +- 동일한 파일의 모든 테스트는 동일한 테스트 프로세스에서 실행되어야 합니다. +- 출력을 캡처하지 않습니다. +- 자세한 모드로 실행합니다. + + + +### 모든 테스트 목록 가져오기[[getting-the-list-of-all-tests]] + +테스트 스위트의 모든 테스트: + +```bash +pytest --collect-only -q +``` + +지정된 테스트 파일의 모든 테스트: + +```bash +pytest tests/test_optimization.py --collect-only -q +``` + +### 특정 테스트 모듈 실행[[run-a-specific-test-module]] + +개별 테스트 모듈 실행하기: + +```bash +pytest tests/utils/test_logging.py +``` + +### 특정 테스트 실행[[run-specific-tests]] + +대부분의 테스트 내부에서는 unittest가 사용됩니다. 따라서 특정 하위 테스트를 실행하려면 해당 테스트를 포함하는 unittest 클래스의 이름을 알아야 합니다. +예를 들어 다음과 같을 수 있습니다: + +```bash +pytest tests/test_optimization.py::OptimizationTest::test_adam_w +``` + +위의 명령어의 의미는 다음과 같습니다: + +- `tests/test_optimization.py` - 테스트가 있는 파일 +- `OptimizationTest` - 클래스의 이름 +- `test_adam_w` - 특정 테스트 함수의 이름 + +파일에 여러 클래스가 포함된 경우, 특정 클래스의 테스트만 실행할 수도 있습니다. 예를 들어 다음과 같습니다: + +```bash +pytest tests/test_optimization.py::OptimizationTest +``` + +이 명령어는 해당 클래스 내부의 모든 테스트를 실행합니다. + +앞에서 언급한 것처럼 `OptimizationTest` 클래스에 포함된 테스트를 확인할 수 있습니다. + +```bash +pytest tests/test_optimization.py::OptimizationTest --collect-only -q +``` + +키워드 표현식을 사용하여 테스트를 실행할 수도 있습니다. + +`adam`이라는 이름을 포함하는 테스트만 실행하려면 다음과 같습니다: + +```bash +pytest -k adam tests/test_optimization.py +``` + +논리 연산자 `and`와 `or`를 사용하여 모든 키워드가 일치해야 하는지 또는 어느 하나가 일치해야 하는지를 나타낼 수 있습니다. +`not`은 부정할 때 사용할 수 있습니다. + +`adam`이라는 이름을 포함하지 않는 모든 테스트를 실행하려면 다음과 같습니다: + +```bash +pytest -k "not adam" tests/test_optimization.py +``` + +두 가지 패턴을 하나로 결합할 수도 있습니다: + +```bash +pytest -k "ada and not adam" tests/test_optimization.py +``` + +예를 들어 `test_adafactor`와 `test_adam_w`를 모두 실행하려면 다음을 사용할 수 있습니다: + +```bash +pytest -k "test_adam_w or test_adam_w" tests/test_optimization.py +``` + +여기서 `or`를 사용하는 것에 유의하세요. 두 키워드 중 하나가 일치하도록 하기 위한 목적으로 사용하기 때문입니다. + +두 패턴이 모두 포함되어야 하는 테스트만 실행하려면, `and`를 사용해야 합니다: + +```bash +pytest -k "test and ada" tests/test_optimization.py +``` + +### `accelerate` 테스트 실행[[run-`accelerate`-tests]] + +모델에서 `accelerate` 테스트를 실행해야 할 때가 있습니다. 이를 위해서는 명령어에 `-m accelerate_tests`를 추가하면 됩니다. +예를 들어, `OPT`에서 이러한 테스트를 실행하려면 다음과 같습니다: +```bash +RUN_SLOW=1 pytest -m accelerate_tests tests/models/opt/test_modeling_opt.py +``` + +### 문서 테스트 실행[[run-documentation-tests]] + +예시 문서가 올바른지 테스트하려면 `doctests`가 통과하는지 확인해야 합니다. +예를 들어, [`WhisperModel.forward`'s docstring](https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/modeling_whisper.py#L1017-L1035)를 사용해 봅시다: + +```python +r""" +Returns: + +Example: + ```python + >>> import torch + >>> from transformers import WhisperModel, WhisperFeatureExtractor + >>> from datasets import load_dataset + + >>> model = WhisperModel.from_pretrained("openai/whisper-base") + >>> feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base") + >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + >>> inputs = feature_extractor(ds[0]["audio"]["array"], return_tensors="pt") + >>> input_features = inputs.input_features + >>> decoder_input_ids = torch.tensor([[1, 1]]) * model.config.decoder_start_token_id + >>> last_hidden_state = model(input_features, decoder_input_ids=decoder_input_ids).last_hidden_state + >>> list(last_hidden_state.shape) + [1, 2, 512] + ```""" + +``` + +원하는 파일의 모든 docstring 예제를 자동으로 테스트하려면 다음 명령을 실행하면 됩니다: +```bash +pytest --doctest-modules +``` +파일의 확장자가 markdown인 경우 `--doctest-glob="*.md"` 인수를 추가해야 합니다. + +### 수정된 테스트만 실행[[run-only-modified-tests]] + +수정된 파일 또는 현재 브랜치 (Git 기준)와 관련된 테스트를 실행하려면 [pytest-picked](https://github.com/anapaulagomes/pytest-picked)을 사용할 수 있습니다. +이는 변경한 내용이 테스트에 영향을 주지 않았는지 빠르게 확인할 수 있는 좋은 방법입니다. + +```bash +pip install pytest-picked +``` + +```bash +pytest --picked +``` + +수정되었지만, 아직 커밋되지 않은 모든 파일 및 폴더에서 테스트가 실행됩니다. + +### 소스 수정 시 실패한 테스트 자동 재실행[[automatically-rerun-failed-tests-on-source-modification]] + +[pytest-xdist](https://github.com/pytest-dev/pytest-xdist)는 모든 실패한 테스트를 감지하고, +파일을 수정한 후에 파일을 계속 재실행하여 테스트가 성공할 때까지 기다리는 매우 유용한 기능을 제공합니다. +따라서 수정한 내용을 확인한 후 pytest를 다시 시작할 필요가 없습니다. +모든 테스트가 통과될 때까지 이 과정을 반복한 후 다시 전체 실행이 이루어집니다. + +```bash +pip install pytest-xdist +``` + +재귀적 모드의 사용: `pytest -f` 또는 `pytest --looponfail` + +파일의 변경 사항은 `looponfailroots` 루트 디렉터리와 해당 내용을 (재귀적으로) 확인하여 감지됩니다. +이 값의 기본값이 작동하지 않는 경우, +`setup.cfg`의 설정 옵션을 변경하여 프로젝트에서 변경할 수 있습니다: + +```ini +[tool:pytest] +looponfailroots = transformers tests +``` + +또는 `pytest.ini`/`tox.ini`` 파일: + +```ini +[pytest] +looponfailroots = transformers tests +``` + +이렇게 하면 ini-file의 디렉터리를 기준으로 상대적으로 지정된 각 디렉터리에서 파일 변경 사항만 찾게 됩니다. + + +이 기능을 대체할 수 있는 구현 방법인 [pytest-watch](https://github.com/joeyespo/pytest-watch)도 있습니다. + + +### 특정 테스트 모듈 건너뛰기[[skip-a-test-module]] + +모든 테스트 모듈을 실행하되 특정 모듈을 제외하려면, 실행할 테스트 목록을 명시적으로 지정할 수 있습니다. +예를 들어, `test_modeling_*.py` 테스트를 제외한 모든 테스트를 실행하려면 다음을 사용할 수 있습니다: + +```bash +pytest *ls -1 tests/*py | grep -v test_modeling* +``` + +### 상태 초기화[[clearing state]] + +CI 빌드 및 (속도에 대한) 격리가 중요한 경우, 캐시를 지워야 합니다: + +```bash +pytest --cache-clear tests +``` + +### 테스트를 병렬로 실행[[running-tests-in-parallel]] + +이전에 언급한 것처럼 `make test`는 테스트를 병렬로 실행하기 위해 +`pytest-xdist` 플러그인(`-n X` 인수, 예를 들어 `-n 2`를 사용하여 2개의 병렬 작업 실행)을 통해 실행됩니다. + +`pytest-xdist`의 `--dist=` 옵션을 사용하여 테스트를 어떻게 그룹화할지 제어할 수 있습니다. +`--dist=loadfile`은 하나의 파일에 있는 테스트를 동일한 프로세스로 그룹화합니다. + +실행된 테스트의 순서가 다르고 예측할 수 없기 때문에, `pytest-xdist`로 테스트 스위트를 실행하면 실패가 발생할 수 있습니다 (검출되지 않은 결합된 테스트가 있는 경우). +이 경우 [pytest-replay](https://github.com/ESSS/pytest-replay)를 사용하면 동일한 순서로 테스트를 다시 실행해서 +실패하는 시퀀스를 최소화하는 데에 도움이 됩니다. + +### 테스트 순서와 반복[[test-order-and-repetition]] + +잠재적인 종속성 및 상태 관련 버그(tear down)를 감지하기 위해 +테스트를 여러 번, 연속으로, 무작위로 또는 세트로 반복하는 것이 좋습니다. +그리고 직접적인 여러 번의 반복은 DL의 무작위성에 의해 발견되는 일부 문제를 감지하는 데에도 유용합니다. + + +#### 테스트를 반복[[repeat-tests]] + +- [pytest-flakefinder](https://github.com/dropbox/pytest-flakefinder): + +```bash +pip install pytest-flakefinder +``` + +모든 테스트를 여러 번 실행합니다(기본값은 50번): + +```bash +pytest --flake-finder --flake-runs=5 tests/test_failing_test.py +``` + + + +이 플러그인은 `pytest-xdist`의 `-n` 플래그와 함께 작동하지 않습니다. + + + + + +`pytest-repeat`라는 또 다른 플러그인도 있지만 `unittest`와 함께 작동하지 않습니다. + + + +#### 테스트를 임의의 순서로 실행[[run-tests-in-a-random-order]] + +```bash +pip install pytest-random-order +``` + +중요: `pytest-random-order`가 설치되면 테스트가 자동으로 임의의 순서로 섞입니다. +구성 변경이나 커맨드 라인 옵션이 필요하지 않습니다. + +앞서 설명한 것처럼 이를 통해 한 테스트의 상태가 다른 테스트의 상태에 영향을 미치는 결합된 테스트를 감지할 수 있습니다. +`pytest-random-order`가 설치되면 해당 세션에서 사용된 랜덤 시드가 출력되며 예를 들어 다음과 같습니다: + +```bash +pytest tests +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +따라서 특정 시퀀스가 실패하는 경우에는 정확한 시드를 추가하여 재현할 수 있습니다. 예를 들어 다음과 같습니다: + +```bash +pytest --random-order-seed=573663 +[...] +Using --random-order-bucket=module +Using --random-order-seed=573663 +``` + +정확히 동일한 테스트 목록(또는 목록이 없음)을 사용하는 경우에만 정확한 순서를 재현합니다. +목록을 수동으로 좁히기 시작하면 더 이상 시드에 의존할 수 없고 실패했던 정확한 순서로 수동으로 목록을 나열해야합니다. 그리고 `--random-order-bucket=none`을 사용하여 pytest에게 순서를 임의로 설정하지 않도록 알려야 합니다. +예를 들어 다음과 같습니다: + +```bash +pytest --random-order-bucket=none tests/test_a.py tests/test_c.py tests/test_b.py +``` + +모든 테스트에 대해 섞기를 비활성화하려면 다음과 같습니다: + +```bash +pytest --random-order-bucket=none +``` + +기본적으로 `--random-order-bucket=module`이 내재되어 있으므로, 모듈 수준에서 파일을 섞습니다. +또한 `class`, `package`, `global` 및 `none` 수준에서도 섞을 수 있습니다. +자세한 내용은 해당 [문서](https://github.com/jbasko/pytest-random-order)를 참조하세요. + +또 다른 무작위화의 대안은 [`pytest-randomly`](https://github.com/pytest-dev/pytest-randomly)입니다. +이 모듈은 매우 유사한 기능/인터페이스를 가지고 있지만, `pytest-random-order`에 있는 버킷 모드를 사용할 수는 없습니다. +설치 후에는 자동으로 적용되는 문제도 동일하게 가집니다. + +### 외관과 느낌을 변경[[look-and-feel-variations] + +#### pytest-sugar 사용[[pytest-sugar]] + +[pytest-sugar](https://github.com/Frozenball/pytest-sugar)는 테스트가 보여지는 형태를 개선하고, +진행 상황 바를 추가하며, 실패한 테스트와 검증을 즉시 표시하는 플러그인입니다. 설치하면 자동으로 활성화됩니다. + +```bash +pip install pytest-sugar +``` + +pytest-sugar 없이 테스트를 실행하려면 다음과 같습니다: + +```bash +pytest -p no:sugar +``` + +또는 제거하세요. + + + +#### 각 하위 테스트 이름과 진행 상황 보고[[report-each-sub-test-name-and-its-progress]] + +`pytest`를 통해 단일 또는 그룹의 테스트를 실행하는 경우(`pip install pytest-pspec` 이후): + +```bash +pytest --pspec tests/test_optimization.py +``` + +#### 실패한 테스트 즉시 표시[[instantly-shows-failed-tests]] + +[pytest-instafail](https://github.com/pytest-dev/pytest-instafail)은 테스트 세션의 끝까지 기다리지 않고 +실패 및 오류를 즉시 표시합니다. + +```bash +pip install pytest-instafail +``` + +```bash +pytest --instafail +``` + +### GPU 사용 여부[[to-GPU-or-not-to-GPU]] + +GPU가 활성화된 환경에서, CPU 전용 모드로 테스트하려면 `CUDA_VISIBLE_DEVICES=""`를 추가합니다: + +```bash +CUDA_VISIBLE_DEVICES="" pytest tests/utils/test_logging.py +``` + +또는 다중 GPU가 있는 경우 `pytest`에서 사용할 GPU를 지정할 수도 있습니다. +예를 들어, GPU `0` 및 `1`이 있는 경우 다음을 실행할 수 있습니다: + +```bash +CUDA_VISIBLE_DEVICES="1" pytest tests/utils/test_logging.py +``` + +이렇게 하면 다른 GPU에서 다른 작업을 실행하려는 경우 유용합니다. + +일부 테스트는 반드시 CPU 전용으로 실행해야 하며, 일부는 CPU 또는 GPU 또는 TPU에서 실행해야 하고, 일부는 여러 GPU에서 실행해야 합니다. +다음 스킵 데코레이터는 테스트의 요구 사항을 CPU/GPU/TPU별로 설정하는 데 사용됩니다: + +- `require_torch` - 이 테스트는 torch에서만 실행됩니다. +- `require_torch_gpu` - `require_torch`에 추가로 적어도 1개의 GPU가 필요합니다. +- `require_torch_multi_gpu` - `require_torch`에 추가로 적어도 2개의 GPU가 필요합니다. +- `require_torch_non_multi_gpu` - `require_torch`에 추가로 0개 또는 1개의 GPU가 필요합니다. +- `require_torch_up_to_2_gpus` - `require_torch`에 추가로 0개, 1개 또는 2개의 GPU가 필요합니다. +- `require_torch_tpu` - `require_torch`에 추가로 적어도 1개의 TPU가 필요합니다. + +GPU 요구 사항을 표로 정리하면 아래와 같습니디ㅏ: + + +| n gpus | decorator | +|--------+--------------------------------| +| `>= 0` | `@require_torch` | +| `>= 1` | `@require_torch_gpu` | +| `>= 2` | `@require_torch_multi_gpu` | +| `< 2` | `@require_torch_non_multi_gpu` | +| `< 3` | `@require_torch_up_to_2_gpus` | + + +예를 들어, 2개 이상의 GPU가 있고 pytorch가 설치되어 있을 때에만 실행되어야 하는 테스트는 다음과 같습니다: + +```python no-style +@require_torch_multi_gpu +def test_example_with_multi_gpu(): +``` + +`tensorflow`가 필요한 경우 `require_tf` 데코레이터를 사용합니다. 예를 들어 다음과 같습니다: + +```python no-style +@require_tf +def test_tf_thing_with_tensorflow(): +``` + +이러한 데코레이터는 중첩될 수 있습니다. +예를 들어, 느린 테스트로 진행되고 pytorch에서 적어도 하나의 GPU가 필요한 경우 다음과 같이 설정할 수 있습니다: + +```python no-style +@require_torch_gpu +@slow +def test_example_slow_on_gpu(): +``` + +`@parametrized`와 같은 일부 데코레이터는 테스트 이름을 다시 작성하기 때문에 `@require_*` 스킵 데코레이터는 올바르게 작동하려면 항상 맨 마지막에 나열되어야 합니다. +다음은 올바른 사용 예입니다: + +```python no-style +@parameterized.expand(...) +@require_torch_multi_gpu +def test_integration_foo(): +``` + +`@pytest.mark.parametrize`에는 이러한 순서 문제는 없으므로 처음 혹은 마지막에 위치시킬 수 있고 이러한 경우에도 잘 작동할 것입니다. +하지만 unittest가 아닌 경우에만 작동합니다. + +테스트 내부에서 다음을 사용할 수 있습니다: + +- 사용 가능한 GPU 수: + +```python +from transformers.testing_utils import get_gpu_count + +n_gpu = get_gpu_count() #torch와 tf와 함께 작동 +``` + +### 분산 훈련[[distributed-training]] + +`pytest`는 분산 훈련을 직접적으로 다루지 못합니다. +이를 시도하면 하위 프로세스가 올바른 작업을 수행하지 않고 `pytest`라고 생각하기에 테스트 스위트를 반복해서 실행하게 됩니다. +그러나 일반 프로세스를 생성한 다음 여러 워커를 생성하고 IO 파이프를 관리하도록 하면 동작합니다. + +다음은 사용 가능한 테스트입니다: + +- [test_trainer_distributed.py](https://github.com/huggingface/transformers/tree/main/tests/trainer/test_trainer_distributed.py) +- [test_deepspeed.py](https://github.com/huggingface/transformers/tree/main/tests/deepspeed/test_deepspeed.py) + +실행 지점으로 바로 이동하려면, 해당 테스트에서 `execute_subprocess_async` 호출을 검색하세요. + +이러한 테스트를 실행하려면 적어도 2개의 GPU가 필요합니다. + +```bash +CUDA_VISIBLE_DEVICES=0,1 RUN_SLOW=1 pytest -sv tests/test_trainer_distributed.py +``` + +### 출력 캡처[[output-capture]] + +테스트 실행 중 `stdout` 및 `stderr`로 전송된 모든 출력이 캡처됩니다. +테스트나 설정 메소드가 실패하면 캡처된 출력은 일반적으로 실패 추적 정보와 함께 표시됩니다. + +출력 캡처를 비활성화하고 `stdout` 및 `stderr`를 정상적으로 받으려면 `-s` 또는 `--capture=no`를 사용하세요: + +```bash +pytest -s tests/utils/test_logging.py +``` + +테스트 결과를 JUnit 형식의 출력으로 보내려면 다음을 사용하세요: + +```bash +py.test tests --junitxml=result.xml +``` + +### 색상 조절[[color-control]] + +색상이 없게 하려면 다음과 같이 설정하세요(예를 들어 흰색 배경에 노란색 글씨는 가독성이 좋지 않습니다): + +```bash +pytest --color=no tests/utils/test_logging.py +``` + +### online pastebin service에 테스트 보고서 전송[[sending test report to online pastebin service]] + +각 테스트 실패에 대한 URL을 만듭니다: + +```bash +pytest --pastebin=failed tests/utils/test_logging.py +``` + +이렇게 하면 각 실패에 대한 URL을 제공하는 remote Paste service에 테스트 실행 정보를 제출합니다. +일반적인 테스트를 선택할 수도 있고 혹은 특정 실패만 보내려면 `-x`와 같이 추가할 수도 있습니다. + +전체 테스트 세션 로그에 대한 URL을 생성합니다: + +```bash +pytest --pastebin=all tests/utils/test_logging.py +``` + +## 테스트 작성[[writing-tests]] + +🤗 transformers 테스트는 대부분 `unittest`를 기반으로 하지만, +`pytest`에서 실행되므로 대부분의 경우 두 시스템의 기능을 사용할 수 있습니다. + +지원되는 기능에 대해 [여기](https://docs.pytest.org/en/stable/unittest.html)에서 확인할 수 있지만, +기억해야 할 중요한 점은 대부분의 `pytest` fixture가 작동하지 않는다는 것입니다. +파라미터화도 작동하지 않지만, 우리는 비슷한 방식으로 작동하는 `parameterized` 모듈을 사용합니다. + + +### 매개변수화[[parametrization]] + +동일한 테스트를 다른 인수로 여러 번 실행해야 하는 경우가 종종 있습니다. +테스트 내에서 이 작업을 수행할 수 있지만, 그렇게 하면 하나의 인수 세트에 대해 테스트를 실행할 수 없습니다. + +```python +# test_this1.py +import unittest +from parameterized import parameterized + + +class TestMathUnitTest(unittest.TestCase): + @parameterized.expand( + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ] + ) + def test_floor(self, name, input, expected): + assert_equal(math.floor(input), expected) +``` + +이제 기본적으로 이 테스트는 `test_floor`의 마지막 3개 인수가 +매개변수 목록의 해당 인수에 할당되는 것으로 3번 실행될 것입니다. + +그리고 `negative` 및 `integer` 매개변수 집합만 실행하려면 다음과 같이 실행할 수 있습니다: + +```bash +pytest -k "negative and integer" tests/test_mytest.py +``` + +또는 `negative` 하위 테스트를 제외한 모든 서브 테스트를 다음과 같이 실행할 수 있습니다: + +```bash +pytest -k "not negative" tests/test_mytest.py +``` + +앞에서 언급한 `-k` 필터를 사용하는 것 외에도, +각 서브 테스트의 정확한 이름을 확인한 후에 일부 혹은 전체 서브 테스트를 실행할 수 있습니다. + +```bash +pytest test_this1.py --collect-only -q +``` + +그리고 다음의 내용을 확인할 수 있을 것입니다: + +```bash +test_this1.py::TestMathUnitTest::test_floor_0_negative +test_this1.py::TestMathUnitTest::test_floor_1_integer +test_this1.py::TestMathUnitTest::test_floor_2_large_fraction +``` + +2개의 특정한 서브 테스트만 실행할 수도 있습니다: + +```bash +pytest test_this1.py::TestMathUnitTest::test_floor_0_negative test_this1.py::TestMathUnitTest::test_floor_1_integer +``` + +`transformers`의 개발자 종속성에 이미 있는 [parameterized](https://pypi.org/project/parameterized/) 모듈은 +`unittests`와 `pytest` 테스트 모두에서 작동합니다. + +그러나 테스트가 `unittest`가 아닌 경우 `pytest.mark.parametrize`를 사용할 수 있습니다(이미 있는 일부 테스트에서 사용되는 경우도 있습니다. +주로 `examples` 하위에 있습니다). + +다음은 `pytest`의 `parametrize` 마커를 사용한 동일한 예입니다: + +```python +# test_this2.py +import pytest + + +@pytest.mark.parametrize( + "name, input, expected", + [ + ("negative", -1.5, -2.0), + ("integer", 1, 1.0), + ("large fraction", 1.6, 1), + ], +) +def test_floor(name, input, expected): + assert_equal(math.floor(input), expected) +``` + +`parameterized`와 마찬가지로 `pytest.mark.parametrize`를 사용하면 +`-k` 필터가 작동하지 않는 경우에도 실행할 서브 테스트를 정확하게 지정할 수 있습니다. +단, 이 매개변수화 함수는 서브 테스트의 이름 집합을 약간 다르게 생성합니다. 다음과 같은 모습입니다: + +```bash +pytest test_this2.py --collect-only -q +``` + +그리고 다음의 내용을 확인할 수 있을 것입니다: + +```bash +test_this2.py::test_floor[integer-1-1.0] +test_this2.py::test_floor[negative--1.5--2.0] +test_this2.py::test_floor[large fraction-1.6-1] +``` + +특정한 테스트에 대해서만 실행할 수도 있습니다: + +```bash +pytest test_this2.py::test_floor[negative--1.5--2.0] test_this2.py::test_floor[integer-1-1.0] +``` + +이전의 예시와 같이 실행할 수 있습니다. + + + +### 파일 및 디렉터리[[files-and-directories]] + +테스트에서 종종 현재 테스트 파일과 관련된 상대적인 위치를 알아야 하는 경우가 있습니다. +테스트가 여러 디렉터리에서 호출되거나 깊이가 다른 하위 디렉터리에 있을 수 있기 때문에 그 위치를 아는 것은 간단하지 않습니다. +`transformers.test_utils.TestCasePlus`라는 헬퍼 클래스는 모든 기본 경로를 처리하고 간단한 액세서를 제공하여 이 문제를 해결합니다: + + +- `pathlib` 객체(완전히 정해진 경로) + + - `test_file_path` - 현재 테스트 파일 경로 (예: `__file__`) + - test_file_dir` - 현재 테스트 파일이 포함된 디렉터리 + - tests_dir` - `tests` 테스트 스위트의 디렉터리 + - examples_dir` - `examples` 테스트 스위트의 디렉터리 + - repo_root_dir` - 저장소 디렉터리 + - src_dir` - `src`의 디렉터리(예: `transformers` 하위 디렉터리가 있는 곳) + +- 문자열로 변환된 경로---위와 동일하지만, `pathlib` 객체가 아닌 문자열로 경로를 반환합니다: + + - `test_file_path_str` + - `test_file_dir_str` + - `tests_dir_str` + - `examples_dir_str` + - `repo_root_dir_str` + - `src_dir_str` + +위의 내용을 사용하려면 테스트가 'transformers.test_utils.TestCasePlus'의 서브클래스에 있는지 확인해야 합니다. +예를 들어 다음과 같습니다: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_local_locations(self): + data_dir = self.tests_dir / "fixtures/tests_samples/wmt_en_ro" +``` + +만약 `pathlib`를 통해 경로를 조작할 필요가 없거나 경로를 문자열로만 필요로 하는 경우에는 `pathlib` 객체에 `str()`을 호출하거나 `_str`로 끝나는 접근자를 사용할 수 있습니다. +예를 들어 다음과 같습니다: + +```python +from transformers.testing_utils import TestCasePlus + + +class PathExampleTest(TestCasePlus): + def test_something_involving_stringified_locations(self): + examples_dir = self.examples_dir_str +``` + +### 임시 파일 및 디렉터리[[temporary-files-and-directories]] + +고유한 임시 파일 및 디렉터리를 사용하는 것은 병렬 테스트 실행에 있어 필수적입니다. +이렇게 함으로써 테스트들이 서로의 데이터를 덮어쓰지 않게 할 수 있습니다. 또한 우리는 생성된 테스트의 종료 단계에서 이러한 임시 파일 및 디렉터리를 제거하고 싶습니다. +따라서 이러한 요구 사항을 충족시켜주는 `tempfile`과 같은 패키지를 사용하는 것이 중요합니다. + +그러나 테스트를 디버깅할 때는 임시 파일이나 디렉터리에 들어가는 내용을 확인할 수 있어야 하며, +재실행되는 각 테스트마다 임시 파일이나 디렉터리의 경로에 대해 무작위 값이 아닌 정확한 값을 알고 싶을 것입니다. + +`transformers.test_utils.TestCasePlus`라는 도우미 클래스는 이러한 목적에 가장 적합합니다. +이 클래스는 `unittest.TestCase`의 하위 클래스이므로, 우리는 이것을 테스트 모듈에서 쉽게 상속할 수 있습니다. + +다음은 해당 클래스를 사용하는 예시입니다: + +```python +from transformers.testing_utils import TestCasePlus + + +class ExamplesTests(TestCasePlus): + def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +이 코드는 고유한 임시 디렉터리를 생성하고 `tmp_dir`을 해당 위치로 설정합니다. + +- 고유한 임시 디렉터리를 생성합니다: + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir() +``` + +`tmp_dir`에는 생성된 임시 디렉터리의 경로가 포함됩니다. +이는 테스트의 종료 단계에서 자동으로 제거됩니다. + +- 선택한 경로로 임시 디렉터리 생성 후에 테스트 시작 전에 비어 있는 상태인지 확인하고, 테스트 후에는 비우지 마세요. + +```python +def test_whatever(self): + tmp_dir = self.get_auto_remove_tmp_dir("./xxx") +``` + +이것은 디버깅할 때 특정 디렉터리를 모니터링하고, +그 디렉터리에 이전에 실행된 테스트가 데이터를 남기지 않도록 하는 데에 유용합니다. + +- `before` 및 `after` 인수를 직접 오버라이딩하여 기본 동작을 변경할 수 있으며 +다음 중 하나의 동작으로 이어집니다: + + - `before=True`: 테스트 시작 시 임시 디렉터리가 항상 지워집니다. + - `before=False`: 임시 디렉터리가 이미 존재하는 경우 기존 파일은 그대로 남습니다. + - `after=True`: 테스트 종료 시 임시 디렉터리가 항상 삭제됩니다. + - `after=False`: 테스트 종료 시 임시 디렉터리가 항상 그대로 유지됩니다. + + + +`rm -r`에 해당하는 명령을 안전하게 실행하기 위해, +명시적인 `tmp_dir`을 사용하는 경우 프로젝트 저장소 체크 아웃의 하위 디렉터리만 허용됩니다. +따라서 실수로 `/tmp`가 아닌 중요한 파일 시스템의 일부가 삭제되지 않도록 항상 `./`로 시작하는 경로를 전달해야 합니다. + + + + + +각 테스트는 여러 개의 임시 디렉터리를 등록할 수 있으며, +별도로 요청하지 않는 한 모두 자동으로 제거됩니다. + + + +### 임시 sys.path 오버라이드[[temporary-sys.path-override]] + +`sys.path`를 다른 테스트로 임시로 오버라이드하기 위해 예를 들어 `ExtendSysPath` 컨텍스트 관리자를 사용할 수 있습니다. +예를 들어 다음과 같습니다: + + +```python +import os +from transformers.testing_utils import ExtendSysPath + +bindir = os.path.abspath(os.path.dirname(__file__)) +with ExtendSysPath(f"{bindir}/.."): + from test_trainer import TrainerIntegrationCommon # noqa +``` + +### 테스트 건너뛰기[[skipping-tests]] + +이것은 버그가 발견되어 새로운 테스트가 작성되었지만 아직 그 버그가 수정되지 않은 경우에 유용합니다. +이 테스트를 주 저장소에 커밋하려면 `make test` 중에 건너뛰도록 해야 합니다. + +방법: + +- **skip**은 테스트가 일부 조건이 충족될 경우에만 통과될 것으로 예상되고, 그렇지 않으면 pytest가 전체 테스트를 건너뛰어야 함을 의미합니다. +일반적인 예로는 Windows가 아닌 플랫폼에서 Windows 전용 테스트를 건너뛰거나 +외부 리소스(예를 들어 데이터베이스)에 의존하는 테스트를 건너뛰는 것이 있습니다. + +- **xfail**은 테스트가 특정한 이유로 인해 실패할 것으로 예상하는 것을 의미합니다. +일반적인 예로는 아직 구현되지 않은 기능이나 아직 수정되지 않은 버그의 테스트가 있습니다. +`xfail`로 표시된 테스트가 예상대로 실패하지 않고 통과된 경우, 이것은 xpass이며 테스트 결과 요약에 기록됩니다. + +두 가지 중요한 차이점 중 하나는 `skip`은 테스트를 실행하지 않지만 `xfail`은 실행한다는 것입니다. +따라서 오류가 있는 코드가 일부 테스트에 영향을 미칠 수 있는 경우 `xfail`을 사용하지 마세요. + +#### 구현[[implementation]] + +- 전체 테스트를 무조건 건너뛰려면 다음과 같이 할 수 있습니다: + +```python no-style +@unittest.skip("this bug needs to be fixed") +def test_feature_x(): +``` + +또는 pytest를 통해: + +```python no-style +@pytest.mark.skip(reason="this bug needs to be fixed") +``` + +또는 `xfail` 방식으로: + +```python no-style +@pytest.mark.xfail +def test_feature_x(): +``` + +- 테스트 내부에서 내부 확인에 따라 테스트를 건너뛰는 방법은 다음과 같습니다: + +```python +def test_feature_x(): + if not has_something(): + pytest.skip("unsupported configuration") +``` + +또는 모듈 전체: + +```python +import pytest + +if not pytest.config.getoption("--custom-flag"): + pytest.skip("--custom-flag is missing, skipping tests", allow_module_level=True) +``` + +또는 `xfail` 방식으로: + +```python +def test_feature_x(): + pytest.xfail("expected to fail until bug XYZ is fixed") +``` + +- import가 missing된 모듈이 있을 때 그 모듈의 모든 테스트를 건너뛰는 방법: + +```python +docutils = pytest.importorskip("docutils", minversion="0.3") +``` + +- 조건에 따라 테스트를 건너뛰는 방법: + +```python no-style +@pytest.mark.skipif(sys.version_info < (3,6), reason="requires python3.6 or higher") +def test_feature_x(): +``` + +또는: + +```python no-style +@unittest.skipIf(torch_device == "cpu", "Can't do half precision") +def test_feature_x(): +``` + +또는 모듈 전체를 건너뛰는 방법: + +```python no-style +@pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") +class TestClass(): + def test_feature_x(self): +``` + +보다 자세한 예제 및 방법은 [여기](https://docs.pytest.org/en/latest/skipping.html)에서 확인할 수 있습니다. + +### 느린 테스트[[slow-tests]] + +테스트 라이브러리는 지속적으로 확장되고 있으며, 일부 테스트는 실행하는 데 몇 분이 걸립니다. +그리고 우리에게는 테스트 스위트가 CI를 통해 완료되기까지 한 시간을 기다릴 여유가 없습니다. +따라서 필수 테스트를 위한 일부 예외를 제외하고 느린 테스트는 다음과 같이 표시해야 합니다. + +```python no-style +from transformers.testing_utils import slow +@slow +def test_integration_foo(): +``` + +`@slow`로 표시된 테스트를 실행하려면 `RUN_SLOW=1` 환경 변수를 설정하세요. 예를 들어 다음과 같습니다: + +```bash +RUN_SLOW=1 pytest tests +``` + +`@parameterized`와 같은 몇 가지 데코레이터는 테스트 이름을 다시 작성합니다. +그러므로 `@slow`와 나머지 건너뛰기 데코레이터 `@require_*`가 올바르게 작동되려면 마지막에 나열되어야 합니다. 다음은 올바른 사용 예입니다. + +```python no-style +@parameterized.expand(...) +@slow +def test_integration_foo(): +``` + +이 문서의 초반부에 설명된 것처럼 느린 테스트는 PR의 CI 확인이 아닌 예약된 일정 기반으로 실행됩니다. +따라서 PR 제출 중에 일부 문제를 놓친 채로 병합될 수 있습니다. +이러한 문제들은 다음번의 예정된 CI 작업 중에 감지됩니다. +하지만 PR을 제출하기 전에 자신의 컴퓨터에서 느린 테스트를 실행하는 것 또한 중요합니다. + +느린 테스트로 표시해야 하는지 여부를 결정하는 대략적인 결정 기준은 다음과 같습니다. + +만약 테스트가 라이브러리의 내부 구성 요소 중 하나에 집중되어 있다면(예: 모델링 파일, 토큰화 파일, 파이프라인), +해당 테스트를 느린 테스트 스위트에서 실행해야 합니다. +만약 라이브러리의 다른 측면(예: 문서 또는 예제)에 집중되어 있다면, +해당 테스트를 느린 테스트 스위트에서 실행해야 합니다. 그리고 이 접근 방식을 보완하기 위해 예외를 만들어야 합니다. + +- 무거운 가중치 세트나 50MB보다 큰 데이터셋을 다운로드해야 하는 모든 테스트(예: 모델 통합 테스트, 토크나이저 통합 테스트, 파이프라인 통합 테스트)를 + 느린 테스트로 설정해야 합니다. + 새로운 모델을 추가하는 경우 통합 테스트용으로 무작위 가중치로 작은 버전을 만들어 허브에 업로드해야 합니다. + 이 내용은 아래 단락에서 설명됩니다. +- 특별히 빠르게 실행되도록 최적화되지 않은 학습을 수행해야 하는 테스트는 느린 테스트로 설정해야 합니다. +- 느리지 않아야 할 테스트 중 일부가 극도로 느린 경우 + 예외를 도입하고 이를 `@slow`로 설정할 수 있습니다. + 대용량 파일을 디스크에 저장하고 불러오는 자동 모델링 테스트는 `@slow`으로 표시된 테스트의 좋은 예입니다. +- CI에서 1초 이내에 테스트가 완료되는 경우(다운로드 포함)에는 느린 테스트가 아니어야 합니다. + +느린 테스트가 아닌 경우에는 다양한 내부를 완전히 커버하면서 빠르게 유지되어야 합니다. +예를 들어, 무작위 가중치를 사용하여 특별히 생성된 작은 모델로 테스트하면 상당한 커버리지를 얻을 수 있습니다. +이러한 모델은 최소한의 레이어 수(예: 2), 어휘 크기(예: 1000) 등의 요소만 가집니다. 그런 다음 `@slow` 테스트는 대형 느린 모델을 사용하여 정성적인 테스트를 수행할 수 있습니다. +이러한 작은 모델을 사용하는 방법을 확인하려면 다음과 같이 *tiny* 모델을 찾아보세요. + +```bash +grep tiny tests examples +``` + +다음은 작은 모델[stas/tiny-wmt19-en-de](https://huggingface.co/stas/tiny-wmt19-en-de)을 만든 +[script](https://github.com/huggingface/transformers/tree/main/scripts/fsmt/fsmt-make-tiny-model.py) 예시입니다. +특정 모델의 아키텍처에 맞게 쉽게 조정할 수 있습니다. + +예를 들어 대용량 모델을 다운로드하는 경우 런타임을 잘못 측정하기 쉽지만, +로컬에서 테스트하면 다운로드한 파일이 캐시되어 다운로드 시간이 측정되지 않습니다. +대신 CI 로그의 실행 속도 보고서를 확인하세요(`pytest --durations=0 tests`의 출력). + +이 보고서는 느린 이상값으로 표시되지 않거나 빠르게 다시 작성해야 하는 느린 이상값을 찾는 데도 유용합니다. +CI에서 테스트 스위트가 느려지기 시작하면 이 보고서의 맨 위 목록에 가장 느린 테스트가 표시됩니다. + + + +### stdout/stderr 출력 테스트[[testing-the-stdout/stderr-output]] + +`stdout` 및/또는 `stderr`로 쓰는 함수를 테스트하려면 `pytest`의 [capsys 시스템](https://docs.pytest.org/en/latest/capture.html)을 사용하여 해당 스트림에 액세스할 수 있습니다. +다음과 같이 수행할 수 있습니다. + +```python +import sys + + +def print_to_stdout(s): + print(s) + + +def print_to_stderr(s): + sys.stderr.write(s) + + +def test_result_and_stdout(capsys): + msg = "Hello" + print_to_stdout(msg) + print_to_stderr(msg) + out, err = capsys.readouterr() # 캡처된 출력 스트림 사용 + # 선택 사항: 캡처된 스트림 재생성 + sys.stdout.write(out) + sys.stderr.write(err) + # 테스트: + assert msg in out + assert msg in err +``` + +그리고, 물론 대부분의 경우에는 `stderr`는 예외의 일부로 제공됩니다. +그러므로 해당 경우에는 try/except를 사용해야 합니다. + +```python +def raise_exception(msg): + raise ValueError(msg) + + +def test_something_exception(): + msg = "Not a good value" + error = "" + try: + raise_exception(msg) + except Exception as e: + error = str(e) + assert msg in error, f"{msg} is in the exception:\n{error}" +``` + +`stdout`를 캡처하는 또 다른 방법은 `contextlib.redirect_stdout`를 사용하는 것입니다. + +```python +from io import StringIO +from contextlib import redirect_stdout + + +def print_to_stdout(s): + print(s) + + +def test_result_and_stdout(): + msg = "Hello" + buffer = StringIO() + with redirect_stdout(buffer): + print_to_stdout(msg) + out = buffer.getvalue() + # 선택 사항: 캡처된 스트림 재생성 + sys.stdout.write(out) + # 테스트: + assert msg in out +``` + +`stdout` 캡처에 관련된 중요한 문제 중 하나는 보통 `print`에서 이전에 인쇄된 내용을 재설정하는 `\r` 문자가 포함될 수 있다는 것입니다. +`pytest`에서는 문제가 없지만 `pytest -s`에서는 이러한 문자가 버퍼에 포함되므로 +`-s`가 있거나 없는 상태에서 태스트를 수행할 수 있으려면 캡처된 출력에 대해 추가적인 정리가 필요합니다. +이 경우에는 `re.sub(r'~.*\r', '', buf, 0, re.M)`을 사용할 수 있습니다. + +하지만 도우미 컨텍스트 관리자 래퍼를 사용하면 +출력에 `\r`이 포함되어 있는지의 여부에 관계없이 모든 것을 자동으로 처리하므로 편리합니다. + +```python +from transformers.testing_utils import CaptureStdout + +with CaptureStdout() as cs: + function_that_writes_to_stdout() +print(cs.out) +``` + +다음은 전체 테스트 예제입니다. + +```python +from transformers.testing_utils import CaptureStdout + +msg = "Secret message\r" +final = "Hello World" +with CaptureStdout() as cs: + print(msg + final) +assert cs.out == final + "\n", f"captured: {cs.out}, expecting {final}" +``` + +`stderr`를 캡처하고 싶다면, 대신 `CaptureStderr` 클래스를 사용하세요. + +```python +from transformers.testing_utils import CaptureStderr + +with CaptureStderr() as cs: + function_that_writes_to_stderr() +print(cs.err) +``` + +두 스트림을 동시에 캡처해야 한다면, 부모 `CaptureStd` 클래스를 사용하세요. + +```python +from transformers.testing_utils import CaptureStd + +with CaptureStd() as cs: + function_that_writes_to_stdout_and_stderr() +print(cs.err, cs.out) +``` + +또한, 테스트의 디버깅을 지원하기 위해 +이러한 컨텍스트 관리자는 기본적으로 컨텍스트에서 종료할 때 캡처된 스트림을 자동으로 다시 실행합니다. + + +### 로거 스트림 캡처[[capturing-logger-stream]] + +로거 출력을 검증해야 하는 경우 `CaptureLogger`를 사용할 수 있습니다. + +```python +from transformers import logging +from transformers.testing_utils import CaptureLogger + +msg = "Testing 1, 2, 3" +logging.set_verbosity_info() +logger = logging.get_logger("transformers.models.bart.tokenization_bart") +with CaptureLogger(logger) as cl: + logger.info(msg) +assert cl.out, msg + "\n" +``` + +### 환경 변수를 이용하여 테스트[[testing-with-environment-variables]] + +특정 테스트의 환경 변수 영향을 검증하려면 +`transformers.testing_utils.mockenv`라는 도우미 데코레이터를 사용할 수 있습니다. + +```python +from transformers.testing_utils import mockenv + + +class HfArgumentParserTest(unittest.TestCase): + @mockenv(TRANSFORMERS_VERBOSITY="error") + def test_env_override(self): + env_level_str = os.getenv("TRANSFORMERS_VERBOSITY", None) +``` + +일부 경우에는 외부 프로그램을 호출해야할 수도 있는데, 이 때에는 여러 개의 로컬 경로를 포함하는 `os.environ`에서 `PYTHONPATH`의 설정이 필요합니다. +헬퍼 클래스 `transformers.test_utils.TestCasePlus`가 도움이 됩니다: + +```python +from transformers.testing_utils import TestCasePlus + + +class EnvExampleTest(TestCasePlus): + def test_external_prog(self): + env = self.get_env() + # 이제 `env`를 사용하여 외부 프로그램 호출 +``` + +테스트 파일이 `tests` 테스트 스위트 또는 `examples`에 있는지에 따라 +`env[PYTHONPATH]`가 두 디렉터리 중 하나를 포함하도록 설정되며, +현재 저장소에 대해 테스트가 수행되도록 `src` 디렉터리도 포함됩니다. +테스트 호출 이전에 설정된 경우에는 `env[PYTHONPATH]`를 그대로 사용합니다. + +이 헬퍼 메소드는 `os.environ` 객체의 사본을 생성하므로 원본은 그대로 유지됩니다. + + +### 재현 가능한 결과 얻기[[getting-reproducible-results]] + +일부 상황에서 테스트에서 임의성을 제거하여 동일하게 재현 가능한 결과를 얻고 싶을 수 있습니다. +이를 위해서는 다음과 같이 시드를 고정해야 합니다. + +```python +seed = 42 + +# 파이썬 RNG +import random + +random.seed(seed) + +# 파이토치 RNG +import torch + +torch.manual_seed(seed) +torch.backends.cudnn.deterministic = True +if torch.cuda.is_available(): + torch.cuda.manual_seed_all(seed) + +# 넘파이 RNG +import numpy as np + +np.random.seed(seed) + +# 텐서플로 RNG +tf.random.set_seed(seed) +``` + +### 테스트 디버깅[[debugging tests]] + +경고가 있는 곳에서 디버거를 시작하려면 다음을 수행하세요. + +```bash +pytest tests/utils/test_logging.py -W error::UserWarning --pdb +``` + +## Github Actions 워크플로우 작업 처리[[working-with-github-actions-workflows]] + +셀프 푸시 워크플로우 CI 작업을 트리거하려면, 다음을 수행해야 합니다. + +1. `transformers` 원본에서 새 브랜치를 만듭니다(포크가 아닙니다!). +2. 브랜치 이름은 `ci_` 또는 `ci-`로 시작해야 합니다(`main`도 트리거하지만 `main`에서는 PR을 할 수 없습니다). + 또한 특정 경로에 대해서만 트리거되므로 이 문서가 작성된 후에 변경된 내용은 + [여기](https://github.com/huggingface/transformers/blob/main/.github/workflows/self-push.yml)의 *push:*에서 확인할 수 있습니다. +3. 이 브랜치에서 PR을 생성합니다 +4. 그런 다음 [여기](https://github.com/huggingface/transformers/actions/workflows/self-push.yml)에서 작업이 나타나는지 확인할 수 있습니다. + 백로그가 있는 경우, 바로 실행되지 않을 수도 있습니다. + + + + +## 실험적인 CI 기능 테스트[[testing-Experimental-CI-Features]] + +CI 기능을 테스트하는 것은 일반 CI 작동에 방해가 될 수 있기 때문에 잠재적으로 문제가 발생할 수 있습니다. +따라서 새로운 CI 기능을 추가하는 경우 다음과 같이 수행해야 합니다. + +1. 테스트해야 할 내용을 테스트하는 새로운 전용 작업을 생성합니다. +2. 새로운 작업은 항상 성공해야만 녹색 ✓를 받을 수 있습니다(아래에 자세한 내용이 있습니다). +3. 다양한 PR 유형에 대한 확인을 위해 + (사용자 포크 브랜치, 포크되지 않은 브랜치, github.com UI 직접 파일 편집에서 생성된 브랜치, 강제 푸시 등 PR의 유형은 아주 다양합니다.) + 며칠 동안 실험 작업의 로그를 모니터링하면서 실행해봅니다. + (의도적으로 항상 녹색을 표시하므로 작업 전체가 녹색은 아니라는 점에 유의합니다.) +4. 모든 것이 안정적인지 확인한 후, 새로운 변경 사항을 기존 작업에 병합합니다. + +이렇게 하면 CI 기능 자체에 대한 실험이 일반 작업 흐름에 방해가 되지 않습니다. + +그러나 새로운 CI 기능이 개발 중인 동안, 항상 성공하도록 할 수 있는 방법은 무엇일까요? + +TravisCI와 같은 일부 CI는 `ignore-step-failure`를 지원하며 전체 작업을 성공한 것으로 보고하지만, +현재 우리가 사용하는 CircleCI와 Github Actions는 이를 지원하지 않습니다. + +따라서 다음과 같은 해결책을 사용할 수 있습니다. + +1. bash 스크립트에서 가능한 많은 오류를 억제하기 위해 실행 명령의 시작 부분에 `set +euo pipefail`을 추가합니다. +2. 마지막 명령은 반드시 성공해야 합니다. `echo "done"` 또는 `true`를 사용하면 됩니다. + +예시는 다음과 같습니다. + +```yaml +- run: + name: run CI experiment + command: | + set +euo pipefail + echo "setting run-all-despite-any-errors-mode" + this_command_will_fail + echo "but bash continues to run" + # emulate another failure + false + # but the last command must be a success + echo "during experiment do not remove: reporting success to CI, even if there were failures" +``` + +간단한 명령의 경우 다음과 같이 수행할 수도 있습니다. + +```bash +cmd_that_may_fail || true +``` + +결과에 만족한 후에는 물론, 실험적인 단계 또는 작업을 일반 작업의 나머지 부분과 통합하면서 +`set +euo pipefail` 또는 기타 추가한 요소를 제거하여 +실험 작업이 일반 CI 작동에 방해되지 않도록 해야 합니다. + +이 전반적인 과정은 실험 단계가 PR의 전반적인 상태에 영향을 주지 않고 실패하도록 +`allow-failure`와 같은 기능을 설정할 수 있다면 훨씬 더 쉬웠을 것입니다. +그러나 앞에서 언급한 바와 같이 CircleCI와 Github Actions는 현재 이러한 기능들 지원하지 않습니다. + +이 기능의 지원을 위한 투표에 참여하고 CI 관련 스레드들에서 이러한 상황을 확인할 수도 있습니다. + +- [Github Actions:](https://github.com/actions/toolkit/issues/399) +- [CircleCI:](https://ideas.circleci.com/ideas/CCI-I-344) diff --git a/docs/source/ko/tf_xla.md b/docs/source/ko/tf_xla.md new file mode 100644 index 00000000000000..0b47d6fbad89d6 --- /dev/null +++ b/docs/source/ko/tf_xla.md @@ -0,0 +1,174 @@ + + +# TensorFlow 모델을 위한 XLA 통합 [[xla-integration-for-tensorflow-models]] + +[[open-in-colab]] + +XLA(Accelerated Linear Algebra)는 TensorFlow 모델의 실행 시간을 가속화하기 위한 컴파일러입니다. [공식 문서](https://www.tensorflow.org/xla)에 따르면 다음과 같습니다: + +XLA(Accelerated Linear Algebra)는 선형 대수를 위한 도메인 특화 컴파일러로, TensorFlow 모델을 소스 코드 변경 없이 가속화할 수 있습니다. + +TensorFlow에서 XLA를 사용하는 것은 간단합니다. XLA는 `tensorflow` 라이브러리 내에 패키지로 제공되며, [`tf.function`](https://www.tensorflow.org/guide/intro_to_graphs)과 같은 그래프 생성 함수에서 `jit_compile` 인수를 사용하여 활성화할 수 있습니다. `fit()` 및 `predict()`와 같은 Keras 메소드를 사용하는 경우, `jit_compile` 인수를 `model.compile()`에 전달하여 XLA를 간단하게 활성화할 수 있습니다. 그러나 XLA는 이러한 메소드에 국한되지 않고 임의의 `tf.function`을 가속화하는 데에도 사용할 수 있습니다. + +🤗 Transformers에서는 [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2), [T5](https://huggingface.co/docs/transformers/model_doc/t5), [OPT](https://huggingface.co/docs/transformers/model_doc/opt)와 같은 모델의 텍스트 생성, 그리고 [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)와 같은 모델의 음성 처리를 포함하여 여러 TensorFlow 메소드가 XLA와 호환되도록 다시 작성되었습니다. + +정확한 속도 향상은 모델에 따라 다르지만, 🤗 Transformers 내의 TensorFlow 텍스트 생성 모델의 경우 최대 100배의 속도 향상을 확인했습니다. 이 문서에서는 이러한 모델에 대해 XLA를 사용하여 최대 성능을 얻는 방법을 설명합니다. 또한 XLA 통합의 벤치마크 및 디자인 철학에 대한 추가 자료 링크도 제공할 것입니다. + +## XLA를 사용하여 TF 함수 실행하기 [[running-tf-functions-with-xla]] + +TensorFlow에서 다음과 같은 모델을 고려해 봅시다: + +```py +import tensorflow as tf + +model = tf.keras.Sequential( + [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")] +) +``` + +위 모델은 차원이 `(10, )`인 입력을 받습니다. 다음과 같이 모델을 사용하여 순전파를 실행할 수 있습니다: + +```py +# 모델에 대한 임의의 입력을 생성합니다. +batch_size = 16 +input_vector_dim = 10 +random_inputs = tf.random.normal((batch_size, input_vector_dim)) + +# 순전파를 실행합니다. +_ = model(random_inputs) +``` + +XLA로 컴파일된 함수로 순전파를 실행하려면 다음과 같이 해야 합니다: + +```py +xla_fn = tf.function(model, jit_compile=True) +_ = xla_fn(random_inputs) +``` + +`model`의 기본 `call()` 함수는 XLA 그래프를 컴파일하는 데 사용됩니다. 그러나 다른 모델 함수를 XLA로 컴파일하려면 다음과 같이 할 수도 있습니다: + +```py +my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True) +``` + +## 🤗 Transformers에서 XLA를 사용하여 TF 텍스트 생성 모델 실행하기 [[running-a-tf-text-generation-model-with-xla-from-transformers]] + +🤗 Transformers에서 XLA로 가속화된 생성을 활성화하려면 최신 버전의 `transformers`가 설치되어 있어야 합니다. 다음과 같이 설치할 수 있습니다: + +```bash +pip install transformers --upgrade +``` + +그리고 다음 코드를 실행할 수 있습니다: + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +# 최소 버전의 Transformers가 설치되어 있지 않다면 오류가 발생합니다. +from transformers.utils import check_min_version + +check_min_version("4.21.0") + + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +# XLA 생성 함수를 만들기 위한 한 줄 +xla_generate = tf.function(model.generate, jit_compile=True) + +tokenized_input = tokenizer(input_string, return_tensors="tf") +generated_tokens = xla_generate(**tokenized_input, num_beams=2) + +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +# Generated -- TensorFlow is an open-source, open-source, distributed-source application # framework for the +``` + +알 수 있듯이, `generate()`에서 XLA를 활성화하는 것은 단 한 줄의 코드입니다. 코드의 나머지 부분은 변경되지 않습니다. 그러나 위 코드 스니펫에서는 XLA에 특정한 몇 가지 주의할 점이 있습니다. XLA가 가져다줄 속도 향상을 실현하기 위해서는 이를 알고 있어야 합니다. 다음 섹션에서 이에 대해 논의합니다. + +## 주의할 점 [[gotchas-to-be-aware-of]] + +XLA 활성화 함수(`xla_generate()`와 같은)를 처음 실행할 때 내부적으로 계산 그래프를 추론하려고 하며, 이는 시간이 소요됩니다. 이 과정은 [“추적(tracing)”](https://www.tensorflow.org/guide/intro_to_graphs#when_is_a_function_tracing)이라고 알려져 있습니다. + +생성 시간이 빠르지 않다는 것을 알 수 있을 것입니다. `xla_generate()`(또는 다른 XLA 활성화 함수)의 연속 호출은 함수에 전달된 입력이 초기에 구축된 계산 그래프와 동일한 형태를 따른다면, 계산 그래프를 추론할 필요가 없습니다. 이는 입력 형태가 고정된 모달리티(예: 이미지)에는 문제가 되지 않지만, 가변 입력 형태 모달리티(예: 텍스트)를 사용할 때 주의해야 합니다. + +`xla_generate()`가 항상 동일한 입력 형태로 동작하도록 하려면, 토크나이저를 호출할 때 `padding` 인수를 지정할 수 있습니다. + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +xla_generate = tf.function(model.generate, jit_compile=True) + +# 여기서, padding 옵션이 있는 토크나이저를 호출합니다. +tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + +generated_tokens = xla_generate(**tokenized_input, num_beams=2) +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +``` + +이렇게 하면 `xla_generate()`에 대한 입력이 항상 추적된 형태로 전달되어 생성 시간이 가속화됩니다. 다음 코드로 이를 확인할 수 있습니다: + +```py +import time +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") + +xla_generate = tf.function(model.generate, jit_compile=True) + +for input_string in ["TensorFlow is", "TensorFlow is a", "TFLite is a"]: + tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + start = time.time_ns() + generated_tokens = xla_generate(**tokenized_input, num_beams=2) + end = time.time_ns() + print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n") +``` + +Tesla T4 GPU에서는 다음과 같은 출력을 예상할 수 있습니다: + +```bash +Execution time -- 30819.6 ms + +Execution time -- 79.0 ms + +Execution time -- 78.9 ms +``` +`xla_generate()`의 첫 번째 호출은 추적 때문에 시간이 오래 걸리지만, 연속 호출은 몇 배나 빠릅니다. 생성 옵션에 대한 어떤 변경이든 다시 추적을 유발하므로 생성 시간이 느려질 수 있음을 명심하세요. + +이 문서에서는 🤗 Transformers에서 제공하는 모든 텍스트 생성 옵션을 다루지 않았습니다. 고급 사용 사례에 대해 문서를 참조하시기 바랍니다. + +## 추가 자료 [[additional-resources]] + +여기에 🤗 Transformers와 XLA에 대해 더 자세히 알고 싶은 경우 도움이 될 수 있는 몇 가지 추가 자료를 제공합니다. + +* [이 Colab 노트북](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/91_tf_xla_generate.ipynb)은 XLA와 호환되는 인코더-디코더([T5](https://huggingface.co/docs/transformers/model_doc/t5)와 같은) 및 디코더 전용([GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)와 같은) 텍스트 생성 모델을 실험해 볼 수 있는 대화형 데모를 제공합니다. +* [이 블로그 글](https://huggingface.co/blog/tf-xla-generate)은 TensorFlow에서 XLA에 대한 친절한 소개와 함께 XLA와 호환되는 모델의 비교 벤치마크에 대한 개요를 제공합니다. +* [이 블로그 글](https://blog.tensorflow.org/2022/11/how-hugging-face-improved-text-generation-performance-with-xla.html)은 🤗 Transformers의 TensorFlow 모델에 XLA 지원을 추가하는 것에 대한 디자인 철학을 논의합니다. +* XLA와 TensorFlow 그래프에 대해 더 자세히 알고 싶은 경우 추천하는 글: + * [XLA: 기계 학습을 위한 최적화 컴파일러](https://www.tensorflow.org/xla) + * [그래프 및 tf.function 소개](https://www.tensorflow.org/guide/intro_to_graphs) + * [tf.function으로 성능 향상하기](https://www.tensorflow.org/guide/function) \ No newline at end of file diff --git a/docs/source/ko/tflite.md b/docs/source/ko/tflite.md new file mode 100644 index 00000000000000..464106a6b7c261 --- /dev/null +++ b/docs/source/ko/tflite.md @@ -0,0 +1,62 @@ + + +# TFLite로 내보내기[[export-to-tflite]] + +[TensorFlow Lite](https://www.tensorflow.org/lite/guide)는 자원이 제한된 휴대폰, 임베디드 시스템, 사물인터넷(IoT) 기기에서 +기계학습 모델을 배포하기 위한 경량 프레임워크입니다. +TFLite는 연산 능력, 메모리, 전력 소비가 제한된 기기에서 모델을 효율적으로 최적화하고 실행하기 위해 +설계되었습니다. +TensorFlow Lite 모델은 `.tflite` 파일 확장자로 식별되는 특수하고 효율적인 휴대용 포맷으로 표현됩니다. + +🤗 Optimum은 `exporters.tflite` 모듈로 🤗 Transformers 모델을 TFLite로 내보내는 기능을 제공합니다. +지원되는 모델 아키텍처 목록은 [🤗 Optimum 문서](https://huggingface.co/docs/optimum/exporters/tflite/overview)를 참고하세요. + +모델을 TFLite로 내보내려면, 필요한 종속성을 설치하세요: + +```bash +pip install optimum[exporters-tf] +``` + +모든 사용 가능한 인수를 확인하려면, [🤗 Optimum 문서](https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model)를 참고하거나 +터미널에서 도움말을 살펴보세요: + +```bash +optimum-cli export tflite --help +``` + +예를 들어 🤗 Hub에서의 `google-bert/bert-base-uncased` 모델 체크포인트를 내보내려면, 다음 명령을 실행하세요: + +```bash +optimum-cli export tflite --model google-bert/bert-base-uncased --sequence_length 128 bert_tflite/ +``` + +다음과 같이 진행 상황을 나타내는 로그와 결과물인 `model.tflite`가 저장된 위치를 보여주는 로그가 표시됩니다: + +```bash +Validating TFLite model... + -[✓] TFLite model output names match reference model (logits) + - Validating TFLite Model output "logits": + -[✓] (1, 128, 30522) matches (1, 128, 30522) + -[x] values not close enough, max diff: 5.817413330078125e-05 (atol: 1e-05) +The TensorFlow Lite export succeeded with the warning: The maximum absolute difference between the output of the reference model and the TFLite exported model is not within the set tolerance 1e-05: +- logits: max diff = 5.817413330078125e-05. + The exported model was saved at: bert_tflite + ``` + +위 예제는 🤗 Hub에서의 체크포인트를 내보내는 방법을 보여줍니다. +로컬 모델을 내보낸다면, 먼저 모델 가중치와 토크나이저 파일이 모두 같은 디렉터리( `local_path` )에 저장됐는지 확인하세요. +CLI를 사용할 때, 🤗 Hub에서의 체크포인트 이름 대신 `model` 인수에 `local_path`를 전달하면 됩니다. \ No newline at end of file diff --git a/docs/source/ko/tokenizer_summary.md b/docs/source/ko/tokenizer_summary.md new file mode 100644 index 00000000000000..0a4ece29a476d9 --- /dev/null +++ b/docs/source/ko/tokenizer_summary.md @@ -0,0 +1,253 @@ + + +# 토크나이저 요약[[summary-of-the-tokenizers]] + +[[open-in-colab]] + +이 페이지에서는 토큰화에 대해 자세히 살펴보겠습니다. + + + +[데이터 전처리하기 튜토리얼](preprocessing)에서 살펴본 것처럼, 텍스트를 토큰화하는 것은 텍스트를 단어 또는 서브워드로 분할하고 룩업 테이블을 통해 id로 변환하는 과정입니다. +단어 또는 서브워드를 id로 변환하는 것은 간단하기 때문에 이번 문서에서는 텍스트를 단어 또는 서브워드로 쪼개는 것(즉, 텍스트를 토큰화하는 것)에 중점을 두겠습니다. +구체적으로, 🤗 Transformers에서 사용되는 세 가지 주요 토큰화 유형인 [Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), [SentencePiece](#sentencepiece)를 살펴보고 어떤 모델에서 어떤 토큰화 유형을 사용하는지 예시를 보여드리겠습니다. + +각 모델 페이지에 연결된 토크나이저의 문서를 보면 사전 훈련 모델에서 어떤 토크나이저를 사용했는지 알 수 있습니다. +예를 들어, [`BertTokenizer`]를 보면 이 모델이 [WordPiece](#wordpiece)를 사용하는 것을 알 수 있습니다. + +## 개요[[introduction]] + +텍스트를 작은 묶음(chunk)으로 쪼개는 것은 보기보다 어려운 작업이며, 여러 가지 방법이 있습니다. +예를 들어, `"Don't you love 🤗 Transformers? We sure do."` 라는 문장을 살펴보도록 하겠습니다. + + + +위 문장을 토큰화하는 간단한 방법은 공백을 기준으로 쪼개는 것입니다. +토큰화된 결과는 다음과 같습니다: + +``` +["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] +``` +이는 첫 번째 결과로는 합리적이지만, `"Transformers?"`와 `"do."`토큰을 보면 각각 `"Transformer"`와 `"do"`에 구두점이 붙어있는 것을 확인할 수 있습니다. +구두점을 고려해야 모델이 단어의 다른 표현과 그 뒤에 올 수 있는 모든 가능한 구두점을 학습할 필요가 없습니다. 그렇지 않으면 모델이 학습해야 하는 표현의 수가 폭발적으로 증가하게 됩니다. + +구두점을 고려한 토큰화 결과는 다음과 같습니다: + +``` +["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +이전보다 나아졌습니다. 하지만, `"Don't"`의 토큰화 결과도 수정이 필요합니다. +`"Don't"`는 `"do not"`의 줄임말이기 때문에 `["Do", "n't"]`로 토큰화되는 것이 좋습니다. +여기서부터 복잡해지기 시작합니다. 그리고 이 점이 각 모델마다 고유한 토큰화 유형이 존재하는 이유 중 하나입니다. +텍스트를 토큰화하는 데 적용하는 규칙에 따라 동일한 텍스트에 대해 토큰화된 결과가 달라집니다. +사전 훈련된 모델은 훈련 데이터를 토큰화하는 데 사용된 것과 동일한 규칙으로 토큰화된 입력을 제공해야만 제대로 작동합니다. + +[spaCy](https://spacy.io/)와 [Moses](http://www.statmt.org/moses/?n=Development.GetStarted)는 유명한 규칙 기반 토크나이저입니다. 예제에 *spaCy*와 *Moses* 를 적용한 결과는 다음과 같습니다: + +``` +["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +보시다시피 공백 및 구두점 토큰화와 규칙 기반 토큰화가 사용됩니다. +공백 및 구두점, 규칙 기반 토큰화은 모두 단어 문장을 단어로 쪼개는 단어 토큰화에 해당합니다. +이 토큰화 방법은 텍스트를 더 작은 묶음(chunk)로 분할하는 가장 직관적인 방법이지만, 대규모 텍스트 말뭉치에 대해서는 문제가 발생할 수 있습니다. +이 경우 공백 및 구두점 토큰화는 일반적으로 매우 큰 어휘(사용된 모든 고유 단어와 토큰 집합)을 생성합니다. +*예를 들어*, [Transformer XL](model_doc/transformerxl)은 공백 및 구두점 토큰화를 사용해 어휘(vocabulary) 크기가 267,735입니다! + +어휘 크기가 크면 모델에 입력 및 출력 레이어로 엄청난 임베딩 행렬이 필요하므로 메모리와 시간 복잡성이 모두 증가합니다. +일반적으로 트랜스포머 모델은 어휘 크기가 50,000개를 넘는 경우가 드물며, 특히 단일 언어에 대해서만 사전 훈련된 경우에는 더욱 그렇습니다. +단순한 공백과 구두점 토큰화가 만족스럽지 않다면 단순히 문자를 토큰화하면 어떨까요? + + + +문자 토큰화는 아주 간단하고 메모리와 시간 복잡도를 크게 줄일 수 있지만, 모델이 의미 있는 입력 표현을 학습하기에는 훨씬 더 어렵습니다. + +*예를 들어*, 문자 `"t"`에 대한 의미 있는 문맥 독립적 표현을 배우는 것 보다 단어 `"today"`에 대한 의미 있는 문맥 독립적 표현을 배우는 것이 훨씬 더 어렵습니다. +문자 토큰화는 종종 성능 저하를 동반하기 때문에 두 가지 장점을 모두 얻기 위해 트랜스포머 모델은 **서브워드** 토큰화라고 하는 단어 수준과 문자 수준 토큰화의 하이브리드를 사용합니다. + +## 서브워드 토큰화[[subword-tokenization]] + + + +서브워드 토큰화 알고리즘은 자주 사용되는 단어는 더 작은 하위 단어로 쪼개고, 드문 단어는 의미 있는 하위 단어로 분해되어야 한다는 원칙에 따라 작동합니다. +예를 들어 `"annoyingly"`는 드문 단어로 간주되어 `"annoying"`과 `"ly"`로 분해될 수 있습니다. +`"annoyingly"`가 `"annoying"`과 `"ly"`의 합성어인 반면, `"annoying"`과 `"ly"` 둘 다 독립적인 서브워드로 자주 등장합니다. +이는 터키어와 같은 응집성 언어에서 특히 유용하며, 서브워드를 묶어 임의로 긴 복합 단어를 만들 수 있습니다. + +서브워드 토큰화를 사용하면 모델이 의미 있는 문맥 독립적 표현을 학습하면서 합리적인 어휘 크기를 가질 수 있습니다. +또한, 서브워드 토큰화를 통해 모델은 이전에 본 적이 없는 단어를 알려진 서브워드로 분해하여 처리할 수 있습니다. + +예를 들어, [`~transformers.BertTokenizer`]는 `"I have a new GPU!"` 라는 문장을 아래와 같이 토큰화합니다: + +```py +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> tokenizer.tokenize("I have a new GPU!") +["i", "have", "a", "new", "gp", "##u", "!"] +``` + +대소문자가 없는 모델을 사용해 문장의 시작이 소문자로 표기되었습니다. +단어 `["i", "have", "a", "new"]`는 토크나이저의 어휘에 속하지만, `"gpu"`는 속하지 않는 것을 확인할 수 있습니다. +결과적으로 토크나이저는 `"gpu"`를 알려진 두 개의 서브워드로 쪼갭니다: `["gp" and "##u"]`. +`"##"`은 토큰의 나머지 부분이 공백 없이 이전 토큰에 연결되어야(attach) 함을 의미합니다(토큰화 디코딩 또는 역전을 위해). + +또 다른 예로, [`~transformers.XLNetTokenizer`]는 이전에 예시 문장을 다음과 같이 토큰화합니다: +```py +>>> from transformers import XLNetTokenizer + +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") +>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") +["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] +``` + +`"▁"`가 가지는 의미는 [SentencePiece](#sentencepiece)에서 다시 살펴보도록 하겠습니다. +보다시피 `"Transformers"` 라는 드문 단어는 서브워드 `"Transform"`와 `"ers"`로 쪼개집니다. + +이제 다양한 하위 단어 토큰화 알고리즘이 어떻게 작동하는지 살펴보겠습니다. +이러한 토큰화 알고리즘은 일반적으로 해당 모델이 학습되는 말뭉치에 대해 수행되는 어떤 형태의 학습에 의존한다는 점에 유의하세요. + + + +### 바이트 페어 인코딩 (Byte-Pair Encoding, BPE)[[bytepair-encoding-bpe]] + +바이트 페어 인코딩(BPE)은 [Neural Machine Translation of Rare Words with Subword Units (Sennrich et +al., 2015)](https://arxiv.org/abs/1508.07909) 에서 소개되었습니다. +BPE는 훈련 데이터를 단어로 분할하는 사전 토크나이저(pre-tokenizer)에 의존합니다. +사전 토큰화(Pretokenization)에는 [GPT-2](model_doc/gpt2), [Roberta](model_doc/roberta)와 같은 간단한 공백 토큰화가 있습니다. +복잡한 사전 토큰화에는 규칙 기반 토큰화가 해당하는데, 훈련 말뭉치에서 각 단어의 빈도를 계산하기 위해 사용합니다. +[XLM](model_doc/xlm), 대부분의 언어에서 Moses를 사용하는 [FlauBERT](model_doc/flaubert), Spacy와 ftfy를 사용하는 [GPT](model_doc/gpt)가 해당합니다. + + +사전 토큰화 이후에, 고유 단어 집합가 생성되고 훈련 데이터에서 각 단어가 등장하는 빈도가 결정됩니다. +다음으로, BPE는 고유 단어 집합에 나타나는 모든 기호로 구성된 기본 어휘를 생성하고 기본 어휘의 두 기호에서 새로운 기호를 형성하는 병합 규칙을 학습합니다. +어휘가 원하는 어휘 크기에 도달할 때까지 위의 과정을 반복합니다. +어휘 크기는 토크나이저를 훈련시키기 전에 정의해야 하는 하이퍼파라미터라는 점을 유의하세요. + +예를 들어, 사전 토큰화 후 빈도를 포함한 다음과 같은 어휘 집합이 결정되었다고 가정해 보겠습니다: + +``` +("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) +``` + +결과적으로 기본 어휘는 `["b", "g", "h", "n", "p", "s", "u"]` 이고, 각 단어를 기본 어휘에 속하는 기호로 쪼개면 아래와 같습니다: + +``` +("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) +``` + +그런 다음 BPE는 가능한 각 기호 쌍의 빈도를 계산하여 가장 자주 발생하는 기호 쌍을 선택합니다. +위의 예시에서 `"h"` 뒤에 오는 `"u"`는 _10 + 5 = 15_ 번 등장합니다. (`"hug"`에서 10번, `"hugs"`에서 5번 등장) + +하지만, 가장 등장 빈도가 높은 기호 쌍은 `"u"` 뒤에 오는 `"g"`입니다. _10 + 5 + 5 = 20_ 으로 총 20번 등장합니다. +따라서 토크나이저가 병합하는 가장 첫 번째 쌍은 `"u"` 뒤에 오는 `"g"`입니다. `"ug"`가 어휘에 추가되어 어휘는 다음과 같습니다: + +``` +("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) +``` + +BPE는 다음으로 가장 많이 등장하는 기호 쌍을 식별합니다. +`"u"` 뒤에 오는 `"n"`은 16번 등장해 `"un"` 으로 병합되어 어휘에 추가됩니다. +그 다음으로 빈도수가 놓은 기호 쌍은 `"h"` 뒤에 오는 `"ug"`로 15번 등장합니다. +다시 한 번 `"hug"`로 병합되어 어휘에 추가됩니다. + +현재 단계에서 어휘는 `["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]` 이고, 고유 단어 집합은 다음과 같습니다: + +``` +("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) +``` + +이 시점에서 바이트 페어 인코딩 훈련이 중단된다고 가정하면, 훈련된 병합 규칙은 새로운 단어에 적용됩니다(기본 어휘에 포함된 기호가 새로운 단어에 포함되지 않는 한). +예를 들어, 단어 `"bug"`는 `["b", "ug"]`로 토큰화되지만, `"m"`이 기본 어휘에 없기 때문에 `"mug"`는 `["", "ug"]`로 토큰화될 것입니다. +훈련 데이터에는 단일 문자가 최소한 한 번 등장하기 때문에 일반적으로 `"m"`과 같은 단일 문자는 `""` 기호로 대체되지 않지만, 이모티콘과 같은 특별한 문자인 경우에는 대체될 수 있습니다. + +이전에 언급했듯이 어휘 크기(즉 기본 어휘 크기 + 병합 횟수)는 선택해야하는 하이퍼파라미터입니다. +예를 들어 [GPT](model_doc/gpt)의 기본 어휘 크기는 478, 40,000번의 병합 이후에 훈련을 종료하기 때문에 어휘 크기가 40,478입니다. + +#### 바이트 수준 BPE (Byte-level BPE)[[bytelevel-bpe]] + +가능한 모든 기본 문자를 포함하는 기본 어휘의 크기는 굉장히 커질 수 있습니다. (예: 모든 유니코드 문자를 기본 문자로 간주하는 경우) +더 나은 기본 어휘를 갖도록 [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)는 기본 어휘로 바이트(bytes)를 사용합니다. +이 방식은 모든 기본 문자가 어휘에 포함되도록 하면서 기본 어휘의 크기를 256으로 제한합니다. +구두점을 다루는 추가적인 규칙을 사용해 GPT2 토크나이저는 모든 텍스트를 기호 없이 토큰화할 수 있습니다. +[GPT-2](model_doc/gpt)의 어휘 크기는 50,257로 256 바이트 크기의 기본 토큰, 특별한 end-of-text 토큰과 50,000번의 병합으로 학습한 기호로 구성됩니다. + + + +### 워드피스 (WordPiece)[[wordpiece]] + +워드피스는 [BERT](model_doc/bert), [DistilBERT](model_doc/distilbert), [Electra](model_doc/electra)에 사용된 서브워드 토큰화 알고리즘입니다. +이 알고리즘은 [Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf)에서 소개되었고, BPE와 굉장히 유사합니다. +워드피스는 훈련 데이터에 등장하는 모든 문자로 기본 어휘를 초기화한 후, 주어진 병합 규칙에 따라 점진적으로 학습합니다. +BPE와는 대조적으로 워드피스는 가장 빈도수가 높은 기호 쌍을 선택하지 않고, 어휘에 추가되었을 때 훈련 데이터의 우도가 최대화되는 쌍을 선택합니다. + +정확히 무슨 의미일까요? +이전 예시를 참조하면, 훈련 데이터의 우도 값을 최대화하는 것은 모든 기호 쌍 중에서 첫 번째 기호와 두 번째 기호의 확률로 나눈 확률이 가장 큰 기호 쌍을 찾는 것과 동일합니다. +예를 들어 `"ug"`의 확률이 `"u"`와 `"g"` 각각으로 쪼개졌을 때 보다 높아야 `"u"` 뒤에 오는 `"g"`는 병합될 것입니다. +직관적으로 워드피스는 두 기호를 병합하여 _잃는_ 것을 평가하여 그만한 _가치_가 있는지 확인한다는 점에서 BPE와 약간 다릅니다. + + + +### 유니그램 (Unigram)[[unigram]] + +유니그램은 [Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf)에서 제안된 서브워드 토큰화 알고리즘입니다. +BPE나 워드피스와 달리 유니그램은 기본 어휘를 많은 수의 기호로 초기화한 후 각 기호를 점진적으로 줄여 더 작은 어휘를 얻습니다. +예를 들어 기본 어휘는 모든 사전 토큰화된 단어와 가장 일반적인 하위 문자열에 해당할 수 있습니다. +유니그램은 transformers 모델에서 직접적으로 사용되지는 않지만, [SentencePiece](#sentencepiece)와 함께 사용됩니다. + +각 훈련 단계에서 유니그램 알고리즘은 현재 어휘와 유니그램 언어 모델이 주어졌을 때 훈련 데이터에 대한 손실(흔히 로그 우도로 정의됨)을 정의합니다. +그런 다음 어휘의 각 기호에 대해 알고리즘은 해당 기호를 어휘에서 제거할 경우 전체 손실이 얼마나 증가할지 계산합니다. +이후에 유니그램은 손실 증가율이 가장 낮은 기호의 p(보통 10% 또는 20%) 퍼센트를 제거합니다. (제거되는 기호는 훈련 데이터에 대한 전체 손실에 가장 작은 영향을 미칩니다.) +어휘가 원하는 크기에 도달할 때까지 이 과정을 반복합니다. +유니그램 알고리즘은 항상 기본 문자를 포함해 어떤 단어라도 토큰화할 수 있습니다. +유니그램이 병합 규칙에 기반하지 않기 떄문에 (BPE나 워드피스와는 대조적으로), 해당 알고리즘은 훈련 이후에 새로운 텍스트를 토큰화하는데 여러 가지 방법이 있습니다. + +예를 들어, 훈련된 유니그램 토큰화가 다음과 같은 어휘를 가진다면: + +``` +["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], +``` + +`"hugs"`는 두 가지로 토큰화할 수 있습니다. `["hug", "s"]`와 `["h", "ug", "s"]` 또는 `["h", "u", "g", "s"]`. + +그렇다면 어떤 토큰화 방법을 선택해야 할까요? +유니그램은 어휘를 저장하는 것 외에도 훈련 말뭉치에 각 토큰의 확률을 저장하여 훈련 후 가능한 각 토큰화의 확률을 계산할 수 있도록 합니다. +이 알고리즘은 단순히 실제로 가장 가능성이 높은 토큰화를 선택하지만, 확률에 따라 가능한 토큰화를 샘플링할 수 있는 가능성도 제공합니다. +이러한 확률은 토크나이저가 학습한 손실에 의해 정의됩니다. + +단어로 구성된 훈련 데이터를 \\(x_{1}, \dots, x_{N}\\)라 하고, 단어 \\(x_{i}\\)에 대한 가능한 모든 토큰화 결과를 \\(S(x_{i})\\)라 한다면, 전체 손실은 다음과 같이 정의됩니다: + +$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ + + + + + +### 센텐스피스 (SentencePiece)[[sentencepiece]] + +지금까지 다룬 토큰화 알고리즘은 동일한 문제를 가집니다: 입력 텍스트는 공백을 사용하여 단어를 구분한다고 가정합니다. +하지만, 모든 언어에서 단어를 구분하기 위해 공백을 사용하지 않습니다. +한가지 가능한 해결방안은 특정 언어에 특화된 사전 토크나이저를 사용하는 것입니다. 예를 들어 [XLM](model_doc/xlm)은 특정 중국어, 일본어, 태국어 사전 토크나이저를 사용합니다. +이 문제를 일반적인 방법으로 해결하기 위해, [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf)는 입력을 스트림으로 처리해 공백를 하나의 문자로 사용합니다. +이후에 BPE 또는 유니그램 알고리즘을 사용해 적절한 어휘를 구성합니다. + +[`XLNetTokenizer`]는 센텐스피스를 사용하기 때문에, 위에서 다룬 예시에서 어휘에 `"▁"`가 포함되어있습니다. +모든 토큰을 합친 후 `"▁"`을 공백으로 대체하면 되기 때문에 센텐스피스로 토큰화된 결과는 디코딩하기 수월합니다. + +transformers에서 제공하는 센텐스피스 토크나이저를 사용하는 모든 모델은 유니그램과 함께 사용됩니다. +[ALBERT](model_doc/albert), [XLNet](model_doc/xlnet), [Marian](model_doc/marian), [T5](model_doc/t5) 모델이 센텐스피스 토크나이저를 사용합니다. \ No newline at end of file diff --git a/docs/source/ko/torchscript.md b/docs/source/ko/torchscript.md new file mode 100644 index 00000000000000..28e198c5ec9306 --- /dev/null +++ b/docs/source/ko/torchscript.md @@ -0,0 +1,189 @@ + + +# TorchScript로 내보내기[[export-to-torchscript]] + + + +TorchScript를 활용한 실험은 아직 초기 단계로, 가변적인 입력 크기 모델들을 통해 그 기능성을 계속 탐구하고 있습니다. +이 기능은 저희가 관심을 두고 있는 분야 중 하나이며, +앞으로 출시될 버전에서 더 많은 코드 예제, 더 유연한 구현, 그리고 Python 기반 코드와 컴파일된 TorchScript를 비교하는 벤치마크를 등을 통해 분석을 심화할 예정입니다. + + + +[TorchScript 문서](https://pytorch.org/docs/stable/jit.html)에서는 이렇게 말합니다. + +> TorchScript는 PyTorch 코드에서 직렬화 및 최적화 가능한 모델을 생성하는 방법입니다. + +[JIT과 TRACE](https://pytorch.org/docs/stable/jit.html)는 개발자가 모델을 내보내서 효율 지향적인 C++ 프로그램과 같은 다른 프로그램에서 재사용할 수 있도록 하는 PyTorch 모듈입니다. + +PyTorch 기반 Python 프로그램과 다른 환경에서 모델을 재사용할 수 있도록, 🤗 Transformers 모델을 TorchScript로 내보낼 수 있는 인터페이스를 제공합니다. +이 문서에서는 TorchScript를 사용하여 모델을 내보내고 사용하는 방법을 설명합니다. + +모델을 내보내려면 두 가지가 필요합니다: + +- `torchscript` 플래그로 모델 인스턴스화 +- 더미 입력을 사용한 순전파(forward pass) + +이 필수 조건들은 아래에 자세히 설명된 것처럼 개발자들이 주의해야 할 여러 사항들을 의미합니다. + +## TorchScript 플래그와 묶인 가중치(tied weights)[[torchscript-flag-and-tied-weights]] + +`torchscript` 플래그가 필요한 이유는 대부분의 🤗 Transformers 언어 모델에서 `Embedding` 레이어와 `Decoding` 레이어 간의 묶인 가중치(tied weights)가 존재하기 때문입니다. +TorchScript는 묶인 가중치를 가진 모델을 내보낼 수 없으므로, 미리 가중치를 풀고 복제해야 합니다. + +`torchscript` 플래그로 인스턴스화된 모델은 `Embedding` 레이어와 `Decoding` 레이어가 분리되어 있으므로 이후에 훈련해서는 안 됩니다. +훈련을 하게 되면 두 레이어 간 동기화가 해제되어 예상치 못한 결과가 발생할 수 있습니다. + +언어 모델 헤드를 갖지 않은 모델은 가중치가 묶여 있지 않아서 이 문제가 발생하지 않습니다. +이러한 모델들은 `torchscript` 플래그 없이 안전하게 내보낼 수 있습니다. + +## 더미 입력과 표준 길이[[dummy-inputs-and-standard-lengths]] + +더미 입력(dummy inputs)은 모델의 순전파(forward pass)에 사용됩니다. +입력 값이 레이어를 통해 전파되는 동안, PyTorch는 각 텐서에서 실행된 다른 연산을 추적합니다. +이러한 기록된 연산은 모델의 *추적(trace)*을 생성하는 데 사용됩니다. + +추적은 입력의 차원을 기준으로 생성됩니다. +따라서 더미 입력의 차원에 제한되어, 다른 시퀀스 길이나 배치 크기에서는 작동하지 않습니다. +다른 크기로 시도할 경우 다음과 같은 오류가 발생합니다: + +``` +`The expanded size of the tensor (3) must match the existing size (7) at non-singleton dimension 2` +``` +추론 중 모델에 공급될 가장 큰 입력만큼 큰 더미 입력 크기로 모델을 추적하는 것이 좋습니다. +패딩은 누락된 값을 채우는 데 도움이 될 수 있습니다. +그러나 모델이 더 큰 입력 크기로 추적되기 때문에, 행렬의 차원이 커지고 계산량이 많아집니다. + +다양한 시퀀스 길이 모델을 내보낼 때는 각 입력에 대해 수행되는 총 연산 횟수에 주의하고 성능을 주의 깊게 확인하세요. + +## Python에서 TorchScript 사용하기[[using-torchscript-in-python]] + +이 섹션에서는 모델을 저장하고 가져오는 방법, 추적을 사용하여 추론하는 방법을 보여줍니다. + +### 모델 저장하기[[saving-a-model]] + +`BertModel`을 TorchScript로 내보내려면 `BertConfig` 클래스에서 `BertModel`을 인스턴스화한 다음, `traced_bert.pt`라는 파일명으로 디스크에 저장하면 됩니다. + +```python +from transformers import BertModel, BertTokenizer, BertConfig +import torch + +enc = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") + +# 입력 텍스트 토큰화하기 +text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" +tokenized_text = enc.tokenize(text) + +# 입력 토큰 중 하나를 마스킹하기 +masked_index = 8 +tokenized_text[masked_index] = "[MASK]" +indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) +segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1] + +# 더미 입력 만들기 +tokens_tensor = torch.tensor([indexed_tokens]) +segments_tensors = torch.tensor([segments_ids]) +dummy_input = [tokens_tensor, segments_tensors] + +# torchscript 플래그로 모델 초기화하기 +# 이 모델은 LM 헤드가 없으므로 필요하지 않지만, 플래그를 True로 설정합니다. +config = BertConfig( + vocab_size_or_config_json_file=32000, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + torchscript=True, +) + +# 모델을 인스턴트화하기 +model = BertModel(config) + +# 모델을 평가 모드로 두어야 합니다. +model.eval() + +# 만약 *from_pretrained*를 사용하여 모델을 인스턴스화하는 경우, TorchScript 플래그를 쉽게 설정할 수 있습니다 +model = BertModel.from_pretrained("google-bert/bert-base-uncased", torchscript=True) + +# 추적 생성하기 +traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors]) +torch.jit.save(traced_model, "traced_bert.pt") +``` + +### 모델 가져오기[[loading-a-model]] + +이제 이전에 저장한 `BertModel`, 즉 `traced_bert.pt`를 디스크에서 가져오고, 이전에 초기화한 `dummy_input`에서 사용할 수 있습니다. + +```python +loaded_model = torch.jit.load("traced_bert.pt") +loaded_model.eval() + +all_encoder_layers, pooled_output = loaded_model(*dummy_input) +``` + +### 추적된 모델을 사용하여 추론하기[[using-a-traced-model-for-inference]] + +`__call__` 이중 언더스코어(dunder) 메소드를 사용하여 추론에 추적된 모델을 사용하세요: + +```python +traced_model(tokens_tensor, segments_tensors) +``` + +## Neuron SDK로 Hugging Face TorchScript 모델을 AWS에 배포하기[[deploy-hugging-face-torchscript-models-to-aws-with-the-neuron-sdk]] + +AWS가 클라우드에서 저비용, 고성능 머신 러닝 추론을 위한 [Amazon EC2 Inf1](https://aws.amazon.com/ec2/instance-types/inf1/) 인스턴스 제품군을 출시했습니다. +Inf1 인스턴스는 딥러닝 추론 워크로드에 특화된 맞춤 하드웨어 가속기인 AWS Inferentia 칩으로 구동됩니다. +[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/#)은 Inferentia를 위한 SDK로, Inf1에 배포하기 위한 transformers 모델 추적 및 최적화를 지원합니다. +Neuron SDK는 다음과 같은 기능을 제공합니다: + +1. 코드 한 줄만 변경하면 클라우드 추론를 위해 TorchScript 모델을 추적하고 최적화할 수 있는 쉬운 API +2. 즉시 사용 가능한 성능 최적화로 [비용 효율 향상](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/benchmark/>) +3. [PyTorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/bert_tutorial/tutorial_pretrained_bert.html) 또는 [TensorFlow](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/tensorflow/huggingface_bert/huggingface_bert.html)로 구축된 Hugging Face transformers 모델 지원 + +### 시사점[[implications]] + +[BERT (Bidirectional Encoder Representations from Transformers)](https://huggingface.co/docs/transformers/main/model_doc/bert) 아키텍처 또는 그 변형인 [distilBERT](https://huggingface.co/docs/transformers/main/model_doc/distilbert) 및 [roBERTa](https://huggingface.co/docs/transformers/main/model_doc/roberta)를 기반으로 한 Transformers 모델은 추출 기반 질의응답, 시퀀스 분류 및 토큰 분류와 같은 비생성 작업 시 Inf1에서 최상의 성능을 보입니다. +그러나 텍스트 생성 작업도 [AWS Neuron MarianMT 튜토리얼](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/transformers-marianmt.html)을 따라 Inf1에서 실행되도록 조정할 수 있습니다. + +Inferentia에서 바로 변환할 수 있는 모델에 대한 자세한 정보는 Neuron 문서의 [Model Architecture Fit](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/models/models-inferentia.html#models-inferentia) 섹션에서 확인할 수 있습니다. + +### 종속성[[dependencies]] + +AWS Neuron을 사용하여 모델을 변환하려면 [Neuron SDK 환경](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html#installation-guide)이 필요합니다. + 이는 [AWS Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-inferentia-launching.html)에 미리 구성되어 있습니다. + +### AWS Neuron으로 모델 변환하기[[converting-a-model-for-aws-neuron]] + +`BertModel`을 추적하려면, [Python에서 TorchScript 사용하기](torchscript#using-torchscript-in-python)에서와 동일한 코드를 사용해서 AWS NEURON용 모델을 변환합니다. +`torch.neuron` 프레임워크 익스텐션을 가져와 Python API를 통해 Neuron SDK의 구성 요소에 접근합니다: + +```python +from transformers import BertModel, BertTokenizer, BertConfig +import torch +import torch.neuron +``` + +다음 줄만 수정하면 됩니다: + +```diff +- torch.jit.trace(model, [tokens_tensor, segments_tensors]) ++ torch.neuron.trace(model, [token_tensor, segments_tensors]) +``` + +이로써 Neuron SDK가 모델을 추적하고 Inf1 인스턴스에 최적화할 수 있게 됩니다. + +AWS Neuron SDK의 기능, 도구, 예제 튜토리얼 및 최신 업데이트에 대해 자세히 알아보려면 [AWS NeuronSDK 문서](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html)를 참조하세요. diff --git a/docs/source/ko/training.md b/docs/source/ko/training.md new file mode 100644 index 00000000000000..fa6d56bdc36696 --- /dev/null +++ b/docs/source/ko/training.md @@ -0,0 +1,428 @@ + + +# 사전 학습된 모델 미세 튜닝하기[[finetune-a-pretrained-model]] + +[[open-in-colab]] + +사전 학습된 모델을 사용하면 상당한 이점이 있습니다. 계산 비용과 탄소발자국을 줄이고, 처음부터 모델을 학습시킬 필요 없이 최신 모델을 사용할 수 있습니다. 🤗 Transformers는 다양한 작업을 위해 사전 학습된 수천 개의 모델에 액세스할 수 있습니다. 사전 학습된 모델을 사용하는 경우, 자신의 작업과 관련된 데이터셋을 사용해 학습합니다. 이것은 미세 튜닝이라고 하는 매우 강력한 훈련 기법입니다. 이 튜토리얼에서는 당신이 선택한 딥러닝 프레임워크로 사전 학습된 모델을 미세 튜닝합니다: + +* 🤗 Transformers로 사전 학습된 모델 미세 튜닝하기 [`Trainer`]. +* Keras를 사용하여 TensorFlow에서 사전 학습된 모델을 미세 튜닝하기. +* 기본 PyTorch에서 사전 학습된 모델을 미세 튜닝하기. + + + +## 데이터셋 준비[[prepare-a-dataset]] + + + +사전 학습된 모델을 미세 튜닝하기 위해서 데이터셋을 다운로드하고 훈련할 수 있도록 준비하세요. 이전 튜토리얼에서 훈련을 위해 데이터를 처리하는 방법을 보여드렸는데, 지금이 배울 걸 되짚을 기회입니다! + +먼저 [Yelp 리뷰](https://huggingface.co/datasets/yelp_review_full) 데이터 세트를 로드합니다: + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("yelp_review_full") +>>> dataset["train"][100] +{'label': 0, + 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} +``` + +텍스트를 처리하고 서로 다른 길이의 시퀀스 패딩 및 잘라내기 전략을 포함하려면 토크나이저가 필요합니다. 데이터셋을 한 번에 처리하려면 🤗 Dataset [`map`](https://huggingface.co/docs/datasets/process#map) 메서드를 사용하여 전체 데이터셋에 전처리 함수를 적용하세요: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") + + +>>> def tokenize_function(examples): +... return tokenizer(examples["text"], padding="max_length", truncation=True) + + +>>> tokenized_datasets = dataset.map(tokenize_function, batched=True) +``` + +필요한 경우 미세 튜닝을 위해 데이터셋의 작은 부분 집합을 만들어 미세 튜닝 작업 시간을 줄일 수 있습니다: + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + + + +## Train + +여기서부터는 사용하려는 프레임워크에 해당하는 섹션을 따라야 합니다. 오른쪽 사이드바의 링크를 사용하여 원하는 프레임워크로 이동할 수 있으며, 특정 프레임워크의 모든 콘텐츠를 숨기려면 해당 프레임워크 블록의 오른쪽 상단에 있는 버튼을 사용하면 됩니다! + + + + + +## 파이토치 Trainer로 훈련하기[[train-with-pytorch-trainer]] + +🤗 Transformers는 🤗 Transformers 모델 훈련에 최적화된 [`Trainer`] 클래스를 제공하여 훈련 루프를 직접 작성하지 않고도 쉽게 훈련을 시작할 수 있습니다. [`Trainer`] API는 로깅(logging), 경사 누적(gradient accumulation), 혼합 정밀도(mixed precision) 등 다양한 훈련 옵션과 기능을 지원합니다. + +먼저 모델을 가져오고 예상되는 레이블 수를 지정합니다. Yelp 리뷰 [데이터셋 카드](https://huggingface.co/datasets/yelp_review_full#data-fields)에서 5개의 레이블이 있음을 알 수 있습니다: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + + + +사전 훈련된 가중치 중 일부가 사용되지 않고 일부 가중치가 무작위로 표시된다는 경고가 표시됩니다. +걱정마세요. 이것은 올바른 동작입니다! 사전 학습된 BERT 모델의 헤드는 폐기되고 무작위로 초기화된 분류 헤드로 대체됩니다. 이제 사전 학습된 모델의 지식으로 시퀀스 분류 작업을 위한 새로운 모델 헤드를 미세 튜닝 합니다. + + + +### 하이퍼파라미터 훈련[[training-hyperparameters]] + +다음으로 정할 수 있는 모든 하이퍼파라미터와 다양한 훈련 옵션을 활성화하기 위한 플래그를 포함하는 [`TrainingArguments`] 클래스를 생성합니다. + +이 튜토리얼에서는 기본 훈련 [하이퍼파라미터](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)로 시작하지만, 자유롭게 실험하여 여러분들에게 맞는 최적의 설정을 찾을 수 있습니다. + +훈련에서 체크포인트(checkpoints)를 저장할 위치를 지정합니다: + +```py +>>> from transformers import TrainingArguments + +>>> training_args = TrainingArguments(output_dir="test_trainer") +``` + +### 평가 하기[[evaluate]] + +[`Trainer`]는 훈련 중에 모델 성능을 자동으로 평가하지 않습니다. 평가 지표를 계산하고 보고할 함수를 [`Trainer`]에 전달해야 합니다. +[🤗 Evaluate](https://huggingface.co/docs/evaluate/index) 라이브러리는 [`evaluate.load`](https://huggingface.co/spaces/evaluate-metric/accuracy) 함수로 로드할 수 있는 간단한 [`accuracy`]함수를 제공합니다 (자세한 내용은 [둘러보기](https://huggingface.co/docs/evaluate/a_quick_tour)를 참조하세요): + +```py +>>> import numpy as np +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +``` + +`metric`에서 [`~evaluate.compute`]를 호출하여 예측의 정확도를 계산합니다. 예측을 `compute`에 전달하기 전에 예측을 로짓으로 변환해야 합니다(모든 🤗 Transformers 모델은 로짓으로 반환한다는 점을 기억하세요): + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... predictions = np.argmax(logits, axis=-1) +... return metric.compute(predictions=predictions, references=labels) +``` + +미세 튜닝 중에 평가 지표를 모니터링하려면 훈련 인수에 `evaluation_strategy` 파라미터를 지정하여 각 에폭이 끝날 때 평가 지표를 확인할 수 있습니다: + +```py +>>> from transformers import TrainingArguments, Trainer + +>>> training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch") +``` + +### 훈련 하기[[trainer]] + +모델, 훈련 인수, 훈련 및 테스트 데이터셋, 평가 함수가 포함된 [`Trainer`] 객체를 만듭니다: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +그리고 [`~transformers.Trainer.train`]을 호출하여 모델을 미세 튜닝합니다: + +```py +>>> trainer.train() +``` + + + + + + +## Keras로 텐서플로우 모델 훈련하기[[train-a-tensorflow-model-with-keras]] + +Keras API를 사용하여 텐서플로우에서 🤗 Transformers 모델을 훈련할 수도 있습니다! + +### Keras용 데이터 로드[[loading-data-for-keras]] + +Keras API로 🤗 Transformers 모델을 학습시키려면 데이터셋을 Keras가 이해할 수 있는 형식으로 변환해야 합니다. +데이터 세트가 작은 경우, 전체를 NumPy 배열로 변환하여 Keras로 전달하면 됩니다. +더 복잡한 작업을 수행하기 전에 먼저 이 작업을 시도해 보겠습니다. + +먼저 데이터 세트를 로드합니다. [GLUE 벤치마크](https://huggingface.co/datasets/glue)의 CoLA 데이터 세트를 사용하겠습니다. +간단한 바이너리 텍스트 분류 작업이므로 지금은 훈련 데이터 분할만 사용합니다. + +```py +from datasets import load_dataset + +dataset = load_dataset("glue", "cola") +dataset = dataset["train"] # Just take the training split for now +``` + +다음으로 토크나이저를 로드하고 데이터를 NumPy 배열로 토큰화합니다. 레이블은 이미 0과 1로 된 리스트이기 때문에 토큰화하지 않고 바로 NumPy 배열로 변환할 수 있습니다! + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True) +# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras +tokenized_data = dict(tokenized_data) + +labels = np.array(dataset["label"]) # Label is already an array of 0 and 1 +``` + +마지막으로 모델을 로드, [`compile`](https://keras.io/api/models/model_training_apis/#compile-method), [`fit`](https://keras.io/api/models/model_training_apis/#fit-method)합니다: + +```py +from transformers import TFAutoModelForSequenceClassification +from tensorflow.keras.optimizers import Adam + +# Load and compile our model +model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased") +# Lower learning rates are often better for fine-tuning transformers +model.compile(optimizer=Adam(3e-5)) + +model.fit(tokenized_data, labels) +``` + + + +모델을 `compile()`할 때 손실 인수를 모델에 전달할 필요가 없습니다! +이 인수를 비워두면 허깅 페이스 모델은 작업과 모델 아키텍처에 적합한 손실을 자동으로 선택합니다. +원한다면 언제든지 직접 손실을 지정하여 이를 재정의할 수 있습니다! + + + +이 접근 방식은 소규모 데이터 집합에서는 잘 작동하지만, 대규모 데이터 집합에서는 문제가 될 수 있습니다. 왜 그럴까요? +토큰화된 배열과 레이블을 메모리에 완전히 로드하고 NumPy는 "들쭉날쭉한" 배열을 처리하지 않기 때문에, +모든 토큰화된 샘플을 전체 데이터셋에서 가장 긴 샘플의 길이만큼 패딩해야 합니다. 이렇게 하면 배열이 훨씬 더 커지고 이 패딩 토큰으로 인해 학습 속도도 느려집니다! + +### 데이터를 tf.data.Dataset으로 로드하기[[loading-data-as-a-tfdatadataset]] + +학습 속도가 느려지는 것을 피하려면 데이터를 `tf.data.Dataset`으로 로드할 수 있습니다. 원한다면 직접 +`tf.data` 파이프라인을 직접 작성할 수도 있지만, 이 작업을 간편하게 수행하는 수 있는 두 가지 방법이 있습니다: + +- [`~TFPreTrainedModel.prepare_tf_dataset`]: 대부분의 경우 이 방법을 권장합니다. 모델의 메서드이기 때문에 모델을 검사하여 모델 입력으로 사용할 수 있는 열을 자동으로 파악하고 +나머지는 버려서 더 단순하고 성능이 좋은 데이터 집합을 만들 수 있습니다. +- [`~datasets.Dataset.to_tf_dataset`]: 이 방법은 좀 더 낮은 수준이며, 포함할 '열'과 '레이블'을 정확히 지정하여 +데이터셋을 생성하는 방법을 정확히 제어하고 싶을 때 유용하며, 포함할 'columns'과 'label_cols'을 정확히 지정할 수 있습니다. + +[`~TFPreTrainedModel.prepare_tf_dataset`]을 사용하려면 먼저 다음 코드 샘플과 같이 토크나이저 출력을 데이터 세트에 열로 추가해야 합니다: + +```py +def tokenize_dataset(data): + # Keys of the returned dictionary will be added to the dataset as columns + return tokenizer(data["text"]) + + +dataset = dataset.map(tokenize_dataset) +``` + +허깅 페이스 데이터셋은 기본적으로 디스크에 저장되므로 메모리 사용량을 늘리지 않는다는 점을 기억하세요! +열이 추가되면 데이터셋에서 배치를 스트리밍하고 각 배치에 패딩을 추가할 수 있으므로 전체 데이터셋에 패딩을 추가하는 것보다 패딩 토큰의 수를 크게 줄일 수 있습니다. + + +```py +>>> tf_dataset = model.prepare_tf_dataset(dataset, batch_size=16, shuffle=True, tokenizer=tokenizer) +``` + +위의 코드 샘플에서는 배치가 로드될 때 올바르게 패딩할 수 있도록 `prepare_tf_dataset`에 토크나이저를 전달해야 합니다. +데이터셋의 모든 샘플 길이가 같고 패딩이 필요하지 않은 경우 이 인수를 건너뛸 수 있습니다. +샘플을 채우는 것보다 더 복잡한 작업(예: 마스킹된 언어의 토큰 손상 모델링)을 수행하기 위해 토큰을 손상시켜야 하는 경우, +`collate_fn` 인수를 사용하여 샘플 목록을 배치로 변환하고 원하는 전처리를 적용할 함수를 전달할 수 있습니다. +[예시](https://github.com/huggingface/transformers/tree/main/examples) 또는 +[노트북](https://huggingface.co/docs/transformers/notebooks)을 참조하여 이 접근 방식이 실제로 작동하는 모습을 확인하세요. + +`tf.data.Dataset`을 생성한 후에는 이전과 마찬가지로 모델을 컴파일하고 훈련(fit)할 수 있습니다: + +```py +model.compile(optimizer=Adam(3e-5)) + +model.fit(tf_dataset) +``` + + + + + + +## 기본 파이토치로 훈련하기[[train-in-native-pytorch]] + + + + + +[`Trainer`]는 훈련 루프를 처리하며 한 줄의 코드로 모델을 미세 조정할 수 있습니다. 직접 훈련 루프를 작성하는 것을 선호하는 사용자의 경우, 기본 PyTorch에서 🤗 Transformers 모델을 미세 조정할 수도 있습니다. + +이 시점에서 노트북을 다시 시작하거나 다음 코드를 실행해 메모리를 확보해야 할 수 있습니다: + +```py +del model +del trainer +torch.cuda.empty_cache() +``` + +다음으로, '토큰화된 데이터셋'을 수동으로 후처리하여 훈련련에 사용할 수 있도록 준비합니다. + +1. 모델이 원시 텍스트를 입력으로 허용하지 않으므로 `text` 열을 제거합니다: + + ```py + >>> tokenized_datasets = tokenized_datasets.remove_columns(["text"]) + ``` + +2. 모델에서 인수의 이름이 `labels`로 지정될 것으로 예상하므로 `label` 열의 이름을 `labels`로 변경합니다: + + ```py + >>> tokenized_datasets = tokenized_datasets.rename_column("label", "labels") + ``` + +3. 데이터셋의 형식을 List 대신 PyTorch 텐서를 반환하도록 설정합니다: + + ```py + >>> tokenized_datasets.set_format("torch") + ``` + +그리고 앞서 표시된 대로 데이터셋의 더 작은 하위 집합을 생성하여 미세 조정 속도를 높입니다: + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + +### DataLoader[[dataloader]] + +훈련 및 테스트 데이터셋에 대한 'DataLoader'를 생성하여 데이터 배치를 반복할 수 있습니다: + +```py +>>> from torch.utils.data import DataLoader + +>>> train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8) +>>> eval_dataloader = DataLoader(small_eval_dataset, batch_size=8) +``` + +예측을 위한 레이블 개수를 사용하여 모델을 로드합니다: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + +### 옵티마이저 및 학습 속도 스케줄러[[optimizer-and-learning-rate-scheduler]] + +옵티마이저와 학습 속도 스케줄러를 생성하여 모델을 미세 조정합니다. 파이토치에서 제공하는 [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) 옵티마이저를 사용해 보겠습니다: + +```py +>>> from torch.optim import AdamW + +>>> optimizer = AdamW(model.parameters(), lr=5e-5) +``` + +[`Trainer`]에서 기본 학습 속도 스케줄러를 생성합니다: + +```py +>>> from transformers import get_scheduler + +>>> num_epochs = 3 +>>> num_training_steps = num_epochs * len(train_dataloader) +>>> lr_scheduler = get_scheduler( +... name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps +... ) +``` + +마지막으로, GPU에 액세스할 수 있는 경우 'device'를 지정하여 GPU를 사용하도록 합니다. 그렇지 않으면 CPU에서 훈련하며 몇 분이 아닌 몇 시간이 걸릴 수 있습니다. + +```py +>>> import torch + +>>> device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +>>> model.to(device) +``` + + + +[Colaboratory](https://colab.research.google.com/) 또는 [SageMaker StudioLab](https://studiolab.sagemaker.aws/)과 같은 호스팅 노트북이 없는 경우 클라우드 GPU에 무료로 액세스할 수 있습니다. + + + +이제 훈련할 준비가 되었습니다! 🥳 + +### 훈련 루프[[training-loop]] + +훈련 진행 상황을 추적하려면 [tqdm](https://tqdm.github.io/) 라이브러리를 사용하여 트레이닝 단계 수에 진행률 표시줄을 추가하세요: + +```py +>>> from tqdm.auto import tqdm + +>>> progress_bar = tqdm(range(num_training_steps)) + +>>> model.train() +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... outputs = model(**batch) +... loss = outputs.loss +... loss.backward() + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +### 평가 하기[[evaluate]] + +[`Trainer`]에 평가 함수를 추가한 방법과 마찬가지로, 훈련 루프를 직접 작성할 때도 동일한 작업을 수행해야 합니다. 하지만 이번에는 각 에포크가 끝날 때마다 평가지표를 계산하여 보고하는 대신, [`~evaluate.add_batch`]를 사용하여 모든 배치를 누적하고 맨 마지막에 평가지표를 계산합니다. + +```py +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +>>> model.eval() +>>> for batch in eval_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... with torch.no_grad(): +... outputs = model(**batch) + +... logits = outputs.logits +... predictions = torch.argmax(logits, dim=-1) +... metric.add_batch(predictions=predictions, references=batch["labels"]) + +>>> metric.compute() +``` + + + + + +## 추가 자료[[additional-resources]] + +더 많은 미세 튜닝 예제는 다음을 참조하세요: + +- [🤗 Trnasformers 예제](https://github.com/huggingface/transformers/tree/main/examples)에는 PyTorch 및 텐서플로우에서 일반적인 NLP 작업을 훈련할 수 있는 스크립트가 포함되어 있습니다. + +- [🤗 Transformers 노트북](notebooks)에는 PyTorch 및 텐서플로우에서 특정 작업을 위해 모델을 미세 튜닝하는 방법에 대한 다양한 노트북이 포함되어 있습니다. diff --git a/docs/source/ko/transformers_agents.md b/docs/source/ko/transformers_agents.md new file mode 100644 index 00000000000000..eeb00761e9a777 --- /dev/null +++ b/docs/source/ko/transformers_agents.md @@ -0,0 +1,328 @@ + + +# Transformers Agent [[transformers-agent]] + + + +Transformers Agent는 실험 중인 API로 언제든지 변경될 수 있습니다. +API 또는 기반 모델이 변경되기 쉽기 때문에 에이전트가 반환하는 결과도 달라질 수 있습니다. + + + +Transformers 버전 4.29.0.에서 *도구*와 *에이전트*라는 컨셉을 도입했습니다. [이 colab](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj)에서 사용해볼 수 있습니다. + +간단히 말하면, Agent는 트랜스포머 위에 자연어 API를 제공합니다. +엄선된 도구 세트를 정의하고, 자연어를 해석하여 이러한 도구를 사용할 수 있는 에이전트를 설계했습니다. +이 API는 확장이 가능하도록 설계 되었습니다. +주요 도구를 선별해두었지만, 커뮤니티에서 개발한 모든 도구를 사용할 수 있도록 시스템을 쉽게 확장할 수 있는 방법도 보여드리겠습니다. + +몇 가지 예를 통해 새로운 API로 무엇을 할 수 있는지 살펴보겠습니다. +이 API는 특히 멀티모달 작업에서 강력하므로 이미지를 생성하고 텍스트를 소리내어 읽어보겠습니다. + +```py +agent.run("Caption the following image", image=image) +``` + +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------| +| | A beaver is swimming in the water | + +--- + +```py +agent.run("Read the following text out loud", text=text) +``` +| **Input** | **Output** | +|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------| +| A beaver is swimming in the water | + +--- + +```py +agent.run( + "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?", + document=document, +) +``` +| **Input** | **Output** | +|-----------------------------------------------------------------------------------------------------------------------------|----------------| +| | ballroom foyer | + +## 바로 시작하기 [[quickstart]] + +`agent.run`을 사용하려면 먼저 대규모 언어 모델(LLM)인 에이전트를 인스턴스화해야 합니다. +저희는 openAI 모델뿐만 아니라 BigCode 및 OpenAssistant의 오픈소스 대체 모델도 지원합니다. +openAI 모델의 성능이 더 우수하지만(단, openAI API 키가 필요하므로 무료로 사용할 수 없음), +Hugging Face는 BigCode와 OpenAssistant 모델의 엔드포인트에 대한 무료 액세스를 제공하고 있습니다. + +우선 모든 기본 종속성을 설치하려면 `agents`를 추가로 설치하세요. +```bash +pip install transformers[agents] +``` + +openAI 모델을 사용하려면 `openai` 종속성을 설치한 후 [`OpenAiAgent`]를 인스턴스화합니다: + +```bash +pip install openai +``` + + +```py +from transformers import OpenAiAgent + +agent = OpenAiAgent(model="text-davinci-003", api_key="") +``` + +BigCode 또는 OpenAssistant를 사용하려면 먼저 로그인하여 Inference API에 액세스하세요: + +```py +from huggingface_hub import login + +login("") +``` + +그런 다음 에이전트를 인스턴스화합니다. + +```py +from transformers import HfAgent + +# Starcoder +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +# StarcoderBase +# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase") +# OpenAssistant +# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") +``` + +현재 Hugging Face에서 무료로 제공하는 추론 API를 사용하고 있습니다. +이 모델에 대한 자체 추론 엔드포인트가 있는 경우(또는 다른 엔드포인트가 있는 경우) 위의 URL을 해당 URL 엔드포인트로 바꿀 수 있습니다. + + + +StarCoder와 OpenAssistant는 무료로 사용할 수 있으며 간단한 작업에서 놀라울 정도로 잘 작동합니다. +그러나 더 복잡한 프롬프트를 처리할 때는 체크포인트가 잘 작동하지 않습니다. +이러한 문제가 발생하면 OpenAI 모델을 사용해 보시기 바랍니다. 아쉽게도 오픈소스는 아니지만 현재로서는 더 나은 성능을 제공합니다. + + + +이제 준비가 완료되었습니다! 이제 자유롭게 사용할 수 있는 두 가지 API에 대해 자세히 알아보겠습니다. + +### 단일 실행 (run) [[single-execution-(run)]] + +단일 실행 방법은 에이전트의 [`~Agent.run`] 메소드를 사용하는 경우입니다: + +```py +agent.run("Draw me a picture of rivers and lakes.") +``` + + + +수행하려는 작업에 적합한 도구를 자동으로 선택하여 적절하게 실행합니다. +동일한 명령어에서 하나 또는 여러 개의 작업을 수행할 수 있습니다 +(다만, 명령어가 복잡할수록 에이전트가 실패할 가능성이 높아집니다). + +```py +agent.run("Draw me a picture of the sea then transform the picture to add an island") +``` + + + +
+ + +모든 [`~Agent.run`] 작업은 독립적이므로 다른 작업으로 여러 번 연속해서 실행할 수 있습니다. + +`agent`는 큰 언어 모델일 뿐이므로 프롬프트에 약간의 변화를 주면 완전히 다른 결과가 나올 수 있다는 점에 유의하세요. +수행하려는 작업을 최대한 명확하게 설명하는 것이 중요합니다. +좋은 프롬프트를 작성하는 방법은 [여기](custom_tools#writing-good-user-inputs)에서 자세히 확인할 수 있습니다. + +여러 실행에 걸쳐 상태를 유지하거나 텍스트가 아닌 개체를 에이전트에게 전달하려는 경우에는 에이전트가 사용할 변수를 지정할 수 있습니다. +예를 들어 강과 호수의 첫 번째 이미지를 생성한 뒤, +모델이 해당 그림에 섬을 추가하도록 다음과 같이 요청할 수 있습니다: + +```python +picture = agent.run("Generate a picture of rivers and lakes.") +updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture) +``` + + + +이 방법은 모델이 요청을 이해하지 못하고 도구를 혼합할 때 유용할 수 있습니다. 예를 들면 다음과 같습니다: + +```py +agent.run("Draw me the picture of a capybara swimming in the sea") +``` + +여기서 모델은 두 가지 방식으로 해석할 수 있습니다: +- `text-to-image`이 바다에서 헤엄치는 카피바라를 생성하도록 합니다. +- 또는 `text-to-image`이 카피바라를 생성한 다음 `image-transformation` 도구를 사용하여 바다에서 헤엄치도록 합니다. + +첫 번째 시나리오를 강제로 실행하려면 프롬프트를 인수로 전달하여 실행할 수 있습니다: + +```py +agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea") +``` + + + + +### 대화 기반 실행 (chat) [[chat-based-execution-(chat)]] + +에이전트는 [`~Agent.chat`] 메소드를 사용하는 대화 기반 접근 방식도 있습니다: + +```py +agent.chat("Generate a picture of rivers and lakes") +``` + + + +```py +agent.chat("Transform the picture so that there is a rock in there") +``` + + + +
+ +이 방식은 여러 명령어에 걸쳐 상태를 유지하고자 할 때 흥미로운 접근 방식입니다. +실험용으로 더 좋지만 복잡한 명령어보다는 +단일 명령어([`~Agent.run`] 메소드가 더 잘 처리하는 명령어)에 훨씬 더 잘 작동하는 경향이 있습니다. + +이 메소드는 텍스트가 아닌 유형이나 특정 프롬프트를 전달하려는 경우 인수를 받을 수도 있습니다. + +### ⚠️ 원격 실행 [[remote-execution]] + +데모 목적과 모든 설정에서 사용할 수 있도록 +에이전트가 접근할 수 있는 몇 가지 기본 도구에 대한 원격 실행기를 만들었습니다. +이러한 도구는 [inference endpoints](https://huggingface.co/inference-endpoints)를 사용하여 만들어졌습니다. +원격 실행기 도구를 직접 설정하는 방법을 보려면 [사용자 정의 도구 가이드](./custom_tools)를 읽어보시기 바랍니다. + +원격 도구로 실행하려면 [`~Agent.run`] 또는 [`~Agent.chat`] 중 하나에 `remote=True`를 지정하기만 하면 됩니다. + +예를 들어 다음 명령은 많은 RAM이나 GPU 없이도 모든 장치에서 효율적으로 실행할 수 있습니다: + +```py +agent.run("Draw me a picture of rivers and lakes", remote=True) +``` + +[`~Agent.chat`]도 마찬가지입니다: + +```py +agent.chat("Draw me a picture of rivers and lakes", remote=True) +``` + +### 여기서 무슨 일이 일어나는 거죠? 도구란 무엇이고, 에이전트란 무엇인가요? [[whats-happening-here-what-are-tools-and-what-are-agents]] + + + +#### 에이전트 [[agents]] + +여기서 "에이전트"는 대규모 언어 모델이며, 특정 도구 모음에 접근할 수 있도록 프롬프트하고 있습니다. + +LLM은 작은 코드 샘플을 생성하는 데 상당히 능숙하므로, +이 장점을 활용해 도구 모음을 사용하여 작업을 수행하는 작은 코드 샘플을 제공하라는 메시지를 표시합니다. +그런 다음 에이전트에게 제공하는 작업과 제공하는 도구에 대한 설명으로 이 프롬프트가 완료됩니다. +이렇게 하면 사용 중인 도구들의 문서에 접근할 수 있으며, 해당 도구들의 입력과 출력을 예상하고, 관련된 코드를 생성할 수 있습니다. + +#### 도구 [[tools]] + +도구는 매우 간단합니다. 이름과 설명이 있는 단일 기능으로 구성되어 있습니다. +그런 다음 이러한 도구의 설명을 사용하여 상담원에게 프롬프트를 표시합니다. +이 프롬프트를 통해 상담원에게 쿼리에서 요청된 작업을 수행하기 위해 도구를 활용하는 방법을 보여줍니다. + +에이전트가 매우 원자적인 도구를 사용하여 더 나은 코드를 작성하기 때문에 파이프라인이 아닌 완전히 새로운 도구를 사용합니다. +파이프라인은 더 많이 리팩터링되며 종종 여러 작업을 하나로 결합합니다. +도구는 하나의 매우 간단한 작업에만 집중하도록 되어 있습니다. + +#### 코드 실행?! [[code-execution]] + +그런 다음 이 코드는 도구와 함께 전달된 입력 세트에 대해 작은 Python 인터프리터를 사용하여 실행됩니다. +"임의 코드 실행이라니!"이라고 비명을 지르는 소리가 들리겠지만, 그렇지 않은 이유를 설명하겠습니다. + +호출할 수 있는 함수는 제공한 도구와 인쇄 기능뿐이므로 이미 실행할 수 있는 기능이 제한되어 있습니다. +Hugging Face 도구로 제한되어 있다면 안전할 것입니다. + +그리고 어트리뷰트 조회나 가져오기를 허용하지 않으므로 +(어차피 작은 함수 집합에 입/출력을 전달할 때는 필요하지 않아야 합니다) +가장 명백한 공격(어차피 LLM에 출력하라는 메시지를 표시해야 합니다)은 문제가 되지 않습니다. +매우 안전하게 하고 싶다면 추가 인수 return_code=True를 사용하여 run() 메소드를 실행하면 됩니다. +이 경우 에이전트가 실행할 코드를 반환하고 실행할지 여부를 결정할 수 있습니다. + +불법적인 연산을 수행하려고 하거나 에이전트가 생성한 코드에 일반적인 파이썬 오류가 있는 경우 +실행이 중지됩니다. + +### 엄선된 도구 모음 [[a-curated-set-of-tools]] + +저희는 이러한 에이전트들의 역량을 강화할 수 있는 일련의 도구를 확인하고 있습니다. +다음은 연동된 도구의 최신 목록입니다: + +- **문서 질문 답변**: 이미지 형식의 문서(예: PDF)가 주어지면 이 문서에 대한 질문에 답변합니다. ([Donut](./model_doc/donut)) +- **텍스트 질문 답변**: 긴 텍스트와 질문이 주어지면 텍스트에서 질문에 답변합니다. ([Flan-T5](./model_doc/flan-t5)) +- **무조건 이미지 캡셔닝**: 이미지에 캡션을 답니다! ([BLIP](./model_doc/blip)) +- **이미지 질문 답변**: 이미지가 주어지면 이 이미지에 대한 질문에 답변하기. ([VILT](./model_doc/vilt)) +- **이미지 분할**: 이미지와 프롬프트가 주어지면 해당 프롬프트의 분할 마스크를 출력합니다. ([CLIPSeg](./model_doc/clipseg)) +- **음성을 텍스트로 변환**: 사람이 말하는 오디오 녹음이 주어지면 음성을 텍스트로 변환합니다. ([Whisper](./model_doc/whisper)) +- **텍스트 음성 변환**: 텍스트를 음성으로 변환합니다. ([SpeechT5](./model_doc/speecht5)) +- **제로 샷(zero-shot) 텍스트 분류**: 텍스트와 레이블 목록이 주어지면 텍스트와 가장 관련 있는 레이블을 식별합니다. ([BART](./model_doc/bart)) +- **텍스트 요약**: 긴 텍스트를 한 문장 또는 몇 문장으로 요약합니다. ([BART](./model_doc/bart)) +- **번역**: 텍스트를 지정된 언어로 번역합니다. ([NLLB](./model_doc/nllb)) + +이러한 도구는 트랜스포머에 통합되어 있으며, 예를 들어 수동으로도 사용할 수 있습니다: + +```py +from transformers import load_tool + +tool = load_tool("text-to-speech") +audio = tool("This is a text to speech tool") +``` + +### 사용자 정의 도구 [[custom-tools]] + +엄선된 도구 세트도 있지만, 이 구현이 제공하는 가장 큰 가치는 사용자 지정 도구를 빠르게 만들고 공유할 수 있다는 점입니다. + +도구의 코드를 Hugging Face Space나 모델 저장소에 푸시하면 에이전트에게 직접 도구를 활용할 수 있습니다. [`huggingface-tools` organization](https://huggingface.co/huggingface-tools)에 몇 가지 **트랜스포머에 구애받지 않는** 툴을 추가했습니다: + +- **텍스트 다운로더**: 웹 URL에서 텍스트를 다운로드합니다. +- **텍스트 이미지 변환**: 프롬프트에 따라 이미지를 생성하여 안정적인 확산을 활용합니다. +- **이미지 변환**: 초기 이미지와 프롬프트가 주어진 이미지를 수정하고, 안정적인 확산을 활용하는 지시 픽셀 2 픽셀을 활용합니다. +- **텍스트 비디오 변환**: 프롬프트에 따라 작은 비디오를 생성하며, damo-vilab을 활용합니다. + +저희가 처음부터 사용하고 있는 텍스트-이미지 변환 도구는 [*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)에 있는 원격 도구입니다! 저희는 이 도구와 다른 조직에 이러한 도구를 계속 출시하여 이 구현을 더욱 강화할 것입니다. + +에이전트는 기본적으로 [`huggingface-tools`](https://huggingface.co/huggingface-tools)에 있는 도구에 접근할 수 있습니다. +[다음 가이드](custom_tools)에서 도구를 작성하고 공유하는 방법과 Hub에 있는 사용자 지정 도구를 활용하는 방법에 대해 설명합니다. + +### 코드 생성[[code-generation]] + +지금까지 에이전트를 사용하여 작업을 수행하는 방법을 보여드렸습니다. 하지만 에이전트는 매우 제한된 Python 인터프리터를 사용하여 실행할 코드만 생성하고 있습니다. 다른 설정에서 생성된 코드를 사용하려는 경우 에이전트에게 도구 정의 및 정확한 가져오기와 함께 코드를 반환하라는 메시지를 표시할 수 있습니다. + +예를 들어 다음 명령어는 +```python +agent.run("Draw me a picture of rivers and lakes", return_code=True) +``` + +다음 코드를 반환합니다. + +```python +from transformers import load_tool + +image_generator = load_tool("huggingface-tools/text-to-image") + +image = image_generator(prompt="rivers and lakes") +``` + +이 코드는 직접 수정하고 실행할 수 있습니다. \ No newline at end of file diff --git a/docs/source/ko/troubleshooting.md b/docs/source/ko/troubleshooting.md new file mode 100644 index 00000000000000..263d693c23da65 --- /dev/null +++ b/docs/source/ko/troubleshooting.md @@ -0,0 +1,198 @@ + + +# 문제 해결[[troubleshoot]] + +때때로 오류가 발생할 수 있지만, 저희가 도와드리겠습니다! 이 가이드는 현재까지 확인된 가장 일반적인 문제 몇 가지와 그것들을 해결하는 방법에 대해 다룹니다. 그러나 이 가이드는 모든 🤗 Transformers 문제를 포괄적으로 다루고 있지 않습니다. 문제 해결에 더 많은 도움을 받으려면 다음을 시도해보세요: + + + +1. [포럼](https://discuss.huggingface.co/)에서 도움을 요청하세요. [Beginners](https://discuss.huggingface.co/c/beginners/5) 또는 [🤗 Transformers](https://discuss.huggingface.co/c/transformers/9)와 같은 특정 카테고리에 질문을 게시할 수 있습니다. 재현 가능한 코드와 함께 잘 서술된 포럼 게시물을 작성하여 여러분의 문제가 해결될 가능성을 극대화하세요! + + + +2. 라이브러리와 관련된 버그이면 🤗 Transformers 저장소에서 [이슈](https://github.com/huggingface/transformers/issues/new/choose)를 생성하세요. 버그에 대해 설명하는 정보를 가능한 많이 포함하려고 노력하여, 무엇이 잘못 되었는지와 어떻게 수정할 수 있는지 더 잘 파악할 수 있도록 도와주세요. + +3. 이전 버전의 🤗 Transformers을 사용하는 경우 중요한 변경 사항이 버전 사이에 도입되었기 때문에 [마이그레이션](migration) 가이드를 확인하세요. + +문제 해결 및 도움 매뉴얼에 대한 자세한 내용은 Hugging Face 강좌의 [8장](https://huggingface.co/course/chapter8/1?fw=pt)을 참조하세요. + + +## 방화벽 환경[[firewalled-environments]] + +클라우드 및 내부망(intranet) 설정의 일부 GPU 인스턴스는 외부 연결에 대한 방화벽으로 차단되어 연결 오류가 발생할 수 있습니다. 스크립트가 모델 가중치나 데이터를 다운로드하려고 할 때, 다운로드가 중단되고 다음 메시지와 함께 시간 초과됩니다: + +``` +ValueError: Connection error, and we cannot find the requested files in the cached path. +Please try again or make sure your Internet connection is on. +``` + +이 경우에는 연결 오류를 피하기 위해 🤗 Transformers를 [오프라인 모드](installation#offline-mode)로 실행해야 합니다. + +## CUDA 메모리 부족(CUDA out of memory)[[cuda-out-of-memory]] + +수백만 개의 매개변수로 대규모 모델을 훈련하는 것은 적절한 하드웨어 없이 어려울 수 있습니다. GPU 메모리가 부족한 경우 발생할 수 있는 일반적인 오류는 다음과 같습니다: + +``` +CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 11.17 GiB total capacity; 9.70 GiB already allocated; 179.81 MiB free; 9.85 GiB reserved in total by PyTorch) +``` + +다음은 메모리 사용을 줄이기 위해 시도해 볼 수 있는 몇 가지 잠재적인 해결책입니다: + +- [`TrainingArguments`]의 [`per_device_train_batch_size`](main_classes/trainer#transformers.TrainingArguments.per_device_train_batch_size) 값을 줄이세요. +- [`TrainingArguments`]의 [`gradient_accumulation_steps`](main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps)은 전체 배치 크기를 효과적으로 늘리세요. + + + +메모리 절약 기술에 대한 자세한 내용은 성능 [가이드](performance)를 참조하세요. + + + +## 저장된 TensorFlow 모델을 가져올 수 없습니다(Unable to load a saved TensorFlow model)[[unable-to-load-a-saved-uensorFlow-model]] + +TensorFlow의 [model.save](https://www.tensorflow.org/tutorials/keras/save_and_load#save_the_entire_model) 메소드는 아키텍처, 가중치, 훈련 구성 등 전체 모델을 단일 파일에 저장합니다. 그러나 모델 파일을 다시 가져올 때 🤗 Transformers는 모델 파일에 있는 모든 TensorFlow 관련 객체를 가져오지 않을 수 있기 때문에 오류가 발생할 수 있습니다. TensorFlow 모델 저장 및 가져오기 문제를 피하려면 다음을 권장합니다: + +- 모델 가중치를 `h5` 파일 확장자로 [`model.save_weights`](https://www.tensorflow.org/tutorials/keras/save_and_load#save_the_entire_model)로 저장한 다음 [`~TFPreTrainedModel.from_pretrained`]로 모델을 다시 가져옵니다: + +```py +>>> from transformers import TFPreTrainedModel +>>> from tensorflow import keras + +>>> model.save_weights("some_folder/tf_model.h5") +>>> model = TFPreTrainedModel.from_pretrained("some_folder") +``` + +- 모델을 [`~TFPretrainedModel.save_pretrained`]로 저장하고 [`~TFPreTrainedModel.from_pretrained`]로 다시 가져옵니다: + +```py +>>> from transformers import TFPreTrainedModel + +>>> model.save_pretrained("path_to/model") +>>> model = TFPreTrainedModel.from_pretrained("path_to/model") +``` + +## ImportError[[importerror]] + +특히 최신 모델인 경우 만날 수 있는 다른 일반적인 오류는 `ImportError`입니다: + +``` +ImportError: cannot import name 'ImageGPTImageProcessor' from 'transformers' (unknown location) +``` + +이러한 오류 유형의 경우 최신 모델에 액세스할 수 있도록 최신 버전의 🤗 Transformers가 설치되어 있는지 확인하세요: + +```bash +pip install transformers --upgrade +``` + +## CUDA error: device-side assert triggered[[cuda-error-deviceside-assert-triggered]] + +때때로 장치 코드 오류에 대한 일반적인 CUDA 오류가 발생할 수 있습니다. + +``` +RuntimeError: CUDA error: device-side assert triggered +``` + +더 자세한 오류 메시지를 얻으려면 우선 코드를 CPU에서 실행합니다. 다음 환경 변수를 코드의 시작 부분에 추가하여 CPU로 전환하세요: + +```py +>>> import os + +>>> os.environ["CUDA_VISIBLE_DEVICES"] = "" +``` + +또 다른 옵션은 GPU에서 더 나은 역추적(traceback)을 얻는 것입니다. 다음 환경 변수를 코드의 시작 부분에 추가하여 역추적이 오류가 발생한 소스를 가리키도록 하세요: + +```py +>>> import os + +>>> os.environ["CUDA_LAUNCH_BLOCKING"] = "1" +``` + +## 패딩 토큰이 마스킹되지 않은 경우 잘못된 출력(Incorrect output when padding tokens aren't masked)[[incorrect-output-when-padding-tokens-arent-masked]] + +경우에 따라 `input_ids`에 패딩 토큰이 포함된 경우 `hidden_state` 출력이 올바르지 않을 수 있습니다. 데모를 위해 모델과 토크나이저를 가져오세요. 모델의 `pad_token_id`에 액세스하여 해당 값을 확인할 수 있습니다. 일부 모델의 경우 `pad_token_id`가 `None`일 수 있지만 언제든지 수동으로 설정할 수 있습니다. + +```py +>>> from transformers import AutoModelForSequenceClassification +>>> import torch + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") +>>> model.config.pad_token_id +0 +``` + +다음 예제는 패딩 토큰을 마스킹하지 않은 출력을 보여줍니다: + +```py +>>> input_ids = torch.tensor([[7592, 2057, 2097, 2393, 9611, 2115], [7592, 0, 0, 0, 0, 0]]) +>>> output = model(input_ids) +>>> print(output.logits) +tensor([[ 0.0082, -0.2307], + [ 0.1317, -0.1683]], grad_fn=) +``` + +다음은 두 번째 시퀀스의 실제 출력입니다: + +```py +>>> input_ids = torch.tensor([[7592]]) +>>> output = model(input_ids) +>>> print(output.logits) +tensor([[-0.1008, -0.4061]], grad_fn=) +``` + +대부분의 경우 모델에 `attention_mask`를 제공하여 패딩 토큰을 무시해야 이러한 조용한 오류를 방지할 수 있습니다. 이제 두 번째 시퀀스의 출력이 실제 출력과 일치합니다: + + + +일반적으로 토크나이저는 특정 토크나이저의 기본 값을 기준으로 사용자에 대한 'attention_mask'를 만듭니다. + + + +```py +>>> attention_mask = torch.tensor([[1, 1, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0]]) +>>> output = model(input_ids, attention_mask=attention_mask) +>>> print(output.logits) +tensor([[ 0.0082, -0.2307], + [-0.1008, -0.4061]], grad_fn=) +``` + +🤗 Transformers는 패딩 토큰이 제공된 경우 패딩 토큰을 마스킹하기 위한 `attention_mask`를 자동으로 생성하지 않습니다. 그 이유는 다음과 같습니다: + +- 일부 모델에는 패딩 토큰이 없습니다. +- 일부 사용 사례의 경우 사용자가 모델이 패딩 토큰을 관리하기를 원합니다. + +## ValueError: 이 유형의 AutoModel에 대해 인식할 수 없는 XYZ 구성 클래스(ValueError: Unrecognized configuration class XYZ for this kind of AutoModel)[[valueerror-unrecognized-configuration-class-xyz-for-this-kind-of-automodel]] + +일반적으로, 사전 학습된 모델의 인스턴스를 가져오기 위해 [`AutoModel`] 클래스를 사용하는 것이 좋습니다. +이 클래스는 구성에 따라 주어진 체크포인트에서 올바른 아키텍처를 자동으로 추론하고 가져올 수 있습니다. +모델을 체크포인트에서 가져올 때 이 `ValueError`가 발생하면, 이는 Auto 클래스가 주어진 체크포인트의 구성에서 +가져오려는 모델 유형과 매핑을 찾을 수 없다는 것을 의미합니다. 가장 흔하게 발생하는 경우는 +체크포인트가 주어진 태스크를 지원하지 않을 때입니다. +예를 들어, 다음 예제에서 질의응답에 대한 GPT2가 없기 때문에 오류가 발생합니다: + +```py +>>> from transformers import AutoProcessor, AutoModelForQuestionAnswering + +>>> processor = AutoProcessor.from_pretrained("openai-community/gpt2-medium") +>>> model = AutoModelForQuestionAnswering.from_pretrained("openai-community/gpt2-medium") +ValueError: Unrecognized configuration class for this kind of AutoModel: AutoModelForQuestionAnswering. +Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BloomConfig, ... +``` diff --git a/docs/source/ms/_toctree.yml b/docs/source/ms/_toctree.yml new file mode 100644 index 00000000000000..0ec1ee59ad8914 --- /dev/null +++ b/docs/source/ms/_toctree.yml @@ -0,0 +1,688 @@ +- sections: + - local: index + title: 🤗 Transformers + - local: quicktour + title: Lawatan cepat + - local: installation + title: Pemasangan + title: Mulakan +- sections: + - local: pipeline_tutorial + title: Jalankan inferens dengan saluran paip + - local: autoclass_tutorial + title: Tulis kod mudah alih dengan AutoClass + - local: preprocessing + title: Praproses data + - local: training + title: Perhalusi model yang telah dilatih + - local: run_scripts + title: Latih dengan skrip + - local: accelerate + title: Sediakan latihan yang diedarkan dengan 🤗 Accelerate + - local: model_sharing + title: Kongsi model anda + - local: transformers_agents + title: Ejen + title: Tutorials +- sections: + - sections: + - local: tasks/sequence_classification + title: Klasifikasi teks + - local: tasks/token_classification + title: Klasifikasi token + - local: tasks/question_answering + title: Soalan menjawab + - local: tasks/language_modeling + title: Pemodelan bahasa sebab-akibat + - local: tasks/masked_language_modeling + title: Pemodelan bahasa Masked + - local: tasks/translation + title: Terjemahan + - local: tasks/summarization + title: Rumusan + - local: tasks/multiple_choice + title: Pilihan + title: Natural Language Processing + isExpanded: false + - sections: + - local: tasks/audio_classification + title: Klasifikasi audio + - local: tasks/asr + title: Pengecaman pertuturan automatik + title: Audio + isExpanded: false + - sections: + - local: tasks/image_classification + title: Klasifikasi imej + - local: tasks/semantic_segmentation + title: Segmentasi semantik + - local: tasks/video_classification + title: Klasifikasi video + - local: tasks/object_detection + title: Pengesanan objek + - local: tasks/zero_shot_object_detection + title: Pengesanan objek Zero-Shot + - local: tasks/zero_shot_image_classification + title: Klasifikasi imej tangkapan Zero-Shot + - local: tasks/monocular_depth_estimation + title: Anggaran kedalaman + title: Visi komputer + isExpanded: false + - sections: + - local: tasks/image_captioning + title: Kapsyen imej + - local: tasks/document_question_answering + title: Menjawab Soalan Dokumen + - local: tasks/text-to-speech + title: Teks kepada ucapan + title: Multimodal + isExpanded: false + title: Panduan Tugasan +- sections: + - local: fast_tokenizers + title: Gunakan tokenizer cepat dari 🤗 Tokenizers + - local: multilingual + title: Jalankan inferens dengan model berbilang bahasa + - local: generation_strategies + title: Sesuaikan strategi penjanaan teks + - local: create_a_model + title: Gunakan API khusus model + - local: custom_models + title: Kongsi model tersuai + - local: sagemaker + title: Jalankan latihan di Amazon SageMaker + - local: serialization + title: Eksport ke ONNX + - local: torchscript + title: Eksport ke TorchScript + - local: benchmarks + title: Penanda aras + - local: Buku nota dengan contoh + title: Notebooks with examples + - local: Sumber komuniti + title: Community resources + - local: Sumber komuniti + title: Custom Tools and Prompts + - local: Alat dan Gesaan Tersuai + title: Selesaikan masalah + title: Panduan Developer +- sections: + - local: performance + title: Gambaran keseluruhan + - local: perf_train_gpu_one + title: Latihan pada satu GPU + - local: perf_train_gpu_many + title: Latihan pada banyak GPU + - local: perf_train_cpu + title: Latihan mengenai CPU + - local: perf_train_cpu_many + title: Latihan pada banyak CPU + - local: perf_train_tpu + title: Latihan mengenai TPU + - local: perf_train_tpu_tf + title: Latihan tentang TPU dengan TensorFlow + - local: perf_train_special + title: Latihan mengenai Perkakasan Khusus + - local: perf_infer_cpu + title: Inferens pada CPU + - local: perf_infer_gpu_one + title: Inferens pada satu GPU + - local: perf_infer_gpu_many + title: Inferens pada banyak GPUs + - local: perf_infer_special + title: Inferens pada Perkakasan Khusus + - local: perf_hardware + title: Perkakasan tersuai untuk latihan + - local: big_models + title: Menghidupkan model besar + - local: debugging + title: Penyahpepijatan + - local: hpo_train + title: Carian Hiperparameter menggunakan API Pelatih + - local: tf_xla + title: Penyepaduan XLA untuk Model TensorFlow + title: Prestasi dan kebolehskalaan +- sections: + - local: contributing + title: Bagaimana untuk menyumbang kepada transformer? + - local: add_new_model + title: Bagaimana untuk menambah model pada 🤗 Transformers? + - local: add_tensorflow_model + title: Bagaimana untuk menukar model Transformers kepada TensorFlow? + - local: add_new_pipeline + title: Bagaimana untuk menambah saluran paip ke 🤗 Transformers? + - local: testing + title: Ujian + - local: pr_checks + title: Menyemak Permintaan Tarik + title: Sumbangkan + +- sections: + - local: philosophy + title: Falsafah + - local: glossary + title: Glosari + - local: task_summary + title: Apa 🤗 Transformers boleh buat + - local: tasks_explained + title: Bagaimana 🤗 Transformers menyelesaikan tugasan + - local: model_summary + title: Keluarga model Transformer + - local: tokenizer_summary + title: Ringkasan tokenizer + - local: attention + title: Mekanisme perhatian + - local: pad_truncation + title: Padding dan pemotongan + - local: bertology + title: BERTology + - local: perplexity + title: Kekeliruan model panjang tetap + - local: pipeline_webserver + title: Saluran paip untuk inferens pelayan web + title: Panduan konsep +- sections: + - sections: + - local: main_classes/agent + title: Ejen dan Alat + - local: model_doc/auto + title: Kelas Auto + - local: main_classes/callback + title: Panggilan balik + - local: main_classes/configuration + title: Configuration + - local: main_classes/data_collator + title: Data Collator + - local: main_classes/keras_callbacks + title: Keras callbacks + - local: main_classes/logging + title: Logging + - local: main_classes/model + title: Models + - local: main_classes/text_generation + title: Text Generation + - local: main_classes/onnx + title: ONNX + - local: main_classes/optimizer_schedules + title: Optimization + - local: main_classes/output + title: Model outputs + - local: main_classes/pipelines + title: Pipelines + - local: main_classes/processors + title: Processors + - local: main_classes/quantization + title: Quantization + - local: main_classes/tokenizer + title: Tokenizer + - local: main_classes/trainer + title: Trainer + - local: main_classes/deepspeed + title: DeepSpeed Integration + - local: main_classes/feature_extractor + title: Feature Extractor + - local: main_classes/image_processor + title: Image Processor + title: Main Classes + - sections: + - isExpanded: false + sections: + - local: model_doc/albert + title: ALBERT + - local: model_doc/bart + title: BART + - local: model_doc/barthez + title: BARThez + - local: model_doc/bartpho + title: BARTpho + - local: model_doc/bert + title: BERT + - local: model_doc/bert-generation + title: BertGeneration + - local: model_doc/bert-japanese + title: BertJapanese + - local: model_doc/bertweet + title: Bertweet + - local: model_doc/big_bird + title: BigBird + - local: model_doc/bigbird_pegasus + title: BigBirdPegasus + - local: model_doc/biogpt + title: BioGpt + - local: model_doc/blenderbot + title: Blenderbot + - local: model_doc/blenderbot-small + title: Blenderbot Small + - local: model_doc/bloom + title: BLOOM + - local: model_doc/bort + title: BORT + - local: model_doc/byt5 + title: ByT5 + - local: model_doc/camembert + title: CamemBERT + - local: model_doc/canine + title: CANINE + - local: model_doc/codegen + title: CodeGen + - local: model_doc/convbert + title: ConvBERT + - local: model_doc/cpm + title: CPM + - local: model_doc/cpmant + title: CPMANT + - local: model_doc/ctrl + title: CTRL + - local: model_doc/deberta + title: DeBERTa + - local: model_doc/deberta-v2 + title: DeBERTa-v2 + - local: model_doc/dialogpt + title: DialoGPT + - local: model_doc/distilbert + title: DistilBERT + - local: model_doc/dpr + title: DPR + - local: model_doc/electra + title: ELECTRA + - local: model_doc/encoder-decoder + title: Encoder Decoder Models + - local: model_doc/ernie + title: ERNIE + - local: model_doc/ernie_m + title: ErnieM + - local: model_doc/esm + title: ESM + - local: model_doc/flan-t5 + title: FLAN-T5 + - local: model_doc/flan-ul2 + title: FLAN-UL2 + - local: model_doc/flaubert + title: FlauBERT + - local: model_doc/fnet + title: FNet + - local: model_doc/fsmt + title: FSMT + - local: model_doc/funnel + title: Funnel Transformer + - local: model_doc/openai-gpt + title: GPT + - local: model_doc/gpt_neo + title: GPT Neo + - local: model_doc/gpt_neox + title: GPT NeoX + - local: model_doc/gpt_neox_japanese + title: GPT NeoX Japanese + - local: model_doc/gptj + title: GPT-J + - local: model_doc/gpt2 + title: GPT2 + - local: model_doc/gpt_bigcode + title: GPTBigCode + - local: model_doc/gptsan-japanese + title: GPTSAN Japanese + - local: model_doc/gpt-sw3 + title: GPTSw3 + - local: model_doc/herbert + title: HerBERT + - local: model_doc/ibert + title: I-BERT + - local: model_doc/jukebox + title: Jukebox + - local: model_doc/led + title: LED + - local: model_doc/llama + title: LLaMA + - local: model_doc/longformer + title: Longformer + - local: model_doc/longt5 + title: LongT5 + - local: model_doc/luke + title: LUKE + - local: model_doc/m2m_100 + title: M2M100 + - local: model_doc/marian + title: MarianMT + - local: model_doc/markuplm + title: MarkupLM + - local: model_doc/mbart + title: MBart and MBart-50 + - local: model_doc/mega + title: MEGA + - local: model_doc/megatron-bert + title: MegatronBERT + - local: model_doc/megatron_gpt2 + title: MegatronGPT2 + - local: model_doc/mluke + title: mLUKE + - local: model_doc/mobilebert + title: MobileBERT + - local: model_doc/mpnet + title: MPNet + - local: model_doc/mt5 + title: MT5 + - local: model_doc/mvp + title: MVP + - local: model_doc/nezha + title: NEZHA + - local: model_doc/nllb + title: NLLB + - local: model_doc/nllb-moe + title: NLLB-MoE + - local: model_doc/nystromformer + title: Nyströmformer + - local: model_doc/open-llama + title: Open-Llama + - local: model_doc/opt + title: OPT + - local: model_doc/pegasus + title: Pegasus + - local: model_doc/pegasus_x + title: PEGASUS-X + - local: model_doc/phobert + title: PhoBERT + - local: model_doc/plbart + title: PLBart + - local: model_doc/prophetnet + title: ProphetNet + - local: model_doc/qdqbert + title: QDQBert + - local: model_doc/rag + title: RAG + - local: model_doc/realm + title: REALM + - local: model_doc/reformer + title: Reformer + - local: model_doc/rembert + title: RemBERT + - local: model_doc/retribert + title: RetriBERT + - local: model_doc/roberta + title: RoBERTa + - local: model_doc/roberta-prelayernorm + title: RoBERTa-PreLayerNorm + - local: model_doc/roc_bert + title: RoCBert + - local: model_doc/roformer + title: RoFormer + - local: model_doc/rwkv + title: RWKV + - local: model_doc/splinter + title: Splinter + - local: model_doc/squeezebert + title: SqueezeBERT + - local: model_doc/switch_transformers + title: SwitchTransformers + - local: model_doc/t5 + title: T5 + - local: model_doc/t5v1.1 + title: T5v1.1 + - local: model_doc/tapex + title: TAPEX + - local: model_doc/transfo-xl + title: Transformer XL + - local: model_doc/ul2 + title: UL2 + - local: model_doc/xmod + title: X-MOD + - local: model_doc/xglm + title: XGLM + - local: model_doc/xlm + title: XLM + - local: model_doc/xlm-prophetnet + title: XLM-ProphetNet + - local: model_doc/xlm-roberta + title: XLM-RoBERTa + - local: model_doc/xlm-roberta-xl + title: XLM-RoBERTa-XL + - local: model_doc/xlm-v + title: XLM-V + - local: model_doc/xlnet + title: XLNet + - local: model_doc/yoso + title: YOSO + title: Text models + - isExpanded: false + sections: + - local: model_doc/beit + title: BEiT + - local: model_doc/bit + title: BiT + - local: model_doc/conditional_detr + title: Conditional DETR + - local: model_doc/convnext + title: ConvNeXT + - local: model_doc/convnextv2 + title: ConvNeXTV2 + - local: model_doc/cvt + title: CvT + - local: model_doc/deformable_detr + title: Deformable DETR + - local: model_doc/deit + title: DeiT + - local: model_doc/deta + title: DETA + - local: model_doc/detr + title: DETR + - local: model_doc/dinat + title: DiNAT + - local: model_doc/dit + title: DiT + - local: model_doc/dpt + title: DPT + - local: model_doc/efficientformer + title: EfficientFormer + - local: model_doc/efficientnet + title: EfficientNet + - local: model_doc/focalnet + title: FocalNet + - local: model_doc/glpn + title: GLPN + - local: model_doc/imagegpt + title: ImageGPT + - local: model_doc/levit + title: LeViT + - local: model_doc/mask2former + title: Mask2Former + - local: model_doc/maskformer + title: MaskFormer + - local: model_doc/mobilenet_v1 + title: MobileNetV1 + - local: model_doc/mobilenet_v2 + title: MobileNetV2 + - local: model_doc/mobilevit + title: MobileViT + - local: model_doc/nat + title: NAT + - local: model_doc/poolformer + title: PoolFormer + - local: model_doc/regnet + title: RegNet + - local: model_doc/resnet + title: ResNet + - local: model_doc/segformer + title: SegFormer + - local: model_doc/swiftformer + title: SwiftFormer + - local: model_doc/swin + title: Swin Transformer + - local: model_doc/swinv2 + title: Swin Transformer V2 + - local: model_doc/swin2sr + title: Swin2SR + - local: model_doc/table-transformer + title: Table Transformer + - local: model_doc/timesformer + title: TimeSformer + - local: model_doc/upernet + title: UperNet + - local: model_doc/van + title: VAN + - local: model_doc/videomae + title: VideoMAE + - local: model_doc/vit + title: Vision Transformer (ViT) + - local: model_doc/vit_hybrid + title: ViT Hybrid + - local: model_doc/vit_mae + title: ViTMAE + - local: model_doc/vit_msn + title: ViTMSN + - local: model_doc/yolos + title: YOLOS + title: Vision models + - isExpanded: false + sections: + - local: model_doc/audio-spectrogram-transformer + title: Audio Spectrogram Transformer + - local: model_doc/clap + title: CLAP + - local: model_doc/hubert + title: Hubert + - local: model_doc/mctct + title: MCTCT + - local: model_doc/sew + title: SEW + - local: model_doc/sew-d + title: SEW-D + - local: model_doc/speech_to_text + title: Speech2Text + - local: model_doc/speech_to_text_2 + title: Speech2Text2 + - local: model_doc/speecht5 + title: SpeechT5 + - local: model_doc/unispeech + title: UniSpeech + - local: model_doc/unispeech-sat + title: UniSpeech-SAT + - local: model_doc/wav2vec2 + title: Wav2Vec2 + - local: model_doc/wav2vec2-conformer + title: Wav2Vec2-Conformer + - local: model_doc/wav2vec2_phoneme + title: Wav2Vec2Phoneme + - local: model_doc/wavlm + title: WavLM + - local: model_doc/whisper + title: Whisper + - local: model_doc/xls_r + title: XLS-R + - local: model_doc/xlsr_wav2vec2 + title: XLSR-Wav2Vec2 + title: Audio models + - isExpanded: false + sections: + - local: model_doc/align + title: ALIGN + - local: model_doc/altclip + title: AltCLIP + - local: model_doc/blip + title: BLIP + - local: model_doc/blip-2 + title: BLIP-2 + - local: model_doc/bridgetower + title: BridgeTower + - local: model_doc/chinese_clip + title: Chinese-CLIP + - local: model_doc/clip + title: CLIP + - local: model_doc/clipseg + title: CLIPSeg + - local: model_doc/data2vec + title: Data2Vec + - local: model_doc/deplot + title: DePlot + - local: model_doc/donut + title: Donut + - local: model_doc/flava + title: FLAVA + - local: model_doc/git + title: GIT + - local: model_doc/groupvit + title: GroupViT + - local: model_doc/layoutlm + title: LayoutLM + - local: model_doc/layoutlmv2 + title: LayoutLMV2 + - local: model_doc/layoutlmv3 + title: LayoutLMV3 + - local: model_doc/layoutxlm + title: LayoutXLM + - local: model_doc/lilt + title: LiLT + - local: model_doc/lxmert + title: LXMERT + - local: model_doc/matcha + title: MatCha + - local: model_doc/mgp-str + title: MGP-STR + - local: model_doc/oneformer + title: OneFormer + - local: model_doc/owlvit + title: OWL-ViT + - local: model_doc/perceiver + title: Perceiver + - local: model_doc/pix2struct + title: Pix2Struct + - local: model_doc/sam + title: Segment Anything + - local: model_doc/speech-encoder-decoder + title: Speech Encoder Decoder Models + - local: model_doc/tapas + title: TAPAS + - local: model_doc/trocr + title: TrOCR + - local: model_doc/tvlt + title: TVLT + - local: model_doc/vilt + title: ViLT + - local: model_doc/vision-encoder-decoder + title: Vision Encoder Decoder Models + - local: model_doc/vision-text-dual-encoder + title: Vision Text Dual Encoder + - local: model_doc/visual_bert + title: VisualBERT + - local: model_doc/xclip + title: X-CLIP + title: Multimodal models + - isExpanded: false + sections: + - local: model_doc/decision_transformer + title: Decision Transformer + - local: model_doc/trajectory_transformer + title: Trajectory Transformer + title: Reinforcement learning models + - isExpanded: false + sections: + - local: model_doc/informer + title: Informer + - local: model_doc/time_series_transformer + title: Time Series Transformer + title: Time series models + - isExpanded: false + sections: + - local: model_doc/graphormer + title: Graphormer + title: Graph models + title: Models + - sections: + - local: internal/modeling_utils + title: Custom Layers and Utilities + - local: internal/pipelines_utils + title: Utilities for pipelines + - local: internal/tokenization_utils + title: Utilities for Tokenizers + - local: internal/trainer_utils + title: Utilities for Trainer + - local: internal/generation_utils + title: Utilities for Generation + - local: internal/image_processing_utils + title: Utilities for Image Processors + - local: internal/audio_utils + title: Utilities for Audio processing + - local: internal/file_utils + title: General Utilities + - local: internal/time_series_utils + title: Utilities for Time Series + title: Internal Helpers + title: API diff --git a/docs/source/en/index.mdx b/docs/source/ms/index.md similarity index 84% rename from docs/source/en/index.mdx rename to docs/source/ms/index.md index 66e79b92c5f09d..f51c43c9bd01a6 100644 --- a/docs/source/en/index.mdx +++ b/docs/source/ms/index.md @@ -1,57 +1,63 @@ # 🤗 Transformers -State-of-the-art Machine Learning for [PyTorch](https://pytorch.org/), [TensorFlow](https://www.tensorflow.org/), and [JAX](https://jax.readthedocs.io/en/latest/). +Pembelajaran Mesin terkini untuk [PyTorch](https://pytorch.org/), [TensorFlow](https://www.tensorflow.org/), dan [JAX](https://jax.readthedocs.io/en/latest/). -🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as: +🤗 Transformers menyediakan API dan alatan untuk memuat turun dan melatih model pra-latihan terkini dengan mudah. Menggunakan model terlatih boleh mengurangkan kos pengiraan anda, jejak karbon dan menjimatkan masa serta sumber yang diperlukan untuk melatih model dari awal. Model ini menyokong tugas biasa dalam modaliti yang berbeza, seperti: -📝 **Natural Language Processing**: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
-🖼️ **Computer Vision**: image classification, object detection, and segmentation.
-🗣️ **Audio**: automatic speech recognition and audio classification.
-🐙 **Multimodal**: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering. +📝 **Natural Language Processing**: klasifikasi teks, pengecaman entiti bernama, menjawab soalan, pemodelan bahasa, ringkasan, terjemahan, pilihan berganda dan penjanaan teks.
+🖼️ **Computer Vision**: pengelasan imej, pengesanan objek dan pembahagian.
+🗣️ **Audio**: pengecaman pertuturan automatik dan klasifikasi audio.
+🐙 **Multimodal**: jawapan soalan jadual, pengecaman aksara optik, pengekstrakan maklumat daripada dokumen yang diimbas, klasifikasi video dan jawapan soalan visual. -🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model's life; train a model in three lines of code in one framework, and load it for inference in another. Models can also be exported to a format like ONNX and TorchScript for deployment in production environments. +🤗 Transformer menyokong kebolehoperasian rangka kerja antara PyTorch, TensorFlow, and JAX. Ini memberikan fleksibiliti untuk menggunakan rangka kerja yang berbeza pada setiap peringkat kehidupan model; latih model dalam tiga baris kod dalam satu rangka kerja, dan muatkannya untuk inferens dalam rangka kerja yang lain. Model juga boleh dieksport ke format seperti ONNX. -Join the growing community on the [Hub](https://huggingface.co/models), [forum](https://discuss.huggingface.co/), or [Discord](https://discord.com/invite/JfAtkvEtRb) today! +Sertai komuniti yang semakin berkembang di [Hub](https://huggingface.co/models), [forum](https://discuss.huggingface.co/), atau [Discord](https://discord.com/invite/JfAtkvEtRb) hari ini! -## If you are looking for custom support from the Hugging Face team +## Jika anda sedang mencari sokongan tersuai daripada pasukan Hugging Face HuggingFace Expert Acceleration Program -## Contents +## Kandungan -The documentation is organized into five sections: +Dokumentasi disusun kepada lima bahagian: -- **GET STARTED** provides a quick tour of the library and installation instructions to get up and running. -- **TUTORIALS** are a great place to start if you're a beginner. This section will help you gain the basic skills you need to start using the library. -- **HOW-TO GUIDES** show you how to achieve a specific goal, like finetuning a pretrained model for language modeling or how to write and share a custom model. -- **CONCEPTUAL GUIDES** offers more discussion and explanation of the underlying concepts and ideas behind models, tasks, and the design philosophy of 🤗 Transformers. -- **API** describes all classes and functions: +- **MULAKAN** menyediakan lawatan pantas ke perpustakaan dan arahan pemasangan untuk bangun dan berjalan. +- **TUTORIAL** ialah tempat yang bagus untuk bermula jika anda seorang pemula. Bahagian ini akan membantu anda memperoleh kemahiran asas yang anda perlukan untuk mula menggunakan perpustakaan. +- **PANDUAN CARA-CARA** menunjukkan kepada anda cara untuk mencapai matlamat tertentu, seperti memperhalusi model terlatih untuk pemodelan bahasa atau cara menulis dan berkongsi model tersuai. +- **PANDUAN KONSEP** menawarkan lebih banyak perbincangan dan penjelasan tentang konsep dan idea asas di sebalik model, tugasan dan falsafah reka bentuk 🤗 Transformers. +- **API** menerangkan semua kelas dan fungsi: - - **MAIN CLASSES** details the most important classes like configuration, model, tokenizer, and pipeline. - - **MODELS** details the classes and functions related to each model implemented in the library. - - **INTERNAL HELPERS** details utility classes and functions used internally. + - **KELAS UTAMA** memperincikan kelas yang paling penting seperti konfigurasi, model, tokenizer dan saluran paip. + - **MODEL** memperincikan kelas dan fungsi yang berkaitan dengan setiap model yang dilaksanakan dalam perpustakaan. + - **PEMBANTU DALAMAN** memperincikan kelas utiliti dan fungsi yang digunakan secara dalaman. -### Supported models +### Model yang disokong - + 1. **[ALBERT](model_doc/albert)** (from Google Research and the Toyota Technological Institute at Chicago) released with the paper [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. +1. **[ALIGN](model_doc/align)** (from Google Research) released with the paper [Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision](https://arxiv.org/abs/2102.05918) by Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. 1. **[AltCLIP](model_doc/altclip)** (from BAAI) released with the paper [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) by Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell. 1. **[Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer)** (from MIT) released with the paper [AST: Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) by Yuan Gong, Yu-An Chung, James Glass. +1. **[Autoformer](model_doc/autoformer)** (from Tsinghua University) released with the paper [Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting](https://arxiv.org/abs/2106.13008) by Haixu Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long. 1. **[BART](model_doc/bart)** (from Facebook) released with the paper [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461) by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov and Luke Zettlemoyer. 1. **[BARThez](model_doc/barthez)** (from École polytechnique) released with the paper [BARThez: a Skilled Pretrained French Sequence-to-Sequence Model](https://arxiv.org/abs/2010.12321) by Moussa Kamal Eddine, Antoine J.-P. Tixier, Michalis Vazirgiannis. 1. **[BARTpho](model_doc/bartpho)** (from VinAI Research) released with the paper [BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese](https://arxiv.org/abs/2109.09701) by Nguyen Luong Tran, Duong Minh Le and Dat Quoc Nguyen. @@ -70,18 +76,21 @@ The documentation is organized into five sections: 1. **[BLOOM](model_doc/bloom)** (from BigScience workshop) released by the [BigScience Workshop](https://bigscience.huggingface.co/). 1. **[BORT](model_doc/bort)** (from Alexa) released with the paper [Optimal Subarchitecture Extraction For BERT](https://arxiv.org/abs/2010.10499) by Adrian de Wynter and Daniel J. Perry. 1. **[BridgeTower](model_doc/bridgetower)** (from Harbin Institute of Technology/Microsoft Research Asia/Intel Labs) released with the paper [BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning](https://arxiv.org/abs/2206.08657) by Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan. +1. **[Bros](model_doc/bros)** (from NAVER) released with the paper [BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents](https://arxiv.org/abs/2108.04539) by Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park. 1. **[ByT5](model_doc/byt5)** (from Google Research) released with the paper [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. 1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[Chinese-CLIP](model_doc/chinese_clip)** (from OFA-Sys) released with the paper [Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese](https://arxiv.org/abs/2211.01335) by An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. -1. **[CLAP](model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation]https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. +1. **[CLAP](model_doc/clap)** (from LAION-AI) released with the paper [Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687) by Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov. 1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[CLIPSeg](model_doc/clipseg)** (from University of Göttingen) released with the paper [Image Segmentation Using Text and Image Prompts](https://arxiv.org/abs/2112.10003) by Timo Lüddecke and Alexander Ecker. 1. **[CodeGen](model_doc/codegen)** (from Salesforce) released with the paper [A Conversational Paradigm for Program Synthesis](https://arxiv.org/abs/2203.13474) by Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 1. **[Conditional DETR](model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. +1. **[CPM-Ant](model_doc/cpmant)** (from OpenBMB) released by the [OpenBMB](https://www.openbmb.org/). 1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. 1. **[Data2Vec](model_doc/data2vec)** (from Facebook) released with the paper [Data2Vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli. @@ -90,6 +99,7 @@ The documentation is organized into five sections: 1. **[Decision Transformer](model_doc/decision_transformer)** (from Berkeley/Facebook/Google) released with the paper [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345) by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. 1. **[Deformable DETR](model_doc/deformable_detr)** (from SenseTime Research) released with the paper [Deformable DETR: Deformable Transformers for End-to-End Object Detection](https://arxiv.org/abs/2010.04159) by Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai. 1. **[DeiT](model_doc/deit)** (from Facebook) released with the paper [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. +1. **[DePlot](model_doc/deplot)** (from Google AI) released with the paper [DePlot: One-shot visual language reasoning by plot-to-table translation](https://arxiv.org/abs/2212.10505) by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun. 1. **[DETA](model_doc/deta)** (from The University of Texas at Austin) released with the paper [NMS Strikes Back](https://arxiv.org/abs/2212.06137) by Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl. 1. **[DETR](model_doc/detr)** (from Facebook) released with the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko. 1. **[DialoGPT](model_doc/dialogpt)** (from Microsoft Research) released with the paper [DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation](https://arxiv.org/abs/1911.00536) by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. @@ -100,30 +110,36 @@ The documentation is organized into five sections: 1. **[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. 1. **[EfficientFormer](model_doc/efficientformer)** (from Snap Research) released with the paper [EfficientFormer: Vision Transformers at MobileNetSpeed](https://arxiv.org/abs/2206.01191) by Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren. +1. **[EfficientNet](model_doc/efficientnet)** (from Google Brain) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan, Quoc V. Le. 1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ERNIE](model_doc/ernie)** (from Baidu) released with the paper [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223) by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu. 1. **[ErnieM](model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang. 1. **[ESM](model_doc/esm)** (from Meta AI) are transformer protein language models. **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. 1. **[FLAN-T5](model_doc/flan-t5)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei +1. **[FLAN-UL2](model_doc/flan-ul2)** (from Google AI) released in the repository [google-research/t5x](https://github.com/google-research/t5x/blob/main/docs/models.md#flan-ul2-checkpoints) by Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei 1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FLAVA](model_doc/flava)** (from Facebook AI) released with the paper [FLAVA: A Foundational Language And Vision Alignment Model](https://arxiv.org/abs/2112.04482) by Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. +1. **[FocalNet](model_doc/focalnet)** (from Microsoft Research) released with the paper [Focal Modulation Networks](https://arxiv.org/abs/2203.11926) by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao. 1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GIT](model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. +1. **[GPTBigCode](model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra. +1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Graphormer](model_doc/graphormer)** (from Microsoft) released with the paper [Do Transformers Really Perform Bad for Graph Representation?](https://arxiv.org/abs/2106.05234) by Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu. 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. +1. **[Informer](model_doc/informer)** (from Beihang University, UC Berkeley, Rutgers University, SEDD Company) released with the paper [Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting](https://arxiv.org/abs/2012.07436) by Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 1. **[Jukebox](model_doc/jukebox)** (from OpenAI) released with the paper [Jukebox: A Generative Model for Music](https://arxiv.org/pdf/2005.00341.pdf) by Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, Ilya Sutskever. 1. **[LayoutLM](model_doc/layoutlm)** (from Microsoft Research Asia) released with the paper [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318) by Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou. 1. **[LayoutLMv2](model_doc/layoutlmv2)** (from Microsoft Research Asia) released with the paper [LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding](https://arxiv.org/abs/2012.14740) by Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou. @@ -132,6 +148,7 @@ The documentation is organized into five sections: 1. **[LED](model_doc/led)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LeViT](model_doc/levit)** (from Meta AI) released with the paper [LeViT: A Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) by Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, Matthijs Douze. 1. **[LiLT](model_doc/lilt)** (from South China University of Technology) released with the paper [LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding](https://arxiv.org/abs/2202.13669) by Jiapeng Wang, Lianwen Jin, Kai Ding. +1. **[LLaMA](model_doc/llama)** (from The FAIR team of Meta AI) released with the paper [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. 1. **[Longformer](model_doc/longformer)** (from AllenAI) released with the paper [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan. 1. **[LongT5](model_doc/longt5)** (from Google AI) released with the paper [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/abs/2112.07916) by Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, Yinfei Yang. 1. **[LUKE](model_doc/luke)** (from Studio Ousia) released with the paper [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://arxiv.org/abs/2010.01057) by Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto. @@ -142,10 +159,13 @@ The documentation is organized into five sections: 1. **[MarkupLM](model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[Mask2Former](model_doc/mask2former)** (from FAIR and UIUC) released with the paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) by Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar. 1. **[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MatCha](model_doc/matcha)** (from Google AI) released with the paper [MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering](https://arxiv.org/abs/2212.09662) by Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, Julian Martin Eisenschlos. 1. **[mBART](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. +1. **[MEGA](model_doc/mega)** (from Meta/USC/CMU/SJTU) released with the paper [Mega: Moving Average Equipped Gated Attention](https://arxiv.org/abs/2209.10655) by Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, and Luke Zettlemoyer. 1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. 1. **[Megatron-GPT2](model_doc/megatron_gpt2)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. +1. **[MGP-STR](model_doc/mgp-str)** (from Alibaba Research) released with the paper [Multi-Granularity Prediction for Scene Text Recognition](https://arxiv.org/abs/2209.03592) by Peng Wang, Cheng Da, and Cong Yao. 1. **[mLUKE](model_doc/mluke)** (from Studio Ousia) released with the paper [mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models](https://arxiv.org/abs/2110.08151) by Ryokan Ri, Ikuya Yamada, and Yoshimasa Tsuruoka. 1. **[MobileBERT](model_doc/mobilebert)** (from CMU/Google Brain) released with the paper [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 1. **[MobileNetV1](model_doc/mobilenet_v1)** (from Google Inc.) released with the paper [MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861) by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam. @@ -157,14 +177,17 @@ The documentation is organized into five sections: 1. **[NAT](model_doc/nat)** (from SHI Labs) released with the paper [Neighborhood Attention Transformer](https://arxiv.org/abs/2204.07143) by Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. 1. **[Nezha](model_doc/nezha)** (from Huawei Noah’s Ark Lab) released with the paper [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204) by Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen and Qun Liu. 1. **[NLLB](model_doc/nllb)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. +1. **[NLLB-MOE](model_doc/nllb-moe)** (from Meta) released with the paper [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) by the NLLB team. 1. **[Nyströmformer](model_doc/nystromformer)** (from the University of Wisconsin - Madison) released with the paper [Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention](https://arxiv.org/abs/2102.03902) by Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. 1. **[OneFormer](model_doc/oneformer)** (from SHI Labs) released with the paper [OneFormer: One Transformer to Rule Universal Image Segmentation](https://arxiv.org/abs/2211.06220) by Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi. +1. **[OpenLlama](model_doc/open-llama)** (from [s-JoL](https://huggingface.co/s-JoL)) released on GitHub (now removed). 1. **[OPT](master/model_doc/opt)** (from Meta AI) released with the paper [OPT: Open Pre-trained Transformer Language Models](https://arxiv.org/abs/2205.01068) by Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen et al. 1. **[OWL-ViT](model_doc/owlvit)** (from Google AI) released with the paper [Simple Open-Vocabulary Object Detection with Vision Transformers](https://arxiv.org/abs/2205.06230) by Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. 1. **[Pegasus](model_doc/pegasus)** (from Google) released with the paper [PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization](https://arxiv.org/abs/1912.08777) by Jingqing Zhang, Yao Zhao, Mohammad Saleh and Peter J. Liu. 1. **[PEGASUS-X](model_doc/pegasus_x)** (from Google) released with the paper [Investigating Efficiently Extending Transformers for Long Input Summarization](https://arxiv.org/abs/2208.04347) by Jason Phang, Yao Zhao, and Peter J. Liu. 1. **[Perceiver IO](model_doc/perceiver)** (from Deepmind) released with the paper [Perceiver IO: A General Architecture for Structured Inputs & Outputs](https://arxiv.org/abs/2107.14795) by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira. 1. **[PhoBERT](model_doc/phobert)** (from VinAI Research) released with the paper [PhoBERT: Pre-trained language models for Vietnamese](https://www.aclweb.org/anthology/2020.findings-emnlp.92/) by Dat Quoc Nguyen and Anh Tuan Nguyen. +1. **[Pix2Struct](model_doc/pix2struct)** (from Google) released with the paper [Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding](https://arxiv.org/abs/2210.03347) by Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova. 1. **[PLBart](model_doc/plbart)** (from UCLA NLP) released with the paper [Unified Pre-training for Program Understanding and Generation](https://arxiv.org/abs/2103.06333) by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. 1. **[PoolFormer](model_doc/poolformer)** (from Sea AI Labs) released with the paper [MetaFormer is Actually What You Need for Vision](https://arxiv.org/abs/2111.11418) by Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng. 1. **[ProphetNet](model_doc/prophetnet)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. @@ -179,7 +202,9 @@ The documentation is organized into five sections: 1. **[RoBERTa-PreLayerNorm](model_doc/roberta-prelayernorm)** (from Facebook) released with the paper [fairseq: A Fast, Extensible Toolkit for Sequence Modeling](https://arxiv.org/abs/1904.01038) by Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli. 1. **[RoCBert](model_doc/roc_bert)** (from WeChatAI) released with the paper [RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining](https://aclanthology.org/2022.acl-long.65.pdf) by HuiSu, WeiweiShi, XiaoyuShen, XiaoZhou, TuoJi, JiaruiFang, JieZhou. 1. **[RoFormer](model_doc/roformer)** (from ZhuiyiTechnology), released together with the paper [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) by Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu. +1. **[RWKV](model_doc/rwkv)** (from Bo Peng), released on [this repo](https://github.com/BlinkDL/RWKV-LM) by Bo Peng. 1. **[SegFormer](model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo. +1. **[Segment Anything](model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. 1. **[SEW](model_doc/sew)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. 1. **[SEW-D](model_doc/sew_d)** (from ASAPP) released with the paper [Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition](https://arxiv.org/abs/2109.06870) by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi. 1. **[SpeechT5](model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei. @@ -187,6 +212,7 @@ The documentation is organized into five sections: 1. **[SpeechToTextTransformer2](model_doc/speech_to_text_2)** (from Facebook), released together with the paper [Large-Scale Self- and Semi-Supervised Learning for Speech Translation](https://arxiv.org/abs/2104.06678) by Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau. 1. **[Splinter](model_doc/splinter)** (from Tel Aviv University), released together with the paper [Few-Shot Question Answering by Pretraining Span Selection](https://arxiv.org/abs/2101.00438) by Ori Ram, Yuval Kirstain, Jonathan Berant, Amir Globerson, Omer Levy. 1. **[SqueezeBERT](model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. +1. **[SwiftFormer](model_doc/swiftformer)** (from MBZUAI) released with the paper [SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications](https://arxiv.org/abs/2303.15446) by Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan. 1. **[Swin Transformer](model_doc/swin)** (from Microsoft) released with the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 1. **[Swin Transformer V2](model_doc/swinv2)** (from Microsoft) released with the paper [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. 1. **[Swin2SR](model_doc/swin2sr)** (from University of Würzburg) released with the paper [Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration](https://arxiv.org/abs/2209.11345) by Marcos V. Conde, Ui-Jin Choi, Maxime Burchi, Radu Timofte. @@ -202,6 +228,7 @@ The documentation is organized into five sections: 1. **[Transformer-XL](model_doc/transfo-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[TrOCR](model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. 1. **[TVLT](model_doc/tvlt)** (from UNC Chapel Hill) released with the paper [TVLT: Textless Vision-Language Transformer](https://arxiv.org/abs/2209.14156) by Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal. +1. **[TVP](model_doc/tvp)** (from Intel) released with the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. 1. **[UL2](model_doc/ul2)** (from Google Research) released with the paper [Unifying Language Learning Paradigms](https://arxiv.org/abs/2205.05131v1) by Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler 1. **[UniSpeech](model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang. 1. **[UniSpeechSat](model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu. @@ -234,19 +261,21 @@ The documentation is organized into five sections: 1. **[YOSO](model_doc/yoso)** (from the University of Wisconsin - Madison) released with the paper [You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling](https://arxiv.org/abs/2111.09714) by Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh. -### Supported frameworks +### Rangka kerja yang disokong -The table below represents the current support in the library for each of those models, whether they have a Python -tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via -Flax), PyTorch, and/or TensorFlow. +Jadual di bawah mewakili sokongan semasa dalam perpustakaan untuk setiap model tersebut, sama ada model tersebut mempunyai Python +tokenizer (dipanggil ""lambat""). Tokenizer ""pantas"" yang disokong oleh perpustakaan Tokenizers 🤗, sama ada mereka mempunyai sokongan dalam Jax (melalui +Flax), PyTorch, dan/atau TensorFlow. - + | Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax Support | |:-----------------------------:|:--------------:|:--------------:|:---------------:|:------------------:|:------------:| | ALBERT | ✅ | ✅ | ✅ | ✅ | ✅ | +| ALIGN | ❌ | ❌ | ✅ | ❌ | ❌ | | AltCLIP | ❌ | ❌ | ✅ | ❌ | ❌ | | Audio Spectrogram Transformer | ❌ | ❌ | ✅ | ❌ | ❌ | +| Autoformer | ❌ | ❌ | ✅ | ❌ | ❌ | | BART | ✅ | ✅ | ✅ | ✅ | ✅ | | BEiT | ❌ | ❌ | ✅ | ❌ | ✅ | | BERT | ✅ | ✅ | ✅ | ✅ | ✅ | @@ -257,10 +286,11 @@ Flax), PyTorch, and/or TensorFlow. | BiT | ❌ | ❌ | ✅ | ❌ | ❌ | | Blenderbot | ✅ | ✅ | ✅ | ✅ | ✅ | | BlenderbotSmall | ✅ | ✅ | ✅ | ✅ | ✅ | -| BLIP | ❌ | ❌ | ✅ | ❌ | ❌ | +| BLIP | ❌ | ❌ | ✅ | ✅ | ❌ | | BLIP-2 | ❌ | ❌ | ✅ | ❌ | ❌ | | BLOOM | ❌ | ✅ | ✅ | ❌ | ❌ | | BridgeTower | ❌ | ❌ | ✅ | ❌ | ❌ | +| Bros | ✅ | ✅ | ✅ | ❌ | ❌ | | CamemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | CANINE | ✅ | ❌ | ✅ | ❌ | ❌ | | Chinese-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -271,6 +301,8 @@ Flax), PyTorch, and/or TensorFlow. | Conditional DETR | ❌ | ❌ | ✅ | ❌ | ❌ | | ConvBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | ConvNeXT | ❌ | ❌ | ✅ | ✅ | ❌ | +| ConvNeXTV2 | ❌ | ❌ | ✅ | ❌ | ❌ | +| CPM-Ant | ✅ | ❌ | ✅ | ❌ | ❌ | | CTRL | ✅ | ❌ | ✅ | ✅ | ❌ | | CvT | ❌ | ❌ | ✅ | ✅ | ❌ | | Data2VecAudio | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -288,7 +320,8 @@ Flax), PyTorch, and/or TensorFlow. | DonutSwin | ❌ | ❌ | ✅ | ❌ | ❌ | | DPR | ✅ | ✅ | ✅ | ✅ | ❌ | | DPT | ❌ | ❌ | ✅ | ❌ | ❌ | -| EfficientFormer | ❌ | ❌ | ✅ | ❌ | ❌ | +| EfficientFormer | ❌ | ❌ | ✅ | ✅ | ❌ | +| EfficientNet | ❌ | ❌ | ✅ | ❌ | ❌ | | ELECTRA | ✅ | ✅ | ✅ | ✅ | ✅ | | Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ | | ERNIE | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -298,6 +331,7 @@ Flax), PyTorch, and/or TensorFlow. | FlauBERT | ✅ | ❌ | ✅ | ✅ | ❌ | | FLAVA | ❌ | ❌ | ✅ | ❌ | ❌ | | FNet | ✅ | ✅ | ✅ | ❌ | ❌ | +| FocalNet | ❌ | ❌ | ✅ | ❌ | ❌ | | Funnel Transformer | ✅ | ✅ | ✅ | ✅ | ❌ | | GIT | ❌ | ❌ | ✅ | ❌ | ❌ | | GLPN | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -306,11 +340,14 @@ Flax), PyTorch, and/or TensorFlow. | GPT NeoX Japanese | ✅ | ❌ | ✅ | ❌ | ❌ | | GPT-J | ❌ | ❌ | ✅ | ✅ | ✅ | | GPT-Sw3 | ✅ | ✅ | ✅ | ✅ | ✅ | +| GPTBigCode | ❌ | ❌ | ✅ | ❌ | ❌ | +| GPTSAN-japanese | ✅ | ❌ | ✅ | ❌ | ❌ | | Graphormer | ❌ | ❌ | ✅ | ❌ | ❌ | | GroupViT | ❌ | ❌ | ✅ | ✅ | ❌ | | Hubert | ❌ | ❌ | ✅ | ✅ | ❌ | | I-BERT | ❌ | ❌ | ✅ | ❌ | ❌ | | ImageGPT | ❌ | ❌ | ✅ | ❌ | ❌ | +| Informer | ❌ | ❌ | ✅ | ❌ | ❌ | | Jukebox | ✅ | ❌ | ✅ | ❌ | ❌ | | LayoutLM | ✅ | ✅ | ✅ | ✅ | ❌ | | LayoutLMv2 | ✅ | ✅ | ✅ | ❌ | ❌ | @@ -318,6 +355,7 @@ Flax), PyTorch, and/or TensorFlow. | LED | ✅ | ✅ | ✅ | ✅ | ❌ | | LeViT | ❌ | ❌ | ✅ | ❌ | ❌ | | LiLT | ❌ | ❌ | ✅ | ❌ | ❌ | +| LLaMA | ✅ | ✅ | ✅ | ❌ | ❌ | | Longformer | ✅ | ✅ | ✅ | ✅ | ❌ | | LongT5 | ❌ | ❌ | ✅ | ❌ | ✅ | | LUKE | ✅ | ❌ | ✅ | ❌ | ❌ | @@ -330,7 +368,9 @@ Flax), PyTorch, and/or TensorFlow. | MaskFormer | ❌ | ❌ | ✅ | ❌ | ❌ | | MaskFormerSwin | ❌ | ❌ | ❌ | ❌ | ❌ | | mBART | ✅ | ✅ | ✅ | ✅ | ✅ | +| MEGA | ❌ | ❌ | ✅ | ❌ | ❌ | | Megatron-BERT | ❌ | ❌ | ✅ | ❌ | ❌ | +| MGP-STR | ✅ | ❌ | ✅ | ❌ | ❌ | | MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | MobileNetV1 | ❌ | ❌ | ✅ | ❌ | ❌ | | MobileNetV2 | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -340,15 +380,18 @@ Flax), PyTorch, and/or TensorFlow. | MVP | ✅ | ✅ | ✅ | ❌ | ❌ | | NAT | ❌ | ❌ | ✅ | ❌ | ❌ | | Nezha | ❌ | ❌ | ✅ | ❌ | ❌ | +| NLLB-MOE | ❌ | ❌ | ✅ | ❌ | ❌ | | Nyströmformer | ❌ | ❌ | ✅ | ❌ | ❌ | | OneFormer | ❌ | ❌ | ✅ | ❌ | ❌ | | OpenAI GPT | ✅ | ✅ | ✅ | ✅ | ❌ | | OpenAI GPT-2 | ✅ | ✅ | ✅ | ✅ | ✅ | +| OpenLlama | ❌ | ❌ | ✅ | ❌ | ❌ | | OPT | ❌ | ❌ | ✅ | ✅ | ✅ | | OWL-ViT | ❌ | ❌ | ✅ | ❌ | ❌ | | Pegasus | ✅ | ✅ | ✅ | ✅ | ✅ | | PEGASUS-X | ❌ | ❌ | ✅ | ❌ | ❌ | | Perceiver | ✅ | ❌ | ✅ | ❌ | ❌ | +| Pix2Struct | ❌ | ❌ | ✅ | ❌ | ❌ | | PLBart | ✅ | ❌ | ✅ | ❌ | ❌ | | PoolFormer | ❌ | ❌ | ✅ | ❌ | ❌ | | ProphetNet | ✅ | ❌ | ✅ | ❌ | ❌ | @@ -356,14 +399,16 @@ Flax), PyTorch, and/or TensorFlow. | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | REALM | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | -| ResNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| ResNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | | RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | | RoBERTa-PreLayerNorm | ❌ | ❌ | ✅ | ✅ | ✅ | | RoCBert | ✅ | ❌ | ✅ | ❌ | ❌ | | RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ | +| RWKV | ❌ | ❌ | ✅ | ❌ | ❌ | +| SAM | ❌ | ❌ | ✅ | ✅ | ❌ | | SegFormer | ❌ | ❌ | ✅ | ✅ | ❌ | | SEW | ❌ | ❌ | ✅ | ❌ | ❌ | | SEW-D | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -373,6 +418,7 @@ Flax), PyTorch, and/or TensorFlow. | SpeechT5 | ✅ | ❌ | ✅ | ❌ | ❌ | | Splinter | ✅ | ✅ | ✅ | ❌ | ❌ | | SqueezeBERT | ✅ | ✅ | ✅ | ❌ | ❌ | +| SwiftFormer | ❌ | ❌ | ✅ | ❌ | ❌ | | Swin Transformer | ❌ | ❌ | ✅ | ✅ | ❌ | | Swin Transformer V2 | ❌ | ❌ | ✅ | ❌ | ❌ | | Swin2SR | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -386,6 +432,7 @@ Flax), PyTorch, and/or TensorFlow. | Transformer-XL | ✅ | ❌ | ✅ | ✅ | ❌ | | TrOCR | ❌ | ❌ | ✅ | ❌ | ❌ | | TVLT | ❌ | ❌ | ✅ | ❌ | ❌ | +| TVP | ❌ | ❌ | ✅ | ❌ | ❌ | | UniSpeech | ❌ | ❌ | ✅ | ❌ | ❌ | | UniSpeechSat | ❌ | ❌ | ✅ | ❌ | ❌ | | UPerNet | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -393,7 +440,7 @@ Flax), PyTorch, and/or TensorFlow. | VideoMAE | ❌ | ❌ | ✅ | ❌ | ❌ | | ViLT | ❌ | ❌ | ✅ | ❌ | ❌ | | Vision Encoder decoder | ❌ | ❌ | ✅ | ✅ | ✅ | -| VisionTextDualEncoder | ❌ | ❌ | ✅ | ❌ | ✅ | +| VisionTextDualEncoder | ❌ | ❌ | ✅ | ✅ | ✅ | | VisualBERT | ❌ | ❌ | ✅ | ❌ | ❌ | | ViT | ❌ | ❌ | ✅ | ✅ | ✅ | | ViT Hybrid | ❌ | ❌ | ✅ | ❌ | ❌ | @@ -402,7 +449,7 @@ Flax), PyTorch, and/or TensorFlow. | Wav2Vec2 | ✅ | ❌ | ✅ | ✅ | ✅ | | Wav2Vec2-Conformer | ❌ | ❌ | ✅ | ❌ | ❌ | | WavLM | ❌ | ❌ | ✅ | ❌ | ❌ | -| Whisper | ✅ | ❌ | ✅ | ✅ | ❌ | +| Whisper | ✅ | ✅ | ✅ | ✅ | ✅ | | X-CLIP | ❌ | ❌ | ✅ | ❌ | ❌ | | X-MOD | ❌ | ❌ | ✅ | ❌ | ❌ | | XGLM | ✅ | ✅ | ✅ | ✅ | ✅ | @@ -414,4 +461,4 @@ Flax), PyTorch, and/or TensorFlow. | YOLOS | ❌ | ❌ | ✅ | ❌ | ❌ | | YOSO | ❌ | ❌ | ✅ | ❌ | ❌ | - + diff --git a/docs/source/pt/_config.py b/docs/source/pt/_config.py index cd76263e9a5cb2..a6d75853f57219 100644 --- a/docs/source/pt/_config.py +++ b/docs/source/pt/_config.py @@ -10,5 +10,5 @@ black_avoid_patterns = { "{processor_class}": "FakeProcessorClass", "{model_class}": "FakeModelClass", - "{object_class}": "FakeObjectClass", + "{object_class}": "FakeObjectClass", } diff --git a/docs/source/pt/accelerate.mdx b/docs/source/pt/accelerate.md similarity index 96% rename from docs/source/pt/accelerate.mdx rename to docs/source/pt/accelerate.md index 59dbd96a83b26a..a4e346a2b4873f 100644 --- a/docs/source/pt/accelerate.mdx +++ b/docs/source/pt/accelerate.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Treinamento distribuído com o 🤗 Accelerate diff --git a/docs/source/pt/converting_tensorflow_models.mdx b/docs/source/pt/converting_tensorflow_models.md similarity index 90% rename from docs/source/pt/converting_tensorflow_models.mdx rename to docs/source/pt/converting_tensorflow_models.md index db7be687c38509..190c1aec5b22bf 100644 --- a/docs/source/pt/converting_tensorflow_models.mdx +++ b/docs/source/pt/converting_tensorflow_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Convertendo checkpoints do TensorFlow para Pytorch @@ -96,29 +100,15 @@ transformers-cli convert --model_type gpt \ Aqui está um exemplo do processo de conversão para um modelo OpenAI GPT-2 pré-treinado (consulte [aqui](https://github.com/openai/gpt-2)) ```bash -export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/gpt2/pretrained/weights +export OPENAI_GPT2_CHECKPOINT_PATH=/path/to/openai-community/gpt2/pretrained/weights -transformers-cli convert --model_type gpt2 \ +transformers-cli convert --model_type openai-community/gpt2 \ --tf_checkpoint $OPENAI_GPT2_CHECKPOINT_PATH \ --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ [--config OPENAI_GPT2_CONFIG] \ [--finetuning_task_name OPENAI_GPT2_FINETUNED_TASK] ``` -## Transformer-XL - -Aqui está um exemplo do processo de conversão para um modelo Transformer-XL pré-treinado (consulte [aqui](https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-modelos-sota)) - -```bash -export TRANSFO_XL_CHECKPOINT_FOLDER_PATH=/path/to/transfo/xl/checkpoint - -transformers-cli convert --model_type transfo_xl \ - --tf_checkpoint $TRANSFO_XL_CHECKPOINT_FOLDER_PATH \ - --pytorch_dump_output $PYTORCH_DUMP_OUTPUT \ - [--config TRANSFO_XL_CONFIG] \ - [--finetuning_task_name TRANSFO_XL_FINETUNED_TASK] -``` - ## XLNet Aqui está um exemplo do processo de conversão para um modelo XLNet pré-treinado: diff --git a/docs/source/pt/create_a_model.mdx b/docs/source/pt/create_a_model.md similarity index 93% rename from docs/source/pt/create_a_model.mdx rename to docs/source/pt/create_a_model.md index bde2b1b18770cc..dd71963236f4fa 100644 --- a/docs/source/pt/create_a_model.mdx +++ b/docs/source/pt/create_a_model.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Criar uma arquitetura customizada @@ -82,7 +86,7 @@ DistilBertConfig { Atributos de um modelo pré-treinado podem ser modificados na função [`~PretrainedConfig.from_pretrained`]: ```py ->>> my_config = DistilBertConfig.from_pretrained("distilbert-base-uncased", activation="relu", attention_dropout=0.4) +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) ``` Uma vez que você está satisfeito com as configurações do seu modelo, você consegue salvar elas com [`~PretrainedConfig.save_pretrained`]. Seu arquivo de configurações está salvo como um arquivo JSON no diretório especificado: @@ -105,7 +109,7 @@ Você pode também salvar seu arquivo de configurações como um dicionário ou ## Modelo -O próximo passo é criar um [model](main_classes/models). O modelo - também vagamente referido como arquitetura - define o que cada camada está fazendo e quais operações estão acontecendo. Atributos como `num_hidden_layers` das configurações são utilizados para definir a arquitetura. Todo modelo compartilha a classe base [`PreTrainedModel`] e alguns métodos em comum como redimensionar o tamanho dos embeddings de entrada e podar as 'self-attention heads'. Além disso, todos os modelos também são subclasses de [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) ou [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/flax.linen.html#module). Isso significa que os modelos são compatíveis com cada respectivo uso de framework. +O próximo passo é criar um [model](main_classes/models). O modelo - também vagamente referido como arquitetura - define o que cada camada está fazendo e quais operações estão acontecendo. Atributos como `num_hidden_layers` das configurações são utilizados para definir a arquitetura. Todo modelo compartilha a classe base [`PreTrainedModel`] e alguns métodos em comum como redimensionar o tamanho dos embeddings de entrada e podar as 'self-attention heads'. Além disso, todos os modelos também são subclasses de [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html), [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) ou [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html). Isso significa que os modelos são compatíveis com cada respectivo uso de framework. @@ -123,13 +127,13 @@ Isso cria um modelo com valores aleatórios ao invés de pré-treinar os pesos. Criar um modelo pré-treinado com [`~PreTrainedModel.from_pretrained`]: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Quando você carregar os pesos pré-treinados, a configuração padrão do modelo é automaticamente carregada se o modelo é provido pelo 🤗 Transformers. No entanto, você ainda consegue mudar - alguns ou todos - os atributos padrões de configuração do modelo com os seus próprio atributos, se você preferir: ```py ->>> model = DistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -147,13 +151,13 @@ Isso cria um modelo com valores aleatórios ao invés de pré-treinar os pesos. Criar um modelo pré-treinado com [`~TFPreTrainedModel.from_pretrained`]: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") ``` Quando você carregar os pesos pré-treinados, a configuração padrão do modelo é automaticamente carregada se o modelo é provido pelo 🤗 Transformers. No entanto, você ainda consegue mudar - alguns ou todos - os atributos padrões de configuração do modelo com os seus próprio atributos, se você preferir: ```py ->>> tf_model = TFDistilBertModel.from_pretrained("distilbert-base-uncased", config=my_config) +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) ``` @@ -169,7 +173,7 @@ Por exemplo, [`DistilBertForSequenceClassification`] é um modelo DistilBERT bas ```py >>> from transformers import DistilBertForSequenceClassification ->>> model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Reutilize facilmente esse ponto de parada para outra tarefe mudando para uma head de modelo diferente. Para uma tarefe de responder questões, você usaria a head do modelo [`DistilBertForQuestionAnswering`]. A head de responder questões é similar com a de classificação de sequências exceto o fato de que ela é uma camada no topo dos estados das saídas ocultas. @@ -177,7 +181,7 @@ Reutilize facilmente esse ponto de parada para outra tarefe mudando para uma hea ```py >>> from transformers import DistilBertForQuestionAnswering ->>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ``` @@ -186,7 +190,7 @@ Por exemplo, [`TFDistilBertForSequenceClassification`] é um modelo DistilBERT b ```py >>> from transformers import TFDistilBertForSequenceClassification ->>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") ``` Reutilize facilmente esse ponto de parada para outra tarefe mudando para uma head de modelo diferente. Para uma tarefe de responder questões, você usaria a head do modelo [`TFDistilBertForQuestionAnswering`]. A head de responder questões é similar com a de classificação de sequências exceto o fato de que ela é uma camada no topo dos estados das saídas ocultas. @@ -194,7 +198,7 @@ Reutilize facilmente esse ponto de parada para outra tarefe mudando para uma hea ```py >>> from transformers import TFDistilBertForQuestionAnswering ->>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased") +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") ```
@@ -227,7 +231,7 @@ Se você treinou seu prórpio tokenizer, você pode criar um a partir do seu arq ```py >>> from transformers import DistilBertTokenizer ->>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Criando um 'fast tokenizer' com a classe [`DistilBertTokenizerFast`]: @@ -235,7 +239,7 @@ Criando um 'fast tokenizer' com a classe [`DistilBertTokenizerFast`]: ```py >>> from transformers import DistilBertTokenizerFast ->>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased") +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") ``` diff --git a/docs/source/pt/custom_models.mdx b/docs/source/pt/custom_models.md similarity index 98% rename from docs/source/pt/custom_models.mdx rename to docs/source/pt/custom_models.md index 59484dcc35eb7b..70c56913a38356 100644 --- a/docs/source/pt/custom_models.mdx +++ b/docs/source/pt/custom_models.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Compartilhando modelos customizados diff --git a/docs/source/pt/fast_tokenizers.mdx b/docs/source/pt/fast_tokenizers.md similarity index 94% rename from docs/source/pt/fast_tokenizers.mdx rename to docs/source/pt/fast_tokenizers.md index aff9afb31f2bb3..ea1da8a61571f1 100644 --- a/docs/source/pt/fast_tokenizers.mdx +++ b/docs/source/pt/fast_tokenizers.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Usando os Tokenizers do 🤗 Tokenizers diff --git a/docs/source/pt/index.mdx b/docs/source/pt/index.md similarity index 97% rename from docs/source/pt/index.mdx rename to docs/source/pt/index.md index ff841d91f4952e..18dbcbc06b8048 100644 --- a/docs/source/pt/index.mdx +++ b/docs/source/pt/index.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers @@ -34,7 +38,7 @@ Cada arquitetura 🤗 Transformers é definida em um módulo individual do Pytho ## Se você estiver procurando suporte do time da Hugging Face, acesse - HuggingFace Expert Acceleration Program + HuggingFace Expert Acceleration Program ## Conteúdo @@ -76,6 +80,7 @@ Atualmente a biblioteca contém implementações do PyTorch, TensorFlow e JAX, p 1. **[CamemBERT](model_doc/camembert)** (from Inria/Facebook/Sorbonne) released with the paper [CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894) by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot. 1. **[CANINE](model_doc/canine)** (from Google Research) released with the paper [CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation](https://arxiv.org/abs/2103.06874) by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. 1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CLIP](model_doc/clip)** (from OpenAI) released with the paper [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020) by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. 1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. @@ -91,16 +96,18 @@ Atualmente a biblioteca contém implementações do PyTorch, TensorFlow e JAX, p 1. **[DistilBERT](model_doc/distilbert)** (from HuggingFace), released together with the paper [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108) by Victor Sanh, Lysandre Debut and Thomas Wolf. The same method has been applied to compress GPT2 into [DistilGPT2](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), RoBERTa into [DistilRoBERTa](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation), Multilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation) and a German version of DistilBERT. 1. **[DPR](model_doc/dpr)** (from Facebook) released with the paper [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906) by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 1. **[DPT](master/model_doc/dpt)** (from Intel Labs) released with the paper [Vision Transformers for Dense Prediction](https://arxiv.org/abs/2103.13413) by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. +1. **[EfficientNet](model_doc/efficientnet)** (from Google Research) released with the paper [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) by Mingxing Tan and Quoc V. Le. 1. **[EncoderDecoder](model_doc/encoder-decoder)** (from Google Research) released with the paper [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461) by Sascha Rothe, Shashi Narayan, Aliaksei Severyn. 1. **[ELECTRA](model_doc/electra)** (from Google Research/Stanford University) released with the paper [ELECTRA: Pre-training text encoders as discriminators rather than generators](https://arxiv.org/abs/2003.10555) by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning. 1. **[FlauBERT](model_doc/flaubert)** (from CNRS) released with the paper [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) by Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab. 1. **[FNet](model_doc/fnet)** (from Google Research) released with the paper [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) by James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. 1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. -1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. +1. **[GPTSAN-japanese](model_doc/gptsan-japanese)** released in the repository [tanreinama/GPTSAN](https://github.com/tanreinama/GPTSAN/blob/main/report/model.md) by Toshiyuki Sakamoto(tanreinama). 1. **[Hubert](model_doc/hubert)** (from Facebook) released with the paper [HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units](https://arxiv.org/abs/2106.07447) by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed. 1. **[I-BERT](model_doc/ibert)** (from Berkeley) released with the paper [I-BERT: Integer-only BERT Quantization](https://arxiv.org/abs/2101.01321) by Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer. 1. **[ImageGPT](model_doc/imagegpt)** (from OpenAI) released with the paper [Generative Pretraining from Pixels](https://openai.com/blog/image-gpt/) by Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever. @@ -247,9 +254,9 @@ disso, são diferenciados pelo suporte em diferentes frameworks: JAX (por meio d | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | Realm | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | -| ResNet | ❌ | ❌ | ✅ | ❌ | ❌ | +| ResNet | ❌ | ❌ | ✅ | ❌ | ✅ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | | RoBERTa | ✅ | ✅ | ✅ | ✅ | ✅ | | RoFormer | ✅ | ✅ | ✅ | ✅ | ✅ | diff --git a/docs/source/pt/installation.mdx b/docs/source/pt/installation.md similarity index 96% rename from docs/source/pt/installation.mdx rename to docs/source/pt/installation.md index 2325cc74afe2d9..7eeefd883d6ec3 100644 --- a/docs/source/pt/installation.mdx +++ b/docs/source/pt/installation.md @@ -12,6 +12,10 @@ distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Guia de Instalação @@ -140,10 +144,10 @@ O ambiente de Python que foi criado para a instalação do 🤗 Transformers enc ## Instalação usando o Conda -É possível instalar o 🤗 Transformers a partir do canal conda `huggingface` com o seguinte comando: +É possível instalar o 🤗 Transformers a partir do canal conda `conda-forge` com o seguinte comando: ```bash -conda install -c huggingface transformers +conda install conda-forge::transformers ``` ## Configuração do Cachê @@ -162,7 +166,7 @@ No Windows, este diretório pré-definido é dado por `C:\Users\username\.cache\ O 🤗 Transformers usará as variáveis de ambiente do shell `PYTORCH_TRANSFORMERS_CACHE` ou `PYTORCH_PRETRAINED_BERT_CACHE` se estiver vindo de uma versão anterior da biblioteca que tenha configurado essas variáveis de ambiente, a menos que você especifique a variável de ambiente do shell `TRANSFORMERS_CACHE`. - + @@ -181,14 +185,14 @@ Você pode adicionar o [🤗 Datasets](https://huggingface.co/docs/datasets/) ao Segue um exemplo de execução do programa numa rede padrão com firewall para instâncias externas, usando o seguinte comando: ```bash -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` Execute esse mesmo programa numa instância offline com o seguinte comando: ```bash HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ -python examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --dataset_name wmt16 --dataset_config ro-en ... +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... ``` O script agora deve ser executado sem travar ou expirar, pois procurará apenas por arquivos locais. diff --git a/docs/source/pt/multilingual.mdx b/docs/source/pt/multilingual.md similarity index 80% rename from docs/source/pt/multilingual.mdx rename to docs/source/pt/multilingual.md index 4db9b54dab34fe..5515c6a922a701 100644 --- a/docs/source/pt/multilingual.mdx +++ b/docs/source/pt/multilingual.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Modelos multilinguísticos para inferência @@ -16,7 +20,7 @@ specific language governing permissions and limitations under the License. Existem vários modelos multilinguísticos no 🤗 Transformers e seus usos para inferência diferem dos modelos monolíngues. No entanto, nem *todos* os usos dos modelos multilíngues são tão diferentes. -Alguns modelos, como o [bert-base-multilingual-uncased](https://huggingface.co/bert-base-multilingual-uncased), +Alguns modelos, como o [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased), podem ser usados como se fossem monolíngues. Este guia irá te ajudar a usar modelos multilíngues cujo uso difere para o propósito de inferência. @@ -30,25 +34,25 @@ checkpoints que usam de language embeddings e os que não. Os seguintes modelos de XLM usam language embeddings para especificar a linguagem utilizada para a inferência. -- `xlm-mlm-ende-1024` (Masked language modeling, English-German) -- `xlm-mlm-enfr-1024` (Masked language modeling, English-French) -- `xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) -- `xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) -- `xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) -- `xlm-clm-enfr-1024` (Causal language modeling, English-French) -- `xlm-clm-ende-1024` (Causal language modeling, English-German) +- `FacebookAI/xlm-mlm-ende-1024` (Masked language modeling, English-German) +- `FacebookAI/xlm-mlm-enfr-1024` (Masked language modeling, English-French) +- `FacebookAI/xlm-mlm-enro-1024` (Masked language modeling, English-Romanian) +- `FacebookAI/xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages) +- `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, English-French) +- `FacebookAI/xlm-clm-ende-1024` (Causal language modeling, English-German) Os language embeddings são representados por um tensor de mesma dimensão que os `input_ids` passados ao modelo. Os valores destes tensores dependem do idioma utilizado e se identificam pelos atributos `lang2id` e `id2lang` do tokenizador. -Neste exemplo, carregamos o checkpoint `xlm-clm-enfr-1024`(Causal language modeling, English-French): +Neste exemplo, carregamos o checkpoint `FacebookAI/xlm-clm-enfr-1024`(Causal language modeling, English-French): ```py >>> import torch >>> from transformers import XLMTokenizer, XLMWithLMHeadModel ->>> tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024") ->>> model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024") +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") ``` O atributo `lang2id` do tokenizador mostra os idiomas deste modelo e seus ids: @@ -88,8 +92,8 @@ O script [run_generation.py](https://github.com/huggingface/transformers/tree/ma Os seguintes modelos XLM não requerem o uso de language embeddings durante a inferência: -- `xlm-mlm-17-1280` (Modelagem de linguagem com máscara, 17 idiomas) -- `xlm-mlm-100-1280` (Modelagem de linguagem com máscara, 100 idiomas) +- `FacebookAI/xlm-mlm-17-1280` (Modelagem de linguagem com máscara, 17 idiomas) +- `FacebookAI/xlm-mlm-100-1280` (Modelagem de linguagem com máscara, 100 idiomas) Estes modelos são utilizados para representações genéricas de frase diferentemente dos checkpoints XLM anteriores. @@ -97,8 +101,8 @@ Estes modelos são utilizados para representações genéricas de frase diferent Os seguintes modelos do BERT podem ser utilizados para tarefas multilinguísticas: -- `bert-base-multilingual-uncased` (Modelagem de linguagem com máscara + Previsão de frases, 102 idiomas) -- `bert-base-multilingual-cased` (Modelagem de linguagem com máscara + Previsão de frases, 104 idiomas) +- `google-bert/bert-base-multilingual-uncased` (Modelagem de linguagem com máscara + Previsão de frases, 102 idiomas) +- `google-bert/bert-base-multilingual-cased` (Modelagem de linguagem com máscara + Previsão de frases, 104 idiomas) Estes modelos não requerem language embeddings durante a inferência. Devem identificar a linguagem a partir do contexto e realizar a inferência em sequência. @@ -107,8 +111,8 @@ do contexto e realizar a inferência em sequência. Os seguintes modelos do XLM-RoBERTa podem ser utilizados para tarefas multilinguísticas: -- `xlm-roberta-base` (Modelagem de linguagem com máscara, 100 idiomas) -- `xlm-roberta-large` Modelagem de linguagem com máscara, 100 idiomas) +- `FacebookAI/xlm-roberta-base` (Modelagem de linguagem com máscara, 100 idiomas) +- `FacebookAI/xlm-roberta-large` Modelagem de linguagem com máscara, 100 idiomas) O XLM-RoBERTa foi treinado com 2,5 TB de dados do CommonCrawl recém-criados e testados em 100 idiomas. Proporciona fortes vantagens sobre os modelos multilinguísticos publicados anteriormente como o mBERT e o XLM em tarefas diff --git a/docs/source/pt/pipeline_tutorial.mdx b/docs/source/pt/pipeline_tutorial.md similarity index 93% rename from docs/source/pt/pipeline_tutorial.mdx rename to docs/source/pt/pipeline_tutorial.md index 2991bcecde4f89..9c0cb3567e72e3 100644 --- a/docs/source/pt/pipeline_tutorial.mdx +++ b/docs/source/pt/pipeline_tutorial.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Pipelines para inferência @@ -75,14 +79,14 @@ Por exemplo, se quiser gerar mais de uma saída, defina-a no parâmetro `num_ret O [`pipeline`] aceita qualquer modelo do [Model Hub](https://huggingface.co/models). Há rótulos adicionais no Model Hub que te permitem filtrar pelo modelo que gostaria de usar para sua tarefa. Uma vez que tiver escolhido o modelo apropriado, -carregue-o com as classes `AutoModelFor` e [`AutoTokenizer'] correspondentes. Por exemplo, carregue a classe [`AutoModelForCausalLM`] +carregue-o com as classes `AutoModelFor` e [`AutoTokenizer`] correspondentes. Por exemplo, carregue a classe [`AutoModelForCausalLM`] para uma tarefa de modelagem de linguagem causal: ```py >>> from transformers import AutoTokenizer, AutoModelForCausalLM ->>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2") ->>> model = AutoModelForCausalLM.from_pretrained("distilgpt2") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") +>>> model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` Crie uma [`pipeline`] para a sua tarefa e especifíque o modelo e o tokenizador que foram carregados: @@ -105,7 +109,7 @@ Passe seu texto de entrada ao [`pipeline`] para gerar algum texto: A flexibilidade do [`pipeline`] significa que também pode-se extender às tarefas de áudio. La flexibilidad de [`pipeline`] significa que también se puede extender a tareas de audio. -Por exemplo, classifiquemos a emoção de um breve fragmento do famoso discurso de John F. Kennedy /home/rzimmerdev/dev/transformers/docs/source/pt/pipeline_tutorial.mdx +Por exemplo, classifiquemos a emoção de um breve fragmento do famoso discurso de John F. Kennedy /home/rzimmerdev/dev/transformers/docs/source/pt/pipeline_tutorial.md Encontre um modelo de [audio classification](https://huggingface.co/models?pipeline_tag=audio-classification) para reconhecimento de emoções no Model Hub e carregue-o usando o [`pipeline`]: diff --git a/docs/source/pt/quicktour.mdx b/docs/source/pt/quicktour.md similarity index 93% rename from docs/source/pt/quicktour.mdx rename to docs/source/pt/quicktour.md index 3c00a64b6652e8..d34480ee23a880 100644 --- a/docs/source/pt/quicktour.mdx +++ b/docs/source/pt/quicktour.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Tour rápido @@ -83,7 +87,7 @@ Importe [`pipeline`] e especifique a tarefa que deseja completar: >>> classifier = pipeline("sentiment-analysis") ``` -A pipeline baixa and armazena um [modelo pré-treinado](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) padrão e tokenizer para análise sentimental. Agora você pode usar `classifier` no texto alvo: +A pipeline baixa and armazena um [modelo pré-treinado](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) padrão e tokenizer para análise sentimental. Agora você pode usar `classifier` no texto alvo: ```py >>> classifier("We are very happy to show you the 🤗 Transformers library.") @@ -115,7 +119,7 @@ Crie uma [`pipeline`] com a tarefa que deseja resolver e o modelo que deseja usa >>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") ``` -A seguir, carregue uma base de dados (confira a 🤗 [Iniciação em Datasets](https://huggingface.co/docs/datasets/quickstart.html) para mais detalhes) que você gostaria de iterar sobre. Por exemplo, vamos carregar o dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14): +A seguir, carregue uma base de dados (confira a 🤗 [Iniciação em Datasets](https://huggingface.co/docs/datasets/quickstart) para mais detalhes) que você gostaria de iterar sobre. Por exemplo, vamos carregar o dataset [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14): ```py >>> from datasets import load_dataset, Audio @@ -224,6 +228,7 @@ Assim como o [`pipeline`], o tokenizer aceitará uma lista de entradas. Além di + ```py >>> pt_batch = tokenizer( ... ["We are very happy to show you the 🤗 transformers library.", "We hope you don't hate it."], @@ -235,6 +240,7 @@ Assim como o [`pipeline`], o tokenizer aceitará uma lista de entradas. Além di ``` + ```py >>> tf_batch = tokenizer( ... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], @@ -325,7 +331,7 @@ Todos os modelos de 🤗 Transformers (PyTorch ou TensorFlow) geram tensores *an
-Os modelos são um standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) ou um [`tf.keras.Model`](https: //www.tensorflow.org/api_docs/python/tf/keras/Model) para que você possa usá-los em seu loop de treinamento habitual. No entanto, para facilitar as coisas, 🤗 Transformers fornece uma classe [`Trainer`] para PyTorch que adiciona funcionalidade para treinamento distribuído, precisão mista e muito mais. Para o TensorFlow, você pode usar o método `fit` de [Keras](https://keras.io/). Consulte o [tutorial de treinamento](./training) para obter mais detalhes. +Os modelos são um standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) ou um [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) para que você possa usá-los em seu loop de treinamento habitual. No entanto, para facilitar as coisas, 🤗 Transformers fornece uma classe [`Trainer`] para PyTorch que adiciona funcionalidade para treinamento distribuído, precisão mista e muito mais. Para o TensorFlow, você pode usar o método `fit` de [Keras](https://keras.io/). Consulte o [tutorial de treinamento](./training) para obter mais detalhes. @@ -373,6 +379,7 @@ Um recurso particularmente interessante dos 🤗 Transformers é a capacidade de + ```py >>> from transformers import AutoModel @@ -381,6 +388,7 @@ Um recurso particularmente interessante dos 🤗 Transformers é a capacidade de ``` + ```py >>> from transformers import TFAutoModel diff --git a/docs/source/pt/run_scripts.mdx b/docs/source/pt/run_scripts.md similarity index 93% rename from docs/source/pt/run_scripts.mdx rename to docs/source/pt/run_scripts.md index e91c4fc87d2d42..a64ad72f1dbc61 100644 --- a/docs/source/pt/run_scripts.mdx +++ b/docs/source/pt/run_scripts.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Treinamento a partir de um script @@ -84,11 +88,11 @@ pip install -r requirements.txt -O script de exemplo baixa e pré-processa um conjunto de dados da biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Em seguida, o script ajusta um conjunto de dados com o [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) em uma arquitetura que oferece suporte à sumarização. O exemplo a seguir mostra como ajustar [T5-small](https://huggingface.co/t5-small) no conjunto de dados [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). O modelo T5 requer um argumento `source_prefix` adicional devido à forma como foi treinado. Este prompt informa ao T5 que esta é uma tarefa de sumarização. +O script de exemplo baixa e pré-processa um conjunto de dados da biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Em seguida, o script ajusta um conjunto de dados com o [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) em uma arquitetura que oferece suporte à sumarização. O exemplo a seguir mostra como ajustar [T5-small](https://huggingface.co/google-t5/t5-small) no conjunto de dados [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). O modelo T5 requer um argumento `source_prefix` adicional devido à forma como foi treinado. Este prompt informa ao T5 que esta é uma tarefa de sumarização. ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -102,11 +106,11 @@ python examples/pytorch/summarization/run_summarization.py \ ``` -Este outro script de exemplo baixa e pré-processa um conjunto de dados da biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Em seguida, o script ajusta um conjunto de dados usando Keras em uma arquitetura que oferece suporte à sumarização. O exemplo a seguir mostra como ajustar [T5-small](https://huggingface.co/t5-small) no conjunto de dados [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). O modelo T5 requer um argumento `source_prefix` adicional devido à forma como foi treinado. Este prompt informa ao T5 que esta é uma tarefa de sumarização. +Este outro script de exemplo baixa e pré-processa um conjunto de dados da biblioteca 🤗 [Datasets](https://huggingface.co/docs/datasets/). Em seguida, o script ajusta um conjunto de dados usando Keras em uma arquitetura que oferece suporte à sumarização. O exemplo a seguir mostra como ajustar [T5-small](https://huggingface.co/google-t5/t5-small) no conjunto de dados [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail). O modelo T5 requer um argumento `source_prefix` adicional devido à forma como foi treinado. Este prompt informa ao T5 que esta é uma tarefa de sumarização. ```bash python examples/tensorflow/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -127,10 +131,10 @@ O [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) ofere - Defina o número de GPUs a serem usadas com o argumento `nproc_per_node`. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 pytorch/summarization/run_summarization.py \ --fp16 \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -154,7 +158,7 @@ As Unidades de Processamento de Tensor (TPUs) são projetadas especificamente pa ```bash python xla_spawn.py --num_cores 8 \ summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -174,7 +178,7 @@ As Unidades de Processamento de Tensor (TPUs) são projetadas especificamente pa ```bash python run_summarization.py \ --tpu name_of_tpu_resource \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --output_dir /tmp/tst-summarization \ @@ -213,7 +217,7 @@ Agora você está pronto para iniciar o treinamento: ```bash accelerate launch run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ @@ -232,7 +236,7 @@ Um script para sumarização usando um conjunto de dados customizado ficaria ass ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --train_file path_to_csv_or_jsonlines_file \ @@ -257,7 +261,7 @@ Geralmente, é uma boa ideia executar seu script em um número menor de exemplos ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --max_train_samples 50 \ --max_eval_samples 50 \ --max_predict_samples 50 \ @@ -287,7 +291,7 @@ O primeiro método usa o argumento `output_dir previous_output_dir` para retomar ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -304,7 +308,7 @@ O segundo método usa o argumento `resume_from_checkpoint path_to_specific_check ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -334,7 +338,7 @@ O exemplo a seguir mostra como fazer upload de um modelo com um nome de reposit ```bash python examples/pytorch/summarization/run_summarization.py - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ diff --git a/docs/source/pt/serialization.mdx b/docs/source/pt/serialization.md similarity index 94% rename from docs/source/pt/serialization.mdx rename to docs/source/pt/serialization.md index 2a01640be467fb..9e390f07bde41d 100644 --- a/docs/source/pt/serialization.mdx +++ b/docs/source/pt/serialization.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Exportando modelos para ONNX @@ -60,6 +64,7 @@ As configurações prontas incluem as seguintes arquiteturas: - Conditional DETR - ConvBERT - ConvNeXT +- ConvNeXTV2 - Data2VecText - Data2VecVision - DeBERTa @@ -141,7 +146,7 @@ optional arguments: A exportação de um checkpoint usando uma configuração pronta pode ser feita da seguinte forma: ```bash -python -m transformers.onnx --model=distilbert-base-uncased onnx/ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ ``` Você deve ver os seguintes logs: @@ -156,7 +161,7 @@ All good, model saved at: onnx/model.onnx ``` Isso exporta um grafo ONNX do ponto de verificação definido pelo argumento `--model`. Nisso -Por exemplo, é `distilbert-base-uncased`, mas pode ser qualquer checkpoint no Hugging +Por exemplo, é `distilbert/distilbert-base-uncased`, mas pode ser qualquer checkpoint no Hugging Face Hub ou um armazenado localmente. O arquivo `model.onnx` resultante pode ser executado em um dos [muitos @@ -168,7 +173,7 @@ Tempo de execução](https://onnxruntime.ai/) da seguinte forma: >>> from transformers import AutoTokenizer >>> from onnxruntime import InferenceSession ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") >>> session = InferenceSession("onnx/model.onnx") >>> # ONNX Runtime expects NumPy arrays as input >>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") @@ -202,8 +207,8 @@ arquivos tokenizer armazenados em um diretório. Por exemplo, podemos carregar e >>> from transformers import AutoTokenizer, AutoModelForSequenceClassification >>> # Load tokenizer and PyTorch weights form the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-pt-checkpoint") >>> pt_model.save_pretrained("local-pt-checkpoint") @@ -220,8 +225,8 @@ python -m transformers.onnx --model=local-pt-checkpoint onnx/ >>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification >>> # Load tokenizer and TensorFlow weights from the Hub ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") >>> # Save to disk >>> tokenizer.save_pretrained("local-tf-checkpoint") >>> tf_model.save_pretrained("local-tf-checkpoint") @@ -266,7 +271,7 @@ pacote `transformers.onnx`. Por exemplo, para exportar um modelo de classificaç escolher um modelo ajustado no Hub e executar: ```bash -python -m transformers.onnx --model=distilbert-base-uncased-finetuned-sst-2-english \ +python -m transformers.onnx --model=distilbert/distilbert-base-uncased-finetuned-sst-2-english \ --feature=sequence-classification onnx/ ``` @@ -282,7 +287,7 @@ All good, model saved at: onnx/model.onnx ``` Observe que, neste caso, os nomes de saída do modelo ajustado são `logits` -em vez do `last_hidden_state` que vimos com o checkpoint `distilbert-base-uncased` +em vez do `last_hidden_state` que vimos com o checkpoint `distilbert/distilbert-base-uncased` mais cedo. Isso é esperado, pois o modelo ajustado (fine-tuned) possui uma cabeça de classificação de sequência. @@ -374,7 +379,7 @@ configuração do modelo base da seguinte forma: ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config = DistilBertOnnxConfig(config) ``` @@ -405,7 +410,7 @@ de classificação, poderíamos usar: ```python >>> from transformers import AutoConfig ->>> config = AutoConfig.from_pretrained("distilbert-base-uncased") +>>> config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased") >>> onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification") >>> print(onnx_config_for_seq_clf.outputs) OrderedDict([('logits', {0: 'batch'})]) @@ -432,7 +437,7 @@ e o caminho para salvar o arquivo exportado: >>> from transformers import AutoTokenizer, AutoModel >>> onnx_path = Path("model.onnx") ->>> model_ckpt = "distilbert-base-uncased" +>>> model_ckpt = "distilbert/distilbert-base-uncased" >>> base_model = AutoModel.from_pretrained(model_ckpt) >>> tokenizer = AutoTokenizer.from_pretrained(model_ckpt) diff --git a/docs/source/pt/tasks/sequence_classification.mdx b/docs/source/pt/tasks/sequence_classification.md similarity index 88% rename from docs/source/pt/tasks/sequence_classification.mdx rename to docs/source/pt/tasks/sequence_classification.md index 7c443e700d4edd..e7776894f874cb 100644 --- a/docs/source/pt/tasks/sequence_classification.mdx +++ b/docs/source/pt/tasks/sequence_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Classificação de texto @@ -16,7 +20,7 @@ specific language governing permissions and limitations under the License. A classificação de texto é uma tarefa comum de NLP que atribui um rótulo ou classe a um texto. Existem muitas aplicações práticas de classificação de texto amplamente utilizadas em produção por algumas das maiores empresas da atualidade. Uma das formas mais populares de classificação de texto é a análise de sentimento, que atribui um rótulo como positivo, negativo ou neutro a um texto. -Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert-base-uncased) no conjunto de dados [IMDb](https://huggingface.co/datasets/imdb) para determinar se a crítica de filme é positiva ou negativa. +Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) no conjunto de dados [IMDb](https://huggingface.co/datasets/imdb) para determinar se a crítica de filme é positiva ou negativa. @@ -56,7 +60,7 @@ Carregue o tokenizador do DistilBERT para processar o campo `text`: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Crie uma função de pré-processamento para tokenizar o campo `text` e truncar as sequências para que não sejam maiores que o comprimento máximo de entrada do DistilBERT: @@ -66,7 +70,7 @@ Crie uma função de pré-processamento para tokenizar o campo `text` e truncar ... return tokenizer(examples["text"], truncation=True) ``` -Use a função [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) do 🤗 Datasets para aplicar a função de pré-processamento em todo o conjunto de dados. Você pode acelerar a função `map` definindo `batched=True` para processar vários elementos do conjunto de dados de uma só vez: +Use a função [`map`](https://huggingface.co/docs/datasets/process#map) do 🤗 Datasets para aplicar a função de pré-processamento em todo o conjunto de dados. Você pode acelerar a função `map` definindo `batched=True` para processar vários elementos do conjunto de dados de uma só vez: ```py tokenized_imdb = imdb.map(preprocess_function, batched=True) @@ -100,7 +104,7 @@ Carregue o DistilBERT com [`AutoModelForSequenceClassification`] junto com o nú ```py >>> from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer ->>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2) ``` @@ -144,7 +148,7 @@ O [`Trainer`] aplicará o preenchimento dinâmico por padrão quando você defin -Para executar o fine-tuning de um modelo no TensorFlow, comece convertendo seu conjunto de dados para o formato `tf.data.Dataset` com [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Nessa execução você deverá especificar as entradas e rótulos (no parâmetro `columns`), se deseja embaralhar o conjunto de dados, o tamanho do batch e o data collator: +Para executar o fine-tuning de um modelo no TensorFlow, comece convertendo seu conjunto de dados para o formato `tf.data.Dataset` com [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Nessa execução você deverá especificar as entradas e rótulos (no parâmetro `columns`), se deseja embaralhar o conjunto de dados, o tamanho do batch e o data collator: ```py >>> tf_train_set = tokenized_imdb["train"].to_tf_dataset( @@ -186,7 +190,7 @@ Carregue o DistilBERT com [`TFAutoModelForSequenceClassification`] junto com o n ```py >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2) ``` Configure o modelo para treinamento com o método [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): diff --git a/docs/source/pt/tasks/token_classification.mdx b/docs/source/pt/tasks/token_classification.md similarity index 90% rename from docs/source/pt/tasks/token_classification.mdx rename to docs/source/pt/tasks/token_classification.md index 780080a60dd325..3465680dcc2046 100644 --- a/docs/source/pt/tasks/token_classification.mdx +++ b/docs/source/pt/tasks/token_classification.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Classificação de tokens @@ -16,7 +20,7 @@ specific language governing permissions and limitations under the License. A classificação de tokens atribui um rótulo a tokens individuais em uma frase. Uma das tarefas de classificação de tokens mais comuns é o Reconhecimento de Entidade Nomeada, também chamada de NER (sigla em inglês para Named Entity Recognition). O NER tenta encontrar um rótulo para cada entidade em uma frase, como uma pessoa, local ou organização. -Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert-base-uncased) no conjunto de dados [WNUT 17](https://huggingface.co/datasets/wnut_17) para detectar novas entidades. +Este guia mostrará como realizar o fine-tuning do [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) no conjunto de dados [WNUT 17](https://huggingface.co/datasets/wnut_17) para detectar novas entidades. @@ -81,7 +85,7 @@ Carregue o tokenizer do DistilBERT para processar os `tokens`: ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") ``` Como a entrada já foi dividida em palavras, defina `is_split_into_words=True` para tokenizar as palavras em subpalavras: @@ -124,7 +128,7 @@ Aqui está como você pode criar uma função para realinhar os tokens e rótulo ... return tokenized_inputs ``` -Use a função [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) do 🤗 Datasets para tokenizar e alinhar os rótulos em todo o conjunto de dados. Você pode acelerar a função `map` configurando `batched=True` para processar vários elementos do conjunto de dados de uma só vez: +Use a função [`map`](https://huggingface.co/docs/datasets/process#map) do 🤗 Datasets para tokenizar e alinhar os rótulos em todo o conjunto de dados. Você pode acelerar a função `map` configurando `batched=True` para processar vários elementos do conjunto de dados de uma só vez: ```py >>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True) @@ -158,7 +162,7 @@ Carregue o DistilBERT com o [`AutoModelForTokenClassification`] junto com o núm ```py >>> from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer ->>> model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=14) +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=14) ``` @@ -197,7 +201,7 @@ Nesse ponto, restam apenas três passos: ``` -Para executar o fine-tuning de um modelo no TensorFlow, comece convertendo seu conjunto de dados para o formato `tf.data.Dataset` com [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Nessa execução você deverá especificar as entradas e rótulos (no parâmetro `columns`), se deseja embaralhar o conjunto de dados, o tamanho do batch e o data collator: +Para executar o fine-tuning de um modelo no TensorFlow, comece convertendo seu conjunto de dados para o formato `tf.data.Dataset` com [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Nessa execução você deverá especificar as entradas e rótulos (no parâmetro `columns`), se deseja embaralhar o conjunto de dados, o tamanho do batch e o data collator: ```py >>> tf_train_set = tokenized_wnut["train"].to_tf_dataset( @@ -242,7 +246,7 @@ Carregue o DistilBERT com o [`TFAutoModelForTokenClassification`] junto com o n ```py >>> from transformers import TFAutoModelForTokenClassification ->>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=2) +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2) ``` Configure o modelo para treinamento com o método [`compile`](https://keras.io/api/models/model_training_apis/#compile-method): diff --git a/docs/source/pt/training.mdx b/docs/source/pt/training.md similarity index 95% rename from docs/source/pt/training.mdx rename to docs/source/pt/training.md index bf59c14528f9b1..49f57dead24233 100644 --- a/docs/source/pt/training.mdx +++ b/docs/source/pt/training.md @@ -8,6 +8,10 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # Fine-tuning de um modelo pré-treinado @@ -48,13 +52,13 @@ Comece carregando o dataset [Yelp Reviews](https://huggingface.co/datasets/yelp_ Como já sabe, é necessário ter um tokenizador para processar o texto e incluir uma estratégia de padding e truncamento, para manejar qualquer tamanho varíavel de sequência. Para processar o seu dataset em apenas um passo, utilize o método de -🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process.html#map) para aplicar uma função de preprocessamento sobre +🤗 Datasets [`map`](https://huggingface.co/docs/datasets/process#map) para aplicar uma função de preprocessamento sobre todo o dataset. ```py >>> from transformers import AutoTokenizer ->>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") >>> def tokenize_function(examples): @@ -89,7 +93,7 @@ sabemos ter 5 labels usamos o seguinte código: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` @@ -122,7 +126,7 @@ Especifique onde salvar os checkpoints do treinamento: O [`Trainer`] não avalia automaticamente o rendimento do modelo durante o treinamento. Será necessário passar ao [`Trainer`] uma função para calcular e fazer um diagnóstico sobre as métricas. A biblioteca 🤗 Datasets proporciona uma função de [`accuracy`](https://huggingface.co/metrics/accuracy) simples que pode ser carregada com a função -`load_metric` (ver este [tutorial](https://huggingface.co/docs/datasets/metrics.html) para mais informações): +`load_metric` (ver este [tutorial](https://huggingface.co/docs/datasets/metrics) para mais informações): ```py >>> import numpy as np @@ -199,13 +203,13 @@ Assegure-se de especificar os `return_tensors` para retornar os tensores do Tens Em seguida, converta os datasets tokenizados em datasets do TensorFlow com o método -[`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). +[`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.to_tf_dataset). Especifique suas entradas em `columns` e seu rótulo em `label_cols`: ```py >>> tf_train_dataset = small_train_dataset.to_tf_dataset( ... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], +... label_cols="labels", ... shuffle=True, ... collate_fn=data_collator, ... batch_size=8, @@ -213,7 +217,7 @@ Especifique suas entradas em `columns` e seu rótulo em `label_cols`: >>> tf_validation_dataset = small_eval_dataset.to_tf_dataset( ... columns=["attention_mask", "input_ids", "token_type_ids"], -... label_cols=["labels"], +... label_cols="labels", ... shuffle=False, ... collate_fn=data_collator, ... batch_size=8, @@ -228,7 +232,7 @@ Carregue um modelo do TensorFlow com o número esperado de rótulos: >>> import tensorflow as tf >>> from transformers import TFAutoModelForSequenceClassification ->>> model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` A seguir, compile e ajuste o fine-tuning a seu modelo com [`fit`](https://keras.io/api/models/model_training_apis/) como @@ -307,7 +311,7 @@ Carregue seu modelo com o número de labels esperados: ```py >>> from transformers import AutoModelForSequenceClassification ->>> model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5) +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) ``` ### Otimização e configuração do Learning Rate @@ -381,7 +385,7 @@ uma barra de progresso sobre o número de passos percorridos no treinamento atua Da mesma forma que é necessário adicionar uma função de avaliação ao [`Trainer`], é necessário fazer o mesmo quando escrevendo o próprio ciclo de treinamento. Contudo, em vez de calcular e retornar a métrica final de cada época, -você deverá adicionar todos os batches com [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch) +você deverá adicionar todos os batches com [`add_batch`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=add_batch#datasets.Metric.add_batch) e calcular a métrica apenas no final. ```py diff --git a/docs/source/te/_toctree.yml b/docs/source/te/_toctree.yml new file mode 100644 index 00000000000000..5e6b45eb472f8d --- /dev/null +++ b/docs/source/te/_toctree.yml @@ -0,0 +1,6 @@ +- sections: + - local: index + title: 🤗 Transformers + - local: quicktour + title: త్వరిత పర్యటన + title: ప్రారంభించడానికి diff --git a/docs/source/te/index.md b/docs/source/te/index.md new file mode 100644 index 00000000000000..3e23f8f5eb1392 --- /dev/null +++ b/docs/source/te/index.md @@ -0,0 +1,298 @@ + + +[పైటోర్చ్](https://pytorch.org/), [టెన్సర్‌ఫ్లో](https://www.tensorflow.org/), మరియు [జాక్స్](https://jax.readthedocs.io/en/latest/) కోసం స్థితి-కలాన యంత్ర అభ్యాసం. + +🤗 ట్రాన్స్ఫార్మర్స్ అభివృద్ధిస్తున్నది API మరియు ఉపకరణాలు, పూర్వ-చేతన మోడల్లను సులభంగా డౌన్లోడ్ మరియు శిక్షణ చేయడానికి అవసరమైన సమయం, వనరులు, మరియు వస్తువులను నుంచి మోడల్ను శీర్షికం నుంచి ప్రశిక్షించడం వరకు దేవాయనం చేస్తుంది. ఈ మోడల్లు విభిన్న మోడాలిటీలలో సాధారణ పనులకు మద్దతు చేస్తాయి, వంటివి: + +📝 **ప్రాకృతిక భాష ప్రక్రియ**: వచన వర్గీకరణ, పేరుల యొక్క యెంటిటీ గుర్తువు, ప్రశ్న సంవాద, భాషా రచన, సంక్షేపణ, అనువాదం, అనేక ప్రకారాలు, మరియు వచన సృష్టి.
+🖼️ **కంప్యూటర్ విషయం**: చిత్రం వర్గీకరణ, వస్త్రం గుర్తువు, మరియు విభజన.
+🗣️ **ఆడియో**: స్వయంచలన ప్రసంగాన్ని గుర్తుచేసేందుకు, ఆడియో వర్గీకరణ.
+🐙 **బహుమూలిక**: పట్టి ప్రశ్న సంవాద, ఆప్టికల్ సిఫర్ గుర్తువు, డాక్యుమెంట్లు స్క్యాన్ చేసినంతగా సమాచార పొందడం, వీడియో వర్గీకరణ, మరియు దృశ్య ప్రశ్న సంవాద. + +🤗 ట్రాన్స్ఫార్మర్స్ పైన మద్దతు చేస్తుంది పైన తొలగించడానికి పైన పైన పైన ప్రోగ్రామ్లో మోడల్ను శిక్షించండి, మరియు అన్ని ప్రాథమిక యొక్కడా ఇన్‌ఫరెన్స్ కోసం లోడ్ చేయండి. మో + +డల్లు కూడా ప్రొడక్షన్ వాతావరణాలలో వాడుకోవడానికి ONNX మరియు TorchScript వంటి ఆకృతులకు ఎగుమతి చేయవచ్చు. + +ఈరువులకు [హబ్](https://huggingface.co/models), [ఫోరం](https://discuss.huggingface.co/), లేదా [డిస్కార్డ్](https://discord.com/invite/JfAtkvEtRb) లో ఈ పెద్ద సముదాయంలో చేరండి! + +## మీరు హగ్గింగ్ ఫేస్ టీమ్ నుండి అనుకూల మద్దతు కోసం చూస్తున్నట్లయితే + + + HuggingFace Expert Acceleration Program + + +## విషయాలు + +డాక్యుమెంటేషన్ ఐదు విభాగాలుగా నిర్వహించబడింది: + +- **ప్రారంభించండి** లైబ్రరీ యొక్క శీఘ్ర పర్యటన మరియు రన్నింగ్ కోసం ఇన్‌స్టాలేషన్ సూచనలను అందిస్తుంది. +- **ట్యుటోరియల్స్** మీరు అనుభవశూన్యుడు అయితే ప్రారంభించడానికి గొప్ప ప్రదేశం. మీరు లైబ్రరీని ఉపయోగించడం ప్రారంభించడానికి అవసరమైన ప్రాథమిక నైపుణ్యాలను పొందడానికి ఈ విభాగం మీకు సహాయం చేస్తుంది. +- **హౌ-టు-గైడ్‌లు** లాంగ్వేజ్ మోడలింగ్ కోసం ప్రిట్రైన్డ్ మోడల్‌ని ఫైన్‌ట్యూన్ చేయడం లేదా కస్టమ్ మోడల్‌ను ఎలా వ్రాయాలి మరియు షేర్ చేయాలి వంటి నిర్దిష్ట లక్ష్యాన్ని ఎలా సాధించాలో మీకు చూపుతాయి. +- **కాన్సెప్చువల్ గైడ్స్** మోడల్‌లు, టాస్క్‌లు మరియు 🤗 ట్రాన్స్‌ఫార్మర్ల డిజైన్ ఫిలాసఫీ వెనుక ఉన్న అంతర్లీన భావనలు మరియు ఆలోచనల గురించి మరింత చర్చ మరియు వివరణను అందిస్తుంది. +- **API** అన్ని తరగతులు మరియు విధులను వివరిస్తుంది: + + - **ప్రధాన తరగతులు** కాన్ఫిగరేషన్, మోడల్, టోకెనైజర్ మరియు పైప్‌లైన్ వంటి అత్యంత ముఖ్యమైన తరగతులను వివరిస్తుంది. + - **మోడల్స్** లైబ్రరీలో అమలు చేయబడిన ప్రతి మోడల్‌కు సంబంధించిన తరగతులు మరియు విధులను వివరిస్తుంది. + - **అంతర్గత సహాయకులు** అంతర్గతంగా ఉపయోగించే యుటిలిటీ క్లాస్‌లు మరియు ఫంక్షన్‌ల వివరాలు. + +## మద్దతు ఉన్న నమూనాలు మరియు ఫ్రేమ్‌వర్క్‌లు + +దిగువన ఉన్న పట్టిక ఆ ప్రతి మోడల్‌కు పైథాన్ కలిగి ఉన్నా లైబ్రరీలో ప్రస్తుత మద్దతును సూచిస్తుంది +టోకెనైజర్ ("నెమ్మదిగా" అని పిలుస్తారు). Jax (ద్వారా +ఫ్లాక్స్), పైటార్చ్ మరియు/లేదా టెన్సర్‌ఫ్లో. + + + +| Model | PyTorch support | TensorFlow support | Flax Support | +|:------------------------------------------------------------------------:|:---------------:|:------------------:|:------------:| +| [ALBERT](model_doc/albert) | ✅ | ✅ | ✅ | +| [ALIGN](model_doc/align) | ✅ | ❌ | ❌ | +| [AltCLIP](model_doc/altclip) | ✅ | ❌ | ❌ | +| [Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer) | ✅ | ❌ | ❌ | +| [Autoformer](model_doc/autoformer) | ✅ | ❌ | ❌ | +| [Bark](model_doc/bark) | ✅ | ❌ | ❌ | +| [BART](model_doc/bart) | ✅ | ✅ | ✅ | +| [BARThez](model_doc/barthez) | ✅ | ✅ | ✅ | +| [BARTpho](model_doc/bartpho) | ✅ | ✅ | ✅ | +| [BEiT](model_doc/beit) | ✅ | ❌ | ✅ | +| [BERT](model_doc/bert) | ✅ | ✅ | ✅ | +| [Bert Generation](model_doc/bert-generation) | ✅ | ❌ | ❌ | +| [BertJapanese](model_doc/bert-japanese) | ✅ | ✅ | ✅ | +| [BERTweet](model_doc/bertweet) | ✅ | ✅ | ✅ | +| [BigBird](model_doc/big_bird) | ✅ | ❌ | ✅ | +| [BigBird-Pegasus](model_doc/bigbird_pegasus) | ✅ | ❌ | ❌ | +| [BioGpt](model_doc/biogpt) | ✅ | ❌ | ❌ | +| [BiT](model_doc/bit) | ✅ | ❌ | ❌ | +| [Blenderbot](model_doc/blenderbot) | ✅ | ✅ | ✅ | +| [BlenderbotSmall](model_doc/blenderbot-small) | ✅ | ✅ | ✅ | +| [BLIP](model_doc/blip) | ✅ | ✅ | ❌ | +| [BLIP-2](model_doc/blip-2) | ✅ | ❌ | ❌ | +| [BLOOM](model_doc/bloom) | ✅ | ❌ | ✅ | +| [BORT](model_doc/bort) | ✅ | ✅ | ✅ | +| [BridgeTower](model_doc/bridgetower) | ✅ | ❌ | ❌ | +| [BROS](model_doc/bros) | ✅ | ❌ | ❌ | +| [ByT5](model_doc/byt5) | ✅ | ✅ | ✅ | +| [CamemBERT](model_doc/camembert) | ✅ | ✅ | ❌ | +| [CANINE](model_doc/canine) | ✅ | ❌ | ❌ | +| [Chinese-CLIP](model_doc/chinese_clip) | ✅ | ❌ | ❌ | +| [CLAP](model_doc/clap) | ✅ | ❌ | ❌ | +| [CLIP](model_doc/clip) | ✅ | ✅ | ✅ | +| [CLIPSeg](model_doc/clipseg) | ✅ | ❌ | ❌ | +| [CodeGen](model_doc/codegen) | ✅ | ❌ | ❌ | +| [CodeLlama](model_doc/code_llama) | ✅ | ❌ | ❌ | +| [Conditional DETR](model_doc/conditional_detr) | ✅ | ❌ | ❌ | +| [ConvBERT](model_doc/convbert) | ✅ | ✅ | ❌ | +| [ConvNeXT](model_doc/convnext) | ✅ | ✅ | ❌ | +| [ConvNeXTV2](model_doc/convnextv2) | ✅ | ❌ | ❌ | +| [CPM](model_doc/cpm) | ✅ | ✅ | ✅ | +| [CPM-Ant](model_doc/cpmant) | ✅ | ❌ | ❌ | +| [CTRL](model_doc/ctrl) | ✅ | ✅ | ❌ | +| [CvT](model_doc/cvt) | ✅ | ✅ | ❌ | +| [Data2VecAudio](model_doc/data2vec) | ✅ | ❌ | ❌ | +| [Data2VecText](model_doc/data2vec) | ✅ | ❌ | ❌ | +| [Data2VecVision](model_doc/data2vec) | ✅ | ✅ | ❌ | +| [DeBERTa](model_doc/deberta) | ✅ | ✅ | ❌ | +| [DeBERTa-v2](model_doc/deberta-v2) | ✅ | ✅ | ❌ | +| [Decision Transformer](model_doc/decision_transformer) | ✅ | ❌ | ❌ | +| [Deformable DETR](model_doc/deformable_detr) | ✅ | ❌ | ❌ | +| [DeiT](model_doc/deit) | ✅ | ✅ | ❌ | +| [DePlot](model_doc/deplot) | ✅ | ❌ | ❌ | +| [DETA](model_doc/deta) | ✅ | ❌ | ❌ | +| [DETR](model_doc/detr) | ✅ | ❌ | ❌ | +| [DialoGPT](model_doc/dialogpt) | ✅ | ✅ | ✅ | +| [DiNAT](model_doc/dinat) | ✅ | ❌ | ❌ | +| [DINOv2](model_doc/dinov2) | ✅ | ❌ | ❌ | +| [DistilBERT](model_doc/distilbert) | ✅ | ✅ | ✅ | +| [DiT](model_doc/dit) | ✅ | ❌ | ✅ | +| [DonutSwin](model_doc/donut) | ✅ | ❌ | ❌ | +| [DPR](model_doc/dpr) | ✅ | ✅ | ❌ | +| [DPT](model_doc/dpt) | ✅ | ❌ | ❌ | +| [EfficientFormer](model_doc/efficientformer) | ✅ | ✅ | ❌ | +| [EfficientNet](model_doc/efficientnet) | ✅ | ❌ | ❌ | +| [ELECTRA](model_doc/electra) | ✅ | ✅ | ✅ | +| [EnCodec](model_doc/encodec) | ✅ | ❌ | ❌ | +| [Encoder decoder](model_doc/encoder-decoder) | ✅ | ✅ | ✅ | +| [ERNIE](model_doc/ernie) | ✅ | ❌ | ❌ | +| [ErnieM](model_doc/ernie_m) | ✅ | ❌ | ❌ | +| [ESM](model_doc/esm) | ✅ | ✅ | ❌ | +| [FairSeq Machine-Translation](model_doc/fsmt) | ✅ | ❌ | ❌ | +| [Falcon](model_doc/falcon) | ✅ | ❌ | ❌ | +| [FLAN-T5](model_doc/flan-t5) | ✅ | ✅ | ✅ | +| [FLAN-UL2](model_doc/flan-ul2) | ✅ | ✅ | ✅ | +| [FlauBERT](model_doc/flaubert) | ✅ | ✅ | ❌ | +| [FLAVA](model_doc/flava) | ✅ | ❌ | ❌ | +| [FNet](model_doc/fnet) | ✅ | ❌ | ❌ | +| [FocalNet](model_doc/focalnet) | ✅ | ❌ | ❌ | +| [Funnel Transformer](model_doc/funnel) | ✅ | ✅ | ❌ | +| [GIT](model_doc/git) | ✅ | ❌ | ❌ | +| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ | +| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ | +| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ | +| [GPT NeoX Japanese](model_doc/gpt_neox_japanese) | ✅ | ❌ | ❌ | +| [GPT-J](model_doc/gptj) | ✅ | ✅ | ✅ | +| [GPT-Sw3](model_doc/gpt-sw3) | ✅ | ✅ | ✅ | +| [GPTBigCode](model_doc/gpt_bigcode) | ✅ | ❌ | ❌ | +| [GPTSAN-japanese](model_doc/gptsan-japanese) | ✅ | ❌ | ❌ | +| [Graphormer](model_doc/graphormer) | ✅ | ❌ | ❌ | +| [GroupViT](model_doc/groupvit) | ✅ | ✅ | ❌ | +| [HerBERT](model_doc/herbert) | ✅ | ✅ | ✅ | +| [Hubert](model_doc/hubert) | ✅ | ✅ | ❌ | +| [I-BERT](model_doc/ibert) | ✅ | ❌ | ❌ | +| [IDEFICS](model_doc/idefics) | ✅ | ❌ | ❌ | +| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ | +| [Informer](model_doc/informer) | ✅ | ❌ | ❌ | +| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ | +| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ | +| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ | +| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ | +| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ | +| [LayoutXLM](model_doc/layoutxlm) | ✅ | ❌ | ❌ | +| [LED](model_doc/led) | ✅ | ✅ | ❌ | +| [LeViT](model_doc/levit) | ✅ | ❌ | ❌ | +| [LiLT](model_doc/lilt) | ✅ | ❌ | ❌ | +| [LLaMA](model_doc/llama) | ✅ | ❌ | ❌ | +| [Llama2](model_doc/llama2) | ✅ | ❌ | ❌ | +| [Longformer](model_doc/longformer) | ✅ | ✅ | ❌ | +| [LongT5](model_doc/longt5) | ✅ | ❌ | ✅ | +| [LUKE](model_doc/luke) | ✅ | ❌ | ❌ | +| [LXMERT](model_doc/lxmert) | ✅ | ✅ | ❌ | +| [M-CTC-T](model_doc/mctct) | ✅ | ❌ | ❌ | +| [M2M100](model_doc/m2m_100) | ✅ | ❌ | ❌ | +| [Marian](model_doc/marian) | ✅ | ✅ | ✅ | +| [MarkupLM](model_doc/markuplm) | ✅ | ❌ | ❌ | +| [Mask2Former](model_doc/mask2former) | ✅ | ❌ | ❌ | +| [MaskFormer](model_doc/maskformer) | ✅ | ❌ | ❌ | +| [MatCha](model_doc/matcha) | ✅ | ❌ | ❌ | +| [mBART](model_doc/mbart) | ✅ | ✅ | ✅ | +| [mBART-50](model_doc/mbart50) | ✅ | ✅ | ✅ | +| [MEGA](model_doc/mega) | ✅ | ❌ | ❌ | +| [Megatron-BERT](model_doc/megatron-bert) | ✅ | ❌ | ❌ | +| [Megatron-GPT2](model_doc/megatron_gpt2) | ✅ | ✅ | ✅ | +| [MGP-STR](model_doc/mgp-str) | ✅ | ❌ | ❌ | +| [Mistral](model_doc/mistral) | ✅ | ❌ | ❌ | +| [mLUKE](model_doc/mluke) | ✅ | ❌ | ❌ | +| [MMS](model_doc/mms) | ✅ | ✅ | ✅ | +| [MobileBERT](model_doc/mobilebert) | ✅ | ✅ | ❌ | +| [MobileNetV1](model_doc/mobilenet_v1) | ✅ | ❌ | ❌ | +| [MobileNetV2](model_doc/mobilenet_v2) | ✅ | ❌ | ❌ | +| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ | +| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ | +| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ | +| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ | +| [MRA](model_doc/mra) | ✅ | ❌ | ❌ | +| [MT5](model_doc/mt5) | ✅ | ✅ | ✅ | +| [MusicGen](model_doc/musicgen) | ✅ | ❌ | ❌ | +| [MVP](model_doc/mvp) | ✅ | ❌ | ❌ | +| [NAT](model_doc/nat) | ✅ | ❌ | ❌ | +| [Nezha](model_doc/nezha) | ✅ | ❌ | ❌ | +| [NLLB](model_doc/nllb) | ✅ | ❌ | ❌ | +| [NLLB-MOE](model_doc/nllb-moe) | ✅ | ❌ | ❌ | +| [Nougat](model_doc/nougat) | ✅ | ✅ | ✅ | +| [Nyströmformer](model_doc/nystromformer) | ✅ | ❌ | ❌ | +| [OneFormer](model_doc/oneformer) | ✅ | ❌ | ❌ | +| [OpenAI GPT](model_doc/openai-gpt) | ✅ | ✅ | ❌ | +| [OpenAI GPT-2](model_doc/gpt2) | ✅ | ✅ | ✅ | +| [OpenLlama](model_doc/open-llama) | ✅ | ❌ | ❌ | +| [OPT](model_doc/opt) | ✅ | ✅ | ✅ | +| [OWL-ViT](model_doc/owlvit) | ✅ | ❌ | ❌ | +| [Pegasus](model_doc/pegasus) | ✅ | ✅ | ✅ | +| [PEGASUS-X](model_doc/pegasus_x) | ✅ | ❌ | ❌ | +| [Perceiver](model_doc/perceiver) | ✅ | ❌ | ❌ | +| [Persimmon](model_doc/persimmon) | ✅ | ❌ | ❌ | +| [PhoBERT](model_doc/phobert) | ✅ | ✅ | ✅ | +| [Pix2Struct](model_doc/pix2struct) | ✅ | ❌ | ❌ | +| [PLBart](model_doc/plbart) | ✅ | ❌ | ❌ | +| [PoolFormer](model_doc/poolformer) | ✅ | ❌ | ❌ | +| [Pop2Piano](model_doc/pop2piano) | ✅ | ❌ | ❌ | +| [ProphetNet](model_doc/prophetnet) | ✅ | ❌ | ❌ | +| [PVT](model_doc/pvt) | ✅ | ❌ | ❌ | +| [QDQBert](model_doc/qdqbert) | ✅ | ❌ | ❌ | +| [RAG](model_doc/rag) | ✅ | ✅ | ❌ | +| [REALM](model_doc/realm) | ✅ | ❌ | ❌ | +| [Reformer](model_doc/reformer) | ✅ | ❌ | ❌ | +| [RegNet](model_doc/regnet) | ✅ | ✅ | ✅ | +| [RemBERT](model_doc/rembert) | ✅ | ✅ | ❌ | +| [ResNet](model_doc/resnet) | ✅ | ✅ | ✅ | +| [RetriBERT](model_doc/retribert) | ✅ | ❌ | ❌ | +| [RoBERTa](model_doc/roberta) | ✅ | ✅ | ✅ | +| [RoBERTa-PreLayerNorm](model_doc/roberta-prelayernorm) | ✅ | ✅ | ✅ | +| [RoCBert](model_doc/roc_bert) | ✅ | ❌ | ❌ | +| [RoFormer](model_doc/roformer) | ✅ | ✅ | ✅ | +| [RWKV](model_doc/rwkv) | ✅ | ❌ | ❌ | +| [SAM](model_doc/sam) | ✅ | ✅ | ❌ | +| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ | +| [SEW](model_doc/sew) | ✅ | ❌ | ❌ | +| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ | +| [Speech Encoder decoder](model_doc/speech-encoder-decoder) | ✅ | ❌ | ✅ | +| [Speech2Text](model_doc/speech_to_text) | ✅ | ✅ | ❌ | +| [SpeechT5](model_doc/speecht5) | ✅ | ❌ | ❌ | +| [Splinter](model_doc/splinter) | ✅ | ❌ | ❌ | +| [SqueezeBERT](model_doc/squeezebert) | ✅ | ❌ | ❌ | +| [SwiftFormer](model_doc/swiftformer) | ✅ | ❌ | ❌ | +| [Swin Transformer](model_doc/swin) | ✅ | ✅ | ❌ | +| [Swin Transformer V2](model_doc/swinv2) | ✅ | ❌ | ❌ | +| [Swin2SR](model_doc/swin2sr) | ✅ | ❌ | ❌ | +| [SwitchTransformers](model_doc/switch_transformers) | ✅ | ❌ | ❌ | +| [T5](model_doc/t5) | ✅ | ✅ | ✅ | +| [T5v1.1](model_doc/t5v1.1) | ✅ | ✅ | ✅ | +| [Table Transformer](model_doc/table-transformer) | ✅ | ❌ | ❌ | +| [TAPAS](model_doc/tapas) | ✅ | ✅ | ❌ | +| [TAPEX](model_doc/tapex) | ✅ | ✅ | ✅ | +| [Time Series Transformer](model_doc/time_series_transformer) | ✅ | ❌ | ❌ | +| [TimeSformer](model_doc/timesformer) | ✅ | ❌ | ❌ | +| [Trajectory Transformer](model_doc/trajectory_transformer) | ✅ | ❌ | ❌ | +| [Transformer-XL](model_doc/transfo-xl) | ✅ | ✅ | ❌ | +| [TrOCR](model_doc/trocr) | ✅ | ❌ | ❌ | +| [TVLT](model_doc/tvlt) | ✅ | ❌ | ❌ | +| [UL2](model_doc/ul2) | ✅ | ✅ | ✅ | +| [UMT5](model_doc/umt5) | ✅ | ❌ | ❌ | +| [UniSpeech](model_doc/unispeech) | ✅ | ❌ | ❌ | +| [UniSpeechSat](model_doc/unispeech-sat) | ✅ | ❌ | ❌ | +| [UPerNet](model_doc/upernet) | ✅ | ❌ | ❌ | +| [VAN](model_doc/van) | ✅ | ❌ | ❌ | +| [VideoMAE](model_doc/videomae) | ✅ | ❌ | ❌ | +| [ViLT](model_doc/vilt) | ✅ | ❌ | ❌ | +| [Vision Encoder decoder](model_doc/vision-encoder-decoder) | ✅ | ✅ | ✅ | +| [VisionTextDualEncoder](model_doc/vision-text-dual-encoder) | ✅ | ✅ | ✅ | +| [VisualBERT](model_doc/visual_bert) | ✅ | ❌ | ❌ | +| [ViT](model_doc/vit) | ✅ | ✅ | ✅ | +| [ViT Hybrid](model_doc/vit_hybrid) | ✅ | ❌ | ❌ | +| [VitDet](model_doc/vitdet) | ✅ | ❌ | ❌ | +| [ViTMAE](model_doc/vit_mae) | ✅ | ✅ | ❌ | +| [ViTMatte](model_doc/vitmatte) | ✅ | ❌ | ❌ | +| [ViTMSN](model_doc/vit_msn) | ✅ | ❌ | ❌ | +| [VITS](model_doc/vits) | ✅ | ❌ | ❌ | +| [ViViT](model_doc/vivit) | ✅ | ❌ | ❌ | +| [Wav2Vec2](model_doc/wav2vec2) | ✅ | ✅ | ✅ | +| [Wav2Vec2-Conformer](model_doc/wav2vec2-conformer) | ✅ | ❌ | ❌ | +| [Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme) | ✅ | ✅ | ✅ | +| [WavLM](model_doc/wavlm) | ✅ | ❌ | ❌ | +| [Whisper](model_doc/whisper) | ✅ | ✅ | ✅ | +| [X-CLIP](model_doc/xclip) | ✅ | ❌ | ❌ | +| [X-MOD](model_doc/xmod) | ✅ | ❌ | ❌ | +| [XGLM](model_doc/xglm) | ✅ | ✅ | ✅ | +| [XLM](model_doc/xlm) | ✅ | ✅ | ❌ | +| [XLM-ProphetNet](model_doc/xlm-prophetnet) | ✅ | ❌ | ❌ | +| [XLM-RoBERTa](model_doc/xlm-roberta) | ✅ | ✅ | ✅ | +| [XLM-RoBERTa-XL](model_doc/xlm-roberta-xl) | ✅ | ❌ | ❌ | +| [XLM-V](model_doc/xlm-v) | ✅ | ✅ | ✅ | +| [XLNet](model_doc/xlnet) | ✅ | ✅ | ❌ | +| [XLS-R](model_doc/xls_r) | ✅ | ✅ | ✅ | +| [XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2) | ✅ | ✅ | ✅ | +| [YOLOS](model_doc/yolos) | ✅ | ❌ | ❌ | +| [YOSO](model_doc/yoso) | ✅ | ❌ | ❌ | + + diff --git a/docs/source/te/quicktour.md b/docs/source/te/quicktour.md new file mode 100644 index 00000000000000..75efa841128605 --- /dev/null +++ b/docs/source/te/quicktour.md @@ -0,0 +1,557 @@ + + +# శీఘ్ర పర్యటన + +[[ఓపెన్-ఇన్-కోలాబ్]] + +🤗 ట్రాన్స్‌ఫార్మర్‌లతో లేచి పరుగెత్తండి! మీరు డెవలపర్ అయినా లేదా రోజువారీ వినియోగదారు అయినా, ఈ శీఘ్ర పర్యటన మీకు ప్రారంభించడానికి సహాయం చేస్తుంది మరియు [`pipeline`] అనుమితి కోసం ఎలా ఉపయోగించాలో మీకు చూపుతుంది, [AutoClass](./model_doc/auto) తో ప్రీట్రైన్డ్ మోడల్ మరియు ప్రిప్రాసెసర్/ ఆటో, మరియు PyTorch లేదా TensorFlowతో మోడల్‌కు త్వరగా శిక్షణ ఇవ్వండి. మీరు ఒక అనుభవశూన్యుడు అయితే, ఇక్కడ పరిచయం చేయబడిన భావనల గురించి మరింత లోతైన వివరణల కోసం మా ట్యుటోరియల్స్ లేదా [course](https://huggingface.co/course/chapter1/1)ని తనిఖీ చేయమని మేము సిఫార్సు చేస్తున్నాము. + +మీరు ప్రారంభించడానికి ముందు, మీరు అవసరమైన అన్ని లైబ్రరీలను ఇన్‌స్టాల్ చేశారని నిర్ధారించుకోండి: + +```bash +!pip install transformers datasets +``` + +మీరు మీ ప్రాధాన్య యంత్ర అభ్యాస ఫ్రేమ్‌వర్క్‌ను కూడా ఇన్‌స్టాల్ చేయాలి: + + + + +```bash +pip install torch +``` + + + +```bash +pip install tensorflow +``` + + + +## పైప్‌లైన్ + + + +[`pipeline`] అనుమితి కోసం ముందుగా శిక్షణ పొందిన నమూనాను ఉపయోగించడానికి సులభమైన మరియు వేగవంతమైన మార్గం. మీరు వివిధ పద్ధతులలో అనేక పనుల కోసం [`pipeline`] వెలుపల ఉపయోగించవచ్చు, వాటిలో కొన్ని క్రింది పట్టికలో చూపబడ్డాయి: + + + + +అందుబాటులో ఉన్న పనుల పూర్తి జాబితా కోసం, [పైప్‌లైన్ API సూచన](./main_classes/pipelines)ని తనిఖీ చేయండి. + + + +Here is the translation in Telugu: + +| **పని** | **వివరణ** | **మోడాలిటీ** | **పైప్‌లైన్ ఐడెంటిఫైయర్** | +|------------------------------|--------------------------------------------------------------------------------------------------------|-----------------|------------------------------------------| +| వచన వర్గీకరణు | కొన్ని వచనాల అంతా ఒక లేబుల్‌ను కొడి | NLP | pipeline(task=“sentiment-analysis”) | +| వచన సృష్టి | ప్రమ్పుటం కలిగినంత వచనం సృష్టించండి | NLP | pipeline(task=“text-generation”) | +| సంక్షేపణ | వచనం లేదా పత్రం కొరకు సంక్షేపణ తయారుచేసండి | NLP | pipeline(task=“summarization”) | +| చిత్రం వర్గీకరణు | చిత్రంలో ఒక లేబుల్‌ను కొడి | కంప్యూటర్ విషయం | pipeline(task=“image-classification”) | +| చిత్రం విభజన | ఒక చిత్రంలో ప్రతి వ్యక్తిగత పిక్సల్‌ను ఒక లేబుల్‌గా నమోదు చేయండి (సెమాంటిక్, పానొప్టిక్, మరియు ఇన్స్టన్స్ విభజనలను మద్దతు చేస్తుంది) | కంప్యూటర్ విషయం | pipeline(task=“image-segmentation”) | +| వస్త్రం గుర్తువు | ఒక చిత్రంలో పదాల యొక్క బౌండింగ్ బాక్స్‌లను మరియు వస్త్రాల వర్గాలను అంచనా చేయండి | కంప్యూటర్ విషయం | pipeline(task=“object-detection”) | +| ఆడియో గుర్తువు | కొన్ని ఆడియో డేటానికి ఒక లేబుల్‌ను కొడి | ఆడియో | pipeline(task=“audio-classification”) | +| స్వయంచలన ప్రసంగ గుర్తువు | ప్రసంగాన్ని వచనంగా వర్ణించండి | ఆడియో | pipeline(task=“automatic-speech-recognition”) | +| దృశ్య ప్రశ్న సంవాదం | వచనం మరియు ప్రశ్నను నమోదు చేసిన చిత్రంతో ప్రశ్నకు సమాధానం ఇవ్వండి | బహుమూలిక | pipeline(task=“vqa”) | +| పత్రం ప్రశ్న సంవాదం | ప్రశ్నను పత్రం లేదా డాక్యుమెంట్‌తో సమాధానం ఇవ్వండి | బహుమూలిక | pipeline(task="document-question-answering") | +| చిత్రం వ్రాసాయింగ్ | కొన్ని చిత్రానికి పిటియార్లను సృష్టించండి | బహుమూలిక | pipeline(task="image-to-text") | + + +[`pipeline`] యొక్క ఉదాహరణను సృష్టించడం ద్వారా మరియు మీరు దానిని ఉపయోగించాలనుకుంటున్న పనిని పేర్కొనడం ద్వారా ప్రారంభించండి. ఈ గైడ్‌లో, మీరు సెంటిమెంట్ విశ్లేషణ కోసం [`pipeline`]ని ఉదాహరణగా ఉపయోగిస్తారు: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("sentiment-analysis") +``` + +సెంటిమెంట్ విశ్లేషణ కోసం [`pipeline`] డిఫాల్ట్ [ప్రీట్రైన్డ్ మోడల్](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) మరియు టోకెనైజర్‌ని డౌన్‌లోడ్ చేస్తుంది మరియు కాష్ చేస్తుంది. ఇప్పుడు మీరు మీ లక్ష్య వచనంలో `classifier`ని ఉపయోగించవచ్చు: + +```py +>>> classifier("We are very happy to show you the 🤗 Transformers library.") +[{'label': 'POSITIVE', 'score': 0.9998}] +``` + +మీరు ఒకటి కంటే ఎక్కువ ఇన్‌పుట్‌లను కలిగి ఉంటే, నిఘంటువుల జాబితాను అందించడానికి మీ ఇన్‌పుట్‌లను జాబితాగా [`pipeline`]కి పంపండి: + +```py +>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) +>>> for result in results: +... print(f"label: {result['label']}, with score: {round(result['score'], 4)}") +label: POSITIVE, with score: 0.9998 +label: NEGATIVE, with score: 0.5309 +``` + +[`pipeline`] మీకు నచ్చిన ఏదైనా పని కోసం మొత్తం డేటాసెట్‌ను కూడా పునరావృతం చేయగలదు. ఈ ఉదాహరణ కోసం, స్వయంచాలక ప్రసంగ గుర్తింపును మన పనిగా ఎంచుకుందాం: + +```py +>>> import torch +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") +``` + +మీరు మళ్లీ మళ్లీ చెప్పాలనుకుంటున్న ఆడియో డేటాసెట్‌ను లోడ్ చేయండి (మరిన్ని వివరాల కోసం 🤗 డేటాసెట్‌లు [త్వరిత ప్రారంభం](https://huggingface.co/docs/datasets/quickstart#audio) చూడండి. ఉదాహరణకు, [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) డేటాసెట్‌ను లోడ్ చేయండి: + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT +``` + +డేటాసెట్ యొక్క నమూనా రేటు నమూనాతో సరిపోలుతుందని మీరు నిర్ధారించుకోవాలి +రేటు [`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h) దీనిపై శిక్షణ పొందింది: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) +``` + +`"ఆడియో"` కాలమ్‌కి కాల్ చేస్తున్నప్పుడు ఆడియో ఫైల్‌లు స్వయంచాలకంగా లోడ్ చేయబడతాయి మరియు మళ్లీ నమూనా చేయబడతాయి. +మొదటి 4 నమూనాల నుండి ముడి వేవ్‌ఫార్మ్ శ్రేణులను సంగ్రహించి, పైప్‌లైన్‌కు జాబితాగా పాస్ చేయండి: + +```py +>>> result = speech_recognizer(dataset[:4]["audio"]) +>>> print([d["text"] for d in result]) +['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FONDERING HOW I'D SET UP A JOIN TO HELL T WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE APSO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AN I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I FURN A JOINA COUT'] +``` + +ఇన్‌పుట్‌లు పెద్దగా ఉన్న పెద్ద డేటాసెట్‌ల కోసం (స్పీచ్ లేదా విజన్ వంటివి), మెమరీలోని అన్ని ఇన్‌పుట్‌లను లోడ్ చేయడానికి మీరు జాబితాకు బదులుగా జెనరేటర్‌ను పాస్ చేయాలనుకుంటున్నారు. మరింత సమాచారం కోసం [పైప్‌లైన్ API సూచన](./main_classes/pipelines)ని చూడండి. + +### పైప్‌లైన్‌లో మరొక మోడల్ మరియు టోకెనైజర్‌ని ఉపయోగించండి + +[`pipeline`] [Hub](https://huggingface.co/models) నుండి ఏదైనా మోడల్‌ను కలిగి ఉంటుంది, దీని వలన ఇతర వినియోగ-కేసుల కోసం [`pipeline`]ని సులభంగా స్వీకరించవచ్చు. ఉదాహరణకు, మీరు ఫ్రెంచ్ టెక్స్ట్‌ను హ్యాండిల్ చేయగల మోడల్ కావాలనుకుంటే, తగిన మోడల్ కోసం ఫిల్టర్ చేయడానికి హబ్‌లోని ట్యాగ్‌లను ఉపయోగించండి. అగ్ర ఫిల్టర్ చేసిన ఫలితం మీరు ఫ్రెంచ్ టెక్స్ట్ కోసం ఉపయోగించగల సెంటిమెంట్ విశ్లేషణ కోసం ఫైన్‌ట్యూన్ చేయబడిన బహుభాషా [BERT మోడల్](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)ని అందిస్తుంది: + +```py +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +``` + + + +ముందుగా శిక్షణ పొందిన మోడల్‌ను లోడ్ చేయడానికి [`AutoModelForSequenceClassification`] మరియు [`AutoTokenizer`]ని ఉపయోగించండి మరియు దాని అనుబంధిత టోకెనైజర్ (తదుపరి విభాగంలో `AutoClass`పై మరిన్ని): + +```py +>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + +ముందుగా శిక్షణ పొందిన మోడల్‌ను లోడ్ చేయడానికి [`TFAutoModelForSequenceClassification`] మరియు [`AutoTokenizer`]ని ఉపయోగించండి మరియు దాని అనుబంధిత టోకెనైజర్ (తదుపరి విభాగంలో `TFAutoClass`పై మరిన్ని): + +```py +>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + + +[`pipeline`]లో మోడల్ మరియు టోకెనైజర్‌ను పేర్కొనండి మరియు ఇప్పుడు మీరు ఫ్రెంచ్ టెక్స్ట్‌పై `క్లాసిఫైయర్`ని వర్తింపజేయవచ్చు: + +```py +>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) +>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") +[{'label': '5 stars', 'score': 0.7273}] +``` + +మీరు మీ వినియోగ-కేస్ కోసం మోడల్‌ను కనుగొనలేకపోతే, మీరు మీ డేటాపై ముందుగా శిక్షణ పొందిన మోడల్‌ను చక్కగా మార్చాలి. ఎలాగో తెలుసుకోవడానికి మా [ఫైన్‌ట్యూనింగ్ ట్యుటోరియల్](./training)ని చూడండి. చివరగా, మీరు మీ ప్రీట్రైన్డ్ మోడల్‌ని ఫైన్‌ట్యూన్ చేసిన తర్వాత, దయచేసి అందరి కోసం మెషిన్ లెర్నింగ్‌ని డెమోక్రటైజ్ చేయడానికి హబ్‌లోని సంఘంతో మోడల్‌ను [షేరింగ్](./model_sharing) పరిగణించండి! 🤗 + +## AutoClass + + + +హుడ్ కింద, మీరు పైన ఉపయోగించిన [`pipeline`]కి శక్తిని అందించడానికి [`AutoModelForSequenceClassification`] మరియు [`AutoTokenizer`] తరగతులు కలిసి పని చేస్తాయి. ఒక [AutoClass](./model_doc/auto) అనేది ముందుగా శిక్షణ పొందిన మోడల్ యొక్క ఆర్కిటెక్చర్‌ను దాని పేరు లేదా మార్గం నుండి స్వయంచాలకంగా తిరిగి పొందే సత్వరమార్గం. మీరు మీ టాస్క్ కోసం తగిన `ఆటోక్లాస్`ని మాత్రమే ఎంచుకోవాలి మరియు ఇది అనుబంధిత ప్రీప్రాసెసింగ్ క్లాస్. + +మునుపటి విభాగం నుండి ఉదాహరణకి తిరిగి వెళ్లి, [`pipeline`] ఫలితాలను ప్రతిబింబించడానికి మీరు `ఆటోక్లాస్`ని ఎలా ఉపయోగించవచ్చో చూద్దాం. + +### AutoTokenizer + +ఒక మోడల్‌కు ఇన్‌పుట్‌లుగా సంఖ్యల శ్రేణిలో వచనాన్ని ప్రీప్రాసెసింగ్ చేయడానికి టోకెనైజర్ బాధ్యత వహిస్తుంది. పదాన్ని ఎలా విభజించాలి మరియు ఏ స్థాయిలో పదాలను విభజించాలి ([tokenizer సారాంశం](./tokenizer_summary)లో టోకనైజేషన్ గురించి మరింత తెలుసుకోండి) సహా టోకనైజేషన్ ప్రక్రియను నియంత్రించే అనేక నియమాలు ఉన్నాయి. గుర్తుంచుకోవలసిన ముఖ్యమైన విషయం ఏమిటంటే, మీరు మోడల్‌కు ముందే శిక్షణ పొందిన అదే టోకనైజేషన్ నియమాలను ఉపయోగిస్తున్నారని నిర్ధారించుకోవడానికి మీరు అదే మోడల్ పేరుతో టోకెనైజర్‌ను తక్షణం చేయాలి. + +[`AutoTokenizer`]తో టోకెనైజర్‌ను లోడ్ చేయండి: + +```py +>>> from transformers import AutoTokenizer + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + +మీ వచనాన్ని టోకెనైజర్‌కు పంపండి: + +```py +>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.") +>>> print(encoding) +{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +టోకెనైజర్ వీటిని కలిగి ఉన్న నిఘంటువుని అందిస్తుంది: + +* [input_ids](./glossary#input-ids): మీ టోకెన్‌ల సంఖ్యాపరమైన ప్రాతినిధ్యం. +* [అటెన్షన్_మాస్క్](./glossary#attention-mask): ఏ టోకెన్‌లకు హాజరు కావాలో సూచిస్తుంది. + +ఒక టోకెనైజర్ ఇన్‌పుట్‌ల జాబితాను కూడా ఆమోదించగలదు మరియు ఏకరీతి పొడవుతో బ్యాచ్‌ను తిరిగి ఇవ్వడానికి టెక్స్ట్‌ను ప్యాడ్ చేసి కత్తిరించవచ్చు: + + + + +```py +>>> pt_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="pt", +... ) +``` + + + +```py +>>> tf_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="tf", +... ) +``` + + + + + +టోకనైజేషన్ గురించి మరిన్ని వివరాల కోసం [ప్రీప్రాసెస్](./preprocessing) ట్యుటోరియల్‌ని చూడండి మరియు ఇమేజ్, ఆడియో మరియు మల్టీమోడల్ ఇన్‌పుట్‌లను ప్రీప్రాసెస్ చేయడానికి [`AutoImageProcessor`], [`AutoFeatureExtractor`] మరియు [`AutoProcessor`] ఎలా ఉపయోగించాలి. + + + +### AutoModel + + + +🤗 ట్రాన్స్‌ఫార్మర్లు ప్రీట్రైన్డ్ ఇన్‌స్టాన్స్‌లను లోడ్ చేయడానికి సులభమైన మరియు ఏకీకృత మార్గాన్ని అందిస్తాయి. దీని అర్థం మీరు [`AutoTokenizer`]ని లోడ్ చేసినట్లుగా [`AutoModel`]ని లోడ్ చేయవచ్చు. టాస్క్ కోసం సరైన [`AutoModel`]ని ఎంచుకోవడం మాత్రమే తేడా. టెక్స్ట్ (లేదా సీక్వెన్స్) వర్గీకరణ కోసం, మీరు [`AutoModelForSequenceClassification`]ని లోడ్ చేయాలి: + + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +[`AutoModel`] క్లాస్ ద్వారా సపోర్ట్ చేసే టాస్క్‌ల కోసం [టాస్క్ సారాంశం](./task_summary)ని చూడండి. + + + +ఇప్పుడు మీ ప్రీప్రాసెస్ చేయబడిన బ్యాచ్ ఇన్‌పుట్‌లను నేరుగా మోడల్‌కి పంపండి. మీరు `**`ని జోడించడం ద్వారా నిఘంటువుని అన్‌ప్యాక్ చేయాలి: + +```py +>>> pt_outputs = pt_model(**pt_batch) +``` + +మోడల్ తుది యాక్టివేషన్‌లను `logits` లక్షణంలో అవుట్‌పుట్ చేస్తుంది. సంభావ్యతలను తిరిగి పొందడానికి సాఫ్ట్‌మాక్స్ ఫంక్షన్‌ను `logits` కు వర్తింపజేయండి: + + +```py +>>> from torch import nn + +>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) +>>> print(pt_predictions) +tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], + [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) +``` + + + +🤗 ట్రాన్స్‌ఫార్మర్లు ప్రీట్రైన్డ్ ఇన్‌స్టాన్స్‌లను లోడ్ చేయడానికి సులభమైన మరియు ఏకీకృత మార్గాన్ని అందిస్తాయి. మీరు [`AutoTokenizer`]ని లోడ్ చేసినట్లుగా మీరు [`TFAutoModel`]ని లోడ్ చేయవచ్చని దీని అర్థం. టాస్క్ కోసం సరైన [`TFAutoModel`]ని ఎంచుకోవడం మాత్రమే తేడా. టెక్స్ట్ (లేదా సీక్వెన్స్) వర్గీకరణ కోసం, మీరు [`TFAutoModelForSequenceClassification`]ని లోడ్ చేయాలి: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +[`AutoModel`] క్లాస్ ద్వారా సపోర్ట్ చేసే టాస్క్‌ల కోసం [టాస్క్ సారాంశం](./task_summary)ని చూడండి. + + + +ఇప్పుడు మీ ప్రీప్రాసెస్ చేయబడిన బ్యాచ్ ఇన్‌పుట్‌లను నేరుగా మోడల్‌కి పంపండి. మీరు టెన్సర్‌లను ఇలా పాస్ చేయవచ్చు: + +```py +>>> tf_outputs = tf_model(tf_batch) +``` + +మోడల్ తుది యాక్టివేషన్‌లను `logits` లక్షణంలో అవుట్‌పుట్ చేస్తుంది. సంభావ్యతలను తిరిగి పొందడానికి సాఫ్ట్‌మాక్స్ ఫంక్షన్‌ను `logits`కు వర్తింపజేయండి: + +```py +>>> import tensorflow as tf + +>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) +>>> tf_predictions # doctest: +IGNORE_RESULT +``` + + + + + +అన్ని 🤗 ట్రాన్స్‌ఫార్మర్స్ మోడల్‌లు (PyTorch లేదా TensorFlow) తుది యాక్టివేషన్‌కు *ముందు* టెన్సర్‌లను అవుట్‌పుట్ చేస్తాయి +ఫంక్షన్ (softmax వంటిది) ఎందుకంటే చివరి యాక్టివేషన్ ఫంక్షన్ తరచుగా నష్టంతో కలిసిపోతుంది. మోడల్ అవుట్‌పుట్‌లు ప్రత్యేక డేటాక్లాస్‌లు కాబట్టి వాటి లక్షణాలు IDEలో స్వయంచాలకంగా పూర్తి చేయబడతాయి. మోడల్ అవుట్‌పుట్‌లు టుపుల్ లేదా డిక్షనరీ లాగా ప్రవర్తిస్తాయి (మీరు పూర్ణాంకం, స్లైస్ లేదా స్ట్రింగ్‌తో ఇండెక్స్ చేయవచ్చు) ఈ సందర్భంలో, ఏదీ లేని గుణాలు విస్మరించబడతాయి. + + + +### మోడల్‌ను సేవ్ చేయండి + + + +మీ మోడల్ చక్కగా ట్యూన్ చేయబడిన తర్వాత, మీరు దానిని [`PreTrainedModel.save_pretrained`]ని ఉపయోగించి దాని టోకెనైజర్‌తో సేవ్ చేయవచ్చు: + +```py +>>> pt_save_directory = "./pt_save_pretrained" +>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT +>>> pt_model.save_pretrained(pt_save_directory) +``` + +మీరు మోడల్‌ని మళ్లీ ఉపయోగించడానికి సిద్ధంగా ఉన్నప్పుడు, దాన్ని [`PreTrainedModel.from_pretrained`]తో రీలోడ్ చేయండి: + +```py +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") +``` + + +మీ మోడల్ చక్కగా ట్యూన్ చేయబడిన తర్వాత, మీరు దానిని [`TFPreTrainedModel.save_pretrained`]ని ఉపయోగించి దాని టోకెనైజర్‌తో సేవ్ చేయవచ్చు: + +```py +>>> tf_save_directory = "./tf_save_pretrained" +>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT +>>> tf_model.save_pretrained(tf_save_directory) +``` +మీరు మోడల్‌ని మళ్లీ ఉపయోగించడానికి సిద్ధంగా ఉన్నప్పుడు, దాన్ని [`TFPreTrainedModel.from_pretrained`]తో రీలోడ్ చేయండి: + +```py +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") +``` + + + +ఒక ప్రత్యేకించి అద్భుతమైన 🤗 ట్రాన్స్‌ఫార్మర్స్ ఫీచర్ మోడల్‌ను సేవ్ చేయగల సామర్థ్యం మరియు దానిని PyTorch లేదా TensorFlow మోడల్‌గా రీలోడ్ చేయగలదు. `from_pt` లేదా `from_tf` పరామితి మోడల్‌ను ఒక ఫ్రేమ్‌వర్క్ నుండి మరొక ఫ్రేమ్‌వర్క్‌కి మార్చగలదు: + + + + +```py +>>> from transformers import AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) +``` + + + +```py +>>> from transformers import TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) +``` + + + +## కస్టమ్ మోడల్ బిల్డ్స్ +మోడల్ ఎలా నిర్మించబడుతుందో మార్చడానికి మీరు మోడల్ కాన్ఫిగరేషన్ క్లాస్‌ని సవరించవచ్చు. దాచిన లేయర్‌లు లేదా అటెన్షన్ హెడ్‌ల సంఖ్య వంటి మోడల్ లక్షణాలను కాన్ఫిగరేషన్ నిర్దేశిస్తుంది. మీరు కస్టమ్ కాన్ఫిగరేషన్ క్లాస్ నుండి మోడల్‌ను ప్రారంభించినప్పుడు మీరు మొదటి నుండి ప్రారంభిస్తారు. మోడల్ అట్రిబ్యూట్‌లు యాదృచ్ఛికంగా ప్రారంభించబడ్డాయి మరియు అర్థవంతమైన ఫలితాలను పొందడానికి మీరు మోడల్‌ను ఉపయోగించే ముందు దానికి శిక్షణ ఇవ్వాలి. + +[`AutoConfig`]ని దిగుమతి చేయడం ద్వారా ప్రారంభించండి, ఆపై మీరు సవరించాలనుకుంటున్న ప్రీట్రైన్డ్ మోడల్‌ను లోడ్ చేయండి. [`AutoConfig.from_pretrained`]లో, మీరు అటెన్షన్ హెడ్‌ల సంఖ్య వంటి మీరు మార్చాలనుకుంటున్న లక్షణాన్ని పేర్కొనవచ్చు: + +```py +>>> from transformers import AutoConfig + +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) +``` + + + +[`AutoModel.from_config`]తో మీ అనుకూల కాన్ఫిగరేషన్ నుండి మోడల్‌ను సృష్టించండి: + +```py +>>> from transformers import AutoModel + +>>> my_model = AutoModel.from_config(my_config) +``` + + +[`TFAutoModel.from_config`]తో మీ అనుకూల కాన్ఫిగరేషన్ నుండి మోడల్‌ను సృష్టించండి: + +```py +>>> from transformers import TFAutoModel + +>>> my_model = TFAutoModel.from_config(my_config) +``` + + + +అనుకూల కాన్ఫిగరేషన్‌లను రూపొందించడం గురించి మరింత సమాచారం కోసం [కస్టమ్ ఆర్కిటెక్చర్‌ని సృష్టించండి](./create_a_model) గైడ్‌ను చూడండి. + +## శిక్షకుడు - పైటార్చ్ ఆప్టిమైజ్ చేసిన శిక్షణ లూప్ + +అన్ని మోడల్‌లు ప్రామాణికమైన [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) కాబట్టి మీరు వాటిని ఏదైనా సాధారణ శిక్షణ లూప్‌లో ఉపయోగించవచ్చు. మీరు మీ స్వంత శిక్షణ లూప్‌ను వ్రాయగలిగినప్పటికీ, 🤗 ట్రాన్స్‌ఫార్మర్లు PyTorch కోసం [`ట్రైనర్`] తరగతిని అందజేస్తాయి, ఇందులో ప్రాథమిక శిక్షణ లూప్ ఉంటుంది మరియు పంపిణీ చేయబడిన శిక్షణ, మిశ్రమ ఖచ్చితత్వం మరియు మరిన్ని వంటి ఫీచర్‌ల కోసం అదనపు కార్యాచరణను జోడిస్తుంది. + +మీ విధిని బట్టి, మీరు సాధారణంగా కింది పారామితులను [`ట్రైనర్`]కి పంపుతారు: + +1. మీరు [`PreTrainedModel`] లేదా [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)తో ప్రారంభిస్తారు: + ```py + >>> from transformers import AutoModelForSequenceClassification + + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. [`TrainingArguments`] మీరు నేర్చుకునే రేటు, బ్యాచ్ పరిమాణం మరియు శిక్షణ పొందవలసిన యుగాల సంఖ్య వంటి మార్చగల మోడల్ హైపర్‌పారామీటర్‌లను కలిగి ఉంది. మీరు ఎలాంటి శిక్షణా వాదనలను పేర్కొనకుంటే డిఫాల్ట్ విలువలు ఉపయోగించబడతాయి: + + ```py + >>> from transformers import TrainingArguments + + >>> training_args = TrainingArguments( + ... output_dir="path/to/save/folder/", + ... learning_rate=2e-5, + ... per_device_train_batch_size=8, + ... per_device_eval_batch_size=8, + ... num_train_epochs=2, + ... ) + ``` + +3. టోకెనైజర్, ఇమేజ్ ప్రాసెసర్, ఫీచర్ ఎక్స్‌ట్రాక్టర్ లేదా ప్రాసెసర్ వంటి ప్రీప్రాసెసింగ్ క్లాస్‌ని లోడ్ చేయండి: + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +4. డేటాసెట్‌ను లోడ్ చేయండి: + + ```py + >>> from datasets import load_dataset + + >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT + ``` + +5. డేటాసెట్‌ను టోకనైజ్ చేయడానికి ఒక ఫంక్షన్‌ను సృష్టించండి: + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) + ``` + + ఆపై దానిని [`~datasets.Dataset.map`]తో మొత్తం డేటాసెట్‌లో వర్తింపజేయండి: + + ```py + >>> dataset = dataset.map(tokenize_dataset, batched=True) + ``` + +6. మీ డేటాసెట్ నుండి ఉదాహరణల సమూహాన్ని సృష్టించడానికి [`DataCollatorWithPadding`]: + + ```py + >>> from transformers import DataCollatorWithPadding + + >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) + ``` + +ఇప్పుడు ఈ తరగతులన్నింటినీ [`Trainer`]లో సేకరించండి: + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) # doctest: +SKIP +``` + +మీరు సిద్ధంగా ఉన్నప్పుడు, శిక్షణను ప్రారంభించడానికి [`~Trainer.train`]కి కాల్ చేయండి: + +```py +>>> trainer.train() # doctest: +SKIP +``` + + + +సీక్వెన్స్-టు-సీక్వెన్స్ మోడల్‌ని ఉపయోగించే - అనువాదం లేదా సారాంశం వంటి పనుల కోసం, బదులుగా [`Seq2SeqTrainer`] మరియు [`Seq2SeqTrainingArguments`] తరగతులను ఉపయోగించండి. + + + +మీరు [`Trainer`] లోపల ఉన్న పద్ధతులను ఉపవర్గీకరించడం ద్వారా శిక్షణ లూప్ ప్రవర్తనను అనుకూలీకరించవచ్చు. ఇది లాస్ ఫంక్షన్, ఆప్టిమైజర్ మరియు షెడ్యూలర్ వంటి లక్షణాలను అనుకూలీకరించడానికి మిమ్మల్ని అనుమతిస్తుంది. ఉపవర్గీకరించబడే పద్ధతుల కోసం [`Trainer`] సూచనను పరిశీలించండి. + +శిక్షణ లూప్‌ను అనుకూలీకరించడానికి మరొక మార్గం [కాల్‌బ్యాక్‌లు](./main_classes/callbacks). మీరు ఇతర లైబ్రరీలతో అనుసంధానం చేయడానికి కాల్‌బ్యాక్‌లను ఉపయోగించవచ్చు మరియు పురోగతిపై నివేదించడానికి శిక్షణ లూప్‌ను తనిఖీ చేయవచ్చు లేదా శిక్షణను ముందుగానే ఆపవచ్చు. శిక్షణ లూప్‌లోనే కాల్‌బ్యాక్‌లు దేనినీ సవరించవు. లాస్ ఫంక్షన్ వంటివాటిని అనుకూలీకరించడానికి, మీరు బదులుగా [`Trainer`]ని ఉపవర్గం చేయాలి. + +## TensorFlowతో శిక్షణ పొందండి + +అన్ని మోడల్‌లు ప్రామాణికమైన [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) కాబట్టి వాటిని [Keras]తో TensorFlowలో శిక్షణ పొందవచ్చు(https: //keras.io/) API. 🤗 ట్రాన్స్‌ఫార్మర్‌లు మీ డేటాసెట్‌ని సులభంగా `tf.data.Dataset`గా లోడ్ చేయడానికి [`~TFPreTrainedModel.prepare_tf_dataset`] పద్ధతిని అందజేస్తుంది కాబట్టి మీరు వెంటనే Keras' [`compile`](https://keras.io/api/models/model_training_apis/#compile-method) మరియు [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) పద్ధతులు. + +1. మీరు [`TFPreTrainedModel`] లేదా [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)తో ప్రారంభిస్తారు: + ```py + >>> from transformers import TFAutoModelForSequenceClassification + + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. టోకెనైజర్, ఇమేజ్ ప్రాసెసర్, ఫీచర్ ఎక్స్‌ట్రాక్టర్ లేదా ప్రాసెసర్ వంటి ప్రీప్రాసెసింగ్ క్లాస్‌ని లోడ్ చేయండి: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +3. డేటాసెట్‌ను టోకనైజ్ చేయడానికి ఒక ఫంక్షన్‌ను సృష్టించండి: + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) # doctest: +SKIP + ``` + +4. [`~datasets.Dataset.map`]తో మొత్తం డేటాసెట్‌పై టోకెనైజర్‌ని వర్తింపజేయి, ఆపై డేటాసెట్ మరియు టోకెనైజర్‌ను [`~TFPreTrainedModel.prepare_tf_dataset`]కి పంపండి. మీరు కావాలనుకుంటే బ్యాచ్ పరిమాణాన్ని కూడా మార్చవచ్చు మరియు డేటాసెట్‌ను ఇక్కడ షఫుల్ చేయవచ్చు: + ```py + >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP + >>> tf_dataset = model.prepare_tf_dataset( + ... dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer + ... ) # doctest: +SKIP + ``` + +5. మీరు సిద్ధంగా ఉన్నప్పుడు, శిక్షణను ప్రారంభించడానికి మీరు `కంపైల్` మరియు `ఫిట్`కి కాల్ చేయవచ్చు. ట్రాన్స్‌ఫార్మర్స్ మోడల్స్ అన్నీ డిఫాల్ట్ టాస్క్-సంబంధిత లాస్ ఫంక్షన్‌ని కలిగి ఉన్నాయని గుర్తుంచుకోండి, కాబట్టి మీరు కోరుకునే వరకు మీరు ఒకదానిని పేర్కొనవలసిన అవసరం లేదు: + + ```py + >>> from tensorflow.keras.optimizers import Adam + + >>> model.compile(optimizer=Adam(3e-5)) # No loss argument! + >>> model.fit(tf_dataset) # doctest: +SKIP + ``` + +## తరవాత ఏంటి? + +ఇప్పుడు మీరు 🤗 ట్రాన్స్‌ఫార్మర్స్ త్వరిత పర్యటనను పూర్తి చేసారు, మా గైడ్‌లను తనిఖీ చేయండి మరియు అనుకూల మోడల్‌ను వ్రాయడం, టాస్క్ కోసం మోడల్‌ను చక్కగా తీర్చిదిద్దడం మరియు స్క్రిప్ట్‌తో మోడల్‌కు శిక్షణ ఇవ్వడం వంటి మరింత నిర్దిష్టమైన పనులను ఎలా చేయాలో తెలుసుకోండి. 🤗 ట్రాన్స్‌ఫార్మర్స్ కోర్ కాన్సెప్ట్‌ల గురించి మరింత తెలుసుకోవడానికి మీకు ఆసక్తి ఉంటే, ఒక కప్పు కాఫీ తాగి, మా కాన్సెప్టువల్ గైడ్‌లను చూడండి! diff --git a/docs/source/tr/_toctree.yml b/docs/source/tr/_toctree.yml new file mode 100644 index 00000000000000..8401da6e4eb0ae --- /dev/null +++ b/docs/source/tr/_toctree.yml @@ -0,0 +1,4 @@ +- sections: + - local: index + title: 🤗 Transformers + title: Get started \ No newline at end of file diff --git a/docs/source/tr/index.md b/docs/source/tr/index.md new file mode 100644 index 00000000000000..1b2c665e169d80 --- /dev/null +++ b/docs/source/tr/index.md @@ -0,0 +1,295 @@ + + +# 🤗 Transformers + +[PyTorch](https://pytorch.org/), [TensorFlow](https://www.tensorflow.org/) ve [JAX](https://jax.readthedocs.io/en/latest/) için son teknoloji makine öğrenimi. + +🤗 Transformers, güncel önceden eğitilmiş (pretrained) modelleri indirmenizi ve eğitmenizi kolaylaştıran API'ler ve araçlar sunar. Önceden eğitilmiş modeller kullanarak, hesaplama maliyetlerinizi ve karbon ayak izinizi azaltabilir, ve sıfırdan bir modeli eğitmek için gereken zaman ve kaynaklardan tasarruf edebilirsiniz. Bu modeller farklı modalitelerde ortak görevleri destekler. Örneğin: + +📝 **Doğal Dil İşleme**: metin sınıflandırma, adlandırılmış varlık tanıma, soru cevaplama, dil modelleme, özetleme, çeviri, çoktan seçmeli ve metin oluşturma.
+🖼️ **Bilgisayarlı Görü**: görüntü sınıflandırma, nesne tespiti ve bölümleme (segmentation).
+🗣️ **Ses**: otomatik konuşma tanıma ve ses sınıflandırma.
+🐙 **Çoklu Model**: tablo soru cevaplama, optik karakter tanıma, taranmış belgelerden bilgi çıkarma, video sınıflandırma ve görsel soru cevaplama. + +🤗 Transformers, PyTorch, TensorFlow ve JAX arasında çerçeve (framework) uyumluluğu sağlar. Bu, bir modelin yaşam döngüsünün her aşamasında farklı bir çerçeve kullanma esnekliği sunar; bir çerçevede üç satır kodla bir modeli eğitebilir ve başka bir çerçevede tahminleme için kullanabilirsiniz. Modeller ayrıca üretim ortamlarında kullanılmak üzere ONNX ve TorchScript gibi bir formata aktarılabilir. + +Büyüyen topluluğa [Hub](https://huggingface.co/models), [Forum](https://discuss.huggingface.co/) veya [Discord](https://discord.com/invite/JfAtkvEtRb) üzerinden katılabilirsiniz! + +## Hugging Face ekibinden özel destek arıyorsanız + + + HuggingFace Uzman Hızlandırma Programı + + +## İçindekiler + +Dokümantasyon, beş bölüme ayrılmıştır: + +- **BAŞLARKEN**, kütüphanenin hızlı bir turunu ve çalışmaya başlamak için kurulum talimatlarını sağlar. +- **ÖĞRETİCİLER**, başlangıç yapmak için harika bir yerdir. Bu bölüm, kütüphane kullanmaya başlamak için ihtiyacınız olan temel becerileri kazanmanıza yardımcı olacaktır. +- **NASIL YAPILIR KILAVUZLARI**, önceden eğitilmiş bir modele dil modellemesi için ince ayar (fine-tuning) yapmak veya özel bir model yazmak, ve paylaşmak gibi belirli bir hedefe nasıl ulaşılacağını gösterir. +- **KAVRAMSAL REHBERLER**, modellerin, görevlerin ve 🤗 Transformers tasarım felsefesinin temel kavramları ve fikirleri hakkında daha fazla tartışma ve açıklama sunar. +- **API** tüm sınıfları (class) ve fonksiyonları (functions) açıklar: + + - **ANA SINIFLAR**, yapılandırma, model, tokenizer ve pipeline gibi en önemli sınıfları (classes) ayrıntılandırır. + - **MODELLER**, kütüphanede kullanılan her modelle ilgili sınıfları ve fonksiyonları detaylı olarak inceler. + - **DAHİLİ YARDIMCILAR**, kullanılan yardımcı sınıfları ve fonksiyonları detaylı olarak inceler. + +## Desteklenen Modeller ve Çerçeveler + +Aşağıdaki tablo, her bir model için kütüphanede yer alan mevcut desteği temsil etmektedir. Her bir model için bir Python tokenizer'ına ("slow" olarak adlandırılır) sahip olup olmadıkları, 🤗 Tokenizers kütüphanesi tarafından desteklenen hızlı bir tokenizer'a sahip olup olmadıkları, Jax (Flax aracılığıyla), PyTorch ve/veya TensorFlow'da destek olup olmadıklarını göstermektedir. + + + +| Model | PyTorch support | TensorFlow support | Flax Support | +|:------------------------------------------------------------------------:|:---------------:|:------------------:|:------------:| +| [ALBERT](model_doc/albert) | ✅ | ✅ | ✅ | +| [ALIGN](model_doc/align) | ✅ | ❌ | ❌ | +| [AltCLIP](model_doc/altclip) | ✅ | ❌ | ❌ | +| [Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer) | ✅ | ❌ | ❌ | +| [Autoformer](model_doc/autoformer) | ✅ | ❌ | ❌ | +| [Bark](model_doc/bark) | ✅ | ❌ | ❌ | +| [BART](model_doc/bart) | ✅ | ✅ | ✅ | +| [BARThez](model_doc/barthez) | ✅ | ✅ | ✅ | +| [BARTpho](model_doc/bartpho) | ✅ | ✅ | ✅ | +| [BEiT](model_doc/beit) | ✅ | ❌ | ✅ | +| [BERT](model_doc/bert) | ✅ | ✅ | ✅ | +| [Bert Generation](model_doc/bert-generation) | ✅ | ❌ | ❌ | +| [BertJapanese](model_doc/bert-japanese) | ✅ | ✅ | ✅ | +| [BERTweet](model_doc/bertweet) | ✅ | ✅ | ✅ | +| [BigBird](model_doc/big_bird) | ✅ | ❌ | ✅ | +| [BigBird-Pegasus](model_doc/bigbird_pegasus) | ✅ | ❌ | ❌ | +| [BioGpt](model_doc/biogpt) | ✅ | ❌ | ❌ | +| [BiT](model_doc/bit) | ✅ | ❌ | ❌ | +| [Blenderbot](model_doc/blenderbot) | ✅ | ✅ | ✅ | +| [BlenderbotSmall](model_doc/blenderbot-small) | ✅ | ✅ | ✅ | +| [BLIP](model_doc/blip) | ✅ | ✅ | ❌ | +| [BLIP-2](model_doc/blip-2) | ✅ | ❌ | ❌ | +| [BLOOM](model_doc/bloom) | ✅ | ❌ | ✅ | +| [BORT](model_doc/bort) | ✅ | ✅ | ✅ | +| [BridgeTower](model_doc/bridgetower) | ✅ | ❌ | ❌ | +| [BROS](model_doc/bros) | ✅ | ❌ | ❌ | +| [ByT5](model_doc/byt5) | ✅ | ✅ | ✅ | +| [CamemBERT](model_doc/camembert) | ✅ | ✅ | ❌ | +| [CANINE](model_doc/canine) | ✅ | ❌ | ❌ | +| [Chinese-CLIP](model_doc/chinese_clip) | ✅ | ❌ | ❌ | +| [CLAP](model_doc/clap) | ✅ | ❌ | ❌ | +| [CLIP](model_doc/clip) | ✅ | ✅ | ✅ | +| [CLIPSeg](model_doc/clipseg) | ✅ | ❌ | ❌ | +| [CodeGen](model_doc/codegen) | ✅ | ❌ | ❌ | +| [CodeLlama](model_doc/code_llama) | ✅ | ❌ | ❌ | +| [Conditional DETR](model_doc/conditional_detr) | ✅ | ❌ | ❌ | +| [ConvBERT](model_doc/convbert) | ✅ | ✅ | ❌ | +| [ConvNeXT](model_doc/convnext) | ✅ | ✅ | ❌ | +| [ConvNeXTV2](model_doc/convnextv2) | ✅ | ❌ | ❌ | +| [CPM](model_doc/cpm) | ✅ | ✅ | ✅ | +| [CPM-Ant](model_doc/cpmant) | ✅ | ❌ | ❌ | +| [CTRL](model_doc/ctrl) | ✅ | ✅ | ❌ | +| [CvT](model_doc/cvt) | ✅ | ✅ | ❌ | +| [Data2VecAudio](model_doc/data2vec) | ✅ | ❌ | ❌ | +| [Data2VecText](model_doc/data2vec) | ✅ | ❌ | ❌ | +| [Data2VecVision](model_doc/data2vec) | ✅ | ✅ | ❌ | +| [DeBERTa](model_doc/deberta) | ✅ | ✅ | ❌ | +| [DeBERTa-v2](model_doc/deberta-v2) | ✅ | ✅ | ❌ | +| [Decision Transformer](model_doc/decision_transformer) | ✅ | ❌ | ❌ | +| [Deformable DETR](model_doc/deformable_detr) | ✅ | ❌ | ❌ | +| [DeiT](model_doc/deit) | ✅ | ✅ | ❌ | +| [DePlot](model_doc/deplot) | ✅ | ❌ | ❌ | +| [DETA](model_doc/deta) | ✅ | ❌ | ❌ | +| [DETR](model_doc/detr) | ✅ | ❌ | ❌ | +| [DialoGPT](model_doc/dialogpt) | ✅ | ✅ | ✅ | +| [DiNAT](model_doc/dinat) | ✅ | ❌ | ❌ | +| [DINOv2](model_doc/dinov2) | ✅ | ❌ | ❌ | +| [DistilBERT](model_doc/distilbert) | ✅ | ✅ | ✅ | +| [DiT](model_doc/dit) | ✅ | ❌ | ✅ | +| [DonutSwin](model_doc/donut) | ✅ | ❌ | ❌ | +| [DPR](model_doc/dpr) | ✅ | ✅ | ❌ | +| [DPT](model_doc/dpt) | ✅ | ❌ | ❌ | +| [EfficientFormer](model_doc/efficientformer) | ✅ | ✅ | ❌ | +| [EfficientNet](model_doc/efficientnet) | ✅ | ❌ | ❌ | +| [ELECTRA](model_doc/electra) | ✅ | ✅ | ✅ | +| [EnCodec](model_doc/encodec) | ✅ | ❌ | ❌ | +| [Encoder decoder](model_doc/encoder-decoder) | ✅ | ✅ | ✅ | +| [ERNIE](model_doc/ernie) | ✅ | ❌ | ❌ | +| [ErnieM](model_doc/ernie_m) | ✅ | ❌ | ❌ | +| [ESM](model_doc/esm) | ✅ | ✅ | ❌ | +| [FairSeq Machine-Translation](model_doc/fsmt) | ✅ | ❌ | ❌ | +| [Falcon](model_doc/falcon) | ✅ | ❌ | ❌ | +| [FLAN-T5](model_doc/flan-t5) | ✅ | ✅ | ✅ | +| [FLAN-UL2](model_doc/flan-ul2) | ✅ | ✅ | ✅ | +| [FlauBERT](model_doc/flaubert) | ✅ | ✅ | ❌ | +| [FLAVA](model_doc/flava) | ✅ | ❌ | ❌ | +| [FNet](model_doc/fnet) | ✅ | ❌ | ❌ | +| [FocalNet](model_doc/focalnet) | ✅ | ❌ | ❌ | +| [Funnel Transformer](model_doc/funnel) | ✅ | ✅ | ❌ | +| [Fuyu](model_doc/fuyu) | ✅ | ❌ | ❌ | +| [GIT](model_doc/git) | ✅ | ❌ | ❌ | +| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ | +| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ | +| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ | +| [GPT NeoX Japanese](model_doc/gpt_neox_japanese) | ✅ | ❌ | ❌ | +| [GPT-J](model_doc/gptj) | ✅ | ✅ | ✅ | +| [GPT-Sw3](model_doc/gpt-sw3) | ✅ | ✅ | ✅ | +| [GPTBigCode](model_doc/gpt_bigcode) | ✅ | ❌ | ❌ | +| [GPTSAN-japanese](model_doc/gptsan-japanese) | ✅ | ❌ | ❌ | +| [Graphormer](model_doc/graphormer) | ✅ | ❌ | ❌ | +| [GroupViT](model_doc/groupvit) | ✅ | ✅ | ❌ | +| [HerBERT](model_doc/herbert) | ✅ | ✅ | ✅ | +| [Hubert](model_doc/hubert) | ✅ | ✅ | ❌ | +| [I-BERT](model_doc/ibert) | ✅ | ❌ | ❌ | +| [IDEFICS](model_doc/idefics) | ✅ | ❌ | ❌ | +| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ | +| [Informer](model_doc/informer) | ✅ | ❌ | ❌ | +| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ | +| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ | +| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ | +| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ | +| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ | +| [LayoutXLM](model_doc/layoutxlm) | ✅ | ❌ | ❌ | +| [LED](model_doc/led) | ✅ | ✅ | ❌ | +| [LeViT](model_doc/levit) | ✅ | ❌ | ❌ | +| [LiLT](model_doc/lilt) | ✅ | ❌ | ❌ | +| [LLaMA](model_doc/llama) | ✅ | ❌ | ❌ | +| [Llama2](model_doc/llama2) | ✅ | ❌ | ❌ | +| [Longformer](model_doc/longformer) | ✅ | ✅ | ❌ | +| [LongT5](model_doc/longt5) | ✅ | ❌ | ✅ | +| [LUKE](model_doc/luke) | ✅ | ❌ | ❌ | +| [LXMERT](model_doc/lxmert) | ✅ | ✅ | ❌ | +| [M-CTC-T](model_doc/mctct) | ✅ | ❌ | ❌ | +| [M2M100](model_doc/m2m_100) | ✅ | ❌ | ❌ | +| [Marian](model_doc/marian) | ✅ | ✅ | ✅ | +| [MarkupLM](model_doc/markuplm) | ✅ | ❌ | ❌ | +| [Mask2Former](model_doc/mask2former) | ✅ | ❌ | ❌ | +| [MaskFormer](model_doc/maskformer) | ✅ | ❌ | ❌ | +| [MatCha](model_doc/matcha) | ✅ | ❌ | ❌ | +| [mBART](model_doc/mbart) | ✅ | ✅ | ✅ | +| [mBART-50](model_doc/mbart50) | ✅ | ✅ | ✅ | +| [MEGA](model_doc/mega) | ✅ | ❌ | ❌ | +| [Megatron-BERT](model_doc/megatron-bert) | ✅ | ❌ | ❌ | +| [Megatron-GPT2](model_doc/megatron_gpt2) | ✅ | ✅ | ✅ | +| [MGP-STR](model_doc/mgp-str) | ✅ | ❌ | ❌ | +| [Mistral](model_doc/mistral) | ✅ | ❌ | ❌ | +| [mLUKE](model_doc/mluke) | ✅ | ❌ | ❌ | +| [MMS](model_doc/mms) | ✅ | ✅ | ✅ | +| [MobileBERT](model_doc/mobilebert) | ✅ | ✅ | ❌ | +| [MobileNetV1](model_doc/mobilenet_v1) | ✅ | ❌ | ❌ | +| [MobileNetV2](model_doc/mobilenet_v2) | ✅ | ❌ | ❌ | +| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ | +| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ | +| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ | +| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ | +| [MRA](model_doc/mra) | ✅ | ❌ | ❌ | +| [MT5](model_doc/mt5) | ✅ | ✅ | ✅ | +| [MusicGen](model_doc/musicgen) | ✅ | ❌ | ❌ | +| [MVP](model_doc/mvp) | ✅ | ❌ | ❌ | +| [NAT](model_doc/nat) | ✅ | ❌ | ❌ | +| [Nezha](model_doc/nezha) | ✅ | ❌ | ❌ | +| [NLLB](model_doc/nllb) | ✅ | ❌ | ❌ | +| [NLLB-MOE](model_doc/nllb-moe) | ✅ | ❌ | ❌ | +| [Nougat](model_doc/nougat) | ✅ | ✅ | ✅ | +| [Nyströmformer](model_doc/nystromformer) | ✅ | ❌ | ❌ | +| [OneFormer](model_doc/oneformer) | ✅ | ❌ | ❌ | +| [OpenAI GPT](model_doc/openai-gpt) | ✅ | ✅ | ❌ | +| [OpenAI GPT-2](model_doc/gpt2) | ✅ | ✅ | ✅ | +| [OpenLlama](model_doc/open-llama) | ✅ | ❌ | ❌ | +| [OPT](model_doc/opt) | ✅ | ✅ | ✅ | +| [OWL-ViT](model_doc/owlvit) | ✅ | ❌ | ❌ | +| [OWLv2](model_doc/owlv2) | ✅ | ❌ | ❌ | +| [Pegasus](model_doc/pegasus) | ✅ | ✅ | ✅ | +| [PEGASUS-X](model_doc/pegasus_x) | ✅ | ❌ | ❌ | +| [Perceiver](model_doc/perceiver) | ✅ | ❌ | ❌ | +| [Persimmon](model_doc/persimmon) | ✅ | ❌ | ❌ | +| [PhoBERT](model_doc/phobert) | ✅ | ✅ | ✅ | +| [Pix2Struct](model_doc/pix2struct) | ✅ | ❌ | ❌ | +| [PLBart](model_doc/plbart) | ✅ | ❌ | ❌ | +| [PoolFormer](model_doc/poolformer) | ✅ | ❌ | ❌ | +| [Pop2Piano](model_doc/pop2piano) | ✅ | ❌ | ❌ | +| [ProphetNet](model_doc/prophetnet) | ✅ | ❌ | ❌ | +| [PVT](model_doc/pvt) | ✅ | ❌ | ❌ | +| [QDQBert](model_doc/qdqbert) | ✅ | ❌ | ❌ | +| [RAG](model_doc/rag) | ✅ | ✅ | ❌ | +| [REALM](model_doc/realm) | ✅ | ❌ | ❌ | +| [Reformer](model_doc/reformer) | ✅ | ❌ | ❌ | +| [RegNet](model_doc/regnet) | ✅ | ✅ | ✅ | +| [RemBERT](model_doc/rembert) | ✅ | ✅ | ❌ | +| [ResNet](model_doc/resnet) | ✅ | ✅ | ✅ | +| [RetriBERT](model_doc/retribert) | ✅ | ❌ | ❌ | +| [RoBERTa](model_doc/roberta) | ✅ | ✅ | ✅ | +| [RoBERTa-PreLayerNorm](model_doc/roberta-prelayernorm) | ✅ | ✅ | ✅ | +| [RoCBert](model_doc/roc_bert) | ✅ | ❌ | ❌ | +| [RoFormer](model_doc/roformer) | ✅ | ✅ | ✅ | +| [RWKV](model_doc/rwkv) | ✅ | ❌ | ❌ | +| [SAM](model_doc/sam) | ✅ | ✅ | ❌ | +| [SeamlessM4T](model_doc/seamless_m4t) | ✅ | ❌ | ❌ | +| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ | +| [SEW](model_doc/sew) | ✅ | ❌ | ❌ | +| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ | +| [Speech Encoder decoder](model_doc/speech-encoder-decoder) | ✅ | ❌ | ✅ | +| [Speech2Text](model_doc/speech_to_text) | ✅ | ✅ | ❌ | +| [SpeechT5](model_doc/speecht5) | ✅ | ❌ | ❌ | +| [Splinter](model_doc/splinter) | ✅ | ❌ | ❌ | +| [SqueezeBERT](model_doc/squeezebert) | ✅ | ❌ | ❌ | +| [SwiftFormer](model_doc/swiftformer) | ✅ | ❌ | ❌ | +| [Swin Transformer](model_doc/swin) | ✅ | ✅ | ❌ | +| [Swin Transformer V2](model_doc/swinv2) | ✅ | ❌ | ❌ | +| [Swin2SR](model_doc/swin2sr) | ✅ | ❌ | ❌ | +| [SwitchTransformers](model_doc/switch_transformers) | ✅ | ❌ | ❌ | +| [T5](model_doc/t5) | ✅ | ✅ | ✅ | +| [T5v1.1](model_doc/t5v1.1) | ✅ | ✅ | ✅ | +| [Table Transformer](model_doc/table-transformer) | ✅ | ❌ | ❌ | +| [TAPAS](model_doc/tapas) | ✅ | ✅ | ❌ | +| [TAPEX](model_doc/tapex) | ✅ | ✅ | ✅ | +| [Time Series Transformer](model_doc/time_series_transformer) | ✅ | ❌ | ❌ | +| [TimeSformer](model_doc/timesformer) | ✅ | ❌ | ❌ | +| [Trajectory Transformer](model_doc/trajectory_transformer) | ✅ | ❌ | ❌ | +| [Transformer-XL](model_doc/transfo-xl) | ✅ | ✅ | ❌ | +| [TrOCR](model_doc/trocr) | ✅ | ❌ | ❌ | +| [TVLT](model_doc/tvlt) | ✅ | ❌ | ❌ | +| [UL2](model_doc/ul2) | ✅ | ✅ | ✅ | +| [UMT5](model_doc/umt5) | ✅ | ❌ | ❌ | +| [UniSpeech](model_doc/unispeech) | ✅ | ❌ | ❌ | +| [UniSpeechSat](model_doc/unispeech-sat) | ✅ | ❌ | ❌ | +| [UPerNet](model_doc/upernet) | ✅ | ❌ | ❌ | +| [VAN](model_doc/van) | ✅ | ❌ | ❌ | +| [VideoMAE](model_doc/videomae) | ✅ | ❌ | ❌ | +| [ViLT](model_doc/vilt) | ✅ | ❌ | ❌ | +| [Vision Encoder decoder](model_doc/vision-encoder-decoder) | ✅ | ✅ | ✅ | +| [VisionTextDualEncoder](model_doc/vision-text-dual-encoder) | ✅ | ✅ | ✅ | +| [VisualBERT](model_doc/visual_bert) | ✅ | ❌ | ❌ | +| [ViT](model_doc/vit) | ✅ | ✅ | ✅ | +| [ViT Hybrid](model_doc/vit_hybrid) | ✅ | ❌ | ❌ | +| [VitDet](model_doc/vitdet) | ✅ | ❌ | ❌ | +| [ViTMAE](model_doc/vit_mae) | ✅ | ✅ | ❌ | +| [ViTMatte](model_doc/vitmatte) | ✅ | ❌ | ❌ | +| [ViTMSN](model_doc/vit_msn) | ✅ | ❌ | ❌ | +| [VITS](model_doc/vits) | ✅ | ❌ | ❌ | +| [ViViT](model_doc/vivit) | ✅ | ❌ | ❌ | +| [Wav2Vec2](model_doc/wav2vec2) | ✅ | ✅ | ✅ | +| [Wav2Vec2-Conformer](model_doc/wav2vec2-conformer) | ✅ | ❌ | ❌ | +| [Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme) | ✅ | ✅ | ✅ | +| [WavLM](model_doc/wavlm) | ✅ | ❌ | ❌ | +| [Whisper](model_doc/whisper) | ✅ | ✅ | ✅ | +| [X-CLIP](model_doc/xclip) | ✅ | ❌ | ❌ | +| [X-MOD](model_doc/xmod) | ✅ | ❌ | ❌ | +| [XGLM](model_doc/xglm) | ✅ | ✅ | ✅ | +| [XLM](model_doc/xlm) | ✅ | ✅ | ❌ | +| [XLM-ProphetNet](model_doc/xlm-prophetnet) | ✅ | ❌ | ❌ | +| [XLM-RoBERTa](model_doc/xlm-roberta) | ✅ | ✅ | ✅ | +| [XLM-RoBERTa-XL](model_doc/xlm-roberta-xl) | ✅ | ❌ | ❌ | +| [XLM-V](model_doc/xlm-v) | ✅ | ✅ | ✅ | +| [XLNet](model_doc/xlnet) | ✅ | ✅ | ❌ | +| [XLS-R](model_doc/xls_r) | ✅ | ✅ | ✅ | +| [XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2) | ✅ | ✅ | ✅ | +| [YOLOS](model_doc/yolos) | ✅ | ❌ | ❌ | +| [YOSO](model_doc/yoso) | ✅ | ❌ | ❌ | + + diff --git a/docs/source/zh/_toctree.yml b/docs/source/zh/_toctree.yml index fe91eabf06f581..dd3eb7c3afc121 100644 --- a/docs/source/zh/_toctree.yml +++ b/docs/source/zh/_toctree.yml @@ -1,6 +1,137 @@ - sections: - local: index - title: 🤗 Transformers简介 + title: 🤗 Transformers 简介 - local: quicktour title: 快速上手 - title: 开始使用 \ No newline at end of file + - local: installation + title: 安装 + title: 开始使用 +- sections: + - local: pipeline_tutorial + title: 使用pipelines进行推理 + - local: autoclass_tutorial + title: 使用AutoClass编写可移植的代码 + - local: preprocessing + title: 预处理数据 + - local: training + title: 微调预训练模型 + - local: run_scripts + title: 通过脚本训练模型 + - local: accelerate + title: 使用🤗Accelerate进行分布式训练 + - local: peft + title: 使用🤗 PEFT加载和训练adapters + - local: model_sharing + title: 分享您的模型 + - local: transformers_agents + title: agents教程 + - local: llm_tutorial + title: 使用LLMs进行生成 + title: 教程 +- sections: + - local: fast_tokenizers + title: 使用 🤗 Tokenizers 中的分词器 + - local: multilingual + title: 使用多语言模型进行推理 + - local: create_a_model + title: 使用特定于模型的 API + - local: custom_models + title: 共享自定义模型 + - local: serialization + title: 导出为 ONNX + - local: tflite + title: 导出为 TFLite + title: 开发者指南 +- sections: + - local: performance + title: 综述 + - sections: + - local: perf_hardware + title: 用于训练的定制硬件 + - local: hpo_train + title: 使用Trainer API 进行超参数搜索 + title: 高效训练技术 + - local: big_models + title: 实例化大模型 + - local: debugging + title: 问题定位及解决 + - local: tf_xla + title: TensorFlow模型的XLA集成 + - local: perf_torch_compile + title: 使用 `torch.compile()` 优化推理 + title: 性能和可扩展性 +- sections: + - local: contributing + title: 如何为 🤗 Transformers 做贡献? + title: 贡献 +- sections: + - local: task_summary + title: 🤗Transformers能做什么 + - local: tokenizer_summary + title: 分词器的摘要 + title: 概念指南 +- sections: + - sections: + - local: main_classes/agent + title: Agents和工具 + - local: main_classes/callback + title: Callbacks + - local: main_classes/configuration + title: Configuration + - local: main_classes/data_collator + title: Data Collator + - local: main_classes/keras_callbacks + title: Keras callbacks + - local: main_classes/logging + title: Logging + - local: main_classes/model + title: 模型 + - local: main_classes/text_generation + title: 文本生成 + - local: main_classes/onnx + title: ONNX + - local: main_classes/optimizer_schedules + title: Optimization + - local: main_classes/output + title: 模型输出 + - local: main_classes/pipelines + title: Pipelines + - local: main_classes/processors + title: Processors + - local: main_classes/quantization + title: Quantization + - local: main_classes/tokenizer + title: Tokenizer + - local: main_classes/trainer + title: Trainer + - local: main_classes/deepspeed + title: DeepSpeed集成 + - local: main_classes/feature_extractor + title: Feature Extractor + - local: main_classes/image_processor + title: Image Processor + title: 主要类 + - sections: + - local: internal/modeling_utils + title: 自定义层和工具 + - local: internal/pipelines_utils + title: pipelines工具 + - local: internal/tokenization_utils + title: Tokenizers工具 + - local: internal/trainer_utils + title: 训练器工具 + - local: internal/generation_utils + title: 生成工具 + - local: internal/image_processing_utils + title: 图像处理工具 + - local: internal/audio_utils + title: 音频处理工具 + - local: internal/file_utils + title: 通用工具 + - local: internal/time_series_utils + title: 时序数据工具 + title: 内部辅助工具 + title: 应用程序接口 (API) + + + diff --git a/docs/source/zh/accelerate.md b/docs/source/zh/accelerate.md new file mode 100644 index 00000000000000..12c626199477dd --- /dev/null +++ b/docs/source/zh/accelerate.md @@ -0,0 +1,132 @@ + + +# 🤗 加速分布式训练 + +随着模型变得越来越大,并行性已经成为在有限硬件上训练更大模型和加速训练速度的策略,增加了数个数量级。在Hugging Face,我们创建了[🤗 加速](https://huggingface.co/docs/accelerate)库,以帮助用户在任何类型的分布式设置上轻松训练🤗 Transformers模型,无论是在一台机器上的多个GPU还是在多个机器上的多个GPU。在本教程中,了解如何自定义您的原生PyTorch训练循环,以启用分布式环境中的训练。 + +## 设置 + +通过安装🤗 加速开始: + +```bash +pip install accelerate +``` + +然后导入并创建[`~accelerate.Accelerator`]对象。[`~accelerate.Accelerator`]将自动检测您的分布式设置类型,并初始化所有必要的训练组件。您不需要显式地将模型放在设备上。 + +```py +>>> from accelerate import Accelerator + +>>> accelerator = Accelerator() +``` + +## 准备加速 + +下一步是将所有相关的训练对象传递给[`~accelerate.Accelerator.prepare`]方法。这包括您的训练和评估DataLoader、一个模型和一个优化器: + +```py +>>> train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( +... train_dataloader, eval_dataloader, model, optimizer +... ) +``` + +## 反向传播 + +最后一步是用🤗 加速的[`~accelerate.Accelerator.backward`]方法替换训练循环中的典型`loss.backward()`: + +```py +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... outputs = model(**batch) +... loss = outputs.loss +... accelerator.backward(loss) + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +如您在下面的代码中所见,您只需要添加四行额外的代码到您的训练循环中即可启用分布式训练! + +```diff ++ from accelerate import Accelerator + from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler + ++ accelerator = Accelerator() + + model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) + optimizer = AdamW(model.parameters(), lr=3e-5) + +- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +- model.to(device) + ++ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( ++ train_dataloader, eval_dataloader, model, optimizer ++ ) + + num_epochs = 3 + num_training_steps = num_epochs * len(train_dataloader) + lr_scheduler = get_scheduler( + "linear", + optimizer=optimizer, + num_warmup_steps=0, + num_training_steps=num_training_steps + ) + + progress_bar = tqdm(range(num_training_steps)) + + model.train() + for epoch in range(num_epochs): + for batch in train_dataloader: +- batch = {k: v.to(device) for k, v in batch.items()} + outputs = model(**batch) + loss = outputs.loss +- loss.backward() ++ accelerator.backward(loss) + + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + progress_bar.update(1) +``` + +## 训练 + +在添加了相关代码行后,可以在脚本或笔记本(如Colaboratory)中启动训练。 + +### 用脚本训练 + +如果您从脚本中运行训练,请运行以下命令以创建和保存配置文件: + +```bash +accelerate config +``` + +然后使用以下命令启动训练: + +```bash +accelerate launch train.py +``` + +### 用笔记本训练 + +🤗 加速还可以在笔记本中运行,如果您计划使用Colaboratory的TPU,则可在其中运行。将负责训练的所有代码包装在一个函数中,并将其传递给[`~accelerate.notebook_launcher`]: + +```py +>>> from accelerate import notebook_launcher + +>>> notebook_launcher(training_function) +``` + +有关🤗 加速及其丰富功能的更多信息,请参阅[文档](https://huggingface.co/docs/accelerate)。 \ No newline at end of file diff --git a/docs/source/zh/autoclass_tutorial.md b/docs/source/zh/autoclass_tutorial.md new file mode 100644 index 00000000000000..7205aa0872d161 --- /dev/null +++ b/docs/source/zh/autoclass_tutorial.md @@ -0,0 +1,149 @@ + + +# 使用AutoClass加载预训练实例 + +由于存在许多不同的Transformer架构,因此为您的checkpoint创建一个可用架构可能会具有挑战性。通过`AutoClass`可以自动推断并从给定的checkpoint加载正确的架构, 这也是🤗 Transformers易于使用、简单且灵活核心规则的重要一部分。`from_pretrained()`方法允许您快速加载任何架构的预训练模型,因此您不必花费时间和精力从头开始训练模型。生成这种与checkpoint无关的代码意味着,如果您的代码适用于一个checkpoint,它将适用于另一个checkpoint - 只要它们是为了类似的任务进行训练的 - 即使架构不同。 + + + +请记住,架构指的是模型的结构,而checkpoints是给定架构的权重。例如,[BERT](https://huggingface.co/google-bert/bert-base-uncased)是一种架构,而`google-bert/bert-base-uncased`是一个checkpoint。模型是一个通用术语,可以指代架构或checkpoint。 + + + + +在这个教程中,学习如何: + +* 加载预训练的分词器(`tokenizer`) +* 加载预训练的图像处理器(`image processor`) +* 加载预训练的特征提取器(`feature extractor`) +* 加载预训练的处理器(`processor`) +* 加载预训练的模型。 + + +## AutoTokenizer + +几乎所有的NLP任务都以`tokenizer`开始。`tokenizer`将您的输入转换为模型可以处理的格式。 + +使用[`AutoTokenizer.from_pretrained`]加载`tokenizer`: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") +``` + +然后按照如下方式对输入进行分词: + +```py +>>> sequence = "In a hole in the ground there lived a hobbit." +>>> print(tokenizer(sequence)) +{'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +## AutoImageProcessor + +对于视觉任务,`image processor`将图像处理成正确的输入格式。 + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + + +## AutoFeatureExtractor + +对于音频任务,`feature extractor`将音频信号处理成正确的输入格式。 + +使用[`AutoFeatureExtractor.from_pretrained`]加载`feature extractor`: + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained( +... "ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" +... ) +``` + +## AutoProcessor + +多模态任务需要一种`processor`,将两种类型的预处理工具结合起来。例如,[LayoutLMV2](model_doc/layoutlmv2)模型需要一个`image processo`来处理图像和一个`tokenizer`来处理文本;`processor`将两者结合起来。 + +使用[`AutoProcessor.from_pretrained`]加载`processor`: + + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased") +``` + +## AutoModel + + + + +最后,`AutoModelFor`类让你可以加载给定任务的预训练模型(参见[这里](model_doc/auto)获取可用任务的完整列表)。例如,使用[`AutoModelForSequenceClassification.from_pretrained`]加载用于序列分类的模型: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +轻松地重复使用相同的checkpoint来为不同任务加载模型架构: + + +```py +>>> from transformers import AutoModelForTokenClassification + +>>> model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +对于PyTorch模型,`from_pretrained()`方法使用`torch.load()`,它内部使用已知是不安全的`pickle`。一般来说,永远不要加载来自不可信来源或可能被篡改的模型。对于托管在Hugging Face Hub上的公共模型,这种安全风险在一定程度上得到了缓解,因为每次提交都会进行[恶意软件扫描](https://huggingface.co/docs/hub/security-malware)。请参阅[Hub文档](https://huggingface.co/docs/hub/security)以了解最佳实践,例如使用GPG进行[签名提交验证](https://huggingface.co/docs/hub/security-gpg#signing-commits-with-gpg)。 + +TensorFlow和Flax的checkpoints不受影响,并且可以在PyTorch架构中使用`from_tf`和`from_flax`关键字参数,通过`from_pretrained`方法进行加载,来绕过此问题。 + + + +一般来说,我们建议使用`AutoTokenizer`类和`AutoModelFor`类来加载预训练的模型实例。这样可以确保每次加载正确的架构。在下一个[教程](preprocessing)中,学习如何使用新加载的`tokenizer`, `image processor`, `feature extractor`和`processor`对数据集进行预处理以进行微调。 + + + +最后,`TFAutoModelFor`类允许您加载给定任务的预训练模型(请参阅[这里](model_doc/auto)获取可用任务的完整列表)。例如,使用[`TFAutoModelForSequenceClassification.from_pretrained`]加载用于序列分类的模型: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +轻松地重复使用相同的checkpoint来为不同任务加载模型架构: + +```py +>>> from transformers import TFAutoModelForTokenClassification + +>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` +一般来说,我们推荐使用`AutoTokenizer`类和`TFAutoModelFor`类来加载模型的预训练实例。这样可以确保每次加载正确的架构。在下一个[教程](preprocessing)中,学习如何使用新加载的`tokenizer`, `image processor`, `feature extractor`和`processor`对数据集进行预处理以进行微调。 + + + diff --git a/docs/source/zh/big_models.md b/docs/source/zh/big_models.md new file mode 100644 index 00000000000000..2215c706618206 --- /dev/null +++ b/docs/source/zh/big_models.md @@ -0,0 +1,123 @@ + + +# 实例化大型模型 + +当你想使用一个非常大的预训练模型时,一个挑战是尽量减少对内存的使用。通常从PyTorch开始的工作流程如下: + +1. 用随机权重创建你的模型。 +2. 加载你的预训练权重。 +3. 将这些预训练权重放入你的随机模型中。 + +步骤1和2都需要完整版本的模型在内存中,这在大多数情况下不是问题,但如果你的模型开始达到几个GB的大小,这两个副本可能会让你超出内存的限制。更糟糕的是,如果你使用`torch.distributed`来启动分布式训练,每个进程都会加载预训练模型并将这两个副本存储在内存中。 + + + +请注意,随机创建的模型使用“空”张量进行初始化,这些张量占用内存空间但不填充它(因此随机值是给定时间内该内存块中的任何内容)。在第3步之后,对未初始化的权重执行适合模型/参数种类的随机初始化(例如正态分布),以尽可能提高速度! + + + +在本指南中,我们将探讨 Transformers 提供的解决方案来处理这个问题。请注意,这是一个积极开发的领域,因此这里解释的API在将来可能会略有变化。 + +## 分片checkpoints + +自4.18.0版本起,占用空间超过10GB的模型检查点将自动分成较小的片段。在使用`model.save_pretrained(save_dir)`时,您最终会得到几个部分`checkpoints`(每个的大小都小于10GB)以及一个索引,该索引将参数名称映射到存储它们的文件。 + +您可以使用`max_shard_size`参数来控制分片之前的最大大小。为了示例的目的,我们将使用具有较小分片大小的普通大小的模型:让我们以传统的BERT模型为例。 + + +```py +from transformers import AutoModel + +model = AutoModel.from_pretrained("google-bert/bert-base-cased") +``` + +如果您使用 [`PreTrainedModel.save_pretrained`](模型预训练保存) 进行保存,您将得到一个新的文件夹,其中包含两个文件:模型的配置和权重: + +```py +>>> import os +>>> import tempfile + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir) +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model.bin'] +``` + +现在让我们使用最大分片大小为200MB: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... print(sorted(os.listdir(tmp_dir))) +['config.json', 'pytorch_model-00001-of-00003.bin', 'pytorch_model-00002-of-00003.bin', 'pytorch_model-00003-of-00003.bin', 'pytorch_model.bin.index.json'] +``` + +在模型配置文件最上方,我们可以看到三个不同的权重文件,以及一个`index.json`索引文件。这样的`checkpoint`可以使用[`~PreTrainedModel.from_pretrained`]方法完全重新加载: + +```py +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... new_model = AutoModel.from_pretrained(tmp_dir) +``` + +对于大型模型来说,这样做的主要优点是在上述工作流程的步骤2中,每个`checkpoint`的分片在前一个分片之后加载,从而将内存中的内存使用限制在模型大小加上最大分片的大小。 + +在后台,索引文件用于确定`checkpoint`中包含哪些键以及相应的权重存储在哪里。我们可以像加载任何json一样加载该索引,并获得一个字典: + +```py +>>> import json + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... with open(os.path.join(tmp_dir, "pytorch_model.bin.index.json"), "r") as f: +... index = json.load(f) + +>>> print(index.keys()) +dict_keys(['metadata', 'weight_map']) +``` + +目前元数据仅包括模型的总大小。我们计划在将来添加其他信息: +```py +>>> index["metadata"] +{'total_size': 433245184} +``` + +权重映射是该索引的主要部分,它将每个参数的名称(通常在PyTorch模型的`state_dict`中找到)映射到存储该参数的文件: + +```py +>>> index["weight_map"] +{'embeddings.LayerNorm.bias': 'pytorch_model-00001-of-00003.bin', + 'embeddings.LayerNorm.weight': 'pytorch_model-00001-of-00003.bin', + ... +``` + +如果您想直接在模型内部加载这样的分片`checkpoint`,而不使用 [`PreTrainedModel.from_pretrained`](就像您会为完整`checkpoint`执行 `model.load_state_dict()` 一样),您应该使用 [`modeling_utils.load_sharded_checkpoint`]: + + +```py +>>> from transformers.modeling_utils import load_sharded_checkpoint + +>>> with tempfile.TemporaryDirectory() as tmp_dir: +... model.save_pretrained(tmp_dir, max_shard_size="200MB") +... load_sharded_checkpoint(model, tmp_dir) +``` + +## 低内存加载 + +分片`checkpoints`在上述工作流的第2步中降低了内存使用,但为了在低内存环境中使用该模型,我们建议使用基于 Accelerate 库的工具。 + +请阅读以下指南以获取更多信息:[使用 Accelerate 进行大模型加载](./main_classes/model#large-model-loading) diff --git a/docs/source/zh/contributing.md b/docs/source/zh/contributing.md new file mode 100644 index 00000000000000..f430e8a85f16cd --- /dev/null +++ b/docs/source/zh/contributing.md @@ -0,0 +1,331 @@ + + +# 为 🤗 Transformers 做贡献 + +欢迎所有人为 🤗 Transformers 做出贡献,我们重视每个人的贡献。代码贡献并不是帮助社区的唯一途径。回答问题、帮助他人和改进文档也非常有价值。 + +宣传 🤗 Transformers 也会帮助我们!比如在博客文章里介绍一下这个库是如何帮助你完成了很棒的项目,每次它帮助你时都在 Twitter 上大声宣传,或者给这个代码仓库点⭐️来表示感谢。 + +无论你选择以哪种方式做出贡献,请注意并尊重我们的[行为准则](https://github.com/huggingface/transformers/blob/main/CODE_OF_CONDUCT.md)。 + +**本指南的灵感来源于 [scikit-learn贡献指南](https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md) ,它令人印象深刻.** + +## 做贡献的方法 + +有多种方法可以为 🤗 Transformers 做贡献: + +* 修复现有代码中尚未解决的问题。 +* 提交与 bug 或所需新功能相关的 issue。 +* 实现新的模型。 +* 为示例或文档做贡献。 + +如果你不知道从哪里开始,有一个特别的 [Good First Issue](https://github.com/huggingface/transformers/contribute) 列表。它会列出一些适合初学者的开放的 issues,并帮助你开始为开源项目做贡献。只需要在你想要处理的 issue 下发表评论就行。 + +如果想要稍微更有挑战性的内容,你也可以查看 [Good Second Issue](https://github.com/huggingface/transformers/labels/Good%20Second%20Issue) 列表。总的来说,如果你觉得自己知道该怎么做,就去做吧,我们会帮助你达到目标的!🚀 + +> 所有的贡献对社区来说都同样宝贵。🥰 + +## 修复尚未解决的问题 + +如果你发现现有代码中存在问题,并且已经想到了解决方法,请随时[开始贡献](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#create-a-pull-request) 并创建一个 Pull Request! + +## 提交与 bug 相关的 issue 或功能请求 + +在提交与错误相关的 issue 或功能请求时,请尽量遵循下面的指南。这能让我们更容易迅速回复你,并提供良好的反馈意见。 + +### 你发现了 bug 吗? + +🤗 Transformers 之所以强大可靠,要感谢用户报告了他们遇到的问题。 + +在提出issue之前,请你**确认该 bug 尚未被报告**(使用 GitHub 的 Issues 下面的搜索栏)。issue 也应该是与库本身的 bug 有关,而不是与你的代码有关。如果不确定 bug 是在你的代码中还是在库中,请先在[论坛](https://discuss.huggingface.co/)中询问。这有助于我们更快地解决与库相关的问题。 + +一旦你确认该 bug 尚未被报告,请在你的 issue 中包含以下信息,以便我们快速解决: + +* 使用的**操作系统类型和版本**,以及 **Python**、**PyTorch** 和 **TensorFlow** 的版本。 +* 一个简短、独立的代码片段,可以让我们在不到30秒内重现这个问题。 +* 如果发生异常,请提供*完整的* traceback。 +* 附上你认为可能有帮助的任何其他附加信息,如屏幕截图。 + +想要自动获取操作系统和软件版本,请运行以下命令: + +```bash +transformers-cli env +``` + +你也可以从代码仓库的根目录下运行相同的命令: + +```bash +python src/transformers/commands/transformers_cli.py env +``` + +### 你想要新功能吗? + +如果你希望在 🤗 Transformers 中看到新功能,请提出一个 issue 并包含以下内容: + +1. 这个新功能的*动机*是什么呢?是因为使用这个库时遇到了问题或者感到了某种不满吗?是因为你的项目需要这个功能吗?或者是你自己开发了某项内容,并且认为它可能会对社区有所帮助? + + 不管是什么,我们都很想听! + +2. 请尽可能详细地描述你想要的功能。你告诉我们的越多,我们就能更好地帮助你。 +3. 请提供一个*代码片段*,演示该功能的使用方法。 +4. 如果这个功能与某篇论文相关,请包含链接。 + +如果你描述得足够清晰,那么在你创建 issue 时,我们已经完成了80%的工作。 + +我们已经添加了[模板](https://github.com/huggingface/transformers/tree/main/templates),可能有助于你提出 issue。 + +## 你想要实现一个新模型吗? + +我们会持续发布新模型,如果你想要实现一个新模型,请提供以下信息: + +* 模型的简要描述和论文链接。 +* 如果实现是开源的,请提供实现的链接。 +* 如果模型权重可用,请提供模型权重的链接。 + +如果你想亲自贡献模型,请告诉我们。让我们帮你把它添加到 🤗 Transformers! + +我们已经添加了[详细的指南和模板](https://github.com/huggingface/transformers/tree/main/templates)来帮助你添加新模型。我们还有一个更技术性的指南,告诉你[如何将模型添加到 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model)。 + +## 你想要添加文档吗? + +我们始终在寻求改进文档,使其更清晰准确。请告诉我们如何改进文档,比如拼写错误以及任何缺失、不清楚或不准确的内容。我们非常乐意进行修改,如果你有兴趣,我们也可以帮助你做出贡献! + +有关如何生成、构建和编写文档的更多详细信息,请查看文档 [README](https://github.com/huggingface/transformers/tree/main/docs)。 + +## 创建 Pull Request + +在开始编写任何代码之前,我们强烈建议你先搜索现有的 PR(Pull Request) 或 issue,以确保没有其他人已经在做同样的事情。如果你不确定,提出 issue 来获取反馈意见是一个好办法。 + +要为 🤗 Transformers 做贡献,你需要基本的 `git` 使用技能。虽然 `git` 不是一个很容易使用的工具,但它提供了非常全面的手册,在命令行中输入 `git --help` 并享受吧!如果你更喜欢书籍,[Pro Git](https://git-scm.com/book/en/v2)是一本很好的参考书。 + +要为 🤗 Transformers 做贡献,你需要 **[Python 3.8](https://github.com/huggingface/transformers/blob/main/setup.py#L426)** 或更高版本。请按照以下步骤开始贡献: + +1. 点击[仓库](https://github.com/huggingface/transformers)页面上的 **[Fork](https://github.com/huggingface/transformers/fork)** 按钮,这会在你的 GitHub 账号下拷贝一份代码。 + +2. 把派生仓库克隆到本地磁盘,并将基础仓库添加为远程仓库: + + ```bash + git clone git@github.com:/transformers.git + cd transformers + git remote add upstream https://github.com/huggingface/transformers.git + ``` + +3. 创建一个新的分支来保存你的更改: + + ```bash + git checkout -b a-descriptive-name-for-my-changes + ``` + + 🚨 **不要**在 `main` 分支工作! + +4. 在虚拟环境中运行以下命令来设置开发环境: + + ```bash + pip install -e ".[dev]" + ``` + + 如果在虚拟环境中已经安装了 🤗 Transformers,请先使用 `pip uninstall transformers` 卸载它,然后再用 `-e` 参数以可编辑模式重新安装。 + + 根据你的操作系统,以及 Transformers 的可选依赖项数量的增加,可能会在执行此命令时出现失败。如果出现这种情况,请确保已经安装了你想使用的深度学习框架(PyTorch, TensorFlow 和 Flax),然后执行以下操作: + + ```bash + pip install -e ".[quality]" + ``` + + 大多数情况下,这些应该够用了。 + +5. 在你的分支上开发相关功能。 + + 在编写代码时,请确保测试套件通过。用下面的方式运行受你的更改影响的测试: + + ```bash + pytest tests/.py + ``` + + 想了解更多关于测试的信息,请阅读[测试](https://huggingface.co/docs/transformers/testing)指南。 + + 🤗 Transformers 使用 `black` 和 `ruff` 来保持代码风格的一致性。进行更改后,使用以下命令自动执行格式更正和代码验证: + + ```bash + make fixup + ``` + + 它已经被优化为仅适用于你创建的 PR 所修改过的文件。 + + 如果想要逐个运行检查,可以使用以下命令: + + ```bash + make style + ``` + + 🤗 Transformers 还使用了 `ruff` 和一些自定义脚本来检查编码错误。虽然质量管理是通过 CI 进行的,但你也可以使用以下命令来运行相同的检查: + + ```bash + make quality + ``` + + 最后,我们有许多脚本来确保在添加新模型时不会忘记更新某些文件。你可以使用以下命令运行这些脚本: + + ```bash + make repo-consistency + ``` + + 想要了解有关这些检查及如何解决相关问题的更多信息,请阅读 [检查 Pull Request](https://huggingface.co/docs/transformers/pr_checks) 指南。 + + 如果你修改了 `docs/source` 目录下的文档,请确保文档仍然能够被构建。这个检查也会在你创建 PR 时在 CI 中运行。如果要进行本地检查,请确保安装了文档构建工具: + + ```bash + pip install ".[docs]" + ``` + + 在仓库的根目录下运行以下命令: + + ```bash + doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build + ``` + + 这将会在 `~/tmp/test-build` 文件夹中构建文档,你可以使用自己喜欢的编辑器查看生成的 Markdown 文件。当你创建 PR 时,也可以在GitHub上预览文档。 + + 当你对修改满意后,使用 `git add` 把修改的文件添加到暂存区,然后使用 `git commit` 在本地记录你的更改: + + ```bash + git add modified_file.py + git commit + ``` + + 请记得写一个[好的提交信息](https://chris.beams.io/posts/git-commit/)来清晰地传达你所做的更改! + + 为了保持你的代码副本与原始仓库的最新状态一致,在你创建 PR *之前*或者在管理员要求的情况下,把你的分支在 `upstream/branch` 上进行 rebase: + + ```bash + git fetch upstream + git rebase upstream/main + ``` + + 把你的更改推送到你的分支: + + ```bash + git push -u origin a-descriptive-name-for-my-changes + ``` + + 如果你已经创建了一个 PR,你需要使用 `--force` 参数进行强制推送。如果 PR 还没有被创建,你可以正常推送你的更改。 + +6. 现在你可以转到 GitHub 上你的账号下的派生仓库,点击 **Pull Request** 来创建一个 PR。 请确保勾选我们 [checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md/#pull-request-checklist) 下的所有项目。准备好这些后,可以将你的更改发送给项目管理员进行审查。 + +7. 如果管理员要求你进行更改,别气馁,我们的核心贡献者也会经历相同的事情!请在你的本地分支上进行工作,并将更改推送到派生仓库,以便于每个人都可以在 PR 中看到你的更改。这样它们会自动出现在 PR 中。 + +### Pull request 的检查清单 + +☐ Pull request 的标题应该总结你的贡献内容。
+☐ 如果你的 Pull request 解决了一个issue,请在 Pull request 描述中提及该 issue 的编号,以确保它们被关联起来(这样查看 issue 的人就知道你正在处理它)。
+☐ 如果是正在进行中的工作,请在标题前加上 [WIP]。这有助于避免重复工作和区分哪些 PR 可以合并。
+☐ 确保可以通过现有的测试。
+☐ 如果添加了新功能,请同时添加对应的测试。
+ - 如果添加一个新模型,请使用 `ModelTester.all_model_classes = (MyModel, MyModelWithLMHead,...)` 来触发通用测试。 + - 如果你正在添加新的 `@slow` 测试,请确保通过以下检查:`RUN_SLOW=1 python -m pytest tests/models/my_new_model/test_my_new_model.py` + - 如果你正在添加一个新的分词器,请编写测试并确保通过以下检查:`RUN_SLOW=1 python -m pytest tests/models/{your_model_name}/test_tokenization_{your_model_name}.py` + - CircleCI 不会运行时间较长的测试,但 GitHub Actions 每晚会运行所有测试!
+ +☐ 所有公共 method 必须具有信息文档(比如 [`modeling_bert.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py))。
+☐ 由于代码仓库的体积正在迅速增长,请避免添加图像、视频和其他非文本文件,它们会增加仓库的负担。请使用 [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) 等 Hub 仓库来托管这些文件,并通过 URL 引用它们。我们建议将与文档相关的图片放置在以下仓库中:[huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images)。你可以在这个数据集仓库上创建一个 PR,并请求 Hugging Face 成员进行合并。 + +要了解更多有关在 Pull request 上运行的检查的信息,请查看我们的 [检查 Pull Request](https://huggingface.co/docs/transformers/pr_checks) 指南。 + +### 测试 + +包含了广泛的测试套件来测试库的行为和一些示例。库测试可以在 [tests](https://github.com/huggingface/transformers/tree/main/tests) 文件夹中找到,示例测试可以在 [examples](https://github.com/huggingface/transformers/tree/main/examples) 文件夹中找到。 + +我们喜欢使用 `pytest` 和 `pytest-xdist`,因为它运行更快。在仓库的根目录,指定一个*子文件夹的路径或测试文件*来运行测试: + +```bash +python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +``` + +同样地,在 `examples` 目录,指定一个*子文件夹的路径或测试文件* 来运行测试。例如,以下命令会测试 PyTorch `examples` 目录中的文本分类子文件夹: + +```bash +pip install -r examples/xxx/requirements.txt # 仅在第一次需要 +python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +实际上这就是我们的 `make test` 和 `make test-examples` 命令的实现方式(不包括 `pip install`)! + +你也可以指定一个较小的测试集来仅测试特定功能。 + +默认情况下,会跳过时间较长的测试,但你可以将 `RUN_SLOW` 环境变量设置为 `yes` 来运行它们。这将下载以 GB 为单位的模型文件,所以确保你有足够的磁盘空间、良好的网络连接和足够的耐心! + + + +记得指定一个*子文件夹的路径或测试文件*来运行测试。否则你将会运行 `tests` 或 `examples` 文件夹中的所有测试,它会花费很长时间! + + + +```bash +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./tests/models/my_new_model +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v ./examples/pytorch/text-classification +``` + +和时间较长的测试一样,还有其他环境变量在测试过程中,在默认情况下是未启用的: +- `RUN_CUSTOM_TOKENIZERS`: 启用自定义分词器的测试。 +- `RUN_PT_FLAX_CROSS_TESTS`: 启用 PyTorch + Flax 整合的测试。 +- `RUN_PT_TF_CROSS_TESTS`: 启用 TensorFlow + PyTorch 整合的测试。 + +更多环境变量和额外信息可以在 [testing_utils.py](src/transformers/testing_utils.py) 中找到。 + +🤗 Transformers 只是使用 `pytest` 作为测试运行程序,但测试套件本身没用任何与 `pytest` 相关的功能。 + +这意味着完全支持 `unittest` 。以下是如何使用 `unittest` 运行测试的方法: + +```bash +python -m unittest discover -s tests -t . -v +python -m unittest discover -s examples -t examples -v +``` + +### 风格指南 + +🤗 Transformers 的文档遵循 [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)。请查看我们的 [文档编写指南](https://github.com/huggingface/transformers/tree/main/docs#writing-documentation---specification) 来获取更多信息。 + +### 在 Windows 上开发 + +在 Windows 上(除非你正在使用 [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/) 或 WSL),你需要配置 git 将 Windows 的 `CRLF` 行结束符转换为 Linux 的 `LF` 行结束符: + +```bash +git config core.autocrlf input +``` + +在 Windows 上有一种方法可以运行 `make` 命令,那就是使用 MSYS2: + +1. [下载 MSYS2](https://www.msys2.org/),假设已经安装在 `C:\msys64`。 +2. 从命令行打开 `C:\msys64\msys2.exe` (可以在 **开始** 菜单中找到)。 +3. 在 shell 中运行: `pacman -Syu` ,并使用 `pacman -S make` 安装 `make`。 +4. 把 `C:\msys64\usr\bin` 添加到你的 PATH 环境变量中。 + +现在你可以在任何终端(PowerShell、cmd.exe 等)中使用 `make` 命令了! 🎉 + +### 将派生仓库与上游主仓库(Hugging Face 仓库)同步 + +更新派生仓库的主分支时,请按照以下步骤操作。这是为了避免向每个上游 PR 添加参考注释,同时避免向参与这些 PR 的开发人员发送不必要的通知。 + +1. 可以的话,请避免使用派生仓库上的分支和 PR 来与上游进行同步,而是直接合并到派生仓库的主分支。 +2. 如果确实需要一个 PR,在检查你的分支后,请按照以下步骤操作: + + ```bash + git checkout -b your-branch-for-syncing + git pull --squash --no-commit upstream main + git commit -m '' + git push --set-upstream origin your-branch-for-syncing + ``` diff --git a/docs/source/zh/create_a_model.md b/docs/source/zh/create_a_model.md new file mode 100644 index 00000000000000..fd07497e7abf3a --- /dev/null +++ b/docs/source/zh/create_a_model.md @@ -0,0 +1,389 @@ + + +# 创建自定义架构 + +[`AutoClass`](model_doc/auto) 自动推断模型架构并下载预训练的配置和权重。一般来说,我们建议使用 `AutoClass` 生成与检查点(checkpoint)无关的代码。希望对特定模型参数有更多控制的用户,可以仅从几个基类创建自定义的 🤗 Transformers 模型。这对于任何有兴趣学习、训练或试验 🤗 Transformers 模型的人可能特别有用。通过本指南,深入了解如何不通过 `AutoClass` 创建自定义模型。了解如何: + +- 加载并自定义模型配置。 +- 创建模型架构。 +- 为文本创建慢速和快速分词器。 +- 为视觉任务创建图像处理器。 +- 为音频任务创建特征提取器。 +- 为多模态任务创建处理器。 + +## 配置 + +[配置](main_classes/configuration) 涉及到模型的具体属性。每个模型配置都有不同的属性;例如,所有 NLP 模型都共享 `hidden_size`、`num_attention_heads`、 `num_hidden_layers` 和 `vocab_size` 属性。这些属性用于指定构建模型时的注意力头数量或隐藏层层数。 + +访问 [`DistilBertConfig`] 以更近一步了解 [DistilBERT](model_doc/distilbert),检查它的属性: + +```py +>>> from transformers import DistilBertConfig + +>>> config = DistilBertConfig() +>>> print(config) +DistilBertConfig { + "activation": "gelu", + "attention_dropout": 0.1, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +[`DistilBertConfig`] 显示了构建基础 [`DistilBertModel`] 所使用的所有默认属性。所有属性都可以进行自定义,为实验创造了空间。例如,您可以将默认模型自定义为: + +- 使用 `activation` 参数尝试不同的激活函数。 +- 使用 `attention_dropout` 参数为 attention probabilities 使用更高的 dropout ratio。 + +```py +>>> my_config = DistilBertConfig(activation="relu", attention_dropout=0.4) +>>> print(my_config) +DistilBertConfig { + "activation": "relu", + "attention_dropout": 0.4, + "dim": 768, + "dropout": 0.1, + "hidden_dim": 3072, + "initializer_range": 0.02, + "max_position_embeddings": 512, + "model_type": "distilbert", + "n_heads": 12, + "n_layers": 6, + "pad_token_id": 0, + "qa_dropout": 0.1, + "seq_classif_dropout": 0.2, + "sinusoidal_pos_embds": false, + "transformers_version": "4.16.2", + "vocab_size": 30522 +} +``` + +预训练模型的属性可以在 [`~PretrainedConfig.from_pretrained`] 函数中进行修改: + +```py +>>> my_config = DistilBertConfig.from_pretrained("distilbert/distilbert-base-uncased", activation="relu", attention_dropout=0.4) +``` + +当你对模型配置满意时,可以使用 [`~PretrainedConfig.save_pretrained`] 来保存配置。你的配置文件将以 JSON 文件的形式存储在指定的保存目录中: + +```py +>>> my_config.save_pretrained(save_directory="./your_model_save_path") +``` + +要重用配置文件,请使用 [`~PretrainedConfig.from_pretrained`] 进行加载: + +```py +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +``` + + + +你还可以将配置文件保存为字典,甚至只保存自定义配置属性与默认配置属性之间的差异!有关更多详细信息,请参阅 [配置](main_classes/configuration) 文档。 + + + +## 模型 + +接下来,创建一个[模型](main_classes/models)。模型,也可泛指架构,定义了每一层网络的行为以及进行的操作。配置中的 `num_hidden_layers` 等属性用于定义架构。每个模型都共享基类 [`PreTrainedModel`] 和一些常用方法,例如调整输入嵌入的大小和修剪自注意力头。此外,所有模型都是 [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)、[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) 或 [`flax.linen.Module`](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) 的子类。这意味着模型与各自框架的用法兼容。 + + + +将自定义配置属性加载到模型中: + +```py +>>> from transformers import DistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/config.json") +>>> model = DistilBertModel(my_config) +``` + +这段代码创建了一个具有随机参数而不是预训练权重的模型。在训练该模型之前,您还无法将该模型用于任何用途。训练是一项昂贵且耗时的过程。通常来说,最好使用预训练模型来更快地获得更好的结果,同时仅使用训练所需资源的一小部分。 + +使用 [`~PreTrainedModel.from_pretrained`] 创建预训练模型: + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +当加载预训练权重时,如果模型是由 🤗 Transformers 提供的,将自动加载默认模型配置。然而,如果你愿意,仍然可以将默认模型配置的某些或者所有属性替换成你自己的配置: + +```py +>>> model = DistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + +将自定义配置属性加载到模型中: + +```py +>>> from transformers import TFDistilBertModel + +>>> my_config = DistilBertConfig.from_pretrained("./your_model_save_path/my_config.json") +>>> tf_model = TFDistilBertModel(my_config) +``` + +这段代码创建了一个具有随机参数而不是预训练权重的模型。在训练该模型之前,您还无法将该模型用于任何用途。训练是一项昂贵且耗时的过程。通常来说,最好使用预训练模型来更快地获得更好的结果,同时仅使用训练所需资源的一小部分。 + +使用 [`~TFPreTrainedModel.from_pretrained`] 创建预训练模型: + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased") +``` + +当加载预训练权重时,如果模型是由 🤗 Transformers 提供的,将自动加载默认模型配置。然而,如果你愿意,仍然可以将默认模型配置的某些或者所有属性替换成自己的配置: + +```py +>>> tf_model = TFDistilBertModel.from_pretrained("distilbert/distilbert-base-uncased", config=my_config) +``` + + + +### 模型头(Model heads) + +此时,你已经有了一个输出*隐藏状态*的基础 DistilBERT 模型。隐藏状态作为输入传递到模型头以生成最终输出。🤗 Transformers 为每个任务提供不同的模型头,只要模型支持该任务(即,您不能使用 DistilBERT 来执行像翻译这样的序列到序列任务)。 + + + +例如,[`DistilBertForSequenceClassification`] 是一个带有序列分类头(sequence classification head)的基础 DistilBERT 模型。序列分类头是池化输出之上的线性层。 + +```py +>>> from transformers import DistilBertForSequenceClassification + +>>> model = DistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +通过切换到不同的模型头,可以轻松地将此检查点重复用于其他任务。对于问答任务,你可以使用 [`DistilBertForQuestionAnswering`] 模型头。问答头(question answering head)与序列分类头类似,不同点在于它是隐藏状态输出之上的线性层。 + +```py +>>> from transformers import DistilBertForQuestionAnswering + +>>> model = DistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + +例如,[`TFDistilBertForSequenceClassification`] 是一个带有序列分类头(sequence classification head)的基础 DistilBERT 模型。序列分类头是池化输出之上的线性层。 + +```py +>>> from transformers import TFDistilBertForSequenceClassification + +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") +``` + +通过切换到不同的模型头,可以轻松地将此检查点重复用于其他任务。对于问答任务,你可以使用 [`TFDistilBertForQuestionAnswering`] 模型头。问答头(question answering head)与序列分类头类似,不同点在于它是隐藏状态输出之上的线性层。 + +```py +>>> from transformers import TFDistilBertForQuestionAnswering + +>>> tf_model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +## 分词器 + +在将模型用于文本数据之前,你需要的最后一个基类是 [tokenizer](main_classes/tokenizer),它用于将原始文本转换为张量。🤗 Transformers 支持两种类型的分词器: + +- [`PreTrainedTokenizer`]:分词器的Python实现 +- [`PreTrainedTokenizerFast`]:来自我们基于 Rust 的 [🤗 Tokenizer](https://huggingface.co/docs/tokenizers/python/latest/) 库的分词器。因为其使用了 Rust 实现,这种分词器类型的速度要快得多,尤其是在批量分词(batch tokenization)的时候。快速分词器还提供其他的方法,例如*偏移映射(offset mapping)*,它将标记(token)映射到其原始单词或字符。 + +这两种分词器都支持常用的方法,如编码和解码、添加新标记以及管理特殊标记。 + + + +并非每个模型都支持快速分词器。参照这张 [表格](index#supported-frameworks) 查看模型是否支持快速分词器。 + + + +如果您训练了自己的分词器,则可以从*词表*文件创建一个分词器: + +```py +>>> from transformers import DistilBertTokenizer + +>>> my_tokenizer = DistilBertTokenizer(vocab_file="my_vocab_file.txt", do_lower_case=False, padding_side="left") +``` + +请务必记住,自定义分词器生成的词表与预训练模型分词器生成的词表是不同的。如果使用预训练模型,则需要使用预训练模型的词表,否则输入将没有意义。 使用 [`DistilBertTokenizer`] 类创建具有预训练模型词表的分词器: + +```py +>>> from transformers import DistilBertTokenizer + +>>> slow_tokenizer = DistilBertTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +``` + +使用 [`DistilBertTokenizerFast`] 类创建快速分词器: + +```py +>>> from transformers import DistilBertTokenizerFast + +>>> fast_tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert/distilbert-base-uncased") +``` + + + +默认情况下,[`AutoTokenizer`] 将尝试加载快速标记生成器。你可以通过在 `from_pretrained` 中设置 `use_fast=False` 以禁用此行为。 + + + +## 图像处理器 + +图像处理器用于处理视觉输入。它继承自 [`~image_processing_utils.ImageProcessingMixin`] 基类。 + +要使用它,需要创建一个与你使用的模型关联的图像处理器。例如,如果你使用 [ViT](model_doc/vit) 进行图像分类,可以创建一个默认的 [`ViTImageProcessor`]: + +```py +>>> from transformers import ViTImageProcessor + +>>> vit_extractor = ViTImageProcessor() +>>> print(vit_extractor) +ViTImageProcessor { + "do_normalize": true, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.5, + 0.5, + 0.5 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": 2, + "size": 224 +} +``` + + + +如果您不需要进行任何自定义,只需使用 `from_pretrained` 方法加载模型的默认图像处理器参数。 + + + +修改任何 [`ViTImageProcessor`] 参数以创建自定义图像处理器: + +```py +>>> from transformers import ViTImageProcessor + +>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3]) +>>> print(my_vit_extractor) +ViTImageProcessor { + "do_normalize": false, + "do_resize": true, + "image_processor_type": "ViTImageProcessor", + "image_mean": [ + 0.3, + 0.3, + 0.3 + ], + "image_std": [ + 0.5, + 0.5, + 0.5 + ], + "resample": "PIL.Image.BOX", + "size": 224 +} +``` + +## 特征提取器 + +特征提取器用于处理音频输入。它继承自 [`~feature_extraction_utils.FeatureExtractionMixin`] 基类,亦可继承 [`SequenceFeatureExtractor`] 类来处理音频输入。 + +要使用它,创建一个与你使用的模型关联的特征提取器。例如,如果你使用 [Wav2Vec2](model_doc/wav2vec2) 进行音频分类,可以创建一个默认的 [`Wav2Vec2FeatureExtractor`]: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor() +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": true, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 16000 +} +``` + + + +如果您不需要进行任何自定义,只需使用 `from_pretrained` 方法加载模型的默认特征提取器参数。 + + + +修改任何 [`Wav2Vec2FeatureExtractor`] 参数以创建自定义特征提取器: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False) +>>> print(w2v2_extractor) +Wav2Vec2FeatureExtractor { + "do_normalize": false, + "feature_extractor_type": "Wav2Vec2FeatureExtractor", + "feature_size": 1, + "padding_side": "right", + "padding_value": 0.0, + "return_attention_mask": false, + "sampling_rate": 8000 +} +``` + + +## 处理器 + +对于支持多模式任务的模型,🤗 Transformers 提供了一个处理器类,可以方便地将特征提取器和分词器等处理类包装到单个对象中。例如,让我们使用 [`Wav2Vec2Processor`] 来执行自动语音识别任务 (ASR)。 ASR 将音频转录为文本,因此您将需要一个特征提取器和一个分词器。 + +创建一个特征提取器来处理音频输入: + +```py +>>> from transformers import Wav2Vec2FeatureExtractor + +>>> feature_extractor = Wav2Vec2FeatureExtractor(padding_value=1.0, do_normalize=True) +``` + +创建一个分词器来处理文本输入: + +```py +>>> from transformers import Wav2Vec2CTCTokenizer + +>>> tokenizer = Wav2Vec2CTCTokenizer(vocab_file="my_vocab_file.txt") +``` + +将特征提取器和分词器合并到 [`Wav2Vec2Processor`] 中: + +```py +>>> from transformers import Wav2Vec2Processor + +>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer) +``` + +通过两个基类 - 配置类和模型类 - 以及一个附加的预处理类(分词器、图像处理器、特征提取器或处理器),你可以创建 🤗 Transformers 支持的任何模型。 每个基类都是可配置的,允许你使用所需的特定属性。 你可以轻松设置模型进行训练或修改现有的预训练模型进行微调。 diff --git a/docs/source/zh/custom_models.md b/docs/source/zh/custom_models.md new file mode 100644 index 00000000000000..2603c394128552 --- /dev/null +++ b/docs/source/zh/custom_models.md @@ -0,0 +1,305 @@ + + +# 共享自定义模型 + +🤗 Transformers 库设计得易于扩展。每个模型的代码都在仓库给定的子文件夹中,没有进行抽象,因此你可以轻松复制模型代码文件并根据需要进行调整。 + +如果你要编写全新的模型,从头开始可能更容易。在本教程中,我们将向你展示如何编写自定义模型及其配置,以便可以在 Transformers 中使用它;以及如何与社区共享它(及其依赖的代码),以便任何人都可以使用,即使它不在 🤗 Transformers 库中。 + +我们将以 ResNet 模型为例,通过将 [timm 库](https://github.com/rwightman/pytorch-image-models) 的 ResNet 类封装到 [`PreTrainedModel`] 中来进行说明。 + +## 编写自定义配置 + +在深入研究模型之前,让我们首先编写其配置。模型的配置是一个对象,其中包含构建模型所需的所有信息。我们将在下一节中看到,模型只能接受一个 `config` 来进行初始化,因此我们很需要使该对象尽可能完整。 + +我们将采用一些我们可能想要调整的 ResNet 类的参数举例。不同的配置将为我们提供不同类型可能的 ResNet 模型。在确认其中一些参数的有效性后,我们只需存储这些参数。 + +```python +from transformers import PretrainedConfig +from typing import List + + +class ResnetConfig(PretrainedConfig): + model_type = "resnet" + + def __init__( + self, + block_type="bottleneck", + layers: List[int] = [3, 4, 6, 3], + num_classes: int = 1000, + input_channels: int = 3, + cardinality: int = 1, + base_width: int = 64, + stem_width: int = 64, + stem_type: str = "", + avg_down: bool = False, + **kwargs, + ): + if block_type not in ["basic", "bottleneck"]: + raise ValueError(f"`block_type` must be 'basic' or bottleneck', got {block_type}.") + if stem_type not in ["", "deep", "deep-tiered"]: + raise ValueError(f"`stem_type` must be '', 'deep' or 'deep-tiered', got {stem_type}.") + + self.block_type = block_type + self.layers = layers + self.num_classes = num_classes + self.input_channels = input_channels + self.cardinality = cardinality + self.base_width = base_width + self.stem_width = stem_width + self.stem_type = stem_type + self.avg_down = avg_down + super().__init__(**kwargs) +``` + +编写自定义配置时需要记住的三个重要事项如下: +- 必须继承自 `PretrainedConfig`, +- `PretrainedConfig` 的 `__init__` 方法必须接受任何 kwargs, +- 这些 `kwargs` 需要传递给超类的 `__init__` 方法。 + +继承是为了确保你获得来自 🤗 Transformers 库的所有功能,而另外两个约束源于 `PretrainedConfig` 的字段比你设置的字段多。在使用 `from_pretrained` 方法重新加载配置时,这些字段需要被你的配置接受,然后传递给超类。 + +为你的配置定义 `model_type`(此处为 `model_type="resnet"`)不是必须的,除非你想使用自动类注册你的模型(请参阅最后一节)。 + +做完这些以后,就可以像使用库里任何其他模型配置一样,轻松地创建和保存配置。以下代码展示了如何创建并保存 resnet50d 配置: + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d_config.save_pretrained("custom-resnet") +``` + +这行代码将在 `custom-resnet` 文件夹内保存一个名为 `config.json` 的文件。然后,你可以使用 `from_pretrained` 方法重新加载配置: + +```py +resnet50d_config = ResnetConfig.from_pretrained("custom-resnet") +``` + +你还可以使用 [`PretrainedConfig`] 类的任何其他方法,例如 [`~PretrainedConfig.push_to_hub`],直接将配置上传到 Hub。 + +## 编写自定义模型 + +有了 ResNet 配置后,就可以继续编写模型了。实际上,我们将编写两个模型:一个模型用于从一批图像中提取隐藏特征(类似于 [`BertModel`]),另一个模型适用于图像分类(类似于 [`BertForSequenceClassification`])。 + +正如之前提到的,我们只会编写一个松散的模型包装,以使示例保持简洁。在编写此类之前,只需要建立起块类型(block types)与实际块类(block classes)之间的映射。然后,通过将所有内容传递给ResNet类,从配置中定义模型: + +```py +from transformers import PreTrainedModel +from timm.models.resnet import BasicBlock, Bottleneck, ResNet +from .configuration_resnet import ResnetConfig + + +BLOCK_MAPPING = {"basic": BasicBlock, "bottleneck": Bottleneck} + + +class ResnetModel(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor): + return self.model.forward_features(tensor) +``` + +对用于进行图像分类的模型,我们只需更改前向方法: + +```py +import torch + + +class ResnetModelForImageClassification(PreTrainedModel): + config_class = ResnetConfig + + def __init__(self, config): + super().__init__(config) + block_layer = BLOCK_MAPPING[config.block_type] + self.model = ResNet( + block_layer, + config.layers, + num_classes=config.num_classes, + in_chans=config.input_channels, + cardinality=config.cardinality, + base_width=config.base_width, + stem_width=config.stem_width, + stem_type=config.stem_type, + avg_down=config.avg_down, + ) + + def forward(self, tensor, labels=None): + logits = self.model(tensor) + if labels is not None: + loss = torch.nn.cross_entropy(logits, labels) + return {"loss": loss, "logits": logits} + return {"logits": logits} +``` + +在这两种情况下,请注意我们如何继承 `PreTrainedModel` 并使用 `config` 调用了超类的初始化(有点像编写常规的torch.nn.Module)。设置 `config_class` 的那行代码不是必须的,除非你想使用自动类注册你的模型(请参阅最后一节)。 + + + +如果你的模型与库中的某个模型非常相似,你可以重用与该模型相同的配置。 + + + +你可以让模型返回任何你想要的内容,但是像我们为 `ResnetModelForImageClassification` 做的那样返回一个字典,并在传递标签时包含loss,可以使你的模型能够在 [`Trainer`] 类中直接使用。只要你计划使用自己的训练循环或其他库进行训练,也可以使用其他输出格式。 + +现在我们已经有了模型类,让我们创建一个: + +```py +resnet50d = ResnetModelForImageClassification(resnet50d_config) +``` + +同样的,你可以使用 [`PreTrainedModel`] 的任何方法,比如 [`~PreTrainedModel.save_pretrained`] 或者 [`~PreTrainedModel.push_to_hub`]。我们将在下一节中使用第二种方法,并了解如何如何使用我们的模型的代码推送模型权重。但首先,让我们在模型内加载一些预训练权重。 + +在你自己的用例中,你可能会在自己的数据上训练自定义模型。为了快速完成本教程,我们将使用 resnet50d 的预训练版本。由于我们的模型只是它的包装,转移这些权重将会很容易: + +```py +import timm + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +现在让我们看看,如何确保在执行 [`~PreTrainedModel.save_pretrained`] 或 [`~PreTrainedModel.push_to_hub`] 时,模型的代码被保存。 + +## 将代码发送到 Hub + + + +此 API 是实验性的,未来的发布中可能会有一些轻微的不兼容更改。 + + + +首先,确保你的模型在一个 `.py` 文件中完全定义。只要所有文件都位于同一目录中,它就可以依赖于某些其他文件的相对导入(目前我们还不为子模块支持此功能)。对于我们的示例,我们将在当前工作目录中名为 `resnet_model` 的文件夹中定义一个 `modeling_resnet.py` 文件和一个 `configuration_resnet.py` 文件。 配置文件包含 `ResnetConfig` 的代码,模型文件包含 `ResnetModel` 和 `ResnetModelForImageClassification` 的代码。 + +``` +. +└── resnet_model + ├── __init__.py + ├── configuration_resnet.py + └── modeling_resnet.py +``` + +`__init__.py` 可以为空,它的存在只是为了让 Python 检测到 `resnet_model` 可以用作模块。 + + + +如果从库中复制模型文件,你需要将文件顶部的所有相对导入替换为从 `transformers` 包中的导入。 + + + +请注意,你可以重用(或子类化)现有的配置/模型。 + +要与社区共享您的模型,请参照以下步骤:首先从新创建的文件中导入ResNet模型和配置: + +```py +from resnet_model.configuration_resnet import ResnetConfig +from resnet_model.modeling_resnet import ResnetModel, ResnetModelForImageClassification +``` + +接下来,你需要告诉库,当使用 `save_pretrained` 方法时,你希望复制这些对象的代码文件,并将它们正确注册到给定的 Auto 类(特别是对于模型),只需要运行以下代码: + +```py +ResnetConfig.register_for_auto_class() +ResnetModel.register_for_auto_class("AutoModel") +ResnetModelForImageClassification.register_for_auto_class("AutoModelForImageClassification") +``` + +请注意,对于配置(只有一个自动类 [`AutoConfig`]),不需要指定自动类,但对于模型来说情况不同。 你的自定义模型可能适用于许多不同的任务,因此你必须指定哪一个自动类适合你的模型。 + +接下来,让我们像之前一样创建配置和模型: + +```py +resnet50d_config = ResnetConfig(block_type="bottleneck", stem_width=32, stem_type="deep", avg_down=True) +resnet50d = ResnetModelForImageClassification(resnet50d_config) + +pretrained_model = timm.create_model("resnet50d", pretrained=True) +resnet50d.model.load_state_dict(pretrained_model.state_dict()) +``` + +现在要将模型推送到集线器,请确保你已登录。你看可以在终端中运行以下命令: + +```bash +huggingface-cli login +``` + +或者在笔记本中运行以下代码: + +```py +from huggingface_hub import notebook_login + +notebook_login() +``` + +然后,可以这样将模型推送到自己的命名空间(或你所属的组织): + +```py +resnet50d.push_to_hub("custom-resnet50d") +``` + +除了模型权重和 JSON 格式的配置外,这行代码也会复制 `custom-resnet50d` 文件夹内的模型以及配置的 `.py` 文件并将结果上传至 Hub。你可以在此[模型仓库](https://huggingface.co/sgugger/custom-resnet50d)中查看结果。 + +有关推推送至 Hub 方法的更多信息,请参阅[共享教程](model_sharing)。 + +## 使用带有自定义代码的模型 + +可以使用自动类(auto-classes)和 `from_pretrained` 方法,使用模型仓库里带有自定义代码的配置、模型或分词器文件。所有上传到 Hub 的文件和代码都会进行恶意软件扫描(有关更多信息,请参阅 [Hub 安全](https://huggingface.co/docs/hub/security#malware-scanning) 文档), 但你仍应查看模型代码和作者,以避免在你的计算机上执行恶意代码。 设置 `trust_remote_code=True` 以使用带有自定义代码的模型: + +```py +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained("sgugger/custom-resnet50d", trust_remote_code=True) +``` + +我们强烈建议为 `revision` 参数传递提交哈希(commit hash),以确保模型的作者没有使用一些恶意的代码行更新了代码(除非您完全信任模型的作者)。 + +```py +commit_hash = "ed94a7c6247d8aedce4647f00f20de6875b5b292" +model = AutoModelForImageClassification.from_pretrained( + "sgugger/custom-resnet50d", trust_remote_code=True, revision=commit_hash +) +``` + +在 Hub 上浏览模型仓库的提交历史时,有一个按钮可以轻松复制任何提交的提交哈希。 + +## 将自定义代码的模型注册到自动类 + +如果你在编写一个扩展 🤗 Transformers 的库,你可能想要扩展自动类以包含您自己的模型。这与将代码推送到 Hub 不同,因为用户需要导入你的库才能获取自定义模型(与从 Hub 自动下载模型代码相反)。 + +只要你的配置 `model_type` 属性与现有模型类型不同,并且你的模型类有正确的 `config_class` 属性,你可以像这样将它们添加到自动类中: + +```py +from transformers import AutoConfig, AutoModel, AutoModelForImageClassification + +AutoConfig.register("resnet", ResnetConfig) +AutoModel.register(ResnetConfig, ResnetModel) +AutoModelForImageClassification.register(ResnetConfig, ResnetModelForImageClassification) +``` + +请注意,将自定义配置注册到 [`AutoConfig`] 时,使用的第一个参数需要与自定义配置的 `model_type` 匹配;而将自定义模型注册到任何自动模型类时,使用的第一个参数需要与 `config_class` 匹配。 diff --git a/docs/source/zh/debugging.md b/docs/source/zh/debugging.md new file mode 100644 index 00000000000000..77746a694fce36 --- /dev/null +++ b/docs/source/zh/debugging.md @@ -0,0 +1,308 @@ + + +# 调试 + +## 多GPU网络问题调试 + +当使用`DistributedDataParallel`和多个GPU进行训练或推理时,如果遇到进程和(或)节点之间的互联问题,您可以使用以下脚本来诊断网络问题。 + +```bash +wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py +``` + +例如,要测试两个GPU之间的互联,请执行以下操作: + +```bash +python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py +``` + +如果两个进程能够相互通信并分配GPU内存,它们各自将打印出 "OK" 状态。 + +对于更多的GPU或节点,可以根据脚本中的参数进行调整。 + +在诊断脚本内部,您将找到更多详细信息,甚至有关如何在SLURM环境中运行它的说明。 + +另一种级别的调试是添加 `NCCL_DEBUG=INFO` 环境变量,如下所示: + + +```bash +NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py +``` + +这将产生大量与NCCL相关的调试信息,如果发现有问题报告,您可以在线搜索以获取相关信息。或者,如果您不确定如何解释输出,可以在`issue`中分享日志文件。 + + +## 下溢和上溢检测 + + + +目前,此功能仅适用于PyTorch。 + + + + + +对于多GPU训练,它需要使用DDP(`torch.distributed.launch`)。 + + + + + +此功能可以与任何基于`nn.Module`的模型一起使用。 + + + +如果您开始发现`loss=NaN`或模型因激活值或权重中的`inf`或`nan`而出现一些异常行为,就需要发现第一个下溢或上溢发生的地方以及导致它的原因。幸运的是,您可以通过激活一个特殊模块来自动进行检测。 + +如果您正在使用[`Trainer`],只需把以下内容: + + +```bash +--debug underflow_overflow +``` + +添加到常规命令行参数中,或在创建[`TrainingArguments`]对象时传递 `debug="underflow_overflow"`。 + +如果您正在使用自己的训练循环或其他Trainer,您可以通过以下方式实现相同的功能: + +```python +from transformers.debug_utils import DebugUnderflowOverflow + +debug_overflow = DebugUnderflowOverflow(model) +``` + +[`debug_utils.DebugUnderflowOverflow`] 将`hooks`插入模型,紧跟在每次前向调用之后,进而测试输入和输出变量,以及相应模块的权重。一旦在激活值或权重的至少一个元素中检测到`inf`或`nan`,程序将执行`assert`并打印报告,就像这样(这是在`google/mt5-small`下使用fp16混合精度捕获的): + +``` +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata + encoder.block.1.layer.1.DenseReluDense.dropout Dropout +0.00e+00 2.57e+02 input[0] +0.00e+00 2.85e+02 output +[...] + encoder.block.2.layer.0 T5LayerSelfAttention +6.78e-04 3.15e+03 input[0] +2.65e-04 3.42e+03 output[0] + None output[1] +2.25e-01 1.00e+04 output[2] + encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.dropout Dropout +0.00e+00 8.76e+03 input[0] +0.00e+00 9.74e+03 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +``` + +由于篇幅原因,示例输出中间的部分已经被缩减。 + +第二列显示了绝对最大元素的值,因此,如果您仔细查看最后`frame`,输入和输出都在`1e4`的范围内。因此,在使用fp16混合精度进行训练时,最后一步发生了溢出(因为在`fp16`下,在`inf`之前的最大数字是`64e3`)。为了避免在`fp16`下发生溢出,激活值必须保持低于`1e4`,因为`1e4 * 1e4 = 1e8`,因此任何具有大激活值的矩阵乘法都会导致数值溢出。 + +在跟踪的开始处,您可以发现问题发生在哪个批次(这里的`Detected inf/nan during batch_number=0`表示问题发生在第一个批次)。 + +每个报告的`frame`都以声明相应模块的层信息为开头,说明这一`frame`是为哪个模块报告的。如果只看这个`frame`: + +``` + encoder.block.2.layer.1.layer_norm T5LayerNorm +8.69e-02 4.18e-01 weight +2.65e-04 3.42e+03 input[0] +1.79e-06 4.65e+00 output +``` + +在这里,`encoder.block.2.layer.1.layer_norm` 表示它是编码器的第二个块中第一层的`layer norm`。而 `forward` 的具体调用是 `T5LayerNorm`。 + +让我们看看该报告的最后几个`frame`: + +``` +Detected inf/nan during batch_number=0 +Last 21 forward frames: +abs min abs max metadata +[...] + encoder.block.2.layer.1.DenseReluDense.wi_0 Linear +2.17e-07 4.50e+00 weight +1.79e-06 4.65e+00 input[0] +2.68e-06 3.70e+01 output + encoder.block.2.layer.1.DenseReluDense.wi_1 Linear +8.08e-07 2.66e+01 weight +1.79e-06 4.65e+00 input[0] +1.27e-04 2.37e+02 output + encoder.block.2.layer.1.DenseReluDense.wo Linear +1.01e-06 6.44e+00 weight +0.00e+00 9.74e+03 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense +1.79e-06 4.65e+00 input[0] +3.18e-04 6.27e+04 output + encoder.block.2.layer.1.dropout Dropout +3.18e-04 6.27e+04 input[0] +0.00e+00 inf output +``` + +最后一个`frame`报告了`Dropout.forward`函数,第一个条目是唯一的输入,第二个条目是唯一的输出。您可以看到,它是从`DenseReluDense`类内的属性`dropout`中调用的。我们可以看到它发生在第2个块的第1层,也就是在第一个批次期间。最后,绝对最大的输入元素值为`6.27e+04`,输出也是`inf`。 + +您可以在这里看到,`T5DenseGatedGeluDense.forward`产生了输出激活值,其绝对最大值约为62.7K,非常接近fp16的上限64K。在下一个`frame`中,我们有`Dropout`对权重进行重新归一化,之后将某些元素归零,将绝对最大值推到了64K以上,导致溢出(`inf`)。 + +正如你所看到的,我们需要查看前面的`frame`, 从那里fp16数字开始变得非常大。 + +让我们将报告与`models/t5/modeling_t5.py`中的代码匹配: + +```python +class T5DenseGatedGeluDense(nn.Module): + def __init__(self, config): + super().__init__() + self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False) + self.wo = nn.Linear(config.d_ff, config.d_model, bias=False) + self.dropout = nn.Dropout(config.dropout_rate) + self.gelu_act = ACT2FN["gelu_new"] + + def forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states +``` + +现在很容易看到`dropout`调用,以及所有之前的调用。 + +由于检测是在前向`hook`中进行的,这些报告将立即在每个`forward`返回后打印出来。 + +回到完整的报告,要采取措施并解决问题,我们需要往回看几个`frame`,在那里数字开始上升,并且最有可能切换到fp32模式以便在乘法或求和时数字不会溢出。当然,可能还有其他解决方案。例如,如果启用了`amp`,我们可以在将原始`forward`移到`helper wrapper`中后,暂时关闭它,如下所示: + +```python +def _forward(self, hidden_states): + hidden_gelu = self.gelu_act(self.wi_0(hidden_states)) + hidden_linear = self.wi_1(hidden_states) + hidden_states = hidden_gelu * hidden_linear + hidden_states = self.dropout(hidden_states) + hidden_states = self.wo(hidden_states) + return hidden_states + + +import torch + + +def forward(self, hidden_states): + if torch.is_autocast_enabled(): + with torch.cuda.amp.autocast(enabled=False): + return self._forward(hidden_states) + else: + return self._forward(hidden_states) +``` + +由于自动检测器仅报告完整`frame`的输入和输出,一旦知道在哪里查找,您可能还希望分析特定`forward`函数的中间阶段。在这种情况下,您可以使用`detect_overflow`辅助函数将检测器放到希望的位置,例如: + +```python +from debug_utils import detect_overflow + + +class T5LayerFF(nn.Module): + [...] + + def forward(self, hidden_states): + forwarded_states = self.layer_norm(hidden_states) + detect_overflow(forwarded_states, "after layer_norm") + forwarded_states = self.DenseReluDense(forwarded_states) + detect_overflow(forwarded_states, "after DenseReluDense") + return hidden_states + self.dropout(forwarded_states) +``` + +可以看到,我们添加了2个检测器,现在我们可以跟踪是否在`forwarded_states`中间的某个地方检测到了`inf`或`nan`。 + +实际上,检测器已经报告了这些,因为上面示例中的每个调用都是一个`nn.Module`,但假设如果您有一些本地的直接计算,这就是您将如何执行的方式。 + +此外,如果您在自己的代码中实例化调试器,您可以调整从其默认打印的`frame`数,例如: + +```python +from transformers.debug_utils import DebugUnderflowOverflow + +debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100) +``` + +### 特定批次的绝对最小值和最大值跟踪 + +当关闭下溢/上溢检测功能, 同样的调试类可以用于批处理跟踪。 + +假设您想要监视给定批次的每个`forward`调用的所有成分的绝对最小值和最大值,并且仅对批次1和3执行此操作,您可以这样实例化这个类: + +```python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3]) +``` + +现在,完整的批次1和3将以与下溢/上溢检测器相同的格式进行跟踪。 + +批次从0开始计数。 + +如果您知道程序在某个批次编号之后开始出现问题,那么您可以直接快进到该区域。以下是一个截取的配置示例输出: + +``` + *** Starting batch number=1 *** +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.47e+04 input[0] +5.36e-05 7.92e+02 output +[...] + decoder.dropout Dropout +1.60e-07 2.27e+01 input[0] +0.00e+00 2.52e+01 output + decoder T5Stack + not a tensor output + lm_head Linear +1.01e-06 7.92e+02 weight +0.00e+00 1.11e+00 input[0] +6.06e-02 8.39e+01 output + T5ForConditionalGeneration + not a tensor output + + *** Starting batch number=3 *** +abs min abs max metadata + shared Embedding +1.01e-06 7.92e+02 weight +0.00e+00 2.78e+04 input[0] +5.36e-05 7.92e+02 output +[...] +``` + +在这里,您将获得大量的`frame`被`dump` - 与您的模型中的前向调用一样多,它有可能符合也可能不符合您的要求,但有时对于调试目的来说,它可能比正常的调试器更容易使用。例如,如果问题开始发生在批次号150上,您可以`dump`批次149和150的跟踪,并比较数字开始发散的地方。 + +你还可以使用以下命令指定停止训练的批次号: + +```python +debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3) +``` diff --git a/docs/source/zh/fast_tokenizers.md b/docs/source/zh/fast_tokenizers.md new file mode 100644 index 00000000000000..dd311c3791cc70 --- /dev/null +++ b/docs/source/zh/fast_tokenizers.md @@ -0,0 +1,67 @@ + + +# 使用 🤗 Tokenizers 中的分词器 + +[`PreTrainedTokenizerFast`] 依赖于 [🤗 Tokenizers](https://huggingface.co/docs/tokenizers) 库。从 🤗 Tokenizers 库获得的分词器可以被轻松地加载到 🤗 Transformers 中。 + +在了解具体内容之前,让我们先用几行代码创建一个虚拟的分词器: + +```python +>>> from tokenizers import Tokenizer +>>> from tokenizers.models import BPE +>>> from tokenizers.trainers import BpeTrainer +>>> from tokenizers.pre_tokenizers import Whitespace + +>>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) +>>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) + +>>> tokenizer.pre_tokenizer = Whitespace() +>>> files = [...] +>>> tokenizer.train(files, trainer) +``` + +现在,我们拥有了一个针对我们定义的文件进行训练的分词器。我们可以在当前运行时中继续使用它,或者将其保存到一个 JSON 文件以供将来重复使用。 + +## 直接从分词器对象加载 + +让我们看看如何利用 🤗 Transformers 库中的这个分词器对象。[`PreTrainedTokenizerFast`] 类允许通过接受已实例化的 *tokenizer* 对象作为参数,进行轻松实例化: + +```python +>>> from transformers import PreTrainedTokenizerFast + +>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) +``` + +现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。 + +## 从 JSON 文件加载 + +为了从 JSON 文件中加载分词器,让我们先保存我们的分词器: + +```python +>>> tokenizer.save("tokenizer.json") +``` + +我们保存此文件的路径可以通过 `tokenizer_file` 参数传递给 [`PreTrainedTokenizerFast`] 初始化方法: + +```python +>>> from transformers import PreTrainedTokenizerFast + +>>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") +``` + +现在可以使用这个对象,使用 🤗 Transformers 分词器共享的所有方法!前往[分词器页面](main_classes/tokenizer)了解更多信息。 diff --git a/docs/source/zh/hpo_train.md b/docs/source/zh/hpo_train.md new file mode 100644 index 00000000000000..182940c359bb44 --- /dev/null +++ b/docs/source/zh/hpo_train.md @@ -0,0 +1,139 @@ + + +# 使用Trainer API进行超参数搜索 + +🤗 Transformers库提供了一个优化过的[`Trainer`]类,用于训练🤗 Transformers模型,相比于手动编写自己的训练循环,这更容易开始训练。[`Trainer`]提供了超参数搜索的API。本文档展示了如何在示例中启用它。 + + +## 超参数搜索后端 + +[`Trainer`] 目前支持四种超参数搜索后端:[optuna](https://optuna.org/),[sigopt](https://sigopt.com/),[raytune](https://docs.ray.io/en/latest/tune/index.html),[wandb](https://wandb.ai/site/sweeps) + +在使用它们之前,您应该先安装它们作为超参数搜索后端。 + +```bash +pip install optuna/sigopt/wandb/ray[tune] +``` + +## 如何在示例中启用超参数搜索 + +定义超参数搜索空间,不同的后端需要不同的格式。 + +对于sigopt,请参阅sigopt [object_parameter](https://docs.sigopt.com/ai-module-api-references/api_reference/objects/object_parameter),它类似于以下内容: + +```py +>>> def sigopt_hp_space(trial): +... return [ +... {"bounds": {"min": 1e-6, "max": 1e-4}, "name": "learning_rate", "type": "double"}, +... { +... "categorical_values": ["16", "32", "64", "128"], +... "name": "per_device_train_batch_size", +... "type": "categorical", +... }, +... ] +``` + +对于optuna,请参阅optuna [object_parameter](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#sphx-glr-tutorial-10-key-features-002-configurations-py),它类似于以下内容: + +```py +>>> def optuna_hp_space(trial): +... return { +... "learning_rate": trial.suggest_float("learning_rate", 1e-6, 1e-4, log=True), +... "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [16, 32, 64, 128]), +... } +``` + +Optuna提供了多目标HPO。您可以在`hyperparameter_search`中传递`direction`参数,并定义自己的`compute_objective`以返回多个目标值。在`hyperparameter_search`中将返回Pareto Front(`List[BestRun]`),您应该参考[test_trainer](https://github.com/huggingface/transformers/blob/main/tests/trainer/test_trainer.py)中的测试用例`TrainerHyperParameterMultiObjectOptunaIntegrationTest`。它类似于以下内容: + +```py +>>> best_trials = trainer.hyperparameter_search( +... direction=["minimize", "maximize"], +... backend="optuna", +... hp_space=optuna_hp_space, +... n_trials=20, +... compute_objective=compute_objective, +... ) +``` + +对于raytune,可以参考raytune的[object_parameter](https://docs.ray.io/en/latest/tune/api/search_space.html),它类似于以下内容: + +```py +>>> def ray_hp_space(trial): +... return { +... "learning_rate": tune.loguniform(1e-6, 1e-4), +... "per_device_train_batch_size": tune.choice([16, 32, 64, 128]), +... } +``` + +对于wandb,可以参考wandb的[object_parameter](https://docs.wandb.ai/guides/sweeps/configuration),它类似于以下内容: + +```py +>>> def wandb_hp_space(trial): +... return { +... "method": "random", +... "metric": {"name": "objective", "goal": "minimize"}, +... "parameters": { +... "learning_rate": {"distribution": "uniform", "min": 1e-6, "max": 1e-4}, +... "per_device_train_batch_size": {"values": [16, 32, 64, 128]}, +... }, +... } +``` + +定义一个`model_init`函数并将其传递给[Trainer],作为示例: + +```py +>>> def model_init(trial): +... return AutoModelForSequenceClassification.from_pretrained( +... model_args.model_name_or_path, +... from_tf=bool(".ckpt" in model_args.model_name_or_path), +... config=config, +... cache_dir=model_args.cache_dir, +... revision=model_args.model_revision, +... use_auth_token=True if model_args.use_auth_token else None, +... ) +``` + +使用你的`model_init`函数、训练参数、训练和测试数据集以及评估函数创建一个[`Trainer`]。 + +```py +>>> trainer = Trainer( +... model=None, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... tokenizer=tokenizer, +... model_init=model_init, +... data_collator=data_collator, +... ) +``` + +调用超参数搜索,获取最佳试验参数,后端可以是`"optuna"`/`"sigopt"`/`"wandb"`/`"ray"`。方向可以是`"minimize"`或`"maximize"`,表示是否优化更大或更低的目标。 + +您可以定义自己的compute_objective函数,如果没有定义,将调用默认的compute_objective,并将评估指标(如f1)之和作为目标值返回。 + +```py +>>> best_trial = trainer.hyperparameter_search( +... direction="maximize", +... backend="optuna", +... hp_space=optuna_hp_space, +... n_trials=20, +... compute_objective=compute_objective, +... ) +``` + +## 针对DDP微调的超参数搜索 +目前,Optuna和Sigopt已启用针对DDP的超参数搜索。只有rank-zero进程会进行超参数搜索并将参数传递给其他进程。 \ No newline at end of file diff --git a/docs/source/zh/index.mdx b/docs/source/zh/index.md similarity index 95% rename from docs/source/zh/index.mdx rename to docs/source/zh/index.md index 4d69d590c6922f..549d6e6371f54b 100644 --- a/docs/source/zh/index.mdx +++ b/docs/source/zh/index.md @@ -8,24 +8,28 @@ http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be +rendered properly in your Markdown viewer. + --> # 🤗 Transformers简介 -为[PyTorch](https://pytorch.org/), [TensorFlow](https://www.tensorflow.org/)和[JAX](https://jax.readthedocs.io/en/latest/)打造的先进的机器学习工具. +为 [PyTorch](https://pytorch.org/)、[TensorFlow](https://www.tensorflow.org/) 和 [JAX](https://jax.readthedocs.io/en/latest/) 打造的先进的机器学习工具. -🤗 Transformers 提供了可以轻松地下载并且训练先进的预训练模型的API和工具. 使用预训练模型可以减少计算消耗和碳排放, 并且节省从头训练所需要的时间和资源. 这些模型支持不同模态中的常见任务,比如: +🤗 Transformers 提供了可以轻松地下载并且训练先进的预训练模型的 API 和工具。使用预训练模型可以减少计算消耗和碳排放,并且节省从头训练所需要的时间和资源。这些模型支持不同模态中的常见任务,比如: -📝 **自然语言处理**: 文本分类, 命名实体识别, 问答, 语言建模, 摘要, 翻译, 多项选择和文本生成.
-🖼️ **机器视觉**: 图像分类, 目标检测和语义分割.
-🗣️ **音频**: 自动语音识别和音频分类.
-🐙 **多模态**: 表格问答, 光学字符识别, 从扫描文档提取信息, 视频分类和视觉问答. +📝 **自然语言处理**:文本分类、命名实体识别、问答、语言建模、摘要、翻译、多项选择和文本生成。
+🖼️ **机器视觉**:图像分类、目标检测和语义分割。
+🗣️ **音频**:自动语音识别和音频分类。
+🐙 **多模态**:表格问答、光学字符识别、从扫描文档提取信息、视频分类和视觉问答。 -🤗 Transformers支持在PyTorch, TensorFlow和JAX上的互操作性. 这给在模型的每个阶段使用不同的框架带来了灵活性; 在一个框架中使用几行代码训练一个模型, 然后在另一个框架中加载它并进行推理. 模型也可以被导出为ONNX和TorchScript格式, 用于在生产环境中部署. +🤗 Transformers 支持在 PyTorch、TensorFlow 和 JAX 上的互操作性. 这给在模型的每个阶段使用不同的框架带来了灵活性;在一个框架中使用几行代码训练一个模型,然后在另一个框架中加载它并进行推理。模型也可以被导出为 ONNX 和 TorchScript 格式,用于在生产环境中部署。 -马上加入在[Hub](https://huggingface.co/models), [forum](https://discuss.huggingface.co/), 或者[Discord](https://discord.com/invite/JfAtkvEtRb)上正在快速发展的社区吧! +马上加入在 [Hub](https://huggingface.co/models)、[论坛](https://discuss.huggingface.co/) 或者 [Discord](https://discord.com/invite/JfAtkvEtRb) 上正在快速发展的社区吧! -## 如果你需要来自Hugging Face团队的个性化支持 +## 如果你需要来自 Hugging Face 团队的个性化支持 HuggingFace Expert Acceleration Program @@ -35,15 +39,15 @@ specific language governing permissions and limitations under the License. 这篇文档被组织为以下5个章节: -- **开始使用** 包含了库的快速上手和安装说明, 便于配置和运行. -- **教程** 是一个初学者开始的好地方. 本章节将帮助你获得你会用到的使用这个库的基本技能. -- **操作指南** 向你展示如何实现一个特定目标, 比如为语言建模微调一个预训练模型或者如何创造并分享个性化模型. -- **概念指南** 对🤗 Transformers的模型, 任务和设计理念背后的基本概念和思想做了更多的讨论和解释. -- **API介绍** 描述了所有的类和函数: +- **开始使用** 包含了库的快速上手和安装说明,便于配置和运行。 +- **教程** 是一个初学者开始的好地方。本章节将帮助你获得你会用到的使用这个库的基本技能。 +- **操作指南** 向你展示如何实现一个特定目标,比如为语言建模微调一个预训练模型或者如何创造并分享个性化模型。 +- **概念指南** 对 🤗 Transformers 的模型,任务和设计理念背后的基本概念和思想做了更多的讨论和解释。 +- **API 介绍** 描述了所有的类和函数: - - **MAIN CLASSES** 详述了配置(configuration)、模型(model)、分词器(tokenizer)和流水线(pipeline)这几个最重要的类. - - **MODELS** 详述了在这个库中和每个模型实现有关的类和函数. - - **INTERNAL HELPERS** 详述了内部使用的工具类和函数. + - **MAIN CLASSES** 详述了配置(configuration)、模型(model)、分词器(tokenizer)和流水线(pipeline)这几个最重要的类。 + - **MODELS** 详述了在这个库中和每个模型实现有关的类和函数。 + - **INTERNAL HELPERS** 详述了内部使用的工具类和函数。 ### 支持的模型 @@ -78,6 +82,7 @@ specific language governing permissions and limitations under the License. 1. **[Conditional DETR](model_doc/conditional_detr)** (from Microsoft Research Asia) released with the paper [Conditional DETR for Fast Training Convergence](https://arxiv.org/abs/2108.06152) by Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang. 1. **[ConvBERT](model_doc/convbert)** (from YituTech) released with the paper [ConvBERT: Improving BERT with Span-based Dynamic Convolution](https://arxiv.org/abs/2008.02496) by Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. 1. **[ConvNeXT](model_doc/convnext)** (from Facebook AI) released with the paper [A ConvNet for the 2020s](https://arxiv.org/abs/2201.03545) by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. +1. **[ConvNeXTV2](model_doc/convnextv2)** (from Facebook AI) released with the paper [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://arxiv.org/abs/2301.00808) by Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie. 1. **[CPM](model_doc/cpm)** (from Tsinghua University) released with the paper [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://arxiv.org/abs/2012.00413) by Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke, Yuxian Gu, Deming Ye, Yujia Qin, Yusheng Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi Wang, Yanan Zheng, Guoyang Zeng, Huanqi Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun, Zhiyuan Liu, Minlie Huang, Wentao Han, Jie Tang, Juanzi Li, Xiaoyan Zhu, Maosong Sun. 1. **[CTRL](model_doc/ctrl)** (from Salesforce) released with the paper [CTRL: A Conditional Transformer Language Model for Controllable Generation](https://arxiv.org/abs/1909.05858) by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. 1. **[CvT](model_doc/cvt)** (from Microsoft) released with the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. @@ -106,11 +111,11 @@ specific language governing permissions and limitations under the License. 1. **[Funnel Transformer](model_doc/funnel)** (from CMU/Google Brain) released with the paper [Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing](https://arxiv.org/abs/2006.03236) by Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le. 1. **[GIT](model_doc/git)** (from Microsoft Research) released with the paper [GIT: A Generative Image-to-text Transformer for Vision and Language](https://arxiv.org/abs/2205.14100) by Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang. 1. **[GLPN](model_doc/glpn)** (from KAIST) released with the paper [Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth](https://arxiv.org/abs/2201.07436) by Doyeon Kim, Woonghyun Ga, Pyungwhan Ahn, Donggyu Joo, Sehwan Chun, Junmo Kim. -1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://blog.openai.com/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. +1. **[GPT](model_doc/openai-gpt)** (from OpenAI) released with the paper [Improving Language Understanding by Generative Pre-Training](https://openai.com/research/language-unsupervised/) by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 1. **[GPT Neo](model_doc/gpt_neo)** (from EleutherAI) released in the repository [EleutherAI/gpt-neo](https://github.com/EleutherAI/gpt-neo) by Sid Black, Stella Biderman, Leo Gao, Phil Wang and Connor Leahy. 1. **[GPT NeoX](model_doc/gpt_neox)** (from EleutherAI) released with the paper [GPT-NeoX-20B: An Open-Source Autoregressive Language Model](https://arxiv.org/abs/2204.06745) by Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach 1. **[GPT NeoX Japanese](model_doc/gpt_neox_japanese)** (from ABEJA) released by Shinya Otani, Takayoshi Makabe, Anuj Arora, and Kyo Hattori. -1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**. +1. **[GPT-2](model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://openai.com/research/better-language-models/) by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. 1. **[GPT-J](model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki. 1. **[GPT-Sw3](model_doc/gpt-sw3)** (from AI-Sweden) released with the paper [Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.376.pdf) by Ariel Ekgren, Amaru Cuba Gyllensten, Evangelia Gogoulou, Alice Heiman, Severine Verlinden, Joey Öhman, Fredrik Carlsson, Magnus Sahlgren. 1. **[GroupViT](model_doc/groupvit)** (from UCSD, NVIDIA) released with the paper [GroupViT: Semantic Segmentation Emerges from Text Supervision](https://arxiv.org/abs/2202.11094) by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. @@ -224,8 +229,8 @@ specific language governing permissions and limitations under the License. ### 支持的框架 -下表展示了库中对每个模型的支持情况, 是否具有Python分词器 (表中的"Tokenizer slow"). 是否具有由🤗 Tokenizers库支持的快速分词器(表中的"Tokenizer fast"), 是否支持Jax (通过 -Flax), PyTorch, 和/或者 TensorFlow. +下表展示了库中对每个模型的支持情况,如是否具有 Python 分词器(表中的“Tokenizer slow”)、是否具有由 🤗 Tokenizers 库支持的快速分词器(表中的“Tokenizer fast”)、是否支持 Jax(通过 +Flax)、PyTorch 与 TensorFlow。 @@ -335,7 +340,7 @@ Flax), PyTorch, 和/或者 TensorFlow. | RAG | ✅ | ❌ | ✅ | ✅ | ❌ | | REALM | ✅ | ✅ | ✅ | ❌ | ❌ | | Reformer | ✅ | ✅ | ✅ | ❌ | ❌ | -| RegNet | ❌ | ❌ | ✅ | ✅ | ❌ | +| RegNet | ❌ | ❌ | ✅ | ✅ | ✅ | | RemBERT | ✅ | ✅ | ✅ | ✅ | ❌ | | ResNet | ❌ | ❌ | ✅ | ✅ | ❌ | | RetriBERT | ✅ | ✅ | ✅ | ❌ | ❌ | @@ -390,4 +395,4 @@ Flax), PyTorch, 和/或者 TensorFlow. | YOLOS | ❌ | ❌ | ✅ | ❌ | ❌ | | YOSO | ❌ | ❌ | ✅ | ❌ | ❌ | - \ No newline at end of file + diff --git a/docs/source/zh/installation.md b/docs/source/zh/installation.md new file mode 100644 index 00000000000000..91e09dc904bd7e --- /dev/null +++ b/docs/source/zh/installation.md @@ -0,0 +1,256 @@ + + +# 安装 + +为你正在使用的深度学习框架安装 🤗 Transformers、设置缓存,并选择性配置 🤗 Transformers 以离线运行。 + +🤗 Transformers 已在 Python 3.6+、PyTorch 1.1.0+、TensorFlow 2.0+ 以及 Flax 上进行测试。针对你使用的深度学习框架,请参照以下安装说明进行安装: + +* [PyTorch](https://pytorch.org/get-started/locally/) 安装说明。 +* [TensorFlow 2.0](https://www.tensorflow.org/install/pip) 安装说明。 +* [Flax](https://flax.readthedocs.io/en/latest/) 安装说明。 + +## 使用 pip 安装 + +你应该使用 [虚拟环境](https://docs.python.org/3/library/venv.html) 安装 🤗 Transformers。如果你不熟悉 Python 虚拟环境,请查看此 [教程](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/)。使用虚拟环境,你可以轻松管理不同项目,避免不同依赖项之间的兼容性问题。 + +首先,在项目目录中创建虚拟环境: + +```bash +python -m venv .env +``` + +在 Linux 和 MacOs 系统中激活虚拟环境: + +```bash +source .env/bin/activate +``` +在 Windows 系统中激活虚拟环境: + +```bash +.env/Scripts/activate +``` + +现在你可以使用以下命令安装 🤗 Transformers: + +```bash +pip install transformers +``` + +若仅需 CPU 支持,可以使用单行命令方便地安装 🤗 Transformers 和深度学习库。例如,使用以下命令安装 🤗 Transformers 和 PyTorch: + +```bash +pip install 'transformers[torch]' +``` + +🤗 Transformers 和 TensorFlow 2.0: + +```bash +pip install 'transformers[tf-cpu]' +``` + + + +M1 / ARM用户 + +在安装 TensorFlow 2.0 前,你需要安装以下库: +```bash +brew install cmake +brew install pkg-config +``` + + + +🤗 Transformers 和 Flax: + +```bash +pip install 'transformers[flax]' +``` + +最后,运行以下命令以检查 🤗 Transformers 是否已被正确安装。该命令将下载一个预训练模型: + +```bash +python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))" +``` + +然后打印标签以及分数: + +```bash +[{'label': 'POSITIVE', 'score': 0.9998704791069031}] +``` + +## 源码安装 + +使用以下命令从源码安装 🤗 Transformers: + +```bash +pip install git+https://github.com/huggingface/transformers +``` + +此命令下载的是最新的前沿 `main` 版本而不是最新的 `stable` 版本。`main` 版本适用于跟最新开发保持一致。例如,上次正式版发布带来的 bug 被修复了,但新版本尚未被推出。但是,这也说明 `main` 版本并不一定总是稳定的。我们努力保持 `main` 版本的可操作性,大多数问题通常在几个小时或一天以内就能被解决。如果你遇到问题,请提个 [Issue](https://github.com/huggingface/transformers/issues) 以便我们能更快修复。 + +运行以下命令以检查 🤗 Transformers 是否已被正确安装: + +```bash +python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I love you'))" +``` + +## 可编辑安装 + +如果你有下列需求,需要进行可编辑安装: + +* 使用源码的 `main` 版本。 +* 为 🤗 Transformers 贡献代码,需要测试代码中的更改。 + +使用以下命令克隆仓库并安装 🤗 Transformers: + +```bash +git clone https://github.com/huggingface/transformers.git +cd transformers +pip install -e . +``` + +这些命令将会链接你克隆的仓库以及你的 Python 库路径。现在,Python 不仅会在正常的库路径中搜索库,也会在你克隆到的文件夹中进行查找。例如,如果你的 Python 包通常本应安装在 `~/anaconda3/envs/main/lib/python3.7/site-packages/` 目录中,在这种情况下 Python 也会搜索你克隆到的文件夹:`~/transformers/`。 + + + +如果你想继续使用这个库,必须保留 `transformers` 文件夹。 + + + +现在,你可以使用以下命令,将你克隆的 🤗 Transformers 库轻松更新至最新版本: + +```bash +cd ~/transformers/ +git pull +``` + +你的 Python 环境将在下次运行时找到 `main` 版本的 🤗 Transformers。 + +## 使用 conda 安装 + +从 conda 的 `conda-forge` 频道安装: + +```bash +conda install conda-forge::transformers +``` + +## 缓存设置 + +预训练模型会被下载并本地缓存到 `~/.cache/huggingface/hub`。这是由环境变量 `TRANSFORMERS_CACHE` 指定的默认目录。在 Windows 上,默认目录为 `C:\Users\username\.cache\huggingface\hub`。你可以按照不同优先级改变下述环境变量,以指定不同的缓存目录。 + +1. 环境变量(默认): `HUGGINGFACE_HUB_CACHE` 或 `TRANSFORMERS_CACHE`。 +2. 环境变量 `HF_HOME`。 +3. 环境变量 `XDG_CACHE_HOME` + `/huggingface`。 + + + +除非你明确指定了环境变量 `TRANSFORMERS_CACHE`,🤗 Transformers 将可能会使用较早版本设置的环境变量 `PYTORCH_TRANSFORMERS_CACHE` 或 `PYTORCH_PRETRAINED_BERT_CACHE`。 + + + +## 离线模式 + +🤗 Transformers 可以仅使用本地文件在防火墙或离线环境中运行。设置环境变量 `TRANSFORMERS_OFFLINE=1` 以启用该行为。 + + + +通过设置环境变量 `HF_DATASETS_OFFLINE=1` 将 [🤗 Datasets](https://huggingface.co/docs/datasets/) 添加至你的离线训练工作流程中。 + + + +例如,你通常会使用以下命令对外部实例进行防火墙保护的的普通网络上运行程序: + +```bash +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... +``` + +在离线环境中运行相同的程序: + +```bash +HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 \ +python examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small --dataset_name wmt16 --dataset_config ro-en ... +``` + +现在脚本可以应该正常运行,而无需挂起或等待超时,因为它知道只应查找本地文件。 + +### 获取离线时使用的模型和分词器 + +另一种离线时使用 🤗 Transformers 的方法是预先下载好文件,然后在需要离线使用时指向它们的离线路径。有三种实现的方法: + +* 单击 [Model Hub](https://huggingface.co/models) 用户界面上的 ↓ 图标下载文件。 + + ![下载图标](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/download-icon.png) + +* 使用 [`PreTrainedModel.from_pretrained`] 和 [`PreTrainedModel.save_pretrained`] 工作流程: + + 1. 预先使用 [`PreTrainedModel.from_pretrained`] 下载文件: + + ```py + >>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + + >>> tokenizer = AutoTokenizer.from_pretrained("bigscience/T0_3B") + >>> model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0_3B") + ``` + + 2. 使用 [`PreTrainedModel.save_pretrained`] 将文件保存至指定目录: + + ```py + >>> tokenizer.save_pretrained("./your/path/bigscience_t0") + >>> model.save_pretrained("./your/path/bigscience_t0") + ``` + + 3. 现在,你可以在离线时从指定目录使用 [`PreTrainedModel.from_pretrained`] 重新加载你的文件: + + ```py + >>> tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0") + >>> model = AutoModel.from_pretrained("./your/path/bigscience_t0") + ``` + +* 使用代码用 [huggingface_hub](https://github.com/huggingface/huggingface_hub/tree/main/src/huggingface_hub) 库下载文件: + + 1. 在你的虚拟环境中安装 `huggingface_hub` 库: + + ```bash + python -m pip install huggingface_hub + ``` + + 2. 使用 [`hf_hub_download`](https://huggingface.co/docs/hub/adding-a-library#download-files-from-the-hub) 函数将文件下载到指定路径。例如,以下命令将 `config.json` 文件从 [T0](https://huggingface.co/bigscience/T0_3B) 模型下载至你想要的路径: + + ```py + >>> from huggingface_hub import hf_hub_download + + >>> hf_hub_download(repo_id="bigscience/T0_3B", filename="config.json", cache_dir="./your/path/bigscience_t0") + ``` + +下载完文件并在本地缓存后,指定其本地路径以加载和使用该模型: + +```py +>>> from transformers import AutoConfig + +>>> config = AutoConfig.from_pretrained("./your/path/bigscience_t0/config.json") +``` + + + +请参阅 [如何从 Hub 下载文件](https://huggingface.co/docs/hub/how-to-downstream) 部分,获取有关下载存储在 Hub 上文件的更多详细信息。 + + diff --git a/docs/source/zh/internal/audio_utils.md b/docs/source/zh/internal/audio_utils.md new file mode 100644 index 00000000000000..17fc430f984287 --- /dev/null +++ b/docs/source/zh/internal/audio_utils.md @@ -0,0 +1,40 @@ + + +# `FeatureExtractors`的工具 + +此页面列出了音频 [`FeatureExtractor`] 可以使用的所有实用函数,以便使用常见的算法(如 *Short Time Fourier Transform* 或 *log mel spectrogram*)从原始音频中计算特殊特征。 + +其中大多数仅在您研究库中音频processors的代码时有用。 + + +## 音频转换 + +[[autodoc]] audio_utils.hertz_to_mel + +[[autodoc]] audio_utils.mel_to_hertz + +[[autodoc]] audio_utils.mel_filter_bank + +[[autodoc]] audio_utils.optimal_fft_length + +[[autodoc]] audio_utils.window_function + +[[autodoc]] audio_utils.spectrogram + +[[autodoc]] audio_utils.power_to_db + +[[autodoc]] audio_utils.amplitude_to_db diff --git a/docs/source/zh/internal/file_utils.md b/docs/source/zh/internal/file_utils.md new file mode 100644 index 00000000000000..ba4b4902814a65 --- /dev/null +++ b/docs/source/zh/internal/file_utils.md @@ -0,0 +1,50 @@ + + +# 通用工具 + +此页面列出了在`utils.py`文件中找到的所有Transformers通用实用函数。 + +其中大多数仅在您研究库中的通用代码时才有用。 + + +## Enums和namedtuples(命名元组) + +[[autodoc]] utils.ExplicitEnum + +[[autodoc]] utils.PaddingStrategy + +[[autodoc]] utils.TensorType + +## 特殊的装饰函数 + +[[autodoc]] utils.add_start_docstrings + +[[autodoc]] utils.add_start_docstrings_to_model_forward + +[[autodoc]] utils.add_end_docstrings + +[[autodoc]] utils.add_code_sample_docstrings + +[[autodoc]] utils.replace_return_docstrings + +## 特殊的属性 + +[[autodoc]] utils.cached_property + +## 其他实用程序 + +[[autodoc]] utils._LazyModule diff --git a/docs/source/zh/internal/generation_utils.md b/docs/source/zh/internal/generation_utils.md new file mode 100644 index 00000000000000..34e9bf2f787ef1 --- /dev/null +++ b/docs/source/zh/internal/generation_utils.md @@ -0,0 +1,352 @@ + + +# 用于生成的工具 + +此页面列出了所有由 [`~generation.GenerationMixin.generate`], +[`~generation.GenerationMixin.greedy_search`], +[`~generation.GenerationMixin.contrastive_search`], +[`~generation.GenerationMixin.sample`], +[`~generation.GenerationMixin.beam_search`], +[`~generation.GenerationMixin.beam_sample`], +[`~generation.GenerationMixin.group_beam_search`], 和 +[`~generation.GenerationMixin.constrained_beam_search`]使用的实用函数。 + +其中大多数仅在您研究库中生成方法的代码时才有用。 + +## 生成输出 + +[`~generation.GenerationMixin.generate`] 的输出是 [`~utils.ModelOutput`] 的一个子类的实例。这个输出是一种包含 [`~generation.GenerationMixin.generate`] 返回的所有信息数据结构,但也可以作为元组或字典使用。 +这里是一个例子: + + +```python +from transformers import GPT2Tokenizer, GPT2LMHeadModel + +tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2") +model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2") + +inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt") +generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True) +``` + +`generation_output` 的对象是 [`~generation.GenerateDecoderOnlyOutput`] 的一个实例,从该类的文档中我们可以看到,这意味着它具有以下属性: + +- `sequences`: 生成的tokens序列 +- `scores`(可选): 每个生成步骤的语言建模头的预测分数 +- `hidden_states`(可选): 每个生成步骤模型的hidden states +- `attentions`(可选): 每个生成步骤模型的注意力权重 + +在这里,由于我们传递了 `output_scores=True`,我们具有 `scores` 属性。但我们没有 `hidden_states` 和 `attentions`,因为没有传递 `output_hidden_states=True` 或 `output_attentions=True`。 + +您可以像通常一样访问每个属性,如果该属性未被模型返回,则将获得 `None`。例如,在这里 `generation_output.scores` 是语言建模头的所有生成预测分数,而 `generation_output.attentions` 为 `None`。 + +当我们将 `generation_output` 对象用作元组时,它只保留非 `None` 值的属性。例如,在这里它有两个元素,`loss` 然后是 `logits`,所以 + + +```python +generation_output[:2] +``` + +将返回元组`(generation_output.sequences, generation_output.scores)`。 + +当我们将`generation_output`对象用作字典时,它只保留非`None`的属性。例如,它有两个键,分别是`sequences`和`scores`。 + +我们在此记录所有输出类型。 + + +### PyTorch + +[[autodoc]] generation.GenerateDecoderOnlyOutput + +[[autodoc]] generation.GenerateEncoderDecoderOutput + +[[autodoc]] generation.GenerateBeamDecoderOnlyOutput + +[[autodoc]] generation.GenerateBeamEncoderDecoderOutput + +### TensorFlow + +[[autodoc]] generation.TFGreedySearchEncoderDecoderOutput + +[[autodoc]] generation.TFGreedySearchDecoderOnlyOutput + +[[autodoc]] generation.TFSampleEncoderDecoderOutput + +[[autodoc]] generation.TFSampleDecoderOnlyOutput + +[[autodoc]] generation.TFBeamSearchEncoderDecoderOutput + +[[autodoc]] generation.TFBeamSearchDecoderOnlyOutput + +[[autodoc]] generation.TFBeamSampleEncoderDecoderOutput + +[[autodoc]] generation.TFBeamSampleDecoderOnlyOutput + +[[autodoc]] generation.TFContrastiveSearchEncoderDecoderOutput + +[[autodoc]] generation.TFContrastiveSearchDecoderOnlyOutput + +### FLAX + +[[autodoc]] generation.FlaxSampleOutput + +[[autodoc]] generation.FlaxGreedySearchOutput + +[[autodoc]] generation.FlaxBeamSearchOutput + +## LogitsProcessor + +[`LogitsProcessor`] 可以用于修改语言模型头的预测分数以进行生成 + + +### PyTorch + +[[autodoc]] AlternatingCodebooksLogitsProcessor + - __call__ + +[[autodoc]] ClassifierFreeGuidanceLogitsProcessor + - __call__ + +[[autodoc]] EncoderNoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] EncoderRepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] EpsilonLogitsWarper + - __call__ + +[[autodoc]] EtaLogitsWarper + - __call__ + +[[autodoc]] ExponentialDecayLengthPenalty + - __call__ + +[[autodoc]] ForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] ForceTokensLogitsProcessor + - __call__ + +[[autodoc]] HammingDiversityLogitsProcessor + - __call__ + +[[autodoc]] InfNanRemoveLogitsProcessor + - __call__ + +[[autodoc]] LogitNormalization + - __call__ + +[[autodoc]] LogitsProcessor + - __call__ + +[[autodoc]] LogitsProcessorList + - __call__ + +[[autodoc]] LogitsWarper + - __call__ + +[[autodoc]] MinLengthLogitsProcessor + - __call__ + +[[autodoc]] MinNewTokensLengthLogitsProcessor + - __call__ + +[[autodoc]] NoBadWordsLogitsProcessor + - __call__ + +[[autodoc]] NoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] PrefixConstrainedLogitsProcessor + - __call__ + +[[autodoc]] RepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] SequenceBiasLogitsProcessor + - __call__ + +[[autodoc]] SuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] SuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] TemperatureLogitsWarper + - __call__ + +[[autodoc]] TopKLogitsWarper + - __call__ + +[[autodoc]] TopPLogitsWarper + - __call__ + +[[autodoc]] TypicalLogitsWarper + - __call__ + +[[autodoc]] UnbatchedClassifierFreeGuidanceLogitsProcessor + - __call__ + +[[autodoc]] WhisperTimeStampLogitsProcessor + - __call__ + +### TensorFlow + +[[autodoc]] TFForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] TFForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] TFForceTokensLogitsProcessor + - __call__ + +[[autodoc]] TFLogitsProcessor + - __call__ + +[[autodoc]] TFLogitsProcessorList + - __call__ + +[[autodoc]] TFLogitsWarper + - __call__ + +[[autodoc]] TFMinLengthLogitsProcessor + - __call__ + +[[autodoc]] TFNoBadWordsLogitsProcessor + - __call__ + +[[autodoc]] TFNoRepeatNGramLogitsProcessor + - __call__ + +[[autodoc]] TFRepetitionPenaltyLogitsProcessor + - __call__ + +[[autodoc]] TFSuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] TFSuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] TFTemperatureLogitsWarper + - __call__ + +[[autodoc]] TFTopKLogitsWarper + - __call__ + +[[autodoc]] TFTopPLogitsWarper + - __call__ + +### FLAX + +[[autodoc]] FlaxForcedBOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForcedEOSTokenLogitsProcessor + - __call__ + +[[autodoc]] FlaxForceTokensLogitsProcessor + - __call__ + +[[autodoc]] FlaxLogitsProcessor + - __call__ + +[[autodoc]] FlaxLogitsProcessorList + - __call__ + +[[autodoc]] FlaxLogitsWarper + - __call__ + +[[autodoc]] FlaxMinLengthLogitsProcessor + - __call__ + +[[autodoc]] FlaxSuppressTokensAtBeginLogitsProcessor + - __call__ + +[[autodoc]] FlaxSuppressTokensLogitsProcessor + - __call__ + +[[autodoc]] FlaxTemperatureLogitsWarper + - __call__ + +[[autodoc]] FlaxTopKLogitsWarper + - __call__ + +[[autodoc]] FlaxTopPLogitsWarper + - __call__ + +[[autodoc]] FlaxWhisperTimeStampLogitsProcessor + - __call__ + +## StoppingCriteria + +可以使用[`StoppingCriteria`]来更改停止生成的时间(除了EOS token以外的方法)。请注意,这仅适用于我们的PyTorch实现。 + + +[[autodoc]] StoppingCriteria + - __call__ + +[[autodoc]] StoppingCriteriaList + - __call__ + +[[autodoc]] MaxLengthCriteria + - __call__ + +[[autodoc]] MaxTimeCriteria + - __call__ + +## Constraints + +可以使用[`Constraint`]来强制生成结果包含输出中的特定tokens或序列。请注意,这仅适用于我们的PyTorch实现。 + +[[autodoc]] Constraint + +[[autodoc]] PhrasalConstraint + +[[autodoc]] DisjunctiveConstraint + +[[autodoc]] ConstraintListState + +## BeamSearch + +[[autodoc]] BeamScorer + - process + - finalize + +[[autodoc]] BeamSearchScorer + - process + - finalize + +[[autodoc]] ConstrainedBeamSearchScorer + - process + - finalize + +## Utilities + +[[autodoc]] top_k_top_p_filtering + +[[autodoc]] tf_top_k_top_p_filtering + +## Streamers + +[[autodoc]] TextStreamer + +[[autodoc]] TextIteratorStreamer diff --git a/docs/source/zh/internal/image_processing_utils.md b/docs/source/zh/internal/image_processing_utils.md new file mode 100644 index 00000000000000..b3c784fa345237 --- /dev/null +++ b/docs/source/zh/internal/image_processing_utils.md @@ -0,0 +1,48 @@ + + +# Image Processors的工具 + +此页面列出了image processors使用的所有实用函数功能,主要是用于处理图像的功能变换。 + +其中大多数仅在您研究库中image processors的代码时有用。 + + +## 图像转换 + +[[autodoc]] image_transforms.center_crop + +[[autodoc]] image_transforms.center_to_corners_format + +[[autodoc]] image_transforms.corners_to_center_format + +[[autodoc]] image_transforms.id_to_rgb + +[[autodoc]] image_transforms.normalize + +[[autodoc]] image_transforms.pad + +[[autodoc]] image_transforms.rgb_to_id + +[[autodoc]] image_transforms.rescale + +[[autodoc]] image_transforms.resize + +[[autodoc]] image_transforms.to_pil_image + +## ImageProcessingMixin + +[[autodoc]] image_processing_utils.ImageProcessingMixin diff --git a/docs/source/zh/internal/modeling_utils.md b/docs/source/zh/internal/modeling_utils.md new file mode 100644 index 00000000000000..93341b323e836b --- /dev/null +++ b/docs/source/zh/internal/modeling_utils.md @@ -0,0 +1,83 @@ + + +# 自定义层和工具 + +此页面列出了库使用的所有自定义层,以及它为模型提供的实用函数。 + +其中大多数只有在您研究库中模型的代码时才有用。 + + +## Pytorch自定义模块 + +[[autodoc]] pytorch_utils.Conv1D + +[[autodoc]] modeling_utils.PoolerStartLogits + - forward + +[[autodoc]] modeling_utils.PoolerEndLogits + - forward + +[[autodoc]] modeling_utils.PoolerAnswerClass + - forward + +[[autodoc]] modeling_utils.SquadHeadOutput + +[[autodoc]] modeling_utils.SQuADHead + - forward + +[[autodoc]] modeling_utils.SequenceSummary + - forward + +## PyTorch帮助函数 + +[[autodoc]] pytorch_utils.apply_chunking_to_forward + +[[autodoc]] pytorch_utils.find_pruneable_heads_and_indices + +[[autodoc]] pytorch_utils.prune_layer + +[[autodoc]] pytorch_utils.prune_conv1d_layer + +[[autodoc]] pytorch_utils.prune_linear_layer + +## TensorFlow自定义层 + +[[autodoc]] modeling_tf_utils.TFConv1D + +[[autodoc]] modeling_tf_utils.TFSequenceSummary + +## TensorFlow loss 函数 + +[[autodoc]] modeling_tf_utils.TFCausalLanguageModelingLoss + +[[autodoc]] modeling_tf_utils.TFMaskedLanguageModelingLoss + +[[autodoc]] modeling_tf_utils.TFMultipleChoiceLoss + +[[autodoc]] modeling_tf_utils.TFQuestionAnsweringLoss + +[[autodoc]] modeling_tf_utils.TFSequenceClassificationLoss + +[[autodoc]] modeling_tf_utils.TFTokenClassificationLoss + +## TensorFlow帮助函数 + +[[autodoc]] modeling_tf_utils.get_initializer + +[[autodoc]] modeling_tf_utils.keras_serializable + +[[autodoc]] modeling_tf_utils.shape_list diff --git a/docs/source/zh/internal/pipelines_utils.md b/docs/source/zh/internal/pipelines_utils.md new file mode 100644 index 00000000000000..30fdb8cd1d4006 --- /dev/null +++ b/docs/source/zh/internal/pipelines_utils.md @@ -0,0 +1,45 @@ + + +# pipelines的工具 + + +此页面列出了库为pipelines提供的所有实用程序功能。 + +其中大多数只有在您研究库中模型的代码时才有用。 + + +## 参数处理 + +[[autodoc]] pipelines.ArgumentHandler + +[[autodoc]] pipelines.ZeroShotClassificationArgumentHandler + +[[autodoc]] pipelines.QuestionAnsweringArgumentHandler + +## 数据格式 + +[[autodoc]] pipelines.PipelineDataFormat + +[[autodoc]] pipelines.CsvPipelineDataFormat + +[[autodoc]] pipelines.JsonPipelineDataFormat + +[[autodoc]] pipelines.PipedPipelineDataFormat + +## 实用函数 + +[[autodoc]] pipelines.PipelineException diff --git a/docs/source/zh/internal/time_series_utils.md b/docs/source/zh/internal/time_series_utils.md new file mode 100644 index 00000000000000..4b9093fbf478c5 --- /dev/null +++ b/docs/source/zh/internal/time_series_utils.md @@ -0,0 +1,31 @@ + + +# 时间序列工具 + + +此页面列出了可用于时间序列类模型的所有实用函数和类。 + +其中大多数仅在您研究时间序列模型的代码,或希望添加到分布输出类集合时有用。 + + +## 输出分布 + +[[autodoc]] time_series_utils.NormalOutput + +[[autodoc]] time_series_utils.StudentTOutput + +[[autodoc]] time_series_utils.NegativeBinomialOutput diff --git a/docs/source/zh/internal/tokenization_utils.md b/docs/source/zh/internal/tokenization_utils.md new file mode 100644 index 00000000000000..9f216131c122ef --- /dev/null +++ b/docs/source/zh/internal/tokenization_utils.md @@ -0,0 +1,43 @@ + + +# Tokenizers的工具 + +并保留格式:此页面列出了tokenizers使用的所有实用函数,主要是类 +[`~tokenization_utils_base.PreTrained TokenizerBase`] 实现了常用方法之间的 +[`PreTrained Tokenizer`] 和 [`PreTrained TokenizerFast`] 以及混合类 +[`~tokenization_utils_base.SpecialTokens Mixin`]。 + +其中大多数只有在您研究库中tokenizers的代码时才有用。 + + +## PreTrainedTokenizerBase + +[[autodoc]] tokenization_utils_base.PreTrainedTokenizerBase + - __call__ + - all + +## SpecialTokensMixin + +[[autodoc]] tokenization_utils_base.SpecialTokensMixin + +## Enums和namedtuples(命名元组) + +[[autodoc]] tokenization_utils_base.TruncationStrategy + +[[autodoc]] tokenization_utils_base.CharSpan + +[[autodoc]] tokenization_utils_base.TokenSpan diff --git a/docs/source/zh/internal/trainer_utils.md b/docs/source/zh/internal/trainer_utils.md new file mode 100644 index 00000000000000..fc28ba623c9d60 --- /dev/null +++ b/docs/source/zh/internal/trainer_utils.md @@ -0,0 +1,50 @@ + + +# Trainer的工具 + +此页面列出了 [`Trainer`] 使用的所有实用函数。 + +其中大多数仅在您研究库中Trainer的代码时有用。 + + +## 工具 + +[[autodoc]] EvalPrediction + +[[autodoc]] IntervalStrategy + +[[autodoc]] enable_full_determinism + +[[autodoc]] set_seed + +[[autodoc]] torch_distributed_zero_first + +## Callbacks内部机制 + +[[autodoc]] trainer_callback.CallbackHandler + +## 分布式评估 + +[[autodoc]] trainer_pt_utils.DistributedTensorGatherer + +## Trainer参数解析 + +[[autodoc]] HfArgumentParser + +## Debug工具 + +[[autodoc]] debug_utils.DebugUnderflowOverflow diff --git a/docs/source/zh/llm_tutorial.md b/docs/source/zh/llm_tutorial.md new file mode 100644 index 00000000000000..47a6742c89745a --- /dev/null +++ b/docs/source/zh/llm_tutorial.md @@ -0,0 +1,269 @@ + + + +## 使用LLMs进行生成 + +[[open-in-colab]] + +LLMs,即大语言模型,是文本生成背后的关键组成部分。简单来说,它们包含经过大规模预训练的transformer模型,用于根据给定的输入文本预测下一个词(或更准确地说,下一个`token`)。由于它们一次只预测一个`token`,因此除了调用模型之外,您需要执行更复杂的操作来生成新的句子——您需要进行自回归生成。 + +自回归生成是在给定一些初始输入,通过迭代调用模型及其自身的生成输出来生成文本的推理过程,。在🤗 Transformers中,这由[`~generation.GenerationMixin.generate`]方法处理,所有具有生成能力的模型都可以使用该方法。 + +本教程将向您展示如何: + +* 使用LLM生成文本 +* 避免常见的陷阱 +* 帮助您充分利用LLM下一步指导 + +在开始之前,请确保已安装所有必要的库: + + +```bash +pip install transformers bitsandbytes>=0.39.0 -q +``` + + +## 生成文本 + +一个用于[因果语言建模](tasks/language_modeling)训练的语言模型,将文本`tokens`序列作为输入,并返回下一个`token`的概率分布。 + + + +
+ +
"LLM的前向传递"
+
+ +使用LLM进行自回归生成的一个关键方面是如何从这个概率分布中选择下一个`token`。这个步骤可以随意进行,只要最终得到下一个迭代的`token`。这意味着可以简单的从概率分布中选择最可能的`token`,也可以复杂的在对结果分布进行采样之前应用多种变换,这取决于你的需求。 + + +
+ +
"自回归生成迭代地从概率分布中选择下一个token以生成文本"
+
+ +上述过程是迭代重复的,直到达到某个停止条件。理想情况下,停止条件由模型决定,该模型应学会在何时输出一个结束序列(`EOS`)标记。如果不是这种情况,生成将在达到某个预定义的最大长度时停止。 + +正确设置`token`选择步骤和停止条件对于让你的模型按照预期的方式执行任务至关重要。这就是为什么我们为每个模型都有一个[~generation.GenerationConfig]文件,它包含一个效果不错的默认生成参数配置,并与您模型一起加载。 + +让我们谈谈代码! + + + +如果您对基本的LLM使用感兴趣,我们高级的[`Pipeline`](pipeline_tutorial)接口是一个很好的起点。然而,LLMs通常需要像`quantization`和`token选择步骤的精细控制`等高级功能,这最好通过[`~generation.GenerationMixin.generate`]来完成。使用LLM进行自回归生成也是资源密集型的操作,应该在GPU上执行以获得足够的吞吐量。 + + + +首先,您需要加载模型。 + +```py +>>> from transformers import AutoModelForCausalLM + +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +您将会注意到在`from_pretrained`调用中的两个标志: + +- `device_map`确保模型被移动到您的GPU(s)上 +- `load_in_4bit`应用[4位动态量化](main_classes/quantization)来极大地减少资源需求 + +还有其他方式来初始化一个模型,但这是一个开始使用LLM很好的起点。 + +接下来,你需要使用一个[tokenizer](tokenizer_summary)来预处理你的文本输入。 + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") +>>> model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda") +``` + +`model_inputs`变量保存着分词后的文本输入以及注意力掩码。尽管[`~generation.GenerationMixin.generate`]在未传递注意力掩码时会尽其所能推断出注意力掩码,但建议尽可能传递它以获得最佳结果。 + +在对输入进行分词后,可以调用[`~generation.GenerationMixin.generate`]方法来返回生成的`tokens`。生成的`tokens`应该在打印之前转换为文本。 + +```py +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A list of colors: red, blue, green, yellow, orange, purple, pink,' +``` + +最后,您不需要一次处理一个序列!您可以批量输入,这将在小延迟和低内存成本下显著提高吞吐量。您只需要确保正确地填充您的输入(详见下文)。 + +```py +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model_inputs = tokenizer( +... ["A list of colors: red, blue", "Portugal is"], return_tensors="pt", padding=True +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True) +['A list of colors: red, blue, green, yellow, orange, purple, pink,', +'Portugal is a country in southwestern Europe, on the Iber'] +``` + +就是这样!在几行代码中,您就可以利用LLM的强大功能。 + + +## 常见陷阱 + +有许多[生成策略](generation_strategies),有时默认值可能不适合您的用例。如果您的输出与您期望的结果不匹配,我们已经创建了一个最常见的陷阱列表以及如何避免它们。 + +```py +>>> from transformers import AutoModelForCausalLM, AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model = AutoModelForCausalLM.from_pretrained( +... "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True +... ) +``` + +### 生成的输出太短/太长 + +如果在[`~generation.GenerationConfig`]文件中没有指定,`generate`默认返回20个tokens。我们强烈建议在您的`generate`调用中手动设置`max_new_tokens`以控制它可以返回的最大新tokens数量。请注意,LLMs(更准确地说,仅[解码器模型](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt))也将输入提示作为输出的一部分返回。 + +```py +>>> model_inputs = tokenizer(["A sequence of numbers: 1, 2"], return_tensors="pt").to("cuda") + +>>> # By default, the output will contain up to 20 tokens +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5' + +>>> # Setting `max_new_tokens` allows you to control the maximum length +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=50) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'A sequence of numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,' +``` + +### 错误的生成模式 + +默认情况下,除非在[`~generation.GenerationConfig`]文件中指定,否则`generate`会在每个迭代中选择最可能的token(贪婪解码)。对于您的任务,这可能是不理想的;像聊天机器人或写作文章这样的创造性任务受益于采样。另一方面,像音频转录或翻译这样的基于输入的任务受益于贪婪解码。通过将`do_sample=True`启用采样,您可以在这篇[博客文章](https://huggingface.co/blog/how-to-generate)中了解更多关于这个话题的信息。 + +```py +>>> # Set seed or reproducibility -- you don't need this unless you want full reproducibility +>>> from transformers import set_seed +>>> set_seed(42) + +>>> model_inputs = tokenizer(["I am a cat."], return_tensors="pt").to("cuda") + +>>> # LLM + greedy decoding = repetitive, boring output +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. I am a cat. I am a cat. I am a cat' + +>>> # With sampling, the output becomes more creative! +>>> generated_ids = model.generate(**model_inputs, do_sample=True) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'I am a cat. Specifically, I am an indoor-only cat. I' +``` + +### 错误的填充位置 + +LLMs是[仅解码器](https://huggingface.co/learn/nlp-course/chapter1/6?fw=pt)架构,意味着它们会持续迭代您的输入提示。如果您的输入长度不相同,则需要对它们进行填充。由于LLMs没有接受过从`pad tokens`继续训练,因此您的输入需要左填充。确保在生成时不要忘记传递注意力掩码! + +```py +>>> # The tokenizer initialized above has right-padding active by default: the 1st sequence, +>>> # which is shorter, has padding on the right side. Generation fails to capture the logic. +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 33333333333' + +>>> # With left-padding, it works as expected! +>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left") +>>> tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default +>>> model_inputs = tokenizer( +... ["1, 2, 3", "A, B, C, D, E"], padding=True, return_tensors="pt" +... ).to("cuda") +>>> generated_ids = model.generate(**model_inputs) +>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] +'1, 2, 3, 4, 5, 6,' +``` + +### 错误的提示 + +一些模型和任务期望某种输入提示格式才能正常工作。当未应用此格式时,您将获得悄然的性能下降:模型能工作,但不如预期提示那样好。有关提示的更多信息,包括哪些模型和任务需要小心,可在[指南](tasks/prompting)中找到。让我们看一个使用[聊天模板](chat_templating)的聊天LLM示例: + +```python +>>> tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-alpha") +>>> model = AutoModelForCausalLM.from_pretrained( +... "HuggingFaceH4/zephyr-7b-alpha", device_map="auto", load_in_4bit=True +... ) +>>> set_seed(0) +>>> prompt = """How many helicopters can a human eat in one sitting? Reply as a thug.""" +>>> model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda") +>>> input_length = model_inputs.input_ids.shape[1] +>>> generated_ids = model.generate(**model_inputs, max_new_tokens=20) +>>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) +"I'm not a thug, but i can tell you that a human cannot eat" +>>> # Oh no, it did not follow our instruction to reply as a thug! Let's see what happens when we write +>>> # a better prompt and use the right template for this model (through `tokenizer.apply_chat_template`) + +>>> set_seed(0) +>>> messages = [ +... { +... "role": "system", +... "content": "You are a friendly chatbot who always responds in the style of a thug", +... }, +... {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}, +... ] +>>> model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda") +>>> input_length = model_inputs.shape[1] +>>> generated_ids = model.generate(model_inputs, do_sample=True, max_new_tokens=20) +>>> print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) +'None, you thug. How bout you try to focus on more useful questions?' +>>> # As we can see, it followed a proper thug style 😎 +``` + +## 更多资源 + +虽然自回归生成过程相对简单,但要充分利用LLM可能是一个具有挑战性的任务,因为很多组件复杂且密切关联。以下是帮助您深入了解LLM使用和理解的下一步: + +### 高级生成用法 + +1. [指南](generation_strategies),介绍如何控制不同的生成方法、如何设置生成配置文件以及如何进行输出流式传输; +2. [指南](chat_templating),介绍聊天LLMs的提示模板; +3. [指南](tasks/prompting),介绍如何充分利用提示设计; +4. API参考文档,包括[`~generation.GenerationConfig`]、[`~generation.GenerationMixin.generate`]和[与生成相关的类](internal/generation_utils)。 + +### LLM排行榜 + +1. [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), 侧重于开源模型的质量; +2. [Open LLM-Perf Leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard), 侧重于LLM的吞吐量. + +### 延迟、吞吐量和内存利用率 + +1. [指南](llm_tutorial_optimization),如何优化LLMs以提高速度和内存利用; +2. [指南](main_classes/quantization), 关于`quantization`,如bitsandbytes和autogptq的指南,教您如何大幅降低内存需求。 + +### 相关库 + +1. [`text-generation-inference`](https://github.com/huggingface/text-generation-inference), 一个面向生产的LLM服务器; +2. [`optimum`](https://github.com/huggingface/optimum), 一个🤗 Transformers的扩展,优化特定硬件设备的性能 diff --git a/docs/source/zh/main_classes/agent.md b/docs/source/zh/main_classes/agent.md new file mode 100644 index 00000000000000..8c7cc53fefd6a4 --- /dev/null +++ b/docs/source/zh/main_classes/agent.md @@ -0,0 +1,101 @@ + + +# Agents和工具 + + + +Transformers Agents是一个实验性的API,它随时可能发生变化。由于API或底层模型容易发生变化,因此由agents返回的结果可能会有所不同。 + + + + +要了解更多关于agents和工具的信息,请确保阅读[介绍指南](../transformers_agents)。此页面包含底层类的API文档。 + + +## Agents + +我们提供三种类型的agents:[`HfAgent`]使用开源模型的推理端点,[`LocalAgent`]使用您在本地选择的模型,[`OpenAiAgent`]使用OpenAI封闭模型。 + + +### HfAgent + +[[autodoc]] HfAgent + +### LocalAgent + +[[autodoc]] LocalAgent + +### OpenAiAgent + +[[autodoc]] OpenAiAgent + +### AzureOpenAiAgent + +[[autodoc]] AzureOpenAiAgent + +### Agent + +[[autodoc]] Agent + - chat + - run + - prepare_for_new_chat + +## 工具 + +### load_tool + +[[autodoc]] load_tool + +### Tool + +[[autodoc]] Tool + +### PipelineTool + +[[autodoc]] PipelineTool + +### RemoteTool + +[[autodoc]] RemoteTool + +### launch_gradio_demo + +[[autodoc]] launch_gradio_demo + +## Agent类型 + +Agents可以处理工具之间任何类型的对象;工具是多模态的,可以接受和返回文本、图像、音频、视频等类型。为了增加工具之间的兼容性,以及正确地在ipython(jupyter、colab、ipython notebooks等)中呈现这些返回值,我们实现了这些类型的包装类。 + +被包装的对象应该继续按照最初的行为方式运作;文本对象应该仍然像字符串一样运作,图像对象应该仍然像`PIL.Image`一样运作。 + +这些类型有三个特定目的: + +- 对类型调用 `to_raw` 应该返回底层对象 +- 对类型调用 `to_string` 应该将对象作为字符串返回:在`AgentText`的情况下可能是字符串,但在其他情况下可能是对象序列化版本的路径 +- 在ipython内核中显示它应该正确显示对象 + +### AgentText + +[[autodoc]] transformers.tools.agent_types.AgentText + +### AgentImage + +[[autodoc]] transformers.tools.agent_types.AgentImage + +### AgentAudio + +[[autodoc]] transformers.tools.agent_types.AgentAudio diff --git a/docs/source/zh/main_classes/callback.md b/docs/source/zh/main_classes/callback.md new file mode 100644 index 00000000000000..be05c37aec9e73 --- /dev/null +++ b/docs/source/zh/main_classes/callback.md @@ -0,0 +1,125 @@ + + +# Callbacks + + +Callbacks可以用来自定义PyTorch [Trainer]中训练循环行为的对象(此功能尚未在TensorFlow中实现),该对象可以检查训练循环状态(用于进度报告、在TensorBoard或其他ML平台上记录日志等),并做出决策(例如提前停止)。 + +Callbacks是“只读”的代码片段,除了它们返回的[TrainerControl]对象外,它们不能更改训练循环中的任何内容。对于需要更改训练循环的自定义,您应该继承[Trainer]并重载您需要的方法(有关示例,请参见[trainer](trainer))。 + +默认情况下,`TrainingArguments.report_to` 设置为"all",然后[Trainer]将使用以下callbacks。 + + +- [`DefaultFlowCallback`],它处理默认的日志记录、保存和评估行为 +- [`PrinterCallback`] 或 [`ProgressCallback`],用于显示进度和打印日志(如果通过[`TrainingArguments`]停用tqdm,则使用第一个函数;否则使用第二个)。 +- [`~integrations.TensorBoardCallback`],如果TensorBoard可访问(通过PyTorch版本 >= 1.4 或者 tensorboardX)。 +- [`~integrations.WandbCallback`],如果安装了[wandb](https://www.wandb.com/)。 +- [`~integrations.CometCallback`],如果安装了[comet_ml](https://www.comet.ml/site/)。 +- [`~integrations.MLflowCallback`],如果安装了[mlflow](https://www.mlflow.org/)。 +- [`~integrations.NeptuneCallback`],如果安装了[neptune](https://neptune.ai/)。 +- [`~integrations.AzureMLCallback`],如果安装了[azureml-sdk](https://pypi.org/project/azureml-sdk/)。 +- [`~integrations.CodeCarbonCallback`],如果安装了[codecarbon](https://pypi.org/project/codecarbon/)。 +- [`~integrations.ClearMLCallback`],如果安装了[clearml](https://github.com/allegroai/clearml)。 +- [`~integrations.DagsHubCallback`],如果安装了[dagshub](https://dagshub.com/)。 +- [`~integrations.FlyteCallback`],如果安装了[flyte](https://flyte.org/)。 +- [`~integrations.DVCLiveCallback`],如果安装了[dvclive](https://dvc.org/doc/dvclive)。 + +如果安装了一个软件包,但您不希望使用相关的集成,您可以将 `TrainingArguments.report_to` 更改为仅包含您想要使用的集成的列表(例如 `["azure_ml", "wandb"]`)。 + +实现callbacks的主要类是[`TrainerCallback`]。它获取用于实例化[`Trainer`]的[`TrainingArguments`],可以通过[`TrainerState`]访问该Trainer的内部状态,并可以通过[`TrainerControl`]对训练循环执行一些操作。 + + +## 可用的Callbacks + +这里是库里可用[`TrainerCallback`]的列表: + +[[autodoc]] integrations.CometCallback + - setup + +[[autodoc]] DefaultFlowCallback + +[[autodoc]] PrinterCallback + +[[autodoc]] ProgressCallback + +[[autodoc]] EarlyStoppingCallback + +[[autodoc]] integrations.TensorBoardCallback + +[[autodoc]] integrations.WandbCallback + - setup + +[[autodoc]] integrations.MLflowCallback + - setup + +[[autodoc]] integrations.AzureMLCallback + +[[autodoc]] integrations.CodeCarbonCallback + +[[autodoc]] integrations.NeptuneCallback + +[[autodoc]] integrations.ClearMLCallback + +[[autodoc]] integrations.DagsHubCallback + +[[autodoc]] integrations.FlyteCallback + +[[autodoc]] integrations.DVCLiveCallback + - setup + +## TrainerCallback + +[[autodoc]] TrainerCallback + +以下是如何使用PyTorch注册自定义callback的示例: + +[`Trainer`]: + +```python +class MyCallback(TrainerCallback): + "A callback that prints a message at the beginning of training" + + def on_train_begin(self, args, state, control, **kwargs): + print("Starting training") + + +trainer = Trainer( + model, + args, + train_dataset=train_dataset, + eval_dataset=eval_dataset, + callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback()) +) +``` + +注册callback的另一种方式是调用 `trainer.add_callback()`,如下所示: + + +```python +trainer = Trainer(...) +trainer.add_callback(MyCallback) +# Alternatively, we can pass an instance of the callback class +trainer.add_callback(MyCallback()) +``` + +## TrainerState + +[[autodoc]] TrainerState + +## TrainerControl + +[[autodoc]] TrainerControl diff --git a/docs/source/zh/main_classes/configuration.md b/docs/source/zh/main_classes/configuration.md new file mode 100644 index 00000000000000..755a3170419bf4 --- /dev/null +++ b/docs/source/zh/main_classes/configuration.md @@ -0,0 +1,28 @@ + + +# Configuration + +基类[`PretrainedConfig`]实现了从本地文件或目录加载/保存配置的常见方法,或下载库提供的预训练模型配置(从HuggingFace的AWS S3库中下载)。 + +每个派生的配置类都实现了特定于模型的属性。所有配置类中共同存在的属性有:`hidden_size`、`num_attention_heads` 和 `num_hidden_layers`。文本模型进一步添加了 `vocab_size`。 + + +## PretrainedConfig + +[[autodoc]] PretrainedConfig + - push_to_hub + - all diff --git a/docs/source/zh/main_classes/data_collator.md b/docs/source/zh/main_classes/data_collator.md new file mode 100644 index 00000000000000..d947b53ea14cb2 --- /dev/null +++ b/docs/source/zh/main_classes/data_collator.md @@ -0,0 +1,65 @@ + + +# Data Collator + +Data collators是一个对象,通过使用数据集元素列表作为输入来形成一个批次。这些元素与 `train_dataset` 或 `eval_dataset` 的元素类型相同。 + +为了能够构建批次,Data collators可能会应用一些预处理(比如填充)。其中一些(比如[`DataCollatorForLanguageModeling`])还会在形成的批次上应用一些随机数据增强(比如随机掩码)。 + +在[示例脚本](../examples)或[示例notebooks](../notebooks)中可以找到使用的示例。 + + +## Default data collator + +[[autodoc]] data.data_collator.default_data_collator + +## DefaultDataCollator + +[[autodoc]] data.data_collator.DefaultDataCollator + +## DataCollatorWithPadding + +[[autodoc]] data.data_collator.DataCollatorWithPadding + +## DataCollatorForTokenClassification + +[[autodoc]] data.data_collator.DataCollatorForTokenClassification + +## DataCollatorForSeq2Seq + +[[autodoc]] data.data_collator.DataCollatorForSeq2Seq + +## DataCollatorForLanguageModeling + +[[autodoc]] data.data_collator.DataCollatorForLanguageModeling + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens + +## DataCollatorForWholeWordMask + +[[autodoc]] data.data_collator.DataCollatorForWholeWordMask + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens + +## DataCollatorForPermutationLanguageModeling + +[[autodoc]] data.data_collator.DataCollatorForPermutationLanguageModeling + - numpy_mask_tokens + - tf_mask_tokens + - torch_mask_tokens diff --git a/docs/source/zh/main_classes/deepspeed.md b/docs/source/zh/main_classes/deepspeed.md new file mode 100644 index 00000000000000..75a0a13df75e24 --- /dev/null +++ b/docs/source/zh/main_classes/deepspeed.md @@ -0,0 +1,2100 @@ + + +# DeepSpeed集成 + +[DeepSpeed](https://github.com/microsoft/DeepSpeed)实现了[ZeRO论文](https://arxiv.org/abs/1910.02054)中描述的所有内容。目前,它提供对以下功能的全面支持: + +1. 优化器状态分区(ZeRO stage 1) +2. 梯度分区(ZeRO stage 2) +3. 参数分区(ZeRO stage 3) +4. 自定义混合精度训练处理 +5. 一系列基于CUDA扩展的快速优化器 +6. ZeRO-Offload 到 CPU 和 NVMe + +ZeRO-Offload有其自己的专门论文:[ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)。而NVMe支持在论文[ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)中进行了描述。 + +DeepSpeed ZeRO-2主要用于训练,因为它的特性对推理没有用处。 + +DeepSpeed ZeRO-3也可以用于推理,因为它允许将单个GPU无法加载的大模型加载到多个GPU上。 + +🤗 Transformers通过以下两种方式集成了[DeepSpeed](https://github.com/microsoft/DeepSpeed): + +1. 通过[`Trainer`]集成核心的DeepSpeed功能。这是一种“为您完成一切”式的集成 - 您只需提供自定义配置文件或使用我们的模板配置文件。本文档的大部分内容都集中在这个功能上。 +2. 如果您不使用[`Trainer`]并希望在自己的Trainer中集成DeepSpeed,那么像`from_pretrained`和`from_config`这样的核心功能函数将包括ZeRO stage 3及以上的DeepSpeed的基础部分,如`zero.Init`。要利用此功能,请阅读有关[非Trainer DeepSpeed集成](#nontrainer-deepspeed-integration)的文档。 + +集成的内容: + +训练: + +1. DeepSpeed ZeRO训练支持完整的ZeRO stages 1、2和3,以及ZeRO-Infinity(CPU和NVMe offload)。 + +推理: + +1. DeepSpeed ZeRO推理支持ZeRO stage 3和ZeRO-Infinity。它使用与训练相同的ZeRO协议,但不使用优化器和学习率调度器,只有stage 3与推理相关。更多详细信息请参阅:[zero-inference](#zero-inference)。 + +此外还有DeepSpeed推理 - 这是一种完全不同的技术,它使用张量并行而不是ZeRO(即将推出)。 + + +
+ + +## Trainer DeepSpeed 集成 + + + + +### 安装 + +通过pypi安装库: + + +```bash +pip install deepspeed +``` + +或通过 `transformers` 的 `extras`安装: + +```bash +pip install transformers[deepspeed] +``` + +或在 [DeepSpeed 的 GitHub 页面](https://github.com/microsoft/deepspeed#installation) 和 +[高级安装](https://www.deepspeed.ai/tutorials/advanced-install/) 中查找更多详细信息。 + +如果构建过程中仍然遇到问题,请首先确保阅读 [CUDA 扩展安装注意事项](trainer#cuda-extension-installation-notes)。 + +如果您没有预先构建扩展而是在运行时构建它们,而且您尝试了以上所有解决方案都无效,下一步可以尝试在安装之前预先构建扩展。 + +进行 DeepSpeed 的本地构建: + + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \ +--global-option="build_ext" --global-option="-j8" --no-cache -v \ +--disable-pip-version-check 2>&1 | tee build.log +``` + +如果您打算使用 NVMe offload,您还需要在上述说明中添加 `DS_BUILD_AIO=1`(并且还需要在系统范围内安装 *libaio-dev*)。 + +编辑 `TORCH_CUDA_ARCH_LIST` 以插入您打算使用的 GPU 卡的架构代码。假设您的所有卡都是相同的,您可以通过以下方式获取架构: + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())" +``` + +因此,如果您得到 `8, 6`,则使用 `TORCH_CUDA_ARCH_LIST="8.6"`。如果您有多个不同的卡,您可以像这样列出所有卡 `TORCH_CUDA_ARCH_LIST="6.1;8.6"`。 + +如果您需要在多台机器上使用相同的设置,请创建一个二进制 wheel: + + +```bash +git clone https://github.com/microsoft/DeepSpeed/ +cd DeepSpeed +rm -rf build +TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \ +python setup.py build_ext -j8 bdist_wheel +``` + +它将生成类似于 `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl` 的文件,现在您可以在本地或任何其他机器上安装它,如 `pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`。 + +再次提醒确保调整 `TORCH_CUDA_ARCH_LIST` 以匹配目标架构。 + +您可以在[这里](https://developer.nvidia.com/cuda-gpus)找到完整的 NVIDIA GPU 列表及其对应的 **计算能力**(与此上下文中的架构相同)。 + +您可以使用以下命令检查 PyTorch 构建时使用的架构: + + +```bash +python -c "import torch; print(torch.cuda.get_arch_list())" +``` + +以下是如何查找已安装 GPU 中的一张卡的架构。例如,对于 GPU 0: + +```bash +CUDA_VISIBLE_DEVICES=0 python -c "import torch; \ +print(torch.cuda.get_device_properties(torch.device('cuda')))" +``` + +如果输出结果如下: + +```bash +_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82) +``` + +然后您就知道这张卡的架构是 `8.6`。 + +您也可以完全省略 `TORCH_CUDA_ARCH_LIST`,然后构建程序将自动查询构建所在的 GPU 的架构。这可能与目标机器上的 GPU 不匹配,因此最好明确指定所需的架构。 + +如果尝试了所有建议的方法仍然遇到构建问题,请继续在 [Deepspeed](https://github.com/microsoft/DeepSpeed/issues)的 GitHub Issue 上提交问题。 + + + + +### 多GPU启用 + +为了启用DeepSpeed 集成,调整 [`Trainer`] 的命令行参数,添加一个新的参数 `--deepspeed ds_config.json`,其中 `ds_config.json` 是 DeepSpeed 配置文件,如文档 [这里](https://www.deepspeed.ai/docs/config-json/) 所述。文件命名由您决定。 +建议使用 DeepSpeed 的 `add_config_arguments` 程序将必要的命令行参数添加到您的代码中。 +有关更多信息,请参阅 [DeepSpeed 的参数解析](https://deepspeed.readthedocs.io/en/latest/initialize.html#argument-parsing) 文档。 + +在这里,您可以使用您喜欢的启动器。您可以继续使用 PyTorch 启动器: + + +```bash +torch.distributed.run --nproc_per_node=2 your_program.py --deepspeed ds_config.json +``` + +或使用由 `deepspeed` 提供的启动器: + + +```bash +deepspeed --num_gpus=2 your_program.py --deepspeed ds_config.json +``` + + +正如您所见,这两个启动器的参数不同,但对于大多数需求,任何一个都可以满足工作需求。有关如何配置各个节点和 GPU 的完整详细信息,请查看 [此处](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node)。 + +当您使用 `deepspeed` 启动器并且希望使用所有可用的 GPU 时,您可以简单地省略 `--num_gpus` 标志。 + +以下是在 DeepSpeed 中启用使用所有可用 GPU情况下, 运行 `run_translation.py` 的示例: + + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +请注意,在 DeepSpeed 文档中,您可能会看到 `--deepspeed --deepspeed_config ds_config.json` - 即两个与 DeepSpeed 相关的参数,但为简单起见,并且因为已经有很多参数要处理,我们将两者合并为一个单一参数。 + +有关一些实际使用示例,请参阅 [此帖](https://github.com/huggingface/transformers/issues/8771#issuecomment-759248400)。 + + + + + +### 单GPU启用 + +要使用一张 GPU 启用 DeepSpeed,调整 [`Trainer`] 的命令行参数如下: + + +```bash +deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero2.json \ +--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \ +--output_dir output_dir --overwrite_output_dir --fp16 \ +--do_train --max_train_samples 500 --num_train_epochs 1 \ +--dataset_name wmt16 --dataset_config "ro-en" \ +--source_lang en --target_lang ro +``` + +这与多 GPU 的情况几乎相同,但在这里我们通过 `--num_gpus=1` 明确告诉 DeepSpeed 仅使用一张 GPU。默认情况下,DeepSpeed 启用给定节点上可以看到的所有 GPU。如果您一开始只有一张 GPU,那么您不需要这个参数。以下 [文档](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node) 讨论了启动器的选项。 + +为什么要在仅使用一张 GPU 的情况下使用 DeepSpeed 呢? + +1. 它具有 ZeRO-offload 功能,可以将一些计算和内存委托给主机的 CPU 和 内存,从而为模型的需求保留更多 GPU 资源 - 例如更大的批处理大小,或启用正常情况下无法容纳的非常大模型。 +2. 它提供了智能的 GPU 内存管理系统,最小化内存碎片,这再次允许您容纳更大的模型和数据批次。 + +虽然接下来我们将详细讨论配置,但在单个 GPU 上通过 DeepSpeed 实现巨大性能提升的关键是在配置文件中至少有以下配置: + + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "overlap_comm": true, + "contiguous_gradients": true + } +} +``` + +这会启用`optimizer offload `和一些其他重要功能。您可以尝试不同的buffer大小,有关详细信息,请参见下面的讨论。 + +关于这种启用类型的实际使用示例,请参阅 [此帖](https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685)。 + +您还可以尝试使用本文后面进一步解释的支持`CPU 和 NVMe offload`功能的ZeRO-3 。 + + + + +注意: + +- 如果您需要在特定的 GPU 上运行,而不是 GPU 0,则无法使用 `CUDA_VISIBLE_DEVICES` 来限制可用 GPU 的可见范围。相反,您必须使用以下语法: + + ```bash + deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ... + ``` + + 在这个例子中,我们告诉 DeepSpeed 使用 GPU 1(第二个 GPU)。 + + + + + +### 多节点启用 + +这一部分的信息不仅适用于 DeepSpeed 集成,也适用于任何多节点程序。但 DeepSpeed 提供了一个比其他启动器更易于使用的 `deepspeed` 启动器,除非您在 SLURM 环境中。 + +在本节,让我们假设您有两个节点,每个节点有 8 张 GPU。您可以通过 `ssh hostname1` 访问第一个节点,通过 `ssh hostname2` 访问第二个节点,两者必须能够在本地通过 ssh 无密码方式相互访问。当然,您需要将这些主机(节点)名称重命名为您实际使用的主机名称。 + + +#### torch.distributed.run启动器 + + +例如,要使用 `torch.distributed.run`,您可以执行以下操作: + +```bash +python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ +--master_port=9901 your_program.py --deepspeed ds_config.json +``` + +您必须 ssh 到每个节点,并在每个节点上运行相同的命令!不用担心,启动器会等待两个节点同步完成。 + +有关更多信息,请参阅 [torchrun](https://pytorch.org/docs/stable/elastic/run.html)。顺便说一下,这也是替代了几个 PyTorch 版本前的 `torch.distributed.launch` 的启动器。 + + +#### deepspeed启动器 + +要改用 `deepspeed` 启动器,首先需要创建一个 `hostfile` 文件: + +``` +hostname1 slots=8 +hostname2 slots=8 +``` +然后,您可以这样启动: + +```bash +deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ +your_program.py --deepspeed ds_config.json +``` + +与 `torch.distributed.run` 启动器不同,`deepspeed` 将自动在两个节点上启动此命令! + +更多信息,请参阅[资源配置(多节点)](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node)。 + + +#### 在 SLURM 环境中启动 + +在 SLURM 环境中,可以采用以下方法。以下是一个 SLURM 脚本 `launch.slurm`,您需要根据您的具体 SLURM 环境进行调整。 + +```bash +#SBATCH --job-name=test-nodes # name +#SBATCH --nodes=2 # nodes +#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node! +#SBATCH --cpus-per-task=10 # number of cores per tasks +#SBATCH --gres=gpu:8 # number of gpus +#SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS) +#SBATCH --output=%x-%j.out # output file name + +export GPUS_PER_NODE=8 +export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) +export MASTER_PORT=9901 + +srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \ + --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \ + --master_addr $MASTER_ADDR --master_port $MASTER_PORT \ +your_program.py --deepspeed ds_config.json' +``` + +剩下的就是运行它: + +```bash +sbatch launch.slurm +``` + +`srun` 将负责在所有节点上同时启动程序。 + + +#### 使用非共享文件系统 + +默认情况下,DeepSpeed 假定多节点环境使用共享存储。如果不是这种情况,每个节点只能看到本地文件系统,你需要调整配置文件,包含一个 [`checkpoint` 部分](https://www.deepspeed.ai/docs/config-json/#checkpoint-options)并设置如下选项: + +```json +{ + "checkpoint": { + "use_node_local_storage": true + } +} +``` + +或者,你还可以使用 [`Trainer`] 的 `--save_on_each_node` 参数,上述配置将自动添加。 + + + + +### 在Notebooks启用 + +在将`notebook cells`作为脚本运行的情况下,问题在于没有正常的 `deepspeed` 启动器可依赖,因此在某些设置下,我们必须仿真运行它。 + +如果您只使用一个 GPU,以下是如何调整notebook中的训练代码以使用 DeepSpeed。 + +```python +# DeepSpeed requires a distributed environment even when only one process is used. +# This emulates a launcher in the notebook +import os + +os.environ["MASTER_ADDR"] = "localhost" +os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use +os.environ["RANK"] = "0" +os.environ["LOCAL_RANK"] = "0" +os.environ["WORLD_SIZE"] = "1" + +# Now proceed as normal, plus pass the deepspeed config file +training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json") +trainer = Trainer(...) +trainer.train() +``` + +注意:`...` 代表您传递给函数的正常参数。 + +如果要使用多于一个 GPU,您必须在 DeepSpeed 中使用多进程环境。也就是说,您必须使用专门的启动器来实现这一目的,而不能通过仿真本节开头呈现的分布式环境来完成。 + +如果想要在notebook中动态创建配置文件并保存在当前目录,您可以在一个专用的cell中使用: + +```python no-style +%%bash +cat <<'EOT' > ds_config_zero3.json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +EOT +``` + +如果训练脚本在一个普通文件中而不是在notebook cells中,您可以通过笔记本中的 shell 正常启动 `deepspeed`。例如,要使用 `run_translation.py`,您可以这样启动: + +```python no-style +!git clone https://github.com/huggingface/transformers +!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ... +``` + +或者使用 `%%bash` 魔术命令,您可以编写多行代码,用于运行 shell 程序: + +```python no-style +%%bash + +git clone https://github.com/huggingface/transformers +cd transformers +deepspeed examples/pytorch/translation/run_translation.py ... +``` + +在这种情况下,您不需要本节开头呈现的任何代码。 + +注意:虽然 `%%bash` 魔术命令很方便,但目前它会缓冲输出,因此在进程完成之前您看不到日志。 + + + + +### 配置 + +有关可以在 DeepSpeed 配置文件中使用的完整配置选项的详细指南,请参阅[以下文档](https://www.deepspeed.ai/docs/config-json/)。 + +您可以在 [DeepSpeedExamples 仓库](https://github.com/microsoft/DeepSpeedExamples)中找到解决各种实际需求的数十个 DeepSpeed 配置示例。 + +```bash +git clone https://github.com/microsoft/DeepSpeedExamples +cd DeepSpeedExamples +find . -name '*json' +``` + +延续上面的代码,假设您要配置 Lamb 优化器。那么您可以通过以下方式在示例的 `.json` 文件中进行搜索: + +```bash +grep -i Lamb $(find . -name '*json') +``` + +还可以在[主仓](https://github.com/microsoft/DeepSpeed)中找到更多示例。 + +在使用 DeepSpeed 时,您总是需要提供一个 DeepSpeed 配置文件,但是一些配置参数必须通过命令行进行配置。您将在本指南的剩余章节找到这些细微差别。 + +为了了解 DeepSpeed 配置文件,这里有一个激活 ZeRO stage 2 功能的示例,包括优化器状态的 CPU offload,使用 `AdamW` 优化器和 `WarmupLR` 调度器,并且如果传递了 `--fp16` 参数将启用混合精度训练: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", +} +``` + +当您执行程序时,DeepSpeed 将把它从 [`Trainer`] 收到的配置日志输出到console,因此您可以看到传递给它的最终配置。 + + + + + +### 传递配置 + +正如本文档讨论的那样,通常将 DeepSpeed 配置作为指向 JSON 文件的路径传递,但如果您没有使用命令行界面配置训练,而是通过 [`TrainingArguments`] 实例化 [`Trainer`],那么对于 `deepspeed` 参数,你可以传递一个嵌套的 `dict`。这使您能够即时创建配置,而无需在将其传递给 [`TrainingArguments`] 之前将其写入文件系统。 + +总结起来,您可以这样做: + +```python +TrainingArguments(..., deepspeed="/path/to/ds_config.json") +``` + +或者: + +```python +ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params) +TrainingArguments(..., deepspeed=ds_config_dict) +``` + + + +### 共享配置 + + + + +这一部分是必读的。 + + + +一些配置值对于 [`Trainer`] 和 DeepSpeed 正常运行都是必需的,因此,为了防止定义冲突及导致的难以检测的错误,我们选择通过 [`Trainer`] 命令行参数配置这些值。 + +此外,一些配置值是基于模型的配置自动派生的,因此,与其记住手动调整多个值,最好让 [`Trainer`] 为您做大部分配置。 + +因此,在本指南的其余部分,您将找到一个特殊的配置值:`auto`,当设置时将自动将参数替换为正确或最有效的值。请随意选择忽略此建议或显式设置该值,在这种情况下,请务必确保 [`Trainer`] 参数和 DeepSpeed 配置保持一致。例如,您是否使用相同的学习率、批量大小或梯度累积设置?如果这些不匹配,训练可能以非常难以检测的方式失败。请重视该警告。 + +还有一些参数是仅适用于 DeepSpeed 的,并且这些参数必须手动设置以适应您的需求。 + +在您自己的程序中,如果您想要作为主动修改 DeepSpeed 配置并以此配置 [`TrainingArguments`],您还可以使用以下方法。步骤如下: + +1. 创建或加载要用作主配置的 DeepSpeed 配置 +2. 根据这些参数值创建 [`TrainingArguments`] 对象 + +请注意,一些值,比如 `scheduler.params.total_num_steps`,是在 [`Trainer`] 的 `train` 过程中计算的,但当然您也可以自己计算这些值。 + + + + +### ZeRO + +[Zero Redundancy Optimizer (ZeRO)](https://www.deepspeed.ai/tutorials/zero/) 是 DeepSpeed 的工作核心。它支持3个不同级别(stages)的优化。Stage 1 对于扩展性来说不是很有趣,因此本文档重点关注Stage 2和Stage 3。Stage 3通过最新的 ZeRO-Infinity 进一步改进。你可以在 DeepSpeed 文档中找到更详细的信息。 + +配置文件的 `zero_optimization` 部分是最重要的部分([文档](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training)),因为在这里您定义了要启用哪些 ZeRO stages 以及如何配置它们。您可以在 DeepSpeed 文档中找到每个参数的解释。 + +这一部分必须通过 DeepSpeed 配置文件单独配置 - [`Trainer`] 不提供相应的命令行参数。 + +注意:目前 DeepSpeed 不验证参数名称,因此如果您拼错了任何参数,它将使用拼写错误的参数的默认设置。您可以观察 DeepSpeed 引擎启动日志消息,看看它将使用哪些值。 + + + +#### ZeRO-2 配置 + +以下是 ZeRO stage 2 的配置示例: + +```json +{ + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 5e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 5e8, + "contiguous_gradients": true + } +} +``` + +**性能调优:** + +- 启用 `offload_optimizer` 应该减少 GPU 内存使用(需要 `"stage": 2`)。 +- `"overlap_comm": true` 通过增加 GPU 内存使用来降低all-reduce 的延迟。 `overlap_comm` 使用了 `allgather_bucket_size` 和 `reduce_bucket_size` 值的4.5倍。因此,如果它们设置为 `5e8`,这将需要一个9GB的内存占用(`5e8 x 2Bytes x 2 x 4.5`)。因此,如果您的 GPU 内存为8GB或更小,为了避免出现OOM错误,您需要将这些参数减小到约 `2e8`,这将需要3.6GB。如果您的 GPU 容量更大,当您开始遇到OOM时,你可能也需要这样做。 +- 当减小这些buffers时,您以更慢的通信速度来换取更多的 GPU 内存。buffers大小越小,通信速度越慢,GPU 可用于其他任务的内存就越多。因此,如果更大的批处理大小很重要,那么稍微减慢训练时间可能是一个很好的权衡。 + +此外,`deepspeed==0.4.4` 添加了一个新选项 `round_robin_gradients`,您可以通过以下方式启用: + +```json +{ + "zero_optimization": { + "round_robin_gradients": true + } +} +``` +这是一个用于 CPU offloading 的stage 2优化,通过细粒度梯度分区在 ranks 之间并行复制到 CPU 内存,从而实现了性能的提升。性能优势随着梯度累积步骤(在优化器步骤之间进行更多复制)或 GPU 数量(增加并行性)增加而增加。 + + + +#### ZeRO-3 配置 + +以下是 ZeRO stage 3的配置示例: + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + } +} +``` + +如果您因为你的模型或激活值超过 GPU 内存而遇到OOM问题,并且您有未使用的 CPU 内存,可以通股票使用 `"device": "cpu"` 将优化器状态和参数卸载到 CPU 内存中,来解决这个限制。如果您不想卸载到 CPU 内存,可以在 `device` 条目中使用 `none` 代替 `cpu`。将优化器状态卸载到 NVMe 上会在后面进一步讨论。 + +通过将 `pin_memory` 设置为 `true` 启用固定内存。此功能会以减少可用于其他进程的内存为代价来提高吞吐量。固定内存被分配给特定请求它的进程,通常比普通 CPU 内存访问速度更快。 + +**性能调优:** + +- `stage3_max_live_parameters`: `1e9` +- `stage3_max_reuse_distance`: `1e9` + +如果遇到OOM问题,请减小 `stage3_max_live_parameters` 和 `stage3_max_reuse_distance`。它们对性能的影响应该很小,除非您正在进行激活值checkpointing。`1e9` 大约会消耗 ~2GB。内存由 `stage3_max_live_parameters` 和 `stage3_max_reuse_distance` 共享,所以它不是叠加的,而是总共2GB。 + +`stage3_max_live_parameters` 是在任何给定时间要在 GPU 上保留多少个完整参数的上限。"reuse distance" 是我们用来确定参数在将来何时会再次使用的度量标准,我们使用 `stage3_max_reuse_distance` 来决定是丢弃参数还是保留参数。如果一个参数在不久的将来(小于 `stage3_max_reuse_distance`)将被再次使用,那么我们将其保留以减少通信开销。这在启用激活值checkpoing时非常有用,其中我们以单层粒度进行前向重计算和反向传播,并希望在反向传播期间保留前向重计算中的参数。 + +以下配置值取决于模型的隐藏大小: + +- `reduce_bucket_size`: `hidden_size*hidden_size` +- `stage3_prefetch_bucket_size`: `0.9 * hidden_size * hidden_size` +- `stage3_param_persistence_threshold`: `10 * hidden_size` + +因此,将这些值设置为 `auto`,[`Trainer`] 将自动分配推荐的参数值。当然,如果您愿意,也可以显式设置这些值。 + +`stage3_gather_16bit_weights_on_model_save` 在模型保存时启用模型的 fp16 权重整合。对于大模型和多个 GPU,无论是在内存还是速度方面,这都是一项昂贵的操作。目前如果计划恢复训练,这是必需的。请注意未来的更新可能会删除此限制并让使用更加灵活。 + +如果您从 ZeRO-2 配置迁移,请注意 `allgather_partitions`、`allgather_bucket_size` 和 `reduce_scatter` 配置参数在 ZeRO-3 中不被使用。如果保留这些配置文件,它们将被忽略。 + +- `sub_group_size`: `1e9` + +`sub_group_size` 控制在优化器步骤期间更新参数的粒度。参数被分组到大小为 `sub_group_size` 的桶中,每个桶逐个更新。在 ZeRO-Infinity 中与 NVMe offload一起使用时,`sub_group_size` 控制了在优化器步骤期间在 NVMe 和 CPU 内存之间移动模型状态的粒度。这可以防止非常大的模型耗尽 CPU 内存。 + +当不使用 NVMe offload时,可以将 `sub_group_size` 保留为其默认值 *1e9*。在以下情况下,您可能需要更改其默认值: + +1. 在优化器步骤中遇到OOM:减小 `sub_group_size` 以减少临时buffers的内存利用 +2. 优化器步骤花费很长时间:增加 `sub_group_size` 以提高由于增加的数据buffers而导致的带宽利用率。 + + +#### ZeRO-0 配置 + +请注意,我们将 Stage 0 和 1 放在最后,因为它们很少使用。 + +Stage 0 禁用了所有类型的分片,只是将 DeepSpeed 作为 DDP 使用。您可以通过以下方式启用: + +```json +{ + "zero_optimization": { + "stage": 0 + } +} +``` + +这将实质上禁用 ZeRO,而无需更改其他任何内容。 + + +#### ZeRO-1 配置 + + +Stage 1 等同于 Stage 2 减去梯度分片。您可以尝试使用以下配置,仅对优化器状态进行分片,以稍微加速: + + +```json +{ + "zero_optimization": { + "stage": 1 + } +} +``` + + + + + +### NVMe 支持 + +ZeRO-Infinity 通过使用 NVMe 内存扩展 GPU 和 CPU 内存,从而允许训练非常大的模型。由于智能分区和平铺算法,在offload期间每个 GPU 需要发送和接收非常小量的数据,因此 NVMe 被证明适用于训练过程中提供更大的总内存池。ZeRO-Infinity 需要启用 ZeRO-3。 + +以下配置示例启用 NVMe 来offload优化器状态和参数: + +```json +{ + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 4, + "fast_init": false + }, + "offload_param": { + "device": "nvme", + "nvme_path": "/local_nvme", + "pin_memory": true, + "buffer_count": 5, + "buffer_size": 1e8, + "max_in_cpu": 1e9 + }, + "aio": { + "block_size": 262144, + "queue_depth": 32, + "thread_count": 1, + "single_submit": false, + "overlap_events": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, +} +``` + +您可以选择将优化器状态和参数都卸载到 NVMe,也可以只选择其中一个,或者都不选择。例如,如果您有大量的 CPU 内存可用,只卸载到 CPU 内存训练速度会更快(提示:"device": "cpu")。 + +这是有关卸载 [优化器状态](https://www.deepspeed.ai/docs/config-json/#optimizer-offloading) 和 [参数](https://www.deepspeed.ai/docs/config-json/#parameter-offloading) 的完整文档。 + +确保您的 `nvme_path` 实际上是一个 NVMe,因为它与普通硬盘或 SSD 一起工作,但速度会慢得多。快速可扩展的训练是根据现代 NVMe 传输速度设计的(截至本文撰写时,可以达到 ~3.5GB/s 读取,~3GB/s 写入的峰值速度)。 + +为了找出最佳的 `aio` 配置块,您必须在目标设置上运行一个基准测试,具体操作请参见[说明](https://github.com/microsoft/DeepSpeed/issues/998)。 + + + + + +#### ZeRO-2 和 ZeRO-3 性能对比 + +如果其他一切都配置相同,ZeRO-3 可能比 ZeRO-2 慢,因为前者除了 ZeRO-2 的操作外,还必须收集模型权重。如果 ZeRO-2 满足您的需求,而且您不需要扩展到几个 GPU 以上,那么您可以选择继续使用它。重要的是要理解,ZeRO-3 以速度为代价实现了更高的可扩展性。 + +可以调整 ZeRO-3 配置使其性能接近 ZeRO-2: + +- 将 `stage3_param_persistence_threshold` 设置为一个非常大的数字 - 大于最大的参数,例如 `6 * hidden_size * hidden_size`。这将保留参数在 GPU 上。 +- 关闭 `offload_params`,因为 ZeRO-2 没有这个选项。 + +即使不更改 `stage3_param_persistence_threshold`,仅将 `offload_params` 关闭,性能可能会显著提高。当然,这些更改将影响您可以训练的模型的大小。因此,这些更改可根据需求帮助您在可扩展性和速度之间进行权衡。 + + + + + +#### ZeRO-2 示例 + +这是一个完整的 ZeRO-2 自动配置文件 `ds_config_zero2.json`: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +这是一个完整的手动设置的启用所有功能的 ZeRO-2 配置文件。主要是为了让您看到典型的参数值是什么样的,但我们强烈建议使用其中包含多个 `auto` 设置的配置文件。 + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 2, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "allgather_partitions": true, + "allgather_bucket_size": 2e8, + "overlap_comm": true, + "reduce_scatter": true, + "reduce_bucket_size": 2e8, + "contiguous_gradients": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + + + +#### ZeRO-3 示例 + +这是一个完整的 ZeRO-3 自动配置文件 `ds_config_zero3.json`: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": "auto", + "stage3_prefetch_bucket_size": "auto", + "stage3_param_persistence_threshold": "auto", + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "gradient_accumulation_steps": "auto", + "gradient_clipping": "auto", + "steps_per_print": 2000, + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto", + "wall_clock_breakdown": false +} +``` + +这是一个完整的 手动设置的启用所有功能的ZeRO-3 配置文件。主要是为了让您看到典型的参数值是什么样的,但我们强烈建议使用其中包含多个 `auto` 设置的配置文件。 + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + }, + + "optimizer": { + "type": "AdamW", + "params": { + "lr": 3e-5, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + }, + + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 3e-5, + "warmup_num_steps": 500 + } + }, + + "zero_optimization": { + "stage": 3, + "offload_optimizer": { + "device": "cpu", + "pin_memory": true + }, + "offload_param": { + "device": "cpu", + "pin_memory": true + }, + "overlap_comm": true, + "contiguous_gradients": true, + "sub_group_size": 1e9, + "reduce_bucket_size": 1e6, + "stage3_prefetch_bucket_size": 0.94e6, + "stage3_param_persistence_threshold": 1e4, + "stage3_max_live_parameters": 1e9, + "stage3_max_reuse_distance": 1e9, + "stage3_gather_16bit_weights_on_model_save": true + }, + + "steps_per_print": 2000, + "wall_clock_breakdown": false +} +``` + +#### 如何选择最佳性能的ZeRO Stage和 offloads + +了解了这些不同stages后,现在您需要决定使用哪个stage。本节将尝试回答这个问题。 + +通常,以下规则适用: + +- 速度方面(左边比右边快) + + stage 0(DDP) > stage 1 > stage 2 > stage 2 + offload > stage 3 > stage3 + offload + +- GPU内存使用方面(右边比左边更节省GPU内存) + + stage 0(DDP) < stage 1 < stage 2 < stage 2 + offload < stage 3 < stage 3 + offload + +所以,当您希望在尽量使用较少数量的GPU的同时获得最快的执行速度时,可以按照以下步骤进行。我们从最快的方法开始,如果遇到GPU内存溢出,然后切换到下一个速度较慢但使用的GPU内存更少的方法。以此类推。 + +首先,将批量大小设置为1(您始终可以使用梯度累积来获得任何所需的有效批量大小)。 + + +1. 启用 `--gradient_checkpointing 1`(HF Trainer)或直接 `model.gradient_checkpointing_enable()` - 如果发生OOM(Out of Memory),则执行以下步骤。 +2. 首先尝试 ZeRO stage 2。如果发生OOM,则执行以下步骤。 +3. 尝试 ZeRO stage 2 + `offload_optimizer` - 如果发生OOM,则执行以下步骤。 +4. 切换到 ZeRO stage 3 - 如果发生OOM,则执行以下步骤。 +5. 启用 `offload_param` 到 `cpu` - 如果发生OOM,则执行以下步骤。 +6. 启用 `offload_optimizer` 到 `cpu` - 如果发生OOM,则执行以下步骤。 +7. 如果仍然无法适应批量大小为1,请首先检查各种默认值并尽可能降低它们。例如,如果使用 `generate` 并且不使用宽搜索束,将其缩小,因为它会占用大量内存。 +8. 绝对要使用混合半精度而非fp32 - 在Ampere及更高的GPU上使用bf16,在旧的GPU体系结构上使用fp16。 +9. 如果仍然发生OOM,可以添加更多硬件或启用ZeRO-Infinity - 即切换 `offload_param` 和 `offload_optimizer` 到 `nvme`。您需要确保它是非常快的NVMe。作为趣闻,我曾经能够在一个小型GPU上使用BLOOM-176B进行推理,使用了ZeRO-Infinity,尽管速度非常慢。但它奏效了! + +当然,您也可以按相反的顺序进行这些步骤,从最节省GPU内存的配置开始,然后逐步反向进行,或者尝试进行二分法。 + +一旦您的批量大小为1不会导致OOM,就测量您的有效吞吐量。 + +接下来尝试将批量大小增加到尽可能大,因为批量大小越大,GPU的效率越高,特别是在它们乘法运算的矩阵很大时。 + +现在性能优化游戏开始了。您可以关闭一些offload特性,或者降低ZeRO stage,并增加/减少批量大小,再次测量有效吞吐量。反复尝试,直到满意为止。 + +不要花费太多时间,但如果您即将开始一个为期3个月的训练 - 请花几天时间找到吞吐量方面最有效的设置。这样您的训练成本将最低,而且您会更快地完成训练。在当前快节奏的机器学习世界中,如果您花费一个额外的月份来训练某样东西,你很可能会错过一个黄金机会。当然,这只是我分享的一种观察,我并不是在催促你。在开始训练BLOOM-176B之前,我花了2天时间进行这个过程,成功将吞吐量从90 TFLOPs提高到150 TFLOPs!这一努力为我们节省了一个多月的训练时间。 + +这些注释主要是为训练模式编写的,但它们在推理中也应该大部分适用。例如,在推理中,Gradient Checkpointing 是无用的,因为它只在训练过程中有用。此外,我们发现,如果你正在进行多GPU推理并且不使用 [DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/),[Accelerate](https://huggingface.co/blog/bloom-inference-pytorch-scripts) 应该提供更优越的性能。 + +其他与性能相关的快速注释: +- 如果您从头开始训练某个模型,请尽量确保张量的形状可以被16整除(例如隐藏层大小)。对于批量大小,至少尝试可被2整除。如果您想从GPU中挤取更高性能,还有一些硬件特定的[wave和tile量化](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/)的可整除性。 + + + +### Activation Checkpointing 或 Gradient Checkpointing + +Activation Checkpointing和Gradient Checkpointing是指相同方法的两个不同术语。这确实让人感到困惑,但事实就是这样。 + +Gradient Checkpointing允许通过牺牲速度来换取GPU内存,这要么使您能够克服GPU内存溢出,要么增加批量大小来获得更好的性能。 + +HF Transformers 模型对DeepSpeed的Activation Checkpointing一无所知,因此如果尝试在DeepSpeed配置文件中启用该功能,什么都不会发生。 + +因此,您有两种方法可以利用这个非常有益的功能: + +1. 如果您想使用 HF Transformers 模型,你可以使用 `model.gradient_checkpointing_enable()` 或在 HF Trainer 中使用 `--gradient_checkpointing`,它会自动为您启用这个功能。在这里使用了 `torch.utils.checkpoint`。 +2. 如果您编写自己的模型并希望使用DeepSpeed的Activation Checkpointing,可以使用[规定的API](https://deepspeed.readthedocs.io/en/latest/activation-checkpointing.html)。您还可以使用 HF Transformers 的模型代码,将 `torch.utils.checkpoint` 替换为 DeepSpeed 的API。后者更灵活,因为它允许您将前向激活值卸载到CPU内存,而不是重新计算它们。 + + +### Optimizer 和 Scheduler + +只要你不启用 `offload_optimizer`,您可以混合使用DeepSpeed和HuggingFace的调度器和优化器,但有一个例外,即不要使用HuggingFace调度器和DeepSpeed优化器的组合: + + +| Combos | HF Scheduler | DS Scheduler | +|:-------------|:-------------|:-------------| +| HF Optimizer | Yes | Yes | +| DS Optimizer | No | Yes | + +在启用 `offload_optimizer` 的情况下,可以使用非DeepSpeed优化器,只要该优化器具有CPU和GPU的实现(除了LAMB)。 + + + +#### Optimizer + +DeepSpeed的主要优化器包括Adam、AdamW、OneBitAdam和Lamb。这些优化器已经与ZeRO进行了彻底的测试,因此建议使用它们。然而,也可以导入`torch`中的其他优化器。完整的文档在[这里](https://www.deepspeed.ai/docs/config-json/#optimizer-parameters)。 + +如果在配置文件中不配置`optimizer`条目,[`Trainer`] 将自动将其设置为 `AdamW`,并使用提供的值或以下命令行参数的默认值:`--learning_rate`、`--adam_beta1`、`--adam_beta2`、`--adam_epsilon` 和 `--weight_decay`。 + +以下是`AdamW` 的自动配置示例: + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": "auto", + "betas": "auto", + "eps": "auto", + "weight_decay": "auto" + } + } +} +``` + +请注意,命令行参数将设置配置文件中的值。这是为了有一个明确的值来源,并避免在不同地方设置学习率等值时难以找到的错误。命令行参数配置高于其他。被覆盖的值包括: + +- `lr` 的值为 `--learning_rate` +- `betas` 的值为 `--adam_beta1 --adam_beta2` +- `eps` 的值为 `--adam_epsilon` +- `weight_decay` 的值为 `--weight_decay` + +因此,请记住在命令行上调整共享的超参数。 + +您也可以显式地设置这些值: + +```json +{ + "optimizer": { + "type": "AdamW", + "params": { + "lr": 0.001, + "betas": [0.8, 0.999], + "eps": 1e-8, + "weight_decay": 3e-7 + } + } +} +``` + +但在这种情况下,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + +如果您想使用上面未列出的其他优化器,您将不得不将其添加到顶层配置中。 + +```json +{ + "zero_allow_untested_optimizer": true +} +``` + +类似于 `AdamW`,您可以配置其他官方支持的优化器。只是记住这些可能有不同的配置值。例如,对于Adam,您可能需要将 `weight_decay` 设置在 `0.01` 左右。 + +此外,当与DeepSpeed的CPU Adam优化器一起使用时,offload的效果最好。如果您想在offload时使用不同的优化器,自 `deepspeed==0.8.3` 起,您还需要添加: + + +```json +{ + "zero_force_ds_cpu_optimizer": false +} +``` +到顶层配置中。 + + + + + +#### Scheduler + +DeepSpeed支持`LRRangeTest`、`OneCycle`、`WarmupLR`和`WarmupDecayLR`学习率调度器。完整文档在[这里](https://www.deepspeed.ai/docs/config-json/#scheduler-parameters)。 + +以下是🤗 Transformers 和 DeepSpeed 之间的调度器重叠部分: + +- 通过 `--lr_scheduler_type constant_with_warmup` 实现 `WarmupLR` +- 通过 `--lr_scheduler_type linear` 实现 `WarmupDecayLR`。这也是 `--lr_scheduler_type` 的默认值,因此,如果不配置调度器,这将是默认配置的调度器。 + +如果在配置文件中不配置 `scheduler` 条目,[`Trainer`] 将使用 `--lr_scheduler_type`、`--learning_rate` 和 `--warmup_steps` 或 `--warmup_ratio` 的值来配置其🤗 Transformers 版本。 + +以下是 `WarmupLR` 的自动配置示例: + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +由于使用了 *"auto"*,[`Trainer`] 的参数将在配置文件中设置正确的值。这是为了有一个明确的值来源,并避免在不同地方设置学习率等值时难以找到的错误。命令行配置高于其他。被设置的值包括: + +- `warmup_min_lr` 的值为 `0`。 +- `warmup_max_lr` 的值为 `--learning_rate`。 +- `warmup_num_steps` 的值为 `--warmup_steps`(如果提供)。否则,将使用 `--warmup_ratio` 乘以训练步骤的数量,并四舍五入。 +- `total_num_steps` 的值为 `--max_steps` 或者如果没有提供,将在运行时根据环境、数据集的大小和其他命令行参数(对于 `WarmupDecayLR` 来说需要)自动推导。 + +当然,您可以接管任何或所有的配置值,并自行设置这些值: + +```json +{ + "scheduler": { + "type": "WarmupLR", + "params": { + "warmup_min_lr": 0, + "warmup_max_lr": 0.001, + "warmup_num_steps": 1000 + } + } +} +``` + +但在这种情况下,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + +例如,对于 `WarmupDecayLR`,您可以使用以下条目: + +```json +{ + "scheduler": { + "type": "WarmupDecayLR", + "params": { + "last_batch_iteration": -1, + "total_num_steps": "auto", + "warmup_min_lr": "auto", + "warmup_max_lr": "auto", + "warmup_num_steps": "auto" + } + } +} +``` + +然后,`total_num_steps`、`warmup_max_lr`、`warmup_num_steps` 和 `total_num_steps` 将在加载时设置。 + + + + +### fp32精度 + +DeepSpeed支持完整的fp32和fp16混合精度。 + +由于fp16混合精度具有更小的内存需求和更快的速度,唯一不使用它的时候是当您使用的模型在这种训练模式下表现不佳时。通常,当模型没有在fp16混合精度下进行预训练时(例如,bf16预训练模型经常出现这种情况),会出现这种情况。这样的模型可能会发生溢出或下溢,导致 `NaN` 损失。如果是这种情况,那么您将希望使用完整的fp32模式,通过显式禁用默认启用的fp16混合精度模式: + +```json +{ + "fp16": { + "enabled": false, + } +} +``` + +如果您使用基于Ampere架构的GPU,PyTorch版本1.7及更高版本将自动切换到使用更高效的tf32格式进行一些操作,但结果仍将以fp32格式呈现。有关详细信息和基准测试,请参见[TensorFloat-32(TF32) on Ampere devices](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)。如果出于某种原因您不希望使用它,该文档包括有关如何禁用此自动转换的说明。 + +在🤗 Trainer中,你可以使用 `--tf32` 来启用它,或使用 `--tf32 0` 或 `--no_tf32` 来禁用它。默认情况下,使用PyTorch的默认设置。 + + + + + +### 自动混合精度 + +您可以使用自动混合精度,可以选择使用类似 PyTorch AMP 的方式,也可以选择使用类似 Apex 的方式: + +### fp16 + +要配置PyTorch AMP-like 的 fp16(float16) 模式,请设置: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +并且,[`Trainer`]将根据`args.fp16_backend`的值自动启用或禁用它。其余的配置值由您决定。 + +当传递`--fp16 --fp16_backend amp`或`--fp16_full_eval`命令行参数时,此模式将被启用。 + +您也可以显式地启用/禁用此模式: + +```json +{ + "fp16": { + "enabled": true, + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +但是之后您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + +以下是[相关文档](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) + + +### bf16 + +如果需要使用bfloat16而不是fp16,那么可以使用以下配置部分: + +```json +{ + "bf16": { + "enabled": "auto" + } +} +``` + +bf16具有与fp32相同的动态范围,因此不需要损失缩放。 + +当传递`--bf16`或`--bf16_full_eval`命令行参数时,启用此模式。 + +您还可以显式地启用/禁用此模式: + +```json +{ + "bf16": { + "enabled": true + } +} +``` + + + +在`deepspeed==0.6.0`版本中,bf16支持是新的实验性功能。 + +如果您启用了bf16来进行[梯度累积](#gradient-accumulation),您需要意识到它会以bf16累积梯度,这可能不是您想要的,因为这种格式的低精度可能会导致lossy accumulation。 + +修复这个问题的工作正在努力进行,同时提供了使用更高精度的`dtype`(fp16或fp32)的选项。 + + + + +### NCCL集合 + +在训练过程中,有两种数据类型:`dtype`和用于通信收集操作的`dtype`,如各种归约和收集/分散操作。 + +所有的gather/scatter操作都是在数据相同的`dtype`中执行的,所以如果您正在使用bf16的训练模式,那么它将在bf16中进行gather操作 - gather操作是非损失性的。 + +各种reduce操作可能会是非常损失性的,例如当梯度在多个gpu上平均时,如果通信是在fp16或bf16中进行的,那么结果可能是有损失性的 - 因为当在一个低精度中添加多个数字时,结果可能不是精确的。更糟糕的是,bf16比fp16具有更低的精度。通常,当平均梯度时,损失最小,这些梯度通常非常小。因此,对于半精度训练,默认情况下,fp16被用作reduction操作的默认值。但是,您可以完全控制这个功能,如果你选择的话,您可以添加一个小的开销,并确保reductions将使用fp32作为累积数据类型,只有当结果准备好时,它才会降级到您在训练中使用的半精度`dtype`。 + +要覆盖默认设置,您只需添加一个新的配置条目: + +```json +{ + "communication_data_type": "fp32" +} +``` + +根据这个信息,有效的值包括"fp16"、"bfp16"和"fp32"。 + +注意:在stage zero 3中,bf16通信数据类型存在一个bug,该问题已在`deepspeed==0.8.1`版本中得到修复。 + + +### apex + +配置apex AMP-like模式: + +```json +"amp": { + "enabled": "auto", + "opt_level": "auto" +} +``` + +并且,[`Trainer`]将根据`args.fp16_backend`和`args.fp16_opt_level`的值自动配置它。 + +当传递`--fp16 --fp16_backend apex --fp16_opt_level 01`命令行参数时,此模式将被启用。 + +您还可以显式配置此模式: + +```json +{ + "amp": { + "enabled": true, + "opt_level": "O1" + } +} +``` + +但是,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + +这里是[文档](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options) + + + + +### Batch Size + +配置batch size可以使用如下参数: + +```json +{ + "train_batch_size": "auto", + "train_micro_batch_size_per_gpu": "auto" +} +``` + +并且,[`Trainer`]将自动将`train_micro_batch_size_per_gpu`设置为`args.per_device_train_batch_size`的值,并将`train_batch_size`设置为`args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps`。 + +您也可以显式设置这些值: + +```json +{ + "train_batch_size": 12, + "train_micro_batch_size_per_gpu": 4 +} +``` + +但是,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + + + + +### Gradient Accumulation + +配置gradient accumulation设置如下: + +```json +{ + "gradient_accumulation_steps": "auto" +} +``` + +并且,[`Trainer`]将自动将其设置为`args.gradient_accumulation_steps`的值。 + +您也可以显式设置这个值: + +```json +{ + "gradient_accumulation_steps": 3 +} +``` + +但是,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + + + + +### Gradient Clipping + +配置gradient clipping如下: + +```json +{ + "gradient_clipping": "auto" +} +``` + +并且,[`Trainer`]将自动将其设置为`args.max_grad_norm`的值。 + +您也可以显式设置这个值: + +```json +{ + "gradient_clipping": 1.0 +} +``` + +但是,您需要自己同步[`Trainer`]命令行参数和DeepSpeed配置。 + + + + + +### 获取模型权重 + +只要您继续使用DeepSpeed进行训练和恢复,您就不需要担心任何事情。DeepSpeed在其自定义检查点优化器文件中存储fp32主权重,这些文件是`global_step*/*optim_states.pt`(这是glob模式),并保存在正常的checkpoint下。 + +**FP16权重:** + +当模型保存在ZeRO-2下时,您最终会得到一个包含模型权重的普通`pytorch_model.bin`文件,但它们只是权重的fp16版本。 + +在ZeRO-3下,事情要复杂得多,因为模型权重分布在多个GPU上,因此需要`"stage3_gather_16bit_weights_on_model_save": true`才能让`Trainer`保存fp16版本的权重。如果这个设置是`False`,`pytorch_model.bin`将不会被创建。这是因为默认情况下,DeepSpeed的`state_dict`包含一个占位符而不是实际的权重。如果我们保存这个`state_dict`,就无法再加载它了。 + + +```json +{ + "zero_optimization": { + "stage3_gather_16bit_weights_on_model_save": true + } +} +``` + +**FP32权重:** + +虽然fp16权重适合恢复训练,但如果您完成了模型的微调并希望将其上传到[models hub](https://huggingface.co/models)或传递给其他人,您很可能想要获取fp32权重。这最好不要在训练期间完成,因为这需要大量内存,因此最好在训练完成后离线进行。但是,如果需要并且有充足的空闲CPU内存,可以在相同的训练脚本中完成。以下部分将讨论这两种方法。 + +**实时FP32权重恢复:** + +如果您的模型很大,并且在训练结束时几乎没有剩余的空闲CPU内存,这种方法可能不起作用。 + +如果您至少保存了一个检查点,并且想要使用最新的一个,可以按照以下步骤操作: + +```python +from transformers.trainer_utils import get_last_checkpoint +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint + +checkpoint_dir = get_last_checkpoint(trainer.args.output_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + +如果您在使用`--load_best_model_at_end`类:*~transformers.TrainingArguments*参数(用于跟踪最佳 +检查点),那么你可以首先显式地保存最终模型,然后再执行相同的操作: + +```python +from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint + +checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final") +trainer.deepspeed.save_checkpoint(checkpoint_dir) +fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir) +``` + + + +注意,一旦运行了`load_state_dict_from_zero_checkpoint`,该模型将不再可以在相同的应用程序的DeepSpeed上下文中使用。也就是说,您需要重新初始化deepspeed引擎,因为`model.load_state_dict(state_dict)`会从其中移除所有的DeepSpeed相关点。所以您只能训练结束时这样做。 + + + +当然,您不必使用类:*~transformers.Trainer*,您可以根据你的需求调整上面的示例。 + +如果您出于某种原因想要更多的优化,您也可以提取权重的fp32 `state_dict`并按照以下示例进行操作: + +```python +from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint + +state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu +model = model.cpu() +model.load_state_dict(state_dict) +``` + +**离线FP32权重恢复:** + +DeepSpeed会创建一个特殊的转换脚本`zero_to_fp32.py`,并将其放置在checkpoint文件夹的顶层。使用此脚本,您可以在任何时候提取权重。该脚本是独立的,您不再需要配置文件或`Trainer`来执行提取操作。 + +假设您的checkpoint文件夹如下所示: + +```bash +$ ls -l output_dir/checkpoint-1/ +-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json +drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/ +-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest +-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt +-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin +-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt +-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json +-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model +-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json +-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json +-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin +-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py* +``` + +在这个例子中,只有一个DeepSpeed检查点子文件夹*global_step1*。因此,要重构fp32权重,只需运行: + +```bash +python zero_to_fp32.py . pytorch_model.bin +``` + +这就是它。`pytorch_model.bin`现在将包含从多个GPUs合并的完整的fp32模型权重。 + +该脚本将自动能够处理ZeRO-2或ZeRO-3 checkpoint。 + +`python zero_to_fp32.py -h`将为您提供使用细节。 + +该脚本将通过文件`latest`的内容自动发现deepspeed子文件夹,在当前示例中,它将包含`global_step1`。 + +注意:目前该脚本需要2倍于最终fp32模型权重的通用内存。 + + +### ZeRO-3 和 Infinity Nuances + +ZeRO-3与ZeRO-2有很大的不同,主要是因为它的参数分片功能。 + +ZeRO-Infinity进一步扩展了ZeRO-3,以支持NVMe内存和其他速度和可扩展性改进。 + +尽管所有努力都是为了在不需要对模型进行任何特殊更改的情况下就能正常运行,但在某些情况下,您可能需要以下信息。 + + +#### 构建大模型 + +DeepSpeed/ZeRO-3可以处理参数量达到数万亿的模型,这些模型可能无法适应现有的内存。在这种情况下,如果您还是希望初始化更快地发生,可以使用*deepspeed.zero.Init()*上下文管理器(也是一个函数装饰器)来初始化模型,如下所示: + +```python +from transformers import T5ForConditionalGeneration, T5Config +import deepspeed + +with deepspeed.zero.Init(): + config = T5Config.from_pretrained("google-t5/t5-small") + model = T5ForConditionalGeneration(config) +``` + +如您所见,这会为您随机初始化一个模型。 + +如果您想使用预训练模型,`model_class.from_pretrained`将在`is_deepspeed_zero3_enabled()`返回`True`的情况下激活此功能,目前这是通过传递的DeepSpeed配置文件中的ZeRO-3配置部分设置的。因此,在调用`from_pretrained`之前,您必须创建**TrainingArguments**对象。以下是可能的顺序示例: + +```python +from transformers import AutoModel, Trainer, TrainingArguments + +training_args = TrainingArguments(..., deepspeed=ds_config) +model = AutoModel.from_pretrained("google-t5/t5-small") +trainer = Trainer(model=model, args=training_args, ...) +``` + +如果您使用的是官方示例脚本,并且命令行参数中包含`--deepspeed ds_config.json`且启用了ZeRO-3配置,那么一切都已经为您准备好了,因为这是示例脚本的编写方式。 + +注意:如果模型的fp16权重无法适应单个GPU的内存,则必须使用此功能。 + +有关此方法和其他相关功能的完整详细信息,请参阅[构建大模型](https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models)。 + +此外,在加载fp16预训练模型时,您希望`from_pretrained`使用`torch_dtype=torch.float16`。详情请参见[from_pretrained-torch-dtype](#from_pretrained-torch-dtype)。 + + +#### 参数收集 + +在多个GPU上使用ZeRO-3时,没有一个GPU拥有所有参数,除非它是当前执行层的参数。因此,如果您需要一次访问所有层的所有参数,有一个特定的方法可以实现。 +您可能不需要它,但如果您需要,请参考[参数收集](https://deepspeed.readthedocs.io/en/latest/zero3.html#manual-parameter-coordination)。 + +然而,我们在多个地方确实使用了它,其中一个例子是在`from_pretrained`中加载预训练模型权重。我们一次加载一层,然后立即将其分区到所有参与的GPU上,因为对于非常大的模型,无法在一个GPU上一次性加载并将其分布到多个GPU上,因为内存限制。 + +此外,在ZeRO-3下,如果您编写自己的代码并遇到看起来像这样的模型参数权重: + +```python +tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True) +``` + +强调`tensor([1.])`,或者如果您遇到一个错误,它说参数的大小是`1`,而不是某个更大的多维形状,这意味着参数被划分了,你看到的是一个ZeRO-3占位符。 + + + + + + +### ZeRO 推理 + +"ZeRO 推断" 使用与 "ZeRO-3 训练" 相同的配置。您只需要去掉优化器和调度器部分。实际上,如果您希望与训练共享相同的配置文件,您可以将它们保留在配置文件中,它们只会被忽略。 + +您只需要传递通常的[`TrainingArguments`]参数。例如: + +```bash +deepspeed --num_gpus=2 your_program.py --do_eval --deepspeed ds_config.json +``` + +唯一的重要事情是您需要使用ZeRO-3配置,因为ZeRO-2对于推理没有任何优势,因为只有ZeRO-3才对参数进行分片,而ZeRO-1则对梯度和优化器状态进行分片。 + +以下是在DeepSpeed下运行`run_translation.py`启用所有可用GPU的示例: + +```bash +deepspeed examples/pytorch/translation/run_translation.py \ +--deepspeed tests/deepspeed/ds_config_zero3.json \ +--model_name_or_path google-t5/t5-small --output_dir output_dir \ +--do_eval --max_eval_samples 50 --warmup_steps 50 \ +--max_source_length 128 --val_max_target_length 128 \ +--overwrite_output_dir --per_device_eval_batch_size 4 \ +--predict_with_generate --dataset_config "ro-en" --fp16 \ +--source_lang en --target_lang ro --dataset_name wmt16 \ +--source_prefix "translate English to Romanian: " +``` + +由于在推理阶段,优化器状态和梯度不需要额外的大量内存,您应该能够将更大的批次和/或序列长度放到相同的硬件上。 + +此外,DeepSpeed目前正在开发一个名为Deepspeed-Inference的相关产品,它与ZeRO技术无关,而是使用张量并行来扩展无法适应单个GPU的模型。这是一个正在进行的工作,一旦该产品完成,我们将提供集成。 + + +### 内存要求 + +由于 DeepSpeed ZeRO 可以将内存卸载到 CPU(和 NVMe),该框架提供了一些工具,允许根据使用的 GPU 数量告知将需要多少 CPU 和 GPU 内存。 + +让我们估计在单个GPU上微调"bigscience/T0_3B"所需的内存: + +```bash +$ python -c 'from transformers import AutoModel; \ +from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ +model = AutoModel.from_pretrained("bigscience/T0_3B"); \ +estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)' +[...] +Estimated memory needed for params, optim states and gradients for a: +HW: Setup with 1 node, 1 GPU per node. +SW: Model with 2783M total params, 65M largest layer params. + per CPU | per GPU | Options + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1 + 62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0 + 0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1 + 15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0 +``` + +因此,您可以将模型拟合在单个80GB的GPU上,不进行CPU offload,或者使用微小的8GB GPU,但需要约60GB的CPU内存。(请注意,这仅是参数、优化器状态和梯度所需的内存 - 您还需要为CUDA内核、激活值和临时变量分配更多的内存。) + +然后,这是成本与速度的权衡。购买/租用较小的 GPU(或较少的 GPU,因为您可以使用多个 GPU 进行 Deepspeed ZeRO)。但这样会更慢,因此即使您不关心完成某项任务的速度,减速也直接影响 GPU 使用的持续时间,从而导致更大的成本。因此,请进行实验并比较哪种方法效果最好。 + +如果您有足够的GPU内存,请确保禁用CPU/NVMe卸载,因为这会使所有操作更快。 + +例如,让我们重复相同的操作,使用2个GPU: + +```bash +$ python -c 'from transformers import AutoModel; \ +from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \ +model = AutoModel.from_pretrained("bigscience/T0_3B"); \ +estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=2, num_nodes=1)' +[...] +Estimated memory needed for params, optim states and gradients for a: +HW: Setup with 1 node, 2 GPUs per node. +SW: Model with 2783M total params, 65M largest layer params. + per CPU | per GPU | Options + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1 + 70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0 + 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=1 + 62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=0 + 0.74GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=1 + 31.11GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=0 + +``` + +所以,您需要2个32GB或更高的GPU,且不进行CPU卸载。 + +如需了解更多信息,请参阅[内存估算器](https://deepspeed.readthedocs.io/en/latest/memory.html)。 + + + +### 归档Issues + +请按照以下步骤提交问题,以便我们能够迅速找到问题并帮助您解除工作阻塞。 + +在您的报告中,请始终包括以下内容: + +1. 完整的Deepspeed配置文件 +2. 如果使用了[`Trainer`],则包括命令行参数;如果自己编写了Trainer设置,则包括[`TrainingArguments`]参数。请不要导出[`TrainingArguments`],因为它有几十个与问题无关的条目。 +3. 输出: + + ```bash + python -c 'import torch; print(f"torch: {torch.__version__}")' + python -c 'import transformers; print(f"transformers: {transformers.__version__}")' + python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")' + ``` + +4. 如果可能,请包含一个Google Colab notebook链接,我们可以使用它来重现问题。您可以使用这个[notebook](https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb)作为起点。 +5. 除非不可能,否则请始终使用标准数据集,而不是自定义数据集。 +6. 如果可能,尝试使用现有[示例](https://github.com/huggingface/transformers/tree/main/examples/pytorch)之一来重现问题。 + +需要考虑的因素: + +- Deepspeed通常不是问题的原因。 + + 一些已提交的问题被证明与Deepspeed无关。也就是说,一旦将Deepspeed从设置中移除,问题仍然存在。 + + 因此,如果问题明显与DeepSpeed相关,例如您可以看到有一个异常并且可以看到DeepSpeed模块涉及其中,请先重新测试没有DeepSpeed的设置。只有当问题仍然存在时,才向Deepspeed提供所有必需的细节。 + +- 如果您明确问题是在Deepspeed核心中而不是集成部分,请直接向[Deepspeed](https://github.com/microsoft/DeepSpeed/)提交问题。如果您不确定,请不要担心,无论使用哪个issue跟踪问题都可以,一旦您发布问题,我们会弄清楚并将其重定向到另一个issue跟踪(如果需要的话)。 + + + +### Troubleshooting + +#### 启动时`deepspeed`进程被终止,没有回溯 + +如果启动时`deepspeed`进程被终止,没有回溯,这通常意味着程序尝试分配的CPU内存超过了系统的限制或进程被允许分配的内存,操作系统内核杀死了该进程。这是因为您的配置文件很可能将`offload_optimizer`或`offload_param`或两者都配置为卸载到`cpu`。如果您有NVMe,可以尝试在ZeRO-3下卸载到NVMe。这里是如何[估计特定模型所需的内存](https://deepspeed.readthedocs.io/en/latest/memory.html)。 + +#### 训练和/或评估/预测loss为`NaN` + +这种情况通常发生在使用bf16混合精度模式预训练的模型试图在fp16(带或不带混合精度)下使用时。大多数在TPU上训练的模型以及由谷歌发布的模型都属于这个类别(例如,几乎所有基于t5的模型)。在这种情况下,解决方案是要么使用fp32,要么在支持的情况下使用bf16(如TPU、Ampere GPU或更新的版本)。 + +另一个问题可能与使用fp16有关。当您配置此部分时: + +```json +{ + "fp16": { + "enabled": "auto", + "loss_scale": 0, + "loss_scale_window": 1000, + "initial_scale_power": 16, + "hysteresis": 2, + "min_loss_scale": 1 + } +} +``` + +并且您在日志中看到Deepspeed报告`OVERFLOW`如下 + +``` +0%| | 0/189 [00:00=4.28`开始,如果没有明确指定`synced_gpus`,检测到这些条件后它将自动设置为`True`。但如果您需要覆盖`synced_gpus`的值,仍然可以这样做。 + + + +## 测试 DeepSpeed 集成 + +如果您提交了一个涉及DeepSpeed集成的PR,请注意我们的CircleCI PR CI设置没有GPU,因此我们只在另一个CI夜间运行需要GPU的测试。因此,如果您在PR中获得绿色的CI报告,并不意味着DeepSpeed测试通过。 + +要运行DeepSpeed测试,请至少运行以下命令: + +```bash +RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py +``` + +如果你更改了任何模型或PyTorch示例代码,请同时运行多模型测试。以下将运行所有DeepSpeed测试: + +```bash +RUN_SLOW=1 pytest tests/deepspeed +``` + +## 主要的DeepSpeed资源 + +- [项目GitHub](https://github.com/microsoft/deepspeed) +- [使用文档](https://www.deepspeed.ai/getting-started/) +- [API文档](https://deepspeed.readthedocs.io/en/latest/index.html) +- [博客文章](https://www.microsoft.com/en-us/research/search/?q=deepspeed) + +论文: + +- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054) +- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840) +- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857) + +最后,请记住,HuggingFace [`Trainer`]仅集成了DeepSpeed,因此如果您在使用DeepSpeed时遇到任何问题或疑问,请在[DeepSpeed GitHub](https://github.com/microsoft/DeepSpeed/issues)上提交一个issue。 diff --git a/docs/source/zh/main_classes/feature_extractor.md b/docs/source/zh/main_classes/feature_extractor.md new file mode 100644 index 00000000000000..f3b65ebf9d66cc --- /dev/null +++ b/docs/source/zh/main_classes/feature_extractor.md @@ -0,0 +1,39 @@ + + +# Feature Extractor + +Feature Extractor负责为音频或视觉模型准备输入特征。这包括从序列中提取特征,例如,对音频文件进行预处理以生成Log-Mel频谱特征,以及从图像中提取特征,例如,裁剪图像文件,同时还包括填充、归一化和转换为NumPy、PyTorch和TensorFlow张量。 + + +## FeatureExtractionMixin + +[[autodoc]] feature_extraction_utils.FeatureExtractionMixin + - from_pretrained + - save_pretrained + +## SequenceFeatureExtractor + +[[autodoc]] SequenceFeatureExtractor + - pad + +## BatchFeature + +[[autodoc]] BatchFeature + +## ImageFeatureExtractionMixin + +[[autodoc]] image_utils.ImageFeatureExtractionMixin diff --git a/docs/source/zh/main_classes/image_processor.md b/docs/source/zh/main_classes/image_processor.md new file mode 100644 index 00000000000000..035afa55348a26 --- /dev/null +++ b/docs/source/zh/main_classes/image_processor.md @@ -0,0 +1,34 @@ + + +# Image Processor + +Image processor负责为视觉模型准备输入特征并后期处理处理它们的输出。这包括诸如调整大小、归一化和转换为PyTorch、TensorFlow、Flax和NumPy张量等转换。它还可能包括特定于模型的后期处理,例如将logits转换为分割掩码。 + + +## ImageProcessingMixin + +[[autodoc]] image_processing_utils.ImageProcessingMixin + - from_pretrained + - save_pretrained + +## BatchFeature + +[[autodoc]] BatchFeature + +## BaseImageProcessor + +[[autodoc]] image_processing_utils.BaseImageProcessor diff --git a/docs/source/zh/main_classes/keras_callbacks.md b/docs/source/zh/main_classes/keras_callbacks.md new file mode 100644 index 00000000000000..1eea2eb998162c --- /dev/null +++ b/docs/source/zh/main_classes/keras_callbacks.md @@ -0,0 +1,27 @@ + + +# Keras callbacks + +在Keras中训练Transformers模型时,有一些库特定的callbacks函数可用于自动执行常见任务: + +## KerasMetricCallback + +[[autodoc]] KerasMetricCallback + +## PushToHubCallback + +[[autodoc]] PushToHubCallback diff --git a/docs/source/zh/main_classes/logging.md b/docs/source/zh/main_classes/logging.md new file mode 100644 index 00000000000000..6116e4962e6a33 --- /dev/null +++ b/docs/source/zh/main_classes/logging.md @@ -0,0 +1,107 @@ + + +# Logging + +🤗 Transformers拥有一个集中式的日志系统,因此您可以轻松设置库输出的日志详细程度。 + +当前库的默认日志详细程度为`WARNING`。 + +要更改日志详细程度,只需使用其中一个直接的setter。例如,以下是如何将日志详细程度更改为INFO级别的方法: + +```python +import transformers + +transformers.logging.set_verbosity_info() +``` + +您还可以使用环境变量`TRANSFORMERS_VERBOSITY`来覆盖默认的日志详细程度。您可以将其设置为以下级别之一:`debug`、`info`、`warning`、`error`、`critical`。例如: + +```bash +TRANSFORMERS_VERBOSITY=error ./myprogram.py +``` + +此外,通过将环境变量`TRANSFORMERS_NO_ADVISORY_WARNINGS`设置为`true`(如*1*),可以禁用一些`warnings`。这将禁用[`logger.warning_advice`]记录的任何警告。例如: + +```bash +TRANSFORMERS_NO_ADVISORY_WARNINGS=1 ./myprogram.py +``` + +以下是如何在您自己的模块或脚本中使用与库相同的logger的示例: + +```python +from transformers.utils import logging + +logging.set_verbosity_info() +logger = logging.get_logger("transformers") +logger.info("INFO") +logger.warning("WARN") +``` + + +此日志模块的所有方法都在下面进行了记录,主要的方法包括 [`logging.get_verbosity`] 用于获取logger当前输出日志详细程度的级别和 [`logging.set_verbosity`] 用于将详细程度设置为您选择的级别。按照顺序(从最不详细到最详细),这些级别(及其相应的整数值)为: + +- `transformers.logging.CRITICAL` 或 `transformers.logging.FATAL`(整数值,50):仅报告最关键的errors。 +- `transformers.logging.ERROR`(整数值,40):仅报告errors。 +- `transformers.logging.WARNING` 或 `transformers.logging.WARN`(整数值,30):仅报告error和warnings。这是库使用的默认级别。 +- `transformers.logging.INFO`(整数值,20):报告error、warnings和基本信息。 +- `transformers.logging.DEBUG`(整数值,10):报告所有信息。 + +默认情况下,将在模型下载期间显示`tqdm`进度条。[`logging.disable_progress_bar`] 和 [`logging.enable_progress_bar`] 可用于禁止或启用此行为。 + +## `logging` vs `warnings` + +Python有两个经常一起使用的日志系统:如上所述的`logging`,和对特定buckets中的警告进行进一步分类的`warnings`,例如,`FutureWarning`用于输出已经被弃用的功能或路径,`DeprecationWarning`用于指示即将被弃用的内容。 + +我们在`transformers`库中同时使用这两个系统。我们利用并调整了`logging`的`captureWarning`方法,以便通过上面的详细程度setters来管理这些警告消息。 + +对于库的开发人员,这意味着什么呢?我们应该遵循以下启发法则: +- 库的开发人员和依赖于`transformers`的库应优先使用`warnings` +- `logging`应该用于在日常项目中经常使用它的用户 + +以下是`captureWarnings`方法的参考。 + +[[autodoc]] logging.captureWarnings + +## Base setters + +[[autodoc]] logging.set_verbosity_error + +[[autodoc]] logging.set_verbosity_warning + +[[autodoc]] logging.set_verbosity_info + +[[autodoc]] logging.set_verbosity_debug + +## Other functions + +[[autodoc]] logging.get_verbosity + +[[autodoc]] logging.set_verbosity + +[[autodoc]] logging.get_logger + +[[autodoc]] logging.enable_default_handler + +[[autodoc]] logging.disable_default_handler + +[[autodoc]] logging.enable_explicit_format + +[[autodoc]] logging.reset_format + +[[autodoc]] logging.enable_progress_bar + +[[autodoc]] logging.disable_progress_bar diff --git a/docs/source/zh/main_classes/model.md b/docs/source/zh/main_classes/model.md new file mode 100644 index 00000000000000..6c0ee3e2b2c097 --- /dev/null +++ b/docs/source/zh/main_classes/model.md @@ -0,0 +1,137 @@ + + +# 模型 + +基类 [`PreTrainedModel`]、[`TFPreTrainedModel`] 和 [`FlaxPreTrainedModel`] 实现了从本地文件或目录加载/保存模型的常用方法,或者从库上提供的预训练模型配置(从 HuggingFace 的 AWS S3 存储库下载)加载模型。 + +[`PreTrainedModel`] 和 [`TFPreTrainedModel`] 还实现了一些所有模型共有的方法: + +- 在向量词嵌入增加新词汇时调整输入标记(token)的大小 +- 对模型的注意力头进行修剪。 + +其他的通用方法在 [`~modeling_utils.ModuleUtilsMixin`](用于 PyTorch 模型)和 [`~modeling_tf_utils.TFModuleUtilsMixin`](用于 TensorFlow 模型)中定义;文本生成方面的方法则定义在 [`~generation.GenerationMixin`](用于 PyTorch 模型)、[`~generation.TFGenerationMixin`](用于 TensorFlow 模型)和 [`~generation.FlaxGenerationMixin`](用于 Flax/JAX 模型)中。 + +## PreTrainedModel + +[[autodoc]] PreTrainedModel + - push_to_hub + - all + + + +### 大模型加载 + +在 Transformers 4.20.0 中,[`~PreTrainedModel.from_pretrained`] 方法已重新设计,以适应使用 [Accelerate](https://huggingface.co/docs/accelerate/big_modeling) 加载大型模型的场景。这需要您使用的 Accelerate 和 PyTorch 版本满足: Accelerate >= 0.9.0, PyTorch >= 1.9.0。除了创建完整模型,然后在其中加载预训练权重(这会占用两倍于模型大小的内存空间,一个用于随机初始化模型,一个用于预训练权重),我们提供了一种选项,将模型创建为空壳,然后只有在加载预训练权重时才实例化其参数。 + +您可以使用 `low_cpu_mem_usage=True` 激活此选项。首先,在 Meta 设备上创建模型(带有空权重),然后将状态字典加载到其中(在分片检查点的情况下逐片加载)。这样,最大使用的内存占用仅为模型的完整大小。 + +```python +from transformers import AutoModelForSeq2SeqLM + +t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", low_cpu_mem_usage=True) +``` + +此外,如果内存不足以放下加载整个模型(目前仅适用于推理),您可以直接将模型放置在不同的设备上。使用 `device_map="auto"`,Accelerate 将确定将每一层放置在哪个设备上,以最大化使用最快的设备(GPU),并将其余部分卸载到 CPU,甚至硬盘上(如果您没有足够的 GPU 内存 或 CPU 内存)。即使模型分布在几个设备上,它也将像您通常期望的那样运行。 + +在传递 `device_map` 时,`low_cpu_mem_usage` 会自动设置为 `True`,因此您不需要指定它: + +```python +from transformers import AutoModelForSeq2SeqLM + +t0pp = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto") +``` + +您可以通过 `hf_device_map` 属性来查看模型是如何在设备上分割的: + +```python +t0pp.hf_device_map +{'shared': 0, + 'decoder.embed_tokens': 0, + 'encoder': 0, + 'decoder.block.0': 0, + 'decoder.block.1': 1, + 'decoder.block.2': 1, + 'decoder.block.3': 1, + 'decoder.block.4': 1, + 'decoder.block.5': 1, + 'decoder.block.6': 1, + 'decoder.block.7': 1, + 'decoder.block.8': 1, + 'decoder.block.9': 1, + 'decoder.block.10': 1, + 'decoder.block.11': 1, + 'decoder.block.12': 1, + 'decoder.block.13': 1, + 'decoder.block.14': 1, + 'decoder.block.15': 1, + 'decoder.block.16': 1, + 'decoder.block.17': 1, + 'decoder.block.18': 1, + 'decoder.block.19': 1, + 'decoder.block.20': 1, + 'decoder.block.21': 1, + 'decoder.block.22': 'cpu', + 'decoder.block.23': 'cpu', + 'decoder.final_layer_norm': 'cpu', + 'decoder.dropout': 'cpu', + 'lm_head': 'cpu'} +``` + +您还可以按照相同的格式(一个层名称到设备的映射关系的字典)编写自己的设备映射规则。它应该将模型的所有参数映射到给定的设备上,如果该层的所有子模块都在同一设备上,您不必详细说明其中所有子模块的位置。例如,以下设备映射对于 T0pp 将正常工作(只要您有 GPU 内存): + +```python +device_map = {"shared": 0, "encoder": 0, "decoder": 1, "lm_head": 1} +``` + +另一种减少模型内存影响的方法是以较低精度的 dtype(例如 `torch.float16`)实例化它,或者使用下面介绍的直接量化技术。 + +### 模型实例化 dtype + +在 PyTorch 下,模型通常以 `torch.float32` 格式实例化。如果尝试加载权重为 fp16 的模型,这可能会导致问题,因为它将需要两倍的内存。为了克服此限制,您可以使用 `torch_dtype` 参数显式传递所需的 `dtype`: + +```python +model = T5ForConditionalGeneration.from_pretrained("t5", torch_dtype=torch.float16) +``` +或者,如果您希望模型始终以最优的内存模式加载,则可以使用特殊值 `"auto"`,然后 `dtype` 将自动从模型的权重中推导出: +```python +model = T5ForConditionalGeneration.from_pretrained("t5", torch_dtype="auto") +``` + +也可以通过以下方式告知从头开始实例化的模型要使用哪种 `dtype`: + +```python +config = T5Config.from_pretrained("t5") +model = AutoModel.from_config(config) +``` + +由于 PyTorch 的设计,此功能仅适用于浮点类型。 + + +## ModuleUtilsMixin + +[[autodoc]] modeling_utils.ModuleUtilsMixin + +TFPreTrainedModel +[[autodoc]] TFPreTrainedModel + - push_to_hub + - all + +## TFModelUtilsMixin +[[autodoc]] modeling_tf_utils.TFModelUtilsMixin + +FlaxPreTrainedModel +[[autodoc]] FlaxPreTrainedModel + - push_to_hub + - all + +## 推送到 Hub +[[autodoc]] utils.PushToHubMixin + +## 分片检查点 +[[autodoc]] modeling_utils.load_sharded_checkpoint diff --git a/docs/source/zh/main_classes/onnx.md b/docs/source/zh/main_classes/onnx.md new file mode 100644 index 00000000000000..d35dd1c4bf8e6a --- /dev/null +++ b/docs/source/zh/main_classes/onnx.md @@ -0,0 +1,50 @@ + + +# 导出 🤗 Transformers 模型到 ONNX + +🤗 Transformers提供了一个`transformers.onnx`包,通过利用配置对象,您可以将模型checkpoints转换为ONNX图。 + +有关更多详细信息,请参阅导出 🤗 Transformers 模型的[指南](../serialization)。 + +## ONNX Configurations + +我们提供了三个抽象类,取决于您希望导出的模型架构类型: + +* 基于编码器的模型继承 [`~onnx.config.OnnxConfig`] +* 基于解码器的模型继承 [`~onnx.config.OnnxConfigWithPast`] +* 编码器-解码器模型继承 [`~onnx.config.OnnxSeq2SeqConfigWithPast`] + +### OnnxConfig + +[[autodoc]] onnx.config.OnnxConfig + +### OnnxConfigWithPast + +[[autodoc]] onnx.config.OnnxConfigWithPast + +### OnnxSeq2SeqConfigWithPast + +[[autodoc]] onnx.config.OnnxSeq2SeqConfigWithPast + +## ONNX Features + +每个ONNX配置与一组 _特性_ 相关联,使您能够为不同类型的拓扑结构或任务导出模型。 + +### FeaturesManager + +[[autodoc]] onnx.features.FeaturesManager + diff --git a/docs/source/zh/main_classes/optimizer_schedules.md b/docs/source/zh/main_classes/optimizer_schedules.md new file mode 100644 index 00000000000000..63e5438d77b6c4 --- /dev/null +++ b/docs/source/zh/main_classes/optimizer_schedules.md @@ -0,0 +1,77 @@ + + +# Optimization + +`.optimization` 模块提供了: + +- 一个带有固定权重衰减的优化器,可用于微调模型 +- 继承自 `_LRSchedule` 多个调度器: +- 一个梯度累积类,用于累积多个批次的梯度 + +## AdamW (PyTorch) + +[[autodoc]] AdamW + +## AdaFactor (PyTorch) + +[[autodoc]] Adafactor + +## AdamWeightDecay (TensorFlow) + +[[autodoc]] AdamWeightDecay + +[[autodoc]] create_optimizer + +## Schedules + +### Learning Rate Schedules (Pytorch) + +[[autodoc]] SchedulerType + +[[autodoc]] get_scheduler + +[[autodoc]] get_constant_schedule + +[[autodoc]] get_constant_schedule_with_warmup + + + +[[autodoc]] get_cosine_schedule_with_warmup + + + +[[autodoc]] get_cosine_with_hard_restarts_schedule_with_warmup + + + +[[autodoc]] get_linear_schedule_with_warmup + + + +[[autodoc]] get_polynomial_decay_schedule_with_warmup + +[[autodoc]] get_inverse_sqrt_schedule + +### Warmup (TensorFlow) + +[[autodoc]] WarmUp + +## Gradient Strategies + +### GradientAccumulator (TensorFlow) + +[[autodoc]] GradientAccumulator diff --git a/docs/source/zh/main_classes/output.md b/docs/source/zh/main_classes/output.md new file mode 100644 index 00000000000000..f4d5c3c6941d51 --- /dev/null +++ b/docs/source/zh/main_classes/output.md @@ -0,0 +1,309 @@ + + +# 模型输出 + +所有模型的输出都是 [`~utils.ModelOutput`] 的子类的实例。这些是包含模型返回的所有信息的数据结构,但也可以用作元组或字典。 + +让我们看一个例子: + +```python +from transformers import BertTokenizer, BertForSequenceClassification +import torch + +tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") + +inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") +labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 +outputs = model(**inputs, labels=labels) +``` + +`outputs` 对象是 [`~modeling_outputs.SequenceClassifierOutput`],如下面该类的文档中所示,它表示它有一个可选的 `loss`,一个 `logits`,一个可选的 `hidden_states` 和一个可选的 `attentions` 属性。在这里,我们有 `loss`,因为我们传递了 `labels`,但我们没有 `hidden_states` 和 `attentions`,因为我们没有传递 `output_hidden_states=True` 或 `output_attentions=True`。 + + + +当传递 `output_hidden_states=True` 时,您可能希望 `outputs.hidden_states[-1]` 与 `outputs.last_hidden_states` 完全匹配。然而,这并不总是成立。一些模型在返回最后的 hidden state时对其应用归一化或其他后续处理。 + + + + +您可以像往常一样访问每个属性,如果模型未返回该属性,您将得到 `None`。在这里,例如,`outputs.loss` 是模型计算的损失,而 `outputs.attentions` 是 `None`。 + +当将我们的 `outputs` 对象视为元组时,它仅考虑那些没有 `None` 值的属性。例如这里它有两个元素,`loss` 和 `logits`,所以 + +```python +outputs[:2] +``` + +将返回元组 `(outputs.loss, outputs.logits)`。 + +将我们的 `outputs` 对象视为字典时,它仅考虑那些没有 `None` 值的属性。例如在这里它有两个键,分别是 `loss` 和 `logits`。 + +我们在这里记录了被多个类型模型使用的通用模型输出。特定输出类型在其相应的模型页面上有文档。 + +## ModelOutput + +[[autodoc]] utils.ModelOutput + - to_tuple + +## BaseModelOutput + +[[autodoc]] modeling_outputs.BaseModelOutput + +## BaseModelOutputWithPooling + +[[autodoc]] modeling_outputs.BaseModelOutputWithPooling + +## BaseModelOutputWithCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithCrossAttentions + +## BaseModelOutputWithPoolingAndCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions + +## BaseModelOutputWithPast + +[[autodoc]] modeling_outputs.BaseModelOutputWithPast + +## BaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_outputs.BaseModelOutputWithPastAndCrossAttentions + +## Seq2SeqModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqModelOutput + +## CausalLMOutput + +[[autodoc]] modeling_outputs.CausalLMOutput + +## CausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_outputs.CausalLMOutputWithCrossAttentions + +## CausalLMOutputWithPast + +[[autodoc]] modeling_outputs.CausalLMOutputWithPast + +## MaskedLMOutput + +[[autodoc]] modeling_outputs.MaskedLMOutput + +## Seq2SeqLMOutput + +[[autodoc]] modeling_outputs.Seq2SeqLMOutput + +## NextSentencePredictorOutput + +[[autodoc]] modeling_outputs.NextSentencePredictorOutput + +## SequenceClassifierOutput + +[[autodoc]] modeling_outputs.SequenceClassifierOutput + +## Seq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_outputs.Seq2SeqSequenceClassifierOutput + +## MultipleChoiceModelOutput + +[[autodoc]] modeling_outputs.MultipleChoiceModelOutput + +## TokenClassifierOutput + +[[autodoc]] modeling_outputs.TokenClassifierOutput + +## QuestionAnsweringModelOutput + +[[autodoc]] modeling_outputs.QuestionAnsweringModelOutput + +## Seq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqQuestionAnsweringModelOutput + +## Seq2SeqSpectrogramOutput + +[[autodoc]] modeling_outputs.Seq2SeqSpectrogramOutput + +## SemanticSegmenterOutput + +[[autodoc]] modeling_outputs.SemanticSegmenterOutput + +## ImageClassifierOutput + +[[autodoc]] modeling_outputs.ImageClassifierOutput + +## ImageClassifierOutputWithNoAttention + +[[autodoc]] modeling_outputs.ImageClassifierOutputWithNoAttention + +## DepthEstimatorOutput + +[[autodoc]] modeling_outputs.DepthEstimatorOutput + +## Wav2Vec2BaseModelOutput + +[[autodoc]] modeling_outputs.Wav2Vec2BaseModelOutput + +## XVectorOutput + +[[autodoc]] modeling_outputs.XVectorOutput + +## Seq2SeqTSModelOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSModelOutput + +## Seq2SeqTSPredictionOutput + +[[autodoc]] modeling_outputs.Seq2SeqTSPredictionOutput + +## SampleTSPredictionOutput + +[[autodoc]] modeling_outputs.SampleTSPredictionOutput + +## TFBaseModelOutput + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutput + +## TFBaseModelOutputWithPooling + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPooling + +## TFBaseModelOutputWithPoolingAndCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPoolingAndCrossAttentions + +## TFBaseModelOutputWithPast + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPast + +## TFBaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFBaseModelOutputWithPastAndCrossAttentions + +## TFSeq2SeqModelOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqModelOutput + +## TFCausalLMOutput + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutput + +## TFCausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithCrossAttentions + +## TFCausalLMOutputWithPast + +[[autodoc]] modeling_tf_outputs.TFCausalLMOutputWithPast + +## TFMaskedLMOutput + +[[autodoc]] modeling_tf_outputs.TFMaskedLMOutput + +## TFSeq2SeqLMOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqLMOutput + +## TFNextSentencePredictorOutput + +[[autodoc]] modeling_tf_outputs.TFNextSentencePredictorOutput + +## TFSequenceClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFSequenceClassifierOutput + +## TFSeq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqSequenceClassifierOutput + +## TFMultipleChoiceModelOutput + +[[autodoc]] modeling_tf_outputs.TFMultipleChoiceModelOutput + +## TFTokenClassifierOutput + +[[autodoc]] modeling_tf_outputs.TFTokenClassifierOutput + +## TFQuestionAnsweringModelOutput + +[[autodoc]] modeling_tf_outputs.TFQuestionAnsweringModelOutput + +## TFSeq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_tf_outputs.TFSeq2SeqQuestionAnsweringModelOutput + +## FlaxBaseModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutput + +## FlaxBaseModelOutputWithPast + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPast + +## FlaxBaseModelOutputWithPooling + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPooling + +## FlaxBaseModelOutputWithPastAndCrossAttentions + +[[autodoc]] modeling_flax_outputs.FlaxBaseModelOutputWithPastAndCrossAttentions + +## FlaxSeq2SeqModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqModelOutput + +## FlaxCausalLMOutputWithCrossAttentions + +[[autodoc]] modeling_flax_outputs.FlaxCausalLMOutputWithCrossAttentions + +## FlaxMaskedLMOutput + +[[autodoc]] modeling_flax_outputs.FlaxMaskedLMOutput + +## FlaxSeq2SeqLMOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqLMOutput + +## FlaxNextSentencePredictorOutput + +[[autodoc]] modeling_flax_outputs.FlaxNextSentencePredictorOutput + +## FlaxSequenceClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxSequenceClassifierOutput + +## FlaxSeq2SeqSequenceClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqSequenceClassifierOutput + +## FlaxMultipleChoiceModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxMultipleChoiceModelOutput + +## FlaxTokenClassifierOutput + +[[autodoc]] modeling_flax_outputs.FlaxTokenClassifierOutput + +## FlaxQuestionAnsweringModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxQuestionAnsweringModelOutput + +## FlaxSeq2SeqQuestionAnsweringModelOutput + +[[autodoc]] modeling_flax_outputs.FlaxSeq2SeqQuestionAnsweringModelOutput diff --git a/docs/source/zh/main_classes/pipelines.md b/docs/source/zh/main_classes/pipelines.md new file mode 100644 index 00000000000000..3cef40478c39a9 --- /dev/null +++ b/docs/source/zh/main_classes/pipelines.md @@ -0,0 +1,480 @@ + + +# Pipelines + +pipelines是使用模型进行推理的一种简单方法。这些pipelines是抽象了库中大部分复杂代码的对象,提供了一个专用于多个任务的简单API,包括专名识别、掩码语言建模、情感分析、特征提取和问答等。请参阅[任务摘要](../task_summary)以获取使用示例。 + +有两种pipelines抽象类需要注意: + +- [`pipeline`],它是封装所有其他pipelines的最强大的对象。 +- 针对特定任务pipelines,适用于[音频](#audio)、[计算机视觉](#computer-vision)、[自然语言处理](#natural-language-processing)和[多模态](#multimodal)任务。 + +## pipeline抽象类 + +*pipeline*抽象类是对所有其他可用pipeline的封装。它可以像任何其他pipeline一样实例化,但进一步提供额外的便利性。 + +简单调用一个项目: + + +```python +>>> pipe = pipeline("text-classification") +>>> pipe("This restaurant is awesome") +[{'label': 'POSITIVE', 'score': 0.9998743534088135}] +``` + +如果您想使用 [hub](https://huggingface.co) 上的特定模型,可以忽略任务,如果hub上的模型已经定义了该任务: + +```python +>>> pipe = pipeline(model="FacebookAI/roberta-large-mnli") +>>> pipe("This restaurant is awesome") +[{'label': 'NEUTRAL', 'score': 0.7313136458396912}] +``` + +要在多个项目上调用pipeline,可以使用*列表*调用它。 + +```python +>>> pipe = pipeline("text-classification") +>>> pipe(["This restaurant is awesome", "This restaurant is awful"]) +[{'label': 'POSITIVE', 'score': 0.9998743534088135}, + {'label': 'NEGATIVE', 'score': 0.9996669292449951}] +``` + +为了遍历整个数据集,建议直接使用 `dataset`。这意味着您不需要一次性分配整个数据集,也不需要自己进行批处理。这应该与GPU上的自定义循环一样快。如果不是,请随时提出issue。 + +```python +import datasets +from transformers import pipeline +from transformers.pipelines.pt_utils import KeyDataset +from tqdm.auto import tqdm + +pipe = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h", device=0) +dataset = datasets.load_dataset("superb", name="asr", split="test") + +# KeyDataset (only *pt*) will simply return the item in the dict returned by the dataset item +# as we're not interested in the *target* part of the dataset. For sentence pair use KeyPairDataset +for out in tqdm(pipe(KeyDataset(dataset, "file"))): + print(out) + # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"} + # {"text": ....} + # .... +``` + +为了方便使用,也可以使用生成器: + + +```python +from transformers import pipeline + +pipe = pipeline("text-classification") + + +def data(): + while True: + # This could come from a dataset, a database, a queue or HTTP request + # in a server + # Caveat: because this is iterative, you cannot use `num_workers > 1` variable + # to use multiple threads to preprocess data. You can still have 1 thread that + # does the preprocessing while the main runs the big inference + yield "This is a test" + + +for out in pipe(data()): + print(out) + # {"text": "NUMBER TEN FRESH NELLY IS WAITING ON YOU GOOD NIGHT HUSBAND"} + # {"text": ....} + # .... +``` + +[[autodoc]] pipeline + +## Pipeline batching + +所有pipeline都可以使用批处理。这将在pipeline使用其流处理功能时起作用(即传递列表或 `Dataset` 或 `generator` 时)。 + +```python +from transformers import pipeline +from transformers.pipelines.pt_utils import KeyDataset +import datasets + +dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised") +pipe = pipeline("text-classification", device=0) +for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"): + print(out) + # [{'label': 'POSITIVE', 'score': 0.9998743534088135}] + # Exactly the same output as before, but the content are passed + # as batches to the model +``` + + + +然而,这并不自动意味着性能提升。它可能是一个10倍的加速或5倍的减速,具体取决于硬件、数据和实际使用的模型。 + +主要是加速的示例: + + + +```python +from transformers import pipeline +from torch.utils.data import Dataset +from tqdm.auto import tqdm + +pipe = pipeline("text-classification", device=0) + + +class MyDataset(Dataset): + def __len__(self): + return 5000 + + def __getitem__(self, i): + return "This is a test" + + +dataset = MyDataset() + +for batch_size in [1, 8, 64, 256]: + print("-" * 30) + print(f"Streaming batch_size={batch_size}") + for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)): + pass +``` + +``` +# On GTX 970 +------------------------------ +Streaming no batching +100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s] +------------------------------ +Streaming batch_size=8 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s] +------------------------------ +Streaming batch_size=64 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s] +------------------------------ +Streaming batch_size=256 +100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s] +(diminishing returns, saturated the GPU) +``` + +主要是减速的示例: + +```python +class MyDataset(Dataset): + def __len__(self): + return 5000 + + def __getitem__(self, i): + if i % 64 == 0: + n = 100 + else: + n = 1 + return "This is a test" * n +``` + +与其他句子相比,这是一个非常长的句子。在这种情况下,**整个**批次将需要400个tokens的长度,因此整个批次将是 [64, 400] 而不是 [64, 4],从而导致较大的减速。更糟糕的是,在更大的批次上,程序会崩溃。 + +``` +------------------------------ +Streaming no batching +100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s] +------------------------------ +Streaming batch_size=8 +100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s] +------------------------------ +Streaming batch_size=64 +100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s] +------------------------------ +Streaming batch_size=256 + 0%| | 0/1000 [00:00 + for out in tqdm(pipe(dataset, batch_size=256), total=len(dataset)): +.... + q = q / math.sqrt(dim_per_head) # (bs, n_heads, q_length, dim_per_head) +RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch) +``` + +对于这个问题,没有好的(通用)解决方案,效果可能因您的用例而异。经验法则如下: + +对于用户,一个经验法则是: + +- **使用硬件测量负载性能。测量、测量、再测量。真实的数字是唯一的方法。** +- 如果受到延迟的限制(进行推理的实时产品),不要进行批处理。 +- 如果使用CPU,不要进行批处理。 +- 如果您在GPU上处理的是吞吐量(您希望在大量静态数据上运行模型),则: + - 如果对序列长度的大小没有概念("自然"数据),默认情况下不要进行批处理,进行测试并尝试逐渐添加,添加OOM检查以在失败时恢复(如果您不能控制序列长度,它将在某些时候失败)。 + - 如果您的序列长度非常规律,那么批处理更有可能非常有趣,进行测试并推动它,直到出现OOM。 + - GPU越大,批处理越有可能变得更有趣 +- 一旦启用批处理,确保能够很好地处理OOM。 + +## Pipeline chunk batching + +`zero-shot-classification` 和 `question-answering` 在某种意义上稍微特殊,因为单个输入可能会导致模型的多次前向传递。在正常情况下,这将导致 `batch_size` 参数的问题。 + +为了规避这个问题,这两个pipeline都有点特殊,它们是 `ChunkPipeline` 而不是常规的 `Pipeline`。简而言之: + + +```python +preprocessed = pipe.preprocess(inputs) +model_outputs = pipe.forward(preprocessed) +outputs = pipe.postprocess(model_outputs) +``` + +现在变成: + + +```python +all_model_outputs = [] +for preprocessed in pipe.preprocess(inputs): + model_outputs = pipe.forward(preprocessed) + all_model_outputs.append(model_outputs) +outputs = pipe.postprocess(all_model_outputs) +``` + +这对您的代码应该是非常直观的,因为pipeline的使用方式是相同的。 + +这是一个简化的视图,因为Pipeline可以自动处理批次!这意味着您不必担心您的输入实际上会触发多少次前向传递,您可以独立于输入优化 `batch_size`。前面部分的注意事项仍然适用。 + +## Pipeline自定义 + +如果您想要重载特定的pipeline。 + +请随时为您手头的任务创建一个issue,Pipeline的目标是易于使用并支持大多数情况,因此 `transformers` 可能支持您的用例。 + +如果您想简单地尝试一下,可以: + +- 继承您选择的pipeline + +```python +class MyPipeline(TextClassificationPipeline): + def postprocess(): + # Your code goes here + scores = scores * 100 + # And here + + +my_pipeline = MyPipeline(model=model, tokenizer=tokenizer, ...) +# or if you use *pipeline* function, then: +my_pipeline = pipeline(model="xxxx", pipeline_class=MyPipeline) +``` + +这样就可以让您编写所有想要的自定义代码。 + + +## 实现一个pipeline + +[实现一个新的pipeline](../add_new_pipeline) + +## 音频 + +可用于音频任务的pipeline包括以下几种。 + +### AudioClassificationPipeline + +[[autodoc]] AudioClassificationPipeline + - __call__ + - all + +### AutomaticSpeechRecognitionPipeline + +[[autodoc]] AutomaticSpeechRecognitionPipeline + - __call__ + - all + +### TextToAudioPipeline + +[[autodoc]] TextToAudioPipeline + - __call__ + - all + + +### ZeroShotAudioClassificationPipeline + +[[autodoc]] ZeroShotAudioClassificationPipeline + - __call__ + - all + +## 计算机视觉 + +可用于计算机视觉任务的pipeline包括以下几种。 + +### DepthEstimationPipeline +[[autodoc]] DepthEstimationPipeline + - __call__ + - all + +### ImageClassificationPipeline + +[[autodoc]] ImageClassificationPipeline + - __call__ + - all + +### ImageSegmentationPipeline + +[[autodoc]] ImageSegmentationPipeline + - __call__ + - all + +### ImageToImagePipeline + +[[autodoc]] ImageToImagePipeline + - __call__ + - all + +### ObjectDetectionPipeline + +[[autodoc]] ObjectDetectionPipeline + - __call__ + - all + +### VideoClassificationPipeline + +[[autodoc]] VideoClassificationPipeline + - __call__ + - all + +### ZeroShotImageClassificationPipeline + +[[autodoc]] ZeroShotImageClassificationPipeline + - __call__ + - all + +### ZeroShotObjectDetectionPipeline + +[[autodoc]] ZeroShotObjectDetectionPipeline + - __call__ + - all + +## 自然语言处理 + +可用于自然语言处理任务的pipeline包括以下几种。 + +### ConversationalPipeline + +[[autodoc]] Conversation + +[[autodoc]] ConversationalPipeline + - __call__ + - all + +### FillMaskPipeline + +[[autodoc]] FillMaskPipeline + - __call__ + - all + +### NerPipeline + +[[autodoc]] NerPipeline + +See [`TokenClassificationPipeline`] for all details. + +### QuestionAnsweringPipeline + +[[autodoc]] QuestionAnsweringPipeline + - __call__ + - all + +### SummarizationPipeline + +[[autodoc]] SummarizationPipeline + - __call__ + - all + +### TableQuestionAnsweringPipeline + +[[autodoc]] TableQuestionAnsweringPipeline + - __call__ + +### TextClassificationPipeline + +[[autodoc]] TextClassificationPipeline + - __call__ + - all + +### TextGenerationPipeline + +[[autodoc]] TextGenerationPipeline + - __call__ + - all + +### Text2TextGenerationPipeline + +[[autodoc]] Text2TextGenerationPipeline + - __call__ + - all + +### TokenClassificationPipeline + +[[autodoc]] TokenClassificationPipeline + - __call__ + - all + +### TranslationPipeline + +[[autodoc]] TranslationPipeline + - __call__ + - all + +### ZeroShotClassificationPipeline + +[[autodoc]] ZeroShotClassificationPipeline + - __call__ + - all + +## 多模态 + +可用于多模态任务的pipeline包括以下几种。 + +### DocumentQuestionAnsweringPipeline + +[[autodoc]] DocumentQuestionAnsweringPipeline + - __call__ + - all + +### FeatureExtractionPipeline + +[[autodoc]] FeatureExtractionPipeline + - __call__ + - all + +### ImageFeatureExtractionPipeline + +[[autodoc]] ImageFeatureExtractionPipeline + - __call__ + - all + +### ImageToTextPipeline + +[[autodoc]] ImageToTextPipeline + - __call__ + - all + +### MaskGenerationPipeline + +[[autodoc]] MaskGenerationPipeline + - __call__ + - all + +### VisualQuestionAnsweringPipeline + +[[autodoc]] VisualQuestionAnsweringPipeline + - __call__ + - all + +## Parent class: `Pipeline` + +[[autodoc]] Pipeline diff --git a/docs/source/zh/main_classes/processors.md b/docs/source/zh/main_classes/processors.md new file mode 100644 index 00000000000000..60167e317adf90 --- /dev/null +++ b/docs/source/zh/main_classes/processors.md @@ -0,0 +1,146 @@ + + +# Processors + +在 Transformers 库中,processors可以有两种不同的含义: +- 为多模态模型,例如[Wav2Vec2](../model_doc/wav2vec2)(语音和文本)或[CLIP](../model_doc/clip)(文本和视觉)预处理输入的对象 +- 在库的旧版本中用于预处理GLUE或SQUAD数据的已弃用对象。 + +## 多模态processors + +任何多模态模型都需要一个对象来编码或解码将多个模态(包括文本、视觉和音频)组合在一起的数据。这由称为processors的对象处理,这些processors将两个或多个处理对象组合在一起,例如tokenizers(用于文本模态),image processors(用于视觉)和feature extractors(用于音频)。 + +这些processors继承自以下实现保存和加载功能的基类: + + +[[autodoc]] ProcessorMixin + +## 已弃用的processors + +所有processor都遵循与 [`~data.processors.utils.DataProcessor`] 相同的架构。processor返回一个 [`~data.processors.utils.InputExample`] 列表。这些 [`~data.processors.utils.InputExample`] 可以转换为 [`~data.processors.utils.InputFeatures`] 以供输送到模型。 + +[[autodoc]] data.processors.utils.DataProcessor + +[[autodoc]] data.processors.utils.InputExample + +[[autodoc]] data.processors.utils.InputFeatures + +## GLUE + +[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) 是一个基准测试,评估模型在各种现有的自然语言理解任务上的性能。它与论文 [GLUE: A multi-task benchmark and analysis platform for natural language understanding](https://openreview.net/pdf?id=rJ4km2R5t7) 一同发布。 + +该库为以下任务提供了总共10个processor:MRPC、MNLI、MNLI(mismatched)、CoLA、SST2、STSB、QQP、QNLI、RTE 和 WNLI。 + +这些processor是: + +- [`~data.processors.utils.MrpcProcessor`] +- [`~data.processors.utils.MnliProcessor`] +- [`~data.processors.utils.MnliMismatchedProcessor`] +- [`~data.processors.utils.Sst2Processor`] +- [`~data.processors.utils.StsbProcessor`] +- [`~data.processors.utils.QqpProcessor`] +- [`~data.processors.utils.QnliProcessor`] +- [`~data.processors.utils.RteProcessor`] +- [`~data.processors.utils.WnliProcessor`] + +此外,还可以使用以下方法从数据文件加载值并将其转换为 [`~data.processors.utils.InputExample`] 列表。 + +[[autodoc]] data.processors.glue.glue_convert_examples_to_features + + +## XNLI + +[跨语言NLI语料库(XNLI)](https://www.nyu.edu/projects/bowman/xnli/) 是一个评估跨语言文本表示质量的基准测试。XNLI是一个基于[*MultiNLI*](http://www.nyu.edu/projects/bowman/multinli/)的众包数据集:”文本对“被标记为包含15种不同语言(包括英语等高资源语言和斯瓦希里语等低资源语言)的文本蕴涵注释。 + +它与论文 [XNLI: Evaluating Cross-lingual Sentence Representations](https://arxiv.org/abs/1809.05053) 一同发布。 + +该库提供了加载XNLI数据的processor: + +- [`~data.processors.utils.XnliProcessor`] + +请注意,由于测试集上有“gold”标签,因此评估是在测试集上进行的。 + +使用这些processor的示例在 [run_xnli.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification/run_xnli.py) 脚本中提供。 + + +## SQuAD + +[斯坦福问答数据集(SQuAD)](https://rajpurkar.github.io/SQuAD-explorer//) 是一个评估模型在问答上性能的基准测试。有两个版本,v1.1 和 v2.0。第一个版本(v1.1)与论文 [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250) 一同发布。第二个版本(v2.0)与论文 [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822) 一同发布。 + +该库为两个版本各自提供了一个processor: + +### Processors + +这两个processor是: + +- [`~data.processors.utils.SquadV1Processor`] +- [`~data.processors.utils.SquadV2Processor`] + +它们都继承自抽象类 [`~data.processors.utils.SquadProcessor`]。 + +[[autodoc]] data.processors.squad.SquadProcessor + - all + +此外,可以使用以下方法将 SQuAD 示例转换为可用作模型输入的 [`~data.processors.utils.SquadFeatures`]。 + +[[autodoc]] data.processors.squad.squad_convert_examples_to_features + + +这些processor以及前面提到的方法可以与包含数据的文件以及tensorflow_datasets包一起使用。下面给出了示例。 + + +### Example使用 + +以下是使用processor以及使用数据文件的转换方法的示例: + +```python +# Loading a V2 processor +processor = SquadV2Processor() +examples = processor.get_dev_examples(squad_v2_data_dir) + +# Loading a V1 processor +processor = SquadV1Processor() +examples = processor.get_dev_examples(squad_v1_data_dir) + +features = squad_convert_examples_to_features( + examples=examples, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=args.doc_stride, + max_query_length=max_query_length, + is_training=not evaluate, +) +``` + +使用 *tensorflow_datasets* 就像使用数据文件一样简单: + +```python +# tensorflow_datasets only handle Squad V1. +tfds_examples = tfds.load("squad") +examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate) + +features = squad_convert_examples_to_features( + examples=examples, + tokenizer=tokenizer, + max_seq_length=max_seq_length, + doc_stride=args.doc_stride, + max_query_length=max_query_length, + is_training=not evaluate, +) +``` + +另一个使用这些processor的示例在 [run_squad.py](https://github.com/huggingface/transformers/tree/main/examples/legacy/question-answering/run_squad.py) 脚本中提供。 \ No newline at end of file diff --git a/docs/source/zh/main_classes/quantization.md b/docs/source/zh/main_classes/quantization.md new file mode 100644 index 00000000000000..3c7e4d9212a1d0 --- /dev/null +++ b/docs/source/zh/main_classes/quantization.md @@ -0,0 +1,572 @@ + + +# 量化 🤗 Transformers 模型 + +## AWQ集成 + +AWQ方法已经在[*AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration*论文](https://arxiv.org/abs/2306.00978)中引入。通过AWQ,您可以以4位精度运行模型,同时保留其原始性能(即没有性能降级),并具有比下面介绍的其他量化方法更出色的吞吐量 - 达到与纯`float16`推理相似的吞吐量。 + +我们现在支持使用任何AWQ模型进行推理,这意味着任何人都可以加载和使用在Hub上推送或本地保存的AWQ权重。请注意,使用AWQ需要访问NVIDIA GPU。目前不支持CPU推理。 + + +### 量化一个模型 + +我们建议用户查看生态系统中不同的现有工具,以使用AWQ算法对其模型进行量化,例如: + +- [`llm-awq`](https://github.com/mit-han-lab/llm-awq),来自MIT Han Lab +- [`autoawq`](https://github.com/casper-hansen/AutoAWQ),来自[`casper-hansen`](https://github.com/casper-hansen) +- Intel neural compressor,来自Intel - 通过[`optimum-intel`](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc)使用 + +生态系统中可能存在许多其他工具,请随时提出PR将它们添加到列表中。 +目前与🤗 Transformers的集成仅适用于使用`autoawq`和`llm-awq`量化后的模型。大多数使用`auto-awq`量化的模型可以在🤗 Hub的[`TheBloke`](https://huggingface.co/TheBloke)命名空间下找到,要使用`llm-awq`对模型进行量化,请参阅[`llm-awq`](https://github.com/mit-han-lab/llm-awq/)的示例文件夹中的[`convert_to_hf.py`](https://github.com/mit-han-lab/llm-awq/blob/main/examples/convert_to_hf.py)脚本。 + + +### 加载一个量化的模型 + +您可以使用`from_pretrained`方法从Hub加载一个量化模型。通过检查模型配置文件(`configuration.json`)中是否存在`quantization_config`属性,来进行确认推送的权重是量化的。您可以通过检查字段`quantization_config.quant_method`来确认模型是否以AWQ格式进行量化,该字段应该设置为`"awq"`。请注意,为了性能原因,默认情况下加载模型将设置其他权重为`float16`。如果您想更改这种设置,可以通过将`torch_dtype`参数设置为`torch.float32`或`torch.bfloat16`。在下面的部分中,您可以找到一些示例片段和notebook。 + + +## 示例使用 + +首先,您需要安装[`autoawq`](https://github.com/casper-hansen/AutoAWQ)库 + +```bash +pip install autoawq +``` + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") +``` + +如果您首先将模型加载到CPU上,请确保在使用之前将其移动到GPU设备上。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda:0") +``` + +### 结合 AWQ 和 Flash Attention + +您可以将AWQ量化与Flash Attention结合起来,得到一个既被量化又更快速的模型。只需使用`from_pretrained`加载模型,并传递`attn_implementation="flash_attention_2"`参数。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", attn_implementation="flash_attention_2", device_map="cuda:0") +``` + +### 基准测试 + +我们使用[`optimum-benchmark`](https://github.com/huggingface/optimum-benchmark)库进行了一些速度、吞吐量和延迟基准测试。 + +请注意,在编写本文档部分时,可用的量化方法包括:`awq`、`gptq`和`bitsandbytes`。 + +基准测试在一台NVIDIA-A100实例上运行,使用[`TheBloke/Mistral-7B-v0.1-AWQ`](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ)作为AWQ模型,[`TheBloke/Mistral-7B-v0.1-GPTQ`](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ)作为GPTQ模型。我们还将其与`bitsandbytes`量化模型和`float16`模型进行了对比。以下是一些结果示例: + + +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +你可以在[此链接](https://github.com/huggingface/optimum-benchmark/tree/main/examples/running-mistrals)中找到完整的结果以及包版本。 + +从结果来看,AWQ量化方法是推理、文本生成中最快的量化方法,并且在文本生成的峰值内存方面属于最低。然而,对于每批数据,AWQ似乎有最大的前向延迟。 + + +### Google colab 演示 + +查看如何在[Google Colab演示](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)中使用此集成! + + +### AwqConfig + +[[autodoc]] AwqConfig + +## `AutoGPTQ` 集成 + +🤗 Transformers已经整合了`optimum` API,用于对语言模型执行GPTQ量化。您可以以8、4、3甚至2位加载和量化您的模型,而性能无明显下降,并且推理速度更快!这受到大多数GPU硬件的支持。 + +要了解更多关于量化模型的信息,请查看: +- [GPTQ](https://arxiv.org/pdf/2210.17323.pdf)论文 +- `optimum`关于GPTQ量化的[指南](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) +- 用作后端的[`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ)库 + + +### 要求 + +为了运行下面的代码,您需要安装: + +- 安装最新版本的 `AutoGPTQ` 库 +`pip install auto-gptq` + +- 从源代码安装最新版本的`optimum` +`pip install git+https://github.com/huggingface/optimum.git` + +- 从源代码安装最新版本的`transformers` +`pip install git+https://github.com/huggingface/transformers.git` + +- 安装最新版本的`accelerate`库: +`pip install --upgrade accelerate` + +请注意,目前GPTQ集成仅支持文本模型,对于视觉、语音或多模态模型可能会遇到预期以外结果。 + +### 加载和量化模型 + +GPTQ是一种在使用量化模型之前需要进行权重校准的量化方法。如果您想从头开始对transformers模型进行量化,生成量化模型可能需要一些时间(在Google Colab上对`facebook/opt-350m`模型量化约为5分钟)。 + +因此,有两种不同的情况下您可能想使用GPTQ量化模型。第一种情况是加载已经由其他用户在Hub上量化的模型,第二种情况是从头开始对您的模型进行量化并保存或推送到Hub,以便其他用户也可以使用它。 + + +#### GPTQ 配置 + +为了加载和量化一个模型,您需要创建一个[`GPTQConfig`]。您需要传递`bits`的数量,一个用于校准量化的`dataset`,以及模型的`tokenizer`以准备数据集。 + +```python +model_id = "facebook/opt-125m" +tokenizer = AutoTokenizer.from_pretrained(model_id) +gptq_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer) +``` + +请注意,您可以将自己的数据集以字符串列表形式传递到模型。然而,强烈建议您使用GPTQ论文中提供的数据集。 + + +```python +dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] +quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer) +``` + +#### 量化 + +您可以通过使用`from_pretrained`并设置`quantization_config`来对模型进行量化。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config) + +``` + +请注意,您需要一个GPU来量化模型。我们将模型放在cpu中,并将模块来回移动到gpu中,以便对其进行量化。 + +如果您想在使用 CPU 卸载的同时最大化 GPU 使用率,您可以设置 `device_map = "auto"`。 + + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) +``` + +请注意,不支持磁盘卸载。此外,如果由于数据集而内存不足,您可能需要在`from_pretrained`中设置`max_memory`。查看这个[指南](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map)以了解有关`device_map`和`max_memory`的更多信息。 + + + +目前,GPTQ量化仅适用于文本模型。此外,量化过程可能会花费很多时间,具体取决于硬件性能(175B模型在NVIDIA A100上需要4小时)。请在Hub上检查是否有模型的GPTQ量化版本。如果没有,您可以在GitHub上提交需求。 + + +### 推送量化模型到 🤗 Hub + +您可以使用`push_to_hub`将量化模型像任何模型一样推送到Hub。量化配置将与模型一起保存和推送。 + +```python +quantized_model.push_to_hub("opt-125m-gptq") +tokenizer.push_to_hub("opt-125m-gptq") +``` + +如果您想在本地计算机上保存量化模型,您也可以使用`save_pretrained`来完成: + +```python +quantized_model.save_pretrained("opt-125m-gptq") +tokenizer.save_pretrained("opt-125m-gptq") +``` + +请注意,如果您量化模型时想使用`device_map`,请确保在保存之前将整个模型移动到您的GPU或CPU之一。 + +```python +quantized_model.to("cpu") +quantized_model.save_pretrained("opt-125m-gptq") +``` + +### 从 🤗 Hub 加载一个量化模型 + +您可以使用`from_pretrained`从Hub加载量化模型。 +请确保推送权重是量化的,检查模型配置对象中是否存在`quantization_config`属性。 + + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq") +``` + +如果您想更快地加载模型,并且不需要分配比实际需要内存更多的内存,量化模型也使用`device_map`参数。确保您已安装`accelerate`库。 + +```python +from transformers import AutoModelForCausalLM +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") +``` + +### Exllama内核加快推理速度 + +保留格式:对于 4 位模型,您可以使用 exllama 内核来提高推理速度。默认情况下,它处于启用状态。您可以通过在 [`GPTQConfig`] 中传递 `use_exllama` 来更改此配置。这将覆盖存储在配置中的量化配置。请注意,您只能覆盖与内核相关的属性。此外,如果您想使用 exllama 内核,整个模型需要全部部署在 gpus 上。此外,您可以使用 版本 > 0.4.2 的 Auto-GPTQ 并传递 `device_map` = "cpu" 来执行 CPU 推理。对于 CPU 推理,您必须在 `GPTQConfig` 中传递 `use_exllama = False`。 + +```py +import torch +gptq_config = GPTQConfig(bits=4) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config) +``` + +随着 exllamav2 内核的发布,与 exllama 内核相比,您可以获得更快的推理速度。您只需在 [`GPTQConfig`] 中传递 `exllama_config={"version": 2}`: + +```py +import torch +gptq_config = GPTQConfig(bits=4, exllama_config={"version":2}) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config) +``` + +请注意,目前仅支持 4 位模型。此外,如果您正在使用 peft 对量化模型进行微调,建议禁用 exllama 内核。 + +您可以在此找到这些内核的基准测试 [这里](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark) + + +#### 微调一个量化模型 + +在Hugging Face生态系统的官方支持下,您可以使用GPTQ进行量化后的模型进行微调。 +请查看`peft`库了解更多详情。 + +### 示例演示 + +请查看 Google Colab [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94ilkUFu6ZX4ceb?usp=sharing),了解如何使用GPTQ量化您的模型以及如何使用peft微调量化模型。 + +### GPTQConfig + +[[autodoc]] GPTQConfig + + +## `bitsandbytes` 集成 + +🤗 Transformers 与 `bitsandbytes` 上最常用的模块紧密集成。您可以使用几行代码以 8 位精度加载您的模型。 +自bitsandbytes的0.37.0版本发布以来,大多数GPU硬件都支持这一点。 + +在[LLM.int8()](https://arxiv.org/abs/2208.07339)论文中了解更多关于量化方法的信息,或者在[博客文章](https://huggingface.co/blog/hf-bitsandbytes-integration)中了解关于合作的更多信息。 + +自其“0.39.0”版本发布以来,您可以使用FP4数据类型,通过4位量化加载任何支持“device_map”的模型。 + +如果您想量化自己的 pytorch 模型,请查看 🤗 Accelerate 的[文档](https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization)。 + +以下是您可以使用“bitsandbytes”集成完成的事情 + +### 通用用法 + +只要您的模型支持使用 🤗 Accelerate 进行加载并包含 `torch.nn.Linear` 层,您可以在调用 [`~PreTrainedModel.from_pretrained`] 方法时使用 `load_in_8bit` 或 `load_in_4bit` 参数来量化模型。这也应该适用于任何模态。 + +```python +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True) +model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True) +``` + +默认情况下,所有其他模块(例如 `torch.nn.LayerNorm`)将被转换为 `torch.float16` 类型。但如果您想更改它们的 `dtype`,可以重载 `torch_dtype` 参数: + +```python +>>> import torch +>>> from transformers import AutoModelForCausalLM + +>>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) +>>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +torch.float32 +``` + + +### FP4 量化 + +#### 要求 + +确保在运行以下代码段之前已完成以下要求: + +- 最新版本 `bitsandbytes` 库 +`pip install bitsandbytes>=0.39.0` + +- 安装最新版本 `accelerate` +`pip install --upgrade accelerate` + +- 安装最新版本 `transformers` +`pip install --upgrade transformers` + +#### 提示和最佳实践 + + +- **高级用法:** 请参考 [此 Google Colab notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) 以获取 4 位量化高级用法和所有可选选项。 + +- **使用 `batch_size=1` 实现更快的推理:** 自 `bitsandbytes` 的 `0.40.0` 版本以来,设置 `batch_size=1`,您可以从快速推理中受益。请查看 [这些发布说明](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) ,并确保使用大于 `0.40.0` 的版本以直接利用此功能。 + +- **训练:** 根据 [QLoRA 论文](https://arxiv.org/abs/2305.14314),对于4位基模型训练(使用 LoRA 适配器),应使用 `bnb_4bit_quant_type='nf4'`。 + +- **推理:** 对于推理,`bnb_4bit_quant_type` 对性能影响不大。但是为了与模型的权重保持一致,请确保使用相同的 `bnb_4bit_compute_dtype` 和 `torch_dtype` 参数。 + +#### 加载 4 位量化的大模型 + +在调用 `.from_pretrained` 方法时使用 `load_in_4bit=True`,可以将您的内存使用量减少到大约原来的 1/4。 + +```python +# pip install transformers accelerate bitsandbytes +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "bigscience/bloom-1b7" + +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True) +``` + + + +需要注意的是,一旦模型以 4 位量化方式加载,就无法将量化后的权重推送到 Hub 上。此外,您不能训练 4 位量化权重,因为目前尚不支持此功能。但是,您可以使用 4 位量化模型来训练额外参数,这将在下一部分中介绍。 + + + +### 加载 8 位量化的大模型 + +您可以通过在调用 `.from_pretrained` 方法时使用 `load_in_8bit=True` 参数,将内存需求大致减半来加载模型 + + +```python +# pip install transformers accelerate bitsandbytes +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "bigscience/bloom-1b7" + +tokenizer = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True) +``` + +然后,像通常使用 `PreTrainedModel` 一样使用您的模型。 + +您可以使用 `get_memory_footprint` 方法检查模型的内存占用。 + + +```python +print(model.get_memory_footprint()) +``` + +通过这种集成,我们能够在较小的设备上加载大模型并运行它们而没有任何问题。 + + + +需要注意的是,一旦模型以 8 位量化方式加载,除了使用最新的 `transformers` 和 `bitsandbytes` 之外,目前尚无法将量化后的权重推送到 Hub 上。此外,您不能训练 8 位量化权重,因为目前尚不支持此功能。但是,您可以使用 8 位量化模型来训练额外参数,这将在下一部分中介绍。 + +注意,`device_map` 是可选的,但设置 `device_map = 'auto'` 更适合用于推理,因为它将更有效地调度可用资源上的模型。 + + + + +#### 高级用例 + +在这里,我们将介绍使用 FP4 量化的一些高级用例。 + +##### 更改计算数据类型 + +计算数据类型用于改变在进行计算时使用的数据类型。例如,hidden states可以是 `float32`,但为了加速,计算时可以被设置为 `bf16`。默认情况下,计算数据类型被设置为 `float32`。 + + +```python +import torch +from transformers import BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +``` + +#### 使用 NF4(普通浮点数 4)数据类型 + +您还可以使用 NF4 数据类型,这是一种针对使用正态分布初始化的权重而适应的新型 4 位数据类型。要运行: + +```python +from transformers import BitsAndBytesConfig + +nf4_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) +``` + +#### 使用嵌套量化进行更高效的内存推理 + +我们还建议用户使用嵌套量化技术。从我们的经验观察来看,这种方法在不增加额外性能的情况下节省更多内存。这使得 llama-13b 模型能够在具有 1024 个序列长度、1 个批次大小和 4 个梯度累积步骤的 NVIDIA-T4 16GB 上进行 fine-tuning。 + +```python +from transformers import BitsAndBytesConfig + +double_quant_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config) +``` + +### 将量化模型推送到🤗 Hub + +您可以使用 `push_to_hub` 方法将量化模型推送到 Hub 上。这将首先推送量化配置文件,然后推送量化模型权重。 +请确保使用 `bitsandbytes>0.37.2`(在撰写本文时,我们使用的是 `bitsandbytes==0.38.0.post1`)才能使用此功能。 + + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True) +tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") + +model.push_to_hub("bloom-560m-8bit") +``` + + + +对大模型,强烈鼓励将 8 位量化模型推送到 Hub 上,以便让社区能够从内存占用减少和加载中受益,例如在 Google Colab 上加载大模型。 + + + +### 从🤗 Hub加载量化模型 + +您可以使用 `from_pretrained` 方法从 Hub 加载量化模型。请确保推送的权重是量化的,检查模型配置对象中是否存在 `quantization_config` 属性。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") +``` + +请注意,在这种情况下,您不需要指定 `load_in_8bit=True` 参数,但需要确保 `bitsandbytes` 和 `accelerate` 已安装。 +情注意,`device_map` 是可选的,但设置 `device_map = 'auto'` 更适合用于推理,因为它将更有效地调度可用资源上的模型。 + +### 高级用例 + +本节面向希望探索除了加载和运行 8 位模型之外还能做什么的进阶用户。 + +#### 在 `cpu` 和 `gpu` 之间卸载 + +此高级用例之一是能够加载模型并将权重分派到 `CPU` 和 `GPU` 之间。请注意,将在 CPU 上分派的权重 **不会** 转换为 8 位,因此会保留为 `float32`。此功能适用于想要适应非常大的模型并将模型分派到 GPU 和 CPU 之间的用户。 + +首先,从 `transformers` 中加载一个 [`BitsAndBytesConfig`],并将属性 `llm_int8_enable_fp32_cpu_offload` 设置为 `True`: + + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) +``` + +假设您想加载 `bigscience/bloom-1b7` 模型,您的 GPU显存仅足够容纳除了`lm_head`外的整个模型。因此,您可以按照以下方式编写自定义的 device_map: + +```python +device_map = { + "transformer.word_embeddings": 0, + "transformer.word_embeddings_layernorm": 0, + "lm_head": "cpu", + "transformer.h": 0, + "transformer.ln_f": 0, +} +``` + +然后如下加载模型: + +```python +model_8bit = AutoModelForCausalLM.from_pretrained( + "bigscience/bloom-1b7", + device_map=device_map, + quantization_config=quantization_config, +) +``` + +这就是全部内容!享受您的模型吧! + +#### 使用`llm_int8_threshold` + +您可以使用 `llm_int8_threshold` 参数来更改异常值的阈值。“异常值”是一个大于特定阈值的`hidden state`值。 +这对应于`LLM.int8()`论文中描述的异常检测的异常阈值。任何高于此阈值的`hidden state`值都将被视为异常值,对这些值的操作将在 fp16 中完成。值通常是正态分布的,也就是说,大多数值在 [-3.5, 3.5] 范围内,但有一些额外的系统异常值,对于大模型来说,它们的分布非常不同。这些异常值通常在区间 [-60, -6] 或 [6, 60] 内。Int8 量化对于幅度为 ~5 的值效果很好,但超出这个范围,性能就会明显下降。一个好的默认阈值是 6,但对于更不稳定的模型(小模型、微调)可能需要更低的阈值。 +这个参数会影响模型的推理速度。我们建议尝试这个参数,以找到最适合您的用例的参数。 + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_threshold=10, +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +tokenizer = AutoTokenizer.from_pretrained(model_id) +``` + +#### 跳过某些模块的转换 + +一些模型有几个需要保持未转换状态以确保稳定性的模块。例如,Jukebox 模型有几个 `lm_head` 模块需要跳过。使用 `llm_int8_skip_modules` 参数进行相应操作。 + + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_skip_modules=["lm_head"], +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +tokenizer = AutoTokenizer.from_pretrained(model_id) +``` + +#### 微调已加载为8位精度的模型 + +借助Hugging Face生态系统中适配器(adapters)的官方支持,您可以在8位精度下微调模型。这使得可以在单个Google Colab中微调大模型,例如`flan-t5-large`或`facebook/opt-6.7b`。请查看[`peft`](https://github.com/huggingface/peft)库了解更多详情。 + +注意,加载模型进行训练时无需传递`device_map`。它将自动将您的模型加载到GPU上。如果需要,您可以将设备映射为特定设备(例如`cuda:0`、`0`、`torch.device('cuda:0')`)。请注意,`device_map=auto`仅应用于推理。 + + +### BitsAndBytesConfig + +[[autodoc]] BitsAndBytesConfig + + +## 使用 🤗 `optimum` 进行量化 + +请查看[Optimum 文档](https://huggingface.co/docs/optimum/index)以了解更多关于`optimum`支持的量化方法,并查看这些方法是否适用于您的用例。 + diff --git a/docs/source/zh/main_classes/text_generation.md b/docs/source/zh/main_classes/text_generation.md new file mode 100644 index 00000000000000..773228832f2272 --- /dev/null +++ b/docs/source/zh/main_classes/text_generation.md @@ -0,0 +1,58 @@ + + +# Generation + +每个框架都在它们各自的 `GenerationMixin` 类中实现了文本生成的 `generate` 方法: + +- PyTorch [`~generation.GenerationMixin.generate`] 在 [`~generation.GenerationMixin`] 中实现。 +- TensorFlow [`~generation.TFGenerationMixin.generate`] 在 [`~generation.TFGenerationMixin`] 中实现。 +- Flax/JAX [`~generation.FlaxGenerationMixin.generate`] 在 [`~generation.FlaxGenerationMixin`] 中实现。 + +无论您选择哪个框架,都可以使用 [`~generation.GenerationConfig`] 类实例对 generate 方法进行参数化。有关生成方法的控制参数的完整列表,请参阅此类。 + +要了解如何检查模型的生成配置、默认值是什么、如何临时更改参数以及如何创建和保存自定义生成配置,请参阅 [文本生成策略指南](../generation_strategies)。该指南还解释了如何使用相关功能,如token流。 + +## GenerationConfig + +[[autodoc]] generation.GenerationConfig + - from_pretrained + - from_model_config + - save_pretrained + +## GenerationMixin + +[[autodoc]] generation.GenerationMixin + - generate + - compute_transition_scores + - greedy_search + - sample + - beam_search + - beam_sample + - contrastive_search + - group_beam_search + - constrained_beam_search + +## TFGenerationMixin + +[[autodoc]] generation.TFGenerationMixin + - generate + - compute_transition_scores + +## FlaxGenerationMixin + +[[autodoc]] generation.FlaxGenerationMixin + - generate diff --git a/docs/source/zh/main_classes/tokenizer.md b/docs/source/zh/main_classes/tokenizer.md new file mode 100644 index 00000000000000..f89fc20b53d1d9 --- /dev/null +++ b/docs/source/zh/main_classes/tokenizer.md @@ -0,0 +1,65 @@ + + +# Tokenizer + +tokenizer负责准备输入以供模型使用。该库包含所有模型的tokenizer。大多数tokenizer都有两种版本:一个是完全的 Python 实现,另一个是基于 Rust 库 [🤗 Tokenizers](https://github.com/huggingface/tokenizers) 的“Fast”实现。"Fast" 实现允许: + +1. 在批量分词时显著提速 +2. 在原始字符串(字符和单词)和token空间之间进行映射的其他方法(例如,获取包含给定字符的token的索引或与给定token对应的字符范围)。 + +基类 [PreTrainedTokenizer] 和 [PreTrained TokenizerFast] 实现了在模型输入中编码字符串输入的常用方法(见下文),并从本地文件或目录或从库提供的预训练的 tokenizer(从 HuggingFace 的 AWS S3 存储库下载)实例化/保存 python 和“Fast” tokenizer。它们都依赖于包含常用方法的 [`~tokenization_utils_base.PreTrainedTokenizerBase`]和[`~tokenization_utils_base.SpecialTokensMixin`]。 + +因此,[`PreTrainedTokenizer`] 和 [`PreTrainedTokenizerFast`] 实现了使用所有tokenizers的主要方法: + +- 分词(将字符串拆分为子词标记字符串),将tokens字符串转换为id并转换回来,以及编码/解码(即标记化并转换为整数)。 +- 以独立于底层结构(BPE、SentencePiece……)的方式向词汇表中添加新tokens。 +- 管理特殊tokens(如mask、句首等):添加它们,将它们分配给tokenizer中的属性以便于访问,并确保它们在标记过程中不会被分割。 + +[`BatchEncoding`] 包含 [`~tokenization_utils_base.PreTrainedTokenizerBase`] 的编码方法(`__call__`、`encode_plus` 和 `batch_encode_plus`)的输出,并且是从 Python 字典派生的。当tokenizer是纯 Python tokenizer时,此类的行为就像标准的 Python 字典一样,并保存这些方法计算的各种模型输入(`input_ids`、`attention_mask` 等)。当分词器是“Fast”分词器时(即由 HuggingFace 的 [tokenizers 库](https://github.com/huggingface/tokenizers) 支持),此类还提供了几种高级对齐方法,可用于在原始字符串(字符和单词)与token空间之间进行映射(例如,获取包含给定字符的token的索引或与给定token对应的字符范围)。 + + +## PreTrainedTokenizer + +[[autodoc]] PreTrainedTokenizer + - __call__ + - add_tokens + - add_special_tokens + - apply_chat_template + - batch_decode + - decode + - encode + - push_to_hub + - all + +## PreTrainedTokenizerFast + +[`PreTrainedTokenizerFast`] 依赖于 [tokenizers](https://huggingface.co/docs/tokenizers) 库。可以非常简单地将从 🤗 tokenizers 库获取的tokenizers加载到 🤗 transformers 中。查看 [使用 🤗 tokenizers 的分词器](../fast_tokenizers) 页面以了解如何执行此操作。 + +[[autodoc]] PreTrainedTokenizerFast + - __call__ + - add_tokens + - add_special_tokens + - apply_chat_template + - batch_decode + - decode + - encode + - push_to_hub + - all + +## BatchEncoding + +[[autodoc]] BatchEncoding diff --git a/docs/source/zh/main_classes/trainer.md b/docs/source/zh/main_classes/trainer.md new file mode 100644 index 00000000000000..cb0262140cb22d --- /dev/null +++ b/docs/source/zh/main_classes/trainer.md @@ -0,0 +1,665 @@ + + +# Trainer + +[`Trainer`] 类提供了一个 PyTorch 的 API,用于处理大多数标准用例的全功能训练。它在大多数[示例脚本](https://github.com/huggingface/transformers/tree/main/examples)中被使用。 + + + +如果你想要使用自回归技术在文本数据集上微调像 Llama-2 或 Mistral 这样的语言模型,考虑使用 [`trl`](https://github.com/huggingface/trl) 的 [`~trl.SFTTrainer`]。[`~trl.SFTTrainer`] 封装了 [`Trainer`],专门针对这个特定任务进行了优化,并支持序列打包、LoRA、量化和 DeepSpeed,以有效扩展到任何模型大小。另一方面,[`Trainer`] 是一个更通用的选项,适用于更广泛的任务。 + + + +在实例化你的 [`Trainer`] 之前,创建一个 [`TrainingArguments`],以便在训练期间访问所有定制点。 + +这个 API 支持在多个 GPU/TPU 上进行分布式训练,支持 [NVIDIA Apex](https://github.com/NVIDIA/apex) 的混合精度和 PyTorch 的原生 AMP。 + +[`Trainer`] 包含基本的训练循环,支持上述功能。如果需要自定义训练,你可以继承 `Trainer` 并覆盖以下方法: + +- **get_train_dataloader** -- 创建训练 DataLoader。 +- **get_eval_dataloader** -- 创建评估 DataLoader。 +- **get_test_dataloader** -- 创建测试 DataLoader。 +- **log** -- 记录观察训练的各种对象的信息。 +- **create_optimizer_and_scheduler** -- 如果它们没有在初始化时传递,请设置优化器和学习率调度器。请注意,你还可以单独继承或覆盖 `create_optimizer` 和 `create_scheduler` 方法。 +- **create_optimizer** -- 如果在初始化时没有传递,则设置优化器。 +- **create_scheduler** -- 如果在初始化时没有传递,则设置学习率调度器。 +- **compute_loss** - 计算单批训练输入的损失。 +- **training_step** -- 执行一步训练。 +- **prediction_step** -- 执行一步评估/测试。 +- **evaluate** -- 运行评估循环并返回指标。 +- **predict** -- 返回在测试集上的预测(如果有标签,则包括指标)。 + + + +[`Trainer`] 类被优化用于 🤗 Transformers 模型,并在你在其他模型上使用时可能会有一些令人惊讶的结果。当在你自己的模型上使用时,请确保: + +- 你的模型始终返回元组或 [`~utils.ModelOutput`] 的子类。 +- 如果提供了 `labels` 参数,你的模型可以计算损失,并且损失作为元组的第一个元素返回(如果你的模型返回元组)。 +- 你的模型可以接受多个标签参数(在 [`TrainingArguments`] 中使用 `label_names` 将它们的名称指示给 [`Trainer`]),但它们中没有一个应该被命名为 `"label"`。 + + + +以下是如何自定义 [`Trainer`] 以使用加权损失的示例(在训练集不平衡时很有用): + +```python +from torch import nn +from transformers import Trainer + + +class CustomTrainer(Trainer): + def compute_loss(self, model, inputs, return_outputs=False): + labels = inputs.pop("labels") + # forward pass + outputs = model(**inputs) + logits = outputs.get("logits") + # compute custom loss (suppose one has 3 labels with different weights) + loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device)) + loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1)) + return (loss, outputs) if return_outputs else loss +``` + +在 PyTorch [`Trainer`] 中自定义训练循环行为的另一种方法是使用 [callbacks](callback),这些回调可以检查训练循环状态(用于进度报告、在 TensorBoard 或其他 ML 平台上记录日志等)并做出决策(比如提前停止)。 + + +## Trainer + +[[autodoc]] Trainer - all + +## Seq2SeqTrainer + +[[autodoc]] Seq2SeqTrainer - evaluate - predict + +## TrainingArguments + +[[autodoc]] TrainingArguments - all + +## Seq2SeqTrainingArguments + +[[autodoc]] Seq2SeqTrainingArguments - all + +## Checkpoints + +默认情况下,[`Trainer`] 会将所有checkpoints保存在你使用的 [`TrainingArguments`] 中设置的 `output_dir` 中。这些checkpoints将位于名为 `checkpoint-xxx` 的子文件夹中,xxx 是训练的步骤。 + +从checkpoints恢复训练可以通过调用 [`Trainer.train`] 时使用以下任一方式进行: + +- `resume_from_checkpoint=True`,这将从最新的checkpoint恢复训练。 +- `resume_from_checkpoint=checkpoint_dir`,这将从指定目录中的特定checkpoint恢复训练。 + +此外,当使用 `push_to_hub=True` 时,你可以轻松将checkpoints保存在 Model Hub 中。默认情况下,保存在训练中间过程的checkpoints中的所有模型都保存在不同的提交中,但不包括优化器状态。你可以根据需要调整 [`TrainingArguments`] 的 `hub-strategy` 值: + +- `"checkpoint"`: 最新的checkpoint也被推送到一个名为 last-checkpoint 的子文件夹中,让你可以通过 `trainer.train(resume_from_checkpoint="output_dir/last-checkpoint")` 轻松恢复训练。 +- `"all_checkpoints"`: 所有checkpoints都像它们出现在输出文件夹中一样被推送(因此你将在最终存储库中的每个文件夹中获得一个checkpoint文件夹)。 + +## Logging + +默认情况下,[`Trainer`] 将对主进程使用 `logging.INFO`,对副本(如果有的话)使用 `logging.WARNING`。 + +可以通过 [`TrainingArguments`] 的参数覆盖这些默认设置,使用其中的 5 个 `logging` 级别: + +- `log_level` - 用于主进程 +- `log_level_replica` - 用于副本 + +此外,如果 [`TrainingArguments`] 的 `log_on_each_node` 设置为 `False`,则只有主节点将使用其主进程的日志级别设置,所有其他节点将使用副本的日志级别设置。 + +请注意,[`Trainer`] 将在其 [`Trainer.__init__`] 中分别为每个节点设置 `transformers` 的日志级别。因此,如果在创建 [`Trainer`] 对象之前要调用其他 `transformers` 功能,可能需要更早地设置这一点(请参见下面的示例)。 + +以下是如何在应用程序中使用的示例: + +```python +[...] +logger = logging.getLogger(__name__) + +# Setup logging +logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], +) + +# set the main code and the modules it uses to the same log-level according to the node +log_level = training_args.get_process_log_level() +logger.setLevel(log_level) +datasets.utils.logging.set_verbosity(log_level) +transformers.utils.logging.set_verbosity(log_level) + +trainer = Trainer(...) +``` + +然后,如果你只想在主节点上看到警告,并且所有其他节点不打印任何可能重复的警告,可以这样运行: + +```bash +my_app.py ... --log_level warning --log_level_replica error +``` + +在多节点环境中,如果你也不希望每个节点的主进程的日志重复输出,你需要将上面的代码更改为: + +```bash +my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0 +``` + +然后,只有第一个节点的主进程将以 "warning" 级别记录日志,主节点上的所有其他进程和其他节点上的所有进程将以 "error" 级别记录日志。 + +如果你希望应用程序尽可能”安静“,可以执行以下操作: + + +```bash +my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0 +``` + +(如果在多节点环境,添加 `--log_on_each_node 0`) + + +## 随机性 + +当从 [`Trainer`] 生成的checkpoint恢复训练时,程序会尽一切努力将 _python_、_numpy_ 和 _pytorch_ 的 RNG(随机数生成器)状态恢复为保存检查点时的状态,这样可以使“停止和恢复”式训练尽可能接近“非停止式”训练。 + +然而,由于各种默认的非确定性 PyTorch 设置,这可能无法完全实现。如果你想要完全确定性,请参阅[控制随机源](https://pytorch.org/docs/stable/notes/randomness)。正如文档中所解释的那样,使事物变得确定的一些设置(例如 `torch.backends.cudnn.deterministic`)可能会减慢速度,因此不能默认执行,但如果需要,你可以自行启用这些设置。 + + +## 特定GPU选择 + +让我们讨论一下如何告诉你的程序应该使用哪些 GPU 以及使用的顺序。 + +当使用 [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) 且仅使用部分 GPU 时,你只需指定要使用的 GPU 数量。例如,如果你有 4 个 GPU,但只想使用前 2 个,可以执行以下操作: + + +```bash +python -m torch.distributed.launch --nproc_per_node=2 trainer-program.py ... +``` + +如果你安装了 [`accelerate`](https://github.com/huggingface/accelerate) 或 [`deepspeed`](https://github.com/microsoft/DeepSpeed),你还可以通过以下任一方法实现相同的效果: + + +```bash +accelerate launch --num_processes 2 trainer-program.py ... +``` + +```bash +deepspeed --num_gpus 2 trainer-program.py ... +``` + +你不需要使用 Accelerate 或 [Deepspeed 集成](Deepspeed) 功能来使用这些启动器。 + +到目前为止,你已经能够告诉程序要使用多少个 GPU。现在让我们讨论如何选择特定的 GPU 并控制它们的顺序。 + +以下环境变量可帮助你控制使用哪些 GPU 以及它们的顺序。 + + +**`CUDA_VISIBLE_DEVICES`** + +如果你有多个 GPU,想要仅使用其中的一个或几个 GPU,请将环境变量 `CUDA_VISIBLE_DEVICES` 设置为要使用的 GPU 列表。 + +例如,假设你有 4 个 GPU:0、1、2 和 3。要仅在物理 GPU 0 和 2 上运行,你可以执行以下操作: + + +```bash +CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch trainer-program.py ... +``` + +现在,PyTorch 将只看到 2 个 GPU,其中你的物理 GPU 0 和 2 分别映射到 `cuda:0` 和 `cuda:1`。 + +你甚至可以改变它们的顺序: + + +```bash +CUDA_VISIBLE_DEVICES=2,0 python -m torch.distributed.launch trainer-program.py ... +``` + +这里,你的物理 GPU 0 和 2 分别映射到 `cuda:1` 和 `cuda:0`。 + +上面的例子都是针对 `DistributedDataParallel` 使用模式的,但同样的方法也适用于 [`DataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html): + + +```bash +CUDA_VISIBLE_DEVICES=2,0 python trainer-program.py ... +``` + +为了模拟没有 GPU 的环境,只需将此环境变量设置为空值,如下所示: + +```bash +CUDA_VISIBLE_DEVICES= python trainer-program.py ... +``` + +与任何环境变量一样,你当然可以将其export到环境变量而不是将其添加到命令行,如下所示: + + +```bash +export CUDA_VISIBLE_DEVICES=0,2 +python -m torch.distributed.launch trainer-program.py ... +``` + +这种方法可能会令人困惑,因为你可能会忘记之前设置了环境变量,进而不明白为什么会使用错误的 GPU。因此,在同一命令行中仅为特定运行设置环境变量是一种常见做法,正如本节大多数示例所示。 + + +**`CUDA_DEVICE_ORDER`** + +还有一个额外的环境变量 `CUDA_DEVICE_ORDER`,用于控制物理设备的排序方式。有两个选择: + +1. 按 PCIe 总线 ID 排序(与 nvidia-smi 的顺序相匹配)- 这是默认选项。 + + +```bash +export CUDA_DEVICE_ORDER=PCI_BUS_ID +``` + +2. 按 GPU 计算能力排序。 + +```bash +export CUDA_DEVICE_ORDER=FASTEST_FIRST +``` + +大多数情况下,你不需要关心这个环境变量,但如果你的设置不均匀,那么这将非常有用,例如,您的旧 GPU 和新 GPU 物理上安装在一起,但让速度较慢的旧卡排在运行的第一位。解决这个问题的一种方法是交换卡的位置。但如果不能交换卡(例如,如果设备的散热受到影响),那么设置 `CUDA_DEVICE_ORDER=FASTEST_FIRST` 将始终将较新、更快的卡放在第一位。但这可能会有点混乱,因为 `nvidia-smi` 仍然会按照 PCIe 顺序报告它们。 + +交换卡的顺序的另一种方法是使用: + + +```bash +export CUDA_VISIBLE_DEVICES=1,0 +``` + +在此示例中,我们只使用了 2 个 GPU,但是当然,对于计算机上有的任何数量的 GPU,都适用相同的方法。 + +此外,如果你设置了这个环境变量,最好将其设置在 `~/.bashrc` 文件或其他启动配置文件中,然后就可以忘记它了。 + + +## Trainer集成 + +[`Trainer`] 已经被扩展,以支持可能显著提高训练时间并适应更大模型的库。 + +目前,它支持第三方解决方案 [DeepSpeed](https://github.com/microsoft/DeepSpeed) 和 [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html),它们实现了论文 [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, by Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He](https://arxiv.org/abs/1910.02054) 的部分内容。 + +截至撰写本文,此提供的支持是新的且实验性的。尽管我们欢迎围绕 DeepSpeed 和 PyTorch FSDP 的issues,但我们不再支持 FairScale 集成,因为它已经集成到了 PyTorch 主线(参见 [PyTorch FSDP 集成](#pytorch-fully-sharded-data-parallel))。 + + + + +### CUDA拓展安装注意事项 + + +撰写时,Deepspeed 需要在使用之前编译 CUDA C++ 代码。 + +虽然所有安装问题都应通过 [Deepspeed](https://github.com/microsoft/DeepSpeed/issues) 的 GitHub Issues处理,但在构建依赖CUDA 扩展的任何 PyTorch 扩展时,可能会遇到一些常见问题。 + +因此,如果在执行以下操作时遇到与 CUDA 相关的构建问题: + + +```bash +pip install deepspeed +``` + +请首先阅读以下说明。 + +在这些说明中,我们提供了在 `pytorch` 使用 CUDA `10.2` 构建时应采取的操作示例。如果你的情况有所不同,请记得将版本号调整为您所需的版本。 + + +#### 可能的问题 #1 + +尽管 PyTorch 自带了其自己的 CUDA 工具包,但要构建这两个项目,你必须在整个系统上安装相同版本的 CUDA。 + +例如,如果你在 Python 环境中使用 `cudatoolkit==10.2` 安装了 `pytorch`,你还需要在整个系统上安装 CUDA `10.2`。 + +确切的位置可能因系统而异,但在许多 Unix 系统上,`/usr/local/cuda-10.2` 是最常见的位置。当 CUDA 正确设置并添加到 `PATH` 环境变量时,可以通过执行以下命令找到安装位置: + + +```bash +which nvcc +``` + +如果你尚未在整个系统上安装 CUDA,请首先安装。你可以使用你喜欢的搜索引擎查找说明。例如,如果你使用的是 Ubuntu,你可能想搜索:[ubuntu cuda 10.2 install](https://www.google.com/search?q=ubuntu+cuda+10.2+install)。 + + +#### 可能的问题 #2 + +另一个可能的常见问题是你可能在整个系统上安装了多个 CUDA 工具包。例如,你可能有: + + +```bash +/usr/local/cuda-10.2 +/usr/local/cuda-11.0 +``` + +在这种情况下,你需要确保 `PATH` 和 `LD_LIBRARY_PATH` 环境变量包含所需 CUDA 版本的正确路径。通常,软件包安装程序将设置这些变量以包含最新安装的版本。如果遇到构建失败的问题,且是因为在整个系统安装但软件仍找不到正确的 CUDA 版本,这意味着你需要调整这两个环境变量。 + +首先,你以查看它们的内容: + + +```bash +echo $PATH +echo $LD_LIBRARY_PATH +``` + +因此,您可以了解其中的内容。 + +`LD_LIBRARY_PATH` 可能是空的。 + +`PATH` 列出了可以找到可执行文件的位置,而 `LD_LIBRARY_PATH` 用于查找共享库。在这两种情况下,较早的条目优先于较后的条目。 `:` 用于分隔多个条目。 + +现在,为了告诉构建程序在哪里找到特定的 CUDA 工具包,请插入所需的路径,让其首先列出: + + +```bash +export PATH=/usr/local/cuda-10.2/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH +``` + +请注意,我们没有覆盖现有值,而是在前面添加新的值。 + +当然,根据需要调整版本号和完整路径。检查你分配的目录是否实际存在。`lib64` 子目录是各种 CUDA `.so` 对象(如 `libcudart.so`)的位置,这个名字可能在你的系统中是不同的,如果是,请调整以反映实际情况。 + + +#### 可能的问题 #3 + +一些较旧的 CUDA 版本可能会拒绝使用更新的编译器。例如,你可能有 `gcc-9`,但 CUDA 可能需要 `gcc-7`。 + +有各种方法可以解决这个问题。 + +如果你可以安装最新的 CUDA 工具包,通常它应该支持更新的编译器。 + +或者,你可以在已经拥有的编译器版本之外安装较低版本,或者你可能已经安装了它但它不是默认的编译器,因此构建系统无法找到它。如果你已经安装了 `gcc-7` 但构建系统找不到它,以下操作可能会解决问题: + + +```bash +sudo ln -s /usr/bin/gcc-7 /usr/local/cuda-10.2/bin/gcc +sudo ln -s /usr/bin/g++-7 /usr/local/cuda-10.2/bin/g++ +``` + +这里,我们正在从 `/usr/local/cuda-10.2/bin/gcc` 创建到 `gcc-7` 的软链接,由于 `/usr/local/cuda-10.2/bin/` 应该在 `PATH` 环境变量中(参见前一个问题的解决方案),它应该能够找到 `gcc-7`(和 `g++7`),然后构建将成功。 + +与往常一样,请确保编辑示例中的路径以匹配你的情况。 + + + +### PyTorch完全分片数据并行(FSDP) + +为了加速在更大批次大小上训练庞大模型,我们可以使用完全分片的数据并行模型。这种数据并行范例通过对优化器状态、梯度和参数进行分片,实现了在更多数据和更大模型上的训练。要了解更多信息以及其优势,请查看[完全分片的数据并行博客](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)。我们已经集成了最新的PyTorch完全分片的数据并行(FSDP)训练功能。您只需通过配置启用它。 + +**FSDP支持所需的PyTorch版本**: PyTorch Nightly(或者如果你在发布后阅读这个,使用1.12.0版本,因为带有激活的FSDP的模型保存仅在最近的修复中可用。 + + +**用法**: + +- 如果你尚未使用过分布式启动器,确保你已经添加了它 `-m torch.distributed.launch --nproc_per_node=NUMBER_OF_GPUS_YOU_HAVE`。 + +- **分片策略**: + - FULL_SHARD:在数据并行线程/GPU之间,对优化器状态、梯度和模型参数进行分片。 + 为此,请在命令行参数中添加 `--fsdp full_shard`。 + - SHARD_GRAD_OP:在数据并行线程/GPU之间对优化器状态和梯度进行分片。 + 为此,请在命令行参数中添加 `--fsdp shard_grad_op`。 + - NO_SHARD:不进行分片。为此,请在命令行参数中添加 `--fsdp no_shard`。 +- 要将参数和梯度卸载到CPU,添加 `--fsdp "full_shard offload"` 或 `--fsdp "shard_grad_op offload"` 到命令行参数中。 +- 要使用 `default_auto_wrap_policy` 自动递归地用FSDP包装层,请添加 `--fsdp "full_shard auto_wrap"` 或 `--fsdp "shard_grad_op auto_wrap"` 到命令行参数中。 +- 要同时启用CPU卸载和自动包装层工具,请添加 `--fsdp "full_shard offload auto_wrap"` 或 `--fsdp "shard_grad_op offload auto_wrap"` 到命令行参数中。 +- 其余的FSDP配置通过 `--fsdp_config ` 传递。它可以是FSDP json配置文件的位置(例如,`fsdp_config.json`)或已加载的json文件作为 `dict`。 + - 如果启用了自动包装,您可以使用基于transformer的自动包装策略或基于大小的自动包装策略。 + - 对于基于transformer的自动包装策略,建议在配置文件中指定 `fsdp_transformer_layer_cls_to_wrap`。如果未指定,则默认值为 `model._no_split_modules`(如果可用)。这将指定要包装的transformer层类名(区分大小写),例如 [`BertLayer`]、[`GPTJBlock`]、[`T5Block`] 等。这很重要,因为共享权重的子模块(例如,embedding层)不应最终出现在不同的FSDP包装单元中。使用此策略,每个包装的块将包含多头注意力和后面的几个MLP层。剩余的层,包括共享的embedding层,都将被方便地包装在同一个最外层的FSDP单元中。因此,对于基于transformer的模型,请使用这个方法。 + - 对于基于大小的自动包装策略,请在配置文件中添加 `fsdp_min_num_params`。它指定了FSDP进行自动包装的最小参数数量。 + - 可以在配置文件中指定 `fsdp_backward_prefetch`。它控制何时预取下一组参数。`backward_pre` 和 `backward_pos` 是可用的选项。有关更多信息,请参阅 `torch.distributed.fsdp.fully_sharded_data_parallel.BackwardPrefetch` + - 可以在配置文件中指定 `fsdp_forward_prefetch`。它控制何时预取下一组参数。如果是`"True"`,在执行前向传递时,FSDP明确地预取下一次即将发生的全局聚集。 + - 可以在配置文件中指定 `limit_all_gathers`。如果是`"True"`,FSDP明确地同步CPU线程,以防止太多的进行中的全局聚集。 + - 可以在配置文件中指定 `activation_checkpointing`。如果是`"True"`,FSDP activation checkpoint是一种通过清除某些层的激活值并在反向传递期间重新计算它们来减少内存使用的技术。实际上,这以更多的计算时间为代价减少了内存使用。 + + +**需要注意几个注意事项** +- 它与 `generate` 不兼容,因此与所有seq2seq/clm脚本(翻译/摘要/clm等)中的 `--predict_with_generate` 不兼容。请参阅issue[#21667](https://github.com/huggingface/transformers/issues/21667)。 + + +### PyTorch/XLA 完全分片数据并行 + +对于所有TPU用户,有个好消息!PyTorch/XLA现在支持FSDP。所有最新的完全分片数据并行(FSDP)训练都受支持。有关更多信息,请参阅[在云端TPU上使用FSDP扩展PyTorch模型](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/)和[PyTorch/XLA FSDP的实现](https://github.com/pytorch/xla/tree/master/torch_xla/distributed/fsdp)。使用它只需通过配置启用。 + +**需要的 PyTorch/XLA 版本以支持 FSDP**:>=2.0 + +**用法**: + +传递 `--fsdp "full shard"`,同时对 `--fsdp_config ` 进行以下更改: +- `xla` 应设置为 `True` 以启用 PyTorch/XLA FSDP。 +- `xla_fsdp_settings` 的值是一个字典,存储 XLA FSDP 封装参数。完整的选项列表,请参见[此处](https://github.com/pytorch/xla/blob/master/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py)。 +- `xla_fsdp_grad_ckpt`。当 `True` 时,在每个嵌套的 XLA FSDP 封装层上使用梯度checkpoint。该设置只能在将 xla 标志设置为 true,并通过 `fsdp_min_num_params` 或 `fsdp_transformer_layer_cls_to_wrap` 指定自动包装策略时使用。 +- 您可以使用基于transformer的自动包装策略或基于大小的自动包装策略。 + - 对于基于transformer的自动包装策略,建议在配置文件中指定 `fsdp_transformer_layer_cls_to_wrap`。如果未指定,默认值为 `model._no_split_modules`(如果可用)。这指定了要包装的transformer层类名列表(区分大小写),例如 [`BertLayer`]、[`GPTJBlock`]、[`T5Block`] 等。这很重要,因为共享权重的子模块(例如,embedding层)不应最终出现在不同的FSDP包装单元中。使用此策略,每个包装的块将包含多头注意力和后面的几个MLP层。剩余的层,包括共享的embedding层,都将被方便地包装在同一个最外层的FSDP单元中。因此,对于基于transformer的模型,请使用这个方法。 + - 对于基于大小的自动包装策略,请在配置文件中添加 `fsdp_min_num_params`。它指定了自动包装的 FSDP 的最小参数数量。 + + +### 在 Mac 上使用 Trainer 进行加速的 PyTorch 训练 + +随着 PyTorch v1.12 版本的发布,开发人员和研究人员可以利用 Apple Silicon GPU 进行显著更快的模型训练。这使得可以在 Mac 上本地执行原型设计和微调等机器学习工作流程。Apple 的 Metal Performance Shaders(MPS)作为 PyTorch 的后端实现了这一点,并且可以通过新的 `"mps"` 设备来使用。 +这将在 MPS 图形框架上映射计算图和神经图元,并使用 MPS 提供的优化内核。更多信息,请参阅官方文档 [Introducing Accelerated PyTorch Training on Mac](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/) 和 [MPS BACKEND](https://pytorch.org/docs/stable/notes/mps.html)。 + + + + +我们强烈建议在你的 MacOS 机器上安装 PyTorch >= 1.13(在撰写本文时为最新版本)。对于基于 transformer 的模型, 它提供与模型正确性和性能改进相关的重大修复。有关更多详细信息,请参阅[pytorch/pytorch#82707](https://github.com/pytorch/pytorch/issues/82707)。 + + + +**使用 Apple Silicon 芯片进行训练和推理的好处** + +1. 使用户能够在本地训练更大的网络或批量数据。 +2. 由于统一内存架构,减少数据检索延迟,并为 GPU 提供对完整内存存储的直接访问。从而提高端到端性能。 +3. 降低与基于云的开发或需要额外本地 GPU 的成本。 + +**先决条件**:要安装带有 mps 支持的 torch,请按照这篇精彩的 Medium 文章操作 [GPU-Acceleration Comes to PyTorch on M1 Macs](https://medium.com/towards-data-science/gpu-acceleration-comes-to-pytorch-on-m1-macs-195c399efcc1)。 + +**用法**: +如果可用,`mps` 设备将默认使用,类似于使用 `cuda` 设备的方式。因此,用户无需采取任何操作。例如,您可以使用以下命令在 Apple Silicon GPU 上运行官方的 Glue 文本分类任务(从根文件夹运行): + +```bash +export TASK_NAME=mrpc + +python examples/pytorch/text-classification/run_glue.py \ + --model_name_or_path google-bert/bert-base-cased \ + --task_name $TASK_NAME \ + --do_train \ + --do_eval \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --output_dir /tmp/$TASK_NAME/ \ + --overwrite_output_dir +``` + +**需要注意的一些注意事项** + +1. 一些 PyTorch 操作尚未在 mps 中实现,将引发错误。解决此问题的一种方法是设置环境变量 `PYTORCH_ENABLE_MPS_FALLBACK=1`,它将把这些操作回退到 CPU 进行。然而,它仍然会抛出 UserWarning 信息。 +2. 分布式设置 `gloo` 和 `nccl` 在 `mps` 设备上不起作用。这意味着当前只能使用 `mps` 设备类型的单个 GPU。 + +最后,请记住,🤗 `Trainer` 仅集成了 MPS 后端,因此如果你在使用 MPS 后端时遇到任何问题或有疑问,请在 [PyTorch GitHub](https://github.com/pytorch/pytorch/issues) 上提交问题。 + + +## 通过 Accelerate Launcher 使用 Trainer + +Accelerate 现在支持 Trainer。用户可以期待以下内容: +- 他们可以继续使用 Trainer 的迭代,如 FSDP、DeepSpeed 等,而无需做任何更改。 +- 现在可以在 Trainer 中使用 Accelerate Launcher(建议使用)。 + +通过 Accelerate Launcher 使用 Trainer 的步骤: +1. 确保已安装 🤗 Accelerate,无论如何,如果没有它,你无法使用 `Trainer`。如果没有,请执行 `pip install accelerate`。你可能还需要更新 Accelerate 的版本:`pip install accelerate --upgrade`。 +2. 运行 `accelerate config` 并填写问题。以下是一些加速配置的示例: + + a. DDP 多节点多 GPU 配置: + + ```yaml + compute_environment: LOCAL_MACHINE + distributed_type: MULTI_GPU + downcast_bf16: 'no' + gpu_ids: all + machine_rank: 0 #change rank as per the node + main_process_ip: 192.168.20.1 + main_process_port: 9898 + main_training_function: main + mixed_precision: fp16 + num_machines: 2 + num_processes: 8 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + + b. FSDP 配置: + + ```yaml + compute_environment: LOCAL_MACHINE + distributed_type: FSDP + downcast_bf16: 'no' + fsdp_config: + fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP + fsdp_backward_prefetch_policy: BACKWARD_PRE + fsdp_forward_prefetch: true + fsdp_offload_params: false + fsdp_sharding_strategy: 1 + fsdp_state_dict_type: FULL_STATE_DICT + fsdp_sync_module_states: true + fsdp_transformer_layer_cls_to_wrap: BertLayer + fsdp_use_orig_params: true + machine_rank: 0 + main_training_function: main + mixed_precision: bf16 + num_machines: 1 + num_processes: 2 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + + c. 指向文件的 DeepSpeed 配置: + + ```yaml + compute_environment: LOCAL_MACHINE + deepspeed_config: + deepspeed_config_file: /home/user/configs/ds_zero3_config.json + zero3_init_flag: true + distributed_type: DEEPSPEED + downcast_bf16: 'no' + machine_rank: 0 + main_training_function: main + num_machines: 1 + num_processes: 4 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + + d. 使用 accelerate 插件的 DeepSpeed 配置: + + ```yaml + compute_environment: LOCAL_MACHINE + deepspeed_config: + gradient_accumulation_steps: 1 + gradient_clipping: 0.7 + offload_optimizer_device: cpu + offload_param_device: cpu + zero3_init_flag: true + zero_stage: 2 + distributed_type: DEEPSPEED + downcast_bf16: 'no' + machine_rank: 0 + main_training_function: main + mixed_precision: bf16 + num_machines: 1 + num_processes: 4 + rdzv_backend: static + same_network: true + tpu_env: [] + tpu_use_cluster: false + tpu_use_sudo: false + use_cpu: false + ``` + +3. 使用accelerate配置文件参数或启动器参数以外的参数运行Trainer脚本。以下是一个使用上述FSDP配置从accelerate启动器运行`run_glue.py`的示例。 + +```bash +cd transformers + +accelerate launch \ +./examples/pytorch/text-classification/run_glue.py \ +--model_name_or_path google-bert/bert-base-cased \ +--task_name $TASK_NAME \ +--do_train \ +--do_eval \ +--max_seq_length 128 \ +--per_device_train_batch_size 16 \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--output_dir /tmp/$TASK_NAME/ \ +--overwrite_output_dir +``` + +4. 你也可以直接使用`accelerate launch`的cmd参数。上面的示例将映射到: + +```bash +cd transformers + +accelerate launch --num_processes=2 \ +--use_fsdp \ +--mixed_precision=bf16 \ +--fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \ +--fsdp_transformer_layer_cls_to_wrap="BertLayer" \ +--fsdp_sharding_strategy=1 \ +--fsdp_state_dict_type=FULL_STATE_DICT \ +./examples/pytorch/text-classification/run_glue.py +--model_name_or_path google-bert/bert-base-cased \ +--task_name $TASK_NAME \ +--do_train \ +--do_eval \ +--max_seq_length 128 \ +--per_device_train_batch_size 16 \ +--learning_rate 5e-5 \ +--num_train_epochs 3 \ +--output_dir /tmp/$TASK_NAME/ \ +--overwrite_output_dir +``` + +有关更多信息,请参阅 🤗 Accelerate CLI 指南:[启动您的 🤗 Accelerate 脚本](https://huggingface.co/docs/accelerate/basic_tutorials/launch)。 + +已移动的部分: + +[ DeepSpeed | Installation | Deployment with multiple GPUs | Deployment with one GPU | Deployment in Notebooks | Configuration | Passing Configuration | Shared Configuration | ZeRO | ZeRO-2 Config | ZeRO-3 Config | NVMe Support | ZeRO-2 vs ZeRO-3 Performance | ZeRO-2 Example | ZeRO-3 Example | Optimizer | Scheduler | fp32 Precision | Automatic Mixed Precision | Batch Size | Gradient Accumulation | Gradient Clipping | Getting The Model Weights Out] + + +## 通过 NEFTune 提升微调性能 + +NEFTune 是一种提升聊天模型性能的技术,由 Jain 等人在论文“NEFTune: Noisy Embeddings Improve Instruction Finetuning” 中引入。该技术在训练过程中向embedding向量添加噪音。根据论文摘要: + +> 使用 Alpaca 对 LLaMA-2-7B 进行标准微调,可以在 AlpacaEval 上达到 29.79%,而使用带有噪音embedding的情况下,性能提高至 64.69%。NEFTune 还在modern instruction数据集上大大优于基线。Evol-Instruct 训练的模型表现提高了 10%,ShareGPT 提高了 8%,OpenPlatypus 提高了 8%。即使像 LLaMA-2-Chat 这样通过 RLHF 进一步细化的强大模型,通过 NEFTune 的额外训练也能受益。 + +
+ +
+ +要在 `Trainer` 中使用它,只需在创建 `TrainingArguments` 实例时传递 `neftune_noise_alpha`。请注意,为了避免任何意外行为,NEFTune在训练后被禁止,以此恢复原始的embedding层。 + +```python +from transformers import Trainer, TrainingArguments + +args = TrainingArguments(..., neftune_noise_alpha=0.1) +trainer = Trainer(..., args=args) + +... + +trainer.train() +``` diff --git a/docs/source/zh/model_sharing.md b/docs/source/zh/model_sharing.md new file mode 100644 index 00000000000000..e28a000c11535e --- /dev/null +++ b/docs/source/zh/model_sharing.md @@ -0,0 +1,238 @@ + + +# 分享模型 + +最后两个教程展示了如何使用PyTorch、Keras和 🤗 Accelerate进行分布式设置来微调模型。下一步是将您的模型与社区分享!在Hugging Face,我们相信公开分享知识和资源,能实现人工智能的普及化,让每个人都能受益。我们鼓励您将您的模型与社区分享,以帮助他人节省时间和精力。 + +在本教程中,您将学习两种在[Model Hub](https://huggingface.co/models)上共享训练好的或微调的模型的方法: + +- 通过编程将文件推送到Hub。 +- 使用Web界面将文件拖放到Hub。 + + + + + +要与社区共享模型,您需要在[huggingface.co](https://huggingface.co/join)上拥有一个帐户。您还可以加入现有的组织或创建一个新的组织。 + + + +## 仓库功能 + +Model Hub上的每个仓库都像是一个典型的GitHub仓库。我们的仓库提供版本控制、提交历史记录以及可视化差异的能力。 + +Model Hub的内置版本控制基于git和[git-lfs](https://git-lfs.github.com/)。换句话说,您可以将一个模型视为一个仓库,从而实现更好的访问控制和可扩展性。版本控制允许使用*修订*方法来固定特定版本的模型,可以使用提交哈希值、标签或分支来标记。 + +因此,您可以通过`revision`参数加载特定的模型版本: + +```py +>>> model = AutoModel.from_pretrained( +... "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash +... ) +``` + +文件也可以轻松地在仓库中编辑,您可以查看提交历史记录以及差异: +![vis_diff](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vis_diff.png) + +## 设置 + +在将模型共享到Hub之前,您需要拥有Hugging Face的凭证。如果您有访问终端的权限,请在安装🤗 Transformers的虚拟环境中运行以下命令。这将在您的Hugging Face缓存文件夹(默认为`~/.cache/`)中存储您的`access token`: + + +```bash +huggingface-cli login +``` + +如果您正在使用像Jupyter或Colaboratory这样的`notebook`,请确保您已安装了[`huggingface_hub`](https://huggingface.co/docs/hub/adding-a-library)库。该库允许您以编程方式与Hub进行交互。 + +```bash +pip install huggingface_hub +``` +然后使用`notebook_login`登录到Hub,并按照[这里](https://huggingface.co/settings/token)的链接生成一个token进行登录: + + +```py +>>> from huggingface_hub import notebook_login + +>>> notebook_login() +``` + +## 转换模型适用于所有框架 + +为确保您的模型可以被使用不同框架的人使用,我们建议您将PyTorch和TensorFlow `checkpoints`都转换并上传。如果您跳过此步骤,用户仍然可以从其他框架加载您的模型,但速度会变慢,因为🤗 Transformers需要实时转换`checkpoints`。 + +为另一个框架转换`checkpoints`很容易。确保您已安装PyTorch和TensorFlow(请参阅[此处](installation)的安装说明),然后在其他框架中找到适合您任务的特定模型。 + + + + +指定`from_tf=True`将checkpoint从TensorFlow转换为PyTorch。 + +```py +>>> pt_model = DistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_tf=True) +>>> pt_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + + +指定`from_pt=True`将checkpoint从PyTorch转换为TensorFlow。 + +```py +>>> tf_model = TFDistilBertForSequenceClassification.from_pretrained("path/to/awesome-name-you-picked", from_pt=True) +``` + +然后,您可以使用新的checkpoint保存您的新TensorFlow模型: + +```py +>>> tf_model.save_pretrained("path/to/awesome-name-you-picked") +``` + + + +如果模型在Flax中可用,您还可以将PyTorch checkpoint转换为Flax: + +```py +>>> flax_model = FlaxDistilBertForSequenceClassification.from_pretrained( +... "path/to/awesome-name-you-picked", from_pt=True +... ) +``` + + + +## 在训练过程中推送模型 + + + + + +将模型分享到Hub就像添加一个额外的参数或回调函数一样简单。请记住,在[微调教程](training)中,`TrainingArguments`类是您指定超参数和附加训练选项的地方。其中一项训练选项包括直接将模型推送到Hub的能力。在您的`TrainingArguments`中设置`push_to_hub=True`: + + +```py +>>> training_args = TrainingArguments(output_dir="my-awesome-model", push_to_hub=True) +``` + +像往常一样将您的训练参数传递给[`Trainer`]: + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` + +在您微调完模型后,在[`Trainer`]上调用[`~transformers.Trainer.push_to_hub`]将训练好的模型推送到Hub。🤗 Transformers甚至会自动将训练超参数、训练结果和框架版本添加到你的模型卡片中! + +```py +>>> trainer.push_to_hub() +``` + + + +使用[`PushToHubCallback`]将模型分享到Hub。在[`PushToHubCallback`]函数中,添加以下内容: + +- 一个用于存储模型的输出目录。 +- 一个tokenizer。 +- `hub_model_id`,即您的Hub用户名和模型名称。 + + +```py +>>> from transformers import PushToHubCallback + +>>> push_to_hub_callback = PushToHubCallback( +... output_dir="./your_model_save_path", tokenizer=tokenizer, hub_model_id="your-username/my-awesome-model" +... ) +``` + +将回调函数添加到 [`fit`](https://keras.io/api/models/model_training_apis/)中,然后🤗 Transformers 会将训练好的模型推送到 Hub: + +```py +>>> model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3, callbacks=push_to_hub_callback) +``` + + + +## 使用`push_to_hub`功能 + +您可以直接在您的模型上调用`push_to_hub`来将其上传到Hub。 + +在`push_to_hub`中指定你的模型名称: + +```py +>>> pt_model.push_to_hub("my-awesome-model") +``` + +这会在您的用户名下创建一个名为`my-awesome-model`的仓库。用户现在可以使用`from_pretrained`函数加载您的模型: + +```py +>>> from transformers import AutoModel + +>>> model = AutoModel.from_pretrained("your_username/my-awesome-model") +``` + +如果您属于一个组织,并希望将您的模型推送到组织名称下,只需将其添加到`repo_id`中: + +```py +>>> pt_model.push_to_hub("my-awesome-org/my-awesome-model") +``` + +`push_to_hub`函数还可以用于向模型仓库添加其他文件。例如,向模型仓库中添加一个`tokenizer`: + +```py +>>> tokenizer.push_to_hub("my-awesome-model") +``` + +或者,您可能希望将您的微调后的PyTorch模型的TensorFlow版本添加进去: + +```py +>>> tf_model.push_to_hub("my-awesome-model") +``` +现在,当您导航到您的Hugging Face个人资料时,您应该看到您新创建的模型仓库。点击**文件**选项卡将显示您已上传到仓库的所有文件。 + +有关如何创建和上传文件到仓库的更多详细信息,请参考Hub文档[这里](https://huggingface.co/docs/hub/how-to-upstream)。 + + +## 使用Web界面上传 + +喜欢无代码方法的用户可以通过Hugging Face的Web界面上传模型。访问[huggingface.co/new](https://huggingface.co/new)创建一个新的仓库: + +![new_model_repo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/new_model_repo.png) + +从这里开始,添加一些关于您的模型的信息: + +- 选择仓库的**所有者**。这可以是您本人或者您所属的任何组织。 +- 为您的项目选择一个名称,该名称也将成为仓库的名称。 +- 选择您的模型是公开还是私有。 +- 指定您的模型的许可证使用情况。 + +现在点击**文件**选项卡,然后点击**添加文件**按钮将一个新文件上传到你的仓库。接着拖放一个文件进行上传,并添加提交信息。 + +![upload_file](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/upload_file.png) + +## 添加模型卡片 + +为了确保用户了解您的模型的能力、限制、潜在偏差和伦理考虑,请在仓库中添加一个模型卡片。模型卡片在`README.md`文件中定义。你可以通过以下方式添加模型卡片: + +* 手动创建并上传一个`README.md`文件。 +* 在你的模型仓库中点击**编辑模型卡片**按钮。 + +可以参考DistilBert的[模型卡片](https://huggingface.co/distilbert/distilbert-base-uncased)来了解模型卡片应该包含的信息类型。有关您可以在`README.md`文件中控制的更多选项的细节,例如模型的碳足迹或小部件示例,请参考文档[这里](https://huggingface.co/docs/hub/models-cards)。 \ No newline at end of file diff --git a/docs/source/zh/multilingual.md b/docs/source/zh/multilingual.md new file mode 100644 index 00000000000000..9c27bd5f335ba0 --- /dev/null +++ b/docs/source/zh/multilingual.md @@ -0,0 +1,178 @@ + + +# 用于推理的多语言模型 + +[[open-in-colab]] + +🤗 Transformers 中有多种多语言模型,它们的推理用法与单语言模型不同。但是,并非*所有*的多语言模型用法都不同。一些模型,例如 [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) 就可以像单语言模型一样使用。本指南将向您展示如何使用不同用途的多语言模型进行推理。 + +## XLM + +XLM 有十个不同的检查点,其中只有一个是单语言的。剩下的九个检查点可以归为两类:使用语言嵌入的检查点和不使用语言嵌入的检查点。 + +### 带有语言嵌入的 XLM + +以下 XLM 模型使用语言嵌入来指定推理中使用的语言: + +- `FacebookAI/xlm-mlm-ende-1024` (掩码语言建模,英语-德语) +- `FacebookAI/xlm-mlm-enfr-1024` (掩码语言建模,英语-法语) +- `FacebookAI/xlm-mlm-enro-1024` (掩码语言建模,英语-罗马尼亚语) +- `FacebookAI/xlm-mlm-xnli15-1024` (掩码语言建模,XNLI 数据集语言) +- `FacebookAI/xlm-mlm-tlm-xnli15-1024` (掩码语言建模+翻译,XNLI 数据集语言) +- `FacebookAI/xlm-clm-enfr-1024` (因果语言建模,英语-法语) +- `FacebookAI/xlm-clm-ende-1024` (因果语言建模,英语-德语) + +语言嵌入被表示一个张量,其形状与传递给模型的 `input_ids` 相同。这些张量中的值取决于所使用的语言,并由分词器的 `lang2id` 和 `id2lang` 属性识别。 + +在此示例中,加载 `FacebookAI/xlm-clm-enfr-1024` 检查点(因果语言建模,英语-法语): + +```py +>>> import torch +>>> from transformers import XLMTokenizer, XLMWithLMHeadModel + +>>> tokenizer = XLMTokenizer.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +>>> model = XLMWithLMHeadModel.from_pretrained("FacebookAI/xlm-clm-enfr-1024") +``` + +分词器的 `lang2id` 属性显示了该模型的语言及其对应的id: + +```py +>>> print(tokenizer.lang2id) +{'en': 0, 'fr': 1} +``` + +接下来,创建一个示例输入: + +```py +>>> input_ids = torch.tensor([tokenizer.encode("Wikipedia was used to")]) # batch size 为 1 +``` + +将语言 id 设置为 `"en"` 并用其定义语言嵌入。语言嵌入是一个用 `0` 填充的张量,这个张量应该与 `input_ids` 大小相同。 + +```py +>>> language_id = tokenizer.lang2id["en"] # 0 +>>> langs = torch.tensor([language_id] * input_ids.shape[1]) # torch.tensor([0, 0, 0, ..., 0]) + +>>> # 我们将其 reshape 为 (batch_size, sequence_length) 大小 +>>> langs = langs.view(1, -1) # 现在的形状是 [1, sequence_length] (我们的 batch size 为 1) +``` + +现在,你可以将 `input_ids` 和语言嵌入传递给模型: + +```py +>>> outputs = model(input_ids, langs=langs) +``` + +[run_generation.py](https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-generation/run_generation.py) 脚本可以使用 `xlm-clm` 检查点生成带有语言嵌入的文本。 + +### 不带语言嵌入的 XLM + +以下 XLM 模型在推理时不需要语言嵌入: + +- `FacebookAI/xlm-mlm-17-1280` (掩码语言建模,支持 17 种语言) +- `FacebookAI/xlm-mlm-100-1280` (掩码语言建模,支持 100 种语言) + +与之前的 XLM 检查点不同,这些模型用于通用句子表示。 + +## BERT + +以下 BERT 模型可用于多语言任务: + +- `google-bert/bert-base-multilingual-uncased` (掩码语言建模 + 下一句预测,支持 102 种语言) +- `google-bert/bert-base-multilingual-cased` (掩码语言建模 + 下一句预测,支持 104 种语言) + +这些模型在推理时不需要语言嵌入。它们应该能够从上下文中识别语言并进行相应的推理。 + +## XLM-RoBERTa + +以下 XLM-RoBERTa 模型可用于多语言任务: + +- `FacebookAI/xlm-roberta-base` (掩码语言建模,支持 100 种语言) +- `FacebookAI/xlm-roberta-large` (掩码语言建模,支持 100 种语言) + +XLM-RoBERTa 使用 100 种语言的 2.5TB 新创建和清理的 CommonCrawl 数据进行了训练。与之前发布的 mBERT 或 XLM 等多语言模型相比,它在分类、序列标记和问答等下游任务上提供了更强大的优势。 + +## M2M100 + +以下 M2M100 模型可用于多语言翻译: + +- `facebook/m2m100_418M` (翻译) +- `facebook/m2m100_1.2B` (翻译) + +在此示例中,加载 `facebook/m2m100_418M` 检查点以将中文翻译为英文。你可以在分词器中设置源语言: + +```py +>>> from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer + +>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +>>> chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒." + +>>> tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh") +>>> model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M") +``` + +对文本进行分词: + +```py +>>> encoded_zh = tokenizer(chinese_text, return_tensors="pt") +``` + +M2M100 强制将目标语言 id 作为第一个生成的标记,以进行到目标语言的翻译。在 `generate` 方法中将 `forced_bos_token_id` 设置为 `en` 以翻译成英语: + +```py +>>> generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en")) +>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +'Do not interfere with the matters of the witches, because they are delicate and will soon be angry.' +``` + +## MBart + +以下 MBart 模型可用于多语言翻译: + +- `facebook/mbart-large-50-one-to-many-mmt` (一对多多语言机器翻译,支持 50 种语言) +- `facebook/mbart-large-50-many-to-many-mmt` (多对多多语言机器翻译,支持 50 种语言) +- `facebook/mbart-large-50-many-to-one-mmt` (多对一多语言机器翻译,支持 50 种语言) +- `facebook/mbart-large-50` (多语言翻译,支持 50 种语言) +- `facebook/mbart-large-cc25` + +在此示例中,加载 `facebook/mbart-large-50-many-to-many-mmt` 检查点以将芬兰语翻译为英语。 你可以在分词器中设置源语言: + +```py +>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + +>>> en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger." +>>> fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia." + +>>> tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI") +>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") +``` + +对文本进行分词: + +```py +>>> encoded_en = tokenizer(en_text, return_tensors="pt") +``` + +MBart 强制将目标语言 id 作为第一个生成的标记,以进行到目标语言的翻译。在 `generate` 方法中将 `forced_bos_token_id` 设置为 `en` 以翻译成英语: + +```py +>>> generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]) +>>> tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) +"Don't interfere with the wizard's affairs, because they are subtle, will soon get angry." +``` + +如果你使用的是 `facebook/mbart-large-50-many-to-one-mmt` 检查点,则无需强制目标语言 id 作为第一个生成的令牌,否则用法是相同的。 diff --git a/docs/source/zh/peft.md b/docs/source/zh/peft.md new file mode 100644 index 00000000000000..4241a15c00eabf --- /dev/null +++ b/docs/source/zh/peft.md @@ -0,0 +1,215 @@ + + +# 使用 🤗 PEFT 加载adapters + +[[open-in-colab]] + +[参数高效微调(PEFT)方法](https://huggingface.co/blog/peft)在微调过程中冻结预训练模型的参数,并在其顶部添加少量可训练参数(adapters)。adapters被训练以学习特定任务的信息。这种方法已被证明非常节省内存,同时具有较低的计算使用量,同时产生与完全微调模型相当的结果。 + +使用PEFT训练的adapters通常比完整模型小一个数量级,使其方便共享、存储和加载。 + +
+ +
与完整尺寸的模型权重(约为700MB)相比,存储在Hub上的OPTForCausalLM模型的adapter权重仅为~6MB。
+
+ +如果您对学习更多关于🤗 PEFT库感兴趣,请查看[文档](https://huggingface.co/docs/peft/index)。 + + +## 设置 + +首先安装 🤗 PEFT: + +```bash +pip install peft +``` + +如果你想尝试全新的特性,你可能会有兴趣从源代码安装这个库: + +```bash +pip install git+https://github.com/huggingface/peft.git +``` +## 支持的 PEFT 模型 + +Transformers原生支持一些PEFT方法,这意味着你可以加载本地存储或在Hub上的adapter权重,并使用几行代码轻松运行或训练它们。以下是受支持的方法: + +- [Low Rank Adapters](https://huggingface.co/docs/peft/conceptual_guides/lora) +- [IA3](https://huggingface.co/docs/peft/conceptual_guides/ia3) +- [AdaLoRA](https://arxiv.org/abs/2303.10512) + +如果你想使用其他PEFT方法,例如提示学习或提示微调,或者关于通用的 🤗 PEFT库,请参阅[文档](https://huggingface.co/docs/peft/index)。 + +## 加载 PEFT adapter + +要从huggingface的Transformers库中加载并使用PEFTadapter模型,请确保Hub仓库或本地目录包含一个`adapter_config.json`文件和adapter权重,如上例所示。然后,您可以使用`AutoModelFor`类加载PEFT adapter模型。例如,要为因果语言建模加载一个PEFT adapter模型: + +1. 指定PEFT模型id +2. 将其传递给[`AutoModelForCausalLM`]类 + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id) +``` + + + +你可以使用`AutoModelFor`类或基础模型类(如`OPTForCausalLM`或`LlamaForCausalLM`)来加载一个PEFT adapter。 + + + + +您也可以通过`load_adapter`方法来加载 PEFT adapter。 + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "facebook/opt-350m" +peft_model_id = "ybelkada/opt-350m-lora" + +model = AutoModelForCausalLM.from_pretrained(model_id) +model.load_adapter(peft_model_id) +``` + +## 基于8bit或4bit进行加载 + +`bitsandbytes`集成支持8bit和4bit精度数据类型,这对于加载大模型非常有用,因为它可以节省内存(请参阅`bitsandbytes`[指南](./quantization#bitsandbytes-integration)以了解更多信息)。要有效地将模型分配到您的硬件,请在[`~PreTrainedModel.from_pretrained`]中添加`load_in_8bit`或`load_in_4bit`参数,并将`device_map="auto"`设置为: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +peft_model_id = "ybelkada/opt-350m-lora" +model = AutoModelForCausalLM.from_pretrained(peft_model_id, device_map="auto", load_in_8bit=True) +``` + +## 添加新的adapter + +你可以使用[`~peft.PeftModel.add_adapter`]方法为一个已有adapter的模型添加一个新的adapter,只要新adapter的类型与当前adapter相同即可。例如,如果你有一个附加到模型上的LoRA adapter: + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +model = AutoModelForCausalLM.from_pretrained(model_id) + +lora_config = LoraConfig( + target_modules=["q_proj", "k_proj"], + init_lora_weights=False +) + +model.add_adapter(lora_config, adapter_name="adapter_1") +``` + + +添加一个新的adapter: + +```py +# attach new adapter with same config +model.add_adapter(lora_config, adapter_name="adapter_2") +``` +现在您可以使用[`~peft.PeftModel.set_adapter`]来设置要使用的adapter。 + +```py +# use adapter_1 +model.set_adapter("adapter_1") +output = model.generate(**inputs) +print(tokenizer.decode(output_disabled[0], skip_special_tokens=True)) + +# use adapter_2 +model.set_adapter("adapter_2") +output_enabled = model.generate(**inputs) +print(tokenizer.decode(output_enabled[0], skip_special_tokens=True)) +``` + +## 启用和禁用adapters +一旦您将adapter添加到模型中,您可以启用或禁用adapter模块。要启用adapter模块: + + +```py +from transformers import AutoModelForCausalLM, OPTForCausalLM, AutoTokenizer +from peft import PeftConfig + +model_id = "facebook/opt-350m" +adapter_model_id = "ybelkada/opt-350m-lora" +tokenizer = AutoTokenizer.from_pretrained(model_id) +text = "Hello" +inputs = tokenizer(text, return_tensors="pt") + +model = AutoModelForCausalLM.from_pretrained(model_id) +peft_config = PeftConfig.from_pretrained(adapter_model_id) + +# to initiate with random weights +peft_config.init_lora_weights = False + +model.add_adapter(peft_config) +model.enable_adapters() +output = model.generate(**inputs) +``` +要禁用adapter模块: + +```py +model.disable_adapters() +output = model.generate(**inputs) +``` +## 训练一个 PEFT adapter + +PEFT适配器受[`Trainer`]类支持,因此您可以为您的特定用例训练适配器。它只需要添加几行代码即可。例如,要训练一个LoRA adapter: + + + + +如果你不熟悉如何使用[`Trainer`]微调模型,请查看[微调预训练模型](training)教程。 + + + +1. 使用任务类型和超参数定义adapter配置(参见[`~peft.LoraConfig`]以了解超参数的详细信息)。 + +```py +from peft import LoraConfig + +peft_config = LoraConfig( + lora_alpha=16, + lora_dropout=0.1, + r=64, + bias="none", + task_type="CAUSAL_LM", +) +``` + +2. 将adapter添加到模型中。 + +```py +model.add_adapter(peft_config) +``` + +3. 现在可以将模型传递给[`Trainer`]了! + +```py +trainer = Trainer(model=model, ...) +trainer.train() +``` + +要保存训练好的adapter并重新加载它: + +```py +model.save_pretrained(save_dir) +model = AutoModelForCausalLM.from_pretrained(save_dir) +``` + + diff --git a/docs/source/zh/perf_hardware.md b/docs/source/zh/perf_hardware.md new file mode 100644 index 00000000000000..95a09eaab4e103 --- /dev/null +++ b/docs/source/zh/perf_hardware.md @@ -0,0 +1,156 @@ + + + +# 训练用的定制硬件 + +您用来运行模型训练和推断的硬件可能会对性能产生重大影响。要深入了解 GPU,务必查看 Tim Dettmer 出色的[博文](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/)。 + +让我们来看一些关于 GPU 配置的实用建议。 + +## GPU +当你训练更大的模型时,基本上有三种选择: + +- 更大的 GPU +- 更多的 GPU +- 更多的 CPU 和 NVMe(通过[DeepSpeed-Infinity](main_classes/deepspeed#nvme-support)实现) + +让我们从只有一块GPU的情况开始。 + +### 供电和散热 + +如果您购买了昂贵的高端GPU,请确保为其提供正确的供电和足够的散热。 + +**供电**: + +一些高端消费者级GPU卡具有2个,有时甚至3个PCI-E-8针电源插口。请确保将与插口数量相同的独立12V PCI-E-8针线缆插入卡中。不要使用同一根线缆两端的2个分叉(也称为pigtail cable)。也就是说,如果您的GPU上有2个插口,您需要使用2条PCI-E-8针线缆连接电源和卡,而不是使用一条末端有2个PCI-E-8针连接器的线缆!否则,您无法充分发挥卡的性能。 + +每个PCI-E-8针电源线缆需要插入电源侧的12V轨上,并且可以提供最多150W的功率。 + +其他一些卡可能使用PCI-E-12针连接器,这些连接器可以提供最多500-600W的功率。 + +低端卡可能使用6针连接器,这些连接器可提供最多75W的功率。 + +此外,您需要选择具有稳定电压的高端电源。一些质量较低的电源可能无法为卡提供所需的稳定电压以发挥其最大性能。 + +当然,电源还需要有足够的未使用的瓦数来为卡供电。 + +**散热**: + +当GPU过热时,它将开始降频,不会提供完整的性能。如果温度过高,可能会缩短GPU的使用寿命。 + +当GPU负载很重时,很难确定最佳温度是多少,但任何低于+80度的温度都是好的,越低越好,也许在70-75度之间是一个非常好的范围。降频可能从大约84-90度开始。但是除了降频外,持续的高温可能会缩短GPU的使用寿命。 + +接下来让我们看一下拥有多个GPU时最重要的方面之一:连接。 + +### 多GPU连接 + +如果您使用多个GPU,则卡之间的互连方式可能会对总训练时间产生巨大影响。如果GPU位于同一物理节点上,您可以运行以下代码: + +```bash +nvidia-smi topo -m +``` + +它将告诉您GPU如何互连。在具有双GPU并通过NVLink连接的机器上,您最有可能看到类似以下内容: + +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X NV2 0-23 N/A +GPU1 NV2 X 0-23 N/A +``` + +在不同的机器上,如果没有NVLink,我们可能会看到: +``` + GPU0 GPU1 CPU Affinity NUMA Affinity +GPU0 X PHB 0-11 N/A +GPU1 PHB X 0-11 N/A +``` + +这个报告包括了这个输出: + +``` + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + NV# = Connection traversing a bonded set of # NVLinks +``` + +因此,第一个报告`NV2`告诉我们GPU通过2个NVLink互连,而第二个报告`PHB`展示了典型的消费者级PCIe+Bridge设置。 + +检查你的设置中具有哪种连接类型。其中一些会使卡之间的通信更快(例如NVLink),而其他则较慢(例如PHB)。 + +根据使用的扩展解决方案的类型,连接速度可能会产生重大或较小的影响。如果GPU很少需要同步,就像在DDP中一样,那么较慢的连接的影响将不那么显著。如果GPU经常需要相互发送消息,就像在ZeRO-DP中一样,那么更快的连接对于实现更快的训练变得非常重要。 + + +#### NVlink + +[NVLink](https://en.wikipedia.org/wiki/NVLink)是由Nvidia开发的一种基于线缆的串行多通道近程通信链接。 + +每个新一代提供更快的带宽,例如在[Nvidia Ampere GA102 GPU架构](https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf)中有这样的引述: + +> Third-Generation NVLink® +> GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, +> with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four +> links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth +> between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. +> (Note that 3-Way and 4-Way SLI configurations are not supported.) + +所以,在`nvidia-smi topo -m`输出的`NVX`报告中获取到的更高的`X`值意味着更好的性能。生成的结果将取决于您的GPU架构。 + +让我们比较在小样本wikitext上训练gpt2语言模型的执行结果。 + +结果是: + + +| NVlink | Time | +| ----- | ---: | +| Y | 101s | +| N | 131s | + + +可以看到,NVLink使训练速度提高了约23%。在第二个基准测试中,我们使用`NCCL_P2P_DISABLE=1`告诉GPU不要使用NVLink。 + +这里是完整的基准测试代码和输出: + +```bash +# DDP w/ NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} + +# DDP w/o NVLink + +rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 torchrun \ +--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-community/gpt2 \ +--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train +--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 + +{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} +``` + +硬件: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`) +软件: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` diff --git a/docs/source/zh/perf_torch_compile.md b/docs/source/zh/perf_torch_compile.md new file mode 100644 index 00000000000000..b28dc9567c9174 --- /dev/null +++ b/docs/source/zh/perf_torch_compile.md @@ -0,0 +1,362 @@ + + +# 使用 torch.compile() 优化推理 + +本指南旨在为使用[`torch.compile()`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)在[🤗 Transformers中的计算机视觉模型](https://huggingface.co/models?pipeline_tag=image-classification&library=transformers&sort=trending)中引入的推理速度提升提供一个基准。 + + +## torch.compile 的优势 + +根据模型和GPU的不同,`torch.compile()`在推理过程中可以提高多达30%的速度。要使用`torch.compile()`,只需安装2.0及以上版本的`torch`即可。 + +编译模型需要时间,因此如果您只需要编译一次模型而不是每次推理都编译,那么它非常有用。 +要编译您选择的任何计算机视觉模型,请按照以下方式调用`torch.compile()`: + + +```diff +from transformers import AutoModelForImageClassification + +model = AutoModelForImageClassification.from_pretrained(MODEL_ID).to("cuda") ++ model = torch.compile(model) +``` + +`compile()` 提供了多种编译模式,它们在编译时间和推理开销上有所不同。`max-autotune` 比 `reduce-overhead` 需要更长的时间,但会得到更快的推理速度。默认模式在编译时最快,但在推理时间上与 `reduce-overhead` 相比效率较低。在本指南中,我们使用了默认模式。您可以在[这里](https://pytorch.org/get-started/pytorch-2.0/#user-experience)了解更多信息。 + +我们在 PyTorch 2.0.1 版本上使用不同的计算机视觉模型、任务、硬件类型和数据批量大小对 `torch.compile` 进行了基准测试。 + +## 基准测试代码 + +以下是每个任务的基准测试代码。我们在推理之前”预热“GPU,并取300次推理的平均值,每次使用相同的图像。 + +### 使用 ViT 进行图像分类 + +```python +import torch +from PIL import Image +import requests +import numpy as np +from transformers import AutoImageProcessor, AutoModelForImageClassification + +url = 'http://images.cocodataset.org/val2017/000000039769.jpg' +image = Image.open(requests.get(url, stream=True).raw) + +processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224").to("cuda") +model = torch.compile(model) + +processed_input = processor(image, return_tensors='pt').to(device="cuda") + +with torch.no_grad(): + _ = model(**processed_input) + +``` + +#### 使用 DETR 进行目标检测 + +```python +from transformers import AutoImageProcessor, AutoModelForObjectDetection + +processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50") +model = AutoModelForObjectDetection.from_pretrained("facebook/detr-resnet-50").to("cuda") +model = torch.compile(model) + +texts = ["a photo of a cat", "a photo of a dog"] +inputs = processor(text=texts, images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**inputs) +``` + +#### 使用 Segformer 进行图像分割 + +```python +from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation + +processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512") +model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512").to("cuda") +model = torch.compile(model) +seg_inputs = processor(images=image, return_tensors="pt").to("cuda") + +with torch.no_grad(): + _ = model(**seg_inputs) +``` + +以下是我们进行基准测试的模型列表。 + +**图像分类** +- [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) +- [microsoft/beit-base-patch16-224-pt22k-ft22k](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k-ft22k) +- [facebook/convnext-large-224](https://huggingface.co/facebook/convnext-large-224) +- [microsoft/resnet-50](https://huggingface.co/) + +**图像分割** +- [nvidia/segformer-b0-finetuned-ade-512-512](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) +- [facebook/mask2former-swin-tiny-coco-panoptic](https://huggingface.co/facebook/mask2former-swin-tiny-coco-panoptic) +- [facebook/maskformer-swin-base-ade](https://huggingface.co/facebook/maskformer-swin-base-ade) +- [google/deeplabv3_mobilenet_v2_1.0_513](https://huggingface.co/google/deeplabv3_mobilenet_v2_1.0_513) + +**目标检测** +- [google/owlvit-base-patch32](https://huggingface.co/google/owlvit-base-patch32) +- [facebook/detr-resnet-101](https://huggingface.co/facebook/detr-resnet-101) +- [microsoft/conditional-detr-resnet-50](https://huggingface.co/microsoft/conditional-detr-resnet-50) + + 下面是使用和不使用`torch.compile()`的推理持续时间可视化,以及每个模型在不同硬件和数据批量大小下的改进百分比。 + + +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ +
+
+ + +![Duration Comparison on V100 with Batch Size of 1](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/v100_1_duration.png) + +![Percentage Improvement on T4 with Batch Size of 4](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/torch_compile/T4_4_percentage.png) + +下面可以找到每个模型使用和不使用`compile()`的推理时间(毫秒)。请注意,OwlViT在大批量大小下会导致内存溢出。 + +### A100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 9.325 | 7.584 | +| Image Segmentation/Segformer | 11.759 | 10.500 | +| Object Detection/OwlViT | 24.978 | 18.420 | +| Image Classification/BeiT | 11.282 | 8.448 | +| Object Detection/DETR | 34.619 | 19.040 | +| Image Classification/ConvNeXT | 10.410 | 10.208 | +| Image Classification/ResNet | 6.531 | 4.124 | +| Image Segmentation/Mask2former | 60.188 | 49.117 | +| Image Segmentation/Maskformer | 75.764 | 59.487 | +| Image Segmentation/MobileNet | 8.583 | 3.974 | +| Object Detection/Resnet-101 | 36.276 | 18.197 | +| Object Detection/Conditional-DETR | 31.219 | 17.993 | + + +### A100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 14.832 | 14.499 | +| Image Segmentation/Segformer | 18.838 | 16.476 | +| Image Classification/BeiT | 13.205 | 13.048 | +| Object Detection/DETR | 48.657 | 32.418| +| Image Classification/ConvNeXT | 22.940 | 21.631 | +| Image Classification/ResNet | 6.657 | 4.268 | +| Image Segmentation/Mask2former | 74.277 | 61.781 | +| Image Segmentation/Maskformer | 180.700 | 159.116 | +| Image Segmentation/MobileNet | 14.174 | 8.515 | +| Object Detection/Resnet-101 | 68.101 | 44.998 | +| Object Detection/Conditional-DETR | 56.470 | 35.552 | + +### A100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 40.944 | 40.010 | +| Image Segmentation/Segformer | 37.005 | 31.144 | +| Image Classification/BeiT | 41.854 | 41.048 | +| Object Detection/DETR | 164.382 | 161.902 | +| Image Classification/ConvNeXT | 82.258 | 75.561 | +| Image Classification/ResNet | 7.018 | 5.024 | +| Image Segmentation/Mask2former | 178.945 | 154.814 | +| Image Segmentation/Maskformer | 638.570 | 579.826 | +| Image Segmentation/MobileNet | 51.693 | 30.310 | +| Object Detection/Resnet-101 | 232.887 | 155.021 | +| Object Detection/Conditional-DETR | 180.491 | 124.032 | + +### V100 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 10.495 | 6.00 | +| Image Segmentation/Segformer | 13.321 | 5.862 | +| Object Detection/OwlViT | 25.769 | 22.395 | +| Image Classification/BeiT | 11.347 | 7.234 | +| Object Detection/DETR | 33.951 | 19.388 | +| Image Classification/ConvNeXT | 11.623 | 10.412 | +| Image Classification/ResNet | 6.484 | 3.820 | +| Image Segmentation/Mask2former | 64.640 | 49.873 | +| Image Segmentation/Maskformer | 95.532 | 72.207 | +| Image Segmentation/MobileNet | 9.217 | 4.753 | +| Object Detection/Resnet-101 | 52.818 | 28.367 | +| Object Detection/Conditional-DETR | 39.512 | 20.816 | + +### V100 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 15.181 | 14.501 | +| Image Segmentation/Segformer | 16.787 | 16.188 | +| Image Classification/BeiT | 15.171 | 14.753 | +| Object Detection/DETR | 88.529 | 64.195 | +| Image Classification/ConvNeXT | 29.574 | 27.085 | +| Image Classification/ResNet | 6.109 | 4.731 | +| Image Segmentation/Mask2former | 90.402 | 76.926 | +| Image Segmentation/Maskformer | 234.261 | 205.456 | +| Image Segmentation/MobileNet | 24.623 | 14.816 | +| Object Detection/Resnet-101 | 134.672 | 101.304 | +| Object Detection/Conditional-DETR | 97.464 | 69.739 | + +### V100 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 52.209 | 51.633 | +| Image Segmentation/Segformer | 61.013 | 55.499 | +| Image Classification/BeiT | 53.938 | 53.581 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 109.682 | 100.771 | +| Image Classification/ResNet | 14.857 | 12.089 | +| Image Segmentation/Mask2former | 249.605 | 222.801 | +| Image Segmentation/Maskformer | 831.142 | 743.645 | +| Image Segmentation/MobileNet | 93.129 | 55.365 | +| Object Detection/Resnet-101 | 482.425 | 361.843 | +| Object Detection/Conditional-DETR | 344.661 | 255.298 | + +### T4 (batch size: 1) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 16.520 | 15.786 | +| Image Segmentation/Segformer | 16.116 | 14.205 | +| Object Detection/OwlViT | 53.634 | 51.105 | +| Image Classification/BeiT | 16.464 | 15.710 | +| Object Detection/DETR | 73.100 | 53.99 | +| Image Classification/ConvNeXT | 32.932 | 30.845 | +| Image Classification/ResNet | 6.031 | 4.321 | +| Image Segmentation/Mask2former | 79.192 | 66.815 | +| Image Segmentation/Maskformer | 200.026 | 188.268 | +| Image Segmentation/MobileNet | 18.908 | 11.997 | +| Object Detection/Resnet-101 | 106.622 | 82.566 | +| Object Detection/Conditional-DETR | 77.594 | 56.984 | + +### T4 (batch size: 4) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 43.653 | 43.626 | +| Image Segmentation/Segformer | 45.327 | 42.445 | +| Image Classification/BeiT | 52.007 | 51.354 | +| Object Detection/DETR | 277.850 | 268.003 | +| Image Classification/ConvNeXT | 119.259 | 105.580 | +| Image Classification/ResNet | 13.039 | 11.388 | +| Image Segmentation/Mask2former | 201.540 | 184.670 | +| Image Segmentation/Maskformer | 764.052 | 711.280 | +| Image Segmentation/MobileNet | 74.289 | 48.677 | +| Object Detection/Resnet-101 | 421.859 | 357.614 | +| Object Detection/Conditional-DETR | 289.002 | 226.945 | + +### T4 (batch size: 16) + +| **Task/Model** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:| +| Image Classification/ViT | 163.914 | 160.907 | +| Image Segmentation/Segformer | 192.412 | 163.620 | +| Image Classification/BeiT | 188.978 | 187.976 | +| Object Detection/DETR | OOM | OOM | +| Image Classification/ConvNeXT | 422.886 | 388.078 | +| Image Classification/ResNet | 44.114 | 37.604 | +| Image Segmentation/Mask2former | 756.337 | 695.291 | +| Image Segmentation/Maskformer | 2842.940 | 2656.88 | +| Image Segmentation/MobileNet | 299.003 | 201.942 | +| Object Detection/Resnet-101 | 1619.505 | 1262.758 | +| Object Detection/Conditional-DETR | 1137.513 | 897.390| + +## PyTorch Nightly +我们还在 PyTorch Nightly 版本(2.1.0dev)上进行了基准测试,可以在[这里](https://download.pytorch.org/whl/nightly/cu118)找到 Nightly 版本的安装包,并观察到了未编译和编译模型的延迟性能改善。 + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 - no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 12.462 | 6.954 | +| Image Classification/BeiT | 4 | 14.109 | 12.851 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 30.484 | 15.221 | +| Object Detection/DETR | 4 | 46.816 | 30.942 | +| Object Detection/DETR | 16 | 163.749 | 163.706 | + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 14.408 | 14.052 | +| Image Classification/BeiT | 4 | 47.381 | 46.604 | +| Image Classification/BeiT | 16 | 42.179 | 42.147 | +| Object Detection/DETR | Unbatched | 68.382 | 53.481 | +| Object Detection/DETR | 4 | 269.615 | 204.785 | +| Object Detection/DETR | 16 | OOM | OOM | + +### V100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/BeiT | Unbatched | 13.477 | 7.926 | +| Image Classification/BeiT | 4 | 15.103 | 14.378 | +| Image Classification/BeiT | 16 | 52.517 | 51.691 | +| Object Detection/DETR | Unbatched | 28.706 | 19.077 | +| Object Detection/DETR | 4 | 88.402 | 62.949| +| Object Detection/DETR | 16 | OOM | OOM | + + +## 降低开销 +我们在 PyTorch Nightly 版本中为 A100 和 T4 进行了 `reduce-overhead` 编译模式的性能基准测试。 + +### A100 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 11.758 | 7.335 | +| Image Classification/ConvNeXT | 4 | 23.171 | 21.490 | +| Image Classification/ResNet | Unbatched | 7.435 | 3.801 | +| Image Classification/ResNet | 4 | 7.261 | 2.187 | +| Object Detection/Conditional-DETR | Unbatched | 32.823 | 11.627 | +| Object Detection/Conditional-DETR | 4 | 50.622 | 33.831 | +| Image Segmentation/MobileNet | Unbatched | 9.869 | 4.244 | +| Image Segmentation/MobileNet | 4 | 14.385 | 7.946 | + + +### T4 + +| **Task/Model** | **Batch Size** | **torch 2.0 -
no compile** | **torch 2.0 -
compile** | +|:---:|:---:|:---:|:---:| +| Image Classification/ConvNeXT | Unbatched | 32.137 | 31.84 | +| Image Classification/ConvNeXT | 4 | 120.944 | 110.209 | +| Image Classification/ResNet | Unbatched | 9.761 | 7.698 | +| Image Classification/ResNet | 4 | 15.215 | 13.871 | +| Object Detection/Conditional-DETR | Unbatched | 72.150 | 57.660 | +| Object Detection/Conditional-DETR | 4 | 301.494 | 247.543 | +| Image Segmentation/MobileNet | Unbatched | 22.266 | 19.339 | +| Image Segmentation/MobileNet | 4 | 78.311 | 50.983 | + + diff --git a/docs/source/zh/performance.md b/docs/source/zh/performance.md new file mode 100644 index 00000000000000..afe41c8fdd1473 --- /dev/null +++ b/docs/source/zh/performance.md @@ -0,0 +1,63 @@ + + +# 性能与可扩展性 + +训练大型transformer模型并将其部署到生产环境会面临各种挑战。 +在训练过程中,模型可能需要比可用的GPU内存更多的资源,或者表现出较慢的训练速度。在部署阶段,模型可能在生产环境中难以处理所需的吞吐量。 + +本文档旨在帮助您克服这些挑战,并找到适合您使用场景的最佳设置。教程分为训练和推理部分,因为每个部分都有不同的挑战和解决方案。在每个部分中,您将找到针对不同硬件配置的单独指南,例如单GPU与多GPU用于训练或CPU与GPU用于推理。 + +将此文档作为您的起点,进一步导航到与您的情况匹配的方法。 + +## 训练 + +高效训练大型transformer模型需要使用加速器硬件,如GPU或TPU。最常见的情况是您只有一个GPU。您应用于单个GPU上提高训练效率的方法可以扩展到其他设置,如多个GPU。然而,也有一些特定于多GPU或CPU训练的技术。我们在单独的部分中介绍它们。 + +* [在单个GPU上进行高效训练的方法和工具](perf_train_gpu_one):从这里开始学习常见的方法,可以帮助优化GPU内存利用率、加快训练速度或两者兼备。 +* [多GPU训练部分](perf_train_gpu_many):探索此部分以了解适用于多GPU设置的进一步优化方法,例如数据并行、张量并行和流水线并行。 +* [CPU训练部分](perf_train_cpu):了解在CPU上的混合精度训练。 +* [在多个CPU上进行高效训练](perf_train_cpu_many):了解分布式CPU训练。 +* [使用TensorFlow在TPU上进行训练](perf_train_tpu_tf):如果您对TPU还不熟悉,请参考此部分,了解有关在TPU上进行训练和使用XLA的建议性介绍。 +* [自定义硬件进行训练](perf_hardware):在构建自己的深度学习机器时查找技巧和窍门。 +* [使用Trainer API进行超参数搜索](hpo_train) + + +## 推理 + +在生产环境中对大型模型进行高效推理可能与训练它们一样具有挑战性。在接下来的部分中,我们将详细介绍如何在CPU和单/多GPU设置上进行推理的步骤。 + +* [在单个CPU上进行推理](perf_infer_cpu) +* [在单个GPU上进行推理](perf_infer_gpu_one) +* [多GPU推理](perf_infer_gpu_one) +* [TensorFlow模型的XLA集成](tf_xla) + +## 训练和推理 + +在这里,您将找到适用于训练模型或使用它进行推理的技巧、窍门和技巧。 + +* [实例化大型模型](big_models) +* [解决性能问题](debugging) + +## 贡献 + +这份文档还远远没有完成,还有很多需要添加的内容,所以如果你有补充或更正的内容,请毫不犹豫地提交一个PR(Pull Request),或者如果你不确定,可以创建一个Issue,我们可以在那里讨论细节。 + +在做出贡献时,如果A比B更好,请尽量包含可重复的基准测试和(或)该信息来源的链接(除非它直接来自您)。 diff --git a/docs/source/zh/pipeline_tutorial.md b/docs/source/zh/pipeline_tutorial.md new file mode 100644 index 00000000000000..568f8bb63603c2 --- /dev/null +++ b/docs/source/zh/pipeline_tutorial.md @@ -0,0 +1,308 @@ + + +# 推理pipeline + +[`pipeline`] 让使用[Hub](https://huggingface.co/models)上的任何模型进行任何语言、计算机视觉、语音以及多模态任务的推理变得非常简单。即使您对特定的模态没有经验,或者不熟悉模型的源码,您仍然可以使用[`pipeline`]进行推理!本教程将教您: + +- 如何使用[`pipeline`] 进行推理。 +- 如何使用特定的`tokenizer`(分词器)或模型。 +- 如何使用[`pipeline`] 进行音频、视觉和多模态任务的推理。 + + + +请查看[`pipeline`]文档以获取已支持的任务和可用参数的完整列表。 + + + +## Pipeline使用 + +虽然每个任务都有一个关联的[`pipeline`],但使用通用的抽象的[`pipeline`]更加简单,其中包含所有特定任务的`pipelines`。[`pipeline`]会自动加载一个默认模型和一个能够进行任务推理的预处理类。让我们以使用[`pipeline`]进行自动语音识别(ASR)或语音转文本为例。 + +1. 首先,创建一个[`pipeline`]并指定推理任务: + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition") +``` + +2. 将您的输入传递给[`pipeline`]。对于语音识别,这通常是一个音频输入文件: + + +```py +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'} +``` + +您没有得到您期望的结果?可以在Hub上查看一些[最受欢迎的自动语音识别模型](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending) +,看看是否可以获得更好的转录。 + +让我们尝试来自 OpenAI 的[Whisper large-v2](https://huggingface.co/openai/whisper-large) 模型。Whisperb比Wav2Vec2晚2年发布,使用接近10倍的数据进行了训练。因此,它在大多数下游基准测试上击败了Wav2Vec2。 +它还具有预测标点和大小写的附加优势,而Wav2Vec2则无法实现这些功能。 + +让我们在这里尝试一下,看看它的表现如何: + + +```py +>>> transcriber = pipeline(model="openai/whisper-large-v2") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +现在这个结果看起来更准确了!要进行深入的Wav2Vec2与Whisper比较,请参阅[音频变换器课程](https://huggingface.co/learn/audio-course/chapter5/asr_models)。 +我们鼓励您在 Hub 上查看不同语言的模型,以及专业领域的模型等。您可以在Hub上直接查看并比较模型的结果,以确定是否适合或处理边缘情况是否比其他模型更好。如果您没有找到适用于您的用例的模型,您始终可以[训练](training)自己的模型! + +如果您有多个输入,您可以将输入作为列表传递: + + +```py +transcriber( + [ + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac", + "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac", + ] +) +``` + +`Pipelines`非常适合用于测试,因为从一个模型切换到另一个模型非常琐碎;但是,还有一些方法可以将它们优化后用于大型工作负载而不仅仅是测试。请查看以下指南,深入探讨如何迭代整个数据集或在Web服务器中使用`Pipelines`: +* [在数据集上使用流水线](#using-pipelines-on-a-dataset) +* [在Web服务器中使用流水线](./pipeline_webserver) + + +## 参数 + +[`pipeline`] 支持许多参数;有些是适用于特定任务的,而有些适用于所有`pipeline`。通常情况下,您可以在任何地方指定对应参数: + + +```py +transcriber = pipeline(model="openai/whisper-large-v2", my_parameter=1) + +out = transcriber(...) # This will use `my_parameter=1`. +out = transcriber(..., my_parameter=2) # This will override and use `my_parameter=2`. +out = transcriber(...) # This will go back to using `my_parameter=1`. +``` + +让我们查看其中的三个重要参数: + + +### 设备 + +如果您使用 `device=n`,`pipeline`会自动将模型放在指定的设备上。无论您使用PyTorch还是Tensorflow,这都可以工作。 + + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device=0) +``` + +如果模型对于单个GPU来说过于庞大,并且您正在使用PyTorch,您可以设置 `device_map="auto"` 以自动确定如何加载和存储模型权重。使用 `device_map` 参数需要安装🤗 [Accelerate](https://huggingface.co/docs/accelerate) 软件包: + + +```bash +pip install --upgrade accelerate +``` + +以下代码会自动在各个设备上加载和存储模型权重: + + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device_map="auto") +``` + +请注意,如果传递了 `device_map="auto"`,在实例化您的 `pipeline` 时不需要添加 `device=device` 参数,否则可能会遇到一些意外的状况! + +### 批量大小 + +默认情况下,`pipelines`不会进行批量推理,原因在[这里](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching)详细解释。因为批处理不一定更快,实际上在某些情况下可能会更慢。 + +但如果在您的用例中起作用,您可以使用: + + +```py +transcriber = pipeline(model="openai/whisper-large-v2", device=0, batch_size=2) +audio_filenames = [f"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/{i}.flac" for i in range(1, 5)] +texts = transcriber(audio_filenames) +``` + +以上代码会在提供的4个音频文件上运行`pipeline`,它会将它们以2个一组的批次传递给模型(模型在GPU上,此时批处理更有可能有所帮助),而您无需编写额外的代码。输出应始终与没有批处理时收到的结果相一致。它只是一种帮助您更快地使用`pipeline`的方式。 + +`pipeline`也可以减轻一些批处理的复杂性,因为对于某些`pipeline`,需要将单个项目(如长音频文件)分成多个部分以供模型处理。`pipeline`为您执行这种[*chunk batching*](./main_classes/pipelines#pipeline-chunk-batching)。 + +### 任务特定参数 + +所有任务都提供了特定于任务的参数,这些参数提供额外的灵活性和选择,以帮助您完成工作。 +例如,[`transformers.AutomaticSpeechRecognitionPipeline.__call__`] 方法具有一个 `return_timestamps` 参数,对于字幕视频似乎很有帮助: + +```py +>>> transcriber = pipeline(model="openai/whisper-large-v2", return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.', 'chunks': [{'timestamp': (0.0, 11.88), 'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its'}, {'timestamp': (11.88, 12.38), 'text': ' creed.'}]} +``` + +正如您所看到的,模型推断出了文本,还输出了各个句子发音的**时间**。 + +每个任务都有许多可用的参数,因此请查看每个任务的API参考,以了解您可以进行哪些调整!例如,[`~transformers.AutomaticSpeechRecognitionPipeline`] 具有 `chunk_length_s` 参数,对于处理非常长的音频文件(例如,为整部电影或长达一小时的视频配字幕)非常有帮助,这通常是模型无法单独处理的: + +```python +>>> transcriber = pipeline(model="openai/whisper-large-v2", chunk_length_s=30, return_timestamps=True) +>>> transcriber("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav") +{'text': " Chapter 16. I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I, too, agree to whatever Marguerite wished, Marguerite to be unable to live apart from me. It was the day after the evening... +``` + +如果您找不到一个真正有帮助的参数,欢迎[提出请求](https://github.com/huggingface/transformers/issues/new?assignees=&labels=feature&template=feature-request.yml)! + +## 在数据集上使用pipelines + +`pipelines`也可以对大型数据集进行推理。我们建议使用迭代器来完成这一任务,这是最简单的方法: + + +```py +def data(): + for i in range(1000): + yield f"My example {i}" + + +pipe = pipeline(model="openai-community/gpt2", device=0) +generated_characters = 0 +for out in pipe(data()): + generated_characters += len(out[0]["generated_text"]) +``` + +迭代器 `data()` 会产生每个结果,`pipelines`会自动识别输入为可迭代对象,并在GPU上处理数据的同时开始获取数据(在底层使用[DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader))。这一点非常重要,因为您不必为整个数据集分配内存,可以尽可能快地将数据传送到GPU。 + +由于批处理可以加速处理,因此在这里尝试调整 `batch_size` 参数可能会很有用。 + +迭代数据集的最简单方法就是从🤗 [Datasets](https://github.com/huggingface/datasets/) 中加载数据集: + + +```py +# KeyDataset is a util that will just output the item we're interested in. +from transformers.pipelines.pt_utils import KeyDataset +from datasets import load_dataset + +pipe = pipeline(model="hf-internal-testing/tiny-random-wav2vec2", device=0) +dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]") + +for out in pipe(KeyDataset(dataset, "audio")): + print(out) +``` + + +## 在Web服务器上使用pipelines + + +创建推理引擎是一个复杂的主题,值得有自己的页面。 + + +[链接](./pipeline_webserver) + +## 视觉流水线 + +对于视觉任务,使用[`pipeline`] 几乎是相同的。 + +指定您的任务并将图像传递给分类器。图像可以是链接、本地路径或base64编码的图像。例如,下面显示的是哪种品种的猫? + +![pipeline-cat-chonk](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg) +```py +>>> from transformers import pipeline + +>>> vision_classifier = pipeline(model="google/vit-base-patch16-224") +>>> preds = vision_classifier( +... images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4335, 'label': 'lynx, catamount'}, {'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}, {'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'}, {'score': 0.0239, 'label': 'Egyptian cat'}, {'score': 0.0229, 'label': 'tiger cat'}] +``` + +## 文本流水线 + +对于NLP任务,使用[`pipeline`] 几乎是相同的。 + + +```py +>>> from transformers import pipeline + +>>> # This model is a `zero-shot-classification` model. +>>> # It will classify text, except you are free to choose any label you might imagine +>>> classifier = pipeline(model="facebook/bart-large-mnli") +>>> classifier( +... "I have a problem with my iphone that needs to be resolved asap!!", +... candidate_labels=["urgent", "not urgent", "phone", "tablet", "computer"], +... ) +{'sequence': 'I have a problem with my iphone that needs to be resolved asap!!', 'labels': ['urgent', 'phone', 'computer', 'not urgent', 'tablet'], 'scores': [0.504, 0.479, 0.013, 0.003, 0.002]} +``` + +## 多模态流水线 + +[`pipeline`] 支持多个模态。例如,视觉问题回答(VQA)任务结合了文本和图像。请随意使用您喜欢的任何图像链接和您想要问关于该图像的问题。图像可以是URL或图像的本地路径。 + +例如,如果您使用这个[invoice image](https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png): + + +```py +>>> from transformers import pipeline + +>>> vqa = pipeline(model="impira/layoutlm-document-qa") +>>> vqa( +... image="https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png", +... question="What is the invoice number?", +... ) +[{'score': 0.42515, 'answer': 'us-001', 'start': 16, 'end': 16}] +``` + + + +要运行上面的示例,除了🤗 Transformers之外,您需要安装[`pytesseract`](https://pypi.org/project/pytesseract/)。 + + +```bash +sudo apt install -y tesseract-ocr +pip install pytesseract +``` + + + +## 在大模型上使用🤗 `accelerate`和`pipeline`: + +您可以轻松地使用🤗 `accelerate`在大模型上运行 `pipeline`!首先确保您已经使用 `pip install accelerate` 安装了 `accelerate`。 + +首先使用 `device_map="auto"` 加载您的模型!我们将在示例中使用 `facebook/opt-1.3b`。 + + +```py +# pip install accelerate +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", torch_dtype=torch.bfloat16, device_map="auto") +output = pipe("This is a cool example!", do_sample=True, top_p=0.95) +``` + +如果安装 `bitsandbytes` 并添加参数 `load_in_8bit=True`,您还可以传递8位加载的模型。 + + +```py +# pip install accelerate bitsandbytes +import torch +from transformers import pipeline + +pipe = pipeline(model="facebook/opt-1.3b", device_map="auto", model_kwargs={"load_in_8bit": True}) +output = pipe("This is a cool example!", do_sample=True, top_p=0.95) +``` + +请注意,您可以将`checkpoint `替换为任何支持大模型加载的Hugging Face模型,比如BLOOM! + diff --git a/docs/source/zh/preprocessing.md b/docs/source/zh/preprocessing.md new file mode 100644 index 00000000000000..b90c89b36d1567 --- /dev/null +++ b/docs/source/zh/preprocessing.md @@ -0,0 +1,541 @@ + + +# 预处理 + +[[open-in-colab]] + +在您可以在数据集上训练模型之前,数据需要被预处理为期望的模型输入格式。无论您的数据是文本、图像还是音频,它们都需要被转换并组合成批量的张量。🤗 Transformers 提供了一组预处理类来帮助准备数据以供模型使用。在本教程中,您将了解以下内容: + +* 对于文本,使用[分词器](./main_classes/tokenizer)(`Tokenizer`)将文本转换为一系列标记(`tokens`),并创建`tokens`的数字表示,将它们组合成张量。 +* 对于语音和音频,使用[特征提取器](./main_classes/feature_extractor)(`Feature extractor`)从音频波形中提取顺序特征并将其转换为张量。 +* 图像输入使用[图像处理器](./main_classes/image)(`ImageProcessor`)将图像转换为张量。 +* 多模态输入,使用[处理器](./main_classes/processors)(`Processor`)结合了`Tokenizer`和`ImageProcessor`或`Processor`。 + + + +`AutoProcessor` **始终**有效的自动选择适用于您使用的模型的正确`class`,无论您使用的是`Tokenizer`、`ImageProcessor`、`Feature extractor`还是`Processor`。 + + + +在开始之前,请安装🤗 Datasets,以便您可以加载一些数据集来进行实验: + + +```bash +pip install datasets +``` + +## 自然语言处理 + + + +处理文本数据的主要工具是[Tokenizer](main_classes/tokenizer)。`Tokenizer`根据一组规则将文本拆分为`tokens`。然后将这些`tokens`转换为数字,然后转换为张量,成为模型的输入。模型所需的任何附加输入都由`Tokenizer`添加。 + + + +如果您计划使用预训练模型,重要的是使用与之关联的预训练`Tokenizer`。这确保文本的拆分方式与预训练语料库相同,并在预训练期间使用相同的标记-索引的对应关系(通常称为*词汇表*-`vocab`)。 + + + +开始使用[`AutoTokenizer.from_pretrained`]方法加载一个预训练`tokenizer`。这将下载模型预训练的`vocab`: + + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +``` + +然后将您的文本传递给`tokenizer`: + + +```py +>>> encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.") +>>> print(encoded_input) +{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +`tokenizer`返回一个包含三个重要对象的字典: + +* [input_ids](glossary#input-ids) 是与句子中每个`token`对应的索引。 +* [attention_mask](glossary#attention-mask) 指示是否应该关注一个`token`。 +* [token_type_ids](glossary#token-type-ids) 在存在多个序列时标识一个`token`属于哪个序列。 + +通过解码 `input_ids` 来返回您的输入: + + +```py +>>> tokenizer.decode(encoded_input["input_ids"]) +'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]' +``` + +如您所见,`tokenizer`向句子中添加了两个特殊`token` - `CLS` 和 `SEP`(分类器和分隔符)。并非所有模型都需要特殊`token`,但如果需要,`tokenizer`会自动为您添加。 + +如果有多个句子需要预处理,将它们作为列表传递给`tokenizer`: + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_inputs = tokenizer(batch_sentences) +>>> print(encoded_inputs) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1]]} +``` + +### 填充 + +句子的长度并不总是相同,这可能会成为一个问题,因为模型输入的张量需要具有统一的形状。填充是一种策略,通过在较短的句子中添加一个特殊的`padding token`,以确保张量是矩形的。 + +将 `padding` 参数设置为 `True`,以使批次中较短的序列填充到与最长序列相匹配的长度: + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + +第一句和第三句因为较短,通过`0`进行填充,。 + +### 截断 + +另一方面,有时候一个序列可能对模型来说太长了。在这种情况下,您需要将序列截断为更短的长度。 + +将 `truncation` 参数设置为 `True`,以将序列截断为模型接受的最大长度: + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True) +>>> print(encoded_input) +{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], + 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], + 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]} +``` + + + +查看[填充和截断](./pad_truncation)概念指南,了解更多有关填充和截断参数的信息。 + + + +### 构建张量 + +最后,`tokenizer`可以返回实际输入到模型的张量。 + +将 `return_tensors` 参数设置为 `pt`(对于PyTorch)或 `tf`(对于TensorFlow): + + + + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt") +>>> print(encoded_input) +{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], + [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], + [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), + 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])} +``` + + + +```py +>>> batch_sentences = [ +... "But what about second breakfast?", +... "Don't think he knows about second breakfast, Pip.", +... "What about elevensies?", +... ] +>>> encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="tf") +>>> print(encoded_input) +{'input_ids': , + 'token_type_ids': , + 'attention_mask': } +``` + + + +## 音频 + +对于音频任务,您需要[feature extractor](main_classes/feature_extractor)来准备您的数据集以供模型使用。`feature extractor`旨在从原始音频数据中提取特征,并将它们转换为张量。 + +加载[MInDS-14](https://huggingface.co/datasets/PolyAI/minds14)数据集(有关如何加载数据集的更多详细信息,请参阅🤗 [Datasets教程](https://huggingface.co/docs/datasets/load_hub))以了解如何在音频数据集中使用`feature extractor`: + + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") +``` + +访问 `audio` 列的第一个元素以查看输入。调用 `audio` 列会自动加载和重新采样音频文件: + +```py +>>> dataset[0]["audio"] +{'array': array([ 0. , 0.00024414, -0.00024414, ..., -0.00024414, + 0. , 0. ], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 8000} +``` + +这会返回三个对象: + +* `array` 是加载的语音信号 - 并在必要时重新采为`1D array`。 +* `path` 指向音频文件的位置。 +* `sampling_rate` 是每秒测量的语音信号数据点数量。 + +对于本教程,您将使用[Wav2Vec2](https://huggingface.co/facebook/wav2vec2-base)模型。查看模型卡片,您将了解到Wav2Vec2是在16kHz采样的语音音频数据上预训练的。重要的是,您的音频数据的采样率要与用于预训练模型的数据集的采样率匹配。如果您的数据的采样率不同,那么您需要对数据进行重新采样。 + +1. 使用🤗 Datasets的[`~datasets.Dataset.cast_column`]方法将采样率提升到16kHz: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +2. 再次调用 `audio` 列以重新采样音频文件: + + +```py +>>> dataset[0]["audio"] +{'array': array([ 2.3443763e-05, 2.1729663e-04, 2.2145823e-04, ..., + 3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', + 'sampling_rate': 16000} +``` + +接下来,加载一个`feature extractor`以对输入进行标准化和填充。当填充文本数据时,会为较短的序列添加 `0`。相同的理念适用于音频数据。`feature extractor`添加 `0` - 被解释为静音 - 到`array` 。 + +使用 [`AutoFeatureExtractor.from_pretrained`] 加载`feature extractor`: + + +```py +>>> from transformers import AutoFeatureExtractor + +>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base") +``` + +将音频 `array` 传递给`feature extractor`。我们还建议在`feature extractor`中添加 `sampling_rate` 参数,以更好地调试可能发生的静音错误: + + +```py +>>> audio_input = [dataset[0]["audio"]["array"]] +>>> feature_extractor(audio_input, sampling_rate=16000) +{'input_values': [array([ 3.8106556e-04, 2.7506407e-03, 2.8015103e-03, ..., + 5.6335266e-04, 4.6588284e-06, -1.7142107e-04], dtype=float32)]} +``` + +就像`tokenizer`一样,您可以应用填充或截断来处理批次中的可变序列。请查看这两个音频样本的序列长度: + + +```py +>>> dataset[0]["audio"]["array"].shape +(173398,) + +>>> dataset[1]["audio"]["array"].shape +(106496,) +``` + +创建一个函数来预处理数据集,以使音频样本具有相同的长度。通过指定最大样本长度,`feature extractor`将填充或截断序列以使其匹配: + + +```py +>>> def preprocess_function(examples): +... audio_arrays = [x["array"] for x in examples["audio"]] +... inputs = feature_extractor( +... audio_arrays, +... sampling_rate=16000, +... padding=True, +... max_length=100000, +... truncation=True, +... ) +... return inputs +``` + +将`preprocess_function`应用于数据集中的前几个示例: + + +```py +>>> processed_dataset = preprocess_function(dataset[:5]) +``` + +现在样本长度是相同的,并且与指定的最大长度匹配。您现在可以将经过处理的数据集传递给模型了! + + +```py +>>> processed_dataset["input_values"][0].shape +(100000,) + +>>> processed_dataset["input_values"][1].shape +(100000,) +``` + +## 计算机视觉 + +对于计算机视觉任务,您需要一个[ image processor](main_classes/image_processor)来准备数据集以供模型使用。图像预处理包括多个步骤将图像转换为模型期望输入的格式。这些步骤包括但不限于调整大小、标准化、颜色通道校正以及将图像转换为张量。 + + + +图像预处理通常遵循某种形式的图像增强。图像预处理和图像增强都会改变图像数据,但它们有不同的目的: + +* 图像增强可以帮助防止过拟合并增加模型的鲁棒性。您可以在数据增强方面充分发挥创造性 - 调整亮度和颜色、裁剪、旋转、调整大小、缩放等。但要注意不要改变图像的含义。 +* 图像预处理确保图像与模型预期的输入格式匹配。在微调计算机视觉模型时,必须对图像进行与模型训练时相同的预处理。 + +您可以使用任何您喜欢的图像增强库。对于图像预处理,请使用与模型相关联的`ImageProcessor`。 + + + +加载[food101](https://huggingface.co/datasets/food101)数据集(有关如何加载数据集的更多详细信息,请参阅🤗 [Datasets教程](https://huggingface.co/docs/datasets/load_hub))以了解如何在计算机视觉数据集中使用图像处理器: + + + +因为数据集相当大,请使用🤗 Datasets的`split`参数加载训练集中的少量样本! + + + + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("food101", split="train[:100]") +``` + +接下来,使用🤗 Datasets的[`Image`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=image#datasets.Image)功能查看图像: + + +```py +>>> dataset[0]["image"] +``` + +
+ +
+ +使用 [`AutoImageProcessor.from_pretrained`] 加载`image processor`: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224") +``` + +首先,让我们进行图像增强。您可以使用任何您喜欢的库,但在本教程中,我们将使用torchvision的[`transforms`](https://pytorch.org/vision/stable/transforms.html)模块。如果您有兴趣使用其他数据增强库,请参阅[Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb)或[Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb)中的示例。 + +1. 在这里,我们使用[`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html)将[`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html)和 [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html)变换连接在一起。请注意,对于调整大小,我们可以从`image_processor`中获取图像尺寸要求。对于一些模型,精确的高度和宽度需要被定义,对于其他模型只需定义`shortest_edge`。 + + +```py +>>> from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose + +>>> size = ( +... image_processor.size["shortest_edge"] +... if "shortest_edge" in image_processor.size +... else (image_processor.size["height"], image_processor.size["width"]) +... ) + +>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)]) +``` + +2. 模型接受 [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) 作为输入。`ImageProcessor` 可以进行图像的标准化,并生成适当的张量。创建一个函数,将图像增强和图像预处理步骤组合起来处理批量图像,并生成 `pixel_values`: + + +```py +>>> def transforms(examples): +... images = [_transforms(img.convert("RGB")) for img in examples["image"]] +... examples["pixel_values"] = image_processor(images, do_resize=False, return_tensors="pt")["pixel_values"] +... return examples +``` + + + +在上面的示例中,我们设置`do_resize=False`,因为我们已经在图像增强转换中调整了图像的大小,并利用了适当的`image_processor`的`size`属性。如果您在图像增强期间不调整图像的大小,请将此参数排除在外。默认情况下`ImageProcessor`将处理调整大小。 + +如果希望将图像标准化步骤为图像增强的一部分,请使用`image_processor.image_mean`和`image_processor.image_std`。 + + + +3. 然后使用🤗 Datasets的[`set_transform`](https://huggingface.co/docs/datasets/process#format-transform)在运行时应用这些变换: + + +```py +>>> dataset.set_transform(transforms) +``` + +4. 现在,当您访问图像时,您将注意到`image processor`已添加了 `pixel_values`。您现在可以将经过处理的数据集传递给模型了! + + +```py +>>> dataset[0].keys() +``` + +这是在应用变换后的图像样子。图像已被随机裁剪,并其颜色属性发生了变化。 + + +```py +>>> import numpy as np +>>> import matplotlib.pyplot as plt + +>>> img = dataset[0]["pixel_values"] +>>> plt.imshow(img.permute(1, 2, 0)) +``` + +
+ +
+ + + +对于诸如目标检测、语义分割、实例分割和全景分割等任务,`ImageProcessor`提供了训练后处理方法。这些方法将模型的原始输出转换为有意义的预测,如边界框或分割地图。 + + + +### 填充 + +在某些情况下,例如,在微调[DETR](./model_doc/detr)时,模型在训练时应用了尺度增强。这可能导致批处理中的图像大小不同。您可以使用[`DetrImageProcessor.pad`]来指定自定义的`collate_fn`将图像批处理在一起。 + +```py +>>> def collate_fn(batch): +... pixel_values = [item["pixel_values"] for item in batch] +... encoding = image_processor.pad(pixel_values, return_tensors="pt") +... labels = [item["labels"] for item in batch] +... batch = {} +... batch["pixel_values"] = encoding["pixel_values"] +... batch["pixel_mask"] = encoding["pixel_mask"] +... batch["labels"] = labels +... return batch +``` + +## 多模态 + +对于涉及多模态输入的任务,您需要[processor](main_classes/processors)来为模型准备数据集。`processor`将两个处理对象-例如`tokenizer`和`feature extractor`-组合在一起。 + +加载[LJ Speech](https://huggingface.co/datasets/lj_speech)数据集(有关如何加载数据集的更多详细信息,请参阅🤗 [Datasets 教程](https://huggingface.co/docs/datasets/load_hub))以了解如何使用`processor`进行自动语音识别(ASR): + + +```py +>>> from datasets import load_dataset + +>>> lj_speech = load_dataset("lj_speech", split="train") +``` + +对于ASR(自动语音识别),主要关注`audio`和`text`,因此可以删除其他列: + + +```py +>>> lj_speech = lj_speech.map(remove_columns=["file", "id", "normalized_text"]) +``` + +现在查看`audio`和`text`列: + +```py +>>> lj_speech[0]["audio"] +{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ..., + 7.3242188e-04, 2.1362305e-04, 6.1035156e-05], dtype=float32), + 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav', + 'sampling_rate': 22050} + +>>> lj_speech[0]["text"] +'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition' +``` + +请记住,您应始终[重新采样](preprocessing#audio)音频数据集的采样率,以匹配用于预训练模型数据集的采样率! + + +```py +>>> lj_speech = lj_speech.cast_column("audio", Audio(sampling_rate=16_000)) +``` + +使用[`AutoProcessor.from_pretrained`]加载一个`processor`: + + +```py +>>> from transformers import AutoProcessor + +>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") +``` + +1. 创建一个函数,用于将包含在 `array` 中的音频数据处理为 `input_values`,并将 `text` 标记为 `labels`。这些将是输入模型的数据: + +```py +>>> def prepare_dataset(example): +... audio = example["audio"] + +... example.update(processor(audio=audio["array"], text=example["text"], sampling_rate=16000)) + +... return example +``` + +2. 将 `prepare_dataset` 函数应用于一个示例: + +```py +>>> prepare_dataset(lj_speech[0]) +``` + +`processor`现在已经添加了 `input_values` 和 `labels`,并且采样率也正确降低为为16kHz。现在可以将处理后的数据集传递给模型! diff --git a/docs/source/zh/quicktour.md b/docs/source/zh/quicktour.md new file mode 100644 index 00000000000000..c23a38ab5f0004 --- /dev/null +++ b/docs/source/zh/quicktour.md @@ -0,0 +1,547 @@ + + +# 快速上手 + +[[open-in-colab]] + +快来使用 🤗 Transformers 吧!无论你是开发人员还是日常用户,这篇快速上手教程都将帮助你入门并且向你展示如何使用 [`pipeline`] 进行推理,使用 [AutoClass](./model_doc/auto) 加载一个预训练模型和预处理器,以及使用 PyTorch 或 TensorFlow 快速训练一个模型。如果你是一个初学者,我们建议你接下来查看我们的教程或者[课程](https://huggingface.co/course/chapter1/1),来更深入地了解在这里介绍到的概念。 + +在开始之前,确保你已经安装了所有必要的库: + +```bash +!pip install transformers datasets +``` + +你还需要安装喜欢的机器学习框架: + + + + +```bash +pip install torch +``` + + + +```bash +pip install tensorflow +``` + + + +## Pipeline + + + +使用 [`pipeline`] 是利用预训练模型进行推理的最简单的方式。你能够将 [`pipeline`] 开箱即用地用于跨不同模态的多种任务。来看看它支持的任务列表: + +| **任务** | **描述** | **模态** | **Pipeline** | +|------------------------------|-----------------------------------|-----------------|-----------------------------------------------| +| 文本分类 | 为给定的文本序列分配一个标签 | NLP | pipeline(task="sentiment-analysis") | +| 文本生成 | 根据给定的提示生成文本 | NLP | pipeline(task="text-generation") | +| 命名实体识别 | 为序列里的每个 token 分配一个标签(人, 组织, 地址等等) | NLP | pipeline(task="ner") | +| 问答系统 | 通过给定的上下文和问题, 在文本中提取答案 | NLP | pipeline(task="question-answering") | +| 掩盖填充 | 预测出正确的在序列中被掩盖的token | NLP | pipeline(task="fill-mask") | +| 文本摘要 | 为文本序列或文档生成总结 | NLP | pipeline(task="summarization") | +| 文本翻译 | 将文本从一种语言翻译为另一种语言 | NLP | pipeline(task="translation") | +| 图像分类 | 为图像分配一个标签 | Computer vision | pipeline(task="image-classification") | +| 图像分割 | 为图像中每个独立的像素分配标签(支持语义、全景和实例分割) | Computer vision | pipeline(task="image-segmentation") | +| 目标检测 | 预测图像中目标对象的边界框和类别 | Computer vision | pipeline(task="object-detection") | +| 音频分类 | 给音频文件分配一个标签 | Audio | pipeline(task="audio-classification") | +| 自动语音识别 | 将音频文件中的语音提取为文本 | Audio | pipeline(task="automatic-speech-recognition") | +| 视觉问答 | 给定一个图像和一个问题,正确地回答有关图像的问题 | Multimodal | pipeline(task="vqa") | + +创建一个 [`pipeline`] 实例并且指定你想要将它用于的任务,就可以开始了。你可以将 [`pipeline`] 用于任何一个上面提到的任务,如果想知道支持的任务的完整列表,可以查阅 [pipeline API 参考](./main_classes/pipelines)。不过, 在这篇教程中,你将把 [`pipeline`] 用在一个情感分析示例上: + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline("sentiment-analysis") +``` + +[`pipeline`] 会下载并缓存一个用于情感分析的默认的[预训练模型](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english)和分词器。现在你可以在目标文本上使用 `classifier` 了: + +```py +>>> classifier("We are very happy to show you the 🤗 Transformers library.") +[{'label': 'POSITIVE', 'score': 0.9998}] +``` + +如果你有不止一个输入,可以把所有输入放入一个列表然后传给[`pipeline`],它将会返回一个字典列表: + +```py +>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) +>>> for result in results: +... print(f"label: {result['label']}, with score: {round(result['score'], 4)}") +label: POSITIVE, with score: 0.9998 +label: NEGATIVE, with score: 0.5309 +``` + +[`pipeline`] 也可以为任何你喜欢的任务遍历整个数据集。在下面这个示例中,让我们选择自动语音识别作为我们的任务: + +```py +>>> import torch +>>> from transformers import pipeline + +>>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") +``` + +加载一个你想遍历的音频数据集(查阅 🤗 Datasets [快速开始](https://huggingface.co/docs/datasets/quickstart#audio) 获得更多信息)。比如,加载 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 数据集: + +```py +>>> from datasets import load_dataset, Audio + +>>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT +``` + +你需要确保数据集中的音频的采样率与 [`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h) 训练用到的音频的采样率一致: + +```py +>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) +``` + +当调用 `"audio"` 列时, 音频文件将会自动加载并重采样。 +从前四个样本中提取原始波形数组,将它作为列表传给 pipeline: + +```py +>>> result = speech_recognizer(dataset[:4]["audio"]) +>>> print([d["text"] for d in result]) +['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I THURN A JOIN A COUNT'] +``` + +对于输入非常庞大的大型数据集(比如语音或视觉),你会想到使用一个生成器,而不是一个将所有输入都加载进内存的列表。查阅 [pipeline API 参考](./main_classes/pipelines) 来获取更多信息。 + +### 在 pipeline 中使用另一个模型和分词器 + +[`pipeline`] 可以容纳 [Hub](https://huggingface.co/models) 中的任何模型,这让 [`pipeline`] 更容易适用于其他用例。比如,你想要一个能够处理法语文本的模型,就可以使用 Hub 上的标记来筛选出合适的模型。靠前的筛选结果会返回一个为情感分析微调的多语言的 [BERT 模型](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment),你可以将它用于法语文本: + +```py +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +``` + + + +使用 [`AutoModelForSequenceClassification`] 和 [`AutoTokenizer`] 来加载预训练模型和它关联的分词器(更多信息可以参考下一节的 `AutoClass`): + +```py +>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + +使用 [`TFAutoModelForSequenceClassification`] 和 [`AutoTokenizer`] 来加载预训练模型和它关联的分词器(更多信息可以参考下一节的 `TFAutoClass`): + +```py +>>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +>>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + + + +在 [`pipeline`] 中指定模型和分词器,现在你就可以在法语文本上使用 `classifier` 了: + +```py +>>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) +>>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") +[{'label': '5 stars', 'score': 0.7273}] +``` + +如果你没有找到适合你的模型,就需要在你的数据上微调一个预训练模型了。查看 [微调教程](./training) 来学习怎样进行微调。最后,微调完模型后,考虑一下在 Hub 上与社区 [分享](./model_sharing) 这个模型,把机器学习普及到每一个人! 🤗 + +## AutoClass + + + +在幕后,是由 [`AutoModelForSequenceClassification`] 和 [`AutoTokenizer`] 一起支持你在上面用到的 [`pipeline`]。[AutoClass](./model_doc/auto) 是一个能够通过预训练模型的名称或路径自动查找其架构的快捷方式。你只需要为你的任务选择合适的 `AutoClass` 和它关联的预处理类。 + +让我们回过头来看上一节的示例,看看怎样使用 `AutoClass` 来重现使用 [`pipeline`] 的结果。 + +### AutoTokenizer + +分词器负责预处理文本,将文本转换为用于输入模型的数字数组。有多个用来管理分词过程的规则,包括如何拆分单词和在什么样的级别上拆分单词(在 [分词器总结](./tokenizer_summary) 学习更多关于分词的信息)。要记住最重要的是你需要实例化的分词器要与模型的名称相同, 来确保和模型训练时使用相同的分词规则。 + +使用 [`AutoTokenizer`] 加载一个分词器: + +```py +>>> from transformers import AutoTokenizer + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tokenizer = AutoTokenizer.from_pretrained(model_name) +``` + +将文本传入分词器: + +```py +>>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.") +>>> print(encoding) +{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +分词器返回了含有如下内容的字典: + +* [input_ids](./glossary#input-ids):用数字表示的 token。 +* [attention_mask](.glossary#attention-mask):应该关注哪些 token 的指示。 + +分词器也可以接受列表作为输入,并填充和截断文本,返回具有统一长度的批次: + + + + +```py +>>> pt_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="pt", +... ) +``` + + + +```py +>>> tf_batch = tokenizer( +... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], +... padding=True, +... truncation=True, +... max_length=512, +... return_tensors="tf", +... ) +``` + + + + + +查阅[预处理](./preprocessing)教程来获得有关分词的更详细的信息,以及如何使用 [`AutoFeatureExtractor`] 和 [`AutoProcessor`] 来处理图像,音频,还有多模式输入。 + + + +### AutoModel + + + +🤗 Transformers 提供了一种简单统一的方式来加载预训练的实例. 这表示你可以像加载 [`AutoTokenizer`] 一样加载 [`AutoModel`]。唯一不同的地方是为你的任务选择正确的[`AutoModel`]。对于文本(或序列)分类,你应该加载[`AutoModelForSequenceClassification`]: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +通过 [任务摘要](./task_summary) 查找 [`AutoModel`] 支持的任务. + + + +现在可以把预处理好的输入批次直接送进模型。你只需要通过 `**` 来解包字典: + +```py +>>> pt_outputs = pt_model(**pt_batch) +``` + +模型在 `logits` 属性输出最终的激活结果. 在 `logits` 上应用 softmax 函数来查询概率: + +```py +>>> from torch import nn + +>>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) +>>> print(pt_predictions) +tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], + [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) +``` + + +🤗 Transformers 提供了一种简单统一的方式来加载预训练的实例。这表示你可以像加载 [`AutoTokenizer`] 一样加载 [`TFAutoModel`]。唯一不同的地方是为你的任务选择正确的 [`TFAutoModel`],对于文本(或序列)分类,你应该加载 [`TFAutoModelForSequenceClassification`]: + +```py +>>> from transformers import TFAutoModelForSequenceClassification + +>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) +``` + + + +通过 [任务摘要](./task_summary) 查找 [`AutoModel`] 支持的任务. + + + +现在通过直接将字典的键传给张量,将预处理的输入批次传给模型。 + +```py +>>> tf_outputs = tf_model(tf_batch) +``` + +模型在 `logits` 属性输出最终的激活结果。在 `logits` 上应用softmax函数来查询概率: + +```py +>>> import tensorflow as tf + +>>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) +>>> tf_predictions # doctest: +IGNORE_RESULT +``` + + + + + +所有 🤗 Transformers 模型(PyTorch 或 TensorFlow)在最终的激活函数(比如 softmax)*之前* 输出张量, +因为最终的激活函数常常与 loss 融合。模型的输出是特殊的数据类,所以它们的属性可以在 IDE 中被自动补全。模型的输出就像一个元组或字典(你可以通过整数、切片或字符串来索引它),在这种情况下,为 None 的属性会被忽略。 + + + +### 保存模型 + + + +当你的模型微调完成,你就可以使用 [`PreTrainedModel.save_pretrained`] 把它和它的分词器保存下来: + +```py +>>> pt_save_directory = "./pt_save_pretrained" +>>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT +>>> pt_model.save_pretrained(pt_save_directory) +``` + +当你准备再次使用这个模型时,就可以使用 [`PreTrainedModel.from_pretrained`] 加载它了: + +```py +>>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") +``` + + +当你的模型微调完成,你就可以使用 [`TFPreTrainedModel.save_pretrained`] 把它和它的分词器保存下来: + +```py +>>> tf_save_directory = "./tf_save_pretrained" +>>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT +>>> tf_model.save_pretrained(tf_save_directory) +``` + +当你准备再次使用这个模型时,就可以使用 [`TFPreTrainedModel.from_pretrained`] 加载它了: + +```py +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") +``` + + + +🤗 Transformers 有一个特别酷的功能,它能够保存一个模型,并且将它加载为 PyTorch 或 TensorFlow 模型。`from_pt` 或 `from_tf` 参数可以将模型从一个框架转换为另一个框架: + + + + +```py +>>> from transformers import AutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) +>>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) +``` + + + +```py +>>> from transformers import TFAutoModel + +>>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) +>>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) +``` + + + +## 自定义模型构建 + +你可以修改模型的配置类来改变模型的构建方式。配置指明了模型的属性,比如隐藏层或者注意力头的数量。当你从自定义的配置类初始化模型时,你就开始自定义模型构建了。模型属性是随机初始化的,你需要先训练模型,然后才能得到有意义的结果。 + +通过导入 [`AutoConfig`] 来开始,之后加载你想修改的预训练模型。在 [`AutoConfig.from_pretrained`] 中,你能够指定想要修改的属性,比如注意力头的数量: + +```py +>>> from transformers import AutoConfig + +>>> my_config = AutoConfig.from_pretrained("distilbert/distilbert-base-uncased", n_heads=12) +``` + + + +使用 [`AutoModel.from_config`] 根据你的自定义配置创建一个模型: + +```py +>>> from transformers import AutoModel + +>>> my_model = AutoModel.from_config(my_config) +``` + + +使用 [`TFAutoModel.from_config`] 根据你的自定义配置创建一个模型: + +```py +>>> from transformers import TFAutoModel + +>>> my_model = TFAutoModel.from_config(my_config) +``` + + + +查阅 [创建一个自定义结构](./create_a_model) 指南获取更多关于构建自定义配置的信息。 + +## Trainer - PyTorch 优化训练循环 + +所有的模型都是标准的 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module),所以你可以在任何典型的训练模型中使用它们。当你编写自己的训练循环时,🤗 Transformers 为 PyTorch 提供了一个 [`Trainer`] 类,它包含了基础的训练循环并且为诸如分布式训练,混合精度等特性增加了额外的功能。 + +取决于你的任务, 你通常可以传递以下的参数给 [`Trainer`]: + +1. [`PreTrainedModel`] 或者 [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module): + + ```py + >>> from transformers import AutoModelForSequenceClassification + + >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. [`TrainingArguments`] 含有你可以修改的模型超参数,比如学习率,批次大小和训练时的迭代次数。如果你没有指定训练参数,那么它会使用默认值: + + ```py + >>> from transformers import TrainingArguments + + >>> training_args = TrainingArguments( + ... output_dir="path/to/save/folder/", + ... learning_rate=2e-5, + ... per_device_train_batch_size=8, + ... per_device_eval_batch_size=8, + ... num_train_epochs=2, + ... ) + ``` + +3. 一个预处理类,比如分词器,特征提取器或者处理器: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +4. 加载一个数据集: + + ```py + >>> from datasets import load_dataset + + >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT + ``` + +5. 创建一个给数据集分词的函数,并且使用 [`~datasets.Dataset.map`] 应用到整个数据集: + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) + + >>> dataset = dataset.map(tokenize_dataset, batched=True) + ``` + +6. 用来从数据集中创建批次的 [`DataCollatorWithPadding`]: + + ```py + >>> from transformers import DataCollatorWithPadding + + >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) + ``` + +现在把所有的类传给 [`Trainer`]: + +```py +>>> from transformers import Trainer + +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=dataset["train"], +... eval_dataset=dataset["test"], +... tokenizer=tokenizer, +... data_collator=data_collator, +... ) # doctest: +SKIP +``` + +一切准备就绪后,调用 [`~Trainer.train`] 进行训练: + +```py +>>> trainer.train() # doctest: +SKIP +``` + + + +对于像翻译或摘要这些使用序列到序列模型的任务,用 [`Seq2SeqTrainer`] 和 [`Seq2SeqTrainingArguments`] 来替代。 + + + +你可以通过子类化 [`Trainer`] 中的方法来自定义训练循环。这样你就可以自定义像损失函数,优化器和调度器这样的特性。查阅 [`Trainer`] 参考手册了解哪些方法能够被子类化。 + +另一个自定义训练循环的方式是通过[回调](./main_classes/callbacks)。你可以使用回调来与其他库集成,查看训练循环来报告进度或提前结束训练。回调不会修改训练循环。如果想自定义损失函数等,就需要子类化 [`Trainer`] 了。 + +## 使用 Tensorflow 训练 + +所有模型都是标准的 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model),所以你可以通过 [Keras](https://keras.io/) API 实现在 Tensorflow 中训练。🤗 Transformers 提供了 [`~TFPreTrainedModel.prepare_tf_dataset`] 方法来轻松地将数据集加载为 `tf.data.Dataset`,这样你就可以使用 Keras 的 [`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 和 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) 方法马上开始训练。 + +1. 使用 [`TFPreTrainedModel`] 或者 [`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) 来开始: + + ```py + >>> from transformers import TFAutoModelForSequenceClassification + + >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +2. 一个预处理类,比如分词器,特征提取器或者处理器: + + ```py + >>> from transformers import AutoTokenizer + + >>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") + ``` + +3. 创建一个给数据集分词的函数 + + ```py + >>> def tokenize_dataset(dataset): + ... return tokenizer(dataset["text"]) # doctest: +SKIP + ``` + +4. 使用 [`~datasets.Dataset.map`] 将分词器应用到整个数据集,之后将数据集和分词器传给 [`~TFPreTrainedModel.prepare_tf_dataset`]。如果你需要的话,也可以在这里改变批次大小和是否打乱数据集: + + ```py + >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP + >>> tf_dataset = model.prepare_tf_dataset( + ... dataset, batch_size=16, shuffle=True, tokenizer=tokenizer + ... ) # doctest: +SKIP + ``` + +5. 一切准备就绪后,调用 `compile` 和 `fit` 开始训练: + + ```py + >>> from tensorflow.keras.optimizers import Adam + + >>> model.compile(optimizer=Adam(3e-5)) + >>> model.fit(dataset) # doctest: +SKIP + ``` + +## 接下来做什么? + +现在你已经完成了 🤗 Transformers 的快速上手教程,来看看我们的指南并且学习如何做一些更具体的事情,比如写一个自定义模型,为某个任务微调一个模型以及如何使用脚本来训练模型。如果你有兴趣了解更多 🤗 Transformers 的核心章节,那就喝杯咖啡然后来看看我们的概念指南吧! diff --git a/docs/source/zh/quicktour.mdx b/docs/source/zh/quicktour.mdx deleted file mode 100644 index a9125136ced75f..00000000000000 --- a/docs/source/zh/quicktour.mdx +++ /dev/null @@ -1,538 +0,0 @@ - - -# 快速上手 - -[[open-in-colab]] - -快来使用 🤗 Transformers 吧! 无论你是开发人员还是日常用户, 这篇快速上手教程都将帮助你入门并且向你展示如何使用[`pipeline`]进行推理, 使用[AutoClass](./model_doc/auto)加载一个预训练模型和预处理器, 以及使用PyTorch或TensorFlow快速训练一个模型. 如果你是一个初学者, 我们建议你接下来查看我们的教程或者[课程](https://huggingface.co/course/chapter1/1), 来更深入地了解在这里介绍到的概念. - -在开始之前, 确保你已经安装了所有必要的库: - -```bash -!pip install transformers datasets -``` - -你还需要安装喜欢的机器学习框架: - - - -```bash -pip install torch -``` - - -```bash -pip install tensorflow -``` - - - -## Pipeline - - - -使用[`pipeline`]是利用预训练模型进行推理的最简单的方式. 你能够将[`pipeline`]开箱即用地用于跨不同模态的多种任务. 来看看它支持的任务列表: - -| **任务** | **描述** | **模态** | **Pipeline** | -|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------|-----------------------------------------------| -| 文本分类 | 为给定的文本序列分配一个标签 | NLP | pipeline(task="sentiment-analysis") | -| 文本生成 | 根据给定的提示生成文本 | NLP | pipeline(task="text-generation") | -| 命名实体识别 | 为序列里的每个token分配一个标签(人, 组织, 地址等等) | NLP | pipeline(task="ner") | -| 问答系统 | 通过给定的上下文和问题, 在文本中提取答案 | NLP | pipeline(task="question-answering") | -| 掩盖填充 | 预测出正确的在序列中被掩盖的token | NLP | pipeline(task="fill-mask") | -| 文本摘要 | 为文本序列或文档生成总结 | NLP | pipeline(task="summarization") | -| 文本翻译 | 将文本从一种语言翻译为另一种语言 | NLP | pipeline(task="translation") | -| 图像分类 | 为图像分配一个标签 | Computer vision | pipeline(task="image-classification") | -| 图像分割 | 为图像中每个独立的像素分配标签(支持语义、全景和实例分割) | Computer vision | pipeline(task="image-segmentation") | -| 目标检测 | 预测图像中目标对象的边界框和类别 | Computer vision | pipeline(task="object-detection") | -| 音频分类 | 给音频文件分配一个标签 | Audio | pipeline(task="audio-classification") | -| 自动语音识别 | 将音频文件中的语音提取为文本 | Audio | pipeline(task="automatic-speech-recognition") | -| 视觉问答 | 给定一个图像和一个问题,正确地回答有关图像的问题 | Multimodal | pipeline(task="vqa") | - -创建一个[`pipeline`]实例并且指定你想要将它用于的任务, 就可以开始了. 你可以将[`pipeline`]用于任何一个上面提到的任务, 如果想知道支持的任务的完整列表, 可以查阅[pipeline API 参考](./main_classes/pipelines). 不过, 在这篇教程中, 你将把 [`pipeline`]用在一个情感分析示例上: - -```py ->>> from transformers import pipeline - ->>> classifier = pipeline("sentiment-analysis") -``` - -[`pipeline`] 会下载并缓存一个用于情感分析的默认的[预训练模型](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)和分词器. 现在你可以在目标文本上使用 `classifier`了: - -```py ->>> classifier("We are very happy to show you the 🤗 Transformers library.") -[{'label': 'POSITIVE', 'score': 0.9998}] -``` - -如果你有不止一个输入, 可以把所有输入放入一个列表然后传给[`pipeline`], 它将会返回一个字典列表: - -```py ->>> results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."]) ->>> for result in results: -... print(f"label: {result['label']}, with score: {round(result['score'], 4)}") -label: POSITIVE, with score: 0.9998 -label: NEGATIVE, with score: 0.5309 -``` - -[`pipeline`] 也可以为任何你喜欢的任务遍历整个数据集. 在下面这个示例中, 让我们选择自动语音识别作为我们的任务: - -```py ->>> import torch ->>> from transformers import pipeline - ->>> speech_recognizer = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") -``` - -加载一个你想遍历的音频数据集 (查阅 🤗 Datasets [快速开始](https://huggingface.co/docs/datasets/quickstart#audio) 获得更多信息). 比如, 加载 [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) 数据集: - -```py ->>> from datasets import load_dataset, Audio - ->>> dataset = load_dataset("PolyAI/minds14", name="en-US", split="train") # doctest: +IGNORE_RESULT -``` - -你需要确保数据集中的音频的采样率与 [`facebook/wav2vec2-base-960h`](https://huggingface.co/facebook/wav2vec2-base-960h) 训练用到的音频的采样率一致: - -```py ->>> dataset = dataset.cast_column("audio", Audio(sampling_rate=speech_recognizer.feature_extractor.sampling_rate)) -``` - -当调用`"audio"` column时, 音频文件将会自动加载并重采样. -从前四个样本中提取原始波形数组, 将它作为列表传给pipeline: - -```py ->>> result = speech_recognizer(dataset[:4]["audio"]) ->>> print([d["text"] for d in result]) -['I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER HOW DO I PROCEED WITH DOING THAT', "FODING HOW I'D SET UP A JOIN TO HET WITH MY WIFE AND WHERE THE AP MIGHT BE", "I I'D LIKE TOY SET UP A JOINT ACCOUNT WITH MY PARTNER I'M NOT SEEING THE OPTION TO DO IT ON THE AP SO I CALLED IN TO GET SOME HELP CAN I JUST DO IT OVER THE PHONE WITH YOU AND GIVE YOU THE INFORMATION OR SHOULD I DO IT IN THE AP AND I'M MISSING SOMETHING UQUETTE HAD PREFERRED TO JUST DO IT OVER THE PHONE OF POSSIBLE THINGS", 'HOW DO I THURN A JOIN A COUNT'] -``` - -对于输入非常庞大的大型数据集 (比如语音或视觉), 你会想到使用一个生成器, 而不是一个将所有输入都加载进内存的列表. 查阅 [pipeline API 参考](./main_classes/pipelines) 来获取更多信息. - -### 在pipeline中使用另一个模型和分词器 - -[`pipeline`]可以容纳[Hub](https://huggingface.co/models)中的任何模型, 这让[`pipeline`]更容易适用于其他用例. 比如, 你想要一个能够处理法语文本的模型, 就可以使用Hub上的标记来筛选出合适的模型. 靠前的筛选结果会返回一个为情感分析微调的多语言的 [BERT 模型](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment), 你可以将它用于法语文本: - -```py ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" -``` - - - -使用 [`AutoModelForSequenceClassification`]和[`AutoTokenizer`]来加载预训练模型和它关联的分词器 (更多信息可以参考下一节的 `AutoClass`): - -```py ->>> from transformers import AutoTokenizer, AutoModelForSequenceClassification - ->>> model = AutoModelForSequenceClassification.from_pretrained(model_name) ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - - -使用 [`TFAutoModelForSequenceClassification`]和[`AutoTokenizer`] 来加载预训练模型和它关联的分词器 (更多信息可以参考下一节的 `TFAutoClass`): - -```py ->>> from transformers import AutoTokenizer, TFAutoModelForSequenceClassification - ->>> model = TFAutoModelForSequenceClassification.from_pretrained(model_name) ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - - - -在[`pipeline`]中指定模型和分词器, 现在你就可以在法语文本上使用 `classifier`了: - -```py ->>> classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer) ->>> classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.") -[{'label': '5 stars', 'score': 0.7273}] -``` - -如果你没有找到适合你的模型, 就需要在你的数据上微调一个预训练模型了. 查看[微调教程](./training) 来学习怎样进行微调. 最后, 微调完模型后, 考虑一下在Hub上与社区 [分享](./model_sharing) 这个模型, 把机器学习普及到每一个人! 🤗 - -## AutoClass - - - -在幕后, 是由[`AutoModelForSequenceClassification`]和[`AutoTokenizer`]一起支持你在上面用到的[`pipeline`]. [AutoClass](./model_doc/auto) 是一个能够通过预训练模型的名称或路径自动查找其架构的快捷方式. 你只需要为你的任务选择合适的 `AutoClass` 和它关联的预处理类. - -让我们回过头来看上一节的示例, 看看怎样使用 `AutoClass` 来重现使用[`pipeline`]的结果. - -### AutoTokenizer - -分词器负责预处理文本, 将文本转换为用于输入模型的数字数组. 有多个用来管理分词过程的规则, 包括如何拆分单词和在什么样的级别上拆分单词 (在 [分词器总结](./tokenizer_summary)学习更多关于分词的信息). 要记住最重要的是你需要实例化的分词器要与模型的名称相同, 来确保和模型训练时使用相同的分词规则. - -使用[`AutoTokenizer`]加载一个分词器: - -```py ->>> from transformers import AutoTokenizer - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> tokenizer = AutoTokenizer.from_pretrained(model_name) -``` - -将文本传入分词器: - -```py ->>> encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.") ->>> print(encoding) -{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], - 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], - 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} -``` - -分词器返回了含有如下内容的字典: - -* [input_ids](./glossary#input-ids): 用数字表示的token. -* [attention_mask](.glossary#attention-mask): 应该关注哪些token的指示. - -分词器也可以接受列表作为输入, 并填充和截断文本, 返回具有统一长度的批次: - - - -```py ->>> pt_batch = tokenizer( -... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], -... padding=True, -... truncation=True, -... max_length=512, -... return_tensors="pt", -... ) -``` - - -```py ->>> tf_batch = tokenizer( -... ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], -... padding=True, -... truncation=True, -... max_length=512, -... return_tensors="tf", -... ) -``` - - - - - -查阅[预处理](./preprocessing)教程来获得有关分词的更详细的信息, 以及如何使用[`AutoFeatureExtractor`]和[`AutoProcessor`]来处理图像, 音频, 还有多模式输入. - - - -### AutoModel - - - -🤗 Transformers 提供了一种简单统一的方式来加载预训练的实例. 这表示你可以像加载[`AutoTokenizer`]一样加载[`AutoModel`]. 唯一不同的地方是为你的任务选择正确的[`AutoModel`]. 对于文本 (或序列) 分类, 你应该加载[`AutoModelForSequenceClassification`]: - -```py ->>> from transformers import AutoModelForSequenceClassification - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) -``` - - - -通过[任务摘要](./task_summary)查找[`AutoModel`]支持的任务. - - - -现在可以把预处理好的输入批次直接送进模型. 你只需要添加`**`来解包字典: - -```py ->>> pt_outputs = pt_model(**pt_batch) -``` - -模型在`logits`属性输出最终的激活结果. 在 `logits`上应用softmax函数来查询概率: - -```py ->>> from torch import nn - ->>> pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1) ->>> print(pt_predictions) -tensor([[0.0021, 0.0018, 0.0115, 0.2121, 0.7725], - [0.2084, 0.1826, 0.1969, 0.1755, 0.2365]], grad_fn=) -``` - - -🤗 Transformers 提供了一种简单统一的方式来加载预训练的实例. 这表示你可以像加载[`AutoTokenizer`]一样加载[`TFAutoModel`]. 唯一不同的地方是为你的任务选择正确的[`TFAutoModel`], 对于文本 (或序列) 分类, 你应该加载[`TFAutoModelForSequenceClassification`]: - -```py ->>> from transformers import TFAutoModelForSequenceClassification - ->>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment" ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) -``` - - - -通过[任务摘要](./task_summary)查找[`AutoModel`]支持的任务. - - - -现在通过直接将字典的键传给张量,将预处理的输入批次传给模型. - -```py ->>> tf_outputs = tf_model(tf_batch) -``` - -模型在`logits`属性输出最终的激活结果. 在 `logits`上应用softmax函数来查询概率: - -```py ->>> import tensorflow as tf - ->>> tf_predictions = tf.nn.softmax(tf_outputs.logits, axis=-1) ->>> tf_predictions # doctest: +IGNORE_RESULT -``` - - - - - -所有 🤗 Transformers 模型 (PyTorch 或 TensorFlow) 在最终的激活函数(比如softmax)*之前* 输出张量, -因为最终的激活函数常常与loss融合. 模型的输出是特殊的数据类, 所以它们的属性可以在IDE中被自动补全. 模型的输出就像一个元组或字典 (你可以通过整数、切片或字符串来索引它), 在这种情况下, 为None的属性会被忽略. - - - -### 保存模型 - - - -当你的模型微调完成, 你就可以使用[`PreTrainedModel.save_pretrained`]把它和它的分词器保存下来: - -```py ->>> pt_save_directory = "./pt_save_pretrained" ->>> tokenizer.save_pretrained(pt_save_directory) # doctest: +IGNORE_RESULT ->>> pt_model.save_pretrained(pt_save_directory) -``` - -当你准备再次使用这个模型时, 就可以使用[`PreTrainedModel.from_pretrained`]加载它了: - -```py ->>> pt_model = AutoModelForSequenceClassification.from_pretrained("./pt_save_pretrained") -``` - - -当你的模型微调完成, 你就可以使用[`TFPreTrainedModel.save_pretrained`]把它和它的分词器保存下来: - -```py ->>> tf_save_directory = "./tf_save_pretrained" ->>> tokenizer.save_pretrained(tf_save_directory) # doctest: +IGNORE_RESULT ->>> tf_model.save_pretrained(tf_save_directory) -``` - -当你准备再次使用这个模型时, 就可以使用[`TFPreTrainedModel.from_pretrained`]加载它了: - -```py ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained("./tf_save_pretrained") -``` - - - -🤗 Transformers有一个特别酷的功能, 它能够保存一个模型, 并且将它加载为PyTorch或TensorFlow模型. `from_pt`或`from_tf`参数可以将模型从一个框架转换为另一个框架: - - - -```py ->>> from transformers import AutoModel - ->>> tokenizer = AutoTokenizer.from_pretrained(tf_save_directory) ->>> pt_model = AutoModelForSequenceClassification.from_pretrained(tf_save_directory, from_tf=True) -``` - - -```py ->>> from transformers import TFAutoModel - ->>> tokenizer = AutoTokenizer.from_pretrained(pt_save_directory) ->>> tf_model = TFAutoModelForSequenceClassification.from_pretrained(pt_save_directory, from_pt=True) -``` - - - -## 自定义模型构建 - -你可以修改模型的配置类来改变模型的构建方式. 配置指明了模型的属性, 比如隐藏层或者注意力头的数量. 当你从自定义的配置类初始化模型时, 你就开始自定义模型构建了. 模型属性是随机初始化的, 你需要先训练模型, 然后才能得到有意义的结果. - -通过导入[`AutoConfig`]来开始, 之后加载你想修改的预训练模型. 在[`AutoConfig.from_pretrained`]中, 你能够指定想要修改的属性, 比如注意力头的数量: - -```py ->>> from transformers import AutoConfig - ->>> my_config = AutoConfig.from_pretrained("distilbert-base-uncased", n_heads=12) -``` - - - -使用[`AutoModel.from_config`]根据你的自定义配置创建一个模型: - -```py ->>> from transformers import AutoModel - ->>> my_model = AutoModel.from_config(my_config) -``` - - -使用[`TFAutoModel.from_config`]根据你的自定义配置创建一个模型: - -```py ->>> from transformers import TFAutoModel - ->>> my_model = TFAutoModel.from_config(my_config) -``` - - - -查阅[创建一个自定义结构](./create_a_model)指南获取更多关于构建自定义配置的信息. - -## Trainer - PyTorch优化训练循环 - -所有的模型都是标准的[`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module), 所以你可以在任何典型的训练模型中使用它们. 当你编写自己的训练循环时W, 🤗 Transformers为PyTorch提供了一个[`Trainer`]类, 它包含了基础的训练循环并且为诸如分布式训练, 混合精度等特性增加了额外的功能. - -取决于你的任务, 你通常可以传递以下的参数给[`Trainer`]: - -1. [`PreTrainedModel`]或者[`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module): - - ```py - >>> from transformers import AutoModelForSequenceClassification - - >>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") - ``` - -2. [`TrainingArguments`]含有你可以修改的模型超参数, 比如学习率, 批次大小和训练时的迭代次数. 如果你没有指定训练参数, 那么它会使用默认值: - - ```py - >>> from transformers import TrainingArguments - - >>> training_args = TrainingArguments( - ... output_dir="path/to/save/folder/", - ... learning_rate=2e-5, - ... per_device_train_batch_size=8, - ... per_device_eval_batch_size=8, - ... num_train_epochs=2, - ... ) - ``` - -3. 一个预处理类, 比如分词器, 特征提取器或者处理器: - - ```py - >>> from transformers import AutoTokenizer - - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") - ``` - -4. 加载一个数据集: - - ```py - >>> from datasets import load_dataset - - >>> dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT - ``` - -5. 创建一个给数据集分词的函数, 并且使用[`~datasets.Dataset.map`]应用到整个数据集: - - ```py - >>> def tokenize_dataset(dataset): - ... return tokenizer(dataset["text"]) - - - >>> dataset = dataset.map(tokenize_dataset, batched=True) - ``` - -6. 用来从数据集中创建批次的[`DataCollatorWithPadding`]: - - ```py - >>> from transformers import DataCollatorWithPadding - - >>> data_collator = DataCollatorWithPadding(tokenizer=tokenizer) - ``` - -现在把所有的类传给[`Trainer`]: - -```py ->>> from transformers import Trainer - ->>> trainer = Trainer( -... model=model, -... args=training_args, -... train_dataset=dataset["train"], -... eval_dataset=dataset["test"], -... tokenizer=tokenizer, -... data_collator=data_collator, -... ) # doctest: +SKIP -``` - -一切准备就绪后, 调用[`~Trainer.train`]进行训练: - -```py ->>> trainer.train() # doctest: +SKIP -``` - - - -对于像翻译或摘要这些使用序列到序列模型的任务, 用[`Seq2SeqTrainer`]和[`Seq2SeqTrainingArguments`]来替代. - - - -你可以通过子类化[`Trainer`]中的方法来自定义训练循环. 这样你就可以自定义像损失函数, 优化器和调度器这样的特性. 查阅[`Trainer`]参考手册了解哪些方法能够被子类化. - -另一个自定义训练循环的方式是通过[回调](./main_classes/callbacks). 你可以使用回调来与其他库集成, 查看训练循环来报告进度或提前结束训练. 回调不会修改训练循环. 如果想自定义损失函数等, 就需要子类化[`Trainer`]了. - -## 使用Tensorflow训练 - -所有模型都是标准的[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model), 所以你可以通过[Keras](https://keras.io/) API实现在Tensorflow中训练. 🤗 Transformers提供了[`~TFPreTrainedModel.prepare_tf_dataset`]方法来轻松地将数据集加载为`tf.data.Dataset`, 这样你就可以使用Keras的[`compile`](https://keras.io/api/models/model_training_apis/#compile-method)和[`fit`](https://keras.io/api/models/model_training_apis/#fit-method)方法马上开始训练. - -1. 使用[`TFPreTrainedModel`]或者[`tf.keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model)来开始: - - ```py - >>> from transformers import TFAutoModelForSequenceClassification - - >>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") - ``` - -2. 一个预处理类, 比如分词器, 特征提取器或者处理器: - - ```py - >>> from transformers import AutoTokenizer - - >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") - ``` - -3. 创建一个给数据集分词的函数 - - ```py - >>> def tokenize_dataset(dataset): - ... return tokenizer(dataset["text"]) # doctest: +SKIP - ``` - -4. 使用[`~datasets.Dataset.map`]将分词器应用到整个数据集, 之后将数据集和分词器传给[`~TFPreTrainedModel.prepare_tf_dataset`]. 如果你需要的话, 也可以在这里改变批次大小和是否打乱数据集: - - ```py - >>> dataset = dataset.map(tokenize_dataset) # doctest: +SKIP - >>> tf_dataset = model.prepare_tf_dataset( - ... dataset, batch_size=16, shuffle=True, tokenizer=tokenizer - ... ) # doctest: +SKIP - ``` - -5. 一切准备就绪后, 调用`compile`和`fit`开始训练: - - ```py - >>> from tensorflow.keras.optimizers import Adam - - >>> model.compile(optimizer=Adam(3e-5)) - >>> model.fit(dataset) # doctest: +SKIP - ``` - -## 接下来做什么? - -现在你已经完成了 🤗 Transformers 的快速上手教程, 来看看我们的指南并且学习如何做一些更具体的事情, 比如写一个自定义模型, 为某个任务微调一个模型以及如何使用脚本来训练模型. 如果你有兴趣了解更多 🤗 Transformers 的核心章节, 那就喝杯咖啡然后来看看我们的概念指南吧! diff --git a/docs/source/zh/run_scripts.md b/docs/source/zh/run_scripts.md new file mode 100644 index 00000000000000..b6e9c8ea6a2d89 --- /dev/null +++ b/docs/source/zh/run_scripts.md @@ -0,0 +1,359 @@ + + +# 使用脚本进行训练 + +除了 🤗 Transformers [notebooks](./noteboks/README),还有示例脚本演示了如何使用[PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch)、[TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow)或[JAX/Flax](https://github.com/huggingface/transformers/tree/main/examples/flax)训练模型以解决特定任务。 + +您还可以在这些示例中找到我们在[研究项目](https://github.com/huggingface/transformers/tree/main/examples/research_projects)和[遗留示例](https://github.com/huggingface/transformers/tree/main/examples/legacy)中使用过的脚本,这些脚本主要是由社区贡献的。这些脚本已不再被积极维护,需要使用特定版本的🤗 Transformers, 可能与库的最新版本不兼容。 + +示例脚本可能无法在初始配置下直接解决每个问题,您可能需要根据要解决的问题调整脚本。为了帮助您,大多数脚本都完全暴露了数据预处理的方式,允许您根据需要对其进行编辑。 + +如果您想在示例脚本中实现任何功能,请在[论坛](https://discuss.huggingface.co/)或[issue](https://github.com/huggingface/transformers/issues)上讨论,然后再提交Pull Request。虽然我们欢迎修复错误,但不太可能合并添加更多功能的Pull Request,因为这会降低可读性。 + +本指南将向您展示如何在[PyTorch](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)和[TensorFlow](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization)中运行示例摘要训练脚本。除非另有说明,否则所有示例都可以在两个框架中工作。 + +## 设置 + +要成功运行示例脚本的最新版本,您必须在新虚拟环境中**从源代码安装 🤗 Transformers**: + +```bash +git clone https://github.com/huggingface/transformers +cd transformers +pip install . +``` + +对于旧版本的示例脚本,请点击下面的切换按钮: + +
+ 老版本🤗 Transformers示例 + +
+ +然后切换您clone的 🤗 Transformers 仓到特定的版本,例如v3.5.1: + +```bash +git checkout tags/v3.5.1 +``` + +在安装了正确的库版本后,进入您选择的版本的`example`文件夹并安装例子要求的环境: + +```bash +pip install -r requirements.txt +``` + +## 运行脚本 + + + + +示例脚本从🤗 [Datasets](https://huggingface.co/docs/datasets/)库下载并预处理数据集。然后,脚本通过[Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)使用支持摘要任务的架构对数据集进行微调。以下示例展示了如何在[CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail)数据集上微调[T5-small](https://huggingface.co/google-t5/t5-small)。由于T5模型的训练方式,它需要一个额外的`source_prefix`参数。这个提示让T5知道这是一个摘要任务。 + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + + +示例脚本从 🤗 [Datasets](https://huggingface.co/docs/datasets/) 库下载并预处理数据集。然后,脚本使用 Keras 在支持摘要的架构上微调数据集。以下示例展示了如何在 [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) 数据集上微调 [T5-small](https://huggingface.co/google-t5/t5-small)。T5 模型由于训练方式需要额外的 `source_prefix` 参数。这个提示让 T5 知道这是一个摘要任务。 + +```bash +python examples/tensorflow/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## 分布式训练和混合精度 + +[Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) 支持分布式训练和混合精度,这意味着你也可以在脚本中使用它。要启用这两个功能,可以做如下设置: + +- 添加 `fp16` 参数以启用混合精度。 +- 使用 `nproc_per_node` 参数设置使用的GPU数量。 + + +```bash +torchrun \ + --nproc_per_node 8 pytorch/summarization/run_summarization.py \ + --fp16 \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +TensorFlow脚本使用[`MirroredStrategy`](https://www.tensorflow.org/guide/distributed_training#mirroredstrategy)进行分布式训练,您无需在训练脚本中添加任何其他参数。如果可用,TensorFlow脚本将默认使用多个GPU。 + +## 在TPU上运行脚本 + + + + +张量处理单元(TPUs)是专门设计用于加速性能的。PyTorch使用[XLA](https://www.tensorflow.org/xla)深度学习编译器支持TPU(更多细节请参见[这里](https://github.com/pytorch/xla/blob/master/README.md))。要使用TPU,请启动`xla_spawn.py`脚本并使用`num_cores`参数设置要使用的TPU核心数量。 + +```bash +python xla_spawn.py --num_cores 8 \ + summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + + + +张量处理单元(TPUs)是专门设计用于加速性能的。TensorFlow脚本使用[`TPUStrategy`](https://www.tensorflow.org/guide/distributed_training#tpustrategy)在TPU上进行训练。要使用TPU,请将TPU资源的名称传递给`tpu`参数。 + +```bash +python run_summarization.py \ + --tpu name_of_tpu_resource \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size 8 \ + --per_device_eval_batch_size 16 \ + --num_train_epochs 3 \ + --do_train \ + --do_eval +``` + + + +## 基于🤗 Accelerate运行脚本 + +🤗 [Accelerate](https://huggingface.co/docs/accelerate) 是一个仅支持 PyTorch 的库,它提供了一种统一的方法来在不同类型的设置(仅 CPU、多个 GPU、多个TPU)上训练模型,同时保持对 PyTorch 训练循环的完全可见性。如果你还没有安装 🤗 Accelerate,请确保你已经安装了它: + +> 注意:由于 Accelerate 正在快速发展,因此必须安装 git 版本的 accelerate 来运行脚本。 + +```bash +pip install git+https://github.com/huggingface/accelerate +``` + +你需要使用`run_summarization_no_trainer.py`脚本,而不是`run_summarization.py`脚本。🤗 Accelerate支持的脚本需要在文件夹中有一个`task_no_trainer.py`文件。首先运行以下命令以创建并保存配置文件: + +```bash +accelerate config +``` +检测您的设置以确保配置正确: + +```bash +accelerate test +``` + +现在您可以开始训练模型了: + +```bash +accelerate launch run_summarization_no_trainer.py \ + --model_name_or_path google-t5/t5-small \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir ~/tmp/tst-summarization +``` + +## 使用自定义数据集 + +摘要脚本支持自定义数据集,只要它们是CSV或JSON Line文件。当你使用自己的数据集时,需要指定一些额外的参数: +- `train_file` 和 `validation_file` 分别指定您的训练和验证文件的路径。 +- `text_column` 是输入要进行摘要的文本。 +- `summary_column` 是目标输出的文本。 + +使用自定义数据集的摘要脚本看起来是这样的: + + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --train_file path_to_csv_or_jsonlines_file \ + --validation_file path_to_csv_or_jsonlines_file \ + --text_column text_column_name \ + --summary_column summary_column_name \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --overwrite_output_dir \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --predict_with_generate +``` + +## 测试脚本 + +通常,在提交整个数据集之前,最好先在较少的数据集示例上运行脚本,以确保一切按预期工作,因为完整数据集的处理可能需要花费几个小时的时间。使用以下参数将数据集截断为最大样本数: + +- `max_train_samples` +- `max_eval_samples` +- `max_predict_samples` + + +```bash +python examples/pytorch/summarization/run_summarization.py \ + --model_name_or_path google-t5/t5-small \ + --max_train_samples 50 \ + --max_eval_samples 50 \ + --max_predict_samples 50 \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` + +并非所有示例脚本都支持`max_predict_samples`参数。如果您不确定您的脚本是否支持此参数,请添加`-h`参数进行检查: + +```bash +examples/pytorch/summarization/run_summarization.py -h +``` + +## 从checkpoint恢复训练 + +另一个有用的选项是从之前的checkpoint恢复训练。这将确保在训练中断时,您可以从之前停止的地方继续进行,而无需重新开始。有两种方法可以从checkpoint恢复训练。 + +第一种方法使用`output_dir previous_output_dir`参数从存储在`output_dir`中的最新的checkpoint恢复训练。在这种情况下,您应该删除`overwrite_output_dir`: + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --output_dir previous_output_dir \ + --predict_with_generate +``` + +第二种方法使用`resume_from_checkpoint path_to_specific_checkpoint`参数从特定的checkpoint文件夹恢复训练。 + + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --resume_from_checkpoint path_to_specific_checkpoint \ + --predict_with_generate +``` + +## 分享模型 + +所有脚本都可以将您的最终模型上传到[Model Hub](https://huggingface.co/models)。在开始之前,请确保您已登录Hugging Face: + +```bash +huggingface-cli login +``` + +然后,在脚本中添加`push_to_hub`参数。这个参数会创建一个带有您Hugging Face用户名和`output_dir`中指定的文件夹名称的仓库。 + +为了给您的仓库指定一个特定的名称,使用`push_to_hub_model_id`参数来添加它。该仓库将自动列出在您的命名空间下。 + +以下示例展示了如何上传具有特定仓库名称的模型: + + +```bash +python examples/pytorch/summarization/run_summarization.py + --model_name_or_path google-t5/t5-small \ + --do_train \ + --do_eval \ + --dataset_name cnn_dailymail \ + --dataset_config "3.0.0" \ + --source_prefix "summarize: " \ + --push_to_hub \ + --push_to_hub_model_id finetuned-t5-cnn_dailymail \ + --output_dir /tmp/tst-summarization \ + --per_device_train_batch_size=4 \ + --per_device_eval_batch_size=4 \ + --overwrite_output_dir \ + --predict_with_generate +``` \ No newline at end of file diff --git a/docs/source/zh/serialization.md b/docs/source/zh/serialization.md new file mode 100644 index 00000000000000..b9cc74e5849d63 --- /dev/null +++ b/docs/source/zh/serialization.md @@ -0,0 +1,181 @@ + + +# 导出为 ONNX + +在生产环境中部署 🤗 Transformers 模型通常需要或者能够受益于,将模型导出为可在专门的运行时和硬件上加载和执行的序列化格式。 + +🤗 Optimum 是 Transformers 的扩展,可以通过其 `exporters` 模块将模型从 PyTorch 或 TensorFlow 导出为 ONNX 及 TFLite 等序列化格式。🤗 Optimum 还提供了一套性能优化工具,可以在目标硬件上以最高效率训练和运行模型。 + +本指南演示了如何使用 🤗 Optimum 将 🤗 Transformers 模型导出为 ONNX。有关将模型导出为 TFLite 的指南,请参考 [导出为 TFLite 页面](tflite)。 + +## 导出为 ONNX + +[ONNX (Open Neural Network eXchange 开放神经网络交换)](http://onnx.ai) 是一个开放的标准,它定义了一组通用的运算符和一种通用的文件格式,用于表示包括 PyTorch 和 TensorFlow 在内的各种框架中的深度学习模型。当一个模型被导出为 ONNX时,这些运算符被用于构建计算图(通常被称为*中间表示*),该图表示数据在神经网络中的流动。 + +通过公开具有标准化运算符和数据类型的图,ONNX使得模型能够轻松在不同深度学习框架间切换。例如,在 PyTorch 中训练的模型可以被导出为 ONNX,然后再导入到 TensorFlow(反之亦然)。 + +导出为 ONNX 后,模型可以: +- 通过 [图优化(graph optimization)](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization) 和 [量化(quantization)](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization) 等技术进行推理优化。 +- 通过 [`ORTModelForXXX` 类](https://huggingface.co/docs/optimum/onnxruntime/package_reference/modeling_ort) 使用 ONNX Runtime 运行,它同样遵循你熟悉的 Transformers 中的 `AutoModel` API。 +- 使用 [优化推理流水线(pipeline)](https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/pipelines) 运行,其 API 与 🤗 Transformers 中的 [`pipeline`] 函数相同。 + +🤗 Optimum 通过利用配置对象提供对 ONNX 导出的支持。多种模型架构已经有现成的配置对象,并且配置对象也被设计得易于扩展以适用于其他架构。 + +现有的配置列表请参考 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/exporters/onnx/overview)。 + +有两种方式可以将 🤗 Transformers 模型导出为 ONNX,这里我们展示这两种方法: + +- 使用 🤗 Optimum 的 CLI(命令行)导出。 +- 使用 🤗 Optimum 的 `optimum.onnxruntime` 模块导出。 + +### 使用 CLI 将 🤗 Transformers 模型导出为 ONNX + +要将 🤗 Transformers 模型导出为 ONNX,首先需要安装额外的依赖项: + +```bash +pip install optimum[exporters] +``` + +请参阅 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/export_a_model#exporting-a-model-to-onnx-using-the-cli) 以查看所有可用参数,或者在命令行中查看帮助: + +```bash +optimum-cli export onnx --help +``` + +运行以下命令,以从 🤗 Hub 导出模型的检查点(checkpoint),以 `distilbert/distilbert-base-uncased-distilled-squad` 为例: + +```bash +optimum-cli export onnx --model distilbert/distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/ +``` + +你应该能在日志中看到导出进度以及生成的 `model.onnx` 文件的保存位置,如下所示: + +```bash +Validating ONNX model distilbert_base_uncased_squad_onnx/model.onnx... + -[✓] ONNX model output names match reference model (start_logits, end_logits) + - Validating ONNX Model output "start_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) + - Validating ONNX Model output "end_logits": + -[✓] (2, 16) matches (2, 16) + -[✓] all values close (atol: 0.0001) +The ONNX export succeeded and the exported model was saved at: distilbert_base_uncased_squad_onnx +``` + +上面的示例说明了从 🤗 Hub 导出检查点的过程。导出本地模型时,首先需要确保将模型的权重和分词器文件保存在同一目录(`local_path`)中。在使用 CLI 时,将 `local_path` 传递给 `model` 参数,而不是 🤗 Hub 上的检查点名称,并提供 `--task` 参数。你可以在 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/exporters/task_manager)中查看支持的任务列表。如果未提供 `task` 参数,将默认导出不带特定任务头的模型架构。 + +```bash +optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/ +``` + +生成的 `model.onnx` 文件可以在支持 ONNX 标准的 [许多加速引擎(accelerators)](https://onnx.ai/supported-tools.html#deployModel) 之一上运行。例如,我们可以使用 [ONNX Runtime](https://onnxruntime.ai/) 加载和运行模型,如下所示: + +```python +>>> from transformers import AutoTokenizer +>>> from optimum.onnxruntime import ORTModelForQuestionAnswering + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx") +>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt") +>>> outputs = model(**inputs) +``` + +从 Hub 导出 TensorFlow 检查点的过程也一样。例如,以下是从 [Keras 组织](https://huggingface.co/keras-io) 导出纯 TensorFlow 检查点的命令: + +```bash +optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/ +``` + +### 使用 `optimum.onnxruntime` 将 🤗 Transformers 模型导出为 ONNX + +除了 CLI 之外,你还可以使用代码将 🤗 Transformers 模型导出为 ONNX,如下所示: + +```python +>>> from optimum.onnxruntime import ORTModelForSequenceClassification +>>> from transformers import AutoTokenizer + +>>> model_checkpoint = "distilbert_base_uncased_squad" +>>> save_directory = "onnx/" + +>>> # 从 transformers 加载模型并将其导出为 ONNX +>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True) +>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) + +>>> # 保存 onnx 模型以及分词器 +>>> ort_model.save_pretrained(save_directory) +>>> tokenizer.save_pretrained(save_directory) +``` + +### 导出尚未支持的架构的模型 + +如果你想要为当前无法导出的模型添加支持,请先检查 [`optimum.exporters.onnx`](https://huggingface.co/docs/optimum/exporters/onnx/overview) 是否支持该模型,如果不支持,你可以 [直接为 🤗 Optimum 贡献代码](https://huggingface.co/docs/optimum/exporters/onnx/usage_guides/contribute)。 + +### 使用 `transformers.onnx` 导出模型 + + + +`tranformers.onnx` 不再进行维护,请如上所述,使用 🤗 Optimum 导出模型。这部分内容将在未来版本中删除。 + + + +要使用 `tranformers.onnx` 将 🤗 Transformers 模型导出为 ONNX,请安装额外的依赖项: + +```bash +pip install transformers[onnx] +``` + +将 `transformers.onnx` 包作为 Python 模块使用,以使用现成的配置导出检查点: + +```bash +python -m transformers.onnx --model=distilbert/distilbert-base-uncased onnx/ +``` + +以上代码将导出由 `--model` 参数定义的检查点的 ONNX 图。传入任何 🤗 Hub 上或者存储与本地的检查点。生成的 `model.onnx` 文件可以在支持 ONNX 标准的众多加速引擎上运行。例如,使用 ONNX Runtime 加载并运行模型,如下所示: + +```python +>>> from transformers import AutoTokenizer +>>> from onnxruntime import InferenceSession + +>>> tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") +>>> session = InferenceSession("onnx/model.onnx") +>>> # ONNX Runtime expects NumPy arrays as input +>>> inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np") +>>> outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs)) +``` + +可以通过查看每个模型的 ONNX 配置来获取所需的输出名(例如 `["last_hidden_state"]`)。例如,对于 DistilBERT,可以用以下代码获取输出名称: + +```python +>>> from transformers.models.distilbert import DistilBertConfig, DistilBertOnnxConfig + +>>> config = DistilBertConfig() +>>> onnx_config = DistilBertOnnxConfig(config) +>>> print(list(onnx_config.outputs.keys())) +["last_hidden_state"] +``` + +从 Hub 导出 TensorFlow 检查点的过程也一样。导出纯 TensorFlow 检查点的示例代码如下: + +```bash +python -m transformers.onnx --model=keras-io/transformers-qa onnx/ +``` + +要导出本地存储的模型,请将模型的权重和分词器文件保存在同一目录中(例如 `local-pt-checkpoint`),然后通过将 `transformers.onnx` 包的 `--model` 参数指向该目录,将其导出为 ONNX: + +```bash +python -m transformers.onnx --model=local-pt-checkpoint onnx/ +``` \ No newline at end of file diff --git a/docs/source/zh/task_summary.md b/docs/source/zh/task_summary.md new file mode 100644 index 00000000000000..8d088bfa71b2d0 --- /dev/null +++ b/docs/source/zh/task_summary.md @@ -0,0 +1,347 @@ + + +# 🤗 Transformers 能做什么 + +🤗 Transformers是一个用于自然语言处理(NLP)、计算机视觉和音频和语音处理任务的预训练模型库。该库不仅包含Transformer模型,还包括用于计算机视觉任务的现代卷积网络等非Transformer模型。如果您看看今天最受欢迎的一些消费产品,比如智能手机、应用程序和电视,很可能背后都有某种深度学习技术的支持。想要从您智能手机拍摄的照片中删除背景对象吗?这里是一个全景分割任务的例子(如果您还不了解这是什么意思,我们将在以下部分进行描述!)。 + +本页面提供了使用🤗 Transformers库仅用三行代码解决不同的语音和音频、计算机视觉和NLP任务的概述! + + +## 音频 +音频和语音处理任务与其他模态略有不同,主要是因为音频作为输入是一个连续的信号。与文本不同,原始音频波形不能像句子可以被划分为单词那样被整齐地分割成离散的块。为了解决这个问题,通常在固定的时间间隔内对原始音频信号进行采样。如果在每个时间间隔内采样更多样本,采样率就会更高,音频更接近原始音频源。 + +以前的方法是预处理音频以从中提取有用的特征。现在更常见的做法是直接将原始音频波形输入到特征编码器中,以提取音频表示。这样可以简化预处理步骤,并允许模型学习最重要的特征。 + +### 音频分类 + +音频分类是一项将音频数据从预定义的类别集合中进行标记的任务。这是一个广泛的类别,具有许多具体的应用,其中一些包括: + +* 声学场景分类:使用场景标签("办公室"、"海滩"、"体育场")对音频进行标记。 +* 声学事件检测:使用声音事件标签("汽车喇叭声"、"鲸鱼叫声"、"玻璃破碎声")对音频进行标记。 +* 标记:对包含多种声音的音频进行标记(鸟鸣、会议中的说话人识别)。 +* 音乐分类:使用流派标签("金属"、"嘻哈"、"乡村")对音乐进行标记。 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="audio-classification", model="superb/hubert-base-superb-er") +>>> preds = classifier("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.4532, 'label': 'hap'}, + {'score': 0.3622, 'label': 'sad'}, + {'score': 0.0943, 'label': 'neu'}, + {'score': 0.0903, 'label': 'ang'}] +``` + +### 自动语音识别 + +自动语音识别(ASR)将语音转录为文本。这是最常见的音频任务之一,部分原因是因为语音是人类交流的自然形式。如今,ASR系统嵌入在智能技术产品中,如扬声器、电话和汽车。我们可以要求虚拟助手播放音乐、设置提醒和告诉我们天气。 + +但是,Transformer架构帮助解决的一个关键挑战是低资源语言。通过在大量语音数据上进行预训练,仅在一个低资源语言的一小时标记语音数据上进行微调,仍然可以产生与以前在100倍更多标记数据上训练的ASR系统相比高质量的结果。 + +```py +>>> from transformers import pipeline + +>>> transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small") +>>> transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac") +{'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'} +``` + +## 计算机视觉 + +计算机视觉任务中最早成功之一是使用卷积神经网络([CNN](glossary#convolution))识别邮政编码数字图像。图像由像素组成,每个像素都有一个数值。这使得将图像表示为像素值矩阵变得容易。每个像素值组合描述了图像的颜色。 + +计算机视觉任务可以通过以下两种通用方式解决: + +1. 使用卷积来学习图像的层次特征,从低级特征到高级抽象特征。 +2. 将图像分成块,并使用Transformer逐步学习每个图像块如何相互关联以形成图像。与CNN偏好的自底向上方法不同,这种方法有点像从一个模糊的图像开始,然后逐渐将其聚焦清晰。 + +### 图像分类 + +图像分类将整个图像从预定义的类别集合中进行标记。像大多数分类任务一样,图像分类有许多实际用例,其中一些包括: + +* 医疗保健:标记医学图像以检测疾病或监测患者健康状况 +* 环境:标记卫星图像以监测森林砍伐、提供野外管理信息或检测野火 +* 农业:标记农作物图像以监测植物健康或用于土地使用监测的卫星图像 +* 生态学:标记动物或植物物种的图像以监测野生动物种群或跟踪濒危物种 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="image-classification") +>>> preds = classifier( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.4335, 'label': 'lynx, catamount'} +{'score': 0.0348, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'} +{'score': 0.0324, 'label': 'snow leopard, ounce, Panthera uncia'} +{'score': 0.0239, 'label': 'Egyptian cat'} +{'score': 0.0229, 'label': 'tiger cat'} +``` + +### 目标检测 + +与图像分类不同,目标检测在图像中识别多个对象以及这些对象在图像中的位置(由边界框定义)。目标检测的一些示例应用包括: + +* 自动驾驶车辆:检测日常交通对象,如其他车辆、行人和红绿灯 +* 遥感:灾害监测、城市规划和天气预报 +* 缺陷检测:检测建筑物中的裂缝或结构损坏,以及制造业产品缺陷 + + +```py +>>> from transformers import pipeline + +>>> detector = pipeline(task="object-detection") +>>> preds = detector( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"], "box": pred["box"]} for pred in preds] +>>> preds +[{'score': 0.9865, + 'label': 'cat', + 'box': {'xmin': 178, 'ymin': 154, 'xmax': 882, 'ymax': 598}}] +``` + +### 图像分割 + +图像分割是一项像素级任务,将图像中的每个像素分配给一个类别。它与使用边界框标记和预测图像中的对象的目标检测不同,因为分割更加精细。分割可以在像素级别检测对象。有几种类型的图像分割: + +* 实例分割:除了标记对象的类别外,还标记每个对象的不同实例(“dog-1”,“dog-2”) +* 全景分割:语义分割和实例分割的组合; 它使用语义类为每个像素标记并标记每个对象的不同实例 + +分割任务对于自动驾驶车辆很有帮助,可以创建周围世界的像素级地图,以便它们可以在行人和其他车辆周围安全导航。它还适用于医学成像,其中任务的更精细粒度可以帮助识别异常细胞或器官特征。图像分割也可以用于电子商务,通过您的相机在现实世界中覆盖物体来虚拟试穿衣服或创建增强现实体验。 + +```py +>>> from transformers import pipeline + +>>> segmenter = pipeline(task="image-segmentation") +>>> preds = segmenter( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> print(*preds, sep="\n") +{'score': 0.9879, 'label': 'LABEL_184'} +{'score': 0.9973, 'label': 'snow'} +{'score': 0.9972, 'label': 'cat'} +``` + +### 深度估计 + +深度估计预测图像中每个像素到相机的距离。这个计算机视觉任务对于场景理解和重建尤为重要。例如,在自动驾驶汽车中,车辆需要了解行人、交通标志和其他车辆等物体的距离,以避免障碍物和碰撞。深度信息还有助于从2D图像构建3D表示,并可用于创建生物结构或建筑物的高质量3D表示。 + +有两种方法可以进行深度估计: + +* stereo(立体):通过比较同一图像的两个略微不同角度的图像来估计深度 +* monocular(单目):从单个图像中估计深度 + + +```py +>>> from transformers import pipeline + +>>> depth_estimator = pipeline(task="depth-estimation") +>>> preds = depth_estimator( +... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +... ) +``` + +## 自然语言处理 + +NLP任务是最常见的类型之一,因为文本是我们进行交流的自然方式。为了让文本变成模型识别的格式,需要对其进行分词。这意味着将一段文本分成单独的单词或子词(`tokens`),然后将这些`tokens`转换为数字。因此,可以将一段文本表示为一系列数字,一旦有了一系列的数字,就可以将其输入到模型中以解决各种NLP任务! + +### 文本分类 + +像任何模态的分类任务一样,文本分类将一段文本(可以是句子级别、段落或文档)从预定义的类别集合中进行标记。文本分类有许多实际应用,其中一些包括: + +* 情感分析:根据某些极性(如`积极`或`消极`)对文本进行标记,可以支持政治、金融和营销等领域的决策制定 +* 内容分类:根据某些主题对文本进行标记,有助于组织和过滤新闻和社交媒体提要中的信息(`天气`、`体育`、`金融`等) + + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="sentiment-analysis") +>>> preds = classifier("Hugging Face is the best thing since sliced bread!") +>>> preds = [{"score": round(pred["score"], 4), "label": pred["label"]} for pred in preds] +>>> preds +[{'score': 0.9991, 'label': 'POSITIVE'}] +``` + +### Token分类 + +在任何NLP任务中,文本都经过预处理,将文本序列分成单个单词或子词。这些被称为[tokens](/glossary#token)。Token分类将每个`token`分配一个来自预定义类别集的标签。 + +两种常见的Token分类是: + +* 命名实体识别(NER):根据实体类别(如组织、人员、位置或日期)对`token`进行标记。NER在生物医学设置中特别受欢迎,可以标记基因、蛋白质和药物名称。 +* 词性标注(POS):根据其词性(如名词、动词或形容词)对标记进行标记。POS对于帮助翻译系统了解两个相同的单词如何在语法上不同很有用(作为名词的银行与作为动词的银行)。 + +```py +>>> from transformers import pipeline + +>>> classifier = pipeline(task="ner") +>>> preds = classifier("Hugging Face is a French company based in New York City.") +>>> preds = [ +... { +... "entity": pred["entity"], +... "score": round(pred["score"], 4), +... "index": pred["index"], +... "word": pred["word"], +... "start": pred["start"], +... "end": pred["end"], +... } +... for pred in preds +... ] +>>> print(*preds, sep="\n") +{'entity': 'I-ORG', 'score': 0.9968, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2} +{'entity': 'I-ORG', 'score': 0.9293, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7} +{'entity': 'I-ORG', 'score': 0.9763, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12} +{'entity': 'I-MISC', 'score': 0.9983, 'index': 6, 'word': 'French', 'start': 18, 'end': 24} +{'entity': 'I-LOC', 'score': 0.999, 'index': 10, 'word': 'New', 'start': 42, 'end': 45} +{'entity': 'I-LOC', 'score': 0.9987, 'index': 11, 'word': 'York', 'start': 46, 'end': 50} +{'entity': 'I-LOC', 'score': 0.9992, 'index': 12, 'word': 'City', 'start': 51, 'end': 55} +``` + +### 问答 + +问答是另一个`token-level`的任务,返回一个问题的答案,有时带有上下文(开放领域),有时不带上下文(封闭领域)。每当我们向虚拟助手提出问题时,例如询问一家餐厅是否营业,就会发生这种情况。它还可以提供客户或技术支持,并帮助搜索引擎检索您要求的相关信息。 + +有两种常见的问答类型: + +* 提取式:给定一个问题和一些上下文,答案是从模型必须提取的上下文中的一段文本跨度。 +* 抽象式:给定一个问题和一些上下文,答案从上下文中生成;这种方法由[`Text2TextGenerationPipeline`]处理,而不是下面显示的[`QuestionAnsweringPipeline`]。 + + +```py +>>> from transformers import pipeline + +>>> question_answerer = pipeline(task="question-answering") +>>> preds = question_answerer( +... question="What is the name of the repository?", +... context="The name of the repository is huggingface/transformers", +... ) +>>> print( +... f"score: {round(preds['score'], 4)}, start: {preds['start']}, end: {preds['end']}, answer: {preds['answer']}" +... ) +score: 0.9327, start: 30, end: 54, answer: huggingface/transformers +``` + +### 摘要 + +摘要从较长的文本中创建一个较短的版本,同时尽可能保留原始文档的大部分含义。摘要是一个序列到序列的任务;它输出比输入更短的文本序列。有许多长篇文档可以进行摘要,以帮助读者快速了解主要要点。法案、法律和财务文件、专利和科学论文等文档可以摘要,以节省读者的时间并作为阅读辅助工具。 + +像问答一样,摘要有两种类型: + +* 提取式:从原始文本中识别和提取最重要的句子 +* 抽象式:从原始文本生成目标摘要(可能包括不在输入文档中的新单词);[`SummarizationPipeline`]使用抽象方法。 + + +```py +>>> from transformers import pipeline + +>>> summarizer = pipeline(task="summarization") +>>> summarizer( +... "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles." +... ) +[{'summary_text': ' The Transformer is the first sequence transduction model based entirely on attention . It replaces the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention . For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers .'}] +``` + +### 翻译 + +翻译将一种语言的文本序列转换为另一种语言。它对于帮助来自不同背景的人们相互交流、帮助翻译内容以吸引更广泛的受众,甚至成为学习工具以帮助人们学习一门新语言都非常重要。除了摘要之外,翻译也是一个序列到序列的任务,意味着模型接收输入序列并返回目标输出序列。 + +在早期,翻译模型大多是单语的,但最近,越来越多的人对可以在多种语言之间进行翻译的多语言模型感兴趣。 + +```py +>>> from transformers import pipeline + +>>> text = "translate English to French: Hugging Face is a community-based open-source platform for machine learning." +>>> translator = pipeline(task="translation", model="google-t5/t5-small") +>>> translator(text) +[{'translation_text': "Hugging Face est une tribune communautaire de l'apprentissage des machines."}] +``` + +### 语言模型 + +语言模型是一种预测文本序列中单词的任务。它已成为一种非常流行的NLP任务,因为预训练的语言模型可以微调用于许多其他下游任务。最近,人们对大型语言模型(LLMs)表现出了极大的兴趣,这些模型展示了`zero learning`或`few-shot learning`的能力。这意味着模型可以解决它未被明确训练过的任务!语言模型可用于生成流畅和令人信服的文本,但需要小心,因为文本可能并不总是准确的。 + +有两种类型的话语模型: + +* causal:模型的目标是预测序列中的下一个`token`,而未来的`tokens`被遮盖。 + + + ```py + >>> from transformers import pipeline + + >>> prompt = "Hugging Face is a community-based open-source platform for machine learning." + >>> generator = pipeline(task="text-generation") + >>> generator(prompt) # doctest: +SKIP + ``` + +* masked:模型的目标是预测序列中被遮蔽的`token`,同时具有对序列中所有`tokens`的完全访问权限。 + + + ```py + >>> text = "Hugging Face is a community-based open-source for machine learning." + >>> fill_mask = pipeline(task="fill-mask") + >>> preds = fill_mask(text, top_k=1) + >>> preds = [ + ... { + ... "score": round(pred["score"], 4), + ... "token": pred["token"], + ... "token_str": pred["token_str"], + ... "sequence": pred["sequence"], + ... } + ... for pred in preds + ... ] + >>> preds + [{'score': 0.2236, + 'token': 1761, + 'token_str': ' platform', + 'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}] + ``` + +## 多模态 + +多模态任务要求模型处理多种数据模态(文本、图像、音频、视频)以解决特定问题。图像描述是一个多模态任务的例子,其中模型将图像作为输入并输出描述图像或图像某些属性的文本序列。 + +虽然多模态模型处理不同的数据类型或模态,但内部预处理步骤帮助模型将所有数据类型转换为`embeddings`(向量或数字列表,包含有关数据的有意义信息)。对于像图像描述这样的任务,模型学习图像嵌入和文本嵌入之间的关系。 + +### 文档问答 + +文档问答是从文档中回答自然语言问题的任务。与`token-level`问答任务不同,文档问答将包含问题的文档的图像作为输入,并返回答案。文档问答可用于解析结构化文档并从中提取关键信息。在下面的例子中,可以从收据中提取总金额和找零金额。 + +```py +>>> from transformers import pipeline +>>> from PIL import Image +>>> import requests + +>>> url = "https://datasets-server.huggingface.co/assets/hf-internal-testing/example-documents/--/hf-internal-testing--example-documents/test/2/image/image.jpg" +>>> image = Image.open(requests.get(url, stream=True).raw) + +>>> doc_question_answerer = pipeline("document-question-answering", model="magorshunov/layoutlm-invoices") +>>> preds = doc_question_answerer( +... question="What is the total amount?", +... image=image, +... ) +>>> preds +[{'score': 0.8531, 'answer': '17,000', 'start': 4, 'end': 4}] +``` + +希望这个页面为您提供了一些有关每种模态中所有类型任务的背景信息以及每个任务的实际重要性。在[下一节](tasks_explained)中,您将了解Transformers如何解决这些任务。 diff --git a/docs/source/zh/tf_xla.md b/docs/source/zh/tf_xla.md new file mode 100644 index 00000000000000..2e5b444d876c0a --- /dev/null +++ b/docs/source/zh/tf_xla.md @@ -0,0 +1,179 @@ + + +# 用于 TensorFlow 模型的 XLA 集成 + +[[open-in-colab]] + +加速线性代数,也称为XLA,是一个用于加速TensorFlow模型运行时间的编译器。从[官方文档](https://www.tensorflow.org/xla)中可以看到: + +XLA(加速线性代数)是一种针对线性代数的特定领域编译器,可以在可能不需要更改源代码的情况下加速TensorFlow模型。 + +在TensorFlow中使用XLA非常简单——它包含在`tensorflow`库中,并且可以使用任何图创建函数中的`jit_compile`参数来触发,例如[`tf.function`](https://www.tensorflow.org/guide/intro_to_graphs)。在使用Keras方法如`fit()`和`predict()`时,只需将`jit_compile`参数传递给`model.compile()`即可启用XLA。然而,XLA不仅限于这些方法 - 它还可以用于加速任何任意的`tf.function`。 + +在🤗 Transformers中,几个TensorFlow方法已经被重写为与XLA兼容,包括[GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)、[T5](https://huggingface.co/docs/transformers/model_doc/t5)和[OPT](https://huggingface.co/docs/transformers/model_doc/opt)等文本生成模型,以及[Whisper](https://huggingface.co/docs/transformers/model_doc/whisper)等语音处理模型。 + +虽然确切的加速倍数很大程度上取决于模型,但对于🤗 Transformers中的TensorFlow文本生成模型,我们注意到速度提高了约100倍。本文档将解释如何在这些模型上使用XLA获得最大的性能。如果您有兴趣了解更多关于基准测试和我们在XLA集成背后的设计哲学的信息,我们还将提供额外的资源链接。 + + +## 使用 XLA 运行 TensorFlow 函数 + +让我们考虑以下TensorFlow 中的模型: + +```py +import tensorflow as tf + +model = tf.keras.Sequential( + [tf.keras.layers.Dense(10, input_shape=(10,), activation="relu"), tf.keras.layers.Dense(5, activation="softmax")] +) +``` + +上述模型接受维度为 `(10,)` 的输入。我们可以像下面这样使用模型进行前向传播: + +```py +# Generate random inputs for the model. +batch_size = 16 +input_vector_dim = 10 +random_inputs = tf.random.normal((batch_size, input_vector_dim)) + +# Run a forward pass. +_ = model(random_inputs) +``` + +为了使用 XLA 编译的函数运行前向传播,我们需要执行以下操作: + +```py +xla_fn = tf.function(model, jit_compile=True) +_ = xla_fn(random_inputs) +``` + +`model`的默认`call()`函数用于编译XLA图。但如果你想将其他模型函数编译成XLA,也是可以的,如下所示: + +```py +my_xla_fn = tf.function(model.my_xla_fn, jit_compile=True) +``` + +## 在🤗 Transformers库中使用XLA运行TensorFlow文本生成模型 + +要在🤗 Transformers中启用XLA加速生成,您需要安装最新版本的`transformers`。您可以通过运行以下命令来安装它: + +```bash +pip install transformers --upgrade +``` + +然后您可以运行以下代码: + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +# Will error if the minimal version of Transformers is not installed. +from transformers.utils import check_min_version + +check_min_version("4.21.0") + + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +# One line to create an XLA generation function +xla_generate = tf.function(model.generate, jit_compile=True) + +tokenized_input = tokenizer(input_string, return_tensors="tf") +generated_tokens = xla_generate(**tokenized_input, num_beams=2) + +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +# Generated -- TensorFlow is an open-source, open-source, distributed-source application # framework for the +``` + +正如您所注意到的,在`generate()`上启用XLA只需要一行代码。其余部分代码保持不变。然而,上面的代码片段中有一些与XLA相关的注意事项。您需要了解这些注意事项,以充分利用XLA可能带来的性能提升。我们将在下面的部分讨论这些内容。 + +## 需要关注的注意事项 + +当您首次执行启用XLA的函数(如上面的`xla_generate()`)时,它将在内部尝试推断计算图,这是一个耗时的过程。这个过程被称为[“tracing”](https://www.tensorflow.org/guide/intro_to_graphs#when_is_a_function_tracing)。 + +您可能会注意到生成时间并不快。连续调用`xla_generate()`(或任何其他启用了XLA的函数)不需要再次推断计算图,只要函数的输入与最初构建计算图时的形状相匹配。对于具有固定输入形状的模态(例如图像),这不是问题,但如果您正在处理具有可变输入形状的模态(例如文本),则必须注意。 + +为了确保`xla_generate()`始终使用相同的输入形状,您可以在调用`tokenizer`时指定`padding`参数。 + +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") +input_string = ["TensorFlow is"] + +xla_generate = tf.function(model.generate, jit_compile=True) + +# Here, we call the tokenizer with padding options. +tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + +generated_tokens = xla_generate(**tokenized_input, num_beams=2) +decoded_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True) +print(f"Generated -- {decoded_text}") +``` + +通过这种方式,您可以确保`xla_generate()`的输入始终具有它跟踪的形状,从而加速生成时间。您可以使用以下代码来验证这一点: + +```py +import time +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2", padding_side="left", pad_token="") +model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2") + +xla_generate = tf.function(model.generate, jit_compile=True) + +for input_string in ["TensorFlow is", "TensorFlow is a", "TFLite is a"]: + tokenized_input = tokenizer(input_string, pad_to_multiple_of=8, padding=True, return_tensors="tf") + start = time.time_ns() + generated_tokens = xla_generate(**tokenized_input, num_beams=2) + end = time.time_ns() + print(f"Execution time -- {(end - start) / 1e6:.1f} ms\n") +``` + +在Tesla T4 GPU上,您可以期望如下的输出: + +```bash +Execution time -- 30819.6 ms + +Execution time -- 79.0 ms + +Execution time -- 78.9 ms +``` + +第一次调用`xla_generate()`会因为`tracing`而耗时,但后续的调用会快得多。请注意,任何时候对生成选项的更改都会触发重新`tracing`,从而导致生成时间减慢。 + +在本文档中,我们没有涵盖🤗 Transformers提供的所有文本生成选项。我们鼓励您阅读文档以了解高级用例。 + +## 附加资源 + +以下是一些附加资源,如果您想深入了解在🤗 Transformers和其他库下使用XLA: + +* [这个Colab Notebook](https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/91_tf_xla_generate.ipynb) 提供了一个互动演示,让您可以尝试使用XLA兼容的编码器-解码器(例如[T5](https://huggingface.co/docs/transformers/model_doc/t5))和仅解码器(例如[GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2))文本生成模型。 + +* [这篇博客文章](https://huggingface.co/blog/tf-xla-generate) 提供了XLA兼容模型的比较基准概述,以及关于在TensorFlow中使用XLA的友好介绍。 + +* [这篇博客文章](https://blog.tensorflow.org/2022/11/how-hugging-face-improved-text-generation-performance-with-xla.html) 讨论了我们在🤗 Transformers中为TensorFlow模型添加XLA支持的设计理念。 + +* 推荐用于更多学习XLA和TensorFlow图的资源: + * [XLA:面向机器学习的优化编译器](https://www.tensorflow.org/xla) + * [图和tf.function简介](https://www.tensorflow.org/guide/intro_to_graphs) + * [使用tf.function获得更好的性能](https://www.tensorflow.org/guide/function) \ No newline at end of file diff --git a/docs/source/zh/tflite.md b/docs/source/zh/tflite.md new file mode 100644 index 00000000000000..f0280156def431 --- /dev/null +++ b/docs/source/zh/tflite.md @@ -0,0 +1,54 @@ + + +# 导出为 TFLite + +[TensorFlow Lite](https://www.tensorflow.org/lite/guide) 是一个轻量级框架,用于资源受限的设备上,如手机、嵌入式系统和物联网(IoT)设备,部署机器学习模型。TFLite 旨在在计算能力、内存和功耗有限的设备上优化和高效运行模型。模型以一种特殊的高效可移植格式表示,其文件扩展名为 `.tflite`。 + +🤗 Optimum 通过 `exporters.tflite` 模块提供将 🤗 Transformers 模型导出至 TFLite 格式的功能。请参考 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/exporters/tflite/overview) 以获取支持的模型架构列表。 + +要将模型导出为 TFLite 格式,请安装所需的依赖项: + +```bash +pip install optimum[exporters-tf] +``` + +请参阅 [🤗 Optimum 文档](https://huggingface.co/docs/optimum/main/en/exporters/tflite/usage_guides/export_a_model) 以查看所有可用参数,或者在命令行中查看帮助: + +```bash +optimum-cli export tflite --help +``` + +运行以下命令,以从 🤗 Hub 导出模型的检查点(checkpoint),以 `google-bert/bert-base-uncased` 为例: + +```bash +optimum-cli export tflite --model google-bert/bert-base-uncased --sequence_length 128 bert_tflite/ +``` + +你应该能在日志中看到导出进度以及生成的 `model.tflite` 文件的保存位置,如下所示: + +```bash +Validating TFLite model... + -[✓] TFLite model output names match reference model (logits) + - Validating TFLite Model output "logits": + -[✓] (1, 128, 30522) matches (1, 128, 30522) + -[x] values not close enough, max diff: 5.817413330078125e-05 (atol: 1e-05) +The TensorFlow Lite export succeeded with the warning: The maximum absolute difference between the output of the reference model and the TFLite exported model is not within the set tolerance 1e-05: +- logits: max diff = 5.817413330078125e-05. + The exported model was saved at: bert_tflite +``` + +上面的示例说明了从 🤗 Hub 导出检查点的过程。导出本地模型时,首先需要确保将模型的权重和分词器文件保存在同一目录(`local_path`)中。在使用 CLI(命令行)时,将 `local_path` 传递给 `model` 参数,而不是 🤗 Hub 上的检查点名称。 \ No newline at end of file diff --git a/docs/source/zh/tokenizer_summary.md b/docs/source/zh/tokenizer_summary.md new file mode 100644 index 00000000000000..c349154f961218 --- /dev/null +++ b/docs/source/zh/tokenizer_summary.md @@ -0,0 +1,234 @@ + + +# 分词器的摘要 +[[open-in-colab]] + +在这个页面,我们来仔细研究分词的知识。 + + +正如我们在[the preprocessing tutorial](preprocessing)所看到的那样,对文本进行分词就是将一段文本分割成很多单词或者子单词, +这些单词或者子单词然后会通过一个查询表格被转换到id,将单词或者子单词转换到id是很直截了当的,也就是一个简单的映射, +所以这么来看,我们主要关注将一段文本分割成很多单词或者很多子单词(像:对一段文本进行分词),更加准确的来说,我们将关注 +在🤗 Transformers内用到的三种主要类型的分词器:[Byte-Pair Encoding (BPE)](#byte-pair-encoding), [WordPiece](#wordpiece), +and [SentencePiece](#sentencepiece),并且给出了示例,哪个模型用到了哪种类型的分词器。 + +注意到在每个模型的主页,你可以查看文档上相关的分词器,就可以知道预训练模型使用了哪种类型的分词器。 +举个例子,如果我们查看[`BertTokenizer`],我们就能看到模型使用了[WordPiece](#wordpiece)。 + +## 介绍 +将一段文本分词到小块是一个比它看起来更加困难的任务,并且有很多方式来实现分词,举个例子,让我们看看这个句子 +`"Don't you love 🤗 Transformers? We sure do."` + + + +对这段文本分词的一个简单方式,就是使用空格来分词,得到的结果是: + +``` +["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."] +``` + +上面的分词是一个明智的开始,但是如果我们查看token `"Transformers?"` 和 `"do."`,我们可以观察到标点符号附在单词`"Transformer"` +和 `"do"`的后面,这并不是最理想的情况。我们应该将标点符号考虑进来,这样一个模型就没必要学习一个单词和每个可能跟在后面的 +标点符号的不同的组合,这么组合的话,模型需要学习的组合的数量会急剧上升。将标点符号也考虑进来,对范例文本进行分词的结果就是: + +``` +["Don", "'", "t", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +分词的结果更好了,然而,这么做也是不好的,分词怎么处理单词`"Don't"`,`"Don't"`的含义是`"do not"`,所以这么分词`["Do", "n't"]` +会更好。现在开始事情就开始变得复杂起来了,部分的原因是每个模型都有它自己的分词类型。依赖于我们应用在文本分词上的规则, +相同的文本会产生不同的分词输出。用在训练数据上的分词规则,被用来对输入做分词操作,一个预训练模型才会正确的执行。 + +[spaCy](https://spacy.io/) and [Moses](http://www.statmt.org/moses/?n=Development.GetStarted) 是两个受欢迎的基于规则的 +分词器。将这两个分词器应用在示例文本上,*spaCy* 和 *Moses*会输出类似下面的结果: + +``` +["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."] +``` + +可见上面的分词使用到了空格和标点符号的分词方式,以及基于规则的分词方式。空格和标点符号分词以及基于规则的分词都是单词分词的例子。 +不那么严格的来说,单词分词的定义就是将句子分割到很多单词。然而将文本分割到更小的块是符合直觉的,当处理大型文本语料库时,上面的 +分词方法会导致很多问题。在这种情况下,空格和标点符号分词通常会产生一个非常大的词典(使用到的所有不重复的单词和tokens的集合)。 +像:[Transformer XL](model_doc/transformerxl)使用空格和标点符号分词,结果会产生一个大小是267,735的词典! + +这么大的一个词典容量,迫使模型有着一个巨大的embedding矩阵,以及巨大的输入和输出层,这会增加内存使用量,也会提高时间复杂度。通常 +情况下,transformers模型几乎没有词典容量大于50,000的,特别是只在一种语言上预训练的模型。 + +所以如果简单的空格和标点符号分词让人不满意,为什么不简单的对字符分词? + + + +尽管字符分词是非常简单的,并且能极大的减少内存使用,降低时间复杂度,但是这样做会让模型很难学到有意义的输入表达。像: +比起学到单词`"today"`的一个有意义的上下文独立的表达,学到字母`"t"`的一个有意义的上下文独立的表达是相当困难的。因此, +字符分词经常会伴随着性能的下降。所以为了获得最好的结果,transformers模型在单词级别分词和字符级别分词之间使用了一个折中的方案 +被称作**子词**分词。 + +## 子词分词 + + + +子词分词算法依赖这样的原则:频繁使用的单词不应该被分割成更小的子词,但是很少使用的单词应该被分解到有意义的子词。举个例子: +`"annoyingly"`能被看作一个很少使用的单词,能被分解成`"annoying"`和`"ly"`。`"annoying"`和`"ly"`作为独立地子词,出现 +的次数都很频繁,而且与此同时单词`"annoyingly"`的含义可以通过组合`"annoying"`和`"ly"`的含义来获得。在粘合和胶水语言上, +像Turkish语言,这么做是相当有用的,在这样的语言里,通过线性组合子词,大多数情况下你能形成任意长的复杂的单词。 + +子词分词允许模型有一个合理的词典大小,而且能学到有意义的上下文独立地表达。除此以外,子词分词可以让模型处理以前从来没见过的单词, +方式是通过分解这些单词到已知的子词,举个例子:[`~transformers.BertTokenizer`]对句子`"I have a new GPU!"`分词的结果如下: + +```py +>>> from transformers import BertTokenizer + +>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") +>>> tokenizer.tokenize("I have a new GPU!") +["i", "have", "a", "new", "gp", "##u", "!"] +``` + +因为我们正在考虑不区分大小写的模型,句子首先被转换成小写字母形式。我们可以见到单词`["i", "have", "a", "new"]`在分词器 +的词典内,但是这个单词`"gpu"`不在词典内。所以,分词器将`"gpu"`分割成已知的子词`["gp" and "##u"]`。`"##"`意味着剩下的 +token应该附着在前面那个token的后面,不带空格的附着(分词的解码或者反向)。 + +另外一个例子,[`~transformers.XLNetTokenizer`]对前面的文本例子分词结果如下: + +```py +>>> from transformers import XLNetTokenizer + +>>> tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") +>>> tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.") +["▁Don", "'", "t", "▁you", "▁love", "▁", "🤗", "▁", "Transform", "ers", "?", "▁We", "▁sure", "▁do", "."] +``` + +当我们查看[SentencePiece](#sentencepiece)时会回过头来解释这些`"▁"`符号的含义。正如你能见到的,很少使用的单词 +`"Transformers"`能被分割到更加频繁使用的子词`"Transform"`和`"ers"`。 + +现在让我们来看看不同的子词分割算法是怎么工作的,注意到所有的这些分词算法依赖于某些训练的方式,这些训练通常在语料库上完成, +相应的模型也是在这个语料库上训练的。 + + + +### Byte-Pair Encoding (BPE) + +Byte-Pair Encoding (BPE)来自于[Neural Machine Translation of Rare Words with Subword Units (Sennrich et +al., 2015)](https://arxiv.org/abs/1508.07909)。BPE依赖于一个预分词器,这个预分词器会将训练数据分割成单词。预分词可以是简单的 +空格分词,像::[GPT-2](model_doc/gpt2),[RoBERTa](model_doc/roberta)。更加先进的预分词方式包括了基于规则的分词,像: [XLM](model_doc/xlm),[FlauBERT](model_doc/flaubert),FlauBERT在大多数语言使用了Moses,或者[GPT](model_doc/gpt),GPT +使用了Spacy和ftfy,统计了训练语料库中每个单词的频次。 + +在预分词以后,生成了单词的集合,也确定了训练数据中每个单词出现的频次。下一步,BPE产生了一个基础词典,包含了集合中所有的符号, +BPE学习融合的规则-组合基础词典中的两个符号来形成一个新的符号。BPE会一直学习直到词典的大小满足了期望的词典大小的要求。注意到 +期望的词典大小是一个超参数,在训练这个分词器以前就需要人为指定。 + +举个例子,让我们假设在预分词以后,下面的单词集合以及他们的频次都已经确定好了: + +``` +("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) +``` + +所以,基础的词典是`["b", "g", "h", "n", "p", "s", "u"]`。将所有单词分割成基础词典内的符号,就可以获得: + +``` +("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5) +``` +BPE接着会统计每个可能的符号对的频次,然后挑出出现最频繁的的符号对,在上面的例子中,`"h"`跟了`"u"`出现了10 + 5 = 15次 +(10次是出现了10次`"hug"`,5次是出现了5次`"hugs"`)。然而,最频繁的符号对是`"u"`后面跟了个`"g"`,总共出现了10 + 5 + 5 += 20次。因此,分词器学到的第一个融合规则是组合所有的`"u"`后面跟了个`"g"`符号。下一步,`"ug"`被加入到了词典内。单词的集合 +就变成了: + +``` +("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5) +``` + +BPE接着会统计出下一个最普遍的出现频次最大的符号对。也就是`"u"`后面跟了个`"n"`,出现了16次。`"u"`,`"n"`被融合成了`"un"`。 +也被加入到了词典中,再下一个出现频次最大的符号对是`"h"`后面跟了个`"ug"`,出现了15次。又一次这个符号对被融合成了`"hug"`, +也被加入到了词典中。 + +在当前这步,词典是`["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]`,我们的单词集合则是: + +``` +("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5) +``` + +假设,the Byte-Pair Encoding在这个时候停止训练,学到的融合规则并应用到其他新的单词上(只要这些新单词不包括不在基础词典内的符号 +就行)。举个例子,单词`"bug"`会被分词到`["b", "ug"]`,但是`"mug"`会被分词到`["", "ug"]`,因为符号`"m"`不在基础词典内。 +通常来看的话,单个字母像`"m"`不会被`""`符号替换掉,因为训练数据通常包括了每个字母,每个字母至少出现了一次,但是在特殊的符号 +中也可能发生像emojis。 + +就像之前提到的那样,词典的大小,举个例子,基础词典的大小 + 融合的数量,是一个需要配置的超参数。举个例子:[GPT](model_doc/gpt) +的词典大小是40,478,因为GPT有着478个基础词典内的字符,在40,000次融合以后选择了停止训练。 + +#### Byte-level BPE + +一个包含了所有可能的基础字符的基础字典可能会非常大,如果考虑将所有的unicode字符作为基础字符。为了拥有一个更好的基础词典,[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)使用了字节 +作为基础词典,这是一个非常聪明的技巧,迫使基础词典是256大小,而且确保了所有基础字符包含在这个词典内。使用了其他的规则 +来处理标点符号,这个GPT2的分词器能对每个文本进行分词,不需要使用到符号。[GPT-2](model_doc/gpt)有一个大小是50,257 +的词典,对应到256字节的基础tokens,一个特殊的文本结束token,这些符号经过了50,000次融合学习。 + + + +### WordPiece + +WordPiece是子词分词算法,被用在[BERT](model_doc/bert),[DistilBERT](model_doc/distilbert),和[Electra](model_doc/electra)。 +这个算法发布在[Japanese and Korean +Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) +和BPE非常相似。WordPiece首先初始化一个词典,这个词典包含了出现在训练数据中的每个字符,然后递进的学习一个给定数量的融合规则。和BPE相比较, +WordPiece不会选择出现频次最大的符号对,而是选择了加入到字典以后能最大化训练数据似然值的符号对。 + +所以这到底意味着什么?参考前面的例子,最大化训练数据的似然值,等价于找到一个符号对,它们的概率除以这个符号对中第一个符号的概率, +接着除以第二个符号的概率,在所有的符号对中商最大。像:如果`"ug"`的概率除以`"u"`除以`"g"`的概率的商,比其他任何符号对更大, +这个时候才能融合`"u"`和`"g"`。直觉上,WordPiece,和BPE有点点不同,WordPiece是评估融合两个符号会失去的量,来确保这么做是值得的。 + + + +### Unigram + +Unigram是一个子词分词器算法,介绍见[Subword Regularization: Improving Neural Network Translation +Models with Multiple Subword Candidates (Kudo, 2018)](https://arxiv.org/pdf/1804.10959.pdf)。和BPE或者WordPiece相比较 +,Unigram使用大量的符号来初始化它的基础字典,然后逐渐的精简每个符号来获得一个更小的词典。举例来看基础词典能够对应所有的预分词 +的单词以及最常见的子字符串。Unigram没有直接用在任何transformers的任何模型中,但是和[SentencePiece](#sentencepiece)一起联合使用。 + +在每个训练的步骤,Unigram算法在当前词典的训练数据上定义了一个损失函数(经常定义为log似然函数的),还定义了一个unigram语言模型。 +然后,对词典内的每个符号,算法会计算如果这个符号从词典内移除,总的损失会升高多少。Unigram然后会移除百分之p的符号,这些符号的loss +升高是最低的(p通常是10%或者20%),像:这些在训练数据上对总的损失影响最小的符号。重复这个过程,直到词典已经达到了期望的大小。 +为了任何单词都能被分词,Unigram算法总是保留基础的字符。 + +因为Unigram不是基于融合规则(和BPE以及WordPiece相比较),在训练以后算法有几种方式来分词,如果一个训练好的Unigram分词器 +的词典是这个: + +``` +["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"], +``` +`"hugs"`可以被分词成`["hug", "s"]`, `["h", "ug", "s"]`或者`["h", "u", "g", "s"]`。所以选择哪一个呢?Unigram在保存 +词典的时候还会保存训练语料库内每个token的概率,所以在训练以后可以计算每个可能的分词结果的概率。实际上算法简单的选择概率 +最大的那个分词结果,但是也会提供概率来根据分词结果的概率来采样一个可能的分词结果。 + +分词器在损失函数上训练,这些损失函数定义了这些概率。假设训练数据包含了这些单词 $x_{1}$, $\dots$, $x_{N}$,一个单词$x_{i}$ +的所有可能的分词结果的集合定义为$S(x_{i})$,然后总的损失就可以定义为: + +$$\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )$$ + + + +### SentencePiece +目前为止描述的所有分词算法都有相同的问题:它们都假设输入的文本使用空格来分开单词。然而,不是所有的语言都使用空格来分开单词。 +一个可能的解决方案是使用某种语言特定的预分词器。像:[XLM](model_doc/xlm)使用了一个特定的中文、日语和Thai的预分词器。 +为了更加广泛的解决这个问题,[SentencePiece: A simple and language independent subword tokenizer and +detokenizer for Neural Text Processing (Kudo et al., 2018)](https://arxiv.org/pdf/1808.06226.pdf) +将输入文本看作一个原始的输入流,因此使用的符合集合中也包括了空格。SentencePiece然后会使用BPE或者unigram算法来产生合适的 +词典。 + +举例来说,[`XLNetTokenizer`]使用了SentencePiece,这也是为什么上面的例子中`"▁"`符号包含在词典内。SentencePiece解码是非常容易的,因为所有的tokens能被concatenate起来,然后将`"▁"`替换成空格。 + +库内所有使用了SentencePiece的transformers模型,会和unigram组合起来使用,像:使用了SentencePiece的模型是[ALBERT](model_doc/albert), +[XLNet](model_doc/xlnet),[Marian](model_doc/marian),和[T5](model_doc/t5)。 diff --git a/docs/source/zh/training.md b/docs/source/zh/training.md new file mode 100644 index 00000000000000..773c58181c31e9 --- /dev/null +++ b/docs/source/zh/training.md @@ -0,0 +1,407 @@ + + +# 微调预训练模型 + +[[open-in-colab]] + +使用预训练模型有许多显著的好处。它降低了计算成本,减少了碳排放,同时允许您使用最先进的模型,而无需从头开始训练一个。🤗 Transformers 提供了涉及各种任务的成千上万的预训练模型。当您使用预训练模型时,您需要在与任务相关的数据集上训练该模型。这种操作被称为微调,是一种非常强大的训练技术。在本教程中,您将使用您选择的深度学习框架来微调一个预训练模型: + +* 使用 🤗 Transformers 的 [`Trainer`] 来微调预训练模型。 +* 在 TensorFlow 中使用 Keras 来微调预训练模型。 +* 在原生 PyTorch 中微调预训练模型。 + + + +## 准备数据集 + + + +在您进行预训练模型微调之前,需要下载一个数据集并为训练做好准备。之前的教程向您展示了如何处理训练数据,现在您有机会将这些技能付诸实践! + +首先,加载[Yelp评论](https://huggingface.co/datasets/yelp_review_full)数据集: + +```py +>>> from datasets import load_dataset + +>>> dataset = load_dataset("yelp_review_full") +>>> dataset["train"][100] +{'label': 0, + 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'} +``` + +正如您现在所知,您需要一个`tokenizer`来处理文本,包括填充和截断操作以处理可变的序列长度。如果要一次性处理您的数据集,可以使用 🤗 Datasets 的 [`map`](https://huggingface.co/docs/datasets/process#map) 方法,将预处理函数应用于整个数据集: + +```py +>>> from transformers import AutoTokenizer + +>>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") + + +>>> def tokenize_function(examples): +... return tokenizer(examples["text"], padding="max_length", truncation=True) + + +>>> tokenized_datasets = dataset.map(tokenize_function, batched=True) +``` +如果愿意的话,您可以从完整数据集提取一个较小子集来进行微调,以减少训练所需的时间: + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + + + +## 训练 + +此时,您应该根据您训练所用的框架来选择对应的教程章节。您可以使用右侧的链接跳转到您想要的章节 - 如果您想隐藏某个框架对应的所有教程内容,只需使用右上角的按钮! + + + + + + +## 使用 PyTorch Trainer 进行训练 + +🤗 Transformers 提供了一个专为训练 🤗 Transformers 模型而优化的 [`Trainer`] 类,使您无需手动编写自己的训练循环步骤而更轻松地开始训练模型。[`Trainer`] API 支持各种训练选项和功能,如日志记录、梯度累积和混合精度。 + +首先加载您的模型并指定期望的标签数量。根据 Yelp Review [数据集卡片](https://huggingface.co/datasets/yelp_review_full#data-fields),您知道有五个标签: + + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + + + +您将会看到一个警告,提到一些预训练权重未被使用,以及一些权重被随机初始化。不用担心,这是完全正常的!BERT 模型的预训练`head`被丢弃,并替换为一个随机初始化的分类`head`。您将在您的序列分类任务上微调这个新模型`head`,将预训练模型的知识转移给它。 + + + +### 训练超参数 + +接下来,创建一个 [`TrainingArguments`] 类,其中包含您可以调整的所有超参数以及用于激活不同训练选项的标志。对于本教程,您可以从默认的训练[超参数](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments)开始,但随时可以尝试不同的设置以找到最佳设置。 + +指定保存训练检查点的位置: + +```py +>>> from transformers import TrainingArguments + +>>> training_args = TrainingArguments(output_dir="test_trainer") +``` + +### 评估 + +[`Trainer`] 在训练过程中不会自动评估模型性能。您需要向 [`Trainer`] 传递一个函数来计算和展示指标。[🤗 Evaluate](https://huggingface.co/docs/evaluate/index) 库提供了一个简单的 [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) 函数,您可以使用 [`evaluate.load`] 函数加载它(有关更多信息,请参阅此[快速入门](https://huggingface.co/docs/evaluate/a_quick_tour)): + +```py +>>> import numpy as np +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +``` +在 `metric` 上调用 [`~evaluate.compute`] 来计算您的预测的准确性。在将预测传递给 `compute` 之前,您需要将预测转换为`logits`(请记住,所有 🤗 Transformers 模型都返回对`logits`): + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... predictions = np.argmax(logits, axis=-1) +... return metric.compute(predictions=predictions, references=labels) +``` + +如果您希望在微调过程中监视评估指标,请在您的训练参数中指定 `evaluation_strategy` 参数,以在每个`epoch`结束时展示评估指标: + +```py +>>> from transformers import TrainingArguments, Trainer + +>>> training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch") +``` + +### 训练器 + +创建一个包含您的模型、训练参数、训练和测试数据集以及评估函数的 [`Trainer`] 对象: + + +```py +>>> trainer = Trainer( +... model=model, +... args=training_args, +... train_dataset=small_train_dataset, +... eval_dataset=small_eval_dataset, +... compute_metrics=compute_metrics, +... ) +``` +然后调用[`~transformers.Trainer.train`]以微调模型: + +```py +>>> trainer.train() +``` + + + + + + +## 使用keras训练TensorFlow模型 + +您也可以使用 Keras API 在 TensorFlow 中训练 🤗 Transformers 模型! + +### 加载用于 Keras 的数据 + +当您希望使用 Keras API 训练 🤗 Transformers 模型时,您需要将您的数据集转换为 Keras 可理解的格式。如果您的数据集很小,您可以将整个数据集转换为NumPy数组并传递给 Keras。在进行更复杂的操作之前,让我们先尝试这种方法。 + +首先,加载一个数据集。我们将使用 [GLUE benchmark](https://huggingface.co/datasets/glue) 中的 CoLA 数据集,因为它是一个简单的二元文本分类任务。现在只使用训练数据集。 + + +```py +from datasets import load_dataset + +dataset = load_dataset("glue", "cola") +dataset = dataset["train"] # Just take the training split for now +``` +接下来,加载一个`tokenizer`并将数据标记为 NumPy 数组。请注意,标签已经是由 0 和 1 组成的`list`,因此我们可以直接将其转换为 NumPy 数组而无需进行分词处理! + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") +tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True) +# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras +tokenized_data = dict(tokenized_data) + +labels = np.array(dataset["label"]) # Label is already an array of 0 and 1 +``` +最后,加载、[`compile`](https://keras.io/api/models/model_training_apis/#compile-method) 和 [`fit`](https://keras.io/api/models/model_training_apis/#fit-method) 模型。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您希望自定义,否则无需指定一个损失函数: + +```py +from transformers import TFAutoModelForSequenceClassification +from tensorflow.keras.optimizers import Adam + +# Load and compile our model +model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased") +# Lower learning rates are often better for fine-tuning transformers +model.compile(optimizer=Adam(3e-5)) # No loss argument! + +model.fit(tokenized_data, labels) +``` + + + +当您使用 `compile()` 编译模型时,无需传递损失参数!如果不指定损失参数,Hugging Face 模型会自动选择适合其任务和模型架构的损失函数。如果需要,您始终可以自己指定损失函数以覆盖默认配置。 + + + +这种方法对于较小的数据集效果很好,但对于较大的数据集,您可能会发现它开始变得有问题。为什么呢?因为分词后的数组和标签必须完全加载到内存中,而且由于 NumPy 无法处理“不规则”数组,因此每个分词后的样本长度都必须被填充到数据集中最长样本的长度。这将使您的数组变得更大,而所有这些`padding tokens`也会减慢训练速度! + + +### 将数据加载为 tf.data.Dataset + +如果您想避免训练速度减慢,可以将数据加载为 `tf.data.Dataset`。虽然您可以自己编写自己的 `tf.data` 流水线,但我们有两种方便的方法来实现这一点: + +- [`~TFPreTrainedModel.prepare_tf_dataset`]:这是我们在大多数情况下推荐的方法。因为它是模型上的一个方法,它可以检查模型以自动确定哪些列可用作模型输入,并丢弃其他列以创建一个更简单、性能更好的数据集。 +- [`~datasets.Dataset.to_tf_dataset`]:这个方法更低级,但当您希望完全控制数据集的创建方式时非常有用,可以通过指定要包括的确切 `columns` 和 `label_cols` 来实现。 + +在使用 [`~TFPreTrainedModel.prepare_tf_dataset`] 之前,您需要将`tokenizer`的输出添加到数据集作为列,如下面的代码示例所示: + +```py +def tokenize_dataset(data): + # Keys of the returned dictionary will be added to the dataset as columns + return tokenizer(data["text"]) + + +dataset = dataset.map(tokenize_dataset) +``` +请记住,默认情况下,Hugging Face 数据集存储在硬盘上,因此这不会增加您的内存使用!一旦列已经添加,您可以从数据集中流式的传输批次数据,并为每个批次添加`padding tokens`,这与为整个数据集添加`padding tokens`相比,大大减少了`padding tokens`的数量。 + +```py +>>> tf_dataset = model.prepare_tf_dataset(dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer) +``` +请注意,在上面的代码示例中,您需要将`tokenizer`传递给`prepare_tf_dataset`,以便它可以在加载批次时正确填充它们。如果数据集中的所有样本都具有相同的长度而且不需要填充,您可以跳过此参数。如果需要执行比填充样本更复杂的操作(例如,用于掩码语言模型的`tokens` 替换),则可以使用 `collate_fn` 参数,而不是传递一个函数来将样本列表转换为批次并应用任何所需的预处理。请查看我们的[示例](https://github.com/huggingface/transformers/tree/main/examples)或[笔记](https://huggingface.co/docs/transformers/notebooks)以了解此方法的实际操作。 + +一旦创建了 `tf.data.Dataset`,您可以像以前一样编译和训练模型: + +```py +model.compile(optimizer=Adam(3e-5)) # No loss argument! + +model.fit(tf_dataset) +``` + + + + + + +## 在原生 PyTorch 中训练 + + + + + +[`Trainer`] 负责训练循环,允许您在一行代码中微调模型。对于喜欢编写自己训练循环的用户,您也可以在原生 PyTorch 中微调 🤗 Transformers 模型。 + +现在,您可能需要重新启动您的`notebook`,或执行以下代码以释放一些内存: + +```py +del model +del trainer +torch.cuda.empty_cache() +``` + +接下来,手动处理 `tokenized_dataset` 以准备进行训练。 + +1. 移除 text 列,因为模型不接受原始文本作为输入: + + ```py + >>> tokenized_datasets = tokenized_datasets.remove_columns(["text"]) + ``` + +2. 将 label 列重命名为 labels,因为模型期望参数的名称为 labels: + + ```py + >>> tokenized_datasets = tokenized_datasets.rename_column("label", "labels") + ``` + +3. 设置数据集的格式以返回 PyTorch 张量而不是`lists`: + + ```py + >>> tokenized_datasets.set_format("torch") + ``` + +接着,创建一个先前展示的数据集的较小子集,以加速微调过程 + +```py +>>> small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) +>>> small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) +``` + +### DataLoader + +您的训练和测试数据集创建一个`DataLoader`类,以便可以迭代处理数据批次 + +```py +>>> from torch.utils.data import DataLoader + +>>> train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8) +>>> eval_dataloader = DataLoader(small_eval_dataset, batch_size=8) +``` + +加载您的模型,并指定期望的标签数量: + +```py +>>> from transformers import AutoModelForSequenceClassification + +>>> model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) +``` + +### Optimizer and learning rate scheduler + +创建一个`optimizer`和`learning rate scheduler`以进行模型微调。让我们使用 PyTorch 中的 [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) 优化器: + +```py +>>> from torch.optim import AdamW + +>>> optimizer = AdamW(model.parameters(), lr=5e-5) +``` + +创建来自 [`Trainer`] 的默认`learning rate scheduler`: + + +```py +>>> from transformers import get_scheduler + +>>> num_epochs = 3 +>>> num_training_steps = num_epochs * len(train_dataloader) +>>> lr_scheduler = get_scheduler( +... name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps +... ) +``` + +最后,指定 `device` 以使用 GPU(如果有的话)。否则,使用 CPU 进行训练可能需要几个小时,而不是几分钟。 + + +```py +>>> import torch + +>>> device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") +>>> model.to(device) +``` + + + +如果没有 GPU,可以通过notebook平台如 [Colaboratory](https://colab.research.google.com/) 或 [SageMaker StudioLab](https://studiolab.sagemaker.aws/) 来免费获得云端GPU使用。 + + + +现在您已经准备好训练了!🥳 + +### 训练循环 + +为了跟踪训练进度,使用 [tqdm](https://tqdm.github.io/) 库来添加一个进度条,显示训练步数的进展: + +```py +>>> from tqdm.auto import tqdm + +>>> progress_bar = tqdm(range(num_training_steps)) + +>>> model.train() +>>> for epoch in range(num_epochs): +... for batch in train_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... outputs = model(**batch) +... loss = outputs.loss +... loss.backward() + +... optimizer.step() +... lr_scheduler.step() +... optimizer.zero_grad() +... progress_bar.update(1) +``` + +### 评估 + +就像您在 [`Trainer`] 中添加了一个评估函数一样,当您编写自己的训练循环时,您需要做同样的事情。但与在每个`epoch`结束时计算和展示指标不同,这一次您将使用 [`~evaluate.add_batch`] 累积所有批次,并在最后计算指标。 + +```py +>>> import evaluate + +>>> metric = evaluate.load("accuracy") +>>> model.eval() +>>> for batch in eval_dataloader: +... batch = {k: v.to(device) for k, v in batch.items()} +... with torch.no_grad(): +... outputs = model(**batch) + +... logits = outputs.logits +... predictions = torch.argmax(logits, dim=-1) +... metric.add_batch(predictions=predictions, references=batch["labels"]) + +>>> metric.compute() +``` + + + + + +## 附加资源 + +更多微调例子可参考如下链接: + +- [🤗 Transformers 示例](https://github.com/huggingface/transformers/tree/main/examples) 包含用于在 PyTorch 和 TensorFlow 中训练常见自然语言处理任务的脚本。 + +- [🤗 Transformers 笔记](notebooks) 包含针对特定任务在 PyTorch 和 TensorFlow 中微调模型的各种`notebook`。 \ No newline at end of file diff --git a/docs/source/zh/transformers_agents.md b/docs/source/zh/transformers_agents.md new file mode 100644 index 00000000000000..a3e601fbedcb0d --- /dev/null +++ b/docs/source/zh/transformers_agents.md @@ -0,0 +1,285 @@ + + +# Transformers Agents + + + +`Transformers Agents`是一个实验性的随时可能发生变化的API。由于API或底层模型可能发生变化,`agents`返回的结果也会有所不同。 + + + +Transformers版本`v4.29.0`基于`tools`和`agents`概念构建。您可以在[此Colab链接](https://colab.research.google.com/drive/1c7MHD-T1forUPGcC_jlwsIptOzpG3hSj)中进行测试。 + +简而言之,它在`Transformers`之上提供了一个自然语言API:我们定义了一组经过筛选的`tools`,并设计了一个`agents`来解读自然语言并使用这些工具。它具有很强的可扩展性;我们筛选了一些相关的`tools`,但我们将向您展示如何通过社区开发的`tool`轻松地扩展系统。 + +让我们从一些可以通过这个新API实现的示例开始。在处理多模态任务时它尤其强大,因此让我们快速试着生成图像并大声朗读文本。 + + +```py +agent.run("Caption the following image", image=image) +``` + +| **输入** | **输出** | +|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------| +| | A beaver is swimming in the water | + +--- + +```py +agent.run("Read the following text out loud", text=text) +``` +| **输入** | **输出** | +|-----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| A beaver is swimming in the water | + +--- + +```py +agent.run( + "In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?", + document=document, +) +``` +| **输入** | **输出** | +|-----------------------------------------------------------------------------------------------------------------------------|----------------| +| | ballroom foyer | + +## 快速入门 + +要使用 `agent.run`,您需要实例化一个`agent`,它是一个大型语言模型(LLM)。我们支持OpenAI模型以及来自BigCode和OpenAssistant的开源替代方案。OpenAI模型性能更好(但需要您拥有OpenAI API密钥,因此无法免费使用),Hugging Face为BigCode和OpenAssistant模型提供了免费访问端点。 + +一开始请安装`agents`附加模块,以安装所有默认依赖项。 + +```bash +pip install transformers[agents] +``` + +要使用OpenAI模型,您可以在安装`openai`依赖项后实例化一个`OpenAiAgent`: + +```bash +pip install openai +``` + + +```py +from transformers import OpenAiAgent + +agent = OpenAiAgent(model="text-davinci-003", api_key="") +``` + +要使用BigCode或OpenAssistant,请首先登录以访问Inference API: + +```py +from huggingface_hub import login + +login("") +``` + +然后,实例化`agent`: + +```py +from transformers import HfAgent + +# Starcoder +agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder") +# StarcoderBase +# agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoderbase") +# OpenAssistant +# agent = HfAgent(url_endpoint="https://api-inference.huggingface.co/models/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5") +``` + +此示例使用了目前Hugging Face免费提供的推理API。如果你有自己的推理端点用于此模型(或其他模型),你可以用你的URL替换上面的URL。 + + + +StarCoder和OpenAssistant可以免费使用,并且在简单任务上表现出色。然而,当处理更复杂的提示时就不再有效。如果你遇到这样的问题,我们建议尝试使用OpenAI模型,尽管遗憾的是它不是开源的,但它在目前情况下表现更好。 + + + +现在,您已经可以开始使用了!让我们深入了解您现在可以使用的两个API。 + +### 单次执行(run) + +单次执行方法是使用`agent`的 `~Agent.run`: + +```py +agent.run("Draw me a picture of rivers and lakes.") +``` + + + +它会自动选择适合您要执行的任务的`tool`(或`tools`),并以适当的方式运行它们。它可以在同一指令中执行一个或多个任务(尽管您的指令越复杂,`agent`失败的可能性就越大)。 + + +```py +agent.run("Draw me a picture of the sea then transform the picture to add an island") +``` + + + +
+ +每个 [`~Agent.run`] 操作都是独立的,因此您可以多次连续运行 [`~Agent.run`]并执行不同的任务。 + +请注意,您的 `agent` 只是一个大型语言模型,因此您略有变化的提示可能会产生完全不同的结果。重要的是尽可能清晰地解释您要执行的任务。我们在[这里](../en/custom_tools#writing-good-user-inputs)更深入地讨论了如何编写良好的提示。 + +如果您想在多次执行之间保持同一状态或向`agent`传递非文本对象,可以通过指定`agent`要使用的变量来实现。例如,您可以生成有关河流和湖泊的第一幅图像,并要求模型通过执行以下操作向该图片添加一个岛屿: + +```python +picture = agent.run("Generate a picture of rivers and lakes.") +updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture) +``` + + + +当模型无法理解您的请求和库中的工具时,这可能会有所帮助。例如: + +```py +agent.run("Draw me the picture of a capybara swimming in the sea") +``` + +在这种情况下,模型可以以两种方式理解您的请求: +- 使用`text-to-image` 生成在大海中游泳的大水獭 +- 或者,使用`text-to-image`生成大水獭,然后使用`image-transformation`工具使其在大海中游泳 + +如果您想强制使用第一种情景,可以通过将提示作为参数传递给它来实现: + + +```py +agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea") +``` + + + + +### 基于交流的执行 (chat) + +基于交流的执行(chat)方式是使用 [`~Agent.chat`]: + +```py +agent.chat("Generate a picture of rivers and lakes") +``` + + + +```py +agent.chat("Transform the picture so that there is a rock in there") +``` + + + +
+ +当您希望在不同指令之间保持同一状态时,这会是一个有趣的方法。它更适合用于单个指令,而不是复杂的多步指令(`~Agent.run` 方法更适合处理这种情况)。 + +这种方法也可以接受参数,以便您可以传递非文本类型或特定提示。 + +### ⚠️ 远程执行 + +出于演示目的以便适用于所有设置,我们为发布版本的少数默认工具创建了远程执行器。这些工具是使用推理终端(inference endpoints)创建的。 + +目前我们已将其关闭,但为了了解如何自行设置远程执行器工具,我们建议阅读[自定义工具指南](./custom_tools)。 + +### 这里发生了什么?什么是`tools`,什么是`agents`? + + + + + +#### Agents + +这里的`Agents`是一个大型语言模型,我们通过提示它以访问特定的工具集。 + +大型语言模型在生成小代码示例方面表现出色,因此这个API利用这一特点,通过提示LLM生成一个使用`tools`集合的小代码示例。然后,根据您给`Agents`的任务和`tools`的描述来完成此提示。这种方式让它能够访问工具的文档,特别是它们的期望输入和输出,以生成相关的代码。 + +#### Tools + +`Tools`非常简单:它们是有名称和描述的单个函数。然后,我们使用这些`tools`的描述来提示代理。通过提示,我们向`agent`展示如何使用`tool`来执行查询语言中请求的操作。 + +这是使用全新`tools`而不是`pipelines`,因为`agent`编写的代码更好,具有非常原子化的`tools`。`pipelines`经常被重构,并且通常将多个任务合并为一个。`tools`旨在专注于一个非常简单的任务。 + +#### 代码执行? + +然后,这段代码基于`tools`的输入被我们的小型Python解释器执行。我们听到你在后面大声呼喊“任意代码执行!”,但让我们解释为什么情况并非如此。 + +只能您提供的`tools`和打印函数可以被执行,因此您已经受到了执行的限制。如果仅限于 Hugging Face 工具,那么您应该是安全的。 + +然后,我们不允许任何属性查找或导入(无论如何都不需要将输入/输出传递给一小组函数),因此所有最明显的攻击(并且您需要提示LLM无论如何输出它们)不应该是一个问题。如果你想超级安全,你可以使用附加参数 return_code=True 执行 run() 方法,在这种情况下,`agent`将只返回要执行的代码,你可以决定是否执行。 + +如果`agent`生成的代码存在任何尝试执行非法操作的行为,或者代码中出现了常规Python错误,执行将停止。 + + +### 一组经过精心筛选的`tools` + +我们确定了一组可以赋予这些`agent`强大能力的`tools`。以下是我们在`transformers`中集成的`tools`的更新列表: + +- **文档问答**:给定一个图像格式的文档(例如PDF),回答该文档上的问题([Donut](../en/model_doc/donut)) +- **文本问答**:给定一段长文本和一个问题,回答文本中的问题([Flan-T5](../en/model_doc/flan-t5)) +- **无条件图像字幕**:为图像添加字幕!([BLIP](../en/model_doc/blip)) +- **图像问答**:给定一张图像,回答该图像上的问题([VILT](../en/model_doc/vilt)) +- **图像分割**:给定一张图像和一个提示,输出该提示的分割掩模([CLIPSeg](../en/model_doc/clipseg)) +- **语音转文本**:给定一个人说话的音频录音,将演讲内容转录为文本([Whisper](../en/model_doc/whisper)) +- **文本转语音**:将文本转换为语音([SpeechT5](../en/model_doc/speecht5)) +- **Zero-Shot文本分类**:给定一个文本和一个标签列表,确定文本最符合哪个标签([BART](../en/model_doc/bart)) +- **文本摘要**:总结长文本为一两句话([BART](../en/model_doc/bart)) +- **翻译**:将文本翻译为指定语言([NLLB](../en/model_doc/nllb)) + +这些`tools`已在transformers中集成,并且也可以手动使用,例如: + +```py +from transformers import load_tool + +tool = load_tool("text-to-speech") +audio = tool("This is a text to speech tool") +``` + +### 自定义工具 + +尽管我们确定了一组经过筛选的`tools`,但我们坚信,此实现提供的主要价值在于能够快速创建和共享自定义`tool`。 + +通过将工具的代码上传到Hugging Face空间或模型repository,您可以直接通过`agent`使用`tools`。我们已经添加了一些**与transformers无关**的`tools`到[`huggingface-tools`组织](https://huggingface.co/huggingface-tools)中: + +- **文本下载器**:从Web URL下载文本 +- **文本到图像**:根据提示生成图像,利用`stable diffusion` +- **图像转换**:根据初始图像和提示修改图像,利用`instruct pix2pix stable diffusion` +- **文本到视频**:根据提示生成小视频,利用`damo-vilab` + +从一开始就一直在使用的文本到图像`tool`是一个远程`tool `,位于[*huggingface-tools/text-to-image*](https://huggingface.co/spaces/huggingface-tools/text-to-image)!我们将继续在此组织和其他组织上发布此类`tool`,以进一步增强此实现。 + +`agents`默认可以访问存储在[`huggingface-tools`](https://huggingface.co/huggingface-tools)上的`tools`。我们将在后续指南中解释如何编写和共享自定义`tools`,以及如何利用Hub上存在的任何自定义`tools`。 + +### 代码生成 + +到目前为止,我们已经展示了如何使用`agents`来为您执行操作。但是,`agents`仅使用非常受限Python解释器执行的代码。如果您希望在不同的环境中使用生成的代码,可以提示`agents`返回代码,以及`tools`的定义和准确的导入信息。 + +例如,以下指令 + +```python +agent.run("Draw me a picture of rivers and lakes", return_code=True) +``` + +返回以下代码 + +```python +from transformers import load_tool + +image_generator = load_tool("huggingface-tools/text-to-image") + +image = image_generator(prompt="rivers and lakes") +``` + +然后你就可以调整并执行代码 \ No newline at end of file diff --git a/examples/README.md b/examples/README.md index 0a5ec752d3920f..a38b4576b35fd3 100644 --- a/examples/README.md +++ b/examples/README.md @@ -19,7 +19,7 @@ We host a wide range of example scripts for multiple learning frameworks. Simply We also have some [research projects](https://github.com/huggingface/transformers/tree/main/examples/research_projects), as well as some [legacy examples](https://github.com/huggingface/transformers/tree/main/examples/legacy). Note that unlike the main examples these are not actively maintained, and may require specific older versions of dependencies in order to run. -While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data, allowing you to tweak and edit them as required. +While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the-box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data, allowing you to tweak and edit them as required. Please discuss on the [forum](https://discuss.huggingface.co/) or in an [issue](https://github.com/huggingface/transformers/issues) a feature you would like to implement in an example before submitting a PR; we welcome bug fixes, but since we want to keep the examples as simple as possible it's unlikely that we will merge a pull request adding more functionality at the cost of readability. @@ -94,3 +94,41 @@ Alternatively, you can switch your cloned 🤗 Transformers to a specific versio git checkout tags/v3.5.1 ``` and run the example command as usual afterward. + +## Running the Examples on Remote Hardware with Auto-Setup + +[run_on_remote.py](./run_on_remote.py) is a script that launches any example on remote self-hosted hardware, +with automatic hardware and environment setup. It uses [Runhouse](https://github.com/run-house/runhouse) to launch +on self-hosted hardware (e.g. in your own cloud account or on-premise cluster) but there are other options +for running remotely as well. You can easily customize the example used, command line arguments, dependencies, +and type of compute hardware, and then run the script to automatically launch the example. + +You can refer to +[hardware setup](https://runhouse-docs.readthedocs-hosted.com/en/latest/api/python/cluster.html#hardware-setup) +for more information about hardware and dependency setup with Runhouse, or this +[Colab tutorial](https://colab.research.google.com/drive/1sh_aNQzJX5BKAdNeXthTNGxKz7sM9VPc) for a more in-depth +walkthrough. + +You can run the script with the following commands: + +```bash +# First install runhouse: +pip install runhouse + +# For an on-demand V100 with whichever cloud provider you have configured: +python run_on_remote.py \ + --example pytorch/text-generation/run_generation.py \ + --model_type=openai-community/gpt2 \ + --model_name_or_path=openai-community/gpt2 \ + --prompt "I am a language model and" + +# For byo (bring your own) cluster: +python run_on_remote.py --host --user --key_path \ + --example + +# For on-demand instances +python run_on_remote.py --instance --provider \ + --example +``` + +You can also adapt the script to your own needs. diff --git a/examples/flax/_tests_requirements.txt b/examples/flax/_tests_requirements.txt index f1e0fb2d90712f..f83c1910a11379 100644 --- a/examples/flax/_tests_requirements.txt +++ b/examples/flax/_tests_requirements.txt @@ -1,8 +1,10 @@ datasets >= 1.1.3 -pytest +pytest<8.0.1 conllu nltk rouge-score seqeval tensorboard -evaluate >= 0.2.0 \ No newline at end of file +evaluate >= 0.2.0 +torch +accelerate \ No newline at end of file diff --git a/examples/flax/conftest.py b/examples/flax/conftest.py index 131c6af92c44cc..4cf2e46ef07393 100644 --- a/examples/flax/conftest.py +++ b/examples/flax/conftest.py @@ -21,7 +21,7 @@ # allow having multiple repository checkouts and not needing to remember to rerun -# 'pip install -e .[dev]' when switching between checkouts and running tests. +# `pip install -e '.[dev]'` when switching between checkouts and running tests. git_repo_path = abspath(join(dirname(dirname(dirname(__file__))), "src")) sys.path.insert(1, git_repo_path) diff --git a/examples/flax/image-captioning/README.md b/examples/flax/image-captioning/README.md index 0faf56124bc2d0..dd2b420639258f 100644 --- a/examples/flax/image-captioning/README.md +++ b/examples/flax/image-captioning/README.md @@ -1,7 +1,7 @@ # Image Captioning (vision-encoder-text-decoder model) training example The following example showcases how to finetune a vision-encoder-text-decoder model for image captioning -using the JAX/Flax backend, leveraging 🤗 Transformers library's [FlaxVisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/visionencoderdecoder#transformers.FlaxVisionEncoderDecoderModel). +using the JAX/Flax backend, leveraging 🤗 Transformers library's [FlaxVisionEncoderDecoderModel](https://huggingface.co/docs/transformers/model_doc/vision-encoder-decoder#transformers.FlaxVisionEncoderDecoderModel). JAX/Flax allows you to trace pure functions and compile them into efficient, fused accelerator code on both GPU and TPU. Models written in JAX/Flax are **immutable** and updated in a purely functional @@ -10,7 +10,7 @@ way which enables simple and efficient model parallelism. `run_image_captioning_flax.py` is a lightweight example of how to download and preprocess a dataset from the 🤗 Datasets library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. -For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files and you also will find examples of these below. +For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files and you also will find examples of these below. ### Download COCO dataset (2017) This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the @@ -34,7 +34,7 @@ Next, we create a [FlaxVisionEncoderDecoderModel](https://huggingface.co/docs/tr python3 create_model_from_encoder_decoder_models.py \ --output_dir model \ --encoder_model_name_or_path google/vit-base-patch16-224-in21k \ - --decoder_model_name_or_path gpt2 + --decoder_model_name_or_path openai-community/gpt2 ``` ### Train the model diff --git a/examples/flax/image-captioning/create_model_from_encoder_decoder_models.py b/examples/flax/image-captioning/create_model_from_encoder_decoder_models.py index c5ce0e4ce133c4..0ebd1464883874 100644 --- a/examples/flax/image-captioning/create_model_from_encoder_decoder_models.py +++ b/examples/flax/image-captioning/create_model_from_encoder_decoder_models.py @@ -37,7 +37,7 @@ class ModelArguments: encoder_model_name_or_path: str = field( metadata={ "help": ( - "The encoder model checkpoint for weights initialization." + "The encoder model checkpoint for weights initialization. " "Don't set if you want to train an encoder model from scratch." ) }, @@ -45,7 +45,7 @@ class ModelArguments: decoder_model_name_or_path: str = field( metadata={ "help": ( - "The decoder model checkpoint for weights initialization." + "The decoder model checkpoint for weights initialization. " "Don't set if you want to train a decoder model from scratch." ) }, diff --git a/examples/flax/image-captioning/run_image_captioning_flax.py b/examples/flax/image-captioning/run_image_captioning_flax.py index 66bd7290758108..652bb3be45474f 100644 --- a/examples/flax/image-captioning/run_image_captioning_flax.py +++ b/examples/flax/image-captioning/run_image_captioning_flax.py @@ -22,6 +22,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from functools import partial @@ -53,7 +54,7 @@ HfArgumentParser, is_tensorboard_available, ) -from transformers.utils import get_full_repo_name, is_offline_mode, send_example_telemetry +from transformers.utils import is_offline_mode, send_example_telemetry logger = logging.getLogger(__name__) @@ -182,12 +183,28 @@ class ModelArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -239,7 +256,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the `max_length` param of `model.generate`, which is used " "during evaluation." ) @@ -364,7 +381,7 @@ def write_metric(summary_writer, metrics, train_time, step, metric_key_prefix="t def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -389,6 +406,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_image_captioning", model_args, data_args, framework="flax") @@ -400,7 +426,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -424,14 +450,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -448,7 +474,7 @@ def main(): cache_dir=model_args.cache_dir, keep_in_memory=False, data_dir=data_args.data_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -465,28 +491,31 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer model = FlaxVisionEncoderDecoderModel.from_pretrained( model_args.model_name_or_path, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) image_processor = AutoImageProcessor.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer.pad_token = tokenizer.convert_ids_to_tokens(model.config.pad_token_id) @@ -659,7 +688,7 @@ def preprocess_fn(examples, max_target_length, check_image=True): eval_batch_size = int(training_args.per_device_eval_batch_size) * jax.device_count() if training_args.block_size % train_batch_size > 0 or training_args.block_size % eval_batch_size > 0: raise ValueError( - "`training_args.block_size` needs to be a multiple of the global train/eval batch size." + "`training_args.block_size` needs to be a multiple of the global train/eval batch size. " f"Got {training_args.block_size}, {train_batch_size} and {eval_batch_size} respectively instead." ) @@ -824,7 +853,7 @@ def blockwise_data_loader( yield batch # Metric - metric = evaluate.load("rouge") + metric = evaluate.load("rouge", cache_dir=model_args.cache_dir) def postprocess_text(preds, labels): preds = [pred.strip() for pred in preds] @@ -892,14 +921,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/language-modeling/README.md b/examples/flax/language-modeling/README.md index 5346904d84c688..cb8671147ff98c 100644 --- a/examples/flax/language-modeling/README.md +++ b/examples/flax/language-modeling/README.md @@ -28,7 +28,7 @@ way which enables simple and efficient model parallelism. In the following, we demonstrate how to train a bi-directional transformer model using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). More specifically, we demonstrate how JAX/Flax can be leveraged -to pre-train [**`roberta-base`**](https://huggingface.co/roberta-base) +to pre-train [**`FacebookAI/roberta-base`**](https://huggingface.co/FacebookAI/roberta-base) in Norwegian on a single TPUv3-8 pod. The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. @@ -76,13 +76,13 @@ tokenizer.save("./norwegian-roberta-base/tokenizer.json") ### Create configuration Next, we create the model's configuration file. This is as simple -as loading and storing [`**roberta-base**`](https://huggingface.co/roberta-base) +as loading and storing [`**FacebookAI/roberta-base**`](https://huggingface.co/FacebookAI/roberta-base) in the local model folder: ```python from transformers import RobertaConfig -config = RobertaConfig.from_pretrained("roberta-base", vocab_size=50265) +config = RobertaConfig.from_pretrained("FacebookAI/roberta-base", vocab_size=50265) config.save_pretrained("./norwegian-roberta-base") ``` @@ -129,8 +129,8 @@ look at [this](https://colab.research.google.com/github/huggingface/notebooks/bl In the following, we demonstrate how to train an auto-regressive causal transformer model in JAX/Flax. -More specifically, we pretrain a randomly initialized [**`gpt2`**](https://huggingface.co/gpt2) model in Norwegian on a single TPUv3-8. -to pre-train 124M [**`gpt2`**](https://huggingface.co/gpt2) +More specifically, we pretrain a randomly initialized [**`openai-community/gpt2`**](https://huggingface.co/openai-community/gpt2) model in Norwegian on a single TPUv3-8. +to pre-train 124M [**`openai-community/gpt2`**](https://huggingface.co/openai-community/gpt2) in Norwegian on a single TPUv3-8 pod. The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. @@ -179,13 +179,13 @@ tokenizer.save("./norwegian-gpt2/tokenizer.json") ### Create configuration Next, we create the model's configuration file. This is as simple -as loading and storing [`**gpt2**`](https://huggingface.co/gpt2) +as loading and storing [`**openai-community/gpt2**`](https://huggingface.co/openai-community/gpt2) in the local model folder: ```python from transformers import GPT2Config -config = GPT2Config.from_pretrained("gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, vocab_size=50257) +config = GPT2Config.from_pretrained("openai-community/gpt2", resid_pdrop=0.0, embd_pdrop=0.0, attn_pdrop=0.0, vocab_size=50257) config.save_pretrained("./norwegian-gpt2") ``` @@ -199,7 +199,7 @@ Finally, we can run the example script to pretrain the model: ```bash python run_clm_flax.py \ --output_dir="./norwegian-gpt2" \ - --model_type="gpt2" \ + --model_type="openai-community/gpt2" \ --config_name="./norwegian-gpt2" \ --tokenizer_name="./norwegian-gpt2" \ --dataset_name="oscar" \ @@ -449,7 +449,7 @@ are 8 TPU cores on 4 chips (each chips has 2 cores), while "8 GPU" are 8 GPU chi For comparison one can run the same pre-training with PyTorch/XLA on TPU. To set up PyTorch/XLA on Cloud TPU VMs, please refer to [this](https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm) guide. -Having created the tokenzier and configuration in `norwegian-roberta-base`, we create the following symbolic links: +Having created the tokenizer and configuration in `norwegian-roberta-base`, we create the following symbolic links: ```bash ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./ @@ -499,7 +499,7 @@ python3 xla_spawn.py --num_cores ${NUM_TPUS} run_mlm.py --output_dir="./runs" \ For comparison you can run the same pre-training with PyTorch on GPU. Note that we have to make use of `gradient_accumulation` because the maximum batch size that fits on a single V100 GPU is 32 instead of 128. -Having created the tokenzier and configuration in `norwegian-roberta-base`, we create the following symbolic links: +Having created the tokenizer and configuration in `norwegian-roberta-base`, we create the following symbolic links: ```bash ln -s ~/transformers/examples/pytorch/language-modeling/run_mlm.py ./ diff --git a/examples/flax/language-modeling/run_bart_dlm_flax.py b/examples/flax/language-modeling/run_bart_dlm_flax.py index 0a97bffd9304b0..f5369299a6d4c9 100644 --- a/examples/flax/language-modeling/run_bart_dlm_flax.py +++ b/examples/flax/language-modeling/run_bart_dlm_flax.py @@ -26,6 +26,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from itertools import chain @@ -59,7 +60,7 @@ set_seed, ) from transformers.models.bart.modeling_flax_bart import shift_tokens_right -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry MODEL_CONFIG_CLASSES = list(FLAX_MODEL_FOR_MASKED_LM_MAPPING.keys()) @@ -138,7 +139,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -168,15 +169,21 @@ class ModelArguments: ) }, ) - use_auth_token: bool = field( - default=False, + token: str = field( + default=None, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." ) }, ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) @dataclass @@ -242,10 +249,12 @@ def __post_init__(self): else: if self.train_file is not None: extension = self.train_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("train_file` should be a csv, json or text file.") if self.validation_file is not None: extension = self.validation_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("`validation_file` should be a csv, json or text file.") @flax.struct.dataclass @@ -319,15 +328,13 @@ def permute_sentences(self, input_ids): sentence_ends = np.argwhere(end_sentence_mask) sentence_ends[:, 1] += 1 example_has_multiple_sentences, num_sentences = np.unique(sentence_ends[:, 0], return_counts=True) - num_sentences_map = {sent_idx: count for sent_idx, count in zip(example_has_multiple_sentences, num_sentences)} + num_sentences_map = dict(zip(example_has_multiple_sentences, num_sentences)) num_to_permute = np.ceil(num_sentences * self.permute_sentence_ratio).astype(int) - num_to_permute_map = { - sent_idx: count for sent_idx, count in zip(example_has_multiple_sentences, num_to_permute) - } + num_to_permute_map = dict(zip(example_has_multiple_sentences, num_to_permute)) sentence_ends = np.split(sentence_ends[:, 1], np.unique(sentence_ends[:, 0], return_index=True)[1][1:]) - sentence_ends_map = {sent_idx: count for sent_idx, count in zip(example_has_multiple_sentences, sentence_ends)} + sentence_ends_map = dict(zip(example_has_multiple_sentences, sentence_ends)) for i in range(input_ids.shape[0]): if i not in example_has_multiple_sentences: @@ -463,6 +470,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_bart_dlm", model_args, data_args, framework="flax") @@ -474,7 +490,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -496,14 +512,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -517,7 +533,8 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -526,29 +543,33 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" datasets = load_dataset( extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -557,17 +578,19 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -576,18 +599,18 @@ def main(): model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -596,13 +619,13 @@ def main(): model_args.config_name, cache_dir=model_args.cache_dir, vocab_size=len(tokenizer), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) elif model_args.model_name_or_path: config = BartConfig.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -671,7 +694,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map tokenized_datasets = tokenized_datasets.map( group_texts, batched=True, @@ -707,7 +730,7 @@ def group_texts(examples): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: config.vocab_size = len(tokenizer) @@ -756,14 +779,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/language-modeling/run_clm_flax.py b/examples/flax/language-modeling/run_clm_flax.py index 607c9bb1ee7c88..5ef17be35322c5 100755 --- a/examples/flax/language-modeling/run_clm_flax.py +++ b/examples/flax/language-modeling/run_clm_flax.py @@ -27,6 +27,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from itertools import chain @@ -58,7 +59,7 @@ set_seed, ) from transformers.testing_utils import CaptureLogger -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry logger = logging.getLogger(__name__) @@ -139,7 +140,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -169,12 +170,28 @@ class ModelArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -251,10 +268,12 @@ def __post_init__(self): else: if self.train_file is not None: extension = self.train_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, a json or a txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("train_file` should be a csv, json or text file.") if self.validation_file is not None: extension = self.validation_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, a json or a txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("`validation_file` should be a csv, json or text file.") class TrainState(train_state.TrainState): @@ -307,7 +326,7 @@ def write_eval_metric(summary_writer, eval_metrics, step): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -332,6 +351,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_clm", model_args, data_args, framework="flax") @@ -343,7 +371,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -370,14 +398,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -395,7 +423,8 @@ def main(): data_args.dataset_config_name, cache_dir=model_args.cache_dir, keep_in_memory=False, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in dataset.keys(): @@ -404,23 +433,26 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) dataset["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) else: data_files = {} dataset_args = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" dataset_args["keep_linebreaks"] = data_args.keep_linebreaks @@ -429,7 +461,8 @@ def main(): data_files=data_files, cache_dir=model_args.cache_dir, **dataset_args, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in dataset.keys(): @@ -439,7 +472,8 @@ def main(): split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, **dataset_args, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) dataset["train"] = load_dataset( extension, @@ -447,10 +481,11 @@ def main(): split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, **dataset_args, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -461,13 +496,15 @@ def main(): config = AutoConfig.from_pretrained( model_args.config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: config = AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -478,18 +515,20 @@ def main(): model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -499,13 +538,15 @@ def main(): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: model = FlaxAutoModelForCausalLM.from_config( config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), + trust_remote_code=model_args.trust_remote_code, ) # Preprocessing the datasets. @@ -543,13 +584,13 @@ def tokenize_function(examples): if block_size > config.max_position_embeddings: logger.warning( f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " - "Picking 1024 instead. You can change that default value by passing --block_size xxx." + f"Using block_size={min(1024, config.max_position_embeddings)} instead. You can change that default value by passing --block_size xxx." ) - block_size = 1024 + block_size = min(1024, config.max_position_embeddings) else: if data_args.block_size > tokenizer.model_max_length: logger.warning( - f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model " f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(data_args.block_size, tokenizer.model_max_length) @@ -576,7 +617,7 @@ def group_texts(examples): # to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map lm_datasets = tokenized_datasets.map( group_texts, @@ -648,14 +689,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/language-modeling/run_mlm_flax.py b/examples/flax/language-modeling/run_mlm_flax.py index 6a06533b14e419..86415578138aea 100755 --- a/examples/flax/language-modeling/run_mlm_flax.py +++ b/examples/flax/language-modeling/run_mlm_flax.py @@ -26,6 +26,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from itertools import chain @@ -59,7 +60,7 @@ is_tensorboard_available, set_seed, ) -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry MODEL_CONFIG_CLASSES = list(FLAX_MODEL_FOR_MASKED_LM_MAPPING.keys()) @@ -144,7 +145,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -174,12 +175,28 @@ class ModelArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -377,6 +394,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_mlm", model_args, data_args, framework="flax") @@ -388,7 +414,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -410,14 +436,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -434,7 +460,8 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -443,29 +470,33 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" datasets = load_dataset( extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -474,17 +505,19 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -495,13 +528,15 @@ def main(): config = AutoConfig.from_pretrained( model_args.config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: config = AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -512,18 +547,20 @@ def main(): model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -598,7 +635,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map tokenized_datasets = tokenized_datasets.map( group_texts, batched=True, @@ -638,13 +675,15 @@ def group_texts(examples): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: model = FlaxAutoModelForMaskedLM.from_config( config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), + trust_remote_code=model_args.trust_remote_code, ) if training_args.gradient_checkpointing: @@ -679,14 +718,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/language-modeling/run_t5_mlm_flax.py b/examples/flax/language-modeling/run_t5_mlm_flax.py index 814d68a88e3716..fa6a5742236ca5 100755 --- a/examples/flax/language-modeling/run_t5_mlm_flax.py +++ b/examples/flax/language-modeling/run_t5_mlm_flax.py @@ -25,6 +25,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field # You can also adapt this script on your own masked language modeling task. Pointers for this are left as comments. @@ -59,7 +60,7 @@ set_seed, ) from transformers.models.t5.modeling_flax_t5 import shift_tokens_right -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry MODEL_CONFIG_CLASSES = list(FLAX_MODEL_FOR_MASKED_LM_MAPPING.keys()) @@ -138,7 +139,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -168,15 +169,21 @@ class ModelArguments: ) }, ) - use_auth_token: bool = field( - default=False, + token: str = field( + default=None, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." ) }, ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) @dataclass @@ -418,13 +425,14 @@ def random_spans_noise_mask(self, length): orig_length = length num_noise_tokens = int(np.round(length * self.noise_density)) + num_nonnoise_tokens = length - num_noise_tokens # avoid degeneracy by ensuring positive numbers of noise and nonnoise tokens. num_noise_tokens = min(max(num_noise_tokens, 1), length - 1) - num_noise_spans = int(np.round(num_noise_tokens / self.mean_noise_span_length)) + # num_noise_tokens should be less than num_noise_tokens and num_nonnoise_tokens + num_noise_spans = int(np.round(min(num_noise_tokens, num_nonnoise_tokens) / self.mean_noise_span_length)) # avoid degeneracy by ensuring positive number of noise spans num_noise_spans = max(num_noise_spans, 1) - num_nonnoise_tokens = length - num_noise_tokens # pick the lengths of the noise spans and the non-noise spans def _random_segmentation(num_items, num_segments): @@ -503,6 +511,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_t5_mlm", model_args, data_args, framework="flax") @@ -514,7 +531,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -536,14 +553,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -557,7 +574,8 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -566,29 +584,33 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" datasets = load_dataset( extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) if "validation" not in datasets.keys(): @@ -597,17 +619,19 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + num_proc=data_args.preprocessing_num_workers, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -616,18 +640,18 @@ def main(): model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -636,13 +660,13 @@ def main(): model_args.config_name, cache_dir=model_args.cache_dir, vocab_size=len(tokenizer), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) elif model_args.model_name_or_path: config = T5Config.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -701,7 +725,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map tokenized_datasets = tokenized_datasets.map( group_texts, batched=True, @@ -737,7 +761,7 @@ def group_texts(examples): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: config.vocab_size = len(tokenizer) @@ -791,14 +815,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/question-answering/README.md b/examples/flax/question-answering/README.md index 822342a99e2168..2f6caa984d4bc1 100644 --- a/examples/flax/question-answering/README.md +++ b/examples/flax/question-answering/README.md @@ -29,7 +29,7 @@ The following example fine-tunes BERT on SQuAD: ```bash python run_qa.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --do_train \ --do_eval \ @@ -67,7 +67,7 @@ Here is an example training on 4 TITAN RTX GPUs and Bert Whole Word Masking unca ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 python run_qa.py \ ---model_name_or_path bert-large-uncased-whole-word-masking \ +--model_name_or_path google-bert/bert-large-uncased-whole-word-masking \ --dataset_name squad \ --do_train \ --do_eval \ diff --git a/examples/flax/question-answering/run_qa.py b/examples/flax/question-answering/run_qa.py index 628b9b81b286c0..7f31321837a88f 100644 --- a/examples/flax/question-answering/run_qa.py +++ b/examples/flax/question-answering/run_qa.py @@ -25,6 +25,7 @@ import random import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from pathlib import Path @@ -55,13 +56,13 @@ PreTrainedTokenizerFast, is_tensorboard_available, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") Array = Any Dataset = datasets.arrow_dataset.Dataset @@ -155,12 +156,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -333,14 +350,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) @@ -374,7 +389,7 @@ def cross_entropy_loss(logits, labels): # region Create learning rate function def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -440,6 +455,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_qa", model_args, data_args, framework="flax") @@ -464,14 +488,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # region Load Data # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) @@ -489,7 +513,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading the dataset from local csv or json file. @@ -509,10 +533,10 @@ def main(): data_files=data_files, field="data", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Load pretrained model and tokenizer @@ -522,14 +546,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -559,7 +585,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -642,13 +668,13 @@ def prepare_train_features(examples): return tokenized_examples - processed_raw_datasets = dict() + processed_raw_datasets = {} if training_args.do_train: if "train" not in raw_datasets: raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if data_args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) # Create train feature from dataset @@ -781,7 +807,9 @@ def post_processing_function(examples, features, predictions, stage="eval"): references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples] return EvalPrediction(predictions=formatted_predictions, label_ids=references) - metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad") + metric = evaluate.load( + "squad_v2" if data_args.version_2_with_negative else "squad", cache_dir=model_args.cache_dir + ) def compute_metrics(p: EvalPrediction): return metric.compute(predictions=p.predictions, references=p.label_ids) @@ -876,7 +904,8 @@ def write_eval_metric(summary_writer, eval_metrics, step): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), ) diff --git a/examples/flax/speech-recognition/README.md b/examples/flax/speech-recognition/README.md new file mode 100644 index 00000000000000..943c98761aa660 --- /dev/null +++ b/examples/flax/speech-recognition/README.md @@ -0,0 +1,68 @@ + + +# Automatic Speech Recognition - Flax Examples + +## Sequence to Sequence + +The script [`run_flax_speech_recognition_seq2seq.py`](https://github.com/huggingface/transformers/blob/main/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py) +can be used to fine-tune any [Flax Speech Sequence-to-Sequence Model](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.FlaxAutoModelForSpeechSeq2Seq) +for automatic speech recognition on one of the [official speech recognition datasets](https://huggingface.co/datasets?task_ids=task_ids:automatic-speech-recognition) +or a custom dataset. This includes the Whisper model from OpenAI, or a warm-started Speech-Encoder-Decoder Model, +an example for which is included below. + +### Whisper Model + +We can load all components of the Whisper model directly from the pretrained checkpoint, including the pretrained model +weights, feature extractor and tokenizer. We simply have to specify the id of fine-tuning dataset and the necessary +training hyperparameters. + +The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint +on the Hindi subset of the [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) dataset. +Note that before running this script you must accept the dataset's [terms of use](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) +and register your Hugging Face Hub token on your device by running `huggingface-hub login`. + +```bash +python run_flax_speech_recognition_seq2seq.py \ + --model_name_or_path="openai/whisper-small" \ + --dataset_name="mozilla-foundation/common_voice_13_0" \ + --dataset_config_name="hi" \ + --language="hindi" \ + --train_split_name="train+validation" \ + --eval_split_name="test" \ + --output_dir="./whisper-small-hi-flax" \ + --per_device_train_batch_size="16" \ + --per_device_eval_batch_size="16" \ + --num_train_epochs="10" \ + --learning_rate="1e-4" \ + --warmup_steps="500" \ + --logging_steps="25" \ + --generation_max_length="40" \ + --preprocessing_num_workers="32" \ + --dataloader_num_workers="32" \ + --max_duration_in_seconds="30" \ + --text_column_name="sentence" \ + --overwrite_output_dir \ + --do_train \ + --do_eval \ + --predict_with_generate \ + --push_to_hub \ + --use_auth_token +``` + +On a TPU v4-8, training should take approximately 25 minutes, with a final cross-entropy loss of 0.02 and word error +rate of **34%**. See the checkpoint [sanchit-gandhi/whisper-small-hi-flax](https://huggingface.co/sanchit-gandhi/whisper-small-hi-flax) +for an example training run. diff --git a/examples/flax/speech-recognition/requirements.txt b/examples/flax/speech-recognition/requirements.txt new file mode 100644 index 00000000000000..b68b236ad76c2b --- /dev/null +++ b/examples/flax/speech-recognition/requirements.txt @@ -0,0 +1,8 @@ +datasets[audio]>=2.14.0 +jax>=0.3.6 +jaxlib>=0.3.6 +flax>=0.4.1 +optax>=0.0.8 +torch>=1.9.0 +jiwer +evaluate diff --git a/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py b/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py new file mode 100644 index 00000000000000..0c6efdf7fca292 --- /dev/null +++ b/examples/flax/speech-recognition/run_flax_speech_recognition_seq2seq.py @@ -0,0 +1,859 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Fine-tuning the Flax library models for sequence to sequence speech recognition. +""" +# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments. + +import logging +import os +import sys +import time +from dataclasses import field +from functools import partial +from pathlib import Path +from typing import Any, Callable, Dict, List, Optional, Union + +import datasets +import evaluate +import flax +import jax +import jax.numpy as jnp +import numpy as np +import optax +from datasets import DatasetDict, load_dataset +from flax import jax_utils, traverse_util +from flax.jax_utils import pad_shard_unpad, unreplicate +from flax.training import train_state +from flax.training.common_utils import get_metrics, onehot, shard, shard_prng_key +from huggingface_hub import Repository, create_repo +from torch.utils.data import DataLoader +from tqdm import tqdm + +import transformers +from transformers import ( + AutoConfig, + AutoFeatureExtractor, + AutoProcessor, + AutoTokenizer, + FlaxAutoModelForSpeechSeq2Seq, + HfArgumentParser, + Seq2SeqTrainingArguments, + is_tensorboard_available, +) +from transformers.file_utils import get_full_repo_name +from transformers.utils import check_min_version, send_example_telemetry +from transformers.utils.versions import require_version + + +# Will error if the minimal version of Transformers is not installed. Remove at your own risk. +check_min_version("4.38.0.dev0") + +require_version("datasets>=2.14.0", "To fix: pip install -r examples/flax/speech-recognition/requirements.txt") + +logger = logging.getLogger(__name__) + + +@flax.struct.dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + feature_extractor_name: Optional[str] = field( + default=None, metadata={"help": "feature extractor name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"}, + ) + use_fast_tokenizer: bool = field( + default=True, + metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, + ) + model_revision: str = field( + default="main", + metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, + ) + use_auth_token: bool = field( + default=False, + metadata={ + "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script " + "with private models)." + }, + ) + dtype: Optional[str] = field( + default="float32", + metadata={ + "help": ( + "Floating-point format in which the model weights should be initialized and trained. Choose one of" + " `[float32, float16, bfloat16]`." + ) + }, + ) + num_beams: Optional[int] = field( + default=None, + metadata={ + "help": ( + "Number of beams to use for evaluation. This argument will be passed to `model.generate`, " + "which is used during evaluation." + ) + }, + ) + + +@flax.struct.dataclass +class DataTrainingArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + """ + + dataset_name: str = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + dataset_config_name: Optional[str] = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + text_column: Optional[str] = field( + default=None, + metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."}, + ) + dataset_cache_dir: Optional[str] = field( + default=None, metadata={"help": "Path to cache directory for saving and loading datasets"} + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + }, + ) + max_eval_samples: Optional[int] = field( + default=None, + metadata={ + "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this " + "value if set." + }, + ) + audio_column_name: str = field( + default="audio", + metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"}, + ) + text_column_name: str = field( + default="text", + metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"}, + ) + max_duration_in_seconds: float = field( + default=20.0, + metadata={"help": "Filter audio files that are longer than `max_duration_in_seconds` seconds"}, + ) + min_duration_in_seconds: float = field( + default=0.0, + metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}, + ) + max_label_length: float = field( + default=128, + metadata={"help": "Truncate transcriptions that are longer `max_eval_length` tokens."}, + ) + pad_input_to_multiple_of: Optional[int] = field( + default=None, + metadata={ + "help": "If set will pad the input sequence to a multiple of the provided value. " + "This is important to avoid triggering recompilations on TPU. If unspecified, will default to padding the inputs to max length." + }, + ) + pad_target_to_multiple_of: Optional[int] = field( + default=None, + metadata={ + "help": "If set will pad the target sequence to a multiple of the provided value. " + "This is important to avoid triggering recompilations on TPU. If unspecified, will default to padding the targets to max length." + }, + ) + preprocessing_only: bool = field( + default=False, + metadata={ + "help": "Whether to only do data preprocessing and skip training. " + "This is especially useful when data preprocessing errors out in distributed training due to timeout. " + "In this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` " + "so that the cached datasets can consequently be loaded in distributed training" + }, + ) + train_split_name: str = field( + default="train", + metadata={ + "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'" + }, + ) + eval_split_name: str = field( + default="validation", + metadata={ + "help": "The name of the evaluation data set split to use (via the datasets library). Defaults to 'validation'" + }, + ) + do_lower_case: bool = field( + default=True, + metadata={"help": "Whether the target text should be lower cased."}, + ) + language: str = field( + default=None, + metadata={ + "help": ( + "Language for multilingual fine-tuning. This argument should be set for multilingual fine-tuning " + "only. For English speech recognition, it should be set to `None`." + ) + }, + ) + task: str = field( + default="transcribe", + metadata={"help": "Task, either `transcribe` for speech recognition or `translate` for speech translation."}, + ) + + +def shift_tokens_right(label_ids: np.array, decoder_start_token_id: int) -> np.ndarray: + """ + Shift label ids one token to the right. + """ + shifted_label_ids = np.zeros_like(label_ids) + shifted_label_ids[:, 1:] = label_ids[:, :-1] + shifted_label_ids[:, 0] = decoder_start_token_id + + return shifted_label_ids + + +@flax.struct.dataclass +class FlaxDataCollatorSpeechSeq2SeqWithPadding: + """ + Data collator that will dynamically pad the inputs received. + Args: + processor ([`Wav2Vec2Processor`]) + The processor used for proccessing the data. + decoder_start_token_id (:obj: `int`) + The begin-of-sentence of the decoder. + input_padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`): + Select a strategy to pad the returned input sequences (according to the model's padding side and padding index) + among: + * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single + sequence if provided). + * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the + maximum acceptable input length for the model if that argument is not provided. + * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of + different lengths). + target_padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`): + Select a strategy to pad the returned target sequences (according to the model's padding side and padding index). + See above for details. + max_input_length (:obj:`float`, `optional`): + Maximum length of the ``input_values`` of the returned list and optionally padding length (see above). + max_target_length (:obj:`int`, `optional`): + Maximum length of the ``labels`` of the returned list and optionally padding length (see above). + pad_input_to_multiple_of (:obj:`int`, `optional`): + If set will pad the input sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= + 7.5 (Volta). + pad_target_to_multiple_of (:obj:`int`, `optional`): + If set will pad the target sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= + 7.5 (Volta). + """ + + processor: Any + decoder_start_token_id: int + input_padding: Union[bool, str] = "longest" + target_padding: Union[bool, str] = "max_length" + max_input_length: Optional[float] = None + max_target_length: Optional[int] = None + pad_input_to_multiple_of: Optional[int] = None + pad_target_to_multiple_of: Optional[int] = None + + def __call__(self, features: List[Dict[str, Union[List[int], np.ndarray]]]) -> Dict[str, np.ndarray]: + # split inputs and labels since they have to be of different lengths and need + # different padding methods + model_input_name = self.processor.model_input_names[0] + + # dataloader returns a list of features which we convert to a dict + input_features = {model_input_name: [feature[model_input_name] for feature in features]} + label_features = {"input_ids": [feature["labels"] for feature in features]} + + # reformat list to dict and set to pytorch format + batch = self.processor.feature_extractor.pad( + input_features, + max_length=self.max_input_length, + padding=self.input_padding, + pad_to_multiple_of=self.pad_input_to_multiple_of, + return_tensors="np", + ) + + labels_batch = self.processor.tokenizer.pad( + label_features, + max_length=self.max_target_length, + padding=self.target_padding, + pad_to_multiple_of=self.pad_target_to_multiple_of, + return_tensors="np", + ) + + # if bos token is appended in previous tokenization step, + # cut bos token here as it's append later anyways + labels = labels_batch["input_ids"] + if (labels[:, 0] == self.decoder_start_token_id).all().item(): + labels = labels[:, 1:] + labels_batch.attention_mask = labels_batch.attention_mask[:, 1:] + + decoder_input_ids = shift_tokens_right(labels, self.decoder_start_token_id) + + # replace padding with -100 to ignore correctly when computing the loss + labels = np.ma.array(labels, mask=np.not_equal(labels_batch.attention_mask, 1)) + labels = labels.filled(fill_value=-100) + + batch["labels"] = labels + batch["decoder_input_ids"] = decoder_input_ids + + return batch + + +class TrainState(train_state.TrainState): + dropout_rng: jnp.ndarray + + def replicate(self): + return jax_utils.replicate(self).replace(dropout_rng=shard_prng_key(self.dropout_rng)) + + +def write_metric(summary_writer, train_metrics, eval_metrics, train_time, step): + summary_writer.scalar("train_time", train_time, step) + + train_metrics = get_metrics(train_metrics) + for key, vals in train_metrics.items(): + tag = f"train_{key}" + for i, val in enumerate(vals): + summary_writer.scalar(tag, val, step - len(vals) + i + 1) + + for metric_name, value in eval_metrics.items(): + summary_writer.scalar(f"eval_{metric_name}", value, step) + + +def create_learning_rate_fn( + num_train_steps: int, num_warmup_steps: int, learning_rate: float +) -> Callable[[int], jnp.ndarray]: + """Returns a linear warmup, linear_decay learning rate function.""" + warmup_fn = optax.linear_schedule(init_value=0.0, end_value=learning_rate, transition_steps=num_warmup_steps) + decay_fn = optax.linear_schedule( + init_value=learning_rate, end_value=0, transition_steps=num_train_steps - num_warmup_steps + ) + schedule_fn = optax.join_schedules(schedules=[warmup_fn, decay_fn], boundaries=[num_warmup_steps]) + return schedule_fn + + +def main(): + # 1. Parse input arguments + # See all possible arguments in src/transformers/training_args.py + # or by passing the --help flag to this script. + # We now keep distinct sets of args, for a cleaner separation of concerns. + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments)) + + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + # If we pass only one argument to the script and it's the path to a json file, + # let's parse it to get our arguments. + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The + # information sent is the one passed as arguments along with your JAX/Flax versions. + send_example_telemetry("run_speech_recognition_seq2seq", model_args, data_args, framework="flax") + + # 2. Setup logging + # Make one log on every process with the configuration for debugging. + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + # Set the verbosity to info of the Transformers logger. + # We only want one process per machine to log things on the screen. + logger.setLevel(logging.INFO if jax.process_index() == 0 else logging.ERROR) + if jax.process_index() == 0: + datasets.utils.logging.set_verbosity_warning() + transformers.utils.logging.set_verbosity_info() + else: + datasets.utils.logging.set_verbosity_error() + transformers.utils.logging.set_verbosity_error() + + logger.info("Training/evaluation parameters %s", training_args) + + # Check the output dir is valid + if ( + os.path.exists(training_args.output_dir) + and os.listdir(training_args.output_dir) + and training_args.do_train + and not training_args.overwrite_output_dir + ): + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use `--overwrite_output_dir` to overcome." + ) + + # Handle the repository creation + if training_args.push_to_hub: + if training_args.hub_model_id is None: + repo_name = get_full_repo_name( + Path(training_args.output_dir).absolute().name, token=training_args.hub_token + ) + else: + repo_name = training_args.hub_model_id + create_repo(repo_name, exist_ok=True, token=training_args.hub_token) + repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + + # 3. Load dataset + raw_datasets = DatasetDict() + + if training_args.do_train: + raw_datasets["train"] = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + split=data_args.train_split_name, + cache_dir=data_args.dataset_cache_dir, + num_proc=data_args.preprocessing_num_workers, + token=True if model_args.use_auth_token else None, + ) + + if training_args.do_eval: + raw_datasets["eval"] = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + split=data_args.eval_split_name, + cache_dir=data_args.dataset_cache_dir, + num_proc=data_args.preprocessing_num_workers, + token=True if model_args.use_auth_token else None, + ) + + if not training_args.do_train and not training_args.do_eval: + raise ValueError( + "Cannot not train and not do evaluation. At least one of training or evaluation has to be performed." + ) + + if data_args.audio_column_name not in next(iter(raw_datasets.values())).column_names: + raise ValueError( + f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'. " + "Make sure to set `--audio_column_name` to the correct audio column - one of " + f"{', '.join(next(iter(raw_datasets.values())).column_names)}." + ) + + if data_args.text_column_name not in next(iter(raw_datasets.values())).column_names: + raise ValueError( + f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. " + "Make sure to set `--text_column_name` to the correct text column - one of " + f"{', '.join(next(iter(raw_datasets.values())).column_names)}." + ) + + # 5. Load pretrained model, tokenizer, and feature extractor + config = AutoConfig.from_pretrained( + model_args.config_name if model_args.config_name else model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=True if model_args.use_auth_token else None, + ) + feature_extractor = AutoFeatureExtractor.from_pretrained( + model_args.feature_extractor_name if model_args.feature_extractor_name else model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=True if model_args.use_auth_token else None, + ) + tokenizer = AutoTokenizer.from_pretrained( + model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + revision=model_args.model_revision, + token=True if model_args.use_auth_token else None, + ) + + model = FlaxAutoModelForSpeechSeq2Seq.from_pretrained( + model_args.model_name_or_path, + config=config, + dtype=getattr(jnp, model_args.dtype), + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=True if model_args.use_auth_token else None, + ) + + if model.config.decoder_start_token_id is None: + raise ValueError("Make sure that `config.decoder_start_token_id` is correctly defined") + + # 6. Resample speech dataset: `datasets` takes care of automatically loading and resampling the audio, + # so we just need to set the correct target sampling rate. + raw_datasets = raw_datasets.cast_column( + data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate) + ) + + # 7. Preprocessing the datasets. + # We need to read the audio files as arrays and tokenize the targets. + max_input_length = int(data_args.max_duration_in_seconds * feature_extractor.sampling_rate) + min_input_length = int(data_args.min_duration_in_seconds * feature_extractor.sampling_rate) + max_label_length = ( + data_args.max_label_length if data_args.max_label_length is not None else model.config.max_length + ) + pad_input_to_multiple_of = data_args.pad_input_to_multiple_of + pad_target_to_multiple_of = data_args.pad_target_to_multiple_of + audio_column_name = data_args.audio_column_name + num_workers = data_args.preprocessing_num_workers + text_column_name = data_args.text_column_name + model_input_name = feature_extractor.model_input_names[0] + do_lower_case = data_args.do_lower_case + + if training_args.do_train and data_args.max_train_samples is not None: + raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples)) + + if training_args.do_eval and data_args.max_eval_samples is not None: + raw_datasets["eval"] = raw_datasets["eval"].select(range(data_args.max_eval_samples)) + + if data_args.language is not None: + # We only need to set the task id when the language is specified (i.e. in a multilingual setting) + tokenizer.set_prefix_tokens(language=data_args.language, task=data_args.task) + + def prepare_dataset(batch): + # process audio + sample = batch[audio_column_name] + inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) + # process audio length + batch[model_input_name] = inputs.get(model_input_name)[0] + batch["input_length"] = len(sample["array"]) + + # process targets + input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name] + batch["labels"] = tokenizer(input_str).input_ids + return batch + + vectorized_datasets = raw_datasets.map( + prepare_dataset, + remove_columns=next(iter(raw_datasets.values())).column_names, + num_proc=num_workers, + desc="preprocess train and eval dataset", + ) + + # filter training data with inputs longer than max_input_length + def is_audio_in_length_range(length): + return min_input_length < length < max_input_length + + vectorized_datasets = vectorized_datasets.filter( + is_audio_in_length_range, + num_proc=num_workers, + input_columns=["input_length"], + ) + + # for large datasets it is advised to run the preprocessing on a + # single machine first with `args.preprocessing_only` since there will mostly likely + # be a timeout when running the script in distributed mode. + # In a second step `args.preprocessing_only` can then be set to `False` to load the + # cached dataset + if data_args.preprocessing_only: + cache = {k: v.cache_files for k, v in vectorized_datasets.items()} + logger.info(f"Data preprocessing finished. Files cached at {cache}.") + return + + # 8. Load Metric + metric = evaluate.load("wer", cache_dir=model_args.cache_dir) + + def compute_metrics(preds, labels): + # replace padded labels by the padding token + for idx in range(len(labels)): + labels[idx][labels[idx] == -100] = tokenizer.pad_token_id + + pred_str = tokenizer.batch_decode(preds, skip_special_tokens=True) + # we do not want to group tokens when computing the metrics + label_str = tokenizer.batch_decode(labels, skip_special_tokens=True) + + wer = metric.compute(predictions=pred_str, references=label_str) + return {"wer": wer} + + # 9. Save feature extractor, tokenizer and config + feature_extractor.save_pretrained(training_args.output_dir) + tokenizer.save_pretrained(training_args.output_dir) + config.save_pretrained(training_args.output_dir) + + processor = AutoProcessor.from_pretrained(training_args.output_dir) + + data_collator = FlaxDataCollatorSpeechSeq2SeqWithPadding( + processor=processor, + decoder_start_token_id=model.config.decoder_start_token_id, + input_padding="longest", + target_padding="longest", + max_target_length=max_label_length, + pad_input_to_multiple_of=pad_input_to_multiple_of, + pad_target_to_multiple_of=pad_target_to_multiple_of if pad_target_to_multiple_of else max_label_length, + ) + + # Enable tensorboard only on the master node + has_tensorboard = is_tensorboard_available() + if has_tensorboard and jax.process_index() == 0: + try: + from flax.metrics.tensorboard import SummaryWriter + + summary_writer = SummaryWriter(log_dir=Path(training_args.output_dir)) + except ImportError as ie: + has_tensorboard = False + logger.warning( + f"Unable to display metrics through TensorBoard because some package are not installed: {ie}" + ) + else: + logger.warning( + "Unable to display metrics through TensorBoard because the package is not installed: " + "Please run pip install tensorboard to enable." + ) + + # Initialize our training + rng = jax.random.PRNGKey(training_args.seed) + rng, dropout_rng = jax.random.split(rng) + + # Store some constant + num_epochs = int(training_args.num_train_epochs) + train_batch_size = int(training_args.per_device_train_batch_size) * jax.device_count() + per_device_eval_batch_size = int(training_args.per_device_eval_batch_size) + eval_batch_size = per_device_eval_batch_size * jax.device_count() + steps_per_epoch = len(vectorized_datasets["train"]) // train_batch_size + total_train_steps = steps_per_epoch * num_epochs + + # Create learning rate schedule + linear_decay_lr_schedule_fn = create_learning_rate_fn( + total_train_steps, + training_args.warmup_steps, + training_args.learning_rate, + ) + + # We use Optax's "masking" functionality to not apply weight decay + # to bias and LayerNorm scale parameters. decay_mask_fn returns a + # mask boolean with the same structure as the parameters. + # The mask is True for parameters that should be decayed. + def decay_mask_fn(params): + flat_params = traverse_util.flatten_dict(params) + # find out all LayerNorm parameters + layer_norm_candidates = ["layer_norm", "self_attn_layer_norm", "final_layer_norm", "encoder_attn_layer_norm"] + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } + flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} + return traverse_util.unflatten_dict(flat_mask) + + # create adam optimizer + adamw = optax.adamw( + learning_rate=linear_decay_lr_schedule_fn, + b1=training_args.adam_beta1, + b2=training_args.adam_beta2, + eps=training_args.adam_epsilon, + weight_decay=training_args.weight_decay, + mask=decay_mask_fn, + ) + + # Setup train state + state = TrainState.create(apply_fn=model.__call__, params=model.params, tx=adamw, dropout_rng=dropout_rng) + + # label smoothed cross entropy + def loss_fn(logits, labels, label_smoothing_factor=0.0): + """ + The label smoothing implementation is adapted from Flax's official example: + https://github.com/google/flax/blob/87a211135c6a377c8f29048a1cac3840e38b9da4/examples/wmt/train.py#L104 + """ + vocab_size = logits.shape[-1] + confidence = 1.0 - label_smoothing_factor + low_confidence = (1.0 - confidence) / (vocab_size - 1) + normalizing_constant = -( + confidence * jnp.log(confidence) + (vocab_size - 1) * low_confidence * jnp.log(low_confidence + 1e-20) + ) + soft_labels = onehot(labels, vocab_size, on_value=confidence, off_value=low_confidence) + + loss = optax.softmax_cross_entropy(logits, soft_labels) + loss = loss - normalizing_constant + + # ignore padded tokens from loss, i.e. where labels are not set to -100 + padding_mask = labels >= 0 + loss = loss * padding_mask + loss = loss.sum() + num_labels = padding_mask.sum() + return loss, num_labels + + # Define gradient update step fn + def train_step(state, batch, label_smoothing_factor=0.0): + dropout_rng, new_dropout_rng = jax.random.split(state.dropout_rng) + + def compute_loss(params): + labels = batch.pop("labels") + logits = state.apply_fn(**batch, params=params, dropout_rng=dropout_rng, train=True)[0] + loss, num_labels = loss_fn(logits, labels, label_smoothing_factor) + return loss, num_labels + + grad_fn = jax.value_and_grad(compute_loss, has_aux=True) + (loss, num_labels), grad = grad_fn(state.params) + num_labels = jax.lax.psum(num_labels, "batch") + + # true loss = total loss / total samples + loss = jax.lax.psum(loss, "batch") + loss = jax.tree_util.tree_map(lambda x: x / num_labels, loss) + + # true grad = total grad / total samples + grad = jax.lax.psum(grad, "batch") + grad = jax.tree_util.tree_map(lambda x: x / num_labels, grad) + new_state = state.apply_gradients(grads=grad, dropout_rng=new_dropout_rng) + + metrics = {"loss": loss, "learning_rate": linear_decay_lr_schedule_fn(state.step)} + return new_state, metrics + + # Define eval fn + def eval_step(params, batch, label_smoothing_factor=0.0): + labels = batch.pop("labels") + logits = model(**batch, params=params, train=False)[0] + + loss, num_labels = loss_fn(logits, labels, label_smoothing_factor) + num_labels = jax.lax.psum(num_labels, "batch") + + # true loss = total loss / total samples + loss = jax.lax.psum(loss, "batch") + loss = jax.tree_util.tree_map(lambda x: x / num_labels, loss) + + metrics = {"loss": loss} + return metrics + + # Define generation function + num_beams = model_args.num_beams if model_args.num_beams is not None else model.config.num_beams + gen_kwargs = {"max_length": max_label_length, "num_beams": num_beams} + + def generate_step(params, batch): + model.params = params + output_ids = model.generate(batch[model_input_name], attention_mask=batch.get("attention_mask"), **gen_kwargs) + return output_ids.sequences + + # Create parallel version of the train and eval step + p_train_step = jax.pmap( + partial(train_step, label_smoothing_factor=training_args.label_smoothing_factor), "batch", donate_argnums=(0,) + ) + p_eval_step = jax.pmap(partial(eval_step, label_smoothing_factor=training_args.label_smoothing_factor), "batch") + p_generate_step = jax.pmap(generate_step, "batch") + + # Replicate the train state on each device + state = state.replicate() + + logger.info("***** Running training *****") + logger.info(f" Num examples = {len(vectorized_datasets['train'])}") + logger.info(f" Num Epochs = {num_epochs}") + logger.info(f" Instantaneous batch size per device = {training_args.per_device_train_batch_size}") + logger.info(f" Total train batch size (w. parallel & distributed) = {train_batch_size}") + logger.info(f" Total optimization steps = {total_train_steps}") + + train_time = 0 + epochs = tqdm(range(num_epochs), desc=f"Epoch ... (1/{num_epochs})", position=0) + for epoch in epochs: + # ======================== Training ================================ + train_start = time.time() + + train_metrics = [] + + # Generate an epoch by shuffling sampling indices from the train dataset and create a data loader + vectorized_datasets["train"] = vectorized_datasets["train"].shuffle(training_args.seed) + train_loader = DataLoader( + vectorized_datasets["train"], + batch_size=train_batch_size, + drop_last=True, + collate_fn=data_collator, + num_workers=training_args.dataloader_num_workers, + ) + # train + for batch in tqdm(train_loader, desc="Training...", position=1, leave=False): + batch = shard(batch.data) + state, train_metric = p_train_step(state, batch) + train_metrics.append(train_metric) + + train_time += time.time() - train_start + + train_metric = unreplicate(train_metric) + + epochs.write( + f"Epoch... ({epoch + 1}/{num_epochs} | Loss: {train_metric['loss']}, Learning Rate:" + f" {train_metric['learning_rate']})" + ) + + # ======================== Evaluating ============================== + eval_metrics = [] + eval_preds = [] + eval_labels = [] + + eval_loader = DataLoader( + vectorized_datasets["eval"], + batch_size=eval_batch_size, + drop_last=False, + collate_fn=data_collator, + num_workers=training_args.dataloader_num_workers, + ) + for batch in tqdm(eval_loader, desc="Evaluating...", position=2, leave=False): + # Model forward + labels = batch["labels"] + + metrics = pad_shard_unpad(p_eval_step, static_return=True)( + state.params, batch.data, min_device_batch=per_device_eval_batch_size + ) + eval_metrics.append(metrics) + + # generation + if training_args.predict_with_generate: + generated_ids = pad_shard_unpad(p_generate_step)(state.params, batch.data) + eval_preds.extend(jax.device_get(generated_ids.reshape(-1, gen_kwargs["max_length"]))) + eval_labels.extend(labels) + + # normalize eval metrics + eval_metrics = get_metrics(eval_metrics) + eval_metrics = jax.tree_util.tree_map(jnp.mean, eval_metrics) + + # compute WER metric + wer_desc = "" + if training_args.predict_with_generate: + wer_metric = compute_metrics(eval_preds, eval_labels) + eval_metrics.update(wer_metric) + wer_desc = " ".join([f"Eval {key}: {value} |" for key, value in wer_metric.items()]) + + # Print metrics and update progress bar + desc = f"Epoch... ({epoch + 1}/{num_epochs} | Eval Loss: {eval_metrics['loss']} | {wer_desc})" + epochs.write(desc) + epochs.desc = desc + + # Save metrics + if has_tensorboard and jax.process_index() == 0: + cur_step = epoch * (len(vectorized_datasets["train"]) // train_batch_size) + write_metric(summary_writer, train_metrics, eval_metrics, train_time, cur_step) + + # save checkpoint after each epoch and push checkpoint to the hub + if jax.process_index() == 0: + params = jax.device_get(jax.tree_util.tree_map(lambda x: x[0], state.params)) + model.save_pretrained(training_args.output_dir, params=params) + tokenizer.save_pretrained(training_args.output_dir) + if training_args.push_to_hub: + repo.push_to_hub(commit_message=f"Saving weights and logs of epoch {epoch}", blocking=False) + + +if __name__ == "__main__": + main() diff --git a/examples/flax/summarization/README.md b/examples/flax/summarization/README.md index bbe231f31a569f..c94b048ec88b42 100644 --- a/examples/flax/summarization/README.md +++ b/examples/flax/summarization/README.md @@ -9,7 +9,7 @@ way which enables simple and efficient model parallelism. `run_summarization_flax.py` is a lightweight example of how to download and preprocess a dataset from the 🤗 Datasets library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. -For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files and you also will find examples of these below. +For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files and you also will find examples of these below. ### Train the model Next we can run the example script to train the model: diff --git a/examples/flax/summarization/run_summarization_flax.py b/examples/flax/summarization/run_summarization_flax.py index feda69592070f0..1a72cc65225c51 100644 --- a/examples/flax/summarization/run_summarization_flax.py +++ b/examples/flax/summarization/run_summarization_flax.py @@ -24,6 +24,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from functools import partial @@ -56,7 +57,7 @@ HfArgumentParser, is_tensorboard_available, ) -from transformers.utils import get_full_repo_name, is_offline_mode, send_example_telemetry +from transformers.utils import is_offline_mode, send_example_telemetry logger = logging.getLogger(__name__) @@ -158,7 +159,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -188,12 +189,28 @@ class ModelArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -251,7 +268,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the `max_length` param of `model.generate`, which is used " "during evaluation." ) @@ -295,7 +312,7 @@ class DataTrainingArguments: default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."} ) num_beams: Optional[int] = field( - default=None, + default=1, metadata={ "help": ( "Number of beams to use for evaluation. This argument will be passed to `model.generate`, " @@ -308,8 +325,13 @@ class DataTrainingArguments: ) def __post_init__(self): - if self.dataset_name is None and self.train_file is None and self.validation_file is None: - raise ValueError("Need either a dataset name or a training/validation file.") + if ( + self.dataset_name is None + and self.train_file is None + and self.validation_file is None + and self.test_file is None + ): + raise ValueError("Need either a dataset name or a training, validation, or test file.") else: if self.train_file is not None: extension = self.train_file.split(".")[-1] @@ -317,6 +339,9 @@ def __post_init__(self): if self.validation_file is not None: extension = self.validation_file.split(".")[-1] assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." + if self.test_file is not None: + extension = self.test_file.split(".")[-1] + assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." if self.val_max_target_length is None: self.val_max_target_length = self.max_target_length @@ -384,7 +409,7 @@ def write_metric(summary_writer, train_metrics, eval_metrics, train_time, step): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -409,6 +434,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_summarization", model_args, data_args, framework="flax") @@ -420,7 +454,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -444,14 +478,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below) # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ @@ -467,7 +501,7 @@ def main(): data_args.dataset_config_name, cache_dir=model_args.cache_dir, keep_in_memory=False, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -484,10 +518,10 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -495,13 +529,15 @@ def main(): config = AutoConfig.from_pretrained( model_args.config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: config = AutoConfig.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -512,18 +548,20 @@ def main(): model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -533,13 +571,15 @@ def main(): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: model = FlaxAutoModelForSeq2SeqLM.from_config( config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), + trust_remote_code=model_args.trust_remote_code, ) if training_args.gradient_checkpointing: @@ -553,10 +593,16 @@ def main(): # Preprocessing the datasets. # We need to tokenize inputs and targets. if training_args.do_train: + if "train" not in dataset: + raise ValueError("--do_train requires a train dataset") column_names = dataset["train"].column_names elif training_args.do_eval: + if "validation" not in dataset: + raise ValueError("--do_eval requires a validation dataset") column_names = dataset["validation"].column_names elif training_args.do_predict: + if "test" not in dataset: + raise ValueError("--do_predict requires a test dataset") column_names = dataset["test"].column_names else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") @@ -620,8 +666,6 @@ def preprocess_function(examples): return model_inputs if training_args.do_train: - if "train" not in dataset: - raise ValueError("--do_train requires a train dataset") train_dataset = dataset["train"] if data_args.max_train_samples is not None: max_train_samples = min(len(train_dataset), data_args.max_train_samples) @@ -637,8 +681,6 @@ def preprocess_function(examples): if training_args.do_eval: max_target_length = data_args.val_max_target_length - if "validation" not in dataset: - raise ValueError("--do_eval requires a validation dataset") eval_dataset = dataset["validation"] if data_args.max_eval_samples is not None: max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) @@ -654,8 +696,6 @@ def preprocess_function(examples): if training_args.do_predict: max_target_length = data_args.val_max_target_length - if "test" not in dataset: - raise ValueError("--do_predict requires a test dataset") predict_dataset = dataset["test"] if data_args.max_predict_samples is not None: max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples) @@ -670,7 +710,7 @@ def preprocess_function(examples): ) # Metric - metric = evaluate.load("rouge") + metric = evaluate.load("rouge", cache_dir=model_args.cache_dir) def postprocess_text(preds, labels): preds = [pred.strip() for pred in preds] @@ -742,14 +782,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) diff --git a/examples/flax/test_flax_examples.py b/examples/flax/test_flax_examples.py index 2fc2dcc16adc0c..9fc424c1a7532c 100644 --- a/examples/flax/test_flax_examples.py +++ b/examples/flax/test_flax_examples.py @@ -32,6 +32,7 @@ "summarization", "token-classification", "question-answering", + "speech-recognition", ] ] sys.path.extend(SRC_DIRS) @@ -41,6 +42,7 @@ import run_clm_flax import run_flax_glue import run_flax_ner + import run_flax_speech_recognition_seq2seq import run_mlm_flax import run_qa import run_summarization_flax @@ -76,7 +78,7 @@ def test_run_glue(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_glue.py - --model_name_or_path distilbert-base-uncased + --model_name_or_path distilbert/distilbert-base-uncased --output_dir {tmp_dir} --train_file ./tests/fixtures/tests_samples/MRPC/train.csv --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv @@ -99,7 +101,7 @@ def test_run_clm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_clm_flax.py - --model_name_or_path distilgpt2 + --model_name_or_path distilbert/distilgpt2 --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --do_train @@ -123,7 +125,7 @@ def test_run_summarization(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_summarization.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --train_file tests/fixtures/tests_samples/xsum/sample.json --validation_file tests/fixtures/tests_samples/xsum/sample.json --test_file tests/fixtures/tests_samples/xsum/sample.json @@ -153,7 +155,7 @@ def test_run_mlm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_mlm.py - --model_name_or_path distilroberta-base + --model_name_or_path distilbert/distilroberta-base --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --output_dir {tmp_dir} @@ -177,7 +179,7 @@ def test_run_t5_mlm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_t5_mlm_flax.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --do_train @@ -204,7 +206,7 @@ def test_run_ner(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_flax_ner.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/conll/sample.json --validation_file tests/fixtures/tests_samples/conll/sample.json --output_dir {tmp_dir} @@ -231,7 +233,7 @@ def test_run_qa(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_qa.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --version_2_with_negative --train_file tests/fixtures/tests_samples/SQUAD/sample.json --validation_file tests/fixtures/tests_samples/SQUAD/sample.json @@ -252,3 +254,32 @@ def test_run_qa(self): result = get_results(tmp_dir) self.assertGreaterEqual(result["eval_f1"], 30) self.assertGreaterEqual(result["eval_exact"], 30) + + @slow + def test_run_flax_speech_recognition_seq2seq(self): + tmp_dir = self.get_auto_remove_tmp_dir() + testargs = f""" + run_flax_speech_recognition_seq2seq.py + --model_name_or_path openai/whisper-tiny.en + --dataset_name hf-internal-testing/librispeech_asr_dummy + --dataset_config clean + --train_split_name validation + --eval_split_name validation + --output_dir {tmp_dir} + --overwrite_output_dir + --num_train_epochs=2 + --max_train_samples 10 + --max_eval_samples 10 + --warmup_steps=8 + --do_train + --do_eval + --learning_rate=2e-4 + --per_device_train_batch_size=2 + --per_device_eval_batch_size=1 + --predict_with_generate + """.split() + + with patch.object(sys, "argv", testargs): + run_flax_speech_recognition_seq2seq.main() + result = get_results(tmp_dir, split="eval") + self.assertLessEqual(result["eval_wer"], 0.05) diff --git a/examples/flax/text-classification/README.md b/examples/flax/text-classification/README.md index 8d43ab7725a241..65e50a075b78d5 100644 --- a/examples/flax/text-classification/README.md +++ b/examples/flax/text-classification/README.md @@ -31,7 +31,7 @@ GLUE is made up of a total of 9 different tasks. Here is how to run the script o export TASK_NAME=mrpc python run_flax_glue.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --task_name ${TASK_NAME} \ --max_seq_length 128 \ --learning_rate 2e-5 \ diff --git a/examples/flax/text-classification/run_flax_glue.py b/examples/flax/text-classification/run_flax_glue.py index c47ea90d392a3d..7821308b9d0c7a 100755 --- a/examples/flax/text-classification/run_flax_glue.py +++ b/examples/flax/text-classification/run_flax_glue.py @@ -21,6 +21,7 @@ import random import sys import time +import warnings from dataclasses import dataclass, field from pathlib import Path from typing import Any, Callable, Dict, Optional, Tuple @@ -49,12 +50,12 @@ TrainingArguments, is_tensorboard_available, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") Array = Any Dataset = datasets.arrow_dataset.Dataset @@ -101,12 +102,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -195,7 +212,7 @@ def __post_init__(self): if self.validation_file is not None: extension = self.validation_file.split(".")[-1] assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." - self.task_name = self.task_name.lower() if type(self.task_name) == str else self.task_name + self.task_name = self.task_name.lower() if isinstance(self.task_name, str) else self.task_name def create_train_state( @@ -229,14 +246,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) @@ -273,7 +288,7 @@ def cross_entropy_loss(logits, labels): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -323,6 +338,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_glue", model_args, data_args, framework="flax") @@ -344,14 +368,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below) # or specify a GLUE benchmark task (the dataset will be downloaded automatically from the datasets Hub). @@ -370,7 +394,7 @@ def main(): raw_datasets = load_dataset( "glue", data_args.task_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading the dataset from local csv or json file. @@ -383,10 +407,10 @@ def main(): raw_datasets = load_dataset( extension, data_files=data_files, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Labels if data_args.task_name is not None: @@ -403,7 +427,7 @@ def main(): num_labels = 1 else: # A useful fast method: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique + # https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.unique label_list = raw_datasets["train"].unique("label") label_list.sort() # Let's sort it for determinism num_labels = len(label_list) @@ -413,17 +437,20 @@ def main(): model_args.model_name_or_path, num_labels=num_labels, finetuning_task=data_args.task_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.model_name_or_path, use_fast=not model_args.use_slow_tokenizer, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = FlaxAutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, config=config, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # Preprocessing the datasets @@ -449,7 +476,7 @@ def main(): ): # Some have all caps in their config, some don't. label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()} - if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)): + if sorted(label_name_to_id.keys()) == sorted(label_list): logger.info( f"The configuration of the model provided the following label correspondence: {label_name_to_id}. " "Using it!" @@ -458,7 +485,7 @@ def main(): else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}." + f"model labels: {sorted(label_name_to_id.keys())}, dataset labels: {sorted(label_list)}." "\nIgnoring the model labels as a result.", ) elif data_args.task_name is None: @@ -572,9 +599,9 @@ def eval_step(state, batch): p_eval_step = jax.pmap(eval_step, axis_name="batch") if data_args.task_name is not None: - metric = evaluate.load("glue", data_args.task_name) + metric = evaluate.load("glue", data_args.task_name, cache_dir=model_args.cache_dir) else: - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) logger.info(f"===== Starting training ({num_epochs} epochs) =====") train_time = 0 diff --git a/examples/flax/token-classification/README.md b/examples/flax/token-classification/README.md index 915cf6ae20ff93..1f8175072148bb 100644 --- a/examples/flax/token-classification/README.md +++ b/examples/flax/token-classification/README.md @@ -25,7 +25,7 @@ The following example fine-tunes BERT on CoNLL-2003: ```bash python run_flax_ner.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name conll2003 \ --max_seq_length 128 \ --learning_rate 2e-5 \ diff --git a/examples/flax/token-classification/run_flax_ner.py b/examples/flax/token-classification/run_flax_ner.py index c7509433d95796..ac3eb31e8b82fa 100644 --- a/examples/flax/token-classification/run_flax_ner.py +++ b/examples/flax/token-classification/run_flax_ner.py @@ -21,6 +21,7 @@ import random import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from itertools import chain @@ -49,13 +50,13 @@ HfArgumentParser, is_tensorboard_available, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/token-classification/requirements.txt") @@ -149,12 +150,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -290,14 +307,12 @@ def decay_mask_fn(params): flat_params = traverse_util.flatten_dict(params) # find out all LayerNorm parameters layer_norm_candidates = ["layernorm", "layer_norm", "ln"] - layer_norm_named_params = set( - [ - layer[-2:] - for layer_norm_name in layer_norm_candidates - for layer in flat_params.keys() - if layer_norm_name in "".join(layer).lower() - ] - ) + layer_norm_named_params = { + layer[-2:] + for layer_norm_name in layer_norm_candidates + for layer in flat_params.keys() + if layer_norm_name in "".join(layer).lower() + } flat_mask = {path: (path[-1] != "bias" and path[-2:] not in layer_norm_named_params) for path in flat_params} return traverse_util.unflatten_dict(flat_mask) @@ -325,7 +340,7 @@ def cross_entropy_loss(logits, labels): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -379,6 +394,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_ner", model_args, data_args, framework="flax") @@ -400,14 +424,14 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Get the datasets: you can either provide your own CSV/JSON/TXT training and evaluation files (see below) # or just provide the name of one of the public datasets for token classification task available on the hub at https://huggingface.co/datasets/ @@ -424,7 +448,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading the dataset from local csv or json file. @@ -438,10 +462,10 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if raw_datasets["train"] is not None: column_names = raw_datasets["train"].column_names @@ -492,7 +516,8 @@ def get_label_list(labels): finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path if config.model_type in {"gpt2", "roberta"}: @@ -500,7 +525,8 @@ def get_label_list(labels): tokenizer_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, add_prefix_space=True, ) else: @@ -508,14 +534,16 @@ def get_label_list(labels): tokenizer_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = FlaxAutoModelForTokenClassification.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # Preprocessing the datasets @@ -648,7 +676,7 @@ def eval_step(state, batch): p_eval_step = jax.pmap(eval_step, axis_name="batch") - metric = evaluate.load("seqeval") + metric = evaluate.load("seqeval", cache_dir=model_args.cache_dir) def get_labels(y_pred, y_true): # Transform predictions and references tensos to numpy arrays diff --git a/examples/flax/vision/requirements.txt b/examples/flax/vision/requirements.txt index cf1859d7549477..539ffdc6fa9f74 100644 --- a/examples/flax/vision/requirements.txt +++ b/examples/flax/vision/requirements.txt @@ -3,6 +3,6 @@ jaxlib>=0.1.59 flax>=0.3.5 optax>=0.0.8 -f https://download.pytorch.org/whl/torch_stable.html -torch==1.9.0+cpu +torch==1.11.0+cpu -f https://download.pytorch.org/whl/torch_stable.html -torchvision==0.10.0+cpu \ No newline at end of file +torchvision==0.12.0+cpu diff --git a/examples/flax/vision/run_image_classification.py b/examples/flax/vision/run_image_classification.py index 6a88f0f8d67b28..364ac7dd2d0931 100644 --- a/examples/flax/vision/run_image_classification.py +++ b/examples/flax/vision/run_image_classification.py @@ -24,6 +24,7 @@ import os import sys import time +import warnings from dataclasses import asdict, dataclass, field from enum import Enum from pathlib import Path @@ -54,7 +55,7 @@ is_tensorboard_available, set_seed, ) -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry logger = logging.getLogger(__name__) @@ -136,7 +137,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -159,12 +160,28 @@ class ModelArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -232,7 +249,7 @@ def write_metric(summary_writer, train_metrics, eval_metrics, train_time, step): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -257,6 +274,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_image_classification", model_args, data_args, framework="flax") @@ -268,7 +294,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -293,18 +319,18 @@ def main(): # Handle the repository creation if training_args.push_to_hub: - if training_args.hub_model_id is None: - repo_name = get_full_repo_name( - Path(training_args.output_dir).absolute().name, token=training_args.hub_token - ) - else: - repo_name = training_args.hub_model_id - create_repo(repo_name, exist_ok=True, token=training_args.hub_token) - repo = Repository(training_args.output_dir, clone_from=repo_name, token=training_args.hub_token) + # Retrieve of infer repo_name + repo_name = training_args.hub_model_id + if repo_name is None: + repo_name = Path(training_args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=training_args.hub_token).repo_id + # Clone repo locally + repo = Repository(training_args.output_dir, clone_from=repo_id, token=training_args.hub_token) # Initialize datasets and pre-processing transforms # We use torchvision here for faster pre-processing - # Note that here we are using some default pre-processing, for maximum accuray + # Note that here we are using some default pre-processing, for maximum accuracy # one should tune this part and carefully select what transformations to use. normalize = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) train_dataset = torchvision.datasets.ImageFolder( @@ -338,7 +364,8 @@ def main(): num_labels=len(train_dataset.classes), image_size=data_args.image_size, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: config = AutoConfig.from_pretrained( @@ -346,7 +373,8 @@ def main(): num_labels=len(train_dataset.classes), image_size=data_args.image_size, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: config = CONFIG_MAPPING[model_args.model_type]() @@ -358,13 +386,15 @@ def main(): config=config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: model = FlaxAutoModelForImageClassification.from_config( config, seed=training_args.seed, dtype=getattr(jnp, model_args.dtype), + trust_remote_code=model_args.trust_remote_code, ) # Store some constant diff --git a/examples/pytorch/benchmarking/README.md b/examples/legacy/benchmarking/README.md similarity index 59% rename from examples/pytorch/benchmarking/README.md rename to examples/legacy/benchmarking/README.md index 7099ed9f6b3d3d..03e174770d1077 100644 --- a/examples/pytorch/benchmarking/README.md +++ b/examples/legacy/benchmarking/README.md @@ -22,5 +22,5 @@ If you would like to list benchmark results on your favorite models of the [mode | Benchmark description | Results | Environment info | Author | |:----------|:-------------|:-------------|------:| -| PyTorch Benchmark on inference for `bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | -| PyTorch Benchmark on inference for `bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | +| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | +| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | diff --git a/examples/pytorch/benchmarking/plot_csv_file.py b/examples/legacy/benchmarking/plot_csv_file.py similarity index 96% rename from examples/pytorch/benchmarking/plot_csv_file.py rename to examples/legacy/benchmarking/plot_csv_file.py index 1a0ae735d8c671..9a9ad9c670470e 100644 --- a/examples/pytorch/benchmarking/plot_csv_file.py +++ b/examples/legacy/benchmarking/plot_csv_file.py @@ -83,7 +83,7 @@ def can_convert_to_float(string): class Plot: def __init__(self, args): self.args = args - self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={})) + self.result_dict = defaultdict(lambda: {"bsz": [], "seq_len": [], "result": {}}) with open(self.args.csv_file, newline="") as csv_file: reader = csv.DictReader(csv_file) @@ -116,8 +116,8 @@ def plot(self): axis.set_major_formatter(ScalarFormatter()) for model_name_idx, model_name in enumerate(self.result_dict.keys()): - batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"]))) - sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"]))) + batch_sizes = sorted(set(self.result_dict[model_name]["bsz"])) + sequence_lengths = sorted(set(self.result_dict[model_name]["seq_len"])) results = self.result_dict[model_name]["result"] (x_axis_array, inner_loop_array) = ( diff --git a/examples/pytorch/benchmarking/requirements.txt b/examples/legacy/benchmarking/requirements.txt similarity index 100% rename from examples/pytorch/benchmarking/requirements.txt rename to examples/legacy/benchmarking/requirements.txt diff --git a/examples/pytorch/benchmarking/run_benchmark.py b/examples/legacy/benchmarking/run_benchmark.py old mode 100755 new mode 100644 similarity index 100% rename from examples/pytorch/benchmarking/run_benchmark.py rename to examples/legacy/benchmarking/run_benchmark.py diff --git a/examples/legacy/multiple_choice/utils_multiple_choice.py b/examples/legacy/multiple_choice/utils_multiple_choice.py index 9ffaa7971b5624..e3bbc72884f31c 100644 --- a/examples/legacy/multiple_choice/utils_multiple_choice.py +++ b/examples/legacy/multiple_choice/utils_multiple_choice.py @@ -379,7 +379,7 @@ def get_test_examples(self, data_dir): """See base class.""" logger.info("LOOKING AT {} dev".format(data_dir)) raise ValueError( - "For swag testing, the input file does not contain a label column. It can not be tested in current code" + "For swag testing, the input file does not contain a label column. It can not be tested in current code " "setting!" ) return self._create_examples(self._read_csv(os.path.join(data_dir, "test.csv")), "test") @@ -541,7 +541,7 @@ def convert_examples_to_features( if "num_truncated_tokens" in inputs and inputs["num_truncated_tokens"] > 0: logger.info( "Attention! you are cropping tokens (swag task is ok). " - "If you are training ARC and RACE and you are poping question + options," + "If you are training ARC and RACE and you are poping question + options, " "you need to try to use a bigger max seq length!" ) diff --git a/examples/legacy/pytorch-lightning/lightning_base.py b/examples/legacy/pytorch-lightning/lightning_base.py index f246ecab0dd01b..640828bacd3401 100644 --- a/examples/legacy/pytorch-lightning/lightning_base.py +++ b/examples/legacy/pytorch-lightning/lightning_base.py @@ -313,7 +313,7 @@ def add_generic_args(parser, root_dir) -> None: type=str, default="O2", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) diff --git a/examples/legacy/pytorch-lightning/requirements.txt b/examples/legacy/pytorch-lightning/requirements.txt index b3ed7cbc82ceb1..a6f2d6dce5a9d5 100644 --- a/examples/legacy/pytorch-lightning/requirements.txt +++ b/examples/legacy/pytorch-lightning/requirements.txt @@ -14,7 +14,7 @@ nltk pandas datasets >= 1.1.3 fire -pytest +pytest<8.0.1 conllu sentencepiece != 0.1.92 protobuf diff --git a/examples/legacy/pytorch-lightning/run_glue.py b/examples/legacy/pytorch-lightning/run_glue.py index aa2349f2809fd4..681f633fcd6d2b 100644 --- a/examples/legacy/pytorch-lightning/run_glue.py +++ b/examples/legacy/pytorch-lightning/run_glue.py @@ -23,7 +23,7 @@ class GLUETransformer(BaseTransformer): mode = "sequence-classification" def __init__(self, hparams): - if type(hparams) == dict: + if isinstance(hparams, dict): hparams = Namespace(**hparams) hparams.glue_output_mode = glue_output_modes[hparams.task] num_labels = glue_tasks_num_labels[hparams.task] @@ -124,7 +124,7 @@ def _eval_end(self, outputs) -> tuple: results = {**{"val_loss": val_loss_mean}, **compute_metrics(self.hparams.task, preds, out_label_ids)} - ret = {k: v for k, v in results.items()} + ret = dict(results.items()) ret["log"] = results return ret, preds_list, out_label_list @@ -192,7 +192,7 @@ def main(): # Optionally, predict on dev set and write to output_dir if args.do_predict: - checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpoint-epoch=*.ckpt"), recursive=True))) + checkpoints = sorted(glob.glob(os.path.join(args.output_dir, "checkpoint-epoch=*.ckpt"), recursive=True)) model = model.load_from_checkpoint(checkpoints[-1]) return trainer.test(model) diff --git a/examples/legacy/pytorch-lightning/run_ner.py b/examples/legacy/pytorch-lightning/run_ner.py index 3bcbdfee03b114..fc6f812275ea2c 100644 --- a/examples/legacy/pytorch-lightning/run_ner.py +++ b/examples/legacy/pytorch-lightning/run_ner.py @@ -25,7 +25,7 @@ class NERTransformer(BaseTransformer): mode = "token-classification" def __init__(self, hparams): - if type(hparams) == dict: + if isinstance(hparams, dict): hparams = Namespace(**hparams) module = import_module("tasks") try: @@ -122,7 +122,7 @@ def _eval_end(self, outputs): preds = np.argmax(preds, axis=2) out_label_ids = np.concatenate([x["target"] for x in outputs], axis=0) - label_map = {i: label for i, label in enumerate(self.labels)} + label_map = dict(enumerate(self.labels)) out_label_list = [[] for _ in range(out_label_ids.shape[0])] preds_list = [[] for _ in range(out_label_ids.shape[0])] @@ -140,7 +140,7 @@ def _eval_end(self, outputs): "f1": f1_score(out_label_list, preds_list), } - ret = {k: v for k, v in results.items()} + ret = dict(results.items()) ret["log"] = results return ret, preds_list, out_label_list @@ -211,6 +211,6 @@ def add_model_specific_args(parser, root_dir): # pl use this default format to create a checkpoint: # https://github.com/PyTorchLightning/pytorch-lightning/blob/master\ # /pytorch_lightning/callbacks/model_checkpoint.py#L322 - checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "checkpoint-epoch=*.ckpt"), recursive=True))) + checkpoints = sorted(glob.glob(os.path.join(args.output_dir, "checkpoint-epoch=*.ckpt"), recursive=True)) model = model.load_from_checkpoint(checkpoints[-1]) trainer.test(model) diff --git a/examples/legacy/question-answering/README.md b/examples/legacy/question-answering/README.md index 494ae4ffd7eebf..339837c94f5d86 100644 --- a/examples/legacy/question-answering/README.md +++ b/examples/legacy/question-answering/README.md @@ -1,7 +1,7 @@ #### Fine-tuning BERT on SQuAD1.0 with relative position embeddings The following examples show how to fine-tune BERT models with different relative position embeddings. The BERT model -`bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained +`google-bert/bert-base-uncased` was pretrained with default absolute position embeddings. We provide the following pretrained models which were pre-trained on the same training data (BooksCorpus and English Wikipedia) as in the BERT model training, but with different relative position embeddings. @@ -10,7 +10,7 @@ Shaw et al., [Self-Attention with Relative Position Representations](https://arx * `zhiheng-huang/bert-base-uncased-embedding-relative-key-query`, trained from scratch with relative embedding method 4 in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658) * `zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query`, fine-tuned from model -`bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. +`google-bert/bert-large-uncased-whole-word-masking` with 3 additional epochs with relative embedding method 4 in Huang et al. [Improve Transformer Models with Better Relative Position Embeddings](https://arxiv.org/abs/2009.13658) @@ -18,7 +18,7 @@ in Huang et al. [Improve Transformer Models with Better Relative Position Embedd ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ +torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \ --model_name_or_path zhiheng-huang/bert-base-uncased-embedding-relative-key-query \ --dataset_name squad \ --do_train \ @@ -46,7 +46,7 @@ gpu training leads to the f1 score of 90.71. ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ +torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \ --model_name_or_path zhiheng-huang/bert-large-uncased-whole-word-masking-embedding-relative-key-query \ --dataset_name squad \ --do_train \ @@ -61,15 +61,15 @@ python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answer --gradient_accumulation_steps 3 ``` Training with the above command leads to the f1 score of 93.52, which is slightly better than the f1 score of 93.15 for -`bert-large-uncased-whole-word-masking`. +`google-bert/bert-large-uncased-whole-word-masking`. #### Distributed training Here is an example using distributed training on 8 V100 GPUs and Bert Whole Word Masking uncased model to reach a F1 > 93 on SQuAD1.1: ```bash -python -m torch.distributed.launch --nproc_per_node=8 ./examples/question-answering/run_squad.py \ - --model_name_or_path bert-large-uncased-whole-word-masking \ +torchrun --nproc_per_node=8 ./examples/question-answering/run_squad.py \ + --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \ --dataset_name squad \ --do_train \ --do_eval \ @@ -90,7 +90,7 @@ exact_match = 86.91 ``` This fine-tuned model is available as a checkpoint under the reference -[`bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad). +[`google-bert/bert-large-uncased-whole-word-masking-finetuned-squad`](https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad). ## Results diff --git a/examples/legacy/question-answering/run_squad.py b/examples/legacy/question-answering/run_squad.py index d966b3f02f0315..999752485b9109 100644 --- a/examples/legacy/question-answering/run_squad.py +++ b/examples/legacy/question-answering/run_squad.py @@ -148,7 +148,7 @@ def train(args, train_dataset, model, tokenizer): # Check if continuing training from a checkpoint if os.path.exists(args.model_name_or_path): try: - # set global_step to gobal_step of last saved checkpoint from model path + # set global_step to global_step of last saved checkpoint from model path checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0] global_step = int(checkpoint_suffix) epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) @@ -166,7 +166,7 @@ def train(args, train_dataset, model, tokenizer): train_iterator = trange( epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0] ) - # Added here for reproductibility + # Added here for reproducibility set_seed(args) for _ in train_iterator: @@ -663,7 +663,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -705,7 +705,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -810,10 +810,10 @@ def main(): logger.info("Loading checkpoints saved during training for evaluation") checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] else: logger.info("Loading checkpoint %s for evaluation", args.model_name_or_path) @@ -830,7 +830,7 @@ def main(): # Evaluate result = evaluate(args, model, tokenizer, prefix=global_step) - result = dict((k + ("_{}".format(global_step) if global_step else ""), v) for k, v in result.items()) + result = {k + ("_{}".format(global_step) if global_step else ""): v for k, v in result.items()} results.update(result) logger.info("Results: {}".format(results)) diff --git a/examples/legacy/run_camembert.py b/examples/legacy/run_camembert.py index 9651570b39e1e8..67e04babe1043e 100755 --- a/examples/legacy/run_camembert.py +++ b/examples/legacy/run_camembert.py @@ -39,8 +39,8 @@ def fill_mask(masked_input, model, tokenizer, topk=5): return topk_filled_outputs -tokenizer = CamembertTokenizer.from_pretrained("camembert-base") -model = CamembertForMaskedLM.from_pretrained("camembert-base") +tokenizer = CamembertTokenizer.from_pretrained("almanach/camembert-base") +model = CamembertForMaskedLM.from_pretrained("almanach/camembert-base") model.eval() masked_input = "Le camembert est :)" diff --git a/examples/legacy/run_language_modeling.py b/examples/legacy/run_language_modeling.py index 59490f710e1338..b1576586562c4a 100755 --- a/examples/legacy/run_language_modeling.py +++ b/examples/legacy/run_language_modeling.py @@ -149,7 +149,7 @@ class DataTrainingArguments: default=-1, metadata={ "help": ( - "Optional input sequence length after tokenization." + "Optional input sequence length after tokenization. " "The training dataset will be truncated in block of this size for training." "Default to the model max input length for single sentence inputs (take into account special tokens)." ) @@ -283,7 +283,7 @@ def main(): if config.model_type in ["bert", "roberta", "distilbert", "camembert"] and not data_args.mlm: raise ValueError( - "BERT and RoBERTa-like models do not have LM heads but masked LM heads. They must be run using the" + "BERT and RoBERTa-like models do not have LM heads but masked LM heads. They must be run using the " "--mlm flag (masked language modeling)." ) diff --git a/examples/legacy/run_openai_gpt.py b/examples/legacy/run_openai_gpt.py index 1f02570f8f514a..d0c21aba27eaca 100755 --- a/examples/legacy/run_openai_gpt.py +++ b/examples/legacy/run_openai_gpt.py @@ -20,7 +20,7 @@ This script with default values fine-tunes and evaluate a pretrained OpenAI GPT on the RocStories dataset: python run_openai_gpt.py \ - --model_name openai-gpt \ + --model_name openai-community/openai-gpt \ --do_train \ --do_eval \ --train_dataset "$ROC_STORIES_DIR/cloze_test_val__spring2016 - cloze_test_ALL_val.csv" \ @@ -104,7 +104,7 @@ def pre_process_datasets(encoded_datasets, input_len, cap_length, start_token, d def main(): parser = argparse.ArgumentParser() - parser.add_argument("--model_name", type=str, default="openai-gpt", help="pretrained model name") + parser.add_argument("--model_name", type=str, default="openai-community/openai-gpt", help="pretrained model name") parser.add_argument("--do_train", action="store_true", help="Whether to run training.") parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") parser.add_argument( @@ -189,7 +189,7 @@ def tokenize_and_encode(obj): return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(obj)) elif isinstance(obj, int): return obj - return list(tokenize_and_encode(o) for o in obj) + return [tokenize_and_encode(o) for o in obj] logger.info("Encoding dataset...") train_dataset = load_rocstories_dataset(args.train_dataset) diff --git a/examples/legacy/run_swag.py b/examples/legacy/run_swag.py index 5cac1567243c3e..66d77a1742b22a 100755 --- a/examples/legacy/run_swag.py +++ b/examples/legacy/run_swag.py @@ -338,7 +338,7 @@ def train(args, train_dataset, model, tokenizer): tr_loss, logging_loss = 0.0, 0.0 model.zero_grad() train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) - set_seed(args) # Added here for reproductibility + set_seed(args) # Added here for reproducibility for _ in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) for step, batch in enumerate(epoch_iterator): @@ -538,7 +538,7 @@ def main(): default=1, help="Number of updates steps to accumulate before performing a backward/update pass.", ) - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument( @@ -579,7 +579,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -612,7 +612,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -696,9 +696,9 @@ def main(): checkpoints = [args.model_name_or_path] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) @@ -712,7 +712,7 @@ def main(): # Evaluate result = evaluate(args, model, tokenizer, prefix=global_step) - result = dict((k + ("_{}".format(global_step) if global_step else ""), v) for k, v in result.items()) + result = {k + ("_{}".format(global_step) if global_step else ""): v for k, v in result.items()} results.update(result) logger.info("Results: {}".format(results)) diff --git a/examples/legacy/run_transfo_xl.py b/examples/legacy/run_transfo_xl.py index 7ee941150852e1..1c48974f39c77a 100755 --- a/examples/legacy/run_transfo_xl.py +++ b/examples/legacy/run_transfo_xl.py @@ -40,7 +40,7 @@ def main(): parser = argparse.ArgumentParser(description="PyTorch Transformer Language Model") - parser.add_argument("--model_name", type=str, default="transfo-xl-wt103", help="pretrained model name") + parser.add_argument("--model_name", type=str, default="transfo-xl/transfo-xl-wt103", help="pretrained model name") parser.add_argument( "--split", type=str, default="test", choices=["all", "valid", "test"], help="which split to evaluate" ) diff --git a/examples/legacy/seq2seq/README.md b/examples/legacy/seq2seq/README.md index 5a3c2dbd3506be..f574ccabda2c4a 100644 --- a/examples/legacy/seq2seq/README.md +++ b/examples/legacy/seq2seq/README.md @@ -140,7 +140,7 @@ python finetune_trainer.py --help For multi-gpu training use `torch.distributed.launch`, e.g. with 2 gpus: ```bash -python -m torch.distributed.launch --nproc_per_node=2 finetune_trainer.py ... +torchrun --nproc_per_node=2 finetune_trainer.py ... ``` **At the moment, `Seq2SeqTrainer` does not support *with teacher* distillation.** @@ -170,7 +170,7 @@ If 'translation' is in your task name, the computed metric will be BLEU. Otherwi For t5, you need to specify --task translation_{src}_to_{tgt} as follows: ```bash export DATA_DIR=wmt_en_ro -./run_eval.py t5-base \ +./run_eval.py google-t5/t5-base \ $DATA_DIR/val.source t5_val_generations.txt \ --reference_path $DATA_DIR/val.target \ --score_path enro_bleu.json \ @@ -214,7 +214,7 @@ because it uses SortishSampler to minimize padding. You can also use it on 1 GPU `{type_path}.source` and `{type_path}.target`. Run `./run_distributed_eval.py --help` for all clargs. ```bash -python -m torch.distributed.launch --nproc_per_node=8 run_distributed_eval.py \ +torchrun --nproc_per_node=8 run_distributed_eval.py \ --model_name sshleifer/distilbart-large-xsum-12-3 \ --save_dir xsum_generations \ --data_dir xsum \ @@ -228,7 +228,7 @@ Contributions that implement this command for other distributed hardware setups When using `run_eval.py`, the following features can be useful: * if you running the script multiple times and want to make it easier to track what arguments produced that output, use `--dump-args`. Along with the results it will also dump any custom params that were passed to the script. For example if you used: `--num_beams 8 --early_stopping true`, the output will be: - ``` + ```json {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True} ``` @@ -236,13 +236,13 @@ When using `run_eval.py`, the following features can be useful: If using `--dump-args --info`, the output will be: - ``` + ```json {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': '2020-09-13 18:44:43'} ``` If using `--dump-args --info "pair:en-ru chkpt=best`, the output will be: - ``` + ```json {'bleu': 26.887, 'n_obs': 10, 'runtime': 1, 'seconds_per_sample': 0.1, 'num_beams': 8, 'early_stopping': True, 'info': 'pair=en-ru chkpt=best'} ``` @@ -321,7 +321,7 @@ For example, ./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro ./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs ``` -splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100. +splits `wmt_en_ro/train` into 11,197 uneven length batches and can finish 1 epoch in 8 minutes on a v100. For comparison, ```bash diff --git a/examples/legacy/seq2seq/old_test_datasets.py b/examples/legacy/seq2seq/old_test_datasets.py index 0b907b1ed9fbb6..be108f7645f8a9 100644 --- a/examples/legacy/seq2seq/old_test_datasets.py +++ b/examples/legacy/seq2seq/old_test_datasets.py @@ -28,7 +28,7 @@ from utils import FAIRSEQ_AVAILABLE, DistributedSortishSampler, LegacySeq2SeqDataset, Seq2SeqDataset -BERT_BASE_CASED = "bert-base-cased" +BERT_BASE_CASED = "google-bert/bert-base-cased" PEGASUS_XSUM = "google/pegasus-xsum" ARTICLES = [" Sam ate lunch today.", "Sams lunch ingredients."] SUMMARIES = ["A very interesting story about what I ate for lunch.", "Avocado, celery, turkey, coffee"] diff --git a/examples/legacy/seq2seq/pack_dataset.py b/examples/legacy/seq2seq/pack_dataset.py index 8b069e452a7177..5c13c74f412df6 100755 --- a/examples/legacy/seq2seq/pack_dataset.py +++ b/examples/legacy/seq2seq/pack_dataset.py @@ -74,7 +74,7 @@ def pack_data_dir(tok, data_dir: Path, max_tokens, save_path): def packer_cli(): parser = argparse.ArgumentParser() - parser.add_argument("--tok_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.") + parser.add_argument("--tok_name", type=str, help="like facebook/bart-large-cnn,google-t5/t5-base, etc.") parser.add_argument("--max_seq_len", type=int, default=128) parser.add_argument("--data_dir", type=str) parser.add_argument("--save_path", type=str) diff --git a/examples/legacy/seq2seq/requirements.txt b/examples/legacy/seq2seq/requirements.txt index e40aef17932017..434f647adea299 100644 --- a/examples/legacy/seq2seq/requirements.txt +++ b/examples/legacy/seq2seq/requirements.txt @@ -14,7 +14,7 @@ nltk pandas datasets >= 1.1.3 fire -pytest +pytest<8.0.1 conllu sentencepiece != 0.1.92 protobuf diff --git a/examples/legacy/seq2seq/run_distributed_eval.py b/examples/legacy/seq2seq/run_distributed_eval.py index 655807ba172ee0..40a946f81c5e15 100755 --- a/examples/legacy/seq2seq/run_distributed_eval.py +++ b/examples/legacy/seq2seq/run_distributed_eval.py @@ -111,7 +111,7 @@ def eval_data_dir( if num_return_sequences > 1: preds = chunks(preds, num_return_sequences) # batch size chunks, each of size num_return_seq for i, pred in enumerate(preds): - results.append(dict(pred=pred, id=ids[i].item())) + results.append({"pred": pred, "id": ids[i].item()}) save_json(results, save_path) return results, sampler.num_replicas @@ -124,7 +124,7 @@ def run_generate(): parser.add_argument( "--model_name", type=str, - help="like facebook/bart-large-cnn,t5-base, etc.", + help="like facebook/bart-large-cnn,google-t5/t5-base, etc.", default="sshleifer/distilbart-xsum-12-3", ) parser.add_argument("--save_dir", type=str, help="where to save", default="tmp_gen") @@ -154,7 +154,7 @@ def run_generate(): parser.add_argument("--src_lang", type=str, default=None, required=False) parser.add_argument("--tgt_lang", type=str, default=None, required=False) parser.add_argument( - "--prefix", type=str, required=False, default=None, help="will be added to the begininng of src examples" + "--prefix", type=str, required=False, default=None, help="will be added to the beginning of src examples" ) parser.add_argument("--fp16", action="store_true") parser.add_argument("--debug", action="store_true") @@ -232,7 +232,7 @@ def combine_partial_results(partial_results) -> List: records = [] for partial_result in partial_results: records.extend(partial_result) - records = list(sorted(records, key=lambda x: x["id"])) + records = sorted(records, key=lambda x: x["id"]) preds = [x["pred"] for x in records] return preds diff --git a/examples/legacy/seq2seq/run_eval.py b/examples/legacy/seq2seq/run_eval.py index a8aa8e7ef95d23..f69e5d51264c78 100755 --- a/examples/legacy/seq2seq/run_eval.py +++ b/examples/legacy/seq2seq/run_eval.py @@ -76,7 +76,7 @@ def generate_summaries_or_translations( fout.close() runtime = int(time.time() - start_time) # seconds n_obs = len(examples) - return dict(n_obs=n_obs, runtime=runtime, seconds_per_sample=round(runtime / n_obs, 4)) + return {"n_obs": n_obs, "runtime": runtime, "seconds_per_sample": round(runtime / n_obs, 4)} def datetime_now(): @@ -100,14 +100,14 @@ def run_generate(verbose=True): """ parser = argparse.ArgumentParser() - parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,t5-base, etc.") + parser.add_argument("model_name", type=str, help="like facebook/bart-large-cnn,google-t5/t5-base, etc.") parser.add_argument("input_path", type=str, help="like cnn_dm/test.source") parser.add_argument("save_path", type=str, help="where to save summaries") parser.add_argument("--reference_path", type=str, required=False, help="like cnn_dm/test.target") parser.add_argument("--score_path", type=str, required=False, default="metrics.json", help="where to save metrics") parser.add_argument("--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.") parser.add_argument( - "--prefix", type=str, required=False, default=None, help="will be added to the begininng of src examples" + "--prefix", type=str, required=False, default=None, help="will be added to the beginning of src examples" ) parser.add_argument("--task", type=str, default="summarization", help="used for task_specific_params + metrics") parser.add_argument("--bs", type=int, default=8, required=False, help="batch size") diff --git a/examples/legacy/seq2seq/run_eval_search.py b/examples/legacy/seq2seq/run_eval_search.py index c72f038fc50ab2..9b5debfb2795ee 100755 --- a/examples/legacy/seq2seq/run_eval_search.py +++ b/examples/legacy/seq2seq/run_eval_search.py @@ -34,9 +34,9 @@ def parse_search_arg(search): groups = search.split() - entries = {k: vs for k, vs in (g.split("=") for g in groups)} + entries = dict((g.split("=") for g in groups)) entry_names = list(entries.keys()) - sets = [list(f"--{k} {v}" for v in vs.split(":")) for k, vs in entries.items()] + sets = [[f"--{k} {v}" for v in vs.split(":")] for k, vs in entries.items()] matrix = [list(x) for x in itertools.product(*sets)] return matrix, entry_names @@ -105,7 +105,7 @@ def run_search(): col_widths = {col: len(str(col)) for col in col_names} results = [] for r in matrix: - hparams = {k: v for k, v in (x.replace("--", "").split() for x in r)} + hparams = dict((x.replace("--", "").split() for x in r)) args_exp = " ".join(r).split() args_exp.extend(["--bs", str(args.bs)]) # in case we need to reduce its size due to CUDA OOM sys.argv = args_normal + args_exp diff --git a/examples/legacy/seq2seq/seq2seq_trainer.py b/examples/legacy/seq2seq/seq2seq_trainer.py index dbf12725f2db07..bb219fd2bcb94d 100644 --- a/examples/legacy/seq2seq/seq2seq_trainer.py +++ b/examples/legacy/seq2seq/seq2seq_trainer.py @@ -19,7 +19,6 @@ from torch.utils.data import DistributedSampler, RandomSampler from transformers import PreTrainedModel, Trainer, logging -from transformers.integrations import is_fairscale_available from transformers.models.fsmt.configuration_fsmt import FSMTConfig from transformers.optimization import ( Adafactor, @@ -36,10 +35,6 @@ from transformers.utils import is_torch_tpu_available -if is_fairscale_available(): - from fairscale.optim import OSS - - logger = logging.get_logger(__name__) arg_to_scheduler = { @@ -70,7 +65,7 @@ def __init__(self, config=None, data_args=None, *args, **kwargs): if self.args.label_smoothing != 0 or (self.data_args is not None and self.data_args.ignore_pad_token_for_loss): assert self.config.pad_token_id is not None, ( - "Make sure that `config.pad_token_id` is correcly defined when ignoring `pad_token` for loss" + "Make sure that `config.pad_token_id` is correctly defined when ignoring `pad_token` for loss" " calculation or doing label smoothing." ) @@ -118,14 +113,7 @@ def create_optimizer_and_scheduler(self, num_training_steps: int): "eps": self.args.adam_epsilon, } optimizer_kwargs["lr"] = self.args.learning_rate - if self.sharded_ddp: - self.optimizer = OSS( - params=optimizer_grouped_parameters, - optim=optimizer_cls, - **optimizer_kwargs, - ) - else: - self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs) + self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs) if self.lr_scheduler is None: self.lr_scheduler = self._get_lr_scheduler(num_training_steps) diff --git a/examples/legacy/seq2seq/seq2seq_training_args.py b/examples/legacy/seq2seq/seq2seq_training_args.py index 1583acd36fc4b7..9da1c69262a8c0 100644 --- a/examples/legacy/seq2seq/seq2seq_training_args.py +++ b/examples/legacy/seq2seq/seq2seq_training_args.py @@ -31,7 +31,7 @@ class Seq2SeqTrainingArguments(TrainingArguments): label_smoothing (:obj:`float`, `optional`, defaults to 0): The label smoothing epsilon to apply (if not zero). sortish_sampler (:obj:`bool`, `optional`, defaults to :obj:`False`): - Whether to SortishSamler or not. It sorts the inputs according to lenghts in-order to minimizing the padding size. + Whether to SortishSampler or not. It sorts the inputs according to lengths in-order to minimizing the padding size. predict_with_generate (:obj:`bool`, `optional`, defaults to :obj:`False`): Whether to use generate to calculate generative metrics (ROUGE, BLEU). """ @@ -39,7 +39,7 @@ class Seq2SeqTrainingArguments(TrainingArguments): label_smoothing: Optional[float] = field( default=0.0, metadata={"help": "The label smoothing epsilon to apply (if not zero)."} ) - sortish_sampler: bool = field(default=False, metadata={"help": "Whether to SortishSamler or not."}) + sortish_sampler: bool = field(default=False, metadata={"help": "Whether to SortishSampler or not."}) predict_with_generate: bool = field( default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."} ) diff --git a/examples/legacy/seq2seq/test_data/test_data b/examples/legacy/seq2seq/test_data/test_data deleted file mode 120000 index 9eee112ad74163..00000000000000 --- a/examples/legacy/seq2seq/test_data/test_data +++ /dev/null @@ -1 +0,0 @@ -seq2seq/test_data \ No newline at end of file diff --git a/examples/legacy/seq2seq/utils.py b/examples/legacy/seq2seq/utils.py index 2655165cf11adf..d7cd84dedb287d 100644 --- a/examples/legacy/seq2seq/utils.py +++ b/examples/legacy/seq2seq/utils.py @@ -456,7 +456,7 @@ def pickle_save(obj, path): def flatten_list(summary_ids: List[List]): - return [x for x in itertools.chain.from_iterable(summary_ids)] + return list(itertools.chain.from_iterable(summary_ids)) def save_git_info(folder_path: str) -> None: diff --git a/examples/legacy/text-classification/run_tf_text_classification.py b/examples/legacy/text-classification/run_tf_text_classification.py deleted file mode 100755 index 1f845db04c0448..00000000000000 --- a/examples/legacy/text-classification/run_tf_text_classification.py +++ /dev/null @@ -1,313 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2020 The HuggingFace Team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" Fine-tuning the library models for sequence classification.""" - - -import logging -import os -from dataclasses import dataclass, field -from typing import Dict, Optional - -import datasets -import numpy as np -import tensorflow as tf - -from transformers import ( - AutoConfig, - AutoTokenizer, - EvalPrediction, - HfArgumentParser, - PreTrainedTokenizer, - TFAutoModelForSequenceClassification, - TFTrainer, - TFTrainingArguments, -) -from transformers.utils import logging as hf_logging - - -hf_logging.set_verbosity_info() -hf_logging.enable_default_handler() -hf_logging.enable_explicit_format() - - -def get_tfds( - train_file: str, - eval_file: str, - test_file: str, - tokenizer: PreTrainedTokenizer, - label_column_id: int, - max_seq_length: Optional[int] = None, -): - files = {} - - if train_file is not None: - files[datasets.Split.TRAIN] = [train_file] - if eval_file is not None: - files[datasets.Split.VALIDATION] = [eval_file] - if test_file is not None: - files[datasets.Split.TEST] = [test_file] - - ds = datasets.load_dataset("csv", data_files=files) - features_name = list(ds[list(files.keys())[0]].features.keys()) - label_name = features_name.pop(label_column_id) - label_list = list(set(ds[list(files.keys())[0]][label_name])) - label2id = {label: i for i, label in enumerate(label_list)} - input_names = tokenizer.model_input_names - transformed_ds = {} - - if len(features_name) == 1: - for k in files.keys(): - transformed_ds[k] = ds[k].map( - lambda example: tokenizer.batch_encode_plus( - example[features_name[0]], truncation=True, max_length=max_seq_length, padding="max_length" - ), - batched=True, - ) - elif len(features_name) == 2: - for k in files.keys(): - transformed_ds[k] = ds[k].map( - lambda example: tokenizer.batch_encode_plus( - (example[features_name[0]], example[features_name[1]]), - truncation=True, - max_length=max_seq_length, - padding="max_length", - ), - batched=True, - ) - - def gen_train(): - for ex in transformed_ds[datasets.Split.TRAIN]: - d = {k: v for k, v in ex.items() if k in input_names} - label = label2id[ex[label_name]] - yield (d, label) - - def gen_val(): - for ex in transformed_ds[datasets.Split.VALIDATION]: - d = {k: v for k, v in ex.items() if k in input_names} - label = label2id[ex[label_name]] - yield (d, label) - - def gen_test(): - for ex in transformed_ds[datasets.Split.TEST]: - d = {k: v for k, v in ex.items() if k in input_names} - label = label2id[ex[label_name]] - yield (d, label) - - train_ds = ( - tf.data.Dataset.from_generator( - gen_train, - ({k: tf.int32 for k in input_names}, tf.int64), - ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])), - ) - if datasets.Split.TRAIN in transformed_ds - else None - ) - - if train_ds is not None: - train_ds = train_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.TRAIN]))) - - val_ds = ( - tf.data.Dataset.from_generator( - gen_val, - ({k: tf.int32 for k in input_names}, tf.int64), - ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])), - ) - if datasets.Split.VALIDATION in transformed_ds - else None - ) - - if val_ds is not None: - val_ds = val_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.VALIDATION]))) - - test_ds = ( - tf.data.Dataset.from_generator( - gen_test, - ({k: tf.int32 for k in input_names}, tf.int64), - ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])), - ) - if datasets.Split.TEST in transformed_ds - else None - ) - - if test_ds is not None: - test_ds = test_ds.apply(tf.data.experimental.assert_cardinality(len(ds[datasets.Split.TEST]))) - - return train_ds, val_ds, test_ds, label2id - - -logger = logging.getLogger(__name__) - - -@dataclass -class DataTrainingArguments: - """ - Arguments pertaining to what data we are going to input our model for training and eval. - - Using `HfArgumentParser` we can turn this class - into argparse arguments to be able to specify them on - the command line. - """ - - label_column_id: int = field(metadata={"help": "Which column contains the label"}) - train_file: str = field(default=None, metadata={"help": "The path of the training file"}) - dev_file: Optional[str] = field(default=None, metadata={"help": "The path of the development file"}) - test_file: Optional[str] = field(default=None, metadata={"help": "The path of the test file"}) - max_seq_length: int = field( - default=128, - metadata={ - "help": ( - "The maximum total input sequence length after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded." - ) - }, - ) - overwrite_cache: bool = field( - default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} - ) - - -@dataclass -class ModelArguments: - """ - Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. - """ - - model_name_or_path: str = field( - metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} - ) - config_name: Optional[str] = field( - default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} - ) - tokenizer_name: Optional[str] = field( - default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} - ) - use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."}) - # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script, - # or just modify its tokenizer_config.json. - cache_dir: Optional[str] = field( - default=None, - metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, - ) - - -def main(): - # See all possible arguments in src/transformers/training_args.py - # or by passing the --help flag to this script. - # We now keep distinct sets of args, for a cleaner separation of concerns. - parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - - if ( - os.path.exists(training_args.output_dir) - and os.listdir(training_args.output_dir) - and training_args.do_train - and not training_args.overwrite_output_dir - ): - raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty. Use" - " --overwrite_output_dir to overcome." - ) - - # Setup logging - logging.basicConfig( - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO, - ) - logger.info( - f"n_replicas: {training_args.n_replicas}, distributed training: {bool(training_args.n_replicas > 1)}, " - f"16-bits training: {training_args.fp16}" - ) - logger.info(f"Training/evaluation parameters {training_args}") - - # Load pretrained model and tokenizer - # - # Distributed training: - # The .from_pretrained methods guarantee that only one local process can concurrently - # download model & vocab. - - tokenizer = AutoTokenizer.from_pretrained( - model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, - cache_dir=model_args.cache_dir, - ) - - train_dataset, eval_dataset, test_ds, label2id = get_tfds( - train_file=data_args.train_file, - eval_file=data_args.dev_file, - test_file=data_args.test_file, - tokenizer=tokenizer, - label_column_id=data_args.label_column_id, - max_seq_length=data_args.max_seq_length, - ) - - config = AutoConfig.from_pretrained( - model_args.config_name if model_args.config_name else model_args.model_name_or_path, - num_labels=len(label2id), - label2id=label2id, - id2label={id: label for label, id in label2id.items()}, - finetuning_task="text-classification", - cache_dir=model_args.cache_dir, - ) - - with training_args.strategy.scope(): - model = TFAutoModelForSequenceClassification.from_pretrained( - model_args.model_name_or_path, - from_pt=bool(".bin" in model_args.model_name_or_path), - config=config, - cache_dir=model_args.cache_dir, - ) - - def compute_metrics(p: EvalPrediction) -> Dict: - preds = np.argmax(p.predictions, axis=1) - - return {"acc": (preds == p.label_ids).mean()} - - # Initialize our Trainer - trainer = TFTrainer( - model=model, - args=training_args, - train_dataset=train_dataset, - eval_dataset=eval_dataset, - compute_metrics=compute_metrics, - ) - - # Training - if training_args.do_train: - trainer.train() - trainer.save_model() - tokenizer.save_pretrained(training_args.output_dir) - - # Evaluation - results = {} - if training_args.do_eval: - logger.info("*** Evaluate ***") - result = trainer.evaluate() - output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt") - - with open(output_eval_file, "w") as writer: - logger.info("***** Eval results *****") - - for key, value in result.items(): - logger.info(f" {key} = {value}") - writer.write(f"{key} = {value}\n") - - results.update(result) - - return results - - -if __name__ == "__main__": - main() diff --git a/examples/legacy/token-classification/README.md b/examples/legacy/token-classification/README.md index c2fa6eec7282b2..fbf17f84d2d7ee 100644 --- a/examples/legacy/token-classification/README.md +++ b/examples/legacy/token-classification/README.md @@ -34,7 +34,7 @@ Let's define some variables that we need for further pre-processing steps and tr ```bash export MAX_LENGTH=128 -export BERT_MODEL=bert-base-multilingual-cased +export BERT_MODEL=google-bert/bert-base-multilingual-cased ``` Run the pre-processing script on training, dev and test datasets: @@ -92,7 +92,7 @@ Instead of passing all parameters via commandline arguments, the `run_ner.py` sc { "data_dir": ".", "labels": "./labels.txt", - "model_name_or_path": "bert-base-multilingual-cased", + "model_name_or_path": "google-bert/bert-base-multilingual-cased", "output_dir": "germeval-model", "max_seq_length": 128, "num_train_epochs": 3, @@ -222,7 +222,7 @@ Let's define some variables that we need for further pre-processing steps: ```bash export MAX_LENGTH=128 -export BERT_MODEL=bert-large-cased +export BERT_MODEL=google-bert/bert-large-cased ``` Here we use the English BERT large model for fine-tuning. @@ -250,7 +250,7 @@ This configuration file looks like: { "data_dir": "./data_wnut_17", "labels": "./data_wnut_17/labels.txt", - "model_name_or_path": "bert-large-cased", + "model_name_or_path": "google-bert/bert-large-cased", "output_dir": "wnut-17-model-1", "max_seq_length": 128, "num_train_epochs": 3, diff --git a/examples/legacy/token-classification/run_ner.py b/examples/legacy/token-classification/run_ner.py index 212ea986b4245b..c571d44a1203c5 100644 --- a/examples/legacy/token-classification/run_ner.py +++ b/examples/legacy/token-classification/run_ner.py @@ -158,7 +158,7 @@ def main(): # Prepare CONLL-2003 task labels = token_classification_task.get_labels(data_args.labels) - label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)} + label_map: Dict[int, str] = dict(enumerate(labels)) num_labels = len(labels) # Load pretrained model and tokenizer diff --git a/examples/legacy/token-classification/run_tf_ner.py b/examples/legacy/token-classification/run_tf_ner.py deleted file mode 100755 index df4770a70fa44d..00000000000000 --- a/examples/legacy/token-classification/run_tf_ner.py +++ /dev/null @@ -1,310 +0,0 @@ -#!/usr/bin/env python -# coding=utf-8 -# Copyright 2018 The HuggingFace Inc. team. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" Fine-tuning the library models for named entity recognition.""" - - -import logging -import os -from dataclasses import dataclass, field -from importlib import import_module -from typing import Dict, List, Optional, Tuple - -import numpy as np -from seqeval.metrics import classification_report, f1_score, precision_score, recall_score -from utils_ner import Split, TFTokenClassificationDataset, TokenClassificationTask - -from transformers import ( - AutoConfig, - AutoTokenizer, - EvalPrediction, - HfArgumentParser, - TFAutoModelForTokenClassification, - TFTrainer, - TFTrainingArguments, -) -from transformers.utils import logging as hf_logging - - -hf_logging.set_verbosity_info() -hf_logging.enable_default_handler() -hf_logging.enable_explicit_format() - - -logger = logging.getLogger(__name__) - - -@dataclass -class ModelArguments: - """ - Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. - """ - - model_name_or_path: str = field( - metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} - ) - config_name: Optional[str] = field( - default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} - ) - task_type: Optional[str] = field( - default="NER", metadata={"help": "Task type to fine tune in training (e.g. NER, POS, etc)"} - ) - tokenizer_name: Optional[str] = field( - default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} - ) - use_fast: bool = field(default=False, metadata={"help": "Set this flag to use fast tokenization."}) - # If you want to tweak more attributes on your tokenizer, you should do it in a distinct script, - # or just modify its tokenizer_config.json. - cache_dir: Optional[str] = field( - default=None, - metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, - ) - - -@dataclass -class DataTrainingArguments: - """ - Arguments pertaining to what data we are going to input our model for training and eval. - """ - - data_dir: str = field( - metadata={"help": "The input data dir. Should contain the .txt files for a CoNLL-2003-formatted task."} - ) - labels: Optional[str] = field( - metadata={"help": "Path to a file containing all labels. If not specified, CoNLL-2003 labels are used."} - ) - max_seq_length: int = field( - default=128, - metadata={ - "help": ( - "The maximum total input sequence length after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded." - ) - }, - ) - overwrite_cache: bool = field( - default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} - ) - - -def main(): - # See all possible arguments in src/transformers/training_args.py - # or by passing the --help flag to this script. - # We now keep distinct sets of args, for a cleaner separation of concerns. - parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - - if ( - os.path.exists(training_args.output_dir) - and os.listdir(training_args.output_dir) - and training_args.do_train - and not training_args.overwrite_output_dir - ): - raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty. Use" - " --overwrite_output_dir to overcome." - ) - - module = import_module("tasks") - - try: - token_classification_task_clazz = getattr(module, model_args.task_type) - token_classification_task: TokenClassificationTask = token_classification_task_clazz() - except AttributeError: - raise ValueError( - f"Task {model_args.task_type} needs to be defined as a TokenClassificationTask subclass in {module}. " - f"Available tasks classes are: {TokenClassificationTask.__subclasses__()}" - ) - - # Setup logging - logging.basicConfig( - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO, - ) - logger.info( - "n_replicas: %s, distributed training: %s, 16-bits training: %s", - training_args.n_replicas, - bool(training_args.n_replicas > 1), - training_args.fp16, - ) - logger.info("Training/evaluation parameters %s", training_args) - - # Prepare Token Classification task - labels = token_classification_task.get_labels(data_args.labels) - label_map: Dict[int, str] = {i: label for i, label in enumerate(labels)} - num_labels = len(labels) - - # Load pretrained model and tokenizer - # - # Distributed training: - # The .from_pretrained methods guarantee that only one local process can concurrently - # download model & vocab. - - config = AutoConfig.from_pretrained( - model_args.config_name if model_args.config_name else model_args.model_name_or_path, - num_labels=num_labels, - id2label=label_map, - label2id={label: i for i, label in enumerate(labels)}, - cache_dir=model_args.cache_dir, - ) - tokenizer = AutoTokenizer.from_pretrained( - model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, - cache_dir=model_args.cache_dir, - use_fast=model_args.use_fast, - ) - - with training_args.strategy.scope(): - model = TFAutoModelForTokenClassification.from_pretrained( - model_args.model_name_or_path, - from_pt=bool(".bin" in model_args.model_name_or_path), - config=config, - cache_dir=model_args.cache_dir, - ) - - # Get datasets - train_dataset = ( - TFTokenClassificationDataset( - token_classification_task=token_classification_task, - data_dir=data_args.data_dir, - tokenizer=tokenizer, - labels=labels, - model_type=config.model_type, - max_seq_length=data_args.max_seq_length, - overwrite_cache=data_args.overwrite_cache, - mode=Split.train, - ) - if training_args.do_train - else None - ) - eval_dataset = ( - TFTokenClassificationDataset( - token_classification_task=token_classification_task, - data_dir=data_args.data_dir, - tokenizer=tokenizer, - labels=labels, - model_type=config.model_type, - max_seq_length=data_args.max_seq_length, - overwrite_cache=data_args.overwrite_cache, - mode=Split.dev, - ) - if training_args.do_eval - else None - ) - - def align_predictions(predictions: np.ndarray, label_ids: np.ndarray) -> Tuple[List[int], List[int]]: - preds = np.argmax(predictions, axis=2) - batch_size, seq_len = preds.shape - out_label_list = [[] for _ in range(batch_size)] - preds_list = [[] for _ in range(batch_size)] - - for i in range(batch_size): - for j in range(seq_len): - if label_ids[i, j] != -100: - out_label_list[i].append(label_map[label_ids[i][j]]) - preds_list[i].append(label_map[preds[i][j]]) - - return preds_list, out_label_list - - def compute_metrics(p: EvalPrediction) -> Dict: - preds_list, out_label_list = align_predictions(p.predictions, p.label_ids) - - return { - "precision": precision_score(out_label_list, preds_list), - "recall": recall_score(out_label_list, preds_list), - "f1": f1_score(out_label_list, preds_list), - } - - # Initialize our Trainer - trainer = TFTrainer( - model=model, - args=training_args, - train_dataset=train_dataset.get_dataset() if train_dataset else None, - eval_dataset=eval_dataset.get_dataset() if eval_dataset else None, - compute_metrics=compute_metrics, - ) - - # Training - if training_args.do_train: - trainer.train() - trainer.save_model() - tokenizer.save_pretrained(training_args.output_dir) - - # Evaluation - results = {} - if training_args.do_eval: - logger.info("*** Evaluate ***") - - result = trainer.evaluate() - output_eval_file = os.path.join(training_args.output_dir, "eval_results.txt") - - with open(output_eval_file, "w") as writer: - logger.info("***** Eval results *****") - - for key, value in result.items(): - logger.info(" %s = %s", key, value) - writer.write("%s = %s\n" % (key, value)) - - results.update(result) - - # Predict - if training_args.do_predict: - test_dataset = TFTokenClassificationDataset( - token_classification_task=token_classification_task, - data_dir=data_args.data_dir, - tokenizer=tokenizer, - labels=labels, - model_type=config.model_type, - max_seq_length=data_args.max_seq_length, - overwrite_cache=data_args.overwrite_cache, - mode=Split.test, - ) - - predictions, label_ids, metrics = trainer.predict(test_dataset.get_dataset()) - preds_list, labels_list = align_predictions(predictions, label_ids) - report = classification_report(labels_list, preds_list) - - logger.info("\n%s", report) - - output_test_results_file = os.path.join(training_args.output_dir, "test_results.txt") - - with open(output_test_results_file, "w") as writer: - writer.write("%s\n" % report) - - # Save predictions - output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt") - - with open(output_test_predictions_file, "w") as writer: - with open(os.path.join(data_args.data_dir, "test.txt"), "r") as f: - example_id = 0 - - for line in f: - if line.startswith("-DOCSTART-") or line == "" or line == "\n": - writer.write(line) - - if not preds_list[example_id]: - example_id += 1 - elif preds_list[example_id]: - output_line = line.split()[0] + " " + preds_list[example_id].pop(0) + "\n" - - writer.write(output_line) - else: - logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0]) - - return results - - -if __name__ == "__main__": - main() diff --git a/examples/legacy/token-classification/utils_ner.py b/examples/legacy/token-classification/utils_ner.py index 2b54c7c4a49159..e7e3a157e30516 100644 --- a/examples/legacy/token-classification/utils_ner.py +++ b/examples/legacy/token-classification/utils_ner.py @@ -113,7 +113,7 @@ def convert_examples_to_features( for word, label in zip(example.words, example.labels): word_tokens = tokenizer.tokenize(word) - # bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space. + # google-bert/bert-base-multilingual-cased sometimes output "nothing ([]) when calling tokenize with just a space. if len(word_tokens) > 0: tokens.extend(word_tokens) # Use the real label id for the first token of the word, and padding ids for the remaining tokens diff --git a/examples/pytorch/README.md b/examples/pytorch/README.md index aa669932475caa..63a56a06e8d5a4 100644 --- a/examples/pytorch/README.md +++ b/examples/pytorch/README.md @@ -53,7 +53,7 @@ Coming soon! Most examples are equipped with a mechanism to truncate the number of dataset samples to the desired length. This is useful for debugging purposes, for example to quickly check that all stages of the programs can complete, before running the same setup on the full dataset which may take hours to complete. For example here is how to truncate all three splits to just 50 samples each: -``` +```bash examples/pytorch/token-classification/run_ner.py \ --max_train_samples 50 \ --max_eval_samples 50 \ @@ -62,7 +62,7 @@ examples/pytorch/token-classification/run_ner.py \ ``` Most example scripts should have the first two command line arguments and some have the third one. You can quickly check if a given example supports any of these by passing a `-h` option, e.g.: -``` +```bash examples/pytorch/token-classification/run_ner.py -h ``` @@ -98,7 +98,7 @@ the [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html) use the following command: ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node number_of_gpu_you_have path_to_script.py \ --all_arguments_of_the_script ``` @@ -107,9 +107,9 @@ As an example, here is how you would fine-tune the BERT large model (with whole classification MNLI task using the `run_glue` script, with 8 GPUs: ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 pytorch/text-classification/run_glue.py \ - --model_name_or_path bert-large-uncased-whole-word-masking \ + --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \ --task_name mnli \ --do_train \ --do_eval \ @@ -153,7 +153,7 @@ classification MNLI task using the `run_glue` script, with 8 TPUs (from this fol ```bash python xla_spawn.py --num_cores 8 \ text-classification/run_glue.py \ - --model_name_or_path bert-large-uncased-whole-word-masking \ + --model_name_or_path google-bert/bert-large-uncased-whole-word-masking \ --task_name mnli \ --do_train \ --do_eval \ @@ -201,6 +201,7 @@ You can easily log and monitor your runs code. The following are currently suppo * [Comet ML](https://www.comet.ml/docs/python-sdk/huggingface/) * [Neptune](https://docs.neptune.ai/integrations-and-supported-tools/model-training/hugging-face) * [ClearML](https://clear.ml/docs/latest/docs/getting_started/ds/ds_first_steps) +* [DVCLive](https://dvc.org/doc/dvclive/ml-frameworks/huggingface) ### Weights & Biases @@ -223,9 +224,9 @@ import wandb wandb.login() ``` -To enable logging to W&B, include `"wandb"` in the `report_to` of your `TrainingArguments` or script. Or just pass along `--report_to all` if you have `wandb` installed. +To enable logging to W&B, include `"wandb"` in the `report_to` of your `TrainingArguments` or script. Or just pass along `--report_to_all` if you have `wandb` installed. -Whenever you use `Trainer` or `TFTrainer` classes, your losses, evaluation metrics, model topology and gradients (for `Trainer` only) will automatically be logged. +Whenever you use the `Trainer` class, your losses, evaluation metrics, model topology and gradients will automatically be logged. Advanced configuration is possible by setting environment variables: @@ -262,13 +263,13 @@ First, install the Neptune client library. You can do it with either `pip` or `c `pip`: ```bash -pip install neptune-client +pip install neptune ``` `conda`: ```bash -conda install -c conda-forge neptune-client +conda install -c conda-forge neptune ``` Next, in your model training script, import `NeptuneCallback`: @@ -281,10 +282,10 @@ To enable Neptune logging, in your `TrainingArguments`, set the `report_to` argu ```python training_args = TrainingArguments( - "quick-training-distilbert-mrpc", + "quick-training-distilbert-mrpc", evaluation_strategy="steps", - eval_steps = 20, - report_to = "neptune", + eval_steps=20, + report_to="neptune", ) trainer = Trainer( @@ -294,6 +295,8 @@ trainer = Trainer( ) ``` +**Note:** This method requires saving your Neptune credentials as environment variables (see the bottom of the section). + Alternatively, for more logging options, create a Neptune callback: ```python @@ -318,7 +321,7 @@ neptune_callback = NeptuneCallback( Pass the callback to the Trainer: ```python -training_args = TrainingArguments(..., report_to = None) +training_args = TrainingArguments(..., report_to=None) trainer = Trainer( model, training_args, @@ -336,7 +339,7 @@ Now, when you start the training with `trainer.train()`, your metadata will be l | `NEPTUNE_API_TOKEN` | Your Neptune API token. To find and copy it, click your Neptune avatar and select **Get your API token**. | | `NEPTUNE_PROJECT` | The full name of your Neptune project (`workspace-name/project-name`). To find and copy it, head to **project settings** → **Properties**. | -For detailed instructions and examples, see the [Neptune docs](https://docs.neptune.ai/integrations-and-supported-tools/model-training/hugging-face). +For detailed instructions and examples, see the [Neptune docs](https://docs.neptune.ai/integrations/transformers/). ### ClearML @@ -373,4 +376,4 @@ Advanced configuration is possible by setting environment variables: | CLEARML_PROJECT | Name of the project in ClearML. (default: `"HuggingFace Transformers"`) | | CLEARML_TASK | Name of the task in ClearML. (default: `"Trainer"`) | -Additional configuration options are available through generic [clearml environment variables](https://clear.ml/docs/latest/docs/configs/env_vars). \ No newline at end of file +Additional configuration options are available through generic [clearml environment variables](https://clear.ml/docs/latest/docs/configs/env_vars). diff --git a/examples/pytorch/_tests_requirements.txt b/examples/pytorch/_tests_requirements.txt index 979890f4b79c38..d58e2def9830d6 100644 --- a/examples/pytorch/_tests_requirements.txt +++ b/examples/pytorch/_tests_requirements.txt @@ -15,11 +15,13 @@ nltk pandas datasets >= 1.13.3 fire -pytest +pytest<8.0.1 conllu sentencepiece != 0.1.92 protobuf +torch torchvision +torchaudio jiwer librosa evaluate >= 0.2.0 diff --git a/examples/pytorch/audio-classification/run_audio_classification.py b/examples/pytorch/audio-classification/run_audio_classification.py index 20ddec4acb9ee1..70fcd1433fd212 100644 --- a/examples/pytorch/audio-classification/run_audio_classification.py +++ b/examples/pytorch/audio-classification/run_audio_classification.py @@ -45,7 +45,7 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.14.0", "To fix: pip install -r examples/pytorch/audio-classification/requirements.txt") @@ -152,12 +152,28 @@ class ModelArguments: attention_mask: bool = field( default=True, metadata={"help": "Whether to generate an attention mask in the feature extractor."} ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -173,14 +189,14 @@ def __post_init__(self): if not self.freeze_feature_extractor and self.freeze_feature_encoder: warnings.warn( "The argument `--freeze_feature_extractor` is deprecated and " - "will be removed in a future version. Use `--freeze_feature_encoder`" + "will be removed in a future version. Use `--freeze_feature_encoder` " "instead. Setting `freeze_feature_encoder==True`.", FutureWarning, ) if self.freeze_feature_extractor and not self.freeze_feature_encoder: raise ValueError( "The argument `--freeze_feature_extractor` is deprecated and " - "should not be used in combination with `--freeze_feature_encoder`." + "should not be used in combination with `--freeze_feature_encoder`. " "Only make use of `--freeze_feature_encoder`." ) @@ -198,6 +214,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_audio_classification", model_args, data_args) @@ -209,6 +234,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -217,8 +246,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu} " - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -246,13 +275,13 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["eval"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if data_args.audio_column_name not in raw_datasets["train"].column_names: @@ -276,7 +305,8 @@ def main(): return_attention_mask=model_args.attention_mask, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # `datasets` takes care of automatically loading and resampling the audio, @@ -285,38 +315,41 @@ def main(): data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate) ) + model_input_name = feature_extractor.model_input_names[0] + def train_transforms(batch): """Apply train_transforms across a batch.""" - output_batch = {"input_values": []} + subsampled_wavs = [] for audio in batch[data_args.audio_column_name]: wav = random_subsample( audio["array"], max_length=data_args.max_length_seconds, sample_rate=feature_extractor.sampling_rate ) - output_batch["input_values"].append(wav) - output_batch["labels"] = [label for label in batch[data_args.label_column_name]] + subsampled_wavs.append(wav) + inputs = feature_extractor(subsampled_wavs, sampling_rate=feature_extractor.sampling_rate) + output_batch = {model_input_name: inputs.get(model_input_name)} + output_batch["labels"] = list(batch[data_args.label_column_name]) return output_batch def val_transforms(batch): """Apply val_transforms across a batch.""" - output_batch = {"input_values": []} - for audio in batch[data_args.audio_column_name]: - wav = audio["array"] - output_batch["input_values"].append(wav) - output_batch["labels"] = [label for label in batch[data_args.label_column_name]] + wavs = [audio["array"] for audio in batch[data_args.audio_column_name]] + inputs = feature_extractor(wavs, sampling_rate=feature_extractor.sampling_rate) + output_batch = {model_input_name: inputs.get(model_input_name)} + output_batch["labels"] = list(batch[data_args.label_column_name]) return output_batch # Prepare label mappings. # We'll include these in the model's config to get human readable labels in the Inference API. labels = raw_datasets["train"].features[data_args.label_column_name].names - label2id, id2label = dict(), dict() + label2id, id2label = {}, {} for i, label in enumerate(labels): label2id[label] = str(i) id2label[str(i)] = label # Load the accuracy metric from the datasets package - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with # `predictions` and `label_ids` fields) and has to return a dictionary string to float. @@ -333,7 +366,8 @@ def compute_metrics(eval_pred): finetuning_task="audio-classification", cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForAudioClassification.from_pretrained( model_args.model_name_or_path, @@ -341,7 +375,8 @@ def compute_metrics(eval_pred): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) diff --git a/examples/pytorch/conftest.py b/examples/pytorch/conftest.py index e85e5afb0200bd..70b4d4c12bc488 100644 --- a/examples/pytorch/conftest.py +++ b/examples/pytorch/conftest.py @@ -21,7 +21,7 @@ # allow having multiple repository checkouts and not needing to remember to rerun -# 'pip install -e .[dev]' when switching between checkouts and running tests. +# `pip install -e '.[dev]'` when switching between checkouts and running tests. git_repo_path = abspath(join(dirname(dirname(dirname(__file__))), "src")) sys.path.insert(1, git_repo_path) diff --git a/examples/pytorch/contrastive-image-text/README.md b/examples/pytorch/contrastive-image-text/README.md index f22f2c82dce2dd..c39f17a138a632 100644 --- a/examples/pytorch/contrastive-image-text/README.md +++ b/examples/pytorch/contrastive-image-text/README.md @@ -64,10 +64,10 @@ from transformers import ( ) model = VisionTextDualEncoderModel.from_vision_text_pretrained( - "openai/clip-vit-base-patch32", "roberta-base" + "openai/clip-vit-base-patch32", "FacebookAI/roberta-base" ) -tokenizer = AutoTokenizer.from_pretrained("roberta-base") +tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base") image_processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch32") processor = VisionTextDualEncoderProcessor(image_processor, tokenizer) diff --git a/examples/pytorch/contrastive-image-text/run_clip.py b/examples/pytorch/contrastive-image-text/run_clip.py index 4669a9b93d87bb..f1830fb4c9e28e 100644 --- a/examples/pytorch/contrastive-image-text/run_clip.py +++ b/examples/pytorch/contrastive-image-text/run_clip.py @@ -26,6 +26,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -54,7 +55,7 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/contrastive-image-text/requirements.txt") @@ -86,12 +87,28 @@ class ModelArguments: default=True, metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -235,6 +252,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_clip", model_args, data_args) @@ -246,6 +272,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -254,12 +284,12 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") - # 3. Detecting last checkpoint and eventualy continue from last checkpoint + # 3. Detecting last checkpoint and eventually continue from last checkpoint last_checkpoint = None if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: last_checkpoint = get_last_checkpoint(training_args.output_dir) @@ -290,7 +320,7 @@ def main(): cache_dir=model_args.cache_dir, keep_in_memory=False, data_dir=data_args.data_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -307,23 +337,31 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # 5. Load pretrained model, tokenizer, and image processor if model_args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained( - model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer + model_args.tokenizer_name, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) elif model_args.model_name_or_path: tokenizer = AutoTokenizer.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -332,14 +370,16 @@ def main(): model_args.image_processor_name or model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModel.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) config = model.config @@ -397,7 +437,7 @@ def _freeze_params(module): # Preprocessing the datasets. # We need to tokenize input captions and transform the images. def tokenize_captions(examples): - captions = [caption for caption in examples[caption_column]] + captions = list(examples[caption_column]) text_inputs = tokenizer(captions, max_length=data_args.max_seq_length, padding="max_length", truncation=True) examples["input_ids"] = text_inputs.input_ids examples["attention_mask"] = text_inputs.attention_mask @@ -488,7 +528,7 @@ def filter_corrupt_images(examples): # Transform images on the fly as doing it on the whole dataset takes too much time. test_dataset.set_transform(transform_images) - # 8. Initalize our trainer + # 8. Initialize our trainer trainer = Trainer( model=model, args=training_args, @@ -506,6 +546,8 @@ def filter_corrupt_images(examples): checkpoint = last_checkpoint train_result = trainer.train(resume_from_checkpoint=checkpoint) trainer.save_model() + tokenizer.save_pretrained(training_args.output_dir) + image_processor.save_pretrained(training_args.output_dir) trainer.log_metrics("train", train_result.metrics) trainer.save_metrics("train", train_result.metrics) trainer.save_state() @@ -517,7 +559,11 @@ def filter_corrupt_images(examples): trainer.save_metrics("eval", metrics) # 11. Write Training Stats and push to hub. - kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "contrastive-image-text-modeling"} + finetuned_from = model_args.model_name_or_path + # If from a local directory, don't set `finetuned_from` as this is required to be a valid repo. id on the Hub. + if os.path.isdir(finetuned_from): + finetuned_from = None + kwargs = {"finetuned_from": finetuned_from, "tasks": "contrastive-image-text-modeling"} if data_args.dataset_name is not None: kwargs["dataset_tags"] = data_args.dataset_name if data_args.dataset_config_name is not None: diff --git a/examples/pytorch/image-classification/README.md b/examples/pytorch/image-classification/README.md index 04b4748774ddf7..112cc51764a38e 100644 --- a/examples/pytorch/image-classification/README.md +++ b/examples/pytorch/image-classification/README.md @@ -41,6 +41,7 @@ python run_image_classification.py \ --dataset_name beans \ --output_dir ./beans_outputs/ \ --remove_unused_columns False \ + --label_column_name labels \ --do_train \ --do_eval \ --push_to_hub \ @@ -113,10 +114,10 @@ from datasets import load_dataset # example 1: local folder dataset = load_dataset("imagefolder", data_dir="path_to_your_folder") -# example 2: local files (suppoted formats are tar, gzip, zip, xz, rar, zstd) +# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd) dataset = load_dataset("imagefolder", data_files="path_to_zip_file") -# example 3: remote files (suppoted formats are tar, gzip, zip, xz, rar, zstd) +# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd) dataset = load_dataset("imagefolder", data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip") # example 4: providing several splits @@ -197,7 +198,7 @@ accelerate test that will check everything is ready for training. Finally, you can launch training with ```bash -accelerate launch run_image_classification_trainer.py +accelerate launch run_image_classification_no_trainer.py --image_column_name img ``` This command is the same and will work for: diff --git a/examples/pytorch/image-classification/requirements.txt b/examples/pytorch/image-classification/requirements.txt index 5a5ba7012679be..4926040789832b 100644 --- a/examples/pytorch/image-classification/requirements.txt +++ b/examples/pytorch/image-classification/requirements.txt @@ -1,5 +1,5 @@ accelerate>=0.12.0 torch>=1.5.0 torchvision>=0.6.0 -datasets>=1.17.0 +datasets>=2.14.0 evaluate \ No newline at end of file diff --git a/examples/pytorch/image-classification/run_image_classification.py b/examples/pytorch/image-classification/run_image_classification.py old mode 100644 new mode 100755 index 78979e41553e0d..94ed62e0df09f1 --- a/examples/pytorch/image-classification/run_image_classification.py +++ b/examples/pytorch/image-classification/run_image_classification.py @@ -16,6 +16,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -27,6 +28,7 @@ from torchvision.transforms import ( CenterCrop, Compose, + Lambda, Normalize, RandomHorizontalFlip, RandomResizedCrop, @@ -55,9 +57,9 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") -require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt") +require_version("datasets>=2.14.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt") MODEL_CONFIG_CLASSES = list(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING.keys()) MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) @@ -109,6 +111,14 @@ class DataTrainingArguments: ) }, ) + image_column_name: str = field( + default="image", + metadata={"help": "The name of the dataset column containing the image data. Defaults to 'image'."}, + ) + label_column_name: str = field( + default="label", + metadata={"help": "The name of the dataset column containing the labels. Defaults to 'label'."}, + ) def __post_init__(self): if self.dataset_name is None and (self.train_dir is None and self.validation_dir is None): @@ -142,12 +152,28 @@ class ModelArguments: metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -157,12 +183,6 @@ class ModelArguments: ) -def collate_fn(examples): - pixel_values = torch.stack([example["pixel_values"] for example in examples]) - labels = torch.tensor([example["labels"] for example in examples]) - return {"pixel_values": pixel_values, "labels": labels} - - def main(): # See all possible arguments in src/transformers/training_args.py # or by passing the --help flag to this script. @@ -176,6 +196,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_image_classification", model_args, data_args) @@ -187,6 +216,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -195,8 +228,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -224,8 +257,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - task="image-classification", - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -237,9 +269,27 @@ def main(): "imagefolder", data_files=data_files, cache_dir=model_args.cache_dir, - task="image-classification", ) + dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names + if data_args.image_column_name not in dataset_column_names: + raise ValueError( + f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. " + "Make sure to set `--image_column_name` to the correct audio column - one of " + f"{', '.join(dataset_column_names)}." + ) + if data_args.label_column_name not in dataset_column_names: + raise ValueError( + f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. " + "Make sure to set `--label_column_name` to the correct text column - one of " + f"{', '.join(dataset_column_names)}." + ) + + def collate_fn(examples): + pixel_values = torch.stack([example["pixel_values"] for example in examples]) + labels = torch.tensor([example[data_args.label_column_name] for example in examples]) + return {"pixel_values": pixel_values, "labels": labels} + # If we don't have a validation split, split off a percentage of train as validation. data_args.train_val_split = None if "validation" in dataset.keys() else data_args.train_val_split if isinstance(data_args.train_val_split, float) and data_args.train_val_split > 0.0: @@ -249,14 +299,14 @@ def main(): # Prepare label mappings. # We'll include these in the model's config to get human readable labels in the Inference API. - labels = dataset["train"].features["labels"].names - label2id, id2label = dict(), dict() + labels = dataset["train"].features[data_args.label_column_name].names + label2id, id2label = {}, {} for i, label in enumerate(labels): label2id[label] = str(i) id2label[str(i)] = label # Load the accuracy metric from the datasets package - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. @@ -272,7 +322,8 @@ def compute_metrics(p): finetuning_task="image-classification", cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForImageClassification.from_pretrained( model_args.model_name_or_path, @@ -280,14 +331,16 @@ def compute_metrics(p): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) image_processor = AutoImageProcessor.from_pretrained( model_args.image_processor_name or model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # Define torchvision transforms to be applied to each image. @@ -295,7 +348,11 @@ def compute_metrics(p): size = image_processor.size["shortest_edge"] else: size = (image_processor.size["height"], image_processor.size["width"]) - normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) + normalize = ( + Normalize(mean=image_processor.image_mean, std=image_processor.image_std) + if hasattr(image_processor, "image_mean") and hasattr(image_processor, "image_std") + else Lambda(lambda x: x) + ) _train_transforms = Compose( [ RandomResizedCrop(size), @@ -316,13 +373,15 @@ def compute_metrics(p): def train_transforms(example_batch): """Apply _train_transforms across a batch.""" example_batch["pixel_values"] = [ - _train_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"] + _train_transforms(pil_img.convert("RGB")) for pil_img in example_batch[data_args.image_column_name] ] return example_batch def val_transforms(example_batch): """Apply _val_transforms across a batch.""" - example_batch["pixel_values"] = [_val_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]] + example_batch["pixel_values"] = [ + _val_transforms(pil_img.convert("RGB")) for pil_img in example_batch[data_args.image_column_name] + ] return example_batch if training_args.do_train: @@ -345,7 +404,7 @@ def val_transforms(example_batch): # Set the validation transforms dataset["validation"].set_transform(val_transforms) - # Initalize our trainer + # Initialize our trainer trainer = Trainer( model=model, args=training_args, diff --git a/examples/pytorch/image-classification/run_image_classification_no_trainer.py b/examples/pytorch/image-classification/run_image_classification_no_trainer.py index 3ba79d630e76a3..8a49afce414c38 100644 --- a/examples/pytorch/image-classification/run_image_classification_no_trainer.py +++ b/examples/pytorch/image-classification/run_image_classification_no_trainer.py @@ -32,6 +32,7 @@ from torchvision.transforms import ( CenterCrop, Compose, + Lambda, Normalize, RandomHorizontalFlip, RandomResizedCrop, @@ -42,12 +43,12 @@ import transformers from transformers import AutoConfig, AutoImageProcessor, AutoModelForImageClassification, SchedulerType, get_scheduler -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) @@ -146,6 +147,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -169,7 +180,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -178,6 +189,18 @@ def parse_args(): action="store_true", help="Whether or not to enable to load a pretrained model whose head dimensions are different.", ) + parser.add_argument( + "--image_column_name", + type=str, + default="image", + help="The name of the dataset column containing the image data. Defaults to 'image'.", + ) + parser.add_argument( + "--label_column_name", + type=str, + default="label", + help="The name of the dataset column containing the labels. Defaults to 'label'.", + ) args = parser.parse_args() # Sanity checks @@ -210,7 +233,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -236,12 +259,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -259,7 +284,7 @@ def main(): # download the dataset. if args.dataset_name is not None: # Downloading and loading a dataset from the hub. - dataset = load_dataset(args.dataset_name, task="image-classification") + dataset = load_dataset(args.dataset_name) else: data_files = {} if args.train_dir is not None: @@ -269,12 +294,24 @@ def main(): dataset = load_dataset( "imagefolder", data_files=data_files, - cache_dir=args.cache_dir, - task="image-classification", ) # See more about loading custom images at # https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder. + dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names + if args.image_column_name not in dataset_column_names: + raise ValueError( + f"--image_column_name {args.image_column_name} not found in dataset '{args.dataset_name}'. " + "Make sure to set `--image_column_name` to the correct audio column - one of " + f"{', '.join(dataset_column_names)}." + ) + if args.label_column_name not in dataset_column_names: + raise ValueError( + f"--label_column_name {args.label_column_name} not found in dataset '{args.dataset_name}'. " + "Make sure to set `--label_column_name` to the correct text column - one of " + f"{', '.join(dataset_column_names)}." + ) + # If we don't have a validation split, split off a percentage of train as validation. args.train_val_split = None if "validation" in dataset.keys() else args.train_val_split if isinstance(args.train_val_split, float) and args.train_val_split > 0.0: @@ -284,7 +321,7 @@ def main(): # Prepare label mappings. # We'll include these in the model's config to get human readable labels in the Inference API. - labels = dataset["train"].features["labels"].names + labels = dataset["train"].features[args.label_column_name].names label2id = {label: str(i) for i, label in enumerate(labels)} id2label = {str(i): label for i, label in enumerate(labels)} @@ -298,13 +335,18 @@ def main(): i2label=id2label, label2id=label2id, finetuning_task="image-classification", + trust_remote_code=args.trust_remote_code, + ) + image_processor = AutoImageProcessor.from_pretrained( + args.model_name_or_path, + trust_remote_code=args.trust_remote_code, ) - image_processor = AutoImageProcessor.from_pretrained(args.model_name_or_path) model = AutoModelForImageClassification.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, ignore_mismatched_sizes=args.ignore_mismatched_sizes, + trust_remote_code=args.trust_remote_code, ) # Preprocessing the datasets @@ -314,7 +356,11 @@ def main(): size = image_processor.size["shortest_edge"] else: size = (image_processor.size["height"], image_processor.size["width"]) - normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std) + normalize = ( + Normalize(mean=image_processor.image_mean, std=image_processor.image_std) + if hasattr(image_processor, "image_mean") and hasattr(image_processor, "image_std") + else Lambda(lambda x: x) + ) train_transforms = Compose( [ RandomResizedCrop(size), @@ -334,12 +380,16 @@ def main(): def preprocess_train(example_batch): """Apply _train_transforms across a batch.""" - example_batch["pixel_values"] = [train_transforms(image.convert("RGB")) for image in example_batch["image"]] + example_batch["pixel_values"] = [ + train_transforms(image.convert("RGB")) for image in example_batch[args.image_column_name] + ] return example_batch def preprocess_val(example_batch): """Apply _val_transforms across a batch.""" - example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["image"]] + example_batch["pixel_values"] = [ + val_transforms(image.convert("RGB")) for image in example_batch[args.image_column_name] + ] return example_batch with accelerator.main_process_first(): @@ -355,7 +405,7 @@ def preprocess_val(example_batch): # DataLoaders creation: def collate_fn(examples): pixel_values = torch.stack([example["pixel_values"] for example in examples]) - labels = torch.tensor([example["labels"] for example in examples]) + labels = torch.tensor([example[args.label_column_name] for example in examples]) return {"pixel_values": pixel_values, "labels": labels} train_dataloader = DataLoader( @@ -388,8 +438,10 @@ def collate_fn(examples): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -437,36 +489,45 @@ def collate_fn(examples): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -485,7 +546,7 @@ def collate_fn(examples): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/pytorch/image-pretraining/README.md b/examples/pytorch/image-pretraining/README.md index 814f160a34915c..65bb863f38b6ce 100644 --- a/examples/pytorch/image-pretraining/README.md +++ b/examples/pytorch/image-pretraining/README.md @@ -25,7 +25,7 @@ NOTE: If you encounter problems/have suggestions for improvement, open an issue ## SimMIM -The `run_mim.py` script can be used to pre-train any Transformer-based vision model in the library (concretly, any model supported by the `AutoModelForMaskedImageModeling` API) for masked image modeling as proposed in [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886) using PyTorch. +The `run_mim.py` script can be used to pre-train any Transformer-based vision model in the library (concretely, any model supported by the `AutoModelForMaskedImageModeling` API) for masked image modeling as proposed in [SimMIM: A Simple Framework for Masked Image Modeling](https://arxiv.org/abs/2111.09886) using PyTorch. drawing diff --git a/examples/pytorch/image-pretraining/run_mae.py b/examples/pytorch/image-pretraining/run_mae.py index f3448a77532a68..95e28a5b6025fd 100644 --- a/examples/pytorch/image-pretraining/run_mae.py +++ b/examples/pytorch/image-pretraining/run_mae.py @@ -16,6 +16,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -43,7 +44,7 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-pretraining/requirements.txt") @@ -91,7 +92,7 @@ class DataTrainingArguments: ) def __post_init__(self): - data_files = dict() + data_files = {} if self.train_dir is not None: data_files["train"] = self.train_dir if self.validation_dir is not None: @@ -109,7 +110,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -133,15 +134,21 @@ class ModelArguments: metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) - use_auth_token: bool = field( - default=False, + token: str = field( + default=None, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." ) }, ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) mask_ratio: float = field( default=0.75, metadata={"help": "The ratio of the number of masked tokens in the input sequence."} ) @@ -175,6 +182,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_mae", model_args, data_args) @@ -186,6 +202,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -194,8 +214,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -220,7 +240,7 @@ def main(): data_args.dataset_config_name, data_files=data_args.data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # If we don't have a validation split, split off a percentage of train as validation. @@ -238,7 +258,7 @@ def main(): config_kwargs = { "cache_dir": model_args.cache_dir, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, } if model_args.config_name: config = ViTMAEConfig.from_pretrained(model_args.config_name, **config_kwargs) @@ -276,7 +296,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: logger.info("Training new model from scratch") diff --git a/examples/pytorch/image-pretraining/run_mim.py b/examples/pytorch/image-pretraining/run_mim.py index a906088ed5c91d..35857b99b9e471 100644 --- a/examples/pytorch/image-pretraining/run_mim.py +++ b/examples/pytorch/image-pretraining/run_mim.py @@ -16,6 +16,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -48,7 +49,7 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-pretraining/requirements.txt") @@ -104,7 +105,7 @@ class DataTrainingArguments: ) def __post_init__(self): - data_files = dict() + data_files = {} if self.train_dir is not None: data_files["train"] = self.train_dir if self.validation_dir is not None: @@ -153,12 +154,28 @@ class ModelArguments: metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -239,6 +256,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_mim", model_args, data_args) @@ -250,6 +276,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -258,8 +288,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -284,7 +314,7 @@ def main(): data_args.dataset_config_name, data_files=data_args.data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # If we don't have a validation split, split off a percentage of train as validation. @@ -301,7 +331,8 @@ def main(): config_kwargs = { "cache_dir": model_args.cache_dir, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, + "trust_remote_code": model_args.trust_remote_code, } if model_args.config_name_or_path: config = AutoConfig.from_pretrained(model_args.config_name_or_path, **config_kwargs) @@ -353,11 +384,12 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForMaskedImageModeling.from_config(config) + model = AutoModelForMaskedImageModeling.from_config(config, trust_remote_code=model_args.trust_remote_code) if training_args.do_train: column_names = ds["train"].column_names diff --git a/examples/pytorch/image-pretraining/run_mim_no_trainer.py b/examples/pytorch/image-pretraining/run_mim_no_trainer.py new file mode 100644 index 00000000000000..dc9ee9f27b1499 --- /dev/null +++ b/examples/pytorch/image-pretraining/run_mim_no_trainer.py @@ -0,0 +1,810 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and + +import argparse +import logging +import math +import os +import warnings +from pathlib import Path + +import datasets +import numpy as np +import torch +from accelerate import Accelerator, DistributedType +from accelerate.utils import set_seed +from datasets import load_dataset +from huggingface_hub import Repository, create_repo +from torch.utils.data import DataLoader +from torchvision.transforms import Compose, Lambda, Normalize, RandomHorizontalFlip, RandomResizedCrop, ToTensor +from tqdm.auto import tqdm + +import transformers +from transformers import ( + CONFIG_MAPPING, + IMAGE_PROCESSOR_MAPPING, + MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING, + AutoConfig, + AutoImageProcessor, + AutoModelForMaskedImageModeling, + SchedulerType, + get_scheduler, +) +from transformers.utils import check_min_version, send_example_telemetry +from transformers.utils.versions import require_version + + +""" Pre-training a 🤗 Transformers model for simple masked image modeling (SimMIM) +without using HuggingFace Trainer. +Any model supported by the AutoModelForMaskedImageModeling API can be used. +""" + +logger = logging.getLogger(__name__) + +# Will error if the minimal version of Transformers is not installed. Remove at your own risks. +check_min_version("4.38.0.dev0") + +require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-pretraining/requirements.txt") + +MODEL_CONFIG_CLASSES = list(MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING.keys()) +MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) + + +def parse_args(): + parser = argparse.ArgumentParser( + description="Finetune a transformers model on a simple Masked Image Modeling task" + ) + parser.add_argument( + "--dataset_name", + type=str, + default="cifar10", + help="Name of a dataset from the datasets package", + ) + parser.add_argument( + "--dataset_config_name", + type=str, + default=None, + help="The configuration name of the dataset to use (via the datasets library).", + ) + parser.add_argument( + "--image_column_name", + type=str, + default=None, + help="The column name of the images in the files. If not set, will try to use 'image' or 'img'.", + ) + parser.add_argument( + "--train_dir", + type=str, + default=None, + help="A folder containing the training data.", + ) + parser.add_argument( + "--validation_dir", + type=None, + default=None, + help="A folder containing the validation data.", + ) + parser.add_argument( + "--train_val_split", + type=float, + default=0.15, + help="Percent to split off of train for validation.", + ) + parser.add_argument( + "--mask_patch_size", + type=int, + default=32, + help="The size of the square patches to use for masking.", + ) + parser.add_argument( + "--mask_ratio", + type=float, + default=0.6, + help="Percentage of patches to mask.", + ) + parser.add_argument( + "--max_train_samples", + type=int, + default=None, + help=( + "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + ), + ) + parser.add_argument( + "--max_eval_samples", + type=int, + default=None, + help=( + "For debugging purposes or quicker training, truncate the number of evaluation examples to this " + "value if set." + ), + ) + parser.add_argument( + "--model_name_or_path", + type=str, + default=None, + help=( + "The model checkpoint for weights initialization. Can be a local path to a pytorch_model.bin or a " + "checkpoint identifier on the hub. " + "Don't set if you want to train a model from scratch." + ), + ) + parser.add_argument( + "--model_type", + type=str, + default=None, + help="If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES), + ) + parser.add_argument( + "--config_name_or_path", + type=str, + default=None, + help="Pretrained config name or path if not the same as model_name", + ) + parser.add_argument( + "--config_overrides", + type=str, + default=None, + help=( + "Override some existing default config settings when a model is trained from scratch. Example: " + "n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index" + ), + ) + parser.add_argument( + "--cache_dir", + type=str, + default=None, + help="Where do you want to store (cache) the pretrained models/datasets downloaded from the hub", + ) + parser.add_argument( + "--model_revision", + type=str, + default="main", + help="The specific model version to use (can be a branch name, tag name or commit id).", + ) + parser.add_argument( + "--gradient_accumulation_steps", + type=int, + default=1, + help="Number of updates steps to accumulate before performing a backward/update pass.", + ) + parser.add_argument( + "--image_processor_name", + type=str, + default=None, + help="Name or path of preprocessor config.", + ) + parser.add_argument( + "--token", + type=str, + default=None, + help=( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ), + ) + parser.add_argument( + "--use_auth_token", + type=bool, + default=None, + help="The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + ) + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) + parser.add_argument( + "--image_size", + type=int, + default=None, + help="The size (resolution) of each image. If not specified, will use `image_size` of the configuration.", + ) + parser.add_argument( + "--patch_size", + type=int, + default=None, + help="The size (resolution) of each patch. If not specified, will use `patch_size` of the configuration.", + ) + parser.add_argument( + "--encoder_stride", + type=int, + default=None, + help={"help": "Stride to use for the encoder."}, + ) + parser.add_argument( + "--push_to_hub", + action="store_true", + help="Whether or not to push the model to the Hub.", + ) + parser.add_argument( + "--with_tracking", + action="store_true", + help="Whether to enable experiment trackers for logging.", + ) + parser.add_argument( + "--report_to", + type=str, + default="all", + help=( + 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' + "Only applicable when `--with_tracking` is passed." + ), + ) + parser.add_argument( + "--seed", + type=int, + default=None, + help="A seed for reproducible training.", + ) + parser.add_argument( + "--per_device_train_batch_size", + type=int, + default=8, + help="Batch size (per device) for the training dataloader.", + ) + parser.add_argument( + "--learning_rate", + type=float, + default=5e-5, + help="The initial learning rate for [`AdamW`] optimizer.", + ) + parser.add_argument( + "--weight_decay", + type=float, + default=0.0, + help="Weight decay to use.", + ) + parser.add_argument( + "--num_train_epochs", + type=float, + default=3.0, + help="Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).", + ) + parser.add_argument( + "--max_train_steps", + type=int, + default=None, + help="Total number of training steps to perform. If provided, overrides num_train_epochs.", + ) + parser.add_argument( + "--lr_scheduler_type", + type=SchedulerType, + default="linear", + help="The scheduler type to use.", + choices=["linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup"], + ) + parser.add_argument( + "--num_warmup_steps", + type=int, + default=0, + help="Number of steps for the warmup in the lr scheduler.", + ) + parser.add_argument( + "--checkpointing_steps", + type=str, + default=None, + help="Whether the various states should be saved at the end of every n steps, or 'epoch' for each epoch.", + ) + parser.add_argument( + "--resume_from_checkpoint", + type=str, + default=None, + help="If the training should continue from a checkpoint folder.", + ) + parser.add_argument( + "--per_device_eval_batch_size", + type=int, + default=8, + help="Batch size (per device) for the evaluation dataloader.", + ) + parser.add_argument( + "--output_dir", + type=str, + default=None, + help="Where to store the final model.", + ) + args = parser.parse_args() + + # Sanity checks + data_files = {} + if args.train_dir is not None: + data_files["train"] = args.train_dir + if args.validation_dir is not None: + data_files["val"] = args.validation_dir + args.data_files = data_files if data_files else None + + if args.push_to_hub: + assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed." + + return args + + +class MaskGenerator: + """ + A class to generate boolean masks for the pretraining task. + + A mask is a 1D tensor of shape (model_patch_size**2,) where the value is either 0 or 1, + where 1 indicates "masked". + """ + + def __init__(self, input_size=192, mask_patch_size=32, model_patch_size=4, mask_ratio=0.6): + self.input_size = input_size + self.mask_patch_size = mask_patch_size + self.model_patch_size = model_patch_size + self.mask_ratio = mask_ratio + + if self.input_size % self.mask_patch_size != 0: + raise ValueError("Input size must be divisible by mask patch size") + if self.mask_patch_size % self.model_patch_size != 0: + raise ValueError("Mask patch size must be divisible by model patch size") + + self.rand_size = self.input_size // self.mask_patch_size + self.scale = self.mask_patch_size // self.model_patch_size + + self.token_count = self.rand_size**2 + self.mask_count = int(np.ceil(self.token_count * self.mask_ratio)) + + def __call__(self): + mask_idx = np.random.permutation(self.token_count)[: self.mask_count] + mask = np.zeros(self.token_count, dtype=int) + mask[mask_idx] = 1 + + mask = mask.reshape((self.rand_size, self.rand_size)) + mask = mask.repeat(self.scale, axis=0).repeat(self.scale, axis=1) + + return torch.tensor(mask.flatten()) + + +def collate_fn(examples): + pixel_values = torch.stack([example["pixel_values"] for example in examples]) + mask = torch.stack([example["mask"] for example in examples]) + return {"pixel_values": pixel_values, "bool_masked_pos": mask} + + +def main(): + args = parse_args() + + if args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + args.token = args.use_auth_token + + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The + # information sent is the one passed as arguments along with your Python/PyTorch versions. + send_example_telemetry("run_mim_no_trainer", args) + + # Initialize the accelerator. We will let the accelerator handle device placement for us in this example. + # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers + # in the environment + accelerator_log_kwargs = {} + + if args.with_tracking: + accelerator_log_kwargs["log_with"] = args.report_to + accelerator_log_kwargs["project_dir"] = args.output_dir + + accelerator = Accelerator( + gradient_accumulation_steps=args.gradient_accumulation_steps, + **accelerator_log_kwargs, + ) + + # Make one log on every process with the configuration for debugging. + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + level=logging.INFO, + ) + logger.info(accelerator.state) + if accelerator.is_local_main_process: + datasets.utils.logging.set_verbosity_warning() + transformers.utils.logging.set_verbosity_info() + else: + datasets.utils.logging.set_verbosity_error() + transformers.utils.logging.set_verbosity_error() + + # If passed along, set the training seed now. + if args.seed is not None: + set_seed(args.seed) + + # Handle the repository creation + if accelerator.is_main_process: + if args.push_to_hub: + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) + + with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: + if "step_*" not in gitignore: + gitignore.write("step_*\n") + if "epoch_*" not in gitignore: + gitignore.write("epoch_*\n") + elif args.output_dir is not None: + os.makedirs(args.output_dir, exist_ok=True) + accelerator.wait_for_everyone() + + # Initialize our dataset. + ds = load_dataset( + args.dataset_name, + args.dataset_config_name, + data_files=args.data_files, + cache_dir=args.cache_dir, + token=args.token, + ) + + # If we don't have a validation split, split off a percentage of train as validation. + args.train_val_split = None if "validation" in ds.keys() else args.train_val_split + if isinstance(args.train_val_split, float) and args.train_val_split > 0.0: + split = ds["train"].train_test_split(args.train_val_split) + ds["train"] = split["train"] + ds["validation"] = split["test"] + + # Create config + # Distributed training: + # The .from_pretrained methods guarantee that only one local process can concurrently + # download model & vocab. + config_kwargs = { + "cache_dir": args.cache_dir, + "revision": args.model_revision, + "token": args.token, + "trust_remote_code": args.trust_remote_code, + } + if args.config_name_or_path: + config = AutoConfig.from_pretrained(args.config_name_or_path, **config_kwargs) + elif args.model_name_or_path: + config = AutoConfig.from_pretrained(args.model_name_or_path, **config_kwargs) + else: + config = CONFIG_MAPPING[args.model_type]() + logger.warning("You are instantiating a new config instance from scratch.") + if args.config_overrides is not None: + logger.info(f"Overriding config: {args.config_overrides}") + config.update_from_string(args.config_overrides) + logger.info(f"New config: {config}") + + # make sure the decoder_type is "simmim" (only relevant for BEiT) + if hasattr(config, "decoder_type"): + config.decoder_type = "simmim" + + # adapt config + args.image_size = args.image_size if args.image_size is not None else config.image_size + args.patch_size = args.patch_size if args.patch_size is not None else config.patch_size + args.encoder_stride = args.encoder_stride if args.encoder_stride is not None else config.encoder_stride + + config.update( + { + "image_size": args.image_size, + "patch_size": args.patch_size, + "encoder_stride": args.encoder_stride, + } + ) + + # create image processor + if args.image_processor_name: + image_processor = AutoImageProcessor.from_pretrained(args.image_processor_name, **config_kwargs) + elif args.model_name_or_path: + image_processor = AutoImageProcessor.from_pretrained(args.model_name_or_path, **config_kwargs) + else: + IMAGE_PROCESSOR_TYPES = { + conf.model_type: image_processor_class for conf, image_processor_class in IMAGE_PROCESSOR_MAPPING.items() + } + image_processor = IMAGE_PROCESSOR_TYPES[args.model_type]() + + # create model + if args.model_name_or_path: + model = AutoModelForMaskedImageModeling.from_pretrained( + args.model_name_or_path, + from_tf=bool(".ckpt" in args.model_name_or_path), + config=config, + cache_dir=args.cache_dir, + revision=args.model_revision, + token=args.token, + trust_remote_code=args.trust_remote_code, + ) + else: + logger.info("Training new model from scratch") + model = AutoModelForMaskedImageModeling.from_config( + config, + token=args.token, + trust_remote_code=args.trust_remote_code, + ) + + column_names = ds["train"].column_names + + if args.image_column_name is not None: + image_column_name = args.image_column_name + elif "image" in column_names: + image_column_name = "image" + elif "img" in column_names: + image_column_name = "img" + else: + image_column_name = column_names[0] + + # transformations as done in original SimMIM paper + # source: https://github.com/microsoft/SimMIM/blob/main/data/data_simmim.py + transforms = Compose( + [ + Lambda(lambda img: img.convert("RGB")), + RandomResizedCrop(args.image_size, scale=(0.67, 1.0), ratio=(3.0 / 4.0, 4.0 / 3.0)), + RandomHorizontalFlip(), + ToTensor(), + Normalize(mean=image_processor.image_mean, std=image_processor.image_std), + ] + ) + + # create mask generator + mask_generator = MaskGenerator( + input_size=args.image_size, + mask_patch_size=args.mask_patch_size, + model_patch_size=args.patch_size, + mask_ratio=args.mask_ratio, + ) + + def preprocess_images(examples): + """Preprocess a batch of images by applying transforms + creating a corresponding mask, indicating + which patches to mask.""" + + examples["pixel_values"] = [transforms(image) for image in examples[image_column_name]] + examples["mask"] = [mask_generator() for i in range(len(examples[image_column_name]))] + + return examples + + if args.max_train_samples is not None: + ds["train"] = ds["train"].shuffle(seed=args.seed).select(range(args.max_train_samples)) + # Set the training transforms + ds["train"].set_transform(preprocess_images) + + if args.max_eval_samples is not None: + ds["validation"] = ds["validation"].shuffle(seed=args.seed).select(range(args.max_eval_samples)) + # Set the validation transforms + ds["validation"].set_transform(preprocess_images) + + # DataLoaders creation: + train_dataloader = DataLoader( + ds["train"], + shuffle=True, + collate_fn=collate_fn, + batch_size=args.per_device_train_batch_size, + ) + eval_dataloader = DataLoader( + ds["validation"], + collate_fn=collate_fn, + batch_size=args.per_device_eval_batch_size, + ) + + # Optimizer + # Split weights in two groups, one with weight decay and the other not. + no_decay = ["bias", "LayerNorm.weight"] + optimizer_grouped_parameters = [ + { + "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], + "weight_decay": args.weight_decay, + }, + { + "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], + "weight_decay": 0.0, + }, + ] + optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate) + + # Note -> the training dataloader needs to be prepared before we grab his length below (cause its length will be + # shorter in multiprocess) + + # Scheduler and math around the number of training steps. + overrode_max_train_steps = False + num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) + if args.max_train_steps is None: + args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch + overrode_max_train_steps = True + + lr_scheduler = get_scheduler( + name=args.lr_scheduler_type, + optimizer=optimizer, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, + ) + + # Prepare everything with our `accelerator`. + model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare( + model, + optimizer, + train_dataloader, + eval_dataloader, + lr_scheduler, + ) + + # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties. + if accelerator.distributed_type == DistributedType.TPU: + model.tie_weights() + + # We need to recalculate our total training steps as the size of the training dataloader may have changed. + num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps) + if overrode_max_train_steps: + args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch + # Afterwards we recalculate our number of training epochs + args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) + + # Figure out how many steps we should save the Accelerator states + checkpointing_steps = args.checkpointing_steps + if checkpointing_steps is not None and checkpointing_steps.isdigit(): + checkpointing_steps = int(checkpointing_steps) + + # We need to initialize the trackers we use, and also store our configuration. + # The trackers initializes automatically on the main process. + if args.with_tracking: + experiment_config = vars(args) + # TensorBoard cannot log Enums, need the raw value + experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"].value + accelerator.init_trackers("mim_no_trainer", experiment_config) + + # Train! + total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps + + logger.info("***** Running training *****") + logger.info(f" Num examples = {len(ds['train'])}") + logger.info(f" Num Epochs = {args.num_train_epochs}") + logger.info(f" Instantaneous batch size per device = {args.per_device_train_batch_size}") + logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}") + logger.info(f" Gradient Accumulation steps = {args.gradient_accumulation_steps}") + logger.info(f" Total optimization steps = {args.max_train_steps}") + # Only show the progress bar once on each machine. + progress_bar = tqdm(range(int(args.max_train_steps)), disable=not accelerator.is_local_main_process) + completed_steps = 0 + starting_epoch = 0 + + # Potentially load in the weights and states from a previous save + if args.resume_from_checkpoint: + if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": + checkpoint_path = args.resume_from_checkpoint + path = os.path.basename(args.resume_from_checkpoint) + else: + # Get the most recent checkpoint + dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] + dirs.sort(key=os.path.getctime) + path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) + # Extract `epoch_{i}` or `step_{i}` + training_difference = os.path.splitext(path)[0] + + if "epoch" in training_difference: + starting_epoch = int(training_difference.replace("epoch_", "")) + 1 + resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch + else: + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps + starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps + resume_step -= starting_epoch * len(train_dataloader) + + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + + for epoch in range(starting_epoch, args.num_train_epochs): + model.train() + if args.with_tracking: + total_loss = 0 + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): + with accelerator.accumulate(model): + outputs = model(**batch) + loss = outputs.loss + # We keep track of the loss at each epoch + if args.with_tracking: + total_loss += loss.detach().float() + accelerator.backward(loss) + optimizer.step() + lr_scheduler.step() + optimizer.zero_grad() + + # Checks if the accelerator has performed an optimization step behind the scenes + if accelerator.sync_gradients: + progress_bar.update(1) + completed_steps += 1 + + if isinstance(checkpointing_steps, int): + if completed_steps % checkpointing_steps == 0: + output_dir = f"step_{completed_steps}" + if args.output_dir is not None: + output_dir = os.path.join(args.output_dir, output_dir) + accelerator.save_state(output_dir) + + if completed_steps >= args.max_train_steps: + break + + model.eval() + losses = [] + for step, batch in enumerate(eval_dataloader): + with torch.no_grad(): + outputs = model(**batch) + + loss = outputs.loss + losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size))) + + losses = torch.cat(losses) + eval_loss = torch.mean(losses) + + logger.info(f"epoch {epoch}: eval_loss: {eval_loss}") + + if args.with_tracking: + accelerator.log( + { + "eval_loss": eval_loss, + "train_loss": total_loss.item() / len(train_dataloader), + "epoch": epoch, + "step": completed_steps, + }, + step=completed_steps, + ) + + if args.push_to_hub and epoch < args.num_train_epochs - 1: + accelerator.wait_for_everyone() + unwrapped_model = accelerator.unwrap_model(model) + unwrapped_model.save_pretrained( + args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save + ) + if accelerator.is_main_process: + image_processor.save_pretrained(args.output_dir) + repo.push_to_hub( + commit_message=f"Training in progress epoch {epoch}", blocking=False, auto_lfs_prune=True + ) + + if args.checkpointing_steps == "epoch": + output_dir = f"epoch_{epoch}" + if args.output_dir is not None: + output_dir = os.path.join(args.output_dir, output_dir) + accelerator.save_state(output_dir) + + if args.with_tracking: + accelerator.end_training() + + if args.output_dir is not None: + accelerator.wait_for_everyone() + unwrapped_model = accelerator.unwrap_model(model) + unwrapped_model.save_pretrained( + args.output_dir, is_main_process=accelerator.is_main_process, save_function=accelerator.save + ) + if accelerator.is_main_process: + image_processor.save_pretrained(args.output_dir) + if args.push_to_hub: + repo.push_to_hub(commit_message="End of training", auto_lfs_prune=True) + + +if __name__ == "__main__": + main() diff --git a/examples/pytorch/language-modeling/README.md b/examples/pytorch/language-modeling/README.md index ff504b535747a9..23c0bc2c79aeb4 100644 --- a/examples/pytorch/language-modeling/README.md +++ b/examples/pytorch/language-modeling/README.md @@ -36,7 +36,7 @@ the tokenization). The loss here is that of causal language modeling. ```bash python run_clm.py \ - --model_name_or_path gpt2 \ + --model_name_or_path openai-community/gpt2 \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 8 \ @@ -53,7 +53,7 @@ To run on your own training and validation files, use the following command: ```bash python run_clm.py \ - --model_name_or_path gpt2 \ + --model_name_or_path openai-community/gpt2 \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --per_device_train_batch_size 8 \ @@ -69,7 +69,7 @@ This uses the built in HuggingFace `Trainer` for training. If you want to use a python run_clm_no_trainer.py \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ - --model_name_or_path gpt2 \ + --model_name_or_path openai-community/gpt2 \ --output_dir /tmp/test-clm ``` @@ -84,7 +84,7 @@ converge slightly slower (over-fitting takes more epochs). ```bash python run_mlm.py \ - --model_name_or_path roberta-base \ + --model_name_or_path FacebookAI/roberta-base \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 8 \ @@ -98,7 +98,7 @@ To run on your own training and validation files, use the following command: ```bash python run_mlm.py \ - --model_name_or_path roberta-base \ + --model_name_or_path FacebookAI/roberta-base \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --per_device_train_batch_size 8 \ @@ -117,7 +117,7 @@ This uses the built in HuggingFace `Trainer` for training. If you want to use a python run_mlm_no_trainer.py \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ - --model_name_or_path roberta-base \ + --model_name_or_path FacebookAI/roberta-base \ --output_dir /tmp/test-mlm ``` @@ -144,7 +144,7 @@ Here is how to fine-tune XLNet on wikitext-2: ```bash python run_plm.py \ - --model_name_or_path=xlnet-base-cased \ + --model_name_or_path=xlnet/xlnet-base-cased \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --per_device_train_batch_size 8 \ @@ -158,7 +158,7 @@ To fine-tune it on your own training and validation file, run: ```bash python run_plm.py \ - --model_name_or_path=xlnet-base-cased \ + --model_name_or_path=xlnet/xlnet-base-cased \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --per_device_train_batch_size 8 \ @@ -178,13 +178,17 @@ sure all your batches have the same length. To use the streaming dataset mode which can be very useful for large datasets, add `--streaming` to the command line. This is currently supported by `run_mlm.py` and `run_clm.py`. +## Low Cpu Memory Usage + +To use low cpu memory mode which can be very useful for LLM, add `--low_cpu_mem_usage` to the command line. This is currently supported by `run_clm.py`,`run_mlm.py`, `run_plm.py`,`run_mlm_no_trainer.py` and `run_clm_no_trainer.py`. + ## Creating a model on the fly When training a model from scratch, configuration values may be overridden with the help of `--config_overrides`: ```bash -python run_clm.py --model_type gpt2 --tokenizer_name gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \ +python run_clm.py --model_type openai-community/gpt2 --tokenizer_name openai-community/gpt2 \ --config_overrides="n_embd=1024,n_head=16,n_layer=48,n_positions=102" \ [...] ``` diff --git a/examples/pytorch/language-modeling/run_clm.py b/examples/pytorch/language-modeling/run_clm.py index ae01b7614e63ba..7c3d69692311b7 100755 --- a/examples/pytorch/language-modeling/run_clm.py +++ b/examples/pytorch/language-modeling/run_clm.py @@ -25,6 +25,7 @@ import math import os import sys +import warnings from dataclasses import dataclass, field from itertools import chain from typing import Optional @@ -55,7 +56,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt") @@ -76,7 +77,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -111,12 +112,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -130,6 +147,15 @@ class ModelArguments: "choices": ["auto", "bfloat16", "float16", "float32"], }, ) + low_cpu_mem_usage: bool = field( + default=False, + metadata={ + "help": ( + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. " + "set True will benefit LLM loading time and RAM consumption." + ) + }, + ) def __post_init__(self): if self.config_overrides is not None and (self.config_name is not None or self.model_name_or_path is not None): @@ -229,6 +255,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_clm", model_args, data_args) @@ -240,6 +275,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -249,8 +288,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -287,7 +326,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) if "validation" not in raw_datasets.keys(): @@ -296,7 +335,7 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) raw_datasets["train"] = load_dataset( @@ -304,7 +343,7 @@ def main(): data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) else: @@ -326,7 +365,7 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) # If no validation data is there, validation_split_percentage will be used to divide the dataset. @@ -336,7 +375,7 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) raw_datasets["train"] = load_dataset( @@ -344,12 +383,12 @@ def main(): data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -360,7 +399,8 @@ def main(): config_kwargs = { "cache_dir": model_args.cache_dir, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, + "trust_remote_code": model_args.trust_remote_code, } if model_args.config_name: config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs) @@ -378,7 +418,8 @@ def main(): "cache_dir": model_args.cache_dir, "use_fast": model_args.use_fast_tokenizer, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, + "trust_remote_code": model_args.trust_remote_code, } if model_args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) @@ -386,7 +427,7 @@ def main(): tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -402,12 +443,14 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, torch_dtype=torch_dtype, + low_cpu_mem_usage=model_args.low_cpu_mem_usage, ) else: - model = AutoModelForCausalLM.from_config(config) - n_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values()) + model = AutoModelForCausalLM.from_config(config, trust_remote_code=model_args.trust_remote_code) + n_params = sum({p.data_ptr(): p.numel() for p in model.parameters()}.values()) logger.info(f"Training new model from scratch - Total size={n_params/2**20:.2f}M params") # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch @@ -454,20 +497,27 @@ def tokenize_function(examples): batched=True, remove_columns=column_names, ) + if hasattr(config, "max_position_embeddings"): + max_pos_embeddings = config.max_position_embeddings + else: + # Define a default value if the attribute is missing in the config. + max_pos_embeddings = 1024 if data_args.block_size is None: block_size = tokenizer.model_max_length - if block_size > 1024: + if block_size > max_pos_embeddings: logger.warning( - "The chosen tokenizer supports a `model_max_length` that is longer than the default `block_size` value" - " of 1024. If you would like to use a longer `block_size` up to `tokenizer.model_max_length` you can" - " override this default with `--block_size xxx`." + f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " + f"Using block_size={min(1024, max_pos_embeddings)} instead. You can change that default value by passing --block_size xxx." ) - block_size = 1024 + if max_pos_embeddings > 0: + block_size = min(1024, max_pos_embeddings) + else: + block_size = 1024 else: if data_args.block_size > tokenizer.model_max_length: logger.warning( - f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model " f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(data_args.block_size, tokenizer.model_max_length) @@ -477,10 +527,9 @@ def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) - # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can - # customize this part to your needs. - if total_length >= block_size: - total_length = (total_length // block_size) * block_size + # We drop the small remainder, and if the total_length < block_size we exclude this batch and return an empty dict. + # We could add padding if the model supported it instead of this drop, you can customize this part to your needs. + total_length = (total_length // block_size) * block_size # Split by chunks of max_len. result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] @@ -494,7 +543,7 @@ def group_texts(examples): # to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map with training_args.main_process_first(desc="grouping texts together"): if not data_args.streaming: @@ -534,7 +583,7 @@ def preprocess_logits_for_metrics(logits, labels): logits = logits[0] return logits.argmax(dim=-1) - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) def compute_metrics(eval_preds): preds, labels = eval_preds diff --git a/examples/pytorch/language-modeling/run_clm_no_trainer.py b/examples/pytorch/language-modeling/run_clm_no_trainer.py index 998ca60b2700ad..e223aa8fe5da22 100755 --- a/examples/pytorch/language-modeling/run_clm_no_trainer.py +++ b/examples/pytorch/language-modeling/run_clm_no_trainer.py @@ -52,12 +52,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) @@ -82,10 +82,10 @@ def parse_args(): help="The configuration name of the dataset to use (via the datasets library).", ) parser.add_argument( - "--train_file", type=str, default=None, help="A csv or a json file containing the training data." + "--train_file", type=str, default=None, help="A csv, txt or a json file containing the training data." ) parser.add_argument( - "--validation_file", type=str, default=None, help="A csv or a json file containing the validation data." + "--validation_file", type=str, default=None, help="A csv, txt or a json file containing the validation data." ) parser.add_argument( "--validation_split_percentage", @@ -193,6 +193,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -216,10 +226,18 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) + parser.add_argument( + "--low_cpu_mem_usage", + action="store_true", + help=( + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. " + "If passed, LLM loading time and RAM consumption will be benefited." + ), + ) args = parser.parse_args() # Sanity checks @@ -228,13 +246,16 @@ def parse_args(): else: if args.train_file is not None: extension = args.train_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`train_file` should be a csv, json or txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("`train_file` should be a csv, json or txt file.") if args.validation_file is not None: extension = args.validation_file.split(".")[-1] - assert extension in ["csv", "json", "txt"], "`validation_file` should be a csv, json or txt file." + if extension not in ["csv", "json", "txt"]: + raise ValueError("`validation_file` should be a csv, json or txt file.") if args.push_to_hub: - assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed." + if args.output_dir is None: + raise ValueError("Need an `output_dir` to create a repo when `--push_to_hub` is passed.") return args @@ -253,7 +274,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -278,12 +299,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -322,9 +345,10 @@ def main(): dataset_args = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] if extension == "txt": extension = "text" dataset_args["keep_linebreaks"] = not args.no_keep_linebreaks @@ -345,27 +369,37 @@ def main(): ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) + config = AutoConfig.from_pretrained( + args.config_name, + trust_remote_code=args.trust_remote_code, + ) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained( + args.model_name_or_path, + trust_remote_code=args.trust_remote_code, + ) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -374,10 +408,12 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + low_cpu_mem_usage=args.low_cpu_mem_usage, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForCausalLM.from_config(config) + model = AutoModelForCausalLM.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -405,17 +441,16 @@ def tokenize_function(examples): if args.block_size is None: block_size = tokenizer.model_max_length - if block_size > 1024: + if block_size > config.max_position_embeddings: logger.warning( - "The chosen tokenizer supports a `model_max_length` that is longer than the default `block_size` value" - " of 1024. If you would like to use a longer `block_size` up to `tokenizer.model_max_length` you can" - " override this default with `--block_size xxx`." + f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " + f"Using block_size={min(1024, config.max_position_embeddings)} instead. You can change that default value by passing --block_size xxx." ) - block_size = 1024 + block_size = min(1024, config.max_position_embeddings) else: if args.block_size > tokenizer.model_max_length: logger.warning( - f"The block_size passed ({args.block_size}) is larger than the maximum length for the model" + f"The block_size passed ({args.block_size}) is larger than the maximum length for the model " f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(args.block_size, tokenizer.model_max_length) @@ -425,10 +460,9 @@ def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) - # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can - # customize this part to your needs. - if total_length >= block_size: - total_length = (total_length // block_size) * block_size + # We drop the small remainder, and if the total_length < block_size we exclude this batch and return an empty dict. + # We could add padding if the model supported it instead of this drop, you can customize this part to your needs. + total_length = (total_length // block_size) * block_size # Split by chunks of max_len. result = { k: [t[i : i + block_size] for i in range(0, total_length, block_size)] @@ -442,7 +476,7 @@ def group_texts(examples): # to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map with accelerator.main_process_first(): lm_datasets = tokenized_datasets.map( @@ -493,8 +527,10 @@ def group_texts(examples): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -544,43 +580,45 @@ def group_texts(examples): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) # update the progress_bar if load from checkpoint - progress_bar.update(starting_epoch * num_update_steps_per_epoch) - completed_steps = starting_epoch * num_update_steps_per_epoch + progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - if step % args.gradient_accumulation_steps == 0: - progress_bar.update(1) - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -599,7 +637,7 @@ def group_texts(examples): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/pytorch/language-modeling/run_mlm.py b/examples/pytorch/language-modeling/run_mlm.py index f44b0e3a01e7de..27f8e0a069d454 100755 --- a/examples/pytorch/language-modeling/run_mlm.py +++ b/examples/pytorch/language-modeling/run_mlm.py @@ -25,6 +25,7 @@ import math import os import sys +import warnings from dataclasses import dataclass, field from itertools import chain from typing import Optional @@ -53,7 +54,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt") @@ -107,12 +108,37 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ) + }, + ) + low_cpu_mem_usage: bool = field( + default=False, + metadata={ + "help": ( + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. " + "set True will benefit LLM loading time and RAM consumption." ) }, ) @@ -229,6 +255,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_mlm", model_args, data_args) @@ -240,6 +275,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -249,8 +288,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) # Set the verbosity to info of the Transformers logger (on main process only): logger.info(f"Training/evaluation parameters {training_args}") @@ -288,7 +327,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) if "validation" not in raw_datasets.keys(): @@ -297,7 +336,7 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) raw_datasets["train"] = load_dataset( @@ -305,7 +344,7 @@ def main(): data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, streaming=data_args.streaming, ) else: @@ -322,7 +361,7 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # If no validation data is there, validation_split_percentage will be used to divide the dataset. @@ -332,18 +371,18 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -353,7 +392,8 @@ def main(): config_kwargs = { "cache_dir": model_args.cache_dir, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, + "trust_remote_code": model_args.trust_remote_code, } if model_args.config_name: config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs) @@ -371,7 +411,8 @@ def main(): "cache_dir": model_args.cache_dir, "use_fast": model_args.use_fast_tokenizer, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, + "trust_remote_code": model_args.trust_remote_code, } if model_args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) @@ -379,7 +420,7 @@ def main(): tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -390,11 +431,13 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + low_cpu_mem_usage=model_args.low_cpu_mem_usage, ) else: logger.info("Training new model from scratch") - model = AutoModelForMaskedLM.from_config(config) + model = AutoModelForMaskedLM.from_config(config, trust_remote_code=model_args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -422,7 +465,7 @@ def main(): else: if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -492,10 +535,9 @@ def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) - # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can - # customize this part to your needs. - if total_length >= max_seq_length: - total_length = (total_length // max_seq_length) * max_seq_length + # We drop the small remainder, and if the total_length < max_seq_length we exclude this batch and return an empty dict. + # We could add padding if the model supported it instead of this drop, you can customize this part to your needs. + total_length = (total_length // max_seq_length) * max_seq_length # Split by chunks of max_len. result = { k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)] @@ -508,7 +550,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map with training_args.main_process_first(desc="grouping texts together"): if not data_args.streaming: @@ -548,7 +590,7 @@ def preprocess_logits_for_metrics(logits, labels): logits = logits[0] return logits.argmax(dim=-1) - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) def compute_metrics(eval_preds): preds, labels = eval_preds diff --git a/examples/pytorch/language-modeling/run_mlm_no_trainer.py b/examples/pytorch/language-modeling/run_mlm_no_trainer.py index ee469e48890e74..80c46d4cce31ae 100755 --- a/examples/pytorch/language-modeling/run_mlm_no_trainer.py +++ b/examples/pytorch/language-modeling/run_mlm_no_trainer.py @@ -52,12 +52,12 @@ SchedulerType, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt") @@ -200,6 +200,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -223,10 +233,18 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) + parser.add_argument( + "--low_cpu_mem_usage", + action="store_true", + help=( + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. " + "If passed, LLM loading time and RAM consumption will be benefited." + ), + ) args = parser.parse_args() # Sanity checks @@ -243,7 +261,8 @@ def parse_args(): raise ValueError("`validation_file` should be a csv, json or txt file.") if args.push_to_hub: - assert args.output_dir is not None, "Need an `output_dir` to create a repo when `--push_to_hub` is passed." + if args.output_dir is None: + raise ValueError("Need an `output_dir` to create a repo when `--push_to_hub` is passed.") return args @@ -262,7 +281,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -287,12 +306,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -330,9 +351,10 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] if extension == "txt": extension = "text" raw_datasets = load_dataset(extension, data_files=data_files) @@ -350,27 +372,31 @@ def main(): ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) + config = AutoConfig.from_pretrained(args.config_name, trust_remote_code=args.trust_remote_code) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -379,10 +405,12 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + low_cpu_mem_usage=args.low_cpu_mem_usage, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForMaskedLM.from_config(config) + model = AutoModelForMaskedLM.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -407,7 +435,7 @@ def main(): else: if args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(args.max_seq_length, tokenizer.model_max_length) @@ -463,10 +491,9 @@ def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) - # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can - # customize this part to your needs. - if total_length >= max_seq_length: - total_length = (total_length // max_seq_length) * max_seq_length + # We drop the small remainder, and if the total_length < max_seq_length we exclude this batch and return an empty dict. + # We could add padding if the model supported it instead of this drop, you can customize this part to your needs. + total_length = (total_length // max_seq_length) * max_seq_length # Split by chunks of max_len. result = { k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)] @@ -479,7 +506,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map with accelerator.main_process_first(): tokenized_datasets = tokenized_datasets.map( @@ -537,8 +564,10 @@ def group_texts(examples): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -588,43 +617,45 @@ def group_texts(examples): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) # update the progress_bar if load from checkpoint - progress_bar.update(starting_epoch * num_update_steps_per_epoch) - completed_steps = starting_epoch * num_update_steps_per_epoch + progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - if step % args.gradient_accumulation_steps == 0: - progress_bar.update(1) - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -643,7 +674,7 @@ def group_texts(examples): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) @@ -667,7 +698,7 @@ def group_texts(examples): except OverflowError: perplexity = float("inf") - logger.info(f"epoch {epoch}: perplexity: {perplexity}") + logger.info(f"epoch {epoch}: perplexity: {perplexity} eval_loss: {eval_loss}") if args.with_tracking: accelerator.log( diff --git a/examples/pytorch/language-modeling/run_plm.py b/examples/pytorch/language-modeling/run_plm.py index 867527a2c5ec23..1a744083b18a94 100755 --- a/examples/pytorch/language-modeling/run_plm.py +++ b/examples/pytorch/language-modeling/run_plm.py @@ -22,6 +22,7 @@ import math import os import sys +import warnings from dataclasses import dataclass, field from itertools import chain from typing import Optional @@ -47,7 +48,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt") @@ -64,7 +65,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -95,12 +96,27 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + low_cpu_mem_usage: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "It is an option to create the model as an empty shell, then only materialize its parameters when the pretrained weights are loaded. " + "set True will benefit LLM loading time and RAM consumption." ) }, ) @@ -220,6 +236,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_plm", model_args, data_args) @@ -231,6 +256,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -240,8 +269,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -278,7 +307,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if "validation" not in raw_datasets.keys(): raw_datasets["validation"] = load_dataset( @@ -286,22 +315,23 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir) @@ -312,18 +342,18 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["train"] = load_dataset( extension, data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -333,7 +363,7 @@ def main(): config_kwargs = { "cache_dir": model_args.cache_dir, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, } if model_args.config_name: config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs) @@ -351,7 +381,7 @@ def main(): "cache_dir": model_args.cache_dir, "use_fast": model_args.use_fast_tokenizer, "revision": model_args.model_revision, - "use_auth_token": True if model_args.use_auth_token else None, + "token": model_args.token, } if model_args.tokenizer_name: tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) @@ -359,7 +389,7 @@ def main(): tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -370,7 +400,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + low_cpu_mem_usage=model_args.low_cpu_mem_usage, ) else: logger.info("Training new model from scratch") @@ -392,7 +423,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -436,10 +467,9 @@ def group_texts(examples): # Concatenate all texts. concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} total_length = len(concatenated_examples[list(examples.keys())[0]]) - # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can - # customize this part to your needs. - if total_length >= max_seq_length: - total_length = (total_length // max_seq_length) * max_seq_length + # We drop the small remainder, and if the total_length < max_seq_length we exclude this batch and return an empty dict. + # We could add padding if the model supported it instead of this drop, you can customize this part to your needs. + total_length = (total_length // max_seq_length) * max_seq_length # Split by chunks of max_len. result = { k: [t[i : i + max_seq_length] for i in range(0, total_length, max_seq_length)] @@ -452,7 +482,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map with training_args.main_process_first(desc="grouping texts together"): tokenized_datasets = tokenized_datasets.map( diff --git a/examples/pytorch/multiple-choice/README.md b/examples/pytorch/multiple-choice/README.md index 735d1f5f33a017..118234002c88a3 100644 --- a/examples/pytorch/multiple-choice/README.md +++ b/examples/pytorch/multiple-choice/README.md @@ -22,13 +22,13 @@ limitations under the License. ```bash python examples/multiple-choice/run_swag.py \ ---model_name_or_path roberta-base \ +--model_name_or_path FacebookAI/roberta-base \ --do_train \ --do_eval \ --learning_rate 5e-5 \ --num_train_epochs 3 \ --output_dir /tmp/swag_base \ ---per_gpu_eval_batch_size=16 \ +--per_device_eval_batch_size=16 \ --per_device_train_batch_size=16 \ --overwrite_output ``` @@ -62,7 +62,7 @@ then export DATASET_NAME=swag python run_swag_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name $DATASET_NAME \ --max_seq_length 128 \ --per_device_train_batch_size 32 \ @@ -89,7 +89,7 @@ that will check everything is ready for training. Finally, you can launch traini export DATASET_NAME=swag accelerate launch run_swag_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name $DATASET_NAME \ --max_seq_length 128 \ --per_device_train_batch_size 32 \ diff --git a/examples/pytorch/multiple-choice/run_swag.py b/examples/pytorch/multiple-choice/run_swag.py index a69171766af5bd..6c61bc77cdec5e 100755 --- a/examples/pytorch/multiple-choice/run_swag.py +++ b/examples/pytorch/multiple-choice/run_swag.py @@ -21,6 +21,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from itertools import chain from typing import Optional, Union @@ -47,7 +48,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = logging.getLogger(__name__) @@ -79,12 +80,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -225,6 +242,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_swag", model_args, data_args) @@ -235,6 +261,11 @@ def main(): datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) + + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -244,8 +275,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -280,14 +311,15 @@ def main(): data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] raw_datasets = load_dataset( extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Downloading and loading the swag dataset from the hub. @@ -295,10 +327,10 @@ def main(): "swag", "regular", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -309,14 +341,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForMultipleChoice.from_pretrained( model_args.model_name_or_path, @@ -324,7 +358,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # When using your own dataset or a different dataset from swag, you will probably need to change this. @@ -344,7 +379,7 @@ def main(): else: if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -457,14 +492,14 @@ def compute_metrics(eval_predictions): trainer.log_metrics("eval", metrics) trainer.save_metrics("eval", metrics) - kwargs = dict( - finetuned_from=model_args.model_name_or_path, - tasks="multiple-choice", - dataset_tags="swag", - dataset_args="regular", - dataset="SWAG", - language="en", - ) + kwargs = { + "finetuned_from": model_args.model_name_or_path, + "tasks": "multiple-choice", + "dataset_tags": "swag", + "dataset_args": "regular", + "dataset": "SWAG", + "language": "en", + } if training_args.push_to_hub: trainer.push_to_hub(**kwargs) diff --git a/examples/pytorch/multiple-choice/run_swag_no_trainer.py b/examples/pytorch/multiple-choice/run_swag_no_trainer.py index b0bcc567551cc7..dc2778929623c2 100755 --- a/examples/pytorch/multiple-choice/run_swag_no_trainer.py +++ b/examples/pytorch/multiple-choice/run_swag_no_trainer.py @@ -52,11 +52,11 @@ default_data_collator, get_scheduler, ) -from transformers.utils import PaddingStrategy, check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import PaddingStrategy, check_min_version, send_example_telemetry # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) # You should update this to your particular problem to have better documentation of `model_type` @@ -90,7 +90,7 @@ def parse_args(): default=128, help=( "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," - " sequences shorter will be padded if `--pad_to_max_lengh` is passed." + " sequences shorter will be padded if `--pad_to_max_length` is passed." ), ) parser.add_argument( @@ -182,6 +182,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -205,7 +215,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -288,7 +298,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -313,12 +323,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -345,16 +357,17 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # Trim a number of training examples if args.debug: for split in raw_datasets.keys(): raw_datasets[split] = raw_datasets[split].select(range(100)) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if raw_datasets["train"] is not None: column_names = raw_datasets["train"].column_names @@ -372,20 +385,24 @@ def main(): # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -394,10 +411,11 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForMultipleChoice.from_config(config) + model = AutoModelForMultipleChoice.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -493,8 +511,10 @@ def preprocess_function(examples): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -543,36 +563,45 @@ def preprocess_function(examples): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -591,7 +620,7 @@ def preprocess_function(examples): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/pytorch/test_xla_examples.py b/examples/pytorch/old_test_xla_examples.py similarity index 97% rename from examples/pytorch/test_xla_examples.py rename to examples/pytorch/old_test_xla_examples.py index 4a29ce3beea64a..2f24035d72377b 100644 --- a/examples/pytorch/test_xla_examples.py +++ b/examples/pytorch/old_test_xla_examples.py @@ -54,7 +54,7 @@ def test_run_glue(self): ./examples/pytorch/text-classification/run_glue.py --num_cores=8 ./examples/pytorch/text-classification/run_glue.py - --model_name_or_path distilbert-base-uncased + --model_name_or_path distilbert/distilbert-base-uncased --output_dir {tmp_dir} --overwrite_output_dir --train_file ./tests/fixtures/tests_samples/MRPC/train.csv diff --git a/examples/pytorch/question-answering/README.md b/examples/pytorch/question-answering/README.md index 6b86a4effa9508..9fac0b30385093 100644 --- a/examples/pytorch/question-answering/README.md +++ b/examples/pytorch/question-answering/README.md @@ -40,7 +40,7 @@ on a single tesla V100 16GB. ```bash python run_qa.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --do_train \ --do_eval \ @@ -67,7 +67,7 @@ The [`run_qa_beam_search.py`](https://github.com/huggingface/transformers/blob/m ```bash python run_qa_beam_search.py \ - --model_name_or_path xlnet-large-cased \ + --model_name_or_path xlnet/xlnet-large-cased \ --dataset_name squad \ --do_train \ --do_eval \ @@ -87,7 +87,7 @@ python run_qa_beam_search.py \ export SQUAD_DIR=/path/to/SQUAD python run_qa_beam_search.py \ - --model_name_or_path xlnet-large-cased \ + --model_name_or_path xlnet/xlnet-large-cased \ --dataset_name squad_v2 \ --do_train \ --do_eval \ @@ -111,7 +111,7 @@ This example code fine-tunes T5 on the SQuAD2.0 dataset. ```bash python run_seq2seq_qa.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name squad_v2 \ --context_column context \ --question_column question \ @@ -143,7 +143,7 @@ then ```bash python run_qa_no_trainer.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --max_seq_length 384 \ --doc_stride 128 \ @@ -166,7 +166,7 @@ that will check everything is ready for training. Finally, you can launch traini ```bash accelerate launch run_qa_no_trainer.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --max_seq_length 384 \ --doc_stride 128 \ diff --git a/examples/pytorch/question-answering/run_qa.py b/examples/pytorch/question-answering/run_qa.py index dfbfe244e206f8..021c18b84d3e70 100755 --- a/examples/pytorch/question-answering/run_qa.py +++ b/examples/pytorch/question-answering/run_qa.py @@ -21,6 +21,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -49,7 +50,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/question-answering/requirements.txt") @@ -79,12 +80,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -227,6 +244,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_qa", model_args, data_args) @@ -238,6 +264,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -247,8 +277,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -285,7 +315,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -304,10 +334,10 @@ def main(): data_files=data_files, field="data", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -318,14 +348,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForQuestionAnswering.from_pretrained( model_args.model_name_or_path, @@ -333,7 +365,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # Tokenizer check: this script requires a fast tokenizer. @@ -345,7 +378,7 @@ def main(): ) # Preprocessing the datasets. - # Preprocessing is slighlty different for training and evaluation. + # Preprocessing is slightly different for training and evaluation. if training_args.do_train: column_names = raw_datasets["train"].column_names elif training_args.do_eval: @@ -361,7 +394,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -586,15 +619,17 @@ def post_processing_function(examples, features, predictions, stage="eval"): # Format the result to the format the metric expects. if data_args.version_2_with_negative: formatted_predictions = [ - {"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items() + {"id": str(k), "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items() ] else: - formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()] + formatted_predictions = [{"id": str(k), "prediction_text": v} for k, v in predictions.items()] - references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples] + references = [{"id": str(ex["id"]), "answers": ex[answer_column_name]} for ex in examples] return EvalPrediction(predictions=formatted_predictions, label_ids=references) - metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad") + metric = evaluate.load( + "squad_v2" if data_args.version_2_with_negative else "squad", cache_dir=model_args.cache_dir + ) def compute_metrics(p: EvalPrediction): return metric.compute(predictions=p.predictions, references=p.label_ids) diff --git a/examples/pytorch/question-answering/run_qa_beam_search.py b/examples/pytorch/question-answering/run_qa_beam_search.py index 4d2f5ef51d99c7..96c3b7cb6e3af9 100755 --- a/examples/pytorch/question-answering/run_qa_beam_search.py +++ b/examples/pytorch/question-answering/run_qa_beam_search.py @@ -21,6 +21,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -48,7 +49,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/question-answering/requirements.txt") @@ -78,15 +79,21 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) - use_auth_token: bool = field( - default=False, + token: str = field( + default=None, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." ) }, ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) @dataclass @@ -226,6 +233,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_qa_beam_search", model_args, data_args) @@ -236,6 +252,11 @@ def main(): datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) + + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -245,8 +266,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -283,7 +304,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -301,10 +322,10 @@ def main(): data_files=data_files, field="data", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -315,13 +336,13 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) tokenizer = XLNetTokenizerFast.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) model = XLNetForQuestionAnswering.from_pretrained( model_args.model_name_or_path, @@ -329,11 +350,11 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # Preprocessing the datasets. - # Preprocessing is slighlty different for training and evaluation. + # Preprocessing is slightly different for training and evaluation. if training_args.do_train: column_names = raw_datasets["train"].column_names elif training_args.do_eval: @@ -349,7 +370,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -626,7 +647,9 @@ def post_processing_function(examples, features, predictions, stage="eval"): references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples] return EvalPrediction(predictions=formatted_predictions, label_ids=references) - metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad") + metric = evaluate.load( + "squad_v2" if data_args.version_2_with_negative else "squad", cache_dir=model_args.cache_dir + ) def compute_metrics(p: EvalPrediction): return metric.compute(predictions=p.predictions, references=p.label_ids) diff --git a/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py b/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py index 9372de3298f2c0..48c923740d6755 100644 --- a/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py +++ b/examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py @@ -51,12 +51,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/question-answering/requirements.txt") @@ -119,7 +119,7 @@ def parse_args(): default=384, help=( "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," - " sequences shorter will be padded if `--pad_to_max_lengh` is passed." + " sequences shorter will be padded if `--pad_to_max_length` is passed." ), ) parser.add_argument( @@ -303,7 +303,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -328,12 +328,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -360,14 +362,16 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file + extension = args.validation_file.split(".")[-1] if args.test_file is not None: data_files["test"] = args.test_file - extension = args.train_file.split(".")[-1] + extension = args.test_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files, field="data") # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -381,7 +385,7 @@ def main(): ) # Preprocessing the datasets. - # Preprocessing is slighlty different for training and evaluation. + # Preprocessing is slightly different for training and evaluation. column_names = raw_datasets["train"].column_names question_column_name = "question" if "question" in column_names else column_names[0] @@ -393,7 +397,7 @@ def main(): if args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) @@ -504,7 +508,7 @@ def prepare_train_features(examples): raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified train_dataset = train_dataset.select(range(args.max_train_samples)) # Create train feature from dataset with accelerator.main_process_first(): @@ -748,8 +752,10 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -795,36 +801,45 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -862,7 +877,7 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): commit_message=f"Training in progress epoch {epoch}", blocking=False, auto_lfs_prune=True ) - # intialize all lists to collect the batches + # initialize all lists to collect the batches all_start_top_log_probs = [] all_start_top_index = [] all_end_top_log_probs = [] @@ -921,7 +936,7 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): logger.info(f"Evaluation metrics: {eval_metric}") if args.do_predict: - # intialize all lists to collect the batches + # initialize all lists to collect the batches all_start_top_log_probs = [] all_start_top_index = [] diff --git a/examples/pytorch/question-answering/run_qa_no_trainer.py b/examples/pytorch/question-answering/run_qa_no_trainer.py index 6cbea37151da8e..a72f70b08aa179 100755 --- a/examples/pytorch/question-answering/run_qa_no_trainer.py +++ b/examples/pytorch/question-answering/run_qa_no_trainer.py @@ -52,12 +52,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/question-answering/requirements.txt") @@ -123,7 +123,7 @@ def parse_args(): default=384, help=( "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," - " sequences shorter will be padded if `--pad_to_max_lengh` is passed." + " sequences shorter will be padded if `--pad_to_max_length` is passed." ), ) parser.add_argument( @@ -273,6 +273,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -296,7 +306,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -341,7 +351,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -366,12 +376,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -398,14 +410,16 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file + extension = args.validation_file.split(".")[-1] if args.test_file is not None: data_files["test"] = args.test_file - extension = args.train_file.split(".")[-1] + extension = args.test_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files, field="data") # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -413,20 +427,24 @@ def main(): # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) + config = AutoConfig.from_pretrained(args.config_name, trust_remote_code=args.trust_remote_code) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=True) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=True, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=True) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=True, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -435,13 +453,14 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForQuestionAnswering.from_config(config) + model = AutoModelForQuestionAnswering.from_config(config, trust_remote_code=args.trust_remote_code) # Preprocessing the datasets. - # Preprocessing is slighlty different for training and evaluation. + # Preprocessing is slightly different for training and evaluation. column_names = raw_datasets["train"].column_names @@ -454,7 +473,7 @@ def main(): if args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) @@ -542,7 +561,7 @@ def prepare_train_features(examples): raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified train_dataset = train_dataset.select(range(args.max_train_samples)) # Create train feature from dataset @@ -763,8 +782,10 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -811,36 +832,45 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -860,7 +890,7 @@ def create_and_fill_np_array(start_or_end_logits, dataset, max_len): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/pytorch/question-answering/run_seq2seq_qa.py b/examples/pytorch/question-answering/run_seq2seq_qa.py index 5fe5c1bddc6cf9..8916e721e56add 100644 --- a/examples/pytorch/question-answering/run_seq2seq_qa.py +++ b/examples/pytorch/question-answering/run_seq2seq_qa.py @@ -21,11 +21,13 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import List, Optional, Tuple import datasets import evaluate +import numpy as np from datasets import load_dataset from trainer_seq2seq_qa import QuestionAnsweringSeq2SeqTrainer @@ -45,7 +47,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/question-answering/requirements.txt") @@ -79,12 +81,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -153,7 +171,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_answer_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_answer_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -272,6 +290,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_seq2seq_qa", model_args, data_args) @@ -283,6 +310,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -292,8 +323,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -330,7 +361,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -348,10 +379,10 @@ def main(): data_files=data_files, field="data", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -362,14 +393,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSeq2SeqLM.from_pretrained( model_args.model_name_or_path, @@ -377,7 +410,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch @@ -434,13 +468,13 @@ def main(): if training_args.label_smoothing_factor > 0 and not hasattr(model, "prepare_decoder_input_ids_from_labels"): logger.warning( - "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for" + "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for " f"`{model.__class__.__name__}`. This will lead to loss being calculated twice and will take up more memory" ) if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -525,7 +559,7 @@ def preprocess_validation_function(examples): raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if data_args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) # Create train feature from dataset @@ -597,7 +631,9 @@ def preprocess_validation_function(examples): pad_to_multiple_of=8 if training_args.fp16 else None, ) - metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad") + metric = evaluate.load( + "squad_v2" if data_args.version_2_with_negative else "squad", cache_dir=model_args.cache_dir + ) def compute_metrics(p: EvalPrediction): return metric.compute(predictions=p.predictions, references=p.label_ids) @@ -610,6 +646,8 @@ def post_processing_function( preds = outputs.predictions if isinstance(preds, tuple): preds = preds[0] + # Replace -100s used for padding as we can't decode them + preds = np.where(preds != -100, preds, tokenizer.pad_token_id) decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) # Build a map example to its corresponding features. diff --git a/examples/pytorch/question-answering/trainer_seq2seq_qa.py b/examples/pytorch/question-answering/trainer_seq2seq_qa.py index 6abb41b33feb8c..bdf82bda9f3678 100644 --- a/examples/pytorch/question-answering/trainer_seq2seq_qa.py +++ b/examples/pytorch/question-answering/trainer_seq2seq_qa.py @@ -46,12 +46,13 @@ def evaluate( **gen_kwargs, ) -> Dict[str, float]: gen_kwargs = gen_kwargs.copy() - gen_kwargs["max_length"] = ( - gen_kwargs["max_length"] if gen_kwargs.get("max_length") is not None else self.args.generation_max_length - ) - gen_kwargs["num_beams"] = ( - gen_kwargs["num_beams"] if gen_kwargs.get("num_beams") is not None else self.args.generation_num_beams - ) + + # Use legacy argument setting if a) the option is not explicitly passed; and b) the argument is set in the + # training args + if gen_kwargs.get("max_length") is None and self.args.generation_max_length is not None: + gen_kwargs["max_length"] = self.args.generation_max_length + if gen_kwargs.get("num_beams") is None and self.args.generation_num_beams is not None: + gen_kwargs["num_beams"] = self.args.generation_num_beams self._gen_kwargs = gen_kwargs eval_dataset = self.eval_dataset if eval_dataset is None else eval_dataset diff --git a/examples/pytorch/semantic-segmentation/run_semantic_segmentation.py b/examples/pytorch/semantic-segmentation/run_semantic_segmentation.py index b1583aca1f0cac..8c78d6435c91d6 100644 --- a/examples/pytorch/semantic-segmentation/run_semantic_segmentation.py +++ b/examples/pytorch/semantic-segmentation/run_semantic_segmentation.py @@ -18,6 +18,7 @@ import os import random import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -51,7 +52,7 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=2.0.0", "To fix: pip install -r examples/pytorch/semantic-segmentation/requirements.txt") @@ -241,12 +242,28 @@ class ModelArguments: metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -265,6 +282,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_semantic_segmentation", model_args, data_args) @@ -276,6 +302,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) transformers.utils.logging.set_verbosity(log_level) @@ -284,8 +314,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -336,7 +366,7 @@ def main(): label2id = {v: str(k) for k, v in id2label.items()} # Load the mean IoU metric from the datasets package - metric = evaluate.load("mean_iou") + metric = evaluate.load("mean_iou", cache_dir=model_args.cache_dir) # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. @@ -375,7 +405,8 @@ def compute_metrics(eval_pred): id2label=id2label, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSemanticSegmentation.from_pretrained( model_args.model_name_or_path, @@ -383,13 +414,15 @@ def compute_metrics(eval_pred): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) image_processor = AutoImageProcessor.from_pretrained( model_args.image_processor_name or model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # Define torchvision transforms to be applied to each image + target. @@ -430,7 +463,7 @@ def preprocess_train(example_batch): pixel_values.append(image) labels.append(target) - encoding = dict() + encoding = {} encoding["pixel_values"] = torch.stack(pixel_values) encoding["labels"] = torch.stack(labels) @@ -444,7 +477,7 @@ def preprocess_val(example_batch): pixel_values.append(image) labels.append(target) - encoding = dict() + encoding = {} encoding["pixel_values"] = torch.stack(pixel_values) encoding["labels"] = torch.stack(labels) @@ -470,7 +503,7 @@ def preprocess_val(example_batch): # Set the validation transforms dataset["validation"].set_transform(preprocess_val) - # Initalize our trainer + # Initialize our trainer trainer = Trainer( model=model, args=training_args, diff --git a/examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py b/examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py index 68919e0cc5c57c..b80f6b71ec062b 100644 --- a/examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py +++ b/examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py @@ -45,12 +45,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) @@ -273,6 +273,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -297,7 +307,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -330,7 +340,7 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) @@ -350,12 +360,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -398,9 +410,15 @@ def main(): label2id = {v: k for k, v in id2label.items()} # Load pretrained model and image processor - config = AutoConfig.from_pretrained(args.model_name_or_path, id2label=id2label, label2id=label2id) - image_processor = AutoImageProcessor.from_pretrained(args.model_name_or_path) - model = AutoModelForSemanticSegmentation.from_pretrained(args.model_name_or_path, config=config) + config = AutoConfig.from_pretrained( + args.model_name_or_path, id2label=id2label, label2id=label2id, trust_remote_code=args.trust_remote_code + ) + image_processor = AutoImageProcessor.from_pretrained( + args.model_name_or_path, trust_remote_code=args.trust_remote_code + ) + model = AutoModelForSemanticSegmentation.from_pretrained( + args.model_name_or_path, config=config, trust_remote_code=args.trust_remote_code + ) # Preprocessing the datasets # Define torchvision transforms to be applied to each image + target. @@ -441,7 +459,7 @@ def preprocess_train(example_batch): pixel_values.append(image) labels.append(target) - encoding = dict() + encoding = {} encoding["pixel_values"] = torch.stack(pixel_values) encoding["labels"] = torch.stack(labels) @@ -455,7 +473,7 @@ def preprocess_val(example_batch): pixel_values.append(image) labels.append(target) - encoding = dict() + encoding = {} encoding["pixel_values"] = torch.stack(pixel_values) encoding["labels"] = torch.stack(labels) @@ -495,8 +513,10 @@ def preprocess_val(example_batch): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -512,7 +532,7 @@ def preprocess_val(example_batch): args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch) # Instantiate metric - metric = evaluate.load("mean_iou") + metric = evaluate.load("mean_iou", cache_dir=args.cache_dir) # We need to initialize the trackers we use, and also store our configuration. # The trackers initializes automatically on the main process. @@ -540,36 +560,45 @@ def preprocess_val(example_batch): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): + model.train() if args.with_tracking: total_loss = 0 - model.train() - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -588,7 +617,7 @@ def preprocess_val(example_batch): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) @@ -682,7 +711,9 @@ def preprocess_val(example_batch): if args.push_to_hub: repo.push_to_hub(commit_message="End of training", auto_lfs_prune=True) - all_results = {f"eval_{k}": v for k, v in eval_metrics.items()} + all_results = { + f"eval_{k}": v.tolist() if isinstance(v, np.ndarray) else v for k, v in eval_metrics.items() + } with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: json.dump(all_results, f) diff --git a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py index 603202e696cf9c..6bde6d2b7d0f12 100755 --- a/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py +++ b/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py @@ -43,7 +43,7 @@ set_seed, ) from transformers.models.wav2vec2.modeling_wav2vec2 import _compute_mask_indices, _sample_negative_indices -from transformers.utils import get_full_repo_name, send_example_telemetry +from transformers.utils import send_example_telemetry logger = get_logger(__name__) @@ -418,12 +418,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub and not args.preprocessing_only: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) elif args.output_dir is not None: os.makedirs(args.output_dir, exist_ok=True) accelerator.wait_for_everyone() diff --git a/examples/pytorch/speech-recognition/README.md b/examples/pytorch/speech-recognition/README.md index cf5a05c017839f..8dbfcafe3405f9 100644 --- a/examples/pytorch/speech-recognition/README.md +++ b/examples/pytorch/speech-recognition/README.md @@ -26,6 +26,10 @@ limitations under the License. - [Librispeech](#librispeech-ctc) - [Common Voice](#common-voice-ctc) - [Multilingual Librispeech](#multilingual-librispeech-ctc) +- [Automatic Speech Recognition with CTC and Adapter Layers](#connectionist-temporal-classification-with-adapters) + - [Massive Multilingual Speech (MMS)](#mms-model) + - [Examples](#examples-ctc-adapter) + - [Common Voice](#common-voice-ctc-adapter) - [Automatic Speech Recognition with Sequence-to-Sequence](#sequence-to-sequence) - [Whisper Model](#whisper-model) - [Speech-Encoder-Decoder Model](#warm-started-speech-encoder-decoder-model) @@ -96,7 +100,7 @@ of **0.35**. The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 run_speech_recognition_ctc.py \ --dataset_name="common_voice" \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ @@ -130,7 +134,7 @@ of **0.36**. ### Multi GPU CTC with Dataset Streaming -The following command shows how to use [Dataset Streaming mode](https://huggingface.co/docs/datasets/dataset_streaming.html) +The following command shows how to use [Dataset Streaming mode](https://huggingface.co/docs/datasets/dataset_streaming) to fine-tune [XLS-R](https://huggingface.co/transformers/main/model_doc/xls_r.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 4 GPUs in half-precision. @@ -143,7 +147,7 @@ However, the `--shuffle_buffer_size` argument controls how many examples we can ```bash -**python -m torch.distributed.launch \ +**torchrun \ --nproc_per_node 4 run_speech_recognition_ctc_streaming.py \ --dataset_name="common_voice" \ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \ @@ -243,6 +247,111 @@ they can serve as a baseline to improve upon. | [Multilingual Librispeech](https://huggingface.co/datasets/multilingual_librispeech)| `"german"` | [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) | 0.13 | - | 1 GPU Titan 24 GB RAM | 15h04 | [here](https://huggingface.co/patrickvonplaten/wav2vec2-xlsr-53-300m-mls-german-ft) | [run.sh](https://huggingface.co/patrickvonplaten/wav2vec2-xlsr-53-300m-mls-german-ft/blob/main/run.sh) | | [Multilingual Librispeech](https://huggingface.co/datasets/multilingual_librispeech)| `"german"` | [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) | 0.15 | - | 1 GPU Titan 24 GB RAM | 15h04 | [here](https://huggingface.co/patrickvonplaten/wav2vec2-300m-mls-german-ft) | [run.sh](https://huggingface.co/patrickvonplaten/wav2vec2-300m-mls-german-ft/blob/main/run.sh) | +## Connectionist Temporal Classification With Adapters + +The script [`run_speech_recognition_ctc_adapter.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py) can be used to fine-tune adapter layers for [Wav2Vec2-like models like MMS](https://huggingface.co/docs/transformers/main/en/model_doc/mms) for automatic speech recognition. + +### MMS Model + +The [Massive Multilingual Speech (MMS) model](https://huggingface.co/facebook/mms-1b-all) has been pre-trained and fine-tuned +on 1000+ languages. The model makes use of adapter attention layers to fine-tune only a small part +of the model on a specific language. The model already comes with fine-tuned adapter layers for 1000+ languages and +can be used for inference for 1000+ languages out of the box. + +However, for improved performance or more specific use cases one can re-initialize the adapter weights, freeze all +other weights and fine-tune them on a specific dataset as shown in the [example below](#examples-ctc-adapter). + +Note that the adapter weights include low dimensional linear layers for every attention block as well as the final language +model head layers. + +### Examples CTC Adapter + +In the following we will look at how one can fine-tune adapter weights for any of the +[MMS CTC checkpoints](https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&other=mms&sort=downloads) in less than 1 hour. + +#### Common Voice CTC Adapter + +As in the examples [above](#examples-ctc), we fine-tune on Common Voice's 6 dataset in Turkish as an example. +Contrary to [`run_speech_recognition_ctc.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py) before there is a `--target_language` which has to be defined to state for which +language or concept the adapter layers shall be trained. The adapter weights will then +accordingly be called `adapter.{/wav In the script [`run_speech_recognition_seq2seq`], we load the warm-started model, feature extractor, and tokenizer, process a speech recognition dataset, and subsequently make use of the [`Seq2SeqTrainer`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Seq2SeqTrainer) to train our system. -Note that it is important to align the target transcriptions with the decoder's vocabulary. For example, the [`Librispeech`](https://huggingface.co/datasets/librispeech_asr) dataset only contains captilized letters in the transcriptions, +Note that it is important to align the target transcriptions with the decoder's vocabulary. For example, the [`Librispeech`](https://huggingface.co/datasets/librispeech_asr) dataset only contains capitalized letters in the transcriptions, whereas BART was pretrained mostly on normalized text. Thus, it is recommended to add the argument `--do_lower_case` to the fine-tuning script when using a warm-started `SpeechEncoderDecoderModel`. The model is fine-tuned on the standard cross-entropy language modeling @@ -463,7 +572,7 @@ cross-entropy loss of **0.405** and word error rate of **0.0728**. The following command shows how to fine-tune [XLSR-Wav2Vec2](https://huggingface.co/transformers/main/model_doc/xlsr_wav2vec2.html) on [Common Voice](https://huggingface.co/datasets/common_voice) using 8 GPUs in half-precision. ```bash -python -m torch.distributed.launch \ +torchrun \ --nproc_per_node 8 run_speech_recognition_seq2seq.py \ --dataset_name="librispeech_asr" \ --model_name_or_path="./" \ diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py b/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py index c6cd82b436fb70..b4a026a23a9ebc 100755 --- a/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py +++ b/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py @@ -51,7 +51,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.18.0", "To fix: pip install -r examples/pytorch/speech-recognition/requirements.txt") @@ -104,8 +104,8 @@ class ModelArguments: default=0.05, metadata={ "help": ( - "Probability of each feature vector along the time axis to be chosen as the start of the vector" - "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature" + "Probability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " "vectors will be masked along the time axis." ) }, @@ -132,6 +132,20 @@ class ModelArguments: ctc_loss_reduction: Optional[str] = field( default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."} ) + ctc_zero_infinity: Optional[bool] = field( + default=False, + metadata={ + "help": "Whether to zero infinite losses and the associated gradients of `torch.nn.CTCLoss`. Infinite losses mainly" + " occur when the inputs are too short to be aligned to the targets." + }, + ) + add_adapter: Optional[bool] = field( + default=False, + metadata={ + "help": "Whether a convolutional attention network should be stacked on top of the Wav2Vec2Bert Encoder. Can be very" + "useful to downsample the output length." + }, + ) @dataclass @@ -229,12 +243,28 @@ class DataTrainingArguments: ) }, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "If :obj:`True`, will use the token generated when running" - ":obj:`huggingface-cli login` as HTTP bearer authorization for remote files." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -293,11 +323,14 @@ class DataCollatorCTCWithPadding: padding: Union[bool, str] = "longest" pad_to_multiple_of: Optional[int] = None pad_to_multiple_of_labels: Optional[int] = None + feature_extractor_input_name: Optional[str] = "input_values" def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods - input_features = [{"input_values": feature["input_values"]} for feature in features] + input_features = [ + {self.feature_extractor_input_name: feature[self.feature_extractor_input_name]} for feature in features + ] label_features = [{"input_ids": feature["labels"]} for feature in features] batch = self.processor.pad( @@ -349,7 +382,7 @@ def extract_all_chars(batch): lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values() ) - vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))} + vocab_dict = {v: k for k, v in enumerate(sorted(vocab_set))} # replace white space with delimiter token if word_delimiter_token is not None: @@ -379,6 +412,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if data_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if data_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + data_args.token = data_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_speech_recognition_ctc", model_args, data_args) @@ -408,8 +450,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) # Set the verbosity to info of the Transformers logger (on main process only): if is_main_process(training_args.local_rank): @@ -427,7 +469,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.token, ) if data_args.audio_column_name not in raw_datasets["train"].column_names: @@ -452,7 +494,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.token, ) if data_args.max_eval_samples is not None: @@ -490,7 +532,10 @@ def remove_special_characters(batch): # the tokenizer # load config config = AutoConfig.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, ) # 4. Next, if no tokenizer file is defined, @@ -546,11 +591,15 @@ def remove_special_characters(batch): # load feature_extractor and tokenizer tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path, - use_auth_token=data_args.use_auth_token, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, **tokenizer_kwargs, ) feature_extractor = AutoFeatureExtractor.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, ) # adapt config @@ -567,9 +616,11 @@ def remove_special_characters(batch): "gradient_checkpointing": training_args.gradient_checkpointing, "layerdrop": model_args.layerdrop, "ctc_loss_reduction": model_args.ctc_loss_reduction, + "ctc_zero_infinity": model_args.ctc_zero_infinity, "pad_token_id": tokenizer.pad_token_id, "vocab_size": len(tokenizer), "activation_dropout": model_args.activation_dropout, + "add_adapter": model_args.add_adapter, } ) @@ -578,7 +629,8 @@ def remove_special_characters(batch): model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, ) # freeze encoder @@ -602,6 +654,7 @@ def remove_special_characters(batch): min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate audio_column_name = data_args.audio_column_name num_workers = data_args.preprocessing_num_workers + feature_extractor_input_name = feature_extractor.model_input_names[0] # `phoneme_language` is only relevant if the model is fine-tuned on phoneme classification phoneme_language = data_args.phoneme_language @@ -613,8 +666,9 @@ def prepare_dataset(batch): sample = batch[audio_column_name] inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) - batch["input_values"] = inputs.input_values[0] - batch["input_length"] = len(batch["input_values"]) + batch[feature_extractor_input_name] = getattr(inputs, feature_extractor_input_name)[0] + # take length of raw audio waveform + batch["input_length"] = len(sample["array"].squeeze()) # encode targets additional_kwargs = {} @@ -647,7 +701,7 @@ def is_audio_in_length_range(length): # instantiate a data collator and the trainer # Define evaluation metrics during training, *i.e.* word error rate, character error rate - eval_metrics = {metric: evaluate.load(metric) for metric in data_args.eval_metrics} + eval_metrics = {metric: evaluate.load(metric, cache_dir=model_args.cache_dir) for metric in data_args.eval_metrics} # for large datasets it is advised to run the preprocessing on a # single machine first with ``args.preprocessing_only`` since there will mostly likely @@ -673,11 +727,14 @@ def compute_metrics(pred): return metrics # Now save everything to be able to create a single processor later - if is_main_process(training_args.local_rank): - # save feature extractor, tokenizer and config - feature_extractor.save_pretrained(training_args.output_dir) - tokenizer.save_pretrained(training_args.output_dir) - config.save_pretrained(training_args.output_dir) + # make sure all processes wait until data is saved + with training_args.main_process_first(): + # only the main process saves them + if is_main_process(training_args.local_rank): + # save feature extractor, tokenizer and config + feature_extractor.save_pretrained(training_args.output_dir) + tokenizer.save_pretrained(training_args.output_dir) + config.save_pretrained(training_args.output_dir) try: processor = AutoProcessor.from_pretrained(training_args.output_dir) @@ -692,7 +749,9 @@ def compute_metrics(pred): processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir) # Instantiate custom data collator - data_collator = DataCollatorCTCWithPadding(processor=processor) + data_collator = DataCollatorCTCWithPadding( + processor=processor, feature_extractor_input_name=feature_extractor_input_name + ) # Initialize Trainer trainer = Trainer( @@ -702,7 +761,7 @@ def compute_metrics(pred): compute_metrics=compute_metrics, train_dataset=vectorized_datasets["train"] if training_args.do_train else None, eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None, - tokenizer=feature_extractor, + tokenizer=processor, ) # 8. Finally, we can start training diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py b/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py new file mode 100755 index 00000000000000..b998596bc9cd0f --- /dev/null +++ b/examples/pytorch/speech-recognition/run_speech_recognition_ctc_adapter.py @@ -0,0 +1,836 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" Fine-tuning a 🤗 Transformers CTC adapter model for automatic speech recognition""" + +import functools +import json +import logging +import os +import re +import sys +import warnings +from dataclasses import dataclass, field +from typing import Dict, List, Optional, Union + +import datasets +import evaluate +import numpy as np +import torch +from datasets import DatasetDict, load_dataset +from safetensors.torch import save_file as safe_save_file + +import transformers +from transformers import ( + AutoConfig, + AutoFeatureExtractor, + AutoModelForCTC, + AutoProcessor, + AutoTokenizer, + HfArgumentParser, + Trainer, + TrainingArguments, + Wav2Vec2Processor, + set_seed, +) +from transformers.models.wav2vec2.modeling_wav2vec2 import WAV2VEC2_ADAPTER_SAFE_FILE +from transformers.trainer_utils import get_last_checkpoint, is_main_process +from transformers.utils import check_min_version, send_example_telemetry +from transformers.utils.versions import require_version + + +# Will error if the minimal version of Transformers is not installed. Remove at your own risks. +check_min_version("4.38.0.dev0") + +require_version("datasets>=1.18.0", "To fix: pip install -r examples/pytorch/speech-recognition/requirements.txt") + + +logger = logging.getLogger(__name__) + + +def list_field(default=None, metadata=None): + return field(default_factory=lambda: default, metadata=metadata) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} + ) + tokenizer_name_or_path: Optional[str] = field( + default=None, + metadata={"help": "Path to pretrained tokenizer or tokenizer identifier from huggingface.co/models"}, + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, + ) + final_dropout: float = field( + default=0.0, + metadata={"help": "The dropout probability for the final projection layer."}, + ) + mask_time_prob: float = field( + default=0.05, + metadata={ + "help": ( + "Probability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " + "vectors will be masked along the time axis." + ) + }, + ) + mask_time_length: int = field( + default=10, + metadata={"help": "Length of vector span to mask along the time axis."}, + ) + mask_feature_prob: float = field( + default=0.0, + metadata={ + "help": ( + "Probability of each feature vector along the feature axis to be chosen as the start of the vectorspan" + " to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature" + " bins will be masked along the time axis." + ) + }, + ) + mask_feature_length: int = field( + default=10, + metadata={"help": "Length of vector span to mask along the feature axis."}, + ) + layerdrop: float = field(default=0.0, metadata={"help": "The LayerDrop probability."}) + ctc_loss_reduction: Optional[str] = field( + default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."} + ) + adapter_attn_dim: int = field( + default=16, + metadata={ + "help": "The hidden dimension of the adapter layers that will be randomly initialized and trained. The higher the dimension, the more capacity is given to the adapter weights. Note that only the adapter weights are fine-tuned." + }, + ) + + +@dataclass +class DataTrainingArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + + Using `HfArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset_name: str = field( + metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + target_language: Optional[str] = field( + metadata={ + "help": ( + "The target language on which the adapter attention layers" + " should be trained on in ISO 693-3 code, e.g. `tur` for Turkish" + " Wav2Vec2's MMS ISO codes can be looked up here: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html" + " If you are not training the adapter layers on a language, simply choose" + " another acronym that fits your data." + ) + }, + ) + dataset_config_name: str = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + train_split_name: str = field( + default="train+validation", + metadata={ + "help": ( + "The name of the training data set split to use (via the datasets library). Defaults to " + "'train+validation'" + ) + }, + ) + eval_split_name: str = field( + default="test", + metadata={ + "help": "The name of the evaluation data set split to use (via the datasets library). Defaults to 'test'" + }, + ) + audio_column_name: str = field( + default="audio", + metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"}, + ) + text_column_name: str = field( + default="text", + metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"}, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + ) + }, + ) + max_eval_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of validation examples to this " + "value if set." + ) + }, + ) + chars_to_ignore: Optional[List[str]] = list_field( + default=None, + metadata={"help": "A list of characters to remove from the transcripts."}, + ) + eval_metrics: List[str] = list_field( + default=["wer"], + metadata={"help": "A list of metrics the model should be evaluated on. E.g. `'wer cer'`"}, + ) + max_duration_in_seconds: float = field( + default=20.0, + metadata={ + "help": ( + "Filter audio files that are longer than `max_duration_in_seconds` seconds to" + " 'max_duration_in_seconds`" + ) + }, + ) + min_duration_in_seconds: float = field( + default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"} + ) + preprocessing_only: bool = field( + default=False, + metadata={ + "help": ( + "Whether to only do data preprocessing and skip training. This is especially useful when data" + " preprocessing errors out in distributed training due to timeout. In this case, one should run the" + " preprocessing in a non-distributed setup with `preprocessing_only=True` so that the cached datasets" + " can consequently be loaded in distributed training" + ) + }, + ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( + default=False, + metadata={ + "help": ( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ) + }, + ) + unk_token: str = field( + default="[UNK]", + metadata={"help": "The unk token for the tokenizer"}, + ) + pad_token: str = field( + default="[PAD]", + metadata={"help": "The padding token for the tokenizer"}, + ) + word_delimiter_token: str = field( + default="|", + metadata={"help": "The word delimiter token for the tokenizer"}, + ) + overwrite_lang_vocab: bool = field( + default=False, + metadata={"help": ("If :obj:`True`, will overwrite existing `target_language` vocabulary of tokenizer.")}, + ) + + +@dataclass +class DataCollatorCTCWithPadding: + """ + Data collator that will dynamically pad the inputs received. + Args: + processor (:class:`~transformers.AutoProcessor`) + The processor used for proccessing the data. + padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`): + Select a strategy to pad the returned sequences (according to the model's padding side and padding index) + among: + * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single + sequence if provided). + * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the + maximum acceptable input length for the model if that argument is not provided. + * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of + different lengths). + max_length (:obj:`int`, `optional`): + Maximum length of the ``input_values`` of the returned list and optionally padding length (see above). + max_length_labels (:obj:`int`, `optional`): + Maximum length of the ``labels`` returned list and optionally padding length (see above). + pad_to_multiple_of (:obj:`int`, `optional`): + If set will pad the sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= + 7.5 (Volta). + """ + + processor: AutoProcessor + padding: Union[bool, str] = "longest" + pad_to_multiple_of: Optional[int] = None + pad_to_multiple_of_labels: Optional[int] = None + + def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: + # split inputs and labels since they have to be of different lengths and need + # different padding methods + input_features = [{"input_values": feature["input_values"]} for feature in features] + label_features = [{"input_ids": feature["labels"]} for feature in features] + + batch = self.processor.pad( + input_features, + padding=self.padding, + pad_to_multiple_of=self.pad_to_multiple_of, + return_tensors="pt", + ) + + labels_batch = self.processor.pad( + labels=label_features, + padding=self.padding, + pad_to_multiple_of=self.pad_to_multiple_of_labels, + return_tensors="pt", + ) + + # replace padding with -100 to ignore loss correctly + labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100) + + batch["labels"] = labels + if "attention_mask" in batch: + batch["attention_mask"] = batch["attention_mask"].to(torch.long) + + return batch + + +def create_vocabulary_from_data( + datasets: DatasetDict, + word_delimiter_token: Optional[str] = None, + unk_token: Optional[str] = None, + pad_token: Optional[str] = None, +): + # Given training and test labels create vocabulary + def extract_all_chars(batch): + all_text = " ".join(batch["target_text"]) + vocab = list(set(all_text)) + return {"vocab": [vocab], "all_text": [all_text]} + + vocabs = datasets.map( + extract_all_chars, + batched=True, + batch_size=-1, + keep_in_memory=True, + remove_columns=datasets["train"].column_names, + ) + + # take union of all unique characters in each dataset + vocab_set = functools.reduce( + lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values() + ) + + vocab_dict = {v: k for k, v in enumerate(sorted(vocab_set))} + + # replace white space with delimiter token + if word_delimiter_token is not None: + vocab_dict[word_delimiter_token] = vocab_dict[" "] + del vocab_dict[" "] + + # add unk and pad token + if unk_token is not None: + vocab_dict[unk_token] = len(vocab_dict) + + if pad_token is not None: + vocab_dict[pad_token] = len(vocab_dict) + + return vocab_dict + + +def main(): + # See all possible arguments in src/transformers/training_args.py + # or by passing the --help flag to this script. + # We now keep distinct sets of args, for a cleaner separation of concerns. + + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + # If we pass only one argument to the script and it's the path to a json file, + # let's parse it to get our arguments. + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if data_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if data_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + data_args.token = data_args.use_auth_token + + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The + # information sent is the one passed as arguments along with your Python/PyTorch versions. + send_example_telemetry("run_speech_recognition_ctc_adapter", model_args, data_args) + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Setup logging + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN) + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" + ) + # Set the verbosity to info of the Transformers logger (on main process only): + if is_main_process(training_args.local_rank): + transformers.utils.logging.set_verbosity_info() + logger.info("Training/evaluation parameters %s", training_args) + + # Set seed before initializing model. + set_seed(training_args.seed) + + # 1. First, let's load the dataset + raw_datasets = DatasetDict() + + if training_args.do_train: + raw_datasets["train"] = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + split=data_args.train_split_name, + token=data_args.token, + ) + + if data_args.audio_column_name not in raw_datasets["train"].column_names: + raise ValueError( + f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'." + " Make sure to set `--audio_column_name` to the correct audio column - one of" + f" {', '.join(raw_datasets['train'].column_names)}." + ) + + if data_args.text_column_name not in raw_datasets["train"].column_names: + raise ValueError( + f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. " + "Make sure to set `--text_column_name` to the correct text column - one of " + f"{', '.join(raw_datasets['train'].column_names)}." + ) + + if data_args.max_train_samples is not None: + raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples)) + + if training_args.do_eval: + raw_datasets["eval"] = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + split=data_args.eval_split_name, + token=data_args.token, + ) + + if data_args.max_eval_samples is not None: + raw_datasets["eval"] = raw_datasets["eval"].select(range(data_args.max_eval_samples)) + + # 2. We remove some special characters from the datasets + # that make training complicated and do not help in transcribing the speech + # E.g. characters, such as `,` and `.` do not really have an acoustic characteristic + # that could be easily picked up by the model + chars_to_ignore_regex = ( + f'[{"".join(data_args.chars_to_ignore)}]' if data_args.chars_to_ignore is not None else None + ) + text_column_name = data_args.text_column_name + + def remove_special_characters(batch): + if chars_to_ignore_regex is not None: + batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " " + else: + batch["target_text"] = batch[text_column_name].lower() + " " + return batch + + with training_args.main_process_first(desc="dataset map special characters removal"): + raw_datasets = raw_datasets.map( + remove_special_characters, + remove_columns=[text_column_name], + desc="remove special characters from datasets", + ) + + # save special tokens for tokenizer + word_delimiter_token = data_args.word_delimiter_token + unk_token = data_args.unk_token + pad_token = data_args.pad_token + + # 3. Next, let's load the config as we might need it to create + # the tokenizer + # load config + config = AutoConfig.from_pretrained( + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, + ) + + # 4. Next, if no tokenizer file is defined, + # we create the vocabulary of the model by extracting all unique characters from + # the training and evaluation datasets + # We need to make sure that only first rank saves vocabulary + # make sure all processes wait until vocab is created + tokenizer_name_or_path = model_args.tokenizer_name_or_path + tokenizer_kwargs = {} + + vocab_dict = {} + if tokenizer_name_or_path is not None: + # load vocabulary of other adapter languages so that new language can be appended + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, + ) + vocab_dict = tokenizer.vocab.copy() + if tokenizer.target_lang is None: + raise ValueError("Make sure to load a multi-lingual tokenizer with a set target language.") + + if data_args.target_language in tokenizer.vocab and not data_args.overwrite_lang_vocab: + logger.info( + "Adapter language already exists." + " Skipping vocabulary creating. If you want to create a new vocabulary" + f" for {data_args.target_language} make sure to add '--overwrite_lang_vocab'" + ) + else: + tokenizer_name_or_path = None + + if tokenizer_name_or_path is None: + # save vocab in training output dir + tokenizer_name_or_path = training_args.output_dir + + vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json") + + with training_args.main_process_first(): + if training_args.overwrite_output_dir and os.path.isfile(vocab_file): + try: + os.remove(vocab_file) + except OSError: + # in shared file-systems it might be the case that + # two processes try to delete the vocab file at the some time + pass + + with training_args.main_process_first(desc="dataset map vocabulary creation"): + if not os.path.isfile(vocab_file): + os.makedirs(tokenizer_name_or_path, exist_ok=True) + lang_dict = create_vocabulary_from_data( + raw_datasets, + word_delimiter_token=word_delimiter_token, + unk_token=unk_token, + pad_token=pad_token, + ) + + # if we doing adapter language training, save + # vocab with adpter language + if data_args.target_language is not None: + vocab_dict[data_args.target_language] = lang_dict + + # save vocab dict to be loaded into tokenizer + with open(vocab_file, "w") as file: + json.dump(vocab_dict, file) + + # if tokenizer has just been created + # it is defined by `tokenizer_class` if present in config else by `model_type` + tokenizer_kwargs = { + "config": config if config.tokenizer_class is not None else None, + "tokenizer_type": config.model_type if config.tokenizer_class is None else None, + "unk_token": unk_token, + "pad_token": pad_token, + "word_delimiter_token": word_delimiter_token, + "target_lang": data_args.target_language, + } + + # 5. Now we can instantiate the feature extractor, tokenizer and model + # Note for distributed training, the .from_pretrained methods guarantee that only + # one local process can concurrently download model & vocab. + + # load feature_extractor and tokenizer + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, + **tokenizer_kwargs, + ) + feature_extractor = AutoFeatureExtractor.from_pretrained( + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, + ) + + # adapt config + config.update( + { + "final_dropout": model_args.final_dropout, + "mask_time_prob": model_args.mask_time_prob, + "mask_time_length": model_args.mask_time_length, + "mask_feature_prob": model_args.mask_feature_prob, + "mask_feature_length": model_args.mask_feature_length, + "gradient_checkpointing": training_args.gradient_checkpointing, + "layerdrop": model_args.layerdrop, + "ctc_loss_reduction": model_args.ctc_loss_reduction, + "pad_token_id": tokenizer.pad_token_id, + "vocab_size": len(tokenizer), + "adapter_attn_dim": model_args.adapter_attn_dim, + } + ) + + # create model + model = AutoModelForCTC.from_pretrained( + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + config=config, + token=data_args.token, + trust_remote_code=data_args.trust_remote_code, + ignore_mismatched_sizes=True, + ) + + # if attn adapter is defined, freeze all non-adapter weights + if model.config.adapter_attn_dim is not None: + model.init_adapter_layers() + # first we freeze the whole base model + model.freeze_base_model() + + # next we unfreeze all adapter layers + adapter_weights = model._get_adapters() + for param in adapter_weights.values(): + param.requires_grad = True + + # 6. Now we preprocess the datasets including loading the audio, resampling and normalization + # Thankfully, `datasets` takes care of automatically loading and resampling the audio, + # so that we just need to set the correct target sampling rate and normalize the input + # via the `feature_extractor` + + # make sure that dataset decodes audio with correct sampling rate + dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate + if dataset_sampling_rate != feature_extractor.sampling_rate: + raw_datasets = raw_datasets.cast_column( + data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate) + ) + + # derive max & min input length for sample rate & max duration + max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate + min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate + audio_column_name = data_args.audio_column_name + num_workers = data_args.preprocessing_num_workers + + # Preprocessing the datasets. + # We need to read the audio files as arrays and tokenize the targets. + def prepare_dataset(batch): + # load audio + sample = batch[audio_column_name] + + inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) + batch["input_values"] = inputs.input_values[0] + batch["input_length"] = len(batch["input_values"]) + + # encode targets + batch["labels"] = tokenizer(batch["target_text"]).input_ids + return batch + + with training_args.main_process_first(desc="dataset map preprocessing"): + vectorized_datasets = raw_datasets.map( + prepare_dataset, + remove_columns=next(iter(raw_datasets.values())).column_names, + num_proc=num_workers, + desc="preprocess datasets", + ) + + def is_audio_in_length_range(length): + return length > min_input_length and length < max_input_length + + # filter data that is shorter than min_input_length + vectorized_datasets = vectorized_datasets.filter( + is_audio_in_length_range, + num_proc=num_workers, + input_columns=["input_length"], + ) + + # 7. Next, we can prepare the training. + # Let's use word error rate (WER) as our evaluation metric, + # instantiate a data collator and the trainer + + # Define evaluation metrics during training, *i.e.* word error rate, character error rate + eval_metrics = {metric: evaluate.load(metric, cache_dir=model_args.cache_dir) for metric in data_args.eval_metrics} + + # for large datasets it is advised to run the preprocessing on a + # single machine first with ``args.preprocessing_only`` since there will mostly likely + # be a timeout when running the script in distributed mode. + # In a second step ``args.preprocessing_only`` can then be set to `False` to load the + # cached dataset + if data_args.preprocessing_only: + logger.info(f"Data preprocessing finished. Files cached at {vectorized_datasets.cache_files}") + return + + def compute_metrics(pred): + pred_logits = pred.predictions + pred_ids = np.argmax(pred_logits, axis=-1) + + pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id + + pred_str = tokenizer.batch_decode(pred_ids) + # we do not want to group tokens when computing the metrics + label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False) + + metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()} + + return metrics + + # Now save everything to be able to create a single processor later + # make sure all processes wait until data is saved + with training_args.main_process_first(): + # only the main process saves them + if is_main_process(training_args.local_rank): + # save feature extractor, tokenizer and config + feature_extractor.save_pretrained(training_args.output_dir) + tokenizer.save_pretrained(training_args.output_dir) + config.save_pretrained(training_args.output_dir) + + try: + processor = AutoProcessor.from_pretrained(training_args.output_dir) + except (OSError, KeyError): + warnings.warn( + "Loading a processor from a feature extractor config that does not" + " include a `processor_class` attribute is deprecated and will be removed in v5. Please add the following " + " attribute to your `preprocessor_config.json` file to suppress this warning: " + " `'processor_class': 'Wav2Vec2Processor'`", + FutureWarning, + ) + processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir) + + # Instantiate custom data collator + data_collator = DataCollatorCTCWithPadding(processor=processor) + + # Initialize Trainer + trainer = Trainer( + model=model, + data_collator=data_collator, + args=training_args, + compute_metrics=compute_metrics, + train_dataset=vectorized_datasets["train"] if training_args.do_train else None, + eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None, + tokenizer=processor, + ) + + # 8. Finally, we can start training + + # Training + if training_args.do_train: + # use last checkpoint if exist + if last_checkpoint is not None: + checkpoint = last_checkpoint + elif os.path.isdir(model_args.model_name_or_path): + checkpoint = model_args.model_name_or_path + else: + checkpoint = None + + train_result = trainer.train(resume_from_checkpoint=checkpoint) + trainer.save_model() + + metrics = train_result.metrics + max_train_samples = ( + data_args.max_train_samples + if data_args.max_train_samples is not None + else len(vectorized_datasets["train"]) + ) + metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"])) + + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluation + results = {} + if training_args.do_eval: + logger.info("*** Evaluate ***") + metrics = trainer.evaluate() + max_eval_samples = ( + data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"]) + ) + metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"])) + + trainer.log_metrics("eval", metrics) + trainer.save_metrics("eval", metrics) + + # Write model card and (optionally) push to hub + config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na" + kwargs = { + "finetuned_from": model_args.model_name_or_path, + "tasks": "automatic-speech-recognition", + "tags": ["automatic-speech-recognition", data_args.dataset_name, "mms"], + "dataset_args": ( + f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split:" + f" {data_args.eval_split_name}" + ), + "dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}", + } + if "common_voice" in data_args.dataset_name: + kwargs["language"] = config_name + + # make sure that adapter weights are saved seperately + adapter_file = WAV2VEC2_ADAPTER_SAFE_FILE.format(data_args.target_language) + adapter_file = os.path.join(training_args.output_dir, adapter_file) + logger.info(f"Saving adapter weights under {adapter_file}...") + safe_save_file(model._get_adapters(), adapter_file, metadata={"format": "pt"}) + + if training_args.push_to_hub: + trainer.push_to_hub(**kwargs) + else: + trainer.create_model_card(**kwargs) + + return results + + +if __name__ == "__main__": + main() diff --git a/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py b/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py index c841e99df21557..0e6bc6b4c234fa 100755 --- a/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py +++ b/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Any, Dict, List, Optional, Union @@ -48,7 +49,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.18.0", "To fix: pip install -r examples/pytorch/speech-recognition/requirements.txt") @@ -85,12 +86,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -113,6 +130,12 @@ class ModelArguments: suppress_tokens: List[int] = field( default=None, metadata={"help": "A list of tokens that will be suppressed at generation."} ) + apply_spec_augment: bool = field( + default=False, + metadata={ + "help": "Whether to apply *SpecAugment* data augmentation to the input features. This is currently only relevant for Wav2Vec2, HuBERT, WavLM and Whisper models." + }, + ) @dataclass @@ -127,10 +150,6 @@ class DataTrainingArguments: dataset_config_name: Optional[str] = field( default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} ) - text_column: Optional[str] = field( - default=None, - metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."}, - ) overwrite_cache: bool = field( default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} ) @@ -227,10 +246,13 @@ class DataCollatorSpeechSeq2SeqWithPadding: The processor used for processing the data. decoder_start_token_id (`int`) The begin-of-sentence of the decoder. + forward_attention_mask (`bool`) + Whether to return attention_mask. """ processor: Any decoder_start_token_id: int + forward_attention_mask: bool def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: # split inputs and labels since they have to be of different lengths and need @@ -241,6 +263,9 @@ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt") + if self.forward_attention_mask: + batch["attention_mask"] = torch.LongTensor([feature["attention_mask"] for feature in features]) + labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt") # replace padding with -100 to ignore loss correctly @@ -270,6 +295,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_speech_recognition_seq2seq", model_args, data_args) @@ -291,8 +325,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -328,7 +362,7 @@ def main(): data_args.dataset_config_name, split=data_args.train_split_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if training_args.do_eval: @@ -337,7 +371,7 @@ def main(): data_args.dataset_config_name, split=data_args.eval_split_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if data_args.audio_column_name not in next(iter(raw_datasets.values())).column_names: @@ -362,30 +396,38 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) config.update({"forced_decoder_ids": model_args.forced_decoder_ids, "suppress_tokens": model_args.suppress_tokens}) + # SpecAugment for whisper models + if getattr(config, "model_type", None) == "whisper": + config.update({"apply_spec_augment": model_args.apply_spec_augment}) + feature_extractor = AutoFeatureExtractor.from_pretrained( model_args.feature_extractor_name if model_args.feature_extractor_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSpeechSeq2Seq.from_pretrained( model_args.model_name_or_path, config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) if model.config.decoder_start_token_id is None: @@ -418,6 +460,12 @@ def main(): text_column_name = data_args.text_column_name model_input_name = feature_extractor.model_input_names[0] do_lower_case = data_args.do_lower_case + # if SpecAugment is used for whisper models, return attention_mask to guide the mask along time axis + forward_attention_mask = ( + getattr(config, "model_type", None) == "whisper" + and getattr(config, "apply_spec_augment", False) + and getattr(config, "mask_time_prob", 0) > 0 + ) if data_args.max_train_samples is not None: raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples)) @@ -428,10 +476,14 @@ def main(): def prepare_dataset(batch): # process audio sample = batch[audio_column_name] - inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"]) + inputs = feature_extractor( + sample["array"], sampling_rate=sample["sampling_rate"], return_attention_mask=forward_attention_mask + ) # process audio length batch[model_input_name] = inputs.get(model_input_name)[0] batch["input_length"] = len(sample["array"]) + if forward_attention_mask: + batch["attention_mask"] = inputs.get("attention_mask")[0] # process targets input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name] @@ -468,7 +520,7 @@ def is_audio_in_length_range(length): return # 8. Load Metric - metric = evaluate.load("wer") + metric = evaluate.load("wer", cache_dir=model_args.cache_dir) def compute_metrics(pred): pred_ids = pred.predictions @@ -484,11 +536,14 @@ def compute_metrics(pred): return {"wer": wer} # 9. Create a single speech processor - if is_main_process(training_args.local_rank): - # save feature extractor, tokenizer and config - feature_extractor.save_pretrained(training_args.output_dir) - tokenizer.save_pretrained(training_args.output_dir) - config.save_pretrained(training_args.output_dir) + # make sure all processes wait until data is saved + with training_args.main_process_first(): + # only the main process saves them + if is_main_process(training_args.local_rank): + # save feature extractor, tokenizer and config + feature_extractor.save_pretrained(training_args.output_dir) + tokenizer.save_pretrained(training_args.output_dir) + config.save_pretrained(training_args.output_dir) processor = AutoProcessor.from_pretrained(training_args.output_dir) @@ -496,6 +551,7 @@ def compute_metrics(pred): data_collator = DataCollatorSpeechSeq2SeqWithPadding( processor=processor, decoder_start_token_id=model.config.decoder_start_token_id, + forward_attention_mask=forward_attention_mask, ) # 11. Initialize Trainer diff --git a/examples/pytorch/summarization/README.md b/examples/pytorch/summarization/README.md index db7f8f4061a5c9..93c0bbccef6c06 100644 --- a/examples/pytorch/summarization/README.md +++ b/examples/pytorch/summarization/README.md @@ -33,7 +33,7 @@ For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2s `run_summarization.py` is a lightweight example of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. -For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files +For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files and you also will find examples of these below. ## With Trainer @@ -41,7 +41,7 @@ and you also will find examples of these below. Here is an example on a summarization task: ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --dataset_name cnn_dailymail \ @@ -54,9 +54,9 @@ python examples/pytorch/summarization/run_summarization.py \ --predict_with_generate ``` -Only T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "summarize: "`. +Only T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "summarize: "`. -We used CNN/DailyMail dataset in this example as `t5-small` was trained on it and one can get good scores even when pre-training with a very small sample. +We used CNN/DailyMail dataset in this example as `google-t5/t5-small` was trained on it and one can get good scores even when pre-training with a very small sample. Extreme Summarization (XSum) Dataset is another commonly used dataset for the task of summarization. To use it replace `--dataset_name cnn_dailymail --dataset_config "3.0.0"` with `--dataset_name xsum`. @@ -65,7 +65,7 @@ And here is how you would use it on your own files, after adjusting the values f ```bash python examples/pytorch/summarization/run_summarization.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --train_file path_to_csv_or_jsonlines_file \ @@ -156,7 +156,7 @@ then ```bash python run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ @@ -179,7 +179,7 @@ that will check everything is ready for training. Finally, you can launch traini ```bash accelerate launch run_summarization_no_trainer.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --dataset_name cnn_dailymail \ --dataset_config "3.0.0" \ --source_prefix "summarize: " \ diff --git a/examples/pytorch/summarization/run_summarization.py b/examples/pytorch/summarization/run_summarization.py index b682f89ce5b829..793917264a7648 100755 --- a/examples/pytorch/summarization/run_summarization.py +++ b/examples/pytorch/summarization/run_summarization.py @@ -21,6 +21,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -52,7 +53,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/summarization/requirements.txt") @@ -99,12 +100,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -188,7 +205,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -232,7 +249,7 @@ class DataTrainingArguments: }, ) num_beams: Optional[int] = field( - default=None, + default=1, metadata={ "help": ( "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, " @@ -247,14 +264,14 @@ class DataTrainingArguments: }, ) source_prefix: Optional[str] = field( - default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."} + default=None, metadata={"help": "A prefix to add before every source text (useful for T5 models)."} ) forced_bos_token: Optional[str] = field( default=None, metadata={ "help": ( - "The token to force as the first generated token after the decoder_start_token_id." + "The token to force as the first generated token after the decoder_start_token_id. " "Useful for multilingual models like mBART where the first generated token" "needs to be the target language token (Usually it is the target language token)" ) @@ -262,8 +279,13 @@ class DataTrainingArguments: ) def __post_init__(self): - if self.dataset_name is None and self.train_file is None and self.validation_file is None: - raise ValueError("Need either a dataset name or a training/validation file.") + if ( + self.dataset_name is None + and self.train_file is None + and self.validation_file is None + and self.test_file is None + ): + raise ValueError("Need either a dataset name or a training, validation, or test file.") else: if self.train_file is not None: extension = self.train_file.split(".")[-1] @@ -271,6 +293,9 @@ def __post_init__(self): if self.validation_file is not None: extension = self.validation_file.split(".")[-1] assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." + if self.test_file is not None: + extension = self.test_file.split(".")[-1] + assert extension in ["csv", "json"], "`test_file` should be a csv or a json file." if self.val_max_target_length is None: self.val_max_target_length = self.max_target_length @@ -304,6 +329,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_summarization", model_args, data_args) @@ -314,6 +348,11 @@ def main(): datefmt="%m/%d/%Y %H:%M:%S", handlers=[logging.StreamHandler(sys.stdout)], ) + + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -323,17 +362,17 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") if data_args.source_prefix is None and model_args.model_name_or_path in [ - "t5-small", - "t5-base", - "t5-large", - "t5-3b", - "t5-11b", + "google-t5/t5-small", + "google-t5/t5-base", + "google-t5/t5-large", + "google-t5/t5-3b", + "google-t5/t5-11b", ]: logger.warning( "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with " @@ -373,7 +412,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -390,10 +429,10 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -404,14 +443,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSeq2SeqLM.from_pretrained( model_args.model_name_or_path, @@ -419,7 +460,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch @@ -462,10 +504,16 @@ def main(): # Preprocessing the datasets. # We need to tokenize inputs and targets. if training_args.do_train: + if "train" not in raw_datasets: + raise ValueError("--do_train requires a train dataset") column_names = raw_datasets["train"].column_names elif training_args.do_eval: + if "validation" not in raw_datasets: + raise ValueError("--do_eval requires a validation dataset") column_names = raw_datasets["validation"].column_names elif training_args.do_predict: + if "test" not in raw_datasets: + raise ValueError("--do_predict requires a test dataset") column_names = raw_datasets["test"].column_names else: logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") @@ -511,7 +559,7 @@ def main(): if training_args.label_smoothing_factor > 0 and not hasattr(model, "prepare_decoder_input_ids_from_labels"): logger.warning( - "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for" + "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for " f"`{model.__class__.__name__}`. This will lead to loss being calculated twice and will take up more memory" ) @@ -541,8 +589,6 @@ def preprocess_function(examples): return model_inputs if training_args.do_train: - if "train" not in raw_datasets: - raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if data_args.max_train_samples is not None: max_train_samples = min(len(train_dataset), data_args.max_train_samples) @@ -559,8 +605,6 @@ def preprocess_function(examples): if training_args.do_eval: max_target_length = data_args.val_max_target_length - if "validation" not in raw_datasets: - raise ValueError("--do_eval requires a validation dataset") eval_dataset = raw_datasets["validation"] if data_args.max_eval_samples is not None: max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) @@ -577,8 +621,6 @@ def preprocess_function(examples): if training_args.do_predict: max_target_length = data_args.val_max_target_length - if "test" not in raw_datasets: - raise ValueError("--do_predict requires a test dataset") predict_dataset = raw_datasets["test"] if data_args.max_predict_samples is not None: max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples) @@ -603,7 +645,7 @@ def preprocess_function(examples): ) # Metric - metric = evaluate.load("rouge") + metric = evaluate.load("rouge", cache_dir=model_args.cache_dir) def postprocess_text(preds, labels): preds = [pred.strip() for pred in preds] @@ -619,10 +661,10 @@ def compute_metrics(eval_preds): preds, labels = eval_preds if isinstance(preds, tuple): preds = preds[0] + # Replace -100s used for padding as we can't decode them + preds = np.where(preds != -100, preds, tokenizer.pad_token_id) decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) - if data_args.ignore_pad_token_for_loss: - # Replace -100 in the labels as we can't decode them. - labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # Some simple post-processing @@ -634,6 +676,16 @@ def compute_metrics(eval_preds): result["gen_len"] = np.mean(prediction_lens) return result + # Override the decoding parameters of Seq2SeqTrainer + training_args.generation_max_length = ( + training_args.generation_max_length + if training_args.generation_max_length is not None + else data_args.val_max_target_length + ) + training_args.generation_num_beams = ( + data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams + ) + # Initialize our Trainer trainer = Seq2SeqTrainer( model=model, @@ -667,15 +719,15 @@ def compute_metrics(eval_preds): # Evaluation results = {} - max_length = ( - training_args.generation_max_length - if training_args.generation_max_length is not None - else data_args.val_max_target_length - ) - num_beams = data_args.num_beams if data_args.num_beams is not None else training_args.generation_num_beams if training_args.do_eval: logger.info("*** Evaluate ***") - metrics = trainer.evaluate(max_length=max_length, num_beams=num_beams, metric_key_prefix="eval") + if isinstance(eval_dataset, dict): + metrics = {} + for eval_ds_name, eval_ds in eval_dataset.items(): + dataset_metrics = trainer.evaluate(eval_dataset=eval_ds, metric_key_prefix=f"eval_{eval_ds_name}") + metrics.update(dataset_metrics) + else: + metrics = trainer.evaluate(metric_key_prefix="eval") max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) @@ -685,9 +737,7 @@ def compute_metrics(eval_preds): if training_args.do_predict: logger.info("*** Predict ***") - predict_results = trainer.predict( - predict_dataset, metric_key_prefix="predict", max_length=max_length, num_beams=num_beams - ) + predict_results = trainer.predict(predict_dataset, metric_key_prefix="predict") metrics = predict_results.metrics max_predict_samples = ( data_args.max_predict_samples if data_args.max_predict_samples is not None else len(predict_dataset) @@ -699,8 +749,10 @@ def compute_metrics(eval_preds): if trainer.is_world_process_zero(): if training_args.predict_with_generate: + predictions = predict_results.predictions + predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id) predictions = tokenizer.batch_decode( - predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True + predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True ) predictions = [pred.strip() for pred in predictions] output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt") diff --git a/examples/pytorch/summarization/run_summarization_no_trainer.py b/examples/pytorch/summarization/run_summarization_no_trainer.py index 8f669be72c5831..1cd9f3865df377 100644 --- a/examples/pytorch/summarization/run_summarization_no_trainer.py +++ b/examples/pytorch/summarization/run_summarization_no_trainer.py @@ -51,12 +51,12 @@ SchedulerType, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, is_offline_mode, send_example_telemetry +from transformers.utils import check_min_version, is_offline_mode, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/summarization/requirements.txt") @@ -146,7 +146,7 @@ def parse_args(): default=128, help=( "The maximum total sequence length for target text after " - "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded. " "during ``evaluate`` and ``predict``." ), ) @@ -161,15 +161,6 @@ def parse_args(): "param of ``model.generate``, which is used during ``evaluate`` and ``predict``." ), ) - parser.add_argument( - "--max_length", - type=int, - default=128, - help=( - "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," - " sequences shorter will be padded if `--pad_to_max_lengh` is passed." - ), - ) parser.add_argument( "--num_beams", type=int, @@ -275,6 +266,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -298,7 +299,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -334,15 +335,15 @@ def main(): if args.with_tracking: accelerator_log_kwargs["log_with"] = args.report_to - accelerator_log_kwargs["logging_dir"] = args.output_dir + accelerator_log_kwargs["project_dir"] = args.output_dir accelerator = Accelerator(gradient_accumulation_steps=args.gradient_accumulation_steps, **accelerator_log_kwargs) if args.source_prefix is None and args.model_name_or_path in [ - "t5-small", - "t5-base", - "t5-large", - "t5-3b", - "t5-11b", + "google-t5/t5-small", + "google-t5/t5-base", + "google-t5/t5-large", + "google-t5/t5-3b", + "google-t5/t5-11b", ]: logger.warning( "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with " @@ -369,12 +370,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -401,32 +404,37 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) + config = AutoConfig.from_pretrained(args.config_name, trust_remote_code=args.trust_remote_code) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -435,10 +443,11 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForSeq2SeqLM.from_config(config) + model = AutoModelForSeq2SeqLM.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -473,6 +482,9 @@ def main(): f"--summary_column' value '{args.summary_column}' needs to be one of: {', '.join(column_names)}" ) + if args.val_max_target_length is None: + args.val_max_target_length = args.max_target_length + # Temporarily set max_target_length for training. max_target_length = args.max_target_length padding = "max_length" if args.pad_to_max_length else False @@ -497,7 +509,7 @@ def preprocess_function(examples): return model_inputs with accelerator.main_process_first(): - processed_datasets = raw_datasets.map( + train_dataset = raw_datasets["train"].map( preprocess_function, batched=True, num_proc=args.preprocessing_num_workers, @@ -506,8 +518,16 @@ def preprocess_function(examples): desc="Running tokenizer on dataset", ) - train_dataset = processed_datasets["train"] - eval_dataset = processed_datasets["validation"] + # Temporarily set max_target_length for validation. + max_target_length = args.val_max_target_length + eval_dataset = raw_datasets["validation"].map( + preprocess_function, + batched=True, + num_proc=args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not args.overwrite_cache, + desc="Running tokenizer on dataset", + ) # Log a few random samples from the training set: for index in random.sample(range(len(train_dataset)), 1): @@ -561,8 +581,10 @@ def postprocess_text(preds, labels): lr_scheduler = get_scheduler( name=args.lr_scheduler_type, optimizer=optimizer, - num_warmup_steps=args.num_warmup_steps * args.gradient_accumulation_steps, - num_training_steps=args.max_train_steps * args.gradient_accumulation_steps, + num_warmup_steps=args.num_warmup_steps * accelerator.num_processes, + num_training_steps=args.max_train_steps + if overrode_max_train_steps + else args.max_train_steps * accelerator.num_processes, ) # Prepare everything with our `accelerator`. @@ -610,36 +632,45 @@ def postprocess_text(preds, labels): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue - + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): with accelerator.accumulate(model): outputs = model(**batch) loss = outputs.loss @@ -658,7 +689,7 @@ def postprocess_text(preds, labels): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) @@ -667,11 +698,9 @@ def postprocess_text(preds, labels): break model.eval() - if args.val_max_target_length is None: - args.val_max_target_length = args.max_target_length gen_kwargs = { - "max_length": args.val_max_target_length if args is not None else config.max_length, + "max_length": args.val_max_target_length, "num_beams": args.num_beams, } for step, batch in enumerate(eval_dataloader): diff --git a/examples/pytorch/test_accelerate_examples.py b/examples/pytorch/test_accelerate_examples.py index d88a2ead64b4ae..918167635e854b 100644 --- a/examples/pytorch/test_accelerate_examples.py +++ b/examples/pytorch/test_accelerate_examples.py @@ -21,13 +21,18 @@ import shutil import sys import tempfile +import unittest from unittest import mock -import torch from accelerate.utils import write_basic_config -from transformers.testing_utils import TestCasePlus, get_gpu_count, run_command, slow, torch_device -from transformers.utils import is_apex_available +from transformers.testing_utils import ( + TestCasePlus, + backend_device_count, + run_command, + slow, + torch_device, +) logging.basicConfig(level=logging.DEBUG) @@ -53,11 +58,6 @@ def get_results(output_dir): return results -def is_cuda_and_apex_available(): - is_using_cuda = torch.cuda.is_available() and torch_device == "cuda" - return is_using_cuda and is_apex_available() - - stream_handler = logging.StreamHandler(sys.stdout) logger.addHandler(stream_handler) @@ -75,12 +75,12 @@ def setUpClass(cls): def tearDownClass(cls): shutil.rmtree(cls.tmpdir) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_glue_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/text-classification/run_glue_no_trainer.py - --model_name_or_path distilbert-base-uncased + --model_name_or_path distilbert/distilbert-base-uncased --output_dir {tmp_dir} --train_file ./tests/fixtures/tests_samples/MRPC/train.csv --validation_file ./tests/fixtures/tests_samples/MRPC/dev.csv @@ -88,25 +88,24 @@ def test_run_glue_no_trainer(self): --per_device_eval_batch_size=1 --learning_rate=1e-4 --seed=42 + --num_warmup_steps=2 --checkpointing_steps epoch --with_tracking """.split() - if is_cuda_and_apex_available(): - testargs.append("--fp16") - run_command(self._launch_args + testargs) result = get_results(tmp_dir) self.assertGreaterEqual(result["eval_accuracy"], 0.75) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "glue_no_trainer"))) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @unittest.skip("Zach is working on this.") + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_clm_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/language-modeling/run_clm_no_trainer.py - --model_name_or_path distilgpt2 + --model_name_or_path distilbert/distilgpt2 --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --block_size 128 @@ -118,7 +117,7 @@ def test_run_clm_no_trainer(self): --with_tracking """.split() - if torch.cuda.device_count() > 1: + if backend_device_count(torch_device) > 1: # Skipping because there are not enough batches to train the model + would need a drop_last to work. return @@ -128,12 +127,13 @@ def test_run_clm_no_trainer(self): self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "clm_no_trainer"))) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @unittest.skip("Zach is working on this.") + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_mlm_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/language-modeling/run_mlm_no_trainer.py - --model_name_or_path distilroberta-base + --model_name_or_path distilbert/distilroberta-base --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --output_dir {tmp_dir} @@ -148,15 +148,15 @@ def test_run_mlm_no_trainer(self): self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "mlm_no_trainer"))) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_ner_no_trainer(self): # with so little data distributed training needs more epochs to get the score on par with 0/1 gpu - epochs = 7 if get_gpu_count() > 1 else 2 + epochs = 7 if backend_device_count(torch_device) > 1 else 2 tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/token-classification/run_ner_no_trainer.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/conll/sample.json --validation_file tests/fixtures/tests_samples/conll/sample.json --output_dir {tmp_dir} @@ -172,16 +172,16 @@ def test_run_ner_no_trainer(self): run_command(self._launch_args + testargs) result = get_results(tmp_dir) self.assertGreaterEqual(result["eval_accuracy"], 0.75) - self.assertLess(result["train_loss"], 0.5) + self.assertLess(result["train_loss"], 0.6) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "ner_no_trainer"))) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_squad_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/question-answering/run_qa_no_trainer.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --version_2_with_negative --train_file tests/fixtures/tests_samples/SQUAD/sample.json --validation_file tests/fixtures/tests_samples/SQUAD/sample.json @@ -204,12 +204,12 @@ def test_run_squad_no_trainer(self): self.assertTrue(os.path.exists(os.path.join(tmp_dir, "epoch_0"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "qa_no_trainer"))) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_swag_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/multiple-choice/run_swag_no_trainer.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/swag/sample.json --validation_file tests/fixtures/tests_samples/swag/sample.json --output_dir {tmp_dir} @@ -227,12 +227,12 @@ def test_run_swag_no_trainer(self): self.assertTrue(os.path.exists(os.path.join(tmp_dir, "swag_no_trainer"))) @slow - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_summarization_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" {self.examples_dir}/pytorch/summarization/run_summarization_no_trainer.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --train_file tests/fixtures/tests_samples/xsum/sample.json --validation_file tests/fixtures/tests_samples/xsum/sample.json --output_dir {tmp_dir} @@ -255,7 +255,7 @@ def test_run_summarization_no_trainer(self): self.assertTrue(os.path.exists(os.path.join(tmp_dir, "summarization_no_trainer"))) @slow - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_translation_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" @@ -268,6 +268,7 @@ def test_run_translation_no_trainer(self): --output_dir {tmp_dir} --max_train_steps=50 --num_warmup_steps=8 + --num_beams=6 --learning_rate=3e-3 --per_device_train_batch_size=2 --per_device_eval_batch_size=1 @@ -305,7 +306,7 @@ def test_run_semantic_segmentation_no_trainer(self): result = get_results(tmp_dir) self.assertGreaterEqual(result["eval_overall_accuracy"], 0.10) - @mock.patch.dict(os.environ, {"WANDB_MODE": "offline"}) + @mock.patch.dict(os.environ, {"WANDB_MODE": "offline", "DVCLIVE_TEST": "true"}) def test_run_image_classification_no_trainer(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" @@ -321,14 +322,12 @@ def test_run_image_classification_no_trainer(self): --output_dir {tmp_dir} --with_tracking --checkpointing_steps 1 + --label_column_name labels """.split() - if is_cuda_and_apex_available(): - testargs.append("--fp16") - run_command(self._launch_args + testargs) result = get_results(tmp_dir) # The base model scores a 25% - self.assertGreaterEqual(result["eval_accuracy"], 0.6) + self.assertGreaterEqual(result["eval_accuracy"], 0.4) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "step_1"))) self.assertTrue(os.path.exists(os.path.join(tmp_dir, "image_classification_no_trainer"))) diff --git a/examples/pytorch/test_pytorch_examples.py b/examples/pytorch/test_pytorch_examples.py index f4682b8933e7e5..1d4f8db9259087 100644 --- a/examples/pytorch/test_pytorch_examples.py +++ b/examples/pytorch/test_pytorch_examples.py @@ -14,18 +14,21 @@ # limitations under the License. -import argparse import json import logging import os import sys from unittest.mock import patch -import torch - from transformers import ViTMAEForPreTraining, Wav2Vec2ForPreTraining -from transformers.testing_utils import CaptureLogger, TestCasePlus, get_gpu_count, slow, torch_device -from transformers.utils import is_apex_available +from transformers.testing_utils import ( + CaptureLogger, + TestCasePlus, + backend_device_count, + is_torch_fp16_available_on_device, + slow, + torch_device, +) SRC_DIRS = [ @@ -63,6 +66,7 @@ import run_semantic_segmentation import run_seq2seq_qa as run_squad_seq2seq import run_speech_recognition_ctc + import run_speech_recognition_ctc_adapter import run_speech_recognition_seq2seq import run_summarization import run_swag @@ -75,13 +79,6 @@ logger = logging.getLogger() -def get_setup_file(): - parser = argparse.ArgumentParser() - parser.add_argument("-f") - args = parser.parse_args() - return args.f - - def get_results(output_dir): results = {} path = os.path.join(output_dir, "all_results.json") @@ -93,11 +90,6 @@ def get_results(output_dir): return results -def is_cuda_and_apex_available(): - is_using_cuda = torch.cuda.is_available() and torch_device == "cuda" - return is_using_cuda and is_apex_available() - - stream_handler = logging.StreamHandler(sys.stdout) logger.addHandler(stream_handler) @@ -107,7 +99,7 @@ def test_run_glue(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_glue.py - --model_name_or_path distilbert-base-uncased + --model_name_or_path distilbert/distilbert-base-uncased --output_dir {tmp_dir} --overwrite_output_dir --train_file ./tests/fixtures/tests_samples/MRPC/train.csv @@ -123,7 +115,7 @@ def test_run_glue(self): --max_seq_length=128 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -135,7 +127,7 @@ def test_run_clm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_clm.py - --model_name_or_path distilgpt2 + --model_name_or_path distilbert/distilgpt2 --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --do_train @@ -148,12 +140,12 @@ def test_run_clm(self): --overwrite_output_dir """.split() - if torch.cuda.device_count() > 1: + if backend_device_count(torch_device) > 1: # Skipping because there are not enough batches to train the model + would need a drop_last to work. return - if torch_device != "cuda": - testargs.append("--no_cuda") + if torch_device == "cpu": + testargs.append("--use_cpu") with patch.object(sys, "argv", testargs): run_clm.main() @@ -168,14 +160,14 @@ def test_run_clm_config_overrides(self): testargs = f""" run_clm.py --model_type gpt2 - --tokenizer_name gpt2 + --tokenizer_name openai-community/gpt2 --train_file ./tests/fixtures/sample_text.txt --output_dir {tmp_dir} --config_overrides n_embd=10,n_head=2 """.split() - if torch_device != "cuda": - testargs.append("--no_cuda") + if torch_device == "cpu": + testargs.append("--use_cpu") logger = run_clm.logger with patch.object(sys, "argv", testargs): @@ -189,7 +181,7 @@ def test_run_mlm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_mlm.py - --model_name_or_path distilroberta-base + --model_name_or_path distilbert/distilroberta-base --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --output_dir {tmp_dir} @@ -200,8 +192,8 @@ def test_run_mlm(self): --num_train_epochs=1 """.split() - if torch_device != "cuda": - testargs.append("--no_cuda") + if torch_device == "cpu": + testargs.append("--use_cpu") with patch.object(sys, "argv", testargs): run_mlm.main() @@ -210,12 +202,12 @@ def test_run_mlm(self): def test_run_ner(self): # with so little data distributed training needs more epochs to get the score on par with 0/1 gpu - epochs = 7 if get_gpu_count() > 1 else 2 + epochs = 7 if backend_device_count(torch_device) > 1 else 2 tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_ner.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/conll/sample.json --validation_file tests/fixtures/tests_samples/conll/sample.json --output_dir {tmp_dir} @@ -230,8 +222,8 @@ def test_run_ner(self): --seed 7 """.split() - if torch_device != "cuda": - testargs.append("--no_cuda") + if torch_device == "cpu": + testargs.append("--use_cpu") with patch.object(sys, "argv", testargs): run_ner.main() @@ -243,7 +235,7 @@ def test_run_squad(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_qa.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --version_2_with_negative --train_file tests/fixtures/tests_samples/SQUAD/sample.json --validation_file tests/fixtures/tests_samples/SQUAD/sample.json @@ -268,7 +260,7 @@ def test_run_squad_seq2seq(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_seq2seq_qa.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --context_column context --question_column question --answer_column answers @@ -297,7 +289,7 @@ def test_run_swag(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_swag.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/swag/sample.json --validation_file tests/fixtures/tests_samples/swag/sample.json --output_dir {tmp_dir} @@ -319,7 +311,7 @@ def test_run_swag(self): def test_generation(self): testargs = ["run_generation.py", "--prompt=Hello", "--length=10", "--seed=42"] - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") model_type, model_name = ( @@ -335,7 +327,7 @@ def test_run_summarization(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_summarization.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --train_file tests/fixtures/tests_samples/xsum/sample.json --validation_file tests/fixtures/tests_samples/xsum/sample.json --output_dir {tmp_dir} @@ -406,9 +398,10 @@ def test_run_image_classification(self): --max_steps 10 --train_val_split 0.1 --seed 42 + --label_column_name labels """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -438,7 +431,7 @@ def test_run_speech_recognition_ctc(self): --seed 42 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -446,6 +439,38 @@ def test_run_speech_recognition_ctc(self): result = get_results(tmp_dir) self.assertLess(result["eval_loss"], result["train_loss"]) + def test_run_speech_recognition_ctc_adapter(self): + tmp_dir = self.get_auto_remove_tmp_dir() + testargs = f""" + run_speech_recognition_ctc_adapter.py + --output_dir {tmp_dir} + --model_name_or_path hf-internal-testing/tiny-random-wav2vec2 + --dataset_name hf-internal-testing/librispeech_asr_dummy + --dataset_config_name clean + --train_split_name validation + --eval_split_name validation + --do_train + --do_eval + --learning_rate 1e-4 + --per_device_train_batch_size 2 + --per_device_eval_batch_size 1 + --remove_unused_columns False + --overwrite_output_dir True + --preprocessing_num_workers 16 + --max_steps 10 + --target_language tur + --seed 42 + """.split() + + if is_torch_fp16_available_on_device(torch_device): + testargs.append("--fp16") + + with patch.object(sys, "argv", testargs): + run_speech_recognition_ctc_adapter.main() + result = get_results(tmp_dir) + self.assertTrue(os.path.isfile(os.path.join(tmp_dir, "./adapter.tur.safetensors"))) + self.assertLess(result["eval_loss"], result["train_loss"]) + def test_run_speech_recognition_seq2seq(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" @@ -468,7 +493,7 @@ def test_run_speech_recognition_seq2seq(self): --seed 42 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -500,7 +525,7 @@ def test_run_audio_classification(self): --seed 42 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -526,9 +551,6 @@ def test_run_wav2vec2_pretraining(self): --seed 42 """.split() - if is_cuda_and_apex_available(): - testargs.append("--fp16") - with patch.object(sys, "argv", testargs): run_wav2vec2_pretraining_no_trainer.main() model = Wav2Vec2ForPreTraining.from_pretrained(tmp_dir) @@ -554,7 +576,7 @@ def test_run_vit_mae_pretraining(self): --seed 42 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): @@ -579,7 +601,7 @@ def test_run_semantic_segmentation(self): --seed 32 """.split() - if is_cuda_and_apex_available(): + if is_torch_fp16_available_on_device(torch_device): testargs.append("--fp16") with patch.object(sys, "argv", testargs): diff --git a/examples/pytorch/text-classification/README.md b/examples/pytorch/text-classification/README.md index 1bc01b416b74c2..6eae65e7c4bc51 100644 --- a/examples/pytorch/text-classification/README.md +++ b/examples/pytorch/text-classification/README.md @@ -31,7 +31,7 @@ GLUE is made up of a total of 9 different tasks. Here is how to run the script o export TASK_NAME=mrpc python run_glue.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --task_name $TASK_NAME \ --do_train \ --do_eval \ @@ -68,7 +68,7 @@ The following example fine-tunes BERT on the `imdb` dataset hosted on our [hub]( ```bash python run_glue.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name imdb \ --do_train \ --do_predict \ @@ -81,6 +81,55 @@ python run_glue.py \ > If your model classification head dimensions do not fit the number of labels in the dataset, you can specify `--ignore_mismatched_sizes` to adapt it. +## Text classification +As an alternative, we can use the script [`run_classification.py`](./run_classification.py) to fine-tune models on a single/multi-label classification task. + +The following example fine-tunes BERT on the `en` subset of [`amazon_reviews_multi`](https://huggingface.co/datasets/amazon_reviews_multi) dataset. +We can specify the metric, the label column and aso choose which text columns to use jointly for classification. +```bash +dataset="amazon_reviews_multi" +subset="en" +python run_classification.py \ + --model_name_or_path google-bert/bert-base-uncased \ + --dataset_name ${dataset} \ + --dataset_config_name ${subset} \ + --shuffle_train_dataset \ + --metric_name accuracy \ + --text_column_name "review_title,review_body,product_category" \ + --text_column_delimiter "\n" \ + --label_column_name stars \ + --do_train \ + --do_eval \ + --max_seq_length 512 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 1 \ + --output_dir /tmp/${dataset}_${subset}/ +``` +Training for 1 epoch results in acc of around 0.5958 for review_body only and 0.659 for title+body+category. + +The following is a multi-label classification example. It fine-tunes BERT on the `reuters21578` dataset hosted on our [hub](https://huggingface.co/datasets/reuters21578): +```bash +dataset="reuters21578" +subset="ModApte" +python run_classification.py \ + --model_name_or_path google-bert/bert-base-uncased \ + --dataset_name ${dataset} \ + --dataset_config_name ${subset} \ + --shuffle_train_dataset \ + --remove_splits "unused" \ + --metric_name f1 \ + --text_column_name text \ + --label_column_name topics \ + --do_train \ + --do_eval \ + --max_seq_length 512 \ + --per_device_train_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 15 \ + --output_dir /tmp/${dataset}_${subset}/ +``` + It results in a Micro F1 score of around 0.82 without any text and label filtering. Note that you have to explicitly remove the "unused" split from the dataset, since it is not used for classification. ### Mixed precision training @@ -126,7 +175,7 @@ then export TASK_NAME=mrpc python run_glue_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --task_name $TASK_NAME \ --max_length 128 \ --per_device_train_batch_size 32 \ @@ -153,7 +202,7 @@ that will check everything is ready for training. Finally, you can launch traini export TASK_NAME=mrpc accelerate launch run_glue_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --task_name $TASK_NAME \ --max_length 128 \ --per_device_train_batch_size 32 \ @@ -183,7 +232,7 @@ This example code fine-tunes mBERT (multi-lingual BERT) on the XNLI dataset. It ```bash python run_xnli.py \ - --model_name_or_path bert-base-multilingual-cased \ + --model_name_or_path google-bert/bert-base-multilingual-cased \ --language de \ --train_language en \ --do_train \ diff --git a/examples/pytorch/text-classification/run_classification.py b/examples/pytorch/text-classification/run_classification.py new file mode 100755 index 00000000000000..ceb16f14ec3368 --- /dev/null +++ b/examples/pytorch/text-classification/run_classification.py @@ -0,0 +1,763 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2020 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Finetuning the library models for text classification.""" +# You can also adapt this script on your own text classification task. Pointers for this are left as comments. + +import logging +import os +import random +import sys +import warnings +from dataclasses import dataclass, field +from typing import List, Optional + +import datasets +import evaluate +import numpy as np +from datasets import Value, load_dataset + +import transformers +from transformers import ( + AutoConfig, + AutoModelForSequenceClassification, + AutoTokenizer, + DataCollatorWithPadding, + EvalPrediction, + HfArgumentParser, + Trainer, + TrainingArguments, + default_data_collator, + set_seed, +) +from transformers.trainer_utils import get_last_checkpoint +from transformers.utils import check_min_version, send_example_telemetry +from transformers.utils.versions import require_version + + +# Will error if the minimal version of Transformers is not installed. Remove at your own risks. +check_min_version("4.38.0.dev0") + +require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") + + +logger = logging.getLogger(__name__) + + +@dataclass +class DataTrainingArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + + Using `HfArgumentParser` we can turn this class + into argparse arguments to be able to specify them on + the command line. + """ + + dataset_name: Optional[str] = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + dataset_config_name: Optional[str] = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + do_regression: bool = field( + default=None, + metadata={ + "help": "Whether to do regression instead of classification. If None, will be inferred from the dataset." + }, + ) + text_column_names: Optional[str] = field( + default=None, + metadata={ + "help": ( + "The name of the text column in the input dataset or a CSV/JSON file. " + 'If not specified, will use the "sentence" column for single/multi-label classification task.' + ) + }, + ) + text_column_delimiter: Optional[str] = field( + default=" ", metadata={"help": "THe delimiter to use to join text columns into a single sentence."} + ) + train_split_name: Optional[str] = field( + default=None, + metadata={ + "help": 'The name of the train split in the input dataset. If not specified, will use the "train" split when do_train is enabled' + }, + ) + validation_split_name: Optional[str] = field( + default=None, + metadata={ + "help": 'The name of the validation split in the input dataset. If not specified, will use the "validation" split when do_eval is enabled' + }, + ) + test_split_name: Optional[str] = field( + default=None, + metadata={ + "help": 'The name of the test split in the input dataset. If not specified, will use the "test" split when do_predict is enabled' + }, + ) + remove_splits: Optional[str] = field( + default=None, + metadata={"help": "The splits to remove from the dataset. Multiple splits should be separated by commas."}, + ) + remove_columns: Optional[str] = field( + default=None, + metadata={"help": "The columns to remove from the dataset. Multiple columns should be separated by commas."}, + ) + label_column_name: Optional[str] = field( + default=None, + metadata={ + "help": ( + "The name of the label column in the input dataset or a CSV/JSON file. " + 'If not specified, will use the "label" column for single/multi-label classification task' + ) + }, + ) + max_seq_length: int = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + ) + }, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."} + ) + pad_to_max_length: bool = field( + default=True, + metadata={ + "help": ( + "Whether to pad all samples to `max_seq_length`. " + "If False, will pad the samples dynamically when batching to the maximum length in the batch." + ) + }, + ) + shuffle_train_dataset: bool = field( + default=False, metadata={"help": "Whether to shuffle the train dataset or not."} + ) + shuffle_seed: int = field( + default=42, metadata={"help": "Random seed that will be used to shuffle the train dataset."} + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + ) + }, + ) + max_eval_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of evaluation examples to this " + "value if set." + ) + }, + ) + max_predict_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of prediction examples to this " + "value if set." + ) + }, + ) + metric_name: Optional[str] = field(default=None, metadata={"help": "The metric to use for evaluation."}) + train_file: Optional[str] = field( + default=None, metadata={"help": "A csv or a json file containing the training data."} + ) + validation_file: Optional[str] = field( + default=None, metadata={"help": "A csv or a json file containing the validation data."} + ) + test_file: Optional[str] = field(default=None, metadata={"help": "A csv or a json file containing the test data."}) + + def __post_init__(self): + if self.dataset_name is None: + if self.train_file is None or self.validation_file is None: + raise ValueError(" training/validation file or a dataset name.") + + train_extension = self.train_file.split(".")[-1] + assert train_extension in ["csv", "json"], "`train_file` should be a csv or a json file." + validation_extension = self.validation_file.split(".")[-1] + assert ( + validation_extension == train_extension + ), "`validation_file` should have the same extension (csv or json) as `train_file`." + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune from. + """ + + model_name_or_path: str = field( + metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"} + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + cache_dir: Optional[str] = field( + default=None, + metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"}, + ) + use_fast_tokenizer: bool = field( + default=True, + metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, + ) + model_revision: str = field( + default="main", + metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, + ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( + default=False, + metadata={ + "help": ( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ) + }, + ) + ignore_mismatched_sizes: bool = field( + default=False, + metadata={"help": "Will enable to load a pretrained model whose head dimensions are different."}, + ) + + +def get_label_list(raw_dataset, split="train") -> List[str]: + """Get the list of labels from a multi-label dataset""" + + if isinstance(raw_dataset[split]["label"][0], list): + label_list = [label for sample in raw_dataset[split]["label"] for label in sample] + label_list = list(set(label_list)) + else: + label_list = raw_dataset[split].unique("label") + # we will treat the label list as a list of string instead of int, consistent with model.config.label2id + label_list = [str(label) for label in label_list] + return label_list + + +def main(): + # See all possible arguments in src/transformers/training_args.py + # or by passing the --help flag to this script. + # We now keep distinct sets of args, for a cleaner separation of concerns. + + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + # If we pass only one argument to the script and it's the path to a json file, + # let's parse it to get our arguments. + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The + # information sent is the one passed as arguments along with your Python/PyTorch versions. + send_example_telemetry("run_classification", model_args, data_args) + + # Setup logging + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + datasets.utils.logging.set_verbosity(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process the small summary: + logger.warning( + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" + ) + logger.info(f"Training/evaluation parameters {training_args}") + + # Detecting last checkpoint. + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + last_checkpoint = get_last_checkpoint(training_args.output_dir) + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # Set seed before initializing model. + set_seed(training_args.seed) + + # Get the datasets: you can either provide your own CSV/JSON training and evaluation files, or specify a dataset name + # to load from huggingface/datasets. In ether case, you can specify a the key of the column(s) containing the text and + # the key of the column containing the label. If multiple columns are specified for the text, they will be joined together + # for the actual text value. + # In distributed training, the load_dataset function guarantee that only one local process can concurrently + # download the dataset. + if data_args.dataset_name is not None: + # Downloading and loading a dataset from the hub. + raw_datasets = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + cache_dir=model_args.cache_dir, + token=model_args.token, + ) + # Try print some info about the dataset + logger.info(f"Dataset loaded: {raw_datasets}") + logger.info(raw_datasets) + else: + # Loading a dataset from your local files. + # CSV/JSON training and evaluation files are needed. + data_files = {"train": data_args.train_file, "validation": data_args.validation_file} + + # Get the test dataset: you can provide your own CSV/JSON test file + if training_args.do_predict: + if data_args.test_file is not None: + train_extension = data_args.train_file.split(".")[-1] + test_extension = data_args.test_file.split(".")[-1] + assert ( + test_extension == train_extension + ), "`test_file` should have the same extension (csv or json) as `train_file`." + data_files["test"] = data_args.test_file + else: + raise ValueError("Need either a dataset name or a test file for `do_predict`.") + + for key in data_files.keys(): + logger.info(f"load a local file for {key}: {data_files[key]}") + + if data_args.train_file.endswith(".csv"): + # Loading a dataset from local csv files + raw_datasets = load_dataset( + "csv", + data_files=data_files, + cache_dir=model_args.cache_dir, + token=model_args.token, + ) + else: + # Loading a dataset from local json files + raw_datasets = load_dataset( + "json", + data_files=data_files, + cache_dir=model_args.cache_dir, + token=model_args.token, + ) + + # See more about loading any type of standard or custom dataset at + # https://huggingface.co/docs/datasets/loading_datasets. + + if data_args.remove_splits is not None: + for split in data_args.remove_splits.split(","): + logger.info(f"removing split {split}") + raw_datasets.pop(split) + + if data_args.train_split_name is not None: + logger.info(f"using {data_args.train_split_name} as train set") + raw_datasets["train"] = raw_datasets[data_args.train_split_name] + raw_datasets.pop(data_args.train_split_name) + + if data_args.validation_split_name is not None: + logger.info(f"using {data_args.validation_split_name} as validation set") + raw_datasets["validation"] = raw_datasets[data_args.validation_split_name] + raw_datasets.pop(data_args.validation_split_name) + + if data_args.test_split_name is not None: + logger.info(f"using {data_args.test_split_name} as test set") + raw_datasets["test"] = raw_datasets[data_args.test_split_name] + raw_datasets.pop(data_args.test_split_name) + + if data_args.remove_columns is not None: + for split in raw_datasets.keys(): + for column in data_args.remove_columns.split(","): + logger.info(f"removing column {column} from split {split}") + raw_datasets[split].remove_columns(column) + + if data_args.label_column_name is not None and data_args.label_column_name != "label": + for key in raw_datasets.keys(): + raw_datasets[key] = raw_datasets[key].rename_column(data_args.label_column_name, "label") + + # Trying to have good defaults here, don't hesitate to tweak to your needs. + + is_regression = ( + raw_datasets["train"].features["label"].dtype in ["float32", "float64"] + if data_args.do_regression is None + else data_args.do_regression + ) + + is_multi_label = False + if is_regression: + label_list = None + num_labels = 1 + # regession requires float as label type, let's cast it if needed + for split in raw_datasets.keys(): + if raw_datasets[split].features["label"].dtype not in ["float32", "float64"]: + logger.warning( + f"Label type for {split} set to float32, was {raw_datasets[split].features['label'].dtype}" + ) + features = raw_datasets[split].features + features.update({"label": Value("float32")}) + try: + raw_datasets[split] = raw_datasets[split].cast(features) + except TypeError as error: + logger.error( + f"Unable to cast {split} set to float32, please check the labels are correct, or maybe try with --do_regression=False" + ) + raise error + + else: # classification + if raw_datasets["train"].features["label"].dtype == "list": # multi-label classification + is_multi_label = True + logger.info("Label type is list, doing multi-label classification") + # Trying to find the number of labels in a multi-label classification task + # We have to deal with common cases that labels appear in the training set but not in the validation/test set. + # So we build the label list from the union of labels in train/val/test. + label_list = get_label_list(raw_datasets, split="train") + for split in ["validation", "test"]: + if split in raw_datasets: + val_or_test_labels = get_label_list(raw_datasets, split=split) + diff = set(val_or_test_labels).difference(set(label_list)) + if len(diff) > 0: + # add the labels that appear in val/test but not in train, throw a warning + logger.warning( + f"Labels {diff} in {split} set but not in training set, adding them to the label list" + ) + label_list += list(diff) + # if label is -1, we throw a warning and remove it from the label list + for label in label_list: + if label == -1: + logger.warning("Label -1 found in label list, removing it.") + label_list.remove(label) + + label_list.sort() + num_labels = len(label_list) + if num_labels <= 1: + raise ValueError("You need more than one label to do classification.") + + # Load pretrained model and tokenizer + # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently + # download model & vocab. + config = AutoConfig.from_pretrained( + model_args.config_name if model_args.config_name else model_args.model_name_or_path, + num_labels=num_labels, + finetuning_task="text-classification", + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + + if is_regression: + config.problem_type = "regression" + logger.info("setting problem type to regression") + elif is_multi_label: + config.problem_type = "multi_label_classification" + logger.info("setting problem type to multi label classification") + else: + config.problem_type = "single_label_classification" + logger.info("setting problem type to single label classification") + + tokenizer = AutoTokenizer.from_pretrained( + model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + model = AutoModelForSequenceClassification.from_pretrained( + model_args.model_name_or_path, + from_tf=bool(".ckpt" in model_args.model_name_or_path), + config=config, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, + ) + + # Padding strategy + if data_args.pad_to_max_length: + padding = "max_length" + else: + # We will pad later, dynamically at batch creation, to the max sequence length in each batch + padding = False + + # for training ,we will update the config with label infos, + # if do_train is not set, we will use the label infos in the config + if training_args.do_train and not is_regression: # classification, training + label_to_id = {v: i for i, v in enumerate(label_list)} + # update config with label infos + if model.config.label2id != label_to_id: + logger.warning( + "The label2id key in the model config.json is not equal to the label2id key of this " + "run. You can ignore this if you are doing finetuning." + ) + model.config.label2id = label_to_id + model.config.id2label = {id: label for label, id in label_to_id.items()} + elif not is_regression: # classification, but not training + logger.info("using label infos in the model config") + logger.info("label2id: {}".format(model.config.label2id)) + label_to_id = model.config.label2id + else: # regression + label_to_id = None + + if data_args.max_seq_length > tokenizer.model_max_length: + logger.warning( + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " + f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." + ) + max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) + + def multi_labels_to_ids(labels: List[str]) -> List[float]: + ids = [0.0] * len(label_to_id) # BCELoss requires float as target type + for label in labels: + ids[label_to_id[label]] = 1.0 + return ids + + def preprocess_function(examples): + if data_args.text_column_names is not None: + text_column_names = data_args.text_column_names.split(",") + # join together text columns into "sentence" column + examples["sentence"] = examples[text_column_names[0]] + for column in text_column_names[1:]: + for i in range(len(examples[column])): + examples["sentence"][i] += data_args.text_column_delimiter + examples[column][i] + # Tokenize the texts + result = tokenizer(examples["sentence"], padding=padding, max_length=max_seq_length, truncation=True) + if label_to_id is not None and "label" in examples: + if is_multi_label: + result["label"] = [multi_labels_to_ids(l) for l in examples["label"]] + else: + result["label"] = [(label_to_id[str(l)] if l != -1 else -1) for l in examples["label"]] + return result + + # Running the preprocessing pipeline on all the datasets + with training_args.main_process_first(desc="dataset map pre-processing"): + raw_datasets = raw_datasets.map( + preprocess_function, + batched=True, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on dataset", + ) + + if training_args.do_train: + if "train" not in raw_datasets: + raise ValueError("--do_train requires a train dataset.") + train_dataset = raw_datasets["train"] + if data_args.shuffle_train_dataset: + logger.info("Shuffling the training dataset") + train_dataset = train_dataset.shuffle(seed=data_args.shuffle_seed) + if data_args.max_train_samples is not None: + max_train_samples = min(len(train_dataset), data_args.max_train_samples) + train_dataset = train_dataset.select(range(max_train_samples)) + + if training_args.do_eval: + if "validation" not in raw_datasets and "validation_matched" not in raw_datasets: + if "test" not in raw_datasets and "test_matched" not in raw_datasets: + raise ValueError("--do_eval requires a validation or test dataset if validation is not defined.") + else: + logger.warning("Validation dataset not found. Falling back to test dataset for validation.") + eval_dataset = raw_datasets["test"] + else: + eval_dataset = raw_datasets["validation"] + + if data_args.max_eval_samples is not None: + max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) + eval_dataset = eval_dataset.select(range(max_eval_samples)) + + if training_args.do_predict or data_args.test_file is not None: + if "test" not in raw_datasets: + raise ValueError("--do_predict requires a test dataset") + predict_dataset = raw_datasets["test"] + # remove label column if it exists + if data_args.max_predict_samples is not None: + max_predict_samples = min(len(predict_dataset), data_args.max_predict_samples) + predict_dataset = predict_dataset.select(range(max_predict_samples)) + + # Log a few random samples from the training set: + if training_args.do_train: + for index in random.sample(range(len(train_dataset)), 3): + logger.info(f"Sample {index} of the training set: {train_dataset[index]}.") + + if data_args.metric_name is not None: + metric = ( + evaluate.load(data_args.metric_name, config_name="multilabel", cache_dir=model_args.cache_dir) + if is_multi_label + else evaluate.load(data_args.metric_name, cache_dir=model_args.cache_dir) + ) + logger.info(f"Using metric {data_args.metric_name} for evaluation.") + else: + if is_regression: + metric = evaluate.load("mse", cache_dir=model_args.cache_dir) + logger.info("Using mean squared error (mse) as regression score, you can use --metric_name to overwrite.") + else: + if is_multi_label: + metric = evaluate.load("f1", config_name="multilabel", cache_dir=model_args.cache_dir) + logger.info( + "Using multilabel F1 for multi-label classification task, you can use --metric_name to overwrite." + ) + else: + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) + logger.info("Using accuracy as classification score, you can use --metric_name to overwrite.") + + def compute_metrics(p: EvalPrediction): + preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions + if is_regression: + preds = np.squeeze(preds) + result = metric.compute(predictions=preds, references=p.label_ids) + elif is_multi_label: + preds = np.array([np.where(p > 0, 1, 0) for p in preds]) # convert logits to multi-hot encoding + # Micro F1 is commonly used in multi-label classification + result = metric.compute(predictions=preds, references=p.label_ids, average="micro") + else: + preds = np.argmax(preds, axis=1) + result = metric.compute(predictions=preds, references=p.label_ids) + if len(result) > 1: + result["combined_score"] = np.mean(list(result.values())).item() + return result + + # Data collator will default to DataCollatorWithPadding when the tokenizer is passed to Trainer, so we change it if + # we already did the padding. + if data_args.pad_to_max_length: + data_collator = default_data_collator + elif training_args.fp16: + data_collator = DataCollatorWithPadding(tokenizer, pad_to_multiple_of=8) + else: + data_collator = None + + # Initialize our Trainer + trainer = Trainer( + model=model, + args=training_args, + train_dataset=train_dataset if training_args.do_train else None, + eval_dataset=eval_dataset if training_args.do_eval else None, + compute_metrics=compute_metrics, + tokenizer=tokenizer, + data_collator=data_collator, + ) + + # Training + if training_args.do_train: + checkpoint = None + if training_args.resume_from_checkpoint is not None: + checkpoint = training_args.resume_from_checkpoint + elif last_checkpoint is not None: + checkpoint = last_checkpoint + train_result = trainer.train(resume_from_checkpoint=checkpoint) + metrics = train_result.metrics + max_train_samples = ( + data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset) + ) + metrics["train_samples"] = min(max_train_samples, len(train_dataset)) + trainer.save_model() # Saves the tokenizer too for easy upload + trainer.log_metrics("train", metrics) + trainer.save_metrics("train", metrics) + trainer.save_state() + + # Evaluation + if training_args.do_eval: + logger.info("*** Evaluate ***") + metrics = trainer.evaluate(eval_dataset=eval_dataset) + max_eval_samples = data_args.max_eval_samples if data_args.max_eval_samples is not None else len(eval_dataset) + metrics["eval_samples"] = min(max_eval_samples, len(eval_dataset)) + trainer.log_metrics("eval", metrics) + trainer.save_metrics("eval", metrics) + + if training_args.do_predict: + logger.info("*** Predict ***") + # Removing the `label` columns if exists because it might contains -1 and Trainer won't like that. + if "label" in predict_dataset.features: + predict_dataset = predict_dataset.remove_columns("label") + predictions = trainer.predict(predict_dataset, metric_key_prefix="predict").predictions + if is_regression: + predictions = np.squeeze(predictions) + elif is_multi_label: + # Convert logits to multi-hot encoding. We compare the logits to 0 instead of 0.5, because the sigmoid is not applied. + # You can also pass `preprocess_logits_for_metrics=lambda logits, labels: nn.functional.sigmoid(logits)` to the Trainer + # and set p > 0.5 below (less efficient in this case) + predictions = np.array([np.where(p > 0, 1, 0) for p in predictions]) + else: + predictions = np.argmax(predictions, axis=1) + output_predict_file = os.path.join(training_args.output_dir, "predict_results.txt") + if trainer.is_world_process_zero(): + with open(output_predict_file, "w") as writer: + logger.info("***** Predict results *****") + writer.write("index\tprediction\n") + for index, item in enumerate(predictions): + if is_regression: + writer.write(f"{index}\t{item:3.3f}\n") + elif is_multi_label: + # recover from multi-hot encoding + item = [label_list[i] for i in range(len(item)) if item[i] == 1] + writer.write(f"{index}\t{item}\n") + else: + item = label_list[item] + writer.write(f"{index}\t{item}\n") + logger.info("Predict results saved at {}".format(output_predict_file)) + kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "text-classification"} + + if training_args.push_to_hub: + trainer.push_to_hub(**kwargs) + else: + trainer.create_model_card(**kwargs) + + +def _mp_fn(index): + # For xla_spawn (TPUs) + main() + + +if __name__ == "__main__": + main() diff --git a/examples/pytorch/text-classification/run_glue.py b/examples/pytorch/text-classification/run_glue.py index 1e7ab534551192..5b268e4ae162e4 100755 --- a/examples/pytorch/text-classification/run_glue.py +++ b/examples/pytorch/text-classification/run_glue.py @@ -20,6 +20,7 @@ import os import random import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -48,7 +49,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") @@ -188,12 +189,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -216,6 +233,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_glue", model_args, data_args) @@ -227,6 +253,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -236,8 +266,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -277,7 +307,7 @@ def main(): "glue", data_args.task_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) elif data_args.dataset_name is not None: # Downloading and loading a dataset from the hub. @@ -285,7 +315,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading a dataset from your local files. @@ -314,7 +344,7 @@ def main(): "csv", data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading a dataset from local json files @@ -322,10 +352,10 @@ def main(): "json", data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Labels if data_args.task_name is not None: @@ -342,7 +372,7 @@ def main(): num_labels = 1 else: # A useful fast method: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.unique + # https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.unique label_list = raw_datasets["train"].unique("label") label_list.sort() # Let's sort it for determinism num_labels = len(label_list) @@ -357,14 +387,16 @@ def main(): finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, @@ -372,7 +404,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) @@ -406,12 +439,12 @@ def main(): ): # Some have all caps in their config, some don't. label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()} - if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)): + if sorted(label_name_to_id.keys()) == sorted(label_list): label_to_id = {i: int(label_name_to_id[label_list[i]]) for i in range(num_labels)} else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}." + f"model labels: {sorted(label_name_to_id.keys())}, dataset labels: {sorted(label_list)}." "\nIgnoring the model labels as a result.", ) elif data_args.task_name is None and not is_regression: @@ -426,7 +459,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -481,24 +514,21 @@ def preprocess_function(examples): # Get the metric function if data_args.task_name is not None: - metric = evaluate.load("glue", data_args.task_name) + metric = evaluate.load("glue", data_args.task_name, cache_dir=model_args.cache_dir) + elif is_regression: + metric = evaluate.load("mse", cache_dir=model_args.cache_dir) else: - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. def compute_metrics(p: EvalPrediction): preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1) - if data_args.task_name is not None: - result = metric.compute(predictions=preds, references=p.label_ids) - if len(result) > 1: - result["combined_score"] = np.mean(list(result.values())).item() - return result - elif is_regression: - return {"mse": ((preds - p.label_ids) ** 2).mean().item()} - else: - return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()} + result = metric.compute(predictions=preds, references=p.label_ids) + if len(result) > 1: + result["combined_score"] = np.mean(list(result.values())).item() + return result # Data collator will default to DataCollatorWithPadding when the tokenizer is passed to Trainer, so we change it if # we already did the padding. diff --git a/examples/pytorch/text-classification/run_glue_no_trainer.py b/examples/pytorch/text-classification/run_glue_no_trainer.py index 03de2cf6b553f4..77d937ef7fd344 100644 --- a/examples/pytorch/text-classification/run_glue_no_trainer.py +++ b/examples/pytorch/text-classification/run_glue_no_trainer.py @@ -43,12 +43,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) @@ -156,6 +156,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -179,7 +189,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -217,7 +227,7 @@ def main(): # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers # in the environment accelerator = ( - Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator() + Accelerator(log_with=args.report_to, project_dir=args.output_dir) if args.with_tracking else Accelerator() ) # Make one log on every process with the configuration for debugging. logging.basicConfig( @@ -240,12 +250,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -281,7 +293,7 @@ def main(): extension = (args.train_file if args.train_file is not None else args.validation_file).split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Labels if args.task_name is not None: @@ -307,13 +319,21 @@ def main(): # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. - config = AutoConfig.from_pretrained(args.model_name_or_path, num_labels=num_labels, finetuning_task=args.task_name) - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + config = AutoConfig.from_pretrained( + args.model_name_or_path, + num_labels=num_labels, + finetuning_task=args.task_name, + trust_remote_code=args.trust_remote_code, + ) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) model = AutoModelForSequenceClassification.from_pretrained( args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, ignore_mismatched_sizes=args.ignore_mismatched_sizes, + trust_remote_code=args.trust_remote_code, ) # Preprocessing the datasets @@ -339,7 +359,7 @@ def main(): ): # Some have all caps in their config, some don't. label_name_to_id = {k.lower(): v for k, v in model.config.label2id.items()} - if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)): + if sorted(label_name_to_id.keys()) == sorted(label_list): logger.info( f"The configuration of the model provided the following label correspondence: {label_name_to_id}. " "Using it!" @@ -348,7 +368,7 @@ def main(): else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}." + f"model labels: {sorted(label_name_to_id.keys())}, dataset labels: {sorted(label_list)}." "\nIgnoring the model labels as a result.", ) elif args.task_name is None and not is_regression: @@ -487,35 +507,45 @@ def preprocess_function(examples): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch @@ -532,7 +562,7 @@ def preprocess_function(examples): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/pytorch/text-classification/run_xnli.py b/examples/pytorch/text-classification/run_xnli.py index 20b38a37cf5bc7..1f61239794a70f 100755 --- a/examples/pytorch/text-classification/run_xnli.py +++ b/examples/pytorch/text-classification/run_xnli.py @@ -21,6 +21,7 @@ import os import random import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -48,7 +49,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt") @@ -152,12 +153,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -175,6 +192,15 @@ def main(): parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_xnli", model_args) @@ -186,6 +212,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -195,8 +225,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -228,7 +258,7 @@ def main(): model_args.language, split="train", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: train_dataset = load_dataset( @@ -236,7 +266,7 @@ def main(): model_args.train_language, split="train", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) label_list = train_dataset.features["label"].names @@ -246,7 +276,7 @@ def main(): model_args.language, split="validation", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) label_list = eval_dataset.features["label"].names @@ -256,7 +286,7 @@ def main(): model_args.language, split="test", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) label_list = predict_dataset.features["label"].names @@ -269,10 +299,13 @@ def main(): config = AutoConfig.from_pretrained( model_args.config_name if model_args.config_name else model_args.model_name_or_path, num_labels=num_labels, + id2label={str(i): label for i, label in enumerate(label_list)}, + label2id={label: i for i, label in enumerate(label_list)}, finetuning_task="xnli", cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, @@ -280,7 +313,8 @@ def main(): cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSequenceClassification.from_pretrained( model_args.model_name_or_path, @@ -288,7 +322,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) @@ -350,7 +385,7 @@ def preprocess_function(examples): ) # Get the metric function - metric = evaluate.load("xnli") + metric = evaluate.load("xnli", cache_dir=model_args.cache_dir) # You can define your custom compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. diff --git a/examples/pytorch/text-generation/README.md b/examples/pytorch/text-generation/README.md index 2177c45c3b884a..e619c25e162d52 100644 --- a/examples/pytorch/text-generation/README.md +++ b/examples/pytorch/text-generation/README.md @@ -18,7 +18,7 @@ limitations under the License. Based on the script [`run_generation.py`](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-generation/run_generation.py). -Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, Transformer-XL, XLNet, CTRL. +Conditional text generation using the auto-regressive models of the library: GPT, GPT-2, GPT-J, Transformer-XL, XLNet, CTRL, BLOOM, LLAMA, OPT. A similar script is used for our official demo [Write With Transfomer](https://transformer.huggingface.co), where you can try out the different models available in the library. @@ -26,6 +26,6 @@ Example usage: ```bash python run_generation.py \ - --model_type=gpt2 \ - --model_name_or_path=gpt2 + --model_type=openai-community/gpt2 \ + --model_name_or_path=openai-community/gpt2 ``` diff --git a/examples/pytorch/text-generation/requirements.txt b/examples/pytorch/text-generation/requirements.txt index 0ef50f181f64c4..324a8cfb1c29e0 100644 --- a/examples/pytorch/text-generation/requirements.txt +++ b/examples/pytorch/text-generation/requirements.txt @@ -1,3 +1,4 @@ +accelerate >= 0.21.0 sentencepiece != 0.1.92 protobuf torch >= 1.3 diff --git a/examples/pytorch/text-generation/run_generation.py b/examples/pytorch/text-generation/run_generation.py index 9b4b09fc96874b..557b75572c9997 100755 --- a/examples/pytorch/text-generation/run_generation.py +++ b/examples/pytorch/text-generation/run_generation.py @@ -19,18 +19,29 @@ import argparse +import inspect import logging +from typing import Tuple -import numpy as np import torch +from accelerate import PartialState +from accelerate.utils import set_seed from transformers import ( + AutoTokenizer, + BloomForCausalLM, + BloomTokenizerFast, CTRLLMHeadModel, CTRLTokenizer, + GenerationMixin, GPT2LMHeadModel, GPT2Tokenizer, + GPTJForCausalLM, + LlamaForCausalLM, + LlamaTokenizer, OpenAIGPTLMHeadModel, OpenAIGPTTokenizer, + OPTForCausalLM, TransfoXLLMHeadModel, TransfoXLTokenizer, XLMTokenizer, @@ -38,6 +49,7 @@ XLNetLMHeadModel, XLNetTokenizer, ) +from transformers.modeling_outputs import CausalLMOutputWithPast logging.basicConfig( @@ -56,6 +68,10 @@ "xlnet": (XLNetLMHeadModel, XLNetTokenizer), "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer), "xlm": (XLMWithLMHeadModel, XLMTokenizer), + "gptj": (GPTJForCausalLM, AutoTokenizer), + "bloom": (BloomForCausalLM, BloomTokenizerFast), + "llama": (LlamaForCausalLM, LlamaTokenizer), + "opt": (OPTForCausalLM, GPT2Tokenizer), } # Padding text to help Transformer-XL and XLNet with short prompts as proposed by Aman Rusia @@ -73,13 +89,6 @@ with people, even a bishop, begging for his blessing. """ -def set_seed(args): - np.random.seed(args.seed) - torch.manual_seed(args.seed) - if args.n_gpu > 0: - torch.cuda.manual_seed_all(args.seed) - - # # Functions to prepare models' input # @@ -151,6 +160,129 @@ def adjust_length_to_model(length, max_sequence_length): return length +def sparse_model_config(model_config): + embedding_size = None + if hasattr(model_config, "hidden_size"): + embedding_size = model_config.hidden_size + elif hasattr(model_config, "n_embed"): + embedding_size = model_config.n_embed + elif hasattr(model_config, "n_embd"): + embedding_size = model_config.n_embd + + num_head = None + if hasattr(model_config, "num_attention_heads"): + num_head = model_config.num_attention_heads + elif hasattr(model_config, "n_head"): + num_head = model_config.n_head + + if embedding_size is None or num_head is None or num_head == 0: + raise ValueError("Check the model config") + + num_embedding_size_per_head = int(embedding_size / num_head) + if hasattr(model_config, "n_layer"): + num_layer = model_config.n_layer + elif hasattr(model_config, "num_hidden_layers"): + num_layer = model_config.num_hidden_layers + else: + raise ValueError("Number of hidden layers couldn't be determined from the model config") + + return num_layer, num_head, num_embedding_size_per_head + + +def generate_past_key_values(model, batch_size, seq_len): + num_block_layers, num_attention_heads, num_embedding_size_per_head = sparse_model_config(model.config) + if model.config.model_type == "bloom": + past_key_values = tuple( + ( + torch.empty(int(num_attention_heads * batch_size), num_embedding_size_per_head, seq_len) + .to(model.dtype) + .to(model.device), + torch.empty(int(num_attention_heads * batch_size), seq_len, num_embedding_size_per_head) + .to(model.dtype) + .to(model.device), + ) + for _ in range(num_block_layers) + ) + else: + past_key_values = tuple( + ( + torch.empty(batch_size, num_attention_heads, seq_len, num_embedding_size_per_head) + .to(model.dtype) + .to(model.device), + torch.empty(batch_size, num_attention_heads, seq_len, num_embedding_size_per_head) + .to(model.dtype) + .to(model.device), + ) + for _ in range(num_block_layers) + ) + return past_key_values + + +def prepare_jit_inputs(inputs, model, tokenizer): + batch_size = len(inputs) + dummy_input = tokenizer.batch_encode_plus(inputs, return_tensors="pt") + dummy_input = dummy_input.to(model.device) + if model.config.use_cache: + dummy_input["past_key_values"] = generate_past_key_values(model, batch_size, 1) + dummy_input["attention_mask"] = torch.cat( + [ + torch.zeros(dummy_input["attention_mask"].shape[0], 1) + .to(dummy_input["attention_mask"].dtype) + .to(model.device), + dummy_input["attention_mask"], + ], + -1, + ) + return dummy_input + + +class _ModelFallbackWrapper(GenerationMixin): + __slots__ = ("_optimized", "_default") + + def __init__(self, optimized, default): + self._optimized = optimized + self._default = default + + def __call__(self, *args, **kwargs): + if kwargs["past_key_values"] is None and self._default.config.use_cache: + kwargs["past_key_values"] = generate_past_key_values(self._default, kwargs["input_ids"].shape[0], 0) + kwargs.pop("position_ids", None) + for k in list(kwargs.keys()): + if kwargs[k] is None or isinstance(kwargs[k], bool): + kwargs.pop(k) + outputs = self._optimized(**kwargs) + lm_logits = outputs[0] + past_key_values = outputs[1] + fixed_output = CausalLMOutputWithPast( + loss=None, + logits=lm_logits, + past_key_values=past_key_values, + hidden_states=None, + attentions=None, + ) + return fixed_output + + def __getattr__(self, item): + return getattr(self._default, item) + + def prepare_inputs_for_generation( + self, input_ids, past_key_values=None, inputs_embeds=None, use_cache=None, **kwargs + ): + return self._default.prepare_inputs_for_generation( + input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, **kwargs + ) + + def _reorder_cache( + self, past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PretrainedModel.beam_search`] or + [`~PretrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return self._default._reorder_cache(past_key_values, beam_idx) + + def main(): parser = argparse.ArgumentParser() parser.add_argument( @@ -189,21 +321,27 @@ def main(): parser.add_argument("--xlm_language", type=str, default="", help="Optional language when used with the XLM model.") parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") - parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") + parser.add_argument( + "--use_cpu", + action="store_true", + help="Whether or not to use cpu. If set to False, " "we will use gpu/npu or mps device if available", + ) parser.add_argument("--num_return_sequences", type=int, default=1, help="The number of samples to generate.") parser.add_argument( "--fp16", action="store_true", help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit", ) + parser.add_argument("--jit", action="store_true", help="Whether or not to use jit trace to accelerate inference") args = parser.parse_args() - args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") - args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() + # Initialize the distributed state. + distributed_state = PartialState(cpu=args.use_cpu) - logger.warning(f"device: {args.device}, n_gpu: {args.n_gpu}, 16-bits training: {args.fp16}") + logger.warning(f"device: {distributed_state.device}, 16-bits inference: {args.fp16}") - set_seed(args) + if args.seed is not None: + set_seed(args.seed) # Initialize the model and tokenizer try: @@ -213,13 +351,17 @@ def main(): raise KeyError("the model {} you specified is not supported. You are welcome to add it and open a PR :)") tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + if tokenizer.pad_token is None: + tokenizer.pad_token = tokenizer.eos_token model = model_class.from_pretrained(args.model_name_or_path) - model.to(args.device) + + # Set the model to the right device + model.to(distributed_state.device) if args.fp16: model.half() - - args.length = adjust_length_to_model(args.length, max_sequence_length=model.config.max_position_embeddings) + max_seq_length = getattr(model.config, "max_position_embeddings", 0) + args.length = adjust_length_to_model(args.length, max_sequence_length=max_seq_length) logger.info(args) prompt_text = args.prompt if args.prompt else input("Model prompt >>> ") @@ -241,13 +383,30 @@ def main(): else: prefix = args.prefix if args.prefix else args.padding_text encoded_prompt = tokenizer.encode(prefix + prompt_text, add_special_tokens=False, return_tensors="pt") - encoded_prompt = encoded_prompt.to(args.device) + encoded_prompt = encoded_prompt.to(distributed_state.device) if encoded_prompt.size()[-1] == 0: input_ids = None else: input_ids = encoded_prompt + if args.jit: + jit_input_texts = ["enable jit"] + jit_inputs = prepare_jit_inputs(jit_input_texts, model, tokenizer) + torch._C._jit_set_texpr_fuser_enabled(False) + model.config.return_dict = False + if hasattr(model, "forward"): + sig = inspect.signature(model.forward) + else: + sig = inspect.signature(model.__call__) + jit_inputs = tuple(jit_inputs[key] for key in sig.parameters if jit_inputs.get(key, None) is not None) + traced_model = torch.jit.trace(model, jit_inputs, strict=False) + traced_model = torch.jit.freeze(traced_model.eval()) + traced_model(*jit_inputs) + traced_model(*jit_inputs) + + model = _ModelFallbackWrapper(traced_model, model) + output_sequences = model.generate( input_ids=input_ids, max_length=args.length + len(encoded_prompt[0]), diff --git a/examples/pytorch/text-generation/run_generation_contrastive_search.py b/examples/pytorch/text-generation/run_generation_contrastive_search.py index 117f063a6dd9a8..a48529fb30dd4b 100755 --- a/examples/pytorch/text-generation/run_generation_contrastive_search.py +++ b/examples/pytorch/text-generation/run_generation_contrastive_search.py @@ -16,15 +16,15 @@ """ The examples of running contrastive search on the auto-APIs; Running this example: -python run_generation_contrastive_search.py --model_name_or_path=gpt2-large --penalty_alpha=0.6 --k=4 --length=256 +python run_generation_contrastive_search.py --model_name_or_path=openai-community/gpt2-large --penalty_alpha=0.6 --k=4 --length=256 """ import argparse import logging -import numpy as np -import torch +from accelerate import PartialState +from accelerate.utils import set_seed from transformers import AutoModelForCausalLM, AutoTokenizer @@ -37,13 +37,6 @@ logger = logging.getLogger(__name__) -def set_seed(args): - np.random.seed(args.seed) - torch.manual_seed(args.seed) - if args.n_gpu > 0: - torch.cuda.manual_seed_all(args.seed) - - def main(): parser = argparse.ArgumentParser() parser.add_argument( @@ -73,7 +66,11 @@ def main(): parser.add_argument("--xlm_language", type=str, default="", help="Optional language when used with the XLM model.") parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") - parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") + parser.add_argument( + "--use_cpu", + action="store_true", + help="Whether or not to use cpu. If set to False, " "we will use gpu/npu or mps device if available", + ) parser.add_argument( "--fp16", action="store_true", @@ -81,12 +78,13 @@ def main(): ) args = parser.parse_args() - args.device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") - args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() + # Initialize the distributed state. + distributed_state = PartialState(cpu=args.use_cpu) - logger.warning(f"device: {args.device}, n_gpu: {args.n_gpu}, 16-bits training: {args.fp16}") + logger.warning(f"device: {distributed_state.device}, 16-bits inference: {args.fp16}") - set_seed(args) + if args.seed is not None: + set_seed(args.seed) # Initialize the model and tokenizer tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) @@ -94,7 +92,8 @@ def main(): # tokenizer = GPT2Tokenizer.from_pretrained(args.model_name_or_path) # model = OPTForCausalLM.from_pretrained(args.model_name_or_path) - model.to(args.device) + # Set the model to the right device + model.to(distributed_state.device) if args.fp16: model.half() @@ -103,7 +102,7 @@ def main(): prompt_text = args.prompt if args.prompt else input("Model prompt >>> ") inputs = tokenizer(prompt_text, return_tensors="pt", add_special_tokens=False) - inputs = {key: value.to(args.device) for key, value in inputs.items()} + inputs = {key: value.to(distributed_state.device) for key, value in inputs.items()} output_sequences = model.generate( **inputs, diff --git a/examples/pytorch/token-classification/README.md b/examples/pytorch/token-classification/README.md index 496722cf6b9a14..568e5242fee3ff 100644 --- a/examples/pytorch/token-classification/README.md +++ b/examples/pytorch/token-classification/README.md @@ -29,7 +29,7 @@ The following example fine-tunes BERT on CoNLL-2003: ```bash python run_ner.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name conll2003 \ --output_dir /tmp/test-ner \ --do_train \ @@ -42,7 +42,7 @@ To run on your own training and validation files, use the following command: ```bash python run_ner.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --output_dir /tmp/test-ner \ @@ -84,7 +84,7 @@ then export TASK_NAME=ner python run_ner_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name conll2003 \ --task_name $TASK_NAME \ --max_length 128 \ @@ -112,7 +112,7 @@ that will check everything is ready for training. Finally, you can launch traini export TASK_NAME=ner accelerate launch run_ner_no_trainer.py \ - --model_name_or_path bert-base-cased \ + --model_name_or_path google-bert/bert-base-cased \ --dataset_name conll2003 \ --task_name $TASK_NAME \ --max_length 128 \ diff --git a/examples/pytorch/token-classification/run_ner.py b/examples/pytorch/token-classification/run_ner.py index 065880e7e26e78..fe9c6224e8033f 100755 --- a/examples/pytorch/token-classification/run_ner.py +++ b/examples/pytorch/token-classification/run_ner.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -49,7 +50,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/token-classification/requirements.txt") @@ -79,12 +80,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -217,6 +234,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_ner", model_args, data_args) @@ -228,6 +254,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -237,8 +267,8 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") @@ -275,20 +305,22 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file + extension = data_args.validation_file.split(".")[-1] if data_args.test_file is not None: data_files["test"] = data_args.test_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.test_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if training_args.do_train: column_names = raw_datasets["train"].column_names @@ -344,7 +376,8 @@ def get_label_list(labels): finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path @@ -354,7 +387,8 @@ def get_label_list(labels): cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, add_prefix_space=True, ) else: @@ -363,7 +397,8 @@ def get_label_list(labels): cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForTokenClassification.from_pretrained( @@ -372,7 +407,8 @@ def get_label_list(labels): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) @@ -386,7 +422,7 @@ def get_label_list(labels): # Model has labels -> use them. if model.config.label2id != PretrainedConfig(num_labels=num_labels).label2id: - if list(sorted(model.config.label2id.keys())) == list(sorted(label_list)): + if sorted(model.config.label2id.keys()) == sorted(label_list): # Reorganize `label_list` to match the ordering of the model. if labels_are_int: label_to_id = {i: int(model.config.label2id[l]) for i, l in enumerate(label_list)} @@ -397,13 +433,13 @@ def get_label_list(labels): else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(model.config.label2id.keys()))}, dataset labels:" - f" {list(sorted(label_list))}.\nIgnoring the model labels as a result.", + f"model labels: {sorted(model.config.label2id.keys())}, dataset labels:" + f" {sorted(label_list)}.\nIgnoring the model labels as a result.", ) # Set the correspondences label/ID inside the model config model.config.label2id = {l: i for i, l in enumerate(label_list)} - model.config.id2label = {i: l for i, l in enumerate(label_list)} + model.config.id2label = dict(enumerate(label_list)) # Map that sends B-Xxx label to its I-Xxx counterpart b_to_i_label = [] @@ -505,7 +541,7 @@ def tokenize_and_align_labels(examples): data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None) # Metrics - metric = evaluate.load("seqeval") + metric = evaluate.load("seqeval", cache_dir=model_args.cache_dir) def compute_metrics(p): predictions, labels = p diff --git a/examples/pytorch/token-classification/run_ner_no_trainer.py b/examples/pytorch/token-classification/run_ner_no_trainer.py index ad630472234ea4..804f2eef16f9f1 100755 --- a/examples/pytorch/token-classification/run_ner_no_trainer.py +++ b/examples/pytorch/token-classification/run_ner_no_trainer.py @@ -28,6 +28,7 @@ import datasets import evaluate +import numpy as np import torch from accelerate import Accelerator from accelerate.logging import get_logger @@ -50,12 +51,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/token-classification/requirements.txt") @@ -209,6 +210,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -232,7 +243,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -271,7 +282,7 @@ def main(): # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers # in the environment accelerator = ( - Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator() + Accelerator(log_with=args.report_to, project_dir=args.output_dir) if args.with_tracking else Accelerator() ) # Make one log on every process with the configuration for debugging. logging.basicConfig( @@ -294,12 +305,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -326,16 +339,17 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # Trim a number of training examples if args.debug: for split in raw_datasets.keys(): raw_datasets[split] = raw_datasets[split].select(range(100)) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if raw_datasets["train"] is not None: column_names = raw_datasets["train"].column_names @@ -385,9 +399,13 @@ def get_label_list(labels): # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name, num_labels=num_labels) + config = AutoConfig.from_pretrained( + args.config_name, num_labels=num_labels, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path, num_labels=num_labels) + config = AutoConfig.from_pretrained( + args.model_name_or_path, num_labels=num_labels, trust_remote_code=args.trust_remote_code + ) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") @@ -395,14 +413,18 @@ def get_label_list(labels): tokenizer_name_or_path = args.tokenizer_name if args.tokenizer_name else args.model_name_or_path if not tokenizer_name_or_path: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) if config.model_type in {"bloom", "gpt2", "roberta"}: - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_fast=True, add_prefix_space=True) + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, use_fast=True, add_prefix_space=True, trust_remote_code=args.trust_remote_code + ) else: - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_fast=True) + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, use_fast=True, trust_remote_code=args.trust_remote_code + ) if args.model_name_or_path: model = AutoModelForTokenClassification.from_pretrained( @@ -410,10 +432,11 @@ def get_label_list(labels): from_tf=bool(".ckpt" in args.model_name_or_path), config=config, ignore_mismatched_sizes=args.ignore_mismatched_sizes, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForTokenClassification.from_config(config) + model = AutoModelForTokenClassification.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -425,7 +448,7 @@ def get_label_list(labels): # Model has labels -> use them. if model.config.label2id != PretrainedConfig(num_labels=num_labels).label2id: - if list(sorted(model.config.label2id.keys())) == list(sorted(label_list)): + if sorted(model.config.label2id.keys()) == sorted(label_list): # Reorganize `label_list` to match the ordering of the model. if labels_are_int: label_to_id = {i: int(model.config.label2id[l]) for i, l in enumerate(label_list)} @@ -436,13 +459,13 @@ def get_label_list(labels): else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(model.config.label2id.keys()))}, dataset labels:" - f" {list(sorted(label_list))}.\nIgnoring the model labels as a result.", + f"model labels: {sorted(model.config.label2id.keys())}, dataset labels:" + f" {sorted(label_list)}.\nIgnoring the model labels as a result.", ) # Set the correspondences label/ID inside the model config model.config.label2id = {l: i for i, l in enumerate(label_list)} - model.config.id2label = {i: l for i, l in enumerate(label_list)} + model.config.id2label = dict(enumerate(label_list)) # Map that sends B-Xxx label to its I-Xxx counterpart b_to_i_label = [] @@ -645,35 +668,45 @@ def compute_metrics(): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: - resume_step = int(training_difference.replace("step_", "")) + # need to multiply `gradient_accumulation_steps` to reflect real steps + resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) + # update the progress_bar if load from checkpoint + progress_bar.update(completed_steps) + for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - completed_steps += 1 - continue + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch @@ -690,7 +723,7 @@ def compute_metrics(): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) @@ -771,6 +804,12 @@ def compute_metrics(): if args.with_tracking: all_results.update({"train_loss": total_loss.item() / len(train_dataloader)}) with open(os.path.join(args.output_dir, "all_results.json"), "w") as f: + # Convert all float64 & int64 type numbers to float & int for json serialization + for key, value in all_results.items(): + if isinstance(value, np.float64): + all_results[key] = float(value) + elif isinstance(value, np.int64): + all_results[key] = int(value) json.dump(all_results, f) diff --git a/examples/pytorch/translation/README.md b/examples/pytorch/translation/README.md index 0593d577a01fdb..74ca16ccb0bf63 100644 --- a/examples/pytorch/translation/README.md +++ b/examples/pytorch/translation/README.md @@ -33,7 +33,7 @@ For the old `finetune_trainer.py` and related utils, see [`examples/legacy/seq2s `run_translation.py` is a lightweight examples of how to download and preprocess a dataset from the [🤗 Datasets](https://github.com/huggingface/datasets) library or use your own files (jsonlines or csv), then fine-tune one of the architectures above on it. -For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets.html#json-files +For custom datasets in `jsonlines` format please see: https://huggingface.co/docs/datasets/loading_datasets#json-files and you also will find examples of these below. @@ -59,11 +59,11 @@ python examples/pytorch/translation/run_translation.py \ MBart and some T5 models require special handling. -T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example: +T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example: ```bash python examples/pytorch/translation/run_translation.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --source_lang en \ @@ -105,7 +105,7 @@ values for the arguments `--train_file`, `--validation_file` to match your setup ```bash python examples/pytorch/translation/run_translation.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --source_lang en \ @@ -134,7 +134,7 @@ If you want to use a pre-processed dataset that leads to high BLEU scores, but f ```bash python examples/pytorch/translation/run_translation.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --source_lang en \ diff --git a/examples/pytorch/translation/run_translation.py b/examples/pytorch/translation/run_translation.py index cd82c779e8724d..f2718c1122acae 100755 --- a/examples/pytorch/translation/run_translation.py +++ b/examples/pytorch/translation/run_translation.py @@ -21,6 +21,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -52,7 +53,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/translation/requirements.txt") @@ -89,12 +90,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -156,7 +173,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -200,7 +217,7 @@ class DataTrainingArguments: }, ) num_beams: Optional[int] = field( - default=None, + default=1, metadata={ "help": ( "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, " @@ -261,6 +278,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_translation", model_args, data_args) @@ -272,6 +298,10 @@ def main(): handlers=[logging.StreamHandler(sys.stdout)], ) + if training_args.should_log: + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + log_level = training_args.get_process_log_level() logger.setLevel(log_level) datasets.utils.logging.set_verbosity(log_level) @@ -281,17 +311,17 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " + + f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}" ) logger.info(f"Training/evaluation parameters {training_args}") if data_args.source_prefix is None and model_args.model_name_or_path in [ - "t5-small", - "t5-base", - "t5-large", - "t5-3b", - "t5-11b", + "google-t5/t5-small", + "google-t5/t5-base", + "google-t5/t5-large", + "google-t5/t5-3b", + "google-t5/t5-11b", ]: logger.warning( "You're running a t5 model but didn't provide a source prefix, which is expected, e.g. with " @@ -331,7 +361,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -344,14 +374,18 @@ def main(): if data_args.test_file is not None: data_files["test"] = data_args.test_file extension = data_args.test_file.split(".")[-1] + if extension == "jsonl": + builder_name = "json" # the "json" builder reads both .json and .jsonl files + else: + builder_name = extension # e.g. "parquet" raw_datasets = load_dataset( - extension, + builder_name, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading. # Load pretrained model and tokenizer # @@ -362,14 +396,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) model = AutoModelForSeq2SeqLM.from_pretrained( model_args.model_name_or_path, @@ -377,7 +413,8 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch @@ -438,7 +475,7 @@ def main(): if training_args.label_smoothing_factor > 0 and not hasattr(model, "prepare_decoder_input_ids_from_labels"): logger.warning( - "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for" + "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for " f"`{model.__class__.__name__}`. This will lead to loss being calculated twice and will take up more memory" ) @@ -527,7 +564,7 @@ def preprocess_function(examples): ) # Metric - metric = evaluate.load("sacrebleu") + metric = evaluate.load("sacrebleu", cache_dir=model_args.cache_dir) def postprocess_text(preds, labels): preds = [pred.strip() for pred in preds] @@ -539,10 +576,10 @@ def compute_metrics(eval_preds): preds, labels = eval_preds if isinstance(preds, tuple): preds = preds[0] + # Replace -100s used for padding as we can't decode them + preds = np.where(preds != -100, preds, tokenizer.pad_token_id) decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True) - if data_args.ignore_pad_token_for_loss: - # Replace -100 in the labels as we can't decode them. - labels = np.where(labels != -100, labels, tokenizer.pad_token_id) + labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # Some simple post-processing @@ -622,8 +659,10 @@ def compute_metrics(eval_preds): if trainer.is_world_process_zero(): if training_args.predict_with_generate: + predictions = predict_results.predictions + predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id) predictions = tokenizer.batch_decode( - predict_results.predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True + predictions, skip_special_tokens=True, clean_up_tokenization_spaces=True ) predictions = [pred.strip() for pred in predictions] output_prediction_file = os.path.join(training_args.output_dir, "generated_predictions.txt") diff --git a/examples/pytorch/translation/run_translation_no_trainer.py b/examples/pytorch/translation/run_translation_no_trainer.py index a853d531edb861..205129e0346514 100644 --- a/examples/pytorch/translation/run_translation_no_trainer.py +++ b/examples/pytorch/translation/run_translation_no_trainer.py @@ -52,12 +52,12 @@ default_data_collator, get_scheduler, ) -from transformers.utils import check_min_version, get_full_repo_name, send_example_telemetry +from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = get_logger(__name__) require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/translation/requirements.txt") @@ -118,7 +118,7 @@ def parse_args(): default=128, help=( "The maximum total sequence length for target text after " - "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded." + "tokenization. Sequences longer than this will be truncated, sequences shorter will be padded " "during ``evaluate`` and ``predict``." ), ) @@ -139,7 +139,7 @@ def parse_args(): default=False, help=( "Whether to pad all samples to model maximum sentence " - "length. If False, will pad the samples dynamically when batching to the maximum length in the batch. More" + "length. If False, will pad the samples dynamically when batching to the maximum length in the batch. More " "efficient on GPU but very bad for TPU." ), ) @@ -175,7 +175,7 @@ def parse_args(): default=128, help=( "The maximum total input sequence length after tokenization. Sequences longer than this will be truncated," - " sequences shorter will be padded if `--pad_to_max_lengh` is passed." + " sequences shorter will be padded if `--pad_to_max_length` is passed." ), ) parser.add_argument( @@ -257,6 +257,16 @@ def parse_args(): "--hub_model_id", type=str, help="The name of the repository to keep in sync with the local `output_dir`." ) parser.add_argument("--hub_token", type=str, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--trust_remote_code", + type=bool, + default=False, + help=( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ), + ) parser.add_argument( "--checkpointing_steps", type=str, @@ -280,7 +290,7 @@ def parse_args(): default="all", help=( 'The integration to report the results and logs to. Supported platforms are `"tensorboard"`,' - ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations.' + ' `"wandb"`, `"comet_ml"` and `"clearml"`. Use `"all"` (default) to report to all integrations. ' "Only applicable when `--with_tracking` is passed." ), ) @@ -316,7 +326,7 @@ def main(): # If we're using tracking, we also need to initialize it here and it will by default pick up all supported trackers # in the environment accelerator = ( - Accelerator(log_with=args.report_to, logging_dir=args.output_dir) if args.with_tracking else Accelerator() + Accelerator(log_with=args.report_to, project_dir=args.output_dir) if args.with_tracking else Accelerator() ) # Make one log on every process with the configuration for debugging. @@ -340,12 +350,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - create_repo(repo_name, exist_ok=True, token=args.hub_token) - repo = Repository(args.output_dir, clone_from=repo_name, token=args.hub_token) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) with open(os.path.join(args.output_dir, ".gitignore"), "w+") as gitignore: if "step_*" not in gitignore: @@ -372,32 +384,37 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if args.config_name: - config = AutoConfig.from_pretrained(args.config_name) + config = AutoConfig.from_pretrained(args.config_name, trust_remote_code=args.trust_remote_code) elif args.model_name_or_path: - config = AutoConfig.from_pretrained(args.model_name_or_path) + config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=args.trust_remote_code) else: config = CONFIG_MAPPING[args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.tokenizer_name, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) elif args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, use_fast=not args.use_slow_tokenizer) + tokenizer = AutoTokenizer.from_pretrained( + args.model_name_or_path, use_fast=not args.use_slow_tokenizer, trust_remote_code=args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -406,10 +423,11 @@ def main(): args.model_name_or_path, from_tf=bool(".ckpt" in args.model_name_or_path), config=config, + trust_remote_code=args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = AutoModelForSeq2SeqLM.from_config(config) + model = AutoModelForSeq2SeqLM.from_config(config, trust_remote_code=args.trust_remote_code) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. @@ -593,42 +611,45 @@ def postprocess_text(preds, labels): # Potentially load in the weights and states from a previous save if args.resume_from_checkpoint: if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "": - accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}") - accelerator.load_state(args.resume_from_checkpoint) + checkpoint_path = args.resume_from_checkpoint path = os.path.basename(args.resume_from_checkpoint) else: # Get the most recent checkpoint dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()] dirs.sort(key=os.path.getctime) path = dirs[-1] # Sorts folders by date modified, most recent checkpoint is the last + checkpoint_path = path + path = os.path.basename(checkpoint_path) + + accelerator.print(f"Resumed from checkpoint: {checkpoint_path}") + accelerator.load_state(checkpoint_path) # Extract `epoch_{i}` or `step_{i}` training_difference = os.path.splitext(path)[0] if "epoch" in training_difference: starting_epoch = int(training_difference.replace("epoch_", "")) + 1 resume_step = None + completed_steps = starting_epoch * num_update_steps_per_epoch else: # need to multiply `gradient_accumulation_steps` to reflect real steps resume_step = int(training_difference.replace("step_", "")) * args.gradient_accumulation_steps starting_epoch = resume_step // len(train_dataloader) + completed_steps = resume_step // args.gradient_accumulation_steps resume_step -= starting_epoch * len(train_dataloader) # update the progress_bar if load from checkpoint - progress_bar.update(starting_epoch * num_update_steps_per_epoch) - completed_steps = starting_epoch * num_update_steps_per_epoch + progress_bar.update(completed_steps) for epoch in range(starting_epoch, args.num_train_epochs): model.train() if args.with_tracking: total_loss = 0 - for step, batch in enumerate(train_dataloader): - # We need to skip steps until we reach the resumed step - if args.resume_from_checkpoint and epoch == starting_epoch: - if resume_step is not None and step < resume_step: - if step % args.gradient_accumulation_steps == 0: - progress_bar.update(1) - completed_steps += 1 - continue + if args.resume_from_checkpoint and epoch == starting_epoch and resume_step is not None: + # We skip the first `n` batches in the dataloader when resuming from a checkpoint + active_dataloader = accelerator.skip_first_batches(train_dataloader, resume_step) + else: + active_dataloader = train_dataloader + for step, batch in enumerate(active_dataloader): outputs = model(**batch) loss = outputs.loss # We keep track of the loss at each epoch @@ -645,7 +666,7 @@ def postprocess_text(preds, labels): if isinstance(checkpointing_steps, int): if completed_steps % checkpointing_steps == 0: - output_dir = f"step_{completed_steps }" + output_dir = f"step_{completed_steps}" if args.output_dir is not None: output_dir = os.path.join(args.output_dir, output_dir) accelerator.save_state(output_dir) diff --git a/examples/research_projects/README.md b/examples/research_projects/README.md index 32d7fee0453c50..b2f5d431f25b50 100644 --- a/examples/research_projects/README.md +++ b/examples/research_projects/README.md @@ -20,7 +20,7 @@ This folder contains various research projects using 🤗 Transformers. They are version of 🤗 Transformers that is indicated in the requirements file of each folder. Updating them to the most recent version of the library will require some work. To use any of them, just run the command -``` +```bash pip install -r requirements.txt ``` inside the folder of your choice. diff --git a/examples/research_projects/bert-loses-patience/README.md b/examples/research_projects/bert-loses-patience/README.md index d1e5baa92e90bb..b405e8a9488750 100755 --- a/examples/research_projects/bert-loses-patience/README.md +++ b/examples/research_projects/bert-loses-patience/README.md @@ -15,7 +15,7 @@ export TASK_NAME=MRPC python ./run_glue_with_pabee.py \ --model_type albert \ - --model_name_or_path bert-base-uncased/albert-base-v2 \ + --model_name_or_path google-bert/bert-base-uncased/albert/albert-base-v2 \ --task_name $TASK_NAME \ --do_train \ --do_eval \ diff --git a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py index 5e17352dc19b54..6881bf8d184e8c 100644 --- a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py +++ b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_albert.py @@ -253,7 +253,7 @@ def forward( Returns: :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.AlbertConfig`) and inputs: - loss: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: + loss (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``: Classification (or regression if config.num_labels==1) loss. logits ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)`` Classification (or regression if config.num_labels==1) scores (before SoftMax). @@ -276,8 +276,8 @@ def forward( from torch import nn import torch - tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') - model = AlbertForSequenceClassificationWithPabee.from_pretrained('albert-base-v2') + tokenizer = AlbertTokenizer.from_pretrained('albert/albert-base-v2') + model = AlbertForSequenceClassificationWithPabee.from_pretrained('albert/albert-base-v2') input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1 labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 outputs = model(input_ids, labels=labels) diff --git a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py index b32f47d0c30020..dfa78585a64489 100644 --- a/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py +++ b/examples/research_projects/bert-loses-patience/pabee/modeling_pabee_bert.py @@ -300,8 +300,8 @@ def forward( from torch import nn import torch - tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') - model = BertForSequenceClassificationWithPabee.from_pretrained('bert-base-uncased') + tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased') + model = BertForSequenceClassificationWithPabee.from_pretrained('google-bert/bert-base-uncased') input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1 labels = torch.tensor([1]).unsqueeze(0) # Batch size 1 diff --git a/examples/research_projects/bert-loses-patience/run_glue_with_pabee.py b/examples/research_projects/bert-loses-patience/run_glue_with_pabee.py index aad680f201c520..847148d557bec1 100755 --- a/examples/research_projects/bert-loses-patience/run_glue_with_pabee.py +++ b/examples/research_projects/bert-loses-patience/run_glue_with_pabee.py @@ -148,7 +148,7 @@ def train(args, train_dataset, model, tokenizer): steps_trained_in_current_epoch = 0 # Check if continuing training from a checkpoint if os.path.exists(args.model_name_or_path): - # set global_step to gobal_step of last saved checkpoint from model path + # set global_step to global_step of last saved checkpoint from model path global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0]) epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps) @@ -169,7 +169,7 @@ def train(args, train_dataset, model, tokenizer): desc="Epoch", disable=args.local_rank not in [-1, 0], ) - set_seed(args) # Added here for reproductibility + set_seed(args) # Added here for reproducibility for _ in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) for step, batch in enumerate(epoch_iterator): @@ -575,7 +575,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -614,7 +614,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -727,9 +727,9 @@ def main(): tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) @@ -743,7 +743,7 @@ def main(): print(f"Evaluation for checkpoint {prefix}") for patience in patience_list: result = evaluate(args, model, tokenizer, prefix=prefix, patience=patience) - result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) + result = {k + "_{}".format(global_step): v for k, v in result.items()} results.update(result) return results diff --git a/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py b/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py index 6a084d0741d5f5..5516924f0f2fb7 100644 --- a/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py +++ b/examples/research_projects/bert-loses-patience/test_run_glue_with_pabee.py @@ -29,7 +29,7 @@ def test_run_glue(self): testargs = f""" run_glue_with_pabee.py --model_type albert - --model_name_or_path albert-base-v2 + --model_name_or_path albert/albert-base-v2 --data_dir ./tests/fixtures/tests_samples/MRPC/ --output_dir {tmp_dir} --overwrite_output_dir diff --git a/examples/research_projects/bertabs/README.md b/examples/research_projects/bertabs/README.md index d5e6bbbaa28699..7109c0fb72be1b 100644 --- a/examples/research_projects/bertabs/README.md +++ b/examples/research_projects/bertabs/README.md @@ -8,7 +8,7 @@ The model is loaded with the pre-trained weights for the abstractive summarizati ## Setup -``` +```bash git clone https://github.com/huggingface/transformers && cd transformers pip install . pip install nltk py-rouge diff --git a/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py index 53ba3829b15030..b6f5d1775150cf 100644 --- a/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py +++ b/examples/research_projects/bertabs/convert_bertabs_original_pytorch_checkpoint.py @@ -107,7 +107,7 @@ def convert_bertabs_checkpoints(path_to_checkpoints, dump_path): # ---------------------------------- logging.info("Make sure that the models' outputs are identical") - tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") # prepare the model inputs encoder_input_ids = tokenizer.encode("This is sample éàalj'-.") diff --git a/examples/research_projects/bertabs/modeling_bertabs.py b/examples/research_projects/bertabs/modeling_bertabs.py index 33e216f4a08117..2ebce466561393 100644 --- a/examples/research_projects/bertabs/modeling_bertabs.py +++ b/examples/research_projects/bertabs/modeling_bertabs.py @@ -54,7 +54,7 @@ def __init__(self, args, checkpoint=None, bert_extractive_checkpoint=None): load_bert_pretrained_extractive = True if bert_extractive_checkpoint else False if load_bert_pretrained_extractive: self.bert.model.load_state_dict( - dict([(n[11:], p) for n, p in bert_extractive_checkpoint.items() if n.startswith("bert.model")]), + {n[11:]: p for n, p in bert_extractive_checkpoint.items() if n.startswith("bert.model")}, strict=True, ) @@ -128,7 +128,7 @@ class Bert(nn.Module): def __init__(self): super().__init__() - config = BertConfig.from_pretrained("bert-base-uncased") + config = BertConfig.from_pretrained("google-bert/bert-base-uncased") self.model = BertModel(config) def forward(self, input_ids, attention_mask=None, token_type_ids=None, **kwargs): diff --git a/examples/research_projects/bertabs/run_summarization.py b/examples/research_projects/bertabs/run_summarization.py index 82ef8ab39ea9b7..1f969f117baaf2 100644 --- a/examples/research_projects/bertabs/run_summarization.py +++ b/examples/research_projects/bertabs/run_summarization.py @@ -29,7 +29,7 @@ def evaluate(args): - tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True) + tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased", do_lower_case=True) model = BertAbs.from_pretrained("remi/bertabs-finetuned-extractive-abstractive-summarization") model.to(args.device) model.eval() diff --git a/examples/research_projects/bertology/run_bertology.py b/examples/research_projects/bertology/run_bertology.py index 030573d87f3532..4cb046066c768b 100644 --- a/examples/research_projects/bertology/run_bertology.py +++ b/examples/research_projects/bertology/run_bertology.py @@ -218,9 +218,9 @@ def prune_heads(args, model, eval_dataloader, head_mask): original_time = datetime.now() - before_time original_num_params = sum(p.numel() for p in model.parameters()) - heads_to_prune = dict( - (layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask)) - ) + heads_to_prune = { + layer: (1 - head_mask[layer].long()).nonzero().squeeze().tolist() for layer in range(len(head_mask)) + } assert sum(len(h) for h in heads_to_prune.values()) == (1 - head_mask.long()).sum().item() model.prune_heads(heads_to_prune) diff --git a/examples/research_projects/bertology/run_prune_gpt.py b/examples/research_projects/bertology/run_prune_gpt.py index 68cece6e997ad2..fa7484a787b6c2 100644 --- a/examples/research_projects/bertology/run_prune_gpt.py +++ b/examples/research_projects/bertology/run_prune_gpt.py @@ -194,9 +194,9 @@ def prune_heads(args, model, eval_dataloader, head_mask): original_time = datetime.now() - before_time original_num_params = sum(p.numel() for p in model.parameters()) - heads_to_prune = dict( - (layer, (1 - head_mask[layer].long()).nonzero().squeeze().tolist()) for layer in range(len(head_mask)) - ) + heads_to_prune = { + layer: (1 - head_mask[layer].long()).nonzero().squeeze().tolist() for layer in range(len(head_mask)) + } for k, v in heads_to_prune.items(): if isinstance(v, int): diff --git a/examples/research_projects/codeparrot/README.md b/examples/research_projects/codeparrot/README.md index 6c57c4350fbc02..f0af3d144f781a 100644 --- a/examples/research_projects/codeparrot/README.md +++ b/examples/research_projects/codeparrot/README.md @@ -50,7 +50,7 @@ The raw dataset contains many duplicates. We deduplicated and filtered the datas - fraction of alphanumeric characters < 0.25 - containing the word "auto-generated" or similar in the first 5 lines - filtering with a probability of 0.7 of files with a mention of "test file" or "configuration file" or similar in the first 5 lines -- filtering with a probability of 0.7 of files with high occurence of the keywords "test " or "config" +- filtering with a probability of 0.7 of files with high occurrence of the keywords "test " or "config" - filtering with a probability of 0.7 of files without a mention of the keywords `def` , `for`, `while` and `class` - filtering files that use the assignment operator `=` less than 5 times - filtering files with ratio between number of characters and number of tokens after tokenization < 1.5 (the average ratio is 3.6) @@ -79,7 +79,7 @@ python scripts/pretokenizing.py \ Before training a new model for code we create a new tokenizer that is efficient at code tokenization. To train the tokenizer you can run the following command: ```bash python scripts/bpe_training.py \ - --base_tokenizer gpt2 \ + --base_tokenizer openai-community/gpt2 \ --dataset_name codeparrot/codeparrot-clean-train ``` @@ -90,12 +90,12 @@ The models are randomly initialized and trained from scratch. To initialize a ne ```bash python scripts/initialize_model.py \ ---config_name gpt2-large \ +--config_name openai-community/gpt2-large \ --tokenizer_name codeparrot/codeparrot \ --model_name codeparrot \ --push_to_hub True ``` -This will initialize a new model with the architecture and configuration of `gpt2-large` and use the tokenizer to appropriately size the input embeddings. Finally, the initilaized model is pushed the hub. +This will initialize a new model with the architecture and configuration of `openai-community/gpt2-large` and use the tokenizer to appropriately size the input embeddings. Finally, the initilaized model is pushed the hub. We can either pass the name of a text dataset or a pretokenized dataset which speeds up training a bit. Now that the tokenizer and model are also ready we can start training the model. The main training script is built with `accelerate` to scale across a wide range of platforms and infrastructure scales. We train two models with [110M](https://huggingface.co/codeparrot/codeparrot-small/) and [1.5B](https://huggingface.co/codeparrot/codeparrot/) parameters for 25-30B tokens on a 16xA100 (40GB) machine which takes 1 day and 1 week, respectively. diff --git a/examples/research_projects/codeparrot/scripts/arguments.py b/examples/research_projects/codeparrot/scripts/arguments.py index 4def9ac3b854ec..5fee05eb04c50a 100644 --- a/examples/research_projects/codeparrot/scripts/arguments.py +++ b/examples/research_projects/codeparrot/scripts/arguments.py @@ -172,7 +172,7 @@ class TokenizerTrainingArguments: """ base_tokenizer: Optional[str] = field( - default="gpt2", metadata={"help": "Base tokenizer to build new tokenizer from."} + default="openai-community/gpt2", metadata={"help": "Base tokenizer to build new tokenizer from."} ) dataset_name: Optional[str] = field( default="transformersbook/codeparrot-train", metadata={"help": "Dataset to train tokenizer on."} @@ -211,7 +211,7 @@ class InitializationArguments: """ config_name: Optional[str] = field( - default="gpt2-large", metadata={"help": "Configuration to use for model initialization."} + default="openai-community/gpt2-large", metadata={"help": "Configuration to use for model initialization."} ) tokenizer_name: Optional[str] = field( default="codeparrot/codeparrot", metadata={"help": "Tokenizer attached to model."} diff --git a/examples/research_projects/codeparrot/scripts/codeparrot_training.py b/examples/research_projects/codeparrot/scripts/codeparrot_training.py index 2510e02c94700d..16f6077f2415c8 100644 --- a/examples/research_projects/codeparrot/scripts/codeparrot_training.py +++ b/examples/research_projects/codeparrot/scripts/codeparrot_training.py @@ -7,6 +7,7 @@ import datasets import torch from accelerate import Accelerator, DistributedType +from accelerate.utils import ProjectConfiguration from arguments import TrainingArguments from datasets import load_dataset from huggingface_hub import Repository @@ -195,7 +196,8 @@ def evaluate(args): args = parser.parse_args() # Accelerator -accelerator = Accelerator(log_with=["wandb", "tensorboard"], logging_dir=f"{args.save_dir}/log") +config = ProjectConfiguration(project_dir=args.save_dir, logging_dir="log") +accelerator = Accelerator(log_with=["wandb", "tensorboard"], project_config=config) acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()} args = Namespace(**vars(args), **acc_state) diff --git a/examples/research_projects/codeparrot/scripts/human_eval.py b/examples/research_projects/codeparrot/scripts/human_eval.py index 157079881d5f73..ef217a597e3385 100644 --- a/examples/research_projects/codeparrot/scripts/human_eval.py +++ b/examples/research_projects/codeparrot/scripts/human_eval.py @@ -60,7 +60,7 @@ def __call__(self, input_ids, scores, **kwargs): decoded_generations = self.tokenizer.batch_decode(input_ids[:, self.start_length :]) done = [] for decoded_generation in decoded_generations: - done.append(any([stop_string in decoded_generation for stop_string in self.eof_strings])) + done.append(any(stop_string in decoded_generation for stop_string in self.eof_strings)) return all(done) diff --git a/examples/research_projects/codeparrot/scripts/minhash_deduplication.py b/examples/research_projects/codeparrot/scripts/minhash_deduplication.py index 195a9dc8096b21..f1984711278a10 100644 --- a/examples/research_projects/codeparrot/scripts/minhash_deduplication.py +++ b/examples/research_projects/codeparrot/scripts/minhash_deduplication.py @@ -29,7 +29,7 @@ def get_min_hash(tokens: List[str]) -> Optional[MinHash]: def get_tokens(code: str) -> Set[str]: """Tokenize a code snippet.""" - return set([t for t in NON_ALPHA.split(code) if len(t.strip()) > 0]) + return {t for t in NON_ALPHA.split(code) if len(t.strip()) > 0} class DuplicationIndex: @@ -243,7 +243,7 @@ def deduplicate_dataset( >>> ds_dedup, duplicate_clusters = deduplicate_dataset(ds, jaccard_threshold=0.85) """ duplicate_clusters = make_duplicate_clusters(dataset, jaccard_threshold) - duplicate_indices = set(x["base_index"] for cluster in duplicate_clusters for x in cluster) + duplicate_indices = {x["base_index"] for cluster in duplicate_clusters for x in cluster} extreme_dict = {} extremes_clusters = find_extremes(duplicate_clusters, dataset, jaccard_threshold) for extremes in extremes_clusters: diff --git a/examples/research_projects/codeparrot/scripts/preprocessing.py b/examples/research_projects/codeparrot/scripts/preprocessing.py index 07540d0b628433..d9cac5abfd8e19 100644 --- a/examples/research_projects/codeparrot/scripts/preprocessing.py +++ b/examples/research_projects/codeparrot/scripts/preprocessing.py @@ -1,5 +1,4 @@ import gzip -import hashlib import json import multiprocessing import os @@ -11,6 +10,7 @@ import numpy as np from arguments import PreprocessingArguments from datasets import load_dataset +from huggingface_hub.utils import insecure_hashlib from minhash_deduplication import deduplicate_dataset from transformers import AutoTokenizer, HfArgumentParser @@ -21,7 +21,7 @@ def get_hash(example): """Get hash of content field.""" - return {"hash": hashlib.md5(re.sub(PATTERN, "", example["content"]).encode("utf-8")).hexdigest()} + return {"hash": insecure_hashlib.md5(re.sub(PATTERN, "", example["content"]).encode("utf-8")).hexdigest()} def line_stats(example): @@ -60,7 +60,7 @@ def is_autogenerated(example, scan_width=5): def is_config_or_test(example, scan_width=5, coeff=0.05): """Check if file is a configuration file or a unit test by : 1- looking for keywords in the first few lines of the file. - 2- counting number of occurence of the words 'config' and 'test' with respect to number of lines. + 2- counting number of occurrence of the words 'config' and 'test' with respect to number of lines. """ keywords = ["unit tests", "test file", "configuration file"] @@ -114,7 +114,7 @@ def char_token_ratio(example): def preprocess(example): """Chain all preprocessing steps into one function to not fill cache.""" - results = dict() + results = {} results.update(get_hash(example)) results.update(line_stats(example)) results.update(alpha_stats(example)) diff --git a/examples/research_projects/codeparrot/scripts/pretokenizing.py b/examples/research_projects/codeparrot/scripts/pretokenizing.py index 5eb793d10d959c..7cac8f511918d1 100644 --- a/examples/research_projects/codeparrot/scripts/pretokenizing.py +++ b/examples/research_projects/codeparrot/scripts/pretokenizing.py @@ -8,7 +8,7 @@ def tokenize(example): - output = dict() + output = {} output["input_ids"] = tokenizer(example["content"], truncation=False)["input_ids"] output["ratio_char_token"] = len(example["content"]) / len(output["input_ids"]) return output diff --git a/examples/research_projects/decision_transformer/requirements.txt b/examples/research_projects/decision_transformer/requirements.txt index 112141e172dd24..d832b76ec04bde 100644 --- a/examples/research_projects/decision_transformer/requirements.txt +++ b/examples/research_projects/decision_transformer/requirements.txt @@ -1,5 +1,5 @@ absl-py==1.0.0 -aiohttp==3.8.1 +aiohttp==3.8.5 aiosignal==1.2.0 alembic==1.7.7 appdirs==1.4.4 @@ -20,7 +20,7 @@ boto3==1.16.34 botocore==1.19.63 Brotli==1.0.9 cachetools==5.0.0 -certifi==2022.12.7 +certifi==2023.7.22 cffi==1.15.0 chardet==4.0.0 charset-normalizer==2.0.12 @@ -34,11 +34,11 @@ cmd2==2.4.0 codecarbon==1.2.0 colorlog==6.6.0 cookiecutter==2.1.1 -cryptography==39.0.1 +cryptography==42.0.0 csvw==2.0.0 cycler==0.11.0 Cython==0.29.28 -dash==2.3.0 +dash==2.15.0 dash-bootstrap-components==1.0.3 dash-core-components==2.0.0 dash-html-components==2.0.0 @@ -57,17 +57,17 @@ fasteners==0.17.3 filelock==3.6.0 fire==0.4.0 flake8==4.0.1 -Flask==2.0.3 +Flask==2.3.2 Flask-Compress==1.11 flatbuffers==2.0 flax==0.4.0 -fonttools==4.31.1 +fonttools==4.43.0 frozenlist==1.3.0 fsspec==2022.2.0 fugashi==1.1.2 gast==0.5.3 gitdb==4.0.9 -GitPython==3.1.30 +GitPython==3.1.32 glfw==2.5.1 google-auth==2.6.2 google-auth-oauthlib==0.4.6 @@ -92,7 +92,7 @@ itsdangerous==2.1.1 jax==0.3.4 jaxlib==0.3.2 jedi==0.18.1 -Jinja2==2.11.3 +Jinja2==3.1.3 jinja2-time==0.2.0 jmespath==0.10.0 joblib==1.2.0 @@ -133,7 +133,7 @@ pbr==5.8.1 pexpect==4.8.0 phonemizer==3.0.1 pickleshare==0.7.5 -Pillow==9.3.0 +Pillow==10.0.1 Pint==0.16.1 plac==1.3.4 platformdirs==2.5.1 @@ -157,7 +157,7 @@ pycodestyle==2.8.0 pycparser==2.21 pyctcdecode==0.3.0 pyflakes==2.4.0 -Pygments==2.11.2 +Pygments==2.15.0 pygtrie==2.4.2 pynvml==11.4.1 pyOpenSSL==22.0.0 @@ -175,9 +175,9 @@ pytz==2022.1 pytz-deprecation-shim==0.1.0.post0 PyYAML==6.0 ray==1.11.0 -redis==4.1.4 +redis==4.5.4 regex==2022.3.15 -requests==2.27.1 +requests==2.31.0 requests-oauthlib==1.3.1 resampy==0.2.2 responses==0.18.0 @@ -229,11 +229,11 @@ tzlocal==4.1 unidic==1.1.0 unidic-lite==1.0.8 uritemplate==4.1.1 -urllib3==1.26.9 +urllib3==1.26.18 wasabi==0.9.0 wcwidth==0.2.5 websocket-client==1.3.1 -Werkzeug==2.2.3 +Werkzeug==3.0.1 wrapt==1.14.0 xxhash==3.0.0 yarl==1.7.2 diff --git a/examples/research_projects/deebert/README.md b/examples/research_projects/deebert/README.md index 30c871e1a594fc..08a087dc03ebaf 100644 --- a/examples/research_projects/deebert/README.md +++ b/examples/research_projects/deebert/README.md @@ -34,7 +34,7 @@ This is for evaluating fine-tuned DeeBERT models, given a number of different ea ## Citation Please cite our paper if you find the resource useful: -``` +```bibtex @inproceedings{xin-etal-2020-deebert, title = "{D}ee{BERT}: Dynamic Early Exiting for Accelerating {BERT} Inference", author = "Xin, Ji and diff --git a/examples/research_projects/deebert/run_glue_deebert.py b/examples/research_projects/deebert/run_glue_deebert.py index f86390375ff754..6ca28ab5bc07bc 100644 --- a/examples/research_projects/deebert/run_glue_deebert.py +++ b/examples/research_projects/deebert/run_glue_deebert.py @@ -162,7 +162,7 @@ def train(args, train_dataset, model, tokenizer, train_highway=False): tr_loss, logging_loss = 0.0, 0.0 model.zero_grad() train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) - set_seed(args) # Added here for reproductibility (even between python 2 and 3) + set_seed(args) # Added here for reproducibility (even between python 2 and 3) for _ in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) for step, batch in enumerate(epoch_iterator): @@ -491,7 +491,7 @@ def main(): help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument( @@ -532,7 +532,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -566,7 +566,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -685,9 +685,9 @@ def main(): tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) for checkpoint in checkpoints: @@ -725,7 +725,7 @@ def main(): for i in range(model.num_layers): info_str += " {:.2f}".format(100 * each_layer_results[i]) logger.info(info_str) - result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) + result = {k + "_{}".format(global_step): v for k, v in result.items()} results.update(result) return results diff --git a/examples/research_projects/deebert/src/modeling_highway_bert.py b/examples/research_projects/deebert/src/modeling_highway_bert.py index 2a881decbbd529..b866ef0869c758 100644 --- a/examples/research_projects/deebert/src/modeling_highway_bert.py +++ b/examples/research_projects/deebert/src/modeling_highway_bert.py @@ -32,7 +32,7 @@ def __init__(self, config): self.early_exit_entropy = [-1 for _ in range(config.num_hidden_layers)] def set_early_exit_entropy(self, x): - if (type(x) is float) or (type(x) is int): + if isinstance(x, (float, int)): for i in range(len(self.early_exit_entropy)): self.early_exit_entropy[i] = x else: @@ -232,9 +232,7 @@ def forward( outputs = ( sequence_output, pooled_output, - ) + encoder_outputs[ - 1: - ] # add hidden_states and attentions if they are here + ) + encoder_outputs[1:] # add hidden_states and attentions if they are here return outputs # sequence_output, pooled_output, (hidden_states), (attentions), highway exits diff --git a/examples/research_projects/deebert/test_glue_deebert.py b/examples/research_projects/deebert/test_glue_deebert.py index 775c4d70b6523e..7a5f059c8cedff 100644 --- a/examples/research_projects/deebert/test_glue_deebert.py +++ b/examples/research_projects/deebert/test_glue_deebert.py @@ -48,7 +48,7 @@ def run_and_check(self, args): def test_glue_deebert_train(self): train_args = """ --model_type roberta - --model_name_or_path roberta-base + --model_name_or_path FacebookAI/roberta-base --task_name MRPC --do_train --do_eval @@ -61,7 +61,7 @@ def test_glue_deebert_train(self): --num_train_epochs 3 --overwrite_output_dir --seed 42 - --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage + --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage --plot_data_dir ./examples/deebert/results/ --save_steps 0 --overwrite_cache @@ -71,12 +71,12 @@ def test_glue_deebert_train(self): eval_args = """ --model_type roberta - --model_name_or_path ./examples/deebert/saved_models/roberta-base/MRPC/two_stage + --model_name_or_path ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage --task_name MRPC --do_eval --do_lower_case --data_dir ./tests/fixtures/tests_samples/MRPC/ - --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage + --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage --plot_data_dir ./examples/deebert/results/ --max_seq_length 128 --eval_each_highway @@ -88,12 +88,12 @@ def test_glue_deebert_train(self): entropy_eval_args = """ --model_type roberta - --model_name_or_path ./examples/deebert/saved_models/roberta-base/MRPC/two_stage + --model_name_or_path ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage --task_name MRPC --do_eval --do_lower_case --data_dir ./tests/fixtures/tests_samples/MRPC/ - --output_dir ./examples/deebert/saved_models/roberta-base/MRPC/two_stage + --output_dir ./examples/deebert/saved_models/FacebookAI/roberta-base/MRPC/two_stage --plot_data_dir ./examples/deebert/results/ --max_seq_length 128 --early_exit_entropy 0.1 diff --git a/examples/research_projects/distillation/README.md b/examples/research_projects/distillation/README.md index 36b45f79889f0f..594e953f99d76d 100644 --- a/examples/research_projects/distillation/README.md +++ b/examples/research_projects/distillation/README.md @@ -183,7 +183,7 @@ Happy distillation! If you find the resource useful, you should cite the following paper: -``` +```bibtex @inproceedings{sanh2019distilbert, title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter}, author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas}, diff --git a/examples/research_projects/distillation/grouped_batch_sampler.py b/examples/research_projects/distillation/grouped_batch_sampler.py index 83addc371f2e21..a068f7e09e6a8e 100644 --- a/examples/research_projects/distillation/grouped_batch_sampler.py +++ b/examples/research_projects/distillation/grouped_batch_sampler.py @@ -27,7 +27,7 @@ def _quantize(x, bins): bins = copy.deepcopy(bins) bins = sorted(bins) - quantized = list(map(lambda y: bisect.bisect_right(bins, y), x)) + quantized = [bisect.bisect_right(bins, y) for y in x] return quantized diff --git a/examples/research_projects/distillation/requirements.txt b/examples/research_projects/distillation/requirements.txt index 80ee9335e6f67b..3e4f807c07d3f8 100644 --- a/examples/research_projects/distillation/requirements.txt +++ b/examples/research_projects/distillation/requirements.txt @@ -1,6 +1,6 @@ transformers -gitpython==3.1.30 +gitpython==3.1.32 tensorboard>=1.14.0 tensorboardX==1.8 psutil==5.6.6 diff --git a/examples/research_projects/distillation/run_squad_w_distillation.py b/examples/research_projects/distillation/run_squad_w_distillation.py index aba91995da0c3f..523d9bedb89261 100644 --- a/examples/research_projects/distillation/run_squad_w_distillation.py +++ b/examples/research_projects/distillation/run_squad_w_distillation.py @@ -165,7 +165,7 @@ def train(args, train_dataset, model, tokenizer, teacher=None): # Check if continuing training from a checkpoint if os.path.exists(args.model_name_or_path): try: - # set global_step to gobal_step of last saved checkpoint from model path + # set global_step to global_step of last saved checkpoint from model path checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0] global_step = int(checkpoint_suffix) epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps) @@ -183,7 +183,7 @@ def train(args, train_dataset, model, tokenizer, teacher=None): train_iterator = trange( epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0] ) - # Added here for reproductibility + # Added here for reproducibility set_seed(args) for _ in train_iterator: @@ -696,7 +696,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -731,7 +731,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -850,9 +850,9 @@ def main(): logger.info("Loading checkpoints saved during training for evaluation") checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) @@ -865,7 +865,7 @@ def main(): # Evaluate result = evaluate(args, model, tokenizer, prefix=global_step) - result = dict((k + ("_{}".format(global_step) if global_step else ""), v) for k, v in result.items()) + result = {k + ("_{}".format(global_step) if global_step else ""): v for k, v in result.items()} results.update(result) logger.info("Results: {}".format(results)) diff --git a/examples/research_projects/distillation/train.py b/examples/research_projects/distillation/train.py index bb35a1df853943..1acb527220e1c5 100644 --- a/examples/research_projects/distillation/train.py +++ b/examples/research_projects/distillation/train.py @@ -208,7 +208,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) diff --git a/examples/research_projects/fsner/src/fsner/tokenizer_utils.py b/examples/research_projects/fsner/src/fsner/tokenizer_utils.py index bc5f6650ccd9f5..b281ae6cfb8961 100644 --- a/examples/research_projects/fsner/src/fsner/tokenizer_utils.py +++ b/examples/research_projects/fsner/src/fsner/tokenizer_utils.py @@ -17,7 +17,7 @@ def tokenize(self, x): `transformers.tokenization_utils_base.BatchEncoding` dict with additional keys and values for start_token_id, end_token_id and sizes of example lists for each entity type """ - if isinstance(x, list) and all([isinstance(_x, list) for _x in x]): + if isinstance(x, list) and all(isinstance(_x, list) for _x in x): d = None for l in x: t = self.tokenizer( @@ -37,7 +37,7 @@ def tokenize(self, x): d["start_token_id"] = torch.tensor(self.tokenizer.convert_tokens_to_ids("[E]")) d["end_token_id"] = torch.tensor(self.tokenizer.convert_tokens_to_ids("[/E]")) - elif isinstance(x, list) and all([isinstance(_x, str) for _x in x]): + elif isinstance(x, list) and all(isinstance(_x, str) for _x in x): d = self.tokenizer( x, padding="max_length", diff --git a/examples/research_projects/information-gain-filtration/README.md b/examples/research_projects/information-gain-filtration/README.md index bf95cb8ea81423..f685a512509f0d 100644 --- a/examples/research_projects/information-gain-filtration/README.md +++ b/examples/research_projects/information-gain-filtration/README.md @@ -64,7 +64,7 @@ To fine-tune a transformer model with IGF on a language modeling task, use the f ```python python run_clm_igf.py\ ---model_name_or_path "gpt2" \ +--model_name_or_path "openai-community/gpt2" \ --data_file="data/tokenized_stories_train_wikitext103" \ --igf_data_file="data/IGF_values" \ --context_len 32 \ @@ -84,7 +84,7 @@ python run_clm_igf.py\ If you find the resource useful, please cite the following paper -``` +```bibtex @inproceedings{antonello-etal-2021-selecting, title = "Selecting Informative Contexts Improves Language Model Fine-tuning", author = "Antonello, Richard and Beckage, Nicole and Turek, Javier and Huth, Alexander", diff --git a/examples/research_projects/information-gain-filtration/igf/igf.py b/examples/research_projects/information-gain-filtration/igf/igf.py index 6861467a33592a..4c5aefd9584e16 100644 --- a/examples/research_projects/information-gain-filtration/igf/igf.py +++ b/examples/research_projects/information-gain-filtration/igf/igf.py @@ -69,9 +69,9 @@ def compute_perplexity(model, test_data, context_len): return perplexity -def load_gpt2(model_name="gpt2"): +def load_gpt2(model_name="openai-community/gpt2"): """ - load original gpt2 and save off for quicker loading + load original openai-community/gpt2 and save off for quicker loading Args: model_name: GPT-2 diff --git a/examples/research_projects/information-gain-filtration/run_clm_igf.py b/examples/research_projects/information-gain-filtration/run_clm_igf.py index c1584a2f89adc1..74973309c4e16b 100644 --- a/examples/research_projects/information-gain-filtration/run_clm_igf.py +++ b/examples/research_projects/information-gain-filtration/run_clm_igf.py @@ -84,7 +84,7 @@ def generate_n_pairs( device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # load pretrained model - model = load_gpt2("gpt2").to(device) + model = load_gpt2("openai-community/gpt2").to(device) print("computing perplexity on objective set") orig_perp = compute_perplexity(model, objective_set, context_len).item() print("perplexity on objective set:", orig_perp) @@ -121,7 +121,7 @@ def training_secondary_learner( set_seed(42) # Load pre-trained model - model = GPT2LMHeadModel.from_pretrained("gpt2") + model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2") # Initialize secondary learner to use embedding weights of model secondary_learner = SecondaryLearner(model) @@ -153,7 +153,7 @@ def finetune( recopy_model=recopy_gpt2, secondary_learner=None, eval_interval=10, - finetuned_model_name="gpt2_finetuned.pt", + finetuned_model_name="openai-community/gpt2_finetuned.pt", ): """ fine-tune with IGF if secondary_learner is not None, else standard fine-tuning @@ -346,7 +346,10 @@ def main(): ) parser.add_argument( - "--batch_size", default=16, type=int, help="batch size of training data of language model(gpt2) " + "--batch_size", + default=16, + type=int, + help="batch size of training data of language model(openai-community/gpt2) ", ) parser.add_argument( @@ -354,7 +357,7 @@ def main(): default=10, type=int, help=( - "decay the selectivity of our secondary learner filter from" + "decay the selectivity of our secondary learner filter from " "1 standard deviation above average to 1 below average after 10 batches" ), ) @@ -383,7 +386,9 @@ def main(): ), ) - parser.add_argument("--finetuned_model_name", default="gpt2_finetuned.pt", type=str, help="finetuned_model_name") + parser.add_argument( + "--finetuned_model_name", default="openai-community/gpt2_finetuned.pt", type=str, help="finetuned_model_name" + ) parser.add_argument( "--recopy_model", @@ -416,16 +421,16 @@ def main(): igf_model_path="igf_model.pt", ) - # load pretrained gpt2 model - model = GPT2LMHeadModel.from_pretrained("gpt2") + # load pretrained openai-community/gpt2 model + model = GPT2LMHeadModel.from_pretrained("openai-community/gpt2") set_seed(42) - # Generate train and test data to train and evaluate gpt2 model + # Generate train and test data to train and evaluate openai-community/gpt2 model train_dataset, test_dataset = generate_datasets( context_len=32, file="data/tokenized_stories_train_wikitext103.jbl", number=100, min_len=1026, trim=True ) - # fine-tuning of the gpt2 model using igf (Information Gain Filtration) + # fine-tuning of the openai-community/gpt2 model using igf (Information Gain Filtration) finetune( model, train_dataset, @@ -437,7 +442,7 @@ def main(): recopy_model=recopy_gpt2, secondary_learner=secondary_learner, eval_interval=10, - finetuned_model_name="gpt2_finetuned.pt", + finetuned_model_name="openai-community/gpt2_finetuned.pt", ) diff --git a/examples/research_projects/jax-projects/README.md b/examples/research_projects/jax-projects/README.md index 66bb6c61a376e6..88d8d7f9eba926 100644 --- a/examples/research_projects/jax-projects/README.md +++ b/examples/research_projects/jax-projects/README.md @@ -159,13 +159,13 @@ to be used, but that everybody in team is on the same page on what type of model To give an example, a well-defined project would be the following: - task: summarization -- model: [t5-small](https://huggingface.co/t5-small) +- model: [google-t5/t5-small](https://huggingface.co/google-t5/t5-small) - dataset: [CNN/Daily mail](https://huggingface.co/datasets/cnn_dailymail) - training script: [run_summarization_flax.py](https://github.com/huggingface/transformers/blob/main/examples/flax/summarization/run_summarization_flax.py) - outcome: t5 model that can summarize news -- work flow: adapt `run_summarization_flax.py` to work with `t5-small`. +- work flow: adapt `run_summarization_flax.py` to work with `google-t5/t5-small`. -This example is a very easy and not the most interesting project since a `t5-small` +This example is a very easy and not the most interesting project since a `google-t5/t5-small` summarization model exists already for CNN/Daily mail and pretty much no code has to be written. A well-defined project does not need to have the dataset be part of @@ -227,7 +227,7 @@ the forum and making use of the [🤗 hub](http://huggingface.co/) to have a ver control for your models and training logs. - When debugging, it is important that the debugging cycle is kept as short as possible to be able to effectively debug. *E.g.* if there is a problem with your training script, -you should run it with just a couple of hundreds of examples and not the whole dataset script. This can be done by either making use of [datasets streaming](https://huggingface.co/docs/datasets/master/dataset_streaming.html?highlight=streaming) or by selecting just the first +you should run it with just a couple of hundreds of examples and not the whole dataset script. This can be done by either making use of [datasets streaming](https://huggingface.co/docs/datasets/master/dataset_streaming?highlight=streaming) or by selecting just the first X number of data samples after loading: ```python @@ -311,7 +311,7 @@ library from source to profit from the most current additions during the communi Simply run the following steps: -``` +```bash $ cd ~/ $ git clone https://github.com/huggingface/datasets.git $ cd datasets @@ -335,7 +335,7 @@ dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', str dummy_input = next(iter(dataset))["text"] -tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") +tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base") input_ids = tokenizer(dummy_input, return_tensors="np").input_ids[:, :10] model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown") @@ -389,13 +389,13 @@ source ~//bin/activate Next you should install JAX's TPU version on TPU by running the following command: -``` +```bash $ pip install requests ``` and then: -``` +```bash $ pip install "jax[tpu]>=0.2.16" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html ``` @@ -468,7 +468,7 @@ library from source to profit from the most current additions during the communi Simply run the following steps: -``` +```bash $ cd ~/ $ git clone https://github.com/huggingface/datasets.git $ cd datasets @@ -492,7 +492,7 @@ dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', str dummy_input = next(iter(dataset))["text"] -tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") +tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base") input_ids = tokenizer(dummy_input, return_tensors="np").input_ids[:, :10] model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown") @@ -518,7 +518,7 @@ be available in a couple of days. - [BigBird](https://github.com/huggingface/transformers/blob/main/src/transformers/models/big_bird/modeling_flax_big_bird.py) - [CLIP](https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/modeling_flax_clip.py) - [ELECTRA](https://github.com/huggingface/transformers/blob/main/src/transformers/models/electra/modeling_flax_electra.py) -- [GPT2](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_flax_gpt2.py) +- [GPT2](https://github.com/huggingface/transformers/blob/main/src/transformers/models/openai-community/gpt2/modeling_flax_gpt2.py) - [(TODO) MBART](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mbart/modeling_flax_mbart.py) - [RoBERTa](https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_flax_roberta.py) - [T5](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_flax_t5.py) @@ -568,7 +568,7 @@ class ModelPyTorch: Instantiating an object `model_pytorch` of the class `ModelPyTorch` would actually allocate memory for the model weights and attach them to the attributes `self.key_proj`, `self.value_proj`, `self.query_proj`, and `self.logits.proj`. We could access the weights via: -``` +```python key_projection_matrix = model_pytorch.key_proj.weight.data ``` @@ -729,7 +729,7 @@ Let's use the base `FlaxRobertaModel` without any heads as an example. from transformers import FlaxRobertaModel, RobertaTokenizerFast import jax -tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") +tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base") inputs = tokenizer("JAX/Flax is amazing ", padding="max_length", max_length=128, return_tensors="np") model = FlaxRobertaModel.from_pretrained("julien-c/dummy-unknown") @@ -1011,7 +1011,7 @@ and run the following commands in a Python shell to save a config. ```python from transformers import RobertaConfig -config = RobertaConfig.from_pretrained("roberta-base") +config = RobertaConfig.from_pretrained("FacebookAI/roberta-base") config.save_pretrained("./") ``` @@ -1117,7 +1117,7 @@ params = model.init(key2, x) bytes_output = serialization.to_bytes(params) -repo = Repository("flax-model", clone_from="flax-community/flax-model-dummy", use_auth_token=True) +repo = Repository("flax-model", clone_from="flax-community/flax-model-dummy", token=True) with repo.commit("My cool Flax model :)"): with open("flax_model.msgpack", "wb") as f: f.write(bytes_output) @@ -1153,7 +1153,7 @@ In the following, we will describe how to do so using a standard console, but yo 2. Once you've installed the google cloud sdk, you should set your account by running the following command. Make sure that `` corresponds to the gmail address you used to sign up for this event. ```bash -$ gcloud config set account +$ gcloud config set account ``` 3. Let's also make sure the correct project is set in case your email is used for multiple gcloud projects: @@ -1193,12 +1193,12 @@ All the widgets are open sourced in the `huggingface_hub` [repo](https://github. **NLP** * **Conversational:** To have the best conversations!. [Example](https://huggingface.co/microsoft/DialoGPT-large?). * **Feature Extraction:** Retrieve the input embeddings. [Example](https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokens?text=test). -* **Fill Mask:** Predict potential words for a mask token. [Example](https://huggingface.co/bert-base-uncased?). -* **Question Answering:** Given a context and a question, predict the answer. [Example](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad). +* **Fill Mask:** Predict potential words for a mask token. [Example](https://huggingface.co/google-bert/bert-base-uncased?). +* **Question Answering:** Given a context and a question, predict the answer. [Example](https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad). * **Sentence Simmilarity:** Predict how similar a set of sentences are. Useful for Sentence Transformers. * **Summarization:** Given a text, output a summary of it. [Example](https://huggingface.co/sshleifer/distilbart-cnn-12-6). * **Table Question Answering:** Given a table and a question, predict the answer. [Example](https://huggingface.co/google/tapas-base-finetuned-wtq). -* **Text Generation:** Generate text based on a prompt. [Example](https://huggingface.co/gpt2) +* **Text Generation:** Generate text based on a prompt. [Example](https://huggingface.co/openai-community/gpt2) * **Token Classification:** Useful for tasks such as Named Entity Recognition and Part of Speech. [Example](https://huggingface.co/dslim/bert-base-NER). * **Zero-Shot Classification:** Too cool to explain with words. Here is an [example](https://huggingface.co/typeform/distilbert-base-uncased-mnli) * ([WIP](https://github.com/huggingface/huggingface_hub/issues/99)) **Table to Text Generation**. @@ -1224,25 +1224,25 @@ Sometimes you might be using different libraries or a very specific application A common use case is how to load files you have in your model repository in the Hub from the Streamlit demo. The `huggingface_hub` library is here to help you! -``` +```bash pip install huggingface_hub ``` Here is an example downloading (and caching!) a specific file directly from the Hub -``` +```python from huggingface_hub import hf_hub_download filepath = hf_hub_download("flax-community/roberta-base-als", "flax_model.msgpack"); ``` In many cases you will want to download the full repository. Here is an example downloading all the files from a repo. You can even specify specific revisions! -``` +```python from huggingface_hub import snapshot_download local_path = snapshot_download("flax-community/roberta-base-als"); ``` Note that if you're using 🤗 Transformers library, you can quickly load the model and tokenizer as follows -``` +```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("REPO_ID") diff --git a/examples/research_projects/jax-projects/big_bird/README.md b/examples/research_projects/jax-projects/big_bird/README.md index e8ef274bbe07cd..42586e49580ebb 100644 --- a/examples/research_projects/jax-projects/big_bird/README.md +++ b/examples/research_projects/jax-projects/big_bird/README.md @@ -57,4 +57,4 @@ wget https://huggingface.co/datasets/vasudevgupta/natural-questions-validation/r python3 evaluate.py ``` -You can find our checkpoint on HuggingFace Hub ([see this](https://huggingface.co/vasudevgupta/flax-bigbird-natural-questions)). In case you are interested in PyTorch BigBird fine-tuning, you can refer to [this repositary](https://github.com/thevasudevgupta/bigbird). +You can find our checkpoint on HuggingFace Hub ([see this](https://huggingface.co/vasudevgupta/flax-bigbird-natural-questions)). In case you are interested in PyTorch BigBird fine-tuning, you can refer to [this repository](https://github.com/thevasudevgupta/bigbird). diff --git a/examples/research_projects/jax-projects/big_bird/bigbird_flax.py b/examples/research_projects/jax-projects/big_bird/bigbird_flax.py index ac37cbc8600a2f..af5e11c83a6ad2 100644 --- a/examples/research_projects/jax-projects/big_bird/bigbird_flax.py +++ b/examples/research_projects/jax-projects/big_bird/bigbird_flax.py @@ -247,9 +247,12 @@ def train(self, state, tr_dataset, val_dataset): lr = self.scheduler_fn(state_step - 1) eval_loss = self.evaluate(state, val_dataset) - logging_dict = dict( - step=state_step.item(), eval_loss=eval_loss.item(), tr_loss=tr_loss, lr=lr.item() - ) + logging_dict = { + "step": state_step.item(), + "eval_loss": eval_loss.item(), + "tr_loss": tr_loss, + "lr": lr.item(), + } tqdm.write(str(logging_dict)) self.logger.log(logging_dict, commit=True) diff --git a/examples/research_projects/jax-projects/big_bird/evaluate.py b/examples/research_projects/jax-projects/big_bird/evaluate.py index 32ca5172a5f25c..04e9e01ca237bd 100644 --- a/examples/research_projects/jax-projects/big_bird/evaluate.py +++ b/examples/research_projects/jax-projects/big_bird/evaluate.py @@ -144,9 +144,9 @@ def evaluate(example): predictions = expand_to_aliases(example["output"]) # some preprocessing to both prediction and answer - answers = set(["".join(a.split()) for a in answers]) - predictions = set(["".join(p.split()) for p in predictions]) - predictions = set([s for s in predictions if s not in ["``", "''", "`", "'"]]) + answers = {"".join(a.split()) for a in answers} + predictions = {"".join(p.split()) for p in predictions} + predictions = {s for s in predictions if s not in ["``", "''", "`", "'"]} # if there is a common element, it's a exact match example["match"] = len(list(answers & predictions)) > 0 diff --git a/examples/research_projects/jax-projects/big_bird/prepare_natural_questions.py b/examples/research_projects/jax-projects/big_bird/prepare_natural_questions.py index 22dc3e455024c0..ebbb184ccb6b6b 100644 --- a/examples/research_projects/jax-projects/big_bird/prepare_natural_questions.py +++ b/examples/research_projects/jax-projects/big_bird/prepare_natural_questions.py @@ -50,7 +50,7 @@ def choose_first(answer, is_long_answer=False): answer["remove_it"] = False cols = ["start_token", "end_token", "start_byte", "end_byte", "text"] - if not all([isinstance(answer[k], list) for k in cols]): + if not all(isinstance(answer[k], list) for k in cols): raise ValueError("Issue in ID", example["id"]) return answer @@ -314,12 +314,12 @@ def save_to_disk(hf_data, file_name): data = data["train" if PROCESS_TRAIN == "true" else "validation"] - fn_kwargs = dict( - tokenizer=tokenizer, - doc_stride=DOC_STRIDE, - max_length=MAX_LENGTH, - assertion=False, - ) + fn_kwargs = { + "tokenizer": tokenizer, + "doc_stride": DOC_STRIDE, + "max_length": MAX_LENGTH, + "assertion": False, + } data = data.map(prepare_inputs, fn_kwargs=fn_kwargs) data = data.remove_columns(["annotations", "document", "id", "question"]) print(data) diff --git a/examples/research_projects/jax-projects/dataset-streaming/README.md b/examples/research_projects/jax-projects/dataset-streaming/README.md index 416eee06af33d6..bdb6629e509c6f 100644 --- a/examples/research_projects/jax-projects/dataset-streaming/README.md +++ b/examples/research_projects/jax-projects/dataset-streaming/README.md @@ -23,7 +23,7 @@ JAX/Flax allows you to trace pure functions and compile them into efficient, fus Models written in JAX/Flax are **immutable** and updated in a purely functional way which enables simple and efficient model parallelism. -All of the following examples make use of [dataset streaming](https://huggingface.co/docs/datasets/master/dataset_streaming.html), therefore allowing to train models on massive datasets\ +All of the following examples make use of [dataset streaming](https://huggingface.co/docs/datasets/master/dataset_streaming), therefore allowing to train models on massive datasets\ without ever having to download the full dataset. ## Masked language modeling @@ -31,7 +31,7 @@ without ever having to download the full dataset. In the following, we demonstrate how to train a bi-directional transformer model using masked language modeling objective as introduced in [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). More specifically, we demonstrate how JAX/Flax and dataset streaming can be leveraged -to pre-train [**`roberta-base`**](https://huggingface.co/roberta-base) +to pre-train [**`FacebookAI/roberta-base`**](https://huggingface.co/FacebookAI/roberta-base) in English on a single TPUv3-8 pod for 10000 update steps. The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets. @@ -42,20 +42,20 @@ Here we call the model `"english-roberta-base-dummy"`, but you can change the mo You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that you are logged in) or via the command line: -``` +```bash huggingface-cli repo create english-roberta-base-dummy ``` Next we clone the model repository to add the tokenizer and model files. -``` +```bash git clone https://huggingface.co//english-roberta-base-dummy ``` To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so. -``` +```bash cd english-roberta-base-dummy git lfs track "*tfevents*" ``` @@ -80,8 +80,8 @@ from transformers import RobertaTokenizerFast, RobertaConfig model_dir = "./english-roberta-base-dummy" -tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base") -config = RobertaConfig.from_pretrained("roberta-base") +tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base") +config = RobertaConfig.from_pretrained("FacebookAI/roberta-base") tokenizer.save_pretrained(model_dir) config.save_pretrained(model_dir) diff --git a/examples/research_projects/jax-projects/dataset-streaming/run_mlm_flax_stream.py b/examples/research_projects/jax-projects/dataset-streaming/run_mlm_flax_stream.py index 3c5bdb7b44507c..fbb165ba42c14e 100755 --- a/examples/research_projects/jax-projects/dataset-streaming/run_mlm_flax_stream.py +++ b/examples/research_projects/jax-projects/dataset-streaming/run_mlm_flax_stream.py @@ -76,7 +76,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -341,7 +341,7 @@ def write_eval_metric(summary_writer, eval_metrics, step): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -399,7 +399,7 @@ def write_eval_metric(summary_writer, eval_metrics, step): ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) diff --git a/examples/research_projects/jax-projects/hybrid_clip/README.md b/examples/research_projects/jax-projects/hybrid_clip/README.md index 282d5c813b7da4..72d3db1935895f 100644 --- a/examples/research_projects/jax-projects/hybrid_clip/README.md +++ b/examples/research_projects/jax-projects/hybrid_clip/README.md @@ -32,7 +32,7 @@ Models written in JAX/Flax are **immutable** and updated in a purely functional way which enables simple and efficient model parallelism. In this example we will use the vision model from [CLIP](https://huggingface.co/models?filter=clip) -as the image encoder and [`roberta-base`](https://huggingface.co/roberta-base) as the text encoder. +as the image encoder and [`FacebookAI/roberta-base`](https://huggingface.co/FacebookAI/roberta-base) as the text encoder. Note that one can also use the [ViT](https://huggingface.co/models?filter=vit) model as image encoder and any other BERT or ROBERTa model as text encoder. To train the model on languages other than English one should choose a text encoder trained on the desired language and a image-text dataset in that language. One such dataset is [WIT](https://github.com/google-research-datasets/wit). @@ -43,17 +43,17 @@ Here we call the model `"clip-roberta-base"`, but you can change the model name You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that you are logged in) or via the command line: -``` +```bash huggingface-cli repo create clip-roberta-base ``` Next we clone the model repository to add the tokenizer and model files. -``` +```bash git clone https://huggingface.co//clip-roberta-base ``` To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so. -``` +```bash cd clip-roberta-base git lfs track "*tfevents*" ``` @@ -76,7 +76,7 @@ Here is an example of how to load the model using pre-trained text and vision mo ```python from modeling_hybrid_clip import FlaxHybridCLIP -model = FlaxHybridCLIP.from_text_vision_pretrained("bert-base-uncased", "openai/clip-vit-base-patch32") +model = FlaxHybridCLIP.from_text_vision_pretrained("google-bert/bert-base-uncased", "openai/clip-vit-base-patch32") # save the model model.save_pretrained("bert-clip") @@ -89,7 +89,7 @@ If the checkpoints are in PyTorch then one could pass `text_from_pt=True` and `v PyTorch checkpoints convert them to flax and load the model. ```python -model = FlaxHybridCLIP.from_text_vision_pretrained("bert-base-uncased", "openai/clip-vit-base-patch32", text_from_pt=True, vision_from_pt=True) +model = FlaxHybridCLIP.from_text_vision_pretrained("google-bert/bert-base-uncased", "openai/clip-vit-base-patch32", text_from_pt=True, vision_from_pt=True) ``` This loads both the text and vision encoders using pre-trained weights, the projection layers are randomly @@ -154,9 +154,9 @@ Next we can run the example script to train the model: ```bash python run_hybrid_clip.py \ --output_dir ${MODEL_DIR} \ - --text_model_name_or_path="roberta-base" \ + --text_model_name_or_path="FacebookAI/roberta-base" \ --vision_model_name_or_path="openai/clip-vit-base-patch32" \ - --tokenizer_name="roberta-base" \ + --tokenizer_name="FacebookAI/roberta-base" \ --train_file="coco_dataset/train_dataset.json" \ --validation_file="coco_dataset/validation_dataset.json" \ --do_train --do_eval \ diff --git a/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py b/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py index e60f07bdd06325..08cb3bd0b3412e 100644 --- a/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py +++ b/examples/research_projects/jax-projects/hybrid_clip/modeling_hybrid_clip.py @@ -314,8 +314,6 @@ def from_text_vision_pretrained( Information necessary to initiate the text model. Can be either: - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under - a user or organization name, like ``dbmdz/bert-base-german-cased``. - A path to a `directory` containing model weights saved using :func:`~transformers.FlaxPreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``. - A path or url to a `PyTorch checkpoint folder` (e.g, ``./pt_model``). In @@ -327,8 +325,6 @@ def from_text_vision_pretrained( Information necessary to initiate the vision model. Can be either: - A string, the `model id` of a pretrained model hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under - a user or organization name, like ``dbmdz/bert-base-german-cased``. - A path to a `directory` containing model weights saved using :func:`~transformers.FlaxPreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``. - A path or url to a `PyTorch checkpoint folder` (e.g, ``./pt_model``). In @@ -354,7 +350,7 @@ def from_text_vision_pretrained( >>> from transformers import FlaxHybridCLIP >>> # initialize a model from pretrained BERT and CLIP models. Note that the projection layers will be randomly initialized. >>> # If using CLIP's vision model the vision projection layer will be initialized using pre-trained weights - >>> model = FlaxHybridCLIP.from_text_vision_pretrained('bert-base-uncased', 'openai/clip-vit-base-patch32') + >>> model = FlaxHybridCLIP.from_text_vision_pretrained('google-bert/bert-base-uncased', 'openai/clip-vit-base-patch32') >>> # saving model after fine-tuning >>> model.save_pretrained("./bert-clip") >>> # load fine-tuned model diff --git a/examples/research_projects/jax-projects/hybrid_clip/run_hybrid_clip.py b/examples/research_projects/jax-projects/hybrid_clip/run_hybrid_clip.py index f54641408f80a2..f954f70ee48b60 100644 --- a/examples/research_projects/jax-projects/hybrid_clip/run_hybrid_clip.py +++ b/examples/research_projects/jax-projects/hybrid_clip/run_hybrid_clip.py @@ -78,7 +78,7 @@ class ModelArguments: text_model_name_or_path: str = field( metadata={ "help": ( - "The text model checkpoint for weights initialization." + "The text model checkpoint for weights initialization. " "Don't set if you want to train a model from scratch." ) }, @@ -86,7 +86,7 @@ class ModelArguments: vision_model_name_or_path: str = field( metadata={ "help": ( - "The vision model checkpoint for weights initialization." + "The vision model checkpoint for weights initialization. " "Don't set if you want to train a model from scratch." ) }, @@ -283,7 +283,7 @@ def write_metric(summary_writer, train_metrics, eval_metrics, train_time, step): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -311,7 +311,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -341,7 +341,7 @@ def main(): ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) diff --git a/examples/research_projects/jax-projects/model_parallel/README.md b/examples/research_projects/jax-projects/model_parallel/README.md index b63b93862db06f..393c9e89375085 100644 --- a/examples/research_projects/jax-projects/model_parallel/README.md +++ b/examples/research_projects/jax-projects/model_parallel/README.md @@ -27,7 +27,7 @@ To adapt the script for other models, we need to also change the `ParitionSpec` TODO: Add more explantion. -Before training, let's prepare our model first. To be able to shard the model, the sharded dimention needs to be a multiple of devices it'll be sharded on. But GPTNeo's vocab size is 50257, so we need to resize the embeddings accordingly. +Before training, let's prepare our model first. To be able to shard the model, the sharded dimension needs to be a multiple of devices it'll be sharded on. But GPTNeo's vocab size is 50257, so we need to resize the embeddings accordingly. ```python from transformers import FlaxGPTNeoForCausalLM, GPTNeoConfig @@ -54,7 +54,7 @@ model.save_pretrained("gpt-neo-1.3B") ```bash python run_clm_mp.py \ --model_name_or_path gpt-neo-1.3B \ - --tokenizer_name gpt2 \ + --tokenizer_name openai-community/gpt2 \ --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \ --do_train --do_eval \ --block_size 1024 \ diff --git a/examples/research_projects/jax-projects/model_parallel/partitions.py b/examples/research_projects/jax-projects/model_parallel/partitions.py index e32ec97e42b491..86e54ad6702779 100644 --- a/examples/research_projects/jax-projects/model_parallel/partitions.py +++ b/examples/research_projects/jax-projects/model_parallel/partitions.py @@ -34,7 +34,7 @@ def _match(qs, ks): """Return True if regexes in qs match any window of strings in tuple ks.""" # compile regexes and force complete match - qts = tuple(map(lambda x: re.compile(x + "$"), qs)) + qts = tuple((re.compile(x + "$") for x in qs)) for i in range(len(ks) - len(qs) + 1): matches = [x.match(y) for x, y in zip(qts, ks[i:])] if matches and all(matches): diff --git a/examples/research_projects/jax-projects/model_parallel/run_clm_mp.py b/examples/research_projects/jax-projects/model_parallel/run_clm_mp.py index 7103b5a28111ff..a72e5cff861c8b 100644 --- a/examples/research_projects/jax-projects/model_parallel/run_clm_mp.py +++ b/examples/research_projects/jax-projects/model_parallel/run_clm_mp.py @@ -70,7 +70,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -214,7 +214,7 @@ def write_eval_metric(summary_writer, eval_metrics, step): def create_learning_rate_fn( train_ds_size: int, train_batch_size: int, num_train_epochs: int, num_warmup_steps: int, learning_rate: float -) -> Callable[[int], jnp.array]: +) -> Callable[[int], jnp.ndarray]: """Returns a linear warmup, linear_decay learning rate function.""" steps_per_epoch = train_ds_size // train_batch_size num_train_steps = steps_per_epoch * num_train_epochs @@ -246,7 +246,7 @@ def main(): and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -297,14 +297,15 @@ def main(): data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" dataset = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained config and tokenizer if model_args.config_name: @@ -325,7 +326,7 @@ def main(): ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -362,13 +363,13 @@ def tokenize_function(examples): if block_size > config.max_position_embeddings: logger.warning( f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " - "Picking 1024 instead. You can change that default value by passing --block_size xxx." + f"Using block_size={min(1024, config.max_position_embeddings)} instead. You can change that default value by passing --block_size xxx." ) - block_size = 1024 + block_size = min(1024, config.max_position_embeddings) else: if data_args.block_size > tokenizer.model_max_length: logger.warning( - f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model " f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(data_args.block_size, tokenizer.model_max_length) @@ -395,7 +396,7 @@ def group_texts(examples): # to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map lm_datasets = tokenized_datasets.map( group_texts, diff --git a/examples/research_projects/jax-projects/wav2vec2/README.md b/examples/research_projects/jax-projects/wav2vec2/README.md index 3b1b74743085a2..5f8e14f47c590c 100644 --- a/examples/research_projects/jax-projects/wav2vec2/README.md +++ b/examples/research_projects/jax-projects/wav2vec2/README.md @@ -10,7 +10,7 @@ way which enables simple and efficient model parallelism. `run_wav2vec2_pretrain_flax.py` is a lightweight example of how to download and preprocess a dataset from the 🤗 Datasets library or use your own files (jsonlines or csv), then pretrain the wav2vec2 architectures above on it. -For custom datasets in `jsonlines` format please see: [the Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets.html#json-files) and you also will find examples of these below. +For custom datasets in `jsonlines` format please see: [the Datasets documentation](https://huggingface.co/docs/datasets/loading_datasets#json-files) and you also will find examples of these below. Let's start by creating a model repository to save the trained model and logs. Here we call the model `"wav2vec2-base-robust"`, but you can change the model name as you like. @@ -18,20 +18,20 @@ Here we call the model `"wav2vec2-base-robust"`, but you can change the model na You can do this either directly on [huggingface.co](https://huggingface.co/new) (assuming that you are logged in) or via the command line: -``` +```bash huggingface-cli repo create wav2vec2-base-robust ``` Next we clone the model repository to add the tokenizer and model files. -``` +```bash git clone https://huggingface.co//wav2vec2-base-robust ``` To ensure that all tensorboard traces will be uploaded correctly, we need to track them. You can run the following command inside your model repo to do so. -``` +```bash cd wav2vec2-base-robust git lfs track "*tfevents*" ``` diff --git a/examples/research_projects/layoutlmv3/run_funsd_cord.py b/examples/research_projects/layoutlmv3/run_funsd_cord.py index 04e4498a1a1500..ad83fbdef9dec4 100644 --- a/examples/research_projects/layoutlmv3/run_funsd_cord.py +++ b/examples/research_projects/layoutlmv3/run_funsd_cord.py @@ -250,7 +250,7 @@ def main(): "nielsr/funsd-layoutlmv3", data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) elif data_args.dataset_name == "cord": # Downloading and loading a dataset from the hub. @@ -258,7 +258,7 @@ def main(): "nielsr/cord-layoutlmv3", data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) else: raise ValueError("This script only supports either FUNSD or CORD out-of-the-box.") @@ -294,11 +294,11 @@ def get_label_list(labels): if isinstance(features[label_column_name].feature, ClassLabel): label_list = features[label_column_name].feature.names # No need to convert the labels since they are already ints. - id2label = {k: v for k, v in enumerate(label_list)} + id2label = dict(enumerate(label_list)) label2id = {v: k for k, v in enumerate(label_list)} else: label_list = get_label_list(datasets["train"][label_column_name]) - id2label = {k: v for k, v in enumerate(label_list)} + id2label = dict(enumerate(label_list)) label2id = {v: k for k, v in enumerate(label_list)} num_labels = len(label_list) @@ -313,7 +313,7 @@ def get_label_list(labels): finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) processor = AutoProcessor.from_pretrained( @@ -321,7 +321,7 @@ def get_label_list(labels): cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, add_prefix_space=True, apply_ocr=False, ) @@ -332,7 +332,7 @@ def get_label_list(labels): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # Set the correspondences label/ID inside the model config diff --git a/examples/research_projects/longform-qa/eli5_app.py b/examples/research_projects/longform-qa/eli5_app.py index 1bcb6fd20d25fc..6b1b15cc9cbba3 100644 --- a/examples/research_projects/longform-qa/eli5_app.py +++ b/examples/research_projects/longform-qa/eli5_app.py @@ -36,7 +36,7 @@ def load_models(): _ = s2s_model.eval() else: s2s_tokenizer, s2s_model = make_qa_s2s_model( - model_name="t5-small", from_file="seq2seq_models/eli5_t5_model_1024_4.pth", device="cuda:0" + model_name="google-t5/t5-small", from_file="seq2seq_models/eli5_t5_model_1024_4.pth", device="cuda:0" ) return (qar_tokenizer, qar_model, s2s_tokenizer, s2s_model) @@ -158,9 +158,7 @@ def answer_question( -""" % ( - header_html, -) +""" % (header_html,) st.sidebar.markdown( header_full, unsafe_allow_html=True, diff --git a/examples/research_projects/longform-qa/eli5_utils.py b/examples/research_projects/longform-qa/eli5_utils.py index db4eae66041be4..d4b235fdbaab26 100644 --- a/examples/research_projects/longform-qa/eli5_utils.py +++ b/examples/research_projects/longform-qa/eli5_utils.py @@ -78,7 +78,7 @@ def query_es_index(question, es_client, index_name="english_wiki_kilt_snippets_1 ) hits = response["hits"]["hits"] support_doc = "

" + "

".join([hit["_source"]["passage_text"] for hit in hits]) - res_list = [dict([(k, hit["_source"][k]) for k in hit["_source"] if k != "passage_text"]) for hit in hits] + res_list = [{k: hit["_source"][k] for k in hit["_source"] if k != "passage_text"} for hit in hits] for r, hit in zip(res_list, hits): r["passage_id"] = hit["_id"] r["score"] = hit["_score"] @@ -601,7 +601,7 @@ def make_qa_dense_index( fp = np.memmap(index_name, dtype=dtype, mode="w+", shape=(passages_dset.num_rows, 128)) n_batches = math.ceil(passages_dset.num_rows / batch_size) for i in range(n_batches): - passages = [p for p in passages_dset[i * batch_size : (i + 1) * batch_size]["passage_text"]] + passages = list(passages_dset[i * batch_size : (i + 1) * batch_size]["passage_text"]) reps = embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length, device) fp[i * batch_size : (i + 1) * batch_size] = reps if i % 50 == 0: @@ -634,7 +634,7 @@ def query_qa_dense_index( D, I = wiki_index.search(q_rep, 2 * n_results) res_passages = [wiki_passages[int(i)] for i in I[0]] support_doc = "

" + "

".join([p["passage_text"] for p in res_passages]) - res_list = [dict([(k, p[k]) for k in wiki_passages.column_names]) for p in res_passages] + res_list = [{k: p[k] for k in wiki_passages.column_names} for p in res_passages] res_list = [res for res in res_list if len(res["passage_text"].split()) > min_length][:n_results] for r, sc in zip(res_list, D[0]): r["score"] = float(sc) @@ -650,7 +650,7 @@ def batch_query_qa_dense_index(questions, qa_embedder, tokenizer, wiki_passages, ] all_res_lists = [] for res_passages, dl in zip(res_passages_lst, D): - res_list = [dict([(k, p[k]) for k in wiki_passages.column_names]) for p in res_passages] + res_list = [{k: p[k] for k in wiki_passages.column_names} for p in res_passages] for r, sc in zip(res_list, dl): r["score"] = float(sc) all_res_lists += [res_list[:]] @@ -663,7 +663,7 @@ def query_qa_dense_index_nn(passage, qa_embedder, tokenizer, wiki_passages, wiki D, I = wiki_index.search(a_rep, 2 * n_results) res_passages = [wiki_passages[int(i)] for i in I[0]] support_doc = "

" + "

".join([p["passage_text"] for p in res_passages]) - res_list = [dict([(k, p[k]) for k in wiki_passages.column_names]) for p in res_passages] + res_list = [{k: p[k] for k in wiki_passages.column_names} for p in res_passages] res_list = [res for res in res_list if len(res["passage_text"].split()) > min_length][:n_results] for r, sc, i in zip(res_list, D[0], I[0]): r["passage_id"] = int(i) @@ -680,7 +680,7 @@ def batch_query_qa_dense_index_nn(passages, qa_embedder, tokenizer, wiki_passage ] all_res_lists = [] for res_passages, dl, il in zip(res_passages_lst, D, I): - res_list = [dict([(k, p[k]) for k in wiki_passages.column_names]) for p in res_passages] + res_list = [{k: p[k] for k in wiki_passages.column_names} for p in res_passages] for r, sc, i in zip(res_list, dl, il): r["passage_id"] = int(i) r["score"] = float(sc) diff --git a/examples/research_projects/luke/run_luke_ner_no_trainer.py b/examples/research_projects/luke/run_luke_ner_no_trainer.py index 4c5227d2c7e011..cac487b059d71f 100644 --- a/examples/research_projects/luke/run_luke_ner_no_trainer.py +++ b/examples/research_projects/luke/run_luke_ner_no_trainer.py @@ -29,7 +29,7 @@ import torch from accelerate import Accelerator, DistributedDataParallelKwargs from datasets import ClassLabel, load_dataset, load_metric -from huggingface_hub import Repository +from huggingface_hub import Repository, create_repo from luke_utils import DataCollatorForLukeTokenClassification, is_punctuation, padding_tensor from torch.utils.data import DataLoader from tqdm.auto import tqdm @@ -45,7 +45,6 @@ get_scheduler, set_seed, ) -from transformers.file_utils import get_full_repo_name from transformers.utils.versions import require_version @@ -258,11 +257,14 @@ def main(): # Handle the repository creation if accelerator.is_main_process: if args.push_to_hub: - if args.hub_model_id is None: - repo_name = get_full_repo_name(Path(args.output_dir).name, token=args.hub_token) - else: - repo_name = args.hub_model_id - repo = Repository(args.output_dir, clone_from=repo_name) + # Retrieve of infer repo_name + repo_name = args.hub_model_id + if repo_name is None: + repo_name = Path(args.output_dir).absolute().name + # Create repo and retrieve repo_id + repo_id = create_repo(repo_name, exist_ok=True, token=args.hub_token).repo_id + # Clone repo locally + repo = Repository(args.output_dir, clone_from=repo_id, token=args.hub_token) elif args.output_dir is not None: os.makedirs(args.output_dir, exist_ok=True) accelerator.wait_for_everyone() @@ -283,16 +285,17 @@ def main(): data_files = {} if args.train_file is not None: data_files["train"] = args.train_file + extension = args.train_file.split(".")[-1] if args.validation_file is not None: data_files["validation"] = args.validation_file - extension = args.train_file.split(".")[-1] + extension = args.validation_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files) # Trim a number of training examples if args.debug: for split in raw_datasets.keys(): raw_datasets[split] = raw_datasets[split].select(range(100)) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if raw_datasets["train"] is not None: column_names = raw_datasets["train"].column_names @@ -355,7 +358,7 @@ def get_label_list(labels): tokenizer_name_or_path = args.tokenizer_name if args.tokenizer_name else args.model_name_or_path if not tokenizer_name_or_path: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -610,7 +613,7 @@ def get_luke_labels(outputs, ner_tags, original_entity_spans): predicted_sequence = [label_list[0]] * len(true_tags) for _, span, label in sorted(predictions, key=lambda o: o[0], reverse=True): - if all([o == label_list[0] for o in predicted_sequence[span[0] : span[1]]]): + if all(o == label_list[0] for o in predicted_sequence[span[0] : span[1]]): predicted_sequence[span[0]] = label if span[1] - span[0] > 1: predicted_sequence[span[0] + 1 : span[1]] = [label] * (span[1] - span[0] - 1) diff --git a/examples/research_projects/lxmert/extracting_data.py b/examples/research_projects/lxmert/extracting_data.py index 9c445be336f553..6b1342c9b11f93 100644 --- a/examples/research_projects/lxmert/extracting_data.py +++ b/examples/research_projects/lxmert/extracting_data.py @@ -61,7 +61,7 @@ def __init__(self, argv=sys.argv[1:]): assert outputfile is not None and not os.path.isfile(outputfile), f"{outputfile}" if subset_list is not None: with open(os.path.realpath(subset_list)) as f: - self.subset_list = set(map(lambda x: self._vqa_file_split()[0], tryload(f))) + self.subset_list = {self._vqa_file_split()[0] for x in tryload(f)} else: self.subset_list = None diff --git a/examples/research_projects/lxmert/modeling_frcnn.py b/examples/research_projects/lxmert/modeling_frcnn.py index 08758b1d3cac06..499de532070c91 100644 --- a/examples/research_projects/lxmert/modeling_frcnn.py +++ b/examples/research_projects/lxmert/modeling_frcnn.py @@ -554,8 +554,8 @@ def __init__( assert thresholds[0] > 0 thresholds.insert(0, -float("inf")) thresholds.append(float("inf")) - assert all([low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])]) - assert all([label_i in [-1, 0, 1] for label_i in labels]) + assert all(low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])) + assert all(label_i in [-1, 0, 1] for label_i in labels) assert len(labels) == len(thresholds) - 1 self.thresholds = thresholds self.labels = labels @@ -1095,7 +1095,7 @@ def forward(self, feature_maps, boxes): Returns: A tensor of shape(N*B, Channels, output_size, output_size) """ - x = [v for v in feature_maps.values()] + x = list(feature_maps.values()) num_level_assignments = len(self.level_poolers) assert len(x) == num_level_assignments and len(boxes) == x[0].size(0) @@ -1706,9 +1706,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): archive_file = pretrained_model_name_or_path elif os.path.isfile(pretrained_model_name_or_path + ".index"): - assert ( - from_tf - ), "We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint".format( + assert from_tf, "We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint".format( pretrained_model_name_or_path + ".index" ) archive_file = pretrained_model_name_or_path + ".index" diff --git a/examples/research_projects/lxmert/requirements.txt b/examples/research_projects/lxmert/requirements.txt index 0d483b6d18923d..55e1bdb845c367 100644 --- a/examples/research_projects/lxmert/requirements.txt +++ b/examples/research_projects/lxmert/requirements.txt @@ -4,7 +4,7 @@ async-generator==1.10 attrs==20.2.0 backcall==0.2.0 CacheControl==0.12.6 -certifi==2022.12.7 +certifi==2023.7.22 cffi==1.14.2 chardet==3.0.4 click==7.1.2 @@ -75,7 +75,7 @@ pyzmq==19.0.2 qtconsole==4.7.7 QtPy==1.9.0 regex==2020.7.14 -requests==2.22.0 +requests==2.31.0 retrying==1.3.3 sacremoses==0.0.43 Send2Trash==1.5.0 @@ -86,11 +86,11 @@ testpath==0.4.4 tokenizers==0.8.1rc2 torch==1.6.0 torchvision==0.7.0 -tornado==6.0.4 +tornado==6.3.3 tqdm==4.48.2 traitlets git+https://github.com/huggingface/transformers.git -urllib3==1.26.5 +urllib3==1.26.18 wcwidth==0.2.5 webencodings==0.5.1 wget==3.2 diff --git a/examples/research_projects/lxmert/utils.py b/examples/research_projects/lxmert/utils.py index 2fc6ea2062efd2..c75f523a08eae7 100644 --- a/examples/research_projects/lxmert/utils.py +++ b/examples/research_projects/lxmert/utils.py @@ -28,7 +28,6 @@ from collections import OrderedDict from contextlib import contextmanager from functools import partial -from hashlib import sha256 from io import BytesIO from pathlib import Path from urllib.parse import urlparse @@ -39,6 +38,7 @@ import requests import wget from filelock import FileLock +from huggingface_hub.utils import insecure_hashlib from PIL import Image from tqdm.auto import tqdm from yaml import Loader, dump, load @@ -402,12 +402,12 @@ def _resumable_file_manager(): def url_to_filename(url, etag=None): url_bytes = url.encode("utf-8") - url_hash = sha256(url_bytes) + url_hash = insecure_hashlib.sha256(url_bytes) filename = url_hash.hexdigest() if etag: etag_bytes = etag.encode("utf-8") - etag_hash = sha256(etag_bytes) + etag_hash = insecure_hashlib.sha256(etag_bytes) filename += "." + etag_hash.hexdigest() if url.endswith(".h5"): diff --git a/examples/research_projects/mlm_wwm/README.md b/examples/research_projects/mlm_wwm/README.md index 9426be7c27be1f..bf5aa9410826ed 100644 --- a/examples/research_projects/mlm_wwm/README.md +++ b/examples/research_projects/mlm_wwm/README.md @@ -32,7 +32,7 @@ to that word). This technique has been refined for Chinese in [this paper](https To fine-tune a model using whole word masking, use the following script: ```bash python run_mlm_wwm.py \ - --model_name_or_path roberta-base \ + --model_name_or_path FacebookAI/roberta-base \ --dataset_name wikitext \ --dataset_config_name wikitext-2-raw-v1 \ --do_train \ @@ -83,7 +83,7 @@ export VALIDATION_REF_FILE=/path/to/validation/chinese_ref/file export OUTPUT_DIR=/tmp/test-mlm-wwm python run_mlm_wwm.py \ - --model_name_or_path roberta-base \ + --model_name_or_path FacebookAI/roberta-base \ --train_file $TRAIN_FILE \ --validation_file $VALIDATION_FILE \ --train_ref_file $TRAIN_REF_FILE \ @@ -95,4 +95,4 @@ python run_mlm_wwm.py \ **Note1:** On TPU, you should the flag `--pad_to_max_length` to make sure all your batches have the same length. -**Note2:** And if you have any questions or something goes wrong when runing this code, don't hesitate to pin @wlhgtc. +**Note2:** And if you have any questions or something goes wrong when running this code, don't hesitate to pin @wlhgtc. diff --git a/examples/research_projects/mlm_wwm/run_mlm_wwm.py b/examples/research_projects/mlm_wwm/run_mlm_wwm.py index f14ad5adfeff16..629026bdb20a63 100644 --- a/examples/research_projects/mlm_wwm/run_mlm_wwm.py +++ b/examples/research_projects/mlm_wwm/run_mlm_wwm.py @@ -62,7 +62,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -271,14 +271,15 @@ def main(): data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" datasets = load_dataset(extension, data_files=data_files) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -314,7 +315,7 @@ def main(): tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -325,7 +326,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) else: logger.info("Training new model from scratch") diff --git a/examples/research_projects/mm-imdb/README.md b/examples/research_projects/mm-imdb/README.md index 7cfc2a7487ba71..68b2f15159ec23 100644 --- a/examples/research_projects/mm-imdb/README.md +++ b/examples/research_projects/mm-imdb/README.md @@ -6,11 +6,11 @@ Based on the script [`run_mmimdb.py`](https://github.com/huggingface/transformer ### Training on MM-IMDb -``` +```bash python run_mmimdb.py \ --data_dir /path/to/mmimdb/dataset/ \ --model_type bert \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --output_dir /path/to/save/dir/ \ --do_train \ --do_eval \ diff --git a/examples/research_projects/mm-imdb/run_mmimdb.py b/examples/research_projects/mm-imdb/run_mmimdb.py index 23b2a65e5c96ac..c863857c41cbd4 100644 --- a/examples/research_projects/mm-imdb/run_mmimdb.py +++ b/examples/research_projects/mm-imdb/run_mmimdb.py @@ -134,7 +134,7 @@ def train(args, train_dataset, model, tokenizer, criterion): best_f1, n_no_improve = 0, 0 model.zero_grad() train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) - set_seed(args) # Added here for reproductibility + set_seed(args) # Added here for reproducibility for _ in train_iterator: epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) for step, batch in enumerate(epoch_iterator): @@ -384,7 +384,7 @@ def main(): help="Number of updates steps to accumulate before performing a backward/update pass.", ) parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument( @@ -426,7 +426,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -460,7 +460,7 @@ def main(): if args.local_rank == -1 or args.no_cuda: device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") args.n_gpu = 0 if args.no_cuda else torch.cuda.device_count() - else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs + else: # Initializes the distributed backend which will take care of synchronizing nodes/GPUs torch.cuda.set_device(args.local_rank) device = torch.device("cuda", args.local_rank) torch.distributed.init_process_group(backend="nccl") @@ -554,9 +554,9 @@ def main(): if args.do_eval and args.local_rank in [-1, 0]: checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) for checkpoint in checkpoints: @@ -566,7 +566,7 @@ def main(): model.load_state_dict(torch.load(checkpoint)) model.to(args.device) result = evaluate(args, model, tokenizer, criterion, prefix=prefix) - result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) + result = {k + "_{}".format(global_step): v for k, v in result.items()} results.update(result) return results diff --git a/examples/research_projects/movement-pruning/README.md b/examples/research_projects/movement-pruning/README.md index 76c660187472a3..575ec1a9b49287 100644 --- a/examples/research_projects/movement-pruning/README.md +++ b/examples/research_projects/movement-pruning/README.md @@ -61,7 +61,7 @@ python examples/movement-pruning/masked_run_squad.py \ --predict_file dev-v1.1.json \ --do_train --do_eval --do_lower_case \ --model_type masked_bert \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --per_gpu_train_batch_size 16 \ --warmup_steps 5400 \ --num_train_epochs 10 \ @@ -84,7 +84,7 @@ python examples/movement-pruning/masked_run_squad.py \ --predict_file dev-v1.1.json \ --do_train --do_eval --do_lower_case \ --model_type masked_bert \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --per_gpu_train_batch_size 16 \ --warmup_steps 5400 \ --num_train_epochs 10 \ @@ -104,7 +104,7 @@ python examples/movement-pruning/masked_run_squad.py \ --predict_file dev-v1.1.json \ --do_train --do_eval --do_lower_case \ --model_type masked_bert \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --per_gpu_train_batch_size 16 \ --warmup_steps 5400 \ --num_train_epochs 10 \ @@ -124,7 +124,7 @@ python examples/movement-pruning/masked_run_squad.py \ --predict_file dev-v1.1.json \ --do_train --do_eval --do_lower_case \ --model_type masked_bert \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --per_gpu_train_batch_size 16 \ --warmup_steps 5400 \ --num_train_epochs 10 \ @@ -173,7 +173,7 @@ In particular, hardware manufacturers are announcing devices that will speedup i If you find this resource useful, please consider citing the following paper: -``` +```bibtex @article{sanh2020movement, title={Movement Pruning: Adaptive Sparsity by Fine-Tuning}, author={Victor Sanh and Thomas Wolf and Alexander M. Rush}, diff --git a/examples/research_projects/movement-pruning/bertarize.py b/examples/research_projects/movement-pruning/bertarize.py index 0c9cc63571d7c1..da7534f4a6f985 100644 --- a/examples/research_projects/movement-pruning/bertarize.py +++ b/examples/research_projects/movement-pruning/bertarize.py @@ -112,8 +112,8 @@ def main(args): type=float, required=False, help=( - "For `magnitude` and `topK`, it is the level of remaining weights (in %) in the fine-pruned model." - "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared." + "For `magnitude` and `topK`, it is the level of remaining weights (in %) in the fine-pruned model. " + "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared. " "Not needed for `l0`" ), ) diff --git a/examples/research_projects/movement-pruning/counts_parameters.py b/examples/research_projects/movement-pruning/counts_parameters.py index 17ddb029f89780..89ce40baa7c2a8 100644 --- a/examples/research_projects/movement-pruning/counts_parameters.py +++ b/examples/research_projects/movement-pruning/counts_parameters.py @@ -79,8 +79,8 @@ def main(args): type=float, required=False, help=( - "For `topK`, it is the level of remaining weights (in %) in the fine-pruned model." - "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared." + "For `topK`, it is the level of remaining weights (in %) in the fine-pruned model. " + "For `sigmoied_threshold`, it is the threshold \tau against which the (sigmoied) scores are compared. " "Not needed for `l0`" ), ) diff --git a/examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py b/examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py index d404bf49aaa62d..f47395bb000b69 100644 --- a/examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py +++ b/examples/research_projects/movement-pruning/emmental/modeling_bert_masked.py @@ -652,9 +652,7 @@ def forward( outputs = ( sequence_output, pooled_output, - ) + encoder_outputs[ - 1: - ] # add hidden_states and attentions if they are here + ) + encoder_outputs[1:] # add hidden_states and attentions if they are here return outputs # sequence_output, pooled_output, (hidden_states), (attentions) diff --git a/examples/research_projects/movement-pruning/masked_run_glue.py b/examples/research_projects/movement-pruning/masked_run_glue.py index 4ce56e524f714b..e2090c431e3d95 100644 --- a/examples/research_projects/movement-pruning/masked_run_glue.py +++ b/examples/research_projects/movement-pruning/masked_run_glue.py @@ -311,8 +311,7 @@ def train(args, train_dataset, model, tokenizer, teacher=None): tr_loss += loss.item() if (step + 1) % args.gradient_accumulation_steps == 0 or ( # last step in epoch but step is always smaller than gradient_accumulation_steps - len(epoch_iterator) <= args.gradient_accumulation_steps - and (step + 1) == len(epoch_iterator) + len(epoch_iterator) <= args.gradient_accumulation_steps and (step + 1) == len(epoch_iterator) ): if args.fp16: nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm) @@ -671,7 +670,7 @@ def main(): default=1, type=int, help=( - "Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays" + "Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays " "at its `initial_threshold` value (sparsity schedule)." ), ) @@ -680,7 +679,7 @@ def main(): default=2, type=int, help=( - "Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays" + "Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays " "at its final_threshold value (sparsity schedule)." ), ) @@ -799,7 +798,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -941,9 +940,9 @@ def main(): tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case) checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] logger.info("Evaluate the following checkpoints: %s", checkpoints) for checkpoint in checkpoints: @@ -953,7 +952,7 @@ def main(): model = model_class.from_pretrained(checkpoint) model.to(args.device) result = evaluate(args, model, tokenizer, prefix=prefix) - result = dict((k + "_{}".format(global_step), v) for k, v in result.items()) + result = {k + "_{}".format(global_step): v for k, v in result.items()} results.update(result) return results diff --git a/examples/research_projects/movement-pruning/masked_run_squad.py b/examples/research_projects/movement-pruning/masked_run_squad.py index a516bb8d585ddd..14d92dde4e4825 100644 --- a/examples/research_projects/movement-pruning/masked_run_squad.py +++ b/examples/research_projects/movement-pruning/masked_run_squad.py @@ -789,7 +789,7 @@ def main(): default=1, type=int, help=( - "Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays" + "Run `initial_warmup` * `warmup_steps` steps of threshold warmup during which threshold stays " "at its `initial_threshold` value (sparsity schedule)." ), ) @@ -798,7 +798,7 @@ def main(): default=2, type=int, help=( - "Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays" + "Run `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays " "at its final_threshold value (sparsity schedule)." ), ) @@ -946,7 +946,7 @@ def main(): type=str, default="O1", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) @@ -1109,10 +1109,10 @@ def main(): logger.info("Loading checkpoints saved during training for evaluation") checkpoints = [args.output_dir] if args.eval_all_checkpoints: - checkpoints = list( + checkpoints = [ os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True)) - ) + ] else: logger.info("Loading checkpoint %s for evaluation", args.model_name_or_path) @@ -1129,7 +1129,7 @@ def main(): # Evaluate result = evaluate(args, model, tokenizer, prefix=global_step) - result = dict((k + ("_{}".format(global_step) if global_step else ""), v) for k, v in result.items()) + result = {k + ("_{}".format(global_step) if global_step else ""): v for k, v in result.items()} results.update(result) logger.info("Results: {}".format(results)) diff --git a/examples/research_projects/onnx/summarization/bart_onnx/reduce_onnx_size.py b/examples/research_projects/onnx/summarization/bart_onnx/reduce_onnx_size.py index d327cdb2841d9c..1df20e4504da2c 100644 --- a/examples/research_projects/onnx/summarization/bart_onnx/reduce_onnx_size.py +++ b/examples/research_projects/onnx/summarization/bart_onnx/reduce_onnx_size.py @@ -42,8 +42,8 @@ def _graph_replace_input_with(graph_proto, name, new_name): def _remove_dup_initializers_from_model(model, model_without_ext, ind_to_replace): - inits_with_data = [i for i in model.graph.initializer] - inits = [i for i in model_without_ext.graph.initializer] + inits_with_data = list(model.graph.initializer) + inits = list(model_without_ext.graph.initializer) for i, ref_i in ind_to_replace: assert inits_with_data[i].name == inits[i].name assert inits_with_data[ref_i].name == inits[ref_i].name @@ -69,7 +69,7 @@ def remove_dup_initializers(onnx_file_path): model = onnx.load(os.path.join(model_file_folder, model_file_name)) - inits = [i for i in model.graph.initializer] + inits = list(model.graph.initializer) dup_set = set() dup_map = {} diff --git a/examples/research_projects/performer/README.md b/examples/research_projects/performer/README.md index 42cb6fa358f95f..fa847268b0c8b3 100644 --- a/examples/research_projects/performer/README.md +++ b/examples/research_projects/performer/README.md @@ -10,8 +10,8 @@ Paper authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyo ## Examples -`sanity_script.sh` will launch performer fine-tuning from the bert-base-cased checkpoint on the Simple Wikipedia dataset (a small, easy-language English Wikipedia) from `datasets`. -`full_script.sh` will launch performer fine-tuning from the bert-large-cased checkpoint on the English Wikipedia dataset from `datasets`. +`sanity_script.sh` will launch performer fine-tuning from the google-bert/bert-base-cased checkpoint on the Simple Wikipedia dataset (a small, easy-language English Wikipedia) from `datasets`. +`full_script.sh` will launch performer fine-tuning from the google-bert/bert-large-cased checkpoint on the English Wikipedia dataset from `datasets`. Here are a few key arguments: - Remove the `--performer` argument to use a standard Bert model. diff --git a/examples/research_projects/performer/run_mlm_performer.py b/examples/research_projects/performer/run_mlm_performer.py index 1547ead421fd6f..4261d9c184b7a7 100644 --- a/examples/research_projects/performer/run_mlm_performer.py +++ b/examples/research_projects/performer/run_mlm_performer.py @@ -99,7 +99,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -466,7 +466,7 @@ def generate_batch_splits(samples_idx: np.ndarray, batch_size: int) -> np.ndarra and not training_args.overwrite_output_dir ): raise ValueError( - f"Output directory ({training_args.output_dir}) already exists and is not empty." + f"Output directory ({training_args.output_dir}) already exists and is not empty. " "Use --overwrite_output_dir to overcome." ) @@ -517,14 +517,15 @@ def generate_batch_splits(samples_idx: np.ndarray, batch_size: int) -> np.ndarra data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" datasets = load_dataset(extension, data_files=data_files) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer @@ -558,7 +559,7 @@ def generate_batch_splits(samples_idx: np.ndarray, batch_size: int) -> np.ndarra ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) diff --git a/examples/research_projects/pplm/run_pplm.py b/examples/research_projects/pplm/run_pplm.py index 54784b944c71cf..cc49b7fa83c4c3 100644 --- a/examples/research_projects/pplm/run_pplm.py +++ b/examples/research_projects/pplm/run_pplm.py @@ -61,7 +61,7 @@ "embed_size": 1024, "class_vocab": {"non_clickbait": 0, "clickbait": 1}, "default_class": 1, - "pretrained_model": "gpt2-medium", + "pretrained_model": "openai-community/gpt2-medium", }, "sentiment": { "url": "https://s3.amazonaws.com/models.huggingface.co/bert/pplm/discriminators/SST_classifier_head.pt", @@ -69,7 +69,7 @@ "embed_size": 1024, "class_vocab": {"very_positive": 2, "very_negative": 3}, "default_class": 3, - "pretrained_model": "gpt2-medium", + "pretrained_model": "openai-community/gpt2-medium", }, } @@ -127,11 +127,9 @@ def perturb_past( _, _, _, curr_length, _ = past[0].shape if curr_length > window_length and window_length > 0: - ones_key_val_shape = tuple(past[0].shape[:-2]) + tuple([window_length]) + tuple(past[0].shape[-1:]) + ones_key_val_shape = tuple(past[0].shape[:-2]) + (window_length,) + tuple(past[0].shape[-1:]) - zeros_key_val_shape = ( - tuple(past[0].shape[:-2]) + tuple([curr_length - window_length]) + tuple(past[0].shape[-1:]) - ) + zeros_key_val_shape = tuple(past[0].shape[:-2]) + (curr_length - window_length,) + tuple(past[0].shape[-1:]) ones_mask = torch.ones(ones_key_val_shape) ones_mask = decay_mask * ones_mask.permute(0, 1, 2, 4, 3) @@ -587,7 +585,7 @@ def set_generic_model_params(discrim_weights, discrim_meta): def run_pplm_example( - pretrained_model="gpt2-medium", + pretrained_model="openai-community/gpt2-medium", cond_text="", uncond=False, num_samples=1, @@ -740,7 +738,7 @@ def run_pplm_example( "--pretrained_model", "-M", type=str, - default="gpt2-medium", + default="openai-community/gpt2-medium", help="pretrained model name or path to local checkpoint", ) parser.add_argument("--cond_text", type=str, default="The lake", help="Prefix texts to condition on") diff --git a/examples/research_projects/pplm/run_pplm_discrim_train.py b/examples/research_projects/pplm/run_pplm_discrim_train.py index d53b557d1af031..43ec5823e37764 100644 --- a/examples/research_projects/pplm/run_pplm_discrim_train.py +++ b/examples/research_projects/pplm/run_pplm_discrim_train.py @@ -45,7 +45,7 @@ class Discriminator(nn.Module): """Transformer encoder followed by a Classification Head""" - def __init__(self, class_size, pretrained_model="gpt2-medium", cached_mode=False, device="cpu"): + def __init__(self, class_size, pretrained_model="openai-community/gpt2-medium", cached_mode=False, device="cpu"): super().__init__() self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model) self.encoder = GPT2LMHeadModel.from_pretrained(pretrained_model) @@ -218,7 +218,7 @@ def get_cached_data_loader(dataset, batch_size, discriminator, shuffle=False, de def train_discriminator( dataset, dataset_fp=None, - pretrained_model="gpt2-medium", + pretrained_model="openai-community/gpt2-medium", epochs=10, batch_size=64, log_interval=10, @@ -490,8 +490,8 @@ def train_discriminator( default="SST", choices=("SST", "clickbait", "toxic", "generic"), help=( - "dataset to train the discriminator on." - "In case of generic, the dataset is expected" + "dataset to train the discriminator on. " + "In case of generic, the dataset is expected " "to be a TSBV file with structure: class \\t text" ), ) @@ -502,7 +502,10 @@ def train_discriminator( help="File path of the dataset to use. Needed only in case of generic datadset", ) parser.add_argument( - "--pretrained_model", type=str, default="gpt2-medium", help="Pretrained model to use as encoder" + "--pretrained_model", + type=str, + default="openai-community/gpt2-medium", + help="Pretrained model to use as encoder", ) parser.add_argument("--epochs", type=int, default=10, metavar="N", help="Number of training epochs") parser.add_argument( diff --git a/examples/research_projects/quantization-qdqbert/README.md b/examples/research_projects/quantization-qdqbert/README.md index fe69819cc5be80..2cc2d5e5f98c71 100644 --- a/examples/research_projects/quantization-qdqbert/README.md +++ b/examples/research_projects/quantization-qdqbert/README.md @@ -30,17 +30,17 @@ Required: ## Setup the environment with Dockerfile Under the directory of `transformers/`, build the docker image: -``` +```bash docker build . -f examples/research_projects/quantization-qdqbert/Dockerfile -t bert_quantization:latest ``` Run the docker: -``` +```bash docker run --gpus all --privileged --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 bert_quantization:latest ``` In the container: -``` +```bash cd transformers/examples/research_projects/quantization-qdqbert/ ``` @@ -48,21 +48,21 @@ cd transformers/examples/research_projects/quantization-qdqbert/ Calibrate the pretrained model and finetune with quantization awared: -``` +```bash python3 run_quant_qa.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --max_seq_length 128 \ --doc_stride 32 \ - --output_dir calib/bert-base-uncased \ + --output_dir calib/google-bert/bert-base-uncased \ --do_calib \ --calibrator percentile \ --percentile 99.99 ``` -``` +```bash python3 run_quant_qa.py \ - --model_name_or_path calib/bert-base-uncased \ + --model_name_or_path calib/google-bert/bert-base-uncased \ --dataset_name squad \ --do_train \ --do_eval \ @@ -71,8 +71,8 @@ python3 run_quant_qa.py \ --num_train_epochs 2 \ --max_seq_length 128 \ --doc_stride 32 \ - --output_dir finetuned_int8/bert-base-uncased \ - --tokenizer_name bert-base-uncased \ + --output_dir finetuned_int8/google-bert/bert-base-uncased \ + --tokenizer_name google-bert/bert-base-uncased \ --save_steps 0 ``` @@ -80,16 +80,16 @@ python3 run_quant_qa.py \ To export the QAT model finetuned above: -``` +```bash python3 run_quant_qa.py \ - --model_name_or_path finetuned_int8/bert-base-uncased \ + --model_name_or_path finetuned_int8/google-bert/bert-base-uncased \ --output_dir ./ \ --save_onnx \ --per_device_eval_batch_size 1 \ --max_seq_length 128 \ --doc_stride 32 \ --dataset_name squad \ - --tokenizer_name bert-base-uncased + --tokenizer_name google-bert/bert-base-uncased ``` Use `--recalibrate-weights` to calibrate the weight ranges according to the quantizer axis. Use `--quant-per-tensor` for per tensor quantization (default is per channel). @@ -97,19 +97,19 @@ Recalibrating will affect the accuracy of the model, but the change should be mi ### Benchmark the INT8 QAT ONNX model inference with TensorRT using dummy input -``` +```bash trtexec --onnx=model.onnx --explicitBatch --workspace=16384 --int8 --shapes=input_ids:64x128,attention_mask:64x128,token_type_ids:64x128 --verbose ``` ### Benchmark the INT8 QAT ONNX model inference with [ONNX Runtime-TRT](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html) using dummy input -``` +```bash python3 ort-infer-benchmark.py ``` ### Evaluate the INT8 QAT ONNX model inference with TensorRT -``` +```bash python3 evaluate-hf-trt-qa.py \ --onnx_model_path=./model.onnx \ --output_dir ./ \ @@ -117,7 +117,7 @@ python3 evaluate-hf-trt-qa.py \ --max_seq_length 128 \ --doc_stride 32 \ --dataset_name squad \ - --tokenizer_name bert-base-uncased \ + --tokenizer_name google-bert/bert-base-uncased \ --int8 \ --seed 42 ``` @@ -126,16 +126,16 @@ python3 evaluate-hf-trt-qa.py \ Finetune a fp32 precision model with [transformers/examples/pytorch/question-answering/](../../pytorch/question-answering/): -``` +```bash python3 ../../pytorch/question-answering/run_qa.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name squad \ --per_device_train_batch_size 12 \ --learning_rate 3e-5 \ --num_train_epochs 2 \ --max_seq_length 128 \ --doc_stride 32 \ - --output_dir ./finetuned_fp32/bert-base-uncased \ + --output_dir ./finetuned_fp32/google-bert/bert-base-uncased \ --save_steps 0 \ --do_train \ --do_eval @@ -145,15 +145,15 @@ python3 ../../pytorch/question-answering/run_qa.py \ ### PTQ by calibrating and evaluating the finetuned FP32 model above: -``` +```bash python3 run_quant_qa.py \ - --model_name_or_path ./finetuned_fp32/bert-base-uncased \ + --model_name_or_path ./finetuned_fp32/google-bert/bert-base-uncased \ --dataset_name squad \ --calibrator percentile \ --percentile 99.99 \ --max_seq_length 128 \ --doc_stride 32 \ - --output_dir ./calib/bert-base-uncased \ + --output_dir ./calib/google-bert/bert-base-uncased \ --save_steps 0 \ --do_calib \ --do_eval @@ -161,21 +161,21 @@ python3 run_quant_qa.py \ ### Export the INT8 PTQ model to ONNX -``` +```bash python3 run_quant_qa.py \ - --model_name_or_path ./calib/bert-base-uncased \ + --model_name_or_path ./calib/google-bert/bert-base-uncased \ --output_dir ./ \ --save_onnx \ --per_device_eval_batch_size 1 \ --max_seq_length 128 \ --doc_stride 32 \ --dataset_name squad \ - --tokenizer_name bert-base-uncased + --tokenizer_name google-bert/bert-base-uncased ``` ### Evaluate the INT8 PTQ ONNX model inference with TensorRT -``` +```bash python3 evaluate-hf-trt-qa.py \ --onnx_model_path=./model.onnx \ --output_dir ./ \ @@ -183,7 +183,7 @@ python3 evaluate-hf-trt-qa.py \ --max_seq_length 128 \ --doc_stride 32 \ --dataset_name squad \ - --tokenizer_name bert-base-uncased \ + --tokenizer_name google-bert/bert-base-uncased \ --int8 \ --seed 42 ``` diff --git a/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py b/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py index 814f95d0ab8f79..677b9c7860ab77 100755 --- a/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py +++ b/examples/research_projects/quantization-qdqbert/evaluate-hf-trt-qa.py @@ -153,7 +153,7 @@ tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, use_fast=True) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) @@ -272,10 +272,10 @@ def model_infer(inputs, context, d_inputs, h_output0, h_output1, d_output0, d_ou else: raise ValueError("Evaluation requires a dataset name") # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at -# https://huggingface.co/docs/datasets/loading_datasets.html. +# https://huggingface.co/docs/datasets/loading_datasets. # Preprocessing the datasets. -# Preprocessing is slighlty different for training and evaluation. +# Preprocessing is slightly different for training and evaluation. column_names = raw_datasets["validation"].column_names @@ -288,7 +288,7 @@ def model_infer(inputs, context, d_inputs, h_output0, h_output1, d_output0, d_ou if args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) diff --git a/examples/research_projects/quantization-qdqbert/quant_trainer.py b/examples/research_projects/quantization-qdqbert/quant_trainer.py index 9360cc01ba7fa0..09bac19e921a89 100755 --- a/examples/research_projects/quantization-qdqbert/quant_trainer.py +++ b/examples/research_projects/quantization-qdqbert/quant_trainer.py @@ -41,8 +41,8 @@ def add_arguments(parser): group.add_argument("--quant-disable", action="store_true", help="disable all quantizers") group.add_argument("--quant-disable-embeddings", action="store_true", help="disable all embeddings quantizers") group.add_argument("--quant-disable-keyword", type=str, nargs="+", help="disable quantizers by keyword") - group.add_argument("--quant-disable-layer-module", type=str, help="disable quantizers by keyword under layer.\d+.") - group.add_argument("--quant-enable-layer-module", type=str, help="enable quantizers by keyword under layer.\d+.") + group.add_argument("--quant-disable-layer-module", type=str, help="disable quantizers by keyword under layer.") + group.add_argument("--quant-enable-layer-module", type=str, help="enable quantizers by keyword under layer") group.add_argument("--calibrator", default="max", help="which quantization range calibrator to use") group.add_argument("--percentile", default=None, type=float, help="percentile for PercentileCalibrator") group.add_argument("--fuse-qkv", action="store_true", help="use the same scale factor for qkv") @@ -94,10 +94,10 @@ def configure_model(model, args, calib=False, eval=False): set_quantizer_by_name(model, args.quant_disable_keyword, _disabled=True) if args.quant_disable_layer_module: - set_quantizer_by_name(model, ["layer.\d+." + args.quant_disable_layer_module], _disabled=True) + set_quantizer_by_name(model, [r"layer.\d+." + args.quant_disable_layer_module], _disabled=True) if args.quant_enable_layer_module: - set_quantizer_by_name(model, ["layer.\d+." + args.quant_enable_layer_module], _disabled=False) + set_quantizer_by_name(model, [r"layer.\d+." + args.quant_enable_layer_module], _disabled=False) if args.recalibrate_weights: recalibrate_weights(model) @@ -239,7 +239,7 @@ def print_model_summary(model, name_width=25, line_width=180, ignore=None): continue if type(mod) in ignore: continue - if [True for s in ignore if type(s) is str and s in name]: + if [True for s in ignore if isinstance(s, str) and s in name]: continue act_str = f"Act:{input_q.extra_repr()}" wgt_str = f"Wgt:{weight_q.extra_repr()}" diff --git a/examples/research_projects/quantization-qdqbert/run_quant_qa.py b/examples/research_projects/quantization-qdqbert/run_quant_qa.py index ba5dfe4c090736..770a36525b5caa 100755 --- a/examples/research_projects/quantization-qdqbert/run_quant_qa.py +++ b/examples/research_projects/quantization-qdqbert/run_quant_qa.py @@ -308,7 +308,7 @@ def main(): extension = data_args.test_file.split(".")[-1] raw_datasets = load_dataset(extension, data_files=data_files, field="data", cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # set default quantization parameters before building model quant_trainer.set_default_quantizers(quant_trainer_args) @@ -322,14 +322,14 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) model = QDQBertForQuestionAnswering.from_pretrained( model_args.model_name_or_path, @@ -337,7 +337,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # Tokenizer check: this script requires a fast tokenizer. @@ -349,7 +349,7 @@ def main(): ) # Preprocessing the datasets. - # Preprocessing is slighlty different for training and evaluation. + # Preprocessing is slightly different for training and evaluation. if training_args.do_train or model_args.do_calib: column_names = raw_datasets["train"].column_names elif training_args.do_eval or model_args.save_onnx: @@ -365,7 +365,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -448,7 +448,7 @@ def prepare_train_features(examples): raise ValueError("--do_train requires a train dataset") train_dataset = raw_datasets["train"] if data_args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) # Create train feature from dataset diff --git a/examples/research_projects/rag-end2end-retriever/finetune_rag.py b/examples/research_projects/rag-end2end-retriever/finetune_rag.py index 8d0ba293b12b90..9bc2e5db6d5d10 100644 --- a/examples/research_projects/rag-end2end-retriever/finetune_rag.py +++ b/examples/research_projects/rag-end2end-retriever/finetune_rag.py @@ -164,11 +164,11 @@ def __init__(self, hparams, **kwargs): self.step_count = 0 self.metrics = defaultdict(list) - self.dataset_kwargs: dict = dict( - data_dir=self.hparams.data_dir, - max_source_length=self.hparams.max_source_length, - prefix=prefix or "", - ) + self.dataset_kwargs: dict = { + "data_dir": self.hparams.data_dir, + "max_source_length": self.hparams.max_source_length, + "prefix": prefix or "", + } n_observations_per_split = { "train": self.hparams.n_train, "val": self.hparams.n_val, @@ -360,7 +360,7 @@ def training_step(self, batch, batch_idx) -> Dict: loss_tensors = self._step(batch) - logs = {name: loss for name, loss in zip(self.loss_names, loss_tensors)} + logs = dict(zip(self.loss_names, loss_tensors)) # tokens per batch tgt_pad_token_id = ( self.tokenizer.generator.pad_token_id @@ -434,7 +434,7 @@ def _generative_step(self, batch: dict) -> dict: target: List[str] = self.ids_to_clean_text(batch["decoder_input_ids"]) # print(preds,target) loss_tensors = self._step(batch) - base_metrics = {name: loss for name, loss in zip(self.loss_names, loss_tensors)} + base_metrics = dict(zip(self.loss_names, loss_tensors)) gen_metrics: Dict = self.calc_generative_metrics(preds, target) summ_len = np.mean(lmap(len, generated_ids)) @@ -680,7 +680,7 @@ def add_ray_specific_args(parser): type=int, default=1, help=( - "The number of retrieval actors to use when Ray is selected" + "The number of retrieval actors to use when Ray is selected " "for the distributed retriever. Has no effect when " "distributed_retriever is set to pytorch." ), @@ -719,7 +719,7 @@ def main(args=None, model=None) -> GenerativeQAModule: ray.init(address=args.ray_address, namespace="rag") except (ConnectionError, ValueError): logger.warning( - "Connection to Ray cluster failed. Make sure a Ray" + "Connection to Ray cluster failed. Make sure a Ray " "cluster is running by either using Ray's cluster " "launcher (`ray up`) or by manually starting Ray on " "each node via `ray start --head` for the head node " diff --git a/examples/research_projects/rag-end2end-retriever/lightning_base.py b/examples/research_projects/rag-end2end-retriever/lightning_base.py index b9f8c6e3d7b5c0..276f2f791b9eba 100644 --- a/examples/research_projects/rag-end2end-retriever/lightning_base.py +++ b/examples/research_projects/rag-end2end-retriever/lightning_base.py @@ -333,7 +333,7 @@ def add_generic_args(parser, root_dir) -> None: type=str, default="O2", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) diff --git a/examples/research_projects/rag-end2end-retriever/use_own_knowledge_dataset.py b/examples/research_projects/rag-end2end-retriever/use_own_knowledge_dataset.py index e0aa86a3a65ba9..20e0ea2d3cc2a2 100644 --- a/examples/research_projects/rag-end2end-retriever/use_own_knowledge_dataset.py +++ b/examples/research_projects/rag-end2end-retriever/use_own_knowledge_dataset.py @@ -65,7 +65,7 @@ def main( "csv", data_files=[rag_example_args.csv_path], split="train", delimiter="\t", column_names=["title", "text"] ) - # More info about loading csv files in the documentation: https://huggingface.co/docs/datasets/loading_datasets.html?highlight=csv#csv-files + # More info about loading csv files in the documentation: https://huggingface.co/docs/datasets/loading_datasets?highlight=csv#csv-files # Then split the documents into passages of 100 words dataset = dataset.map(split_documents, batched=True, num_proc=processing_args.num_proc) diff --git a/examples/research_projects/rag-end2end-retriever/utils_rag.py b/examples/research_projects/rag-end2end-retriever/utils_rag.py index 7bf5d7e35e9e98..ec98c1d782e0ea 100644 --- a/examples/research_projects/rag-end2end-retriever/utils_rag.py +++ b/examples/research_projects/rag-end2end-retriever/utils_rag.py @@ -137,7 +137,7 @@ def collate_fn(self, batch) -> Dict[str, torch.Tensor]: def flatten_list(summary_ids: List[List]): - return [x for x in itertools.chain.from_iterable(summary_ids)] + return list(itertools.chain.from_iterable(summary_ids)) def save_git_info(folder_path: str) -> None: diff --git a/examples/research_projects/rag/README.md b/examples/research_projects/rag/README.md index 36c4a47841e560..7fbaea84b93782 100644 --- a/examples/research_projects/rag/README.md +++ b/examples/research_projects/rag/README.md @@ -17,7 +17,7 @@ Read more about RAG at https://arxiv.org/abs/2005.11401. # Finetuning -Our finetuning logic is based on scripts from [`examples/seq2seq`](https://github.com/huggingface/transformers/tree/main/examples/seq2seq). We accept training data in the same format as specified there - we expect a directory consisting of 6 text files: +Our finetuning logic is based on scripts from [`examples/legacy/seq2seq`](https://github.com/huggingface/transformers/tree/main/examples/legacy/seq2seq). We accept training data in the same format as specified there - we expect a directory consisting of 6 text files: ```bash train.source train.target @@ -45,7 +45,7 @@ We publish two `base` models which can serve as a starting point for finetuning The `base` models initialize the question encoder with [`facebook/dpr-question_encoder-single-nq-base`](https://huggingface.co/facebook/dpr-question_encoder-single-nq-base) and the generator with [`facebook/bart-large`](https://huggingface.co/facebook/bart-large). If you would like to initialize finetuning with a base model using different question encoder and generator architectures, you can build it with a consolidation script, e.g.: -``` +```bash python examples/research_projects/rag/consolidate_rag_checkpoint.py \ --model_type rag_sequence \ --generator_name_or_path facebook/bart-large-cnn \ diff --git a/examples/research_projects/rag/finetune_rag.py b/examples/research_projects/rag/finetune_rag.py index f5cef614e2d9f3..7f4778d7d71eeb 100644 --- a/examples/research_projects/rag/finetune_rag.py +++ b/examples/research_projects/rag/finetune_rag.py @@ -162,11 +162,11 @@ def __init__(self, hparams, **kwargs): self.step_count = 0 self.metrics = defaultdict(list) - self.dataset_kwargs: dict = dict( - data_dir=self.hparams.data_dir, - max_source_length=self.hparams.max_source_length, - prefix=prefix or "", - ) + self.dataset_kwargs: dict = { + "data_dir": self.hparams.data_dir, + "max_source_length": self.hparams.max_source_length, + "prefix": prefix or "", + } n_observations_per_split = { "train": self.hparams.n_train, "val": self.hparams.n_val, @@ -321,7 +321,7 @@ def _generative_step(self, batch: dict) -> dict: preds: List[str] = self.ids_to_clean_text(generated_ids) target: List[str] = self.ids_to_clean_text(batch["decoder_input_ids"]) loss_tensors = self._step(batch) - base_metrics = {name: loss for name, loss in zip(self.loss_names, loss_tensors)} + base_metrics = dict(zip(self.loss_names, loss_tensors)) gen_metrics: Dict = self.calc_generative_metrics(preds, target) summ_len = np.mean(lmap(len, generated_ids)) @@ -525,7 +525,7 @@ def add_ray_specific_args(parser): type=int, default=1, help=( - "The number of retrieval actors to use when Ray is selected" + "The number of retrieval actors to use when Ray is selected " "for the distributed retriever. Has no effect when " "distributed_retriever is set to pytorch." ), @@ -552,7 +552,7 @@ def main(args=None, model=None) -> GenerativeQAModule: ray.init(address=args.ray_address, namespace="rag") except (ConnectionError, ValueError): logger.warning( - "Connection to Ray cluster failed. Make sure a Ray" + "Connection to Ray cluster failed. Make sure a Ray " "cluster is running by either using Ray's cluster " "launcher (`ray up`) or by manually starting Ray on " "each node via `ray start --head` for the head node " diff --git a/examples/research_projects/rag/lightning_base.py b/examples/research_projects/rag/lightning_base.py index e78a7582395875..12099bc3aa106e 100644 --- a/examples/research_projects/rag/lightning_base.py +++ b/examples/research_projects/rag/lightning_base.py @@ -322,7 +322,7 @@ def add_generic_args(parser, root_dir) -> None: type=str, default="O2", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) diff --git a/examples/research_projects/rag/requirements.txt b/examples/research_projects/rag/requirements.txt index fdeb5567d24d55..5988d38de9e903 100644 --- a/examples/research_projects/rag/requirements.txt +++ b/examples/research_projects/rag/requirements.txt @@ -3,6 +3,6 @@ datasets >= 1.0.1 psutil >= 5.7.0 torch >= 1.4.0 ray >= 1.10.0 -pytorch-lightning >= 1.5.10 +pytorch-lightning >= 1.5.10, <=1.6.0 transformers GitPython \ No newline at end of file diff --git a/examples/research_projects/rag/use_own_knowledge_dataset.py b/examples/research_projects/rag/use_own_knowledge_dataset.py index 84d7c854975f11..d2ab6d07d5cc36 100644 --- a/examples/research_projects/rag/use_own_knowledge_dataset.py +++ b/examples/research_projects/rag/use_own_knowledge_dataset.py @@ -73,7 +73,7 @@ def main( "csv", data_files=[rag_example_args.csv_path], split="train", delimiter="\t", column_names=["title", "text"] ) - # More info about loading csv files in the documentation: https://huggingface.co/docs/datasets/loading_datasets.html?highlight=csv#csv-files + # More info about loading csv files in the documentation: https://huggingface.co/docs/datasets/loading_datasets?highlight=csv#csv-files # Then split the documents into passages of 100 words dataset = dataset.map(split_documents, batched=True, num_proc=processing_args.num_proc) diff --git a/examples/research_projects/rag/utils_rag.py b/examples/research_projects/rag/utils_rag.py index 7bf5d7e35e9e98..ec98c1d782e0ea 100644 --- a/examples/research_projects/rag/utils_rag.py +++ b/examples/research_projects/rag/utils_rag.py @@ -137,7 +137,7 @@ def collate_fn(self, batch) -> Dict[str, torch.Tensor]: def flatten_list(summary_ids: List[List]): - return [x for x in itertools.chain.from_iterable(summary_ids)] + return list(itertools.chain.from_iterable(summary_ids)) def save_git_info(folder_path: str) -> None: diff --git a/examples/research_projects/robust-speech-event/README.md b/examples/research_projects/robust-speech-event/README.md index fd1a42c7d4bb58..5c7bf42a00445a 100644 --- a/examples/research_projects/robust-speech-event/README.md +++ b/examples/research_projects/robust-speech-event/README.md @@ -3,7 +3,7 @@ Welcome to the robust speech recognition challenge 🎙️ ! The goal of this event is to build **robust**, **real-world** speech recognition (ASR) systems in as many languages as possible 🌏🌍🌎. -If necessary and available, free access to a V100S 32 GB GPU will kindly be provided by the [OVHcloud team]( https://www.ovhcloud.com/) 🚀. +If necessary and available, free access to a V100S 32 GB GPU will kindly be provided by the [OVHcloud team](https://www.ovhcloud.com/) 🚀. This document summarizes all the relevant information required for the speech community event 📋. To sign-up, please see [this forum post](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614) 🤗. Please make sure to: @@ -112,7 +112,7 @@ Hugging Face Hub for additional audio data, for example by selecting the categor ["speech-processing"](https://huggingface.co/datasets?task_categories=task_categories:speech-processing&sort=downloads). All datasets that are available on the Hub can be downloaded via the 🤗 Datasets library in the same way Common Voice is downloaded. If one wants to combine multiple datasets for training, it might make sense to take a look at -the [`interleave_datasets`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=interleave#datasets.interleave_datasets) function. +the [`interleave_datasets`](https://huggingface.co/docs/datasets/package_reference/main_classes?highlight=interleave#datasets.interleave_datasets) function. In addition, participants can also make use of their audio data. Here, please make sure that you **are allowed to use the audio data**. E.g., if audio data is taken from media platforms, such as YouTube, it should be verified that the media platform and the owner of the data have given her/his approval to use the audio @@ -216,7 +216,7 @@ library from source to profit from the most current additions during the communi Simply run the following steps: -``` +```bash $ cd ~/ $ git clone https://github.com/huggingface/datasets.git $ cd datasets diff --git a/examples/research_projects/robust-speech-event/eval.py b/examples/research_projects/robust-speech-event/eval.py index a8acca1825d7da..b6c89a6d49fac9 100755 --- a/examples/research_projects/robust-speech-event/eval.py +++ b/examples/research_projects/robust-speech-event/eval.py @@ -65,7 +65,7 @@ def normalize_text(text: str) -> str: def main(args): # load dataset - dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True) + dataset = load_dataset(args.dataset, args.config, split=args.split, token=True) # for testing: only process the first two examples as a test # dataset = dataset.select(range(10)) diff --git a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py index aaacc79cebd7a6..ebf33eb01df537 100755 --- a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py +++ b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_bnb.py @@ -104,8 +104,8 @@ class ModelArguments: default=0.05, metadata={ "help": ( - "Probability of each feature vector along the time axis to be chosen as the start of the vector" - "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature" + "Probability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " "vectors will be masked along the time axis." ) }, @@ -292,7 +292,7 @@ class DataCollatorCTCWithPadding: pad_to_multiple_of_labels: Optional[int] = None def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods input_features = [{"input_values": feature["input_values"]} for feature in features] label_features = [{"input_ids": feature["labels"]} for feature in features] @@ -344,7 +344,7 @@ def extract_all_chars(batch): lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values() ) - vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))} + vocab_dict = {v: k for k, v in enumerate(sorted(vocab_set))} # replace white space with delimiter token if word_delimiter_token is not None: @@ -399,7 +399,7 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) # Set the verbosity to info of the Transformers logger (on main process only): @@ -418,7 +418,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, split=data_args.train_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) if data_args.audio_column_name not in raw_datasets["train"].column_names: @@ -443,7 +443,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, split=data_args.eval_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) if data_args.max_eval_samples is not None: @@ -481,7 +481,7 @@ def remove_special_characters(batch): # the tokenizer # load config config = AutoConfig.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) # 4. Next, if no tokenizer file is defined, @@ -532,11 +532,11 @@ def remove_special_characters(batch): # load feature_extractor and tokenizer tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, **tokenizer_kwargs, ) feature_extractor = AutoFeatureExtractor.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) # adapt config @@ -564,7 +564,7 @@ def remove_special_characters(batch): model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) # freeze encoder diff --git a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py index 54338f15988154..8a8eda851bb4e5 100644 --- a/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py +++ b/examples/research_projects/robust-speech-event/run_speech_recognition_ctc_streaming.py @@ -103,8 +103,8 @@ class ModelArguments: default=0.05, metadata={ "help": ( - "Probability of each feature vector along the time axis to be chosen as the start of the vector" - "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature" + "Probability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " "vectors will be masked along the time axis." ) }, @@ -284,7 +284,7 @@ class DataCollatorCTCWithPadding: pad_to_multiple_of_labels: Optional[int] = None def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods input_features = [] label_features = [] @@ -354,7 +354,7 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) # Set the verbosity to info of the Transformers logger (on main process only): @@ -395,7 +395,7 @@ def load_streaming_dataset(split, sampling_rate, **kwargs): # so we just need to set the correct target sampling rate and normalize the input # via the `feature_extractor` feature_extractor = AutoFeatureExtractor.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) if training_args.do_train: @@ -403,7 +403,7 @@ def load_streaming_dataset(split, sampling_rate, **kwargs): path=data_args.dataset_name, name=data_args.dataset_config_name, split=data_args.train_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, streaming=True, sampling_rate=feature_extractor.sampling_rate, ) @@ -431,7 +431,7 @@ def load_streaming_dataset(split, sampling_rate, **kwargs): path=data_args.dataset_name, name=data_args.dataset_config_name, split=data_args.eval_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, streaming=True, sampling_rate=feature_extractor.sampling_rate, ) @@ -465,7 +465,7 @@ def remove_special_characters(batch): # 3. Next, let's load the config as we might need it to create # the tokenizer config = AutoConfig.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) # 4. Now we can instantiate the tokenizer and model @@ -481,7 +481,7 @@ def remove_special_characters(batch): tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) # adapt config @@ -509,7 +509,7 @@ def remove_special_characters(batch): model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) # freeze encoder diff --git a/examples/research_projects/seq2seq-distillation/README.md b/examples/research_projects/seq2seq-distillation/README.md index 930e5b8fc98398..ab79a652ed38c3 100644 --- a/examples/research_projects/seq2seq-distillation/README.md +++ b/examples/research_projects/seq2seq-distillation/README.md @@ -239,7 +239,7 @@ For example, ./save_len_file.py Helsinki-NLP/opus-mt-en-ro wmt_en_ro ./dynamic_bs_example.sh --max_tokens_per_batch=2000 --output_dir benchmark_dynamic_bs ``` -splits `wmt_en_ro/train` into 11,197 uneven lengthed batches and can finish 1 epoch in 8 minutes on a v100. +splits `wmt_en_ro/train` into 11,197 uneven length batches and can finish 1 epoch in 8 minutes on a v100. For comparison, ```bash diff --git a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py index b1c84ad9b8bdc3..454951ed3888a0 100644 --- a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py +++ b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples.py @@ -145,18 +145,18 @@ def test_hub_configs(self): assert not failures, f"The following models could not be loaded through AutoConfig: {failures}" def test_distill_no_teacher(self): - updates = dict(student_encoder_layers=2, student_decoder_layers=1, no_teacher=True) + updates = {"student_encoder_layers": 2, "student_decoder_layers": 1, "no_teacher": True} self._test_distiller_cli(updates) def test_distill_checkpointing_with_teacher(self): - updates = dict( - student_encoder_layers=2, - student_decoder_layers=1, - max_epochs=4, - val_check_interval=0.25, - alpha_hid=2.0, - model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED", - ) + updates = { + "student_encoder_layers": 2, + "student_decoder_layers": 1, + "max_epochs": 4, + "val_check_interval": 0.25, + "alpha_hid": 2.0, + "model_name_or_path": "IGNORE_THIS_IT_DOESNT_GET_USED", + } model = self._test_distiller_cli(updates, check_contents=False) ckpts = list(Path(model.output_dir).glob("*.ckpt")) @@ -193,19 +193,19 @@ def test_loss_fn(self): self.assertEqual(nll_loss, model_computed_loss) def test_distill_mbart(self): - updates = dict( - student_encoder_layers=2, - student_decoder_layers=1, - num_train_epochs=4, - val_check_interval=0.25, - alpha_hid=2.0, - task="translation", - model_name_or_path="IGNORE_THIS_IT_DOESNT_GET_USED", - tokenizer_name=MBART_TINY, - teacher=MBART_TINY, - src_lang="en_XX", - tgt_lang="ro_RO", - ) + updates = { + "student_encoder_layers": 2, + "student_decoder_layers": 1, + "num_train_epochs": 4, + "val_check_interval": 0.25, + "alpha_hid": 2.0, + "task": "translation", + "model_name_or_path": "IGNORE_THIS_IT_DOESNT_GET_USED", + "tokenizer_name": MBART_TINY, + "teacher": MBART_TINY, + "src_lang": "en_XX", + "tgt_lang": "ro_RO", + } model = self._test_distiller_cli(updates, check_contents=False) assert model.model.config.model_type == "mbart" @@ -217,39 +217,39 @@ def test_distill_mbart(self): self.assertEqual(len(transformer_ckpts), 2) def test_distill_t5(self): - updates = dict( - student_encoder_layers=1, - student_decoder_layers=1, - alpha_hid=2.0, - teacher=T5_TINY, - model_name_or_path=T5_TINY, - tokenizer_name=T5_TINY, - ) + updates = { + "student_encoder_layers": 1, + "student_decoder_layers": 1, + "alpha_hid": 2.0, + "teacher": T5_TINY, + "model_name_or_path": T5_TINY, + "tokenizer_name": T5_TINY, + } self._test_distiller_cli(updates) def test_distill_different_base_models(self): - updates = dict( - teacher=T5_TINY, - student=T5_TINIER, - model_name_or_path=T5_TINIER, - tokenizer_name=T5_TINIER, - ) + updates = { + "teacher": T5_TINY, + "student": T5_TINIER, + "model_name_or_path": T5_TINIER, + "tokenizer_name": T5_TINIER, + } self._test_distiller_cli(updates) def _test_distiller_cli(self, updates, check_contents=True): - default_updates = dict( - label_smoothing=0.0, - early_stopping_patience=-1, - train_batch_size=1, - eval_batch_size=2, - max_epochs=2, - alpha_mlm=0.2, - alpha_ce=0.8, - do_predict=True, - model_name_or_path="sshleifer/tinier_bart", - teacher=CHEAP_ARGS["model_name_or_path"], - val_check_interval=0.5, - ) + default_updates = { + "label_smoothing": 0.0, + "early_stopping_patience": -1, + "train_batch_size": 1, + "eval_batch_size": 2, + "max_epochs": 2, + "alpha_mlm": 0.2, + "alpha_ce": 0.8, + "do_predict": True, + "model_name_or_path": "sshleifer/tinier_bart", + "teacher": CHEAP_ARGS["model_name_or_path"], + "val_check_interval": 0.5, + } default_updates.update(updates) args_d: dict = CHEAP_ARGS.copy() tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()) diff --git a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py index bb06ec8e659fa7..9eeb3b30d39986 100644 --- a/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py +++ b/examples/research_projects/seq2seq-distillation/_test_seq2seq_examples_multi_gpu.py @@ -98,29 +98,29 @@ def setUpClass(cls): @require_torch_multi_gpu def test_multi_gpu(self): - updates = dict( - no_teacher=True, - freeze_encoder=True, - gpus=2, - overwrite_output_dir=True, - sortish_sampler=True, - ) + updates = { + "no_teacher": True, + "freeze_encoder": True, + "gpus": 2, + "overwrite_output_dir": True, + "sortish_sampler": True, + } self._test_distiller_cli_fork(updates, check_contents=False) def _test_distiller_cli_fork(self, updates, check_contents=True): - default_updates = dict( - label_smoothing=0.0, - early_stopping_patience=-1, - train_batch_size=1, - eval_batch_size=2, - max_epochs=2, - alpha_mlm=0.2, - alpha_ce=0.8, - do_predict=True, - model_name_or_path="sshleifer/tinier_bart", - teacher=CHEAP_ARGS["model_name_or_path"], - val_check_interval=0.5, - ) + default_updates = { + "label_smoothing": 0.0, + "early_stopping_patience": -1, + "train_batch_size": 1, + "eval_batch_size": 2, + "max_epochs": 2, + "alpha_mlm": 0.2, + "alpha_ce": 0.8, + "do_predict": True, + "model_name_or_path": "sshleifer/tinier_bart", + "teacher": CHEAP_ARGS["model_name_or_path"], + "val_check_interval": 0.5, + } default_updates.update(updates) args_d: dict = CHEAP_ARGS.copy() tmp_dir = make_test_data_dir(tmp_dir=self.get_auto_remove_tmp_dir()) diff --git a/examples/research_projects/seq2seq-distillation/finetune.py b/examples/research_projects/seq2seq-distillation/finetune.py index 77f02bef135ed3..ff889af81e36a6 100755 --- a/examples/research_projects/seq2seq-distillation/finetune.py +++ b/examples/research_projects/seq2seq-distillation/finetune.py @@ -74,11 +74,11 @@ def __init__(self, hparams, **kwargs): self.model_type = self.config.model_type self.vocab_size = self.config.tgt_vocab_size if self.model_type == "fsmt" else self.config.vocab_size - self.dataset_kwargs: dict = dict( - data_dir=self.hparams.data_dir, - max_source_length=self.hparams.max_source_length, - prefix=self.model.config.prefix or "", - ) + self.dataset_kwargs: dict = { + "data_dir": self.hparams.data_dir, + "max_source_length": self.hparams.max_source_length, + "prefix": self.model.config.prefix or "", + } n_observations_per_split = { "train": self.hparams.n_train, "val": self.hparams.n_val, @@ -170,7 +170,7 @@ def pad(self) -> int: def training_step(self, batch, batch_idx) -> Dict: loss_tensors = self._step(batch) - logs = {name: loss for name, loss in zip(self.loss_names, loss_tensors)} + logs = dict(zip(self.loss_names, loss_tensors)) # tokens per batch logs["tpb"] = batch["input_ids"].ne(self.pad).sum() + batch["labels"].ne(self.pad).sum() logs["bs"] = batch["input_ids"].shape[0] @@ -225,7 +225,7 @@ def _generative_step(self, batch: dict) -> dict: preds: List[str] = self.ids_to_clean_text(generated_ids) target: List[str] = self.ids_to_clean_text(batch["labels"]) loss_tensors = self._step(batch) - base_metrics = {name: loss for name, loss in zip(self.loss_names, loss_tensors)} + base_metrics = dict(zip(self.loss_names, loss_tensors)) rouge: Dict = self.calc_generative_metrics(preds, target) summ_len = np.mean(lmap(len, generated_ids)) base_metrics.update(gen_time=gen_time, gen_len=summ_len, preds=preds, target=target, **rouge) @@ -433,7 +433,7 @@ def main(args, model=None) -> SummarizationModule: return model model.hparams.test_checkpoint = "" - checkpoints = list(sorted(glob.glob(os.path.join(args.output_dir, "*.ckpt"), recursive=True))) + checkpoints = sorted(glob.glob(os.path.join(args.output_dir, "*.ckpt"), recursive=True)) if checkpoints: model.hparams.test_checkpoint = checkpoints[-1] trainer.resume_from_checkpoint = checkpoints[-1] diff --git a/examples/research_projects/seq2seq-distillation/lightning_base.py b/examples/research_projects/seq2seq-distillation/lightning_base.py index f246ecab0dd01b..640828bacd3401 100644 --- a/examples/research_projects/seq2seq-distillation/lightning_base.py +++ b/examples/research_projects/seq2seq-distillation/lightning_base.py @@ -313,7 +313,7 @@ def add_generic_args(parser, root_dir) -> None: type=str, default="O2", help=( - "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." + "For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']. " "See details at https://nvidia.github.io/apex/amp.html" ), ) diff --git a/examples/research_projects/seq2seq-distillation/make_student.py b/examples/research_projects/seq2seq-distillation/make_student.py index c1efc1b497abba..83e014bf481e81 100644 --- a/examples/research_projects/seq2seq-distillation/make_student.py +++ b/examples/research_projects/seq2seq-distillation/make_student.py @@ -171,11 +171,11 @@ def create_student_by_copying_alternating_layers( logger.info( f"Copied encoder layers {e_layers_to_copy} and decoder layers {d_layers_to_copy}. Saving them to {save_path}" ) - student.config.init_metadata = dict( - teacher_type=teacher.config.model_type, - copied_encoder_layers=e_layers_to_copy, - copied_decoder_layers=d_layers_to_copy, - ) + student.config.init_metadata = { + "teacher_type": teacher.config.model_type, + "copied_encoder_layers": e_layers_to_copy, + "copied_decoder_layers": d_layers_to_copy, + } student.save_pretrained(save_path) # Save information about copying for easier reproducibility diff --git a/examples/research_projects/seq2seq-distillation/run_eval.py b/examples/research_projects/seq2seq-distillation/run_eval.py index 3f685884e8e893..54ad6c6fb6b637 100755 --- a/examples/research_projects/seq2seq-distillation/run_eval.py +++ b/examples/research_projects/seq2seq-distillation/run_eval.py @@ -63,7 +63,7 @@ def generate_summaries_or_translations( fout.close() runtime = int(time.time() - start_time) # seconds n_obs = len(examples) - return dict(n_obs=n_obs, runtime=runtime, seconds_per_sample=round(runtime / n_obs, 4)) + return {"n_obs": n_obs, "runtime": runtime, "seconds_per_sample": round(runtime / n_obs, 4)} def datetime_now(): @@ -94,7 +94,7 @@ def run_generate(verbose=True): parser.add_argument("--score_path", type=str, required=False, default="metrics.json", help="where to save metrics") parser.add_argument("--device", type=str, required=False, default=DEFAULT_DEVICE, help="cuda, cuda:1, cpu etc.") parser.add_argument( - "--prefix", type=str, required=False, default=None, help="will be added to the begininng of src examples" + "--prefix", type=str, required=False, default=None, help="will be added to the beginning of src examples" ) parser.add_argument("--task", type=str, default="summarization", help="used for task_specific_params + metrics") parser.add_argument("--bs", type=int, default=8, required=False, help="batch size") diff --git a/examples/research_projects/seq2seq-distillation/utils.py b/examples/research_projects/seq2seq-distillation/utils.py index f1a8cef8508ccd..de666e0c249002 100644 --- a/examples/research_projects/seq2seq-distillation/utils.py +++ b/examples/research_projects/seq2seq-distillation/utils.py @@ -437,7 +437,7 @@ def pickle_save(obj, path): def flatten_list(summary_ids: List[List]): - return [x for x in itertools.chain.from_iterable(summary_ids)] + return list(itertools.chain.from_iterable(summary_ids)) def save_git_info(folder_path: str) -> None: diff --git a/examples/research_projects/tapex/run_tabfact_with_tapex.py b/examples/research_projects/tapex/run_tabfact_with_tapex.py index 23d094f8992a63..5dcec10a084c5f 100644 --- a/examples/research_projects/tapex/run_tabfact_with_tapex.py +++ b/examples/research_projects/tapex/run_tabfact_with_tapex.py @@ -277,7 +277,7 @@ def main(): # Loading a dataset from local json files raw_datasets = load_dataset("json", data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Labels label_list = raw_datasets["train"].features["label"].names @@ -292,7 +292,7 @@ def main(): num_labels=num_labels, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # load tapex tokenizer tokenizer = TapexTokenizer.from_pretrained( @@ -300,7 +300,7 @@ def main(): cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, add_prefix_space=True, ) model = BartForSequenceClassification.from_pretrained( @@ -309,7 +309,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # Padding strategy @@ -325,7 +325,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) diff --git a/examples/research_projects/tapex/run_wikisql_with_tapex.py b/examples/research_projects/tapex/run_wikisql_with_tapex.py index a5717d245cb6c9..81e940a77c882c 100644 --- a/examples/research_projects/tapex/run_wikisql_with_tapex.py +++ b/examples/research_projects/tapex/run_wikisql_with_tapex.py @@ -170,7 +170,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -317,7 +317,7 @@ def main(): datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -329,7 +329,7 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # IMPORTANT: the initial BART model's decoding is penalized by no_repeat_ngram_size, and thus @@ -344,7 +344,7 @@ def main(): cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, add_prefix_space=True, ) @@ -355,7 +355,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) if model.config.decoder_start_token_id is None: @@ -379,7 +379,7 @@ def main(): if training_args.label_smoothing_factor > 0 and not hasattr(model, "prepare_decoder_input_ids_from_labels"): logger.warning( - "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for" + "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for " f"`{model.__class__.__name__}`. This will lead to loss being calculated twice and will take up more memory" ) diff --git a/examples/research_projects/tapex/run_wikitablequestions_with_tapex.py b/examples/research_projects/tapex/run_wikitablequestions_with_tapex.py index 901e921f26a694..55350025cb3bb4 100644 --- a/examples/research_projects/tapex/run_wikitablequestions_with_tapex.py +++ b/examples/research_projects/tapex/run_wikitablequestions_with_tapex.py @@ -168,7 +168,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -315,7 +315,7 @@ def main(): datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Load pretrained model and tokenizer # @@ -327,7 +327,7 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) # IMPORTANT: the initial BART model's decoding is penalized by no_repeat_ngram_size, and thus @@ -342,7 +342,7 @@ def main(): cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, add_prefix_space=True, ) @@ -353,7 +353,7 @@ def main(): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=True if model_args.use_auth_token else None, ) if model.config.decoder_start_token_id is None: @@ -377,7 +377,7 @@ def main(): if training_args.label_smoothing_factor > 0 and not hasattr(model, "prepare_decoder_input_ids_from_labels"): logger.warning( - "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for" + "label_smoothing is enabled but the `prepare_decoder_input_ids_from_labels` method is not defined for " f"`{model.__class__.__name__}`. This will lead to loss being calculated twice and will take up more memory" ) diff --git a/examples/research_projects/tapex/wikisql_utils.py b/examples/research_projects/tapex/wikisql_utils.py index 3028e81ad481fc..110b14e02fb8e0 100644 --- a/examples/research_projects/tapex/wikisql_utils.py +++ b/examples/research_projects/tapex/wikisql_utils.py @@ -30,7 +30,7 @@ def _split_thousands(delimiter, value): split = value.split(delimiter) - return len(split) > 1 and any(map(lambda x: len(x) == 3, split)) + return len(split) > 1 and any((len(x) == 3 for x in split)) def convert_to_float(value): @@ -123,7 +123,7 @@ class _Condition: def _normalize_for_match(x): - return [t for t in _TOKENIZER.findall(x.lower())] + return list(_TOKENIZER.findall(x.lower())) def _compare(operator, src, tgt): diff --git a/examples/research_projects/visual_bert/extracting_data.py b/examples/research_projects/visual_bert/extracting_data.py index 9c445be336f553..6b1342c9b11f93 100644 --- a/examples/research_projects/visual_bert/extracting_data.py +++ b/examples/research_projects/visual_bert/extracting_data.py @@ -61,7 +61,7 @@ def __init__(self, argv=sys.argv[1:]): assert outputfile is not None and not os.path.isfile(outputfile), f"{outputfile}" if subset_list is not None: with open(os.path.realpath(subset_list)) as f: - self.subset_list = set(map(lambda x: self._vqa_file_split()[0], tryload(f))) + self.subset_list = {self._vqa_file_split()[0] for x in tryload(f)} else: self.subset_list = None diff --git a/examples/research_projects/visual_bert/modeling_frcnn.py b/examples/research_projects/visual_bert/modeling_frcnn.py index 08758b1d3cac06..499de532070c91 100644 --- a/examples/research_projects/visual_bert/modeling_frcnn.py +++ b/examples/research_projects/visual_bert/modeling_frcnn.py @@ -554,8 +554,8 @@ def __init__( assert thresholds[0] > 0 thresholds.insert(0, -float("inf")) thresholds.append(float("inf")) - assert all([low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])]) - assert all([label_i in [-1, 0, 1] for label_i in labels]) + assert all(low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])) + assert all(label_i in [-1, 0, 1] for label_i in labels) assert len(labels) == len(thresholds) - 1 self.thresholds = thresholds self.labels = labels @@ -1095,7 +1095,7 @@ def forward(self, feature_maps, boxes): Returns: A tensor of shape(N*B, Channels, output_size, output_size) """ - x = [v for v in feature_maps.values()] + x = list(feature_maps.values()) num_level_assignments = len(self.level_poolers) assert len(x) == num_level_assignments and len(boxes) == x[0].size(0) @@ -1706,9 +1706,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): elif os.path.isfile(pretrained_model_name_or_path) or is_remote_url(pretrained_model_name_or_path): archive_file = pretrained_model_name_or_path elif os.path.isfile(pretrained_model_name_or_path + ".index"): - assert ( - from_tf - ), "We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint".format( + assert from_tf, "We found a TensorFlow checkpoint at {}, please set from_tf to True to load from this checkpoint".format( pretrained_model_name_or_path + ".index" ) archive_file = pretrained_model_name_or_path + ".index" diff --git a/examples/research_projects/visual_bert/requirements.txt b/examples/research_projects/visual_bert/requirements.txt index 0d483b6d18923d..55e1bdb845c367 100644 --- a/examples/research_projects/visual_bert/requirements.txt +++ b/examples/research_projects/visual_bert/requirements.txt @@ -4,7 +4,7 @@ async-generator==1.10 attrs==20.2.0 backcall==0.2.0 CacheControl==0.12.6 -certifi==2022.12.7 +certifi==2023.7.22 cffi==1.14.2 chardet==3.0.4 click==7.1.2 @@ -75,7 +75,7 @@ pyzmq==19.0.2 qtconsole==4.7.7 QtPy==1.9.0 regex==2020.7.14 -requests==2.22.0 +requests==2.31.0 retrying==1.3.3 sacremoses==0.0.43 Send2Trash==1.5.0 @@ -86,11 +86,11 @@ testpath==0.4.4 tokenizers==0.8.1rc2 torch==1.6.0 torchvision==0.7.0 -tornado==6.0.4 +tornado==6.3.3 tqdm==4.48.2 traitlets git+https://github.com/huggingface/transformers.git -urllib3==1.26.5 +urllib3==1.26.18 wcwidth==0.2.5 webencodings==0.5.1 wget==3.2 diff --git a/examples/research_projects/visual_bert/utils.py b/examples/research_projects/visual_bert/utils.py index 2fc6ea2062efd2..c75f523a08eae7 100644 --- a/examples/research_projects/visual_bert/utils.py +++ b/examples/research_projects/visual_bert/utils.py @@ -28,7 +28,6 @@ from collections import OrderedDict from contextlib import contextmanager from functools import partial -from hashlib import sha256 from io import BytesIO from pathlib import Path from urllib.parse import urlparse @@ -39,6 +38,7 @@ import requests import wget from filelock import FileLock +from huggingface_hub.utils import insecure_hashlib from PIL import Image from tqdm.auto import tqdm from yaml import Loader, dump, load @@ -402,12 +402,12 @@ def _resumable_file_manager(): def url_to_filename(url, etag=None): url_bytes = url.encode("utf-8") - url_hash = sha256(url_bytes) + url_hash = insecure_hashlib.sha256(url_bytes) filename = url_hash.hexdigest() if etag: etag_bytes = etag.encode("utf-8") - etag_hash = sha256(etag_bytes) + etag_hash = insecure_hashlib.sha256(etag_bytes) filename += "." + etag_hash.hexdigest() if url.endswith(".h5"): diff --git a/examples/research_projects/vqgan-clip/README.md b/examples/research_projects/vqgan-clip/README.md index aef95093542208..a74bf9209b0a9a 100644 --- a/examples/research_projects/vqgan-clip/README.md +++ b/examples/research_projects/vqgan-clip/README.md @@ -21,7 +21,7 @@ To install locally: In the root of the repo run: -``` +```bash conda create -n vqganclip python=3.8 conda activate vqganclip git-lfs install @@ -30,7 +30,7 @@ pip install -r requirements.txt ``` ### Generate new images -``` +```python from VQGAN_CLIP import VQGAN_CLIP vqgan_clip = VQGAN_CLIP() vqgan_clip.generate("a picture of a smiling woman") @@ -41,7 +41,7 @@ To get a test image, run `git clone https://huggingface.co/datasets/erwann/vqgan-clip-pic test_images` To edit: -``` +```python from VQGAN_CLIP import VQGAN_CLIP vqgan_clip = VQGAN_CLIP() diff --git a/examples/research_projects/vqgan-clip/VQGAN_CLIP.py b/examples/research_projects/vqgan-clip/VQGAN_CLIP.py index b5a23c15b2b1a9..1bfbc4cd5c36f3 100644 --- a/examples/research_projects/vqgan-clip/VQGAN_CLIP.py +++ b/examples/research_projects/vqgan-clip/VQGAN_CLIP.py @@ -99,7 +99,7 @@ def make_animation(self, input_path=None, output_path=None, total_duration=5, ex output_path = "./animation.gif" if input_path is None: input_path = self.save_path - paths = list(sorted(glob(input_path + "/*"))) + paths = sorted(glob(input_path + "/*")) if not len(paths): raise ValueError( "No images found in save path, aborting (did you pass save_intermediate=True to the generate" @@ -178,7 +178,7 @@ def _init_logging(self, positive_prompts, negative_prompts, image_path): wandb.init(reinit=True, project="face-editor") wandb.config.update({"Positive Prompts": positive_prompts}) wandb.config.update({"Negative Prompts": negative_prompts}) - wandb.config.update(dict(lr=self.lr, iterations=self.iterations)) + wandb.config.update({"lr": self.lr, "iterations": self.iterations}) if image_path: image = Image.open(image_path) image = image.resize((256, 256)) diff --git a/examples/research_projects/vqgan-clip/loaders.py b/examples/research_projects/vqgan-clip/loaders.py index e8650f72128456..88513bcb69180d 100644 --- a/examples/research_projects/vqgan-clip/loaders.py +++ b/examples/research_projects/vqgan-clip/loaders.py @@ -47,7 +47,7 @@ def get_obj_from_str(string, reload=False): def instantiate_from_config(config): if "target" not in config: raise KeyError("Expected key `target` to instantiate.") - return get_obj_from_str(config["target"])(**config.get("params", dict())) + return get_obj_from_str(config["target"])(**config.get("params", {})) def load_model_from_config(config, sd, gpu=True, eval_mode=True): diff --git a/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md b/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md index d8a4e110873015..52553532fe08ab 100644 --- a/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md +++ b/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md @@ -138,20 +138,20 @@ For bigger datasets, we recommend to train Wav2Vec2 locally instead of in a goog First, you need to clone the `transformers` repo with: -``` +```bash $ git clone https://github.com/huggingface/transformers.git ``` Second, head over to the `examples/research_projects/wav2vec2` directory, where the `run_common_voice.py` script is located. -``` +```bash $ cd transformers/examples/research_projects/wav2vec2 ``` Third, install the required packages. The packages are listed in the `requirements.txt` file and can be installed with -``` +```bash $ pip install -r requirements.txt ``` @@ -259,7 +259,7 @@ Then and add the following files that fully define a XLSR-Wav2Vec2 checkpoint in - `pytorch_model.bin` Having added the above files, you should run the following to push files to your model repository. -``` +```bash git add . && git commit -m "Add model files" && git push ``` diff --git a/examples/research_projects/wav2vec2/README.md b/examples/research_projects/wav2vec2/README.md index 1dcd8dcc283538..cc667d6567ff95 100644 --- a/examples/research_projects/wav2vec2/README.md +++ b/examples/research_projects/wav2vec2/README.md @@ -134,7 +134,7 @@ which helps with capping GPU memory usage. To learn how to deploy Deepspeed Integration please refer to [this guide](https://huggingface.co/transformers/main/main_classes/deepspeed.html#deepspeed-trainer-integration). But to get started quickly all you need is to install: -``` +```bash pip install deepspeed ``` and then use the default configuration files in this directory: @@ -148,7 +148,7 @@ Here are examples of how you can use DeepSpeed: ZeRO-2: -``` +```bash PYTHONPATH=../../../src deepspeed --num_gpus 2 \ run_asr.py \ --output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \ @@ -162,7 +162,7 @@ run_asr.py \ ``` For ZeRO-2 with more than 1 gpu you need to use (which is already in the example configuration file): -``` +```json "zero_optimization": { ... "find_unused_parameters": true, @@ -172,7 +172,7 @@ For ZeRO-2 with more than 1 gpu you need to use (which is already in the example ZeRO-3: -``` +```bash PYTHONPATH=../../../src deepspeed --num_gpus 2 \ run_asr.py \ --output_dir=output_dir --num_train_epochs=2 --per_device_train_batch_size=2 \ @@ -192,7 +192,7 @@ It is recommended to pre-train Wav2Vec2 with Trainer + Deepspeed (please refer t Here is an example of how you can use DeepSpeed ZeRO-2 to pretrain a small Wav2Vec2 model: -``` +```bash PYTHONPATH=../../../src deepspeed --num_gpus 4 run_pretrain.py \ --output_dir="./wav2vec2-base-libri-100h" \ --num_train_epochs="3" \ @@ -238,7 +238,7 @@ Output directory will contain 0000.txt and 0001.txt. Each file will have format #### Run command -``` +```bash python alignment.py \ --model_name="arijitx/wav2vec2-xls-r-300m-bengali" \ --wav_dir="./wavs" diff --git a/examples/research_projects/wav2vec2/run_asr.py b/examples/research_projects/wav2vec2/run_asr.py index 15d2f12c7ddb56..6535e3485d177e 100755 --- a/examples/research_projects/wav2vec2/run_asr.py +++ b/examples/research_projects/wav2vec2/run_asr.py @@ -254,7 +254,7 @@ class DataCollatorCTCWithPadding: pad_to_multiple_of_labels: Optional[int] = None def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods input_features = [{"input_values": feature["input_values"]} for feature in features] label_features = [{"input_ids": feature["labels"]} for feature in features] @@ -365,7 +365,7 @@ def main(): target_sr = processor.feature_extractor.sampling_rate if data_args.target_feature_extractor_sampling_rate else None vocabulary_chars_str = "".join(t for t in processor.tokenizer.get_vocab().keys() if len(t) == 1) vocabulary_text_cleaner = re.compile( # remove characters not in vocabulary - f"[^\s{re.escape(vocabulary_chars_str)}]", # allow space in addition to chars in vocabulary + rf"[^\s{re.escape(vocabulary_chars_str)}]", # allow space in addition to chars in vocabulary flags=re.IGNORECASE if processor.tokenizer.do_lower_case else 0, ) text_updates = [] diff --git a/examples/research_projects/wav2vec2/run_common_voice.py b/examples/research_projects/wav2vec2/run_common_voice.py index 01a877a8092ecf..a7f57960d89f2c 100644 --- a/examples/research_projects/wav2vec2/run_common_voice.py +++ b/examples/research_projects/wav2vec2/run_common_voice.py @@ -69,19 +69,19 @@ class ModelArguments: hidden_dropout: Optional[float] = field( default=0.1, metadata={ - "help": "The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler." + "help": "The dropout probability for all fully connected layers in the embeddings, encoder, and pooler." }, ) feat_proj_dropout: Optional[float] = field( default=0.1, - metadata={"help": "The dropout probabilitiy for all 1D convolutional layers in feature extractor."}, + metadata={"help": "The dropout probability for all 1D convolutional layers in feature extractor."}, ) mask_time_prob: Optional[float] = field( default=0.05, metadata={ "help": ( - "Propability of each feature vector along the time axis to be chosen as the start of the vector" - "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature" + "Propability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " "vectors will be masked along the time axis. This is only relevant if ``apply_spec_augment is True``." ) }, @@ -173,7 +173,7 @@ class DataCollatorCTCWithPadding: pad_to_multiple_of_labels: Optional[int] = None def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods input_features = [{"input_values": feature["input_values"]} for feature in features] label_features = [{"input_ids": feature["labels"]} for feature in features] diff --git a/examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py b/examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py index 8f181409d6d7e6..d44145f3e0c12f 100644 --- a/examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py +++ b/examples/research_projects/wav2vec2/test_wav2vec2_deepspeed.py @@ -32,7 +32,7 @@ from parameterized import parameterized # noqa from transformers import TrainingArguments, is_torch_available # noqa -from transformers.deepspeed import is_deepspeed_available # noqa +from transformers.integrations.deepspeed import is_deepspeed_available # noqa from transformers.file_utils import WEIGHTS_NAME # noqa from transformers.testing_utils import ( # noqa CaptureLogger, @@ -51,7 +51,7 @@ set_seed(42) -models = dict(base="patrickvonplaten/wav2vec2_tiny_random", robust="patrickvonplaten/wav2vec2_tiny_random_robust") +models = {"base": "patrickvonplaten/wav2vec2_tiny_random", "robust": "patrickvonplaten/wav2vec2_tiny_random_robust"} ZERO2 = "zero2" ZERO3 = "zero3" diff --git a/examples/research_projects/xtreme-s/run_xtreme_s.py b/examples/research_projects/xtreme-s/run_xtreme_s.py index 38ed3376ecc0ba..e01ccbf4488dff 100644 --- a/examples/research_projects/xtreme-s/run_xtreme_s.py +++ b/examples/research_projects/xtreme-s/run_xtreme_s.py @@ -116,8 +116,8 @@ class ModelArguments: default=0.05, metadata={ "help": ( - "Probability of each feature vector along the time axis to be chosen as the start of the vector" - "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature" + "Probability of each feature vector along the time axis to be chosen as the start of the vector " + "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature " "vectors will be masked along the time axis." ) }, @@ -335,7 +335,7 @@ class SpeechDataCollatorWithPadding: pad_to_multiple_of_labels: Optional[int] = None def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: - # split inputs and labels since they have to be of different lenghts and need + # split inputs and labels since they have to be of different lengths and need # different padding methods input_features = [{"input_values": feature["input_values"]} for feature in features] @@ -400,7 +400,7 @@ def extract_all_chars(batch): | (set(vocabs["predict"]["vocab"][0]) if "predict" in vocabs else set()) ) - vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))} + vocab_dict = {v: k for k, v in enumerate(sorted(vocab_set))} # replace white space with delimiter token if word_delimiter_token is not None: @@ -455,7 +455,7 @@ def main(): # Log on each process the small summary: logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" + f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, " f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" ) # Set the verbosity to info of the Transformers logger (on main process only): @@ -502,7 +502,7 @@ def main(): data_args.dataset_name, config_name, split=data_args.train_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, cache_dir=model_args.cache_dir, ) @@ -528,7 +528,7 @@ def main(): data_args.dataset_name, config_name, split=data_args.eval_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, cache_dir=model_args.cache_dir, ) @@ -540,7 +540,7 @@ def main(): data_args.dataset_name, config_name, split=data_args.predict_split_name, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, cache_dir=model_args.cache_dir, ) @@ -595,7 +595,7 @@ def remove_special_characters(batch): # 3. Next, let's load the config as we might need it to create # the tokenizer config = AutoConfig.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) if is_text_target: @@ -651,11 +651,11 @@ def remove_special_characters(batch): if is_text_target: tokenizer = AutoTokenizer.from_pretrained( tokenizer_name_or_path, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, **tokenizer_kwargs, ) feature_extractor = AutoFeatureExtractor.from_pretrained( - model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token + model_args.model_name_or_path, cache_dir=model_args.cache_dir, token=data_args.use_auth_token ) # adapt config @@ -694,14 +694,14 @@ def remove_special_characters(batch): model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) elif config.is_encoder_decoder: model = AutoModelForSpeechSeq2Seq.from_pretrained( model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) if model.config.decoder_start_token_id is None: raise ValueError("Make sure that `config.decoder_start_token_id` is correctly defined") @@ -710,7 +710,7 @@ def remove_special_characters(batch): model_args.model_name_or_path, cache_dir=model_args.cache_dir, config=config, - use_auth_token=data_args.use_auth_token, + token=data_args.use_auth_token, ) # freeze encoder diff --git a/examples/research_projects/zero-shot-distillation/README.md b/examples/research_projects/zero-shot-distillation/README.md index cbc33071f0c9b4..14b6a8ea07f7ae 100644 --- a/examples/research_projects/zero-shot-distillation/README.md +++ b/examples/research_projects/zero-shot-distillation/README.md @@ -21,7 +21,7 @@ classification performance to the original zero-shot model A teacher NLI model can be distilled to a more efficient student model by running [`distill_classifier.py`](https://github.com/huggingface/transformers/blob/main/examples/research_projects/zero-shot-distillation/distill_classifier.py): -``` +```bash python distill_classifier.py \ --data_file \ --class_names_file \ diff --git a/examples/research_projects/zero-shot-distillation/distill_classifier.py b/examples/research_projects/zero-shot-distillation/distill_classifier.py index 16d52214376eed..56181208477767 100644 --- a/examples/research_projects/zero-shot-distillation/distill_classifier.py +++ b/examples/research_projects/zero-shot-distillation/distill_classifier.py @@ -41,7 +41,7 @@ class TeacherModelArguments: default="This example is {}.", metadata={ "help": ( - "Template used to turn class names into mock hypotheses for teacher NLI model. Must include {{}}" + "Template used to turn class names into mock hypotheses for teacher NLI model. Must include {{}} " "where class name is inserted." ) }, @@ -53,7 +53,7 @@ class TeacherModelArguments: default=False, metadata={ "help": ( - "Allow multiple classes to be true rather than forcing them to sum to 1 (sometimes called" + "Allow multiple classes to be true rather than forcing them to sum to 1 (sometimes called " "multi-class multi-label classification)." ) }, @@ -98,7 +98,7 @@ class DistillTrainingArguments(TrainingArguments): default=True, metadata={ "help": ( - "Whether to evaluate the agreement of the final student predictions and the teacher predictions" + "Whether to evaluate the agreement of the final student predictions and the teacher predictions " "after training." ) }, @@ -107,7 +107,7 @@ class DistillTrainingArguments(TrainingArguments): default=0, metadata={ "help": ( - "Limit the total amount of checkpoints." + "Limit the total amount of checkpoints. " "Deletes the older checkpoints in the output_dir. Default is 0 (no checkpoints)." ) }, @@ -303,7 +303,7 @@ def main(): student_args.student_name_or_path, num_labels=len(class_names) ) tokenizer = AutoTokenizer.from_pretrained(student_args.student_name_or_path, use_fast=data_args.use_fast_tokenizer) - model.config.id2label = {i: label for i, label in enumerate(class_names)} + model.config.id2label = dict(enumerate(class_names)) model.config.label2id = {label: i for i, label in enumerate(class_names)} # 4. train student on teacher predictions diff --git a/examples/run_on_remote.py b/examples/run_on_remote.py new file mode 100644 index 00000000000000..46f87065d761a9 --- /dev/null +++ b/examples/run_on_remote.py @@ -0,0 +1,71 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2021 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import shlex + +import runhouse as rh + + +if __name__ == "__main__": + # Refer to https://runhouse-docs.readthedocs-hosted.com/en/latest/api/python/cluster.html#hardware-setup for cloud access + # setup instructions, if using on-demand hardware + + # If user passes --user --host --key_path , fill them in as BYO cluster + # If user passes --instance --provider , fill them in as on-demand cluster + # Throw an error if user passes both BYO and on-demand cluster args + # Otherwise, use default values + parser = argparse.ArgumentParser() + parser.add_argument("--user", type=str, default="ubuntu") + parser.add_argument("--host", type=str, default="localhost") + parser.add_argument("--key_path", type=str, default=None) + parser.add_argument("--instance", type=str, default="V100:1") + parser.add_argument("--provider", type=str, default="cheapest") + parser.add_argument("--use_spot", type=bool, default=False) + parser.add_argument("--example", type=str, default="pytorch/text-generation/run_generation.py") + args, unknown = parser.parse_known_args() + if args.host != "localhost": + if args.instance != "V100:1" or args.provider != "cheapest": + raise ValueError("Cannot specify both BYO and on-demand cluster args") + cluster = rh.cluster( + name="rh-cluster", ips=[args.host], ssh_creds={"ssh_user": args.user, "ssh_private_key": args.key_path} + ) + else: + cluster = rh.cluster( + name="rh-cluster", instance_type=args.instance, provider=args.provider, use_spot=args.use_spot + ) + example_dir = args.example.rsplit("/", 1)[0] + + # Set up remote environment + cluster.install_packages(["pip:./"]) # Installs transformers from local source + # Note transformers is copied into the home directory on the remote machine, so we can install from there + cluster.run([f"pip install -r transformers/examples/{example_dir}/requirements.txt"]) + cluster.run(["pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu117"]) + + # Run example. You can bypass the CLI wrapper and paste your own code here. + cluster.run([f'python transformers/examples/{args.example} {" ".join(shlex.quote(arg) for arg in unknown)}']) + + # Alternatively, we can just import and run a training function (especially if there's no wrapper CLI): + # from my_script... import train + # reqs = ['pip:./', 'torch', 'datasets', 'accelerate', 'evaluate', 'tqdm', 'scipy', 'scikit-learn', 'tensorboard'] + # launch_train_gpu = rh.function(fn=train, + # system=gpu, + # reqs=reqs, + # name='train_bert_glue') + # + # We can pass in arguments just like we would to a function: + # launch_train_gpu(num_epochs = 3, lr = 2e-5, seed = 42, batch_size = 16 + # stream_logs=True) diff --git a/examples/tensorflow/README.md b/examples/tensorflow/README.md index 7936e3d4650950..2c4115b369f75a 100644 --- a/examples/tensorflow/README.md +++ b/examples/tensorflow/README.md @@ -15,7 +15,7 @@ limitations under the License. # Examples -This folder contains actively maintained examples of use of 🤗 Transformers organized into different ML tasks. All examples in this folder are **TensorFlow** examples, and are written using native Keras rather than classes like `TFTrainer`, which we now consider deprecated. If you've previously only used 🤗 Transformers via `TFTrainer`, we highly recommend taking a look at the new style - we think it's a big improvement! +This folder contains actively maintained examples of the use of 🤗 Transformers organized into different ML tasks. All examples in this folder are **TensorFlow** examples and are written using native Keras. If you've previously only used 🤗 Transformers via `TFTrainer`, we highly recommend taking a look at the new style - we think it's a big improvement! In addition, all scripts here now support the [🤗 Datasets](https://github.com/huggingface/datasets) library - you can grab entire datasets just by changing one command-line argument! @@ -32,13 +32,13 @@ Here is the list of all our examples: | Task | Example datasets | |---|---| | [**`language-modeling`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/language-modeling) | WikiText-2 -| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) | SWAG +| [**`multiple-choice`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/multiple-choice) | SWAG | [**`question-answering`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering) | SQuAD -| [**`summarization`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) | XSum +| [**`summarization`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/summarization) | XSum | [**`text-classification`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/text-classification) | GLUE | [**`token-classification`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/token-classification) | CoNLL NER | [**`translation`**](https://github.com/huggingface/transformers/tree/main/examples/tensorflow/translation) | WMT ## Coming soon -- **Colab notebooks** to easily run through these scripts! +- **Colab notebooks** to easily run through these scripts! diff --git a/examples/tensorflow/_tests_requirements.txt b/examples/tensorflow/_tests_requirements.txt index 837ce6d0d16dbf..6971795ce4ea19 100644 --- a/examples/tensorflow/_tests_requirements.txt +++ b/examples/tensorflow/_tests_requirements.txt @@ -1,10 +1,10 @@ -tensorflow<2.11 +tensorflow<2.16 +keras<2.16 tensorboard scikit-learn seqeval psutil sacrebleu >= 1.4.12 -git+https://github.com/huggingface/accelerate@main#egg=accelerate rouge-score tensorflow_datasets matplotlib @@ -16,7 +16,7 @@ nltk pandas datasets >= 1.13.3 fire -pytest +pytest<8.0.1 conllu sentencepiece != 0.1.92 protobuf diff --git a/examples/tensorflow/benchmarking/README.md b/examples/tensorflow/benchmarking/README.md index 7099ed9f6b3d3d..03e174770d1077 100644 --- a/examples/tensorflow/benchmarking/README.md +++ b/examples/tensorflow/benchmarking/README.md @@ -22,5 +22,5 @@ If you would like to list benchmark results on your favorite models of the [mode | Benchmark description | Results | Environment info | Author | |:----------|:-------------|:-------------|------:| -| PyTorch Benchmark on inference for `bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | -| PyTorch Benchmark on inference for `bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | +| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[memory](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_memory.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | +| PyTorch Benchmark on inference for `google-bert/bert-base-cased` |[time](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/inference_time.csv) | [env](https://github.com/patrickvonplaten/files_to_link_to/blob/master/bert_benchmark/env.csv) | [Partick von Platen](https://github.com/patrickvonplaten) | diff --git a/examples/tensorflow/benchmarking/plot_csv_file.py b/examples/tensorflow/benchmarking/plot_csv_file.py index 1a0ae735d8c671..9a9ad9c670470e 100644 --- a/examples/tensorflow/benchmarking/plot_csv_file.py +++ b/examples/tensorflow/benchmarking/plot_csv_file.py @@ -83,7 +83,7 @@ def can_convert_to_float(string): class Plot: def __init__(self, args): self.args = args - self.result_dict = defaultdict(lambda: dict(bsz=[], seq_len=[], result={})) + self.result_dict = defaultdict(lambda: {"bsz": [], "seq_len": [], "result": {}}) with open(self.args.csv_file, newline="") as csv_file: reader = csv.DictReader(csv_file) @@ -116,8 +116,8 @@ def plot(self): axis.set_major_formatter(ScalarFormatter()) for model_name_idx, model_name in enumerate(self.result_dict.keys()): - batch_sizes = sorted(list(set(self.result_dict[model_name]["bsz"]))) - sequence_lengths = sorted(list(set(self.result_dict[model_name]["seq_len"]))) + batch_sizes = sorted(set(self.result_dict[model_name]["bsz"])) + sequence_lengths = sorted(set(self.result_dict[model_name]["seq_len"])) results = self.result_dict[model_name]["result"] (x_axis_array, inner_loop_array) = ( diff --git a/examples/tensorflow/contrastive-image-text/README.md b/examples/tensorflow/contrastive-image-text/README.md new file mode 100644 index 00000000000000..29d9b897734cb2 --- /dev/null +++ b/examples/tensorflow/contrastive-image-text/README.md @@ -0,0 +1,81 @@ + + +# TFVisionTextDualEncoder and CLIP model training examples + +The following example showcases how to train a CLIP-like vision-text dual encoder model +using a pre-trained vision and text encoder. + +Such a model can be used for natural language image search and potentially zero-shot image classification. +The model is inspired by [CLIP](https://openai.com/blog/clip/), introduced by Alec Radford et al. +The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their +captions into the same embedding space, such that the caption embeddings are located near the embeddings +of the images they describe. + +### Download COCO dataset (2017) +This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the +COCO dataset before training. + +```bash +mkdir data +cd data +wget http://images.cocodataset.org/zips/train2017.zip +wget http://images.cocodataset.org/zips/val2017.zip +wget http://images.cocodataset.org/zips/test2017.zip +wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip +wget http://images.cocodataset.org/annotations/image_info_test2017.zip +cd .. +``` + +Having downloaded COCO dataset manually you should be able to load with the `ydshieh/coc_dataset_script` dataset loading script: + +```py +import os +import datasets + +COCO_DIR = os.path.join(os.getcwd(), "data") +ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR) +``` + +### Create a model from a vision encoder model and a text encoder model +We can either load a CLIP-like vision-text dual encoder model from an existing dual encoder model, or +by using a pre-trained vision encoder model and a pre-trained text encoder model. + +If you wish to load an existing dual encoder model, please use the `--model_name_or_path` argument. If +you want to use separate pre-trained vision and text models, please use the +`--vision_model_name_or_path` and `--text_model_name_or_path` arguments instead. + +### Train the model +Finally, we can run the example script to train the model: + +```bash +python examples/tensorflow/contrastive-image-text/run_clip.py \ + --output_dir ./clip-roberta-finetuned \ + --vision_model_name_or_path openai/clip-vit-base-patch32 \ + --text_model_name_or_path FacebookAI/roberta-base \ + --data_dir $PWD/data \ + --dataset_name ydshieh/coco_dataset_script \ + --dataset_config_name=2017 \ + --image_column image_path \ + --caption_column caption \ + --remove_unused_columns=False \ + --do_train --do_eval \ + --per_device_train_batch_size="64" \ + --per_device_eval_batch_size="64" \ + --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \ + --overwrite_output_dir \ + --push_to_hub +``` diff --git a/examples/tensorflow/contrastive-image-text/requirements.txt b/examples/tensorflow/contrastive-image-text/requirements.txt new file mode 100644 index 00000000000000..ef4bf188bff203 --- /dev/null +++ b/examples/tensorflow/contrastive-image-text/requirements.txt @@ -0,0 +1,2 @@ +tensorflow>=2.6.0 +datasets>=1.8.0 \ No newline at end of file diff --git a/examples/tensorflow/contrastive-image-text/run_clip.py b/examples/tensorflow/contrastive-image-text/run_clip.py new file mode 100644 index 00000000000000..341565d357f67a --- /dev/null +++ b/examples/tensorflow/contrastive-image-text/run_clip.py @@ -0,0 +1,626 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Team All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Training a CLIP like dual encoder models using text and vision encoders in the library. + +The script can be used to train CLIP like models for languages other than English by using +a text encoder pre-trained in the desired language. Currently this script supports the following vision +and text models: +Vision models: ViT(https://huggingface.co/models?filter=vit), CLIP (https://huggingface.co/models?filter=clip) +Text models: BERT, ROBERTa (https://huggingface.co/models?filter=fill-mask) +""" + +import logging +import os +import sys +import warnings +from dataclasses import dataclass, field +from typing import Optional + +import tensorflow as tf +from datasets import load_dataset +from PIL import Image + +import transformers +from transformers import ( + AutoImageProcessor, + AutoTokenizer, + HfArgumentParser, + PushToHubCallback, + TFAutoModel, + TFTrainingArguments, + TFVisionTextDualEncoderModel, + create_optimizer, +) +from transformers.utils import check_min_version, send_example_telemetry +from transformers.utils.versions import require_version + + +logger = logging.getLogger(__name__) + +# Will error if the minimal version of Transformers is not installed. Remove at your own risks. +check_min_version("4.38.0.dev0") + +require_version( + "datasets>=1.8.0", "To fix: pip install -r examples/tensorflow/contrastive-image-text/requirements.txt" +) + + +@dataclass +class ModelArguments: + """ + Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch. + """ + + model_name_or_path: str = field( + metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}, default=None + ) + vision_model_name_or_path: str = field( + metadata={"help": "Path to pretrained image model or model identifier from huggingface.co/models"}, + default=None, + ) + text_model_name_or_path: str = field( + metadata={"help": "Path to pretrained text model or model identifier from huggingface.co/models"}, default=None + ) + config_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"} + ) + tokenizer_name: Optional[str] = field( + default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"} + ) + image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) + cache_dir: Optional[str] = field( + default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"} + ) + model_revision: str = field( + default="main", + metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, + ) + use_fast_tokenizer: bool = field( + default=True, + metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."}, + ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) + use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( + default=False, + metadata={ + "help": ( + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." + ) + }, + ) + freeze_vision_model: bool = field( + default=False, metadata={"help": "Whether to freeze the vision model parameters or not."} + ) + freeze_text_model: bool = field( + default=False, metadata={"help": "Whether to freeze the text model parameters or not."} + ) + + +@dataclass +class DataTrainingArguments: + """ + Arguments pertaining to what data we are going to input our model for training and eval. + """ + + dataset_name: Optional[str] = field( + default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} + ) + dataset_config_name: Optional[str] = field( + default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} + ) + data_dir: Optional[str] = field(default=None, metadata={"help": "The data directory containing input files."}) + image_column: Optional[str] = field( + default="image_path", + metadata={"help": "The name of the column in the datasets containing the full image file paths."}, + ) + caption_column: Optional[str] = field( + default="caption", + metadata={"help": "The name of the column in the datasets containing the image captions."}, + ) + train_file: Optional[str] = field( + default=None, metadata={"help": "The input training data file (a jsonlines file)."} + ) + validation_file: Optional[str] = field( + default=None, + metadata={"help": "An optional input evaluation data file (a jsonlines file)."}, + ) + test_file: Optional[str] = field( + default=None, + metadata={"help": "An optional input testing data file (a jsonlines file)."}, + ) + max_seq_length: Optional[int] = field( + default=128, + metadata={ + "help": ( + "The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded." + ) + }, + ) + max_train_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of training examples to this " + "value if set." + ) + }, + ) + max_eval_samples: Optional[int] = field( + default=None, + metadata={ + "help": ( + "For debugging purposes or quicker training, truncate the number of evaluation examples to this " + "value if set." + ) + }, + ) + overwrite_cache: bool = field( + default=False, metadata={"help": "Overwrite the cached training and evaluation sets"} + ) + preprocessing_num_workers: Optional[int] = field( + default=None, + metadata={"help": "The number of processes to use for the preprocessing."}, + ) + + def __post_init__(self): + if self.dataset_name is None and self.train_file is None and self.validation_file is None: + raise ValueError("Need either a dataset name or a training/validation file.") + else: + if self.train_file is not None: + extension = self.train_file.split(".")[-1] + assert extension in ["csv", "json"], "`train_file` should be a csv or a json file." + if self.validation_file is not None: + extension = self.validation_file.split(".")[-1] + assert extension in ["csv", "json"], "`validation_file` should be a csv or a json file." + if self.validation_file is not None: + extension = self.validation_file.split(".")[-1] + assert extension == "json", "`validation_file` should be a json file." + + +dataset_name_mapping = { + "image_caption_dataset.py": ("image_path", "caption"), +} + + +def crop_to_square(image): + height, width = tf.shape(image)[0], tf.shape(image)[1] + if height > width: + image = tf.image.crop_to_bounding_box(image, (height - width) // 2, 0, width, width) + elif width > height: + image = tf.image.crop_to_bounding_box(image, 0, (width - height) // 2, height, height) + return image + + +def load_as_tf_dataset(dataset, image_column, image_size, mean, std, batch_size, shuffle): + dataset = dataset.with_format("tensorflow")[:] # Load the dataset as tensor slices, but not the images yet! + tf_dataset = tf.data.Dataset.from_tensor_slices(dataset) + + def load_image(sample): + image_path = sample[image_column] + image = tf.io.read_file(image_path) + image = tf.image.decode_image(image, channels=3, expand_animations=False) + image = crop_to_square(image) + image = tf.image.resize(image, [image_size, image_size], method="bicubic", antialias=True) + image = image / 255.0 + image = (image - mean) / std + image = tf.transpose(image, perm=[2, 0, 1]) # Convert to channels-first + sample["pixel_values"] = image + del sample[image_column] + return sample + + if shuffle: + tf_dataset = tf_dataset.shuffle(len(tf_dataset)) + tf_dataset = tf_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE) + tf_dataset = tf_dataset.batch(batch_size, drop_remainder=shuffle) + tf_dataset = tf_dataset.prefetch(tf.data.experimental.AUTOTUNE) + + return tf_dataset + + +def main(): + # 1. Parse input arguments + # See all possible arguments in src/transformers/training_args.py + # or by passing the --help flag to this script. + # We now keep distinct sets of args, for a cleaner separation of concerns. + + parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments)) + if len(sys.argv) == 2 and sys.argv[1].endswith(".json"): + # If we pass only one argument to the script and it's the path to a json file, + # let's parse it to get our arguments. + model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1])) + else: + model_args, data_args, training_args = parser.parse_args_into_dataclasses() + + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + + if model_args.model_name_or_path is not None: + if model_args.vision_model_name_or_path is not None or model_args.text_model_name_or_path is not None: + raise ValueError( + "If using model_name_or_path, you cannot specify separate image/text model paths as well!" + ) + + if model_args.vision_model_name_or_path is not None or model_args.text_model_name_or_path is not None: + if model_args.model_name_or_path is not None: + raise ValueError( + "If using separate image/text model paths, you cannot specify model_name_or_path as well!" + ) + if not (model_args.vision_model_name_or_path is not None and model_args.text_model_name_or_path is not None): + raise ValueError( + "If using separate image/text model paths, you must specify both vision_model_name_or_path " + "and text_model_name_or_path!" + ) + + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The + # information sent is the one passed as arguments along with your Python/TensorFlow versions. + send_example_telemetry("run_clip", model_args, data_args, framework="tensorflow") + + # 2. Setup logging + logging.basicConfig( + format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", + datefmt="%m/%d/%Y %H:%M:%S", + handlers=[logging.StreamHandler(sys.stdout)], + ) + + # The default of training_args.log_level is passive, so we set log level at info here to have that default. + transformers.utils.logging.set_verbosity_info() + + log_level = training_args.get_process_log_level() + logger.setLevel(log_level) + transformers.utils.logging.set_verbosity(log_level) + transformers.utils.logging.enable_default_handler() + transformers.utils.logging.enable_explicit_format() + + # Log on each process the small summary: + logger.info(f"Training/evaluation parameters {training_args}") + + # 3. Detecting last checkpoint and eventually continue from last checkpoint + last_checkpoint = None + if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir: + if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0: + raise ValueError( + f"Output directory ({training_args.output_dir}) already exists and is not empty. " + "Use --overwrite_output_dir to overcome." + ) + elif last_checkpoint is not None and training_args.resume_from_checkpoint is None: + logger.info( + f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change " + "the `--output_dir` or add `--overwrite_output_dir` to train from scratch." + ) + + # 4. Load dataset + # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below) + # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ + # (the dataset will be downloaded automatically from the datasets Hub). + # + # For CSV/JSON files this script will use the first column for the full image path and the second column for the + # captions (unless you specify column names for this with the `image_column` and `caption_column` arguments). + # + if data_args.dataset_name is not None: + # Downloading and loading a dataset from the hub. + dataset = load_dataset( + data_args.dataset_name, + data_args.dataset_config_name, + cache_dir=model_args.cache_dir, + keep_in_memory=False, + data_dir=data_args.data_dir, + token=model_args.token, + ) + else: + data_files = {} + if data_args.train_file is not None: + data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] + if data_args.validation_file is not None: + data_files["validation"] = data_args.validation_file + extension = data_args.validation_file.split(".")[-1] + if data_args.test_file is not None: + data_files["test"] = data_args.test_file + extension = data_args.test_file.split(".")[-1] + dataset = load_dataset( + extension, + data_files=data_files, + cache_dir=model_args.cache_dir, + token=model_args.token, + ) + # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at + # https://huggingface.co/docs/datasets/loading_datasets. + + # 5. Load pretrained model, tokenizer, and image processor + if model_args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained( + model_args.tokenizer_name, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + elif model_args.model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + elif model_args.text_model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained( + model_args.text_model_name_or_path, + cache_dir=model_args.cache_dir, + use_fast=model_args.use_fast_tokenizer, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + else: + raise ValueError( + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " + "You can do it from another script, save it, and load it from here, using --tokenizer_name." + ) + + if model_args.model_name_or_path: + # Load image_processor, in this script we only use this to get the mean and std for normalization. + image_processor = AutoImageProcessor.from_pretrained( + model_args.image_processor_name or model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + with training_args.strategy.scope(): + model = TFAutoModel.from_pretrained( + model_args.model_name_or_path, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + else: + # Load image_processor, in this script we only use this to get the mean and std for normalization. + image_processor = AutoImageProcessor.from_pretrained( + model_args.image_processor_name or model_args.vision_model_name_or_path, + cache_dir=model_args.cache_dir, + revision=model_args.model_revision, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + with training_args.strategy.scope(): + model = TFVisionTextDualEncoderModel.from_vision_text_pretrained( + vision_model_name_or_path=model_args.vision_model_name_or_path, + text_model_name_or_path=model_args.text_model_name_or_path, + cache_dir=model_args.cache_dir, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) + config = model.config + + if model_args.freeze_vision_model: + model.vision_model.trainable = False + + if model_args.freeze_text_model: + model.text_model.trainable = False + + # Preprocessing the datasets. + # We need to tokenize inputs and targets. + if training_args.do_train: + column_names = dataset["train"].column_names + elif training_args.do_eval: + column_names = dataset["validation"].column_names + elif training_args.do_predict: + column_names = dataset["test"].column_names + else: + logger.info("There is nothing to do. Please pass `do_train`, `do_eval` and/or `do_predict`.") + return + + # 6. Get the column names for input/target. + dataset_columns = dataset_name_mapping.get(data_args.dataset_name, None) + if data_args.image_column is None: + image_column = dataset_columns[0] if dataset_columns is not None else column_names[0] + else: + image_column = data_args.image_column + if image_column not in column_names: + raise ValueError( + f"--image_column' value '{data_args.image_column}' needs to be one of: {', '.join(column_names)}" + ) + if data_args.caption_column is None: + caption_column = dataset_columns[1] if dataset_columns is not None else column_names[1] + else: + caption_column = data_args.caption_column + if caption_column not in column_names: + raise ValueError( + f"--caption_column' value '{data_args.caption_column}' needs to be one of: {', '.join(column_names)}" + ) + + # # 7. Preprocessing the datasets. + + # We need to tokenize input captions and transform the images. + def tokenize_captions(examples): + captions = list(examples[caption_column]) + text_inputs = tokenizer(captions, max_length=data_args.max_seq_length, padding="max_length", truncation=True) + examples["input_ids"] = text_inputs.input_ids + examples["attention_mask"] = text_inputs.attention_mask + return examples + + def filter_corrupt_images(examples): + """remove problematic images""" + valid_images = [] + for image_file in examples[image_column]: + try: + Image.open(image_file) + valid_images.append(True) + except Exception: + valid_images.append(False) + return valid_images + + if training_args.do_train: + if "train" not in dataset: + raise ValueError("--do_train requires a train dataset") + train_dataset = dataset["train"] + if data_args.max_train_samples is not None: + max_train_samples = min(len(train_dataset), data_args.max_train_samples) + train_dataset = train_dataset.select(range(max_train_samples)) + + train_dataset = train_dataset.filter( + filter_corrupt_images, batched=True, num_proc=data_args.preprocessing_num_workers + ) + train_dataset = train_dataset.map( + function=tokenize_captions, + batched=True, + remove_columns=[col for col in column_names if col != image_column], + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) + + tf_train_dataset = load_as_tf_dataset( + dataset=train_dataset, + batch_size=training_args.per_device_train_batch_size, + image_column=image_column, + image_size=config.vision_config.image_size, + mean=image_processor.image_mean, + std=image_processor.image_std, + shuffle=True, + ) + + if training_args.do_eval: + if "validation" not in dataset: + raise ValueError("--do_eval requires a train validation") + eval_dataset = dataset["validation"] + if data_args.max_eval_samples is not None: + max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) + eval_dataset = eval_dataset.select(range(max_eval_samples)) + + eval_dataset = eval_dataset.filter( + filter_corrupt_images, batched=True, num_proc=data_args.preprocessing_num_workers + ) + eval_dataset = eval_dataset.map( + function=tokenize_captions, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=[col for col in column_names if col != image_column], + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) + + tf_eval_dataset = load_as_tf_dataset( + dataset=eval_dataset, + batch_size=training_args.per_device_eval_batch_size, + image_column=image_column, + image_size=config.vision_config.image_size, + mean=image_processor.image_mean, + std=image_processor.image_std, + shuffle=False, + ) + + # 8. Preparing push_to_hub and model card + push_to_hub_model_id = training_args.push_to_hub_model_id + if model_args.model_name_or_path is not None: + model_name = model_args.model_name_or_path.split("/")[-1] + else: + vision_name = model_args.vision_model_name_or_path.split("/")[-1] + text_name = model_args.text_model_name_or_path.split("/")[-1] + model_name = f"{vision_name}-{text_name}" + if not push_to_hub_model_id: + if data_args.dataset_name is not None: + push_to_hub_model_id = f"{model_name}-finetuned-{data_args.dataset_name}" + else: + push_to_hub_model_id = f"{model_name}-finetuned-contrastive-image-text-modeling" + + model_card_kwargs = {"finetuned_from": model_args.model_name_or_path, "tasks": "contrastive-image-text-modeling"} + if data_args.dataset_name is not None: + model_card_kwargs["dataset_tags"] = data_args.dataset_name + if data_args.dataset_config_name is not None: + model_card_kwargs["dataset_args"] = data_args.dataset_config_name + model_card_kwargs["dataset"] = f"{data_args.dataset_name} {data_args.dataset_config_name}" + else: + model_card_kwargs["dataset"] = data_args.dataset_name + + if training_args.push_to_hub: + callbacks = [ + PushToHubCallback( + output_dir=training_args.output_dir, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, + tokenizer=tokenizer, + **model_card_kwargs, + ) + ] + else: + callbacks = [] + + # # 9. Training + if training_args.do_train: + num_train_steps = int(len(tf_train_dataset) * int(training_args.num_train_epochs)) + if training_args.warmup_steps > 0: + num_warmup_steps = training_args.warmup_steps + elif training_args.warmup_ratio > 0: + num_warmup_steps = int(num_train_steps * training_args.warmup_ratio) + else: + num_warmup_steps = 0 + optimizer, lr_schedule = create_optimizer( + init_lr=training_args.learning_rate, + num_train_steps=num_train_steps, + num_warmup_steps=num_warmup_steps, + adam_beta1=training_args.adam_beta1, + adam_beta2=training_args.adam_beta2, + adam_epsilon=training_args.adam_epsilon, + weight_decay_rate=training_args.weight_decay, + adam_global_clipnorm=training_args.max_grad_norm, + ) + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). + model.compile(optimizer=optimizer, jit_compile=training_args.xla) + + if not training_args.do_eval: + tf_eval_dataset = None + model.fit( + tf_train_dataset, + validation_data=tf_eval_dataset, + epochs=int(training_args.num_train_epochs), + callbacks=callbacks, + ) + + # # 10. Evaluation + + if training_args.do_eval and not training_args.do_train: + model.evaluate(tf_eval_dataset) + + +if __name__ == "__main__": + main() diff --git a/examples/tensorflow/image-classification/README.md b/examples/tensorflow/image-classification/README.md index 28da5e894e1782..96979330ddc5b5 100644 --- a/examples/tensorflow/image-classification/README.md +++ b/examples/tensorflow/image-classification/README.md @@ -107,10 +107,10 @@ from datasets import load_dataset # example 1: local folder dataset = load_dataset("imagefolder", data_dir="path_to_your_folder") -# example 2: local files (suppoted formats are tar, gzip, zip, xz, rar, zstd) +# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd) dataset = load_dataset("imagefolder", data_files="path_to_zip_file") -# example 3: remote files (suppoted formats are tar, gzip, zip, xz, rar, zstd) +# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd) dataset = load_dataset("imagefolder", data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip") # example 4: providing several splits diff --git a/examples/tensorflow/image-classification/run_image_classification.py b/examples/tensorflow/image-classification/run_image_classification.py index d9fcc8daafa62d..a4f322932130b5 100644 --- a/examples/tensorflow/image-classification/run_image_classification.py +++ b/examples/tensorflow/image-classification/run_image_classification.py @@ -23,6 +23,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -34,7 +35,7 @@ import transformers from transformers import ( - MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, + TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, AutoConfig, AutoImageProcessor, DefaultDataCollator, @@ -46,6 +47,7 @@ set_seed, ) from transformers.keras_callbacks import KerasMetricCallback +from transformers.modeling_tf_utils import keras from transformers.trainer_utils import get_last_checkpoint, is_main_process from transformers.utils import check_min_version, send_example_telemetry from transformers.utils.versions import require_version @@ -54,11 +56,11 @@ logger = logging.getLogger(__name__) # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.24.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt") -MODEL_CONFIG_CLASSES = list(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING.keys()) +MODEL_CONFIG_CLASSES = list(TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING.keys()) MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES) @@ -157,12 +159,28 @@ class ModelArguments: metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."}) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -226,6 +244,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + if not (training_args.do_train or training_args.do_eval or training_args.do_predict): exit("Must specify at least one of --do_train, --do_eval or --do_predict!") @@ -262,11 +289,6 @@ def main(): transformers.utils.logging.set_verbosity_info() transformers.utils.logging.enable_default_handler() transformers.utils.logging.enable_explicit_format() - # Log on each process the small summary: - logger.warning( - f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}" - + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}" - ) logger.info(f"Training/evaluation parameters {training_args}") # region Dataset and labels @@ -280,7 +302,7 @@ def main(): data_args.dataset_config_name, cache_dir=model_args.cache_dir, task="image-classification", - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -295,12 +317,12 @@ def main(): task="image-classification", ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # Prepare label mappings. # We'll include these in the model's config to get human readable labels in the Inference API. labels = dataset["train"].features["labels"].names - label2id, id2label = dict(), dict() + label2id, id2label = {}, {} for i, label in enumerate(labels): label2id[label] = str(i) id2label[str(i)] = label @@ -314,13 +336,15 @@ def main(): finetuning_task="image-classification", cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) image_processor = AutoImageProcessor.from_pretrained( model_args.image_processor_name or model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # If we don't have a validation split, split off a percentage of train as validation. @@ -340,7 +364,7 @@ def main(): def _train_transforms(image): img_size = image_size - image = tf.keras.utils.img_to_array(image) + image = keras.utils.img_to_array(image) image = random_resized_crop(image, size=img_size) image = tf.image.random_flip_left_right(image) image /= 255.0 @@ -349,7 +373,7 @@ def _train_transforms(image): return image def _val_transforms(image): - image = tf.keras.utils.img_to_array(image) + image = keras.utils.img_to_array(image) image = tf.image.resize(image, size=image_size) # image = np.array(image) # FIXME - use tf.image function image = center_crop(image, size=image_size) @@ -417,7 +441,7 @@ def val_transforms(example_batch): collate_fn = DefaultDataCollator(return_tensors="np") # Load the accuracy metric from the datasets package - metric = evaluate.load("accuracy") + metric = evaluate.load("accuracy", cache_dir=model_args.cache_dir) # Define our compute_metrics function. It takes an `EvalPrediction` object (a namedtuple with a # predictions and label_ids field) and has to return a dictionary string to float. @@ -440,7 +464,8 @@ def compute_metrics(p): from_pt=bool(".bin" in model_path), cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ignore_mismatched_sizes=model_args.ignore_mismatched_sizes, ) num_replicas = training_args.strategy.num_replicas_in_sync @@ -502,6 +527,8 @@ def compute_metrics(p): collate_fn=collate_fn, ).with_options(dataset_options) + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla, metrics=["accuracy"]) push_to_hub_model_id = training_args.push_to_hub_model_id @@ -548,6 +575,7 @@ def compute_metrics(p): logging.info(f"{metric_name}: {value:.3f}") if training_args.output_dir is not None: + os.makedirs(training_args.output_dir, exist_ok=True) with open(os.path.join(training_args.output_dir, "all_results.json"), "w") as f: f.write(json.dumps(eval_metrics)) diff --git a/examples/tensorflow/language-modeling-tpu/README.md b/examples/tensorflow/language-modeling-tpu/README.md new file mode 100644 index 00000000000000..25381f86d093af --- /dev/null +++ b/examples/tensorflow/language-modeling-tpu/README.md @@ -0,0 +1,110 @@ +# Training a masked language model end-to-end from scratch on TPUs + +In this example, we're going to demonstrate how to train a TensorFlow model from 🤗 Transformers from scratch. If you're interested in some background theory on training Hugging Face models with TensorFlow on TPU, please check out our +[tutorial doc](https://huggingface.co/docs/transformers/main/perf_train_tpu_tf) on this topic! +If you're interested in smaller-scale TPU training from a pre-trained checkpoint, you can also check out the [TPU fine-tuning example](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/tpu_training-tf.ipynb). + +This example will demonstrate pre-training language models at the 100M-1B parameter scale, similar to BERT or GPT-2. More concretely, we will show how to train a [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) (base model) from scratch on the [WikiText dataset (v1)](https://huggingface.co/datasets/wikitext). + +We've tried to ensure that all the practices we show you here are scalable, though - with relatively few changes, the code could be scaled up to much larger models. + +Google's gargantuan [PaLM model](https://arxiv.org/abs/2204.02311), with +over 500B parameters, is a good example of how far you can go with pure TPU training, though gathering the dataset and the budget to train at that scale is not an easy task! + +### Table of contents + +- [Setting up a TPU-VM](#setting-up-a-tpu-vm) +- [Training a tokenizer](#training-a-tokenizer) +- [Preparing the dataset](#preparing-the-dataset) +- [Training the model](#training-the-model) +- [Inference](#inference) + +## Setting up a TPU-VM + +Since this example focuses on using TPUs, the first step is to set up access to TPU hardware. For this example, we chose to use a TPU v3-8 VM. Follow [this guide](https://cloud.google.com/tpu/docs/run-calculation-tensorflow) to quickly create a TPU VM with TensorFlow pre-installed. + +> 💡 **Note**: You don't need a TPU-enabled hardware for tokenizer training and TFRecord shard preparation. + +## Training a tokenizer + +To train a language model from scratch, the first step is to tokenize text. In most Hugging Face examples, we begin from a pre-trained model and use its tokenizer. However, in this example, we're going to train a tokenizer from scratch as well. The script for this is `train_unigram.py`. An example command is: + +```bash +python train_unigram.py --batch_size 1000 --vocab_size 25000 --export_to_hub +``` + +The script will automatically load the `train` split of the WikiText dataset and train a [Unigram tokenizer](https://huggingface.co/course/chapter6/7?fw=pt) on it. + +> 💡 **Note**: In order for `export_to_hub` to work, you must authenticate yourself with the `huggingface-cli`. Run `huggingface-cli login` and follow the on-screen instructions. + +## Preparing the dataset + +The next step is to prepare the dataset. This consists of loading a text dataset from the Hugging Face Hub, tokenizing it and grouping it into chunks of a fixed length ready for training. The script for this is `prepare_tfrecord_shards.py`. + +The reason we create TFRecord output files from this step is that these files work well with [`tf.data` pipelines](https://www.tensorflow.org/guide/data_performance). This makes them very suitable for scalable TPU training - the dataset can easily be sharded and read in parallel just by tweaking a few parameters in the pipeline. An example command is: + +```bash +python prepare_tfrecord_shards.py \ + --tokenizer_name_or_path tf-tpu/unigram-tokenizer-wikitext \ + --shard_size 5000 \ + --split test + --max_length 128 \ + --output_dir gs://tf-tpu-training-resources +``` + +**Notes**: + +* While running the above script, you need to specify the `split` accordingly. The example command above will only filter the `test` split of the dataset. +* If you append `gs://` in your `output_dir` the TFRecord shards will be directly serialized to a Google Cloud Storage (GCS) bucket. Ensure that you have already [created the GCS bucket](https://cloud.google.com/storage/docs). +* If you're using a TPU node, you must stream data from a GCS bucket. Otherwise, if you're using a TPU VM,you can store the data locally. You may need to [attach](https://cloud.google.com/tpu/docs/setup-persistent-disk) a persistent storage to the VM. +* Additional CLI arguments are also supported. We encourage you to run `python prepare_tfrecord_shards.py -h` to know more about them. + +## Training the model + +Once that's done, the model is ready for training. By default, training takes place on TPU, but you can use the `--no_tpu` flag to train on CPU for testing purposes. An example command is: + +```bash +python3 run_mlm.py \ + --train_dataset gs://tf-tpu-training-resources/train/ \ + --eval_dataset gs://tf-tpu-training-resources/validation/ \ + --tokenizer tf-tpu/unigram-tokenizer-wikitext \ + --output_dir trained_model +``` + +If you had specified a `hub_model_id` while launching training, then your model will be pushed to a model repository on the Hugging Face Hub. You can find such an example repository here: +[tf-tpu/roberta-base-epochs-500-no-wd](https://huggingface.co/tf-tpu/roberta-base-epochs-500-no-wd). + +## Inference + +Once the model is trained, you can use 🤗 Pipelines to perform inference: + +```python +from transformers import pipeline + +model_id = "tf-tpu/roberta-base-epochs-500-no-wd" +unmasker = pipeline("fill-mask", model=model_id, framework="tf") +unmasker("Goal of my life is to [MASK].") + +[{'score': 0.1003185287117958, + 'token': 52, + 'token_str': 'be', + 'sequence': 'Goal of my life is to be.'}, + {'score': 0.032648514956235886, + 'token': 5, + 'token_str': '', + 'sequence': 'Goal of my life is to .'}, + {'score': 0.02152673341333866, + 'token': 138, + 'token_str': 'work', + 'sequence': 'Goal of my life is to work.'}, + {'score': 0.019547373056411743, + 'token': 984, + 'token_str': 'act', + 'sequence': 'Goal of my life is to act.'}, + {'score': 0.01939118467271328, + 'token': 73, + 'token_str': 'have', + 'sequence': 'Goal of my life is to have.'}] +``` + +You can also try out inference using the [Inference Widget](https://huggingface.co/tf-tpu/roberta-base-epochs-500-no-wd?text=Goal+of+my+life+is+to+%5BMASK%5D.) from the model page. \ No newline at end of file diff --git a/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py b/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py new file mode 100644 index 00000000000000..a8bb7d37929f61 --- /dev/null +++ b/examples/tensorflow/language-modeling-tpu/prepare_tfrecord_shards.py @@ -0,0 +1,181 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Script for preparing TFRecord shards for pre-tokenized examples.""" + +import argparse +import logging +import os + +import datasets +import tensorflow as tf + +from transformers import AutoTokenizer + + +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser( + description="Prepare TFRecord shards from pre-tokenized samples of the wikitext dataset." + ) + parser.add_argument( + "--dataset_name", + type=str, + default="wikitext", + help="Name of the training. Explore datasets at: hf.co/datasets.", + ) + parser.add_argument( + "--dataset_config", type=str, default="wikitext-103-raw-v1", help="Configuration name of the dataset." + ) + parser.add_argument( + "--tokenizer_name_or_path", + type=str, + default="sayakpaul/unigram-tokenizer-wikitext", + help="Tokenizer identifier. Can be a local filepath or a Hub identifier.", + ) + parser.add_argument( + "--shard_size", + type=int, + default=1000, + help="Number of entries to go in a single shard.", + ) + parser.add_argument("--split", type=str, default="train", choices=["train", "test", "validation"]) + parser.add_argument( + "--limit", + default=None, + type=int, + help="Limit the number of shards (used for debugging).", + ) + parser.add_argument( + "--max_length", + type=int, + default=512, + help="Maximum sequence length. For training on TPUs, it helps to have a maximum" + " sequence length that is a multiple of 8.", + ) + parser.add_argument( + "--output_dir", + default="tf-tpu", + type=str, + help="Output directory where the TFRecord shards will be saved. If the" + " path is appended with `gs://` ('gs://tf-tpu', for example) then the TFRecord" + " shards will be directly saved to a Google Cloud Storage bucket.", + ) + + args = parser.parse_args() + return args + + +def tokenize_function(tokenizer): + def fn(examples): + return tokenizer(examples["text"]) + + return fn + + +def get_serialized_examples(tokenized_data): + records = [] + for i in range(len(tokenized_data["input_ids"])): + features = { + "input_ids": tf.train.Feature(int64_list=tf.train.Int64List(value=tokenized_data["input_ids"][i])), + "attention_mask": tf.train.Feature( + int64_list=tf.train.Int64List(value=tokenized_data["attention_mask"][i]) + ), + } + features = tf.train.Features(feature=features) + example = tf.train.Example(features=features) + record_bytes = example.SerializeToString() + records.append(record_bytes) + return records + + +def main(args): + dataset = datasets.load_dataset(args.dataset_name, args.dataset_config, split=args.split) + + if args.limit is not None: + max_samples = min(len(dataset), args.limit) + dataset = dataset.select(range(max_samples)) + print(f"Limiting the dataset to {args.limit} entries.") + + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path) + + # Handle output directory creation. + # For serializing into a Google Cloud Storage Bucket, one needs to first + # create a bucket. + if "gs" not in args.output_dir: + if not os.path.exists(args.output_dir): + os.makedirs(args.output_dir) + split_dir = os.path.join(args.output_dir, args.split) + if not os.path.exists(split_dir): + os.makedirs(split_dir) + else: + split_dir = os.path.join(args.output_dir, args.split) + + # Tokenize the whole dataset at once. + tokenize_fn = tokenize_function(tokenizer) + dataset_tokenized = dataset.map(tokenize_fn, batched=True, num_proc=4, remove_columns=["text"]) + + # We need to concatenate all our texts together, and then split the result + # into chunks of a fixed size, which we will call block_size. To do this, we + # will use the map method again, with the option batched=True. When we use batched=True, + # the function we pass to map() will be passed multiple inputs at once, allowing us + # to group them into more or fewer examples than we had in the input. + # This allows us to create our new fixed-length samples. The advantage of this + # method is that we don't lose a whole lot of content from the dataset compared to the + # case where we simply tokenize with a pre-defined max_length. + + def group_texts(examples): + # Concatenate all texts. + concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} + total_length = len(concatenated_examples[list(examples.keys())[0]]) + # We drop the small remainder, though you could add padding instead if the model supports it + # In this, as in all things, we advise you to follow your heart 🫀 + total_length = (total_length // args.max_length) * args.max_length + # Split by chunks of max_len. + result = { + k: [t[i : i + args.max_length] for i in range(0, total_length, args.max_length)] + for k, t in concatenated_examples.items() + } + return result + + grouped_dataset = dataset_tokenized.map(group_texts, batched=True, batch_size=1000, num_proc=4) + + shard_count = 0 + total_records = 0 + for shard in range(0, len(grouped_dataset), args.shard_size): + dataset_snapshot = grouped_dataset[shard : shard + args.shard_size] + records_containing = len(dataset_snapshot["input_ids"]) + filename = os.path.join(split_dir, f"dataset-{shard_count}-{records_containing}.tfrecord") + serialized_examples = get_serialized_examples(dataset_snapshot) + + with tf.io.TFRecordWriter(filename) as out_file: + for i in range(len(serialized_examples)): + example = serialized_examples[i] + out_file.write(example) + print("Wrote file {} containing {} records".format(filename, records_containing)) + + shard_count += 1 + total_records += records_containing + + with open(f"split-{args.split}-records-count.txt", "w") as f: + print(f"Total {args.split} records: {total_records}", file=f) + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/examples/tensorflow/language-modeling-tpu/requirements.txt b/examples/tensorflow/language-modeling-tpu/requirements.txt new file mode 100644 index 00000000000000..60bbe767a21427 --- /dev/null +++ b/examples/tensorflow/language-modeling-tpu/requirements.txt @@ -0,0 +1,3 @@ +transformers==4.26.1 +datasets==2.9.0 +tokenizers==0.13.2 diff --git a/examples/tensorflow/language-modeling-tpu/run_mlm.py b/examples/tensorflow/language-modeling-tpu/run_mlm.py new file mode 100644 index 00000000000000..7ed111ab12712b --- /dev/null +++ b/examples/tensorflow/language-modeling-tpu/run_mlm.py @@ -0,0 +1,323 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Script for training a masked language model on TPU.""" + +import argparse +import logging +import os +import re + +import tensorflow as tf +from packaging.version import parse + +from transformers import ( + AutoConfig, + AutoTokenizer, + DataCollatorForLanguageModeling, + PushToHubCallback, + TFAutoModelForMaskedLM, + create_optimizer, +) + + +try: + import tf_keras as keras +except (ModuleNotFoundError, ImportError): + import keras + + if parse(keras.__version__).major > 2: + raise ValueError( + "Your currently installed version of Keras is Keras 3, but this is not yet supported in " + "Transformers. Please install the backwards-compatible tf-keras package with " + "`pip install tf-keras`." + ) + + +logger = logging.getLogger(__name__) + +AUTO = tf.data.AUTOTUNE + + +def parse_args(): + parser = argparse.ArgumentParser(description="Train a masked language model on TPU.") + parser.add_argument( + "--pretrained_model_config", + type=str, + default="FacebookAI/roberta-base", + help="The model config to use. Note that we don't copy the model's weights, only the config!", + ) + parser.add_argument( + "--tokenizer", + type=str, + default="unigram-tokenizer-wikitext", + help="The name of the tokenizer to load. We use the pretrained tokenizer to initialize the model's vocab size.", + ) + + parser.add_argument( + "--per_replica_batch_size", + type=int, + default=8, + help="Batch size per TPU core.", + ) + + parser.add_argument( + "--no_tpu", + action="store_true", + help="If set, run on CPU and don't try to initialize a TPU. Useful for debugging on non-TPU instances.", + ) + + parser.add_argument( + "--tpu_name", + type=str, + help="Name of TPU resource to initialize. Should be blank on Colab, and 'local' on TPU VMs.", + default="local", + ) + + parser.add_argument( + "--tpu_zone", + type=str, + help="Google cloud zone that TPU resource is located in. Only used for non-Colab TPU nodes.", + ) + + parser.add_argument( + "--gcp_project", type=str, help="Google cloud project name. Only used for non-Colab TPU nodes." + ) + + parser.add_argument( + "--bfloat16", + action="store_true", + help="Use mixed-precision bfloat16 for training. This is the recommended lower-precision format for TPU.", + ) + + parser.add_argument( + "--train_dataset", + type=str, + help="Path to training dataset to load. If the path begins with `gs://`" + " then the dataset will be loaded from a Google Cloud Storage bucket.", + ) + + parser.add_argument( + "--shuffle_buffer_size", + type=int, + default=2**18, # Default corresponds to a 1GB buffer for seq_len 512 + help="Size of the shuffle buffer (in samples)", + ) + + parser.add_argument( + "--eval_dataset", + type=str, + help="Path to evaluation dataset to load. If the path begins with `gs://`" + " then the dataset will be loaded from a Google Cloud Storage bucket.", + ) + + parser.add_argument( + "--num_epochs", + type=int, + default=1, + help="Number of epochs to train for.", + ) + + parser.add_argument( + "--learning_rate", + type=float, + default=1e-4, + help="Learning rate to use for training.", + ) + + parser.add_argument( + "--weight_decay_rate", + type=float, + default=1e-3, + help="Weight decay rate to use for training.", + ) + + parser.add_argument( + "--max_length", + type=int, + default=512, + help="Maximum length of tokenized sequences. Should match the setting used in prepare_tfrecord_shards.py", + ) + + parser.add_argument( + "--mlm_probability", + type=float, + default=0.15, + help="Fraction of tokens to mask during training.", + ) + + parser.add_argument("--output_dir", type=str, required=True, help="Path to save model checkpoints to.") + parser.add_argument("--hub_model_id", type=str, help="Model ID to upload to on the Hugging Face Hub.") + + args = parser.parse_args() + return args + + +def initialize_tpu(args): + try: + if args.tpu_name: + tpu = tf.distribute.cluster_resolver.TPUClusterResolver( + args.tpu_name, zone=args.tpu_zone, project=args.gcp_project + ) + else: + tpu = tf.distribute.cluster_resolver.TPUClusterResolver() + except ValueError: + raise RuntimeError( + "Couldn't connect to TPU! Most likely you need to specify --tpu_name, --tpu_zone, or " + "--gcp_project. When running on a TPU VM, use --tpu_name local." + ) + + tf.config.experimental_connect_to_cluster(tpu) + tf.tpu.experimental.initialize_tpu_system(tpu) + + return tpu + + +def count_samples(file_list): + num_samples = 0 + for file in file_list: + filename = file.split("/")[-1] + sample_count = re.search(r"-\d+-(\d+)\.tfrecord", filename).group(1) + sample_count = int(sample_count) + num_samples += sample_count + + return num_samples + + +def prepare_dataset(records, decode_fn, mask_fn, batch_size, shuffle, shuffle_buffer_size=None): + num_samples = count_samples(records) + dataset = tf.data.Dataset.from_tensor_slices(records) + if shuffle: + dataset = dataset.shuffle(len(dataset)) + dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=AUTO) + # TF can't infer the total sample count because it doesn't read all the records yet, so we assert it here + dataset = dataset.apply(tf.data.experimental.assert_cardinality(num_samples)) + dataset = dataset.map(decode_fn, num_parallel_calls=AUTO) + if shuffle: + assert shuffle_buffer_size is not None + dataset = dataset.shuffle(args.shuffle_buffer_size) + dataset = dataset.batch(batch_size, drop_remainder=True) + dataset = dataset.map(mask_fn, num_parallel_calls=AUTO) + dataset = dataset.prefetch(AUTO) + return dataset + + +def main(args): + if not args.no_tpu: + tpu = initialize_tpu(args) + strategy = tf.distribute.TPUStrategy(tpu) + else: + strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0") + + if args.bfloat16: + keras.mixed_precision.set_global_policy("mixed_bfloat16") + + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer) + config = AutoConfig.from_pretrained(args.pretrained_model_config) + config.vocab_size = tokenizer.vocab_size + + training_records = tf.io.gfile.glob(os.path.join(args.train_dataset, "*.tfrecord")) + if not training_records: + raise ValueError(f"No .tfrecord files found in {args.train_dataset}.") + eval_records = tf.io.gfile.glob(os.path.join(args.eval_dataset, "*.tfrecord")) + if not eval_records: + raise ValueError(f"No .tfrecord files found in {args.eval_dataset}.") + + num_train_samples = count_samples(training_records) + + steps_per_epoch = num_train_samples // (args.per_replica_batch_size * strategy.num_replicas_in_sync) + total_train_steps = steps_per_epoch * args.num_epochs + + with strategy.scope(): + model = TFAutoModelForMaskedLM.from_config(config) + model(model.dummy_inputs) # Pass some dummy inputs through the model to ensure all the weights are built + optimizer, schedule = create_optimizer( + num_train_steps=total_train_steps, + num_warmup_steps=total_train_steps // 20, + init_lr=args.learning_rate, + weight_decay_rate=args.weight_decay_rate, + ) + + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). + model.compile(optimizer=optimizer, metrics=["accuracy"]) + + def decode_fn(example): + features = { + "input_ids": tf.io.FixedLenFeature(dtype=tf.int64, shape=(args.max_length,)), + "attention_mask": tf.io.FixedLenFeature(dtype=tf.int64, shape=(args.max_length,)), + } + return tf.io.parse_single_example(example, features) + + # Many of the data collators in Transformers are TF-compilable when return_tensors == "tf", so we can + # use their methods in our data pipeline. + data_collator = DataCollatorForLanguageModeling( + tokenizer=tokenizer, mlm_probability=args.mlm_probability, mlm=True, return_tensors="tf" + ) + + def mask_with_collator(batch): + # TF really needs an isin() function + special_tokens_mask = ( + ~tf.cast(batch["attention_mask"], tf.bool) + | (batch["input_ids"] == tokenizer.cls_token_id) + | (batch["input_ids"] == tokenizer.sep_token_id) + ) + batch["input_ids"], batch["labels"] = data_collator.tf_mask_tokens( + batch["input_ids"], + vocab_size=len(tokenizer), + mask_token_id=tokenizer.mask_token_id, + special_tokens_mask=special_tokens_mask, + ) + return batch + + batch_size = args.per_replica_batch_size * strategy.num_replicas_in_sync + + train_dataset = prepare_dataset( + training_records, + decode_fn=decode_fn, + mask_fn=mask_with_collator, + batch_size=batch_size, + shuffle=True, + shuffle_buffer_size=args.shuffle_buffer_size, + ) + + eval_dataset = prepare_dataset( + eval_records, + decode_fn=decode_fn, + mask_fn=mask_with_collator, + batch_size=batch_size, + shuffle=False, + ) + + callbacks = [] + if args.hub_model_id: + callbacks.append( + PushToHubCallback(output_dir=args.output_dir, hub_model_id=args.hub_model_id, tokenizer=tokenizer) + ) + + model.fit( + train_dataset, + validation_data=eval_dataset, + epochs=args.num_epochs, + callbacks=callbacks, + ) + + model.save_pretrained(args.output_dir) + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/examples/tensorflow/language-modeling-tpu/train_unigram.py b/examples/tensorflow/language-modeling-tpu/train_unigram.py new file mode 100644 index 00000000000000..a71cac45759cb6 --- /dev/null +++ b/examples/tensorflow/language-modeling-tpu/train_unigram.py @@ -0,0 +1,119 @@ +#!/usr/bin/env python +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Script for training a Unigram tokenizer.""" + +import argparse +import logging + +import datasets +from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors +from tokenizers.models import Unigram +from tokenizers.trainers import UnigramTrainer + +from transformers import AlbertTokenizerFast + + +logger = logging.getLogger(__name__) + + +def parse_args(): + parser = argparse.ArgumentParser(description="Train a unigram tokenizer on the wikitext dataset.") + parser.add_argument( + "--dataset_name", + type=str, + default="wikitext", + help="Name of the training. Explore datasets at: hf.co/datasets.", + ) + parser.add_argument( + "--dataset_config", type=str, default="wikitext-103-raw-v1", help="Configuration name of the dataset." + ) + parser.add_argument( + "--batch_size", + type=int, + default=1000, + help="Batch size during training.", + ) + parser.add_argument( + "--vocab_size", + type=int, + default=10048, + help="Size of the desired vocabulary.", + ) + parser.add_argument( + "--limit", + default=None, + type=int, + help="Limit the number of shards (used for debugging).", + ) + parser.add_argument( + "--export_to_hub", + action="store_true", + ) + + args = parser.parse_args() + return args + + +def main(args): + dataset = datasets.load_dataset(args.dataset_name, args.dataset_config, split="train") + + if args.limit is not None: + max_train_samples = min(len(dataset), args.limit) + dataset = dataset.select(range(max_train_samples)) + logger.info(f"Limiting the dataset to {args.limit} entries.") + + def batch_iterator(): + for i in range(0, len(dataset), args.batch_size): + yield dataset[i : i + args.batch_size]["text"] + + # Prepare the tokenizer. + tokenizer = Tokenizer(Unigram()) + tokenizer.normalizer = normalizers.Sequence([normalizers.Replace("``", '"'), normalizers.Replace("''", '"')]) + tokenizer.pre_tokenizer = pre_tokenizers.Metaspace() + + # Prepare the trainer. + trainer = UnigramTrainer( + unk_token="", + special_tokens=["[CLS]", "[SEP]", "", "", "[MASK]"], + vocab_size=args.vocab_size, + ) + + logger.info("Training the tokenizer.") + tokenizer.train_from_iterator(batch_iterator(), trainer=trainer) + logger.info("Tokenizer training complete!") + + cls_token_id = tokenizer.token_to_id("[CLS]") + sep_token_id = tokenizer.token_to_id("[SEP]") + tokenizer.post_processor = processors.TemplateProcessing( + single="[CLS]:0 $A:0 [SEP]:0", + pair="[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1", + special_tokens=[ + ("[CLS]", cls_token_id), + ("[SEP]", sep_token_id), + ], + ) + tokenizer.decoder = decoders.Metaspace() + + if args.export_to_hub: + logger.info("Exporting the trained tokenizer to Hub.") + new_tokenizer = AlbertTokenizerFast(tokenizer_object=tokenizer) + new_tokenizer.push_to_hub("unigram-tokenizer-dataset") + + +if __name__ == "__main__": + args = parse_args() + main(args) diff --git a/examples/tensorflow/language-modeling/README.md b/examples/tensorflow/language-modeling/README.md index b96217c1f5da6d..ed4f507d4e82ce 100644 --- a/examples/tensorflow/language-modeling/README.md +++ b/examples/tensorflow/language-modeling/README.md @@ -41,18 +41,18 @@ can also be used by passing the name of the TPU resource with the `--tpu` argume This script trains a masked language model. ### Example command -``` +```bash python run_mlm.py \ ---model_name_or_path distilbert-base-cased \ +--model_name_or_path distilbert/distilbert-base-cased \ --output_dir output \ --dataset_name wikitext \ --dataset_config_name wikitext-103-raw-v1 ``` When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation. -``` +```bash python run_mlm.py \ ---model_name_or_path distilbert-base-cased \ +--model_name_or_path distilbert/distilbert-base-cased \ --output_dir output \ --train_file train_file_path ``` @@ -62,9 +62,9 @@ python run_mlm.py \ This script trains a causal language model. ### Example command -``` +```bash python run_clm.py \ ---model_name_or_path distilgpt2 \ +--model_name_or_path distilbert/distilgpt2 \ --output_dir output \ --dataset_name wikitext \ --dataset_config_name wikitext-103-raw-v1 @@ -72,9 +72,9 @@ python run_clm.py \ When using a custom dataset, the validation file can be separately passed as an input argument. Otherwise some split (customizable) of training data is used as validation. -``` +```bash python run_clm.py \ ---model_name_or_path distilgpt2 \ +--model_name_or_path distilbert/distilgpt2 \ --output_dir output \ --train_file train_file_path ``` diff --git a/examples/tensorflow/language-modeling/run_clm.py b/examples/tensorflow/language-modeling/run_clm.py index 51087123b5644f..5c941016d57d75 100755 --- a/examples/tensorflow/language-modeling/run_clm.py +++ b/examples/tensorflow/language-modeling/run_clm.py @@ -30,6 +30,7 @@ import os import random import sys +import warnings from dataclasses import dataclass, field from itertools import chain from pathlib import Path @@ -77,7 +78,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -112,12 +113,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -220,6 +237,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_clm", model_args, data_args, framework="tensorflow") @@ -287,7 +313,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if "validation" not in raw_datasets.keys(): raw_datasets["validation"] = load_dataset( @@ -295,14 +321,14 @@ def main(): data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -323,7 +349,7 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) # If no validation data is there, validation_split_percentage will be used to divide the dataset. @@ -333,7 +359,7 @@ def main(): data_files=data_files, split=f"train[:{data_args.validation_split_percentage}%]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) raw_datasets["train"] = load_dataset( @@ -341,11 +367,11 @@ def main(): data_files=data_files, split=f"train[{data_args.validation_split_percentage}%:]", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, **dataset_args, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Load pretrained model and tokenizer @@ -353,20 +379,30 @@ def main(): # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if model_args.config_name: - config = AutoConfig.from_pretrained(model_args.config_name) + config = AutoConfig.from_pretrained( + model_args.config_name, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) elif model_args.model_name_or_path: - config = AutoConfig.from_pretrained(model_args.model_name_or_path) + config = AutoConfig.from_pretrained( + model_args.model_name_or_path, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) else: config = CONFIG_MAPPING[model_args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if model_args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name) + tokenizer = AutoTokenizer.from_pretrained( + model_args.tokenizer_name, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) # endregion @@ -390,16 +426,16 @@ def tokenize_function(examples): if data_args.block_size is None: block_size = tokenizer.model_max_length - if block_size > 1024: + if block_size > config.max_position_embeddings: logger.warning( f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " - "Picking 1024 instead. You can change that default value by passing --block_size xxx." + f"Using block_size={min(1024, config.max_position_embeddings)} instead. You can change that default value by passing --block_size xxx." ) - block_size = 1024 + block_size = min(1024, config.max_position_embeddings) else: if data_args.block_size > tokenizer.model_max_length: logger.warning( - f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" + f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model " f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(data_args.block_size, tokenizer.model_max_length) @@ -426,7 +462,7 @@ def group_texts(examples): # to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map lm_datasets = tokenized_datasets.map( group_texts, @@ -466,16 +502,33 @@ def group_texts(examples): with training_args.strategy.scope(): # region Prepare model if checkpoint is not None: - model = TFAutoModelForCausalLM.from_pretrained(checkpoint, config=config) + model = TFAutoModelForCausalLM.from_pretrained( + checkpoint, config=config, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.model_name_or_path: - model = TFAutoModelForCausalLM.from_pretrained(model_args.model_name_or_path, config=config) + model = TFAutoModelForCausalLM.from_pretrained( + model_args.model_name_or_path, + config=config, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) else: logger.info("Training new model from scratch") - model = TFAutoModelForCausalLM.from_config(config) + model = TFAutoModelForCausalLM.from_config( + config, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. - embedding_size = model.get_input_embeddings().weight.shape[0] + embeddings = model.get_input_embeddings() + + # Matt: This is a temporary workaround as we transition our models to exclusively using Keras embeddings. + # As soon as the transition is complete, all embeddings should be keras.Embeddings layers, and + # the weights will always be in embeddings.embeddings. + if hasattr(embeddings, "embeddings"): + embedding_size = embeddings.embeddings.shape[0] + else: + embedding_size = embeddings.weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) # endregion @@ -529,7 +582,8 @@ def group_texts(examples): adam_global_clipnorm=training_args.max_grad_norm, ) - # no user-specified loss = will use the model internal loss + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla) # endregion @@ -555,9 +609,8 @@ def group_texts(examples): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) @@ -600,7 +653,7 @@ def group_texts(examples): if training_args.output_dir is not None: output_eval_file = os.path.join(training_args.output_dir, "all_results.json") - results_dict = dict() + results_dict = {} results_dict["train_loss"] = train_loss results_dict["train_perplexity"] = train_perplexity results_dict["eval_loss"] = validation_loss diff --git a/examples/tensorflow/language-modeling/run_mlm.py b/examples/tensorflow/language-modeling/run_mlm.py index f7812b611bce6a..b14648b2c9cc73 100755 --- a/examples/tensorflow/language-modeling/run_mlm.py +++ b/examples/tensorflow/language-modeling/run_mlm.py @@ -28,6 +28,7 @@ import os import random import sys +import warnings from dataclasses import dataclass, field from itertools import chain from pathlib import Path @@ -75,7 +76,7 @@ class ModelArguments: default=None, metadata={ "help": ( - "The model checkpoint for weights initialization.Don't set if you want to train a model from scratch." + "The model checkpoint for weights initialization. Don't set if you want to train a model from scratch." ) }, ) @@ -110,12 +111,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -226,6 +243,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_mlm", model_args, data_args, framework="tensorflow") @@ -296,38 +322,39 @@ def main(): raw_datasets = load_dataset( data_args.dataset_name, data_args.dataset_config_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) if "validation" not in raw_datasets.keys(): raw_datasets["validation"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[:{data_args.validation_split_percentage}%]", - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) raw_datasets["train"] = load_dataset( data_args.dataset_name, data_args.dataset_config_name, split=f"train[{data_args.validation_split_percentage}%:]", - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] if extension == "txt": extension = "text" raw_datasets = load_dataset( extension, data_files=data_files, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Load pretrained model and tokenizer @@ -335,22 +362,32 @@ def main(): # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if checkpoint is not None: - config = AutoConfig.from_pretrained(checkpoint) + config = AutoConfig.from_pretrained( + checkpoint, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.config_name: - config = AutoConfig.from_pretrained(model_args.config_name) + config = AutoConfig.from_pretrained( + model_args.config_name, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.model_name_or_path: - config = AutoConfig.from_pretrained(model_args.model_name_or_path) + config = AutoConfig.from_pretrained( + model_args.model_name_or_path, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) else: config = CONFIG_MAPPING[model_args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") if model_args.tokenizer_name: - tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name) + tokenizer = AutoTokenizer.from_pretrained( + model_args.tokenizer_name, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.model_name_or_path: - tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) + tokenizer = AutoTokenizer.from_pretrained( + model_args.model_name_or_path, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) else: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) # endregion @@ -371,7 +408,7 @@ def main(): else: if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -441,7 +478,7 @@ def group_texts(examples): # might be slower to preprocess. # # To speed up this part, we use multiprocessing. See the documentation of the map method for more information: - # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map + # https://huggingface.co/docs/datasets/process#map tokenized_datasets = tokenized_datasets.map( group_texts, @@ -482,16 +519,33 @@ def group_texts(examples): with training_args.strategy.scope(): # region Prepare model if checkpoint is not None: - model = TFAutoModelForMaskedLM.from_pretrained(checkpoint, config=config) + model = TFAutoModelForMaskedLM.from_pretrained( + checkpoint, config=config, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) elif model_args.model_name_or_path: - model = TFAutoModelForMaskedLM.from_pretrained(model_args.model_name_or_path, config=config) + model = TFAutoModelForMaskedLM.from_pretrained( + model_args.model_name_or_path, + config=config, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) else: logger.info("Training new model from scratch") - model = TFAutoModelForMaskedLM.from_config(config) + model = TFAutoModelForMaskedLM.from_config( + config, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. - embedding_size = model.get_input_embeddings().weight.shape[0] + embeddings = model.get_input_embeddings() + + # Matt: This is a temporary workaround as we transition our models to exclusively using Keras embeddings. + # As soon as the transition is complete, all embeddings should be keras.Embeddings layers, and + # the weights will always be in embeddings.embeddings. + if hasattr(embeddings, "embeddings"): + embedding_size = embeddings.embeddings.shape[0] + else: + embedding_size = embeddings.weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) # endregion @@ -551,8 +605,9 @@ def group_texts(examples): adam_global_clipnorm=training_args.max_grad_norm, ) - # no user-specified loss = will use the model internal loss - model.compile(optimizer=optimizer, jit_compile=training_args.xla, run_eagerly=True) + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). + model.compile(optimizer=optimizer, jit_compile=training_args.xla) # endregion # region Preparing push_to_hub and model card @@ -577,9 +632,8 @@ def group_texts(examples): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) @@ -623,7 +677,7 @@ def group_texts(examples): if training_args.output_dir is not None: output_eval_file = os.path.join(training_args.output_dir, "all_results.json") - results_dict = dict() + results_dict = {} results_dict["train_loss"] = train_loss results_dict["train_perplexity"] = train_perplexity results_dict["eval_loss"] = validation_loss diff --git a/examples/tensorflow/multiple-choice/README.md b/examples/tensorflow/multiple-choice/README.md index 01e33fb62dbe23..a7f499963ec678 100644 --- a/examples/tensorflow/multiple-choice/README.md +++ b/examples/tensorflow/multiple-choice/README.md @@ -36,7 +36,7 @@ README, but for more information you can see the 'Input Datasets' section of ### Example command ```bash python run_swag.py \ - --model_name_or_path distilbert-base-cased \ + --model_name_or_path distilbert/distilbert-base-cased \ --output_dir output \ --do_eval \ --do_train diff --git a/examples/tensorflow/multiple-choice/run_swag.py b/examples/tensorflow/multiple-choice/run_swag.py index f5bc179a96ac11..d84279a30e6dc7 100644 --- a/examples/tensorflow/multiple-choice/run_swag.py +++ b/examples/tensorflow/multiple-choice/run_swag.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from itertools import chain from pathlib import Path @@ -50,7 +51,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = logging.getLogger(__name__) @@ -146,12 +147,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -239,6 +256,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_swag", model_args, data_args, framework="tensorflow") @@ -294,14 +320,15 @@ def main(): data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] raw_datasets = load_dataset( extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Downloading and loading the swag dataset from the hub. @@ -309,10 +336,10 @@ def main(): "swag", "regular", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # When using your own dataset or a different dataset from swag, you will probably need to change this. ending_names = [f"ending{i}" for i in range(4)] @@ -335,14 +362,16 @@ def main(): config_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -358,7 +387,7 @@ def main(): else: if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -387,13 +416,12 @@ def preprocess_function(examples): if data_args.max_train_samples is not None: max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) - with training_args.main_process_first(desc="train dataset map pre-processing"): - train_dataset = train_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - load_from_cache_file=not data_args.overwrite_cache, - ) + train_dataset = train_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) if training_args.do_eval: if "validation" not in raw_datasets: @@ -402,13 +430,12 @@ def preprocess_function(examples): if data_args.max_eval_samples is not None: max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) eval_dataset = eval_dataset.select(range(max_eval_samples)) - with training_args.main_process_first(desc="validation dataset map pre-processing"): - eval_dataset = eval_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - load_from_cache_file=not data_args.overwrite_cache, - ) + eval_dataset = eval_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + load_from_cache_file=not data_args.overwrite_cache, + ) if data_args.pad_to_max_length: data_collator = DefaultDataCollator(return_tensors="np") @@ -428,7 +455,8 @@ def preprocess_function(examples): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) num_replicas = training_args.strategy.num_replicas_in_sync @@ -455,6 +483,8 @@ def preprocess_function(examples): ) else: optimizer = None + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, metrics=["accuracy"], jit_compile=training_args.xla) # endregion @@ -470,9 +500,8 @@ def preprocess_function(examples): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) diff --git a/examples/tensorflow/question-answering/README.md b/examples/tensorflow/question-answering/README.md index b7c0443b1b079e..41cc8b7ef30c69 100644 --- a/examples/tensorflow/question-answering/README.md +++ b/examples/tensorflow/question-answering/README.md @@ -45,9 +45,9 @@ README, but for more information you can see the 'Input Datasets' section of [this document](https://www.tensorflow.org/guide/tpu). ### Example command -``` +```bash python run_qa.py \ ---model_name_or_path distilbert-base-cased \ +--model_name_or_path distilbert/distilbert-base-cased \ --output_dir output \ --dataset_name squad \ --do_train \ diff --git a/examples/tensorflow/question-answering/run_qa.py b/examples/tensorflow/question-answering/run_qa.py index 1c3acd34aedd9d..8d5116d72ffaac 100755 --- a/examples/tensorflow/question-answering/run_qa.py +++ b/examples/tensorflow/question-answering/run_qa.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from pathlib import Path from typing import Optional @@ -29,6 +30,7 @@ import evaluate import tensorflow as tf from datasets import load_dataset +from packaging.version import parse from utils_qa import postprocess_qa_predictions import transformers @@ -47,8 +49,21 @@ from transformers.utils import CONFIG_NAME, TF2_WEIGHTS_NAME, check_min_version, send_example_telemetry +try: + import tf_keras as keras +except (ModuleNotFoundError, ImportError): + import keras + + if parse(keras.__version__).major > 2: + raise ValueError( + "Your currently installed version of Keras is Keras 3, but this is not yet supported in " + "Transformers. Please install the backwards-compatible tf-keras package with " + "`pip install tf-keras`." + ) + + # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") logger = logging.getLogger(__name__) @@ -77,12 +92,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -216,7 +247,7 @@ def __post_init__(self): # region Helper classes -class SavePretrainedCallback(tf.keras.callbacks.Callback): +class SavePretrainedCallback(keras.callbacks.Callback): # Hugging Face models have a save_pretrained() method that saves both the weights and the necessary # metadata to allow them to be loaded as a pretrained model in future. This is a simple Keras callback # that saves the model with this method after each epoch. @@ -245,6 +276,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_qa", model_args, data_args, framework="tensorflow") @@ -304,7 +344,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -323,10 +363,10 @@ def main(): data_files=data_files, field="data", cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Load pretrained model and tokenizer @@ -338,14 +378,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=True, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -375,7 +417,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -464,13 +506,13 @@ def prepare_train_features(examples): return tokenized_examples - processed_datasets = dict() + processed_datasets = {} if training_args.do_train: if "train" not in datasets: raise ValueError("--do_train requires a train dataset") train_dataset = datasets["train"] if data_args.max_train_samples is not None: - # We will select sample from whole data if agument is specified + # We will select sample from whole data if argument is specified max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) # Create train feature from dataset @@ -603,7 +645,9 @@ def post_processing_function(examples, features, predictions, stage="eval"): references = [{"id": ex["id"], "answers": ex[answer_column_name]} for ex in examples] return EvalPrediction(predictions=formatted_predictions, label_ids=references) - metric = evaluate.load("squad_v2" if data_args.version_2_with_negative else "squad") + metric = evaluate.load( + "squad_v2" if data_args.version_2_with_negative else "squad", cache_dir=model_args.cache_dir + ) def compute_metrics(p: EvalPrediction): return metric.compute(predictions=p.predictions, references=p.label_ids) @@ -625,7 +669,8 @@ def compute_metrics(p: EvalPrediction): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) if training_args.do_train: training_dataset = model.prepare_tf_dataset( @@ -656,7 +701,8 @@ def compute_metrics(p: EvalPrediction): adam_global_clipnorm=training_args.max_grad_norm, ) - # no user-specified loss = will use the model internal loss + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla, metrics=["accuracy"]) else: @@ -709,9 +755,8 @@ def compute_metrics(p: EvalPrediction): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) diff --git a/examples/tensorflow/summarization/run_summarization.py b/examples/tensorflow/summarization/run_summarization.py index 61ee9c2ba6d37f..d4430227860a9f 100644 --- a/examples/tensorflow/summarization/run_summarization.py +++ b/examples/tensorflow/summarization/run_summarization.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -53,7 +54,7 @@ # region Checking dependencies # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/summarization/requirements.txt") @@ -99,12 +100,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -177,7 +194,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -221,7 +238,7 @@ class DataTrainingArguments: }, ) num_beams: Optional[int] = field( - default=None, + default=1, metadata={ "help": ( "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, " @@ -287,6 +304,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_summarization", model_args, data_args, framework="tensorflow") @@ -308,11 +334,11 @@ def main(): # region T5 special-casing if data_args.source_prefix is None and model_args.model_name_or_path in [ - "t5-small", - "t5-base", - "t5-large", - "t5-3b", - "t5-11b", + "google-t5/t5-small", + "google-t5/t5-base", + "google-t5/t5-large", + "google-t5/t5-3b", + "google-t5/t5-11b", ]: logger.warning( "You're running a t5 model but didn't provide a source prefix, which is the expected, e.g. with " @@ -355,7 +381,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -372,10 +398,10 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Load model config and tokenizer @@ -388,14 +414,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) prefix = data_args.source_prefix if data_args.source_prefix is not None else "" @@ -460,15 +488,14 @@ def preprocess_function(examples): if data_args.max_train_samples is not None: max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) - with training_args.main_process_first(desc="train dataset map pre-processing"): - train_dataset = train_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - remove_columns=column_names, - load_from_cache_file=not data_args.overwrite_cache, - desc="Running tokenizer on train dataset", - ) + train_dataset = train_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) else: train_dataset = None @@ -480,15 +507,14 @@ def preprocess_function(examples): if data_args.max_eval_samples is not None: max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) eval_dataset = eval_dataset.select(range(max_eval_samples)) - with training_args.main_process_first(desc="validation dataset map pre-processing"): - eval_dataset = eval_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - remove_columns=column_names, - load_from_cache_file=not data_args.overwrite_cache, - desc="Running tokenizer on validation dataset", - ) + eval_dataset = eval_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) else: eval_dataset = None # endregion @@ -513,12 +539,21 @@ def postprocess_text(preds, labels): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. - embedding_size = model.get_input_embeddings().weight.shape[0] + embeddings = model.get_input_embeddings() + + # Matt: This is a temporary workaround as we transition our models to exclusively using Keras embeddings. + # As soon as the transition is complete, all embeddings should be keras.Embeddings layers, and + # the weights will always be in embeddings.embeddings. + if hasattr(embeddings, "embeddings"): + embedding_size = embeddings.embeddings.shape[0] + else: + embedding_size = embeddings.weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) # endregion @@ -592,7 +627,7 @@ def postprocess_text(preds, labels): # region Metric and KerasMetricCallback if training_args.do_eval: - metric = evaluate.load("rouge") + metric = evaluate.load("rouge", cache_dir=model_args.cache_dir) if data_args.val_max_target_length is None: data_args.val_max_target_length = data_args.max_target_length @@ -657,9 +692,8 @@ def compute_metrics(preds): callbacks.append( PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) @@ -667,6 +701,8 @@ def compute_metrics(preds): # endregion # region Training + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla) eval_metrics = None if training_args.do_train: diff --git a/examples/tensorflow/test_tensorflow_examples.py b/examples/tensorflow/test_tensorflow_examples.py index 956209baade456..914ea767d0f08e 100644 --- a/examples/tensorflow/test_tensorflow_examples.py +++ b/examples/tensorflow/test_tensorflow_examples.py @@ -23,6 +23,20 @@ from unittest.mock import patch import tensorflow as tf +from packaging.version import parse + + +try: + import tf_keras as keras +except (ModuleNotFoundError, ImportError): + import keras + + if parse(keras.__version__).major > 2: + raise ValueError( + "Your currently installed version of Keras is Keras 3, but this is not yet supported in " + "Transformers. Please install the backwards-compatible tf-keras package with " + "`pip install tf-keras`." + ) from transformers.testing_utils import TestCasePlus, get_gpu_count, slow @@ -93,7 +107,7 @@ def test_run_text_classification(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_text_classification.py - --model_name_or_path distilbert-base-uncased + --model_name_or_path distilbert/distilbert-base-uncased --output_dir {tmp_dir} --overwrite_output_dir --train_file ./tests/fixtures/tests_samples/MRPC/train.csv @@ -115,7 +129,7 @@ def test_run_text_classification(self): with patch.object(sys, "argv", testargs): run_text_classification.main() # Reset the mixed precision policy so we don't break other tests - tf.keras.mixed_precision.set_global_policy("float32") + keras.mixed_precision.set_global_policy("float32") result = get_results(tmp_dir) self.assertGreaterEqual(result["eval_accuracy"], 0.75) @@ -123,7 +137,7 @@ def test_run_clm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_clm.py - --model_name_or_path distilgpt2 + --model_name_or_path distilbert/distilgpt2 --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --do_train @@ -149,7 +163,7 @@ def test_run_mlm(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_mlm.py - --model_name_or_path distilroberta-base + --model_name_or_path distilbert/distilroberta-base --train_file ./tests/fixtures/sample_text.txt --validation_file ./tests/fixtures/sample_text.txt --max_seq_length 64 @@ -174,7 +188,7 @@ def test_run_ner(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_ner.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/conll/sample.json --validation_file tests/fixtures/tests_samples/conll/sample.json --output_dir {tmp_dir} @@ -198,7 +212,7 @@ def test_run_squad(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_qa.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --version_2_with_negative --train_file tests/fixtures/tests_samples/SQUAD/sample.json --validation_file tests/fixtures/tests_samples/SQUAD/sample.json @@ -223,7 +237,7 @@ def test_run_swag(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_swag.py - --model_name_or_path bert-base-uncased + --model_name_or_path google-bert/bert-base-uncased --train_file tests/fixtures/tests_samples/swag/sample.json --validation_file tests/fixtures/tests_samples/swag/sample.json --output_dir {tmp_dir} @@ -247,7 +261,7 @@ def test_run_summarization(self): tmp_dir = self.get_auto_remove_tmp_dir() testargs = f""" run_summarization.py - --model_name_or_path t5-small + --model_name_or_path google-t5/t5-small --train_file tests/fixtures/tests_samples/xsum/sample.json --validation_file tests/fixtures/tests_samples/xsum/sample.json --output_dir {tmp_dir} diff --git a/examples/tensorflow/text-classification/README.md b/examples/tensorflow/text-classification/README.md index 898cfa70145b26..b8bc0b367c4d82 100644 --- a/examples/tensorflow/text-classification/README.md +++ b/examples/tensorflow/text-classification/README.md @@ -36,7 +36,7 @@ may not always be what you want, especially if you have more than two fields! Here is a snippet of a valid input JSON file, though note that your texts can be much longer than these, and are not constrained (despite the field name) to being single grammatical sentences: -``` +```json {"sentence1": "COVID-19 vaccine updates: How is the rollout proceeding?", "label": "news"} {"sentence1": "Manchester United celebrates Europa League success", "label": "sports"} ``` @@ -69,9 +69,9 @@ README, but for more information you can see the 'Input Datasets' section of [this document](https://www.tensorflow.org/guide/tpu). ### Example command -``` +```bash python run_text_classification.py \ ---model_name_or_path distilbert-base-cased \ +--model_name_or_path distilbert/distilbert-base-cased \ --train_file training_data.json \ --validation_file validation_data.json \ --output_dir output/ \ @@ -101,9 +101,9 @@ README, but for more information you can see the 'Input Datasets' section of [this document](https://www.tensorflow.org/guide/tpu). ### Example command -``` +```bash python run_glue.py \ ---model_name_or_path distilbert-base-cased \ +--model_name_or_path distilbert/distilbert-base-cased \ --task_name mnli \ --do_train \ --do_eval \ diff --git a/examples/tensorflow/text-classification/run_glue.py b/examples/tensorflow/text-classification/run_glue.py index bf03901011fa4b..5ce564850a08a3 100644 --- a/examples/tensorflow/text-classification/run_glue.py +++ b/examples/tensorflow/text-classification/run_glue.py @@ -20,6 +20,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -47,7 +48,7 @@ # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") task_to_keys = { "cola": ("sentence", None), @@ -164,12 +165,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -192,6 +209,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_glue", model_args, data_args, framework="tensorflow") @@ -242,10 +268,10 @@ def main(): "glue", data_args.task_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. is_regression = data_args.task_name == "stsb" if not is_regression: @@ -284,14 +310,16 @@ def main(): finetuning_task=data_args.task_name, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -310,12 +338,12 @@ def main(): if config.label2id != PretrainedConfig(num_labels=num_labels).label2id and not is_regression: # Some have all caps in their config, some don't. label_name_to_id = {k.lower(): v for k, v in config.label2id.items()} - if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)): + if sorted(label_name_to_id.keys()) == sorted(label_list): label_to_id = {i: int(label_name_to_id[label_list[i]]) for i in range(num_labels)} else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels: {list(sorted(label_list))}." + f"model labels: {sorted(label_name_to_id.keys())}, dataset labels: {sorted(label_list)}." "\nIgnoring the model labels as a result.", ) label_to_id = {label: i for i, label in enumerate(label_list)} @@ -328,7 +356,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -351,7 +379,7 @@ def preprocess_function(examples): # endregion # region Metric function - metric = evaluate.load("glue", data_args.task_name) + metric = evaluate.load("glue", data_args.task_name, cache_dir=model_args.cache_dir) def compute_metrics(preds, label_ids): preds = preds["logits"] @@ -374,7 +402,8 @@ def compute_metrics(preds, label_ids): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -383,7 +412,7 @@ def compute_metrics(preds, label_ids): dataset_options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF num_replicas = training_args.strategy.num_replicas_in_sync - tf_data = dict() + tf_data = {} max_samples = { "train": data_args.max_train_samples, "validation": data_args.max_eval_samples, @@ -453,6 +482,8 @@ def compute_metrics(preds, label_ids): metrics = [] else: metrics = ["accuracy"] + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, metrics=metrics, jit_compile=training_args.xla) # endregion @@ -469,9 +500,8 @@ def compute_metrics(preds, label_ids): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) diff --git a/examples/tensorflow/text-classification/run_text_classification.py b/examples/tensorflow/text-classification/run_text_classification.py index 0cf1972e937fb8..bfa2c63dba60a6 100644 --- a/examples/tensorflow/text-classification/run_text_classification.py +++ b/examples/tensorflow/text-classification/run_text_classification.py @@ -20,12 +20,14 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from pathlib import Path from typing import Optional import numpy as np from datasets import load_dataset +from packaging.version import parse from transformers import ( AutoConfig, @@ -45,11 +47,24 @@ import tensorflow as tf # noqa: E402 +try: + import tf_keras as keras +except (ModuleNotFoundError, ImportError): + import keras + + if parse(keras.__version__).major > 2: + raise ValueError( + "Your currently installed version of Keras is Keras 3, but this is not yet supported in " + "Transformers. Please install the backwards-compatible tf-keras package with " + "`pip install tf-keras`." + ) + + logger = logging.getLogger(__name__) # region Helper classes -class SavePretrainedCallback(tf.keras.callbacks.Callback): +class SavePretrainedCallback(keras.callbacks.Callback): # Hugging Face models have a save_pretrained() method that saves both the weights and the necessary # metadata to allow them to be loaded as a pretrained model in future. This is a simple Keras callback # that saves the model with this method after each epoch. @@ -100,7 +115,7 @@ class DataTrainingArguments: metadata={ "help": ( "Whether to pad all samples to `max_seq_length`. " - "If False, will pad the samples dynamically when batching to the maximum length in the batch." + "If False, will pad the samples dynamically when batching to the maximum length in the batch. " "Data will always be padded when using TPUs." ) }, @@ -170,12 +185,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -198,6 +229,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_text_classification", model_args, data_args, framework="tensorflow") @@ -258,13 +298,13 @@ def main(): "csv", data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: # Loading a dataset from local json files datasets = load_dataset("json", data_files=data_files, cache_dir=model_args.cache_dir) # See more about loading any type of standard or custom dataset at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. # endregion # region Label preprocessing @@ -301,20 +341,23 @@ def main(): num_labels=num_labels, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: config = AutoConfig.from_pretrained( config_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -334,7 +377,7 @@ def main(): if data_args.max_seq_length > tokenizer.model_max_length: logger.warning( - f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the" + f"The max_seq_length passed ({data_args.max_seq_length}) is larger than the maximum length for the " f"model ({tokenizer.model_max_length}). Using max_seq_length={tokenizer.model_max_length}." ) max_seq_length = min(data_args.max_seq_length, tokenizer.model_max_length) @@ -343,13 +386,13 @@ def main(): if "train" in datasets: if not is_regression and config.label2id != PretrainedConfig(num_labels=num_labels).label2id: label_name_to_id = config.label2id - if list(sorted(label_name_to_id.keys())) == list(sorted(label_list)): + if sorted(label_name_to_id.keys()) == sorted(label_list): label_to_id = label_name_to_id # Use the model's labels else: logger.warning( "Your model seems to have been trained with labels, but they don't match the dataset: ", - f"model labels: {list(sorted(label_name_to_id.keys()))}, dataset labels:" - f" {list(sorted(label_list))}.\nIgnoring the model labels as a result.", + f"model labels: {sorted(label_name_to_id.keys())}, dataset labels:" + f" {sorted(label_list)}.\nIgnoring the model labels as a result.", ) label_to_id = {v: i for i, v in enumerate(label_list)} elif not is_regression: @@ -402,7 +445,8 @@ def preprocess_function(examples): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # endregion @@ -411,7 +455,7 @@ def preprocess_function(examples): dataset_options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF num_replicas = training_args.strategy.num_replicas_in_sync - tf_data = dict() + tf_data = {} max_samples = { "train": data_args.max_train_samples, "validation": data_args.max_val_samples, @@ -487,6 +531,8 @@ def preprocess_function(examples): metrics = [] else: metrics = ["accuracy"] + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, metrics=metrics) # endregion @@ -502,9 +548,8 @@ def preprocess_function(examples): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) diff --git a/examples/tensorflow/token-classification/README.md b/examples/tensorflow/token-classification/README.md index 0e5ec84528f8f2..6c8a15c00e813a 100644 --- a/examples/tensorflow/token-classification/README.md +++ b/examples/tensorflow/token-classification/README.md @@ -27,7 +27,7 @@ The following example fine-tunes BERT on CoNLL-2003: ```bash python run_ner.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --dataset_name conll2003 \ --output_dir /tmp/test-ner ``` @@ -36,7 +36,7 @@ To run on your own training and validation files, use the following command: ```bash python run_ner.py \ - --model_name_or_path bert-base-uncased \ + --model_name_or_path google-bert/bert-base-uncased \ --train_file path_to_train_file \ --validation_file path_to_validation_file \ --output_dir /tmp/test-ner diff --git a/examples/tensorflow/token-classification/run_ner.py b/examples/tensorflow/token-classification/run_ner.py index 7b90938f02d7ae..db8aa7af42ed8a 100644 --- a/examples/tensorflow/token-classification/run_ner.py +++ b/examples/tensorflow/token-classification/run_ner.py @@ -14,14 +14,14 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -Fine-tuning a 🤗 Transformers model on token classification tasks (NER, POS, CHUNKS) relying on the accelerate library -without using a Trainer. +Fine-tuning a 🤗 Transformers model on token classification tasks (NER, POS, CHUNKS) """ import json import logging import os import random +import warnings from dataclasses import dataclass, field from typing import Optional @@ -76,12 +76,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -197,6 +213,15 @@ def main(): parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TFTrainingArguments)) model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_ner", model_args, data_args, framework="tensorflow") @@ -229,22 +254,23 @@ def main(): raw_datasets = load_dataset( data_args.dataset_name, data_args.dataset_config_name, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} if data_args.train_file is not None: data_files["train"] = data_args.train_file + extension = data_args.train_file.split(".")[-1] if data_args.validation_file is not None: data_files["validation"] = data_args.validation_file - extension = data_args.train_file.split(".")[-1] + extension = data_args.validation_file.split(".")[-1] raw_datasets = load_dataset( extension, data_files=data_files, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading_datasets. if raw_datasets["train"] is not None: column_names = raw_datasets["train"].column_names @@ -292,9 +318,19 @@ def get_label_list(labels): # In distributed training, the .from_pretrained methods guarantee that only one local process can concurrently # download model & vocab. if model_args.config_name: - config = AutoConfig.from_pretrained(model_args.config_name, num_labels=num_labels) + config = AutoConfig.from_pretrained( + model_args.config_name, + num_labels=num_labels, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) elif model_args.model_name_or_path: - config = AutoConfig.from_pretrained(model_args.model_name_or_path, num_labels=num_labels) + config = AutoConfig.from_pretrained( + model_args.model_name_or_path, + num_labels=num_labels, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) else: config = CONFIG_MAPPING[model_args.model_type]() logger.warning("You are instantiating a new config instance from scratch.") @@ -302,14 +338,25 @@ def get_label_list(labels): tokenizer_name_or_path = model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path if not tokenizer_name_or_path: raise ValueError( - "You are instantiating a new tokenizer from scratch. This is not supported by this script." + "You are instantiating a new tokenizer from scratch. This is not supported by this script. " "You can do it from another script, save it, and load it from here, using --tokenizer_name." ) if config.model_type in {"gpt2", "roberta"}: - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_fast=True, add_prefix_space=True) + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, + use_fast=True, + add_prefix_space=True, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) else: - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, use_fast=True) + tokenizer = AutoTokenizer.from_pretrained( + tokenizer_name_or_path, + use_fast=True, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, + ) # endregion # region Preprocessing the raw datasets @@ -380,14 +427,26 @@ def tokenize_and_align_labels(examples): model = TFAutoModelForTokenClassification.from_pretrained( model_args.model_name_or_path, config=config, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) else: logger.info("Training new model from scratch") - model = TFAutoModelForTokenClassification.from_config(config) + model = TFAutoModelForTokenClassification.from_config( + config, token=model_args.token, trust_remote_code=model_args.trust_remote_code + ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. - embedding_size = model.get_input_embeddings().weight.shape[0] + embeddings = model.get_input_embeddings() + + # Matt: This is a temporary workaround as we transition our models to exclusively using Keras embeddings. + # As soon as the transition is complete, all embeddings should be keras.Embeddings layers, and + # the weights will always be in embeddings.embeddings. + if hasattr(embeddings, "embeddings"): + embedding_size = embeddings.embeddings.shape[0] + else: + embedding_size = embeddings.weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) # endregion @@ -447,12 +506,13 @@ def tokenize_and_align_labels(examples): weight_decay_rate=training_args.weight_decay, adam_global_clipnorm=training_args.max_grad_norm, ) - + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla) # endregion # Metrics - metric = evaluate.load("seqeval") + metric = evaluate.load("seqeval", cache_dir=model_args.cache_dir) def get_labels(y_pred, y_true): # Transform predictions and references tensos to numpy arrays @@ -512,9 +572,8 @@ def compute_metrics(): callbacks = [ PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) diff --git a/examples/tensorflow/translation/README.md b/examples/tensorflow/translation/README.md index df5ee9c1ae36ba..bbe6e27e9c78a4 100644 --- a/examples/tensorflow/translation/README.md +++ b/examples/tensorflow/translation/README.md @@ -29,11 +29,11 @@ can also be used by passing the name of the TPU resource with the `--tpu` argume MBart and some T5 models require special handling. -T5 models `t5-small`, `t5-base`, `t5-large`, `t5-3b` and `t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example: +T5 models `google-t5/t5-small`, `google-t5/t5-base`, `google-t5/t5-large`, `google-t5/t5-3b` and `google-t5/t5-11b` must use an additional argument: `--source_prefix "translate {source_lang} to {target_lang}"`. For example: ```bash python run_translation.py \ - --model_name_or_path t5-small \ + --model_name_or_path google-t5/t5-small \ --do_train \ --do_eval \ --source_lang en \ diff --git a/examples/tensorflow/translation/run_translation.py b/examples/tensorflow/translation/run_translation.py index 09c0b8a9ea7ed0..e54fa17c79f585 100644 --- a/examples/tensorflow/translation/run_translation.py +++ b/examples/tensorflow/translation/run_translation.py @@ -22,6 +22,7 @@ import logging import os import sys +import warnings from dataclasses import dataclass, field from typing import Optional @@ -56,7 +57,7 @@ # region Dependencies and constants # Will error if the minimal version of Transformers is not installed. Remove at your own risks. -check_min_version("4.27.0.dev0") +check_min_version("4.38.0.dev0") require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/summarization/requirements.txt") @@ -93,12 +94,28 @@ class ModelArguments: default="main", metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."}, ) + token: str = field( + default=None, + metadata={ + "help": ( + "The token to use as HTTP bearer authorization for remote files. If not specified, will use the token " + "generated when running `huggingface-cli login` (stored in `~/.huggingface`)." + ) + }, + ) use_auth_token: bool = field( + default=None, + metadata={ + "help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead." + }, + ) + trust_remote_code: bool = field( default=False, metadata={ "help": ( - "Will use the token generated when running `huggingface-cli login` (necessary to use this script " - "with private models)." + "Whether or not to allow for custom models defined on the Hub in their own modeling files. This option " + "should only be set to `True` for repositories you trust and in which you have read the code, as it will " + "execute code present on the Hub on your local machine." ) }, ) @@ -165,7 +182,7 @@ class DataTrainingArguments: metadata={ "help": ( "The maximum total sequence length for validation target text after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`." + "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`. " "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used " "during ``evaluate`` and ``predict``." ) @@ -209,7 +226,7 @@ class DataTrainingArguments: }, ) num_beams: Optional[int] = field( - default=None, + default=1, metadata={ "help": ( "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, " @@ -268,6 +285,15 @@ def main(): else: model_args, data_args, training_args = parser.parse_args_into_dataclasses() + if model_args.use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.", + FutureWarning, + ) + if model_args.token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + model_args.token = model_args.use_auth_token + # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The # information sent is the one passed as arguments along with your Python/PyTorch versions. send_example_telemetry("run_translation", model_args, data_args, framework="tensorflow") @@ -322,7 +348,7 @@ def main(): data_args.dataset_name, data_args.dataset_config_name, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) else: data_files = {} @@ -336,10 +362,10 @@ def main(): extension, data_files=data_files, cache_dir=model_args.cache_dir, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, ) # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at - # https://huggingface.co/docs/datasets/loading_datasets.html. + # https://huggingface.co/docs/datasets/loading # endregion # region Load model config and tokenizer @@ -352,14 +378,16 @@ def main(): model_args.config_name if model_args.config_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) tokenizer = AutoTokenizer.from_pretrained( model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_fast=model_args.use_fast_tokenizer, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) prefix = data_args.source_prefix if data_args.source_prefix is not None else "" @@ -426,15 +454,14 @@ def preprocess_function(examples): if data_args.max_train_samples is not None: max_train_samples = min(len(train_dataset), data_args.max_train_samples) train_dataset = train_dataset.select(range(max_train_samples)) - with training_args.main_process_first(desc="train dataset map pre-processing"): - train_dataset = train_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - remove_columns=column_names, - load_from_cache_file=not data_args.overwrite_cache, - desc="Running tokenizer on train dataset", - ) + train_dataset = train_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on train dataset", + ) else: train_dataset = None @@ -446,15 +473,14 @@ def preprocess_function(examples): if data_args.max_eval_samples is not None: max_eval_samples = min(len(eval_dataset), data_args.max_eval_samples) eval_dataset = eval_dataset.select(range(max_eval_samples)) - with training_args.main_process_first(desc="validation dataset map pre-processing"): - eval_dataset = eval_dataset.map( - preprocess_function, - batched=True, - num_proc=data_args.preprocessing_num_workers, - remove_columns=column_names, - load_from_cache_file=not data_args.overwrite_cache, - desc="Running tokenizer on validation dataset", - ) + eval_dataset = eval_dataset.map( + preprocess_function, + batched=True, + num_proc=data_args.preprocessing_num_workers, + remove_columns=column_names, + load_from_cache_file=not data_args.overwrite_cache, + desc="Running tokenizer on validation dataset", + ) else: eval_dataset = None # endregion @@ -466,14 +492,24 @@ def preprocess_function(examples): config=config, cache_dir=model_args.cache_dir, revision=model_args.model_revision, - use_auth_token=True if model_args.use_auth_token else None, + token=model_args.token, + trust_remote_code=model_args.trust_remote_code, ) # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch # on a small vocab and want a smaller embedding size, remove this test. - embedding_size = model.get_input_embeddings().weight.shape[0] + embeddings = model.get_input_embeddings() + + # Matt: This is a temporary workaround as we transition our models to exclusively using Keras embeddings. + # As soon as the transition is complete, all embeddings should be keras.Embeddings layers, and + # the weights will always be in embeddings.embeddings. + if hasattr(embeddings, "embeddings"): + embedding_size = embeddings.embeddings.shape[0] + else: + embedding_size = embeddings.weight.shape[0] if len(tokenizer) > embedding_size: model.resize_token_embeddings(len(tokenizer)) + if isinstance(tokenizer, tuple(MULTILINGUAL_TOKENIZERS)): model.config.forced_bos_token_id = forced_bos_token_id # endregion @@ -553,7 +589,7 @@ def preprocess_function(examples): # region Metric and postprocessing if training_args.do_eval: - metric = evaluate.load("sacrebleu") + metric = evaluate.load("sacrebleu", cache_dir=model_args.cache_dir) if data_args.val_max_target_length is None: data_args.val_max_target_length = data_args.max_target_length @@ -624,9 +660,8 @@ def compute_metrics(preds): callbacks.append( PushToHubCallback( output_dir=training_args.output_dir, - model_id=push_to_hub_model_id, - organization=training_args.push_to_hub_organization, - token=training_args.push_to_hub_token, + hub_model_id=push_to_hub_model_id, + hub_token=training_args.push_to_hub_token, tokenizer=tokenizer, **model_card_kwargs, ) @@ -635,6 +670,8 @@ def compute_metrics(preds): # region Training eval_metrics = None + # Transformers models compute the right loss for their task by default when labels are passed, and will + # use this for training unless you specify your own loss function in compile(). model.compile(optimizer=optimizer, jit_compile=training_args.xla) if training_args.do_train: diff --git a/hubconf.py b/hubconf.py index 6c60cd4213d5c4..412cb27f6380df 100644 --- a/hubconf.py +++ b/hubconf.py @@ -15,6 +15,7 @@ import os import sys + SRC_DIR = os.path.join(os.path.dirname(__file__), "src") sys.path.append(SRC_DIR) @@ -40,12 +41,12 @@ def config(*args, **kwargs): # Using torch.hub ! import torch - config = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased') # Download configuration from huggingface.co and cache. + config = torch.hub.load('huggingface/transformers', 'config', 'google-bert/bert-base-uncased') # Download configuration from huggingface.co and cache. config = torch.hub.load('huggingface/transformers', 'config', './test/bert_saved_model/') # E.g. config (or model) was saved using `save_pretrained('./test/saved_model/')` config = torch.hub.load('huggingface/transformers', 'config', './test/bert_saved_model/my_configuration.json') - config = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased', output_attentions=True, foo=False) + config = torch.hub.load('huggingface/transformers', 'config', 'google-bert/bert-base-uncased', output_attentions=True, foo=False) assert config.output_attentions == True - config, unused_kwargs = torch.hub.load('huggingface/transformers', 'config', 'bert-base-uncased', output_attentions=True, foo=False, return_unused_kwargs=True) + config, unused_kwargs = torch.hub.load('huggingface/transformers', 'config', 'google-bert/bert-base-uncased', output_attentions=True, foo=False, return_unused_kwargs=True) assert config.output_attentions == True assert unused_kwargs == {'foo': False} @@ -60,7 +61,7 @@ def tokenizer(*args, **kwargs): # Using torch.hub ! import torch - tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'bert-base-uncased') # Download vocabulary from huggingface.co and cache. + tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'google-bert/bert-base-uncased') # Download vocabulary from huggingface.co and cache. tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', './test/bert_saved_model/') # E.g. tokenizer was saved using `save_pretrained('./test/saved_model/')` """ @@ -74,9 +75,9 @@ def model(*args, **kwargs): # Using torch.hub ! import torch - model = torch.hub.load('huggingface/transformers', 'model', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache. + model = torch.hub.load('huggingface/transformers', 'model', 'google-bert/bert-base-uncased') # Download model and configuration from huggingface.co and cache. model = torch.hub.load('huggingface/transformers', 'model', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` - model = torch.hub.load('huggingface/transformers', 'model', 'bert-base-uncased', output_attentions=True) # Update configuration during loading + model = torch.hub.load('huggingface/transformers', 'model', 'google-bert/bert-base-uncased', output_attentions=True) # Update configuration during loading assert model.config.output_attentions == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_pretrained('./tf_model/bert_tf_model_config.json') @@ -93,9 +94,9 @@ def modelForCausalLM(*args, **kwargs): # Using torch.hub ! import torch - model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2') # Download model and configuration from huggingface.co and cache. + model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'openai-community/gpt2') # Download model and configuration from huggingface.co and cache. model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', './test/saved_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` - model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'gpt2', output_attentions=True) # Update configuration during loading + model = torch.hub.load('huggingface/transformers', 'modelForCausalLM', 'openai-community/gpt2', output_attentions=True) # Update configuration during loading assert model.config.output_attentions == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_pretrained('./tf_model/gpt_tf_model_config.json') @@ -111,9 +112,9 @@ def modelForMaskedLM(*args, **kwargs): # Using torch.hub ! import torch - model = torch.hub.load('huggingface/transformers', 'modelForMaskedLM', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache. + model = torch.hub.load('huggingface/transformers', 'modelForMaskedLM', 'google-bert/bert-base-uncased') # Download model and configuration from huggingface.co and cache. model = torch.hub.load('huggingface/transformers', 'modelForMaskedLM', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` - model = torch.hub.load('huggingface/transformers', 'modelForMaskedLM', 'bert-base-uncased', output_attentions=True) # Update configuration during loading + model = torch.hub.load('huggingface/transformers', 'modelForMaskedLM', 'google-bert/bert-base-uncased', output_attentions=True) # Update configuration during loading assert model.config.output_attentions == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_pretrained('./tf_model/bert_tf_model_config.json') @@ -130,9 +131,9 @@ def modelForSequenceClassification(*args, **kwargs): # Using torch.hub ! import torch - model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache. + model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'google-bert/bert-base-uncased') # Download model and configuration from huggingface.co and cache. model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` - model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'bert-base-uncased', output_attentions=True) # Update configuration during loading + model = torch.hub.load('huggingface/transformers', 'modelForSequenceClassification', 'google-bert/bert-base-uncased', output_attentions=True) # Update configuration during loading assert model.config.output_attentions == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_pretrained('./tf_model/bert_tf_model_config.json') @@ -149,9 +150,9 @@ def modelForQuestionAnswering(*args, **kwargs): # Using torch.hub ! import torch - model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'bert-base-uncased') # Download model and configuration from huggingface.co and cache. + model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'google-bert/bert-base-uncased') # Download model and configuration from huggingface.co and cache. model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', './test/bert_model/') # E.g. model was saved using `save_pretrained('./test/saved_model/')` - model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'bert-base-uncased', output_attentions=True) # Update configuration during loading + model = torch.hub.load('huggingface/transformers', 'modelForQuestionAnswering', 'google-bert/bert-base-uncased', output_attentions=True) # Update configuration during loading assert model.config.output_attentions == True # Loading from a TF checkpoint file instead of a PyTorch model (slower) config = AutoConfig.from_pretrained('./tf_model/bert_tf_model_config.json') diff --git a/notebooks/README.md b/notebooks/README.md index 97f804eb6d935b..e701ca0a8887de 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -19,7 +19,7 @@ limitations under the License. You can find here a list of the official notebooks provided by Hugging Face. Also, we would like to list here interesting content created by the community. -If you wrote some notebook(s) leveraging 🤗 Transformers and would like be listed here, please open a +If you wrote some notebook(s) leveraging 🤗 Transformers and would like to be listed here, please open a Pull Request so it can be included under the Community notebooks. @@ -80,19 +80,27 @@ You can open any page of the documentation as a notebook in Colab (there is a bu | [How to fine-tune a speech recognition model in any language](https://github.com/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| Show how to preprocess the data and fine-tune a multi-lingually pretrained speech model on Common Voice | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/multi_lingual_speech_recognition.ipynb)| | [How to fine-tune a model on audio classification](https://github.com/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| Show how to preprocess the data and fine-tune a pretrained Speech model on Keyword Spotting | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)| -#### Other modalities[[pytorch-other]] +#### Biological Sequences[[pytorch-bio]] | Notebook | Description | | | |:----------|:----------------------------------------------------------------------------------------|:-------------|------:| | [How to fine-tune a pre-trained protein model](https://github.com/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | See how to tokenize proteins and fine-tune a large pre-trained protein "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/protein_language_modeling.ipynb) | | [How to generate protein folds](https://github.com/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | See how to go from protein sequence to a full protein model and PDB file | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/protein_folding.ipynb) | +| [How to fine-tune a Nucleotide Transformer model](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | See how to tokenize DNA and fine-tune a large pre-trained DNA "language" model | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling.ipynb) | +| [Fine-tune a Nucleotide Transformer model with LoRA](https://github.com/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | Train even larger DNA models in a memory-efficient way | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/nucleotide_transformer_dna_sequence_modelling_with_peft.ipynb) | + + +#### Other modalities[[pytorch-other]] + +| Notebook | Description | | | +|:----------|:----------------------------------------------------------------------------------------|:-------------|------:| | [Probabilistic Time Series Forecasting](https://github.com/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | See how to train Time Series Transformer on a custom dataset | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/time-series-transformers.ipynb) | #### Utility notebooks[[pytorch-utility]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| -| [How to export model to ONNX](https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| Highlight how to export and run inference workloads through ONNX | +| [How to export model to ONNX](https://github.com/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| Highlight how to export and run inference workloads through ONNX | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/onnx-export.ipynb)| | [How to use Benchmarks](https://github.com/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| How to benchmark models with transformers | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/benchmark.ipynb)| ### TensorFlow Examples @@ -118,7 +126,7 @@ You can open any page of the documentation as a notebook in Colab (there is a bu | [How to fine-tune a model on image classification](https://github.com/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb) | Show how to preprocess the data and fine-tune any pretrained Vision model on Image Classification | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/image_classification-tf.ipynb)| | [How to fine-tune a SegFormer model on semantic segmentation](https://github.com/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb) | Show how to preprocess the data and fine-tune a pretrained SegFormer model on Semantic Segmentation | [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb)| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/main/examples/semantic_segmentation-tf.ipynb)| -#### Other modalities[[tensorflow-other]] +#### Biological Sequences[[tensorflow-bio]] | Notebook | Description | | | |:----------|:-------------|:-------------|------:| diff --git a/pyproject.toml b/pyproject.toml index 26fa9e0bb092fc..d66b89769c2cb1 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,11 +1,7 @@ -[tool.black] -line-length = 119 -target-version = ['py37'] - [tool.ruff] # Never enforce `E501` (line length violations). -ignore = ["E501", "E741", "W605"] -select = ["E", "F", "I", "W"] +ignore = ["C901", "E501", "E741", "F402", "F823" ] +select = ["C", "E", "F", "I", "W"] line-length = 119 # Ignore import violations in all `__init__.py` files. @@ -17,3 +13,24 @@ line-length = 119 [tool.ruff.isort] lines-after-imports = 2 known-first-party = ["transformers"] + +[tool.ruff.format] +# Like Black, use double quotes for strings. +quote-style = "double" + +# Like Black, indent with spaces, rather than tabs. +indent-style = "space" + +# Like Black, respect magic trailing commas. +skip-magic-trailing-comma = false + +# Like Black, automatically detect the appropriate line ending. +line-ending = "auto" + +[tool.pytest.ini_options] +doctest_optionflags="NUMBER NORMALIZE_WHITESPACE ELLIPSIS" +doctest_glob="**/*.md" +markers = [ + "flash_attn_test: marks tests related to flash attention (deselect with '-m \"not flash_attn_test\"')", + "bitsandbytes: select (or deselect with `not`) bitsandbytes integration tests", +] \ No newline at end of file diff --git a/scripts/benchmark/trainer-benchmark.py b/scripts/benchmark/trainer-benchmark.py index 903b4e0dd6d500..9eab3f638d7f21 100755 --- a/scripts/benchmark/trainer-benchmark.py +++ b/scripts/benchmark/trainer-benchmark.py @@ -54,7 +54,7 @@ # # CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \ # --base-cmd \ -# ' examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \ +# ' examples/pytorch/translation/run_translation.py --model_name_or_path google-t5/t5-small \ # --output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \ # --save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \ # --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \ diff --git a/scripts/check_tokenizers.py b/scripts/check_tokenizers.py index cfd0a7f3a1defc..ea0d0bc21850ba 100644 --- a/scripts/check_tokenizers.py +++ b/scripts/check_tokenizers.py @@ -1,10 +1,12 @@ from collections import Counter + import datasets + import transformers from transformers.convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS - from transformers.utils import logging + logging.set_verbosity_info() TOKENIZER_CLASSES = { @@ -101,8 +103,8 @@ def check_details(line, spm_ids, tok_ids, slow, fast): except Exception: pass - ok_start = fast.decode(spm_ids[:first]) - ok_end = fast.decode(spm_ids[last:]) + fast.decode(spm_ids[:first]) + fast.decode(spm_ids[last:]) wrong = fast.decode(spm_ids[first:last]) print() print(wrong) diff --git a/scripts/fsmt/fsmt-make-super-tiny-model.py b/scripts/fsmt/fsmt-make-super-tiny-model.py index 4a6b8e0c1b4cc3..a70f40ee6ca4d6 100755 --- a/scripts/fsmt/fsmt-make-super-tiny-model.py +++ b/scripts/fsmt/fsmt-make-super-tiny-model.py @@ -24,18 +24,19 @@ # # It will be used then as "stas/tiny-wmt19-en-ru" -from pathlib import Path import json import tempfile +from pathlib import Path -from transformers import FSMTTokenizer, FSMTConfig, FSMTForConditionalGeneration +from transformers import FSMTConfig, FSMTForConditionalGeneration, FSMTTokenizer from transformers.models.fsmt.tokenization_fsmt import VOCAB_FILES_NAMES + mname_tiny = "tiny-wmt19-en-ru" # Build -# borrowed from a test +# borrowed from a test vocab = [ "l", "o", "w", "e", "r", "s", "t", "i", "d", "n", "w", "r", "t", "lo", "low", "er", "low", "lowest", "newer", "wider", "", ] vocab_tokens = dict(zip(vocab, range(len(vocab)))) merges = ["l o 123", "lo w 1456", "e r 1789", ""] @@ -57,7 +58,7 @@ tgt_vocab_file=tgt_vocab_file, merges_file=merges_file, ) - + config = FSMTConfig( langs=['ru', 'en'], src_vocab_size=1000, tgt_vocab_size=1000, diff --git a/scripts/fsmt/fsmt-make-tiny-model.py b/scripts/fsmt/fsmt-make-tiny-model.py index 431942c05ddbcc..b737cc61cea3c0 100755 --- a/scripts/fsmt/fsmt-make-tiny-model.py +++ b/scripts/fsmt/fsmt-make-tiny-model.py @@ -27,16 +27,18 @@ # It will be used then as "stas/tiny-wmt19-en-de" # Build -from transformers import FSMTTokenizer, FSMTConfig, FSMTForConditionalGeneration +from transformers import FSMTConfig, FSMTForConditionalGeneration, FSMTTokenizer + + mname = "facebook/wmt19-en-de" tokenizer = FSMTTokenizer.from_pretrained(mname) # get the correct vocab sizes, etc. from the master model config = FSMTConfig.from_pretrained(mname) -config.update(dict( - d_model=4, - encoder_layers=1, decoder_layers=1, - encoder_ffn_dim=4, decoder_ffn_dim=4, - encoder_attention_heads=1, decoder_attention_heads=1)) +config.update({ + "d_model": 4, + "encoder_layers": 1, "decoder_layers": 1, + "encoder_ffn_dim": 4, "decoder_ffn_dim": 4, + "encoder_attention_heads": 1, "decoder_attention_heads": 1}) tiny_model = FSMTForConditionalGeneration(config) print(f"num of params {tiny_model.num_parameters()}") diff --git a/scripts/fsmt/gen-card-allenai-wmt16.py b/scripts/fsmt/gen-card-allenai-wmt16.py index b910cb05b1bbe6..1b5fe1cda8b2ca 100755 --- a/scripts/fsmt/gen-card-allenai-wmt16.py +++ b/scripts/fsmt/gen-card-allenai-wmt16.py @@ -19,6 +19,7 @@ import os from pathlib import Path + def write_model_card(model_card_dir, src_lang, tgt_lang, model_name): texts = { diff --git a/scripts/fsmt/gen-card-allenai-wmt19.py b/scripts/fsmt/gen-card-allenai-wmt19.py index df0f5851c82eed..b7d727ff2a149f 100755 --- a/scripts/fsmt/gen-card-allenai-wmt19.py +++ b/scripts/fsmt/gen-card-allenai-wmt19.py @@ -19,6 +19,7 @@ import os from pathlib import Path + def write_model_card(model_card_dir, src_lang, tgt_lang, model_name): texts = { diff --git a/scripts/fsmt/gen-card-facebook-wmt19.py b/scripts/fsmt/gen-card-facebook-wmt19.py index e75406b261dcb1..58df676cbc9493 100755 --- a/scripts/fsmt/gen-card-facebook-wmt19.py +++ b/scripts/fsmt/gen-card-facebook-wmt19.py @@ -19,6 +19,7 @@ import os from pathlib import Path + def write_model_card(model_card_dir, src_lang, tgt_lang): texts = { @@ -39,7 +40,7 @@ def write_model_card(model_card_dir, src_lang, tgt_lang): readme = f""" --- -language: +language: - {src_lang} - {tgt_lang} thumbnail: diff --git a/scripts/pegasus/build_test_sample_spm_no_bos.py b/scripts/pegasus/build_test_sample_spm_no_bos.py index 324db02ef7101b..f223304a771789 100755 --- a/scripts/pegasus/build_test_sample_spm_no_bos.py +++ b/scripts/pegasus/build_test_sample_spm_no_bos.py @@ -13,15 +13,16 @@ # See the License for the specific language governing permissions and # limitations under the License. -# this script builds a small sample spm file tests/fixtures/test_sentencepiece_no_bos.model, with features needed by pegasus +# this script builds a small sample spm file tests/fixtures/test_sentencepiece_no_bos.model, with features needed by pegasus # 1. pip install sentencepiece -# +# # 2. wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt # 3. build import sentencepiece as spm + # pegasus: # 1. no bos # 2. eos_id is 1 diff --git a/scripts/stale.py b/scripts/stale.py index 88d7efbd3b29e4..bf7c6670c431e1 100644 --- a/scripts/stale.py +++ b/scripts/stale.py @@ -15,9 +15,10 @@ Script to close stale issue. Taken in part from the AllenNLP repository. https://github.com/allenai/allennlp. """ -from datetime import datetime as dt import os +from datetime import datetime as dt +import github.GithubException from github import Github @@ -36,30 +37,37 @@ def main(): repo = g.get_repo("huggingface/transformers") open_issues = repo.get_issues(state="open") - for issue in open_issues: - comments = sorted([comment for comment in issue.get_comments()], key=lambda i: i.created_at, reverse=True) + for i, issue in enumerate(open_issues): + print(i, issue) + comments = sorted(list(issue.get_comments()), key=lambda i: i.created_at, reverse=True) last_comment = comments[0] if len(comments) > 0 else None if ( last_comment is not None and last_comment.user.login == "github-actions[bot]" - and (dt.utcnow() - issue.updated_at).days > 7 - and (dt.utcnow() - issue.created_at).days >= 30 + and (dt.utcnow() - issue.updated_at.replace(tzinfo=None)).days > 7 + and (dt.utcnow() - issue.created_at.replace(tzinfo=None)).days >= 30 and not any(label.name.lower() in LABELS_TO_EXEMPT for label in issue.get_labels()) ): # print(f"Would close issue {issue.number} since it has been 7 days of inactivity since bot mention.") - issue.edit(state="closed") + try: + issue.edit(state="closed") + except github.GithubException as e: + print("Couldn't close the issue:", repr(e)) elif ( - (dt.utcnow() - issue.updated_at).days > 23 - and (dt.utcnow() - issue.created_at).days >= 30 + (dt.utcnow() - issue.updated_at.replace(tzinfo=None)).days > 23 + and (dt.utcnow() - issue.created_at.replace(tzinfo=None)).days >= 30 and not any(label.name.lower() in LABELS_TO_EXEMPT for label in issue.get_labels()) ): # print(f"Would add stale comment to {issue.number}") - issue.create_comment( - "This issue has been automatically marked as stale because it has not had " - "recent activity. If you think this still needs to be addressed " - "please comment on this thread.\n\nPlease note that issues that do not follow the " - "[contributing guidelines](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md) " - "are likely to be ignored." - ) + try: + issue.create_comment( + "This issue has been automatically marked as stale because it has not had " + "recent activity. If you think this still needs to be addressed " + "please comment on this thread.\n\nPlease note that issues that do not follow the " + "[contributing guidelines](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md) " + "are likely to be ignored." + ) + except github.GithubException as e: + print("Couldn't create comment:", repr(e)) if __name__ == "__main__": diff --git a/scripts/tatoeba/README.md b/scripts/tatoeba/README.md index 7c492ec4f46e2e..b142039b246ee6 100644 --- a/scripts/tatoeba/README.md +++ b/scripts/tatoeba/README.md @@ -23,7 +23,7 @@ pip install pandas GitPython wget ``` Get required metadata -``` +```bash curl https://cdn-datasets.huggingface.co/language_codes/language-codes-3b2.csv > language-codes-3b2.csv curl https://cdn-datasets.huggingface.co/language_codes/iso-639-3.csv > iso-639-3.csv ``` @@ -54,7 +54,7 @@ To upload all converted models, 1. Install [git-lfs](https://git-lfs.github.com/). -2. Login to `transformers-cli` +2. Login to `huggingface-cli` ```bash huggingface-cli login diff --git a/setup.cfg b/setup.cfg deleted file mode 100644 index 9a56ddc2fc65ac..00000000000000 --- a/setup.cfg +++ /dev/null @@ -1,17 +0,0 @@ -[isort] -default_section = FIRSTPARTY -ensure_newline_before_comments = True -force_grid_wrap = 0 -include_trailing_comma = True -known_first_party = transformers -line_length = 119 -lines_after_imports = 2 -multi_line_output = 3 -use_parentheses = True - -[flake8] -ignore = E203, E501, E741, W503, W605 -max-line-length = 119 - -[tool:pytest] -doctest_optionflags=NUMBER NORMALIZE_WHITESPACE ELLIPSIS \ No newline at end of file diff --git a/setup.py b/setup.py index 17cc262a2c4a75..224f36c4a98f00 100644 --- a/setup.py +++ b/setup.py @@ -17,39 +17,39 @@ To create the package for pypi. -1. Run `make pre-release` (or `make pre-patch` for a patch release) then run `make fix-copies` to fix the index of the - documentation. +1. Create the release branch named: v-release, for example v4.19-release. For a patch release checkout the + current release branch. If releasing on a special branch, copy the updated README.md on the main branch for your the commit you will make for the post-release and run `make fix-copies` on the main branch as well. -2. Run Tests for Amazon Sagemaker. The documentation is located in `./tests/sagemaker/README.md`, otherwise @philschmid. +2. Run `make pre-release` (or `make pre-patch` for a patch release) and commit these changes with the message: + "Release: " and push. -3. Unpin specific versions from setup.py that use a git install. +3. Go back to the main branch and run `make post-release` then `make fix-copies`. Commit these changes with the + message "v.dev.0" and push to main. -4. Checkout the release branch (v-release, for example v4.19-release), and commit these changes with the - message: "Release: " and push. +# If you were just cutting the branch in preparation for a release, you can stop here for now. -5. Wait for the tests on main to be completed and be green (otherwise revert and fix bugs) +4. Wait for the tests on the release branch to be completed and be green (otherwise revert and fix bugs) -6. Add a tag in git to mark the release: "git tag v -m 'Adds tag v for pypi' " +5. On the release branch, add a tag in git to mark the release: "git tag v -m 'Adds tag v for pypi' " Push the tag to git: git push --tags origin v-release -7. Build both the sources and the wheel. Do not change anything in setup.py between +6. Build both the sources and the wheel. Do not change anything in setup.py between creating the wheel and the source distribution (obviously). - For the wheel, run: "python setup.py bdist_wheel" in the top level directory. - (this will build a wheel for the python version you use to build it). + Run `make build-release`. This will build the release and do some sanity checks for you. If this ends with an error + message, you need to fix things before going further. - For the sources, run: "python setup.py sdist" You should now have a /dist directory with both .whl and .tar.gz source versions. -8. Check that everything looks correct by uploading the package to the pypi test server: +7. Check that everything looks correct by uploading the package to the pypi test server: - twine upload dist/* -r pypitest + twine upload dist/* -r testpypi (pypi suggest using twine as other methods upload files via plaintext.) You may have to specify the repository url, use the following command then: - twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ + twine upload dist/* -r testpypi --repository-url=https://test.pypi.org/legacy/ Check that you can install it in a virtualenv by running: pip install -i https://testpypi.python.org/pypi transformers @@ -57,23 +57,22 @@ Check you can run the following commands: python -c "from transformers import pipeline; classifier = pipeline('text-classification'); print(classifier('What a nice release'))" python -c "from transformers import *" + python utils/check_build.py --check_lib -9. Upload the final version to actual pypi: - twine upload dist/* -r pypi + If making a patch release, double check the bug you are patching is indeed resolved. -10. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory. +8. Upload the final version to actual pypi: + twine upload dist/* -r pypi -11. Run `make post-release` then run `make fix-copies`. If you were on a branch for the release, - you need to go back to main before executing this. +9. Copy the release notes from RELEASE.md to the tag in github once everything is looking hunky-dory. """ import os import re import shutil -from distutils.core import Command from pathlib import Path -from setuptools import find_packages, setup +from setuptools import Command, find_packages, setup # Remove stale transformers.egg-info directory to avoid https://github.com/pypa/pip/issues/5466 @@ -96,66 +95,70 @@ # 1. all dependencies should be listed here with their version requirements if any # 2. once modified, run: `make deps_table_update` to update src/transformers/dependency_versions_table.py _deps = [ - "Pillow", - "accelerate>=0.10.0", + "Pillow>=10.0.1,<=15.0", + "accelerate>=0.21.0", + "av==9.2.0", # Latest version of PyAV (10.0.0) has issues with audio stream. "beautifulsoup4", - "black~=23.1", "codecarbon==1.2.0", "cookiecutter==1.7.3", "dataclasses", "datasets!=2.5.0", "decord==0.6.0", - "deepspeed>=0.6.5", + "deepspeed>=0.9.3", + "diffusers", "dill<0.3.5", "evaluate>=0.2.0", - "fairscale>0.3", "faiss-cpu", "fastapi", "filelock", - "flax>=0.4.1", + "flax>=0.4.1,<=0.7.0", + "fsspec<2023.10.0", "ftfy", "fugashi>=1.0", "GitPython<3.1.19", "hf-doc-builder>=0.3.0", - "huggingface-hub>=0.11.0,<1.0", + "huggingface-hub>=0.19.3,<1.0", "importlib_metadata", "ipadic>=1.0.0,<2.0", "isort>=5.5.4", - "jax>=0.2.8,!=0.3.2,<=0.3.6", - "jaxlib>=0.1.65,<=0.3.6", + "jax>=0.4.1,<=0.4.13", + "jaxlib>=0.4.1,<=0.4.13", "jieba", "kenlm", + # Keras pin - this is to make sure Keras 3 doesn't destroy us. Remove or change when we have proper support. + "keras<2.16", "keras-nlp>=0.3.1", "librosa", "nltk", - "natten>=0.14.4", + "natten>=0.14.6,<0.15.0", "numpy>=1.17", "onnxconverter-common", "onnxruntime-tools>=1.4.2", "onnxruntime>=1.4.0", + "opencv-python", "optuna", - "optax>=0.0.8", + "optax>=0.0.8,<=0.1.4", "packaging>=20.0", "parameterized", "phonemizer", - "protobuf<=3.20.2", + "protobuf", "psutil", "pyyaml>=5.1", "pydantic", - "pytest", + "pytest>=7.2.0,<8.0.0", "pytest-timeout", "pytest-xdist", - "python>=3.7.0", - "ray[tune]", + "python>=3.8.0", + "ray[tune]>=2.7.0", "regex!=2019.12.17", "requests", - "rhoknp>=1.1.0", + "rhoknp>=1.1.0,<1.3.1", "rjieba", "rouge-score!=0.0.7,!=0.0.8,!=0.1,!=0.1.1", - "ruff>=0.0.241", + "ruff==0.1.5", "sacrebleu>=1.4.12,<2.0.0", "sacremoses", - "safetensors>=0.2.1", + "safetensors>=0.4.1", "sagemaker>=2.31.0", "scikit-learn", "sentencepiece>=0.1.91,!=0.1.92", @@ -163,20 +166,23 @@ "starlette", "sudachipy>=0.6.6", "sudachidict_core>=20220729", - "tensorflow-cpu>=2.4,<2.12", - "tensorflow>=2.4,<2.12", - "tensorflow-text", + "tensorboard", + # TensorFlow pin. When changing this value, update examples/tensorflow/_tests_requirements.txt accordingly + "tensorflow-cpu>=2.6,<2.16", + "tensorflow>=2.6,<2.16", + "tensorflow-text<2.16", "tf2onnx", "timeout-decorator", "timm", - "tokenizers>=0.11.1,!=0.11.3,<0.14", - "torch>=1.7,!=1.12.0", + "tokenizers>=0.14,<0.19", + "torch", "torchaudio", "torchvision", "pyctcdecode>=0.4.0", "tqdm>=4.27", "unidic>=1.0.2", "unidic_lite>=1.0.7", + "urllib3<2.0.0", "uvicorn", ] @@ -244,6 +250,7 @@ def run(self): with open(target, "w", encoding="utf-8", newline="\n") as f: f.write("\n".join(content)) + extras = {} extras["ja"] = deps_list("fugashi", "ipadic", "unidic_lite", "unidic", "sudachipy", "sudachidict_core", "rhoknp") @@ -252,7 +259,7 @@ def run(self): extras["tf"] = deps_list("tensorflow", "onnxconverter-common", "tf2onnx", "tensorflow-text", "keras-nlp") extras["tf-cpu"] = deps_list("tensorflow-cpu", "onnxconverter-common", "tf2onnx", "tensorflow-text", "keras-nlp") -extras["torch"] = deps_list("torch") +extras["torch"] = deps_list("torch", "accelerate") extras["accelerate"] = deps_list("accelerate") if os.name == "nt": # windows @@ -270,7 +277,6 @@ def run(self): extras["sagemaker"] = deps_list("sagemaker") extras["deepspeed"] = deps_list("deepspeed") + extras["accelerate"] -extras["fairscale"] = deps_list("fairscale") extras["optuna"] = deps_list("optuna") extras["ray"] = deps_list("ray[tune]") extras["sigopt"] = deps_list("sigopt") @@ -289,7 +295,7 @@ def run(self): extras["torch-vision"] = deps_list("torchvision") + extras["vision"] extras["natten"] = deps_list("natten") extras["codecarbon"] = deps_list("codecarbon") -extras["video"] = deps_list("decord") +extras["video"] = deps_list("decord", "av") extras["sentencepiece"] = deps_list("sentencepiece", "protobuf") extras["testing"] = ( @@ -303,7 +309,7 @@ def run(self): "dill", "evaluate", "pytest-timeout", - "black", + "ruff", "sacrebleu", "rouge-score", "nltk", @@ -312,8 +318,9 @@ def run(self): "protobuf", # Can be removed once we can unpin protobuf "sacremoses", "rjieba", - "safetensors", "beautifulsoup4", + "tensorboard", + "pydantic", ) + extras["retrieval"] + extras["modelcreation"] @@ -321,7 +328,7 @@ def run(self): extras["deepspeed-testing"] = extras["deepspeed"] + extras["testing"] + extras["optuna"] + extras["sentencepiece"] -extras["quality"] = deps_list("black", "datasets", "isort", "ruff", "GitPython", "hf-doc-builder") +extras["quality"] = deps_list("datasets", "isort", "ruff", "GitPython", "hf-doc-builder", "urllib3") extras["all"] = ( extras["tf"] @@ -401,9 +408,12 @@ def run(self): "tqdm", ) +extras["agents"] = deps_list( + "diffusers", "accelerate", "datasets", "torch", "sentencepiece", "opencv-python", "Pillow" +) + # when modifying the following list, make sure to update src/transformers/dependency_versions_check.py install_requires = [ - deps["importlib_metadata"] + ";python_version<'3.8'", # importlib_metadata for Python versions that don't have it deps["filelock"], # filesystem locks, e.g., to prevent parallel downloads deps["huggingface-hub"], deps["numpy"], @@ -412,12 +422,13 @@ def run(self): deps["regex"], # for OpenAI GPT deps["requests"], # for downloading models over HTTPS deps["tokenizers"], + deps["safetensors"], deps["tqdm"], # progress bars in model download and training scripts ] setup( name="transformers", - version="4.27.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) + version="4.38.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) author="The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)", author_email="transformers@huggingface.co", description="State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow", @@ -429,12 +440,12 @@ def run(self): package_dir={"": "src"}, packages=find_packages("src"), include_package_data=True, - package_data={"transformers": ["*.cu", "*.cpp", "*.cuh", "*.h", "*.pyx"]}, + package_data={"": ["**/*.cu", "**/*.cpp", "**/*.cuh", "**/*.h", "**/*.pyx"]}, zip_safe=False, extras_require=extras, entry_points={"console_scripts": ["transformers-cli=transformers.commands.transformers_cli:main"]}, - python_requires=">=3.7.0", - install_requires=install_requires, + python_requires=">=3.8.0", + install_requires=list(install_requires), classifiers=[ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", @@ -443,7 +454,6 @@ def run(self): "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index 8e1b779a116fda..84a66458022730 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -18,7 +18,7 @@ # to defer the actual importing for when the objects are requested. This way `import transformers` provides the names # in the namespace without actually importing anything (and especially none of the backends). -__version__ = "4.27.0.dev0" +__version__ = "4.38.0.dev0" from typing import TYPE_CHECKING @@ -28,8 +28,13 @@ OptionalDependencyNotAvailable, _LazyModule, is_bitsandbytes_available, + is_essentia_available, is_flax_available, + is_g2p_en_available, is_keras_nlp_available, + is_librosa_available, + is_pretty_midi_available, + is_scipy_available, is_sentencepiece_available, is_speech_available, is_tensorflow_text_available, @@ -90,18 +95,21 @@ "data.metrics": [], "data.processors": [], "debug_utils": [], + "deepspeed": [], "dependency_versions_check": [], "dependency_versions_table": [], "dynamic_module_utils": [], "feature_extraction_sequence_utils": ["SequenceFeatureExtractor"], "feature_extraction_utils": ["BatchFeature", "FeatureExtractionMixin"], "file_utils": [], - "generation": ["GenerationConfig"], + "generation": ["GenerationConfig", "TextIteratorStreamer", "TextStreamer"], "hf_argparser": ["HfArgumentParser"], + "hyperparameter_search": [], "image_transforms": [], "integrations": [ "is_clearml_available", "is_comet_available", + "is_dvclive_available", "is_neptune_available", "is_optuna_available", "is_ray_available", @@ -123,6 +131,13 @@ "models": [], # Models "models.albert": ["ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "AlbertConfig"], + "models.align": [ + "ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "AlignConfig", + "AlignProcessor", + "AlignTextConfig", + "AlignVisionConfig", + ], "models.altclip": [ "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "AltCLIPConfig", @@ -133,6 +148,7 @@ "models.audio_spectrogram_transformer": [ "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "ASTConfig", + "ASTFeatureExtractor", ], "models.auto": [ "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -148,6 +164,17 @@ "AutoProcessor", "AutoTokenizer", ], + "models.autoformer": [ + "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "AutoformerConfig", + ], + "models.bark": [ + "BarkCoarseConfig", + "BarkConfig", + "BarkFineConfig", + "BarkProcessor", + "BarkSemanticConfig", + ], "models.bart": ["BartConfig", "BartTokenizer"], "models.barthez": [], "models.bartpho": [], @@ -160,16 +187,28 @@ "WordpieceTokenizer", ], "models.bert_generation": ["BertGenerationConfig"], - "models.bert_japanese": ["BertJapaneseTokenizer", "CharacterTokenizer", "MecabTokenizer"], + "models.bert_japanese": [ + "BertJapaneseTokenizer", + "CharacterTokenizer", + "MecabTokenizer", + ], "models.bertweet": ["BertweetTokenizer"], "models.big_bird": ["BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP", "BigBirdConfig"], "models.bigbird_pegasus": [ "BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "BigBirdPegasusConfig", ], - "models.biogpt": ["BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BioGptConfig", "BioGptTokenizer"], + "models.biogpt": [ + "BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "BioGptConfig", + "BioGptTokenizer", + ], "models.bit": ["BIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BitConfig"], - "models.blenderbot": ["BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP", "BlenderbotConfig", "BlenderbotTokenizer"], + "models.blenderbot": [ + "BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "BlenderbotConfig", + "BlenderbotTokenizer", + ], "models.blenderbot_small": [ "BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP", "BlenderbotSmallConfig", @@ -190,7 +229,6 @@ "Blip2VisionConfig", ], "models.bloom": ["BLOOM_PRETRAINED_CONFIG_ARCHIVE_MAP", "BloomConfig"], - "models.bort": [], "models.bridgetower": [ "BRIDGETOWER_PRETRAINED_CONFIG_ARCHIVE_MAP", "BridgeTowerConfig", @@ -198,9 +236,18 @@ "BridgeTowerTextConfig", "BridgeTowerVisionConfig", ], + "models.bros": [ + "BROS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "BrosConfig", + "BrosProcessor", + ], "models.byt5": ["ByT5Tokenizer"], "models.camembert": ["CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CamembertConfig"], - "models.canine": ["CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP", "CanineConfig", "CanineTokenizer"], + "models.canine": [ + "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP", + "CanineConfig", + "CanineTokenizer", + ], "models.chinese_clip": [ "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", "ChineseCLIPConfig", @@ -230,12 +277,46 @@ "CLIPSegTextConfig", "CLIPSegVisionConfig", ], - "models.codegen": ["CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", "CodeGenConfig", "CodeGenTokenizer"], - "models.conditional_detr": ["CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConditionalDetrConfig"], - "models.convbert": ["CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvBertConfig", "ConvBertTokenizer"], + "models.clvp": [ + "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ClvpConfig", + "ClvpDecoderConfig", + "ClvpEncoderConfig", + "ClvpFeatureExtractor", + "ClvpProcessor", + "ClvpTokenizer", + ], + "models.code_llama": [], + "models.codegen": [ + "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "CodeGenConfig", + "CodeGenTokenizer", + ], + "models.conditional_detr": [ + "CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ConditionalDetrConfig", + ], + "models.convbert": [ + "CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ConvBertConfig", + "ConvBertTokenizer", + ], "models.convnext": ["CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvNextConfig"], + "models.convnextv2": [ + "CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ConvNextV2Config", + ], "models.cpm": [], - "models.ctrl": ["CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CTRLConfig", "CTRLTokenizer"], + "models.cpmant": [ + "CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "CpmAntConfig", + "CpmAntTokenizer", + ], + "models.ctrl": [ + "CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", + "CTRLConfig", + "CTRLTokenizer", + ], "models.cvt": ["CVT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CvtConfig"], "models.data2vec": [ "DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -244,18 +325,71 @@ "Data2VecTextConfig", "Data2VecVisionConfig", ], - "models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"], - "models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"], - "models.decision_transformer": ["DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "DecisionTransformerConfig"], - "models.deformable_detr": ["DEFORMABLE_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DeformableDetrConfig"], + "models.deberta": [ + "DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DebertaConfig", + "DebertaTokenizer", + ], + "models.deberta_v2": [ + "DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DebertaV2Config", + ], + "models.decision_transformer": [ + "DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DecisionTransformerConfig", + ], + "models.deformable_detr": [ + "DEFORMABLE_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DeformableDetrConfig", + ], "models.deit": ["DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DeiTConfig"], + "models.deprecated": [], + "models.deprecated.bort": [], + "models.deprecated.mctct": [ + "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MCTCTConfig", + "MCTCTFeatureExtractor", + "MCTCTProcessor", + ], + "models.deprecated.mmbt": ["MMBTConfig"], + "models.deprecated.open_llama": [ + "OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "OpenLlamaConfig", + ], + "models.deprecated.retribert": [ + "RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RetriBertConfig", + "RetriBertTokenizer", + ], + "models.deprecated.tapex": ["TapexTokenizer"], + "models.deprecated.trajectory_transformer": [ + "TRAJECTORY_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TrajectoryTransformerConfig", + ], + "models.deprecated.transfo_xl": [ + "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TransfoXLConfig", + "TransfoXLCorpus", + "TransfoXLTokenizer", + ], + "models.deprecated.van": ["VAN_PRETRAINED_CONFIG_ARCHIVE_MAP", "VanConfig"], + "models.depth_anything": ["DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP", "DepthAnythingConfig"], "models.deta": ["DETA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetaConfig"], "models.detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig"], "models.dialogpt": [], "models.dinat": ["DINAT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DinatConfig"], - "models.distilbert": ["DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DistilBertConfig", "DistilBertTokenizer"], + "models.dinov2": ["DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Dinov2Config"], + "models.distilbert": [ + "DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DistilBertConfig", + "DistilBertTokenizer", + ], "models.dit": [], - "models.donut": ["DONUT_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", "DonutProcessor", "DonutSwinConfig"], + "models.donut": [ + "DONUT_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "DonutProcessor", + "DonutSwinConfig", + ], "models.dpr": [ "DPR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DPRConfig", @@ -265,8 +399,24 @@ "DPRReaderTokenizer", ], "models.dpt": ["DPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "DPTConfig"], - "models.efficientformer": ["EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "EfficientFormerConfig"], - "models.electra": ["ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "ElectraConfig", "ElectraTokenizer"], + "models.efficientformer": [ + "EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "EfficientFormerConfig", + ], + "models.efficientnet": [ + "EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP", + "EfficientNetConfig", + ], + "models.electra": [ + "ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ElectraConfig", + "ElectraTokenizer", + ], + "models.encodec": [ + "ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP", + "EncodecConfig", + "EncodecFeatureExtractor", + ], "models.encoder_decoder": ["EncoderDecoderConfig"], "models.ernie": [ "ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -274,6 +424,16 @@ ], "models.ernie_m": ["ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP", "ErnieMConfig"], "models.esm": ["ESM_PRETRAINED_CONFIG_ARCHIVE_MAP", "EsmConfig", "EsmTokenizer"], + "models.falcon": ["FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP", "FalconConfig"], + "models.fastspeech2_conformer": [ + "FASTSPEECH2_CONFORMER_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "FASTSPEECH2_CONFORMER_WITH_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "FastSpeech2ConformerConfig", + "FastSpeech2ConformerHifiGanConfig", + "FastSpeech2ConformerTokenizer", + "FastSpeech2ConformerWithHifiGanConfig", + ], "models.flaubert": ["FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FlaubertConfig", "FlaubertTokenizer"], "models.flava": [ "FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -284,17 +444,51 @@ "FlavaTextConfig", ], "models.fnet": ["FNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FNetConfig"], - "models.fsmt": ["FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", "FSMTConfig", "FSMTTokenizer"], - "models.funnel": ["FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", "FunnelConfig", "FunnelTokenizer"], - "models.git": ["GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GitConfig", "GitProcessor", "GitVisionConfig"], + "models.focalnet": ["FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "FocalNetConfig"], + "models.fsmt": [ + "FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "FSMTConfig", + "FSMTTokenizer", + ], + "models.funnel": [ + "FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP", + "FunnelConfig", + "FunnelTokenizer", + ], + "models.fuyu": ["FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP", "FuyuConfig"], + "models.git": [ + "GIT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GitConfig", + "GitProcessor", + "GitVisionConfig", + ], "models.glpn": ["GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP", "GLPNConfig"], - "models.gpt2": ["GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPT2Config", "GPT2Tokenizer"], + "models.gpt2": [ + "GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GPT2Config", + "GPT2Tokenizer", + ], + "models.gpt_bigcode": [ + "GPT_BIGCODE_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GPTBigCodeConfig", + ], "models.gpt_neo": ["GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoConfig"], "models.gpt_neox": ["GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXConfig"], - "models.gpt_neox_japanese": ["GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTNeoXJapaneseConfig"], + "models.gpt_neox_japanese": [ + "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GPTNeoXJapaneseConfig", + ], "models.gpt_sw3": [], "models.gptj": ["GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP", "GPTJConfig"], - "models.graphormer": ["GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "GraphormerConfig"], + "models.gptsan_japanese": [ + "GPTSAN_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GPTSanJapaneseConfig", + "GPTSanJapaneseTokenizer", + ], + "models.graphormer": [ + "GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "GraphormerConfig", + ], "models.groupvit": [ "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "GroupViTConfig", @@ -304,7 +498,19 @@ "models.herbert": ["HerbertTokenizer"], "models.hubert": ["HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "HubertConfig"], "models.ibert": ["IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "IBertConfig"], + "models.idefics": [ + "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "IdeficsConfig", + ], "models.imagegpt": ["IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ImageGPTConfig"], + "models.informer": ["INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "InformerConfig"], + "models.instructblip": [ + "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", + "InstructBlipConfig", + "InstructBlipProcessor", + "InstructBlipQFormerConfig", + "InstructBlipVisionConfig", + ], "models.jukebox": [ "JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP", "JukeboxConfig", @@ -312,7 +518,16 @@ "JukeboxTokenizer", "JukeboxVQVAEConfig", ], - "models.layoutlm": ["LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "LayoutLMConfig", "LayoutLMTokenizer"], + "models.kosmos2": [ + "KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Kosmos2Config", + "Kosmos2Processor", + ], + "models.layoutlm": [ + "LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP", + "LayoutLMConfig", + "LayoutLMTokenizer", + ], "models.layoutlmv2": [ "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "LayoutLMv2Config", @@ -333,10 +548,27 @@ "models.led": ["LED_PRETRAINED_CONFIG_ARCHIVE_MAP", "LEDConfig", "LEDTokenizer"], "models.levit": ["LEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "LevitConfig"], "models.lilt": ["LILT_PRETRAINED_CONFIG_ARCHIVE_MAP", "LiltConfig"], - "models.longformer": ["LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "LongformerConfig", "LongformerTokenizer"], + "models.llama": ["LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP", "LlamaConfig"], + "models.llava": [ + "LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "LlavaConfig", + ], + "models.longformer": [ + "LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "LongformerConfig", + "LongformerTokenizer", + ], "models.longt5": ["LONGT5_PRETRAINED_CONFIG_ARCHIVE_MAP", "LongT5Config"], - "models.luke": ["LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP", "LukeConfig", "LukeTokenizer"], - "models.lxmert": ["LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "LxmertConfig", "LxmertTokenizer"], + "models.luke": [ + "LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP", + "LukeConfig", + "LukeTokenizer", + ], + "models.lxmert": [ + "LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "LxmertConfig", + "LxmertTokenizer", + ], "models.m2m_100": ["M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP", "M2M100Config"], "models.marian": ["MarianConfig"], "models.markuplm": [ @@ -350,31 +582,87 @@ "MASK2FORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "Mask2FormerConfig", ], - "models.maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig", "MaskFormerSwinConfig"], + "models.maskformer": [ + "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MaskFormerConfig", + "MaskFormerSwinConfig", + ], "models.mbart": ["MBartConfig"], "models.mbart50": [], - "models.mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig", "MCTCTProcessor"], - "models.megatron_bert": ["MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegatronBertConfig"], + "models.mega": ["MEGA_PRETRAINED_CONFIG_ARCHIVE_MAP", "MegaConfig"], + "models.megatron_bert": [ + "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MegatronBertConfig", + ], "models.megatron_gpt2": [], + "models.mgp_str": [ + "MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MgpstrConfig", + "MgpstrProcessor", + "MgpstrTokenizer", + ], + "models.mistral": ["MISTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP", "MistralConfig"], + "models.mixtral": ["MIXTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP", "MixtralConfig"], "models.mluke": [], - "models.mmbt": ["MMBTConfig"], - "models.mobilebert": ["MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileBertConfig", "MobileBertTokenizer"], - "models.mobilenet_v1": ["MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileNetV1Config"], - "models.mobilenet_v2": ["MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileNetV2Config"], + "models.mobilebert": [ + "MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MobileBertConfig", + "MobileBertTokenizer", + ], + "models.mobilenet_v1": [ + "MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MobileNetV1Config", + ], + "models.mobilenet_v2": [ + "MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MobileNetV2Config", + ], "models.mobilevit": ["MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MobileViTConfig"], - "models.mpnet": ["MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "MPNetConfig", "MPNetTokenizer"], + "models.mobilevitv2": [ + "MOBILEVITV2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MobileViTV2Config", + ], + "models.mpnet": [ + "MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MPNetConfig", + "MPNetTokenizer", + ], + "models.mpt": ["MPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MptConfig"], + "models.mra": ["MRA_PRETRAINED_CONFIG_ARCHIVE_MAP", "MraConfig"], "models.mt5": ["MT5Config"], + "models.musicgen": [ + "MUSICGEN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "MusicgenConfig", + "MusicgenDecoderConfig", + ], "models.mvp": ["MvpConfig", "MvpTokenizer"], "models.nat": ["NAT_PRETRAINED_CONFIG_ARCHIVE_MAP", "NatConfig"], "models.nezha": ["NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP", "NezhaConfig"], "models.nllb": [], + "models.nllb_moe": ["NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP", "NllbMoeConfig"], + "models.nougat": ["NougatProcessor"], "models.nystromformer": [ "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "NystromformerConfig", ], - "models.oneformer": ["ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "OneFormerConfig", "OneFormerProcessor"], - "models.openai": ["OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenAIGPTConfig", "OpenAIGPTTokenizer"], + "models.oneformer": [ + "ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "OneFormerConfig", + "OneFormerProcessor", + ], + "models.openai": [ + "OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "OpenAIGPTConfig", + "OpenAIGPTTokenizer", + ], "models.opt": ["OPTConfig"], + "models.owlv2": [ + "OWLV2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Owlv2Config", + "Owlv2Processor", + "Owlv2TextConfig", + "Owlv2VisionConfig", + ], "models.owlvit": [ "OWLVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "OwlViTConfig", @@ -382,32 +670,116 @@ "OwlViTTextConfig", "OwlViTVisionConfig", ], - "models.pegasus": ["PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusConfig", "PegasusTokenizer"], + "models.patchtsmixer": [ + "PATCHTSMIXER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "PatchTSMixerConfig", + ], + "models.patchtst": ["PATCHTST_PRETRAINED_CONFIG_ARCHIVE_MAP", "PatchTSTConfig"], + "models.pegasus": [ + "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "PegasusConfig", + "PegasusTokenizer", + ], "models.pegasus_x": ["PEGASUS_X_PRETRAINED_CONFIG_ARCHIVE_MAP", "PegasusXConfig"], - "models.perceiver": ["PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PerceiverConfig", "PerceiverTokenizer"], + "models.perceiver": [ + "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "PerceiverConfig", + "PerceiverTokenizer", + ], + "models.persimmon": ["PERSIMMON_PRETRAINED_CONFIG_ARCHIVE_MAP", "PersimmonConfig"], + "models.phi": ["PHI_PRETRAINED_CONFIG_ARCHIVE_MAP", "PhiConfig"], "models.phobert": ["PhobertTokenizer"], + "models.pix2struct": [ + "PIX2STRUCT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Pix2StructConfig", + "Pix2StructProcessor", + "Pix2StructTextConfig", + "Pix2StructVisionConfig", + ], "models.plbart": ["PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP", "PLBartConfig"], - "models.poolformer": ["POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "PoolFormerConfig"], - "models.prophetnet": ["PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ProphetNetConfig", "ProphetNetTokenizer"], + "models.poolformer": [ + "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "PoolFormerConfig", + ], + "models.pop2piano": [ + "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Pop2PianoConfig", + ], + "models.prophetnet": [ + "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ProphetNetConfig", + "ProphetNetTokenizer", + ], + "models.pvt": ["PVT_PRETRAINED_CONFIG_ARCHIVE_MAP", "PvtConfig"], "models.qdqbert": ["QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "QDQBertConfig"], + "models.qwen2": [ + "QWEN2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Qwen2Config", + "Qwen2Tokenizer", + ], "models.rag": ["RagConfig", "RagRetriever", "RagTokenizer"], - "models.realm": ["REALM_PRETRAINED_CONFIG_ARCHIVE_MAP", "RealmConfig", "RealmTokenizer"], + "models.realm": [ + "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RealmConfig", + "RealmTokenizer", + ], "models.reformer": ["REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "ReformerConfig"], "models.regnet": ["REGNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "RegNetConfig"], "models.rembert": ["REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RemBertConfig"], "models.resnet": ["RESNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "ResNetConfig"], - "models.retribert": ["RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RetriBertConfig", "RetriBertTokenizer"], - "models.roberta": ["ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaConfig", "RobertaTokenizer"], - "models.roberta_prelayernorm": ["ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP", "RobertaPreLayerNormConfig"], - "models.roc_bert": ["ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoCBertConfig", "RoCBertTokenizer"], - "models.roformer": ["ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "RoFormerConfig", "RoFormerTokenizer"], + "models.roberta": [ + "ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RobertaConfig", + "RobertaTokenizer", + ], + "models.roberta_prelayernorm": [ + "ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RobertaPreLayerNormConfig", + ], + "models.roc_bert": [ + "ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RoCBertConfig", + "RoCBertTokenizer", + ], + "models.roformer": [ + "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "RoFormerConfig", + "RoFormerTokenizer", + ], + "models.rwkv": ["RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP", "RwkvConfig"], + "models.sam": [ + "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SamConfig", + "SamMaskDecoderConfig", + "SamProcessor", + "SamPromptEncoderConfig", + "SamVisionConfig", + ], + "models.seamless_m4t": [ + "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SeamlessM4TConfig", + "SeamlessM4TFeatureExtractor", + "SeamlessM4TProcessor", + ], + "models.seamless_m4t_v2": [ + "SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SeamlessM4Tv2Config", + ], "models.segformer": ["SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SegformerConfig"], "models.sew": ["SEW_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWConfig"], "models.sew_d": ["SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP", "SEWDConfig"], + "models.siglip": [ + "SIGLIP_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SiglipConfig", + "SiglipProcessor", + "SiglipTextConfig", + "SiglipVisionConfig", + ], "models.speech_encoder_decoder": ["SpeechEncoderDecoderConfig"], "models.speech_to_text": [ "SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "Speech2TextConfig", + "Speech2TextFeatureExtractor", "Speech2TextProcessor", ], "models.speech_to_text_2": [ @@ -420,34 +792,51 @@ "SPEECHT5_PRETRAINED_CONFIG_ARCHIVE_MAP", "SPEECHT5_PRETRAINED_HIFIGAN_CONFIG_ARCHIVE_MAP", "SpeechT5Config", + "SpeechT5FeatureExtractor", "SpeechT5HifiGanConfig", "SpeechT5Processor", ], - "models.splinter": ["SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP", "SplinterConfig", "SplinterTokenizer"], - "models.squeezebert": ["SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "SqueezeBertConfig", "SqueezeBertTokenizer"], + "models.splinter": [ + "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SplinterConfig", + "SplinterTokenizer", + ], + "models.squeezebert": [ + "SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SqueezeBertConfig", + "SqueezeBertTokenizer", + ], + "models.stablelm": ["STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP", "StableLmConfig"], + "models.swiftformer": [ + "SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SwiftFormerConfig", + ], "models.swin": ["SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", "SwinConfig"], "models.swin2sr": ["SWIN2SR_PRETRAINED_CONFIG_ARCHIVE_MAP", "Swin2SRConfig"], "models.swinv2": ["SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Swinv2Config"], - "models.switch_transformers": ["SWITCH_TRANSFORMERS_PRETRAINED_CONFIG_ARCHIVE_MAP", "SwitchTransformersConfig"], + "models.switch_transformers": [ + "SWITCH_TRANSFORMERS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "SwitchTransformersConfig", + ], "models.t5": ["T5_PRETRAINED_CONFIG_ARCHIVE_MAP", "T5Config"], - "models.table_transformer": ["TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "TableTransformerConfig"], - "models.tapas": ["TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP", "TapasConfig", "TapasTokenizer"], - "models.tapex": ["TapexTokenizer"], + "models.table_transformer": [ + "TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TableTransformerConfig", + ], + "models.tapas": [ + "TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TapasConfig", + "TapasTokenizer", + ], "models.time_series_transformer": [ "TIME_SERIES_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "TimeSeriesTransformerConfig", ], - "models.timesformer": ["TIMESFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "TimesformerConfig"], - "models.trajectory_transformer": [ - "TRAJECTORY_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", - "TrajectoryTransformerConfig", - ], - "models.transfo_xl": [ - "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP", - "TransfoXLConfig", - "TransfoXLCorpus", - "TransfoXLTokenizer", + "models.timesformer": [ + "TIMESFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TimesformerConfig", ], + "models.timm_backbone": ["TimmBackboneConfig"], "models.trocr": [ "TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP", "TrOCRConfig", @@ -456,8 +845,15 @@ "models.tvlt": [ "TVLT_PRETRAINED_CONFIG_ARCHIVE_MAP", "TvltConfig", + "TvltFeatureExtractor", "TvltProcessor", ], + "models.tvp": [ + "TVP_PRETRAINED_CONFIG_ARCHIVE_MAP", + "TvpConfig", + "TvpProcessor", + ], + "models.umt5": ["UMT5Config"], "models.unispeech": [ "UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP", "UniSpeechConfig", @@ -466,8 +862,12 @@ "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP", "UniSpeechSatConfig", ], + "models.univnet": [ + "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP", + "UnivNetConfig", + "UnivNetFeatureExtractor", + ], "models.upernet": ["UperNetConfig"], - "models.van": ["VAN_PRETRAINED_CONFIG_ARCHIVE_MAP", "VanConfig"], "models.videomae": ["VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "VideoMAEConfig"], "models.vilt": [ "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP", @@ -476,13 +876,37 @@ "ViltImageProcessor", "ViltProcessor", ], + "models.vipllava": [ + "VIPLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VipLlavaConfig", + ], "models.vision_encoder_decoder": ["VisionEncoderDecoderConfig"], - "models.vision_text_dual_encoder": ["VisionTextDualEncoderConfig", "VisionTextDualEncoderProcessor"], - "models.visual_bert": ["VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", "VisualBertConfig"], + "models.vision_text_dual_encoder": [ + "VisionTextDualEncoderConfig", + "VisionTextDualEncoderProcessor", + ], + "models.visual_bert": [ + "VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VisualBertConfig", + ], "models.vit": ["VIT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTConfig"], - "models.vit_hybrid": ["VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTHybridConfig"], + "models.vit_hybrid": [ + "VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ViTHybridConfig", + ], "models.vit_mae": ["VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMAEConfig"], "models.vit_msn": ["VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP", "ViTMSNConfig"], + "models.vitdet": ["VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitDetConfig"], + "models.vitmatte": ["VITMATTE_PRETRAINED_CONFIG_ARCHIVE_MAP", "VitMatteConfig"], + "models.vits": [ + "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VitsConfig", + "VitsTokenizer", + ], + "models.vivit": [ + "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "VivitConfig", + ], "models.wav2vec2": [ "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Wav2Vec2Config", @@ -491,6 +915,11 @@ "Wav2Vec2Processor", "Wav2Vec2Tokenizer", ], + "models.wav2vec2_bert": [ + "WAV2VEC2_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP", + "Wav2Vec2BertConfig", + "Wav2Vec2BertProcessor", + ], "models.wav2vec2_conformer": [ "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "Wav2Vec2ConformerConfig", @@ -517,9 +946,18 @@ ], "models.xglm": ["XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XGLMConfig"], "models.xlm": ["XLM_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMConfig", "XLMTokenizer"], - "models.xlm_prophetnet": ["XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMProphetNetConfig"], - "models.xlm_roberta": ["XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMRobertaConfig"], - "models.xlm_roberta_xl": ["XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLMRobertaXLConfig"], + "models.xlm_prophetnet": [ + "XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP", + "XLMProphetNetConfig", + ], + "models.xlm_roberta": [ + "XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", + "XLMRobertaConfig", + ], + "models.xlm_roberta_xl": [ + "XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP", + "XLMRobertaXLConfig", + ], "models.xlnet": ["XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP", "XLNetConfig"], "models.xmod": ["XMOD_PRETRAINED_CONFIG_ARCHIVE_MAP", "XmodConfig"], "models.yolos": ["YOLOS_PRETRAINED_CONFIG_ARCHIVE_MAP", "YolosConfig"], @@ -536,9 +974,12 @@ "FeatureExtractionPipeline", "FillMaskPipeline", "ImageClassificationPipeline", + "ImageFeatureExtractionPipeline", "ImageSegmentationPipeline", + "ImageToImagePipeline", "ImageToTextPipeline", "JsonPipelineDataFormat", + "MaskGenerationPipeline", "NerPipeline", "ObjectDetectionPipeline", "PipedPipelineDataFormat", @@ -550,16 +991,19 @@ "Text2TextGenerationPipeline", "TextClassificationPipeline", "TextGenerationPipeline", + "TextToAudioPipeline", "TokenClassificationPipeline", "TranslationPipeline", "VideoClassificationPipeline", "VisualQuestionAnsweringPipeline", + "ZeroShotAudioClassificationPipeline", "ZeroShotClassificationPipeline", "ZeroShotImageClassificationPipeline", "ZeroShotObjectDetectionPipeline", "pipeline", ], "processing_utils": ["ProcessorMixin"], + "quantizers": [], "testing_utils": [], "tokenization_utils": ["PreTrainedTokenizer"], "tokenization_utils_base": [ @@ -570,6 +1014,18 @@ "SpecialTokensMixin", "TokenSpan", ], + "tools": [ + "Agent", + "AzureOpenAiAgent", + "HfAgent", + "LocalAgent", + "OpenAiAgent", + "PipelineTool", + "RemoteTool", + "Tool", + "launch_gradio_demo", + "load_tool", + ], "trainer_callback": [ "DefaultFlowCallback", "EarlyStoppingCallback", @@ -579,7 +1035,13 @@ "TrainerControl", "TrainerState", ], - "trainer_utils": ["EvalPrediction", "IntervalStrategy", "SchedulerType", "enable_full_determinism", "set_seed"], + "trainer_utils": [ + "EvalPrediction", + "IntervalStrategy", + "SchedulerType", + "enable_full_determinism", + "set_seed", + ], "training_args": ["TrainingArguments"], "training_args_seq2seq": ["Seq2SeqTrainingArguments"], "training_args_tf": ["TFTrainingArguments"], @@ -618,13 +1080,14 @@ "is_tokenizers_available", "is_torch_available", "is_torch_neuroncore_available", + "is_torch_npu_available", "is_torch_tpu_available", "is_torchvision_available", + "is_torch_xpu_available", "is_vision_available", "logging", ], - "utils.bitsandbytes": [], - "utils.quantization_config": ["BitsAndBytesConfig"], + "utils.quantization_config": ["AqlmConfig", "AwqConfig", "BitsAndBytesConfig", "GPTQConfig"], } # sentencepiece-backed objects @@ -644,12 +1107,14 @@ _import_structure["models.bert_generation"].append("BertGenerationTokenizer") _import_structure["models.big_bird"].append("BigBirdTokenizer") _import_structure["models.camembert"].append("CamembertTokenizer") + _import_structure["models.code_llama"].append("CodeLlamaTokenizer") _import_structure["models.cpm"].append("CpmTokenizer") _import_structure["models.deberta_v2"].append("DebertaV2Tokenizer") _import_structure["models.ernie_m"].append("ErnieMTokenizer") _import_structure["models.fnet"].append("FNetTokenizer") _import_structure["models.gpt_sw3"].append("GPTSw3Tokenizer") _import_structure["models.layoutxlm"].append("LayoutXLMTokenizer") + _import_structure["models.llama"].append("LlamaTokenizer") _import_structure["models.m2m_100"].append("M2M100Tokenizer") _import_structure["models.marian"].append("MarianTokenizer") _import_structure["models.mbart"].append("MBartTokenizer") @@ -661,6 +1126,8 @@ _import_structure["models.plbart"].append("PLBartTokenizer") _import_structure["models.reformer"].append("ReformerTokenizer") _import_structure["models.rembert"].append("RemBertTokenizer") + _import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizer") + _import_structure["models.siglip"].append("SiglipTokenizer") _import_structure["models.speech_to_text"].append("Speech2TextTokenizer") _import_structure["models.speecht5"].append("SpeechT5Tokenizer") _import_structure["models.t5"].append("T5Tokenizer") @@ -691,14 +1158,20 @@ _import_structure["models.bloom"].append("BloomTokenizerFast") _import_structure["models.camembert"].append("CamembertTokenizerFast") _import_structure["models.clip"].append("CLIPTokenizerFast") + _import_structure["models.code_llama"].append("CodeLlamaTokenizerFast") _import_structure["models.codegen"].append("CodeGenTokenizerFast") _import_structure["models.convbert"].append("ConvBertTokenizerFast") _import_structure["models.cpm"].append("CpmTokenizerFast") _import_structure["models.deberta"].append("DebertaTokenizerFast") _import_structure["models.deberta_v2"].append("DebertaV2TokenizerFast") + _import_structure["models.deprecated.retribert"].append("RetriBertTokenizerFast") _import_structure["models.distilbert"].append("DistilBertTokenizerFast") _import_structure["models.dpr"].extend( - ["DPRContextEncoderTokenizerFast", "DPRQuestionEncoderTokenizerFast", "DPRReaderTokenizerFast"] + [ + "DPRContextEncoderTokenizerFast", + "DPRQuestionEncoderTokenizerFast", + "DPRReaderTokenizerFast", + ] ) _import_structure["models.electra"].append("ElectraTokenizerFast") _import_structure["models.fnet"].append("FNetTokenizerFast") @@ -712,6 +1185,7 @@ _import_structure["models.layoutlmv3"].append("LayoutLMv3TokenizerFast") _import_structure["models.layoutxlm"].append("LayoutXLMTokenizerFast") _import_structure["models.led"].append("LEDTokenizerFast") + _import_structure["models.llama"].append("LlamaTokenizerFast") _import_structure["models.longformer"].append("LongformerTokenizerFast") _import_structure["models.lxmert"].append("LxmertTokenizerFast") _import_structure["models.markuplm"].append("MarkupLMTokenizerFast") @@ -722,17 +1196,20 @@ _import_structure["models.mt5"].append("MT5TokenizerFast") _import_structure["models.mvp"].append("MvpTokenizerFast") _import_structure["models.nllb"].append("NllbTokenizerFast") + _import_structure["models.nougat"].append("NougatTokenizerFast") _import_structure["models.openai"].append("OpenAIGPTTokenizerFast") _import_structure["models.pegasus"].append("PegasusTokenizerFast") + _import_structure["models.qwen2"].append("Qwen2TokenizerFast") _import_structure["models.realm"].append("RealmTokenizerFast") _import_structure["models.reformer"].append("ReformerTokenizerFast") _import_structure["models.rembert"].append("RemBertTokenizerFast") - _import_structure["models.retribert"].append("RetriBertTokenizerFast") _import_structure["models.roberta"].append("RobertaTokenizerFast") _import_structure["models.roformer"].append("RoFormerTokenizerFast") + _import_structure["models.seamless_m4t"].append("SeamlessM4TTokenizerFast") _import_structure["models.splinter"].append("SplinterTokenizerFast") _import_structure["models.squeezebert"].append("SqueezeBertTokenizerFast") _import_structure["models.t5"].append("T5TokenizerFast") + _import_structure["models.whisper"].append("WhisperTokenizerFast") _import_structure["models.xglm"].append("XGLMTokenizerFast") _import_structure["models.xlm_roberta"].append("XLMRobertaTokenizerFast") _import_structure["models.xlnet"].append("XLNetTokenizerFast") @@ -749,24 +1226,10 @@ name for name in dir(dummy_sentencepiece_and_tokenizers_objects) if not name.startswith("_") ] else: - _import_structure["convert_slow_tokenizer"] = ["SLOW_TO_FAST_CONVERTERS", "convert_slow_tokenizer"] - -# Speech-specific objects -try: - if not is_speech_available(): - raise OptionalDependencyNotAvailable() -except OptionalDependencyNotAvailable: - from .utils import dummy_speech_objects - - _import_structure["utils.dummy_speech_objects"] = [ - name for name in dir(dummy_speech_objects) if not name.startswith("_") + _import_structure["convert_slow_tokenizer"] = [ + "SLOW_TO_FAST_CONVERTERS", + "convert_slow_tokenizer", ] -else: - _import_structure["models.audio_spectrogram_transformer"].append("ASTFeatureExtractor") - _import_structure["models.mctct"].append("MCTCTFeatureExtractor") - _import_structure["models.speech_to_text"].append("Speech2TextFeatureExtractor") - _import_structure["models.speecht5"].append("SpeechT5FeatureExtractor") - _import_structure["models.tvlt"].append("TvltFeatureExtractor") # Tensorflow-text-specific objects try: @@ -826,8 +1289,11 @@ _import_structure["models.donut"].extend(["DonutFeatureExtractor", "DonutImageProcessor"]) _import_structure["models.dpt"].extend(["DPTFeatureExtractor", "DPTImageProcessor"]) _import_structure["models.efficientformer"].append("EfficientFormerImageProcessor") + _import_structure["models.efficientnet"].append("EfficientNetImageProcessor") _import_structure["models.flava"].extend(["FlavaFeatureExtractor", "FlavaImageProcessor", "FlavaProcessor"]) + _import_structure["models.fuyu"].extend(["FuyuImageProcessor", "FuyuProcessor"]) _import_structure["models.glpn"].extend(["GLPNFeatureExtractor", "GLPNImageProcessor"]) + _import_structure["models.idefics"].extend(["IdeficsImageProcessor"]) _import_structure["models.imagegpt"].extend(["ImageGPTFeatureExtractor", "ImageGPTImageProcessor"]) _import_structure["models.layoutlmv2"].extend(["LayoutLMv2FeatureExtractor", "LayoutLMv2ImageProcessor"]) _import_structure["models.layoutlmv3"].extend(["LayoutLMv3FeatureExtractor", "LayoutLMv3ImageProcessor"]) @@ -837,65 +1303,28 @@ _import_structure["models.mobilenet_v1"].extend(["MobileNetV1FeatureExtractor", "MobileNetV1ImageProcessor"]) _import_structure["models.mobilenet_v2"].extend(["MobileNetV2FeatureExtractor", "MobileNetV2ImageProcessor"]) _import_structure["models.mobilevit"].extend(["MobileViTFeatureExtractor", "MobileViTImageProcessor"]) + _import_structure["models.nougat"].append("NougatImageProcessor") _import_structure["models.oneformer"].extend(["OneFormerImageProcessor"]) + _import_structure["models.owlv2"].append("Owlv2ImageProcessor") _import_structure["models.owlvit"].extend(["OwlViTFeatureExtractor", "OwlViTImageProcessor"]) _import_structure["models.perceiver"].extend(["PerceiverFeatureExtractor", "PerceiverImageProcessor"]) + _import_structure["models.pix2struct"].extend(["Pix2StructImageProcessor"]) _import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"]) + _import_structure["models.pvt"].extend(["PvtImageProcessor"]) + _import_structure["models.sam"].extend(["SamImageProcessor"]) _import_structure["models.segformer"].extend(["SegformerFeatureExtractor", "SegformerImageProcessor"]) + _import_structure["models.siglip"].append("SiglipImageProcessor") _import_structure["models.swin2sr"].append("Swin2SRImageProcessor") _import_structure["models.tvlt"].append("TvltImageProcessor") + _import_structure["models.tvp"].append("TvpImageProcessor") _import_structure["models.videomae"].extend(["VideoMAEFeatureExtractor", "VideoMAEImageProcessor"]) _import_structure["models.vilt"].extend(["ViltFeatureExtractor", "ViltImageProcessor", "ViltProcessor"]) _import_structure["models.vit"].extend(["ViTFeatureExtractor", "ViTImageProcessor"]) _import_structure["models.vit_hybrid"].extend(["ViTHybridImageProcessor"]) + _import_structure["models.vitmatte"].append("VitMatteImageProcessor") + _import_structure["models.vivit"].append("VivitImageProcessor") _import_structure["models.yolos"].extend(["YolosFeatureExtractor", "YolosImageProcessor"]) -# Timm-backed objects -try: - if not (is_timm_available() and is_vision_available()): - raise OptionalDependencyNotAvailable() -except OptionalDependencyNotAvailable: - from .utils import dummy_timm_and_vision_objects - - _import_structure["utils.dummy_timm_and_vision_objects"] = [ - name for name in dir(dummy_timm_and_vision_objects) if not name.startswith("_") - ] -else: - _import_structure["models.deformable_detr"].extend( - [ - "DEFORMABLE_DETR_PRETRAINED_MODEL_ARCHIVE_LIST", - "DeformableDetrForObjectDetection", - "DeformableDetrModel", - "DeformableDetrPreTrainedModel", - ] - ) - _import_structure["models.detr"].extend( - [ - "DETR_PRETRAINED_MODEL_ARCHIVE_LIST", - "DetrForObjectDetection", - "DetrForSegmentation", - "DetrModel", - "DetrPreTrainedModel", - ] - ) - _import_structure["models.table_transformer"].extend( - [ - "TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", - "TableTransformerForObjectDetection", - "TableTransformerModel", - "TableTransformerPreTrainedModel", - ] - ) - _import_structure["models.conditional_detr"].extend( - [ - "CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST", - "ConditionalDetrForObjectDetection", - "ConditionalDetrForSegmentation", - "ConditionalDetrModel", - "ConditionalDetrPreTrainedModel", - ] - ) - # PyTorch-backed objects try: @@ -909,6 +1338,7 @@ _import_structure["activations"] = [] _import_structure["benchmark.benchmark"] = ["PyTorchBenchmark"] _import_structure["benchmark.benchmark_args"] = ["PyTorchBenchmarkArguments"] + _import_structure["cache_utils"] = ["Cache", "DynamicCache", "SinkCache", "StaticCache"] _import_structure["data.datasets"] = [ "GlueDataset", "GlueDataTrainingArguments", @@ -920,20 +1350,28 @@ "TextDataset", "TextDatasetForNextSentencePrediction", ] - _import_structure["deepspeed"] = [] _import_structure["generation"].extend( [ + "AlternatingCodebooksLogitsProcessor", "BeamScorer", "BeamSearchScorer", + "ClassifierFreeGuidanceLogitsProcessor", "ConstrainedBeamSearchScorer", "Constraint", "ConstraintListState", "DisjunctiveConstraint", + "EncoderNoRepeatNGramLogitsProcessor", + "EncoderRepetitionPenaltyLogitsProcessor", + "EpsilonLogitsWarper", + "EtaLogitsWarper", + "ExponentialDecayLengthPenalty", "ForcedBOSTokenLogitsProcessor", "ForcedEOSTokenLogitsProcessor", + "ForceTokensLogitsProcessor", "GenerationMixin", "HammingDiversityLogitsProcessor", "InfNanRemoveLogitsProcessor", + "LogitNormalization", "LogitsProcessor", "LogitsProcessorList", "LogitsWarper", @@ -946,12 +1384,17 @@ "PhrasalConstraint", "PrefixConstrainedLogitsProcessor", "RepetitionPenaltyLogitsProcessor", + "SequenceBiasLogitsProcessor", "StoppingCriteria", "StoppingCriteriaList", + "SuppressTokensAtBeginLogitsProcessor", + "SuppressTokensLogitsProcessor", "TemperatureLogitsWarper", "TopKLogitsWarper", "TopPLogitsWarper", "TypicalLogitsWarper", + "UnbatchedClassifierFreeGuidanceLogitsProcessor", + "WhisperTimeStampLogitsProcessor", "top_k_top_p_filtering", ] ) @@ -960,6 +1403,7 @@ _import_structure["modeling_utils"] = ["PreTrainedModel"] # PyTorch models structure + _import_structure["models.albert"].extend( [ "ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -974,6 +1418,16 @@ "load_tf_weights_in_albert", ] ) + + _import_structure["models.align"].extend( + [ + "ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST", + "AlignModel", + "AlignPreTrainedModel", + "AlignTextModel", + "AlignVisionModel", + ] + ) _import_structure["models.altclip"].extend( [ "ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -994,6 +1448,7 @@ _import_structure["models.auto"].extend( [ "MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", + "MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING", "MODEL_FOR_AUDIO_XVECTOR_MAPPING", "MODEL_FOR_BACKBONE_MAPPING", "MODEL_FOR_CAUSAL_IMAGE_MODELING_MAPPING", @@ -1003,9 +1458,11 @@ "MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING", "MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", "MODEL_FOR_IMAGE_SEGMENTATION_MAPPING", + "MODEL_FOR_IMAGE_TO_IMAGE_MAPPING", "MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING", "MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING", "MODEL_FOR_MASKED_LM_MAPPING", + "MODEL_FOR_MASK_GENERATION_MAPPING", "MODEL_FOR_MULTIPLE_CHOICE_MAPPING", "MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING", "MODEL_FOR_OBJECT_DETECTION_MAPPING", @@ -1016,11 +1473,17 @@ "MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", "MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING", + "MODEL_FOR_TEXT_ENCODING_MAPPING", + "MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING", + "MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING", + "MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING", + "MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING", "MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING", "MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING", "MODEL_FOR_VISION_2_SEQ_MAPPING", "MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING", + "MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING", "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING", "MODEL_MAPPING", "MODEL_WITH_LM_HEAD_MAPPING", @@ -1035,9 +1498,11 @@ "AutoModelForDocumentQuestionAnswering", "AutoModelForImageClassification", "AutoModelForImageSegmentation", + "AutoModelForImageToImage", "AutoModelForInstanceSegmentation", "AutoModelForMaskedImageModeling", "AutoModelForMaskedLM", + "AutoModelForMaskGeneration", "AutoModelForMultipleChoice", "AutoModelForNextSentencePrediction", "AutoModelForObjectDetection", @@ -1048,15 +1513,38 @@ "AutoModelForSequenceClassification", "AutoModelForSpeechSeq2Seq", "AutoModelForTableQuestionAnswering", + "AutoModelForTextEncoding", + "AutoModelForTextToSpectrogram", + "AutoModelForTextToWaveform", "AutoModelForTokenClassification", "AutoModelForUniversalSegmentation", "AutoModelForVideoClassification", "AutoModelForVision2Seq", "AutoModelForVisualQuestionAnswering", + "AutoModelForZeroShotImageClassification", "AutoModelForZeroShotObjectDetection", "AutoModelWithLMHead", ] ) + _import_structure["models.autoformer"].extend( + [ + "AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "AutoformerForPrediction", + "AutoformerModel", + "AutoformerPreTrainedModel", + ] + ) + _import_structure["models.bark"].extend( + [ + "BARK_PRETRAINED_MODEL_ARCHIVE_LIST", + "BarkCausalModel", + "BarkCoarseModel", + "BarkFineModel", + "BarkModel", + "BarkPreTrainedModel", + "BarkSemanticModel", + ] + ) _import_structure["models.bart"].extend( [ "BART_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1066,12 +1554,14 @@ "BartForSequenceClassification", "BartModel", "BartPretrainedModel", + "BartPreTrainedModel", "PretrainedBartModel", ] ) _import_structure["models.beit"].extend( [ "BEIT_PRETRAINED_MODEL_ARCHIVE_LIST", + "BeitBackbone", "BeitForImageClassification", "BeitForMaskedImageModeling", "BeitForSemanticSegmentation", @@ -1135,6 +1625,8 @@ [ "BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST", "BioGptForCausalLM", + "BioGptForSequenceClassification", + "BioGptForTokenClassification", "BioGptModel", "BioGptPreTrainedModel", ] @@ -1182,6 +1674,7 @@ [ "BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST", "Blip2ForConditionalGeneration", + "Blip2Model", "Blip2PreTrainedModel", "Blip2QFormerModel", "Blip2VisionModel", @@ -1201,12 +1694,24 @@ _import_structure["models.bridgetower"].extend( [ "BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST", + "BridgeTowerForContrastiveLearning", "BridgeTowerForImageAndTextRetrieval", "BridgeTowerForMaskedLM", "BridgeTowerModel", "BridgeTowerPreTrainedModel", ] ) + _import_structure["models.bros"].extend( + [ + "BROS_PRETRAINED_MODEL_ARCHIVE_LIST", + "BrosForTokenClassification", + "BrosModel", + "BrosPreTrainedModel", + "BrosProcessor", + "BrosSpadeEEForTokenClassification", + "BrosSpadeELForTokenClassification", + ] + ) _import_structure["models.camembert"].extend( [ "CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1257,6 +1762,7 @@ _import_structure["models.clip"].extend( [ "CLIP_PRETRAINED_MODEL_ARCHIVE_LIST", + "CLIPForImageClassification", "CLIPModel", "CLIPPreTrainedModel", "CLIPTextModel", @@ -1275,6 +1781,17 @@ "CLIPSegVisionModel", ] ) + _import_structure["models.clvp"].extend( + [ + "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST", + "ClvpDecoder", + "ClvpEncoder", + "ClvpForCausalLM", + "ClvpModel", + "ClvpModelForConditionalGeneration", + "ClvpPreTrainedModel", + ] + ) _import_structure["models.codegen"].extend( [ "CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1283,6 +1800,15 @@ "CodeGenPreTrainedModel", ] ) + _import_structure["models.conditional_detr"].extend( + [ + "CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST", + "ConditionalDetrForObjectDetection", + "ConditionalDetrForSegmentation", + "ConditionalDetrModel", + "ConditionalDetrPreTrainedModel", + ] + ) _import_structure["models.convbert"].extend( [ "CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1306,6 +1832,23 @@ "ConvNextPreTrainedModel", ] ) + _import_structure["models.convnextv2"].extend( + [ + "CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST", + "ConvNextV2Backbone", + "ConvNextV2ForImageClassification", + "ConvNextV2Model", + "ConvNextV2PreTrainedModel", + ] + ) + _import_structure["models.cpmant"].extend( + [ + "CPMANT_PRETRAINED_MODEL_ARCHIVE_LIST", + "CpmAntForCausalLM", + "CpmAntModel", + "CpmAntPreTrainedModel", + ] + ) _import_structure["models.ctrl"].extend( [ "CTRL_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1380,6 +1923,14 @@ "DecisionTransformerPreTrainedModel", ] ) + _import_structure["models.deformable_detr"].extend( + [ + "DEFORMABLE_DETR_PRETRAINED_MODEL_ARCHIVE_LIST", + "DeformableDetrForObjectDetection", + "DeformableDetrModel", + "DeformableDetrPreTrainedModel", + ] + ) _import_structure["models.deit"].extend( [ "DEIT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1390,6 +1941,63 @@ "DeiTPreTrainedModel", ] ) + _import_structure["models.deprecated.mctct"].extend( + [ + "MCTCT_PRETRAINED_MODEL_ARCHIVE_LIST", + "MCTCTForCTC", + "MCTCTModel", + "MCTCTPreTrainedModel", + ] + ) + _import_structure["models.deprecated.mmbt"].extend(["MMBTForClassification", "MMBTModel", "ModalEmbeddings"]) + _import_structure["models.deprecated.open_llama"].extend( + [ + "OpenLlamaForCausalLM", + "OpenLlamaForSequenceClassification", + "OpenLlamaModel", + "OpenLlamaPreTrainedModel", + ] + ) + _import_structure["models.deprecated.retribert"].extend( + [ + "RETRIBERT_PRETRAINED_MODEL_ARCHIVE_LIST", + "RetriBertModel", + "RetriBertPreTrainedModel", + ] + ) + _import_structure["models.deprecated.trajectory_transformer"].extend( + [ + "TRAJECTORY_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "TrajectoryTransformerModel", + "TrajectoryTransformerPreTrainedModel", + ] + ) + _import_structure["models.deprecated.transfo_xl"].extend( + [ + "TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST", + "AdaptiveEmbedding", + "TransfoXLForSequenceClassification", + "TransfoXLLMHeadModel", + "TransfoXLModel", + "TransfoXLPreTrainedModel", + "load_tf_weights_in_transfo_xl", + ] + ) + _import_structure["models.deprecated.van"].extend( + [ + "VAN_PRETRAINED_MODEL_ARCHIVE_LIST", + "VanForImageClassification", + "VanModel", + "VanPreTrainedModel", + ] + ) + _import_structure["models.depth_anything"].extend( + [ + "DEPTH_ANYTHING_PRETRAINED_MODEL_ARCHIVE_LIST", + "DepthAnythingForDepthEstimation", + "DepthAnythingPreTrainedModel", + ] + ) _import_structure["models.deta"].extend( [ "DETA_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1398,6 +2006,15 @@ "DetaPreTrainedModel", ] ) + _import_structure["models.detr"].extend( + [ + "DETR_PRETRAINED_MODEL_ARCHIVE_LIST", + "DetrForObjectDetection", + "DetrForSegmentation", + "DetrModel", + "DetrPreTrainedModel", + ] + ) _import_structure["models.dinat"].extend( [ "DINAT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1407,6 +2024,15 @@ "DinatPreTrainedModel", ] ) + _import_structure["models.dinov2"].extend( + [ + "DINOV2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Dinov2Backbone", + "Dinov2ForImageClassification", + "Dinov2Model", + "Dinov2PreTrainedModel", + ] + ) _import_structure["models.distilbert"].extend( [ "DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1458,6 +2084,14 @@ "EfficientFormerPreTrainedModel", ] ) + _import_structure["models.efficientnet"].extend( + [ + "EFFICIENTNET_PRETRAINED_MODEL_ARCHIVE_LIST", + "EfficientNetForImageClassification", + "EfficientNetModel", + "EfficientNetPreTrainedModel", + ] + ) _import_structure["models.electra"].extend( [ "ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1473,6 +2107,13 @@ "load_tf_weights_in_electra", ] ) + _import_structure["models.encodec"].extend( + [ + "ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST", + "EncodecModel", + "EncodecPreTrainedModel", + ] + ) _import_structure["models.encoder_decoder"].append("EncoderDecoderModel") _import_structure["models.ernie"].extend( [ @@ -1513,6 +2154,26 @@ "EsmPreTrainedModel", ] ) + _import_structure["models.falcon"].extend( + [ + "FALCON_PRETRAINED_MODEL_ARCHIVE_LIST", + "FalconForCausalLM", + "FalconForQuestionAnswering", + "FalconForSequenceClassification", + "FalconForTokenClassification", + "FalconModel", + "FalconPreTrainedModel", + ] + ) + _import_structure["models.fastspeech2_conformer"].extend( + [ + "FASTSPEECH2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "FastSpeech2ConformerHifiGan", + "FastSpeech2ConformerModel", + "FastSpeech2ConformerPreTrainedModel", + "FastSpeech2ConformerWithHifiGan", + ] + ) _import_structure["models.flaubert"].extend( [ "FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1553,6 +2214,16 @@ "FNetPreTrainedModel", ] ) + _import_structure["models.focalnet"].extend( + [ + "FOCALNET_PRETRAINED_MODEL_ARCHIVE_LIST", + "FocalNetBackbone", + "FocalNetForImageClassification", + "FocalNetForMaskedImageModeling", + "FocalNetModel", + "FocalNetPreTrainedModel", + ] + ) _import_structure["models.fsmt"].extend(["FSMTForConditionalGeneration", "FSMTModel", "PretrainedFSMTModel"]) _import_structure["models.funnel"].extend( [ @@ -1569,6 +2240,7 @@ "load_tf_weights_in_funnel", ] ) + _import_structure["models.fuyu"].extend(["FuyuForCausalLM", "FuyuPreTrainedModel"]) _import_structure["models.git"].extend( [ "GIT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1590,6 +2262,7 @@ [ "GPT2_PRETRAINED_MODEL_ARCHIVE_LIST", "GPT2DoubleHeadsModel", + "GPT2ForQuestionAnswering", "GPT2ForSequenceClassification", "GPT2ForTokenClassification", "GPT2LMHeadModel", @@ -1598,11 +2271,23 @@ "load_tf_weights_in_gpt2", ] ) + _import_structure["models.gpt_bigcode"].extend( + [ + "GPT_BIGCODE_PRETRAINED_MODEL_ARCHIVE_LIST", + "GPTBigCodeForCausalLM", + "GPTBigCodeForSequenceClassification", + "GPTBigCodeForTokenClassification", + "GPTBigCodeModel", + "GPTBigCodePreTrainedModel", + ] + ) _import_structure["models.gpt_neo"].extend( [ "GPT_NEO_PRETRAINED_MODEL_ARCHIVE_LIST", "GPTNeoForCausalLM", + "GPTNeoForQuestionAnswering", "GPTNeoForSequenceClassification", + "GPTNeoForTokenClassification", "GPTNeoModel", "GPTNeoPreTrainedModel", "load_tf_weights_in_gpt_neo", @@ -1612,6 +2297,9 @@ [ "GPT_NEOX_PRETRAINED_MODEL_ARCHIVE_LIST", "GPTNeoXForCausalLM", + "GPTNeoXForQuestionAnswering", + "GPTNeoXForSequenceClassification", + "GPTNeoXForTokenClassification", "GPTNeoXLayer", "GPTNeoXModel", "GPTNeoXPreTrainedModel", @@ -1636,6 +2324,14 @@ "GPTJPreTrainedModel", ] ) + _import_structure["models.gptsan_japanese"].extend( + [ + "GPTSAN_JAPANESE_PRETRAINED_MODEL_ARCHIVE_LIST", + "GPTSanJapaneseForConditionalGeneration", + "GPTSanJapaneseModel", + "GPTSanJapanesePreTrainedModel", + ] + ) _import_structure["models.graphormer"].extend( [ "GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1674,6 +2370,15 @@ "IBertPreTrainedModel", ] ) + _import_structure["models.idefics"].extend( + [ + "IDEFICS_PRETRAINED_MODEL_ARCHIVE_LIST", + "IdeficsForVisionText2Text", + "IdeficsModel", + "IdeficsPreTrainedModel", + "IdeficsProcessor", + ] + ) _import_structure["models.imagegpt"].extend( [ "IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1684,6 +2389,23 @@ "load_tf_weights_in_imagegpt", ] ) + _import_structure["models.informer"].extend( + [ + "INFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "InformerForPrediction", + "InformerModel", + "InformerPreTrainedModel", + ] + ) + _import_structure["models.instructblip"].extend( + [ + "INSTRUCTBLIP_PRETRAINED_MODEL_ARCHIVE_LIST", + "InstructBlipForConditionalGeneration", + "InstructBlipPreTrainedModel", + "InstructBlipQFormerModel", + "InstructBlipVisionModel", + ] + ) _import_structure["models.jukebox"].extend( [ "JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1693,6 +2415,14 @@ "JukeboxVQVAE", ] ) + _import_structure["models.kosmos2"].extend( + [ + "KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Kosmos2ForConditionalGeneration", + "Kosmos2Model", + "Kosmos2PreTrainedModel", + ] + ) _import_structure["models.layoutlm"].extend( [ "LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1753,6 +2483,23 @@ "LiltPreTrainedModel", ] ) + _import_structure["models.llama"].extend( + [ + "LlamaForCausalLM", + "LlamaForQuestionAnswering", + "LlamaForSequenceClassification", + "LlamaModel", + "LlamaPreTrainedModel", + ] + ) + _import_structure["models.llava"].extend( + [ + "LLAVA_PRETRAINED_MODEL_ARCHIVE_LIST", + "LlavaForConditionalGeneration", + "LlavaPreTrainedModel", + "LlavaProcessor", + ] + ) _import_structure["models.longformer"].extend( [ "LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1847,12 +2594,17 @@ "MBartPreTrainedModel", ] ) - _import_structure["models.mctct"].extend( + _import_structure["models.mega"].extend( [ - "MCTCT_PRETRAINED_MODEL_ARCHIVE_LIST", - "MCTCTForCTC", - "MCTCTModel", - "MCTCTPreTrainedModel", + "MEGA_PRETRAINED_MODEL_ARCHIVE_LIST", + "MegaForCausalLM", + "MegaForMaskedLM", + "MegaForMultipleChoice", + "MegaForQuestionAnswering", + "MegaForSequenceClassification", + "MegaForTokenClassification", + "MegaModel", + "MegaPreTrainedModel", ] ) _import_structure["models.megatron_bert"].extend( @@ -1870,7 +2622,25 @@ "MegatronBertPreTrainedModel", ] ) - _import_structure["models.mmbt"].extend(["MMBTForClassification", "MMBTModel", "ModalEmbeddings"]) + _import_structure["models.mgp_str"].extend( + [ + "MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST", + "MgpstrForSceneTextRecognition", + "MgpstrModel", + "MgpstrPreTrainedModel", + ] + ) + _import_structure["models.mistral"].extend( + [ + "MistralForCausalLM", + "MistralForSequenceClassification", + "MistralModel", + "MistralPreTrainedModel", + ] + ) + _import_structure["models.mixtral"].extend( + ["MixtralForCausalLM", "MixtralForSequenceClassification", "MixtralModel", "MixtralPreTrainedModel"] + ) _import_structure["models.mobilebert"].extend( [ "MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1915,6 +2685,15 @@ "MobileViTPreTrainedModel", ] ) + _import_structure["models.mobilevitv2"].extend( + [ + "MOBILEVITV2_PRETRAINED_MODEL_ARCHIVE_LIST", + "MobileViTV2ForImageClassification", + "MobileViTV2ForSemanticSegmentation", + "MobileViTV2Model", + "MobileViTV2PreTrainedModel", + ] + ) _import_structure["models.mpnet"].extend( [ "MPNET_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -1928,8 +2707,49 @@ "MPNetPreTrainedModel", ] ) + _import_structure["models.mpt"].extend( + [ + "MPT_PRETRAINED_MODEL_ARCHIVE_LIST", + "MptForCausalLM", + "MptForQuestionAnswering", + "MptForSequenceClassification", + "MptForTokenClassification", + "MptModel", + "MptPreTrainedModel", + ] + ) + _import_structure["models.mra"].extend( + [ + "MRA_PRETRAINED_MODEL_ARCHIVE_LIST", + "MraForMaskedLM", + "MraForMultipleChoice", + "MraForQuestionAnswering", + "MraForSequenceClassification", + "MraForTokenClassification", + "MraModel", + "MraPreTrainedModel", + ] + ) _import_structure["models.mt5"].extend( - ["MT5EncoderModel", "MT5ForConditionalGeneration", "MT5Model", "MT5PreTrainedModel"] + [ + "MT5EncoderModel", + "MT5ForConditionalGeneration", + "MT5ForQuestionAnswering", + "MT5ForSequenceClassification", + "MT5ForTokenClassification", + "MT5Model", + "MT5PreTrainedModel", + ] + ) + _import_structure["models.musicgen"].extend( + [ + "MUSICGEN_PRETRAINED_MODEL_ARCHIVE_LIST", + "MusicgenForCausalLM", + "MusicgenForConditionalGeneration", + "MusicgenModel", + "MusicgenPreTrainedModel", + "MusicgenProcessor", + ] ) _import_structure["models.mvp"].extend( [ @@ -1965,6 +2785,16 @@ "NezhaPreTrainedModel", ] ) + _import_structure["models.nllb_moe"].extend( + [ + "NLLB_MOE_PRETRAINED_MODEL_ARCHIVE_LIST", + "NllbMoeForConditionalGeneration", + "NllbMoeModel", + "NllbMoePreTrainedModel", + "NllbMoeSparseMLP", + "NllbMoeTop2Router", + ] + ) _import_structure["models.nystromformer"].extend( [ "NYSTROMFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2007,6 +2837,16 @@ "OPTPreTrainedModel", ] ) + _import_structure["models.owlv2"].extend( + [ + "OWLV2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Owlv2ForObjectDetection", + "Owlv2Model", + "Owlv2PreTrainedModel", + "Owlv2TextModel", + "Owlv2VisionModel", + ] + ) _import_structure["models.owlvit"].extend( [ "OWLVIT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2017,8 +2857,35 @@ "OwlViTVisionModel", ] ) + _import_structure["models.patchtsmixer"].extend( + [ + "PATCHTSMIXER_PRETRAINED_MODEL_ARCHIVE_LIST", + "PatchTSMixerForPrediction", + "PatchTSMixerForPretraining", + "PatchTSMixerForRegression", + "PatchTSMixerForTimeSeriesClassification", + "PatchTSMixerModel", + "PatchTSMixerPreTrainedModel", + ] + ) + _import_structure["models.patchtst"].extend( + [ + "PATCHTST_PRETRAINED_MODEL_ARCHIVE_LIST", + "PatchTSTForClassification", + "PatchTSTForPrediction", + "PatchTSTForPretraining", + "PatchTSTForRegression", + "PatchTSTModel", + "PatchTSTPreTrainedModel", + ] + ) _import_structure["models.pegasus"].extend( - ["PegasusForCausalLM", "PegasusForConditionalGeneration", "PegasusModel", "PegasusPreTrainedModel"] + [ + "PegasusForCausalLM", + "PegasusForConditionalGeneration", + "PegasusModel", + "PegasusPreTrainedModel", + ] ) _import_structure["models.pegasus_x"].extend( [ @@ -2043,6 +2910,33 @@ "PerceiverPreTrainedModel", ] ) + _import_structure["models.persimmon"].extend( + [ + "PersimmonForCausalLM", + "PersimmonForSequenceClassification", + "PersimmonModel", + "PersimmonPreTrainedModel", + ] + ) + _import_structure["models.phi"].extend( + [ + "PHI_PRETRAINED_MODEL_ARCHIVE_LIST", + "PhiForCausalLM", + "PhiForSequenceClassification", + "PhiForTokenClassification", + "PhiModel", + "PhiPreTrainedModel", + ] + ) + _import_structure["models.pix2struct"].extend( + [ + "PIX2STRUCT_PRETRAINED_MODEL_ARCHIVE_LIST", + "Pix2StructForConditionalGeneration", + "Pix2StructPreTrainedModel", + "Pix2StructTextModel", + "Pix2StructVisionModel", + ] + ) _import_structure["models.plbart"].extend( [ "PLBART_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2061,6 +2955,13 @@ "PoolFormerPreTrainedModel", ] ) + _import_structure["models.pop2piano"].extend( + [ + "POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST", + "Pop2PianoForConditionalGeneration", + "Pop2PianoPreTrainedModel", + ] + ) _import_structure["models.prophetnet"].extend( [ "PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2072,6 +2973,14 @@ "ProphetNetPreTrainedModel", ] ) + _import_structure["models.pvt"].extend( + [ + "PVT_PRETRAINED_MODEL_ARCHIVE_LIST", + "PvtForImageClassification", + "PvtModel", + "PvtPreTrainedModel", + ] + ) _import_structure["models.qdqbert"].extend( [ "QDQBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2088,8 +2997,21 @@ "load_tf_weights_in_qdqbert", ] ) + _import_structure["models.qwen2"].extend( + [ + "Qwen2ForCausalLM", + "Qwen2ForSequenceClassification", + "Qwen2Model", + "Qwen2PreTrainedModel", + ] + ) _import_structure["models.rag"].extend( - ["RagModel", "RagPreTrainedModel", "RagSequenceForGeneration", "RagTokenForGeneration"] + [ + "RagModel", + "RagPreTrainedModel", + "RagSequenceForGeneration", + "RagTokenForGeneration", + ] ) _import_structure["models.realm"].extend( [ @@ -2149,9 +3071,6 @@ "ResNetPreTrainedModel", ] ) - _import_structure["models.retribert"].extend( - ["RETRIBERT_PRETRAINED_MODEL_ARCHIVE_LIST", "RetriBertModel", "RetriBertPreTrainedModel"] - ) _import_structure["models.roberta"].extend( [ "ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2209,6 +3128,47 @@ "load_tf_weights_in_roformer", ] ) + _import_structure["models.rwkv"].extend( + [ + "RWKV_PRETRAINED_MODEL_ARCHIVE_LIST", + "RwkvForCausalLM", + "RwkvModel", + "RwkvPreTrainedModel", + ] + ) + _import_structure["models.sam"].extend( + [ + "SAM_PRETRAINED_MODEL_ARCHIVE_LIST", + "SamModel", + "SamPreTrainedModel", + ] + ) + _import_structure["models.seamless_m4t"].extend( + [ + "SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST", + "SeamlessM4TCodeHifiGan", + "SeamlessM4TForSpeechToSpeech", + "SeamlessM4TForSpeechToText", + "SeamlessM4TForTextToSpeech", + "SeamlessM4TForTextToText", + "SeamlessM4THifiGan", + "SeamlessM4TModel", + "SeamlessM4TPreTrainedModel", + "SeamlessM4TTextToUnitForConditionalGeneration", + "SeamlessM4TTextToUnitModel", + ] + ) + _import_structure["models.seamless_m4t_v2"].extend( + [ + "SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST", + "SeamlessM4Tv2ForSpeechToSpeech", + "SeamlessM4Tv2ForSpeechToText", + "SeamlessM4Tv2ForTextToSpeech", + "SeamlessM4Tv2ForTextToText", + "SeamlessM4Tv2Model", + "SeamlessM4Tv2PreTrainedModel", + ] + ) _import_structure["models.segformer"].extend( [ "SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2238,6 +3198,16 @@ "SEWDPreTrainedModel", ] ) + _import_structure["models.siglip"].extend( + [ + "SIGLIP_PRETRAINED_MODEL_ARCHIVE_LIST", + "SiglipForImageClassification", + "SiglipModel", + "SiglipPreTrainedModel", + "SiglipTextModel", + "SiglipVisionModel", + ] + ) _import_structure["models.speech_encoder_decoder"].extend(["SpeechEncoderDecoderModel"]) _import_structure["models.speech_to_text"].extend( [ @@ -2282,6 +3252,22 @@ "SqueezeBertPreTrainedModel", ] ) + _import_structure["models.stablelm"].extend( + [ + "StableLmForCausalLM", + "StableLmForSequenceClassification", + "StableLmModel", + "StableLmPreTrainedModel", + ] + ) + _import_structure["models.swiftformer"].extend( + [ + "SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "SwiftFormerForImageClassification", + "SwiftFormerModel", + "SwiftFormerPreTrainedModel", + ] + ) _import_structure["models.swin"].extend( [ "SWIN_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2303,6 +3289,7 @@ _import_structure["models.swinv2"].extend( [ "SWINV2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Swinv2Backbone", "Swinv2ForImageClassification", "Swinv2ForMaskedImageModeling", "Swinv2Model", @@ -2325,11 +3312,22 @@ "T5_PRETRAINED_MODEL_ARCHIVE_LIST", "T5EncoderModel", "T5ForConditionalGeneration", + "T5ForQuestionAnswering", + "T5ForSequenceClassification", + "T5ForTokenClassification", "T5Model", "T5PreTrainedModel", "load_tf_weights_in_t5", ] ) + _import_structure["models.table_transformer"].extend( + [ + "TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "TableTransformerForObjectDetection", + "TableTransformerModel", + "TableTransformerPreTrainedModel", + ] + ) _import_structure["models.tapas"].extend( [ "TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2357,27 +3355,14 @@ "TimesformerPreTrainedModel", ] ) - _import_structure["models.trajectory_transformer"].extend( - [ - "TRAJECTORY_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", - "TrajectoryTransformerModel", - "TrajectoryTransformerPreTrainedModel", - ] - ) - _import_structure["models.transfo_xl"].extend( + _import_structure["models.timm_backbone"].extend(["TimmBackbone"]) + _import_structure["models.trocr"].extend( [ - "TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST", - "AdaptiveEmbedding", - "TransfoXLForSequenceClassification", - "TransfoXLLMHeadModel", - "TransfoXLModel", - "TransfoXLPreTrainedModel", - "load_tf_weights_in_transfo_xl", + "TROCR_PRETRAINED_MODEL_ARCHIVE_LIST", + "TrOCRForCausalLM", + "TrOCRPreTrainedModel", ] ) - _import_structure["models.trocr"].extend( - ["TROCR_PRETRAINED_MODEL_ARCHIVE_LIST", "TrOCRForCausalLM", "TrOCRPreTrainedModel"] - ) _import_structure["models.tvlt"].extend( [ "TVLT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2387,6 +3372,25 @@ "TvltPreTrainedModel", ] ) + _import_structure["models.tvp"].extend( + [ + "TVP_PRETRAINED_MODEL_ARCHIVE_LIST", + "TvpForVideoGrounding", + "TvpModel", + "TvpPreTrainedModel", + ] + ) + _import_structure["models.umt5"].extend( + [ + "UMT5EncoderModel", + "UMT5ForConditionalGeneration", + "UMT5ForQuestionAnswering", + "UMT5ForSequenceClassification", + "UMT5ForTokenClassification", + "UMT5Model", + "UMT5PreTrainedModel", + ] + ) _import_structure["models.unispeech"].extend( [ "UNISPEECH_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2409,18 +3413,16 @@ "UniSpeechSatPreTrainedModel", ] ) - _import_structure["models.upernet"].extend( + _import_structure["models.univnet"].extend( [ - "UperNetForSemanticSegmentation", - "UperNetPreTrainedModel", + "UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST", + "UnivNetModel", ] ) - _import_structure["models.van"].extend( + _import_structure["models.upernet"].extend( [ - "VAN_PRETRAINED_MODEL_ARCHIVE_LIST", - "VanForImageClassification", - "VanModel", - "VanPreTrainedModel", + "UperNetForSemanticSegmentation", + "UperNetPreTrainedModel", ] ) _import_structure["models.videomae"].extend( @@ -2445,6 +3447,13 @@ "ViltPreTrainedModel", ] ) + _import_structure["models.vipllava"].extend( + [ + "VIPLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST", + "VipLlavaForConditionalGeneration", + "VipLlavaPreTrainedModel", + ] + ) _import_structure["models.vision_encoder_decoder"].extend(["VisionEncoderDecoderModel"]) _import_structure["models.vision_text_dual_encoder"].extend(["VisionTextDualEncoderModel"]) _import_structure["models.visual_bert"].extend( @@ -2494,6 +3503,36 @@ "ViTMSNPreTrainedModel", ] ) + _import_structure["models.vitdet"].extend( + [ + "VITDET_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitDetBackbone", + "VitDetModel", + "VitDetPreTrainedModel", + ] + ) + _import_structure["models.vitmatte"].extend( + [ + "VITMATTE_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitMatteForImageMatting", + "VitMattePreTrainedModel", + ] + ) + _import_structure["models.vits"].extend( + [ + "VITS_PRETRAINED_MODEL_ARCHIVE_LIST", + "VitsModel", + "VitsPreTrainedModel", + ] + ) + _import_structure["models.vivit"].extend( + [ + "VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST", + "VivitForVideoClassification", + "VivitModel", + "VivitPreTrainedModel", + ] + ) _import_structure["models.wav2vec2"].extend( [ "WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2507,6 +3546,17 @@ "Wav2Vec2PreTrainedModel", ] ) + _import_structure["models.wav2vec2_bert"].extend( + [ + "WAV2VEC2_BERT_PRETRAINED_MODEL_ARCHIVE_LIST", + "Wav2Vec2BertForAudioFrameClassification", + "Wav2Vec2BertForCTC", + "Wav2Vec2BertForSequenceClassification", + "Wav2Vec2BertForXVector", + "Wav2Vec2BertModel", + "Wav2Vec2BertPreTrainedModel", + ] + ) _import_structure["models.wav2vec2_conformer"].extend( [ "WAV2VEC2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2533,6 +3583,8 @@ _import_structure["models.whisper"].extend( [ "WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST", + "WhisperForAudioClassification", + "WhisperForCausalLM", "WhisperForConditionalGeneration", "WhisperModel", "WhisperPreTrainedModel", @@ -2665,8 +3717,13 @@ "get_polynomial_decay_schedule_with_warmup", "get_scheduler", ] - _import_structure["pytorch_utils"] = ["Conv1D", "apply_chunking_to_forward", "prune_layer"] + _import_structure["pytorch_utils"] = [ + "Conv1D", + "apply_chunking_to_forward", + "prune_layer", + ] _import_structure["sagemaker"] = [] + _import_structure["time_series_utils"] = [] _import_structure["trainer"] = ["Trainer"] _import_structure["trainer_pt_utils"] = ["torch_distributed_zero_first"] _import_structure["trainer_seq2seq"] = ["Seq2SeqTrainer"] @@ -2687,6 +3744,7 @@ [ "TFForcedBOSTokenLogitsProcessor", "TFForcedEOSTokenLogitsProcessor", + "TFForceTokensLogitsProcessor", "TFGenerationMixin", "TFLogitsProcessor", "TFLogitsProcessorList", @@ -2695,6 +3753,8 @@ "TFNoBadWordsLogitsProcessor", "TFNoRepeatNGramLogitsProcessor", "TFRepetitionPenaltyLogitsProcessor", + "TFSuppressTokensAtBeginLogitsProcessor", + "TFSuppressTokensLogitsProcessor", "TFTemperatureLogitsWarper", "TFTopKLogitsWarper", "TFTopPLogitsWarper", @@ -2727,11 +3787,13 @@ ) _import_structure["models.auto"].extend( [ + "TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_CAUSAL_LM_MAPPING", "TF_MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING", "TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING", "TF_MODEL_FOR_MASKED_LM_MAPPING", + "TF_MODEL_FOR_MASK_GENERATION_MAPPING", "TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING", "TF_MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING", "TF_MODEL_FOR_PRETRAINING_MAPPING", @@ -2741,15 +3803,20 @@ "TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING", + "TF_MODEL_FOR_TEXT_ENCODING_MAPPING", "TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_VISION_2_SEQ_MAPPING", + "TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING", "TF_MODEL_MAPPING", "TF_MODEL_WITH_LM_HEAD_MAPPING", "TFAutoModel", + "TFAutoModelForAudioClassification", "TFAutoModelForCausalLM", "TFAutoModelForDocumentQuestionAnswering", "TFAutoModelForImageClassification", + "TFAutoModelForMaskedImageModeling", "TFAutoModelForMaskedLM", + "TFAutoModelForMaskGeneration", "TFAutoModelForMultipleChoice", "TFAutoModelForNextSentencePrediction", "TFAutoModelForPreTraining", @@ -2759,13 +3826,20 @@ "TFAutoModelForSequenceClassification", "TFAutoModelForSpeechSeq2Seq", "TFAutoModelForTableQuestionAnswering", + "TFAutoModelForTextEncoding", "TFAutoModelForTokenClassification", "TFAutoModelForVision2Seq", + "TFAutoModelForZeroShotImageClassification", "TFAutoModelWithLMHead", ] ) _import_structure["models.bart"].extend( - ["TFBartForConditionalGeneration", "TFBartForSequenceClassification", "TFBartModel", "TFBartPretrainedModel"] + [ + "TFBartForConditionalGeneration", + "TFBartForSequenceClassification", + "TFBartModel", + "TFBartPretrainedModel", + ] ) _import_structure["models.bert"].extend( [ @@ -2785,10 +3859,30 @@ ] ) _import_structure["models.blenderbot"].extend( - ["TFBlenderbotForConditionalGeneration", "TFBlenderbotModel", "TFBlenderbotPreTrainedModel"] + [ + "TFBlenderbotForConditionalGeneration", + "TFBlenderbotModel", + "TFBlenderbotPreTrainedModel", + ] ) _import_structure["models.blenderbot_small"].extend( - ["TFBlenderbotSmallForConditionalGeneration", "TFBlenderbotSmallModel", "TFBlenderbotSmallPreTrainedModel"] + [ + "TFBlenderbotSmallForConditionalGeneration", + "TFBlenderbotSmallModel", + "TFBlenderbotSmallPreTrainedModel", + ] + ) + _import_structure["models.blip"].extend( + [ + "TF_BLIP_PRETRAINED_MODEL_ARCHIVE_LIST", + "TFBlipForConditionalGeneration", + "TFBlipForImageTextRetrieval", + "TFBlipForQuestionAnswering", + "TFBlipModel", + "TFBlipPreTrainedModel", + "TFBlipTextModel", + "TFBlipVisionModel", + ] ) _import_structure["models.camembert"].extend( [ @@ -2832,6 +3926,13 @@ "TFConvNextPreTrainedModel", ] ) + _import_structure["models.convnextv2"].extend( + [ + "TFConvNextV2ForImageClassification", + "TFConvNextV2Model", + "TFConvNextV2PreTrainedModel", + ] + ) _import_structure["models.ctrl"].extend( [ "TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2872,6 +3973,7 @@ [ "TF_DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST", "TFDebertaV2ForMaskedLM", + "TFDebertaV2ForMultipleChoice", "TFDebertaV2ForQuestionAnswering", "TFDebertaV2ForSequenceClassification", "TFDebertaV2ForTokenClassification", @@ -2889,6 +3991,17 @@ "TFDeiTPreTrainedModel", ] ) + _import_structure["models.deprecated.transfo_xl"].extend( + [ + "TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST", + "TFAdaptiveEmbedding", + "TFTransfoXLForSequenceClassification", + "TFTransfoXLLMHeadModel", + "TFTransfoXLMainLayer", + "TFTransfoXLModel", + "TFTransfoXLPreTrainedModel", + ] + ) _import_structure["models.distilbert"].extend( [ "TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -2915,6 +4028,15 @@ "TFDPRReader", ] ) + _import_structure["models.efficientformer"].extend( + [ + "TF_EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "TFEfficientFormerForImageClassification", + "TFEfficientFormerForImageClassificationWithTeacher", + "TFEfficientFormerModel", + "TFEfficientFormerPreTrainedModel", + ] + ) _import_structure["models.electra"].extend( [ "TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -3109,7 +4231,11 @@ ] ) _import_structure["models.pegasus"].extend( - ["TFPegasusForConditionalGeneration", "TFPegasusModel", "TFPegasusPreTrainedModel"] + [ + "TFPegasusForConditionalGeneration", + "TFPegasusModel", + "TFPegasusPreTrainedModel", + ] ) _import_structure["models.rag"].extend( [ @@ -3191,6 +4317,13 @@ "TFRoFormerPreTrainedModel", ] ) + _import_structure["models.sam"].extend( + [ + "TF_SAM_PRETRAINED_MODEL_ARCHIVE_LIST", + "TFSamModel", + "TFSamPreTrainedModel", + ] + ) _import_structure["models.segformer"].extend( [ "TF_SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", @@ -3237,18 +4370,8 @@ "TFTapasPreTrainedModel", ] ) - _import_structure["models.transfo_xl"].extend( - [ - "TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST", - "TFAdaptiveEmbedding", - "TFTransfoXLForSequenceClassification", - "TFTransfoXLLMHeadModel", - "TFTransfoXLMainLayer", - "TFTransfoXLModel", - "TFTransfoXLPreTrainedModel", - ] - ) _import_structure["models.vision_encoder_decoder"].extend(["TFVisionEncoderDecoderModel"]) + _import_structure["models.vision_text_dual_encoder"].extend(["TFVisionTextDualEncoderModel"]) _import_structure["models.vit"].extend( [ "TFViTForImageClassification", @@ -3267,6 +4390,7 @@ [ "TF_WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST", "TFWav2Vec2ForCTC", + "TFWav2Vec2ForSequenceClassification", "TFWav2Vec2Model", "TFWav2Vec2PreTrainedModel", ] @@ -3326,9 +4450,38 @@ "TFXLNetPreTrainedModel", ] ) - _import_structure["optimization_tf"] = ["AdamWeightDecay", "GradientAccumulator", "WarmUp", "create_optimizer"] + _import_structure["optimization_tf"] = [ + "AdamWeightDecay", + "GradientAccumulator", + "WarmUp", + "create_optimizer", + ] _import_structure["tf_utils"] = [] - _import_structure["trainer_tf"] = ["TFTrainer"] + + +try: + if not ( + is_librosa_available() + and is_essentia_available() + and is_scipy_available() + and is_torch_available() + and is_pretty_midi_available() + ): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + from .utils import ( + dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects, + ) + + _import_structure["utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects"] = [ + name + for name in dir(dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects) + if not name.startswith("_") + ] +else: + _import_structure["models.pop2piano"].append("Pop2PianoFeatureExtractor") + _import_structure["models.pop2piano"].append("Pop2PianoTokenizer") + _import_structure["models.pop2piano"].append("Pop2PianoProcessor") # FLAX-backed objects @@ -3346,14 +4499,18 @@ [ "FlaxForcedBOSTokenLogitsProcessor", "FlaxForcedEOSTokenLogitsProcessor", + "FlaxForceTokensLogitsProcessor", "FlaxGenerationMixin", "FlaxLogitsProcessor", "FlaxLogitsProcessorList", "FlaxLogitsWarper", "FlaxMinLengthLogitsProcessor", "FlaxTemperatureLogitsWarper", + "FlaxSuppressTokensAtBeginLogitsProcessor", + "FlaxSuppressTokensLogitsProcessor", "FlaxTopKLogitsWarper", "FlaxTopPLogitsWarper", + "FlaxWhisperTimeStampLogitsProcessor", ] ) _import_structure["generation_flax_utils"] = [] @@ -3373,6 +4530,7 @@ ) _import_structure["models.auto"].extend( [ + "FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_CAUSAL_LM_MAPPING", "FLAX_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_MASKED_LM_MAPPING", @@ -3382,6 +4540,7 @@ "FLAX_MODEL_FOR_QUESTION_ANSWERING_MAPPING", "FLAX_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING", "FLAX_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", + "FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "FLAX_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_VISION_2_SEQ_MAPPING", "FLAX_MODEL_MAPPING", @@ -3395,6 +4554,7 @@ "FlaxAutoModelForQuestionAnswering", "FlaxAutoModelForSeq2SeqLM", "FlaxAutoModelForSequenceClassification", + "FlaxAutoModelForSpeechSeq2Seq", "FlaxAutoModelForTokenClassification", "FlaxAutoModelForVision2Seq", ] @@ -3450,7 +4610,11 @@ ] ) _import_structure["models.blenderbot"].extend( - ["FlaxBlenderbotForConditionalGeneration", "FlaxBlenderbotModel", "FlaxBlenderbotPreTrainedModel"] + [ + "FlaxBlenderbotForConditionalGeneration", + "FlaxBlenderbotModel", + "FlaxBlenderbotPreTrainedModel", + ] ) _import_structure["models.blenderbot_small"].extend( [ @@ -3459,12 +4623,20 @@ "FlaxBlenderbotSmallPreTrainedModel", ] ) + _import_structure["models.bloom"].extend( + [ + "FlaxBloomForCausalLM", + "FlaxBloomModel", + "FlaxBloomPreTrainedModel", + ] + ) _import_structure["models.clip"].extend( [ "FlaxCLIPModel", "FlaxCLIPPreTrainedModel", "FlaxCLIPTextModel", "FlaxCLIPTextPreTrainedModel", + "FlaxCLIPTextModelWithProjection", "FlaxCLIPVisionModel", "FlaxCLIPVisionPreTrainedModel", ] @@ -3499,8 +4671,13 @@ ["FlaxGPTNeoForCausalLM", "FlaxGPTNeoModel", "FlaxGPTNeoPreTrainedModel"] ) _import_structure["models.gptj"].extend(["FlaxGPTJForCausalLM", "FlaxGPTJModel", "FlaxGPTJPreTrainedModel"]) + _import_structure["models.llama"].extend(["FlaxLlamaForCausalLM", "FlaxLlamaModel", "FlaxLlamaPreTrainedModel"]) _import_structure["models.longt5"].extend( - ["FlaxLongT5ForConditionalGeneration", "FlaxLongT5Model", "FlaxLongT5PreTrainedModel"] + [ + "FlaxLongT5ForConditionalGeneration", + "FlaxLongT5Model", + "FlaxLongT5PreTrainedModel", + ] ) _import_structure["models.marian"].extend( [ @@ -3518,6 +4695,13 @@ "FlaxMBartPreTrainedModel", ] ) + _import_structure["models.mistral"].extend( + [ + "FlaxMistralForCausalLM", + "FlaxMistralModel", + "FlaxMistralPreTrainedModel", + ] + ) _import_structure["models.mt5"].extend(["FlaxMT5EncoderModel", "FlaxMT5ForConditionalGeneration", "FlaxMT5Model"]) _import_structure["models.opt"].extend( [ @@ -3533,6 +4717,20 @@ "FlaxPegasusPreTrainedModel", ] ) + _import_structure["models.regnet"].extend( + [ + "FlaxRegNetForImageClassification", + "FlaxRegNetModel", + "FlaxRegNetPreTrainedModel", + ] + ) + _import_structure["models.resnet"].extend( + [ + "FlaxResNetForImageClassification", + "FlaxResNetModel", + "FlaxResNetPreTrainedModel", + ] + ) _import_structure["models.roberta"].extend( [ "FlaxRobertaForCausalLM", @@ -3570,13 +4768,31 @@ ) _import_structure["models.speech_encoder_decoder"].append("FlaxSpeechEncoderDecoderModel") _import_structure["models.t5"].extend( - ["FlaxT5EncoderModel", "FlaxT5ForConditionalGeneration", "FlaxT5Model", "FlaxT5PreTrainedModel"] + [ + "FlaxT5EncoderModel", + "FlaxT5ForConditionalGeneration", + "FlaxT5Model", + "FlaxT5PreTrainedModel", + ] ) _import_structure["models.vision_encoder_decoder"].append("FlaxVisionEncoderDecoderModel") _import_structure["models.vision_text_dual_encoder"].extend(["FlaxVisionTextDualEncoderModel"]) _import_structure["models.vit"].extend(["FlaxViTForImageClassification", "FlaxViTModel", "FlaxViTPreTrainedModel"]) _import_structure["models.wav2vec2"].extend( - ["FlaxWav2Vec2ForCTC", "FlaxWav2Vec2ForPreTraining", "FlaxWav2Vec2Model", "FlaxWav2Vec2PreTrainedModel"] + [ + "FlaxWav2Vec2ForCTC", + "FlaxWav2Vec2ForPreTraining", + "FlaxWav2Vec2Model", + "FlaxWav2Vec2PreTrainedModel", + ] + ) + _import_structure["models.whisper"].extend( + [ + "FlaxWhisperForConditionalGeneration", + "FlaxWhisperModel", + "FlaxWhisperPreTrainedModel", + "FlaxWhisperForAudioClassification", + ] ) _import_structure["models.xglm"].extend( [ @@ -3644,13 +4860,14 @@ from .feature_extraction_utils import BatchFeature, FeatureExtractionMixin # Generation - from .generation import GenerationConfig + from .generation import GenerationConfig, TextIteratorStreamer, TextStreamer from .hf_argparser import HfArgumentParser # Integrations from .integrations import ( is_clearml_available, is_comet_available, + is_dvclive_available, is_neptune_available, is_optuna_available, is_ray_available, @@ -3674,6 +4891,13 @@ load_tf2_weights_in_pytorch_model, ) from .models.albert import ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, AlbertConfig + from .models.align import ( + ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP, + AlignConfig, + AlignProcessor, + AlignTextConfig, + AlignVisionConfig, + ) from .models.altclip import ( ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, AltCLIPConfig, @@ -3684,6 +4908,7 @@ from .models.audio_spectrogram_transformer import ( AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ASTConfig, + ASTFeatureExtractor, ) from .models.auto import ( ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -3699,6 +4924,17 @@ AutoProcessor, AutoTokenizer, ) + from .models.autoformer import ( + AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + AutoformerConfig, + ) + from .models.bark import ( + BarkCoarseConfig, + BarkConfig, + BarkFineConfig, + BarkProcessor, + BarkSemanticConfig, + ) from .models.bart import BartConfig, BartTokenizer from .models.beit import BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BeitConfig from .models.bert import ( @@ -3709,13 +4945,28 @@ WordpieceTokenizer, ) from .models.bert_generation import BertGenerationConfig - from .models.bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer + from .models.bert_japanese import ( + BertJapaneseTokenizer, + CharacterTokenizer, + MecabTokenizer, + ) from .models.bertweet import BertweetTokenizer from .models.big_bird import BIG_BIRD_PRETRAINED_CONFIG_ARCHIVE_MAP, BigBirdConfig - from .models.bigbird_pegasus import BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, BigBirdPegasusConfig - from .models.biogpt import BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, BioGptConfig, BioGptTokenizer + from .models.bigbird_pegasus import ( + BIGBIRD_PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, + BigBirdPegasusConfig, + ) + from .models.biogpt import ( + BIOGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, + BioGptConfig, + BioGptTokenizer, + ) from .models.bit import BIT_PRETRAINED_CONFIG_ARCHIVE_MAP, BitConfig - from .models.blenderbot import BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotConfig, BlenderbotTokenizer + from .models.blenderbot import ( + BLENDERBOT_PRETRAINED_CONFIG_ARCHIVE_MAP, + BlenderbotConfig, + BlenderbotTokenizer, + ) from .models.blenderbot_small import ( BLENDERBOT_SMALL_PRETRAINED_CONFIG_ARCHIVE_MAP, BlenderbotSmallConfig, @@ -3743,9 +4994,21 @@ BridgeTowerTextConfig, BridgeTowerVisionConfig, ) + from .models.bros import ( + BROS_PRETRAINED_CONFIG_ARCHIVE_MAP, + BrosConfig, + BrosProcessor, + ) from .models.byt5 import ByT5Tokenizer - from .models.camembert import CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, CamembertConfig - from .models.canine import CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP, CanineConfig, CanineTokenizer + from .models.camembert import ( + CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + CamembertConfig, + ) + from .models.canine import ( + CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP, + CanineConfig, + CanineTokenizer, + ) from .models.chinese_clip import ( CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, ChineseCLIPConfig, @@ -3775,11 +5038,44 @@ CLIPSegTextConfig, CLIPSegVisionConfig, ) - from .models.codegen import CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, CodeGenConfig, CodeGenTokenizer - from .models.conditional_detr import CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, ConditionalDetrConfig - from .models.convbert import CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvBertConfig, ConvBertTokenizer + from .models.clvp import ( + CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP, + ClvpConfig, + ClvpDecoderConfig, + ClvpEncoderConfig, + ClvpFeatureExtractor, + ClvpProcessor, + ClvpTokenizer, + ) + from .models.codegen import ( + CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, + CodeGenConfig, + CodeGenTokenizer, + ) + from .models.conditional_detr import ( + CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, + ConditionalDetrConfig, + ) + from .models.convbert import ( + CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + ConvBertConfig, + ConvBertTokenizer, + ) from .models.convnext import CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvNextConfig - from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer + from .models.convnextv2 import ( + CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP, + ConvNextV2Config, + ) + from .models.cpmant import ( + CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP, + CpmAntConfig, + CpmAntTokenizer, + ) + from .models.ctrl import ( + CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, + CTRLConfig, + CTRLTokenizer, + ) from .models.cvt import CVT_PRETRAINED_CONFIG_ARCHIVE_MAP, CvtConfig from .models.data2vec import ( DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -3788,19 +5084,67 @@ Data2VecTextConfig, Data2VecVisionConfig, ) - from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer - from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config + from .models.deberta import ( + DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, + DebertaConfig, + DebertaTokenizer, + ) + from .models.deberta_v2 import ( + DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, + DebertaV2Config, + ) from .models.decision_transformer import ( DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, DecisionTransformerConfig, ) - from .models.deformable_detr import DEFORMABLE_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DeformableDetrConfig + from .models.deformable_detr import ( + DEFORMABLE_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, + DeformableDetrConfig, + ) from .models.deit import DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP, DeiTConfig + from .models.deprecated.mctct import ( + MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, + MCTCTConfig, + MCTCTFeatureExtractor, + MCTCTProcessor, + ) + from .models.deprecated.mmbt import MMBTConfig + from .models.deprecated.open_llama import ( + OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP, + OpenLlamaConfig, + ) + from .models.deprecated.retribert import ( + RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + RetriBertConfig, + RetriBertTokenizer, + ) + from .models.deprecated.tapex import TapexTokenizer + from .models.deprecated.trajectory_transformer import ( + TRAJECTORY_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + TrajectoryTransformerConfig, + ) + from .models.deprecated.transfo_xl import ( + TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, + TransfoXLConfig, + TransfoXLCorpus, + TransfoXLTokenizer, + ) + from .models.deprecated.van import VAN_PRETRAINED_CONFIG_ARCHIVE_MAP, VanConfig + from .models.depth_anything import DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP, DepthAnythingConfig from .models.deta import DETA_PRETRAINED_CONFIG_ARCHIVE_MAP, DetaConfig from .models.detr import DETR_PRETRAINED_CONFIG_ARCHIVE_MAP, DetrConfig from .models.dinat import DINAT_PRETRAINED_CONFIG_ARCHIVE_MAP, DinatConfig - from .models.distilbert import DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, DistilBertConfig, DistilBertTokenizer - from .models.donut import DONUT_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, DonutProcessor, DonutSwinConfig + from .models.dinov2 import DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP, Dinov2Config + from .models.distilbert import ( + DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + DistilBertConfig, + DistilBertTokenizer, + ) + from .models.donut import ( + DONUT_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, + DonutProcessor, + DonutSwinConfig, + ) from .models.dpr import ( DPR_PRETRAINED_CONFIG_ARCHIVE_MAP, DPRConfig, @@ -3810,12 +5154,38 @@ DPRReaderTokenizer, ) from .models.dpt import DPT_PRETRAINED_CONFIG_ARCHIVE_MAP, DPTConfig - from .models.efficientformer import EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientFormerConfig - from .models.electra import ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, ElectraConfig, ElectraTokenizer + from .models.efficientformer import ( + EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + EfficientFormerConfig, + ) + from .models.efficientnet import ( + EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + EfficientNetConfig, + ) + from .models.electra import ( + ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP, + ElectraConfig, + ElectraTokenizer, + ) + from .models.encodec import ( + ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP, + EncodecConfig, + EncodecFeatureExtractor, + ) from .models.encoder_decoder import EncoderDecoderConfig from .models.ernie import ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieConfig from .models.ernie_m import ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP, ErnieMConfig from .models.esm import ESM_PRETRAINED_CONFIG_ARCHIVE_MAP, EsmConfig, EsmTokenizer + from .models.falcon import FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP, FalconConfig + from .models.fastspeech2_conformer import ( + FASTSPEECH2_CONFORMER_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP, + FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + FASTSPEECH2_CONFORMER_WITH_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP, + FastSpeech2ConformerConfig, + FastSpeech2ConformerHifiGanConfig, + FastSpeech2ConformerTokenizer, + FastSpeech2ConformerWithHifiGanConfig, + ) from .models.flaubert import FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, FlaubertConfig, FlaubertTokenizer from .models.flava import ( FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -3826,16 +5196,50 @@ FlavaTextConfig, ) from .models.fnet import FNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FNetConfig - from .models.fsmt import FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, FSMTConfig, FSMTTokenizer - from .models.funnel import FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, FunnelConfig, FunnelTokenizer - from .models.git import GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GitConfig, GitProcessor, GitVisionConfig + from .models.focalnet import FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP, FocalNetConfig + from .models.fsmt import ( + FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP, + FSMTConfig, + FSMTTokenizer, + ) + from .models.funnel import ( + FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP, + FunnelConfig, + FunnelTokenizer, + ) + from .models.fuyu import FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP, FuyuConfig + from .models.git import ( + GIT_PRETRAINED_CONFIG_ARCHIVE_MAP, + GitConfig, + GitProcessor, + GitVisionConfig, + ) from .models.glpn import GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP, GLPNConfig - from .models.gpt2 import GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, GPT2Config, GPT2Tokenizer + from .models.gpt2 import ( + GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP, + GPT2Config, + GPT2Tokenizer, + ) + from .models.gpt_bigcode import ( + GPT_BIGCODE_PRETRAINED_CONFIG_ARCHIVE_MAP, + GPTBigCodeConfig, + ) from .models.gpt_neo import GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoConfig from .models.gpt_neox import GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXConfig - from .models.gpt_neox_japanese import GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTNeoXJapaneseConfig + from .models.gpt_neox_japanese import ( + GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, + GPTNeoXJapaneseConfig, + ) from .models.gptj import GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP, GPTJConfig - from .models.graphormer import GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, GraphormerConfig + from .models.gptsan_japanese import ( + GPTSAN_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP, + GPTSanJapaneseConfig, + GPTSanJapaneseTokenizer, + ) + from .models.graphormer import ( + GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + GraphormerConfig, + ) from .models.groupvit import ( GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, GroupViTConfig, @@ -3845,7 +5249,19 @@ from .models.herbert import HerbertTokenizer from .models.hubert import HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, HubertConfig from .models.ibert import IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, IBertConfig + from .models.idefics import ( + IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP, + IdeficsConfig, + ) from .models.imagegpt import IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP, ImageGPTConfig + from .models.informer import INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, InformerConfig + from .models.instructblip import ( + INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, + InstructBlipConfig, + InstructBlipProcessor, + InstructBlipQFormerConfig, + InstructBlipVisionConfig, + ) from .models.jukebox import ( JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP, JukeboxConfig, @@ -3853,7 +5269,16 @@ JukeboxTokenizer, JukeboxVQVAEConfig, ) - from .models.layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig, LayoutLMTokenizer + from .models.kosmos2 import ( + KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP, + Kosmos2Config, + Kosmos2Processor, + ) + from .models.layoutlm import ( + LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, + LayoutLMConfig, + LayoutLMTokenizer, + ) from .models.layoutlmv2 import ( LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMv2Config, @@ -3874,10 +5299,27 @@ from .models.led import LED_PRETRAINED_CONFIG_ARCHIVE_MAP, LEDConfig, LEDTokenizer from .models.levit import LEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, LevitConfig from .models.lilt import LILT_PRETRAINED_CONFIG_ARCHIVE_MAP, LiltConfig - from .models.longformer import LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, LongformerConfig, LongformerTokenizer + from .models.llama import LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP, LlamaConfig + from .models.llava import ( + LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP, + LlavaConfig, + ) + from .models.longformer import ( + LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + LongformerConfig, + LongformerTokenizer, + ) from .models.longt5 import LONGT5_PRETRAINED_CONFIG_ARCHIVE_MAP, LongT5Config - from .models.luke import LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP, LukeConfig, LukeTokenizer - from .models.lxmert import LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, LxmertConfig, LxmertTokenizer + from .models.luke import ( + LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP, + LukeConfig, + LukeTokenizer, + ) + from .models.lxmert import ( + LXMERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + LxmertConfig, + LxmertTokenizer, + ) from .models.m2m_100 import M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP, M2M100Config from .models.marian import MarianConfig from .models.markuplm import ( @@ -3887,25 +5329,90 @@ MarkupLMProcessor, MarkupLMTokenizer, ) - from .models.mask2former import MASK2FORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, Mask2FormerConfig - from .models.maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig, MaskFormerSwinConfig + from .models.mask2former import ( + MASK2FORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + Mask2FormerConfig, + ) + from .models.maskformer import ( + MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + MaskFormerConfig, + MaskFormerSwinConfig, + ) from .models.mbart import MBartConfig - from .models.mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig, MCTCTProcessor - from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig - from .models.mmbt import MMBTConfig - from .models.mobilebert import MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileBertConfig, MobileBertTokenizer - from .models.mobilenet_v1 import MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileNetV1Config - from .models.mobilenet_v2 import MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileNetV2Config - from .models.mobilevit import MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, MobileViTConfig - from .models.mpnet import MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP, MPNetConfig, MPNetTokenizer + from .models.mega import MEGA_PRETRAINED_CONFIG_ARCHIVE_MAP, MegaConfig + from .models.megatron_bert import ( + MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + MegatronBertConfig, + ) + from .models.mgp_str import ( + MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP, + MgpstrConfig, + MgpstrProcessor, + MgpstrTokenizer, + ) + from .models.mistral import MISTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP, MistralConfig + from .models.mixtral import MIXTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP, MixtralConfig + from .models.mobilebert import ( + MOBILEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + MobileBertConfig, + MobileBertTokenizer, + ) + from .models.mobilenet_v1 import ( + MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP, + MobileNetV1Config, + ) + from .models.mobilenet_v2 import ( + MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, + MobileNetV2Config, + ) + from .models.mobilevit import ( + MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, + MobileViTConfig, + ) + from .models.mobilevitv2 import ( + MOBILEVITV2_PRETRAINED_CONFIG_ARCHIVE_MAP, + MobileViTV2Config, + ) + from .models.mpnet import ( + MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + MPNetConfig, + MPNetTokenizer, + ) + from .models.mpt import MPT_PRETRAINED_CONFIG_ARCHIVE_MAP, MptConfig + from .models.mra import MRA_PRETRAINED_CONFIG_ARCHIVE_MAP, MraConfig from .models.mt5 import MT5Config + from .models.musicgen import ( + MUSICGEN_PRETRAINED_CONFIG_ARCHIVE_MAP, + MusicgenConfig, + MusicgenDecoderConfig, + ) from .models.mvp import MvpConfig, MvpTokenizer from .models.nat import NAT_PRETRAINED_CONFIG_ARCHIVE_MAP, NatConfig from .models.nezha import NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP, NezhaConfig - from .models.nystromformer import NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, NystromformerConfig - from .models.oneformer import ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, OneFormerConfig, OneFormerProcessor - from .models.openai import OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenAIGPTConfig, OpenAIGPTTokenizer + from .models.nllb_moe import NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP, NllbMoeConfig + from .models.nougat import NougatProcessor + from .models.nystromformer import ( + NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + NystromformerConfig, + ) + from .models.oneformer import ( + ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + OneFormerConfig, + OneFormerProcessor, + ) + from .models.openai import ( + OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP, + OpenAIGPTConfig, + OpenAIGPTTokenizer, + ) from .models.opt import OPTConfig + from .models.owlv2 import ( + OWLV2_PRETRAINED_CONFIG_ARCHIVE_MAP, + Owlv2Config, + Owlv2Processor, + Owlv2TextConfig, + Owlv2VisionConfig, + ) from .models.owlvit import ( OWLVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, OwlViTConfig, @@ -3913,35 +5420,121 @@ OwlViTTextConfig, OwlViTVisionConfig, ) - from .models.pegasus import PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusConfig, PegasusTokenizer - from .models.pegasus_x import PEGASUS_X_PRETRAINED_CONFIG_ARCHIVE_MAP, PegasusXConfig - from .models.perceiver import PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, PerceiverConfig, PerceiverTokenizer + from .models.patchtsmixer import ( + PATCHTSMIXER_PRETRAINED_CONFIG_ARCHIVE_MAP, + PatchTSMixerConfig, + ) + from .models.patchtst import PATCHTST_PRETRAINED_CONFIG_ARCHIVE_MAP, PatchTSTConfig + from .models.pegasus import ( + PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP, + PegasusConfig, + PegasusTokenizer, + ) + from .models.pegasus_x import ( + PEGASUS_X_PRETRAINED_CONFIG_ARCHIVE_MAP, + PegasusXConfig, + ) + from .models.perceiver import ( + PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP, + PerceiverConfig, + PerceiverTokenizer, + ) + from .models.persimmon import ( + PERSIMMON_PRETRAINED_CONFIG_ARCHIVE_MAP, + PersimmonConfig, + ) + from .models.phi import PHI_PRETRAINED_CONFIG_ARCHIVE_MAP, PhiConfig from .models.phobert import PhobertTokenizer + from .models.pix2struct import ( + PIX2STRUCT_PRETRAINED_CONFIG_ARCHIVE_MAP, + Pix2StructConfig, + Pix2StructProcessor, + Pix2StructTextConfig, + Pix2StructVisionConfig, + ) from .models.plbart import PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP, PLBartConfig - from .models.poolformer import POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, PoolFormerConfig - from .models.prophetnet import PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ProphetNetConfig, ProphetNetTokenizer + from .models.poolformer import ( + POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + PoolFormerConfig, + ) + from .models.pop2piano import ( + POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP, + Pop2PianoConfig, + ) + from .models.prophetnet import ( + PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + ProphetNetConfig, + ProphetNetTokenizer, + ) + from .models.pvt import PVT_PRETRAINED_CONFIG_ARCHIVE_MAP, PvtConfig from .models.qdqbert import QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, QDQBertConfig + from .models.qwen2 import QWEN2_PRETRAINED_CONFIG_ARCHIVE_MAP, Qwen2Config, Qwen2Tokenizer from .models.rag import RagConfig, RagRetriever, RagTokenizer - from .models.realm import REALM_PRETRAINED_CONFIG_ARCHIVE_MAP, RealmConfig, RealmTokenizer + from .models.realm import ( + REALM_PRETRAINED_CONFIG_ARCHIVE_MAP, + RealmConfig, + RealmTokenizer, + ) from .models.reformer import REFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ReformerConfig from .models.regnet import REGNET_PRETRAINED_CONFIG_ARCHIVE_MAP, RegNetConfig from .models.rembert import REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RemBertConfig from .models.resnet import RESNET_PRETRAINED_CONFIG_ARCHIVE_MAP, ResNetConfig - from .models.retribert import RETRIBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RetriBertConfig, RetriBertTokenizer - from .models.roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig, RobertaTokenizer + from .models.roberta import ( + ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, + RobertaConfig, + RobertaTokenizer, + ) from .models.roberta_prelayernorm import ( ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaPreLayerNormConfig, ) - from .models.roc_bert import ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, RoCBertConfig, RoCBertTokenizer - from .models.roformer import ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, RoFormerConfig, RoFormerTokenizer - from .models.segformer import SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, SegformerConfig + from .models.roc_bert import ( + ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + RoCBertConfig, + RoCBertTokenizer, + ) + from .models.roformer import ( + ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + RoFormerConfig, + RoFormerTokenizer, + ) + from .models.rwkv import RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP, RwkvConfig + from .models.sam import ( + SAM_PRETRAINED_CONFIG_ARCHIVE_MAP, + SamConfig, + SamMaskDecoderConfig, + SamProcessor, + SamPromptEncoderConfig, + SamVisionConfig, + ) + from .models.seamless_m4t import ( + SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP, + SeamlessM4TConfig, + SeamlessM4TFeatureExtractor, + SeamlessM4TProcessor, + ) + from .models.seamless_m4t_v2 import ( + SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, + SeamlessM4Tv2Config, + ) + from .models.segformer import ( + SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + SegformerConfig, + ) from .models.sew import SEW_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWConfig from .models.sew_d import SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP, SEWDConfig + from .models.siglip import ( + SIGLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, + SiglipConfig, + SiglipProcessor, + SiglipTextConfig, + SiglipVisionConfig, + ) from .models.speech_encoder_decoder import SpeechEncoderDecoderConfig from .models.speech_to_text import ( SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Speech2TextConfig, + Speech2TextFeatureExtractor, Speech2TextProcessor, ) from .models.speech_to_text_2 import ( @@ -3954,40 +5547,82 @@ SPEECHT5_PRETRAINED_CONFIG_ARCHIVE_MAP, SPEECHT5_PRETRAINED_HIFIGAN_CONFIG_ARCHIVE_MAP, SpeechT5Config, + SpeechT5FeatureExtractor, SpeechT5HifiGanConfig, SpeechT5Processor, ) - from .models.splinter import SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP, SplinterConfig, SplinterTokenizer - from .models.squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig, SqueezeBertTokenizer + from .models.splinter import ( + SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP, + SplinterConfig, + SplinterTokenizer, + ) + from .models.squeezebert import ( + SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + SqueezeBertConfig, + SqueezeBertTokenizer, + ) + from .models.stablelm import STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP, StableLmConfig + from .models.swiftformer import ( + SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + SwiftFormerConfig, + ) from .models.swin import SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, SwinConfig from .models.swin2sr import SWIN2SR_PRETRAINED_CONFIG_ARCHIVE_MAP, Swin2SRConfig from .models.swinv2 import SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP, Swinv2Config - from .models.switch_transformers import SWITCH_TRANSFORMERS_PRETRAINED_CONFIG_ARCHIVE_MAP, SwitchTransformersConfig + from .models.switch_transformers import ( + SWITCH_TRANSFORMERS_PRETRAINED_CONFIG_ARCHIVE_MAP, + SwitchTransformersConfig, + ) from .models.t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config - from .models.table_transformer import TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, TableTransformerConfig - from .models.tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig, TapasTokenizer - from .models.tapex import TapexTokenizer + from .models.table_transformer import ( + TABLE_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + TableTransformerConfig, + ) + from .models.tapas import ( + TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, + TapasConfig, + TapasTokenizer, + ) from .models.time_series_transformer import ( TIME_SERIES_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, TimeSeriesTransformerConfig, ) - from .models.timesformer import TIMESFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, TimesformerConfig - from .models.trajectory_transformer import ( - TRAJECTORY_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, - TrajectoryTransformerConfig, - ) - from .models.transfo_xl import ( - TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, - TransfoXLConfig, - TransfoXLCorpus, - TransfoXLTokenizer, + from .models.timesformer import ( + TIMESFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + TimesformerConfig, + ) + from .models.timm_backbone import TimmBackboneConfig + from .models.trocr import ( + TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP, + TrOCRConfig, + TrOCRProcessor, + ) + from .models.tvlt import ( + TVLT_PRETRAINED_CONFIG_ARCHIVE_MAP, + TvltConfig, + TvltFeatureExtractor, + TvltProcessor, + ) + from .models.tvp import ( + TVP_PRETRAINED_CONFIG_ARCHIVE_MAP, + TvpConfig, + TvpProcessor, + ) + from .models.umt5 import UMT5Config + from .models.unispeech import ( + UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, + UniSpeechConfig, + ) + from .models.unispeech_sat import ( + UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, + UniSpeechSatConfig, + ) + from .models.univnet import ( + UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + UnivNetConfig, + UnivNetFeatureExtractor, ) - from .models.trocr import TROCR_PRETRAINED_CONFIG_ARCHIVE_MAP, TrOCRConfig, TrOCRProcessor - from .models.tvlt import TVLT_PRETRAINED_CONFIG_ARCHIVE_MAP, TvltConfig, TvltProcessor - from .models.unispeech import UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechConfig - from .models.unispeech_sat import UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP, UniSpeechSatConfig from .models.upernet import UperNetConfig - from .models.van import VAN_PRETRAINED_CONFIG_ARCHIVE_MAP, VanConfig from .models.videomae import VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP, VideoMAEConfig from .models.vilt import ( VILT_PRETRAINED_CONFIG_ARCHIVE_MAP, @@ -3996,13 +5631,34 @@ ViltImageProcessor, ViltProcessor, ) + from .models.vipllava import ( + VIPLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP, + VipLlavaConfig, + ) from .models.vision_encoder_decoder import VisionEncoderDecoderConfig - from .models.vision_text_dual_encoder import VisionTextDualEncoderConfig, VisionTextDualEncoderProcessor - from .models.visual_bert import VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, VisualBertConfig + from .models.vision_text_dual_encoder import ( + VisionTextDualEncoderConfig, + VisionTextDualEncoderProcessor, + ) + from .models.visual_bert import ( + VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + VisualBertConfig, + ) from .models.vit import VIT_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTConfig - from .models.vit_hybrid import VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTHybridConfig + from .models.vit_hybrid import ( + VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP, + ViTHybridConfig, + ) from .models.vit_mae import VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMAEConfig from .models.vit_msn import VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP, ViTMSNConfig + from .models.vitdet import VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP, VitDetConfig + from .models.vitmatte import VITMATTE_PRETRAINED_CONFIG_ARCHIVE_MAP, VitMatteConfig + from .models.vits import ( + VITS_PRETRAINED_CONFIG_ARCHIVE_MAP, + VitsConfig, + VitsTokenizer, + ) + from .models.vivit import VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP, VivitConfig from .models.wav2vec2 import ( WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2Config, @@ -4011,7 +5667,15 @@ Wav2Vec2Processor, Wav2Vec2Tokenizer, ) - from .models.wav2vec2_conformer import WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, Wav2Vec2ConformerConfig + from .models.wav2vec2_bert import ( + WAV2VEC2_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, + Wav2Vec2BertConfig, + Wav2Vec2BertProcessor, + ) + from .models.wav2vec2_conformer import ( + WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + Wav2Vec2ConformerConfig, + ) from .models.wav2vec2_phoneme import Wav2Vec2PhonemeCTCTokenizer from .models.wav2vec2_with_lm import Wav2Vec2ProcessorWithLM from .models.wavlm import WAVLM_PRETRAINED_CONFIG_ARCHIVE_MAP, WavLMConfig @@ -4031,9 +5695,18 @@ ) from .models.xglm import XGLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XGLMConfig from .models.xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig, XLMTokenizer - from .models.xlm_prophetnet import XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMProphetNetConfig - from .models.xlm_roberta import XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaConfig - from .models.xlm_roberta_xl import XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMRobertaXLConfig + from .models.xlm_prophetnet import ( + XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + XLMProphetNetConfig, + ) + from .models.xlm_roberta import ( + XLM_ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, + XLMRobertaConfig, + ) + from .models.xlm_roberta_xl import ( + XLM_ROBERTA_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, + XLMRobertaXLConfig, + ) from .models.xlnet import XLNET_PRETRAINED_CONFIG_ARCHIVE_MAP, XLNetConfig from .models.xmod import XMOD_PRETRAINED_CONFIG_ARCHIVE_MAP, XmodConfig from .models.yolos import YOLOS_PRETRAINED_CONFIG_ARCHIVE_MAP, YolosConfig @@ -4051,9 +5724,12 @@ FeatureExtractionPipeline, FillMaskPipeline, ImageClassificationPipeline, + ImageFeatureExtractionPipeline, ImageSegmentationPipeline, + ImageToImagePipeline, ImageToTextPipeline, JsonPipelineDataFormat, + MaskGenerationPipeline, NerPipeline, ObjectDetectionPipeline, PipedPipelineDataFormat, @@ -4065,10 +5741,12 @@ Text2TextGenerationPipeline, TextClassificationPipeline, TextGenerationPipeline, + TextToAudioPipeline, TokenClassificationPipeline, TranslationPipeline, VideoClassificationPipeline, VisualQuestionAnsweringPipeline, + ZeroShotAudioClassificationPipeline, ZeroShotClassificationPipeline, ZeroShotImageClassificationPipeline, ZeroShotObjectDetectionPipeline, @@ -4087,6 +5765,20 @@ TokenSpan, ) + # Tools + from .tools import ( + Agent, + AzureOpenAiAgent, + HfAgent, + LocalAgent, + OpenAiAgent, + PipelineTool, + RemoteTool, + Tool, + launch_gradio_demo, + load_tool, + ) + # Trainer from .trainer_callback import ( DefaultFlowCallback, @@ -4097,7 +5789,13 @@ TrainerControl, TrainerState, ) - from .trainer_utils import EvalPrediction, IntervalStrategy, SchedulerType, enable_full_determinism, set_seed + from .trainer_utils import ( + EvalPrediction, + IntervalStrategy, + SchedulerType, + enable_full_determinism, + set_seed, + ) from .training_args import TrainingArguments from .training_args_seq2seq import Seq2SeqTrainingArguments from .training_args_tf import TFTrainingArguments @@ -4138,14 +5836,16 @@ is_tokenizers_available, is_torch_available, is_torch_neuroncore_available, + is_torch_npu_available, is_torch_tpu_available, + is_torch_xpu_available, is_torchvision_available, is_vision_available, logging, ) # bitsandbytes config - from .utils.quantization_config import BitsAndBytesConfig + from .utils.quantization_config import AqlmConfig, AwqConfig, BitsAndBytesConfig, GPTQConfig try: if not is_sentencepiece_available(): @@ -4159,12 +5859,14 @@ from .models.bert_generation import BertGenerationTokenizer from .models.big_bird import BigBirdTokenizer from .models.camembert import CamembertTokenizer + from .models.code_llama import CodeLlamaTokenizer from .models.cpm import CpmTokenizer from .models.deberta_v2 import DebertaV2Tokenizer from .models.ernie_m import ErnieMTokenizer from .models.fnet import FNetTokenizer from .models.gpt_sw3 import GPTSw3Tokenizer from .models.layoutxlm import LayoutXLMTokenizer + from .models.llama import LlamaTokenizer from .models.m2m_100 import M2M100Tokenizer from .models.marian import MarianTokenizer from .models.mbart import MBart50Tokenizer, MBartTokenizer @@ -4175,6 +5877,8 @@ from .models.plbart import PLBartTokenizer from .models.reformer import ReformerTokenizer from .models.rembert import RemBertTokenizer + from .models.seamless_m4t import SeamlessM4TTokenizer + from .models.siglip import SiglipTokenizer from .models.speech_to_text import Speech2TextTokenizer from .models.speecht5 import SpeechT5Tokenizer from .models.t5 import T5Tokenizer @@ -4200,13 +5904,19 @@ from .models.bloom import BloomTokenizerFast from .models.camembert import CamembertTokenizerFast from .models.clip import CLIPTokenizerFast + from .models.code_llama import CodeLlamaTokenizerFast from .models.codegen import CodeGenTokenizerFast from .models.convbert import ConvBertTokenizerFast from .models.cpm import CpmTokenizerFast from .models.deberta import DebertaTokenizerFast from .models.deberta_v2 import DebertaV2TokenizerFast + from .models.deprecated.retribert import RetriBertTokenizerFast from .models.distilbert import DistilBertTokenizerFast - from .models.dpr import DPRContextEncoderTokenizerFast, DPRQuestionEncoderTokenizerFast, DPRReaderTokenizerFast + from .models.dpr import ( + DPRContextEncoderTokenizerFast, + DPRQuestionEncoderTokenizerFast, + DPRReaderTokenizerFast, + ) from .models.electra import ElectraTokenizerFast from .models.fnet import FNetTokenizerFast from .models.funnel import FunnelTokenizerFast @@ -4219,6 +5929,7 @@ from .models.layoutlmv3 import LayoutLMv3TokenizerFast from .models.layoutxlm import LayoutXLMTokenizerFast from .models.led import LEDTokenizerFast + from .models.llama import LlamaTokenizerFast from .models.longformer import LongformerTokenizerFast from .models.lxmert import LxmertTokenizerFast from .models.markuplm import MarkupLMTokenizerFast @@ -4229,17 +5940,20 @@ from .models.mt5 import MT5TokenizerFast from .models.mvp import MvpTokenizerFast from .models.nllb import NllbTokenizerFast + from .models.nougat import NougatTokenizerFast from .models.openai import OpenAIGPTTokenizerFast from .models.pegasus import PegasusTokenizerFast + from .models.qwen2 import Qwen2TokenizerFast from .models.realm import RealmTokenizerFast from .models.reformer import ReformerTokenizerFast from .models.rembert import RemBertTokenizerFast - from .models.retribert import RetriBertTokenizerFast from .models.roberta import RobertaTokenizerFast from .models.roformer import RoFormerTokenizerFast + from .models.seamless_m4t import SeamlessM4TTokenizerFast from .models.splinter import SplinterTokenizerFast from .models.squeezebert import SqueezeBertTokenizerFast from .models.t5 import T5TokenizerFast + from .models.whisper import WhisperTokenizerFast from .models.xglm import XGLMTokenizerFast from .models.xlm_roberta import XLMRobertaTokenizerFast from .models.xlnet import XLNetTokenizerFast @@ -4251,19 +5965,10 @@ except OptionalDependencyNotAvailable: from .utils.dummies_sentencepiece_and_tokenizers_objects import * else: - from .convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS, convert_slow_tokenizer - - try: - if not is_speech_available(): - raise OptionalDependencyNotAvailable() - except OptionalDependencyNotAvailable: - from .utils.dummy_speech_objects import * - else: - from .models.audio_spectrogram_transformer import ASTFeatureExtractor - from .models.mctct import MCTCTFeatureExtractor - from .models.speech_to_text import Speech2TextFeatureExtractor - from .models.speecht5 import SpeechT5FeatureExtractor - from .models.tvlt import TvltFeatureExtractor + from .convert_slow_tokenizer import ( + SLOW_TO_FAST_CONVERTERS, + convert_slow_tokenizer, + ) try: if not is_tensorflow_text_available(): @@ -4293,75 +5998,85 @@ from .models.bit import BitImageProcessor from .models.blip import BlipImageProcessor from .models.bridgetower import BridgeTowerImageProcessor - from .models.chinese_clip import ChineseCLIPFeatureExtractor, ChineseCLIPImageProcessor + from .models.chinese_clip import ( + ChineseCLIPFeatureExtractor, + ChineseCLIPImageProcessor, + ) from .models.clip import CLIPFeatureExtractor, CLIPImageProcessor - from .models.conditional_detr import ConditionalDetrFeatureExtractor, ConditionalDetrImageProcessor + from .models.conditional_detr import ( + ConditionalDetrFeatureExtractor, + ConditionalDetrImageProcessor, + ) from .models.convnext import ConvNextFeatureExtractor, ConvNextImageProcessor - from .models.deformable_detr import DeformableDetrFeatureExtractor, DeformableDetrImageProcessor + from .models.deformable_detr import ( + DeformableDetrFeatureExtractor, + DeformableDetrImageProcessor, + ) from .models.deit import DeiTFeatureExtractor, DeiTImageProcessor from .models.deta import DetaImageProcessor from .models.detr import DetrFeatureExtractor, DetrImageProcessor from .models.donut import DonutFeatureExtractor, DonutImageProcessor from .models.dpt import DPTFeatureExtractor, DPTImageProcessor from .models.efficientformer import EfficientFormerImageProcessor - from .models.flava import FlavaFeatureExtractor, FlavaImageProcessor, FlavaProcessor + from .models.efficientnet import EfficientNetImageProcessor + from .models.flava import ( + FlavaFeatureExtractor, + FlavaImageProcessor, + FlavaProcessor, + ) + from .models.fuyu import FuyuImageProcessor, FuyuProcessor from .models.glpn import GLPNFeatureExtractor, GLPNImageProcessor + from .models.idefics import IdeficsImageProcessor from .models.imagegpt import ImageGPTFeatureExtractor, ImageGPTImageProcessor - from .models.layoutlmv2 import LayoutLMv2FeatureExtractor, LayoutLMv2ImageProcessor - from .models.layoutlmv3 import LayoutLMv3FeatureExtractor, LayoutLMv3ImageProcessor + from .models.layoutlmv2 import ( + LayoutLMv2FeatureExtractor, + LayoutLMv2ImageProcessor, + ) + from .models.layoutlmv3 import ( + LayoutLMv3FeatureExtractor, + LayoutLMv3ImageProcessor, + ) from .models.levit import LevitFeatureExtractor, LevitImageProcessor from .models.mask2former import Mask2FormerImageProcessor - from .models.maskformer import MaskFormerFeatureExtractor, MaskFormerImageProcessor - from .models.mobilenet_v1 import MobileNetV1FeatureExtractor, MobileNetV1ImageProcessor - from .models.mobilenet_v2 import MobileNetV2FeatureExtractor, MobileNetV2ImageProcessor + from .models.maskformer import ( + MaskFormerFeatureExtractor, + MaskFormerImageProcessor, + ) + from .models.mobilenet_v1 import ( + MobileNetV1FeatureExtractor, + MobileNetV1ImageProcessor, + ) + from .models.mobilenet_v2 import ( + MobileNetV2FeatureExtractor, + MobileNetV2ImageProcessor, + ) from .models.mobilevit import MobileViTFeatureExtractor, MobileViTImageProcessor + from .models.nougat import NougatImageProcessor from .models.oneformer import OneFormerImageProcessor + from .models.owlv2 import Owlv2ImageProcessor from .models.owlvit import OwlViTFeatureExtractor, OwlViTImageProcessor from .models.perceiver import PerceiverFeatureExtractor, PerceiverImageProcessor - from .models.poolformer import PoolFormerFeatureExtractor, PoolFormerImageProcessor + from .models.pix2struct import Pix2StructImageProcessor + from .models.poolformer import ( + PoolFormerFeatureExtractor, + PoolFormerImageProcessor, + ) + from .models.pvt import PvtImageProcessor + from .models.sam import SamImageProcessor from .models.segformer import SegformerFeatureExtractor, SegformerImageProcessor + from .models.siglip import SiglipImageProcessor from .models.swin2sr import Swin2SRImageProcessor from .models.tvlt import TvltImageProcessor + from .models.tvp import TvpImageProcessor from .models.videomae import VideoMAEFeatureExtractor, VideoMAEImageProcessor from .models.vilt import ViltFeatureExtractor, ViltImageProcessor, ViltProcessor from .models.vit import ViTFeatureExtractor, ViTImageProcessor from .models.vit_hybrid import ViTHybridImageProcessor + from .models.vitmatte import VitMatteImageProcessor + from .models.vivit import VivitImageProcessor from .models.yolos import YolosFeatureExtractor, YolosImageProcessor # Modeling - try: - if not (is_timm_available() and is_vision_available()): - raise OptionalDependencyNotAvailable() - except OptionalDependencyNotAvailable: - from .utils.dummy_timm_and_vision_objects import * - else: - from .models.conditional_detr import ( - CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST, - ConditionalDetrForObjectDetection, - ConditionalDetrForSegmentation, - ConditionalDetrModel, - ConditionalDetrPreTrainedModel, - ) - from .models.deformable_detr import ( - DEFORMABLE_DETR_PRETRAINED_MODEL_ARCHIVE_LIST, - DeformableDetrForObjectDetection, - DeformableDetrModel, - DeformableDetrPreTrainedModel, - ) - from .models.detr import ( - DETR_PRETRAINED_MODEL_ARCHIVE_LIST, - DetrForObjectDetection, - DetrForSegmentation, - DetrModel, - DetrPreTrainedModel, - ) - from .models.table_transformer import ( - TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, - TableTransformerForObjectDetection, - TableTransformerModel, - TableTransformerPreTrainedModel, - ) - try: if not is_torch_available(): raise OptionalDependencyNotAvailable() @@ -4371,6 +6086,7 @@ # Benchmarks from .benchmark.benchmark import PyTorchBenchmark from .benchmark.benchmark_args import PyTorchBenchmarkArguments + from .cache_utils import Cache, DynamicCache, SinkCache, StaticCache from .data.datasets import ( GlueDataset, GlueDataTrainingArguments, @@ -4383,17 +6099,26 @@ TextDatasetForNextSentencePrediction, ) from .generation import ( + AlternatingCodebooksLogitsProcessor, BeamScorer, BeamSearchScorer, + ClassifierFreeGuidanceLogitsProcessor, ConstrainedBeamSearchScorer, Constraint, ConstraintListState, DisjunctiveConstraint, + EncoderNoRepeatNGramLogitsProcessor, + EncoderRepetitionPenaltyLogitsProcessor, + EpsilonLogitsWarper, + EtaLogitsWarper, + ExponentialDecayLengthPenalty, ForcedBOSTokenLogitsProcessor, ForcedEOSTokenLogitsProcessor, + ForceTokensLogitsProcessor, GenerationMixin, HammingDiversityLogitsProcessor, InfNanRemoveLogitsProcessor, + LogitNormalization, LogitsProcessor, LogitsProcessorList, LogitsWarper, @@ -4406,17 +6131,20 @@ PhrasalConstraint, PrefixConstrainedLogitsProcessor, RepetitionPenaltyLogitsProcessor, + SequenceBiasLogitsProcessor, StoppingCriteria, StoppingCriteriaList, + SuppressTokensAtBeginLogitsProcessor, + SuppressTokensLogitsProcessor, TemperatureLogitsWarper, TopKLogitsWarper, TopPLogitsWarper, TypicalLogitsWarper, + UnbatchedClassifierFreeGuidanceLogitsProcessor, + WhisperTimeStampLogitsProcessor, top_k_top_p_filtering, ) from .modeling_utils import PreTrainedModel - - # PyTorch model imports from .models.albert import ( ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST, AlbertForMaskedLM, @@ -4429,6 +6157,13 @@ AlbertPreTrainedModel, load_tf_weights_in_albert, ) + from .models.align import ( + ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST, + AlignModel, + AlignPreTrainedModel, + AlignTextModel, + AlignVisionModel, + ) from .models.altclip import ( ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST, AltCLIPModel, @@ -4444,6 +6179,7 @@ ) from .models.auto import ( MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, + MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING, MODEL_FOR_AUDIO_XVECTOR_MAPPING, MODEL_FOR_BACKBONE_MAPPING, MODEL_FOR_CAUSAL_IMAGE_MODELING_MAPPING, @@ -4453,7 +6189,9 @@ MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING, MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, MODEL_FOR_IMAGE_SEGMENTATION_MAPPING, + MODEL_FOR_IMAGE_TO_IMAGE_MAPPING, MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING, + MODEL_FOR_MASK_GENERATION_MAPPING, MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING, MODEL_FOR_MASKED_LM_MAPPING, MODEL_FOR_MULTIPLE_CHOICE_MAPPING, @@ -4466,11 +6204,17 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING, + MODEL_FOR_TEXT_ENCODING_MAPPING, + MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING, + MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING, + MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING, + MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING, MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING, MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING, MODEL_FOR_VISION_2_SEQ_MAPPING, MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING, + MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING, MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING, MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, @@ -4485,9 +6229,11 @@ AutoModelForDocumentQuestionAnswering, AutoModelForImageClassification, AutoModelForImageSegmentation, + AutoModelForImageToImage, AutoModelForInstanceSegmentation, AutoModelForMaskedImageModeling, AutoModelForMaskedLM, + AutoModelForMaskGeneration, AutoModelForMultipleChoice, AutoModelForNextSentencePrediction, AutoModelForObjectDetection, @@ -4498,14 +6244,33 @@ AutoModelForSequenceClassification, AutoModelForSpeechSeq2Seq, AutoModelForTableQuestionAnswering, + AutoModelForTextEncoding, + AutoModelForTextToSpectrogram, + AutoModelForTextToWaveform, AutoModelForTokenClassification, AutoModelForUniversalSegmentation, AutoModelForVideoClassification, AutoModelForVision2Seq, AutoModelForVisualQuestionAnswering, + AutoModelForZeroShotImageClassification, AutoModelForZeroShotObjectDetection, AutoModelWithLMHead, ) + from .models.autoformer import ( + AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + AutoformerForPrediction, + AutoformerModel, + AutoformerPreTrainedModel, + ) + from .models.bark import ( + BARK_PRETRAINED_MODEL_ARCHIVE_LIST, + BarkCausalModel, + BarkCoarseModel, + BarkFineModel, + BarkModel, + BarkPreTrainedModel, + BarkSemanticModel, + ) from .models.bart import ( BART_PRETRAINED_MODEL_ARCHIVE_LIST, BartForCausalLM, @@ -4513,11 +6278,13 @@ BartForQuestionAnswering, BartForSequenceClassification, BartModel, + BartPreTrainedModel, BartPretrainedModel, PretrainedBartModel, ) from .models.beit import ( BEIT_PRETRAINED_MODEL_ARCHIVE_LIST, + BeitBackbone, BeitForImageClassification, BeitForMaskedImageModeling, BeitForSemanticSegmentation, @@ -4571,6 +6338,8 @@ from .models.biogpt import ( BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST, BioGptForCausalLM, + BioGptForSequenceClassification, + BioGptForTokenClassification, BioGptModel, BioGptPreTrainedModel, ) @@ -4608,6 +6377,7 @@ from .models.blip_2 import ( BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST, Blip2ForConditionalGeneration, + Blip2Model, Blip2PreTrainedModel, Blip2QFormerModel, Blip2VisionModel, @@ -4623,11 +6393,21 @@ ) from .models.bridgetower import ( BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST, + BridgeTowerForContrastiveLearning, BridgeTowerForImageAndTextRetrieval, BridgeTowerForMaskedLM, BridgeTowerModel, BridgeTowerPreTrainedModel, ) + from .models.bros import ( + BROS_PRETRAINED_MODEL_ARCHIVE_LIST, + BrosForTokenClassification, + BrosModel, + BrosPreTrainedModel, + BrosProcessor, + BrosSpadeEEForTokenClassification, + BrosSpadeELForTokenClassification, + ) from .models.camembert import ( CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST, CamembertForCausalLM, @@ -4669,6 +6449,7 @@ ) from .models.clip import ( CLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + CLIPForImageClassification, CLIPModel, CLIPPreTrainedModel, CLIPTextModel, @@ -4684,12 +6465,28 @@ CLIPSegTextModel, CLIPSegVisionModel, ) + from .models.clvp import ( + CLVP_PRETRAINED_MODEL_ARCHIVE_LIST, + ClvpDecoder, + ClvpEncoder, + ClvpForCausalLM, + ClvpModel, + ClvpModelForConditionalGeneration, + ClvpPreTrainedModel, + ) from .models.codegen import ( CODEGEN_PRETRAINED_MODEL_ARCHIVE_LIST, CodeGenForCausalLM, CodeGenModel, CodeGenPreTrainedModel, ) + from .models.conditional_detr import ( + CONDITIONAL_DETR_PRETRAINED_MODEL_ARCHIVE_LIST, + ConditionalDetrForObjectDetection, + ConditionalDetrForSegmentation, + ConditionalDetrModel, + ConditionalDetrPreTrainedModel, + ) from .models.convbert import ( CONVBERT_PRETRAINED_MODEL_ARCHIVE_LIST, ConvBertForMaskedLM, @@ -4709,6 +6506,19 @@ ConvNextModel, ConvNextPreTrainedModel, ) + from .models.convnextv2 import ( + CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST, + ConvNextV2Backbone, + ConvNextV2ForImageClassification, + ConvNextV2Model, + ConvNextV2PreTrainedModel, + ) + from .models.cpmant import ( + CPMANT_PRETRAINED_MODEL_ARCHIVE_LIST, + CpmAntForCausalLM, + CpmAntModel, + CpmAntPreTrainedModel, + ) from .models.ctrl import ( CTRL_PRETRAINED_MODEL_ARCHIVE_LIST, CTRLForSequenceClassification, @@ -4771,6 +6581,12 @@ DecisionTransformerModel, DecisionTransformerPreTrainedModel, ) + from .models.deformable_detr import ( + DEFORMABLE_DETR_PRETRAINED_MODEL_ARCHIVE_LIST, + DeformableDetrForObjectDetection, + DeformableDetrModel, + DeformableDetrPreTrainedModel, + ) from .models.deit import ( DEIT_PRETRAINED_MODEL_ARCHIVE_LIST, DeiTForImageClassification, @@ -4779,12 +6595,66 @@ DeiTModel, DeiTPreTrainedModel, ) + from .models.deprecated.mctct import ( + MCTCT_PRETRAINED_MODEL_ARCHIVE_LIST, + MCTCTForCTC, + MCTCTModel, + MCTCTPreTrainedModel, + ) + from .models.deprecated.mmbt import ( + MMBTForClassification, + MMBTModel, + ModalEmbeddings, + ) + from .models.deprecated.open_llama import ( + OpenLlamaForCausalLM, + OpenLlamaForSequenceClassification, + OpenLlamaModel, + OpenLlamaPreTrainedModel, + ) + from .models.deprecated.retribert import ( + RETRIBERT_PRETRAINED_MODEL_ARCHIVE_LIST, + RetriBertModel, + RetriBertPreTrainedModel, + ) + from .models.deprecated.trajectory_transformer import ( + TRAJECTORY_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + TrajectoryTransformerModel, + TrajectoryTransformerPreTrainedModel, + ) + from .models.deprecated.transfo_xl import ( + TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST, + AdaptiveEmbedding, + TransfoXLForSequenceClassification, + TransfoXLLMHeadModel, + TransfoXLModel, + TransfoXLPreTrainedModel, + load_tf_weights_in_transfo_xl, + ) + from .models.deprecated.van import ( + VAN_PRETRAINED_MODEL_ARCHIVE_LIST, + VanForImageClassification, + VanModel, + VanPreTrainedModel, + ) + from .models.depth_anything import ( + DEPTH_ANYTHING_PRETRAINED_MODEL_ARCHIVE_LIST, + DepthAnythingForDepthEstimation, + DepthAnythingPreTrainedModel, + ) from .models.deta import ( DETA_PRETRAINED_MODEL_ARCHIVE_LIST, DetaForObjectDetection, DetaModel, DetaPreTrainedModel, ) + from .models.detr import ( + DETR_PRETRAINED_MODEL_ARCHIVE_LIST, + DetrForObjectDetection, + DetrForSegmentation, + DetrModel, + DetrPreTrainedModel, + ) from .models.dinat import ( DINAT_PRETRAINED_MODEL_ARCHIVE_LIST, DinatBackbone, @@ -4792,6 +6662,13 @@ DinatModel, DinatPreTrainedModel, ) + from .models.dinov2 import ( + DINOV2_PRETRAINED_MODEL_ARCHIVE_LIST, + Dinov2Backbone, + Dinov2ForImageClassification, + Dinov2Model, + Dinov2PreTrainedModel, + ) from .models.distilbert import ( DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST, DistilBertForMaskedLM, @@ -4802,7 +6679,11 @@ DistilBertModel, DistilBertPreTrainedModel, ) - from .models.donut import DONUT_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST, DonutSwinModel, DonutSwinPreTrainedModel + from .models.donut import ( + DONUT_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST, + DonutSwinModel, + DonutSwinPreTrainedModel, + ) from .models.dpr import ( DPR_CONTEXT_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, DPR_QUESTION_ENCODER_PRETRAINED_MODEL_ARCHIVE_LIST, @@ -4829,6 +6710,12 @@ EfficientFormerModel, EfficientFormerPreTrainedModel, ) + from .models.efficientnet import ( + EFFICIENTNET_PRETRAINED_MODEL_ARCHIVE_LIST, + EfficientNetForImageClassification, + EfficientNetModel, + EfficientNetPreTrainedModel, + ) from .models.electra import ( ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST, ElectraForCausalLM, @@ -4842,6 +6729,11 @@ ElectraPreTrainedModel, load_tf_weights_in_electra, ) + from .models.encodec import ( + ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST, + EncodecModel, + EncodecPreTrainedModel, + ) from .models.encoder_decoder import EncoderDecoderModel from .models.ernie import ( ERNIE_PRETRAINED_MODEL_ARCHIVE_LIST, @@ -4876,6 +6768,22 @@ EsmModel, EsmPreTrainedModel, ) + from .models.falcon import ( + FALCON_PRETRAINED_MODEL_ARCHIVE_LIST, + FalconForCausalLM, + FalconForQuestionAnswering, + FalconForSequenceClassification, + FalconForTokenClassification, + FalconModel, + FalconPreTrainedModel, + ) + from .models.fastspeech2_conformer import ( + FASTSPEECH2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + FastSpeech2ConformerHifiGan, + FastSpeech2ConformerModel, + FastSpeech2ConformerPreTrainedModel, + FastSpeech2ConformerWithHifiGan, + ) from .models.flaubert import ( FLAUBERT_PRETRAINED_MODEL_ARCHIVE_LIST, FlaubertForMultipleChoice, @@ -4910,7 +6818,19 @@ FNetModel, FNetPreTrainedModel, ) - from .models.fsmt import FSMTForConditionalGeneration, FSMTModel, PretrainedFSMTModel + from .models.focalnet import ( + FOCALNET_PRETRAINED_MODEL_ARCHIVE_LIST, + FocalNetBackbone, + FocalNetForImageClassification, + FocalNetForMaskedImageModeling, + FocalNetModel, + FocalNetPreTrainedModel, + ) + from .models.fsmt import ( + FSMTForConditionalGeneration, + FSMTModel, + PretrainedFSMTModel, + ) from .models.funnel import ( FUNNEL_PRETRAINED_MODEL_ARCHIVE_LIST, FunnelBaseModel, @@ -4924,6 +6844,10 @@ FunnelPreTrainedModel, load_tf_weights_in_funnel, ) + from .models.fuyu import ( + FuyuForCausalLM, + FuyuPreTrainedModel, + ) from .models.git import ( GIT_PRETRAINED_MODEL_ARCHIVE_LIST, GitForCausalLM, @@ -4940,6 +6864,7 @@ from .models.gpt2 import ( GPT2_PRETRAINED_MODEL_ARCHIVE_LIST, GPT2DoubleHeadsModel, + GPT2ForQuestionAnswering, GPT2ForSequenceClassification, GPT2ForTokenClassification, GPT2LMHeadModel, @@ -4947,10 +6872,20 @@ GPT2PreTrainedModel, load_tf_weights_in_gpt2, ) + from .models.gpt_bigcode import ( + GPT_BIGCODE_PRETRAINED_MODEL_ARCHIVE_LIST, + GPTBigCodeForCausalLM, + GPTBigCodeForSequenceClassification, + GPTBigCodeForTokenClassification, + GPTBigCodeModel, + GPTBigCodePreTrainedModel, + ) from .models.gpt_neo import ( GPT_NEO_PRETRAINED_MODEL_ARCHIVE_LIST, GPTNeoForCausalLM, + GPTNeoForQuestionAnswering, GPTNeoForSequenceClassification, + GPTNeoForTokenClassification, GPTNeoModel, GPTNeoPreTrainedModel, load_tf_weights_in_gpt_neo, @@ -4958,6 +6893,9 @@ from .models.gpt_neox import ( GPT_NEOX_PRETRAINED_MODEL_ARCHIVE_LIST, GPTNeoXForCausalLM, + GPTNeoXForQuestionAnswering, + GPTNeoXForSequenceClassification, + GPTNeoXForTokenClassification, GPTNeoXLayer, GPTNeoXModel, GPTNeoXPreTrainedModel, @@ -4977,6 +6915,12 @@ GPTJModel, GPTJPreTrainedModel, ) + from .models.gptsan_japanese import ( + GPTSAN_JAPANESE_PRETRAINED_MODEL_ARCHIVE_LIST, + GPTSanJapaneseForConditionalGeneration, + GPTSanJapaneseModel, + GPTSanJapanesePreTrainedModel, + ) from .models.graphormer import ( GRAPHORMER_PRETRAINED_MODEL_ARCHIVE_LIST, GraphormerForGraphClassification, @@ -5007,6 +6951,13 @@ IBertModel, IBertPreTrainedModel, ) + from .models.idefics import ( + IDEFICS_PRETRAINED_MODEL_ARCHIVE_LIST, + IdeficsForVisionText2Text, + IdeficsModel, + IdeficsPreTrainedModel, + IdeficsProcessor, + ) from .models.imagegpt import ( IMAGEGPT_PRETRAINED_MODEL_ARCHIVE_LIST, ImageGPTForCausalImageModeling, @@ -5015,6 +6966,19 @@ ImageGPTPreTrainedModel, load_tf_weights_in_imagegpt, ) + from .models.informer import ( + INFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + InformerForPrediction, + InformerModel, + InformerPreTrainedModel, + ) + from .models.instructblip import ( + INSTRUCTBLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + InstructBlipForConditionalGeneration, + InstructBlipPreTrainedModel, + InstructBlipQFormerModel, + InstructBlipVisionModel, + ) from .models.jukebox import ( JUKEBOX_PRETRAINED_MODEL_ARCHIVE_LIST, JukeboxModel, @@ -5022,6 +6986,12 @@ JukeboxPrior, JukeboxVQVAE, ) + from .models.kosmos2 import ( + KOSMOS2_PRETRAINED_MODEL_ARCHIVE_LIST, + Kosmos2ForConditionalGeneration, + Kosmos2Model, + Kosmos2PreTrainedModel, + ) from .models.layoutlm import ( LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST, LayoutLMForMaskedLM, @@ -5070,6 +7040,19 @@ LiltModel, LiltPreTrainedModel, ) + from .models.llama import ( + LlamaForCausalLM, + LlamaForQuestionAnswering, + LlamaForSequenceClassification, + LlamaModel, + LlamaPreTrainedModel, + ) + from .models.llava import ( + LLAVA_PRETRAINED_MODEL_ARCHIVE_LIST, + LlavaForConditionalGeneration, + LlavaPreTrainedModel, + LlavaProcessor, + ) from .models.longformer import ( LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, LongformerForMaskedLM, @@ -5146,7 +7129,17 @@ MBartModel, MBartPreTrainedModel, ) - from .models.mctct import MCTCT_PRETRAINED_MODEL_ARCHIVE_LIST, MCTCTForCTC, MCTCTModel, MCTCTPreTrainedModel + from .models.mega import ( + MEGA_PRETRAINED_MODEL_ARCHIVE_LIST, + MegaForCausalLM, + MegaForMaskedLM, + MegaForMultipleChoice, + MegaForQuestionAnswering, + MegaForSequenceClassification, + MegaForTokenClassification, + MegaModel, + MegaPreTrainedModel, + ) from .models.megatron_bert import ( MEGATRON_BERT_PRETRAINED_MODEL_ARCHIVE_LIST, MegatronBertForCausalLM, @@ -5160,7 +7153,24 @@ MegatronBertModel, MegatronBertPreTrainedModel, ) - from .models.mmbt import MMBTForClassification, MMBTModel, ModalEmbeddings + from .models.mgp_str import ( + MGP_STR_PRETRAINED_MODEL_ARCHIVE_LIST, + MgpstrForSceneTextRecognition, + MgpstrModel, + MgpstrPreTrainedModel, + ) + from .models.mistral import ( + MistralForCausalLM, + MistralForSequenceClassification, + MistralModel, + MistralPreTrainedModel, + ) + from .models.mixtral import ( + MixtralForCausalLM, + MixtralForSequenceClassification, + MixtralModel, + MixtralPreTrainedModel, + ) from .models.mobilebert import ( MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST, MobileBertForMaskedLM, @@ -5197,6 +7207,13 @@ MobileViTModel, MobileViTPreTrainedModel, ) + from .models.mobilevitv2 import ( + MOBILEVITV2_PRETRAINED_MODEL_ARCHIVE_LIST, + MobileViTV2ForImageClassification, + MobileViTV2ForSemanticSegmentation, + MobileViTV2Model, + MobileViTV2PreTrainedModel, + ) from .models.mpnet import ( MPNET_PRETRAINED_MODEL_ARCHIVE_LIST, MPNetForMaskedLM, @@ -5208,7 +7225,42 @@ MPNetModel, MPNetPreTrainedModel, ) - from .models.mt5 import MT5EncoderModel, MT5ForConditionalGeneration, MT5Model, MT5PreTrainedModel + from .models.mpt import ( + MPT_PRETRAINED_MODEL_ARCHIVE_LIST, + MptForCausalLM, + MptForQuestionAnswering, + MptForSequenceClassification, + MptForTokenClassification, + MptModel, + MptPreTrainedModel, + ) + from .models.mra import ( + MRA_PRETRAINED_MODEL_ARCHIVE_LIST, + MraForMaskedLM, + MraForMultipleChoice, + MraForQuestionAnswering, + MraForSequenceClassification, + MraForTokenClassification, + MraModel, + MraPreTrainedModel, + ) + from .models.mt5 import ( + MT5EncoderModel, + MT5ForConditionalGeneration, + MT5ForQuestionAnswering, + MT5ForSequenceClassification, + MT5ForTokenClassification, + MT5Model, + MT5PreTrainedModel, + ) + from .models.musicgen import ( + MUSICGEN_PRETRAINED_MODEL_ARCHIVE_LIST, + MusicgenForCausalLM, + MusicgenForConditionalGeneration, + MusicgenModel, + MusicgenPreTrainedModel, + MusicgenProcessor, + ) from .models.mvp import ( MVP_PRETRAINED_MODEL_ARCHIVE_LIST, MvpForCausalLM, @@ -5237,6 +7289,14 @@ NezhaModel, NezhaPreTrainedModel, ) + from .models.nllb_moe import ( + NLLB_MOE_PRETRAINED_MODEL_ARCHIVE_LIST, + NllbMoeForConditionalGeneration, + NllbMoeModel, + NllbMoePreTrainedModel, + NllbMoeSparseMLP, + NllbMoeTop2Router, + ) from .models.nystromformer import ( NYSTROMFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, NystromformerForMaskedLM, @@ -5271,6 +7331,14 @@ OPTModel, OPTPreTrainedModel, ) + from .models.owlv2 import ( + OWLV2_PRETRAINED_MODEL_ARCHIVE_LIST, + Owlv2ForObjectDetection, + Owlv2Model, + Owlv2PreTrainedModel, + Owlv2TextModel, + Owlv2VisionModel, + ) from .models.owlvit import ( OWLVIT_PRETRAINED_MODEL_ARCHIVE_LIST, OwlViTForObjectDetection, @@ -5279,6 +7347,24 @@ OwlViTTextModel, OwlViTVisionModel, ) + from .models.patchtsmixer import ( + PATCHTSMIXER_PRETRAINED_MODEL_ARCHIVE_LIST, + PatchTSMixerForPrediction, + PatchTSMixerForPretraining, + PatchTSMixerForRegression, + PatchTSMixerForTimeSeriesClassification, + PatchTSMixerModel, + PatchTSMixerPreTrainedModel, + ) + from .models.patchtst import ( + PATCHTST_PRETRAINED_MODEL_ARCHIVE_LIST, + PatchTSTForClassification, + PatchTSTForPrediction, + PatchTSTForPretraining, + PatchTSTForRegression, + PatchTSTModel, + PatchTSTPreTrainedModel, + ) from .models.pegasus import ( PegasusForCausalLM, PegasusForConditionalGeneration, @@ -5304,6 +7390,27 @@ PerceiverModel, PerceiverPreTrainedModel, ) + from .models.persimmon import ( + PersimmonForCausalLM, + PersimmonForSequenceClassification, + PersimmonModel, + PersimmonPreTrainedModel, + ) + from .models.phi import ( + PHI_PRETRAINED_MODEL_ARCHIVE_LIST, + PhiForCausalLM, + PhiForSequenceClassification, + PhiForTokenClassification, + PhiModel, + PhiPreTrainedModel, + ) + from .models.pix2struct import ( + PIX2STRUCT_PRETRAINED_MODEL_ARCHIVE_LIST, + Pix2StructForConditionalGeneration, + Pix2StructPreTrainedModel, + Pix2StructTextModel, + Pix2StructVisionModel, + ) from .models.plbart import ( PLBART_PRETRAINED_MODEL_ARCHIVE_LIST, PLBartForCausalLM, @@ -5318,6 +7425,11 @@ PoolFormerModel, PoolFormerPreTrainedModel, ) + from .models.pop2piano import ( + POP2PIANO_PRETRAINED_MODEL_ARCHIVE_LIST, + Pop2PianoForConditionalGeneration, + Pop2PianoPreTrainedModel, + ) from .models.prophetnet import ( PROPHETNET_PRETRAINED_MODEL_ARCHIVE_LIST, ProphetNetDecoder, @@ -5327,6 +7439,12 @@ ProphetNetModel, ProphetNetPreTrainedModel, ) + from .models.pvt import ( + PVT_PRETRAINED_MODEL_ARCHIVE_LIST, + PvtForImageClassification, + PvtModel, + PvtPreTrainedModel, + ) from .models.qdqbert import ( QDQBERT_PRETRAINED_MODEL_ARCHIVE_LIST, QDQBertForMaskedLM, @@ -5341,7 +7459,18 @@ QDQBertPreTrainedModel, load_tf_weights_in_qdqbert, ) - from .models.rag import RagModel, RagPreTrainedModel, RagSequenceForGeneration, RagTokenForGeneration + from .models.qwen2 import ( + Qwen2ForCausalLM, + Qwen2ForSequenceClassification, + Qwen2Model, + Qwen2PreTrainedModel, + ) + from .models.rag import ( + RagModel, + RagPreTrainedModel, + RagSequenceForGeneration, + RagTokenForGeneration, + ) from .models.realm import ( REALM_PRETRAINED_MODEL_ARCHIVE_LIST, RealmEmbedder, @@ -5390,7 +7519,6 @@ ResNetModel, ResNetPreTrainedModel, ) - from .models.retribert import RETRIBERT_PRETRAINED_MODEL_ARCHIVE_LIST, RetriBertModel, RetriBertPreTrainedModel from .models.roberta import ( ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST, RobertaForCausalLM, @@ -5440,6 +7568,41 @@ RoFormerPreTrainedModel, load_tf_weights_in_roformer, ) + from .models.rwkv import ( + RWKV_PRETRAINED_MODEL_ARCHIVE_LIST, + RwkvForCausalLM, + RwkvModel, + RwkvPreTrainedModel, + ) + from .models.sam import ( + SAM_PRETRAINED_MODEL_ARCHIVE_LIST, + SamModel, + SamPreTrainedModel, + ) + + # PyTorch model imports + from .models.seamless_m4t import ( + SEAMLESS_M4T_PRETRAINED_MODEL_ARCHIVE_LIST, + SeamlessM4TCodeHifiGan, + SeamlessM4TForSpeechToSpeech, + SeamlessM4TForSpeechToText, + SeamlessM4TForTextToSpeech, + SeamlessM4TForTextToText, + SeamlessM4THifiGan, + SeamlessM4TModel, + SeamlessM4TPreTrainedModel, + SeamlessM4TTextToUnitForConditionalGeneration, + SeamlessM4TTextToUnitModel, + ) + from .models.seamless_m4t_v2 import ( + SEAMLESS_M4T_V2_PRETRAINED_MODEL_ARCHIVE_LIST, + SeamlessM4Tv2ForSpeechToSpeech, + SeamlessM4Tv2ForSpeechToText, + SeamlessM4Tv2ForTextToSpeech, + SeamlessM4Tv2ForTextToText, + SeamlessM4Tv2Model, + SeamlessM4Tv2PreTrainedModel, + ) from .models.segformer import ( SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, SegformerDecodeHead, @@ -5463,6 +7626,14 @@ SEWDModel, SEWDPreTrainedModel, ) + from .models.siglip import ( + SIGLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + SiglipForImageClassification, + SiglipModel, + SiglipPreTrainedModel, + SiglipTextModel, + SiglipVisionModel, + ) from .models.speech_encoder_decoder import SpeechEncoderDecoderModel from .models.speech_to_text import ( SPEECH_TO_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST, @@ -5470,7 +7641,10 @@ Speech2TextModel, Speech2TextPreTrainedModel, ) - from .models.speech_to_text_2 import Speech2Text2ForCausalLM, Speech2Text2PreTrainedModel + from .models.speech_to_text_2 import ( + Speech2Text2ForCausalLM, + Speech2Text2PreTrainedModel, + ) from .models.speecht5 import ( SPEECHT5_PRETRAINED_MODEL_ARCHIVE_LIST, SpeechT5ForSpeechToSpeech, @@ -5499,6 +7673,18 @@ SqueezeBertModule, SqueezeBertPreTrainedModel, ) + from .models.stablelm import ( + StableLmForCausalLM, + StableLmForSequenceClassification, + StableLmModel, + StableLmPreTrainedModel, + ) + from .models.swiftformer import ( + SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + SwiftFormerForImageClassification, + SwiftFormerModel, + SwiftFormerPreTrainedModel, + ) from .models.swin import ( SWIN_PRETRAINED_MODEL_ARCHIVE_LIST, SwinBackbone, @@ -5515,6 +7701,7 @@ ) from .models.swinv2 import ( SWINV2_PRETRAINED_MODEL_ARCHIVE_LIST, + Swinv2Backbone, Swinv2ForImageClassification, Swinv2ForMaskedImageModeling, Swinv2Model, @@ -5533,10 +7720,19 @@ T5_PRETRAINED_MODEL_ARCHIVE_LIST, T5EncoderModel, T5ForConditionalGeneration, + T5ForQuestionAnswering, + T5ForSequenceClassification, + T5ForTokenClassification, T5Model, T5PreTrainedModel, load_tf_weights_in_t5, ) + from .models.table_transformer import ( + TABLE_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + TableTransformerForObjectDetection, + TableTransformerModel, + TableTransformerPreTrainedModel, + ) from .models.tapas import ( TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST, TapasForMaskedLM, @@ -5558,21 +7754,12 @@ TimesformerModel, TimesformerPreTrainedModel, ) - from .models.trajectory_transformer import ( - TRAJECTORY_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, - TrajectoryTransformerModel, - TrajectoryTransformerPreTrainedModel, - ) - from .models.transfo_xl import ( - TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST, - AdaptiveEmbedding, - TransfoXLForSequenceClassification, - TransfoXLLMHeadModel, - TransfoXLModel, - TransfoXLPreTrainedModel, - load_tf_weights_in_transfo_xl, + from .models.timm_backbone import TimmBackbone + from .models.trocr import ( + TROCR_PRETRAINED_MODEL_ARCHIVE_LIST, + TrOCRForCausalLM, + TrOCRPreTrainedModel, ) - from .models.trocr import TROCR_PRETRAINED_MODEL_ARCHIVE_LIST, TrOCRForCausalLM, TrOCRPreTrainedModel from .models.tvlt import ( TVLT_PRETRAINED_MODEL_ARCHIVE_LIST, TvltForAudioVisualClassification, @@ -5580,6 +7767,21 @@ TvltModel, TvltPreTrainedModel, ) + from .models.tvp import ( + TVP_PRETRAINED_MODEL_ARCHIVE_LIST, + TvpForVideoGrounding, + TvpModel, + TvpPreTrainedModel, + ) + from .models.umt5 import ( + UMT5EncoderModel, + UMT5ForConditionalGeneration, + UMT5ForQuestionAnswering, + UMT5ForSequenceClassification, + UMT5ForTokenClassification, + UMT5Model, + UMT5PreTrainedModel, + ) from .models.unispeech import ( UNISPEECH_PRETRAINED_MODEL_ARCHIVE_LIST, UniSpeechForCTC, @@ -5598,12 +7800,10 @@ UniSpeechSatModel, UniSpeechSatPreTrainedModel, ) - from .models.upernet import UperNetForSemanticSegmentation, UperNetPreTrainedModel - from .models.van import ( - VAN_PRETRAINED_MODEL_ARCHIVE_LIST, - VanForImageClassification, - VanModel, - VanPreTrainedModel, + from .models.univnet import UNIVNET_PRETRAINED_MODEL_ARCHIVE_LIST, UnivNetModel + from .models.upernet import ( + UperNetForSemanticSegmentation, + UperNetPreTrainedModel, ) from .models.videomae import ( VIDEOMAE_PRETRAINED_MODEL_ARCHIVE_LIST, @@ -5623,6 +7823,11 @@ ViltModel, ViltPreTrainedModel, ) + from .models.vipllava import ( + VIPLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST, + VipLlavaForConditionalGeneration, + VipLlavaPreTrainedModel, + ) from .models.vision_encoder_decoder import VisionEncoderDecoderModel from .models.vision_text_dual_encoder import VisionTextDualEncoderModel from .models.visual_bert import ( @@ -5662,6 +7867,28 @@ ViTMSNModel, ViTMSNPreTrainedModel, ) + from .models.vitdet import ( + VITDET_PRETRAINED_MODEL_ARCHIVE_LIST, + VitDetBackbone, + VitDetModel, + VitDetPreTrainedModel, + ) + from .models.vitmatte import ( + VITMATTE_PRETRAINED_MODEL_ARCHIVE_LIST, + VitMatteForImageMatting, + VitMattePreTrainedModel, + ) + from .models.vits import ( + VITS_PRETRAINED_MODEL_ARCHIVE_LIST, + VitsModel, + VitsPreTrainedModel, + ) + from .models.vivit import ( + VIVIT_PRETRAINED_MODEL_ARCHIVE_LIST, + VivitForVideoClassification, + VivitModel, + VivitPreTrainedModel, + ) from .models.wav2vec2 import ( WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST, Wav2Vec2ForAudioFrameClassification, @@ -5673,6 +7900,15 @@ Wav2Vec2Model, Wav2Vec2PreTrainedModel, ) + from .models.wav2vec2_bert import ( + WAV2VEC2_BERT_PRETRAINED_MODEL_ARCHIVE_LIST, + Wav2Vec2BertForAudioFrameClassification, + Wav2Vec2BertForCTC, + Wav2Vec2BertForSequenceClassification, + Wav2Vec2BertForXVector, + Wav2Vec2BertModel, + Wav2Vec2BertPreTrainedModel, + ) from .models.wav2vec2_conformer import ( WAV2VEC2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, Wav2Vec2ConformerForAudioFrameClassification, @@ -5694,6 +7930,8 @@ ) from .models.whisper import ( WHISPER_PRETRAINED_MODEL_ARCHIVE_LIST, + WhisperForAudioClassification, + WhisperForCausalLM, WhisperForConditionalGeneration, WhisperModel, WhisperPreTrainedModel, @@ -5705,7 +7943,12 @@ XCLIPTextModel, XCLIPVisionModel, ) - from .models.xglm import XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, XGLMForCausalLM, XGLMModel, XGLMPreTrainedModel + from .models.xglm import ( + XGLM_PRETRAINED_MODEL_ARCHIVE_LIST, + XGLMForCausalLM, + XGLMModel, + XGLMPreTrainedModel, + ) from .models.xlm import ( XLM_PRETRAINED_MODEL_ARCHIVE_LIST, XLMForMultipleChoice, @@ -5825,6 +8068,7 @@ from .generation import ( TFForcedBOSTokenLogitsProcessor, TFForcedEOSTokenLogitsProcessor, + TFForceTokensLogitsProcessor, TFGenerationMixin, TFLogitsProcessor, TFLogitsProcessorList, @@ -5833,23 +8077,20 @@ TFNoBadWordsLogitsProcessor, TFNoRepeatNGramLogitsProcessor, TFRepetitionPenaltyLogitsProcessor, + TFSuppressTokensAtBeginLogitsProcessor, + TFSuppressTokensLogitsProcessor, TFTemperatureLogitsWarper, TFTopKLogitsWarper, TFTopPLogitsWarper, tf_top_k_top_p_filtering, ) from .keras_callbacks import KerasMetricCallback, PushToHubCallback - from .modeling_tf_layoutlm import ( - TF_LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST, - TFLayoutLMForMaskedLM, - TFLayoutLMForQuestionAnswering, - TFLayoutLMForSequenceClassification, - TFLayoutLMForTokenClassification, - TFLayoutLMMainLayer, - TFLayoutLMModel, - TFLayoutLMPreTrainedModel, + from .modeling_tf_utils import ( + TFPreTrainedModel, + TFSequenceSummary, + TFSharedEmbeddings, + shape_list, ) - from .modeling_tf_utils import TFPreTrainedModel, TFSequenceSummary, TFSharedEmbeddings, shape_list # TensorFlow model imports from .models.albert import ( @@ -5865,9 +8106,11 @@ TFAlbertPreTrainedModel, ) from .models.auto import ( + TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING, TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, + TF_MODEL_FOR_MASK_GENERATION_MAPPING, TF_MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING, TF_MODEL_FOR_MASKED_LM_MAPPING, TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING, @@ -5879,15 +8122,20 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, TF_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING, + TF_MODEL_FOR_TEXT_ENCODING_MAPPING, TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, TF_MODEL_FOR_VISION_2_SEQ_MAPPING, + TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING, TF_MODEL_MAPPING, TF_MODEL_WITH_LM_HEAD_MAPPING, TFAutoModel, + TFAutoModelForAudioClassification, TFAutoModelForCausalLM, TFAutoModelForDocumentQuestionAnswering, TFAutoModelForImageClassification, + TFAutoModelForMaskedImageModeling, TFAutoModelForMaskedLM, + TFAutoModelForMaskGeneration, TFAutoModelForMultipleChoice, TFAutoModelForNextSentencePrediction, TFAutoModelForPreTraining, @@ -5897,8 +8145,10 @@ TFAutoModelForSequenceClassification, TFAutoModelForSpeechSeq2Seq, TFAutoModelForTableQuestionAnswering, + TFAutoModelForTextEncoding, TFAutoModelForTokenClassification, TFAutoModelForVision2Seq, + TFAutoModelForZeroShotImageClassification, TFAutoModelWithLMHead, ) from .models.bart import ( @@ -5932,6 +8182,16 @@ TFBlenderbotSmallModel, TFBlenderbotSmallPreTrainedModel, ) + from .models.blip import ( + TF_BLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + TFBlipForConditionalGeneration, + TFBlipForImageTextRetrieval, + TFBlipForQuestionAnswering, + TFBlipModel, + TFBlipPreTrainedModel, + TFBlipTextModel, + TFBlipVisionModel, + ) from .models.camembert import ( TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST, TFCamembertForCausalLM, @@ -5961,7 +8221,16 @@ TFConvBertModel, TFConvBertPreTrainedModel, ) - from .models.convnext import TFConvNextForImageClassification, TFConvNextModel, TFConvNextPreTrainedModel + from .models.convnext import ( + TFConvNextForImageClassification, + TFConvNextModel, + TFConvNextPreTrainedModel, + ) + from .models.convnextv2 import ( + TFConvNextV2ForImageClassification, + TFConvNextV2Model, + TFConvNextV2PreTrainedModel, + ) from .models.ctrl import ( TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST, TFCTRLForSequenceClassification, @@ -5993,6 +8262,7 @@ from .models.deberta_v2 import ( TF_DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST, TFDebertaV2ForMaskedLM, + TFDebertaV2ForMultipleChoice, TFDebertaV2ForQuestionAnswering, TFDebertaV2ForSequenceClassification, TFDebertaV2ForTokenClassification, @@ -6007,6 +8277,15 @@ TFDeiTModel, TFDeiTPreTrainedModel, ) + from .models.deprecated.transfo_xl import ( + TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST, + TFAdaptiveEmbedding, + TFTransfoXLForSequenceClassification, + TFTransfoXLLMHeadModel, + TFTransfoXLMainLayer, + TFTransfoXLModel, + TFTransfoXLPreTrainedModel, + ) from .models.distilbert import ( TF_DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST, TFDistilBertForMaskedLM, @@ -6029,6 +8308,13 @@ TFDPRQuestionEncoder, TFDPRReader, ) + from .models.efficientformer import ( + TF_EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + TFEfficientFormerForImageClassification, + TFEfficientFormerForImageClassificationWithTeacher, + TFEfficientFormerModel, + TFEfficientFormerPreTrainedModel, + ) from .models.electra import ( TF_ELECTRA_PRETRAINED_MODEL_ARCHIVE_LIST, TFElectraForMaskedLM, @@ -6100,6 +8386,16 @@ TFHubertModel, TFHubertPreTrainedModel, ) + from .models.layoutlm import ( + TF_LAYOUTLM_PRETRAINED_MODEL_ARCHIVE_LIST, + TFLayoutLMForMaskedLM, + TFLayoutLMForQuestionAnswering, + TFLayoutLMForSequenceClassification, + TFLayoutLMForTokenClassification, + TFLayoutLMMainLayer, + TFLayoutLMModel, + TFLayoutLMPreTrainedModel, + ) from .models.layoutlmv3 import ( TF_LAYOUTLMV3_PRETRAINED_MODEL_ARCHIVE_LIST, TFLayoutLMv3ForQuestionAnswering, @@ -6108,7 +8404,11 @@ TFLayoutLMv3Model, TFLayoutLMv3PreTrainedModel, ) - from .models.led import TFLEDForConditionalGeneration, TFLEDModel, TFLEDPreTrainedModel + from .models.led import ( + TFLEDForConditionalGeneration, + TFLEDModel, + TFLEDPreTrainedModel, + ) from .models.longformer import ( TF_LONGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, TFLongformerForMaskedLM, @@ -6128,11 +8428,18 @@ TFLxmertPreTrainedModel, TFLxmertVisualFeatureEncoder, ) - from .models.marian import TFMarianModel, TFMarianMTModel, TFMarianPreTrainedModel - from .models.mbart import TFMBartForConditionalGeneration, TFMBartModel, TFMBartPreTrainedModel + from .models.marian import ( + TFMarianModel, + TFMarianMTModel, + TFMarianPreTrainedModel, + ) + from .models.mbart import ( + TFMBartForConditionalGeneration, + TFMBartModel, + TFMBartPreTrainedModel, + ) from .models.mobilebert import ( TF_MOBILEBERT_PRETRAINED_MODEL_ARCHIVE_LIST, - TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST, TFMobileBertForMaskedLM, TFMobileBertForMultipleChoice, TFMobileBertForNextSentencePrediction, @@ -6143,6 +8450,9 @@ TFMobileBertMainLayer, TFMobileBertModel, TFMobileBertPreTrainedModel, + ) + from .models.mobilevit import ( + TF_MOBILEVIT_PRETRAINED_MODEL_ARCHIVE_LIST, TFMobileViTForImageClassification, TFMobileViTForSemanticSegmentation, TFMobileViTModel, @@ -6159,7 +8469,11 @@ TFMPNetModel, TFMPNetPreTrainedModel, ) - from .models.mt5 import TFMT5EncoderModel, TFMT5ForConditionalGeneration, TFMT5Model + from .models.mt5 import ( + TFMT5EncoderModel, + TFMT5ForConditionalGeneration, + TFMT5Model, + ) from .models.openai import ( TF_OPENAI_GPT_PRETRAINED_MODEL_ARCHIVE_LIST, TFOpenAIGPTDoubleHeadsModel, @@ -6170,8 +8484,17 @@ TFOpenAIGPTPreTrainedModel, ) from .models.opt import TFOPTForCausalLM, TFOPTModel, TFOPTPreTrainedModel - from .models.pegasus import TFPegasusForConditionalGeneration, TFPegasusModel, TFPegasusPreTrainedModel - from .models.rag import TFRagModel, TFRagPreTrainedModel, TFRagSequenceForGeneration, TFRagTokenForGeneration + from .models.pegasus import ( + TFPegasusForConditionalGeneration, + TFPegasusModel, + TFPegasusPreTrainedModel, + ) + from .models.rag import ( + TFRagModel, + TFRagPreTrainedModel, + TFRagSequenceForGeneration, + TFRagTokenForGeneration, + ) from .models.regnet import ( TF_REGNET_PRETRAINED_MODEL_ARCHIVE_LIST, TFRegNetForImageClassification, @@ -6232,6 +8555,11 @@ TFRoFormerModel, TFRoFormerPreTrainedModel, ) + from .models.sam import ( + TF_SAM_PRETRAINED_MODEL_ARCHIVE_LIST, + TFSamModel, + TFSamPreTrainedModel, + ) from .models.segformer import ( TF_SEGFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, TFSegformerDecodeHead, @@ -6268,21 +8596,22 @@ TFTapasModel, TFTapasPreTrainedModel, ) - from .models.transfo_xl import ( - TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST, - TFAdaptiveEmbedding, - TFTransfoXLForSequenceClassification, - TFTransfoXLLMHeadModel, - TFTransfoXLMainLayer, - TFTransfoXLModel, - TFTransfoXLPreTrainedModel, - ) from .models.vision_encoder_decoder import TFVisionEncoderDecoderModel - from .models.vit import TFViTForImageClassification, TFViTModel, TFViTPreTrainedModel - from .models.vit_mae import TFViTMAEForPreTraining, TFViTMAEModel, TFViTMAEPreTrainedModel + from .models.vision_text_dual_encoder import TFVisionTextDualEncoderModel + from .models.vit import ( + TFViTForImageClassification, + TFViTModel, + TFViTPreTrainedModel, + ) + from .models.vit_mae import ( + TFViTMAEForPreTraining, + TFViTMAEModel, + TFViTMAEPreTrainedModel, + ) from .models.wav2vec2 import ( TF_WAV_2_VEC_2_PRETRAINED_MODEL_ARCHIVE_LIST, TFWav2Vec2ForCTC, + TFWav2Vec2ForSequenceClassification, TFWav2Vec2Model, TFWav2Vec2PreTrainedModel, ) @@ -6333,10 +8662,30 @@ ) # Optimization - from .optimization_tf import AdamWeightDecay, GradientAccumulator, WarmUp, create_optimizer + from .optimization_tf import ( + AdamWeightDecay, + GradientAccumulator, + WarmUp, + create_optimizer, + ) - # Trainer - from .trainer_tf import TFTrainer + try: + if not ( + is_librosa_available() + and is_essentia_available() + and is_scipy_available() + and is_torch_available() + and is_pretty_midi_available() + ): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + from .utils.dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects import * + else: + from .models.pop2piano import ( + Pop2PianoFeatureExtractor, + Pop2PianoProcessor, + Pop2PianoTokenizer, + ) try: if not is_flax_available(): @@ -6349,14 +8698,18 @@ from .generation import ( FlaxForcedBOSTokenLogitsProcessor, FlaxForcedEOSTokenLogitsProcessor, + FlaxForceTokensLogitsProcessor, FlaxGenerationMixin, FlaxLogitsProcessor, FlaxLogitsProcessorList, FlaxLogitsWarper, FlaxMinLengthLogitsProcessor, + FlaxSuppressTokensAtBeginLogitsProcessor, + FlaxSuppressTokensLogitsProcessor, FlaxTemperatureLogitsWarper, FlaxTopKLogitsWarper, FlaxTopPLogitsWarper, + FlaxWhisperTimeStampLogitsProcessor, ) from .modeling_flax_utils import FlaxPreTrainedModel @@ -6372,6 +8725,7 @@ FlaxAlbertPreTrainedModel, ) from .models.auto import ( + FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_CAUSAL_LM_MAPPING, FLAX_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_MASKED_LM_MAPPING, @@ -6381,6 +8735,7 @@ FLAX_MODEL_FOR_QUESTION_ANSWERING_MAPPING, FLAX_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING, FLAX_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, + FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, FLAX_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_VISION_2_SEQ_MAPPING, FLAX_MODEL_MAPPING, @@ -6394,6 +8749,7 @@ FlaxAutoModelForQuestionAnswering, FlaxAutoModelForSeq2SeqLM, FlaxAutoModelForSequenceClassification, + FlaxAutoModelForSpeechSeq2Seq, FlaxAutoModelForTokenClassification, FlaxAutoModelForVision2Seq, ) @@ -6445,10 +8801,16 @@ FlaxBlenderbotSmallModel, FlaxBlenderbotSmallPreTrainedModel, ) + from .models.bloom import ( + FlaxBloomForCausalLM, + FlaxBloomModel, + FlaxBloomPreTrainedModel, + ) from .models.clip import ( FlaxCLIPModel, FlaxCLIPPreTrainedModel, FlaxCLIPTextModel, + FlaxCLIPTextModelWithProjection, FlaxCLIPTextPreTrainedModel, FlaxCLIPVisionModel, FlaxCLIPVisionPreTrainedModel, @@ -6474,11 +8836,36 @@ FlaxElectraPreTrainedModel, ) from .models.encoder_decoder import FlaxEncoderDecoderModel - from .models.gpt2 import FlaxGPT2LMHeadModel, FlaxGPT2Model, FlaxGPT2PreTrainedModel - from .models.gpt_neo import FlaxGPTNeoForCausalLM, FlaxGPTNeoModel, FlaxGPTNeoPreTrainedModel - from .models.gptj import FlaxGPTJForCausalLM, FlaxGPTJModel, FlaxGPTJPreTrainedModel - from .models.longt5 import FlaxLongT5ForConditionalGeneration, FlaxLongT5Model, FlaxLongT5PreTrainedModel - from .models.marian import FlaxMarianModel, FlaxMarianMTModel, FlaxMarianPreTrainedModel + from .models.gpt2 import ( + FlaxGPT2LMHeadModel, + FlaxGPT2Model, + FlaxGPT2PreTrainedModel, + ) + from .models.gpt_neo import ( + FlaxGPTNeoForCausalLM, + FlaxGPTNeoModel, + FlaxGPTNeoPreTrainedModel, + ) + from .models.gptj import ( + FlaxGPTJForCausalLM, + FlaxGPTJModel, + FlaxGPTJPreTrainedModel, + ) + from .models.llama import ( + FlaxLlamaForCausalLM, + FlaxLlamaModel, + FlaxLlamaPreTrainedModel, + ) + from .models.longt5 import ( + FlaxLongT5ForConditionalGeneration, + FlaxLongT5Model, + FlaxLongT5PreTrainedModel, + ) + from .models.marian import ( + FlaxMarianModel, + FlaxMarianMTModel, + FlaxMarianPreTrainedModel, + ) from .models.mbart import ( FlaxMBartForConditionalGeneration, FlaxMBartForQuestionAnswering, @@ -6486,9 +8873,32 @@ FlaxMBartModel, FlaxMBartPreTrainedModel, ) - from .models.mt5 import FlaxMT5EncoderModel, FlaxMT5ForConditionalGeneration, FlaxMT5Model + from .models.mistral import ( + FlaxMistralForCausalLM, + FlaxMistralModel, + FlaxMistralPreTrainedModel, + ) + from .models.mt5 import ( + FlaxMT5EncoderModel, + FlaxMT5ForConditionalGeneration, + FlaxMT5Model, + ) from .models.opt import FlaxOPTForCausalLM, FlaxOPTModel, FlaxOPTPreTrainedModel - from .models.pegasus import FlaxPegasusForConditionalGeneration, FlaxPegasusModel, FlaxPegasusPreTrainedModel + from .models.pegasus import ( + FlaxPegasusForConditionalGeneration, + FlaxPegasusModel, + FlaxPegasusPreTrainedModel, + ) + from .models.regnet import ( + FlaxRegNetForImageClassification, + FlaxRegNetModel, + FlaxRegNetPreTrainedModel, + ) + from .models.resnet import ( + FlaxResNetForImageClassification, + FlaxResNetModel, + FlaxResNetPreTrainedModel, + ) from .models.roberta import ( FlaxRobertaForCausalLM, FlaxRobertaForMaskedLM, @@ -6519,17 +8929,36 @@ FlaxRoFormerPreTrainedModel, ) from .models.speech_encoder_decoder import FlaxSpeechEncoderDecoderModel - from .models.t5 import FlaxT5EncoderModel, FlaxT5ForConditionalGeneration, FlaxT5Model, FlaxT5PreTrainedModel + from .models.t5 import ( + FlaxT5EncoderModel, + FlaxT5ForConditionalGeneration, + FlaxT5Model, + FlaxT5PreTrainedModel, + ) from .models.vision_encoder_decoder import FlaxVisionEncoderDecoderModel from .models.vision_text_dual_encoder import FlaxVisionTextDualEncoderModel - from .models.vit import FlaxViTForImageClassification, FlaxViTModel, FlaxViTPreTrainedModel + from .models.vit import ( + FlaxViTForImageClassification, + FlaxViTModel, + FlaxViTPreTrainedModel, + ) from .models.wav2vec2 import ( FlaxWav2Vec2ForCTC, FlaxWav2Vec2ForPreTraining, FlaxWav2Vec2Model, FlaxWav2Vec2PreTrainedModel, ) - from .models.xglm import FlaxXGLMForCausalLM, FlaxXGLMModel, FlaxXGLMPreTrainedModel + from .models.whisper import ( + FlaxWhisperForAudioClassification, + FlaxWhisperForConditionalGeneration, + FlaxWhisperModel, + FlaxWhisperPreTrainedModel, + ) + from .models.xglm import ( + FlaxXGLMForCausalLM, + FlaxXGLMModel, + FlaxXGLMPreTrainedModel, + ) from .models.xlm_roberta import ( FLAX_XLM_ROBERTA_PRETRAINED_MODEL_ARCHIVE_LIST, FlaxXLMRobertaForCausalLM, diff --git a/src/transformers/activations.py b/src/transformers/activations.py index 436d2b95fe62a7..22f5fe9b1bc2f4 100644 --- a/src/transformers/activations.py +++ b/src/transformers/activations.py @@ -13,6 +13,7 @@ # limitations under the License. import math +import warnings from collections import OrderedDict import torch @@ -121,17 +122,28 @@ def forward(self, x: Tensor) -> Tensor: return torch.clip(gelu(x), self.min, self.max) -class SiLUActivation(nn.Module): +class AccurateGELUActivation(nn.Module): """ - See Gaussian Error Linear Units (Hendrycks et al., https://arxiv.org/abs/1606.08415) where the SiLU (Sigmoid Linear - Unit) was originally introduced and coined, and see Sigmoid-Weighted Linear Units for Neural Network Function - Approximation in Reinforcement Learning (Elfwing et al., https://arxiv.org/abs/1702.03118) and Swish: a Self-Gated - Activation Function (Ramachandran et al., https://arxiv.org/abs/1710.05941v1) where the SiLU was experimented with - later. + Applies GELU approximation that is faster than default and more accurate than QuickGELU. See: + https://github.com/hendrycks/GELUs + + Implemented along with MEGA (Moving Average Equipped Gated Attention) """ + def __init__(self): + super().__init__() + self.precomputed_constant = math.sqrt(2 / math.pi) + def forward(self, input: Tensor) -> Tensor: - return nn.functional.silu(input) + return 0.5 * input * (1 + torch.tanh(self.precomputed_constant * (input + 0.044715 * torch.pow(input, 3)))) + + +class SiLUActivation(nn.SiLU): + def __init__(self, *args, **kwargs): + warnings.warn( + "The SiLUActivation class has been deprecated and will be removed in v4.39. Please use nn.SiLU instead.", + ) + super().__init__(*args, **kwargs) class MishActivation(nn.Module): @@ -163,6 +175,30 @@ def forward(self, input: Tensor) -> Tensor: return input +class LaplaceActivation(nn.Module): + """ + Applies elementwise activation based on Laplace function, introduced in MEGA as an attention activation. See + https://arxiv.org/abs/2209.10655 + + Inspired by squared relu, but with bounded range and gradient for better stability + """ + + def forward(self, input, mu=0.707107, sigma=0.282095): + input = (input - mu).div(sigma * math.sqrt(2.0)) + return 0.5 * (1.0 + torch.erf(input)) + + +class ReLUSquaredActivation(nn.Module): + """ + Applies the relu^2 activation introduced in https://arxiv.org/abs/2109.08668v2 + """ + + def forward(self, input): + relu_applied = nn.functional.relu(input) + squared = torch.square(relu_applied) + return squared + + class ClassInstantier(OrderedDict): def __getitem__(self, key): content = super().__getitem__(key) @@ -177,14 +213,18 @@ def __getitem__(self, key): "gelu_new": NewGELUActivation, "gelu_python": (GELUActivation, {"use_gelu_python": True}), "gelu_pytorch_tanh": PytorchGELUTanh, + "gelu_accurate": AccurateGELUActivation, + "laplace": LaplaceActivation, + "leaky_relu": nn.LeakyReLU, "linear": LinearActivation, "mish": MishActivation, "quick_gelu": QuickGELUActivation, "relu": nn.ReLU, + "relu2": ReLUSquaredActivation, "relu6": nn.ReLU6, "sigmoid": nn.Sigmoid, - "silu": SiLUActivation, - "swish": SiLUActivation, + "silu": nn.SiLU, + "swish": nn.SiLU, "tanh": nn.Tanh, } ACT2FN = ClassInstantier(ACT2CLS) diff --git a/src/transformers/activations_tf.py b/src/transformers/activations_tf.py index 4fcb1493e437bc..d12b73ea45176f 100644 --- a/src/transformers/activations_tf.py +++ b/src/transformers/activations_tf.py @@ -15,7 +15,20 @@ import math import tensorflow as tf -from packaging import version +from packaging.version import parse + + +try: + import tf_keras as keras +except (ModuleNotFoundError, ImportError): + import keras + + if parse(keras.__version__).major > 2: + raise ValueError( + "Your currently installed version of Keras is Keras 3, but this is not yet supported in " + "Transformers. Please install the backwards-compatible tf-keras package with " + "`pip install tf-keras`." + ) def _gelu(x): @@ -99,12 +112,12 @@ def glu(x, axis=-1): return a * tf.math.sigmoid(b) -if version.parse(tf.version.VERSION) >= version.parse("2.4"): +if parse(tf.version.VERSION) >= parse("2.4"): def approximate_gelu_wrap(x): - return tf.keras.activations.gelu(x, approximate=True) + return keras.activations.gelu(x, approximate=True) - gelu = tf.keras.activations.gelu + gelu = keras.activations.gelu gelu_new = approximate_gelu_wrap else: gelu = _gelu @@ -119,11 +132,11 @@ def approximate_gelu_wrap(x): "glu": glu, "mish": mish, "quick_gelu": quick_gelu, - "relu": tf.keras.activations.relu, - "sigmoid": tf.keras.activations.sigmoid, - "silu": tf.keras.activations.swish, - "swish": tf.keras.activations.swish, - "tanh": tf.keras.activations.tanh, + "relu": keras.activations.relu, + "sigmoid": keras.activations.sigmoid, + "silu": keras.activations.swish, + "swish": keras.activations.swish, + "tanh": keras.activations.tanh, } diff --git a/src/transformers/audio_utils.py b/src/transformers/audio_utils.py index 73bc041d6961d8..5819f0723fb658 100644 --- a/src/transformers/audio_utils.py +++ b/src/transformers/audio_utils.py @@ -1,5 +1,5 @@ # coding=utf-8 -# Copyright 2023 The HuggingFace Inc. team. +# Copyright 2023 The HuggingFace Inc. team and the librosa & torchaudio authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -13,239 +13,596 @@ # See the License for the specific language governing permissions and # limitations under the License. """ - Audio processing functions to extract feature from a raw audio. Should all be in numpy to support all frameworks, and - remmove unecessary dependencies. +Audio processing functions to extract features from audio waveforms. This code is pure numpy to support all frameworks +and remove unnecessary dependencies. """ -import math import warnings -from typing import Optional +from typing import Optional, Union import numpy as np -from numpy.fft import fft -def hertz_to_mel(freq: float, mel_scale: str = "htk") -> float: - """Convert Hertz to Mels. +def hertz_to_mel(freq: Union[float, np.ndarray], mel_scale: str = "htk") -> Union[float, np.ndarray]: + """ + Convert frequency from hertz to mels. Args: - freqs (`float`): - Frequencies in Hertz + freq (`float` or `np.ndarray`): + The frequency, or multiple frequencies, in hertz (Hz). mel_scale (`str`, *optional*, defaults to `"htk"`): - Scale to use, `htk` or `slaney`. + The mel frequency scale to use, `"htk"`, `"kaldi"` or `"slaney"`. Returns: - mels (`float`): - Frequency in Mels + `float` or `np.ndarray`: The frequencies on the mel scale. """ - if mel_scale not in ["slaney", "htk"]: - raise ValueError('mel_scale should be one of "htk" or "slaney".') + if mel_scale not in ["slaney", "htk", "kaldi"]: + raise ValueError('mel_scale should be one of "htk", "slaney" or "kaldi".') if mel_scale == "htk": - return 2595.0 * math.log10(1.0 + (freq / 700.0)) - - # Fill in the linear part - frequency_min = 0.0 - f_sp = 200.0 / 3 + return 2595.0 * np.log10(1.0 + (freq / 700.0)) + elif mel_scale == "kaldi": + return 1127.0 * np.log(1.0 + (freq / 700.0)) - mels = (freq - frequency_min) / f_sp - - # Fill in the log-scale part min_log_hertz = 1000.0 - min_log_mel = (min_log_hertz - frequency_min) / f_sp - logstep = math.log(6.4) / 27.0 + min_log_mel = 15.0 + logstep = 27.0 / np.log(6.4) + mels = 3.0 * freq / 200.0 - if freq >= min_log_hertz: - mels = min_log_mel + math.log(freq / min_log_hertz) / logstep + if isinstance(freq, np.ndarray): + log_region = freq >= min_log_hertz + mels[log_region] = min_log_mel + np.log(freq[log_region] / min_log_hertz) * logstep + elif freq >= min_log_hertz: + mels = min_log_mel + np.log(freq / min_log_hertz) * logstep return mels -def mel_to_hertz(mels: np.array, mel_scale: str = "htk") -> np.array: - """Convert mel bin numbers to frequencies. +def mel_to_hertz(mels: Union[float, np.ndarray], mel_scale: str = "htk") -> Union[float, np.ndarray]: + """ + Convert frequency from mels to hertz. Args: - mels (`np.array`): - Mel frequencies + mels (`float` or `np.ndarray`): + The frequency, or multiple frequencies, in mels. mel_scale (`str`, *optional*, `"htk"`): - Scale to use: `htk` or `slaney`. + The mel frequency scale to use, `"htk"`, `"kaldi"` or `"slaney"`. Returns: - freqs (`np.array`): - Mels converted to Hertz + `float` or `np.ndarray`: The frequencies in hertz. """ - if mel_scale not in ["slaney", "htk"]: - raise ValueError('mel_scale should be one of "htk" or "slaney".') + if mel_scale not in ["slaney", "htk", "kaldi"]: + raise ValueError('mel_scale should be one of "htk", "slaney" or "kaldi".') if mel_scale == "htk": - return 700.0 * (10.0 ** (mels / 2595.0) - 1.0) + return 700.0 * (np.power(10, mels / 2595.0) - 1.0) + elif mel_scale == "kaldi": + return 700.0 * (np.exp(mels / 1127.0) - 1.0) - # Fill in the linear scale - frequency_min = 0.0 - f_sp = 200.0 / 3 - freqs = frequency_min + f_sp * mels - - # And now the nonlinear scale min_log_hertz = 1000.0 - min_log_mel = (min_log_hertz - frequency_min) / f_sp - logstep = math.log(6.4) / 27.0 + min_log_mel = 15.0 + logstep = np.log(6.4) / 27.0 + freq = 200.0 * mels / 3.0 - log_t = mels >= min_log_mel - freqs[log_t] = min_log_hertz * np.exp(logstep * (mels[log_t] - min_log_mel)) + if isinstance(mels, np.ndarray): + log_region = mels >= min_log_mel + freq[log_region] = min_log_hertz * np.exp(logstep * (mels[log_region] - min_log_mel)) + elif mels >= min_log_mel: + freq = min_log_hertz * np.exp(logstep * (mels - min_log_mel)) - return freqs + return freq -def _create_triangular_filterbank( - all_freqs: np.array, - f_pts: np.array, -) -> np.array: - """Create a triangular filter bank. +def _create_triangular_filter_bank(fft_freqs: np.ndarray, filter_freqs: np.ndarray) -> np.ndarray: + """ + Creates a triangular filter bank. + Adapted from *torchaudio* and *librosa*. Args: - all_freqs (`np.array` of shape (`nb_frequency_bins`, )): - Discrete frequencies used when the STFT was computed. - f_pts (`np.array`, of shape (`nb_mel_filters`, )): - Coordinates of the middle points of the triangular filters to create. + fft_freqs (`np.ndarray` of shape `(num_frequency_bins,)`): + Discrete frequencies of the FFT bins in Hz. + filter_freqs (`np.ndarray` of shape `(num_mel_filters,)`): + Center frequencies of the triangular filters to create, in Hz. Returns: - fb (np.array): - The filter bank of size (`nb_frequency_bins`, `nb_mel_filters`). + `np.ndarray` of shape `(num_frequency_bins, num_mel_filters)` """ - # Adapted from Librosa - # calculate the difference between each filter mid point and each stft freq point in hertz - f_diff = f_pts[1:] - f_pts[:-1] # (n_filter + 1) - slopes = np.expand_dims(f_pts, 0) - np.expand_dims(all_freqs, 1) # (nb_frequency_bins, n_filter + 2) - # create overlapping triangles - zero = np.zeros(1) - down_slopes = (-1.0 * slopes[:, :-2]) / f_diff[:-1] # (nb_frequency_bins, n_filter) - up_slopes = slopes[:, 2:] / f_diff[1:] # (nb_frequency_bins, n_filter) - fb = np.maximum(zero, np.minimum(down_slopes, up_slopes)) - - return fb - - -def get_mel_filter_banks( - nb_frequency_bins: int, - nb_mel_filters: int, - frequency_min: float, - frequency_max: float, - sample_rate: int, + filter_diff = np.diff(filter_freqs) + slopes = np.expand_dims(filter_freqs, 0) - np.expand_dims(fft_freqs, 1) + down_slopes = -slopes[:, :-2] / filter_diff[:-1] + up_slopes = slopes[:, 2:] / filter_diff[1:] + return np.maximum(np.zeros(1), np.minimum(down_slopes, up_slopes)) + + +def mel_filter_bank( + num_frequency_bins: int, + num_mel_filters: int, + min_frequency: float, + max_frequency: float, + sampling_rate: int, norm: Optional[str] = None, mel_scale: str = "htk", -) -> np.array: + triangularize_in_mel_space: bool = False, +) -> np.ndarray: """ - Create a frequency bin conversion matrix used to obtain the Mel Spectrogram. This is called a *mel filter bank*, - and various implementation exist, which differ in the number of filters, the shape of the filters, the way the - filters are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these + Creates a frequency bin conversion matrix used to obtain a mel spectrogram. This is called a *mel filter bank*, and + various implementation exist, which differ in the number of filters, the shape of the filters, the way the filters + are spaced, the bandwidth of the filters, and the manner in which the spectrum is warped. The goal of these features is to approximate the non-linear human perception of the variation in pitch with respect to the frequency. - This code is heavily inspired from the *torchaudio* implementation, see - [here](https://pytorch.org/audio/stable/transforms.html) for more details. - - - Tips: - - Different banks of Mel filters were introduced in the litterature. The following variation are supported: - - MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHertz - and a speech bandwidth of `[0, 4600]` Hertz - - MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a - speech bandwidth `[0, 8000]` Hertz (sampling rate ≥ 16 kHertz). - - MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate - of 16 kHertz, and speech bandwidth [133, 6854] Hertz. This version also includes an area normalization. - - HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes sampling - rate of 12.5 kHertz and speech bandwidth [0, 6250] Hertz - - The default parameters of `torchaudio`'s mel filterbanks implement the `"htk"` filers while `torchlibrosa` - uses the `"slaney"` implementation. + + Different banks of mel filters were introduced in the literature. The following variations are supported: + + - MFCC FB-20: introduced in 1980 by Davis and Mermelstein, it assumes a sampling frequency of 10 kHz and a speech + bandwidth of `[0, 4600]` Hz. + - MFCC FB-24 HTK: from the Cambridge HMM Toolkit (HTK) (1995) uses a filter bank of 24 filters for a speech + bandwidth of `[0, 8000]` Hz. This assumes sampling rate ≥ 16 kHz. + - MFCC FB-40: from the Auditory Toolbox for MATLAB written by Slaney in 1998, assumes a sampling rate of 16 kHz and + speech bandwidth of `[133, 6854]` Hz. This version also includes area normalization. + - HFCC-E FB-29 (Human Factor Cepstral Coefficients) of Skowronski and Harris (2004), assumes a sampling rate of + 12.5 kHz and speech bandwidth of `[0, 6250]` Hz. + + This code is adapted from *torchaudio* and *librosa*. Note that the default parameters of torchaudio's + `melscale_fbanks` implement the `"htk"` filters while librosa uses the `"slaney"` implementation. Args: - nb_frequency_bins (`int`): + num_frequency_bins (`int`): Number of frequencies used to compute the spectrogram (should be the same as in `stft`). - nb_mel_filters (`int`): - Number of Mel filers to generate. - frequency_min (`float`): - Minimum frequency of interest(Hertz). - frequency_max (`float`): - Maximum frequency of interest(Hertz). - sample_rate (`int`): + num_mel_filters (`int`): + Number of mel filters to generate. + min_frequency (`float`): + Lowest frequency of interest in Hz. + max_frequency (`float`): + Highest frequency of interest in Hz. This should not exceed `sampling_rate / 2`. + sampling_rate (`int`): Sample rate of the audio waveform. norm (`str`, *optional*): - If "slaney", divide the triangular Mel weights by the width of the mel band (area normalization). + If `"slaney"`, divide the triangular mel weights by the width of the mel band (area normalization). mel_scale (`str`, *optional*, defaults to `"htk"`): - Scale to use: `"htk"` or `"slaney"`. + The mel frequency scale to use, `"htk"`, `"kaldi"` or `"slaney"`. + triangularize_in_mel_space (`bool`, *optional*, defaults to `False`): + If this option is enabled, the triangular filter is applied in mel space rather than frequency space. This + should be set to `true` in order to get the same results as `torchaudio` when computing mel filters. Returns: - `np.ndarray`: Triangular filter banks (fb matrix) of shape (`nb_frequency_bins`, `nb_mel_filters`). This matrix - is a projection matrix to go from a spectrogram to a Mel Spectrogram. - + `np.ndarray` of shape (`num_frequency_bins`, `num_mel_filters`): Triangular filter bank matrix. This is a + projection matrix to go from a spectrogram to a mel spectrogram. """ - if norm is not None and norm != "slaney": raise ValueError('norm must be one of None or "slaney"') - # freqency bins - all_freqs = np.linspace(0, sample_rate // 2, nb_frequency_bins) + # center points of the triangular mel filters + mel_min = hertz_to_mel(min_frequency, mel_scale=mel_scale) + mel_max = hertz_to_mel(max_frequency, mel_scale=mel_scale) + mel_freqs = np.linspace(mel_min, mel_max, num_mel_filters + 2) + filter_freqs = mel_to_hertz(mel_freqs, mel_scale=mel_scale) - # Compute mim and max frequencies in mel scale - m_min = hertz_to_mel(frequency_min, mel_scale=mel_scale) - m_max = hertz_to_mel(frequency_max, mel_scale=mel_scale) + if triangularize_in_mel_space: + # frequencies of FFT bins in Hz, but filters triangularized in mel space + fft_bin_width = sampling_rate / (num_frequency_bins * 2) + fft_freqs = hertz_to_mel(fft_bin_width * np.arange(num_frequency_bins), mel_scale=mel_scale) + filter_freqs = mel_freqs + else: + # frequencies of FFT bins in Hz + fft_freqs = np.linspace(0, sampling_rate // 2, num_frequency_bins) - # create the centers of the triangular mel filters. - m_pts = np.linspace(m_min, m_max, nb_mel_filters + 2) - f_pts = mel_to_hertz(m_pts, mel_scale=mel_scale) - - # create the filterbank - filterbank = _create_triangular_filterbank(all_freqs, f_pts) + mel_filters = _create_triangular_filter_bank(fft_freqs, filter_freqs) if norm is not None and norm == "slaney": # Slaney-style mel is scaled to be approx constant energy per channel - enorm = 2.0 / (f_pts[2 : nb_mel_filters + 2] - f_pts[:nb_mel_filters]) - filterbank *= np.expand_dims(enorm, 0) + enorm = 2.0 / (filter_freqs[2 : num_mel_filters + 2] - filter_freqs[:num_mel_filters]) + mel_filters *= np.expand_dims(enorm, 0) - if (filterbank.max(axis=0) == 0.0).any(): + if (mel_filters.max(axis=0) == 0.0).any(): warnings.warn( - "At least one mel filterbank has all zero values. " - f"The value for `nb_mel_filters` ({nb_mel_filters}) may be set too high. " - f"Or, the value for `nb_frequency_bins` ({nb_frequency_bins}) may be set too low." + "At least one mel filter has all zero values. " + f"The value for `num_mel_filters` ({num_mel_filters}) may be set too high. " + f"Or, the value for `num_frequency_bins` ({num_frequency_bins}) may be set too low." ) - return filterbank + return mel_filters + +def optimal_fft_length(window_length: int) -> int: + """ + Finds the best FFT input size for a given `window_length`. This function takes a given window length and, if not + already a power of two, rounds it up to the next power or two. -def power_to_db(mel_spectrogram, top_db=None, a_min=1e-10, ref=1.0): + The FFT algorithm works fastest when the length of the input is a power of two, which may be larger than the size + of the window or analysis frame. For example, if the window is 400 samples, using an FFT input size of 512 samples + is more optimal than an FFT size of 400 samples. Using a larger FFT size does not affect the detected frequencies, + it simply gives a higher frequency resolution (i.e. the frequency bins are smaller). """ - Convert a mel spectrogram from power to db scale, this function is the numpy implementation of librosa.power_to_lb. - It computes `10 * log10(mel_spectrogram / ref)`, using basic log properties for stability. + return 2 ** int(np.ceil(np.log2(window_length))) + + +def window_function( + window_length: int, + name: str = "hann", + periodic: bool = True, + frame_length: Optional[int] = None, + center: bool = True, +) -> np.ndarray: + """ + Returns an array containing the specified window. This window is intended to be used with `stft`. + + The following window types are supported: - Tips: - - The motivation behind applying the log function on the mel spectrogram is that humans do not hear loudness on - a - linear scale. Generally to double the percieved volume of a sound we need to put 8 times as much energy into - it. - - This means that large variations in energy may not sound all that different if the sound is loud to begin - with. This compression operation makes the mel features match more closely what humans actually hear. + - `"boxcar"`: a rectangular window + - `"hamming"`: the Hamming window + - `"hann"`: the Hann window + - `"povey"`: the Povey window Args: - mel_spectrogram (`np.array`): - Input mel spectrogram. - top_db (`int`, *optional*): - The maximum decibel value. - a_min (`int`, *optional*, default to 1e-10): - Minimum value to use when cliping the mel spectrogram. - ref (`float`, *optional*, default to 1.0): - Maximum reference value used to scale the mel_spectrogram. + window_length (`int`): + The length of the window in samples. + name (`str`, *optional*, defaults to `"hann"`): + The name of the window function. + periodic (`bool`, *optional*, defaults to `True`): + Whether the window is periodic or symmetric. + frame_length (`int`, *optional*): + The length of the analysis frames in samples. Provide a value for `frame_length` if the window is smaller + than the frame length, so that it will be zero-padded. + center (`bool`, *optional*, defaults to `True`): + Whether to center the window inside the FFT buffer. Only used when `frame_length` is provided. + Returns: + `np.ndarray` of shape `(window_length,)` or `(frame_length,)` containing the window. + """ + length = window_length + 1 if periodic else window_length + + if name == "boxcar": + window = np.ones(length) + elif name in ["hamming", "hamming_window"]: + window = np.hamming(length) + elif name in ["hann", "hann_window"]: + window = np.hanning(length) + elif name in ["povey"]: + window = np.power(np.hanning(length), 0.85) + else: + raise ValueError(f"Unknown window function '{name}'") + + if periodic: + window = window[:-1] + + if frame_length is None: + return window + + if window_length > frame_length: + raise ValueError( + f"Length of the window ({window_length}) may not be larger than frame_length ({frame_length})" + ) + + padded_window = np.zeros(frame_length) + offset = (frame_length - window_length) // 2 if center else 0 + padded_window[offset : offset + window_length] = window + return padded_window + + +# TODO This method does not support batching yet as we are mainly focused on inference. +def spectrogram( + waveform: np.ndarray, + window: np.ndarray, + frame_length: int, + hop_length: int, + fft_length: Optional[int] = None, + power: Optional[float] = 1.0, + center: bool = True, + pad_mode: str = "reflect", + onesided: bool = True, + preemphasis: Optional[float] = None, + mel_filters: Optional[np.ndarray] = None, + mel_floor: float = 1e-10, + log_mel: Optional[str] = None, + reference: float = 1.0, + min_value: float = 1e-10, + db_range: Optional[float] = None, + remove_dc_offset: Optional[bool] = None, + dtype: np.dtype = np.float32, +) -> np.ndarray: + """ + Calculates a spectrogram over one waveform using the Short-Time Fourier Transform. + + This function can create the following kinds of spectrograms: + + - amplitude spectrogram (`power = 1.0`) + - power spectrogram (`power = 2.0`) + - complex-valued spectrogram (`power = None`) + - log spectrogram (use `log_mel` argument) + - mel spectrogram (provide `mel_filters`) + - log-mel spectrogram (provide `mel_filters` and `log_mel`) + + How this works: + + 1. The input waveform is split into frames of size `frame_length` that are partially overlapping by `frame_length + - hop_length` samples. + 2. Each frame is multiplied by the window and placed into a buffer of size `fft_length`. + 3. The DFT is taken of each windowed frame. + 4. The results are stacked into a spectrogram. + + We make a distinction between the following "blocks" of sample data, each of which may have a different lengths: + + - The analysis frame. This is the size of the time slices that the input waveform is split into. + - The window. Each analysis frame is multiplied by the window to avoid spectral leakage. + - The FFT input buffer. The length of this determines how many frequency bins are in the spectrogram. + + In this implementation, the window is assumed to be zero-padded to have the same size as the analysis frame. A + padded window can be obtained from `window_function()`. The FFT input buffer may be larger than the analysis frame, + typically the next power of two. + + Note: This function is not optimized for speed yet. It should be mostly compatible with `librosa.stft` and + `torchaudio.functional.transforms.Spectrogram`, although it is more flexible due to the different ways spectrograms + can be constructed. + + Args: + waveform (`np.ndarray` of shape `(length,)`): + The input waveform. This must be a single real-valued, mono waveform. + window (`np.ndarray` of shape `(frame_length,)`): + The windowing function to apply, including zero-padding if necessary. The actual window length may be + shorter than `frame_length`, but we're assuming the array has already been zero-padded. + frame_length (`int`): + The length of the analysis frames in samples. With librosa this is always equal to `fft_length` but we also + allow smaller sizes. + hop_length (`int`): + The stride between successive analysis frames in samples. + fft_length (`int`, *optional*): + The size of the FFT buffer in samples. This determines how many frequency bins the spectrogram will have. + For optimal speed, this should be a power of two. If `None`, uses `frame_length`. + power (`float`, *optional*, defaults to 1.0): + If 1.0, returns the amplitude spectrogram. If 2.0, returns the power spectrogram. If `None`, returns + complex numbers. + center (`bool`, *optional*, defaults to `True`): + Whether to pad the waveform so that frame `t` is centered around time `t * hop_length`. If `False`, frame + `t` will start at time `t * hop_length`. + pad_mode (`str`, *optional*, defaults to `"reflect"`): + Padding mode used when `center` is `True`. Possible values are: `"constant"` (pad with zeros), `"edge"` + (pad with edge values), `"reflect"` (pads with mirrored values). + onesided (`bool`, *optional*, defaults to `True`): + If True, only computes the positive frequencies and returns a spectrogram containing `fft_length // 2 + 1` + frequency bins. If False, also computes the negative frequencies and returns `fft_length` frequency bins. + preemphasis (`float`, *optional*) + Coefficient for a low-pass filter that applies pre-emphasis before the DFT. + mel_filters (`np.ndarray` of shape `(num_freq_bins, num_mel_filters)`, *optional*): + The mel filter bank. If supplied, applies a this filter bank to create a mel spectrogram. + mel_floor (`float`, *optional*, defaults to 1e-10): + Minimum value of mel frequency banks. + log_mel (`str`, *optional*): + How to convert the spectrogram to log scale. Possible options are: `None` (don't convert), `"log"` (take + the natural logarithm) `"log10"` (take the base-10 logarithm), `"dB"` (convert to decibels). Can only be + used when `power` is not `None`. + reference (`float`, *optional*, defaults to 1.0): + Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set + the loudest part to 0 dB. Must be greater than zero. + min_value (`float`, *optional*, defaults to `1e-10`): + The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking + `log(0)`. For a power spectrogram, the default of `1e-10` corresponds to a minimum of -100 dB. For an + amplitude spectrogram, the value `1e-5` corresponds to -100 dB. Must be greater than zero. + db_range (`float`, *optional*): + Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the + peak value and the smallest value will never be more than 80 dB. Must be greater than zero. + remove_dc_offset (`bool`, *optional*): + Subtract mean from waveform on each frame, applied before pre-emphasis. This should be set to `true` in + order to get the same results as `torchaudio.compliance.kaldi.fbank` when computing mel filters. + dtype (`np.dtype`, *optional*, defaults to `np.float32`): + Data type of the spectrogram tensor. If `power` is None, this argument is ignored and the dtype will be + `np.complex64`. + + Returns: + `nd.array` containing a spectrogram of shape `(num_frequency_bins, length)` for a regular spectrogram or shape + `(num_mel_filters, length)` for a mel spectrogram. """ - log_spec = 10 * np.log10(np.clip(mel_spectrogram, a_min=a_min, a_max=None)) - log_spec -= 10.0 * np.log10(np.maximum(a_min, ref)) - if top_db is not None: - if top_db < 0: - raise ValueError("top_db must be non-negative") - log_spec = np.clip(log_spec, min=np.maximum(log_spec) - top_db, max=np.inf) - return log_spec + window_length = len(window) + + if fft_length is None: + fft_length = frame_length + + if frame_length > fft_length: + raise ValueError(f"frame_length ({frame_length}) may not be larger than fft_length ({fft_length})") + + if window_length != frame_length: + raise ValueError(f"Length of the window ({window_length}) must equal frame_length ({frame_length})") + + if hop_length <= 0: + raise ValueError("hop_length must be greater than zero") + + if waveform.ndim != 1: + raise ValueError(f"Input waveform must have only one dimension, shape is {waveform.shape}") + + if np.iscomplexobj(waveform): + raise ValueError("Complex-valued input waveforms are not currently supported") + + # center pad the waveform + if center: + padding = [(int(frame_length // 2), int(frame_length // 2))] + waveform = np.pad(waveform, padding, mode=pad_mode) + + # promote to float64, since np.fft uses float64 internally + waveform = waveform.astype(np.float64) + window = window.astype(np.float64) + + # split waveform into frames of frame_length size + num_frames = int(1 + np.floor((waveform.size - frame_length) / hop_length)) + + num_frequency_bins = (fft_length // 2) + 1 if onesided else fft_length + spectrogram = np.empty((num_frames, num_frequency_bins), dtype=np.complex64) + + # rfft is faster than fft + fft_func = np.fft.rfft if onesided else np.fft.fft + buffer = np.zeros(fft_length) + + timestep = 0 + for frame_idx in range(num_frames): + buffer[:frame_length] = waveform[timestep : timestep + frame_length] + + if remove_dc_offset: + buffer[:frame_length] = buffer[:frame_length] - buffer[:frame_length].mean() + + if preemphasis is not None: + buffer[1:frame_length] -= preemphasis * buffer[: frame_length - 1] + buffer[0] *= 1 - preemphasis + + buffer[:frame_length] *= window + + spectrogram[frame_idx] = fft_func(buffer) + timestep += hop_length + + # note: ** is much faster than np.power + if power is not None: + spectrogram = np.abs(spectrogram, dtype=np.float64) ** power + + spectrogram = spectrogram.T + + if mel_filters is not None: + spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram)) + + if power is not None and log_mel is not None: + if log_mel == "log": + spectrogram = np.log(spectrogram) + elif log_mel == "log10": + spectrogram = np.log10(spectrogram) + elif log_mel == "dB": + if power == 1.0: + spectrogram = amplitude_to_db(spectrogram, reference, min_value, db_range) + elif power == 2.0: + spectrogram = power_to_db(spectrogram, reference, min_value, db_range) + else: + raise ValueError(f"Cannot use log_mel option '{log_mel}' with power {power}") + else: + raise ValueError(f"Unknown log_mel option: {log_mel}") + + spectrogram = np.asarray(spectrogram, dtype) + + return spectrogram + + +def power_to_db( + spectrogram: np.ndarray, + reference: float = 1.0, + min_value: float = 1e-10, + db_range: Optional[float] = None, +) -> np.ndarray: + """ + Converts a power spectrogram to the decibel scale. This computes `10 * log10(spectrogram / reference)`, using basic + logarithm properties for numerical stability. + + The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a + linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. + This means that large variations in energy may not sound all that different if the sound is loud to begin with. + This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. + + Based on the implementation of `librosa.power_to_db`. + + Args: + spectrogram (`np.ndarray`): + The input power (mel) spectrogram. Note that a power spectrogram has the amplitudes squared! + reference (`float`, *optional*, defaults to 1.0): + Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set + the loudest part to 0 dB. Must be greater than zero. + min_value (`float`, *optional*, defaults to `1e-10`): + The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking + `log(0)`. The default of `1e-10` corresponds to a minimum of -100 dB. Must be greater than zero. + db_range (`float`, *optional*): + Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the + peak value and the smallest value will never be more than 80 dB. Must be greater than zero. + + Returns: + `np.ndarray`: the spectrogram in decibels + """ + if reference <= 0.0: + raise ValueError("reference must be greater than zero") + if min_value <= 0.0: + raise ValueError("min_value must be greater than zero") + + reference = max(min_value, reference) + + spectrogram = np.clip(spectrogram, a_min=min_value, a_max=None) + spectrogram = 10.0 * (np.log10(spectrogram) - np.log10(reference)) + + if db_range is not None: + if db_range <= 0.0: + raise ValueError("db_range must be greater than zero") + spectrogram = np.clip(spectrogram, a_min=spectrogram.max() - db_range, a_max=None) + + return spectrogram + + +def amplitude_to_db( + spectrogram: np.ndarray, + reference: float = 1.0, + min_value: float = 1e-5, + db_range: Optional[float] = None, +) -> np.ndarray: + """ + Converts an amplitude spectrogram to the decibel scale. This computes `20 * log10(spectrogram / reference)`, using + basic logarithm properties for numerical stability. + + The motivation behind applying the log function on the (mel) spectrogram is that humans do not hear loudness on a + linear scale. Generally to double the perceived volume of a sound we need to put 8 times as much energy into it. + This means that large variations in energy may not sound all that different if the sound is loud to begin with. + This compression operation makes the (mel) spectrogram features match more closely what humans actually hear. + + Args: + spectrogram (`np.ndarray`): + The input amplitude (mel) spectrogram. + reference (`float`, *optional*, defaults to 1.0): + Sets the input spectrogram value that corresponds to 0 dB. For example, use `np.max(spectrogram)` to set + the loudest part to 0 dB. Must be greater than zero. + min_value (`float`, *optional*, defaults to `1e-5`): + The spectrogram will be clipped to this minimum value before conversion to decibels, to avoid taking + `log(0)`. The default of `1e-5` corresponds to a minimum of -100 dB. Must be greater than zero. + db_range (`float`, *optional*): + Sets the maximum dynamic range in decibels. For example, if `db_range = 80`, the difference between the + peak value and the smallest value will never be more than 80 dB. Must be greater than zero. + + Returns: + `np.ndarray`: the spectrogram in decibels + """ + if reference <= 0.0: + raise ValueError("reference must be greater than zero") + if min_value <= 0.0: + raise ValueError("min_value must be greater than zero") + + reference = max(min_value, reference) + + spectrogram = np.clip(spectrogram, a_min=min_value, a_max=None) + spectrogram = 20.0 * (np.log10(spectrogram) - np.log10(reference)) + + if db_range is not None: + if db_range <= 0.0: + raise ValueError("db_range must be greater than zero") + spectrogram = np.clip(spectrogram, a_min=spectrogram.max() - db_range, a_max=None) + + return spectrogram + + +### deprecated functions below this line ### + + +def get_mel_filter_banks( + nb_frequency_bins: int, + nb_mel_filters: int, + frequency_min: float, + frequency_max: float, + sample_rate: int, + norm: Optional[str] = None, + mel_scale: str = "htk", +) -> np.array: + warnings.warn( + "The function `get_mel_filter_banks` is deprecated and will be removed in version 4.31.0 of Transformers", + FutureWarning, + ) + return mel_filter_bank( + num_frequency_bins=nb_frequency_bins, + num_mel_filters=nb_mel_filters, + min_frequency=frequency_min, + max_frequency=frequency_max, + sampling_rate=sample_rate, + norm=norm, + mel_scale=mel_scale, + ) -# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference. def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = 400, center: bool = True): """ In order to compute the short time fourier transform, the waveform needs to be split in overlapping windowed @@ -270,6 +627,10 @@ def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = framed_waveform (`np.array` of shape `(waveform.shape // hop_length , fft_window_size)`): The framed waveforms that can be fed to `np.fft`. """ + warnings.warn( + "The function `fram_wave` is deprecated and will be removed in version 4.31.0 of Transformers", + FutureWarning, + ) frames = [] for i in range(0, waveform.shape[0] + 1, hop_length): if center: @@ -298,9 +659,6 @@ def fram_wave(waveform: np.array, hop_length: int = 160, fft_window_size: int = return frames -# TODO @ArthurZucker: This method does not support batching yet as we are mainly focus on inference. - - def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = None): """ Calculates the complex Short-Time Fourier Transform (STFT) of the given framed signal. Should give the same results @@ -337,6 +695,10 @@ def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = spectrogram (`np.ndarray`): A spectrogram of shape `(num_frames, nb_frequency_bins)` obtained using the STFT algorithm """ + warnings.warn( + "The function `stft` is deprecated and will be removed in version 4.31.0 of Transformers", + FutureWarning, + ) frame_size = frames.shape[1] if fft_window_size is None: @@ -355,5 +717,5 @@ def stft(frames: np.array, windowing_function: np.array, fft_window_size: int = np.multiply(frame, windowing_function, out=fft_signal[:frame_size]) else: fft_signal[:frame_size] = frame - spectrogram[f] = fft(fft_signal, axis=0)[:nb_frequency_bins] + spectrogram[f] = np.fft.fft(fft_signal, axis=0)[:nb_frequency_bins] return spectrogram.T diff --git a/src/transformers/benchmark/benchmark_args_utils.py b/src/transformers/benchmark/benchmark_args_utils.py index d9233906d281c9..b63d792986c619 100644 --- a/src/transformers/benchmark/benchmark_args_utils.py +++ b/src/transformers/benchmark/benchmark_args_utils.py @@ -147,11 +147,12 @@ def to_json_string(self): return json.dumps(dataclasses.asdict(self), indent=2) @property - def model_names(self): - assert len(self.models) > 0, ( - "Please make sure you provide at least one model name / model identifier, *e.g.* `--models" - " bert-base-cased` or `args.models = ['bert-base-cased']." - ) + def model_names(self) -> List[str]: + if len(self.models) <= 0: + raise ValueError( + "Please make sure you provide at least one model name / model identifier, *e.g.* `--models" + " google-bert/bert-base-cased` or `args.models = ['google-bert/bert-base-cased']." + ) return self.models @property diff --git a/src/transformers/benchmark/benchmark_tf.py b/src/transformers/benchmark/benchmark_tf.py index 126172ffbd3000..c813591be0be07 100644 --- a/src/transformers/benchmark/benchmark_tf.py +++ b/src/transformers/benchmark/benchmark_tf.py @@ -60,9 +60,10 @@ def run_in_graph_mode(*args, **kwargs): return func(*args, **kwargs) if do_eager_mode is True: - assert ( - use_xla is False - ), "Cannot run model in XLA, if `args.eager_mode` is set to `True`. Please set `args.eager_mode=False`." + if use_xla is not False: + raise ValueError( + "Cannot run model in XLA, if `args.eager_mode` is set to `True`. Please set `args.eager_mode=False`." + ) return run_in_eager_mode else: return run_in_graph_mode @@ -88,13 +89,15 @@ def framework_version(self): def _inference_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float: # initialize GPU on separate process strategy = self.args.strategy - assert strategy is not None, "A device strategy has to be initialized before using TensorFlow." + if strategy is None: + raise ValueError("A device strategy has to be initialized before using TensorFlow.") _inference = self._prepare_inference_func(model_name, batch_size, sequence_length) return self._measure_speed(_inference) def _train_speed(self, model_name: str, batch_size: int, sequence_length: int) -> float: strategy = self.args.strategy - assert strategy is not None, "A device strategy has to be initialized before using TensorFlow." + if strategy is None: + raise ValueError("A device strategy has to be initialized before using TensorFlow.") _train = self._prepare_train_func(model_name, batch_size, sequence_length) return self._measure_speed(_train) @@ -105,7 +108,8 @@ def _inference_memory( if self.args.is_gpu: tf.config.experimental.set_memory_growth(self.args.gpu_list[self.args.device_idx], True) strategy = self.args.strategy - assert strategy is not None, "A device strategy has to be initialized before using TensorFlow." + if strategy is None: + raise ValueError("A device strategy has to be initialized before using TensorFlow.") _inference = self._prepare_inference_func(model_name, batch_size, sequence_length) return self._measure_memory(_inference) @@ -115,7 +119,8 @@ def _train_memory( if self.args.is_gpu: tf.config.experimental.set_memory_growth(self.args.gpu_list[self.args.device_idx], True) strategy = self.args.strategy - assert strategy is not None, "A device strategy has to be initialized before using TensorFlow." + if strategy is None: + raise ValueError("A device strategy has to be initialized before using TensorFlow.") _train = self._prepare_train_func(model_name, batch_size, sequence_length) return self._measure_memory(_train) @@ -164,9 +169,8 @@ def encoder_forward(): def _prepare_train_func(self, model_name: str, batch_size: int, sequence_length: int) -> Callable[[], None]: config = self.config_dict[model_name] - assert ( - self.args.eager_mode is False - ), "Training cannot be done in eager mode. Please make sure that `args.eager_mode = False`." + if self.args.eager_mode is not False: + raise ValueError("Training cannot be done in eager mode. Please make sure that `args.eager_mode = False`.") if self.args.fp16: raise NotImplementedError("Mixed precision is currently not supported.") @@ -240,10 +244,11 @@ def _measure_memory(self, func: Callable[[], None]) -> [Memory, MemorySummary]: with self.args.strategy.scope(): try: if self.args.trace_memory_line_by_line: - assert self.args.eager_mode, ( - "`args.eager_mode` is set to `False`. Make sure to run model in eager mode to measure memory" - " consumption line by line." - ) + if not self.args.eager_mode: + raise ValueError( + "`args.eager_mode` is set to `False`. Make sure to run model in eager mode to measure memory" + " consumption line by line." + ) trace = start_memory_tracing("transformers") if self.args.is_tpu: diff --git a/src/transformers/benchmark/benchmark_utils.py b/src/transformers/benchmark/benchmark_utils.py index a6c6353c19fb37..a71b1fb65a23ef 100644 --- a/src/transformers/benchmark/benchmark_utils.py +++ b/src/transformers/benchmark/benchmark_utils.py @@ -557,9 +557,9 @@ def stop_memory_tracing( cumulative_memory_dict[frame][2] += cpu_gpu_mem_inc cumulative_memory = sorted( - list(cumulative_memory_dict.items()), key=lambda x: x[1][2], reverse=True + cumulative_memory_dict.items(), key=lambda x: x[1][2], reverse=True ) # order by the total CPU + GPU memory increase - cumulative_memory = list( + cumulative_memory = [ MemoryState( frame=frame, cpu=Memory(cpu_mem_inc), @@ -567,7 +567,7 @@ def stop_memory_tracing( cpu_gpu=Memory(cpu_gpu_mem_inc), ) for frame, (cpu_mem_inc, gpu_mem_inc, cpu_gpu_mem_inc) in cumulative_memory - ) + ] memory_curr_trace = sorted(memory_curr_trace, key=lambda x: x.cpu_gpu.bytes, reverse=True) @@ -610,7 +610,7 @@ def __init__(self, args: BenchmarkArguments = None, configs: PretrainedConfig = model_name: AutoConfig.from_pretrained(model_name) for model_name in self.args.model_names } else: - self.config_dict = {model_name: config for model_name, config in zip(self.args.model_names, configs)} + self.config_dict = dict(zip(self.args.model_names, configs)) warnings.warn( f"The class {self.__class__} is deprecated. Hugging Face Benchmarking utils" @@ -890,7 +890,8 @@ def save_to_csv(self, result_dict, filename): return self.print_fn("Saving results to csv.") with open(filename, mode="w") as csv_file: - assert len(self.args.model_names) > 0, f"At least 1 model should be defined, but got {self.model_names}" + if len(self.args.model_names) <= 0: + raise ValueError(f"At least 1 model should be defined, but got {self.model_names}") fieldnames = ["model", "batch_size", "sequence_length"] writer = csv.DictWriter(csv_file, fieldnames=fieldnames + ["result"]) diff --git a/src/transformers/cache_utils.py b/src/transformers/cache_utils.py new file mode 100644 index 00000000000000..abdc3c7c0707bc --- /dev/null +++ b/src/transformers/cache_utils.py @@ -0,0 +1,417 @@ +from dataclasses import dataclass +from typing import Any, Dict, List, Optional, Tuple + +import torch + +from .configuration_utils import PretrainedConfig + + +@dataclass +class Cache: + """ + Base, abstract class for all caches. The actual data structure is specific to each subclass. + """ + + def update( + self, + key_states: torch.Tensor, + value_states: torch.Tensor, + layer_idx: int, + cache_kwargs: Optional[Dict[str, Any]] = None, + ) -> Tuple[torch.Tensor, torch.Tensor]: + """ + Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`. + + Parameters: + key_states (`torch.Tensor`): + The new key states to cache. + value_states (`torch.Tensor`): + The new value states to cache. + layer_idx (`int`): + The index of the layer to cache the states for. + cache_kwargs (`Dict[str, Any]`, `optional`): + Additional arguments for the cache subclass. These are specific to each subclass and allow new types of + cache to be created. + + Return: + A tuple containing the updated key and value states. + """ + raise NotImplementedError("Make sure to implement `update` in a subclass.") + + def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: + """Returns the sequence length of the cached states. A layer index can be optionally passed.""" + raise NotImplementedError("Make sure to implement `get_seq_length` in a subclass.") + + def get_max_length(self) -> Optional[int]: + """Returns the maximum sequence length of the cached states, if there is any.""" + raise NotImplementedError("Make sure to implement `get_max_length` in a subclass.") + + def get_usable_length(self, new_seq_length: int, layer_idx: Optional[int] = 0) -> int: + """Given the sequence length of the new inputs, returns the usable length of the cache.""" + # Cache without size limit -> all cache is usable + # Cache with size limit -> if the length cache plus the length of the new inputs is larger the maximum cache + # length, we will need to evict part of the cache (and thus not all cache is usable) + max_length = self.get_max_length() + previous_seq_length = self.get_seq_length(layer_idx) + if max_length is not None and previous_seq_length + new_seq_length > max_length: + return max_length - new_seq_length + return previous_seq_length + + +class DynamicCache(Cache): + """ + A cache that grows dynamically as more tokens are generated. This is the default for generative models. + + It stores the Key and Value states as a list of tensors, one for each layer. The expected shape for each tensor is + `[batch_size, num_heads, seq_len, head_dim]`. + """ + + def __init__(self) -> None: + self.key_cache: List[torch.Tensor] = [] + self.value_cache: List[torch.Tensor] = [] + self.seen_tokens = 0 # Used in `generate` to keep tally of how many tokens the cache has seen + + def __getitem__(self, layer_idx: int) -> List[Tuple[torch.Tensor]]: + """ + Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the + sequence length. + """ + if layer_idx < len(self): + return (self.key_cache[layer_idx], self.value_cache[layer_idx]) + else: + raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}") + + def __iter__(self): + """ + Support for backwards-compatible `past_key_value` iteration, e.g. `for x in past_key_value:` to iterate over + keys and values + """ + for layer_idx in range(len(self)): + yield (self.key_cache[layer_idx], self.value_cache[layer_idx]) + + def __len__(self): + """ + Support for backwards-compatible `past_key_value` length, e.g. `len(past_key_value)`. This value corresponds + to the number of layers in the model. + """ + return len(self.key_cache) + + def update( + self, + key_states: torch.Tensor, + value_states: torch.Tensor, + layer_idx: int, + cache_kwargs: Optional[Dict[str, Any]] = None, + ) -> Tuple[torch.Tensor, torch.Tensor]: + """ + Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`. + + Parameters: + key_states (`torch.Tensor`): + The new key states to cache. + value_states (`torch.Tensor`): + The new value states to cache. + layer_idx (`int`): + The index of the layer to cache the states for. + cache_kwargs (`Dict[str, Any]`, `optional`): + Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`. + + Return: + A tuple containing the updated key and value states. + """ + # Update the number of seen tokens + if layer_idx == 0: + self.seen_tokens += key_states.shape[-2] + + # Update the cache + if len(self.key_cache) <= layer_idx: + self.key_cache.append(key_states) + self.value_cache.append(value_states) + else: + self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) + self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=-2) + + return self.key_cache[layer_idx], self.value_cache[layer_idx] + + def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: + """Returns the sequence length of the cached states. A layer index can be optionally passed.""" + if len(self.key_cache) <= layer_idx: + return 0 + return self.key_cache[layer_idx].shape[-2] + + def get_max_length(self) -> Optional[int]: + """Returns the maximum sequence length of the cached states. DynamicCache does not have a maximum length.""" + return None + + def reorder_cache(self, beam_idx: torch.LongTensor): + """Reorders the cache for beam search, given the selected beam indices.""" + for layer_idx in range(len(self.key_cache)): + device = self.key_cache[layer_idx].device + self.key_cache[layer_idx] = self.key_cache[layer_idx].index_select(0, beam_idx.to(device)) + device = self.value_cache[layer_idx].device + self.value_cache[layer_idx] = self.value_cache[layer_idx].index_select(0, beam_idx.to(device)) + + def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]: + """Converts the `DynamicCache` instance into the its equivalent in the legacy cache format.""" + legacy_cache = () + for layer_idx in range(len(self)): + legacy_cache += ((self.key_cache[layer_idx], self.value_cache[layer_idx]),) + return legacy_cache + + @classmethod + def from_legacy_cache(cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None) -> "DynamicCache": + """Converts a cache in the legacy cache format into an equivalent `DynamicCache`.""" + cache = cls() + if past_key_values is not None: + for layer_idx in range(len(past_key_values)): + key_states, value_states = past_key_values[layer_idx] + cache.update(key_states, value_states, layer_idx) + return cache + + +class SinkCache(Cache): + """ + A cache that as described in the [Attention Sinks paper](https://arxiv.org/abs/2309.17453). It allows the model to + generate beyond the length of its context window, without losing fluency in the conversation. As it discards past + tokens, the model will lose the ability to generate tokens that depend on the context that was discarded. + + It stores the Key and Value states as a list of tensors, one for each layer. The expected shape for each tensor is + `[batch_size, num_heads, seq_len, head_dim]`. + + Parameters: + window_length (`int`): + The length of the context window. + num_sink_tokens (`int`): + The number of sink tokens. See the original paper for more information. + """ + + def __init__(self, window_length: int, num_sink_tokens: int) -> None: + self.key_cache: List[torch.Tensor] = [] + self.value_cache: List[torch.Tensor] = [] + self.window_length = window_length + self.num_sink_tokens = num_sink_tokens + self.cos_sin_cache = {} + self.seen_tokens = 0 # Used in `generate` to keep tally of how many tokens the cache has seen + + @staticmethod + def _rotate_half(x): + x1 = x[..., : x.shape[-1] // 2] + x2 = x[..., x.shape[-1] // 2 :] + return torch.cat((-x2, x1), dim=-1) + + def _apply_key_rotary_pos_emb( + self, key_states: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor + ) -> torch.Tensor: + rotated_key_states = (key_states * cos) + (self._rotate_half(key_states) * sin) + return rotated_key_states + + def _get_rerotation_cos_sin( + self, key_states: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor + ) -> Tuple[torch.Tensor, torch.Tensor]: + if key_states.shape[-2] not in self.cos_sin_cache: + # Upcast to float32 temporarily for better accuracy + cos = cos.to(torch.float32) + sin = sin.to(torch.float32) + + # Compute the cos and sin required for back- and forward-rotating to one position earlier in the sequence + original_cos = cos[self.num_sink_tokens + key_states.shape[-2] :] + shifted_cos = cos[self.num_sink_tokens : -key_states.shape[-2]] + original_sin = sin[self.num_sink_tokens + key_states.shape[-2] :] + shifted_sin = sin[self.num_sink_tokens : -key_states.shape[-2]] + rerotation_cos = original_cos * shifted_cos + original_sin * shifted_sin + rerotation_sin = -original_sin * shifted_cos + original_cos * shifted_sin + + self.cos_sin_cache[key_states.shape[-2]] = ( + rerotation_cos.to(key_states.dtype).unsqueeze(0), + rerotation_sin.to(key_states.dtype).unsqueeze(0), + ) + return self.cos_sin_cache[key_states.shape[-2]] + + def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: + """Returns the sequence length of the cached states. A layer index can be optionally passed.""" + # Workaround to make 'key_states.shape[-2] + past_key_value.get_seq_length(self.layer_idx)' <= window_length + if len(self.key_cache) <= layer_idx: + return 0 + return self.key_cache[layer_idx].shape[-2] + + def get_max_length(self) -> Optional[int]: + """Returns the maximum sequence length of the cached states.""" + return self.window_length + + def update( + self, + key_states: torch.Tensor, + value_states: torch.Tensor, + layer_idx: int, + cache_kwargs: Optional[Dict[str, Any]] = None, + ) -> Tuple[torch.Tensor, torch.Tensor]: + """ + Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`. + + Parameters: + key_states (`torch.Tensor`): + The new key states to cache. + value_states (`torch.Tensor`): + The new value states to cache. + layer_idx (`int`): + The index of the layer to cache the states for. + cache_kwargs (`Dict[str, Any]`, `optional`): + Additional arguments for the cache subclass. The following arguments can be used in `SinkCache`: `sin`, + `cos` and `partial_rotation_size`. These arguments are used with models using RoPE, to recompute the + rotation as the tokens are shifted. + + Return: + A tuple containing the updated key and value states. + """ + # Optional kwargs for `SinkCache` -- needed on models using RoPE. `partial_rotation_size` is used on models + # with partially rotated position embeddings, like Phi or Persimmon. + sin = cache_kwargs.get("sin") + cos = cache_kwargs.get("cos") + partial_rotation_size = cache_kwargs.get("partial_rotation_size") + using_rope = cos is not None and sin is not None + + # Update the number of seen tokens + if layer_idx == 0: + self.seen_tokens += key_states.shape[-2] + + # [bsz, num_heads, seq_len, head_dim] + if len(self.key_cache) <= layer_idx: + # Empty cache + self.key_cache.append(key_states) + self.value_cache.append(value_states) + + elif key_states.shape[-2] + self.get_seq_length(layer_idx) < self.window_length: + # Growing cache + self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2) + self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=-2) + + else: + # Shifting cache + keys_to_keep = self.key_cache[layer_idx][ + :, :, -self.window_length + self.num_sink_tokens + key_states.shape[-2] : + ] + + # On RoPE models, we need to recompute the Key rotation as the tokens are shifted + if using_rope: + rerotation_cos, rerotation_sin = self._get_rerotation_cos_sin( + key_states, cos[: self.window_length], sin[: self.window_length] + ) + if partial_rotation_size is not None: + keys_to_keep, keys_pass = ( + keys_to_keep[..., :partial_rotation_size], + keys_to_keep[..., partial_rotation_size:], + ) + keys_to_keep = self._apply_key_rotary_pos_emb(keys_to_keep, rerotation_cos, rerotation_sin) + if partial_rotation_size is not None: + keys_to_keep = torch.cat((keys_to_keep, keys_pass), dim=-1) + + # Concatenate sink tokens, shifted & rotated tokens (if needed), and new tokens + sink_keys = self.key_cache[layer_idx][:, :, : self.num_sink_tokens] + self.key_cache[layer_idx] = torch.cat([sink_keys, keys_to_keep, key_states], dim=-2) + + sink_values = self.value_cache[layer_idx][:, :, : self.num_sink_tokens] + values_to_keep = self.value_cache[layer_idx][ + :, :, -self.window_length + self.num_sink_tokens + value_states.shape[-2] : + ] + self.value_cache[layer_idx] = torch.cat([sink_values, values_to_keep, value_states], dim=-2) + + return self.key_cache[layer_idx], self.value_cache[layer_idx] + + def reorder_cache(self, beam_idx: torch.LongTensor): + """Reorders the cache for beam search, given the selected beam indices.""" + for layer_idx in range(len(self.key_cache)): + device = self.key_cache[layer_idx].device + self.key_cache[layer_idx] = self.key_cache[layer_idx].index_select(0, beam_idx.to(device)) + device = self.value_cache[layer_idx].device + self.value_cache[layer_idx] = self.value_cache[layer_idx].index_select(0, beam_idx.to(device)) + + +class StaticCache(Cache): + """ + Static Cache class to be used with `torch.compile(model)`. + + Parameters: + config (`PretrainedConfig): + The configuration file defining the `max_position_embeddings`, `hidden_size` and `num_attention_heads` + required to initialize the static cache. + max_batch_size (`int`): + The maximum batch size with which the model will be used. + max_cache_len (`int`): + The maximum sequence length with which the model will be used. + device (`torch.device`): + The device on which the cache should be initialized. Should be the same as the layer. + dtype (*optional*, defaults to `torch.float32`): + The default `dtype` to use when initializing the layer. + """ + + def __init__(self, config: PretrainedConfig, max_batch_size: int, max_cache_len: int, device, dtype=None) -> None: + super().__init__() + self.max_batch_size = max_batch_size + self.max_cache_len = config.max_position_embeddings if max_cache_len is None else max_cache_len + self.head_dim = config.hidden_size // config.num_attention_heads + self.dtype = dtype if dtype is not None else torch.float32 + self.num_key_value_heads = ( + config.num_attention_heads if config.num_key_value_heads is None else config.num_key_value_heads + ) + + cache_shape = (max_batch_size, self.num_key_value_heads, self.max_cache_len, self.head_dim) + self.key_cache: torch.Tensor = torch.zeros(cache_shape, dtype=self.dtype, device=device) + self.value_cache: torch.Tensor = torch.zeros(cache_shape, dtype=self.dtype, device=device) + self.seen_tokens = 0 + + def update( + self, + key_states: torch.Tensor, + value_states: torch.Tensor, + layer_idx: int, + cache_kwargs: Optional[Dict[str, Any]] = None, + ) -> Tuple[torch.Tensor, torch.Tensor]: + """ + Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`. + It is VERY important to index using a tensor, otherwise you introduce a copy to the device. + + Parameters: + key_states (`torch.Tensor`): + The new key states to cache. + value_states (`torch.Tensor`): + The new value states to cache. + layer_idx (`int`): + The index of the layer to cache the states for. Kept for backward compatibility + cache_kwargs (`Dict[str, Any]`, `optional`): + Additional arguments for the cache subclass. The `StaticCache` just needs the `q_len` + to know how much of the cache it should overwrite. + + Return: + A tuple containing the updated key and value states. + """ + new_cache_positions = cache_kwargs.get("cache_position") + k_out = self.key_cache + v_out = self.value_cache + + k_out[:, :, new_cache_positions] = key_states + v_out[:, :, new_cache_positions] = value_states + + self.seen_tokens += key_states.shape[2] + return k_out, v_out + + def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: + """Returns the sequence length of the cached states that were seen by the model. `layer_idx` kept for BC""" + return self.seen_tokens + + def get_usable_length(self, new_sequence_length=None, layer_idx: Optional[int] = 0) -> int: + return self.seen_tokens + + def get_max_length(self) -> Optional[int]: + """Returns the maximum sequence length of the cached states. DynamicCache does not have a maximum length.""" + return self.max_cache_len + + def reorder_cache(self, beam_idx: torch.LongTensor): + """Reorders the cache for beam search, given the selected beam indices.""" + device = self.key_cache.device + self.key_cache = self.key_cache.index_select(0, beam_idx.to(device)) + device = self.value_cache.device + self.value_cache = self.value_cache.index_select(0, beam_idx.to(device)) + + def to_legacy_cache(self): + """Dummy function for BC. We have to keep it because otherwise the call in the forward of models will break it""" + return None diff --git a/src/transformers/commands/add_new_model.py b/src/transformers/commands/add_new_model.py index 85d053a14873a3..87949827d9f884 100644 --- a/src/transformers/commands/add_new_model.py +++ b/src/transformers/commands/add_new_model.py @@ -183,8 +183,8 @@ def remove_copy_lines(path): os.remove(f"{directory}/test_modeling_flax_{lowercase_model_name}.py") shutil.move( - f"{directory}/{lowercase_model_name}.mdx", - f"{path_to_transformer_root}/docs/source/en/model_doc/{lowercase_model_name}.mdx", + f"{directory}/{lowercase_model_name}.md", + f"{path_to_transformer_root}/docs/source/en/model_doc/{lowercase_model_name}.md", ) shutil.move( diff --git a/src/transformers/commands/add_new_model_like.py b/src/transformers/commands/add_new_model_like.py index 5dd5e7dcb82859..3b7fcdf19f869f 100644 --- a/src/transformers/commands/add_new_model_like.py +++ b/src/transformers/commands/add_new_model_like.py @@ -23,9 +23,10 @@ from pathlib import Path from typing import Any, Callable, Dict, List, Optional, Pattern, Tuple, Union -import transformers.models.auto as auto_module -from transformers.models.auto.configuration_auto import model_type_to_module_name +import yaml +from ..models import auto as auto_module +from ..models.auto.configuration_auto import model_type_to_module_name from ..utils import is_flax_available, is_tf_available, is_torch_available, logging from . import BaseTransformersCLICommand @@ -128,7 +129,7 @@ def find_indent(line: str) -> int: """ Returns the number of spaces that start a line indent. """ - search = re.search("^(\s*)(?:\S|$)", line) + search = re.search(r"^(\s*)(?:\S|$)", line) if search is None: return 0 return len(search.groups()[0]) @@ -174,6 +175,56 @@ def parse_module_content(content: str) -> List[str]: return objects +def extract_block(content: str, indent_level: int = 0) -> str: + """Return the first block in `content` with the indent level `indent_level`. + + The first line in `content` should be indented at `indent_level` level, otherwise an error will be thrown. + + This method will immediately stop the search when a (non-empty) line with indent level less than `indent_level` is + encountered. + + Args: + content (`str`): The content to parse + indent_level (`int`, *optional*, default to 0): The indent level of the blocks to search for + + Returns: + `str`: The first block in `content` with the indent level `indent_level`. + """ + current_object = [] + lines = content.split("\n") + # Doc-styler takes everything between two triple quotes in docstrings, so we need a fake """ here to go with this. + end_markers = [")", "]", "}", '"""'] + + for idx, line in enumerate(lines): + if idx == 0 and indent_level > 0 and not is_empty_line(line) and find_indent(line) != indent_level: + raise ValueError( + f"When `indent_level > 0`, the first line in `content` should have indent level {indent_level}. Got " + f"{find_indent(line)} instead." + ) + + if find_indent(line) < indent_level and not is_empty_line(line): + break + + # End of an object + is_valid_object = len(current_object) > 0 + if ( + not is_empty_line(line) + and not line.endswith(":") + and find_indent(line) == indent_level + and is_valid_object + ): + # Closing parts should be included in current object + if line.lstrip() in end_markers: + current_object.append(line) + return "\n".join(current_object) + else: + current_object.append(line) + + # Add last object + if len(current_object) > 0: + return "\n".join(current_object) + + def add_content_to_text( text: str, content: str, @@ -403,12 +454,53 @@ def get_module_from_file(module_file: Union[str, os.PathLike]) -> str: _re_class_func = re.compile(r"^(?:class|def)\s+([^\s:\(]+)\s*(?:\(|\:)", flags=re.MULTILINE) +def remove_attributes(obj, target_attr): + """Remove `target_attr` in `obj`.""" + lines = obj.split(os.linesep) + + target_idx = None + for idx, line in enumerate(lines): + # search for assignment + if line.lstrip().startswith(f"{target_attr} = "): + target_idx = idx + break + # search for function/method definition + elif line.lstrip().startswith(f"def {target_attr}("): + target_idx = idx + break + + # target not found + if target_idx is None: + return obj + + line = lines[target_idx] + indent_level = find_indent(line) + # forward pass to find the ending of the block (including empty lines) + parsed = extract_block("\n".join(lines[target_idx:]), indent_level) + num_lines = len(parsed.split("\n")) + for idx in range(num_lines): + lines[target_idx + idx] = None + + # backward pass to find comments or decorator + for idx in range(target_idx - 1, -1, -1): + line = lines[idx] + if (line.lstrip().startswith("#") or line.lstrip().startswith("@")) and find_indent(line) == indent_level: + lines[idx] = None + else: + break + + new_obj = os.linesep.join([x for x in lines if x is not None]) + + return new_obj + + def duplicate_module( module_file: Union[str, os.PathLike], old_model_patterns: ModelPatterns, new_model_patterns: ModelPatterns, dest_file: Optional[str] = None, add_copied_from: bool = True, + attrs_to_remove: List[str] = None, ): """ Create a new module from an existing one and adapting all function and classes names from old patterns to new ones. @@ -429,7 +521,7 @@ def duplicate_module( with open(module_file, "r", encoding="utf-8") as f: content = f.read() - content = re.sub("# Copyright (\d+)\s", f"# Copyright {CURRENT_YEAR} ", content) + content = re.sub(r"# Copyright (\d+)\s", f"# Copyright {CURRENT_YEAR} ", content) objects = parse_module_content(content) # Loop and treat all objects @@ -478,7 +570,7 @@ def duplicate_module( # Regular classes functions old_obj = obj obj, replacement = replace_model_patterns(obj, old_model_patterns, new_model_patterns) - has_copied_from = re.search("^#\s+Copied from", obj, flags=re.MULTILINE) is not None + has_copied_from = re.search(r"^#\s+Copied from", obj, flags=re.MULTILINE) is not None if add_copied_from and not has_copied_from and _re_class_func.search(obj) is not None and len(replacement) > 0: # Copied from statement must be added just before the class/function definition, which may not be the # first line because of decorators. @@ -492,8 +584,14 @@ def duplicate_module( new_objects.append(obj) + content = "\n".join(new_objects) + # Remove some attributes that we don't want to copy to the new file(s) + if attrs_to_remove is not None: + for attr in attrs_to_remove: + content = remove_attributes(content, target_attr=attr) + with open(dest_file, "w", encoding="utf-8") as f: - content = f.write("\n".join(new_objects)) + f.write(content) def filter_framework_files( @@ -550,7 +648,7 @@ def get_model_files(model_type: str, frameworks: Optional[List[str]] = None) -> model_files = list(model_module.glob("*.py")) model_files = filter_framework_files(model_files, frameworks=frameworks) - doc_file = REPO_PATH / "docs" / "source" / "en" / "model_doc" / f"{model_type}.mdx" + doc_file = REPO_PATH / "docs" / "source" / "en" / "model_doc" / f"{model_type}.md" # Basic pattern for test files test_files = [ @@ -571,7 +669,7 @@ def get_model_files(model_type: str, frameworks: Optional[List[str]] = None) -> return {"doc_file": doc_file, "model_files": model_files, "module_name": module_name, "test_files": test_files} -_re_checkpoint_for_doc = re.compile("^_CHECKPOINT_FOR_DOC\s+=\s+(\S*)\s*$", flags=re.MULTILINE) +_re_checkpoint_for_doc = re.compile(r"^_CHECKPOINT_FOR_DOC\s+=\s+(\S*)\s*$", flags=re.MULTILINE) def find_base_model_checkpoint( @@ -817,8 +915,8 @@ def clean_frameworks_in_init( idx += 1 # Otherwise we keep the line, except if it's a tokenizer import and we don't want to keep it. elif keep_processing or ( - re.search('^\s*"(tokenization|processing|feature_extraction|image_processing)', lines[idx]) is None - and re.search("^\s*from .(tokenization|processing|feature_extraction|image_processing)", lines[idx]) + re.search(r'^\s*"(tokenization|processing|feature_extraction|image_processing)', lines[idx]) is None + and re.search(r"^\s*from .(tokenization|processing|feature_extraction|image_processing)", lines[idx]) is None ): new_lines.append(lines[idx]) @@ -1089,18 +1187,18 @@ def duplicate_doc_file( old_model_patterns (`ModelPatterns`): The patterns for the old model. new_model_patterns (`ModelPatterns`): The patterns for the new model. dest_file (`str` or `os.PathLike`, *optional*): Path to the new doc file. - Will default to the a file named `{new_model_patterns.model_type}.mdx` in the same folder as `module_file`. + Will default to the a file named `{new_model_patterns.model_type}.md` in the same folder as `module_file`. frameworks (`List[str]`, *optional*): If passed, will only keep the model classes corresponding to this list of frameworks in the new doc file. """ with open(doc_file, "r", encoding="utf-8") as f: content = f.read() - content = re.sub(" in auto-factory + # we pop attn_implementation from the kwargs but this handles the case where users + # passes manually the config to `from_pretrained`. + config = copy.deepcopy(config) + + kwarg_attn_imp = kwargs.pop("attn_implementation", None) + if kwarg_attn_imp is not None and config._attn_implementation != kwarg_attn_imp: + config._attn_implementation = kwarg_attn_imp model_kwargs = kwargs - if commit_hash is None: - commit_hash = getattr(config, "_commit_hash", None) + pre_quantized = getattr(config, "quantization_config", None) is not None + if pre_quantized or quantization_config is not None: + if pre_quantized: + config.quantization_config = AutoHfQuantizer.merge_quantization_configs( + config.quantization_config, quantization_config + ) + else: + config.quantization_config = quantization_config + hf_quantizer = AutoHfQuantizer.from_config(config.quantization_config, pre_quantized=pre_quantized) + else: + hf_quantizer = None + + if hf_quantizer is not None: + hf_quantizer.validate_environment( + torch_dtype=torch_dtype, from_tf=from_tf, from_flax=from_flax, device_map=device_map + ) + torch_dtype = hf_quantizer.update_torch_dtype(torch_dtype) + device_map = hf_quantizer.update_device_map(device_map) + + # Force-set to `True` for more mem efficiency + if low_cpu_mem_usage is None: + low_cpu_mem_usage = True + logger.warning("`low_cpu_mem_usage` was None, now set to True since model is quantized.") # This variable will flag if we're loading a sharded checkpoint. In this case the archive file is just the # index of the files. @@ -2224,14 +3062,14 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P ): # Load from a Flax checkpoint in priority if from_flax archive_file = os.path.join(pretrained_model_name_or_path, subfolder, FLAX_WEIGHTS_NAME) - elif is_safetensors_available() and os.path.isfile( + elif use_safetensors is not False and os.path.isfile( os.path.join(pretrained_model_name_or_path, subfolder, _add_variant(SAFE_WEIGHTS_NAME, variant)) ): # Load from a safetensors checkpoint archive_file = os.path.join( pretrained_model_name_or_path, subfolder, _add_variant(SAFE_WEIGHTS_NAME, variant) ) - elif is_safetensors_available() and os.path.isfile( + elif use_safetensors is not False and os.path.isfile( os.path.join( pretrained_model_name_or_path, subfolder, _add_variant(SAFE_WEIGHTS_INDEX_NAME, variant) ) @@ -2271,6 +3109,11 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P f" {pretrained_model_name_or_path} but there is a file for Flax weights. Use `from_flax=True`" " to load this model from those weights." ) + elif use_safetensors: + raise EnvironmentError( + f"Error no file named {_add_variant(SAFE_WEIGHTS_NAME, variant)} found in directory" + f" {pretrained_model_name_or_path}." + ) else: raise EnvironmentError( f"Error no file named {_add_variant(WEIGHTS_NAME, variant)}, {TF2_WEIGHTS_NAME}," @@ -2297,26 +3140,27 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P filename = TF2_WEIGHTS_NAME elif from_flax: filename = FLAX_WEIGHTS_NAME - elif is_safetensors_available(): + elif use_safetensors is not False: filename = _add_variant(SAFE_WEIGHTS_NAME, variant) else: filename = _add_variant(WEIGHTS_NAME, variant) try: # Load from URL or cache if already cached - cached_file_kwargs = dict( - cache_dir=cache_dir, - force_download=force_download, - proxies=proxies, - resume_download=resume_download, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - user_agent=user_agent, - revision=revision, - subfolder=subfolder, - _raise_exceptions_for_missing_entries=False, - _commit_hash=commit_hash, - ) + cached_file_kwargs = { + "cache_dir": cache_dir, + "force_download": force_download, + "proxies": proxies, + "resume_download": resume_download, + "local_files_only": local_files_only, + "token": token, + "user_agent": user_agent, + "revision": revision, + "subfolder": subfolder, + "_raise_exceptions_for_gated_repo": False, + "_raise_exceptions_for_missing_entries": False, + "_commit_hash": commit_hash, + } resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, **cached_file_kwargs) # Since we set _raise_exceptions_for_missing_entries=False, we don't get an exception but a None @@ -2330,6 +3174,19 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P ) if resolved_archive_file is not None: is_sharded = True + elif use_safetensors: + if revision == "main": + resolved_archive_file, revision, is_sharded = auto_conversion( + pretrained_model_name_or_path, **cached_file_kwargs + ) + cached_file_kwargs["revision"] = revision + if resolved_archive_file is None: + raise EnvironmentError( + f"{pretrained_model_name_or_path} does not appear to have a file named" + f" {_add_variant(SAFE_WEIGHTS_NAME, variant)} or {_add_variant(SAFE_WEIGHTS_INDEX_NAME, variant)} " + "and thus cannot be loaded with `safetensors`. Please make sure that the model has " + "been saved with `safe_serialization=True` or do not set `use_safetensors=True`." + ) else: # This repo has no safetensors file of any kind, we switch to PyTorch. filename = _add_variant(WEIGHTS_NAME, variant) @@ -2351,7 +3208,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P has_file_kwargs = { "revision": revision, "proxies": proxies, - "use_auth_token": use_auth_token, + "token": token, } if has_file(pretrained_model_name_or_path, TF2_WEIGHTS_NAME, **has_file_kwargs): raise EnvironmentError( @@ -2383,7 +3240,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P # Raise any environment error raise by `cached_file`. It will have a helpful error message adapted # to the original exception. raise - except Exception: + except Exception as e: # For any other exception, we throw a generic error. raise EnvironmentError( f"Can't load the model for '{pretrained_model_name_or_path}'. If you were trying to load it" @@ -2391,7 +3248,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P f" same name. Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a" f" directory containing a file named {_add_variant(WEIGHTS_NAME, variant)}," f" {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or {FLAX_WEIGHTS_NAME}." - ) + ) from e if is_local: logger.info(f"loading weights file {archive_file}") @@ -2412,13 +3269,36 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P proxies=proxies, resume_download=resume_download, local_files_only=local_files_only, - use_auth_token=use_auth_token, + token=token, user_agent=user_agent, revision=revision, subfolder=subfolder, _commit_hash=commit_hash, ) + if ( + is_safetensors_available() + and isinstance(resolved_archive_file, str) + and resolved_archive_file.endswith(".safetensors") + ): + with safe_open(resolved_archive_file, framework="pt") as f: + metadata = f.metadata() + + if metadata.get("format") == "pt": + pass + elif metadata.get("format") == "tf": + from_tf = True + logger.info("A TensorFlow safetensors file is being loaded in a PyTorch model.") + elif metadata.get("format") == "flax": + from_flax = True + logger.info("A Flax safetensors file is being loaded in a PyTorch model.") + else: + raise ValueError( + f"Incompatible safetensors file. File metadata is not ['pt', 'tf', 'flax'] but {metadata.get('format')}" + ) + + from_pt = not (from_tf | from_flax) + # load pt weights early so that we know which dtype to init the model under if from_pt: if not is_sharded and state_dict is None: @@ -2458,24 +3338,18 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P dtype_orig = cls._set_default_torch_dtype(torch_dtype) # Check if `_keep_in_fp32_modules` is not None - use_keep_in_fp32_modules = ( - (cls._keep_in_fp32_modules is not None) and is_accelerate_available() and torch_dtype == torch.float16 + use_keep_in_fp32_modules = (cls._keep_in_fp32_modules is not None) and ( + (torch_dtype == torch.float16) or hasattr(hf_quantizer, "use_keep_in_fp32_modules") ) - if ( - (cls._keep_in_fp32_modules is not None) - and not is_accelerate_available() - and torch_dtype == torch.float16 - ): - logger.warning( - "For stability purposes, it is recommended to have accelerate installed when using this model in" - " torch.float16, please install it with `pip install accelerate`" - ) if is_sharded: loaded_state_dict_keys = sharded_metadata["all_checkpoint_keys"] else: - loaded_state_dict_keys = [k for k in state_dict.keys()] - if low_cpu_mem_usage or use_keep_in_fp32_modules: + loaded_state_dict_keys = list(state_dict.keys()) + if low_cpu_mem_usage or (use_keep_in_fp32_modules and is_accelerate_available()): + # In case some weights need to be kept in float32 and accelerate is not installed, + # we later on want to take the path where state_dict is not None, that is the one + # that do not require accelerate. state_dict = None config.name_or_path = pretrained_model_name_or_path @@ -2488,103 +3362,100 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model") init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts - elif load_in_8bit or low_cpu_mem_usage: + elif low_cpu_mem_usage: init_contexts.append(init_empty_weights()) + config = copy.deepcopy(config) # We do not want to modify the config inplace in from_pretrained. + config = cls._autoset_attn_implementation( + config, use_flash_attention_2=use_flash_attention_2, torch_dtype=torch_dtype, device_map=device_map + ) + with ContextManagers(init_contexts): + # Let's make sure we don't run the init function of buffer modules model = cls(config, *model_args, **model_kwargs) + # make sure we use the model's config since the __init__ call might have copied it + config = model.config + # Check first if we are `from_pt` if use_keep_in_fp32_modules: - low_cpu_mem_usage = True + if is_accelerate_available() and not is_deepspeed_zero3_enabled(): + low_cpu_mem_usage = True keep_in_fp32_modules = model._keep_in_fp32_modules else: keep_in_fp32_modules = [] - if load_in_8bit: - from .utils.bitsandbytes import get_keys_to_not_convert, replace_8bit_linear - - load_in_8bit_skip_modules = quantization_config.llm_int8_skip_modules - load_in_8bit_threshold = quantization_config.llm_int8_threshold - load_in_8bit_fp32_cpu_offload = quantization_config.llm_int8_enable_fp32_cpu_offload - - logger.info("Detected 8-bit loading: activating 8-bit loading for this model") - - # We keep some modules such as the lm_head in their original dtype for numerical stability reasons - if load_in_8bit_skip_modules is None: - modules_to_not_convert = get_keys_to_not_convert(model) - else: - modules_to_not_convert = load_in_8bit_skip_modules - - if not isinstance(modules_to_not_convert, list): - modules_to_not_convert = [modules_to_not_convert] - - modules_to_not_convert.extend(keep_in_fp32_modules) + if hf_quantizer is not None: + hf_quantizer.preprocess_model( + model=model, device_map=device_map, keep_in_fp32_modules=keep_in_fp32_modules + ) - # Extend the modules to not convert to keys that are supposed to be offloaded to `cpu` or `disk` - if isinstance(device_map, dict) and len(device_map.keys()) > 1: - keys_on_cpu = [key for key, value in device_map.items() if value in ["disk", "cpu"]] + # We store the original dtype for quantized models as we cannot easily retrieve it + # once the weights have been quantized + # Note that once you have loaded a quantized model, you can't change its dtype so this will + # remain a single source of truth + config._pre_quantization_dtype = torch_dtype - if len(keys_on_cpu) > 0 and not load_in_8bit_fp32_cpu_offload: - raise ValueError( - "If you want to offload some keys to `cpu` or `disk`, you need to set " - "`load_in_8bit_fp32_cpu_offload=True`. Note that these modules will not be " - " converted to 8-bit but kept in 32-bit." - ) + if isinstance(device_map, str): + special_dtypes = {} - modules_to_not_convert.extend(keys_on_cpu) + if hf_quantizer is not None: + special_dtypes.update(hf_quantizer.get_special_dtypes_update(model, torch_dtype)) - model = replace_8bit_linear( - model, threshold=load_in_8bit_threshold, modules_to_not_convert=modules_to_not_convert + special_dtypes.update( + { + name: torch.float32 + for name, _ in model.named_parameters() + if any(m in name for m in keep_in_fp32_modules) + } ) - # training in 8-bit is only available in 0.37.0+ - model._is_int8_training_enabled = version.parse( - importlib_metadata.version("bitsandbytes") - ) >= version.parse("0.37.0") + target_dtype = torch_dtype - if isinstance(device_map, str): - if model._no_split_modules is None: - raise ValueError(f"{model.__class__.__name__} does not support `device_map='{device_map}'` yet.") - no_split_modules = model._no_split_modules + if hf_quantizer is not None: + target_dtype = hf_quantizer.adjust_target_dtype(target_dtype) + + no_split_modules = model._get_no_split_modules(device_map) if device_map not in ["auto", "balanced", "balanced_low_0", "sequential"]: raise ValueError( "If passing a string for `device_map`, please choose 'auto', 'balanced', 'balanced_low_0' or " "'sequential'." ) - elif device_map in ["balanced", "balanced_low_0"] and get_balanced_memory is None: - raise ValueError(f"`device_map={device_map}` requires a source install of Accelerate.") - if device_map != "sequential" and get_balanced_memory is not None: + + device_map_kwargs = {"no_split_module_classes": no_split_modules} + if "special_dtypes" in inspect.signature(infer_auto_device_map).parameters: + device_map_kwargs["special_dtypes"] = special_dtypes + elif len(special_dtypes) > 0: + logger.warning( + "This model has some weights that should be kept in higher precision, you need to upgrade " + "`accelerate` to properly deal with them (`pip install --upgrade accelerate`)." + ) + if device_map != "sequential": max_memory = get_balanced_memory( model, - max_memory=max_memory, - no_split_module_classes=no_split_modules, - dtype=torch_dtype, + dtype=target_dtype, low_zero=(device_map == "balanced_low_0"), + max_memory=max_memory, + **device_map_kwargs, ) + else: + max_memory = get_max_memory(max_memory) + if hf_quantizer is not None: + max_memory = hf_quantizer.adjust_max_memory(max_memory) + device_map_kwargs["max_memory"] = max_memory + # Make sure tied weights are tied before creating the device map. model.tie_weights() - device_map = infer_auto_device_map( - model, - no_split_module_classes=no_split_modules, - dtype=torch_dtype if not load_in_8bit else torch.int8, - max_memory=max_memory, - ) + device_map = infer_auto_device_map(model, dtype=target_dtype, **device_map_kwargs) - if load_in_8bit: - # The LM head / tied weights or any last module can stay on disk / CPU - device_map_without_lm_head = { - key: device_map[key] for key in device_map.keys() if key not in modules_to_not_convert - } - if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_without_lm_head.values(): - raise ValueError( - """ - Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit - the quantized model. If you have set a value for `max_memory` you should increase that. To have - an idea of the modules that are set on the CPU or RAM you can print model.hf_device_map. - """ - ) - del device_map_without_lm_head + if hf_quantizer is not None: + hf_quantizer.validate_environment(device_map=device_map) + + elif device_map is not None: + model.tie_weights() + tied_params = find_tied_parameters(model) + # check if we don't have tied param in different devices + check_tied_parameters_on_same_device(tied_params, device_map) if from_tf: if resolved_archive_file.endswith(".index"): @@ -2621,7 +3492,6 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P # restore default dtype if dtype_orig is not None: torch.set_default_dtype(dtype_orig) - ( model, missing_keys, @@ -2643,12 +3513,10 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P offload_folder=offload_folder, offload_state_dict=offload_state_dict, dtype=torch_dtype, - load_in_8bit=load_in_8bit, + hf_quantizer=hf_quantizer, keep_in_fp32_modules=keep_in_fp32_modules, ) - model.is_loaded_in_8bit = load_in_8bit - # make sure token embedding weights are still tied if needed model.tie_weights() @@ -2656,7 +3524,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P model.eval() # If it is a model with generation capabilities, attempt to load the generation config - if model.can_generate(): + if model.can_generate() and pretrained_model_name_or_path is not None: try: model.generation_config = GenerationConfig.from_pretrained( pretrained_model_name_or_path, @@ -2665,14 +3533,14 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P resume_download=resume_download, proxies=proxies, local_files_only=local_files_only, - use_auth_token=use_auth_token, + token=token, revision=revision, subfolder=subfolder, _from_auto=from_auto_class, _from_pipeline=from_pipeline, **kwargs, ) - except (OSError, TypeError): + except OSError: logger.info( "Generation config file not found, using a generation config created from the model config." ) @@ -2680,7 +3548,26 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P # Dispatch model with hooks on all devices if necessary if device_map is not None: - dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index) + device_map_kwargs = { + "device_map": device_map, + "offload_dir": offload_folder, + "offload_index": offload_index, + } + if "skip_keys" in inspect.signature(dispatch_model).parameters: + device_map_kwargs["skip_keys"] = model._skip_keys_device_placement + dispatch_model(model, **device_map_kwargs) + + if hf_quantizer is not None: + hf_quantizer.postprocess_model(model) + model.hf_quantizer = hf_quantizer + + if _adapter_model_path is not None: + model.load_adapter( + _adapter_model_path, + adapter_name=adapter_name, + token=token, + adapter_kwargs=adapter_kwargs, + ) if output_loading_info: if loading_info is None: @@ -2710,12 +3597,10 @@ def _load_pretrained_model( offload_folder=None, offload_state_dict=None, dtype=None, - load_in_8bit=False, + hf_quantizer=None, keep_in_fp32_modules=None, ): is_safetensors = False - if load_in_8bit: - from .utils.bitsandbytes import set_module_8bit_tensor_to_device if device_map is not None and "disk" in device_map.values(): archive_file = ( @@ -2734,6 +3619,10 @@ def _load_pretrained_model( offload_state_dict = True is_sharded_safetensors = is_safetensors and sharded_metadata is not None + + # tie the model weights before retrieving the state_dict + model.tie_weights() + # Retrieve missing & unexpected_keys model_state_dict = model.state_dict() expected_keys = list(model_state_dict.keys()) @@ -2768,8 +3657,38 @@ def _fix_key(key): elif add_prefix_to_model: expected_keys = [".".join([prefix, s]) for s in expected_keys] - missing_keys = list(set(expected_keys) - set(loaded_keys)) - unexpected_keys = list(set(loaded_keys) - set(expected_keys)) + missing_keys = sorted(set(expected_keys) - set(loaded_keys)) + unexpected_keys = set(loaded_keys) - set(expected_keys) + # Remove nonpersistent buffers from unexpected keys: they are not in the state dict but will be in the model + # buffers + model_buffers = {n for n, _ in model.named_buffers()} + if remove_prefix_from_model: + model_buffers = {key[len(_prefix) :] if key.startswith(_prefix) else key for key in model_buffers} + elif add_prefix_to_model: + model_buffers = {".".join([prefix, key]) for key in model_buffers} + unexpected_keys = sorted(unexpected_keys - model_buffers) + + model.tie_weights() + if device_map is None and not is_fsdp_enabled() and not is_deepspeed_zero3_enabled(): + ptrs = collections.defaultdict(list) + for name, tensor in model.state_dict().items(): + id_tensor = id_tensor_storage(tensor) + ptrs[id_tensor].append(name) + + # These are all the pointers of shared tensors. + tied_params = [names for _, names in ptrs.items() if len(names) > 1] + else: + # id function doesn't work for meta tensor so we need this function + tied_params = find_tied_parameters(model) + + for group in tied_params: + if remove_prefix_from_model: + group = [key[len(_prefix) :] if key.startswith(_prefix) else key for key in group] + elif add_prefix_to_model: + group = [".".join([prefix, key]) for key in group] + missing_in_group = [k for k in missing_keys if k in group] + if len(missing_in_group) > 0 and len(missing_in_group) < len(group): + missing_keys = [k for k in missing_keys if k not in missing_in_group] # Some models may have keys that are not in the state by design, removing them before needlessly warning # the user. @@ -2787,8 +3706,8 @@ def _fix_key(key): for key in missing_keys: if key in list(model_state_dict.keys()): key = key - elif f"{prefix}.key" in list(model_state_dict.keys()): - key = f"{prefix}.key" + elif f"{prefix}.{key}" in list(model_state_dict.keys()): + key = f"{prefix}.{key}" elif key.startswith(prefix) and ".".join(key.split(".")[1:]) in list(model_state_dict.keys()): key = ".".join(key.split(".")[1:]) param = model_state_dict[key] @@ -2798,35 +3717,66 @@ def _fix_key(key): if ( keep_in_fp32_modules is not None and dtype == torch.float16 - and any(module_to_keep_in_fp32 in key for module_to_keep_in_fp32 in keep_in_fp32_modules) + and any( + module_to_keep_in_fp32 in key.split(".") for module_to_keep_in_fp32 in keep_in_fp32_modules + ) ): target_dtype = torch.float32 if param.device == torch.device("meta"): - if not load_in_8bit: - set_module_tensor_to_device(model, key, "cpu", torch.empty(*param.size(), dtype=target_dtype)) - else: - set_module_8bit_tensor_to_device( - model, key, "cpu", torch.empty(*param.size(), dtype=target_dtype) + value = torch.empty(*param.size(), dtype=target_dtype) + if ( + hf_quantizer is None + or getattr(hf_quantizer, "requires_parameters_quantization", False) + or not hf_quantizer.check_quantized_param( + model, param_value=value, param_name=key, state_dict={} ) + ): + set_module_tensor_to_device(model, key, "cpu", value) + else: + hf_quantizer.create_quantized_param(model, value, key, "cpu", state_dict) - # retrieve unintialized modules and initialize before maybe overriding that with the pretrained weights. + # retrieve uninitialized modules and initialize before maybe overriding that with the pretrained weights. if _fast_init: - if remove_prefix_from_model: - _loaded_keys = [f"{prefix}.{k}" for k in loaded_keys] - elif add_prefix_to_model: - _loaded_keys = [k[len(prefix) + 1 :] for k in loaded_keys] + if not ignore_mismatched_sizes: + if remove_prefix_from_model: + _loaded_keys = [f"{prefix}.{k}" for k in loaded_keys] + elif add_prefix_to_model: + _loaded_keys = [k[len(prefix) + 1 :] for k in loaded_keys] + else: + _loaded_keys = loaded_keys + not_initialized_submodules = set_initialized_submodules(model, _loaded_keys) + # If we're about to tie the output embeds to the input embeds we don't need to init them + if hasattr(model.config, "tie_word_embeddings") and model.config.tie_word_embeddings: + output_embeddings = model.get_output_embeddings() + if output_embeddings is not None: + # Still need to initialize if there is a bias term since biases are not tied. + if not hasattr(output_embeddings, "bias") or output_embeddings.bias is None: + output_embeddings._is_hf_initialized = True else: - _loaded_keys = loaded_keys - set_initialized_submodules(model, _loaded_keys) + not_initialized_submodules = dict(model.named_modules()) # This will only initialize submodules that are not marked as initialized by the line above. - model.apply(model._initialize_weights) + if is_deepspeed_zero3_enabled(): + import deepspeed + + not_initialized_parameters = list( + set( + itertools.chain.from_iterable( + submodule.parameters(recurse=False) for submodule in not_initialized_submodules.values() + ) + ) + ) + with deepspeed.zero.GatheredParameters(not_initialized_parameters, modifier_rank=0): + model.apply(model._initialize_weights) + else: + model.apply(model._initialize_weights) # Set some modules to fp32 if any if keep_in_fp32_modules is not None: for name, param in model.named_parameters(): - if any(module_to_keep_in_fp32 in name for module_to_keep_in_fp32 in keep_in_fp32_modules): - param = param.to(torch.float32) + if any(module_to_keep_in_fp32 in name.split(".") for module_to_keep_in_fp32 in keep_in_fp32_modules): + # param = param.to(torch.float32) does not work here as only in the local scope. + param.data = param.data.to(torch.float32) # Make sure we are able to load base models as well as derived models (with heads) start_prefix = "" @@ -2855,6 +3805,9 @@ def _find_mismatched_keys( mismatched_keys = [] if ignore_mismatched_sizes: for checkpoint_key in loaded_keys: + # If the checkpoint is sharded, we may not have the key here. + if checkpoint_key not in state_dict: + continue model_key = checkpoint_key if remove_prefix_from_model: # The model key starts with `prefix` but `checkpoint_key` doesn't so we add it. @@ -2867,10 +3820,18 @@ def _find_mismatched_keys( model_key in model_state_dict and state_dict[checkpoint_key].shape != model_state_dict[model_key].shape ): - mismatched_keys.append( - (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape) - ) - del state_dict[checkpoint_key] + if ( + state_dict[checkpoint_key].shape[-1] == 1 + and state_dict[checkpoint_key].numel() * 2 == model_state_dict[model_key].numel() + ): + # This skips size mismatches for 4-bit weights. Two 4-bit values share an 8-bit container, causing size differences. + # Without matching with module type or paramter type it seems like a practical way to detect valid 4bit weights. + pass + else: + mismatched_keys.append( + (checkpoint_key, state_dict[checkpoint_key].shape, model_state_dict[model_key].shape) + ) + del state_dict[checkpoint_key] return mismatched_keys if resolved_archive_file is not None: @@ -2878,8 +3839,7 @@ def _find_mismatched_keys( else: folder = None if device_map is not None and is_safetensors: - param_device_map = expand_device_map(device_map, original_loaded_keys) - + param_device_map = expand_device_map(device_map, original_loaded_keys, start_prefix) str_dtype = str(dtype).replace("torch.", "") if dtype is not None else "float32" if sharded_metadata is None: archive_file = ( @@ -2891,9 +3851,9 @@ def _find_mismatched_keys( else: weight_map = {p: os.path.join(folder, f) for p, f in sharded_metadata["weight_map"].items()} offload_index = { - p: {"safetensors_file": f, "weight_name": p, "dtype": str_dtype} + p[len(start_prefix) :]: {"safetensors_file": f, "weight_name": p, "dtype": str_dtype} for p, f in weight_map.items() - if param_device_map[p] == "disk" + if p.startswith(start_prefix) and param_device_map[p[len(start_prefix) :]] == "disk" } if state_dict is not None: @@ -2927,7 +3887,9 @@ def _find_mismatched_keys( state_dict_index = None if is_sharded_safetensors: - disk_only_shard_files = get_disk_only_shard_files(device_map, sharded_metadata=sharded_metadata) + disk_only_shard_files = get_disk_only_shard_files( + device_map, sharded_metadata=sharded_metadata, start_prefix=start_prefix + ) disk_only_shard_files = [os.path.join(folder, f) for f in disk_only_shard_files] else: disk_only_shard_files = [] @@ -2950,25 +3912,35 @@ def _find_mismatched_keys( remove_prefix_from_model, ignore_mismatched_sizes, ) - if low_cpu_mem_usage: - new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( - model_to_load, - state_dict, - loaded_keys, - start_prefix, - expected_keys, - device_map=device_map, - offload_folder=offload_folder, - offload_index=offload_index, - state_dict_folder=state_dict_folder, - state_dict_index=state_dict_index, - dtype=dtype, - load_in_8bit=load_in_8bit, - is_safetensors=is_safetensors, - keep_in_fp32_modules=keep_in_fp32_modules, - ) - error_msgs += new_error_msgs + if is_fsdp_enabled() and not is_local_dist_rank_0(): + for key, param in model_to_load.state_dict().items(): + if param.device == torch.device("meta"): + if hf_quantizer is None: + set_module_tensor_to_device( + model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) + ) + else: + hf_quantizer.create_quantized_param(model, param, key, "cpu", state_dict) + else: + new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( + model_to_load, + state_dict, + loaded_keys, + start_prefix, + expected_keys, + device_map=device_map, + offload_folder=offload_folder, + offload_index=offload_index, + state_dict_folder=state_dict_folder, + state_dict_index=state_dict_index, + dtype=dtype, + hf_quantizer=hf_quantizer, + is_safetensors=is_safetensors, + keep_in_fp32_modules=keep_in_fp32_modules, + unexpected_keys=unexpected_keys, + ) + error_msgs += new_error_msgs else: error_msgs += _load_state_dict_into_model(model_to_load, state_dict, start_prefix) @@ -3005,7 +3977,9 @@ def _find_mismatched_keys( raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") if len(unexpected_keys) > 0: - logger.warning( + archs = [] if model.config.architectures is None else model.config.architectures + warner = logger.warning if model.__class__.__name__ in archs else logger.info + warner( f"Some weights of the model checkpoint at {pretrained_model_name_or_path} were not used when" f" initializing {model.__class__.__name__}: {unexpected_keys}\n- This IS expected if you are" f" initializing {model.__class__.__name__} from the checkpoint of a model trained on another task or" @@ -3046,12 +4020,12 @@ def _find_mismatched_keys( return model, missing_keys, unexpected_keys, mismatched_keys, offload_index, error_msgs def retrieve_modules_from_names(self, names, add_prefix=False, remove_prefix=False): - module_keys = set([".".join(key.split(".")[:-1]) for key in names]) + module_keys = {".".join(key.split(".")[:-1]) for key in names} # torch.nn.ParameterList is a special case where two parameter keywords # are appended to the module name, *e.g.* bert.special_embeddings.0 module_keys = module_keys.union( - set([".".join(key.split(".")[:-2]) for key in names if len(key) > 0 and key[-1].isdigit()]) + {".".join(key.split(".")[:-2]) for key in names if len(key) > 0 and key[-1].isdigit()} ) retrieved_modules = [] @@ -3069,7 +4043,9 @@ def retrieve_modules_from_names(self, names, add_prefix=False, remove_prefix=Fal return retrieved_modules @staticmethod - def _load_pretrained_model_low_mem(model, loaded_state_dict_keys, resolved_archive_file, start_prefix=""): + def _load_pretrained_model_low_mem( + model, loaded_state_dict_keys, resolved_archive_file, start_prefix="", hf_quantizer=None + ): """ This is an experimental function that loads the model using ~1.x model size CPU memory @@ -3084,12 +4060,21 @@ def _load_pretrained_model_low_mem(model, loaded_state_dict_keys, resolved_archi 4. load state_dict 2nd time 5. replace the params/buffers from the state_dict - Currently, it doesn't handle missing_keys, unexpected_keys, mismatched_keys. It can't handle deepspeed. + Currently, it doesn't handle missing_keys, unexpected_keys, mismatched_keys. It can't handle deepspeed. To + handle bitsandbytes, needs non-empty hf_quantizer argument. """ _move_model_to_meta(model, loaded_state_dict_keys, start_prefix) state_dict = load_state_dict(resolved_archive_file) - error_msgs = _load_state_dict_into_meta_model(model, state_dict, loaded_state_dict_keys, start_prefix) + expected_keys = loaded_state_dict_keys # plug for missing expected_keys. TODO: replace with proper keys + error_msgs = _load_state_dict_into_meta_model( + model, + state_dict, + loaded_state_dict_keys, + start_prefix, + expected_keys=expected_keys, + hf_quantizer=hf_quantizer, + ) return error_msgs @classmethod @@ -3118,6 +4103,103 @@ def register_for_auto_class(cls, auto_class="AutoModel"): cls._auto_class = auto_class + def to_bettertransformer(self) -> "PreTrainedModel": + """ + Converts the model to use [PyTorch's native attention + implementation](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html), integrated to + Transformers through [Optimum library](https://huggingface.co/docs/optimum/bettertransformer/overview). Only a + subset of all Transformers models are supported. + + PyTorch's attention fastpath allows to speed up inference through kernel fusions and the use of [nested + tensors](https://pytorch.org/docs/stable/nested.html). Detailed benchmarks can be found in [this blog + post](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2). + + Returns: + [`PreTrainedModel`]: The model converted to BetterTransformer. + """ + if not is_optimum_available(): + raise ImportError("The package `optimum` is required to use Better Transformer.") + + from optimum.version import __version__ as optimum_version + + if version.parse(optimum_version) < version.parse("1.7.0"): + raise ImportError( + f"Please install optimum>=1.7.0 to use Better Transformer. The version {optimum_version} was found." + ) + + from optimum.bettertransformer import BetterTransformer + + return BetterTransformer.transform(self) + + def reverse_bettertransformer(self): + """ + Reverts the transformation from [`~PreTrainedModel.to_bettertransformer`] so that the original modeling is + used, for example in order to save the model. + + Returns: + [`PreTrainedModel`]: The model converted back to the original modeling. + """ + if not is_optimum_available(): + raise ImportError("The package `optimum` is required to use Better Transformer.") + + from optimum.version import __version__ as optimum_version + + if version.parse(optimum_version) < version.parse("1.7.0"): + raise ImportError( + f"Please install optimum>=1.7.0 to use Better Transformer. The version {optimum_version} was found." + ) + + from optimum.bettertransformer import BetterTransformer + + return BetterTransformer.reverse(self) + + def warn_if_padding_and_no_attention_mask(self, input_ids, attention_mask): + """ + Shows a one-time warning if the input_ids appear to contain padding and no attention mask was given. + """ + + # Skip the check during tracing. + if is_torch_fx_proxy(input_ids) or torch.jit.is_tracing() or is_torchdynamo_compiling(): + return + + if (attention_mask is not None) or (self.config.pad_token_id is None): + return + + # Check only the first and last input IDs to reduce overhead. + if self.config.pad_token_id in input_ids[:, [-1, 0]]: + warn_string = ( + "We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See " + "https://huggingface.co/docs/transformers/troubleshooting" + "#incorrect-output-when-padding-tokens-arent-masked." + ) + + # If the pad token is equal to either BOS, EOS, or SEP, we do not know whether the user should use an + # attention_mask or not. In this case, we should still show a warning because this is a rare case. + if ( + (self.config.bos_token_id is not None and self.config.bos_token_id == self.config.pad_token_id) + or (self.config.eos_token_id is not None and self.config.eos_token_id == self.config.pad_token_id) + or (self.config.sep_token_id is not None and self.config.sep_token_id == self.config.pad_token_id) + ): + warn_string += ( + f"\nYou may ignore this warning if your `pad_token_id` ({self.config.pad_token_id}) is identical " + f"to the `bos_token_id` ({self.config.bos_token_id}), `eos_token_id` ({self.config.eos_token_id}), " + f"or the `sep_token_id` ({self.config.sep_token_id}), and your input is not padded." + ) + + logger.warning_once(warn_string) + + @property + def _is_quantized_training_enabled(self): + warnings.warn( + "`_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead", + FutureWarning, + ) + + if not hasattr(self, "hf_quantizer"): + return False + + return self.hf_quantizer.is_trainable + PreTrainedModel.push_to_hub = copy_func(PreTrainedModel.push_to_hub) if PreTrainedModel.push_to_hub.__doc__ is not None: @@ -3559,22 +4641,29 @@ def unwrap_model(model: nn.Module) -> nn.Module: return model -def expand_device_map(device_map, param_names): +def expand_device_map(device_map, param_names, start_prefix): """ Expand a device map to return the correspondance parameter name to device. """ new_device_map = {} + param_names = [p[len(start_prefix) :] for p in param_names if p.startswith(start_prefix)] for module, device in device_map.items(): - new_device_map.update({p: device for p in param_names if p == module or p.startswith(f"{module}.")}) + new_device_map.update( + {p: device for p in param_names if p == module or p.startswith(f"{module}.") or module == ""} + ) return new_device_map -def get_disk_only_shard_files(device_map, sharded_metadata): +def get_disk_only_shard_files(device_map, sharded_metadata, start_prefix): """ Returns the list of shard files containing only weights offloaded to disk. """ + + weight_map = { + p[len(start_prefix) :]: v for p, v in sharded_metadata["weight_map"].items() if p.startswith(start_prefix) + } files_content = collections.defaultdict(list) - for weight_name, filename in sharded_metadata["weight_map"].items(): + for weight_name, filename in weight_map.items(): while len(weight_name) > 0 and weight_name not in device_map: weight_name = ".".join(weight_name.split(".")[:-1]) files_content[filename].append(device_map[weight_name]) diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py index 9716cc6957e2d5..5686cf516c497d 100644 --- a/src/transformers/models/__init__.py +++ b/src/transformers/models/__init__.py @@ -14,9 +14,12 @@ from . import ( albert, + align, altclip, audio_spectrogram_transformer, auto, + autoformer, + bark, bart, barthez, bartpho, @@ -34,8 +37,8 @@ blip, blip_2, bloom, - bort, bridgetower, + bros, byt5, camembert, canine, @@ -43,11 +46,15 @@ clap, clip, clipseg, + clvp, + code_llama, codegen, conditional_detr, convbert, convnext, + convnextv2, cpm, + cpmant, ctrl, cvt, data2vec, @@ -56,41 +63,56 @@ decision_transformer, deformable_detr, deit, + deprecated, + depth_anything, deta, detr, dialogpt, dinat, + dinov2, distilbert, dit, donut, dpr, dpt, efficientformer, + efficientnet, electra, + encodec, encoder_decoder, ernie, ernie_m, esm, + falcon, + fastspeech2_conformer, flaubert, flava, fnet, + focalnet, fsmt, funnel, + fuyu, git, glpn, gpt2, + gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gpt_sw3, gptj, + gptsan_japanese, graphormer, groupvit, herbert, hubert, ibert, + idefics, imagegpt, + informer, + instructblip, jukebox, + kosmos2, layoutlm, layoutlmv2, layoutlmv3, @@ -98,6 +120,8 @@ led, levit, lilt, + llama, + llava, longformer, longt5, luke, @@ -109,54 +133,77 @@ maskformer, mbart, mbart50, - mctct, + mega, megatron_bert, megatron_gpt2, + mgp_str, + mistral, + mixtral, mluke, - mmbt, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, + mobilevitv2, mpnet, + mpt, + mra, mt5, + musicgen, mvp, nat, nezha, nllb, + nllb_moe, + nougat, nystromformer, oneformer, openai, opt, + owlv2, owlvit, + patchtsmixer, + patchtst, pegasus, pegasus_x, perceiver, + persimmon, + phi, phobert, + pix2struct, plbart, poolformer, + pop2piano, prophetnet, + pvt, qdqbert, + qwen2, rag, realm, reformer, regnet, rembert, resnet, - retribert, roberta, roberta_prelayernorm, roc_bert, roformer, + rwkv, + sam, + seamless_m4t, + seamless_m4t_v2, segformer, sew, sew_d, + siglip, speech_encoder_decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, + stablelm, + swiftformer, swin, swin2sr, swinv2, @@ -164,19 +211,20 @@ t5, table_transformer, tapas, - tapex, time_series_transformer, timesformer, - trajectory_transformer, - transfo_xl, + timm_backbone, trocr, tvlt, + tvp, + umt5, unispeech, unispeech_sat, + univnet, upernet, - van, videomae, vilt, + vipllava, vision_encoder_decoder, vision_text_dual_encoder, visual_bert, @@ -184,7 +232,12 @@ vit_hybrid, vit_mae, vit_msn, + vitdet, + vitmatte, + vits, + vivit, wav2vec2, + wav2vec2_bert, wav2vec2_conformer, wav2vec2_phoneme, wav2vec2_with_lm, diff --git a/src/transformers/models/albert/configuration_albert.py b/src/transformers/models/albert/configuration_albert.py index fd0c6238879257..690be7fbbf2c0c 100644 --- a/src/transformers/models/albert/configuration_albert.py +++ b/src/transformers/models/albert/configuration_albert.py @@ -22,14 +22,14 @@ ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { - "albert-base-v1": "https://huggingface.co/albert-base-v1/resolve/main/config.json", - "albert-large-v1": "https://huggingface.co/albert-large-v1/resolve/main/config.json", - "albert-xlarge-v1": "https://huggingface.co/albert-xlarge-v1/resolve/main/config.json", - "albert-xxlarge-v1": "https://huggingface.co/albert-xxlarge-v1/resolve/main/config.json", - "albert-base-v2": "https://huggingface.co/albert-base-v2/resolve/main/config.json", - "albert-large-v2": "https://huggingface.co/albert-large-v2/resolve/main/config.json", - "albert-xlarge-v2": "https://huggingface.co/albert-xlarge-v2/resolve/main/config.json", - "albert-xxlarge-v2": "https://huggingface.co/albert-xxlarge-v2/resolve/main/config.json", + "albert/albert-base-v1": "https://huggingface.co/albert/albert-base-v1/resolve/main/config.json", + "albert/albert-large-v1": "https://huggingface.co/albert/albert-large-v1/resolve/main/config.json", + "albert/albert-xlarge-v1": "https://huggingface.co/albert/albert-xlarge-v1/resolve/main/config.json", + "albert/albert-xxlarge-v1": "https://huggingface.co/albert/albert-xxlarge-v1/resolve/main/config.json", + "albert/albert-base-v2": "https://huggingface.co/albert/albert-base-v2/resolve/main/config.json", + "albert/albert-large-v2": "https://huggingface.co/albert/albert-large-v2/resolve/main/config.json", + "albert/albert-xlarge-v2": "https://huggingface.co/albert/albert-xlarge-v2/resolve/main/config.json", + "albert/albert-xxlarge-v2": "https://huggingface.co/albert/albert-xxlarge-v2/resolve/main/config.json", } @@ -38,7 +38,7 @@ class AlbertConfig(PretrainedConfig): This is the configuration class to store the configuration of a [`AlbertModel`] or a [`TFAlbertModel`]. It is used to instantiate an ALBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALBERT - [albert-xxlarge-v2](https://huggingface.co/albert-xxlarge-v2) architecture. + [albert/albert-xxlarge-v2](https://huggingface.co/albert/albert-xxlarge-v2) architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. @@ -85,6 +85,12 @@ class AlbertConfig(PretrainedConfig): [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). + pad_token_id (`int`, *optional*, defaults to 0): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 2): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 3): + End of stream token id. Examples: diff --git a/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py b/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py index 8823a86fc8c61d..eecada8b432a2d 100644 --- a/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py +++ b/src/transformers/models/albert/convert_albert_original_tf_checkpoint_to_pytorch.py @@ -19,8 +19,8 @@ import torch -from transformers import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert -from transformers.utils import logging +from ...utils import logging +from . import AlbertConfig, AlbertForPreTraining, load_tf_weights_in_albert logging.set_verbosity_info() diff --git a/src/transformers/models/albert/modeling_albert.py b/src/transformers/models/albert/modeling_albert.py index 687a927ef0c486..25ae832b03a00a 100755 --- a/src/transformers/models/albert/modeling_albert.py +++ b/src/transformers/models/albert/modeling_albert.py @@ -48,19 +48,19 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "albert-base-v2" +_CHECKPOINT_FOR_DOC = "albert/albert-base-v2" _CONFIG_FOR_DOC = "AlbertConfig" ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "albert-base-v1", - "albert-large-v1", - "albert-xlarge-v1", - "albert-xxlarge-v1", - "albert-base-v2", - "albert-large-v2", - "albert-xlarge-v2", - "albert-xxlarge-v2", + "albert/albert-base-v1", + "albert/albert-large-v1", + "albert/albert-xlarge-v1", + "albert/albert-xxlarge-v1", + "albert/albert-base-v2", + "albert/albert-large-v2", + "albert/albert-xlarge-v2", + "albert/albert-xxlarge-v2", # See all ALBERT models at https://huggingface.co/models?filter=albert ] @@ -182,7 +182,7 @@ def load_tf_weights_in_albert(model, config, tf_checkpoint_path): try: if pointer.shape != array.shape: raise ValueError(f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched") - except AssertionError as e: + except ValueError as e: e.args += (pointer.shape, array.shape) raise print(f"Initialize PyTorch weight {name} from {original_name}") @@ -208,7 +208,9 @@ def __init__(self, config: AlbertConfig): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False @@ -507,7 +509,6 @@ class AlbertPreTrainedModel(PreTrainedModel): config_class = AlbertConfig load_tf_weights = load_tf_weights_in_albert base_model_prefix = "albert" - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights.""" @@ -687,9 +688,9 @@ def forward( position_ids: Optional[torch.LongTensor] = None, head_mask: Optional[torch.FloatTensor] = None, inputs_embeds: Optional[torch.FloatTensor] = None, - output_attentions: Optional[None] = None, - output_hidden_states: Optional[None] = None, - return_dict: Optional[None] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, ) -> Union[BaseModelOutputWithPooling, Tuple]: output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( @@ -700,6 +701,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -759,11 +761,7 @@ def forward( ALBERT_START_DOCSTRING, ) class AlbertForPreTraining(AlbertPreTrainedModel): - _keys_to_ignore_on_load_missing = [ - "predictions.decoder.weight", - "predictions.decoder.bias", - "embeddings.position_ids", - ] + _tied_weights_keys = ["predictions.decoder.bias", "predictions.decoder.weight"] def __init__(self, config: AlbertConfig): super().__init__(config) @@ -818,8 +816,8 @@ def forward( >>> from transformers import AutoTokenizer, AlbertForPreTraining >>> import torch - >>> tokenizer = AutoTokenizer.from_pretrained("albert-base-v2") - >>> model = AlbertForPreTraining.from_pretrained("albert-base-v2") + >>> tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") + >>> model = AlbertForPreTraining.from_pretrained("albert/albert-base-v2") >>> input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) >>> # Batch size 1 @@ -911,12 +909,7 @@ def forward(self, pooled_output: torch.Tensor) -> torch.Tensor: ALBERT_START_DOCSTRING, ) class AlbertForMaskedLM(AlbertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [ - "predictions.decoder.weight", - "predictions.decoder.bias", - "embeddings.position_ids", - ] + _tied_weights_keys = ["predictions.decoder.bias", "predictions.decoder.weight"] def __init__(self, config): super().__init__(config) @@ -965,8 +958,8 @@ def forward( >>> import torch >>> from transformers import AutoTokenizer, AlbertForMaskedLM - >>> tokenizer = AutoTokenizer.from_pretrained("albert-base-v2") - >>> model = AlbertForMaskedLM.from_pretrained("albert-base-v2") + >>> tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") + >>> model = AlbertForMaskedLM.from_pretrained("albert/albert-base-v2") >>> # add mask_token >>> inputs = tokenizer("The capital of [MASK] is Paris.", return_tensors="pt") @@ -1131,8 +1124,6 @@ def forward( ALBERT_START_DOCSTRING, ) class AlbertForTokenClassification(AlbertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - def __init__(self, config: AlbertConfig): super().__init__(config) self.num_labels = config.num_labels @@ -1216,8 +1207,6 @@ def forward( ALBERT_START_DOCSTRING, ) class AlbertForQuestionAnswering(AlbertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - def __init__(self, config: AlbertConfig): super().__init__(config) self.num_labels = config.num_labels diff --git a/src/transformers/models/albert/modeling_flax_albert.py b/src/transformers/models/albert/modeling_flax_albert.py index 0ff1b9276a19d6..b2c01ded3619ca 100644 --- a/src/transformers/models/albert/modeling_flax_albert.py +++ b/src/transformers/models/albert/modeling_flax_albert.py @@ -47,7 +47,7 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "albert-base-v2" +_CHECKPOINT_FOR_DOC = "albert/albert-base-v2" _CONFIG_FOR_DOC = "AlbertConfig" @@ -86,9 +86,10 @@ class FlaxAlbertForPreTrainingOutput(ModelOutput): This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models) - This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) - subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to - general usage and behavior. + This model is also a + [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as + a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and + behavior. Finally, this model supports inherent JAX features such as: @@ -173,7 +174,6 @@ def setup(self): self.LayerNorm = nn.LayerNorm(epsilon=self.config.layer_norm_eps, dtype=self.dtype) self.dropout = nn.Dropout(rate=self.config.hidden_dropout_prob) - # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertEmbeddings.__call__ def __call__(self, input_ids, token_type_ids, position_ids, deterministic: bool = True): # Embed inputs_embeds = self.word_embeddings(input_ids.astype("i4")) @@ -754,8 +754,8 @@ class FlaxAlbertForPreTraining(FlaxAlbertPreTrainedModel): ```python >>> from transformers import AutoTokenizer, FlaxAlbertForPreTraining - >>> tokenizer = AutoTokenizer.from_pretrained("albert-base-v2") - >>> model = FlaxAlbertForPreTraining.from_pretrained("albert-base-v2") + >>> tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") + >>> model = FlaxAlbertForPreTraining.from_pretrained("albert/albert-base-v2") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np") >>> outputs = model(**inputs) @@ -829,7 +829,9 @@ class FlaxAlbertForMaskedLM(FlaxAlbertPreTrainedModel): module_class = FlaxAlbertForMaskedLMModule -append_call_sample_docstring(FlaxAlbertForMaskedLM, _CHECKPOINT_FOR_DOC, FlaxMaskedLMOutput, _CONFIG_FOR_DOC) +append_call_sample_docstring( + FlaxAlbertForMaskedLM, _CHECKPOINT_FOR_DOC, FlaxMaskedLMOutput, _CONFIG_FOR_DOC, revision="refs/pr/11" +) class FlaxAlbertForSequenceClassificationModule(nn.Module): diff --git a/src/transformers/models/albert/modeling_tf_albert.py b/src/transformers/models/albert/modeling_tf_albert.py index 247ee395dc60fe..1225465c5260a8 100644 --- a/src/transformers/models/albert/modeling_tf_albert.py +++ b/src/transformers/models/albert/modeling_tf_albert.py @@ -15,6 +15,9 @@ # limitations under the License. """ TF 2.0 ALBERT model.""" + +from __future__ import annotations + import math from dataclasses import dataclass from typing import Dict, Optional, Tuple, Union @@ -41,12 +44,12 @@ TFSequenceClassificationLoss, TFTokenClassificationLoss, get_initializer, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - MULTIPLE_CHOICE_DUMMY_INPUTS, ModelOutput, add_code_sample_docstrings, add_start_docstrings, @@ -59,18 +62,18 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "albert-base-v2" +_CHECKPOINT_FOR_DOC = "albert/albert-base-v2" _CONFIG_FOR_DOC = "AlbertConfig" TF_ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "albert-base-v1", - "albert-large-v1", - "albert-xlarge-v1", - "albert-xxlarge-v1", - "albert-base-v2", - "albert-large-v2", - "albert-xlarge-v2", - "albert-xxlarge-v2", + "albert/albert-base-v1", + "albert/albert-large-v1", + "albert/albert-xlarge-v1", + "albert/albert-xxlarge-v1", + "albert/albert-base-v2", + "albert/albert-large-v2", + "albert/albert-xlarge-v2", + "albert/albert-xxlarge-v2", # See all ALBERT models at https://huggingface.co/models?filter=albert ] @@ -82,9 +85,7 @@ class TFAlbertPreTrainingLoss: """ def hf_compute_loss(self, labels: tf.Tensor, logits: tf.Tensor) -> tf.Tensor: - loss_fn = tf.keras.losses.SparseCategoricalCrossentropy( - from_logits=True, reduction=tf.keras.losses.Reduction.NONE - ) + loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=keras.losses.Reduction.NONE) if self.config.tf_legacy_loss: # make sure only labels that are not equal to -100 # are taken into account as loss @@ -131,7 +132,7 @@ def hf_compute_loss(self, labels: tf.Tensor, logits: tf.Tensor) -> tf.Tensor: return tf.reshape(reduced_masked_lm_loss + reduced_masked_sop_loss, (1,)) -class TFAlbertEmbeddings(tf.keras.layers.Layer): +class TFAlbertEmbeddings(keras.layers.Layer): """Construct the embeddings from word, position and token_type embeddings.""" def __init__(self, config: AlbertConfig, **kwargs): @@ -141,10 +142,10 @@ def __init__(self, config: AlbertConfig, **kwargs): self.embedding_size = config.embedding_size self.max_position_embeddings = config.max_position_embeddings self.initializer_range = config.initializer_range - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape=None): with tf.name_scope("word_embeddings"): self.weight = self.add_weight( name="weight", @@ -166,7 +167,12 @@ def build(self, input_shape: tf.TensorShape): initializer=get_initializer(self.initializer_range), ) - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.embedding_size]) # Copied from transformers.models.bert.modeling_tf_bert.TFBertEmbeddings.call def call( @@ -188,16 +194,7 @@ def call( raise ValueError("Need to provide either `input_ids` or `input_embeds`.") if input_ids is not None: - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.config.vocab_size, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})" - ), - ) + check_embeddings_within_bounds(input_ids, self.config.vocab_size) inputs_embeds = tf.gather(params=self.weight, indices=input_ids) input_shape = shape_list(inputs_embeds)[:-1] @@ -219,7 +216,7 @@ def call( return final_embeddings -class TFAlbertAttention(tf.keras.layers.Layer): +class TFAlbertAttention(keras.layers.Layer): """Contains the complete attention sublayer, including both dropouts and layer norm.""" def __init__(self, config: AlbertConfig, **kwargs): @@ -237,22 +234,23 @@ def __init__(self, config: AlbertConfig, **kwargs): self.sqrt_att_head_size = math.sqrt(self.attention_head_size) self.output_attentions = config.output_attentions - self.query = tf.keras.layers.Dense( + self.query = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query" ) - self.key = tf.keras.layers.Dense( + self.key = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key" ) - self.value = tf.keras.layers.Dense( + self.value = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value" ) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") # Two different dropout probabilities; see https://github.com/google-research/albert/blob/master/modeling.py#L971-L993 - self.attention_dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob) - self.output_dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.attention_dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob) + self.output_dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor: # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size] @@ -314,13 +312,33 @@ def call( return outputs - -class TFAlbertLayer(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "query", None) is not None: + with tf.name_scope(self.query.name): + self.query.build([None, None, self.config.hidden_size]) + if getattr(self, "key", None) is not None: + with tf.name_scope(self.key.name): + self.key.build([None, None, self.config.hidden_size]) + if getattr(self, "value", None) is not None: + with tf.name_scope(self.value.name): + self.value.build([None, None, self.config.hidden_size]) + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + + +class TFAlbertLayer(keras.layers.Layer): def __init__(self, config: AlbertConfig, **kwargs): super().__init__(**kwargs) self.attention = TFAlbertAttention(config, name="attention") - self.ffn = tf.keras.layers.Dense( + self.ffn = keras.layers.Dense( units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="ffn" ) @@ -329,13 +347,14 @@ def __init__(self, config: AlbertConfig, **kwargs): else: self.activation = config.hidden_act - self.ffn_output = tf.keras.layers.Dense( + self.ffn_output = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="ffn_output" ) - self.full_layer_layer_norm = tf.keras.layers.LayerNormalization( + self.full_layer_layer_norm = keras.layers.LayerNormalization( epsilon=config.layer_norm_eps, name="full_layer_layer_norm" ) - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def call( self, @@ -363,8 +382,25 @@ def call( return outputs - -class TFAlbertLayerGroup(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "attention", None) is not None: + with tf.name_scope(self.attention.name): + self.attention.build(None) + if getattr(self, "ffn", None) is not None: + with tf.name_scope(self.ffn.name): + self.ffn.build([None, None, self.config.hidden_size]) + if getattr(self, "ffn_output", None) is not None: + with tf.name_scope(self.ffn_output.name): + self.ffn_output.build([None, None, self.config.intermediate_size]) + if getattr(self, "full_layer_layer_norm", None) is not None: + with tf.name_scope(self.full_layer_layer_norm.name): + self.full_layer_layer_norm.build([None, None, self.config.hidden_size]) + + +class TFAlbertLayerGroup(keras.layers.Layer): def __init__(self, config: AlbertConfig, **kwargs): super().__init__(**kwargs) @@ -406,8 +442,17 @@ def call( return tuple(v for v in [hidden_states, layer_hidden_states, layer_attentions] if v is not None) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert_layers", None) is not None: + for layer in self.albert_layers: + with tf.name_scope(layer.name): + layer.build(None) + -class TFAlbertTransformer(tf.keras.layers.Layer): +class TFAlbertTransformer(keras.layers.Layer): def __init__(self, config: AlbertConfig, **kwargs): super().__init__(**kwargs) @@ -415,7 +460,7 @@ def __init__(self, config: AlbertConfig, **kwargs): self.num_hidden_groups = config.num_hidden_groups # Number of layers in a hidden group self.layers_per_group = int(config.num_hidden_layers / config.num_hidden_groups) - self.embedding_hidden_mapping_in = tf.keras.layers.Dense( + self.embedding_hidden_mapping_in = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="embedding_hidden_mapping_in", @@ -423,6 +468,7 @@ def __init__(self, config: AlbertConfig, **kwargs): self.albert_layer_groups = [ TFAlbertLayerGroup(config, name=f"albert_layer_groups_._{i}") for i in range(config.num_hidden_groups) ] + self.config = config def call( self, @@ -464,6 +510,18 @@ def call( last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embedding_hidden_mapping_in", None) is not None: + with tf.name_scope(self.embedding_hidden_mapping_in.name): + self.embedding_hidden_mapping_in.build([None, None, self.config.embedding_size]) + if getattr(self, "albert_layer_groups", None) is not None: + for layer in self.albert_layer_groups: + with tf.name_scope(layer.name): + layer.build(None) + class TFAlbertPreTrainedModel(TFPreTrainedModel): """ @@ -475,13 +533,13 @@ class TFAlbertPreTrainedModel(TFPreTrainedModel): base_model_prefix = "albert" -class TFAlbertMLMHead(tf.keras.layers.Layer): - def __init__(self, config: AlbertConfig, input_embeddings: tf.keras.layers.Layer, **kwargs): +class TFAlbertMLMHead(keras.layers.Layer): + def __init__(self, config: AlbertConfig, input_embeddings: keras.layers.Layer, **kwargs): super().__init__(**kwargs) self.config = config self.embedding_size = config.embedding_size - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) if isinstance(config.hidden_act, str): @@ -489,21 +547,29 @@ def __init__(self, config: AlbertConfig, input_embeddings: tf.keras.layers.Layer else: self.activation = config.hidden_act - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. self.decoder = input_embeddings - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape=None): self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias") self.decoder_bias = self.add_weight( shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="decoder/bias" ) - super().build(input_shape) - - def get_output_embeddings(self) -> tf.keras.layers.Layer: + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.embedding_size]) + + def get_output_embeddings(self) -> keras.layers.Layer: return self.decoder def set_output_embeddings(self, value: tf.Variable): @@ -532,7 +598,7 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: @keras_serializable -class TFAlbertMainLayer(tf.keras.layers.Layer): +class TFAlbertMainLayer(keras.layers.Layer): config_class = AlbertConfig def __init__(self, config: AlbertConfig, add_pooling_layer: bool = True, **kwargs): @@ -543,7 +609,7 @@ def __init__(self, config: AlbertConfig, add_pooling_layer: bool = True, **kwarg self.embeddings = TFAlbertEmbeddings(config, name="embeddings") self.encoder = TFAlbertTransformer(config, name="encoder") self.pooler = ( - tf.keras.layers.Dense( + keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), activation="tanh", @@ -553,7 +619,7 @@ def __init__(self, config: AlbertConfig, add_pooling_layer: bool = True, **kwarg else None ) - def get_input_embeddings(self) -> tf.keras.layers.Layer: + def get_input_embeddings(self) -> keras.layers.Layer: return self.embeddings def set_input_embeddings(self, value: tf.Variable): @@ -570,12 +636,12 @@ class PreTrainedModel @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -657,6 +723,20 @@ def call( attentions=encoder_outputs.attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "pooler", None) is not None: + with tf.name_scope(self.pooler.name): + self.pooler.build([None, None, self.config.hidden_size]) + @dataclass class TFAlbertForPreTrainingOutput(ModelOutput): @@ -685,8 +765,8 @@ class TFAlbertForPreTrainingOutput(ModelOutput): loss: tf.Tensor = None prediction_logits: tf.Tensor = None sop_logits: tf.Tensor = None - hidden_states: Optional[Tuple[tf.Tensor]] = None - attentions: Optional[Tuple[tf.Tensor]] = None + hidden_states: Tuple[tf.Tensor] | None = None + attentions: Tuple[tf.Tensor] | None = None ALBERT_START_DOCSTRING = r""" @@ -695,7 +775,7 @@ class TFAlbertForPreTrainingOutput(ModelOutput): library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -806,12 +886,12 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -832,16 +912,13 @@ def call( return outputs - def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFBaseModelOutputWithPooling( - last_hidden_state=output.last_hidden_state, - pooler_output=output.pooler_output, - hidden_states=hs, - attentions=attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) @add_start_docstrings( @@ -864,7 +941,7 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): self.predictions = TFAlbertMLMHead(config, input_embeddings=self.albert.embeddings, name="predictions") self.sop_classifier = TFAlbertSOPHead(config, name="sop_classifier") - def get_lm_head(self) -> tf.keras.layers.Layer: + def get_lm_head(self) -> keras.layers.Layer: return self.predictions @unpack_inputs @@ -872,17 +949,17 @@ def get_lm_head(self) -> tf.keras.layers.Layer: @replace_return_docstrings(output_type=TFAlbertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, - sentence_order_label: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, + sentence_order_label: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFAlbertForPreTrainingOutput, Tuple[tf.Tensor]]: r""" @@ -894,8 +971,8 @@ def call( >>> import tensorflow as tf >>> from transformers import AutoTokenizer, TFAlbertForPreTraining - >>> tokenizer = AutoTokenizer.from_pretrained("albert-base-v2") - >>> model = TFAlbertForPreTraining.from_pretrained("albert-base-v2") + >>> tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") + >>> model = TFAlbertForPreTraining.from_pretrained("albert/albert-base-v2") >>> input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True))[None, :] >>> # Batch size 1 @@ -939,28 +1016,32 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFAlbertForPreTrainingOutput) -> TFAlbertForPreTrainingOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFAlbertForPreTrainingOutput( - prediction_logits=output.prediction_logits, - sop_logits=output.sop_logits, - hidden_states=hs, - attentions=attns, - ) - - -class TFAlbertSOPHead(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "predictions", None) is not None: + with tf.name_scope(self.predictions.name): + self.predictions.build(None) + if getattr(self, "sop_classifier", None) is not None: + with tf.name_scope(self.sop_classifier.name): + self.sop_classifier.build(None) + + +class TFAlbertSOPHead(keras.layers.Layer): def __init__(self, config: AlbertConfig, **kwargs): super().__init__(**kwargs) - self.dropout = tf.keras.layers.Dropout(rate=config.classifier_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=config.classifier_dropout_prob) + self.classifier = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier", ) + self.config = config def call(self, pooled_output: tf.Tensor, training: bool) -> tf.Tensor: dropout_pooled_output = self.dropout(inputs=pooled_output, training=training) @@ -968,6 +1049,14 @@ def call(self, pooled_output: tf.Tensor, training: bool) -> tf.Tensor: return logits + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) + @add_start_docstrings("""Albert Model with a `language modeling` head on top.""", ALBERT_START_DOCSTRING) class TFAlbertForMaskedLM(TFAlbertPreTrainedModel, TFMaskedLanguageModelingLoss): @@ -980,7 +1069,7 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): self.albert = TFAlbertMainLayer(config, add_pooling_layer=False, name="albert") self.predictions = TFAlbertMLMHead(config, input_embeddings=self.albert.embeddings, name="predictions") - def get_lm_head(self) -> tf.keras.layers.Layer: + def get_lm_head(self) -> keras.layers.Layer: return self.predictions @unpack_inputs @@ -988,16 +1077,16 @@ def get_lm_head(self) -> tf.keras.layers.Layer: @replace_return_docstrings(output_type=TFMaskedLMOutput, config_class=_CONFIG_FOR_DOC) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]: r""" @@ -1014,8 +1103,8 @@ def call( >>> import tensorflow as tf >>> from transformers import AutoTokenizer, TFAlbertForMaskedLM - >>> tokenizer = AutoTokenizer.from_pretrained("albert-base-v2") - >>> model = TFAlbertForMaskedLM.from_pretrained("albert-base-v2") + >>> tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2") + >>> model = TFAlbertForMaskedLM.from_pretrained("albert/albert-base-v2") >>> # add mask_token >>> inputs = tokenizer(f"The capital of [MASK] is Paris.", return_tensors="tf") @@ -1064,12 +1153,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output - def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "predictions", None) is not None: + with tf.name_scope(self.predictions.name): + self.predictions.build(None) @add_start_docstrings( @@ -1090,10 +1183,11 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): self.num_labels = config.num_labels self.albert = TFAlbertMainLayer(config, name="albert") - self.dropout = tf.keras.layers.Dropout(rate=config.classifier_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=config.classifier_dropout_prob) + self.classifier = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1106,16 +1200,16 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1153,12 +1247,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output - def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1184,10 +1282,11 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): if config.classifier_dropout_prob is not None else config.hidden_dropout_prob ) - self.dropout = tf.keras.layers.Dropout(rate=classifier_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=classifier_dropout_prob) + self.classifier = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1198,16 +1297,16 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1243,12 +1342,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output - def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1268,9 +1371,10 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): self.num_labels = config.num_labels self.albert = TFAlbertMainLayer(config, add_pooling_layer=False, name="albert") - self.qa_outputs = tf.keras.layers.Dense( + self.qa_outputs = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs" ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1285,17 +1389,17 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, - end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, + start_positions: np.ndarray | tf.Tensor | None = None, + end_positions: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]: r""" @@ -1345,14 +1449,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForQuestionAnswering.serving_output - def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFQuestionAnsweringModelOutput( - start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "qa_outputs", None) is not None: + with tf.name_scope(self.qa_outputs.name): + self.qa_outputs.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1371,20 +1477,11 @@ def __init__(self, config: AlbertConfig, *inputs, **kwargs): super().__init__(config, *inputs, **kwargs) self.albert = TFAlbertMainLayer(config, name="albert") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.classifier = keras.layers.Dense( units=1, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) - - @property - def dummy_inputs(self): - """ - Dummy inputs to build the network. - - Returns: - tf.Tensor with dummy inputs - """ - return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)} + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(ALBERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length")) @@ -1395,16 +1492,16 @@ def dummy_inputs(self): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]: r""" @@ -1464,24 +1561,13 @@ def call( attentions=outputs.attentions, ) - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"), - "token_type_ids": tf.TensorSpec((None, None, None), tf.int32, name="token_type_ids"), - } - ] - ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving - def serving(self, inputs: Dict[str, tf.Tensor]) -> TFMultipleChoiceModelOutput: - output = self.call(input_ids=inputs) - - return self.serving_output(output) - - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving_output - def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "albert", None) is not None: + with tf.name_scope(self.albert.name): + self.albert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) diff --git a/src/transformers/models/albert/tokenization_albert.py b/src/transformers/models/albert/tokenization_albert.py index b043a14989fc57..7baaa0a6000e6f 100644 --- a/src/transformers/models/albert/tokenization_albert.py +++ b/src/transformers/models/albert/tokenization_albert.py @@ -31,26 +31,26 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "albert-base-v1": "https://huggingface.co/albert-base-v1/resolve/main/spiece.model", - "albert-large-v1": "https://huggingface.co/albert-large-v1/resolve/main/spiece.model", - "albert-xlarge-v1": "https://huggingface.co/albert-xlarge-v1/resolve/main/spiece.model", - "albert-xxlarge-v1": "https://huggingface.co/albert-xxlarge-v1/resolve/main/spiece.model", - "albert-base-v2": "https://huggingface.co/albert-base-v2/resolve/main/spiece.model", - "albert-large-v2": "https://huggingface.co/albert-large-v2/resolve/main/spiece.model", - "albert-xlarge-v2": "https://huggingface.co/albert-xlarge-v2/resolve/main/spiece.model", - "albert-xxlarge-v2": "https://huggingface.co/albert-xxlarge-v2/resolve/main/spiece.model", + "albert/albert-base-v1": "https://huggingface.co/albert/albert-base-v1/resolve/main/spiece.model", + "albert/albert-large-v1": "https://huggingface.co/albert/albert-large-v1/resolve/main/spiece.model", + "albert/albert-xlarge-v1": "https://huggingface.co/albert/albert-xlarge-v1/resolve/main/spiece.model", + "albert/albert-xxlarge-v1": "https://huggingface.co/albert/albert-xxlarge-v1/resolve/main/spiece.model", + "albert/albert-base-v2": "https://huggingface.co/albert/albert-base-v2/resolve/main/spiece.model", + "albert/albert-large-v2": "https://huggingface.co/albert/albert-large-v2/resolve/main/spiece.model", + "albert/albert-xlarge-v2": "https://huggingface.co/albert/albert-xlarge-v2/resolve/main/spiece.model", + "albert/albert-xxlarge-v2": "https://huggingface.co/albert/albert-xxlarge-v2/resolve/main/spiece.model", } } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "albert-base-v1": 512, - "albert-large-v1": 512, - "albert-xlarge-v1": 512, - "albert-xxlarge-v1": 512, - "albert-base-v2": 512, - "albert-large-v2": 512, - "albert-xlarge-v2": 512, - "albert-xxlarge-v2": 512, + "albert/albert-base-v1": 512, + "albert/albert-large-v1": 512, + "albert/albert-xlarge-v1": 512, + "albert/albert-xxlarge-v1": 512, + "albert/albert-base-v2": 512, + "albert/albert-large-v2": 512, + "albert/albert-xlarge-v2": 512, + "albert/albert-xxlarge-v2": 512, } SPIECE_UNDERLINE = "▁" @@ -159,6 +159,14 @@ def __init__( self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + self.do_lower_case = do_lower_case + self.remove_space = remove_space + self.keep_accents = keep_accents + self.vocab_file = vocab_file + + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(vocab_file) + super().__init__( do_lower_case=do_lower_case, remove_space=remove_space, @@ -174,19 +182,11 @@ def __init__( **kwargs, ) - self.do_lower_case = do_lower_case - self.remove_space = remove_space - self.keep_accents = keep_accents - self.vocab_file = vocab_file - - self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(vocab_file) - @property - def vocab_size(self): + def vocab_size(self) -> int: return len(self.sp_model) - def get_vocab(self): + def get_vocab(self) -> Dict[str, int]: vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} vocab.update(self.added_tokens_encoder) return vocab @@ -228,6 +228,8 @@ def _tokenize(self, text: str) -> List[str]: new_pieces = [] for piece in pieces: if len(piece) > 1 and piece[-1] == str(",") and piece[-2].isdigit(): + # Logic to handle special cases see https://github.com/google-research/bert/blob/master/README.md#tokenization + # `9,9` -> ['▁9', ',', '9'] instead of [`_9,`, '9'] cur_pieces = self.sp_model.EncodeAsPieces(piece[:-1].replace(SPIECE_UNDERLINE, "")) if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE: if len(cur_pieces[0]) == 1: diff --git a/src/transformers/models/albert/tokenization_albert_fast.py b/src/transformers/models/albert/tokenization_albert_fast.py index 16c54e7eac6c94..91cf403d07eefd 100644 --- a/src/transformers/models/albert/tokenization_albert_fast.py +++ b/src/transformers/models/albert/tokenization_albert_fast.py @@ -34,36 +34,36 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "albert-base-v1": "https://huggingface.co/albert-base-v1/resolve/main/spiece.model", - "albert-large-v1": "https://huggingface.co/albert-large-v1/resolve/main/spiece.model", - "albert-xlarge-v1": "https://huggingface.co/albert-xlarge-v1/resolve/main/spiece.model", - "albert-xxlarge-v1": "https://huggingface.co/albert-xxlarge-v1/resolve/main/spiece.model", - "albert-base-v2": "https://huggingface.co/albert-base-v2/resolve/main/spiece.model", - "albert-large-v2": "https://huggingface.co/albert-large-v2/resolve/main/spiece.model", - "albert-xlarge-v2": "https://huggingface.co/albert-xlarge-v2/resolve/main/spiece.model", - "albert-xxlarge-v2": "https://huggingface.co/albert-xxlarge-v2/resolve/main/spiece.model", + "albert/albert-base-v1": "https://huggingface.co/albert/albert-base-v1/resolve/main/spiece.model", + "albert/albert-large-v1": "https://huggingface.co/albert/albert-large-v1/resolve/main/spiece.model", + "albert/albert-xlarge-v1": "https://huggingface.co/albert/albert-xlarge-v1/resolve/main/spiece.model", + "albert/albert-xxlarge-v1": "https://huggingface.co/albert/albert-xxlarge-v1/resolve/main/spiece.model", + "albert/albert-base-v2": "https://huggingface.co/albert/albert-base-v2/resolve/main/spiece.model", + "albert/albert-large-v2": "https://huggingface.co/albert/albert-large-v2/resolve/main/spiece.model", + "albert/albert-xlarge-v2": "https://huggingface.co/albert/albert-xlarge-v2/resolve/main/spiece.model", + "albert/albert-xxlarge-v2": "https://huggingface.co/albert/albert-xxlarge-v2/resolve/main/spiece.model", }, "tokenizer_file": { - "albert-base-v1": "https://huggingface.co/albert-base-v1/resolve/main/tokenizer.json", - "albert-large-v1": "https://huggingface.co/albert-large-v1/resolve/main/tokenizer.json", - "albert-xlarge-v1": "https://huggingface.co/albert-xlarge-v1/resolve/main/tokenizer.json", - "albert-xxlarge-v1": "https://huggingface.co/albert-xxlarge-v1/resolve/main/tokenizer.json", - "albert-base-v2": "https://huggingface.co/albert-base-v2/resolve/main/tokenizer.json", - "albert-large-v2": "https://huggingface.co/albert-large-v2/resolve/main/tokenizer.json", - "albert-xlarge-v2": "https://huggingface.co/albert-xlarge-v2/resolve/main/tokenizer.json", - "albert-xxlarge-v2": "https://huggingface.co/albert-xxlarge-v2/resolve/main/tokenizer.json", + "albert/albert-base-v1": "https://huggingface.co/albert/albert-base-v1/resolve/main/tokenizer.json", + "albert/albert-large-v1": "https://huggingface.co/albert/albert-large-v1/resolve/main/tokenizer.json", + "albert/albert-xlarge-v1": "https://huggingface.co/albert/albert-xlarge-v1/resolve/main/tokenizer.json", + "albert/albert-xxlarge-v1": "https://huggingface.co/albert/albert-xxlarge-v1/resolve/main/tokenizer.json", + "albert/albert-base-v2": "https://huggingface.co/albert/albert-base-v2/resolve/main/tokenizer.json", + "albert/albert-large-v2": "https://huggingface.co/albert/albert-large-v2/resolve/main/tokenizer.json", + "albert/albert-xlarge-v2": "https://huggingface.co/albert/albert-xlarge-v2/resolve/main/tokenizer.json", + "albert/albert-xxlarge-v2": "https://huggingface.co/albert/albert-xxlarge-v2/resolve/main/tokenizer.json", }, } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "albert-base-v1": 512, - "albert-large-v1": 512, - "albert-xlarge-v1": 512, - "albert-xxlarge-v1": 512, - "albert-base-v2": 512, - "albert-large-v2": 512, - "albert-xlarge-v2": 512, - "albert-xxlarge-v2": 512, + "albert/albert-base-v1": 512, + "albert/albert-large-v1": 512, + "albert/albert-xlarge-v1": 512, + "albert/albert-xxlarge-v1": 512, + "albert/albert-base-v2": 512, + "albert/albert-large-v2": 512, + "albert/albert-xlarge-v2": 512, + "albert/albert-xxlarge-v2": 512, } SPIECE_UNDERLINE = "▁" @@ -165,7 +165,10 @@ def __init__( self.remove_space = remove_space self.keep_accents = keep_accents self.vocab_file = vocab_file - self.can_save_slow_tokenizer = False if not self.vocab_file else True + + @property + def can_save_slow_tokenizer(self) -> bool: + return os.path.isfile(self.vocab_file) if self.vocab_file else False def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None diff --git a/src/transformers/models/align/__init__.py b/src/transformers/models/align/__init__.py new file mode 100644 index 00000000000000..8f9a6c40a7169f --- /dev/null +++ b/src/transformers/models/align/__init__.py @@ -0,0 +1,73 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_torch_available, +) + + +_import_structure = { + "configuration_align": [ + "ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP", + "AlignConfig", + "AlignTextConfig", + "AlignVisionConfig", + ], + "processing_align": ["AlignProcessor"], +} + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_align"] = [ + "ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST", + "AlignModel", + "AlignPreTrainedModel", + "AlignTextModel", + "AlignVisionModel", + ] + +if TYPE_CHECKING: + from .configuration_align import ( + ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP, + AlignConfig, + AlignTextConfig, + AlignVisionConfig, + ) + from .processing_align import AlignProcessor + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_align import ( + ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST, + AlignModel, + AlignPreTrainedModel, + AlignTextModel, + AlignVisionModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/align/configuration_align.py b/src/transformers/models/align/configuration_align.py new file mode 100644 index 00000000000000..b7f377d4813679 --- /dev/null +++ b/src/transformers/models/align/configuration_align.py @@ -0,0 +1,384 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" ALIGN model configuration""" + +import os +from typing import TYPE_CHECKING, List, Union + + +if TYPE_CHECKING: + pass + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "kakaobrain/align-base": "https://huggingface.co/kakaobrain/align-base/resolve/main/config.json", +} + + +class AlignTextConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`AlignTextModel`]. It is used to instantiate a + ALIGN text encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the text encoder of the ALIGN + [kakaobrain/align-base](https://huggingface.co/kakaobrain/align-base) architecture. The default values here are + copied from BERT. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 30522): + Vocabulary size of the Align Text model. Defines the number of different tokens that can be represented by + the `inputs_ids` passed when calling [`AlignTextModel`]. + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. + hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"silu"` and `"gelu_new"` are supported. + hidden_dropout_prob (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): + The dropout ratio for the attention probabilities. + max_position_embeddings (`int`, *optional*, defaults to 512): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + type_vocab_size (`int`, *optional*, defaults to 2): + The vocabulary size of the `token_type_ids` passed when calling [`AlignTextModel`]. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-12): + The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 0): + Padding token id. + position_embedding_type (`str`, *optional*, defaults to `"absolute"`): + Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For + positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to + [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). + For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models + with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). Only + relevant if `config.is_decoder=True`. + + Example: + + ```python + >>> from transformers import AlignTextConfig, AlignTextModel + + >>> # Initializing a AlignTextConfig with kakaobrain/align-base style configuration + >>> configuration = AlignTextConfig() + + >>> # Initializing a AlignTextModel (with random weights) from the kakaobrain/align-base style configuration + >>> model = AlignTextModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "align_text_model" + + def __init__( + self, + vocab_size=30522, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + initializer_range=0.02, + layer_norm_eps=1e-12, + pad_token_id=0, + position_embedding_type="absolute", + use_cache=True, + **kwargs, + ): + super().__init__(**kwargs) + + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.hidden_act = hidden_act + self.intermediate_size = intermediate_size + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.max_position_embeddings = max_position_embeddings + self.type_vocab_size = type_vocab_size + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.position_embedding_type = position_embedding_type + self.use_cache = use_cache + self.pad_token_id = pad_token_id + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the text config dict if we are loading from AlignConfig + if config_dict.get("model_type") == "align": + config_dict = config_dict["text_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class AlignVisionConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`AlignVisionModel`]. It is used to instantiate a + ALIGN vision encoder according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the vision encoder of the ALIGN + [kakaobrain/align-base](https://huggingface.co/kakaobrain/align-base) architecture. The default values are copied + from EfficientNet (efficientnet-b7) + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. + image_size (`int`, *optional*, defaults to 600): + The input image size. + width_coefficient (`float`, *optional*, defaults to 2.0): + Scaling coefficient for network width at each stage. + depth_coefficient (`float`, *optional*, defaults to 3.1): + Scaling coefficient for network depth at each stage. + depth_divisor `int`, *optional*, defaults to 8): + A unit of network width. + kernel_sizes (`List[int]`, *optional*, defaults to `[3, 3, 5, 3, 5, 5, 3]`): + List of kernel sizes to be used in each block. + in_channels (`List[int]`, *optional*, defaults to `[32, 16, 24, 40, 80, 112, 192]`): + List of input channel sizes to be used in each block for convolutional layers. + out_channels (`List[int]`, *optional*, defaults to `[16, 24, 40, 80, 112, 192, 320]`): + List of output channel sizes to be used in each block for convolutional layers. + depthwise_padding (`List[int]`, *optional*, defaults to `[]`): + List of block indices with square padding. + strides (`List[int]`, *optional*, defaults to `[1, 2, 2, 2, 1, 2, 1]`): + List of stride sizes to be used in each block for convolutional layers. + num_block_repeats (`List[int]`, *optional*, defaults to `[1, 2, 2, 3, 3, 4, 1]`): + List of the number of times each block is to repeated. + expand_ratios (`List[int]`, *optional*, defaults to `[1, 6, 6, 6, 6, 6, 6]`): + List of scaling coefficient of each block. + squeeze_expansion_ratio (`float`, *optional*, defaults to 0.25): + Squeeze expansion ratio. + hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): + The non-linear activation function (function or string) in each block. If string, `"gelu"`, `"relu"`, + `"selu", `"gelu_new"`, `"silu"` and `"mish"` are supported. + hiddem_dim (`int`, *optional*, defaults to 1280): + The hidden dimension of the layer before the classification head. + pooling_type (`str` or `function`, *optional*, defaults to `"mean"`): + Type of final pooling to be applied before the dense classification head. Available options are [`"mean"`, + `"max"`] + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + batch_norm_eps (`float`, *optional*, defaults to 1e-3): + The epsilon used by the batch normalization layers. + batch_norm_momentum (`float`, *optional*, defaults to 0.99): + The momentum used by the batch normalization layers. + drop_connect_rate (`float`, *optional*, defaults to 0.2): + The drop rate for skip connections. + + Example: + + ```python + >>> from transformers import AlignVisionConfig, AlignVisionModel + + >>> # Initializing a AlignVisionConfig with kakaobrain/align-base style configuration + >>> configuration = AlignVisionConfig() + + >>> # Initializing a AlignVisionModel (with random weights) from the kakaobrain/align-base style configuration + >>> model = AlignVisionModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "align_vision_model" + + def __init__( + self, + num_channels: int = 3, + image_size: int = 600, + width_coefficient: float = 2.0, + depth_coefficient: float = 3.1, + depth_divisor: int = 8, + kernel_sizes: List[int] = [3, 3, 5, 3, 5, 5, 3], + in_channels: List[int] = [32, 16, 24, 40, 80, 112, 192], + out_channels: List[int] = [16, 24, 40, 80, 112, 192, 320], + depthwise_padding: List[int] = [], + strides: List[int] = [1, 2, 2, 2, 1, 2, 1], + num_block_repeats: List[int] = [1, 2, 2, 3, 3, 4, 1], + expand_ratios: List[int] = [1, 6, 6, 6, 6, 6, 6], + squeeze_expansion_ratio: float = 0.25, + hidden_act: str = "swish", + hidden_dim: int = 2560, + pooling_type: str = "mean", + initializer_range: float = 0.02, + batch_norm_eps: float = 0.001, + batch_norm_momentum: float = 0.99, + drop_connect_rate: float = 0.2, + **kwargs, + ): + super().__init__(**kwargs) + + self.num_channels = num_channels + self.image_size = image_size + self.width_coefficient = width_coefficient + self.depth_coefficient = depth_coefficient + self.depth_divisor = depth_divisor + self.kernel_sizes = kernel_sizes + self.in_channels = in_channels + self.out_channels = out_channels + self.depthwise_padding = depthwise_padding + self.strides = strides + self.num_block_repeats = num_block_repeats + self.expand_ratios = expand_ratios + self.squeeze_expansion_ratio = squeeze_expansion_ratio + self.hidden_act = hidden_act + self.hidden_dim = hidden_dim + self.pooling_type = pooling_type + self.initializer_range = initializer_range + self.batch_norm_eps = batch_norm_eps + self.batch_norm_momentum = batch_norm_momentum + self.drop_connect_rate = drop_connect_rate + self.num_hidden_layers = sum(num_block_repeats) * 4 + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the vision config dict if we are loading from AlignConfig + if config_dict.get("model_type") == "align": + config_dict = config_dict["vision_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class AlignConfig(PretrainedConfig): + r""" + [`AlignConfig`] is the configuration class to store the configuration of a [`AlignModel`]. It is used to + instantiate a ALIGN model according to the specified arguments, defining the text model and vision model configs. + Instantiating a configuration with the defaults will yield a similar configuration to that of the ALIGN + [kakaobrain/align-base](https://huggingface.co/kakaobrain/align-base) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + text_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`AlignTextConfig`]. + vision_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`AlignVisionConfig`]. + projection_dim (`int`, *optional*, defaults to 640): + Dimentionality of text and vision projection layers. + temperature_init_value (`float`, *optional*, defaults to 1.0): + The inital value of the *temperature* paramter. Default is used as per the original ALIGN implementation. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + kwargs (*optional*): + Dictionary of keyword arguments. + + Example: + + ```python + >>> from transformers import AlignConfig, AlignModel + + >>> # Initializing a AlignConfig with kakaobrain/align-base style configuration + >>> configuration = AlignConfig() + + >>> # Initializing a AlignModel (with random weights) from the kakaobrain/align-base style configuration + >>> model = AlignModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + + >>> # We can also initialize a AlignConfig from a AlignTextConfig and a AlignVisionConfig + >>> from transformers import AlignTextConfig, AlignVisionConfig + + >>> # Initializing ALIGN Text and Vision configurations + >>> config_text = AlignTextConfig() + >>> config_vision = AlignVisionConfig() + + >>> config = AlignConfig.from_text_vision_configs(config_text, config_vision) + ```""" + + model_type = "align" + + def __init__( + self, + text_config=None, + vision_config=None, + projection_dim=640, + temperature_init_value=1.0, + initializer_range=0.02, + **kwargs, + ): + super().__init__(**kwargs) + + if text_config is None: + text_config = {} + logger.info("text_config is None. Initializing the AlignTextConfig with default values.") + + if vision_config is None: + vision_config = {} + logger.info("vision_config is None. Initializing the AlignVisionConfig with default values.") + + self.text_config = AlignTextConfig(**text_config) + self.vision_config = AlignVisionConfig(**vision_config) + + self.projection_dim = projection_dim + self.temperature_init_value = temperature_init_value + self.initializer_range = initializer_range + + @classmethod + def from_text_vision_configs(cls, text_config: AlignTextConfig, vision_config: AlignVisionConfig, **kwargs): + r""" + Instantiate a [`AlignConfig`] (or a derived class) from align text model configuration and align vision model + configuration. + + Returns: + [`AlignConfig`]: An instance of a configuration object + """ + + return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) diff --git a/src/transformers/models/align/convert_align_tf_to_hf.py b/src/transformers/models/align/convert_align_tf_to_hf.py new file mode 100644 index 00000000000000..610db8482f9162 --- /dev/null +++ b/src/transformers/models/align/convert_align_tf_to_hf.py @@ -0,0 +1,389 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert ALIGN checkpoints from the original repository.""" + +import argparse +import os + +import align +import numpy as np +import requests +import tensorflow as tf +import torch +from PIL import Image +from tokenizer import Tokenizer + +from transformers import ( + AlignConfig, + AlignModel, + AlignProcessor, + BertConfig, + BertTokenizer, + EfficientNetConfig, + EfficientNetImageProcessor, +) +from transformers.utils import logging + + +logging.set_verbosity_info() +logger = logging.get_logger(__name__) + + +def preprocess(image): + image = tf.image.resize(image, (346, 346)) + image = tf.image.crop_to_bounding_box(image, (346 - 289) // 2, (346 - 289) // 2, 289, 289) + return image + + +def get_align_config(): + vision_config = EfficientNetConfig.from_pretrained("google/efficientnet-b7") + vision_config.image_size = 289 + vision_config.hidden_dim = 640 + vision_config.id2label = {"0": "LABEL_0", "1": "LABEL_1"} + vision_config.label2id = {"LABEL_0": 0, "LABEL_1": 1} + vision_config.depthwise_padding = [] + + text_config = BertConfig() + config = AlignConfig.from_text_vision_configs( + text_config=text_config, vision_config=vision_config, projection_dim=640 + ) + return config + + +# We will verify our results on an image of cute cats +def prepare_img(): + url = "http://images.cocodataset.org/val2017/000000039769.jpg" + im = Image.open(requests.get(url, stream=True).raw) + return im + + +def get_processor(): + image_processor = EfficientNetImageProcessor( + do_center_crop=True, + rescale_factor=1 / 127.5, + rescale_offset=True, + do_normalize=False, + include_top=False, + resample=Image.BILINEAR, + ) + tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") + tokenizer.model_max_length = 64 + processor = AlignProcessor(image_processor=image_processor, tokenizer=tokenizer) + return processor + + +# here we list all keys to be renamed (original name on the left, our name on the right) +def rename_keys(original_param_names): + # EfficientNet image encoder + block_names = [v.split("_")[0].split("block")[1] for v in original_param_names if v.startswith("block")] + block_names = list(set(block_names)) + block_names = sorted(block_names) + num_blocks = len(block_names) + block_name_mapping = {b: str(i) for b, i in zip(block_names, range(num_blocks))} + + rename_keys = [] + rename_keys.append(("stem_conv/kernel:0", "embeddings.convolution.weight")) + rename_keys.append(("stem_bn/gamma:0", "embeddings.batchnorm.weight")) + rename_keys.append(("stem_bn/beta:0", "embeddings.batchnorm.bias")) + rename_keys.append(("stem_bn/moving_mean:0", "embeddings.batchnorm.running_mean")) + rename_keys.append(("stem_bn/moving_variance:0", "embeddings.batchnorm.running_var")) + + for b in block_names: + hf_b = block_name_mapping[b] + rename_keys.append((f"block{b}_expand_conv/kernel:0", f"encoder.blocks.{hf_b}.expansion.expand_conv.weight")) + rename_keys.append((f"block{b}_expand_bn/gamma:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.weight")) + rename_keys.append((f"block{b}_expand_bn/beta:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.bias")) + rename_keys.append( + (f"block{b}_expand_bn/moving_mean:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.running_mean") + ) + rename_keys.append( + (f"block{b}_expand_bn/moving_variance:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.running_var") + ) + rename_keys.append( + (f"block{b}_dwconv/depthwise_kernel:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_conv.weight") + ) + rename_keys.append((f"block{b}_bn/gamma:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.weight")) + rename_keys.append((f"block{b}_bn/beta:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.bias")) + rename_keys.append( + (f"block{b}_bn/moving_mean:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.running_mean") + ) + rename_keys.append( + (f"block{b}_bn/moving_variance:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.running_var") + ) + + rename_keys.append((f"block{b}_se_reduce/kernel:0", f"encoder.blocks.{hf_b}.squeeze_excite.reduce.weight")) + rename_keys.append((f"block{b}_se_reduce/bias:0", f"encoder.blocks.{hf_b}.squeeze_excite.reduce.bias")) + rename_keys.append((f"block{b}_se_expand/kernel:0", f"encoder.blocks.{hf_b}.squeeze_excite.expand.weight")) + rename_keys.append((f"block{b}_se_expand/bias:0", f"encoder.blocks.{hf_b}.squeeze_excite.expand.bias")) + rename_keys.append( + (f"block{b}_project_conv/kernel:0", f"encoder.blocks.{hf_b}.projection.project_conv.weight") + ) + rename_keys.append((f"block{b}_project_bn/gamma:0", f"encoder.blocks.{hf_b}.projection.project_bn.weight")) + rename_keys.append((f"block{b}_project_bn/beta:0", f"encoder.blocks.{hf_b}.projection.project_bn.bias")) + rename_keys.append( + (f"block{b}_project_bn/moving_mean:0", f"encoder.blocks.{hf_b}.projection.project_bn.running_mean") + ) + rename_keys.append( + (f"block{b}_project_bn/moving_variance:0", f"encoder.blocks.{hf_b}.projection.project_bn.running_var") + ) + + key_mapping = {} + for item in rename_keys: + if item[0] in original_param_names: + key_mapping[item[0]] = "vision_model." + item[1] + + # BERT text encoder + rename_keys = [] + old = "tf_bert_model/bert" + new = "text_model" + for i in range(12): + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/query/kernel:0", + f"{new}.encoder.layer.{i}.attention.self.query.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/query/bias:0", + f"{new}.encoder.layer.{i}.attention.self.query.bias", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/key/kernel:0", + f"{new}.encoder.layer.{i}.attention.self.key.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/key/bias:0", + f"{new}.encoder.layer.{i}.attention.self.key.bias", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/value/kernel:0", + f"{new}.encoder.layer.{i}.attention.self.value.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/self/value/bias:0", + f"{new}.encoder.layer.{i}.attention.self.value.bias", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/output/dense/kernel:0", + f"{new}.encoder.layer.{i}.attention.output.dense.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/output/dense/bias:0", + f"{new}.encoder.layer.{i}.attention.output.dense.bias", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/output/LayerNorm/gamma:0", + f"{new}.encoder.layer.{i}.attention.output.LayerNorm.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/attention/output/LayerNorm/beta:0", + f"{new}.encoder.layer.{i}.attention.output.LayerNorm.bias", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/intermediate/dense/kernel:0", + f"{new}.encoder.layer.{i}.intermediate.dense.weight", + ) + ) + rename_keys.append( + ( + f"{old}/encoder/layer_._{i}/intermediate/dense/bias:0", + f"{new}.encoder.layer.{i}.intermediate.dense.bias", + ) + ) + rename_keys.append( + (f"{old}/encoder/layer_._{i}/output/dense/kernel:0", f"{new}.encoder.layer.{i}.output.dense.weight") + ) + rename_keys.append( + (f"{old}/encoder/layer_._{i}/output/dense/bias:0", f"{new}.encoder.layer.{i}.output.dense.bias") + ) + rename_keys.append( + (f"{old}/encoder/layer_._{i}/output/LayerNorm/gamma:0", f"{new}.encoder.layer.{i}.output.LayerNorm.weight") + ) + rename_keys.append( + (f"{old}/encoder/layer_._{i}/output/LayerNorm/beta:0", f"{new}.encoder.layer.{i}.output.LayerNorm.bias") + ) + + rename_keys.append((f"{old}/embeddings/word_embeddings/weight:0", f"{new}.embeddings.word_embeddings.weight")) + rename_keys.append( + (f"{old}/embeddings/position_embeddings/embeddings:0", f"{new}.embeddings.position_embeddings.weight") + ) + rename_keys.append( + (f"{old}/embeddings/token_type_embeddings/embeddings:0", f"{new}.embeddings.token_type_embeddings.weight") + ) + rename_keys.append((f"{old}/embeddings/LayerNorm/gamma:0", f"{new}.embeddings.LayerNorm.weight")) + rename_keys.append((f"{old}/embeddings/LayerNorm/beta:0", f"{new}.embeddings.LayerNorm.bias")) + + rename_keys.append((f"{old}/pooler/dense/kernel:0", f"{new}.pooler.dense.weight")) + rename_keys.append((f"{old}/pooler/dense/bias:0", f"{new}.pooler.dense.bias")) + rename_keys.append(("dense/kernel:0", "text_projection.weight")) + rename_keys.append(("dense/bias:0", "text_projection.bias")) + rename_keys.append(("dense/bias:0", "text_projection.bias")) + rename_keys.append(("temperature:0", "temperature")) + + for item in rename_keys: + if item[0] in original_param_names: + key_mapping[item[0]] = item[1] + return key_mapping + + +def replace_params(hf_params, tf_params, key_mapping): + list(hf_params.keys()) + + for key, value in tf_params.items(): + if key not in key_mapping: + continue + + hf_key = key_mapping[key] + if "_conv" in key and "kernel" in key: + new_hf_value = torch.from_numpy(value).permute(3, 2, 0, 1) + elif "embeddings" in key: + new_hf_value = torch.from_numpy(value) + elif "depthwise_kernel" in key: + new_hf_value = torch.from_numpy(value).permute(2, 3, 0, 1) + elif "kernel" in key: + new_hf_value = torch.from_numpy(np.transpose(value)) + elif "temperature" in key: + new_hf_value = value + elif "bn/gamma" or "bn/beta" in key: + new_hf_value = torch.from_numpy(np.transpose(value)).squeeze() + else: + new_hf_value = torch.from_numpy(value) + + # Replace HF parameters with original TF model parameters + hf_params[hf_key].copy_(new_hf_value) + + +@torch.no_grad() +def convert_align_checkpoint(checkpoint_path, pytorch_dump_folder_path, save_model, push_to_hub): + """ + Copy/paste/tweak model's weights to our ALIGN structure. + """ + # Load original model + seq_length = 64 + tok = Tokenizer(seq_length) + original_model = align.Align("efficientnet-b7", "bert-base", 640, seq_length, tok.get_vocab_size()) + original_model.compile() + original_model.load_weights(checkpoint_path) + + tf_params = original_model.trainable_variables + tf_non_train_params = original_model.non_trainable_variables + tf_params = {param.name: param.numpy() for param in tf_params} + for param in tf_non_train_params: + tf_params[param.name] = param.numpy() + tf_param_names = list(tf_params.keys()) + + # Load HuggingFace model + config = get_align_config() + hf_model = AlignModel(config).eval() + hf_params = hf_model.state_dict() + + # Create src-to-dst parameter name mapping dictionary + print("Converting parameters...") + key_mapping = rename_keys(tf_param_names) + replace_params(hf_params, tf_params, key_mapping) + + # Initialize processor + processor = get_processor() + inputs = processor( + images=prepare_img(), text="A picture of a cat", padding="max_length", max_length=64, return_tensors="pt" + ) + + # HF model inference + hf_model.eval() + with torch.no_grad(): + outputs = hf_model(**inputs) + + hf_image_features = outputs.image_embeds.detach().numpy() + hf_text_features = outputs.text_embeds.detach().numpy() + + # Original model inference + original_model.trainable = False + tf_image_processor = EfficientNetImageProcessor( + do_center_crop=True, + do_rescale=False, + do_normalize=False, + include_top=False, + resample=Image.BILINEAR, + ) + image = tf_image_processor(images=prepare_img(), return_tensors="tf", data_format="channels_last")["pixel_values"] + text = tok(tf.constant(["A picture of a cat"])) + + image_features = original_model.image_encoder(image, training=False) + text_features = original_model.text_encoder(text, training=False) + + image_features = tf.nn.l2_normalize(image_features, axis=-1) + text_features = tf.nn.l2_normalize(text_features, axis=-1) + + # Check whether original and HF model outputs match -> np.allclose + if not np.allclose(image_features, hf_image_features, atol=1e-3): + raise ValueError("The predicted image features are not the same.") + if not np.allclose(text_features, hf_text_features, atol=1e-3): + raise ValueError("The predicted text features are not the same.") + print("Model outputs match!") + + if save_model: + # Create folder to save model + if not os.path.isdir(pytorch_dump_folder_path): + os.mkdir(pytorch_dump_folder_path) + # Save converted model and image processor + hf_model.save_pretrained(pytorch_dump_folder_path) + processor.save_pretrained(pytorch_dump_folder_path) + + if push_to_hub: + # Push model and image processor to hub + print("Pushing converted ALIGN to the hub...") + processor.push_to_hub("align-base") + hf_model.push_to_hub("align-base") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--checkpoint_path", + default="./weights/model-weights", + type=str, + help="Path to the pretrained TF ALIGN checkpoint.", + ) + parser.add_argument( + "--pytorch_dump_folder_path", + default="hf_model", + type=str, + help="Path to the output PyTorch model directory.", + ) + parser.add_argument("--save_model", action="store_true", help="Save model to local") + parser.add_argument("--push_to_hub", action="store_true", help="Push model and image processor to the hub") + + args = parser.parse_args() + convert_align_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub) diff --git a/src/transformers/models/align/modeling_align.py b/src/transformers/models/align/modeling_align.py new file mode 100644 index 00000000000000..f48fcbace12f4f --- /dev/null +++ b/src/transformers/models/align/modeling_align.py @@ -0,0 +1,1636 @@ +# coding=utf-8 +# Copyright 2023 The Google Research Team Authors and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch ALIGN model.""" + +import math +from dataclasses import dataclass +from typing import Any, Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn + +from ...activations import ACT2FN +from ...modeling_outputs import ( + BaseModelOutputWithNoAttention, + BaseModelOutputWithPastAndCrossAttentions, + BaseModelOutputWithPoolingAndCrossAttentions, + BaseModelOutputWithPoolingAndNoAttention, +) +from ...modeling_utils import PreTrainedModel +from ...pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer +from ...utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_align import AlignConfig, AlignTextConfig, AlignVisionConfig + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "kakaobrain/align-base" +_CONFIG_FOR_DOC = "AlignConfig" + + +ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "kakaobrain/align-base", + # See all ALIGN models at https://huggingface.co/models?filter=align +] + + +ALIGN_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`AlignConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +ALIGN_TEXT_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +ALIGN_VISION_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`AutoImageProcessor`]. See [`EfficientNetImageProcessor.__call__`] for details. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +ALIGN_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`AutoImageProcessor`]. See [`EfficientNetImageProcessor.__call__`] for details. + return_loss (`bool`, *optional*): + Whether or not to return the contrastive loss. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@dataclass +class AlignVisionModelOutput(ModelOutput): + """ + Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. + + Args: + image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The image embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + """ + + image_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class AlignTextModelOutput(ModelOutput): + """ + Base class for text model's outputs that also contains a pooling of the last hidden states. + + Args: + text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The text embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + text_embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class AlignOutput(ModelOutput): + """ + Args: + loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): + Contrastive loss for image-text similarity. + logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): + The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text + similarity scores. + logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image + similarity scores. + text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The text embeddings obtained by applying the projection layer to the pooled output of [`AlignTextModel`]. + image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): + The output of [`AlignVisionModel`]. + text_model_output(`BaseModelOutputWithPoolingAndCrossAttentions`): + The output of the [`AlignTextModel`]. + vision_model_output(`BaseModelOutputWithPoolingAndNoAttention`): + The output of the [`AlignVisionModel`]. + """ + + loss: Optional[torch.FloatTensor] = None + logits_per_image: torch.FloatTensor = None + logits_per_text: torch.FloatTensor = None + text_embeds: torch.FloatTensor = None + image_embeds: torch.FloatTensor = None + text_model_output: BaseModelOutputWithPoolingAndCrossAttentions = None + vision_model_output: BaseModelOutputWithPoolingAndNoAttention = None + + def to_tuple(self) -> Tuple[Any]: + return tuple( + self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple() + for k in self.keys() + ) + + +# contrastive loss function, adapted from +# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html +def contrastive_loss(logits: torch.Tensor) -> torch.Tensor: + return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device), label_smoothing=0.1) + + +def align_loss(similarity: torch.Tensor) -> torch.Tensor: + caption_loss = contrastive_loss(similarity) + image_loss = contrastive_loss(similarity.t()) + return (caption_loss + image_loss) / 2.0 + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.round_filters with EfficientNet->AlignVision +def round_filters(config: AlignVisionConfig, num_channels: int): + r""" + Round number of filters based on depth multiplier. + """ + divisor = config.depth_divisor + num_channels *= config.width_coefficient + new_dim = max(divisor, int(num_channels + divisor / 2) // divisor * divisor) + + # Make sure that round down does not go down by more than 10%. + if new_dim < 0.9 * num_channels: + new_dim += divisor + + return int(new_dim) + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.correct_pad +def correct_pad(kernel_size: Union[int, Tuple], adjust: bool = True): + r""" + Utility function to get the tuple padding value for the depthwise convolution. + + Args: + kernel_size (`int` or `tuple`): + Kernel size of the convolution layers. + adjust (`bool`, *optional*, defaults to `True`): + Adjusts padding value to apply to right and bottom sides of the input. + """ + if isinstance(kernel_size, int): + kernel_size = (kernel_size, kernel_size) + + correct = (kernel_size[0] // 2, kernel_size[1] // 2) + if adjust: + return (correct[1] - 1, correct[1], correct[0] - 1, correct[0]) + else: + return (correct[1], correct[1], correct[0], correct[0]) + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.EfficientNetEmbeddings with EfficientNet->AlignVision +class AlignVisionEmbeddings(nn.Module): + r""" + A module that corresponds to the stem module of the original work. + """ + + def __init__(self, config: AlignVisionConfig): + super().__init__() + + self.out_dim = round_filters(config, 32) + self.padding = nn.ZeroPad2d(padding=(0, 1, 0, 1)) + self.convolution = nn.Conv2d( + config.num_channels, self.out_dim, kernel_size=3, stride=2, padding="valid", bias=False + ) + self.batchnorm = nn.BatchNorm2d(self.out_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum) + self.activation = ACT2FN[config.hidden_act] + + def forward(self, pixel_values: torch.Tensor) -> torch.Tensor: + features = self.padding(pixel_values) + features = self.convolution(features) + features = self.batchnorm(features) + features = self.activation(features) + + return features + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.EfficientNetDepthwiseConv2d with EfficientNet->AlignVision +class AlignVisionDepthwiseConv2d(nn.Conv2d): + def __init__( + self, + in_channels, + depth_multiplier=1, + kernel_size=3, + stride=1, + padding=0, + dilation=1, + bias=True, + padding_mode="zeros", + ): + out_channels = in_channels * depth_multiplier + super().__init__( + in_channels=in_channels, + out_channels=out_channels, + kernel_size=kernel_size, + stride=stride, + padding=padding, + dilation=dilation, + groups=in_channels, + bias=bias, + padding_mode=padding_mode, + ) + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.EfficientNetExpansionLayer with EfficientNet->AlignVision +class AlignVisionExpansionLayer(nn.Module): + r""" + This corresponds to the expansion phase of each block in the original implementation. + """ + + def __init__(self, config: AlignVisionConfig, in_dim: int, out_dim: int, stride: int): + super().__init__() + self.expand_conv = nn.Conv2d( + in_channels=in_dim, + out_channels=out_dim, + kernel_size=1, + padding="same", + bias=False, + ) + self.expand_bn = nn.BatchNorm2d(num_features=out_dim, eps=config.batch_norm_eps) + self.expand_act = ACT2FN[config.hidden_act] + + def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor: + # Expand phase + hidden_states = self.expand_conv(hidden_states) + hidden_states = self.expand_bn(hidden_states) + hidden_states = self.expand_act(hidden_states) + + return hidden_states + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.EfficientNetDepthwiseLayer with with EfficientNet->AlignVision +class AlignVisionDepthwiseLayer(nn.Module): + r""" + This corresponds to the depthwise convolution phase of each block in the original implementation. + """ + + def __init__( + self, + config: AlignVisionConfig, + in_dim: int, + stride: int, + kernel_size: int, + adjust_padding: bool, + ): + super().__init__() + self.stride = stride + conv_pad = "valid" if self.stride == 2 else "same" + padding = correct_pad(kernel_size, adjust=adjust_padding) + + self.depthwise_conv_pad = nn.ZeroPad2d(padding=padding) + self.depthwise_conv = AlignVisionDepthwiseConv2d( + in_dim, kernel_size=kernel_size, stride=stride, padding=conv_pad, bias=False + ) + self.depthwise_norm = nn.BatchNorm2d( + num_features=in_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum + ) + self.depthwise_act = ACT2FN[config.hidden_act] + + def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor: + # Depthwise convolution + if self.stride == 2: + hidden_states = self.depthwise_conv_pad(hidden_states) + + hidden_states = self.depthwise_conv(hidden_states) + hidden_states = self.depthwise_norm(hidden_states) + hidden_states = self.depthwise_act(hidden_states) + + return hidden_states + + +# Copied from transformers.models.efficientnet.modeling_efficientnet.EfficientNetSqueezeExciteLayer with with EfficientNet->AlignVision +class AlignVisionSqueezeExciteLayer(nn.Module): + r""" + This corresponds to the Squeeze and Excitement phase of each block in the original implementation. + """ + + def __init__(self, config: AlignVisionConfig, in_dim: int, expand_dim: int, expand: bool = False): + super().__init__() + self.dim = expand_dim if expand else in_dim + self.dim_se = max(1, int(in_dim * config.squeeze_expansion_ratio)) + + self.squeeze = nn.AdaptiveAvgPool2d(output_size=1) + self.reduce = nn.Conv2d( + in_channels=self.dim, + out_channels=self.dim_se, + kernel_size=1, + padding="same", + ) + self.expand = nn.Conv2d( + in_channels=self.dim_se, + out_channels=self.dim, + kernel_size=1, + padding="same", + ) + self.act_reduce = ACT2FN[config.hidden_act] + self.act_expand = nn.Sigmoid() + + def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor: + inputs = hidden_states + hidden_states = self.squeeze(hidden_states) + hidden_states = self.reduce(hidden_states) + hidden_states = self.act_reduce(hidden_states) + + hidden_states = self.expand(hidden_states) + hidden_states = self.act_expand(hidden_states) + hidden_states = torch.mul(inputs, hidden_states) + + return hidden_states + + +class AlignVisionFinalBlockLayer(nn.Module): + r""" + This corresponds to the final phase of each block in the original implementation. + """ + + def __init__( + self, config: AlignVisionConfig, in_dim: int, out_dim: int, stride: int, drop_rate: float, id_skip: bool + ): + super().__init__() + self.apply_dropout = stride == 1 and not id_skip + self.project_conv = nn.Conv2d( + in_channels=in_dim, + out_channels=out_dim, + kernel_size=1, + padding="same", + bias=False, + ) + self.project_bn = nn.BatchNorm2d( + num_features=out_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum + ) + self.dropout = nn.Dropout(p=drop_rate) + + def forward(self, embeddings: torch.FloatTensor, hidden_states: torch.FloatTensor) -> torch.Tensor: + hidden_states = self.project_conv(hidden_states) + hidden_states = self.project_bn(hidden_states) + + if self.apply_dropout: + hidden_states = self.dropout(hidden_states) + hidden_states = hidden_states + embeddings + + return hidden_states + + +class AlignVisionBlock(nn.Module): + r""" + This corresponds to the block module of original the EfficientNet vision encoder implementation. + + Args: + config ([`AlignVisionConfig`]): + Model configuration class. + in_dim (`int`): + Number of input channels. + out_dim (`int`): + Number of output channels. + stride (`int`): + Stride size to be used in convolution layers. + expand_ratio (`int`): + Expand ratio to set the output dimensions for the expansion and squeeze-excite layers. + kernel_size (`int`): + Kernel size for the depthwise convolution layer. + drop_rate (`float`): + Dropout rate to be used in the final phase of each block. + id_skip (`bool`): + Whether to apply dropout and sum the final hidden states with the input embeddings during the final phase + of each block. Set to `True` for the first block of each stage. + adjust_padding (`bool`): + Whether to apply padding to only right and bottom side of the input kernel before the depthwise convolution + operation, set to `True` for inputs with odd input sizes. + """ + + def __init__( + self, + config: AlignVisionConfig, + in_dim: int, + out_dim: int, + stride: int, + expand_ratio: int, + kernel_size: int, + drop_rate: float, + id_skip: bool, + adjust_padding: bool, + ): + super().__init__() + self.expand_ratio = expand_ratio + self.expand = True if self.expand_ratio != 1 else False + expand_in_dim = in_dim * expand_ratio + + if self.expand: + self.expansion = AlignVisionExpansionLayer( + config=config, in_dim=in_dim, out_dim=expand_in_dim, stride=stride + ) + + self.depthwise_conv = AlignVisionDepthwiseLayer( + config=config, + in_dim=expand_in_dim if self.expand else in_dim, + stride=stride, + kernel_size=kernel_size, + adjust_padding=adjust_padding, + ) + self.squeeze_excite = AlignVisionSqueezeExciteLayer( + config=config, in_dim=in_dim, expand_dim=expand_in_dim, expand=self.expand + ) + self.projection = AlignVisionFinalBlockLayer( + config=config, + in_dim=expand_in_dim if self.expand else in_dim, + out_dim=out_dim, + stride=stride, + drop_rate=drop_rate, + id_skip=id_skip, + ) + + def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor: + embeddings = hidden_states + # Expansion and depthwise convolution phase + if self.expand_ratio != 1: + hidden_states = self.expansion(hidden_states) + hidden_states = self.depthwise_conv(hidden_states) + + # Squeeze and excite phase + hidden_states = self.squeeze_excite(hidden_states) + hidden_states = self.projection(embeddings, hidden_states) + return hidden_states + + +class AlignVisionEncoder(nn.Module): + r""" + Forward propogates the embeddings through each vision encoder (EfficientNet) block. + + Args: + config ([`AlignVisionConfig`]): + Model configuration class. + """ + + def __init__(self, config: AlignVisionConfig): + super().__init__() + self.depth_coefficient = config.depth_coefficient + + def round_repeats(repeats): + # Round number of block repeats based on depth multiplier. + return int(math.ceil(self.depth_coefficient * repeats)) + + num_base_blocks = len(config.in_channels) + num_blocks = sum(round_repeats(n) for n in config.num_block_repeats) + + curr_block_num = 0 + blocks = [] + for i in range(num_base_blocks): + in_dim = round_filters(config, config.in_channels[i]) + out_dim = round_filters(config, config.out_channels[i]) + stride = config.strides[i] + kernel_size = config.kernel_sizes[i] + expand_ratio = config.expand_ratios[i] + + for j in range(round_repeats(config.num_block_repeats[i])): + id_skip = True if j == 0 else False + stride = 1 if j > 0 else stride + in_dim = out_dim if j > 0 else in_dim + adjust_padding = False if curr_block_num in config.depthwise_padding else True + drop_rate = config.drop_connect_rate * curr_block_num / num_blocks + + block = AlignVisionBlock( + config=config, + in_dim=in_dim, + out_dim=out_dim, + stride=stride, + kernel_size=kernel_size, + expand_ratio=expand_ratio, + drop_rate=drop_rate, + id_skip=id_skip, + adjust_padding=adjust_padding, + ) + blocks.append(block) + curr_block_num += 1 + + self.blocks = nn.ModuleList(blocks) + + def forward( + self, + hidden_states: torch.FloatTensor, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = True, + ) -> BaseModelOutputWithPoolingAndNoAttention: + all_hidden_states = (hidden_states,) if output_hidden_states else None + + for block in self.blocks: + hidden_states = block(hidden_states) + if output_hidden_states: + all_hidden_states += (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, all_hidden_states] if v is not None) + + return BaseModelOutputWithNoAttention( + last_hidden_state=hidden_states, + hidden_states=all_hidden_states, + ) + + +# Copied from transformers.models.bert.modeling_bert.BertEmbeddings with Bert->AlignText +class AlignTextEmbeddings(nn.Module): + """Construct the embeddings from word, position and token_type embeddings.""" + + def __init__(self, config): + super().__init__() + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) + + # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load + # any TensorFlow checkpoint file + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + # position_ids (1, len position emb) is contiguous in memory and exported when serialized + self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) + self.register_buffer( + "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False + ) + + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + past_key_values_length: int = 0, + ) -> torch.Tensor: + if input_ids is not None: + input_shape = input_ids.size() + else: + input_shape = inputs_embeds.size()[:-1] + + seq_length = input_shape[1] + + if position_ids is None: + position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length] + + # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs + # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves + # issue #5664 + if token_type_ids is None: + if hasattr(self, "token_type_ids"): + buffered_token_type_ids = self.token_type_ids[:, :seq_length] + buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length) + token_type_ids = buffered_token_type_ids_expanded + else: + token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device) + + if inputs_embeds is None: + inputs_embeds = self.word_embeddings(input_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = inputs_embeds + token_type_embeddings + if self.position_embedding_type == "absolute": + position_embeddings = self.position_embeddings(position_ids) + embeddings += position_embeddings + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +# Copied from transformers.models.bert.modeling_bert.BertSelfAttention with Bert->AlignText +class AlignTextSelfAttention(nn.Module): + def __init__(self, config, position_embedding_type=None): + super().__init__() + if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"): + raise ValueError( + f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention " + f"heads ({config.num_attention_heads})" + ) + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = int(config.hidden_size / config.num_attention_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.hidden_size, self.all_head_size) + self.value = nn.Linear(config.hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + self.position_embedding_type = position_embedding_type or getattr( + config, "position_embedding_type", "absolute" + ) + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + self.max_position_embeddings = config.max_position_embeddings + self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size) + + self.is_decoder = config.is_decoder + + def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor: + new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) + x = x.view(new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + mixed_query_layer = self.query(hidden_states) + + # If this is instantiated as a cross-attention module, the keys + # and values come from an encoder; the attention mask needs to be + # such that the encoder's padding tokens are not attended to. + is_cross_attention = encoder_hidden_states is not None + + if is_cross_attention and past_key_value is not None: + # reuse k,v, cross_attentions + key_layer = past_key_value[0] + value_layer = past_key_value[1] + attention_mask = encoder_attention_mask + elif is_cross_attention: + key_layer = self.transpose_for_scores(self.key(encoder_hidden_states)) + value_layer = self.transpose_for_scores(self.value(encoder_hidden_states)) + attention_mask = encoder_attention_mask + elif past_key_value is not None: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + key_layer = torch.cat([past_key_value[0], key_layer], dim=2) + value_layer = torch.cat([past_key_value[1], value_layer], dim=2) + else: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + + query_layer = self.transpose_for_scores(mixed_query_layer) + + use_cache = past_key_value is not None + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_layer, value_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + query_length, key_length = query_layer.shape[2], key_layer.shape[2] + if use_cache: + position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view( + -1, 1 + ) + else: + position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1) + position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1) + distance = position_ids_l - position_ids_r + + positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1) + positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility + + if self.position_embedding_type == "relative_key": + relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + attention_scores = attention_scores + relative_position_scores + elif self.position_embedding_type == "relative_key_query": + relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding) + attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in AlignTextModel forward() function) + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = nn.functional.softmax(attention_scores, dim=-1) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = torch.matmul(attention_probs, value_layer) + + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + if self.is_decoder: + outputs = outputs + (past_key_value,) + return outputs + + +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput with Bert->AlignText +class AlignTextSelfOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +# Copied from transformers.models.bert.modeling_bert.BertAttention with Bert->AlignText +class AlignTextAttention(nn.Module): + def __init__(self, config, position_embedding_type=None): + super().__init__() + self.self = AlignTextSelfAttention(config, position_embedding_type=position_embedding_type) + self.output = AlignTextSelfOutput(config) + self.pruned_heads = set() + + def prune_heads(self, heads): + if len(heads) == 0: + return + heads, index = find_pruneable_heads_and_indices( + heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads + ) + + # Prune linear layers + self.self.query = prune_linear_layer(self.self.query, index) + self.self.key = prune_linear_layer(self.self.key, index) + self.self.value = prune_linear_layer(self.self.value, index) + self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) + + # Update hyper params and store pruned heads + self.self.num_attention_heads = self.self.num_attention_heads - len(heads) + self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads + self.pruned_heads = self.pruned_heads.union(heads) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + self_outputs = self.self( + hidden_states, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->AlignText +class AlignTextIntermediate(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.intermediate_size) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +# Copied from transformers.models.bert.modeling_bert.BertOutput with Bert->AlignText +class AlignTextOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.intermediate_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +# Copied from transformers.models.bert.modeling_bert.BertLayer with Bert->AlignText +class AlignTextLayer(nn.Module): + def __init__(self, config): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.seq_len_dim = 1 + self.attention = AlignTextAttention(config) + self.is_decoder = config.is_decoder + self.add_cross_attention = config.add_cross_attention + if self.add_cross_attention: + if not self.is_decoder: + raise ValueError(f"{self} should be used as a decoder model if cross attention is added") + self.crossattention = AlignTextAttention(config, position_embedding_type="absolute") + self.intermediate = AlignTextIntermediate(config) + self.output = AlignTextOutput(config) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 + self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None + self_attention_outputs = self.attention( + hidden_states, + attention_mask, + head_mask, + output_attentions=output_attentions, + past_key_value=self_attn_past_key_value, + ) + attention_output = self_attention_outputs[0] + + # if decoder, the last output is tuple of self-attn cache + if self.is_decoder: + outputs = self_attention_outputs[1:-1] + present_key_value = self_attention_outputs[-1] + else: + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + cross_attn_present_key_value = None + if self.is_decoder and encoder_hidden_states is not None: + if not hasattr(self, "crossattention"): + raise ValueError( + f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers" + " by setting `config.add_cross_attention=True`" + ) + + # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple + cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None + cross_attention_outputs = self.crossattention( + attention_output, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + cross_attn_past_key_value, + output_attentions, + ) + attention_output = cross_attention_outputs[0] + outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights + + # add cross-attn cache to positions 3,4 of present_key_value tuple + cross_attn_present_key_value = cross_attention_outputs[-1] + present_key_value = present_key_value + cross_attn_present_key_value + + layer_output = apply_chunking_to_forward( + self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output + ) + outputs = (layer_output,) + outputs + + # if decoder, return the attn key/values as the last output + if self.is_decoder: + outputs = outputs + (present_key_value,) + + return outputs + + def feed_forward_chunk(self, attention_output): + intermediate_output = self.intermediate(attention_output) + layer_output = self.output(intermediate_output, attention_output) + return layer_output + + +# Copied from transformers.models.bert.modeling_bert.BertEncoder with Bert->AlignText +class AlignTextEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.layer = nn.ModuleList([AlignTextLayer(config) for _ in range(config.num_hidden_layers)]) + self.gradient_checkpointing = False + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = False, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = True, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + + next_decoder_cache = () if use_cache else None + for i, layer_module in enumerate(self.layer): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + past_key_value = past_key_values[i] if past_key_values is not None else None + + if self.gradient_checkpointing and self.training: + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + ) + else: + layer_outputs = layer_module( + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + ) + + hidden_states = layer_outputs[0] + if use_cache: + next_decoder_cache += (layer_outputs[-1],) + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + if self.config.add_cross_attention: + all_cross_attentions = all_cross_attentions + (layer_outputs[2],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple( + v + for v in [ + hidden_states, + next_decoder_cache, + all_hidden_states, + all_self_attentions, + all_cross_attentions, + ] + if v is not None + ) + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=next_decoder_cache, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + cross_attentions=all_cross_attentions, + ) + + +# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert -> AlignText +class AlignTextPooler(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class AlignPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = AlignConfig + base_model_prefix = "align" + supports_gradient_checkpointing = True + + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, (nn.Linear, nn.Conv2d)): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, AlignModel): + nn.init.xavier_uniform_(module.text_projection.weight) + module.text_projection.bias.data.zero_() + module.text_projection._is_hf_initialized = True + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + if isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + +@add_start_docstrings( + """The text model from ALIGN without any head or projection on top.""", + ALIGN_START_DOCSTRING, +) +class AlignTextModel(AlignPreTrainedModel): + config_class = AlignTextConfig + + def __init__(self, config: AlignTextConfig, add_pooling_layer: bool = True): + super().__init__(config) + self.config = config + + self.embeddings = AlignTextEmbeddings(config) + self.encoder = AlignTextEncoder(config) + + self.pooler = AlignTextPooler(config) if add_pooling_layer else None + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.embeddings.word_embeddings + + def set_input_embeddings(self, value): + self.embeddings.word_embeddings = value + + @add_start_docstrings_to_model_forward(ALIGN_TEXT_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BaseModelOutputWithPoolingAndCrossAttentions, config_class=AlignTextConfig) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]: + r""" + Returns: + + Examples: + + ```python + >>> from transformers import AutoTokenizer, AlignTextModel + + >>> model = AlignTextModel.from_pretrained("kakaobrain/align-base") + >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled (EOS token) states + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) + input_shape = input_ids.size() + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + batch_size, seq_length = input_shape + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if attention_mask is None: + attention_mask = torch.ones(((batch_size, seq_length)), device=device) + + if token_type_ids is None: + if hasattr(self.embeddings, "token_type_ids"): + buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length] + buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) + token_type_ids = buffered_token_type_ids_expanded + else: + token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device) + + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape) + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids, + inputs_embeds=inputs_embeds, + ) + encoder_outputs = self.encoder( + embedding_output, + attention_mask=extended_attention_mask, + head_mask=head_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) if self.pooler is not None else None + + if not return_dict: + return (sequence_output, pooled_output) + encoder_outputs[1:] + + return BaseModelOutputWithPoolingAndCrossAttentions( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + cross_attentions=encoder_outputs.cross_attentions, + ) + + +@add_start_docstrings( + """The vision model from ALIGN without any head or projection on top.""", + ALIGN_START_DOCSTRING, +) +class AlignVisionModel(AlignPreTrainedModel): + config_class = AlignVisionConfig + main_input_name = "pixel_values" + supports_gradient_checkpointing = False + + def __init__(self, config: AlignVisionConfig): + super().__init__(config) + self.config = config + self.embeddings = AlignVisionEmbeddings(config) + self.encoder = AlignVisionEncoder(config) + + # Final pooling layer + if config.pooling_type == "mean": + self.pooler = nn.AvgPool2d(config.hidden_dim, ceil_mode=True) + elif config.pooling_type == "max": + self.pooler = nn.MaxPool2d(config.hidden_dim, ceil_mode=True) + else: + raise ValueError(f"config.pooling must be one of ['mean', 'max'] got {config.pooling}") + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self) -> nn.Module: + return self.vision_model.embeddings.convolution + + @add_start_docstrings_to_model_forward(ALIGN_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BaseModelOutputWithPoolingAndNoAttention, config_class=AlignVisionConfig) + def forward( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, AlignVisionModel + + >>> model = AlignVisionModel.from_pretrained("kakaobrain/align-base") + >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> last_hidden_state = outputs.last_hidden_state + >>> pooled_output = outputs.pooler_output # pooled CLS states + ```""" + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + embedding_output = self.embeddings(pixel_values) + encoder_outputs = self.encoder( + embedding_output, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + # Apply pooling + last_hidden_state = encoder_outputs[0] + pooled_output = self.pooler(last_hidden_state) + # Reshape (batch_size, projection_dim, 1 , 1) -> (batch_size, projection_dim) + pooled_output = pooled_output.reshape(pooled_output.shape[:2]) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + + return BaseModelOutputWithPoolingAndNoAttention( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + ) + + +@add_start_docstrings(ALIGN_START_DOCSTRING) +class AlignModel(AlignPreTrainedModel): + config_class = AlignConfig + + def __init__(self, config: AlignConfig): + super().__init__(config) + + if not isinstance(config.text_config, AlignTextConfig): + raise ValueError( + "config.text_config is expected to be of type AlignTextConfig but is of type" + f" {type(config.text_config)}." + ) + + if not isinstance(config.vision_config, AlignVisionConfig): + raise ValueError( + "config.vision_config is expected to be of type AlignVisionConfig but is of type" + f" {type(config.vision_config)}." + ) + + text_config = config.text_config + vision_config = config.vision_config + + self.projection_dim = config.projection_dim + self.text_embed_dim = text_config.hidden_size + + self.text_model = AlignTextModel(text_config) + self.vision_model = AlignVisionModel(vision_config) + + self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim) + self.temperature = nn.Parameter(torch.tensor(self.config.temperature_init_value)) + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(ALIGN_TEXT_INPUTS_DOCSTRING) + def get_text_features( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by + applying the projection layer to the pooled output of [`AlignTextModel`]. + + Examples: + + ```python + >>> from transformers import AutoTokenizer, AlignModel + + >>> model = AlignModel.from_pretrained("kakaobrain/align-base") + >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt") + >>> text_features = model.get_text_features(**inputs) + ```""" + # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_state = text_outputs[0][:, 0, :] + text_features = self.text_projection(last_hidden_state) + + return text_features + + @add_start_docstrings_to_model_forward(ALIGN_VISION_INPUTS_DOCSTRING) + def get_image_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> torch.FloatTensor: + r""" + Returns: + image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by + applying the projection layer to the pooled output of [`AlignVisionModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, AlignModel + + >>> model = AlignModel.from_pretrained("kakaobrain/align-base") + >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="pt") + + >>> image_features = model.get_image_features(**inputs) + ```""" + # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components. + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + image_features = vision_outputs[1] # pooled_output + + return image_features + + @add_start_docstrings_to_model_forward(ALIGN_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=AlignOutput, config_class=AlignConfig) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + pixel_values: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + return_loss: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, AlignOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, AlignModel + + >>> model = AlignModel.from_pretrained("kakaobrain/align-base") + >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor( + ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True + ... ) + + >>> outputs = model(**inputs) + >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score + >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities + ```""" + # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + image_embeds = vision_outputs[1] + text_embeds = text_outputs[0][:, 0, :] + text_embeds = self.text_projection(text_embeds) + + # normalized features + image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True) + text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True) + + # cosine similarity as logits + logits_per_text = torch.matmul(text_embeds, image_embeds.t()) / self.temperature + logits_per_image = logits_per_text.t() + + loss = None + if return_loss: + loss = align_loss(logits_per_text) + + if not return_dict: + output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs) + return ((loss,) + output) if loss is not None else output + + return AlignOutput( + loss=loss, + logits_per_image=logits_per_image, + logits_per_text=logits_per_text, + text_embeds=text_embeds, + image_embeds=image_embeds, + text_model_output=text_outputs, + vision_model_output=vision_outputs, + ) diff --git a/src/transformers/models/align/processing_align.py b/src/transformers/models/align/processing_align.py new file mode 100644 index 00000000000000..0863c11310e318 --- /dev/null +++ b/src/transformers/models/align/processing_align.py @@ -0,0 +1,122 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Image/Text processor class for ALIGN +""" + + +from ...processing_utils import ProcessorMixin +from ...tokenization_utils_base import BatchEncoding + + +class AlignProcessor(ProcessorMixin): + r""" + Constructs an ALIGN processor which wraps [`EfficientNetImageProcessor`] and + [`BertTokenizer`]/[`BertTokenizerFast`] into a single processor that interits both the image processor and + tokenizer functionalities. See the [`~AlignProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more + information. + + Args: + image_processor ([`EfficientNetImageProcessor`]): + The image processor is a required input. + tokenizer ([`BertTokenizer`, `BertTokenizerFast`]): + The tokenizer is a required input. + """ + + attributes = ["image_processor", "tokenizer"] + image_processor_class = "EfficientNetImageProcessor" + tokenizer_class = ("BertTokenizer", "BertTokenizerFast") + + def __init__(self, image_processor, tokenizer): + super().__init__(image_processor, tokenizer) + + def __call__(self, text=None, images=None, padding="max_length", max_length=64, return_tensors=None, **kwargs): + """ + Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text` + and `kwargs` arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode + the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to + EfficientNetImageProcessor's [`~EfficientNetImageProcessor.__call__`] if `images` is not `None`. Please refer + to the doctsring of the above two methods for more information. + + Args: + text (`str`, `List[str]`): + The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings + (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set + `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). + images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`): + The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch + tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a + number of channels, H and W are image height and width. + padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length`): + Activates and controls padding for tokenization of input text. Choose between [`True` or `'longest'`, + `'max_length'`, `False` or `'do_not_pad'`] + max_length (`int`, *optional*, defaults to `max_length`): + Maximum padding value to use to pad the input text during tokenization. + + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors of a particular framework. Acceptable values are: + + - `'tf'`: Return TensorFlow `tf.constant` objects. + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return NumPy `np.ndarray` objects. + - `'jax'`: Return JAX `jnp.ndarray` objects. + + Returns: + [`BatchEncoding`]: A [`BatchEncoding`] with the following fields: + + - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`. + - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when + `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not + `None`). + - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`. + """ + if text is None and images is None: + raise ValueError("You have to specify either text or images. Both cannot be none.") + + if text is not None: + encoding = self.tokenizer( + text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs + ) + + if images is not None: + image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs) + + if text is not None and images is not None: + encoding["pixel_values"] = image_features.pixel_values + return encoding + elif text is not None: + return encoding + else: + return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors) + + def batch_decode(self, *args, **kwargs): + """ + This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please + refer to the docstring of this method for more information. + """ + return self.tokenizer.batch_decode(*args, **kwargs) + + def decode(self, *args, **kwargs): + """ + This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. + """ + return self.tokenizer.decode(*args, **kwargs) + + @property + def model_input_names(self): + tokenizer_input_names = self.tokenizer.model_input_names + image_processor_input_names = self.image_processor.model_input_names + return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names)) diff --git a/src/transformers/models/altclip/configuration_altclip.py b/src/transformers/models/altclip/configuration_altclip.py index 523cf420e0aed5..b9d451d2c05050 100755 --- a/src/transformers/models/altclip/configuration_altclip.py +++ b/src/transformers/models/altclip/configuration_altclip.py @@ -13,7 +13,6 @@ # See the License for the specific language governing permissions and # limitations under the License. """ AltCLIP model configuration""" -import copy import os from typing import Union @@ -62,12 +61,19 @@ class AltCLIPTextConfig(PretrainedConfig): max_position_embeddings (`int`, *optional*, defaults to 514): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). - type_vocab_size (`int`, *optional*, defaults to 2): + type_vocab_size (`int`, *optional*, defaults to 1): The vocabulary size of the `token_type_ids` passed when calling [`AltCLIPTextModel`] initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + initializer_factor (`float`, *optional*, defaults to 0.02): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 1): The id of the *padding* token. + bos_token_id (`int`, *optional*, defaults to 0): The id of the *beginning-of-sequence* token. + eos_token_id (`Union[int, List[int]]`, *optional*, defaults to 2): + The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens. position_embedding_type (`str`, *optional*, defaults to `"absolute"`): Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to @@ -94,6 +100,7 @@ class AltCLIPTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "altclip_text_model" def __init__( @@ -155,10 +162,14 @@ class AltCLIPVisionConfig(PretrainedConfig): Dimensionality of the encoder layers and the pooler layer. intermediate_size (`int`, *optional*, defaults to 3072): Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 512): + Dimentionality of text and vision projection layers. num_hidden_layers (`int`, *optional*, defaults to 12): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. image_size (`int`, *optional*, defaults to 224): The size (resolution) of each image. patch_size (`int`, *optional*, defaults to 32): @@ -166,13 +177,13 @@ class AltCLIPVisionConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). @@ -228,6 +239,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from AltCLIPConfig @@ -258,7 +271,7 @@ class AltCLIPConfig(PretrainedConfig): Dictionary of configuration options used to initialize [`AltCLIPTextConfig`]. vision_config (`dict`, *optional*): Dictionary of configuration options used to initialize [`AltCLIPVisionConfig`]. - projection_dim (`int`, *optional*, defaults to 512): + projection_dim (`int`, *optional*, defaults to 768): Dimentionality of text and vision projection layers. logit_scale_init_value (`float`, *optional*, defaults to 2.6592): The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation. @@ -289,28 +302,87 @@ class AltCLIPConfig(PretrainedConfig): ```""" model_type = "altclip" - is_composition = True def __init__( self, text_config=None, vision_config=None, projection_dim=768, logit_scale_init_value=2.6592, **kwargs ): - super().__init__(**kwargs) - # If `_config_dict` exist, we use them for the backward compatibility. + # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot + # of confusion!). text_config_dict = kwargs.pop("text_config_dict", None) vision_config_dict = kwargs.pop("vision_config_dict", None) + + super().__init__(**kwargs) + + # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in + # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most + # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`. if text_config_dict is not None: - text_config = text_config_dict + if text_config is None: + text_config = {} + + # This is the complete result when using `text_config_dict`. + _text_config_dict = AltCLIPTextConfig(**text_config_dict).to_dict() + + # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different. + for key, value in _text_config_dict.items(): + if key in text_config and value != text_config[key] and key not in ["transformers_version"]: + # If specified in `text_config_dict` + if key in text_config_dict: + message = ( + f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. " + f'The value `text_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`text_config_dict` is provided which will be used to initialize `AltCLIPTextConfig`. The " + f'value `text_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `text_config` with the ones in `_text_config_dict`. + text_config.update(_text_config_dict) + if vision_config_dict is not None: - vision_config = vision_config_dict + if vision_config is None: + vision_config = {} + + # This is the complete result when using `vision_config_dict`. + _vision_config_dict = AltCLIPVisionConfig(**vision_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different. + for key, value in _vision_config_dict.items(): + if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]: + # If specified in `vision_config_dict` + if key in vision_config_dict: + message = ( + f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different " + f'values. The value `vision_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`vision_config_dict` is provided which will be used to initialize `AltCLIPVisionConfig`. " + f'The value `vision_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `vision_config` with the ones in `_vision_config_dict`. + vision_config.update(_vision_config_dict) if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the AltCLIPTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `AltCLIPTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. initializing the AltCLIPVisionConfig with default values.") + logger.info("`vision_config` is `None`. initializing the `AltCLIPVisionConfig` with default values.") self.text_config = AltCLIPTextConfig(**text_config) self.vision_config = AltCLIPVisionConfig(**vision_config) @@ -330,16 +402,3 @@ def from_text_vision_configs(cls, text_config: AltCLIPTextConfig, vision_config: """ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/altclip/modeling_altclip.py b/src/transformers/models/altclip/modeling_altclip.py index 8f05e71a460dfa..2f511bace5fa25 100755 --- a/src/transformers/models/altclip/modeling_altclip.py +++ b/src/transformers/models/altclip/modeling_altclip.py @@ -174,8 +174,7 @@ class AltCLIPOutput(ModelOutput): text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPTextModel`]. image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): - The image embeddings obtained by applying the projection layer to the pooled output of - [`AltCLIPVisionModel`]. + The image embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPVisionModel`]. text_model_output(`BaseModelOutputWithPooling`): The output of the [`AltCLIPTextModel`]. vision_model_output(`BaseModelOutputWithPooling`): @@ -216,7 +215,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -628,6 +629,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -637,25 +645,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -956,18 +954,12 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, causal_attention_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -1014,11 +1006,12 @@ def __init__(self, config: AltCLIPVisionConfig): self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) - self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + target_dtype = self.patch_embedding.weight.dtype + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1) @@ -1036,7 +1029,6 @@ class AltCLIPPreTrainedModel(PreTrainedModel): config_class = AltCLIPConfig base_model_prefix = "altclip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -1056,9 +1048,7 @@ def _init_weights(self, module): nn.init.normal_(module.out_proj.weight, std=out_proj_std) elif isinstance(module, AltCLIPMLP): factor = self.config.initializer_factor - in_proj_std = ( - (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor - ) + in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor fc_std = (2 * module.config.hidden_size) ** -0.5 * factor nn.init.normal_(module.fc1.weight, std=fc_std) nn.init.normal_(module.fc2.weight, std=in_proj_std) @@ -1085,12 +1075,6 @@ def _init_weights(self, module): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, AltCLIPEncoder): - module.gradient_checkpointing = value - if isinstance(module, AltRobertaEncoder): - module.gradient_checkpointing = value - # Copied from transformers.models.clip.modeling_clip.CLIPVisionTransformer with CLIPVisionTransformer->AltCLIPVisionTransformer,CLIPVisionConfig->AltCLIPVisionConfig,CLIPVisionEmbeddings->AltCLIPVisionEmbeddings,CLIPEncoder->AltCLIPEncoder,CLIP_VISION_INPUTS_DOCSTRING->ALTCLIP_VISION_INPUTS_DOCSTRING class AltCLIPVisionTransformer(nn.Module): @@ -1301,6 +1285,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1415,7 +1400,7 @@ def forward( output_attentions: Optional[bool] = None, return_dict: Optional[bool] = None, output_hidden_states: Optional[bool] = None, - ): + ) -> Union[Tuple, BaseModelOutputWithPoolingAndProjection]: r""" Returns: @@ -1502,7 +1487,7 @@ def __init__(self, config: AltCLIPConfig): self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) - self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value) + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) # Initialize weights and apply final processing self.post_init() @@ -1608,7 +1593,7 @@ def forward( pixel_values: Optional[torch.FloatTensor] = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, - token_type_ids=None, + token_type_ids: Optional[torch.Tensor] = None, return_loss: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, diff --git a/src/transformers/models/altclip/processing_altclip.py b/src/transformers/models/altclip/processing_altclip.py index 8fe49ad678e9c8..e9b4f45269ca76 100644 --- a/src/transformers/models/altclip/processing_altclip.py +++ b/src/transformers/models/altclip/processing_altclip.py @@ -30,16 +30,18 @@ class AltCLIPProcessor(ProcessorMixin): the [`~AltCLIPProcessor.__call__`] and [`~AltCLIPProcessor.decode`] for more information. Args: - image_processor ([`CLIPImageProcessor`]): + image_processor ([`CLIPImageProcessor`], *optional*): The image processor is a required input. - tokenizer ([`XLMRobertaTokenizerFast`]): + tokenizer ([`XLMRobertaTokenizerFast`], *optional*): The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "CLIPImageProcessor" tokenizer_class = ("XLMRobertaTokenizer", "XLMRobertaTokenizerFast") def __init__(self, image_processor=None, tokenizer=None, **kwargs): + feature_extractor = None if "feature_extractor" in kwargs: warnings.warn( "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`" diff --git a/src/transformers/models/audio_spectrogram_transformer/__init__.py b/src/transformers/models/audio_spectrogram_transformer/__init__.py index 9aa42423cf5fda..2b48fe07311c1e 100644 --- a/src/transformers/models/audio_spectrogram_transformer/__init__.py +++ b/src/transformers/models/audio_spectrogram_transformer/__init__.py @@ -13,14 +13,15 @@ # limitations under the License. from typing import TYPE_CHECKING -from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_speech_available, is_torch_available +from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available _import_structure = { "configuration_audio_spectrogram_transformer": [ "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "ASTConfig", - ] + ], + "feature_extraction_audio_spectrogram_transformer": ["ASTFeatureExtractor"], } try: @@ -36,19 +37,13 @@ "ASTPreTrainedModel", ] -try: - if not is_speech_available(): - raise OptionalDependencyNotAvailable() -except OptionalDependencyNotAvailable: - pass -else: - _import_structure["feature_extraction_audio_spectrogram_transformer"] = ["ASTFeatureExtractor"] if TYPE_CHECKING: from .configuration_audio_spectrogram_transformer import ( AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, ASTConfig, ) + from .feature_extraction_audio_spectrogram_transformer import ASTFeatureExtractor try: if not is_torch_available(): @@ -63,14 +58,6 @@ ASTPreTrainedModel, ) - try: - if not is_speech_available(): - raise OptionalDependencyNotAvailable() - except OptionalDependencyNotAvailable: - pass - else: - from .feature_extraction_audio_spectrogram_transformer import ASTFeatureExtractor - else: import sys diff --git a/src/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py b/src/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py index 22b0ca70ac8520..81a087f07f69f1 100644 --- a/src/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py +++ b/src/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py @@ -51,15 +51,15 @@ class ASTConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. - hidden_dropout_prob (`float`, *optional*, defaults to 0.1): + hidden_dropout_prob (`float`, *optional*, defaults to 0.0): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): + attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. - patch_size (`int`, *optional*, defaults to `16`): + patch_size (`int`, *optional*, defaults to 16): The size (resolution) of each patch. qkv_bias (`bool`, *optional*, defaults to `True`): Whether to add a bias to the queries, keys and values. @@ -86,6 +86,7 @@ class ASTConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "audio-spectrogram-transformer" def __init__( diff --git a/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py b/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py index deda2fc7781b28..2bd122b4098c36 100644 --- a/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py +++ b/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py @@ -19,12 +19,18 @@ from typing import List, Optional, Union import numpy as np -import torch -import torchaudio.compliance.kaldi as ta_kaldi +from ...audio_utils import mel_filter_bank, spectrogram, window_function from ...feature_extraction_sequence_utils import SequenceFeatureExtractor from ...feature_extraction_utils import BatchFeature -from ...utils import TensorType, logging +from ...utils import TensorType, is_speech_available, is_torch_available, logging + + +if is_speech_available(): + import torchaudio.compliance.kaldi as ta_kaldi + +if is_torch_available(): + import torch logger = logging.get_logger(__name__) @@ -37,8 +43,8 @@ class ASTFeatureExtractor(SequenceFeatureExtractor): This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. - This class extracts mel-filter bank features from raw speech using TorchAudio, pads/truncates them to a fixed - length and normalizes them using a mean and standard deviation. + This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy + otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation. Args: feature_size (`int`, *optional*, defaults to 1): @@ -83,6 +89,21 @@ def __init__( self.std = std self.return_attention_mask = return_attention_mask + if not is_speech_available(): + mel_filters = mel_filter_bank( + num_frequency_bins=256, + num_mel_filters=self.num_mel_bins, + min_frequency=20, + max_frequency=sampling_rate // 2, + sampling_rate=sampling_rate, + norm=None, + mel_scale="kaldi", + triangularize_in_mel_space=True, + ) + + self.mel_filters = np.pad(mel_filters, ((0, 1), (0, 0))) + self.window = window_function(400, "hann", periodic=False) + def _extract_fbank_features( self, waveform: np.ndarray, @@ -93,17 +114,32 @@ def _extract_fbank_features( and hence the waveform should not be normalized before feature extraction. """ # waveform = waveform * (2**15) # Kaldi compliance: 16-bit signed integers - waveform = torch.from_numpy(waveform).unsqueeze(0) - fbank = ta_kaldi.fbank( - waveform, - htk_compat=True, - sample_frequency=self.sampling_rate, - use_energy=False, - window_type="hanning", - num_mel_bins=self.num_mel_bins, - dither=0.0, - frame_shift=10, - ) + if is_speech_available(): + waveform = torch.from_numpy(waveform).unsqueeze(0) + fbank = ta_kaldi.fbank( + waveform, + sample_frequency=self.sampling_rate, + window_type="hanning", + num_mel_bins=self.num_mel_bins, + ) + else: + waveform = np.squeeze(waveform) + fbank = spectrogram( + waveform, + self.window, + frame_length=400, + hop_length=160, + fft_length=512, + power=2.0, + center=False, + preemphasis=0.97, + mel_filters=self.mel_filters, + log_mel="log", + mel_floor=1.192092955078125e-07, + remove_dc_offset=True, + ).T + + fbank = torch.from_numpy(fbank) n_frames = fbank.shape[0] difference = max_length - n_frames @@ -135,7 +171,8 @@ def __call__( Args: raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`): The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float - values, a list of numpy arrays or a list of list of float values. + values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not + stereo, i.e. single float per timestep. sampling_rate (`int`, *optional*): The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass `sampling_rate` at the forward call to prevent silent errors. @@ -160,9 +197,11 @@ def __call__( "Failing to do so can result in silent errors that might be hard to debug." ) - is_batched = bool( - isinstance(raw_speech, (list, tuple)) - and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list))) + is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1 + if is_batched_numpy and len(raw_speech.shape) > 2: + raise ValueError(f"Only mono-channel audio is supported for input to {self}") + is_batched = is_batched_numpy or ( + isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list))) ) if is_batched: diff --git a/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py b/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py index 54b77df7458d2c..3fddccdea75273 100644 --- a/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py +++ b/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py @@ -336,17 +336,11 @@ def forward( layer_head_mask = head_mask[i] if head_mask is not None else None if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, layer_head_mask, + output_attentions, ) else: layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions) @@ -394,11 +388,6 @@ def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> No module.bias.data.zero_() module.weight.data.fill_(1.0) - # Copied from transformers.models.vit.modeling_vit.ViTPreTrainedModel._set_gradient_checkpointing with ViT->AST - def _set_gradient_checkpointing(self, module: ASTEncoder, value: bool = False) -> None: - if isinstance(module, ASTEncoder): - module.gradient_checkpointing = value - AUDIO_SPECTROGRAM_TRANSFORMER_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it @@ -414,9 +403,12 @@ def _set_gradient_checkpointing(self, module: ASTEncoder, value: bool = False) - AUDIO_SPECTROGRAM_TRANSFORMER_INPUTS_DOCSTRING = r""" Args: - input_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): - Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See - [`ASTFeatureExtractor.__call__`] for details. + input_values (`torch.FloatTensor` of shape `(batch_size, max_length, num_mel_bins)`): + Float values mel features extracted from the raw audio waveform. Raw audio waveform can be obtained by + loading a `.flac` or `.wav` audio file into an array of type `List[float]` or a `numpy.ndarray`, *e.g.* via + the soundfile library (`pip install soundfile`). To prepare the array into `input_features`, the + [`AutoFeatureExtractor`] should be used for extracting the mel features, padding and conversion into a + tensor of type `torch.FloatTensor`. See [`~ASTFeatureExtractor.__call__`] head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: @@ -440,7 +432,7 @@ def _set_gradient_checkpointing(self, module: ASTEncoder, value: bool = False) - AUDIO_SPECTROGRAM_TRANSFORMER_START_DOCSTRING, ) class ASTModel(ASTPreTrainedModel): - def __init__(self, config: ASTConfig): + def __init__(self, config: ASTConfig) -> None: super().__init__(config) self.config = config @@ -478,7 +470,7 @@ def forward( output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - ): + ) -> Union[Tuple, BaseModelOutputWithPooling]: output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states diff --git a/src/transformers/models/auto/__init__.py b/src/transformers/models/auto/__init__.py index c81071d62a79fd..153f7f10def694 100644 --- a/src/transformers/models/auto/__init__.py +++ b/src/transformers/models/auto/__init__.py @@ -40,6 +40,7 @@ else: _import_structure["modeling_auto"] = [ "MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", + "MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING", "MODEL_FOR_AUDIO_XVECTOR_MAPPING", "MODEL_FOR_BACKBONE_MAPPING", "MODEL_FOR_CAUSAL_IMAGE_MODELING_MAPPING", @@ -49,9 +50,11 @@ "MODEL_FOR_DEPTH_ESTIMATION_MAPPING", "MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", "MODEL_FOR_IMAGE_SEGMENTATION_MAPPING", + "MODEL_FOR_IMAGE_TO_IMAGE_MAPPING", "MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING", "MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING", "MODEL_FOR_MASKED_LM_MAPPING", + "MODEL_FOR_MASK_GENERATION_MAPPING", "MODEL_FOR_MULTIPLE_CHOICE_MAPPING", "MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING", "MODEL_FOR_OBJECT_DETECTION_MAPPING", @@ -62,6 +65,9 @@ "MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", "MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING", + "MODEL_FOR_TEXT_ENCODING_MAPPING", + "MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING", + "MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING", "MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING", "MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING", @@ -69,7 +75,10 @@ "MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING", "MODEL_MAPPING", "MODEL_WITH_LM_HEAD_MAPPING", + "MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING", "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING", + "MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING", + "MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING", "AutoModel", "AutoBackbone", "AutoModelForAudioClassification", @@ -80,7 +89,10 @@ "AutoModelForDepthEstimation", "AutoModelForImageClassification", "AutoModelForImageSegmentation", + "AutoModelForImageToImage", "AutoModelForInstanceSegmentation", + "AutoModelForMaskGeneration", + "AutoModelForTextEncoding", "AutoModelForMaskedImageModeling", "AutoModelForMaskedLM", "AutoModelForMultipleChoice", @@ -93,6 +105,8 @@ "AutoModelForSequenceClassification", "AutoModelForSpeechSeq2Seq", "AutoModelForTableQuestionAnswering", + "AutoModelForTextToSpectrogram", + "AutoModelForTextToWaveform", "AutoModelForTokenClassification", "AutoModelForUniversalSegmentation", "AutoModelForVideoClassification", @@ -100,6 +114,7 @@ "AutoModelForVisualQuestionAnswering", "AutoModelForDocumentQuestionAnswering", "AutoModelWithLMHead", + "AutoModelForZeroShotImageClassification", "AutoModelForZeroShotObjectDetection", ] @@ -110,8 +125,10 @@ pass else: _import_structure["modeling_tf_auto"] = [ + "TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_CAUSAL_LM_MAPPING", "TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", + "TF_MODEL_FOR_MASK_GENERATION_MAPPING", "TF_MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING", "TF_MODEL_FOR_MASKED_LM_MAPPING", "TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING", @@ -124,14 +141,19 @@ "TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING", + "TF_MODEL_FOR_TEXT_ENCODING_MAPPING", "TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "TF_MODEL_FOR_VISION_2_SEQ_MAPPING", + "TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING", "TF_MODEL_MAPPING", "TF_MODEL_WITH_LM_HEAD_MAPPING", "TFAutoModel", + "TFAutoModelForAudioClassification", "TFAutoModelForCausalLM", "TFAutoModelForImageClassification", + "TFAutoModelForMaskedImageModeling", "TFAutoModelForMaskedLM", + "TFAutoModelForMaskGeneration", "TFAutoModelForMultipleChoice", "TFAutoModelForNextSentencePrediction", "TFAutoModelForPreTraining", @@ -142,8 +164,10 @@ "TFAutoModelForSequenceClassification", "TFAutoModelForSpeechSeq2Seq", "TFAutoModelForTableQuestionAnswering", + "TFAutoModelForTextEncoding", "TFAutoModelForTokenClassification", "TFAutoModelForVision2Seq", + "TFAutoModelForZeroShotImageClassification", "TFAutoModelWithLMHead", ] @@ -154,6 +178,7 @@ pass else: _import_structure["modeling_flax_auto"] = [ + "FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_CAUSAL_LM_MAPPING", "FLAX_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_MASKED_LM_MAPPING", @@ -163,6 +188,7 @@ "FLAX_MODEL_FOR_QUESTION_ANSWERING_MAPPING", "FLAX_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING", "FLAX_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING", + "FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING", "FLAX_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING", "FLAX_MODEL_FOR_VISION_2_SEQ_MAPPING", "FLAX_MODEL_MAPPING", @@ -176,6 +202,7 @@ "FlaxAutoModelForQuestionAnswering", "FlaxAutoModelForSeq2SeqLM", "FlaxAutoModelForSequenceClassification", + "FlaxAutoModelForSpeechSeq2Seq", "FlaxAutoModelForTokenClassification", "FlaxAutoModelForVision2Seq", ] @@ -197,6 +224,7 @@ else: from .modeling_auto import ( MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, + MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING, MODEL_FOR_AUDIO_XVECTOR_MAPPING, MODEL_FOR_BACKBONE_MAPPING, MODEL_FOR_CAUSAL_IMAGE_MODELING_MAPPING, @@ -206,7 +234,9 @@ MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING, MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, MODEL_FOR_IMAGE_SEGMENTATION_MAPPING, + MODEL_FOR_IMAGE_TO_IMAGE_MAPPING, MODEL_FOR_INSTANCE_SEGMENTATION_MAPPING, + MODEL_FOR_MASK_GENERATION_MAPPING, MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING, MODEL_FOR_MASKED_LM_MAPPING, MODEL_FOR_MULTIPLE_CHOICE_MAPPING, @@ -219,11 +249,17 @@ MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING, + MODEL_FOR_TEXT_ENCODING_MAPPING, + MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING, + MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING, + MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING, + MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING, MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, MODEL_FOR_UNIVERSAL_SEGMENTATION_MAPPING, MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING, MODEL_FOR_VISION_2_SEQ_MAPPING, MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING, + MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING, MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING, MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, @@ -238,9 +274,11 @@ AutoModelForDocumentQuestionAnswering, AutoModelForImageClassification, AutoModelForImageSegmentation, + AutoModelForImageToImage, AutoModelForInstanceSegmentation, AutoModelForMaskedImageModeling, AutoModelForMaskedLM, + AutoModelForMaskGeneration, AutoModelForMultipleChoice, AutoModelForNextSentencePrediction, AutoModelForObjectDetection, @@ -251,11 +289,15 @@ AutoModelForSequenceClassification, AutoModelForSpeechSeq2Seq, AutoModelForTableQuestionAnswering, + AutoModelForTextEncoding, + AutoModelForTextToSpectrogram, + AutoModelForTextToWaveform, AutoModelForTokenClassification, AutoModelForUniversalSegmentation, AutoModelForVideoClassification, AutoModelForVision2Seq, AutoModelForVisualQuestionAnswering, + AutoModelForZeroShotImageClassification, AutoModelForZeroShotObjectDetection, AutoModelWithLMHead, ) @@ -267,9 +309,11 @@ pass else: from .modeling_tf_auto import ( + TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING, TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, + TF_MODEL_FOR_MASK_GENERATION_MAPPING, TF_MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING, TF_MODEL_FOR_MASKED_LM_MAPPING, TF_MODEL_FOR_MULTIPLE_CHOICE_MAPPING, @@ -281,15 +325,20 @@ TF_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, TF_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, TF_MODEL_FOR_TABLE_QUESTION_ANSWERING_MAPPING, + TF_MODEL_FOR_TEXT_ENCODING_MAPPING, TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, TF_MODEL_FOR_VISION_2_SEQ_MAPPING, + TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING, TF_MODEL_MAPPING, TF_MODEL_WITH_LM_HEAD_MAPPING, TFAutoModel, + TFAutoModelForAudioClassification, TFAutoModelForCausalLM, TFAutoModelForDocumentQuestionAnswering, TFAutoModelForImageClassification, + TFAutoModelForMaskedImageModeling, TFAutoModelForMaskedLM, + TFAutoModelForMaskGeneration, TFAutoModelForMultipleChoice, TFAutoModelForNextSentencePrediction, TFAutoModelForPreTraining, @@ -299,8 +348,10 @@ TFAutoModelForSequenceClassification, TFAutoModelForSpeechSeq2Seq, TFAutoModelForTableQuestionAnswering, + TFAutoModelForTextEncoding, TFAutoModelForTokenClassification, TFAutoModelForVision2Seq, + TFAutoModelForZeroShotImageClassification, TFAutoModelWithLMHead, ) @@ -311,6 +362,7 @@ pass else: from .modeling_flax_auto import ( + FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_CAUSAL_LM_MAPPING, FLAX_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_MASKED_LM_MAPPING, @@ -320,6 +372,7 @@ FLAX_MODEL_FOR_QUESTION_ANSWERING_MAPPING, FLAX_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING, FLAX_MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING, + FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING, FLAX_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING, FLAX_MODEL_FOR_VISION_2_SEQ_MAPPING, FLAX_MODEL_MAPPING, @@ -333,6 +386,7 @@ FlaxAutoModelForQuestionAnswering, FlaxAutoModelForSeq2SeqLM, FlaxAutoModelForSequenceClassification, + FlaxAutoModelForSpeechSeq2Seq, FlaxAutoModelForTokenClassification, FlaxAutoModelForVision2Seq, ) diff --git a/src/transformers/models/auto/auto_factory.py b/src/transformers/models/auto/auto_factory.py index eb87bb1ff7dbdd..ce7884d2ef120e 100644 --- a/src/transformers/models/auto/auto_factory.py +++ b/src/transformers/models/auto/auto_factory.py @@ -15,11 +15,23 @@ """Factory function to build auto-model classes.""" import copy import importlib +import json +import os +import warnings from collections import OrderedDict from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module -from ...utils import copy_func, logging +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code +from ...utils import ( + CONFIG_NAME, + cached_file, + copy_func, + extract_commit_hash, + find_adapter_config_file, + is_peft_available, + logging, + requires_backends, +) from .configuration_auto import AutoConfig, model_type_to_module_name, replace_list_option_in_docstrings @@ -75,8 +87,6 @@ Can be either: - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a - user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing model weights saved using [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`. - A path or url to a *tensorflow index checkpoint file* (e.g, `./tf_model/model.ckpt.index`). In @@ -128,6 +138,11 @@ Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. + code_revision (`str`, *optional*, defaults to `"main"`): + The specific revision to use for the code on the Hub, if the code leaves in a different repository than + the rest of the model. It can be a branch name, a tag name, or a commit id, since we use a git-based + system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier + allowed by git. kwargs (additional keyword arguments, *optional*): Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., `output_attentions=True`). Behaves differently depending on whether a `config` is provided or @@ -177,8 +192,6 @@ Can be either: - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a - user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing model weights saved using [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`. - A path or url to a *PyTorch state_dict save file* (e.g, `./pt_model/pytorch_model.bin`). In this @@ -224,6 +237,11 @@ Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. + code_revision (`str`, *optional*, defaults to `"main"`): + The specific revision to use for the code on the Hub, if the code leaves in a different repository than + the rest of the model. It can be a branch name, a tag name, or a commit id, since we use a git-based + system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier + allowed by git. kwargs (additional keyword arguments, *optional*): Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., `output_attentions=True`). Behaves differently depending on whether a `config` is provided or @@ -273,8 +291,6 @@ Can be either: - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a - user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing model weights saved using [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`. - A path or url to a *PyTorch state_dict save file* (e.g, `./pt_model/pytorch_model.bin`). In this @@ -320,6 +336,11 @@ Whether or not to allow for custom models defined on the Hub in their own modeling files. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. + code_revision (`str`, *optional*, defaults to `"main"`): + The specific revision to use for the code on the Hub, if the code leaves in a different repository than + the rest of the model. It can be a branch name, a tag name, or a commit id, since we use a git-based + system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier + allowed by git. kwargs (additional keyword arguments, *optional*): Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., `output_attentions=True`). Behaves differently depending on whether a `config` is provided or @@ -389,22 +410,25 @@ def __init__(self, *args, **kwargs): @classmethod def from_config(cls, config, **kwargs): - trust_remote_code = kwargs.pop("trust_remote_code", False) - if hasattr(config, "auto_map") and cls.__name__ in config.auto_map: - if not trust_remote_code: - raise ValueError( - "Loading this model requires you to execute the modeling file in that repo " - "on your local machine. Make sure you have read the code there to avoid malicious use, then set " - "the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure " - "no malicious code has been contributed in a newer revision." - ) + trust_remote_code = kwargs.pop("trust_remote_code", None) + has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map + has_local_code = type(config) in cls._model_mapping.keys() + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, config._name_or_path, has_local_code, has_remote_code + ) + + if has_remote_code and trust_remote_code: class_ref = config.auto_map[cls.__name__] - module_file, class_name = class_ref.split(".") - model_class = get_class_from_dynamic_module(config.name_or_path, module_file + ".py", class_name, **kwargs) + if "--" in class_ref: + repo_id, class_ref = class_ref.split("--") + else: + repo_id = config.name_or_path + model_class = get_class_from_dynamic_module(class_ref, repo_id, **kwargs) + if os.path.isdir(config._name_or_path): + model_class.register_for_auto_class(cls.__name__) + else: + cls.register(config.__class__, model_class, exist_ok=True) + _ = kwargs.pop("code_revision", None) return model_class._from_config(config, **kwargs) elif type(config) in cls._model_mapping.keys(): model_class = _get_model_class(config, cls._model_mapping) @@ -418,7 +442,7 @@ def from_config(cls, config, **kwargs): @classmethod def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): config = kwargs.pop("config", None) - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) kwargs["_from_auto"] = True hub_kwargs_names = [ "cache_dir", @@ -429,40 +453,106 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): "revision", "subfolder", "use_auth_token", + "token", ] hub_kwargs = {name: kwargs.pop(name) for name in hub_kwargs_names if name in kwargs} + code_revision = kwargs.pop("code_revision", None) + commit_hash = kwargs.pop("_commit_hash", None) + adapter_kwargs = kwargs.pop("adapter_kwargs", None) + + token = hub_kwargs.pop("token", None) + use_auth_token = hub_kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if token is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + token = use_auth_token + + if token is not None: + hub_kwargs["token"] = token + + if commit_hash is None: + if not isinstance(config, PretrainedConfig): + # We make a call to the config file first (which may be absent) to get the commit hash as soon as possible + resolved_config_file = cached_file( + pretrained_model_name_or_path, + CONFIG_NAME, + _raise_exceptions_for_gated_repo=False, + _raise_exceptions_for_missing_entries=False, + _raise_exceptions_for_connection_errors=False, + **hub_kwargs, + ) + commit_hash = extract_commit_hash(resolved_config_file, commit_hash) + else: + commit_hash = getattr(config, "_commit_hash", None) + + if is_peft_available(): + if adapter_kwargs is None: + adapter_kwargs = {} + if token is not None: + adapter_kwargs["token"] = token + + maybe_adapter_path = find_adapter_config_file( + pretrained_model_name_or_path, _commit_hash=commit_hash, **adapter_kwargs + ) + + if maybe_adapter_path is not None: + with open(maybe_adapter_path, "r", encoding="utf-8") as f: + adapter_config = json.load(f) + + adapter_kwargs["_adapter_model_path"] = pretrained_model_name_or_path + pretrained_model_name_or_path = adapter_config["base_model_name_or_path"] + if not isinstance(config, PretrainedConfig): - kwargs_copy = copy.deepcopy(kwargs) + kwargs_orig = copy.deepcopy(kwargs) # ensure not to pollute the config object with torch_dtype="auto" - since it's # meaningless in the context of the config object - torch.dtype values are acceptable - if kwargs_copy.get("torch_dtype", None) == "auto": - _ = kwargs_copy.pop("torch_dtype") + if kwargs.get("torch_dtype", None) == "auto": + _ = kwargs.pop("torch_dtype") + # to not overwrite the quantization_config if config has a quantization_config + if kwargs.get("quantization_config", None) is not None: + _ = kwargs.pop("quantization_config") config, kwargs = AutoConfig.from_pretrained( pretrained_model_name_or_path, return_unused_kwargs=True, trust_remote_code=trust_remote_code, + code_revision=code_revision, + _commit_hash=commit_hash, **hub_kwargs, - **kwargs_copy, + **kwargs, ) - if hasattr(config, "auto_map") and cls.__name__ in config.auto_map: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the modeling file in that repo " - "on your local machine. Make sure you have read the code there to avoid malicious use, then set " - "the option `trust_remote_code=True` to remove this error." - ) - if hub_kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure " - "no malicious code has been contributed in a newer revision." - ) + + # if torch_dtype=auto was passed here, ensure to pass it on + if kwargs_orig.get("torch_dtype", None) == "auto": + kwargs["torch_dtype"] = "auto" + if kwargs_orig.get("quantization_config", None) is not None: + kwargs["quantization_config"] = kwargs_orig["quantization_config"] + + has_remote_code = hasattr(config, "auto_map") and cls.__name__ in config.auto_map + has_local_code = type(config) in cls._model_mapping.keys() + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) + + # Set the adapter kwargs + kwargs["adapter_kwargs"] = adapter_kwargs + + if has_remote_code and trust_remote_code: class_ref = config.auto_map[cls.__name__] - module_file, class_name = class_ref.split(".") model_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **hub_kwargs, **kwargs + class_ref, pretrained_model_name_or_path, code_revision=code_revision, **hub_kwargs, **kwargs ) - model_class.register_for_auto_class(cls.__name__) + _ = hub_kwargs.pop("code_revision", None) + if os.path.isdir(pretrained_model_name_or_path): + model_class.register_for_auto_class(cls.__name__) + else: + cls.register(config.__class__, model_class, exist_ok=True) return model_class.from_pretrained( pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs ) @@ -477,7 +567,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): ) @classmethod - def register(cls, config_class, model_class): + def register(cls, config_class, model_class, exist_ok=False): """ Register a new model for this class. @@ -493,7 +583,46 @@ def register(cls, config_class, model_class): f"config class you passed (model has {model_class.config_class} and you passed {config_class}. Fix " "one of those so they match!" ) - cls._model_mapping.register(config_class, model_class) + cls._model_mapping.register(config_class, model_class, exist_ok=exist_ok) + + +class _BaseAutoBackboneClass(_BaseAutoModelClass): + # Base class for auto backbone models. + _model_mapping = None + + @classmethod + def _load_timm_backbone_from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): + requires_backends(cls, ["vision", "timm"]) + from ...models.timm_backbone import TimmBackboneConfig + + config = kwargs.pop("config", TimmBackboneConfig()) + + if kwargs.get("out_features", None) is not None: + raise ValueError("Cannot specify `out_features` for timm backbones") + + if kwargs.get("output_loading_info", False): + raise ValueError("Cannot specify `output_loading_info=True` when loading from timm") + + num_channels = kwargs.pop("num_channels", config.num_channels) + features_only = kwargs.pop("features_only", config.features_only) + use_pretrained_backbone = kwargs.pop("use_pretrained_backbone", config.use_pretrained_backbone) + out_indices = kwargs.pop("out_indices", config.out_indices) + config = TimmBackboneConfig( + backbone=pretrained_model_name_or_path, + num_channels=num_channels, + features_only=features_only, + use_pretrained_backbone=use_pretrained_backbone, + out_indices=out_indices, + ) + return super().from_config(config, **kwargs) + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs): + use_timm_backbone = kwargs.pop("use_timm_backbone", False) + if use_timm_backbone: + return cls._load_timm_backbone_from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs) + + return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs) def insert_head_doc(docstring, head_doc=""): @@ -507,7 +636,7 @@ def insert_head_doc(docstring, head_doc=""): ) -def auto_class_update(cls, checkpoint_for_example="bert-base-cased", head_doc=""): +def auto_class_update(cls, checkpoint_for_example="google-bert/bert-base-cased", head_doc=""): # Create a new class with the right name from the base class model_mapping = cls._model_mapping name = cls.__name__ @@ -586,6 +715,7 @@ def __init__(self, config_mapping, model_mapping): self._config_mapping = config_mapping self._reverse_config_mapping = {v: k for k, v in config_mapping.items()} self._model_mapping = model_mapping + self._model_mapping._model_mapping = self self._extra_content = {} self._modules = {} @@ -662,13 +792,13 @@ def __contains__(self, item): model_type = self._reverse_config_mapping[item.__name__] return model_type in self._model_mapping - def register(self, key, value): + def register(self, key, value, exist_ok=False): """ Register a new model in this mapping. """ if hasattr(key, "__name__") and key.__name__ in self._reverse_config_mapping: model_type = self._reverse_config_mapping[key.__name__] - if model_type in self._model_mapping.keys(): + if model_type in self._model_mapping.keys() and not exist_ok: raise ValueError(f"'{key}' is already used by a Transformers model.") self._extra_content[key] = value diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index 567ae7caddeeed..6868175b2a7060 100755 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -14,13 +14,14 @@ # limitations under the License. """ Auto Config class.""" import importlib +import os import re import warnings from collections import OrderedDict from typing import List, Union from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code from ...utils import CONFIG_NAME, logging @@ -30,8 +31,11 @@ [ # Add configs here ("albert", "AlbertConfig"), + ("align", "AlignConfig"), ("altclip", "AltCLIPConfig"), ("audio-spectrogram-transformer", "ASTConfig"), + ("autoformer", "AutoformerConfig"), + ("bark", "BarkConfig"), ("bart", "BartConfig"), ("beit", "BeitConfig"), ("bert", "BertConfig"), @@ -46,16 +50,23 @@ ("blip-2", "Blip2Config"), ("bloom", "BloomConfig"), ("bridgetower", "BridgeTowerConfig"), + ("bros", "BrosConfig"), ("camembert", "CamembertConfig"), ("canine", "CanineConfig"), ("chinese_clip", "ChineseCLIPConfig"), + ("chinese_clip_vision_model", "ChineseCLIPVisionConfig"), ("clap", "ClapConfig"), ("clip", "CLIPConfig"), + ("clip_vision_model", "CLIPVisionConfig"), ("clipseg", "CLIPSegConfig"), + ("clvp", "ClvpConfig"), + ("code_llama", "LlamaConfig"), ("codegen", "CodeGenConfig"), ("conditional_detr", "ConditionalDetrConfig"), ("convbert", "ConvBertConfig"), ("convnext", "ConvNextConfig"), + ("convnextv2", "ConvNextV2Config"), + ("cpmant", "CpmAntConfig"), ("ctrl", "CTRLConfig"), ("cvt", "CvtConfig"), ("data2vec-audio", "Data2VecAudioConfig"), @@ -66,44 +77,60 @@ ("decision_transformer", "DecisionTransformerConfig"), ("deformable_detr", "DeformableDetrConfig"), ("deit", "DeiTConfig"), + ("depth_anything", "DepthAnythingConfig"), ("deta", "DetaConfig"), ("detr", "DetrConfig"), ("dinat", "DinatConfig"), + ("dinov2", "Dinov2Config"), ("distilbert", "DistilBertConfig"), ("donut-swin", "DonutSwinConfig"), ("dpr", "DPRConfig"), ("dpt", "DPTConfig"), ("efficientformer", "EfficientFormerConfig"), + ("efficientnet", "EfficientNetConfig"), ("electra", "ElectraConfig"), + ("encodec", "EncodecConfig"), ("encoder-decoder", "EncoderDecoderConfig"), ("ernie", "ErnieConfig"), ("ernie_m", "ErnieMConfig"), ("esm", "EsmConfig"), + ("falcon", "FalconConfig"), + ("fastspeech2_conformer", "FastSpeech2ConformerConfig"), ("flaubert", "FlaubertConfig"), ("flava", "FlavaConfig"), ("fnet", "FNetConfig"), + ("focalnet", "FocalNetConfig"), ("fsmt", "FSMTConfig"), ("funnel", "FunnelConfig"), + ("fuyu", "FuyuConfig"), ("git", "GitConfig"), ("glpn", "GLPNConfig"), ("gpt-sw3", "GPT2Config"), ("gpt2", "GPT2Config"), + ("gpt_bigcode", "GPTBigCodeConfig"), ("gpt_neo", "GPTNeoConfig"), ("gpt_neox", "GPTNeoXConfig"), ("gpt_neox_japanese", "GPTNeoXJapaneseConfig"), ("gptj", "GPTJConfig"), + ("gptsan-japanese", "GPTSanJapaneseConfig"), ("graphormer", "GraphormerConfig"), ("groupvit", "GroupViTConfig"), ("hubert", "HubertConfig"), ("ibert", "IBertConfig"), + ("idefics", "IdeficsConfig"), ("imagegpt", "ImageGPTConfig"), + ("informer", "InformerConfig"), + ("instructblip", "InstructBlipConfig"), ("jukebox", "JukeboxConfig"), + ("kosmos-2", "Kosmos2Config"), ("layoutlm", "LayoutLMConfig"), ("layoutlmv2", "LayoutLMv2Config"), ("layoutlmv3", "LayoutLMv3Config"), ("led", "LEDConfig"), ("levit", "LevitConfig"), ("lilt", "LiltConfig"), + ("llama", "LlamaConfig"), + ("llava", "LlavaConfig"), ("longformer", "LongformerConfig"), ("longt5", "LongT5Config"), ("luke", "LukeConfig"), @@ -116,28 +143,48 @@ ("maskformer-swin", "MaskFormerSwinConfig"), ("mbart", "MBartConfig"), ("mctct", "MCTCTConfig"), + ("mega", "MegaConfig"), ("megatron-bert", "MegatronBertConfig"), + ("mgp-str", "MgpstrConfig"), + ("mistral", "MistralConfig"), + ("mixtral", "MixtralConfig"), ("mobilebert", "MobileBertConfig"), ("mobilenet_v1", "MobileNetV1Config"), ("mobilenet_v2", "MobileNetV2Config"), ("mobilevit", "MobileViTConfig"), + ("mobilevitv2", "MobileViTV2Config"), ("mpnet", "MPNetConfig"), + ("mpt", "MptConfig"), + ("mra", "MraConfig"), ("mt5", "MT5Config"), + ("musicgen", "MusicgenConfig"), ("mvp", "MvpConfig"), ("nat", "NatConfig"), ("nezha", "NezhaConfig"), + ("nllb-moe", "NllbMoeConfig"), + ("nougat", "VisionEncoderDecoderConfig"), ("nystromformer", "NystromformerConfig"), ("oneformer", "OneFormerConfig"), + ("open-llama", "OpenLlamaConfig"), ("openai-gpt", "OpenAIGPTConfig"), ("opt", "OPTConfig"), + ("owlv2", "Owlv2Config"), ("owlvit", "OwlViTConfig"), + ("patchtsmixer", "PatchTSMixerConfig"), + ("patchtst", "PatchTSTConfig"), ("pegasus", "PegasusConfig"), ("pegasus_x", "PegasusXConfig"), ("perceiver", "PerceiverConfig"), + ("persimmon", "PersimmonConfig"), + ("phi", "PhiConfig"), + ("pix2struct", "Pix2StructConfig"), ("plbart", "PLBartConfig"), ("poolformer", "PoolFormerConfig"), + ("pop2piano", "Pop2PianoConfig"), ("prophetnet", "ProphetNetConfig"), + ("pvt", "PvtConfig"), ("qdqbert", "QDQBertConfig"), + ("qwen2", "Qwen2Config"), ("rag", "RagConfig"), ("realm", "RealmConfig"), ("reformer", "ReformerConfig"), @@ -149,15 +196,23 @@ ("roberta-prelayernorm", "RobertaPreLayerNormConfig"), ("roc_bert", "RoCBertConfig"), ("roformer", "RoFormerConfig"), + ("rwkv", "RwkvConfig"), + ("sam", "SamConfig"), + ("seamless_m4t", "SeamlessM4TConfig"), + ("seamless_m4t_v2", "SeamlessM4Tv2Config"), ("segformer", "SegformerConfig"), ("sew", "SEWConfig"), ("sew-d", "SEWDConfig"), + ("siglip", "SiglipConfig"), + ("siglip_vision_model", "SiglipVisionConfig"), ("speech-encoder-decoder", "SpeechEncoderDecoderConfig"), ("speech_to_text", "Speech2TextConfig"), ("speech_to_text_2", "Speech2Text2Config"), ("speecht5", "SpeechT5Config"), ("splinter", "SplinterConfig"), ("squeezebert", "SqueezeBertConfig"), + ("stablelm", "StableLmConfig"), + ("swiftformer", "SwiftFormerConfig"), ("swin", "SwinConfig"), ("swin2sr", "Swin2SRConfig"), ("swinv2", "Swinv2Config"), @@ -167,16 +222,21 @@ ("tapas", "TapasConfig"), ("time_series_transformer", "TimeSeriesTransformerConfig"), ("timesformer", "TimesformerConfig"), + ("timm_backbone", "TimmBackboneConfig"), ("trajectory_transformer", "TrajectoryTransformerConfig"), ("transfo-xl", "TransfoXLConfig"), ("trocr", "TrOCRConfig"), ("tvlt", "TvltConfig"), + ("tvp", "TvpConfig"), + ("umt5", "UMT5Config"), ("unispeech", "UniSpeechConfig"), ("unispeech-sat", "UniSpeechSatConfig"), + ("univnet", "UnivNetConfig"), ("upernet", "UperNetConfig"), ("van", "VanConfig"), ("videomae", "VideoMAEConfig"), ("vilt", "ViltConfig"), + ("vipllava", "VipLlavaConfig"), ("vision-encoder-decoder", "VisionEncoderDecoderConfig"), ("vision-text-dual-encoder", "VisionTextDualEncoderConfig"), ("visual_bert", "VisualBertConfig"), @@ -184,7 +244,12 @@ ("vit_hybrid", "ViTHybridConfig"), ("vit_mae", "ViTMAEConfig"), ("vit_msn", "ViTMSNConfig"), + ("vitdet", "VitDetConfig"), + ("vitmatte", "VitMatteConfig"), + ("vits", "VitsConfig"), + ("vivit", "VivitConfig"), ("wav2vec2", "Wav2Vec2Config"), + ("wav2vec2-bert", "Wav2Vec2BertConfig"), ("wav2vec2-conformer", "Wav2Vec2ConformerConfig"), ("wavlm", "WavLMConfig"), ("whisper", "WhisperConfig"), @@ -205,8 +270,11 @@ [ # Add archive maps here) ("albert", "ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("align", "ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("altclip", "ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("audio-spectrogram-transformer", "AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("autoformer", "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("bark", "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bart", "BART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bert", "BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -220,16 +288,20 @@ ("blip-2", "BLIP_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bloom", "BLOOM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("bridgetower", "BRIDGETOWER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("bros", "BROS_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("camembert", "CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("chinese_clip", "CHINESE_CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("clap", "CLAP_PRETRAINED_MODEL_ARCHIVE_LIST"), ("clip", "CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("clipseg", "CLIPSEG_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("clvp", "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("codegen", "CODEGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("conditional_detr", "CONDITIONAL_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("convbert", "CONVBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("convnext", "CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("convnextv2", "CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("cpmant", "CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ctrl", "CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("cvt", "CVT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("data2vec-audio", "DATA2VEC_AUDIO_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -239,42 +311,58 @@ ("deberta-v2", "DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("deformable_detr", "DEFORMABLE_DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("deit", "DEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("depth_anything", "DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("deta", "DETA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("detr", "DETR_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("dinat", "DINAT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("dinov2", "DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("distilbert", "DISTILBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("donut-swin", "DONUT_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("dpr", "DPR_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("dpt", "DPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("efficientformer", "EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("efficientnet", "EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("electra", "ELECTRA_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("encodec", "ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ernie", "ERNIE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ernie_m", "ERNIE_M_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("esm", "ESM_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("falcon", "FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("fastspeech2_conformer", "FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("flaubert", "FLAUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("flava", "FLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("fnet", "FNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("focalnet", "FOCALNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("fsmt", "FSMT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("funnel", "FUNNEL_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("fuyu", "FUYU_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("git", "GIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("glpn", "GLPN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt2", "GPT2_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("gpt_bigcode", "GPT_BIGCODE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt_neo", "GPT_NEO_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt_neox", "GPT_NEOX_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gpt_neox_japanese", "GPT_NEOX_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("gptj", "GPTJ_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("gptsan-japanese", "GPTSAN_JAPANESE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("graphormer", "GRAPHORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("groupvit", "GROUPVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("hubert", "HUBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("ibert", "IBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("idefics", "IDEFICS_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("imagegpt", "IMAGEGPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("informer", "INFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("instructblip", "INSTRUCTBLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("jukebox", "JUKEBOX_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("kosmos-2", "KOSMOS2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("layoutlm", "LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("layoutlmv3", "LAYOUTLMV3_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("led", "LED_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("levit", "LEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("lilt", "LILT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("llama", "LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("llava", "LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("longformer", "LONGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("longt5", "LONGT5_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("luke", "LUKE_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -285,26 +373,45 @@ ("maskformer", "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mbart", "MBART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mctct", "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mega", "MEGA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mgp-str", "MGP_STR_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mistral", "MISTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mixtral", "MIXTRAL_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mobilenet_v1", "MOBILENET_V1_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mobilenet_v2", "MOBILENET_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mobilevit", "MOBILEVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mobilevitv2", "MOBILEVITV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mpnet", "MPNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mpt", "MPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("mra", "MRA_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("musicgen", "MUSICGEN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mvp", "MVP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("nat", "NAT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("nezha", "NEZHA_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("nllb-moe", "NLLB_MOE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("nystromformer", "NYSTROMFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("oneformer", "ONEFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("open-llama", "OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("openai-gpt", "OPENAI_GPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("opt", "OPT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("owlv2", "OWLV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("owlvit", "OWLVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("patchtsmixer", "PATCHTSMIXER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("patchtst", "PATCHTST_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("pegasus", "PEGASUS_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("pegasus_x", "PEGASUS_X_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("perceiver", "PERCEIVER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("persimmon", "PERSIMMON_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("phi", "PHI_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("pix2struct", "PIX2STRUCT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("poolformer", "POOLFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("pop2piano", "POP2PIANO_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("prophetnet", "PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("pvt", "PVT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("qdqbert", "QDQBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("qwen2", "QWEN2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("realm", "REALM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("regnet", "REGNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("rembert", "REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -314,14 +421,21 @@ ("roberta-prelayernorm", "ROBERTA_PRELAYERNORM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("roc_bert", "ROC_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("roformer", "ROFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("rwkv", "RWKV_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("sam", "SAM_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("seamless_m4t", "SEAMLESS_M4T_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("seamless_m4t_v2", "SEAMLESS_M4T_V2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("segformer", "SEGFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("sew", "SEW_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("sew-d", "SEW_D_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("siglip", "SIGLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("speech_to_text", "SPEECH_TO_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("speech_to_text_2", "SPEECH_TO_TEXT_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("speecht5", "SPEECHT5_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("splinter", "SPLINTER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("squeezebert", "SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("stablelm", "STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("swiftformer", "SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("swin", "SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("swin2sr", "SWIN2SR_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("swinv2", "SWINV2_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -333,17 +447,25 @@ ("timesformer", "TIMESFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("transfo-xl", "TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("tvlt", "TVLT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("tvp", "TVP_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("unispeech", "UNISPEECH_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("unispeech-sat", "UNISPEECH_SAT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("univnet", "UNIVNET_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("van", "VAN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("videomae", "VIDEOMAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vilt", "VILT_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vipllava", "VIPLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("visual_bert", "VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit", "VIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_hybrid", "VIT_HYBRID_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_mae", "VIT_MAE_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("vit_msn", "VIT_MSN_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vitdet", "VITDET_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vitmatte", "VITMATTE_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vits", "VITS_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("vivit", "VIVIT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2", "WAV_2_VEC_2_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("wav2vec2-bert", "WAV2VEC2_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("wav2vec2-conformer", "WAV2VEC2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("whisper", "WHISPER_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("xclip", "XCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -362,8 +484,11 @@ [ # Add full (and cased) model names here ("albert", "ALBERT"), + ("align", "ALIGN"), ("altclip", "AltCLIP"), ("audio-spectrogram-transformer", "Audio Spectrogram Transformer"), + ("autoformer", "Autoformer"), + ("bark", "Bark"), ("bart", "BART"), ("barthez", "BARThez"), ("bartpho", "BARTpho"), @@ -383,18 +508,25 @@ ("bloom", "BLOOM"), ("bort", "BORT"), ("bridgetower", "BridgeTower"), + ("bros", "BROS"), ("byt5", "ByT5"), ("camembert", "CamemBERT"), ("canine", "CANINE"), ("chinese_clip", "Chinese-CLIP"), + ("chinese_clip_vision_model", "ChineseCLIPVisionModel"), ("clap", "CLAP"), ("clip", "CLIP"), + ("clip_vision_model", "CLIPVisionModel"), ("clipseg", "CLIPSeg"), + ("clvp", "CLVP"), + ("code_llama", "CodeLlama"), ("codegen", "CodeGen"), ("conditional_detr", "Conditional DETR"), ("convbert", "ConvBERT"), ("convnext", "ConvNeXT"), + ("convnextv2", "ConvNeXTV2"), ("cpm", "CPM"), + ("cpmant", "CPM-Ant"), ("ctrl", "CTRL"), ("cvt", "CvT"), ("data2vec-audio", "Data2VecAudio"), @@ -405,42 +537,58 @@ ("decision_transformer", "Decision Transformer"), ("deformable_detr", "Deformable DETR"), ("deit", "DeiT"), + ("deplot", "DePlot"), + ("depth_anything", "Depth Anything"), ("deta", "DETA"), ("detr", "DETR"), ("dialogpt", "DialoGPT"), ("dinat", "DiNAT"), + ("dinov2", "DINOv2"), ("distilbert", "DistilBERT"), ("dit", "DiT"), ("donut-swin", "DonutSwin"), ("dpr", "DPR"), ("dpt", "DPT"), ("efficientformer", "EfficientFormer"), + ("efficientnet", "EfficientNet"), ("electra", "ELECTRA"), + ("encodec", "EnCodec"), ("encoder-decoder", "Encoder decoder"), ("ernie", "ERNIE"), ("ernie_m", "ErnieM"), ("esm", "ESM"), + ("falcon", "Falcon"), + ("fastspeech2_conformer", "FastSpeech2Conformer"), ("flan-t5", "FLAN-T5"), + ("flan-ul2", "FLAN-UL2"), ("flaubert", "FlauBERT"), ("flava", "FLAVA"), ("fnet", "FNet"), + ("focalnet", "FocalNet"), ("fsmt", "FairSeq Machine-Translation"), ("funnel", "Funnel Transformer"), + ("fuyu", "Fuyu"), ("git", "GIT"), ("glpn", "GLPN"), ("gpt-sw3", "GPT-Sw3"), ("gpt2", "OpenAI GPT-2"), + ("gpt_bigcode", "GPTBigCode"), ("gpt_neo", "GPT Neo"), ("gpt_neox", "GPT NeoX"), ("gpt_neox_japanese", "GPT NeoX Japanese"), ("gptj", "GPT-J"), + ("gptsan-japanese", "GPTSAN-japanese"), ("graphormer", "Graphormer"), ("groupvit", "GroupViT"), ("herbert", "HerBERT"), ("hubert", "Hubert"), ("ibert", "I-BERT"), + ("idefics", "IDEFICS"), ("imagegpt", "ImageGPT"), + ("informer", "Informer"), + ("instructblip", "InstructBLIP"), ("jukebox", "Jukebox"), + ("kosmos-2", "KOSMOS-2"), ("layoutlm", "LayoutLM"), ("layoutlmv2", "LayoutLMv2"), ("layoutlmv3", "LayoutLMv3"), @@ -448,45 +596,71 @@ ("led", "LED"), ("levit", "LeViT"), ("lilt", "LiLT"), + ("llama", "LLaMA"), + ("llama2", "Llama2"), + ("llava", "LLaVa"), ("longformer", "Longformer"), ("longt5", "LongT5"), ("luke", "LUKE"), ("lxmert", "LXMERT"), ("m2m_100", "M2M100"), + ("madlad-400", "MADLAD-400"), ("marian", "Marian"), ("markuplm", "MarkupLM"), ("mask2former", "Mask2Former"), ("maskformer", "MaskFormer"), ("maskformer-swin", "MaskFormerSwin"), + ("matcha", "MatCha"), ("mbart", "mBART"), ("mbart50", "mBART-50"), ("mctct", "M-CTC-T"), + ("mega", "MEGA"), ("megatron-bert", "Megatron-BERT"), ("megatron_gpt2", "Megatron-GPT2"), + ("mgp-str", "MGP-STR"), + ("mistral", "Mistral"), + ("mixtral", "Mixtral"), ("mluke", "mLUKE"), + ("mms", "MMS"), ("mobilebert", "MobileBERT"), ("mobilenet_v1", "MobileNetV1"), ("mobilenet_v2", "MobileNetV2"), ("mobilevit", "MobileViT"), + ("mobilevitv2", "MobileViTV2"), ("mpnet", "MPNet"), + ("mpt", "MPT"), + ("mra", "MRA"), ("mt5", "MT5"), + ("musicgen", "MusicGen"), ("mvp", "MVP"), ("nat", "NAT"), ("nezha", "Nezha"), ("nllb", "NLLB"), + ("nllb-moe", "NLLB-MOE"), + ("nougat", "Nougat"), ("nystromformer", "Nyströmformer"), ("oneformer", "OneFormer"), + ("open-llama", "OpenLlama"), ("openai-gpt", "OpenAI GPT"), ("opt", "OPT"), + ("owlv2", "OWLv2"), ("owlvit", "OWL-ViT"), + ("patchtsmixer", "PatchTSMixer"), + ("patchtst", "PatchTST"), ("pegasus", "Pegasus"), ("pegasus_x", "PEGASUS-X"), ("perceiver", "Perceiver"), + ("persimmon", "Persimmon"), + ("phi", "Phi"), ("phobert", "PhoBERT"), + ("pix2struct", "Pix2Struct"), ("plbart", "PLBart"), ("poolformer", "PoolFormer"), + ("pop2piano", "Pop2Piano"), ("prophetnet", "ProphetNet"), + ("pvt", "PVT"), ("qdqbert", "QDQBert"), + ("qwen2", "Qwen2"), ("rag", "RAG"), ("realm", "REALM"), ("reformer", "Reformer"), @@ -498,15 +672,23 @@ ("roberta-prelayernorm", "RoBERTa-PreLayerNorm"), ("roc_bert", "RoCBert"), ("roformer", "RoFormer"), + ("rwkv", "RWKV"), + ("sam", "SAM"), + ("seamless_m4t", "SeamlessM4T"), + ("seamless_m4t_v2", "SeamlessM4Tv2"), ("segformer", "SegFormer"), ("sew", "SEW"), ("sew-d", "SEW-D"), + ("siglip", "SigLIP"), + ("siglip_vision_model", "SiglipVisionModel"), ("speech-encoder-decoder", "Speech Encoder decoder"), ("speech_to_text", "Speech2Text"), ("speech_to_text_2", "Speech2Text2"), ("speecht5", "SpeechT5"), ("splinter", "Splinter"), ("squeezebert", "SqueezeBERT"), + ("stablelm", "StableLm"), + ("swiftformer", "SwiftFormer"), ("swin", "Swin Transformer"), ("swin2sr", "Swin2SR"), ("swinv2", "Swin Transformer V2"), @@ -518,17 +700,22 @@ ("tapex", "TAPEX"), ("time_series_transformer", "Time Series Transformer"), ("timesformer", "TimeSformer"), + ("timm_backbone", "TimmBackbone"), ("trajectory_transformer", "Trajectory Transformer"), ("transfo-xl", "Transformer-XL"), ("trocr", "TrOCR"), ("tvlt", "TVLT"), + ("tvp", "TVP"), ("ul2", "UL2"), + ("umt5", "UMT5"), ("unispeech", "UniSpeech"), ("unispeech-sat", "UniSpeechSat"), + ("univnet", "UnivNet"), ("upernet", "UPerNet"), ("van", "VAN"), ("videomae", "VideoMAE"), ("vilt", "ViLT"), + ("vipllava", "VipLlava"), ("vision-encoder-decoder", "Vision Encoder decoder"), ("vision-text-dual-encoder", "VisionTextDualEncoder"), ("visual_bert", "VisualBERT"), @@ -536,7 +723,12 @@ ("vit_hybrid", "ViT Hybrid"), ("vit_mae", "ViTMAE"), ("vit_msn", "ViTMSN"), + ("vitdet", "VitDet"), + ("vitmatte", "ViTMatte"), + ("vits", "VITS"), + ("vivit", "ViViT"), ("wav2vec2", "Wav2Vec2"), + ("wav2vec2-bert", "Wav2Vec2-BERT"), ("wav2vec2-conformer", "Wav2Vec2-Conformer"), ("wav2vec2_phoneme", "Wav2Vec2Phoneme"), ("wavlm", "WavLM"), @@ -557,6 +749,20 @@ ] ) +# This is tied to the processing `-` -> `_` in `model_type_to_module_name`. For example, instead of putting +# `transfo-xl` (as in `CONFIG_MAPPING_NAMES`), we should use `transfo_xl`. +DEPRECATED_MODELS = [ + "bort", + "mctct", + "mmbt", + "open_llama", + "retribert", + "tapex", + "trajectory_transformer", + "transfo_xl", + "van", +] + SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict( [ ("openai-gpt", "openai"), @@ -564,8 +770,12 @@ ("data2vec-text", "data2vec"), ("data2vec-vision", "data2vec"), ("donut-swin", "donut"), + ("kosmos-2", "kosmos2"), ("maskformer-swin", "maskformer"), ("xclip", "x_clip"), + ("clip_vision_model", "clip"), + ("siglip_vision_model", "siglip"), + ("chinese_clip_vision_model", "chinese_clip"), ] ) @@ -576,7 +786,11 @@ def model_type_to_module_name(key): if key in SPECIAL_MODEL_TYPE_TO_MODULE_NAME: return SPECIAL_MODEL_TYPE_TO_MODULE_NAME[key] - return key.replace("-", "_") + key = key.replace("-", "_") + if key in DEPRECATED_MODELS: + key = f"deprecated.{key}" + + return key def config_class_to_model_type(config): @@ -584,6 +798,10 @@ def config_class_to_model_type(config): for key, cls in CONFIG_MAPPING_NAMES.items(): if cls == config: return key + # if key not found check in extra content + for key, cls in CONFIG_MAPPING._extra_content.items(): + if cls.__name__ == config: + return key return None @@ -629,11 +847,11 @@ def __iter__(self): def __contains__(self, item): return item in self._mapping or item in self._extra_content - def register(self, key, value): + def register(self, key, value, exist_ok=False): """ Register a new configuration in this mapping. """ - if key in self._mapping.keys(): + if key in self._mapping.keys() and not exist_ok: raise ValueError(f"'{key}' is already used by a Transformers config, pick another name.") self._extra_content[key] = value @@ -802,8 +1020,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): Can be either: - A string, the *model id* of a pretrained model configuration hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or - namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - A path to a *directory* containing a configuration file saved using the [`~PretrainedConfig.save_pretrained`] method, or the [`~PreTrainedModel.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -846,7 +1063,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> from transformers import AutoConfig >>> # Download configuration from huggingface.co and cache. - >>> config = AutoConfig.from_pretrained("bert-base-uncased") + >>> config = AutoConfig.from_pretrained("google-bert/bert-base-uncased") >>> # Download configuration from huggingface.co (user-uploaded) and cache. >>> config = AutoConfig.from_pretrained("dbmdz/bert-base-german-cased") @@ -858,12 +1075,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> config = AutoConfig.from_pretrained("./test/bert_saved_model/my_configuration.json") >>> # Change some config attributes when loading a pretrained config. - >>> config = AutoConfig.from_pretrained("bert-base-uncased", output_attentions=True, foo=False) + >>> config = AutoConfig.from_pretrained("google-bert/bert-base-uncased", output_attentions=True, foo=False) >>> config.output_attentions True >>> config, unused_kwargs = AutoConfig.from_pretrained( - ... "bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True + ... "google-bert/bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True ... ) >>> config.output_attentions True @@ -871,31 +1088,47 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> unused_kwargs {'foo': False} ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if kwargs.get("token", None) is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + kwargs["token"] = use_auth_token + kwargs["_from_auto"] = True kwargs["name_or_path"] = pretrained_model_name_or_path - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) + code_revision = kwargs.pop("code_revision", None) + config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) - if "auto_map" in config_dict and "AutoConfig" in config_dict["auto_map"]: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the configuration file in that" - " repo on your local machine. Make sure you have read the code there to avoid malicious use, then" - " set the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a configuration with custom code to " - "ensure no malicious code has been contributed in a newer revision." - ) + has_remote_code = "auto_map" in config_dict and "AutoConfig" in config_dict["auto_map"] + has_local_code = "model_type" in config_dict and config_dict["model_type"] in CONFIG_MAPPING + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) + + if has_remote_code and trust_remote_code: class_ref = config_dict["auto_map"]["AutoConfig"] - module_file, class_name = class_ref.split(".") config_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs + class_ref, pretrained_model_name_or_path, code_revision=code_revision, **kwargs ) - config_class.register_for_auto_class() + if os.path.isdir(pretrained_model_name_or_path): + config_class.register_for_auto_class() return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs) elif "model_type" in config_dict: - config_class = CONFIG_MAPPING[config_dict["model_type"]] + try: + config_class = CONFIG_MAPPING[config_dict["model_type"]] + except KeyError: + raise ValueError( + f"The checkpoint you are trying to load has model type `{config_dict['model_type']}` " + "but Transformers does not recognize this architecture. This could be because of an " + "issue with the checkpoint, or because your version of Transformers is out of date." + ) return config_class.from_dict(config_dict, **unused_kwargs) else: # Fallback: use pattern matching on the string. @@ -911,7 +1144,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ) @staticmethod - def register(model_type, config): + def register(model_type, config, exist_ok=False): """ Register a new configuration for this class. @@ -925,4 +1158,4 @@ def register(model_type, config): f"you passed (config has {config.model_type} and you passed {model_type}. Fix one of those so they " "match!" ) - CONFIG_MAPPING.register(model_type, config) + CONFIG_MAPPING.register(model_type, config, exist_ok=exist_ok) diff --git a/src/transformers/models/auto/feature_extraction_auto.py b/src/transformers/models/auto/feature_extraction_auto.py index caf27f2176603b..f8cb55091b02fd 100644 --- a/src/transformers/models/auto/feature_extraction_auto.py +++ b/src/transformers/models/auto/feature_extraction_auto.py @@ -16,12 +16,13 @@ import importlib import json import os +import warnings from collections import OrderedDict from typing import Dict, Optional, Union # Build the list of all feature extractors from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code from ...feature_extraction_utils import FeatureExtractionMixin from ...utils import CONFIG_NAME, FEATURE_EXTRACTOR_NAME, get_file_from_repo, logging from .auto_factory import _LazyAutoMapping @@ -43,6 +44,7 @@ ("clap", "ClapFeatureExtractor"), ("clip", "CLIPFeatureExtractor"), ("clipseg", "ViTFeatureExtractor"), + ("clvp", "ClvpFeatureExtractor"), ("conditional_detr", "ConditionalDetrFeatureExtractor"), ("convnext", "ConvNextFeatureExtractor"), ("cvt", "ConvNextFeatureExtractor"), @@ -54,6 +56,7 @@ ("dinat", "ViTFeatureExtractor"), ("donut-swin", "DonutFeatureExtractor"), ("dpt", "DPTFeatureExtractor"), + ("encodec", "EncodecFeatureExtractor"), ("flava", "FlavaFeatureExtractor"), ("glpn", "GLPNFeatureExtractor"), ("groupvit", "CLIPFeatureExtractor"), @@ -71,19 +74,25 @@ ("owlvit", "OwlViTFeatureExtractor"), ("perceiver", "PerceiverFeatureExtractor"), ("poolformer", "PoolFormerFeatureExtractor"), + ("pop2piano", "Pop2PianoFeatureExtractor"), ("regnet", "ConvNextFeatureExtractor"), ("resnet", "ConvNextFeatureExtractor"), + ("seamless_m4t", "SeamlessM4TFeatureExtractor"), + ("seamless_m4t_v2", "SeamlessM4TFeatureExtractor"), ("segformer", "SegformerFeatureExtractor"), ("sew", "Wav2Vec2FeatureExtractor"), ("sew-d", "Wav2Vec2FeatureExtractor"), ("speech_to_text", "Speech2TextFeatureExtractor"), ("speecht5", "SpeechT5FeatureExtractor"), + ("swiftformer", "ViTFeatureExtractor"), ("swin", "ViTFeatureExtractor"), ("swinv2", "ViTFeatureExtractor"), ("table-transformer", "DetrFeatureExtractor"), ("timesformer", "VideoMAEFeatureExtractor"), + ("tvlt", "TvltFeatureExtractor"), ("unispeech", "Wav2Vec2FeatureExtractor"), ("unispeech-sat", "Wav2Vec2FeatureExtractor"), + ("univnet", "UnivNetFeatureExtractor"), ("van", "ConvNextFeatureExtractor"), ("videomae", "VideoMAEFeatureExtractor"), ("vilt", "ViltFeatureExtractor"), @@ -91,6 +100,7 @@ ("vit_mae", "ViTFeatureExtractor"), ("vit_msn", "ViTFeatureExtractor"), ("wav2vec2", "Wav2Vec2FeatureExtractor"), + ("wav2vec2-bert", "Wav2Vec2FeatureExtractor"), ("wav2vec2-conformer", "Wav2Vec2FeatureExtractor"), ("wavlm", "Wav2Vec2FeatureExtractor"), ("whisper", "WhisperFeatureExtractor"), @@ -132,7 +142,7 @@ def get_feature_extractor_config( force_download: bool = False, resume_download: bool = False, proxies: Optional[Dict[str, str]] = None, - use_auth_token: Optional[Union[bool, str]] = None, + token: Optional[Union[bool, str]] = None, revision: Optional[str] = None, local_files_only: bool = False, **kwargs, @@ -145,8 +155,7 @@ def get_feature_extractor_config( This can be either: - a string, the *model id* of a pretrained model configuration hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced - under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a configuration file saved using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -161,7 +170,7 @@ def get_feature_extractor_config( proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -173,7 +182,7 @@ def get_feature_extractor_config( - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -184,17 +193,27 @@ def get_feature_extractor_config( ```python # Download configuration from huggingface.co and cache. - tokenizer_config = get_tokenizer_config("bert-base-uncased") + tokenizer_config = get_tokenizer_config("google-bert/bert-base-uncased") # This model does not have a tokenizer config so the result will be an empty dict. - tokenizer_config = get_tokenizer_config("xlm-roberta-base") + tokenizer_config = get_tokenizer_config("FacebookAI/xlm-roberta-base") # Save a pretrained tokenizer locally and you can reload its config from transformers import AutoTokenizer - tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") + tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") tokenizer.save_pretrained("tokenizer-test") tokenizer_config = get_tokenizer_config("tokenizer-test") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + token = use_auth_token + resolved_config_file = get_file_from_repo( pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME, @@ -202,7 +221,7 @@ def get_feature_extractor_config( force_download=force_download, resume_download=resume_download, proxies=proxies, - use_auth_token=use_auth_token, + token=token, revision=revision, local_files_only=local_files_only, ) @@ -247,8 +266,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): This can be either: - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or - namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a feature extractor file saved using the [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -266,7 +284,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -289,7 +307,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -302,10 +320,22 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base-960h") >>> # If feature extractor files are in a directory (e.g. feature extractor was saved using *save_pretrained('./test/saved_model/')*) - >>> feature_extractor = AutoFeatureExtractor.from_pretrained("./test/saved_model/") + >>> # feature_extractor = AutoFeatureExtractor.from_pretrained("./test/saved_model/") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if kwargs.get("token", None) is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + kwargs["token"] = use_auth_token + config = kwargs.pop("config", None) - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) kwargs["_from_auto"] = True config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs) @@ -324,28 +354,23 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): feature_extractor_auto_map = config.auto_map["AutoFeatureExtractor"] if feature_extractor_class is not None: - # If we have custom code for a feature extractor, we get the proper class. - if feature_extractor_auto_map is not None: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the feature extractor file " - "in that repo on your local machine. Make sure you have read the code there to avoid " - "malicious use, then set the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a feature extractor with custom " - "code to ensure no malicious code has been contributed in a newer revision." - ) - - module_file, class_name = feature_extractor_auto_map.split(".") - feature_extractor_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs - ) - feature_extractor_class.register_for_auto_class() - else: - feature_extractor_class = feature_extractor_class_from_name(feature_extractor_class) + feature_extractor_class = feature_extractor_class_from_name(feature_extractor_class) + + has_remote_code = feature_extractor_auto_map is not None + has_local_code = feature_extractor_class is not None or type(config) in FEATURE_EXTRACTOR_MAPPING + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) + if has_remote_code and trust_remote_code: + feature_extractor_class = get_class_from_dynamic_module( + feature_extractor_auto_map, pretrained_model_name_or_path, **kwargs + ) + _ = kwargs.pop("code_revision", None) + if os.path.isdir(pretrained_model_name_or_path): + feature_extractor_class.register_for_auto_class() + return feature_extractor_class.from_dict(config_dict, **kwargs) + elif feature_extractor_class is not None: return feature_extractor_class.from_dict(config_dict, **kwargs) # Last try: we use the FEATURE_EXTRACTOR_MAPPING. elif type(config) in FEATURE_EXTRACTOR_MAPPING: @@ -359,7 +384,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ) @staticmethod - def register(config_class, feature_extractor_class): + def register(config_class, feature_extractor_class, exist_ok=False): """ Register a new feature extractor for this class. @@ -368,4 +393,4 @@ def register(config_class, feature_extractor_class): The configuration corresponding to the model to register. feature_extractor_class ([`FeatureExtractorMixin`]): The feature extractor to register. """ - FEATURE_EXTRACTOR_MAPPING.register(config_class, feature_extractor_class) + FEATURE_EXTRACTOR_MAPPING.register(config_class, feature_extractor_class, exist_ok=exist_ok) diff --git a/src/transformers/models/auto/image_processing_auto.py b/src/transformers/models/auto/image_processing_auto.py index d12aa1604162e9..c9cd6fca69d661 100644 --- a/src/transformers/models/auto/image_processing_auto.py +++ b/src/transformers/models/auto/image_processing_auto.py @@ -16,12 +16,13 @@ import importlib import json import os +import warnings from collections import OrderedDict from typing import Dict, Optional, Union # Build the list of all image processors from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code from ...image_processing_utils import ImageProcessingMixin from ...utils import CONFIG_NAME, IMAGE_PROCESSOR_NAME, get_file_from_repo, logging from .auto_factory import _LazyAutoMapping @@ -37,6 +38,7 @@ IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict( [ + ("align", "EfficientNetImageProcessor"), ("beit", "BeitImageProcessor"), ("bit", "BitImageProcessor"), ("blip", "BlipImageProcessor"), @@ -47,52 +49,74 @@ ("clipseg", "ViTImageProcessor"), ("conditional_detr", "ConditionalDetrImageProcessor"), ("convnext", "ConvNextImageProcessor"), + ("convnextv2", "ConvNextImageProcessor"), ("cvt", "ConvNextImageProcessor"), ("data2vec-vision", "BeitImageProcessor"), ("deformable_detr", "DeformableDetrImageProcessor"), ("deit", "DeiTImageProcessor"), + ("depth_anything", "DPTImageProcessor"), ("deta", "DetaImageProcessor"), ("detr", "DetrImageProcessor"), ("dinat", "ViTImageProcessor"), + ("dinov2", "BitImageProcessor"), ("donut-swin", "DonutImageProcessor"), ("dpt", "DPTImageProcessor"), ("efficientformer", "EfficientFormerImageProcessor"), + ("efficientnet", "EfficientNetImageProcessor"), ("flava", "FlavaImageProcessor"), + ("focalnet", "BitImageProcessor"), + ("fuyu", "FuyuImageProcessor"), ("git", "CLIPImageProcessor"), ("glpn", "GLPNImageProcessor"), ("groupvit", "CLIPImageProcessor"), + ("idefics", "IdeficsImageProcessor"), ("imagegpt", "ImageGPTImageProcessor"), + ("instructblip", "BlipImageProcessor"), + ("kosmos-2", "CLIPImageProcessor"), ("layoutlmv2", "LayoutLMv2ImageProcessor"), ("layoutlmv3", "LayoutLMv3ImageProcessor"), ("levit", "LevitImageProcessor"), + ("llava", "CLIPImageProcessor"), ("mask2former", "Mask2FormerImageProcessor"), ("maskformer", "MaskFormerImageProcessor"), + ("mgp-str", "ViTImageProcessor"), ("mobilenet_v1", "MobileNetV1ImageProcessor"), ("mobilenet_v2", "MobileNetV2ImageProcessor"), - ("mobilenet_v2", "MobileNetV2ImageProcessor"), ("mobilevit", "MobileViTImageProcessor"), ("mobilevit", "MobileViTImageProcessor"), + ("mobilevitv2", "MobileViTImageProcessor"), ("nat", "ViTImageProcessor"), + ("nougat", "NougatImageProcessor"), ("oneformer", "OneFormerImageProcessor"), + ("owlv2", "Owlv2ImageProcessor"), ("owlvit", "OwlViTImageProcessor"), ("perceiver", "PerceiverImageProcessor"), + ("pix2struct", "Pix2StructImageProcessor"), ("poolformer", "PoolFormerImageProcessor"), + ("pvt", "PvtImageProcessor"), ("regnet", "ConvNextImageProcessor"), ("resnet", "ConvNextImageProcessor"), + ("sam", "SamImageProcessor"), ("segformer", "SegformerImageProcessor"), + ("siglip", "SiglipImageProcessor"), + ("swiftformer", "ViTImageProcessor"), ("swin", "ViTImageProcessor"), ("swin2sr", "Swin2SRImageProcessor"), ("swinv2", "ViTImageProcessor"), ("table-transformer", "DetrImageProcessor"), ("timesformer", "VideoMAEImageProcessor"), + ("tvlt", "TvltImageProcessor"), + ("tvp", "TvpImageProcessor"), ("upernet", "SegformerImageProcessor"), ("van", "ConvNextImageProcessor"), ("videomae", "VideoMAEImageProcessor"), ("vilt", "ViltImageProcessor"), + ("vipllava", "CLIPImageProcessor"), ("vit", "ViTImageProcessor"), ("vit_hybrid", "ViTHybridImageProcessor"), ("vit_mae", "ViTImageProcessor"), ("vit_msn", "ViTImageProcessor"), + ("vitmatte", "VitMatteImageProcessor"), ("xclip", "CLIPImageProcessor"), ("yolos", "YolosImageProcessor"), ] @@ -131,7 +155,7 @@ def get_image_processor_config( force_download: bool = False, resume_download: bool = False, proxies: Optional[Dict[str, str]] = None, - use_auth_token: Optional[Union[bool, str]] = None, + token: Optional[Union[bool, str]] = None, revision: Optional[str] = None, local_files_only: bool = False, **kwargs, @@ -144,8 +168,7 @@ def get_image_processor_config( This can be either: - a string, the *model id* of a pretrained model configuration hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced - under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a configuration file saved using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -160,7 +183,7 @@ def get_image_processor_config( proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -172,7 +195,7 @@ def get_image_processor_config( - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -183,9 +206,9 @@ def get_image_processor_config( ```python # Download configuration from huggingface.co and cache. - image_processor_config = get_image_processor_config("bert-base-uncased") + image_processor_config = get_image_processor_config("google-bert/bert-base-uncased") # This model does not have a image processor config so the result will be an empty dict. - image_processor_config = get_image_processor_config("xlm-roberta-base") + image_processor_config = get_image_processor_config("FacebookAI/xlm-roberta-base") # Save a pretrained image processor locally and you can reload its config from transformers import AutoTokenizer @@ -194,6 +217,16 @@ def get_image_processor_config( image_processor.save_pretrained("image-processor-test") image_processor_config = get_image_processor_config("image-processor-test") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + token = use_auth_token + resolved_config_file = get_file_from_repo( pretrained_model_name_or_path, IMAGE_PROCESSOR_NAME, @@ -201,7 +234,7 @@ def get_image_processor_config( force_download=force_download, resume_download=resume_download, proxies=proxies, - use_auth_token=use_auth_token, + token=token, revision=revision, local_files_only=local_files_only, ) @@ -246,8 +279,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): This can be either: - a string, the *model id* of a pretrained image_processor hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or - namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a image processor file saved using the [`~image_processing_utils.ImageProcessingMixin.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -265,7 +297,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -288,7 +320,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -301,10 +333,22 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k") >>> # If image processor files are in a directory (e.g. image processor was saved using *save_pretrained('./test/saved_model/')*) - >>> image_processor = AutoImageProcessor.from_pretrained("./test/saved_model/") + >>> # image_processor = AutoImageProcessor.from_pretrained("./test/saved_model/") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if kwargs.get("token", None) is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + kwargs["token"] = use_auth_token + config = kwargs.pop("config", None) - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) kwargs["_from_auto"] = True config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs) @@ -319,16 +363,20 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): feature_extractor_class = config_dict.pop("feature_extractor_type", None) if feature_extractor_class is not None: logger.warning( - "Could not find image processor class in the image processor config or the model config. Loading" - " based on pattern matching with the model's feature extractor configuration." + "Could not find image processor class in the image processor config or the model config. Loading " + "based on pattern matching with the model's feature extractor configuration. Please open a " + "PR/issue to update `preprocessor_config.json` to use `image_processor_type` instead of " + "`feature_extractor_type`. This warning will be removed in v4.40." ) image_processor_class = feature_extractor_class.replace("FeatureExtractor", "ImageProcessor") if "AutoFeatureExtractor" in config_dict.get("auto_map", {}): feature_extractor_auto_map = config_dict["auto_map"]["AutoFeatureExtractor"] image_processor_auto_map = feature_extractor_auto_map.replace("FeatureExtractor", "ImageProcessor") logger.warning( - "Could not find image processor auto map in the image processor config or the model config." - " Loading based on pattern matching with the model's feature extractor configuration." + "Could not find image processor auto map in the image processor config or the model config. " + "Loading based on pattern matching with the model's feature extractor configuration. Please open a " + "PR/issue to update `preprocessor_config.json` to use `AutoImageProcessor` instead of " + "`AutoFeatureExtractor`. This warning will be removed in v4.40." ) # If we don't find the image processor class in the image processor config, let's try the model config. @@ -341,28 +389,23 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): image_processor_auto_map = config.auto_map["AutoImageProcessor"] if image_processor_class is not None: - # If we have custom code for a image processor, we get the proper class. - if image_processor_auto_map is not None: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the image processor file " - "in that repo on your local machine. Make sure you have read the code there to avoid " - "malicious use, then set the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a image processor with custom " - "code to ensure no malicious code has been contributed in a newer revision." - ) - - module_file, class_name = image_processor_auto_map.split(".") - image_processor_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs - ) - image_processor_class.register_for_auto_class() - else: - image_processor_class = image_processor_class_from_name(image_processor_class) + image_processor_class = image_processor_class_from_name(image_processor_class) + + has_remote_code = image_processor_auto_map is not None + has_local_code = image_processor_class is not None or type(config) in IMAGE_PROCESSOR_MAPPING + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) + if has_remote_code and trust_remote_code: + image_processor_class = get_class_from_dynamic_module( + image_processor_auto_map, pretrained_model_name_or_path, **kwargs + ) + _ = kwargs.pop("code_revision", None) + if os.path.isdir(pretrained_model_name_or_path): + image_processor_class.register_for_auto_class() + return image_processor_class.from_dict(config_dict, **kwargs) + elif image_processor_class is not None: return image_processor_class.from_dict(config_dict, **kwargs) # Last try: we use the IMAGE_PROCESSOR_MAPPING. elif type(config) in IMAGE_PROCESSOR_MAPPING: @@ -376,7 +419,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ) @staticmethod - def register(config_class, image_processor_class): + def register(config_class, image_processor_class, exist_ok=False): """ Register a new image processor for this class. @@ -385,4 +428,4 @@ def register(config_class, image_processor_class): The configuration corresponding to the model to register. image_processor_class ([`ImageProcessingMixin`]): The image processor to register. """ - IMAGE_PROCESSOR_MAPPING.register(config_class, image_processor_class) + IMAGE_PROCESSOR_MAPPING.register(config_class, image_processor_class, exist_ok=exist_ok) diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index 8b141b2cd45825..1da2a644326d1b 100755 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -18,7 +18,12 @@ from collections import OrderedDict from ...utils import logging -from .auto_factory import _BaseAutoModelClass, _LazyAutoMapping, auto_class_update +from .auto_factory import ( + _BaseAutoBackboneClass, + _BaseAutoModelClass, + _LazyAutoMapping, + auto_class_update, +) from .configuration_auto import CONFIG_MAPPING_NAMES @@ -29,8 +34,11 @@ [ # Base model mapping ("albert", "AlbertModel"), + ("align", "AlignModel"), ("altclip", "AltCLIPModel"), ("audio-spectrogram-transformer", "ASTModel"), + ("autoformer", "AutoformerModel"), + ("bark", "BarkModel"), ("bart", "BartModel"), ("beit", "BeitModel"), ("bert", "BertModel"), @@ -42,18 +50,26 @@ ("blenderbot", "BlenderbotModel"), ("blenderbot-small", "BlenderbotSmallModel"), ("blip", "BlipModel"), + ("blip-2", "Blip2Model"), ("bloom", "BloomModel"), ("bridgetower", "BridgeTowerModel"), + ("bros", "BrosModel"), ("camembert", "CamembertModel"), ("canine", "CanineModel"), ("chinese_clip", "ChineseCLIPModel"), + ("chinese_clip_vision_model", "ChineseCLIPVisionModel"), ("clap", "ClapModel"), ("clip", "CLIPModel"), + ("clip_vision_model", "CLIPVisionModel"), ("clipseg", "CLIPSegModel"), + ("clvp", "ClvpModelForConditionalGeneration"), + ("code_llama", "LlamaModel"), ("codegen", "CodeGenModel"), ("conditional_detr", "ConditionalDetrModel"), ("convbert", "ConvBertModel"), ("convnext", "ConvNextModel"), + ("convnextv2", "ConvNextV2Model"), + ("cpmant", "CpmAntModel"), ("ctrl", "CTRLModel"), ("cvt", "CvtModel"), ("data2vec-audio", "Data2VecAudioModel"), @@ -62,46 +78,57 @@ ("deberta", "DebertaModel"), ("deberta-v2", "DebertaV2Model"), ("decision_transformer", "DecisionTransformerModel"), - ("decision_transformer_gpt2", "DecisionTransformerGPT2Model"), ("deformable_detr", "DeformableDetrModel"), ("deit", "DeiTModel"), ("deta", "DetaModel"), ("detr", "DetrModel"), ("dinat", "DinatModel"), + ("dinov2", "Dinov2Model"), ("distilbert", "DistilBertModel"), ("donut-swin", "DonutSwinModel"), ("dpr", "DPRQuestionEncoder"), ("dpt", "DPTModel"), ("efficientformer", "EfficientFormerModel"), + ("efficientnet", "EfficientNetModel"), ("electra", "ElectraModel"), + ("encodec", "EncodecModel"), ("ernie", "ErnieModel"), ("ernie_m", "ErnieMModel"), ("esm", "EsmModel"), + ("falcon", "FalconModel"), + ("fastspeech2_conformer", "FastSpeech2ConformerModel"), ("flaubert", "FlaubertModel"), ("flava", "FlavaModel"), ("fnet", "FNetModel"), + ("focalnet", "FocalNetModel"), ("fsmt", "FSMTModel"), ("funnel", ("FunnelModel", "FunnelBaseModel")), ("git", "GitModel"), ("glpn", "GLPNModel"), ("gpt-sw3", "GPT2Model"), ("gpt2", "GPT2Model"), + ("gpt_bigcode", "GPTBigCodeModel"), ("gpt_neo", "GPTNeoModel"), ("gpt_neox", "GPTNeoXModel"), ("gpt_neox_japanese", "GPTNeoXJapaneseModel"), ("gptj", "GPTJModel"), + ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"), ("graphormer", "GraphormerModel"), ("groupvit", "GroupViTModel"), ("hubert", "HubertModel"), ("ibert", "IBertModel"), + ("idefics", "IdeficsModel"), ("imagegpt", "ImageGPTModel"), + ("informer", "InformerModel"), ("jukebox", "JukeboxModel"), + ("kosmos-2", "Kosmos2Model"), ("layoutlm", "LayoutLMModel"), ("layoutlmv2", "LayoutLMv2Model"), ("layoutlmv3", "LayoutLMv3Model"), ("led", "LEDModel"), ("levit", "LevitModel"), ("lilt", "LiltModel"), + ("llama", "LlamaModel"), ("longformer", "LongformerModel"), ("longt5", "LongT5Model"), ("luke", "LukeModel"), @@ -114,29 +141,44 @@ ("maskformer-swin", "MaskFormerSwinModel"), ("mbart", "MBartModel"), ("mctct", "MCTCTModel"), + ("mega", "MegaModel"), ("megatron-bert", "MegatronBertModel"), + ("mgp-str", "MgpstrForSceneTextRecognition"), + ("mistral", "MistralModel"), + ("mixtral", "MixtralModel"), ("mobilebert", "MobileBertModel"), ("mobilenet_v1", "MobileNetV1Model"), ("mobilenet_v2", "MobileNetV2Model"), ("mobilevit", "MobileViTModel"), + ("mobilevitv2", "MobileViTV2Model"), ("mpnet", "MPNetModel"), + ("mpt", "MptModel"), + ("mra", "MraModel"), ("mt5", "MT5Model"), ("mvp", "MvpModel"), ("nat", "NatModel"), ("nezha", "NezhaModel"), - ("nllb", "M2M100Model"), + ("nllb-moe", "NllbMoeModel"), ("nystromformer", "NystromformerModel"), ("oneformer", "OneFormerModel"), + ("open-llama", "OpenLlamaModel"), ("openai-gpt", "OpenAIGPTModel"), ("opt", "OPTModel"), + ("owlv2", "Owlv2Model"), ("owlvit", "OwlViTModel"), + ("patchtsmixer", "PatchTSMixerModel"), + ("patchtst", "PatchTSTModel"), ("pegasus", "PegasusModel"), ("pegasus_x", "PegasusXModel"), ("perceiver", "PerceiverModel"), + ("persimmon", "PersimmonModel"), + ("phi", "PhiModel"), ("plbart", "PLBartModel"), ("poolformer", "PoolFormerModel"), ("prophetnet", "ProphetNetModel"), + ("pvt", "PvtModel"), ("qdqbert", "QDQBertModel"), + ("qwen2", "Qwen2Model"), ("reformer", "ReformerModel"), ("regnet", "RegNetModel"), ("rembert", "RemBertModel"), @@ -146,13 +188,21 @@ ("roberta-prelayernorm", "RobertaPreLayerNormModel"), ("roc_bert", "RoCBertModel"), ("roformer", "RoFormerModel"), + ("rwkv", "RwkvModel"), + ("sam", "SamModel"), + ("seamless_m4t", "SeamlessM4TModel"), + ("seamless_m4t_v2", "SeamlessM4Tv2Model"), ("segformer", "SegformerModel"), ("sew", "SEWModel"), ("sew-d", "SEWDModel"), + ("siglip", "SiglipModel"), + ("siglip_vision_model", "SiglipVisionModel"), ("speech_to_text", "Speech2TextModel"), ("speecht5", "SpeechT5Model"), ("splinter", "SplinterModel"), ("squeezebert", "SqueezeBertModel"), + ("stablelm", "StableLmModel"), + ("swiftformer", "SwiftFormerModel"), ("swin", "SwinModel"), ("swin2sr", "Swin2SRModel"), ("swinv2", "Swinv2Model"), @@ -162,11 +212,15 @@ ("tapas", "TapasModel"), ("time_series_transformer", "TimeSeriesTransformerModel"), ("timesformer", "TimesformerModel"), + ("timm_backbone", "TimmBackbone"), ("trajectory_transformer", "TrajectoryTransformerModel"), ("transfo-xl", "TransfoXLModel"), ("tvlt", "TvltModel"), + ("tvp", "TvpModel"), + ("umt5", "UMT5Model"), ("unispeech", "UniSpeechModel"), ("unispeech-sat", "UniSpeechSatModel"), + ("univnet", "UnivNetModel"), ("van", "VanModel"), ("videomae", "VideoMAEModel"), ("vilt", "ViltModel"), @@ -176,7 +230,11 @@ ("vit_hybrid", "ViTHybridModel"), ("vit_mae", "ViTMAEModel"), ("vit_msn", "ViTMSNModel"), + ("vitdet", "VitDetModel"), + ("vits", "VitsModel"), + ("vivit", "VivitModel"), ("wav2vec2", "Wav2Vec2Model"), + ("wav2vec2-bert", "Wav2Vec2BertModel"), ("wav2vec2-conformer", "Wav2Vec2ConformerModel"), ("wavlm", "WavLMModel"), ("whisper", "WhisperModel"), @@ -216,21 +274,30 @@ ("funnel", "FunnelForPreTraining"), ("gpt-sw3", "GPT2LMHeadModel"), ("gpt2", "GPT2LMHeadModel"), + ("gpt_bigcode", "GPTBigCodeForCausalLM"), + ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"), ("ibert", "IBertForMaskedLM"), + ("idefics", "IdeficsForVisionText2Text"), ("layoutlm", "LayoutLMForMaskedLM"), + ("llava", "LlavaForConditionalGeneration"), ("longformer", "LongformerForMaskedLM"), ("luke", "LukeForMaskedLM"), ("lxmert", "LxmertForPreTraining"), + ("mega", "MegaForMaskedLM"), ("megatron-bert", "MegatronBertForPreTraining"), ("mobilebert", "MobileBertForPreTraining"), ("mpnet", "MPNetForMaskedLM"), + ("mpt", "MptForCausalLM"), + ("mra", "MraForMaskedLM"), ("mvp", "MvpForConditionalGeneration"), ("nezha", "NezhaForPreTraining"), + ("nllb-moe", "NllbMoeForConditionalGeneration"), ("openai-gpt", "OpenAIGPTLMHeadModel"), ("retribert", "RetriBertModel"), ("roberta", "RobertaForMaskedLM"), ("roberta-prelayernorm", "RobertaPreLayerNormForMaskedLM"), ("roc_bert", "RoCBertForPreTraining"), + ("rwkv", "RwkvForCausalLM"), ("splinter", "SplinterForPreTraining"), ("squeezebert", "SqueezeBertForMaskedLM"), ("switch_transformers", "SwitchTransformersForConditionalGeneration"), @@ -241,6 +308,7 @@ ("unispeech", "UniSpeechForPreTraining"), ("unispeech-sat", "UniSpeechSatForPreTraining"), ("videomae", "VideoMAEForPreTraining"), + ("vipllava", "VipLlavaForConditionalGeneration"), ("visual_bert", "VisualBertForPreTraining"), ("vit_mae", "ViTMAEForPreTraining"), ("wav2vec2", "Wav2Vec2ForPreTraining"), @@ -266,6 +334,7 @@ ("camembert", "CamembertForMaskedLM"), ("codegen", "CodeGenForCausalLM"), ("convbert", "ConvBertForMaskedLM"), + ("cpmant", "CpmAntForCausalLM"), ("ctrl", "CTRLLMHeadModel"), ("data2vec-text", "Data2VecTextForMaskedLM"), ("deberta", "DebertaForMaskedLM"), @@ -282,10 +351,12 @@ ("git", "GitForCausalLM"), ("gpt-sw3", "GPT2LMHeadModel"), ("gpt2", "GPT2LMHeadModel"), + ("gpt_bigcode", "GPTBigCodeForCausalLM"), ("gpt_neo", "GPTNeoForCausalLM"), ("gpt_neox", "GPTNeoXForCausalLM"), ("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"), ("gptj", "GPTJForCausalLM"), + ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"), ("ibert", "IBertForMaskedLM"), ("layoutlm", "LayoutLMForMaskedLM"), ("led", "LEDForConditionalGeneration"), @@ -294,16 +365,20 @@ ("luke", "LukeForMaskedLM"), ("m2m_100", "M2M100ForConditionalGeneration"), ("marian", "MarianMTModel"), + ("mega", "MegaForMaskedLM"), ("megatron-bert", "MegatronBertForCausalLM"), ("mobilebert", "MobileBertForMaskedLM"), ("mpnet", "MPNetForMaskedLM"), + ("mpt", "MptForCausalLM"), + ("mra", "MraForMaskedLM"), ("mvp", "MvpForConditionalGeneration"), ("nezha", "NezhaForMaskedLM"), - ("nllb", "M2M100ForConditionalGeneration"), + ("nllb-moe", "NllbMoeForConditionalGeneration"), ("nystromformer", "NystromformerForMaskedLM"), ("openai-gpt", "OpenAIGPTLMHeadModel"), ("pegasus_x", "PegasusXForConditionalGeneration"), ("plbart", "PLBartForConditionalGeneration"), + ("pop2piano", "Pop2PianoForConditionalGeneration"), ("qdqbert", "QDQBertForMaskedLM"), ("reformer", "ReformerModelWithLMHead"), ("rembert", "RemBertForMaskedLM"), @@ -311,6 +386,7 @@ ("roberta-prelayernorm", "RobertaPreLayerNormForMaskedLM"), ("roc_bert", "RoCBertForMaskedLM"), ("roformer", "RoFormerForMaskedLM"), + ("rwkv", "RwkvForCausalLM"), ("speech_to_text", "Speech2TextForConditionalGeneration"), ("squeezebert", "SqueezeBertForMaskedLM"), ("switch_transformers", "SwitchTransformersForConditionalGeneration"), @@ -341,37 +417,55 @@ ("blenderbot-small", "BlenderbotSmallForCausalLM"), ("bloom", "BloomForCausalLM"), ("camembert", "CamembertForCausalLM"), + ("code_llama", "LlamaForCausalLM"), ("codegen", "CodeGenForCausalLM"), + ("cpmant", "CpmAntForCausalLM"), ("ctrl", "CTRLLMHeadModel"), ("data2vec-text", "Data2VecTextForCausalLM"), ("electra", "ElectraForCausalLM"), ("ernie", "ErnieForCausalLM"), + ("falcon", "FalconForCausalLM"), + ("fuyu", "FuyuForCausalLM"), ("git", "GitForCausalLM"), ("gpt-sw3", "GPT2LMHeadModel"), ("gpt2", "GPT2LMHeadModel"), + ("gpt_bigcode", "GPTBigCodeForCausalLM"), ("gpt_neo", "GPTNeoForCausalLM"), ("gpt_neox", "GPTNeoXForCausalLM"), ("gpt_neox_japanese", "GPTNeoXJapaneseForCausalLM"), ("gptj", "GPTJForCausalLM"), + ("llama", "LlamaForCausalLM"), ("marian", "MarianForCausalLM"), ("mbart", "MBartForCausalLM"), + ("mega", "MegaForCausalLM"), ("megatron-bert", "MegatronBertForCausalLM"), + ("mistral", "MistralForCausalLM"), + ("mixtral", "MixtralForCausalLM"), + ("mpt", "MptForCausalLM"), + ("musicgen", "MusicgenForCausalLM"), ("mvp", "MvpForCausalLM"), + ("open-llama", "OpenLlamaForCausalLM"), ("openai-gpt", "OpenAIGPTLMHeadModel"), ("opt", "OPTForCausalLM"), ("pegasus", "PegasusForCausalLM"), + ("persimmon", "PersimmonForCausalLM"), + ("phi", "PhiForCausalLM"), ("plbart", "PLBartForCausalLM"), ("prophetnet", "ProphetNetForCausalLM"), ("qdqbert", "QDQBertLMHeadModel"), + ("qwen2", "Qwen2ForCausalLM"), ("reformer", "ReformerModelWithLMHead"), ("rembert", "RemBertForCausalLM"), ("roberta", "RobertaForCausalLM"), ("roberta-prelayernorm", "RobertaPreLayerNormForCausalLM"), ("roc_bert", "RoCBertForCausalLM"), ("roformer", "RoFormerForCausalLM"), + ("rwkv", "RwkvForCausalLM"), ("speech_to_text_2", "Speech2Text2ForCausalLM"), + ("stablelm", "StableLmForCausalLM"), ("transfo-xl", "TransfoXLLMHeadModel"), ("trocr", "TrOCRForCausalLM"), + ("whisper", "WhisperForCausalLM"), ("xglm", "XGLMForCausalLM"), ("xlm", "XLMWithLMHeadModel"), ("xlm-prophetnet", "XLMProphetNetForCausalLM"), @@ -385,6 +479,7 @@ MODEL_FOR_MASKED_IMAGE_MODELING_MAPPING_NAMES = OrderedDict( [ ("deit", "DeiTForMaskedImageModeling"), + ("focalnet", "FocalNetForMaskedImageModeling"), ("swin", "SwinForMaskedImageModeling"), ("swinv2", "Swinv2ForMaskedImageModeling"), ("vit", "ViTForMaskedImageModeling"), @@ -404,11 +499,17 @@ # Model for Image Classification mapping ("beit", "BeitForImageClassification"), ("bit", "BitForImageClassification"), + ("clip", "CLIPForImageClassification"), ("convnext", "ConvNextForImageClassification"), + ("convnextv2", "ConvNextV2ForImageClassification"), ("cvt", "CvtForImageClassification"), ("data2vec-vision", "Data2VecVisionForImageClassification"), - ("deit", ("DeiTForImageClassification", "DeiTForImageClassificationWithTeacher")), + ( + "deit", + ("DeiTForImageClassification", "DeiTForImageClassificationWithTeacher"), + ), ("dinat", "DinatForImageClassification"), + ("dinov2", "Dinov2ForImageClassification"), ( "efficientformer", ( @@ -416,11 +517,17 @@ "EfficientFormerForImageClassificationWithTeacher", ), ), + ("efficientnet", "EfficientNetForImageClassification"), + ("focalnet", "FocalNetForImageClassification"), ("imagegpt", "ImageGPTForImageClassification"), - ("levit", ("LevitForImageClassification", "LevitForImageClassificationWithTeacher")), + ( + "levit", + ("LevitForImageClassification", "LevitForImageClassificationWithTeacher"), + ), ("mobilenet_v1", "MobileNetV1ForImageClassification"), ("mobilenet_v2", "MobileNetV2ForImageClassification"), ("mobilevit", "MobileViTForImageClassification"), + ("mobilevitv2", "MobileViTV2ForImageClassification"), ("nat", "NatForImageClassification"), ( "perceiver", @@ -431,9 +538,12 @@ ), ), ("poolformer", "PoolFormerForImageClassification"), + ("pvt", "PvtForImageClassification"), ("regnet", "RegNetForImageClassification"), ("resnet", "ResNetForImageClassification"), ("segformer", "SegformerForImageClassification"), + ("siglip", "SiglipForImageClassification"), + ("swiftformer", "SwiftFormerForImageClassification"), ("swin", "SwinForImageClassification"), ("swinv2", "Swinv2ForImageClassification"), ("van", "VanForImageClassification"), @@ -459,6 +569,7 @@ ("dpt", "DPTForSemanticSegmentation"), ("mobilenet_v2", "MobileNetV2ForSemanticSegmentation"), ("mobilevit", "MobileViTForSemanticSegmentation"), + ("mobilevitv2", "MobileViTV2ForSemanticSegmentation"), ("segformer", "SegformerForSemanticSegmentation"), ("upernet", "UperNetForSemanticSegmentation"), ] @@ -486,11 +597,20 @@ [ ("timesformer", "TimesformerForVideoClassification"), ("videomae", "VideoMAEForVideoClassification"), + ("vivit", "VivitForVideoClassification"), ] ) MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict( [ + ("blip", "BlipForConditionalGeneration"), + ("blip-2", "Blip2ForConditionalGeneration"), + ("git", "GitForCausalLM"), + ("instructblip", "InstructBlipForConditionalGeneration"), + ("kosmos-2", "Kosmos2ForConditionalGeneration"), + ("llava", "LlavaForConditionalGeneration"), + ("pix2struct", "Pix2StructForConditionalGeneration"), + ("vipllava", "VipLlavaForConditionalGeneration"), ("vision-encoder-decoder", "VisionEncoderDecoderModel"), ] ) @@ -519,9 +639,11 @@ ("longformer", "LongformerForMaskedLM"), ("luke", "LukeForMaskedLM"), ("mbart", "MBartForConditionalGeneration"), + ("mega", "MegaForMaskedLM"), ("megatron-bert", "MegatronBertForMaskedLM"), ("mobilebert", "MobileBertForMaskedLM"), ("mpnet", "MPNetForMaskedLM"), + ("mra", "MraForMaskedLM"), ("mvp", "MvpForConditionalGeneration"), ("nezha", "NezhaForMaskedLM"), ("nystromformer", "NystromformerForMaskedLM"), @@ -559,13 +681,15 @@ MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING_NAMES = OrderedDict( [ # Model for Zero Shot Object Detection mapping - ("owlvit", "OwlViTForObjectDetection") + ("owlv2", "Owlv2ForObjectDetection"), + ("owlvit", "OwlViTForObjectDetection"), ] ) MODEL_FOR_DEPTH_ESTIMATION_MAPPING_NAMES = OrderedDict( [ # Model for depth estimation mapping + ("depth_anything", "DepthAnythingForDepthEstimation"), ("dpt", "DPTForDepthEstimation"), ("glpn", "GLPNForDepthEstimation"), ] @@ -579,6 +703,7 @@ ("blenderbot-small", "BlenderbotSmallForConditionalGeneration"), ("encoder-decoder", "EncoderDecoderModel"), ("fsmt", "FSMTForConditionalGeneration"), + ("gptsan-japanese", "GPTSanJapaneseForConditionalGeneration"), ("led", "LEDForConditionalGeneration"), ("longt5", "LongT5ForConditionalGeneration"), ("m2m_100", "M2M100ForConditionalGeneration"), @@ -586,19 +711,25 @@ ("mbart", "MBartForConditionalGeneration"), ("mt5", "MT5ForConditionalGeneration"), ("mvp", "MvpForConditionalGeneration"), - ("nllb", "M2M100ForConditionalGeneration"), + ("nllb-moe", "NllbMoeForConditionalGeneration"), ("pegasus", "PegasusForConditionalGeneration"), ("pegasus_x", "PegasusXForConditionalGeneration"), ("plbart", "PLBartForConditionalGeneration"), ("prophetnet", "ProphetNetForConditionalGeneration"), + ("seamless_m4t", "SeamlessM4TForTextToText"), + ("seamless_m4t_v2", "SeamlessM4Tv2ForTextToText"), ("switch_transformers", "SwitchTransformersForConditionalGeneration"), ("t5", "T5ForConditionalGeneration"), + ("umt5", "UMT5ForConditionalGeneration"), ("xlm-prophetnet", "XLMProphetNetForConditionalGeneration"), ] ) MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict( [ + ("pop2piano", "Pop2PianoForConditionalGeneration"), + ("seamless_m4t", "SeamlessM4TForSpeechToText"), + ("seamless_m4t_v2", "SeamlessM4Tv2ForSpeechToText"), ("speech-encoder-decoder", "SpeechEncoderDecoderModel"), ("speech_to_text", "Speech2TextForConditionalGeneration"), ("speecht5", "SpeechT5ForSpeechToText"), @@ -614,9 +745,11 @@ ("bert", "BertForSequenceClassification"), ("big_bird", "BigBirdForSequenceClassification"), ("bigbird_pegasus", "BigBirdPegasusForSequenceClassification"), + ("biogpt", "BioGptForSequenceClassification"), ("bloom", "BloomForSequenceClassification"), ("camembert", "CamembertForSequenceClassification"), ("canine", "CanineForSequenceClassification"), + ("code_llama", "LlamaForSequenceClassification"), ("convbert", "ConvBertForSequenceClassification"), ("ctrl", "CTRLForSequenceClassification"), ("data2vec-text", "Data2VecTextForSequenceClassification"), @@ -627,12 +760,15 @@ ("ernie", "ErnieForSequenceClassification"), ("ernie_m", "ErnieMForSequenceClassification"), ("esm", "EsmForSequenceClassification"), + ("falcon", "FalconForSequenceClassification"), ("flaubert", "FlaubertForSequenceClassification"), ("fnet", "FNetForSequenceClassification"), ("funnel", "FunnelForSequenceClassification"), ("gpt-sw3", "GPT2ForSequenceClassification"), ("gpt2", "GPT2ForSequenceClassification"), + ("gpt_bigcode", "GPTBigCodeForSequenceClassification"), ("gpt_neo", "GPTNeoForSequenceClassification"), + ("gpt_neox", "GPTNeoXForSequenceClassification"), ("gptj", "GPTJForSequenceClassification"), ("ibert", "IBertForSequenceClassification"), ("layoutlm", "LayoutLMForSequenceClassification"), @@ -640,21 +776,32 @@ ("layoutlmv3", "LayoutLMv3ForSequenceClassification"), ("led", "LEDForSequenceClassification"), ("lilt", "LiltForSequenceClassification"), + ("llama", "LlamaForSequenceClassification"), ("longformer", "LongformerForSequenceClassification"), ("luke", "LukeForSequenceClassification"), ("markuplm", "MarkupLMForSequenceClassification"), ("mbart", "MBartForSequenceClassification"), + ("mega", "MegaForSequenceClassification"), ("megatron-bert", "MegatronBertForSequenceClassification"), + ("mistral", "MistralForSequenceClassification"), + ("mixtral", "MixtralForSequenceClassification"), ("mobilebert", "MobileBertForSequenceClassification"), ("mpnet", "MPNetForSequenceClassification"), + ("mpt", "MptForSequenceClassification"), + ("mra", "MraForSequenceClassification"), + ("mt5", "MT5ForSequenceClassification"), ("mvp", "MvpForSequenceClassification"), ("nezha", "NezhaForSequenceClassification"), ("nystromformer", "NystromformerForSequenceClassification"), + ("open-llama", "OpenLlamaForSequenceClassification"), ("openai-gpt", "OpenAIGPTForSequenceClassification"), ("opt", "OPTForSequenceClassification"), ("perceiver", "PerceiverForSequenceClassification"), + ("persimmon", "PersimmonForSequenceClassification"), + ("phi", "PhiForSequenceClassification"), ("plbart", "PLBartForSequenceClassification"), ("qdqbert", "QDQBertForSequenceClassification"), + ("qwen2", "Qwen2ForSequenceClassification"), ("reformer", "ReformerForSequenceClassification"), ("rembert", "RemBertForSequenceClassification"), ("roberta", "RobertaForSequenceClassification"), @@ -662,8 +809,11 @@ ("roc_bert", "RoCBertForSequenceClassification"), ("roformer", "RoFormerForSequenceClassification"), ("squeezebert", "SqueezeBertForSequenceClassification"), + ("stablelm", "StableLmForSequenceClassification"), + ("t5", "T5ForSequenceClassification"), ("tapas", "TapasForSequenceClassification"), ("transfo-xl", "TransfoXLForSequenceClassification"), + ("umt5", "UMT5ForSequenceClassification"), ("xlm", "XLMForSequenceClassification"), ("xlm-roberta", "XLMRobertaForSequenceClassification"), ("xlm-roberta-xl", "XLMRobertaXLForSequenceClassification"), @@ -692,23 +842,32 @@ ("electra", "ElectraForQuestionAnswering"), ("ernie", "ErnieForQuestionAnswering"), ("ernie_m", "ErnieMForQuestionAnswering"), + ("falcon", "FalconForQuestionAnswering"), ("flaubert", "FlaubertForQuestionAnsweringSimple"), ("fnet", "FNetForQuestionAnswering"), ("funnel", "FunnelForQuestionAnswering"), + ("gpt2", "GPT2ForQuestionAnswering"), + ("gpt_neo", "GPTNeoForQuestionAnswering"), + ("gpt_neox", "GPTNeoXForQuestionAnswering"), ("gptj", "GPTJForQuestionAnswering"), ("ibert", "IBertForQuestionAnswering"), ("layoutlmv2", "LayoutLMv2ForQuestionAnswering"), ("layoutlmv3", "LayoutLMv3ForQuestionAnswering"), ("led", "LEDForQuestionAnswering"), ("lilt", "LiltForQuestionAnswering"), + ("llama", "LlamaForQuestionAnswering"), ("longformer", "LongformerForQuestionAnswering"), ("luke", "LukeForQuestionAnswering"), ("lxmert", "LxmertForQuestionAnswering"), ("markuplm", "MarkupLMForQuestionAnswering"), ("mbart", "MBartForQuestionAnswering"), + ("mega", "MegaForQuestionAnswering"), ("megatron-bert", "MegatronBertForQuestionAnswering"), ("mobilebert", "MobileBertForQuestionAnswering"), ("mpnet", "MPNetForQuestionAnswering"), + ("mpt", "MptForQuestionAnswering"), + ("mra", "MraForQuestionAnswering"), + ("mt5", "MT5ForQuestionAnswering"), ("mvp", "MvpForQuestionAnswering"), ("nezha", "NezhaForQuestionAnswering"), ("nystromformer", "NystromformerForQuestionAnswering"), @@ -722,6 +881,8 @@ ("roformer", "RoFormerForQuestionAnswering"), ("splinter", "SplinterForQuestionAnswering"), ("squeezebert", "SqueezeBertForQuestionAnswering"), + ("t5", "T5ForQuestionAnswering"), + ("umt5", "UMT5ForQuestionAnswering"), ("xlm", "XLMForQuestionAnsweringSimple"), ("xlm-roberta", "XLMRobertaForQuestionAnswering"), ("xlm-roberta-xl", "XLMRobertaXLForQuestionAnswering"), @@ -740,6 +901,7 @@ MODEL_FOR_VISUAL_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict( [ + ("blip-2", "Blip2ForConditionalGeneration"), ("vilt", "ViltForQuestionAnswering"), ] ) @@ -758,7 +920,9 @@ ("albert", "AlbertForTokenClassification"), ("bert", "BertForTokenClassification"), ("big_bird", "BigBirdForTokenClassification"), + ("biogpt", "BioGptForTokenClassification"), ("bloom", "BloomForTokenClassification"), + ("bros", "BrosForTokenClassification"), ("camembert", "CamembertForTokenClassification"), ("canine", "CanineForTokenClassification"), ("convbert", "ConvBertForTokenClassification"), @@ -770,11 +934,15 @@ ("ernie", "ErnieForTokenClassification"), ("ernie_m", "ErnieMForTokenClassification"), ("esm", "EsmForTokenClassification"), + ("falcon", "FalconForTokenClassification"), ("flaubert", "FlaubertForTokenClassification"), ("fnet", "FNetForTokenClassification"), ("funnel", "FunnelForTokenClassification"), ("gpt-sw3", "GPT2ForTokenClassification"), ("gpt2", "GPT2ForTokenClassification"), + ("gpt_bigcode", "GPTBigCodeForTokenClassification"), + ("gpt_neo", "GPTNeoForTokenClassification"), + ("gpt_neox", "GPTNeoXForTokenClassification"), ("ibert", "IBertForTokenClassification"), ("layoutlm", "LayoutLMForTokenClassification"), ("layoutlmv2", "LayoutLMv2ForTokenClassification"), @@ -783,11 +951,16 @@ ("longformer", "LongformerForTokenClassification"), ("luke", "LukeForTokenClassification"), ("markuplm", "MarkupLMForTokenClassification"), + ("mega", "MegaForTokenClassification"), ("megatron-bert", "MegatronBertForTokenClassification"), ("mobilebert", "MobileBertForTokenClassification"), ("mpnet", "MPNetForTokenClassification"), + ("mpt", "MptForTokenClassification"), + ("mra", "MraForTokenClassification"), + ("mt5", "MT5ForTokenClassification"), ("nezha", "NezhaForTokenClassification"), ("nystromformer", "NystromformerForTokenClassification"), + ("phi", "PhiForTokenClassification"), ("qdqbert", "QDQBertForTokenClassification"), ("rembert", "RemBertForTokenClassification"), ("roberta", "RobertaForTokenClassification"), @@ -795,6 +968,8 @@ ("roc_bert", "RoCBertForTokenClassification"), ("roformer", "RoFormerForTokenClassification"), ("squeezebert", "SqueezeBertForTokenClassification"), + ("t5", "T5ForTokenClassification"), + ("umt5", "UMT5ForTokenClassification"), ("xlm", "XLMForTokenClassification"), ("xlm-roberta", "XLMRobertaForTokenClassification"), ("xlm-roberta-xl", "XLMRobertaXLForTokenClassification"), @@ -825,9 +1000,11 @@ ("ibert", "IBertForMultipleChoice"), ("longformer", "LongformerForMultipleChoice"), ("luke", "LukeForMultipleChoice"), + ("mega", "MegaForMultipleChoice"), ("megatron-bert", "MegatronBertForMultipleChoice"), ("mobilebert", "MobileBertForMultipleChoice"), ("mpnet", "MPNetForMultipleChoice"), + ("mra", "MraForMultipleChoice"), ("nezha", "NezhaForMultipleChoice"), ("nystromformer", "NystromformerForMultipleChoice"), ("qdqbert", "QDQBertForMultipleChoice"), @@ -869,8 +1046,10 @@ ("unispeech", "UniSpeechForSequenceClassification"), ("unispeech-sat", "UniSpeechSatForSequenceClassification"), ("wav2vec2", "Wav2Vec2ForSequenceClassification"), + ("wav2vec2-bert", "Wav2Vec2BertForSequenceClassification"), ("wav2vec2-conformer", "Wav2Vec2ConformerForSequenceClassification"), ("wavlm", "WavLMForSequenceClassification"), + ("whisper", "WhisperForAudioClassification"), ] ) @@ -885,6 +1064,7 @@ ("unispeech", "UniSpeechForCTC"), ("unispeech-sat", "UniSpeechSatForCTC"), ("wav2vec2", "Wav2Vec2ForCTC"), + ("wav2vec2-bert", "Wav2Vec2BertForCTC"), ("wav2vec2-conformer", "Wav2Vec2ConformerForCTC"), ("wavlm", "WavLMForCTC"), ] @@ -896,6 +1076,7 @@ ("data2vec-audio", "Data2VecAudioForAudioFrameClassification"), ("unispeech-sat", "UniSpeechSatForAudioFrameClassification"), ("wav2vec2", "Wav2Vec2ForAudioFrameClassification"), + ("wav2vec2-bert", "Wav2Vec2BertForAudioFrameClassification"), ("wav2vec2-conformer", "Wav2Vec2ConformerForAudioFrameClassification"), ("wavlm", "WavLMForAudioFrameClassification"), ] @@ -907,32 +1088,119 @@ ("data2vec-audio", "Data2VecAudioForXVector"), ("unispeech-sat", "UniSpeechSatForXVector"), ("wav2vec2", "Wav2Vec2ForXVector"), + ("wav2vec2-bert", "Wav2Vec2BertForXVector"), ("wav2vec2-conformer", "Wav2Vec2ConformerForXVector"), ("wavlm", "WavLMForXVector"), ] ) -_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( +MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES = OrderedDict( + [ + # Model for Text-To-Spectrogram mapping + ("fastspeech2_conformer", "FastSpeech2ConformerModel"), + ("speecht5", "SpeechT5ForTextToSpeech"), + ] +) + +MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES = OrderedDict( + [ + # Model for Text-To-Waveform mapping + ("bark", "BarkModel"), + ("fastspeech2_conformer", "FastSpeech2ConformerWithHifiGan"), + ("musicgen", "MusicgenForConditionalGeneration"), + ("seamless_m4t", "SeamlessM4TForTextToSpeech"), + ("seamless_m4t_v2", "SeamlessM4Tv2ForTextToSpeech"), + ("vits", "VitsModel"), + ] +) + +MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( [ # Model for Zero Shot Image Classification mapping + ("align", "AlignModel"), ("altclip", "AltCLIPModel"), ("blip", "BlipModel"), ("chinese_clip", "ChineseCLIPModel"), ("clip", "CLIPModel"), ("clipseg", "CLIPSegModel"), + ("siglip", "SiglipModel"), ] ) MODEL_FOR_BACKBONE_MAPPING_NAMES = OrderedDict( [ # Backbone mapping + ("beit", "BeitBackbone"), ("bit", "BitBackbone"), ("convnext", "ConvNextBackbone"), + ("convnextv2", "ConvNextV2Backbone"), ("dinat", "DinatBackbone"), + ("dinov2", "Dinov2Backbone"), + ("focalnet", "FocalNetBackbone"), ("maskformer-swin", "MaskFormerSwinBackbone"), ("nat", "NatBackbone"), ("resnet", "ResNetBackbone"), ("swin", "SwinBackbone"), + ("swinv2", "Swinv2Backbone"), + ("timm_backbone", "TimmBackbone"), + ("vitdet", "VitDetBackbone"), + ] +) + +MODEL_FOR_MASK_GENERATION_MAPPING_NAMES = OrderedDict( + [ + ("sam", "SamModel"), + ] +) + +MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict( + [ + ("albert", "AlbertModel"), + ("bert", "BertModel"), + ("big_bird", "BigBirdModel"), + ("data2vec-text", "Data2VecTextModel"), + ("deberta", "DebertaModel"), + ("deberta-v2", "DebertaV2Model"), + ("distilbert", "DistilBertModel"), + ("electra", "ElectraModel"), + ("flaubert", "FlaubertModel"), + ("ibert", "IBertModel"), + ("longformer", "LongformerModel"), + ("mobilebert", "MobileBertModel"), + ("mt5", "MT5EncoderModel"), + ("nystromformer", "NystromformerModel"), + ("reformer", "ReformerModel"), + ("rembert", "RemBertModel"), + ("roberta", "RobertaModel"), + ("roberta-prelayernorm", "RobertaPreLayerNormModel"), + ("roc_bert", "RoCBertModel"), + ("roformer", "RoFormerModel"), + ("squeezebert", "SqueezeBertModel"), + ("t5", "T5EncoderModel"), + ("umt5", "UMT5EncoderModel"), + ("xlm", "XLMModel"), + ("xlm-roberta", "XLMRobertaModel"), + ("xlm-roberta-xl", "XLMRobertaXLModel"), + ] +) + +MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING_NAMES = OrderedDict( + [ + ("patchtsmixer", "PatchTSMixerForTimeSeriesClassification"), + ("patchtst", "PatchTSTForClassification"), + ] +) + +MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING_NAMES = OrderedDict( + [ + ("patchtsmixer", "PatchTSMixerForRegression"), + ("patchtst", "PatchTSTForRegression"), + ] +) + +MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES = OrderedDict( + [ + ("swin2sr", "Swin2SRForImageSuperResolution"), ] ) @@ -946,6 +1214,9 @@ MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES ) +MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES +) MODEL_FOR_IMAGE_SEGMENTATION_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_SEGMENTATION_MAPPING_NAMES ) @@ -1006,8 +1277,40 @@ ) MODEL_FOR_AUDIO_XVECTOR_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES) +MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES +) + +MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES) + MODEL_FOR_BACKBONE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_BACKBONE_MAPPING_NAMES) +MODEL_FOR_MASK_GENERATION_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_MASK_GENERATION_MAPPING_NAMES) + +MODEL_FOR_TEXT_ENCODING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES) + +MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, MODEL_FOR_TIME_SERIES_CLASSIFICATION_MAPPING_NAMES +) + +MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, MODEL_FOR_TIME_SERIES_REGRESSION_MAPPING_NAMES +) + +MODEL_FOR_IMAGE_TO_IMAGE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_IMAGE_TO_IMAGE_MAPPING_NAMES) + + +class AutoModelForMaskGeneration(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_MASK_GENERATION_MAPPING + + +class AutoModelForTextEncoding(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_TEXT_ENCODING_MAPPING + + +class AutoModelForImageToImage(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_IMAGE_TO_IMAGE_MAPPING + class AutoModel(_BaseAutoModelClass): _model_mapping = MODEL_MAPPING @@ -1050,7 +1353,9 @@ class AutoModelForSeq2SeqLM(_BaseAutoModelClass): AutoModelForSeq2SeqLM = auto_class_update( - AutoModelForSeq2SeqLM, head_doc="sequence-to-sequence language modeling", checkpoint_for_example="t5-base" + AutoModelForSeq2SeqLM, + head_doc="sequence-to-sequence language modeling", + checkpoint_for_example="google-t5/t5-base", ) @@ -1133,6 +1438,15 @@ class AutoModelForImageClassification(_BaseAutoModelClass): AutoModelForImageClassification = auto_class_update(AutoModelForImageClassification, head_doc="image classification") +class AutoModelForZeroShotImageClassification(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING + + +AutoModelForZeroShotImageClassification = auto_class_update( + AutoModelForZeroShotImageClassification, head_doc="zero-shot image classification" +) + + class AutoModelForImageSegmentation(_BaseAutoModelClass): _model_mapping = MODEL_FOR_IMAGE_SEGMENTATION_MAPPING @@ -1240,7 +1554,15 @@ class AutoModelForAudioXVector(_BaseAutoModelClass): _model_mapping = MODEL_FOR_AUDIO_XVECTOR_MAPPING -class AutoBackbone(_BaseAutoModelClass): +class AutoModelForTextToSpectrogram(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING + + +class AutoModelForTextToWaveform(_BaseAutoModelClass): + _model_mapping = MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING + + +class AutoBackbone(_BaseAutoBackboneClass): _model_mapping = MODEL_FOR_BACKBONE_MAPPING diff --git a/src/transformers/models/auto/modeling_flax_auto.py b/src/transformers/models/auto/modeling_flax_auto.py index 61d34f0f082675..785035b98fb74e 100644 --- a/src/transformers/models/auto/modeling_flax_auto.py +++ b/src/transformers/models/auto/modeling_flax_auto.py @@ -35,6 +35,7 @@ ("big_bird", "FlaxBigBirdModel"), ("blenderbot", "FlaxBlenderbotModel"), ("blenderbot-small", "FlaxBlenderbotSmallModel"), + ("bloom", "FlaxBloomModel"), ("clip", "FlaxCLIPModel"), ("distilbert", "FlaxDistilBertModel"), ("electra", "FlaxElectraModel"), @@ -42,12 +43,16 @@ ("gpt2", "FlaxGPT2Model"), ("gpt_neo", "FlaxGPTNeoModel"), ("gptj", "FlaxGPTJModel"), + ("llama", "FlaxLlamaModel"), ("longt5", "FlaxLongT5Model"), ("marian", "FlaxMarianModel"), ("mbart", "FlaxMBartModel"), + ("mistral", "FlaxMistralModel"), ("mt5", "FlaxMT5Model"), ("opt", "FlaxOPTModel"), ("pegasus", "FlaxPegasusModel"), + ("regnet", "FlaxRegNetModel"), + ("resnet", "FlaxResNetModel"), ("roberta", "FlaxRobertaModel"), ("roberta-prelayernorm", "FlaxRobertaPreLayerNormModel"), ("roformer", "FlaxRoFormerModel"), @@ -55,6 +60,7 @@ ("vision-text-dual-encoder", "FlaxVisionTextDualEncoderModel"), ("vit", "FlaxViTModel"), ("wav2vec2", "FlaxWav2Vec2Model"), + ("whisper", "FlaxWhisperModel"), ("xglm", "FlaxXGLMModel"), ("xlm-roberta", "FlaxXLMRobertaModel"), ] @@ -76,6 +82,7 @@ ("roformer", "FlaxRoFormerForMaskedLM"), ("t5", "FlaxT5ForConditionalGeneration"), ("wav2vec2", "FlaxWav2Vec2ForPreTraining"), + ("whisper", "FlaxWhisperForConditionalGeneration"), ("xlm-roberta", "FlaxXLMRobertaForMaskedLM"), ] ) @@ -117,6 +124,8 @@ [ # Model for Image-classsification ("beit", "FlaxBeitForImageClassification"), + ("regnet", "FlaxRegNetForImageClassification"), + ("resnet", "FlaxResNetForImageClassification"), ("vit", "FlaxViTForImageClassification"), ] ) @@ -133,11 +142,14 @@ ("bart", "FlaxBartForCausalLM"), ("bert", "FlaxBertForCausalLM"), ("big_bird", "FlaxBigBirdForCausalLM"), + ("bloom", "FlaxBloomForCausalLM"), ("electra", "FlaxElectraForCausalLM"), ("gpt-sw3", "FlaxGPT2LMHeadModel"), ("gpt2", "FlaxGPT2LMHeadModel"), ("gpt_neo", "FlaxGPTNeoForCausalLM"), ("gptj", "FlaxGPTJForCausalLM"), + ("llama", "FlaxLlamaForCausalLM"), + ("mistral", "FlaxMistralForCausalLM"), ("opt", "FlaxOPTForCausalLM"), ("roberta", "FlaxRobertaForCausalLM"), ("roberta-prelayernorm", "FlaxRobertaPreLayerNormForCausalLM"), @@ -219,9 +231,15 @@ FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict( [ ("speech-encoder-decoder", "FlaxSpeechEncoderDecoderModel"), + ("whisper", "FlaxWhisperForConditionalGeneration"), ] ) +FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = OrderedDict( + [ + ("whisper", "FlaxWhisperForAudioClassification"), + ] +) FLAX_MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, FLAX_MODEL_MAPPING_NAMES) FLAX_MODEL_FOR_PRETRAINING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, FLAX_MODEL_FOR_PRETRAINING_MAPPING_NAMES) @@ -252,6 +270,9 @@ FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, FLAX_MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES ) +FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, FLAX_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES +) class FlaxAutoModel(_BaseAutoModelClass): @@ -287,7 +308,9 @@ class FlaxAutoModelForSeq2SeqLM(_BaseAutoModelClass): FlaxAutoModelForSeq2SeqLM = auto_class_update( - FlaxAutoModelForSeq2SeqLM, head_doc="sequence-to-sequence language modeling", checkpoint_for_example="t5-base" + FlaxAutoModelForSeq2SeqLM, + head_doc="sequence-to-sequence language modeling", + checkpoint_for_example="google-t5/t5-base", ) diff --git a/src/transformers/models/auto/modeling_tf_auto.py b/src/transformers/models/auto/modeling_tf_auto.py index fcb8c5e5d5fda3..deed743162e477 100644 --- a/src/transformers/models/auto/modeling_tf_auto.py +++ b/src/transformers/models/auto/modeling_tf_auto.py @@ -34,10 +34,12 @@ ("bert", "TFBertModel"), ("blenderbot", "TFBlenderbotModel"), ("blenderbot-small", "TFBlenderbotSmallModel"), + ("blip", "TFBlipModel"), ("camembert", "TFCamembertModel"), ("clip", "TFCLIPModel"), ("convbert", "TFConvBertModel"), ("convnext", "TFConvNextModel"), + ("convnextv2", "TFConvNextV2Model"), ("ctrl", "TFCTRLModel"), ("cvt", "TFCvtModel"), ("data2vec-vision", "TFData2VecVisionModel"), @@ -46,6 +48,7 @@ ("deit", "TFDeiTModel"), ("distilbert", "TFDistilBertModel"), ("dpr", "TFDPRQuestionEncoder"), + ("efficientformer", "TFEfficientFormerModel"), ("electra", "TFElectraModel"), ("esm", "TFEsmModel"), ("flaubert", "TFFlaubertModel"), @@ -75,12 +78,14 @@ ("roberta", "TFRobertaModel"), ("roberta-prelayernorm", "TFRobertaPreLayerNormModel"), ("roformer", "TFRoFormerModel"), + ("sam", "TFSamModel"), ("segformer", "TFSegformerModel"), ("speech_to_text", "TFSpeech2TextModel"), ("swin", "TFSwinModel"), ("t5", "TFT5Model"), ("tapas", "TFTapasModel"), ("transfo-xl", "TFTransfoXLModel"), + ("vision-text-dual-encoder", "TFVisionTextDualEncoderModel"), ("vit", "TFViTModel"), ("vit_mae", "TFViTMAEModel"), ("wav2vec2", "TFWav2Vec2Model"), @@ -196,9 +201,14 @@ [ # Model for Image-classsification ("convnext", "TFConvNextForImageClassification"), + ("convnextv2", "TFConvNextV2ForImageClassification"), ("cvt", "TFCvtForImageClassification"), ("data2vec-vision", "TFData2VecVisionForImageClassification"), ("deit", ("TFDeiTForImageClassification", "TFDeiTForImageClassificationWithTeacher")), + ( + "efficientformer", + ("TFEfficientFormerForImageClassification", "TFEfficientFormerForImageClassificationWithTeacher"), + ), ("mobilevit", "TFMobileViTForImageClassification"), ("regnet", "TFRegNetForImageClassification"), ("resnet", "TFResNetForImageClassification"), @@ -208,6 +218,16 @@ ] ) + +TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES = OrderedDict( + [ + # Model for Zero Shot Image Classification mapping + ("blip", "TFBlipModel"), + ("clip", "TFCLIPModel"), + ] +) + + TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES = OrderedDict( [ # Model for Semantic Segmentation mapping @@ -219,6 +239,7 @@ TF_MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = OrderedDict( [ + ("blip", "TFBlipForConditionalGeneration"), ("vision-encoder-decoder", "TFVisionEncoderDecoderModel"), ] ) @@ -338,10 +359,12 @@ ("xlnet", "TFXLNetForQuestionAnsweringSimple"), ] ) +TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = OrderedDict([("wav2vec2", "TFWav2Vec2ForSequenceClassification")]) TF_MODEL_FOR_DOCUMENT_QUESTION_ANSWERING_MAPPING_NAMES = OrderedDict( [ ("layoutlm", "TFLayoutLMForQuestionAnswering"), + ("layoutlmv3", "TFLayoutLMv3ForQuestionAnswering"), ] ) @@ -389,6 +412,7 @@ ("bert", "TFBertForMultipleChoice"), ("camembert", "TFCamembertForMultipleChoice"), ("convbert", "TFConvBertForMultipleChoice"), + ("deberta-v2", "TFDebertaV2ForMultipleChoice"), ("distilbert", "TFDistilBertForMultipleChoice"), ("electra", "TFElectraForMultipleChoice"), ("flaubert", "TFFlaubertForMultipleChoice"), @@ -412,6 +436,33 @@ ("mobilebert", "TFMobileBertForNextSentencePrediction"), ] ) +TF_MODEL_FOR_MASK_GENERATION_MAPPING_NAMES = OrderedDict( + [ + ("sam", "TFSamModel"), + ] +) +TF_MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES = OrderedDict( + [ + ("albert", "TFAlbertModel"), + ("bert", "TFBertModel"), + ("convbert", "TFConvBertModel"), + ("deberta", "TFDebertaModel"), + ("deberta-v2", "TFDebertaV2Model"), + ("distilbert", "TFDistilBertModel"), + ("electra", "TFElectraModel"), + ("flaubert", "TFFlaubertModel"), + ("longformer", "TFLongformerModel"), + ("mobilebert", "TFMobileBertModel"), + ("mt5", "TFMT5EncoderModel"), + ("rembert", "TFRemBertModel"), + ("roberta", "TFRobertaModel"), + ("roberta-prelayernorm", "TFRobertaPreLayerNormModel"), + ("roformer", "TFRoFormerModel"), + ("t5", "TFT5EncoderModel"), + ("xlm", "TFXLMModel"), + ("xlm-roberta", "TFXLMRobertaModel"), + ] +) TF_MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, TF_MODEL_MAPPING_NAMES) TF_MODEL_FOR_PRETRAINING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, TF_MODEL_FOR_PRETRAINING_MAPPING_NAMES) @@ -423,6 +474,9 @@ TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, TF_MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING_NAMES ) +TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING_NAMES +) TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING_NAMES ) @@ -455,6 +509,23 @@ TF_MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING = _LazyAutoMapping( CONFIG_MAPPING_NAMES, TF_MODEL_FOR_NEXT_SENTENCE_PREDICTION_MAPPING_NAMES ) +TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES +) + +TF_MODEL_FOR_MASK_GENERATION_MAPPING = _LazyAutoMapping( + CONFIG_MAPPING_NAMES, TF_MODEL_FOR_MASK_GENERATION_MAPPING_NAMES +) + +TF_MODEL_FOR_TEXT_ENCODING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, TF_MODEL_FOR_TEXT_ENCODING_MAPPING_NAMES) + + +class TFAutoModelForMaskGeneration(_BaseAutoModelClass): + _model_mapping = TF_MODEL_FOR_MASK_GENERATION_MAPPING + + +class TFAutoModelForTextEncoding(_BaseAutoModelClass): + _model_mapping = TF_MODEL_FOR_TEXT_ENCODING_MAPPING class TFAutoModel(_BaseAutoModelClass): @@ -464,6 +535,15 @@ class TFAutoModel(_BaseAutoModelClass): TFAutoModel = auto_class_update(TFAutoModel) +class TFAutoModelForAudioClassification(_BaseAutoModelClass): + _model_mapping = TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING + + +TFAutoModelForAudioClassification = auto_class_update( + TFAutoModelForAudioClassification, head_doc="audio classification" +) + + class TFAutoModelForPreTraining(_BaseAutoModelClass): _model_mapping = TF_MODEL_FOR_PRETRAINING_MAPPING @@ -504,11 +584,20 @@ class TFAutoModelForImageClassification(_BaseAutoModelClass): ) +class TFAutoModelForZeroShotImageClassification(_BaseAutoModelClass): + _model_mapping = TF_MODEL_FOR_ZERO_SHOT_IMAGE_CLASSIFICATION_MAPPING + + +TFAutoModelForZeroShotImageClassification = auto_class_update( + TFAutoModelForZeroShotImageClassification, head_doc="zero-shot image classification" +) + + class TFAutoModelForSemanticSegmentation(_BaseAutoModelClass): _model_mapping = TF_MODEL_FOR_SEMANTIC_SEGMENTATION_MAPPING -TF_AutoModelForSemanticSegmentation = auto_class_update( +TFAutoModelForSemanticSegmentation = auto_class_update( TFAutoModelForSemanticSegmentation, head_doc="semantic segmentation" ) @@ -532,7 +621,9 @@ class TFAutoModelForSeq2SeqLM(_BaseAutoModelClass): TFAutoModelForSeq2SeqLM = auto_class_update( - TFAutoModelForSeq2SeqLM, head_doc="sequence-to-sequence language modeling", checkpoint_for_example="t5-base" + TFAutoModelForSeq2SeqLM, + head_doc="sequence-to-sequence language modeling", + checkpoint_for_example="google-t5/t5-base", ) diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py index bac6fe78c192fc..e41e39e56eeea2 100644 --- a/src/transformers/models/auto/processing_auto.py +++ b/src/transformers/models/auto/processing_auto.py @@ -16,15 +16,18 @@ import importlib import inspect import json +import os +import warnings from collections import OrderedDict # Build the list of all feature extractors from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code from ...feature_extraction_utils import FeatureExtractionMixin from ...image_processing_utils import ImageProcessingMixin +from ...processing_utils import ProcessorMixin from ...tokenization_utils import TOKENIZER_CONFIG_FILE -from ...utils import FEATURE_EXTRACTOR_NAME, get_file_from_repo, logging +from ...utils import FEATURE_EXTRACTOR_NAME, PROCESSOR_NAME, get_file_from_repo, logging from .auto_factory import _LazyAutoMapping from .configuration_auto import ( CONFIG_MAPPING_NAMES, @@ -41,7 +44,9 @@ PROCESSOR_MAPPING_NAMES = OrderedDict( [ + ("align", "AlignProcessor"), ("altclip", "AltCLIPProcessor"), + ("bark", "BarkProcessor"), ("blip", "BlipProcessor"), ("blip-2", "Blip2Processor"), ("bridgetower", "BridgeTowerProcessor"), @@ -49,29 +54,45 @@ ("clap", "ClapProcessor"), ("clip", "CLIPProcessor"), ("clipseg", "CLIPSegProcessor"), + ("clvp", "ClvpProcessor"), ("flava", "FlavaProcessor"), - ("git", "GITProcessor"), + ("fuyu", "FuyuProcessor"), + ("git", "GitProcessor"), ("groupvit", "CLIPProcessor"), ("hubert", "Wav2Vec2Processor"), + ("idefics", "IdeficsProcessor"), + ("instructblip", "InstructBlipProcessor"), + ("kosmos-2", "Kosmos2Processor"), ("layoutlmv2", "LayoutLMv2Processor"), ("layoutlmv3", "LayoutLMv3Processor"), - ("layoutxlm", "LayoutXLMProcessor"), + ("llava", "LlavaProcessor"), ("markuplm", "MarkupLMProcessor"), + ("mctct", "MCTCTProcessor"), + ("mgp-str", "MgpstrProcessor"), ("oneformer", "OneFormerProcessor"), + ("owlv2", "Owlv2Processor"), ("owlvit", "OwlViTProcessor"), + ("pix2struct", "Pix2StructProcessor"), + ("pop2piano", "Pop2PianoProcessor"), + ("sam", "SamProcessor"), + ("seamless_m4t", "SeamlessM4TProcessor"), ("sew", "Wav2Vec2Processor"), ("sew-d", "Wav2Vec2Processor"), + ("siglip", "SiglipProcessor"), ("speech_to_text", "Speech2TextProcessor"), ("speech_to_text_2", "Speech2Text2Processor"), ("speecht5", "SpeechT5Processor"), ("trocr", "TrOCRProcessor"), + ("tvlt", "TvltProcessor"), + ("tvp", "TvpProcessor"), ("unispeech", "Wav2Vec2Processor"), ("unispeech-sat", "Wav2Vec2Processor"), ("vilt", "ViltProcessor"), + ("vipllava", "LlavaProcessor"), ("vision-text-dual-encoder", "VisionTextDualEncoderProcessor"), ("wav2vec2", "Wav2Vec2Processor"), + ("wav2vec2-bert", "Wav2Vec2Processor"), ("wav2vec2-conformer", "Wav2Vec2Processor"), - ("wav2vec2_with_lm", "Wav2Vec2ProcessorWithLM"), ("wavlm", "Wav2Vec2Processor"), ("whisper", "WhisperProcessor"), ("xclip", "XCLIPProcessor"), @@ -135,8 +156,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): This can be either: - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or - namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a processor files saved using the `save_pretrained()` method, e.g., `./my_model_directory/`. cache_dir (`str` or `os.PathLike`, *optional*): @@ -151,7 +171,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -174,7 +194,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -187,36 +207,62 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h") >>> # If processor files are in a directory (e.g. processor was saved using *save_pretrained('./test/saved_model/')*) - >>> processor = AutoProcessor.from_pretrained("./test/saved_model/") + >>> # processor = AutoProcessor.from_pretrained("./test/saved_model/") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if kwargs.get("token", None) is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + kwargs["token"] = use_auth_token + config = kwargs.pop("config", None) - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) kwargs["_from_auto"] = True processor_class = None processor_auto_map = None - # First, let's see if we have a preprocessor config. + # First, let's see if we have a processor or preprocessor config. # Filter the kwargs for `get_file_from_repo`. get_file_from_repo_kwargs = { key: kwargs[key] for key in inspect.signature(get_file_from_repo).parameters.keys() if key in kwargs } - # Let's start by checking whether the processor class is saved in an image processor - preprocessor_config_file = get_file_from_repo( - pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME, **get_file_from_repo_kwargs + + # Let's start by checking whether the processor class is saved in a processor config + processor_config_file = get_file_from_repo( + pretrained_model_name_or_path, PROCESSOR_NAME, **get_file_from_repo_kwargs ) - if preprocessor_config_file is not None: - config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs) + if processor_config_file is not None: + config_dict, _ = ProcessorMixin.get_processor_dict(pretrained_model_name_or_path, **kwargs) processor_class = config_dict.get("processor_class", None) if "AutoProcessor" in config_dict.get("auto_map", {}): processor_auto_map = config_dict["auto_map"]["AutoProcessor"] - # If not found, let's check whether the processor class is saved in a feature extractor config - if preprocessor_config_file is not None and processor_class is None: - config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs) - processor_class = config_dict.get("processor_class", None) - if "AutoProcessor" in config_dict.get("auto_map", {}): - processor_auto_map = config_dict["auto_map"]["AutoProcessor"] + if processor_class is None: + # If not found, let's check whether the processor class is saved in an image processor config + preprocessor_config_file = get_file_from_repo( + pretrained_model_name_or_path, FEATURE_EXTRACTOR_NAME, **get_file_from_repo_kwargs + ) + if preprocessor_config_file is not None: + config_dict, _ = ImageProcessingMixin.get_image_processor_dict(pretrained_model_name_or_path, **kwargs) + processor_class = config_dict.get("processor_class", None) + if "AutoProcessor" in config_dict.get("auto_map", {}): + processor_auto_map = config_dict["auto_map"]["AutoProcessor"] + + # If not found, let's check whether the processor class is saved in a feature extractor config + if preprocessor_config_file is not None and processor_class is None: + config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict( + pretrained_model_name_or_path, **kwargs + ) + processor_class = config_dict.get("processor_class", None) + if "AutoProcessor" in config_dict.get("auto_map", {}): + processor_auto_map = config_dict["auto_map"]["AutoProcessor"] if processor_class is None: # Next, let's check whether the processor class is saved in a tokenizer @@ -244,34 +290,30 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): processor_auto_map = config.auto_map["AutoProcessor"] if processor_class is not None: - # If we have custom code for a feature extractor, we get the proper class. - if processor_auto_map is not None: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the feature extractor file " - "in that repo on your local machine. Make sure you have read the code there to avoid " - "malicious use, then set the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a feature extractor with custom " - "code to ensure no malicious code has been contributed in a newer revision." - ) - - module_file, class_name = processor_auto_map.split(".") - processor_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs - ) - processor_class.register_for_auto_class() - else: - processor_class = processor_class_from_name(processor_class) + processor_class = processor_class_from_name(processor_class) + + has_remote_code = processor_auto_map is not None + has_local_code = processor_class is not None or type(config) in PROCESSOR_MAPPING + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) + if has_remote_code and trust_remote_code: + processor_class = get_class_from_dynamic_module( + processor_auto_map, pretrained_model_name_or_path, **kwargs + ) + _ = kwargs.pop("code_revision", None) + if os.path.isdir(pretrained_model_name_or_path): + processor_class.register_for_auto_class() + return processor_class.from_pretrained( + pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs + ) + elif processor_class is not None: return processor_class.from_pretrained( pretrained_model_name_or_path, trust_remote_code=trust_remote_code, **kwargs ) - # Last try: we use the PROCESSOR_MAPPING. - if type(config) in PROCESSOR_MAPPING: + elif type(config) in PROCESSOR_MAPPING: return PROCESSOR_MAPPING[type(config)].from_pretrained(pretrained_model_name_or_path, **kwargs) # At this stage, there doesn't seem to be a `Processor` class available for this model, so let's try a @@ -297,12 +339,12 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): raise ValueError( f"Unrecognized processing class in {pretrained_model_name_or_path}. Can't instantiate a processor, a " - "tokenizer, an image processor or a feature extractor for this model. Make sure the repository contains" + "tokenizer, an image processor or a feature extractor for this model. Make sure the repository contains " "the files of at least one of those processing classes." ) @staticmethod - def register(config_class, processor_class): + def register(config_class, processor_class, exist_ok=False): """ Register a new processor for this class. @@ -311,4 +353,4 @@ def register(config_class, processor_class): The configuration corresponding to the model to register. processor_class ([`FeatureExtractorMixin`]): The processor to register. """ - PROCESSOR_MAPPING.register(config_class, processor_class) + PROCESSOR_MAPPING.register(config_class, processor_class, exist_ok=exist_ok) diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py index 7dadfa5fdf58f6..83bb7041d3942b 100644 --- a/src/transformers/models/auto/tokenization_auto.py +++ b/src/transformers/models/auto/tokenization_auto.py @@ -17,15 +17,22 @@ import importlib import json import os +import warnings from collections import OrderedDict from typing import TYPE_CHECKING, Dict, Optional, Tuple, Union from ...configuration_utils import PretrainedConfig -from ...dynamic_module_utils import get_class_from_dynamic_module +from ...dynamic_module_utils import get_class_from_dynamic_module, resolve_trust_remote_code from ...tokenization_utils import PreTrainedTokenizer from ...tokenization_utils_base import TOKENIZER_CONFIG_FILE -from ...tokenization_utils_fast import PreTrainedTokenizerFast -from ...utils import cached_file, extract_commit_hash, is_sentencepiece_available, is_tokenizers_available, logging +from ...utils import ( + cached_file, + extract_commit_hash, + is_g2p_en_available, + is_sentencepiece_available, + is_tokenizers_available, + logging, +) from ..encoder_decoder import EncoderDecoderConfig from .auto_factory import _LazyAutoMapping from .configuration_auto import ( @@ -37,6 +44,12 @@ ) +if is_tokenizers_available(): + from ...tokenization_utils_fast import PreTrainedTokenizerFast +else: + PreTrainedTokenizerFast = None + + logger = logging.get_logger(__name__) if TYPE_CHECKING: @@ -53,6 +66,8 @@ "AlbertTokenizerFast" if is_tokenizers_available() else None, ), ), + ("align", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ("bark", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("bart", ("BartTokenizer", "BartTokenizerFast")), ( "barthez", @@ -81,6 +96,7 @@ ("blip-2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("bloom", (None, "BloomTokenizerFast" if is_tokenizers_available() else None)), ("bridgetower", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), + ("bros", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("byt5", ("ByT5Tokenizer", None)), ( "camembert", @@ -112,6 +128,14 @@ "CLIPTokenizerFast" if is_tokenizers_available() else None, ), ), + ("clvp", ("ClvpTokenizer", None)), + ( + "code_llama", + ( + "CodeLlamaTokenizer" if is_sentencepiece_available() else None, + "CodeLlamaTokenizerFast" if is_tokenizers_available() else None, + ), + ), ("codegen", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)), ("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)), ( @@ -121,7 +145,9 @@ "CpmTokenizerFast" if is_tokenizers_available() else None, ), ), + ("cpmant", ("CpmAntTokenizer", None)), ("ctrl", ("CTRLTokenizer", None)), + ("data2vec-audio", ("Wav2Vec2CTCTokenizer", None)), ("data2vec-text", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), ("deberta", ("DebertaTokenizer", "DebertaTokenizerFast" if is_tokenizers_available() else None)), ( @@ -143,6 +169,11 @@ ("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)), ("esm", ("EsmTokenizer", None)), + ("falcon", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), + ( + "fastspeech2_conformer", + ("FastSpeech2ConformerTokenizer" if is_g2p_en_available() else None, None), + ), ("flaubert", ("FlaubertTokenizer", None)), ("fnet", ("FNetTokenizer", "FNetTokenizerFast" if is_tokenizers_available() else None)), ("fsmt", ("FSMTTokenizer", None)), @@ -150,21 +181,40 @@ ("git", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ("gpt-sw3", ("GPTSw3Tokenizer" if is_sentencepiece_available() else None, None)), ("gpt2", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), + ("gpt_bigcode", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("gpt_neo", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("gpt_neox", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)), ("gpt_neox_japanese", ("GPTNeoXJapaneseTokenizer", None)), ("gptj", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), + ("gptsan-japanese", ("GPTSanJapaneseTokenizer", None)), ("groupvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), ("herbert", ("HerbertTokenizer", "HerbertTokenizerFast" if is_tokenizers_available() else None)), ("hubert", ("Wav2Vec2CTCTokenizer", None)), ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), + ("idefics", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)), + ("instructblip", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), ("jukebox", ("JukeboxTokenizer", None)), + ( + "kosmos-2", + ( + "XLMRobertaTokenizer" if is_sentencepiece_available() else None, + "XLMRobertaTokenizerFast" if is_tokenizers_available() else None, + ), + ), ("layoutlm", ("LayoutLMTokenizer", "LayoutLMTokenizerFast" if is_tokenizers_available() else None)), ("layoutlmv2", ("LayoutLMv2Tokenizer", "LayoutLMv2TokenizerFast" if is_tokenizers_available() else None)), ("layoutlmv3", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)), ("layoutxlm", ("LayoutXLMTokenizer", "LayoutXLMTokenizerFast" if is_tokenizers_available() else None)), ("led", ("LEDTokenizer", "LEDTokenizerFast" if is_tokenizers_available() else None)), ("lilt", ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast" if is_tokenizers_available() else None)), + ( + "llama", + ( + "LlamaTokenizer" if is_sentencepiece_available() else None, + "LlamaTokenizerFast" if is_tokenizers_available() else None, + ), + ), + ("llava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)), ("longformer", ("LongformerTokenizer", "LongformerTokenizerFast" if is_tokenizers_available() else None)), ( "longt5", @@ -191,10 +241,28 @@ "MBart50TokenizerFast" if is_tokenizers_available() else None, ), ), + ("mega", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), ("megatron-bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ("mgp-str", ("MgpstrTokenizer", None)), + ( + "mistral", + ( + "LlamaTokenizer" if is_sentencepiece_available() else None, + "LlamaTokenizerFast" if is_tokenizers_available() else None, + ), + ), + ( + "mixtral", + ( + "LlamaTokenizer" if is_sentencepiece_available() else None, + "LlamaTokenizerFast" if is_tokenizers_available() else None, + ), + ), ("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)), ("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)), ("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)), + ("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)), + ("mra", ("RobertaTokenizer", "RobertaTokenizerFast" if is_tokenizers_available() else None)), ( "mt5", ( @@ -202,6 +270,7 @@ "MT5TokenizerFast" if is_tokenizers_available() else None, ), ), + ("musicgen", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)), ("mvp", ("MvpTokenizer", "MvpTokenizerFast" if is_tokenizers_available() else None)), ("nezha", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), ( @@ -211,6 +280,13 @@ "NllbTokenizerFast" if is_tokenizers_available() else None, ), ), + ( + "nllb-moe", + ( + "NllbTokenizer" if is_sentencepiece_available() else None, + "NllbTokenizerFast" if is_tokenizers_available() else None, + ), + ), ( "nystromformer", ( @@ -219,8 +295,12 @@ ), ), ("oneformer", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), - ("openai-gpt", ("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None)), + ( + "openai-gpt", + ("OpenAIGPTTokenizer", "OpenAIGPTTokenizerFast" if is_tokenizers_available() else None), + ), ("opt", ("GPT2Tokenizer", "GPT2TokenizerFast" if is_tokenizers_available() else None)), + ("owlv2", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), ("owlvit", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), ( "pegasus", @@ -243,10 +323,26 @@ None, ), ), + ( + "persimmon", + ( + "LlamaTokenizer" if is_sentencepiece_available() else None, + "LlamaTokenizerFast" if is_tokenizers_available() else None, + ), + ), + ("phi", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)), ("phobert", ("PhobertTokenizer", None)), + ("pix2struct", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)), ("plbart", ("PLBartTokenizer" if is_sentencepiece_available() else None, None)), ("prophetnet", ("ProphetNetTokenizer", None)), ("qdqbert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ( + "qwen2", + ( + "Qwen2Tokenizer", + "Qwen2TokenizerFast" if is_tokenizers_available() else None, + ), + ), ("rag", ("RagTokenizer", None)), ("realm", ("RealmTokenizer", "RealmTokenizerFast" if is_tokenizers_available() else None)), ( @@ -271,6 +367,22 @@ ), ("roc_bert", ("RoCBertTokenizer", None)), ("roformer", ("RoFormerTokenizer", "RoFormerTokenizerFast" if is_tokenizers_available() else None)), + ("rwkv", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)), + ( + "seamless_m4t", + ( + "SeamlessM4TTokenizer" if is_sentencepiece_available() else None, + "SeamlessM4TTokenizerFast" if is_tokenizers_available() else None, + ), + ), + ( + "seamless_m4t_v2", + ( + "SeamlessM4TTokenizer" if is_sentencepiece_available() else None, + "SeamlessM4TTokenizerFast" if is_tokenizers_available() else None, + ), + ), + ("siglip", ("SiglipTokenizer" if is_sentencepiece_available() else None, None)), ("speech_to_text", ("Speech2TextTokenizer" if is_sentencepiece_available() else None, None)), ("speech_to_text_2", ("Speech2Text2Tokenizer", None)), ("speecht5", ("SpeechT5Tokenizer" if is_sentencepiece_available() else None, None)), @@ -279,6 +391,7 @@ "squeezebert", ("SqueezeBertTokenizer", "SqueezeBertTokenizerFast" if is_tokenizers_available() else None), ), + ("stablelm", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)), ( "switch_transformers", ( @@ -296,12 +409,23 @@ ("tapas", ("TapasTokenizer", None)), ("tapex", ("TapexTokenizer", None)), ("transfo-xl", ("TransfoXLTokenizer", None)), + ("tvp", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ( + "umt5", + ( + "T5Tokenizer" if is_sentencepiece_available() else None, + "T5TokenizerFast" if is_tokenizers_available() else None, + ), + ), ("vilt", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ("vipllava", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)), ("visual_bert", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)), + ("vits", ("VitsTokenizer", None)), ("wav2vec2", ("Wav2Vec2CTCTokenizer", None)), + ("wav2vec2-bert", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2-conformer", ("Wav2Vec2CTCTokenizer", None)), ("wav2vec2_phoneme", ("Wav2Vec2PhonemeCTCTokenizer", None)), - ("whisper", ("WhisperTokenizer" if is_sentencepiece_available() else None, None)), + ("whisper", ("WhisperTokenizer", "WhisperTokenizerFast" if is_tokenizers_available() else None)), ("xclip", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)), ( "xglm", @@ -389,7 +513,7 @@ def get_tokenizer_config( force_download: bool = False, resume_download: bool = False, proxies: Optional[Dict[str, str]] = None, - use_auth_token: Optional[Union[bool, str]] = None, + token: Optional[Union[bool, str]] = None, revision: Optional[str] = None, local_files_only: bool = False, subfolder: str = "", @@ -403,8 +527,7 @@ def get_tokenizer_config( This can be either: - a string, the *model id* of a pretrained model configuration hosted inside a model repo on - huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced - under a user or organization name, like `dbmdz/bert-base-german-cased`. + huggingface.co. - a path to a *directory* containing a configuration file saved using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. @@ -419,7 +542,7 @@ def get_tokenizer_config( proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request. - use_auth_token (`str` or *bool*, *optional*): + token (`str` or *bool*, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): @@ -434,7 +557,7 @@ def get_tokenizer_config( - Passing `use_auth_token=True` is required when you want to use a private model. + Passing `token=True` is required when you want to use a private model. @@ -445,17 +568,27 @@ def get_tokenizer_config( ```python # Download configuration from huggingface.co and cache. - tokenizer_config = get_tokenizer_config("bert-base-uncased") + tokenizer_config = get_tokenizer_config("google-bert/bert-base-uncased") # This model does not have a tokenizer config so the result will be an empty dict. - tokenizer_config = get_tokenizer_config("xlm-roberta-base") + tokenizer_config = get_tokenizer_config("FacebookAI/xlm-roberta-base") # Save a pretrained tokenizer locally and you can reload its config from transformers import AutoTokenizer - tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") + tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") tokenizer.save_pretrained("tokenizer-test") tokenizer_config = get_tokenizer_config("tokenizer-test") ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if token is not None: + raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.") + token = use_auth_token + commit_hash = kwargs.get("_commit_hash", None) resolved_config_file = cached_file( pretrained_model_name_or_path, @@ -464,10 +597,11 @@ def get_tokenizer_config( force_download=force_download, resume_download=resume_download, proxies=proxies, - use_auth_token=use_auth_token, + token=token, revision=revision, local_files_only=local_files_only, subfolder=subfolder, + _raise_exceptions_for_gated_repo=False, _raise_exceptions_for_missing_entries=False, _raise_exceptions_for_connection_errors=False, _commit_hash=commit_hash, @@ -514,8 +648,6 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): Can be either: - A string, the *model id* of a predefined tokenizer hosted inside a model repo on huggingface.co. - Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a - user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. - A path or url to a single saved vocabulary file if and only if the tokenizer only requires a @@ -524,7 +656,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): inputs (additional positional arguments, *optional*): Will be passed along to the Tokenizer `__init__()` method. config ([`PretrainedConfig`], *optional*) - The configuration object used to dertermine the tokenizer class to instantiate. + The configuration object used to determine the tokenizer class to instantiate. cache_dir (`str` or `os.PathLike`, *optional*): Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. @@ -565,23 +697,35 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): >>> from transformers import AutoTokenizer >>> # Download vocabulary from huggingface.co and cache. - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> # Download vocabulary from huggingface.co (user-uploaded) and cache. >>> tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-cased") >>> # If vocabulary files are in a directory (e.g. tokenizer was saved using *save_pretrained('./test/saved_model/')*) - >>> tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/") + >>> # tokenizer = AutoTokenizer.from_pretrained("./test/bert_saved_model/") >>> # Download vocabulary from huggingface.co and define model-specific arguments - >>> tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True) + >>> tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space=True) ```""" + use_auth_token = kwargs.pop("use_auth_token", None) + if use_auth_token is not None: + warnings.warn( + "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.", + FutureWarning, + ) + if kwargs.get("token", None) is not None: + raise ValueError( + "`token` and `use_auth_token` are both specified. Please set only the argument `token`." + ) + kwargs["token"] = use_auth_token + config = kwargs.pop("config", None) kwargs["_from_auto"] = True use_fast = kwargs.pop("use_fast", True) tokenizer_type = kwargs.pop("tokenizer_type", None) - trust_remote_code = kwargs.pop("trust_remote_code", False) + trust_remote_code = kwargs.pop("trust_remote_code", None) # First, let's see whether the tokenizer_type is passed so that we can leverage it if tokenizer_type is not None: @@ -635,40 +779,38 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): if hasattr(config, "auto_map") and "AutoTokenizer" in config.auto_map: tokenizer_auto_map = config.auto_map["AutoTokenizer"] - # If we have the tokenizer class from the tokenizer config or the model config we're good! - if config_tokenizer_class is not None: - tokenizer_class = None - if tokenizer_auto_map is not None: - if not trust_remote_code: - raise ValueError( - f"Loading {pretrained_model_name_or_path} requires you to execute the tokenizer file in that" - " repo on your local machine. Make sure you have read the code there to avoid malicious use," - " then set the option `trust_remote_code=True` to remove this error." - ) - if kwargs.get("revision", None) is None: - logger.warning( - "Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure" - " no malicious code has been contributed in a newer revision." - ) - - if use_fast and tokenizer_auto_map[1] is not None: - class_ref = tokenizer_auto_map[1] - else: - class_ref = tokenizer_auto_map[0] + has_remote_code = tokenizer_auto_map is not None + has_local_code = type(config) in TOKENIZER_MAPPING or ( + config_tokenizer_class is not None + and ( + tokenizer_class_from_name(config_tokenizer_class) is not None + or tokenizer_class_from_name(config_tokenizer_class + "Fast") is not None + ) + ) + trust_remote_code = resolve_trust_remote_code( + trust_remote_code, pretrained_model_name_or_path, has_local_code, has_remote_code + ) - module_file, class_name = class_ref.split(".") - tokenizer_class = get_class_from_dynamic_module( - pretrained_model_name_or_path, module_file + ".py", class_name, **kwargs - ) + if has_remote_code and trust_remote_code: + if use_fast and tokenizer_auto_map[1] is not None: + class_ref = tokenizer_auto_map[1] + else: + class_ref = tokenizer_auto_map[0] + tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs) + _ = kwargs.pop("code_revision", None) + if os.path.isdir(pretrained_model_name_or_path): tokenizer_class.register_for_auto_class() - - elif use_fast and not config_tokenizer_class.endswith("Fast"): + return tokenizer_class.from_pretrained( + pretrained_model_name_or_path, *inputs, trust_remote_code=trust_remote_code, **kwargs + ) + elif config_tokenizer_class is not None: + tokenizer_class = None + if use_fast and not config_tokenizer_class.endswith("Fast"): tokenizer_class_candidate = f"{config_tokenizer_class}Fast" tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate) if tokenizer_class is None: tokenizer_class_candidate = config_tokenizer_class tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate) - if tokenizer_class is None: raise ValueError( f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported." @@ -706,7 +848,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs): f"Model type should be one of {', '.join(c.__name__ for c in TOKENIZER_MAPPING.keys())}." ) - def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None): + def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None, exist_ok=False): """ Register a new tokenizer in this mapping. @@ -716,7 +858,7 @@ def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None) The configuration corresponding to the model to register. slow_tokenizer_class ([`PretrainedTokenizer`], *optional*): The slow tokenizer to register. - slow_tokenizer_class ([`PretrainedTokenizerFast`], *optional*): + fast_tokenizer_class ([`PretrainedTokenizerFast`], *optional*): The fast tokenizer to register. """ if slow_tokenizer_class is None and fast_tokenizer_class is None: @@ -747,4 +889,4 @@ def register(config_class, slow_tokenizer_class=None, fast_tokenizer_class=None) if fast_tokenizer_class is None: fast_tokenizer_class = existing_fast - TOKENIZER_MAPPING.register(config_class, (slow_tokenizer_class, fast_tokenizer_class)) + TOKENIZER_MAPPING.register(config_class, (slow_tokenizer_class, fast_tokenizer_class), exist_ok=exist_ok) diff --git a/src/transformers/models/autoformer/__init__.py b/src/transformers/models/autoformer/__init__.py new file mode 100644 index 00000000000000..f87bfdea532d61 --- /dev/null +++ b/src/transformers/models/autoformer/__init__.py @@ -0,0 +1,63 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +# rely on isort to merge the imports +from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available + + +_import_structure = { + "configuration_autoformer": [ + "AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", + "AutoformerConfig", + ], +} + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_autoformer"] = [ + "AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST", + "AutoformerForPrediction", + "AutoformerModel", + "AutoformerPreTrainedModel", + ] + + +if TYPE_CHECKING: + from .configuration_autoformer import ( + AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, + AutoformerConfig, + ) + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_autoformer import ( + AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST, + AutoformerForPrediction, + AutoformerModel, + AutoformerPreTrainedModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/autoformer/configuration_autoformer.py b/src/transformers/models/autoformer/configuration_autoformer.py new file mode 100644 index 00000000000000..7604233e327369 --- /dev/null +++ b/src/transformers/models/autoformer/configuration_autoformer.py @@ -0,0 +1,246 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Autoformer model configuration""" + +from typing import List, Optional + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +AUTOFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "huggingface/autoformer-tourism-monthly": "https://huggingface.co/huggingface/autoformer-tourism-monthly/resolve/main/config.json", +} + + +class AutoformerConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of an [`AutoformerModel`]. It is used to instantiate an + Autoformer model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the Autoformer + [huggingface/autoformer-tourism-monthly](https://huggingface.co/huggingface/autoformer-tourism-monthly) + architecture. + + Configuration objects inherit from [`PretrainedConfig`] can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + prediction_length (`int`): + The prediction length for the decoder. In other words, the prediction horizon of the model. + context_length (`int`, *optional*, defaults to `prediction_length`): + The context length for the encoder. If unset, the context length will be the same as the + `prediction_length`. + distribution_output (`string`, *optional*, defaults to `"student_t"`): + The distribution emission head for the model. Could be either "student_t", "normal" or "negative_binomial". + loss (`string`, *optional*, defaults to `"nll"`): + The loss function for the model corresponding to the `distribution_output` head. For parametric + distributions it is the negative log likelihood (nll) - which currently is the only supported one. + input_size (`int`, *optional*, defaults to 1): + The size of the target variable which by default is 1 for univariate targets. Would be > 1 in case of + multivariate targets. + lags_sequence (`list[int]`, *optional*, defaults to `[1, 2, 3, 4, 5, 6, 7]`): + The lags of the input time series as covariates often dictated by the frequency. Default is `[1, 2, 3, 4, + 5, 6, 7]`. + scaling (`bool`, *optional* defaults to `True`): + Whether to scale the input targets. + num_time_features (`int`, *optional*, defaults to 0): + The number of time features in the input time series. + num_dynamic_real_features (`int`, *optional*, defaults to 0): + The number of dynamic real valued features. + num_static_categorical_features (`int`, *optional*, defaults to 0): + The number of static categorical features. + num_static_real_features (`int`, *optional*, defaults to 0): + The number of static real valued features. + cardinality (`list[int]`, *optional*): + The cardinality (number of different values) for each of the static categorical features. Should be a list + of integers, having the same length as `num_static_categorical_features`. Cannot be `None` if + `num_static_categorical_features` is > 0. + embedding_dimension (`list[int]`, *optional*): + The dimension of the embedding for each of the static categorical features. Should be a list of integers, + having the same length as `num_static_categorical_features`. Cannot be `None` if + `num_static_categorical_features` is > 0. + d_model (`int`, *optional*, defaults to 64): + Dimensionality of the transformer layers. + encoder_layers (`int`, *optional*, defaults to 2): + Number of encoder layers. + decoder_layers (`int`, *optional*, defaults to 2): + Number of decoder layers. + encoder_attention_heads (`int`, *optional*, defaults to 2): + Number of attention heads for each attention layer in the Transformer encoder. + decoder_attention_heads (`int`, *optional*, defaults to 2): + Number of attention heads for each attention layer in the Transformer decoder. + encoder_ffn_dim (`int`, *optional*, defaults to 32): + Dimension of the "intermediate" (often named feed-forward) layer in encoder. + decoder_ffn_dim (`int`, *optional*, defaults to 32): + Dimension of the "intermediate" (often named feed-forward) layer in decoder. + activation_function (`str` or `function`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and decoder. If string, `"gelu"` and + `"relu"` are supported. + dropout (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the encoder, and decoder. + encoder_layerdrop (`float`, *optional*, defaults to 0.1): + The dropout probability for the attention and fully connected layers for each encoder layer. + decoder_layerdrop (`float`, *optional*, defaults to 0.1): + The dropout probability for the attention and fully connected layers for each decoder layer. + attention_dropout (`float`, *optional*, defaults to 0.1): + The dropout probability for the attention probabilities. + activation_dropout (`float`, *optional*, defaults to 0.1): + The dropout probability used between the two layers of the feed-forward networks. + num_parallel_samples (`int`, *optional*, defaults to 100): + The number of samples to generate in parallel for each time step of inference. + init_std (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated normal weight initialization distribution. + use_cache (`bool`, *optional*, defaults to `True`): + Whether to use the past key/values attentions (if applicable to the model) to speed up decoding. + label_length (`int`, *optional*, defaults to 10): + Start token length of the Autoformer decoder, which is used for direct multi-step prediction (i.e. + non-autoregressive generation). + moving_average (`int`, defaults to 25): + The window size of the moving average. In practice, it's the kernel size in AvgPool1d of the Decomposition + Layer. + autocorrelation_factor (`int`, defaults to 3): + "Attention" (i.e. AutoCorrelation mechanism) factor which is used to find top k autocorrelations delays. + It's recommended in the paper to set it to a number between 1 and 5. + + + Example: + + ```python + >>> from transformers import AutoformerConfig, AutoformerModel + + >>> # Initializing a default Autoformer configuration + >>> configuration = AutoformerConfig() + + >>> # Randomly initializing a model (with random weights) from the configuration + >>> model = AutoformerModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "autoformer" + attribute_map = { + "hidden_size": "d_model", + "num_attention_heads": "encoder_attention_heads", + "num_hidden_layers": "encoder_layers", + } + + def __init__( + self, + prediction_length: Optional[int] = None, + context_length: Optional[int] = None, + distribution_output: str = "student_t", + loss: str = "nll", + input_size: int = 1, + lags_sequence: List[int] = [1, 2, 3, 4, 5, 6, 7], + scaling: bool = True, + num_time_features: int = 0, + num_dynamic_real_features: int = 0, + num_static_categorical_features: int = 0, + num_static_real_features: int = 0, + cardinality: Optional[List[int]] = None, + embedding_dimension: Optional[List[int]] = None, + d_model: int = 64, + encoder_attention_heads: int = 2, + decoder_attention_heads: int = 2, + encoder_layers: int = 2, + decoder_layers: int = 2, + encoder_ffn_dim: int = 32, + decoder_ffn_dim: int = 32, + activation_function: str = "gelu", + dropout: float = 0.1, + encoder_layerdrop: float = 0.1, + decoder_layerdrop: float = 0.1, + attention_dropout: float = 0.1, + activation_dropout: float = 0.1, + num_parallel_samples: int = 100, + init_std: float = 0.02, + use_cache: bool = True, + is_encoder_decoder=True, + # Autoformer arguments + label_length: int = 10, + moving_average: int = 25, + autocorrelation_factor: int = 3, + **kwargs, + ): + # time series specific configuration + self.prediction_length = prediction_length + self.context_length = context_length if context_length is not None else prediction_length + self.distribution_output = distribution_output + self.loss = loss + self.input_size = input_size + self.num_time_features = num_time_features + self.lags_sequence = lags_sequence + self.scaling = scaling + self.num_dynamic_real_features = num_dynamic_real_features + self.num_static_real_features = num_static_real_features + self.num_static_categorical_features = num_static_categorical_features + if cardinality is not None and num_static_categorical_features > 0: + if len(cardinality) != num_static_categorical_features: + raise ValueError( + "The cardinality should be a list of the same length as `num_static_categorical_features`" + ) + self.cardinality = cardinality + else: + self.cardinality = [0] + if embedding_dimension is not None and num_static_categorical_features > 0: + if len(embedding_dimension) != num_static_categorical_features: + raise ValueError( + "The embedding dimension should be a list of the same length as `num_static_categorical_features`" + ) + self.embedding_dimension = embedding_dimension + else: + self.embedding_dimension = [min(50, (cat + 1) // 2) for cat in self.cardinality] + self.num_parallel_samples = num_parallel_samples + + # Transformer architecture configuration + self.feature_size = input_size * len(self.lags_sequence) + self._number_of_features + self.d_model = d_model + self.encoder_attention_heads = encoder_attention_heads + self.decoder_attention_heads = decoder_attention_heads + self.encoder_ffn_dim = encoder_ffn_dim + self.decoder_ffn_dim = decoder_ffn_dim + self.encoder_layers = encoder_layers + self.decoder_layers = decoder_layers + + self.dropout = dropout + self.attention_dropout = attention_dropout + self.activation_dropout = activation_dropout + self.encoder_layerdrop = encoder_layerdrop + self.decoder_layerdrop = decoder_layerdrop + + self.activation_function = activation_function + self.init_std = init_std + + self.use_cache = use_cache + + # Autoformer + self.label_length = label_length + self.moving_average = moving_average + self.autocorrelation_factor = autocorrelation_factor + + super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs) + + @property + def _number_of_features(self) -> int: + return ( + sum(self.embedding_dimension) + + self.num_dynamic_real_features + + self.num_time_features + + self.num_static_real_features + + self.input_size * 2 # the log1p(abs(loc)) and log(scale) features + ) diff --git a/src/transformers/models/autoformer/modeling_autoformer.py b/src/transformers/models/autoformer/modeling_autoformer.py new file mode 100644 index 00000000000000..3fb9fac5caaa5f --- /dev/null +++ b/src/transformers/models/autoformer/modeling_autoformer.py @@ -0,0 +1,2117 @@ +# coding=utf-8 +# Copyright (c) 2021 THUML @ Tsinghua University +# Copyright 2023 Amazon.com, Inc. or its affiliates. All Rights Reserved. +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch Autoformer model.""" + +import math +from dataclasses import dataclass +from typing import List, Optional, Tuple, Union + +import numpy as np +import torch +import torch.utils.checkpoint +from torch import nn + +from ...activations import ACT2FN +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask +from ...modeling_outputs import ( + BaseModelOutput, + ModelOutput, + SampleTSPredictionOutput, + Seq2SeqTSPredictionOutput, +) +from ...modeling_utils import PreTrainedModel +from ...time_series_utils import NegativeBinomialOutput, NormalOutput, StudentTOutput +from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings +from .configuration_autoformer import AutoformerConfig + + +logger = logging.get_logger(__name__) + +_CONFIG_FOR_DOC = "AutoformerConfig" + + +@dataclass +class AutoFormerDecoderOutput(ModelOutput): + """ + Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding). + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + + If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, + hidden_size)` is output. + trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Trend tensor for each time series. + past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape + `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if + `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads, + encoder_sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if + `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` + input) to speed up sequential decoding. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the + weighted average in the cross-attention heads. + """ + + last_hidden_state: torch.FloatTensor = None + trend: torch.FloatTensor = None + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + cross_attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class AutoformerModelOutput(ModelOutput): + """ + Autoformer model output that contains the additional trend output. + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the decoder of the model. + + If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, + hidden_size)` is output. + trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Trend tensor for each time series. + past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape + `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape + `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention + blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. + decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs. + decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the + self-attention heads. + cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the + weighted average in the cross-attention heads. + encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Sequence of hidden-states at the output of the last layer of the encoder of the model. + encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs. + encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the + self-attention heads. + loc (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*): + Shift values of each time series' context window which is used to give the model inputs of the same + magnitude and then used to shift back to the original magnitude. + scale (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*): + Scaling values of each time series' context window which is used to give the model inputs of the same + magnitude and then used to rescale back to the original magnitude. + static_features: (`torch.FloatTensor` of shape `(batch_size, feature size)`, *optional*): + Static features of each time series' in a batch which are copied to the covariates at inference time. + """ + + last_hidden_state: torch.FloatTensor = None + trend: torch.FloatTensor = None + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None + decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None + decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None + cross_attentions: Optional[Tuple[torch.FloatTensor]] = None + encoder_last_hidden_state: Optional[torch.FloatTensor] = None + encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None + encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None + loc: Optional[torch.FloatTensor] = None + scale: Optional[torch.FloatTensor] = None + static_features: Optional[torch.FloatTensor] = None + + +AUTOFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "huggingface/autoformer-tourism-monthly", + # See all Autoformer models at https://huggingface.co/models?filter=autoformer +] + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesFeatureEmbedder with TimeSeries->Autoformer +class AutoformerFeatureEmbedder(nn.Module): + """ + Embed a sequence of categorical features. + + Args: + cardinalities (`list[int]`): + List of cardinalities of the categorical features. + embedding_dims (`list[int]`): + List of embedding dimensions of the categorical features. + """ + + def __init__(self, cardinalities: List[int], embedding_dims: List[int]) -> None: + super().__init__() + + self.num_features = len(cardinalities) + self.embedders = nn.ModuleList([nn.Embedding(c, d) for c, d in zip(cardinalities, embedding_dims)]) + + def forward(self, features: torch.Tensor) -> torch.Tensor: + if self.num_features > 1: + # we slice the last dimension, giving an array of length + # self.num_features with shape (N,T) or (N) + cat_feature_slices = torch.chunk(features, self.num_features, dim=-1) + else: + cat_feature_slices = [features] + + return torch.cat( + [ + embed(cat_feature_slice.squeeze(-1)) + for embed, cat_feature_slice in zip(self.embedders, cat_feature_slices) + ], + dim=-1, + ) + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesStdScaler with TimeSeriesTransformer->Autoformer,TimeSeries->Autoformer +class AutoformerStdScaler(nn.Module): + """ + Standardize features by calculating the mean and scaling along the first dimension, and then normalizes it by + subtracting from the mean and dividing by the standard deviation. + """ + + def __init__(self, config: AutoformerConfig): + super().__init__() + self.dim = config.scaling_dim if hasattr(config, "scaling_dim") else 1 + self.keepdim = config.keepdim if hasattr(config, "keepdim") else True + self.minimum_scale = config.minimum_scale if hasattr(config, "minimum_scale") else 1e-5 + + def forward( + self, data: torch.Tensor, observed_indicator: torch.Tensor + ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: + """ + Parameters: + data (`torch.Tensor` of shape `(batch_size, sequence_length, num_input_channels)`): + input for Batch norm calculation + observed_indicator (`torch.BoolTensor` of shape `(batch_size, sequence_length, num_input_channels)`): + Calculating the scale on the observed indicator. + Returns: + tuple of `torch.Tensor` of shapes + (`(batch_size, sequence_length, num_input_channels)`,`(batch_size, 1, num_input_channels)`, + `(batch_size, 1, num_input_channels)`) + """ + denominator = observed_indicator.sum(self.dim, keepdim=self.keepdim) + denominator = denominator.clamp_min(1.0) + loc = (data * observed_indicator).sum(self.dim, keepdim=self.keepdim) / denominator + + variance = (((data - loc) * observed_indicator) ** 2).sum(self.dim, keepdim=self.keepdim) / denominator + scale = torch.sqrt(variance + self.minimum_scale) + return (data - loc) / scale, loc, scale + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesMeanScaler with TimeSeriesTransformer->Autoformer,TimeSeries->Autoformer +class AutoformerMeanScaler(nn.Module): + """ + Computes a scaling factor as the weighted average absolute value along the first dimension, and scales the data + accordingly. + """ + + def __init__(self, config: AutoformerConfig): + super().__init__() + self.dim = config.scaling_dim if hasattr(config, "scaling_dim") else 1 + self.keepdim = config.keepdim if hasattr(config, "keepdim") else True + self.minimum_scale = config.minimum_scale if hasattr(config, "minimum_scale") else 1e-10 + self.default_scale = config.default_scale if hasattr(config, "default_scale") else None + + def forward( + self, data: torch.Tensor, observed_indicator: torch.Tensor + ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: + """ + Parameters: + data (`torch.Tensor` of shape `(batch_size, sequence_length, num_input_channels)`): + input for Batch norm calculation + observed_indicator (`torch.BoolTensor` of shape `(batch_size, sequence_length, num_input_channels)`): + Calculating the scale on the observed indicator. + Returns: + tuple of `torch.Tensor` of shapes + (`(batch_size, sequence_length, num_input_channels)`,`(batch_size, 1, num_input_channels)`, + `(batch_size, 1, num_input_channels)`) + """ + ts_sum = (data * observed_indicator).abs().sum(self.dim, keepdim=True) + num_observed = observed_indicator.sum(self.dim, keepdim=True) + + scale = ts_sum / torch.clamp(num_observed, min=1) + + # If `default_scale` is provided, we use it, otherwise we use the scale + # of the batch. + if self.default_scale is None: + batch_sum = ts_sum.sum(dim=0) + batch_observations = torch.clamp(num_observed.sum(0), min=1) + default_scale = torch.squeeze(batch_sum / batch_observations) + else: + default_scale = self.default_scale * torch.ones_like(scale) + + # apply default scale where there are no observations + scale = torch.where(num_observed > 0, scale, default_scale) + + # ensure the scale is at least `self.minimum_scale` + scale = torch.clamp(scale, min=self.minimum_scale) + scaled_data = data / scale + + if not self.keepdim: + scale = scale.squeeze(dim=self.dim) + + return scaled_data, torch.zeros_like(scale), scale + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesNOPScaler with TimeSeriesTransformer->Autoformer,TimeSeries->Autoformer +class AutoformerNOPScaler(nn.Module): + """ + Assigns a scaling factor equal to 1 along the first dimension, and therefore applies no scaling to the input data. + """ + + def __init__(self, config: AutoformerConfig): + super().__init__() + self.dim = config.scaling_dim if hasattr(config, "scaling_dim") else 1 + self.keepdim = config.keepdim if hasattr(config, "keepdim") else True + + def forward( + self, data: torch.Tensor, observed_indicator: torch.Tensor = None + ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: + """ + Parameters: + data (`torch.Tensor` of shape `(batch_size, sequence_length, num_input_channels)`): + input for Batch norm calculation + Returns: + tuple of `torch.Tensor` of shapes + (`(batch_size, sequence_length, num_input_channels)`,`(batch_size, 1, num_input_channels)`, + `(batch_size, 1, num_input_channels)`) + """ + scale = torch.ones_like(data, requires_grad=False).mean(dim=self.dim, keepdim=self.keepdim) + loc = torch.zeros_like(data, requires_grad=False).mean(dim=self.dim, keepdim=self.keepdim) + return data, loc, scale + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.weighted_average +def weighted_average(input_tensor: torch.Tensor, weights: Optional[torch.Tensor] = None, dim=None) -> torch.Tensor: + """ + Computes the weighted average of a given tensor across a given `dim`, masking values associated with weight zero, + meaning instead of `nan * 0 = nan` you will get `0 * 0 = 0`. + + Args: + input_tensor (`torch.FloatTensor`): + Input tensor, of which the average must be computed. + weights (`torch.FloatTensor`, *optional*): + Weights tensor, of the same shape as `input_tensor`. + dim (`int`, *optional*): + The dim along which to average `input_tensor`. + + Returns: + `torch.FloatTensor`: The tensor with values averaged along the specified `dim`. + """ + if weights is not None: + weighted_tensor = torch.where(weights != 0, input_tensor * weights, torch.zeros_like(input_tensor)) + sum_weights = torch.clamp(weights.sum(dim=dim) if dim else weights.sum(), min=1.0) + return (weighted_tensor.sum(dim=dim) if dim else weighted_tensor.sum()) / sum_weights + else: + return input_tensor.mean(dim=dim) + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.nll +def nll(input: torch.distributions.Distribution, target: torch.Tensor) -> torch.Tensor: + """ + Computes the negative log likelihood loss from input distribution with respect to target. + """ + return -input.log_prob(target) + + +# Copied from transformers.models.marian.modeling_marian.MarianSinusoidalPositionalEmbedding with Marian->Autoformer +class AutoformerSinusoidalPositionalEmbedding(nn.Embedding): + """This module produces sinusoidal positional embeddings of any length.""" + + def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None) -> None: + super().__init__(num_positions, embedding_dim) + self.weight = self._init_weight(self.weight) + + @staticmethod + def _init_weight(out: nn.Parameter) -> nn.Parameter: + """ + Identical to the XLM create_sinusoidal_embeddings except features are not interleaved. The cos features are in + the 2nd half of the vector. [dim // 2:] + """ + n_pos, dim = out.shape + position_enc = np.array( + [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)] + ) + out.requires_grad = False # set early to avoid an error in pytorch-1.8+ + sentinel = dim // 2 if dim % 2 == 0 else (dim // 2) + 1 + out[:, 0:sentinel] = torch.FloatTensor(np.sin(position_enc[:, 0::2])) + out[:, sentinel:] = torch.FloatTensor(np.cos(position_enc[:, 1::2])) + out.detach_() + return out + + @torch.no_grad() + def forward(self, input_ids_shape: torch.Size, past_key_values_length: int = 0) -> torch.Tensor: + """`input_ids_shape` is expected to be [bsz x seqlen].""" + bsz, seq_len = input_ids_shape[:2] + positions = torch.arange( + past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device + ) + return super().forward(positions) + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesValueEmbedding with TimeSeries->Autoformer +class AutoformerValueEmbedding(nn.Module): + def __init__(self, feature_size, d_model): + super().__init__() + self.value_projection = nn.Linear(in_features=feature_size, out_features=d_model, bias=False) + + def forward(self, x): + return self.value_projection(x) + + +# Class based on +# https://github.com/thuml/Autoformer/blob/c6a0694ff484753f2d986cc0bb1f99ee850fc1a8/layers/Autoformer_EncDec.py#L39 +# where AutoformerSeriesDecompositionLayer is series_decomp + moving_average +class AutoformerSeriesDecompositionLayer(nn.Module): + """ + Returns the trend and the seasonal parts of the time series. Calculated as: + + x_trend = AvgPool(Padding(X)) and x_seasonal = X - x_trend + """ + + def __init__(self, config: AutoformerConfig): + super().__init__() + self.kernel_size = config.moving_average + self.avg = nn.AvgPool1d(kernel_size=self.kernel_size, stride=1, padding=0) + + def forward(self, x): + """Input shape: Batch x Time x EMBED_DIM""" + # padding on the both ends of time series + num_of_pads = (self.kernel_size - 1) // 2 + front = x[:, 0:1, :].repeat(1, num_of_pads, 1) + end = x[:, -1:, :].repeat(1, num_of_pads, 1) + x_padded = torch.cat([front, x, end], dim=1) + + # calculate the trend and seasonal part of the series + x_trend = self.avg(x_padded.permute(0, 2, 1)).permute(0, 2, 1) + x_seasonal = x - x_trend + return x_seasonal, x_trend + + +# Class based on +# https://github.com/thuml/Autoformer/blob/c6a0694ff484753f2d986cc0bb1f99ee850fc1a8/layers/Autoformer_EncDec.py#L6 +# where AutoformerLayernorm is my_Layernorm +class AutoformerLayernorm(nn.Module): + """ + Special designed layer normalization for the seasonal part, calculated as: AutoformerLayernorm(x) = nn.LayerNorm(x) + - torch.mean(nn.LayerNorm(x)) + """ + + def __init__(self, config: AutoformerConfig): + super().__init__() + self.layernorm = nn.LayerNorm(config.d_model) + + def forward(self, x): + x_hat = self.layernorm(x) + bias = torch.mean(x_hat, dim=1).unsqueeze(1).repeat(1, x.shape[1], 1) + return x_hat - bias + + +class AutoformerAttention(nn.Module): + """ + AutoCorrelation Mechanism with the following two phases: + (1) period-based dependencies discovery (2) time delay aggregation + This block replace the canonical self-attention mechanism. + """ + + def __init__( + self, + embed_dim: int, + num_heads: int, + dropout: float = 0.0, + is_decoder: bool = False, + bias: bool = True, + autocorrelation_factor: int = 3, + ): + super().__init__() + self.embed_dim = embed_dim + self.num_heads = num_heads + self.dropout = dropout + self.head_dim = embed_dim // num_heads + + if (self.head_dim * num_heads) != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}" + f" and `num_heads`: {num_heads})." + ) + self.scaling = self.head_dim**-0.5 + self.is_decoder = is_decoder + + self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias) + + self.autocorrelation_factor = autocorrelation_factor + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.Tensor, + key_value_states: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + """Input shape: Batch x Time x Channel""" + + # if key_value_states are provided this layer is used as a cross-attention layer + # for the decoder + is_cross_attention = key_value_states is not None + + bsz, tgt_len, _ = hidden_states.size() + + # get query proj + query_states = self.q_proj(hidden_states) + # get key, value proj + # `past_key_value[0].shape[2] == key_value_states.shape[1]` + # is checking that the `sequence_length` of the `past_key_value` is the same as + # the provided `key_value_states` to support prefix tuning + if ( + is_cross_attention + and past_key_value is not None + and past_key_value[0].shape[2] == key_value_states.shape[1] + ): + # reuse k,v, cross_attentions + key_states = past_key_value[0] + value_states = past_key_value[1] + elif is_cross_attention: + # cross_attentions + key_states = self._shape(self.k_proj(key_value_states), -1, bsz) + value_states = self._shape(self.v_proj(key_value_states), -1, bsz) + elif past_key_value is not None: + # reuse k, v, self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + key_states = torch.cat([past_key_value[0], key_states], dim=2) + value_states = torch.cat([past_key_value[1], value_states], dim=2) + else: + # self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_states, value_states) + + proj_shape = (bsz * self.num_heads, -1, self.head_dim) + query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) + key_states = key_states.view(*proj_shape) + value_states = value_states.view(*proj_shape) + + # (1) period-based dependencies discovery + # Resize (truncation or zero filling) + queries_time_length = query_states.size(1) + values_time_length = value_states.size(1) + if queries_time_length > values_time_length: + query_states = query_states[:, : (queries_time_length - values_time_length), :] + zeros = torch.zeros_like(query_states).float() + value_states = torch.cat([value_states, zeros], dim=1) + key_states = torch.cat([key_states, zeros], dim=1) + else: + value_states = value_states[:, :queries_time_length, :] + key_states = key_states[:, :queries_time_length, :] + + query_states_fft = torch.fft.rfft(query_states, n=tgt_len, dim=1) + key_states_fft = torch.fft.rfft(key_states, n=tgt_len, dim=1) + attn_weights = query_states_fft * torch.conj(key_states_fft) + attn_weights = torch.fft.irfft(attn_weights, n=tgt_len, dim=1) # Autocorrelation(Q,K) + + src_len = key_states.size(1) + channel = key_states.size(2) + + if attn_weights.size() != (bsz * self.num_heads, tgt_len, channel): + raise ValueError( + f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, channel)}, but is" + f" {attn_weights.size()}" + ) + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + if layer_head_mask is not None: + if layer_head_mask.size() != (self.num_heads,): + raise ValueError( + f"Head mask for a single layer should be of size {(self.num_heads,)}, but is" + f" {layer_head_mask.size()}" + ) + attn_weights = layer_head_mask.view(1, -1, 1, 1) * attn_weights.view(bsz, self.num_heads, tgt_len, channel) + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, channel) + + if output_attentions: + # this operation is a bit awkward, but it's required to + # make sure that attn_weights keeps its gradient. + # In order to do so, attn_weights have to be reshaped + # twice and have to be reused in the following + attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, channel) + attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, channel) + else: + attn_weights_reshaped = None + + # time delay aggregation + time_length = value_states.size(1) + autocorrelations = attn_weights.view(bsz, self.num_heads, tgt_len, channel) + + # find top k autocorrelations delays + top_k = int(self.autocorrelation_factor * math.log(time_length)) + autocorrelations_mean_on_head_channel = torch.mean(autocorrelations, dim=(1, -1)) # bsz x tgt_len + if self.training: + autocorrelations_mean_on_bsz = torch.mean(autocorrelations_mean_on_head_channel, dim=0) + _, top_k_delays_index = torch.topk(autocorrelations_mean_on_bsz, top_k) + top_k_autocorrelations = torch.stack( + [autocorrelations_mean_on_head_channel[:, top_k_delays_index[i]] for i in range(top_k)], dim=-1 + ) + else: + top_k_autocorrelations, top_k_delays_index = torch.topk( + autocorrelations_mean_on_head_channel, top_k, dim=1 + ) + + top_k_autocorrelations = torch.softmax(top_k_autocorrelations, dim=-1) # bsz x top_k + + # compute aggregation: value_states.roll(delay) * top_k_autocorrelations(delay) + if not self.training: + # used for compute values_states.roll(delay) in inference + tmp_values = value_states.repeat(1, 2, 1) + init_index = ( + torch.arange(time_length) + .view(1, -1, 1) + .repeat(bsz * self.num_heads, 1, channel) + .to(value_states.device) + ) + + delays_agg = torch.zeros_like(value_states).float() # bsz x time_length x channel + for i in range(top_k): + # compute value_states roll delay + if not self.training: + tmp_delay = init_index + top_k_delays_index[:, i].view(-1, 1, 1).repeat( + self.num_heads, tgt_len, channel + ) + value_states_roll_delay = torch.gather(tmp_values, dim=1, index=tmp_delay) + else: + value_states_roll_delay = value_states.roll(shifts=-int(top_k_delays_index[i]), dims=1) + + # aggregation + top_k_autocorrelations_at_delay = ( + top_k_autocorrelations[:, i].view(-1, 1, 1).repeat(self.num_heads, tgt_len, channel) + ) + delays_agg += value_states_roll_delay * top_k_autocorrelations_at_delay + + attn_output = delays_agg.contiguous() + + if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) + attn_output = attn_output.transpose(1, 2) + + # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be + # partitioned across GPUs when using tensor-parallelism. + attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) + + attn_output = self.out_proj(attn_output) + + return attn_output, attn_weights_reshaped, past_key_value + + +class AutoformerEncoderLayer(nn.Module): + def __init__(self, config: AutoformerConfig): + super().__init__() + self.embed_dim = config.d_model + self.self_attn = AutoformerAttention( + embed_dim=self.embed_dim, + num_heads=config.encoder_attention_heads, + dropout=config.attention_dropout, + autocorrelation_factor=config.autocorrelation_factor, + ) + self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) + self.dropout = config.dropout + self.activation_fn = ACT2FN[config.activation_function] + self.activation_dropout = config.activation_dropout + self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim) + self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim) + self.final_layer_norm = AutoformerLayernorm(config) + self.decomp1 = AutoformerSeriesDecompositionLayer(config) + self.decomp2 = AutoformerSeriesDecompositionLayer(config) + + def forward( + self, + hidden_states: torch.FloatTensor, + attention_mask: torch.FloatTensor, + layer_head_mask: torch.FloatTensor, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]: + """ + Args: + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` + attention_mask (`torch.FloatTensor`): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size + `(encoder_attention_heads,)`. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + """ + residual = hidden_states + hidden_states, attn_weights, _ = self.self_attn( + hidden_states=hidden_states, + attention_mask=attention_mask, + layer_head_mask=layer_head_mask, + output_attentions=output_attentions, + ) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + # added layer norm here as an improvement + hidden_states = self.self_attn_layer_norm(hidden_states) + hidden_states, _ = self.decomp1(hidden_states) + + residual = hidden_states + hidden_states = self.activation_fn(self.fc1(hidden_states)) + hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training) + hidden_states = self.fc2(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states, _ = self.decomp2(hidden_states) + hidden_states = self.final_layer_norm(hidden_states) + + if hidden_states.dtype == torch.float16 and ( + torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any() + ): + clamp_value = torch.finfo(hidden_states.dtype).max - 1000 + hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attn_weights,) + + return outputs + + +class AutoformerDecoderLayer(nn.Module): + def __init__(self, config: AutoformerConfig): + super().__init__() + self.embed_dim = config.d_model + + self.self_attn = AutoformerAttention( + embed_dim=self.embed_dim, + num_heads=config.decoder_attention_heads, + dropout=config.attention_dropout, + is_decoder=True, + autocorrelation_factor=config.autocorrelation_factor, + ) + self.dropout = config.dropout + self.activation_fn = ACT2FN[config.activation_function] + self.activation_dropout = config.activation_dropout + + self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) + self.encoder_attn = AutoformerAttention( + self.embed_dim, + config.decoder_attention_heads, + dropout=config.attention_dropout, + is_decoder=True, + autocorrelation_factor=config.autocorrelation_factor, + ) + self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim) + self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) + self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim) + self.final_layer_norm = AutoformerLayernorm(config) + + self.decomp1 = AutoformerSeriesDecompositionLayer(config) + self.decomp2 = AutoformerSeriesDecompositionLayer(config) + self.decomp3 = AutoformerSeriesDecompositionLayer(config) + + # source: https://github.com/thuml/Autoformer/blob/e6371e24f2ae2dd53e472edefdd5814c5176f864/layers/Autoformer_EncDec.py#L128 + self.trend_projection = nn.Conv1d( + in_channels=self.embed_dim, + out_channels=config.feature_size, + kernel_size=3, + stride=1, + padding=1, + padding_mode="circular", + bias=False, + ) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + cross_attn_layer_head_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + output_attentions: Optional[bool] = False, + use_cache: Optional[bool] = True, + ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]: + """ + Args: + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` + attention_mask (`torch.FloatTensor`): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + encoder_hidden_states (`torch.FloatTensor`): + cross attention input to the layer of shape `(batch, seq_len, embed_dim)` + encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size + `(encoder_attention_heads,)`. + cross_attn_layer_head_mask (`torch.FloatTensor`): mask for cross-attention heads in a given layer of + size `(decoder_attention_heads,)`. + past_key_value (`Tuple(torch.FloatTensor)`): cached past key and value projection states + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + use_cache: (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the `present_key_value` state to be used for subsequent + decoding. + """ + residual = hidden_states + + # Self Attention + # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 + self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None + # add present self-attn cache to positions 1,2 of present_key_value tuple + hidden_states, self_attn_weights, present_key_value = self.self_attn( + hidden_states=hidden_states, + past_key_value=self_attn_past_key_value, + attention_mask=attention_mask, + layer_head_mask=layer_head_mask, + output_attentions=output_attentions, + ) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states, trend1 = self.decomp1(hidden_states) + # added layer norm here as an improvement + hidden_states = self.self_attn_layer_norm(hidden_states) + + # Cross-Attention Block + cross_attn_present_key_value = None + cross_attn_weights = None + if encoder_hidden_states is not None: + residual = hidden_states + + # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple + cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None + hidden_states, cross_attn_weights, cross_attn_present_key_value = self.encoder_attn( + hidden_states=hidden_states, + key_value_states=encoder_hidden_states, + attention_mask=encoder_attention_mask, + layer_head_mask=cross_attn_layer_head_mask, + past_key_value=cross_attn_past_key_value, + output_attentions=output_attentions, + ) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states, trend2 = self.decomp2(hidden_states) + # added layer norm here as an improvement + hidden_states = self.encoder_attn_layer_norm(hidden_states) + + # add cross-attn to positions 3,4 of present_key_value tuple + present_key_value = present_key_value + cross_attn_present_key_value + + # Fully Connected + residual = hidden_states + hidden_states = self.activation_fn(self.fc1(hidden_states)) + hidden_states = nn.functional.dropout(hidden_states, p=self.activation_dropout, training=self.training) + hidden_states = self.fc2(hidden_states) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + hidden_states = residual + hidden_states + hidden_states, trend3 = self.decomp3(hidden_states) + hidden_states = self.final_layer_norm(hidden_states) + + if encoder_hidden_states is not None: + residual_trend = trend1 + trend2 + trend3 + else: + residual_trend = trend1 + trend3 + residual_trend = self.trend_projection(residual_trend.permute(0, 2, 1)).transpose(1, 2) + outputs = ((hidden_states, residual_trend),) + + if output_attentions: + outputs += (self_attn_weights, cross_attn_weights) + + if use_cache: + outputs += (present_key_value,) + + return outputs + + +class AutoformerPreTrainedModel(PreTrainedModel): + config_class = AutoformerConfig + base_model_prefix = "model" + main_input_name = "past_values" + supports_gradient_checkpointing = True + + def _init_weights(self, module): + std = self.config.init_std + if isinstance(module, (nn.Linear, nn.Conv1d)): + module.weight.data.normal_(mean=0.0, std=std) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, AutoformerSinusoidalPositionalEmbedding): + pass + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=std) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + + +AUTOFORMER_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`AutoformerConfig`]): + Model configuration class with all the parameters of the model. Initializing with a config file does not + load the weights associated with the model, only the configuration. Check out the + [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +AUTOFORMER_INPUTS_DOCSTRING = r""" + Args: + past_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): + Past values of the time series, that serve as context in order to predict the future. These values may + contain lags, i.e. additional values from the past which are added in order to serve as "extra context". + The `past_values` is what the Transformer encoder gets as input (with optional additional features, such as + `static_categorical_features`, `static_real_features`, `past_time_features`). + + The sequence length here is equal to `context_length` + `max(config.lags_sequence)`. + + Missing values need to be replaced with zeros. + + past_time_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_features)`, *optional*): + Optional time features, which the model internally will add to `past_values`. These could be things like + "month of year", "day of the month", etc. encoded as vectors (for instance as Fourier features). These + could also be so-called "age" features, which basically help the model know "at which point in life" a + time-series is. Age features have small values for distant past time steps and increase monotonically the + more we approach the current time step. + + These features serve as the "positional encodings" of the inputs. So contrary to a model like BERT, where + the position encodings are learned from scratch internally as parameters of the model, the Time Series + Transformer requires to provide additional time features. + + The Autoformer only learns additional embeddings for `static_categorical_features`. + + past_observed_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)`, *optional*): + Boolean mask to indicate which `past_values` were observed and which were missing. Mask values selected in + `[0, 1]`: + + - 1 for values that are **observed**, + - 0 for values that are **missing** (i.e. NaNs that were replaced by zeros). + + static_categorical_features (`torch.LongTensor` of shape `(batch_size, number of static categorical features)`, *optional*): + Optional static categorical features for which the model will learn an embedding, which it will add to the + values of the time series. + + Static categorical features are features which have the same value for all time steps (static over time). + + A typical example of a static categorical feature is a time series ID. + + static_real_features (`torch.FloatTensor` of shape `(batch_size, number of static real features)`, *optional*): + Optional static real features which the model will add to the values of the time series. + + Static real features are features which have the same value for all time steps (static over time). + + A typical example of a static real feature is promotion information. + + future_values (`torch.FloatTensor` of shape `(batch_size, prediction_length)`): + Future values of the time series, that serve as labels for the model. The `future_values` is what the + Transformer needs to learn to output, given the `past_values`. + + See the demo notebook and code snippets for details. + + Missing values need to be replaced with zeros. + + future_time_features (`torch.FloatTensor` of shape `(batch_size, prediction_length, num_features)`, *optional*): + Optional time features, which the model internally will add to `future_values`. These could be things like + "month of year", "day of the month", etc. encoded as vectors (for instance as Fourier features). These + could also be so-called "age" features, which basically help the model know "at which point in life" a + time-series is. Age features have small values for distant past time steps and increase monotonically the + more we approach the current time step. + + These features serve as the "positional encodings" of the inputs. So contrary to a model like BERT, where + the position encodings are learned from scratch internally as parameters of the model, the Time Series + Transformer requires to provide additional features. + + The Autoformer only learns additional embeddings for `static_categorical_features`. + + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on certain token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + + decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Mask to avoid performing attention on certain token indices. By default, a causal mask will be used, to + make sure the model can only look at previous inputs in order to predict the future. + + head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + decoder_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules in the decoder. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the cross-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*): + Tuple consists of `last_hidden_state`, `hidden_states` (*optional*) and `attentions` (*optional*) + `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` (*optional*) is a sequence of + hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. + past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape + `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape + `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention + blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. + + If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that + don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +# Copied from transformers.models.time_series_transformer.modeling_time_series_transformer.TimeSeriesTransformerEncoder with TimeSeriesTransformer->Autoformer,TimeSeries->Autoformer +class AutoformerEncoder(AutoformerPreTrainedModel): + """ + Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a + [`AutoformerEncoderLayer`]. + + Args: + config: AutoformerConfig + """ + + def __init__(self, config: AutoformerConfig): + super().__init__(config) + + self.dropout = config.dropout + self.layerdrop = config.encoder_layerdrop + if config.prediction_length is None: + raise ValueError("The `prediction_length` config needs to be specified.") + + self.value_embedding = AutoformerValueEmbedding(feature_size=config.feature_size, d_model=config.d_model) + self.embed_positions = AutoformerSinusoidalPositionalEmbedding( + config.context_length + config.prediction_length, config.d_model + ) + self.layers = nn.ModuleList([AutoformerEncoderLayer(config) for _ in range(config.encoder_layers)]) + self.layernorm_embedding = nn.LayerNorm(config.d_model) + + self.gradient_checkpointing = False + # Initialize weights and apply final processing + self.post_init() + + def forward( + self, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + r""" + Args: + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + hidden_states = self.value_embedding(inputs_embeds) + embed_pos = self.embed_positions(inputs_embeds.size()) + + hidden_states = self.layernorm_embedding(hidden_states + embed_pos) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + + # expand attention_mask + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + # check if head_mask has a correct number of layers specified if desired + if head_mask is not None: + if head_mask.size()[0] != (len(self.layers)): + raise ValueError( + f"The head_mask should be specified for {len(self.layers)} layers, but it is for" + f" {head_mask.size()[0]}." + ) + + for idx, encoder_layer in enumerate(self.layers): + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) + to_drop = False + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: # skip the layer + to_drop = True + + if to_drop: + layer_outputs = (None, None) + else: + if self.gradient_checkpointing and self.training: + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, + hidden_states, + attention_mask, + (head_mask[idx] if head_mask is not None else None), + output_attentions, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + attention_mask, + layer_head_mask=(head_mask[idx] if head_mask is not None else None), + output_attentions=output_attentions, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None) + return BaseModelOutput( + last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions + ) + + +class AutoformerDecoder(AutoformerPreTrainedModel): + """ + Transformer decoder consisting of `config.decoder_layers` layers. Each layer is a [`AutoformerDecoderLayer`] + + Args: + config: AutoformerConfig + """ + + def __init__(self, config: AutoformerConfig): + super().__init__(config) + self.dropout = config.dropout + self.layerdrop = config.decoder_layerdrop + if config.prediction_length is None: + raise ValueError("The `prediction_length` config needs to be specified.") + + self.value_embedding = AutoformerValueEmbedding(feature_size=config.feature_size, d_model=config.d_model) + self.embed_positions = AutoformerSinusoidalPositionalEmbedding( + config.context_length + config.prediction_length, config.d_model + ) + self.layers = nn.ModuleList([AutoformerDecoderLayer(config) for _ in range(config.decoder_layers)]) + self.layernorm_embedding = nn.LayerNorm(config.d_model) + + # https://github.com/thuml/Autoformer/blob/e6371e24f2ae2dd53e472edefdd5814c5176f864/models/Autoformer.py#L74 + self.seasonality_projection = nn.Linear(config.d_model, config.feature_size) + + self.gradient_checkpointing = False + # Initialize weights and apply final processing + self.post_init() + + def forward( + self, + trend: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.Tensor] = None, + cross_attn_head_mask: Optional[torch.Tensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, AutoFormerDecoderOutput]: + r""" + Args: + trend (`torch.FloatTensor` of shape `(batch_size, prediction_length, feature_size)`, *optional*): + The trend sequence to be fed to the decoder. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*): + Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention + of the decoder. + encoder_attention_mask (`torch.LongTensor` of shape `(batch_size, encoder_sequence_length)`, *optional*): + Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values + selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the cross-attention modules in the decoder to avoid performing + cross-attention on hidden heads. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): + Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of + shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of + shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks and in the + cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding. + + If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those + that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. + use_cache (`bool`, *optional*): + If `use_cache` is True, `past_key_values` key value states are returned and can be used to speed up + decoding (see `past_key_values`). + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + input_shape = inputs_embeds.size()[:-1] + + # expand encoder attention mask + if encoder_hidden_states is not None and encoder_attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + encoder_attention_mask = _prepare_4d_attention_mask( + encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1] + ) + + hidden_states = self.value_embedding(inputs_embeds) + embed_pos = self.embed_positions( + inputs_embeds.size(), past_key_values_length=self.config.context_length - self.config.label_length + ) + hidden_states = self.layernorm_embedding(hidden_states + embed_pos) + hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + + # decoder layers + all_hidden_states = () if output_hidden_states else None + all_self_attns = () if output_attentions else None + all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None + next_decoder_cache = () if use_cache else None + + # check if head_mask/cross_attn_head_mask has a correct number of layers specified if desired + for attn_mask, mask_name in zip([head_mask, cross_attn_head_mask], ["head_mask", "cross_attn_head_mask"]): + if attn_mask is not None: + if attn_mask.size()[0] != (len(self.layers)): + raise ValueError( + f"The `{mask_name}` should be specified for {len(self.layers)} layers, but it is for" + f" {head_mask.size()[0]}." + ) + + for idx, decoder_layer in enumerate(self.layers): + # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) + if output_hidden_states: + all_hidden_states += (hidden_states,) + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue + + past_key_value = past_key_values[idx] if past_key_values is not None else None + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, + hidden_states, + attention_mask, + encoder_hidden_states, + encoder_attention_mask, + head_mask[idx] if head_mask is not None else None, + cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, + None, + output_attentions, + use_cache, + ) + else: + layer_outputs = decoder_layer( + hidden_states, + attention_mask=attention_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + layer_head_mask=(head_mask[idx] if head_mask is not None else None), + cross_attn_layer_head_mask=( + cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None + ), + past_key_value=past_key_value, + output_attentions=output_attentions, + use_cache=use_cache, + ) + (hidden_states, residual_trend) = layer_outputs[0] + trend = trend + residual_trend + + if use_cache: + next_decoder_cache += (layer_outputs[3 if output_attentions else 1],) + + if output_attentions: + all_self_attns += (layer_outputs[1],) + + if encoder_hidden_states is not None: + all_cross_attentions += (layer_outputs[2],) + + # project seasonality representation + hidden_states = self.seasonality_projection(hidden_states) + + # add hidden states from the last decoder layer + if output_hidden_states: + all_hidden_states += (hidden_states,) + + next_cache = next_decoder_cache if use_cache else None + if not return_dict: + return tuple( + v + for v in [hidden_states, trend, next_cache, all_hidden_states, all_self_attns, all_cross_attentions] + if v is not None + ) + return AutoFormerDecoderOutput( + last_hidden_state=hidden_states, + trend=trend, + past_key_values=next_cache, + hidden_states=all_hidden_states, + attentions=all_self_attns, + cross_attentions=all_cross_attentions, + ) + + +@add_start_docstrings( + "The bare Autoformer Model outputting raw hidden-states without any specific head on top.", + AUTOFORMER_START_DOCSTRING, +) +class AutoformerModel(AutoformerPreTrainedModel): + def __init__(self, config: AutoformerConfig): + super().__init__(config) + + if config.scaling == "mean" or config.scaling is True: + self.scaler = AutoformerMeanScaler(config) + elif config.scaling == "std": + self.scaler = AutoformerStdScaler(config) + else: + self.scaler = AutoformerNOPScaler(config) + + if config.num_static_categorical_features > 0: + self.embedder = AutoformerFeatureEmbedder( + cardinalities=config.cardinality, embedding_dims=config.embedding_dimension + ) + + # transformer encoder-decoder and mask initializer + self.encoder = AutoformerEncoder(config) + self.decoder = AutoformerDecoder(config) + + # used for decoder seasonal and trend initialization + self.decomposition_layer = AutoformerSeriesDecompositionLayer(config) + + # Initialize weights and apply final processing + self.post_init() + + @property + def _past_length(self) -> int: + return self.config.context_length + max(self.config.lags_sequence) + + def get_lagged_subsequences( + self, sequence: torch.Tensor, subsequences_length: int, shift: int = 0 + ) -> torch.Tensor: + """ + Returns lagged subsequences of a given sequence. Returns a tensor of shape (batch_size, subsequences_length, + feature_size, indices_length), containing lagged subsequences. Specifically, lagged[i, j, :, k] = sequence[i, + -indices[k]-subsequences_length+j, :]. + + Args: + sequence (`torch.Tensor` or shape `(batch_size, context_length, + feature_size)`): The sequence from which lagged subsequences should be extracted. + subsequences_length (`int`): + Length of the subsequences to be extracted. + shift (`int`, *optional* defaults to 0): + Shift the lags by this amount back in the time index. + """ + + # calculates the indices of the lags by subtracting the shift value from the given lags_sequence + indices = [lag - shift for lag in self.config.lags_sequence] + + # checks if the maximum lag plus the length of the subsequences exceeds the length of the input sequence + sequence_length = sequence.shape[1] + if max(indices) + subsequences_length > sequence_length: + raise ValueError( + f"lags cannot go further than history length, found lag {max(indices)} " + f"while history length is only {sequence_length}" + ) + + # extracts the lagged subsequences from the input sequence using the calculated indices + lagged_values = [] + for lag_index in indices: + begin_index = -lag_index - subsequences_length + end_index = -lag_index if lag_index > 0 else None + lagged_values.append(sequence[:, begin_index:end_index, ...]) + + # return as stacked tensor in the feature dimension + return torch.stack(lagged_values, dim=-1) + + def create_network_inputs( + self, + past_values: torch.Tensor, + past_time_features: torch.Tensor, + static_categorical_features: Optional[torch.Tensor] = None, + static_real_features: Optional[torch.Tensor] = None, + past_observed_mask: Optional[torch.Tensor] = None, + future_values: Optional[torch.Tensor] = None, + future_time_features: Optional[torch.Tensor] = None, + ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: + """ + Creates the inputs for the network given the past and future values, time features, and static features. + + Args: + past_values (`torch.Tensor`): + A tensor of shape `(batch_size, past_length, input_size)` containing the past values. + past_time_features (`torch.Tensor`): + A tensor of shape `(batch_size, past_length, num_features)` containing the past time features. + static_categorical_features (`Optional[torch.Tensor]`): + An optional tensor of shape `(batch_size, num_categorical_features)` containing the static categorical + features. + static_real_features (`Optional[torch.Tensor]`): + An optional tensor of shape `(batch_size, num_real_features)` containing the static real features. + past_observed_mask (`Optional[torch.Tensor]`): + An optional tensor of shape `(batch_size, past_length, input_size)` containing the mask of observed + values in the past. + future_values (`Optional[torch.Tensor]`): + An optional tensor of shape `(batch_size, future_length, input_size)` containing the future values. + + Returns: + A tuple containing the following tensors: + - reshaped_lagged_sequence (`torch.Tensor`): A tensor of shape `(batch_size, sequence_length, num_lags * + input_size)` containing the lagged subsequences of the inputs. + - features (`torch.Tensor`): A tensor of shape `(batch_size, sequence_length, num_features)` containing the + concatenated static and time features. + - loc (`torch.Tensor`): A tensor of shape `(batch_size, input_size)` containing the mean of the input + values. + - scale (`torch.Tensor`): A tensor of shape `(batch_size, input_size)` containing the std of the input + values. + - static_feat (`torch.Tensor`): A tensor of shape `(batch_size, num_static_features)` containing the + concatenated static features. + """ + # time feature + time_feat = ( + torch.cat( + ( + past_time_features[:, self._past_length - self.config.context_length :, ...], + future_time_features, + ), + dim=1, + ) + if future_values is not None + else past_time_features[:, self._past_length - self.config.context_length :, ...] + ) + + # target + if past_observed_mask is None: + past_observed_mask = torch.ones_like(past_values) + + context = past_values[:, -self.config.context_length :] + observed_context = past_observed_mask[:, -self.config.context_length :] + _, loc, scale = self.scaler(context, observed_context) + + inputs = ( + (torch.cat((past_values, future_values), dim=1) - loc) / scale + if future_values is not None + else (past_values - loc) / scale + ) + + # static features + log_abs_loc = loc.abs().log1p() if self.config.input_size == 1 else loc.squeeze(1).abs().log1p() + log_scale = scale.log() if self.config.input_size == 1 else scale.squeeze(1).log() + static_feat = torch.cat((log_abs_loc, log_scale), dim=1) + + if static_real_features is not None: + static_feat = torch.cat((static_real_features, static_feat), dim=1) + if static_categorical_features is not None: + embedded_cat = self.embedder(static_categorical_features) + static_feat = torch.cat((embedded_cat, static_feat), dim=1) + expanded_static_feat = static_feat.unsqueeze(1).expand(-1, time_feat.shape[1], -1) + + # all features + features = torch.cat((expanded_static_feat, time_feat), dim=-1) + + # lagged features + subsequences_length = ( + self.config.context_length + self.config.prediction_length + if future_values is not None + else self.config.context_length + ) + lagged_sequence = self.get_lagged_subsequences(sequence=inputs, subsequences_length=subsequences_length) + lags_shape = lagged_sequence.shape + reshaped_lagged_sequence = lagged_sequence.reshape(lags_shape[0], lags_shape[1], -1) + + if reshaped_lagged_sequence.shape[1] != time_feat.shape[1]: + raise ValueError( + f"input length {reshaped_lagged_sequence.shape[1]} and time feature lengths {time_feat.shape[1]} does not match" + ) + return reshaped_lagged_sequence, features, loc, scale, static_feat + + def get_encoder(self): + return self.encoder + + def get_decoder(self): + return self.decoder + + @add_start_docstrings_to_model_forward(AUTOFORMER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=AutoformerModelOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + past_values: torch.Tensor, + past_time_features: torch.Tensor, + past_observed_mask: torch.Tensor, + static_categorical_features: Optional[torch.Tensor] = None, + static_real_features: Optional[torch.Tensor] = None, + future_values: Optional[torch.Tensor] = None, + future_time_features: Optional[torch.Tensor] = None, + decoder_attention_mask: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.Tensor] = None, + decoder_head_mask: Optional[torch.Tensor] = None, + cross_attn_head_mask: Optional[torch.Tensor] = None, + encoder_outputs: Optional[List[torch.FloatTensor]] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + use_cache: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[AutoformerModelOutput, Tuple]: + r""" + Returns: + + Examples: + + ```python + >>> from huggingface_hub import hf_hub_download + >>> import torch + >>> from transformers import AutoformerModel + + >>> file = hf_hub_download( + ... repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset" + ... ) + >>> batch = torch.load(file) + + >>> model = AutoformerModel.from_pretrained("huggingface/autoformer-tourism-monthly") + + >>> # during training, one provides both past and future values + >>> # as well as possible additional features + >>> outputs = model( + ... past_values=batch["past_values"], + ... past_time_features=batch["past_time_features"], + ... past_observed_mask=batch["past_observed_mask"], + ... static_categorical_features=batch["static_categorical_features"], + ... future_values=batch["future_values"], + ... future_time_features=batch["future_time_features"], + ... ) + + >>> last_hidden_state = outputs.last_hidden_state + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_inputs, temporal_features, loc, scale, static_feat = self.create_network_inputs( + past_values=past_values, + past_time_features=past_time_features, + past_observed_mask=past_observed_mask, + static_categorical_features=static_categorical_features, + static_real_features=static_real_features, + future_values=future_values, + future_time_features=future_time_features, + ) + + if encoder_outputs is None: + enc_input = torch.cat( + ( + transformer_inputs[:, : self.config.context_length, ...], + temporal_features[:, : self.config.context_length, ...], + ), + dim=-1, + ) + encoder_outputs = self.encoder( + inputs_embeds=enc_input, + head_mask=head_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + # If the user passed a tuple for encoder_outputs, we wrap it in a BaseModelOutput when return_dict=True + elif return_dict and not isinstance(encoder_outputs, BaseModelOutput): + encoder_outputs = BaseModelOutput( + last_hidden_state=encoder_outputs[0], + hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None, + attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None, + ) + + if future_values is not None: + # Decoder inputs + # seasonality and trend from context length + seasonal_input, trend_input = self.decomposition_layer( + transformer_inputs[:, : self.config.context_length, ...] + ) + mean = ( + torch.mean(transformer_inputs[:, : self.config.context_length, ...], dim=1) + .unsqueeze(1) + .repeat(1, self.config.prediction_length, 1) + ) + zeros = torch.zeros( + [transformer_inputs.shape[0], self.config.prediction_length, transformer_inputs.shape[2]], + device=enc_input.device, + ) + + decoder_input = torch.cat( + ( + torch.cat((seasonal_input[:, -self.config.label_length :, ...], zeros), dim=1), + temporal_features[:, self.config.context_length - self.config.label_length :, ...], + ), + dim=-1, + ) + trend_init = torch.cat( + ( + torch.cat((trend_input[:, -self.config.label_length :, ...], mean), dim=1), + temporal_features[:, self.config.context_length - self.config.label_length :, ...], + ), + dim=-1, + ) + + decoder_outputs = self.decoder( + trend=trend_init, + inputs_embeds=decoder_input, + attention_mask=decoder_attention_mask, + encoder_hidden_states=encoder_outputs[0], + head_mask=decoder_head_mask, + cross_attn_head_mask=cross_attn_head_mask, + past_key_values=past_key_values, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + else: + decoder_outputs = AutoFormerDecoderOutput() + + if not return_dict: + return decoder_outputs + encoder_outputs + (loc, scale, static_feat) + + return AutoformerModelOutput( + last_hidden_state=decoder_outputs.last_hidden_state, + trend=decoder_outputs.trend, + past_key_values=decoder_outputs.past_key_values, + decoder_hidden_states=decoder_outputs.hidden_states, + decoder_attentions=decoder_outputs.attentions, + cross_attentions=decoder_outputs.cross_attentions, + encoder_last_hidden_state=encoder_outputs.last_hidden_state, + encoder_hidden_states=encoder_outputs.hidden_states, + encoder_attentions=encoder_outputs.attentions, + loc=loc, + scale=scale, + static_features=static_feat, + ) + + +@add_start_docstrings( + "The Autoformer Model with a distribution head on top for time-series forecasting.", + AUTOFORMER_START_DOCSTRING, +) +class AutoformerForPrediction(AutoformerPreTrainedModel): + def __init__(self, config: AutoformerConfig): + super().__init__(config) + self.model = AutoformerModel(config) + if config.distribution_output == "student_t": + self.distribution_output = StudentTOutput(dim=config.input_size) + elif config.distribution_output == "normal": + self.distribution_output = NormalOutput(dim=config.input_size) + elif config.distribution_output == "negative_binomial": + self.distribution_output = NegativeBinomialOutput(dim=config.input_size) + else: + raise ValueError(f"Unknown distribution output {config.distribution_output}") + + self.parameter_projection = self.distribution_output.get_parameter_projection(self.model.config.feature_size) + self.target_shape = self.distribution_output.event_shape + + if config.loss == "nll": + self.loss = nll + else: + raise ValueError(f"Unknown loss function {config.loss}") + + # Initialize weights of distribution_output and apply final processing + self.post_init() + + def output_params(self, decoder_output): + return self.parameter_projection(decoder_output[:, -self.config.prediction_length :, :]) + + def get_encoder(self): + return self.model.get_encoder() + + def get_decoder(self): + return self.model.get_decoder() + + @torch.jit.ignore + def output_distribution(self, params, loc=None, scale=None, trailing_n=None) -> torch.distributions.Distribution: + sliced_params = params + if trailing_n is not None: + sliced_params = [p[:, -trailing_n:] for p in params] + return self.distribution_output.distribution(sliced_params, loc=loc, scale=scale) + + @add_start_docstrings_to_model_forward(AUTOFORMER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=Seq2SeqTSPredictionOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + past_values: torch.Tensor, + past_time_features: torch.Tensor, + past_observed_mask: torch.Tensor, + static_categorical_features: Optional[torch.Tensor] = None, + static_real_features: Optional[torch.Tensor] = None, + future_values: Optional[torch.Tensor] = None, + future_time_features: Optional[torch.Tensor] = None, + future_observed_mask: Optional[torch.Tensor] = None, + decoder_attention_mask: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.Tensor] = None, + decoder_head_mask: Optional[torch.Tensor] = None, + cross_attn_head_mask: Optional[torch.Tensor] = None, + encoder_outputs: Optional[List[torch.FloatTensor]] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + use_cache: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Seq2SeqTSPredictionOutput, Tuple]: + r""" + Returns: + + Examples: + + ```python + >>> from huggingface_hub import hf_hub_download + >>> import torch + >>> from transformers import AutoformerForPrediction + + >>> file = hf_hub_download( + ... repo_id="hf-internal-testing/tourism-monthly-batch", filename="train-batch.pt", repo_type="dataset" + ... ) + >>> batch = torch.load(file) + + >>> model = AutoformerForPrediction.from_pretrained("huggingface/autoformer-tourism-monthly") + + >>> # during training, one provides both past and future values + >>> # as well as possible additional features + >>> outputs = model( + ... past_values=batch["past_values"], + ... past_time_features=batch["past_time_features"], + ... past_observed_mask=batch["past_observed_mask"], + ... static_categorical_features=batch["static_categorical_features"], + ... static_real_features=batch["static_real_features"], + ... future_values=batch["future_values"], + ... future_time_features=batch["future_time_features"], + ... ) + + >>> loss = outputs.loss + >>> loss.backward() + + >>> # during inference, one only provides past values + >>> # as well as possible additional features + >>> # the model autoregressively generates future values + >>> outputs = model.generate( + ... past_values=batch["past_values"], + ... past_time_features=batch["past_time_features"], + ... past_observed_mask=batch["past_observed_mask"], + ... static_categorical_features=batch["static_categorical_features"], + ... static_real_features=batch["static_real_features"], + ... future_time_features=batch["future_time_features"], + ... ) + + >>> mean_prediction = outputs.sequences.mean(dim=1) + ```""" + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + if future_values is not None: + use_cache = False + + outputs = self.model( + past_values=past_values, + past_time_features=past_time_features, + past_observed_mask=past_observed_mask, + static_categorical_features=static_categorical_features, + static_real_features=static_real_features, + future_values=future_values, + future_time_features=future_time_features, + decoder_attention_mask=decoder_attention_mask, + head_mask=head_mask, + decoder_head_mask=decoder_head_mask, + cross_attn_head_mask=cross_attn_head_mask, + encoder_outputs=encoder_outputs, + past_key_values=past_key_values, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + use_cache=use_cache, + return_dict=return_dict, + ) + + prediction_loss = None + params = None + if future_values is not None: + # outputs.last_hidden_state and trend + # loc is 4rd last and scale is 3rd last output + params = self.output_params(outputs[0] + outputs[1]) + distribution = self.output_distribution(params, loc=outputs[-3], scale=outputs[-2]) + + loss = self.loss(distribution, future_values) + + if future_observed_mask is None: + future_observed_mask = torch.ones_like(future_values) + + if len(self.target_shape) == 0: + loss_weights = future_observed_mask + else: + loss_weights, _ = future_observed_mask.min(dim=-1, keepdim=False) + + prediction_loss = weighted_average(loss, weights=loss_weights) + + if not return_dict: + outputs = ((params,) + outputs[2:]) if params is not None else outputs[2:] + return ((prediction_loss,) + outputs) if prediction_loss is not None else outputs + + return Seq2SeqTSPredictionOutput( + loss=prediction_loss, + params=params, + past_key_values=outputs.past_key_values, + decoder_hidden_states=outputs.decoder_hidden_states, + decoder_attentions=outputs.decoder_attentions, + cross_attentions=outputs.cross_attentions, + encoder_last_hidden_state=outputs.encoder_last_hidden_state, + encoder_hidden_states=outputs.encoder_hidden_states, + encoder_attentions=outputs.encoder_attentions, + loc=outputs.loc, + scale=outputs.scale, + static_features=outputs.static_features, + ) + + @torch.no_grad() + def generate( + self, + past_values: torch.Tensor, + past_time_features: torch.Tensor, + future_time_features: torch.Tensor, + past_observed_mask: Optional[torch.Tensor] = None, + static_categorical_features: Optional[torch.Tensor] = None, + static_real_features: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + ) -> SampleTSPredictionOutput: + r""" + Greedily generate sequences of sample predictions from a model with a probability distribution head. + + Parameters: + past_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)` or `(batch_size, sequence_length, input_size)`): + Past values of the time series, that serve as context in order to predict the future. The sequence size + of this tensor must be larger than the `context_length` of the model, since the model will use the + larger size to construct lag features, i.e. additional values from the past which are added in order to + serve as "extra context". + + The `sequence_length` here is equal to `config.context_length` + `max(config.lags_sequence)`, which if + no `lags_sequence` is configured, is equal to `config.context_length` + 7 (as by default, the largest + look-back index in `config.lags_sequence` is 7). The property `_past_length` returns the actual length + of the past. + + The `past_values` is what the Transformer encoder gets as input (with optional additional features, + such as `static_categorical_features`, `static_real_features`, `past_time_features` and lags). + + Optionally, missing values need to be replaced with zeros and indicated via the `past_observed_mask`. + + For multivariate time series, the `input_size` > 1 dimension is required and corresponds to the number + of variates in the time series per time step. + past_time_features (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_features)`): + Required time features, which the model internally will add to `past_values`. These could be things + like "month of year", "day of the month", etc. encoded as vectors (for instance as Fourier features). + These could also be so-called "age" features, which basically help the model know "at which point in + life" a time-series is. Age features have small values for distant past time steps and increase + monotonically the more we approach the current time step. Holiday features are also a good example of + time features. + + These features serve as the "positional encodings" of the inputs. So contrary to a model like BERT, + where the position encodings are learned from scratch internally as parameters of the model, the Time + Series Transformer requires to provide additional time features. The Time Series Transformer only + learns additional embeddings for `static_categorical_features`. + + Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these + features must but known at prediction time. + + The `num_features` here is equal to `config.`num_time_features` + `config.num_dynamic_real_features`. + future_time_features (`torch.FloatTensor` of shape `(batch_size, prediction_length, num_features)`): + Required time features for the prediction window, which the model internally will add to sampled + predictions. These could be things like "month of year", "day of the month", etc. encoded as vectors + (for instance as Fourier features). These could also be so-called "age" features, which basically help + the model know "at which point in life" a time-series is. Age features have small values for distant + past time steps and increase monotonically the more we approach the current time step. Holiday features + are also a good example of time features. + + These features serve as the "positional encodings" of the inputs. So contrary to a model like BERT, + where the position encodings are learned from scratch internally as parameters of the model, the Time + Series Transformer requires to provide additional time features. The Time Series Transformer only + learns additional embeddings for `static_categorical_features`. + + Additional dynamic real covariates can be concatenated to this tensor, with the caveat that these + features must but known at prediction time. + + The `num_features` here is equal to `config.`num_time_features` + `config.num_dynamic_real_features`. + past_observed_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)` or `(batch_size, sequence_length, input_size)`, *optional*): + Boolean mask to indicate which `past_values` were observed and which were missing. Mask values selected + in `[0, 1]`: + + - 1 for values that are **observed**, + - 0 for values that are **missing** (i.e. NaNs that were replaced by zeros). + + static_categorical_features (`torch.LongTensor` of shape `(batch_size, number of static categorical features)`, *optional*): + Optional static categorical features for which the model will learn an embedding, which it will add to + the values of the time series. + + Static categorical features are features which have the same value for all time steps (static over + time). + + A typical example of a static categorical feature is a time series ID. + static_real_features (`torch.FloatTensor` of shape `(batch_size, number of static real features)`, *optional*): + Optional static real features which the model will add to the values of the time series. + + Static real features are features which have the same value for all time steps (static over time). + + A typical example of a static real feature is promotion information. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. + + Return: + [`SampleTSPredictionOutput`] where the outputs `sequences` tensor will have shape `(batch_size, number of + samples, prediction_length)` or `(batch_size, number of samples, prediction_length, input_size)` for + multivariate predictions. + """ + outputs = self( + static_categorical_features=static_categorical_features, + static_real_features=static_real_features, + past_time_features=past_time_features, + past_values=past_values, + past_observed_mask=past_observed_mask, + future_time_features=None, + future_values=None, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=True, + use_cache=False, + ) + + decoder = self.model.get_decoder() + enc_last_hidden = outputs.encoder_last_hidden_state + loc = outputs.loc + scale = outputs.scale + static_feat = outputs.static_features + + num_parallel_samples = self.config.num_parallel_samples + repeated_loc = loc.repeat_interleave(repeats=num_parallel_samples, dim=0) + repeated_scale = scale.repeat_interleave(repeats=num_parallel_samples, dim=0) + + repeated_past_values = ( + past_values.repeat_interleave(repeats=num_parallel_samples, dim=0) - repeated_loc + ) / repeated_scale + + time_features = torch.cat((past_time_features, future_time_features), dim=1) + + expanded_static_feat = static_feat.unsqueeze(1).expand(-1, time_features.shape[1], -1) + features = torch.cat((expanded_static_feat, time_features), dim=-1) + repeated_features = features.repeat_interleave(repeats=num_parallel_samples, dim=0) + + repeated_enc_last_hidden = enc_last_hidden.repeat_interleave(repeats=num_parallel_samples, dim=0) + + lagged_sequence = self.model.get_lagged_subsequences( + sequence=repeated_past_values, subsequences_length=self.config.context_length + ) + lags_shape = lagged_sequence.shape + reshaped_lagged_sequence = lagged_sequence.reshape(lags_shape[0], lags_shape[1], -1) + seasonal_input, trend_input = self.model.decomposition_layer(reshaped_lagged_sequence) + + mean = torch.mean(reshaped_lagged_sequence, dim=1).unsqueeze(1).repeat(1, self.config.prediction_length, 1) + zeros = torch.zeros( + [reshaped_lagged_sequence.shape[0], self.config.prediction_length, reshaped_lagged_sequence.shape[2]], + device=reshaped_lagged_sequence.device, + ) + + decoder_input = torch.cat( + ( + torch.cat((seasonal_input[:, -self.config.label_length :, ...], zeros), dim=1), + repeated_features[:, -self.config.prediction_length - self.config.label_length :, ...], + ), + dim=-1, + ) + trend_init = torch.cat( + ( + torch.cat((trend_input[:, -self.config.label_length :, ...], mean), dim=1), + repeated_features[:, -self.config.prediction_length - self.config.label_length :, ...], + ), + dim=-1, + ) + decoder_outputs = decoder( + trend=trend_init, inputs_embeds=decoder_input, encoder_hidden_states=repeated_enc_last_hidden + ) + decoder_last_hidden = decoder_outputs.last_hidden_state + trend = decoder_outputs.trend + params = self.output_params(decoder_last_hidden + trend) + distr = self.output_distribution(params, loc=repeated_loc, scale=repeated_scale) + future_samples = distr.sample() + + return SampleTSPredictionOutput( + sequences=future_samples.reshape( + (-1, num_parallel_samples, self.config.prediction_length) + self.target_shape, + ) + ) diff --git a/src/transformers/models/bark/__init__.py b/src/transformers/models/bark/__init__.py new file mode 100644 index 00000000000000..03e5865ca4a483 --- /dev/null +++ b/src/transformers/models/bark/__init__.py @@ -0,0 +1,79 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_torch_available, +) + + +_import_structure = { + "configuration_bark": [ + "BARK_PRETRAINED_CONFIG_ARCHIVE_MAP", + "BarkCoarseConfig", + "BarkConfig", + "BarkFineConfig", + "BarkSemanticConfig", + ], + "processing_bark": ["BarkProcessor"], +} + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_bark"] = [ + "BARK_PRETRAINED_MODEL_ARCHIVE_LIST", + "BarkFineModel", + "BarkSemanticModel", + "BarkCoarseModel", + "BarkModel", + "BarkPreTrainedModel", + "BarkCausalModel", + ] + +if TYPE_CHECKING: + from .configuration_bark import ( + BARK_PRETRAINED_CONFIG_ARCHIVE_MAP, + BarkCoarseConfig, + BarkConfig, + BarkFineConfig, + BarkSemanticConfig, + ) + from .processing_bark import BarkProcessor + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_bark import ( + BARK_PRETRAINED_MODEL_ARCHIVE_LIST, + BarkCausalModel, + BarkCoarseModel, + BarkFineModel, + BarkModel, + BarkPreTrainedModel, + BarkSemanticModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/bark/configuration_bark.py b/src/transformers/models/bark/configuration_bark.py new file mode 100644 index 00000000000000..15efb11dc7d4a5 --- /dev/null +++ b/src/transformers/models/bark/configuration_bark.py @@ -0,0 +1,330 @@ +# coding=utf-8 +# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" BARK model configuration""" + +import os +from typing import Dict, Optional, Union + +from ...configuration_utils import PretrainedConfig +from ...utils import add_start_docstrings, logging +from ..auto import CONFIG_MAPPING + + +logger = logging.get_logger(__name__) + + +BARK_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "suno/bark-small": "https://huggingface.co/suno/bark-small/resolve/main/config.json", + "suno/bark": "https://huggingface.co/suno/bark/resolve/main/config.json", +} + +BARK_SUBMODELCONFIG_START_DOCSTRING = """ + This is the configuration class to store the configuration of a [`{model}`]. It is used to instantiate the model + according to the specified arguments, defining the model architecture. Instantiating a configuration with the + defaults will yield a similar configuration to that of the Bark [suno/bark](https://huggingface.co/suno/bark) + architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + block_size (`int`, *optional*, defaults to 1024): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + input_vocab_size (`int`, *optional*, defaults to 10_048): + Vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`{model}`]. Defaults to 10_048 but should be carefully thought with + regards to the chosen sub-model. + output_vocab_size (`int`, *optional*, defaults to 10_048): + Output vocabulary size of a Bark sub-model. Defines the number of different tokens that can be represented + by the: `output_ids` when passing forward a [`{model}`]. Defaults to 10_048 but should be carefully thought + with regards to the chosen sub-model. + num_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the given sub-model. + num_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer architecture. + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the "intermediate" (often named feed-forward) layer in the architecture. + dropout (`float`, *optional*, defaults to 0.0): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + bias (`bool`, *optional*, defaults to `True`): + Whether or not to use bias in the linear layers and layer norm layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). +""" + + +class BarkSubModelConfig(PretrainedConfig): + model_type = "bark_module" + keys_to_ignore_at_inference = ["past_key_values"] + + attribute_map = { + "num_attention_heads": "num_heads", + "num_hidden_layers": "num_layers", + "vocab_size": "input_vocab_size", + "window_size": "block_size", + } + + def __init__( + self, + block_size=1024, + input_vocab_size=10_048, + output_vocab_size=10_048, + num_layers=12, + num_heads=12, + hidden_size=768, + dropout=0.0, + bias=True, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster + initializer_range=0.02, + use_cache=True, + **kwargs, + ): + self.block_size = block_size + self.input_vocab_size = input_vocab_size + self.output_vocab_size = output_vocab_size + self.num_layers = num_layers + self.num_heads = num_heads + self.hidden_size = hidden_size + self.dropout = dropout + self.bias = bias + self.use_cache = use_cache + self.initializer_range = initializer_range + + super().__init__(**kwargs) + + @classmethod + def from_pretrained( + cls, + pretrained_model_name_or_path: Union[str, os.PathLike], + cache_dir: Optional[Union[str, os.PathLike]] = None, + force_download: bool = False, + local_files_only: bool = False, + token: Optional[Union[str, bool]] = None, + revision: str = "main", + **kwargs, + ) -> "PretrainedConfig": + kwargs["cache_dir"] = cache_dir + kwargs["force_download"] = force_download + kwargs["local_files_only"] = local_files_only + kwargs["revision"] = revision + + cls._set_token_in_kwargs(kwargs, token) + + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the config dict if we are loading from Bark + if config_dict.get("model_type") == "bark": + config_dict = config_dict[f"{cls.model_type}_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +@add_start_docstrings( + BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkSemanticConfig", model="BarkSemanticModel"), + """ + Example: + + ```python + >>> from transformers import BarkSemanticConfig, BarkSemanticModel + + >>> # Initializing a Bark sub-module style configuration + >>> configuration = BarkSemanticConfig() + + >>> # Initializing a model (with random weights) from the suno/bark style configuration + >>> model = BarkSemanticModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""", +) +class BarkSemanticConfig(BarkSubModelConfig): + model_type = "semantic" + + +@add_start_docstrings( + BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkCoarseConfig", model="BarkCoarseModel"), + """ + Example: + + ```python + >>> from transformers import BarkCoarseConfig, BarkCoarseModel + + >>> # Initializing a Bark sub-module style configuration + >>> configuration = BarkCoarseConfig() + + >>> # Initializing a model (with random weights) from the suno/bark style configuration + >>> model = BarkCoarseModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""", +) +class BarkCoarseConfig(BarkSubModelConfig): + model_type = "coarse_acoustics" + + +@add_start_docstrings( + BARK_SUBMODELCONFIG_START_DOCSTRING.format(config="BarkFineConfig", model="BarkFineModel"), + """ + n_codes_total (`int`, *optional*, defaults to 8): + The total number of audio codebooks predicted. Used in the fine acoustics sub-model. + n_codes_given (`int`, *optional*, defaults to 1): + The number of audio codebooks predicted in the coarse acoustics sub-model. Used in the acoustics + sub-models. + Example: + + ```python + >>> from transformers import BarkFineConfig, BarkFineModel + + >>> # Initializing a Bark sub-module style configuration + >>> configuration = BarkFineConfig() + + >>> # Initializing a model (with random weights) from the suno/bark style configuration + >>> model = BarkFineModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""", +) +class BarkFineConfig(BarkSubModelConfig): + model_type = "fine_acoustics" + + def __init__(self, tie_word_embeddings=True, n_codes_total=8, n_codes_given=1, **kwargs): + self.n_codes_total = n_codes_total + self.n_codes_given = n_codes_given + + super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs) + + +class BarkConfig(PretrainedConfig): + """ + This is the configuration class to store the configuration of a [`BarkModel`]. It is used to instantiate a Bark + model according to the specified sub-models configurations, defining the model architecture. + + Instantiating a configuration with the defaults will yield a similar configuration to that of the Bark + [suno/bark](https://huggingface.co/suno/bark) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + semantic_config ([`BarkSemanticConfig`], *optional*): + Configuration of the underlying semantic sub-model. + coarse_acoustics_config ([`BarkCoarseConfig`], *optional*): + Configuration of the underlying coarse acoustics sub-model. + fine_acoustics_config ([`BarkFineConfig`], *optional*): + Configuration of the underlying fine acoustics sub-model. + codec_config ([`AutoConfig`], *optional*): + Configuration of the underlying codec sub-model. + + Example: + + ```python + >>> from transformers import ( + ... BarkSemanticConfig, + ... BarkCoarseConfig, + ... BarkFineConfig, + ... BarkModel, + ... BarkConfig, + ... AutoConfig, + ... ) + + >>> # Initializing Bark sub-modules configurations. + >>> semantic_config = BarkSemanticConfig() + >>> coarse_acoustics_config = BarkCoarseConfig() + >>> fine_acoustics_config = BarkFineConfig() + >>> codec_config = AutoConfig.from_pretrained("facebook/encodec_24khz") + + + >>> # Initializing a Bark module style configuration + >>> configuration = BarkConfig.from_sub_model_configs( + ... semantic_config, coarse_acoustics_config, fine_acoustics_config, codec_config + ... ) + + >>> # Initializing a model (with random weights) + >>> model = BarkModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ``` + """ + + model_type = "bark" + + def __init__( + self, + semantic_config: Dict = None, + coarse_acoustics_config: Dict = None, + fine_acoustics_config: Dict = None, + codec_config: Dict = None, + initializer_range=0.02, + **kwargs, + ): + if semantic_config is None: + semantic_config = {} + logger.info("semantic_config is None. initializing the semantic model with default values.") + + if coarse_acoustics_config is None: + coarse_acoustics_config = {} + logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.") + + if fine_acoustics_config is None: + fine_acoustics_config = {} + logger.info("fine_acoustics_config is None. initializing the fine model with default values.") + + if codec_config is None: + codec_config = {} + logger.info("codec_config is None. initializing the codec model with default values.") + + self.semantic_config = BarkSemanticConfig(**semantic_config) + self.coarse_acoustics_config = BarkCoarseConfig(**coarse_acoustics_config) + self.fine_acoustics_config = BarkFineConfig(**fine_acoustics_config) + codec_model_type = codec_config["model_type"] if "model_type" in codec_config else "encodec" + self.codec_config = CONFIG_MAPPING[codec_model_type](**codec_config) + + self.initializer_range = initializer_range + + super().__init__(**kwargs) + + @classmethod + def from_sub_model_configs( + cls, + semantic_config: BarkSemanticConfig, + coarse_acoustics_config: BarkCoarseConfig, + fine_acoustics_config: BarkFineConfig, + codec_config: PretrainedConfig, + **kwargs, + ): + r""" + Instantiate a [`BarkConfig`] (or a derived class) from bark sub-models configuration. + + Returns: + [`BarkConfig`]: An instance of a configuration object + """ + return cls( + semantic_config=semantic_config.to_dict(), + coarse_acoustics_config=coarse_acoustics_config.to_dict(), + fine_acoustics_config=fine_acoustics_config.to_dict(), + codec_config=codec_config.to_dict(), + **kwargs, + ) diff --git a/src/transformers/models/bark/convert_suno_to_hf.py b/src/transformers/models/bark/convert_suno_to_hf.py new file mode 100644 index 00000000000000..4720a70d5cd2ad --- /dev/null +++ b/src/transformers/models/bark/convert_suno_to_hf.py @@ -0,0 +1,262 @@ +"""Convert Bark checkpoint.""" +import argparse +import os +from pathlib import Path + +import torch +from bark.generation import _load_model as _bark_load_model +from huggingface_hub import hf_hub_download + +from transformers import EncodecConfig, EncodecModel, set_seed +from transformers.models.bark.configuration_bark import ( + BarkCoarseConfig, + BarkConfig, + BarkFineConfig, + BarkSemanticConfig, +) +from transformers.models.bark.generation_configuration_bark import ( + BarkCoarseGenerationConfig, + BarkFineGenerationConfig, + BarkGenerationConfig, + BarkSemanticGenerationConfig, +) +from transformers.models.bark.modeling_bark import BarkCoarseModel, BarkFineModel, BarkModel, BarkSemanticModel +from transformers.utils import logging + + +logging.set_verbosity_info() +logger = logging.get_logger(__name__) + +set_seed(770) + + +new_layer_name_dict = { + "c_attn": "att_proj", + "c_proj": "out_proj", + "c_fc": "in_proj", + "transformer.": "", + "h.": "layers.", + "ln_1": "layernorm_1", + "ln_2": "layernorm_2", + "ln_f": "layernorm_final", + "wpe": "position_embeds_layer", + "wte": "input_embeds_layer", +} + + +REMOTE_MODEL_PATHS = { + "text_small": { + "repo_id": "suno/bark", + "file_name": "text.pt", + }, + "coarse_small": { + "repo_id": "suno/bark", + "file_name": "coarse.pt", + }, + "fine_small": { + "repo_id": "suno/bark", + "file_name": "fine.pt", + }, + "text": { + "repo_id": "suno/bark", + "file_name": "text_2.pt", + }, + "coarse": { + "repo_id": "suno/bark", + "file_name": "coarse_2.pt", + }, + "fine": { + "repo_id": "suno/bark", + "file_name": "fine_2.pt", + }, +} + +CUR_PATH = os.path.dirname(os.path.abspath(__file__)) +default_cache_dir = os.path.join(os.path.expanduser("~"), ".cache") +CACHE_DIR = os.path.join(os.getenv("XDG_CACHE_HOME", default_cache_dir), "suno", "bark_v0") + + +def _get_ckpt_path(model_type, use_small=False): + key = model_type + if use_small: + key += "_small" + return os.path.join(CACHE_DIR, REMOTE_MODEL_PATHS[key]["file_name"]) + + +def _download(from_hf_path, file_name): + os.makedirs(CACHE_DIR, exist_ok=True) + hf_hub_download(repo_id=from_hf_path, filename=file_name, local_dir=CACHE_DIR) + + +def _load_model(ckpt_path, device, use_small=False, model_type="text"): + if model_type == "text": + ModelClass = BarkSemanticModel + ConfigClass = BarkSemanticConfig + GenerationConfigClass = BarkSemanticGenerationConfig + elif model_type == "coarse": + ModelClass = BarkCoarseModel + ConfigClass = BarkCoarseConfig + GenerationConfigClass = BarkCoarseGenerationConfig + elif model_type == "fine": + ModelClass = BarkFineModel + ConfigClass = BarkFineConfig + GenerationConfigClass = BarkFineGenerationConfig + else: + raise NotImplementedError() + model_key = f"{model_type}_small" if use_small else model_type + model_info = REMOTE_MODEL_PATHS[model_key] + if not os.path.exists(ckpt_path): + logger.info(f"{model_type} model not found, downloading into `{CACHE_DIR}`.") + _download(model_info["repo_id"], model_info["file_name"]) + checkpoint = torch.load(ckpt_path, map_location=device) + # this is a hack + model_args = checkpoint["model_args"] + if "input_vocab_size" not in model_args: + model_args["input_vocab_size"] = model_args["vocab_size"] + model_args["output_vocab_size"] = model_args["vocab_size"] + del model_args["vocab_size"] + + # convert Bark model arguments to HF Bark model arguments + model_args["num_heads"] = model_args.pop("n_head") + model_args["hidden_size"] = model_args.pop("n_embd") + model_args["num_layers"] = model_args.pop("n_layer") + + model_config = ConfigClass(**checkpoint["model_args"]) + model = ModelClass(config=model_config) + model_generation_config = GenerationConfigClass() + + model.generation_config = model_generation_config + state_dict = checkpoint["model"] + # fixup checkpoint + unwanted_prefix = "_orig_mod." + for k, v in list(state_dict.items()): + if k.startswith(unwanted_prefix): + # replace part of the key with corresponding layer name in HF implementation + new_k = k[len(unwanted_prefix) :] + for old_layer_name in new_layer_name_dict: + new_k = new_k.replace(old_layer_name, new_layer_name_dict[old_layer_name]) + + state_dict[new_k] = state_dict.pop(k) + + extra_keys = set(state_dict.keys()) - set(model.state_dict().keys()) + extra_keys = {k for k in extra_keys if not k.endswith(".attn.bias")} + missing_keys = set(model.state_dict().keys()) - set(state_dict.keys()) + missing_keys = {k for k in missing_keys if not k.endswith(".attn.bias")} + if len(extra_keys) != 0: + raise ValueError(f"extra keys found: {extra_keys}") + if len(missing_keys) != 0: + raise ValueError(f"missing keys: {missing_keys}") + model.load_state_dict(state_dict, strict=False) + n_params = model.num_parameters(exclude_embeddings=True) + val_loss = checkpoint["best_val_loss"].item() + logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss") + model.eval() + model.to(device) + del checkpoint, state_dict + + return model + + +def load_model(pytorch_dump_folder_path, use_small=False, model_type="text"): + if model_type not in ("text", "coarse", "fine"): + raise NotImplementedError() + + device = "cpu" # do conversion on cpu + + ckpt_path = _get_ckpt_path(model_type, use_small=use_small) + model = _load_model(ckpt_path, device, model_type=model_type, use_small=use_small) + + # load bark initial model + bark_model = _bark_load_model(ckpt_path, "cpu", model_type=model_type, use_small=use_small) + + if model_type == "text": + bark_model = bark_model["model"] + + if model.num_parameters(exclude_embeddings=True) != bark_model.get_num_params(): + raise ValueError("initial and new models don't have the same number of parameters") + + # check if same output as the bark model + batch_size = 5 + sequence_length = 10 + + if model_type in ["text", "coarse"]: + vec = torch.randint(256, (batch_size, sequence_length), dtype=torch.int) + output_old_model = bark_model(vec)[0] + + output_new_model_total = model(vec) + + # take last logits + output_new_model = output_new_model_total.logits[:, [-1], :] + + else: + prediction_codeboook_channel = 3 + n_codes_total = 8 + vec = torch.randint(256, (batch_size, sequence_length, n_codes_total), dtype=torch.int) + + output_new_model_total = model(prediction_codeboook_channel, vec) + output_old_model = bark_model(prediction_codeboook_channel, vec) + + output_new_model = output_new_model_total.logits + + # output difference should come from the difference of self-attention implementation design + if output_new_model.shape != output_old_model.shape: + raise ValueError("initial and new outputs don't have the same shape") + if (output_new_model - output_old_model).abs().max().item() > 1e-3: + raise ValueError("initial and new outputs are not equal") + + Path(pytorch_dump_folder_path).mkdir(exist_ok=True) + model.save_pretrained(pytorch_dump_folder_path) + + +def load_whole_bark_model( + semantic_path, + coarse_path, + fine_path, + append_text, + hub_path, + folder_path, +): + pytorch_dump_folder_path = os.path.join(folder_path, append_text) + + semanticConfig = BarkSemanticConfig.from_pretrained(os.path.join(semantic_path, "config.json")) + coarseAcousticConfig = BarkCoarseConfig.from_pretrained(os.path.join(coarse_path, "config.json")) + fineAcousticConfig = BarkFineConfig.from_pretrained(os.path.join(fine_path, "config.json")) + codecConfig = EncodecConfig.from_pretrained("facebook/encodec_24khz") + + semantic = BarkSemanticModel.from_pretrained(semantic_path) + coarseAcoustic = BarkCoarseModel.from_pretrained(coarse_path) + fineAcoustic = BarkFineModel.from_pretrained(fine_path) + codec = EncodecModel.from_pretrained("facebook/encodec_24khz") + + bark_config = BarkConfig.from_sub_model_configs( + semanticConfig, coarseAcousticConfig, fineAcousticConfig, codecConfig + ) + + bark_generation_config = BarkGenerationConfig.from_sub_model_configs( + semantic.generation_config, coarseAcoustic.generation_config, fineAcoustic.generation_config + ) + + bark = BarkModel(bark_config) + + bark.semantic = semantic + bark.coarse_acoustics = coarseAcoustic + bark.fine_acoustics = fineAcoustic + bark.codec_model = codec + + bark.generation_config = bark_generation_config + + Path(pytorch_dump_folder_path).mkdir(exist_ok=True) + bark.save_pretrained(pytorch_dump_folder_path, repo_id=hub_path, push_to_hub=True) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # Required parameters + + parser.add_argument("model_type", type=str, help="text, coarse or fine.") + parser.add_argument("pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model.") + parser.add_argument("--is_small", action="store_true", help="convert the small version instead of the large.") + + args = parser.parse_args() + + load_model(args.pytorch_dump_folder_path, model_type=args.model_type, use_small=args.is_small) diff --git a/src/transformers/models/bark/generation_configuration_bark.py b/src/transformers/models/bark/generation_configuration_bark.py new file mode 100644 index 00000000000000..7d7d98449d665f --- /dev/null +++ b/src/transformers/models/bark/generation_configuration_bark.py @@ -0,0 +1,331 @@ +# coding=utf-8 +# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" BARK model generation configuration""" + +import copy +from typing import Dict + +from ...generation.configuration_utils import GenerationConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + + +class BarkSemanticGenerationConfig(GenerationConfig): + model_type = "semantic" + + def __init__( + self, + eos_token_id=10_000, + renormalize_logits=True, + max_new_tokens=768, + output_scores=False, + return_dict_in_generate=False, + output_hidden_states=False, + output_attentions=False, + temperature=1.0, + do_sample=False, + text_encoding_offset=10_048, + text_pad_token=129_595, + semantic_infer_token=129_599, + semantic_vocab_size=10_000, + max_input_semantic_length=256, + semantic_rate_hz=49.9, + min_eos_p=None, + **kwargs, + ): + """Class that holds a generation configuration for [`BarkSemanticModel`]. + + This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the + documentation from [`GenerationConfig`] for more information. + + Args: + eos_token_id (`int`, *optional*, defaults to 10_000): + The id of the *end-of-sequence* token. + renormalize_logits (`bool`, *optional*, defaults to `True`): + Whether to renormalize the logits after applying all the logits processors or warpers (including the + custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the + score logits are normalized but some logit processors or warpers break the normalization. + max_new_tokens (`int`, *optional*, defaults to 768): + The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. + output_scores (`bool`, *optional*, defaults to `False`): + Whether or not to return the prediction scores. See `scores` under returned tensors for more details. + return_dict_in_generate (`bool`, *optional*, defaults to `False`): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + output_hidden_states (`bool`, *optional*, defaults to `False`): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more details. + output_attentions (`bool`, *optional*, defaults to `False`): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more details. + temperature (`float`, *optional*, defaults to 1.0): + The value used to modulate the next token probabilities. + do_sample (`bool`, *optional*, defaults to `False`): + Whether or not to use sampling ; use greedy decoding otherwise. + text_encoding_offset (`int`, *optional*, defaults to 10_048): + Text encoding offset. + text_pad_token (`int`, *optional*, defaults to 129_595): + Text pad token. + semantic_infer_token (`int`, *optional*, defaults to 129_599): + Semantic infer token. + semantic_vocab_size (`int`, *optional*, defaults to 10_000): + Semantic vocab size. + max_input_semantic_length (`int`, *optional*, defaults to 256): + Max length of semantic input vector. + semantic_rate_hz (`float`, *optional*, defaults to 49.9): + Semantic rate in Hertz. + min_eos_p (`float`, *optional*): + Minimum threshold of the probability of the EOS token for it to be sampled. This is an early stopping + strategy to mitigate potential unwanted generations at the end of a prompt. The original implementation + suggests a default value of 0.2. + """ + super().__init__( + temperature=temperature, + do_sample=do_sample, + eos_token_id=eos_token_id, + renormalize_logits=renormalize_logits, + max_new_tokens=max_new_tokens, + output_scores=output_scores, + return_dict_in_generate=return_dict_in_generate, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + **kwargs, + ) + + self.text_encoding_offset = text_encoding_offset + self.text_pad_token = text_pad_token + self.semantic_pad_token = eos_token_id + self.semantic_infer_token = semantic_infer_token + self.semantic_vocab_size = semantic_vocab_size + self.max_input_semantic_length = max_input_semantic_length + self.semantic_rate_hz = semantic_rate_hz + self.min_eos_p = min_eos_p + + +class BarkCoarseGenerationConfig(GenerationConfig): + model_type = "coarse_acoustics" + + def __init__( + self, + renormalize_logits=True, + output_scores=False, + return_dict_in_generate=False, + output_hidden_states=False, + output_attentions=False, + temperature=1.0, + do_sample=False, + coarse_semantic_pad_token=12_048, + coarse_rate_hz=75, + n_coarse_codebooks=2, + coarse_infer_token=12_050, + max_coarse_input_length=256, + max_coarse_history: int = 630, + sliding_window_len: int = 60, + **kwargs, + ): + """Class that holds a generation configuration for [`BarkCoarseModel`]. + + This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the + documentation from [`GenerationConfig`] for more information. + + Args: + renormalize_logits (`bool`, *optional*, defaults to `True`): + Whether to renormalize the logits after applying all the logits processors or warpers (including the + custom ones). It's highly recommended to set this flag to `True` as the search algorithms suppose the + score logits are normalized but some logit processors or warpers break the normalization. + output_scores (`bool`, *optional*, defaults to `False`): + Whether or not to return the prediction scores. See `scores` under returned tensors for more details. + return_dict_in_generate (`bool`, *optional*, defaults to `False`): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + output_hidden_states (`bool`, *optional*, defaults to `False`): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more details. + output_attentions (`bool`, *optional*, defaults to `False`): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more details. + temperature (`float`, *optional*, defaults to 1.0): + The value used to modulate the next token probabilities. + do_sample (`bool`, *optional*, defaults to `False`): + Whether or not to use sampling ; use greedy decoding otherwise. + coarse_semantic_pad_token (`int`, *optional*, defaults to 12_048): + Coarse semantic pad token. + coarse_rate_hz (`int`, *optional*, defaults to 75): + Coarse rate in Hertz. + n_coarse_codebooks (`int`, *optional*, defaults to 2): + Number of coarse codebooks. + coarse_infer_token (`int`, *optional*, defaults to 12_050): + Coarse infer token. + max_coarse_input_length (`int`, *optional*, defaults to 256): + Max length of input coarse vector. + max_coarse_history (`int`, *optional*, defaults to 630): + Max length of the output of the coarse acoustics model used in the fine generation step. + sliding_window_len (`int`, *optional*, defaults to 60): + The coarse generation step uses a sliding window to generate raw audio. + """ + super().__init__( + temperature=temperature, + do_sample=do_sample, + renormalize_logits=renormalize_logits, + output_scores=output_scores, + return_dict_in_generate=return_dict_in_generate, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + **kwargs, + ) + + self.coarse_semantic_pad_token = coarse_semantic_pad_token + self.coarse_rate_hz = coarse_rate_hz + self.n_coarse_codebooks = n_coarse_codebooks + self.coarse_infer_token = coarse_infer_token + self.max_coarse_input_length = max_coarse_input_length + self.max_coarse_history = max_coarse_history + self.sliding_window_len = sliding_window_len + + +class BarkFineGenerationConfig(GenerationConfig): + model_type = "fine_acoustics" + + def __init__( + self, + temperature=1.0, + max_fine_history_length=512, + max_fine_input_length=1024, + n_fine_codebooks=8, + **kwargs, + ): + """Class that holds a generation configuration for [`BarkFineModel`]. + + [`BarkFineModel`] is an autoencoder model, so should not usually be used for generation. However, under the + hood, it uses `temperature` when used by [`BarkModel`] + + This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the + documentation from [`GenerationConfig`] for more information. + + Args: + temperature (`float`, *optional*): + The value used to modulate the next token probabilities. + max_fine_history_length (`int`, *optional*, defaults to 512): + Max length of the fine history vector. + max_fine_input_length (`int`, *optional*, defaults to 1024): + Max length of fine input vector. + n_fine_codebooks (`int`, *optional*, defaults to 8): + Number of codebooks used. + """ + super().__init__(temperature=temperature) + + self.max_fine_history_length = max_fine_history_length + self.max_fine_input_length = max_fine_input_length + self.n_fine_codebooks = n_fine_codebooks + + def validate(self, **kwargs): + """ + Overrides GenerationConfig.validate because BarkFineGenerationConfig don't use any parameters outside + temperature. + """ + pass + + +class BarkGenerationConfig(GenerationConfig): + model_type = "bark" + is_composition = True + + # TODO (joao): nested from_dict + + def __init__( + self, + semantic_config: Dict = None, + coarse_acoustics_config: Dict = None, + fine_acoustics_config: Dict = None, + sample_rate=24_000, + codebook_size=1024, + **kwargs, + ): + """Class that holds a generation configuration for [`BarkModel`]. + + The [`BarkModel`] does not have a `generate` method, but uses this class to generate speeches with a nested + [`BarkGenerationConfig`] which uses [`BarkSemanticGenerationConfig`], [`BarkCoarseGenerationConfig`], + [`BarkFineGenerationConfig`]. + + This configuration inherit from [`GenerationConfig`] and can be used to control the model generation. Read the + documentation from [`GenerationConfig`] for more information. + + Args: + semantic_config (`Dict`, *optional*): + Semantic generation configuration. + coarse_acoustics_config (`Dict`, *optional*): + Coarse generation configuration. + fine_acoustics_config (`Dict`, *optional*): + Fine generation configuration. + sample_rate (`int`, *optional*, defaults to 24_000): + Sample rate. + codebook_size (`int`, *optional*, defaults to 1024): + Vector length for each codebook. + """ + if semantic_config is None: + semantic_config = {} + logger.info("semantic_config is None. initializing the semantic model with default values.") + + if coarse_acoustics_config is None: + coarse_acoustics_config = {} + logger.info("coarse_acoustics_config is None. initializing the coarse model with default values.") + + if fine_acoustics_config is None: + fine_acoustics_config = {} + logger.info("fine_acoustics_config is None. initializing the fine model with default values.") + + self.semantic_config = BarkSemanticGenerationConfig(**semantic_config) + self.coarse_acoustics_config = BarkCoarseGenerationConfig(**coarse_acoustics_config) + self.fine_acoustics_config = BarkFineGenerationConfig(**fine_acoustics_config) + + self.sample_rate = sample_rate + self.codebook_size = codebook_size + + @classmethod + def from_sub_model_configs( + cls, + semantic_config: BarkSemanticGenerationConfig, + coarse_acoustics_config: BarkCoarseGenerationConfig, + fine_acoustics_config: BarkFineGenerationConfig, + **kwargs, + ): + r""" + Instantiate a [`BarkGenerationConfig`] (or a derived class) from bark sub-models generation configuration. + + Returns: + [`BarkGenerationConfig`]: An instance of a configuration object + """ + return cls( + semantic_config=semantic_config.to_dict(), + coarse_acoustics_config=coarse_acoustics_config.to_dict(), + fine_acoustics_config=fine_acoustics_config.to_dict(), + **kwargs, + ) + + def to_dict(self): + """ + Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. + + Returns: + `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, + """ + output = copy.deepcopy(self.__dict__) + + output["semantic_config"] = self.semantic_config.to_dict() + output["coarse_acoustics_config"] = self.coarse_acoustics_config.to_dict() + output["fine_acoustics_config"] = self.fine_acoustics_config.to_dict() + + output["model_type"] = self.__class__.model_type + return output diff --git a/src/transformers/models/bark/modeling_bark.py b/src/transformers/models/bark/modeling_bark.py new file mode 100644 index 00000000000000..57cccd43127fa8 --- /dev/null +++ b/src/transformers/models/bark/modeling_bark.py @@ -0,0 +1,1910 @@ +# coding=utf-8 +# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch BARK model.""" +import math +from typing import Dict, Optional, Tuple, Union + +import numpy as np +import torch +from torch import nn +from torch.nn import functional as F + +from ...generation.logits_process import ( + AlternatingCodebooksLogitsProcessor, + BarkEosPrioritizerLogitsProcessor, + SuppressTokensLogitsProcessor, +) +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask +from ...modeling_outputs import CausalLMOutputWithPast, MaskedLMOutput +from ...modeling_utils import PreTrainedModel, get_parameter_device +from ...utils import ( + add_start_docstrings, + add_start_docstrings_to_model_forward, + is_accelerate_available, + is_flash_attn_2_available, + is_flash_attn_greater_or_equal_2_10, + logging, +) +from ..auto import AutoModel +from .configuration_bark import ( + BarkCoarseConfig, + BarkConfig, + BarkFineConfig, + BarkSemanticConfig, + BarkSubModelConfig, +) +from .generation_configuration_bark import ( + BarkCoarseGenerationConfig, + BarkFineGenerationConfig, + BarkSemanticGenerationConfig, +) + + +if is_flash_attn_2_available(): + from flash_attn import flash_attn_func, flash_attn_varlen_func + from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa + + +logger = logging.get_logger(__name__) + + +_CHECKPOINT_FOR_DOC = "suno/bark-small" +_CONFIG_FOR_DOC = "BarkConfig" + +BARK_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "suno/bark-small", + "suno/bark", + # See all Bark models at https://huggingface.co/models?filter=bark +] + + +# Copied from transformers.models.llama.modeling_llama._get_unpad_data +def _get_unpad_data(attention_mask): + seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32) + indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() + max_seqlen_in_batch = seqlens_in_batch.max().item() + cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0)) + return ( + indices, + cu_seqlens, + max_seqlen_in_batch, + ) + + +class BarkSelfAttention(nn.Module): + # adapted from GPTNeoSelfAttention and Bark code + # BarkSelfAttention can have two attention type, i.e full attention or causal attention + + def __init__(self, config, is_causal=False): + super().__init__() + + # regularization + self.dropout = config.dropout + self.attn_dropout = nn.Dropout(config.dropout) + self.resid_dropout = nn.Dropout(config.dropout) + + self.embed_dim = config.hidden_size + self.num_heads = config.num_heads + self.head_dim = self.embed_dim // self.num_heads + + if config.hidden_size % config.num_heads != 0: + raise ValueError( + f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" + f" {self.num_heads})." + ) + + # key, query, value projections for all heads, but in a batch + self.att_proj = nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=config.bias) + # output projection + self.out_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=config.bias) + + self.is_causal = is_causal + if is_causal: + block_size = config.block_size + bias = torch.tril(torch.ones((block_size, block_size), dtype=bool)).view(1, 1, block_size, block_size) + self.register_buffer("bias", bias) + + # Copied from transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoSelfAttention._split_heads + def _split_heads(self, tensor, num_heads, attn_head_size): + """ + Splits hidden_size dim into attn_head_size and num_heads + """ + new_shape = tensor.size()[:-1] + (num_heads, attn_head_size) + tensor = tensor.view(new_shape) + return tensor.permute(0, 2, 1, 3) # (batch, head, seq_length, head_features) + + def _merge_heads(self, tensor, num_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into hidden_size + """ + + # re-assemble all head outputs side by side + # (batch, num_heads, seq_len, attn_head_size) -> (batch, seq_len, num_heads*attn_head_size) + tensor = tensor.transpose(1, 2).contiguous() + tensor = tensor.view(tensor.size()[:-2] + (num_heads * attn_head_size,)) + + return tensor + + def _attn(self, query, key, value, attention_mask=None, head_mask=None): + # unlike GPTNeo's SelfAttention, divide by the square root of the dimension of the query and the key + attn_weights = torch.matmul(query, key.transpose(-1, -2)) * (1.0 / math.sqrt(self.head_dim)) + + if self.is_causal: + query_length, key_length = query.size(-2), key.size(-2) + + # fill the upper left part of the attention weights with inf + attn_weights = attn_weights.masked_fill( + self.bias[:, :, key_length - query_length : key_length, :key_length] == 0, + torch.finfo(attn_weights.dtype).min, + ) + + if attention_mask is not None: + # Apply the attention mask + attn_weights = attn_weights + attention_mask + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + attn_weights = attn_weights.to(value.dtype) + attn_weights = self.attn_dropout(attn_weights) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + # (batch, num_heads, seq_len, seq_len) x (batch, num_heads, seq_len, attn_head_size) + # -> (batch, num_heads, seq_len, attn_head_size) + attn_output = torch.matmul(attn_weights, value) + + return attn_output, attn_weights + + def forward( + self, + hidden_states, + attention_mask=None, + past_key_values=None, + head_mask=None, + use_cache=False, + output_attentions=False, + ): + # calculate query, key, values for all heads in batch and move head forward to be the batch dim + query, key, value = self.att_proj(hidden_states).split(self.embed_dim, dim=2) + + query = self._split_heads(query, self.num_heads, self.head_dim) + key = self._split_heads(key, self.num_heads, self.head_dim) + value = self._split_heads(value, self.num_heads, self.head_dim) + + if past_key_values is not None: + past_key = past_key_values[0] + past_value = past_key_values[1] + key = torch.cat((past_key, key), dim=-2) + value = torch.cat((past_value, value), dim=-2) + + if use_cache is True: + present = (key, value) + else: + present = None + + attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) + + attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + outputs += (attn_weights,) + + return outputs + + +class BarkSelfFlashAttention2(BarkSelfAttention): + """ + Bark flash attention module. This module inherits from `BarkSelfAttention` as the weights of the module stays + untouched. The only required change would be on the forward pass where it needs to correctly call the public API of + flash attention and deal with padding tokens in case the input contains any of them. + """ + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__ + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1. + # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0. + # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left). + self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10() + + def _split_heads(self, tensor, num_heads, attn_head_size): + """ + Splits hidden_size dim into attn_head_size and num_heads + """ + new_shape = tensor.size()[:-1] + (num_heads, attn_head_size) + tensor = tensor.view(new_shape) + # Flash attention requires the input to have the shape + # batch_size x seq_length x head_dim x hidden_dim - (batch, seq_length, head, head_features) + return tensor + + def _merge_heads(self, tensor, num_heads, attn_head_size): + """ + Merges attn_head_size dim and num_attn_heads dim into hidden_size + """ + # re-assemble all head outputs side by side + # (batch, seq_len, num_heads, attn_head_size) -> (batch, seq_len, num_heads*attn_head_size) + tensor = tensor.view(tensor.size()[:-2] + (num_heads * attn_head_size,)) + return tensor + + def forward( + self, + hidden_states, + attention_mask=None, + past_key_values=None, + head_mask=None, + use_cache=False, + output_attentions=False, + ): + batch_size, query_len, _ = hidden_states.size() + + # calculate query, key, values for all heads in batch and move head forward to be the batch dim + query, key, value = self.att_proj(hidden_states).split(self.embed_dim, dim=2) + + query = self._split_heads(query, self.num_heads, self.head_dim) + key = self._split_heads(key, self.num_heads, self.head_dim) + value = self._split_heads(value, self.num_heads, self.head_dim) + + if past_key_values is not None: + # (batch, head, seq_length, head_features) -> (batch, seq_length, head, head_features) + past_key = past_key_values[0].transpose(1, 2) + past_value = past_key_values[1].transpose(1, 2) + # and merge on seq_length + key = torch.cat((past_key, key), dim=1) + value = torch.cat((past_value, value), dim=1) + + if use_cache is True: + # (batch, head, seq_length, head_features) + present = (key.transpose(1, 2), value.transpose(1, 2)) + else: + present = None + + attn_output = self._flash_attention_forward(query, key, value, attention_mask, query_len, dropout=self.dropout) + + attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim) + attn_output = self.out_proj(attn_output) + attn_output = self.resid_dropout(attn_output) + + outputs = (attn_output, present) + if output_attentions: + attn_weights = None + outputs += (attn_weights,) + + return outputs + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward + def _flash_attention_forward( + self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None + ): + """ + Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token + first unpad the input, then computes the attention scores and pad the final attention scores. + + Args: + query_states (`torch.Tensor`): + Input query states to be passed to Flash Attention API + key_states (`torch.Tensor`): + Input key states to be passed to Flash Attention API + value_states (`torch.Tensor`): + Input value states to be passed to Flash Attention API + attention_mask (`torch.Tensor`): + The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the + position of padding tokens and 1 for the position of non-padding tokens. + dropout (`int`, *optional*): + Attention dropout + softmax_scale (`float`, *optional*): + The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) + """ + if not self._flash_attn_uses_top_left_mask: + causal = self.is_causal + else: + # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__. + causal = self.is_causal and query_length != 1 + + # Contains at least one padding token in the sequence + if attention_mask is not None: + batch_size = query_states.shape[0] + query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input( + query_states, key_states, value_states, attention_mask, query_length + ) + + cu_seqlens_q, cu_seqlens_k = cu_seq_lens + max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens + + attn_output_unpad = flash_attn_varlen_func( + query_states, + key_states, + value_states, + cu_seqlens_q=cu_seqlens_q, + cu_seqlens_k=cu_seqlens_k, + max_seqlen_q=max_seqlen_in_batch_q, + max_seqlen_k=max_seqlen_in_batch_k, + dropout_p=dropout, + softmax_scale=softmax_scale, + causal=causal, + ) + + attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length) + else: + attn_output = flash_attn_func( + query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal + ) + + return attn_output + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input + def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length): + indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask) + batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape + + key_layer = index_first_axis( + key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k + ) + value_layer = index_first_axis( + value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k + ) + if query_length == kv_seq_len: + query_layer = index_first_axis( + query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k + ) + cu_seqlens_q = cu_seqlens_k + max_seqlen_in_batch_q = max_seqlen_in_batch_k + indices_q = indices_k + elif query_length == 1: + max_seqlen_in_batch_q = 1 + cu_seqlens_q = torch.arange( + batch_size + 1, dtype=torch.int32, device=query_layer.device + ) # There is a memcpy here, that is very bad. + indices_q = cu_seqlens_q[:-1] + query_layer = query_layer.squeeze(1) + else: + # The -q_len: slice assumes left padding. + attention_mask = attention_mask[:, -query_length:] + query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask) + + return ( + query_layer, + key_layer, + value_layer, + indices_q, + (cu_seqlens_q, cu_seqlens_k), + (max_seqlen_in_batch_q, max_seqlen_in_batch_k), + ) + + +BARK_ATTENTION_CLASSES = { + "eager": BarkSelfAttention, + "flash_attention_2": BarkSelfFlashAttention2, +} + + +class BarkLayerNorm(nn.Module): + """LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False.""" + + def __init__(self, hidden_size, bias=True): + super().__init__() + self.weight = nn.Parameter(torch.ones(hidden_size)) + self.bias = nn.Parameter(torch.zeros(hidden_size)) if bias else None + + def forward(self, input): + return F.layer_norm(input, self.weight.shape, self.weight, self.bias, eps=1e-5) + + +class BarkMLP(nn.Module): + def __init__(self, config): + super().__init__() + self.in_proj = nn.Linear(config.hidden_size, 4 * config.hidden_size, bias=config.bias) + self.out_proj = nn.Linear(4 * config.hidden_size, config.hidden_size, bias=config.bias) + self.dropout = nn.Dropout(config.dropout) + self.gelu = nn.GELU() + + def forward(self, hidden_states): + hidden_states = self.in_proj(hidden_states) + hidden_states = self.gelu(hidden_states) + hidden_states = self.out_proj(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +class BarkBlock(nn.Module): + def __init__(self, config, is_causal=False): + super().__init__() + + if is_causal: + # if causal, uses handmade LayerNorm, so that the layerNorm bias is optional + # this handmade layerNorm is used to stick with Bark choice of leaving optional bias in + # AutoRegressive models (corresponding to the "Text" and the "Coarse" modules) + self.layernorm_1 = BarkLayerNorm(config.hidden_size, bias=config.bias) + self.layernorm_2 = BarkLayerNorm(config.hidden_size, bias=config.bias) + else: + self.layernorm_1 = nn.LayerNorm(config.hidden_size) + self.layernorm_2 = nn.LayerNorm(config.hidden_size) + + self.attn = BARK_ATTENTION_CLASSES[config._attn_implementation](config, is_causal=is_causal) + + self.mlp = BarkMLP(config) + + def forward( + self, + hidden_states, + past_key_values=None, + attention_mask=None, + head_mask=None, + use_cache=False, + output_attentions=False, + ): + intermediary_hidden_states = self.layernorm_1(hidden_states) + + attn_outputs = self.attn( + intermediary_hidden_states, + past_key_values=past_key_values, + attention_mask=attention_mask, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + + attn_output = attn_outputs[0] # output_attn: output, present_key_values, (attn_weights) + outputs = attn_outputs[1:] + + intermediary_hidden_states = hidden_states + attn_output + intermediary_hidden_states = intermediary_hidden_states + self.mlp( + self.layernorm_2(intermediary_hidden_states) + ) + + if use_cache: + outputs = (intermediary_hidden_states,) + outputs + else: + outputs = (intermediary_hidden_states,) + outputs[1:] + + return outputs # hidden_states, ((present), attentions) + + +class BarkPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = BarkConfig + supports_gradient_checkpointing = False + _supports_flash_attn_2 = True + + def _init_weights(self, module): + """Initialize the weights.""" + if isinstance(module, (nn.Linear,)): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def __init__(self, *inputs, **kwargs): + super().__init__(*inputs, **kwargs) + + @property + def device(self) -> torch.device: + """ + `torch.device`: The device on which the module is (assuming that all the module parameters are on the same + device). + """ + + # if has _hf_hook, has been offloaded so the device has to be found in the hook + if not hasattr(self, "_hf_hook"): + return get_parameter_device(self) + for module in self.modules(): + if ( + hasattr(module, "_hf_hook") + and hasattr(module._hf_hook, "execution_device") + and module._hf_hook.execution_device is not None + ): + return torch.device(module._hf_hook.execution_device) + + return get_parameter_device(self) + + +BARK_MODEL_START_DOCSTRING = """ + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`{config}`]): + Model configuration class with all the parameters of the model. Initializing with a config file does not + load the weights associated with the model, only the configuration. Check out the + [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + + +BARK_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`BarkConfig`]): + Model configuration class with all the parameters of the model. Initializing with a config file does not + load the weights associated with the model, only the configuration. Check out the + [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + + +BARK_FINE_INPUTS_DOCSTRING = r""" + Args: + codebook_idx (`int`): + Index of the codebook that will be predicted. + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length, number_of_codebooks)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. Initially, indices of the first two codebooks are obtained from the `coarse` sub-model. The rest is + predicted recursively by attending the previously predicted channels. The model predicts on windows of + length 1024. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): NOT IMPLEMENTED YET. + input_embeds (`torch.FloatTensor` of shape `(batch_size, input_sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. If + `past_key_values` is used, optionally only the last `input_embeds` have to be input (see + `past_key_values`). This is useful if you want more control over how to convert `input_ids` indices into + associated vectors than the model's internal embedding lookup matrix. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +BARK_CAUSAL_MODEL_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) + past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache` is passed or when `config.use_cache=True`): + Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape + `(batch_size, num_heads, sequence_length, embed_size_per_head)`. + + Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see + `past_key_values` input) to speed up sequential decoding. + + If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that + don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all + `input_ids` of shape `(batch_size, sequence_length)`. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.Tensor` of shape `(encoder_layers, encoder_attention_heads)`, *optional*): + Mask to nullify selected heads of the attention modules in the encoder. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + input_embeds (`torch.FloatTensor` of shape `(batch_size, input_sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + Here, due to `Bark` particularities, if `past_key_values` is used, `input_embeds` will be ignored and you + have to use `input_ids`. If `past_key_values` is not used and `use_cache` is set to `True`, `input_embeds` + is used in priority instead of `input_ids`. + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +# GPT2-like autoregressive model +class BarkCausalModel(BarkPreTrainedModel): + config_class = BarkSubModelConfig + + def __init__(self, config): + super().__init__(config) + self.config = config + + # initialize as an autoregressive GPT-like model + self.input_embeds_layer = nn.Embedding(config.input_vocab_size, config.hidden_size) + self.position_embeds_layer = nn.Embedding(config.block_size, config.hidden_size) + + self.drop = nn.Dropout(config.dropout) + + self.layers = nn.ModuleList([BarkBlock(config, is_causal=True) for _ in range(config.num_layers)]) + self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" + + self.layernorm_final = BarkLayerNorm(config.hidden_size, bias=config.bias) + + self.lm_head = nn.Linear(config.hidden_size, config.output_vocab_size, bias=False) + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.input_embeds_layer + + def set_input_embeddings(self, new_embeddings): + self.input_embeds_layer = new_embeddings + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs): + input_embeds = kwargs.get("input_embeds", None) + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if past_key_values is not None: + # Omit tokens covered by past_key_values + seq_len = input_ids.shape[1] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] + + # input_embeds have already been used and is not required anymore + input_embeds = None + else: + if input_embeds is not None and kwargs.get("use_cache"): + seq_len = input_embeds.shape[1] + else: + seq_len = input_ids.shape[1] + + # ensure that attention_mask and position_ids shapes are aligned with the weird Bark hack of reducing + # sequence length on the first forward pass + if attention_mask is not None: + attention_mask = attention_mask[:, :seq_len] + if position_ids is not None: + position_ids = position_ids[:, :seq_len] + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -input_ids.shape[1] :] + else: + position_ids = None + + if input_embeds is not None and kwargs.get("use_cache"): + return { + "input_ids": None, + "input_embeds": input_embeds, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + } + return { + "input_ids": input_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "attention_mask": attention_mask, + } + + @add_start_docstrings_to_model_forward(BARK_CAUSAL_MODEL_INPUTS_DOCSTRING) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + past_key_values: Optional[Tuple[torch.FloatTensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + labels: Optional[torch.LongTensor] = None, + input_embeds: Optional[torch.Tensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithPast]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # Verify if input_embeds already exists + # then compute embeddings. + if input_ids is not None and input_embeds is not None: + raise ValueError("You cannot specify both input_ids and input_embeds at the same time") + elif input_embeds is not None and past_key_values is None: + # we want to return the input_embeds in priority so that it is in line with a weird hack + # of Bark which concatenate two bits of the input_embeds on the first forward pass of the semantic model + pass + elif input_ids is not None: + input_embeds = self.input_embeds_layer(input_ids) # token embeddings of shape (b, t, n_embd) + elif input_embeds is not None: + pass + else: + raise ValueError("You have to specify either input_ids or input_embeds") + + input_shape = input_embeds.size()[:-1] + batch_size = input_embeds.shape[0] + seq_length = input_shape[-1] + + device = input_ids.device if input_ids is not None else input_embeds.device + + if past_key_values is None: + past_length = 0 + past_key_values = tuple([None] * len(self.layers)) + else: + past_length = past_key_values[0][0].size(-2) + + if position_ids is None: + position_ids = torch.arange(past_length, seq_length + past_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0) # shape (1, seq_length) + + position_embeds = self.position_embeds_layer(position_ids) # position embeddings of shape (1, t, n_embd) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + if self._use_flash_attention_2: + attention_mask = attention_mask if 0 in attention_mask else None + else: + attention_mask = attention_mask.view(batch_size, -1) + # [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length] + # from_seq_length is 1 to easily broadcast + attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1) + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_heads x N x N + # head_mask has shape num_layers x batch x num_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.num_layers) + + hidden_states = self.drop(input_embeds + position_embeds) + output_shape = input_shape + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + + present_key_values = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + + for i, (block, past_layer_key_values) in enumerate(zip(self.layers, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + outputs = self._gradient_checkpointing_func( + block.__call__, + hidden_states, + None, + attention_mask, + head_mask[i], + use_cache, + output_attentions, + ) + else: + outputs = block( + hidden_states, + past_key_values=past_layer_key_values, + attention_mask=attention_mask, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + + if use_cache: + present_key_values = present_key_values + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + + hidden_states = self.layernorm_final(hidden_states) + + hidden_states = hidden_states.view(output_shape) + + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + logits = self.lm_head(hidden_states) + + loss = None + if labels is not None: + raise NotImplementedError( + "Training is not implemented yet for Bark - ensure you do not pass `labels` to the model." + ) + + if not return_dict: + return tuple( + v for v in [None, logits, present_key_values, all_hidden_states, all_self_attentions] if v is not None + ) + + return CausalLMOutputWithPast( + loss=loss, + logits=logits, + past_key_values=present_key_values, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + @staticmethod + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or + [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + # Necessary for beam_search + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + +@add_start_docstrings( + """Bark semantic (or text) model. It shares the same architecture as the coarse model. + It is a GPT-2 like autoregressive model with a language modeling head on top.""", + BARK_MODEL_START_DOCSTRING.format(config="BarkSemanticConfig"), +) +class BarkSemanticModel(BarkCausalModel): + base_model_prefix = "semantic" + config_class = BarkSemanticConfig + + def generate( + self, + input_ids: torch.Tensor, + semantic_generation_config: BarkSemanticGenerationConfig = None, + history_prompt: Optional[Dict[str, torch.Tensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + **kwargs, + ) -> torch.LongTensor: + """ + Generates text semantic tokens from an input prompt and an additional optional `Bark` speaker prompt. + + Args: + input_ids (`Optional[torch.Tensor]` of shape (batch_size, seq_len), *optional*): + Input ids, i.e tokenized input sentences. Will be truncated up to + semantic_generation_config.max_input_semantic_length tokens. Note that the output audios will be as + long as the longest generation among the batch. + semantic_generation_config (`BarkSemanticGenerationConfig`): + Generation config indicating how to generate the semantic tokens. + history_prompt (`Optional[Dict[str,torch.Tensor]]`, *optional*): + Optional `Bark` speaker prompt. + attention_mask (`Optional[torch.Tensor]`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + Returns: + torch.LongTensor: Output semantic tokens. + """ + if semantic_generation_config is None: + raise ValueError("`semantic_generation_config` has to be provided") + + batch_size = input_ids.shape[0] + + max_input_semantic_length = semantic_generation_config.max_input_semantic_length + + input_ids = input_ids + semantic_generation_config.text_encoding_offset + + if attention_mask is not None: + input_ids = input_ids.masked_fill((1 - attention_mask).bool(), semantic_generation_config.text_pad_token) + + if history_prompt is not None: + semantic_history = history_prompt["semantic_prompt"][-max_input_semantic_length:] + semantic_history = nn.functional.pad( + semantic_history, + (0, max_input_semantic_length - len(semantic_history)), + value=semantic_generation_config.semantic_pad_token, + mode="constant", + ) + else: + semantic_history = torch.tensor( + [semantic_generation_config.semantic_pad_token] * max_input_semantic_length, dtype=torch.int + ).to(self.device) + + semantic_history = torch.repeat_interleave(semantic_history[None], batch_size, dim=0) + + infer_array = torch.tensor( + [[semantic_generation_config.semantic_infer_token]] * batch_size, dtype=torch.int + ).to(self.device) + + input_embeds = torch.cat( + [ + self.input_embeds_layer(input_ids[:, :max_input_semantic_length]) + + self.input_embeds_layer(semantic_history[:, : max_input_semantic_length + 1]), + self.input_embeds_layer(infer_array), + ], + dim=1, + ) + + tokens_to_suppress = list( + range(semantic_generation_config.semantic_vocab_size, semantic_generation_config.semantic_pad_token) + ) + tokens_to_suppress.extend( + list(range(semantic_generation_config.semantic_pad_token + 1, self.config.output_vocab_size)) + ) + + suppress_tokens_logits_processor = SuppressTokensLogitsProcessor(tokens_to_suppress) + + min_eos_p = kwargs.get("min_eos_p", semantic_generation_config.min_eos_p) + early_stopping_logits_processor = BarkEosPrioritizerLogitsProcessor( + eos_token_id=semantic_generation_config.eos_token_id, min_eos_p=min_eos_p + ) + + # pass input_ids in order to stay consistent with the transformers generate method even though it is not used + # (except to get the input seq_len - that's why we keep the first 257 tokens) + semantic_output = super().generate( + torch.ones((batch_size, max_input_semantic_length + 1), dtype=torch.int).to(self.device), + input_embeds=input_embeds, + logits_processor=[suppress_tokens_logits_processor, early_stopping_logits_processor], + generation_config=semantic_generation_config, + **kwargs, + ) # size: 10048 + + # take the generated semantic tokens + semantic_output = semantic_output[:, max_input_semantic_length + 1 :] + + return semantic_output + + +@add_start_docstrings( + """Bark coarse acoustics model. + It shares the same architecture as the semantic (or text) model. It is a GPT-2 like autoregressive model with a + language modeling head on top.""", + BARK_MODEL_START_DOCSTRING.format(config="BarkCoarseConfig"), +) +class BarkCoarseModel(BarkCausalModel): + base_model_prefix = "coarse_acoustics" + config_class = BarkCoarseConfig + + def preprocess_histories( + self, + max_coarse_history: int, + semantic_to_coarse_ratio: int, + batch_size: int, + semantic_generation_config: int, + codebook_size: int, + history_prompt: Optional[Dict[str, torch.Tensor]] = None, + ): + """ + Preprocess the optional `Bark` speaker prompts before `self.generate`. + + Args: + max_coarse_history (`int`): + Maximum size of coarse tokens used. + semantic_to_coarse_ratio (`int`): + Ratio of semantic to coarse frequency + batch_size (`int`): + Batch size, i.e the number of samples. + semantic_generation_config (`BarkSemanticGenerationConfig`): + Generation config indicating how to generate the semantic tokens. + codebook_size (`int`): + Codebook channel size, i.e. the size of the output vocabulary per codebook channel. + history_prompt (`Optional[Dict[str,torch.Tensor]]`): + Optional `Bark` speaker prompt. + Returns: Returns: + `tuple(torch.FloatTensor)`: + - **x_semantic_history** (`torch.FloatTensor` -- Processed semantic speaker prompt. + - **x_coarse_history** (`torch.FloatTensor`) -- Processed coarse speaker prompt. + """ + if history_prompt is not None: + x_semantic_history = torch.repeat_interleave(history_prompt["semantic_prompt"][None], batch_size, dim=0) + # clone to avoid modifying history_prompt.coarse_prompt + x_coarse_history = history_prompt["coarse_prompt"].clone() + + # offset x_coarse_history + if codebook_size is not None: + for n in range(1, x_coarse_history.shape[0]): + # offset + x_coarse_history[n, :] += codebook_size * n + + # flatten x_coarse_history + x_coarse_history = torch.transpose(x_coarse_history, 0, 1).view(-1) + + x_coarse_history = x_coarse_history + semantic_generation_config.semantic_vocab_size + + x_coarse_history = torch.repeat_interleave(x_coarse_history[None], batch_size, dim=0) + # e.g: after SEMANTIC_VOCAB_SIZE (10000), 1024 tokens dedicated to first codebook, 1024 next tokens + # dedicated to second codebook. + + max_semantic_history = int(np.floor(max_coarse_history / semantic_to_coarse_ratio)) + # trim histories correctly + n_semantic_hist_provided = min( + [ + max_semantic_history, + x_semantic_history.shape[1] - x_semantic_history.shape[1] % 2, + int(np.floor(x_coarse_history.shape[1] / semantic_to_coarse_ratio)), + ] + ) + + n_coarse_hist_provided = int(round(n_semantic_hist_provided * semantic_to_coarse_ratio)) + + x_semantic_history = x_semantic_history[:, -n_semantic_hist_provided:].int() + x_coarse_history = x_coarse_history[:, -n_coarse_hist_provided:].int() + # bit of a hack for time alignment (sounds better) - from Bark original implementation + x_coarse_history = x_coarse_history[:, :-2] + + else: + # shape: (batch_size, 0) + x_semantic_history = torch.tensor([[]] * batch_size, dtype=torch.int).to(self.device) + x_coarse_history = torch.tensor([[]] * batch_size, dtype=torch.int).to(self.device) + + return x_semantic_history, x_coarse_history + + def generate( + self, + semantic_output: torch.Tensor, + semantic_generation_config: BarkSemanticGenerationConfig = None, + coarse_generation_config: BarkCoarseGenerationConfig = None, + codebook_size: int = 1024, + history_prompt: Optional[Dict[str, torch.Tensor]] = None, + return_output_lengths: Optional[bool] = None, + **kwargs, + ) -> Union[torch.LongTensor, Tuple[torch.LongTensor, torch.LongTensor]]: + """ + Generates coarse acoustics tokens from input text semantic tokens and an additional optional `Bark` speaker + prompt. + + Args: + semantic_output (`torch.Tensor` of shape (batch_size, seq_len), *optional*): + Input text semantic ids, i.e the output of `BarkSemanticModel.generate`. + semantic_generation_config (`BarkSemanticGenerationConfig`): + Generation config indicating how to generate the semantic tokens. + coarse_generation_config (`BarkCoarseGenerationConfig`): + Generation config indicating how to generate the coarse tokens. + codebook_size (`int`, *optional*, defaults to 1024): + Codebook channel size, i.e. the size of the output vocabulary per codebook channel. + history_prompt (`Optional[Dict[str,torch.Tensor]]`, *optional*): + Optional `Bark` speaker prompt. + return_output_lengths (`bool`, *optional*): + Whether or not to return the output lengths. Useful when batching. + Returns: + By default: + torch.LongTensor: Output coarse acoustics tokens. + If `return_output_lengths=True`: + `Tuple(torch.Tensor, torch.Tensor): The output coarse acoustics tokens, and the length of each sample + of the batch. + """ + + if semantic_generation_config is None: + raise ValueError("`semantic_generation_config` has to be provided") + + if coarse_generation_config is None: + raise ValueError("`coarse_generation_config` has to be provided") + + max_coarse_input_length = coarse_generation_config.max_coarse_input_length + max_coarse_history = coarse_generation_config.max_coarse_history + sliding_window_len = coarse_generation_config.sliding_window_len + + # replace semantic_pad_token (eos_tok and pad_tok here) with coarse_semantic_pad_token i.e the pad_token + # used in the next model + semantic_output.masked_fill_( + semantic_output == semantic_generation_config.semantic_pad_token, + coarse_generation_config.coarse_semantic_pad_token, + ) + + semantic_to_coarse_ratio = ( + coarse_generation_config.coarse_rate_hz + / semantic_generation_config.semantic_rate_hz + * coarse_generation_config.n_coarse_codebooks + ) + max_semantic_history = int(np.floor(max_coarse_history / semantic_to_coarse_ratio)) + + output_lengths = (semantic_output != coarse_generation_config.coarse_semantic_pad_token).sum(1) + output_lengths = torch.floor( + output_lengths * semantic_to_coarse_ratio / coarse_generation_config.n_coarse_codebooks + ) + output_lengths = torch.round(output_lengths * coarse_generation_config.n_coarse_codebooks).int() + + max_generated_len = torch.max(output_lengths).item() + + batch_size = semantic_output.shape[0] + + x_semantic_history, x_coarse = self.preprocess_histories( + history_prompt=history_prompt, + max_coarse_history=max_coarse_history, + semantic_to_coarse_ratio=semantic_to_coarse_ratio, + batch_size=batch_size, + semantic_generation_config=semantic_generation_config, + codebook_size=codebook_size, + ) + base_semantic_idx = x_semantic_history.shape[1] + + semantic_output = torch.hstack([x_semantic_history, semantic_output]) + + n_window_steps = int(np.ceil(max_generated_len / sliding_window_len)) + + total_generated_len = 0 + + len_coarse_history = x_coarse.shape[1] + + for _ in range(n_window_steps): + semantic_idx = base_semantic_idx + int(round(total_generated_len / semantic_to_coarse_ratio)) + + # pad from right side + input_coarse = semantic_output[:, np.max([0, semantic_idx - max_semantic_history]) :] + input_coarse = input_coarse[:, :max_coarse_input_length] + input_coarse = F.pad( + input_coarse, + (0, max_coarse_input_length - input_coarse.shape[-1]), + "constant", + coarse_generation_config.coarse_semantic_pad_token, + ) + + input_coarse = torch.hstack( + [ + input_coarse, + torch.tensor([[coarse_generation_config.coarse_infer_token]] * batch_size).to(self.device), + x_coarse[:, -max_coarse_history:], + ] + ) + + alternatingLogitsProcessor = AlternatingCodebooksLogitsProcessor( + input_coarse.shape[1], + semantic_generation_config.semantic_vocab_size, + codebook_size, + ) + + output_coarse = super().generate( + input_coarse, + logits_processor=[alternatingLogitsProcessor], + max_new_tokens=min(sliding_window_len, max_generated_len - total_generated_len), + generation_config=coarse_generation_config, + **kwargs, + ) + + input_coarse_len = input_coarse.shape[1] + + x_coarse = torch.hstack([x_coarse, output_coarse[:, input_coarse_len:]]) + total_generated_len = x_coarse.shape[1] - len_coarse_history + + del output_coarse + + coarse_output = x_coarse[:, len_coarse_history:] + + if return_output_lengths: + return coarse_output, output_lengths + + return coarse_output + + +@add_start_docstrings( + """Bark fine acoustics model. It is a non-causal GPT-like model with `config.n_codes_total` embedding layers and + language modeling heads, one for each codebook.""", + BARK_MODEL_START_DOCSTRING.format(config="BarkFineConfig"), +) +class BarkFineModel(BarkPreTrainedModel): + base_model_prefix = "fine_acoustics" + config_class = BarkFineConfig + main_input_name = "codebook_idx" + + def __init__(self, config): + # non-causal gpt-like model with one embedding layer and one lm_head for each codebook of Encodec + super().__init__(config) + self.config = config + + # initialize a modified non causal GPT-like model + # note that for there is one embedding layer and one lm_head for each codebook of Encodec + self.input_embeds_layers = nn.ModuleList( + [nn.Embedding(config.input_vocab_size, config.hidden_size) for _ in range(config.n_codes_total)] + ) + self.position_embeds_layer = nn.Embedding(config.block_size, config.hidden_size) + + self.drop = nn.Dropout(config.dropout) + + self.layers = nn.ModuleList([BarkBlock(config, is_causal=False) for _ in range(config.num_layers)]) + self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" + + self.layernorm_final = nn.LayerNorm(config.hidden_size) + + self.lm_heads = nn.ModuleList( + [ + nn.Linear(config.hidden_size, config.output_vocab_size, bias=False) + for _ in range(config.n_codes_given, config.n_codes_total) + ] + ) + self.gradient_checkpointing = False + self.n_codes_total = config.n_codes_total + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + # one embedding layers for each codebook + return self.input_embeds_layers + + def set_input_embeddings(self, new_embeddings): + # one embedding layers for each codebook + self.input_embeds_layers = new_embeddings + + def get_output_embeddings(self): + # one lm_head for each codebook + return self.lm_heads + + def set_output_embeddings(self, new_output_embeddings): + # one lm_head for each codebook + self.lm_heads = new_output_embeddings + + def _resize_token_embeddings(self, new_num_tokens, pad_to_multiple_of=None): + old_embeddings_list = self.get_input_embeddings() + new_embeddings_list = nn.ModuleList( + [ + self._get_resized_embeddings(old_embeddings, new_num_tokens, pad_to_multiple_of) + for old_embeddings in old_embeddings_list + ] + ) + self.set_input_embeddings(new_embeddings_list) + new_num_tokens = new_embeddings_list[0].weight.shape[0] + + # if word embeddings are not tied, make sure that lm head is resized as well + if self.get_output_embeddings() is not None and not self.config.tie_word_embeddings: + old_lm_head_list = self.get_output_embeddings() + new_lm_head_list = nn.ModuleList( + [self._get_resized_lm_head(old_lm_head, new_num_tokens) for old_lm_head in old_lm_head_list] + ) + self.set_output_embeddings(new_lm_head_list) + + return self.get_input_embeddings() + + def resize_token_embeddings( + self, new_num_tokens: Optional[int] = None, pad_to_multiple_of: Optional[int] = None + ) -> nn.Embedding: + """ + Resizes input token embeddings matrix of the model if `new_num_tokens != config.vocab_size`. + + Takes care of tying weights embeddings afterwards if the model class has a `tie_weights()` method. + + Arguments: + new_num_tokens (`int`, *optional*): + The number of new tokens in the embedding matrix. Increasing the size will add newly initialized + vectors at the end. Reducing the size will remove vectors from the end. If not provided or `None`, just + returns a pointer to the input tokens `torch.nn.Embedding` module of the model without doing anything. + pad_to_multiple_of (`int`, *optional*): + If set will pad the embedding matrix to a multiple of the provided value. + + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability + `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. For more + details about this, or help on choosing the correct value for resizing, refer to this guide: + https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc + + Return: + `torch.nn.Embedding`: Pointer to the input tokens Embeddings Module of the model. + """ + model_embeds = self._resize_token_embeddings(new_num_tokens, pad_to_multiple_of) + if new_num_tokens is None and pad_to_multiple_of is None: + return model_embeds + + # Update base model and current model config + self.config.output_vocab_size = model_embeds[0].weight.shape[0] + self.config.vocab_size = model_embeds[0].weight.shape[0] + self.output_vocab_size = model_embeds[0].weight.shape[0] + self.vocab_size = model_embeds[0].weight.shape[0] + + # Tie weights again if needed + self.tie_weights() + + return model_embeds + + def tie_weights(self): + """ + Tie the weights between the input embeddings list and the output embeddings list. + + If the `torchscript` flag is set in the configuration, can't handle parameter sharing so we are cloning the + weights instead. + """ + if getattr(self.config, "tie_word_embeddings", True): + self._tied_weights_keys = [] + output_embeddings = self.get_output_embeddings() + input_embeddings = self.get_input_embeddings() + + for i in range(self.config.n_codes_total - self.config.n_codes_given): + # self.input_embeds_layers[i + 1].weight = self.lm_heads[i].weight + self._tie_or_clone_weights(output_embeddings[i], input_embeddings[i + 1]) + self._tied_weights_keys.append(f"lm_heads.{i}.weight") + + for module in self.modules(): + if hasattr(module, "_tie_weights"): + module._tie_weights() + + @add_start_docstrings_to_model_forward(BARK_FINE_INPUTS_DOCSTRING) + def forward( + self, + codebook_idx: int, # an additionnal idx corresponding to the id of the codebook that will be predicted + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + labels: Optional[torch.LongTensor] = None, + input_embeds: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if codebook_idx == 0: + raise ValueError("Cannot predict 0th codebook - 0th codebook should be predicted by the coarse model") + + if input_ids is not None and input_embeds is not None: + raise ValueError("You cannot specify both input_ids and input_embeds at the same time") + + if input_ids is None and input_embeds is None: + raise ValueError("You have to specify either input_ids or input_embeds") + + if input_ids is not None: + # the input_embeddings are the sum of the j previous codebooks embeddings before + # the current codebook_idx codebook + + # forward the GPT model itself + input_embeds = [ + input_embeds_layer(input_ids[:, :, i]).unsqueeze(-1) + for i, input_embeds_layer in enumerate(self.input_embeds_layers) + ] # token embeddings of shape (b, t, n_embd) + input_embeds = torch.cat(input_embeds, dim=-1) + input_embeds = input_embeds[:, :, :, : codebook_idx + 1].sum(dim=-1) + + input_shape = input_embeds.size()[:-1] + batch_size = input_embeds.shape[0] + seq_length = input_shape[1] + + device = input_ids.device if input_ids is not None else input_embeds.device + + if position_ids is None: + position_ids = torch.arange(0, seq_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0) # shape (1, seq_length) + + position_embeds = self.position_embeds_layer(position_ids) # position embeddings of shape (1, t, n_embd) + + # Attention mask. + if attention_mask is not None: + if batch_size <= 0: + raise ValueError("batch_size has to be defined and > 0") + if self._use_flash_attention_2: + attention_mask = attention_mask if 0 in attention_mask else None + else: + # [bsz, to_seq_length] -> [bsz, 1, 1, to_seq_length] + # from_seq_length is 1 to easily broadcast + attention_mask = _prepare_4d_attention_mask(attention_mask, input_embeds.dtype, tgt_len=1) + + head_mask = self.get_head_mask(head_mask, self.config.num_layers) + + hidden_states = self.drop(input_embeds + position_embeds) + output_shape = input_shape + (hidden_states.size(-1),) + + all_self_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + + for i, block in enumerate(self.layers): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + outputs = block( + hidden_states, + attention_mask=attention_mask, + head_mask=head_mask[i], + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[1],) + + hidden_states = self.layernorm_final(hidden_states) + hidden_states = hidden_states.view(output_shape) + + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + logits = self.lm_heads[codebook_idx - self.config.n_codes_given](hidden_states) + + loss = None + if labels is not None: + raise NotImplementedError("Training is not implemented yet") + + if not return_dict: + return tuple(v for v in [None, logits, all_hidden_states, all_self_attentions] if v is not None) + + return MaskedLMOutput( + loss=loss, + logits=logits, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + ) + + def generate( + self, + coarse_output: torch.Tensor, + semantic_generation_config: BarkSemanticGenerationConfig = None, + coarse_generation_config: BarkCoarseGenerationConfig = None, + fine_generation_config: BarkFineGenerationConfig = None, + codebook_size: int = 1024, + history_prompt: Optional[Dict[str, torch.Tensor]] = None, + **kwargs, + ) -> torch.LongTensor: + """ + Generates fine acoustics tokens from input coarse acoustics tokens and an additional optional `Bark` speaker + prompt. + + Args: + coarse_output (`torch.Tensor` of shape (batch_size, seq_len)): + Input coarse acoustics ids, i.e the output of `BarkCoarseModel.generate`. + semantic_generation_config (`BarkSemanticGenerationConfig`): + Generation config indicating how to generate the semantic tokens. + coarse_generation_config (`BarkCoarseGenerationConfig`): + Generation config indicating how to generate the coarse tokens. + fine_generation_config (`BarkFineGenerationConfig`): + Generation config indicating how to generate the fine tokens. + codebook_size (`int`, *optional*, defaults to 1024): + Codebook channel size, i.e. the size of the output vocabulary per codebook channel. + history_prompt (`Optional[Dict[str,torch.Tensor]]`, *optional*): + Optional `Bark` speaker prompt. + Returns: + torch.LongTensor: Output fine acoustics tokens. + """ + if semantic_generation_config is None: + raise ValueError("`semantic_generation_config` has to be provided") + + if coarse_generation_config is None: + raise ValueError("`coarse_generation_config` has to be provided") + + if fine_generation_config is None: + raise ValueError("`fine_generation_config` has to be provided") + + # since we don't really use GenerationConfig through the fine model (autoencoder) + # and since only temperature is used from the classic GenerationConfig parameters + # manually impose the kwargs priority over the generation config + temperature = kwargs.get("temperature", fine_generation_config.temperature) + + max_fine_history_length = fine_generation_config.max_fine_history_length + max_fine_input_length = fine_generation_config.max_fine_input_length + + # shape: (batch, n_coarse_codebooks * seq_len) + # new_shape: (batch, seq_len, n_coarse_codebooks) + coarse_output = coarse_output.view(coarse_output.shape[0], -1, coarse_generation_config.n_coarse_codebooks) + + # brings ids into the range [0, codebook_size -1] + coarse_output = torch.remainder(coarse_output - semantic_generation_config.semantic_vocab_size, codebook_size) + batch_size = coarse_output.shape[0] + + if history_prompt is not None: + x_fine_history = torch.repeat_interleave(history_prompt["fine_prompt"].T[None], batch_size, dim=0) + # transpose to get to shape (seq_len, n_fine_codebooks) + else: + x_fine_history = None + + n_coarse = coarse_generation_config.n_coarse_codebooks + + # pad the last 6th codebooks + fine_input = F.pad( + coarse_output, + (0, fine_generation_config.n_fine_codebooks - n_coarse), + "constant", + codebook_size, + ) + + # prepend history if available (max max_fine_history_length) + if x_fine_history is not None: + fine_input = torch.cat([x_fine_history[:, -max_fine_history_length:, :], fine_input], dim=1) + + # len of the fine_history that has been added to fine_input + n_history = x_fine_history[:, -max_fine_history_length:, :].shape[1] + else: + n_history = 0 + + n_remove_from_end = 0 + # need to pad if too short (since non-causal model) + if fine_input.shape[1] < max_fine_input_length: + n_remove_from_end = max_fine_input_length - fine_input.shape[1] + fine_input = F.pad(fine_input, (0, 0, 0, n_remove_from_end), mode="constant", value=codebook_size) + + # we can be lazy about fractional loop and just keep overwriting codebooks. + # seems that coarse_output.shape[1] - (max_fine_input_length - n_history) is equal to minus n_remove_from_end + # So if we needed to pad because too short, n_loops is always 1 (because n_remove_from_end > 0) + # If not, we loop over at least twice. + + n_loops = (coarse_output.shape[1] - (max_fine_input_length - n_history)) / max_fine_history_length + n_loops = int(np.ceil(n_loops)) + n_loops = max(0, n_loops) + 1 + + for n_outer in range(n_loops): + start_idx = min([n_outer * max_fine_history_length, fine_input.shape[1] - max_fine_input_length]) + + start_fill_idx = min( + [n_history + n_outer * max_fine_history_length, fine_input.shape[1] - max_fine_history_length] + ) + rel_start_fill_idx = start_fill_idx - start_idx + input_buffer = fine_input[:, start_idx : start_idx + max_fine_input_length, :] + for n_inner in range(n_coarse, fine_generation_config.n_fine_codebooks): + logits = self.forward(n_inner, input_buffer).logits + if temperature is None or temperature == 1.0: + relevant_logits = logits[:, rel_start_fill_idx:, :codebook_size] + codebook_preds = torch.argmax(relevant_logits, -1) + else: + relevant_logits = logits[:, :, :codebook_size] / temperature + # apply softmax + probs = F.softmax(relevant_logits, dim=-1)[:, rel_start_fill_idx:max_fine_input_length] + # reshape to 2D: (batch_size, seq_len, codebook_size) -> (batch_size*seq_len, codebook_size) + probs = probs.reshape((-1, codebook_size)) + # multinomial then reshape : (batch_size*seq_len)-> (batch_size,seq_len) + codebook_preds = torch.multinomial(probs, num_samples=1).view(batch_size, -1) + codebook_preds = codebook_preds.to(torch.int32) + input_buffer[:, rel_start_fill_idx:, n_inner] = codebook_preds + del logits, codebook_preds + + # transfer into fine_input + for n_inner in range(n_coarse, fine_generation_config.n_fine_codebooks): + fine_input[ + :, start_fill_idx : start_fill_idx + (max_fine_input_length - rel_start_fill_idx), n_inner + ] = input_buffer[:, rel_start_fill_idx:, n_inner] + del input_buffer + + fine_input = fine_input.transpose(1, 2)[:, :, n_history:] + if n_remove_from_end > 0: + fine_input = fine_input[:, :, :-n_remove_from_end] + + if fine_input.shape[-1] != coarse_output.shape[-2]: + raise ValueError("input and output should have the same seq_len") + + return fine_input + + +@add_start_docstrings( + """ + The full Bark model, a text-to-speech model composed of 4 sub-models: + - [`BarkSemanticModel`] (also referred to as the 'text' model): a causal auto-regressive transformer model that + takes + as input tokenized text, and predicts semantic text tokens that capture the meaning of the text. + - [`BarkCoarseModel`] (also refered to as the 'coarse acoustics' model), also a causal autoregressive transformer, + that takes into input the results of the last model. It aims at regressing the first two audio codebooks necessary + to `encodec`. + - [`BarkFineModel`] (the 'fine acoustics' model), this time a non-causal autoencoder transformer, which iteratively + predicts the last codebooks based on the sum of the previous codebooks embeddings. + - having predicted all the codebook channels from the [`EncodecModel`], Bark uses it to decode the output audio + array. + + It should be noted that each of the first three modules can support conditional speaker embeddings to condition the + output sound according to specific predefined voice. + """, + BARK_START_DOCSTRING, +) +class BarkModel(BarkPreTrainedModel): + config_class = BarkConfig + + def __init__(self, config): + super().__init__(config) + + self.semantic = BarkSemanticModel(config.semantic_config) + self.coarse_acoustics = BarkCoarseModel(config.coarse_acoustics_config) + self.fine_acoustics = BarkFineModel(config.fine_acoustics_config) + + self.codec_model = AutoModel.from_config(config.codec_config) + + self.config = config + + @property + def device(self) -> torch.device: + """ + `torch.device`: The device on which the module is (assuming that all the module parameters are on the same + device). + """ + # for bark_model, device must be verified on its sub-models + # if has _hf_hook, has been offloaded so the device has to be found in the hook + if not hasattr(self.semantic, "_hf_hook"): + return get_parameter_device(self) + for module in self.semantic.modules(): + if ( + hasattr(module, "_hf_hook") + and hasattr(module._hf_hook, "execution_device") + and module._hf_hook.execution_device is not None + ): + return torch.device(module._hf_hook.execution_device) + + def enable_cpu_offload(self, gpu_id: Optional[int] = 0): + r""" + Offloads all sub-models to CPU using accelerate, reducing memory usage with a low impact on performance. This + method moves one whole sub-model at a time to the GPU when it is used, and the sub-model remains in GPU until + the next sub-model runs. + + Args: + gpu_id (`int`, *optional*, defaults to 0): + GPU id on which the sub-models will be loaded and offloaded. + """ + if is_accelerate_available(): + from accelerate import cpu_offload_with_hook + else: + raise ImportError("`enable_model_cpu_offload` requires `accelerate`.") + + device = torch.device(f"cuda:{gpu_id}") + + if self.device.type != "cpu": + self.to("cpu") + torch.cuda.empty_cache() # otherwise we don't see the memory savings (but they probably exist) + + # this layer is used outside the first foward pass of semantic so need to be loaded before semantic + self.semantic.input_embeds_layer, _ = cpu_offload_with_hook(self.semantic.input_embeds_layer, device) + + hook = None + for cpu_offloaded_model in [ + self.semantic, + self.coarse_acoustics, + self.fine_acoustics, + ]: + _, hook = cpu_offload_with_hook(cpu_offloaded_model, device, prev_module_hook=hook) + + self.fine_acoustics_hook = hook + + _, hook = cpu_offload_with_hook(self.codec_model, device, prev_module_hook=hook) + + # We'll offload the last model manually. + self.codec_model_hook = hook + + def codec_decode(self, fine_output, output_lengths=None): + """Turn quantized audio codes into audio array using encodec.""" + + fine_output = fine_output.transpose(0, 1) + emb = self.codec_model.quantizer.decode(fine_output) + + if output_lengths is not None: + # encodec uses LSTMs which behaves differently with appended padding + # decoding with encodec takes around 0.1% of the total generation time + # to keep generation quality, we break batching + out = [sample[:, :l].unsqueeze(0) for (sample, l) in zip(emb, output_lengths)] + audio_arr = [self.codec_model.decoder(sample).squeeze() for sample in out] + else: + out = self.codec_model.decoder(emb) + audio_arr = out.squeeze(1) # squeeze the codebook dimension + + return audio_arr + + @torch.no_grad() + def generate( + self, + input_ids: Optional[torch.Tensor] = None, + history_prompt: Optional[Dict[str, torch.Tensor]] = None, + return_output_lengths: Optional[bool] = None, + **kwargs, + ) -> torch.LongTensor: + """ + Generates audio from an input prompt and an additional optional `Bark` speaker prompt. + + Args: + input_ids (`Optional[torch.Tensor]` of shape (batch_size, seq_len), *optional*): + Input ids. Will be truncated up to 256 tokens. Note that the output audios will be as long as the + longest generation among the batch. + history_prompt (`Optional[Dict[str,torch.Tensor]]`, *optional*): + Optional `Bark` speaker prompt. Note that for now, this model takes only one speaker prompt per batch. + kwargs (*optional*): Remaining dictionary of keyword arguments. Keyword arguments are of two types: + + - Without a prefix, they will be entered as `**kwargs` for the `generate` method of each sub-model. + - With a *semantic_*, *coarse_*, *fine_* prefix, they will be input for the `generate` method of the + semantic, coarse and fine respectively. It has the priority over the keywords without a prefix. + + This means you can, for example, specify a generation strategy for all sub-models except one. + return_output_lengths (`bool`, *optional*): + Whether or not to return the waveform lengths. Useful when batching. + Returns: + By default: + - **audio_waveform** (`torch.Tensor` of shape (batch_size, seq_len)): Generated audio waveform. + When `return_output_lengths=True`: + Returns a tuple made of: + - **audio_waveform** (`torch.Tensor` of shape (batch_size, seq_len)): Generated audio waveform. + - **output_lengths** (`torch.Tensor` of shape (batch_size)): The length of each waveform in the batch + Example: + + ```python + >>> from transformers import AutoProcessor, BarkModel + + >>> processor = AutoProcessor.from_pretrained("suno/bark-small") + >>> model = BarkModel.from_pretrained("suno/bark-small") + + >>> # To add a voice preset, you can pass `voice_preset` to `BarkProcessor.__call__(...)` + >>> voice_preset = "v2/en_speaker_6" + + >>> inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset) + + >>> audio_array = model.generate(**inputs, semantic_max_new_tokens=100) + >>> audio_array = audio_array.cpu().numpy().squeeze() + ``` + """ + # TODO (joao):workaround until nested generation config is compatible with PreTrained Model + # todo: dict + semantic_generation_config = BarkSemanticGenerationConfig(**self.generation_config.semantic_config) + coarse_generation_config = BarkCoarseGenerationConfig(**self.generation_config.coarse_acoustics_config) + fine_generation_config = BarkFineGenerationConfig(**self.generation_config.fine_acoustics_config) + + kwargs_semantic = { + # if "attention_mask" is set, it should not be passed to CoarseModel and FineModel + "attention_mask": kwargs.pop("attention_mask", None), + "min_eos_p": kwargs.pop("min_eos_p", None), + } + kwargs_coarse = {} + kwargs_fine = {} + for key, value in kwargs.items(): + if key.startswith("semantic_"): + key = key[len("semantic_") :] + kwargs_semantic[key] = value + elif key.startswith("coarse_"): + key = key[len("coarse_") :] + kwargs_coarse[key] = value + elif key.startswith("fine_"): + key = key[len("fine_") :] + kwargs_fine[key] = value + else: + # If the key is already in a specific config, then it's been set with a + # submodules specific value and we don't override + if key not in kwargs_semantic: + kwargs_semantic[key] = value + if key not in kwargs_coarse: + kwargs_coarse[key] = value + if key not in kwargs_fine: + kwargs_fine[key] = value + + # 1. Generate from the semantic model + semantic_output = self.semantic.generate( + input_ids, + history_prompt=history_prompt, + semantic_generation_config=semantic_generation_config, + **kwargs_semantic, + ) + + # 2. Generate from the coarse model + coarse_output = self.coarse_acoustics.generate( + semantic_output, + history_prompt=history_prompt, + semantic_generation_config=semantic_generation_config, + coarse_generation_config=coarse_generation_config, + codebook_size=self.generation_config.codebook_size, + return_output_lengths=return_output_lengths, + **kwargs_coarse, + ) + + output_lengths = None + if return_output_lengths: + coarse_output, output_lengths = coarse_output + # (batch_size, seq_len*coarse_codebooks) -> (batch_size, seq_len) + output_lengths = output_lengths // coarse_generation_config.n_coarse_codebooks + + # 3. "generate" from the fine model + output = self.fine_acoustics.generate( + coarse_output, + history_prompt=history_prompt, + semantic_generation_config=semantic_generation_config, + coarse_generation_config=coarse_generation_config, + fine_generation_config=fine_generation_config, + codebook_size=self.generation_config.codebook_size, + **kwargs_fine, + ) + + if getattr(self, "fine_acoustics_hook", None) is not None: + # Manually offload fine_acoustics to CPU + # and load codec_model to GPU + # since bark doesn't use codec_model forward pass + self.fine_acoustics_hook.offload() + self.codec_model = self.codec_model.to(self.device) + + # 4. Decode the output and generate audio array + audio = self.codec_decode(output, output_lengths) + + if getattr(self, "codec_model_hook", None) is not None: + # Offload codec_model to CPU + self.codec_model_hook.offload() + + if return_output_lengths: + output_lengths = [len(sample) for sample in audio] + audio = nn.utils.rnn.pad_sequence(audio, batch_first=True, padding_value=0) + return audio, output_lengths + + return audio + + @classmethod + def _check_and_enable_flash_attn_2( + cls, + config, + torch_dtype: Optional[torch.dtype] = None, + device_map: Optional[Union[str, Dict[str, int]]] = None, + hard_check_only: bool = False, + ): + """ + `_check_and_enable_flash_attn_2` originally don't expand flash attention enabling to the model + sub-configurations. We override the original method to make sure that Bark sub-models are using Flash Attention + if necessary. + + If you don't know about Flash Attention, check out the official repository of flash attention: + https://github.com/Dao-AILab/flash-attention + + For using Flash Attention 1.0 you can do it directly via the `BetterTransformer` API, have a look at this + specific section of the documentation to learn more about it: + https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#decoder-models + + The method checks if the current setup is compatible with Flash Attention as it requires the model to be in + half precision and not ran on CPU. + + If all checks pass and `hard_check_only` is False, the method will set the config attribute `_attn_implementation` to "flash_attention_2" so that the model + can initialize the correct attention module + """ + config = super()._check_and_enable_flash_attn_2( + config, torch_dtype, device_map, hard_check_only=hard_check_only + ) + + config.semantic_config._attn_implementation = config._attn_implementation + config.coarse_acoustics_config._attn_implementation = config._attn_implementation + config.fine_acoustics_config._attn_implementation = config._attn_implementation + return config diff --git a/src/transformers/models/bark/processing_bark.py b/src/transformers/models/bark/processing_bark.py new file mode 100644 index 00000000000000..d58b89bf6f8f9b --- /dev/null +++ b/src/transformers/models/bark/processing_bark.py @@ -0,0 +1,286 @@ +# coding=utf-8 +# Copyright 2023 The Suno AI Authors and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Processor class for Bark +""" +import json +import os +from typing import Optional + +import numpy as np + +from ...feature_extraction_utils import BatchFeature +from ...processing_utils import ProcessorMixin +from ...utils import logging +from ...utils.hub import get_file_from_repo +from ..auto import AutoTokenizer + + +logger = logging.get_logger(__name__) + + +class BarkProcessor(ProcessorMixin): + r""" + Constructs a Bark processor which wraps a text tokenizer and optional Bark voice presets into a single processor. + + Args: + tokenizer ([`PreTrainedTokenizer`]): + An instance of [`PreTrainedTokenizer`]. + speaker_embeddings (`Dict[Dict[str]]`, *optional*): + Optional nested speaker embeddings dictionary. The first level contains voice preset names (e.g + `"en_speaker_4"`). The second level contains `"semantic_prompt"`, `"coarse_prompt"` and `"fine_prompt"` + embeddings. The values correspond to the path of the corresponding `np.ndarray`. See + [here](https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c) for + a list of `voice_preset_names`. + + """ + + tokenizer_class = "AutoTokenizer" + attributes = ["tokenizer"] + + preset_shape = { + "semantic_prompt": 1, + "coarse_prompt": 2, + "fine_prompt": 2, + } + + def __init__(self, tokenizer, speaker_embeddings=None): + super().__init__(tokenizer) + + self.speaker_embeddings = speaker_embeddings + + @classmethod + def from_pretrained( + cls, pretrained_processor_name_or_path, speaker_embeddings_dict_path="speaker_embeddings_path.json", **kwargs + ): + r""" + Instantiate a Bark processor associated with a pretrained model. + + Args: + pretrained_model_name_or_path (`str` or `os.PathLike`): + This can be either: + + - a string, the *model id* of a pretrained [`BarkProcessor`] hosted inside a model repo on + huggingface.co. + - a path to a *directory* containing a processor saved using the [`~BarkProcessor.save_pretrained`] + method, e.g., `./my_model_directory/`. + speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`): + The name of the `.json` file containing the speaker_embeddings dictionnary located in + `pretrained_model_name_or_path`. If `None`, no speaker_embeddings is loaded. + **kwargs + Additional keyword arguments passed along to both + [`~tokenization_utils_base.PreTrainedTokenizer.from_pretrained`]. + """ + + if speaker_embeddings_dict_path is not None: + speaker_embeddings_path = get_file_from_repo( + pretrained_processor_name_or_path, + speaker_embeddings_dict_path, + subfolder=kwargs.pop("subfolder", None), + cache_dir=kwargs.pop("cache_dir", None), + force_download=kwargs.pop("force_download", False), + proxies=kwargs.pop("proxies", None), + resume_download=kwargs.pop("resume_download", False), + local_files_only=kwargs.pop("local_files_only", False), + token=kwargs.pop("use_auth_token", None), + revision=kwargs.pop("revision", None), + ) + if speaker_embeddings_path is None: + logger.warning( + f"""`{os.path.join(pretrained_processor_name_or_path,speaker_embeddings_dict_path)}` does not exists + , no preloaded speaker embeddings will be used - Make sure to provide a correct path to the json + dictionnary if wanted, otherwise set `speaker_embeddings_dict_path=None`.""" + ) + speaker_embeddings = None + else: + with open(speaker_embeddings_path) as speaker_embeddings_json: + speaker_embeddings = json.load(speaker_embeddings_json) + else: + speaker_embeddings = None + + tokenizer = AutoTokenizer.from_pretrained(pretrained_processor_name_or_path, **kwargs) + + return cls(tokenizer=tokenizer, speaker_embeddings=speaker_embeddings) + + def save_pretrained( + self, + save_directory, + speaker_embeddings_dict_path="speaker_embeddings_path.json", + speaker_embeddings_directory="speaker_embeddings", + push_to_hub: bool = False, + **kwargs, + ): + """ + Saves the attributes of this processor (tokenizer...) in the specified directory so that it can be reloaded + using the [`~BarkProcessor.from_pretrained`] method. + + Args: + save_directory (`str` or `os.PathLike`): + Directory where the tokenizer files and the speaker embeddings will be saved (directory will be created + if it does not exist). + speaker_embeddings_dict_path (`str`, *optional*, defaults to `"speaker_embeddings_path.json"`): + The name of the `.json` file that will contains the speaker_embeddings nested path dictionnary, if it + exists, and that will be located in `pretrained_model_name_or_path/speaker_embeddings_directory`. + speaker_embeddings_directory (`str`, *optional*, defaults to `"speaker_embeddings/"`): + The name of the folder in which the speaker_embeddings arrays will be saved. + push_to_hub (`bool`, *optional*, defaults to `False`): + Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the + repository you want to push to with `repo_id` (will default to the name of `save_directory` in your + namespace). + kwargs: + Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method. + """ + if self.speaker_embeddings is not None: + os.makedirs(os.path.join(save_directory, speaker_embeddings_directory, "v2"), exist_ok=True) + + embeddings_dict = {} + + embeddings_dict["repo_or_path"] = save_directory + + for prompt_key in self.speaker_embeddings: + if prompt_key != "repo_or_path": + voice_preset = self._load_voice_preset(prompt_key) + + tmp_dict = {} + for key in self.speaker_embeddings[prompt_key]: + np.save( + os.path.join( + embeddings_dict["repo_or_path"], speaker_embeddings_directory, f"{prompt_key}_{key}" + ), + voice_preset[key], + allow_pickle=False, + ) + tmp_dict[key] = os.path.join(speaker_embeddings_directory, f"{prompt_key}_{key}.npy") + + embeddings_dict[prompt_key] = tmp_dict + + with open(os.path.join(save_directory, speaker_embeddings_dict_path), "w") as fp: + json.dump(embeddings_dict, fp) + + super().save_pretrained(save_directory, push_to_hub, **kwargs) + + def _load_voice_preset(self, voice_preset: str = None, **kwargs): + voice_preset_paths = self.speaker_embeddings[voice_preset] + + voice_preset_dict = {} + for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]: + if key not in voice_preset_paths: + raise ValueError( + f"Voice preset unrecognized, missing {key} as a key in self.speaker_embeddings[{voice_preset}]." + ) + + path = get_file_from_repo( + self.speaker_embeddings.get("repo_or_path", "/"), + voice_preset_paths[key], + subfolder=kwargs.pop("subfolder", None), + cache_dir=kwargs.pop("cache_dir", None), + force_download=kwargs.pop("force_download", False), + proxies=kwargs.pop("proxies", None), + resume_download=kwargs.pop("resume_download", False), + local_files_only=kwargs.pop("local_files_only", False), + token=kwargs.pop("use_auth_token", None), + revision=kwargs.pop("revision", None), + ) + if path is None: + raise ValueError( + f"""`{os.path.join(self.speaker_embeddings.get("repo_or_path", "/"),voice_preset_paths[key])}` does not exists + , no preloaded voice preset will be used - Make sure to provide correct paths to the {voice_preset} + embeddings.""" + ) + + voice_preset_dict[key] = np.load(path) + + return voice_preset_dict + + def _validate_voice_preset_dict(self, voice_preset: Optional[dict] = None): + for key in ["semantic_prompt", "coarse_prompt", "fine_prompt"]: + if key not in voice_preset: + raise ValueError(f"Voice preset unrecognized, missing {key} as a key.") + + if not isinstance(voice_preset[key], np.ndarray): + raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.") + + if len(voice_preset[key].shape) != self.preset_shape[key]: + raise ValueError(f"{key} voice preset must be a {str(self.preset_shape[key])}D ndarray.") + + def __call__( + self, + text=None, + voice_preset=None, + return_tensors="pt", + max_length=256, + add_special_tokens=False, + return_attention_mask=True, + return_token_type_ids=False, + **kwargs, + ): + """ + Main method to prepare for the model one or several sequences(s). This method forwards the `text` and `kwargs` + arguments to the AutoTokenizer's [`~AutoTokenizer.__call__`] to encode the text. The method also proposes a + voice preset which is a dictionary of arrays that conditions `Bark`'s output. `kwargs` arguments are forwarded + to the tokenizer and to `cached_file` method if `voice_preset` is a valid filename. + + Args: + text (`str`, `List[str]`, `List[List[str]]`): + The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings + (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set + `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). + voice_preset (`str`, `Dict[np.ndarray]`): + The voice preset, i.e the speaker embeddings. It can either be a valid voice_preset name, e.g + `"en_speaker_1"`, or directly a dictionnary of `np.ndarray` embeddings for each submodel of `Bark`. Or + it can be a valid file name of a local `.npz` single voice preset. + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors of a particular framework. Acceptable values are: + + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return NumPy `np.ndarray` objects. + + Returns: + Tuple([`BatchEncoding`], [`BatchFeature`]): A tuple composed of a [`BatchEncoding`], i.e the output of the + `tokenizer` and a [`BatchFeature`], i.e the voice preset with the right tensors type. + """ + if voice_preset is not None and not isinstance(voice_preset, dict): + if ( + isinstance(voice_preset, str) + and self.speaker_embeddings is not None + and voice_preset in self.speaker_embeddings + ): + voice_preset = self._load_voice_preset(voice_preset) + + else: + if isinstance(voice_preset, str) and not voice_preset.endswith(".npz"): + voice_preset = voice_preset + ".npz" + + voice_preset = np.load(voice_preset) + + if voice_preset is not None: + self._validate_voice_preset_dict(voice_preset, **kwargs) + voice_preset = BatchFeature(data=voice_preset, tensor_type=return_tensors) + + encoded_text = self.tokenizer( + text, + return_tensors=return_tensors, + padding="max_length", + max_length=max_length, + return_attention_mask=return_attention_mask, + return_token_type_ids=return_token_type_ids, + add_special_tokens=add_special_tokens, + **kwargs, + ) + + if voice_preset is not None: + encoded_text["history_prompt"] = voice_preset + + return encoded_text diff --git a/src/transformers/models/bart/__init__.py b/src/transformers/models/bart/__init__.py index 7129474b4ee449..4f104efce1a4d2 100644 --- a/src/transformers/models/bart/__init__.py +++ b/src/transformers/models/bart/__init__.py @@ -49,6 +49,7 @@ "BartForQuestionAnswering", "BartForSequenceClassification", "BartModel", + "BartPreTrainedModel", "BartPretrainedModel", "PretrainedBartModel", ] @@ -107,6 +108,7 @@ BartForQuestionAnswering, BartForSequenceClassification, BartModel, + BartPreTrainedModel, BartPretrainedModel, PretrainedBartModel, ) diff --git a/src/transformers/models/bart/configuration_bart.py b/src/transformers/models/bart/configuration_bart.py index 2a04657f419909..8c03be9a6202a8 100644 --- a/src/transformers/models/bart/configuration_bart.py +++ b/src/transformers/models/bart/configuration_bart.py @@ -107,6 +107,7 @@ class BartConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "bart" keys_to_ignore_at_inference = ["past_key_values"] attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"} diff --git a/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py index baa2fff290f79d..d09b39d51e0038 100644 --- a/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py +++ b/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py @@ -101,7 +101,10 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkp config = BartConfig.from_pretrained(hf_checkpoint_name) tokens = bart.encode(SAMPLE_TEXT).unsqueeze(0) tokens2 = BartTokenizer.from_pretrained(hf_checkpoint_name).encode(SAMPLE_TEXT, return_tensors="pt").unsqueeze(0) - assert torch.eq(tokens, tokens2).all() + if not torch.eq(tokens, tokens2).all(): + raise ValueError( + f"converted tokenizer and pretrained tokenizer returned different output: {tokens} != {tokens2}" + ) if checkpoint_path == "bart.large.mnli": state_dict = bart.state_dict() @@ -130,8 +133,12 @@ def convert_bart_checkpoint(checkpoint_path, pytorch_dump_folder_path, hf_checkp new_model_outputs = model.model(tokens)[0] # Check results - assert fairseq_output.shape == new_model_outputs.shape - assert (fairseq_output == new_model_outputs).all().item() + if fairseq_output.shape != new_model_outputs.shape: + raise ValueError( + f"`fairseq_output` shape and `new_model_output` shape are different: {fairseq_output.shape=}, {new_model_outputs.shape}" + ) + if (fairseq_output != new_model_outputs).any().item(): + raise ValueError("Some values in `fairseq_output` are different from `new_model_outputs`") Path(pytorch_dump_folder_path).mkdir(exist_ok=True) model.save_pretrained(pytorch_dump_folder_path) diff --git a/src/transformers/models/bart/modeling_bart.py b/src/transformers/models/bart/modeling_bart.py index 1f2c4a14ed14e2..ca5f724b08a917 100755 --- a/src/transformers/models/bart/modeling_bart.py +++ b/src/transformers/models/bart/modeling_bart.py @@ -15,16 +15,22 @@ """ PyTorch BART model.""" import copy import math -import random import warnings from typing import List, Optional, Tuple, Union import torch +import torch.nn.functional as F import torch.utils.checkpoint from torch import nn from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN +from ...modeling_attn_mask_utils import ( + _prepare_4d_attention_mask, + _prepare_4d_attention_mask_for_sdpa, + _prepare_4d_causal_attention_mask, + _prepare_4d_causal_attention_mask_for_sdpa, +) from ...modeling_outputs import ( BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, @@ -40,12 +46,19 @@ add_end_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, + is_flash_attn_2_available, + is_flash_attn_greater_or_equal_2_10, logging, replace_return_docstrings, ) from .configuration_bart import BartConfig +if is_flash_attn_2_available(): + from flash_attn import flash_attn_func, flash_attn_varlen_func + from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa + + logger = logging.get_logger(__name__) _CHECKPOINT_FOR_DOC = "facebook/bart-base" @@ -71,6 +84,19 @@ ] +# Copied from transformers.models.llama.modeling_llama._get_unpad_data +def _get_unpad_data(attention_mask): + seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32) + indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten() + max_seqlen_in_batch = seqlens_in_batch.max().item() + cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0)) + return ( + indices, + cu_seqlens, + max_seqlen_in_batch, + ) + + def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start_token_id: int): """ Shift input ids one token to the right. @@ -87,35 +113,6 @@ def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start return shifted_input_ids -def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): - """ - Make causal mask used for bi-directional self-attention. - """ - bsz, tgt_len = input_ids_shape - mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min)) - mask_cond = torch.arange(mask.size(-1)) - mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) - mask = mask.to(dtype) - - if past_key_values_length > 0: - mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) - return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) - - -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - class BartLearnedPositionalEmbedding(nn.Embedding): """ This module learns positional embeddings up to a fixed maximum size. @@ -148,12 +145,15 @@ def __init__( dropout: float = 0.0, is_decoder: bool = False, bias: bool = True, + is_causal: bool = False, + config: Optional[BartConfig] = None, ): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.dropout = dropout self.head_dim = embed_dim // num_heads + self.config = config if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -162,6 +162,7 @@ def __init__( ) self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder + self.is_causal = is_causal self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) @@ -229,8 +230,8 @@ def forward( proj_shape = (bsz * self.num_heads, -1, self.head_dim) query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) - key_states = key_states.view(*proj_shape) - value_states = value_states.view(*proj_shape) + key_states = key_states.reshape(*proj_shape) + value_states = value_states.reshape(*proj_shape) src_len = key_states.size(1) attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) @@ -276,7 +277,7 @@ def forward( if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): raise ValueError( - f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" f" {attn_output.size()}" ) @@ -284,7 +285,7 @@ def forward( attn_output = attn_output.transpose(1, 2) # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be - # partitioned aross GPUs when using tensor-parallelism. + # partitioned across GPUs when using tensor-parallelism. attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) attn_output = self.out_proj(attn_output) @@ -292,14 +293,344 @@ def forward( return attn_output, attn_weights_reshaped, past_key_value +class BartFlashAttention2(BartAttention): + """ + Bart flash attention module. This module inherits from `BartAttention` as the weights of the module stays + untouched. The only required change would be on the forward pass where it needs to correctly call the public API of + flash attention and deal with padding tokens in case the input contains any of them. + """ + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__ + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1. + # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0. + # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left). + self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10() + + def _reshape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim) + + def forward( + self, + hidden_states: torch.Tensor, + key_value_states: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + # BartFlashAttention2 attention does not support output_attentions + if output_attentions: + raise ValueError("BartFlashAttention2 attention does not support output_attentions") + + # if key_value_states are provided this layer is used as a cross-attention layer + # for the decoder + is_cross_attention = key_value_states is not None + + bsz, q_len, _ = hidden_states.size() + + # get query proj + query_states = self._reshape(self.q_proj(hidden_states), -1, bsz) + # get key, value proj + # `past_key_value[0].shape[2] == key_value_states.shape[1]` + # is checking that the `sequence_length` of the `past_key_value` is the same as + # the provided `key_value_states` to support prefix tuning + if ( + is_cross_attention + and past_key_value is not None + and past_key_value[0].shape[2] == key_value_states.shape[1] + ): + # reuse k,v, cross_attentions + key_states = past_key_value[0].transpose(1, 2) + value_states = past_key_value[1].transpose(1, 2) + elif is_cross_attention: + # cross_attentions + key_states = self._reshape(self.k_proj(key_value_states), -1, bsz) + value_states = self._reshape(self.v_proj(key_value_states), -1, bsz) + elif past_key_value is not None: + # reuse k, v, self_attention + key_states = self._reshape(self.k_proj(hidden_states), -1, bsz) + value_states = self._reshape(self.v_proj(hidden_states), -1, bsz) + key_states = torch.cat([past_key_value[0].transpose(1, 2), key_states], dim=1) + value_states = torch.cat([past_key_value[1].transpose(1, 2), value_states], dim=1) + else: + # self_attention + key_states = self._reshape(self.k_proj(hidden_states), -1, bsz) + value_states = self._reshape(self.v_proj(hidden_states), -1, bsz) + + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_states.transpose(1, 2), value_states.transpose(1, 2)) + + kv_seq_len = key_states.shape[-2] + if past_key_value is not None: + kv_seq_len += past_key_value[0].shape[-2] + + # In PEFT, usually we cast the layer norms in float32 for training stability reasons + # therefore the input hidden states gets silently casted in float32. Hence, we need + # cast them back in the correct dtype just to be sure everything works as expected. + # This might slowdown training & inference so it is recommended to not cast the LayerNorms + # in fp32. (LlamaRMSNorm handles it correctly) + + input_dtype = query_states.dtype + if input_dtype == torch.float32: + if torch.is_autocast_enabled(): + target_dtype = torch.get_autocast_gpu_dtype() + # Handle the case where the model is quantized + elif hasattr(self.config, "_pre_quantization_dtype"): + target_dtype = self.config._pre_quantization_dtype + else: + target_dtype = self.q_proj.weight.dtype + + logger.warning_once( + f"The input hidden states seems to be silently casted in float32, this might be related to" + f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in" + f" {target_dtype}." + ) + + query_states = query_states.to(target_dtype) + key_states = key_states.to(target_dtype) + value_states = value_states.to(target_dtype) + + attn_output = self._flash_attention_forward( + query_states, key_states, value_states, attention_mask, q_len, dropout=self.dropout + ) + + attn_output = attn_output.reshape(bsz, q_len, -1) + attn_output = self.out_proj(attn_output) + + if not output_attentions: + attn_weights = None + + return attn_output, attn_weights, past_key_value + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward + def _flash_attention_forward( + self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None + ): + """ + Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token + first unpad the input, then computes the attention scores and pad the final attention scores. + + Args: + query_states (`torch.Tensor`): + Input query states to be passed to Flash Attention API + key_states (`torch.Tensor`): + Input key states to be passed to Flash Attention API + value_states (`torch.Tensor`): + Input value states to be passed to Flash Attention API + attention_mask (`torch.Tensor`): + The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the + position of padding tokens and 1 for the position of non-padding tokens. + dropout (`int`, *optional*): + Attention dropout + softmax_scale (`float`, *optional*): + The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim) + """ + if not self._flash_attn_uses_top_left_mask: + causal = self.is_causal + else: + # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__. + causal = self.is_causal and query_length != 1 + + # Contains at least one padding token in the sequence + if attention_mask is not None: + batch_size = query_states.shape[0] + query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input( + query_states, key_states, value_states, attention_mask, query_length + ) + + cu_seqlens_q, cu_seqlens_k = cu_seq_lens + max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens + + attn_output_unpad = flash_attn_varlen_func( + query_states, + key_states, + value_states, + cu_seqlens_q=cu_seqlens_q, + cu_seqlens_k=cu_seqlens_k, + max_seqlen_q=max_seqlen_in_batch_q, + max_seqlen_k=max_seqlen_in_batch_k, + dropout_p=dropout, + softmax_scale=softmax_scale, + causal=causal, + ) + + attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length) + else: + attn_output = flash_attn_func( + query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal + ) + + return attn_output + + # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input + def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length): + indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask) + batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape + + key_layer = index_first_axis( + key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k + ) + value_layer = index_first_axis( + value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k + ) + if query_length == kv_seq_len: + query_layer = index_first_axis( + query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k + ) + cu_seqlens_q = cu_seqlens_k + max_seqlen_in_batch_q = max_seqlen_in_batch_k + indices_q = indices_k + elif query_length == 1: + max_seqlen_in_batch_q = 1 + cu_seqlens_q = torch.arange( + batch_size + 1, dtype=torch.int32, device=query_layer.device + ) # There is a memcpy here, that is very bad. + indices_q = cu_seqlens_q[:-1] + query_layer = query_layer.squeeze(1) + else: + # The -q_len: slice assumes left padding. + attention_mask = attention_mask[:, -query_length:] + query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask) + + return ( + query_layer, + key_layer, + value_layer, + indices_q, + (cu_seqlens_q, cu_seqlens_k), + (max_seqlen_in_batch_q, max_seqlen_in_batch_k), + ) + + +class BartSdpaAttention(BartAttention): + def forward( + self, + hidden_states: torch.Tensor, + key_value_states: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.Tensor] = None, + layer_head_mask: Optional[torch.Tensor] = None, + output_attentions: bool = False, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]: + """Input shape: Batch x Time x Channel""" + if output_attentions or layer_head_mask is not None: + # TODO: Improve this warning with e.g. `model.config._attn_implementation = "manual"` once this is implemented. + logger.warning_once( + "BartModel is using BartSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True` or `layer_head_mask` not None. Falling back to the manual attention" + ' implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.' + ) + return super().forward( + hidden_states, + key_value_states=key_value_states, + past_key_value=past_key_value, + attention_mask=attention_mask, + layer_head_mask=layer_head_mask, + output_attentions=output_attentions, + ) + + # if key_value_states are provided this layer is used as a cross-attention layer + # for the decoder + is_cross_attention = key_value_states is not None + + bsz, tgt_len, _ = hidden_states.size() + + # get query proj + query_states = self.q_proj(hidden_states) + # get key, value proj + # `past_key_value[0].shape[2] == key_value_states.shape[1]` + # is checking that the `sequence_length` of the `past_key_value` is the same as + # the provided `key_value_states` to support prefix tuning + if ( + is_cross_attention + and past_key_value is not None + and past_key_value[0].shape[2] == key_value_states.shape[1] + ): + # reuse k,v, cross_attentions + key_states = past_key_value[0] + value_states = past_key_value[1] + elif is_cross_attention: + # cross_attentions + key_states = self._shape(self.k_proj(key_value_states), -1, bsz) + value_states = self._shape(self.v_proj(key_value_states), -1, bsz) + elif past_key_value is not None: + # reuse k, v, self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + key_states = torch.cat([past_key_value[0], key_states], dim=2) + value_states = torch.cat([past_key_value[1], value_states], dim=2) + else: + # self_attention + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_states, value_states) + + query_states = self._shape(query_states, tgt_len, bsz) + + # NOTE: SDPA with memory-efficient backend is currently (torch==2.1.2) bugged when using non-contiguous inputs and a custom attn_mask, + # but we are fine here as `_shape` do call `.contiguous()`. Reference: https://github.com/pytorch/pytorch/issues/112577 + attn_output = torch.nn.functional.scaled_dot_product_attention( + query_states, + key_states, + value_states, + attn_mask=attention_mask, + dropout_p=self.dropout if self.training else 0.0, + # The tgt_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case tgt_len == 1. + is_causal=self.is_causal and attention_mask is None and tgt_len > 1, + ) + + if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + attn_output = attn_output.transpose(1, 2) + + # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be + # partitioned across GPUs when using tensor-parallelism. + attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) + + attn_output = self.out_proj(attn_output) + + return attn_output, None, past_key_value + + +BART_ATTENTION_CLASSES = { + "eager": BartAttention, + "sdpa": BartSdpaAttention, + "flash_attention_2": BartFlashAttention2, +} + + class BartEncoderLayer(nn.Module): def __init__(self, config: BartConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BartAttention( + + self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.encoder_attention_heads, dropout=config.attention_dropout, + config=config, ) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.dropout = config.dropout @@ -318,7 +649,7 @@ def forward( ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]: """ Args: - hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size @@ -365,22 +696,25 @@ def __init__(self, config: BartConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BartAttention( + self.self_attn = BART_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + is_causal=True, + config=config, ) self.dropout = config.dropout self.activation_fn = ACT2FN[config.activation_function] self.activation_dropout = config.activation_dropout self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) - self.encoder_attn = BartAttention( + self.encoder_attn = BART_ATTENTION_CLASSES[config._attn_implementation]( self.embed_dim, config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + config=config, ) self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) @@ -501,12 +835,15 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: return hidden_states -class BartPretrainedModel(PreTrainedModel): +class BartPreTrainedModel(PreTrainedModel): config_class = BartConfig base_model_prefix = "model" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_unexpected = [r"encoder.version", r"decoder.version"] + _keys_to_ignore_on_load_unexpected = ["encoder.version", "decoder.version"] _no_split_modules = [r"BartEncoderLayer", r"BartDecoderLayer"] + _skip_keys_device_placement = "past_key_values" + _supports_flash_attn_2 = True + _supports_sdpa = True def _init_weights(self, module): std = self.config.init_std @@ -519,10 +856,6 @@ def _init_weights(self, module): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, (BartDecoder, BartEncoder)): - module.gradient_checkpointing = value - @property def dummy_inputs(self): pad_token = self.config.pad_token_id @@ -534,10 +867,18 @@ def dummy_inputs(self): return dummy_inputs -class PretrainedBartModel(BartPretrainedModel): +class PretrainedBartModel(BartPreTrainedModel): def __init_subclass__(self): warnings.warn( - "The class `PretrainedBartModel` has been depreciated, please use `BartPretrainedModel` instead.", + "The class `PretrainedBartModel` has been depreciated, please use `BartPreTrainedModel` instead.", + FutureWarning, + ) + + +class BartPretrainedModel(BartPreTrainedModel): + def __init_subclass__(self): + warnings.warn( + "The class `PretrainedBartModel` has been depreciated, please use `BartPreTrainedModel` instead.", FutureWarning, ) @@ -672,10 +1013,11 @@ def __init_subclass__(self): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all - `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you - can choose to directly pass an embedded representation. This is useful if you want more control over how to - convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be @@ -698,7 +1040,7 @@ def __init_subclass__(self): """ -class BartEncoder(BartPretrainedModel): +class BartEncoder(BartPreTrainedModel): """ Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a [`BartEncoderLayer`]. @@ -729,6 +1071,8 @@ def __init__(self, config: BartConfig, embed_tokens: Optional[nn.Embedding] = No embed_dim, ) self.layers = nn.ModuleList([BartEncoderLayer(config) for _ in range(config.encoder_layers)]) + self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" + self._use_sdpa = config._attn_implementation == "sdpa" self.layernorm_embedding = nn.LayerNorm(embed_dim) self.gradient_checkpointing = False @@ -816,8 +1160,16 @@ def forward( # expand attention_mask if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype) + if self._use_flash_attention_2: + attention_mask = attention_mask if 0 in attention_mask else None + elif self._use_sdpa and head_mask is None and not output_attentions: + # output_attentions=True & head_mask can not be supported when using SDPA, fall back to + # the manual implementation that requires a 4D causal mask in all cases. + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _prepare_4d_attention_mask_for_sdpa(attention_mask, inputs_embeds.dtype) + else: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) encoder_states = () if output_hidden_states else None all_attentions = () if output_attentions else None @@ -834,23 +1186,22 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): # skip the layer + to_drop = False + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: # skip the layer + to_drop = True + + if to_drop: layer_outputs = (None, None) else: if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, (head_mask[idx] if head_mask is not None else None), + output_attentions, ) else: layer_outputs = encoder_layer( @@ -875,7 +1226,7 @@ def custom_forward(*inputs): ) -class BartDecoder(BartPretrainedModel): +class BartDecoder(BartPreTrainedModel): """ Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`BartDecoderLayer`] @@ -902,6 +1253,9 @@ def __init__(self, config: BartConfig, embed_tokens: Optional[nn.Embedding] = No config.d_model, ) self.layers = nn.ModuleList([BartDecoderLayer(config) for _ in range(config.decoder_layers)]) + self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" + self._use_sdpa = config._attn_implementation == "sdpa" + self.layernorm_embedding = nn.LayerNorm(config.d_model) self.gradient_checkpointing = False @@ -914,26 +1268,6 @@ def get_input_embeddings(self): def set_input_embeddings(self, value): self.embed_tokens = value - def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): - # create causal mask - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - combined_attention_mask = None - if input_shape[-1] > 1: - combined_attention_mask = _make_causal_mask( - input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length - ).to(inputs_embeds.device) - - if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( - inputs_embeds.device - ) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask - ) - - return combined_attention_mask - def forward( self, input_ids: torch.LongTensor = None, @@ -1000,11 +1334,11 @@ def forward( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of - shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing - `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more - control over how to convert `input_ids` indices into associated vectors than the model's internal - embedding lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. @@ -1040,14 +1374,42 @@ def forward( if inputs_embeds is None: inputs_embeds = self.embed_tokens(input) * self.embed_scale - attention_mask = self._prepare_decoder_attention_mask( - attention_mask, input_shape, inputs_embeds, past_key_values_length - ) + if self._use_flash_attention_2: + # 2d mask is passed through the layers + attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None + elif self._use_sdpa and not output_attentions and cross_attn_head_mask is None: + # output_attentions=True & cross_attn_head_mask can not be supported when using SDPA, and we fall back on + # the manual implementation that requires a 4D causal mask in all cases. + attention_mask = _prepare_4d_causal_attention_mask_for_sdpa( + attention_mask, + input_shape, + inputs_embeds, + past_key_values_length, + ) + else: + # 4d mask is passed through the layers + attention_mask = _prepare_4d_causal_attention_mask( + attention_mask, input_shape, inputs_embeds, past_key_values_length + ) # expand encoder attention mask if encoder_hidden_states is not None and encoder_attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + if self._use_flash_attention_2: + encoder_attention_mask = encoder_attention_mask if 0 in encoder_attention_mask else None + elif self._use_sdpa and cross_attn_head_mask is None and not output_attentions: + # output_attentions=True & cross_attn_head_mask can not be supported when using SDPA, and we fall back on + # the manual implementation that requires a 4D causal mask in all cases. + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + encoder_attention_mask = _prepare_4d_attention_mask_for_sdpa( + encoder_attention_mask, + inputs_embeds.dtype, + tgt_len=input_shape[-1], + ) + else: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + encoder_attention_mask = _prepare_4d_attention_mask( + encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1] + ) # embed positions positions = self.embed_positions(input, past_key_values_length) @@ -1058,6 +1420,13 @@ def forward( hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + # decoder layers all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None @@ -1077,28 +1446,16 @@ def forward( # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) if output_hidden_states: all_hidden_states += (hidden_states,) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): - continue + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue past_key_value = past_key_values[idx] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, output_attentions, use_cache) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(decoder_layer), + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, hidden_states, attention_mask, encoder_hidden_states, @@ -1106,6 +1463,8 @@ def custom_forward(*inputs): head_mask[idx] if head_mask is not None else None, cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, None, + output_attentions, + use_cache, ) else: layer_outputs = decoder_layer( @@ -1156,8 +1515,8 @@ def custom_forward(*inputs): "The bare BART Model outputting raw hidden-states without any specific head on top.", BART_START_DOCSTRING, ) -class BartModel(BartPretrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] +class BartModel(BartPreTrainedModel): + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config: BartConfig): super().__init__(config) @@ -1171,6 +1530,11 @@ def __init__(self, config: BartConfig): # Initialize weights and apply final processing self.post_init() + def _tie_weights(self): + if self.config.tie_word_embeddings: + self._tie_or_clone_weights(self.encoder.embed_tokens, self.shared) + self._tie_or_clone_weights(self.decoder.embed_tokens, self.shared) + def get_input_embeddings(self): return self.shared @@ -1283,14 +1647,10 @@ def forward( @add_start_docstrings( "The BART Model with a language modeling head. Can be used for summarization.", BART_START_DOCSTRING ) -class BartForConditionalGeneration(BartPretrainedModel): +class BartForConditionalGeneration(BartPreTrainedModel): base_model_prefix = "model" - _keys_to_ignore_on_load_missing = [ - r"final_logits_bias", - r"lm_head.weight", - "encoder.embed_tokens.weight", - "decoder.embed_tokens.weight", - ] + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight"] + _keys_to_ignore_on_load_missing = ["final_logits_bias"] def __init__(self, config: BartConfig): super().__init__(config) @@ -1307,9 +1667,9 @@ def get_encoder(self): def get_decoder(self): return self.model.get_decoder() - def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding: - new_embeddings = super().resize_token_embeddings(new_num_tokens) - self._resize_final_logits_bias(new_num_tokens) + def resize_token_embeddings(self, new_num_tokens: int, pad_to_multiple_of: Optional[int] = None) -> nn.Embedding: + new_embeddings = super().resize_token_embeddings(new_num_tokens, pad_to_multiple_of) + self._resize_final_logits_bias(new_embeddings.weight.shape[0]) return new_embeddings def _resize_final_logits_bias(self, new_num_tokens: int) -> None: @@ -1391,6 +1751,7 @@ def forward( masked_lm_loss = None if labels is not None: + labels = labels.to(lm_logits.device) loss_fct = CrossEntropyLoss() masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1)) @@ -1425,7 +1786,16 @@ def prepare_inputs_for_generation( ): # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - decoder_input_ids = decoder_input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if decoder_input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = decoder_input_ids.shape[1] - 1 + + decoder_input_ids = decoder_input_ids[:, remove_prefix_length:] return { "input_ids": None, # encoder_outputs is defined. input_ids not needed @@ -1449,7 +1819,8 @@ def _reorder_cache(past_key_values, beam_idx): for layer_past in past_key_values: # cached cross_attention states don't have to be reordered -> they are always the same reordered_past += ( - tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past[:2]) + + layer_past[2:], ) return reordered_past @@ -1461,8 +1832,8 @@ def _reorder_cache(past_key_values, beam_idx): """, BART_START_DOCSTRING, ) -class BartForSequenceClassification(BartPretrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] +class BartForSequenceClassification(BartPreTrainedModel): + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config: BartConfig, **kwargs): super().__init__(config, **kwargs) @@ -1546,6 +1917,7 @@ def forward( loss = None if labels is not None: + labels = labels.to(logits.device) if self.config.problem_type is None: if self.config.num_labels == 1: self.config.problem_type = "regression" @@ -1590,8 +1962,8 @@ def forward( """, BART_START_DOCSTRING, ) -class BartForQuestionAnswering(BartPretrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] +class BartForQuestionAnswering(BartPreTrainedModel): + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config): super().__init__(config) @@ -1708,7 +2080,7 @@ def forward( ) -class BartDecoderWrapper(BartPretrainedModel): +class BartDecoderWrapper(BartPreTrainedModel): """ This wrapper class is a helper class to correctly load pretrained checkpoints when the causal language model is used in combination with the [`EncoderDecoderModel`] framework. @@ -1728,8 +2100,8 @@ def forward(self, *args, **kwargs): """, BART_START_DOCSTRING, ) -class BartForCausalLM(BartPretrainedModel): - _keys_to_ignore_on_load_missing = ["lm_head.weight"] +class BartForCausalLM(BartPreTrainedModel): + _tied_weights_keys = ["lm_head.weight"] def __init__(self, config): config = copy.deepcopy(config) @@ -1889,6 +2261,7 @@ def forward( loss = None if labels is not None: + labels = labels.to(logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) @@ -1913,7 +2286,16 @@ def prepare_inputs_for_generation( attention_mask = input_ids.new_ones(input_ids.shape) if past_key_values: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] # first step, decoder_cached_states are empty return { "input_ids": input_ids, # encoder_outputs is defined. input_ids not needed @@ -1926,5 +2308,7 @@ def prepare_inputs_for_generation( def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/bart/modeling_flax_bart.py b/src/transformers/models/bart/modeling_flax_bart.py index ac292cc77707db..6abfcdc398422f 100644 --- a/src/transformers/models/bart/modeling_flax_bart.py +++ b/src/transformers/models/bart/modeling_flax_bart.py @@ -22,7 +22,6 @@ import flax.linen as nn import jax import jax.numpy as jnp -import numpy as np from flax.core.frozen_dict import FrozenDict, freeze, unfreeze from flax.linen import combine_masks, make_causal_mask from flax.linen.attention import dot_product_attention_weights @@ -218,15 +217,15 @@ """ -def shift_tokens_right(input_ids: np.array, pad_token_id: int, decoder_start_token_id: int) -> np.ndarray: +def shift_tokens_right(input_ids: jnp.ndarray, pad_token_id: int, decoder_start_token_id: int) -> jnp.ndarray: """ Shift input ids one token to the right. """ - shifted_input_ids = np.zeros_like(input_ids) - shifted_input_ids[:, 1:] = input_ids[:, :-1] - shifted_input_ids[:, 0] = decoder_start_token_id + shifted_input_ids = jnp.zeros_like(input_ids) + shifted_input_ids = shifted_input_ids.at[:, 1:].set(input_ids[:, :-1]) + shifted_input_ids = shifted_input_ids.at[:, 0].set(decoder_start_token_id) - shifted_input_ids = np.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) + shifted_input_ids = jnp.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) return shifted_input_ids @@ -1468,8 +1467,8 @@ def prepare_inputs_for_generation( self, decoder_input_ids, max_length, - attention_mask: Optional[jnp.DeviceArray] = None, - decoder_attention_mask: Optional[jnp.DeviceArray] = None, + attention_mask: Optional[jax.Array] = None, + decoder_attention_mask: Optional[jax.Array] = None, encoder_outputs=None, **kwargs, ): @@ -1961,7 +1960,7 @@ def __call__( class FlaxBartForCausalLM(FlaxBartDecoderPreTrainedModel): module_class = FlaxBartForCausalLMModule - def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jnp.DeviceArray] = None): + def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jax.Array] = None): # initializing the cache batch_size, seq_length = input_ids.shape diff --git a/src/transformers/models/bart/modeling_tf_bart.py b/src/transformers/models/bart/modeling_tf_bart.py index 6e29434c4df158..1e38908b4a4934 100644 --- a/src/transformers/models/bart/modeling_tf_bart.py +++ b/src/transformers/models/bart/modeling_tf_bart.py @@ -15,6 +15,8 @@ """ TF 2.0 Bart model.""" +from __future__ import annotations + import random from typing import Optional, Tuple, Union @@ -32,17 +34,16 @@ # Public API from ...modeling_tf_utils import ( - DUMMY_INPUTS, TFCausalLanguageModelingLoss, TFModelInputType, TFPreTrainedModel, TFSequenceClassificationLoss, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - ContextManagers, add_code_sample_docstrings, add_end_docstrings, add_start_docstrings, @@ -116,7 +117,7 @@ def _expand_mask(mask: tf.Tensor, tgt_len: Optional[int] = None): return (one_cst - expanded_mask) * LARGE_NEGATIVE -class TFBartLearnedPositionalEmbedding(tf.keras.layers.Embedding): +class TFBartLearnedPositionalEmbedding(keras.layers.Embedding): """ This module learns positional embeddings up to a fixed maximum size. """ @@ -131,7 +132,7 @@ def call( self, input_shape: Optional[tf.TensorShape] = None, past_key_values_length: int = 0, - position_ids: Optional[tf.Tensor] = None, + position_ids: tf.Tensor | None = None, ): """Input is expected to be of size [bsz x seqlen].""" if position_ids is None: @@ -143,7 +144,7 @@ def call( return super().call(position_ids + tf.constant(self.offset, dtype=offset_dtype)) -class TFBartAttention(tf.keras.layers.Layer): +class TFBartAttention(keras.layers.Layer): """Multi-headed attention from "Attention Is All You Need""" def __init__( @@ -159,7 +160,7 @@ def __init__( self.embed_dim = embed_dim self.num_heads = num_heads - self.dropout = tf.keras.layers.Dropout(dropout) + self.dropout = keras.layers.Dropout(dropout) self.head_dim = embed_dim // num_heads if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -169,10 +170,10 @@ def __init__( self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder - self.k_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") - self.q_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") - self.v_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") - self.out_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") + self.k_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") + self.q_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") + self.v_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") + self.out_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): return tf.transpose(tf.reshape(tensor, (bsz, seq_len, self.num_heads, self.head_dim)), (0, 2, 1, 3)) @@ -180,12 +181,12 @@ def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): def call( self, hidden_states: tf.Tensor, - key_value_states: Optional[tf.Tensor] = None, - past_key_value: Optional[Tuple[Tuple[tf.Tensor]]] = None, - attention_mask: Optional[tf.Tensor] = None, - layer_head_mask: Optional[tf.Tensor] = None, + key_value_states: tf.Tensor | None = None, + past_key_value: Tuple[Tuple[tf.Tensor]] | None = None, + attention_mask: tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, training: Optional[bool] = False, - ) -> Tuple[tf.Tensor, Optional[tf.Tensor]]: + ) -> Tuple[tf.Tensor, tf.Tensor | None]: """Input shape: Batch x Time x Channel""" # if key_value_states are provided this layer is used as a cross-attention layer @@ -295,32 +296,50 @@ def call( return attn_output, attn_weights, past_key_value - -class TFBartEncoderLayer(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "k_proj", None) is not None: + with tf.name_scope(self.k_proj.name): + self.k_proj.build([None, None, self.embed_dim]) + if getattr(self, "q_proj", None) is not None: + with tf.name_scope(self.q_proj.name): + self.q_proj.build([None, None, self.embed_dim]) + if getattr(self, "v_proj", None) is not None: + with tf.name_scope(self.v_proj.name): + self.v_proj.build([None, None, self.embed_dim]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.embed_dim]) + + +class TFBartEncoderLayer(keras.layers.Layer): def __init__(self, config: BartConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model self.self_attn = TFBartAttention( self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout, name="self_attn" ) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) - self.fc1 = tf.keras.layers.Dense(config.encoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) + self.fc1 = keras.layers.Dense(config.encoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, hidden_states: tf.Tensor, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]], - layer_head_mask: Optional[tf.Tensor], + attention_mask: np.ndarray | tf.Tensor | None, + layer_head_mask: tf.Tensor | None, training: Optional[bool] = False, ) -> tf.Tensor: """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`tf.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`tf.Tensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -351,8 +370,28 @@ def call( return hidden_states, self_attn_weights - -class TFBartDecoderLayer(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.encoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + + +class TFBartDecoderLayer(keras.layers.Layer): def __init__(self, config: BartConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model @@ -363,11 +402,11 @@ def __init__(self, config: BartConfig, **kwargs): name="self_attn", is_decoder=True, ) - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") self.encoder_attn = TFBartAttention( self.embed_dim, config.decoder_attention_heads, @@ -375,29 +414,30 @@ def __init__(self, config: BartConfig, **kwargs): name="encoder_attn", is_decoder=True, ) - self.encoder_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") - self.fc1 = tf.keras.layers.Dense(config.decoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.encoder_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") + self.fc1 = keras.layers.Dense(config.decoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, hidden_states: tf.Tensor, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - layer_head_mask: Optional[tf.Tensor] = None, - cross_attn_layer_head_mask: Optional[tf.Tensor] = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, + cross_attn_layer_head_mask: tf.Tensor | None = None, past_key_value: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, training: Optional[bool] = False, ) -> Tuple[tf.Tensor, tf.Tensor, Tuple[Tuple[tf.Tensor]]]: """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`tf.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`tf.Tensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. encoder_hidden_states (`tf.Tensor`): - cross attention input to the layer of shape `(seq_len, batch, embed_dim)` + cross attention input to the layer of shape `(batch, seq_len, embed_dim)` encoder_attention_mask (`tf.Tensor`): encoder attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -460,24 +500,63 @@ def call( present_key_value, ) - -class TFBartClassificationHead(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "encoder_attn", None) is not None: + with tf.name_scope(self.encoder_attn.name): + self.encoder_attn.build(None) + if getattr(self, "encoder_attn_layer_norm", None) is not None: + with tf.name_scope(self.encoder_attn_layer_norm.name): + self.encoder_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.decoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + + +class TFBartClassificationHead(keras.layers.Layer): """Head for sentence-level classification tasks.""" def __init__(self, inner_dim: int, num_classes: int, pooler_dropout: float, name: str, **kwargs): super().__init__(name=name, **kwargs) - self.dense = tf.keras.layers.Dense(inner_dim, name="dense") - self.dropout = tf.keras.layers.Dropout(pooler_dropout) - self.out_proj = tf.keras.layers.Dense(num_classes, name="out_proj") + self.dense = keras.layers.Dense(inner_dim, name="dense") + self.dropout = keras.layers.Dropout(pooler_dropout) + self.out_proj = keras.layers.Dense(num_classes, name="out_proj") + self.input_dim = inner_dim + self.inner_dim = inner_dim def call(self, inputs): hidden_states = self.dropout(inputs) hidden_states = self.dense(hidden_states) - hidden_states = tf.keras.activations.tanh(hidden_states) + hidden_states = keras.activations.tanh(hidden_states) hidden_states = self.dropout(hidden_states) hidden_states = self.out_proj(hidden_states) return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.input_dim]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.inner_dim]) + class TFBartPretrainedModel(TFPreTrainedModel): config_class = BartConfig @@ -485,30 +564,19 @@ class TFBartPretrainedModel(TFPreTrainedModel): @property def dummy_inputs(self): - pad_token = 1 - input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - decoder_input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - dummy_inputs = { - "decoder_input_ids": decoder_input_ids, - "attention_mask": tf.cast(input_ids != pad_token, tf.int32), - "input_ids": input_ids, - } + dummy_inputs = super().dummy_inputs + # Dummy inputs should not contain the default val of 1 + # as this is the padding token and some assertions check it + dummy_inputs["input_ids"] = dummy_inputs["input_ids"] * 2 + if "decoder_input_ids" in dummy_inputs: + dummy_inputs["decoder_input_ids"] = dummy_inputs["decoder_input_ids"] * 2 return dummy_inputs - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"), - "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"), - } - ] - ) - def serving(self, inputs): - output = self.call(inputs) - - return self.serving_output(output) + def tf_to_pt_weight_rename(self, tf_weight): + if tf_weight == "model.shared.weight": + return tf_weight, "model.decoder.embed_tokens.weight" + else: + return (tf_weight,) BART_START_DOCSTRING = r""" @@ -516,7 +584,7 @@ def serving(self, inputs): library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -592,7 +660,7 @@ def serving(self, inputs): input_ids (`tf.Tensor` of shape `({0})`): Indices of input sequence tokens in the vocabulary. - Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) @@ -648,6 +716,10 @@ def serving(self, inputs): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. use_cache (`bool`, *optional*, defaults to `True`): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). Set to `False` during training, `True` during generation @@ -669,7 +741,7 @@ def serving(self, inputs): @keras_serializable -class TFBartEncoder(tf.keras.layers.Layer): +class TFBartEncoder(keras.layers.Layer): config_class = BartConfig """ Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a @@ -679,10 +751,10 @@ class TFBartEncoder(tf.keras.layers.Layer): config: BartConfig """ - def __init__(self, config: BartConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs): + def __init__(self, config: BartConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.layerdrop = config.encoder_layerdrop self.padding_idx = config.pad_token_id self.max_source_positions = config.max_position_embeddings @@ -695,15 +767,16 @@ def __init__(self, config: BartConfig, embed_tokens: Optional[tf.keras.layers.Em name="embed_positions", ) self.layers = [TFBartEncoderLayer(config, name=f"layers.{i}") for i in range(config.encoder_layers)] - self.layernorm_embedding = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.layernorm_embedding = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.embed_dim = config.d_model @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -755,25 +828,8 @@ def call( raise ValueError("You have to specify either input_ids or inputs_embeds") if inputs_embeds is None: - # if `self.embed_tokens.load_weight_prefix` is set, runs the embedding operation with the correct name - # scope, so that its weights are registered with the desired name for loading/storing. When `tf.name_scope` - # is used with a name ending in `/`, that name replaces the current name scope. - # (embeddings with tf.name_scope: self.embed_tokens.load_weight_prefix/self.embed_tokens.name/embeddings:0) - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale embed_pos = self.embed_positions(input_shape) hidden_states = inputs_embeds + embed_pos @@ -828,9 +884,24 @@ def call( last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layernorm_embedding", None) is not None: + with tf.name_scope(self.layernorm_embedding.name): + self.layernorm_embedding.build([None, None, self.embed_dim]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBartDecoder(tf.keras.layers.Layer): +class TFBartDecoder(keras.layers.Layer): config_class = BartConfig """ Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`TFBartDecoderLayer`] @@ -840,7 +911,7 @@ class TFBartDecoder(tf.keras.layers.Layer): embed_tokens: output embedding """ - def __init__(self, config: BartConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs): + def __init__(self, config: BartConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config self.padding_idx = config.pad_token_id @@ -853,21 +924,21 @@ def __init__(self, config: BartConfig, embed_tokens: Optional[tf.keras.layers.Em ) self.embed_scale = tf.math.sqrt(float(config.d_model)) if config.scale_embedding else 1.0 self.layers = [TFBartDecoderLayer(config, name=f"layers.{i}") for i in range(config.decoder_layers)] - self.layernorm_embedding = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.layernorm_embedding = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - cross_attn_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + cross_attn_head_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, @@ -924,11 +995,11 @@ def call( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`tf.Tensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` - you can choose to directly pass an embedded representation. This is useful if you want more control - over how to convert `input_ids` indices into associated vectors than the model's internal embedding - lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`tf.tTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. @@ -957,25 +1028,8 @@ def call( positions = self.embed_positions(input_shape, position_ids=position_ids) if inputs_embeds is None: - # if `self.embed_tokens.load_weight_prefix` is set, runs the embedding operation with the correct name - # scope, so that its weights are registered with the desired name for loading/storing. When `tf.name_scope` - # is used with a name ending in `/`, that name replaces the current name scope. - # (embeddings with tf.name_scope: self.embed_tokens.load_weight_prefix/self.embed_tokens.name/embeddings:0) - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale hidden_states = inputs_embeds @@ -1060,18 +1114,33 @@ def call( cross_attentions=all_cross_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layernorm_embedding", None) is not None: + with tf.name_scope(self.layernorm_embedding.name): + self.layernorm_embedding.build([None, None, self.config.d_model]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBartMainLayer(tf.keras.layers.Layer): +class TFBartMainLayer(keras.layers.Layer): config_class = BartConfig def __init__(self, config: BartConfig, load_weight_prefix=None, **kwargs): super().__init__(**kwargs) self.config = config - self.shared = tf.keras.layers.Embedding( + self.shared = keras.layers.Embedding( input_dim=config.vocab_size, output_dim=config.d_model, - embeddings_initializer=tf.keras.initializers.TruncatedNormal(stddev=self.config.init_std), + embeddings_initializer=keras.initializers.TruncatedNormal(stddev=self.config.init_std), name="model.shared", ) # Additional attribute to specify the expected name scope of the layer (for loading/storing weights) @@ -1091,18 +1160,18 @@ def set_input_embeddings(self, new_embeddings): @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_input_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - cross_attn_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_input_ids: np.ndarray | tf.Tensor | None = None, + decoder_attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + decoder_head_mask: np.ndarray | tf.Tensor | None = None, + cross_attn_head_mask: np.ndarray | tf.Tensor | None = None, encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1177,6 +1246,22 @@ def call( encoder_attentions=encoder_outputs.attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + # The shared/tied weights expect to be in the model base namespace + # Adding "/" to the end (not the start!) of a tf.name_scope puts it in the root namespace rather than + # the current one. + with tf.name_scope(self.shared.load_weight_prefix + "/" + self.shared.name + "/"): + self.shared.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "decoder", None) is not None: + with tf.name_scope(self.decoder.name): + self.decoder.build(None) + @add_start_docstrings( "The bare BART Model outputting raw hidden-states without any specific head on top.", @@ -1205,18 +1290,18 @@ def get_decoder(self): @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_input_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - cross_attn_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_input_ids: np.ndarray | tf.Tensor | None = None, + decoder_attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + decoder_head_mask: np.ndarray | tf.Tensor | None = None, + cross_attn_head_mask: np.ndarray | tf.Tensor | None = None, encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1265,10 +1350,18 @@ def serving_output(self, output): encoder_attentions=enc_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) -class BiasLayer(tf.keras.layers.Layer): + +class BiasLayer(keras.layers.Layer): """ - Bias as a layer. It is used for serialization purposes: `tf.keras.Model.save_weights` stores on a per-layer basis, + Bias as a layer. It is used for serialization purposes: `keras.Model.save_weights` stores on a per-layer basis, so all weights have to be registered in a layer. """ @@ -1329,23 +1422,23 @@ def set_bias(self, value): @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_input_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - cross_attn_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_input_ids: np.ndarray | tf.Tensor | None = None, + decoder_attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + decoder_head_mask: np.ndarray | tf.Tensor | None = None, + cross_attn_head_mask: np.ndarray | tf.Tensor | None = None, encoder_outputs: Optional[TFBaseModelOutput] = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[tf.Tensor] = None, + labels: tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFSeq2SeqLMOutput, Tuple[tf.Tensor]]: r""" @@ -1468,6 +1561,17 @@ def prepare_inputs_for_generation( def prepare_decoder_input_ids_from_labels(self, labels: tf.Tensor): return shift_tokens_right(labels, self.config.pad_token_id, self.config.decoder_start_token_id) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + if getattr(self, "bias_layer", None) is not None: + with tf.name_scope(self.bias_layer.name): + self.bias_layer.build(None) + @add_start_docstrings( """ @@ -1477,16 +1581,6 @@ def prepare_decoder_input_ids_from_labels(self, labels: tf.Tensor): BART_START_DOCSTRING, ) class TFBartForSequenceClassification(TFBartPretrainedModel, TFSequenceClassificationLoss): - @property - def dummy_inputs(self): - pad_token = self.config.pad_token_id - input_ids = tf.constant([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]]) - dummy_inputs = { - "attention_mask": tf.cast(tf.math.not_equal(input_ids, (pad_token)), dtype=tf.int32), - "input_ids": input_ids, - } - return dummy_inputs - def __init__(self, config: BartConfig, load_weight_prefix=None, *inputs, **kwargs): super().__init__(config, *inputs, **kwargs) self.model = TFBartMainLayer(config, load_weight_prefix=load_weight_prefix, name="model") @@ -1499,23 +1593,23 @@ def __init__(self, config: BartConfig, load_weight_prefix=None, *inputs, **kwarg @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_input_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - cross_attn_head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_input_ids: np.ndarray | tf.Tensor | None = None, + decoder_attention_mask: np.ndarray | tf.Tensor | None = None, + decoder_position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + decoder_head_mask: np.ndarray | tf.Tensor | None = None, + cross_attn_head_mask: np.ndarray | tf.Tensor | None = None, encoder_outputs: Optional[TFBaseModelOutput] = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - decoder_inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[tf.Tensor] = None, + labels: tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFSeq2SeqSequenceClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1605,3 +1699,14 @@ def serving_output(self, output): encoder_hidden_states=enc_hs, encoder_attentions=enc_attns, ) + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + if getattr(self, "classification_head", None) is not None: + with tf.name_scope(self.classification_head.name): + self.classification_head.build(None) diff --git a/src/transformers/models/bart/tokenization_bart.py b/src/transformers/models/bart/tokenization_bart.py index 5dc93578109f8f..b21e81000f2daf 100644 --- a/src/transformers/models/bart/tokenization_bart.py +++ b/src/transformers/models/bart/tokenization_bart.py @@ -105,12 +105,14 @@ class BartTokenizer(PreTrainedTokenizer): This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not: - ``` + ```python >>> from transformers import BartTokenizer + >>> tokenizer = BartTokenizer.from_pretrained("facebook/bart-base") - >>> tokenizer("Hello world")['input_ids'] + >>> tokenizer("Hello world")["input_ids"] [0, 31414, 232, 2] - >>> tokenizer(" Hello world")['input_ids'] + + >>> tokenizer(" Hello world")["input_ids"] [0, 20920, 232, 2] ``` @@ -204,19 +206,6 @@ def __init__( # Mask token behave like a normal word, i.e. include the space before it mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token - super().__init__( - errors=errors, - bos_token=bos_token, - eos_token=eos_token, - unk_token=unk_token, - sep_token=sep_token, - cls_token=cls_token, - pad_token=pad_token, - mask_token=mask_token, - add_prefix_space=add_prefix_space, - **kwargs, - ) - with open(vocab_file, encoding="utf-8") as vocab_handle: self.encoder = json.load(vocab_handle) self.decoder = {v: k for k, v in self.encoder.items()} @@ -233,6 +222,19 @@ def __init__( # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") + super().__init__( + errors=errors, + bos_token=bos_token, + eos_token=eos_token, + unk_token=unk_token, + sep_token=sep_token, + cls_token=cls_token, + pad_token=pad_token, + mask_token=mask_token, + add_prefix_space=add_prefix_space, + **kwargs, + ) + @property def vocab_size(self): return len(self.encoder) diff --git a/src/transformers/models/bart/tokenization_bart_fast.py b/src/transformers/models/bart/tokenization_bart_fast.py index 6d6e29986be4fa..850c9636833aa2 100644 --- a/src/transformers/models/bart/tokenization_bart_fast.py +++ b/src/transformers/models/bart/tokenization_bart_fast.py @@ -75,12 +75,14 @@ class BartTokenizerFast(PreTrainedTokenizerFast): This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not: - ``` + ```python >>> from transformers import BartTokenizerFast + >>> tokenizer = BartTokenizerFast.from_pretrained("facebook/bart-base") - >>> tokenizer("Hello world")['input_ids'] + >>> tokenizer("Hello world")["input_ids"] [0, 31414, 232, 2] - >>> tokenizer(" Hello world")['input_ids'] + + >>> tokenizer(" Hello world")["input_ids"] [0, 20920, 232, 2] ``` @@ -145,6 +147,7 @@ class BartTokenizerFast(PreTrainedTokenizerFast): trim_offsets (`bool`, *optional*, defaults to `True`): Whether the post processing step should trim offsets to avoid including whitespaces. """ + vocab_files_names = VOCAB_FILES_NAMES pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES @@ -168,6 +171,12 @@ def __init__( trim_offsets=True, **kwargs, ): + # we have to specify that this tokens is special otherwise adding it will reset the normalized flag to `False` in `add_special_tokens` + mask_token = ( + AddedToken(mask_token, lstrip=True, normalized=True, special=True) + if isinstance(mask_token, str) + else mask_token + ) super().__init__( vocab_file, merges_file, diff --git a/src/transformers/models/barthez/tokenization_barthez.py b/src/transformers/models/barthez/tokenization_barthez.py index 77ab8a9d64166b..f6ea253402f69a 100644 --- a/src/transformers/models/barthez/tokenization_barthez.py +++ b/src/transformers/models/barthez/tokenization_barthez.py @@ -47,6 +47,8 @@ SPIECE_UNDERLINE = "▁" +# TODO this class is useless. This is the most standard sentencpiece model. Let's find which one is closest and nuke this. + class BarthezTokenizer(PreTrainedTokenizer): """ @@ -95,8 +97,6 @@ class BarthezTokenizer(PreTrainedTokenizer): mask_token (`str`, *optional*, defaults to `""`): The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - additional_special_tokens (`List[str]`, *optional*, defaults to `["NOTUSED", "NOTUSED"]`): - Additional special tokens used by the tokenizer. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, @@ -136,11 +136,14 @@ def __init__( sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs, ) -> None: - # Mask token behave like a normal word, i.e. include the space before it - mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token + # Mask token behave like a normal word, i.e. include the space before it. Will have normalized=False by default this way + mask_token = AddedToken(mask_token, lstrip=True, special=True) if isinstance(mask_token, str) else mask_token self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + self.vocab_file = vocab_file + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(str(vocab_file)) super().__init__( bos_token=bos_token, eos_token=eos_token, @@ -153,15 +156,6 @@ def __init__( **kwargs, ) - self.vocab_file = vocab_file - self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(str(vocab_file)) - - self.fairseq_tokens_to_ids = {"": 0, "": 1, "": 2, "": 3} - - self.fairseq_tokens_to_ids[""] = len(self.sp_model) - 1 - self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} - def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: @@ -251,18 +245,13 @@ def _tokenize(self, text: str) -> List[str]: def _convert_token_to_id(self, token): """Converts a token (str) in an id using the vocab.""" - if token in self.fairseq_tokens_to_ids: - return self.fairseq_tokens_to_ids[token] - spm_id = self.sp_model.PieceToId(token) - - return spm_id if spm_id else self.unk_token_id + return self.sp_model.PieceToId(token) def _convert_id_to_token(self, index): """Converts an index (integer) in a token (str) using the vocab.""" - if index in self.fairseq_ids_to_tokens: - return self.fairseq_ids_to_tokens[index] return self.sp_model.IdToPiece(index) + # Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.convert_tokens_to_string def convert_tokens_to_string(self, tokens): """Converts a sequence of tokens (string) in a single string.""" current_sub_tokens = [] diff --git a/src/transformers/models/barthez/tokenization_barthez_fast.py b/src/transformers/models/barthez/tokenization_barthez_fast.py index f53a5acd712c71..fb4a114b43bf62 100644 --- a/src/transformers/models/barthez/tokenization_barthez_fast.py +++ b/src/transformers/models/barthez/tokenization_barthez_fast.py @@ -146,7 +146,10 @@ def __init__( ) self.vocab_file = vocab_file - self.can_save_slow_tokenizer = False if not self.vocab_file else True + + @property + def can_save_slow_tokenizer(self) -> bool: + return os.path.isfile(self.vocab_file) if self.vocab_file else False def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None diff --git a/src/transformers/models/bartpho/tokenization_bartpho.py b/src/transformers/models/bartpho/tokenization_bartpho.py index 1c1ef0b8675b8a..6b9dc266b29ff4 100644 --- a/src/transformers/models/bartpho/tokenization_bartpho.py +++ b/src/transformers/models/bartpho/tokenization_bartpho.py @@ -92,8 +92,6 @@ class BartphoTokenizer(PreTrainedTokenizer): mask_token (`str`, *optional*, defaults to `""`): The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - additional_special_tokens (`List[str]`, *optional*, defaults to `["NOTUSED", "NOTUSED"]`): - Additional special tokens used by the tokenizer. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, @@ -139,18 +137,6 @@ def __init__( self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs - super().__init__( - bos_token=bos_token, - eos_token=eos_token, - unk_token=unk_token, - sep_token=sep_token, - cls_token=cls_token, - pad_token=pad_token, - mask_token=mask_token, - sp_model_kwargs=self.sp_model_kwargs, - **kwargs, - ) - self.vocab_file = vocab_file self.monolingual_vocab_file = monolingual_vocab_file self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) @@ -174,6 +160,18 @@ def __init__( self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} + super().__init__( + bos_token=bos_token, + eos_token=eos_token, + unk_token=unk_token, + sep_token=sep_token, + cls_token=cls_token, + pad_token=pad_token, + mask_token=mask_token, + sp_model_kwargs=self.sp_model_kwargs, + **kwargs, + ) + def __getstate__(self): state = self.__dict__.copy() state["sp_model"] = None diff --git a/src/transformers/models/beit/__init__.py b/src/transformers/models/beit/__init__.py index 4b631ac1c36aae..ce399f92e0fa4d 100644 --- a/src/transformers/models/beit/__init__.py +++ b/src/transformers/models/beit/__init__.py @@ -47,6 +47,7 @@ "BeitForSemanticSegmentation", "BeitModel", "BeitPreTrainedModel", + "BeitBackbone", ] @@ -83,6 +84,7 @@ else: from .modeling_beit import ( BEIT_PRETRAINED_MODEL_ARCHIVE_LIST, + BeitBackbone, BeitForImageClassification, BeitForMaskedImageModeling, BeitForSemanticSegmentation, diff --git a/src/transformers/models/beit/configuration_beit.py b/src/transformers/models/beit/configuration_beit.py index ef7bf22b9189cf..b579eeea37c480 100644 --- a/src/transformers/models/beit/configuration_beit.py +++ b/src/transformers/models/beit/configuration_beit.py @@ -21,6 +21,7 @@ from ...configuration_utils import PretrainedConfig from ...onnx import OnnxConfig from ...utils import logging +from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices logger = logging.get_logger(__name__) @@ -33,7 +34,7 @@ } -class BeitConfig(PretrainedConfig): +class BeitConfig(BackboneConfigMixin, PretrainedConfig): r""" This is the configuration class to store the configuration of a [`BeitModel`]. It is used to instantiate an BEiT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the @@ -41,7 +42,7 @@ class BeitConfig(PretrainedConfig): [microsoft/beit-base-patch16-224-pt22k](https://huggingface.co/microsoft/beit-base-patch16-224-pt22k) architecture. Args: - vocab_size (`int`, *optional*, defaults to 8092): + vocab_size (`int`, *optional*, defaults to 8192): Vocabulary size of the BEiT model. Defines the number of different image tokens that can be used during pre-training. hidden_size (`int`, *optional*, defaults to 768): @@ -84,8 +85,6 @@ class BeitConfig(PretrainedConfig): use_mean_pooling (`bool`, *optional*, defaults to `True`): Whether to mean pool the final hidden states of the patches instead of using the final hidden state of the CLS token, before applying the classification head. - out_indices (`List[int]`, *optional*, defaults to `[3, 5, 7, 11]`): - Indices of the feature maps to use for semantic segmentation. pool_scales (`Tuple[int]`, *optional*, defaults to `[1, 2, 3, 6]`): Pooling scales used in Pooling Pyramid Module applied on the last feature map. use_auxiliary_head (`bool`, *optional*, defaults to `True`): @@ -100,6 +99,22 @@ class BeitConfig(PretrainedConfig): Whether to concatenate the output of the auxiliary head with the input before the classification layer. semantic_loss_ignore_index (`int`, *optional*, defaults to 255): The index that is ignored by the loss function of the semantic segmentation model. + out_features (`List[str]`, *optional*): + If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc. + (depending on how many stages the model has). If unset and `out_indices` is set, will default to the + corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the + same order as defined in the `stage_names` attribute. + out_indices (`List[int]`, *optional*): + If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how + many stages the model has). If unset and `out_features` is set, will default to the corresponding stages. + If unset and `out_features` is unset, will default to the last stage. Must be in the + same order as defined in the `stage_names` attribute. + add_fpn (`bool`, *optional*, defaults to `False`): + Whether to add a FPN as part of the backbone. Only relevant for [`BeitBackbone`]. + reshape_hidden_states (`bool`, *optional*, defaults to `True`): + Whether to reshape the feature maps to 4D tensors of shape `(batch_size, hidden_size, height, width)` in + case the model is used as backbone. If `False`, the feature maps will be 3D tensors of shape `(batch_size, + seq_len, hidden_size)`. Only relevant for [`BeitBackbone`]. Example: @@ -115,6 +130,7 @@ class BeitConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "beit" def __init__( @@ -139,7 +155,6 @@ def __init__( layer_scale_init_value=0.1, drop_path_rate=0.1, use_mean_pooling=True, - out_indices=[3, 5, 7, 11], pool_scales=[1, 2, 3, 6], use_auxiliary_head=True, auxiliary_loss_weight=0.4, @@ -147,6 +162,10 @@ def __init__( auxiliary_num_convs=1, auxiliary_concat_input=False, semantic_loss_ignore_index=255, + out_features=None, + out_indices=None, + add_fpn=False, + reshape_hidden_states=True, **kwargs, ): super().__init__(**kwargs) @@ -173,7 +192,6 @@ def __init__( self.drop_path_rate = drop_path_rate self.use_mean_pooling = use_mean_pooling # decode head attributes (semantic segmentation) - self.out_indices = out_indices self.pool_scales = pool_scales # auxiliary head attributes (semantic segmentation) self.use_auxiliary_head = use_auxiliary_head @@ -183,6 +201,22 @@ def __init__( self.auxiliary_concat_input = auxiliary_concat_input self.semantic_loss_ignore_index = semantic_loss_ignore_index + # handle backwards compatibility + if "segmentation_indices" in kwargs: + logger.warning( + "The `segmentation_indices` argument is deprecated and will be removed in a future version, use `out_indices` instead.", + FutureWarning, + ) + out_indices = kwargs.pop("segmentation_indices") + + # backbone attributes + self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, self.num_hidden_layers + 1)] + self._out_features, self._out_indices = get_aligned_output_features_output_indices( + out_features=out_features, out_indices=out_indices, stage_names=self.stage_names + ) + self.add_fpn = add_fpn + self.reshape_hidden_states = reshape_hidden_states + # Copied from transformers.models.vit.configuration_vit.ViTOnnxConfig class BeitOnnxConfig(OnnxConfig): diff --git a/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py b/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py index f80d52db7d8893..757113c8a60fcc 100644 --- a/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py +++ b/src/transformers/models/beit/convert_beit_unilm_to_pytorch.py @@ -27,10 +27,10 @@ from transformers import ( BeitConfig, - BeitFeatureExtractor, BeitForImageClassification, BeitForMaskedImageModeling, BeitForSemanticSegmentation, + BeitImageProcessor, ) from transformers.image_utils import PILImageResampling from transformers.utils import logging @@ -266,16 +266,16 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path): # Check outputs on an image if is_semantic: - feature_extractor = BeitFeatureExtractor(size=config.image_size, do_center_crop=False) + image_processor = BeitImageProcessor(size=config.image_size, do_center_crop=False) ds = load_dataset("hf-internal-testing/fixtures_ade20k", split="test") image = Image.open(ds[0]["file"]) else: - feature_extractor = BeitFeatureExtractor( + image_processor = BeitImageProcessor( size=config.image_size, resample=PILImageResampling.BILINEAR, do_center_crop=False ) image = prepare_img() - encoding = feature_extractor(images=image, return_tensors="pt") + encoding = image_processor(images=image, return_tensors="pt") pixel_values = encoding["pixel_values"] outputs = model(pixel_values) @@ -337,24 +337,25 @@ def convert_beit_checkpoint(checkpoint_url, pytorch_dump_folder_path): else: raise ValueError("Can't verify logits as model is not supported") - assert logits.shape == expected_shape, "Shape of logits not as expected" + if logits.shape != expected_shape: + raise ValueError(f"Shape of logits not as expected. {logits.shape=}, {expected_shape=}") if not has_lm_head: if is_semantic: - assert torch.allclose( - logits[0, :3, :3, :3], expected_logits, atol=1e-3 - ), "First elements of logits not as expected" + if not torch.allclose(logits[0, :3, :3, :3], expected_logits, atol=1e-3): + raise ValueError("First elements of logits not as expected") else: print("Predicted class idx:", logits.argmax(-1).item()) - assert torch.allclose( - logits[0, :3], expected_logits, atol=1e-3 - ), "First elements of logits not as expected" - assert logits.argmax(-1).item() == expected_class_idx, "Predicted class index not as expected" + + if not torch.allclose(logits[0, :3], expected_logits, atol=1e-3): + raise ValueError("First elements of logits not as expected") + if logits.argmax(-1).item() != expected_class_idx: + raise ValueError("Predicted class index not as expected") Path(pytorch_dump_folder_path).mkdir(exist_ok=True) print(f"Saving model to {pytorch_dump_folder_path}") model.save_pretrained(pytorch_dump_folder_path) - print(f"Saving feature extractor to {pytorch_dump_folder_path}") - feature_extractor.save_pretrained(pytorch_dump_folder_path) + print(f"Saving image processor to {pytorch_dump_folder_path}") + image_processor.save_pretrained(pytorch_dump_folder_path) if __name__ == "__main__": diff --git a/src/transformers/models/beit/image_processing_beit.py b/src/transformers/models/beit/image_processing_beit.py index 6f73679a71baf6..52c1a813f6091a 100644 --- a/src/transformers/models/beit/image_processing_beit.py +++ b/src/transformers/models/beit/image_processing_beit.py @@ -19,22 +19,22 @@ import numpy as np -from transformers.utils import is_torch_available, is_torch_tensor, is_vision_available -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict -from ...image_transforms import center_crop, normalize, rescale, resize, to_channel_dimension_format +from ...image_transforms import resize, to_channel_dimension_format from ...image_utils import ( IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD, ChannelDimension, ImageInput, PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, make_list_of_images, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging +from ...utils import TensorType, is_torch_available, is_torch_tensor, is_vision_available, logging if is_vision_available(): @@ -58,7 +58,7 @@ class BeitImageProcessor(BaseImageProcessor): size (`Dict[str, int]` *optional*, defaults to `{"height": 256, "width": 256}`): Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess` method. - resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`): Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the `preprocess` method. do_center_crop (`bool`, *optional*, defaults to `True`): @@ -68,12 +68,12 @@ class BeitImageProcessor(BaseImageProcessor): crop_size (`Dict[str, int]`, *optional*, defaults to `{"height": 224, "width": 224}`): Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`. Can be overridden by the `crop_size` parameter in the `preprocess` method. - do_rescale (`bool`, *optional*, defaults to `True`): - Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale` - parameter in the `preprocess` method. rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the `preprocess` method. + do_rescale (`bool`, *optional*, defaults to `True`): + Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale` + parameter in the `preprocess` method. do_normalize (`bool`, *optional*, defaults to `True`): Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess` method. @@ -131,15 +131,6 @@ def __init__( self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD self.do_reduce_labels = do_reduce_labels - @property - def reduce_labels(self) -> bool: - warnings.warn( - "The `reduce_labels` property is deprecated and will be removed in v4.27. Please use" - " `do_reduce_labels` instead.", - FutureWarning, - ) - return self.do_reduce_labels - @classmethod def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs): """ @@ -157,6 +148,7 @@ def resize( size: Dict[str, int], resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ @@ -171,79 +163,21 @@ def resize( Resampling filter to use when resiizing the image. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred. """ size = get_size_dict(size, default_to_square=True, param_name="size") if "height" not in size or "width" not in size: raise ValueError(f"The `size` argument must contain `height` and `width` keys. Got {size.keys()}") return resize( - image, size=(size["height"], size["width"]), resample=resample, data_format=data_format, **kwargs + image, + size=(size["height"], size["width"]), + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, ) - def center_crop( - self, - image: np.ndarray, - size: Dict[str, int], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Center crop an image to (size["height"], size["width"]). If the input size is smaller than `size` along any - edge, the image is padded with 0's and then center cropped. - - Args: - image (`np.ndarray`): - Image to center crop. - size (`Dict[str, int]`): - Size of the output image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - size = get_size_dict(size, default_to_square=True, param_name="size") - return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) - - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) - - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - image_mean (`float` or `List[float]`): - Image mean. - image_std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) - def reduce_label(self, label: ImageInput) -> np.ndarray: label = to_numpy_array(label) # Avoid using underflow conversion @@ -266,21 +200,22 @@ def _preprocess( do_normalize: bool = None, image_mean: Optional[Union[float, List[float]]] = None, image_std: Optional[Union[float, List[float]]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ): if do_reduce_labels: image = self.reduce_label(image) if do_resize: - image = self.resize(image=image, size=size, resample=resample) + image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) if do_center_crop: - image = self.center_crop(image=image, size=crop_size) + image = self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) if do_rescale: - image = self.rescale(image=image, scale=rescale_factor) + image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) if do_normalize: - image = self.normalize(image=image, mean=image_mean, std=image_std) + image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) return image @@ -298,10 +233,18 @@ def _preprocess_image( image_mean: Optional[Union[float, List[float]]] = None, image_std: Optional[Union[float, List[float]]] = None, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ) -> np.ndarray: """Preprocesses a single image.""" # All transformations expect numpy arrays. image = to_numpy_array(image) + if is_scaled_image(image) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + if input_data_format is None: + input_data_format = infer_channel_dimension_format(image) image = self._preprocess( image, do_reduce_labels=False, @@ -315,9 +258,10 @@ def _preprocess_image( do_normalize=do_normalize, image_mean=image_mean, image_std=image_std, + input_data_format=input_data_format, ) if data_format is not None: - image = to_channel_dimension_format(image, data_format) + image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) return image def _preprocess_segmentation_map( @@ -329,6 +273,7 @@ def _preprocess_segmentation_map( do_center_crop: bool = None, crop_size: Dict[str, int] = None, do_reduce_labels: bool = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ): """Preprocesses a single segmentation map.""" # All transformations expect numpy arrays. @@ -337,8 +282,11 @@ def _preprocess_segmentation_map( if segmentation_map.ndim == 2: segmentation_map = segmentation_map[None, ...] added_dimension = True + input_data_format = ChannelDimension.FIRST else: added_dimension = False + if input_data_format is None: + input_data_format = infer_channel_dimension_format(segmentation_map, num_channels=1) segmentation_map = self._preprocess( image=segmentation_map, do_reduce_labels=do_reduce_labels, @@ -349,6 +297,7 @@ def _preprocess_segmentation_map( crop_size=crop_size, do_normalize=False, do_rescale=False, + input_data_format=ChannelDimension.FIRST, ) # Remove extra axis if added if added_dimension: @@ -378,6 +327,7 @@ def preprocess( do_reduce_labels: Optional[bool] = None, return_tensors: Optional[Union[str, TensorType]] = None, data_format: ChannelDimension = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -385,7 +335,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -421,8 +372,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize size = size if size is not None else self.size @@ -439,32 +397,33 @@ def preprocess( do_reduce_labels = do_reduce_labels if do_reduce_labels is not None else self.do_reduce_labels images = make_list_of_images(images) + if segmentation_maps is not None: segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2) - if not valid_images(images): + if segmentation_maps is not None and not valid_images(segmentation_maps): raise ValueError( - "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " + "Invalid segmentation_maps type. Must be of type PIL.Image.Image, numpy.ndarray, " "torch.Tensor, tf.Tensor or jax.ndarray." ) - - if segmentation_maps is not None and not valid_images(segmentation_maps): + if not valid_images(images): raise ValueError( - "Invalid segmentation map type. Must be of type PIL.Image.Image, numpy.ndarray, " + "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "torch.Tensor, tf.Tensor or jax.ndarray." ) - if do_resize and size is None or resample is None: - raise ValueError("Size and resample must be specified if do_resize is True.") - - if do_center_crop and crop_size is None: - raise ValueError("Crop size must be specified if do_center_crop is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_resize=do_resize, + size=size, + resample=resample, + ) images = [ self._preprocess_image( @@ -480,6 +439,7 @@ def preprocess( image_mean=image_mean, image_std=image_std, data_format=data_format, + input_data_format=input_data_format, ) for img in images ] @@ -511,8 +471,9 @@ def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple] outputs ([`BeitForSemanticSegmentation`]): Raw outputs of the model. target_sizes (`List[Tuple]` of length `batch_size`, *optional*): - List of tuples corresponding to the requested final size (height, width) of each prediction. If left to - None, predictions will not be resized. + List of tuples corresponding to the requested final size (height, width) of each prediction. If unset, + predictions will not be resized. + Returns: semantic_segmentation: `List[torch.Tensor]` of length `batch_size`, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if `target_sizes` is diff --git a/src/transformers/models/beit/modeling_beit.py b/src/transformers/models/beit/modeling_beit.py index 4caf0f478fb6db..da4721656c0285 100755 --- a/src/transformers/models/beit/modeling_beit.py +++ b/src/transformers/models/beit/modeling_beit.py @@ -22,11 +22,12 @@ import torch import torch.utils.checkpoint -from torch import nn +from torch import Tensor, nn from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN from ...modeling_outputs import ( + BackboneOutput, BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput, @@ -42,6 +43,7 @@ logging, replace_return_docstrings, ) +from ...utils.backbone_utils import BackboneMixin from .configuration_beit import BeitConfig @@ -149,22 +151,26 @@ def __init__(self, config: BeitConfig) -> None: self.dropout = nn.Dropout(config.hidden_dropout_prob) def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.BoolTensor] = None) -> torch.Tensor: - embeddings = self.patch_embeddings(pixel_values) + embeddings, (patch_height, patch_width) = self.patch_embeddings( + pixel_values, self.position_embeddings[:, 1:, :] if self.position_embeddings is not None else None + ) batch_size, seq_len, _ = embeddings.size() - cls_tokens = self.cls_token.expand(batch_size, -1, -1) if bool_masked_pos is not None: mask_tokens = self.mask_token.expand(batch_size, seq_len, -1) # replace the masked visual tokens by mask_tokens w = bool_masked_pos.unsqueeze(-1).type_as(mask_tokens) embeddings = embeddings * (1 - w) + mask_tokens * w - embeddings = torch.cat((cls_tokens, embeddings), dim=1) + cls_tokens = self.cls_token.expand(batch_size, -1, -1) if self.position_embeddings is not None: - embeddings = embeddings + self.position_embeddings + cls_tokens = cls_tokens + self.position_embeddings[:, :1, :] + + embeddings = torch.cat((cls_tokens, embeddings), dim=1) + embeddings = self.dropout(embeddings) - return embeddings + return embeddings, (patch_height, patch_width) class BeitPatchEmbeddings(nn.Module): @@ -191,19 +197,29 @@ def __init__(self, config): self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size) - def forward(self, pixel_values: torch.Tensor) -> torch.Tensor: + def forward(self, pixel_values: torch.Tensor, position_embedding: Optional[torch.Tensor] = None) -> torch.Tensor: batch_size, num_channels, height, width = pixel_values.shape if num_channels != self.num_channels: raise ValueError( "Make sure that the channel dimension of the pixel values match with the one set in the configuration." ) - if height != self.image_size[0] or width != self.image_size[1]: - raise ValueError( - f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})." + + embeddings = self.projection(pixel_values) + patch_height, patch_width = embeddings.shape[2], embeddings.shape[3] + + if position_embedding is not None: + # interpolate the position embedding to the corresponding size + position_embedding = position_embedding.view(1, self.patch_shape[0], self.patch_shape[1], -1).permute( + 0, 3, 1, 2 + ) + position_embedding = nn.functional.interpolate( + position_embedding, size=(patch_height, patch_width), mode="bicubic" ) - embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2) + embeddings = embeddings + position_embedding + + embeddings = embeddings.flatten(2).transpose(1, 2) - return embeddings + return embeddings, (patch_height, patch_width) class BeitSelfAttention(nn.Module): @@ -459,7 +475,7 @@ def __init__(self, config: BeitConfig, window_size: tuple) -> None: relative_position_index[0:, 0] = self.num_relative_distance - 2 relative_position_index[0, 0] = self.num_relative_distance - 1 - self.register_buffer("relative_position_index", relative_position_index) + self.register_buffer("relative_position_index", relative_position_index, persistent=False) def forward(self) -> torch.Tensor: relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view( @@ -510,17 +526,11 @@ def forward( layer_head_mask = head_mask[i] if head_mask is not None else None if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, layer_head_mask, + output_attentions, ) else: relative_position_bias = ( @@ -572,10 +582,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BeitEncoder): - module.gradient_checkpointing = value - BEIT_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it @@ -659,6 +665,10 @@ def forward( output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[tuple, BeitModelOutputWithPooling]: + r""" + bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`, *optional*): + Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). + """ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_hidden_states = ( output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states @@ -675,7 +685,7 @@ def forward( # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) - embedding_output = self.embeddings(pixel_values, bool_masked_pos) + embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values, bool_masked_pos) encoder_outputs = self.encoder( embedding_output, @@ -1157,6 +1167,12 @@ def __init__(self, config: BeitConfig) -> None: self.beit = BeitModel(config, add_pooling_layer=False) # FPNs + if len(self.config.out_indices) != 4: + raise ValueError( + "BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers, " + "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of " + "a base-sized architecture." + ) self.fpn1 = nn.Sequential( nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2), nn.BatchNorm2d(config.hidden_size), @@ -1188,8 +1204,10 @@ def compute_loss(self, logits, auxiliary_logits, labels): # compute weighted loss loss_fct = CrossEntropyLoss(ignore_index=self.config.semantic_loss_ignore_index) main_loss = loss_fct(upsampled_logits, labels) - auxiliary_loss = loss_fct(upsampled_auxiliary_logits, labels) - loss = main_loss + self.config.auxiliary_loss_weight * auxiliary_loss + loss = main_loss + if auxiliary_logits is not None: + auxiliary_loss = loss_fct(upsampled_auxiliary_logits, labels) + loss += self.config.auxiliary_loss_weight * auxiliary_loss return loss @@ -1284,3 +1302,126 @@ def forward( hidden_states=outputs.hidden_states if output_hidden_states else None, attentions=outputs.attentions, ) + + +@add_start_docstrings( + """ + BEiT backbone, to be used with frameworks like DETR and MaskFormer. + """, + BEIT_START_DOCSTRING, +) +class BeitBackbone(BeitPreTrainedModel, BackboneMixin): + def __init__(self, config): + super().__init__(config) + super()._init_backbone(config) + + self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)] + self.embeddings = BeitEmbeddings(config) + self.encoder = BeitEncoder(config, window_size=self.embeddings.patch_embeddings.patch_shape) + + if config.add_fpn: + if len(self.config.out_indices) != 4: + raise ValueError( + "BeitBackbone requires config.out_indices to be a list of 4 integers, " + "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of " + "a base-sized architecture." + ) + hidden_size = config.hidden_size + self.fpn1 = nn.Sequential( + nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2), + nn.BatchNorm2d(hidden_size, eps=config.batch_norm_eps), + nn.GELU(), + nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2), + ) + + self.fpn2 = nn.Sequential(nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2)) + self.fpn3 = nn.Identity() + self.fpn4 = nn.MaxPool2d(kernel_size=2, stride=2) + + # initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.embeddings.patch_embeddings + + @add_start_docstrings_to_model_forward(BEIT_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + pixel_values: Tensor, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> BackboneOutput: + """ + Returns: + + Examples: + + ```python + >>> from transformers import AutoImageProcessor, AutoBackbone + >>> import torch + >>> from PIL import Image + >>> import requests + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224") + >>> model = AutoBackbone.from_pretrained( + ... "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"] + ... ) + + >>> inputs = processor(image, return_tensors="pt") + + >>> outputs = model(**inputs) + >>> feature_maps = outputs.feature_maps + >>> list(feature_maps[-1].shape) + [1, 768, 14, 14] + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + + batch_size = pixel_values.shape[0] + embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values) + + outputs = self.encoder( + embedding_output, output_hidden_states=True, output_attentions=output_attentions, return_dict=return_dict + ) + + hidden_states = outputs.hidden_states if return_dict else outputs[1] + + feature_maps = () + for stage, hidden_state in zip(self.stage_names, hidden_states): + if stage in self.out_features: + if self.config.reshape_hidden_states: + hidden_state = hidden_state[:, 1:, :] + hidden_state = hidden_state.permute(0, 2, 1) + hidden_state = hidden_state.reshape(batch_size, -1, patch_height, patch_width) + + feature_maps += (hidden_state,) + + if self.config.add_fpn: + feature_maps = [ + self.fpn1(feature_maps[0]), + self.fpn2(feature_maps[1]), + self.fpn3(feature_maps[2]), + self.fpn4(feature_maps[3]), + ] + feature_maps = tuple(feature_maps) + + if not return_dict: + if output_hidden_states: + output = (feature_maps,) + outputs[1:] + else: + output = (feature_maps,) + outputs[2:] + return output + + return BackboneOutput( + feature_maps=feature_maps, + hidden_states=outputs.hidden_states if output_hidden_states else None, + attentions=outputs.attentions, + ) diff --git a/src/transformers/models/beit/modeling_flax_beit.py b/src/transformers/models/beit/modeling_flax_beit.py index 02fb2e5e338dfa..c1da64d263a266 100644 --- a/src/transformers/models/beit/modeling_flax_beit.py +++ b/src/transformers/models/beit/modeling_flax_beit.py @@ -69,9 +69,10 @@ class FlaxBeitModelOutputWithPooling(FlaxBaseModelOutputWithPooling): This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models) - This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) - subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to - general usage and behavior. + This model is also a + [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as + a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and + behavior. Finally, this model supports inherent JAX features such as: @@ -555,7 +556,7 @@ def setup(self): ) # stochastic depth decay rule - drop_path_rates = [x for x in np.linspace(0, self.config.drop_path_rate, self.config.num_hidden_layers)] + drop_path_rates = list(np.linspace(0, self.config.drop_path_rate, self.config.num_hidden_layers)) self.layer = FlaxBeitLayerCollection( self.config, window_size=self.window_size, @@ -831,7 +832,7 @@ class FlaxBeitForMaskedImageModeling(FlaxBeitPreTrainedModel): FLAX_BEIT_MLM_DOCSTRING = """ bool_masked_pos (`numpy.ndarray` of shape `(batch_size, num_patches)`): - Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). + Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Returns: diff --git a/src/transformers/models/bert/configuration_bert.py b/src/transformers/models/bert/configuration_bert.py index 589c2b0261854b..1f79260f510ff2 100644 --- a/src/transformers/models/bert/configuration_bert.py +++ b/src/transformers/models/bert/configuration_bert.py @@ -25,29 +25,29 @@ logger = logging.get_logger(__name__) BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { - "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/config.json", - "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/config.json", - "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/config.json", - "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/config.json", - "bert-base-multilingual-uncased": "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/config.json", - "bert-base-multilingual-cased": "https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json", - "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/config.json", - "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/config.json", - "bert-large-uncased-whole-word-masking": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/config.json" + "google-bert/bert-base-uncased": "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/config.json", + "google-bert/bert-large-uncased": "https://huggingface.co/google-bert/bert-large-uncased/resolve/main/config.json", + "google-bert/bert-base-cased": "https://huggingface.co/google-bert/bert-base-cased/resolve/main/config.json", + "google-bert/bert-large-cased": "https://huggingface.co/google-bert/bert-large-cased/resolve/main/config.json", + "google-bert/bert-base-multilingual-uncased": "https://huggingface.co/google-bert/bert-base-multilingual-uncased/resolve/main/config.json", + "google-bert/bert-base-multilingual-cased": "https://huggingface.co/google-bert/bert-base-multilingual-cased/resolve/main/config.json", + "google-bert/bert-base-chinese": "https://huggingface.co/google-bert/bert-base-chinese/resolve/main/config.json", + "google-bert/bert-base-german-cased": "https://huggingface.co/google-bert/bert-base-german-cased/resolve/main/config.json", + "google-bert/bert-large-uncased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking/resolve/main/config.json" ), - "bert-large-cased-whole-word-masking": ( - "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/config.json" + "google-bert/bert-large-cased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking/resolve/main/config.json" ), - "bert-large-uncased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/config.json" + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/config.json" ), - "bert-large-cased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json" + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json" ), - "bert-base-cased-finetuned-mrpc": "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/config.json", - "bert-base-german-dbmdz-cased": "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/config.json", - "bert-base-german-dbmdz-uncased": "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/config.json", + "google-bert/bert-base-cased-finetuned-mrpc": "https://huggingface.co/google-bert/bert-base-cased-finetuned-mrpc/resolve/main/config.json", + "google-bert/bert-base-german-dbmdz-cased": "https://huggingface.co/google-bert/bert-base-german-dbmdz-cased/resolve/main/config.json", + "google-bert/bert-base-german-dbmdz-uncased": "https://huggingface.co/google-bert/bert-base-german-dbmdz-uncased/resolve/main/config.json", "cl-tohoku/bert-base-japanese": "https://huggingface.co/cl-tohoku/bert-base-japanese/resolve/main/config.json", "cl-tohoku/bert-base-japanese-whole-word-masking": ( "https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking/resolve/main/config.json" @@ -74,7 +74,7 @@ class BertConfig(PretrainedConfig): This is the configuration class to store the configuration of a [`BertModel`] or a [`TFBertModel`]. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT - [bert-base-uncased](https://huggingface.co/bert-base-uncased) architecture. + [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. @@ -127,15 +127,16 @@ class BertConfig(PretrainedConfig): ```python >>> from transformers import BertConfig, BertModel - >>> # Initializing a BERT bert-base-uncased style configuration + >>> # Initializing a BERT google-bert/bert-base-uncased style configuration >>> configuration = BertConfig() - >>> # Initializing a model (with random weights) from the bert-base-uncased style configuration + >>> # Initializing a model (with random weights) from the google-bert/bert-base-uncased style configuration >>> model = BertModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "bert" def __init__( diff --git a/src/transformers/models/bert/convert_bert_pytorch_checkpoint_to_original_tf.py b/src/transformers/models/bert/convert_bert_pytorch_checkpoint_to_original_tf.py index 68ed9bafc873ac..f7cb149053a3d0 100644 --- a/src/transformers/models/bert/convert_bert_pytorch_checkpoint_to_original_tf.py +++ b/src/transformers/models/bert/convert_bert_pytorch_checkpoint_to_original_tf.py @@ -78,10 +78,10 @@ def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session): for var_name in state_dict: tf_name = to_tf_var_name(var_name) torch_tensor = state_dict[var_name].numpy() - if any([x in var_name for x in tensors_to_transpose]): + if any(x in var_name for x in tensors_to_transpose): torch_tensor = torch_tensor.T tf_var = create_tf_var(tensor=torch_tensor, name=tf_name, session=session) - tf.keras.backend.set_value(tf_var, torch_tensor) + tf_var.assign(tf.cast(torch_tensor, tf_var.dtype)) tf_weight = session.run(tf_var) print(f"Successfully created {tf_name}: {np.allclose(tf_weight, torch_tensor)}") @@ -91,7 +91,7 @@ def create_tf_var(tensor: np.ndarray, name: str, session: tf.Session): def main(raw_args=None): parser = argparse.ArgumentParser() - parser.add_argument("--model_name", type=str, required=True, help="model name e.g. bert-base-uncased") + parser.add_argument("--model_name", type=str, required=True, help="model name e.g. google-bert/bert-base-uncased") parser.add_argument( "--cache_dir", type=str, default=None, required=False, help="Directory containing pytorch model" ) diff --git a/src/transformers/models/bert/modeling_bert.py b/src/transformers/models/bert/modeling_bert.py index 9c8cf804400de4..4c068c4d4f1d76 100755 --- a/src/transformers/models/bert/modeling_bert.py +++ b/src/transformers/models/bert/modeling_bert.py @@ -54,7 +54,7 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "bert-base-uncased" +_CHECKPOINT_FOR_DOC = "google-bert/bert-base-uncased" _CONFIG_FOR_DOC = "BertConfig" # TokenClassification docstring @@ -78,21 +78,21 @@ BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "bert-base-uncased", - "bert-large-uncased", - "bert-base-cased", - "bert-large-cased", - "bert-base-multilingual-uncased", - "bert-base-multilingual-cased", - "bert-base-chinese", - "bert-base-german-cased", - "bert-large-uncased-whole-word-masking", - "bert-large-cased-whole-word-masking", - "bert-large-uncased-whole-word-masking-finetuned-squad", - "bert-large-cased-whole-word-masking-finetuned-squad", - "bert-base-cased-finetuned-mrpc", - "bert-base-german-dbmdz-cased", - "bert-base-german-dbmdz-uncased", + "google-bert/bert-base-uncased", + "google-bert/bert-large-uncased", + "google-bert/bert-base-cased", + "google-bert/bert-large-cased", + "google-bert/bert-base-multilingual-uncased", + "google-bert/bert-base-multilingual-cased", + "google-bert/bert-base-chinese", + "google-bert/bert-base-german-cased", + "google-bert/bert-large-uncased-whole-word-masking", + "google-bert/bert-large-cased-whole-word-masking", + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad", + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad", + "google-bert/bert-base-cased-finetuned-mrpc", + "google-bert/bert-base-german-dbmdz-cased", + "google-bert/bert-base-german-dbmdz-uncased", "cl-tohoku/bert-base-japanese", "cl-tohoku/bert-base-japanese-whole-word-masking", "cl-tohoku/bert-base-japanese-char", @@ -169,7 +169,7 @@ def load_tf_weights_in_bert(model, config, tf_checkpoint_path): try: if pointer.shape != array.shape: raise ValueError(f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched") - except AssertionError as e: + except ValueError as e: e.args += (pointer.shape, array.shape) raise logger.info(f"Initialize PyTorch weight {name}") @@ -192,7 +192,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -575,6 +577,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -584,25 +593,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -741,7 +740,6 @@ class BertPreTrainedModel(PreTrainedModel): load_tf_weights = load_tf_weights_in_bert base_model_prefix = "bert" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -759,10 +757,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BertEncoder): - module.gradient_checkpointing = value - @dataclass class BertForPreTrainingOutput(ModelOutput): @@ -963,6 +957,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1051,7 +1046,7 @@ def forward( BERT_START_DOCSTRING, ) class BertForPreTraining(BertPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", r"cls.predictions.decoder.weight"] + _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"] def __init__(self, config): super().__init__(config) @@ -1106,8 +1101,8 @@ def forward( >>> from transformers import AutoTokenizer, BertForPreTraining >>> import torch - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = BertForPreTraining.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = BertForPreTraining.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) @@ -1157,8 +1152,7 @@ def forward( """Bert Model with a `language modeling` head on top for CLM fine-tuning.""", BERT_START_DOCSTRING ) class BertLMHeadModel(BertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", r"cls.predictions.decoder.weight"] + _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"] def __init__(self, config): super().__init__(config) @@ -1279,7 +1273,16 @@ def prepare_inputs_for_generation( # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] return { "input_ids": input_ids, @@ -1291,14 +1294,15 @@ def prepare_inputs_for_generation( def _reorder_cache(self, past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past @add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING) class BertForMaskedLM(BertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", r"cls.predictions.decoder.weight"] + _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"] def __init__(self, config): super().__init__(config) @@ -1449,8 +1453,8 @@ def forward( >>> from transformers import AutoTokenizer, BertForNextSentencePrediction >>> import torch - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = BertForNextSentencePrediction.from_pretrained("google-bert/bert-base-uncased") >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." >>> next_sentence = "The sky is blue due to the shorter wavelength of blue light." @@ -1710,8 +1714,6 @@ def forward( BERT_START_DOCSTRING, ) class BertForTokenClassification(BertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels @@ -1795,8 +1797,6 @@ def forward( BERT_START_DOCSTRING, ) class BertForQuestionAnswering(BertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels diff --git a/src/transformers/models/bert/modeling_flax_bert.py b/src/transformers/models/bert/modeling_flax_bert.py index 6e8eb829b90903..772ea2bf12b2ee 100644 --- a/src/transformers/models/bert/modeling_flax_bert.py +++ b/src/transformers/models/bert/modeling_flax_bert.py @@ -52,7 +52,7 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "bert-base-uncased" +_CHECKPOINT_FOR_DOC = "google-bert/bert-base-uncased" _CONFIG_FOR_DOC = "BertConfig" remat = nn_partitioning.remat @@ -93,9 +93,10 @@ class FlaxBertForPreTrainingOutput(ModelOutput): This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models) - This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) - subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to - general usage and behavior. + This model is also a + [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as + a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and + behavior. Finally, this model supports inherent JAX features such as: @@ -295,7 +296,7 @@ def __call__( hidden_states, attention_mask, layer_head_mask, - key_value_states: Optional[jnp.array] = None, + key_value_states: Optional[jnp.ndarray] = None, init_cache: bool = False, deterministic=True, output_attentions: bool = False, @@ -1113,8 +1114,8 @@ class FlaxBertForPreTraining(FlaxBertPreTrainedModel): ```python >>> from transformers import AutoTokenizer, FlaxBertForPreTraining - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = FlaxBertForPreTraining.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = FlaxBertForPreTraining.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="np") >>> outputs = model(**inputs) @@ -1268,8 +1269,8 @@ class FlaxBertForNextSentencePrediction(FlaxBertPreTrainedModel): ```python >>> from transformers import AutoTokenizer, FlaxBertForNextSentencePrediction - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = FlaxBertForNextSentencePrediction.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = FlaxBertForNextSentencePrediction.from_pretrained("google-bert/bert-base-uncased") >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." >>> next_sentence = "The sky is blue due to the shorter wavelength of blue light." @@ -1568,7 +1569,7 @@ def __call__( hidden_states = outputs[0] logits = self.qa_outputs(hidden_states) - start_logits, end_logits = logits.split(self.config.num_labels, axis=-1) + start_logits, end_logits = jnp.split(logits, self.config.num_labels, axis=-1) start_logits = start_logits.squeeze(-1) end_logits = end_logits.squeeze(-1) @@ -1677,7 +1678,7 @@ def __call__( class FlaxBertForCausalLM(FlaxBertPreTrainedModel): module_class = FlaxBertForCausalLMModule - def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jnp.DeviceArray] = None): + def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jax.Array] = None): # initializing the cache batch_size, seq_length = input_ids.shape diff --git a/src/transformers/models/bert/modeling_tf_bert.py b/src/transformers/models/bert/modeling_tf_bert.py index 5391d71a916c3b..7fe89e43e86335 100644 --- a/src/transformers/models/bert/modeling_tf_bert.py +++ b/src/transformers/models/bert/modeling_tf_bert.py @@ -15,6 +15,9 @@ # limitations under the License. """ TF 2.0 BERT model.""" + +from __future__ import annotations + import math import warnings from dataclasses import dataclass @@ -46,13 +49,12 @@ TFSequenceClassificationLoss, TFTokenClassificationLoss, get_initializer, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - DUMMY_INPUTS, - MULTIPLE_CHOICE_DUMMY_INPUTS, ModelOutput, add_code_sample_docstrings, add_start_docstrings, @@ -65,7 +67,7 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "bert-base-uncased" +_CHECKPOINT_FOR_DOC = "google-bert/bert-base-uncased" _CONFIG_FOR_DOC = "BertConfig" # TokenClassification docstring @@ -88,19 +90,19 @@ _SEQ_CLASS_EXPECTED_LOSS = 0.01 TF_BERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "bert-base-uncased", - "bert-large-uncased", - "bert-base-cased", - "bert-large-cased", - "bert-base-multilingual-uncased", - "bert-base-multilingual-cased", - "bert-base-chinese", - "bert-base-german-cased", - "bert-large-uncased-whole-word-masking", - "bert-large-cased-whole-word-masking", - "bert-large-uncased-whole-word-masking-finetuned-squad", - "bert-large-cased-whole-word-masking-finetuned-squad", - "bert-base-cased-finetuned-mrpc", + "google-bert/bert-base-uncased", + "google-bert/bert-large-uncased", + "google-bert/bert-base-cased", + "google-bert/bert-large-cased", + "google-bert/bert-base-multilingual-uncased", + "google-bert/bert-base-multilingual-cased", + "google-bert/bert-base-chinese", + "google-bert/bert-base-german-cased", + "google-bert/bert-large-uncased-whole-word-masking", + "google-bert/bert-large-cased-whole-word-masking", + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad", + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad", + "google-bert/bert-base-cased-finetuned-mrpc", "cl-tohoku/bert-base-japanese", "cl-tohoku/bert-base-japanese-whole-word-masking", "cl-tohoku/bert-base-japanese-char", @@ -120,9 +122,7 @@ class TFBertPreTrainingLoss: """ def hf_compute_loss(self, labels: tf.Tensor, logits: tf.Tensor) -> tf.Tensor: - loss_fn = tf.keras.losses.SparseCategoricalCrossentropy( - from_logits=True, reduction=tf.keras.losses.Reduction.NONE - ) + loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction=keras.losses.Reduction.NONE) # Clip negative labels to zero here to avoid NaNs and errors - those positions will get masked later anyway unmasked_lm_losses = loss_fn(y_true=tf.nn.relu(labels["labels"]), y_pred=logits[0]) @@ -142,7 +142,7 @@ def hf_compute_loss(self, labels: tf.Tensor, logits: tf.Tensor) -> tf.Tensor: return tf.reshape(reduced_masked_lm_loss + reduced_masked_ns_loss, (1,)) -class TFBertEmbeddings(tf.keras.layers.Layer): +class TFBertEmbeddings(keras.layers.Layer): """Construct the embeddings from word, position and token_type embeddings.""" def __init__(self, config: BertConfig, **kwargs): @@ -152,10 +152,10 @@ def __init__(self, config: BertConfig, **kwargs): self.hidden_size = config.hidden_size self.max_position_embeddings = config.max_position_embeddings self.initializer_range = config.initializer_range - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape=None): with tf.name_scope("word_embeddings"): self.weight = self.add_weight( name="weight", @@ -177,7 +177,12 @@ def build(self, input_shape: tf.TensorShape): initializer=get_initializer(self.initializer_range), ) - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) def call( self, @@ -198,16 +203,7 @@ def call( raise ValueError("Need to provide either `input_ids` or `input_embeds`.") if input_ids is not None: - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.config.vocab_size, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})" - ), - ) + check_embeddings_within_bounds(input_ids, self.config.vocab_size) inputs_embeds = tf.gather(params=self.weight, indices=input_ids) input_shape = shape_list(inputs_embeds)[:-1] @@ -229,7 +225,7 @@ def call( return final_embeddings -class TFBertSelfAttention(tf.keras.layers.Layer): +class TFBertSelfAttention(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) @@ -244,18 +240,19 @@ def __init__(self, config: BertConfig, **kwargs): self.all_head_size = self.num_attention_heads * self.attention_head_size self.sqrt_att_head_size = math.sqrt(self.attention_head_size) - self.query = tf.keras.layers.Dense( + self.query = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query" ) - self.key = tf.keras.layers.Dense( + self.key = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key" ) - self.value = tf.keras.layers.Dense( + self.value = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value" ) - self.dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob) + self.dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob) self.is_decoder = config.is_decoder + self.config = config def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor: # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size] @@ -345,16 +342,31 @@ def call( outputs = outputs + (past_key_value,) return outputs - -class TFBertSelfOutput(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "query", None) is not None: + with tf.name_scope(self.query.name): + self.query.build([None, None, self.config.hidden_size]) + if getattr(self, "key", None) is not None: + with tf.name_scope(self.key.name): + self.key.build([None, None, self.config.hidden_size]) + if getattr(self, "value", None) is not None: + with tf.name_scope(self.value.name): + self.value.build([None, None, self.config.hidden_size]) + + +class TFBertSelfOutput(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -363,8 +375,19 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + -class TFBertAttention(tf.keras.layers.Layer): +class TFBertAttention(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) @@ -403,12 +426,23 @@ def call( return outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attention", None) is not None: + with tf.name_scope(self.self_attention.name): + self.self_attention.build(None) + if getattr(self, "dense_output", None) is not None: + with tf.name_scope(self.dense_output.name): + self.dense_output.build(None) -class TFBertIntermediate(tf.keras.layers.Layer): + +class TFBertIntermediate(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) @@ -416,6 +450,7 @@ def __init__(self, config: BertConfig, **kwargs): self.intermediate_act_fn = get_tf_activation(config.hidden_act) else: self.intermediate_act_fn = config.hidden_act + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -423,16 +458,25 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) -class TFBertOutput(tf.keras.layers.Layer): + +class TFBertOutput(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -441,8 +485,19 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.intermediate_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) -class TFBertLayer(tf.keras.layers.Layer): + +class TFBertLayer(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) @@ -461,9 +516,9 @@ def call( hidden_states: tf.Tensor, attention_mask: tf.Tensor, head_mask: tf.Tensor, - encoder_hidden_states: Optional[tf.Tensor], - encoder_attention_mask: Optional[tf.Tensor], - past_key_value: Optional[Tuple[tf.Tensor]], + encoder_hidden_states: tf.Tensor | None, + encoder_attention_mask: tf.Tensor | None, + past_key_value: Tuple[tf.Tensor] | None, output_attentions: bool, training: bool = False, ) -> Tuple[tf.Tensor]: @@ -527,8 +582,25 @@ def call( return outputs - -class TFBertEncoder(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "attention", None) is not None: + with tf.name_scope(self.attention.name): + self.attention.build(None) + if getattr(self, "intermediate", None) is not None: + with tf.name_scope(self.intermediate.name): + self.intermediate.build(None) + if getattr(self, "bert_output", None) is not None: + with tf.name_scope(self.bert_output.name): + self.bert_output.build(None) + if getattr(self, "crossattention", None) is not None: + with tf.name_scope(self.crossattention.name): + self.crossattention.build(None) + + +class TFBertEncoder(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) self.config = config @@ -539,9 +611,9 @@ def call( hidden_states: tf.Tensor, attention_mask: tf.Tensor, head_mask: tf.Tensor, - encoder_hidden_states: Optional[tf.Tensor], - encoder_attention_mask: Optional[tf.Tensor], - past_key_values: Optional[Tuple[Tuple[tf.Tensor]]], + encoder_hidden_states: tf.Tensor | None, + encoder_attention_mask: tf.Tensor | None, + past_key_values: Tuple[Tuple[tf.Tensor]] | None, use_cache: Optional[bool], output_attentions: bool, output_hidden_states: bool, @@ -596,17 +668,27 @@ def call( cross_attentions=all_cross_attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "layer", None) is not None: + for layer in self.layer: + with tf.name_scope(layer.name): + layer.build(None) + -class TFBertPooler(tf.keras.layers.Layer): +class TFBertPooler(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), activation="tanh", name="dense", ) + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: # We "pool" the model by simply taking the hidden state corresponding @@ -616,12 +698,20 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return pooled_output + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + -class TFBertPredictionHeadTransform(tf.keras.layers.Layer): +class TFBertPredictionHeadTransform(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense", @@ -632,7 +722,8 @@ def __init__(self, config: BertConfig, **kwargs): else: self.transform_act_fn = config.hidden_act - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -641,9 +732,20 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + -class TFBertLMPredictionHead(tf.keras.layers.Layer): - def __init__(self, config: BertConfig, input_embeddings: tf.keras.layers.Layer, **kwargs): +class TFBertLMPredictionHead(keras.layers.Layer): + def __init__(self, config: BertConfig, input_embeddings: keras.layers.Layer, **kwargs): super().__init__(**kwargs) self.config = config @@ -655,12 +757,17 @@ def __init__(self, config: BertConfig, input_embeddings: tf.keras.layers.Layer, # an output-only bias for each token. self.input_embeddings = input_embeddings - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape=None): self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias") - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "transform", None) is not None: + with tf.name_scope(self.transform.name): + self.transform.build(None) - def get_output_embeddings(self) -> tf.keras.layers.Layer: + def get_output_embeddings(self) -> keras.layers.Layer: return self.input_embeddings def set_output_embeddings(self, value: tf.Variable): @@ -685,8 +792,8 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return hidden_states -class TFBertMLMHead(tf.keras.layers.Layer): - def __init__(self, config: BertConfig, input_embeddings: tf.keras.layers.Layer, **kwargs): +class TFBertMLMHead(keras.layers.Layer): + def __init__(self, config: BertConfig, input_embeddings: keras.layers.Layer, **kwargs): super().__init__(**kwargs) self.predictions = TFBertLMPredictionHead(config, input_embeddings, name="predictions") @@ -696,25 +803,42 @@ def call(self, sequence_output: tf.Tensor) -> tf.Tensor: return prediction_scores + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "predictions", None) is not None: + with tf.name_scope(self.predictions.name): + self.predictions.build(None) -class TFBertNSPHead(tf.keras.layers.Layer): + +class TFBertNSPHead(keras.layers.Layer): def __init__(self, config: BertConfig, **kwargs): super().__init__(**kwargs) - self.seq_relationship = tf.keras.layers.Dense( + self.seq_relationship = keras.layers.Dense( units=2, kernel_initializer=get_initializer(config.initializer_range), name="seq_relationship", ) + self.config = config def call(self, pooled_output: tf.Tensor) -> tf.Tensor: seq_relationship_score = self.seq_relationship(inputs=pooled_output) return seq_relationship_score + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "seq_relationship", None) is not None: + with tf.name_scope(self.seq_relationship.name): + self.seq_relationship.build([None, None, self.config.hidden_size]) + @keras_serializable -class TFBertMainLayer(tf.keras.layers.Layer): +class TFBertMainLayer(keras.layers.Layer): config_class = BertConfig def __init__(self, config: BertConfig, add_pooling_layer: bool = True, **kwargs): @@ -727,7 +851,7 @@ def __init__(self, config: BertConfig, add_pooling_layer: bool = True, **kwargs) self.encoder = TFBertEncoder(config, name="encoder") self.pooler = TFBertPooler(config, name="pooler") if add_pooling_layer else None - def get_input_embeddings(self) -> tf.keras.layers.Layer: + def get_input_embeddings(self) -> keras.layers.Layer: return self.embeddings def set_input_embeddings(self, value: tf.Variable): @@ -744,14 +868,14 @@ class PreTrainedModel @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, @@ -899,6 +1023,20 @@ def call( cross_attentions=encoder_outputs.cross_attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "pooler", None) is not None: + with tf.name_scope(self.pooler.name): + self.pooler.build(None) + class TFBertPreTrainedModel(TFPreTrainedModel): """ @@ -909,24 +1047,6 @@ class TFBertPreTrainedModel(TFPreTrainedModel): config_class = BertConfig base_model_prefix = "bert" - @property - def dummy_inputs(self): - """ - Dummy inputs to build the network. - - Returns: - `Dict[str, tf.Tensor]`: The dummy inputs. - """ - dummy = {"input_ids": tf.constant(DUMMY_INPUTS, dtype=tf.int32)} - # Add `encoder_hidden_states` to make the cross-attention layers' weights initialized - if self.config.add_cross_attention: - batch_size, seq_len = tf.constant(DUMMY_INPUTS).shape - shape = (batch_size, seq_len) + (self.config.hidden_size,) - h = tf.random.uniform(shape=shape) - dummy["encoder_hidden_states"] = h - - return dummy - @dataclass class TFBertForPreTrainingOutput(ModelOutput): @@ -952,7 +1072,7 @@ class TFBertForPreTrainingOutput(ModelOutput): heads. """ - loss: Optional[tf.Tensor] = None + loss: tf.Tensor | None = None prediction_logits: tf.Tensor = None seq_relationship_logits: tf.Tensor = None hidden_states: Optional[Union[Tuple[tf.Tensor], tf.Tensor]] = None @@ -965,7 +1085,7 @@ class TFBertForPreTrainingOutput(ModelOutput): library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -1076,14 +1196,14 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, @@ -1129,25 +1249,13 @@ def call( ) return outputs - def serving_output( - self, output: TFBaseModelOutputWithPoolingAndCrossAttentions - ) -> TFBaseModelOutputWithPoolingAndCrossAttentions: - output_cache = self.config.use_cache and self.config.is_decoder - pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None - if not (self.config.output_attentions and self.config.add_cross_attention): - cross_attns = None - - return TFBaseModelOutputWithPoolingAndCrossAttentions( - last_hidden_state=output.last_hidden_state, - pooler_output=output.pooler_output, - past_key_values=pkv, - hidden_states=hs, - attentions=attns, - cross_attentions=cross_attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) @add_start_docstrings( @@ -1172,7 +1280,7 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): self.nsp = TFBertNSPHead(config, name="nsp___cls") self.mlm = TFBertMLMHead(config, input_embeddings=self.bert.embeddings, name="mlm___cls") - def get_lm_head(self) -> tf.keras.layers.Layer: + def get_lm_head(self) -> keras.layers.Layer: return self.mlm.predictions def get_prefix_bias_name(self) -> str: @@ -1184,17 +1292,17 @@ def get_prefix_bias_name(self) -> str: @replace_return_docstrings(output_type=TFBertForPreTrainingOutput, config_class=_CONFIG_FOR_DOC) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, - next_sentence_label: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, + next_sentence_label: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFBertForPreTrainingOutput, Tuple[tf.Tensor]]: r""" @@ -1219,8 +1327,8 @@ def call( >>> import tensorflow as tf >>> from transformers import AutoTokenizer, TFBertForPreTraining - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = TFBertForPreTraining.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = TFBertForPreTraining.from_pretrained("google-bert/bert-base-uncased") >>> input_ids = tokenizer("Hello, my dog is cute", add_special_tokens=True, return_tensors="tf") >>> # Batch size 1 @@ -1261,16 +1369,19 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFBertForPreTrainingOutput) -> TFBertForPreTrainingOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFBertForPreTrainingOutput( - prediction_logits=output.prediction_logits, - seq_relationship_logits=output.seq_relationship_logits, - hidden_states=hs, - attentions=attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "nsp", None) is not None: + with tf.name_scope(self.nsp.name): + self.nsp.build(None) + if getattr(self, "mlm", None) is not None: + with tf.name_scope(self.mlm.name): + self.mlm.build(None) @add_start_docstrings("""Bert Model with a `language modeling` head on top.""", BERT_START_DOCSTRING) @@ -1295,7 +1406,7 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): self.bert = TFBertMainLayer(config, add_pooling_layer=False, name="bert") self.mlm = TFBertMLMHead(config, input_embeddings=self.bert.embeddings, name="mlm___cls") - def get_lm_head(self) -> tf.keras.layers.Layer: + def get_lm_head(self) -> keras.layers.Layer: return self.mlm.predictions def get_prefix_bias_name(self) -> str: @@ -1313,16 +1424,16 @@ def get_prefix_bias_name(self) -> str: ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]: r""" @@ -1358,11 +1469,16 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "mlm", None) is not None: + with tf.name_scope(self.mlm.name): + self.mlm.build(None) class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss): @@ -1383,7 +1499,7 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): self.bert = TFBertMainLayer(config, add_pooling_layer=False, name="bert") self.mlm = TFBertMLMHead(config, input_embeddings=self.bert.embeddings, name="mlm___cls") - def get_lm_head(self) -> tf.keras.layers.Layer: + def get_lm_head(self) -> keras.layers.Layer: return self.mlm.predictions def get_prefix_bias_name(self) -> str: @@ -1410,20 +1526,20 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, **kwargs, ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]: @@ -1489,18 +1605,16 @@ def call( cross_attentions=outputs.cross_attentions, ) - def serving_output(self, output: TFCausalLMOutputWithCrossAttentions) -> TFCausalLMOutputWithCrossAttentions: - output_cache = self.config.use_cache and self.config.is_decoder - pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None - if not (self.config.output_attentions and self.config.add_cross_attention): - cross_attns = None - - return TFCausalLMOutputWithCrossAttentions( - logits=output.logits, past_key_values=pkv, hidden_states=hs, attentions=attns, cross_attentions=cross_attns - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "mlm", None) is not None: + with tf.name_scope(self.mlm.name): + self.mlm.build(None) @add_start_docstrings( @@ -1522,16 +1636,16 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): @replace_return_docstrings(output_type=TFNextSentencePredictorOutput, config_class=_CONFIG_FOR_DOC) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - next_sentence_label: Optional[Union[np.ndarray, tf.Tensor]] = None, + next_sentence_label: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFNextSentencePredictorOutput, Tuple[tf.Tensor]]: r""" @@ -1543,8 +1657,8 @@ def call( >>> import tensorflow as tf >>> from transformers import AutoTokenizer, TFBertForNextSentencePrediction - >>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") - >>> model = TFBertForNextSentencePrediction.from_pretrained("bert-base-uncased") + >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") + >>> model = TFBertForNextSentencePrediction.from_pretrained("google-bert/bert-base-uncased") >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." >>> next_sentence = "The sky is blue due to the shorter wavelength of blue light." @@ -1584,11 +1698,16 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFNextSentencePredictorOutput) -> TFNextSentencePredictorOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFNextSentencePredictorOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "nsp", None) is not None: + with tf.name_scope(self.nsp.name): + self.nsp.build(None) @add_start_docstrings( @@ -1612,12 +1731,13 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) - self.dropout = tf.keras.layers.Dropout(rate=classifier_dropout) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=classifier_dropout) + self.classifier = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier", ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1630,16 +1750,16 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1676,11 +1796,16 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1699,20 +1824,11 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): super().__init__(config, *inputs, **kwargs) self.bert = TFBertMainLayer(config, name="bert") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.classifier = keras.layers.Dense( units=1, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) - - @property - def dummy_inputs(self) -> Dict[str, tf.Tensor]: - """ - Dummy inputs to build the network. - - Returns: - tf.Tensor with dummy inputs - """ - return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)} + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length")) @@ -1723,16 +1839,16 @@ def dummy_inputs(self) -> Dict[str, tf.Tensor]: ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]: r""" @@ -1791,25 +1907,16 @@ def call( attentions=outputs.attentions, ) - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"), - "token_type_ids": tf.TensorSpec((None, None, None), tf.int32, name="token_type_ids"), - } - ] - ) - def serving(self, inputs: Dict[str, tf.Tensor]) -> TFMultipleChoiceModelOutput: - output = self.call(input_ids=inputs) - - return self.serving_output(output) - - def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1839,12 +1946,13 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) - self.dropout = tf.keras.layers.Dropout(rate=classifier_dropout) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(rate=classifier_dropout) + self.classifier = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier", ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1857,16 +1965,16 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1901,11 +2009,16 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1931,11 +2044,12 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): self.num_labels = config.num_labels self.bert = TFBertMainLayer(config, add_pooling_layer=False, name="bert") - self.qa_outputs = tf.keras.layers.Dense( + self.qa_outputs = keras.layers.Dense( units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs", ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1950,17 +2064,17 @@ def __init__(self, config: BertConfig, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, - end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, + start_positions: np.ndarray | tf.Tensor | None = None, + end_positions: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]: r""" @@ -2009,10 +2123,13 @@ def call( attentions=outputs.attentions, ) - def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFQuestionAnsweringModelOutput( - start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "qa_outputs", None) is not None: + with tf.name_scope(self.qa_outputs.name): + self.qa_outputs.build([None, None, self.config.hidden_size]) diff --git a/src/transformers/models/bert/tokenization_bert.py b/src/transformers/models/bert/tokenization_bert.py index 8d13bb4e546c22..c95e9ff0f8b43c 100644 --- a/src/transformers/models/bert/tokenization_bert.py +++ b/src/transformers/models/bert/tokenization_bert.py @@ -30,34 +30,34 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt", - "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt", - "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt", - "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/vocab.txt", - "bert-base-multilingual-uncased": ( - "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/vocab.txt" + "google-bert/bert-base-uncased": "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/vocab.txt", + "google-bert/bert-large-uncased": "https://huggingface.co/google-bert/bert-large-uncased/resolve/main/vocab.txt", + "google-bert/bert-base-cased": "https://huggingface.co/google-bert/bert-base-cased/resolve/main/vocab.txt", + "google-bert/bert-large-cased": "https://huggingface.co/google-bert/bert-large-cased/resolve/main/vocab.txt", + "google-bert/bert-base-multilingual-uncased": ( + "https://huggingface.co/google-bert/bert-base-multilingual-uncased/resolve/main/vocab.txt" ), - "bert-base-multilingual-cased": "https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt", - "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt", - "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt", - "bert-large-uncased-whole-word-masking": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt" + "google-bert/bert-base-multilingual-cased": "https://huggingface.co/google-bert/bert-base-multilingual-cased/resolve/main/vocab.txt", + "google-bert/bert-base-chinese": "https://huggingface.co/google-bert/bert-base-chinese/resolve/main/vocab.txt", + "google-bert/bert-base-german-cased": "https://huggingface.co/google-bert/bert-base-german-cased/resolve/main/vocab.txt", + "google-bert/bert-large-uncased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt" ), - "bert-large-cased-whole-word-masking": ( - "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/vocab.txt" + "google-bert/bert-large-cased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking/resolve/main/vocab.txt" ), - "bert-large-uncased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" ), - "bert-large-cased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" ), - "bert-base-cased-finetuned-mrpc": ( - "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt" + "google-bert/bert-base-cased-finetuned-mrpc": ( + "https://huggingface.co/google-bert/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt" ), - "bert-base-german-dbmdz-cased": "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/vocab.txt", - "bert-base-german-dbmdz-uncased": ( - "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt" + "google-bert/bert-base-german-dbmdz-cased": "https://huggingface.co/google-bert/bert-base-german-dbmdz-cased/resolve/main/vocab.txt", + "google-bert/bert-base-german-dbmdz-uncased": ( + "https://huggingface.co/google-bert/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt" ), "TurkuNLP/bert-base-finnish-cased-v1": ( "https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/vocab.txt" @@ -72,42 +72,42 @@ } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "bert-base-uncased": 512, - "bert-large-uncased": 512, - "bert-base-cased": 512, - "bert-large-cased": 512, - "bert-base-multilingual-uncased": 512, - "bert-base-multilingual-cased": 512, - "bert-base-chinese": 512, - "bert-base-german-cased": 512, - "bert-large-uncased-whole-word-masking": 512, - "bert-large-cased-whole-word-masking": 512, - "bert-large-uncased-whole-word-masking-finetuned-squad": 512, - "bert-large-cased-whole-word-masking-finetuned-squad": 512, - "bert-base-cased-finetuned-mrpc": 512, - "bert-base-german-dbmdz-cased": 512, - "bert-base-german-dbmdz-uncased": 512, + "google-bert/bert-base-uncased": 512, + "google-bert/bert-large-uncased": 512, + "google-bert/bert-base-cased": 512, + "google-bert/bert-large-cased": 512, + "google-bert/bert-base-multilingual-uncased": 512, + "google-bert/bert-base-multilingual-cased": 512, + "google-bert/bert-base-chinese": 512, + "google-bert/bert-base-german-cased": 512, + "google-bert/bert-large-uncased-whole-word-masking": 512, + "google-bert/bert-large-cased-whole-word-masking": 512, + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": 512, + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": 512, + "google-bert/bert-base-cased-finetuned-mrpc": 512, + "google-bert/bert-base-german-dbmdz-cased": 512, + "google-bert/bert-base-german-dbmdz-uncased": 512, "TurkuNLP/bert-base-finnish-cased-v1": 512, "TurkuNLP/bert-base-finnish-uncased-v1": 512, "wietsedv/bert-base-dutch-cased": 512, } PRETRAINED_INIT_CONFIGURATION = { - "bert-base-uncased": {"do_lower_case": True}, - "bert-large-uncased": {"do_lower_case": True}, - "bert-base-cased": {"do_lower_case": False}, - "bert-large-cased": {"do_lower_case": False}, - "bert-base-multilingual-uncased": {"do_lower_case": True}, - "bert-base-multilingual-cased": {"do_lower_case": False}, - "bert-base-chinese": {"do_lower_case": False}, - "bert-base-german-cased": {"do_lower_case": False}, - "bert-large-uncased-whole-word-masking": {"do_lower_case": True}, - "bert-large-cased-whole-word-masking": {"do_lower_case": False}, - "bert-large-uncased-whole-word-masking-finetuned-squad": {"do_lower_case": True}, - "bert-large-cased-whole-word-masking-finetuned-squad": {"do_lower_case": False}, - "bert-base-cased-finetuned-mrpc": {"do_lower_case": False}, - "bert-base-german-dbmdz-cased": {"do_lower_case": False}, - "bert-base-german-dbmdz-uncased": {"do_lower_case": True}, + "google-bert/bert-base-uncased": {"do_lower_case": True}, + "google-bert/bert-large-uncased": {"do_lower_case": True}, + "google-bert/bert-base-cased": {"do_lower_case": False}, + "google-bert/bert-large-cased": {"do_lower_case": False}, + "google-bert/bert-base-multilingual-uncased": {"do_lower_case": True}, + "google-bert/bert-base-multilingual-cased": {"do_lower_case": False}, + "google-bert/bert-base-chinese": {"do_lower_case": False}, + "google-bert/bert-base-german-cased": {"do_lower_case": False}, + "google-bert/bert-large-uncased-whole-word-masking": {"do_lower_case": True}, + "google-bert/bert-large-cased-whole-word-masking": {"do_lower_case": False}, + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": {"do_lower_case": True}, + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": {"do_lower_case": False}, + "google-bert/bert-base-cased-finetuned-mrpc": {"do_lower_case": False}, + "google-bert/bert-base-german-dbmdz-cased": {"do_lower_case": False}, + "google-bert/bert-base-german-dbmdz-uncased": {"do_lower_case": True}, "TurkuNLP/bert-base-finnish-cased-v1": {"do_lower_case": False}, "TurkuNLP/bert-base-finnish-uncased-v1": {"do_lower_case": True}, "wietsedv/bert-base-dutch-cased": {"do_lower_case": False}, @@ -196,20 +196,6 @@ def __init__( strip_accents=None, **kwargs, ): - super().__init__( - do_lower_case=do_lower_case, - do_basic_tokenize=do_basic_tokenize, - never_split=never_split, - unk_token=unk_token, - sep_token=sep_token, - pad_token=pad_token, - cls_token=cls_token, - mask_token=mask_token, - tokenize_chinese_chars=tokenize_chinese_chars, - strip_accents=strip_accents, - **kwargs, - ) - if not os.path.isfile(vocab_file): raise ValueError( f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained" @@ -225,7 +211,22 @@ def __init__( tokenize_chinese_chars=tokenize_chinese_chars, strip_accents=strip_accents, ) - self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) + + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token)) + + super().__init__( + do_lower_case=do_lower_case, + do_basic_tokenize=do_basic_tokenize, + never_split=never_split, + unk_token=unk_token, + sep_token=sep_token, + pad_token=pad_token, + cls_token=cls_token, + mask_token=mask_token, + tokenize_chinese_chars=tokenize_chinese_chars, + strip_accents=strip_accents, + **kwargs, + ) @property def do_lower_case(self): @@ -238,10 +239,12 @@ def vocab_size(self): def get_vocab(self): return dict(self.vocab, **self.added_tokens_encoder) - def _tokenize(self, text): + def _tokenize(self, text, split_special_tokens=False): split_tokens = [] if self.do_basic_tokenize: - for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): + for token in self.basic_tokenizer.tokenize( + text, never_split=self.all_special_tokens if not split_special_tokens else None + ): # If the token is part of the never_split set if token in self.basic_tokenizer.never_split: split_tokens.append(token) @@ -385,20 +388,30 @@ class BasicTokenizer(object): strip_accents (`bool`, *optional*): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for `lowercase` (as in the original BERT). + do_split_on_punc (`bool`, *optional*, defaults to `True`): + In some instances we want to skip the basic punctuation splitting so that later tokenization can capture + the full context of the words, such as contractions. """ - def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None): + def __init__( + self, + do_lower_case=True, + never_split=None, + tokenize_chinese_chars=True, + strip_accents=None, + do_split_on_punc=True, + ): if never_split is None: never_split = [] self.do_lower_case = do_lower_case self.never_split = set(never_split) self.tokenize_chinese_chars = tokenize_chinese_chars self.strip_accents = strip_accents + self.do_split_on_punc = do_split_on_punc def tokenize(self, text, never_split=None): """ - Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see - WordPieceTokenizer. + Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer. Args: never_split (`List[str]`, *optional*) @@ -417,7 +430,9 @@ def tokenize(self, text, never_split=None): # words in the English Wikipedia.). if self.tokenize_chinese_chars: text = self._tokenize_chinese_chars(text) - orig_tokens = whitespace_tokenize(text) + # prevents treating the same character with different unicode codepoints as different characters + unicode_normalized_text = unicodedata.normalize("NFC", text) + orig_tokens = whitespace_tokenize(unicode_normalized_text) split_tokens = [] for token in orig_tokens: if token not in never_split: @@ -445,7 +460,7 @@ def _run_strip_accents(self, text): def _run_split_on_punc(self, text, never_split=None): """Splits punctuation on a piece of text.""" - if never_split is not None and text in never_split: + if not self.do_split_on_punc or (never_split is not None and text in never_split): return [text] chars = list(text) i = 0 diff --git a/src/transformers/models/bert/tokenization_bert_fast.py b/src/transformers/models/bert/tokenization_bert_fast.py index e55f3f36ad6dd3..e7754b2fb5a128 100644 --- a/src/transformers/models/bert/tokenization_bert_fast.py +++ b/src/transformers/models/bert/tokenization_bert_fast.py @@ -30,34 +30,34 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt", - "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt", - "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt", - "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/vocab.txt", - "bert-base-multilingual-uncased": ( - "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/vocab.txt" + "google-bert/bert-base-uncased": "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/vocab.txt", + "google-bert/bert-large-uncased": "https://huggingface.co/google-bert/bert-large-uncased/resolve/main/vocab.txt", + "google-bert/bert-base-cased": "https://huggingface.co/google-bert/bert-base-cased/resolve/main/vocab.txt", + "google-bert/bert-large-cased": "https://huggingface.co/google-bert/bert-large-cased/resolve/main/vocab.txt", + "google-bert/bert-base-multilingual-uncased": ( + "https://huggingface.co/google-bert/bert-base-multilingual-uncased/resolve/main/vocab.txt" ), - "bert-base-multilingual-cased": "https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt", - "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt", - "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt", - "bert-large-uncased-whole-word-masking": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt" + "google-bert/bert-base-multilingual-cased": "https://huggingface.co/google-bert/bert-base-multilingual-cased/resolve/main/vocab.txt", + "google-bert/bert-base-chinese": "https://huggingface.co/google-bert/bert-base-chinese/resolve/main/vocab.txt", + "google-bert/bert-base-german-cased": "https://huggingface.co/google-bert/bert-base-german-cased/resolve/main/vocab.txt", + "google-bert/bert-large-uncased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt" ), - "bert-large-cased-whole-word-masking": ( - "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/vocab.txt" + "google-bert/bert-large-cased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking/resolve/main/vocab.txt" ), - "bert-large-uncased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" ), - "bert-large-cased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt" ), - "bert-base-cased-finetuned-mrpc": ( - "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt" + "google-bert/bert-base-cased-finetuned-mrpc": ( + "https://huggingface.co/google-bert/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt" ), - "bert-base-german-dbmdz-cased": "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/vocab.txt", - "bert-base-german-dbmdz-uncased": ( - "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt" + "google-bert/bert-base-german-dbmdz-cased": "https://huggingface.co/google-bert/bert-base-german-dbmdz-cased/resolve/main/vocab.txt", + "google-bert/bert-base-german-dbmdz-uncased": ( + "https://huggingface.co/google-bert/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt" ), "TurkuNLP/bert-base-finnish-cased-v1": ( "https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/vocab.txt" @@ -70,38 +70,38 @@ ), }, "tokenizer_file": { - "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json", - "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/tokenizer.json", - "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json", - "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/tokenizer.json", - "bert-base-multilingual-uncased": ( - "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/tokenizer.json" + "google-bert/bert-base-uncased": "https://huggingface.co/google-bert/bert-base-uncased/resolve/main/tokenizer.json", + "google-bert/bert-large-uncased": "https://huggingface.co/google-bert/bert-large-uncased/resolve/main/tokenizer.json", + "google-bert/bert-base-cased": "https://huggingface.co/google-bert/bert-base-cased/resolve/main/tokenizer.json", + "google-bert/bert-large-cased": "https://huggingface.co/google-bert/bert-large-cased/resolve/main/tokenizer.json", + "google-bert/bert-base-multilingual-uncased": ( + "https://huggingface.co/google-bert/bert-base-multilingual-uncased/resolve/main/tokenizer.json" ), - "bert-base-multilingual-cased": ( - "https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json" + "google-bert/bert-base-multilingual-cased": ( + "https://huggingface.co/google-bert/bert-base-multilingual-cased/resolve/main/tokenizer.json" ), - "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/tokenizer.json", - "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/tokenizer.json", - "bert-large-uncased-whole-word-masking": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/tokenizer.json" + "google-bert/bert-base-chinese": "https://huggingface.co/google-bert/bert-base-chinese/resolve/main/tokenizer.json", + "google-bert/bert-base-german-cased": "https://huggingface.co/google-bert/bert-base-german-cased/resolve/main/tokenizer.json", + "google-bert/bert-large-uncased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking/resolve/main/tokenizer.json" ), - "bert-large-cased-whole-word-masking": ( - "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/tokenizer.json" + "google-bert/bert-large-cased-whole-word-masking": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking/resolve/main/tokenizer.json" ), - "bert-large-uncased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json" + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json" ), - "bert-large-cased-whole-word-masking-finetuned-squad": ( - "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json" + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": ( + "https://huggingface.co/google-bert/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/tokenizer.json" ), - "bert-base-cased-finetuned-mrpc": ( - "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/tokenizer.json" + "google-bert/bert-base-cased-finetuned-mrpc": ( + "https://huggingface.co/google-bert/bert-base-cased-finetuned-mrpc/resolve/main/tokenizer.json" ), - "bert-base-german-dbmdz-cased": ( - "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/tokenizer.json" + "google-bert/bert-base-german-dbmdz-cased": ( + "https://huggingface.co/google-bert/bert-base-german-dbmdz-cased/resolve/main/tokenizer.json" ), - "bert-base-german-dbmdz-uncased": ( - "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/tokenizer.json" + "google-bert/bert-base-german-dbmdz-uncased": ( + "https://huggingface.co/google-bert/bert-base-german-dbmdz-uncased/resolve/main/tokenizer.json" ), "TurkuNLP/bert-base-finnish-cased-v1": ( "https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/tokenizer.json" @@ -116,42 +116,42 @@ } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "bert-base-uncased": 512, - "bert-large-uncased": 512, - "bert-base-cased": 512, - "bert-large-cased": 512, - "bert-base-multilingual-uncased": 512, - "bert-base-multilingual-cased": 512, - "bert-base-chinese": 512, - "bert-base-german-cased": 512, - "bert-large-uncased-whole-word-masking": 512, - "bert-large-cased-whole-word-masking": 512, - "bert-large-uncased-whole-word-masking-finetuned-squad": 512, - "bert-large-cased-whole-word-masking-finetuned-squad": 512, - "bert-base-cased-finetuned-mrpc": 512, - "bert-base-german-dbmdz-cased": 512, - "bert-base-german-dbmdz-uncased": 512, + "google-bert/bert-base-uncased": 512, + "google-bert/bert-large-uncased": 512, + "google-bert/bert-base-cased": 512, + "google-bert/bert-large-cased": 512, + "google-bert/bert-base-multilingual-uncased": 512, + "google-bert/bert-base-multilingual-cased": 512, + "google-bert/bert-base-chinese": 512, + "google-bert/bert-base-german-cased": 512, + "google-bert/bert-large-uncased-whole-word-masking": 512, + "google-bert/bert-large-cased-whole-word-masking": 512, + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": 512, + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": 512, + "google-bert/bert-base-cased-finetuned-mrpc": 512, + "google-bert/bert-base-german-dbmdz-cased": 512, + "google-bert/bert-base-german-dbmdz-uncased": 512, "TurkuNLP/bert-base-finnish-cased-v1": 512, "TurkuNLP/bert-base-finnish-uncased-v1": 512, "wietsedv/bert-base-dutch-cased": 512, } PRETRAINED_INIT_CONFIGURATION = { - "bert-base-uncased": {"do_lower_case": True}, - "bert-large-uncased": {"do_lower_case": True}, - "bert-base-cased": {"do_lower_case": False}, - "bert-large-cased": {"do_lower_case": False}, - "bert-base-multilingual-uncased": {"do_lower_case": True}, - "bert-base-multilingual-cased": {"do_lower_case": False}, - "bert-base-chinese": {"do_lower_case": False}, - "bert-base-german-cased": {"do_lower_case": False}, - "bert-large-uncased-whole-word-masking": {"do_lower_case": True}, - "bert-large-cased-whole-word-masking": {"do_lower_case": False}, - "bert-large-uncased-whole-word-masking-finetuned-squad": {"do_lower_case": True}, - "bert-large-cased-whole-word-masking-finetuned-squad": {"do_lower_case": False}, - "bert-base-cased-finetuned-mrpc": {"do_lower_case": False}, - "bert-base-german-dbmdz-cased": {"do_lower_case": False}, - "bert-base-german-dbmdz-uncased": {"do_lower_case": True}, + "google-bert/bert-base-uncased": {"do_lower_case": True}, + "google-bert/bert-large-uncased": {"do_lower_case": True}, + "google-bert/bert-base-cased": {"do_lower_case": False}, + "google-bert/bert-large-cased": {"do_lower_case": False}, + "google-bert/bert-base-multilingual-uncased": {"do_lower_case": True}, + "google-bert/bert-base-multilingual-cased": {"do_lower_case": False}, + "google-bert/bert-base-chinese": {"do_lower_case": False}, + "google-bert/bert-base-german-cased": {"do_lower_case": False}, + "google-bert/bert-large-uncased-whole-word-masking": {"do_lower_case": True}, + "google-bert/bert-large-cased-whole-word-masking": {"do_lower_case": False}, + "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad": {"do_lower_case": True}, + "google-bert/bert-large-cased-whole-word-masking-finetuned-squad": {"do_lower_case": False}, + "google-bert/bert-base-cased-finetuned-mrpc": {"do_lower_case": False}, + "google-bert/bert-base-german-dbmdz-cased": {"do_lower_case": False}, + "google-bert/bert-base-german-dbmdz-uncased": {"do_lower_case": True}, "TurkuNLP/bert-base-finnish-cased-v1": {"do_lower_case": False}, "TurkuNLP/bert-base-finnish-uncased-v1": {"do_lower_case": True}, "wietsedv/bert-base-dutch-cased": {"do_lower_case": False}, @@ -265,7 +265,7 @@ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): """ output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id] - if token_ids_1: + if token_ids_1 is not None: output += token_ids_1 + [self.sep_token_id] return output diff --git a/src/transformers/models/bert/tokenization_bert_tf.py b/src/transformers/models/bert/tokenization_bert_tf.py index e0e38d68a58c3e..ebf88eeac9bbe8 100644 --- a/src/transformers/models/bert/tokenization_bert_tf.py +++ b/src/transformers/models/bert/tokenization_bert_tf.py @@ -5,10 +5,11 @@ from tensorflow_text import BertTokenizer as BertTokenizerLayer from tensorflow_text import FastBertTokenizer, ShrinkLongestTrimmer, case_fold_utf8, combine_segments, pad_model_inputs +from ...modeling_tf_utils import keras from .tokenization_bert import BertTokenizer -class TFBertTokenizer(tf.keras.layers.Layer): +class TFBertTokenizer(keras.layers.Layer): """ This is an in-graph tokenizer for BERT. It should be initialized similarly to other tokenizers, using the `from_pretrained()` method. It can also be initialized with the `from_tokenizer()` method, which imports settings @@ -48,7 +49,9 @@ class TFBertTokenizer(tf.keras.layers.Layer): return_attention_mask (`bool`, *optional*, defaults to `True`): Whether to return the attention_mask. use_fast_bert_tokenizer (`bool`, *optional*, defaults to `True`): - If set to false will use standard TF Text BertTokenizer, making it servable by TF Serving. + If True, will use the FastBertTokenizer class from Tensorflow Text. If False, will use the BertTokenizer + class instead. BertTokenizer supports some additional options, but is slower and cannot be exported to + TFLite. """ def __init__( @@ -65,11 +68,12 @@ def __init__( return_token_type_ids: bool = True, return_attention_mask: bool = True, use_fast_bert_tokenizer: bool = True, + **tokenizer_kwargs, ): super().__init__() if use_fast_bert_tokenizer: self.tf_tokenizer = FastBertTokenizer( - vocab_list, token_out_type=tf.int64, lower_case_nfd_strip_accents=do_lower_case + vocab_list, token_out_type=tf.int64, lower_case_nfd_strip_accents=do_lower_case, **tokenizer_kwargs ) else: lookup_table = tf.lookup.StaticVocabularyTable( @@ -81,13 +85,15 @@ def __init__( ), num_oov_buckets=1, ) - self.tf_tokenizer = BertTokenizerLayer(lookup_table, token_out_type=tf.int64, lower_case=do_lower_case) + self.tf_tokenizer = BertTokenizerLayer( + lookup_table, token_out_type=tf.int64, lower_case=do_lower_case, **tokenizer_kwargs + ) self.vocab_list = vocab_list self.do_lower_case = do_lower_case - self.cls_token_id = cls_token_id or vocab_list.index("[CLS]") - self.sep_token_id = sep_token_id or vocab_list.index("[SEP]") - self.pad_token_id = pad_token_id or vocab_list.index("[PAD]") + self.cls_token_id = vocab_list.index("[CLS]") if cls_token_id is None else cls_token_id + self.sep_token_id = vocab_list.index("[SEP]") if sep_token_id is None else sep_token_id + self.pad_token_id = vocab_list.index("[PAD]") if pad_token_id is None else pad_token_id self.paired_trimmer = ShrinkLongestTrimmer(max_length - 3, axis=1) # Allow room for special tokens self.max_length = max_length self.padding = padding @@ -110,7 +116,7 @@ def from_tokenizer(cls, tokenizer: "PreTrainedTokenizerBase", **kwargs): # noqa ```python from transformers import AutoTokenizer, TFBertTokenizer - tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") + tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") tf_tokenizer = TFBertTokenizer.from_tokenizer(tokenizer) ``` """ @@ -124,7 +130,7 @@ def from_tokenizer(cls, tokenizer: "PreTrainedTokenizerBase", **kwargs): # noqa pad_token_id = tokenizer.pad_token_id if pad_token_id is None else pad_token_id vocab = tokenizer.get_vocab() - vocab = sorted([(wordpiece, idx) for wordpiece, idx in vocab.items()], key=lambda x: x[1]) + vocab = sorted(vocab.items(), key=lambda x: x[1]) vocab_list = [entry[0] for entry in vocab] return cls( vocab_list=vocab_list, @@ -149,7 +155,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], ```python from transformers import TFBertTokenizer - tf_tokenizer = TFBertTokenizer.from_pretrained("bert-base-uncased") + tf_tokenizer = TFBertTokenizer.from_pretrained("google-bert/bert-base-uncased") ``` """ try: diff --git a/src/transformers/models/bert_generation/configuration_bert_generation.py b/src/transformers/models/bert_generation/configuration_bert_generation.py index f0cb795d93615f..841aec5c0fb7ac 100644 --- a/src/transformers/models/bert_generation/configuration_bert_generation.py +++ b/src/transformers/models/bert_generation/configuration_bert_generation.py @@ -38,7 +38,7 @@ class BertGenerationConfig(PretrainedConfig): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 16): Number of attention heads for each attention layer in the Transformer encoder. - intermediate_size (`int`, *optional*, defaults to 3072): + intermediate_size (`int`, *optional*, defaults to 4096): Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder. hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, @@ -54,14 +54,18 @@ class BertGenerationConfig(PretrainedConfig): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 0): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 2): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 1): + End of stream token id. position_embedding_type (`str`, *optional*, defaults to `"absolute"`): Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). - is_decoder (`bool`, *optional*, defaults to `False`): - Whether the model is used as a decoder or not. If `False`, the model is used as an encoder. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. @@ -80,6 +84,7 @@ class BertGenerationConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "bert-generation" def __init__( diff --git a/src/transformers/models/bert_generation/modeling_bert_generation.py b/src/transformers/models/bert_generation/modeling_bert_generation.py index 928cd4433e1ef5..b7250f6f7b926f 100755 --- a/src/transformers/models/bert_generation/modeling_bert_generation.py +++ b/src/transformers/models/bert_generation/modeling_bert_generation.py @@ -385,6 +385,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -394,25 +401,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -554,7 +551,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) def forward(self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0): if input_ids is not None: @@ -586,7 +585,6 @@ class BertGenerationPreTrainedModel(PreTrainedModel): config_class = BertGenerationConfig base_model_prefix = "bert" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -604,10 +602,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BertEncoder): - module.gradient_checkpointing = value - BERT_GENERATION_START_DOCSTRING = r""" @@ -765,6 +759,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -858,7 +853,7 @@ def _tie_weights(self): BERT_GENERATION_START_DOCSTRING, ) class BertGenerationDecoder(BertGenerationPreTrainedModel): - _keys_to_ignore_on_load_missing = ["lm_head.decoder.weight", "lm_head.decoder.bias", "embeddings.position_ids"] + _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -989,14 +984,25 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti if attention_mask is None: attention_mask = input_ids.new_ones(input_shape) - # cut decoder_input_ids if past is used + # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values} def _reorder_cache(self, past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/bert_generation/tokenization_bert_generation.py b/src/transformers/models/bert_generation/tokenization_bert_generation.py index 6ef3321277f365..3b6298fcbd8f6e 100644 --- a/src/transformers/models/bert_generation/tokenization_bert_generation.py +++ b/src/transformers/models/bert_generation/tokenization_bert_generation.py @@ -51,15 +51,19 @@ class BertGenerationTokenizer(PreTrainedTokenizer): vocab_file (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that contains the vocabulary necessary to instantiate a tokenizer. - eos_token (`str`, *optional*, defaults to `""`): - The end of sequence token. bos_token (`str`, *optional*, defaults to `""`): The begin of sequence token. + eos_token (`str`, *optional*, defaults to `""`): + The end of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. + sep_token (`str`, *optional*, defaults to `"<::::>"`): + The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for + sequence classification or for a text and a question for question answering. It is also used as the last + token of a sequence built with special tokens. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, @@ -96,6 +100,11 @@ def __init__( ) -> None: self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + self.vocab_file = vocab_file + + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(vocab_file) + # Add extra_ids to the special token list super().__init__( bos_token=bos_token, @@ -107,11 +116,6 @@ def __init__( **kwargs, ) - self.vocab_file = vocab_file - - self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(vocab_file) - @property def vocab_size(self): return self.sp_model.get_piece_size() diff --git a/src/transformers/models/bert_japanese/tokenization_bert_japanese.py b/src/transformers/models/bert_japanese/tokenization_bert_japanese.py index 5af9984bb9e9b0..b2d1ac19580191 100644 --- a/src/transformers/models/bert_japanese/tokenization_bert_japanese.py +++ b/src/transformers/models/bert_japanese/tokenization_bert_japanese.py @@ -22,7 +22,7 @@ from typing import Any, Dict, List, Optional, Tuple from ...tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace -from ...utils import is_sentencepiece_available, logging +from ...utils import is_sentencepiece_available, is_sudachi_projection_available, logging if is_sentencepiece_available(): @@ -160,25 +160,6 @@ def __init__( jumanpp_kwargs=None, **kwargs, ): - super().__init__( - spm_file=spm_file, - unk_token=unk_token, - sep_token=sep_token, - pad_token=pad_token, - cls_token=cls_token, - mask_token=mask_token, - do_lower_case=do_lower_case, - do_word_tokenize=do_word_tokenize, - do_subword_tokenize=do_subword_tokenize, - word_tokenizer_type=word_tokenizer_type, - subword_tokenizer_type=subword_tokenizer_type, - never_split=never_split, - mecab_kwargs=mecab_kwargs, - sudachi_kwargs=sudachi_kwargs, - jumanpp_kwargs=jumanpp_kwargs, - **kwargs, - ) - if subword_tokenizer_type == "sentencepiece": if not os.path.isfile(spm_file): raise ValueError( @@ -226,13 +207,31 @@ def __init__( self.subword_tokenizer_type = subword_tokenizer_type if do_subword_tokenize: if subword_tokenizer_type == "wordpiece": - self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) + self.subword_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token)) elif subword_tokenizer_type == "character": - self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=self.unk_token) + self.subword_tokenizer = CharacterTokenizer(vocab=self.vocab, unk_token=str(unk_token)) elif subword_tokenizer_type == "sentencepiece": - self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=self.unk_token) + self.subword_tokenizer = SentencepieceTokenizer(vocab=self.spm_file, unk_token=str(unk_token)) else: raise ValueError(f"Invalid subword_tokenizer_type '{subword_tokenizer_type}' is specified.") + super().__init__( + spm_file=spm_file, + unk_token=unk_token, + sep_token=sep_token, + pad_token=pad_token, + cls_token=cls_token, + mask_token=mask_token, + do_lower_case=do_lower_case, + do_word_tokenize=do_word_tokenize, + do_subword_tokenize=do_subword_tokenize, + word_tokenizer_type=word_tokenizer_type, + subword_tokenizer_type=subword_tokenizer_type, + never_split=never_split, + mecab_kwargs=mecab_kwargs, + sudachi_kwargs=sudachi_kwargs, + jumanpp_kwargs=jumanpp_kwargs, + **kwargs, + ) @property def do_lower_case(self): @@ -543,6 +542,7 @@ def __init__( sudachi_config_path=None, sudachi_resource_dir=None, sudachi_dict_type="core", + sudachi_projection=None, ): """ Constructs a SudachiTokenizer. @@ -558,11 +558,13 @@ def __init__( **trim_whitespace**: (*optional*) boolean (default False) Whether to trim all whitespace, tab, newline from tokens. **sudachi_split_mode**: (*optional*) string - Split mode of sudachi, choose from "A", "B", "C". + Split mode of sudachi, choose from `["A", "B", "C"]`. **sudachi_config_path**: (*optional*) string **sudachi_resource_dir**: (*optional*) string **sudachi_dict_type**: (*optional*) string - dict type of sudachi, choose from "small", "core", "full". + dict type of sudachi, choose from `["small", "core", "full"]`. + **sudachi_projection**: (*optional*) string + Word projection mode of sudachi, choose from `["surface", "normalized", "reading", "dictionary", "dictionary_and_surface", "normalized_and_surface", "normalized_nouns"]`. """ self.do_lower_case = do_lower_case @@ -587,9 +589,17 @@ def __init__( else: raise ValueError("Invalid sudachi_split_mode is specified.") - self.sudachi = dictionary.Dictionary( + self.projection = sudachi_projection + + sudachi_dictionary = dictionary.Dictionary( config_path=sudachi_config_path, resource_dir=sudachi_resource_dir, dict=sudachi_dict_type - ).create(self.split_mode) + ) + if is_sudachi_projection_available(): + self.sudachi = sudachi_dictionary.create(self.split_mode, projection=self.projection) + elif self.projection is not None: + raise ImportError("You need to install sudachipy>=0.6.8 to specify `projection` field in sudachi_kwargs.") + else: + self.sudachi = sudachi_dictionary.create(self.split_mode) def tokenize(self, text, never_split=None, **kwargs): """Tokenizes a piece of text.""" @@ -748,20 +758,30 @@ class BasicTokenizer(object): strip_accents (`bool`, *optional*): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for `lowercase` (as in the original BERT). + do_split_on_punc (`bool`, *optional*, defaults to `True`): + In some instances we want to skip the basic punctuation splitting so that later tokenization can capture + the full context of the words, such as contractions. """ - def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None): + def __init__( + self, + do_lower_case=True, + never_split=None, + tokenize_chinese_chars=True, + strip_accents=None, + do_split_on_punc=True, + ): if never_split is None: never_split = [] self.do_lower_case = do_lower_case self.never_split = set(never_split) self.tokenize_chinese_chars = tokenize_chinese_chars self.strip_accents = strip_accents + self.do_split_on_punc = do_split_on_punc def tokenize(self, text, never_split=None): """ - Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see - WordPieceTokenizer. + Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer. Args: never_split (`List[str]`, *optional*) @@ -780,7 +800,9 @@ def tokenize(self, text, never_split=None): # words in the English Wikipedia.). if self.tokenize_chinese_chars: text = self._tokenize_chinese_chars(text) - orig_tokens = whitespace_tokenize(text) + # prevents treating the same character with different unicode codepoints as different characters + unicode_normalized_text = unicodedata.normalize("NFC", text) + orig_tokens = whitespace_tokenize(unicode_normalized_text) split_tokens = [] for token in orig_tokens: if token not in never_split: @@ -808,7 +830,7 @@ def _run_strip_accents(self, text): def _run_split_on_punc(self, text, never_split=None): """Splits punctuation on a piece of text.""" - if never_split is not None and text in never_split: + if not self.do_split_on_punc or (never_split is not None and text in never_split): return [text] chars = list(text) i = 0 diff --git a/src/transformers/models/bertweet/tokenization_bertweet.py b/src/transformers/models/bertweet/tokenization_bertweet.py index 837fea136743d2..74bc040c25b13d 100644 --- a/src/transformers/models/bertweet/tokenization_bertweet.py +++ b/src/transformers/models/bertweet/tokenization_bertweet.py @@ -77,7 +77,7 @@ class BertweetTokenizer(PreTrainedTokenizer): Path to the vocabulary file. merges_file (`str`): Path to the merges file. - normalization (`bool`, *optional*, defaults to `False`) + normalization (`bool`, *optional*, defaults to `False`): Whether or not to apply a normalization preprocess. bos_token (`str`, *optional*, defaults to `""`): The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. @@ -134,18 +134,6 @@ def __init__( mask_token="", **kwargs, ): - super().__init__( - normalization=normalization, - bos_token=bos_token, - eos_token=eos_token, - sep_token=sep_token, - cls_token=cls_token, - unk_token=unk_token, - pad_token=pad_token, - mask_token=mask_token, - **kwargs, - ) - try: from emoji import demojize @@ -161,10 +149,10 @@ def __init__( self.merges_file = merges_file self.encoder = {} - self.encoder[self.bos_token] = 0 - self.encoder[self.pad_token] = 1 - self.encoder[self.eos_token] = 2 - self.encoder[self.unk_token] = 3 + self.encoder[str(bos_token)] = 0 + self.encoder[str(pad_token)] = 1 + self.encoder[str(eos_token)] = 2 + self.encoder[str(unk_token)] = 3 self.add_from_file(vocab_file) @@ -178,9 +166,20 @@ def __init__( self.normalization = normalization self.tweetPreprocessor = TweetTokenizer() - self.special_puncts = {"’": "'", "…": "..."} + super().__init__( + normalization=normalization, + bos_token=bos_token, + eos_token=eos_token, + sep_token=sep_token, + cls_token=cls_token, + unk_token=unk_token, + pad_token=pad_token, + mask_token=mask_token, + **kwargs, + ) + def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: @@ -318,7 +317,7 @@ def _tokenize(self, text): split_tokens = [] words = re.findall(r"\S+\n?", text) for token in words: - split_tokens.extend([t for t in self.bpe(token).split(" ")]) + split_tokens.extend(list(self.bpe(token).split(" "))) return split_tokens def normalizeTweet(self, tweet): @@ -398,8 +397,12 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"] ) - if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): + if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file): copyfile(self.vocab_file, out_vocab_file) + elif not os.path.isfile(self.vocab_file): + with open(out_vocab_file, "wb") as fi: + content_spiece_model = self.sp_model.serialized_model_proto() + fi.write(content_spiece_model) if os.path.abspath(self.merges_file) != os.path.abspath(out_merge_file): copyfile(self.merges_file, out_merge_file) @@ -640,9 +643,17 @@ def _replace_html_entities(text, keep=(), remove_illegal=True, encoding="utf-8") See https://github.com/scrapy/w3lib/blob/master/w3lib/html.py - >>> from nltk.tokenize.casual import _replace_html_entities >>> _replace_html_entities(b'Price: £100') - 'Price: \\xa3100' >>> print(_replace_html_entities(b'Price: £100')) Price: £100 >>> - """ + Examples: + + ```python + >>> from nltk.tokenize.casual import _replace_html_entities + + >>> _replace_html_entities(b"Price: £100") + 'Price: \\xa3100' + + >>> print(_replace_html_entities(b"Price: £100")) + Price: £100 + ```""" def _convert_entity(match): entity_body = match.group(3) @@ -726,7 +737,7 @@ def tokenize(self, text): words = WORD_RE.findall(safe_text) # Possibly alter the case, but avoid changing emoticons like :D into :d: if not self.preserve_case: - words = list(map((lambda x: x if EMOTICON_RE.search(x) else x.lower()), words)) + words = [x if EMOTICON_RE.search(x) else x.lower() for x in words] return words diff --git a/src/transformers/models/big_bird/configuration_big_bird.py b/src/transformers/models/big_bird/configuration_big_bird.py index 53bf1ee6f44b75..9802e758539858 100644 --- a/src/transformers/models/big_bird/configuration_big_bird.py +++ b/src/transformers/models/big_bird/configuration_big_bird.py @@ -58,7 +58,7 @@ class BigBirdConfig(PretrainedConfig): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. hidden_dropout_prob (`float`, *optional*, defaults to 0.1): - The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout ratio for the attention probabilities. max_position_embeddings (`int`, *optional*, defaults to 4096): @@ -104,6 +104,7 @@ class BigBirdConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "big_bird" def __init__( diff --git a/src/transformers/models/big_bird/modeling_big_bird.py b/src/transformers/models/big_bird/modeling_big_bird.py index cdb9e787b791e3..008985f760e867 100755 --- a/src/transformers/models/big_bird/modeling_big_bird.py +++ b/src/transformers/models/big_bird/modeling_big_bird.py @@ -227,7 +227,7 @@ def load_tf_weights_trivia_qa(init_vars): raise ValueError( f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched of {txt_name}." ) - except AssertionError as e: + except ValueError as e: e.args += (pointer.shape, array.shape) raise pt_weight_name = ".".join(pt_name) @@ -257,7 +257,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -894,15 +896,11 @@ def bigbird_block_sparse_attention( # global keys (corresponding to 1st key block) attention_probs[:, :, 2 * from_block_size : -2 * from_block_size, :to_block_size] = attn_weights[ :, :, :, :, :to_block_size - ].view( - bsz, n_heads, -1, to_block_size - ) # first_band_product + ].view(bsz, n_heads, -1, to_block_size) # first_band_product # global keys (corresponding to last key block) attention_probs[:, :, 2 * from_block_size : -2 * from_block_size, -to_block_size:] = attn_weights[ :, :, :, :, -to_block_size: - ].view( - bsz, n_heads, -1, to_block_size - ) # last_band_product + ].view(bsz, n_heads, -1, to_block_size) # last_band_product # random keys for p1, i1, w1 in zip(range(bsz), rand_attn, attn_weights): # p1, i1, w1 corresponds to batch_dim i.e. following operation is done for each sequence in batch @@ -971,11 +969,8 @@ def torch_gather_b2(params, indices): num_indices_to_gather = indices.shape[-2] * indices.shape[-1] num_indices_to_pick_from = params.shape[2] - indices_shift = ( - torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) - // num_indices_to_gather - * num_indices_to_pick_from - ) + shift = torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) + indices_shift = torch.div(shift, num_indices_to_gather, rounding_mode="floor") * num_indices_to_pick_from flattened_indices = indices.view(-1) + indices_shift flattened_params = params.reshape(-1, params.shape[-2], params.shape[-1]) @@ -1055,9 +1050,8 @@ def _get_rand_attn_plan(from_seq_length, from_block_size, num_rand_blocks): return plan_from_length, plan_num_rand_blocks - @staticmethod def _bigbird_block_rand_mask( - from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 + self, from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 ): """ Create adjacency list of random attention. @@ -1080,6 +1074,9 @@ def _bigbird_block_rand_mask( raise ValueError("Error the number of blocks needs to be same!") rand_attn = np.zeros((from_seq_length // from_block_size - 2, num_rand_blocks), dtype=np.int32) + # During inference (eval) no randomness + if not self.training: + return rand_attn middle_seq = np.arange(1, to_seq_length // to_block_size - 1, dtype=np.int32) last = to_seq_length // to_block_size - 1 if last_idx > (2 * to_block_size): @@ -1163,11 +1160,17 @@ def _bigbird_block_rand_mask_with_head( plan_block_length = np.array(plan_from_length) // from_block_size # till when to follow plan max_plan_idx = plan_from_length.index(from_seq_length) + # Random Attention adjacency list rand_attn = [ np.zeros((num_blocks, np.sum(plan_num_rand_blocks[: max_plan_idx + 1])), dtype=np.int32) for i in range(num_heads) ] + # During inference (eval) no randomness + if not self.training: + for nh in range(num_heads): + rand_attn[nh] = rand_attn[nh][global_block_top : num_blocks - global_block_bottom, :] + return rand_attn # We will go iteratively over the plan blocks and pick random number of # Attention blocks from the legally allowed blocks @@ -1356,7 +1359,6 @@ def set_attention_type(self, value: str): attn_weights.key = self.self.key self.self = attn_weights self.attention_type = value - if not self.training: self.self.eval() @@ -1383,7 +1385,6 @@ def forward( from_mask = from_mask.to(hidden_states.dtype) if to_mask is not None: to_mask = to_mask.to(hidden_states.dtype) - if self.attention_type == "original_full": self_outputs = self.self( hidden_states, @@ -1595,6 +1596,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): @@ -1605,20 +1613,8 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, @@ -1628,6 +1624,8 @@ def custom_forward(*inputs): from_mask, to_mask, blocked_encoder_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -1760,7 +1758,6 @@ class BigBirdPreTrainedModel(PreTrainedModel): load_tf_weights = load_tf_weights_in_big_bird base_model_prefix = "bert" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -1778,10 +1775,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BigBirdEncoder): - module.gradient_checkpointing = value - BIG_BIRD_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use @@ -2029,6 +2022,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -2229,7 +2223,7 @@ def _pad_to_block_size( padding_len = (block_size - seq_len % block_size) % block_size if padding_len > 0: - logger.info( + logger.warning_once( f"Input ids are automatically padded from {seq_len} to {seq_len + padding_len} to be a multiple of " f"`config.block_size`: {block_size}" ) @@ -2256,7 +2250,7 @@ def _pad_to_block_size( class BigBirdForPreTraining(BigBirdPreTrainedModel): - _keys_to_ignore_on_load_missing = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"] + _tied_weights_keys = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -2362,7 +2356,7 @@ def forward( @add_start_docstrings("""BigBird Model with a `language modeling` head on top.""", BIG_BIRD_START_DOCSTRING) class BigBirdForMaskedLM(BigBirdPreTrainedModel): - _keys_to_ignore_on_load_missing = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"] + _tied_weights_keys = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -2448,7 +2442,7 @@ def forward( >>> labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100) >>> outputs = model(**inputs, labels=labels) >>> round(outputs.loss.item(), 2) - 1.08 + 1.99 ``` """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict @@ -2506,12 +2500,7 @@ def prepare_inputs_for_generation(self, input_ids, attention_mask=None, **model_ """BigBird Model with a `language modeling` head on top for CLM fine-tuning.""", BIG_BIRD_START_DOCSTRING ) class BigBirdForCausalLM(BigBirdPreTrainedModel): - _keys_to_ignore_on_load_missing = [ - r"position_ids", - r"predictions.decoder.bias", - "cls.predictions.decoder.weight", - "cls.predictions.decoder.bias", - ] + _tied_weights_keys = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -2626,9 +2615,18 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti if attention_mask is None: attention_mask = input_ids.new_ones(input_shape) - # cut decoder_input_ids if past is used + # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values} @@ -2636,7 +2634,8 @@ def _reorder_cache(self, past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: reordered_past += ( - tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past[:2]) + + layer_past[2:], ) return reordered_past @@ -3010,9 +3009,9 @@ def __init__(self, config, add_pooling_layer=False): @replace_return_docstrings(output_type=BigBirdForQuestionAnsweringModelOutput, config_class=_CONFIG_FOR_DOC) def forward( self, - input_ids: torch.LongTensor = None, + input_ids: Optional[torch.LongTensor] = None, attention_mask: Optional[torch.FloatTensor] = None, - question_lengths=None, + question_lengths: Optional[torch.Tensor] = None, token_type_ids: Optional[torch.LongTensor] = None, position_ids: Optional[torch.LongTensor] = None, head_mask: Optional[torch.FloatTensor] = None, @@ -3042,8 +3041,8 @@ def forward( >>> from transformers import AutoTokenizer, BigBirdForQuestionAnswering >>> from datasets import load_dataset - >>> tokenizer = AutoTokenizer.from_pretrained("abhinavkulkarni/bigbird-roberta-base-finetuned-squad") - >>> model = BigBirdForQuestionAnswering.from_pretrained("abhinavkulkarni/bigbird-roberta-base-finetuned-squad") + >>> tokenizer = AutoTokenizer.from_pretrained("google/bigbird-roberta-base") + >>> model = BigBirdForQuestionAnswering.from_pretrained("google/bigbird-roberta-base") >>> squad_ds = load_dataset("squad_v2", split="train") # doctest: +IGNORE_RESULT >>> # select random article and question @@ -3062,17 +3061,14 @@ def forward( >>> answer_start_index = outputs.start_logits.argmax() >>> answer_end_index = outputs.end_logits.argmax() - >>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] - >>> tokenizer.decode(predict_answer_tokens) - '80 °C (176 °F) or more' + >>> predict_answer_token_ids = inputs.input_ids[0, answer_start_index : answer_end_index + 1] + >>> predict_answer_token = tokenizer.decode(predict_answer_token_ids) ``` ```python >>> target_start_index, target_end_index = torch.tensor([130]), torch.tensor([132]) >>> outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index) >>> loss = outputs.loss - >>> round(outputs.loss.item(), 2) - 7.63 ``` """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict diff --git a/src/transformers/models/big_bird/modeling_flax_big_bird.py b/src/transformers/models/big_bird/modeling_flax_big_bird.py index 2c3806c754e9cd..94eabdec451dda 100644 --- a/src/transformers/models/big_bird/modeling_flax_big_bird.py +++ b/src/transformers/models/big_bird/modeling_flax_big_bird.py @@ -19,7 +19,6 @@ import flax.linen as nn import jax import jax.numpy as jnp -import numpy as np from flax.core.frozen_dict import FrozenDict, freeze, unfreeze from flax.linen import combine_masks, make_causal_mask from flax.linen import partitioning as nn_partitioning @@ -123,9 +122,10 @@ class FlaxBigBirdForQuestionAnsweringModelOutput(ModelOutput): This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models) - This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) - subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to - general usage and behavior. + This model is also a + [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as + a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and + behavior. Finally, this model supports inherent JAX features such as: @@ -317,7 +317,7 @@ def __call__( hidden_states, attention_mask, layer_head_mask, - key_value_states: Optional[jnp.array] = None, + key_value_states: Optional[jnp.ndarray] = None, init_cache: bool = False, deterministic=True, output_attentions: bool = False, @@ -459,6 +459,10 @@ def __call__( key_layer = self.transpose_for_scores(self.key(hidden_states), n_heads, head_size) value_layer = self.transpose_for_scores(self.value(hidden_states), n_heads, head_size) + indices_prng_key = None + if not deterministic: + indices_prng_key = self.make_rng("indices") + attn_output, attn_weights = self.bigbird_block_sparse_attention( query_layer, key_layer, @@ -470,6 +474,8 @@ def __call__( blocked_encoder_mask, n_heads, head_size, + indices_prng_key=indices_prng_key, + deterministic=deterministic, plan_from_length=None, plan_num_rand_blocks=None, output_attentions=output_attentions, @@ -528,6 +534,8 @@ def bigbird_block_sparse_attention( to_blocked_mask, n_heads, head_size, + indices_prng_key: Optional[jax.random.PRNGKey] = None, + deterministic: Optional[bool] = True, plan_from_length=None, plan_num_rand_blocks=None, output_attentions=None, @@ -571,12 +579,18 @@ def bigbird_block_sparse_attention( rsqrt_d = 1 / jnp.sqrt(head_size) attn_mask_penalty = -10000.0 - np.random.seed(self.block_sparse_seed) if from_seq_len in [1024, 3072, 4096]: # old plans used in paper max_seqlen = self.config.max_position_embeddings rand_attn = [ self._bigbird_block_rand_mask( - max_seqlen, max_seqlen, from_block_size, to_block_size, n_rand_blocks, last_idx=1024 + max_seqlen, + max_seqlen, + from_block_size, + to_block_size, + n_rand_blocks, + indices_prng_key=indices_prng_key, + deterministic=deterministic, + last_idx=1024, )[: (from_seq_len // from_block_size - 2)] for _ in range(n_heads) ] @@ -585,7 +599,6 @@ def bigbird_block_sparse_attention( plan_from_length, plan_num_rand_blocks = self._get_rand_attn_plan( from_seq_len, from_block_size, n_rand_blocks ) - rand_attn = self._bigbird_block_rand_mask_with_head( from_seq_length=from_seq_len, to_seq_length=to_seq_len, @@ -594,6 +607,7 @@ def bigbird_block_sparse_attention( num_heads=n_heads, plan_from_length=plan_from_length, plan_num_rand_blocks=plan_num_rand_blocks, + indices_prng_key=indices_prng_key, ) rand_attn = jnp.stack(rand_attn, axis=0) @@ -942,7 +956,14 @@ def _get_rand_attn_plan(from_seq_length, from_block_size, num_rand_blocks): @staticmethod def _bigbird_block_rand_mask( - from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 + from_seq_length, + to_seq_length, + from_block_size, + to_block_size, + num_rand_blocks, + indices_prng_key: Optional[jax.random.PRNGKey] = None, + deterministic: Optional[bool] = True, + last_idx: Optional[int] = -1, ): """ Create adjacency list of random attention. @@ -953,6 +974,8 @@ def _bigbird_block_rand_mask( from_block_size: int. size of block in from sequence. to_block_size: int. size of block in to sequence. num_rand_blocks: int. Number of random chunks per row. + indices_prng_key: jax.random.PRNGKey. PRNG key that is used to perform random jax operations. + deterministic: bool. When False random attention will be used. last_idx: if -1 then num_rand_blocks blocks chosen anywhere in to sequence, if positive then num_rand_blocks blocks chosen only up to last_idx. @@ -963,9 +986,12 @@ def _bigbird_block_rand_mask( if from_seq_length // from_block_size != to_seq_length // to_block_size: raise ValueError("Error the number of blocks needs to be same!") + rand_attn = jnp.zeros((from_seq_length // from_block_size - 2, num_rand_blocks), dtype=jnp.int32) + # deterministic nor randomness + if deterministic: + return rand_attn - rand_attn = np.zeros((from_seq_length // from_block_size - 2, num_rand_blocks), dtype=np.int32) - middle_seq = np.arange(1, to_seq_length // to_block_size - 1, dtype=np.int32) + middle_seq = jnp.arange(1, to_seq_length // to_block_size - 1, dtype=jnp.int32) last = to_seq_length // to_block_size - 1 if last_idx > (2 * to_block_size): last = (last_idx // to_block_size) - 1 @@ -975,25 +1001,31 @@ def _bigbird_block_rand_mask( start = i - 2 end = i if i == 1: - rand_attn[i - 1, :] = np.random.permutation(middle_seq[2:last])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[2:last])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) elif i == 2: - rand_attn[i - 1, :] = np.random.permutation(middle_seq[3:last])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[3:last])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) elif i == from_seq_length // from_block_size - 3: - rand_attn[i - 1, :] = np.random.permutation(middle_seq[:last])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[:last])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) # Missing -3: should have been sliced till last-3 elif i == from_seq_length // from_block_size - 2: - rand_attn[i - 1, :] = np.random.permutation(middle_seq[:last])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[:last])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) # Missing -4: should have been sliced till last-4 else: if start > last: start = last - rand_attn[i - 1, :] = np.random.permutation(middle_seq[:start])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[:start])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) elif (end + 1) == last: - rand_attn[i - 1, :] = np.random.permutation(middle_seq[:start])[:r] + seq_values = jax.random.permutation(indices_prng_key, middle_seq[:start])[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) else: - rand_attn[i - 1, :] = np.random.permutation( - np.concatenate((middle_seq[:start], middle_seq[end + 1 : last])) - )[:r] + concat_values = jnp.concatenate((middle_seq[:start], middle_seq[end + 1 : last])) + seq_values = jax.random.permutation(indices_prng_key, concat_values)[:r] + rand_attn = rand_attn.at[i - 1].set(seq_values) return rand_attn def _bigbird_block_rand_mask_with_head( @@ -1005,6 +1037,8 @@ def _bigbird_block_rand_mask_with_head( num_heads, plan_from_length, plan_num_rand_blocks, + indices_prng_key: Optional[jax.random.PRNGKey] = None, + deterministic: Optional[bool] = True, window_block_left=1, window_block_right=1, global_block_top=1, @@ -1023,6 +1057,8 @@ def _bigbird_block_rand_mask_with_head( num_heads: int. total number of heads. plan_from_length: list. plan from length where num_random_blocks are choosen from. plan_num_rand_blocks: list. number of rand blocks within the plan. + indices_prng_key: jax.random.PRNGKey. PRNG key that is used to perform random jax operations. + deterministic: bool. When False random attention will be used. window_block_left: int. number of blocks of window to left of a block. window_block_right: int. number of blocks of window to right of a block. global_block_top: int. number of blocks at the top. @@ -1045,15 +1081,22 @@ def _bigbird_block_rand_mask_with_head( # Total number of blocks in the mmask num_blocks = from_seq_length // from_block_size # Number of blocks per plan - plan_block_length = np.array(plan_from_length) // from_block_size + plan_block_length = jnp.array(plan_from_length) // from_block_size # till when to follow plan max_plan_idx = plan_from_length.index(from_seq_length) + # Random Attention adjacency list rand_attn = [ - np.zeros((num_blocks, np.sum(plan_num_rand_blocks[: max_plan_idx + 1])), dtype=np.int32) + jnp.zeros((num_blocks, sum(plan_num_rand_blocks[: max_plan_idx + 1])), dtype=jnp.int32) for i in range(num_heads) ] + # deterministic + if deterministic: + for nh in range(num_heads): + rand_attn[nh] = rand_attn[nh][global_block_top : num_blocks - global_block_bottom, :] + return rand_attn + # We will go iteratively over the plan blocks and pick random number of # Attention blocks from the legally allowed blocks for plan_idx in range(max_plan_idx + 1): @@ -1064,11 +1107,11 @@ def _bigbird_block_rand_mask_with_head( # column indx start fromm plan_block_length[plan_idx-1] and ends at # plan_block_length[plan_idx] if plan_num_rand_blocks[plan_idx] > 0: - rnd_r_cnt = int(np.sum(plan_num_rand_blocks[:plan_idx])) - curr_r_cnt = int(np.sum(plan_num_rand_blocks[: plan_idx + 1])) + rnd_r_cnt = int(sum(plan_num_rand_blocks[:plan_idx])) + curr_r_cnt = int(sum(plan_num_rand_blocks[: plan_idx + 1])) for blk_rw_idx in range(global_block_top, plan_block_length[plan_idx - 1]): for h in range(num_heads): - rand_attn[h][blk_rw_idx, rnd_r_cnt:curr_r_cnt] = self._get_single_block_row_attention( + single_block_row_attention = self._get_single_block_row_attention( block_id=blk_rw_idx, to_start_block_id=plan_block_length[plan_idx - 1], to_end_block_id=plan_block_length[plan_idx], @@ -1077,6 +1120,10 @@ def _bigbird_block_rand_mask_with_head( window_block_right=window_block_right, global_block_left=global_block_left, global_block_right=global_block_right, + indices_prng_key=indices_prng_key, + ) + rand_attn[h] = ( + rand_attn[h].at[blk_rw_idx, rnd_r_cnt:curr_r_cnt].set(single_block_row_attention) ) for pl_id in range(plan_idx): @@ -1086,11 +1133,11 @@ def _bigbird_block_rand_mask_with_head( rnd_r_cnt = 0 to_start_block_id = 0 if pl_id > 0: - rnd_r_cnt = int(np.sum(plan_num_rand_blocks[:pl_id])) + rnd_r_cnt = int(sum(plan_num_rand_blocks[:pl_id])) to_start_block_id = plan_block_length[pl_id - 1] - curr_r_cnt = int(np.sum(plan_num_rand_blocks[: pl_id + 1])) + curr_r_cnt = int(sum(plan_num_rand_blocks[: pl_id + 1])) for h in range(num_heads): - rand_attn[h][blk_rw_idx, rnd_r_cnt:curr_r_cnt] = self._get_single_block_row_attention( + single_block_row_attention = self._get_single_block_row_attention( block_id=blk_rw_idx, to_start_block_id=to_start_block_id, to_end_block_id=plan_block_length[pl_id], @@ -1099,21 +1146,24 @@ def _bigbird_block_rand_mask_with_head( window_block_right=window_block_right, global_block_left=global_block_left, global_block_right=global_block_right, + indices_prng_key=indices_prng_key, + ) + rand_attn[h] = ( + rand_attn[h].at[blk_rw_idx, rnd_r_cnt:curr_r_cnt].set(single_block_row_attention) ) if plan_num_rand_blocks[plan_idx] == 0: continue - curr_r_cnt = int(np.sum(plan_num_rand_blocks[: plan_idx + 1])) + curr_r_cnt = int(sum(plan_num_rand_blocks[: plan_idx + 1])) from_start_block_id = global_block_top to_start_block_id = 0 if plan_idx > 0: - rnd_r_cnt = int(np.sum(plan_num_rand_blocks[:plan_idx])) + rnd_r_cnt = int(sum(plan_num_rand_blocks[:plan_idx])) from_start_block_id = plan_block_length[plan_idx - 1] to_start_block_id = plan_block_length[plan_idx - 1] - for blk_rw_idx in range(from_start_block_id, plan_block_length[plan_idx]): for h in range(num_heads): - rand_attn[h][blk_rw_idx, rnd_r_cnt:curr_r_cnt] = self._get_single_block_row_attention( + single_block_row_attention = self._get_single_block_row_attention( block_id=blk_rw_idx, to_start_block_id=to_start_block_id, to_end_block_id=plan_block_length[plan_idx], @@ -1122,11 +1172,12 @@ def _bigbird_block_rand_mask_with_head( window_block_right=window_block_right, global_block_left=global_block_left, global_block_right=global_block_right, + indices_prng_key=indices_prng_key, ) + rand_attn[h] = rand_attn[h].at[blk_rw_idx, rnd_r_cnt:curr_r_cnt].set(single_block_row_attention) for nh in range(num_heads): rand_attn[nh] = rand_attn[nh][global_block_top : num_blocks - global_block_bottom, :] - return rand_attn @staticmethod @@ -1135,6 +1186,7 @@ def _get_single_block_row_attention( to_start_block_id, to_end_block_id, num_rand_blocks, + indices_prng_key: Optional[jax.random.PRNGKey] = None, window_block_left=1, window_block_right=1, global_block_left=1, @@ -1148,6 +1200,7 @@ def _get_single_block_row_attention( to_start_block_id: int. random attention column start id. to_end_block_id: int. random attention column end id. num_rand_blocks: int. number of random blocks to be selected. + indices_prng_key: jax.random.PRNGKey. PRNG key that is used to perform random jax operations window_block_left: int. number of blocks of window to left of a block. window_block_right: int. number of blocks of window to right of a block. global_block_left: int. Number of blocks globally used to the left. @@ -1157,9 +1210,9 @@ def _get_single_block_row_attention( row containing the random attention vector of size num_rand_blocks. """ # list of to_blocks from which to choose random attention - to_block_list = np.arange(to_start_block_id, to_end_block_id, dtype=np.int32) + to_block_list = jnp.arange(to_start_block_id, to_end_block_id, dtype=jnp.int32) # permute the blocks - perm_block = np.random.permutation(to_block_list) + perm_block = jax.random.permutation(indices_prng_key, to_block_list) # illegal blocks for the current block id, using window illegal_blocks = list(range(block_id - window_block_left, block_id + window_block_right + 1)) @@ -1176,14 +1229,14 @@ def _get_single_block_row_attention( if block_id == to_end_block_id - 2: illegal_blocks.append(1) - selected_random_blokcs = [] + selected_random_blocks = [] for i in range(to_end_block_id - to_start_block_id): if perm_block[i] not in illegal_blocks: - selected_random_blokcs.append(perm_block[i]) - if len(selected_random_blokcs) == num_rand_blocks: + selected_random_blocks.append(perm_block[i]) + if len(selected_random_blocks) == num_rand_blocks: break - return np.array(selected_random_blokcs, dtype=np.int32) + return jnp.array(selected_random_blocks, dtype=jnp.int32) # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertSelfOutput with Bert->BigBird @@ -1507,11 +1560,11 @@ def __call__(self, hidden_states): return self.LayerNorm(hidden_states) -# Copied from transformers.models.bert.modeling_flax_bert.FlaxBertLMPredictionHead with Bert->BigBird +# Copied from transformers.models.bert.modeling_flax_bert.FlaxBertLMPredictionHead with Bert->BigBird, np.ndarray->jnp.ndarray class FlaxBigBirdLMPredictionHead(nn.Module): config: BigBirdConfig dtype: jnp.dtype = jnp.float32 - bias_init: Callable[..., np.ndarray] = jax.nn.initializers.zeros + bias_init: Callable[..., jnp.ndarray] = jax.nn.initializers.zeros def setup(self): self.transform = FlaxBigBirdPredictionHeadTransform(self.config, dtype=self.dtype) @@ -1594,7 +1647,6 @@ def enable_gradient_checkpointing(self): gradient_checkpointing=True, ) - # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertPreTrainedModel.init_weights def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple, params: FrozenDict = None) -> FrozenDict: # init input tensors input_ids = jnp.zeros(input_shape, dtype="i4") @@ -1603,8 +1655,8 @@ def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple, params: Froz attention_mask = jnp.ones_like(input_ids) head_mask = jnp.ones((self.config.num_hidden_layers, self.config.num_attention_heads)) - params_rng, dropout_rng = jax.random.split(rng) - rngs = {"params": params_rng, "dropout": dropout_rng} + params_rng, dropout_rng, indices_rng = jax.random.split(rng, num=3) + rngs = {"params": params_rng, "dropout": dropout_rng, "indices": indices_rng} if self.config.add_cross_attention: encoder_hidden_states = jnp.zeros(input_shape + (self.config.hidden_size,)) @@ -1622,7 +1674,13 @@ def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple, params: Froz ) else: module_init_outputs = self.module.init( - rngs, input_ids, attention_mask, token_type_ids, position_ids, head_mask, return_dict=False + rngs, + input_ids, + attention_mask, + token_type_ids, + position_ids, + head_mask, + return_dict=False, ) random_params = module_init_outputs["params"] @@ -1658,7 +1716,6 @@ def init_cache(self, batch_size, max_length): return unfreeze(init_variables["cache"]) @add_start_docstrings_to_model_forward(BIG_BIRD_INPUTS_DOCSTRING.format("batch_size, sequence_length")) - # Copied from transformers.models.bert.modeling_flax_bert.FlaxBertPreTrainedModel.__call__ with Bert->BigBird def __call__( self, input_ids, @@ -1669,7 +1726,8 @@ def __call__( encoder_hidden_states=None, encoder_attention_mask=None, params: dict = None, - dropout_rng: jax.random.PRNGKey = None, + dropout_rng: Optional[jax.random.PRNGKey] = None, + indices_rng: Optional[jax.random.PRNGKey] = None, train: bool = False, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1697,6 +1755,9 @@ def __call__( # Handle any PRNG if needed rngs = {} + if indices_rng is not None: + rngs["indices"] = indices_rng + if dropout_rng is not None: rngs["dropout"] = dropout_rng @@ -2382,7 +2443,8 @@ def __call__( head_mask=None, question_lengths=None, params: dict = None, - dropout_rng: jax.random.PRNGKey = None, + dropout_rng: Optional[jax.random.PRNGKey] = None, + indices_rng: Optional[jax.random.PRNGKey] = None, train: bool = False, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -2428,6 +2490,9 @@ def __call__( if dropout_rng is not None: rngs["dropout"] = dropout_rng + if indices_rng is not None: + rngs["indices"] = indices_rng + return self.module.apply( {"params": params or self.params}, jnp.array(input_ids, dtype="i4"), @@ -2459,7 +2524,6 @@ def prepare_question_mask(q_lengths, maxlen: int): ) -# Copied from transformers.models.bert.modeling_flax_bert.FlaxBertForCausalLMModule with Bert->BigBird class FlaxBigBirdForCausalLMModule(nn.Module): config: BigBirdConfig dtype: jnp.dtype = jnp.float32 @@ -2491,11 +2555,11 @@ def __call__( ): # Model outputs = self.bert( - input_ids, - attention_mask, - token_type_ids, - position_ids, - head_mask, + input_ids=input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, encoder_hidden_states=encoder_hidden_states, encoder_attention_mask=encoder_attention_mask, init_cache=init_cache, @@ -2536,7 +2600,7 @@ def __call__( class FlaxBigBirdForCausalLM(FlaxBigBirdPreTrainedModel): module_class = FlaxBigBirdForCausalLMModule - def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jnp.DeviceArray] = None): + def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jax.Array] = None): # initializing the cache batch_size, seq_length = input_ids.shape diff --git a/src/transformers/models/big_bird/tokenization_big_bird.py b/src/transformers/models/big_bird/tokenization_big_bird.py index bd6f90ef027acd..e7c43a86a6cab4 100644 --- a/src/transformers/models/big_bird/tokenization_big_bird.py +++ b/src/transformers/models/big_bird/tokenization_big_bird.py @@ -60,25 +60,25 @@ class BigBirdTokenizer(PreTrainedTokenizer): vocab_file (`str`): [SentencePiece](https://github.com/google/sentencepiece) file (generally has a *.spm* extension) that contains the vocabulary necessary to instantiate a tokenizer. - eos_token (`str`, *optional*, defaults to `""`): - The end of sequence token. - bos_token (`str`, *optional*, defaults to `""`): - The begin of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. + bos_token (`str`, *optional*, defaults to `""`): + The begin of sequence token. + eos_token (`str`, *optional*, defaults to `""`): + The end of sequence token. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. sep_token (`str`, *optional*, defaults to `"[SEP]"`): The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - cls_token (`str`, *optional*, defaults to `"[CLS]"`): - The classifier token which is used when doing sequence classification (classification of the whole sequence - instead of per-token classification). It is the first token of the sequence when built with special tokens. mask_token (`str`, *optional*, defaults to `"[MASK]"`): The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. + cls_token (`str`, *optional*, defaults to `"[CLS]"`): + The classifier token which is used when doing sequence classification (classification of the whole sequence + instead of per-token classification). It is the first token of the sequence when built with special tokens. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, @@ -127,6 +127,11 @@ def __init__( self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + self.vocab_file = vocab_file + + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(vocab_file) + super().__init__( bos_token=bos_token, eos_token=eos_token, @@ -139,11 +144,6 @@ def __init__( **kwargs, ) - self.vocab_file = vocab_file - - self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(vocab_file) - @property def vocab_size(self): return self.sp_model.get_piece_size() @@ -181,6 +181,7 @@ def _convert_id_to_token(self, index): token = self.sp_model.IdToPiece(index) return token + # Copied from transformers.models.albert.tokenization_albert.AlbertTokenizer.convert_tokens_to_string def convert_tokens_to_string(self, tokens): """Converts a sequence of tokens (string) in a single string.""" current_sub_tokens = [] @@ -204,7 +205,7 @@ def _decode( self, token_ids: List[int], skip_special_tokens: bool = False, - clean_up_tokenization_spaces: bool = True, + clean_up_tokenization_spaces: bool = None, spaces_between_special_tokens: bool = True, **kwargs, ) -> str: @@ -237,6 +238,11 @@ def _decode( else: text = "".join(sub_texts) + clean_up_tokenization_spaces = ( + clean_up_tokenization_spaces + if clean_up_tokenization_spaces is not None + else self.clean_up_tokenization_spaces + ) if clean_up_tokenization_spaces: clean_text = self.clean_up_tokenization(text) return clean_text diff --git a/src/transformers/models/big_bird/tokenization_big_bird_fast.py b/src/transformers/models/big_bird/tokenization_big_bird_fast.py index 11c3386794701d..24fc33d8052962 100644 --- a/src/transformers/models/big_bird/tokenization_big_bird_fast.py +++ b/src/transformers/models/big_bird/tokenization_big_bird_fast.py @@ -150,7 +150,10 @@ def __init__( ) self.vocab_file = vocab_file - self.can_save_slow_tokenizer = False if not self.vocab_file else True + + @property + def can_save_slow_tokenizer(self) -> bool: + return os.path.isfile(self.vocab_file) if self.vocab_file else False def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None @@ -202,7 +205,7 @@ def get_special_tokens_mask( "You should not supply a second sequence if the provided sequence of " "ids is already formatted with special tokens for the model." ) - return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0)) + return [1 if x in [self.sep_token_id, self.cls_token_id] else 0 for x in token_ids_0] if token_ids_1 is None: return [1] + ([0] * len(token_ids_0)) + [1] diff --git a/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py b/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py index a7f198a735b385..1c78803c4b1146 100644 --- a/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py +++ b/src/transformers/models/bigbird_pegasus/configuration_bigbird_pegasus.py @@ -120,6 +120,7 @@ class BigBirdPegasusConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "bigbird_pegasus" keys_to_ignore_at_inference = ["past_key_values"] attribute_map = { diff --git a/src/transformers/models/bigbird_pegasus/convert_bigbird_pegasus_tf_to_pytorch.py b/src/transformers/models/bigbird_pegasus/convert_bigbird_pegasus_tf_to_pytorch.py index 5a81207548f9a2..e17369e48041c6 100644 --- a/src/transformers/models/bigbird_pegasus/convert_bigbird_pegasus_tf_to_pytorch.py +++ b/src/transformers/models/bigbird_pegasus/convert_bigbird_pegasus_tf_to_pytorch.py @@ -104,7 +104,7 @@ def convert_bigbird_pegasus(tf_weights: dict, config_update: dict) -> BigBirdPeg new_k = rename_state_dict_key(k, patterns) if new_k not in state_dict: raise ValueError(f"could not find new key {new_k} in state dict. (converted from {k})") - if any([True if i in k else False for i in ["dense", "query", "key", "value"]]): + if any(True if i in k else False for i in ["dense", "query", "key", "value"]): v = v.T mapping[new_k] = torch.from_numpy(v) assert v.shape == state_dict[new_k].shape, f"{new_k}, {k}, {v.shape}, {state_dict[new_k].shape}" @@ -117,7 +117,7 @@ def convert_bigbird_pegasus(tf_weights: dict, config_update: dict) -> BigBirdPeg new_k = rename_state_dict_key(k, patterns) if new_k not in state_dict and k != "pegasus/embeddings/position_embeddings": raise ValueError(f"could not find new key {new_k} in state dict. (converted from {k})") - if any([True if i in k else False for i in ["dense", "query", "key", "value"]]): + if any(True if i in k else False for i in ["dense", "query", "key", "value"]): v = v.T mapping[new_k] = torch.from_numpy(v) if k != "pegasus/embeddings/position_embeddings": @@ -147,7 +147,7 @@ def get_tf_weights_as_numpy(path) -> Dict: tf_weights = {} ignore_name = ["global_step"] for name, shape in tqdm(init_vars, desc="converting tf checkpoint to dict"): - skip_key = any([pat in name for pat in ignore_name]) + skip_key = any(pat in name for pat in ignore_name) if skip_key: continue array = tf.train.load_variable(path, name) diff --git a/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py b/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py index 17a92b5e27190d..baf08143431693 100755 --- a/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py +++ b/src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py @@ -17,7 +17,6 @@ import copy import math -import random from typing import List, Optional, Tuple, Union import numpy as np @@ -26,6 +25,7 @@ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask from ...modeling_outputs import ( BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, @@ -78,35 +78,6 @@ def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start return shifted_input_ids -def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): - """ - Make causal mask used for bi-directional self-attention. - """ - bsz, tgt_len = input_ids_shape - mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min)) - mask_cond = torch.arange(mask.size(-1)) - mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) - mask = mask.to(dtype) - - if past_key_values_length > 0: - mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) - return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) - - -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min) - - class BigBirdPegasusLearnedPositionalEmbedding(nn.Embedding): """ This module learns positional embeddings up to a fixed maximum size. @@ -712,15 +683,11 @@ def bigbird_block_sparse_attention( # global keys (corresponding to 1st key block) attention_probs[:, :, 2 * from_block_size : -2 * from_block_size, :to_block_size] = attn_weights[ :, :, :, :, :to_block_size - ].view( - bsz, n_heads, -1, to_block_size - ) # first_band_product + ].view(bsz, n_heads, -1, to_block_size) # first_band_product # global keys (corresponding to last key block) attention_probs[:, :, 2 * from_block_size : -2 * from_block_size, -to_block_size:] = attn_weights[ :, :, :, :, -to_block_size: - ].view( - bsz, n_heads, -1, to_block_size - ) # last_band_product + ].view(bsz, n_heads, -1, to_block_size) # last_band_product # random keys for p1, i1, w1 in zip(range(bsz), rand_attn, attn_weights): # p1, i1, w1 corresponds to batch_dim i.e. following operation is done for each sequence in batch @@ -789,11 +756,8 @@ def torch_gather_b2(params, indices): num_indices_to_gather = indices.shape[-2] * indices.shape[-1] num_indices_to_pick_from = params.shape[2] - indices_shift = ( - torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) - // num_indices_to_gather - * num_indices_to_pick_from - ) + shift = torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device) + indices_shift = torch.div(shift, num_indices_to_gather, rounding_mode="floor") * num_indices_to_pick_from flattened_indices = indices.view(-1) + indices_shift flattened_params = params.reshape(-1, params.shape[-2], params.shape[-1]) @@ -873,9 +837,8 @@ def _get_rand_attn_plan(from_seq_length, from_block_size, num_rand_blocks): return plan_from_length, plan_num_rand_blocks - @staticmethod def _bigbird_block_rand_mask( - from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 + self, from_seq_length, to_seq_length, from_block_size, to_block_size, num_rand_blocks, last_idx=-1 ): """ Create adjacency list of random attention. @@ -898,6 +861,9 @@ def _bigbird_block_rand_mask( raise ValueError("Error the number of blocks needs to be same!") rand_attn = np.zeros((from_seq_length // from_block_size - 2, num_rand_blocks), dtype=np.int32) + # During inference (eval) no randomness + if not self.training: + return rand_attn middle_seq = np.arange(1, to_seq_length // to_block_size - 1, dtype=np.int32) last = to_seq_length // to_block_size - 1 if last_idx > (2 * to_block_size): @@ -981,11 +947,17 @@ def _bigbird_block_rand_mask_with_head( plan_block_length = np.array(plan_from_length) // from_block_size # till when to follow plan max_plan_idx = plan_from_length.index(from_seq_length) + # Random Attention adjacency list rand_attn = [ np.zeros((num_blocks, np.sum(plan_num_rand_blocks[: max_plan_idx + 1])), dtype=np.int32) for i in range(num_heads) ] + # During inference (eval) no randomness + if not self.training: + for nh in range(num_heads): + rand_attn[nh] = rand_attn[nh][global_block_top : num_blocks - global_block_bottom, :] + return rand_attn # We will go iteratively over the plan blocks and pick random number of # Attention blocks from the legally allowed blocks @@ -1198,7 +1170,7 @@ def forward( return outputs -# Copied from transformers.models.bart.modeling_bart.BartAttention with Bart->BigBirdPegasusDecoder +# Copied from transformers.models.bart.modeling_bart.BartAttention with BartConfig->BigBirdPegasusConfig, Bart->BigBirdPegasusDecoder class BigBirdPegasusDecoderAttention(nn.Module): """Multi-headed attention from 'Attention Is All You Need' paper""" @@ -1209,12 +1181,15 @@ def __init__( dropout: float = 0.0, is_decoder: bool = False, bias: bool = True, + is_causal: bool = False, + config: Optional[BigBirdPegasusConfig] = None, ): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.dropout = dropout self.head_dim = embed_dim // num_heads + self.config = config if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -1223,6 +1198,7 @@ def __init__( ) self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder + self.is_causal = is_causal self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) @@ -1290,8 +1266,8 @@ def forward( proj_shape = (bsz * self.num_heads, -1, self.head_dim) query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) - key_states = key_states.view(*proj_shape) - value_states = value_states.view(*proj_shape) + key_states = key_states.reshape(*proj_shape) + value_states = value_states.reshape(*proj_shape) src_len = key_states.size(1) attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) @@ -1337,7 +1313,7 @@ def forward( if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): raise ValueError( - f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" f" {attn_output.size()}" ) @@ -1345,7 +1321,7 @@ def forward( attn_output = attn_output.transpose(1, 2) # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be - # partitioned aross GPUs when using tensor-parallelism. + # partitioned across GPUs when using tensor-parallelism. attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) attn_output = self.out_proj(attn_output) @@ -1381,7 +1357,7 @@ def forward( ): """ Args: - hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. output_attentions (`bool`, *optional*): @@ -1589,6 +1565,7 @@ class BigBirdPegasusPreTrainedModel(PreTrainedModel): base_model_prefix = "model" supports_gradient_checkpointing = True _no_split_modules = ["BigBirdPegasusEncoderLayer", "BigBirdPegasusDecoderLayer"] + _skip_keys_device_placement = "past_key_values" def _init_weights(self, module): std = self.config.init_std @@ -1601,10 +1578,6 @@ def _init_weights(self, module): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, (BigBirdPegasusDecoder, BigBirdPegasusEncoder)): - module.gradient_checkpointing = value - @property def dummy_inputs(self): pad_token = self.config.pad_token_id @@ -1705,10 +1678,11 @@ def dummy_inputs(self): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all - `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you - can choose to directly pass an embedded representation. This is useful if you want more control over how to - convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be @@ -1849,6 +1823,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() input_ids = input_ids.view(-1, input_shape[-1]) elif inputs_embeds is not None: @@ -1897,7 +1872,7 @@ def forward( # expand attention_mask if self.attention_type == "original_full": # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype) + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) blocked_encoder_mask = band_mask = from_mask = to_mask = None elif self.attention_type == "block_sparse": blocked_encoder_mask, band_mask, from_mask, to_mask = self.create_masks_for_block_sparse_attn( @@ -1924,20 +1899,18 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): # skip the layer + to_drop = False + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: # skip the layer + to_drop = True + + if to_drop: layer_outputs = (None, None) else: if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, (head_mask[idx] if head_mask is not None else None), @@ -1946,6 +1919,7 @@ def custom_forward(*inputs): to_mask, blocked_encoder_mask, blocked_encoder_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -2041,7 +2015,7 @@ def _pad_to_block_size(self, hidden_states: torch.Tensor, attention_mask: torch. padding_len = (block_size - seq_len % block_size) % block_size if padding_len > 0: - logger.info( + logger.warning_once( f"Input ids are automatically padded from {seq_len} to {seq_len + padding_len} to be a multiple of " f"`config.block_size`: {block_size}" ) @@ -2097,27 +2071,6 @@ def get_input_embeddings(self): def set_input_embeddings(self, value): self.embed_tokens = value - # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask - def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): - # create causal mask - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - combined_attention_mask = None - if input_shape[-1] > 1: - combined_attention_mask = _make_causal_mask( - input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length - ).to(inputs_embeds.device) - - if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( - inputs_embeds.device - ) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask - ) - - return combined_attention_mask - def forward( self, input_ids: Optional[torch.Tensor] = None, @@ -2184,11 +2137,11 @@ def forward( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of - shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing - `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more - control over how to convert `input_ids` indices into associated vectors than the model's internal - embedding lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. @@ -2222,14 +2175,16 @@ def forward( if inputs_embeds is None: inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale - attention_mask = self._prepare_decoder_attention_mask( + attention_mask = _prepare_4d_causal_attention_mask( attention_mask, input_shape, inputs_embeds, past_key_values_length ) # expand encoder attention mask if encoder_hidden_states is not None and encoder_attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + encoder_attention_mask = _prepare_4d_attention_mask( + encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1] + ) # embed positions positions = self.embed_positions(input_shape, past_key_values_length) @@ -2239,6 +2194,13 @@ def forward( hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + # decoder layers all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None @@ -2257,28 +2219,16 @@ def forward( # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) if output_hidden_states: all_hidden_states += (hidden_states,) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): - continue + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue past_key_value = past_key_values[idx] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, output_attentions, use_cache) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(decoder_layer), + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, hidden_states, attention_mask, encoder_hidden_states, @@ -2286,6 +2236,8 @@ def custom_forward(*inputs): head_mask[idx] if head_mask is not None else None, cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, None, + output_attentions, + use_cache, ) else: layer_outputs = decoder_layer( @@ -2339,7 +2291,7 @@ def custom_forward(*inputs): BIGBIRD_PEGASUS_START_DOCSTRING, ) class BigBirdPegasusModel(BigBirdPegasusPreTrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config: BigBirdPegasusConfig): super().__init__(config) @@ -2361,6 +2313,11 @@ def set_input_embeddings(self, value): self.encoder.embed_tokens = self.shared self.decoder.embed_tokens = self.shared + def _tie_weights(self): + if self.config.tie_word_embeddings: + self._tie_or_clone_weights(self.encoder.embed_tokens, self.shared) + self._tie_or_clone_weights(self.decoder.embed_tokens, self.shared) + def get_encoder(self): return self.encoder @@ -2470,12 +2427,8 @@ def forward( # Copied from transformers.models.bart.modeling_bart.BartForConditionalGeneration with Bart->BigBirdPegasus, BART->BIGBIRD_PEGASUS class BigBirdPegasusForConditionalGeneration(BigBirdPegasusPreTrainedModel): base_model_prefix = "model" - _keys_to_ignore_on_load_missing = [ - r"final_logits_bias", - r"lm_head.weight", - "encoder.embed_tokens.weight", - "decoder.embed_tokens.weight", - ] + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight"] + _keys_to_ignore_on_load_missing = ["final_logits_bias"] def __init__(self, config: BigBirdPegasusConfig): super().__init__(config) @@ -2492,9 +2445,9 @@ def get_encoder(self): def get_decoder(self): return self.model.get_decoder() - def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding: - new_embeddings = super().resize_token_embeddings(new_num_tokens) - self._resize_final_logits_bias(new_num_tokens) + def resize_token_embeddings(self, new_num_tokens: int, pad_to_multiple_of: Optional[int] = None) -> nn.Embedding: + new_embeddings = super().resize_token_embeddings(new_num_tokens, pad_to_multiple_of) + self._resize_final_logits_bias(new_embeddings.weight.shape[0]) return new_embeddings def _resize_final_logits_bias(self, new_num_tokens: int) -> None: @@ -2576,6 +2529,7 @@ def forward( masked_lm_loss = None if labels is not None: + labels = labels.to(lm_logits.device) loss_fct = CrossEntropyLoss() masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1)) @@ -2610,7 +2564,16 @@ def prepare_inputs_for_generation( ): # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - decoder_input_ids = decoder_input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if decoder_input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = decoder_input_ids.shape[1] - 1 + + decoder_input_ids = decoder_input_ids[:, remove_prefix_length:] return { "input_ids": None, # encoder_outputs is defined. input_ids not needed @@ -2634,7 +2597,8 @@ def _reorder_cache(past_key_values, beam_idx): for layer_past in past_key_values: # cached cross_attention states don't have to be reordered -> they are always the same reordered_past += ( - tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past[:2]) + + layer_past[2:], ) return reordered_past @@ -2647,7 +2611,7 @@ def _reorder_cache(past_key_values, beam_idx): BIGBIRD_PEGASUS_START_DOCSTRING, ) class BigBirdPegasusForSequenceClassification(BigBirdPegasusPreTrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config: BigBirdPegasusConfig, **kwargs): super().__init__(config, **kwargs) @@ -2730,6 +2694,7 @@ def forward( loss = None if labels is not None: + labels = labels.to(logits.device) if self.config.problem_type is None: if self.config.num_labels == 1: self.config.problem_type = "regression" @@ -2775,7 +2740,7 @@ def forward( BIGBIRD_PEGASUS_START_DOCSTRING, ) class BigBirdPegasusForQuestionAnswering(BigBirdPegasusPreTrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] + _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] def __init__(self, config): super().__init__(config) @@ -2907,7 +2872,7 @@ def forward(self, *args, **kwargs): class BigBirdPegasusForCausalLM(BigBirdPegasusPreTrainedModel): - _keys_to_ignore_on_load_missing = ["lm_head.weight"] + _tied_weights_keys = ["lm_head.weight"] def __init__(self, config): config = copy.deepcopy(config) @@ -3103,5 +3068,7 @@ def prepare_inputs_for_generation( def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/biogpt/__init__.py b/src/transformers/models/biogpt/__init__.py index 64fd23977a5c22..ec3d6966ac419d 100644 --- a/src/transformers/models/biogpt/__init__.py +++ b/src/transformers/models/biogpt/__init__.py @@ -30,6 +30,8 @@ _import_structure["modeling_biogpt"] = [ "BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST", "BioGptForCausalLM", + "BioGptForTokenClassification", + "BioGptForSequenceClassification", "BioGptModel", "BioGptPreTrainedModel", ] @@ -48,6 +50,8 @@ from .modeling_biogpt import ( BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST, BioGptForCausalLM, + BioGptForSequenceClassification, + BioGptForTokenClassification, BioGptModel, BioGptPreTrainedModel, ) diff --git a/src/transformers/models/biogpt/configuration_biogpt.py b/src/transformers/models/biogpt/configuration_biogpt.py index 2fe46354d291e8..1fb2933f2843eb 100644 --- a/src/transformers/models/biogpt/configuration_biogpt.py +++ b/src/transformers/models/biogpt/configuration_biogpt.py @@ -53,7 +53,7 @@ class BioGptConfig(PretrainedConfig): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. hidden_dropout_prob (`float`, *optional*, defaults to 0.1): - The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout ratio for the attention probabilities. max_position_embeddings (`int`, *optional*, defaults to 1024): @@ -72,13 +72,14 @@ class BioGptConfig(PretrainedConfig): Please refer to the paper about LayerDrop: https://arxiv.org/abs/1909.11556 for further details activation_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for activations inside the fully connected layer. - pad_token_id (`int`, *optional*, defaults to 1) + pad_token_id (`int`, *optional*, defaults to 1): Padding token id. - bos_token_id (`int`, *optional*, defaults to 0) + bos_token_id (`int`, *optional*, defaults to 0): Beginning of stream token id. - eos_token_id (`int`, *optional*, defaults to 2) + eos_token_id (`int`, *optional*, defaults to 2): End of stream token id. - Example: + + Example: ```python >>> from transformers import BioGptModel, BioGptConfig @@ -92,6 +93,7 @@ class BioGptConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "biogpt" def __init__( diff --git a/src/transformers/models/biogpt/modeling_biogpt.py b/src/transformers/models/biogpt/modeling_biogpt.py index c4c89a9f4e2a3d..d98f0886dfa95c 100755 --- a/src/transformers/models/biogpt/modeling_biogpt.py +++ b/src/transformers/models/biogpt/modeling_biogpt.py @@ -16,18 +16,28 @@ import math -import random from typing import Optional, Tuple, Union import torch import torch.utils.checkpoint from torch import nn -from torch.nn import CrossEntropyLoss +from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN -from ...modeling_outputs import BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions +from ...modeling_attn_mask_utils import _prepare_4d_causal_attention_mask +from ...modeling_outputs import ( + BaseModelOutputWithPastAndCrossAttentions, + CausalLMOutputWithCrossAttentions, + SequenceClassifierOutputWithPast, + TokenClassifierOutput, +) from ...modeling_utils import PreTrainedModel -from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging +from ...utils import ( + add_code_sample_docstrings, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, +) from .configuration_biogpt import BioGptConfig @@ -36,43 +46,14 @@ _CHECKPOINT_FOR_DOC = "microsoft/biogpt" _CONFIG_FOR_DOC = "BioGptConfig" + BIOGPT_PRETRAINED_MODEL_ARCHIVE_LIST = [ "microsoft/biogpt", + "microsoft/BioGPT-Large", # See all BioGPT models at https://huggingface.co/models?filter=biogpt ] -# Copied from transformers.models.bart.modeling_bart._make_causal_mask -def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): - """ - Make causal mask used for bi-directional self-attention. - """ - bsz, tgt_len = input_ids_shape - mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min)) - mask_cond = torch.arange(mask.size(-1)) - mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) - mask = mask.to(dtype) - - if past_key_values_length > 0: - mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) - return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) - - -# Copied from transformers.models.bart.modeling_bart._expand_mask -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - # Copied from transformers.models.opt.modeling_opt.OPTLearnedPositionalEmbedding with OPT->BioGpt class BioGptLearnedPositionalEmbedding(nn.Embedding): """ @@ -109,12 +90,15 @@ def __init__( dropout: float = 0.0, is_decoder: bool = False, bias: bool = True, + is_causal: bool = False, + config: Optional[BioGptConfig] = None, ): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.dropout = dropout self.head_dim = embed_dim // num_heads + self.config = config if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -123,6 +107,7 @@ def __init__( ) self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder + self.is_causal = is_causal self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) @@ -190,8 +175,8 @@ def forward( proj_shape = (bsz * self.num_heads, -1, self.head_dim) query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) - key_states = key_states.view(*proj_shape) - value_states = value_states.view(*proj_shape) + key_states = key_states.reshape(*proj_shape) + value_states = value_states.reshape(*proj_shape) src_len = key_states.size(1) attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) @@ -237,7 +222,7 @@ def forward( if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): raise ValueError( - f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" f" {attn_output.size()}" ) @@ -245,7 +230,7 @@ def forward( attn_output = attn_output.transpose(1, 2) # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be - # partitioned aross GPUs when using tensor-parallelism. + # partitioned across GPUs when using tensor-parallelism. attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) attn_output = self.out_proj(attn_output) @@ -363,10 +348,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BioGptModel): - module.gradient_checkpointing = value - BIOGPT_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use @@ -415,10 +396,11 @@ def _set_gradient_checkpointing(self, module, value=False): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all - `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you - can choose to directly pass an embedded representation. This is useful if you want more control over how to - convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). @@ -463,27 +445,6 @@ def get_input_embeddings(self): def set_input_embeddings(self, value): self.embed_tokens = value - # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask - def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): - # create causal mask - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - combined_attention_mask = None - if input_shape[-1] > 1: - combined_attention_mask = _make_causal_mask( - input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length - ).to(inputs_embeds.device) - - if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( - inputs_embeds.device - ) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask - ) - - return combined_attention_mask - @add_start_docstrings_to_model_forward(BIOGPT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @add_code_sample_docstrings( checkpoint=_CHECKPOINT_FOR_DOC, @@ -528,11 +489,21 @@ def forward( inputs_embeds = self.embed_tokens(input) * self.embed_scale if attention_mask is None: - attention_mask = torch.ones(inputs_embeds.shape[:2], dtype=torch.bool, device=inputs_embeds.device) + attention_mask = torch.ones( + (inputs_embeds.shape[0], inputs_embeds.shape[1] + past_key_values_length), + dtype=torch.bool, + device=inputs_embeds.device, + ) + elif attention_mask.shape[1] != past_key_values_length + input_shape[1]: + raise ValueError( + f"The provided attention mask has length {attention_mask.shape[1]}, but its length should be " + f"{past_key_values_length + input_shape[1]} (sum of the lengths of current and past inputs)" + ) + # embed positions positions = self.embed_positions(attention_mask, past_key_values_length) - attention_mask = self._prepare_decoder_attention_mask( + attention_mask = _prepare_4d_causal_attention_mask( attention_mask, input_shape, inputs_embeds, past_key_values_length ) @@ -540,6 +511,13 @@ def forward( hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None all_cross_attentions = None @@ -549,32 +527,22 @@ def forward( # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) if output_hidden_states: all_hidden_states += (hidden_states,) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): - continue + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue past_key_value = past_key_values[idx] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, output_attentions, use_cache) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(decoder_layer), + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, hidden_states, attention_mask, head_mask[idx] if head_mask is not None else None, None, + output_attentions, + use_cache, ) else: layer_outputs = decoder_layer( @@ -621,7 +589,7 @@ def custom_forward(*inputs): """BioGPT Model with a `language modeling` head on top for CLM fine-tuning.""", BIOGPT_START_DOCSTRING ) class BioGptForCausalLM(BioGptPreTrainedModel): - _keys_to_ignore_on_load_missing = ["output_projection.weight"] + _tied_weights_keys = ["output_projection.weight"] def __init__(self, config): super().__init__(config) @@ -701,21 +669,260 @@ def forward( cross_attentions=outputs.cross_attentions, ) - def prepare_inputs_for_generation(self, input_ids, attention_mask, past_key_values=None, **kwargs): - # only last token for inputs_ids if past is defined in kwargs - if past_key_values: - input_ids = input_ids[:, -1].unsqueeze(-1) + def prepare_inputs_for_generation( + self, input_ids, attention_mask, inputs_embeds=None, past_key_values=None, **kwargs + ): + # only last tokens for inputs_ids if past is defined in kwargs + if past_key_values is not None: + past_length = past_key_values[0][0].shape[2] - return { - "input_ids": input_ids, - "attention_mask": attention_mask, - "past_key_values": past_key_values, - "use_cache": kwargs.get("use_cache"), - } + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] + + if inputs_embeds is not None and past_key_values is None: + model_inputs = {"inputs_embeds": inputs_embeds} + else: + model_inputs = {"input_ids": input_ids} + + model_inputs.update( + { + "attention_mask": attention_mask, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + } + ) + + return model_inputs @staticmethod def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past + + +@add_start_docstrings( + """ + BioGPT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for + Named-Entity-Recognition (NER) tasks. + """, + BIOGPT_START_DOCSTRING, +) +class BioGptForTokenClassification(BioGptPreTrainedModel): + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_labels + + self.biogpt = BioGptModel(config) + if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None: + classifier_dropout = config.classifier_dropout + else: + classifier_dropout = config.hidden_dropout_prob + self.dropout = nn.Dropout(classifier_dropout) + self.classifier = nn.Linear(config.hidden_size, config.num_labels) + + self.post_init() + + @add_start_docstrings_to_model_forward(BIOGPT_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=TokenClassifierOutput, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, TokenClassifierOutput]: + r""" + labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., + config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If + `config.num_labels > 1` a classification loss is computed (Cross-Entropy). + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.biogpt( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + hidden_states = transformer_outputs[0] + hidden_states = self.dropout(hidden_states) + logits = self.classifier(hidden_states) + + loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() + # Only keep active parts of the loss + if attention_mask is not None: + active_loss = attention_mask.view(-1) == 1 + active_logits = logits.view(-1, self.num_labels) + active_labels = torch.where( + active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels) + ) + loss = loss_fct(active_logits, active_labels) + else: + loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) + + if not return_dict: + output = (logits,) + transformer_outputs[2:] + return ((loss,) + output) if loss is not None else output + + return TokenClassifierOutput( + loss=loss, + logits=logits, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + +@add_start_docstrings( + """ + The BioGpt Model transformer with a sequence classification head on top (linear layer). + + [`BioGptForSequenceClassification`] uses the last token in order to do the classification, as other causal models + (e.g. GPT-2) do. + + Since it does classification on the last token, it is required to know the position of the last token. If a + `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If + no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the + padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in + each row of the batch). + """, + BIOGPT_START_DOCSTRING, +) +class BioGptForSequenceClassification(BioGptPreTrainedModel): + def __init__(self, config: BioGptConfig): + super().__init__(config) + self.num_labels = config.num_labels + self.biogpt = BioGptModel(config) + self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False) + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(BIOGPT_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_CHECKPOINT_FOR_DOC, + output_type=SequenceClassifierOutputWithPast, + config_class=_CONFIG_FOR_DOC, + ) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, SequenceClassifierOutputWithPast]: + r""" + labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., + config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If + `config.num_labels > 1` a classification loss is computed (Cross-Entropy). + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + transformer_outputs = self.biogpt( + input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + hidden_states = transformer_outputs[0] + logits = self.score(hidden_states) + + if input_ids is not None: + batch_size, sequence_length = input_ids.shape[:2] + else: + batch_size, sequence_length = inputs_embeds.shape[:2] + + if self.config.pad_token_id is None: + sequence_length = -1 + else: + if input_ids is not None: + sequence_length = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device) + else: + sequence_length = -1 + logger.warning( + f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be " + "unexpected if using padding tokens in conjunction with `inputs_embeds.`" + ) + + pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_length] + + loss = None + if labels is not None: + if self.config.problem_type is None: + if self.num_labels == 1: + self.config.problem_type = "regression" + elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): + self.config.problem_type = "single_label_classification" + else: + self.config.problem_type = "multi_label_classification" + + if self.config.problem_type == "regression": + loss_fct = MSELoss() + if self.num_labels == 1: + loss = loss_fct(pooled_logits.squeeze(), labels.squeeze()) + else: + loss = loss_fct(pooled_logits, labels) + elif self.config.problem_type == "single_label_classification": + loss_fct = CrossEntropyLoss() + loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1)) + elif self.config.problem_type == "multi_label_classification": + loss_fct = BCEWithLogitsLoss() + loss = loss_fct(pooled_logits, labels) + if not return_dict: + output = (pooled_logits,) + transformer_outputs[1:] + return ((loss,) + output) if loss is not None else output + + return SequenceClassifierOutputWithPast( + loss=loss, + logits=pooled_logits, + past_key_values=transformer_outputs.past_key_values, + hidden_states=transformer_outputs.hidden_states, + attentions=transformer_outputs.attentions, + ) + + def get_input_embeddings(self): + return self.biogpt.embed_tokens + + def set_input_embeddings(self, value): + self.biogpt.embed_tokens = value diff --git a/src/transformers/models/biogpt/tokenization_biogpt.py b/src/transformers/models/biogpt/tokenization_biogpt.py index 55f337f2ec9132..093991ecb3885d 100644 --- a/src/transformers/models/biogpt/tokenization_biogpt.py +++ b/src/transformers/models/biogpt/tokenization_biogpt.py @@ -112,15 +112,6 @@ def __init__( pad_token="", **kwargs, ): - super().__init__( - bos_token=bos_token, - eos_token=eos_token, - sep_token=sep_token, - unk_token=unk_token, - pad_token=pad_token, - **kwargs, - ) - try: import sacremoses except ImportError: @@ -132,8 +123,8 @@ def __init__( self.lang = "en" self.sm = sacremoses # cache of sm.MosesTokenizer instance - self.cache_moses_tokenizer = dict() - self.cache_moses_detokenizer = dict() + self.cache_moses_tokenizer = {} + self.cache_moses_detokenizer = {} """ Initialisation""" with open(vocab_file, encoding="utf-8") as vocab_handle: @@ -145,6 +136,15 @@ def __init__( self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.cache = {} + super().__init__( + bos_token=bos_token, + eos_token=eos_token, + sep_token=sep_token, + unk_token=unk_token, + pad_token=pad_token, + **kwargs, + ) + @property def vocab_size(self): """Returns vocab size""" @@ -221,7 +221,7 @@ def _tokenize(self, text, bypass_tokenizer=False): split_tokens = [] for token in text: if token: - split_tokens.extend([t for t in self.bpe(token).split(" ")]) + split_tokens.extend(list(self.bpe(token).split(" "))) return split_tokens diff --git a/src/transformers/models/bit/configuration_bit.py b/src/transformers/models/bit/configuration_bit.py index 278c7f1c7f1a5a..d11a8e38185113 100644 --- a/src/transformers/models/bit/configuration_bit.py +++ b/src/transformers/models/bit/configuration_bit.py @@ -16,6 +16,7 @@ from ...configuration_utils import PretrainedConfig from ...utils import logging +from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices logger = logging.get_logger(__name__) @@ -25,7 +26,7 @@ } -class BitConfig(PretrainedConfig): +class BitConfig(BackboneConfigMixin, PretrainedConfig): r""" This is the configuration class to store the configuration of a [`BitModel`]. It is used to instantiate an BiT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the @@ -51,7 +52,7 @@ class BitConfig(PretrainedConfig): are supported. global_padding (`str`, *optional*): Padding strategy to use for the convolutional layers. Can be either `"valid"`, `"same"`, or `None`. - num_groups (`int`, *optional*, defaults to `32`): + num_groups (`int`, *optional*, defaults to 32): Number of groups used for the `BitGroupNormActivation` layers. drop_path_rate (`float`, *optional*, defaults to 0.0): The drop path rate for the stochastic depth. @@ -63,7 +64,14 @@ class BitConfig(PretrainedConfig): The width factor for the model. out_features (`List[str]`, *optional*): If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc. - (depending on how many stages the model has). Will default to the last stage if unset. + (depending on how many stages the model has). If unset and `out_indices` is set, will default to the + corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the + same order as defined in the `stage_names` attribute. + out_indices (`List[int]`, *optional*): + If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how + many stages the model has). If unset and `out_features` is set, will default to the corresponding stages. + If unset and `out_features` is unset, will default to the last stage. Must be in the + same order as defined in the `stage_names` attribute. Example: ```python @@ -79,6 +87,7 @@ class BitConfig(PretrainedConfig): >>> configuration = model.config ``` """ + model_type = "bit" layer_types = ["preactivation", "bottleneck"] supported_padding = ["SAME", "VALID"] @@ -98,6 +107,7 @@ def __init__( output_stride=32, width_factor=1, out_features=None, + out_indices=None, **kwargs, ): super().__init__(**kwargs) @@ -122,12 +132,6 @@ def __init__( self.width_factor = width_factor self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(depths) + 1)] - if out_features is not None: - if not isinstance(out_features, list): - raise ValueError("out_features should be a list") - for feature in out_features: - if feature not in self.stage_names: - raise ValueError( - f"Feature {feature} is not a valid feature name. Valid names are {self.stage_names}" - ) - self.out_features = out_features + self._out_features, self._out_indices = get_aligned_output_features_output_indices( + out_features=out_features, out_indices=out_indices, stage_names=self.stage_names + ) diff --git a/src/transformers/models/bit/image_processing_bit.py b/src/transformers/models/bit/image_processing_bit.py index 0394ecb411efd1..df9336c347955b 100644 --- a/src/transformers/models/bit/image_processing_bit.py +++ b/src/transformers/models/bit/image_processing_bit.py @@ -18,15 +18,10 @@ import numpy as np -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict from ...image_transforms import ( - center_crop, convert_to_rgb, get_resize_output_image_size, - normalize, - rescale, resize, to_channel_dimension_format, ) @@ -36,12 +31,14 @@ ChannelDimension, ImageInput, PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, make_list_of_images, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging -from ...utils.import_utils import is_vision_available +from ...utils import TensorType, is_vision_available, logging logger = logging.get_logger(__name__) @@ -79,14 +76,15 @@ class BitImageProcessor(BaseImageProcessor): method. do_normalize: Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. - image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`): + image_mean (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`): Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. - image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`): - Image standard deviation. - do_convert_rgb (`bool`, *optional*, defaults to `True`): + image_std (`float` or `List[float]`, *optional*, defaults to `OPENAI_CLIP_MEAN`): Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + Can be overridden by the `image_std` parameter in the `preprocess` method. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Whether to convert the image to RGB. """ model_input_names = ["pixel_values"] @@ -124,12 +122,14 @@ def __init__( self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD self.do_convert_rgb = do_convert_rgb + # Copied from transformers.models.clip.image_processing_clip.CLIPImageProcessor.resize def resize( self, image: np.ndarray, size: Dict[str, int], resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ @@ -145,79 +145,32 @@ def resize( Resampling filter to use when resiizing the image. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred. """ - size = get_size_dict(size, default_to_square=False) - if "shortest_edge" not in size: - raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}") - output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False) - return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) - - def center_crop( - self, - image: np.ndarray, - size: Dict[str, int], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the - returned result will always be of size `size`). - - Args: - image (`np.ndarray`): - Image to center crop. - size (`Dict[str, int]`): - Size of the output image in the form of a dictionary with keys `height` and `width`. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - size = get_size_dict(size) - if "height" not in size or "width" not in size: - raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}") - return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) - - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) - - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - image_mean (`float` or `List[float]`): - Image mean. - image_std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + default_to_square = True + if "shortest_edge" in size: + size = size["shortest_edge"] + default_to_square = False + elif "height" in size and "width" in size: + size = (size["height"], size["width"]) + else: + raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.") + + output_size = get_resize_output_image_size( + image, + size=size, + default_to_square=default_to_square, + input_data_format=input_data_format, + ) + return resize( + image, + size=output_size, + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, + ) def preprocess( self, @@ -235,6 +188,7 @@ def preprocess( do_convert_rgb: bool = None, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -242,7 +196,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -277,9 +232,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. - - Unset: defaults to the channel dimension format of the input image. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize size = size if size is not None else self.size @@ -303,17 +264,18 @@ def preprocess( "torch.Tensor, tf.Tensor or jax.ndarray." ) - if do_resize and size is None: - raise ValueError("Size must be specified if do_resize is True.") - - if do_center_crop and crop_size is None: - raise ValueError("Crop size must be specified if do_center_crop is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_resize=do_resize, + size=size, + resample=resample, + ) # PIL RGBA images are converted to RGB if do_convert_rgb: @@ -322,19 +284,42 @@ def preprocess( # All transformations expect numpy arrays. images = [to_numpy_array(image) for image in images] + if is_scaled_image(images[0]) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + + if input_data_format is None: + # We assume that all images have the same channel dimension format. + input_data_format = infer_channel_dimension_format(images[0]) + if do_resize: - images = [self.resize(image=image, size=size, resample=resample) for image in images] + images = [ + self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) + for image in images + ] if do_center_crop: - images = [self.center_crop(image=image, size=crop_size) for image in images] + images = [ + self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images + ] if do_rescale: - images = [self.rescale(image=image, scale=rescale_factor) for image in images] + images = [ + self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + for image in images + ] if do_normalize: - images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] - - images = [to_channel_dimension_format(image, data_format) for image in images] + images = [ + self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + for image in images + ] + + images = [ + to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images + ] data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) diff --git a/src/transformers/models/bit/modeling_bit.py b/src/transformers/models/bit/modeling_bit.py index 7ebe461e5be01a..49bc75b5f0aa6b 100644 --- a/src/transformers/models/bit/modeling_bit.py +++ b/src/transformers/models/bit/modeling_bit.py @@ -31,7 +31,7 @@ BaseModelOutputWithPoolingAndNoAttention, ImageClassifierOutputWithNoAttention, ) -from ...modeling_utils import BackboneMixin, PreTrainedModel +from ...modeling_utils import PreTrainedModel from ...utils import ( add_code_sample_docstrings, add_start_docstrings, @@ -39,6 +39,7 @@ logging, replace_return_docstrings, ) +from ...utils.backbone_utils import BackboneMixin from .configuration_bit import BitConfig @@ -299,7 +300,7 @@ def forward(self, pixel_values: Tensor) -> Tensor: # Copied from transformers.models.convnext.modeling_convnext.drop_path -def drop_path(input, drop_prob: float = 0.0, training: bool = False): +def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor: """ Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). @@ -659,7 +660,6 @@ class BitPreTrainedModel(PreTrainedModel): config_class = BitConfig base_model_prefix = "bit" main_input_name = "pixel_values" - supports_gradient_checkpointing = True def _init_weights(self, module): if isinstance(module, nn.Conv2d): @@ -668,10 +668,6 @@ def _init_weights(self, module): nn.init.constant_(module.weight, 1) nn.init.constant_(module.bias, 0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BitModel): - module.gradient_checkpointing = value - BIT_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it @@ -844,26 +840,14 @@ def forward( class BitBackbone(BitPreTrainedModel, BackboneMixin): def __init__(self, config): super().__init__(config) + super()._init_backbone(config) - self.stage_names = config.stage_names self.bit = BitModel(config) - - self.out_features = config.out_features if config.out_features is not None else [self.stage_names[-1]] - - out_feature_channels = {} - out_feature_channels["stem"] = config.embedding_size - for idx, stage in enumerate(self.stage_names[1:]): - out_feature_channels[stage] = config.hidden_sizes[idx] - - self.out_feature_channels = out_feature_channels + self.num_features = [config.embedding_size] + config.hidden_sizes # initialize weights and apply final processing self.post_init() - @property - def channels(self): - return [self.out_feature_channels[name] for name in self.out_features] - @add_start_docstrings_to_model_forward(BIT_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC) def forward( diff --git a/src/transformers/models/blenderbot/configuration_blenderbot.py b/src/transformers/models/blenderbot/configuration_blenderbot.py index 93ee9281364526..4f55a96bf62b71 100644 --- a/src/transformers/models/blenderbot/configuration_blenderbot.py +++ b/src/transformers/models/blenderbot/configuration_blenderbot.py @@ -104,6 +104,7 @@ class BlenderbotConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "blenderbot" keys_to_ignore_at_inference = ["past_key_values"] attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"} diff --git a/src/transformers/models/blenderbot/modeling_blenderbot.py b/src/transformers/models/blenderbot/modeling_blenderbot.py index 6ddefde8560a12..28b81387c13e62 100755 --- a/src/transformers/models/blenderbot/modeling_blenderbot.py +++ b/src/transformers/models/blenderbot/modeling_blenderbot.py @@ -18,7 +18,6 @@ import copy import math import os -import random import warnings from typing import List, Optional, Tuple, Union @@ -28,6 +27,7 @@ from torch.nn import CrossEntropyLoss from ...activations import ACT2FN +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask from ...modeling_outputs import ( BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, @@ -76,37 +76,6 @@ def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start return shifted_input_ids -# Copied from transformers.models.bart.modeling_bart._make_causal_mask -def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): - """ - Make causal mask used for bi-directional self-attention. - """ - bsz, tgt_len = input_ids_shape - mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min)) - mask_cond = torch.arange(mask.size(-1)) - mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) - mask = mask.to(dtype) - - if past_key_values_length > 0: - mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) - return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) - - -# Copied from transformers.models.bart.modeling_bart._expand_mask -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - class BlenderbotLearnedPositionalEmbedding(nn.Embedding): """ This module learns positional embeddings up to a fixed maximum size. @@ -135,12 +104,15 @@ def __init__( dropout: float = 0.0, is_decoder: bool = False, bias: bool = True, + is_causal: bool = False, + config: Optional[BlenderbotConfig] = None, ): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.dropout = dropout self.head_dim = embed_dim // num_heads + self.config = config if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -149,6 +121,7 @@ def __init__( ) self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder + self.is_causal = is_causal self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) @@ -216,8 +189,8 @@ def forward( proj_shape = (bsz * self.num_heads, -1, self.head_dim) query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) - key_states = key_states.view(*proj_shape) - value_states = value_states.view(*proj_shape) + key_states = key_states.reshape(*proj_shape) + value_states = value_states.reshape(*proj_shape) src_len = key_states.size(1) attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) @@ -263,7 +236,7 @@ def forward( if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): raise ValueError( - f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" f" {attn_output.size()}" ) @@ -271,7 +244,7 @@ def forward( attn_output = attn_output.transpose(1, 2) # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be - # partitioned aross GPUs when using tensor-parallelism. + # partitioned across GPUs when using tensor-parallelism. attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) attn_output = self.out_proj(attn_output) @@ -279,15 +252,20 @@ def forward( return attn_output, attn_weights_reshaped, past_key_value -# Copied from transformers.models.mbart.modeling_mbart.MBartEncoderLayer with MBart->Blenderbot +BLENDERBOT_ATTENTION_CLASSES = {"eager": BlenderbotAttention} + + +# Copied from transformers.models.mbart.modeling_mbart.MBartEncoderLayer with MBart->Blenderbot, MBART->BLENDERBOT class BlenderbotEncoderLayer(nn.Module): def __init__(self, config: BlenderbotConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BlenderbotAttention( + + self.self_attn = BLENDERBOT_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.encoder_attention_heads, dropout=config.attention_dropout, + config=config, ) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.dropout = config.dropout @@ -306,7 +284,7 @@ def forward( ) -> torch.Tensor: """ Args: - hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size @@ -348,28 +326,31 @@ def forward( return outputs -# Copied from transformers.models.mbart.modeling_mbart.MBartDecoderLayer with MBart->Blenderbot +# Copied from transformers.models.mbart.modeling_mbart.MBartDecoderLayer with MBart->Blenderbot, MBART->BLENDERBOT class BlenderbotDecoderLayer(nn.Module): def __init__(self, config: BlenderbotConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BlenderbotAttention( + self.self_attn = BLENDERBOT_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + is_causal=True, + config=config, ) self.dropout = config.dropout self.activation_fn = ACT2FN[config.activation_function] self.activation_dropout = config.activation_dropout self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) - self.encoder_attn = BlenderbotAttention( + self.encoder_attn = BLENDERBOT_ATTENTION_CLASSES[config._attn_implementation]( self.embed_dim, config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + config=config, ) self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) @@ -482,10 +463,6 @@ def _init_weights(self, module): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, (BlenderbotDecoder, BlenderbotEncoder)): - module.gradient_checkpointing = value - @property def dummy_inputs(self): pad_token = self.config.pad_token_id @@ -612,10 +589,11 @@ def dummy_inputs(self): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all - `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you - can choose to directly pass an embedded representation. This is useful if you want more control over how to - convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be @@ -731,6 +709,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() input_ids = input_ids.view(-1, input_shape[-1]) elif inputs_embeds is not None: @@ -749,7 +728,7 @@ def forward( # expand attention_mask if attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype) + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) encoder_states = () if output_hidden_states else None all_attentions = () if output_attentions else None @@ -765,23 +744,22 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): # skip the layer + to_drop = False + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: # skip the layer + to_drop = True + + if to_drop: layer_outputs = (None, None) else: if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, (head_mask[idx] if head_mask is not None else None), + output_attentions, ) else: layer_outputs = encoder_layer( @@ -848,27 +826,6 @@ def get_input_embeddings(self): def set_input_embeddings(self, value): self.embed_tokens = value - # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask - def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): - # create causal mask - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - combined_attention_mask = None - if input_shape[-1] > 1: - combined_attention_mask = _make_causal_mask( - input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length - ).to(inputs_embeds.device) - - if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( - inputs_embeds.device - ) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask - ) - - return combined_attention_mask - def forward( self, input_ids=None, @@ -936,11 +893,11 @@ def forward( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of - shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing - `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more - control over how to convert `input_ids` indices into associated vectors than the model's internal - embedding lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. @@ -974,14 +931,16 @@ def forward( if inputs_embeds is None: inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale - attention_mask = self._prepare_decoder_attention_mask( + attention_mask = _prepare_4d_causal_attention_mask( attention_mask, input_shape, inputs_embeds, past_key_values_length ) # expand encoder attention mask if encoder_hidden_states is not None and encoder_attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + encoder_attention_mask = _prepare_4d_attention_mask( + encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1] + ) # embed positions positions = self.embed_positions(input_shape, past_key_values_length) @@ -990,6 +949,12 @@ def forward( hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False # decoder layers all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None @@ -1008,28 +973,16 @@ def forward( # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) if output_hidden_states: all_hidden_states += (hidden_states,) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): - continue + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue past_key_value = past_key_values[idx] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, output_attentions, use_cache) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(decoder_layer), + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, hidden_states, attention_mask, encoder_hidden_states, @@ -1037,6 +990,8 @@ def custom_forward(*inputs): head_mask[idx] if head_mask is not None else None, cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, None, + output_attentions, + use_cache, ) else: layer_outputs = decoder_layer( @@ -1091,7 +1046,7 @@ def custom_forward(*inputs): BLENDERBOT_START_DOCSTRING, ) class BlenderbotModel(BlenderbotPreTrainedModel): - _keys_to_ignore_on_load_missing = ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight"] + _tied_weights_keys = ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight"] def __init__(self, config: BlenderbotConfig): super().__init__(config) @@ -1232,14 +1187,8 @@ def forward( ) class BlenderbotForConditionalGeneration(BlenderbotPreTrainedModel): base_model_prefix = "model" - _keys_to_ignore_on_load_missing = [ - r"final_logits_bias", - r"encoder.version", - r"decoder.version", - r"lm_head.weight", - "decoder.embed_tokens.weight", - "encoder.embed_tokens.weight", - ] + _keys_to_ignore_on_load_missing = ["final_logits_bias"] + _tied_weights_keys = ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight", "lm_head.weight"] def __init__(self, config: BlenderbotConfig): super().__init__(config) @@ -1271,9 +1220,9 @@ def get_encoder(self): def get_decoder(self): return self.model.get_decoder() - def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding: - new_embeddings = super().resize_token_embeddings(new_num_tokens) - self._resize_final_logits_bias(new_num_tokens) + def resize_token_embeddings(self, new_num_tokens: int, pad_to_multiple_of: Optional[int] = None) -> nn.Embedding: + new_embeddings = super().resize_token_embeddings(new_num_tokens, pad_to_multiple_of) + self._resize_final_logits_bias(new_embeddings.weight.shape[0]) return new_embeddings def _resize_final_logits_bias(self, new_num_tokens: int) -> None: @@ -1386,7 +1335,16 @@ def prepare_inputs_for_generation( ): # cut decoder_input_ids if past is used if past_key_values is not None: - decoder_input_ids = decoder_input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if decoder_input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = decoder_input_ids.shape[1] - 1 + + decoder_input_ids = decoder_input_ids[:, remove_prefix_length:] return { "input_ids": None, # encoder_outputs is defined. input_ids not needed @@ -1406,7 +1364,8 @@ def _reorder_cache(past_key_values, beam_idx): for layer_past in past_key_values: # cached cross_attention states don't have to be reordered -> they are always the same reordered_past += ( - tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past[:2]) + + layer_past[2:], ) return reordered_past @@ -1428,7 +1387,7 @@ def forward(self, *args, **kwargs): # Copied from transformers.models.bart.modeling_bart.BartForCausalLM with Bart->Blenderbot, facebook/bart-base->facebook/blenderbot-400M-distill class BlenderbotForCausalLM(BlenderbotPreTrainedModel): - _keys_to_ignore_on_load_missing = ["lm_head.weight"] + _tied_weights_keys = ["lm_head.weight"] def __init__(self, config): config = copy.deepcopy(config) @@ -1551,9 +1510,7 @@ def forward( >>> from transformers import AutoTokenizer, BlenderbotForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot-400M-distill") - >>> model = BlenderbotForCausalLM.from_pretrained( - ... "facebook/blenderbot-400M-distill", add_cross_attention=False - ... ) + >>> model = BlenderbotForCausalLM.from_pretrained("facebook/blenderbot-400M-distill", add_cross_attention=False) >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder." >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) @@ -1590,6 +1547,7 @@ def forward( loss = None if labels is not None: + labels = labels.to(logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) @@ -1614,7 +1572,16 @@ def prepare_inputs_for_generation( attention_mask = input_ids.new_ones(input_ids.shape) if past_key_values: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] # first step, decoder_cached_states are empty return { "input_ids": input_ids, # encoder_outputs is defined. input_ids not needed @@ -1627,5 +1594,7 @@ def prepare_inputs_for_generation( def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/blenderbot/modeling_flax_blenderbot.py b/src/transformers/models/blenderbot/modeling_flax_blenderbot.py index 629ddb99a80c5c..61239335be3b63 100644 --- a/src/transformers/models/blenderbot/modeling_flax_blenderbot.py +++ b/src/transformers/models/blenderbot/modeling_flax_blenderbot.py @@ -22,7 +22,6 @@ import flax.linen as nn import jax import jax.numpy as jnp -import numpy as np from flax.core.frozen_dict import FrozenDict, freeze, unfreeze from flax.linen import combine_masks, make_causal_mask from flax.linen.attention import dot_product_attention_weights @@ -205,15 +204,15 @@ # Copied from transformers.models.bart.modeling_flax_bart.shift_tokens_right -def shift_tokens_right(input_ids: np.array, pad_token_id: int, decoder_start_token_id: int) -> np.ndarray: +def shift_tokens_right(input_ids: jnp.ndarray, pad_token_id: int, decoder_start_token_id: int) -> jnp.ndarray: """ Shift input ids one token to the right. """ - shifted_input_ids = np.zeros_like(input_ids) - shifted_input_ids[:, 1:] = input_ids[:, :-1] - shifted_input_ids[:, 0] = decoder_start_token_id + shifted_input_ids = jnp.zeros_like(input_ids) + shifted_input_ids = shifted_input_ids.at[:, 1:].set(input_ids[:, :-1]) + shifted_input_ids = shifted_input_ids.at[:, 0].set(decoder_start_token_id) - shifted_input_ids = np.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) + shifted_input_ids = jnp.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) return shifted_input_ids @@ -1444,8 +1443,8 @@ def prepare_inputs_for_generation( self, decoder_input_ids, max_length, - attention_mask: Optional[jnp.DeviceArray] = None, - decoder_attention_mask: Optional[jnp.DeviceArray] = None, + attention_mask: Optional[jax.Array] = None, + decoder_attention_mask: Optional[jax.Array] = None, encoder_outputs=None, **kwargs, ): diff --git a/src/transformers/models/blenderbot/modeling_tf_blenderbot.py b/src/transformers/models/blenderbot/modeling_tf_blenderbot.py index 6b95bd56739adc..ccb07d20ecf97d 100644 --- a/src/transformers/models/blenderbot/modeling_tf_blenderbot.py +++ b/src/transformers/models/blenderbot/modeling_tf_blenderbot.py @@ -15,6 +15,8 @@ """ TF 2.0 Blenderbot model.""" +from __future__ import annotations + import os import random import warnings @@ -32,15 +34,14 @@ # Public API from ...modeling_tf_utils import ( - DUMMY_INPUTS, TFCausalLanguageModelingLoss, TFPreTrainedModel, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - ContextManagers, add_code_sample_docstrings, add_end_docstrings, add_start_docstrings, @@ -117,7 +118,7 @@ def _expand_mask(mask: tf.Tensor, tgt_len: Optional[int] = None): return (one_cst - expanded_mask) * LARGE_NEGATIVE -class TFBlenderbotLearnedPositionalEmbedding(tf.keras.layers.Embedding): +class TFBlenderbotLearnedPositionalEmbedding(keras.layers.Embedding): """ This module learns positional embeddings up to a fixed maximum size. """ @@ -126,7 +127,7 @@ def __init__(self, num_embeddings: int, embedding_dim: int, **kwargs): super().__init__(num_embeddings, embedding_dim, **kwargs) def call( - self, input_shape: tf.TensorShape, past_key_values_length: int = 0, position_ids: Optional[tf.Tensor] = None + self, input_shape: tf.TensorShape, past_key_values_length: int = 0, position_ids: tf.Tensor | None = None ): """Input is expected to be of size [bsz x seqlen].""" if position_ids is None: @@ -138,7 +139,7 @@ def call( # Copied from transformers.models.bart.modeling_tf_bart.TFBartAttention with Bart->Blenderbot -class TFBlenderbotAttention(tf.keras.layers.Layer): +class TFBlenderbotAttention(keras.layers.Layer): """Multi-headed attention from "Attention Is All You Need""" def __init__( @@ -154,7 +155,7 @@ def __init__( self.embed_dim = embed_dim self.num_heads = num_heads - self.dropout = tf.keras.layers.Dropout(dropout) + self.dropout = keras.layers.Dropout(dropout) self.head_dim = embed_dim // num_heads if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -164,10 +165,10 @@ def __init__( self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder - self.k_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") - self.q_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") - self.v_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") - self.out_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") + self.k_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") + self.q_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") + self.v_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") + self.out_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): return tf.transpose(tf.reshape(tensor, (bsz, seq_len, self.num_heads, self.head_dim)), (0, 2, 1, 3)) @@ -175,12 +176,12 @@ def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): def call( self, hidden_states: tf.Tensor, - key_value_states: Optional[tf.Tensor] = None, - past_key_value: Optional[Tuple[Tuple[tf.Tensor]]] = None, - attention_mask: Optional[tf.Tensor] = None, - layer_head_mask: Optional[tf.Tensor] = None, + key_value_states: tf.Tensor | None = None, + past_key_value: Tuple[Tuple[tf.Tensor]] | None = None, + attention_mask: tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, training: Optional[bool] = False, - ) -> Tuple[tf.Tensor, Optional[tf.Tensor]]: + ) -> Tuple[tf.Tensor, tf.Tensor | None]: """Input shape: Batch x Time x Channel""" # if key_value_states are provided this layer is used as a cross-attention layer @@ -290,22 +291,40 @@ def call( return attn_output, attn_weights, past_key_value + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "k_proj", None) is not None: + with tf.name_scope(self.k_proj.name): + self.k_proj.build([None, None, self.embed_dim]) + if getattr(self, "q_proj", None) is not None: + with tf.name_scope(self.q_proj.name): + self.q_proj.build([None, None, self.embed_dim]) + if getattr(self, "v_proj", None) is not None: + with tf.name_scope(self.v_proj.name): + self.v_proj.build([None, None, self.embed_dim]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.embed_dim]) + # Copied from transformers.models.mbart.modeling_tf_mbart.TFMBartEncoderLayer with MBart->Blenderbot -class TFBlenderbotEncoderLayer(tf.keras.layers.Layer): +class TFBlenderbotEncoderLayer(keras.layers.Layer): def __init__(self, config: BlenderbotConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model self.self_attn = TFBlenderbotAttention( self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout, name="self_attn" ) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) - self.fc1 = tf.keras.layers.Dense(config.encoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) + self.fc1 = keras.layers.Dense(config.encoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, @@ -316,7 +335,7 @@ def call( ): """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape *(seq_len, batch, embed_dim)* + hidden_states (`tf.Tensor`): input to the layer of shape *(batch, seq_len, embed_dim)* attention_mask (`tf.Tensor`): attention mask of size *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -347,9 +366,29 @@ def call( return hidden_states, self_attn_weights + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.encoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + # Copied from transformers.models.mbart.modeling_tf_mbart.TFMBartDecoderLayer with MBart->Blenderbot -class TFBlenderbotDecoderLayer(tf.keras.layers.Layer): +class TFBlenderbotDecoderLayer(keras.layers.Layer): def __init__(self, config: BlenderbotConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model @@ -360,11 +399,11 @@ def __init__(self, config: BlenderbotConfig, **kwargs): name="self_attn", is_decoder=True, ) - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") self.encoder_attn = TFBlenderbotAttention( self.embed_dim, config.decoder_attention_heads, @@ -372,29 +411,30 @@ def __init__(self, config: BlenderbotConfig, **kwargs): name="encoder_attn", is_decoder=True, ) - self.encoder_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") - self.fc1 = tf.keras.layers.Dense(config.decoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.encoder_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") + self.fc1 = keras.layers.Dense(config.decoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, hidden_states: tf.Tensor, - attention_mask: Optional[tf.Tensor] = None, - encoder_hidden_states: Optional[tf.Tensor] = None, - encoder_attention_mask: Optional[tf.Tensor] = None, - layer_head_mask: Optional[tf.Tensor] = None, - cross_attn_layer_head_mask: Optional[tf.Tensor] = None, - past_key_value: Optional[Tuple[tf.Tensor]] = None, + attention_mask: tf.Tensor | None = None, + encoder_hidden_states: tf.Tensor | None = None, + encoder_attention_mask: tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, + cross_attn_layer_head_mask: tf.Tensor | None = None, + past_key_value: Tuple[tf.Tensor] | None = None, training: Optional[bool] = False, ) -> Tuple[tf.Tensor, tf.Tensor, Tuple[Tuple[tf.Tensor]]]: """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape *(seq_len, batch, embed_dim)* + hidden_states (`tf.Tensor`): input to the layer of shape *(batch, seq_len, embed_dim)* attention_mask (`tf.Tensor`): attention mask of size *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values. encoder_hidden_states (`tf.Tensor`): - cross attention input to the layer of shape *(seq_len, batch, embed_dim)* + cross attention input to the layer of shape *(batch, seq_len, embed_dim)* encoder_attention_mask (`tf.Tensor`): encoder attention mask of size *(batch, 1, tgt_len, src_len)* where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -457,46 +497,44 @@ def call( present_key_value, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "encoder_attn", None) is not None: + with tf.name_scope(self.encoder_attn.name): + self.encoder_attn.build(None) + if getattr(self, "encoder_attn_layer_norm", None) is not None: + with tf.name_scope(self.encoder_attn_layer_norm.name): + self.encoder_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.decoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + class TFBlenderbotPreTrainedModel(TFPreTrainedModel): config_class = BlenderbotConfig base_model_prefix = "model" - @property - def dummy_inputs(self): - pad_token = 1 - input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - decoder_input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - dummy_inputs = { - "decoder_input_ids": decoder_input_ids, - "attention_mask": tf.cast(input_ids != pad_token, tf.int32), - "input_ids": input_ids, - } - return dummy_inputs - - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"), - "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"), - } - ] - ) - # Copied from transformers.models.bart.modeling_tf_bart.TFBartPretrainedModel.serving - def serving(self, inputs): - output = self.call(inputs) - - return self.serving_output(output) - BLENDERBOT_START_DOCSTRING = r""" This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -640,7 +678,7 @@ def serving(self, inputs): @keras_serializable -class TFBlenderbotEncoder(tf.keras.layers.Layer): +class TFBlenderbotEncoder(keras.layers.Layer): config_class = BlenderbotConfig """ Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a @@ -650,10 +688,10 @@ class TFBlenderbotEncoder(tf.keras.layers.Layer): config: BlenderbotConfig """ - def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs): + def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.layerdrop = config.encoder_layerdrop self.padding_idx = config.pad_token_id self.max_source_positions = config.max_position_embeddings @@ -666,7 +704,7 @@ def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[tf.keras.lay name="embed_positions", ) self.layers = [TFBlenderbotEncoderLayer(config, name=f"layers.{i}") for i in range(config.encoder_layers)] - self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layer_norm") + self.layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="layer_norm") def get_embed_tokens(self): return self.embed_tokens @@ -738,25 +776,8 @@ def call( raise ValueError("You have to specify either input_ids or inputs_embeds") if inputs_embeds is None: - # if `self.embed_tokens.load_weight_prefix` is set, runs the embedding operation with the correct name - # scope, so that its weights are registered with the desired name for loading/storing. When `tf.name_scope` - # is used with a name ending in `/`, that name replaces the current name scope. - # (embeddings with tf.name_scope: self.embed_tokens.load_weight_prefix/self.embed_tokens.name/embeddings:0) - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale embed_pos = self.embed_positions(input_shape) hidden_states = inputs_embeds + embed_pos @@ -812,9 +833,24 @@ def call( last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layer_norm", None) is not None: + with tf.name_scope(self.layer_norm.name): + self.layer_norm.build([None, None, self.config.d_model]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBlenderbotDecoder(tf.keras.layers.Layer): +class TFBlenderbotDecoder(keras.layers.Layer): config_class = BlenderbotConfig """ Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`TFBlenderbotDecoderLayer`] @@ -824,7 +860,7 @@ class TFBlenderbotDecoder(tf.keras.layers.Layer): embed_tokens: output embedding """ - def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs): + def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config self.padding_idx = config.pad_token_id @@ -837,9 +873,9 @@ def __init__(self, config: BlenderbotConfig, embed_tokens: Optional[tf.keras.lay ) self.embed_scale = tf.math.sqrt(float(config.d_model)) if config.scale_embedding else 1.0 self.layers = [TFBlenderbotDecoderLayer(config, name=f"layers.{i}") for i in range(config.decoder_layers)] - self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layer_norm") + self.layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="layer_norm") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) def get_embed_tokens(self): return self.embed_tokens @@ -914,11 +950,11 @@ def call( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`tf.Tensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` - you can choose to directly pass an embedded representation. This is useful if you want more control - over how to convert `input_ids` indices into associated vectors than the model's internal embedding - lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value @@ -952,21 +988,8 @@ def call( positions = self.embed_positions(input_shape, position_ids=position_ids) if inputs_embeds is None: - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale hidden_states = inputs_embeds @@ -1051,19 +1074,34 @@ def call( cross_attentions=all_cross_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layer_norm", None) is not None: + with tf.name_scope(self.layer_norm.name): + self.layer_norm.build([None, None, self.config.d_model]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBlenderbotMainLayer(tf.keras.layers.Layer): +class TFBlenderbotMainLayer(keras.layers.Layer): config_class = BlenderbotConfig def __init__(self, config: BlenderbotConfig, **kwargs): super().__init__(**kwargs) self.config = config - self.shared = tf.keras.layers.Embedding( + self.shared = keras.layers.Embedding( input_dim=config.vocab_size, output_dim=config.d_model, - embeddings_initializer=tf.keras.initializers.TruncatedNormal(stddev=self.config.init_std), + embeddings_initializer=keras.initializers.TruncatedNormal(stddev=self.config.init_std), name="model.shared", ) # Additional attribute to specify the expected name scope of the layer (for loading/storing weights) @@ -1159,6 +1197,22 @@ def call( encoder_attentions=encoder_outputs.attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + # The shared/tied weights expect to be in the model base namespace + # Adding "/" to the end (not the start!) of a tf.name_scope puts it in the root namespace rather than + # the current one. + with tf.name_scope(self.shared.load_weight_prefix + "/" + self.shared.name + "/"): + self.shared.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "decoder", None) is not None: + with tf.name_scope(self.decoder.name): + self.decoder.build(None) + @add_start_docstrings( "The bare BLENDERBOT Model outputting raw hidden-states without any specific head on top.", @@ -1201,18 +1255,18 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P ) def call( self, - input_ids: Optional[tf.Tensor] = None, - attention_mask: Optional[tf.Tensor] = None, - decoder_input_ids: Optional[tf.Tensor] = None, - decoder_attention_mask: Optional[tf.Tensor] = None, - decoder_position_ids: Optional[tf.Tensor] = None, - head_mask: Optional[tf.Tensor] = None, - decoder_head_mask: Optional[tf.Tensor] = None, - cross_attn_head_mask: Optional[tf.Tensor] = None, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + decoder_input_ids: tf.Tensor | None = None, + decoder_attention_mask: tf.Tensor | None = None, + decoder_position_ids: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + decoder_head_mask: tf.Tensor | None = None, + cross_attn_head_mask: tf.Tensor | None = None, encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None, - past_key_values: Optional[List[tf.Tensor]] = None, - inputs_embeds: Optional[tf.Tensor] = None, - decoder_inputs_embeds: Optional[tf.Tensor] = None, + past_key_values: List[tf.Tensor] | None = None, + inputs_embeds: tf.Tensor | None = None, + decoder_inputs_embeds: tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1262,11 +1316,19 @@ def serving_output(self, output): encoder_attentions=enc_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + # Copied from transformers.models.bart.modeling_tf_bart.BiasLayer -class BiasLayer(tf.keras.layers.Layer): +class BiasLayer(keras.layers.Layer): """ - Bias as a layer. It is used for serialization purposes: `tf.keras.Model.save_weights` stores on a per-layer basis, + Bias as a layer. It is used for serialization purposes: `keras.Model.save_weights` stores on a per-layer basis, so all weights have to be registered in a layer. """ @@ -1345,23 +1407,23 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P @add_end_docstrings(BLENDERBOT_GENERATION_EXAMPLE) def call( self, - input_ids: Optional[tf.Tensor] = None, - attention_mask: Optional[tf.Tensor] = None, - decoder_input_ids: Optional[tf.Tensor] = None, - decoder_attention_mask: Optional[tf.Tensor] = None, - decoder_position_ids: Optional[tf.Tensor] = None, - head_mask: Optional[tf.Tensor] = None, - decoder_head_mask: Optional[tf.Tensor] = None, - cross_attn_head_mask: Optional[tf.Tensor] = None, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + decoder_input_ids: tf.Tensor | None = None, + decoder_attention_mask: tf.Tensor | None = None, + decoder_position_ids: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + decoder_head_mask: tf.Tensor | None = None, + cross_attn_head_mask: tf.Tensor | None = None, encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None, - past_key_values: Optional[List[tf.Tensor]] = None, - inputs_embeds: Optional[tf.Tensor] = None, - decoder_inputs_embeds: Optional[tf.Tensor] = None, + past_key_values: List[tf.Tensor] | None = None, + inputs_embeds: tf.Tensor | None = None, + decoder_inputs_embeds: tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[tf.Tensor] = None, + labels: tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[Tuple[tf.Tensor], TFSeq2SeqLMOutput]: r""" @@ -1481,3 +1543,14 @@ def prepare_inputs_for_generation( "cross_attn_head_mask": cross_attn_head_mask, "use_cache": use_cache, # change this to avoid caching (presumably for debugging) } + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + if getattr(self, "bias_layer", None) is not None: + with tf.name_scope(self.bias_layer.name): + self.bias_layer.build(None) diff --git a/src/transformers/models/blenderbot/tokenization_blenderbot.py b/src/transformers/models/blenderbot/tokenization_blenderbot.py index 208ced46bc2db8..29386c1233adf0 100644 --- a/src/transformers/models/blenderbot/tokenization_blenderbot.py +++ b/src/transformers/models/blenderbot/tokenization_blenderbot.py @@ -17,7 +17,7 @@ import json import os from functools import lru_cache -from typing import TYPE_CHECKING, List, Optional, Tuple +from typing import List, Optional, Tuple import regex as re @@ -25,9 +25,6 @@ from ...utils import logging -if TYPE_CHECKING: - from transformers.pipelines.conversational import Conversation - logger = logging.get_logger(__name__) @@ -96,13 +93,15 @@ class BlenderbotTokenizer(PreTrainedTokenizer): This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not: - ``` + ```python >>> from transformers import BlenderbotTokenizer + >>> tokenizer = BlenderbotTokenizer.from_pretrained("facebook/blenderbot-3B") >>> tokenizer.add_prefix_space = False - >>> tokenizer("Hello world")['input_ids'] + >>> tokenizer("Hello world")["input_ids"] [47, 921, 86, 1085, 2] - >>> tokenizer(" Hello world")['input_ids'] + + >>> tokenizer(" Hello world")["input_ids"] [6950, 1085, 2] ``` @@ -188,28 +187,21 @@ def __init__( **kwargs, ): bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token + pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token + unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token - unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token - pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token # Mask token behave like a normal word, i.e. include the space before it - mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token - - super().__init__( - errors=errors, - bos_token=bos_token, - eos_token=eos_token, - unk_token=unk_token, - sep_token=sep_token, - cls_token=cls_token, - pad_token=pad_token, - mask_token=mask_token, - add_prefix_space=add_prefix_space, - **kwargs, + mask_token = ( + AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) + if isinstance(mask_token, str) + else mask_token ) + # these special tokens are not part of the vocab.json, let's add them in the correct order + with open(vocab_file, encoding="utf-8") as vocab_handle: self.encoder = json.load(vocab_handle) self.decoder = {v: k for k, v in self.encoder.items()} @@ -226,6 +218,19 @@ def __init__( # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") + super().__init__( + errors=errors, + bos_token=bos_token, + eos_token=eos_token, + unk_token=unk_token, + sep_token=sep_token, + cls_token=cls_token, + pad_token=pad_token, + mask_token=mask_token, + add_prefix_space=add_prefix_space, + **kwargs, + ) + @property # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.vocab_size with Roberta->Blenderbot, RoBERTa->Blenderbot def vocab_size(self): @@ -233,7 +238,9 @@ def vocab_size(self): # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.get_vocab with Roberta->Blenderbot, RoBERTa->Blenderbot def get_vocab(self): - return dict(self.encoder, **self.added_tokens_encoder) + vocab = dict(self.encoder).copy() + vocab.update(self.added_tokens_encoder) + return vocab # Copied from transformers.models.roberta.tokenization_roberta.RobertaTokenizer.bpe with Roberta->Blenderbot, RoBERTa->Blenderbot def bpe(self, token): @@ -369,8 +376,8 @@ def create_token_type_ids_from_sequences( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: """ - Create a mask from the two sequences passed to be used in a sequence-pair classification task. Blenderbot does - not make use of token type ids, therefore a list of zeros is returned. + Create a mask from the two sequences passed to be used in a sequence-pair classification task. Blenderbot does not + make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`List[int]`): @@ -411,19 +418,22 @@ def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: """ return token_ids_0 + [self.eos_token_id] - def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]: - inputs = [] - for is_user, text in conversation.iter_texts(): - if is_user: - # We need to space prefix as it's being done within blenderbot - inputs.append(" " + text) - else: - # Generated responses should contain them already. - inputs.append(text) - - full_string = " ".join(inputs) - input_ids = self.encode(full_string) - if len(input_ids) > self.model_max_length: - input_ids = input_ids[-self.model_max_length :] - logger.warning(f"Trimmed input from conversation as it was longer than {self.model_max_length} tokens.") - return input_ids + @property + def default_chat_template(self): + """ + A very simple chat template that just adds whitespace between messages. + """ + logger.warning_once( + "\nNo chat template is defined for this tokenizer - using the default template " + f"for the {self.__class__.__name__} class. If the default is not appropriate for " + "your model, please set `tokenizer.chat_template` to an appropriate template. " + "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n" + ) + return ( + "{% for message in messages %}" + "{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}" + "{{ message['content'] }}" + "{% if not loop.last %}{{ ' ' }}{% endif %}" + "{% endfor %}" + "{{ eos_token }}" + ) diff --git a/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py b/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py index 7c4e060e5d2035..6245025b503d53 100644 --- a/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py +++ b/src/transformers/models/blenderbot/tokenization_blenderbot_fast.py @@ -14,7 +14,7 @@ # limitations under the License. """Fast Tokenization class for Blenderbot.""" import json -from typing import TYPE_CHECKING, List, Optional, Tuple +from typing import List, Optional, Tuple from tokenizers import pre_tokenizers, processors @@ -24,9 +24,6 @@ from .tokenization_blenderbot import BlenderbotTokenizer -if TYPE_CHECKING: - from transformers.pipelines.conversational import Conversation - logger = logging.get_logger(__name__) @@ -55,12 +52,14 @@ class BlenderbotTokenizerFast(PreTrainedTokenizerFast): This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not: - ``` + ```python >>> from transformers import BlenderbotTokenizerFast + >>> tokenizer = BlenderbotTokenizerFast.from_pretrained("facebook/blenderbot-3B") - >>> tokenizer("Hello world")['input_ids'] + >>> tokenizer("Hello world")["input_ids"] [6950, 1085, 2] - >>> tokenizer(" Hello world")['input_ids'] + + >>> tokenizer(" Hello world")["input_ids"] [6950, 1085, 2] ``` @@ -150,6 +149,11 @@ def __init__( trim_offsets=True, **kwargs, ): + mask_token = ( + AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False) + if isinstance(mask_token, str) + else mask_token + ) super().__init__( vocab_file, merges_file, @@ -208,8 +212,8 @@ def mask_token(self) -> str: `str`: Mask token, to use when training a model with masked-language modeling. Log an error if used while not having been set. - Blenderbot tokenizer has a special mask token to be usable in the fill-mask pipeline. The mask token will - greedily comprise the space before the **. + Blenderbot tokenizer has a special mask token to be usable in the fill-mask pipeline. The mask token will greedily + comprise the space before the **. """ if self._mask_token is None: if self.verbose: @@ -260,8 +264,8 @@ def create_token_type_ids_from_sequences( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None ) -> List[int]: """ - Create a mask from the two sequences passed to be used in a sequence-pair classification task. Blenderbot does - not make use of token type ids, therefore a list of zeros is returned. + Create a mask from the two sequences passed to be used in a sequence-pair classification task. Blenderbot does not + make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`List[int]`): @@ -295,19 +299,23 @@ def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: """ return token_ids_0 + [self.eos_token_id] - def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]: - inputs = [] - for is_user, text in conversation.iter_texts(): - if is_user: - # We need to space prefix as it's being done within blenderbot - inputs.append(" " + text) - else: - # Generated responses should contain them already. - inputs.append(text) - - full_string = " ".join(inputs) - input_ids = self.encode(full_string) - if len(input_ids) > self.model_max_length: - input_ids = input_ids[-self.model_max_length :] - logger.warning(f"Trimmed input from conversation as it was longer than {self.model_max_length} tokens.") - return input_ids + @property + # Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template + def default_chat_template(self): + """ + A very simple chat template that just adds whitespace between messages. + """ + logger.warning_once( + "\nNo chat template is defined for this tokenizer - using the default template " + f"for the {self.__class__.__name__} class. If the default is not appropriate for " + "your model, please set `tokenizer.chat_template` to an appropriate template. " + "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n" + ) + return ( + "{% for message in messages %}" + "{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}" + "{{ message['content'] }}" + "{% if not loop.last %}{{ ' ' }}{% endif %}" + "{% endfor %}" + "{{ eos_token }}" + ) diff --git a/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py b/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py index fbc23435d66f31..b41330656d39ab 100644 --- a/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py +++ b/src/transformers/models/blenderbot_small/configuration_blenderbot_small.py @@ -104,6 +104,7 @@ class BlenderbotSmallConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "blenderbot-small" keys_to_ignore_at_inference = ["past_key_values"] attribute_map = {"num_attention_heads": "encoder_attention_heads", "hidden_size": "d_model"} diff --git a/src/transformers/models/blenderbot_small/modeling_blenderbot_small.py b/src/transformers/models/blenderbot_small/modeling_blenderbot_small.py index 9927d6ab4e9247..f9a9508e590557 100755 --- a/src/transformers/models/blenderbot_small/modeling_blenderbot_small.py +++ b/src/transformers/models/blenderbot_small/modeling_blenderbot_small.py @@ -17,7 +17,6 @@ import copy import math -import random from typing import List, Optional, Tuple, Union import torch @@ -26,6 +25,7 @@ from torch.nn import CrossEntropyLoss from ...activations import ACT2FN +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask from ...modeling_outputs import ( BaseModelOutput, BaseModelOutputWithPastAndCrossAttentions, @@ -72,37 +72,6 @@ def shift_tokens_right(input_ids: torch.Tensor, pad_token_id: int, decoder_start return shifted_input_ids -# Copied from transformers.models.bart.modeling_bart._make_causal_mask -def _make_causal_mask(input_ids_shape: torch.Size, dtype: torch.dtype, past_key_values_length: int = 0): - """ - Make causal mask used for bi-directional self-attention. - """ - bsz, tgt_len = input_ids_shape - mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min)) - mask_cond = torch.arange(mask.size(-1)) - mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0) - mask = mask.to(dtype) - - if past_key_values_length > 0: - mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype), mask], dim=-1) - return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length) - - -# Copied from transformers.models.bart.modeling_bart._expand_mask -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - # Copied from transformers.models.blenderbot.modeling_blenderbot.BlenderbotLearnedPositionalEmbedding with Blenderbot->BlenderbotSmall class BlenderbotSmallLearnedPositionalEmbedding(nn.Embedding): """ @@ -132,12 +101,15 @@ def __init__( dropout: float = 0.0, is_decoder: bool = False, bias: bool = True, + is_causal: bool = False, + config: Optional[BlenderbotSmallConfig] = None, ): super().__init__() self.embed_dim = embed_dim self.num_heads = num_heads self.dropout = dropout self.head_dim = embed_dim // num_heads + self.config = config if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -146,6 +118,7 @@ def __init__( ) self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder + self.is_causal = is_causal self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias) self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias) @@ -213,8 +186,8 @@ def forward( proj_shape = (bsz * self.num_heads, -1, self.head_dim) query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) - key_states = key_states.view(*proj_shape) - value_states = value_states.view(*proj_shape) + key_states = key_states.reshape(*proj_shape) + value_states = value_states.reshape(*proj_shape) src_len = key_states.size(1) attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) @@ -260,7 +233,7 @@ def forward( if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): raise ValueError( - f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is" f" {attn_output.size()}" ) @@ -268,7 +241,7 @@ def forward( attn_output = attn_output.transpose(1, 2) # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be - # partitioned aross GPUs when using tensor-parallelism. + # partitioned across GPUs when using tensor-parallelism. attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) attn_output = self.out_proj(attn_output) @@ -276,15 +249,17 @@ def forward( return attn_output, attn_weights_reshaped, past_key_value -# Copied from transformers.models.bart.modeling_bart.BartEncoderLayer with Bart->BlenderbotSmall +# Copied from transformers.models.bart.modeling_bart.BartEncoderLayer with Bart->BlenderbotSmall, BART->BLENDERBOT_SMALL class BlenderbotSmallEncoderLayer(nn.Module): def __init__(self, config: BlenderbotSmallConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BlenderbotSmallAttention( + + self.self_attn = BLENDERBOT_SMALL_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.encoder_attention_heads, dropout=config.attention_dropout, + config=config, ) self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.dropout = config.dropout @@ -303,7 +278,7 @@ def forward( ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]: """ Args: - hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`torch.FloatTensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`torch.FloatTensor`): mask for attention heads in a given layer of size @@ -345,28 +320,37 @@ def forward( return outputs -# Copied from transformers.models.bart.modeling_bart.BartDecoderLayer with Bart->BlenderbotSmall +# TODO: Implement attention with SDPA for TimeSeriesTransformer. +BLENDERBOT_SMALL_ATTENTION_CLASSES = { + "eager": BlenderbotSmallAttention, +} + + +# Copied from transformers.models.bart.modeling_bart.BartDecoderLayer with Bart->BlenderbotSmall, BART->BLENDERBOT_SMALL class BlenderbotSmallDecoderLayer(nn.Module): def __init__(self, config: BlenderbotSmallConfig): super().__init__() self.embed_dim = config.d_model - self.self_attn = BlenderbotSmallAttention( + self.self_attn = BLENDERBOT_SMALL_ATTENTION_CLASSES[config._attn_implementation]( embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + is_causal=True, + config=config, ) self.dropout = config.dropout self.activation_fn = ACT2FN[config.activation_function] self.activation_dropout = config.activation_dropout self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim) - self.encoder_attn = BlenderbotSmallAttention( + self.encoder_attn = BLENDERBOT_SMALL_ATTENTION_CLASSES[config._attn_implementation]( self.embed_dim, config.decoder_attention_heads, dropout=config.attention_dropout, is_decoder=True, + config=config, ) self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim) self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim) @@ -479,10 +463,6 @@ def _init_weights(self, module): if module.padding_idx is not None: module.weight.data[module.padding_idx].zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, (BlenderbotSmallDecoder, BlenderbotSmallEncoder)): - module.gradient_checkpointing = value - @property def dummy_inputs(self): pad_token = self.config.pad_token_id @@ -534,14 +514,14 @@ def dummy_inputs(self): Human: I'm not sure >>> NEXT_UTTERANCE = ( - ... "My friends are cool but they eat too many carbs. what kind of carbs do they eat? " - ... "i don't know much about carbs " - ... " I'm not sure." + ... "My friends are cool but they eat too many carbs.__end__ __start__what kind of carbs do they eat? " + ... "i don't know much about carbs__end__ " + ... "__start__ I'm not sure." ... ) >>> inputs = tokenizer([NEXT_UTTERANCE], return_tensors="pt") >>> next_reply_ids = model.generate(**inputs) >>> print("Bot: ", tokenizer.batch_decode(next_reply_ids, skip_special_tokens=True)[0]) - Bot: they eat a lot of carbs. carbs are high in fat, protein, and carbohydrates. + Bot: they eat a lot of carbs. carbs are high in fat, protein, and fats. ``` """ @@ -609,10 +589,11 @@ def dummy_inputs(self): If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all - `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` you - can choose to directly pass an embedded representation. This is useful if you want more control over how to - convert `input_ids` indices into associated vectors than the model's internal embedding lookup matrix. + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. decoder_inputs_embeds (`torch.FloatTensor` of shape `(batch_size, target_sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `decoder_input_ids` you can choose to directly pass an embedded representation. If `past_key_values` is used, optionally only the last `decoder_inputs_embeds` have to be @@ -728,6 +709,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() input_ids = input_ids.view(-1, input_shape[-1]) elif inputs_embeds is not None: @@ -747,7 +729,7 @@ def forward( # expand attention_mask if attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype) + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) encoder_states = () if output_hidden_states else None all_attentions = () if output_attentions else None @@ -763,23 +745,22 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): # skip the layer + to_drop = False + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: # skip the layer + to_drop = True + + if to_drop: layer_outputs = (None, None) else: if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, (head_mask[idx] if head_mask is not None else None), + output_attentions, ) else: layer_outputs = encoder_layer( @@ -843,27 +824,6 @@ def get_input_embeddings(self): def set_input_embeddings(self, value): self.embed_tokens = value - # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask - def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length): - # create causal mask - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - combined_attention_mask = None - if input_shape[-1] > 1: - combined_attention_mask = _make_causal_mask( - input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length - ).to(inputs_embeds.device) - - if attention_mask is not None: - # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to( - inputs_embeds.device - ) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask - ) - - return combined_attention_mask - def forward( self, input_ids=None, @@ -930,11 +890,11 @@ def forward( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`torch.FloatTensor` of - shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing - `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more - control over how to convert `input_ids` indices into associated vectors than the model's internal - embedding lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. @@ -968,14 +928,16 @@ def forward( if inputs_embeds is None: inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale - attention_mask = self._prepare_decoder_attention_mask( + attention_mask = _prepare_4d_causal_attention_mask( attention_mask, input_shape, inputs_embeds, past_key_values_length ) # expand encoder attention mask if encoder_hidden_states is not None and encoder_attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]) + encoder_attention_mask = _prepare_4d_attention_mask( + encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1] + ) # embed positions positions = self.embed_positions(input_shape, past_key_values_length) @@ -986,6 +948,13 @@ def forward( hidden_states = nn.functional.dropout(hidden_states, p=self.dropout, training=self.training) + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + # decoder layers all_hidden_states = () if output_hidden_states else None all_self_attns = () if output_attentions else None @@ -1004,28 +973,16 @@ def forward( # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) if output_hidden_states: all_hidden_states += (hidden_states,) - dropout_probability = random.uniform(0, 1) - if self.training and (dropout_probability < self.layerdrop): - continue + if self.training: + dropout_probability = torch.rand([]) + if dropout_probability < self.layerdrop: + continue past_key_value = past_key_values[idx] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, output_attentions, use_cache) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(decoder_layer), + layer_outputs = self._gradient_checkpointing_func( + decoder_layer.__call__, hidden_states, attention_mask, encoder_hidden_states, @@ -1033,6 +990,8 @@ def custom_forward(*inputs): head_mask[idx] if head_mask is not None else None, cross_attn_head_mask[idx] if cross_attn_head_mask is not None else None, None, + output_attentions, + use_cache, ) else: layer_outputs = decoder_layer( @@ -1084,7 +1043,7 @@ def custom_forward(*inputs): BLENDERBOT_SMALL_START_DOCSTRING, ) class BlenderbotSmallModel(BlenderbotSmallPreTrainedModel): - _keys_to_ignore_on_load_missing = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"] + _tied_weights_keys = ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight"] def __init__(self, config: BlenderbotSmallConfig): super().__init__(config) @@ -1213,14 +1172,8 @@ def forward( ) class BlenderbotSmallForConditionalGeneration(BlenderbotSmallPreTrainedModel): base_model_prefix = "model" - _keys_to_ignore_on_load_missing = [ - r"final_logits_bias", - r"encoder.version", - r"decoder.version", - r"lm_head.weight", - "encoder.embed_tokens.weight", - "decoder.embed_tokens.weight", - ] + _keys_to_ignore_on_load_missing = ["final_logits_bias"] + _tied_weights_keys = ["decoder.embed_tokens.weight", "encoder.embed_tokens.weight", "lm_head.weight"] def __init__(self, config: BlenderbotSmallConfig): super().__init__(config) @@ -1237,9 +1190,9 @@ def get_encoder(self): def get_decoder(self): return self.model.get_decoder() - def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding: - new_embeddings = super().resize_token_embeddings(new_num_tokens) - self._resize_final_logits_bias(new_num_tokens) + def resize_token_embeddings(self, new_num_tokens: int, pad_to_multiple_of: Optional[int] = None) -> nn.Embedding: + new_embeddings = super().resize_token_embeddings(new_num_tokens, pad_to_multiple_of) + self._resize_final_logits_bias(new_embeddings.weight.shape[0]) return new_embeddings def _resize_final_logits_bias(self, new_num_tokens: int) -> None: @@ -1352,7 +1305,16 @@ def prepare_inputs_for_generation( ): # cut decoder_input_ids if past is used if past_key_values is not None: - decoder_input_ids = decoder_input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if decoder_input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = decoder_input_ids.shape[1] - 1 + + decoder_input_ids = decoder_input_ids[:, remove_prefix_length:] return { "input_ids": None, # encoder_outputs is defined. input_ids not needed @@ -1372,7 +1334,8 @@ def _reorder_cache(past_key_values, beam_idx): for layer_past in past_key_values: # cached cross_attention states don't have to be reordered -> they are always the same reordered_past += ( - tuple(past_state.index_select(0, beam_idx) for past_state in layer_past[:2]) + layer_past[2:], + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past[:2]) + + layer_past[2:], ) return reordered_past @@ -1394,7 +1357,7 @@ def forward(self, *args, **kwargs): # Copied from transformers.models.bart.modeling_bart.BartForCausalLM with Bart->BlenderbotSmall, facebook/bart-base->facebook/blenderbot_small-90M class BlenderbotSmallForCausalLM(BlenderbotSmallPreTrainedModel): - _keys_to_ignore_on_load_missing = ["lm_head.weight"] + _tied_weights_keys = ["lm_head.weight"] def __init__(self, config): config = copy.deepcopy(config) @@ -1517,9 +1480,7 @@ def forward( >>> from transformers import AutoTokenizer, BlenderbotSmallForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("facebook/blenderbot_small-90M") - >>> model = BlenderbotSmallForCausalLM.from_pretrained( - ... "facebook/blenderbot_small-90M", add_cross_attention=False - ... ) + >>> model = BlenderbotSmallForCausalLM.from_pretrained("facebook/blenderbot_small-90M", add_cross_attention=False) >>> assert model.config.is_decoder, f"{model.__class__} has to be configured as a decoder." >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) @@ -1556,6 +1517,7 @@ def forward( loss = None if labels is not None: + labels = labels.to(logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.config.vocab_size), labels.view(-1)) @@ -1580,7 +1542,16 @@ def prepare_inputs_for_generation( attention_mask = input_ids.new_ones(input_ids.shape) if past_key_values: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] # first step, decoder_cached_states are empty return { "input_ids": input_ids, # encoder_outputs is defined. input_ids not needed @@ -1593,5 +1564,7 @@ def prepare_inputs_for_generation( def _reorder_cache(past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/blenderbot_small/modeling_flax_blenderbot_small.py b/src/transformers/models/blenderbot_small/modeling_flax_blenderbot_small.py index 226e401c921ea5..b5272fb3bca9e2 100644 --- a/src/transformers/models/blenderbot_small/modeling_flax_blenderbot_small.py +++ b/src/transformers/models/blenderbot_small/modeling_flax_blenderbot_small.py @@ -23,7 +23,6 @@ import flax.linen as nn import jax import jax.numpy as jnp -import numpy as np from flax.core.frozen_dict import FrozenDict, freeze, unfreeze from flax.linen import combine_masks, make_causal_mask from flax.linen.attention import dot_product_attention_weights @@ -221,11 +220,11 @@ def shift_tokens_right(input_ids: jnp.ndarray, pad_token_id: int, decoder_start_ """ Shift input ids one token to the right. """ - shifted_input_ids = np.zeros_like(input_ids) - shifted_input_ids[:, 1:] = input_ids[:, :-1] - shifted_input_ids[:, 0] = decoder_start_token_id + shifted_input_ids = jnp.zeros_like(input_ids) + shifted_input_ids = shifted_input_ids.at[:, 1:].set(input_ids[:, :-1]) + shifted_input_ids = shifted_input_ids.at[:, 0].set(decoder_start_token_id) - shifted_input_ids = np.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) + shifted_input_ids = jnp.where(shifted_input_ids == -100, pad_token_id, shifted_input_ids) return shifted_input_ids @@ -1442,8 +1441,8 @@ def prepare_inputs_for_generation( self, decoder_input_ids, max_length, - attention_mask: Optional[jnp.DeviceArray] = None, - decoder_attention_mask: Optional[jnp.DeviceArray] = None, + attention_mask: Optional[jax.Array] = None, + decoder_attention_mask: Optional[jax.Array] = None, encoder_outputs=None, **kwargs, ): diff --git a/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py b/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py index 3d521ea77a4d67..01206831ac96c3 100644 --- a/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py +++ b/src/transformers/models/blenderbot_small/modeling_tf_blenderbot_small.py @@ -15,6 +15,8 @@ """ TF 2.0 BlenderbotSmall model.""" +from __future__ import annotations + import random from typing import List, Optional, Tuple, Union @@ -31,15 +33,14 @@ # Public API from ...modeling_tf_utils import ( - DUMMY_INPUTS, TFCausalLanguageModelingLoss, TFPreTrainedModel, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - ContextManagers, add_code_sample_docstrings, add_end_docstrings, add_start_docstrings, @@ -117,7 +118,7 @@ def _expand_mask(mask: tf.Tensor, tgt_len: Optional[int] = None): # Copied from transformers.models.blenderbot.modeling_tf_blenderbot.TFBlenderbotLearnedPositionalEmbedding with Blenderbot->BlenderbotSmall -class TFBlenderbotSmallLearnedPositionalEmbedding(tf.keras.layers.Embedding): +class TFBlenderbotSmallLearnedPositionalEmbedding(keras.layers.Embedding): """ This module learns positional embeddings up to a fixed maximum size. """ @@ -126,7 +127,7 @@ def __init__(self, num_embeddings: int, embedding_dim: int, **kwargs): super().__init__(num_embeddings, embedding_dim, **kwargs) def call( - self, input_shape: tf.TensorShape, past_key_values_length: int = 0, position_ids: Optional[tf.Tensor] = None + self, input_shape: tf.TensorShape, past_key_values_length: int = 0, position_ids: tf.Tensor | None = None ): """Input is expected to be of size [bsz x seqlen].""" if position_ids is None: @@ -138,7 +139,7 @@ def call( # Copied from transformers.models.bart.modeling_tf_bart.TFBartAttention with Bart->BlenderbotSmall -class TFBlenderbotSmallAttention(tf.keras.layers.Layer): +class TFBlenderbotSmallAttention(keras.layers.Layer): """Multi-headed attention from "Attention Is All You Need""" def __init__( @@ -154,7 +155,7 @@ def __init__( self.embed_dim = embed_dim self.num_heads = num_heads - self.dropout = tf.keras.layers.Dropout(dropout) + self.dropout = keras.layers.Dropout(dropout) self.head_dim = embed_dim // num_heads if (self.head_dim * num_heads) != self.embed_dim: raise ValueError( @@ -164,10 +165,10 @@ def __init__( self.scaling = self.head_dim**-0.5 self.is_decoder = is_decoder - self.k_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") - self.q_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") - self.v_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") - self.out_proj = tf.keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") + self.k_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="k_proj") + self.q_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="q_proj") + self.v_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="v_proj") + self.out_proj = keras.layers.Dense(embed_dim, use_bias=bias, name="out_proj") def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): return tf.transpose(tf.reshape(tensor, (bsz, seq_len, self.num_heads, self.head_dim)), (0, 2, 1, 3)) @@ -175,12 +176,12 @@ def _shape(self, tensor: tf.Tensor, seq_len: int, bsz: int): def call( self, hidden_states: tf.Tensor, - key_value_states: Optional[tf.Tensor] = None, - past_key_value: Optional[Tuple[Tuple[tf.Tensor]]] = None, - attention_mask: Optional[tf.Tensor] = None, - layer_head_mask: Optional[tf.Tensor] = None, + key_value_states: tf.Tensor | None = None, + past_key_value: Tuple[Tuple[tf.Tensor]] | None = None, + attention_mask: tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, training: Optional[bool] = False, - ) -> Tuple[tf.Tensor, Optional[tf.Tensor]]: + ) -> Tuple[tf.Tensor, tf.Tensor | None]: """Input shape: Batch x Time x Channel""" # if key_value_states are provided this layer is used as a cross-attention layer @@ -290,33 +291,51 @@ def call( return attn_output, attn_weights, past_key_value + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "k_proj", None) is not None: + with tf.name_scope(self.k_proj.name): + self.k_proj.build([None, None, self.embed_dim]) + if getattr(self, "q_proj", None) is not None: + with tf.name_scope(self.q_proj.name): + self.q_proj.build([None, None, self.embed_dim]) + if getattr(self, "v_proj", None) is not None: + with tf.name_scope(self.v_proj.name): + self.v_proj.build([None, None, self.embed_dim]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.embed_dim]) + # Copied from transformers.models.bart.modeling_tf_bart.TFBartEncoderLayer with Bart->BlenderbotSmall -class TFBlenderbotSmallEncoderLayer(tf.keras.layers.Layer): +class TFBlenderbotSmallEncoderLayer(keras.layers.Layer): def __init__(self, config: BlenderbotSmallConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model self.self_attn = TFBlenderbotSmallAttention( self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout, name="self_attn" ) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) - self.fc1 = tf.keras.layers.Dense(config.encoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) + self.fc1 = keras.layers.Dense(config.encoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, hidden_states: tf.Tensor, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]], - layer_head_mask: Optional[tf.Tensor], + attention_mask: np.ndarray | tf.Tensor | None, + layer_head_mask: tf.Tensor | None, training: Optional[bool] = False, ) -> tf.Tensor: """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`tf.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`tf.Tensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -347,9 +366,29 @@ def call( return hidden_states, self_attn_weights + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.encoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + # Copied from transformers.models.bart.modeling_tf_bart.TFBartDecoderLayer with Bart->BlenderbotSmall -class TFBlenderbotSmallDecoderLayer(tf.keras.layers.Layer): +class TFBlenderbotSmallDecoderLayer(keras.layers.Layer): def __init__(self, config: BlenderbotSmallConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.d_model @@ -360,11 +399,11 @@ def __init__(self, config: BlenderbotSmallConfig, **kwargs): name="self_attn", is_decoder=True, ) - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.activation_fn = get_tf_activation(config.activation_function) - self.activation_dropout = tf.keras.layers.Dropout(config.activation_dropout) + self.activation_dropout = keras.layers.Dropout(config.activation_dropout) - self.self_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") + self.self_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="self_attn_layer_norm") self.encoder_attn = TFBlenderbotSmallAttention( self.embed_dim, config.decoder_attention_heads, @@ -372,29 +411,30 @@ def __init__(self, config: BlenderbotSmallConfig, **kwargs): name="encoder_attn", is_decoder=True, ) - self.encoder_attn_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") - self.fc1 = tf.keras.layers.Dense(config.decoder_ffn_dim, name="fc1") - self.fc2 = tf.keras.layers.Dense(self.embed_dim, name="fc2") - self.final_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.encoder_attn_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="encoder_attn_layer_norm") + self.fc1 = keras.layers.Dense(config.decoder_ffn_dim, name="fc1") + self.fc2 = keras.layers.Dense(self.embed_dim, name="fc2") + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=1e-5, name="final_layer_norm") + self.config = config def call( self, hidden_states: tf.Tensor, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - layer_head_mask: Optional[tf.Tensor] = None, - cross_attn_layer_head_mask: Optional[tf.Tensor] = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, + layer_head_mask: tf.Tensor | None = None, + cross_attn_layer_head_mask: tf.Tensor | None = None, past_key_value: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, training: Optional[bool] = False, ) -> Tuple[tf.Tensor, tf.Tensor, Tuple[Tuple[tf.Tensor]]]: """ Args: - hidden_states (`tf.Tensor`): input to the layer of shape `(seq_len, batch, embed_dim)` + hidden_states (`tf.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` attention_mask (`tf.Tensor`): attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. encoder_hidden_states (`tf.Tensor`): - cross attention input to the layer of shape `(seq_len, batch, embed_dim)` + cross attention input to the layer of shape `(batch, seq_len, embed_dim)` encoder_attention_mask (`tf.Tensor`): encoder attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. layer_head_mask (`tf.Tensor`): mask for attention heads in a given layer of size @@ -457,46 +497,44 @@ def call( present_key_value, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "self_attn_layer_norm", None) is not None: + with tf.name_scope(self.self_attn_layer_norm.name): + self.self_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "encoder_attn", None) is not None: + with tf.name_scope(self.encoder_attn.name): + self.encoder_attn.build(None) + if getattr(self, "encoder_attn_layer_norm", None) is not None: + with tf.name_scope(self.encoder_attn_layer_norm.name): + self.encoder_attn_layer_norm.build([None, None, self.embed_dim]) + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.embed_dim]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.decoder_ffn_dim]) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + class TFBlenderbotSmallPreTrainedModel(TFPreTrainedModel): config_class = BlenderbotSmallConfig base_model_prefix = "model" - @property - def dummy_inputs(self): - pad_token = 1 - input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - decoder_input_ids = tf.convert_to_tensor(DUMMY_INPUTS, dtype=tf.int32) - dummy_inputs = { - "decoder_input_ids": decoder_input_ids, - "attention_mask": tf.cast(input_ids != pad_token, tf.int32), - "input_ids": input_ids, - } - return dummy_inputs - - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"), - "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"), - } - ] - ) - # Copied from transformers.models.bart.modeling_tf_bart.TFBartPretrainedModel.serving - def serving(self, inputs): - output = self.call(inputs) - - return self.serving_output(output) - BLENDERBOT_SMALL_START_DOCSTRING = r""" This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -644,7 +682,7 @@ def serving(self, inputs): @keras_serializable -class TFBlenderbotSmallEncoder(tf.keras.layers.Layer): +class TFBlenderbotSmallEncoder(keras.layers.Layer): config_class = BlenderbotSmallConfig """ Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer is a @@ -654,12 +692,10 @@ class TFBlenderbotSmallEncoder(tf.keras.layers.Layer): config: BlenderbotSmallConfig """ - def __init__( - self, config: BlenderbotSmallConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs - ): + def __init__(self, config: BlenderbotSmallConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) self.layerdrop = config.encoder_layerdrop self.padding_idx = config.pad_token_id self.max_source_positions = config.max_position_embeddings @@ -672,7 +708,8 @@ def __init__( name="embed_positions", ) self.layers = [TFBlenderbotSmallEncoderLayer(config, name=f"layers.{i}") for i in range(config.encoder_layers)] - self.layernorm_embedding = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.layernorm_embedding = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.embed_dim = config.d_model def get_embed_tokens(self): return self.embed_tokens @@ -744,25 +781,8 @@ def call( raise ValueError("You have to specify either input_ids or inputs_embeds") if inputs_embeds is None: - # if `self.embed_tokens.load_weight_prefix` is set, runs the embedding operation with the correct name - # scope, so that its weights are registered with the desired name for loading/storing. When `tf.name_scope` - # is used with a name ending in `/`, that name replaces the current name scope. - # (embeddings with tf.name_scope: self.embed_tokens.load_weight_prefix/self.embed_tokens.name/embeddings:0) - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale embed_pos = self.embed_positions(input_shape) hidden_states = inputs_embeds + embed_pos @@ -817,9 +837,24 @@ def call( last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layernorm_embedding", None) is not None: + with tf.name_scope(self.layernorm_embedding.name): + self.layernorm_embedding.build([None, None, self.embed_dim]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBlenderbotSmallDecoder(tf.keras.layers.Layer): +class TFBlenderbotSmallDecoder(keras.layers.Layer): config_class = BlenderbotSmallConfig """ Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`TFBlenderbotSmallDecoderLayer`] @@ -829,9 +864,7 @@ class TFBlenderbotSmallDecoder(tf.keras.layers.Layer): embed_tokens: output embedding """ - def __init__( - self, config: BlenderbotSmallConfig, embed_tokens: Optional[tf.keras.layers.Embedding] = None, **kwargs - ): + def __init__(self, config: BlenderbotSmallConfig, embed_tokens: Optional[keras.layers.Embedding] = None, **kwargs): super().__init__(**kwargs) self.config = config self.padding_idx = config.pad_token_id @@ -844,9 +877,9 @@ def __init__( ) self.embed_scale = tf.math.sqrt(float(config.d_model)) if config.scale_embedding else 1.0 self.layers = [TFBlenderbotSmallDecoderLayer(config, name=f"layers.{i}") for i in range(config.decoder_layers)] - self.layernorm_embedding = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") + self.layernorm_embedding = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_embedding") - self.dropout = tf.keras.layers.Dropout(config.dropout) + self.dropout = keras.layers.Dropout(config.dropout) def get_embed_tokens(self): return self.embed_tokens @@ -921,11 +954,11 @@ def call( If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of - all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`tf.Tensor` of shape - `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing `input_ids` - you can choose to directly pass an embedded representation. This is useful if you want more control - over how to convert `input_ids` indices into associated vectors than the model's internal embedding - lookup matrix. + all `decoder_input_ids` of shape `(batch_size, sequence_length)`. + inputs_embeds (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert `input_ids` indices into associated vectors + than the model's internal embedding lookup matrix. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. This argument can be used only in eager mode, in graph mode the value @@ -953,25 +986,8 @@ def call( past_key_values_length = shape_list(past_key_values[0][0])[2] if past_key_values is not None else 0 if inputs_embeds is None: - # if `self.embed_tokens.load_weight_prefix` is set, runs the embedding operation with the correct name - # scope, so that its weights are registered with the desired name for loading/storing. When `tf.name_scope` - # is used with a name ending in `/`, that name replaces the current name scope. - # (embeddings with tf.name_scope: self.embed_tokens.load_weight_prefix/self.embed_tokens.name/embeddings:0) - context = [] - if hasattr(self.embed_tokens, "load_weight_prefix"): - context.append(tf.name_scope(self.embed_tokens.load_weight_prefix + "/")) - with ContextManagers(context): - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.embed_tokens.input_dim, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.embed_tokens.input_dim})" - ), - ) - inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale + check_embeddings_within_bounds(input_ids, self.embed_tokens.input_dim) + inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] if input_shape[-1] > 1: @@ -1059,19 +1075,34 @@ def call( cross_attentions=all_cross_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embed_positions", None) is not None: + with tf.name_scope(self.embed_positions.name): + self.embed_positions.build(None) + if getattr(self, "layernorm_embedding", None) is not None: + with tf.name_scope(self.layernorm_embedding.name): + self.layernorm_embedding.build([None, None, self.config.d_model]) + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable -class TFBlenderbotSmallMainLayer(tf.keras.layers.Layer): +class TFBlenderbotSmallMainLayer(keras.layers.Layer): config_class = BlenderbotSmallConfig def __init__(self, config: BlenderbotSmallConfig, **kwargs): super().__init__(**kwargs) self.config = config - self.shared = tf.keras.layers.Embedding( + self.shared = keras.layers.Embedding( input_dim=config.vocab_size, output_dim=config.d_model, - embeddings_initializer=tf.keras.initializers.TruncatedNormal(stddev=self.config.init_std), + embeddings_initializer=keras.initializers.TruncatedNormal(stddev=self.config.init_std), name="model.shared", ) # Additional attribute to specify the expected name scope of the layer (for loading/storing weights) @@ -1167,6 +1198,22 @@ def call( encoder_attentions=encoder_outputs.attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + # The shared/tied weights expect to be in the model base namespace + # Adding "/" to the end (not the start!) of a tf.name_scope puts it in the root namespace rather than + # the current one. + with tf.name_scope(self.shared.load_weight_prefix + "/" + self.shared.name + "/"): + self.shared.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "decoder", None) is not None: + with tf.name_scope(self.decoder.name): + self.decoder.build(None) + @add_start_docstrings( "The bare BLENDERBOT_SMALL Model outputting raw hidden-states without any specific head on top.", @@ -1193,18 +1240,18 @@ def get_decoder(self): ) def call( self, - input_ids: Optional[tf.Tensor] = None, - attention_mask: Optional[tf.Tensor] = None, - decoder_input_ids: Optional[tf.Tensor] = None, - decoder_attention_mask: Optional[tf.Tensor] = None, - decoder_position_ids: Optional[tf.Tensor] = None, - head_mask: Optional[tf.Tensor] = None, - decoder_head_mask: Optional[tf.Tensor] = None, - cross_attn_head_mask: Optional[tf.Tensor] = None, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + decoder_input_ids: tf.Tensor | None = None, + decoder_attention_mask: tf.Tensor | None = None, + decoder_position_ids: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + decoder_head_mask: tf.Tensor | None = None, + cross_attn_head_mask: tf.Tensor | None = None, encoder_outputs: Optional[Union[Tuple, TFBaseModelOutput]] = None, - past_key_values: Optional[List[tf.Tensor]] = None, - inputs_embeds: Optional[tf.Tensor] = None, - decoder_inputs_embeds: Optional[tf.Tensor] = None, + past_key_values: List[tf.Tensor] | None = None, + inputs_embeds: tf.Tensor | None = None, + decoder_inputs_embeds: tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1254,11 +1301,19 @@ def serving_output(self, output): encoder_attentions=enc_attns, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + # Copied from transformers.models.bart.modeling_tf_bart.BiasLayer -class BiasLayer(tf.keras.layers.Layer): +class BiasLayer(keras.layers.Layer): """ - Bias as a layer. It is used for serialization purposes: `tf.keras.Model.save_weights` stores on a per-layer basis, + Bias as a layer. It is used for serialization purposes: `keras.Model.save_weights` stores on a per-layer basis, so all weights have to be registered in a layer. """ @@ -1321,23 +1376,23 @@ def set_bias(self, value): @add_end_docstrings(BLENDERBOT_SMALL_GENERATION_EXAMPLE) def call( self, - input_ids: Optional[tf.Tensor] = None, - attention_mask: Optional[tf.Tensor] = None, - decoder_input_ids: Optional[tf.Tensor] = None, - decoder_attention_mask: Optional[tf.Tensor] = None, - decoder_position_ids: Optional[tf.Tensor] = None, - head_mask: Optional[tf.Tensor] = None, - decoder_head_mask: Optional[tf.Tensor] = None, - cross_attn_head_mask: Optional[tf.Tensor] = None, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + decoder_input_ids: tf.Tensor | None = None, + decoder_attention_mask: tf.Tensor | None = None, + decoder_position_ids: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + decoder_head_mask: tf.Tensor | None = None, + cross_attn_head_mask: tf.Tensor | None = None, encoder_outputs: Optional[TFBaseModelOutput] = None, - past_key_values: Optional[List[tf.Tensor]] = None, - inputs_embeds: Optional[tf.Tensor] = None, - decoder_inputs_embeds: Optional[tf.Tensor] = None, + past_key_values: List[tf.Tensor] | None = None, + inputs_embeds: tf.Tensor | None = None, + decoder_inputs_embeds: tf.Tensor | None = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[tf.Tensor] = None, + labels: tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[Tuple[tf.Tensor], TFSeq2SeqLMOutput]: r""" @@ -1458,3 +1513,14 @@ def prepare_inputs_for_generation( "cross_attn_head_mask": cross_attn_head_mask, "use_cache": use_cache, # change this to avoid caching (presumably for debugging) } + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "model", None) is not None: + with tf.name_scope(self.model.name): + self.model.build(None) + if getattr(self, "bias_layer", None) is not None: + with tf.name_scope(self.bias_layer.name): + self.bias_layer.build(None) diff --git a/src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py b/src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py index a0b45bff1dc78c..240495d73894ef 100644 --- a/src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py +++ b/src/transformers/models/blenderbot_small/tokenization_blenderbot_small.py @@ -85,9 +85,9 @@ class BlenderbotSmallTokenizer(PreTrainedTokenizer): unk_token (`str`, *optional*, defaults to `"__unk__"`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (`str`, *optional*, defaults to `"__pad__"`): + pad_token (`str`, *optional*, defaults to `"__null__"`): The token used for padding, for example when batching sequences of different lengths. - **kwargs + kwargs (*optional*): Additional keyword arguments passed along to [`PreTrainedTokenizer`] """ @@ -106,8 +106,6 @@ def __init__( pad_token="__null__", **kwargs, ): - super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs) - with open(vocab_file, encoding="utf-8") as vocab_handle: self.encoder = json.load(vocab_handle) self.decoder = {v: k for k, v in self.encoder.items()} @@ -116,6 +114,7 @@ def __init__( merges = [tuple(merge.split()) for merge in merges] self.bpe_ranks = dict(zip(merges, range(len(merges)))) self.cache = {} + super().__init__(unk_token=unk_token, bos_token=bos_token, eos_token=eos_token, pad_token=pad_token, **kwargs) @property def vocab_size(self) -> int: @@ -191,7 +190,7 @@ def _tokenize(self, text: str) -> List[str]: words = re.findall(r"\S+\n?", text) for token in words: - split_tokens.extend([t for t in self.bpe(token).split(" ")]) + split_tokens.extend(list(self.bpe(token).split(" "))) return split_tokens def _convert_token_to_id(self, token: str) -> int: @@ -236,3 +235,24 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = index += 1 return vocab_file, merge_file + + @property + # Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template + def default_chat_template(self): + """ + A very simple chat template that just adds whitespace between messages. + """ + logger.warning_once( + "\nNo chat template is defined for this tokenizer - using the default template " + f"for the {self.__class__.__name__} class. If the default is not appropriate for " + "your model, please set `tokenizer.chat_template` to an appropriate template. " + "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n" + ) + return ( + "{% for message in messages %}" + "{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}" + "{{ message['content'] }}" + "{% if not loop.last %}{{ ' ' }}{% endif %}" + "{% endfor %}" + "{{ eos_token }}" + ) diff --git a/src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py b/src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py index adc350f3d11132..4bf0017b5f2a29 100644 --- a/src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py +++ b/src/transformers/models/blenderbot_small/tokenization_blenderbot_small_fast.py @@ -117,3 +117,24 @@ def create_token_type_ids_from_sequences( if token_ids_1 is None: return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] + + @property + # Copied from transformers.models.blenderbot.tokenization_blenderbot.BlenderbotTokenizer.default_chat_template + def default_chat_template(self): + """ + A very simple chat template that just adds whitespace between messages. + """ + logger.warning_once( + "\nNo chat template is defined for this tokenizer - using the default template " + f"for the {self.__class__.__name__} class. If the default is not appropriate for " + "your model, please set `tokenizer.chat_template` to an appropriate template. " + "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n" + ) + return ( + "{% for message in messages %}" + "{% if message['role'] == 'user' %}{{ ' ' }}{% endif %}" + "{{ message['content'] }}" + "{% if not loop.last %}{{ ' ' }}{% endif %}" + "{% endfor %}" + "{{ eos_token }}" + ) diff --git a/src/transformers/models/blip/__init__.py b/src/transformers/models/blip/__init__.py index d295801a0f5e2c..a7001788e62916 100644 --- a/src/transformers/models/blip/__init__.py +++ b/src/transformers/models/blip/__init__.py @@ -13,7 +13,13 @@ # limitations under the License. from typing import TYPE_CHECKING -from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_tf_available, + is_torch_available, + is_vision_available, +) _import_structure = { @@ -52,6 +58,23 @@ "BlipForImageTextRetrieval", ] +try: + if not is_tf_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_tf_blip"] = [ + "TF_BLIP_PRETRAINED_MODEL_ARCHIVE_LIST", + "TFBlipModel", + "TFBlipPreTrainedModel", + "TFBlipForConditionalGeneration", + "TFBlipForQuestionAnswering", + "TFBlipVisionModel", + "TFBlipTextModel", + "TFBlipForImageTextRetrieval", + ] + if TYPE_CHECKING: from .configuration_blip import BLIP_PRETRAINED_CONFIG_ARCHIVE_MAP, BlipConfig, BlipTextConfig, BlipVisionConfig from .processing_blip import BlipProcessor @@ -81,6 +104,23 @@ BlipVisionModel, ) + try: + if not is_tf_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_tf_blip import ( + TF_BLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + TFBlipForConditionalGeneration, + TFBlipForImageTextRetrieval, + TFBlipForQuestionAnswering, + TFBlipModel, + TFBlipPreTrainedModel, + TFBlipTextModel, + TFBlipVisionModel, + ) + else: import sys diff --git a/src/transformers/models/blip/configuration_blip.py b/src/transformers/models/blip/configuration_blip.py index 8bdff88bff2fe8..42e35958ced3cf 100644 --- a/src/transformers/models/blip/configuration_blip.py +++ b/src/transformers/models/blip/configuration_blip.py @@ -14,7 +14,6 @@ # limitations under the License. """ Blip model configuration""" -import copy import os from typing import Union @@ -56,7 +55,7 @@ class BlipTextConfig(PretrainedConfig): Args: - vocab_size (`int`, *optional*, defaults to 30522): + vocab_size (`int`, *optional*, defaults to 30524): Vocabulary size of the `Blip` text model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`BlipModel`]. hidden_size (`int`, *optional*, defaults to 768): @@ -69,7 +68,7 @@ class BlipTextConfig(PretrainedConfig): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 8): Number of attention heads for each attention layer in the Transformer encoder. - max_position_embeddings (`int`, *optional*, defaults to 77): + max_position_embeddings (`int`, *optional*, defaults to 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): @@ -91,10 +90,14 @@ class BlipTextConfig(PretrainedConfig): The id of the `padding` token. sep_token_id (`int`, *optional*, defaults to 102): The id of the `separator` token. - is_decoder (`bool`, *optional*, defaults to `False`): + is_decoder (`bool`, *optional*, defaults to `True`): Whether the model is used as a decoder. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). + label_smoothing (float, *optional*): + A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing. The targets + become a mixture of the original ground truth and a uniform distribution as described in + `Rethinking the Inception Architecture for Computer Vision `__. Default: :math:`0.0`. Example: @@ -110,6 +113,7 @@ class BlipTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "blip_text_model" def __init__( @@ -133,6 +137,7 @@ def __init__( sep_token_id=102, is_decoder=True, use_cache=True, + label_smoothing=0.0, **kwargs, ): super().__init__( @@ -158,9 +163,12 @@ def __init__( self.attention_probs_dropout_prob = attention_probs_dropout_prob self.is_decoder = is_decoder self.use_cache = use_cache + self.label_smoothing = label_smoothing @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the text config dict if we are loading from BlipConfig @@ -196,9 +204,9 @@ class BlipVisionConfig(PretrainedConfig): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. - image_size (`int`, *optional*, defaults to 224): + image_size (`int`, *optional*, defaults to 384): The size (resolution) of each image. - patch_size (`int`, *optional*, defaults to 32): + patch_size (`int`, *optional*, defaults to 16): The size (resolution) of each patch. hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, @@ -207,7 +215,7 @@ class BlipVisionConfig(PretrainedConfig): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. - initializer_range (`float`, *optional*, defaults to 0.02): + initializer_range (`float`, *optional*, defaults to 1e-10): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. Example: @@ -234,7 +242,6 @@ def __init__( projection_dim=512, num_hidden_layers=12, num_attention_heads=12, - num_channels=3, image_size=384, patch_size=16, hidden_act="gelu", @@ -250,7 +257,6 @@ def __init__( self.projection_dim = projection_dim self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads - self.num_channels = num_channels self.patch_size = patch_size self.image_size = image_size self.initializer_range = initializer_range @@ -260,6 +266,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from BlipConfig @@ -294,8 +302,12 @@ class BlipConfig(PretrainedConfig): Dimentionality of text and vision projection layers. logit_scale_init_value (`float`, *optional*, defaults to 2.6592): The inital value of the *logit_scale* paramter. Default is used as per the original BLIP implementation. - image_text_hidden_size (`int`, *optional*, defaults to 768): + image_text_hidden_size (`int`, *optional*, defaults to 256): Dimentionality of the hidden state of the image-text fusion layer. + label_smoothing (float, optional, *optional*, defaults to 0.0): + A float in [0.0, 1.0]. Specifies the amount of smoothing when computing the loss, where 0.0 means no smoothing. The targets + become a mixture of the original ground truth and a uniform distribution as described in + `Rethinking the Inception Architecture for Computer Vision `__. Default: :math:`0.0`. kwargs (*optional*): Dictionary of keyword arguments. @@ -323,7 +335,6 @@ class BlipConfig(PretrainedConfig): ```""" model_type = "blip" - is_composition = True def __init__( self, @@ -332,25 +343,18 @@ def __init__( projection_dim=512, logit_scale_init_value=2.6592, image_text_hidden_size=256, + label_smoothing=0.0, **kwargs, ): super().__init__(**kwargs) - # If `_config_dict` exist, we use them for the backward compatibility. - text_config_dict = kwargs.pop("text_config_dict", None) - vision_config_dict = kwargs.pop("vision_config_dict", None) - if text_config_dict is not None: - text_config = text_config_dict - if vision_config_dict is not None: - vision_config = vision_config_dict - if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the BlipTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `BlipTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. initializing the BlipVisionConfig with default values.") + logger.info("`vision_config` is `None`. Initializing the `BlipVisionConfig` with default values.") self.text_config = BlipTextConfig(**text_config) self.vision_config = BlipVisionConfig(**vision_config) @@ -362,6 +366,7 @@ def __init__( self.initializer_factor = 1.0 self.initializer_range = 0.02 self.image_text_hidden_size = image_text_hidden_size + self.label_smoothing = label_smoothing @classmethod def from_text_vision_configs(cls, text_config: BlipTextConfig, vision_config: BlipVisionConfig, **kwargs): @@ -374,16 +379,3 @@ def from_text_vision_configs(cls, text_config: BlipTextConfig, vision_config: Bl """ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/blip/convert_blip_original_pytorch_to_hf.py b/src/transformers/models/blip/convert_blip_original_pytorch_to_hf.py index 7609b4a40e857f..714aaa1e273d1a 100644 --- a/src/transformers/models/blip/convert_blip_original_pytorch_to_hf.py +++ b/src/transformers/models/blip/convert_blip_original_pytorch_to_hf.py @@ -105,7 +105,7 @@ def convert_blip_checkpoint(pytorch_dump_folder_path, config_path=None): image_size = 384 image = load_demo_image(image_size=image_size, device="cpu") - tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") + tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") input_ids = tokenizer(["a picture of"]).input_ids out = hf_model.generate(image, input_ids) diff --git a/src/transformers/models/blip/image_processing_blip.py b/src/transformers/models/blip/image_processing_blip.py index 539d6d198603ef..fa65624937f35e 100644 --- a/src/transformers/models/blip/image_processing_blip.py +++ b/src/transformers/models/blip/image_processing_blip.py @@ -18,22 +18,22 @@ import numpy as np -from transformers.utils import is_vision_available -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict -from ...image_transforms import convert_to_rgb, normalize, rescale, resize, to_channel_dimension_format +from ...image_transforms import convert_to_rgb, resize, to_channel_dimension_format from ...image_utils import ( OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, ChannelDimension, ImageInput, PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, make_list_of_images, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging +from ...utils import TensorType, is_vision_available, logging if is_vision_available(): @@ -54,11 +54,11 @@ class BlipImageProcessor(BaseImageProcessor): size (`dict`, *optional*, defaults to `{"height": 384, "width": 384}`): Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess` method. - resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`): Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be overridden by the `resample` parameter in the `preprocess` method. do_rescale (`bool`, *optional*, defaults to `True`): - Wwhether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the + Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale` parameter in the `preprocess` method. rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be @@ -107,77 +107,54 @@ def __init__( self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD self.do_convert_rgb = do_convert_rgb + # Copied from transformers.models.vit.image_processing_vit.ViTImageProcessor.resize with PILImageResampling.BILINEAR->PILImageResampling.BICUBIC def resize( self, image: np.ndarray, size: Dict[str, int], resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ - Resize an image. - - Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the - longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then - resized to the max size while preserving the aspect ratio. + Resize an image to `(size["height"], size["width"])`. Args: image (`np.ndarray`): Image to resize. size (`Dict[str, int]`): - Controls the size of the output image. Should be of the form `{"shortest_edge": int}`. - resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`): - Resampling filter to use when resiizing the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - size = get_size_dict(size, default_to_square=True) - output_size = (size["width"], size["height"]) - return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) - - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. + Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image. + resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`. + data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the output image. If unset, the channel dimension format of the input + image is used. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. + + Returns: + `np.ndarray`: The resized image. """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) - - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - mean (`float` or `List[float]`): - Image mean. - std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + size = get_size_dict(size) + if "height" not in size or "width" not in size: + raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}") + output_size = (size["height"], size["width"]) + return resize( + image, + size=output_size, + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, + ) def preprocess( self, @@ -193,6 +170,7 @@ def preprocess( return_tensors: Optional[Union[str, TensorType]] = None, do_convert_rgb: bool = None, data_format: ChannelDimension = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -200,7 +178,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -231,8 +210,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize resample = resample if resample is not None else self.resample @@ -254,15 +240,16 @@ def preprocess( "torch.Tensor, tf.Tensor or jax.ndarray." ) - if do_resize and size is None or resample is None: - raise ValueError("Size and resample must be specified if do_resize is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") - + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_resize=do_resize, + size=size, + resample=resample, + ) # PIL RGBA images are converted to RGB if do_convert_rgb: images = [convert_to_rgb(image) for image in images] @@ -270,16 +257,37 @@ def preprocess( # All transformations expect numpy arrays. images = [to_numpy_array(image) for image in images] + if is_scaled_image(images[0]) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + + if input_data_format is None: + # We assume that all images have the same channel dimension format. + input_data_format = infer_channel_dimension_format(images[0]) + if do_resize: - images = [self.resize(image=image, size=size, resample=resample) for image in images] + images = [ + self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) + for image in images + ] if do_rescale: - images = [self.rescale(image=image, scale=rescale_factor) for image in images] + images = [ + self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + for image in images + ] if do_normalize: - images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] - - images = [to_channel_dimension_format(image, data_format) for image in images] + images = [ + self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + for image in images + ] + + images = [ + to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images + ] encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors) diff --git a/src/transformers/models/blip/modeling_blip.py b/src/transformers/models/blip/modeling_blip.py index 7f1b3412b6845e..1dc79efb6546af 100644 --- a/src/transformers/models/blip/modeling_blip.py +++ b/src/transformers/models/blip/modeling_blip.py @@ -14,6 +14,7 @@ # limitations under the License. """ PyTorch BLIP model.""" +import warnings from dataclasses import dataclass from typing import Any, Optional, Tuple, Union @@ -42,13 +43,13 @@ BLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [ "Salesforce/blip-vqa-base", - "Salesforce/blip-vqa-capfit-large", + "Salesforce/blip-vqa-capfilt-large", "Salesforce/blip-image-captioning-base", "Salesforce/blip-image-captioning-large", "Salesforce/blip-itm-base-coco", "Salesforce/blip-itm-large-coco", - "Salesforce/blip-itm-base-flikr", - "Salesforce/blip-itm-large-flikr", + "Salesforce/blip-itm-base-flickr", + "Salesforce/blip-itm-large-flickr", # See all BLIP models at https://huggingface.co/models?filter=blip ] @@ -74,7 +75,7 @@ class BlipForConditionalGenerationModelOutput(ModelOutput): Args: loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): Languge modeling loss from the text decoder. - decoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*): + logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*): Prediction scores of the language modeling head of the text decoder model. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*): The image embeddings obtained after applying the Vision Transformer model to the input image. @@ -94,11 +95,20 @@ class BlipForConditionalGenerationModelOutput(ModelOutput): """ loss: Optional[Tuple[torch.FloatTensor]] = None - decoder_logits: Optional[Tuple[torch.FloatTensor]] = None + logits: Optional[Tuple[torch.FloatTensor]] = None image_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None + + @property + def decoder_logits(self): + warnings.warn( + "`decoder_logits` attribute is deprecated and will be removed in version 5 of Transformers." + " Please use the `logits` attribute to retrieve the final output instead.", + FutureWarning, + ) + return self.logits @dataclass @@ -130,8 +140,8 @@ class BlipTextVisionModelOutput(ModelOutput): loss: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None @dataclass @@ -171,9 +181,9 @@ class BlipImageTextMatchingModelOutput(ModelOutput): loss: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None vision_pooler_output: Optional[torch.FloatTensor] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None question_embeds: Optional[Tuple[torch.FloatTensor]] = None @@ -222,9 +232,7 @@ def __init__(self, config: BlipVisionConfig): self.image_size = config.image_size self.patch_size = config.patch_size - self.class_embedding = nn.Parameter( - torch.randn(1, 1, self.embed_dim), - ) + self.class_embedding = nn.Parameter(torch.randn(1, 1, self.embed_dim)) self.patch_embedding = nn.Conv2d( in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size @@ -238,7 +246,7 @@ def __init__(self, config: BlipVisionConfig): def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] target_dtype = self.patch_embedding.weight.dtype - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype) @@ -257,7 +265,9 @@ def __init__(self, config: BlipTextConfig): self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) def forward( self, @@ -313,17 +323,12 @@ def forward( bsz, tgt_len, embed_dim = hidden_states.size() - mixed_qkv = self.qkv(hidden_states) mixed_qkv = ( self.qkv(hidden_states) .reshape(bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads) .permute(2, 0, 3, 1, 4) ) - query_states, key_states, value_states = ( - mixed_qkv[0], - mixed_qkv[1], - mixed_qkv[2], - ) + query_states, key_states, value_states = mixed_qkv[0], mixed_qkv[1], mixed_qkv[2] # Take the dot product between "query" and "key" to get the raw attention scores. attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2)) @@ -426,7 +431,6 @@ class BlipPreTrainedModel(PreTrainedModel): config_class = BlipConfig base_model_prefix = "blip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -457,10 +461,6 @@ def _init_weights(self, module): elif isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, BlipEncoder): - module.gradient_checkpointing = value - BLIP_START_DOCSTRING = r""" This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the @@ -587,9 +587,7 @@ def forward( r""" Args: inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): - Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. - This is useful if you want more control over how to convert `input_ids` indices into associated vectors - than the model's internal embedding lookup matrix. + Embedded representation of the inputs. Should be float, not int tokens. attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: @@ -620,17 +618,11 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -751,7 +743,7 @@ def __init__(self, config: BlipConfig): self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) - self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value) + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) # Initialize weights and apply final processing self.post_init() @@ -824,10 +816,7 @@ def get_image_features( ```""" return_dict = return_dict if return_dict is not None else self.config.use_return_dict - vision_outputs = self.vision_model( - pixel_values=pixel_values, - return_dict=return_dict, - ) + vision_outputs = self.vision_model(pixel_values=pixel_values, return_dict=return_dict) pooled_output = vision_outputs[1] # pooled_output image_features = self.visual_projection(pooled_output) @@ -939,7 +928,7 @@ def forward( ) class BlipForConditionalGeneration(BlipPreTrainedModel): config_class = BlipConfig - _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"] + _tied_weights_keys = ["text_decoder.cls.predictions.decoder.bias"] main_input_name = "pixel_values" def __init__(self, config: BlipConfig): @@ -985,13 +974,18 @@ def forward( >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) + >>> text = "A picture of" - >>> inputs = processor(images=image, return_tensors="pt") + >>> inputs = processor(images=image, text=text, return_tensors="pt") >>> outputs = model(**inputs) ```""" - batch_size = pixel_values.shape[0] + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) vision_outputs = self.vision_model( pixel_values=pixel_values, @@ -1002,12 +996,6 @@ def forward( image_embeds = vision_outputs[0] - if input_ids is None: - input_ids = torch.LongTensor([[self.decoder_input_ids] * batch_size]).to(image_embeds.device) - - if labels is None: - labels = input_ids.masked_fill(input_ids == self.decoder_pad_token_id, -100) - outputs = self.text_decoder( input_ids=input_ids, attention_mask=attention_mask, @@ -1023,7 +1011,7 @@ def forward( return BlipForConditionalGenerationModelOutput( loss=outputs.loss, - decoder_logits=outputs.logits, + logits=outputs.logits, image_embeds=image_embeds, last_hidden_state=vision_outputs.last_hidden_state, hidden_states=vision_outputs.hidden_states, @@ -1042,7 +1030,7 @@ def generate( Overrides *generate* function to be able to use the model as a conditional generator Parameters: - pixel_values (*torch.FloatTensor* of shape *(batch_size, image_width, image_height)*: + pixel_values (*torch.FloatTensor* of shape *(batch_size, num_channels, image_height, image_width)*: Input image to be processed input_ids (*torch.LongTensor* of shape *(batch_size, sequence_length)*, *optional*): The sequence used as a prompt for the generation. @@ -1066,14 +1054,12 @@ def generate( >>> outputs = model.generate(**inputs) >>> print(processor.decode(outputs[0], skip_special_tokens=True)) - two cats are laying on a couch + two cats sleeping on a couch ``` """ batch_size = pixel_values.shape[0] - vision_outputs = self.vision_model( - pixel_values=pixel_values, - ) + vision_outputs = self.vision_model(pixel_values=pixel_values) image_embeds = vision_outputs[0] @@ -1114,7 +1100,7 @@ def generate( ) class BlipForQuestionAnswering(BlipPreTrainedModel): config_class = BlipConfig - _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"] + _tied_weights_keys = ["text_decoder.cls.predictions.decoder.bias"] def __init__(self, config: BlipConfig): super().__init__(config) @@ -1134,19 +1120,6 @@ def __init__(self, config: BlipConfig): def get_input_embeddings(self) -> nn.Module: return self.vision_model.embeddings.patch_embedding - # Adapted from transformers.models.t5.modeling_t5.T5PreTrainedModel._shift_right - def _shift_right(self, input_ids): - pad_token_id = self.decoder_pad_token_id - - shifted_input_ids = input_ids.new_zeros(input_ids.shape) - shifted_input_ids[..., 1:] = input_ids[..., :-1].clone() - shifted_input_ids[..., 0] = self.decoder_start_token_id - - # replace possible -100 values in labels by `pad_token_id` - shifted_input_ids.masked_fill_(shifted_input_ids == -100, pad_token_id) - - return shifted_input_ids - @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=BlipTextVisionModelOutput, config_class=BlipVisionConfig) def forward( @@ -1203,6 +1176,10 @@ def forward( ) return_dict = return_dict if return_dict is not None else self.config.use_return_dict + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) vision_outputs = self.vision_model( pixel_values=pixel_values, @@ -1222,13 +1199,11 @@ def forward( return_dict=return_dict, ) - question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state - if labels is not None and decoder_input_ids is None: - # get decoder inputs from shifting lm labels to the right - this is used in training mode - decoder_input_ids = self._shift_right(labels) - # replace possible -100 values in labels by `pad_token_id` - labels = labels.masked_fill(labels == self.decoder_pad_token_id, -100) + # labels are already shifted right, see: https://github.com/huggingface/transformers/pull/23153 + decoder_input_ids = labels + + question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state answer_output = self.text_decoder( input_ids=decoder_input_ids, @@ -1271,7 +1246,7 @@ def generate( Parameters: input_ids (*torch.LongTensor* of shape *(batch_size, sequence_length)*): The sequence used as a prompt for the generation. - pixel_values (*torch.FloatTensor* of shape *(batch_size, image_width, image_height)*: + pixel_values (*torch.FloatTensor* of shape *(batch_size, num_channels, image_height, image_width)*: Input image to be processed attention_mask (*torch.LongTensor* of shape *(batch_size, sequence_length)*, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`. `1` for @@ -1300,9 +1275,7 @@ def generate( 2 ``` """ - vision_outputs = self.vision_model( - pixel_values=pixel_values, - ) + vision_outputs = self.vision_model(pixel_values=pixel_values) image_embeds = vision_outputs[0] @@ -1417,6 +1390,10 @@ def forward( ``` """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) vision_outputs = self.vision_model( pixel_values=pixel_values, diff --git a/src/transformers/models/blip/modeling_blip_text.py b/src/transformers/models/blip/modeling_blip_text.py index 3ec4a994d783a8..808c33f8104fc1 100644 --- a/src/transformers/models/blip/modeling_blip_text.py +++ b/src/transformers/models/blip/modeling_blip_text.py @@ -15,27 +15,26 @@ import math -from typing import Optional, Tuple +from typing import List, Optional, Tuple, Union import torch import torch.utils.checkpoint from torch import Tensor, device, nn from torch.nn import CrossEntropyLoss -from transformers.activations import ACT2FN -from transformers.modeling_outputs import ( +from ...activations import ACT2FN +from ...modeling_outputs import ( BaseModelOutputWithPastAndCrossAttentions, BaseModelOutputWithPoolingAndCrossAttentions, CausalLMOutputWithCrossAttentions, ) -from transformers.modeling_utils import ( +from ...modeling_utils import ( PreTrainedModel, apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer, ) -from transformers.utils import logging - +from ...utils import logging from .configuration_blip import BlipTextConfig @@ -57,12 +56,20 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") self.config = config - def forward(self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0): + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + past_key_values_length: int = 0, + ) -> torch.Tensor: if input_ids is not None: input_shape = input_ids.size() else: @@ -135,14 +142,14 @@ def transpose_for_scores(self, x): def forward( self, - hidden_states, - attention_mask=None, - head_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_value=None, - output_attentions=False, - ): + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: mixed_query_layer = self.query(hidden_states) # If this is instantiated as a cross-attention module, the keys @@ -264,7 +271,7 @@ def forward( encoder_attention_mask: Optional[torch.FloatTensor] = None, past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, output_attentions: Optional[bool] = False, - ): + ) -> Tuple[torch.Tensor]: self_outputs = self.self( hidden_states, attention_mask, @@ -325,14 +332,14 @@ def __init__(self, config, layer_num): def forward( self, - hidden_states, - attention_mask=None, - head_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_value=None, - output_attentions=False, - ): + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None self_attention_outputs = self.attention( @@ -383,17 +390,23 @@ def __init__(self, config): def forward( self, - hidden_states, - attention_mask=None, - head_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_values=None, - use_cache=None, - output_attentions=False, - output_hidden_states=False, - return_dict=True, - ): + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = False, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = True, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]: + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False all_hidden_states = () if output_hidden_states else None all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.is_decoder else None @@ -409,25 +422,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warn( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -445,6 +448,7 @@ def custom_forward(*inputs): next_decoder_cache += (layer_outputs[-1],) if output_attentions: all_self_attentions = all_self_attentions + (layer_outputs[1],) + all_cross_attentions = all_cross_attentions + (layer_outputs[2],) if output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,) @@ -545,7 +549,6 @@ class BlipTextPreTrainedModel(PreTrainedModel): config_class = BlipTextConfig base_model_prefix = "bert" - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -606,7 +609,7 @@ def get_extended_attention_mask( Mask with ones indicating tokens to attend to, zeros for tokens to ignore. input_shape (`Tuple[int]`): The shape of the input to the model. - device: (`torch.device`): + device (`torch.device`): The device of the input to the model. Returns: @@ -662,21 +665,21 @@ def get_extended_attention_mask( def forward( self, - input_ids=None, - attention_mask=None, - position_ids=None, - head_mask=None, - inputs_embeds=None, - encoder_embeds=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_values=None, - use_cache=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - is_decoder=False, - ): + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + encoder_embeds: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + is_decoder: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]: r""" encoder_hidden_states (`torch.FloatTensor`, *optional*): Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if @@ -709,6 +712,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() batch_size, seq_length = input_shape device = input_ids.device @@ -738,13 +742,13 @@ def forward( # If a 2D or 3D attention mask is provided for the cross-attention # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] if encoder_hidden_states is not None: - if type(encoder_hidden_states) == list: + if isinstance(encoder_hidden_states, list): encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size() else: encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size() encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) - if type(encoder_attention_mask) == list: + if isinstance(encoder_attention_mask, list): encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask] elif encoder_attention_mask is None: encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) @@ -801,14 +805,12 @@ def forward( # Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L811 class BlipTextLMHeadModel(BlipTextPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"] - def __init__(self, config): super().__init__(config) self.bert = BlipTextModel(config, add_pooling_layer=False) self.cls = BlipTextOnlyMLMHead(config) + self.label_smoothing = config.label_smoothing def get_output_embeddings(self): return self.cls.predictions.decoder @@ -818,23 +820,23 @@ def set_output_embeddings(self, new_embeddings): def forward( self, - input_ids=None, - attention_mask=None, - position_ids=None, - head_mask=None, - inputs_embeds=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - labels=None, - past_key_values=None, - use_cache=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - return_logits=False, - is_decoder=True, - reduction="mean", - ): + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + past_key_values: Optional[List[torch.Tensor]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + return_logits: Optional[bool] = False, + is_decoder: Optional[bool] = True, + reduction: Optional[str] = "mean", + ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]: r""" encoder_hidden_states (`torch.FloatTensor`, *optional*): Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is @@ -888,7 +890,7 @@ def forward( # we are doing next-token prediction; shift prediction scores and input ids by one shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous() labels = labels[:, 1:].contiguous().to(shifted_prediction_scores.device) - loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1) + loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=self.label_smoothing) lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) if reduction == "none": lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1) @@ -914,7 +916,16 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] return { "input_ids": input_ids, @@ -928,5 +939,7 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti def _reorder_cache(self, past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/blip/modeling_tf_blip.py b/src/transformers/models/blip/modeling_tf_blip.py new file mode 100644 index 00000000000000..5952aa145c9f78 --- /dev/null +++ b/src/transformers/models/blip/modeling_tf_blip.py @@ -0,0 +1,1710 @@ +# coding=utf-8 +# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" TensorFlow BLIP model.""" + +from __future__ import annotations + +import warnings +from dataclasses import dataclass +from typing import Any, Optional, Tuple, Union + +import tensorflow as tf + +from ...modeling_tf_outputs import TFBaseModelOutput, TFBaseModelOutputWithPooling +from ...modeling_tf_utils import ( + TFPreTrainedModel, + get_initializer, + get_tf_activation, + keras, + keras_serializable, + shape_list, + unpack_inputs, +) +from ...tf_utils import check_embeddings_within_bounds, stable_softmax +from ...utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_blip import BlipConfig, BlipTextConfig, BlipVisionConfig +from .modeling_tf_blip_text import BLIP_TEXT_INPUTS_DOCSTRING, TFBlipTextLMHeadModel, TFBlipTextModel + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "Salesforce/blip-vqa-base" + +TF_BLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "Salesforce/blip-vqa-base", + "Salesforce/blip-vqa-capfilt-large", + "Salesforce/blip-image-captioning-base", + "Salesforce/blip-image-captioning-large", + "Salesforce/blip-itm-base-coco", + "Salesforce/blip-itm-large-coco", + "Salesforce/blip-itm-base-flickr", + "Salesforce/blip-itm-large-flickr", + # See all BLIP models at https://huggingface.co/models?filter=blip +] + + +# Copied from transformers.models.clip.modeling_tf_clip.contrastive_loss +def contrastive_loss(logits: tf.Tensor) -> tf.Tensor: + return tf.math.reduce_mean( + keras.metrics.sparse_categorical_crossentropy( + y_true=tf.range(shape_list(logits)[0]), y_pred=logits, from_logits=True + ) + ) + + +# Copied from transformers.models.clip.modeling_tf_clip.clip_loss with clip->blip +def blip_loss(similarity: tf.Tensor) -> tf.Tensor: + caption_loss = contrastive_loss(similarity) + image_loss = contrastive_loss(tf.transpose(similarity)) + return (caption_loss + image_loss) / 2.0 + + +@dataclass +class TFBlipForConditionalGenerationModelOutput(ModelOutput): + """ + Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the + last hidden states. This class also adds the loss term from the text decoder. + + Args: + loss (`tf.Tensor`, *optional*, returned when `labels` is provided, `tf.Tensor` of shape `(1,)`): + Languge modeling loss from the text decoder. + logits (`tf.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*): + Prediction scores of the language modeling head of the text decoder model. + image_embeds (`tf.Tensor` of shape `(batch_size, output_dim)`, *optional*): + The image embeddings obtained after applying the Vision Transformer model to the input image. + last_hidden_state (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True`): + Tuple of `tf.Tensor` (one for the output of the embeddings, if the model has an embedding layer, + one for + the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed): + Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads.` + """ + + loss: Tuple[tf.Tensor] | None = None + logits: Tuple[tf.Tensor] | None = None + image_embeds: tf.Tensor | None = None + last_hidden_state: tf.Tensor = None + hidden_states: Tuple[tf.Tensor, ...] | None = None + attentions: Tuple[tf.Tensor, ...] | None = None + + @property + def decoder_logits(self): + warnings.warn( + "`decoder_logits` attribute is deprecated and will be removed in version 5 of Transformers." + " Please use the `logits` attribute to retrieve the final output instead.", + FutureWarning, + ) + return self.logits + + +@dataclass +class TFBlipTextVisionModelOutput(ModelOutput): + """ + Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the + last hidden states. This class also adds the loss term from the text decoder. + + Args: + loss (`tf.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided): + Languge modeling loss from the text decoder. + image_embeds (`tf.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The image embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `tf.Tensor` (one for the output of the embeddings, if the model has an embedding layer, + one for + the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + loss: tf.Tensor | None = None + image_embeds: tf.Tensor | None = None + last_hidden_state: tf.Tensor = None + hidden_states: Tuple[tf.Tensor, ...] | None = None + attentions: Tuple[tf.Tensor, ...] | None = None + + +@dataclass +class TFBlipImageTextMatchingModelOutput(ModelOutput): + """ + Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the + last hidden states. This class also adds the loss term from the text decoder as well as the image-text similarity + scores. + + Args: + itm_score (`tf.Tensor`): + The image-text similarity scores. + loss (`tf.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided): + Languge modeling loss from the text decoder. + image_embeds (`tf.Tensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): + The image embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `tf.Tensor` (one for the output of the embeddings, if the model has an embedding layer, + one for + the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + vision_pooler_output (`tf.Tensor` of shape `(batch_size, hidden_size)`, *optional*): + Last layer hidden-state of the vision of the vision-only branch of the model. + attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + question_embeds (`tf.Tensor`): + The question embeddings obtained by the text projection layer. + """ + + itm_score: tf.Tensor | None = None + loss: tf.Tensor | None = None + image_embeds: tf.Tensor | None = None + last_hidden_state: tf.Tensor = None + hidden_states: Tuple[tf.Tensor, ...] | None = None + vision_pooler_output: tf.Tensor | None = None + attentions: Tuple[tf.Tensor, ...] | None = None + question_embeds: Tuple[tf.Tensor] | None = None + + +@dataclass +class TFBlipOutput(ModelOutput): + """ + Args: + loss (`tf.Tensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): + Contrastive loss for image-text similarity. + logits_per_image:(`tf.Tensor` of shape `(image_batch_size, text_batch_size)`): + The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text + similarity scores. + logits_per_text:(`tf.Tensor` of shape `(text_batch_size, image_batch_size)`): + The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image + similarity scores. + text_embeds(`tf.Tensor` of shape `(batch_size, output_dim`): + The text embeddings obtained by applying the projection layer to the pooled output of [`BlipTextModel`]. + image_embeds(`tf.Tensor` of shape `(batch_size, output_dim`): + The image embeddings obtained by applying the projection layer to the pooled output of [`BlipVisionModel`]. + text_model_output(`BaseModelOutputWithPooling`): + The output of the [`BlipTextModel`]. + vision_model_output(`BaseModelOutputWithPooling`): + The output of the [`BlipVisionModel`]. + """ + + loss: tf.Tensor | None = None + logits_per_image: tf.Tensor = None + logits_per_text: tf.Tensor = None + text_embeds: tf.Tensor = None + image_embeds: tf.Tensor = None + text_model_output: TFBaseModelOutputWithPooling = None + vision_model_output: TFBaseModelOutputWithPooling = None + + def to_tuple(self) -> Tuple[Any]: + return tuple( + self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple() + for k in self.keys() + ) + + +class TFBlipVisionEmbeddings(keras.layers.Layer): + def __init__(self, config: BlipVisionConfig, **kwargs): + super().__init__(**kwargs) + self.config = config + self.embed_dim = config.hidden_size + self.image_size = config.image_size + self.patch_size = config.patch_size + + self.patch_embedding = keras.layers.Conv2D( + filters=self.embed_dim, + kernel_size=self.patch_size, + strides=self.patch_size, + kernel_initializer=get_initializer(self.config.initializer_range), + data_format="channels_last", + name="patch_embedding", + ) + + self.num_patches = (self.image_size // self.patch_size) ** 2 + self.num_positions = self.num_patches + 1 + + def build(self, input_shape=None): + self.class_embedding = self.add_weight( + shape=(1, 1, self.embed_dim), + initializer=get_initializer(self.config.initializer_range), + trainable=True, + name="class_embedding", + ) + + self.position_embedding = self.add_weight( + shape=(1, self.num_positions, self.embed_dim), + initializer=get_initializer(self.config.initializer_range), + trainable=True, + name="position_embedding", + ) + + if self.built: + return + self.built = True + if getattr(self, "patch_embedding", None) is not None: + with tf.name_scope(self.patch_embedding.name): + self.patch_embedding.build([None, None, None, 3]) + + def call(self, pixel_values: tf.Tensor) -> tf.Tensor: + # Input is channels-first, we transpose. PyTorch transposes after the conv because PyTorch + # likes channels-first convs. + batch_size = tf.shape(pixel_values)[0] + pixel_values = tf.transpose(pixel_values, perm=(0, 2, 3, 1)) + patch_embeds = self.patch_embedding(pixel_values) + patch_embeds = tf.reshape(patch_embeds, (batch_size, self.num_patches, -1)) + + class_embeds = tf.broadcast_to(self.class_embedding, (batch_size, 1, self.embed_dim)) + embeddings = tf.concat([class_embeds, patch_embeds], axis=1) + embeddings = embeddings + self.position_embedding[:, : tf.shape(embeddings)[1], :] + return embeddings + + +# Copied from transformers.models.clip.modeling_tf_clip.TFCLIPTextEmbeddings with CLIP->Blip +class TFBlipTextEmbeddings(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.embed_dim = config.hidden_size + + self.config = config + + def build(self, input_shape: tf.TensorShape = None): + with tf.name_scope("token_embedding"): + self.weight = self.add_weight( + shape=(self.config.vocab_size, self.embed_dim), + initializer=get_initializer(self.config.initializer_factor * self.config.initializer_range), + trainable=True, + name="weight", + ) + + with tf.name_scope("position_embedding"): + self.position_embedding = self.add_weight( + shape=(self.config.max_position_embeddings, self.embed_dim), + initializer=get_initializer(self.config.initializer_factor * self.config.initializer_range), + trainable=True, + name="embeddings", + ) + + super().build(input_shape) + + def call( + self, + input_ids: tf.Tensor = None, + position_ids: tf.Tensor = None, + inputs_embeds: tf.Tensor = None, + ) -> tf.Tensor: + """ + Applies embedding based on inputs tensor. + + Returns: + final_embeddings (`tf.Tensor`): output embedding tensor. + """ + if input_ids is None and inputs_embeds is None: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + if inputs_embeds is None: + check_embeddings_within_bounds(input_ids, self.config.vocab_size) + inputs_embeds = tf.gather(params=self.weight, indices=input_ids) + + input_shape = shape_list(inputs_embeds)[:-1] + + if position_ids is None: + position_ids = tf.expand_dims(tf.range(start=0, limit=input_shape[-1]), axis=0) + + position_embeds = tf.gather(params=self.position_embedding, indices=position_ids) + position_embeds = tf.tile(input=position_embeds, multiples=(input_shape[0], 1, 1)) + final_embeddings = inputs_embeds + position_embeds + + return final_embeddings + + +class TFBlipAttention(keras.layers.Layer): + """Multi-headed attention from 'Attention Is All You Need' paper""" + + def __init__(self, config, **kwargs): + super().__init__(**kwargs) + self.config = config + self.embed_dim = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_heads + if self.head_dim * self.num_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" + f" {self.num_heads})." + ) + self.scale = self.head_dim**-0.5 + self.dropout = keras.layers.Dropout(config.attention_dropout, name="dropout") + + self.qkv = keras.layers.Dense( + 3 * self.embed_dim, kernel_initializer=get_initializer(config.initializer_range), name="qkv" + ) + + self.projection = keras.layers.Dense( + self.embed_dim, kernel_initializer=get_initializer(config.initializer_range), name="projection" + ) + + def call( + self, + hidden_states: tf.Tensor, + head_mask: tf.Tensor | None = None, + output_attentions: Optional[bool] = False, + training: Optional[bool] = None, + ) -> Tuple[tf.Tensor, tf.Tensor | None, Tuple[tf.Tensor] | None]: + """Input shape: Batch x Time x Channel""" + + bsz, tgt_len, embed_dim = shape_list(hidden_states) + + mixed_qkv = self.qkv(hidden_states) + mixed_qkv = tf.reshape(mixed_qkv, (bsz, tgt_len, 3, self.num_heads, self.head_dim)) + mixed_qkv = tf.transpose(mixed_qkv, perm=(2, 0, 3, 1, 4)) + + query_states, key_states, value_states = mixed_qkv[0], mixed_qkv[1], mixed_qkv[2] + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = query_states @ tf.transpose(key_states, (0, 1, 3, 2)) + + attention_scores = attention_scores * self.scale + + # Normalize the attention scores to probabilities. + attention_probs = stable_softmax(attention_scores, axis=-1) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs, training=training) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = tf.transpose(attention_probs @ value_states, perm=(0, 2, 1, 3)) + + new_context_layer_shape = shape_list(context_layer)[:-2] + [self.embed_dim] + context_layer = tf.reshape(context_layer, new_context_layer_shape) + + output = self.projection(context_layer) + + outputs = (output, attention_probs) if output_attentions else (output, None) + + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dropout", None) is not None: + with tf.name_scope(self.dropout.name): + self.dropout.build(None) + if getattr(self, "qkv", None) is not None: + with tf.name_scope(self.qkv.name): + self.qkv.build([None, None, self.embed_dim]) + if getattr(self, "projection", None) is not None: + with tf.name_scope(self.projection.name): + self.projection.build([None, None, self.embed_dim]) + + +class TFBlipMLP(keras.layers.Layer): + def __init__(self, config: BlipConfig, **kwargs): + super().__init__(**kwargs) + + self.activation_fn = get_tf_activation(config.hidden_act) + + in_proj_std = (config.hidden_size**-0.5) * ((2 * config.num_hidden_layers) ** -0.5) + fc_std = (2 * config.hidden_size) ** -0.5 + + self.fc1 = keras.layers.Dense( + units=config.intermediate_size, kernel_initializer=get_initializer(fc_std), name="fc1" + ) + self.fc2 = keras.layers.Dense( + units=config.hidden_size, kernel_initializer=get_initializer(in_proj_std), name="fc2" + ) + self.config = config + + def call(self, hidden_states: tf.Tensor) -> tf.Tensor: + hidden_states = self.fc1(inputs=hidden_states) + hidden_states = self.activation_fn(hidden_states) + hidden_states = self.fc2(inputs=hidden_states) + return hidden_states + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.config.hidden_size]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.intermediate_size]) + + +class TFBlipEncoderLayer(keras.layers.Layer): + def __init__(self, config: BlipConfig, **kwargs): + super().__init__(**kwargs) + self.embed_dim = config.hidden_size + self.self_attn = TFBlipAttention(config, name="self_attn") + self.layer_norm1 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm1") + self.mlp = TFBlipMLP(config, name="mlp") + self.layer_norm2 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm2") + + def call( + self, + hidden_states: tf.Tensor, + attention_mask: tf.Tensor, + output_attentions: Optional[bool] = False, + training: Optional[bool] = None, + ) -> Tuple[tf.Tensor]: + """ + Args: + hidden_states (`tf.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)` + attention_mask (`tf.Tensor`): attention mask of size + `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values. + `(config.encoder_attention_heads,)`. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + """ + residual = hidden_states + + hidden_states = self.layer_norm1(hidden_states) + hidden_states, attn_weights = self.self_attn( + hidden_states=hidden_states, + head_mask=attention_mask, + output_attentions=output_attentions, + training=training, + ) + hidden_states = hidden_states + residual + residual = hidden_states + hidden_states = self.layer_norm2(hidden_states) + hidden_states = self.mlp(hidden_states) + + hidden_states = hidden_states + residual + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attn_weights,) + + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "layer_norm1", None) is not None: + with tf.name_scope(self.layer_norm1.name): + self.layer_norm1.build([None, None, self.embed_dim]) + if getattr(self, "mlp", None) is not None: + with tf.name_scope(self.mlp.name): + self.mlp.build(None) + if getattr(self, "layer_norm2", None) is not None: + with tf.name_scope(self.layer_norm2.name): + self.layer_norm2.build([None, None, self.embed_dim]) + + +class TFBlipPreTrainedModel(TFPreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = BlipConfig + base_model_prefix = "blip" + _keys_to_ignore_on_load_missing = [r"position_ids"] + + +BLIP_START_DOCSTRING = r""" + This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and + behavior. + + Parameters: + config ([`BlipConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the model weights. +""" + +BLIP_VISION_INPUTS_DOCSTRING = r""" + Args: + pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`BlipImageProcessor`]. See [`BlipImageProcessor.__call__`] for details. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + +BLIP_INPUTS_DOCSTRING = r""" + Args: + input_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoProcessor`]. See [`BlipProcessor.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`): + Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using + [`BlipImageProcessor`]. See [`BlipImageProcessor.__call__`] for details. + return_loss (`bool`, *optional*): + Whether or not to return the contrastive loss. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +@keras_serializable +class TFBlipEncoder(keras.layers.Layer): + config_class = BlipConfig + """ + Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a + [`BlipEncoderLayer`]. + + Args: + config (`BlipConfig`): + The corresponding vision configuration for the `BlipEncoder`. + """ + + def __init__(self, config: BlipConfig, **kwargs): + super().__init__(**kwargs) + self.config = config + self.layers = [TFBlipEncoderLayer(config, name=f"layers_._{i}") for i in range(config.num_hidden_layers)] + + @unpack_inputs + def call( + self, + inputs_embeds, + attention_mask: tf.Tensor | None = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBaseModelOutput]: + r""" + Args: + inputs_embeds (`tf.Tensor` of shape `(batch_size, sequence_length, hidden_size)`): + Embedded representation of the inputs. Should be float, not int tokens. + attention_mask (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + hidden_states = inputs_embeds + for idx, encoder_layer in enumerate(self.layers): + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + layer_outputs = encoder_layer( + hidden_states, + attention_mask, + output_attentions=output_attentions, + training=training, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None) + return TFBaseModelOutput( + last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions + ) + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + + +class TFBlipVisionModel(TFBlipPreTrainedModel): + main_input_name = "pixel_values" + config_class = BlipVisionConfig + + def __init__(self, config: BlipVisionConfig, *args, **kwargs): + super().__init__(config, *args, **kwargs) + self.config = config + + self.embeddings = TFBlipVisionEmbeddings(config, name="embeddings") + self.encoder = TFBlipEncoder(config, name="encoder") + self.post_layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="post_layernorm") + self.embed_dim = config.hidden_size + + def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling: + hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None + attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None + + return TFBaseModelOutputWithPooling( + last_hidden_state=output.last_hidden_state, + pooler_output=output.pooler_output, + hidden_states=hs, + attentions=attns, + ) + + @unpack_inputs + @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=TFBaseModelOutputWithPooling, config_class=BlipVisionConfig) + def call( + self, + pixel_values: tf.Tensor | None = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBaseModelOutputWithPooling]: + r""" + Returns: + + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + hidden_states = self.embeddings(pixel_values) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + last_hidden_state = encoder_outputs[0] + last_hidden_state = self.post_layernorm(last_hidden_state) + + pooled_output = last_hidden_state[:, 0, :] + # TF gets confused if we call the layer with inputs of different ranks, so insert a singleton dimension + pooled_output = self.post_layernorm(tf.expand_dims(pooled_output, 1)) + pooled_output = tf.squeeze(pooled_output, 1) + + if not return_dict: + return (last_hidden_state, pooled_output) + encoder_outputs[1:] + + return TFBaseModelOutputWithPooling( + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + def get_input_embeddings(self): + return self.embeddings + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "post_layernorm", None) is not None: + with tf.name_scope(self.post_layernorm.name): + self.post_layernorm.build([None, None, self.embed_dim]) + + +class TFBlipMainLayer(keras.layers.Layer): + config_class = BlipConfig + + def __init__(self, config: BlipConfig, *args, **kwargs): + super().__init__(*args, **kwargs) + + if not isinstance(config.text_config, BlipTextConfig): + raise ValueError( + "config.text_config is expected to be of type BlipTextConfig but is of type" + f" {type(config.text_config)}." + ) + + if not isinstance(config.vision_config, BlipVisionConfig): + raise ValueError( + "config.vision_config is expected to be of type BlipVisionConfig but is of type" + f" {type(config.vision_config)}." + ) + + text_config = config.text_config + vision_config = config.vision_config + + self.projection_dim = config.projection_dim + self.text_embed_dim = text_config.hidden_size + self.vision_embed_dim = vision_config.hidden_size + + self.text_model = TFBlipTextModel(text_config, name="text_model") + self.vision_model = TFBlipVisionModel(vision_config, name="vision_model") + + self.visual_projection = keras.layers.Dense( + self.projection_dim, + use_bias=False, + kernel_initializer=get_initializer(config.initializer_range), + name="visual_projection", + ) + self.text_projection = keras.layers.Dense( + self.projection_dim, + use_bias=False, + kernel_initializer=get_initializer(config.initializer_range), + name="text_projection", + ) + + self.config = config + + def build(self, input_shape=None): + self.logit_scale = self.add_weight( + name="logit_scale", + shape=[], + initializer=keras.initializers.Constant(self.config.logit_scale_init_value), + trainable=True, + ) + + if self.built: + return + self.built = True + if getattr(self, "text_model", None) is not None: + with tf.name_scope(self.text_model.name): + self.text_model.build(None) + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + if getattr(self, "visual_projection", None) is not None: + with tf.name_scope(self.visual_projection.name): + self.visual_projection.build([None, None, self.vision_embed_dim]) + if getattr(self, "text_projection", None) is not None: + with tf.name_scope(self.text_projection.name): + self.text_projection.build([None, None, self.text_embed_dim]) + + @unpack_inputs + def call( + self, + input_ids: tf.Tensor | None = None, + pixel_values: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + position_ids: tf.Tensor | None = None, + return_loss: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBlipOutput]: + # Use BLIP model's config for some fields (if specified) instead of those of vision & text components. + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + image_embeds = vision_outputs[1] + image_embeds = self.visual_projection(image_embeds) + + text_embeds = text_outputs[1] + text_embeds = self.text_projection(text_embeds) + + # normalized features + image_embeds = image_embeds / tf.norm(image_embeds, ord=2, axis=-1, keepdims=True) + text_embeds = text_embeds / tf.norm(text_embeds, ord=2, axis=-1, keepdims=True) + + # cosine similarity as logits + logit_scale = tf.exp(self.logit_scale) + logits_per_text = tf.matmul(text_embeds, image_embeds, transpose_b=True) * logit_scale + logits_per_image = tf.transpose(logits_per_text) + + loss = None + if return_loss: + loss = blip_loss(logits_per_text) + loss = tf.reshape(loss, (1,)) + + if not return_dict: + output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs) + return ((loss,) + output) if loss is not None else output + + return TFBlipOutput( + loss=loss, + logits_per_image=logits_per_image, + logits_per_text=logits_per_text, + text_embeds=text_embeds, + image_embeds=image_embeds, + text_model_output=text_outputs, + vision_model_output=vision_outputs, + ) + + +class TFBlipModel(TFBlipPreTrainedModel): + config_class = BlipConfig + _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"] + main_input_name = "input_ids" + + def __init__(self, config: BlipConfig, *inputs, **kwargs): + super().__init__(config, *inputs, **kwargs) + + self.blip = TFBlipMainLayer(config, name="blip") + + def serving_output(self, output: TFBlipOutput) -> TFBlipOutput: + return TFBlipOutput( + logits_per_image=output.logits_per_image, + logits_per_text=output.logits_per_text, + text_embeds=output.text_embeds, + image_embeds=output.image_embeds, + ) + + @unpack_inputs + @add_start_docstrings_to_model_forward(BLIP_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=TFBlipOutput, config_class=BlipConfig) + def call( + self, + input_ids: tf.Tensor | None = None, + pixel_values: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + position_ids: tf.Tensor | None = None, + return_loss: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBlipOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipModel + + >>> model = TFBlipModel.from_pretrained("Salesforce/blip-image-captioning-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor( + ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="tf", padding=True + ... ) + + >>> outputs = model(**inputs) + >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score + >>> probs = tf.nn.softmax(logits_per_image, axis=1) # we can take the softmax to get the label probabilities + ```""" + outputs = self.blip( + input_ids=input_ids, + pixel_values=pixel_values, + attention_mask=attention_mask, + position_ids=position_ids, + return_loss=return_loss, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + return outputs + + @add_start_docstrings_to_model_forward(BLIP_TEXT_INPUTS_DOCSTRING) + def get_text_features( + self, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + position_ids: tf.Tensor | None = None, + return_dict: Optional[bool] = None, + ) -> tf.Tensor: + r""" + Returns: + text_features (`tf.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying + the projection layer to the pooled output of [`TFBlipTextModel`]. + + Examples: + + ```python + >>> from transformers import AutoProcessor, TFBlipModel + + >>> model = TFBlipModel.from_pretrained("Salesforce/blip-image-captioning-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") + + >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="tf") + >>> text_features = model.get_text_features(**inputs) + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + text_outputs = self.blip.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + return_dict=return_dict, + ) + + pooled_output = text_outputs[1] + text_features = self.blip.text_projection(pooled_output) + + return text_features + + @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) + def get_image_features( + self, + pixel_values: tf.Tensor | None = None, + return_dict: Optional[bool] = None, + ) -> tf.Tensor: + r""" + Returns: + image_features (`tf.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying + the projection layer to the pooled output of [`TFBlipVisionModel`]. + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipModel + + >>> model = TFBlipModel.from_pretrained("Salesforce/blip-image-captioning-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="tf") + + >>> image_features = model.get_image_features(**inputs) + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.blip.vision_model(pixel_values=pixel_values, return_dict=return_dict) + + pooled_output = vision_outputs[1] # pooled_output + image_features = self.blip.visual_projection(pooled_output) + + return image_features + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "blip", None) is not None: + with tf.name_scope(self.blip.name): + self.blip.build(None) + + +@add_start_docstrings( + """ + BLIP Model for image captioning. The model consists of a vision encoder and a text decoder. One can optionally pass + `input_ids` to the model, which serve as a text prompt, to make the text decoder continue the prompt. Otherwise, + the decoder starts generating text from the [BOS] (beginning-of-sequence) token. will start generating the caption + from the text input. If no text input is provided, the decoder will start with the [BOS] token only. + """, + BLIP_START_DOCSTRING, +) +class TFBlipForConditionalGeneration(TFBlipPreTrainedModel): + config_class = BlipConfig + _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"] + main_input_name = "pixel_values" + + def __init__(self, config: BlipConfig, *args, **kwargs): + super().__init__(config, *args, **kwargs) + + self.vision_model = TFBlipVisionModel(config.vision_config, name="vision_model") + + self.text_decoder = TFBlipTextLMHeadModel(config.text_config, name="text_decoder") + + self.decoder_input_ids = config.text_config.bos_token_id + self.decoder_pad_token_id = config.text_config.pad_token_id + + def get_input_embeddings(self) -> keras.layers.Layer: + return self.vision_model.embeddings.patch_embedding + + @unpack_inputs + @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=TFBlipForConditionalGenerationModelOutput, config_class=BlipConfig) + def call( + self, + pixel_values: tf.Tensor, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + labels: tf.Tensor | None = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBlipForConditionalGenerationModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipForConditionalGeneration + + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") + >>> model = TFBlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + >>> text = "A picture of" + + >>> inputs = processor(images=image, text=text, return_tensors="tf") + + >>> outputs = model(**inputs) + ```""" + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + image_embeds = vision_outputs[0] + + outputs = self.text_decoder( + input_ids=input_ids, + attention_mask=attention_mask, + encoder_hidden_states=image_embeds, + labels=labels, + return_dict=False, + training=training, + ) + + if not return_dict: + outputs = (outputs[0], outputs[1], image_embeds, vision_outputs[0]) + vision_outputs[2:] + return tuple(output for output in outputs if output is not None) + + if labels is not None: + loss = outputs[0] + logits = outputs[1] + else: + loss = None + logits = outputs[0] + + if loss is not None and loss.shape.rank == 0: + loss = tf.reshape(loss, (1,)) + + return TFBlipForConditionalGenerationModelOutput( + loss=loss, + logits=logits, + image_embeds=image_embeds, + last_hidden_state=vision_outputs.last_hidden_state, + hidden_states=vision_outputs.hidden_states, + attentions=vision_outputs.attentions, + ) + + def generate( + self, + pixel_values: tf.Tensor, + input_ids: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + **generate_kwargs, + ) -> tf.Tensor: + r""" + Overrides *generate* function to be able to use the model as a conditional generator + + Parameters: + pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, image_height, image_width)`: + Input image to be processed + input_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + The sequence used as a prompt for the generation. + attention_mask (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + + Examples: + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipForConditionalGeneration + + >>> model = TFBlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> inputs = processor(images=image, return_tensors="tf") + + >>> outputs = model.generate(**inputs) + >>> print(processor.decode(outputs[0], skip_special_tokens=True)) + two cats sleeping on a couch + ``` + """ + + batch_size = pixel_values.shape[0] + vision_outputs = self.vision_model(pixel_values=pixel_values) + + image_embeds = vision_outputs[0] + + image_attention_mask = tf.ones(shape_list(image_embeds)[:-1], dtype=tf.int32) + + if isinstance(input_ids, list): + input_ids = tf.convert_to_tensor(input_ids, dtype=tf.int32) + elif input_ids is None: + input_ids = tf.convert_to_tensor( + [[self.decoder_input_ids, self.config.text_config.eos_token_id]], dtype=tf.int32 + ) + + input_ids = tf.tile(input_ids, (batch_size, 1)) + + # PyTorch: input_ids[:, 0] = self.config.text_config.bos_token_id + input_ids = tf.concat( + [tf.ones((batch_size, 1), dtype=tf.int32) * self.config.text_config.bos_token_id, input_ids[:, 1:]], axis=1 + ) + attention_mask = attention_mask[:, :-1] if attention_mask is not None else None + + outputs = self.text_decoder.generate( + input_ids=input_ids[:, :-1], + eos_token_id=self.config.text_config.sep_token_id, + pad_token_id=self.config.text_config.pad_token_id, + attention_mask=attention_mask, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + **generate_kwargs, + ) + + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + if getattr(self, "text_decoder", None) is not None: + with tf.name_scope(self.text_decoder.name): + self.text_decoder.build(None) + + +@add_start_docstrings( + """ + BLIP Model for visual question answering. The model consists of a vision encoder, a text encoder as well as a text + decoder. The vision encoder will encode the input image, the text encoder will encode the input question together + with the encoding of the image, and the text decoder will output the answer to the question. + """, + BLIP_START_DOCSTRING, +) +class TFBlipForQuestionAnswering(TFBlipPreTrainedModel): + config_class = BlipConfig + _keys_to_ignore_on_load_missing = [r"text_decoder.cls.predictions.decoder.bias"] + + def __init__(self, config: BlipConfig, *args, **kwargs): + super().__init__(config, *args, **kwargs) + + self.vision_model = TFBlipVisionModel(config.vision_config, name="vision_model") + + self.text_encoder = TFBlipTextModel(config.text_config, name="text_encoder", add_pooling_layer=False) + + self.text_decoder = TFBlipTextLMHeadModel(config.text_config, name="text_decoder") + + self.decoder_pad_token_id = config.text_config.pad_token_id + self.decoder_start_token_id = config.text_config.bos_token_id + + def get_input_embeddings(self) -> keras.layers.Layer: + return self.vision_model.embeddings.patch_embedding + + # Adapted from transformers.models.t5.modeling_tf_t5.TFT5PreTrainedModel._shift_right + def _shift_right(self, input_ids): + decoder_start_token_id = self.decoder_start_token_id + pad_token_id = self.decoder_pad_token_id + + if decoder_start_token_id is None or pad_token_id is None: + raise ValueError("decoder_start_token_id and pad_token_id must be defined!") + + start_tokens = tf.fill((shape_list(input_ids)[0], 1), decoder_start_token_id) + start_tokens = tf.cast(start_tokens, input_ids.dtype) # Ensure compatible dtypes for concatenation + shifted_input_ids = tf.concat([start_tokens, input_ids[:, :-1]], -1) + + # replace possible -100 values in labels by `pad_token_id` + shifted_input_ids = tf.where( + shifted_input_ids == -100, + tf.cast(tf.fill(shape_list(shifted_input_ids), pad_token_id), shifted_input_ids.dtype), + shifted_input_ids, + ) + + # "Verify that `labels` has only positive values and -100" + tf.debugging.assert_greater_equal(shifted_input_ids, tf.constant(0, dtype=shifted_input_ids.dtype)) + + return shifted_input_ids + + @unpack_inputs + @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=TFBlipTextVisionModelOutput, config_class=BlipVisionConfig) + def call( + self, + input_ids: tf.Tensor, + pixel_values: tf.Tensor | None = None, + decoder_input_ids: tf.Tensor | None = None, + decoder_attention_mask: tf.Tensor | None = None, + attention_mask: tf.Tensor | None = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + labels: tf.Tensor | None = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBlipTextVisionModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipForQuestionAnswering + + >>> model = TFBlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> # training + >>> text = "How many cats are in the picture?" + >>> label = "2" + >>> inputs = processor(images=image, text=text, return_tensors="tf") + >>> labels = processor(text=label, return_tensors="tf").input_ids + + >>> inputs["labels"] = labels + >>> outputs = model(**inputs) + >>> loss = outputs.loss + + >>> # inference + >>> text = "How many cats are in the picture?" + >>> inputs = processor(images=image, text=text, return_tensors="tf") + >>> outputs = model.generate(**inputs) + >>> print(processor.decode(outputs[0], skip_special_tokens=True)) + 2 + ```""" + if labels is None and decoder_input_ids is None: + raise ValueError( + "Either `decoder_input_ids` or `labels` should be passed when calling" + " `TFBlipForQuestionAnswering`. if you are training the model make sure that `labels` is passed, if you" + " are using the model for inference make sure that `decoder_input_ids` is passed or call `generate`" + ) + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + image_embeds = vision_outputs[0] + image_attention_mask = tf.ones(shape_list(image_embeds)[:-1], dtype=tf.int64) + + question_embeds = self.text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + return_dict=return_dict, + training=training, + ) + + question_embeds = question_embeds[0] if not return_dict else question_embeds.last_hidden_state + + if labels is not None and decoder_input_ids is None: + # labels are already shifted right, see: https://github.com/huggingface/transformers/pull/23153 + decoder_input_ids = labels + + answer_output = self.text_decoder( + input_ids=decoder_input_ids, + attention_mask=decoder_attention_mask, + encoder_hidden_states=question_embeds, + encoder_attention_mask=attention_mask, + labels=labels, + return_dict=return_dict, + training=training, + ) + + if labels is not None: + decoder_loss = tf.reduce_mean(answer_output.loss) if return_dict else tf.reduce_mean(answer_output[0]) + else: + decoder_loss = None + + if not return_dict: + outputs = (decoder_loss, image_embeds, vision_outputs[0]) + vision_outputs[2:] + return tuple(output for output in outputs if output is not None) + + return TFBlipTextVisionModelOutput( + loss=decoder_loss, + image_embeds=image_embeds, + last_hidden_state=vision_outputs.last_hidden_state, + hidden_states=vision_outputs.hidden_states, + attentions=vision_outputs.attentions, + ) + + def generate( + self, + input_ids: tf.Tensor, + pixel_values: tf.Tensor, + attention_mask: tf.Tensor | None = None, + **generate_kwargs, + ) -> tf.Tensor: + r""" + Overrides *generate* function to be able to use the model as a conditional generator + + Parameters: + input_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`): + The sequence used as a prompt for the generation. + pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, image_height, image_width)`: + Input image to be processed + attention_mask (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`. `1` for + tokens that are NOT MASKED, `0` for MASKED tokens. + generate_kwargs (dict, *optional*): + Additional arguments passed to the `generate` function of the decoder + + + Examples: + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipForQuestionAnswering + + >>> model = TFBlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-vqa-base") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + >>> text = "How many cats are in the picture?" + + >>> inputs = processor(images=image, text=text, return_tensors="tf") + + >>> outputs = model.generate(**inputs) + >>> print(processor.decode(outputs[0], skip_special_tokens=True)) + 2 + ``` + """ + vision_outputs = self.vision_model(pixel_values=pixel_values) + + image_embeds = vision_outputs[0] + + image_attention_mask = tf.ones(shape_list(image_embeds)[:-1], dtype=tf.int32) + + if isinstance(input_ids, list): + input_ids = tf.Tensor(input_ids) + + question_outputs = self.text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + return_dict=False, + ) + + question_embeds = question_outputs[0] + + question_attention_mask = tf.ones(shape_list(question_embeds)[:-1], dtype=tf.int32) + + bos_ids = tf.fill( + (tf.shape(question_embeds)[0], 1), value=tf.cast(self.decoder_start_token_id, input_ids.dtype) + ) + + outputs = self.text_decoder.generate( + input_ids=bos_ids, + eos_token_id=self.config.text_config.sep_token_id, + pad_token_id=self.config.text_config.pad_token_id, + encoder_hidden_states=question_embeds, + encoder_attention_mask=question_attention_mask, + **generate_kwargs, + ) + + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + if getattr(self, "text_encoder", None) is not None: + with tf.name_scope(self.text_encoder.name): + self.text_encoder.build(None) + if getattr(self, "text_decoder", None) is not None: + with tf.name_scope(self.text_decoder.name): + self.text_decoder.build(None) + + +@add_start_docstrings( + """ + BLIP Model with a vision and text projector, and a classification head on top. The model is used in the context of + image-text retrieval. Given an image and a text, the model returns the probability of the text being relevant to + the image. + """, + BLIP_START_DOCSTRING, +) +class TFBlipForImageTextRetrieval(TFBlipPreTrainedModel): + config_class = BlipConfig + + def __init__(self, config: BlipConfig, *args, **kwargs): + super().__init__(config, *args, **kwargs) + + self.vision_model = TFBlipVisionModel(config.vision_config, name="vision_model") + + self.text_encoder = TFBlipTextModel(config.text_config, name="text_encoder", add_pooling_layer=False) + + # vision projection layer + self.vision_proj = keras.layers.Dense( + config.image_text_hidden_size, + kernel_initializer=get_initializer(config.initializer_range), + name="vision_proj", + ) + + # text projection layer + self.text_proj = keras.layers.Dense( + config.image_text_hidden_size, + kernel_initializer=get_initializer(config.initializer_range), + name="text_proj", + ) + + # image text matching head + self.itm_head = keras.layers.Dense( + 2, kernel_initializer=get_initializer(config.initializer_range), name="itm_head" + ) + + self.decoder_pad_token_id = ( + config.text_config.pad_token_id + if not hasattr(config, "decoder_pad_token_id") + else config.decoder_pad_token_id + ) + self.decoder_start_token_id = ( + config.text_config.bos_token_id + if not hasattr(config, "decoder_start_token_id") + else config.decoder_start_token_id + ) + self.config = config + + def get_input_embeddings(self) -> keras.layers.Layer: + return self.vision_model.embeddings.patch_embedding + + @unpack_inputs + @add_start_docstrings_to_model_forward(BLIP_VISION_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=TFBlipImageTextMatchingModelOutput, config_class=BlipVisionConfig) + def call( + self, + input_ids: tf.Tensor, + pixel_values: tf.Tensor | None = None, + use_itm_head: Optional[bool] = True, + attention_mask: tf.Tensor | None = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + training: Optional[bool] = None, + ) -> Union[Tuple, TFBlipImageTextMatchingModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, TFBlipForImageTextRetrieval + + >>> model = TFBlipForImageTextRetrieval.from_pretrained("Salesforce/blip-itm-base-coco") + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip-itm-base-coco") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + >>> text = "an image of a cat" + + >>> inputs = processor(images=image, text=text, return_tensors="tf") + >>> outputs = model(**inputs) + ``` + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + + image_embeds = vision_outputs[0] + image_atts = tf.ones(shape_list(image_embeds)[:-1], dtype=tf.int64) + + # Matt: In PyTorch, only one path (itm/non-itm) is taken. However, in TensorFlow this can result in + # some layers not being built! To avoid this, we always call both paths, then use an if statement to select + # which output to pass to the final output. The unnecessary nodes will be pruned from the final graph, but + # not before the layers have all been built correctly. + itm_question_embeds = self.text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_atts, + return_dict=return_dict, + training=training, + ) + itm_question_embeds = itm_question_embeds[0] if not return_dict else itm_question_embeds.last_hidden_state + + itm_output = self.itm_head(itm_question_embeds[:, 0, :]) + + no_itm_question_embeds = self.text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + return_dict=return_dict, + training=training, + ) + no_itm_question_embeds = ( + no_itm_question_embeds[0] if not return_dict else no_itm_question_embeds.last_hidden_state + ) + + image_feat, _ = tf.linalg.normalize(self.vision_proj(image_embeds[:, 0, :]), ord=2, axis=-1) + text_feat, _ = tf.linalg.normalize(self.text_proj(no_itm_question_embeds[:, 0, :]), ord=2, axis=-1) + + no_itm_output = tf.matmul(image_feat, text_feat, transpose_b=True) + + if use_itm_head: + output = itm_output + question_embeds = itm_question_embeds + else: + output = no_itm_output + question_embeds = no_itm_question_embeds + + if not return_dict: + outputs = (output, vision_outputs[0]) + vision_outputs[2:] + (question_embeds,) + return tuple(output for output in outputs if output is not None) + + return TFBlipImageTextMatchingModelOutput( + itm_score=output, + last_hidden_state=vision_outputs.last_hidden_state, + hidden_states=vision_outputs.hidden_states, + attentions=vision_outputs.attentions, + question_embeds=question_embeds, + ) + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + if getattr(self, "text_encoder", None) is not None: + with tf.name_scope(self.text_encoder.name): + self.text_encoder.build(None) + if getattr(self, "vision_proj", None) is not None: + with tf.name_scope(self.vision_proj.name): + self.vision_proj.build([None, None, self.config.vision_config.hidden_size]) + if getattr(self, "text_proj", None) is not None: + with tf.name_scope(self.text_proj.name): + self.text_proj.build([None, None, self.config.text_config.hidden_size]) + if getattr(self, "itm_head", None) is not None: + with tf.name_scope(self.itm_head.name): + self.itm_head.build([None, None, self.config.text_config.hidden_size]) diff --git a/src/transformers/models/blip/modeling_tf_blip_text.py b/src/transformers/models/blip/modeling_tf_blip_text.py new file mode 100644 index 00000000000000..b605a25eeb4bcf --- /dev/null +++ b/src/transformers/models/blip/modeling_tf_blip_text.py @@ -0,0 +1,1122 @@ +# coding=utf-8 +# Copyright 2023 The Salesforce Team Authors and The HuggingFace Team. All rights reserved. +# +# Licensed under the BSD-3-clause license (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# https://opensource.org/licenses/BSD-3-Clause +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from __future__ import annotations + +import math +from typing import Optional, Tuple + +import tensorflow as tf + +from ...modeling_tf_outputs import ( + TFBaseModelOutputWithPastAndCrossAttentions, + TFBaseModelOutputWithPoolingAndCrossAttentions, + TFCausalLMOutputWithCrossAttentions, +) +from ...modeling_tf_utils import ( + TFModelInputType, + TFPreTrainedModel, + get_initializer, + get_tf_activation, + keras, + keras_serializable, + shape_list, + unpack_inputs, +) +from ...tf_utils import check_embeddings_within_bounds, invert_attention_mask, stable_softmax +from ...utils import add_start_docstrings_to_model_forward, logging +from .configuration_blip import BlipTextConfig + + +logger = logging.get_logger(__name__) + +BLIP_TEXT_INPUTS_DOCSTRING = r""" + Args: + input_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoProcessor`]. See [`BlipProcessor.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`tf.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L52 +class TFBlipTextEmbeddings(keras.layers.Layer): + """Construct the embeddings from word and position embeddings.""" + + def __init__(self, config, **kwargs): + super().__init__(**kwargs) + self.word_embeddings = keras.layers.Embedding( + config.vocab_size, + config.hidden_size, + embeddings_initializer=get_initializer(config.initializer_range), + name="word_embeddings", + ) + self.position_embeddings = keras.layers.Embedding( + config.max_position_embeddings, + config.hidden_size, + embeddings_initializer=get_initializer(config.initializer_range), + name="position_embeddings", + ) + + # self.LayerNorm is not snake-cased to stick with PyTorch model variable name and be able to load + # any TensorFlow checkpoint file + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(config.hidden_dropout_prob, name="dropout") + + self.position_ids = tf.expand_dims(tf.range(config.max_position_embeddings), 0) + self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") + + self.config = config + + def call(self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0, training=None): + if input_ids is not None: + input_shape = tf.shape(input_ids) + else: + input_shape = tf.shape(inputs_embeds)[:-1] + + seq_length = input_shape[1] + + if position_ids is None: + position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length] + + if inputs_embeds is None: + check_embeddings_within_bounds(input_ids, self.config.vocab_size) + inputs_embeds = self.word_embeddings(input_ids) + + embeddings = inputs_embeds + + if self.position_embedding_type == "absolute": + position_embeddings = self.position_embeddings(position_ids) + embeddings += position_embeddings + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings, training=training) + return embeddings + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "word_embeddings", None) is not None: + with tf.name_scope(self.word_embeddings.name): + self.word_embeddings.build(None) + if getattr(self, "position_embeddings", None) is not None: + with tf.name_scope(self.position_embeddings.name): + self.position_embeddings.build(None) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + if getattr(self, "dropout", None) is not None: + with tf.name_scope(self.dropout.name): + self.dropout.build(None) + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L97 +class TFBlipTextSelfAttention(keras.layers.Layer): + def __init__(self, config, is_cross_attention, **kwargs): + super().__init__(**kwargs) + self.config = config + if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"): + raise ValueError( + "The hidden size (%d) is not a multiple of the number of attention heads (%d)" + % (config.hidden_size, config.num_attention_heads) + ) + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = int(config.hidden_size / config.num_attention_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = keras.layers.Dense( + self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query" + ) + self.key = keras.layers.Dense( + self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key" + ) + self.value = keras.layers.Dense( + self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value" + ) + + self.dropout = keras.layers.Dropout(config.attention_probs_dropout_prob) + self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + self.max_position_embeddings = config.max_position_embeddings + self.distance_embedding = keras.layers.Embedding( + 2 * config.max_position_embeddings - 1, self.attention_head_size + ) + self.is_cross_attention = is_cross_attention + + def transpose_for_scores(self, x): + new_x_shape = tf.concat( + [tf.shape(x)[:-1], tf.constant([self.num_attention_heads, self.attention_head_size], dtype=tf.int32)], + axis=0, + ) + x = tf.reshape(x, new_x_shape) + return tf.transpose(x, perm=(0, 2, 1, 3)) + + def call( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_value=None, + output_attentions=False, + training=None, + ): + mixed_query_layer = self.query(hidden_states) + + # If this is instantiated as a cross-attention module, the keys + # and values come from an encoder; the attention mask needs to be + # such that the encoder's padding tokens are not attended to. + is_cross_attention = encoder_hidden_states is not None + + if is_cross_attention: + key_layer = self.transpose_for_scores(self.key(encoder_hidden_states)) + value_layer = self.transpose_for_scores(self.value(encoder_hidden_states)) + attention_mask = encoder_attention_mask + elif past_key_value is not None: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + key_layer = tf.concat([past_key_value[0], key_layer], axis=2) + value_layer = tf.concat([past_key_value[1], value_layer], axis=2) + else: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + + query_layer = self.transpose_for_scores(mixed_query_layer) + + past_key_value = (key_layer, value_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) + + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + seq_length = shape_list(hidden_states)[1] + position_ids_l = tf.expand_dims(tf.range(seq_length, dtype=tf.int64, device=hidden_states.device), 1) + position_ids_r = tf.expand_dims(tf.range(seq_length, dtype=tf.int64, device=hidden_states.device), 0) + distance = position_ids_l - position_ids_r + positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1) + positional_embedding = tf.cast(positional_embedding, query_layer.dtype) # fp16 compatibility + + if self.position_embedding_type == "relative_key": + relative_position_scores = tf.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + attention_scores = attention_scores + relative_position_scores + elif self.position_embedding_type == "relative_key_query": + relative_position_scores_query = tf.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + relative_position_scores_key = tf.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding) + attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in BlipTextModel forward() function) + attention_scores = attention_scores + tf.cast(attention_mask, attention_scores.dtype) + + # Normalize the attention scores to probabilities. + attention_probs = stable_softmax(attention_scores, axis=-1) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs_dropped = self.dropout(attention_probs, training=training) + + # Mask heads if we want to + if head_mask is not None: + attention_probs_dropped = attention_probs_dropped * head_mask + + context_layer = attention_probs_dropped @ value_layer + + context_layer = tf.transpose(context_layer, perm=(0, 2, 1, 3)) + new_context_layer_shape = shape_list(context_layer)[:-2] + [self.all_head_size] + context_layer = tf.reshape(context_layer, new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + outputs = outputs + (past_key_value,) + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "query", None) is not None: + with tf.name_scope(self.query.name): + self.query.build([None, None, self.config.hidden_size]) + if self.is_cross_attention: + if getattr(self, "key", None) is not None: + with tf.name_scope(self.key.name): + self.key.build([None, None, self.config.encoder_hidden_size]) + if getattr(self, "value", None) is not None: + with tf.name_scope(self.value.name): + self.value.build([None, None, self.config.encoder_hidden_size]) + else: + if getattr(self, "key", None) is not None: + with tf.name_scope(self.key.name): + self.key.build([None, None, self.config.hidden_size]) + if getattr(self, "value", None) is not None: + with tf.name_scope(self.value.name): + self.value.build([None, None, self.config.hidden_size]) + + +class TFBlipTextSelfOutput(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.dense = keras.layers.Dense( + units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" + ) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config + + def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: Optional[bool] = None) -> tf.Tensor: + hidden_states = self.dense(inputs=hidden_states) + hidden_states = self.dropout(inputs=hidden_states, training=training) + hidden_states = self.LayerNorm(inputs=hidden_states + input_tensor) + + return hidden_states + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#242 +class TFBlipTextAttention(keras.layers.Layer): + def __init__(self, config, is_cross_attention=False, **kwargs): + super().__init__(**kwargs) + self.self = TFBlipTextSelfAttention(config, is_cross_attention, name="self") + # "output" is a protected attribute on TF models + self.self_output = TFBlipTextSelfOutput(config, name="output") + + def call( + self, + hidden_states: tf.Tensor, + attention_mask: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + encoder_hidden_states: tf.Tensor | None = None, + encoder_attention_mask: tf.Tensor | None = None, + past_key_value: Tuple[Tuple[tf.Tensor]] | None = None, + output_attentions: Optional[bool] = False, + training: Optional[bool] = None, + ): + self_outputs = self.self( + hidden_states, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + training=training, + ) + attention_output = self.self_output(self_outputs[0], hidden_states, training=training) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self", None) is not None: + with tf.name_scope(self.self.name): + self.self.build(None) + if getattr(self, "self_output", None) is not None: + with tf.name_scope(self.self_output.name): + self.self_output.build(None) + + +# Copied from transformers.models.bert.modeling_tf_bert.TFBertIntermediate with Bert->BlipText +class TFBlipTextIntermediate(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.dense = keras.layers.Dense( + units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" + ) + + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = get_tf_activation(config.hidden_act) + else: + self.intermediate_act_fn = config.hidden_act + self.config = config + + def call(self, hidden_states: tf.Tensor) -> tf.Tensor: + hidden_states = self.dense(inputs=hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + + return hidden_states + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + + +class TFBlipTextOutput(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.dense = keras.layers.Dense( + units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" + ) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config + + def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor: + hidden_states = self.dense(inputs=hidden_states) + hidden_states = self.dropout(inputs=hidden_states, training=training) + hidden_states = self.LayerNorm(inputs=hidden_states + input_tensor) + + return hidden_states + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.intermediate_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + + +class TFBlipTextLayer(keras.layers.Layer): + def __init__(self, config, **kwargs): + super().__init__(**kwargs) + self.config = config + self.attention = TFBlipTextAttention(config, name="attention") + if self.config.is_decoder: + self.crossattention = TFBlipTextAttention( + config, is_cross_attention=self.config.is_decoder, name="crossattention" + ) + self.intermediate = TFBlipTextIntermediate(config, name="intermediate") + self.self_output = TFBlipTextOutput(config, name="output") + + def call( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_value=None, + output_attentions=False, + training=None, + ): + # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 + self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None + self_attention_outputs = self.attention( + hidden_states, + attention_mask, + head_mask, + output_attentions=output_attentions, + past_key_value=self_attn_past_key_value, + training=training, + ) + attention_output = self_attention_outputs[0] + + outputs = self_attention_outputs[1:-1] + present_key_value = self_attention_outputs[-1] + + if encoder_hidden_states is not None: + cross_attention_outputs = self.crossattention( + attention_output, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions=output_attentions, + training=training, + ) + attention_output = cross_attention_outputs[0] + outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights + intermediate_output = self.intermediate(attention_output) + layer_output = self.self_output(intermediate_output, attention_output, training=training) + outputs = (layer_output,) + outputs + + outputs = outputs + (present_key_value,) + + return outputs + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "attention", None) is not None: + with tf.name_scope(self.attention.name): + self.attention.build(None) + if getattr(self, "intermediate", None) is not None: + with tf.name_scope(self.intermediate.name): + self.intermediate.build(None) + if getattr(self, "self_output", None) is not None: + with tf.name_scope(self.self_output.name): + self.self_output.build(None) + if getattr(self, "crossattention", None) is not None: + with tf.name_scope(self.crossattention.name): + self.crossattention.build(None) + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L386 +@keras_serializable +class TFBlipTextEncoder(keras.layers.Layer): + config_class = BlipTextConfig + + def __init__(self, config, name=None, **kwargs): + super().__init__(name=name, **kwargs) + self.config = config + self.layer = [TFBlipTextLayer(config, name=f"layer_._{i}") for i in range(config.num_hidden_layers)] + + @unpack_inputs + def call( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + past_key_values=None, + use_cache=None, + output_attentions=False, + output_hidden_states=False, + return_dict=True, + training=None, + ): + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + all_cross_attentions = () if output_attentions and self.config.is_decoder else None + + next_decoder_cache = () if use_cache else None + + for i in range(self.config.num_hidden_layers): + layer_module = self.layer[i] + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + past_key_value = past_key_values[i] if past_key_values is not None else None + + layer_outputs = layer_module( + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + past_key_value, + output_attentions, + training=training, + ) + + hidden_states = layer_outputs[0] + if use_cache: + next_decoder_cache += (layer_outputs[-1],) + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + all_cross_attentions = all_cross_attentions + (layer_outputs[2],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple( + v + for v in [ + hidden_states, + next_decoder_cache, + all_hidden_states, + all_self_attentions, + all_cross_attentions, + ] + if v is not None + ) + return TFBaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=next_decoder_cache, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + cross_attentions=all_cross_attentions, + ) + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "layer", None) is not None: + for layer in self.layer: + with tf.name_scope(layer.name): + layer.build(None) + + +# Copied from transformers.models.bert.modeling_tf_bert.TFBertPooler with Bert->BlipText +class TFBlipTextPooler(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.dense = keras.layers.Dense( + units=config.hidden_size, + kernel_initializer=get_initializer(config.initializer_range), + activation="tanh", + name="dense", + ) + self.config = config + + def call(self, hidden_states: tf.Tensor) -> tf.Tensor: + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(inputs=first_token_tensor) + + return pooled_output + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + + +# Copied from transformers.models.bert.modeling_tf_bert.TFBertPredictionHeadTransform with Bert->BlipText +class TFBlipTextPredictionHeadTransform(keras.layers.Layer): + def __init__(self, config: BlipTextConfig, **kwargs): + super().__init__(**kwargs) + + self.dense = keras.layers.Dense( + units=config.hidden_size, + kernel_initializer=get_initializer(config.initializer_range), + name="dense", + ) + + if isinstance(config.hidden_act, str): + self.transform_act_fn = get_tf_activation(config.hidden_act) + else: + self.transform_act_fn = config.hidden_act + + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.config = config + + def call(self, hidden_states: tf.Tensor) -> tf.Tensor: + hidden_states = self.dense(inputs=hidden_states) + hidden_states = self.transform_act_fn(hidden_states) + hidden_states = self.LayerNorm(inputs=hidden_states) + + return hidden_states + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + + +class TFBlipTextLMPredictionHead(keras.layers.Layer): + def __init__(self, config, **kwargs): + super().__init__(**kwargs) + self.transform = TFBlipTextPredictionHeadTransform(config, name="transform") + + # The output weights are the same as the input embeddings, but there is + # an output-only bias for each token. + self.decoder = keras.layers.Dense( + config.vocab_size, + kernel_initializer=get_initializer(config.initializer_range), + name="decoder", + use_bias=False, + ) + self.config = config + + def build(self, input_shape=None): + self.bias = self.add_weight(name="bias", shape=(self.config.vocab_size,), initializer="zeros", trainable=True) + + if self.built: + return + self.built = True + if getattr(self, "transform", None) is not None: + with tf.name_scope(self.transform.name): + self.transform.build(None) + if getattr(self, "decoder", None) is not None: + with tf.name_scope(self.decoder.name): + self.decoder.build([None, None, self.config.hidden_size]) + + def call(self, hidden_states): + hidden_states = self.transform(hidden_states) + hidden_states = self.decoder(hidden_states) + self.bias + return hidden_states + + +class TFBlipTextOnlyMLMHead(keras.layers.Layer): + def __init__(self, config, **kwargs): + super().__init__(**kwargs) + self.predictions = TFBlipTextLMPredictionHead(config, name="predictions") + + def call(self, sequence_output: tf.Tensor) -> tf.Tensor: + prediction_scores = self.predictions(sequence_output) + return prediction_scores + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "predictions", None) is not None: + with tf.name_scope(self.predictions.name): + self.predictions.build(None) + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L548 +class TFBlipTextPreTrainedModel(TFPreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = BlipTextConfig + base_model_prefix = "bert" + _keys_to_ignore_on_load_missing = [r"position_ids"] + + +# Adapted from https://github.com/salesforce/BLIP/blob/3a29b7410476bf5f2ba0955827390eb6ea1f4f9d/models/med.py#L571 +class TFBlipTextModel(TFBlipTextPreTrainedModel): + """ + The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of + cross-attention is added between the self-attention layers, following the architecture described in [Attention is + all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, + Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. argument and `is_decoder` set to `True`; an + `encoder_hidden_states` is then expected as an input to the forward pass. + """ + + def __init__(self, config, add_pooling_layer=True, name=None, **kwargs): + super().__init__(config, name=name, **kwargs) + self.config = config + + self.embeddings = TFBlipTextEmbeddings(config, name="embeddings") + self.encoder = TFBlipTextEncoder(config, name="encoder") + self.pooler = TFBlipTextPooler(config, name="pooler") if add_pooling_layer else None + + def get_input_embeddings(self): + return self.embeddings.word_embeddings + + def set_input_embeddings(self, value): + self.embeddings.word_embeddings = value + + @tf.function + def get_extended_attention_mask( + self, attention_mask: tf.Tensor, input_shape: Tuple[int], is_decoder: bool + ) -> tf.Tensor: + """ + Makes broadcastable attention and causal masks so that future and masked tokens are ignored. + + Arguments: + attention_mask (`tf.Tensor`): + Mask with ones indicating tokens to attend to, zeros for tokens to ignore. + input_shape (`Tuple[int]`): + The shape of the input to the model. + is_decoder (`bool`): + Whether the model is used as a decoder. + + Returns: + `tf.Tensor` The extended attention mask, with the same dtype as `attention_mask.dtype`. + """ + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + if not isinstance(attention_mask, tf.Tensor): + attention_mask = tf.convert_to_tensor(attention_mask) # Catches NumPy inputs that haven't been cast yet + if attention_mask.shape.rank == 3: + extended_attention_mask = attention_mask[:, None, :, :] + elif attention_mask.shape.rank == 2: + # Provided a padding mask of dimensions [batch_size, seq_length] + # - if the model is a decoder, apply a causal mask in addition to the padding mask + # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length] + if is_decoder: + batch_size, seq_length = input_shape + + seq_ids = tf.range(seq_length, dtype=attention_mask.dtype) + causal_mask = tf.broadcast_to(seq_ids, (batch_size, seq_length, seq_length)) <= seq_ids[None, :, None] + # in case past_key_values are used we need to add a prefix ones mask to the causal mask + + if shape_list(causal_mask)[1] < shape_list(attention_mask)[1]: + prefix_seq_len = tf.shape(attention_mask)[1] - tf.shape(causal_mask)[1] + causal_mask = tf.concat( + [ + tf.ones((batch_size, seq_length, prefix_seq_len), dtype=causal_mask.dtype), + causal_mask, + ], + axis=-1, + ) + extended_attention_mask = ( + tf.cast(causal_mask[:, None, :, :], attention_mask.dtype) * attention_mask[:, None, None, :] + ) + else: + extended_attention_mask = attention_mask[:, None, None, :] + else: + raise ValueError( + "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format( + input_shape, attention_mask.shape + ) + ) + + # Since attention_mask is 1.0 for positions we want to attend and 0.0 for + # masked positions, this operation will create a tensor which is 0.0 for + # positions we want to attend and -10000.0 for masked positions. + # Since we are adding it to the raw scores before the softmax, this is + # effectively the same as removing these entirely. + extended_attention_mask = tf.cast(extended_attention_mask, self.dtype) # fp16 compatibility + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + return extended_attention_mask + + @add_start_docstrings_to_model_forward(BLIP_TEXT_INPUTS_DOCSTRING) + @unpack_inputs + def call( + self, + input_ids: TFModelInputType | None = None, + attention_mask: tf.Tensor | None = None, + position_ids: tf.Tensor | None = None, + head_mask: tf.Tensor | None = None, + inputs_embeds: tf.Tensor | None = None, + encoder_embeds: tf.Tensor | None = None, + encoder_hidden_states: tf.Tensor | None = None, + encoder_attention_mask: tf.Tensor | None = None, + past_key_values: Tuple[Tuple[tf.Tensor]] | None = None, + use_cache: bool | None = None, + output_attentions: bool | None = None, + output_hidden_states: bool | None = None, + return_dict: bool | None = None, + is_decoder: bool = False, + training: bool = False, + ) -> Tuple[tf.Tensor] | TFBaseModelOutputWithPoolingAndCrossAttentions: + r""" + encoder_hidden_states (`tf.Tensor`, *optional*): + Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if + the model is configured as a decoder. + encoder_attention_mask (`tf.Tensor`, *optional*): + Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in + the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + past_key_values (`tuple(tuple(tf.Tensor))`, *optional*): + Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. + If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that + don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if is_decoder: + use_cache = use_cache if use_cache is not None else self.config.use_cache + else: + use_cache = False + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = shape_list(input_ids) + batch_size, seq_length = input_shape + elif inputs_embeds is not None: + input_shape = shape_list(inputs_embeds)[:-1] + batch_size, seq_length = input_shape + elif encoder_embeds is not None: + input_shape = shape_list(encoder_embeds)[:-1] + batch_size, seq_length = input_shape + else: + raise ValueError("You have to specify either input_ids or inputs_embeds or encoder_embeds") + + # past_key_values_length + past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0 + + if attention_mask is None: + attention_mask = tf.ones(((batch_size, seq_length + past_key_values_length))) + + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + extended_attention_mask: tf.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, is_decoder) + + # If a 2D or 3D attention mask is provided for the cross-attention + # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] + if encoder_hidden_states is not None: + if isinstance(encoder_hidden_states, list): + encoder_batch_size, encoder_sequence_length, _ = shape_list(encoder_hidden_states[0]) + else: + encoder_batch_size, encoder_sequence_length, _ = shape_list(encoder_hidden_states) + encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) + + if isinstance(encoder_attention_mask, list): + encoder_extended_attention_mask = [invert_attention_mask(mask) for mask in encoder_attention_mask] + elif encoder_attention_mask is None: + encoder_attention_mask = tf.ones(encoder_hidden_shape) + encoder_extended_attention_mask = invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = None + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + if encoder_embeds is None: + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + inputs_embeds=inputs_embeds, + past_key_values_length=past_key_values_length, + ) + else: + embedding_output = encoder_embeds + + encoder_outputs = self.encoder( + embedding_output, + attention_mask=extended_attention_mask, + head_mask=head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_extended_attention_mask, + past_key_values=past_key_values, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + training=training, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) if self.pooler is not None else None + + if not return_dict: + return (sequence_output, pooled_output) + encoder_outputs[1:] + + return TFBaseModelOutputWithPoolingAndCrossAttentions( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + past_key_values=encoder_outputs.past_key_values, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + cross_attentions=encoder_outputs.cross_attentions, + ) + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "pooler", None) is not None: + with tf.name_scope(self.pooler.name): + self.pooler.build(None) + + +# Adapted from https://github.com/salesforce/BLIP/blob/main/models/med.py#L811 +class TFBlipTextLMHeadModel(TFBlipTextPreTrainedModel): + _keys_to_ignore_on_load_unexpected = [r"pooler"] + _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"] + + def __init__(self, config, **kwargs): + super().__init__(config, **kwargs) + + self.bert = TFBlipTextModel(config, add_pooling_layer=False, name="bert") + self.cls = TFBlipTextOnlyMLMHead(config, name="cls") + self.label_smoothing = config.label_smoothing + + def get_output_embeddings(self): + return self.cls.predictions.decoder + + def set_output_embeddings(self, new_embeddings): + self.cls.predictions.decoder = new_embeddings + + @add_start_docstrings_to_model_forward(BLIP_TEXT_INPUTS_DOCSTRING) + @unpack_inputs + def call( + self, + input_ids=None, + attention_mask=None, + position_ids=None, + head_mask=None, + inputs_embeds=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + labels=None, + past_key_values=None, + use_cache=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + return_logits=False, + is_decoder=True, + training=None, + ): + r""" + encoder_hidden_states (`tf.Tensor`, *optional*): Sequence of + hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is + configured as a decoder. + encoder_attention_mask (`tf.Tensor`, *optional*): + Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in + the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + labels (`tf.Tensor`, *optional*): + Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in + `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are + ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]` + past_key_values (`tuple(tuple(tf.Tensor))`, *optional*): + Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. + If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that + don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all + `decoder_input_ids` of shape `(batch_size, sequence_length)`. + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + if labels is not None: + use_cache = False + + outputs = self.bert( + input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + past_key_values=past_key_values, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + is_decoder=is_decoder, + training=training, + ) + + sequence_output = outputs[0] + prediction_scores = self.cls(sequence_output) + + if return_logits: + return prediction_scores[:, :-1, :] + + lm_loss = None + if labels is not None: + # we are doing next-token prediction; shift prediction scores and input ids by one + shifted_prediction_scores = prediction_scores[:, :-1, :] + shifted_prediction_scores = tf.reshape(shifted_prediction_scores, (-1, self.config.vocab_size)) + labels = labels[:, 1:] + labels = tf.reshape(labels, (-1,)) + # Keras won't give us label smoothing for sparse CE, so we de-sparsify things here + # Use relu to clamp masked labels at 0 to avoid NaN (we will be zeroing those out later anyway) + one_hot_labels = tf.one_hot(tf.nn.relu(labels), depth=self.config.vocab_size, dtype=tf.float32) + loss_fct = keras.losses.CategoricalCrossentropy( + from_logits=True, label_smoothing=self.label_smoothing, reduction="none" + ) + masked_positions = tf.cast(tf.not_equal(labels, -100), dtype=tf.float32) + lm_loss = loss_fct(one_hot_labels, shifted_prediction_scores) + lm_loss *= masked_positions + lm_loss = tf.reduce_sum(lm_loss, axis=0) / tf.math.count_nonzero(masked_positions, dtype=tf.float32) + + if not return_dict: + output = (prediction_scores,) + outputs[2:] + return ((lm_loss,) + output) if lm_loss is not None else output + + return TFCausalLMOutputWithCrossAttentions( + loss=lm_loss, + logits=prediction_scores, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + cross_attentions=outputs.cross_attentions, + ) + + def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs): + input_shape = input_ids.shape + # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly + if attention_mask is None: + attention_mask = input_ids.new_ones(input_shape) + + # cut decoder_input_ids if past_key_values is used + if past_key_values is not None: + input_ids = input_ids[:, -1:] + + return { + "input_ids": input_ids, + "attention_mask": attention_mask, + "past_key_values": past_key_values, + "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None), + "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None), + "is_decoder": True, + } + + def _reorder_cache(self, past_key_values, beam_idx): + reordered_past = () + for layer_past in past_key_values: + reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + return reordered_past + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "bert", None) is not None: + with tf.name_scope(self.bert.name): + self.bert.build(None) + if getattr(self, "cls", None) is not None: + with tf.name_scope(self.cls.name): + self.cls.build(None) diff --git a/src/transformers/models/blip/processing_blip.py b/src/transformers/models/blip/processing_blip.py index 5a9967913e4486..3b9d5c369a4412 100644 --- a/src/transformers/models/blip/processing_blip.py +++ b/src/transformers/models/blip/processing_blip.py @@ -18,6 +18,7 @@ from typing import List, Optional, Union +from ...image_utils import ImageInput from ...processing_utils import ProcessorMixin from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy from ...utils import TensorType @@ -36,6 +37,7 @@ class BlipProcessor(ProcessorMixin): tokenizer (`BertTokenizerFast`): An instance of ['BertTokenizerFast`]. The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "BlipImageProcessor" tokenizer_class = ("BertTokenizer", "BertTokenizerFast") @@ -47,7 +49,7 @@ def __init__(self, image_processor, tokenizer): def __call__( self, - images=None, + images: ImageInput = None, text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, diff --git a/src/transformers/models/blip_2/__init__.py b/src/transformers/models/blip_2/__init__.py index 1c25156654c254..6fbfd53b3703fd 100644 --- a/src/transformers/models/blip_2/__init__.py +++ b/src/transformers/models/blip_2/__init__.py @@ -34,6 +34,7 @@ else: _import_structure["modeling_blip_2"] = [ "BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST", + "Blip2Model", "Blip2QFormerModel", "Blip2PreTrainedModel", "Blip2ForConditionalGeneration", @@ -58,6 +59,7 @@ from .modeling_blip_2 import ( BLIP_2_PRETRAINED_MODEL_ARCHIVE_LIST, Blip2ForConditionalGeneration, + Blip2Model, Blip2PreTrainedModel, Blip2QFormerModel, Blip2VisionModel, diff --git a/src/transformers/models/blip_2/configuration_blip_2.py b/src/transformers/models/blip_2/configuration_blip_2.py index 17d16cd3be5c9d..85749888a54bba 100644 --- a/src/transformers/models/blip_2/configuration_blip_2.py +++ b/src/transformers/models/blip_2/configuration_blip_2.py @@ -14,13 +14,11 @@ # limitations under the License. """ BLIP-2 model configuration""" -import copy import os from typing import Union -from transformers.models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES - from ...configuration_utils import PretrainedConfig +from ...models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES from ...utils import logging from ..auto import CONFIG_MAPPING @@ -59,15 +57,10 @@ class Blip2VisionConfig(PretrainedConfig): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported. layer_norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon used by the layer normalization layers. - dropout (`float`, *optional*, defaults to 0.0): - The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float``, *optional*, defaults to 1): - A factor for initializing all weight matrices (should be kept to 1, used internally for initialization - testing). qkv_bias (`bool`, *optional*, defaults to `True`): Whether to add a bias to the queries and values in the self-attention layers. @@ -92,18 +85,14 @@ def __init__( self, hidden_size=1408, intermediate_size=6144, - projection_dim=512, num_hidden_layers=39, num_attention_heads=16, - num_channels=3, image_size=224, patch_size=14, hidden_act="gelu", - layer_norm_eps=0.00001, - dropout=0.0, + layer_norm_eps=1e-6, attention_dropout=0.0, initializer_range=1e-10, - initializer_factor=1.0, qkv_bias=True, **kwargs, ): @@ -111,15 +100,11 @@ def __init__( self.hidden_size = hidden_size self.intermediate_size = intermediate_size - self.projection_dim = projection_dim - self.dropout = dropout self.num_hidden_layers = num_hidden_layers self.num_attention_heads = num_attention_heads - self.num_channels = num_channels self.patch_size = patch_size self.image_size = image_size self.initializer_range = initializer_range - self.initializer_factor = initializer_factor self.attention_dropout = attention_dropout self.layer_norm_eps = layer_norm_eps self.hidden_act = hidden_act @@ -127,6 +112,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from Blip2Config @@ -185,8 +172,6 @@ class Blip2QFormerConfig(PretrainedConfig): [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155). For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658). - classifier_dropout (`float`, *optional*): - The dropout ratio for the classification head. cross_attention_frequency (`int`, *optional*, defaults to 2): The frequency of adding cross-attention to the Transformer layers. encoder_hidden_size (`int`, *optional*, defaults to 1408): @@ -205,6 +190,7 @@ class Blip2QFormerConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "blip_2_qformer" def __init__( @@ -222,7 +208,6 @@ def __init__( layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type="absolute", - classifier_dropout=None, cross_attention_frequency=2, encoder_hidden_size=1408, **kwargs, @@ -241,12 +226,13 @@ def __init__( self.initializer_range = initializer_range self.layer_norm_eps = layer_norm_eps self.position_embedding_type = position_embedding_type - self.classifier_dropout = classifier_dropout self.cross_attention_frequency = cross_attention_frequency self.encoder_hidden_size = encoder_hidden_size @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the qformer config dict if we are loading from Blip2Config @@ -316,7 +302,6 @@ class Blip2Config(PretrainedConfig): ```""" model_type = "blip-2" - is_composition = True def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs): super().__init__(**kwargs) @@ -338,6 +323,9 @@ def __init__(self, vision_config=None, qformer_config=None, text_config=None, nu text_model_type = text_config["model_type"] if "model_type" in text_config else "opt" self.text_config = CONFIG_MAPPING[text_model_type](**text_config) + self.tie_word_embeddings = self.text_config.tie_word_embeddings + self.is_encoder_decoder = self.text_config.is_encoder_decoder + self.num_query_tokens = num_query_tokens self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES @@ -366,17 +354,3 @@ def from_vision_qformer_text_configs( text_config=text_config.to_dict(), **kwargs, ) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["vision_config"] = self.vision_config.to_dict() - output["qformer_config"] = self.qformer_config.to_dict() - output["text_config"] = self.text_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py b/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py index 2e33f81745a8ec..c2e6eceae53273 100644 --- a/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py +++ b/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py @@ -24,7 +24,8 @@ import torch # pip3 install salesforce-lavis -# I'm actually installing a slightly modified version: pip3 install git+https://github.com/nielsrogge/LAVIS.git@fix_lavis +# I'm actually installing a slightly modified version: pip3 install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32 +# to make sure we can compare both original and HF implementation in float32 from lavis.models import load_model_and_preprocess from PIL import Image @@ -37,6 +38,7 @@ BlipImageProcessor, OPTConfig, T5Config, + set_seed, ) from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD @@ -145,11 +147,16 @@ def convert_blip2_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_ name, type = model_name_to_original[model_name] + # note: this script is tested on 2 GPUs, as models are compared in float32, + # which requires quite some memory. Hence loading both on a + # separate device is the easiest to compare + hf_model_device = "cuda:0" if torch.cuda.is_available() else "cpu" + lavis_device = "cuda:1" if torch.cuda.is_available() else "cpu" + # load original model print("Loading original model...") - device = "cuda" if torch.cuda.is_available() else "cpu" original_model, vis_processors, _ = load_model_and_preprocess( - name=name, model_type=type, is_eval=True, device=device + name=name, model_type=type, is_eval=True, device=lavis_device ) original_model.eval() print("Done!") @@ -185,61 +192,53 @@ def convert_blip2_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_ assert unexpected_keys == ["qformer.embeddings.position_ids"] image = load_demo_image() - original_pixel_values = vis_processors["eval"](image).unsqueeze(0).to(device) - input_ids = tokenizer(["\n"], return_tensors="pt").input_ids.to(device) + original_pixel_values = vis_processors["eval"](image).unsqueeze(0).to(lavis_device) + input_ids = tokenizer(["\n"], return_tensors="pt").input_ids.to(hf_model_device) # create processor image_processor = BlipImageProcessor( size={"height": image_size, "width": image_size}, image_mean=OPENAI_CLIP_MEAN, image_std=OPENAI_CLIP_STD ) processor = Blip2Processor(image_processor=image_processor, tokenizer=tokenizer) - pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device) + pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(hf_model_device) # make sure processor creates exact same pixel values - assert torch.allclose(pixel_values, original_pixel_values) + assert torch.allclose(pixel_values, original_pixel_values.to(pixel_values.device)) - original_model.to(device) - hf_model.to(device) + original_model.to(lavis_device) + hf_model.to(hf_model_device) with torch.no_grad(): if "opt" in model_name: original_logits = original_model({"image": original_pixel_values, "text_input": [""]}).logits - logits = hf_model(original_pixel_values, input_ids).logits + logits = hf_model(pixel_values, input_ids).logits else: original_logits = original_model( {"image": original_pixel_values, "text_input": ["\n"], "text_output": ["\n"]} ).logits labels = input_ids.masked_fill(input_ids == tokenizer.pad_token_id, -100) - logits = hf_model(original_pixel_values, input_ids, labels=labels).logits + logits = hf_model(pixel_values, input_ids, labels=labels).logits assert original_logits.shape == logits.shape print("First values of original logits:", original_logits[0, :3, :3]) print("First values of HF logits:", logits[0, :3, :3]) # assert values - if model_name == "blip2-flan-t5-xl": - expected_slice_logits = torch.tensor( - [[-41.5850, -4.4440, -8.9922], [-47.4322, -5.9143, -1.7340]], device=device - ) - assert torch.allclose(logits[0, :3, :3], expected_slice_logits, atol=1e-4) - elif model_name == "blip2-flan-t5-xl-coco": - expected_slice_logits = torch.tensor( - [[-57.0109, -9.8967, -12.6280], [-68.6578, -12.7191, -10.5065]], device=device - ) - else: - # cast to same type - target_dtype = logits.dtype - assert torch.allclose(original_logits.to(target_dtype), logits, atol=1e-2) + assert torch.allclose(original_logits.to(logits.device), logits, atol=1e-4) print("Looks ok!") print("Generating a caption...") - prompt = "" - input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) + prompt = "Question: what object is in this image? Answer:" + input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(hf_model_device) + + set_seed(42) - original_outputs = original_model.generate({"image": original_pixel_values}) + original_outputs = original_model.generate( + {"image": original_pixel_values, "prompt": prompt}, use_nucleus_sampling=True + ) outputs = hf_model.generate( - original_pixel_values, + pixel_values, input_ids, - do_sample=False, + do_sample=True, num_beams=5, max_length=30, min_length=1, @@ -248,10 +247,9 @@ def convert_blip2_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_ length_penalty=1.0, temperature=1, ) - print("Original generation:", original_outputs) - prompt_length = input_ids.shape[1] - output_text = processor.batch_decode(outputs[:, prompt_length:], skip_special_tokens=True) + output_text = processor.batch_decode(outputs, skip_special_tokens=True) output_text = [text.strip() for text in output_text] + print("Original generation:", original_outputs) print("HF generation:", output_text) if pytorch_dump_folder_path is not None: diff --git a/src/transformers/models/blip_2/modeling_blip_2.py b/src/transformers/models/blip_2/modeling_blip_2.py index 5006a7819c5812..00433f3ea349ac 100644 --- a/src/transformers/models/blip_2/modeling_blip_2.py +++ b/src/transformers/models/blip_2/modeling_blip_2.py @@ -95,9 +95,7 @@ def __init__(self, config: Blip2VisionConfig): self.image_size = config.image_size self.patch_size = config.patch_size - self.class_embedding = nn.Parameter( - torch.randn(1, 1, self.embed_dim), - ) + self.class_embedding = nn.Parameter(torch.randn(1, 1, self.embed_dim)) self.patch_embedding = nn.Conv2d( in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size @@ -111,7 +109,7 @@ def __init__(self, config: Blip2VisionConfig): def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] target_dtype = self.patch_embedding.weight.dtype - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype) @@ -171,11 +169,7 @@ def forward( mixed_qkv = mixed_qkv.reshape(bsz, tgt_len, 3, self.num_heads, embed_dim // self.num_heads).permute( 2, 0, 3, 1, 4 ) - query_states, key_states, value_states = ( - mixed_qkv[0], - mixed_qkv[1], - mixed_qkv[2], - ) + query_states, key_states, value_states = mixed_qkv[0], mixed_qkv[1], mixed_qkv[2] # Take the dot product between "query" and "key" to get the raw attention scores. attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2)) @@ -279,12 +273,8 @@ class Blip2PreTrainedModel(PreTrainedModel): config_class = Blip2Config base_model_prefix = "blip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [ - r"position_ids", - r"language_model.encoder.embed_tokens.weight", - r"language_model.decoder.embed_tokens.weight", - ] _no_split_modules = ["Blip2Attention", "T5Block", "OPTDecoderLayer"] + _skip_keys_device_placement = "past_key_values" _keep_in_fp32_modules = ["wo"] def _init_weights(self, module): @@ -307,10 +297,6 @@ def _init_weights(self, module): elif isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, Blip2Encoder): - module.gradient_checkpointing = value - BLIP_2_START_DOCSTRING = r""" This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the @@ -342,12 +328,49 @@ def _set_gradient_checkpointing(self, module, value=False): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. """ +BLIP_2_TEXT_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + [What are attention masks?](../glossary#attention-mask) + decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Indices of decoder input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are decoder input IDs?](../glossary#decoder-input-ids) + + T5 uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If `past_key_values` + is used, optionally only the last `decoder_input_ids` have to be input (see `past_key_values`). + + To know more on how to prepare `decoder_input_ids` for pretraining take a look at [T5 + Training](./t5#training). + decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): + Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also + be used by default. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + BLIP_2_INPUTS_DOCSTRING = r""" Args: pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): Pixel values. Pixel values can be obtained using [`Blip2Processor`]. See [`Blip2Processor.__call__`] for details. - + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Indices of input sequence tokens in the vocabulary of the language model. Input tokens can optionally be provided to serve as text prompt, which the language model can continue. @@ -366,10 +389,10 @@ def _set_gradient_checkpointing(self, module, value=False): decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*): Indices of decoder input sequence tokens in the vocabulary of the language model. Only relevant in case an encoder-decoder language model (like T5) is used. - + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. [What are decoder input IDs?](../glossary#decoder-input-ids) - + decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*): Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also be used by default. @@ -415,9 +438,7 @@ def forward( r""" Args: inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): - Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. - This is useful if you want more control over how to convert `input_ids` indices into associated vectors - than the model's internal embedding lookup matrix. + Embedded representation of the inputs. Should be float, not int tokens. attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: @@ -448,17 +469,11 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -915,19 +930,12 @@ def forward( if getattr(self.config, "gradient_checkpointing", False) and self.training: if use_cache: - logger.warn( + logger.warning( "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." ) use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions, query_length) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, @@ -1023,7 +1031,7 @@ def get_extended_attention_mask( Mask with ones indicating tokens to attend to, zeros for tokens to ignore. input_shape (`Tuple[int]`): The shape of the input to the model. - device: (`torch.device`): + device (`torch.device`): The device of the input to the model. Returns: @@ -1055,17 +1063,17 @@ def get_extended_attention_mask( def forward( self, - query_embeds, - attention_mask=None, - head_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, - past_key_values=None, - use_cache=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - ): + query_embeds: torch.FloatTensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]: r""" encoder_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, `optional`): Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if @@ -1115,17 +1123,13 @@ def forward( # If a 2D or 3D attention mask is provided for the cross-attention # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] if encoder_hidden_states is not None: - if type(encoder_hidden_states) == list: + if isinstance(encoder_hidden_states, list): encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size() else: - ( - encoder_batch_size, - encoder_sequence_length, - _, - ) = encoder_hidden_states.size() + encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size() encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) - if type(encoder_attention_mask) == list: + if isinstance(encoder_attention_mask, list): encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask] elif encoder_attention_mask is None: encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) @@ -1171,6 +1175,352 @@ def forward( ) +@add_start_docstrings( + """ + BLIP-2 Model for generating text and image features. The model consists of a vision encoder, Querying Transformer + (Q-Former) and a language model. + """, + BLIP_2_START_DOCSTRING, +) +class Blip2Model(Blip2PreTrainedModel): + config_class = Blip2Config + main_input_name = "pixel_values" + + def __init__(self, config: Blip2Config): + super().__init__(config) + + self.vision_model = Blip2VisionModel(config.vision_config) + + self.query_tokens = nn.Parameter(torch.zeros(1, config.num_query_tokens, config.qformer_config.hidden_size)) + self.qformer = Blip2QFormerModel(config.qformer_config) + + self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size) + if config.use_decoder_only_language_model: + language_model = AutoModelForCausalLM.from_config(config.text_config) + else: + language_model = AutoModelForSeq2SeqLM.from_config(config.text_config) + + # Update _tied_weights_keys using the base model used. + if language_model._tied_weights_keys is not None: + self._tied_weights_keys = [f"language_model.{k}" for k in language_model._tied_weights_keys] + + self.language_model = language_model + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.language_model.get_input_embeddings() + + def set_input_embeddings(self, value): + self.language_model.set_input_embeddings(value) + + def set_output_embeddings(self, new_embeddings): + self.language_model.set_output_embeddings(new_embeddings) + + def get_output_embeddings(self) -> nn.Module: + return self.language_model.get_output_embeddings() + + def get_encoder(self): + return self.language_model.get_encoder() + + def get_decoder(self): + return self.language_model.get_decoder() + + def _tie_weights(self): + if not self.config.use_decoder_only_language_model: + self.language_model.encoder.embed_tokens = self.language_model.shared + self.language_model.decoder.embed_tokens = self.language_model.shared + + @add_start_docstrings_to_model_forward(BLIP_2_TEXT_INPUTS_DOCSTRING) + def get_text_features( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + decoder_input_ids: Optional[torch.Tensor] = None, + decoder_attention_mask: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + r""" + Returns: + text_outputs (`CausalLMOutputWithPast`, or `tuple(torch.FloatTensor)` if `return_dict=False`): + The language model outputs. If `return_dict=True`, the output is a [`CausalLMOutputWithPast`] that + contains the language model logits, the past key values and the hidden states if + `output_hidden_states=True`. + Examples: + ```python + >>> import torch + >>> from transformers import AutoTokenizer, Blip2Model + + >>> model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b") + + >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/blip2-opt-2.7b") + >>> inputs = tokenizer(["a photo of a cat"], padding=True, return_tensors="pt") + >>> text_features = model.get_text_features(**inputs) + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if self.config.use_decoder_only_language_model: + text_outputs = self.language_model( + input_ids=input_ids, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + else: + inputs_embeds = self.language_model.get_input_embeddings()(input_ids) + + text_outputs = self.language_model( + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + decoder_input_ids=decoder_input_ids, + decoder_attention_mask=decoder_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + labels=labels, + ) + + return text_outputs + + @add_start_docstrings_to_model_forward(BLIP_2_VISION_INPUTS_DOCSTRING) + def get_image_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + r""" + Returns: + vision_outputs (`BaseModelOutputWithPooling` or tuple of `torch.FloatTensor`): + The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that + contains the image features, the pooled image features and the hidden states if + `output_hidden_states=True`. + Examples: + ```python + >>> import torch + >>> from PIL import Image + >>> import requests + >>> from transformers import AutoProcessor, Blip2Model + + >>> model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b") + + >>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b") + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + >>> inputs = processor(images=image, return_tensors="pt") + >>> image_outputs = model.get_image_features(**inputs) + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + return vision_outputs + + @add_start_docstrings_to_model_forward(BLIP_2_INPUTS_DOCSTRING) + def get_qformer_features( + self, + pixel_values: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + r""" + Returns: + vision_outputs (`BaseModelOutputWithPooling` or tuple of `torch.FloatTensor`): + The vision model outputs. If `return_dict=True`, the output is a [`BaseModelOutputWithPooling`] that + contains the image features, the pooled image features and the hidden states if + `output_hidden_states=True`. + Examples: + ```python + >>> import torch + >>> from PIL import Image + >>> import requests + >>> from transformers import Blip2Processor, Blip2Model + + >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") + >>> model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b") + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + >>> inputs = processor(images=image, return_tensors="pt") + >>> qformer_outputs = model.get_qformer_features(**inputs) + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + image_embeds = vision_outputs[0] + + # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention + image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device) + + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1) + query_outputs = self.qformer( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + return query_outputs + + @add_start_docstrings_to_model_forward(BLIP_2_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=Blip2ForConditionalGenerationModelOutput, config_class=Blip2VisionConfig) + def forward( + self, + pixel_values: torch.FloatTensor, + input_ids: torch.FloatTensor, + attention_mask: Optional[torch.LongTensor] = None, + decoder_input_ids: Optional[torch.LongTensor] = None, + decoder_attention_mask: Optional[torch.LongTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + labels: Optional[torch.LongTensor] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, Blip2ForConditionalGenerationModelOutput]: + r""" + Returns: + + Examples: + + ```python + >>> from PIL import Image + >>> import requests + >>> from transformers import Blip2Processor, Blip2Model + >>> import torch + + >>> device = "cuda" if torch.cuda.is_available() else "cpu" + + >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") + >>> model = Blip2Model.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16) + >>> model.to(device) # doctest: +IGNORE_RESULT + + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" + >>> image = Image.open(requests.get(url, stream=True).raw) + + >>> prompt = "Question: how many cats are there? Answer:" + >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16) + + >>> outputs = model(**inputs) + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # step 1: forward the images through the vision encoder, + # to get image embeddings of shape (batch_size, seq_len, hidden_size) + vision_outputs = self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + image_embeds = vision_outputs[0] + + # step 2: forward the query tokens through the QFormer, using the image embeddings for cross-attention + image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device) + + query_tokens = self.query_tokens.expand(image_embeds.shape[0], -1, -1) + query_outputs = self.qformer( + query_embeds=query_tokens, + encoder_hidden_states=image_embeds, + encoder_attention_mask=image_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + query_output = query_outputs[0] + + # step 3: use the language model, conditioned on the query outputs and the prompt + language_model_inputs = self.language_projection(query_output) + language_model_attention_mask = torch.ones( + language_model_inputs.size()[:-1], dtype=torch.long, device=language_model_inputs.device + ) + inputs_embeds = self.language_model.get_input_embeddings()(input_ids) + inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1) + + if attention_mask is None: + attention_mask = torch.ones_like(input_ids) + expected_device = language_model_attention_mask.device + attention_mask = torch.cat([language_model_attention_mask, attention_mask.to(expected_device)], dim=1) + + if self.config.use_decoder_only_language_model: + outputs = self.language_model( + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + logits = outputs.logits if return_dict else outputs[0] + loss = None + # we compute the loss here since we need to take into account the sequence length of the query embeds + if labels is not None: + labels = labels.to(logits.device) + logits = logits[:, -labels.size(1) :, :] + # Shift so that tokens < n predict n + shift_logits = logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous().to(logits.device) + + # Flatten the tokens + loss_fct = CrossEntropyLoss(reduction="mean") + + loss = loss_fct(shift_logits.view(-1, self.config.text_config.vocab_size), shift_labels.view(-1)) + else: + outputs = self.language_model( + inputs_embeds=inputs_embeds, + attention_mask=attention_mask, + decoder_input_ids=decoder_input_ids, + decoder_attention_mask=decoder_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + labels=labels, + ) + loss = outputs.loss if return_dict else outputs[0] + logits = outputs.logits if return_dict else outputs[1] + + if not return_dict: + output = (logits, vision_outputs, query_outputs, outputs) + return ((loss,) + output) if loss is not None else output + + return Blip2ForConditionalGenerationModelOutput( + loss=loss, + logits=logits, + vision_outputs=vision_outputs, + qformer_outputs=query_outputs, + language_model_outputs=outputs, + ) + + @add_start_docstrings( """ BLIP-2 Model for generating text given an image and an optional text prompt. The model consists of a vision @@ -1178,6 +1528,12 @@ def forward( One can optionally pass `input_ids` to the model, which serve as a text prompt, to make the language model continue the prompt. Otherwise, the language model starts generating text from the [BOS] (beginning-of-sequence) token. + + + + Note that Flan-T5 checkpoints cannot be cast to float16. They are pre-trained using bfloat16. + + """, BLIP_2_START_DOCSTRING, ) @@ -1198,13 +1554,58 @@ def __init__(self, config: Blip2Config): language_model = AutoModelForCausalLM.from_config(config.text_config) else: language_model = AutoModelForSeq2SeqLM.from_config(config.text_config) + + # Update _tied_weights_keys using the base model used. + if language_model._tied_weights_keys is not None: + self._tied_weights_keys = [f"language_model.{k}" for k in language_model._tied_weights_keys] + self.language_model = language_model # Initialize weights and apply final processing self.post_init() - def get_input_embeddings(self) -> nn.Module: - return self.vision_model.embeddings.patch_embedding + def get_input_embeddings(self): + return self.language_model.get_input_embeddings() + + def set_input_embeddings(self, value): + self.language_model.set_input_embeddings(value) + + def set_output_embeddings(self, new_embeddings): + self.language_model.set_output_embeddings(new_embeddings) + + def get_output_embeddings(self) -> nn.Module: + return self.language_model.get_output_embeddings() + + def get_encoder(self): + return self.language_model.get_encoder() + + def get_decoder(self): + return self.language_model.get_decoder() + + def _tie_weights(self): + if not self.config.use_decoder_only_language_model: + self.language_model.encoder.embed_tokens = self.language_model.shared + self.language_model.decoder.embed_tokens = self.language_model.shared + + def _preprocess_accelerate(self): + r""" + Some pre-processing hacks to make the model `accelerate` compatible. Check + https://github.com/huggingface/transformers/pull/21707 for more details. + """ + hf_device_map = self.hf_device_map + + if len(hf_device_map) > 1 and "language_model" not in hf_device_map and torch.cuda.device_count() > 1: + # warn users about unexpected behavior when using multi-GPU + BLIP-2 + `accelerate`. + logger.warning( + "The `language_model` is not in the `hf_device_map` dictionary and you are running your script" + " in a multi-GPU environment. this may lead to unexpected behavior when using `accelerate`." + " Please pass a `device_map` that contains `language_model` to remove this warning." + " Please refer to https://github.com/huggingface/blog/blob/main/accelerate-large-models.md for" + " more details on creating a `device_map` for large models.", + ) + + if hasattr(self.language_model, "_hf_hook"): + self.language_model._hf_hook.io_same_device = True # For `generate` compatibility @add_start_docstrings_to_model_forward(BLIP_2_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=Blip2ForConditionalGenerationModelOutput, config_class=Blip2VisionConfig) @@ -1225,7 +1626,7 @@ def forward( Examples: - Image captioning (without providing a text prompt): + Prepare processor, model and image input ```python >>> from PIL import Image @@ -1237,13 +1638,16 @@ def forward( >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") >>> model = Blip2ForConditionalGeneration.from_pretrained( - ... "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 - ... ) - >>> model.to(device) # doctest: +IGNORE_RESULT + ... "Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0}, torch_dtype=torch.float16 + ... ) # doctest: +IGNORE_RESULT >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) + ``` + + Image captioning (without providing a text prompt): + ```python >>> inputs = processor(images=image, return_tensors="pt").to(device, torch.float16) >>> generated_ids = model.generate(**inputs) @@ -1255,24 +1659,24 @@ def forward( Visual question answering (prompt = question): ```python - >>> from PIL import Image - >>> import requests - >>> from transformers import Blip2Processor, Blip2ForConditionalGeneration - >>> import torch + >>> prompt = "Question: how many cats are there? Answer:" + >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device="cuda", dtype=torch.float16) - >>> device = "cuda" if torch.cuda.is_available() else "cpu" + >>> generated_ids = model.generate(**inputs) + >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() + >>> print(generated_text) + two + ``` - >>> processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b") - >>> model = Blip2ForConditionalGeneration.from_pretrained( - ... "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 - ... ) - >>> model.to(device) # doctest: +IGNORE_RESULT + Note that int8 inference is also supported through [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). + This greatly reduces the amount of memory used by the model while maintaining the same performance. - >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" - >>> image = Image.open(requests.get(url, stream=True).raw) + ```python + >>> model = Blip2ForConditionalGeneration.from_pretrained( + ... "Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0}, torch_dtype=torch.bfloat16 + ... ) # doctest: +IGNORE_RESULT - >>> prompt = "Question: how many cats are there? Answer:" - >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.float16) + >>> inputs = processor(images=image, text=prompt, return_tensors="pt").to(device="cuda", dtype=torch.bfloat16) >>> generated_ids = model.generate(**inputs) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip() @@ -1311,7 +1715,7 @@ def forward( language_model_inputs.size()[:-1], dtype=torch.long, device=language_model_inputs.device ) inputs_embeds = self.language_model.get_input_embeddings()(input_ids) - inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1) + inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1) if attention_mask is None: attention_mask = torch.ones_like(input_ids) @@ -1330,6 +1734,7 @@ def forward( loss = None # we compute the loss here since we need to take into account the sequence length of the query embeds if labels is not None: + labels = labels.to(logits.device) logits = logits[:, -labels.size(1) :, :] # Shift so that tokens < n predict n shift_logits = logits[..., :-1, :].contiguous() @@ -1387,6 +1792,10 @@ def generate( Returns: captions (list): A list of strings of length batch_size * num_captions. """ + if hasattr(self, "hf_device_map"): + # preprocess for `accelerate` + self._preprocess_accelerate() + batch_size = pixel_values.shape[0] image_embeds = self.vision_model(pixel_values, return_dict=True).last_hidden_state image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long, device=image_embeds.device) @@ -1412,11 +1821,11 @@ def generate( ) if attention_mask is None: attention_mask = torch.ones_like(input_ids) - attention_mask = torch.cat([language_attention_mask, attention_mask], dim=1) + attention_mask = torch.cat([language_attention_mask, attention_mask.to(language_attention_mask.device)], dim=1) # concatenate query embeddings with prompt embeddings - inputs_embeds = self.language_model.get_input_embeddings()(input_ids) - inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1) + inputs_embeds = self.get_input_embeddings()(input_ids) + inputs_embeds = torch.cat([language_model_inputs, inputs_embeds.to(language_model_inputs.device)], dim=1) outputs = self.language_model.generate( inputs_embeds=inputs_embeds, diff --git a/src/transformers/models/blip_2/processing_blip_2.py b/src/transformers/models/blip_2/processing_blip_2.py index 5616a30ab35667..ff7044c82aedb6 100644 --- a/src/transformers/models/blip_2/processing_blip_2.py +++ b/src/transformers/models/blip_2/processing_blip_2.py @@ -18,6 +18,7 @@ from typing import List, Optional, Union +from ...image_utils import ImageInput from ...processing_utils import ProcessorMixin from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy from ...utils import TensorType @@ -36,6 +37,7 @@ class Blip2Processor(ProcessorMixin): tokenizer (`AutoTokenizer`): An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "BlipImageProcessor" tokenizer_class = "AutoTokenizer" @@ -49,7 +51,7 @@ def __init__(self, image_processor, tokenizer): # Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__ def __call__( self, - images=None, + images: ImageInput = None, text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, @@ -140,8 +142,8 @@ def batch_decode(self, *args, **kwargs): # Copied from transformers.models.blip.processing_blip.BlipProcessor.decode with BertTokenizerFast->PreTrainedTokenizer def decode(self, *args, **kwargs): """ - This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer - to the docstring of this method for more information. + This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. """ return self.tokenizer.decode(*args, **kwargs) diff --git a/src/transformers/models/bloom/__init__.py b/src/transformers/models/bloom/__init__.py index c91b3d0f385ae2..32e8617e8270e9 100644 --- a/src/transformers/models/bloom/__init__.py +++ b/src/transformers/models/bloom/__init__.py @@ -14,7 +14,13 @@ from typing import TYPE_CHECKING -from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_flax_available, + is_tokenizers_available, + is_torch_available, +) _import_structure = { @@ -44,6 +50,19 @@ "BloomForQuestionAnswering", ] +try: + if not is_flax_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_flax_bloom"] = [ + "FlaxBloomForCausalLM", + "FlaxBloomModel", + "FlaxBloomPreTrainedModel", + ] + + if TYPE_CHECKING: from .configuration_bloom import BLOOM_PRETRAINED_CONFIG_ARCHIVE_MAP, BloomConfig, BloomOnnxConfig @@ -71,6 +90,13 @@ BloomPreTrainedModel, ) + try: + if not is_flax_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_flax_bloom import FlaxBloomForCausalLM, FlaxBloomModel, FlaxBloomPreTrainedModel else: import sys diff --git a/src/transformers/models/bloom/configuration_bloom.py b/src/transformers/models/bloom/configuration_bloom.py index f2ea93c11683d6..17395625e0177e 100644 --- a/src/transformers/models/bloom/configuration_bloom.py +++ b/src/transformers/models/bloom/configuration_bloom.py @@ -18,15 +18,13 @@ from packaging import version -from transformers import is_torch_available - if TYPE_CHECKING: - from transformers import PreTrainedTokenizer, TensorType + from ... import PreTrainedTokenizer, TensorType from ...configuration_utils import PretrainedConfig from ...onnx import OnnxConfigWithPast, PatchingSpec -from ...utils import logging +from ...utils import is_torch_available, logging logger = logging.get_logger(__name__) diff --git a/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py b/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py index c8a069784d5e2b..eda9a2d815e6b8 100644 --- a/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py +++ b/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py @@ -71,7 +71,7 @@ def layer_name_mapping(key, file): def get_dtype_size(dtype): if dtype == torch.bool: return 1 / 8 - bit_search = re.search("[^\d](\d+)$", str(dtype)) + bit_search = re.search(r"[^\d](\d+)$", str(dtype)) if bit_search is None: raise ValueError(f"`dtype` is not a valid dtype: {dtype}.") bit_size = int(bit_search.groups()[0]) @@ -89,7 +89,7 @@ def convert_bloom_checkpoint_to_pytorch( if shard_model: file_names = os.listdir(bloom_checkpoint_path) - file_names = list(sorted(filter(lambda s: s.startswith("layer") and "model_00" in s, file_names))) + file_names = sorted(filter(lambda s: s.startswith("layer") and "model_00" in s, file_names)) index_dict = {"weight_map": {}, "metadata": {}} total_size = 0 @@ -157,7 +157,7 @@ def convert_bloom_checkpoint_to_pytorch( model = BloomModel(config) file_names = os.listdir(bloom_checkpoint_path) - file_names = list(sorted(filter(lambda s: s.startswith("layer") and "model_00" in s, file_names))) + file_names = sorted(filter(lambda s: s.startswith("layer") and "model_00" in s, file_names)) missing_keys = None for i, file in enumerate(file_names): diff --git a/src/transformers/models/bloom/modeling_bloom.py b/src/transformers/models/bloom/modeling_bloom.py index 9d333c7ed9b8a0..14700d6f12d3f7 100644 --- a/src/transformers/models/bloom/modeling_bloom.py +++ b/src/transformers/models/bloom/modeling_bloom.py @@ -25,6 +25,7 @@ from torch.nn import functional as F from ...file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward +from ...modeling_attn_mask_utils import _prepare_4d_causal_attention_mask from ...modeling_outputs import ( BaseModelOutputWithPastAndCrossAttentions, CausalLMOutputWithCrossAttentions, @@ -53,36 +54,6 @@ ] -def _make_causal_mask( - input_ids_shape: torch.Size, device: torch.device, past_key_values_length: int -) -> torch.BoolTensor: - """ - Make causal mask used for self-attention. - """ - batch_size, target_length = input_ids_shape - mask = torch.empty((target_length, target_length + past_key_values_length), dtype=torch.bool, device=device) - # ONNX doesn't support `torch.Tensor.triu` properly, thus we use this workaround - seq_ids = torch.arange(target_length, device=device) - mask[:, past_key_values_length:] = seq_ids[:, None] < seq_ids[None, :] - - if past_key_values_length > 0: - mask[:, :past_key_values_length] = False - - expanded_mask = mask[None, None, :, :].expand(batch_size, 1, target_length, target_length + past_key_values_length) - return expanded_mask - - -def _expand_mask(mask: torch.Tensor, tgt_length: int) -> torch.BoolTensor: - """ - Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`. - """ - batch_size, src_length = mask.shape - tgt_length = tgt_length if tgt_length is not None else src_length - - expanded_mask = ~(mask[:, None, None, :].to(torch.bool)) - return expanded_mask.expand(batch_size, 1, tgt_length, src_length) - - def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor: """ Link to paper: https://arxiv.org/abs/2108.12409 Alibi tensor is not causal as the original paper mentions, it @@ -135,7 +106,7 @@ def dropout_add(x: torch.Tensor, residual: torch.Tensor, prob: float, training: x (`torch.tensor`, *required*): input tensor residual (`torch.tensor`, *required*): - esidual tensor + residual tensor prob (`float`, *required*): dropout probability training (`bool`, *required*): @@ -253,10 +224,10 @@ def _split_heads(self, fused_qkv: torch.Tensor) -> Tuple[torch.Tensor, torch.Ten def _merge_heads(self, x: torch.Tensor) -> torch.Tensor: """ - Merge heads together over the last dimenstion + Merge heads together over the last dimension Args: - x: (`torch.tensor`, *required*): [batch_size * num_heads, seq_length, head_dim] + x (`torch.tensor`, *required*): [batch_size * num_heads, seq_length, head_dim] Returns: torch.tensor: [batch_size, seq_length, num_heads * head_dim] @@ -344,7 +315,7 @@ def forward( # matmul: [batch_size * num_heads, q_length, head_dim] context_layer = torch.bmm(attention_probs_reshaped, value_layer) - # change view [batch_size, num_heads, q_length, head_dim] + # change view [batch_size, q_length, num_heads * head_dim] context_layer = self._merge_heads(context_layer) # aggregate results across tp ranks. See here: https://github.com/pytorch/pytorch/issues/76232 @@ -471,16 +442,11 @@ def forward( class BloomPreTrainedModel(PreTrainedModel): - _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"] - """ - An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained - models. - """ - config_class = BloomConfig base_model_prefix = "transformer" supports_gradient_checkpointing = True _no_split_modules = ["BloomBlock"] + _skip_keys_device_placement = "past_key_values" def __init__(self, *inputs, **kwargs): super().__init__(*inputs, **kwargs) @@ -501,10 +467,6 @@ def _init_weights(self, module: nn.Module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module: nn.Module, value: bool = False): - if isinstance(module, BloomModel): - module.gradient_checkpointing = value - @staticmethod def _convert_to_standard_cache( past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]], batch_size: int @@ -527,7 +489,7 @@ def _convert_to_standard_cache( @staticmethod def _convert_to_bloom_cache( - past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]] + past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]], ) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]: """ Converts the cache to the format expected by Bloom, i.e. to tuple(tuple([batch_size * num_heads, ...])) @@ -641,31 +603,12 @@ def __init__(self, config: BloomConfig): # Initialize weights and apply final processing self.post_init() + def build_alibi_tensor(self, attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor: + return build_alibi_tensor(attention_mask, num_heads, dtype) + def get_input_embeddings(self): return self.word_embeddings - def _prepare_attn_mask( - self, attention_mask: torch.Tensor, input_shape: Tuple[int, int], past_key_values_length: int - ) -> torch.BoolTensor: - # create causal mask - # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length] - combined_attention_mask = None - device = attention_mask.device - _, src_length = input_shape - - if src_length > 1: - combined_attention_mask = _make_causal_mask( - input_shape, device=device, past_key_values_length=past_key_values_length - ) - - # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length] - expanded_attn_mask = _expand_mask(attention_mask, tgt_length=src_length) - combined_attention_mask = ( - expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask - ) - - return combined_attention_mask - def set_input_embeddings(self, new_embeddings: torch.Tensor): self.word_embeddings = new_embeddings @@ -732,6 +675,13 @@ def forward( all_self_attentions = () if output_attentions else None all_hidden_states = () if output_hidden_states else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + # Compute alibi tensor: check build_alibi_tensor documentation seq_length_with_past = seq_length past_key_values_length = 0 @@ -743,39 +693,30 @@ def forward( else: attention_mask = attention_mask.to(hidden_states.device) - alibi = build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype) + alibi = self.build_alibi_tensor(attention_mask, self.num_heads, dtype=hidden_states.dtype) - causal_mask = self._prepare_attn_mask( + causal_mask = _prepare_4d_causal_attention_mask( attention_mask, input_shape=(batch_size, seq_length), + inputs_embeds=inputs_embeds, past_key_values_length=past_key_values_length, ) + causal_mask = causal_mask.bool() for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)): if output_hidden_states: all_hidden_states = all_hidden_states + (hidden_states,) if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - # None for past_key_value - return module(*inputs, use_cache=use_cache, output_attentions=output_attentions) - - return custom_forward - - outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(block), + outputs = self._gradient_checkpointing_func( + block.__call__, hidden_states, alibi, causal_mask, layer_past, head_mask[i], + use_cache, + output_attentions, ) else: outputs = block( @@ -820,7 +761,7 @@ def custom_forward(*inputs): BLOOM_START_DOCSTRING, ) class BloomForCausalLM(BloomPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"] + _tied_weights_keys = ["lm_head.weight"] def __init__(self, config: BloomConfig): super().__init__(config) @@ -844,9 +785,18 @@ def prepare_inputs_for_generation( inputs_embeds: Optional[torch.Tensor] = None, **kwargs, ) -> dict: - # only last token for input_ids if past is not None - if past_key_values: - input_ids = input_ids[:, -1].unsqueeze(-1) + # only last tokens for input_ids if past is not None + if past_key_values is not None: + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] # the cache may be in the stardard format (e.g. in contrastive search), convert to bloom's format if needed if past_key_values[0][0].shape[0] == input_ids.shape[0]: @@ -922,6 +872,8 @@ def forward( loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(lm_logits.device) # Shift so that tokens < n predict n shift_logits = lm_logits[..., :-1, :].contiguous() shift_labels = labels[..., 1:].contiguous() @@ -986,8 +938,6 @@ def _reorder_cache( BLOOM_START_DOCSTRING, ) class BloomForSequenceClassification(BloomPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"] - def __init__(self, config: BloomConfig): super().__init__(config) self.num_labels = config.num_labels @@ -1061,7 +1011,10 @@ def forward( sequence_lengths = -1 else: if input_ids is not None: - sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device) + # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility + sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1 + sequence_lengths = sequence_lengths % input_ids.shape[-1] + sequence_lengths = sequence_lengths.to(logits.device) else: sequence_lengths = -1 logger.warning( @@ -1114,8 +1067,6 @@ def forward( BLOOM_START_DOCSTRING, ) class BloomForTokenClassification(BloomPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"] - def __init__(self, config: BloomConfig): super().__init__(config) self.num_labels = config.num_labels @@ -1189,6 +1140,8 @@ def forward( loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(logits.device) batch_size, seq_length = labels.shape loss_fct = CrossEntropyLoss() loss = loss_fct( @@ -1215,8 +1168,6 @@ def forward( BLOOM_START_DOCSTRING, ) class BloomForQuestionAnswering(BloomPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"h.*.self_attention.scale_mask_softmax.causal_mask", r"lm_head.weight"] - def __init__(self, config): super().__init__(config) self.transformer = BloomModel(config) diff --git a/src/transformers/models/bloom/modeling_flax_bloom.py b/src/transformers/models/bloom/modeling_flax_bloom.py new file mode 100644 index 00000000000000..187230f35ab9e4 --- /dev/null +++ b/src/transformers/models/bloom/modeling_flax_bloom.py @@ -0,0 +1,734 @@ +# coding=utf-8 +# Copyright 2023 HuggingFace Inc. Team and Bigscience Workshop. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Flax BLOOM model.""" + +import math +from functools import partial +from typing import Optional, Tuple + +import flax.linen as nn +import jax +import jax.numpy as jnp +from flax.core.frozen_dict import FrozenDict, freeze, unfreeze +from flax.linen import combine_masks, dot_product_attention_weights, make_causal_mask +from flax.linen.activation import tanh +from flax.traverse_util import flatten_dict, unflatten_dict +from jax import lax + +from ...modeling_flax_outputs import ( + FlaxBaseModelOutput, + FlaxBaseModelOutputWithPastAndCrossAttentions, + FlaxCausalLMOutput, +) +from ...modeling_flax_utils import FlaxPreTrainedModel, append_call_sample_docstring +from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging +from .configuration_bloom import BloomConfig + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "bigscience/bloom" +_CONFIG_FOR_DOC = "BloomConfig" + + +BLOOM_START_DOCSTRING = r""" + + This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a Flax Linen + [flax.nn.Module](https://flax.readthedocs.io/en/latest/_autosummary/flax.nn.module.html) subclass. Use it as a + regular Flax Module and refer to the Flax documentation for all matter related to general usage and behavior. + + Finally, this model supports inherent JAX features such as: + + - [Just-In-Time (JIT) compilation](https://jax.readthedocs.io/en/latest/jax.html#just-in-time-compilation-jit) + - [Automatic Differentiation](https://jax.readthedocs.io/en/latest/jax.html#automatic-differentiation) + - [Vectorization](https://jax.readthedocs.io/en/latest/jax.html#vectorization-vmap) + - [Parallelization](https://jax.readthedocs.io/en/latest/jax.html#parallelization-pmap) + + Parameters: + config ([`BloomConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~FlaxPreTrainedModel.from_pretrained`] method to load the model weights. + dtype (`jax.numpy.dtype`, *optional*, defaults to `jax.numpy.float32`): + The data type of the computation. Can be one of `jax.numpy.float32`, `jax.numpy.float16` (on GPUs) and + `jax.numpy.bfloat16` (on TPUs). + + This can be used to enable mixed-precision training or half-precision inference on GPUs or TPUs. If + specified all the computation will be performed with the given `dtype`. + + **Note that this only specifies the dtype of the computation and does not influence the dtype of model + parameters.** + + If you wish to change the dtype of the model parameters, see [`~FlaxPreTrainedModel.to_fp16`] and + [`~FlaxPreTrainedModel.to_bf16`]. +""" + +BLOOM_INPUTS_DOCSTRING = r""" + Args: + input_ids (`numpy.ndarray` of shape `(batch_size, input_ids_length)`): + `input_ids_length` = `sequence_length`. Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`BloomTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + attention_mask (`numpy.ndarray` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + past_key_values (`Dict[str, np.ndarray]`, *optional*, returned by `init_cache` or when passing previous `past_key_values`): + Dictionary of pre-computed hidden-states (key and values in the attention blocks) that can be used for fast + auto-regressive decoding. Pre-computed key and value hidden-states are of shape *[batch_size, max_length]*. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +def build_alibi_tensor(attention_mask: jnp.ndarray, num_heads: int, dtype: Optional[jnp.dtype] = jnp.float32): + """ + Flax implementation of the BLOOM Alibi tensor. BLOOM Alibi tensor is not causal as the original paper mentions, it + relies on a translation invariance of softmax for quick implementation: with l being a tensor, and a fixed value + `softmax(l+a) = softmax(l)`. Based on + https://github.com/ofirpress/attention_with_linear_biases/blob/a35aaca144e0eb6b789dfcb46784c4b8e31b7983/fairseq/models/transformer.py#L742 + Link to paper: https://arxiv.org/abs/2108.12409 + + Args: + attention_mask (`jnp.ndarray`): + Token-wise attention mask, this should be of shape `(batch_size, max_seq_len)`. + num_heads (`int`): + Number of attention heads. + dtype (`jnp.dtype`, *optional*, defaults to `jnp.float32`): + The data type (dtype) of the output tensor. + + Returns: Alibi tensor of shape `(batch_size * num_heads, 1, max_seq_len)`. + """ + batch_size, seq_length = attention_mask.shape + closest_power_of_2 = 2 ** math.floor(math.log2(num_heads)) + base = jnp.array(2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), dtype=jnp.float32) + powers = jnp.arange(1, 1 + closest_power_of_2, dtype=jnp.float32) + slopes = jax.lax.pow(base, powers) + + if closest_power_of_2 != num_heads: + extra_base = jnp.array(2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), dtype=jnp.float32) + num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2) + extra_powers = jnp.arange(1, 1 + 2 * num_remaining_heads, 2, dtype=jnp.float32) + slopes = jnp.cat([slopes, jax.lax.pow(extra_base, extra_powers)], axis=0) + + # Note: the Alibi tensor will added to the attention bias that will be applied to the query, key product of attention + # therefore, Alibi will have to be of shape (batch_size, num_heads, query_length, key_length) + # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length) + # so that the query_length dimension will then be broadcast correctly. + # This is more or less identical to T5's relative position bias: + # https://github.com/huggingface/transformers/blob/f681437203baa7671de3174b0fa583c349d9d5e1/src/transformers/models/t5/modeling_t5.py#L527 + arange_tensor = ((attention_mask.cumsum(axis=-1) - 1) * attention_mask)[:, None, :] + alibi = slopes[..., None] * arange_tensor + alibi = jnp.expand_dims(alibi, axis=2) + return jnp.asarray(alibi, dtype) + + +class FlaxBloomAttention(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.hidden_size = self.config.hidden_size + self.num_heads = self.config.n_head + self.head_dim = self.hidden_size // self.num_heads + self.attention_softmax_in_fp32 = self.dtype is not jnp.float32 + + if self.head_dim * self.num_heads != self.hidden_size: + raise ValueError( + f"`hidden_size` must be divisible by `num_heads` (got `hidden_size`: {self.hidden_size} and " + f"`num_heads`: {self.num_heads})." + ) + + dense = partial( + nn.Dense, + dtype=self.dtype, + kernel_init=jax.nn.initializers.normal(self.config.initializer_range), + ) + + self.query_key_value = dense(self.hidden_size * 3) + self.dense = dense(self.hidden_size) + self.resid_dropout = nn.Dropout(rate=self.config.hidden_dropout) + + def _split_heads(self, hidden_states): + return hidden_states.reshape(hidden_states.shape[:-1] + (self.num_heads, self.head_dim * 3)) + + def _merge_heads(self, hidden_states): + return hidden_states.reshape(hidden_states.shape[:2] + (self.hidden_size,)) + + @nn.compact + # Copied from transformers.models.gptj.modeling_flax_gptj.FlaxGPTJAttention._concatenate_to_cache + def _concatenate_to_cache(self, key, value, query, attention_mask): + """ + This function takes projected key, value states from a single input token and concatenates the states to cached + states from previous steps. This function is slighly adapted from the official Flax repository: + https://github.com/google/flax/blob/491ce18759622506588784b4fca0e4bf05f8c8cd/flax/linen/attention.py#L252 + """ + # detect if we're initializing by absence of existing cache data. + is_initialized = self.has_variable("cache", "cached_key") + cached_key = self.variable("cache", "cached_key", jnp.zeros, key.shape, key.dtype) + cached_value = self.variable("cache", "cached_value", jnp.zeros, value.shape, value.dtype) + cache_index = self.variable("cache", "cache_index", lambda: jnp.array(0, dtype=jnp.int32)) + + if is_initialized: + *batch_dims, max_length, num_heads, depth_per_head = cached_key.value.shape + # update key, value caches with our new 1d spatial slices + cur_index = cache_index.value + indices = (0,) * len(batch_dims) + (cur_index, 0, 0) + key = lax.dynamic_update_slice(cached_key.value, key, indices) + value = lax.dynamic_update_slice(cached_value.value, value, indices) + cached_key.value = key + cached_value.value = value + num_updated_cache_vectors = query.shape[1] + cache_index.value = cache_index.value + num_updated_cache_vectors + # causal mask for cached decoder self-attention: our single query position should only attend to those key + # positions that have already been generated and cached, not the remaining zero elements. + pad_mask = jnp.broadcast_to( + jnp.arange(max_length) < cur_index + num_updated_cache_vectors, + tuple(batch_dims) + (1, num_updated_cache_vectors, max_length), + ) + attention_mask = combine_masks(pad_mask, attention_mask) + return key, value, attention_mask + + def __call__( + self, + hidden_states, + residual, + alibi, + attention_mask=None, + deterministic: bool = True, + init_cache: bool = False, + output_attentions: bool = False, + ): + batch_size, seq_length = hidden_states.shape[:2] + + # proj q, k, v + fused_qkv = self.query_key_value(hidden_states) + fused_qkv = self._split_heads(fused_qkv) + query, key, value = jnp.split(fused_qkv, 3, axis=-1) + + causal_attention_mask = make_causal_mask(attention_mask, dtype="bool") + + # for fast decoding causal attention mask should be shifted + causal_attention_mask_shift = ( + self.variables["cache"]["cache_index"] if self.has_variable("cache", "cached_key") else 0 + ) + + # fast decoding for generate requires special attention_mask + if self.has_variable("cache", "cached_key"): + max_decoder_length = self.variables["cache"]["cached_key"].shape[1] + causal_attention_mask = jax.lax.dynamic_slice( + causal_attention_mask, + (0, 0, causal_attention_mask_shift, 0), + (1, 1, seq_length, max_decoder_length), + ) + + # broadcast causal attention mask & attention mask to fit for merge + causal_attention_mask = jnp.broadcast_to( + causal_attention_mask, (batch_size,) + causal_attention_mask.shape[1:] + ) + attention_mask = jnp.broadcast_to(jnp.expand_dims(attention_mask, axis=(-3, -2)), causal_attention_mask.shape) + attention_mask = combine_masks(attention_mask, causal_attention_mask) + + dropout_rng = None + if not deterministic and self.config.attention_dropout > 0.0: + dropout_rng = self.make_rng("dropout") + + # During fast autoregressive decoding, we feed one position at a time, + # and cache the keys and values step by step. + if self.has_variable("cache", "cached_key") or init_cache: + key, value, attention_mask = self._concatenate_to_cache(key, value, query, attention_mask) + + # transform boolean mask into float mask + mask_value = jnp.finfo(self.dtype).min + attention_bias = lax.select( + attention_mask > 0, + jnp.full(attention_mask.shape, 0.0).astype(self.dtype), + jnp.full(attention_mask.shape, mask_value).astype(self.dtype), + ) + + attention_bias = attention_bias + alibi + + # Cast in fp32 if the original dtype is different from fp32 + attention_dtype = jnp.float32 if self.attention_softmax_in_fp32 else self.dtype + + attn_weights = dot_product_attention_weights( + query, + key, + bias=attention_bias, + dropout_rng=dropout_rng, + dropout_rate=self.config.attention_dropout, + deterministic=deterministic, + dtype=attention_dtype, + ) + + # Cast back in the original dtype if the native dtype is not fp32 + if self.attention_softmax_in_fp32: + attn_weights = attn_weights.astype(self.dtype) + + attn_output = jnp.einsum("...hqk,...khd->...qhd", attn_weights, value) + attn_output = self._merge_heads(attn_output) + attn_output = self.dense(attn_output) + attn_output = self.resid_dropout(attn_output, deterministic=deterministic) + + attn_output = attn_output + residual + + outputs = (attn_output, attn_weights) if output_attentions else (attn_output,) + return outputs + + +class BloomGELU(nn.Module): + def setup(self): + self.dtype = jnp.float32 + + def __call__(self, x): + return x * 0.5 * (1.0 + tanh(0.79788456 * x * (1 + 0.044715 * x * x))) + + +class FlaxBloomMLP(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + hidden_size = self.config.hidden_size + + kernel_init = jax.nn.initializers.normal(self.config.initializer_range) + + self.dense_h_to_4h = nn.Dense(4 * hidden_size, dtype=self.dtype, kernel_init=kernel_init) + self.dense_4h_to_h = nn.Dense(hidden_size, dtype=self.dtype, kernel_init=kernel_init) + self.hidden_dropout = nn.Dropout(self.config.hidden_dropout) + self.act = BloomGELU() + + def __call__(self, hidden_states, residual, deterministic: bool = True): + hidden_states = self.dense_h_to_4h(hidden_states) + hidden_states = self.act(hidden_states) + + intermediate_output = self.dense_4h_to_h(hidden_states) + + intermediate_output = intermediate_output + residual + hidden_states = self.hidden_dropout(intermediate_output, deterministic=deterministic) + + return hidden_states + + +class FlaxBloomBlock(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.input_layernorm = nn.LayerNorm(epsilon=self.config.layer_norm_epsilon, dtype=self.dtype) + + self.self_attention = FlaxBloomAttention(self.config, dtype=self.dtype) + self.post_attention_layernorm = nn.LayerNorm(epsilon=self.config.layer_norm_epsilon, dtype=self.dtype) + + self.mlp = FlaxBloomMLP(self.config, dtype=self.dtype) + + self.apply_residual_connection_post_layernorm = self.config.apply_residual_connection_post_layernorm + self.hidden_dropout = self.config.hidden_dropout + + def __call__( + self, + hidden_states, + alibi, + attention_mask=None, + deterministic: bool = True, + init_cache: bool = False, + output_attentions: bool = False, + ): + layernorm_output = self.input_layernorm(hidden_states) + + # layer norm before saving residual if config calls for it + if self.apply_residual_connection_post_layernorm: + residual = layernorm_output + else: + residual = hidden_states + + # self-attention + attn_outputs = self.self_attention( + layernorm_output, + residual=residual, + alibi=alibi, + attention_mask=attention_mask, + deterministic=deterministic, + init_cache=init_cache, + output_attentions=output_attentions, + ) + + attention_output = attn_outputs[0] + + outputs = attn_outputs[1:] + + post_layernorm = self.post_attention_layernorm(attention_output) + + # set residual based on config + if self.apply_residual_connection_post_layernorm: + residual = post_layernorm + else: + residual = attention_output + + output = self.mlp(post_layernorm, residual, deterministic=deterministic) + + outputs = (output,) + outputs + + return outputs + + +class FlaxBloomPreTrainedModel(FlaxPreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = BloomConfig + base_model_prefix = "transformer" + module_class: nn.Module = None + + def __init__( + self, + config: BloomConfig, + input_shape: Tuple = (1, 1), + seed: int = 0, + dtype: jnp.dtype = jnp.float32, + _do_init: bool = True, + **kwargs, + ): + module = self.module_class(config=config, dtype=dtype, **kwargs) + super().__init__(config, module, input_shape=input_shape, seed=seed, dtype=dtype, _do_init=_do_init) + + def init_weights(self, rng: jax.random.PRNGKey, input_shape: Tuple, params: FrozenDict = None) -> FrozenDict: + # init input tensors + input_ids = jnp.zeros(input_shape, dtype="i4") + attention_mask = jnp.ones_like(input_ids) + params_rng, dropout_rng = jax.random.split(rng) + rngs = {"params": params_rng, "dropout": dropout_rng} + + random_params = self.module.init(rngs, input_ids, attention_mask, return_dict=False)["params"] + + if params is not None: + random_params = flatten_dict(unfreeze(random_params)) + params = flatten_dict(unfreeze(params)) + for missing_key in self._missing_keys: + params[missing_key] = random_params[missing_key] + self._missing_keys = set() + return freeze(unflatten_dict(params)) + else: + return random_params + + def init_cache(self, batch_size, max_length): + r""" + Args: + batch_size (`int`): + batch_size used for fast auto-regressive decoding. Defines the batch size of the initialized cache. + max_length (`int`): + maximum possible length for auto-regressive decoding. Defines the sequence length of the initialized + cache. + """ + # init input variables to retrieve cache + input_ids = jnp.ones((batch_size, max_length), dtype="i4") + attention_mask = jnp.ones_like(input_ids) + + init_variables = self.module.init( + jax.random.PRNGKey(0), input_ids, attention_mask, return_dict=False, init_cache=True + ) + return unfreeze(init_variables["cache"]) + + @add_start_docstrings_to_model_forward(BLOOM_INPUTS_DOCSTRING) + def __call__( + self, + input_ids, + attention_mask=None, + past_key_values: dict = None, + params: dict = None, + dropout_rng: jax.random.PRNGKey = None, + train: bool = False, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ): + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + batch_size, sequence_length = input_ids.shape + + if attention_mask is None: + attention_mask = jnp.ones((batch_size, sequence_length)) + + # Handle any PRNG if needed + rngs = {} + if dropout_rng is not None: + rngs["dropout"] = dropout_rng + + inputs = {"params": params or self.params} + + # If past_key_values are passed then cache is already initialized a private flag init_cache has to be passed + # down to ensure cache is used. It has to be made sure that cache is marked as mutable so that it can be + # changed by FlaxBloomAttention module + if past_key_values: + inputs["cache"] = past_key_values + mutable = ["cache"] + else: + mutable = False + + outputs = self.module.apply( + inputs, + jnp.array(input_ids, dtype="i4"), + jnp.array(attention_mask, dtype="i4"), + not train, + False, + output_attentions, + output_hidden_states, + return_dict, + rngs=rngs, + mutable=mutable, + ) + + # add updated cache to model output + if past_key_values is not None and return_dict: + outputs, past_key_values = outputs + outputs["past_key_values"] = unfreeze(past_key_values["cache"]) + return outputs + elif past_key_values is not None and not return_dict: + outputs, past_key_values = outputs + outputs = outputs[:1] + (unfreeze(past_key_values["cache"]),) + outputs[1:] + + return outputs + + +class FlaxBloomBlockCollection(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.layers = [ + FlaxBloomBlock(self.config, name=str(layer_number), dtype=self.dtype) + for layer_number in range(self.config.num_hidden_layers) + ] + + def __call__( + self, + hidden_states, + alibi, + attention_mask=None, + deterministic: bool = True, + init_cache: bool = False, + output_attentions: bool = False, + output_hidden_states: bool = False, + ): + all_attentions = () if output_attentions else None + all_hidden_states = () if output_hidden_states else None + + for layer_number in range(self.config.num_hidden_layers): + if output_hidden_states: + all_hidden_states += (hidden_states,) + + layer_outputs = self.layers[layer_number]( + hidden_states, + alibi=alibi, + attention_mask=attention_mask, + deterministic=deterministic, + init_cache=init_cache, + output_attentions=output_attentions, + ) + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions += (layer_outputs[1],) + + # this contains possible `None` values - `FlaxBloomModule` will filter them out + outputs = (hidden_states, all_hidden_states, all_attentions) + + return outputs + + +class FlaxBloomModule(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.embed_dim = self.config.hidden_size + + # word embeddings (no positional embedding layer) + self.word_embeddings = nn.Embed( + self.config.vocab_size, + self.embed_dim, + embedding_init=jax.nn.initializers.normal(stddev=self.config.initializer_range), + dtype=self.dtype, + ) + + # post-embedding layernorm + self.word_embeddings_layernorm = nn.LayerNorm(epsilon=self.config.layer_norm_epsilon, dtype=self.dtype) + + # transformer layers + self.h = FlaxBloomBlockCollection(self.config, dtype=self.dtype) + + # final layernorm + self.ln_f = nn.LayerNorm(epsilon=self.config.layer_norm_epsilon, dtype=self.dtype) + + def __call__( + self, + input_ids=None, + attention_mask=None, + deterministic=True, + init_cache: bool = False, + output_attentions: bool = False, + output_hidden_states: bool = False, + return_dict: bool = True, + ): + inputs_embeds = self.word_embeddings(input_ids) + # do post-embedding layernorm + hidden_states = self.word_embeddings_layernorm(inputs_embeds) + + # build alibi depending on `attention_mask` + alibi = build_alibi_tensor(attention_mask, self.config.n_head, dtype=hidden_states.dtype) + + outputs = self.h( + hidden_states, + alibi=alibi, + attention_mask=attention_mask, + deterministic=deterministic, + init_cache=init_cache, + output_hidden_states=output_hidden_states, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + hidden_states = self.ln_f(hidden_states) + + if output_hidden_states: + all_hidden_states = outputs[1] + (hidden_states,) + outputs = (hidden_states, all_hidden_states) + outputs[2:] + else: + outputs = (hidden_states,) + outputs[1:] + + if not return_dict: + return tuple(v for v in [outputs[0], outputs[-1]] if v is not None) + + return FlaxBaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + hidden_states=outputs[1], + attentions=outputs[-1], + ) + + +@add_start_docstrings( + "The bare Bloom Model transformer outputting raw hidden-states without any specific head on top.", + BLOOM_START_DOCSTRING, +) +# Copied from transformers.models.gpt_neo.modeling_flax_gpt_neo.FlaxGPTNeoModel with GPTNeo->Bloom +class FlaxBloomModel(FlaxBloomPreTrainedModel): + module_class = FlaxBloomModule + + +append_call_sample_docstring(FlaxBloomModel, _CHECKPOINT_FOR_DOC, FlaxBaseModelOutput, _CONFIG_FOR_DOC) + + +class FlaxBloomForCausalLMModule(nn.Module): + config: BloomConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.transformer = FlaxBloomModule(self.config, dtype=self.dtype) + self.lm_head = nn.Dense( + self.config.vocab_size, + use_bias=False, + dtype=self.dtype, + kernel_init=jax.nn.initializers.normal(stddev=self.config.initializer_range), + ) + + def __call__( + self, + input_ids, + attention_mask, + deterministic: bool = True, + init_cache: bool = False, + output_attentions: bool = False, + output_hidden_states: bool = False, + return_dict: bool = True, + ): + outputs = self.transformer( + input_ids, + attention_mask=attention_mask, + deterministic=deterministic, + init_cache=init_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + hidden_states = outputs[0] + + if self.config.tie_word_embeddings: + shared_kernel = self.transformer.variables["params"]["word_embeddings"]["embedding"].T + lm_logits = self.lm_head.apply({"params": {"kernel": shared_kernel}}, hidden_states) + else: + lm_logits = self.lm_head(hidden_states) + + if not return_dict: + return (lm_logits,) + outputs[1:] + + return FlaxCausalLMOutput(logits=lm_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions) + + +@add_start_docstrings( + """ + The Bloom Model transformer with a language modeling head on top (linear layer with weights tied to the input + embeddings). + """, + BLOOM_START_DOCSTRING, +) +class FlaxBloomForCausalLM(FlaxBloomPreTrainedModel): + module_class = FlaxBloomForCausalLMModule + + def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jax.Array] = None): + # initializing the cache + batch_size, seq_length = input_ids.shape + + past_key_values = self.init_cache(batch_size, max_length) + # Note that usually one would have to put 0's in the attention_mask for + # x > input_ids.shape[-1] and x < cache_length. But since Bloom uses a causal mask, + # those positions are masked anyway. Thus, we can create a single static attention_mask here, + # which is more efficient for compilation + extended_attention_mask = jnp.ones((batch_size, max_length), dtype="i4") + if attention_mask is not None: + extended_attention_mask = lax.dynamic_update_slice(extended_attention_mask, attention_mask, (0, 0)) + + return { + "past_key_values": past_key_values, + "attention_mask": extended_attention_mask, + } + + def update_inputs_for_generation(self, model_outputs, model_kwargs): + model_kwargs["past_key_values"] = model_outputs.past_key_values + return model_kwargs + + +append_call_sample_docstring(FlaxBloomForCausalLM, _CHECKPOINT_FOR_DOC, FlaxCausalLMOutput, _CONFIG_FOR_DOC) diff --git a/src/transformers/models/bloom/tokenization_bloom_fast.py b/src/transformers/models/bloom/tokenization_bloom_fast.py index 1c8efb10cb6c46..c0189e08b3d149 100644 --- a/src/transformers/models/bloom/tokenization_bloom_fast.py +++ b/src/transformers/models/bloom/tokenization_bloom_fast.py @@ -15,20 +15,14 @@ """Tokenization classes for Bloom.""" -import json -from typing import TYPE_CHECKING, List, Optional, Tuple - -from tokenizers import pre_tokenizers +import pickle +from typing import Optional, Tuple from ...tokenization_utils_base import BatchEncoding from ...tokenization_utils_fast import PreTrainedTokenizerFast from ...utils import logging -if TYPE_CHECKING: - from transformers.pipelines.conversational import Conversation - - logger = logging.get_logger(__name__) VOCAB_FILES_NAMES = {"tokenizer_file": "tokenizer.json"} @@ -54,13 +48,15 @@ class BloomTokenizerFast(PreTrainedTokenizerFast): This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not: - ``` + ```python >>> from transformers import BloomTokenizerFast + >>> tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") - >>> tokenizer("Hello world")['input_ids'] - [15496, 995] - >>> tokenizer(" Hello world")['input_ids'] - [18435, 995] + >>> tokenizer("Hello world")["input_ids"] + [59414, 8876] + + >>> tokenizer(" Hello world")["input_ids"] + [86153, 8876] ``` You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since @@ -113,6 +109,7 @@ def __init__( eos_token="", pad_token="", add_prefix_space=False, + clean_up_tokenization_spaces=False, **kwargs, ): super().__init__( @@ -124,13 +121,19 @@ def __init__( eos_token=eos_token, pad_token=pad_token, add_prefix_space=add_prefix_space, + clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs, ) - pre_tok_state = json.loads(self.backend_tokenizer.pre_tokenizer.__getstate__()) - if pre_tok_state.get("add_prefix_space", add_prefix_space) != add_prefix_space: - pre_tok_class = getattr(pre_tokenizers, pre_tok_state.pop("type")) - pre_tok_state["add_prefix_space"] = add_prefix_space - self.backend_tokenizer.pre_tokenizer = pre_tok_class(**pre_tok_state) + # TODO @ArthurZucker this can only work one way for now, to update later-on. Tests should also properly + # check this as they were green before. + pre_tok_state = pickle.dumps(self.backend_tokenizer.pre_tokenizer) + decoder_state = pickle.dumps(self.backend_tokenizer.decoder) + + if add_prefix_space: + pre_tok_state = pre_tok_state.replace(b'"add_prefix_space":false', b'"add_prefix_space": true') + decoder_state = decoder_state.replace(b'"add_prefix_space":false', b'"add_prefix_space": true') + self.backend_tokenizer.pre_tokenizer = pickle.loads(pre_tok_state) + self.backend_tokenizer.decoder = pickle.loads(decoder_state) self.add_prefix_space = add_prefix_space @@ -159,12 +162,16 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = files = self._tokenizer.model.save(save_directory, name=filename_prefix) return tuple(files) - def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]: - """This corresponds to DialoGPT variants of models.""" - input_ids = [] - for is_user, text in conversation.iter_texts(): - input_ids.extend(self.encode(text, add_special_tokens=False) + [self.eos_token_id]) - - if len(input_ids) > self.model_max_length: - input_ids = input_ids[-self.model_max_length :] - return input_ids + @property + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.default_chat_template + def default_chat_template(self): + """ + A simple chat template that ignores role information and just concatenates messages with EOS tokens. + """ + logger.warning_once( + "\nNo chat template is defined for this tokenizer - using the default template " + f"for the {self.__class__.__name__} class. If the default is not appropriate for " + "your model, please set `tokenizer.chat_template` to an appropriate template. " + "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n" + ) + return "{% for message in messages %}" "{{ message.content }}{{ eos_token }}" "{% endfor %}" diff --git a/src/transformers/models/bridgetower/__init__.py b/src/transformers/models/bridgetower/__init__.py index 7058fffa529e80..cbd5bd4a366aed 100644 --- a/src/transformers/models/bridgetower/__init__.py +++ b/src/transformers/models/bridgetower/__init__.py @@ -42,6 +42,7 @@ else: _import_structure["modeling_bridgetower"] = [ "BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST", + "BridgeTowerForContrastiveLearning", "BridgeTowerForImageAndTextRetrieval", "BridgeTowerForMaskedLM", "BridgeTowerModel", @@ -74,6 +75,7 @@ else: from .modeling_bridgetower import ( BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST, + BridgeTowerForContrastiveLearning, BridgeTowerForImageAndTextRetrieval, BridgeTowerForMaskedLM, BridgeTowerModel, diff --git a/src/transformers/models/bridgetower/configuration_bridgetower.py b/src/transformers/models/bridgetower/configuration_bridgetower.py index c04b01a2ea713c..c12c1600e9b449 100644 --- a/src/transformers/models/bridgetower/configuration_bridgetower.py +++ b/src/transformers/models/bridgetower/configuration_bridgetower.py @@ -14,7 +14,6 @@ # limitations under the License. """ BridgeTower model configuration""" -import copy import os from typing import Union @@ -50,7 +49,7 @@ class BridgeTowerVisionConfig(PretrainedConfig): The size (resolution) of each patch. image_size (`int`, *optional*, defaults to 288): The size (resolution) of each image. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). layer_norm_eps (`float`, *optional*, defaults to 1e-05): @@ -74,6 +73,7 @@ class BridgeTowerVisionConfig(PretrainedConfig): >>> # Accessing the configuration >>> configuration ```""" + model_type = "bridgetower_vision_model" def __init__( @@ -152,11 +152,9 @@ class BridgeTowerTextConfig(PretrainedConfig): just in case (e.g., 512 or 1024 or 2048). type_vocab_size (`int`, *optional*, defaults to 2): The vocabulary size of the `token_type_ids`. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. position_embedding_type (`str`, *optional*, defaults to `"absolute"`): @@ -170,8 +168,6 @@ class BridgeTowerTextConfig(PretrainedConfig): use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. - classifier_dropout (`float`, *optional*): - The dropout ratio for the classification head. Example: @@ -184,6 +180,7 @@ class BridgeTowerTextConfig(PretrainedConfig): >>> # Accessing the configuration >>> configuration ```""" + model_type = "bridgetower_text_model" def __init__( @@ -199,14 +196,12 @@ def __init__( attention_probs_dropout_prob=0.1, max_position_embeddings=514, type_vocab_size=1, - initializer_range=0.02, layer_norm_eps=1e-05, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type="absolute", use_cache=True, - classifier_dropout=None, **kwargs, ): super().__init__(**kwargs) @@ -222,11 +217,9 @@ def __init__( self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size - self.initializer_range = initializer_range self.layer_norm_eps = layer_norm_eps self.position_embedding_type = position_embedding_type self.use_cache = use_cache - self.classifier_dropout = classifier_dropout self.pad_token_id = pad_token_id self.bos_token_id = bos_token_id self.eos_token_id = eos_token_id @@ -264,7 +257,7 @@ class BridgeTowerConfig(PretrainedConfig): The non-linear activation function (function or string) in the encoder and pooler. hidden_size (`int`, *optional*, defaults to 768): Dimensionality of the encoder layers and the pooler layer. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). layer_norm_eps (`float`, *optional*, defaults to 1e-05): @@ -300,6 +293,7 @@ class BridgeTowerConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "bridgetower" def __init__( @@ -319,6 +313,10 @@ def __init__( vision_config=None, **kwargs, ): + # TODO: remove this once the Hub files are updated. + _ = kwargs.pop("text_config_dict", None) + _ = kwargs.pop("vision_config_dict", None) + super().__init__(**kwargs) self.share_cross_modal_transformer_layers = share_cross_modal_transformer_layers self.hidden_act = hidden_act @@ -332,20 +330,13 @@ def __init__( self.tie_word_embeddings = tie_word_embeddings self.init_layernorm_from_vision_encoder = init_layernorm_from_vision_encoder - text_config_dict = kwargs.pop("text_config_dict", None) - vision_config_dict = kwargs.pop("vision_config_dict", None) - if text_config_dict is not None: - text_config = text_config_dict - if vision_config_dict is not None: - vision_config = vision_config_dict - if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the BridgeTowerTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `BridgeTowerTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. Initializing the BridgeTowerVisionConfig with default values.") + logger.info("`vision_config` is `None`. Initializing the `BridgeTowerVisionConfig` with default values.") self.text_config = BridgeTowerTextConfig(**text_config) self.vision_config = BridgeTowerVisionConfig(**vision_config) @@ -360,16 +351,3 @@ def from_text_vision_configs( """ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/bridgetower/image_processing_bridgetower.py b/src/transformers/models/bridgetower/image_processing_bridgetower.py index 6cf988114e2fd1..3053c72a4c5bb7 100644 --- a/src/transformers/models/bridgetower/image_processing_bridgetower.py +++ b/src/transformers/models/bridgetower/image_processing_bridgetower.py @@ -14,16 +14,12 @@ # limitations under the License. """Image processor class for BridgeTower.""" -import warnings from typing import Any, Dict, Iterable, List, Optional, Tuple, Union import numpy as np -from transformers.utils import is_vision_available -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict -from ...image_transforms import PaddingMode, center_crop, normalize, pad, rescale, resize, to_channel_dimension_format +from ...image_transforms import PaddingMode, center_crop, pad, resize, to_channel_dimension_format from ...image_utils import ( OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, @@ -33,10 +29,12 @@ get_image_size, infer_channel_dimension_format, is_batched, + is_scaled_image, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging +from ...utils import TensorType, is_vision_available, logging if is_vision_available(): @@ -54,7 +52,9 @@ def max_across_indices(values: Iterable[Any]) -> List[Any]: # Copied from transformers.models.vilt.image_processing_vilt.make_pixel_mask -def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray: +def make_pixel_mask( + image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None +) -> np.ndarray: """ Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding. @@ -64,33 +64,40 @@ def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarr output_size (`Tuple[int, int]`): Output size of the mask. """ - input_height, input_width = get_image_size(image) + input_height, input_width = get_image_size(image, channel_dim=input_data_format) mask = np.zeros(output_size, dtype=np.int64) mask[:input_height, :input_width] = 1 return mask # Copied from transformers.models.vilt.image_processing_vilt.get_max_height_width -def get_max_height_width(images: List[np.ndarray]) -> List[int]: +def get_max_height_width( + images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None +) -> List[int]: """ Get the maximum height and width across all images in a batch. """ - input_channel_dimension = infer_channel_dimension_format(images[0]) + if input_data_format is None: + input_data_format = infer_channel_dimension_format(images[0]) - if input_channel_dimension == ChannelDimension.FIRST: + if input_data_format == ChannelDimension.FIRST: _, max_height, max_width = max_across_indices([img.shape for img in images]) - elif input_channel_dimension == ChannelDimension.LAST: + elif input_data_format == ChannelDimension.LAST: max_height, max_width, _ = max_across_indices([img.shape for img in images]) else: - raise ValueError(f"Invalid channel dimension format: {input_channel_dimension}") + raise ValueError(f"Invalid channel dimension format: {input_data_format}") return (max_height, max_width) # Copied from transformers.models.vilt.image_processing_vilt.get_resize_output_image_size def get_resize_output_image_size( - input_image: np.ndarray, shorter: int = 800, longer: int = 1333, size_divisor: int = 32 + input_image: np.ndarray, + shorter: int = 800, + longer: int = 1333, + size_divisor: int = 32, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ) -> Tuple[int, int]: - input_height, input_width = get_image_size(input_image) + input_height, input_width = get_image_size(input_image, input_data_format) min_size, max_size = shorter, longer scale = min_size / min(input_height, input_width) @@ -122,14 +129,14 @@ class BridgeTowerImageProcessor(BaseImageProcessor): do_resize (`bool`, *optional*, defaults to `True`): Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the `do_resize` parameter in the `preprocess` method. - size (`Dict[str, int]` *optional*, defaults to `288`): + size (`Dict[str, int]` *optional*, defaults to `{'shortest_edge': 288}`): Resize the shorter side of the input to `size["shortest_edge"]`. The longer side will be limited to under `int((1333 / 800) * size["shortest_edge"])` while preserving the aspect ratio. Only has an effect if `do_resize` is set to `True`. Can be overridden by the `size` parameter in the `preprocess` method. - size_divisor (`int`, *optional*, defaults to `32`): + size_divisor (`int`, *optional*, defaults to 32): The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize` is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method. - resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`): Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be overridden by the `resample` parameter in the `preprocess` method. do_rescale (`bool`, *optional*, defaults to `True`): @@ -152,6 +159,9 @@ class BridgeTowerImageProcessor(BaseImageProcessor): do_center_crop (`bool`, *optional*, defaults to `True`): Whether to center crop the image. Can be overridden by the `do_center_crop` parameter in the `preprocess` method. + crop_size (`Dict[str, int]`, *optional*): + Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`. + Can be overridden by the `crop_size` parameter in the `preprocess` method. If unset defaults to `size`, do_pad (`bool`, *optional*, defaults to `True`): Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by the `do_pad` parameter in the `preprocess` method. @@ -162,7 +172,7 @@ class BridgeTowerImageProcessor(BaseImageProcessor): def __init__( self, do_resize: bool = True, - size: Dict[str, int] = 288, + size: Dict[str, int] = None, size_divisor: int = 32, resample: PILImageResampling = PILImageResampling.BICUBIC, do_rescale: bool = True, @@ -171,6 +181,7 @@ def __init__( image_mean: Optional[Union[float, List[float]]] = None, image_std: Optional[Union[float, List[float]]] = None, do_center_crop: bool = True, + crop_size: Dict[str, int] = None, do_pad: bool = True, **kwargs, ) -> None: @@ -192,6 +203,7 @@ def __init__( self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD self.do_pad = do_pad self.do_center_crop = do_center_crop + self.crop_size = crop_size # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.resize def resize( @@ -201,6 +213,7 @@ def resize( size_divisor: int = 32, resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ @@ -221,119 +234,107 @@ def resize( Resampling filter to use when resiizing the image. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`str` or `ChannelDimension`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred. """ size = get_size_dict(size, default_to_square=False) if "shortest_edge" not in size: raise ValueError(f"The `size` dictionary must contain the key `shortest_edge`. Got {size.keys()}") shorter = size["shortest_edge"] longer = int(1333 / 800 * shorter) - output_size = get_resize_output_image_size(image, shorter=shorter, longer=longer, size_divisor=size_divisor) - return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) - - # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.rescale - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) + output_size = get_resize_output_image_size( + image, shorter=shorter, longer=longer, size_divisor=size_divisor, input_data_format=input_data_format + ) + return resize( + image, + size=output_size, + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, + ) def center_crop( self, image: np.ndarray, size: Dict[str, int], data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ - Center crop an image to (size["height"], size["width"]). If the input size is smaller than `size` along any - edge, the image is padded with 0's and then center cropped. + Center crop an image to `(size["height"], size["width"])`. If the input size is smaller than `crop_size` along + any edge, the image is padded with 0's and then center cropped. Args: image (`np.ndarray`): Image to center crop. size (`Dict[str, int]`): - Size of the output image. + Size of the output image in the form `{"height": h, "width": w}`. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred from the input + image. """ output_size = size["shortest_edge"] - return center_crop(image, size=(output_size, output_size), data_format=data_format, **kwargs) - - # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.normalize - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - mean (`float` or `List[float]`): - Image mean. - std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + return center_crop( + image, + size=(output_size, output_size), + data_format=data_format, + input_data_format=input_data_format, + **kwargs, + ) + # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor._pad_image def _pad_image( self, image: np.ndarray, output_size: Tuple[int, int], constant_values: Union[float, Iterable[float]] = 0, data_format: Optional[ChannelDimension] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ) -> np.ndarray: """ Pad an image with zeros to the given size. """ - input_height, input_width = get_image_size(image) + input_height, input_width = get_image_size(image, channel_dim=input_data_format) output_height, output_width = output_size pad_bottom = output_height - input_height pad_right = output_width - input_width padding = ((0, pad_bottom), (0, pad_right)) padded_image = pad( - image, padding, mode=PaddingMode.CONSTANT, constant_values=constant_values, data_format=data_format + image, + padding, + mode=PaddingMode.CONSTANT, + constant_values=constant_values, + data_format=data_format, + input_data_format=input_data_format, ) return padded_image + # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.pad def pad( self, images: List[np.ndarray], + constant_values: Union[float, Iterable[float]] = 0, return_pixel_mask: bool = True, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, ) -> BatchFeature: """ - Pads a batch of images with zeros to the size of largest height and width in the batch and optionally returns - their corresponding pixel mask. + Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width + in the batch and optionally returns their corresponding pixel mask. Args: - images (`List[np.ndarray]`): - Batch of images to pad. - return_pixel_mask (`bool`, *optional*, defaults to `False`): - Whether to return the pixel mask. + image (`np.ndarray`): + Image to pad. + constant_values (`float` or `Iterable[float]`, *optional*): + The value to use for the padding if `mode` is `"constant"`. + return_pixel_mask (`bool`, *optional*, defaults to `True`): + Whether to return a pixel mask. return_tensors (`str` or `TensorType`, *optional*): The type of tensors to return. Can be one of: - Unset: Return a list of `np.ndarray`. @@ -343,54 +344,32 @@ def pad( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred. """ - pad_size = get_max_height_width(images) + pad_size = get_max_height_width(images, input_data_format=input_data_format) + padded_images = [ - self._pad_image(image=image, output_size=pad_size, data_format=data_format) for image in images + self._pad_image( + image, + pad_size, + constant_values=constant_values, + data_format=data_format, + input_data_format=input_data_format, + ) + for image in images ] data = {"pixel_values": padded_images} + if return_pixel_mask: - masks = [make_pixel_mask(image=image, output_size=pad_size) for image in images] + masks = [ + make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format) + for image in images + ] data["pixel_mask"] = masks return BatchFeature(data=data, tensor_type=return_tensors) - # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.pad_and_create_pixel_mask - def pad_and_create_pixel_mask( - self, - pixel_values_list: List[ImageInput], - return_tensors: Optional[Union[str, TensorType]] = None, - data_format: Optional[ChannelDimension] = None, - ) -> BatchFeature: - """ - Pads a batch of images with zeros to the size of largest height and width in the batch and returns their - corresponding pixel mask. - - Args: - images (`List[np.ndarray]`): - Batch of images to pad. - return_tensors (`str` or `TensorType`, *optional*): - The type of tensors to return. Can be one of: - - Unset: Return a list of `np.ndarray`. - - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. - - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. - - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. - - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - warnings.warn( - "This method is deprecated and will be removed in v4.26.0. Please use pad instead.", FutureWarning - ) - # pad expects a list of np.ndarray, but the previous feature extractors expected torch tensors - images = [to_numpy_array(image) for image in pixel_values_list] - return self.pad( - images=images, - return_pixel_mask=True, - return_tensors=return_tensors, - data_format=data_format, - ) - def preprocess( self, images: ImageInput, @@ -405,8 +384,10 @@ def preprocess( image_std: Optional[Union[float, List[float]]] = None, do_pad: Optional[bool] = None, do_center_crop: Optional[bool] = None, + crop_size: Dict[str, int] = None, return_tensors: Optional[Union[str, TensorType]] = None, data_format: ChannelDimension = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -414,7 +395,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -442,6 +424,9 @@ def preprocess( do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`): Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the image is padded with 0's and then center cropped. + crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`): + Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be + padded with zeros and then cropped return_tensors (`str` or `TensorType`, *optional*): The type of tensors to return. Can be one of: - Unset: Return a list of `np.ndarray`. @@ -451,8 +436,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize size_divisor = size_divisor if size_divisor is not None else self.size_divisor @@ -464,6 +456,11 @@ def preprocess( image_std = image_std if image_std is not None else self.image_std do_pad = do_pad if do_pad is not None else self.do_pad do_center_crop if do_center_crop is not None else self.do_center_crop + # For backwards compatibility. Initial version of this processor was cropping to the "size" argument, which + # it should default to if crop_size is undefined. + crop_size = ( + crop_size if crop_size is not None else (self.crop_size if self.crop_size is not None else self.size) + ) size = size if size is not None else self.size size = get_size_dict(size, default_to_square=False) @@ -476,37 +473,67 @@ def preprocess( "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "torch.Tensor, tf.Tensor or jax.ndarray." ) - - if do_resize and size is None or resample is None: - raise ValueError("Size and resample must be specified if do_resize is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") - + # Here, crop_size is used only if it is set, else size will be used. + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_pad=do_pad, + size_divisibility=size_divisor, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_resize=do_resize, + size=size, + resample=resample, + ) # All transformations expect numpy arrays. images = [to_numpy_array(image) for image in images] + if is_scaled_image(images[0]) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + if do_resize: images = [ - self.resize(image=image, size=size, size_divisor=size_divisor, resample=resample) for image in images + self.resize( + image=image, + size=size, + size_divisor=size_divisor, + resample=resample, + input_data_format=input_data_format, + ) + for image in images ] if do_center_crop: - images = [self.center_crop(image=image, size=size) for image in images] + images = [ + self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images + ] if do_rescale: - images = [self.rescale(image=image, scale=rescale_factor) for image in images] + images = [ + self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + for image in images + ] if do_normalize: - images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] + images = [ + self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + for image in images + ] - images = [to_channel_dimension_format(image, data_format) for image in images] + images = [ + to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images + ] if do_pad: - encoded_outputs = self.pad(images, return_pixel_mask=True, return_tensors=return_tensors) + encoded_outputs = self.pad( + images, return_pixel_mask=True, return_tensors=return_tensors, input_data_format=data_format + ) else: encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors) diff --git a/src/transformers/models/bridgetower/modeling_bridgetower.py b/src/transformers/models/bridgetower/modeling_bridgetower.py index aa25ad52d7edf8..f5822070db6a3d 100644 --- a/src/transformers/models/bridgetower/modeling_bridgetower.py +++ b/src/transformers/models/bridgetower/modeling_bridgetower.py @@ -22,6 +22,7 @@ import torch import torch.utils.checkpoint from torch import nn +from torch.nn import CrossEntropyLoss from ...activations import ACT2FN, QuickGELUActivation from ...modeling_outputs import ( @@ -32,26 +33,20 @@ SequenceClassifierOutput, ) from ...modeling_utils import PreTrainedModel, apply_chunking_to_forward -from ...pytorch_utils import find_pruneable_heads_and_indices, is_torch_greater_or_equal_than_1_10, prune_linear_layer +from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings from .configuration_bridgetower import BridgeTowerConfig, BridgeTowerTextConfig, BridgeTowerVisionConfig logger = logging.get_logger(__name__) -if not is_torch_greater_or_equal_than_1_10: - logger.warning( - f"You are using torch=={torch.__version__}, but torch>=1.10.0 is required to use " - "BridgeTowerModel. Please upgrade torch." - ) - _CONFIG_FOR_DOC = "BridgeTowerConfig" _CHECKPOINT_FOR_DOC = "BridgeTower/bridgetower-base" _TOKENIZER_FOR_DOC = "RobertaTokenizer" BRIDGETOWER_PRETRAINED_MODEL_ARCHIVE_LIST = [ "BridgeTower/bridgetower-base", - "BridgeTower/bridgetower-base-itm-mlm" + "BridgeTower/bridgetower-base-itm-mlm", # See all bridgetower models at https://huggingface.co/BridgeTower ] @@ -70,7 +65,7 @@ BRIDGETOWER_INPUTS_DOCSTRING = r""" Args: input_ids (`torch.LongTensor` of shape `({0})`): - Indices of input sequence tokens in the vocabulary. Indices can be obtained using [`BertTokenizer`]. See + Indices of input sequence tokens in the vocabulary. Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) @@ -142,9 +137,8 @@ class BridgeTowerModelOutput(ModelOutput): token), respectively, after further processing through layers used for auxiliary pretraining tasks. hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + - one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. - - Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of + the model at the output of each layer plus the optional initial embedding outputs. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length)`. @@ -160,6 +154,40 @@ class BridgeTowerModelOutput(ModelOutput): attentions: Optional[Tuple[torch.FloatTensor]] = None +@dataclass +class BridgeTowerContrastiveOutput(ModelOutput): + """ + Output type of ['BridgeTowerForContrastiveLearning'] + + Args: + loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`: + Image-text contrastive loss. + logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): + Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). + text_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): + The text embeddings obtained by applying the projection layer to the pooler_output. + image_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): + The image embeddings obtained by applying the projection layer to the pooler_output. + cross_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): + The text-image cross-modal embeddings obtained by applying the projection layer to the pooler_output. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of + the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + """ + + loss: Optional[torch.FloatTensor] = None + logits: torch.FloatTensor = None + text_embeds: Optional[Tuple[torch.FloatTensor]] = None + image_embeds: Optional[Tuple[torch.FloatTensor]] = None + cross_embeds: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + class BridgeTowerResidualAttention(nn.Module): def __init__(self, config): super().__init__() @@ -252,11 +280,12 @@ def __init__(self, config: BridgeTowerVisionConfig): self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) - self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + target_dtype = self.patch_embedding.weight.dtype + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1) @@ -759,6 +788,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -768,25 +804,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -850,7 +876,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -945,7 +973,8 @@ class BridgeTowerPreTrainedModel(PreTrainedModel): config_class = BridgeTowerConfig base_model_prefix = "bridgetower" supports_gradient_checkpointing = False - _no_split_modules = ["BridgeTowerSelfAttention"] + _no_split_modules = ["BridgeTowerSelfAttention", "BridgeTowerResidualAttention"] + _skip_keys_device_placement = "past_key_values" def _init_weights(self, module): if isinstance(module, BridgeTowerVisionModel): @@ -1007,8 +1036,6 @@ class BridgeTowerTextModel(BridgeTowerPreTrainedModel): config_class = BridgeTowerTextConfig - _keys_to_ignore_on_load_missing = [r"position_ids"] - def __init__(self, config, add_pooling_layer=True): super().__init__(config) self.config = config @@ -1086,6 +1113,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1228,6 +1256,12 @@ def __init__(self, config): self.post_init() + def get_input_embeddings(self): + return self.text_model.get_input_embeddings() + + def set_input_embeddings(self, value): + self.text_model.set_input_embeddings(value) + @add_start_docstrings_to_model_forward(BRIDGETOWER_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=BridgeTowerModelOutput, config_class=_CONFIG_FOR_DOC) def forward( @@ -1311,7 +1345,12 @@ def forward( if output_hidden_states: all_hidden_states_text += (text_embeds,) - image_embeds = self.vision_model.visual.forward_pre(pixel_values.type(self.vision_model.dtype)) + if image_embeds is None: + image_embeds = self.vision_model.visual.forward_pre(pixel_values.type(self.vision_model.dtype)) + else: + # Permute as BridgeTowerResidualAttention has batch_first=True + image_embeds = image_embeds.permute(1, 0, 2) + if output_hidden_states: all_hidden_states_image += (image_embeds,) @@ -1435,7 +1474,11 @@ def forward( all_hidden_states = (all_hidden_states_text, all_hidden_states_image, all_hidden_states_cross) if not return_dict: - return tuple(v for v in [text_features, image_features, cls_features] if v is not None) + return tuple( + v + for v in [text_features, image_features, cls_features, all_hidden_states, all_self_attentions] + if v is not None + ) return BridgeTowerModelOutput( text_features=text_features, @@ -1502,6 +1545,8 @@ def forward(self, x): BRIDGETOWER_START_DOCSTRING, ) class BridgeTowerForMaskedLM(BridgeTowerPreTrainedModel): + _tied_weights_keys = ["mlm_score.decoder.weight"] + def __init__(self, config): super().__init__(config) @@ -1535,8 +1580,10 @@ def forward( labels: Optional[torch.LongTensor] = None, ) -> Union[MaskedLMOutput, Tuple[torch.FloatTensor]]: r""" - labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): - Labels are currently not supported. + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ..., + config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the + loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` Returns: Examples: @@ -1580,11 +1627,19 @@ def forward( ) mlm_logits = self.mlm_score(outputs.text_features if return_dict else outputs[0]) + masked_lm_loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() # -100 index = padding token + + labels = labels.to(mlm_logits.device) + masked_lm_loss = loss_fct(mlm_logits.view(-1, self.config.text_config.vocab_size), labels.view(-1)) if not return_dict: - return tuple(mlm_logits) + output = tuple(mlm_logits) + return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output return MaskedLMOutput( + loss=masked_lm_loss, logits=mlm_logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, @@ -1627,8 +1682,9 @@ def forward( labels: Optional[torch.LongTensor] = None, ) -> Union[SequenceClassifierOutput, Tuple[torch.FloatTensor]]: r""" - labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): - Labels are currently not supported. + labels (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*): + Labels for computing the image-text matching loss. 0 means the pairs don't match and 1 means they match. + The pairs with 0 will be skipped for calculation. Returns: Examples: @@ -1673,12 +1729,173 @@ def forward( logits = self.itm_score(pooler_output) + itm_loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() + + labels = labels.to(logits.device) + itm_loss = loss_fct(logits, labels) + if not return_dict: - return tuple(logits) + output = tuple(logits) + return ((itm_loss,) + output) if itm_loss is not None else output return SequenceClassifierOutput( - loss=None, + loss=itm_loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) + + +class BridgeTowerContrastiveHead(nn.Module): + def __init__(self, hidden_size, embed_size): + super().__init__() + self.fc = nn.Linear(hidden_size, embed_size) + + def forward(self, x): + x = self.fc(x) + return x + + +@add_start_docstrings( + """ + BridgeTower Model with a image-text contrastive head on top computing image-text contrastive loss. + """, + BRIDGETOWER_START_DOCSTRING, +) +class BridgeTowerForContrastiveLearning(BridgeTowerPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.bridgetower = BridgeTowerModel(config) + + self.itc_text_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size) + self.itc_image_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size) + self.itc_cross_modal_head = BridgeTowerContrastiveHead(config.hidden_size * 2, config.contrastive_hidden_size) + + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(BRIDGETOWER_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=BridgeTowerContrastiveOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + pixel_values: Optional[torch.FloatTensor] = None, + pixel_mask: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + image_embeds: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = True, + return_dict: Optional[bool] = None, + return_loss: Optional[bool] = None, + ) -> Union[BridgeTowerContrastiveOutput, Tuple[torch.FloatTensor]]: + r""" + return_loss (`bool`, *optional*): + Whether or not to return the contrastive loss. + Returns: + + Examples: + + ```python + >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning + >>> import requests + >>> from PIL import Image + >>> import torch + + >>> image_urls = [ + ... "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg", + ... "http://images.cocodataset.org/val2017/000000039769.jpg", + ... ] + >>> texts = ["two dogs in a car", "two cats sleeping on a couch"] + >>> images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls] + + >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") + >>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc") + + >>> inputs = processor(images, texts, padding=True, return_tensors="pt") + >>> loss = model(**inputs, return_loss=True).loss + + >>> inputs = processor(images, texts[::-1], padding=True, return_tensors="pt") + >>> loss_swapped = model(**inputs, return_loss=True).loss + + >>> print("Loss", round(loss.item(), 4)) + Loss 0.0019 + + >>> print("Loss with swapped images", round(loss_swapped.item(), 4)) + Loss with swapped images 2.126 + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.bridgetower( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + pixel_values=pixel_values, + pixel_mask=pixel_mask, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + image_embeds=image_embeds, + output_attentions=output_attentions, + output_hidden_states=True, + return_dict=return_dict, + ) + + pooler_output = outputs.pooler_output if return_dict else outputs[2] + hidden_states_txt, hidden_states_img, hidden_states_cross_modal = ( + outputs.hidden_states if return_dict else outputs[3] + ) + + text_embeds = hidden_states_txt[-1] + image_embeds = hidden_states_img[-1] + + image_embeds_with_ln = self.bridgetower.vision_model.visual.forward_post(image_embeds) + image_token_type_embeddings = self.bridgetower.token_type_embeddings( + torch.full((1,), 1, dtype=torch.long, device=self.bridgetower.token_type_embeddings.weight.device) + ).expand_as(image_embeds_with_ln) + + image_embeds = self.bridgetower.cross_modal_image_transform(image_embeds_with_ln) + image_token_type_embeddings + + # normalized features + text_embeds = nn.functional.normalize(self.itc_text_head(text_embeds[:, 0, :]), dim=-1, p=2) + image_embeds = nn.functional.normalize(self.itc_image_head(image_embeds[:, 0, :]), dim=-1, p=2).to( + device=text_embeds.device + ) + cross_embeds = nn.functional.normalize(self.itc_cross_modal_head(pooler_output), dim=-1, p=2).to( + device=text_embeds.device + ) + + logits = torch.stack([text_embeds, image_embeds, cross_embeds], dim=-2) + + logit_scale = self.logit_scale.exp().to(device=text_embeds.device) + logits_text_to_image = torch.matmul(text_embeds, image_embeds.t()) * logit_scale + logits_text_to_cross = torch.matmul(text_embeds, cross_embeds.t()) * logit_scale + logits_image_to_cross = torch.matmul(image_embeds, cross_embeds.t()) * logit_scale + + itc_loss = None + + if return_loss: + labels = torch.arange(len(logits), device=logits.device) + text_to_image_loss = nn.functional.cross_entropy(logits_text_to_image, labels) + text_to_cross_loss = nn.functional.cross_entropy(logits_text_to_cross, labels) + image_to_cross_loss = nn.functional.cross_entropy(logits_image_to_cross, labels) + itc_loss = (text_to_image_loss + text_to_cross_loss + image_to_cross_loss) / 3.0 + + if not return_dict: + output = (logits, text_embeds, image_embeds, cross_embeds) + outputs[3:] + return ((itc_loss,) + output) if itc_loss is not None else output + + return BridgeTowerContrastiveOutput( + loss=itc_loss, + logits=logits, + text_embeds=text_embeds, + image_embeds=image_embeds, + cross_embeds=cross_embeds, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) diff --git a/src/transformers/models/bridgetower/processing_bridgetower.py b/src/transformers/models/bridgetower/processing_bridgetower.py index c268d7c26f43d9..7718c3bf833fec 100644 --- a/src/transformers/models/bridgetower/processing_bridgetower.py +++ b/src/transformers/models/bridgetower/processing_bridgetower.py @@ -38,6 +38,7 @@ class BridgeTowerProcessor(ProcessorMixin): tokenizer (`RobertaTokenizerFast`): An instance of ['RobertaTokenizerFast`]. The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "BridgeTowerImageProcessor" tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast") diff --git a/src/transformers/models/bros/__init__.py b/src/transformers/models/bros/__init__.py new file mode 100644 index 00000000000000..b08d55836488a0 --- /dev/null +++ b/src/transformers/models/bros/__init__.py @@ -0,0 +1,77 @@ +# Copyright 2023-present NAVER Corp, The Microsoft Research Asia LayoutLM Team Authors and the HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available + + +_import_structure = { + "configuration_bros": ["BROS_PRETRAINED_CONFIG_ARCHIVE_MAP", "BrosConfig"], +} + +try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["processing_bros"] = ["BrosProcessor"] + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_bros"] = [ + "BROS_PRETRAINED_MODEL_ARCHIVE_LIST", + "BrosPreTrainedModel", + "BrosModel", + "BrosForTokenClassification", + "BrosSpadeEEForTokenClassification", + "BrosSpadeELForTokenClassification", + ] + + +if TYPE_CHECKING: + from .configuration_bros import BROS_PRETRAINED_CONFIG_ARCHIVE_MAP, BrosConfig + + try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .processing_bros import BrosProcessor + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_bros import ( + BROS_PRETRAINED_MODEL_ARCHIVE_LIST, + BrosForTokenClassification, + BrosModel, + BrosPreTrainedModel, + BrosSpadeEEForTokenClassification, + BrosSpadeELForTokenClassification, + ) + + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/bros/configuration_bros.py b/src/transformers/models/bros/configuration_bros.py new file mode 100644 index 00000000000000..4384810a55a013 --- /dev/null +++ b/src/transformers/models/bros/configuration_bros.py @@ -0,0 +1,140 @@ +# coding=utf-8 +# Copyright 2023-present NAVER Corp, The Microsoft Research Asia LayoutLM Team Authors and the HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Bros model configuration""" + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +BROS_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "jinho8345/bros-base-uncased": "https://huggingface.co/jinho8345/bros-base-uncased/blob/main/config.json", + "jinho8345/bros-large-uncased": "https://huggingface.co/jinho8345/bros-large-uncased/blob/main/config.json", +} + + +class BrosConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`BrosModel`] or a [`TFBrosModel`]. It is used to + instantiate a Bros model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the Bros + [jinho8345/bros-base-uncased](https://huggingface.co/jinho8345/bros-base-uncased) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 30522): + Vocabulary size of the Bros model. Defines the number of different tokens that can be represented by the + `inputs_ids` passed when calling [`BrosModel`] or [`TFBrosModel`]. + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + num_hidden_layers (`int`, *optional*, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + intermediate_size (`int`, *optional*, defaults to 3072): + Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. + hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"silu"` and `"gelu_new"` are supported. + hidden_dropout_prob (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): + The dropout ratio for the attention probabilities. + max_position_embeddings (`int`, *optional*, defaults to 512): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + type_vocab_size (`int`, *optional*, defaults to 2): + The vocabulary size of the `token_type_ids` passed when calling [`BrosModel`] or [`TFBrosModel`]. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-12): + The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 0): + The index of the padding token in the token vocabulary. + dim_bbox (`int`, *optional*, defaults to 8): + The dimension of the bounding box coordinates. (x0, y1, x1, y0, x1, y1, x0, y1) + bbox_scale (`float`, *optional*, defaults to 100.0): + The scale factor of the bounding box coordinates. + n_relations (`int`, *optional*, defaults to 1): + The number of relations for SpadeEE(entity extraction), SpadeEL(entity linking) head. + classifier_dropout_prob (`float`, *optional*, defaults to 0.1): + The dropout ratio for the classifier head. + + + Examples: + + ```python + >>> from transformers import BrosConfig, BrosModel + + >>> # Initializing a BROS jinho8345/bros-base-uncased style configuration + >>> configuration = BrosConfig() + + >>> # Initializing a model from the jinho8345/bros-base-uncased style configuration + >>> model = BrosModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "bros" + + def __init__( + self, + vocab_size=30522, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=512, + type_vocab_size=2, + initializer_range=0.02, + layer_norm_eps=1e-12, + pad_token_id=0, + dim_bbox=8, + bbox_scale=100.0, + n_relations=1, + classifier_dropout_prob=0.1, + **kwargs, + ): + super().__init__( + vocab_size=vocab_size, + hidden_size=hidden_size, + num_hidden_layers=num_hidden_layers, + num_attention_heads=num_attention_heads, + intermediate_size=intermediate_size, + hidden_act=hidden_act, + hidden_dropout_prob=hidden_dropout_prob, + attention_probs_dropout_prob=attention_probs_dropout_prob, + max_position_embeddings=max_position_embeddings, + type_vocab_size=type_vocab_size, + initializer_range=initializer_range, + layer_norm_eps=layer_norm_eps, + pad_token_id=pad_token_id, + **kwargs, + ) + + self.dim_bbox = dim_bbox + self.bbox_scale = bbox_scale + self.n_relations = n_relations + self.dim_bbox_sinusoid_emb_2d = self.hidden_size // 4 + self.dim_bbox_sinusoid_emb_1d = self.dim_bbox_sinusoid_emb_2d // self.dim_bbox + self.dim_bbox_projection = self.hidden_size // self.num_attention_heads + self.classifier_dropout_prob = classifier_dropout_prob diff --git a/src/transformers/models/bros/convert_bros_to_pytorch.py b/src/transformers/models/bros/convert_bros_to_pytorch.py new file mode 100644 index 00000000000000..c0984f2c74b20c --- /dev/null +++ b/src/transformers/models/bros/convert_bros_to_pytorch.py @@ -0,0 +1,145 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert Bros checkpoints.""" + +import argparse + +import bros # original repo +import torch + +from transformers import BrosConfig, BrosModel, BrosProcessor +from transformers.utils import logging + + +logging.set_verbosity_info() +logger = logging.get_logger(__name__) + + +def get_configs(model_name): + bros_config = BrosConfig.from_pretrained(model_name) + return bros_config + + +def remove_ignore_keys_(state_dict): + ignore_keys = [ + "embeddings.bbox_sinusoid_emb.inv_freq", + ] + for k in ignore_keys: + state_dict.pop(k, None) + + +def rename_key(name): + if name == "embeddings.bbox_projection.weight": + name = "bbox_embeddings.bbox_projection.weight" + + if name == "embeddings.bbox_sinusoid_emb.x_pos_emb.inv_freq": + name = "bbox_embeddings.bbox_sinusoid_emb.x_pos_emb.inv_freq" + + if name == "embeddings.bbox_sinusoid_emb.y_pos_emb.inv_freq": + name = "bbox_embeddings.bbox_sinusoid_emb.y_pos_emb.inv_freq" + + return name + + +def convert_state_dict(orig_state_dict, model): + # rename keys + for key in orig_state_dict.copy().keys(): + val = orig_state_dict.pop(key) + orig_state_dict[rename_key(key)] = val + + # remove ignore keys + remove_ignore_keys_(orig_state_dict) + + return orig_state_dict + + +def convert_bros_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_hub=False): + # load original model + original_model = bros.BrosModel.from_pretrained(model_name).eval() + + # load HuggingFace Model + bros_config = get_configs(model_name) + model = BrosModel.from_pretrained(model_name, config=bros_config) + model.eval() + + state_dict = original_model.state_dict() + new_state_dict = convert_state_dict(state_dict, model) + model.load_state_dict(new_state_dict) + + # verify results + + # original BROS model require 4 points (8 float values) for each bbox, prepare bbox with [batch_size, seq_len, 8] shape + bbox = torch.tensor( + [ + [ + [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000], + [0.4396, 0.6720, 0.4659, 0.6720, 0.4659, 0.6850, 0.4396, 0.6850], + [0.4698, 0.6720, 0.4843, 0.6720, 0.4843, 0.6850, 0.4698, 0.6850], + [0.4698, 0.6720, 0.4843, 0.6720, 0.4843, 0.6850, 0.4698, 0.6850], + [0.2047, 0.6870, 0.2730, 0.6870, 0.2730, 0.7000, 0.2047, 0.7000], + [0.2047, 0.6870, 0.2730, 0.6870, 0.2730, 0.7000, 0.2047, 0.7000], + [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000], + ] + ] + ) + + processor = BrosProcessor.from_pretrained(model_name) + + encoding = processor("His name is Rocco.", return_tensors="pt") + encoding["bbox"] = bbox + + original_hidden_states = original_model(**encoding).last_hidden_state + # pixel_values = processor(image, return_tensors="pt").pixel_values + + last_hidden_states = model(**encoding).last_hidden_state + + assert torch.allclose(original_hidden_states, last_hidden_states, atol=1e-4) + + if pytorch_dump_folder_path is not None: + print(f"Saving model and processor to {pytorch_dump_folder_path}") + model.save_pretrained(pytorch_dump_folder_path) + processor.save_pretrained(pytorch_dump_folder_path) + + if push_to_hub: + model.push_to_hub("jinho8345/" + model_name.split("/")[-1], commit_message="Update model") + processor.push_to_hub("jinho8345/" + model_name.split("/")[-1], commit_message="Update model") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--model_name", + default="jinho8345/bros-base-uncased", + required=False, + type=str, + help="Name of the original model you'd like to convert.", + ) + parser.add_argument( + "--pytorch_dump_folder_path", + default=None, + required=False, + type=str, + help="Path to the output PyTorch model directory.", + ) + parser.add_argument( + "--push_to_hub", + action="store_true", + help="Whether or not to push the converted model and processor to the 🤗 hub.", + ) + + args = parser.parse_args() + convert_bros_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub) diff --git a/src/transformers/models/bros/modeling_bros.py b/src/transformers/models/bros/modeling_bros.py new file mode 100755 index 00000000000000..d3a17b23c94d48 --- /dev/null +++ b/src/transformers/models/bros/modeling_bros.py @@ -0,0 +1,1320 @@ +# coding=utf-8 +# Copyright 2023-present NAVER Corp, The Microsoft Research Asia LayoutLM Team Authors and the HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" PyTorch Bros model.""" + + +import math +from dataclasses import dataclass +from typing import List, Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss + +from ...activations import ACT2FN +from ...modeling_outputs import ( + BaseModelOutputWithPastAndCrossAttentions, + BaseModelOutputWithPoolingAndCrossAttentions, + TokenClassifierOutput, +) +from ...modeling_utils import PreTrainedModel +from ...pytorch_utils import apply_chunking_to_forward, find_pruneable_heads_and_indices, prune_linear_layer +from ...utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_bros import BrosConfig + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "jinho8345/bros-base-uncased" +_CONFIG_FOR_DOC = "BrosConfig" + +BROS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "jinho8345/bros-base-uncased", + "jinho8345/bros-large-uncased", + # See all Bros models at https://huggingface.co/models?filter=bros +] + +BROS_START_DOCSTRING = r""" + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`BrosConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + +BROS_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `({0})`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`BrosProcessor`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + + bbox ('torch.FloatTensor' of shape '(batch_size, num_boxes, 4)'): + Bounding box coordinates for each token in the input sequence. Each bounding box is a list of four values + (x1, y1, x2, y2), where (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner of the + bounding box. + + attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + + bbox_first_token_mask (`torch.FloatTensor` of shape `({0})`, *optional*): + Mask to indicate the first token of each bounding box. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + + position_ids (`torch.LongTensor` of shape `({0})`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + + return_dict (`bool`, *optional*): + Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple. +""" + + +@dataclass +class BrosSpadeOutput(ModelOutput): + """ + Base class for outputs of token classification models. + + Args: + loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) : + Classification loss. + initial_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`): + Classification scores for entity initial tokens (before SoftMax). + subsequent_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, sequence_length+1)`): + Classification scores for entity sequence tokens (before SoftMax). + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + loss: Optional[torch.FloatTensor] = None + initial_token_logits: torch.FloatTensor = None + subsequent_token_logits: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +class BrosPositionalEmbedding1D(nn.Module): + # Reference: https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/mem_transformer.py#L15 + + def __init__(self, config): + super(BrosPositionalEmbedding1D, self).__init__() + + self.dim_bbox_sinusoid_emb_1d = config.dim_bbox_sinusoid_emb_1d + + inv_freq = 1 / ( + 10000 ** (torch.arange(0.0, self.dim_bbox_sinusoid_emb_1d, 2.0) / self.dim_bbox_sinusoid_emb_1d) + ) + self.register_buffer("inv_freq", inv_freq) + + def forward(self, pos_seq: torch.Tensor) -> torch.Tensor: + seq_size = pos_seq.size() + b1, b2, b3 = seq_size + sinusoid_inp = pos_seq.view(b1, b2, b3, 1) * self.inv_freq.view(1, 1, 1, self.dim_bbox_sinusoid_emb_1d // 2) + pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1) + return pos_emb + + +class BrosPositionalEmbedding2D(nn.Module): + def __init__(self, config): + super(BrosPositionalEmbedding2D, self).__init__() + + self.dim_bbox = config.dim_bbox + self.x_pos_emb = BrosPositionalEmbedding1D(config) + self.y_pos_emb = BrosPositionalEmbedding1D(config) + + def forward(self, bbox: torch.Tensor) -> torch.Tensor: + stack = [] + for i in range(self.dim_bbox): + if i % 2 == 0: + stack.append(self.x_pos_emb(bbox[..., i])) + else: + stack.append(self.y_pos_emb(bbox[..., i])) + bbox_pos_emb = torch.cat(stack, dim=-1) + return bbox_pos_emb + + +class BrosBboxEmbeddings(nn.Module): + def __init__(self, config): + super(BrosBboxEmbeddings, self).__init__() + self.bbox_sinusoid_emb = BrosPositionalEmbedding2D(config) + self.bbox_projection = nn.Linear(config.dim_bbox_sinusoid_emb_2d, config.dim_bbox_projection, bias=False) + + def forward(self, bbox: torch.Tensor): + bbox_t = bbox.transpose(0, 1) + bbox_pos = bbox_t[None, :, :, :] - bbox_t[:, None, :, :] + bbox_pos_emb = self.bbox_sinusoid_emb(bbox_pos) + bbox_pos_emb = self.bbox_projection(bbox_pos_emb) + + return bbox_pos_emb + + +class BrosTextEmbeddings(nn.Module): + """Construct the embeddings from word, position and token_type embeddings.""" + + def __init__(self, config): + super().__init__() + + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) + + # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load + # any TensorFlow checkpoint file + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + # position_ids (1, len position emb) is contiguous in memory and exported when serialized + self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") + self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "token_type_ids", + torch.zeros( + self.position_ids.size(), + dtype=torch.long, + device=self.position_ids.device, + ), + persistent=False, + ) + + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + past_key_values_length: int = 0, + ) -> torch.Tensor: + if input_ids is not None: + input_shape = input_ids.size() + else: + input_shape = inputs_embeds.size()[:-1] + + seq_length = input_shape[1] + + if position_ids is None: + position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length] + + if token_type_ids is None: + if hasattr(self, "token_type_ids"): + buffered_token_type_ids = self.token_type_ids[:, :seq_length] + buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length) + token_type_ids = buffered_token_type_ids_expanded + else: + token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device) + + if inputs_embeds is None: + inputs_embeds = self.word_embeddings(input_ids) + token_type_embeddings = self.token_type_embeddings(token_type_ids) + + embeddings = inputs_embeds + token_type_embeddings + if self.position_embedding_type == "absolute": + position_embeddings = self.position_embeddings(position_ids) + embeddings += position_embeddings + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +class BrosSelfAttention(nn.Module): + def __init__(self, config): + super().__init__() + if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"): + raise ValueError( + f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention " + f"heads ({config.num_attention_heads})" + ) + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = int(config.hidden_size / config.num_attention_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.hidden_size, self.all_head_size) + self.value = nn.Linear(config.hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + self.max_position_embeddings = config.max_position_embeddings + self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size) + + self.is_decoder = config.is_decoder + + def transpose_for_scores(self, x: torch.Tensor): + new_x_shape = x.size()[:-1] + ( + self.num_attention_heads, + self.attention_head_size, + ) + x = x.view(*new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states: torch.Tensor, + bbox_pos_emb: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[torch.Tensor] = False, + ) -> Tuple[torch.Tensor]: + mixed_query_layer = self.query(hidden_states) + + # If this is instantiated as a cross-attention module, the keys + # and values come from an encoder; the attention mask needs to be + # such that the encoder's padding tokens are not attended to. + is_cross_attention = encoder_hidden_states is not None + + if is_cross_attention and past_key_value is not None: + # reuse k,v, cross_attentions + key_layer = past_key_value[0] + value_layer = past_key_value[1] + attention_mask = encoder_attention_mask + elif is_cross_attention: + key_layer = self.transpose_for_scores(self.key(encoder_hidden_states)) + value_layer = self.transpose_for_scores(self.value(encoder_hidden_states)) + attention_mask = encoder_attention_mask + elif past_key_value is not None: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + key_layer = torch.cat([past_key_value[0], key_layer], dim=2) + value_layer = torch.cat([past_key_value[1], value_layer], dim=2) + else: + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + + query_layer = self.transpose_for_scores(mixed_query_layer) + + if self.is_decoder: + # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states. + # Further calls to cross_attention layer can then reuse all cross-attention + # key/value_states (first "if" case) + # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of + # all previous decoder key/value_states. Further calls to uni-directional self-attention + # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case) + # if encoder bi-directional self-attention `past_key_value` is always `None` + past_key_value = (key_layer, value_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + + if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query": + seq_length = hidden_states.size()[1] + position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1) + position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1) + distance = position_ids_l - position_ids_r + positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1) + positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility + + if self.position_embedding_type == "relative_key": + relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + attention_scores = attention_scores + relative_position_scores + elif self.position_embedding_type == "relative_key_query": + relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding) + relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding) + + attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key + + # bbox positional encoding + batch_size, n_head, seq_length, d_head = query_layer.shape + bbox_pos_emb = bbox_pos_emb.view(seq_length, seq_length, batch_size, d_head) + bbox_pos_emb = bbox_pos_emb.permute([2, 0, 1, 3]) + bbox_pos_scores = torch.einsum("bnid,bijd->bnij", (query_layer, bbox_pos_emb)) + + attention_scores = attention_scores + bbox_pos_scores + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in BrosModel forward() function) + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = nn.Softmax(dim=-1)(attention_scores) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = torch.matmul(attention_probs, value_layer) + + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(*new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + if self.is_decoder: + outputs = outputs + (past_key_value,) + return outputs + + +# Copied from transformers.models.bert.modeling_bert.BertSelfOutput with Bert->Bros +class BrosSelfOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BrosAttention(nn.Module): + def __init__(self, config): + super().__init__() + self.self = BrosSelfAttention(config) + self.output = BrosSelfOutput(config) + self.pruned_heads = set() + + def prune_heads(self, heads): + if len(heads) == 0: + return + heads, index = find_pruneable_heads_and_indices( + heads, + self.self.num_attention_heads, + self.self.attention_head_size, + self.pruned_heads, + ) + + # Prune linear layers + self.self.query = prune_linear_layer(self.self.query, index) + self.self.key = prune_linear_layer(self.self.key, index) + self.self.value = prune_linear_layer(self.self.value, index) + self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) + + # Update hyper params and store pruned heads + self.self.num_attention_heads = self.self.num_attention_heads - len(heads) + self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads + self.pruned_heads = self.pruned_heads.union(heads) + + def forward( + self, + hidden_states: torch.Tensor, + bbox_pos_emb: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + self_outputs = self.self( + hidden_states=hidden_states, + bbox_pos_emb=bbox_pos_emb, + attention_mask=attention_mask, + head_mask=head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + past_key_value=past_key_value, + output_attentions=output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->Bros +class BrosIntermediate(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.intermediate_size) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +class BrosOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.intermediate_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +class BrosLayer(nn.Module): + def __init__(self, config): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.seq_len_dim = 1 + self.attention = BrosAttention(config) + self.is_decoder = config.is_decoder + self.add_cross_attention = config.add_cross_attention + if self.add_cross_attention: + if not self.is_decoder: + raise Exception(f"{self} should be used as a decoder model if cross attention is added") + self.crossattention = BrosAttention(config) + self.intermediate = BrosIntermediate(config) + self.output = BrosOutput(config) + + def forward( + self, + hidden_states: torch.Tensor, + bbox_pos_emb: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + # decoder uni-directional self-attention cached key/values tuple is at positions 1,2 + self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None + self_attention_outputs = self.attention( + hidden_states, + bbox_pos_emb=bbox_pos_emb, + attention_mask=attention_mask, + head_mask=head_mask, + output_attentions=output_attentions, + past_key_value=self_attn_past_key_value, + ) + attention_output = self_attention_outputs[0] + + # if decoder, the last output is tuple of self-attn cache + if self.is_decoder: + outputs = self_attention_outputs[1:-1] + present_key_value = self_attention_outputs[-1] + else: + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + cross_attn_present_key_value = None + if self.is_decoder and encoder_hidden_states is not None: + if hasattr(self, "crossattention"): + raise Exception( + f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`" + ) + + # cross_attn cached key/values tuple is at positions 3,4 of past_key_value tuple + cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None + cross_attention_outputs = self.crossattention( + attention_output, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + cross_attn_past_key_value, + output_attentions, + ) + attention_output = cross_attention_outputs[0] + outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights + + # add cross-attn cache to positions 3,4 of present_key_value tuple + cross_attn_present_key_value = cross_attention_outputs[-1] + present_key_value = present_key_value + cross_attn_present_key_value + + layer_output = apply_chunking_to_forward( + self.feed_forward_chunk, + self.chunk_size_feed_forward, + self.seq_len_dim, + attention_output, + ) + outputs = (layer_output,) + outputs + + # if decoder, return the attn key/values as the last output + if self.is_decoder: + outputs = outputs + (present_key_value,) + + return outputs + + def feed_forward_chunk(self, attention_output): + intermediate_output = self.intermediate(attention_output) + layer_output = self.output(intermediate_output, attention_output) + return layer_output + + +class BrosEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.layer = nn.ModuleList([BrosLayer(config) for _ in range(config.num_hidden_layers)]) + + def forward( + self, + hidden_states: torch.Tensor, + bbox_pos_emb: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = False, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = True, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPastAndCrossAttentions]: + all_hidden_states = () if output_hidden_states else None + all_self_attentions = () if output_attentions else None + all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + + next_decoder_cache = () if use_cache else None + for i, layer_module in enumerate(self.layer): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + past_key_value = past_key_values[i] if past_key_values is not None else None + + if getattr(self.config, "gradient_checkpointing", False) and self.training: + if use_cache: + logger.warning( + "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting " + "`use_cache=False`..." + ) + use_cache = False + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, + hidden_states, + bbox_pos_emb, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions, + ) + else: + layer_outputs = layer_module( + hidden_states=hidden_states, + bbox_pos_emb=bbox_pos_emb, + attention_mask=attention_mask, + head_mask=layer_head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + past_key_value=past_key_value, + output_attentions=output_attentions, + ) + + hidden_states = layer_outputs[0] + if use_cache: + next_decoder_cache += (layer_outputs[-1],) + if output_attentions: + all_self_attentions = all_self_attentions + (layer_outputs[1],) + if self.config.add_cross_attention: + all_cross_attentions = all_cross_attentions + (layer_outputs[2],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple( + v + for v in [ + hidden_states, + next_decoder_cache, + all_hidden_states, + all_self_attentions, + all_cross_attentions, + ] + if v is not None + ) + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=next_decoder_cache, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + cross_attentions=all_cross_attentions, + ) + + +# Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->Bros +class BrosPooler(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class BrosRelationExtractor(nn.Module): + def __init__(self, config): + super().__init__() + self.n_relations = config.n_relations + self.backbone_hidden_size = config.hidden_size + self.head_hidden_size = config.hidden_size + self.classifier_dropout_prob = config.classifier_dropout_prob + + self.drop = nn.Dropout(self.classifier_dropout_prob) + self.query = nn.Linear(self.backbone_hidden_size, self.n_relations * self.head_hidden_size) + + self.key = nn.Linear(self.backbone_hidden_size, self.n_relations * self.head_hidden_size) + + self.dummy_node = nn.Parameter(torch.zeros(1, self.backbone_hidden_size)) + + def forward(self, query_layer: torch.Tensor, key_layer: torch.Tensor): + query_layer = self.query(self.drop(query_layer)) + + dummy_vec = self.dummy_node.unsqueeze(0).repeat(1, key_layer.size(1), 1) + key_layer = torch.cat([key_layer, dummy_vec], axis=0) + key_layer = self.key(self.drop(key_layer)) + + query_layer = query_layer.view( + query_layer.size(0), query_layer.size(1), self.n_relations, self.head_hidden_size + ) + key_layer = key_layer.view(key_layer.size(0), key_layer.size(1), self.n_relations, self.head_hidden_size) + + relation_score = torch.matmul( + query_layer.permute(2, 1, 0, 3), key_layer.permute(2, 1, 3, 0) + ) # equivalent to torch.einsum("ibnd,jbnd->nbij", (query_layer, key_layer)) + + return relation_score + + +class BrosPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = BrosConfig + base_model_prefix = "bros" + + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, nn.Linear): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.padding_idx is not None: + module.weight.data[module.padding_idx].zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + +@add_start_docstrings( + "The bare Bros Model transformer outputting raw hidden-states without any specific head on top.", + BROS_START_DOCSTRING, +) +class BrosModel(BrosPreTrainedModel): + def __init__(self, config, add_pooling_layer=True): + super().__init__(config) + self.config = config + + self.embeddings = BrosTextEmbeddings(config) + self.bbox_embeddings = BrosBboxEmbeddings(config) + self.encoder = BrosEncoder(config) + + self.pooler = BrosPooler(config) if add_pooling_layer else None + + self.init_weights() + + def get_input_embeddings(self): + return self.embeddings.word_embeddings + + def set_input_embeddings(self, value): + self.embeddings.word_embeddings = value + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + @add_start_docstrings_to_model_forward(BROS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=BaseModelOutputWithPoolingAndCrossAttentions, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + bbox: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]: + r""" + Returns: + + Examples: + + ```python + >>> import torch + >>> from transformers import BrosProcessor, BrosModel + + >>> processor = BrosProcessor.from_pretrained("jinho8345/bros-base-uncased") + + >>> model = BrosModel.from_pretrained("jinho8345/bros-base-uncased") + + >>> encoding = processor("Hello, my dog is cute", add_special_tokens=False, return_tensors="pt") + >>> bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1) + >>> encoding["bbox"] = bbox + + >>> outputs = model(**encoding) + >>> last_hidden_states = outputs.last_hidden_state + ```""" + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if self.config.is_decoder: + use_cache = use_cache if use_cache is not None else self.config.use_cache + else: + use_cache = False + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + if bbox is None: + raise ValueError("You have to specify bbox") + + batch_size, seq_length = input_shape + device = input_ids.device if input_ids is not None else inputs_embeds.device + + # past_key_values_length + past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0 + + if attention_mask is None: + attention_mask = torch.ones(input_shape, device=device) + + if token_type_ids is None: + if hasattr(self.embeddings, "token_type_ids"): + buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length] + buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) + token_type_ids = buffered_token_type_ids_expanded + else: + token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device) + + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device) + + # If a 2D or 3D attention mask is provided for the cross-attention + # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length] + if self.config.is_decoder and encoder_hidden_states is not None: + encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size() + encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) + if encoder_attention_mask is None: + encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) + encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = None + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + embedding_output = self.embeddings( + input_ids=input_ids, + position_ids=position_ids, + token_type_ids=token_type_ids, + inputs_embeds=inputs_embeds, + past_key_values_length=past_key_values_length, + ) + + # if bbox has 2 points (4 float tensors) per token, convert it to 4 points (8 float tensors) per token + if bbox.shape[-1] == 4: + bbox = bbox[:, :, [0, 1, 2, 1, 2, 3, 0, 3]] + scaled_bbox = bbox * self.config.bbox_scale + bbox_position_embeddings = self.bbox_embeddings(scaled_bbox) + + encoder_outputs = self.encoder( + embedding_output, + bbox_pos_emb=bbox_position_embeddings, + attention_mask=extended_attention_mask, + head_mask=head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_extended_attention_mask, + past_key_values=past_key_values, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) if self.pooler is not None else None + + if not return_dict: + return (sequence_output, pooled_output) + encoder_outputs[1:] + + return BaseModelOutputWithPoolingAndCrossAttentions( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + past_key_values=encoder_outputs.past_key_values, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + cross_attentions=encoder_outputs.cross_attentions, + ) + + +@add_start_docstrings( + """ + Bros Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for + Named-Entity-Recognition (NER) tasks. + """, + BROS_START_DOCSTRING, +) +class BrosForTokenClassification(BrosPreTrainedModel): + _keys_to_ignore_on_load_unexpected = [r"pooler"] + + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_labels + + self.bros = BrosModel(config) + classifier_dropout = ( + config.classifier_dropout if hasattr(config, "classifier_dropout") else config.hidden_dropout_prob + ) + self.dropout = nn.Dropout(classifier_dropout) + self.classifier = nn.Linear(config.hidden_size, config.num_labels) + + self.init_weights() + + @add_start_docstrings_to_model_forward(BROS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=TokenClassifierOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + bbox: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + bbox_first_token_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]: + r""" + + Returns: + + Examples: + + ```python + >>> import torch + >>> from transformers import BrosProcessor, BrosForTokenClassification + + >>> processor = BrosProcessor.from_pretrained("jinho8345/bros-base-uncased") + + >>> model = BrosForTokenClassification.from_pretrained("jinho8345/bros-base-uncased") + + >>> encoding = processor("Hello, my dog is cute", add_special_tokens=False, return_tensors="pt") + >>> bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1) + >>> encoding["bbox"] = bbox + + >>> outputs = model(**encoding) + ```""" + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.bros( + input_ids, + bbox=bbox, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = outputs[0] + + sequence_output = self.dropout(sequence_output) + logits = self.classifier(sequence_output) + + loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() + if bbox_first_token_mask is not None: + bbox_first_token_mask = bbox_first_token_mask.view(-1) + loss = loss_fct( + logits.view(-1, self.num_labels)[bbox_first_token_mask], labels.view(-1)[bbox_first_token_mask] + ) + else: + loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) + + if not return_dict: + output = (logits,) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return TokenClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +@add_start_docstrings( + """ + Bros Model with a token classification head on top (initial_token_layers and subsequent_token_layer on top of the + hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. The initial_token_classifier is used to + predict the first token of each entity, and the subsequent_token_classifier is used to predict the subsequent + tokens within an entity. Compared to BrosForTokenClassification, this model is more robust to serialization errors + since it predicts next token from one token. + """, + BROS_START_DOCSTRING, +) +class BrosSpadeEEForTokenClassification(BrosPreTrainedModel): + _keys_to_ignore_on_load_unexpected = [r"pooler"] + + def __init__(self, config): + super().__init__(config) + self.config = config + self.num_labels = config.num_labels + self.n_relations = config.n_relations + self.backbone_hidden_size = config.hidden_size + + self.bros = BrosModel(config) + classifier_dropout = ( + config.classifier_dropout if hasattr(config, "classifier_dropout") else config.hidden_dropout_prob + ) + + # Initial token classification for Entity Extraction (NER) + self.initial_token_classifier = nn.Sequential( + nn.Dropout(classifier_dropout), + nn.Linear(config.hidden_size, config.hidden_size), + nn.Dropout(classifier_dropout), + nn.Linear(config.hidden_size, config.num_labels), + ) + + # Subsequent token classification for Entity Extraction (NER) + self.subsequent_token_classifier = BrosRelationExtractor(config) + + self.init_weights() + + @add_start_docstrings_to_model_forward(BROS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=BrosSpadeOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + bbox: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + bbox_first_token_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + initial_token_labels: Optional[torch.Tensor] = None, + subsequent_token_labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], BrosSpadeOutput]: + r""" + Returns: + + Examples: + + ```python + >>> import torch + >>> from transformers import BrosProcessor, BrosSpadeEEForTokenClassification + + >>> processor = BrosProcessor.from_pretrained("jinho8345/bros-base-uncased") + + >>> model = BrosSpadeEEForTokenClassification.from_pretrained("jinho8345/bros-base-uncased") + + >>> encoding = processor("Hello, my dog is cute", add_special_tokens=False, return_tensors="pt") + >>> bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1) + >>> encoding["bbox"] = bbox + + >>> outputs = model(**encoding) + ```""" + + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.bros( + input_ids=input_ids, + bbox=bbox, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_states = outputs[0] + last_hidden_states = last_hidden_states.transpose(0, 1).contiguous() + initial_token_logits = self.initial_token_classifier(last_hidden_states).transpose(0, 1).contiguous() + subsequent_token_logits = self.subsequent_token_classifier(last_hidden_states, last_hidden_states).squeeze(0) + + # make subsequent token (sequence token classification) mask + inv_attention_mask = 1 - attention_mask + batch_size, max_seq_length = inv_attention_mask.shape + device = inv_attention_mask.device + invalid_token_mask = torch.cat([inv_attention_mask, torch.zeros([batch_size, 1]).to(device)], axis=1).bool() + subsequent_token_logits = subsequent_token_logits.masked_fill( + invalid_token_mask[:, None, :], torch.finfo(subsequent_token_logits.dtype).min + ) + self_token_mask = torch.eye(max_seq_length, max_seq_length + 1).to(device).bool() + subsequent_token_logits = subsequent_token_logits.masked_fill( + self_token_mask[None, :, :], torch.finfo(subsequent_token_logits.dtype).min + ) + subsequent_token_mask = attention_mask.view(-1).bool() + + loss = None + if initial_token_labels is not None and subsequent_token_labels is not None: + loss_fct = CrossEntropyLoss() + + # get initial token loss + initial_token_labels = initial_token_labels.view(-1) + if bbox_first_token_mask is not None: + bbox_first_token_mask = bbox_first_token_mask.view(-1) + initial_token_loss = loss_fct( + initial_token_logits.view(-1, self.num_labels)[bbox_first_token_mask], + initial_token_labels[bbox_first_token_mask], + ) + else: + initial_token_loss = loss_fct(initial_token_logits.view(-1, self.num_labels), initial_token_labels) + + subsequent_token_labels = subsequent_token_labels.view(-1) + subsequent_token_loss = loss_fct( + subsequent_token_logits.view(-1, max_seq_length + 1)[subsequent_token_mask], + subsequent_token_labels[subsequent_token_mask], + ) + + loss = initial_token_loss + subsequent_token_loss + + if not return_dict: + output = (initial_token_logits, subsequent_token_logits) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return BrosSpadeOutput( + loss=loss, + initial_token_logits=initial_token_logits, + subsequent_token_logits=subsequent_token_logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +@add_start_docstrings( + """ + Bros Model with a token classification head on top (a entity_linker layer on top of the hidden-states output) e.g. + for Entity-Linking. The entity_linker is used to predict intra-entity links (one entity to another entity). + """, + BROS_START_DOCSTRING, +) +class BrosSpadeELForTokenClassification(BrosPreTrainedModel): + _keys_to_ignore_on_load_unexpected = [r"pooler"] + + def __init__(self, config): + super().__init__(config) + self.config = config + self.num_labels = config.num_labels + self.n_relations = config.n_relations + self.backbone_hidden_size = config.hidden_size + + self.bros = BrosModel(config) + (config.classifier_dropout if hasattr(config, "classifier_dropout") else config.hidden_dropout_prob) + + self.entity_linker = BrosRelationExtractor(config) + + self.init_weights() + + @add_start_docstrings_to_model_forward(BROS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=TokenClassifierOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + bbox: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + bbox_first_token_mask: Optional[torch.Tensor] = None, + token_type_ids: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + head_mask: Optional[torch.Tensor] = None, + inputs_embeds: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]: + r""" + Returns: + + Examples: + + ```python + >>> import torch + >>> from transformers import BrosProcessor, BrosSpadeELForTokenClassification + + >>> processor = BrosProcessor.from_pretrained("jinho8345/bros-base-uncased") + + >>> model = BrosSpadeELForTokenClassification.from_pretrained("jinho8345/bros-base-uncased") + + >>> encoding = processor("Hello, my dog is cute", add_special_tokens=False, return_tensors="pt") + >>> bbox = torch.tensor([[[0, 0, 1, 1]]]).repeat(1, encoding["input_ids"].shape[-1], 1) + >>> encoding["bbox"] = bbox + + >>> outputs = model(**encoding) + ```""" + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.bros( + input_ids=input_ids, + bbox=bbox, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + last_hidden_states = outputs[0] + last_hidden_states = last_hidden_states.transpose(0, 1).contiguous() + + logits = self.entity_linker(last_hidden_states, last_hidden_states).squeeze(0) + + loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() + + batch_size, max_seq_length = attention_mask.shape + device = attention_mask.device + + self_token_mask = torch.eye(max_seq_length, max_seq_length + 1).to(device).bool() + + mask = bbox_first_token_mask.view(-1) + bbox_first_token_mask = torch.cat( + [ + ~bbox_first_token_mask, + torch.zeros([batch_size, 1], dtype=torch.bool).to(device), + ], + axis=1, + ) + logits = logits.masked_fill(bbox_first_token_mask[:, None, :], torch.finfo(logits.dtype).min) + logits = logits.masked_fill(self_token_mask[None, :, :], torch.finfo(logits.dtype).min) + + loss = loss_fct(logits.view(-1, max_seq_length + 1)[mask], labels.view(-1)[mask]) + + if not return_dict: + output = (logits,) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return TokenClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) diff --git a/src/transformers/models/bros/processing_bros.py b/src/transformers/models/bros/processing_bros.py new file mode 100644 index 00000000000000..9c2e0642d8cdc4 --- /dev/null +++ b/src/transformers/models/bros/processing_bros.py @@ -0,0 +1,109 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Processor class for Bros. +""" + +from typing import List, Optional, Union + +from ...processing_utils import ProcessorMixin +from ...tokenization_utils_base import BatchEncoding, PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy +from ...utils import TensorType + + +class BrosProcessor(ProcessorMixin): + r""" + Constructs a Bros processor which wraps a BERT tokenizer. + + [`BrosProcessor`] offers all the functionalities of [`BertTokenizerFast`]. See the docstring of + [`~BrosProcessor.__call__`] and [`~BrosProcessor.decode`] for more information. + + Args: + tokenizer (`BertTokenizerFast`, *optional*): + An instance of ['BertTokenizerFast`]. The tokenizer is a required input. + """ + + attributes = ["tokenizer"] + tokenizer_class = ("BertTokenizer", "BertTokenizerFast") + + def __init__(self, tokenizer=None, **kwargs): + if tokenizer is None: + raise ValueError("You need to specify a `tokenizer`.") + + super().__init__(tokenizer) + + def __call__( + self, + text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = None, + max_length: Optional[int] = None, + stride: int = 0, + pad_to_multiple_of: Optional[int] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs, + ) -> BatchEncoding: + """ + This method uses [`BertTokenizerFast.__call__`] to prepare text for the model. + + Please refer to the docstring of the above two methods for more information. + """ + encoding = self.tokenizer( + text=text, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + stride=stride, + pad_to_multiple_of=pad_to_multiple_of, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + return_tensors=return_tensors, + **kwargs, + ) + + return encoding + + def batch_decode(self, *args, **kwargs): + """ + This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please + refer to the docstring of this method for more information. + """ + return self.tokenizer.batch_decode(*args, **kwargs) + + def decode(self, *args, **kwargs): + """ + This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. + """ + return self.tokenizer.decode(*args, **kwargs) + + @property + def model_input_names(self): + tokenizer_input_names = self.tokenizer.model_input_names + return list(dict.fromkeys(tokenizer_input_names)) diff --git a/src/transformers/models/byt5/tokenization_byt5.py b/src/transformers/models/byt5/tokenization_byt5.py index 59e694c343c559..68c70db0d18d65 100644 --- a/src/transformers/models/byt5/tokenization_byt5.py +++ b/src/transformers/models/byt5/tokenization_byt5.py @@ -16,7 +16,7 @@ import warnings -from typing import Dict, List, Optional, Tuple +from typing import List, Optional, Tuple from ...tokenization_utils import AddedToken, PreTrainedTokenizer from ...utils import logging @@ -48,7 +48,7 @@ class ByT5Tokenizer(PreTrainedTokenizer): token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. - extra_ids (`int`, *optional*, defaults to 100): + extra_ids (`int`, *optional*, defaults to 125): Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning ("" is the last token in the vocabulary @@ -72,7 +72,7 @@ def __init__( # Add extra_ids to the special token list if extra_ids > 0 and additional_special_tokens is None: additional_special_tokens = [f"" for i in range(extra_ids)] - elif extra_ids > 0 and additional_special_tokens is not None: + elif extra_ids > 0 and additional_special_tokens is not None and len(additional_special_tokens) > 0: # Check that we have the right number of extra_id special tokens extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens))) if extra_tokens != extra_ids: @@ -82,38 +82,31 @@ def __init__( " extra_ids tokens" ) - pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token - eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token - unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token - + pad_token = AddedToken(pad_token, lstrip=True, rstrip=True) if isinstance(pad_token, str) else pad_token + # we force left and right stripping for backward compatibility. The byt5tests depend on this. + eos_token = AddedToken(eos_token, lstrip=True, rstrip=True) if isinstance(eos_token, str) else eos_token + unk_token = AddedToken(unk_token, lstrip=True, rstrip=True) if isinstance(unk_token, str) else unk_token + # unk token needs to be in the vocab with correct index + self._added_tokens_decoder = {0: pad_token, 1: eos_token, 2: unk_token} + self.offset = len(self._added_tokens_decoder) + self._utf_vocab_size = 2**8 # utf is 8 bits super().__init__( eos_token=eos_token, unk_token=unk_token, pad_token=pad_token, - extra_ids=extra_ids, - additional_special_tokens=additional_special_tokens, + extra_ids=0, + additional_special_tokens=additional_special_tokens, # TODO extra ids are not used :sweatywmile: **kwargs, ) - self._extra_ids = extra_ids - - self._utf_vocab_size = 2**8 # utf is 8 bits - - # define special tokens dict - self.special_tokens_encoder: Dict[int, str] = { - self.pad_token: 0, - self.eos_token: 1, - self.unk_token: 2, - } - self._num_special_tokens = len(self.special_tokens_encoder) - n = len(additional_special_tokens) - for i, token in enumerate(additional_special_tokens): - self.special_tokens_encoder[token] = self.vocab_size + i - n - self.special_tokens_decoder: Dict[str, int] = {v: k for k, v in self.special_tokens_encoder.items()} - @property def vocab_size(self): - return self._utf_vocab_size + self._num_special_tokens + self._extra_ids + return self._utf_vocab_size + + def get_vocab(self): + vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size + self.offset)} + vocab.update(self.added_tokens_encoder) + return vocab def get_special_tokens_mask( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False @@ -209,34 +202,25 @@ def _tokenize(self, text: str) -> List[str]: def _convert_token_to_id(self, token): """Converts a token (str) in an id using the vocab.""" - if token in self.special_tokens_encoder: - token_id = self.special_tokens_encoder[token] - elif token in self.added_tokens_encoder: - token_id = self.added_tokens_encoder[token] - elif len(token) != 1: - token_id = self.unk_token_id + + if len(token) != 1: + token_id = None else: - token_id = ord(token) + self._num_special_tokens + token_id = ord(token) + self.offset + return token_id def _convert_id_to_token(self, index): """Converts an index (integer) in a token (str) using the vocab.""" - if index in self.special_tokens_decoder: - token = self.special_tokens_decoder[index] - else: - token = chr(index - self._num_special_tokens) + token = chr(index - self.offset) return token def convert_tokens_to_string(self, tokens): """Converts a sequence of tokens (string) in a single string.""" bstring = b"" for token in tokens: - if token in self.special_tokens_decoder: - tok_string = self.special_tokens_decoder[token].encode("utf-8") - elif token in self.added_tokens_decoder: - tok_string = self.special_tokens_decoder[token].encode("utf-8") - elif token in self.special_tokens_encoder: - tok_string = token.encode("utf-8") + if token in self.added_tokens_decoder: + tok_string = self.added_tokens_decoder[token].encode("utf-8") elif token in self.added_tokens_encoder: tok_string = token.encode("utf-8") else: diff --git a/src/transformers/models/camembert/configuration_camembert.py b/src/transformers/models/camembert/configuration_camembert.py index d712726492ae18..d904c35ad7b7a5 100644 --- a/src/transformers/models/camembert/configuration_camembert.py +++ b/src/transformers/models/camembert/configuration_camembert.py @@ -26,7 +26,7 @@ logger = logging.get_logger(__name__) CAMEMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP = { - "camembert-base": "https://huggingface.co/camembert-base/resolve/main/config.json", + "almanach/camembert-base": "https://huggingface.co/almanach/camembert-base/resolve/main/config.json", "umberto-commoncrawl-cased-v1": ( "https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1/resolve/main/config.json" ), @@ -41,7 +41,7 @@ class CamembertConfig(PretrainedConfig): This is the configuration class to store the configuration of a [`CamembertModel`] or a [`TFCamembertModel`]. It is used to instantiate a Camembert model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Camembert - [camembert-base](https://huggingface.co/camembert-base) architecture. + [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. @@ -94,10 +94,10 @@ class CamembertConfig(PretrainedConfig): ```python >>> from transformers import CamembertConfig, CamembertModel - >>> # Initializing a Camembert camembert-base style configuration + >>> # Initializing a Camembert almanach/camembert-base style configuration >>> configuration = CamembertConfig() - >>> # Initializing a model (with random weights) from the camembert-base style configuration + >>> # Initializing a model (with random weights) from the almanach/camembert-base style configuration >>> model = CamembertModel(configuration) >>> # Accessing the model configuration diff --git a/src/transformers/models/camembert/modeling_camembert.py b/src/transformers/models/camembert/modeling_camembert.py index 81352b9cca2c21..cd0b329b6ae00d 100644 --- a/src/transformers/models/camembert/modeling_camembert.py +++ b/src/transformers/models/camembert/modeling_camembert.py @@ -48,11 +48,11 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "camembert-base" +_CHECKPOINT_FOR_DOC = "almanach/camembert-base" _CONFIG_FOR_DOC = "CamembertConfig" CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ - "camembert-base", + "almanach/camembert-base", "Musixmatch/umberto-commoncrawl-cased-v1", "Musixmatch/umberto-wikipedia-uncased-v1", # See all CamemBERT models at https://huggingface.co/models?filter=camembert @@ -94,7 +94,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -506,6 +508,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -515,25 +524,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -621,19 +620,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, CamembertEncoder): - module.gradient_checkpointing = value - - def update_keys_to_ignore(self, config, del_keys_to_ignore): - """Remove some keys from ignore list""" - if not config.tie_word_embeddings: - # must make a new list, or the class variable gets modified! - self._keys_to_ignore_on_save = [k for k in self._keys_to_ignore_on_save if k not in del_keys_to_ignore] - self._keys_to_ignore_on_load_missing = [ - k for k in self._keys_to_ignore_on_load_missing if k not in del_keys_to_ignore - ] - CAMEMBERT_INPUTS_DOCSTRING = r""" Args: @@ -760,7 +746,6 @@ class CamembertModel(CamembertPreTrainedModel): """ - _keys_to_ignore_on_load_missing = [r"position_ids"] _no_split_modules = [] # Copied from transformers.models.bert.modeling_bert.BertModel.__init__ with Bert->Camembert @@ -847,6 +832,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -933,9 +919,7 @@ def forward( ) # Copied from transformers.models.roberta.modeling_roberta.RobertaForMaskedLM with Roberta->Camembert, ROBERTA->CAMEMBERT class CamembertForMaskedLM(CamembertPreTrainedModel): - _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"] - _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"] - _keys_to_ignore_on_load_unexpected = [r"pooler"] + _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -949,9 +933,6 @@ def __init__(self, config): self.roberta = CamembertModel(config, add_pooling_layer=False) self.lm_head = CamembertLMHead(config) - # The LM head weights require special treatment only when they are tied with the word embeddings - self.update_keys_to_ignore(config, ["lm_head.decoder.weight"]) - # Initialize weights and apply final processing self.post_init() @@ -1013,6 +994,8 @@ def forward( masked_lm_loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(prediction_scores.device) loss_fct = CrossEntropyLoss() masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) @@ -1037,8 +1020,6 @@ def forward( ) # Copied from transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification with Roberta->Camembert, ROBERTA->CAMEMBERT class CamembertForSequenceClassification(CamembertPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"position_ids"] - def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels @@ -1095,6 +1076,8 @@ def forward( loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(logits.device) if self.config.problem_type is None: if self.num_labels == 1: self.config.problem_type = "regression" @@ -1137,8 +1120,6 @@ def forward( ) # Copied from transformers.models.roberta.modeling_roberta.RobertaForMultipleChoice with Roberta->Camembert, ROBERTA->CAMEMBERT class CamembertForMultipleChoice(CamembertPreTrainedModel): - _keys_to_ignore_on_load_missing = [r"position_ids"] - def __init__(self, config): super().__init__(config) @@ -1208,6 +1189,8 @@ def forward( loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(reshaped_logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(reshaped_logits, labels) @@ -1232,9 +1215,6 @@ def forward( ) # Copied from transformers.models.roberta.modeling_roberta.RobertaForTokenClassification with Roberta->Camembert, ROBERTA->CAMEMBERT class CamembertForTokenClassification(CamembertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [r"position_ids"] - def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels @@ -1295,6 +1275,8 @@ def forward( loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) @@ -1319,9 +1301,6 @@ def forward( ) # Copied from transformers.models.roberta.modeling_roberta.RobertaForQuestionAnswering with Roberta->Camembert, ROBERTA->CAMEMBERT class CamembertForQuestionAnswering(CamembertPreTrainedModel): - _keys_to_ignore_on_load_unexpected = [r"pooler"] - _keys_to_ignore_on_load_missing = [r"position_ids"] - def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels @@ -1418,11 +1397,9 @@ def forward( @add_start_docstrings( """CamemBERT Model with a `language modeling` head on top for CLM fine-tuning.""", CAMEMBERT_START_DOCSTRING ) -# Copied from transformers.models.roberta.modeling_roberta.RobertaForCausalLM with Roberta->Camembert, ROBERTA->CAMEMBERT, roberta-base->camembert-base +# Copied from transformers.models.roberta.modeling_roberta.RobertaForCausalLM with Roberta->Camembert, ROBERTA->CAMEMBERT, FacebookAI/roberta-base->almanach/camembert-base class CamembertForCausalLM(CamembertPreTrainedModel): - _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"] - _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"] - _keys_to_ignore_on_load_unexpected = [r"pooler"] + _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"] def __init__(self, config): super().__init__(config) @@ -1433,9 +1410,6 @@ def __init__(self, config): self.roberta = CamembertModel(config, add_pooling_layer=False) self.lm_head = CamembertLMHead(config) - # The LM head weights require special treatment only when they are tied with the word embeddings - self.update_keys_to_ignore(config, ["lm_head.decoder.weight"]) - # Initialize weights and apply final processing self.post_init() @@ -1497,10 +1471,10 @@ def forward( >>> from transformers import AutoTokenizer, CamembertForCausalLM, AutoConfig >>> import torch - >>> tokenizer = AutoTokenizer.from_pretrained("camembert-base") - >>> config = AutoConfig.from_pretrained("camembert-base") + >>> tokenizer = AutoTokenizer.from_pretrained("almanach/camembert-base") + >>> config = AutoConfig.from_pretrained("almanach/camembert-base") >>> config.is_decoder = True - >>> model = CamembertForCausalLM.from_pretrained("camembert-base", config=config) + >>> model = CamembertForCausalLM.from_pretrained("almanach/camembert-base", config=config) >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) @@ -1532,6 +1506,8 @@ def forward( lm_loss = None if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(prediction_scores.device) # we are doing next-token prediction; shift prediction scores and input ids by one shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous() labels = labels[:, 1:].contiguous() @@ -1557,16 +1533,27 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti if attention_mask is None: attention_mask = input_ids.new_ones(input_shape) - # cut decoder_input_ids if past is used + # cut decoder_input_ids if past_key_values is used if past_key_values is not None: - input_ids = input_ids[:, -1:] + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values} def _reorder_cache(self, past_key_values, beam_idx): reordered_past = () for layer_past in past_key_values: - reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),) + reordered_past += ( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past), + ) return reordered_past diff --git a/src/transformers/models/camembert/modeling_tf_camembert.py b/src/transformers/models/camembert/modeling_tf_camembert.py index 5142b3d82b04cb..e3e3fca4cef440 100644 --- a/src/transformers/models/camembert/modeling_tf_camembert.py +++ b/src/transformers/models/camembert/modeling_tf_camembert.py @@ -15,6 +15,9 @@ # limitations under the License. """ TF 2.0 CamemBERT model.""" + +from __future__ import annotations + import math import warnings from typing import Optional, Tuple, Union @@ -43,13 +46,12 @@ TFSequenceClassificationLoss, TFTokenClassificationLoss, get_initializer, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( - DUMMY_INPUTS, - MULTIPLE_CHOICE_DUMMY_INPUTS, add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, @@ -60,7 +62,7 @@ logger = logging.get_logger(__name__) -_CHECKPOINT_FOR_DOC = "camembert-base" +_CHECKPOINT_FOR_DOC = "almanach/camembert-base" _CONFIG_FOR_DOC = "CamembertConfig" TF_CAMEMBERT_PRETRAINED_MODEL_ARCHIVE_LIST = [ @@ -74,7 +76,7 @@ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -167,7 +169,7 @@ # Copied from transformers.models.roberta.modeling_tf_roberta.TFRobertaEmbeddings -class TFCamembertEmbeddings(tf.keras.layers.Layer): +class TFCamembertEmbeddings(keras.layers.Layer): """ Same as BertEmbeddings with a tiny tweak for positional embeddings indexing. """ @@ -180,10 +182,10 @@ def __init__(self, config, **kwargs): self.hidden_size = config.hidden_size self.max_position_embeddings = config.max_position_embeddings self.initializer_range = config.initializer_range - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape=None): with tf.name_scope("word_embeddings"): self.weight = self.add_weight( name="weight", @@ -205,7 +207,12 @@ def build(self, input_shape: tf.TensorShape): initializer=get_initializer(self.initializer_range), ) - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) def create_position_ids_from_input_ids(self, input_ids, past_key_values_length=0): """ @@ -239,16 +246,7 @@ def call( assert not (input_ids is None and inputs_embeds is None) if input_ids is not None: - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.config.vocab_size, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})" - ), - ) + check_embeddings_within_bounds(input_ids, self.config.vocab_size) inputs_embeds = tf.gather(params=self.weight, indices=input_ids) input_shape = shape_list(inputs_embeds)[:-1] @@ -277,16 +275,17 @@ def call( # Copied from transformers.models.bert.modeling_tf_bert.TFBertPooler with Bert->Camembert -class TFCamembertPooler(tf.keras.layers.Layer): +class TFCamembertPooler(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), activation="tanh", name="dense", ) + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: # We "pool" the model by simply taking the hidden state corresponding @@ -296,9 +295,17 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return pooled_output + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertSelfAttention with Bert->Camembert -class TFCamembertSelfAttention(tf.keras.layers.Layer): +class TFCamembertSelfAttention(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) @@ -313,18 +320,19 @@ def __init__(self, config: CamembertConfig, **kwargs): self.all_head_size = self.num_attention_heads * self.attention_head_size self.sqrt_att_head_size = math.sqrt(self.attention_head_size) - self.query = tf.keras.layers.Dense( + self.query = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query" ) - self.key = tf.keras.layers.Dense( + self.key = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key" ) - self.value = tf.keras.layers.Dense( + self.value = keras.layers.Dense( units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value" ) - self.dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob) + self.dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob) self.is_decoder = config.is_decoder + self.config = config def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor: # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size] @@ -414,17 +422,32 @@ def call( outputs = outputs + (past_key_value,) return outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "query", None) is not None: + with tf.name_scope(self.query.name): + self.query.build([None, None, self.config.hidden_size]) + if getattr(self, "key", None) is not None: + with tf.name_scope(self.key.name): + self.key.build([None, None, self.config.hidden_size]) + if getattr(self, "value", None) is not None: + with tf.name_scope(self.value.name): + self.value.build([None, None, self.config.hidden_size]) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertSelfOutput with Bert->Camembert -class TFCamembertSelfOutput(tf.keras.layers.Layer): +class TFCamembertSelfOutput(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -433,9 +456,20 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertAttention with Bert->Camembert -class TFCamembertAttention(tf.keras.layers.Layer): +class TFCamembertAttention(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) @@ -474,13 +508,24 @@ def call( return outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attention", None) is not None: + with tf.name_scope(self.self_attention.name): + self.self_attention.build(None) + if getattr(self, "dense_output", None) is not None: + with tf.name_scope(self.dense_output.name): + self.dense_output.build(None) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertIntermediate with Bert->Camembert -class TFCamembertIntermediate(tf.keras.layers.Layer): +class TFCamembertIntermediate(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) @@ -488,6 +533,7 @@ def __init__(self, config: CamembertConfig, **kwargs): self.intermediate_act_fn = get_tf_activation(config.hidden_act) else: self.intermediate_act_fn = config.hidden_act + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -495,17 +541,26 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertOutput with Bert->Camembert -class TFCamembertOutput(tf.keras.layers.Layer): +class TFCamembertOutput(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") - self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm") + self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob) + self.config = config def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor: hidden_states = self.dense(inputs=hidden_states) @@ -514,9 +569,20 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.intermediate_size]) + if getattr(self, "LayerNorm", None) is not None: + with tf.name_scope(self.LayerNorm.name): + self.LayerNorm.build([None, None, self.config.hidden_size]) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertLayer with Bert->Camembert -class TFCamembertLayer(tf.keras.layers.Layer): +class TFCamembertLayer(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) @@ -535,9 +601,9 @@ def call( hidden_states: tf.Tensor, attention_mask: tf.Tensor, head_mask: tf.Tensor, - encoder_hidden_states: Optional[tf.Tensor], - encoder_attention_mask: Optional[tf.Tensor], - past_key_value: Optional[Tuple[tf.Tensor]], + encoder_hidden_states: tf.Tensor | None, + encoder_attention_mask: tf.Tensor | None, + past_key_value: Tuple[tf.Tensor] | None, output_attentions: bool, training: bool = False, ) -> Tuple[tf.Tensor]: @@ -601,9 +667,26 @@ def call( return outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "attention", None) is not None: + with tf.name_scope(self.attention.name): + self.attention.build(None) + if getattr(self, "intermediate", None) is not None: + with tf.name_scope(self.intermediate.name): + self.intermediate.build(None) + if getattr(self, "bert_output", None) is not None: + with tf.name_scope(self.bert_output.name): + self.bert_output.build(None) + if getattr(self, "crossattention", None) is not None: + with tf.name_scope(self.crossattention.name): + self.crossattention.build(None) + # Copied from transformers.models.bert.modeling_tf_bert.TFBertEncoder with Bert->Camembert -class TFCamembertEncoder(tf.keras.layers.Layer): +class TFCamembertEncoder(keras.layers.Layer): def __init__(self, config: CamembertConfig, **kwargs): super().__init__(**kwargs) self.config = config @@ -614,9 +697,9 @@ def call( hidden_states: tf.Tensor, attention_mask: tf.Tensor, head_mask: tf.Tensor, - encoder_hidden_states: Optional[tf.Tensor], - encoder_attention_mask: Optional[tf.Tensor], - past_key_values: Optional[Tuple[Tuple[tf.Tensor]]], + encoder_hidden_states: tf.Tensor | None, + encoder_attention_mask: tf.Tensor | None, + past_key_values: Tuple[Tuple[tf.Tensor]] | None, use_cache: Optional[bool], output_attentions: bool, output_hidden_states: bool, @@ -671,10 +754,19 @@ def call( cross_attentions=all_cross_attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "layer", None) is not None: + for layer in self.layer: + with tf.name_scope(layer.name): + layer.build(None) + @keras_serializable # Copied from transformers.models.roberta.modeling_tf_roberta.TFRobertaMainLayer with Roberta->Camembert -class TFCamembertMainLayer(tf.keras.layers.Layer): +class TFCamembertMainLayer(keras.layers.Layer): config_class = CamembertConfig def __init__(self, config, add_pooling_layer=True, **kwargs): @@ -694,7 +786,7 @@ def __init__(self, config, add_pooling_layer=True, **kwargs): self.embeddings = TFCamembertEmbeddings(config, name="embeddings") # Copied from transformers.models.bert.modeling_tf_bert.TFBertMainLayer.get_input_embeddings - def get_input_embeddings(self) -> tf.keras.layers.Layer: + def get_input_embeddings(self) -> keras.layers.Layer: return self.embeddings # Copied from transformers.models.bert.modeling_tf_bert.TFBertMainLayer.set_input_embeddings @@ -714,14 +806,14 @@ class PreTrainedModel # Copied from transformers.models.bert.modeling_tf_bert.TFBertMainLayer.call def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, @@ -869,6 +961,20 @@ def call( cross_attentions=encoder_outputs.cross_attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "pooler", None) is not None: + with tf.name_scope(self.pooler.name): + self.pooler.build(None) + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + class TFCamembertPreTrainedModel(TFPreTrainedModel): """ @@ -879,38 +985,6 @@ class TFCamembertPreTrainedModel(TFPreTrainedModel): config_class = CamembertConfig base_model_prefix = "roberta" - @property - # Copied from transformers.models.bert.modeling_tf_bert.TFBertPreTrainedModel.dummy_inputs - def dummy_inputs(self): - """ - Dummy inputs to build the network. - - Returns: - `Dict[str, tf.Tensor]`: The dummy inputs. - """ - dummy = {"input_ids": tf.constant(DUMMY_INPUTS, dtype=tf.int32)} - # Add `encoder_hidden_states` to make the cross-attention layers' weights initialized - if self.config.add_cross_attention: - batch_size, seq_len = tf.constant(DUMMY_INPUTS).shape - shape = (batch_size, seq_len) + (self.config.hidden_size,) - h = tf.random.uniform(shape=shape) - dummy["encoder_hidden_states"] = h - - return dummy - - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - } - ] - ) - def serving(self, inputs): - output = self.call(inputs) - - return self.serving_output(output) - @add_start_docstrings( "The bare CamemBERT Model transformer outputting raw hidden-states without any specific head on top.", @@ -931,14 +1005,14 @@ def __init__(self, config, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, @@ -985,30 +1059,17 @@ def call( return outputs - # Copied from transformers.models.bert.modeling_tf_bert.TFBertModel.serving_output - def serving_output( - self, output: TFBaseModelOutputWithPoolingAndCrossAttentions - ) -> TFBaseModelOutputWithPoolingAndCrossAttentions: - output_cache = self.config.use_cache and self.config.is_decoder - pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None - if not (self.config.output_attentions and self.config.add_cross_attention): - cross_attns = None - - return TFBaseModelOutputWithPoolingAndCrossAttentions( - last_hidden_state=output.last_hidden_state, - pooler_output=output.pooler_output, - past_key_values=pkv, - hidden_states=hs, - attentions=attns, - cross_attentions=cross_attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) # Copied from transformers.models.roberta.modeling_tf_roberta.TFRobertaLMHead with Roberta->Camembert -class TFCamembertLMHead(tf.keras.layers.Layer): +class TFCamembertLMHead(keras.layers.Layer): """Camembert Head for masked language modeling.""" def __init__(self, config, input_embeddings, **kwargs): @@ -1016,20 +1077,28 @@ def __init__(self, config, input_embeddings, **kwargs): self.config = config self.hidden_size = config.hidden_size - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense" ) - self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm") + self.layer_norm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm") self.act = get_tf_activation("gelu") # The output weights are the same as the input embeddings, but there is # an output-only bias for each token. self.decoder = input_embeddings - def build(self, input_shape): + def build(self, input_shape=None): self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias") - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "layer_norm", None) is not None: + with tf.name_scope(self.layer_norm.name): + self.layer_norm.build([None, None, self.config.hidden_size]) def get_output_embeddings(self): return self.decoder @@ -1094,16 +1163,16 @@ def get_prefix_bias_name(self): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]: r""" @@ -1141,21 +1210,25 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output - def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "lm_head", None) is not None: + with tf.name_scope(self.lm_head.name): + self.lm_head.build(None) # Copied from transformers.models.roberta.modeling_tf_roberta.TFRobertaClassificationHead -class TFCamembertClassificationHead(tf.keras.layers.Layer): +class TFCamembertClassificationHead(keras.layers.Layer): """Head for sentence-level classification tasks.""" def __init__(self, config, **kwargs): super().__init__(**kwargs) - self.dense = tf.keras.layers.Dense( + self.dense = keras.layers.Dense( config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), activation="tanh", @@ -1164,10 +1237,11 @@ def __init__(self, config, **kwargs): classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) - self.dropout = tf.keras.layers.Dropout(classifier_dropout) - self.out_proj = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(classifier_dropout) + self.out_proj = keras.layers.Dense( config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="out_proj" ) + self.config = config def call(self, features, training=False): x = features[:, 0, :] # take token (equiv. to [CLS]) @@ -1177,6 +1251,17 @@ def call(self, features, training=False): x = self.out_proj(x) return x + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "dense", None) is not None: + with tf.name_scope(self.dense.name): + self.dense.build([None, None, self.config.hidden_size]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.config.hidden_size]) + @add_start_docstrings( """ @@ -1208,16 +1293,16 @@ def __init__(self, config, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1254,12 +1339,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output - def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build(None) @add_start_docstrings( @@ -1283,10 +1372,11 @@ def __init__(self, config, *inputs, **kwargs): classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) - self.dropout = tf.keras.layers.Dropout(classifier_dropout) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(classifier_dropout) + self.classifier = keras.layers.Dense( config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(CAMEMBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1299,16 +1389,16 @@ def __init__(self, config, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]: r""" @@ -1345,12 +1435,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output - def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1370,20 +1464,11 @@ def __init__(self, config, *inputs, **kwargs): super().__init__(config, *inputs, **kwargs) self.roberta = TFCamembertMainLayer(config, name="roberta") - self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob) - self.classifier = tf.keras.layers.Dense( + self.dropout = keras.layers.Dropout(config.hidden_dropout_prob) + self.classifier = keras.layers.Dense( 1, kernel_initializer=get_initializer(config.initializer_range), name="classifier" ) - - @property - def dummy_inputs(self): - """ - Dummy inputs to build the network. - - Returns: - tf.Tensor with dummy inputs - """ - return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)} + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward( @@ -1396,16 +1481,16 @@ def dummy_inputs(self): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]: r""" @@ -1455,25 +1540,16 @@ def call( attentions=outputs.attentions, ) - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"), - } - ] - ) - def serving(self, inputs): - output = self.call(inputs) - - return self.serving_output(output) - - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving_output - def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "classifier", None) is not None: + with tf.name_scope(self.classifier.name): + self.classifier.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1493,9 +1569,10 @@ def __init__(self, config, *inputs, **kwargs): self.num_labels = config.num_labels self.roberta = TFCamembertMainLayer(config, add_pooling_layer=False, name="roberta") - self.qa_outputs = tf.keras.layers.Dense( + self.qa_outputs = keras.layers.Dense( config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs" ) + self.config = config @unpack_inputs @add_start_docstrings_to_model_forward(CAMEMBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @@ -1508,17 +1585,17 @@ def __init__(self, config, *inputs, **kwargs): ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, - end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None, + start_positions: np.ndarray | tf.Tensor | None = None, + end_positions: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]: r""" @@ -1568,14 +1645,16 @@ def call( attentions=outputs.attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertForQuestionAnswering.serving_output - def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFQuestionAnsweringModelOutput( - start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "qa_outputs", None) is not None: + with tf.name_scope(self.qa_outputs.name): + self.qa_outputs.build([None, None, self.config.hidden_size]) @add_start_docstrings( @@ -1624,20 +1703,20 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti ) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, - head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None, - encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + token_type_ids: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, + head_mask: np.ndarray | tf.Tensor | None = None, + inputs_embeds: np.ndarray | tf.Tensor | None = None, + encoder_hidden_states: np.ndarray | tf.Tensor | None = None, + encoder_attention_mask: np.ndarray | tf.Tensor | None = None, past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, - labels: Optional[Union[np.ndarray, tf.Tensor]] = None, + labels: np.ndarray | tf.Tensor | None = None, training: Optional[bool] = False, ) -> Union[TFCausalLMOutputWithCrossAttentions, Tuple[tf.Tensor]]: r""" @@ -1703,16 +1782,13 @@ def call( cross_attentions=outputs.cross_attentions, ) - # Copied from transformers.models.bert.modeling_tf_bert.TFBertLMHeadModel.serving_output - def serving_output(self, output: TFCausalLMOutputWithCrossAttentions) -> TFCausalLMOutputWithCrossAttentions: - output_cache = self.config.use_cache and self.config.is_decoder - pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None - if not (self.config.output_attentions and self.config.add_cross_attention): - cross_attns = None - - return TFCausalLMOutputWithCrossAttentions( - logits=output.logits, past_key_values=pkv, hidden_states=hs, attentions=attns, cross_attentions=cross_attns - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "roberta", None) is not None: + with tf.name_scope(self.roberta.name): + self.roberta.build(None) + if getattr(self, "lm_head", None) is not None: + with tf.name_scope(self.lm_head.name): + self.lm_head.build(None) diff --git a/src/transformers/models/camembert/tokenization_camembert.py b/src/transformers/models/camembert/tokenization_camembert.py index 658dd1080b7122..0949db02fbb850 100644 --- a/src/transformers/models/camembert/tokenization_camembert.py +++ b/src/transformers/models/camembert/tokenization_camembert.py @@ -31,12 +31,12 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "camembert-base": "https://huggingface.co/camembert-base/resolve/main/sentencepiece.bpe.model", + "almanach/camembert-base": "https://huggingface.co/almanach/camembert-base/resolve/main/sentencepiece.bpe.model", } } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "camembert-base": 512, + "almanach/camembert-base": 512, } SPIECE_UNDERLINE = "▁" @@ -89,7 +89,7 @@ class CamembertTokenizer(PreTrainedTokenizer): mask_token (`str`, *optional*, defaults to `""`): The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - additional_special_tokens (`List[str]`, *optional*, defaults to `["NOTUSED", "NOTUSED"]`): + additional_special_tokens (`List[str]`, *optional*, defaults to `['NOTUSED', 'NOTUSED', 'NOTUSED']`): Additional special tokens used by the tokenizer. sp_model_kwargs (`dict`, *optional*): Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for @@ -127,15 +127,42 @@ def __init__( unk_token="", pad_token="", mask_token="", - additional_special_tokens=["NOTUSED", "NOTUSED"], + additional_special_tokens=["NOTUSED", "NOTUSED", "NOTUSED"], sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs, ) -> None: # Mask token behave like a normal word, i.e. include the space before it - mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token + mask_token = ( + AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False, special=True) + if isinstance(mask_token, str) + else mask_token + ) self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) + self.sp_model.Load(str(vocab_file)) + self.vocab_file = vocab_file + + # HACK: These tokens were added by the author for an obscure reason as they were already part of the + # sentencepiece vocabulary (this is the case for and and ). + # In this case it is recommended to properly set the tokens by hand. + self._added_tokens_decoder = { + 0: AddedToken("NOTUSED", special=True), + 1: AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token, + 2: AddedToken("NOTUSED", special=True), + 3: AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token, + 4: AddedToken("NOTUSED", special=True), + } + + self.fairseq_offset = 4 # 3 tokens are newly added, but the offset starts from 4 + + # legacy: camemebert is a particular case were we have to make sure `"NOTUSED"` is here + if "added_tokens_decoder" in kwargs: + # this is the only class that requires this unfortunately..... + # the reason is that the fast version has a whole. + kwargs["added_tokens_decoder"].update(self._added_tokens_decoder) + super().__init__( bos_token=bos_token, eos_token=eos_token, @@ -148,15 +175,83 @@ def __init__( sp_model_kwargs=self.sp_model_kwargs, **kwargs, ) + + @property + def vocab_size(self): + # The length of the vocabulary without added tokens is len(self.sp_model) but the added tokens are added at the beginning. + return len(self.sp_model) + + def get_vocab(self): + vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size + self.fairseq_offset)} + vocab.update(self.added_tokens_encoder) + return vocab + + def _tokenize(self, text: str) -> List[str]: + return self.sp_model.encode(text, out_type=str) + + def _convert_token_to_id(self, token): + """Converts a token (str) in an id using the vocab.""" + # specifi to camembert, both 3 and 4 point to the unk token. + if self.sp_model.PieceToId(token) == 0: + # Convert sentence piece unk token to fairseq unk token index + return self.unk_token_id + return self.fairseq_offset + self.sp_model.PieceToId(token) + + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + return self.sp_model.IdToPiece(index - self.fairseq_offset) + + def convert_tokens_to_string(self, tokens): + """Converts a sequence of tokens (string) in a single string.""" + # TODO decode outputs do not match between fast and slow + current_sub_tokens = [] + out_string = "" + prev_is_special = False + for token in tokens: + # make sure that special tokens are not decoded using sentencepiece model + if token in self.all_special_tokens: + if not prev_is_special: + out_string += " " + out_string += self.sp_model.decode(current_sub_tokens) + token + prev_is_special = True + current_sub_tokens = [] + else: + current_sub_tokens.append(token) + prev_is_special = False + out_string += self.sp_model.decode(current_sub_tokens) + return out_string.strip() + + def __getstate__(self): + state = self.__dict__.copy() + state["sp_model"] = None + return state + + def __setstate__(self, d): + self.__dict__ = d + + # for backward compatibility + if not hasattr(self, "sp_model_kwargs"): + self.sp_model_kwargs = {} + self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(str(vocab_file)) - self.vocab_file = vocab_file - # HACK: These tokens were added by fairseq but don't seem to be actually used when duplicated in the actual - # sentencepiece vocabulary (this is the case for and - self.fairseq_tokens_to_ids = {"NOTUSED": 0, "": 1, "NOTUSED": 2, "": 3} - self.fairseq_offset = len(self.fairseq_tokens_to_ids) - self.fairseq_tokens_to_ids[""] = len(self.sp_model) + len(self.fairseq_tokens_to_ids) - self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()} + self.sp_model.Load(self.vocab_file) + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + if not os.path.isdir(save_directory): + logger.error(f"Vocabulary path ({save_directory}) should be a directory") + return + out_vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + + if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file): + copyfile(self.vocab_file, out_vocab_file) + elif not os.path.isfile(self.vocab_file): + with open(out_vocab_file, "wb") as fi: + content_spiece_model = self.sp_model.serialized_model_proto() + fi.write(content_spiece_model) + + return (out_vocab_file,) def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None @@ -233,81 +328,3 @@ def create_token_type_ids_from_sequences( if token_ids_1 is None: return len(cls + token_ids_0 + sep) * [0] return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] - - @property - def vocab_size(self): - return len(self.fairseq_tokens_to_ids) + len(self.sp_model) - - def get_vocab(self): - vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)} - vocab.update(self.added_tokens_encoder) - return vocab - - def _tokenize(self, text: str) -> List[str]: - return self.sp_model.encode(text, out_type=str) - - def _convert_token_to_id(self, token): - """Converts a token (str) in an id using the vocab.""" - if token in self.fairseq_tokens_to_ids: - return self.fairseq_tokens_to_ids[token] - elif self.sp_model.PieceToId(token) == 0: - # Convert sentence piece unk token to fairseq unk token index - return self.unk_token_id - return self.fairseq_offset + self.sp_model.PieceToId(token) - - def _convert_id_to_token(self, index): - """Converts an index (integer) in a token (str) using the vocab.""" - if index in self.fairseq_ids_to_tokens: - return self.fairseq_ids_to_tokens[index] - return self.sp_model.IdToPiece(index - self.fairseq_offset) - - def convert_tokens_to_string(self, tokens): - """Converts a sequence of tokens (string) in a single string.""" - current_sub_tokens = [] - out_string = "" - prev_is_special = False - for token in tokens: - # make sure that special tokens are not decoded using sentencepiece model - if token in self.all_special_tokens: - if not prev_is_special: - out_string += " " - out_string += self.sp_model.decode(current_sub_tokens) + token - prev_is_special = True - current_sub_tokens = [] - else: - current_sub_tokens.append(token) - prev_is_special = False - out_string += self.sp_model.decode(current_sub_tokens) - return out_string.strip() - - def __getstate__(self): - state = self.__dict__.copy() - state["sp_model"] = None - return state - - def __setstate__(self, d): - self.__dict__ = d - - # for backward compatibility - if not hasattr(self, "sp_model_kwargs"): - self.sp_model_kwargs = {} - - self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) - self.sp_model.Load(self.vocab_file) - - def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: - if not os.path.isdir(save_directory): - logger.error(f"Vocabulary path ({save_directory}) should be a directory") - return - out_vocab_file = os.path.join( - save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] - ) - - if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file): - copyfile(self.vocab_file, out_vocab_file) - elif not os.path.isfile(self.vocab_file): - with open(out_vocab_file, "wb") as fi: - content_spiece_model = self.sp_model.serialized_model_proto() - fi.write(content_spiece_model) - - return (out_vocab_file,) diff --git a/src/transformers/models/camembert/tokenization_camembert_fast.py b/src/transformers/models/camembert/tokenization_camembert_fast.py index 8a5ebbedd1c7be..627971eb51db3e 100644 --- a/src/transformers/models/camembert/tokenization_camembert_fast.py +++ b/src/transformers/models/camembert/tokenization_camembert_fast.py @@ -36,15 +36,15 @@ PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { - "camembert-base": "https://huggingface.co/camembert-base/resolve/main/sentencepiece.bpe.model", + "almanach/camembert-base": "https://huggingface.co/almanach/camembert-base/resolve/main/sentencepiece.bpe.model", }, "tokenizer_file": { - "camembert-base": "https://huggingface.co/camembert-base/resolve/main/tokenizer.json", + "almanach/camembert-base": "https://huggingface.co/almanach/camembert-base/resolve/main/tokenizer.json", }, } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { - "camembert-base": 512, + "almanach/camembert-base": 512, } SPIECE_UNDERLINE = "▁" @@ -119,12 +119,11 @@ def __init__( unk_token="", pad_token="", mask_token="", - additional_special_tokens=["NOTUSED", "NOTUSED"], + additional_special_tokens=["NOTUSED", "NOTUSED", "NOTUSED"], **kwargs, ): - # Mask token behave like a normal word, i.e. include the space before it - mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token - + # Mask token behave like a normal word, i.e. include the space before it. Will have normalized = False + mask_token = AddedToken(mask_token, lstrip=True, special=True) if isinstance(mask_token, str) else mask_token super().__init__( vocab_file, tokenizer_file=tokenizer_file, @@ -140,7 +139,10 @@ def __init__( ) self.vocab_file = vocab_file - self.can_save_slow_tokenizer = False if not self.vocab_file else True + + @property + def can_save_slow_tokenizer(self) -> bool: + return os.path.isfile(self.vocab_file) if self.vocab_file else False def build_inputs_with_special_tokens( self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None diff --git a/src/transformers/models/canine/configuration_canine.py b/src/transformers/models/canine/configuration_canine.py index 1fdeb3204a52e4..f1e1bb415892a2 100644 --- a/src/transformers/models/canine/configuration_canine.py +++ b/src/transformers/models/canine/configuration_canine.py @@ -50,7 +50,7 @@ class CanineConfig(PretrainedConfig): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` are supported. hidden_dropout_prob (`float`, *optional*, defaults to 0.1): - The dropout probabilitiy for all fully connected layers in the embeddings, encoders, and pooler. + The dropout probability for all fully connected layers in the embeddings, encoders, and pooler. attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout ratio for the attention probabilities. max_position_embeddings (`int`, *optional*, defaults to 16384): @@ -61,6 +61,12 @@ class CanineConfig(PretrainedConfig): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 0): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 57344): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 57345): + End of stream token id. downsampling_rate (`int`, *optional*, defaults to 4): The rate at which to downsample the original character sequence length before applying the deep Transformer encoder. @@ -89,6 +95,7 @@ class CanineConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "canine" def __init__( diff --git a/src/transformers/models/canine/modeling_canine.py b/src/transformers/models/canine/modeling_canine.py index a91d42f0395ee8..378a5775256f70 100644 --- a/src/transformers/models/canine/modeling_canine.py +++ b/src/transformers/models/canine/modeling_canine.py @@ -54,7 +54,7 @@ CANINE_PRETRAINED_MODEL_ARCHIVE_LIST = [ "google/canine-s", - "google/canine-r" + "google/canine-r", # See all CANINE models at https://huggingface.co/models?filter=canine ] @@ -216,7 +216,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") def _hash_bucket_tensors(self, input_ids, num_hashes: int, num_buckets: int): @@ -793,18 +795,12 @@ def forward( layer_head_mask = head_mask[i] if head_mask is not None else None if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, + output_attentions, ) else: layer_outputs = layer_module(hidden_states, attention_mask, layer_head_mask, output_attentions) @@ -900,7 +896,6 @@ class CaninePreTrainedModel(PreTrainedModel): load_tf_weights = load_tf_weights_in_canine base_model_prefix = "canine" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -918,10 +913,6 @@ def _init_weights(self, module): module.bias.data.zero_() module.weight.data.fill_(1.0) - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, CanineEncoder): - module.gradient_checkpointing = value - CANINE_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use @@ -1125,6 +1116,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1167,7 +1159,9 @@ def forward( # Contextualize character embeddings using shallow Transformer. # We use a 3D attention mask for the local attention. # `input_char_encoding`: shape (batch_size, char_seq_len, char_dim) - char_attention_mask = self._create_3d_attention_mask_from_input_mask(input_ids, attention_mask) + char_attention_mask = self._create_3d_attention_mask_from_input_mask( + input_ids if input_ids is not None else inputs_embeds, attention_mask + ) init_chars_encoder_outputs = self.initial_char_encoder( input_char_embeddings, attention_mask=char_attention_mask, diff --git a/src/transformers/models/canine/tokenization_canine.py b/src/transformers/models/canine/tokenization_canine.py index 2fae9e1482bd32..25932ae75d2a87 100644 --- a/src/transformers/models/canine/tokenization_canine.py +++ b/src/transformers/models/canine/tokenization_canine.py @@ -33,7 +33,6 @@ # Below: Constants defining canonical codepoints for special, pseudo-characters. # Copied from https://github.com/google-research/language/blob/master/language/canine/special_codepoints.py PAD = 0 - CLS = 0xE000 SEP = 0xE001 BOS = 0xE002 @@ -97,18 +96,6 @@ def __init__( # Mask token behave like a normal word, i.e. include the space before it mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token - super().__init__( - bos_token=bos_token, - eos_token=eos_token, - sep_token=sep_token, - cls_token=cls_token, - pad_token=pad_token, - mask_token=mask_token, - add_prefix_space=add_prefix_space, - model_max_length=model_max_length, - **kwargs, - ) - # Creates a mapping for looking up the IDs of special symbols. self._special_codepoints: Dict[str, int] = {} for codepoint, name in SPECIAL_CODEPOINTS.items(): @@ -122,10 +109,27 @@ def __init__( self._unicode_vocab_size = UNICODE_VOCAB_SIZE self._num_special_tokens = len(self._special_codepoints) + super().__init__( + bos_token=bos_token, + eos_token=eos_token, + sep_token=sep_token, + cls_token=cls_token, + pad_token=pad_token, + mask_token=mask_token, + add_prefix_space=add_prefix_space, + model_max_length=model_max_length, + **kwargs, + ) + @property def vocab_size(self) -> int: return self._unicode_vocab_size + def get_vocab(self): + vocab = {chr(i): i for i in range(self.vocab_size)} + vocab.update(self.added_tokens_encoder) + return vocab + def _tokenize(self, text: str) -> List[str]: """Tokenize a string (i.e. perform character splitting).""" return list(text) diff --git a/src/transformers/models/chinese_clip/configuration_chinese_clip.py b/src/transformers/models/chinese_clip/configuration_chinese_clip.py index f20e16e41cac64..53b6d49b3f6698 100644 --- a/src/transformers/models/chinese_clip/configuration_chinese_clip.py +++ b/src/transformers/models/chinese_clip/configuration_chinese_clip.py @@ -14,7 +14,6 @@ # limitations under the License. """ Chinese-CLIP model configuration""" -import copy import os from collections import OrderedDict from typing import TYPE_CHECKING, Any, Mapping, Optional, Union @@ -76,8 +75,13 @@ class ChineseCLIPTextConfig(PretrainedConfig): The vocabulary size of the `token_type_ids` passed when calling [`ChineseCLIPModel`]. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + initializer_factor (`float`, *optional*, defaults to 1.0): + A factor for initializing all weight matrices (should be kept to 1, used internally for initialization + testing). layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. + pad_token_id (`int`, *optional*, defaults to 0): + Padding token id. position_embedding_type (`str`, *optional*, defaults to `"absolute"`): Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to @@ -102,6 +106,7 @@ class ChineseCLIPTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "chinese_clip_text_model" def __init__( @@ -144,6 +149,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from ChineseCLIPConfig @@ -164,8 +171,7 @@ class ChineseCLIPVisionConfig(PretrainedConfig): This is the configuration class to store the configuration of a [`ChineseCLIPModel`]. It is used to instantiate an ChineseCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ChineseCLIP - [OFA-Sys/chinese-clip-vit-base-patch16](https: - //huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) architecture. + [OFA-Sys/chinese-clip-vit-base-patch16](https://huggingface.co/OFA-Sys/chinese-clip-vit-base-patch16) architecture. Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the documentation from [`PretrainedConfig`] for more information. @@ -176,10 +182,14 @@ class ChineseCLIPVisionConfig(PretrainedConfig): Dimensionality of the encoder layers and the pooler layer. intermediate_size (`int`, *optional*, defaults to 3072): Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 512): + Dimentionality of text and vision projection layers. num_hidden_layers (`int`, *optional*, defaults to 12): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. image_size (`int`, *optional*, defaults to 224): The size (resolution) of each image. patch_size (`int`, *optional*, defaults to 32): @@ -187,13 +197,13 @@ class ChineseCLIPVisionConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). Example: @@ -247,6 +257,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from ChineseCLIPConfig @@ -310,28 +322,87 @@ class ChineseCLIPConfig(PretrainedConfig): ```""" model_type = "chinese_clip" - is_composition = True def __init__( self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs ): - super().__init__(**kwargs) - # If `_config_dict` exist, we use them for the backward compatibility. + # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot + # of confusion!). text_config_dict = kwargs.pop("text_config_dict", None) vision_config_dict = kwargs.pop("vision_config_dict", None) + + super().__init__(**kwargs) + + # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in + # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most + # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`. if text_config_dict is not None: - text_config = text_config_dict + if text_config is None: + text_config = {} + + # This is the complete result when using `text_config_dict`. + _text_config_dict = ChineseCLIPTextConfig(**text_config_dict).to_dict() + + # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different. + for key, value in _text_config_dict.items(): + if key in text_config and value != text_config[key] and key not in ["transformers_version"]: + # If specified in `text_config_dict` + if key in text_config_dict: + message = ( + f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. " + f'The value `text_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`text_config_dict` is provided which will be used to initialize `ChineseCLIPTextConfig`. " + f'The value `text_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `text_config` with the ones in `_text_config_dict`. + text_config.update(_text_config_dict) + if vision_config_dict is not None: - vision_config = vision_config_dict + if vision_config is None: + vision_config = {} + + # This is the complete result when using `vision_config_dict`. + _vision_config_dict = ChineseCLIPVisionConfig(**vision_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different. + for key, value in _vision_config_dict.items(): + if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]: + # If specified in `vision_config_dict` + if key in vision_config_dict: + message = ( + f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different " + f'values. The value `vision_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`vision_config_dict` is provided which will be used to initialize " + f'`ChineseCLIPVisionConfig`. The value `vision_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `vision_config` with the ones in `_vision_config_dict`. + vision_config.update(_vision_config_dict) if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the ChineseCLIPTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `ChineseCLIPTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. initializing the ChineseCLIPVisionConfig with default values.") + logger.info("`vision_config` is `None`. initializing the `ChineseCLIPVisionConfig` with default values.") self.text_config = ChineseCLIPTextConfig(**text_config) self.vision_config = ChineseCLIPVisionConfig(**vision_config) @@ -353,19 +424,6 @@ def from_text_vision_configs( return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output - class ChineseCLIPOnnxConfig(OnnxConfig): @property @@ -404,7 +462,7 @@ def generate_dummy_inputs( processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework ) image_input_dict = super().generate_dummy_inputs( - processor.feature_extractor, batch_size=batch_size, framework=framework + processor.image_processor, batch_size=batch_size, framework=framework ) return {**text_input_dict, **image_input_dict} diff --git a/src/transformers/models/chinese_clip/image_processing_chinese_clip.py b/src/transformers/models/chinese_clip/image_processing_chinese_clip.py index a21372b7533ee7..0216bc5431ea7f 100644 --- a/src/transformers/models/chinese_clip/image_processing_chinese_clip.py +++ b/src/transformers/models/chinese_clip/image_processing_chinese_clip.py @@ -18,15 +18,10 @@ import numpy as np -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict from ...image_transforms import ( - center_crop, convert_to_rgb, get_resize_output_image_size, - normalize, - rescale, resize, to_channel_dimension_format, ) @@ -36,12 +31,14 @@ ChannelDimension, ImageInput, PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, make_list_of_images, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging -from ...utils.import_utils import is_vision_available +from ...utils import TensorType, is_vision_available, logging logger = logging.get_logger(__name__) @@ -63,7 +60,7 @@ class ChineseCLIPImageProcessor(BaseImageProcessor): Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess` method. - resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`): Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method. do_center_crop (`bool`, *optional*, defaults to `True`): Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the @@ -77,16 +74,17 @@ class ChineseCLIPImageProcessor(BaseImageProcessor): rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess` method. - do_normalize: + do_normalize (`bool`, *optional*, defaults to `True`): Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`): Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`): - Image standard deviation. - do_convert_rgb (`bool`, *optional*, defaults to `True`): Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + Can be overridden by the `image_std` parameter in the `preprocess` method. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Whether to convert the image to RGB. """ model_input_names = ["pixel_values"] @@ -130,6 +128,7 @@ def resize( size: Dict[str, int], resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ @@ -145,77 +144,22 @@ def resize( Resampling filter to use when resiizing the image. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred from the input + image. """ size = get_size_dict(size, default_to_square=False) output_size = get_resize_output_image_size( - image, size=(size["height"], size["width"]), default_to_square=False + image, size=(size["height"], size["width"]), default_to_square=False, input_data_format=input_data_format + ) + return resize( + image, + size=output_size, + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, ) - return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) - - def center_crop( - self, - image: np.ndarray, - size: Dict[str, int], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the - returned result will always be of size `size`). - - Args: - image (`np.ndarray`): - Image to center crop. - size (`Dict[str, int]`): - Size of the output image in the form of a dictionary with keys `height` and `width`. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - size = get_size_dict(size) - return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) - - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) - - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - image_mean (`float` or `List[float]`): - Image mean. - image_std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) def preprocess( self, @@ -233,6 +177,7 @@ def preprocess( do_convert_rgb: bool = None, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -240,7 +185,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -275,9 +221,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. - - Unset: defaults to the channel dimension format of the input image. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize size = size if size is not None else self.size @@ -300,39 +252,60 @@ def preprocess( "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "torch.Tensor, tf.Tensor or jax.ndarray." ) - - if do_resize and size is None: - raise ValueError("Size must be specified if do_resize is True.") - - if do_center_crop and crop_size is None: - raise ValueError("Crop size must be specified if do_center_crop is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") - - # PIL RGBA images are converted to RGB + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_resize=do_resize, + size=size, + resample=resample, + ) if do_convert_rgb: images = [convert_to_rgb(image) for image in images] # All transformations expect numpy arrays. images = [to_numpy_array(image) for image in images] + if is_scaled_image(images[0]) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + + if input_data_format is None: + # We assume that all images have the same channel dimension format. + input_data_format = infer_channel_dimension_format(images[0]) + if do_resize: - images = [self.resize(image=image, size=size, resample=resample) for image in images] + images = [ + self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) + for image in images + ] if do_center_crop: - images = [self.center_crop(image=image, size=crop_size) for image in images] + images = [ + self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images + ] if do_rescale: - images = [self.rescale(image=image, scale=rescale_factor) for image in images] + images = [ + self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + for image in images + ] if do_normalize: - images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] - - images = [to_channel_dimension_format(image, data_format) for image in images] + images = [ + self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + for image in images + ] + + images = [ + to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images + ] data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) diff --git a/src/transformers/models/chinese_clip/modeling_chinese_clip.py b/src/transformers/models/chinese_clip/modeling_chinese_clip.py index ce6a283a05b3ed..a16fb081b19357 100644 --- a/src/transformers/models/chinese_clip/modeling_chinese_clip.py +++ b/src/transformers/models/chinese_clip/modeling_chinese_clip.py @@ -121,7 +121,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False ) @@ -190,11 +192,12 @@ def __init__(self, config: ChineseCLIPVisionConfig): self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) - self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + target_dtype = self.patch_embedding.weight.dtype + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1) @@ -689,7 +692,6 @@ class ChineseCLIPPreTrainedModel(PreTrainedModel): config_class = ChineseCLIPConfig base_model_prefix = "chinese_clip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -716,9 +718,7 @@ def _init_weights(self, module): nn.init.normal_(module.out_proj.weight, std=out_proj_std) elif isinstance(module, ChineseCLIPVisionMLP): factor = self.config.initializer_factor - in_proj_std = ( - (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor - ) + in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor fc_std = (2 * module.config.hidden_size) ** -0.5 * factor nn.init.normal_(module.fc1.weight, std=fc_std) nn.init.normal_(module.fc2.weight, std=in_proj_std) @@ -740,10 +740,6 @@ def _init_weights(self, module): if module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, ChineseCLIPVisionEncoder) or isinstance(module, ChineseCLIPTextEncoder): - module.gradient_checkpointing = value - CHINESE_CLIP_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it @@ -891,6 +887,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -900,25 +903,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -1014,16 +1007,10 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -1204,6 +1191,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1373,7 +1361,7 @@ def __init__(self, config: ChineseCLIPConfig): self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) - self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value) + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) # Initialize weights and apply final processing self.post_init() diff --git a/src/transformers/models/chinese_clip/processing_chinese_clip.py b/src/transformers/models/chinese_clip/processing_chinese_clip.py index 6a8d9c961a372e..832f44102abf32 100644 --- a/src/transformers/models/chinese_clip/processing_chinese_clip.py +++ b/src/transformers/models/chinese_clip/processing_chinese_clip.py @@ -31,16 +31,18 @@ class ChineseCLIPProcessor(ProcessorMixin): See the [`~ChineseCLIPProcessor.__call__`] and [`~ChineseCLIPProcessor.decode`] for more information. Args: - image_processor ([`ChineseCLIPImageProcessor`]): + image_processor ([`ChineseCLIPImageProcessor`], *optional*): The image processor is a required input. - tokenizer ([`BertTokenizerFast`]): + tokenizer ([`BertTokenizerFast`], *optional*): The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "ChineseCLIPImageProcessor" tokenizer_class = ("BertTokenizer", "BertTokenizerFast") def __init__(self, image_processor=None, tokenizer=None, **kwargs): + feature_extractor = None if "feature_extractor" in kwargs: warnings.warn( "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`" diff --git a/src/transformers/models/clap/configuration_clap.py b/src/transformers/models/clap/configuration_clap.py index 13d1f7b7e05973..1a02d8460937d0 100644 --- a/src/transformers/models/clap/configuration_clap.py +++ b/src/transformers/models/clap/configuration_clap.py @@ -14,7 +14,6 @@ # limitations under the License. """ CLAP model configuration""" -import copy import os from typing import Union @@ -65,8 +64,6 @@ class ClapTextConfig(PretrainedConfig): just in case (e.g., 512 or 1024 or 2048). type_vocab_size (`int`, *optional*, defaults to 2): The vocabulary size of the `token_type_ids` passed when calling [`ClapTextModel`]. - initializer_range (`float`, *optional*, defaults to 0.02): - The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. position_embedding_type (`str`, *optional*, defaults to `"absolute"`): @@ -80,8 +77,6 @@ class ClapTextConfig(PretrainedConfig): use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. - classifier_dropout (`float`, *optional*): - The dropout ratio for the classification head. projection_hidden_act (`str`, *optional*, defaults to `"relu"`): The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. @@ -102,6 +97,7 @@ class ClapTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "clap_text_model" def __init__( @@ -116,7 +112,6 @@ def __init__( attention_probs_dropout_prob=0.1, max_position_embeddings=514, type_vocab_size=1, - initializer_range=0.02, initializer_factor=1.0, layer_norm_eps=1e-12, projection_dim=512, @@ -125,7 +120,6 @@ def __init__( eos_token_id=2, position_embedding_type="absolute", use_cache=True, - classifier_dropout=None, projection_hidden_act="relu", **kwargs, ): @@ -141,17 +135,17 @@ def __init__( self.attention_probs_dropout_prob = attention_probs_dropout_prob self.max_position_embeddings = max_position_embeddings self.type_vocab_size = type_vocab_size - self.initializer_range = initializer_range self.initializer_factor = initializer_factor self.layer_norm_eps = layer_norm_eps self.position_embedding_type = position_embedding_type self.use_cache = use_cache - self.classifier_dropout = classifier_dropout self.projection_hidden_act = projection_hidden_act self.projection_dim = projection_dim @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the text config dict if we are loading from ClapConfig @@ -208,7 +202,7 @@ class ClapAudioConfig(PretrainedConfig): Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the best results. hidden_dropout_prob (`float`, *optional*, defaults to 0.1): - The dropout probabilitiy for all fully connected layers in the encoder. + The dropout probability for all fully connected layers in the encoder. fusion_type (`[type]`, *optional*): Fusion type used for the patch fusion. patch_embed_input_channels (`int`, *optional*, defaults to 1): @@ -234,7 +228,7 @@ class ClapAudioConfig(PretrainedConfig): projection_hidden_act (`str`, *optional*, defaults to `"relu"`): The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. - layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`): + layer_norm_eps (`[type]`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization @@ -320,6 +314,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the audio config dict if we are loading from ClapConfig @@ -350,10 +346,10 @@ class ClapConfig(PretrainedConfig): Dictionary of configuration options used to initialize [`ClapTextConfig`]. audio_config (`dict`, *optional*): Dictionary of configuration options used to initialize [`ClapAudioConfig`]. + logit_scale_init_value (`float`, *optional*, defaults to 14.29): + The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation. projection_dim (`int`, *optional*, defaults to 512): Dimentionality of text and audio projection layers. - logit_scale_init_value (`float`, *optional*, defaults to 2.6592): - The inital value of the *logit_scale* paramter. Default is used as per the original CLAP implementation. projection_hidden_act (`str`, *optional*, defaults to `"relu"`): Activation function for the projection layers. initializer_factor (`float`, *optional*, defaults to 1.0): @@ -386,7 +382,6 @@ class ClapConfig(PretrainedConfig): ```""" model_type = "clap" - is_composition = True def __init__( self, @@ -435,16 +430,3 @@ def from_text_audio_configs(cls, text_config: ClapTextConfig, audio_config: Clap """ return cls(text_config=text_config.to_dict(), audio_config=audio_config.to_dict(), **kwargs) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["audio_config"] = self.audio_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/clap/convert_clap_original_pytorch_to_hf.py b/src/transformers/models/clap/convert_clap_original_pytorch_to_hf.py index 908fef5927af02..d422bc45ab3de0 100644 --- a/src/transformers/models/clap/convert_clap_original_pytorch_to_hf.py +++ b/src/transformers/models/clap/convert_clap_original_pytorch_to_hf.py @@ -16,8 +16,7 @@ import argparse import re -import torch -from CLAP import create_model +from laion_clap import CLAP_Module from transformers import AutoFeatureExtractor, ClapConfig, ClapModel @@ -38,17 +37,25 @@ processor = AutoFeatureExtractor.from_pretrained("laion/clap-htsat-unfused", truncation="rand_trunc") -def init_clap(checkpoint_path, enable_fusion=False): - model, model_cfg = create_model( - "HTSAT-tiny", - "roberta", - checkpoint_path, - precision="fp32", - device="cuda:0" if torch.cuda.is_available() else "cpu", +def init_clap(checkpoint_path, model_type, enable_fusion=False): + model = CLAP_Module( + amodel=model_type, enable_fusion=enable_fusion, - fusion_type="aff_2d" if enable_fusion else None, ) - return model, model_cfg + model.load_ckpt(checkpoint_path) + return model + + +def get_config_from_original(clap_model): + audio_config = { + "patch_embeds_hidden_size": clap_model.model.audio_branch.embed_dim, + "depths": clap_model.model.audio_branch.depths, + "hidden_size": clap_model.model.audio_projection[0].in_features, + } + + text_config = {"hidden_size": clap_model.model.text_branch.pooler.dense.in_features} + + return ClapConfig(audio_config=audio_config, text_config=text_config) def rename_state_dict(state_dict): @@ -94,14 +101,14 @@ def rename_state_dict(state_dict): return model_state_dict -def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, enable_fusion=False): - clap_model, clap_model_cfg = init_clap(checkpoint_path, enable_fusion=enable_fusion) +def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_path, model_type, enable_fusion=False): + clap_model = init_clap(checkpoint_path, model_type, enable_fusion=enable_fusion) clap_model.eval() - state_dict = clap_model.state_dict() + state_dict = clap_model.model.state_dict() state_dict = rename_state_dict(state_dict) - transformers_config = ClapConfig() + transformers_config = get_config_from_original(clap_model) transformers_config.audio_config.enable_fusion = enable_fusion model = ClapModel(transformers_config) @@ -118,6 +125,9 @@ def convert_clap_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_pa parser.add_argument("--checkpoint_path", default=None, type=str, help="Path to fairseq checkpoint") parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert") parser.add_argument("--enable_fusion", action="store_true", help="Whether to enable fusion or not") + parser.add_argument("--model_type", default="HTSAT-tiny", type=str, help="Whether to enable fusion or not") args = parser.parse_args() - convert_clap_checkpoint(args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.enable_fusion) + convert_clap_checkpoint( + args.checkpoint_path, args.pytorch_dump_folder_path, args.config_path, args.model_type, args.enable_fusion + ) diff --git a/src/transformers/models/clap/feature_extraction_clap.py b/src/transformers/models/clap/feature_extraction_clap.py index 1367591fead52f..ce18fedd19b109 100644 --- a/src/transformers/models/clap/feature_extraction_clap.py +++ b/src/transformers/models/clap/feature_extraction_clap.py @@ -21,7 +21,7 @@ import numpy as np import torch -from ...audio_utils import fram_wave, get_mel_filter_banks, power_to_db, stft +from ...audio_utils import mel_filter_bank, spectrogram, window_function from ...feature_extraction_sequence_utils import SequenceFeatureExtractor from ...feature_extraction_utils import BatchFeature from ...utils import TensorType, logging @@ -41,32 +41,32 @@ class ClapFeatureExtractor(SequenceFeatureExtractor): Fourier Transform* (STFT) which should match pytorch's `torch.stft` equivalent. Args: - feature_size (`int`, defaults to 64): + feature_size (`int`, *optional*, defaults to 64): The feature dimension of the extracted Mel spectrograms. This corresponds to the number of mel filters (`n_mels`). - sampling_rate (`int`, defaults to 48_000): + sampling_rate (`int`, *optional*, defaults to 48000): The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). This only serves to warn users if the audio fed to the feature extractor does not have the same sampling rate. - hop_length (`int`, defaults to 480): + hop_length (`int`,*optional*, defaults to 480): Length of the overlaping windows for the STFT used to obtain the Mel Spectrogram. The audio will be split in smaller `frames` with a step of `hop_length` between each frame. - max_length_s (`int`, defaults to 10): - The maximum input lenght of the model in seconds. This is used to pad the audio. - fft_window_size (`int`, defaults to 1024): + max_length_s (`int`, *optional*, defaults to 10): + The maximum input length of the model in seconds. This is used to pad the audio. + fft_window_size (`int`, *optional*, defaults to 1024): Size of the window (in samples) on which the Fourier transform is applied. This controls the frequency resolution of the spectrogram. 400 means that the fourrier transform is computed on windows of 400 samples. padding_value (`float`, *optional*, defaults to 0.0): Padding value used to pad the audio. Should correspond to silences. return_attention_mask (`bool`, *optional*, defaults to `False`): Whether or not the model should return the attention masks coresponding to the input. - frequency_min (`float`, *optional*, default to 0): + frequency_min (`float`, *optional*, defaults to 0): The lowest frequency of interest. The STFT will not be computed for values below this. - frequency_max (`float`, *optional*, default to 14_000): + frequency_max (`float`, *optional*, defaults to 14000): The highest frequency of interest. The STFT will not be computed for values above this. top_db (`float`, *optional*): The highest decibel value used to convert the mel spectrogram to the log scale. For more details see the `audio_utils.power_to_db` function - truncation (`str`, *optional*, default to `"fusions"`): + truncation (`str`, *optional*, defaults to `"fusion"`): Truncation pattern for long audio inputs. Two patterns are available: - `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and a downsampled version of the entire mel spectrogram. @@ -116,21 +116,21 @@ def __init__( self.sampling_rate = sampling_rate self.frequency_min = frequency_min self.frequency_max = frequency_max - self.mel_filters = get_mel_filter_banks( - nb_frequency_bins=self.nb_frequency_bins, - nb_mel_filters=feature_size, - frequency_min=frequency_min, - frequency_max=frequency_max, - sample_rate=sampling_rate, + self.mel_filters = mel_filter_bank( + num_frequency_bins=self.nb_frequency_bins, + num_mel_filters=feature_size, + min_frequency=frequency_min, + max_frequency=frequency_max, + sampling_rate=sampling_rate, norm=None, mel_scale="htk", ) - self.mel_filters_slaney = get_mel_filter_banks( - nb_frequency_bins=self.nb_frequency_bins, - nb_mel_filters=feature_size, - frequency_min=frequency_min, - frequency_max=frequency_max, - sample_rate=sampling_rate, + self.mel_filters_slaney = mel_filter_bank( + num_frequency_bins=self.nb_frequency_bins, + num_mel_filters=feature_size, + min_frequency=frequency_min, + max_frequency=frequency_max, + sampling_rate=sampling_rate, norm="slaney", mel_scale="slaney", ) @@ -153,24 +153,25 @@ def to_dict(self) -> Dict[str, Any]: def _np_extract_fbank_features(self, waveform: np.array, mel_filters: Optional[np.array] = None) -> np.ndarray: """ - Compute the log-Mel spectrogram of the provided `waveform` using the `hanning` window. In CLAP, two different - filter banks are used depending on the truncation pattern: - - `self.mel_filters`: they correspond to the defaults parameters of `torchaduio` which can be obtained from + Compute the log-mel spectrogram of the provided `waveform` using the Hann window. In CLAP, two different filter + banks are used depending on the truncation pattern: + - `self.mel_filters`: they correspond to the default parameters of `torchaudio` which can be obtained from calling `torchaudio.transforms.MelSpectrogram().mel_scale.fb`. These filters are used when `truncation` is set to `"fusion"`. - - `self.mel_filteres_slaney` : they correspond to the defaults parameters of `torchlibrosa` which used + - `self.mel_filteres_slaney` : they correspond to the default parameters of `librosa` which used `librosa.filters.mel` when computing the mel spectrogram. These filters were only used in the original implementation when the truncation mode is not `"fusion"`. """ - window = np.hanning(self.fft_window_size + 1)[:-1] - frames = fram_wave(waveform, self.hop_length, self.fft_window_size) - spectrogram = stft(frames, window, fft_window_size=self.fft_window_size) - - magnitudes = np.abs(spectrogram) ** 2 - mel_spectrogram = np.matmul(mel_filters.T, magnitudes) - log_mel_spectrogram = power_to_db(mel_spectrogram).T - log_mel_spectrogram = np.asarray(log_mel_spectrogram, np.float32) - return log_mel_spectrogram + log_mel_spectrogram = spectrogram( + waveform, + window_function(self.fft_window_size, "hann"), + frame_length=self.fft_window_size, + hop_length=self.hop_length, + power=2.0, + mel_filters=mel_filters, + log_mel="dB", + ) + return log_mel_spectrogram.T def _random_mel_fusion(self, mel, total_frames, chunk_frames): ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3) @@ -191,10 +192,10 @@ def _random_mel_fusion(self, mel, total_frames, chunk_frames): mel = torch.tensor(mel[None, None, :]) mel_shrink = torch.nn.functional.interpolate( - mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False, antialias=False + mel, size=[chunk_frames, 64], mode="bilinear", align_corners=False ) mel_shrink = mel_shrink[0][0].numpy() - mel_fusion = np.stack([mel_chunk_front, mel_chunk_middle, mel_chunk_back, mel_shrink], axis=0) + mel_fusion = np.stack([mel_shrink, mel_chunk_front, mel_chunk_middle, mel_chunk_back], axis=0) return mel_fusion def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> np.array: @@ -241,10 +242,10 @@ def _get_input_mel(self, waveform: np.array, max_length, truncation, padding) -> if waveform.shape[0] < max_length: if padding == "repeat": n_repeat = int(max_length / len(waveform)) - waveform = np.stack(np.tile(waveform, n_repeat + 1))[:max_length] + waveform = np.tile(waveform, n_repeat + 1)[:max_length] if padding == "repeatpad": n_repeat = int(max_length / len(waveform)) - waveform = np.stack(np.tile(waveform, n_repeat)) + waveform = np.tile(waveform, n_repeat) waveform = np.pad(waveform, (0, max_length - waveform.shape[0]), mode="constant", constant_values=0) if truncation == "fusion": @@ -271,7 +272,8 @@ def __call__( Args: raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`): The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float - values, a list of numpy arrays or a list of list of float values. + values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not + stereo, i.e. single float per timestep. truncation (`str`, *optional*): Truncation pattern for long audio inputs. Two patterns are available: - `fusion` will use `_random_mel_fusion`, which stacks 3 random crops from the mel spectrogram and @@ -311,9 +313,11 @@ def __call__( "Failing to do so can result in silent errors that might be hard to debug." ) - is_batched = bool( - isinstance(raw_speech, (list, tuple)) - and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list))) + is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1 + if is_batched_numpy and len(raw_speech.shape) > 2: + raise ValueError(f"Only mono-channel audio is supported for input to {self}") + is_batched = is_batched_numpy or ( + isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list))) ) if is_batched: @@ -347,6 +351,9 @@ def __call__( if isinstance(input_mel[0], List): input_mel = [np.asarray(feature, dtype=np.float64) for feature in input_mel] + # is_longer is a list of bool + is_longer = [[longer] for longer in is_longer] + input_features = {"input_features": input_mel, "is_longer": is_longer} input_features = BatchFeature(input_features) diff --git a/src/transformers/models/clap/modeling_clap.py b/src/transformers/models/clap/modeling_clap.py index aeb64e29cf2dc1..6310b9675fb654 100644 --- a/src/transformers/models/clap/modeling_clap.py +++ b/src/transformers/models/clap/modeling_clap.py @@ -18,7 +18,6 @@ from dataclasses import dataclass from typing import Any, List, Optional, Tuple, Union -import numpy as np import torch import torch.nn.functional as F from torch import nn @@ -93,6 +92,7 @@ def window_partition(hidden_states, window_size): # Adapted from https://github.com/LAION-AI/CLAP/blob/6ad05a971ba0622f6acee8c41993e0d02bbed639/src/open_clip/htsat.py#L263 def window_reverse(windows, window_size, height, width): """ + Merges windows to produce higher resolution features. Args: windows (`torch.FloatTensor` of shape `(num_windows * batch_size, window_size, window_size, num_channels)`): Input windows @@ -103,11 +103,10 @@ def window_reverse(windows, window_size, height, width): width (`int`): Width of the resized audio """ - batch_size = int(windows.shape[0] / (height * width / window_size / window_size)) - - hidden_states = windows.view(batch_size, height // window_size, width // window_size, window_size, window_size, -1) - hidden_states = hidden_states.permute(0, 1, 3, 2, 4, 5).contiguous().view(batch_size, height, width, -1) - return hidden_states + num_channels = windows.shape[-1] + windows = windows.view(-1, height // window_size, width // window_size, window_size, window_size, num_channels) + windows = windows.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, height, width, num_channels) + return windows # Copied from transformers.models.roberta.modeling_roberta.create_position_ids_from_input_ids @@ -160,8 +159,8 @@ class ClapTextModelOutput(ModelOutput): text_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None @dataclass @@ -189,8 +188,8 @@ class ClapAudioModelOutput(ModelOutput): audio_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None @dataclass @@ -825,16 +824,15 @@ def __init__(self, config): self.config = config self.patch_embed = ClapAudioPatchEmbed(config) self.enable_fusion = config.enable_fusion - grid_size = self.patch_embed.grid_size self.patch_stride = self.patch_embed.patch_stride self.spec_size = config.spec_size - self.freq_ratio = self.spec_size // config.num_mel_bins + self.freq_ratio = config.spec_size // config.num_mel_bins self.num_features = int(config.patch_embeds_hidden_size * 2 ** (self.num_layers - 1)) - self.freq_ratio = config.spec_size // config.num_mel_bins drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + grid_size = self.patch_embed.grid_size self.input_resolutions = [(grid_size[0] // (2**i), grid_size[1] // (2**i)) for i in range(self.num_layers)] self.layers = nn.ModuleList( @@ -941,15 +939,8 @@ def forward( input_dimensions = self.input_resolutions[i] if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), hidden_states, input_dimensions, layer_head_mask + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, input_dimensions, layer_head_mask, output_attentions ) else: layer_outputs = layer_module( @@ -1167,7 +1158,9 @@ def __init__(self, config): self.dropout = nn.Dropout(config.hidden_dropout_prob) # position_ids (1, len position emb) is contiguous in memory and exported when serialized self.position_embedding_type = getattr(config, "position_embedding_type", "absolute") - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=True + ) self.register_buffer( "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=True ) @@ -1579,6 +1572,13 @@ def forward( all_self_attentions = () if output_attentions else None all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + next_decoder_cache = () if use_cache else None for i, layer_module in enumerate(self.layer): if output_hidden_states: @@ -1588,25 +1588,15 @@ def forward( past_key_value = past_key_values[i] if past_key_values is not None else None if self.gradient_checkpointing and self.training: - if use_cache: - logger.warning( - "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." - ) - use_cache = False - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, past_key_value, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), + layer_outputs = self._gradient_checkpointing_func( + layer_module.__call__, hidden_states, attention_mask, layer_head_mask, encoder_hidden_states, encoder_attention_mask, + past_key_value, + output_attentions, ) else: layer_outputs = layer_module( @@ -1676,7 +1666,6 @@ class ClapPreTrainedModel(PreTrainedModel): config_class = ClapConfig base_model_prefix = "clap" supports_gradient_checkpointing = False - _keys_to_ignore_on_load_missing = [r"position_ids", r"logit_scale_a", r"logit_scale_t"] def _init_weights(self, module): """Initialize the weights""" @@ -1700,10 +1689,6 @@ def _init_weights(self, module): if module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, ClapTextEncoder): - module.gradient_checkpointing = value - class ClapAudioModel(ClapPreTrainedModel): config_class = ClapAudioConfig @@ -1780,7 +1765,6 @@ class ClapTextModel(ClapPreTrainedModel): """ config_class = ClapTextConfig - _keys_to_ignore_on_load_missing = [r"position_ids"] # Copied from transformers.models.bert.modeling_bert.BertModel.__init__ with Bert->ClapText def __init__(self, config, add_pooling_layer=True): @@ -1852,6 +1836,7 @@ def forward( if input_ids is not None and inputs_embeds is not None: raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) input_shape = input_ids.size() elif inputs_embeds is not None: input_shape = inputs_embeds.size()[:-1] @@ -1935,7 +1920,6 @@ def forward( @add_start_docstrings(CLAP_START_DOCSTRING) class ClapModel(ClapPreTrainedModel): config_class = ClapConfig - _keys_to_ignore_on_load_missing = [r"position_ids"] def __init__(self, config: ClapConfig): super().__init__(config) @@ -1955,8 +1939,8 @@ def __init__(self, config: ClapConfig): text_config = config.text_config audio_config = config.audio_config - self.logit_scale_a = nn.Parameter(torch.ones([]) * np.log(config.logit_scale_init_value)) - self.logit_scale_t = nn.Parameter(torch.ones([]) * np.log(config.logit_scale_init_value)) + self.logit_scale_a = nn.Parameter(torch.tensor(math.log(config.logit_scale_init_value))) + self.logit_scale_t = nn.Parameter(torch.tensor(math.log(config.logit_scale_init_value))) self.projection_dim = config.projection_dim diff --git a/src/transformers/models/clap/processing_clap.py b/src/transformers/models/clap/processing_clap.py index 7492f102b4b227..87799899945fa6 100644 --- a/src/transformers/models/clap/processing_clap.py +++ b/src/transformers/models/clap/processing_clap.py @@ -33,6 +33,7 @@ class ClapProcessor(ProcessorMixin): tokenizer ([`RobertaTokenizerFast`]): The tokenizer is a required input. """ + feature_extractor_class = "ClapFeatureExtractor" tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast") diff --git a/src/transformers/models/clip/__init__.py b/src/transformers/models/clip/__init__.py index 1f079783bed674..868c46616e9b33 100644 --- a/src/transformers/models/clip/__init__.py +++ b/src/transformers/models/clip/__init__.py @@ -67,6 +67,7 @@ "CLIPTextModelWithProjection", "CLIPVisionModel", "CLIPVisionModelWithProjection", + "CLIPForImageClassification", ] try: @@ -94,6 +95,7 @@ "FlaxCLIPPreTrainedModel", "FlaxCLIPTextModel", "FlaxCLIPTextPreTrainedModel", + "FlaxCLIPTextModelWithProjection", "FlaxCLIPVisionModel", "FlaxCLIPVisionPreTrainedModel", ] @@ -135,6 +137,7 @@ else: from .modeling_clip import ( CLIP_PRETRAINED_MODEL_ARCHIVE_LIST, + CLIPForImageClassification, CLIPModel, CLIPPreTrainedModel, CLIPTextModel, @@ -167,6 +170,7 @@ FlaxCLIPModel, FlaxCLIPPreTrainedModel, FlaxCLIPTextModel, + FlaxCLIPTextModelWithProjection, FlaxCLIPTextPreTrainedModel, FlaxCLIPVisionModel, FlaxCLIPVisionPreTrainedModel, diff --git a/src/transformers/models/clip/configuration_clip.py b/src/transformers/models/clip/configuration_clip.py index 1295245993110b..8c3e30ee0517af 100644 --- a/src/transformers/models/clip/configuration_clip.py +++ b/src/transformers/models/clip/configuration_clip.py @@ -14,7 +14,6 @@ # limitations under the License. """ CLIP model configuration""" -import copy import os from collections import OrderedDict from typing import TYPE_CHECKING, Any, Mapping, Optional, Union @@ -55,6 +54,8 @@ class CLIPTextConfig(PretrainedConfig): Dimensionality of the encoder layers and the pooler layer. intermediate_size (`int`, *optional*, defaults to 2048): Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 512): + Dimentionality of text and vision projection layers. num_hidden_layers (`int`, *optional*, defaults to 12): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 8): @@ -65,15 +66,21 @@ class CLIPTextConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float`, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). + pad_token_id (`int`, *optional*, defaults to 1): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 49406): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 49407): + End of stream token id. Example: @@ -89,6 +96,7 @@ class CLIPTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "clip_text_model" def __init__( @@ -105,9 +113,11 @@ def __init__( attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, + # This differs from `CLIPTokenizer`'s default and from openai/clip + # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538 pad_token_id=1, - bos_token_id=0, - eos_token_id=2, + bos_token_id=49406, + eos_token_id=49407, **kwargs, ): super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) @@ -127,6 +137,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the text config dict if we are loading from CLIPConfig @@ -157,10 +169,14 @@ class CLIPVisionConfig(PretrainedConfig): Dimensionality of the encoder layers and the pooler layer. intermediate_size (`int`, *optional*, defaults to 3072): Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 512): + Dimentionality of text and vision projection layers. num_hidden_layers (`int`, *optional*, defaults to 12): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. image_size (`int`, *optional*, defaults to 224): The size (resolution) of each image. patch_size (`int`, *optional*, defaults to 32): @@ -168,13 +184,13 @@ class CLIPVisionConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float`, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). @@ -230,6 +246,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from CLIPConfig @@ -292,28 +310,87 @@ class CLIPConfig(PretrainedConfig): ```""" model_type = "clip" - is_composition = True def __init__( self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs ): - super().__init__(**kwargs) - # If `_config_dict` exist, we use them for the backward compatibility. + # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot + # of confusion!). text_config_dict = kwargs.pop("text_config_dict", None) vision_config_dict = kwargs.pop("vision_config_dict", None) + + super().__init__(**kwargs) + + # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in + # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most + # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`. if text_config_dict is not None: - text_config = text_config_dict + if text_config is None: + text_config = {} + + # This is the complete result when using `text_config_dict`. + _text_config_dict = CLIPTextConfig(**text_config_dict).to_dict() + + # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different. + for key, value in _text_config_dict.items(): + if key in text_config and value != text_config[key] and key not in ["transformers_version"]: + # If specified in `text_config_dict` + if key in text_config_dict: + message = ( + f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. " + f'The value `text_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The " + f'value `text_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `text_config` with the ones in `_text_config_dict`. + text_config.update(_text_config_dict) + if vision_config_dict is not None: - vision_config = vision_config_dict + if vision_config is None: + vision_config = {} + + # This is the complete result when using `vision_config_dict`. + _vision_config_dict = CLIPVisionConfig(**vision_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different. + for key, value in _vision_config_dict.items(): + if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]: + # If specified in `vision_config_dict` + if key in vision_config_dict: + message = ( + f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different " + f'values. The value `vision_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. " + f'The value `vision_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `vision_config` with the ones in `_vision_config_dict`. + vision_config.update(_vision_config_dict) if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the CLIPTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. initializing the CLIPVisionConfig with default values.") + logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.") self.text_config = CLIPTextConfig(**text_config) self.vision_config = CLIPVisionConfig(**vision_config) @@ -334,19 +411,6 @@ def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CL return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output - class CLIPOnnxConfig(OnnxConfig): @property @@ -385,7 +449,7 @@ def generate_dummy_inputs( processor.tokenizer, batch_size=batch_size, seq_length=seq_length, framework=framework ) image_input_dict = super().generate_dummy_inputs( - processor.feature_extractor, batch_size=batch_size, framework=framework + processor.image_processor, batch_size=batch_size, framework=framework ) return {**text_input_dict, **image_input_dict} diff --git a/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py b/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py index 0033be274d5c13..2127da4f6cf902 100644 --- a/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py +++ b/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py @@ -127,9 +127,9 @@ def convert_clip_checkpoint(checkpoint_path, pytorch_dump_folder_path, config_pa input_ids = torch.arange(0, 77).unsqueeze(0) pixel_values = torch.randn(1, 3, 224, 224) - hf_logits_per_image, hf_logits_per_text = hf_model( - input_ids=input_ids, pixel_values=pixel_values, return_dict=True - )[1:3] + hf_outputs = hf_model(input_ids=input_ids, pixel_values=pixel_values, return_dict=True) + hf_logits_per_image = hf_outputs.logits_per_image + hf_logits_per_text = hf_outputs.logits_per_text pt_logits_per_image, pt_logits_per_text = pt_model(pixel_values, input_ids) assert torch.allclose(hf_logits_per_image, pt_logits_per_image, atol=1e-3) diff --git a/src/transformers/models/clip/image_processing_clip.py b/src/transformers/models/clip/image_processing_clip.py index b06e121758fd10..6549a572d864f3 100644 --- a/src/transformers/models/clip/image_processing_clip.py +++ b/src/transformers/models/clip/image_processing_clip.py @@ -18,15 +18,10 @@ import numpy as np -from transformers.utils.generic import TensorType - from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict from ...image_transforms import ( - center_crop, convert_to_rgb, get_resize_output_image_size, - normalize, - rescale, resize, to_channel_dimension_format, ) @@ -36,12 +31,14 @@ ChannelDimension, ImageInput, PILImageResampling, + infer_channel_dimension_format, + is_scaled_image, make_list_of_images, to_numpy_array, valid_images, + validate_preprocess_arguments, ) -from ...utils import logging -from ...utils.import_utils import is_vision_available +from ...utils import TensorType, is_vision_available, logging logger = logging.get_logger(__name__) @@ -63,7 +60,7 @@ class CLIPImageProcessor(BaseImageProcessor): Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess` method. - resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`): + resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`): Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method. do_center_crop (`bool`, *optional*, defaults to `True`): Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the @@ -77,16 +74,17 @@ class CLIPImageProcessor(BaseImageProcessor): rescale_factor (`int` or `float`, *optional*, defaults to `1/255`): Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess` method. - do_normalize: + do_normalize (`bool`, *optional*, defaults to `True`): Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method. image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`): Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`): - Image standard deviation. - do_convert_rgb (`bool`, *optional*, defaults to `True`): Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method. + Can be overridden by the `image_std` parameter in the `preprocess` method. + do_convert_rgb (`bool`, *optional*, defaults to `True`): + Whether to convert the image to RGB. """ model_input_names = ["pixel_values"] @@ -124,12 +122,17 @@ def __init__( self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD self.do_convert_rgb = do_convert_rgb + # for backwards compatibility of KOSMOS-2 + if "use_square_size" in kwargs: + self.size = {"height": size["shortest_edge"], "width": size["shortest_edge"]} + def resize( self, image: np.ndarray, size: Dict[str, int], resample: PILImageResampling = PILImageResampling.BICUBIC, data_format: Optional[Union[str, ChannelDimension]] = None, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> np.ndarray: """ @@ -145,79 +148,32 @@ def resize( Resampling filter to use when resiizing the image. data_format (`str` or `ChannelDimension`, *optional*): The channel dimension format of the image. If not provided, it will be the same as the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format of the input image. If not provided, it will be inferred. """ - size = get_size_dict(size, default_to_square=False) - if "shortest_edge" not in size: - raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}") - output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False) - return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs) - - def center_crop( - self, - image: np.ndarray, - size: Dict[str, int], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the - returned result will always be of size `size`). - - Args: - image (`np.ndarray`): - Image to center crop. - size (`Dict[str, int]`): - Size of the output image in the form of a dictionary with keys `height` and `width`. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - size = get_size_dict(size) - if "height" not in size or "width" not in size: - raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}") - return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs) - - def rescale( - self, - image: np.ndarray, - scale: Union[int, float], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ): - """ - Rescale an image by a scale factor. image = image * scale. - - Args: - image (`np.ndarray`): - Image to rescale. - scale (`int` or `float`): - Scale to apply to the image. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return rescale(image, scale=scale, data_format=data_format, **kwargs) - - def normalize( - self, - image: np.ndarray, - mean: Union[float, List[float]], - std: Union[float, List[float]], - data_format: Optional[Union[str, ChannelDimension]] = None, - **kwargs, - ) -> np.ndarray: - """ - Normalize an image. image = (image - image_mean) / image_std. - - Args: - image (`np.ndarray`): - Image to normalize. - image_mean (`float` or `List[float]`): - Image mean. - image_std (`float` or `List[float]`): - Image standard deviation. - data_format (`str` or `ChannelDimension`, *optional*): - The channel dimension format of the image. If not provided, it will be the same as the input image. - """ - return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs) + default_to_square = True + if "shortest_edge" in size: + size = size["shortest_edge"] + default_to_square = False + elif "height" in size and "width" in size: + size = (size["height"], size["width"]) + else: + raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.") + + output_size = get_resize_output_image_size( + image, + size=size, + default_to_square=default_to_square, + input_data_format=input_data_format, + ) + return resize( + image, + size=output_size, + resample=resample, + data_format=data_format, + input_data_format=input_data_format, + **kwargs, + ) def preprocess( self, @@ -235,6 +191,7 @@ def preprocess( do_convert_rgb: bool = None, return_tensors: Optional[Union[str, TensorType]] = None, data_format: Optional[ChannelDimension] = ChannelDimension.FIRST, + input_data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs, ) -> PIL.Image.Image: """ @@ -242,7 +199,8 @@ def preprocess( Args: images (`ImageInput`): - Image to preprocess. + Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If + passing in images with pixel values between 0 and 1, set `do_rescale=False`. do_resize (`bool`, *optional*, defaults to `self.do_resize`): Whether to resize the image. size (`Dict[str, int]`, *optional*, defaults to `self.size`): @@ -277,9 +235,15 @@ def preprocess( - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`): The channel dimension format for the output image. Can be one of: - - `ChannelDimension.FIRST`: image in (num_channels, height, width) format. - - `ChannelDimension.LAST`: image in (height, width, num_channels) format. - - Unset: defaults to the channel dimension format of the input image. + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - Unset: Use the channel dimension format of the input image. + input_data_format (`ChannelDimension` or `str`, *optional*): + The channel dimension format for the input image. If unset, the channel dimension format is inferred + from the input image. Can be one of: + - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format. + - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format. + - `"none"` or `ChannelDimension.NONE`: image in (height, width) format. """ do_resize = do_resize if do_resize is not None else self.do_resize size = size if size is not None else self.size @@ -302,39 +266,61 @@ def preprocess( "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, " "torch.Tensor, tf.Tensor or jax.ndarray." ) + validate_preprocess_arguments( + do_rescale=do_rescale, + rescale_factor=rescale_factor, + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_center_crop=do_center_crop, + crop_size=crop_size, + do_resize=do_resize, + size=size, + resample=resample, + ) - if do_resize and size is None: - raise ValueError("Size must be specified if do_resize is True.") - - if do_center_crop and crop_size is None: - raise ValueError("Crop size must be specified if do_center_crop is True.") - - if do_rescale and rescale_factor is None: - raise ValueError("Rescale factor must be specified if do_rescale is True.") - - if do_normalize and (image_mean is None or image_std is None): - raise ValueError("Image mean and std must be specified if do_normalize is True.") - - # PIL RGBA images are converted to RGB if do_convert_rgb: images = [convert_to_rgb(image) for image in images] # All transformations expect numpy arrays. images = [to_numpy_array(image) for image in images] + if is_scaled_image(images[0]) and do_rescale: + logger.warning_once( + "It looks like you are trying to rescale already rescaled images. If the input" + " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again." + ) + + if input_data_format is None: + # We assume that all images have the same channel dimension format. + input_data_format = infer_channel_dimension_format(images[0]) + if do_resize: - images = [self.resize(image=image, size=size, resample=resample) for image in images] + images = [ + self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format) + for image in images + ] if do_center_crop: - images = [self.center_crop(image=image, size=crop_size) for image in images] + images = [ + self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images + ] if do_rescale: - images = [self.rescale(image=image, scale=rescale_factor) for image in images] + images = [ + self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) + for image in images + ] if do_normalize: - images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images] - - images = [to_channel_dimension_format(image, data_format) for image in images] + images = [ + self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) + for image in images + ] + + images = [ + to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images + ] data = {"pixel_values": images} return BatchFeature(data=data, tensor_type=return_tensors) diff --git a/src/transformers/models/clip/modeling_clip.py b/src/transformers/models/clip/modeling_clip.py index b59a3d244d01ad..06ee5f6e325db4 100644 --- a/src/transformers/models/clip/modeling_clip.py +++ b/src/transformers/models/clip/modeling_clip.py @@ -21,12 +21,15 @@ import torch import torch.utils.checkpoint from torch import nn +from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN -from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling +from ...modeling_attn_mask_utils import _create_4d_causal_attention_mask, _prepare_4d_attention_mask +from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput from ...modeling_utils import PreTrainedModel from ...utils import ( ModelOutput, + add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging, @@ -37,31 +40,22 @@ logger = logging.get_logger(__name__) +# General docstring +_CONFIG_FOR_DOC = "CLIPConfig" _CHECKPOINT_FOR_DOC = "openai/clip-vit-base-patch32" +# Image classification docstring +_IMAGE_CLASS_CHECKPOINT = "openai/clip-vit-base-patch32" +_IMAGE_CLASS_EXPECTED_OUTPUT = "LABEL_0" + CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [ "openai/clip-vit-base-patch32", # See all CLIP models at https://huggingface.co/models?filter=clip ] -# Copied from transformers.models.bart.modeling_bart._expand_mask -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - # contrastive loss function, adapted from -# https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html +# https://sachinruk.github.io/blog/2021-03-07-clip.html def contrastive_loss(logits: torch.Tensor) -> torch.Tensor: return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device)) @@ -97,8 +91,8 @@ class CLIPVisionModelOutput(ModelOutput): image_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None @dataclass @@ -126,8 +120,8 @@ class CLIPTextModelOutput(ModelOutput): text_embeds: Optional[torch.FloatTensor] = None last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None + hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None + attentions: Optional[Tuple[torch.FloatTensor, ...]] = None @dataclass @@ -188,11 +182,12 @@ def __init__(self, config: CLIPVisionConfig): self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) - self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor: batch_size = pixel_values.shape[0] - patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid] + target_dtype = self.patch_embedding.weight.dtype + patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid] patch_embeds = patch_embeds.flatten(2).transpose(1, 2) class_embeds = self.class_embedding.expand(batch_size, 1, -1) @@ -210,7 +205,9 @@ def __init__(self, config: CLIPTextConfig): self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) def forward( self, @@ -410,7 +407,6 @@ class CLIPPreTrainedModel(PreTrainedModel): config_class = CLIPConfig base_model_prefix = "clip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -433,9 +429,7 @@ def _init_weights(self, module): nn.init.normal_(module.out_proj.weight, std=out_proj_std) elif isinstance(module, CLIPMLP): factor = self.config.initializer_factor - in_proj_std = ( - (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor - ) + in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor fc_std = (2 * module.config.hidden_size) ** -0.5 * factor nn.init.normal_(module.fc1.weight, std=fc_std) nn.init.normal_(module.fc2.weight, std=in_proj_std) @@ -465,10 +459,6 @@ def _init_weights(self, module): if isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, CLIPEncoder): - module.gradient_checkpointing = value - CLIP_START_DOCSTRING = r""" This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the @@ -637,18 +627,12 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, causal_attention_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -682,6 +666,9 @@ def __init__(self, config: CLIPTextConfig): self.encoder = CLIPEncoder(config) self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + # For `pooled_output` computation + self.eos_token_id = config.eos_token_id + @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=CLIPTextConfig) def forward( @@ -711,16 +698,15 @@ def forward( hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) - bsz, seq_len = input_shape # CLIP's text model uses causal mask, prepare it here. # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324 - causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype).to( - hidden_states.device + causal_attention_mask = _create_4d_causal_attention_mask( + input_shape, hidden_states.dtype, device=hidden_states.device ) # expand attention_mask if attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, hidden_states.dtype) + attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype) encoder_outputs = self.encoder( inputs_embeds=hidden_states, @@ -734,13 +720,26 @@ def forward( last_hidden_state = encoder_outputs[0] last_hidden_state = self.final_layer_norm(last_hidden_state) - # text_embeds.shape = [batch_size, sequence_length, transformer.width] - # take features from the eot embedding (eot_token is the highest number in each sequence) - # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14 - pooled_output = last_hidden_state[ - torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), - input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1), - ] + if self.eos_token_id == 2: + # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here. + # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added + # ------------------------------------------------------------ + # text_embeds.shape = [batch_size, sequence_length, transformer.width] + # take features from the eot embedding (eot_token is the highest number in each sequence) + # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14 + pooled_output = last_hidden_state[ + torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), + input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1), + ] + else: + # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible) + pooled_output = last_hidden_state[ + torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), + # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`) + (input_ids.to(dtype=torch.int, device=last_hidden_state.device) == self.eos_token_id) + .int() + .argmax(dim=-1), + ] if not return_dict: return (last_hidden_state, pooled_output) + encoder_outputs[1:] @@ -752,15 +751,6 @@ def forward( attentions=encoder_outputs.attentions, ) - def _build_causal_attention_mask(self, bsz, seq_len, dtype): - # lazily create causal attention mask, with full attention between the vision tokens - # pytorch uses additive attention mask; fill with -inf - mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype) - mask.fill_(torch.tensor(torch.finfo(dtype).min)) - mask.triu_(1) # zero out the lower diagonal - mask = mask.unsqueeze(1) # expand mask - return mask - @add_start_docstrings( """The text model from CLIP without any head or projection on top.""", @@ -769,7 +759,7 @@ def _build_causal_attention_mask(self, bsz, seq_len, dtype): class CLIPTextModel(CLIPPreTrainedModel): config_class = CLIPTextConfig - _no_split_modules = ["CLIPEncoderLayer"] + _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"] def __init__(self, config: CLIPTextConfig): super().__init__(config) @@ -888,6 +878,7 @@ def forward( class CLIPVisionModel(CLIPPreTrainedModel): config_class = CLIPVisionConfig main_input_name = "pixel_values" + _no_split_modules = ["CLIPEncoderLayer"] def __init__(self, config: CLIPVisionConfig): super().__init__(config) @@ -942,6 +933,7 @@ def forward( @add_start_docstrings(CLIP_START_DOCSTRING) class CLIPModel(CLIPPreTrainedModel): config_class = CLIPConfig + _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"] def __init__(self, config: CLIPConfig): super().__init__(config) @@ -970,7 +962,7 @@ def __init__(self, config: CLIPConfig): self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) - self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value) + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) # Initialize weights and apply final processing self.post_init() @@ -1174,7 +1166,7 @@ def forward( class CLIPTextModelWithProjection(CLIPPreTrainedModel): config_class = CLIPTextConfig - _no_split_modules = ["CLIPEncoderLayer"] + _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"] def __init__(self, config: CLIPTextConfig): super().__init__(config) @@ -1322,3 +1314,105 @@ def forward( hidden_states=vision_outputs.hidden_states, attentions=vision_outputs.attentions, ) + + +@add_start_docstrings( + """ + CLIP vision encoder with an image classification head on top (a linear layer on top of the pooled final hidden states of + the patch tokens) e.g. for ImageNet. + """, + CLIP_START_DOCSTRING, +) +class CLIPForImageClassification(CLIPPreTrainedModel): + main_input_name = "pixel_values" + + def __init__(self, config: CLIPConfig) -> None: + super().__init__(config) + + self.num_labels = config.num_labels + self.vision_model = CLIPVisionTransformer(config.vision_config) + + # Classifier head + self.classifier = ( + nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity() + ) + + # Initialize weights and apply final processing + self.post_init() + + @add_start_docstrings_to_model_forward(CLIP_INPUTS_DOCSTRING) + @add_code_sample_docstrings( + checkpoint=_IMAGE_CLASS_CHECKPOINT, + output_type=ImageClassifierOutput, + config_class=_CONFIG_FOR_DOC, + expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT, + ) + def forward( + self, + pixel_values: Optional[torch.Tensor] = None, + labels: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[tuple, ImageClassifierOutput]: + r""" + labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): + Labels for computing the image classification/regression loss. Indices should be in `[0, ..., + config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If + `config.num_labels > 1` a classification loss is computed (Cross-Entropy). + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.vision_model( + pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = outputs[0] + + # average pool the patch tokens + sequence_output = torch.mean(sequence_output[:, 1:, :], dim=1) + # apply classifier + logits = self.classifier(sequence_output) + + loss = None + if labels is not None: + # move labels to correct device to enable model parallelism + labels = labels.to(logits.device) + if self.config.problem_type is None: + if self.num_labels == 1: + self.config.problem_type = "regression" + elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): + self.config.problem_type = "single_label_classification" + else: + self.config.problem_type = "multi_label_classification" + + if self.config.problem_type == "regression": + loss_fct = MSELoss() + if self.num_labels == 1: + loss = loss_fct(logits.squeeze(), labels.squeeze()) + else: + loss = loss_fct(logits, labels) + elif self.config.problem_type == "single_label_classification": + loss_fct = CrossEntropyLoss() + loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) + elif self.config.problem_type == "multi_label_classification": + loss_fct = BCEWithLogitsLoss() + loss = loss_fct(logits, labels) + + if not return_dict: + output = (logits,) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return ImageClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) diff --git a/src/transformers/models/clip/modeling_flax_clip.py b/src/transformers/models/clip/modeling_flax_clip.py index cb8ee4e7c9a444..265e7005b74e0e 100644 --- a/src/transformers/models/clip/modeling_flax_clip.py +++ b/src/transformers/models/clip/modeling_flax_clip.py @@ -43,9 +43,10 @@ This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading, saving and converting weights from PyTorch models) - This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module) - subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to - general usage and behavior. + This model is also a + [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as + a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and + behavior. Finally, this model supports inherent JAX features such as: @@ -155,6 +156,36 @@ """ +@flax.struct.dataclass +class FlaxCLIPTextModelOutput(ModelOutput): + """ + Base class for text model's outputs that also contains a pooling of the last hidden states. + + Args: + text_embeds (`jnp.ndarray` of shape `(batch_size, output_dim`): + The text embeddings obtained by applying the projection layer to the pooled output of + [`FlaxCLIPTextModel`]. + last_hidden_state (`jnp.ndarray` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(jnp.ndarray)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `jnp.ndarray` (one for the output of the embeddings + one for the output of each layer) of shape + `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + attentions (`tuple(jnp.ndarray)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `jnp.ndarray` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + text_embeds: jnp.ndarray = None + last_hidden_state: jnp.ndarray = None + hidden_states: Optional[Tuple[jnp.ndarray, ...]] = None + attentions: Optional[Tuple[jnp.ndarray, ...]] = None + + @flax.struct.dataclass class FlaxCLIPOutput(ModelOutput): """ @@ -487,6 +518,9 @@ def setup(self): self.encoder = FlaxCLIPEncoder(self.config, dtype=self.dtype) self.final_layer_norm = nn.LayerNorm(epsilon=self.config.layer_norm_eps, dtype=self.dtype) + # For `pooled_output` computation + self.eos_token_id = self.config.eos_token_id + def __call__( self, input_ids, @@ -517,9 +551,18 @@ def __call__( last_hidden_state = encoder_outputs[0] last_hidden_state = self.final_layer_norm(last_hidden_state) - # text_embeds.shape = [batch_size, sequence_length, transformer.width] - # take features from the EOS embedding (eos_token_id is the highest number in each sequence) - pooled_output = last_hidden_state[jnp.arange(last_hidden_state.shape[0]), input_ids.argmax(axis=-1)] + if self.eos_token_id == 2: + # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here. + # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added + # ------------------------------------------------------------ + # text_embeds.shape = [batch_size, sequence_length, transformer.width] + # take features from the EOS embedding (eos_token_id is the highest number in each sequence) + pooled_output = last_hidden_state[jnp.arange(last_hidden_state.shape[0]), input_ids.argmax(axis=-1)] + else: + # (no need to cast from bool to int after comparing to `eos_token_id`) + pooled_output = last_hidden_state[ + jnp.arange(last_hidden_state.shape[0]), (input_ids == self.eos_token_id).argmax(axis=-1) + ] if not return_dict: return (last_hidden_state, pooled_output) + encoder_outputs[1:] @@ -995,6 +1038,78 @@ class FlaxCLIPTextModel(FlaxCLIPTextPreTrainedModel): ) +class FlaxCLIPTextModelWithProjectionModule(nn.Module): + config: CLIPTextConfig + dtype: jnp.dtype = jnp.float32 + + def setup(self): + self.text_model = FlaxCLIPTextTransformer(self.config, dtype=self.dtype) + self.text_projection = nn.Dense(self.config.projection_dim, use_bias=False, dtype=self.dtype) + + def __call__( + self, + input_ids, + attention_mask, + position_ids, + deterministic: bool = True, + output_attentions: bool = False, + output_hidden_states: bool = False, + return_dict: bool = True, + ): + text_outputs = self.text_model( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + deterministic=deterministic, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = text_outputs[1] + text_embeds = self.text_projection(pooled_output) + + if not return_dict: + return (text_embeds, text_outputs[0]) + text_outputs[2:] + + return FlaxCLIPTextModelOutput( + text_embeds=text_embeds, + last_hidden_state=text_outputs.last_hidden_state, + hidden_states=text_outputs.hidden_states, + attentions=text_outputs.attentions, + ) + + +class FlaxCLIPTextModelWithProjection(FlaxCLIPTextPreTrainedModel): + module_class = FlaxCLIPTextModelWithProjectionModule + + +FLAX_CLIP_TEXT_MODEL_WITH_PROJECTION_DOCSTRING = """ + Returns: + + Example: + + ```python + >>> from transformers import AutoTokenizer, FlaxCLIPTextModelWithProjection + + >>> model = FlaxCLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32") + >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32") + + >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="np") + + >>> outputs = model(**inputs) + >>> text_embeds = outputs.text_embeds + ``` +""" + +overwrite_call_docstring( + FlaxCLIPTextModelWithProjection, CLIP_TEXT_INPUTS_DOCSTRING + FLAX_CLIP_TEXT_MODEL_WITH_PROJECTION_DOCSTRING +) +append_replace_return_docstrings( + FlaxCLIPTextModelWithProjection, output_type=FlaxCLIPTextModelOutput, config_class=CLIPTextConfig +) + + class FlaxCLIPVisionModule(nn.Module): config: CLIPVisionConfig dtype: jnp.dtype = jnp.float32 diff --git a/src/transformers/models/clip/modeling_tf_clip.py b/src/transformers/models/clip/modeling_tf_clip.py index 680ea91ca1383a..d8dd7f0bd83c40 100644 --- a/src/transformers/models/clip/modeling_tf_clip.py +++ b/src/transformers/models/clip/modeling_tf_clip.py @@ -15,9 +15,11 @@ """ TF 2.0 CLIP model.""" +from __future__ import annotations + import math from dataclasses import dataclass -from typing import Any, Dict, Optional, Tuple, Union +from typing import Any, Optional, Tuple, Union import numpy as np import tensorflow as tf @@ -27,14 +29,14 @@ # Public API from ...modeling_tf_utils import ( - DUMMY_INPUTS, TFModelInputType, TFPreTrainedModel, get_initializer, + keras, keras_serializable, unpack_inputs, ) -from ...tf_utils import shape_list, stable_softmax +from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax from ...utils import ( ModelOutput, add_start_docstrings, @@ -76,7 +78,7 @@ def _expand_mask(mask: tf.Tensor, tgt_len: Optional[int] = None): # https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html def contrastive_loss(logits: tf.Tensor) -> tf.Tensor: return tf.math.reduce_mean( - tf.keras.metrics.sparse_categorical_crossentropy( + keras.metrics.sparse_categorical_crossentropy( y_true=tf.range(shape_list(logits)[0]), y_pred=logits, from_logits=True ) ) @@ -111,7 +113,7 @@ class TFCLIPOutput(ModelOutput): The output of the [`TFCLIPVisionModel`]. """ - loss: Optional[tf.Tensor] = None + loss: tf.Tensor | None = None logits_per_image: tf.Tensor = None logits_per_text: tf.Tensor = None text_embeds: tf.Tensor = None @@ -126,7 +128,7 @@ def to_tuple(self) -> Tuple[Any]: ) -class TFCLIPVisionEmbeddings(tf.keras.layers.Layer): +class TFCLIPVisionEmbeddings(keras.layers.Layer): def __init__(self, config: CLIPVisionConfig, **kwargs): super().__init__(**kwargs) @@ -139,7 +141,7 @@ def __init__(self, config: CLIPVisionConfig, **kwargs): self.config = config - self.patch_embedding = tf.keras.layers.Conv2D( + self.patch_embedding = keras.layers.Conv2D( filters=self.embed_dim, kernel_size=self.patch_size, strides=self.patch_size, @@ -150,7 +152,7 @@ def __init__(self, config: CLIPVisionConfig, **kwargs): name="patch_embedding", ) - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape: tf.TensorShape = None): factor = self.config.initializer_factor self.class_embedding = self.add_weight( @@ -168,7 +170,12 @@ def build(self, input_shape: tf.TensorShape): name="embeddings", ) - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "patch_embedding", None) is not None: + with tf.name_scope(self.patch_embedding.name): + self.patch_embedding.build([None, None, None, self.config.num_channels]) def call(self, pixel_values: tf.Tensor) -> tf.Tensor: """`pixel_values` is expected to be of NCHW format.""" @@ -195,7 +202,7 @@ def call(self, pixel_values: tf.Tensor) -> tf.Tensor: return embeddings -class TFCLIPTextEmbeddings(tf.keras.layers.Layer): +class TFCLIPTextEmbeddings(keras.layers.Layer): def __init__(self, config: CLIPTextConfig, **kwargs): super().__init__(**kwargs) @@ -203,7 +210,7 @@ def __init__(self, config: CLIPTextConfig, **kwargs): self.config = config - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape: tf.TensorShape = None): with tf.name_scope("token_embedding"): self.weight = self.add_weight( shape=(self.config.vocab_size, self.embed_dim), @@ -238,16 +245,7 @@ def call( raise ValueError("You have to specify either input_ids or inputs_embeds") if inputs_embeds is None: - # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound - # indices on GPU, returning zeros instead. This is a dangerous silent behavior. - tf.debugging.assert_less( - input_ids, - tf.cast(self.config.vocab_size, dtype=input_ids.dtype), - message=( - "input_ids must be smaller than the embedding layer's input dimension (got" - f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})" - ), - ) + check_embeddings_within_bounds(input_ids, self.config.vocab_size) inputs_embeds = tf.gather(params=self.weight, indices=input_ids) input_shape = shape_list(inputs_embeds)[:-1] @@ -262,7 +260,7 @@ def call( return final_embeddings -class TFCLIPAttention(tf.keras.layers.Layer): +class TFCLIPAttention(keras.layers.Layer): """Multi-headed attention from 'Attention Is All You Need' paper""" def __init__(self, config: CLIPConfig, **kwargs): @@ -283,19 +281,19 @@ def __init__(self, config: CLIPConfig, **kwargs): self.sqrt_att_head_size = math.sqrt(self.attention_head_size) - self.q_proj = tf.keras.layers.Dense( + self.q_proj = keras.layers.Dense( units=self.embed_dim, kernel_initializer=get_initializer(in_proj_std), name="q_proj" ) - self.k_proj = tf.keras.layers.Dense( + self.k_proj = keras.layers.Dense( units=self.embed_dim, kernel_initializer=get_initializer(in_proj_std), name="k_proj" ) - self.v_proj = tf.keras.layers.Dense( + self.v_proj = keras.layers.Dense( units=self.embed_dim, kernel_initializer=get_initializer(in_proj_std), name="v_proj" ) - self.dropout = tf.keras.layers.Dropout(rate=config.attention_dropout) + self.dropout = keras.layers.Dropout(rate=config.attention_dropout) - self.out_proj = tf.keras.layers.Dense( + self.out_proj = keras.layers.Dense( units=self.embed_dim, kernel_initializer=get_initializer(out_proj_std), name="out_proj" ) @@ -360,8 +358,25 @@ def call( return outputs - -class TFCLIPMLP(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "q_proj", None) is not None: + with tf.name_scope(self.q_proj.name): + self.q_proj.build([None, None, self.embed_dim]) + if getattr(self, "k_proj", None) is not None: + with tf.name_scope(self.k_proj.name): + self.k_proj.build([None, None, self.embed_dim]) + if getattr(self, "v_proj", None) is not None: + with tf.name_scope(self.v_proj.name): + self.v_proj.build([None, None, self.embed_dim]) + if getattr(self, "out_proj", None) is not None: + with tf.name_scope(self.out_proj.name): + self.out_proj.build([None, None, self.embed_dim]) + + +class TFCLIPMLP(keras.layers.Layer): def __init__(self, config: CLIPConfig, **kwargs): super().__init__(**kwargs) @@ -371,12 +386,13 @@ def __init__(self, config: CLIPConfig, **kwargs): in_proj_std = (config.hidden_size**-0.5) * ((2 * config.num_hidden_layers) ** -0.5) * factor fc_std = (2 * config.hidden_size) ** -0.5 * factor - self.fc1 = tf.keras.layers.Dense( + self.fc1 = keras.layers.Dense( units=config.intermediate_size, kernel_initializer=get_initializer(fc_std), name="fc1" ) - self.fc2 = tf.keras.layers.Dense( + self.fc2 = keras.layers.Dense( units=config.hidden_size, kernel_initializer=get_initializer(in_proj_std), name="fc2" ) + self.config = config def call(self, hidden_states: tf.Tensor) -> tf.Tensor: hidden_states = self.fc1(inputs=hidden_states) @@ -384,16 +400,27 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor: hidden_states = self.fc2(inputs=hidden_states) return hidden_states + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "fc1", None) is not None: + with tf.name_scope(self.fc1.name): + self.fc1.build([None, None, self.config.hidden_size]) + if getattr(self, "fc2", None) is not None: + with tf.name_scope(self.fc2.name): + self.fc2.build([None, None, self.config.intermediate_size]) + -class TFCLIPEncoderLayer(tf.keras.layers.Layer): +class TFCLIPEncoderLayer(keras.layers.Layer): def __init__(self, config: CLIPConfig, **kwargs): super().__init__(**kwargs) self.embed_dim = config.hidden_size self.self_attn = TFCLIPAttention(config, name="self_attn") - self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm1") + self.layer_norm1 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm1") self.mlp = TFCLIPMLP(config, name="mlp") - self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm2") + self.layer_norm2 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm2") def call( self, @@ -436,8 +463,25 @@ def call( return outputs - -class TFCLIPEncoder(tf.keras.layers.Layer): + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "self_attn", None) is not None: + with tf.name_scope(self.self_attn.name): + self.self_attn.build(None) + if getattr(self, "layer_norm1", None) is not None: + with tf.name_scope(self.layer_norm1.name): + self.layer_norm1.build([None, None, self.embed_dim]) + if getattr(self, "mlp", None) is not None: + with tf.name_scope(self.mlp.name): + self.mlp.build(None) + if getattr(self, "layer_norm2", None) is not None: + with tf.name_scope(self.layer_norm2.name): + self.layer_norm2.build([None, None, self.embed_dim]) + + +class TFCLIPEncoder(keras.layers.Layer): """ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a [`TFCLIPEncoderLayer`]. @@ -491,16 +535,27 @@ def call( last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "layers", None) is not None: + for layer in self.layers: + with tf.name_scope(layer.name): + layer.build(None) + -class TFCLIPTextTransformer(tf.keras.layers.Layer): +class TFCLIPTextTransformer(keras.layers.Layer): def __init__(self, config: CLIPTextConfig, **kwargs): super().__init__(**kwargs) self.embeddings = TFCLIPTextEmbeddings(config, name="embeddings") self.encoder = TFCLIPEncoder(config, name="encoder") - self.final_layer_norm = tf.keras.layers.LayerNormalization( - epsilon=config.layer_norm_eps, name="final_layer_norm" - ) + self.final_layer_norm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="final_layer_norm") + + # For `pooled_output` computation + self.eos_token_id = config.eos_token_id + self.embed_dim = config.hidden_size def call( self, @@ -538,14 +593,30 @@ def call( sequence_output = encoder_outputs[0] sequence_output = self.final_layer_norm(inputs=sequence_output) - # text_embeds.shape = [batch_size, n_ctx, transformer.width] - # take features from the eot embedding (eot_token is the highest number in each sequence) - pooled_output = tf.gather_nd( - params=sequence_output, - indices=tf.stack( - values=(tf.range(input_shape[0], dtype=tf.int64), tf.math.argmax(input_ids, axis=-1)), axis=1 - ), - ) + if self.eos_token_id == 2: + # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here. + # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added + # ------------------------------------------------------------ + # text_embeds.shape = [batch_size, n_ctx, transformer.width] + # take features from the eot embedding (eot_token is the highest number in each sequence) + pooled_output = tf.gather_nd( + params=sequence_output, + indices=tf.stack( + values=(tf.range(input_shape[0], dtype=tf.int64), tf.math.argmax(input_ids, axis=-1)), axis=1 + ), + ) + else: + # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible) + pooled_output = tf.gather_nd( + params=sequence_output, + indices=tf.stack( + values=( + tf.range(input_shape[0], dtype=tf.int64), + tf.math.argmax(tf.cast(input_ids == self.eos_token_id, dtype=tf.int8), axis=-1), + ), + axis=1, + ), + ) if not return_dict: return (sequence_output, pooled_output) + encoder_outputs[1:] @@ -575,9 +646,23 @@ def _build_causal_attention_mask(self, batch_size, seq_length, dtype=tf.float32) return tf.broadcast_to(input=to_mask, shape=(batch_size, 1, seq_length, seq_length)) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "final_layer_norm", None) is not None: + with tf.name_scope(self.final_layer_norm.name): + self.final_layer_norm.build([None, None, self.embed_dim]) + @keras_serializable -class TFCLIPTextMainLayer(tf.keras.layers.Layer): +class TFCLIPTextMainLayer(keras.layers.Layer): config_class = CLIPTextConfig def __init__(self, config: CLIPTextConfig, **kwargs): @@ -585,7 +670,7 @@ def __init__(self, config: CLIPTextConfig, **kwargs): self.config = config self.text_model = TFCLIPTextTransformer(config, name="text_model") - def get_input_embeddings(self) -> tf.keras.layers.Layer: + def get_input_embeddings(self) -> keras.layers.Layer: return self.text_model.embeddings def set_input_embeddings(self, value: tf.Variable): @@ -595,9 +680,9 @@ def set_input_embeddings(self, value: tf.Variable): @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -623,15 +708,24 @@ def call( return text_model_outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "text_model", None) is not None: + with tf.name_scope(self.text_model.name): + self.text_model.build(None) -class TFCLIPVisionTransformer(tf.keras.layers.Layer): + +class TFCLIPVisionTransformer(keras.layers.Layer): def __init__(self, config: CLIPVisionConfig, **kwargs): super().__init__(**kwargs) self.embeddings = TFCLIPVisionEmbeddings(config, name="embeddings") - self.pre_layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="pre_layrnorm") + self.pre_layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="pre_layrnorm") self.encoder = TFCLIPEncoder(config, name="encoder") - self.post_layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="post_layernorm") + self.post_layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="post_layernorm") + self.embed_dim = config.hidden_size def call( self, @@ -668,9 +762,26 @@ def call( attentions=encoder_outputs.attentions, ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "embeddings", None) is not None: + with tf.name_scope(self.embeddings.name): + self.embeddings.build(None) + if getattr(self, "pre_layernorm", None) is not None: + with tf.name_scope(self.pre_layernorm.name): + self.pre_layernorm.build([None, None, self.embed_dim]) + if getattr(self, "encoder", None) is not None: + with tf.name_scope(self.encoder.name): + self.encoder.build(None) + if getattr(self, "post_layernorm", None) is not None: + with tf.name_scope(self.post_layernorm.name): + self.post_layernorm.build([None, self.embed_dim]) + @keras_serializable -class TFCLIPVisionMainLayer(tf.keras.layers.Layer): +class TFCLIPVisionMainLayer(keras.layers.Layer): config_class = CLIPVisionConfig def __init__(self, config: CLIPVisionConfig, **kwargs): @@ -678,13 +789,13 @@ def __init__(self, config: CLIPVisionConfig, **kwargs): self.config = config self.vision_model = TFCLIPVisionTransformer(config, name="vision_model") - def get_input_embeddings(self) -> tf.keras.layers.Layer: + def get_input_embeddings(self) -> keras.layers.Layer: return self.vision_model.embeddings @unpack_inputs def call( self, - pixel_values: Optional[TFModelInputType] = None, + pixel_values: TFModelInputType | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -703,9 +814,17 @@ def call( return vision_model_outputs + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + @keras_serializable -class TFCLIPMainLayer(tf.keras.layers.Layer): +class TFCLIPMainLayer(keras.layers.Layer): config_class = CLIPConfig def __init__(self, config: CLIPConfig, **kwargs): @@ -733,36 +852,52 @@ def __init__(self, config: CLIPConfig, **kwargs): self.text_model = TFCLIPTextTransformer(text_config, name="text_model") self.vision_model = TFCLIPVisionTransformer(vision_config, name="vision_model") - self.visual_projection = tf.keras.layers.Dense( + self.visual_projection = keras.layers.Dense( units=self.projection_dim, kernel_initializer=get_initializer(vision_config.hidden_size**-0.5 * self.config.initializer_factor), use_bias=False, name="visual_projection", ) - self.text_projection = tf.keras.layers.Dense( + self.text_projection = keras.layers.Dense( units=self.projection_dim, kernel_initializer=get_initializer(text_config.hidden_size**-0.5 * self.config.initializer_factor), use_bias=False, name="text_projection", ) + self.text_embed_dim = text_config.hidden_size + self.vision_embed_dim = vision_config.hidden_size - def build(self, input_shape: tf.TensorShape): + def build(self, input_shape: tf.TensorShape = None): self.logit_scale = self.add_weight( shape=(1,), - initializer=tf.keras.initializers.Constant(self.config.logit_scale_init_value), + initializer=keras.initializers.Constant(self.config.logit_scale_init_value), trainable=True, name="logit_scale", ) - super().build(input_shape) + if self.built: + return + self.built = True + if getattr(self, "text_model", None) is not None: + with tf.name_scope(self.text_model.name): + self.text_model.build(None) + if getattr(self, "vision_model", None) is not None: + with tf.name_scope(self.vision_model.name): + self.vision_model.build(None) + if getattr(self, "visual_projection", None) is not None: + with tf.name_scope(self.visual_projection.name): + self.visual_projection.build([None, None, self.vision_embed_dim]) + if getattr(self, "text_projection", None) is not None: + with tf.name_scope(self.text_projection.name): + self.text_projection.build([None, None, self.text_embed_dim]) @unpack_inputs def get_text_features( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -794,7 +929,7 @@ def get_text_features( @unpack_inputs def get_image_features( self, - pixel_values: Optional[TFModelInputType] = None, + pixel_values: TFModelInputType | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -819,10 +954,10 @@ def get_image_features( @unpack_inputs def call( self, - input_ids: Optional[TFModelInputType] = None, - pixel_values: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + pixel_values: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, return_loss: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -900,6 +1035,8 @@ class TFCLIPPreTrainedModel(TFPreTrainedModel): config_class = CLIPConfig base_model_prefix = "clip" + _keys_to_ignore_on_load_missing = [r"position_ids"] + _keys_to_ignore_on_load_unexpected = [r"position_ids"] CLIP_START_DOCSTRING = r""" @@ -908,7 +1045,7 @@ class TFCLIPPreTrainedModel(TFPreTrainedModel): library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it + This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and behavior. @@ -949,7 +1086,7 @@ class TFCLIPPreTrainedModel(TFPreTrainedModel): input_ids (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `({0})`): Indices of input sequence tokens in the vocabulary. - Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for details. [What are input IDs?](../glossary#input-ids) @@ -1006,7 +1143,7 @@ class TFCLIPPreTrainedModel(TFPreTrainedModel): input_ids (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `({0})`): Indices of input sequence tokens in the vocabulary. - Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.__call__`] and [`PreTrainedTokenizer.encode`] for details. [What are input IDs?](../glossary#input-ids) @@ -1057,9 +1194,9 @@ def __init__(self, config: CLIPTextConfig, *inputs, **kwargs): @replace_return_docstrings(output_type=TFBaseModelOutputWithPooling, config_class=CLIPTextConfig) def call( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -1095,28 +1232,13 @@ def call( return outputs - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - } - ] - ) - def serving(self, inputs: Dict[str, tf.Tensor]) -> TFBaseModelOutputWithPooling: - output = self.call(inputs) - return self.serving_output(output) - - def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFBaseModelOutputWithPooling( - last_hidden_state=output.last_hidden_state, - pooler_output=output.pooler_output, - hidden_states=hs, - attentions=attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "clip", None) is not None: + with tf.name_scope(self.clip.name): + self.clip.build(None) class TFCLIPVisionModel(TFCLIPPreTrainedModel): @@ -1128,44 +1250,12 @@ def __init__(self, config: CLIPVisionConfig, *inputs, **kwargs): self.clip = TFCLIPVisionMainLayer(config, name="clip") - @property - def dummy_inputs(self) -> Dict[str, tf.Tensor]: - """ - Dummy inputs to build the network. - - Returns: - `Dict[str, tf.Tensor]`: The dummy inputs. - """ - VISION_DUMMY_INPUTS = tf.random.uniform( - shape=(len(DUMMY_INPUTS), 3, self.config.image_size, self.config.image_size), dtype=tf.float32 - ) - return {"pixel_values": VISION_DUMMY_INPUTS} - - @tf.function( - input_signature=[ - { - "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"), - } - ] - ) - def serving(self, inputs: Dict[str, tf.Tensor]) -> TFBaseModelOutputWithPooling: - """ - Method used for serving the model. - - Args: - inputs (`Dict[str, tf.Tensor]`): - The input of the saved model as a dictionary of tensors. - """ - output = self.call(inputs) - - return self.serving_output(output) - @unpack_inputs @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=TFBaseModelOutputWithPooling, config_class=CLIPVisionConfig) def call( self, - pixel_values: Optional[TFModelInputType] = None, + pixel_values: TFModelInputType | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -1204,16 +1294,13 @@ def call( return outputs - def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling: - hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None - attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None - - return TFBaseModelOutputWithPooling( - last_hidden_state=output.last_hidden_state, - pooler_output=output.pooler_output, - hidden_states=hs, - attentions=attns, - ) + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "clip", None) is not None: + with tf.name_scope(self.clip.name): + self.clip.build(None) @add_start_docstrings(CLIP_START_DOCSTRING) @@ -1225,51 +1312,13 @@ def __init__(self, config: CLIPConfig, *inputs, **kwargs): self.clip = TFCLIPMainLayer(config, name="clip") - @property - def dummy_inputs(self) -> Dict[str, tf.Tensor]: - """ - Dummy inputs to build the network. - - Returns: - `Dict[str, tf.Tensor]`: The dummy inputs. - """ - VISION_DUMMY_INPUTS = tf.random.uniform( - shape=(len(DUMMY_INPUTS), 3, self.config.vision_config.image_size, self.config.vision_config.image_size), - dtype=tf.float32, - ) - return { - "input_ids": tf.constant(DUMMY_INPUTS, dtype=tf.int32), - "pixel_values": VISION_DUMMY_INPUTS, - } - - @tf.function( - input_signature=[ - { - "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"), - "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"), - "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"), - } - ] - ) - def serving(self, inputs: Dict[str, tf.Tensor]) -> TFCLIPOutput: - """ - Method used for serving the model. - - Args: - inputs (`Dict[str, tf.Tensor]`): - The input of the saved model as a dictionary of tensors. - """ - output = self.call(inputs) - - return self.serving_output(output) - @unpack_inputs @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) def get_text_features( self, - input_ids: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -1307,7 +1356,7 @@ def get_text_features( @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING) def get_image_features( self, - pixel_values: Optional[TFModelInputType] = None, + pixel_values: TFModelInputType | None = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, @@ -1350,10 +1399,10 @@ def get_image_features( @replace_return_docstrings(output_type=TFCLIPOutput, config_class=CLIPConfig) def call( self, - input_ids: Optional[TFModelInputType] = None, - pixel_values: Optional[TFModelInputType] = None, - attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None, - position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None, + input_ids: TFModelInputType | None = None, + pixel_values: TFModelInputType | None = None, + attention_mask: np.ndarray | tf.Tensor | None = None, + position_ids: np.ndarray | tf.Tensor | None = None, return_loss: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, @@ -1404,3 +1453,11 @@ def serving_output(self, output: TFCLIPOutput) -> TFCLIPOutput: # TensorFlow cannot trace through nested dataclasses. Reference: # https://github.com/huggingface/transformers/pull/16886 return output + + def build(self, input_shape=None): + if self.built: + return + self.built = True + if getattr(self, "clip", None) is not None: + with tf.name_scope(self.clip.name): + self.clip.build(None) diff --git a/src/transformers/models/clip/processing_clip.py b/src/transformers/models/clip/processing_clip.py index 3e2f438d263e1d..31351f31efc5fb 100644 --- a/src/transformers/models/clip/processing_clip.py +++ b/src/transformers/models/clip/processing_clip.py @@ -30,16 +30,18 @@ class CLIPProcessor(ProcessorMixin): [`~CLIPProcessor.__call__`] and [`~CLIPProcessor.decode`] for more information. Args: - image_processor ([`CLIPImageProcessor`]): + image_processor ([`CLIPImageProcessor`], *optional*): The image processor is a required input. - tokenizer ([`CLIPTokenizerFast`]): + tokenizer ([`CLIPTokenizerFast`], *optional*): The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "CLIPImageProcessor" tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast") def __init__(self, image_processor=None, tokenizer=None, **kwargs): + feature_extractor = None if "feature_extractor" in kwargs: warnings.warn( "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`" diff --git a/src/transformers/models/clip/tokenization_clip.py b/src/transformers/models/clip/tokenization_clip.py index e3ff5f8626fa6f..f62ef65c5ede02 100644 --- a/src/transformers/models/clip/tokenization_clip.py +++ b/src/transformers/models/clip/tokenization_clip.py @@ -126,20 +126,30 @@ class BasicTokenizer(object): strip_accents (`bool`, *optional*): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for `lowercase` (as in the original BERT). + do_split_on_punc (`bool`, *optional*, defaults to `True`): + In some instances we want to skip the basic punctuation splitting so that later tokenization can capture + the full context of the words, such as contractions. """ - def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None): + def __init__( + self, + do_lower_case=True, + never_split=None, + tokenize_chinese_chars=True, + strip_accents=None, + do_split_on_punc=True, + ): if never_split is None: never_split = [] self.do_lower_case = do_lower_case self.never_split = set(never_split) self.tokenize_chinese_chars = tokenize_chinese_chars self.strip_accents = strip_accents + self.do_split_on_punc = do_split_on_punc def tokenize(self, text, never_split=None): """ - Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see - WordPieceTokenizer. + Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer. Args: never_split (`List[str]`, *optional*) @@ -158,7 +168,9 @@ def tokenize(self, text, never_split=None): # words in the English Wikipedia.). if self.tokenize_chinese_chars: text = self._tokenize_chinese_chars(text) - orig_tokens = whitespace_tokenize(text) + # prevents treating the same character with different unicode codepoints as different characters + unicode_normalized_text = unicodedata.normalize("NFC", text) + orig_tokens = whitespace_tokenize(unicode_normalized_text) split_tokens = [] for token in orig_tokens: if token not in never_split: @@ -186,7 +198,7 @@ def _run_strip_accents(self, text): def _run_split_on_punc(self, text, never_split=None): """Splits punctuation on a piece of text.""" - if never_split is not None and text in never_split: + if not self.do_split_on_punc or (never_split is not None and text in never_split): return [text] chars = list(text) i = 0 @@ -272,13 +284,15 @@ class CLIPTokenizer(PreTrainedTokenizer): errors (`str`, *optional*, defaults to `"replace"`): Paradigm to follow when decoding bytes to UTF-8. See [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information. - unk_token (`str`, *optional*, defaults to `<|endoftext|>`): + unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (`str`, *optional*, defaults to `<|startoftext|>`): + bos_token (`str`, *optional*, defaults to `"<|startoftext|>"`): The beginning of sequence token. - eos_token (`str`, *optional*, defaults to `<|endoftext|>`): + eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`): The end of sequence token. + pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`): + The token used for padding, for example when batching sequences of different lengths. """ vocab_files_names = VOCAB_FILES_NAMES @@ -300,23 +314,13 @@ def __init__( bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token - - super().__init__( - errors=errors, - unk_token=unk_token, - bos_token=bos_token, - eos_token=eos_token, - pad_token=pad_token, - **kwargs, - ) - try: import ftfy self.fix_text = ftfy.fix_text except ImportError: logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.") - self.nlp = BasicTokenizer(do_lower_case=True) + self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False) self.fix_text = None with open(vocab_file, encoding="utf-8") as vocab_handle: @@ -336,6 +340,15 @@ def __init__( re.IGNORECASE, ) + super().__init__( + errors=errors, + unk_token=unk_token, + bos_token=bos_token, + eos_token=eos_token, + pad_token=pad_token, + **kwargs, + ) + @property def vocab_size(self): return len(self.encoder) diff --git a/src/transformers/models/clip/tokenization_clip_fast.py b/src/transformers/models/clip/tokenization_clip_fast.py index 75b3e4f4078053..3b092b0f8d50fc 100644 --- a/src/transformers/models/clip/tokenization_clip_fast.py +++ b/src/transformers/models/clip/tokenization_clip_fast.py @@ -56,17 +56,21 @@ class CLIPTokenizerFast(PreTrainedTokenizerFast): refer to this superclass for more information regarding those methods. Args: - vocab_file (`str`): + vocab_file (`str`, *optional*): Path to the vocabulary file. - merges_file (`str`): + merges_file (`str`, *optional*): Path to the merges file. - unk_token (`str`, *optional*, defaults to `<|endoftext|>`): + tokenizer_file (`str`, *optional*): + The path to a tokenizer file to use instead of the vocab file. + unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - bos_token (`str`, *optional*, defaults to `<|startoftext|>`): + bos_token (`str`, *optional*, defaults to `"<|startoftext|>"`): The beginning of sequence token. - eos_token (`str`, *optional*, defaults to `<|endoftext|>`): + eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`): The end of sequence token. + pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`): + The token used for padding, for example when batching sequences of different lengths. """ vocab_files_names = VOCAB_FILES_NAMES diff --git a/src/transformers/models/clipseg/configuration_clipseg.py b/src/transformers/models/clipseg/configuration_clipseg.py index 1910c946325ae4..555d226e10d507 100644 --- a/src/transformers/models/clipseg/configuration_clipseg.py +++ b/src/transformers/models/clipseg/configuration_clipseg.py @@ -14,7 +14,6 @@ # limitations under the License. """ CLIPSeg model configuration""" -import copy import os from typing import Union @@ -57,15 +56,21 @@ class CLIPSegTextConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). + pad_token_id (`int`, *optional*, defaults to 1): + Padding token id. + bos_token_id (`int`, *optional*, defaults to 49406): + Beginning of stream token id. + eos_token_id (`int`, *optional*, defaults to 49407): + End of stream token id. Example: @@ -81,6 +86,7 @@ class CLIPSegTextConfig(PretrainedConfig): >>> # Accessing the model configuration >>> configuration = model.config ```""" + model_type = "clipseg_text_model" def __init__( @@ -97,8 +103,8 @@ def __init__( initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, - bos_token_id=0, - eos_token_id=2, + bos_token_id=49406, + eos_token_id=49407, **kwargs, ): super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) @@ -117,6 +123,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the text config dict if we are loading from CLIPSegConfig @@ -151,6 +159,8 @@ class CLIPSegVisionConfig(PretrainedConfig): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. image_size (`int`, *optional*, defaults to 224): The size (resolution) of each image. patch_size (`int`, *optional*, defaults to 32): @@ -158,13 +168,13 @@ class CLIPSegVisionConfig(PretrainedConfig): hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported. - layer_norm_eps (`float`, *optional*, defaults to 1e-5): + layer_norm_eps (`float`, *optional*, defaults to 1e-05): The epsilon used by the layer normalization layers. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - initializer_factor (`float``, *optional*, defaults to 1): + initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). @@ -218,6 +228,8 @@ def __init__( @classmethod def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) # get the vision config dict if we are loading from CLIPSegConfig @@ -252,7 +264,7 @@ class CLIPSegConfig(PretrainedConfig): Dimensionality of text and vision projection layers. logit_scale_init_value (`float`, *optional*, defaults to 2.6592): The inital value of the *logit_scale* paramter. Default is used as per the original CLIPSeg implementation. - extract_layers (`List[int]`, *optional*, defaults to [3, 6, 9]): + extract_layers (`List[int]`, *optional*, defaults to `[3, 6, 9]`): Layers to extract when forwarding the query image through the frozen visual backbone of CLIP. reduce_dim (`int`, *optional*, defaults to 64): Dimensionality to reduce the CLIP vision embedding. @@ -298,7 +310,6 @@ class CLIPSegConfig(PretrainedConfig): ```""" model_type = "clipseg" - is_composition = True def __init__( self, @@ -316,22 +327,83 @@ def __init__( use_complex_transposed_convolution=False, **kwargs, ): - super().__init__(**kwargs) - + # If `_config_dict` exist, we use them for the backward compatibility. + # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot + # of confusion!). text_config_dict = kwargs.pop("text_config_dict", None) vision_config_dict = kwargs.pop("vision_config_dict", None) + + super().__init__(**kwargs) + + # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in + # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most + # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`. if text_config_dict is not None: - text_config = text_config_dict + if text_config is None: + text_config = {} + + # This is the complete result when using `text_config_dict`. + _text_config_dict = CLIPSegTextConfig(**text_config_dict).to_dict() + + # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different. + for key, value in _text_config_dict.items(): + if key in text_config and value != text_config[key] and key not in ["transformers_version"]: + # If specified in `text_config_dict` + if key in text_config_dict: + message = ( + f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. " + f'The value `text_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`text_config_dict` is provided which will be used to initialize `CLIPSegTextConfig`. The " + f'value `text_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `text_config` with the ones in `_text_config_dict`. + text_config.update(_text_config_dict) + if vision_config_dict is not None: - vision_config = vision_config_dict + if vision_config is None: + vision_config = {} + + # This is the complete result when using `vision_config_dict`. + _vision_config_dict = CLIPSegVisionConfig(**vision_config_dict).to_dict() + # convert keys to string instead of integer + if "id2label" in _vision_config_dict: + _vision_config_dict["id2label"] = { + str(key): value for key, value in _vision_config_dict["id2label"].items() + } + + # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different. + for key, value in _vision_config_dict.items(): + if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]: + # If specified in `vision_config_dict` + if key in vision_config_dict: + message = ( + f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different " + f'values. The value `vision_config_dict["{key}"]` will be used instead.' + ) + # If inferred from default argument values (just to be super careful) + else: + message = ( + f"`vision_config_dict` is provided which will be used to initialize `CLIPSegVisionConfig`. " + f'The value `vision_config["{key}"]` will be overriden.' + ) + logger.info(message) + + # Update all values in `vision_config` with the ones in `_vision_config_dict`. + vision_config.update(_vision_config_dict) if text_config is None: text_config = {} - logger.info("text_config is None. Initializing the CLIPSegTextConfig with default values.") + logger.info("`text_config` is `None`. Initializing the `CLIPSegTextConfig` with default values.") if vision_config is None: vision_config = {} - logger.info("vision_config is None. initializing the CLIPSegVisionConfig with default values.") + logger.info("`vision_config` is `None`. initializing the `CLIPSegVisionConfig` with default values.") self.text_config = CLIPSegTextConfig(**text_config) self.vision_config = CLIPSegVisionConfig(**vision_config) @@ -359,16 +431,3 @@ def from_text_vision_configs(cls, text_config: CLIPSegTextConfig, vision_config: """ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs) - - def to_dict(self): - """ - Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. - - Returns: - `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance, - """ - output = copy.deepcopy(self.__dict__) - output["text_config"] = self.text_config.to_dict() - output["vision_config"] = self.vision_config.to_dict() - output["model_type"] = self.__class__.model_type - return output diff --git a/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py b/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py index 183bb93b9e2b75..c614d61e5b3dd8 100644 --- a/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py +++ b/src/transformers/models/clipseg/convert_clipseg_original_pytorch_to_hf.py @@ -28,7 +28,7 @@ CLIPSegTextConfig, CLIPSegVisionConfig, CLIPTokenizer, - ViTFeatureExtractor, + ViTImageProcessor, ) @@ -185,9 +185,9 @@ def convert_clipseg_checkpoint(model_name, checkpoint_path, pytorch_dump_folder_ if unexpected_keys != ["decoder.reduce.weight", "decoder.reduce.bias"]: raise ValueError(f"Unexpected keys: {unexpected_keys}") - feature_extractor = ViTFeatureExtractor(size=352) + image_processor = ViTImageProcessor(size=352) tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32") - processor = CLIPSegProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer) + processor = CLIPSegProcessor(image_processor=image_processor, tokenizer=tokenizer) image = prepare_img() text = ["a glass", "something to fill", "wood", "a jar"] diff --git a/src/transformers/models/clipseg/modeling_clipseg.py b/src/transformers/models/clipseg/modeling_clipseg.py index 3ec81b33fb76f5..c0cf6b3b165707 100644 --- a/src/transformers/models/clipseg/modeling_clipseg.py +++ b/src/transformers/models/clipseg/modeling_clipseg.py @@ -24,6 +24,7 @@ from torch import nn from ...activations import ACT2FN +from ...modeling_attn_mask_utils import _create_4d_causal_attention_mask, _prepare_4d_attention_mask from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling from ...modeling_utils import PreTrainedModel from ...utils import ( @@ -47,21 +48,6 @@ ] -# Copied from transformers.models.bart.modeling_bart._expand_mask -def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): - """ - Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. - """ - bsz, src_len = mask.size() - tgt_len = tgt_len if tgt_len is not None else src_len - - expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) - - inverted_mask = 1.0 - expanded_mask - - return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min) - - # contrastive loss function, adapted from # https://sachinruk.github.io/blog/pytorch/pytorch%20lightning/loss%20function/gpu/2021/03/07/CLIP.html def contrastive_loss(logits: torch.Tensor) -> torch.Tensor: @@ -91,8 +77,7 @@ class CLIPSegOutput(ModelOutput): text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegTextModel`]. image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): - The image embeddings obtained by applying the projection layer to the pooled output of - [`CLIPSegVisionModel`]. + The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegVisionModel`]. text_model_output(`BaseModelOutputWithPooling`): The output of the [`CLIPSegTextModel`]. vision_model_output(`BaseModelOutputWithPooling`): @@ -160,7 +145,7 @@ def to_tuple(self) -> Tuple[Any]: class CLIPSegVisionEmbeddings(nn.Module): - # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings.__init__ + # Copied from transformers.models.clip.modeling_clip.CLIPVisionEmbeddings.__init__ with CLIP->CLIPSeg def __init__(self, config: CLIPSegVisionConfig): super().__init__() self.config = config @@ -181,7 +166,7 @@ def __init__(self, config: CLIPSegVisionConfig): self.num_patches = (self.image_size // self.patch_size) ** 2 self.num_positions = self.num_patches + 1 self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) - self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1))) + self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False) def interpolate_position_embeddings(self, new_size): if len(new_size) != 2: @@ -230,7 +215,9 @@ def __init__(self, config: CLIPSegTextConfig): self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim) # position_ids (1, len position emb) is contiguous in memory and exported when serialized - self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) + self.register_buffer( + "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False + ) def forward( self, @@ -433,7 +420,6 @@ class CLIPSegPreTrainedModel(PreTrainedModel): config_class = CLIPSegConfig base_model_prefix = "clip" supports_gradient_checkpointing = True - _keys_to_ignore_on_load_missing = [r"position_ids"] def _init_weights(self, module): """Initialize the weights""" @@ -456,9 +442,7 @@ def _init_weights(self, module): nn.init.normal_(module.out_proj.weight, std=out_proj_std) elif isinstance(module, CLIPSegMLP): factor = self.config.initializer_factor - in_proj_std = ( - (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor - ) + in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor fc_std = (2 * module.config.hidden_size) ** -0.5 * factor nn.init.normal_(module.fc1.weight, std=fc_std) nn.init.normal_(module.fc2.weight, std=in_proj_std) @@ -478,10 +462,6 @@ def _init_weights(self, module): if isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() - def _set_gradient_checkpointing(self, module, value=False): - if isinstance(module, CLIPSegEncoder): - module.gradient_checkpointing = value - CLIPSEG_START_DOCSTRING = r""" This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it @@ -647,18 +627,12 @@ def forward( if output_hidden_states: encoder_states = encoder_states + (hidden_states,) if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_outputs = torch.utils.checkpoint.checkpoint( - create_custom_forward(encoder_layer), + layer_outputs = self._gradient_checkpointing_func( + encoder_layer.__call__, hidden_states, attention_mask, causal_attention_mask, + output_attentions, ) else: layer_outputs = encoder_layer( @@ -693,6 +667,9 @@ def __init__(self, config: CLIPSegTextConfig): self.encoder = CLIPSegEncoder(config) self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps) + # For `pooled_output` computation + self.eos_token_id = config.eos_token_id + @add_start_docstrings_to_model_forward(CLIPSEG_TEXT_INPUTS_DOCSTRING) @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=CLIPSegTextConfig) # Copied from transformers.models.clip.modeling_clip.CLIPTextTransformer.forward with clip->clipseg, CLIP->CLIPSeg @@ -723,16 +700,15 @@ def forward( hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids) - bsz, seq_len = input_shape # CLIPSeg's text model uses causal mask, prepare it here. # https://github.com/openai/CLIPSeg/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clipseg/model.py#L324 - causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype).to( - hidden_states.device + causal_attention_mask = _create_4d_causal_attention_mask( + input_shape, hidden_states.dtype, device=hidden_states.device ) # expand attention_mask if attention_mask is not None: # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] - attention_mask = _expand_mask(attention_mask, hidden_states.dtype) + attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype) encoder_outputs = self.encoder( inputs_embeds=hidden_states, @@ -746,13 +722,26 @@ def forward( last_hidden_state = encoder_outputs[0] last_hidden_state = self.final_layer_norm(last_hidden_state) - # text_embeds.shape = [batch_size, sequence_length, transformer.width] - # take features from the eot embedding (eot_token is the highest number in each sequence) - # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14 - pooled_output = last_hidden_state[ - torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), - input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1), - ] + if self.eos_token_id == 2: + # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here. + # A CLIPSeg model with such `eos_token_id` in the config can't work correctly with extra new tokens added + # ------------------------------------------------------------ + # text_embeds.shape = [batch_size, sequence_length, transformer.width] + # take features from the eot embedding (eot_token is the highest number in each sequence) + # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14 + pooled_output = last_hidden_state[ + torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), + input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1), + ] + else: + # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible) + pooled_output = last_hidden_state[ + torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device), + # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`) + (input_ids.to(dtype=torch.int, device=last_hidden_state.device) == self.eos_token_id) + .int() + .argmax(dim=-1), + ] if not return_dict: return (last_hidden_state, pooled_output) + encoder_outputs[1:] @@ -764,20 +753,11 @@ def forward( attentions=encoder_outputs.attentions, ) - def _build_causal_attention_mask(self, bsz, seq_len, dtype): - # lazily create causal attention mask, with full attention between the vision tokens - # pytorch uses additive attention mask; fill with -inf - mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype) - mask.fill_(torch.tensor(torch.finfo(dtype).min)) - mask.triu_(1) # zero out the lower diagonal - mask = mask.unsqueeze(1) # expand mask - return mask - class CLIPSegTextModel(CLIPSegPreTrainedModel): config_class = CLIPSegTextConfig - _no_split_modules = ["CLIPSegEncoderLayer"] + _no_split_modules = ["CLIPSegTextEmbeddings", "CLIPSegEncoderLayer"] def __init__(self, config: CLIPSegTextConfig): super().__init__(config) @@ -972,7 +952,7 @@ def __init__(self, config: CLIPSegConfig): self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False) self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False) - self.logit_scale = nn.Parameter(torch.ones([]) * self.config.logit_scale_init_value) + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) # Initialize weights and apply final processing self.post_init() @@ -1480,6 +1460,8 @@ def forward( loss = None if labels is not None: + # move labels to the correct device to enable PP + labels = labels.to(logits.device) loss_fn = nn.BCEWithLogitsLoss() loss = loss_fn(logits, labels) diff --git a/src/transformers/models/clipseg/processing_clipseg.py b/src/transformers/models/clipseg/processing_clipseg.py index df3705e99e2c78..e57021f213ab05 100644 --- a/src/transformers/models/clipseg/processing_clipseg.py +++ b/src/transformers/models/clipseg/processing_clipseg.py @@ -30,16 +30,18 @@ class CLIPSegProcessor(ProcessorMixin): [`~CLIPSegProcessor.__call__`] and [`~CLIPSegProcessor.decode`] for more information. Args: - image_processor ([`ViTImageProcessor`]): + image_processor ([`ViTImageProcessor`], *optional*): The image processor is a required input. - tokenizer ([`CLIPTokenizerFast`]): + tokenizer ([`CLIPTokenizerFast`], *optional*): The tokenizer is a required input. """ + attributes = ["image_processor", "tokenizer"] image_processor_class = "ViTImageProcessor" tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast") def __init__(self, image_processor=None, tokenizer=None, **kwargs): + feature_extractor = None if "feature_extractor" in kwargs: warnings.warn( "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`" diff --git a/src/transformers/models/clvp/__init__.py b/src/transformers/models/clvp/__init__.py new file mode 100644 index 00000000000000..fb88e24171c369 --- /dev/null +++ b/src/transformers/models/clvp/__init__.py @@ -0,0 +1,83 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import ( + OptionalDependencyNotAvailable, + _LazyModule, + is_torch_available, +) + + +_import_structure = { + "configuration_clvp": [ + "CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP", + "ClvpConfig", + "ClvpDecoderConfig", + "ClvpEncoderConfig", + ], + "feature_extraction_clvp": ["ClvpFeatureExtractor"], + "processing_clvp": ["ClvpProcessor"], + "tokenization_clvp": ["ClvpTokenizer"], +} + + +try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["modeling_clvp"] = [ + "CLVP_PRETRAINED_MODEL_ARCHIVE_LIST", + "ClvpModelForConditionalGeneration", + "ClvpForCausalLM", + "ClvpModel", + "ClvpPreTrainedModel", + "ClvpEncoder", + "ClvpDecoder", + ] + + +if TYPE_CHECKING: + from .configuration_clvp import ( + CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP, + ClvpConfig, + ClvpDecoderConfig, + ClvpEncoderConfig, + ) + from .feature_extraction_clvp import ClvpFeatureExtractor + from .processing_clvp import ClvpProcessor + from .tokenization_clvp import ClvpTokenizer + + try: + if not is_torch_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .modeling_clvp import ( + CLVP_PRETRAINED_MODEL_ARCHIVE_LIST, + ClvpDecoder, + ClvpEncoder, + ClvpForCausalLM, + ClvpModel, + ClvpModelForConditionalGeneration, + ClvpPreTrainedModel, + ) + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/clvp/configuration_clvp.py b/src/transformers/models/clvp/configuration_clvp.py new file mode 100644 index 00000000000000..3d20b5c16d5d10 --- /dev/null +++ b/src/transformers/models/clvp/configuration_clvp.py @@ -0,0 +1,457 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" CLVP model configuration""" + + +import os +from typing import TYPE_CHECKING, Union + + +if TYPE_CHECKING: + pass + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + +CLVP_PRETRAINED_CONFIG_ARCHIVE_MAP = { + "susnato/clvp_dev": "https://huggingface.co/susnato/clvp_dev/resolve/main/config.json", +} + + +class ClvpEncoderConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ClvpEncoder`]. It is used to instantiate a CLVP + text or CLVP speech encoder according to the specified arguments. Instantiating a configuration with the defaults + will yield a similar configuration to that of the encoder of the CLVP + [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + vocab_size (`int`, *optional*, defaults to 256): + Vocabulary size of the CLVP Encoder model. + hidden_size (`int`, *optional*, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + intermediate_size (`int`, *optional*, defaults to 1536): + Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. + projection_dim (`int`, *optional*, defaults to 768): + Dimensionality of the projection vector. + num_hidden_layers (`int`, *optional*, defaults to 20): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, + `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported. + layer_norm_eps (`float`, *optional*, defaults to 1e-05): + The epsilon used by the layer normalization layers. + attention_dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio for the attention probabilities. + dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio for the feed-forward layers in [`ClvpEncoderMLP`]. + use_rotary_embedding (`bool`, *optional*, defaults to `True`): + Whether to use rotary_embedding or not. + use_attention_bias (`bool`, *optional*, defaults to `False`): + Whether to use bias in Query, Key and Value layers during self attention. + summary_type (`str`, *optional*, defaults to `"mean"`): + What strategy to use to get pooler_output from the last_hidden_state. `"last"`, `"first"`, `"mean"` and + `"cls_index"` are supported. + initializer_factor (`float`, *optional*, defaults to 1.0): + A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization + testing). + bos_token_id (`int`, *optional*, defaults to 255): + Beginning of sequence token id. + eos_token_id (`int`, *optional*, defaults to 0): + End of sequence token id. + + Example: + + ```python + >>> from transformers import ClvpEncoderConfig, ClvpEncoder + + >>> # Initializing a ClvpEncoderConfig with susnato/clvp_dev style configuration + >>> encoder_configuration = ClvpEncoderConfig() + + >>> # Initializing a ClvpEncoder (with random weights) from the susnato/clvp_dev style configuration + >>> model = ClvpEncoder(encoder_configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "clvp_encoder" + + def __init__( + self, + vocab_size=256, + hidden_size=768, + intermediate_size=1536, + projection_dim=768, + num_hidden_layers=20, + num_attention_heads=12, + hidden_act="gelu", + layer_norm_eps=1e-5, + attention_dropout=0.1, + dropout=0.1, + use_rotary_embedding=True, + use_attention_bias=False, + summary_type="mean", + initializer_factor=1.0, + bos_token_id=255, + eos_token_id=0, + **kwargs, + ): + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.intermediate_size = intermediate_size + self.projection_dim = projection_dim + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.layer_norm_eps = layer_norm_eps + self.hidden_act = hidden_act + self.initializer_factor = initializer_factor + self.attention_dropout = attention_dropout + self.dropout = dropout + self.use_rotary_embedding = use_rotary_embedding + self.use_attention_bias = use_attention_bias + self.summary_type = summary_type + self.bos_token_id = bos_token_id + self.eos_token_id = eos_token_id + + super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) + + @classmethod + def from_pretrained( + cls, pretrained_model_name_or_path: Union[str, os.PathLike], config_type: str = "text_config", **kwargs + ) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # make sure to have the config_type be either "text_config" or "speech_config" + # this is to make sure that we can load only text or speech configs from the nested ClvpConfig. + if config_type not in ["text_config", "speech_config"]: + raise ValueError( + f"We can only load either 'text_config' or 'speech_config' but you are trying to load" f"{config_type}" + ) + + # get the text config dict if we are loading from ClvpConfig + if config_dict.get("model_type") == "clvp": + config_dict = config_dict[config_type] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ClvpDecoderConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`ClvpDecoder`]. It is used to instantiate a CLVP + Decoder Model according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of the Decoder part of the CLVP + [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + The architecture is similar to GPT2. + + Args: + vocab_size (`int`, *optional*, defaults to 8194): + Vocabulary size of the model. + max_position_embeddings (`int`, *optional*, defaults to 608): + The maximum sequence length of mel tokens that this model might ever be used with. Similar to `n_positions` + in `GPT2Config`. + max_text_tokens (`int`, *optional*, defaults to 404): + The maximum sequence length of text tokens that this model might ever be used with. Similar to + `n_positions` in `GPT2Config`. + hidden_size (`int`, *optional*, defaults to 1024): + Dimensionality of the embeddings and hidden states. + num_hidden_layers (`int`, *optional*, defaults to 30): + Number of hidden layers in the Transformer encoder. + num_attention_heads (`int`, *optional*, defaults to 16): + Number of attention heads for each attention layer in the Transformer encoder. + n_inner (`int`, *optional*): + Dimensionality of the inner feed-forward layers. `None` will set it to 4 times `hidden_size`. + num_mel_attn_blocks (`int`, *optional*, defaults to 6): + Denotes the number of self attention layers in [`ClvpConditioningEncoder`]. + activation_function (`str`, *optional*, defaults to `"gelu_new"`): + Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`. + resid_pdrop (`float`, *optional*, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + embd_pdrop (`float`, *optional*, defaults to 0.1): + The dropout ratio for the embeddings. + attention_dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio for the attention. + layer_norm_epsilon (`float`, *optional*, defaults to 1e-05): + The epsilon to use in the layer normalization layers. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + summary_type (`string`, *optional*, defaults to `"cls_index"`): + Argument used when doing sequence summary. + + Has to be one of the following options: + + - `"last"`: Take the last token hidden state (like XLNet). + - `"first"`: Take the first token hidden state (like BERT). + - `"mean"`: Take the mean of all tokens hidden states. + - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2). + - `"attn"`: Not implemented now, use multi-head attention. + summary_use_proj (`bool`, *optional*, defaults to `True`): + Whether or not to add a projection after the vector extraction. + summary_activation (`str`, *optional*): + Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation. + summary_proj_to_labels (`bool`, *optional*, defaults to `True`): + Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes. + summary_first_dropout (`float`, *optional*, defaults to 0.1): + The dropout ratio to be used after the projection and activation. + use_cache (`bool`, *optional*, defaults to `True`): + Whether or not the model should return the last key/values attentions (not used by all models). + bos_token_id (`int`, *optional*, defaults to 8192): + Beginning of sequence token id, used at the start of the generation. + eos_token_id (`int`, *optional*, defaults to 8193): + End of sequence token id, used in the method + [`ClvpModelForConditionalGeneration.fix_speech_decoder_output()`] to correct decoder outputs. + feature_size (`int`, *optional*, defaults to 80): + The feature dimension of the extracted mel features. This value is used in [`ClvpConditioningEncoder`]. + use_attention_bias (`bool`, *optional*, defaults to `True`): + Whether to use bias in Query, Key and Value layers during self attention. + initializer_factor (`float`, *optional*, defaults to 1.0): + A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization + testing). + decoder_fixing_codes (`list`, *optional*, defaults to `[83, 45, 45, 248]`): + These values are used in the method `fix_speech_decoder_output` to fix decoder generated outputs. + + Example: + + ```python + >>> from transformers import ClvpDecoderConfig, ClvpDecoder + + >>> # Initializing a ClvpDecoderConfig with susnato/clvp_dev style configuration + >>> decoder_configuration = ClvpDecoderConfig() + + >>> # Initializing a ClvpDecoder (with random weights) from the susnato/clvp_dev style configuration + >>> model = ClvpDecoder(decoder_configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + + model_type = "clvp_decoder" + + def __init__( + self, + vocab_size=8194, + max_position_embeddings=608, + max_text_tokens=404, + hidden_size=1024, + num_hidden_layers=30, + num_attention_heads=16, + n_inner=None, + num_mel_attn_blocks=6, + activation_function="gelu_new", + resid_pdrop=0.1, + embd_pdrop=0.1, + attention_dropout=0.1, + layer_norm_epsilon=1e-5, + initializer_range=0.02, + summary_type="cls_index", + summary_use_proj=True, + summary_activation=None, + summary_proj_to_labels=True, + summary_first_dropout=0.1, + use_cache=True, + bos_token_id=8192, + eos_token_id=8193, + feature_size=80, + use_attention_bias=True, + initializer_factor=1.0, + decoder_fixing_codes=[83, 45, 45, 248], + **kwargs, + ): + self.vocab_size = vocab_size + self.max_position_embeddings = max_position_embeddings + self.max_text_tokens = max_text_tokens + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.n_inner = n_inner + self.num_mel_attn_blocks = num_mel_attn_blocks + self.activation_function = activation_function + self.resid_pdrop = resid_pdrop + self.embd_pdrop = embd_pdrop + self.attention_dropout = attention_dropout + self.layer_norm_epsilon = layer_norm_epsilon + self.initializer_range = initializer_range + self.summary_type = summary_type + self.summary_use_proj = summary_use_proj + self.summary_activation = summary_activation + self.summary_first_dropout = summary_first_dropout + self.summary_proj_to_labels = summary_proj_to_labels + self.use_cache = use_cache + self.feature_size = feature_size + self.use_attention_bias = use_attention_bias + self.initializer_factor = initializer_factor + self.decoder_fixing_codes = decoder_fixing_codes + + self.bos_token_id = bos_token_id + self.eos_token_id = eos_token_id + + super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs) + + @classmethod + def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": + cls._set_token_in_kwargs(kwargs) + + config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) + + # get the speech config dict if we are loading from ClvpConfig + if config_dict.get("model_type") == "clvp": + config_dict = config_dict["decoder_config"] + + if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type: + logger.warning( + f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " + f"{cls.model_type}. This is not supported for all configurations of models and can yield errors." + ) + + return cls.from_dict(config_dict, **kwargs) + + +class ClvpConfig(PretrainedConfig): + r""" + [`ClvpConfig`] is the configuration class to store the configuration of a [`ClvpModelForConditionalGeneration`]. It + is used to instantiate a CLVP model according to the specified arguments, defining the text model, speech model and + decoder model configs. Instantiating a configuration with the defaults will yield a similar configuration to that + of the CLVP [susnato/clvp_dev](https://huggingface.co/susnato/clvp_dev) architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + text_config (`dict`, *optional*): + Dictionary of configuration options used to initialize the CLVP text encoder. + speech_config (`dict`, *optional*): + Dictionary of configuration options used to initialize CLVP speech encoder. + decoder_config (`dict`, *optional*): + Dictionary of configuration options used to initialize [`ClvpDecoderConfig`]. + projection_dim (`int`, *optional*, defaults to 768): + Dimentionality of text and speech projection layers. + logit_scale_init_value (`float`, *optional*, defaults to 2.6592): + The inital value of the *logit_scale* paramter. Default is used as per the original CLVP implementation. + initializer_factor (`float`, *optional*, defaults to 1.0): + A factor for initializing all weight matrices (should be kept to 1.0, used internally for initialization + testing). + kwargs (*optional*): + Dictionary of keyword arguments. + + Example: + + ```python + >>> from transformers import ClvpConfig, ClvpModelForConditionalGeneration + + >>> # Initializing a ClvpConfig with susnato/clvp_dev style configuration + >>> configuration = ClvpConfig() + + >>> # Initializing a ClvpModelForConditionalGeneration (with random weights) from the susnato/clvp_dev style configuration + >>> model = ClvpModelForConditionalGeneration(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + + >>> # We can also initialize a CLVPConfig from a CLVPTextConfig, CLVPSpeechConfig and a CLVPAutoRegressiveConfig + >>> from transformers import ClvpEncoderConfig, ClvpDecoderConfig + + >>> # Initializing a CLVP text, CLVP speech and CLVP decoder configuration + >>> config_text = ClvpEncoderConfig() + >>> config_speech = ClvpEncoderConfig() + >>> decoder_config = ClvpDecoderConfig() + + >>> config = ClvpConfig.from_sub_model_configs(config_text, config_speech, decoder_config) + ```""" + + model_type = "clvp" + is_composition = True + + def __init__( + self, + text_config=None, + speech_config=None, + decoder_config=None, + projection_dim=768, + logit_scale_init_value=2.6592, + initializer_factor=1.0, + **kwargs, + ): + super().__init__(**kwargs) + + if text_config is None: + text_config = {} + logger.info("`text_config` is `None`. Initializing the `ClvpEncoderConfig` with default values.") + + if speech_config is None: + speech_config = {} + logger.info("`speech_config` is `None`. initializing the `ClvpEncoderConfig` with default values.") + + if decoder_config is None: + decoder_config = {} + logger.info("`decoder_config` is `None`. initializing the `ClvpDecoderConfig` with default values.") + + self.text_config = ClvpEncoderConfig(**text_config) + self.speech_config = ClvpEncoderConfig(**speech_config) + self.decoder_config = ClvpDecoderConfig(**decoder_config) + + self.projection_dim = projection_dim + self.logit_scale_init_value = logit_scale_init_value + self.initializer_factor = initializer_factor + + @classmethod + def from_sub_model_configs( + cls, + text_config: ClvpEncoderConfig, + speech_config: ClvpEncoderConfig, + decoder_config: ClvpDecoderConfig, + **kwargs, + ): + r""" + Instantiate a [`ClvpConfig`] (or a derived class) from CLVP text model configuration, CLVP speech model + configuration and CLVP decoder model configuration. + + Args: + text_config (`ClvpEncoderConfig`): + Text model configuration of type [`ClvpEncoderConfig`]. + speech_config (`ClvpEncoderConfig`): + Speech model configuration of type [`ClvpEncoderConfig`]. + decoder_config (`ClvpDecoderConfig`): + Decoder model configuration of type [`ClvpDecoderConfig`]. + + Returns: + [`ClvpConfig`]: An instance of a configuration object + """ + + return cls( + text_config=text_config.to_dict(), + speech_config=speech_config.to_dict(), + decoder_config=decoder_config.to_dict(), + **kwargs, + ) diff --git a/src/transformers/models/clvp/convert_clvp_to_hf.py b/src/transformers/models/clvp/convert_clvp_to_hf.py new file mode 100644 index 00000000000000..4ae6fd4254978f --- /dev/null +++ b/src/transformers/models/clvp/convert_clvp_to_hf.py @@ -0,0 +1,234 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Weights conversion script for CLVP +""" + +import argparse +import os + +import torch +from huggingface_hub import hf_hub_download + +from transformers import ClvpConfig, ClvpModelForConditionalGeneration + + +_MODELS = { + "clvp": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/clvp2.pth", + "decoder": "https://huggingface.co/jbetker/tortoise-tts-v2/blob/main/.models/autoregressive.pth", +} + +dim = 1024 +sub_dim = dim // 16 + +CLVP_ENCODERS_MAPPING = { + "text_transformer.transformer.attn_layers": "text_encoder_model", + "speech_transformer.transformer.attn_layers": "speech_encoder_model", + "text_transformer.transformer.norm": "text_encoder_model.final_layer_norm", + "speech_transformer.transformer.norm": "speech_encoder_model.final_layer_norm", + "to_text_latent": "text_encoder_model.projection", + "to_speech_latent": "speech_encoder_model.projection", + "text_emb": "text_encoder_model.token_embedding", + "speech_emb": "speech_encoder_model.token_embedding", + "1.wrap.net.0": "mlp.fc1", + "1.wrap.net.3": "mlp.fc2", + "1.wrap": "self_attn", + "to_out": "out_proj", + "to_q": "q_proj", + "to_k": "k_proj", + "to_v": "v_proj", + "temperature": "logit_scale", +} + +CLVP_DECODER_MAPPING = { + "conditioning_encoder.init": "conditioning_encoder.mel_conv", + "conditioning_encoder.attn": "conditioning_encoder.mel_attn_blocks", + "mel_attn_blocks": "group_norms", + ".norm.weight": ".weight", + ".norm.bias": ".bias", + "text_embedding": "conditioning_encoder.text_token_embedding", + "text_pos_embedding.emb": "conditioning_encoder.text_position_embedding", + "final_norm": "speech_decoder_model.final_norm", + "mel_head": "speech_decoder_model.lm_head", + "gpt.ln_f": "speech_decoder_model.model.decoder.layer_norm", + "mel_embedding": "speech_decoder_model.model.decoder.input_embeds_layer", + "mel_pos_embedding.emb": "speech_decoder_model.model.decoder.position_embeds_layer", + "gpt.h": "speech_decoder_model.model.decoder.layers", + "ln_1": "input_layernorm", + "ln_2": "post_attention_layernorm", +} + + +def update_index(present_index): + if present_index % 2 == 0: + return int(present_index / 2) + else: + return int((present_index - 1) / 2) + + +def convert_encoder_weights(original_weights): + converted_weights = {} + original_weights_keys = sorted(original_weights.keys()) + for original_key in original_weights_keys: + updated_key = original_key + # for input_rmsnorm.weight and post_attention_rmsnorm.weight + if "0.0.g" in updated_key: + present_index = updated_key.split(".")[4] + if int(present_index) % 2 == 0: + updated_key = updated_key.replace("0.0.g", "input_rmsnorm.weight") + else: + updated_key = updated_key.replace("0.0.g", "post_attention_rmsnorm.weight") + + if "transformer.attn_layers.layers" in updated_key: + present_index = updated_key.split(".")[4] + updated_index = update_index(int(present_index)) + updated_key = updated_key.replace( + f"transformer.attn_layers.layers.{present_index}", f"transformer.attn_layers.layers.{updated_index}" + ) + + for k, v in CLVP_ENCODERS_MAPPING.items(): + if k in updated_key: + updated_key = updated_key.replace(k, v) + + converted_weights[updated_key] = original_weights.pop(original_key) + + return converted_weights + + +def convert_decoder_weights(original_weights): + converted_weights = {} + original_weights_keys = sorted(original_weights.keys()) + for original_key in original_weights_keys: + updated_key = original_key + if len(updated_key.split(".")) > 3: + index, attr = updated_key.split(".")[2], updated_key.split(".")[-1] + + # for decoder attention + if "attn.c_attn" in updated_key: + if attr == "weight": + slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).T.split(split_size=dim, dim=0) + else: + slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0) + converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.q_proj.{attr}"] = slice1 + converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.k_proj.{attr}"] = slice2 + converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.v_proj.{attr}"] = slice3 + continue + + if "attn.c_proj" in updated_key: + converted_weights[f"speech_decoder_model.model.decoder.layers.{index}.attn.out_proj.{attr}"] = ( + original_weights[updated_key].squeeze(-1).T + ) + continue + + if "attn.bias" in updated_key or "attn.masked_bias" in updated_key or "text_head" in updated_key: + original_weights.pop(updated_key) + continue + + # conditional encoder attention + if "qkv" in updated_key: + if attr == "weight": + slice1, slice2, slice3 = original_weights[updated_key].squeeze(-1).split(split_size=dim, dim=0) + else: + slice1, slice2, slice3 = original_weights[updated_key].split(split_size=dim, dim=0) + + indices = torch.arange(dim) + index1, index2, index3 = ( + indices.unfold(0, sub_dim, sub_dim * 3).flatten(), + indices[sub_dim:].unfold(0, sub_dim, sub_dim * 3).flatten(), + indices[2 * sub_dim :].unfold(0, sub_dim, sub_dim * 3).flatten(), + ) + + converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.q_proj.{attr}"] = torch.concatenate( + [slice1[index1], slice2[index3], slice3[index2]], + axis=0, + ) + converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.k_proj.{attr}"] = torch.concatenate( + [slice1[index2], slice2[index1], slice3[index3]], + axis=0, + ) + converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.v_proj.{attr}"] = torch.concatenate( + [slice1[index3], slice2[index2], slice3[index1]], + axis=0, + ) + continue + + if "proj_out" in updated_key: + converted_weights[f"conditioning_encoder.mel_attn_blocks.{index}.out_proj.{attr}"] = original_weights[ + updated_key + ].squeeze(-1) + continue + + for k, v in CLVP_DECODER_MAPPING.items(): + if k in updated_key: + updated_key = updated_key.replace(k, v) + + converted_weights[updated_key] = original_weights.pop(original_key) + + return converted_weights + + +def _download(url: str, root: str): + repo_id = f"{url.split('/')[3]}/{url.split('/')[4]}" + filename = f"{url.split('/')[-2]}/{url.split('/')[-1]}" + hf_hub_download( + repo_id=repo_id, + filename=filename, + force_filename=root, + local_dir_use_symlinks=False, + ) + + +def convert_clvp_weights(checkpoint_path, pytorch_dump_folder_path): + converted_checkpoint = {} + + for each_model_name, each_model_url in _MODELS.items(): + each_model_path = os.path.join(checkpoint_path, each_model_url.split("/")[-1]) + if not os.path.exists(each_model_path): + print(f"\n{each_model_name} was not found! Downloading it to {each_model_path}") + _download(url=each_model_url, root=each_model_path) + + if each_model_name == "clvp": + clvp_checkpoint = torch.load(each_model_path, map_location="cpu") + else: + decoder_checkpoint = torch.load(each_model_path, map_location="cpu") + + # Converting the weights + converted_checkpoint.update(**convert_encoder_weights(clvp_checkpoint)) + converted_checkpoint.update(**convert_decoder_weights(decoder_checkpoint)) + + config = ClvpConfig.from_pretrained("susnato/clvp_dev") + model = ClvpModelForConditionalGeneration(config) + + model.load_state_dict(converted_checkpoint, strict=True) + model.save_pretrained(pytorch_dump_folder_path) + print(f"Model saved at {pytorch_dump_folder_path}!") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # # Required parameters + parser.add_argument( + "--checkpoint_path", type=str, help="Path to the folder of downloaded checkpoints. (Please enter full path)" + ) + parser.add_argument( + "--pytorch_dump_folder_path", + default=None, + type=str, + help="Path to the output PyTorch model. (Please enter full path)", + ) + args = parser.parse_args() + + convert_clvp_weights(args.checkpoint_path, args.pytorch_dump_folder_path) diff --git a/src/transformers/models/clvp/feature_extraction_clvp.py b/src/transformers/models/clvp/feature_extraction_clvp.py new file mode 100644 index 00000000000000..69741a03f575b8 --- /dev/null +++ b/src/transformers/models/clvp/feature_extraction_clvp.py @@ -0,0 +1,238 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Feature extractor class for CLVP +""" + +from typing import List, Optional, Union + +import numpy as np + +from ...audio_utils import mel_filter_bank, spectrogram, window_function +from ...feature_extraction_sequence_utils import SequenceFeatureExtractor +from ...feature_extraction_utils import BatchFeature +from ...utils import TensorType, logging + + +logger = logging.get_logger(__name__) + + +class ClvpFeatureExtractor(SequenceFeatureExtractor): + r""" + Constructs a CLVP feature extractor. + + This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains + most of the main methods. Users should refer to this superclass for more information regarding those methods. + + This class extracts log-mel-spectrogram features from raw speech using a custom numpy implementation of the `Short + Time Fourier Transform` which should match pytorch's `torch.stft` equivalent. + + Args: + feature_size (`int`, *optional*, defaults to 80): + The feature dimension of the extracted features. + sampling_rate (`int`, *optional*, defaults to 22050): + The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). + default_audio_length (`int`, *optional*, defaults to 6): + The default length of raw audio in seconds. If `max_length` is not set during `__call__` then it will + automatically be set to default_audio_length * `self.sampling_rate`. + hop_length (`int`, *optional*, defaults to 256): + Length of the overlaping windows for the STFT used to obtain the Mel Frequency coefficients. + chunk_length (`int`, *optional*, defaults to 30): + The maximum number of chuncks of `sampling_rate` samples used to trim and pad longer or shorter audio + sequences. + n_fft (`int`, *optional*, defaults to 1024): + Size of the Fourier transform. + padding_value (`float`, *optional*, defaults to 0.0): + Padding value used to pad the audio. Should correspond to silences. + mel_norms (`list` of length `feature_size`, *optional*): + If `mel_norms` is provided then it will be used to normalize the log-mel spectrograms along each + mel-filter. + return_attention_mask (`bool`, *optional*, defaults to `False`): + Whether to return the attention mask. If left to the default, it will return the attention mask. + + [What are attention masks?](../glossary#attention-mask) + """ + + model_input_names = ["input_features", "attention_mask"] + + def __init__( + self, + feature_size=80, + sampling_rate=22050, + default_audio_length=6, + hop_length=256, + chunk_length=30, + n_fft=1024, + padding_value=0.0, + mel_norms=None, + return_attention_mask=False, # pad inputs to max length with silence token (zero) and no attention mask + **kwargs, + ): + super().__init__( + feature_size=feature_size, + sampling_rate=sampling_rate, + padding_value=padding_value, + return_attention_mask=return_attention_mask, + **kwargs, + ) + self.n_fft = n_fft + self.hop_length = hop_length + self.chunk_length = chunk_length + self.n_samples = chunk_length * sampling_rate + self.nb_max_frames = self.n_samples // hop_length + self.sampling_rate = sampling_rate + self.default_audio_length = default_audio_length + self.mel_norms = mel_norms + self.mel_filters = mel_filter_bank( + num_frequency_bins=1 + (n_fft // 2), + num_mel_filters=feature_size, + min_frequency=0.0, + max_frequency=8000.0, + sampling_rate=sampling_rate, + norm="slaney", + mel_scale="htk", + ) + + def _np_extract_fbank_features(self, waveform: np.array) -> np.ndarray: + """ + This method first computes the log-mel spectrogram of the provided audio then applies normalization along the + each mel-filterbank, if `mel_norms` is provided. + """ + log_spec = spectrogram( + waveform, + window_function(self.n_fft, "hann"), + frame_length=self.n_fft, + hop_length=self.hop_length, + power=2.0, + mel_filters=self.mel_filters, + log_mel=None, + ) + + log_spec = np.log(np.clip(log_spec, a_min=1e-5, a_max=None)) + + if self.mel_norms is not None: + log_spec = log_spec / np.array(self.mel_norms)[:, None] + + return log_spec + + def __call__( + self, + raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]], + sampling_rate: Optional[int] = None, + truncation: bool = True, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_attention_mask: Optional[bool] = True, + padding: Optional[str] = "max_length", + max_length: Optional[int] = None, + **kwargs, + ) -> BatchFeature: + """ + `ClvpFeatureExtractor` is used to extract various voice specific properties such as the pitch and tone of the + voice, speaking speed, and even speaking defects like a lisp or stuttering from a sample voice or `raw_speech`. + + First the voice is padded or truncated in a way such that it becomes a waveform of `self.default_audio_length` + seconds long and then the log-mel spectrogram is extracted from it. + + Args: + raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`): + The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float + values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not + stereo, i.e. single float per timestep. + sampling_rate (`int`, *optional*): + The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass + `sampling_rate` at the forward call to prevent silent errors and allow automatic speech recognition + pipeline. + truncation (`bool`, *optional*, default to `True`): + Activates truncation to cut input sequences longer than *max_length* to *max_length*. + pad_to_multiple_of (`int`, *optional*): + If set will pad the sequence to a multiple of the provided value. + + This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability + `>= 7.5` (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. + return_attention_mask (`bool`, *optional*, defaults to `True`): + Whether to return the attention mask. If left to the default, it will return the attention mask. + + [What are attention masks?](../glossary#attention-mask) + return_tensors (`str` or [`~utils.TensorType`], *optional*): + If set, will return tensors instead of list of python integers. Acceptable values are: + + - `'tf'`: Return TensorFlow `tf.constant` objects. + - `'pt'`: Return PyTorch `torch.Tensor` objects. + - `'np'`: Return Numpy `np.ndarray` objects. + padding_value (`float`, defaults to 0.0): + The value that is used to fill the padding values / vectors. + max_length (`int`, *optional*): + The maximum input length of the inputs. + """ + + if sampling_rate is not None: + if sampling_rate != self.sampling_rate: + raise ValueError( + f"The model corresponding to this feature extractor: {self.__class__.__name__} was trained using a" + f" sampling rate of {self.sampling_rate}. Please make sure that the provided `raw_speech` input" + f" was sampled with {self.sampling_rate} and not {sampling_rate}." + ) + else: + logger.warning( + "It is strongly recommended to pass the `sampling_rate` argument to this function. " + "Failing to do so can result in silent errors that might be hard to debug." + ) + + is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1 + if is_batched_numpy and len(raw_speech.shape) > 2: + raise ValueError(f"Only mono-channel audio is supported for input to {self}") + is_batched = is_batched_numpy or ( + isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list))) + ) + + if is_batched: + raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech] + elif not is_batched and not isinstance(raw_speech, np.ndarray): + raw_speech = np.asarray(raw_speech, dtype=np.float32) + elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64): + raw_speech = raw_speech.astype(np.float32) + + # always return batch + if not is_batched: + raw_speech = [np.asarray([raw_speech]).T] + + batched_speech = BatchFeature({"input_features": raw_speech}) + + max_length = self.default_audio_length * self.sampling_rate if max_length is None else max_length + + padded_inputs = self.pad( + batched_speech, + padding=padding, + max_length=max_length, + truncation=truncation, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + # make sure list is in array format + input_features = padded_inputs.get("input_features").transpose(2, 0, 1) + + input_features = [ + self._np_extract_fbank_features(waveform).astype(np.float32) for waveform in input_features[0] + ] + + if isinstance(input_features[0], List): + padded_inputs["input_features"] = [np.asarray(feature) for feature in input_features] + else: + padded_inputs["input_features"] = input_features + + return padded_inputs.convert_to_tensors(return_tensors) diff --git a/src/transformers/models/clvp/modeling_clvp.py b/src/transformers/models/clvp/modeling_clvp.py new file mode 100644 index 00000000000000..b660f54e5d820f --- /dev/null +++ b/src/transformers/models/clvp/modeling_clvp.py @@ -0,0 +1,2024 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" PyTorch CLVP model.""" + + +import copy +import math +from dataclasses import dataclass +from typing import Dict, Optional, Tuple, Union + +import torch +import torch.utils.checkpoint +from torch import nn +from torch.nn import CrossEntropyLoss + +from ...activations import ACT2FN +from ...generation import GenerationConfig +from ...modeling_attn_mask_utils import _prepare_4d_attention_mask, _prepare_4d_causal_attention_mask +from ...modeling_outputs import ( + BaseModelOutput, + BaseModelOutputWithPastAndCrossAttentions, + BaseModelOutputWithPooling, + CausalLMOutputWithCrossAttentions, +) +from ...modeling_utils import PreTrainedModel, SequenceSummary +from ...pytorch_utils import Conv1D +from ...utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + logging, + replace_return_docstrings, +) +from .configuration_clvp import ( + ClvpConfig, + ClvpDecoderConfig, + ClvpEncoderConfig, +) + + +logger = logging.get_logger(__name__) + +_CHECKPOINT_FOR_DOC = "susnato/clvp_dev" + +CLVP_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "susnato/clvp_dev", + # See all Clvp models at https://huggingface.co/models?filter=clvp +] + + +# Copied from transformers.models.clip.modeling_clip.contrastive_loss +def contrastive_loss(logits: torch.Tensor) -> torch.Tensor: + return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device)) + + +# Copied from transformers.models.clip.modeling_clip.clip_loss with clip->clvp, image_loss->speech_loss +def clvp_loss(similarity: torch.Tensor) -> torch.Tensor: + caption_loss = contrastive_loss(similarity) + speech_loss = contrastive_loss(similarity.t()) + return (caption_loss + speech_loss) / 2.0 + + +# Copied from transformers.models.llama.modeling_llama.rotate_half +def rotate_half(x): + """Rotates half the hidden dims of the input.""" + x1 = x[..., : x.shape[-1] // 2] + x2 = x[..., x.shape[-1] // 2 :] + return torch.cat((-x2, x1), dim=-1) + + +def apply_rotary_pos_emb(q, k, v, cos, sin, position_ids, unsqueeze_dim=1): + """Applies Rotary Position Embedding to the query and key tensors. + + Args: + q (`torch.Tensor`): The query tensor. + k (`torch.Tensor`): The key tensor. + cos (`torch.Tensor`): The cosine part of the rotary embedding. + sin (`torch.Tensor`): The sine part of the rotary embedding. + position_ids (`torch.Tensor`): + The position indices of the tokens corresponding to the query and key tensors. For example, this can be + used to pass offsetted position ids when working with a KV-cache. + unsqueeze_dim (`int`, *optional*, defaults to 1): + The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and + sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note + that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and + k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes + cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have + the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2. + Returns: + `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding. + """ + cos = cos[position_ids].unsqueeze(unsqueeze_dim) + sin = sin[position_ids].unsqueeze(unsqueeze_dim) + q_embed = (q * cos) + (rotate_half(q) * sin) + k_embed = (k * cos) + (rotate_half(k) * sin) + v_embed = (v * cos) + (rotate_half(v) * sin) + return q_embed, k_embed, v_embed + + +def _pad_extra_bos_eos_tokens( + input_ids, + attention_mask=None, + pad_token_id=0, + bos_token_id=255, + eos_token_id=0, + add_bos_token=True, + add_eos_token=True, +): + """ + This method adds extra bos and eos tokens to input_ids and accordingly modifies the attention_mask which is used in + `ClvpConditioningEncoder` and the generation loop of the `ClvpModelForConditionalGeneration`. + """ + + # add the bos token at the beginning + if add_bos_token: + input_ids = torch.nn.functional.pad(input_ids, (1, 0), value=bos_token_id) + attention_mask = ( + torch.nn.functional.pad(attention_mask, (1, 0), value=1) if attention_mask is not None else attention_mask + ) + + modified_input_ids = input_ids + if add_eos_token: + modified_input_ids = torch.zeros( + (input_ids.shape[0], input_ids.shape[1] + 1), dtype=input_ids.dtype, device=input_ids.device + ) + for i, each_input_id in enumerate(input_ids): + # locate where the valid tokens end and then add the eos token + if torch.isin(each_input_id, pad_token_id).sum(): + pos = torch.where(each_input_id == pad_token_id)[0].min() + modified_input_ids[i] = torch.concatenate( + [each_input_id[:pos], torch.tensor([eos_token_id], device=input_ids.device), each_input_id[pos:]] + ) + else: + # if there are no pad tokens present, then add eos to the end + modified_input_ids[i] = torch.nn.functional.pad(each_input_id, (0, 1), value=eos_token_id) + attention_mask = ( + torch.nn.functional.pad(attention_mask, (1, 0), value=1) if attention_mask is not None else attention_mask + ) + + return modified_input_ids, attention_mask + + +@dataclass +class ClvpEncoderOutput(ModelOutput): + """ + Base class for CLVP encoder's outputs that contains a pooling of the last hidden states as well as a projection + output (a linear layer on top of the pooled output). + + Args: + embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): + The embeddings obtained by applying the projection layer to the pooler_output. + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + The hidden state of the last layer of the model. + pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): + Pooled output of the `last_hidden_state`. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + + one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of + the model at the output of each layer plus the optional initial embedding outputs. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in + the self-attention heads. + """ + + embeds: Optional[torch.FloatTensor] = None + last_hidden_state: torch.FloatTensor = None + pooler_output: Optional[torch.FloatTensor] = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class ClvpOutput(ModelOutput): + """ + Args: + loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): + Contrastive loss for speech-text similarity. + speech_ids (`torch.LongTensor`, *optional*): + speech_ids (or speech candidates) generated by the `ClvpForCausalLM` model. + logits_per_speech (`torch.FloatTensor` of shape `(speech_batch_size, text_batch_size)`): + The scaled dot product scores between `speech_embeds` and `text_embeds`. This represents the speech-text + similarity scores. + logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, speech_batch_size)`): + The scaled dot product scores between `text_embeds` and `speech_embeds`. This represents the text-speech + similarity scores. + text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): + The text embeddings obtained by applying the projection layer to the pooled output of the text encoder + model. + speech_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): + The speech embeddings obtained by applying the projection layer to the pooled output of the speech encoder + model. + text_model_output (`BaseModelOutputWithPooling`): + The pooled output of the `last_hidden_state` of the text encoder Model. + speech_model_output (`BaseModelOutputWithPooling`): + The pooled output of the `last_hidden_state` of the speech encoder Model. + decoder_hidden_states (`torch.FloatTensor`, *optional*): + The hidden states of the decoder model. + text_encoder_hidden_states (`torch.FloatTensor`, *optional*): + The hidden states of the text encoder model. + speech_encoder_hidden_states (`torch.FloatTensor`, *optional*): + The hidden states of the speech encoder model. + """ + + loss: Optional[torch.FloatTensor] = None + speech_ids: Optional[torch.LongTensor] = None + logits_per_speech: torch.FloatTensor = None + logits_per_text: torch.FloatTensor = None + text_embeds: torch.FloatTensor = None + speech_embeds: torch.FloatTensor = None + text_model_output: BaseModelOutputWithPooling = None + speech_model_output: BaseModelOutputWithPooling = None + decoder_hidden_states: torch.FloatTensor = None + text_encoder_hidden_states: torch.FloatTensor = None + speech_encoder_hidden_states: torch.FloatTensor = None + + +# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Clvp +class ClvpRMSNorm(nn.Module): + def __init__(self, hidden_size, eps=1e-6): + """ + ClvpRMSNorm is equivalent to T5LayerNorm + """ + super().__init__() + self.weight = nn.Parameter(torch.ones(hidden_size)) + self.variance_epsilon = eps + + def forward(self, hidden_states): + input_dtype = hidden_states.dtype + hidden_states = hidden_states.to(torch.float32) + variance = hidden_states.pow(2).mean(-1, keepdim=True) + hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) + return self.weight * hidden_states.to(input_dtype) + + +class ClvpRotaryPositionalEmbedding(nn.Module): + """ + Rotary Position Embedding Class for CLVP. It was proposed in the paper 'ROFORMER: ENHANCED TRANSFORMER WITH ROTARY + POSITION EMBEDDING', Please see https://arxiv.org/pdf/2104.09864v1.pdf . + """ + + def __init__(self, config): + super().__init__() + dim = max(config.projection_dim // (config.num_attention_heads * 2), 32) + inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim)) + + self.register_buffer("inv_freq", inv_freq) + self.cached_sequence_length = None + self.cached_rotary_positional_embedding = None + + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: + sequence_length = hidden_states.shape[1] + + if sequence_length == self.cached_sequence_length and self.cached_rotary_positional_embedding is not None: + return self.cached_rotary_positional_embedding + + self.cached_sequence_length = sequence_length + time_stamps = torch.arange(sequence_length, device=hidden_states.device).type_as(self.inv_freq) + freqs = torch.einsum("i,j->ij", time_stamps, self.inv_freq) + embeddings = torch.cat((freqs, freqs), dim=-1) + + self.cached_rotary_positional_embedding = embeddings.unsqueeze(0) + return self.cached_rotary_positional_embedding + + +class ClvpSelfAttention(nn.Module): + """ + Multi-headed attention to combine Absolute and Rotary Positional Embeddings into a single Attention module. + """ + + def __init__(self, config): + super().__init__() + self.config = config + self.embed_dim = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.embed_dim // self.num_heads + if self.head_dim * self.num_heads != self.embed_dim: + raise ValueError( + f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" + f" {self.num_heads})." + ) + self.scale = self.head_dim**-0.5 + self.dropout = config.attention_dropout + + if hasattr(config, "max_position_embeddings"): + max_positions = config.max_position_embeddings + bias = torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)) + bias = bias.view(1, 1, max_positions, max_positions) + self.register_buffer("bias", bias, persistent=False) + + self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_attention_bias) + self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_attention_bias) + self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.use_attention_bias) + self.out_proj = nn.Linear(self.embed_dim, self.embed_dim) + + # Copied from transformers.models.clip.modeling_clip.CLIPAttention._shape + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous() + + def forward( + self, + hidden_states: torch.FloatTensor, + rotary_pos_emb: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + use_cache: Optional[bool] = False, + head_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor], Optional[Tuple[torch.FloatTensor]]]: + # Raise error when position_ids is None but rotary_pos_emb is provided, because we need that when applying + # rotary_pos_emb to query and key states. + if rotary_pos_emb is not None and position_ids is None: + raise ValueError("`position_ids` must be provided when `rotary_pos_emb` is not None.") + + bsz, _, embed_dim = hidden_states.size() + + # get query proj + query_states = self._shape(self.q_proj(hidden_states), -1, bsz) * self.scale + key_states = self._shape(self.k_proj(hidden_states), -1, bsz) + value_states = self._shape(self.v_proj(hidden_states), -1, bsz) + + if past_key_value is not None: + past_key, past_value = past_key_value + key_states = torch.cat((past_key, key_states), dim=-2) + value_states = torch.cat((past_value, value_states), dim=-2) + + if use_cache is True: + present = (key_states, value_states) + else: + present = None + + if rotary_pos_emb is not None: + rotary_emb_dim = rotary_pos_emb.shape[-1] + + # Partial rotary embedding + query_rot, query_pass = ( + query_states[..., :rotary_emb_dim], + query_states[..., rotary_emb_dim:], + ) + key_rot, key_pass = ( + key_states[..., :rotary_emb_dim], + key_states[..., rotary_emb_dim:], + ) + value_rot, value_pass = ( + value_states[..., :rotary_emb_dim], + value_states[..., rotary_emb_dim:], + ) + + cos, sin = rotary_pos_emb.cos().squeeze(0), rotary_pos_emb.sin().squeeze(0) + query_rot, key_rot, value_rot = apply_rotary_pos_emb(query_rot, key_rot, value_rot, cos, sin, position_ids) + + # [batch_size, num_heads, seq_length, head_dim] + query_states = torch.cat((query_rot, query_pass), dim=-1) + key_states = torch.cat((key_rot, key_pass), dim=-1) + value_states = torch.cat((value_rot, value_pass), dim=-1) + + tgt_len = query_states.shape[2] + src_len = key_states.shape[2] + attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = attn_weights + attention_mask + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + # Mask heads if we want to + if head_mask is not None: + attn_weights = attn_weights * head_mask + + attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training) + attn_output = torch.matmul(attn_probs, value_states) + + if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + attn_output = attn_output.transpose(1, 2).contiguous() + attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) + + attn_output = self.out_proj(attn_output) + + if not output_attentions: + attn_weights = None + + return attn_output, present, attn_weights + + +class ClvpGatedLinearUnit(nn.Module): + """ + `ClvpGatedLinearUnit` uses the second half of the `hidden_states` to act as a gate for the first half of the + `hidden_states` which controls the flow of data from the first of the tensor. + """ + + def __init__(self, config): + super().__init__() + self.activation_fn = ACT2FN[config.hidden_act] + self.proj = nn.Linear(config.hidden_size, config.intermediate_size * 2) + + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: + hidden_states, gate = self.proj(hidden_states).chunk(2, dim=-1) + return hidden_states * self.activation_fn(gate) + + +class ClvpEncoderMLP(nn.Module): + """ + This MLP is used in CLVP speech or text encoder models. + """ + + def __init__(self, config): + super().__init__() + self.config = config + + self.fc1 = ClvpGatedLinearUnit(config) + self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size) + self.dropout_layer = nn.Dropout(config.dropout) + + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: + hidden_states = self.fc1(hidden_states) + hidden_states = self.dropout_layer(hidden_states) + hidden_states = self.fc2(hidden_states) + return hidden_states + + +class ClvpEncoderLayer(nn.Module): + def __init__(self, config: ClvpConfig): + super().__init__() + self.config = config + self.embed_dim = config.hidden_size + self.self_attn = ClvpSelfAttention(config) + self.mlp = ClvpEncoderMLP(config) + + self.input_rmsnorm = ClvpRMSNorm(self.embed_dim, eps=config.layer_norm_eps) + self.post_attention_rmsnorm = ClvpRMSNorm(self.embed_dim, eps=config.layer_norm_eps) + + def forward( + self, + hidden_states: torch.FloatTensor, + rotary_pos_emb: torch.FloatTensor, + attention_mask: torch.LongTensor, + position_ids: torch.LongTensor, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.FloatTensor]: + """ + Args: + hidden_states (`torch.FloatTensor` of shape `(batch, seq_len, embed_dim)`): + input to the layer. + rotary_pos_emb (`torch.FloatTensor`): + rotary position embeddings generated by `ClvpRotaryPositionalEmbedding` module. + attention_mask (`torch.FloatTensor` of shape `(batch, 1, tgt_len, src_len)`): + attention mask where padding elements are indicated by very large negative values. + position_ids (`torch.LongTensor`): + Denotes position ids of the input tokens. + output_attentions (`bool`, *optional*, defaults to `False`): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + """ + residual = hidden_states + + hidden_states = self.input_rmsnorm(hidden_states) + + attention_outputs = self.self_attn( + hidden_states=hidden_states, + rotary_pos_emb=rotary_pos_emb, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + ) + + hidden_states = attention_outputs[0] + + hidden_states = residual + hidden_states + + residual = hidden_states + hidden_states = self.post_attention_rmsnorm(hidden_states) + hidden_states = self.mlp(hidden_states) + hidden_states = residual + hidden_states + + outputs = (hidden_states,) + + if output_attentions: + outputs += (attention_outputs[-1],) + + return outputs + + +# Copied from transformers.models.gpt2.modeling_gpt2.GPT2MLP with GPT2->ClvpDecoderMLP +class ClvpDecoderMLP(nn.Module): + def __init__(self, intermediate_size, config): + super().__init__() + embed_dim = config.hidden_size + self.c_fc = Conv1D(intermediate_size, embed_dim) + self.c_proj = Conv1D(embed_dim, intermediate_size) + self.act = ACT2FN[config.activation_function] + self.dropout = nn.Dropout(config.resid_pdrop) + + def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor: + hidden_states = self.c_fc(hidden_states) + hidden_states = self.act(hidden_states) + hidden_states = self.c_proj(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +class ClvpDecoderLayer(nn.Module): + def __init__(self, config): + super().__init__() + hidden_size = config.hidden_size + inner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_size + + self.input_layernorm = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon) + self.attn = ClvpSelfAttention(config) + self.post_attention_layernorm = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon) + + self.mlp = ClvpDecoderMLP(inner_dim, config) + + def forward( + self, + hidden_states: Optional[Tuple[torch.FloatTensor]], + past_key_value: Optional[Tuple[torch.Tensor]] = None, + attention_mask: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = False, + output_attentions: Optional[bool] = False, + ) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]: + residual = hidden_states + hidden_states = self.input_layernorm(hidden_states) + attn_outputs = self.attn( + hidden_states, + past_key_value=past_key_value, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask, + use_cache=use_cache, + output_attentions=output_attentions, + ) + attn_output = attn_outputs[0] + outputs = attn_outputs[1:] + # residual connection + hidden_states = attn_output + residual + + residual = hidden_states + hidden_states = self.post_attention_layernorm(hidden_states) + feed_forward_hidden_states = self.mlp(hidden_states) + # residual connection + hidden_states = residual + feed_forward_hidden_states + + if use_cache: + outputs = (hidden_states,) + outputs + else: + outputs = (hidden_states,) + outputs[1:] + + return outputs + + +class ClvpConditioningEncoder(nn.Module): + """ + This class processes the log-mel spectrograms(extracted by the Feature Extractor) and text tokens(produced by the + tokenizer) as inputs for the decoder model. + + First each log-mel spectrogram is processed into a single vector which captures valuable characteristics from each + of them, then the text tokens are converted into token embeddings and position embeddings are added afterwards. + Both of these vectors are concatenated and then passed to the decoder model. + + The text tokens helps to incorporate the "text information" and the log-mel spectrogram is used to specify the + "voice characteristics" into the generated mel tokens. + """ + + def __init__(self, config: ClvpConfig): + super().__init__() + + self.text_config = config.text_config + self.decoder_config = config.decoder_config + + self.text_token_embedding = nn.Embedding(self.text_config.vocab_size, self.decoder_config.hidden_size) + self.text_position_embedding = nn.Embedding( + self.decoder_config.max_text_tokens, self.decoder_config.hidden_size + ) + + self.mel_conv = nn.Conv1d(self.decoder_config.feature_size, self.decoder_config.hidden_size, kernel_size=1) + + # define group norms to be used before each attention layer + num_groups = self.compute_groupnorm_groups(self.decoder_config.hidden_size) + self.group_norms = nn.ModuleList( + [ + nn.GroupNorm(num_groups, self.decoder_config.hidden_size, eps=1e-5, affine=True) + for _ in range(self.decoder_config.num_mel_attn_blocks) + ] + ) + + # define the attention layers + self.mel_attn_blocks = nn.ModuleList( + [ClvpSelfAttention(self.decoder_config) for _ in range(self.decoder_config.num_mel_attn_blocks)] + ) + + self.gradient_checkpointing = False + + def compute_groupnorm_groups(self, channels: int, groups: int = 32): + """ + Calculates the value of `num_groups` for nn.GroupNorm. This logic is taken from the official tortoise + repository. link : + https://github.com/neonbjb/tortoise-tts/blob/4003544b6ff4b68c09856e04d3eff9da26d023c2/tortoise/models/arch_util.py#L26 + """ + if channels <= 16: + groups = 8 + elif channels <= 64: + groups = 16 + while channels % groups != 0: + groups = int(groups / 2) + + if groups <= 2: + raise ValueError( + f"Number of groups for the GroupNorm must be greater than 2, but it is {groups}." + f"Please consider using a different `hidden_size`" + ) + + return groups + + def forward( + self, + input_features: torch.FloatTensor, + input_ids: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + ): + # process text + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + batch_size, seq_length = input_ids.size() + elif inputs_embeds is not None: + batch_size, seq_length = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + # construct attention mask if not given + if attention_mask is None: + attention_mask = torch.ones([batch_size, seq_length], dtype=torch.long, device=input_ids.device) + + # We add bos and eos input_ids in the modeling file instead of the tokenizer file to keep the logic simple + # This logic is specific to ClvpConditioningEncoder and not used by other modules. + input_ids, attention_mask = _pad_extra_bos_eos_tokens( + input_ids, + attention_mask, + bos_token_id=self.text_config.bos_token_id, + eos_token_id=self.text_config.eos_token_id, + ) + + inputs_embeds = self.text_token_embedding(input_ids) + position_ids = attention_mask.cumsum(-1) - 1 + position_embeds = self.text_position_embedding(position_ids) + text_embeds = inputs_embeds + position_embeds + + if self.gradient_checkpointing and self.training: + # process each log-mel spectrogram into a single vector + mel_spec = torch.utils.checkpoint.checkpoint(self.mel_conv, input_features) + + for i, mel_attn_block in enumerate(self.mel_attn_blocks): + residual_mel_spec = mel_spec.transpose(1, 2) + + mel_spec = torch.utils.checkpoint.checkpoint(self.group_norms[i], mel_spec).transpose(1, 2) + mel_spec = torch.utils.checkpoint.checkpoint(mel_attn_block, mel_spec)[0] + residual_mel_spec + mel_spec = mel_spec.transpose(1, 2) + + else: + # process each log-mel spectrogram into a single vector + mel_spec = self.mel_conv(input_features) + + for i, mel_attn_block in enumerate(self.mel_attn_blocks): + residual_mel_spec = mel_spec.transpose(1, 2) + + mel_spec = self.group_norms[i](mel_spec).transpose(1, 2) + mel_spec = mel_attn_block(mel_spec)[0] + residual_mel_spec + mel_spec = mel_spec.transpose(1, 2) + + mel_spec = mel_spec[:, :, 0] + mel_spec = mel_spec.unsqueeze(1) + + # repeat if there is either (1 text vs N audios) or (N texts vs 1 audio) + if text_embeds.shape[0] == 1 and mel_spec.shape[0] != 1: + text_embeds = text_embeds.repeat(mel_spec.shape[0], 1, 1) + elif text_embeds.shape[0] != 1 and mel_spec.shape[0] == 1: + mel_spec = mel_spec.repeat(text_embeds.shape[0], 1, 1) + # If there is N texts and M audios we will raise error since the number of text and audio must be same. + elif text_embeds.shape[0] != mel_spec.shape[0]: + raise ValueError( + f"The number of texts and number of audios must be same. " + f"Found {text_embeds.shape[0]} texts vs {mel_spec.shape[0]} audios" + ) + + return torch.concat([mel_spec, text_embeds], dim=1) + + +class ClvpPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = ClvpConfig + base_model_prefix = "clvp" + supports_gradient_checkpointing = True + _skip_keys_device_placement = "past_key_values" + + def _init_weights(self, module): + """Initialize the weights""" + factor = self.config.initializer_factor + if isinstance(module, nn.Embedding): + module.weight.data.normal_(mean=0.0, std=factor * 0.02) + elif isinstance(module, (nn.Linear, Conv1D, nn.Conv1d)): + module.weight.data.normal_(mean=0.0, std=factor * 0.02) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, ClvpEncoderMLP): + factor = self.config.initializer_factor + in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor + fc_std = (2 * module.config.hidden_size) ** -0.5 * factor + nn.init.normal_(module.fc1.proj.weight if getattr(module.fc1, "proj") else module.fc1.weight, std=fc_std) + nn.init.normal_(module.fc2.weight, std=in_proj_std) + elif isinstance(module, ClvpEncoder): + config = self.config.text_config if hasattr(self.config, "text_config") else self.config + factor = config.initializer_factor + module.projection.weight.data.normal_(mean=0.0, std=factor * (config.hidden_size**-0.5)) + elif isinstance(module, ClvpConditioningEncoder): + module.mel_conv.weight.data.normal_(mean=0.0, std=factor) + module.mel_conv.bias.data.zero_() + elif isinstance(module, ClvpForCausalLM): + for name, p in module.named_parameters(): + if name == "c_proj.weight": + p.data.normal_( + mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.num_hidden_layers)) + ) + if isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + +CLVP_START_DOCSTRING = r""" + This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the + library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads + etc.) + + This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. + Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage + and behavior. + + Parameters: + config ([`ClvpConfig`]): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights. +""" + + +CLVP_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide + it. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, time_dim)`): + Indicates log mel-spectrogram representations for audio returned by [`ClvpFeatureExtractor`]. + conditioning_encoder_inputs_embeds (`torch.FloatTensor`, *optional*): + inputs_embeds for `ClvpConditioningEncoder`. Can be used in place of `input_ids`. + text_encoder_inputs_embeds (`torch.FloatTensor`, *optional*): + inputs_embeds for the text encoder model passed in place of `input_ids`. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding text token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + return_loss (`bool`, *optional*): + Whether or not to return the contrastive loss. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +CLVP_DECODER_INPUTS_DOCSTRING = r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`): + Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see + `past_key_values` output below). Can be used to speed up sequential decoding. The `input_ids` which have + their past given to this model should not be passed as `input_ids` as they have already been computed. + attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + If `past_key_values` is used, `attention_mask` needs to contain the masking strategy that was used for + `past_key_values`. In other words, the `attention_mask` always has to have the length: + `len(past_key_values) + len(input_ids)` + + [What are attention masks?](../glossary#attention-mask) + token_type_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*): + Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, + 1]`: + + - 0 corresponds to a *sentence A* token, + - 1 corresponds to a *sentence B* token. + + [What are token type IDs?](../glossary#token-type-ids) + position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0, + config.max_position_embeddings - 1]`. + + [What are position IDs?](../glossary#position-ids) + head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*): + Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`: + + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This + is useful if you want more control over how to convert `input_ids` indices into associated vectors than the + model's internal embedding lookup matrix. + + If `past_key_values` is used, optionally only the last `inputs_embeds` have to be input (see + `past_key_values`). + use_cache (`bool`, *optional*): + If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see + `past_key_values`). + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned + tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for + more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. +""" + + +class ClvpEncoder(ClvpPreTrainedModel): + """ + Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a + [`ClvpEncoderLayer`]. + + Args: + config: ClvpConfig + """ + + def __init__(self, config: ClvpConfig): + super().__init__(config) + + self.config = config + self.token_embedding = nn.Embedding(config.vocab_size, config.hidden_size) + self.rotary_pos_emb = ClvpRotaryPositionalEmbedding(config) if config.use_rotary_embedding else None + self.layers = nn.ModuleList([ClvpEncoderLayer(config) for _ in range(config.num_hidden_layers)]) + + self.sequence_summary = SequenceSummary(config) + self.final_layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + + self.projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False) + + self.gradient_checkpointing = False + + self.post_init() + + def get_input_embeddings(self): + return self.token_embedding + + def set_input_embeddings(self, value): + self.token_embedding = value + + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + inputs_embeds: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + r""" + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*): + Indices of input sequence tokens in the vocabulary. + + Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and + [`PreTrainedTokenizer.__call__`] for details. + + [What are input IDs?](../glossary#input-ids) + inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): + input embeddings for the model. This bypasses the model's internal embedding lookup matrix. + attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + position_ids (`torch.LongTensor`, *optional*): + Denotes the position ids of `input_ids`. + output_attentions (`bool`, *optional*): + Whether or not to return the attentions tensors of all attention layers. See `attentions` under + returned tensors for more detail. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors + for more detail. + return_dict (`bool`, *optional*): + Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + inputs_embeds = self.token_embedding(input_ids) + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + # expand attention_mask and create position_ids if needed + if attention_mask is not None: + # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len] + attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype) + + if position_ids is None: + device = input_ids.device if input_ids is not None else inputs_embeds.device + position_ids = torch.arange(input_shape[1], dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0) + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + rotary_pos_emb = self.rotary_pos_emb(inputs_embeds) if self.rotary_pos_emb is not None else None + + hidden_states = inputs_embeds + for idx, encoder_layer in enumerate(self.layers): + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + if self.gradient_checkpointing and self.training: + layer_outputs = torch.utils.checkpoint.checkpoint( + encoder_layer.__call__, + hidden_states, + rotary_pos_emb, + attention_mask, + position_ids, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + rotary_pos_emb, + attention_mask, + position_ids, + output_attentions=output_attentions, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + last_hidden_state = hidden_states + last_hidden_state = self.final_layer_norm(last_hidden_state) + + # take the mean over axis 1 and get pooled output + pooled_output = self.sequence_summary(last_hidden_state) + + # apply the projection layer + embeds = self.projection(pooled_output) + + if not return_dict: + return tuple( + v for v in [embeds, last_hidden_state, pooled_output, encoder_states, all_attentions] if v is not None + ) + + return ClvpEncoderOutput( + embeds=embeds, + last_hidden_state=last_hidden_state, + pooler_output=pooled_output, + hidden_states=encoder_states, + attentions=all_attentions, + ) + + +class ClvpDecoder(ClvpPreTrainedModel): + """ + Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`ClvpDecoderLayer`] + """ + + def __init__(self, config): + super().__init__(config) + + self.config = config + + self.input_embeds_layer = nn.Embedding(self.config.vocab_size, self.config.hidden_size) + self.position_embeds_layer = nn.Embedding(self.config.max_position_embeddings, self.config.hidden_size) + + self.drop = nn.Dropout(self.config.embd_pdrop) + self.layers = nn.ModuleList([ClvpDecoderLayer(self.config) for _ in range(self.config.num_hidden_layers)]) + self.layer_norm = nn.LayerNorm(self.config.hidden_size, eps=self.config.layer_norm_epsilon) + + self.gradient_checkpointing = False + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.input_embeds_layer + + def set_input_embeddings(self, new_embeddings): + self.input_embeds_layer = new_embeddings + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} + """ + for layer, heads in heads_to_prune.items(): + self.layers[layer].attn.prune_heads(heads) + + @add_start_docstrings_to_model_forward(CLVP_DECODER_INPUTS_DOCSTRING) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask) + input_shape = input_ids.size() + input_ids = input_ids.view(-1, input_shape[-1]) + input_ids.shape[0] + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + inputs_embeds.shape[0] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if token_type_ids is not None: + token_type_ids = token_type_ids.view(-1, input_shape[-1]) + + if past_key_values is None: + past_key_values_length = 0 + past_key_values = tuple([None] * len(self.layers)) + else: + past_key_values_length = past_key_values[0][0].size(-2) + if position_ids is None: + position_ids = torch.arange( + past_key_values_length, input_shape[-1] + past_key_values_length, dtype=torch.long, device=device + ) + position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1]) + + if inputs_embeds is None: + inputs_embeds = self.input_embeds_layer(input_ids) + position_embeds = self.position_embeds_layer(position_ids) + inputs_embeds = inputs_embeds + position_embeds + + attention_mask = _prepare_4d_causal_attention_mask( + attention_mask, input_shape, inputs_embeds, past_key_values_length + ) + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x num_attention_heads x N x N + # head_mask has shape num_hidden_layers x batch x num_attention_heads x N x N + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + hidden_states = inputs_embeds + + if token_type_ids is not None: + token_type_embeds = self.input_embeds_layer(token_type_ids) + hidden_states = hidden_states + token_type_embeds + + hidden_states = self.drop(hidden_states) + + output_shape = (-1,) + input_shape[1:] + (hidden_states.size(-1),) + + if self.gradient_checkpointing and self.training: + if use_cache: + logger.warning_once( + "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..." + ) + use_cache = False + + presents = () if use_cache else None + all_self_attentions = () if output_attentions else None + all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None + all_hidden_states = () if output_hidden_states else None + for i, (block, past_key_value) in enumerate(zip(self.layers, past_key_values)): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if self.gradient_checkpointing and self.training: + outputs = torch.utils.checkpoint.checkpoint( + block.__call__, + hidden_states, + None, + attention_mask, + position_ids, + head_mask[i], + ) + else: + outputs = block( + hidden_states, + past_key_value=past_key_value, + attention_mask=attention_mask, + position_ids=position_ids, + head_mask=head_mask[i], + use_cache=use_cache, + output_attentions=output_attentions, + ) + + hidden_states = outputs[0] + if use_cache is True: + presents = presents + (outputs[1],) + + if output_attentions: + all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],) + if self.config.add_cross_attention: + all_cross_attentions = all_cross_attentions + (outputs[3 if use_cache else 2],) + + hidden_states = self.layer_norm(hidden_states) + + hidden_states = hidden_states.view(output_shape) + + # Add last hidden state + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple( + v + for v in [hidden_states, presents, all_hidden_states, all_self_attentions, all_cross_attentions] + if v is not None + ) + + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=hidden_states, + past_key_values=presents, + hidden_states=all_hidden_states, + attentions=all_self_attentions, + cross_attentions=all_cross_attentions, + ) + + +@add_start_docstrings( + "The bare Clvp decoder model outputting raw hidden-states without any specific head on top.", + CLVP_START_DOCSTRING, +) +class ClvpModel(ClvpPreTrainedModel): + def __init__(self, config: ClvpDecoderConfig): + super().__init__(config) + self.config = config + self.decoder = ClvpDecoder(self.config) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.decoder.input_embeds_layer + + def set_input_embeddings(self, value): + self.decoder.input_embeds_layer = value + + def get_decoder(self): + return self.decoder + + @add_start_docstrings_to_model_forward(CLVP_DECODER_INPUTS_DOCSTRING) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]: + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn) + decoder_outputs = self.decoder( + input_ids=input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + if not return_dict: + return decoder_outputs + + return BaseModelOutputWithPastAndCrossAttentions( + last_hidden_state=decoder_outputs.last_hidden_state, + past_key_values=decoder_outputs.past_key_values, + hidden_states=decoder_outputs.hidden_states, + attentions=decoder_outputs.attentions, + cross_attentions=decoder_outputs.cross_attentions, + ) + + +@add_start_docstrings( + "The CLVP decoder model with a language modelling head on top.", + CLVP_START_DOCSTRING, +) +class ClvpForCausalLM(ClvpPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + self.config = config + self.model = ClvpModel(self.config) + + self.final_norm = nn.LayerNorm(self.config.hidden_size) + self.lm_head = nn.Linear(self.config.hidden_size, self.config.vocab_size, bias=True) + + # Initialize weights and apply final processing + self.post_init() + + def get_input_embeddings(self): + return self.model.decoder.input_embeds_layer + + def set_input_embeddings(self, new_embeddings): + self.model.decoder.input_embeds_layer = new_embeddings + + def _prepare_model_inputs( + self, + inputs: Optional[torch.Tensor] = None, + bos_token_id: Optional[int] = None, + model_kwargs: Optional[Dict[str, torch.Tensor]] = None, + ) -> Tuple[torch.Tensor, Optional[str], Dict[str, torch.Tensor]]: + """ + This function extracts the model-specific `inputs` for generation. + """ + input_name = self.main_input_name + + model_kwargs = {k: v for k, v in model_kwargs.items() if v is not None} + + inputs_kwarg = model_kwargs.pop(input_name, None) + if inputs_kwarg is not None and inputs is not None: + raise ValueError( + f"`inputs`: {inputs}` were passed alongside {input_name} which is not allowed." + f"Make sure to either pass {inputs} or {input_name}=..." + ) + elif inputs_kwarg is not None: + inputs = inputs_kwarg + + if input_name == "input_ids" and "inputs_embeds" in model_kwargs: + model_kwargs["input_ids"] = self._maybe_initialize_input_ids_for_generation( + inputs, bos_token_id, model_kwargs=model_kwargs + ) + inputs, input_name = model_kwargs["inputs_embeds"], "inputs_embeds" + + # Check if conditioning_embeds are provided or not, if yes then concatenate the bos_token_id at the end of the conditioning_embeds. + # Then we must subtract the positional_ids because during the forward pass it will be added anyways, so we must cancel them out here. + conditioning_embeds = model_kwargs.get("conditioning_embeds", None) + + if conditioning_embeds is not None: + mel_start_token_embedding = self.model.decoder.input_embeds_layer( + torch.full( + (conditioning_embeds.shape[0], 1), + fill_value=self.config.bos_token_id, + device=conditioning_embeds.device, + ) + ) + mel_start_token_embedding += self.model.decoder.position_embeds_layer( + torch.full((conditioning_embeds.shape[0], 1), fill_value=0, device=conditioning_embeds.device) + ) + conditioning_embeds = torch.concat([conditioning_embeds, mel_start_token_embedding], dim=1) + + # subtract the positional_ids here + if hasattr(model_kwargs, "attention_mask"): + position_ids = model_kwargs["attention_mask"].long().cumsum(-1) - 1 + else: + position_ids = torch.range( + 0, conditioning_embeds.shape[1] - 1, dtype=torch.long, device=conditioning_embeds.device + ) + position_ids = position_ids.unsqueeze(0).repeat(conditioning_embeds.shape[0], 1) + + model_kwargs["inputs_embeds"] = conditioning_embeds - self.model.decoder.position_embeds_layer( + position_ids + ) + model_kwargs["input_ids"] = ( + torch.ones((model_kwargs["inputs_embeds"].shape[0], 1), dtype=torch.long, device=self.device) + * self.config.bos_token_id + ) + + return model_kwargs["inputs_embeds"], "inputs_embeds", model_kwargs + + inputs = self._maybe_initialize_input_ids_for_generation(inputs, bos_token_id, model_kwargs) + return inputs, input_name, model_kwargs + + def prepare_inputs_for_generation( + self, input_ids, past_key_values=None, inputs_embeds=None, conditioning_embeds=None, **kwargs + ): + input_ids_length = input_ids.shape[-1] + token_type_ids = kwargs.get("token_type_ids", None) + # only last token for inputs_ids if past is defined in kwargs + if past_key_values: + past_length = past_key_values[0][0].shape[2] + + # Some generation methods already pass only the last input ID + if input_ids.shape[1] > past_length: + remove_prefix_length = past_length + else: + # Default to old behavior: keep only final ID + remove_prefix_length = input_ids.shape[1] - 1 + + input_ids = input_ids[:, remove_prefix_length:] + if token_type_ids is not None: + token_type_ids = token_type_ids[:, -input_ids.shape[1] :] + + attention_mask = kwargs.get("attention_mask", None) + position_ids = kwargs.get("position_ids", None) + + if attention_mask is not None and position_ids is None: + # create position_ids on the fly for batch generation + position_ids = attention_mask.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -1].unsqueeze(-1) + else: + position_ids = None + + if conditioning_embeds is not None and past_key_values is not None: + position_ids = torch.tensor([input_ids_length], dtype=torch.long, device=input_ids.device) + + # if `inputs_embeds` are passed, we only want to use them in the 1st generation step + if inputs_embeds is not None and past_key_values is None: + model_inputs = {"inputs_embeds": inputs_embeds} + else: + model_inputs = {"input_ids": input_ids} + + model_inputs.update( + { + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache"), + "position_ids": position_ids, + "token_type_ids": token_type_ids, + } + ) + return model_inputs + + @add_start_docstrings_to_model_forward(CLVP_DECODER_INPUTS_DOCSTRING) + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None, + attention_mask: Optional[torch.FloatTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, + position_ids: Optional[torch.LongTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]: + r""" + labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set + `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` + are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` + """ + + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + use_cache = use_cache if use_cache is not None else self.config.use_cache + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.model( + input_ids=input_ids, + past_key_values=past_key_values, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + hidden_states = outputs[0] + + lm_logits = self.final_norm(hidden_states) + lm_logits = self.lm_head(lm_logits) + + loss = None + if labels is not None: + labels = labels.to(lm_logits.device) + # Shift so that tokens < n predict n + shift_logits = lm_logits[..., :-1, :].contiguous() + shift_labels = labels[..., 1:].contiguous() + # Flatten the tokens + loss_fct = CrossEntropyLoss() + loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)) + + if not return_dict: + output = (lm_logits,) + outputs[1:] + return ((loss,) + output) if loss is not None else output + + return CausalLMOutputWithCrossAttentions( + loss=loss, + logits=lm_logits, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + cross_attentions=outputs.cross_attentions, + ) + + @staticmethod + # Copied from transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel._reorder_cache + def _reorder_cache( + past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor + ) -> Tuple[Tuple[torch.Tensor]]: + """ + This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or + [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct + beam_idx at every generation step. + """ + return tuple( + tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past) + for layer_past in past_key_values + ) + + +@add_start_docstrings( + "The composite CLVP model with a text encoder, speech encoder and speech decoder model." + "The speech decoder model generates the speech_ids from the text and the text encoder and speech encoder works" + "together to filter out the best speech_ids.", + CLVP_START_DOCSTRING, +) +class ClvpModelForConditionalGeneration(ClvpPreTrainedModel): + config_class = ClvpConfig + + def __init__(self, config: ClvpConfig): + super().__init__(config) + + if not isinstance(config.text_config, ClvpEncoderConfig): + raise ValueError( + "config.text_config is expected to be of type `ClvpEncoderConfig` but is of type" + f" {type(config.text_config)}." + ) + + if not isinstance(config.speech_config, ClvpEncoderConfig): + raise ValueError( + "config.speech_config is expected to be of type `ClvpEncoderConfig` but is of type" + f" {type(config.speech_config)}." + ) + + if not isinstance(config.decoder_config, ClvpDecoderConfig): + raise ValueError( + "config.decoder_config is expected to be of type `ClvpDecoderConfig` but is of type" + f" {type(config.decoder_config)}." + ) + + self.conditioning_encoder = ClvpConditioningEncoder(config) + + self.speech_decoder_model = ClvpForCausalLM(config.decoder_config) + + self.text_encoder_model = ClvpEncoder(config.text_config) + self.speech_encoder_model = ClvpEncoder(config.speech_config) + + self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value)) + + # Initialize weights and apply final processing + self.post_init() + + # taken from the original repo, + # link : https://github.com/neonbjb/tortoise-tts/blob/4003544b6ff4b68c09856e04d3eff9da26d023c2/tortoise/api.py#L117 + def fix_speech_decoder_output(self, speech_ids: torch.LongTensor) -> torch.LongTensor: + """ + This method modifies the output of the decoder model, such as replacing the `eos_token_id` and changing the + last few tokens of each sequence. + + Args: + speech_ids (`torch.LongTensor`): + This refers to the output of the decoder model. + """ + decoder_fixing_codes = self.config.decoder_config.decoder_fixing_codes + speech_ids = speech_ids[:, 1:] + + stop_token_indices = torch.where(speech_ids == self.speech_decoder_model.config.eos_token_id, 1, 0) + speech_ids = torch.masked_fill(speech_ids, mask=stop_token_indices.bool(), value=decoder_fixing_codes[0]) + + for i, each_seq_stop_token_index in enumerate(stop_token_indices): + # This means that no stop tokens were found so the sentence was still being generated, in that case we don't need + # to apply any padding so just skip to the next sequence of tokens. + if each_seq_stop_token_index.sum() == 0: + continue + + stm = each_seq_stop_token_index.argmax() + speech_ids[i, stm:] = decoder_fixing_codes[0] + if stm - 3 < speech_ids.shape[1]: + speech_ids[i, -3:] = torch.tensor( + [decoder_fixing_codes[1:]], device=speech_ids.device, dtype=torch.long + ) + + return speech_ids + + def get_text_features( + self, + input_ids: Optional[torch.LongTensor] = None, + text_encoder_inputs_embeds: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + ) -> torch.FloatTensor: + r""" + This method can be used to extract text_embeds from a text. The text embeddings obtained by applying the + projection layer to the pooled output of the CLVP text encoder model. + + Args: + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`): + Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you + provide it. + + [What are input IDs?](../glossary#input-ids) + text_encoder_inputs_embeds (`torch.FloatTensor`, *optional*): + inputs_embeds for the text encoder model passed in place of `input_ids`. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + + Returns: + `torch.FloatTensor` of shape `(batch_size, output_dim)`: + The text embeddings obtained by applying the projection layer to the pooled output of the CLVP Text + Model. + + Examples: + + ```python + >>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration + + >>> # Define the Text + >>> text = "This is an example text." + + >>> # Define processor and model + >>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev") + >>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev") + + >>> # Generate processor output and text embeds + >>> processor_output = processor(text=text, return_tensors="pt") + >>> text_embeds = model.get_text_features(input_ids=processor_output["input_ids"]) + ``` + """ + + outputs = self.text_encoder_model( + input_ids=input_ids, + inputs_embeds=text_encoder_inputs_embeds, + attention_mask=attention_mask, + ) + + return outputs[0] + + def get_speech_features( + self, + speech_ids: Optional[torch.LongTensor] = None, + input_ids: Optional[torch.LongTensor] = None, + input_features: Optional[torch.FloatTensor] = None, + conditioning_encoder_inputs_embeds: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.Tensor] = None, + generation_config: Optional[GenerationConfig] = None, + **kwargs, + ) -> torch.FloatTensor: + r""" + This method can be used to extract speech_embeds. The speech embeddings are obtained by applying the speech + model on speech_ids. If speech_ids is not present but both input_ids and input_features are given then the + decoder model will be used to first generate the speech_ids and then applying the speech model. + + Args: + speech_ids (`torch.LongTensor` of shape `(batch_size, num_speech_ids)`, *optional*): + Speech Tokens. Padding will be ignored by default should you provide it. If speech_ids are provided + then input_ids and input_features will be automatically ignored. + input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Input text Tokens. Processed from the [`ClvpTokenizer`]. If speech_ids is not provided, then input_ids + and input_features will be used. + input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, time_dim)`, *optional*): + Indicates log-melspectrogram representations for audio returned by [`ClvpFeatureExtractor`]. If + speech_ids is not provided, then input_ids and input_features will be used. + conditioning_encoder_inputs_embeds (`torch.FloatTensor`, *optional*): + inputs_embeds for `ClvpConditioningEncoder`. Can be used in place of `input_ids`. + attention_mask (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding speech token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + generation_config (`GenerationConfig`, *optional*): + generation config to control the generation of speech_ids if they are not provided. + + Returns: + `torch.FloatTensor` of shape `(batch_size, output_dim)`: + The speech embeddings obtained by applying the projection layer to the pooled output of the CLVP Speech + Model. + + Examples: + + ```python + >>> import datasets + >>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration + + >>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library) + >>> text = "This is an example text." + >>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + >>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050)) + >>> _, audio, sr = ds.sort("id").select(range(1))[:1]["audio"][0].values() + + >>> # Define processor and model + >>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev") + >>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev") + + >>> # Generate processor output and model output + >>> processor_output = processor(raw_speech=audio, sampling_rate=sr, text=text, return_tensors="pt") + >>> speech_embeds = model.get_speech_features( + ... input_ids=processor_output["input_ids"], input_features=processor_output["input_features"] + ... ) + ``` + """ + + if speech_ids is None: + if (input_ids is None and conditioning_encoder_inputs_embeds is None) or input_features is None: + raise ValueError( + "Either speech_ids or input_ids/conditioning_encoder_inputs_embeds and input_features must be provided." + ) + + if generation_config is None: + generation_config = self.generation_config + generation_config.update(**kwargs) + + conditioning_embeds = self.conditioning_encoder( + input_features=input_features, + input_ids=input_ids, + inputs_embeds=conditioning_encoder_inputs_embeds, + attention_mask=attention_mask, + ) + + speech_ids = self.speech_decoder_model.generate( + conditioning_embeds=conditioning_embeds, + generation_config=generation_config, + ) + + speech_ids = self.fix_speech_decoder_output(speech_ids[0]) + + outputs = self.speech_encoder_model( + input_ids=speech_ids, + attention_mask=attention_mask, + ) + + return outputs[0] + + @add_start_docstrings_to_model_forward(CLVP_INPUTS_DOCSTRING) + @replace_return_docstrings(output_type=ClvpOutput, config_class=ClvpConfig) + def forward( + self, + input_ids: torch.LongTensor = None, + input_features: torch.FloatTensor = None, + conditioning_encoder_inputs_embeds: Optional[torch.FloatTensor] = None, + text_encoder_inputs_embeds: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.LongTensor] = None, + return_loss: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + output_attentions: Optional[bool] = False, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, ClvpOutput]: + r""" + Returns: + + Examples: + + ```python + >>> import datasets + >>> from transformers import ClvpProcessor, ClvpModelForConditionalGeneration + + >>> # Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library) + >>> text = "This is an example text." + + >>> ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") + >>> ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050)) + >>> _, audio, sr = ds.sort("id").select(range(1))[:1]["audio"][0].values() + + >>> # Define processor and model + >>> processor = ClvpProcessor.from_pretrained("susnato/clvp_dev") + >>> model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev") + + >>> # processor outputs and model outputs + >>> processor_output = processor(raw_speech=audio, sampling_rate=sr, text=text, return_tensors="pt") + >>> outputs = model( + ... input_ids=processor_output["input_ids"], + ... input_features=processor_output["input_features"], + ... return_dict=True, + ... ) + ``` + """ + + # Use CLVP model's config for some fields (if specified) instead of those of speech & text components. + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + conditioning_embeds = self.conditioning_encoder( + input_features=input_features, + input_ids=input_ids, + inputs_embeds=conditioning_encoder_inputs_embeds, + attention_mask=attention_mask, + ) + + decoder_outputs = self.speech_decoder_model( + inputs_embeds=conditioning_embeds, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + speech_ids = decoder_outputs[0] + + # since we will get the embeds of shape `(batch_size, seq_len, embedding_dim)` during the forward pass + # we must convert it to tokens, to make it compaitable with speech_transformer + if speech_ids.ndim == 3: + speech_ids = speech_ids.argmax(2) + speech_ids = self.fix_speech_decoder_output(speech_ids) + + speech_outputs = self.speech_encoder_model( + input_ids=speech_ids, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + text_outputs = self.text_encoder_model( + input_ids=input_ids, + inputs_embeds=text_encoder_inputs_embeds, + attention_mask=attention_mask, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + speech_embeds = speech_outputs[0] + text_embeds = text_outputs[0] + + # normalized features + speech_embeds = speech_embeds / speech_embeds.norm(p=2, dim=-1, keepdim=True) + text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True) + + # cosine similarity as logits + logit_scale = self.logit_scale.exp() + logits_per_text = torch.matmul(text_embeds, speech_embeds.t()) * logit_scale + logits_per_speech = logits_per_text.t() + + loss = None + if return_loss: + loss = clvp_loss(logits_per_text) + + if not return_dict: + output = ( + logits_per_speech, + logits_per_text, + text_embeds, + speech_embeds, + text_outputs[2], + speech_outputs[2], + ) + if output_hidden_states: + output += ( + decoder_outputs[-1], + text_outputs[-1], + speech_outputs[-1], + ) + + return ((loss,) + output) if loss is not None else output + + return ClvpOutput( + loss=loss, + logits_per_speech=logits_per_speech, + logits_per_text=logits_per_text, + text_embeds=text_embeds, + speech_embeds=speech_embeds, + text_model_output=text_outputs[2], + speech_model_output=speech_outputs[2], + decoder_hidden_states=decoder_outputs.hidden_states, + text_encoder_hidden_states=text_outputs.hidden_states, + speech_encoder_hidden_states=speech_outputs.hidden_states, + ) + + @torch.no_grad() + def generate( + self, + input_ids: torch.LongTensor = None, + input_features: torch.FloatTensor = None, + attention_mask: Optional[torch.LongTensor] = None, + generation_config: Optional[GenerationConfig] = None, + pad_to_max_mel_tokens: Optional[int] = None, + output_hidden_states: Optional[bool] = None, + **kwargs, + ): + """ + Generate method for `ClvpModelForConditionalGeneration`, this method calls the `generate` method of + `ClvpForCausalLM` and then uses those generated `speech_ids` to process `text_embeds` and `speech_embeds` using + `ClvpEncoder`. + + Args: + input_ids (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*): + Input text Tokens. Processed from the [`ClvpTokenizer`]. + input_features (`torch.FloatTensor` of shape `(batch_size, feature_size, time_dim)`, *optional*): + Indicates log-melspectrogram representations for audio returned by [`ClvpFeatureExtractor`]. + attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): + Mask to avoid performing attention on padding text token indices. Mask values selected in `[0, 1]`: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + [What are attention masks?](../glossary#attention-mask) + generation_config (`~generation.GenerationConfig`, *optional*): + The generation configuration to be used as base parametrization for the generation call. `**kwargs` + passed to generate matching the attributes of `generation_config` will override them. If + `generation_config` is not provided, the default will be used, which had the following loading + priority: 1) from the `generation_config.json` model file, if it exists; 2) from the model + configuration. Please note that unspecified parameters will inherit [`~generation.GenerationConfig`]'s + default values, whose documentation should be checked to parameterize generation. + pad_to_max_mel_tokens (`int`, *optional*): + Pads generated speech_ids to the specified value. This is to implement the same logic from the official + repo, link: https://github.com/neonbjb/tortoise-tts/blob/80f89987a5abda5e2b082618cd74f9c7411141dc/tortoise/api.py#L430 + and to make sure the logits are same. + This does not affect generation quality so please don't consider using it since it is less efficient. + output_hidden_states (`bool`, *optional*): + Whether or not to return the hidden states of decoder model, text encoder and speech encoder models. + + Returns: + `ClvpOutput` or tuple: A `ClvpOutput` (if `return_dict_in_generate=True` or when + `config.return_dict_in_generate=True`) or a tuple. + """ + + # If the input sequences are larger than (self.config.decoder_config.max_text_tokens - 3) then raise error, + # because we need to add 3 tokens ( 1 bos tokens and 2 eos tokens) to the input_ids in ClvpConditioningEncoder to + # properly sample + sequence_length = input_ids.shape[-1] + if sequence_length > (self.config.decoder_config.max_text_tokens - 3): + raise ValueError( + f"Maximum sequence length reached! Found input_ids of length {sequence_length}." + f"Please make sure that the maximum length of input_ids is {self.config.decoder_config.max_text_tokens - 3}" + ) + + if generation_config is None: + generation_config = self.generation_config + + generation_config = copy.deepcopy(generation_config) + model_kwargs = generation_config.update(**kwargs) # All unused kwargs must be model kwargs + generation_config.validate() + self._validate_model_kwargs(model_kwargs.copy()) + + # pad input_ids as specified in the original repo + # link: https://github.com/neonbjb/tortoise-tts/blob/80f89987a5abda5e2b082618cd74f9c7411141dc/tortoise/api.py#L380 + input_ids, attention_mask = _pad_extra_bos_eos_tokens( + input_ids, + attention_mask, + add_bos_token=False, + bos_token_id=self.config.text_config.bos_token_id, + eos_token_id=self.config.text_config.eos_token_id, + ) + + conditioning_embeds = self.conditioning_encoder( + input_features=input_features, + input_ids=input_ids, + attention_mask=attention_mask, + ) + + decoder_outputs = self.speech_decoder_model.generate( + conditioning_embeds=conditioning_embeds, + generation_config=generation_config, + output_hidden_states=output_hidden_states, + return_dict=generation_config.return_dict_in_generate, + ) + if isinstance(decoder_outputs, ModelOutput): + speech_ids = decoder_outputs.sequences + + # pad to pad_to_max_mel_tokens if given, to replicate the original repo logic + # link: https://github.com/neonbjb/tortoise-tts/blob/80f89987a5abda5e2b082618cd74f9c7411141dc/tortoise/api.py#L430 + if pad_to_max_mel_tokens is not None: + padding_needed = pad_to_max_mel_tokens - speech_ids.shape[-1] + speech_ids = torch.nn.functional.pad( + speech_ids, (0, padding_needed), value=self.generation_config.eos_token_id + ) + + speech_ids = self.fix_speech_decoder_output(speech_ids) + + speech_outputs = self.speech_encoder_model( + input_ids=speech_ids, + output_hidden_states=output_hidden_states, + return_dict=generation_config.return_dict_in_generate, + ) + text_outputs = self.text_encoder_model( + input_ids=input_ids, + attention_mask=attention_mask, + output_hidden_states=output_hidden_states, + return_dict=generation_config.return_dict_in_generate, + ) + + speech_embeds = speech_outputs[0] + text_embeds = text_outputs[0] + + # normalized features + speech_embeds = speech_embeds / speech_embeds.norm(p=2, dim=-1, keepdim=True) + text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True) + + # cosine similarity as logits + logit_scale = self.logit_scale.exp() + logits_per_text = torch.matmul(text_embeds, speech_embeds.t()) * logit_scale + logits_per_speech = logits_per_text.t() + + if not generation_config.return_dict_in_generate: + output = ( + speech_ids, + logits_per_speech, + logits_per_text, + text_embeds, + speech_embeds, + text_outputs[2], + speech_outputs[2], + ) + if output_hidden_states: + output += ( + decoder_outputs[-1], + text_outputs[-1], + speech_outputs[-1], + ) + + return output + + return ClvpOutput( + speech_ids=speech_ids, + logits_per_speech=logits_per_speech, + logits_per_text=logits_per_text, + text_embeds=text_embeds, + speech_embeds=speech_embeds, + text_model_output=text_outputs[2], + speech_model_output=speech_outputs[2], + decoder_hidden_states=decoder_outputs.hidden_states, + text_encoder_hidden_states=text_outputs.hidden_states, + speech_encoder_hidden_states=speech_outputs.hidden_states, + ) diff --git a/src/transformers/models/clvp/number_normalizer.py b/src/transformers/models/clvp/number_normalizer.py new file mode 100644 index 00000000000000..86aa087e8139b0 --- /dev/null +++ b/src/transformers/models/clvp/number_normalizer.py @@ -0,0 +1,238 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""English Normalizer class for CLVP.""" + + +import re + + +class EnglishNormalizer: + def __init__(self): + # List of (regular expression, replacement) pairs for abbreviations: + self._abbreviations = [ + (re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1]) + for x in [ + ("mrs", "misess"), + ("mr", "mister"), + ("dr", "doctor"), + ("st", "saint"), + ("co", "company"), + ("jr", "junior"), + ("maj", "major"), + ("gen", "general"), + ("drs", "doctors"), + ("rev", "reverend"), + ("lt", "lieutenant"), + ("hon", "honorable"), + ("sgt", "sergeant"), + ("capt", "captain"), + ("esq", "esquire"), + ("ltd", "limited"), + ("col", "colonel"), + ("ft", "fort"), + ] + ] + + self.ones = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"] + self.teens = [ + "ten", + "eleven", + "twelve", + "thirteen", + "fourteen", + "fifteen", + "sixteen", + "seventeen", + "eighteen", + "nineteen", + ] + self.tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"] + + def number_to_words(self, num: int) -> str: + """ + Converts numbers(`int`) to words(`str`). + + Please note that it only supports upto - "'nine hundred ninety-nine quadrillion, nine hundred ninety-nine + trillion, nine hundred ninety-nine billion, nine hundred ninety-nine million, nine hundred ninety-nine + thousand, nine hundred ninety-nine'" or `number_to_words(999_999_999_999_999_999)`. + """ + if num == 0: + return "zero" + elif num < 0: + return "minus " + self.number_to_words(abs(num)) + elif num < 10: + return self.ones[num] + elif num < 20: + return self.teens[num - 10] + elif num < 100: + return self.tens[num // 10] + ("-" + self.number_to_words(num % 10) if num % 10 != 0 else "") + elif num < 1000: + return ( + self.ones[num // 100] + " hundred" + (" " + self.number_to_words(num % 100) if num % 100 != 0 else "") + ) + elif num < 1_000_000: + return ( + self.number_to_words(num // 1000) + + " thousand" + + (", " + self.number_to_words(num % 1000) if num % 1000 != 0 else "") + ) + elif num < 1_000_000_000: + return ( + self.number_to_words(num // 1_000_000) + + " million" + + (", " + self.number_to_words(num % 1_000_000) if num % 1_000_000 != 0 else "") + ) + elif num < 1_000_000_000_000: + return ( + self.number_to_words(num // 1_000_000_000) + + " billion" + + (", " + self.number_to_words(num % 1_000_000_000) if num % 1_000_000_000 != 0 else "") + ) + elif num < 1_000_000_000_000_000: + return ( + self.number_to_words(num // 1_000_000_000_000) + + " trillion" + + (", " + self.number_to_words(num % 1_000_000_000_000) if num % 1_000_000_000_000 != 0 else "") + ) + elif num < 1_000_000_000_000_000_000: + return ( + self.number_to_words(num // 1_000_000_000_000_000) + + " quadrillion" + + ( + ", " + self.number_to_words(num % 1_000_000_000_000_000) + if num % 1_000_000_000_000_000 != 0 + else "" + ) + ) + else: + return "number out of range" + + def convert_to_ascii(self, text: str) -> str: + """ + Converts unicode to ascii + """ + return text.encode("ascii", "ignore").decode("utf-8") + + def _expand_dollars(self, m: str) -> str: + """ + This method is used to expand numerical dollar values into spoken words. + """ + match = m.group(1) + parts = match.split(".") + if len(parts) > 2: + return match + " dollars" # Unexpected format + + dollars = int(parts[0]) if parts[0] else 0 + cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 + if dollars and cents: + dollar_unit = "dollar" if dollars == 1 else "dollars" + cent_unit = "cent" if cents == 1 else "cents" + return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit) + elif dollars: + dollar_unit = "dollar" if dollars == 1 else "dollars" + return "%s %s" % (dollars, dollar_unit) + elif cents: + cent_unit = "cent" if cents == 1 else "cents" + return "%s %s" % (cents, cent_unit) + else: + return "zero dollars" + + def _remove_commas(self, m: str) -> str: + """ + This method is used to remove commas from sentences. + """ + return m.group(1).replace(",", "") + + def _expand_decimal_point(self, m: str) -> str: + """ + This method is used to expand '.' into spoken word ' point '. + """ + return m.group(1).replace(".", " point ") + + def _expand_ordinal(self, num: str) -> str: + """ + This method is used to expand ordinals such as '1st', '2nd' into spoken words. + """ + ordinal_suffixes = {1: "st", 2: "nd", 3: "rd"} + + num = int(num.group(0)[:-2]) + if 10 <= num % 100 and num % 100 <= 20: + suffix = "th" + else: + suffix = ordinal_suffixes.get(num % 10, "th") + return self.number_to_words(num) + suffix + + def _expand_number(self, m: str) -> str: + """ + This method acts as a preprocessing step for numbers between 1000 and 3000 (same as the original repository, + link : + https://github.com/neonbjb/tortoise-tts/blob/4003544b6ff4b68c09856e04d3eff9da26d023c2/tortoise/utils/tokenizer.py#L86) + """ + num = int(m.group(0)) + + if num > 1000 and num < 3000: + if num == 2000: + return "two thousand" + elif num > 2000 and num < 2010: + return "two thousand " + self.number_to_words(num % 100) + elif num % 100 == 0: + return self.number_to_words(num // 100) + " hundred" + else: + return self.number_to_words(num) + else: + return self.number_to_words(num) + + def normalize_numbers(self, text: str) -> str: + """ + This method is used to normalize numbers within a text such as converting the numbers to words, removing + commas, etc. + """ + text = re.sub(re.compile(r"([0-9][0-9\,]+[0-9])"), self._remove_commas, text) + text = re.sub(re.compile(r"£([0-9\,]*[0-9]+)"), r"\1 pounds", text) + text = re.sub(re.compile(r"\$([0-9\.\,]*[0-9]+)"), self._expand_dollars, text) + text = re.sub(re.compile(r"([0-9]+\.[0-9]+)"), self._expand_decimal_point, text) + text = re.sub(re.compile(r"[0-9]+(st|nd|rd|th)"), self._expand_ordinal, text) + text = re.sub(re.compile(r"[0-9]+"), self._expand_number, text) + return text + + def expand_abbreviations(self, text: str) -> str: + """ + Expands the abbreviate words. + """ + for regex, replacement in self._abbreviations: + text = re.sub(regex, replacement, text) + return text + + def collapse_whitespace(self, text: str) -> str: + """ + Removes multiple whitespaces + """ + return re.sub(re.compile(r"\s+"), " ", text) + + def __call__(self, text): + """ + Converts text to ascii, numbers / number-like quantities to their spelt-out counterparts and expands + abbreviations + """ + + text = self.convert_to_ascii(text) + text = text.lower() + text = self.normalize_numbers(text) + text = self.expand_abbreviations(text) + text = self.collapse_whitespace(text) + text = text.replace('"', "") + + return text diff --git a/src/transformers/models/clvp/processing_clvp.py b/src/transformers/models/clvp/processing_clvp.py new file mode 100644 index 00000000000000..0723986db9757d --- /dev/null +++ b/src/transformers/models/clvp/processing_clvp.py @@ -0,0 +1,91 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Processor class for CLVP +""" + + +from ...processing_utils import ProcessorMixin + + +class ClvpProcessor(ProcessorMixin): + r""" + Constructs a CLVP processor which wraps a CLVP Feature Extractor and a CLVP Tokenizer into a single processor. + + [`ClvpProcessor`] offers all the functionalities of [`ClvpFeatureExtractor`] and [`ClvpTokenizer`]. See the + [`~ClvpProcessor.__call__`], [`~ClvpProcessor.decode`] and [`~ClvpProcessor.batch_decode`] for more information. + + Args: + feature_extractor (`ClvpFeatureExtractor`): + An instance of [`ClvpFeatureExtractor`]. The feature extractor is a required input. + tokenizer (`ClvpTokenizer`): + An instance of [`ClvpTokenizer`]. The tokenizer is a required input. + """ + + feature_extractor_class = "ClvpFeatureExtractor" + tokenizer_class = "ClvpTokenizer" + model_input_names = [ + "input_ids", + "input_features", + "attention_mask", + ] + + def __init__(self, feature_extractor, tokenizer): + super().__init__(feature_extractor, tokenizer) + + def __call__(self, *args, **kwargs): + """ + Forwards the `audio` and `sampling_rate` arguments to [`~ClvpFeatureExtractor.__call__`] and the `text` + argument to [`~ClvpTokenizer.__call__`]. Please refer to the doctsring of the above two methods for more + information. + """ + + raw_speech = kwargs.pop("raw_speech", None) + sampling_rate = kwargs.pop("sampling_rate", None) + text = kwargs.pop("text", None) + + if raw_speech is None and text is None: + raise ValueError("You need to specify either an `raw_speech` or `text` input to process.") + + if raw_speech is not None: + inputs = self.feature_extractor(raw_speech, sampling_rate=sampling_rate, **kwargs) + if text is not None: + encodings = self.tokenizer(text, **kwargs) + + if text is None: + return inputs + elif raw_speech is None: + return encodings + else: + inputs["input_ids"] = encodings["input_ids"] + inputs["attention_mask"] = encodings["attention_mask"] + return inputs + + # Copied from transformers.models.whisper.processing_whisper.WhisperProcessor.batch_decode with Whisper->Clvp + def batch_decode(self, *args, **kwargs): + """ + This method forwards all its arguments to ClvpTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please + refer to the docstring of this method for more information. + """ + return self.tokenizer.batch_decode(*args, **kwargs) + + # Copied from transformers.models.whisper.processing_whisper.WhisperProcessor.decode with Whisper->Clvp + def decode(self, *args, **kwargs): + """ + This method forwards all its arguments to ClvpTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to + the docstring of this method for more information. + """ + return self.tokenizer.decode(*args, **kwargs) diff --git a/src/transformers/models/clvp/tokenization_clvp.py b/src/transformers/models/clvp/tokenization_clvp.py new file mode 100644 index 00000000000000..f09245f94be8c5 --- /dev/null +++ b/src/transformers/models/clvp/tokenization_clvp.py @@ -0,0 +1,379 @@ +# coding=utf-8 +# Copyright 2023 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Tokenization class for CLVP.""" + +import json +import os +from functools import lru_cache +from typing import List, Optional, Tuple + +import regex as re + +from ...tokenization_utils import AddedToken, PreTrainedTokenizer +from ...utils import logging +from .number_normalizer import EnglishNormalizer + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = { + "vocab_file": "vocab.json", + "merges_file": "merges.txt", +} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "clvp_dev": "https://huggingface.co/susnato/clvp_dev/blob/main/vocab.json", + }, + "merges_file": { + "clvp_dev": "https://huggingface.co/susnato/clvp_dev/blob/main/merges.txt", + }, +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "clvp_dev": 1024, +} + + +@lru_cache() +# Copied from transformers.models.gpt2.tokenization_gpt2.bytes_to_unicode +def bytes_to_unicode(): + """ + Returns list of utf-8 byte and a mapping to unicode strings. We specifically avoids mapping to whitespace/control + characters the bpe code barfs on. + + The reversible bpe codes work on unicode strings. This means you need a large # of unicode characters in your vocab + if you want to avoid UNKs. When you're at something like a 10B token dataset you end up needing around 5K for + decent coverage. This is a significant percentage of your normal, say, 32K bpe vocab. To avoid that, we want lookup + tables between utf-8 bytes and unicode strings. + """ + bs = ( + list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1)) + ) + cs = bs[:] + n = 0 + for b in range(2**8): + if b not in bs: + bs.append(b) + cs.append(2**8 + n) + n += 1 + cs = [chr(n) for n in cs] + return dict(zip(bs, cs)) + + +# Copied from transformers.models.gpt2.tokenization_gpt2.get_pairs +def get_pairs(word): + """ + Return set of symbol pairs in a word. + + Word is represented as tuple of symbols (symbols being variable-length strings). + """ + pairs = set() + prev_char = word[0] + for char in word[1:]: + pairs.add((prev_char, char)) + prev_char = char + return pairs + + +class ClvpTokenizer(PreTrainedTokenizer): + """ + Construct a CLVP tokenizer. Based on byte-level Byte-Pair-Encoding. + + This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will + be encoded differently whether it is at the beginning of the sentence (without space) or not: + + ```python + >>> from transformers import ClvpTokenizer + + >>> tokenizer = ClvpTokenizer.from_pretrained("susnato/clvp_dev") + >>> tokenizer("Hello world")["input_ids"] + [62, 84, 28, 2, 179, 79] + + >>> tokenizer(" Hello world")["input_ids"] + [2, 62, 84, 28, 2, 179, 79] + ``` + + You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you + call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance. + + + + When used with `is_split_into_words=True`, this tokenizer will add a space before each word (even the first one). + + + + This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to + this superclass for more information regarding those methods. + + Args: + vocab_file (`str`): + Path to the vocabulary file. + merges_file (`str`): + Path to the merges file. + errors (`str`, *optional*, defaults to `"replace"`): + Paradigm to follow when decoding bytes to UTF-8. See + [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information. + unk_token (`str`, *optional*, defaults to `"[UNK]"`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + bos_token (`str`, *optional*, defaults to `"<|endoftext|>"`): + The beginning of sequence token. + eos_token (`str`, *optional*, defaults to `"[STOP]"`): + The end of sequence token. + pad_token (`str`, *optional*, defaults to `"[STOP]"`): + The pad token of the sequence. + add_prefix_space (`bool`, *optional*, defaults to `False`): + Whether or not to add an initial space to the input. This allows to treat the leading word just as any + other word. (CLVP tokenizer detect beginning of words by the preceding space). + add_bos_token (`bool`, *optional*, defaults to `False`): + Whether to add `bos_token` in front of the sequence when add_special_tokens=True. + add_eos_token (`bool`, *optional*, defaults to `False`): + Whether to add `eos_token` in end of the sequence when add_special_tokens=True. + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + model_input_names = [ + "input_ids", + "attention_mask", + ] + + def __init__( + self, + vocab_file, + merges_file, + errors="replace", + unk_token="[UNK]", + bos_token="<|endoftext|>", + eos_token="[STOP]", + pad_token="[STOP]", + add_prefix_space=False, + add_bos_token=False, + add_eos_token=False, + **kwargs, + ): + bos_token = AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token + eos_token = AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token + unk_token = AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token + pad_token = AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token + + self.add_bos_token = add_bos_token + self.add_eos_token = add_eos_token + self._normalizer = None + + with open(vocab_file, encoding="utf-8") as vocab_handle: + self.encoder = json.load(vocab_handle) + self.decoder = {v: k for k, v in self.encoder.items()} + self.errors = errors # how to handle errors in decoding + self.byte_encoder = bytes_to_unicode() + self.byte_decoder = {v: k for k, v in self.byte_encoder.items()} + with open(merges_file, encoding="utf-8") as merges_handle: + bpe_merges = merges_handle.read().split("\n")[1:-1] + bpe_merges = [tuple(merge.split()) for merge in bpe_merges] + self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges)))) + self.cache = {} + self.add_prefix_space = add_prefix_space + + # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions + self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""") + + super().__init__( + errors=errors, + unk_token=unk_token, + bos_token=bos_token, + eos_token=eos_token, + pad_token=pad_token, + add_prefix_space=add_prefix_space, + add_bos_token=add_bos_token, + add_eos_token=add_eos_token, + **kwargs, + ) + + @property + def vocab_size(self): + return len(self.encoder) + + @property + def normalizer(self): + if self._normalizer is None: + self._normalizer = EnglishNormalizer() + return self._normalizer + + def get_vocab(self): + return dict(self.encoder, **self.added_tokens_encoder) + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.bpe + def bpe(self, token): + if token in self.cache: + return self.cache[token] + word = tuple(token) + pairs = get_pairs(word) + + if not pairs: + return token + + while True: + bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf"))) + if bigram not in self.bpe_ranks: + break + first, second = bigram + new_word = [] + i = 0 + while i < len(word): + try: + j = word.index(first, i) + except ValueError: + new_word.extend(word[i:]) + break + else: + new_word.extend(word[i:j]) + i = j + + if word[i] == first and i < len(word) - 1 and word[i + 1] == second: + new_word.append(first + second) + i += 2 + else: + new_word.append(word[i]) + i += 1 + new_word = tuple(new_word) + word = new_word + if len(word) == 1: + break + else: + pairs = get_pairs(word) + word = " ".join(word) + self.cache[token] = word + return word + + # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens + def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): + bos_token_id = [self.bos_token_id] if self.add_bos_token else [] + eos_token_id = [self.eos_token_id] if self.add_eos_token else [] + + output = bos_token_id + token_ids_0 + eos_token_id + + if token_ids_1 is not None: + output = output + bos_token_id + token_ids_1 + eos_token_id + + return output + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.get_special_tokens_mask + def get_special_tokens_mask( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False + ) -> List[int]: + """ + Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding + special tokens using the tokenizer `prepare_for_model` or `encode_plus` methods. + + Args: + token_ids_0 (`List[int]`): + List of IDs. + token_ids_1 (`List[int]`, *optional*): + Optional second list of IDs for sequence pairs. + already_has_special_tokens (`bool`, *optional*, defaults to `False`): + Whether or not the token list is already formatted with special tokens for the model. + + Returns: + `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. + """ + if already_has_special_tokens: + return super().get_special_tokens_mask( + token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True + ) + + if not self.add_bos_token: + return super().get_special_tokens_mask( + token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=False + ) + + if token_ids_1 is None: + return [1] + ([0] * len(token_ids_0)) + return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + + def _tokenize(self, text): + """Tokenize a string.""" + bpe_tokens = [] + text = self.normalizer(text) + for token in re.findall(self.pat, text): + token = "".join( + self.byte_encoder[b] for b in token.encode("utf-8") + ) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case) + + # if the token is "Ġ" we replace it with "[SPACE]" (if "[SPACE]" is present in the vocab), otherwise we keep the "Ġ". + bpe_tokens.extend( + "[SPACE]" if bpe_token == "\u0120" and "[SPACE]" in self.encoder.keys() else bpe_token + for bpe_token in self.bpe(token).split(" ") + ) + + return bpe_tokens + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_token_to_id + def _convert_token_to_id(self, token): + """Converts a token (str) in an id using the vocab.""" + return self.encoder.get(token, self.encoder.get(self.unk_token)) + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._convert_id_to_token + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + return self.decoder.get(index) + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.convert_tokens_to_string + def convert_tokens_to_string(self, tokens): + """Converts a sequence of tokens (string) in a single string.""" + text = "".join(tokens) + text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors) + return text + + def clean_up_tokenization(self, text): + text = "".join(text) + vocab_tokens = list(self.encoder.keys()) + list(self.added_tokens_encoder.keys()) + + text = text.replace("[SPACE]", " ") if "[SPACE]" in vocab_tokens else text + text = text.replace("[STOP]", " ") if "[STOP]" in vocab_tokens else text + + text = text.replace(self.unk_token, "").replace(" ", " ").replace(" ", " ") + return text + + # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.save_vocabulary + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + if not os.path.isdir(save_directory): + logger.error(f"Vocabulary path ({save_directory}) should be a directory") + return + vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + merge_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"] + ) + + with open(vocab_file, "w", encoding="utf-8") as f: + f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") + + index = 0 + with open(merge_file, "w", encoding="utf-8") as writer: + writer.write("#version: 0.2\n") + for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]): + if index != token_index: + logger.warning( + f"Saving vocabulary to {merge_file}: BPE merge indices are not consecutive." + " Please check that the tokenizer is not corrupted!" + ) + index = token_index + writer.write(" ".join(bpe_tokens) + "\n") + index += 1 + + return vocab_file, merge_file diff --git a/src/transformers/models/code_llama/__init__.py b/src/transformers/models/code_llama/__init__.py new file mode 100644 index 00000000000000..8c99c023419bbf --- /dev/null +++ b/src/transformers/models/code_llama/__init__.py @@ -0,0 +1,57 @@ +# Copyright 2023 MetaAI and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import TYPE_CHECKING + +from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_sentencepiece_available, is_tokenizers_available + + +_import_structure = {} + +try: + if not is_sentencepiece_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["tokenization_code_llama"] = ["CodeLlamaTokenizer"] + +try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() +except OptionalDependencyNotAvailable: + pass +else: + _import_structure["tokenization_code_llama_fast"] = ["CodeLlamaTokenizerFast"] + +if TYPE_CHECKING: + try: + if not is_sentencepiece_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .tokenization_code_llama import CodeLlamaTokenizer + + try: + if not is_tokenizers_available(): + raise OptionalDependencyNotAvailable() + except OptionalDependencyNotAvailable: + pass + else: + from .tokenization_code_llama_fast import CodeLlamaTokenizerFast + +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/transformers/models/code_llama/tokenization_code_llama.py b/src/transformers/models/code_llama/tokenization_code_llama.py new file mode 100644 index 00000000000000..db280bbc156150 --- /dev/null +++ b/src/transformers/models/code_llama/tokenization_code_llama.py @@ -0,0 +1,522 @@ +# coding=utf-8 +# Copyright 2023 MetaAI and the HuggingFace Inc. team. All rights reserved. +# +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tokenization classes for Code LLaMA.""" +import os +from shutil import copyfile +from typing import Any, Dict, List, Optional, Tuple + +import sentencepiece as spm + +from ...convert_slow_tokenizer import import_protobuf +from ...tokenization_utils import AddedToken, PreTrainedTokenizer +from ...utils import logging, requires_backends + + +logger = logging.get_logger(__name__) + +VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "hf-internal-testing/llama-code-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model", + }, + "tokenizer_file": { + "hf-internal-testing/llama-code-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json", + }, +} +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "hf-internal-testing/llama-code-tokenizer": 2048, +} +SPIECE_UNDERLINE = "▁" + +B_INST, E_INST = "[INST]", "[/INST]" +B_SYS, E_SYS = "<>\n", "\n<>\n\n" + +# fmt: off +DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \ +answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\ + that your responses are socially unbiased and positive in nature. + +If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \ +correct. If you don't know the answer to a question, please don't share false information.""" +# fmt: on + + +class CodeLlamaTokenizer(PreTrainedTokenizer): + """ + Construct a CodeLlama tokenizer. Based on byte-level Byte-Pair-Encoding. The default padding token is unset as + there is no padding token in the original model. + + The default configuration match that of + [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf/blob/main/tokenizer_config.json) + which supports prompt infilling. + + Args: + vocab_file (`str`): + Path to the vocabulary file. + unk_token (`str`, *optional*, defaults to `""`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + bos_token (`str`, *optional*, defaults to `""`): + The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. + eos_token (`str`, *optional*, defaults to `""`): + The end of sequence token. + + + + When building a sequence using special tokens, this is not the token that is used for the end of sequence. + The token used is the `sep_token`. + + + + prefix_token (`str`, *optional*, defaults to `"▁

"`):
+            Prefix token used for infilling.
+        middle_token (`str`, *optional*, defaults to `"▁"`):
+            Middle token used for infilling.
+        suffix_token (`str`, *optional*, defaults to `"▁"`):
+            Suffix token used for infilling.
+        eot_token (`str`, *optional*, defaults to `"▁"`):
+            End of text token used for infilling.
+        fill_token (`str`, *optional*, defaults to `""`):
+            The token used to split the input between the prefix and suffix.
+        suffix_first (`bool`, *optional*, defaults to `False`):
+            Whether the input prompt and suffix should be formatted with the suffix first.
+        sp_model_kwargs (`dict`, *optional*):
+            Will be passed to the `SentencePieceProcessor.__init__()` method. The [Python wrapper for
+            SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things,
+            to set:
+
+            - `enable_sampling`: Enable subword regularization.
+            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
+
+              - `nbest_size = {0,1}`: No sampling is performed.
+              - `nbest_size > 1`: samples from the nbest_size results.
+              - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
+                using forward-filtering-and-backward-sampling algorithm.
+
+            - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
+              BPE-dropout.
+        add_bos_token (`bool`, *optional*, defaults to `True`):
+            Whether to add a beginning of sequence token at the start of sequences.
+        add_eos_token (`bool`, *optional*, defaults to `False`):
+            Whether to add an end of sequence token at the end of sequences.
+        clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
+            Whether or not to clean up the tokenization spaces.
+        additional_special_tokens (`List[str]`, *optional*):
+            Additional special tokens used by the tokenizer.
+        use_default_system_prompt (`bool`, *optional*, defaults to `False`):
+            Whether or not the default system prompt for Llama should be used.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="",
+        bos_token="",
+        eos_token="",
+        prefix_token="▁
",
+        middle_token="▁",
+        suffix_token="▁",
+        eot_token="▁",
+        fill_token="",
+        suffix_first=False,
+        sp_model_kwargs: Optional[Dict[str, Any]] = None,
+        add_bos_token=True,
+        add_eos_token=False,
+        clean_up_tokenization_spaces=False,
+        additional_special_tokens=None,
+        use_default_system_prompt=False,
+        **kwargs,
+    ):
+        requires_backends(self, "protobuf")
+        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+        bos_token = AddedToken(bos_token, normalized=False, special=True) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, normalized=False, special=True) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, normalized=False, special=True) if isinstance(unk_token, str) else unk_token
+
+        self.use_default_system_prompt = use_default_system_prompt
+        # mark tokens special to skip them
+        additional_special_tokens = additional_special_tokens or []
+        for token in [prefix_token, middle_token, suffix_token, eot_token]:
+            additional_special_tokens += [token] if token is not None else []
+
+        self.vocab_file = vocab_file
+        self.add_bos_token = add_bos_token
+        self.add_eos_token = add_eos_token
+        self._prefix_token = prefix_token
+        self._middle_token = middle_token
+        self._suffix_token = suffix_token
+        self._eot_token = eot_token
+        self.fill_token = fill_token
+        self.suffix_first = suffix_first
+        self.sp_model = self.get_spm_processor()
+
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            prefix_token=prefix_token,
+            middle_token=middle_token,
+            suffix_token=suffix_token,
+            eot_token=eot_token,
+            fill_token=fill_token,
+            sp_model_kwargs=self.sp_model_kwargs,
+            suffix_first=suffix_first,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            additional_special_tokens=additional_special_tokens,
+            use_default_system_prompt=use_default_system_prompt,
+            **kwargs,
+        )
+
+    @property
+    def unk_token_length(self):
+        return len(self.sp_model.encode(str(self.unk_token)))
+
+    def get_spm_processor(self):
+        tokenizer = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        with open(self.vocab_file, "rb") as f:
+            sp_model = f.read()
+            model_pb2 = import_protobuf()
+            model = model_pb2.ModelProto.FromString(sp_model)
+            normalizer_spec = model_pb2.NormalizerSpec()
+            normalizer_spec.add_dummy_prefix = False
+            model.normalizer_spec.MergeFrom(normalizer_spec)
+            sp_model = model.SerializeToString()
+            tokenizer.LoadFromSerializedProto(sp_model)
+        return tokenizer
+
+    @property
+    def prefix_token(self):
+        return self._prefix_token
+
+    @property
+    def prefix_id(self):
+        if self._prefix_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.prefix_token)
+
+    @property
+    def middle_token(self):
+        return self._middle_token
+
+    @property
+    def middle_id(self):
+        if self._middle_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.middle_token)
+
+    @property
+    def suffix_token(self):
+        return self._suffix_token
+
+    @property
+    def suffix_id(self):
+        if self._suffix_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.suffix_token)
+
+    @property
+    def eot_token(self):
+        return self._eot_token
+
+    @property
+    def eot_id(self):
+        if self._eot_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eot_token)
+
+    @property
+    def vocab_size(self):
+        """Returns vocab size"""
+        return self.sp_model.get_piece_size()
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.get_vocab
+    def get_vocab(self):
+        """Returns vocab as a dict"""
+        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+
+    def tokenize(self, prefix, suffix=None, suffix_first=False, **kwargs) -> List[int]:
+        # add a prefix space to `prefix`
+        if self.fill_token is not None and self.fill_token in prefix and suffix is None:
+            prefix, suffix = prefix.split(self.fill_token)
+
+        if len(prefix) > 0:
+            prefix = SPIECE_UNDERLINE + prefix.replace(SPIECE_UNDERLINE, " ")
+
+        if suffix is None or len(suffix) < 1:
+            tokens = super().tokenize(prefix, **kwargs)
+            if len(tokens) > 1 and tokens[0] == SPIECE_UNDERLINE and tokens[1] in self.all_special_tokens:
+                tokens = tokens[1:]
+            return tokens
+
+        prefix_tokens = self._tokenize(prefix)  # prefix has an extra `SPIECE_UNDERLINE`
+
+        if None in (self.prefix_id, self.middle_id, self.suffix_id):
+            raise ValueError(
+                "The input either includes a `prefix` and a `suffix` used for the infilling task,"
+                f"  or can be split on the {self.fill_token} token, creating a suffix and prefix,"
+                " but the model does not support `infilling`."
+            )
+        suffix_tokens = self._tokenize(suffix)  # make sure CodeLlama sp model does not mess up
+
+        suffix_first = suffix_first if suffix_first is not None else self.suffix_first
+        if suffix_first:
+            # format as " 
 {suf}  {pre}"
+            return [self.prefix_token, self.suffix_token] + suffix_tokens + [self.middle_token] + prefix_tokens
+        else:
+            # format as " 
 {pre} {suf} "
+            return [self.prefix_token] + prefix_tokens + [self.suffix_token] + suffix_tokens + [self.middle_token]
+
+    def _tokenize(self, text, **kwargs):
+        """
+        Returns a tokenized string.
+
+        We de-activated the `add_dummy_prefix` option, thus the sentencepiece internals will always strip any
+        SPIECE_UNDERLINE. For example: `self.sp_model.encode(f"{SPIECE_UNDERLINE}Hey", out_type = str)` will give
+        `['H', 'e', 'y']` instead of `['▁He', 'y']`. Thus we always encode `f"{unk_token}text"` and strip the
+        `unk_token`. Here is an example with `unk_token = ""` and `unk_token_length = 4`.
+        `self.tokenizer.sp_model.encode(" Hey", out_type = str)[4:]`.
+        """
+        tokens = self.sp_model.encode(text, out_type=str)
+        if not text.startswith((SPIECE_UNDERLINE, " ")):
+            return tokens
+        # 1. Encode string + prefix ex: " Hey"
+        tokens = self.sp_model.encode(self.unk_token + text, out_type=str)
+        # 2. Remove self.unk_token from ['<','unk','>', '▁Hey']
+        return tokens[self.unk_token_length :] if len(tokens) >= self.unk_token_length else tokens
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer._convert_token_to_id
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.sp_model.piece_to_id(token)
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer._convert_id_to_token
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        token = self.sp_model.IdToPiece(index)
+        return token
+
+    def convert_tokens_to_string(self, tokens):
+        """Converts a sequence of tokens (string) in a single string."""
+        # since we manually add the prefix space, we have to remove it when decoding
+        if tokens[0].startswith(SPIECE_UNDERLINE):
+            tokens[0] = tokens[0][1:]
+
+        current_sub_tokens = []
+        out_string = ""
+        for _, token in enumerate(tokens):
+            # make sure that special tokens are not decoded using sentencepiece model
+            if token in self.all_special_tokens:
+                out_string += self.sp_model.decode(current_sub_tokens) + token
+                current_sub_tokens = []
+            else:
+                current_sub_tokens.append(token)
+        out_string += self.sp_model.decode(current_sub_tokens)
+        return out_string
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.save_vocabulary
+    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)
+
+        return (out_vocab_file,)
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.build_inputs_with_special_tokens
+    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = bos_token_id + token_ids_0 + eos_token_id
+
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+
+        return output
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.get_special_tokens_mask
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        bos_token_id = [1] if self.add_bos_token else []
+        eos_token_id = [1] if self.add_eos_token else []
+
+        if token_ids_1 is None:
+            return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
+        return (
+            bos_token_id
+            + ([0] * len(token_ids_0))
+            + eos_token_id
+            + bos_token_id
+            + ([0] * len(token_ids_1))
+            + eos_token_id
+        )
+
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.create_token_type_ids_from_sequences
+    def create_token_type_ids_from_sequences(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT
+        sequence pair mask has the following format:
+
+        ```
+        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
+        | first sequence    | second sequence |
+        ```
+
+        if token_ids_1 is None, only returns the first portion of the mask (0s).
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of ids.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
+        """
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+
+        output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
+
+        if token_ids_1 is not None:
+            output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
+
+        return output
+
+    @property
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
+    def default_chat_template(self):
+        """
+        LLaMA uses [INST] and [/INST] to indicate user messages, and <> and <> to indicate system messages.
+        Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
+        user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
+        rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
+        results in an unusual token ordering when it is present. This template should definitely be changed if you wish
+        to fine-tune a model with more flexible role ordering!
+
+        The output should look something like:
+
+        [INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer [INST] Prompt [/INST] Answer 
+        [INST] Prompt [/INST]
+
+        The reference for this chat template is [this code
+        snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
+        in the original repository.
+        """
+        logger.warning_once(
+            "\nNo chat template is defined for this tokenizer - using the default template "
+            f"for the {self.__class__.__name__} class. If the default is not appropriate for "
+            "your model, please set `tokenizer.chat_template` to an appropriate template. "
+            "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n"
+        )
+        template = (
+            "{% if messages[0]['role'] == 'system' %}"
+            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
+            "{% set system_message = messages[0]['content'] %}"
+            "{% elif USE_DEFAULT_PROMPT == true and not '<>' in messages[0]['content'] %}"
+            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
+            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
+            "{% else %}"
+            "{% set loop_messages = messages %}"
+            "{% set system_message = false %}"
+            "{% endif %}"
+            "{% for message in loop_messages %}"  # Loop over all non-system messages
+            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
+            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
+            "{% endif %}"
+            "{% if loop.index0 == 0 and system_message != false %}"  # Embed system message in first message
+            "{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}"
+            "{% else %}"
+            "{% set content = message['content'] %}"
+            "{% endif %}"
+            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
+            "{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
+            "{% elif message['role'] == 'system' %}"
+            "{{ '<>\\n' + content.strip() + '\\n<>\\n\\n' }}"
+            "{% elif message['role'] == 'assistant' %}"
+            "{{ ' '  + content.strip() + ' ' + eos_token }}"
+            "{% endif %}"
+            "{% endfor %}"
+        )
+        template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
+        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
+        template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
+
+        return template
+
+    def __getstate__(self):
+        state = self.__dict__.copy()
+        state["sp_model"] = None
+        state["sp_model_proto"] = self.sp_model.serialized_model_proto()
+        return state
+
+    def __setstate__(self, d):
+        self.__dict__ = d
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
diff --git a/src/transformers/models/code_llama/tokenization_code_llama_fast.py b/src/transformers/models/code_llama/tokenization_code_llama_fast.py
new file mode 100644
index 00000000000000..e2429aaec5d187
--- /dev/null
+++ b/src/transformers/models/code_llama/tokenization_code_llama_fast.py
@@ -0,0 +1,439 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from shutil import copyfile
+from typing import List, Optional, Tuple
+
+from tokenizers import normalizers, processors
+
+from ...tokenization_utils_fast import PreTrainedTokenizerFast
+from ...utils import is_sentencepiece_available, logging
+from ...utils.versions import require_version
+
+
+require_version("tokenizers>=0.13.3")
+
+if is_sentencepiece_available():
+    from .tokenization_code_llama import CodeLlamaTokenizer
+else:
+    CodeLlamaTokenizer = None
+
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model", "tokenizer_file": "tokenizer.json"}
+
+SPIECE_UNDERLINE = "▁"
+
+
+B_INST, E_INST = "[INST]", "[/INST]"
+B_SYS, E_SYS = "<>\n", "\n<>\n\n"
+
+# fmt: off
+DEFAULT_SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your \
+answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure\
+ that your responses are socially unbiased and positive in nature.
+
+If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
+correct. If you don't know the answer to a question, please don't share false information."""
+# fmt: on
+
+
+class CodeLlamaTokenizerFast(PreTrainedTokenizerFast):
+    """
+    Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.
+
+    This uses notably ByteFallback and no normalization.
+
+    ```python
+    >>> from transformers import CodeLlamaTokenizerFast
+
+    >>> tokenizer = CodeLlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
+    >>> tokenizer.encode("Hello this is a test")
+    [1, 15043, 445, 338, 263, 1243]
+    ```
+
+    If you want to change the `bos_token` or the `eos_token`, make sure to specify them when initializing the model, or
+    call `tokenizer.update_post_processor()` to make sure that the post-processing is correctly done (otherwise the
+    values of the first token and final token of an encoded sequence will not be correct). For more details, checkout
+    [post-processors] (https://huggingface.co/docs/tokenizers/api/post-processors) documentation.
+
+
+    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main methods. Users should
+    refer to this superclass for more information regarding those methods. The default configuration match that of
+    [codellama/CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf/blob/main/tokenizer_config.json)
+    which supports prompt infilling.
+
+    Args:
+        vocab_file (`str`, *optional*):
+            [SentencePiece](https://github.com/google/sentencepiece) file (generally has a .model extension) that
+            contains the vocabulary necessary to instantiate a tokenizer.
+        tokenizer_file (`str`, *optional*):
+            [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
+            contains everything needed to load the tokenizer.
+        clean_up_tokenization_spaces (`str`, *optional*, defaults to `False`):
+            Wether to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra
+            spaces.
+        unk_token (`str`, *optional*, defaults to `""`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        bos_token (`str`, *optional*, defaults to `""`):
+            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
+        eos_token (`str`, *optional*, defaults to `""`):
+            The end of sequence token.
+        prefix_token (`str`, *optional*, defaults to `"▁
"`):
+            Prefix token used for infilling.
+        middle_token (`str`, *optional*, defaults to `"▁"`):
+            Middle token used for infilling.
+        suffix_token (`str`, *optional*, defaults to `"▁"`):
+            Suffix token used for infilling.
+        eot_token (`str`, *optional*, defaults to `"▁"`):
+            End of text token used for infilling.
+        fill_token (`str`, *optional*, defaults to `""`):
+            The token used to split the input between the prefix and suffix.
+        additional_special_tokens (`List[str]`, *optional*):
+            Additional special tokens used by the tokenizer.
+        add_bos_token (`bool`, *optional*, defaults to `True`):
+            Whether to add a beginning of sequence token at the start of sequences.
+        add_eos_token (`bool`, *optional*, defaults to `False`):
+            Whether to add an end of sequence token at the end of sequences.
+        use_default_system_prompt (`bool`, *optional*, defaults to `False`):
+            Whether or not the default system prompt for Llama should be used.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    slow_tokenizer_class = CodeLlamaTokenizer
+    padding_side = "left"
+    model_input_names = ["input_ids", "attention_mask"]
+
+    def __init__(
+        self,
+        vocab_file=None,
+        tokenizer_file=None,
+        clean_up_tokenization_spaces=False,
+        unk_token="",
+        bos_token="",
+        eos_token="",
+        prefix_token="▁
",
+        middle_token="▁",
+        suffix_token="▁",
+        eot_token="▁",
+        fill_token="",
+        additional_special_tokens=None,
+        add_bos_token=True,
+        add_eos_token=False,
+        use_default_system_prompt=False,
+        **kwargs,
+    ):
+        # mark tokens special to skip them
+        additional_special_tokens = additional_special_tokens or []
+        for token in [prefix_token, middle_token, suffix_token, eot_token]:
+            additional_special_tokens += [token] if token is not None else []
+        self.use_default_system_prompt = use_default_system_prompt
+
+        super().__init__(
+            vocab_file=vocab_file,
+            tokenizer_file=tokenizer_file,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            additional_special_tokens=additional_special_tokens,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            add_bos_token=add_bos_token,
+            add_eos_token=add_eos_token,
+            prefix_token=prefix_token,
+            middle_token=middle_token,
+            suffix_token=suffix_token,
+            eot_token=eot_token,
+            fill_token=fill_token,
+            use_default_system_prompt=use_default_system_prompt,
+            **kwargs,
+        )
+        self._add_bos_token = add_bos_token
+        self._add_eos_token = add_eos_token
+        self.update_post_processor()
+
+        self.vocab_file = vocab_file
+
+        self._prefix_token = prefix_token
+        self._middle_token = middle_token
+        self._suffix_token = suffix_token
+        self._eot_token = eot_token
+        self.fill_token = fill_token
+
+    @property
+    def can_save_slow_tokenizer(self) -> bool:
+        return os.path.isfile(self.vocab_file) if self.vocab_file else False
+
+    # Copied from transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast.update_post_processor
+    def update_post_processor(self):
+        """
+        Updates the underlying post processor with the current `bos_token` and `eos_token`.
+        """
+        bos = self.bos_token
+        bos_token_id = self.bos_token_id
+        if bos is None and self.add_bos_token:
+            raise ValueError("add_bos_token = True but bos_token = None")
+
+        eos = self.eos_token
+        eos_token_id = self.eos_token_id
+        if eos is None and self.add_eos_token:
+            raise ValueError("add_eos_token = True but eos_token = None")
+
+        single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
+        pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
+
+        special_tokens = []
+        if self.add_bos_token:
+            special_tokens.append((bos, bos_token_id))
+        if self.add_eos_token:
+            special_tokens.append((eos, eos_token_id))
+        self._tokenizer.post_processor = processors.TemplateProcessing(
+            single=single, pair=pair, special_tokens=special_tokens
+        )
+
+    @property
+    def prefix_token(self):
+        return self._prefix_token
+
+    @property
+    def prefix_id(self):
+        if self._prefix_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.prefix_token)
+
+    @property
+    def middle_token(self):
+        return self._middle_token
+
+    @property
+    def middle_id(self):
+        if self._middle_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.middle_token)
+
+    @property
+    def suffix_token(self):
+        return self._suffix_token
+
+    @property
+    def suffix_id(self):
+        if self._suffix_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.suffix_token)
+
+    @property
+    def eot_id(self):
+        if self._eot_token is None:
+            return None
+        return self.convert_tokens_to_ids(self.eot_token)
+
+    @property
+    def eot_token(self):
+        return self._eot_token
+
+    @property
+    def add_eos_token(self):
+        return self._add_eos_token
+
+    @property
+    def add_bos_token(self):
+        return self._add_bos_token
+
+    @add_eos_token.setter
+    def add_eos_token(self, value):
+        self._add_eos_token = value
+        self.update_post_processor()
+
+    @add_bos_token.setter
+    def add_bos_token(self, value):
+        self._add_bos_token = value
+        self.update_post_processor()
+
+    def set_infilling_processor(self, reset, suffix_first=False, add_special_tokens=True):
+        """
+        Updates the normalizer to make sure the prompt format for `infilling` is respected. The infilling format is the
+        following: if suffix_first
+            " 
 {suf}  {pre}"
+        else:
+            " 
 {pre} {suf} "
+
+        If `reset` is set to `True`, the `normalizer` and `post_processor` are reset to their "normal" behaviour, which
+        is to add a prefix space for the normalizer, and add a `bos_token` to the input text for the `post_processor`.
+        """
+        if reset:
+            self._tokenizer.normalizer = normalizers.Sequence(
+                [
+                    normalizers.Prepend(prepend="▁"),
+                    normalizers.Replace(pattern=" ", content="▁"),
+                ]
+            )
+            self.update_post_processor()
+            return
+
+        self._tokenizer.normalizer = normalizers.Replace(pattern=" ", content="▁")
+        pair = [self.bos_token] if self.add_bos_token and add_special_tokens else []
+        special_tokens = [(self.bos_token, self.bos_token_id)] if self.add_bos_token and add_special_tokens else []
+        if suffix_first:
+            # format as " 
 {suf}  {pre}"
+            pair += [self.prefix_token, self.suffix_token, "$B", self.middle_token, "$A"]
+            special_tokens += [
+                (self.prefix_token, self.prefix_id),
+                (self.suffix_token, self.suffix_id),
+                (self.middle_token, self.middle_id),
+            ]
+        else:
+            # format as " 
 {pre} {suf} "
+            pair += [self.prefix_token, "$A", self.suffix_token, "$B", self.middle_token]
+            special_tokens += [
+                (self.prefix_token, self.prefix_id),
+                (self.suffix_token, self.suffix_id),
+                (self.middle_token, self.middle_id),
+            ]
+
+        if self.add_eos_token and add_special_tokens:
+            pair += [self.eos_token]
+            special_tokens += [(self.eos_token, self.eos_token_id)]
+        self._tokenizer.post_processor = processors.TemplateProcessing(
+            single="$A", pair=pair, special_tokens=special_tokens
+        )
+
+    def encode_plus(self, text, text_pair=None, suffix_first=False, add_special_tokens=True, **kwargs):
+        # hack to make sure the input is pre-process but outside rust
+        text_pair = kwargs.pop("suffix", text_pair)
+        if self.fill_token is not None and self.fill_token in text and text_pair is None:
+            text, text_pair = text.split(self.fill_token)
+
+        if text_pair is None or len(text_pair) < 1:
+            return super().encode_plus(text, text_pair, add_special_tokens=add_special_tokens, **kwargs)
+
+        if None in (self.prefix_id, self.middle_id, self.suffix_id):
+            raise ValueError(
+                "Then input includes a `prefix` and a `suffix` used for the infilling task,"
+                " the `prefix_id, middle_id, suffix_id` must all be initialized. Current"
+                f" values : {self.prefix_id, self.middle_id, self.suffix_id}"
+            )
+
+        self.set_infilling_processor(False, suffix_first=suffix_first, add_special_tokens=add_special_tokens)
+        tokens = super().encode_plus(" " + text, text_pair=text_pair, add_special_tokens=True, **kwargs)
+        self.set_infilling_processor(True)
+        return tokens
+
+    # Copied from transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast.save_vocabulary
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if not self.can_save_slow_tokenizer:
+            raise ValueError(
+                "Your fast tokenizer does not have the necessary information to save the vocabulary for a slow "
+                "tokenizer."
+            )
+
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+
+        return (out_vocab_file,)
+
+    @property
+    # Copied from transformers.models.llama.tokenization_llama.LlamaTokenizer.default_chat_template
+    def default_chat_template(self):
+        """
+        LLaMA uses [INST] and [/INST] to indicate user messages, and <> and <> to indicate system messages.
+        Assistant messages do not have special tokens, because LLaMA chat models are generally trained with strict
+        user/assistant/user/assistant message ordering, and so assistant messages can be identified from the ordering
+        rather than needing special tokens. The system message is partly 'embedded' in the first user message, which
+        results in an unusual token ordering when it is present. This template should definitely be changed if you wish
+        to fine-tune a model with more flexible role ordering!
+
+        The output should look something like:
+
+        [INST] B_SYS SystemPrompt E_SYS Prompt [/INST] Answer [INST] Prompt [/INST] Answer 
+        [INST] Prompt [/INST]
+
+        The reference for this chat template is [this code
+        snippet](https://github.com/facebookresearch/llama/blob/556949fdfb72da27c2f4a40b7f0e4cf0b8153a28/llama/generation.py#L320-L362)
+        in the original repository.
+        """
+        logger.warning_once(
+            "\nNo chat template is defined for this tokenizer - using the default template "
+            f"for the {self.__class__.__name__} class. If the default is not appropriate for "
+            "your model, please set `tokenizer.chat_template` to an appropriate template. "
+            "See https://huggingface.co/docs/transformers/main/chat_templating for more information.\n"
+        )
+        template = (
+            "{% if messages[0]['role'] == 'system' %}"
+            "{% set loop_messages = messages[1:] %}"  # Extract system message if it's present
+            "{% set system_message = messages[0]['content'] %}"
+            "{% elif USE_DEFAULT_PROMPT == true and not '<>' in messages[0]['content'] %}"
+            "{% set loop_messages = messages %}"  # Or use the default system message if the flag is set
+            "{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}"
+            "{% else %}"
+            "{% set loop_messages = messages %}"
+            "{% set system_message = false %}"
+            "{% endif %}"
+            "{% for message in loop_messages %}"  # Loop over all non-system messages
+            "{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
+            "{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}"
+            "{% endif %}"
+            "{% if loop.index0 == 0 and system_message != false %}"  # Embed system message in first message
+            "{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}"
+            "{% else %}"
+            "{% set content = message['content'] %}"
+            "{% endif %}"
+            "{% if message['role'] == 'user' %}"  # After all of that, handle messages/roles in a fairly normal way
+            "{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}"
+            "{% elif message['role'] == 'system' %}"
+            "{{ '<>\\n' + content.strip() + '\\n<>\\n\\n' }}"
+            "{% elif message['role'] == 'assistant' %}"
+            "{{ ' '  + content.strip() + ' ' + eos_token }}"
+            "{% endif %}"
+            "{% endfor %}"
+        )
+        template = template.replace("USE_DEFAULT_PROMPT", "true" if self.use_default_system_prompt else "false")
+        default_message = DEFAULT_SYSTEM_PROMPT.replace("\n", "\\n").replace("'", "\\'")
+        template = template.replace("DEFAULT_SYSTEM_MESSAGE", default_message)
+
+        return template
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. The special tokens depend on calling set_lang.
+
+        An NLLB sequence has the following format, where `X` represents the sequence:
+
+        - `input_ids` (for encoder) `X [eos, src_lang_code]`
+        - `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]`
+
+        BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a
+        separator.
+
+        Args:
+            token_ids_0 (`List[int]`):
+                List of IDs to which the special tokens will be added.
+            token_ids_1 (`List[int]`, *optional*):
+                Optional second list of IDs for sequence pairs.
+
+        Returns:
+            `List[int]`: list of [input IDs](../glossary#input-ids) with the appropriate special tokens.
+        """
+        if token_ids_1 is None:
+            return self.bos_token_id + token_ids_0 + self.eos_token_id
+        return self.bos_token_id + token_ids_0 + token_ids_1 + self.eos_token_id
diff --git a/src/transformers/models/codegen/configuration_codegen.py b/src/transformers/models/codegen/configuration_codegen.py
index 1a1e609f0111fb..73c019870f1f6a 100644
--- a/src/transformers/models/codegen/configuration_codegen.py
+++ b/src/transformers/models/codegen/configuration_codegen.py
@@ -57,6 +57,8 @@ class CodeGenConfig(PretrainedConfig):
         n_positions (`int`, *optional*, defaults to 2048):
             The maximum sequence length that this model might ever be used with. Typically set this to something large
             just in case (e.g., 512 or 1024 or 2048).
+        n_ctx (`int`, *optional*, defaults to 2048):
+            This attribute is used in `CodeGenModel.__init__` without any real effect.
         n_embd (`int`, *optional*, defaults to 4096):
             Dimensionality of the embeddings and hidden states.
         n_layer (`int`, *optional*, defaults to 28):
@@ -65,22 +67,29 @@ class CodeGenConfig(PretrainedConfig):
             Number of attention heads for each attention layer in the Transformer encoder.
         rotary_dim (`int`, *optional*, defaults to 64):
             Number of dimensions in the embedding that Rotary Position Embedding is applied to.
-        n_inner (`int`, *optional*, defaults to None):
+        n_inner (`int`, *optional*):
             Dimensionality of the inner feed-forward layers. `None` will set it to 4 times n_embd
         activation_function (`str`, *optional*, defaults to `"gelu_new"`):
             Activation function, to be selected in the list `["relu", "silu", "gelu", "tanh", "gelu_new"]`.
-        resid_pdrop (`float`, *optional*, defaults to 0.1):
+        resid_pdrop (`float`, *optional*, defaults to 0.0):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        embd_pdrop (`int`, *optional*, defaults to 0.1):
+        embd_pdrop (`int`, *optional*, defaults to 0.0):
             The dropout ratio for the embeddings.
-        attn_pdrop (`float`, *optional*, defaults to 0.1):
+        attn_pdrop (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention.
-        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-05):
             The epsilon to use in the layer normalization layers.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         use_cache (`bool`, *optional*, defaults to `True`):
             Whether or not the model should return the last key/values attentions (not used by all models).
+        bos_token_id (`int`, *optional*, defaults to 50256):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 50256):
+            End of stream token id.
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied. Note that this is only relevant if the
+            model has a output word embedding layer.
 
     Example:
 
@@ -96,6 +105,7 @@ class CodeGenConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "codegen"
     attribute_map = {
         "max_position_embeddings": "n_positions",
diff --git a/src/transformers/models/codegen/modeling_codegen.py b/src/transformers/models/codegen/modeling_codegen.py
index fb7716a00ec529..60496f57212226 100644
--- a/src/transformers/models/codegen/modeling_codegen.py
+++ b/src/transformers/models/codegen/modeling_codegen.py
@@ -51,43 +51,26 @@
 ]
 
 
-# Copied from transformers.models.gptj.modeling_gptj.fixed_pos_embedding
-def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
-    dim = x.shape[-1]
-    if seq_len is None:
-        seq_len = x.shape[seq_dim]
-    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim))
-    sinusoid_inp = (
-        torch.einsum("i , j -> i j", torch.arange(seq_len, dtype=torch.float), inv_freq).to(x.device).float()
-    )
-    return torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)
+# Copied from transformers.models.gptj.modeling_gptj.create_sinusoidal_positions
+def create_sinusoidal_positions(num_pos: int, dim: int) -> torch.Tensor:
+    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, dtype=torch.int64) / dim))
+    sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(num_pos, dtype=torch.int64).float(), inv_freq).float()
+    return torch.cat((torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)), dim=1)
 
 
 # Copied from transformers.models.gptj.modeling_gptj.rotate_every_two
-def rotate_every_two(x):
+def rotate_every_two(x: torch.Tensor) -> torch.Tensor:
     x1 = x[:, :, :, ::2]
     x2 = x[:, :, :, 1::2]
     x = torch.stack((-x2, x1), dim=-1)
     return x.flatten(-2)  # in einsum notation: rearrange(x, '... d j -> ... (d j)')
 
 
-# Copied from transformers.models.gptj.modeling_gptj.duplicate_interleave
-def duplicate_interleave(m):
-    """
-    A simple version of `torch.repeat_interleave` for duplicating a matrix while interleaving the copy.
-    """
-    dim0 = m.shape[0]
-    m = m.view(-1, 1)  # flatten the matrix
-    m = m.repeat(1, 2)  # repeat all elements into the 2nd dimension
-    m = m.view(dim0, -1)  # reshape into a matrix, interleaving the copy
-    return m
-
-
 # Copied from transformers.models.gptj.modeling_gptj.apply_rotary_pos_emb
-def apply_rotary_pos_emb(x, sincos, offset=0):
-    sin, cos = map(lambda t: duplicate_interleave(t)[None, offset : x.shape[1] + offset, None, :], sincos)
-    # einsum notation for lambda t: repeat(t[offset:x.shape[1]+offset,:], "n d -> () n () (d j)", j=2)
-    return (x * cos) + (rotate_every_two(x) * sin)
+def apply_rotary_pos_emb(tensor: torch.Tensor, sin: torch.Tensor, cos: torch.Tensor) -> torch.Tensor:
+    sin = torch.repeat_interleave(sin[:, :, None, :], 2, 3)
+    cos = torch.repeat_interleave(cos[:, :, None, :], 2, 3)
+    return (tensor * cos) + (rotate_every_two(tensor) * sin)
 
 
 class CodeGenAttention(nn.Module):
@@ -97,9 +80,10 @@ def __init__(self, config):
         max_positions = config.max_position_embeddings
         self.register_buffer(
             "causal_mask",
-            torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
+            torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view(
                 1, 1, max_positions, max_positions
             ),
+            persistent=False,
         )
 
         self.attn_dropout = nn.Dropout(config.attn_pdrop)
@@ -117,9 +101,9 @@ def __init__(self, config):
         self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False)
 
         self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
-        self.rotary_dim = None
-        if config.rotary_dim is not None:
-            self.rotary_dim = config.rotary_dim
+        self.rotary_dim = config.rotary_dim
+        pos_embd_dim = self.rotary_dim or self.embed_dim
+        self.embed_positions = create_sinusoidal_positions(max_positions, pos_embd_dim)
 
     def _split_heads(self, x, n_head, dim_head, mp_num):
         reshaped = x.reshape(x.shape[:-1] + (n_head // mp_num, dim_head))
@@ -183,8 +167,9 @@ def _attn(
     def forward(
         self,
         hidden_states: Optional[torch.FloatTensor],
-        attention_mask: Optional[torch.FloatTensor] = None,
         layer_past: Optional[Tuple[torch.Tensor]] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
         head_mask: Optional[torch.FloatTensor] = None,
         use_cache: Optional[bool] = False,
         output_attentions: Optional[bool] = False,
@@ -205,12 +190,13 @@ def forward(
         value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num)
         value = value.permute(0, 2, 1, 3)
 
-        seq_len = key.shape[1]
-        offset = 0
+        embed_positions = self.embed_positions
+        if embed_positions.device != position_ids.device:
+            embed_positions = embed_positions.to(position_ids.device)
+            self.embed_positions = embed_positions
 
-        if layer_past is not None:
-            offset = layer_past[0].shape[-2]
-            seq_len += offset
+        sincos = embed_positions[position_ids]
+        sin, cos = torch.split(sincos, sincos.shape[-1] // 2, dim=-1)
 
         if self.rotary_dim is not None:
             k_rot = key[:, :, :, : self.rotary_dim]
@@ -219,16 +205,14 @@ def forward(
             q_rot = query[:, :, :, : self.rotary_dim]
             q_pass = query[:, :, :, self.rotary_dim :]
 
-            sincos = fixed_pos_embedding(k_rot, 1, seq_len=seq_len)
-            k_rot = apply_rotary_pos_emb(k_rot, sincos, offset=offset)
-            q_rot = apply_rotary_pos_emb(q_rot, sincos, offset=offset)
+            k_rot = apply_rotary_pos_emb(k_rot, sin, cos)
+            q_rot = apply_rotary_pos_emb(q_rot, sin, cos)
 
             key = torch.cat([k_rot, k_pass], dim=-1)
             query = torch.cat([q_rot, q_pass], dim=-1)
         else:
-            sincos = fixed_pos_embedding(key, 1, seq_len=seq_len)
-            key = apply_rotary_pos_emb(key, sincos, offset=offset)
-            query = apply_rotary_pos_emb(query, sincos, offset=offset)
+            key = apply_rotary_pos_emb(key, sin, cos)
+            query = apply_rotary_pos_emb(query, sin, cos)
 
         key = key.permute(0, 2, 1, 3)
         query = query.permute(0, 2, 1, 3)
@@ -240,7 +224,9 @@ def forward(
             value = torch.cat((past_value, value), dim=-2)
 
         if use_cache is True:
-            present = (key, value)
+            # Note that this cast is quite ugly, but is not implemented before ROPE as k_rot in the original codebase is always in fp32.
+            # Reference: https://github.com/salesforce/CodeGen/blob/f210c3bb1216c975ad858cd4132c0fdeabf4bfc2/codegen1/jaxformer/hf/codegen/modeling_codegen.py#L38
+            present = (key.to(hidden_states.dtype), value)
         else:
             present = None
 
@@ -292,6 +278,7 @@ def forward(
         hidden_states: Optional[torch.FloatTensor],
         layer_past: Optional[Tuple[torch.Tensor]] = None,
         attention_mask: Optional[torch.FloatTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
         head_mask: Optional[torch.FloatTensor] = None,
         use_cache: Optional[bool] = False,
         output_attentions: Optional[bool] = False,
@@ -299,9 +286,10 @@ def forward(
         residual = hidden_states
         hidden_states = self.ln_1(hidden_states)
         attn_outputs = self.attn(
-            hidden_states,
+            hidden_states=hidden_states,
             layer_past=layer_past,
             attention_mask=attention_mask,
+            position_ids=position_ids,
             head_mask=head_mask,
             use_cache=use_cache,
             output_attentions=output_attentions,
@@ -330,6 +318,7 @@ class CodeGenPreTrainedModel(PreTrainedModel):
     base_model_prefix = "transformer"
     supports_gradient_checkpointing = True
     _no_split_modules = ["CodeGenBlock"]
+    _skip_keys_device_placement = "past_key_values"
 
     def __init__(self, *inputs, **kwargs):
         super().__init__(*inputs, **kwargs)
@@ -350,10 +339,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, CodeGenModel):
-            module.gradient_checkpointing = value
-
 
 CODEGEN_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
@@ -473,6 +458,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
             input_ids = input_ids.view(-1, input_shape[-1])
             batch_size = input_ids.shape[0]
@@ -487,9 +473,6 @@ def forward(
         if token_type_ids is not None:
             token_type_ids = token_type_ids.view(-1, input_shape[-1])
 
-        if position_ids is not None:
-            position_ids = position_ids.view(-1, input_shape[-1])
-
         if past_key_values is None:
             past_length = 0
             past_key_values = tuple([None] * len(self.h))
@@ -498,7 +481,7 @@ def forward(
 
         if position_ids is None:
             position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
-            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
+            position_ids = position_ids.unsqueeze(0)
 
         # Attention mask.
         if attention_mask is not None:
@@ -539,6 +522,14 @@ def forward(
 
         output_shape = input_shape + (hidden_states.size(-1),)
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
+                    "`use_cache=False`..."
+                )
+                use_cache = False
+
         presents = () if use_cache else None
         all_self_attentions = () if output_attentions else None
         all_hidden_states = () if output_hidden_states else None
@@ -547,32 +538,22 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
-                        "`use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        # None for past_key_value
-                        return module(*inputs, use_cache, output_attentions)
-
-                    return custom_forward
-
-                outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(block),
+                outputs = self._gradient_checkpointing_func(
+                    block.__call__,
                     hidden_states,
                     None,
                     attention_mask,
+                    position_ids,
                     head_mask[i],
+                    use_cache,
+                    output_attentions,
                 )
             else:
                 outputs = block(
-                    hidden_states,
+                    hidden_states=hidden_states,
                     layer_past=layer_past,
                     attention_mask=attention_mask,
+                    position_ids=position_ids,
                     head_mask=head_mask[i],
                     use_cache=use_cache,
                     output_attentions=output_attentions,
@@ -610,7 +591,7 @@ def custom_forward(*inputs):
     CODEGEN_START_DOCSTRING,
 )
 class CodeGenForCausalLM(CodeGenPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.causal_mask"]
+    _tied_weights_keys = ["lm_head.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -628,11 +609,20 @@ def set_output_embeddings(self, new_embeddings):
 
     def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
         token_type_ids = kwargs.get("token_type_ids", None)
-        # only last token for inputs_ids if past is defined in kwargs
+        # Omit tokens covered by past_key_values
         if past_key_values:
-            input_ids = input_ids[:, -1].unsqueeze(-1)
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
             if token_type_ids is not None:
-                token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
+                token_type_ids = token_type_ids[:, -input_ids.shape[1] :]
 
         attention_mask = kwargs.get("attention_mask", None)
         position_ids = kwargs.get("position_ids", None)
@@ -642,9 +632,8 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwarg
             position_ids = attention_mask.long().cumsum(-1) - 1
             position_ids.masked_fill_(attention_mask == 0, 1)
             if past_key_values:
-                position_ids = position_ids[:, -1].unsqueeze(-1)
-        else:
-            position_ids = None
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+
         return {
             "input_ids": input_ids,
             "past_key_values": past_key_values,
@@ -705,6 +694,8 @@ def forward(
 
         loss = None
         if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(lm_logits.device)
             # Shift so that tokens < n predict n
             shift_logits = lm_logits[..., :-1, :].contiguous()
             shift_labels = labels[..., 1:].contiguous()
diff --git a/src/transformers/models/codegen/tokenization_codegen.py b/src/transformers/models/codegen/tokenization_codegen.py
index c09a816bfbab5c..c79a6d46e4ad34 100644
--- a/src/transformers/models/codegen/tokenization_codegen.py
+++ b/src/transformers/models/codegen/tokenization_codegen.py
@@ -23,7 +23,7 @@
 import numpy as np
 import regex as re
 
-from ...utils import is_tf_available, is_torch_available, logging
+from ...utils import is_tf_available, is_torch_available, logging, to_py_obj
 
 
 if TYPE_CHECKING:
@@ -102,12 +102,14 @@ class CodeGenTokenizer(PreTrainedTokenizer):
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
 
-    ```
+    ```python
     >>> from transformers import CodeGenTokenizer
+
     >>> tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
-    >>> tokenizer("Hello world")['input_ids']
+    >>> tokenizer("Hello world")["input_ids"]
     [15496, 995]
-    >>> tokenizer(" Hello world")['input_ids']
+
+    >>> tokenizer(" Hello world")["input_ids"]
     [18435, 995]
     ```
 
@@ -131,16 +133,20 @@ class CodeGenTokenizer(PreTrainedTokenizer):
         errors (`str`, *optional*, defaults to `"replace"`):
             Paradigm to follow when decoding bytes to UTF-8. See
             [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
-        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        bos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The beginning of sequence token.
-        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The end of sequence token.
+        pad_token (`str`, *optional*):
+            The token used for padding, for example when batching sequences of different lengths.
         add_prefix_space (`bool`, *optional*, defaults to `False`):
             Whether or not to add an initial space to the input. This allows to treat the leading word just as any
             other word. (CodeGen tokenizer detect beginning of words by the preceding space).
+        add_bos_token (`bool`, *optional*, defaults to `False`):
+            Whether to add a beginning of sequence token at the start of sequences.
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -161,20 +167,10 @@ def __init__(
         add_bos_token=False,
         **kwargs,
     ):
-        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
-        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
-        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
-        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
-        super().__init__(
-            errors=errors,
-            unk_token=unk_token,
-            bos_token=bos_token,
-            eos_token=eos_token,
-            pad_token=pad_token,
-            add_prefix_space=add_prefix_space,
-            add_bos_token=add_bos_token,
-            **kwargs,
-        )
+        bos_token = AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token
+        unk_token = AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token
         self.add_bos_token = add_bos_token
 
         with open(vocab_file, encoding="utf-8") as vocab_handle:
@@ -192,6 +188,16 @@ def __init__(
 
         # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
         self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+        super().__init__(
+            errors=errors,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            add_prefix_space=add_prefix_space,
+            add_bos_token=add_bos_token,
+            **kwargs,
+        )
 
     @property
     def vocab_size(self):
@@ -318,7 +324,7 @@ def decode(
         self,
         token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
         skip_special_tokens: bool = False,
-        clean_up_tokenization_spaces: bool = True,
+        clean_up_tokenization_spaces: bool = None,
         truncate_before_pattern: Optional[List[str]] = None,
         **kwargs,
     ) -> str:
@@ -333,8 +339,9 @@ def decode(
                 List of tokenized input ids. Can be obtained using the `__call__` method.
             skip_special_tokens (`bool`, *optional*, defaults to `False`):
                 Whether or not to remove special tokens in the decoding.
-            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
-                Whether or not to clean up the tokenization spaces.
+            clean_up_tokenization_spaces (`bool`, *optional*):
+                Whether or not to clean up the tokenization spaces. If `None`, will default to
+                `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
             truncate_before_pattern (`List[str]`, *optional*, defaults to `None`):
                 A list of regular expression strings that will be used to truncate the returned string. This can be
                 used to remove extra pieces of code (e.g. truncate if observing a comment symbol "#" at the beginning
@@ -345,6 +352,9 @@ def decode(
         Returns:
             `str`: The decoded sentence.
         """
+
+        token_ids = to_py_obj(token_ids)
+
         decoded_text = super()._decode(
             token_ids=token_ids,
             skip_special_tokens=skip_special_tokens,
diff --git a/src/transformers/models/codegen/tokenization_codegen_fast.py b/src/transformers/models/codegen/tokenization_codegen_fast.py
index 332f0ed934acad..3c2661db396162 100644
--- a/src/transformers/models/codegen/tokenization_codegen_fast.py
+++ b/src/transformers/models/codegen/tokenization_codegen_fast.py
@@ -68,12 +68,14 @@ class CodeGenTokenizerFast(PreTrainedTokenizerFast):
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
 
-    ```
+    ```python
     >>> from transformers import CodeGenTokenizerFast
+
     >>> tokenizer = CodeGenTokenizerFast.from_pretrained("Salesforce/codegen-350M-mono")
-    >>> tokenizer("Hello world")['input_ids']
+    >>> tokenizer("Hello world")["input_ids"]
     [15496, 995]
-    >>> tokenizer(" Hello world")['input_ids']
+
+    >>> tokenizer(" Hello world")["input_ids"]
     [18435, 995]
     ```
 
@@ -90,25 +92,23 @@ class CodeGenTokenizerFast(PreTrainedTokenizerFast):
     refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (`str`):
+        vocab_file (`str`, *optional*):
             Path to the vocabulary file.
-        merges_file (`str`):
+        merges_file (`str`, *optional*):
             Path to the merges file.
-        errors (`str`, *optional*, defaults to `"replace"`):
-            Paradigm to follow when decoding bytes to UTF-8. See
-            [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
-        unk_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        tokenizer_file (`str`, *optional*):
+            Path to [tokenizers](https://github.com/huggingface/tokenizers) file (generally has a .json extension) that
+            contains everything needed to load the tokenizer.
+        unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
             token instead.
-        bos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        bos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The beginning of sequence token.
-        eos_token (`str`, *optional*, defaults to `<|endoftext|>`):
+        eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
             The end of sequence token.
         add_prefix_space (`bool`, *optional*, defaults to `False`):
             Whether or not to add an initial space to the input. This allows to treat the leading word just as any
             other word. (CodeGen tokenizer detect beginning of words by the preceding space).
-        trim_offsets (`bool`, *optional*, defaults to `True`):
-            Whether or not the post-processing step should trim offsets to avoid including whitespaces.
     """
 
     vocab_files_names = VOCAB_FILES_NAMES
@@ -142,7 +142,7 @@ def __init__(
         if kwargs.pop("add_bos_token", False):
             model_id = kwargs.pop("name_or_path", "")
             raise ValueError(
-                "Currenty GPT2's fast tokenizer does NOT support adding a BOS token."
+                "Currenty GPT2's fast tokenizer does NOT support adding a BOS token. "
                 "Instead you should use GPT2's slow tokenizer class `CodeGenTokenizer` as follows: \n"
                 f"`CodeGenTokenizer.from_pretrained('{model_id}')`\nor\n"
                 f"`AutoTokenizer.from_pretrained('{model_id}', use_fast=False)`\n"
@@ -185,7 +185,7 @@ def decode(
         self,
         token_ids: Union[int, List[int], "np.ndarray", "torch.Tensor", "tf.Tensor"],
         skip_special_tokens: bool = False,
-        clean_up_tokenization_spaces: bool = True,
+        clean_up_tokenization_spaces: bool = None,
         truncate_before_pattern: Optional[List[str]] = None,
         **kwargs,
     ) -> str:
@@ -200,8 +200,9 @@ def decode(
                 List of tokenized input ids. Can be obtained using the `__call__` method.
             skip_special_tokens (`bool`, *optional*, defaults to `False`):
                 Whether or not to remove special tokens in the decoding.
-            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
-                Whether or not to clean up the tokenization spaces.
+            clean_up_tokenization_spaces (`bool`, *optional*):
+                Whether or not to clean up the tokenization spaces. If `None`, will default to
+                `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
             truncate_before_pattern (`List[str]`, *optional*, defaults to `None`):
                 A list of regular expression strings that will be used to truncate the returned string. This can be
                 used to remove extra pieces of code (e.g. truncate if observing a comment symbol "#" at the beginning
diff --git a/src/transformers/models/conditional_detr/__init__.py b/src/transformers/models/conditional_detr/__init__.py
index da5ed8f2f8ed01..565323321160ff 100644
--- a/src/transformers/models/conditional_detr/__init__.py
+++ b/src/transformers/models/conditional_detr/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_timm_available, is_vision_available
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
 
 
 _import_structure = {
@@ -35,7 +35,7 @@
     _import_structure["image_processing_conditional_detr"] = ["ConditionalDetrImageProcessor"]
 
 try:
-    if not is_timm_available():
+    if not is_torch_available():
         raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
     pass
@@ -66,7 +66,7 @@
         from .image_processing_conditional_detr import ConditionalDetrImageProcessor
 
     try:
-        if not is_timm_available():
+        if not is_torch_available():
             raise OptionalDependencyNotAvailable()
     except OptionalDependencyNotAvailable:
         pass
diff --git a/src/transformers/models/conditional_detr/configuration_conditional_detr.py b/src/transformers/models/conditional_detr/configuration_conditional_detr.py
index ec04f0a52369fa..7a6cd436385852 100644
--- a/src/transformers/models/conditional_detr/configuration_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/configuration_conditional_detr.py
@@ -13,7 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Conditional DETR model configuration"""
-
 from collections import OrderedDict
 from typing import Mapping
 
@@ -94,11 +93,14 @@ class ConditionalDetrConfig(PretrainedConfig):
         position_embedding_type (`str`, *optional*, defaults to `"sine"`):
             Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
         backbone (`str`, *optional*, defaults to `"resnet50"`):
-            Name of convolutional backbone to use in case `use_timm_backbone` = `True`. Supports any convolutional
-            backbone from the timm package. For a list of all available models, see [this
-            page](https://rwightman.github.io/pytorch-image-models/#load-a-pretrained-model).
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
         use_pretrained_backbone (`bool`, *optional*, defaults to `True`):
-            Whether to use pretrained weights for the backbone. Only supported when `use_timm_backbone` = `True`.
+            Whether to use pretrained weights for the backbone.
+        backbone_kwargs (`dict`, *optional*):
+            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
+            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
         dilation (`bool`, *optional*, defaults to `False`):
             Whether to replace stride with dilation in the last convolutional block (DC5). Only supported when
             `use_timm_backbone` = `True`.
@@ -135,6 +137,7 @@ class ConditionalDetrConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "conditional_detr"
     keys_to_ignore_at_inference = ["past_key_values"]
     attribute_map = {
@@ -168,6 +171,7 @@ def __init__(
         position_embedding_type="sine",
         backbone="resnet50",
         use_pretrained_backbone=True,
+        backbone_kwargs=None,
         dilation=False,
         class_cost=2,
         bbox_cost=5,
@@ -180,9 +184,20 @@ def __init__(
         focal_alpha=0.25,
         **kwargs,
     ):
+        if not use_timm_backbone and use_pretrained_backbone:
+            raise ValueError(
+                "Loading pretrained backbone weights from the transformers library is not supported yet. `use_timm_backbone` must be set to `True` when `use_pretrained_backbone=True`"
+            )
+
+        if backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
         if backbone_config is not None and use_timm_backbone:
             raise ValueError("You can't specify both `backbone_config` and `use_timm_backbone`.")
 
+        if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
+            raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
+
         if not use_timm_backbone:
             if backbone_config is None:
                 logger.info("`backbone_config` is `None`. Initializing the config with the default `ResNet` backbone.")
@@ -216,6 +231,7 @@ def __init__(
         self.position_embedding_type = position_embedding_type
         self.backbone = backbone
         self.use_pretrained_backbone = use_pretrained_backbone
+        self.backbone_kwargs = backbone_kwargs
         self.dilation = dilation
         # Hungarian matcher
         self.class_cost = class_cost
diff --git a/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py
index 083b3c681ec22d..b1a1b1c817ae70 100644
--- a/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/conditional_detr/convert_conditional_detr_original_pytorch_checkpoint_to_pytorch.py
@@ -27,9 +27,9 @@
 
 from transformers import (
     ConditionalDetrConfig,
-    ConditionalDetrFeatureExtractor,
     ConditionalDetrForObjectDetection,
     ConditionalDetrForSegmentation,
+    ConditionalDetrImageProcessor,
 )
 from transformers.utils import logging
 
@@ -244,13 +244,13 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
         config.id2label = id2label
         config.label2id = {v: k for k, v in id2label.items()}
 
-    # load feature extractor
+    # load image processor
     format = "coco_panoptic" if is_panoptic else "coco_detection"
-    feature_extractor = ConditionalDetrFeatureExtractor(format=format)
+    image_processor = ConditionalDetrImageProcessor(format=format)
 
     # prepare image
     img = prepare_img()
-    encoding = feature_extractor(images=img, return_tensors="pt")
+    encoding = image_processor(images=img, return_tensors="pt")
     pixel_values = encoding["pixel_values"]
 
     logger.info(f"Converting model {model_name}...")
@@ -302,11 +302,11 @@ def convert_conditional_detr_checkpoint(model_name, pytorch_dump_folder_path):
     if is_panoptic:
         assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4)
 
-    # Save model and feature extractor
-    logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
+    # Save model and image processor
+    logger.info(f"Saving PyTorch model and image processor to {pytorch_dump_folder_path}...")
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     model.save_pretrained(pytorch_dump_folder_path)
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py b/src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py
index 2af959e8a991f3..bfdec373f865c5 100644
--- a/src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/feature_extraction_conditional_detr.py
@@ -16,6 +16,7 @@
 
 import warnings
 
+from ...image_transforms import rgb_to_id as _rgb_to_id
 from ...utils import logging
 from .image_processing_conditional_detr import ConditionalDetrImageProcessor
 
@@ -23,6 +24,15 @@
 logger = logging.get_logger(__name__)
 
 
+def rgb_to_id(x):
+    warnings.warn(
+        "rgb_to_id has moved and will not be importable from this module from v5. "
+        "Please import from transformers.image_transforms instead.",
+        FutureWarning,
+    )
+    return _rgb_to_id(x)
+
+
 class ConditionalDetrFeatureExtractor(ConditionalDetrImageProcessor):
     def __init__(self, *args, **kwargs) -> None:
         warnings.warn(
diff --git a/src/transformers/models/conditional_detr/image_processing_conditional_detr.py b/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
index d4e2f9dd5f3ac2..0af79bbcb93efa 100644
--- a/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/image_processing_conditional_detr.py
@@ -16,41 +16,43 @@
 
 import io
 import pathlib
-import warnings
 from collections import defaultdict
 from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Tuple, Union
 
 import numpy as np
 
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
-from transformers.image_transforms import (
+from ...feature_extraction_utils import BatchFeature
+from ...image_processing_utils import BaseImageProcessor, get_size_dict
+from ...image_transforms import (
     PaddingMode,
     center_to_corners_format,
     corners_to_center_format,
     id_to_rgb,
-    normalize,
     pad,
     rescale,
     resize,
     rgb_to_id,
     to_channel_dimension_format,
 )
-from transformers.image_utils import (
+from ...image_utils import (
     IMAGENET_DEFAULT_MEAN,
     IMAGENET_DEFAULT_STD,
+    AnnotationFormat,
+    AnnotationType,
     ChannelDimension,
     ImageInput,
     PILImageResampling,
     get_image_size,
     infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
-    valid_coco_detection_annotations,
-    valid_coco_panoptic_annotations,
     valid_images,
+    validate_annotations,
+    validate_preprocess_arguments,
 )
-from transformers.utils import (
+from ...utils import (
+    TensorType,
     is_flax_available,
     is_jax_tensor,
     is_scipy_available,
@@ -59,8 +61,8 @@
     is_torch_available,
     is_torch_tensor,
     is_vision_available,
+    logging,
 )
-from transformers.utils.generic import ExplicitEnum, TensorType
 
 
 if is_torch_available():
@@ -77,15 +79,10 @@
     import scipy.stats
 
 
-AnnotationType = Dict[str, Union[int, str, List[Dict]]]
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 
 
-class AnnotionFormat(ExplicitEnum):
-    COCO_DETECTION = "coco_detection"
-    COCO_PANOPTIC = "coco_panoptic"
-
-
-SUPPORTED_ANNOTATION_FORMATS = (AnnotionFormat.COCO_DETECTION, AnnotionFormat.COCO_PANOPTIC)
+SUPPORTED_ANNOTATION_FORMATS = (AnnotationFormat.COCO_DETECTION, AnnotationFormat.COCO_PANOPTIC)
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio
@@ -122,7 +119,10 @@ def get_size_with_aspect_ratio(image_size, size, max_size=None) -> Tuple[int, in
 
 # Copied from transformers.models.detr.image_processing_detr.get_resize_output_image_size
 def get_resize_output_image_size(
-    input_image: np.ndarray, size: Union[int, Tuple[int, int], List[int]], max_size: Optional[int] = None
+    input_image: np.ndarray,
+    size: Union[int, Tuple[int, int], List[int]],
+    max_size: Optional[int] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
 ) -> Tuple[int, int]:
     """
     Computes the output image size given the input image size and the desired output size. If the desired output size
@@ -130,14 +130,16 @@ def get_resize_output_image_size(
     image size is computed by keeping the aspect ratio of the input image size.
 
     Args:
-        image_size (`Tuple[int, int]`):
-            The input image size.
-        size (`int`):
+        input_image (`np.ndarray`):
+            The image to resize.
+        size (`int` or `Tuple[int, int]` or `List[int]`):
             The desired output size.
         max_size (`int`, *optional*):
             The maximum allowed output size.
+        input_data_format (`ChannelDimension` or `str`, *optional*):
+            The channel dimension format of the input image. If not provided, it will be inferred from the input image.
     """
-    image_size = get_image_size(input_image)
+    image_size = get_image_size(input_image, input_data_format)
     if isinstance(size, (list, tuple)):
         return size
 
@@ -207,23 +209,28 @@ def max_across_indices(values: Iterable[Any]) -> List[Any]:
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_max_height_width
-def get_max_height_width(images: List[np.ndarray]) -> List[int]:
+def get_max_height_width(
+    images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> List[int]:
     """
     Get the maximum height and width across all images in a batch.
     """
-    input_channel_dimension = infer_channel_dimension_format(images[0])
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(images[0])
 
-    if input_channel_dimension == ChannelDimension.FIRST:
+    if input_data_format == ChannelDimension.FIRST:
         _, max_height, max_width = max_across_indices([img.shape for img in images])
-    elif input_channel_dimension == ChannelDimension.LAST:
+    elif input_data_format == ChannelDimension.LAST:
         max_height, max_width, _ = max_across_indices([img.shape for img in images])
     else:
-        raise ValueError(f"Invalid channel dimension format: {input_channel_dimension}")
+        raise ValueError(f"Invalid channel dimension format: {input_data_format}")
     return (max_height, max_width)
 
 
 # Copied from transformers.models.detr.image_processing_detr.make_pixel_mask
-def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray:
+def make_pixel_mask(
+    image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> np.ndarray:
     """
     Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
 
@@ -233,7 +240,7 @@ def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarr
         output_size (`Tuple[int, int]`):
             Output size of the mask.
     """
-    input_height, input_width = get_image_size(image)
+    input_height, input_width = get_image_size(image, channel_dim=input_data_format)
     mask = np.zeros(output_size, dtype=np.int64)
     mask[:input_height, :input_width] = 1
     return mask
@@ -275,11 +282,16 @@ def convert_coco_poly_to_mask(segmentations, height: int, width: int) -> np.ndar
 
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_detection_annotation with DETR->ConditionalDetr
-def prepare_coco_detection_annotation(image, target, return_segmentation_masks: bool = False):
+def prepare_coco_detection_annotation(
+    image,
+    target,
+    return_segmentation_masks: bool = False,
+    input_data_format: Optional[Union[ChannelDimension, str]] = None,
+):
     """
     Convert the target in COCO format into the format expected by ConditionalDetr.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
 
     image_id = target["image_id"]
     image_id = np.asarray([image_id], dtype=np.int64)
@@ -314,10 +326,13 @@ def prepare_coco_detection_annotation(image, target, return_segmentation_masks:
 
     if annotations and "keypoints" in annotations[0]:
         keypoints = [obj["keypoints"] for obj in annotations]
+        # Converting the filtered keypoints list to a numpy array
         keypoints = np.asarray(keypoints, dtype=np.float32)
+        # Apply the keep mask here to filter the relevant annotations
+        keypoints = keypoints[keep]
         num_keypoints = keypoints.shape[0]
         keypoints = keypoints.reshape((-1, 3)) if num_keypoints else keypoints
-        new_target["keypoints"] = keypoints[keep]
+        new_target["keypoints"] = keypoints
 
     if return_segmentation_masks:
         segmentation_masks = [obj["segmentation"] for obj in annotations]
@@ -364,12 +379,16 @@ def masks_to_boxes(masks: np.ndarray) -> np.ndarray:
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_panoptic_annotation with DETR->ConditionalDetr
 def prepare_coco_panoptic_annotation(
-    image: np.ndarray, target: Dict, masks_path: Union[str, pathlib.Path], return_masks: bool = True
+    image: np.ndarray,
+    target: Dict,
+    masks_path: Union[str, pathlib.Path],
+    return_masks: bool = True,
+    input_data_format: Union[ChannelDimension, str] = None,
 ) -> Dict:
     """
     Prepare a coco panoptic annotation for ConditionalDetr.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
     annotation_path = pathlib.Path(masks_path) / target["file_name"]
 
     new_target = {}
@@ -456,8 +475,7 @@ def post_process_panoptic_sample(
     threshold=0.85,
 ) -> Dict:
     """
-    Converts the output of [`ConditionalDetrForSegmentation`] into panoptic segmentation predictions for a single
-    sample.
+    Converts the output of [`ConditionalDetrForSegmentation`] into panoptic segmentation predictions for a single sample.
 
     Args:
         out_logits (`torch.Tensor`):
@@ -604,7 +622,7 @@ def binary_mask_to_rle(mask):
     pixels = np.concatenate([[0], pixels, [0]])
     runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
     runs[1::2] -= runs[::2]
-    return [x for x in runs]
+    return list(runs)
 
 
 # Copied from transformers.models.detr.image_processing_detr.convert_segmentation_to_rle
@@ -768,9 +786,14 @@ class ConditionalDetrImageProcessor(BaseImageProcessor):
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
             Standard deviation values to use when normalizing the image. Can be a single value or a list of values, one
             for each channel. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_annotations (`bool`, *optional*, defaults to `True`):
+            Controls whether to convert the annotations to the format expected by the DETR model. Converts the
+            bounding boxes to the format `(center_x, center_y, width, height)` and in the range `[0, 1]`.
+            Can be overridden by the `do_convert_annotations` parameter in the `preprocess` method.
         do_pad (`bool`, *optional*, defaults to `True`):
-            Controls whether to pad the image to the largest image in a batch and create a pixel mask. Can be
-            overridden by the `do_pad` parameter in the `preprocess` method.
+            Controls whether to pad the image. Can be overridden by the `do_pad` parameter in the `preprocess`
+            method. If `True` will pad the images in the batch to the largest height and width in the batch.
+            Padding will be applied to the bottom and right of the image with zeros.
     """
 
     model_input_names = ["pixel_values", "pixel_mask"]
@@ -778,7 +801,7 @@ class ConditionalDetrImageProcessor(BaseImageProcessor):
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.__init__
     def __init__(
         self,
-        format: Union[str, AnnotionFormat] = AnnotionFormat.COCO_DETECTION,
+        format: Union[str, AnnotationFormat] = AnnotationFormat.COCO_DETECTION,
         do_resize: bool = True,
         size: Dict[str, int] = None,
         resample: PILImageResampling = PILImageResampling.BILINEAR,
@@ -787,6 +810,7 @@ def __init__(
         do_normalize: bool = True,
         image_mean: Union[float, List[float]] = None,
         image_std: Union[float, List[float]] = None,
+        do_convert_annotations: Optional[bool] = None,
         do_pad: bool = True,
         **kwargs,
     ) -> None:
@@ -794,10 +818,9 @@ def __init__(
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
@@ -806,6 +829,10 @@ def __init__(
         size = size if size is not None else {"shortest_edge": 800, "longest_edge": 1333}
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
 
+        # Backwards compatibility
+        if do_convert_annotations is None:
+            do_convert_annotations = do_normalize
+
         super().__init__(**kwargs)
         self.format = format
         self.do_resize = do_resize
@@ -814,20 +841,11 @@ def __init__(
         self.do_rescale = do_rescale
         self.rescale_factor = rescale_factor
         self.do_normalize = do_normalize
+        self.do_convert_annotations = do_convert_annotations
         self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
         self.do_pad = do_pad
 
-    @property
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.max_size
-    def max_size(self):
-        warnings.warn(
-            "The `max_size` parameter is deprecated and will be removed in v4.27. "
-            "Please specify in `size['longest_edge'] instead`.",
-            FutureWarning,
-        )
-        return self.size["longest_edge"]
-
     @classmethod
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.from_dict with Detr->ConditionalDetr
     def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
@@ -848,31 +866,38 @@ def prepare_annotation(
         self,
         image: np.ndarray,
         target: Dict,
-        format: Optional[AnnotionFormat] = None,
+        format: Optional[AnnotationFormat] = None,
         return_segmentation_masks: bool = None,
         masks_path: Optional[Union[str, pathlib.Path]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> Dict:
         """
         Prepare an annotation for feeding into ConditionalDetr model.
         """
         format = format if format is not None else self.format
 
-        if format == AnnotionFormat.COCO_DETECTION:
+        if format == AnnotationFormat.COCO_DETECTION:
             return_segmentation_masks = False if return_segmentation_masks is None else return_segmentation_masks
-            target = prepare_coco_detection_annotation(image, target, return_segmentation_masks)
-        elif format == AnnotionFormat.COCO_PANOPTIC:
+            target = prepare_coco_detection_annotation(
+                image, target, return_segmentation_masks, input_data_format=input_data_format
+            )
+        elif format == AnnotationFormat.COCO_PANOPTIC:
             return_segmentation_masks = True if return_segmentation_masks is None else return_segmentation_masks
             target = prepare_coco_panoptic_annotation(
-                image, target, masks_path=masks_path, return_masks=return_segmentation_masks
+                image,
+                target,
+                masks_path=masks_path,
+                return_masks=return_segmentation_masks,
+                input_data_format=input_data_format,
             )
         else:
             raise ValueError(f"Format {format} is not supported.")
         return target
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare
-    def prepare(self, image, target, return_segmentation_masks=False, masks_path=None):
-        warnings.warn(
-            "The `prepare` method is deprecated and will be removed in a future version. "
+    def prepare(self, image, target, return_segmentation_masks=None, masks_path=None):
+        logger.warning_once(
+            "The `prepare` method is deprecated and will be removed in a v4.33. "
             "Please use `prepare_annotation` instead. Note: the `prepare_annotation` method "
             "does not return the image anymore.",
         )
@@ -881,17 +906,17 @@ def prepare(self, image, target, return_segmentation_masks=False, masks_path=Non
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.convert_coco_poly_to_mask
     def convert_coco_poly_to_mask(self, *args, **kwargs):
-        warnings.warn("The `convert_coco_poly_to_mask` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `convert_coco_poly_to_mask` method is deprecated and will be removed in v4.33. ")
         return convert_coco_poly_to_mask(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_detection with DETR->ConditionalDetr
     def prepare_coco_detection(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_detection` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_detection` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_detection_annotation(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_panoptic
     def prepare_coco_panoptic(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_panoptic` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_panoptic` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_panoptic_annotation(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.resize
@@ -901,24 +926,40 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BILINEAR,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
         Resize the image to the given size. Size can be `min_size` (scalar) or `(height, width)` tuple. If size is an
         int, smaller edge of the image will be matched to this number.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary containing the size to resize to. Can contain the keys `shortest_edge` and `longest_edge` or
+                `height` and `width`.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+                Resampling filter to use if resizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
             max_size = None
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
         if "shortest_edge" in size and "longest_edge" in size:
-            size = get_resize_output_image_size(image, size["shortest_edge"], size["longest_edge"])
+            size = get_resize_output_image_size(
+                image, size["shortest_edge"], size["longest_edge"], input_data_format=input_data_format
+            )
         elif "height" in size and "width" in size:
             size = (size["height"], size["width"])
         else:
@@ -926,7 +967,9 @@ def resize(
                 "Size must contain 'height' and 'width' keys or 'shortest_edge' and 'longest_edge' keys. Got"
                 f" {size.keys()}."
             )
-        image = resize(image, size=size, resample=resample, data_format=data_format)
+        image = resize(
+            image, size=size, resample=resample, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
         return image
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.resize_annotation
@@ -945,130 +988,195 @@ def resize_annotation(
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.rescale
     def rescale(
-        self, image: np.ndarray, rescale_factor: Union[float, int], data_format: Optional[ChannelDimension] = None
-    ) -> np.ndarray:
-        """
-        Rescale the image by the given factor.
-        """
-        return rescale(image, rescale_factor, data_format=data_format)
-
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize
-    def normalize(
         self,
         image: np.ndarray,
-        mean: Union[float, Iterable[float]],
-        std: Union[float, Iterable[float]],
-        data_format: Optional[ChannelDimension] = None,
+        rescale_factor: float,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
-        Normalize the image with the given mean and standard deviation.
+        Rescale the image by the given factor. image = image * rescale_factor.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            rescale_factor (`float`):
+                The value to use for rescaling.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. If unset, is inferred from the input image. Can be
+                one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
         """
-        return normalize(image, mean=mean, std=std, data_format=data_format)
+        return rescale(image, rescale_factor, data_format=data_format, input_data_format=input_data_format)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize_annotation
     def normalize_annotation(self, annotation: Dict, image_size: Tuple[int, int]) -> Dict:
         """
         Normalize the boxes in the annotation from `[top_left_x, top_left_y, bottom_right_x, bottom_right_y]` to
-        `[center_x, center_y, width, height]` format.
+        `[center_x, center_y, width, height]` format and from absolute to relative pixel values.
         """
         return normalize_annotation(annotation, image_size=image_size)
 
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad_and_create_pixel_mask
-    def pad_and_create_pixel_mask(
+    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._update_annotation_for_padded_image
+    def _update_annotation_for_padded_image(
         self,
-        pixel_values_list: List[ImageInput],
-        return_tensors: Optional[Union[str, TensorType]] = None,
-        data_format: Optional[ChannelDimension] = None,
-    ) -> BatchFeature:
+        annotation: Dict,
+        input_image_size: Tuple[int, int],
+        output_image_size: Tuple[int, int],
+        padding,
+        update_bboxes,
+    ) -> Dict:
         """
-        Pads a batch of images with zeros to the size of largest height and width in the batch and returns their
-        corresponding pixel mask.
-
-        Args:
-            images (`List[np.ndarray]`):
-                Batch of images to pad.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                    - Unset: Return a list of `np.ndarray`.
-                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        Update the annotation for a padded image.
         """
-        warnings.warn(
-            "This method is deprecated and will be removed in v4.27.0. Please use pad instead.", FutureWarning
-        )
-        # pad expects a list of np.ndarray, but the previous feature extractors expected torch tensors
-        images = [to_numpy_array(image) for image in pixel_values_list]
-        return self.pad(
-            images=images,
-            return_pixel_mask=True,
-            return_tensors=return_tensors,
-            data_format=data_format,
-        )
+        new_annotation = {}
+        new_annotation["size"] = output_image_size
+
+        for key, value in annotation.items():
+            if key == "masks":
+                masks = value
+                masks = pad(
+                    masks,
+                    padding,
+                    mode=PaddingMode.CONSTANT,
+                    constant_values=0,
+                    input_data_format=ChannelDimension.FIRST,
+                )
+                masks = safe_squeeze(masks, 1)
+                new_annotation["masks"] = masks
+            elif key == "boxes" and update_bboxes:
+                boxes = value
+                boxes *= np.asarray(
+                    [
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                    ]
+                )
+                new_annotation["boxes"] = boxes
+            elif key == "size":
+                new_annotation["size"] = output_image_size
+            else:
+                new_annotation[key] = value
+        return new_annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._pad_image
     def _pad_image(
         self,
         image: np.ndarray,
         output_size: Tuple[int, int],
+        annotation: Optional[Dict[str, Any]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
     ) -> np.ndarray:
         """
         Pad an image with zeros to the given size.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = output_size
 
         pad_bottom = output_height - input_height
         pad_right = output_width - input_width
         padding = ((0, pad_bottom), (0, pad_right))
         padded_image = pad(
-            image, padding, mode=PaddingMode.CONSTANT, constant_values=constant_values, data_format=data_format
+            image,
+            padding,
+            mode=PaddingMode.CONSTANT,
+            constant_values=constant_values,
+            data_format=data_format,
+            input_data_format=input_data_format,
         )
-        return padded_image
+        if annotation is not None:
+            annotation = self._update_annotation_for_padded_image(
+                annotation, (input_height, input_width), (output_height, output_width), padding, update_bboxes
+            )
+        return padded_image, annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad
     def pad(
         self,
         images: List[np.ndarray],
+        annotations: Optional[Union[AnnotationType, List[AnnotationType]]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         return_pixel_mask: bool = True,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Optional[ChannelDimension] = None,
-    ) -> np.ndarray:
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
+    ) -> BatchFeature:
         """
         Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
         in the batch and optionally returns their corresponding pixel mask.
 
         Args:
-            image (`np.ndarray`):
-                Image to pad.
+            images (List[`np.ndarray`]):
+                Images to pad.
+            annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
+                Annotations to transform according to the padding that is applied to the images.
             constant_values (`float` or `Iterable[float]`, *optional*):
                 The value to use for the padding if `mode` is `"constant"`.
             return_pixel_mask (`bool`, *optional*, defaults to `True`):
                 Whether to return a pixel mask.
-            input_channel_dimension (`ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be inferred from the input image.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+            update_bboxes (`bool`, *optional*, defaults to `True`):
+                Whether to update the bounding boxes in the annotations to match the padded images. If the
+                bounding boxes have not been converted to relative coordinates and `(centre_x, centre_y, width, height)`
+                format, the bounding boxes will not be updated.
         """
-        pad_size = get_max_height_width(images)
+        pad_size = get_max_height_width(images, input_data_format=input_data_format)
+
+        annotation_list = annotations if annotations is not None else [None] * len(images)
+        padded_images = []
+        padded_annotations = []
+        for image, annotation in zip(images, annotation_list):
+            padded_image, padded_annotation = self._pad_image(
+                image,
+                pad_size,
+                annotation,
+                constant_values=constant_values,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                update_bboxes=update_bboxes,
+            )
+            padded_images.append(padded_image)
+            padded_annotations.append(padded_annotation)
 
-        padded_images = [
-            self._pad_image(image, pad_size, constant_values=constant_values, data_format=data_format)
-            for image in images
-        ]
         data = {"pixel_values": padded_images}
 
         if return_pixel_mask:
-            masks = [make_pixel_mask(image=image, output_size=pad_size) for image in images]
+            masks = [
+                make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
+                for image in images
+            ]
             data["pixel_mask"] = masks
 
-        return BatchFeature(data=data, tensor_type=return_tensors)
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        if annotations is not None:
+            encoded_inputs["labels"] = [
+                BatchFeature(annotation, tensor_type=return_tensors) for annotation in padded_annotations
+            ]
+
+        return encoded_inputs
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.preprocess
     def preprocess(
@@ -1083,12 +1191,14 @@ def preprocess(
         do_rescale: Optional[bool] = None,
         rescale_factor: Optional[Union[int, float]] = None,
         do_normalize: Optional[bool] = None,
+        do_convert_annotations: Optional[bool] = None,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
         do_pad: Optional[bool] = None,
-        format: Optional[Union[str, AnnotionFormat]] = None,
+        format: Optional[Union[str, AnnotationFormat]] = None,
         return_tensors: Optional[Union[TensorType, str]] = None,
         data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> BatchFeature:
         """
@@ -1096,14 +1206,15 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image or batch of images to preprocess.
+                Image or batch of images to preprocess. Expects a single or batch of images with pixel values ranging
+                from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
-                List of annotations associated with the image or batch of images. If annotionation is for object
+                List of annotations associated with the image or batch of images. If annotation is for object
                 detection, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "annotations" (`List[Dict]`): List of annotations for an image. Each annotation should be a
                   dictionary. An image can have no annotations, in which case the list should be empty.
-                If annotionation is for segmentation, the annotations should be a dictionary with the following keys:
+                If annotation is for segmentation, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "segments_info" (`List[Dict]`): List of segments for an image. Each segment should be a dictionary.
                   An image can have no segments, in which case the list should be empty.
@@ -1124,33 +1235,45 @@ def preprocess(
                 Rescale factor to use when rescaling the image.
             do_normalize (`bool`, *optional*, defaults to self.do_normalize):
                 Whether to normalize the image.
+            do_convert_annotations (`bool`, *optional*, defaults to self.do_convert_annotations):
+                Whether to convert the annotations to the format expected by the model. Converts the bounding
+                boxes from the format `(top_left_x, top_left_y, width, height)` to `(center_x, center_y, width, height)`
+                and in relative coordinates.
             image_mean (`float` or `List[float]`, *optional*, defaults to self.image_mean):
                 Mean to use when normalizing the image.
             image_std (`float` or `List[float]`, *optional*, defaults to self.image_std):
                 Standard deviation to use when normalizing the image.
             do_pad (`bool`, *optional*, defaults to self.do_pad):
-                Whether to pad the image.
-            format (`str` or `AnnotionFormat`, *optional*, defaults to self.format):
+                Whether to pad the image. If `True` will pad the images in the batch to the largest image in the batch
+                and create a pixel mask. Padding will be applied to the bottom and right of the image with zeros.
+            format (`str` or `AnnotationFormat`, *optional*, defaults to self.format):
                 Format of the annotations.
             return_tensors (`str` or `TensorType`, *optional*, defaults to self.return_tensors):
                 Type of tensors to return. If `None`, will return the list of images.
-            data_format (`str` or `ChannelDimension`, *optional*, defaults to self.data_format):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         if "pad_and_return_pixel_mask" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `pad_and_return_pixel_mask` argument is deprecated and will be removed in a future version, "
-                "use `do_pad` instead.",
-                FutureWarning,
+                "use `do_pad` instead."
             )
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         max_size = None
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` argument is deprecated and will be removed in a future version, use"
-                " `size['longest_edge']` instead.",
-                FutureWarning,
+                " `size['longest_edge']` instead."
             )
             size = kwargs.pop("max_size")
 
@@ -1163,19 +1286,33 @@ def preprocess(
         do_normalize = self.do_normalize if do_normalize is None else do_normalize
         image_mean = self.image_mean if image_mean is None else image_mean
         image_std = self.image_std if image_std is None else image_std
+        do_convert_annotations = (
+            self.do_convert_annotations if do_convert_annotations is None else do_convert_annotations
+        )
         do_pad = self.do_pad if do_pad is None else do_pad
         format = self.format if format is None else format
 
-        if do_resize is not None and size is None:
-            raise ValueError("Size and max_size must be specified if do_resize is True.")
+        images = make_list_of_images(images)
 
-        if do_rescale is not None and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
 
-        if do_normalize is not None and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        # Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
+
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
-        images = make_list_of_images(images)
         if annotations is not None and isinstance(annotations, dict):
             annotations = [annotations]
 
@@ -1184,34 +1321,13 @@ def preprocess(
                 f"The number of images ({len(images)}) and annotations ({len(annotations)}) do not match."
             )
 
-        if not valid_images(images):
-            raise ValueError(
-                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
-                "torch.Tensor, tf.Tensor or jax.ndarray."
-            )
-
-        format = AnnotionFormat(format)
+        format = AnnotationFormat(format)
         if annotations is not None:
-            if format == AnnotionFormat.COCO_DETECTION and not valid_coco_detection_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO detection annotations. Annotations must a dict (single image) of list of dicts"
-                    "(batch of images) with the following keys: `image_id` and `annotations`, with the latter "
-                    "being a list of annotations in the COCO format."
-                )
-            elif format == AnnotionFormat.COCO_PANOPTIC and not valid_coco_panoptic_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO panoptic annotations. Annotations must a dict (single image) of list of dicts "
-                    "(batch of images) with the following keys: `image_id`, `file_name` and `segments_info`, with "
-                    "the latter being a list of annotations in the COCO format."
-                )
-            elif format not in SUPPORTED_ANNOTATION_FORMATS:
-                raise ValueError(
-                    f"Unsupported annotation format: {format} must be one of {SUPPORTED_ANNOTATION_FORMATS}"
-                )
+            validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations)
 
         if (
             masks_path is not None
-            and format == AnnotionFormat.COCO_PANOPTIC
+            and format == AnnotationFormat.COCO_PANOPTIC
             and not isinstance(masks_path, (pathlib.Path, str))
         ):
             raise ValueError(
@@ -1222,13 +1338,28 @@ def preprocess(
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         # prepare (COCO annotations as a list of Dict -> DETR target as a single Dict per image)
         if annotations is not None:
             prepared_images = []
             prepared_annotations = []
             for image, target in zip(images, annotations):
                 target = self.prepare_annotation(
-                    image, target, format, return_segmentation_masks=return_segmentation_masks, masks_path=masks_path
+                    image,
+                    target,
+                    format,
+                    return_segmentation_masks=return_segmentation_masks,
+                    masks_path=masks_path,
+                    input_data_format=input_data_format,
                 )
                 prepared_images.append(image)
                 prepared_annotations.append(target)
@@ -1241,48 +1372,67 @@ def preprocess(
             if annotations is not None:
                 resized_images, resized_annotations = [], []
                 for image, target in zip(images, annotations):
-                    orig_size = get_image_size(image)
-                    resized_image = self.resize(image, size=size, max_size=max_size, resample=resample)
-                    resized_annotation = self.resize_annotation(target, orig_size, get_image_size(resized_image))
+                    orig_size = get_image_size(image, input_data_format)
+                    resized_image = self.resize(
+                        image, size=size, max_size=max_size, resample=resample, input_data_format=input_data_format
+                    )
+                    resized_annotation = self.resize_annotation(
+                        target, orig_size, get_image_size(resized_image, input_data_format)
+                    )
                     resized_images.append(resized_image)
                     resized_annotations.append(resized_annotation)
                 images = resized_images
                 annotations = resized_annotations
                 del resized_images, resized_annotations
             else:
-                images = [self.resize(image, size=size, resample=resample) for image in images]
+                images = [
+                    self.resize(image, size=size, resample=resample, input_data_format=input_data_format)
+                    for image in images
+                ]
 
         if do_rescale:
-            images = [self.rescale(image, rescale_factor) for image in images]
+            images = [self.rescale(image, rescale_factor, input_data_format=input_data_format) for image in images]
 
         if do_normalize:
-            images = [self.normalize(image, image_mean, image_std) for image in images]
-            if annotations is not None:
-                annotations = [
-                    self.normalize_annotation(annotation, get_image_size(image))
-                    for annotation, image in zip(annotations, images)
-                ]
+            images = [
+                self.normalize(image, image_mean, image_std, input_data_format=input_data_format) for image in images
+            ]
+
+        if do_convert_annotations and annotations is not None:
+            annotations = [
+                self.normalize_annotation(annotation, get_image_size(image, input_data_format))
+                for annotation, image in zip(annotations, images)
+            ]
 
         if do_pad:
             # Pads images and returns their mask: {'pixel_values': ..., 'pixel_mask': ...}
-            data = self.pad(images, return_pixel_mask=True, data_format=data_format)
+            encoded_inputs = self.pad(
+                images,
+                annotations=annotations,
+                return_pixel_mask=True,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                return_tensors=return_tensors,
+                update_bboxes=do_convert_annotations,
+            )
         else:
-            images = [to_channel_dimension_format(image, data_format) for image in images]
-            data = {"pixel_values": images}
-
-        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
-        if annotations is not None:
-            encoded_inputs["labels"] = [
-                BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+            images = [
+                to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+                for image in images
             ]
+            encoded_inputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)
+            if annotations is not None:
+                encoded_inputs["labels"] = [
+                    BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+                ]
 
         return encoded_inputs
 
     # POSTPROCESSING METHODS - TODO: add support for other frameworks
     def post_process(self, outputs, target_sizes):
         """
-        Converts the output of [`ConditionalDetrForObjectDetection`] into the format expected by the COCO api. Only
-        supports PyTorch.
+        Converts the output of [`ConditionalDetrForObjectDetection`] into the format expected by the Pascal VOC format (xmin, ymin, xmax, ymax).
+        Only supports PyTorch.
 
         Args:
             outputs ([`ConditionalDetrObjectDetectionOutput`]):
@@ -1295,10 +1445,9 @@ def post_process(self, outputs, target_sizes):
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
             in the batch as predicted by the model.
         """
-        warnings.warn(
+        logging.warning_once(
             "`post_process` is deprecated and will be removed in v5 of Transformers, please use"
-            " `post_process_object_detection`",
-            FutureWarning,
+            " `post_process_object_detection` instead, with `threshold=0.` for equivalent results.",
         )
 
         out_logits, out_bbox = outputs.logits, outputs.pred_boxes
@@ -1311,7 +1460,7 @@ def post_process(self, outputs, target_sizes):
         prob = out_logits.sigmoid()
         topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1), 300, dim=1)
         scores = topk_values
-        topk_boxes = topk_indexes // out_logits.shape[2]
+        topk_boxes = torch.div(topk_indexes, out_logits.shape[2], rounding_mode="floor")
         labels = topk_indexes % out_logits.shape[2]
         boxes = center_to_corners_format(out_bbox)
         boxes = torch.gather(boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
@@ -1327,7 +1476,7 @@ def post_process(self, outputs, target_sizes):
 
     # Copied from transformers.models.deformable_detr.image_processing_deformable_detr.DeformableDetrImageProcessor.post_process_object_detection with DeformableDetr->ConditionalDetr
     def post_process_object_detection(
-        self, outputs, threshold: float = 0.5, target_sizes: Union[TensorType, List[Tuple]] = None
+        self, outputs, threshold: float = 0.5, target_sizes: Union[TensorType, List[Tuple]] = None, top_k: int = 100
     ):
         """
         Converts the raw output of [`ConditionalDetrForObjectDetection`] into final bounding boxes in (top_left_x,
@@ -1341,6 +1490,8 @@ def post_process_object_detection(
             target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
                 Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
                 (height, width) of each image in the batch. If left to None, predictions will not be resized.
+            top_k (`int`, *optional*, defaults to 100):
+                Keep only top k bounding boxes before filtering by thresholding.
 
         Returns:
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
@@ -1355,21 +1506,24 @@ def post_process_object_detection(
                 )
 
         prob = out_logits.sigmoid()
-        topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1), 100, dim=1)
+        prob = prob.view(out_logits.shape[0], -1)
+        k_value = min(top_k, prob.size(1))
+        topk_values, topk_indexes = torch.topk(prob, k_value, dim=1)
         scores = topk_values
-        topk_boxes = topk_indexes // out_logits.shape[2]
+        topk_boxes = torch.div(topk_indexes, out_logits.shape[2], rounding_mode="floor")
         labels = topk_indexes % out_logits.shape[2]
         boxes = center_to_corners_format(out_bbox)
         boxes = torch.gather(boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
 
         # and from relative [0, 1] to absolute [0, height] coordinates
-        if isinstance(target_sizes, List):
-            img_h = torch.Tensor([i[0] for i in target_sizes])
-            img_w = torch.Tensor([i[1] for i in target_sizes])
-        else:
-            img_h, img_w = target_sizes.unbind(1)
-        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
-        boxes = boxes * scale_fct[:, None, :]
+        if target_sizes is not None:
+            if isinstance(target_sizes, List):
+                img_h = torch.Tensor([i[0] for i in target_sizes])
+                img_w = torch.Tensor([i[1] for i in target_sizes])
+            else:
+                img_h, img_w = target_sizes.unbind(1)
+            scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
+            boxes = boxes * scale_fct[:, None, :]
 
         results = []
         for s, l, b in zip(scores, labels, boxes):
@@ -1383,8 +1537,7 @@ def post_process_object_detection(
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.post_process_semantic_segmentation with Detr->ConditionalDetr
     def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple[int, int]] = None):
         """
-        Converts the output of [`ConditionalDetrForSegmentation`] into semantic segmentation maps. Only supports
-        PyTorch.
+        Converts the output of [`ConditionalDetrForSegmentation`] into semantic segmentation maps. Only supports PyTorch.
 
         Args:
             outputs ([`ConditionalDetrForSegmentation`]):
@@ -1440,8 +1593,7 @@ def post_process_instance_segmentation(
         return_coco_annotation: Optional[bool] = False,
     ) -> List[Dict]:
         """
-        Converts the output of [`ConditionalDetrForSegmentation`] into instance segmentation predictions. Only supports
-        PyTorch.
+        Converts the output of [`ConditionalDetrForSegmentation`] into instance segmentation predictions. Only supports PyTorch.
 
         Args:
             outputs ([`ConditionalDetrForSegmentation`]):
@@ -1525,8 +1677,8 @@ def post_process_panoptic_segmentation(
         target_sizes: Optional[List[Tuple[int, int]]] = None,
     ) -> List[Dict]:
         """
-        Converts the output of [`ConditionalDetrForSegmentation`] into image panoptic segmentation predictions. Only
-        supports PyTorch.
+        Converts the output of [`ConditionalDetrForSegmentation`] into image panoptic segmentation predictions. Only supports
+        PyTorch.
 
         Args:
             outputs ([`ConditionalDetrForSegmentation`]):
@@ -1559,7 +1711,7 @@ def post_process_panoptic_segmentation(
         """
 
         if label_ids_to_fuse is None:
-            warnings.warn("`label_ids_to_fuse` unset. No instance will be fused.")
+            logger.warning_once("`label_ids_to_fuse` unset. No instance will be fused.")
             label_ids_to_fuse = set()
 
         class_queries_logits = outputs.logits  # [batch_size, num_queries, num_classes+1]
diff --git a/src/transformers/models/conditional_detr/modeling_conditional_detr.py b/src/transformers/models/conditional_detr/modeling_conditional_detr.py
index 5068c003f95b49..2a5e06ea2b4abc 100644
--- a/src/transformers/models/conditional_detr/modeling_conditional_detr.py
+++ b/src/transformers/models/conditional_detr/modeling_conditional_detr.py
@@ -16,21 +16,21 @@
 
 
 import math
-import random
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple, Union
 
 import torch
 from torch import Tensor, nn
 
 from ...activations import ACT2FN
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
 from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithCrossAttentions, Seq2SeqModelOutput
 from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import torch_int_div
 from ...utils import (
     ModelOutput,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
+    is_accelerate_available,
     is_scipy_available,
     is_timm_available,
     is_vision_available,
@@ -38,10 +38,14 @@
     replace_return_docstrings,
     requires_backends,
 )
-from ..auto import AutoBackbone
+from ...utils.backbone_utils import load_backbone
 from .configuration_conditional_detr import ConditionalDetrConfig
 
 
+if is_accelerate_available():
+    from accelerate import PartialState
+    from accelerate.utils import reduce
+
 if is_scipy_available():
     from scipy.optimize import linear_sum_assignment
 
@@ -49,7 +53,7 @@
     from timm import create_model
 
 if is_vision_available():
-    from transformers.image_transforms import center_to_corners_format
+    from ...image_transforms import center_to_corners_format
 
 logger = logging.get_logger(__name__)
 
@@ -154,8 +158,8 @@ class ConditionalDetrObjectDetectionOutput(ModelOutput):
         pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
             Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
             values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
-            possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve
-            the unnormalized bounding boxes.
+            possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the
+            unnormalized bounding boxes.
         auxiliary_outputs (`list[Dict]`, *optional*):
             Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
             and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
@@ -218,14 +222,14 @@ class ConditionalDetrSegmentationOutput(ModelOutput):
         pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
             Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
             values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
-            possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve
-            the unnormalized bounding boxes.
+            possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the
+            unnormalized bounding boxes.
         pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
             Segmentation masks logits for all queries. See also
             [`~ConditionalDetrImageProcessor.post_process_semantic_segmentation`] or
             [`~ConditionalDetrImageProcessor.post_process_instance_segmentation`]
-            [`~ConditionalDetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and
-            panoptic segmentation masks respectively.
+            [`~ConditionalDetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic
+            segmentation masks respectively.
         auxiliary_outputs (`list[Dict]`, *optional*):
             Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
             and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
@@ -312,19 +316,28 @@ def forward(self, x):
 
 
 # Copied from transformers.models.detr.modeling_detr.replace_batch_norm with Detr->ConditionalDetr
-def replace_batch_norm(m, name=""):
-    for attr_str in dir(m):
-        target_attr = getattr(m, attr_str)
-        if isinstance(target_attr, nn.BatchNorm2d):
-            frozen = ConditionalDetrFrozenBatchNorm2d(target_attr.num_features)
-            bn = getattr(m, attr_str)
-            frozen.weight.data.copy_(bn.weight)
-            frozen.bias.data.copy_(bn.bias)
-            frozen.running_mean.data.copy_(bn.running_mean)
-            frozen.running_var.data.copy_(bn.running_var)
-            setattr(m, attr_str, frozen)
-    for n, ch in m.named_children():
-        replace_batch_norm(ch, n)
+def replace_batch_norm(model):
+    r"""
+    Recursively replace all `torch.nn.BatchNorm2d` with `ConditionalDetrFrozenBatchNorm2d`.
+
+    Args:
+        model (torch.nn.Module):
+            input model
+    """
+    for name, module in model.named_children():
+        if isinstance(module, nn.BatchNorm2d):
+            new_module = ConditionalDetrFrozenBatchNorm2d(module.num_features)
+
+            if not module.weight.device == torch.device("meta"):
+                new_module.weight.data.copy_(module.weight)
+                new_module.bias.data.copy_(module.bias)
+                new_module.running_mean.data.copy_(module.running_mean)
+                new_module.running_var.data.copy_(module.running_var)
+
+            model._modules[name] = new_module
+
+        if len(list(module.children())) > 0:
+            replace_batch_norm(module)
 
 
 # Copied from transformers.models.detr.modeling_detr.DetrConvEncoder
@@ -355,7 +368,7 @@ def __init__(self, config):
                 **kwargs,
             )
         else:
-            backbone = AutoBackbone.from_config(config.backbone_config)
+            backbone = load_backbone(config)
 
         # replace batch norm by frozen batch norm
         with torch.no_grad():
@@ -409,22 +422,6 @@ def forward(self, pixel_values, pixel_mask):
         return out, pos
 
 
-# Copied from transformers.models.detr.modeling_detr._expand_mask with Detr->ConditionalDetr
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, target_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[batch_size, seq_len]` to `[batch_size, 1, target_seq_len, source_seq_len]`.
-    """
-    batch_size, source_len = mask.size()
-    target_len = target_len if target_len is not None else source_len
-
-    expanded_mask = mask[:, None, None, :].expand(batch_size, 1, target_len, source_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)
-
-
-# Copied from transformers.models.detr.modeling_detr.DetrSinePositionEmbedding with Detr->ConditionalDetr
 class ConditionalDetrSinePositionEmbedding(nn.Module):
     """
     This is a more standard version of the position embedding, very similar to the one used by the Attention is all you
@@ -451,8 +448,8 @@ def forward(self, pixel_values, pixel_mask):
             y_embed = y_embed / (y_embed[:, -1:, :] + 1e-6) * self.scale
             x_embed = x_embed / (x_embed[:, :, -1:] + 1e-6) * self.scale
 
-        dim_t = torch.arange(self.embedding_dim, dtype=torch.float32, device=pixel_values.device)
-        dim_t = self.temperature ** (2 * torch_int_div(dim_t, 2) / self.embedding_dim)
+        dim_t = torch.arange(self.embedding_dim, dtype=torch.int64, device=pixel_values.device).float()
+        dim_t = self.temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / self.embedding_dim)
 
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
@@ -501,10 +498,11 @@ def build_position_encoding(config):
 
 
 # function to generate sine positional embedding for 2d coordinates
-def gen_sine_position_embeddings(pos_tensor):
+def gen_sine_position_embeddings(pos_tensor, d_model):
     scale = 2 * math.pi
-    dim_t = torch.arange(128, dtype=torch.float32, device=pos_tensor.device)
-    dim_t = 10000 ** (2 * (dim_t // 2) / 128)
+    dim = d_model // 2
+    dim_t = torch.arange(dim, dtype=torch.float32, device=pos_tensor.device)
+    dim_t = 10000 ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / dim)
     x_embed = pos_tensor[:, :, 0] * scale
     y_embed = pos_tensor[:, :, 1] * scale
     pos_x = x_embed[:, :, None] / dim_t
@@ -535,7 +533,6 @@ def __init__(
         embed_dim: int,
         num_heads: int,
         dropout: float = 0.0,
-        is_decoder: bool = False,
         bias: bool = True,
     ):
         super().__init__()
@@ -558,34 +555,79 @@ def __init__(
     def _shape(self, tensor: torch.Tensor, seq_len: int, batch_size: int):
         return tensor.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
 
-    def with_pos_embed(self, tensor: torch.Tensor, position_embeddings: Optional[Tensor]):
-        return tensor if position_embeddings is None else tensor + position_embeddings
+    def with_pos_embed(self, tensor: torch.Tensor, object_queries: Optional[Tensor], **kwargs):
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
+        return tensor if object_queries is None else tensor + object_queries
 
     def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.Tensor] = None,
-        position_embeddings: Optional[torch.Tensor] = None,
+        object_queries: Optional[torch.Tensor] = None,
         key_value_states: Optional[torch.Tensor] = None,
-        key_value_position_embeddings: Optional[torch.Tensor] = None,
+        spatial_position_embeddings: Optional[torch.Tensor] = None,
         output_attentions: bool = False,
+        **kwargs,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         """Input shape: Batch x Time x Channel"""
 
+        position_embeddings = kwargs.pop("position_ebmeddings", None)
+        key_value_position_embeddings = kwargs.pop("key_value_position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if key_value_position_embeddings is not None and spatial_position_embeddings is not None:
+            raise ValueError(
+                "Cannot specify both key_value_position_embeddings and spatial_position_embeddings. Please use just spatial_position_embeddings"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
+        if key_value_position_embeddings is not None:
+            logger.warning_once(
+                "key_value_position_embeddings has been deprecated and will be removed in v4.34. Please use spatial_position_embeddings instead"
+            )
+            spatial_position_embeddings = key_value_position_embeddings
+
         # if key_value_states are provided this layer is used as a cross-attention layer
         # for the decoder
         is_cross_attention = key_value_states is not None
         batch_size, target_len, embed_dim = hidden_states.size()
 
         # add position embeddings to the hidden states before projecting to queries and keys
-        if position_embeddings is not None:
+        if object_queries is not None:
             hidden_states_original = hidden_states
-            hidden_states = self.with_pos_embed(hidden_states, position_embeddings)
+            hidden_states = self.with_pos_embed(hidden_states, object_queries)
 
         # add key-value position embeddings to the key value states
-        if key_value_position_embeddings is not None:
+        if spatial_position_embeddings is not None:
             key_value_states_original = key_value_states
-            key_value_states = self.with_pos_embed(key_value_states, key_value_position_embeddings)
+            key_value_states = self.with_pos_embed(key_value_states, spatial_position_embeddings)
 
         # get query proj
         query_states = self.q_proj(hidden_states) * self.scaling
@@ -793,25 +835,43 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: torch.Tensor,
-        position_embeddings: torch.Tensor = None,
+        object_queries: torch.Tensor = None,
         output_attentions: bool = False,
+        **kwargs,
     ):
         """
         Args:
-            hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
             attention_mask (`torch.FloatTensor`): attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
-            position_embeddings (`torch.FloatTensor`, *optional*): position embeddings, to be added to hidden_states.
+            object_queries (`torch.FloatTensor`, *optional*):
+                Object queries (also called content embeddings), to be added to the hidden states.
             output_attentions (`bool`, *optional*):
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                 returned tensors for more detail.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         residual = hidden_states
         hidden_states, attn_weights = self.self_attn(
             hidden_states=hidden_states,
             attention_mask=attention_mask,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             output_attentions=output_attentions,
         )
 
@@ -888,13 +948,14 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.Tensor] = None,
-        position_embeddings: Optional[torch.Tensor] = None,
+        object_queries: Optional[torch.Tensor] = None,
         query_position_embeddings: Optional[torch.Tensor] = None,
         query_sine_embed: Optional[torch.Tensor] = None,
         encoder_hidden_states: Optional[torch.Tensor] = None,
         encoder_attention_mask: Optional[torch.Tensor] = None,
         output_attentions: Optional[bool] = False,
         is_first: Optional[bool] = False,
+        **kwargs,
     ):
         """
         Args:
@@ -902,11 +963,11 @@ def forward(
             attention_mask (`torch.FloatTensor`): attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
-            position_embeddings (`torch.FloatTensor`, *optional*):
-                position embeddings that are added to the queries and keys
+            object_queries (`torch.FloatTensor`, *optional*):
+                object_queries that are added to the queries and keys
             in the cross-attention layer.
             query_position_embeddings (`torch.FloatTensor`, *optional*):
-                position embeddings that are added to the queries and keys
+                object_queries that are added to the queries and keys
             in the self-attention layer.
             encoder_hidden_states (`torch.FloatTensor`):
                 cross attention input to the layer of shape `(seq_len, batch, embed_dim)`
@@ -917,6 +978,22 @@ def forward(
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                 returned tensors for more detail.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         residual = hidden_states
 
         # ========== Begin of Self-Attention =============
@@ -957,7 +1034,7 @@ def forward(
         batch_size, num_queries, n_model = q_content.shape
         _, source_len, _ = k_content.shape
 
-        k_pos = self.ca_kpos_proj(position_embeddings)
+        k_pos = self.ca_kpos_proj(object_queries)
 
         # For the first decoder layer, we concatenate the positional embedding predicted from
         # the object query (the positional embedding) into the original query (key) in DETR.
@@ -1059,6 +1136,7 @@ class ConditionalDetrPreTrainedModel(PreTrainedModel):
     config_class = ConditionalDetrConfig
     base_model_prefix = "model"
     main_input_name = "pixel_values"
+    _no_split_modules = [r"ConditionalDetrConvEncoder", r"ConditionalDetrEncoderLayer", r"ConditionalDetrDecoderLayer"]
 
     def _init_weights(self, module):
         std = self.config.init_std
@@ -1083,10 +1161,6 @@ def _init_weights(self, module):
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ConditionalDetrDecoder):
-            module.gradient_checkpointing = value
-
 
 CONDITIONAL_DETR_START_DOCSTRING = r"""
     This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
@@ -1120,7 +1194,7 @@ def _set_gradient_checkpointing(self, module, value=False):
 
             [What are attention masks?](../glossary#attention-mask)
 
-        decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, num_queries)`, *optional*):
+        decoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, num_queries)`, *optional*):
             Not used by default. Can be used to mask object queries.
         encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
             Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
@@ -1153,7 +1227,7 @@ class ConditionalDetrEncoder(ConditionalDetrPreTrainedModel):
 
     Small tweak for ConditionalDETR:
 
-    - position_embeddings are added to the forward pass.
+    - object_queries are added to the forward pass.
 
     Args:
         config: ConditionalDetrConfig
@@ -1176,10 +1250,11 @@ def forward(
         self,
         inputs_embeds=None,
         attention_mask=None,
-        position_embeddings=None,
+        object_queries=None,
         output_attentions=None,
         output_hidden_states=None,
         return_dict=None,
+        **kwargs,
     ):
         r"""
         Args:
@@ -1194,8 +1269,8 @@ def forward(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-            position_embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
-                Position embeddings that are added to the queries and keys in each self-attention layer.
+            object_queries (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Object queries that are added to the queries in each self-attention layer.
 
             output_attentions (`bool`, *optional*):
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
@@ -1206,6 +1281,22 @@ def forward(
             return_dict (`bool`, *optional*):
                 Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -1218,7 +1309,7 @@ def forward(
         # expand attention_mask
         if attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype)
+            attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype)
 
         encoder_states = () if output_hidden_states else None
         all_attentions = () if output_attentions else None
@@ -1226,15 +1317,20 @@ def forward(
             if output_hidden_states:
                 encoder_states = encoder_states + (hidden_states,)
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
-            dropout_probability = random.uniform(0, 1)
-            if self.training and (dropout_probability < self.layerdrop):  # skip the layer
+            to_drop = False
+            if self.training:
+                dropout_probability = torch.rand([])
+                if dropout_probability < self.layerdrop:  # skip the layer
+                    to_drop = True
+
+            if to_drop:
                 layer_outputs = (None, None)
             else:
-                # we add position_embeddings as extra input to the encoder_layer
+                # we add object_queries as extra input to the encoder_layer
                 layer_outputs = encoder_layer(
                     hidden_states,
                     attention_mask,
-                    position_embeddings=position_embeddings,
+                    object_queries=object_queries,
                     output_attentions=output_attentions,
                 )
 
@@ -1261,7 +1357,7 @@ class ConditionalDetrDecoder(ConditionalDetrPreTrainedModel):
 
     Some small tweaks for Conditional DETR:
 
-    - position_embeddings and query_position_embeddings are added to the forward pass.
+    - object_queries and query_position_embeddings are added to the forward pass.
     - if self.config.auxiliary_loss is set to True, also returns a stack of activations from all decoding layers.
 
     Args:
@@ -1294,11 +1390,12 @@ def forward(
         attention_mask=None,
         encoder_hidden_states=None,
         encoder_attention_mask=None,
-        position_embeddings=None,
+        object_queries=None,
         query_position_embeddings=None,
         output_attentions=None,
         output_hidden_states=None,
         return_dict=None,
+        **kwargs,
     ):
         r"""
         Args:
@@ -1322,7 +1419,7 @@ def forward(
                 - 1 for pixels that are real (i.e. **not masked**),
                 - 0 for pixels that are padding (i.e. **masked**).
 
-            position_embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            object_queries (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                 Position embeddings that are added to the queries and keys in each cross-attention layer.
             query_position_embeddings (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
                 , *optional*): Position embeddings that are added to the queries and keys in each self-attention layer.
@@ -1335,6 +1432,22 @@ def forward(
             return_dict (`bool`, *optional*):
                 Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -1345,19 +1458,11 @@ def forward(
             hidden_states = inputs_embeds
             input_shape = inputs_embeds.size()[:-1]
 
-        combined_attention_mask = None
-
-        if attention_mask is not None and combined_attention_mask is not None:
-            # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            combined_attention_mask = combined_attention_mask + _expand_mask(
-                attention_mask, inputs_embeds.dtype, target_len=input_shape[-1]
-            )
-
         # expand encoder attention mask
         if encoder_hidden_states is not None and encoder_attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            encoder_attention_mask = _expand_mask(
-                encoder_attention_mask, inputs_embeds.dtype, target_len=input_shape[-1]
+            encoder_attention_mask = _prepare_4d_attention_mask(
+                encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
             )
 
         # optional intermediate hidden states
@@ -1374,15 +1479,16 @@ def forward(
         reference_points = reference_points_before_sigmoid.sigmoid().transpose(0, 1)
         obj_center = reference_points[..., :2].transpose(0, 1)
         # get sine embedding for the query vector
-        query_sine_embed_before_transformation = gen_sine_position_embeddings(obj_center)
+        query_sine_embed_before_transformation = gen_sine_position_embeddings(obj_center, self.config.d_model)
 
         for idx, decoder_layer in enumerate(self.layers):
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
             if output_hidden_states:
                 all_hidden_states += (hidden_states,)
-            dropout_probability = random.uniform(0, 1)
-            if self.training and (dropout_probability < self.layerdrop):
-                continue
+            if self.training:
+                dropout_probability = torch.rand([])
+                if dropout_probability < self.layerdrop:
+                    continue
             if idx == 0:
                 pos_transformation = 1
             else:
@@ -1390,18 +1496,11 @@ def forward(
             # apply transformation
             query_sine_embed = query_sine_embed_before_transformation * pos_transformation
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
                     hidden_states,
-                    combined_attention_mask,
-                    position_embeddings,
+                    None,
+                    object_queries,
                     query_position_embeddings,
                     query_sine_embed,
                     encoder_hidden_states,
@@ -1412,8 +1511,8 @@ def custom_forward(*inputs):
             else:
                 layer_outputs = decoder_layer(
                     hidden_states,
-                    attention_mask=combined_attention_mask,
-                    position_embeddings=position_embeddings,
+                    attention_mask=None,
+                    object_queries=object_queries,
                     query_position_embeddings=query_position_embeddings,
                     query_sine_embed=query_sine_embed,
                     encoder_hidden_states=encoder_hidden_states,
@@ -1481,8 +1580,8 @@ def __init__(self, config: ConditionalDetrConfig):
 
         # Create backbone + positional encoding
         backbone = ConditionalDetrConvEncoder(config)
-        position_embeddings = build_position_encoding(config)
-        self.backbone = ConditionalDetrConvModel(backbone, position_embeddings)
+        object_queries = build_position_encoding(config)
+        self.backbone = ConditionalDetrConvModel(backbone, object_queries)
 
         # Create projection layer
         self.input_projection = nn.Conv2d(backbone.intermediate_channel_sizes[-1], config.d_model, kernel_size=1)
@@ -1513,16 +1612,16 @@ def unfreeze_backbone(self):
     @replace_return_docstrings(output_type=ConditionalDetrModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.LongTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], ConditionalDetrModelOutput]:
         r"""
         Returns:
 
@@ -1566,7 +1665,7 @@ def forward(
         # First, sent pixel_values + pixel_mask through Backbone to obtain the features
         # pixel_values should be of shape (batch_size, num_channels, height, width)
         # pixel_mask should be of shape (batch_size, height, width)
-        features, position_embeddings_list = self.backbone(pixel_values, pixel_mask)
+        features, object_queries_list = self.backbone(pixel_values, pixel_mask)
 
         # get final feature map and downsampled mask
         feature_map, mask = features[-1]
@@ -1577,21 +1676,21 @@ def forward(
         # Second, apply 1x1 convolution to reduce the channel dimension to d_model (256 by default)
         projected_feature_map = self.input_projection(feature_map)
 
-        # Third, flatten the feature map + position embeddings of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
+        # Third, flatten the feature map + object_queries of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
         # In other words, turn their shape into (batch_size, sequence_length, hidden_size)
         flattened_features = projected_feature_map.flatten(2).permute(0, 2, 1)
-        position_embeddings = position_embeddings_list[-1].flatten(2).permute(0, 2, 1)
+        object_queries = object_queries_list[-1].flatten(2).permute(0, 2, 1)
 
         flattened_mask = mask.flatten(1)
 
-        # Fourth, sent flattened_features + flattened_mask + position embeddings through encoder
+        # Fourth, sent flattened_features + flattened_mask + object_queries through encoder
         # flattened_features is a Tensor of shape (batch_size, heigth*width, hidden_size)
         # flattened_mask is a Tensor of shape (batch_size, heigth*width)
         if encoder_outputs is None:
             encoder_outputs = self.encoder(
                 inputs_embeds=flattened_features,
                 attention_mask=flattened_mask,
-                position_embeddings=position_embeddings,
+                object_queries=object_queries,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
@@ -1604,7 +1703,7 @@ def forward(
                 attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
             )
 
-        # Fifth, sent query embeddings + position embeddings through the decoder (which is conditioned on the encoder output)
+        # Fifth, sent query embeddings + object_queries through the decoder (which is conditioned on the encoder output)
         query_position_embeddings = self.query_position_embeddings.weight.unsqueeze(0).repeat(batch_size, 1, 1)
         queries = torch.zeros_like(query_position_embeddings)
 
@@ -1612,7 +1711,7 @@ def forward(
         decoder_outputs = self.decoder(
             inputs_embeds=queries,
             attention_mask=None,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             query_position_embeddings=query_position_embeddings,
             encoder_hidden_states=encoder_outputs[0],
             encoder_attention_mask=flattened_mask,
@@ -1674,17 +1773,17 @@ def _set_aux_loss(self, outputs_class, outputs_coord):
     @replace_return_docstrings(output_type=ConditionalDetrObjectDetectionOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.LongTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], ConditionalDetrObjectDetectionOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the
@@ -1711,7 +1810,7 @@ def forward(
 
         >>> outputs = model(**inputs)
 
-        >>> # convert outputs (bounding boxes and class logits) to COCO API
+        >>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
         >>> target_sizes = torch.tensor([image.size[::-1]])
         >>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
         ...     0
@@ -1780,8 +1879,8 @@ def forward(
                 intermediate = outputs.intermediate_hidden_states if return_dict else outputs[4]
                 outputs_class = self.class_labels_classifier(intermediate)
 
-                for lvl in range(hs.shape[0]):
-                    tmp = self.bbox_predictor(hs[lvl])
+                for lvl in range(intermediate.shape[0]):
+                    tmp = self.bbox_predictor(intermediate[lvl])
                     tmp[..., :2] += reference_before_sigmoid
                     outputs_coord = tmp.sigmoid()
                     outputs_coords.append(outputs_coord)
@@ -1858,17 +1957,17 @@ def __init__(self, config: ConditionalDetrConfig):
     @replace_return_docstrings(output_type=ConditionalDetrSegmentationOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], ConditionalDetrSegmentationOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss, DICE/F-1 loss and Focal loss. List of dicts, each
@@ -1928,29 +2027,29 @@ def forward(
         if pixel_mask is None:
             pixel_mask = torch.ones((batch_size, height, width), device=device)
 
-        # First, get list of feature maps and position embeddings
-        features, position_embeddings_list = self.conditional_detr.model.backbone(pixel_values, pixel_mask=pixel_mask)
+        # First, get list of feature maps and object_queries
+        features, object_queries_list = self.conditional_detr.model.backbone(pixel_values, pixel_mask=pixel_mask)
 
         # Second, apply 1x1 convolution to reduce the channel dimension to d_model (256 by default)
         feature_map, mask = features[-1]
         batch_size, num_channels, height, width = feature_map.shape
         projected_feature_map = self.conditional_detr.model.input_projection(feature_map)
 
-        # Third, flatten the feature map + position embeddings of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
+        # Third, flatten the feature map + object_queries of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
         # In other words, turn their shape into (batch_size, sequence_length, hidden_size)
         flattened_features = projected_feature_map.flatten(2).permute(0, 2, 1)
-        position_embeddings = position_embeddings_list[-1].flatten(2).permute(0, 2, 1)
+        object_queries = object_queries_list[-1].flatten(2).permute(0, 2, 1)
 
         flattened_mask = mask.flatten(1)
 
-        # Fourth, sent flattened_features + flattened_mask + position embeddings through encoder
+        # Fourth, sent flattened_features + flattened_mask + object_queries through encoder
         # flattened_features is a Tensor of shape (batch_size, heigth*width, hidden_size)
         # flattened_mask is a Tensor of shape (batch_size, heigth*width)
         if encoder_outputs is None:
             encoder_outputs = self.conditional_detr.model.encoder(
                 inputs_embeds=flattened_features,
                 attention_mask=flattened_mask,
-                position_embeddings=position_embeddings,
+                object_queries=object_queries,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
@@ -1963,7 +2062,7 @@ def forward(
                 attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
             )
 
-        # Fifth, sent query embeddings + position embeddings through the decoder (which is conditioned on the encoder output)
+        # Fifth, sent query embeddings + object_queries through the decoder (which is conditioned on the encoder output)
         query_position_embeddings = self.conditional_detr.model.query_position_embeddings.weight.unsqueeze(0).repeat(
             batch_size, 1, 1
         )
@@ -1973,7 +2072,7 @@ def forward(
         decoder_outputs = self.conditional_detr.model.decoder(
             inputs_embeds=queries,
             attention_mask=None,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             query_position_embeddings=query_position_embeddings,
             encoder_hidden_states=encoder_outputs[0],
             encoder_attention_mask=flattened_mask,
@@ -2024,9 +2123,9 @@ def forward(
             outputs_loss["pred_masks"] = pred_masks
             if self.config.auxiliary_loss:
                 intermediate = decoder_outputs.intermediate_hidden_states if return_dict else decoder_outputs[-1]
-                outputs_class = self.class_labels_classifier(intermediate)
-                outputs_coord = self.bbox_predictor(intermediate).sigmoid()
-                auxiliary_outputs = self._set_aux_loss(outputs_class, outputs_coord)
+                outputs_class = self.conditional_detr.class_labels_classifier(intermediate)
+                outputs_coord = self.conditional_detr.bbox_predictor(intermediate).sigmoid()
+                auxiliary_outputs = self.conditional_detr._set_aux_loss(outputs_class, outputs_coord)
                 outputs_loss["auxiliary_outputs"] = auxiliary_outputs
 
             loss_dict = criterion(outputs_loss, labels)
@@ -2090,13 +2189,13 @@ def __init__(self, dim, fpn_dims, context_dim):
         self.lay1 = nn.Conv2d(dim, dim, 3, padding=1)
         self.gn1 = nn.GroupNorm(8, dim)
         self.lay2 = nn.Conv2d(dim, inter_dims[1], 3, padding=1)
-        self.gn2 = nn.GroupNorm(8, inter_dims[1])
+        self.gn2 = nn.GroupNorm(min(8, inter_dims[1]), inter_dims[1])
         self.lay3 = nn.Conv2d(inter_dims[1], inter_dims[2], 3, padding=1)
-        self.gn3 = nn.GroupNorm(8, inter_dims[2])
+        self.gn3 = nn.GroupNorm(min(8, inter_dims[2]), inter_dims[2])
         self.lay4 = nn.Conv2d(inter_dims[2], inter_dims[3], 3, padding=1)
-        self.gn4 = nn.GroupNorm(8, inter_dims[3])
+        self.gn4 = nn.GroupNorm(min(8, inter_dims[3]), inter_dims[3])
         self.lay5 = nn.Conv2d(inter_dims[3], inter_dims[4], 3, padding=1)
-        self.gn5 = nn.GroupNorm(8, inter_dims[4])
+        self.gn5 = nn.GroupNorm(min(8, inter_dims[4]), inter_dims[4])
         self.out_lay = nn.Conv2d(inter_dims[4], 1, 3, padding=1)
 
         self.dim = dim
@@ -2413,11 +2512,12 @@ def forward(self, outputs, targets):
         # Compute the average number of target boxes across all nodes, for normalization purposes
         num_boxes = sum(len(t["class_labels"]) for t in targets)
         num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
-        # (Niels): comment out function below, distributed training to be added
-        # if is_dist_avail_and_initialized():
-        #     torch.distributed.all_reduce(num_boxes)
-        # (Niels) in original implementation, num_boxes is divided by get_world_size()
-        num_boxes = torch.clamp(num_boxes, min=1).item()
+
+        world_size = 1
+        if PartialState._shared_state != {}:
+            num_boxes = reduce(num_boxes)
+            world_size = PartialState().num_processes
+        num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
 
         # Compute all the requested losses
         losses = {}
diff --git a/src/transformers/models/convbert/configuration_convbert.py b/src/transformers/models/convbert/configuration_convbert.py
index 4c1032f4ffa0fd..62019796664660 100644
--- a/src/transformers/models/convbert/configuration_convbert.py
+++ b/src/transformers/models/convbert/configuration_convbert.py
@@ -61,7 +61,7 @@ class ConvBertConfig(PretrainedConfig):
             The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
             `"relu"`, `"selu"` and `"gelu_new"` are supported.
         hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
         attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
             The dropout ratio for the attention probabilities.
         max_position_embeddings (`int`, *optional*, defaults to 512):
@@ -96,6 +96,7 @@ class ConvBertConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "convbert"
 
     def __init__(
diff --git a/src/transformers/models/convbert/modeling_convbert.py b/src/transformers/models/convbert/modeling_convbert.py
index 655ea55eeb56ae..032b9d0ce18ba3 100755
--- a/src/transformers/models/convbert/modeling_convbert.py
+++ b/src/transformers/models/convbert/modeling_convbert.py
@@ -191,7 +191,9 @@ def __init__(self, config):
         self.LayerNorm = nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
         self.register_buffer(
             "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
         )
@@ -245,8 +247,6 @@ class ConvBertPreTrainedModel(PreTrainedModel):
     load_tf_weights = load_tf_weights_in_convbert
     base_model_prefix = "convbert"
     supports_gradient_checkpointing = True
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-    _keys_to_ignore_on_load_unexpected = [r"convbert.embeddings_project.weight", r"convbert.embeddings_project.bias"]
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -264,10 +264,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ConvBertEncoder):
-            module.gradient_checkpointing = value
-
 
 class SeparableConv1D(nn.Module):
     """This class implements separable convolution, i.e. a depthwise and a pointwise layer"""
@@ -316,7 +312,7 @@ def __init__(self, config):
         if config.hidden_size % self.num_attention_heads != 0:
             raise ValueError("hidden_size should be divisible by num_attention_heads")
 
-        self.attention_head_size = config.hidden_size // config.num_attention_heads
+        self.attention_head_size = (config.hidden_size // self.num_attention_heads) // 2
         self.all_head_size = self.num_attention_heads * self.attention_head_size
 
         self.query = nn.Linear(config.hidden_size, self.all_head_size)
@@ -413,7 +409,10 @@ def forward(
         conv_out = torch.reshape(conv_out_layer, [batch_size, -1, self.num_attention_heads, self.attention_head_size])
         context_layer = torch.cat([context_layer, conv_out], 2)
 
-        new_context_layer_shape = context_layer.size()[:-2] + (self.head_ratio * self.all_head_size,)
+        # conv and context
+        new_context_layer_shape = context_layer.size()[:-2] + (
+            self.num_attention_heads * self.attention_head_size * 2,
+        )
         context_layer = context_layer.view(*new_context_layer_shape)
 
         outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
@@ -629,20 +628,14 @@ def forward(
             layer_head_mask = head_mask[i] if head_mask is not None else None
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     attention_mask,
                     layer_head_mask,
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(
@@ -762,8 +755,6 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
     CONVBERT_START_DOCSTRING,
 )
 class ConvBertModel(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.embeddings = ConvBertEmbeddings(config)
@@ -817,6 +808,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -864,12 +856,13 @@ class ConvBertGeneratorPredictions(nn.Module):
     def __init__(self, config):
         super().__init__()
 
+        self.activation = get_activation("gelu")
         self.LayerNorm = nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
         self.dense = nn.Linear(config.hidden_size, config.embedding_size)
 
     def forward(self, generator_hidden_states: torch.FloatTensor) -> torch.FloatTensor:
         hidden_states = self.dense(generator_hidden_states)
-        hidden_states = get_activation("gelu")(hidden_states)
+        hidden_states = self.activation(hidden_states)
         hidden_states = self.LayerNorm(hidden_states)
 
         return hidden_states
@@ -877,7 +870,7 @@ def forward(self, generator_hidden_states: torch.FloatTensor) -> torch.FloatTens
 
 @add_start_docstrings("""ConvBERT Model with a `language modeling` head on top.""", CONVBERT_START_DOCSTRING)
 class ConvBertForMaskedLM(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids", "generator.lm_head.weight"]
+    _tied_weights_keys = ["generator.lm_head.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -988,8 +981,6 @@ def forward(self, hidden_states: torch.Tensor, **kwargs) -> torch.Tensor:
     CONVBERT_START_DOCSTRING,
 )
 class ConvBertForSequenceClassification(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1085,8 +1076,6 @@ def forward(
     CONVBERT_START_DOCSTRING,
 )
 class ConvBertForMultipleChoice(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
 
@@ -1180,8 +1169,6 @@ def forward(
     CONVBERT_START_DOCSTRING,
 )
 class ConvBertForTokenClassification(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1263,8 +1250,6 @@ def forward(
     CONVBERT_START_DOCSTRING,
 )
 class ConvBertForQuestionAnswering(ConvBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["embeddings.position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
 
diff --git a/src/transformers/models/convbert/modeling_tf_convbert.py b/src/transformers/models/convbert/modeling_tf_convbert.py
index 3976be69eb5b86..e6855c68e2f8a9 100644
--- a/src/transformers/models/convbert/modeling_tf_convbert.py
+++ b/src/transformers/models/convbert/modeling_tf_convbert.py
@@ -15,6 +15,8 @@
 """ TF 2.0 ConvBERT model."""
 
 
+from __future__ import annotations
+
 from typing import Optional, Tuple, Union
 
 import numpy as np
@@ -39,12 +41,12 @@
     TFSequenceSummary,
     TFTokenClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import (
-    MULTIPLE_CHOICE_DUMMY_INPUTS,
     add_code_sample_docstrings,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
@@ -67,7 +69,7 @@
 
 
 # Copied from transformers.models.albert.modeling_tf_albert.TFAlbertEmbeddings with Albert->ConvBert
-class TFConvBertEmbeddings(tf.keras.layers.Layer):
+class TFConvBertEmbeddings(keras.layers.Layer):
     """Construct the embeddings from word, position and token_type embeddings."""
 
     def __init__(self, config: ConvBertConfig, **kwargs):
@@ -77,10 +79,10 @@ def __init__(self, config: ConvBertConfig, **kwargs):
         self.embedding_size = config.embedding_size
         self.max_position_embeddings = config.max_position_embeddings
         self.initializer_range = config.initializer_range
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         with tf.name_scope("word_embeddings"):
             self.weight = self.add_weight(
                 name="weight",
@@ -102,7 +104,12 @@ def build(self, input_shape: tf.TensorShape):
                 initializer=get_initializer(self.initializer_range),
             )
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.embedding_size])
 
     # Copied from transformers.models.bert.modeling_tf_bert.TFBertEmbeddings.call
     def call(
@@ -124,16 +131,7 @@ def call(
             raise ValueError("Need to provide either `input_ids` or `input_embeds`.")
 
         if input_ids is not None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
 
         input_shape = shape_list(inputs_embeds)[:-1]
@@ -155,7 +153,7 @@ def call(
         return final_embeddings
 
 
-class TFConvBertSelfAttention(tf.keras.layers.Layer):
+class TFConvBertSelfAttention(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
@@ -181,17 +179,17 @@ def __init__(self, config, **kwargs):
 
         self.attention_head_size = config.hidden_size // config.num_attention_heads
         self.all_head_size = self.num_attention_heads * self.attention_head_size
-        self.query = tf.keras.layers.Dense(
+        self.query = keras.layers.Dense(
             self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
         )
-        self.key = tf.keras.layers.Dense(
+        self.key = keras.layers.Dense(
             self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key"
         )
-        self.value = tf.keras.layers.Dense(
+        self.value = keras.layers.Dense(
             self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
         )
 
-        self.key_conv_attn_layer = tf.keras.layers.SeparableConv1D(
+        self.key_conv_attn_layer = keras.layers.SeparableConv1D(
             self.all_head_size,
             self.conv_kernel_size,
             padding="same",
@@ -201,21 +199,22 @@ def __init__(self, config, **kwargs):
             name="key_conv_attn_layer",
         )
 
-        self.conv_kernel_layer = tf.keras.layers.Dense(
+        self.conv_kernel_layer = keras.layers.Dense(
             self.num_attention_heads * self.conv_kernel_size,
             activation=None,
             name="conv_kernel_layer",
             kernel_initializer=get_initializer(config.initializer_range),
         )
 
-        self.conv_out_layer = tf.keras.layers.Dense(
+        self.conv_out_layer = keras.layers.Dense(
             self.all_head_size,
             activation=None,
             name="conv_out_layer",
             kernel_initializer=get_initializer(config.initializer_range),
         )
 
-        self.dropout = tf.keras.layers.Dropout(config.attention_probs_dropout_prob)
+        self.dropout = keras.layers.Dropout(config.attention_probs_dropout_prob)
+        self.config = config
 
     def transpose_for_scores(self, x, batch_size):
         # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size]
@@ -305,16 +304,40 @@ def call(self, hidden_states, attention_mask, head_mask, output_attentions, trai
 
         return outputs
 
-
-class TFConvBertSelfOutput(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query", None) is not None:
+            with tf.name_scope(self.query.name):
+                self.query.build([None, None, self.config.hidden_size])
+        if getattr(self, "key", None) is not None:
+            with tf.name_scope(self.key.name):
+                self.key.build([None, None, self.config.hidden_size])
+        if getattr(self, "value", None) is not None:
+            with tf.name_scope(self.value.name):
+                self.value.build([None, None, self.config.hidden_size])
+        if getattr(self, "key_conv_attn_layer", None) is not None:
+            with tf.name_scope(self.key_conv_attn_layer.name):
+                self.key_conv_attn_layer.build([None, None, self.config.hidden_size])
+        if getattr(self, "conv_kernel_layer", None) is not None:
+            with tf.name_scope(self.conv_kernel_layer.name):
+                self.conv_kernel_layer.build([None, None, self.all_head_size])
+        if getattr(self, "conv_out_layer", None) is not None:
+            with tf.name_scope(self.conv_out_layer.name):
+                self.conv_out_layer.build([None, None, self.config.hidden_size])
+
+
+class TFConvBertSelfOutput(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training=False):
         hidden_states = self.dense(hidden_states)
@@ -323,8 +346,19 @@ def call(self, hidden_states, input_tensor, training=False):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
 
-class TFConvBertAttention(tf.keras.layers.Layer):
+
+class TFConvBertAttention(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
@@ -343,8 +377,19 @@ def call(self, input_tensor, attention_mask, head_mask, output_attentions, train
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self_attention", None) is not None:
+            with tf.name_scope(self.self_attention.name):
+                self.self_attention.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
-class GroupedLinearLayer(tf.keras.layers.Layer):
+class GroupedLinearLayer(keras.layers.Layer):
     def __init__(self, input_size, output_size, num_groups, kernel_initializer, **kwargs):
         super().__init__(**kwargs)
         self.input_size = input_size
@@ -354,7 +399,7 @@ def __init__(self, input_size, output_size, num_groups, kernel_initializer, **kw
         self.group_in_dim = self.input_size // self.num_groups
         self.group_out_dim = self.output_size // self.num_groups
 
-    def build(self, input_shape):
+    def build(self, input_shape=None):
         self.kernel = self.add_weight(
             "kernel",
             shape=[self.group_out_dim, self.group_in_dim, self.num_groups],
@@ -365,6 +410,7 @@ def build(self, input_shape):
         self.bias = self.add_weight(
             "bias", shape=[self.output_size], initializer=self.kernel_initializer, dtype=self.dtype, trainable=True
         )
+        super().build(input_shape)
 
     def call(self, hidden_states):
         batch_size = shape_list(hidden_states)[0]
@@ -376,11 +422,11 @@ def call(self, hidden_states):
         return x
 
 
-class TFConvBertIntermediate(tf.keras.layers.Layer):
+class TFConvBertIntermediate(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
         if config.num_groups == 1:
-            self.dense = tf.keras.layers.Dense(
+            self.dense = keras.layers.Dense(
                 config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
             )
         else:
@@ -396,6 +442,7 @@ def __init__(self, config, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states):
         hidden_states = self.dense(hidden_states)
@@ -403,13 +450,21 @@ def call(self, hidden_states):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
 
-class TFConvBertOutput(tf.keras.layers.Layer):
+
+class TFConvBertOutput(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
         if config.num_groups == 1:
-            self.dense = tf.keras.layers.Dense(
+            self.dense = keras.layers.Dense(
                 config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
             )
         else:
@@ -420,8 +475,9 @@ def __init__(self, config, **kwargs):
                 kernel_initializer=get_initializer(config.initializer_range),
                 name="dense",
             )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training=False):
         hidden_states = self.dense(hidden_states)
@@ -430,8 +486,19 @@ def call(self, hidden_states, input_tensor, training=False):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+
 
-class TFConvBertLayer(tf.keras.layers.Layer):
+class TFConvBertLayer(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
@@ -450,8 +517,22 @@ def call(self, hidden_states, attention_mask, head_mask, output_attentions, trai
 
         return outputs
 
-
-class TFConvBertEncoder(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "bert_output", None) is not None:
+            with tf.name_scope(self.bert_output.name):
+                self.bert_output.build(None)
+
+
+class TFConvBertEncoder(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
@@ -493,12 +574,21 @@ def call(
             last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
-class TFConvBertPredictionHeadTransform(tf.keras.layers.Layer):
+class TFConvBertPredictionHeadTransform(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             config.embedding_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -507,7 +597,8 @@ def __init__(self, config, **kwargs):
         else:
             self.transform_act_fn = config.hidden_act
 
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.config = config
 
     def call(self, hidden_states):
         hidden_states = self.dense(hidden_states)
@@ -516,9 +607,20 @@ def call(self, hidden_states):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+
 
 @keras_serializable
-class TFConvBertMainLayer(tf.keras.layers.Layer):
+class TFConvBertMainLayer(keras.layers.Layer):
     config_class = ConvBertConfig
 
     def __init__(self, config, **kwargs):
@@ -527,7 +629,7 @@ def __init__(self, config, **kwargs):
         self.embeddings = TFConvBertEmbeddings(config, name="embeddings")
 
         if config.embedding_size != config.hidden_size:
-            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name="embeddings_project")
+            self.embeddings_project = keras.layers.Dense(config.hidden_size, name="embeddings_project")
 
         self.encoder = TFConvBertEncoder(config, name="encoder")
         self.config = config
@@ -623,6 +725,20 @@ def call(
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "embeddings_project", None) is not None:
+            with tf.name_scope(self.embeddings_project.name):
+                self.embeddings_project.build([None, None, self.config.embedding_size])
+
 
 class TFConvBertPreTrainedModel(TFPreTrainedModel):
     """
@@ -640,7 +756,7 @@ class TFConvBertPreTrainedModel(TFPreTrainedModel):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -751,12 +867,12 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
+        input_ids: TFModelInputType | None = None,
         attention_mask: Optional[Union[np.array, tf.Tensor]] = None,
         token_type_ids: Optional[Union[np.array, tf.Tensor]] = None,
         position_ids: Optional[Union[np.array, tf.Tensor]] = None,
         head_mask: Optional[Union[np.array, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -777,14 +893,16 @@ def call(
 
         return outputs
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutput(last_hidden_state=output.last_hidden_state, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
 
 
-class TFConvBertMaskedLMHead(tf.keras.layers.Layer):
+class TFConvBertMaskedLMHead(keras.layers.Layer):
     def __init__(self, config, input_embeddings, **kwargs):
         super().__init__(**kwargs)
 
@@ -821,12 +939,13 @@ def call(self, hidden_states):
         return hidden_states
 
 
-class TFConvBertGeneratorPredictions(tf.keras.layers.Layer):
+class TFConvBertGeneratorPredictions(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dense = tf.keras.layers.Dense(config.embedding_size, name="dense")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dense = keras.layers.Dense(config.embedding_size, name="dense")
+        self.config = config
 
     def call(self, generator_hidden_states, training=False):
         hidden_states = self.dense(generator_hidden_states)
@@ -835,6 +954,17 @@ def call(self, generator_hidden_states, training=False):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.embedding_size])
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 @add_start_docstrings("""ConvBERT Model with a `language modeling` head on top.""", CONVBERT_START_DOCSTRING)
 class TFConvBertForMaskedLM(TFConvBertPreTrainedModel, TFMaskedLanguageModelingLoss):
@@ -867,16 +997,16 @@ def get_prefix_bias_name(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[tf.Tensor] = None,
+        labels: tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFMaskedLMOutput]:
         r"""
@@ -914,28 +1044,35 @@ def call(
             attentions=generator_hidden_states.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
-
-
-class TFConvBertClassificationHead(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
+        if getattr(self, "generator_predictions", None) is not None:
+            with tf.name_scope(self.generator_predictions.name):
+                self.generator_predictions.build(None)
+        if getattr(self, "generator_lm_head", None) is not None:
+            with tf.name_scope(self.generator_lm_head.name):
+                self.generator_lm_head.build(None)
+
+
+class TFConvBertClassificationHead(keras.layers.Layer):
     """Head for sentence-level classification tasks."""
 
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
         classifier_dropout = (
             config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
         )
-        self.dropout = tf.keras.layers.Dropout(classifier_dropout)
-        self.out_proj = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(classifier_dropout)
+        self.out_proj = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="out_proj"
         )
 
@@ -951,6 +1088,17 @@ def call(self, hidden_states, **kwargs):
 
         return x
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "out_proj", None) is not None:
+            with tf.name_scope(self.out_proj.name):
+                self.out_proj.build([None, None, self.config.hidden_size])
+
 
 @add_start_docstrings(
     """
@@ -974,16 +1122,16 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[tf.Tensor] = None,
+        labels: tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFSequenceClassifierOutput]:
         r"""
@@ -1019,11 +1167,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build(None)
 
 
 @add_start_docstrings(
@@ -1041,19 +1194,10 @@ def __init__(self, config, *inputs, **kwargs):
         self.sequence_summary = TFSequenceSummary(
             config, initializer_range=config.initializer_range, name="sequence_summary"
         )
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
-
-    @property
-    def dummy_inputs(self):
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            tf.Tensor with dummy inputs
-        """
-        return {"input_ids": tf.convert_to_tensor(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)}
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(
@@ -1066,16 +1210,16 @@ def dummy_inputs(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[tf.Tensor] = None,
+        labels: tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFMultipleChoiceModelOutput]:
         r"""
@@ -1128,25 +1272,19 @@ def call(
             attentions=outputs.attentions,
         )
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"),
-                "token_type_ids": tf.TensorSpec((None, None, None), tf.int32, name="token_type_ids"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
+        if getattr(self, "sequence_summary", None) is not None:
+            with tf.name_scope(self.sequence_summary.name):
+                self.sequence_summary.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1165,10 +1303,11 @@ def __init__(self, config, *inputs, **kwargs):
         classifier_dropout = (
             config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
         )
-        self.dropout = tf.keras.layers.Dropout(classifier_dropout)
-        self.classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(classifier_dropout)
+        self.classifier = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(CONVBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1179,16 +1318,16 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[tf.Tensor] = None,
+        labels: tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFTokenClassifierOutput]:
         r"""
@@ -1223,11 +1362,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1243,9 +1387,10 @@ def __init__(self, config, *inputs, **kwargs):
 
         self.num_labels = config.num_labels
         self.convbert = TFConvBertMainLayer(config, name="convbert")
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(CONVBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1256,17 +1401,17 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        start_positions: Optional[tf.Tensor] = None,
-        end_positions: Optional[tf.Tensor] = None,
+        start_positions: tf.Tensor | None = None,
+        end_positions: tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFQuestionAnsweringModelOutput]:
         r"""
@@ -1315,10 +1460,13 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFQuestionAnsweringModelOutput(
-            start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convbert", None) is not None:
+            with tf.name_scope(self.convbert.name):
+                self.convbert.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.config.hidden_size])
diff --git a/src/transformers/models/convbert/tokenization_convbert.py b/src/transformers/models/convbert/tokenization_convbert.py
index cbee21aafe3a52..8c359886cf7435 100644
--- a/src/transformers/models/convbert/tokenization_convbert.py
+++ b/src/transformers/models/convbert/tokenization_convbert.py
@@ -135,20 +135,6 @@ def __init__(
         strip_accents=None,
         **kwargs,
     ):
-        super().__init__(
-            do_lower_case=do_lower_case,
-            do_basic_tokenize=do_basic_tokenize,
-            never_split=never_split,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            tokenize_chinese_chars=tokenize_chinese_chars,
-            strip_accents=strip_accents,
-            **kwargs,
-        )
-
         if not os.path.isfile(vocab_file):
             raise ValueError(
                 f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
@@ -164,7 +150,22 @@ def __init__(
                 tokenize_chinese_chars=tokenize_chinese_chars,
                 strip_accents=strip_accents,
             )
-        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
 
     @property
     def do_lower_case(self):
@@ -177,10 +178,12 @@ def vocab_size(self):
     def get_vocab(self):
         return dict(self.vocab, **self.added_tokens_encoder)
 
-    def _tokenize(self, text):
+    def _tokenize(self, text, split_special_tokens=False):
         split_tokens = []
         if self.do_basic_tokenize:
-            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+            for token in self.basic_tokenizer.tokenize(
+                text, never_split=self.all_special_tokens if not split_special_tokens else None
+            ):
                 # If the token is part of the never_split set
                 if token in self.basic_tokenizer.never_split:
                     split_tokens.append(token)
@@ -260,8 +263,8 @@ def create_token_type_ids_from_sequences(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ConvBERT
-        sequence pair mask has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ConvBERT sequence
+        pair mask has the following format:
 
         ```
         0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
@@ -325,20 +328,30 @@ class BasicTokenizer(object):
         strip_accents (`bool`, *optional*):
             Whether or not to strip all accents. If this option is not specified, then it will be determined by the
             value for `lowercase` (as in the original BERT).
+        do_split_on_punc (`bool`, *optional*, defaults to `True`):
+            In some instances we want to skip the basic punctuation splitting so that later tokenization can capture
+            the full context of the words, such as contractions.
     """
 
-    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+    def __init__(
+        self,
+        do_lower_case=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        do_split_on_punc=True,
+    ):
         if never_split is None:
             never_split = []
         self.do_lower_case = do_lower_case
         self.never_split = set(never_split)
         self.tokenize_chinese_chars = tokenize_chinese_chars
         self.strip_accents = strip_accents
+        self.do_split_on_punc = do_split_on_punc
 
     def tokenize(self, text, never_split=None):
         """
-        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
-        WordPieceTokenizer.
+        Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.
 
         Args:
             never_split (`List[str]`, *optional*)
@@ -357,7 +370,9 @@ def tokenize(self, text, never_split=None):
         # words in the English Wikipedia.).
         if self.tokenize_chinese_chars:
             text = self._tokenize_chinese_chars(text)
-        orig_tokens = whitespace_tokenize(text)
+        # prevents treating the same character with different unicode codepoints as different characters
+        unicode_normalized_text = unicodedata.normalize("NFC", text)
+        orig_tokens = whitespace_tokenize(unicode_normalized_text)
         split_tokens = []
         for token in orig_tokens:
             if token not in never_split:
@@ -385,7 +400,7 @@ def _run_strip_accents(self, text):
 
     def _run_split_on_punc(self, text, never_split=None):
         """Splits punctuation on a piece of text."""
-        if never_split is not None and text in never_split:
+        if not self.do_split_on_punc or (never_split is not None and text in never_split):
             return [text]
         chars = list(text)
         i = 0
diff --git a/src/transformers/models/convbert/tokenization_convbert_fast.py b/src/transformers/models/convbert/tokenization_convbert_fast.py
index 07447bb6a17caa..14909876ded885 100644
--- a/src/transformers/models/convbert/tokenization_convbert_fast.py
+++ b/src/transformers/models/convbert/tokenization_convbert_fast.py
@@ -159,7 +159,7 @@ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
         """
         output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
 
-        if token_ids_1:
+        if token_ids_1 is not None:
             output += token_ids_1 + [self.sep_token_id]
 
         return output
@@ -168,8 +168,8 @@ def create_token_type_ids_from_sequences(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ConvBERT
-        sequence pair mask has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ConvBERT sequence
+        pair mask has the following format:
 
         ```
         0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
diff --git a/src/transformers/models/convnext/__init__.py b/src/transformers/models/convnext/__init__.py
index 2141e18bebceea..099a7fc9d63da4 100644
--- a/src/transformers/models/convnext/__init__.py
+++ b/src/transformers/models/convnext/__init__.py
@@ -93,7 +93,7 @@
     except OptionalDependencyNotAvailable:
         pass
     else:
-        from .modeling_convnext import TFConvNextForImageClassification, TFConvNextModel, TFConvNextPreTrainedModel
+        from .modeling_tf_convnext import TFConvNextForImageClassification, TFConvNextModel, TFConvNextPreTrainedModel
 
 
 else:
diff --git a/src/transformers/models/convnext/configuration_convnext.py b/src/transformers/models/convnext/configuration_convnext.py
index d4807bc5741a02..48647bd1224ecd 100644
--- a/src/transformers/models/convnext/configuration_convnext.py
+++ b/src/transformers/models/convnext/configuration_convnext.py
@@ -22,6 +22,7 @@
 from ...configuration_utils import PretrainedConfig
 from ...onnx import OnnxConfig
 from ...utils import logging
+from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices
 
 
 logger = logging.get_logger(__name__)
@@ -32,7 +33,7 @@
 }
 
 
-class ConvNextConfig(PretrainedConfig):
+class ConvNextConfig(BackboneConfigMixin, PretrainedConfig):
     r"""
     This is the configuration class to store the configuration of a [`ConvNextModel`]. It is used to instantiate an
     ConvNeXT model according to the specified arguments, defining the model architecture. Instantiating a configuration
@@ -66,7 +67,14 @@ class ConvNextConfig(PretrainedConfig):
             The drop rate for stochastic depth.
         out_features (`List[str]`, *optional*):
             If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
-            (depending on how many stages the model has). Will default to the last stage if unset.
+            (depending on how many stages the model has). If unset and `out_indices` is set, will default to the
+            corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+        out_indices (`List[int]`, *optional*):
+            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
+            many stages the model has). If unset and `out_features` is set, will default to the corresponding stages.
+            If unset and `out_features` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
 
     Example:
     ```python
@@ -81,6 +89,7 @@ class ConvNextConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "convnext"
 
     def __init__(
@@ -97,6 +106,7 @@ def __init__(
         drop_path_rate=0.0,
         image_size=224,
         out_features=None,
+        out_indices=None,
         **kwargs,
     ):
         super().__init__(**kwargs)
@@ -113,15 +123,9 @@ def __init__(
         self.drop_path_rate = drop_path_rate
         self.image_size = image_size
         self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(self.depths) + 1)]
-        if out_features is not None:
-            if not isinstance(out_features, list):
-                raise ValueError("out_features should be a list")
-            for feature in out_features:
-                if feature not in self.stage_names:
-                    raise ValueError(
-                        f"Feature {feature} is not a valid feature name. Valid names are {self.stage_names}"
-                    )
-        self.out_features = out_features
+        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
+            out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
+        )
 
 
 class ConvNextOnnxConfig(OnnxConfig):
diff --git a/src/transformers/models/convnext/convert_convnext_to_pytorch.py b/src/transformers/models/convnext/convert_convnext_to_pytorch.py
index 69c300eee40406..cdcbf24d552389 100644
--- a/src/transformers/models/convnext/convert_convnext_to_pytorch.py
+++ b/src/transformers/models/convnext/convert_convnext_to_pytorch.py
@@ -26,7 +26,7 @@
 from huggingface_hub import hf_hub_download
 from PIL import Image
 
-from transformers import ConvNextConfig, ConvNextFeatureExtractor, ConvNextForImageClassification
+from transformers import ConvNextConfig, ConvNextForImageClassification, ConvNextImageProcessor
 from transformers.utils import logging
 
 
@@ -144,10 +144,10 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
     model.load_state_dict(state_dict)
     model.eval()
 
-    # Check outputs on an image, prepared by ConvNextFeatureExtractor
+    # Check outputs on an image, prepared by ConvNextImageProcessor
     size = 224 if "224" in checkpoint_url else 384
-    feature_extractor = ConvNextFeatureExtractor(size=size)
-    pixel_values = feature_extractor(images=prepare_img(), return_tensors="pt").pixel_values
+    image_processor = ConvNextImageProcessor(size=size)
+    pixel_values = image_processor(images=prepare_img(), return_tensors="pt").pixel_values
 
     logits = model(pixel_values).logits
 
@@ -191,8 +191,8 @@ def convert_convnext_checkpoint(checkpoint_url, pytorch_dump_folder_path):
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     print(f"Saving model to {pytorch_dump_folder_path}")
     model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    print(f"Saving image processor to {pytorch_dump_folder_path}")
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
     print("Pushing model to the hub...")
     model_name = "convnext"
diff --git a/src/transformers/models/convnext/image_processing_convnext.py b/src/transformers/models/convnext/image_processing_convnext.py
index a46bdcfef75b77..6d6476e77214b0 100644
--- a/src/transformers/models/convnext/image_processing_convnext.py
+++ b/src/transformers/models/convnext/image_processing_convnext.py
@@ -18,15 +18,10 @@
 
 import numpy as np
 
-from transformers.utils import is_vision_available
-from transformers.utils.generic import TensorType
-
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
 from ...image_transforms import (
     center_crop,
     get_resize_output_image_size,
-    normalize,
-    rescale,
     resize,
     to_channel_dimension_format,
 )
@@ -36,11 +31,14 @@
     ChannelDimension,
     ImageInput,
     PILImageResampling,
+    infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
     valid_images,
+    validate_preprocess_arguments,
 )
-from ...utils import logging
+from ...utils import TensorType, is_vision_available, logging
 
 
 if is_vision_available():
@@ -64,10 +62,10 @@ class ConvNextImageProcessor(BaseImageProcessor):
             be matched to `int(size["shortest_edge"]/crop_pct)`, after which the image is cropped to
             `(size["shortest_edge"], size["shortest_edge"])`. Only has an effect if `do_resize` is set to `True`. Can
             be overriden by `size` in the `preprocess` method.
-        crop_pct (`float` *optional*, defaults to 244 / 256):
+        crop_pct (`float` *optional*, defaults to 224 / 256):
             Percentage of the image to crop. Only has an effect if `do_resize` is `True` and size < 384. Can be
             overriden by `crop_pct` in the `preprocess` method.
-        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
             Resampling filter to use if resizing the image. Can be overriden by `resample` in the `preprocess` method.
         do_rescale (`bool`, *optional*, defaults to `True`):
             Whether to rescale the image by the specified scale `rescale_factor`. Can be overriden by `do_rescale` in
@@ -123,6 +121,7 @@ def resize(
         crop_pct: float,
         resample: PILImageResampling = PILImageResampling.BICUBIC,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
@@ -142,6 +141,9 @@ def resize(
                 Resampling filter to use when resizing the image.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred from the input
+                image.
         """
         size = get_size_dict(size, default_to_square=False)
         if "shortest_edge" not in size:
@@ -151,59 +153,36 @@ def resize(
         if shortest_edge < 384:
             # maintain same ratio, resizing shortest edge to shortest_edge/crop_pct
             resize_shortest_edge = int(shortest_edge / crop_pct)
-            resize_size = get_resize_output_image_size(image, size=resize_shortest_edge, default_to_square=False)
-            image = resize(image=image, size=resize_size, resample=resample, data_format=data_format, **kwargs)
+            resize_size = get_resize_output_image_size(
+                image, size=resize_shortest_edge, default_to_square=False, input_data_format=input_data_format
+            )
+            image = resize(
+                image=image,
+                size=resize_size,
+                resample=resample,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                **kwargs,
+            )
             # then crop to (shortest_edge, shortest_edge)
-            return center_crop(image=image, size=(shortest_edge, shortest_edge), data_format=data_format, **kwargs)
+            return center_crop(
+                image=image,
+                size=(shortest_edge, shortest_edge),
+                data_format=data_format,
+                input_data_format=input_data_format,
+                **kwargs,
+            )
         else:
             # warping (no cropping) when evaluated at 384 or larger
             return resize(
-                image, size=(shortest_edge, shortest_edge), resample=resample, data_format=data_format, **kwargs
+                image,
+                size=(shortest_edge, shortest_edge),
+                resample=resample,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                **kwargs,
             )
 
-    def rescale(
-        self,
-        image: np.ndarray,
-        scale: Union[int, float],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ):
-        """
-        Rescale an image by a scale factor. image = image * scale.
-
-        Args:
-            image (`np.ndarray`):
-                Image to rescale.
-            scale (`int` or `float`):
-                Scale to apply to the image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return rescale(image, scale=scale, data_format=data_format, **kwargs)
-
-    def normalize(
-        self,
-        image: np.ndarray,
-        mean: Union[float, List[float]],
-        std: Union[float, List[float]],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Normalize an image. image = (image - image_mean) / image_std.
-
-        Args:
-            image (`np.ndarray`):
-                Image to normalize.
-            image_mean (`float` or `List[float]`):
-                Image mean.
-            image_std (`float` or `List[float]`):
-                Image standard deviation.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
-
     def preprocess(
         self,
         images: ImageInput,
@@ -218,6 +197,7 @@ def preprocess(
         image_std: Optional[Union[float, List[float]]] = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: ChannelDimension = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> PIL.Image.Image:
         """
@@ -225,7 +205,8 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image to preprocess.
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                 Whether to resize the image.
             size (`Dict[str, int]`, *optional*, defaults to `self.size`):
@@ -257,8 +238,15 @@ def preprocess(
                     - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
             data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                 The channel dimension format for the output image. Can be one of:
-                    - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
-                    - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         do_resize = do_resize if do_resize is not None else self.do_resize
         crop_pct = crop_pct if crop_pct is not None else self.crop_pct
@@ -280,31 +268,53 @@ def preprocess(
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
 
-        if do_resize and size is None or resample is None:
-            raise ValueError("Size and resample must be specified if do_resize is True.")
-
-        if do_resize and size["shortest_edge"] < 384 and crop_pct is None:
-            raise ValueError("crop_pct must be specified if size < 384.")
-
-        if do_rescale and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
-        if do_normalize and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
         # All transformations expect numpy arrays.
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         if do_resize:
-            images = [self.resize(image=image, size=size, crop_pct=crop_pct, resample=resample) for image in images]
+            images = [
+                self.resize(
+                    image=image, size=size, crop_pct=crop_pct, resample=resample, input_data_format=input_data_format
+                )
+                for image in images
+            ]
 
         if do_rescale:
-            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_normalize:
-            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
-
-        images = [to_channel_dimension_format(image, data_format) for image in images]
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
 
         data = {"pixel_values": images}
         return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/convnext/modeling_convnext.py b/src/transformers/models/convnext/modeling_convnext.py
index 5e60ddfe6d99c1..a952e5d8165e15 100755
--- a/src/transformers/models/convnext/modeling_convnext.py
+++ b/src/transformers/models/convnext/modeling_convnext.py
@@ -29,7 +29,7 @@
     BaseModelOutputWithPoolingAndNoAttention,
     ImageClassifierOutputWithNoAttention,
 )
-from ...modeling_utils import BackboneMixin, PreTrainedModel
+from ...modeling_utils import PreTrainedModel
 from ...utils import (
     add_code_sample_docstrings,
     add_start_docstrings,
@@ -37,6 +37,7 @@
     logging,
     replace_return_docstrings,
 )
+from ...utils.backbone_utils import BackboneMixin
 from .configuration_convnext import ConvNextConfig
 
 
@@ -60,7 +61,7 @@
 
 
 # Copied from transformers.models.beit.modeling_beit.drop_path
-def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
@@ -281,7 +282,6 @@ class ConvNextPreTrainedModel(PreTrainedModel):
     config_class = ConvNextConfig
     base_model_prefix = "convnext"
     main_input_name = "pixel_values"
-    supports_gradient_checkpointing = True
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -295,10 +295,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ConvNextEncoder):
-            module.gradient_checkpointing = value
-
 
 CONVNEXT_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
@@ -480,33 +476,21 @@ def forward(
 class ConvNextBackbone(ConvNextPreTrainedModel, BackboneMixin):
     def __init__(self, config):
         super().__init__(config)
+        super()._init_backbone(config)
 
-        self.stage_names = config.stage_names
         self.embeddings = ConvNextEmbeddings(config)
         self.encoder = ConvNextEncoder(config)
-
-        self.out_features = config.out_features if config.out_features is not None else [self.stage_names[-1]]
-
-        out_feature_channels = {}
-        out_feature_channels["stem"] = config.hidden_sizes[0]
-        for idx, stage in enumerate(self.stage_names[1:]):
-            out_feature_channels[stage] = config.hidden_sizes[idx]
-
-        self.out_feature_channels = out_feature_channels
+        self.num_features = [config.hidden_sizes[0]] + config.hidden_sizes
 
         # Add layer norms to hidden states of out_features
-        hidden_states_norms = dict()
-        for stage, num_channels in zip(self.out_features, self.channels):
+        hidden_states_norms = {}
+        for stage, num_channels in zip(self._out_features, self.channels):
             hidden_states_norms[stage] = ConvNextLayerNorm(num_channels, data_format="channels_first")
         self.hidden_states_norms = nn.ModuleDict(hidden_states_norms)
 
         # initialize weights and apply final processing
         self.post_init()
 
-    @property
-    def channels(self):
-        return [self.out_feature_channels[name] for name in self.out_features]
-
     @add_start_docstrings_to_model_forward(CONVNEXT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
@@ -545,14 +529,13 @@ def forward(
         outputs = self.encoder(
             embedding_output,
             output_hidden_states=True,
-            return_dict=True,
+            return_dict=return_dict,
         )
 
-        hidden_states = outputs.hidden_states
+        hidden_states = outputs.hidden_states if return_dict else outputs[1]
 
         feature_maps = ()
-        # we skip the stem
-        for idx, (stage, hidden_state) in enumerate(zip(self.stage_names[1:], hidden_states[1:])):
+        for stage, hidden_state in zip(self.stage_names, hidden_states):
             if stage in self.out_features:
                 hidden_state = self.hidden_states_norms[stage](hidden_state)
                 feature_maps += (hidden_state,)
@@ -560,11 +543,11 @@ def forward(
         if not return_dict:
             output = (feature_maps,)
             if output_hidden_states:
-                output += (outputs.hidden_states,)
+                output += (hidden_states,)
             return output
 
         return BackboneOutput(
             feature_maps=feature_maps,
-            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            hidden_states=hidden_states if output_hidden_states else None,
             attentions=None,
         )
diff --git a/src/transformers/models/convnext/modeling_tf_convnext.py b/src/transformers/models/convnext/modeling_tf_convnext.py
index 671de6ca9b46ed..b92ac446d94f21 100644
--- a/src/transformers/models/convnext/modeling_tf_convnext.py
+++ b/src/transformers/models/convnext/modeling_tf_convnext.py
@@ -15,13 +15,13 @@
 """ TF 2.0 ConvNext model."""
 
 
-from typing import Dict, Optional, Tuple, Union
+from __future__ import annotations
+
+from typing import List, Optional, Tuple, Union
 
 import numpy as np
 import tensorflow as tf
 
-from transformers import shape_list
-
 from ...activations_tf import get_tf_activation
 from ...modeling_tf_outputs import TFBaseModelOutput, TFBaseModelOutputWithPooling, TFSequenceClassifierOutput
 from ...modeling_tf_utils import (
@@ -29,9 +29,11 @@
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
+from ...tf_utils import shape_list
 from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
 from .configuration_convnext import ConvNextConfig
 
@@ -43,17 +45,17 @@
 _CHECKPOINT_FOR_DOC = "facebook/convnext-tiny-224"
 
 
-class TFConvNextDropPath(tf.keras.layers.Layer):
+class TFConvNextDropPath(keras.layers.Layer):
     """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
     References:
         (1) github.com:rwightman/pytorch-image-models
     """
 
-    def __init__(self, drop_path, **kwargs):
+    def __init__(self, drop_path: float, **kwargs):
         super().__init__(**kwargs)
         self.drop_path = drop_path
 
-    def call(self, x, training=None):
+    def call(self, x: tf.Tensor, training=None):
         if training:
             keep_prob = 1 - self.drop_path
             shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
@@ -63,45 +65,57 @@ def call(self, x, training=None):
         return x
 
 
-class TFConvNextEmbeddings(tf.keras.layers.Layer):
+class TFConvNextEmbeddings(keras.layers.Layer):
     """This class is comparable to (and inspired by) the SwinEmbeddings class
     found in src/transformers/models/swin/modeling_swin.py.
     """
 
-    def __init__(self, config, **kwargs):
+    def __init__(self, config: ConvNextConfig, **kwargs):
         super().__init__(**kwargs)
-        self.patch_embeddings = tf.keras.layers.Conv2D(
+        self.patch_embeddings = keras.layers.Conv2D(
             filters=config.hidden_sizes[0],
             kernel_size=config.patch_size,
             strides=config.patch_size,
             name="patch_embeddings",
             kernel_initializer=get_initializer(config.initializer_range),
-            bias_initializer="zeros",
+            bias_initializer=keras.initializers.Zeros(),
         )
-        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=1e-6, name="layernorm")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=1e-6, name="layernorm")
         self.num_channels = config.num_channels
+        self.config = config
 
     def call(self, pixel_values):
         if isinstance(pixel_values, dict):
             pixel_values = pixel_values["pixel_values"]
 
-        num_channels = shape_list(pixel_values)[1]
-        if tf.executing_eagerly() and num_channels != self.num_channels:
-            raise ValueError(
-                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
-            )
+        tf.debugging.assert_equal(
+            shape_list(pixel_values)[1],
+            self.num_channels,
+            message="Make sure that the channel dimension of the pixel values match with the one set in the configuration.",
+        )
 
-        # When running on CPU, `tf.keras.layers.Conv2D` doesn't support `NCHW` format.
+        # When running on CPU, `keras.layers.Conv2D` doesn't support `NCHW` format.
         # So change the input format from `NCHW` to `NHWC`.
-        # shape = (batch_size, in_height, in_width, in_channels=num_channels)
+        # shape = (batch_size, in_height, in_width, in_channels)
         pixel_values = tf.transpose(pixel_values, perm=(0, 2, 3, 1))
 
         embeddings = self.patch_embeddings(pixel_values)
         embeddings = self.layernorm(embeddings)
         return embeddings
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "patch_embeddings", None) is not None:
+            with tf.name_scope(self.patch_embeddings.name):
+                self.patch_embeddings.build([None, None, None, self.config.num_channels])
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, None, self.config.hidden_sizes[0]])
+
 
-class TFConvNextLayer(tf.keras.layers.Layer):
+class TFConvNextLayer(keras.layers.Layer):
     """This corresponds to the `Block` class in the original implementation.
 
     There are two equivalent implementations: [DwConv, LayerNorm (channels_first), Conv, GELU,1x1 Conv]; all in (N, C,
@@ -120,7 +134,7 @@ def __init__(self, config, dim, drop_path=0.0, **kwargs):
         super().__init__(**kwargs)
         self.dim = dim
         self.config = config
-        self.dwconv = tf.keras.layers.Conv2D(
+        self.dwconv = keras.layers.Conv2D(
             filters=dim,
             kernel_size=7,
             padding="same",
@@ -129,18 +143,18 @@ def __init__(self, config, dim, drop_path=0.0, **kwargs):
             bias_initializer="zeros",
             name="dwconv",
         )  # depthwise conv
-        self.layernorm = tf.keras.layers.LayerNormalization(
+        self.layernorm = keras.layers.LayerNormalization(
             epsilon=1e-6,
             name="layernorm",
         )
-        self.pwconv1 = tf.keras.layers.Dense(
+        self.pwconv1 = keras.layers.Dense(
             units=4 * dim,
             kernel_initializer=get_initializer(config.initializer_range),
             bias_initializer="zeros",
             name="pwconv1",
         )  # pointwise/1x1 convs, implemented with linear layers
         self.act = get_tf_activation(config.hidden_act)
-        self.pwconv2 = tf.keras.layers.Dense(
+        self.pwconv2 = keras.layers.Dense(
             units=dim,
             kernel_initializer=get_initializer(config.initializer_range),
             bias_initializer="zeros",
@@ -151,22 +165,40 @@ def __init__(self, config, dim, drop_path=0.0, **kwargs):
         self.drop_path = (
             TFConvNextDropPath(drop_path, name="drop_path")
             if drop_path > 0.0
-            else tf.keras.layers.Activation("linear", name="drop_path")
+            else keras.layers.Activation("linear", name="drop_path")
         )
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape: tf.TensorShape = None):
         # PT's `nn.Parameters` must be mapped to a TF layer weight to inherit the same name hierarchy (and vice-versa)
         self.layer_scale_parameter = (
             self.add_weight(
                 shape=(self.dim,),
-                initializer=tf.keras.initializers.Constant(value=self.config.layer_scale_init_value),
+                initializer=keras.initializers.Constant(value=self.config.layer_scale_init_value),
                 trainable=True,
                 name="layer_scale_parameter",
             )
             if self.config.layer_scale_init_value > 0
             else None
         )
-        super().build(input_shape)
+
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dwconv", None) is not None:
+            with tf.name_scope(self.dwconv.name):
+                self.dwconv.build([None, None, None, self.dim])
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, None, self.dim])
+        if getattr(self, "pwconv1", None) is not None:
+            with tf.name_scope(self.pwconv1.name):
+                self.pwconv1.build([None, None, self.dim])
+        if getattr(self, "pwconv2", None) is not None:
+            with tf.name_scope(self.pwconv2.name):
+                self.pwconv2.build([None, None, 4 * self.dim])
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
 
     def call(self, hidden_states, training=False):
         input = hidden_states
@@ -183,24 +215,37 @@ def call(self, hidden_states, training=False):
         return x
 
 
-class TFConvNextStage(tf.keras.layers.Layer):
+class TFConvNextStage(keras.layers.Layer):
     """ConvNext stage, consisting of an optional downsampling layer + multiple residual blocks.
 
     Args:
-        config ([`ConvNextConfig`]): Model configuration class.
-        in_channels (`int`): Number of input channels.
-        out_channels (`int`): Number of output channels.
-        depth (`int`): Number of residual blocks.
-        drop_path_rates(`List[float]`): Stochastic depth rates for each layer.
+        config (`ConvNextV2Config`):
+            Model configuration class.
+        in_channels (`int`):
+            Number of input channels.
+        out_channels (`int`):
+            Number of output channels.
+        depth (`int`):
+            Number of residual blocks.
+        drop_path_rates(`List[float]`):
+            Stochastic depth rates for each layer.
     """
 
     def __init__(
-        self, config, in_channels, out_channels, kernel_size=2, stride=2, depth=2, drop_path_rates=None, **kwargs
+        self,
+        config: ConvNextConfig,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int = 2,
+        stride: int = 2,
+        depth: int = 2,
+        drop_path_rates: Optional[List[float]] = None,
+        **kwargs,
     ):
         super().__init__(**kwargs)
         if in_channels != out_channels or stride > 1:
             self.downsampling_layer = [
-                tf.keras.layers.LayerNormalization(
+                keras.layers.LayerNormalization(
                     epsilon=1e-6,
                     name="downsampling_layer.0",
                 ),
@@ -209,12 +254,12 @@ def __init__(
                 # layer. All the outputs throughout the model will be in NHWC
                 # from this point on until the output where we again change to
                 # NCHW.
-                tf.keras.layers.Conv2D(
+                keras.layers.Conv2D(
                     filters=out_channels,
                     kernel_size=kernel_size,
                     strides=stride,
                     kernel_initializer=get_initializer(config.initializer_range),
-                    bias_initializer="zeros",
+                    bias_initializer=keras.initializers.Zeros(),
                     name="downsampling_layer.1",
                 ),
             ]
@@ -231,6 +276,9 @@ def __init__(
             )
             for j in range(depth)
         ]
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
 
     def call(self, hidden_states):
         for layer in self.downsampling_layer:
@@ -239,8 +287,22 @@ def call(self, hidden_states):
             hidden_states = layer(hidden_states)
         return hidden_states
 
-
-class TFConvNextEncoder(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layers", None) is not None:
+            for layer in self.layers:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+        if self.in_channels != self.out_channels or self.stride > 1:
+            with tf.name_scope(self.downsampling_layer[0].name):
+                self.downsampling_layer[0].build([None, None, None, self.in_channels])
+            with tf.name_scope(self.downsampling_layer[1].name):
+                self.downsampling_layer[1].build([None, None, None, self.in_channels])
+
+
+class TFConvNextEncoder(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
         self.stages = []
@@ -279,9 +341,14 @@ def call(self, hidden_states, output_hidden_states=False, return_dict=True):
 
         return TFBaseModelOutput(last_hidden_state=hidden_states, hidden_states=all_hidden_states)
 
+    def build(self, input_shape=None):
+        for stage in self.stages:
+            with tf.name_scope(stage.name):
+                stage.build(None)
+
 
 @keras_serializable
-class TFConvNextMainLayer(tf.keras.layers.Layer):
+class TFConvNextMainLayer(keras.layers.Layer):
     config_class = ConvNextConfig
 
     def __init__(self, config: ConvNextConfig, add_pooling_layer: bool = True, **kwargs):
@@ -290,15 +357,15 @@ def __init__(self, config: ConvNextConfig, add_pooling_layer: bool = True, **kwa
         self.config = config
         self.embeddings = TFConvNextEmbeddings(config, name="embeddings")
         self.encoder = TFConvNextEncoder(config, name="encoder")
-        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
         # We are setting the `data_format` like so because from here on we will revert to the
         # NCHW output format
-        self.pooler = tf.keras.layers.GlobalAvgPool2D(data_format="channels_first") if add_pooling_layer else None
+        self.pooler = keras.layers.GlobalAvgPool2D(data_format="channels_first") if add_pooling_layer else None
 
     @unpack_inputs
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
+        pixel_values: TFModelInputType | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
@@ -339,6 +406,20 @@ def call(
             hidden_states=hidden_states if output_hidden_states else encoder_outputs.hidden_states,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, self.config.hidden_sizes[-1]])
+
 
 class TFConvNextPreTrainedModel(TFPreTrainedModel):
     """
@@ -350,50 +431,13 @@ class TFConvNextPreTrainedModel(TFPreTrainedModel):
     base_model_prefix = "convnext"
     main_input_name = "pixel_values"
 
-    @property
-    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        VISION_DUMMY_INPUTS = tf.random.uniform(
-            shape=(
-                3,
-                self.config.num_channels,
-                self.config.image_size,
-                self.config.image_size,
-            ),
-            dtype=tf.float32,
-        )
-        return {"pixel_values": tf.constant(VISION_DUMMY_INPUTS)}
-
-    @tf.function(
-        input_signature=[
-            {
-                "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        """
-        Method used for serving the model.
-
-        Args:
-            inputs (`Dict[str, tf.Tensor]`):
-                The input of the saved model as a dictionary of tensors.
-        """
-        output = self.call(inputs)
-        return self.serving_output(output)
-
 
 CONVNEXT_START_DOCSTRING = r"""
     This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -459,7 +503,7 @@ def __init__(self, config, *inputs, add_pooling_layer=True, **kwargs):
     @replace_return_docstrings(output_type=TFBaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
+        pixel_values: TFModelInputType | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
@@ -508,13 +552,13 @@ def call(
             hidden_states=outputs.hidden_states,
         )
 
-    def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling:
-        # hidden_states not converted to Tensor with tf.convert_to_tensor as they are all of different dimensions
-        return TFBaseModelOutputWithPooling(
-            last_hidden_state=output.last_hidden_state,
-            pooler_output=output.pooler_output,
-            hidden_states=output.hidden_states,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convnext", None) is not None:
+            with tf.name_scope(self.convnext.name):
+                self.convnext.build(None)
 
 
 @add_start_docstrings(
@@ -532,22 +576,23 @@ def __init__(self, config: ConvNextConfig, *inputs, **kwargs):
         self.convnext = TFConvNextMainLayer(config, name="convnext")
 
         # Classifier head
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             units=config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             bias_initializer="zeros",
             name="classifier",
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(CONVNEXT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=TFSequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
+        pixel_values: TFModelInputType | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -609,6 +654,14 @@ def call(
             hidden_states=outputs.hidden_states,
         )
 
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        # hidden_states not converted to Tensor with tf.convert_to_tensor as they are all of different dimensions
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=output.hidden_states)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convnext", None) is not None:
+            with tf.name_scope(self.convnext.name):
+                self.convnext.build(None)
+        if getattr(self, "classifier", None) is not None:
+            if hasattr(self.classifier, "name"):
+                with tf.name_scope(self.classifier.name):
+                    self.classifier.build([None, None, self.config.hidden_sizes[-1]])
diff --git a/src/transformers/models/convnextv2/__init__.py b/src/transformers/models/convnextv2/__init__.py
new file mode 100644
index 00000000000000..d2a484b9b82850
--- /dev/null
+++ b/src/transformers/models/convnextv2/__init__.py
@@ -0,0 +1,97 @@
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+# rely on isort to merge the imports
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+    is_tf_available,
+)
+
+
+_import_structure = {
+    "configuration_convnextv2": [
+        "CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "ConvNextV2Config",
+    ]
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_convnextv2"] = [
+        "CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "ConvNextV2ForImageClassification",
+        "ConvNextV2Model",
+        "ConvNextV2PreTrainedModel",
+        "ConvNextV2Backbone",
+    ]
+
+try:
+    if not is_tf_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_tf_convnextv2"] = [
+        "TFConvNextV2ForImageClassification",
+        "TFConvNextV2Model",
+        "TFConvNextV2PreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_convnextv2 import (
+        CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        ConvNextV2Config,
+    )
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_convnextv2 import (
+            CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST,
+            ConvNextV2Backbone,
+            ConvNextV2ForImageClassification,
+            ConvNextV2Model,
+            ConvNextV2PreTrainedModel,
+        )
+
+    try:
+        if not is_tf_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_tf_convnextv2 import (
+            TFConvNextV2ForImageClassification,
+            TFConvNextV2Model,
+            TFConvNextV2PreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
diff --git a/src/transformers/models/convnextv2/configuration_convnextv2.py b/src/transformers/models/convnextv2/configuration_convnextv2.py
new file mode 100644
index 00000000000000..3d7d1fa7397714
--- /dev/null
+++ b/src/transformers/models/convnextv2/configuration_convnextv2.py
@@ -0,0 +1,118 @@
+# coding=utf-8
+# Copyright 2023 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" ConvNeXTV2 model configuration"""
+
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices
+
+
+logger = logging.get_logger(__name__)
+
+CONVNEXTV2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "facebook/convnextv2-tiny-1k-224": "https://huggingface.co/facebook/convnextv2-tiny-1k-224/resolve/main/config.json",
+}
+
+
+class ConvNextV2Config(BackboneConfigMixin, PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`ConvNextV2Model`]. It is used to instantiate an
+    ConvNeXTV2 model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the ConvNeXTV2
+    [facebook/convnextv2-tiny-1k-224](https://huggingface.co/facebook/convnextv2-tiny-1k-224) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        patch_size (`int`, optional, defaults to 4):
+            Patch size to use in the patch embedding layer.
+        num_stages (`int`, optional, defaults to 4):
+            The number of stages in the model.
+        hidden_sizes (`List[int]`, *optional*, defaults to `[96, 192, 384, 768]`):
+            Dimensionality (hidden size) at each stage.
+        depths (`List[int]`, *optional*, defaults to `[3, 3, 9, 3]`):
+            Depth (number of blocks) for each stage.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in each block. If string, `"gelu"`, `"relu"`,
+            `"selu"` and `"gelu_new"` are supported.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            The drop rate for stochastic depth.
+        out_features (`List[str]`, *optional*):
+            If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
+            (depending on how many stages the model has). If unset and `out_indices` is set, will default to the
+            corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+        out_indices (`List[int]`, *optional*):
+            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
+            many stages the model has). If unset and `out_features` is set, will default to the corresponding stages.
+            If unset and `out_features` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+
+    Example:
+    ```python
+    >>> from transformers import ConvNeXTV2Config, ConvNextV2Model
+
+    >>> # Initializing a ConvNeXTV2 convnextv2-tiny-1k-224 style configuration
+    >>> configuration = ConvNeXTV2Config()
+
+    >>> # Initializing a model (with random weights) from the convnextv2-tiny-1k-224 style configuration
+    >>> model = ConvNextV2Model(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "convnextv2"
+
+    def __init__(
+        self,
+        num_channels=3,
+        patch_size=4,
+        num_stages=4,
+        hidden_sizes=None,
+        depths=None,
+        hidden_act="gelu",
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        drop_path_rate=0.0,
+        image_size=224,
+        out_features=None,
+        out_indices=None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.num_stages = num_stages
+        self.hidden_sizes = [96, 192, 384, 768] if hidden_sizes is None else hidden_sizes
+        self.depths = [3, 3, 9, 3] if depths is None else depths
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.drop_path_rate = drop_path_rate
+        self.image_size = image_size
+        self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(self.depths) + 1)]
+        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
+            out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
+        )
diff --git a/src/transformers/models/convnextv2/convert_convnextv2_to_pytorch.py b/src/transformers/models/convnextv2/convert_convnextv2_to_pytorch.py
new file mode 100644
index 00000000000000..8094ecf0d6157a
--- /dev/null
+++ b/src/transformers/models/convnextv2/convert_convnextv2_to_pytorch.py
@@ -0,0 +1,286 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert ConvNeXTV2 checkpoints from the original repository.
+
+URL: https://github.com/facebookresearch/ConvNeXt"""
+
+import argparse
+import json
+import os
+
+import requests
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+
+from transformers import ConvNextImageProcessor, ConvNextV2Config, ConvNextV2ForImageClassification
+from transformers.image_utils import PILImageResampling
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_convnextv2_config(checkpoint_url):
+    config = ConvNextV2Config()
+
+    if "atto" in checkpoint_url:
+        depths = [2, 2, 6, 2]
+        hidden_sizes = [40, 80, 160, 320]
+    if "femto" in checkpoint_url:
+        depths = [2, 2, 6, 2]
+        hidden_sizes = [48, 96, 192, 384]
+    if "pico" in checkpoint_url:
+        depths = [2, 2, 6, 2]
+        hidden_sizes = [64, 128, 256, 512]
+    if "nano" in checkpoint_url:
+        depths = [2, 2, 8, 2]
+        hidden_sizes = [80, 160, 320, 640]
+    if "tiny" in checkpoint_url:
+        depths = [3, 3, 9, 3]
+        hidden_sizes = [96, 192, 384, 768]
+    if "base" in checkpoint_url:
+        depths = [3, 3, 27, 3]
+        hidden_sizes = [128, 256, 512, 1024]
+    if "large" in checkpoint_url:
+        depths = [3, 3, 27, 3]
+        hidden_sizes = [192, 384, 768, 1536]
+    if "huge" in checkpoint_url:
+        depths = [3, 3, 27, 3]
+        hidden_sizes = [352, 704, 1408, 2816]
+
+    num_labels = 1000
+    filename = "imagenet-1k-id2label.json"
+    expected_shape = (1, 1000)
+
+    repo_id = "huggingface/label-files"
+    config.num_labels = num_labels
+    id2label = json.load(open(hf_hub_download(repo_id, filename, repo_type="dataset"), "r"))
+    id2label = {int(k): v for k, v in id2label.items()}
+
+    config.id2label = id2label
+    config.label2id = {v: k for k, v in id2label.items()}
+    config.hidden_sizes = hidden_sizes
+    config.depths = depths
+
+    return config, expected_shape
+
+
+def rename_key(name):
+    if "downsample_layers.0.0" in name:
+        name = name.replace("downsample_layers.0.0", "embeddings.patch_embeddings")
+    if "downsample_layers.0.1" in name:
+        name = name.replace("downsample_layers.0.1", "embeddings.norm")  # we rename to layernorm later on
+    if "downsample_layers.1.0" in name:
+        name = name.replace("downsample_layers.1.0", "stages.1.downsampling_layer.0")
+    if "downsample_layers.1.1" in name:
+        name = name.replace("downsample_layers.1.1", "stages.1.downsampling_layer.1")
+    if "downsample_layers.2.0" in name:
+        name = name.replace("downsample_layers.2.0", "stages.2.downsampling_layer.0")
+    if "downsample_layers.2.1" in name:
+        name = name.replace("downsample_layers.2.1", "stages.2.downsampling_layer.1")
+    if "downsample_layers.3.0" in name:
+        name = name.replace("downsample_layers.3.0", "stages.3.downsampling_layer.0")
+    if "downsample_layers.3.1" in name:
+        name = name.replace("downsample_layers.3.1", "stages.3.downsampling_layer.1")
+    if "stages" in name and "downsampling_layer" not in name:
+        # stages.0.0. for instance should be renamed to stages.0.layers.0.
+        name = name[: len("stages.0")] + ".layers" + name[len("stages.0") :]
+    if "gamma" in name:
+        name = name.replace("gamma", "weight")
+    if "beta" in name:
+        name = name.replace("beta", "bias")
+    if "stages" in name:
+        name = name.replace("stages", "encoder.stages")
+    if "norm" in name:
+        name = name.replace("norm", "layernorm")
+    if "head" in name:
+        name = name.replace("head", "classifier")
+
+    return name
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+def convert_preprocessor(checkpoint_url):
+    if "224" in checkpoint_url:
+        size = 224
+        crop_pct = 224 / 256
+    elif "384" in checkpoint_url:
+        size = 384
+        crop_pct = None
+    else:
+        size = 512
+        crop_pct = None
+
+    return ConvNextImageProcessor(
+        size=size,
+        crop_pct=crop_pct,
+        image_mean=[0.485, 0.456, 0.406],
+        image_std=[0.229, 0.224, 0.225],
+        resample=PILImageResampling.BICUBIC,
+    )
+
+
+@torch.no_grad()
+def convert_convnextv2_checkpoint(checkpoint_url, pytorch_dump_folder_path, save_model, push_to_hub):
+    """
+    Copy/paste/tweak model's weights to our ConvNeXTV2 structure.
+    """
+    print("Downloading original model from checkpoint...")
+    # define ConvNeXTV2 configuration based on URL
+    config, expected_shape = get_convnextv2_config(checkpoint_url)
+    # load original state_dict from URL
+    state_dict = torch.hub.load_state_dict_from_url(checkpoint_url)["model"]
+
+    print("Converting model parameters...")
+    # rename keys
+    for key in state_dict.copy().keys():
+        val = state_dict.pop(key)
+        state_dict[rename_key(key)] = val
+    # add prefix to all keys expect classifier head
+    for key in state_dict.copy().keys():
+        val = state_dict.pop(key)
+        if not key.startswith("classifier"):
+            key = "convnextv2." + key
+        state_dict[key] = val
+
+    # load HuggingFace model
+    model = ConvNextV2ForImageClassification(config)
+    model.load_state_dict(state_dict)
+    model.eval()
+
+    # Check outputs on an image, prepared by ConvNextImageProcessor
+    preprocessor = convert_preprocessor(checkpoint_url)
+    inputs = preprocessor(images=prepare_img(), return_tensors="pt")
+    logits = model(**inputs).logits
+
+    # note: the logits below were obtained without center cropping
+    if checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_atto_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.3930, 0.1747, -0.5246, 0.4177, 0.4295])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_femto_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.1727, -0.5341, -0.7818, -0.4745, -0.6566])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_pico_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.0333, 0.1563, -0.9137, 0.1054, 0.0381])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_nano_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.1744, -0.1555, -0.0713, 0.0950, -0.1431])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_tiny_1k_224_ema.pt":
+        expected_logits = torch.tensor([0.9996, 0.1966, -0.4386, -0.3472, 0.6661])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_base_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.2553, -0.6708, -0.1359, 0.2518, -0.2488])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_large_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.0673, -0.5627, -0.3753, -0.2722, 0.0178])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_huge_1k_224_ema.pt":
+        expected_logits = torch.tensor([-0.6377, -0.7458, -0.2150, 0.1184, -0.0597])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_224_ema.pt":
+        expected_logits = torch.tensor([1.0799, 0.2322, -0.8860, 1.0219, 0.6231])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_nano_22k_384_ema.pt":
+        expected_logits = torch.tensor([0.3766, 0.4917, -1.1426, 0.9942, 0.6024])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_224_ema.pt":
+        expected_logits = torch.tensor([0.4220, -0.6919, -0.4317, -0.2881, -0.6609])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_tiny_22k_384_ema.pt":
+        expected_logits = torch.tensor([0.1082, -0.8286, -0.5095, 0.4681, -0.8085])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_224_ema.pt":
+        expected_logits = torch.tensor([-0.2419, -0.6221, 0.2176, -0.0980, -0.7527])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_base_22k_384_ema.pt":
+        expected_logits = torch.tensor([0.0391, -0.4371, 0.3786, 0.1251, -0.2784])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_224_ema.pt":
+        expected_logits = torch.tensor([-0.0504, 0.5636, -0.1729, -0.6507, -0.3949])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_large_22k_384_ema.pt":
+        expected_logits = torch.tensor([0.3560, 0.9486, 0.3149, -0.2667, -0.5138])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_384_ema.pt":
+        expected_logits = torch.tensor([-0.2469, -0.4550, -0.5853, -0.0810, 0.0309])
+    elif checkpoint_url == "https://dl.fbaipublicfiles.com/convnext/convnextv2/im22k/convnextv2_huge_22k_512_ema.pt":
+        expected_logits = torch.tensor([-0.3090, 0.0802, -0.0682, -0.1979, -0.2826])
+    else:
+        raise ValueError(f"Unknown URL: {checkpoint_url}")
+
+    assert torch.allclose(logits[0, :5], expected_logits, atol=1e-3)
+    assert logits.shape == expected_shape
+    print("Model outputs match the original results!")
+
+    if save_model:
+        print("Saving model to local...")
+        # Create folder to save model
+        if not os.path.isdir(pytorch_dump_folder_path):
+            os.mkdir(pytorch_dump_folder_path)
+
+        model.save_pretrained(pytorch_dump_folder_path)
+        preprocessor.save_pretrained(pytorch_dump_folder_path)
+
+    model_name = "convnextv2"
+    if "atto" in checkpoint_url:
+        model_name += "-atto"
+    if "femto" in checkpoint_url:
+        model_name += "-femto"
+    if "pico" in checkpoint_url:
+        model_name += "-pico"
+    if "nano" in checkpoint_url:
+        model_name += "-nano"
+    elif "tiny" in checkpoint_url:
+        model_name += "-tiny"
+    elif "base" in checkpoint_url:
+        model_name += "-base"
+    elif "large" in checkpoint_url:
+        model_name += "-large"
+    elif "huge" in checkpoint_url:
+        model_name += "-huge"
+    if "22k" in checkpoint_url and "1k" not in checkpoint_url:
+        model_name += "-22k"
+    elif "22k" in checkpoint_url and "1k" in checkpoint_url:
+        model_name += "-22k-1k"
+    elif "1k" in checkpoint_url:
+        model_name += "-1k"
+    if "224" in checkpoint_url:
+        model_name += "-224"
+    elif "384" in checkpoint_url:
+        model_name += "-384"
+    elif "512" in checkpoint_url:
+        model_name += "-512"
+
+    if push_to_hub:
+        print(f"Pushing {model_name} to the hub...")
+        model.push_to_hub(model_name)
+        preprocessor.push_to_hub(model_name)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--checkpoint_url",
+        default="https://dl.fbaipublicfiles.com/convnext/convnextv2/im1k/convnextv2_atto_1k_224_ema.pt",
+        type=str,
+        help="URL of the original ConvNeXTV2 checkpoint you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default="model",
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument("--save_model", action="store_true", help="Save model to local")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model and image preprocessor to the hub")
+
+    args = parser.parse_args()
+    convert_convnextv2_checkpoint(
+        args.checkpoint_url, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub
+    )
diff --git a/src/transformers/models/convnextv2/modeling_convnextv2.py b/src/transformers/models/convnextv2/modeling_convnextv2.py
new file mode 100644
index 00000000000000..8d166200d12253
--- /dev/null
+++ b/src/transformers/models/convnextv2/modeling_convnextv2.py
@@ -0,0 +1,576 @@
+# coding=utf-8
+# Copyright 2023 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch ConvNextV2 model."""
+
+
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...activations import ACT2FN
+from ...modeling_outputs import (
+    BackboneOutput,
+    BaseModelOutputWithNoAttention,
+    BaseModelOutputWithPoolingAndNoAttention,
+    ImageClassifierOutputWithNoAttention,
+)
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from ...utils.backbone_utils import BackboneMixin
+from .configuration_convnextv2 import ConvNextV2Config
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "ConvNextV2Config"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "facebook/convnextv2-tiny-1k-224"
+_EXPECTED_OUTPUT_SHAPE = [1, 768, 7, 7]
+
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "facebook/convnextv2-tiny-1k-224"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"
+
+CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/convnextv2-tiny-1k-224",
+    # See all ConvNextV2 models at https://huggingface.co/models?filter=convnextv2
+]
+
+
+# Copied from transformers.models.beit.modeling_beit.drop_path
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
+    """
+    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+
+    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
+    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
+    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
+    argument.
+    """
+    if drop_prob == 0.0 or not training:
+        return input
+    keep_prob = 1 - drop_prob
+    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
+    random_tensor.floor_()  # binarize
+    output = input.div(keep_prob) * random_tensor
+    return output
+
+
+# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->ConvNextV2
+class ConvNextV2DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+    def __init__(self, drop_prob: Optional[float] = None) -> None:
+        super().__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return drop_path(hidden_states, self.drop_prob, self.training)
+
+    def extra_repr(self) -> str:
+        return "p={}".format(self.drop_prob)
+
+
+class ConvNextV2GRN(nn.Module):
+    """GRN (Global Response Normalization) layer"""
+
+    def __init__(self, dim: int):
+        super().__init__()
+        self.weight = nn.Parameter(torch.zeros(1, 1, 1, dim))
+        self.bias = nn.Parameter(torch.zeros(1, 1, 1, dim))
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        # Compute and normalize global spatial feature maps
+        global_features = torch.norm(hidden_states, p=2, dim=(1, 2), keepdim=True)
+        norm_features = global_features / (global_features.mean(dim=-1, keepdim=True) + 1e-6)
+        hidden_states = self.weight * (hidden_states * norm_features) + self.bias + hidden_states
+
+        return hidden_states
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextLayerNorm with ConvNext->ConvNextV2
+class ConvNextV2LayerNorm(nn.Module):
+    r"""LayerNorm that supports two data formats: channels_last (default) or channels_first.
+    The ordering of the dimensions in the inputs. channels_last corresponds to inputs with shape (batch_size, height,
+    width, channels) while channels_first corresponds to inputs with shape (batch_size, channels, height, width).
+    """
+
+    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(normalized_shape))
+        self.bias = nn.Parameter(torch.zeros(normalized_shape))
+        self.eps = eps
+        self.data_format = data_format
+        if self.data_format not in ["channels_last", "channels_first"]:
+            raise NotImplementedError(f"Unsupported data format: {self.data_format}")
+        self.normalized_shape = (normalized_shape,)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.data_format == "channels_last":
+            x = torch.nn.functional.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
+        elif self.data_format == "channels_first":
+            input_dtype = x.dtype
+            x = x.float()
+            u = x.mean(1, keepdim=True)
+            s = (x - u).pow(2).mean(1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.eps)
+            x = x.to(dtype=input_dtype)
+            x = self.weight[:, None, None] * x + self.bias[:, None, None]
+        return x
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextEmbeddings with ConvNext->ConvNextV2
+class ConvNextV2Embeddings(nn.Module):
+    """This class is comparable to (and inspired by) the SwinEmbeddings class
+    found in src/transformers/models/swin/modeling_swin.py.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.patch_embeddings = nn.Conv2d(
+            config.num_channels, config.hidden_sizes[0], kernel_size=config.patch_size, stride=config.patch_size
+        )
+        self.layernorm = ConvNextV2LayerNorm(config.hidden_sizes[0], eps=1e-6, data_format="channels_first")
+        self.num_channels = config.num_channels
+
+    def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
+        num_channels = pixel_values.shape[1]
+        if num_channels != self.num_channels:
+            raise ValueError(
+                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+            )
+        embeddings = self.patch_embeddings(pixel_values)
+        embeddings = self.layernorm(embeddings)
+        return embeddings
+
+
+class ConvNextV2Layer(nn.Module):
+    """This corresponds to the `Block` class in the original implementation.
+
+    There are two equivalent implementations: [DwConv, LayerNorm (channels_first), Conv, GELU,1x1 Conv]; all in (N, C,
+    H, W) (2) [DwConv, Permute to (N, H, W, C), LayerNorm (channels_last), Linear, GELU, Linear]; Permute back
+
+    The authors used (2) as they find it slightly faster in PyTorch.
+
+    Args:
+        config ([`ConvNextV2Config`]): Model configuration class.
+        dim (`int`): Number of input channels.
+        drop_path (`float`): Stochastic depth rate. Default: 0.0.
+    """
+
+    def __init__(self, config, dim, drop_path=0):
+        super().__init__()
+        # depthwise conv
+        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
+        self.layernorm = ConvNextV2LayerNorm(dim, eps=1e-6)
+        # pointwise/1x1 convs, implemented with linear layers
+        self.pwconv1 = nn.Linear(dim, 4 * dim)
+        self.act = ACT2FN[config.hidden_act]
+        self.grn = ConvNextV2GRN(4 * dim)
+        self.pwconv2 = nn.Linear(4 * dim, dim)
+        self.drop_path = ConvNextV2DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        input = hidden_states
+        x = self.dwconv(hidden_states)
+        # (batch_size, num_channels, height, width) -> (batch_size, height, width, num_channels)
+        x = x.permute(0, 2, 3, 1)
+        x = self.layernorm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.grn(x)
+        x = self.pwconv2(x)
+        # (batch_size, height, width, num_channels) -> (batch_size, num_channels, height, width)
+        x = x.permute(0, 3, 1, 2)
+
+        x = input + self.drop_path(x)
+        return x
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextStage with ConvNeXT->ConvNeXTV2, ConvNext->ConvNextV2
+class ConvNextV2Stage(nn.Module):
+    """ConvNeXTV2 stage, consisting of an optional downsampling layer + multiple residual blocks.
+
+    Args:
+        config ([`ConvNextV2Config`]): Model configuration class.
+        in_channels (`int`): Number of input channels.
+        out_channels (`int`): Number of output channels.
+        depth (`int`): Number of residual blocks.
+        drop_path_rates(`List[float]`): Stochastic depth rates for each layer.
+    """
+
+    def __init__(self, config, in_channels, out_channels, kernel_size=2, stride=2, depth=2, drop_path_rates=None):
+        super().__init__()
+
+        if in_channels != out_channels or stride > 1:
+            self.downsampling_layer = nn.Sequential(
+                ConvNextV2LayerNorm(in_channels, eps=1e-6, data_format="channels_first"),
+                nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride),
+            )
+        else:
+            self.downsampling_layer = nn.Identity()
+        drop_path_rates = drop_path_rates or [0.0] * depth
+        self.layers = nn.Sequential(
+            *[ConvNextV2Layer(config, dim=out_channels, drop_path=drop_path_rates[j]) for j in range(depth)]
+        )
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        hidden_states = self.downsampling_layer(hidden_states)
+        hidden_states = self.layers(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextEncoder with ConvNext->ConvNextV2
+class ConvNextV2Encoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.stages = nn.ModuleList()
+        drop_path_rates = [
+            x.tolist() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths)).split(config.depths)
+        ]
+        prev_chs = config.hidden_sizes[0]
+        for i in range(config.num_stages):
+            out_chs = config.hidden_sizes[i]
+            stage = ConvNextV2Stage(
+                config,
+                in_channels=prev_chs,
+                out_channels=out_chs,
+                stride=2 if i > 0 else 1,
+                depth=config.depths[i],
+                drop_path_rates=drop_path_rates[i],
+            )
+            self.stages.append(stage)
+            prev_chs = out_chs
+
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple, BaseModelOutputWithNoAttention]:
+        all_hidden_states = () if output_hidden_states else None
+
+        for i, layer_module in enumerate(self.stages):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            hidden_states = layer_module(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states] if v is not None)
+
+        return BaseModelOutputWithNoAttention(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+        )
+
+
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextPreTrainedModel with ConvNext->ConvNextV2, convnext->convnextv2
+class ConvNextV2PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = ConvNextV2Config
+    base_model_prefix = "convnextv2"
+    main_input_name = "pixel_values"
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Conv2d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+CONVNEXTV2_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`ConvNextV2Config`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+CONVNEXTV2_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`ConvNextImageProcessor`]. See
+            [`ConvNextImageProcessor.__call__`] for details.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare ConvNextV2 model outputting raw features without any specific head on top.",
+    CONVNEXTV2_START_DOCSTRING,
+)
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextModel with CONVNEXT->CONVNEXTV2, ConvNext->ConvNextV2
+class ConvNextV2Model(ConvNextV2PreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = ConvNextV2Embeddings(config)
+        self.encoder = ConvNextV2Encoder(config)
+
+        # final layernorm layer
+        self.layernorm = nn.LayerNorm(config.hidden_sizes[-1], eps=config.layer_norm_eps)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(CONVNEXTV2_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPoolingAndNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]:
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        embedding_output = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+
+        # global average pooling, (N, C, H, W) -> (N, C)
+        pooled_output = self.layernorm(last_hidden_state.mean([-2, -1]))
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndNoAttention(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+        )
+
+
+@add_start_docstrings(
+    """
+    ConvNextV2 Model with an image classification head on top (a linear layer on top of the pooled features), e.g. for
+    ImageNet.
+    """,
+    CONVNEXTV2_START_DOCSTRING,
+)
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextForImageClassification with CONVNEXT->CONVNEXTV2,ConvNext->ConvNextV2,convnext->convnextv2
+class ConvNextV2ForImageClassification(ConvNextV2PreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.convnextv2 = ConvNextV2Model(config)
+
+        # Classifier head
+        self.classifier = (
+            nn.Linear(config.hidden_sizes[-1], config.num_labels) if config.num_labels > 0 else nn.Identity()
+        )
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(CONVNEXTV2_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=ImageClassifierOutputWithNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageClassifierOutputWithNoAttention]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.convnextv2(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)
+
+        pooled_output = outputs.pooler_output if return_dict else outputs[1]
+
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return ImageClassifierOutputWithNoAttention(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+        )
+
+
+@add_start_docstrings(
+    """
+    ConvNeXT V2 backbone, to be used with frameworks like DETR and MaskFormer.
+    """,
+    CONVNEXTV2_START_DOCSTRING,
+)
+# Copied from transformers.models.convnext.modeling_convnext.ConvNextBackbone with CONVNEXT->CONVNEXTV2,ConvNext->ConvNextV2,facebook/convnext-tiny-224->facebook/convnextv2-tiny-1k-224
+class ConvNextV2Backbone(ConvNextV2PreTrainedModel, BackboneMixin):
+    def __init__(self, config):
+        super().__init__(config)
+        super()._init_backbone(config)
+
+        self.embeddings = ConvNextV2Embeddings(config)
+        self.encoder = ConvNextV2Encoder(config)
+        self.num_features = [config.hidden_sizes[0]] + config.hidden_sizes
+
+        # Add layer norms to hidden states of out_features
+        hidden_states_norms = {}
+        for stage, num_channels in zip(self._out_features, self.channels):
+            hidden_states_norms[stage] = ConvNextV2LayerNorm(num_channels, data_format="channels_first")
+        self.hidden_states_norms = nn.ModuleDict(hidden_states_norms)
+
+        # initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(CONVNEXTV2_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        pixel_values: torch.Tensor,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> BackboneOutput:
+        """
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoImageProcessor, AutoBackbone
+        >>> import torch
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> processor = AutoImageProcessor.from_pretrained("facebook/convnextv2-tiny-1k-224")
+        >>> model = AutoBackbone.from_pretrained("facebook/convnextv2-tiny-1k-224")
+
+        >>> inputs = processor(image, return_tensors="pt")
+        >>> outputs = model(**inputs)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+
+        embedding_output = self.embeddings(pixel_values)
+
+        outputs = self.encoder(
+            embedding_output,
+            output_hidden_states=True,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs.hidden_states if return_dict else outputs[1]
+
+        feature_maps = ()
+        for stage, hidden_state in zip(self.stage_names, hidden_states):
+            if stage in self.out_features:
+                hidden_state = self.hidden_states_norms[stage](hidden_state)
+                feature_maps += (hidden_state,)
+
+        if not return_dict:
+            output = (feature_maps,)
+            if output_hidden_states:
+                output += (hidden_states,)
+            return output
+
+        return BackboneOutput(
+            feature_maps=feature_maps,
+            hidden_states=hidden_states if output_hidden_states else None,
+            attentions=None,
+        )
diff --git a/src/transformers/models/convnextv2/modeling_tf_convnextv2.py b/src/transformers/models/convnextv2/modeling_tf_convnextv2.py
new file mode 100644
index 00000000000000..d4bef6f161d2bf
--- /dev/null
+++ b/src/transformers/models/convnextv2/modeling_tf_convnextv2.py
@@ -0,0 +1,686 @@
+# coding=utf-8
+# Copyright 2023 Meta Platforms Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TF 2.0 ConvNextV2 model."""
+
+
+from __future__ import annotations
+
+from typing import List, Optional, Tuple, Union
+
+import numpy as np
+import tensorflow as tf
+
+from ...activations_tf import get_tf_activation
+from ...modeling_tf_outputs import (
+    TFBaseModelOutputWithNoAttention,
+    TFBaseModelOutputWithPooling,
+    TFBaseModelOutputWithPoolingAndNoAttention,
+    TFImageClassifierOutputWithNoAttention,
+)
+from ...modeling_tf_utils import (
+    TFModelInputType,
+    TFPreTrainedModel,
+    TFSequenceClassificationLoss,
+    get_initializer,
+    keras,
+    keras_serializable,
+    unpack_inputs,
+)
+from ...tf_utils import shape_list
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+)
+from .configuration_convnextv2 import ConvNextV2Config
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "ConvNextV2Config"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "facebook/convnextv2-tiny-1k-224"
+_EXPECTED_OUTPUT_SHAPE = [1, 768, 7, 7]
+
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "facebook/convnextv2-tiny-1k-224"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"
+
+CONVNEXTV2_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/convnextv2-tiny-1k-224",
+    # See all ConvNextV2 models at https://huggingface.co/models?filter=convnextv2
+]
+
+
+# Copied from transformers.models.convnext.modeling_tf_convnext.TFConvNextDropPath with ConvNext->ConvNextV2
+class TFConvNextV2DropPath(keras.layers.Layer):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+    References:
+        (1) github.com:rwightman/pytorch-image-models
+    """
+
+    def __init__(self, drop_path: float, **kwargs):
+        super().__init__(**kwargs)
+        self.drop_path = drop_path
+
+    def call(self, x: tf.Tensor, training=None):
+        if training:
+            keep_prob = 1 - self.drop_path
+            shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
+            random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)
+            random_tensor = tf.floor(random_tensor)
+            return (x / keep_prob) * random_tensor
+        return x
+
+
+class TFConvNextV2GRN(keras.layers.Layer):
+    """GRN (Global Response Normalization) layer"""
+
+    def __init__(self, config: ConvNextV2Config, dim: int, **kwargs):
+        super().__init__(**kwargs)
+        self.dim = dim
+
+    def build(self, input_shape: tf.TensorShape = None):
+        # PT's `nn.Parameters` must be mapped to a TF layer weight to inherit the same name hierarchy (and vice-versa)
+        self.weight = self.add_weight(
+            name="weight",
+            shape=(1, 1, 1, self.dim),
+            initializer=keras.initializers.Zeros(),
+        )
+        self.bias = self.add_weight(
+            name="bias",
+            shape=(1, 1, 1, self.dim),
+            initializer=keras.initializers.Zeros(),
+        )
+        return super().build(input_shape)
+
+    def call(self, hidden_states: tf.Tensor):
+        global_features = tf.norm(hidden_states, ord="euclidean", axis=(1, 2), keepdims=True)
+        norm_features = global_features / (tf.reduce_mean(global_features, axis=-1, keepdims=True) + 1e-6)
+        hidden_states = self.weight * (hidden_states * norm_features) + self.bias + hidden_states
+        return hidden_states
+
+
+# Copied from transformers.models.convnext.modeling_tf_convnext.TFConvNextEmbeddings with ConvNext->ConvNextV2
+class TFConvNextV2Embeddings(keras.layers.Layer):
+    """This class is comparable to (and inspired by) the SwinEmbeddings class
+    found in src/transformers/models/swin/modeling_swin.py.
+    """
+
+    def __init__(self, config: ConvNextV2Config, **kwargs):
+        super().__init__(**kwargs)
+        self.patch_embeddings = keras.layers.Conv2D(
+            filters=config.hidden_sizes[0],
+            kernel_size=config.patch_size,
+            strides=config.patch_size,
+            name="patch_embeddings",
+            kernel_initializer=get_initializer(config.initializer_range),
+            bias_initializer=keras.initializers.Zeros(),
+        )
+        self.layernorm = keras.layers.LayerNormalization(epsilon=1e-6, name="layernorm")
+        self.num_channels = config.num_channels
+        self.config = config
+
+    def call(self, pixel_values):
+        if isinstance(pixel_values, dict):
+            pixel_values = pixel_values["pixel_values"]
+
+        tf.debugging.assert_equal(
+            shape_list(pixel_values)[1],
+            self.num_channels,
+            message="Make sure that the channel dimension of the pixel values match with the one set in the configuration.",
+        )
+
+        # When running on CPU, `keras.layers.Conv2D` doesn't support `NCHW` format.
+        # So change the input format from `NCHW` to `NHWC`.
+        # shape = (batch_size, in_height, in_width, in_channels)
+        pixel_values = tf.transpose(pixel_values, perm=(0, 2, 3, 1))
+
+        embeddings = self.patch_embeddings(pixel_values)
+        embeddings = self.layernorm(embeddings)
+        return embeddings
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "patch_embeddings", None) is not None:
+            with tf.name_scope(self.patch_embeddings.name):
+                self.patch_embeddings.build([None, None, None, self.config.num_channels])
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, None, self.config.hidden_sizes[0]])
+
+
+class TFConvNextV2Layer(keras.layers.Layer):
+    """This corresponds to the `Block` class in the original implementation.
+
+    There are two equivalent implementations: [DwConv, LayerNorm (channels_first), Conv, GELU,1x1 Conv]; all in (N, C,
+    H, W) (2) [DwConv, Permute to (N, H, W, C), LayerNorm (channels_last), Linear, GELU, Linear]; Permute back
+
+    The authors used (2) as they find it slightly faster in PyTorch. Since we already permuted the inputs to follow
+    NHWC ordering, we can just apply the operations straight-away without the permutation.
+
+    Args:
+        config (`ConvNextV2Config`):
+            Model configuration class.
+        dim (`int`):
+            Number of input channels.
+        drop_path (`float`, defaults to 0.0):
+            Stochastic depth rate.
+    """
+
+    def __init__(self, config: ConvNextV2Config, dim: int, drop_path: float = 0.0, **kwargs):
+        super().__init__(**kwargs)
+        self.dim = dim
+        self.config = config
+        self.dwconv = keras.layers.Conv2D(
+            filters=dim,
+            kernel_size=7,
+            padding="same",
+            groups=dim,
+            kernel_initializer=get_initializer(config.initializer_range),
+            bias_initializer=keras.initializers.Zeros(),
+            name="dwconv",
+        )  # depthwise conv
+        self.layernorm = keras.layers.LayerNormalization(
+            epsilon=1e-6,
+            name="layernorm",
+        )
+        self.pwconv1 = keras.layers.Dense(
+            units=4 * dim,
+            kernel_initializer=get_initializer(config.initializer_range),
+            bias_initializer=keras.initializers.Zeros(),
+            name="pwconv1",
+        )  # pointwise/1x1 convs, implemented with linear layers
+        self.act = get_tf_activation(config.hidden_act)
+        self.grn = TFConvNextV2GRN(config, 4 * dim, dtype=tf.float32, name="grn")
+        self.pwconv2 = keras.layers.Dense(
+            units=dim,
+            kernel_initializer=get_initializer(config.initializer_range),
+            bias_initializer=keras.initializers.Zeros(),
+            name="pwconv2",
+        )
+        # Using `layers.Activation` instead of `tf.identity` to better control `training`
+        # behaviour.
+        self.drop_path = (
+            TFConvNextV2DropPath(drop_path, name="drop_path")
+            if drop_path > 0.0
+            else keras.layers.Activation("linear", name="drop_path")
+        )
+
+    def call(self, hidden_states, training=False):
+        input = hidden_states
+        x = self.dwconv(hidden_states)
+        x = self.layernorm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.grn(x)
+        x = self.pwconv2(x)
+        x = self.drop_path(x, training=training)
+        x = input + x
+        return x
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dwconv", None) is not None:
+            with tf.name_scope(self.dwconv.name):
+                self.dwconv.build([None, None, None, self.dim])
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, None, self.dim])
+        if getattr(self, "pwconv1", None) is not None:
+            with tf.name_scope(self.pwconv1.name):
+                self.pwconv1.build([None, None, self.dim])
+        if getattr(self, "grn", None) is not None:
+            with tf.name_scope(self.grn.name):
+                self.grn.build(None)
+        if getattr(self, "pwconv2", None) is not None:
+            with tf.name_scope(self.pwconv2.name):
+                self.pwconv2.build([None, None, 4 * self.dim])
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
+
+
+# Copied from transformers.models.convnext.modeling_tf_convnext.TFConvNextStage with ConvNext->ConvNextV2
+class TFConvNextV2Stage(keras.layers.Layer):
+    """ConvNextV2 stage, consisting of an optional downsampling layer + multiple residual blocks.
+
+    Args:
+        config (`ConvNextV2V2Config`):
+            Model configuration class.
+        in_channels (`int`):
+            Number of input channels.
+        out_channels (`int`):
+            Number of output channels.
+        depth (`int`):
+            Number of residual blocks.
+        drop_path_rates(`List[float]`):
+            Stochastic depth rates for each layer.
+    """
+
+    def __init__(
+        self,
+        config: ConvNextV2Config,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int = 2,
+        stride: int = 2,
+        depth: int = 2,
+        drop_path_rates: Optional[List[float]] = None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        if in_channels != out_channels or stride > 1:
+            self.downsampling_layer = [
+                keras.layers.LayerNormalization(
+                    epsilon=1e-6,
+                    name="downsampling_layer.0",
+                ),
+                # Inputs to this layer will follow NHWC format since we
+                # transposed the inputs from NCHW to NHWC in the `TFConvNextV2Embeddings`
+                # layer. All the outputs throughout the model will be in NHWC
+                # from this point on until the output where we again change to
+                # NCHW.
+                keras.layers.Conv2D(
+                    filters=out_channels,
+                    kernel_size=kernel_size,
+                    strides=stride,
+                    kernel_initializer=get_initializer(config.initializer_range),
+                    bias_initializer=keras.initializers.Zeros(),
+                    name="downsampling_layer.1",
+                ),
+            ]
+        else:
+            self.downsampling_layer = [tf.identity]
+
+        drop_path_rates = drop_path_rates or [0.0] * depth
+        self.layers = [
+            TFConvNextV2Layer(
+                config,
+                dim=out_channels,
+                drop_path=drop_path_rates[j],
+                name=f"layers.{j}",
+            )
+            for j in range(depth)
+        ]
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.stride = stride
+
+    def call(self, hidden_states):
+        for layer in self.downsampling_layer:
+            hidden_states = layer(hidden_states)
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layers", None) is not None:
+            for layer in self.layers:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+        if self.in_channels != self.out_channels or self.stride > 1:
+            with tf.name_scope(self.downsampling_layer[0].name):
+                self.downsampling_layer[0].build([None, None, None, self.in_channels])
+            with tf.name_scope(self.downsampling_layer[1].name):
+                self.downsampling_layer[1].build([None, None, None, self.in_channels])
+
+
+class TFConvNextV2Encoder(keras.layers.Layer):
+    def __init__(self, config: ConvNextV2Config, **kwargs):
+        super().__init__(**kwargs)
+        self.stages = []
+        drop_path_rates = tf.linspace(0.0, config.drop_path_rate, sum(config.depths))
+        drop_path_rates = tf.split(drop_path_rates, config.depths)
+        drop_path_rates = [x.numpy().tolist() for x in drop_path_rates]
+        prev_chs = config.hidden_sizes[0]
+        for i in range(config.num_stages):
+            out_chs = config.hidden_sizes[i]
+            stage = TFConvNextV2Stage(
+                config,
+                in_channels=prev_chs,
+                out_channels=out_chs,
+                stride=2 if i > 0 else 1,
+                depth=config.depths[i],
+                drop_path_rates=drop_path_rates[i],
+                name=f"stages.{i}",
+            )
+            self.stages.append(stage)
+            prev_chs = out_chs
+
+    def call(
+        self,
+        hidden_states: tf.Tensor,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> Union[Tuple, TFBaseModelOutputWithNoAttention]:
+        all_hidden_states = () if output_hidden_states else None
+
+        for i, layer_module in enumerate(self.stages):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            hidden_states = layer_module(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states] if v is not None)
+
+        return TFBaseModelOutputWithNoAttention(last_hidden_state=hidden_states, hidden_states=all_hidden_states)
+
+    def build(self, input_shape=None):
+        for stage in self.stages:
+            with tf.name_scope(stage.name):
+                stage.build(None)
+
+
+@keras_serializable
+class TFConvNextV2MainLayer(keras.layers.Layer):
+    config_class = ConvNextV2Config
+
+    def __init__(self, config: ConvNextV2Config, **kwargs):
+        super().__init__(**kwargs)
+
+        self.config = config
+        self.embeddings = TFConvNextV2Embeddings(config, name="embeddings")
+        self.encoder = TFConvNextV2Encoder(config, name="encoder")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+        # We are setting the `data_format` like so because from here on we will revert to the
+        # NCHW output format
+        self.pooler = keras.layers.GlobalAvgPool2D(data_format="channels_last")
+
+    @unpack_inputs
+    def call(
+        self,
+        pixel_values: TFModelInputType | None = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[TFBaseModelOutputWithPooling, Tuple[tf.Tensor]]:
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        embedding_output = self.embeddings(pixel_values, training=training)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        last_hidden_state = encoder_outputs[0]
+
+        # Change to NCHW output format have uniformity in the modules
+        pooled_output = self.pooler(last_hidden_state)
+        last_hidden_state = tf.transpose(last_hidden_state, perm=(0, 3, 1, 2))
+        pooled_output = self.layernorm(pooled_output)
+
+        # Change the other hidden state outputs to NCHW as well
+        if output_hidden_states:
+            hidden_states = tuple([tf.transpose(h, perm=(0, 3, 1, 2)) for h in encoder_outputs[1]])
+
+        if not return_dict:
+            hidden_states = hidden_states if output_hidden_states else ()
+            return (last_hidden_state, pooled_output) + hidden_states
+
+        return TFBaseModelOutputWithPoolingAndNoAttention(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=hidden_states if output_hidden_states else encoder_outputs.hidden_states,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, self.config.hidden_sizes[-1]])
+
+
+class TFConvNextV2PreTrainedModel(TFPreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = ConvNextV2Config
+    base_model_prefix = "convnextv2"
+    main_input_name = "pixel_values"
+
+
+CONVNEXTV2_START_DOCSTRING = r"""
+    This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
+    behavior.
+
+    
+
+    TensorFlow models and layers in `transformers` accept two formats as input:
+
+    - having all inputs as keyword arguments (like PyTorch models), or
+    - having all inputs as a list, tuple or dict in the first positional argument.
+
+    The reason the second format is supported is that Keras methods prefer this format when passing inputs to models
+    and layers. Because of this support, when using methods like `model.fit()` things should "just work" for you - just
+    pass your inputs and labels in any format that `model.fit()` supports! If, however, you want to use the second
+    format outside of Keras methods like `fit()` and `predict()`, such as when creating your own layers or models with
+    the Keras `Functional` API, there are three possibilities you can use to gather all the input Tensors in the first
+    positional argument:
+
+    - a single Tensor with `pixel_values` only and nothing else: `model(pixel_values)`
+    - a list of varying length with one or several input Tensors IN THE ORDER given in the docstring:
+    `model([pixel_values, attention_mask])` or `model([pixel_values, attention_mask, token_type_ids])`
+    - a dictionary with one or several input Tensors associated to the input names given in the docstring:
+    `model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})`
+
+    Note that when creating models and layers with
+    [subclassing](https://keras.io/guides/making_new_layers_and_models_via_subclassing/) then you don't need to worry
+    about any of this, as you can just pass inputs like you would to any other Python function!
+
+    
+
+    Parameters:
+        config ([`ConvNextV2Config`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~TFPreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+CONVNEXTV2_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]`, `Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
+            [`ConvNextImageProcessor.__call__`] for details.
+
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail. This argument can be used only in eager mode, in graph mode the value in the config will be
+            used instead.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. This argument can be used in
+            eager mode, in graph mode the value will always be set to `True`.
+"""
+
+
+@add_start_docstrings(
+    "The bare ConvNextV2 model outputting raw features without any specific head on top.",
+    CONVNEXTV2_START_DOCSTRING,
+)
+class TFConvNextV2Model(TFConvNextV2PreTrainedModel):
+    def __init__(self, config: ConvNextV2Config, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+        self.convnextv2 = TFConvNextV2MainLayer(config, name="convnextv2")
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(CONVNEXTV2_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TFBaseModelOutputWithPoolingAndNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def call(
+        self,
+        pixel_values: TFModelInputType | None = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[TFBaseModelOutputWithPoolingAndNoAttention, Tuple[tf.Tensor]]:
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        outputs = self.convnextv2(
+            pixel_values=pixel_values,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        if not return_dict:
+            return outputs[:]
+
+        return TFBaseModelOutputWithPoolingAndNoAttention(
+            last_hidden_state=outputs.last_hidden_state,
+            pooler_output=outputs.pooler_output,
+            hidden_states=outputs.hidden_states,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convnextv2", None) is not None:
+            with tf.name_scope(self.convnextv2.name):
+                self.convnextv2.build(None)
+
+
+@add_start_docstrings(
+    """
+    ConvNextV2 Model with an image classification head on top (a linear layer on top of the pooled features), e.g. for
+    ImageNet.
+    """,
+    CONVNEXTV2_START_DOCSTRING,
+)
+class TFConvNextV2ForImageClassification(TFConvNextV2PreTrainedModel, TFSequenceClassificationLoss):
+    def __init__(self, config: ConvNextV2Config, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+
+        self.num_labels = config.num_labels
+        self.convnextv2 = TFConvNextV2MainLayer(config, name="convnextv2")
+
+        # Classifier head
+        self.classifier = keras.layers.Dense(
+            units=config.num_labels,
+            kernel_initializer=get_initializer(config.initializer_range),
+            bias_initializer=keras.initializers.Zeros(),
+            name="classifier",
+        )
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(CONVNEXTV2_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=TFImageClassifierOutputWithNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def call(
+        self,
+        pixel_values: TFModelInputType | None = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
+        training: Optional[bool] = False,
+    ) -> Union[TFImageClassifierOutputWithNoAttention, Tuple[tf.Tensor]]:
+        r"""
+        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        outputs = self.convnextv2(
+            pixel_values,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        pooled_output = outputs.pooler_output if return_dict else outputs[1]
+
+        logits = self.classifier(pooled_output)
+        loss = None if labels is None else self.hf_compute_loss(labels=labels, logits=logits)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TFImageClassifierOutputWithNoAttention(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convnextv2", None) is not None:
+            with tf.name_scope(self.convnextv2.name):
+                self.convnextv2.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_sizes[-1]])
diff --git a/src/transformers/models/cpm/tokenization_cpm.py b/src/transformers/models/cpm/tokenization_cpm.py
index f509519271d4e8..67281b3cf185f8 100644
--- a/src/transformers/models/cpm/tokenization_cpm.py
+++ b/src/transformers/models/cpm/tokenization_cpm.py
@@ -38,6 +38,9 @@
 class CpmTokenizer(PreTrainedTokenizer):
     """Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models."""
 
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+
     def __init__(
         self,
         vocab_file,
@@ -121,24 +124,6 @@ def __init__(
 
         self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
 
-        super().__init__(
-            do_lower_case=do_lower_case,
-            remove_space=remove_space,
-            keep_accents=keep_accents,
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            additional_special_tokens=additional_special_tokens,
-            sp_model_kwargs=self.sp_model_kwargs,
-            **kwargs,
-        )
-
-        self._pad_token_type_id = 3
-
         self.do_lower_case = do_lower_case
         self.remove_space = remove_space
         self.keep_accents = keep_accents
@@ -157,6 +142,24 @@ def __init__(
         self.jieba = jieba
         self.translator = str.maketrans(" \n", "\u2582\u2583")
 
+        super().__init__(
+            do_lower_case=do_lower_case,
+            remove_space=remove_space,
+            keep_accents=keep_accents,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            additional_special_tokens=additional_special_tokens,
+            sp_model_kwargs=self.sp_model_kwargs,
+            **kwargs,
+        )
+
+        self._pad_token_type_id = 3
+
     @property
     # Copied from transformers.models.xlnet.tokenization_xlnet.XLNetTokenizer.vocab_size
     def vocab_size(self):
diff --git a/src/transformers/models/cpm/tokenization_cpm_fast.py b/src/transformers/models/cpm/tokenization_cpm_fast.py
index 31a2aaa9f1d848..8e8f927e813b64 100644
--- a/src/transformers/models/cpm/tokenization_cpm_fast.py
+++ b/src/transformers/models/cpm/tokenization_cpm_fast.py
@@ -141,7 +141,6 @@ def __init__(
         self.remove_space = remove_space
         self.keep_accents = keep_accents
         self.vocab_file = vocab_file
-        self.can_save_slow_tokenizer = False if not self.vocab_file else True
 
         try:
             import jieba
@@ -153,6 +152,10 @@ def __init__(
         self.jieba = jieba
         self.translator = str.maketrans(" \n", "\u2582\u2583")
 
+    @property
+    def can_save_slow_tokenizer(self) -> bool:
+        return os.path.isfile(self.vocab_file) if self.vocab_file else False
+
     # Copied from transformers.models.xlnet.tokenization_xlnet_fast.XLNetTokenizerFast.build_inputs_with_special_tokens
     def build_inputs_with_special_tokens(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
diff --git a/src/transformers/models/cpmant/__init__.py b/src/transformers/models/cpmant/__init__.py
new file mode 100644
index 00000000000000..8140009b60f156
--- /dev/null
+++ b/src/transformers/models/cpmant/__init__.py
@@ -0,0 +1,64 @@
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2022 The HuggingFace Team and The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+# rely on isort to merge the imports
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+
+
+_import_structure = {
+    "configuration_cpmant": ["CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP", "CpmAntConfig"],
+    "tokenization_cpmant": ["CpmAntTokenizer"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_cpmant"] = [
+        "CPMANT_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "CpmAntForCausalLM",
+        "CpmAntModel",
+        "CpmAntPreTrainedModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_cpmant import CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP, CpmAntConfig
+    from .tokenization_cpmant import CpmAntTokenizer
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_cpmant import (
+            CPMANT_PRETRAINED_MODEL_ARCHIVE_LIST,
+            CpmAntForCausalLM,
+            CpmAntModel,
+            CpmAntPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/cpmant/configuration_cpmant.py b/src/transformers/models/cpmant/configuration_cpmant.py
new file mode 100644
index 00000000000000..0ad5208566b337
--- /dev/null
+++ b/src/transformers/models/cpmant/configuration_cpmant.py
@@ -0,0 +1,124 @@
+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CPMAnt model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+CPMANT_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "openbmb/cpm-ant-10b": "https://huggingface.co/openbmb/cpm-ant-10b/blob/main/config.json"
+    # See all CPMAnt models at https://huggingface.co/models?filter=cpmant
+}
+
+
+class CpmAntConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CpmAntModel`]. It is used to instantiate an
+    CPMAnt model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CPMAnt
+    [openbmb/cpm-ant-10b](https://huggingface.co/openbmb/cpm-ant-10b) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30720):
+            Vocabulary size of the CPMAnt model. Defines the number of different tokens that can be represented by the
+            `input` passed when calling [`CpmAntModel`].
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the encoder layers.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads in the Transformer encoder.
+        dim_head (`int`, *optional*, defaults to 128):
+            Dimension of attention heads for each attention layer in the Transformer encoder.
+        dim_ff (`int`, *optional*, defaults to 10240):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 48):
+            Number of layers of the Transformer encoder.
+        dropout_p (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder.
+        position_bias_num_buckets (`int`, *optional*, defaults to 512):
+            The number of position_bias buckets.
+        position_bias_max_distance (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the layer normalization layers.
+        init_std (`float`, *optional*, defaults to 1.0):
+            Initialize parameters with std = init_std.
+        prompt_types (`int`, *optional*, defaults to 32):
+            The type of prompt.
+        prompt_length (`int`, *optional*, defaults to 32):
+            The length of prompt.
+        segment_types (`int`, *optional*, defaults to 32):
+            The type of segment.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether to use cache.
+
+    Example:
+
+    ```python
+    >>> from transformers import CpmAntModel, CpmAntConfig
+
+    >>> # Initializing a CPMAnt cpm-ant-10b style configuration
+    >>> configuration = CpmAntConfig()
+
+    >>> # Initializing a model from the cpm-ant-10b style configuration
+    >>> model = CpmAntModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "cpmant"
+
+    def __init__(
+        self,
+        vocab_size: int = 30720,
+        hidden_size: int = 4096,
+        num_attention_heads: int = 32,
+        dim_head: int = 128,
+        dim_ff: int = 10240,
+        num_hidden_layers: int = 48,
+        dropout_p: int = 0.0,
+        position_bias_num_buckets: int = 512,
+        position_bias_max_distance: int = 2048,
+        eps: int = 1e-6,
+        init_std: float = 1.0,
+        prompt_types: int = 32,
+        prompt_length: int = 32,
+        segment_types: int = 32,
+        use_cache: bool = True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.prompt_types = prompt_types
+        self.prompt_length = prompt_length
+        self.segment_types = segment_types
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.dim_head = dim_head
+        self.dim_ff = dim_ff
+        self.num_hidden_layers = num_hidden_layers
+        self.position_bias_num_buckets = position_bias_num_buckets
+        self.position_bias_max_distance = position_bias_max_distance
+        self.dropout_p = dropout_p
+        self.eps = eps
+        self.use_cache = use_cache
+        self.vocab_size = vocab_size
+        self.init_std = init_std
diff --git a/src/transformers/models/cpmant/modeling_cpmant.py b/src/transformers/models/cpmant/modeling_cpmant.py
new file mode 100755
index 00000000000000..405d892c70ed70
--- /dev/null
+++ b/src/transformers/models/cpmant/modeling_cpmant.py
@@ -0,0 +1,874 @@
+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch CPMAnt"""
+
+
+import math
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+
+from ...activations import ACT2FN
+from ...modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+from ...modeling_utils import PreTrainedModel
+from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from .configuration_cpmant import CpmAntConfig
+
+
+logger = logging.get_logger(__name__)
+
+_CHECKPOINT_FOR_DOC = "openbmb/cpm-ant-10b"
+_CONFIG_FOR_DOC = "CpmAntConfig"
+
+CPMANT_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "openbmb/cpm-ant-10b",
+    # See all CPMAnt models at https://huggingface.co/models?filter=cpmant
+]
+
+
+class CpmAntLayerNorm(nn.Module):
+    """
+    We use Root Mean Square (RMS) Layer Normalization, please see https://arxiv.org/abs/1910.07467 for details."
+    """
+
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+
+        self.eps = config.eps
+        self.dim_norm = config.hidden_size
+        self.weight = nn.Parameter(torch.empty(config.hidden_size))
+
+    def forward(self, hidden_states: torch.Tensor):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        if hidden_states.size(-1) != self.dim_norm:
+            raise AssertionError("hidden_states.size(-1) != self.dim_norm")
+        old_dtype = hidden_states.dtype
+        variance = hidden_states.to(torch.float32).pow(2).mean(dim=-1, keepdim=True)
+        hidden_states = (hidden_states * torch.rsqrt(variance + self.eps)).to(old_dtype) * self.weight
+        return hidden_states
+
+
+class CpmAntAttention(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.dim_model = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.dim_head = config.dim_head
+
+        self.project_q = nn.Linear(self.dim_model, self.num_heads * self.dim_head, bias=False)
+        self.project_k = nn.Linear(self.dim_model, self.num_heads * self.dim_head, bias=False)
+        self.project_v = nn.Linear(self.dim_model, self.num_heads * self.dim_head, bias=False)
+
+        self.attention_out = nn.Linear(self.num_heads * self.dim_head, self.dim_model, bias=False)
+
+        self.softmax = torch.nn.Softmax(dim=-1)
+
+        if config.dropout_p is not None:
+            self.dropout = torch.nn.Dropout(p=config.dropout_p)
+        else:
+            self.dropout = None
+
+    def forward(
+        self,
+        hidden_q: torch.Tensor,
+        hidden_kv: torch.Tensor,
+        attention_mask: torch.BoolTensor,
+        position_bias: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_q (`torch.Tensor`):
+                Input of transformer block(self-attention block). It can be the raw embedding of a batch of sequences.
+            hidden_kv (`torch.Tensor` of shape `(batch, len_k, dim_model)`)):
+                Tensor *key_value* and *query* of shape `(batch, len_k, dim_model)`
+            attention_mask (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Avoid invalid areas to participate in the calculation of self-attention.
+            position_bias (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Provide positional information to self-attention block.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor]`, *optional*):
+                Cached past key and value projection states.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        batch_size = hidden_q.size(0)
+        len_q = hidden_q.size(1)
+        len_k = hidden_kv.size(1)
+
+        query = self.project_q(hidden_q)
+        key = self.project_k(hidden_kv)
+        value = self.project_v(hidden_kv)
+
+        query = query.view(batch_size, len_q, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+        key = key.view(batch_size, len_k, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+        value = value.view(batch_size, len_k, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+
+        if past_key_values is not None:
+            key = torch.cat([past_key_values[0], key], dim=-2)
+            value = torch.cat([past_key_values[1], value], dim=-2)
+            len_k = key.size(-2)
+
+        # (batch_size, num_heads, len_q, dim_head) @ (batch_size, num_heads, dim_head, len_k) -> (batch_size, num_heads, len_q, len_k)
+        score = torch.matmul(query, key.transpose(-1, -2)) / math.sqrt(self.dim_head)
+        score = score + position_bias
+
+        score = torch.masked_fill(
+            score,
+            attention_mask.view(batch_size, 1, len_q, len_k) == torch.tensor(False),
+            torch.scalar_tensor(float("-inf"), device=score.device, dtype=score.dtype),
+        )
+        score = self.softmax(score)
+
+        score = torch.masked_fill(
+            score,
+            attention_mask.view(batch_size, 1, len_q, len_k) == torch.tensor(False),
+            torch.scalar_tensor(0, device=score.device, dtype=score.dtype),
+        )
+        if output_attentions:
+            attn_weights = score
+        else:
+            attn_weights = None
+
+        if self.dropout is not None:
+            score = self.dropout(score)
+
+        # (batch_size, num_heads, len_q, len_k) @ (batch_size, num_heads, len_k, dim_head) -> (batch_size, num_heads, len_q, dim_head)
+        score = torch.matmul(score, value)
+
+        score = score.view(batch_size, self.num_heads, len_q, self.dim_head).permute(0, 2, 1, 3)
+        score = score.contiguous().view(batch_size, len_q, self.num_heads * self.dim_head)
+
+        score = self.attention_out(score)
+
+        past_key_values = None
+        if use_cache:
+            past_key_values = (key, value)
+
+        return score, attn_weights, past_key_values
+
+
+class CpmAntSelfAttentionBlock(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.layernorm_before_attention = CpmAntLayerNorm(config)
+        self.self_attention = CpmAntAttention(config)
+        if config.dropout_p:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, len_seq, dim_model)`):
+                Input of transformer block(self-attention block). It can be the raw embedding of a batch of sequences.
+            attention_mask (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Avoid invalid areas to participate in the calculation of self-attention.
+            position_bias (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Provide positional information to self-attention block.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple(torch.FloatTensor)`, *optional*):
+                Cached past key and value projection states.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        outputs = self.layernorm_before_attention(hidden_states)
+        outputs = self.self_attention(
+            outputs, outputs, attention_mask, position_bias, output_attentions, past_key_values, use_cache
+        )
+
+        outputs, attn_weights, current_key_value = outputs
+
+        if self.dropout is not None:
+            outputs = self.dropout(outputs)
+        hidden_states = hidden_states + outputs
+
+        return hidden_states, attn_weights, current_key_value
+
+
+class CpmAntDenseGatedACT(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.w_0 = nn.Linear(config.hidden_size, config.dim_ff, bias=False)
+        self.w_1 = nn.Linear(config.hidden_size, config.dim_ff, bias=False)
+        self.act = torch.nn.GELU()
+
+    def forward(self, hidden_states: torch.Tensor):
+        """Transform an input tensor from one feature space to another via a nonlinear operation
+
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        gate_score = self.act(self.w_0(hidden_states))
+        hidden_states = self.w_1(hidden_states)
+
+        hidden_states = gate_score * hidden_states
+        return hidden_states
+
+
+class CpmAntFeedForward(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.w_in = CpmAntDenseGatedACT(config)
+        if config.dropout_p is not None:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+
+        self.w_out = nn.Linear(config.dim_ff, config.hidden_size, bias=False)
+
+    def forward(self, hidden_states: torch.Tensor):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        hidden_states = self.w_in(hidden_states)
+
+        if self.dropout is not None:
+            hidden_states = self.dropout(hidden_states)
+
+        hidden_states = self.w_out(hidden_states)
+
+        return hidden_states
+
+
+class CpmAntFFNBlock(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.layernorm_before_ffn = CpmAntLayerNorm(config)
+        self.ffn = CpmAntFeedForward(config)
+        if config.dropout_p:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, len_seq, dim_model)`):
+                Hidden states before feed forward layer.
+        """
+        ln_outputs = self.layernorm_before_ffn(hidden_states)
+        outputs = self.ffn(ln_outputs)
+        if self.dropout is not None:
+            outputs = self.dropout(outputs)
+        hidden_states = hidden_states + outputs
+        return hidden_states
+
+
+class CpmAntTransformerBlock(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.self_att = CpmAntSelfAttentionBlock(config)
+        self.ffn = CpmAntFFNBlock(config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor`):
+                Input to the layer of shape `(batch, seq_len, dim_model)`
+            attention_mask (`torch.Tensor`):
+                Avoid invalid areas to participate in the calculation of shape `(batch, seq_len, seq_len)`
+            position_bias (`torch.Tensor`):
+                Provides position information to attention mechanism of shape `(num_heads, seq_len, seq_len)`
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor])`, *optional*):
+                Cached past key and value projection states
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        hidden_states = self.self_att(
+            hidden_states,
+            attention_mask=attention_mask,
+            position_bias=position_bias,
+            output_attentions=output_attentions,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+        )
+
+        hidden_states, attn_weights, current_key_value = hidden_states
+
+        hidden_states = self.ffn(hidden_states)
+
+        return hidden_states, attn_weights, current_key_value
+
+
+class CpmAntEncoder(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+        self.num_layers = config.num_hidden_layers
+        self.layers = nn.ModuleList([CpmAntTransformerBlock(config) for ith in range(self.num_layers)])
+
+        self.output_layernorm = CpmAntLayerNorm(config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: torch.Tensor,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor`):
+                Input to the layer of shape `(batch, seq_len, dim_model)`
+            attention_mask (`torch.Tensor`):
+                Avoid invalid areas to participate in the calculation of shape `(batch, seq_len, seq_len)`
+            position_bias (`torch.Tensor`):
+                Provides position information to attention mechanism of shape `(num_heads, seq_len, seq_len)`
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor])`, *optional*):
+                Cached past key and value projection states
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        current_key_values = () if use_cache else None
+
+        for i, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            layer_outputs = layer(
+                hidden_states,
+                attention_mask,
+                position_bias,
+                output_attentions=output_attentions,
+                past_key_values=past_key_values[i] if past_key_values else None,
+                use_cache=use_cache,
+            )
+            hidden_states, attn_weights, current_key_value = layer_outputs
+            if output_attentions:
+                all_self_attns += (attn_weights,)
+            if current_key_value is not None:
+                current_key_values = current_key_values + (current_key_value,)
+
+        hidden_states = self.output_layernorm(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        return hidden_states, current_key_values, all_hidden_states, all_self_attns
+
+
+# Copied from transformers.models.bert.modeling_bert.BertIntermediate with Bert->CPMAnt
+class CpmAntIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+class CpmAntSegmentPositionEmbedding(nn.Module):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__()
+
+        self.num_heads = config.num_attention_heads
+        self.num_buckets = config.position_bias_num_buckets
+        self.max_distance = config.position_bias_max_distance
+        self.num_segments = config.segment_types
+
+        self.relative_attention_bias = nn.Parameter(
+            torch.empty(
+                config.segment_types * config.segment_types + config.position_bias_num_buckets,
+                config.num_attention_heads,
+            )
+        )
+
+    def forward(
+        self,
+        key_pos: torch.Tensor,
+        query_pos: torch.Tensor,
+        key_segment: torch.Tensor,
+        query_segment: torch.Tensor,
+    ):
+        with torch.no_grad():
+            batch = key_pos.size(0)
+            keylen = key_pos.size(1)
+            querylen = query_pos.size(1)
+
+            if key_pos.size(0) != query_pos.size(0):
+                raise AssertionError(
+                    f"key_pos.size(0) should be equal to query_pos.size(0), but got {key_pos.size(0)} and {query_pos.size(0)}!"
+                )
+            if keylen != key_segment.size(1) or querylen != query_segment.size(1):
+                raise AssertionError(
+                    f"keylen should be equal to key_segment.size(1), but got {keylen} and {key_segment.size(1)}!"
+                )
+            if querylen != query_segment.size(1):
+                raise AssertionError(
+                    f"querylen should be equal to query_segment.size(1), but got {querylen} and {query_segment.szie(1)}!"
+                )
+
+            key_pos = key_pos.view(batch, -1, keylen)
+            query_pos = query_pos.view(batch, querylen, -1)
+            key_segment = key_segment.view(batch, -1, keylen)
+            query_segment = query_segment.view(batch, querylen, -1)
+
+            relative_position_bucket = self._segment_relative_position_bucket(query_segment, key_segment)
+            relative_position_bucket = relative_position_bucket + self.num_buckets
+
+            # (batch, len_q, len_k)
+            absolute_position_bucket = self._position_bucket(
+                torch.arange(keylen, dtype=torch.int32, device=relative_position_bucket.device)[None, :]
+                - torch.arange(querylen, dtype=torch.int32, device=relative_position_bucket.device)[:, None],
+                num_buckets=self.num_buckets,
+                max_distance=self.max_distance,
+            )
+            relative_position_bucket = torch.where(
+                (key_segment == query_segment),
+                absolute_position_bucket[None, :, :],
+                relative_position_bucket,
+            )
+
+        # (batch, len_q, len_k, num_heads)
+        embeds = F.embedding(relative_position_bucket, self.relative_attention_bias)
+        # (batch, num_heads, len_q, len_k)
+        embeds = embeds.permute(0, 3, 1, 2).contiguous()
+        return embeds
+
+    def _segment_relative_position_bucket(self, query_segment, key_segment):
+        return query_segment * self.num_segments + key_segment
+
+    def _position_bucket(self, relative_position, num_buckets=32, max_distance=128):
+        relative_buckets = 0
+        # always bidirectional in CPMAnt
+        num_buckets //= 2
+        relative_buckets = (relative_position > 0).to(torch.int32) * num_buckets
+        relative_position = torch.abs(relative_position)
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+        relative_postion_if_large = max_exact + (
+            torch.log(relative_position.float() / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).to(torch.int32)
+        relative_postion_if_large = torch.min(
+            relative_postion_if_large,
+            torch.full_like(relative_postion_if_large, num_buckets - 1),
+        )
+        relative_buckets += torch.where(is_small, relative_position.to(torch.int32), relative_postion_if_large)
+        return relative_buckets
+
+
+# Copied from transformers.models.bert.modeling_bert.BertOutput with Bert->CPMAnt
+class CpmAntOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class CpmAntPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = CpmAntConfig
+    base_model_prefix = "cpmant"
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.init_std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.init_std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, CpmAntLayerNorm):
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, CpmAntSegmentPositionEmbedding):
+            module.relative_attention_bias.data.normal_(mean=0.0, std=self.config.init_std)
+
+
+CPMANT_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
+    it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters
+        config ([`~CpmAntConfig`]): Model configuration class with all the parameters of the
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+CPMANT_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Indices of input sequence tokens in the vocabulary.
+
+            Indices can be obtained using [`CPMAntTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare CPMAnt Model outputting raw hidden-states without any specific head on top.",
+    CPMANT_START_DOCSTRING,
+)
+class CpmAntModel(CpmAntPreTrainedModel):
+    def __init__(self, config: CpmAntConfig):
+        super().__init__(config)
+        self.encoder = CpmAntEncoder(config)
+        self.segment_embedding = nn.Embedding(config.segment_types, config.hidden_size)
+        self.input_embedding = nn.Embedding(
+            config.vocab_size + config.prompt_types * config.prompt_length, config.hidden_size
+        )
+        self.position_bias = CpmAntSegmentPositionEmbedding(config)
+        self.prompt_length = config.prompt_length
+        self.vocab_size = config.vocab_size
+
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.input_embedding
+
+    def set_input_embeddings(self, embeddings, **kwargs):
+        self.input_embedding = embeddings
+
+    def _prepare_attention_mask(self, input_ids, span, context, length):
+        batch = input_ids.size(0)
+        seqlen = input_ids.size(1)
+        device = input_ids.device
+        directional_mask_2d = torch.arange(seqlen, device=device) <= torch.arange(seqlen, device=device).view(-1, 1)
+        attention_mask = context[:, None, :] | (
+            context[:, :, None].logical_not() & directional_mask_2d.view(1, seqlen, seqlen)
+        )
+        attention_mask = attention_mask & (span[:, None, :] == span[:, :, None])
+        # mask for left padding
+        mask_1d = (
+            torch.tensor(list(range(seqlen - self.prompt_length))[::-1], device=device)[None, :].repeat(batch, 1)
+            < length[:, None]
+        )
+        mask_1d = torch.cat((torch.ones(batch, self.prompt_length, device=device).bool(), mask_1d), dim=1)
+        attention_mask = mask_1d.view(batch, seqlen, 1) & mask_1d.view(batch, 1, seqlen) & attention_mask
+        return attention_mask
+
+    @add_start_docstrings_to_model_forward(CPMANT_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        # add prompts ahead
+        if input_ids.dtype != torch.int32:
+            input_ids = input_ids.to(torch.int32)
+        dtype, device = input_ids.dtype, input_ids.device
+        segment = torch.where(input_ids != 0, 2, 0).to(dtype=dtype, device=device)
+        length = (segment != 0).sum(-1).to(dtype=dtype, device=device)
+        input_ids = torch.cat(
+            (
+                torch.arange(
+                    self.prompt_length * 2 + self.vocab_size,
+                    self.prompt_length * 3 + self.vocab_size,
+                    dtype=dtype,
+                    device=device,
+                ).repeat(input_ids.size(0), 1),
+                input_ids,
+            ),
+            dim=1,
+        )
+        batch, seq_length = input_ids.size()
+        segment = torch.cat((torch.zeros(batch, self.prompt_length, dtype=dtype, device=device), segment), dim=1)
+        context = torch.full((batch, seq_length), 1, dtype=dtype, device=device)
+        position = torch.arange(seq_length, dtype=dtype, device=device).repeat(batch, 1)
+        span = torch.full((batch, seq_length), 0, dtype=dtype, device=device)
+
+        if past_key_values is None:
+            past_length = 0
+            past_key_values = tuple([None] * self.encoder.num_layers)
+            input_ids = input_ids.contiguous()
+            hidden_states = self.input_embedding(input_ids)
+            segment_states = self.segment_embedding(segment)
+            hidden_states = hidden_states + segment_states
+        else:
+            past_length = past_key_values[0][0].size(-2)
+            segment_states = self.segment_embedding(segment)
+            hidden_states = self.input_embedding(input_ids) + segment_states[:, -1:, :]
+
+        attention_mask = self._prepare_attention_mask(input_ids, span, context, length)
+        position_bias = self.position_bias(position, position, segment, segment)
+
+        attention_mask = attention_mask[:, past_length:, :]
+        position_bias = position_bias[:, :, past_length:, :]
+        hidden_states = hidden_states[:, past_length:, :]
+
+        hidden_states, present_key_values, all_hidden_states, all_attentions = self.encoder(
+            hidden_states,
+            attention_mask,
+            position_bias,
+            output_attentions,
+            output_hidden_states,
+            past_key_values,
+            use_cache,
+        )
+
+        if past_length == 0:
+            hidden_states = hidden_states[:, self.prompt_length :, :]
+            # drop the prompt
+            if all_attentions is not None:
+                new_attentions = ()
+                for attention in all_attentions:
+                    new_attentions += (attention[:, :, self.prompt_length :, self.prompt_length :],)
+                all_attentions = new_attentions
+            if all_hidden_states is not None:
+                new_hidden_states = ()
+                for hidden_state in all_hidden_states:
+                    new_hidden_states += (hidden_state[:, self.prompt_length :, :],)
+                all_hidden_states = new_hidden_states
+
+        if not return_dict:
+            return tuple(
+                v for v in [hidden_states, present_key_values, all_hidden_states, all_attentions] if v is not None
+            )
+
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    The CPMAnt Model with a language modeling head on top (linear layer with weights tied to the input embeddings).
+    """,
+    CPMANT_START_DOCSTRING,
+)
+class CpmAntForCausalLM(CpmAntPreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: CpmAntConfig):
+        super().__init__(config)
+        self.cpmant = CpmAntModel(config)
+
+        # lm_head.weight is tied to cpmant.input_embedding.weight
+        self.lm_head = nn.Linear(
+            config.hidden_size, config.vocab_size + config.prompt_types * config.prompt_length, bias=False
+        )
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(CPMANT_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=CausalLMOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        labels: Optional[torch.Tensor] = None,
+        return_dict: Optional[bool] = None,
+        attention_mask: Optional[torch.Tensor] = None,  # dummy parameter for text-generation pipeline
+        **kwargs,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            input_ids (`torch.Tensor` of shape `(batch_size, seq_len)`):
+                Indices of input sequence tokens in the vocabulary.
+
+                Indices can be obtained using [`CPMAntTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+                [`PreTrainedTokenizer.__call__`] for details.
+
+                [What are input IDs?](../glossary#input-ids)
+            past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
+                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers.
+            labels (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                CPMAnt will process attention mask automatically, this parameter is a dummy parameter for
+                text-generation pipeline.
+
+        Example:
+
+        Text Generation with CpmAntForCausalLM.
+        ```python
+        >>> from transformers import CPMAntTokenizer, CpmAntForCausalLM
+
+        >>> texts = "今天天气不错,"
+        >>> model = CpmAntForCausalLM.from_pretrained("openbmb/cpm-ant-10b")
+        >>> tokenizer = CPMAntTokenizer.from_pretrained("openbmb/cpm-ant-10b")
+        >>> input_ids = tokenizer(texts, return_tensors="pt")
+        >>> outputs = model.generate(**input_ids)
+        >>> output_texts = tokenizer.batch_decode(outputs)
+        >>> print(output_texts)
+        ['今天天气不错,阳光明媚,我和妈妈一起去超市买东西。\n在超市里,我看到了一个很好玩的玩具,它的名字叫“机器人”。它有一个圆圆的脑袋,两只圆圆的眼睛,还有一个圆圆的']
+        ```
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        model_output = self.cpmant(
+            input_ids, output_attentions, output_hidden_states, past_key_values, use_cache, return_dict
+        )
+        hidden_states = model_output.last_hidden_state if return_dict else model_output[0]
+
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            loss_func = CrossEntropyLoss()
+            loss = loss_func(logits.view(-1, logits.size(-1)), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + model_output[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=model_output.past_key_values,
+            hidden_states=model_output.hidden_states,
+            attentions=model_output.attentions,
+        )
+
+    def get_input_embeddings(self):
+        return self.cpmant.input_embedding
+
+    def set_input_embeddings(self, embeddings):
+        self.cpmant.input_embedding = embeddings
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def prepare_inputs_for_generation(self, input_ids, **kwargs):
+        input_ids = input_ids.int()
+        # save the memory usage of dummy attention mask
+        if "attention_mask" in kwargs:
+            kwargs["attention_mask"] = torch.zeros(1, 1)
+
+        return {
+            "input_ids": input_ids,
+            "use_cache": kwargs["use_cache"],
+            "past_key_values": kwargs.get("past_key_values", None),
+        }
+
+    def _reorder_cache(self, past_key_values, beam_idx):
+        past_key_values = [list(each) if each is not None else each for each in past_key_values]
+        for key_value_layer in past_key_values:
+            key_value_layer[0] = key_value_layer[0][beam_idx]
+            key_value_layer[1] = key_value_layer[1][beam_idx]
+        return past_key_values
diff --git a/src/transformers/models/cpmant/tokenization_cpmant.py b/src/transformers/models/cpmant/tokenization_cpmant.py
new file mode 100644
index 00000000000000..c10f48e2de282e
--- /dev/null
+++ b/src/transformers/models/cpmant/tokenization_cpmant.py
@@ -0,0 +1,278 @@
+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for CPMAnt."""
+import collections
+import os
+from typing import List, Optional, Tuple
+
+from transformers.utils import is_jieba_available, requires_backends
+
+
+if is_jieba_available():
+    import jieba
+
+from ...tokenization_utils import PreTrainedTokenizer
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "openbmb/cpm-ant-10b": "https://huggingface.co/openbmb/cpm-ant-10b/blob/main/vocab.txt",
+    },
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "openbmb/cpm-ant-10b": 1024,
+}
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+class WordpieceTokenizer(object):
+    def __init__(self, vocab, unk_token="", max_input_chars_per_word=200):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, token):
+        chars = list(token)
+        if len(chars) > self.max_input_chars_per_word:
+            return [self.unk_token]
+
+        start = 0
+        sub_tokens = []
+        while start < len(chars):
+            end = len(chars)
+            cur_substr = None
+            while start < end:
+                substr = "".join(chars[start:end])
+                if substr in self.vocab:
+                    cur_substr = substr
+                    break
+                end -= 1
+            if cur_substr is None:
+                sub_tokens.append(self.unk_token)
+                start += 1
+            else:
+                sub_tokens.append(cur_substr)
+                start = end
+
+        return sub_tokens
+
+
+class CpmAntTokenizer(PreTrainedTokenizer):
+    """
+    Construct a CPMAnt tokenizer. Based on byte-level Byte-Pair-Encoding.
+
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        bod_token (`str`, *optional*, defaults to `""`):
+            The beginning of document token.
+        eod_token (`str`, *optional*, defaults to `""`):
+            The end of document token.
+        bos_token (`str`, *optional*, defaults to `""`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `""`):
+            The end of sequence token.
+        pad_token (`str`, *optional*, defaults to `""`):
+            The token used for padding.
+        unk_token (`str`, *optional*, defaults to `""`):
+            The unknown token.
+        line_token (`str`, *optional*, defaults to `""`):
+            The line token.
+        space_token (`str`, *optional*, defaults to `""`):
+            The space token.
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names = ["input_ids", "attention_mask"]
+    add_prefix_space = False
+
+    def __init__(
+        self,
+        vocab_file,
+        bod_token="",
+        eod_token="",
+        bos_token="",
+        eos_token="",
+        pad_token="",
+        unk_token="",
+        line_token="",
+        space_token="",
+        padding_side="left",
+        **kwargs,
+    ):
+        requires_backends(self, ["jieba"])
+        self.bod_token = bod_token
+        self.eod_token = eod_token
+        self.encoder = load_vocab(vocab_file)
+        self.encoder[" "] = self.encoder[space_token]
+        self.encoder["\n"] = self.encoder[line_token]
+
+        del self.encoder[space_token]
+        del self.encoder[line_token]
+
+        self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
+        self.decoder = {v: k for k, v in self.encoder.items()}
+
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.encoder, unk_token=unk_token)
+
+        super().__init__(
+            bod_token=bod_token,
+            eod_token=eod_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            unk_token=unk_token,
+            line_token=line_token,
+            space_token=space_token,
+            padding_side=padding_side,
+            **kwargs,
+        )
+
+    @property
+    def bod_token_id(self):
+        return self.encoder[self.bod_token]
+
+    @property
+    def eod_token_id(self):
+        return self.encoder[self.eod_token]
+
+    @property
+    def newline_id(self):
+        return self.encoder["\n"]
+
+    @property
+    def vocab_size(self) -> int:
+        return len(self.encoder)
+
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        """Tokenize a string."""
+        output_tokens = []
+        for x in jieba.cut(text, cut_all=False):
+            output_tokens.extend(self.wordpiece_tokenizer.tokenize(x))
+        return output_tokens
+
+    def _decode(self, token_ids, **kwargs):
+        """Decode ids into a string."""
+        token_ids = [i for i in token_ids if i >= 0]
+        token_ids = [
+            x for x in token_ids if x != self.pad_token_id and x != self.eos_token_id and x != self.bos_token_id
+        ]
+        return super()._decode(token_ids, **kwargs)
+
+    def check(self, token):
+        return token in self.encoder
+
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        return "".join(tokens)
+
+    def _convert_token_to_id(self, token):
+        """Converts a token (str) in an id using the vocab."""
+        return self.encoder.get(token, self.encoder.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.decoder.get(index, self.unk_token)
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+            )
+        else:
+            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
+        index = 0
+        if " " in self.encoder:
+            self.encoder[""] = self.encoder[" "]
+            del self.encoder[" "]
+        if "\n" in self.encoder:
+            self.encoder[""] = self.encoder["\n"]
+            del self.encoder["\n"]
+        self.encoder = collections.OrderedDict(sorted(self.encoder.items(), key=lambda x: x[1]))
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in self.encoder.items():
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+
+    def build_inputs_with_special_tokens(self, token_ids_0: List[int], token_ids_1: List[int] = None) -> List[int]:
+        """
+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens. A CPMAnt sequence has the following format:
+
+        - single sequence: `[BOS] Sequence`.
+
+        Args:
+            token_ids_0 (`List[int]`): The first tokenized sequence that special tokens will be added.
+            token_ids_1 (`List[int]`): The optional second tokenized sequence that special tokens will be added.
+
+        Returns:
+            `List[int]`: The model input with special tokens.
+        """
+        if token_ids_1 is None:
+            return [self.bos_token_id] + token_ids_0
+        return [self.bos_token_id] + token_ids_0 + [self.bos_token_id] + token_ids_1
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer `prepare_for_model` method.
+
+        Args:
+            token_ids_0 (`List[int]`): List of IDs.
+            token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs.
+            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
+            )
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
+        return [1] + ([0] * len(token_ids_0))
diff --git a/src/transformers/models/ctrl/configuration_ctrl.py b/src/transformers/models/ctrl/configuration_ctrl.py
index 0a1feed58b24db..553e919b4a77d8 100644
--- a/src/transformers/models/ctrl/configuration_ctrl.py
+++ b/src/transformers/models/ctrl/configuration_ctrl.py
@@ -20,7 +20,9 @@
 
 logger = logging.get_logger(__name__)
 
-CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {"ctrl": "https://huggingface.co/ctrl/resolve/main/config.json"}
+CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "Salesforce/ctrl": "https://huggingface.co/Salesforce/ctrl/resolve/main/config.json"
+}
 
 
 class CTRLConfig(PretrainedConfig):
@@ -28,7 +30,7 @@ class CTRLConfig(PretrainedConfig):
     This is the configuration class to store the configuration of a [`CTRLModel`] or a [`TFCTRLModel`]. It is used to
     instantiate a CTRL model according to the specified arguments, defining the model architecture. Instantiating a
     configuration with the defaults will yield a similar configuration to that of the
-    [ctrl](https://huggingface.co/ctrl) architecture from SalesForce.
+    [Salesforce/ctrl](https://huggingface.co/Salesforce/ctrl) architecture from SalesForce.
 
     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
     documentation from [`PretrainedConfig`] for more information.
@@ -52,7 +54,7 @@ class CTRLConfig(PretrainedConfig):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
         embd_pdrop (`int`, *optional*, defaults to 0.1):
             The dropout ratio for the embeddings.
-        layer_norm_epsilon (`float`, *optional*, defaults to 1e-6):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-06):
             The epsilon to use in the layer normalization layers
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
diff --git a/src/transformers/models/ctrl/modeling_ctrl.py b/src/transformers/models/ctrl/modeling_ctrl.py
index b41b7c5d1b4b8b..3814f897d545fa 100644
--- a/src/transformers/models/ctrl/modeling_ctrl.py
+++ b/src/transformers/models/ctrl/modeling_ctrl.py
@@ -34,7 +34,7 @@
 _CONFIG_FOR_DOC = "CTRLConfig"
 
 CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [
-    "ctrl"
+    "Salesforce/ctrl"
     # See all CTRL models at https://huggingface.co/models?filter=ctrl
 ]
 
@@ -47,8 +47,8 @@ def angle_defn(pos, i, d_model_size):
 def positional_encoding(position, d_model_size, dtype):
     # create the sinusoidal pattern for the positional encoding
     angle_rads = angle_defn(
-        torch.arange(position, dtype=dtype).unsqueeze(1),
-        torch.arange(d_model_size, dtype=dtype).unsqueeze(0),
+        torch.arange(position, dtype=torch.int64).to(dtype).unsqueeze(1),
+        torch.arange(d_model_size, dtype=torch.int64).to(dtype).unsqueeze(0),
         d_model_size,
     )
 
@@ -374,8 +374,8 @@ def forward(
         >>> from transformers import AutoTokenizer, CTRLModel
         >>> import torch
 
-        >>> tokenizer = AutoTokenizer.from_pretrained("ctrl")
-        >>> model = CTRLModel.from_pretrained("ctrl")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+        >>> model = CTRLModel.from_pretrained("Salesforce/ctrl")
 
         >>> # CTRL was trained with control codes as the first token
         >>> inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
@@ -397,6 +397,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
             input_ids = input_ids.view(-1, input_shape[-1])
             batch_size = input_ids.shape[0]
@@ -415,7 +416,7 @@ def forward(
             past_length = past_key_values[0][0].size(-2)
         if position_ids is None:
             position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
-            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
+            position_ids = position_ids.unsqueeze(0)
 
         # Attention mask.
         if attention_mask is not None:
@@ -446,7 +447,6 @@ def forward(
             token_type_embeds *= np.sqrt(self.d_model_size)
         else:
             token_type_embeds = 0
-        position_ids = position_ids.view(-1, input_shape[-1])
 
         if inputs_embeds is None:
             inputs_embeds = self.w(input_ids)
@@ -509,7 +509,7 @@ def forward(
     CTRL_START_DOCSTRING,
 )
 class CTRLLMHeadModel(CTRLPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["lm_head.weight"]
+    _tied_weights_keys = ["lm_head.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -526,9 +526,18 @@ def set_output_embeddings(self, new_embeddings):
         self.lm_head = new_embeddings
 
     def prepare_inputs_for_generation(self, input_ids, past_key_values=None, use_cache=None, **kwargs):
-        # only last token for inputs_ids if past is defined in kwargs
-        if past_key_values:
-            input_ids = input_ids[:, -1].unsqueeze(-1)
+        # only last tokens for inputs_ids if past is defined in kwargs
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
 
         return {"input_ids": input_ids, "past_key_values": past_key_values, "use_cache": use_cache}
 
@@ -563,8 +572,8 @@ def forward(
         >>> import torch
         >>> from transformers import AutoTokenizer, CTRLLMHeadModel
 
-        >>> tokenizer = AutoTokenizer.from_pretrained("ctrl")
-        >>> model = CTRLLMHeadModel.from_pretrained("ctrl")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+        >>> model = CTRLLMHeadModel.from_pretrained("Salesforce/ctrl")
 
         >>> # CTRL was trained with control codes as the first token
         >>> inputs = tokenizer("Wikipedia The llama is", return_tensors="pt")
@@ -691,8 +700,8 @@ def forward(
         >>> import torch
         >>> from transformers import AutoTokenizer, CTRLForSequenceClassification
 
-        >>> tokenizer = AutoTokenizer.from_pretrained("ctrl")
-        >>> model = CTRLForSequenceClassification.from_pretrained("ctrl")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+        >>> model = CTRLForSequenceClassification.from_pretrained("Salesforce/ctrl")
 
         >>> # CTRL was trained with control codes as the first token
         >>> inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
@@ -712,7 +721,7 @@ def forward(
         >>> torch.manual_seed(42)  # doctest: +IGNORE_RESULT
         >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
         >>> num_labels = len(model.config.id2label)
-        >>> model = CTRLForSequenceClassification.from_pretrained("ctrl", num_labels=num_labels)
+        >>> model = CTRLForSequenceClassification.from_pretrained("Salesforce/ctrl", num_labels=num_labels)
 
         >>> labels = torch.tensor(1)
         >>> loss = model(**inputs, labels=labels).loss
@@ -726,8 +735,10 @@ def forward(
         >>> import torch
         >>> from transformers import AutoTokenizer, CTRLForSequenceClassification
 
-        >>> tokenizer = AutoTokenizer.from_pretrained("ctrl")
-        >>> model = CTRLForSequenceClassification.from_pretrained("ctrl", problem_type="multi_label_classification")
+        >>> tokenizer = AutoTokenizer.from_pretrained("Salesforce/ctrl")
+        >>> model = CTRLForSequenceClassification.from_pretrained(
+        ...     "Salesforce/ctrl", problem_type="multi_label_classification"
+        ... )
 
         >>> # CTRL was trained with control codes as the first token
         >>> inputs = tokenizer("Opinion My dog is cute", return_tensors="pt")
@@ -744,7 +755,7 @@ def forward(
         ```python
         >>> # To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
         >>> num_labels = len(model.config.id2label)
-        >>> model = CTRLForSequenceClassification.from_pretrained("ctrl", num_labels=num_labels)
+        >>> model = CTRLForSequenceClassification.from_pretrained("Salesforce/ctrl", num_labels=num_labels)
 
         >>> num_labels = len(model.config.id2label)
         >>> labels = torch.nn.functional.one_hot(torch.tensor([predicted_class_id]), num_classes=num_labels).to(
@@ -785,7 +796,10 @@ def forward(
             sequence_lengths = -1
         else:
             if input_ids is not None:
-                sequence_lengths = torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1
+                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
+                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
+                sequence_lengths = sequence_lengths % input_ids.shape[-1]
+                sequence_lengths = sequence_lengths.to(logits.device)
             else:
                 sequence_lengths = -1
                 logger.warning(
diff --git a/src/transformers/models/ctrl/modeling_tf_ctrl.py b/src/transformers/models/ctrl/modeling_tf_ctrl.py
index dcd3f5a03e0c92..19a6a84fc75f16 100644
--- a/src/transformers/models/ctrl/modeling_tf_ctrl.py
+++ b/src/transformers/models/ctrl/modeling_tf_ctrl.py
@@ -15,7 +15,8 @@
 # limitations under the License.
 """ TF 2.0 CTRL model."""
 
-import warnings
+from __future__ import annotations
+
 from typing import Optional, Tuple, Union
 
 import numpy as np
@@ -27,24 +28,24 @@
     TFModelInputType,
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
-    TFSharedEmbeddings,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
 from .configuration_ctrl import CTRLConfig
 
 
 logger = logging.get_logger(__name__)
 
-_CHECKPOINT_FOR_DOC = "ctrl"
+_CHECKPOINT_FOR_DOC = "Salesforce/ctrl"
 _CONFIG_FOR_DOC = "CTRLConfig"
 
 TF_CTRL_PRETRAINED_MODEL_ARCHIVE_LIST = [
-    "ctrl"
-    # See all CTRL models at https://huggingface.co/models?filter=ctrl
+    "Salesforce/ctrl"
+    # See all CTRL models at https://huggingface.co/models?filter=Salesforce/ctrl
 ]
 
 
@@ -90,7 +91,7 @@ def scaled_dot_product_attention(q, k, v, mask, attention_mask=None, head_mask=N
     return output, attention_weights
 
 
-class TFMultiHeadAttention(tf.keras.layers.Layer):
+class TFMultiHeadAttention(keras.layers.Layer):
     def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):
         super().__init__(**kwargs)
         self.num_heads = num_heads
@@ -99,11 +100,11 @@ def __init__(self, d_model_size, num_heads, output_attentions=False, **kwargs):
 
         self.depth = int(d_model_size / self.num_heads)
 
-        self.Wq = tf.keras.layers.Dense(d_model_size, name="Wq")
-        self.Wk = tf.keras.layers.Dense(d_model_size, name="Wk")
-        self.Wv = tf.keras.layers.Dense(d_model_size, name="Wv")
+        self.Wq = keras.layers.Dense(d_model_size, name="Wq")
+        self.Wk = keras.layers.Dense(d_model_size, name="Wk")
+        self.Wv = keras.layers.Dense(d_model_size, name="Wv")
 
-        self.dense = tf.keras.layers.Dense(d_model_size, name="dense")
+        self.dense = keras.layers.Dense(d_model_size, name="dense")
 
     def split_into_heads(self, x, batch_size):
         x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
@@ -142,13 +143,32 @@ def call(self, v, k, q, mask, layer_past, attention_mask, head_mask, use_cache,
 
         return outputs
 
-
-class TFPointWiseFeedForwardLayer(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "Wq", None) is not None:
+            with tf.name_scope(self.Wq.name):
+                self.Wq.build([None, None, self.d_model_size])
+        if getattr(self, "Wk", None) is not None:
+            with tf.name_scope(self.Wk.name):
+                self.Wk.build([None, None, self.d_model_size])
+        if getattr(self, "Wv", None) is not None:
+            with tf.name_scope(self.Wv.name):
+                self.Wv.build([None, None, self.d_model_size])
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.d_model_size])
+
+
+class TFPointWiseFeedForwardLayer(keras.layers.Layer):
     def __init__(self, d_model_size, dff, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense_0 = tf.keras.layers.Dense(dff, activation="relu", name="0")
-        self.dense_2 = tf.keras.layers.Dense(d_model_size, name="2")
+        self.dense_0 = keras.layers.Dense(dff, activation="relu", name="0")
+        self.dense_2 = keras.layers.Dense(d_model_size, name="2")
+        self.d_model_size = d_model_size
+        self.dff = dff
 
     def call(self, inputs, trainable=False):
         dense_0_output = self.dense_0(inputs)
@@ -156,8 +176,19 @@ def call(self, inputs, trainable=False):
 
         return dense_2_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense_0", None) is not None:
+            with tf.name_scope(self.dense_0.name):
+                self.dense_0.build([None, None, self.d_model_size])
+        if getattr(self, "dense_2", None) is not None:
+            with tf.name_scope(self.dense_2.name):
+                self.dense_2.build([None, None, self.dff])
 
-class TFEncoderLayer(tf.keras.layers.Layer):
+
+class TFEncoderLayer(keras.layers.Layer):
     def __init__(
         self, d_model_size, num_heads, dff, rate=0.1, layer_norm_epsilon=1e-6, output_attentions=False, **kwargs
     ):
@@ -170,11 +201,12 @@ def __init__(
         )
         self.ffn = TFPointWiseFeedForwardLayer(d_model_size, dff, name="ffn")
 
-        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm1")
-        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm2")
+        self.layernorm1 = keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm1")
+        self.layernorm2 = keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layernorm2")
 
-        self.dropout1 = tf.keras.layers.Dropout(rate)
-        self.dropout2 = tf.keras.layers.Dropout(rate)
+        self.dropout1 = keras.layers.Dropout(rate)
+        self.dropout2 = keras.layers.Dropout(rate)
+        self.d_model_size = d_model_size
 
     def call(self, x, mask, layer_past, attention_mask, head_mask, use_cache, output_attentions, training=False):
         normed = self.layernorm1(x)
@@ -202,9 +234,26 @@ def call(self, x, mask, layer_past, attention_mask, head_mask, use_cache, output
         outputs = (out2,) + attn_outputs[1:]
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "multi_head_attention", None) is not None:
+            with tf.name_scope(self.multi_head_attention.name):
+                self.multi_head_attention.build(None)
+        if getattr(self, "ffn", None) is not None:
+            with tf.name_scope(self.ffn.name):
+                self.ffn.build(None)
+        if getattr(self, "layernorm1", None) is not None:
+            with tf.name_scope(self.layernorm1.name):
+                self.layernorm1.build([None, None, self.d_model_size])
+        if getattr(self, "layernorm2", None) is not None:
+            with tf.name_scope(self.layernorm2.name):
+                self.layernorm2.build([None, None, self.d_model_size])
+
 
 @keras_serializable
-class TFCTRLMainLayer(tf.keras.layers.Layer):
+class TFCTRLMainLayer(keras.layers.Layer):
     config_class = CTRLConfig
 
     def __init__(self, config, **kwargs):
@@ -221,11 +270,14 @@ def __init__(self, config, **kwargs):
 
         self.pos_encoding = positional_encoding(config.n_positions, self.d_model_size)
 
-        self.w = TFSharedEmbeddings(
-            config.vocab_size, config.n_embd, initializer_range=config.initializer_range, name="w"
+        self.w = keras.layers.Embedding(
+            input_dim=config.vocab_size,
+            output_dim=config.n_embd,
+            embeddings_initializer=get_initializer(config.initializer_range),
+            name="w",
         )
 
-        self.dropout = tf.keras.layers.Dropout(config.embd_pdrop)
+        self.dropout = keras.layers.Dropout(config.embd_pdrop)
         self.h = [
             TFEncoderLayer(
                 config.n_embd,
@@ -238,14 +290,13 @@ def __init__(self, config, **kwargs):
             )
             for i in range(config.n_layer)
         ]
-        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name="layernorm")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_epsilon, name="layernorm")
 
     def get_input_embeddings(self):
         return self.w
 
-    def set_input_embeddings(self, value):
-        self.w.weight = value
-        self.w.vocab_size = shape_list(value)[0]
+    def set_input_embeddings(self, new_embeddings):
+        self.w = new_embeddings
 
     def _prune_heads(self, heads_to_prune):
         """
@@ -256,13 +307,13 @@ def _prune_heads(self, heads_to_prune):
     @unpack_inputs
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
+        input_ids: TFModelInputType | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
@@ -305,7 +356,7 @@ def call(
             # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
             # this attention mask is more simple than the triangular masking of causal attention
             # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
-            attention_mask = tf.reshape(attention_mask, (input_shape[0], 1, 1, input_shape[1]))
+            attention_mask = tf.reshape(attention_mask, (input_shape[0], 1, 1, input_shape[1] + past_length))
 
             # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
             # masked positions, this operation will create a tensor which is 0.0 for
@@ -329,24 +380,15 @@ def call(
 
         if token_type_ids is not None:
             token_type_ids = tf.reshape(token_type_ids, [-1, shape_list(token_type_ids)[-1]])
-            token_type_embeds = self.w(token_type_ids, mode="embedding")
+            token_type_embeds = self.w(token_type_ids)
             token_type_embeds *= tf.math.sqrt(tf.cast(self.d_model_size, dtype=token_type_embeds.dtype))
         else:
             token_type_embeds = tf.constant(0.0)
         position_ids = tf.reshape(position_ids, [-1, shape_list(position_ids)[-1]])
 
         if inputs_embeds is None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.w.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.w.vocab_size})"
-                ),
-            )
-            inputs_embeds = self.w(input_ids, mode="embedding")
+            check_embeddings_within_bounds(input_ids, self.w.input_dim)
+            inputs_embeds = self.w(input_ids)
         seq_len = input_shape[-1]
         mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
 
@@ -403,6 +445,21 @@ def call(
             attentions=all_attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "w", None) is not None:
+            with tf.name_scope(self.w.name):
+                self.w.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, self.config.n_embd])
+        if getattr(self, "h", None) is not None:
+            for layer in self.h:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 class TFCTRLPreTrainedModel(TFPreTrainedModel):
     """
@@ -420,7 +477,7 @@ class TFCTRLPreTrainedModel(TFPreTrainedModel):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -541,13 +598,13 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
+        input_ids: TFModelInputType | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
@@ -570,49 +627,35 @@ def call(
         )
         return outputs
 
-    def serving_output(self, output):
-        pkv = tf.convert_to_tensor(output.past_key_values) if self.config.use_cache else None
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutputWithPast(
-            last_hidden_state=output.last_hidden_state, past_key_values=pkv, hidden_states=hs, attentions=attns
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "transformer", None) is not None:
+            with tf.name_scope(self.transformer.name):
+                self.transformer.build(None)
 
 
-class TFCTRLLMHead(tf.keras.layers.Layer):
-    def __init__(self, config, input_embeddings, **kwargs):
-        super().__init__(**kwargs)
-        self.config = config
-        # CTRL has numerical issues in XLA generate
-        self.supports_xla_generation = False
+class TFCTRLBiasLayer(keras.layers.Layer):
+    """
+    Bias as a layer. It is used for serialization purposes: `keras.Model.save_weights` stores on a per-layer basis,
+    so all weights have to be registered in a layer.
+    """
 
-        # The output weights are the same as the input embeddings, but there is
-        # an output-only bias for each token.
-        self.input_embeddings = input_embeddings
+    def __init__(self, shape, initializer, trainable, name, **kwargs):
+        super().__init__(name=name, **kwargs)
+        self.shape = shape
+        self.initializer = initializer
+        self.trainable = trainable
 
     def build(self, input_shape):
-        self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias")
+        self.bias = self.add_weight(
+            name="bias", shape=self.shape, initializer=self.initializer, trainable=self.trainable
+        )
         super().build(input_shape)
 
-    def get_output_embeddings(self):
-        return self.input_embeddings
-
-    def set_output_embeddings(self, value):
-        self.input_embeddings.weight = value
-        self.input_embeddings.vocab_size = shape_list(value)[0]
-
-    def get_bias(self):
-        return {"bias": self.bias}
-
-    def set_bias(self, value):
-        self.bias = value["bias"]
-        self.config.vocab_size = shape_list(value["bias"])[0]
-
-    def call(self, hidden_states):
-        hidden_states = self.input_embeddings(hidden_states, mode="linear")
-        hidden_states = hidden_states + self.bias
-        return hidden_states
+    def call(self, x):
+        return x + self.bias
 
 
 @add_start_docstrings(
@@ -626,24 +669,53 @@ class TFCTRLLMHeadModel(TFCTRLPreTrainedModel, TFCausalLanguageModelingLoss):
     def __init__(self, config, *inputs, **kwargs):
         super().__init__(config, *inputs, **kwargs)
         self.transformer = TFCTRLMainLayer(config, name="transformer")
+        self.bias_layer = TFCTRLBiasLayer(
+            name="lm_head", shape=[1, config.vocab_size], initializer="zeros", trainable=True
+        )
 
-        self.lm_head = TFCTRLLMHead(config, self.transformer.w, name="lm_head")
-        # CTRL has numerical issues in XLA generate
-        self.supports_xla_generation = False
+    def get_output_embeddings(self):
+        return self.get_input_embeddings()
 
-    def get_lm_head(self):
-        return self.lm_head
+    def set_output_embeddings(self, value):
+        self.set_input_embeddings(value)
 
-    def get_prefix_bias_name(self):
-        warnings.warn("The method get_prefix_bias_name is deprecated. Please use `get_bias` instead.", FutureWarning)
-        return self.name + "/" + self.lm_head.name
+    def get_bias(self):
+        return {"lm_head.bias": self.bias_layer.bias}
+
+    def set_bias(self, value):
+        # Replaces the existing layers containing bias for correct (de)serialization.
+        vocab_size = value["lm_head.bias"].shape[-1]
+        self.bias_layer = TFCTRLBiasLayer(
+            name="final_logits_bias", shape=[1, vocab_size], initializer="zeros", trainable=True
+        )
+        self.bias_layer.build(None)
+        self.bias_layer.bias.assign(value["lm_head.bias"])
 
-    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, use_cache=None, **kwargs):
+    # Copied from transformers.models.gpt2.modeling_tf_gpt2.TFGPT2LMHeadModel.prepare_inputs_for_generation
+    def prepare_inputs_for_generation(self, inputs, past_key_values=None, use_cache=None, **kwargs):
+        token_type_ids = kwargs.get("token_type_ids", None)
         # only last token for inputs_ids if past is defined in kwargs
         if past_key_values:
-            input_ids = tf.expand_dims(input_ids[:, -1], -1)
+            inputs = tf.expand_dims(inputs[:, -1], -1)
+            if token_type_ids is not None:
+                token_type_ids = tf.expand_dims(token_type_ids[:, -1], -1)
+
+        position_ids = kwargs.get("position_ids", None)
+        attention_mask = kwargs.get("attention_mask", None)
 
-        return {"input_ids": input_ids, "past_key_values": past_key_values, "use_cache": use_cache}
+        if attention_mask is not None and position_ids is None:
+            position_ids = tf.math.cumsum(attention_mask, axis=-1, exclusive=True)
+            if past_key_values:
+                position_ids = tf.expand_dims(position_ids[:, -1], -1)
+
+        return {
+            "input_ids": inputs,
+            "attention_mask": attention_mask,
+            "position_ids": position_ids,
+            "past_key_values": past_key_values,
+            "use_cache": use_cache,
+            "token_type_ids": token_type_ids,
+        }
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(CTRL_INPUTS_DOCSTRING)
@@ -654,18 +726,18 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, use_cac
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
+        input_ids: TFModelInputType | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFCausalLMOutputWithPast]:
         r"""
@@ -687,10 +759,9 @@ def call(
             return_dict=return_dict,
             training=training,
         )
-
         hidden_states = transformer_outputs[0]
-
-        logits = self.lm_head(hidden_states)
+        logits = tf.matmul(hidden_states, self.transformer.w.weights, transpose_b=True)
+        logits = self.bias_layer(logits)
 
         loss = None
         if labels is not None:
@@ -711,12 +782,16 @@ def call(
             attentions=transformer_outputs.attentions,
         )
 
-    def serving_output(self, output):
-        pkv = tf.convert_to_tensor(output.past_key_values) if self.config.use_cache else None
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFCausalLMOutputWithPast(logits=output.logits, past_key_values=pkv, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "transformer", None) is not None:
+            with tf.name_scope(self.transformer.name):
+                self.transformer.build(None)
+        if getattr(self, "bias_layer", None) is not None:
+            with tf.name_scope(self.bias_layer.name):
+                self.bias_layer.build(None)
 
 
 @add_start_docstrings(
@@ -738,15 +813,21 @@ class TFCTRLForSequenceClassification(TFCTRLPreTrainedModel, TFSequenceClassific
     def __init__(self, config, *inputs, **kwargs):
         super().__init__(config, *inputs, **kwargs)
         self.num_labels = config.num_labels
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             name="classifier",
             use_bias=False,
         )
         self.transformer = TFCTRLMainLayer(config, name="transformer")
+        self.config = config
 
     def get_output_embeddings(self):
+        # Remove after transformers v4.32. Fix this model's `test_model_common_attributes` test too.
+        logger.warning(
+            "Sequence classification models do not have output embeddings. `.get_output_embeddings` will be removed "
+            "in transformers v4.32."
+        )
         return self.transformer.w
 
     @unpack_inputs
@@ -758,18 +839,18 @@ def get_output_embeddings(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
+        input_ids: TFModelInputType | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFSequenceClassifierOutput]:
         r"""
@@ -801,16 +882,10 @@ def call(
         else:
             if input_ids is not None:
                 sequence_lengths = (
-                    tf.reduce_sum(
-                        tf.cast(
-                            tf.math.not_equal(input_ids, self.config.pad_token_id),
-                            dtype=input_ids.dtype,
-                        ),
-                        -1,
-                        keepdims=False,
-                    )
+                    tf.argmax(tf.cast(tf.math.equal(input_ids, self.config.pad_token_id), input_ids.dtype), axis=-1)
                     - 1
                 )
+                sequence_lengths = tf.where(sequence_lengths >= 0, sequence_lengths, input_ids.shape[-1] - 1)
                 in_logits = tf.gather(logits, sequence_lengths, batch_dims=1, axis=1)
             else:
                 sequence_lengths = -1
@@ -846,9 +921,13 @@ def call(
             attentions=transformer_outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.n_embd])
+        if getattr(self, "transformer", None) is not None:
+            with tf.name_scope(self.transformer.name):
+                self.transformer.build(None)
diff --git a/src/transformers/models/ctrl/tokenization_ctrl.py b/src/transformers/models/ctrl/tokenization_ctrl.py
index f8524bdf1f54ac..3aac022897d4c0 100644
--- a/src/transformers/models/ctrl/tokenization_ctrl.py
+++ b/src/transformers/models/ctrl/tokenization_ctrl.py
@@ -33,12 +33,12 @@
 }
 
 PRETRAINED_VOCAB_FILES_MAP = {
-    "vocab_file": {"ctrl": "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json"},
-    "merges_file": {"ctrl": "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt"},
+    "vocab_file": {"Salesforce/ctrl": "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-vocab.json"},
+    "merges_file": {"Salesforce/ctrl": "https://raw.githubusercontent.com/salesforce/ctrl/master/ctrl-merges.txt"},
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
-    "ctrl": 256,
+    "Salesforce/ctrl": 256,
 }
 
 CONTROL_CODES = {
@@ -139,8 +139,6 @@ class CTRLTokenizer(PreTrainedTokenizer):
     control_codes = CONTROL_CODES
 
     def __init__(self, vocab_file, merges_file, unk_token="", **kwargs):
-        super().__init__(unk_token=unk_token, **kwargs)
-
         with open(vocab_file, encoding="utf-8") as vocab_handle:
             self.encoder = json.load(vocab_handle)
         self.decoder = {v: k for k, v in self.encoder.items()}
@@ -149,6 +147,7 @@ def __init__(self, vocab_file, merges_file, unk_token="", **kwargs):
         merges = [tuple(merge.split()) for merge in merges]
         self.bpe_ranks = dict(zip(merges, range(len(merges))))
         self.cache = {}
+        super().__init__(unk_token=unk_token, **kwargs)
 
     @property
     def vocab_size(self):
@@ -208,7 +207,7 @@ def _tokenize(self, text):
         words = re.findall(r"\S+\n?", text)
 
         for token in words:
-            split_tokens.extend([t for t in self.bpe(token).split(" ")])
+            split_tokens.extend(list(self.bpe(token).split(" ")))
         return split_tokens
 
     def _convert_token_to_id(self, token):
diff --git a/src/transformers/models/cvt/configuration_cvt.py b/src/transformers/models/cvt/configuration_cvt.py
index a540c0f4807cca..f1d96fc17ea59d 100644
--- a/src/transformers/models/cvt/configuration_cvt.py
+++ b/src/transformers/models/cvt/configuration_cvt.py
@@ -96,6 +96,7 @@ class CvtConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "cvt"
 
     def __init__(
diff --git a/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py
index e84c61d6aad644..ea4edac16cdbae 100644
--- a/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/cvt/convert_cvt_original_pytorch_checkpoint_to_pytorch.py
@@ -24,7 +24,7 @@
 import torch
 from huggingface_hub import cached_download, hf_hub_url
 
-from transformers import AutoFeatureExtractor, CvtConfig, CvtForImageClassification
+from transformers import AutoImageProcessor, CvtConfig, CvtForImageClassification
 
 
 def embeddings(idx):
@@ -307,8 +307,8 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
         config.embed_dim = [192, 768, 1024]
 
     model = CvtForImageClassification(config)
-    feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/convnext-base-224-22k-1k")
-    feature_extractor.size["shortest_edge"] = image_size
+    image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-base-224-22k-1k")
+    image_processor.size["shortest_edge"] = image_size
     original_weights = torch.load(cvt_file_name, map_location=torch.device("cpu"))
 
     huggingface_weights = OrderedDict()
@@ -329,7 +329,7 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
 
     model.load_state_dict(huggingface_weights)
     model.save_pretrained(pytorch_dump_folder)
-    feature_extractor.save_pretrained(pytorch_dump_folder)
+    image_processor.save_pretrained(pytorch_dump_folder)
 
 
 # Download the weights from zoo: https://1drv.ms/u/s!AhIXJn_J-blW9RzF3rMW7SsLHa8h?e=blQ0Al
@@ -350,7 +350,7 @@ def convert_cvt_checkpoint(cvt_model, image_size, cvt_file_name, pytorch_dump_fo
     )
     parser.add_argument(
         "--cvt_file_name",
-        default="cvtmodels\CvT-w24-384x384-IN-22k.pth",
+        default=r"cvtmodels\CvT-w24-384x384-IN-22k.pth",
         type=str,
         help="Input Image Size",
     )
diff --git a/src/transformers/models/cvt/modeling_cvt.py b/src/transformers/models/cvt/modeling_cvt.py
index 99e3a02febf4d2..ef7e3671e69d35 100644
--- a/src/transformers/models/cvt/modeling_cvt.py
+++ b/src/transformers/models/cvt/modeling_cvt.py
@@ -74,11 +74,11 @@ class BaseModelOutputWithCLSToken(ModelOutput):
 
     last_hidden_state: torch.FloatTensor = None
     cls_token_value: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 # Copied from transformers.models.beit.modeling_beit.drop_path
-def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
diff --git a/src/transformers/models/cvt/modeling_tf_cvt.py b/src/transformers/models/cvt/modeling_tf_cvt.py
index 52cc6585a7a552..c69973bdc828af 100644
--- a/src/transformers/models/cvt/modeling_tf_cvt.py
+++ b/src/transformers/models/cvt/modeling_tf_cvt.py
@@ -15,9 +15,11 @@
 """ TF 2.0 Cvt model."""
 
 
+from __future__ import annotations
+
 import collections.abc
 from dataclasses import dataclass
-from typing import Dict, Optional, Tuple, Union
+from typing import Optional, Tuple, Union
 
 import tensorflow as tf
 
@@ -27,6 +29,7 @@
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
@@ -75,10 +78,10 @@ class TFBaseModelOutputWithCLSToken(ModelOutput):
 
     last_hidden_state: tf.Tensor = None
     cls_token_value: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor, ...] | None = None
 
 
-class TFCvtDropPath(tf.keras.layers.Layer):
+class TFCvtDropPath(keras.layers.Layer):
     """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
     References:
         (1) github.com:rwightman/pytorch-image-models
@@ -93,18 +96,19 @@ def call(self, x: tf.Tensor, training=None):
             return x
         keep_prob = 1 - self.drop_prob
         shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
-        random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)
+        random_tensor = keep_prob + tf.random.uniform(shape, 0, 1, dtype=self.compute_dtype)
         random_tensor = tf.floor(random_tensor)
         return (x / keep_prob) * random_tensor
 
 
-class TFCvtEmbeddings(tf.keras.layers.Layer):
+class TFCvtEmbeddings(keras.layers.Layer):
     """Construct the Convolutional Token Embeddings."""
 
     def __init__(
         self,
         config: CvtConfig,
         patch_size: int,
+        num_channels: int,
         embed_dim: int,
         stride: int,
         padding: int,
@@ -115,27 +119,45 @@ def __init__(
         self.convolution_embeddings = TFCvtConvEmbeddings(
             config,
             patch_size=patch_size,
+            num_channels=num_channels,
             embed_dim=embed_dim,
             stride=stride,
             padding=padding,
             name="convolution_embeddings",
         )
-        self.dropout = tf.keras.layers.Dropout(dropout_rate)
+        self.dropout = keras.layers.Dropout(dropout_rate)
 
     def call(self, pixel_values: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_state = self.convolution_embeddings(pixel_values)
         hidden_state = self.dropout(hidden_state, training=training)
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution_embeddings", None) is not None:
+            with tf.name_scope(self.convolution_embeddings.name):
+                self.convolution_embeddings.build(None)
+
 
-class TFCvtConvEmbeddings(tf.keras.layers.Layer):
+class TFCvtConvEmbeddings(keras.layers.Layer):
     """Image to Convolution Embeddings. This convolutional operation aims to model local spatial contexts."""
 
-    def __init__(self, config: CvtConfig, patch_size: int, embed_dim: int, stride: int, padding: int, **kwargs):
+    def __init__(
+        self,
+        config: CvtConfig,
+        patch_size: int,
+        num_channels: int,
+        embed_dim: int,
+        stride: int,
+        padding: int,
+        **kwargs,
+    ):
         super().__init__(**kwargs)
-        self.padding = tf.keras.layers.ZeroPadding2D(padding=padding)
+        self.padding = keras.layers.ZeroPadding2D(padding=padding)
         self.patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
-        self.projection = tf.keras.layers.Conv2D(
+        self.projection = keras.layers.Conv2D(
             filters=embed_dim,
             kernel_size=patch_size,
             strides=stride,
@@ -145,7 +167,9 @@ def __init__(self, config: CvtConfig, patch_size: int, embed_dim: int, stride: i
             name="projection",
         )
         # Using the same default epsilon as PyTorch
-        self.normalization = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="normalization")
+        self.normalization = keras.layers.LayerNormalization(epsilon=1e-5, name="normalization")
+        self.num_channels = num_channels
+        self.embed_dim = embed_dim
 
     def call(self, pixel_values: tf.Tensor) -> tf.Tensor:
         if isinstance(pixel_values, dict):
@@ -163,14 +187,25 @@ def call(self, pixel_values: tf.Tensor) -> tf.Tensor:
         pixel_values = tf.reshape(pixel_values, shape=(batch_size, height, width, num_channels))
         return pixel_values
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "projection", None) is not None:
+            with tf.name_scope(self.projection.name):
+                self.projection.build([None, None, None, self.num_channels])
+        if getattr(self, "normalization", None) is not None:
+            with tf.name_scope(self.normalization.name):
+                self.normalization.build([None, None, self.embed_dim])
+
 
-class TFCvtSelfAttentionConvProjection(tf.keras.layers.Layer):
+class TFCvtSelfAttentionConvProjection(keras.layers.Layer):
     """Convolutional projection layer."""
 
     def __init__(self, config: CvtConfig, embed_dim: int, kernel_size: int, stride: int, padding: int, **kwargs):
         super().__init__(**kwargs)
-        self.padding = tf.keras.layers.ZeroPadding2D(padding=padding)
-        self.convolution = tf.keras.layers.Conv2D(
+        self.padding = keras.layers.ZeroPadding2D(padding=padding)
+        self.convolution = keras.layers.Conv2D(
             filters=embed_dim,
             kernel_size=kernel_size,
             kernel_initializer=get_initializer(config.initializer_range),
@@ -181,15 +216,27 @@ def __init__(self, config: CvtConfig, embed_dim: int, kernel_size: int, stride:
             groups=embed_dim,
         )
         # Using the same default epsilon as PyTorch, TF uses (1 - pytorch momentum)
-        self.normalization = tf.keras.layers.BatchNormalization(epsilon=1e-5, momentum=0.9, name="normalization")
+        self.normalization = keras.layers.BatchNormalization(epsilon=1e-5, momentum=0.9, name="normalization")
+        self.embed_dim = embed_dim
 
     def call(self, hidden_state: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_state = self.convolution(self.padding(hidden_state))
         hidden_state = self.normalization(hidden_state, training=training)
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution", None) is not None:
+            with tf.name_scope(self.convolution.name):
+                self.convolution.build([None, None, None, self.embed_dim])
+        if getattr(self, "normalization", None) is not None:
+            with tf.name_scope(self.normalization.name):
+                self.normalization.build([None, None, None, self.embed_dim])
+
 
-class TFCvtSelfAttentionLinearProjection(tf.keras.layers.Layer):
+class TFCvtSelfAttentionLinearProjection(keras.layers.Layer):
     """Linear projection layer used to flatten tokens into 1D."""
 
     def call(self, hidden_state: tf.Tensor) -> tf.Tensor:
@@ -200,7 +247,7 @@ def call(self, hidden_state: tf.Tensor) -> tf.Tensor:
         return hidden_state
 
 
-class TFCvtSelfAttentionProjection(tf.keras.layers.Layer):
+class TFCvtSelfAttentionProjection(keras.layers.Layer):
     """Convolutional Projection for Attention."""
 
     def __init__(
@@ -225,8 +272,16 @@ def call(self, hidden_state: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_state = self.linear_projection(hidden_state)
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution_projection", None) is not None:
+            with tf.name_scope(self.convolution_projection.name):
+                self.convolution_projection.build(None)
 
-class TFCvtSelfAttention(tf.keras.layers.Layer):
+
+class TFCvtSelfAttention(keras.layers.Layer):
     """
     Self-attention layer. A depth-wise separable convolution operation (Convolutional Projection), is applied for
     query, key, and value embeddings.
@@ -282,28 +337,28 @@ def __init__(
             name="convolution_projection_value",
         )
 
-        self.projection_query = tf.keras.layers.Dense(
+        self.projection_query = keras.layers.Dense(
             units=embed_dim,
             kernel_initializer=get_initializer(config.initializer_range),
             use_bias=qkv_bias,
             bias_initializer="zeros",
             name="projection_query",
         )
-        self.projection_key = tf.keras.layers.Dense(
+        self.projection_key = keras.layers.Dense(
             units=embed_dim,
             kernel_initializer=get_initializer(config.initializer_range),
             use_bias=qkv_bias,
             bias_initializer="zeros",
             name="projection_key",
         )
-        self.projection_value = tf.keras.layers.Dense(
+        self.projection_value = keras.layers.Dense(
             units=embed_dim,
             kernel_initializer=get_initializer(config.initializer_range),
             use_bias=qkv_bias,
             bias_initializer="zeros",
             name="projection_value",
         )
-        self.dropout = tf.keras.layers.Dropout(attention_drop_rate)
+        self.dropout = keras.layers.Dropout(attention_drop_rate)
 
     def rearrange_for_multi_head_attention(self, hidden_state: tf.Tensor) -> tf.Tensor:
         batch_size, hidden_size, _ = shape_list(hidden_state)
@@ -346,24 +401,56 @@ def call(self, hidden_state: tf.Tensor, height: int, width: int, training: bool
         context = tf.reshape(context, (batch_size, hidden_size, self.num_heads * head_dim))
         return context
 
-
-class TFCvtSelfOutput(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution_projection_query", None) is not None:
+            with tf.name_scope(self.convolution_projection_query.name):
+                self.convolution_projection_query.build(None)
+        if getattr(self, "convolution_projection_key", None) is not None:
+            with tf.name_scope(self.convolution_projection_key.name):
+                self.convolution_projection_key.build(None)
+        if getattr(self, "convolution_projection_value", None) is not None:
+            with tf.name_scope(self.convolution_projection_value.name):
+                self.convolution_projection_value.build(None)
+        if getattr(self, "projection_query", None) is not None:
+            with tf.name_scope(self.projection_query.name):
+                self.projection_query.build([None, None, self.embed_dim])
+        if getattr(self, "projection_key", None) is not None:
+            with tf.name_scope(self.projection_key.name):
+                self.projection_key.build([None, None, self.embed_dim])
+        if getattr(self, "projection_value", None) is not None:
+            with tf.name_scope(self.projection_value.name):
+                self.projection_value.build([None, None, self.embed_dim])
+
+
+class TFCvtSelfOutput(keras.layers.Layer):
     """Output of the Attention layer ."""
 
     def __init__(self, config: CvtConfig, embed_dim: int, drop_rate: float, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=embed_dim, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(drop_rate)
+        self.dropout = keras.layers.Dropout(drop_rate)
+        self.embed_dim = embed_dim
 
     def call(self, hidden_state: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_state = self.dense(inputs=hidden_state)
         hidden_state = self.dropout(inputs=hidden_state, training=training)
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.embed_dim])
+
 
-class TFCvtAttention(tf.keras.layers.Layer):
+class TFCvtAttention(keras.layers.Layer):
     """Attention layer. First chunk of the convolutional transformer block."""
 
     def __init__(
@@ -409,35 +496,57 @@ def call(self, hidden_state: tf.Tensor, height: int, width: int, training: bool
         attention_output = self.dense_output(self_output, training=training)
         return attention_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
 
-class TFCvtIntermediate(tf.keras.layers.Layer):
+
+class TFCvtIntermediate(keras.layers.Layer):
     """Intermediate dense layer. Second chunk of the convolutional transformer block."""
 
     def __init__(self, config: CvtConfig, embed_dim: int, mlp_ratio: int, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=int(embed_dim * mlp_ratio),
             kernel_initializer=get_initializer(config.initializer_range),
             activation="gelu",
             name="dense",
         )
+        self.embed_dim = embed_dim
 
     def call(self, hidden_state: tf.Tensor) -> tf.Tensor:
         hidden_state = self.dense(hidden_state)
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.embed_dim])
+
 
-class TFCvtOutput(tf.keras.layers.Layer):
+class TFCvtOutput(keras.layers.Layer):
     """
     Output of the Convolutional Transformer Block (last chunk). It consists of a MLP and a residual connection.
     """
 
-    def __init__(self, config: CvtConfig, embed_dim: int, drop_rate: int, **kwargs):
+    def __init__(self, config: CvtConfig, embed_dim: int, mlp_ratio: int, drop_rate: int, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=embed_dim, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(drop_rate)
+        self.dropout = keras.layers.Dropout(drop_rate)
+        self.embed_dim = embed_dim
+        self.mlp_ratio = mlp_ratio
 
     def call(self, hidden_state: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_state = self.dense(inputs=hidden_state)
@@ -445,8 +554,16 @@ def call(self, hidden_state: tf.Tensor, input_tensor: tf.Tensor, training: bool
         hidden_state = hidden_state + input_tensor
         return hidden_state
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, int(self.embed_dim * self.mlp_ratio)])
+
 
-class TFCvtLayer(tf.keras.layers.Layer):
+class TFCvtLayer(keras.layers.Layer):
     """
     Convolutional Transformer Block composed by attention layers, normalization and multi-layer perceptrons (mlps). It
     consists of 3 chunks : an attention layer, an intermediate dense layer and an output layer. This corresponds to the
@@ -490,16 +607,17 @@ def __init__(
             name="attention",
         )
         self.intermediate = TFCvtIntermediate(config, embed_dim, mlp_ratio, name="intermediate")
-        self.dense_output = TFCvtOutput(config, embed_dim, drop_rate, name="output")
+        self.dense_output = TFCvtOutput(config, embed_dim, mlp_ratio, drop_rate, name="output")
         # Using `layers.Activation` instead of `tf.identity` to better control `training` behaviour.
         self.drop_path = (
             TFCvtDropPath(drop_path_rate, name="drop_path")
             if drop_path_rate > 0.0
-            else tf.keras.layers.Activation("linear", name="drop_path")
+            else keras.layers.Activation("linear", name="drop_path")
         )
         # Using the same default epsilon as PyTorch
-        self.layernorm_before = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_before")
-        self.layernorm_after = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_after")
+        self.layernorm_before = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_before")
+        self.layernorm_after = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm_after")
+        self.embed_dim = embed_dim
 
     def call(self, hidden_state: tf.Tensor, height: int, width: int, training: bool = False) -> tf.Tensor:
         # in Cvt, layernorm is applied before self-attention
@@ -518,8 +636,31 @@ def call(self, hidden_state: tf.Tensor, height: int, width: int, training: bool
         layer_output = self.drop_path(layer_output, training=training)
         return layer_output
 
-
-class TFCvtStage(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
+        if getattr(self, "layernorm_before", None) is not None:
+            with tf.name_scope(self.layernorm_before.name):
+                self.layernorm_before.build([None, None, self.embed_dim])
+        if getattr(self, "layernorm_after", None) is not None:
+            with tf.name_scope(self.layernorm_after.name):
+                self.layernorm_after.build([None, None, self.embed_dim])
+
+
+class TFCvtStage(keras.layers.Layer):
     """
     Cvt stage (encoder block). Each stage has 2 parts :
     - (1) A Convolutional Token Embedding layer
@@ -546,6 +687,7 @@ def __init__(self, config: CvtConfig, stage: int, **kwargs):
         self.embedding = TFCvtEmbeddings(
             self.config,
             patch_size=config.patch_sizes[self.stage],
+            num_channels=config.num_channels if self.stage == 0 else config.embed_dim[self.stage - 1],
             stride=config.patch_stride[self.stage],
             embed_dim=config.embed_dim[self.stage],
             padding=config.patch_padding[self.stage],
@@ -601,8 +743,20 @@ def call(self, hidden_state: tf.Tensor, training: bool = False):
         hidden_state = tf.reshape(hidden_state, shape=(batch_size, height, width, num_channels))
         return hidden_state, cls_token
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embedding", None) is not None:
+            with tf.name_scope(self.embedding.name):
+                self.embedding.build(None)
+        if getattr(self, "layers", None) is not None:
+            for layer in self.layers:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
 
-class TFCvtEncoder(tf.keras.layers.Layer):
+
+class TFCvtEncoder(keras.layers.Layer):
     """
     Convolutional Vision Transformer encoder. CVT has 3 stages of encoder blocks with their respective number of layers
     (depth) being 1, 2 and 10.
@@ -629,7 +783,7 @@ def call(
     ) -> Union[TFBaseModelOutputWithCLSToken, Tuple[tf.Tensor]]:
         all_hidden_states = () if output_hidden_states else None
         hidden_state = pixel_values
-        # When running on CPU, `tf.keras.layers.Conv2D` doesn't support (batch_size, num_channels, height, width)
+        # When running on CPU, `keras.layers.Conv2D` doesn't support (batch_size, num_channels, height, width)
         # as input format. So change the input format to (batch_size, height, width, num_channels).
         hidden_state = tf.transpose(hidden_state, perm=(0, 2, 3, 1))
 
@@ -653,9 +807,18 @@ def call(
             hidden_states=all_hidden_states,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "stages", None) is not None:
+            for layer in self.stages:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 @keras_serializable
-class TFCvtMainLayer(tf.keras.layers.Layer):
+class TFCvtMainLayer(keras.layers.Layer):
     """Construct the Cvt model."""
 
     config_class = CvtConfig
@@ -668,7 +831,7 @@ def __init__(self, config: CvtConfig, **kwargs):
     @unpack_inputs
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
+        pixel_values: TFModelInputType | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
@@ -694,6 +857,14 @@ def call(
             hidden_states=encoder_outputs.hidden_states,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+
 
 class TFCvtPreTrainedModel(TFPreTrainedModel):
     """
@@ -705,35 +876,6 @@ class TFCvtPreTrainedModel(TFPreTrainedModel):
     base_model_prefix = "cvt"
     main_input_name = "pixel_values"
 
-    @property
-    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        VISION_DUMMY_INPUTS = tf.random.uniform(shape=(3, self.config.num_channels, 224, 224), dtype=tf.float32)
-        return {"pixel_values": tf.constant(VISION_DUMMY_INPUTS)}
-
-    @tf.function(
-        input_signature=[
-            {
-                "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        """
-        Method used for serving the model.
-
-        Args:
-            inputs (`Dict[str, tf.Tensor]`):
-                The input of the saved model as a dictionary of tensors.
-        """
-        output = self.call(inputs)
-        return self.serving_output(output)
-
 
 TFCVT_START_DOCSTRING = r"""
 
@@ -741,7 +883,7 @@ def serving(self, inputs):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -752,7 +894,7 @@ def serving(self, inputs):
     - having all inputs as keyword arguments (like PyTorch models), or
     - having all inputs as a list, tuple or dict in the first positional arguments.
 
-    This second option is useful when using [`tf.keras.Model.fit`] method which currently requires having all the
+    This second option is useful when using [`keras.Model.fit`] method which currently requires having all the
     tensors in the first argument of the model call function: `model(inputs)`.
 
     
@@ -797,7 +939,7 @@ def __init__(self, config: CvtConfig, *inputs, **kwargs):
     @replace_return_docstrings(output_type=TFBaseModelOutputWithCLSToken, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
@@ -842,12 +984,13 @@ def call(
             hidden_states=outputs.hidden_states,
         )
 
-    def serving_output(self, output: TFBaseModelOutputWithCLSToken) -> TFBaseModelOutputWithCLSToken:
-        return TFBaseModelOutputWithCLSToken(
-            last_hidden_state=output.last_hidden_state,
-            cls_token_value=output.cls_token_value,
-            hidden_states=output.hidden_states,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "cvt", None) is not None:
+            with tf.name_scope(self.cvt.name):
+                self.cvt.build(None)
 
 
 @add_start_docstrings(
@@ -864,24 +1007,25 @@ def __init__(self, config: CvtConfig, *inputs, **kwargs):
         self.num_labels = config.num_labels
         self.cvt = TFCvtMainLayer(config, name="cvt")
         # Using same default epsilon as in the original implementation.
-        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=1e-5, name="layernorm")
 
         # Classifier head
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             units=config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             use_bias=True,
             bias_initializer="zeros",
             name="classifier",
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(TFCVT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=TFImageClassifierOutputWithNoAttention, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        labels: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        labels: tf.Tensor | None = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: Optional[bool] = False,
@@ -944,5 +1088,17 @@ def call(
 
         return TFImageClassifierOutputWithNoAttention(loss=loss, logits=logits, hidden_states=outputs.hidden_states)
 
-    def serving_output(self, output: TFImageClassifierOutputWithNoAttention) -> TFImageClassifierOutputWithNoAttention:
-        return TFImageClassifierOutputWithNoAttention(logits=output.logits, hidden_states=output.hidden_states)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "cvt", None) is not None:
+            with tf.name_scope(self.cvt.name):
+                self.cvt.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, self.config.embed_dim[-1]])
+        if getattr(self, "classifier", None) is not None:
+            if hasattr(self.classifier, "name"):
+                with tf.name_scope(self.classifier.name):
+                    self.classifier.build([None, None, self.config.embed_dim[-1]])
diff --git a/src/transformers/models/data2vec/configuration_data2vec_audio.py b/src/transformers/models/data2vec/configuration_data2vec_audio.py
index 2ec526924f36eb..e37def379fbb15 100644
--- a/src/transformers/models/data2vec/configuration_data2vec_audio.py
+++ b/src/transformers/models/data2vec/configuration_data2vec_audio.py
@@ -58,10 +58,15 @@ class Data2VecAudioConfig(PretrainedConfig):
             `"relu"`, `"selu"` and `"gelu_new"` are supported.
         hidden_dropout (`float`, *optional*, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        activation_dropout (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for activations inside the fully connected layer.
         attention_dropout (`float`, *optional*, defaults to 0.1):
             The dropout ratio for the attention probabilities.
         final_dropout (`float`, *optional*, defaults to 0.1):
             The dropout probability for the final projection layer of [`Data2VecAudioForCTC`].
+        layerdrop (`float`, *optional*, defaults to 0.1):
+            The LayerDrop probability. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556) for more
+            details.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         layer_norm_eps (`float`, *optional*, defaults to 1e-12):
@@ -163,6 +168,7 @@ class Data2VecAudioConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "data2vec-audio"
 
     def __init__(
diff --git a/src/transformers/models/data2vec/configuration_data2vec_text.py b/src/transformers/models/data2vec/configuration_data2vec_text.py
index 305a3ea5e4ffa4..01a81e95b412b7 100644
--- a/src/transformers/models/data2vec/configuration_data2vec_text.py
+++ b/src/transformers/models/data2vec/configuration_data2vec_text.py
@@ -95,6 +95,7 @@ class Data2VecTextConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "data2vec-text"
 
     def __init__(
diff --git a/src/transformers/models/data2vec/configuration_data2vec_vision.py b/src/transformers/models/data2vec/configuration_data2vec_vision.py
index b45f8420ca0008..5d8e4a252a7c29 100644
--- a/src/transformers/models/data2vec/configuration_data2vec_vision.py
+++ b/src/transformers/models/data2vec/configuration_data2vec_vision.py
@@ -111,6 +111,7 @@ class Data2VecVisionConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "data2vec-vision"
 
     def __init__(
diff --git a/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py
index 9a38b3ae0bd1a3..81f5cd23fb9ef8 100644
--- a/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/data2vec/convert_data2vec_text_original_pytorch_checkpoint_to_pytorch.py
@@ -24,7 +24,12 @@
 from fairseq.modules import TransformerSentenceEncoderLayer
 from packaging import version
 
-from transformers import Data2VecTextConfig, Data2VecTextForMaskedLM, Data2VecTextForSequenceClassification
+from transformers import (
+    Data2VecTextConfig,
+    Data2VecTextForMaskedLM,
+    Data2VecTextForSequenceClassification,
+    Data2VecTextModel,
+)
 from transformers.models.bert.modeling_bert import (
     BertIntermediate,
     BertLayer,
@@ -35,7 +40,6 @@
 
 # IMPORTANT: In order for this script to run, please make sure to download the dictionary: `dict.txt` from wget https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
 # File copied from https://github.com/pytorch/fairseq/blob/main/examples/data2vec/models/data2vec_text.py
-from transformers.models.data2vec.data2vec_text import Data2VecTextModel
 from transformers.utils import logging
 
 
diff --git a/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py
index 8ff8c1b910f0b4..0c6f42f4ba7f1b 100755
--- a/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/data2vec/convert_data2vec_vision_original_pytorch_checkpoint_to_pytorch.py
@@ -8,7 +8,7 @@
 from timm.models import create_model
 
 from transformers import (
-    BeitFeatureExtractor,
+    BeitImageProcessor,
     Data2VecVisionConfig,
     Data2VecVisionForImageClassification,
     Data2VecVisionModel,
@@ -304,9 +304,9 @@ def main():
     orig_model.eval()
 
     # 3. Forward Beit model
-    feature_extractor = BeitFeatureExtractor(size=config.image_size, do_center_crop=False)
+    image_processor = BeitImageProcessor(size=config.image_size, do_center_crop=False)
     image = Image.open("../../../../tests/fixtures/tests_samples/COCO/000000039769.png")
-    encoding = feature_extractor(images=image, return_tensors="pt")
+    encoding = image_processor(images=image, return_tensors="pt")
     pixel_values = encoding["pixel_values"]
 
     orig_args = (pixel_values,) if is_finetuned else (pixel_values, None)
@@ -354,7 +354,7 @@ def main():
     # 7. Save
     print(f"Saving to {args.hf_checkpoint_name}")
     hf_model.save_pretrained(args.hf_checkpoint_name)
-    feature_extractor.save_pretrained(args.hf_checkpoint_name)
+    image_processor.save_pretrained(args.hf_checkpoint_name)
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/data2vec/modeling_data2vec_audio.py b/src/transformers/models/data2vec/modeling_data2vec_audio.py
index 6cda1b869bca68..b3dde2438ab98f 100755
--- a/src/transformers/models/data2vec/modeling_data2vec_audio.py
+++ b/src/transformers/models/data2vec/modeling_data2vec_audio.py
@@ -25,7 +25,7 @@
 from torch.nn import CrossEntropyLoss
 
 from ...activations import ACT2FN
-from ...deepspeed import is_deepspeed_zero3_enabled
+from ...integrations.deepspeed import is_deepspeed_zero3_enabled
 from ...modeling_outputs import (
     BaseModelOutput,
     CausalLMOutput,
@@ -35,8 +35,13 @@
     XVectorOutput,
 )
 from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import torch_int_div
-from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_peft_available,
+    logging,
+)
 from .configuration_data2vec_audio import Data2VecAudioConfig
 
 
@@ -294,15 +299,8 @@ def forward(self, input_values):
 
         for conv_layer in self.conv_layers:
             if self._requires_grad and self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs)
-
-                    return custom_forward
-
-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(conv_layer),
+                hidden_states = self._gradient_checkpointing_func(
+                    conv_layer.__call__,
                     hidden_states,
                 )
             else:
@@ -338,12 +336,15 @@ def __init__(
         dropout: float = 0.0,
         is_decoder: bool = False,
         bias: bool = True,
+        is_causal: bool = False,
+        config: Optional[Data2VecAudioConfig] = None,
     ):
         super().__init__()
         self.embed_dim = embed_dim
         self.num_heads = num_heads
         self.dropout = dropout
         self.head_dim = embed_dim // num_heads
+        self.config = config
 
         if (self.head_dim * num_heads) != self.embed_dim:
             raise ValueError(
@@ -352,6 +353,7 @@ def __init__(
             )
         self.scaling = self.head_dim**-0.5
         self.is_decoder = is_decoder
+        self.is_causal = is_causal
 
         self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
         self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
@@ -419,8 +421,8 @@ def forward(
 
         proj_shape = (bsz * self.num_heads, -1, self.head_dim)
         query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
-        key_states = key_states.view(*proj_shape)
-        value_states = value_states.view(*proj_shape)
+        key_states = key_states.reshape(*proj_shape)
+        value_states = value_states.reshape(*proj_shape)
 
         src_len = key_states.size(1)
         attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
@@ -466,7 +468,7 @@ def forward(
 
         if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
             raise ValueError(
-                f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
+                f"`attn_output` should be of size {(bsz * self.num_heads, tgt_len, self.head_dim)}, but is"
                 f" {attn_output.size()}"
             )
 
@@ -474,7 +476,7 @@ def forward(
         attn_output = attn_output.transpose(1, 2)
 
         # Use the `embed_dim` from the config (stored in the class) rather than `hidden_state` because `attn_output` can be
-        # partitioned aross GPUs when using tensor-parallelism.
+        # partitioned across GPUs when using tensor-parallelism.
         attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim)
 
         attn_output = self.out_proj(attn_output)
@@ -588,23 +590,17 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
-            dropout_probability = np.random.uniform(0, 1)
+            dropout_probability = torch.rand([])
 
             skip_the_layer = True if self.training and (dropout_probability < self.config.layerdrop) else False
             if not skip_the_layer or deepspeed_zero3_is_enabled:
                 # under deepspeed zero3 all gpus must run in sync
                 if self.gradient_checkpointing and self.training:
-                    # create gradient checkpointing function
-                    def create_custom_forward(module):
-                        def custom_forward(*inputs):
-                            return module(*inputs, output_attentions)
-
-                        return custom_forward
-
-                    layer_outputs = torch.utils.checkpoint.checkpoint(
-                        create_custom_forward(layer),
+                    layer_outputs = self._gradient_checkpointing_func(
+                        layer.__call__,
                         hidden_states,
                         attention_mask,
+                        output_attentions,
                     )
                 else:
                     layer_outputs = layer(
@@ -690,7 +686,6 @@ class Data2VecAudioPreTrainedModel(PreTrainedModel):
     config_class = Data2VecAudioConfig
     base_model_prefix = "data2vec_audio"
     main_input_name = "input_values"
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
     supports_gradient_checkpointing = True
 
     def _init_weights(self, module):
@@ -731,7 +726,7 @@ def _get_feat_extract_output_lengths(
         def _conv_out_length(input_length, kernel_size, stride):
             # 1D convolutional layer output length formula taken
             # from https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
-            return torch_int_div(input_length - kernel_size, stride) + 1
+            return torch.div(input_length - kernel_size, stride, rounding_mode="floor") + 1
 
         for kernel_size, stride in zip(self.config.conv_kernel, self.config.conv_stride):
             input_lengths = _conv_out_length(input_lengths, kernel_size, stride)
@@ -763,10 +758,6 @@ def _get_feature_vector_attention_mask(
         attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).bool()
         return attention_mask
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, (Data2VecAudioEncoder, Data2VecAudioFeatureEncoder)):
-            module.gradient_checkpointing = value
-
 
 DATA2VEC_AUDIO_START_DOCSTRING = r"""
     Data2VecAudio was proposed in [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and
@@ -805,12 +796,11 @@ def _set_gradient_checkpointing(self, module, value=False):
 
             
 
-            `attention_mask` should only be passed if the corresponding processor has `config.return_attention_mask ==
-            True`. For all models whose processor has `config.return_attention_mask == False`, such as
-            [data2vec-audio-base](https://huggingface.co/facebook/data2vec-audio-base-960h), `attention_mask` should
-            **not** be passed to avoid degraded performance when doing batched inference. For such models
-            `input_values` should simply be padded with 0 and passed without `attention_mask`. Be aware that these
-            models also yield slightly different results depending on whether `input_values` is padded or not.
+            `attention_mask` should be passed if the corresponding processor has `config.return_attention_mask ==
+            True`, which is the case for all pre-trained Data2Vec Audio models. Be aware that that even with
+            `attention_mask`, zero-padded inputs will have slightly different outputs compared to non-padded inputs
+            because there are more than one convolutional layer in the positional encodings. For a more detailed
+            explanation, see [here](https://github.com/huggingface/transformers/issues/25621#issuecomment-1713759349).
 
             
 
@@ -993,7 +983,7 @@ def freeze_feature_extractor(self):
         not be updated during training.
         """
         warnings.warn(
-            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5."
+            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5. "
             "Please use the equivalent `freeze_feature_encoder` method instead.",
             FutureWarning,
         )
@@ -1118,7 +1108,7 @@ def freeze_feature_extractor(self):
         not be updated during training.
         """
         warnings.warn(
-            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5."
+            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5. "
             "Please use the equivalent `freeze_feature_encoder` method instead.",
             FutureWarning,
         )
@@ -1239,7 +1229,7 @@ def freeze_feature_extractor(self):
         not be updated during training.
         """
         warnings.warn(
-            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5."
+            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5. "
             "Please use the equivalent `freeze_feature_encoder` method instead.",
             FutureWarning,
         )
@@ -1358,16 +1348,21 @@ def __init__(self, config, layer_id=0):
         self.kernel = nn.Linear(self.in_conv_dim * self.kernel_size, self.out_conv_dim)
         self.activation = nn.ReLU()
 
-    def forward(self, hidden_states):
-        hidden_states = hidden_states.unsqueeze(1)
-        hidden_states = nn.functional.unfold(
-            hidden_states,
-            (self.kernel_size, self.in_conv_dim),
-            stride=(1, self.in_conv_dim),
-            dilation=(self.dilation, 1),
-        )
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        if is_peft_available():
+            from peft.tuners.lora import LoraLayer
+
+            if isinstance(self.kernel, LoraLayer):
+                warnings.warn(
+                    "Detected LoRA on TDNNLayer. LoRA weights won't be applied due to optimization. "
+                    "You should exclude TDNNLayer from LoRA's target modules.",
+                )
+
+        # for backward compatibility, we keep nn.Linear but call F.conv1d for speed up
+        hidden_states = hidden_states.transpose(1, 2)
+        weight = self.kernel.weight.view(self.out_conv_dim, self.kernel_size, self.in_conv_dim).transpose(1, 2)
+        hidden_states = nn.functional.conv1d(hidden_states, weight, self.kernel.bias, dilation=self.dilation)
         hidden_states = hidden_states.transpose(1, 2)
-        hidden_states = self.kernel(hidden_states)
 
         hidden_states = self.activation(hidden_states)
         return hidden_states
@@ -1405,7 +1400,7 @@ def freeze_feature_extractor(self):
         not be updated during training.
         """
         warnings.warn(
-            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5."
+            "The method `freeze_feature_extractor` is deprecated and will be removed in Transformers v5. "
             "Please use the equivalent `freeze_feature_encoder` method instead.",
             FutureWarning,
         )
diff --git a/src/transformers/models/data2vec/modeling_data2vec_text.py b/src/transformers/models/data2vec/modeling_data2vec_text.py
index aa89c8f5d0e27c..567cc7b5c34f5e 100644
--- a/src/transformers/models/data2vec/modeling_data2vec_text.py
+++ b/src/transformers/models/data2vec/modeling_data2vec_text.py
@@ -80,7 +80,9 @@ def __init__(self, config):
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
         self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
         self.register_buffer(
             "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
         )
@@ -492,6 +494,13 @@ def forward(
         all_self_attentions = () if output_attentions else None
         all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         next_decoder_cache = () if use_cache else None
         for i, layer_module in enumerate(self.layer):
             if output_hidden_states:
@@ -501,25 +510,15 @@ def forward(
             past_key_value = past_key_values[i] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, past_key_value, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     attention_mask,
                     layer_head_mask,
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(
@@ -589,7 +588,7 @@ class Data2VecTextPreTrainedModel(PreTrainedModel):
     config_class = Data2VecTextConfig
     base_model_prefix = "data2vec_text"
     supports_gradient_checkpointing = True
-    _no_split_modules = []
+    _no_split_modules = ["Data2VecTextForTextEmbeddings", "Data2VecTextLayer"]
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -609,19 +608,6 @@ def _init_weights(self, module):
             if hasattr(module, "weight") and module.weight is not None:
                 module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, Data2VecTextEncoder):
-            module.gradient_checkpointing = value
-
-    def update_keys_to_ignore(self, config, del_keys_to_ignore):
-        """Remove some keys from ignore list"""
-        if not config.tie_word_embeddings:
-            # must make a new list, or the class variable gets modified!
-            self._keys_to_ignore_on_save = [k for k in self._keys_to_ignore_on_save if k not in del_keys_to_ignore]
-            self._keys_to_ignore_on_load_missing = [
-                k for k in self._keys_to_ignore_on_load_missing if k not in del_keys_to_ignore
-            ]
-
 
 DATA2VECTEXT_START_DOCSTRING = r"""
     Data2VecText was proposed in [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and
@@ -712,8 +698,6 @@ class Data2VecTextModel(Data2VecTextPreTrainedModel):
 
     """
 
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config, add_pooling_layer=True):
         super().__init__(config)
         self.config = config
@@ -797,6 +781,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -881,9 +866,7 @@ def forward(
     """Data2VecText Model with a `language modeling` head on top for CLM fine-tuning.""", DATA2VECTEXT_START_DOCSTRING
 )
 class Data2VecTextForCausalLM(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -894,9 +877,6 @@ def __init__(self, config):
         self.data2vec_text = Data2VecTextModel(config, add_pooling_layer=False)
         self.lm_head = Data2VecTextLMHead(config)
 
-        # The LM head weights require special treatment only when they are tied with the word embeddings
-        self.update_keys_to_ignore(config, ["lm_head.decoder.weight"])
-
         # Initialize weights and apply final processing
         self.post_init()
 
@@ -997,6 +977,8 @@ def forward(
             shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
             labels = labels[:, 1:].contiguous()
             loss_fct = CrossEntropyLoss()
+
+            labels = labels.to(shifted_prediction_scores.device)
             lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
 
         if not return_dict:
@@ -1018,24 +1000,33 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti
         if attention_mask is None:
             attention_mask = input_ids.new_ones(input_shape)
 
-        # cut decoder_input_ids if past is used
+        # cut decoder_input_ids if past_key_values is used
         if past_key_values is not None:
-            input_ids = input_ids[:, -1:]
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
 
         return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}
 
     def _reorder_cache(self, past_key_values, beam_idx):
         reordered_past = ()
         for layer_past in past_key_values:
-            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
         return reordered_past
 
 
 @add_start_docstrings("""data2vec Model with a `language modeling` head on top.""", DATA2VECTEXT_START_DOCSTRING)
 class Data2VecTextForMaskedLM(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1049,9 +1040,6 @@ def __init__(self, config):
         self.data2vec_text = Data2VecTextModel(config, add_pooling_layer=False)
         self.lm_head = Data2VecTextLMHead(config)
 
-        # The LM head weights require special treatment only when they are tied with the word embeddings
-        self.update_keys_to_ignore(config, ["lm_head.decoder.weight"])
-
         # Initialize weights and apply final processing
         self.post_init()
 
@@ -1112,6 +1100,8 @@ def forward(
         masked_lm_loss = None
         if labels is not None:
             loss_fct = CrossEntropyLoss()
+
+            labels = labels.to(prediction_scores.device)
             masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
 
         if not return_dict:
@@ -1166,8 +1156,6 @@ def _tie_weights(self):
     DATA2VECTEXT_START_DOCSTRING,
 )
 class Data2VecTextForSequenceClassification(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1222,6 +1210,8 @@ def forward(
 
         loss = None
         if labels is not None:
+            labels = labels.to(logits.device)
+
             if self.config.problem_type is None:
                 if self.num_labels == 1:
                     self.config.problem_type = "regression"
@@ -1263,8 +1253,6 @@ def forward(
     DATA2VECTEXT_START_DOCSTRING,
 )
 class Data2VecTextForMultipleChoice(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
 
@@ -1335,6 +1323,8 @@ def forward(
         loss = None
         if labels is not None:
             loss_fct = CrossEntropyLoss()
+
+            labels = labels.to(reshaped_logits.device)
             loss = loss_fct(reshaped_logits, labels)
 
         if not return_dict:
@@ -1357,9 +1347,6 @@ def forward(
     DATA2VECTEXT_START_DOCSTRING,
 )
 class Data2VecTextForTokenClassification(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1419,6 +1406,8 @@ def forward(
         loss = None
         if labels is not None:
             loss_fct = CrossEntropyLoss()
+
+            labels = labels.to(logits.device)
             loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
 
         if not return_dict:
@@ -1464,9 +1453,6 @@ def forward(self, features, **kwargs):
     DATA2VECTEXT_START_DOCSTRING,
 )
 class Data2VecTextForQuestionAnswering(Data2VecTextPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
diff --git a/src/transformers/models/data2vec/modeling_data2vec_vision.py b/src/transformers/models/data2vec/modeling_data2vec_vision.py
index 42a1edcb6493ca..77c9363fa217c4 100644
--- a/src/transformers/models/data2vec/modeling_data2vec_vision.py
+++ b/src/transformers/models/data2vec/modeling_data2vec_vision.py
@@ -150,22 +150,26 @@ def __init__(self, config: Data2VecVisionConfig) -> None:
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 
     def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.BoolTensor] = None) -> torch.Tensor:
-        embeddings = self.patch_embeddings(pixel_values)
+        embeddings, (patch_height, patch_width) = self.patch_embeddings(
+            pixel_values, self.position_embeddings[:, 1:, :] if self.position_embeddings is not None else None
+        )
         batch_size, seq_len, _ = embeddings.size()
 
-        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
         if bool_masked_pos is not None:
             mask_tokens = self.mask_token.expand(batch_size, seq_len, -1)
             # replace the masked visual tokens by mask_tokens
             w = bool_masked_pos.unsqueeze(-1).type_as(mask_tokens)
             embeddings = embeddings * (1 - w) + mask_tokens * w
 
-        embeddings = torch.cat((cls_tokens, embeddings), dim=1)
+        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
         if self.position_embeddings is not None:
-            embeddings = embeddings + self.position_embeddings
+            cls_tokens = cls_tokens + self.position_embeddings[:, :1, :]
+
+        embeddings = torch.cat((cls_tokens, embeddings), dim=1)
+
         embeddings = self.dropout(embeddings)
 
-        return embeddings
+        return embeddings, (patch_height, patch_width)
 
 
 # Copied from transformers.models.beit.modeling_beit.BeitPatchEmbeddings with Beit->Data2VecVision
@@ -193,19 +197,29 @@ def __init__(self, config):
 
         self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size)
 
-    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+    def forward(self, pixel_values: torch.Tensor, position_embedding: Optional[torch.Tensor] = None) -> torch.Tensor:
         batch_size, num_channels, height, width = pixel_values.shape
         if num_channels != self.num_channels:
             raise ValueError(
                 "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
             )
-        if height != self.image_size[0] or width != self.image_size[1]:
-            raise ValueError(
-                f"Input image size ({height}*{width}) doesn't match model ({self.image_size[0]}*{self.image_size[1]})."
+
+        embeddings = self.projection(pixel_values)
+        patch_height, patch_width = embeddings.shape[2], embeddings.shape[3]
+
+        if position_embedding is not None:
+            # interpolate the position embedding to the corresponding size
+            position_embedding = position_embedding.view(1, self.patch_shape[0], self.patch_shape[1], -1).permute(
+                0, 3, 1, 2
+            )
+            position_embedding = nn.functional.interpolate(
+                position_embedding, size=(patch_height, patch_width), mode="bicubic"
             )
-        embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
+            embeddings = embeddings + position_embedding
 
-        return embeddings
+        embeddings = embeddings.flatten(2).transpose(1, 2)
+
+        return embeddings, (patch_height, patch_width)
 
 
 # Copied from transformers.models.beit.modeling_beit.BeitSelfAttention with Beit->Data2VecVision
@@ -289,8 +303,8 @@ def forward(
 # Copied from transformers.models.beit.modeling_beit.BeitSelfOutput with Beit->Data2VecVision
 class Data2VecVisionSelfOutput(nn.Module):
     """
-    The residual connection is defined in Data2VecVisionLayer instead of here (as is the case with other models), due
-    to the layernorm applied before each block.
+    The residual connection is defined in Data2VecVisionLayer instead of here (as is the case with other models), due to the
+    layernorm applied before each block.
     """
 
     def __init__(self, config: Data2VecVisionConfig) -> None:
@@ -470,7 +484,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: tuple) -> None:
         relative_position_index[0:, 0] = self.num_relative_distance - 2
         relative_position_index[0, 0] = self.num_relative_distance - 1
 
-        self.register_buffer("relative_position_index", relative_position_index)
+        self.register_buffer("relative_position_index", relative_position_index, persistent=False)
 
     def forward(self) -> torch.Tensor:
         relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)].view(
@@ -522,17 +536,11 @@ def forward(
             layer_head_mask = head_mask[i] if head_mask is not None else None
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     layer_head_mask,
+                    output_attentions,
                 )
             else:
                 relative_position_bias = (
@@ -585,10 +593,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, Data2VecVisionEncoder):
-            module.gradient_checkpointing = value
-
 
 DATA2VEC_VISION_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
@@ -673,6 +677,10 @@ def forward(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> Union[tuple, Data2VecVisionModelOutputWithPooling]:
+        r"""
+        bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`, *optional*):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+        """
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -689,7 +697,7 @@ def forward(
         # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
 
-        embedding_output = self.embeddings(pixel_values, bool_masked_pos)
+        embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values, bool_masked_pos)
 
         encoder_outputs = self.encoder(
             embedding_output,
@@ -1085,6 +1093,12 @@ def __init__(self, config: Data2VecVisionConfig) -> None:
         self.data2vec_vision = Data2VecVisionModel(config, add_pooling_layer=False)
 
         # FPNs
+        if len(self.config.out_indices) != 4:
+            raise ValueError(
+                "Data2VecVisionForSemanticSegmentation requires config.out_indices to be a list of 4 integers, "
+                "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of "
+                "a base-sized architecture."
+            )
         self.fpn1 = nn.Sequential(
             nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
             nn.BatchNorm2d(config.hidden_size),
@@ -1116,8 +1130,10 @@ def compute_loss(self, logits, auxiliary_logits, labels):
         # compute weighted loss
         loss_fct = CrossEntropyLoss(ignore_index=self.config.semantic_loss_ignore_index)
         main_loss = loss_fct(upsampled_logits, labels)
-        auxiliary_loss = loss_fct(upsampled_auxiliary_logits, labels)
-        loss = main_loss + self.config.auxiliary_loss_weight * auxiliary_loss
+        loss = main_loss
+        if auxiliary_logits is not None:
+            auxiliary_loss = loss_fct(upsampled_auxiliary_logits, labels)
+            loss += self.config.auxiliary_loss_weight * auxiliary_loss
 
         return loss
 
diff --git a/src/transformers/models/data2vec/modeling_tf_data2vec_vision.py b/src/transformers/models/data2vec/modeling_tf_data2vec_vision.py
index c837670b1a1f69..bc8ff9cfc9e619 100644
--- a/src/transformers/models/data2vec/modeling_tf_data2vec_vision.py
+++ b/src/transformers/models/data2vec/modeling_tf_data2vec_vision.py
@@ -14,16 +14,17 @@
 # limitations under the License.
 """ TF 2.0 Data2Vec Vision model."""
 
+
+from __future__ import annotations
+
 import collections.abc
 import math
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Tuple, Union
+from typing import List, Optional, Tuple, Union
 
 import numpy as np
 import tensorflow as tf
 
-from transformers.tf_utils import shape_list, stable_softmax
-
 from ...activations_tf import get_tf_activation
 from ...modeling_tf_outputs import (
     TFBaseModelOutput,
@@ -36,9 +37,11 @@
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
+from ...tf_utils import shape_list, stable_softmax
 from ...utils import (
     add_code_sample_docstrings,
     add_start_docstrings,
@@ -95,11 +98,11 @@ class TFData2VecVisionModelOutputWithPooling(TFBaseModelOutputWithPooling):
 
     last_hidden_state: tf.Tensor = None
     pooler_output: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
-class TFData2VecVisionDropPath(tf.keras.layers.Layer):
+class TFData2VecVisionDropPath(keras.layers.Layer):
     """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
     References:
         (1) github.com:rwightman/pytorch-image-models
@@ -119,7 +122,7 @@ def call(self, x, training=None):
         return x
 
 
-class TFData2VecVisionEmbeddings(tf.keras.layers.Layer):
+class TFData2VecVisionEmbeddings(keras.layers.Layer):
     """
     Construct the CLS token, position and patch embeddings. Optionally, also the mask token.
 
@@ -133,9 +136,9 @@ def __init__(self, config: Data2VecVisionConfig, **kwargs):
         self.num_patches = self.patch_embeddings.num_patches
         self.config = config
 
-        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         self.cls_token = self.add_weight(
             shape=(1, 1, self.config.hidden_size),
             initializer=tf.random_normal_initializer(stddev=self.config.initializer_range),
@@ -162,9 +165,14 @@ def build(self, input_shape: tf.TensorShape):
         else:
             self.position_embeddings = None
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "patch_embeddings", None) is not None:
+            with tf.name_scope(self.patch_embeddings.name):
+                self.patch_embeddings.build(None)
 
-    def call(self, pixel_values: tf.Tensor, bool_masked_pos: Optional[tf.Tensor] = None) -> tf.Tensor:
+    def call(self, pixel_values: tf.Tensor, bool_masked_pos: tf.Tensor | None = None) -> tf.Tensor:
         embeddings = self.patch_embeddings(pixel_values)
         batch_size, seq_len, projection_dim = shape_list(embeddings)
 
@@ -186,7 +194,7 @@ def call(self, pixel_values: tf.Tensor, bool_masked_pos: Optional[tf.Tensor] = N
         return embeddings
 
 
-class TFData2VecVisionPatchEmbeddings(tf.keras.layers.Layer):
+class TFData2VecVisionPatchEmbeddings(keras.layers.Layer):
     """
     Image to Patch Embedding.
     """
@@ -208,7 +216,7 @@ def __init__(self, config: Data2VecVisionConfig, **kwargs):
         self.patch_shape = patch_shape
         self.num_channels = num_channels
 
-        self.projection = tf.keras.layers.Conv2D(
+        self.projection = keras.layers.Conv2D(
             filters=hidden_size,
             kernel_size=patch_size,
             strides=patch_size,
@@ -233,7 +241,7 @@ def call(self, pixel_values: tf.Tensor, training: bool = False) -> tf.Tensor:
                     f" ({self.image_size[0]}*{self.image_size[1]})."
                 )
 
-        # When running on CPU, `tf.keras.layers.Conv2D` doesn't support `NCHW` format.
+        # When running on CPU, `keras.layers.Conv2D` doesn't support `NCHW` format.
         # So change the input format from `NCHW` to `NHWC`.
         # shape = (batch_size, in_height, in_width, in_channels=num_channels)
         pixel_values = tf.transpose(pixel_values, perm=(0, 2, 3, 1))
@@ -246,8 +254,16 @@ def call(self, pixel_values: tf.Tensor, training: bool = False) -> tf.Tensor:
 
         return tf.reshape(tensor=projection, shape=(batch_size, num_patches, -1))
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "projection", None) is not None:
+            with tf.name_scope(self.projection.name):
+                self.projection.build([None, None, None, self.num_channels])
+
 
-class TFData2VecVisionSelfAttention(tf.keras.layers.Layer):
+class TFData2VecVisionSelfAttention(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] = None, **kwargs):
         super().__init__(**kwargs)
 
@@ -262,19 +278,19 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] =
         self.all_head_size = self.num_attention_heads * self.attention_head_size
         self.sqrt_att_head_size = math.sqrt(self.attention_head_size)
 
-        self.query = tf.keras.layers.Dense(
+        self.query = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
         )
-        self.key = tf.keras.layers.Dense(
+        self.key = keras.layers.Dense(
             units=self.all_head_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="key",
             use_bias=False,
         )
-        self.value = tf.keras.layers.Dense(
+        self.value = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
 
         if window_size:
             self.relative_position_bias = TFData2VecVisionRelativePositionBias(
@@ -282,6 +298,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] =
             )
         else:
             self.relative_position_bias = None
+        self.config = config
 
     def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor:
         # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size]
@@ -342,8 +359,25 @@ def call(
 
         return outputs
 
-
-class TFData2VecVisionSelfOutput(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query", None) is not None:
+            with tf.name_scope(self.query.name):
+                self.query.build([None, None, self.config.hidden_size])
+        if getattr(self, "key", None) is not None:
+            with tf.name_scope(self.key.name):
+                self.key.build([None, None, self.config.hidden_size])
+        if getattr(self, "value", None) is not None:
+            with tf.name_scope(self.value.name):
+                self.value.build([None, None, self.config.hidden_size])
+        if getattr(self, "relative_position_bias", None) is not None:
+            with tf.name_scope(self.relative_position_bias.name):
+                self.relative_position_bias.build(None)
+
+
+class TFData2VecVisionSelfOutput(keras.layers.Layer):
     """
     The residual connection is defined in TFData2VecVisionLayer instead of here (as is the case with other models), due
     to the layernorm applied before each block.
@@ -352,10 +386,11 @@ class TFData2VecVisionSelfOutput(tf.keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, gamma=None, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -363,8 +398,16 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, gamma=None, tr
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
-class TFData2VecVisionAttention(tf.keras.layers.Layer):
+class TFData2VecVisionAttention(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] = None, **kwargs):
         super().__init__(**kwargs)
 
@@ -396,13 +439,24 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTIntermediate with ViT->Data2VecVision
-class TFData2VecVisionIntermediate(tf.keras.layers.Layer):
+class TFData2VecVisionIntermediate(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -410,6 +464,7 @@ def __init__(self, config: Data2VecVisionConfig, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -417,15 +472,24 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
-class TFData2VecVisionOutput(tf.keras.layers.Layer):
+class TFData2VecVisionOutput(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -433,8 +497,16 @@ def call(self, hidden_states: tf.Tensor, training: bool = False) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+
 
-class TFData2VecVisionLayer(tf.keras.layers.Layer):
+class TFData2VecVisionLayer(keras.layers.Layer):
     """This corresponds to the Block class in the timm implementation."""
 
     def __init__(
@@ -447,22 +519,18 @@ def __init__(
         self.intermediate = TFData2VecVisionIntermediate(config, name="intermediate")
         self.data2vec_output = TFData2VecVisionOutput(config, name="output")
 
-        self.layernorm_before = tf.keras.layers.LayerNormalization(
-            epsilon=config.layer_norm_eps, name="layernorm_before"
-        )
-        self.layernorm_after = tf.keras.layers.LayerNormalization(
-            epsilon=config.layer_norm_eps, name="layernorm_after"
-        )
+        self.layernorm_before = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm_before")
+        self.layernorm_after = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm_after")
         # Using `layers.Activation` instead of `tf.identity` to better control `training`
         # behaviour.
         self.drop_path = (
             TFData2VecVisionDropPath(drop_path_rate, name="drop_path")
             if drop_path_rate > 0.0
-            else tf.keras.layers.Activation("linear", name="drop_path")
+            else keras.layers.Activation("linear", name="drop_path")
         )
         self.init_values = config.layer_scale_init_value
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape: tf.TensorShape = None):
         if self.init_values > 0:
             self.lambda_1 = self.add_weight(
                 shape=(self.config.hidden_size),
@@ -481,7 +549,27 @@ def build(self, input_shape: tf.TensorShape):
         else:
             self.lambda_1, self.lambda_2 = None, None
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "data2vec_output", None) is not None:
+            with tf.name_scope(self.data2vec_output.name):
+                self.data2vec_output.build(None)
+        if getattr(self, "layernorm_before", None) is not None:
+            with tf.name_scope(self.layernorm_before.name):
+                self.layernorm_before.build([None, None, self.config.hidden_size])
+        if getattr(self, "layernorm_after", None) is not None:
+            with tf.name_scope(self.layernorm_after.name):
+                self.layernorm_after.build([None, None, self.config.hidden_size])
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
 
     def call(
         self,
@@ -528,7 +616,7 @@ def call(
 
 # Taken and modified from here:
 # https://github.com/leondgarse/keras_cv_attention_models/blob/main/keras_cv_attention_models/beit/beit.py#L28
-class TFData2VecVisionRelativePositionBias(tf.keras.layers.Layer):
+class TFData2VecVisionRelativePositionBias(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, window_size: tuple, **kwargs) -> None:
         super().__init__(**kwargs)
         self.config = config
@@ -584,7 +672,7 @@ def call(self, inputs=None) -> tf.Tensor:
         return tf.transpose(relative_position_bias, [2, 0, 1])
 
 
-class TFData2VecVisionEncoder(tf.keras.layers.Layer):
+class TFData2VecVisionEncoder(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] = None, **kwargs):
         super().__init__(**kwargs)
         self.config = config
@@ -596,7 +684,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] =
             self.relative_position_bias = None
 
         # stochastic depth decay rule
-        dpr = [x for x in tf.linspace(0.0, config.drop_path_rate, config.num_hidden_layers)]
+        dpr = list(tf.linspace(0.0, config.drop_path_rate, config.num_hidden_layers))
         self.layer = [
             TFData2VecVisionLayer(
                 config,
@@ -610,7 +698,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] =
     def call(
         self,
         hidden_states: tf.Tensor,
-        head_mask: Optional[tf.Tensor] = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: bool = False,
         output_hidden_states: bool = False,
         return_dict: bool = True,
@@ -648,9 +736,21 @@ def call(
             attentions=all_self_attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "relative_position_bias", None) is not None:
+            with tf.name_scope(self.relative_position_bias.name):
+                self.relative_position_bias.build(None)
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 @keras_serializable
-class TFData2VecVisionMainLayer(tf.keras.layers.Layer):
+class TFData2VecVisionMainLayer(keras.layers.Layer):
     config_class = Data2VecVisionConfig
 
     def __init__(self, config: Data2VecVisionConfig, add_pooling_layer: bool = True, **kwargs):
@@ -666,14 +766,14 @@ def __init__(self, config: Data2VecVisionConfig, add_pooling_layer: bool = True,
         self.layernorm = (
             tf.identity
             if config.use_mean_pooling
-            else tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+            else keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
         )
 
         # We are setting the `data_format` like so because from here on we will revert to the
         # NCHW output format
         self.pooler = TFData2VecVisionPooler(config, name="pooler") if add_pooling_layer else None
 
-    def get_input_embeddings(self) -> tf.keras.layers.Layer:
+    def get_input_embeddings(self) -> keras.layers.Layer:
         return self.embeddings.patch_embeddings
 
     def _prune_heads(self, heads_to_prune):
@@ -686,9 +786,9 @@ class PreTrainedModel
     @unpack_inputs
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        bool_masked_pos: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        bool_masked_pos: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -739,15 +839,34 @@ def call(
             attentions=encoder_outputs.attentions,
         )
 
-
-class TFData2VecVisionPooler(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            if hasattr(self.layernorm, "name"):
+                with tf.name_scope(self.layernorm.name):
+                    self.layernorm.build((None, self.config.hidden_size))
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+
+
+class TFData2VecVisionPooler(keras.layers.Layer):
     def __init__(self, config: Data2VecVisionConfig, **kwargs):
         super().__init__(**kwargs)
         self.layernorm = (
-            tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+            keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
             if config.use_mean_pooling
             else None
         )
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         if self.layernorm is not None:
@@ -760,6 +879,15 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return pooled_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layernorm", None) is not None:
+            if hasattr(self.layernorm, "name"):
+                with tf.name_scope(self.layernorm.name):
+                    self.layernorm.build((None, self.config.hidden_size))
+
 
 class TFData2VecVisionPreTrainedModel(TFPreTrainedModel):
     """
@@ -772,43 +900,13 @@ class TFData2VecVisionPreTrainedModel(TFPreTrainedModel):
     main_input_name = "pixel_values"
     _keys_to_ignore_on_load_unexpected = [r"relative_position_index"]
 
-    @property
-    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
-        """
-        Dummy inputs to build the network. Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        VISION_DUMMY_INPUTS = tf.random.uniform(
-            shape=(3, self.config.num_channels, self.config.image_size, self.config.image_size),
-            dtype=tf.float32,
-        )
-        return {"pixel_values": tf.constant(VISION_DUMMY_INPUTS)}
-
-    @tf.function(
-        input_signature=[
-            {
-                "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        """
-        Method used for serving the model.
-
-        Args:
-            inputs (`Dict[str, tf.Tensor]`):
-                The input of the saved model as a dictionary of tensors.
-        """
-        output = self.call(inputs)
-        return self.serving_output(output)
-
 
 DATA2VEC_VISION_START_DOCSTRING = r"""
     This model inherits from [`TFPreTrainedModel`]. Check the superclass documentation for the generic methods the
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.).
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -900,14 +998,18 @@ def get_input_embeddings(self):
     )
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
-        bool_masked_pos: Optional[tf.Tensor] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        pixel_values: TFModelInputType | None = None,
+        bool_masked_pos: tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
     ) -> Union[tuple, TFData2VecVisionModelOutputWithPooling]:
+        r"""
+        bool_masked_pos (`tf.Tensor` of shape `(batch_size, num_patches)`, *optional*):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+        """
         outputs = self.data2vec_vision(
             pixel_values=pixel_values,
             bool_masked_pos=bool_masked_pos,
@@ -920,16 +1022,13 @@ def call(
 
         return outputs
 
-    def serving_output(self, output: TFData2VecVisionModelOutputWithPooling) -> TFData2VecVisionModelOutputWithPooling:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFData2VecVisionModelOutputWithPooling(
-            last_hidden_state=output.last_hidden_state,
-            pooler_output=output.pooler_output,
-            hidden_states=hidden_states,
-            attentions=attentions,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "data2vec_vision", None) is not None:
+            with tf.name_scope(self.data2vec_vision.name):
+                self.data2vec_vision.build(None)
 
 
 @add_start_docstrings(
@@ -947,11 +1046,12 @@ def __init__(self, config: Data2VecVisionConfig, *inputs, **kwargs):
         self.data2vec_vision = TFData2VecVisionMainLayer(config, add_pooling_layer=True, name="data2vec_vision")
 
         # Classifier head
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             units=config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             name="classifier",
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DATA2VEC_VISION_INPUTS_DOCSTRING)
@@ -963,12 +1063,12 @@ def __init__(self, config: Data2VecVisionConfig, *inputs, **kwargs):
     )
     def call(
         self,
-        pixel_values: Optional[TFModelInputType] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        pixel_values: TFModelInputType | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, tuple]:
         r"""
@@ -1003,14 +1103,19 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hidden_states, attentions=attentions)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "data2vec_vision", None) is not None:
+            with tf.name_scope(self.data2vec_vision.name):
+                self.data2vec_vision.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
-class TFData2VecVisionConvModule(tf.keras.layers.Layer):
+class TFData2VecVisionConvModule(keras.layers.Layer):
     """
     A convolutional block that bundles conv/norm/activation layers. This block simplifies the usage of convolution
     layers, which are commonly used with a norm layer (e.g., BatchNorm) and activation layer (e.g., ReLU).
@@ -1020,6 +1125,7 @@ class TFData2VecVisionConvModule(tf.keras.layers.Layer):
 
     def __init__(
         self,
+        in_channels: int,
         out_channels: int,
         kernel_size: Union[int, Tuple[int, int]],
         padding: str = "valid",
@@ -1028,7 +1134,7 @@ def __init__(
         **kwargs,
     ) -> None:
         super().__init__(**kwargs)
-        self.conv = tf.keras.layers.Conv2D(
+        self.conv = keras.layers.Conv2D(
             filters=out_channels,
             kernel_size=kernel_size,
             padding=padding,
@@ -1036,8 +1142,10 @@ def __init__(
             dilation_rate=dilation,
             name="conv",
         )
-        self.bn = tf.keras.layers.BatchNormalization(name="bn", momentum=0.9, epsilon=1e-5)
+        self.bn = keras.layers.BatchNormalization(name="bn", momentum=0.9, epsilon=1e-5)
         self.activation = tf.nn.relu
+        self.in_channels = in_channels
+        self.out_channels = out_channels
 
     def call(self, input: tf.Tensor) -> tf.Tensor:
         output = self.conv(input)
@@ -1045,91 +1153,143 @@ def call(self, input: tf.Tensor) -> tf.Tensor:
         output = self.activation(output)
         return output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "conv", None) is not None:
+            with tf.name_scope(self.conv.name):
+                self.conv.build([None, None, None, self.in_channels])
+        if getattr(self, "bn", None) is not None:
+            with tf.name_scope(self.bn.name):
+                self.bn.build((None, None, None, self.out_channels))
 
-# Copied from:
-# https://gist.github.com/Rocketknight1/43abbe6e73f1008e6e459486e01e0ceb
-class TFAdaptiveAvgPool1D(tf.keras.layers.Layer):
-    def __init__(self, output_dim, mode="dense", **kwargs):
-        super().__init__(**kwargs)
-        self.output_dim = output_dim
-        self.mode = mode
-        self.map = None
 
-    def build(self, input_shape):
-        super().build(input_shape)
-        """We pre-compute the sparse matrix for the build() step once. The below code comes
-        from https://stackoverflow.com/questions/53841509/how-does-adaptive-pooling-in-pytorch-work/63603993#63603993."""
-
-        def get_kernels(ind, outd) -> List:
-            """Returns a List [(kernel_offset_start,kernel_length)] defining all the pooling kernels for a 1-D adaptive
-            pooling layer that takes an input of dimension `ind` and yields an output of dimension `outd`"""
-
-            def start_index(a, b, c):
-                return math.floor((float(a) * float(c)) / b)
-
-            def end_index(a, b, c):
-                return math.ceil((float(a + 1) * float(c)) / b)
-
-            results = []
-            for ow in range(outd):
-                start = start_index(ow, outd, ind)
-                end = end_index(ow, outd, ind)
-                sz = end - start
-                results.append((start, sz))
-            return results
-
-        in_dim = int(input_shape[-1])
-        kernels = get_kernels(in_dim, self.output_dim)
-        sparse_map = np.zeros((in_dim, self.output_dim), dtype=np.float32)
-        for i, kernel in enumerate(kernels):
-            sparse_map[kernel[0] : kernel[0] + kernel[1], i] = 1 / kernel[1]
-        if self.mode == "dense":
-            self.map = tf.constant(sparse_map)
+class TFAdaptiveAvgPool2D(keras.layers.Layer):
+    def __init__(self, output_dims: Tuple[int, int], input_ordering: str = "NHWC", **kwargs):
+        super().__init__(**kwargs)
+        self.output_dims = output_dims
+        self.input_ordering = input_ordering
+        if input_ordering not in ("NCHW", "NHWC"):
+            raise ValueError("Unrecognized input_ordering, should be 'NCHW' or 'NHWC'!")
+        self.h_axis = input_ordering.index("H")
+        self.w_axis = input_ordering.index("W")
+
+    def pseudo_1d_pool(self, inputs: tf.Tensor, h_pooling: bool):
+        # Figure out which axis we're pooling on
+        if h_pooling:
+            axis = self.h_axis
+            output_dim = self.output_dims[0]
         else:
-            self.map = tf.sparse.from_dense(sparse_map)
+            axis = self.w_axis
+            output_dim = self.output_dims[1]
+        input_dim = inputs.shape[axis]
+
+        # Figure out the potential pooling windows
+        # This is the key idea - the torch op always uses only two
+        # consecutive pooling window sizes, like 3 and 4. Therefore,
+        # if we pool with both possible sizes, we simply need to gather
+        # the 'correct' pool at each position to reimplement the torch op.
+        small_window = math.ceil(input_dim / output_dim)
+        big_window = small_window + 1
+        if h_pooling:
+            output_dim = self.output_dims[0]
+            small_window_shape = (small_window, 1)
+            big_window_shape = (big_window, 1)
+        else:
+            output_dim = self.output_dims[1]
+            small_window_shape = (1, small_window)
+            big_window_shape = (1, big_window)
+
+        # For resizes to 1, or integer resizes, we can take quick shortcuts
+        if output_dim == input_dim:
+            return inputs
+        elif output_dim == 1:
+            return tf.reduce_mean(inputs, axis=axis, keepdims=True)
+        elif input_dim % output_dim == 0:
+            return tf.nn.avg_pool2d(
+                inputs,
+                ksize=small_window_shape,
+                strides=small_window_shape,
+                padding="VALID",
+                data_format=self.input_ordering,
+            )
+        # When upscaling by an integer factor we can also take a quick shortcut
+        elif output_dim > input_dim and output_dim % input_dim == 0:
+            return tf.repeat(inputs, repeats=output_dim // input_dim, axis=axis)
+
+        # For non-integer resizes, we pool with both possible window sizes and concatenate them
+        if output_dim < input_dim:
+            small_pool = tf.nn.avg_pool2d(
+                inputs, ksize=small_window_shape, strides=1, padding="VALID", data_format=self.input_ordering
+            )
+            big_pool = tf.nn.avg_pool2d(
+                inputs, ksize=big_window_shape, strides=1, padding="VALID", data_format=self.input_ordering
+            )
+            both_pool = tf.concat([small_pool, big_pool], axis=axis)
+        else:
+            # When we're actually upscaling instead, then we build the pools a bit differently
+            small_pool = inputs
+            big_pool = tf.nn.avg_pool2d(
+                inputs, ksize=big_window_shape, strides=1, padding="VALID", data_format=self.input_ordering
+            )
+            both_pool = tf.concat([small_pool, big_pool], axis=axis)
+
+        # We compute vectors of the start and end positions for each pooling window
+        # Each (start, end) pair here corresponds to a single output position
+        window_starts = tf.math.floor((tf.range(output_dim, dtype=tf.float32) * input_dim) / output_dim)
+        window_starts = tf.cast(window_starts, tf.int64)
+        window_ends = tf.math.ceil((tf.range(1, output_dim + 1, dtype=tf.float32) * input_dim) / output_dim)
+        window_ends = tf.cast(window_ends, tf.int64)
+
+        # pool_selector is a boolean array of shape (output_dim,) where 1 indicates that output position
+        # has a big receptive field and 0 indicates that that output position has a small receptive field
+        pool_selector = tf.cast(window_ends - window_starts - small_window, tf.bool)
+
+        # Since we concatenated the small and big pools, we need to do a bit of
+        # pointer arithmetic to get the indices of the big pools
+        small_indices = window_starts
+        big_indices = window_starts + small_pool.shape[axis]
+
+        # Finally, we use the pool_selector to generate a list of indices, one per output position
+        gather_indices = tf.where(pool_selector, big_indices, small_indices)
 
-    def call(self, inputs):
-        if self.mode == "dense":
-            return inputs @ self.map
+        # Gathering from those indices yields the final, correct pooling
+        return tf.gather(both_pool, gather_indices, axis=axis)
+
+    def call(self, inputs: tf.Tensor):
+        if self.input_ordering == "NHWC":
+            input_shape = inputs.shape[1:3]
         else:
-            input_dims = inputs.shape
-            input_matrix = tf.reshape(inputs, (-1, input_dims[-1]))
-            out = tf.sparse.sparse_dense_matmul(input_matrix, self.map)
-            return tf.reshape(out, input_dims[:-1].as_list() + [-1])
+            input_shape = inputs.shape[2:]
 
-    def get_config(self):
-        config = super().get_config()
-        config.update({"output_dim": self.output_dim, "mode": self.mode})
-        return config
+        # We break the task down into each possible case
+        # Firstly, if we're resizing down to 1, it's just tf.reduce_mean
+        if self.output_dims[0] == self.output_dims[1] == 1:
+            if self.input_ordering == "NHWC":
+                reduce_dims = [1, 2]
+            else:
+                reduce_dims = [2, 3]
+            return tf.reduce_mean(inputs, axis=reduce_dims, keepdims=True)
+        # Secondly, if we're resizing by an integer factor on both dimensions, we can take a quick shortcut
+        elif input_shape[0] % self.output_dims[0] == 0 and input_shape[1] % self.output_dims[1] == 0:
+            h_resize = int(input_shape[0] // self.output_dims[0])
+            w_resize = int(input_shape[1] // self.output_dims[1])
+            return tf.nn.avg_pool2d(
+                inputs,
+                ksize=(h_resize, w_resize),
+                strides=(h_resize, w_resize),
+                padding="VALID",
+                data_format=self.input_ordering,
+            )
+        else:
+            # Finally, if we can't take the shortcut, we do a 1D pool on each axis. pseudo_1d_pool will take a shortcut
+            # for dimensions where an integer resize is possible. It can also handle upscaling.
+            h_pooled = self.pseudo_1d_pool(inputs, h_pooling=True)
+            return self.pseudo_1d_pool(h_pooled, h_pooling=False)
 
 
-class TFAdaptiveAvgPool2D(tf.keras.layers.Layer):
-    def __init__(self, output_shape, mode="dense", **kwargs):
-        super().__init__(**kwargs)
-        self.mode = mode
-        self.h_pool = TFAdaptiveAvgPool1D(output_shape[0], mode=mode, name="h_pool")
-        self.w_pool = TFAdaptiveAvgPool1D(output_shape[1], mode=mode, name="w_pool")
-
-    def call(self, inputs):
-        # Rearrange from NHWC -> NCHW
-        inputs = tf.transpose(inputs, perm=[0, 3, 1, 2])
-        # Perform W-pooling
-        inputs = self.w_pool(inputs)
-        # Rearrange NCHW -> NCWH
-        inputs = tf.transpose(inputs, perm=[0, 1, 3, 2])
-        # Perform H-pooling
-        inputs = self.h_pool(inputs)
-        # Rearrange from NCWH -> NHWC
-        inputs = tf.transpose(inputs, perm=[0, 3, 2, 1])
-        return inputs
-
-    def get_config(self):
-        config = super().get_config()
-        config.update({"mode": self.mode})
-        return config
-
-
-class TFData2VecVisionPyramidPoolingModule(tf.keras.layers.Layer):
+class TFData2VecVisionPyramidPoolingModule(keras.layers.Layer):
     """
     Pyramid Pooling Module (PPM) used in PSPNet.
 
@@ -1141,18 +1301,21 @@ class TFData2VecVisionPyramidPoolingModule(tf.keras.layers.Layer):
     Based on OpenMMLab's implementation, found in https://github.com/open-mmlab/mmsegmentation.
     """
 
-    def __init__(self, pool_scales: Tuple[int, ...], channels: int, **kwargs) -> None:
+    def __init__(self, pool_scales: Tuple[int, ...], in_channels: int, out_channels: int, **kwargs) -> None:
         super().__init__(**kwargs)
         self.pool_scales = pool_scales
-        self.channels = channels
+        self.in_channels = in_channels
+        self.out_channels = out_channels
 
         self.layer_list = []
         for idx, pool_scale in enumerate(pool_scales):
             pool_scale = pool_scale if isinstance(pool_scale, collections.abc.Iterable) else (pool_scale, pool_scale)
             self.layer_list.append(
                 [
-                    TFAdaptiveAvgPool2D(output_shape=pool_scale),
-                    TFData2VecVisionConvModule(out_channels=self.channels, kernel_size=1, name=f"{idx}.1"),
+                    TFAdaptiveAvgPool2D(output_dims=pool_scale),
+                    TFData2VecVisionConvModule(
+                        in_channels=in_channels, out_channels=self.out_channels, kernel_size=1, name=f"{idx}.1"
+                    ),
                 ]
             )
 
@@ -1169,8 +1332,14 @@ def call(self, x: tf.Tensor) -> List[tf.Tensor]:
             ppm_outs.append(upsampled_ppm_out)
         return ppm_outs
 
+    def build(self, input_shape=None):
+        for layer in self.layer_list:
+            for layer_module in layer:
+                with tf.name_scope(layer_module.name):
+                    layer_module.build(None)
 
-class TFData2VecVisionUperHead(tf.keras.layers.Layer):
+
+class TFData2VecVisionUperHead(keras.layers.Layer):
     """
     Unified Perceptual Parsing for Scene Understanding. This head is the implementation of
     [UPerNet](https://arxiv.org/abs/1807.10221).
@@ -1184,24 +1353,42 @@ def __init__(self, config: Data2VecVisionConfig, **kwargs) -> None:
         self.pool_scales = config.pool_scales  # e.g. (1, 2, 3, 6)
         self.in_channels = [config.hidden_size] * 4  # e.g. [768, 768, 768, 768]
         self.channels = config.hidden_size
-        self.classifier = tf.keras.layers.Conv2D(config.num_labels, kernel_size=1, name="classifier")
+        self.classifier = keras.layers.Conv2D(config.num_labels, kernel_size=1, name="classifier")
 
         # PSP Module
-        self.psp_modules = TFData2VecVisionPyramidPoolingModule(self.pool_scales, self.channels, name="psp_modules")
-        self.bottleneck = TFData2VecVisionConvModule(self.channels, kernel_size=3, padding="same", name="bottleneck")
+        self.psp_modules = TFData2VecVisionPyramidPoolingModule(
+            self.pool_scales, self.in_channels[-1], self.channels, name="psp_modules"
+        )
+        self.bottleneck = TFData2VecVisionConvModule(
+            self.in_channels[-1] + len(self.pool_scales) * self.channels,
+            self.channels,
+            kernel_size=3,
+            padding="same",
+            name="bottleneck",
+        )
         # FPN Module
         self.lateral_convs = []
         self.fpn_convs = []
-        for idx, _ in enumerate(self.in_channels[:-1]):  # skip the top layer
-            l_conv = TFData2VecVisionConvModule(out_channels=self.channels, kernel_size=1, name=f"lateral_convs.{idx}")
+        for idx, in_channels in enumerate(self.in_channels[:-1]):  # skip the top layer
+            l_conv = TFData2VecVisionConvModule(
+                in_channels, out_channels=self.channels, kernel_size=1, name=f"lateral_convs.{idx}"
+            )
             fpn_conv = TFData2VecVisionConvModule(
-                out_channels=self.channels, kernel_size=3, padding="same", name=f"fpn_convs.{idx}"
+                in_channels=self.channels,
+                out_channels=self.channels,
+                kernel_size=3,
+                padding="same",
+                name=f"fpn_convs.{idx}",
             )
             self.lateral_convs.append(l_conv)
             self.fpn_convs.append(fpn_conv)
 
         self.fpn_bottleneck = TFData2VecVisionConvModule(
-            out_channels=self.channels, kernel_size=3, padding="same", name="fpn_bottleneck"
+            in_channels=len(self.in_channels) * self.channels,
+            out_channels=self.channels,
+            kernel_size=3,
+            padding="same",
+            name="fpn_bottleneck",
         )
 
     def psp_forward(self, inputs):
@@ -1238,8 +1425,31 @@ def call(self, encoder_hidden_states: tf.Tensor) -> tf.Tensor:
 
         return output
 
-
-class TFData2VecVisionFCNHead(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, None, self.channels])
+        if getattr(self, "psp_modules", None) is not None:
+            with tf.name_scope(self.psp_modules.name):
+                self.psp_modules.build(None)
+        if getattr(self, "bottleneck", None) is not None:
+            with tf.name_scope(self.bottleneck.name):
+                self.bottleneck.build(None)
+        if getattr(self, "fpn_bottleneck", None) is not None:
+            with tf.name_scope(self.fpn_bottleneck.name):
+                self.fpn_bottleneck.build(None)
+        for layer in self.lateral_convs:
+            with tf.name_scope(layer.name):
+                layer.build(None)
+        for layer in self.fpn_convs:
+            with tf.name_scope(layer.name):
+                layer.build(None)
+
+
+class TFData2VecVisionFCNHead(keras.layers.Layer):
     """
     Fully Convolution Networks for Semantic Segmentation. This head is implemented from
     [FCNNet](https://arxiv.org/abs/1411.4038).
@@ -1271,6 +1481,7 @@ def __init__(
         convs = []
         convs.append(
             TFData2VecVisionConvModule(
+                in_channels=self.in_channels,
                 out_channels=self.channels,
                 kernel_size=kernel_size,
                 padding="same",
@@ -1281,6 +1492,7 @@ def __init__(
         for i in range(self.num_convs - 1):
             convs.append(
                 TFData2VecVisionConvModule(
+                    in_channels=self.channels,
                     out_channels=self.channels,
                     kernel_size=kernel_size,
                     padding="same",
@@ -1294,10 +1506,14 @@ def __init__(
             self.convs = convs
         if self.concat_input:
             self.conv_cat = TFData2VecVisionConvModule(
-                out_channels=self.channels, kernel_size=kernel_size, padding="same", name="conv_cat"
+                self.in_channels + self.channels,
+                out_channels=self.channels,
+                kernel_size=kernel_size,
+                padding="same",
+                name="conv_cat",
             )
 
-        self.classifier = tf.keras.layers.Conv2D(config.num_labels, kernel_size=1, name="classifier")
+        self.classifier = keras.layers.Conv2D(config.num_labels, kernel_size=1, name="classifier")
 
     def call(self, encoder_hidden_states: tf.Tensor) -> tf.Tensor:
         # just take the relevant feature maps
@@ -1310,6 +1526,17 @@ def call(self, encoder_hidden_states: tf.Tensor) -> tf.Tensor:
         output = self.classifier(output)
         return output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, None, self.channels])
+        if getattr(self, "conv_cat", None) is not None:
+            with tf.name_scope(self.conv_cat.name):
+                self.conv_cat.build(None)
+
 
 @add_start_docstrings(
     """
@@ -1325,15 +1552,15 @@ def __init__(self, config: Data2VecVisionConfig, *inputs, **kwargs) -> None:
 
         # FPNs
         self.fpn1 = [
-            tf.keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn1.0"),
-            tf.keras.layers.BatchNormalization(name="fpn1.1", momentum=0.9, epsilon=1e-5),
-            tf.keras.layers.Activation("gelu"),
-            tf.keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn1.3"),
+            keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn1.0"),
+            keras.layers.BatchNormalization(name="fpn1.1", momentum=0.9, epsilon=1e-5),
+            keras.layers.Activation("gelu"),
+            keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn1.3"),
         ]
-        self.fpn2 = [tf.keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn2.0")]
+        self.fpn2 = [keras.layers.Conv2DTranspose(config.hidden_size, kernel_size=2, strides=2, name="fpn2.0")]
 
         self.fpn3 = tf.identity
-        self.fpn4 = tf.keras.layers.MaxPool2D(pool_size=2, strides=2)
+        self.fpn4 = keras.layers.MaxPool2D(pool_size=2, strides=2)
 
         # Semantic segmentation head(s)
         self.decode_head = TFData2VecVisionUperHead(config, name="decode_head")
@@ -1352,7 +1579,7 @@ def compute_loss(self, logits, auxiliary_logits, labels):
         if auxiliary_logits is not None:
             upsampled_auxiliary_logits = tf.image.resize(auxiliary_logits, size=label_interp_shape, method="bilinear")
         # compute weighted loss
-        loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")
+        loss_fct = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")
 
         # Copied from https://www.tensorflow.org/text/tutorials/transformer#loss_and_metrics.
         # Utility to mask the index to ignore during computing the loss.
@@ -1375,9 +1602,9 @@ def masked_loss(real, pred):
     @replace_return_docstrings(output_type=TFSemanticSegmenterOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
-        labels: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
+        labels: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1424,11 +1651,11 @@ def call(
         # only keep certain features, and reshape
         # note that we do +1 as the encoder_hidden_states also includes the initial embeddings
         features = [feature for idx, feature in enumerate(encoder_hidden_states) if idx + 1 in self.config.out_indices]
-        batch_size = shape_list(pixel_values)[0]
         patch_resolution = self.config.image_size // self.config.patch_size
 
         def reshape_features(x):
-            x = tf.reshape(x, (batch_size, patch_resolution, patch_resolution, -1))
+            # We do it this way so TF can always infer the non-batch dims at compile time
+            x = tf.reshape(x, (-1, patch_resolution, patch_resolution, self.config.hidden_size))
             return x
 
         features = [reshape_features(x[:, 1:, :]) for x in features]
@@ -1470,8 +1697,26 @@ def reshape_features(x):
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFSemanticSegmenterOutput) -> TFSemanticSegmenterOutput:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSemanticSegmenterOutput(logits=output.logits, hidden_states=hidden_states, attentions=attentions)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "data2vec_vision", None) is not None:
+            with tf.name_scope(self.data2vec_vision.name):
+                self.data2vec_vision.build(None)
+        if getattr(self, "decode_head", None) is not None:
+            with tf.name_scope(self.decode_head.name):
+                self.decode_head.build(None)
+        if getattr(self, "auxiliary_head", None) is not None:
+            with tf.name_scope(self.auxiliary_head.name):
+                self.auxiliary_head.build(None)
+        if getattr(self, "fpn1", None) is not None:
+            with tf.name_scope(self.fpn1[0].name):
+                self.fpn1[0].build([None, None, None, self.config.hidden_size])
+            with tf.name_scope(self.fpn1[1].name):
+                self.fpn1[1].build((None, None, None, self.config.hidden_size))
+            with tf.name_scope(self.fpn1[3].name):
+                self.fpn1[3].build([None, None, None, self.config.hidden_size])
+        if getattr(self, "fpn2", None) is not None:
+            with tf.name_scope(self.fpn2[0].name):
+                self.fpn2[0].build([None, None, None, self.config.hidden_size])
diff --git a/src/transformers/models/deberta/configuration_deberta.py b/src/transformers/models/deberta/configuration_deberta.py
index 94ea91cd3a0888..f6db66f0d8d99c 100644
--- a/src/transformers/models/deberta/configuration_deberta.py
+++ b/src/transformers/models/deberta/configuration_deberta.py
@@ -105,6 +105,7 @@ class DebertaConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "deberta"
 
     def __init__(
@@ -148,7 +149,7 @@ def __init__(
         self.position_biased_input = position_biased_input
 
         # Backwards compatibility
-        if type(pos_att_type) == str:
+        if isinstance(pos_att_type, str):
             pos_att_type = [x.strip() for x in pos_att_type.lower().split("|")]
 
         self.pos_att_type = pos_att_type
diff --git a/src/transformers/models/deberta/modeling_deberta.py b/src/transformers/models/deberta/modeling_deberta.py
index 2b63b428bbe9b2..b5136bcb88cd67 100644
--- a/src/transformers/models/deberta/modeling_deberta.py
+++ b/src/transformers/models/deberta/modeling_deberta.py
@@ -139,13 +139,13 @@ def symbolic(g, self, mask, dim):
         r_mask = g.op(
             "Cast",
             g.op("Sub", g.op("Constant", value_t=torch.tensor(1, dtype=torch.int64)), mask_cast_value),
-            to_i=sym_help.cast_pytorch_to_onnx["Byte"],
+            to_i=sym_help.cast_pytorch_to_onnx["Bool"],
         )
         output = masked_fill(
             g, self, r_mask, g.op("Constant", value_t=torch.tensor(torch.finfo(self.type().dtype()).min))
         )
         output = softmax(g, output, dim)
-        return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.uint8)))
+        return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.bool)))
 
 
 class DropoutContext(object):
@@ -420,7 +420,6 @@ def get_attention_mask(self, attention_mask):
         if attention_mask.dim() <= 2:
             extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
             attention_mask = extended_attention_mask * extended_attention_mask.squeeze(-2).unsqueeze(-1)
-            attention_mask = attention_mask.byte()
         elif attention_mask.dim() == 3:
             attention_mask = attention_mask.unsqueeze(1)
 
@@ -458,20 +457,14 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                hidden_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                hidden_states = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     next_kv,
                     attention_mask,
                     query_states,
                     relative_pos,
                     rel_embeddings,
+                    output_attentions,
                 )
             else:
                 hidden_states = layer_module(
@@ -614,7 +607,7 @@ def forward(
                 Input states to the module usually the output from previous layer, it will be the Q,K and V in
                 *Attention(Q,K,V)*
 
-            attention_mask (`torch.ByteTensor`):
+            attention_mask (`torch.BoolTensor`):
                 An attention mask matrix of shape [*B*, *N*, *N*] where *B* is the batch size, *N* is the maximum
                 sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j*
                 th token.
@@ -765,7 +758,9 @@ def __init__(self, config):
         self.config = config
 
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
 
     def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None):
         if input_ids is not None:
@@ -822,7 +817,6 @@ class DebertaPreTrainedModel(PreTrainedModel):
 
     config_class = DebertaConfig
     base_model_prefix = "deberta"
-    _keys_to_ignore_on_load_missing = ["position_ids"]
     _keys_to_ignore_on_load_unexpected = ["position_embeddings"]
     supports_gradient_checkpointing = True
 
@@ -839,10 +833,6 @@ def _init_weights(self, module):
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DebertaEncoder):
-            module.gradient_checkpointing = value
-
 
 DEBERTA_START_DOCSTRING = r"""
     The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled
@@ -959,6 +949,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -1021,8 +1012,7 @@ def forward(
 
 @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING)
 class DebertaForMaskedLM(DebertaPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    _tied_weights_keys = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1100,16 +1090,17 @@ def forward(
         )
 
 
-# copied from transformers.models.bert.BertPredictionHeadTransform with bert -> deberta
 class DebertaPredictionHeadTransform(nn.Module):
     def __init__(self, config):
         super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
+
+        self.dense = nn.Linear(config.hidden_size, self.embedding_size)
         if isinstance(config.hidden_act, str):
             self.transform_act_fn = ACT2FN[config.hidden_act]
         else:
             self.transform_act_fn = config.hidden_act
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.LayerNorm = nn.LayerNorm(self.embedding_size, eps=config.layer_norm_eps)
 
     def forward(self, hidden_states):
         hidden_states = self.dense(hidden_states)
@@ -1118,15 +1109,15 @@ def forward(self, hidden_states):
         return hidden_states
 
 
-# copied from transformers.models.bert.BertLMPredictionHead with bert -> deberta
 class DebertaLMPredictionHead(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.transform = DebertaPredictionHeadTransform(config)
 
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
         # The output weights are the same as the input embeddings, but there is
         # an output-only bias for each token.
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(self.embedding_size, config.vocab_size, bias=False)
 
         self.bias = nn.Parameter(torch.zeros(config.vocab_size))
 
@@ -1276,8 +1267,6 @@ def forward(
     DEBERTA_START_DOCSTRING,
 )
 class DebertaForTokenClassification(DebertaPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1351,8 +1340,6 @@ def forward(
     DEBERTA_START_DOCSTRING,
 )
 class DebertaForQuestionAnswering(DebertaPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
diff --git a/src/transformers/models/deberta/modeling_tf_deberta.py b/src/transformers/models/deberta/modeling_tf_deberta.py
index 016ce15db61825..2a2a586c3592ef 100644
--- a/src/transformers/models/deberta/modeling_tf_deberta.py
+++ b/src/transformers/models/deberta/modeling_tf_deberta.py
@@ -15,6 +15,8 @@
 """ TF 2.0 DeBERTa model."""
 
 
+from __future__ import annotations
+
 import math
 from typing import Dict, Optional, Sequence, Tuple, Union
 
@@ -37,9 +39,10 @@
     TFSequenceClassificationLoss,
     TFTokenClassificationLoss,
     get_initializer,
+    keras,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
 from .configuration_deberta import DebertaConfig
 
@@ -56,10 +59,10 @@
 ]
 
 
-class TFDebertaContextPooler(tf.keras.layers.Layer):
+class TFDebertaContextPooler(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(config.pooler_hidden_size, name="dense")
+        self.dense = keras.layers.Dense(config.pooler_hidden_size, name="dense")
         self.dropout = TFDebertaStableDropout(config.pooler_dropout, name="dropout")
         self.config = config
 
@@ -76,8 +79,19 @@ def call(self, hidden_states, training: bool = False):
     def output_dim(self) -> int:
         return self.config.hidden_size
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.pooler_hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
 
-class TFDebertaXSoftmax(tf.keras.layers.Layer):
+class TFDebertaXSoftmax(keras.layers.Layer):
     """
     Masked Softmax which is optimized for saving memory
 
@@ -99,7 +113,7 @@ def call(self, inputs: tf.Tensor, mask: tf.Tensor):
         return output
 
 
-class TFDebertaStableDropout(tf.keras.layers.Layer):
+class TFDebertaStableDropout(keras.layers.Layer):
     """
     Optimized dropout module for stabilizing the training
 
@@ -139,7 +153,7 @@ def call(self, inputs: tf.Tensor, training: tf.Tensor = False):
         return inputs
 
 
-class TFDebertaLayerNorm(tf.keras.layers.Layer):
+class TFDebertaLayerNorm(keras.layers.Layer):
     """LayerNorm module in the TF style (epsilon inside the square root)."""
 
     def __init__(self, size, eps=1e-12, **kwargs):
@@ -159,12 +173,13 @@ def call(self, x: tf.Tensor) -> tf.Tensor:
         return self.gamma * (x - mean) / std + self.beta
 
 
-class TFDebertaSelfOutput(tf.keras.layers.Layer):
+class TFDebertaSelfOutput(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(config.hidden_size, name="dense")
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dense = keras.layers.Dense(config.hidden_size, name="dense")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaStableDropout(config.hidden_dropout_prob, name="dropout")
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training: bool = False):
         hidden_states = self.dense(hidden_states)
@@ -172,8 +187,22 @@ def call(self, hidden_states, input_tensor, training: bool = False):
         hidden_states = self.LayerNorm(hidden_states + input_tensor)
         return hidden_states
 
-
-class TFDebertaAttention(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
+
+class TFDebertaAttention(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
         self.self = TFDebertaDisentangledSelfAttention(config, name="self")
@@ -209,12 +238,23 @@ def call(
 
         return output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self", None) is not None:
+            with tf.name_scope(self.self.name):
+                self.self.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
-class TFDebertaIntermediate(tf.keras.layers.Layer):
+class TFDebertaIntermediate(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -222,6 +262,7 @@ def __init__(self, config: DebertaConfig, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -229,16 +270,25 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
-class TFDebertaOutput(tf.keras.layers.Layer):
+class TFDebertaOutput(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaStableDropout(config.hidden_dropout_prob, name="dropout")
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -247,8 +297,22 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
-
-class TFDebertaLayer(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
+
+class TFDebertaLayer(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -284,8 +348,22 @@ def call(
 
         return outputs
 
-
-class TFDebertaEncoder(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "bert_output", None) is not None:
+            with tf.name_scope(self.bert_output.name):
+                self.bert_output.build(None)
+
+
+class TFDebertaEncoder(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -297,14 +375,20 @@ def __init__(self, config: DebertaConfig, **kwargs):
             if self.max_relative_positions < 1:
                 self.max_relative_positions = config.max_position_embeddings
 
-    def build(self, input_shape):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
         if self.relative_attention:
             self.rel_embeddings = self.add_weight(
                 name="rel_embeddings.weight",
                 shape=[self.max_relative_positions * 2, self.config.hidden_size],
                 initializer=get_initializer(self.config.initializer_range),
             )
-        return super().build(input_shape)
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
 
     def get_rel_embedding(self):
         rel_embeddings = self.rel_embeddings if self.relative_attention else None
@@ -460,7 +544,7 @@ def torch_gather(x, indices, gather_axis):
     return gathered
 
 
-class TFDebertaDisentangledSelfAttention(tf.keras.layers.Layer):
+class TFDebertaDisentangledSelfAttention(keras.layers.Layer):
     """
     Disentangled self-attention module
 
@@ -481,7 +565,7 @@ def __init__(self, config: DebertaConfig, **kwargs):
         self.num_attention_heads = config.num_attention_heads
         self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
         self.all_head_size = self.num_attention_heads * self.attention_head_size
-        self.in_proj = tf.keras.layers.Dense(
+        self.in_proj = keras.layers.Dense(
             self.all_head_size * 3,
             kernel_initializer=get_initializer(config.initializer_range),
             name="in_proj",
@@ -493,13 +577,13 @@ def __init__(self, config: DebertaConfig, **kwargs):
         self.talking_head = getattr(config, "talking_head", False)
 
         if self.talking_head:
-            self.head_logits_proj = tf.keras.layers.Dense(
+            self.head_logits_proj = keras.layers.Dense(
                 self.num_attention_heads,
                 kernel_initializer=get_initializer(config.initializer_range),
                 name="head_logits_proj",
                 use_bias=False,
             )
-            self.head_weights_proj = tf.keras.layers.Dense(
+            self.head_weights_proj = keras.layers.Dense(
                 self.num_attention_heads,
                 kernel_initializer=get_initializer(config.initializer_range),
                 name="head_weights_proj",
@@ -514,27 +598,51 @@ def __init__(self, config: DebertaConfig, **kwargs):
                 self.max_relative_positions = config.max_position_embeddings
             self.pos_dropout = TFDebertaStableDropout(config.hidden_dropout_prob, name="pos_dropout")
             if "c2p" in self.pos_att_type:
-                self.pos_proj = tf.keras.layers.Dense(
+                self.pos_proj = keras.layers.Dense(
                     self.all_head_size,
                     kernel_initializer=get_initializer(config.initializer_range),
                     name="pos_proj",
                     use_bias=False,
                 )
             if "p2c" in self.pos_att_type:
-                self.pos_q_proj = tf.keras.layers.Dense(
+                self.pos_q_proj = keras.layers.Dense(
                     self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="pos_q_proj"
                 )
 
         self.dropout = TFDebertaStableDropout(config.attention_probs_dropout_prob, name="dropout")
+        self.config = config
 
-    def build(self, input_shape):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
         self.q_bias = self.add_weight(
-            name="q_bias", shape=(self.all_head_size), initializer=tf.keras.initializers.Zeros()
+            name="q_bias", shape=(self.all_head_size), initializer=keras.initializers.Zeros()
         )
         self.v_bias = self.add_weight(
-            name="v_bias", shape=(self.all_head_size), initializer=tf.keras.initializers.Zeros()
+            name="v_bias", shape=(self.all_head_size), initializer=keras.initializers.Zeros()
         )
-        return super().build(input_shape)
+        if getattr(self, "in_proj", None) is not None:
+            with tf.name_scope(self.in_proj.name):
+                self.in_proj.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "head_logits_proj", None) is not None:
+            with tf.name_scope(self.head_logits_proj.name):
+                self.head_logits_proj.build(None)
+        if getattr(self, "head_weights_proj", None) is not None:
+            with tf.name_scope(self.head_weights_proj.name):
+                self.head_weights_proj.build(None)
+        if getattr(self, "pos_dropout", None) is not None:
+            with tf.name_scope(self.pos_dropout.name):
+                self.pos_dropout.build(None)
+        if getattr(self, "pos_proj", None) is not None:
+            with tf.name_scope(self.pos_proj.name):
+                self.pos_proj.build([self.config.hidden_size])
+        if getattr(self, "pos_q_proj", None) is not None:
+            with tf.name_scope(self.pos_q_proj.name):
+                self.pos_q_proj.build([self.config.hidden_size])
 
     def transpose_for_scores(self, tensor: tf.Tensor) -> tf.Tensor:
         shape = shape_list(tensor)[:-1] + [self.num_attention_heads, -1]
@@ -591,11 +699,10 @@ def call(
         else:
 
             def linear(w, b, x):
-                return tf.cond(
-                    b is not None,
-                    lambda: tf.matmul(x, w, transpose_b=True) + tf.transpose(b),
-                    lambda: tf.matmul(x, w, transpose_b=True),
-                )
+                out = tf.matmul(x, w, transpose_b=True)
+                if b is not None:
+                    out += tf.transpose(b)
+                return out
 
             ws = tf.split(
                 tf.transpose(self.in_proj.weight[0]), num_or_size_splits=self.num_attention_heads * 3, axis=0
@@ -712,7 +819,7 @@ def disentangled_att_bias(self, query_layer, key_layer, relative_pos, rel_embedd
         return score
 
 
-class TFDebertaEmbeddings(tf.keras.layers.Layer):
+class TFDebertaEmbeddings(keras.layers.Layer):
     """Construct the embeddings from word, position and token_type embeddings."""
 
     def __init__(self, config, **kwargs):
@@ -725,11 +832,16 @@ def __init__(self, config, **kwargs):
         self.position_biased_input = getattr(config, "position_biased_input", True)
         self.initializer_range = config.initializer_range
         if self.embedding_size != config.hidden_size:
-            self.embed_proj = tf.keras.layers.Dense(config.hidden_size, use_bias=False)
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+            self.embed_proj = keras.layers.Dense(
+                config.hidden_size,
+                kernel_initializer=get_initializer(config.initializer_range),
+                name="embed_proj",
+                use_bias=False,
+            )
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaStableDropout(config.hidden_dropout_prob, name="dropout")
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         with tf.name_scope("word_embeddings"):
             self.weight = self.add_weight(
                 name="weight",
@@ -757,7 +869,18 @@ def build(self, input_shape: tf.TensorShape):
             else:
                 self.position_embeddings = None
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "embed_proj", None) is not None:
+            with tf.name_scope(self.embed_proj.name):
+                self.embed_proj.build([None, None, self.embedding_size])
 
     def call(
         self,
@@ -778,16 +901,7 @@ def call(
             raise ValueError("Need to provide either `input_ids` or `input_embeds`.")
 
         if input_ids is not None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
 
         input_shape = shape_list(inputs_embeds)[:-1]
@@ -824,12 +938,14 @@ def call(
         return final_embeddings
 
 
-class TFDebertaPredictionHeadTransform(tf.keras.layers.Layer):
+class TFDebertaPredictionHeadTransform(keras.layers.Layer):
     def __init__(self, config: DebertaConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
-            units=config.hidden_size,
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
+
+        self.dense = keras.layers.Dense(
+            units=self.embedding_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="dense",
         )
@@ -838,7 +954,8 @@ def __init__(self, config: DebertaConfig, **kwargs):
             self.transform_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.transform_act_fn = config.hidden_act
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -847,13 +964,24 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.embedding_size])
+
 
-class TFDebertaLMPredictionHead(tf.keras.layers.Layer):
-    def __init__(self, config: DebertaConfig, input_embeddings: tf.keras.layers.Layer, **kwargs):
+class TFDebertaLMPredictionHead(keras.layers.Layer):
+    def __init__(self, config: DebertaConfig, input_embeddings: keras.layers.Layer, **kwargs):
         super().__init__(**kwargs)
 
         self.config = config
-        self.hidden_size = config.hidden_size
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
 
         self.transform = TFDebertaPredictionHeadTransform(config, name="transform")
 
@@ -861,12 +989,17 @@ def __init__(self, config: DebertaConfig, input_embeddings: tf.keras.layers.Laye
         # an output-only bias for each token.
         self.input_embeddings = input_embeddings
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias")
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "transform", None) is not None:
+            with tf.name_scope(self.transform.name):
+                self.transform.build(None)
 
-    def get_output_embeddings(self) -> tf.keras.layers.Layer:
+    def get_output_embeddings(self) -> keras.layers.Layer:
         return self.input_embeddings
 
     def set_output_embeddings(self, value: tf.Variable):
@@ -883,7 +1016,7 @@ def set_bias(self, value: tf.Variable):
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.transform(hidden_states=hidden_states)
         seq_length = shape_list(hidden_states)[1]
-        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, self.hidden_size])
+        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, self.embedding_size])
         hidden_states = tf.matmul(a=hidden_states, b=self.input_embeddings.weight, transpose_b=True)
         hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, seq_length, self.config.vocab_size])
         hidden_states = tf.nn.bias_add(value=hidden_states, bias=self.bias)
@@ -891,8 +1024,8 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         return hidden_states
 
 
-class TFDebertaOnlyMLMHead(tf.keras.layers.Layer):
-    def __init__(self, config: DebertaConfig, input_embeddings: tf.keras.layers.Layer, **kwargs):
+class TFDebertaOnlyMLMHead(keras.layers.Layer):
+    def __init__(self, config: DebertaConfig, input_embeddings: keras.layers.Layer, **kwargs):
         super().__init__(**kwargs)
         self.predictions = TFDebertaLMPredictionHead(config, input_embeddings, name="predictions")
 
@@ -901,9 +1034,17 @@ def call(self, sequence_output: tf.Tensor) -> tf.Tensor:
 
         return prediction_scores
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "predictions", None) is not None:
+            with tf.name_scope(self.predictions.name):
+                self.predictions.build(None)
+
 
 # @keras_serializable
-class TFDebertaMainLayer(tf.keras.layers.Layer):
+class TFDebertaMainLayer(keras.layers.Layer):
     config_class = DebertaConfig
 
     def __init__(self, config: DebertaConfig, **kwargs):
@@ -914,7 +1055,7 @@ def __init__(self, config: DebertaConfig, **kwargs):
         self.embeddings = TFDebertaEmbeddings(config, name="embeddings")
         self.encoder = TFDebertaEncoder(config, name="encoder")
 
-    def get_input_embeddings(self) -> tf.keras.layers.Layer:
+    def get_input_embeddings(self) -> keras.layers.Layer:
         return self.embeddings
 
     def set_input_embeddings(self, value: tf.Variable):
@@ -931,11 +1072,11 @@ class PreTrainedModel
     @unpack_inputs
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -985,6 +1126,17 @@ def call(
             attentions=encoder_outputs.attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+
 
 class TFDebertaPreTrainedModel(TFPreTrainedModel):
     """
@@ -1002,7 +1154,7 @@ class TFDebertaPreTrainedModel(TFPreTrainedModel):
     on top of BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
     improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -1101,11 +1253,11 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1125,11 +1277,13 @@ def call(
 
         return outputs
 
-    def serving_output(self, output: TFBaseModelOutput) -> TFBaseModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutput(last_hidden_state=output.last_hidden_state, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
 
 
 @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING)
@@ -1146,7 +1300,7 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
         self.deberta = TFDebertaMainLayer(config, name="deberta")
         self.mlm = TFDebertaOnlyMLMHead(config, input_embeddings=self.deberta.embeddings, name="cls")
 
-    def get_lm_head(self) -> tf.keras.layers.Layer:
+    def get_lm_head(self) -> keras.layers.Layer:
         return self.mlm.predictions
 
     @unpack_inputs
@@ -1158,15 +1312,15 @@ def get_lm_head(self) -> tf.keras.layers.Layer:
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1201,11 +1355,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "mlm", None) is not None:
+            with tf.name_scope(self.mlm.name):
+                self.mlm.build(None)
 
 
 @add_start_docstrings(
@@ -1227,11 +1386,12 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
         drop_out = getattr(config, "cls_dropout", None)
         drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out
         self.dropout = TFDebertaStableDropout(drop_out, name="cls_dropout")
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             units=config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             name="classifier",
         )
+        self.output_dim = self.pooler.output_dim
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1242,15 +1402,15 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1288,11 +1448,22 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.output_dim])
 
 
 @add_start_docstrings(
@@ -1309,10 +1480,11 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.deberta = TFDebertaMainLayer(config, name="deberta")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
-        self.classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.classifier = keras.layers.Dense(
             units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1323,15 +1495,15 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1365,11 +1537,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1386,9 +1563,10 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.deberta = TFDebertaMainLayer(config, name="deberta")
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1399,16 +1577,16 @@ def __init__(self, config: DebertaConfig, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        start_positions: np.ndarray | tf.Tensor | None = None,
+        end_positions: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1456,10 +1634,13 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFQuestionAnsweringModelOutput(
-            start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.config.hidden_size])
diff --git a/src/transformers/models/deberta/tokenization_deberta.py b/src/transformers/models/deberta/tokenization_deberta.py
index bcaaaa4421178f..6a48b188d61897 100644
--- a/src/transformers/models/deberta/tokenization_deberta.py
+++ b/src/transformers/models/deberta/tokenization_deberta.py
@@ -16,7 +16,7 @@
 
 import json
 import os
-from typing import TYPE_CHECKING, List, Optional, Tuple
+from typing import List, Optional, Tuple
 
 import regex as re
 
@@ -24,9 +24,6 @@
 from ...utils import logging
 
 
-if TYPE_CHECKING:
-    from transformers.pipelines.conversational import Conversation
-
 logger = logging.get_logger(__name__)
 
 VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt"}
@@ -116,13 +113,15 @@ class DebertaTokenizer(PreTrainedTokenizer):
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
 
-    ```
+    ```python
     >>> from transformers import DebertaTokenizer
+
     >>> tokenizer = DebertaTokenizer.from_pretrained("microsoft/deberta-base")
-    >>> tokenizer("Hello world")['input_ids']
-    [15496, 995]
-    >>> tokenizer(" Hello world")['input_ids']
-    [18435, 995]
+    >>> tokenizer("Hello world")["input_ids"]
+    [1, 31414, 232, 2]
+
+    >>> tokenizer(" Hello world")["input_ids"]
+    [1, 20920, 232, 2]
     ```
 
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer or when you
@@ -193,29 +192,15 @@ def __init__(
         add_bos_token=False,
         **kwargs,
     ):
-        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
-        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
-        sep_token = AddedToken(sep_token, lstrip=False, rstrip=False) if isinstance(sep_token, str) else sep_token
-        cls_token = AddedToken(cls_token, lstrip=False, rstrip=False) if isinstance(cls_token, str) else cls_token
-        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
-        pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
+        bos_token = AddedToken(bos_token, special=True) if isinstance(bos_token, str) else bos_token
+        eos_token = AddedToken(eos_token, special=True) if isinstance(eos_token, str) else eos_token
+        sep_token = AddedToken(sep_token, special=True) if isinstance(sep_token, str) else sep_token
+        cls_token = AddedToken(cls_token, special=True) if isinstance(cls_token, str) else cls_token
+        unk_token = AddedToken(unk_token, special=True) if isinstance(unk_token, str) else unk_token
+        pad_token = AddedToken(pad_token, special=True) if isinstance(pad_token, str) else pad_token
 
         # Mask token behave like a normal word, i.e. include the space before it
         mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
-
-        super().__init__(
-            errors=errors,
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            cls_token=cls_token,
-            pad_token=pad_token,
-            mask_token=mask_token,
-            add_prefix_space=add_prefix_space,
-            add_bos_token=add_bos_token,
-            **kwargs,
-        )
         self.add_bos_token = add_bos_token
 
         with open(vocab_file, encoding="utf-8") as vocab_handle:
@@ -234,6 +219,20 @@ def __init__(
         # Should have added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
         self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
 
+        super().__init__(
+            errors=errors,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            add_prefix_space=add_prefix_space,
+            add_bos_token=add_bos_token,
+            **kwargs,
+        )
+
     @property
     # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer.vocab_size
     def vocab_size(self):
@@ -431,12 +430,3 @@ def prepare_for_tokenization(self, text, is_split_into_words=False, **kwargs):
         if (is_split_into_words or add_prefix_space) and (len(text) > 0 and not text[0].isspace()):
             text = " " + text
         return (text, kwargs)
-
-    # Copied from transformers.models.gpt2.tokenization_gpt2.GPT2Tokenizer._build_conversation_input_ids
-    def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]:
-        input_ids = []
-        for is_user, text in conversation.iter_texts():
-            input_ids.extend(self.encode(text, add_special_tokens=False) + [self.eos_token_id])
-        if len(input_ids) > self.model_max_length:
-            input_ids = input_ids[-self.model_max_length :]
-        return input_ids
diff --git a/src/transformers/models/deberta/tokenization_deberta_fast.py b/src/transformers/models/deberta/tokenization_deberta_fast.py
index 959bcae470112a..6d157fdf3c7066 100644
--- a/src/transformers/models/deberta/tokenization_deberta_fast.py
+++ b/src/transformers/models/deberta/tokenization_deberta_fast.py
@@ -15,7 +15,7 @@
 """ Fast Tokenization class for model DeBERTa."""
 
 import json
-from typing import TYPE_CHECKING, List, Optional, Tuple
+from typing import List, Optional, Tuple
 
 from tokenizers import pre_tokenizers
 
@@ -25,10 +25,6 @@
 from .tokenization_deberta import DebertaTokenizer
 
 
-if TYPE_CHECKING:
-    from transformers.pipelines.conversational import Conversation
-
-
 logger = logging.get_logger(__name__)
 
 VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"}
@@ -79,13 +75,15 @@ class DebertaTokenizerFast(PreTrainedTokenizerFast):
     This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
     be encoded differently whether it is at the beginning of the sentence (without space) or not:
 
-    ```
+    ```python
     >>> from transformers import DebertaTokenizerFast
+
     >>> tokenizer = DebertaTokenizerFast.from_pretrained("microsoft/deberta-base")
-    >>> tokenizer("Hello world")['input_ids']
-    [15496, 995]
-    >>> tokenizer(" Hello world")['input_ids']
-    [18435, 995]
+    >>> tokenizer("Hello world")["input_ids"]
+    [1, 31414, 232, 2]
+
+    >>> tokenizer(" Hello world")["input_ids"]
+    [1, 20920, 232, 2]
     ```
 
     You can get around that behavior by passing `add_prefix_space=True` when instantiating this tokenizer, but since
@@ -101,9 +99,9 @@ class DebertaTokenizerFast(PreTrainedTokenizerFast):
     refer to this superclass for more information regarding those methods.
 
     Args:
-        vocab_file (`str`):
+        vocab_file (`str`, *optional*):
             Path to the vocabulary file.
-        merges_file (`str`):
+        merges_file (`str`, *optional*):
             Path to the merges file.
         tokenizer_file (`str`, *optional*):
             The path to a tokenizer file to use instead of the vocab file.
@@ -286,14 +284,3 @@ def _encode_plus(self, *args, **kwargs) -> BatchEncoding:
     def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
         files = self._tokenizer.model.save(save_directory, name=filename_prefix)
         return tuple(files)
-
-    # Copied from transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast._build_conversation_input_ids
-    def _build_conversation_input_ids(self, conversation: "Conversation") -> List[int]:
-        """This corresponds to DialoGPT variants of models."""
-        input_ids = []
-        for is_user, text in conversation.iter_texts():
-            input_ids.extend(self.encode(text, add_special_tokens=False) + [self.eos_token_id])
-
-        if len(input_ids) > self.model_max_length:
-            input_ids = input_ids[-self.model_max_length :]
-        return input_ids
diff --git a/src/transformers/models/deberta_v2/__init__.py b/src/transformers/models/deberta_v2/__init__.py
index 2c5bf57572536b..fb1b20a331fe11 100644
--- a/src/transformers/models/deberta_v2/__init__.py
+++ b/src/transformers/models/deberta_v2/__init__.py
@@ -46,6 +46,7 @@
         "TF_DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST",
         "TFDebertaV2ForMaskedLM",
         "TFDebertaV2ForQuestionAnswering",
+        "TFDebertaV2ForMultipleChoice",
         "TFDebertaV2ForSequenceClassification",
         "TFDebertaV2ForTokenClassification",
         "TFDebertaV2Model",
@@ -95,6 +96,7 @@
         from .modeling_tf_deberta_v2 import (
             TF_DEBERTA_V2_PRETRAINED_MODEL_ARCHIVE_LIST,
             TFDebertaV2ForMaskedLM,
+            TFDebertaV2ForMultipleChoice,
             TFDebertaV2ForQuestionAnswering,
             TFDebertaV2ForSequenceClassification,
             TFDebertaV2ForTokenClassification,
diff --git a/src/transformers/models/deberta_v2/configuration_deberta_v2.py b/src/transformers/models/deberta_v2/configuration_deberta_v2.py
index d55486cd563381..68f2112754a4c1 100644
--- a/src/transformers/models/deberta_v2/configuration_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/configuration_deberta_v2.py
@@ -107,6 +107,7 @@ class DebertaV2Config(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "deberta-v2"
 
     def __init__(
@@ -150,7 +151,7 @@ def __init__(
         self.position_biased_input = position_biased_input
 
         # Backwards compatibility
-        if type(pos_att_type) == str:
+        if isinstance(pos_att_type, str):
             pos_att_type = [x.strip() for x in pos_att_type.lower().split("|")]
 
         self.pos_att_type = pos_att_type
diff --git a/src/transformers/models/deberta_v2/modeling_deberta_v2.py b/src/transformers/models/deberta_v2/modeling_deberta_v2.py
index ef04a24b2f63e1..a8f064369268b0 100644
--- a/src/transformers/models/deberta_v2/modeling_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/modeling_deberta_v2.py
@@ -130,13 +130,13 @@ def symbolic(g, self, mask, dim):
         r_mask = g.op(
             "Cast",
             g.op("Sub", g.op("Constant", value_t=torch.tensor(1, dtype=torch.int64)), mask_cast_value),
-            to_i=sym_help.cast_pytorch_to_onnx["Byte"],
+            to_i=sym_help.cast_pytorch_to_onnx["Bool"],
         )
         output = masked_fill(
             g, self, r_mask, g.op("Constant", value_t=torch.tensor(torch.finfo(self.type().dtype()).min))
         )
         output = softmax(g, output, dim)
-        return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.uint8)))
+        return masked_fill(g, output, r_mask, g.op("Constant", value_t=torch.tensor(0, dtype=torch.bool)))
 
 
 # Copied from transformers.models.deberta.modeling_deberta.DropoutContext
@@ -453,7 +453,6 @@ def get_attention_mask(self, attention_mask):
         if attention_mask.dim() <= 2:
             extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
             attention_mask = extended_attention_mask * extended_attention_mask.squeeze(-2).unsqueeze(-1)
-            attention_mask = attention_mask.byte()
         elif attention_mask.dim() == 3:
             attention_mask = attention_mask.unsqueeze(1)
 
@@ -484,7 +483,7 @@ def forward(
         if attention_mask.dim() <= 2:
             input_mask = attention_mask
         else:
-            input_mask = (attention_mask.sum(-2) > 0).byte()
+            input_mask = attention_mask.sum(-2) > 0
         attention_mask = self.get_attention_mask(attention_mask)
         relative_pos = self.get_rel_pos(hidden_states, query_states, relative_pos)
 
@@ -502,20 +501,14 @@ def forward(
                 all_hidden_states = all_hidden_states + (output_states,)
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                output_states = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                output_states = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     next_kv,
                     attention_mask,
                     query_states,
                     relative_pos,
                     rel_embeddings,
+                    output_attentions,
                 )
             else:
                 output_states = layer_module(
@@ -687,7 +680,7 @@ def forward(
                 Input states to the module usually the output from previous layer, it will be the Q,K and V in
                 *Attention(Q,K,V)*
 
-            attention_mask (`torch.ByteTensor`):
+            attention_mask (`torch.BoolTensor`):
                 An attention mask matrix of shape [*B*, *N*, *N*] where *B* is the batch size, *N* is the maximum
                 sequence length in which element [i,j] = *1* means the *i* th token in the input can attend to the *j*
                 th token.
@@ -722,7 +715,7 @@ def forward(
         if "p2c" in self.pos_att_type:
             scale_factor += 1
         scale = torch.sqrt(torch.tensor(query_layer.size(-1), dtype=torch.float) * scale_factor)
-        attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / scale.to(dtype=query_layer.dtype)
+        attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2) / scale.to(dtype=query_layer.dtype))
         if self.relative_attention:
             rel_embeddings = self.pos_dropout(rel_embeddings)
             rel_att = self.disentangled_attention_bias(
@@ -787,15 +780,11 @@ def disentangled_attention_bias(self, query_layer, key_layer, relative_pos, rel_
             if "c2p" in self.pos_att_type:
                 pos_key_layer = self.transpose_for_scores(
                     self.pos_key_proj(rel_embeddings), self.num_attention_heads
-                ).repeat(
-                    query_layer.size(0) // self.num_attention_heads, 1, 1
-                )  # .split(self.all_head_size, dim=-1)
+                ).repeat(query_layer.size(0) // self.num_attention_heads, 1, 1)  # .split(self.all_head_size, dim=-1)
             if "p2c" in self.pos_att_type:
                 pos_query_layer = self.transpose_for_scores(
                     self.pos_query_proj(rel_embeddings), self.num_attention_heads
-                ).repeat(
-                    query_layer.size(0) // self.num_attention_heads, 1, 1
-                )  # .split(self.all_head_size, dim=-1)
+                ).repeat(query_layer.size(0) // self.num_attention_heads, 1, 1)  # .split(self.all_head_size, dim=-1)
 
         score = 0
         # content->position
@@ -863,7 +852,9 @@ def __init__(self, config):
         self.config = config
 
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
 
     def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None):
         if input_ids is not None:
@@ -921,7 +912,6 @@ class DebertaV2PreTrainedModel(PreTrainedModel):
 
     config_class = DebertaV2Config
     base_model_prefix = "deberta"
-    _keys_to_ignore_on_load_missing = ["position_ids"]
     _keys_to_ignore_on_load_unexpected = ["position_embeddings"]
     supports_gradient_checkpointing = True
 
@@ -938,10 +928,6 @@ def _init_weights(self, module):
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DebertaV2Encoder):
-            module.gradient_checkpointing = value
-
 
 DEBERTA_START_DOCSTRING = r"""
     The DeBERTa model was proposed in [DeBERTa: Decoding-enhanced BERT with Disentangled
@@ -1059,6 +1045,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -1121,8 +1108,7 @@ def forward(
 
 @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING)
 class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    _tied_weights_keys = ["cls.predictions.decoder.weight", "cls.predictions.decoder.bias"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1199,16 +1185,18 @@ def forward(
         )
 
 
-# copied from transformers.models.bert.BertPredictionHeadTransform with bert -> deberta
+# Copied from transformers.models.deberta.modeling_deberta.DebertaPredictionHeadTransform with Deberta->DebertaV2
 class DebertaV2PredictionHeadTransform(nn.Module):
     def __init__(self, config):
         super().__init__()
-        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
+
+        self.dense = nn.Linear(config.hidden_size, self.embedding_size)
         if isinstance(config.hidden_act, str):
             self.transform_act_fn = ACT2FN[config.hidden_act]
         else:
             self.transform_act_fn = config.hidden_act
-        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.LayerNorm = nn.LayerNorm(self.embedding_size, eps=config.layer_norm_eps)
 
     def forward(self, hidden_states):
         hidden_states = self.dense(hidden_states)
@@ -1217,15 +1205,16 @@ def forward(self, hidden_states):
         return hidden_states
 
 
-# copied from transformers.models.bert.BertLMPredictionHead with bert -> deberta
+# Copied from transformers.models.deberta.modeling_deberta.DebertaLMPredictionHead with Deberta->DebertaV2
 class DebertaV2LMPredictionHead(nn.Module):
     def __init__(self, config):
         super().__init__()
         self.transform = DebertaV2PredictionHeadTransform(config)
 
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
         # The output weights are the same as the input embeddings, but there is
         # an output-only bias for each token.
-        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.decoder = nn.Linear(self.embedding_size, config.vocab_size, bias=False)
 
         self.bias = nn.Parameter(torch.zeros(config.vocab_size))
 
@@ -1377,8 +1366,6 @@ def forward(
 )
 # Copied from transformers.models.deberta.modeling_deberta.DebertaForTokenClassification with Deberta->DebertaV2
 class DebertaV2ForTokenClassification(DebertaV2PreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1452,8 +1439,6 @@ def forward(
     DEBERTA_START_DOCSTRING,
 )
 class DebertaV2ForQuestionAnswering(DebertaV2PreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
diff --git a/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py b/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
index 015eb392574087..05b222ec8a595f 100644
--- a/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/modeling_tf_deberta_v2.py
@@ -14,6 +14,7 @@
 # limitations under the License.
 """ TF 2.0 DeBERTa-v2 model."""
 
+from __future__ import annotations
 
 from typing import Dict, Optional, Tuple, Union
 
@@ -24,6 +25,7 @@
 from ...modeling_tf_outputs import (
     TFBaseModelOutput,
     TFMaskedLMOutput,
+    TFMultipleChoiceModelOutput,
     TFQuestionAnsweringModelOutput,
     TFSequenceClassifierOutput,
     TFTokenClassifierOutput,
@@ -31,21 +33,22 @@
 from ...modeling_tf_utils import (
     TFMaskedLanguageModelingLoss,
     TFModelInputType,
+    TFMultipleChoiceLoss,
     TFPreTrainedModel,
     TFQuestionAnsweringLoss,
     TFSequenceClassificationLoss,
     TFTokenClassificationLoss,
     get_initializer,
+    keras,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
 from .configuration_deberta_v2 import DebertaV2Config
 
 
 logger = logging.get_logger(__name__)
 
-
 _CONFIG_FOR_DOC = "DebertaV2Config"
 _CHECKPOINT_FOR_DOC = "kamalkraj/deberta-v2-xlarge"
 
@@ -56,10 +59,10 @@
 
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaContextPooler with Deberta->DebertaV2
-class TFDebertaV2ContextPooler(tf.keras.layers.Layer):
+class TFDebertaV2ContextPooler(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(config.pooler_hidden_size, name="dense")
+        self.dense = keras.layers.Dense(config.pooler_hidden_size, name="dense")
         self.dropout = TFDebertaV2StableDropout(config.pooler_dropout, name="dropout")
         self.config = config
 
@@ -76,9 +79,20 @@ def call(self, hidden_states, training: bool = False):
     def output_dim(self) -> int:
         return self.config.hidden_size
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.pooler_hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaXSoftmax with Deberta->DebertaV2
-class TFDebertaV2XSoftmax(tf.keras.layers.Layer):
+class TFDebertaV2XSoftmax(keras.layers.Layer):
     """
     Masked Softmax which is optimized for saving memory
 
@@ -101,7 +115,7 @@ def call(self, inputs: tf.Tensor, mask: tf.Tensor):
 
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaStableDropout with Deberta->DebertaV2
-class TFDebertaV2StableDropout(tf.keras.layers.Layer):
+class TFDebertaV2StableDropout(keras.layers.Layer):
     """
     Optimized dropout module for stabilizing the training
 
@@ -142,12 +156,13 @@ def call(self, inputs: tf.Tensor, training: tf.Tensor = False):
 
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaSelfOutput with Deberta->DebertaV2
-class TFDebertaV2SelfOutput(tf.keras.layers.Layer):
+class TFDebertaV2SelfOutput(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
-        self.dense = tf.keras.layers.Dense(config.hidden_size, name="dense")
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dense = keras.layers.Dense(config.hidden_size, name="dense")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaV2StableDropout(config.hidden_dropout_prob, name="dropout")
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training: bool = False):
         hidden_states = self.dense(hidden_states)
@@ -155,9 +170,23 @@ def call(self, hidden_states, input_tensor, training: bool = False):
         hidden_states = self.LayerNorm(hidden_states + input_tensor)
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaAttention with Deberta->DebertaV2
-class TFDebertaV2Attention(tf.keras.layers.Layer):
+class TFDebertaV2Attention(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
         self.self = TFDebertaV2DisentangledSelfAttention(config, name="self")
@@ -193,13 +222,24 @@ def call(
 
         return output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self", None) is not None:
+            with tf.name_scope(self.self.name):
+                self.self.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaIntermediate with Deberta->DebertaV2
-class TFDebertaV2Intermediate(tf.keras.layers.Layer):
+class TFDebertaV2Intermediate(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -207,6 +247,7 @@ def __init__(self, config: DebertaV2Config, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -214,17 +255,26 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaOutput with Deberta->DebertaV2
-class TFDebertaV2Output(tf.keras.layers.Layer):
+class TFDebertaV2Output(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaV2StableDropout(config.hidden_dropout_prob, name="dropout")
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -233,9 +283,23 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaLayer with Deberta->DebertaV2
-class TFDebertaV2Layer(tf.keras.layers.Layer):
+class TFDebertaV2Layer(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
@@ -271,8 +335,22 @@ def call(
 
         return outputs
 
-
-class TFDebertaV2ConvLayer(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "bert_output", None) is not None:
+            with tf.name_scope(self.bert_output.name):
+                self.bert_output.build(None)
+
+
+class TFDebertaV2ConvLayer(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
@@ -280,11 +358,14 @@ def __init__(self, config: DebertaV2Config, **kwargs):
         # groups = getattr(config, "conv_groups", 1)
         self.conv_act = get_tf_activation(getattr(config, "conv_act", "tanh"))
         self.padding = (self.kernel_size - 1) // 2
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaV2StableDropout(config.hidden_dropout_prob, name="dropout")
         self.config = config
 
-    def build(self, input_shape):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
         with tf.name_scope("conv"):
             self.conv_kernel = self.add_weight(
                 name="kernel",
@@ -294,7 +375,12 @@ def build(self, input_shape):
             self.conv_bias = self.add_weight(
                 name="bias", shape=[self.config.hidden_size], initializer=tf.zeros_initializer()
             )
-        return super().build(input_shape)
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
 
     def call(
         self, hidden_states: tf.Tensor, residual_states: tf.Tensor, input_mask: tf.Tensor, training: bool = False
@@ -327,7 +413,7 @@ def call(
         return output_states
 
 
-class TFDebertaV2Encoder(tf.keras.layers.Layer):
+class TFDebertaV2Encoder(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
@@ -348,18 +434,30 @@ def __init__(self, config: DebertaV2Config, **kwargs):
         self.norm_rel_ebd = [x.strip() for x in getattr(config, "norm_rel_ebd", "none").lower().split("|")]
 
         if "layer_norm" in self.norm_rel_ebd:
-            self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+            self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
 
         self.conv = TFDebertaV2ConvLayer(config, name="conv") if getattr(config, "conv_kernel_size", 0) > 0 else None
 
-    def build(self, input_shape):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
         if self.relative_attention:
             self.rel_embeddings = self.add_weight(
                 name="rel_embeddings.weight",
                 shape=[self.pos_ebd_size, self.config.hidden_size],
                 initializer=get_initializer(self.config.initializer_range),
             )
-        return super().build(input_shape)
+        if getattr(self, "conv", None) is not None:
+            with tf.name_scope(self.conv.name):
+                self.conv.build(None)
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, self.config.hidden_size])
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
 
     def get_rel_embedding(self):
         rel_embeddings = self.rel_embeddings if self.relative_attention else None
@@ -537,7 +635,7 @@ def take_along_axis(x, indices):
     return gathered
 
 
-class TFDebertaV2DisentangledSelfAttention(tf.keras.layers.Layer):
+class TFDebertaV2DisentangledSelfAttention(keras.layers.Layer):
     """
     Disentangled self-attention module
 
@@ -559,19 +657,19 @@ def __init__(self, config: DebertaV2Config, **kwargs):
         _attention_head_size = config.hidden_size // config.num_attention_heads
         self.attention_head_size = getattr(config, "attention_head_size", _attention_head_size)
         self.all_head_size = self.num_attention_heads * self.attention_head_size
-        self.query_proj = tf.keras.layers.Dense(
+        self.query_proj = keras.layers.Dense(
             self.all_head_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="query_proj",
             use_bias=True,
         )
-        self.key_proj = tf.keras.layers.Dense(
+        self.key_proj = keras.layers.Dense(
             self.all_head_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="key_proj",
             use_bias=True,
         )
-        self.value_proj = tf.keras.layers.Dense(
+        self.value_proj = keras.layers.Dense(
             self.all_head_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="value_proj",
@@ -595,20 +693,21 @@ def __init__(self, config: DebertaV2Config, **kwargs):
 
             if not self.share_att_key:
                 if "c2p" in self.pos_att_type:
-                    self.pos_key_proj = tf.keras.layers.Dense(
+                    self.pos_key_proj = keras.layers.Dense(
                         self.all_head_size,
                         kernel_initializer=get_initializer(config.initializer_range),
                         name="pos_proj",
                         use_bias=True,
                     )
                 if "p2c" in self.pos_att_type:
-                    self.pos_query_proj = tf.keras.layers.Dense(
+                    self.pos_query_proj = keras.layers.Dense(
                         self.all_head_size,
                         kernel_initializer=get_initializer(config.initializer_range),
                         name="pos_q_proj",
                     )
         self.softmax = TFDebertaV2XSoftmax(axis=-1)
         self.dropout = TFDebertaV2StableDropout(config.attention_probs_dropout_prob, name="dropout")
+        self.config = config
 
     def transpose_for_scores(self, tensor: tf.Tensor, attention_heads: int) -> tf.Tensor:
         tensor_shape = shape_list(tensor)
@@ -674,7 +773,7 @@ def call(
         if "p2c" in self.pos_att_type:
             scale_factor += 1
         scale = tf.math.sqrt(tf.cast(shape_list(query_layer)[-1] * scale_factor, tf.float32))
-        attention_scores = tf.matmul(query_layer, tf.transpose(key_layer, [0, 2, 1])) / scale
+        attention_scores = tf.matmul(query_layer, tf.transpose(key_layer, [0, 2, 1]) / scale)
         if self.relative_attention:
             rel_embeddings = self.pos_dropout(rel_embeddings)
             rel_att = self.disentangled_att_bias(query_layer, key_layer, relative_pos, rel_embeddings, scale_factor)
@@ -799,9 +898,35 @@ def disentangled_att_bias(self, query_layer, key_layer, relative_pos, rel_embedd
 
         return score
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query_proj", None) is not None:
+            with tf.name_scope(self.query_proj.name):
+                self.query_proj.build([None, None, self.config.hidden_size])
+        if getattr(self, "key_proj", None) is not None:
+            with tf.name_scope(self.key_proj.name):
+                self.key_proj.build([None, None, self.config.hidden_size])
+        if getattr(self, "value_proj", None) is not None:
+            with tf.name_scope(self.value_proj.name):
+                self.value_proj.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "pos_dropout", None) is not None:
+            with tf.name_scope(self.pos_dropout.name):
+                self.pos_dropout.build(None)
+        if getattr(self, "pos_key_proj", None) is not None:
+            with tf.name_scope(self.pos_key_proj.name):
+                self.pos_key_proj.build([None, None, self.config.hidden_size])
+        if getattr(self, "pos_query_proj", None) is not None:
+            with tf.name_scope(self.pos_query_proj.name):
+                self.pos_query_proj.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaEmbeddings Deberta->DebertaV2
-class TFDebertaV2Embeddings(tf.keras.layers.Layer):
+class TFDebertaV2Embeddings(keras.layers.Layer):
     """Construct the embeddings from word, position and token_type embeddings."""
 
     def __init__(self, config, **kwargs):
@@ -814,11 +939,16 @@ def __init__(self, config, **kwargs):
         self.position_biased_input = getattr(config, "position_biased_input", True)
         self.initializer_range = config.initializer_range
         if self.embedding_size != config.hidden_size:
-            self.embed_proj = tf.keras.layers.Dense(config.hidden_size, use_bias=False)
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+            self.embed_proj = keras.layers.Dense(
+                config.hidden_size,
+                kernel_initializer=get_initializer(config.initializer_range),
+                name="embed_proj",
+                use_bias=False,
+            )
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
         self.dropout = TFDebertaV2StableDropout(config.hidden_dropout_prob, name="dropout")
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         with tf.name_scope("word_embeddings"):
             self.weight = self.add_weight(
                 name="weight",
@@ -846,7 +976,18 @@ def build(self, input_shape: tf.TensorShape):
             else:
                 self.position_embeddings = None
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "embed_proj", None) is not None:
+            with tf.name_scope(self.embed_proj.name):
+                self.embed_proj.build([None, None, self.embedding_size])
 
     def call(
         self,
@@ -867,16 +1008,7 @@ def call(
             raise ValueError("Need to provide either `input_ids` or `input_embeds`.")
 
         if input_ids is not None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
 
         input_shape = shape_list(inputs_embeds)[:-1]
@@ -914,12 +1046,14 @@ def call(
 
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaPredictionHeadTransform with Deberta->DebertaV2
-class TFDebertaV2PredictionHeadTransform(tf.keras.layers.Layer):
+class TFDebertaV2PredictionHeadTransform(keras.layers.Layer):
     def __init__(self, config: DebertaV2Config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
-            units=config.hidden_size,
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
+
+        self.dense = keras.layers.Dense(
+            units=self.embedding_size,
             kernel_initializer=get_initializer(config.initializer_range),
             name="dense",
         )
@@ -928,7 +1062,8 @@ def __init__(self, config: DebertaV2Config, **kwargs):
             self.transform_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.transform_act_fn = config.hidden_act
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -937,14 +1072,25 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.embedding_size])
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaLMPredictionHead with Deberta->DebertaV2
-class TFDebertaV2LMPredictionHead(tf.keras.layers.Layer):
-    def __init__(self, config: DebertaV2Config, input_embeddings: tf.keras.layers.Layer, **kwargs):
+class TFDebertaV2LMPredictionHead(keras.layers.Layer):
+    def __init__(self, config: DebertaV2Config, input_embeddings: keras.layers.Layer, **kwargs):
         super().__init__(**kwargs)
 
         self.config = config
-        self.hidden_size = config.hidden_size
+        self.embedding_size = getattr(config, "embedding_size", config.hidden_size)
 
         self.transform = TFDebertaV2PredictionHeadTransform(config, name="transform")
 
@@ -952,12 +1098,17 @@ def __init__(self, config: DebertaV2Config, input_embeddings: tf.keras.layers.La
         # an output-only bias for each token.
         self.input_embeddings = input_embeddings
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         self.bias = self.add_weight(shape=(self.config.vocab_size,), initializer="zeros", trainable=True, name="bias")
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "transform", None) is not None:
+            with tf.name_scope(self.transform.name):
+                self.transform.build(None)
 
-    def get_output_embeddings(self) -> tf.keras.layers.Layer:
+    def get_output_embeddings(self) -> keras.layers.Layer:
         return self.input_embeddings
 
     def set_output_embeddings(self, value: tf.Variable):
@@ -974,7 +1125,7 @@ def set_bias(self, value: tf.Variable):
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.transform(hidden_states=hidden_states)
         seq_length = shape_list(hidden_states)[1]
-        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, self.hidden_size])
+        hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, self.embedding_size])
         hidden_states = tf.matmul(a=hidden_states, b=self.input_embeddings.weight, transpose_b=True)
         hidden_states = tf.reshape(tensor=hidden_states, shape=[-1, seq_length, self.config.vocab_size])
         hidden_states = tf.nn.bias_add(value=hidden_states, bias=self.bias)
@@ -983,8 +1134,8 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaOnlyMLMHead with Deberta->DebertaV2
-class TFDebertaV2OnlyMLMHead(tf.keras.layers.Layer):
-    def __init__(self, config: DebertaV2Config, input_embeddings: tf.keras.layers.Layer, **kwargs):
+class TFDebertaV2OnlyMLMHead(keras.layers.Layer):
+    def __init__(self, config: DebertaV2Config, input_embeddings: keras.layers.Layer, **kwargs):
         super().__init__(**kwargs)
         self.predictions = TFDebertaV2LMPredictionHead(config, input_embeddings, name="predictions")
 
@@ -993,9 +1144,17 @@ def call(self, sequence_output: tf.Tensor) -> tf.Tensor:
 
         return prediction_scores
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "predictions", None) is not None:
+            with tf.name_scope(self.predictions.name):
+                self.predictions.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaMainLayer with Deberta->DebertaV2
-class TFDebertaV2MainLayer(tf.keras.layers.Layer):
+class TFDebertaV2MainLayer(keras.layers.Layer):
     config_class = DebertaV2Config
 
     def __init__(self, config: DebertaV2Config, **kwargs):
@@ -1006,7 +1165,7 @@ def __init__(self, config: DebertaV2Config, **kwargs):
         self.embeddings = TFDebertaV2Embeddings(config, name="embeddings")
         self.encoder = TFDebertaV2Encoder(config, name="encoder")
 
-    def get_input_embeddings(self) -> tf.keras.layers.Layer:
+    def get_input_embeddings(self) -> keras.layers.Layer:
         return self.embeddings
 
     def set_input_embeddings(self, value: tf.Variable):
@@ -1023,11 +1182,11 @@ class PreTrainedModel
     @unpack_inputs
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1077,6 +1236,17 @@ def call(
             attentions=encoder_outputs.attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+
 
 # Copied from transformers.models.deberta.modeling_tf_deberta.TFDebertaPreTrainedModel with Deberta->DebertaV2
 class TFDebertaV2PreTrainedModel(TFPreTrainedModel):
@@ -1095,7 +1265,7 @@ class TFDebertaV2PreTrainedModel(TFPreTrainedModel):
     on top of BERT/RoBERTa with two improvements, i.e. disentangled attention and enhanced mask decoder. With those two
     improvements, it out perform BERT/RoBERTa on a majority of tasks with 80GB pretraining data.
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -1195,11 +1365,11 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1219,11 +1389,13 @@ def call(
 
         return outputs
 
-    def serving_output(self, output: TFBaseModelOutput) -> TFBaseModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutput(last_hidden_state=output.last_hidden_state, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
 
 
 @add_start_docstrings("""DeBERTa Model with a `language modeling` head on top.""", DEBERTA_START_DOCSTRING)
@@ -1241,7 +1413,7 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
         self.deberta = TFDebertaV2MainLayer(config, name="deberta")
         self.mlm = TFDebertaV2OnlyMLMHead(config, input_embeddings=self.deberta.embeddings, name="cls")
 
-    def get_lm_head(self) -> tf.keras.layers.Layer:
+    def get_lm_head(self) -> keras.layers.Layer:
         return self.mlm.predictions
 
     @unpack_inputs
@@ -1253,15 +1425,15 @@ def get_lm_head(self) -> tf.keras.layers.Layer:
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1296,11 +1468,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "mlm", None) is not None:
+            with tf.name_scope(self.mlm.name):
+                self.mlm.build(None)
 
 
 @add_start_docstrings(
@@ -1323,11 +1500,12 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
         drop_out = getattr(config, "cls_dropout", None)
         drop_out = self.config.hidden_dropout_prob if drop_out is None else drop_out
         self.dropout = TFDebertaV2StableDropout(drop_out, name="cls_dropout")
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             units=config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             name="classifier",
         )
+        self.output_dim = self.pooler.output_dim
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1338,15 +1516,15 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1384,11 +1562,22 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.output_dim])
 
 
 @add_start_docstrings(
@@ -1406,10 +1595,11 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.deberta = TFDebertaV2MainLayer(config, name="deberta")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
-        self.classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.classifier = keras.layers.Dense(
             units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1420,15 +1610,15 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1462,11 +1652,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1484,9 +1679,10 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.deberta = TFDebertaV2MainLayer(config, name="deberta")
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             units=config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1497,16 +1693,16 @@ def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        start_positions: np.ndarray | tf.Tensor | None = None,
+        end_positions: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1554,10 +1750,127 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.config.hidden_size])
 
-        return TFQuestionAnsweringModelOutput(
-            start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
+
+@add_start_docstrings(
+    """
+    DeBERTa Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a
+    softmax) e.g. for RocStories/SWAG tasks.
+    """,
+    DEBERTA_START_DOCSTRING,
+)
+class TFDebertaV2ForMultipleChoice(TFDebertaV2PreTrainedModel, TFMultipleChoiceLoss):
+    # names with a '.' represents the authorized unexpected/missing layers when a TF model is loaded from a PT model
+    # _keys_to_ignore_on_load_unexpected = [r"mlm___cls", r"nsp___cls", r"cls.predictions", r"cls.seq_relationship"]
+    # _keys_to_ignore_on_load_missing = [r"dropout"]
+
+    def __init__(self, config: DebertaV2Config, *inputs, **kwargs):
+        super().__init__(config, *inputs, **kwargs)
+
+        self.deberta = TFDebertaV2MainLayer(config, name="deberta")
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.pooler = TFDebertaV2ContextPooler(config, name="pooler")
+        self.classifier = keras.layers.Dense(
+            units=1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.output_dim = self.pooler.output_dim
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TFMultipleChoiceModelOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def call(
+        self,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
+        training: Optional[bool] = False,
+    ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
+        r"""
+        labels (`tf.Tensor` or `np.ndarray` of shape `(batch_size,)`, *optional*):
+            Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`
+            where `num_choices` is the size of the second dimension of the input tensors. (See `input_ids` above)
+        """
+        if input_ids is not None:
+            num_choices = shape_list(input_ids)[1]
+            seq_length = shape_list(input_ids)[2]
+        else:
+            num_choices = shape_list(inputs_embeds)[1]
+            seq_length = shape_list(inputs_embeds)[2]
+
+        flat_input_ids = tf.reshape(tensor=input_ids, shape=(-1, seq_length)) if input_ids is not None else None
+        flat_attention_mask = (
+            tf.reshape(tensor=attention_mask, shape=(-1, seq_length)) if attention_mask is not None else None
+        )
+        flat_token_type_ids = (
+            tf.reshape(tensor=token_type_ids, shape=(-1, seq_length)) if token_type_ids is not None else None
+        )
+        flat_position_ids = (
+            tf.reshape(tensor=position_ids, shape=(-1, seq_length)) if position_ids is not None else None
+        )
+        flat_inputs_embeds = (
+            tf.reshape(tensor=inputs_embeds, shape=(-1, seq_length, shape_list(inputs_embeds)[3]))
+            if inputs_embeds is not None
+            else None
+        )
+        outputs = self.deberta(
+            input_ids=flat_input_ids,
+            attention_mask=flat_attention_mask,
+            token_type_ids=flat_token_type_ids,
+            position_ids=flat_position_ids,
+            inputs_embeds=flat_inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+        sequence_output = outputs[0]
+        pooled_output = self.pooler(sequence_output, training=training)
+        pooled_output = self.dropout(pooled_output, training=training)
+        logits = self.classifier(pooled_output)
+        reshaped_logits = tf.reshape(tensor=logits, shape=(-1, num_choices))
+        loss = None if labels is None else self.hf_compute_loss(labels=labels, logits=reshaped_logits)
+
+        if not return_dict:
+            output = (reshaped_logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TFMultipleChoiceModelOutput(
+            loss=loss,
+            logits=reshaped_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deberta", None) is not None:
+            with tf.name_scope(self.deberta.name):
+                self.deberta.build(None)
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.output_dim])
diff --git a/src/transformers/models/deberta_v2/tokenization_deberta_v2.py b/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
index b2a0d844f1625d..0cf8807ca61f2c 100644
--- a/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
+++ b/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
@@ -20,9 +20,12 @@
 
 import sentencepiece as sp
 
-from ...tokenization_utils import PreTrainedTokenizer
+from ...tokenization_utils import AddedToken, PreTrainedTokenizer
+from ...utils import logging
 
 
+logger = logging.get_logger(__name__)
+
 PRETRAINED_VOCAB_FILES_MAP = {
     "vocab_file": {
         "microsoft/deberta-v2-xlarge": "https://huggingface.co/microsoft/deberta-v2-xlarge/resolve/main/spm.model",
@@ -124,6 +127,18 @@ def __init__(
     ) -> None:
         self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
 
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
+                " model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
+            )
+        self.do_lower_case = do_lower_case
+        self.split_by_punct = split_by_punct
+        self.vocab_file = vocab_file
+        self._tokenizer = SPMTokenizer(
+            vocab_file, None, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs
+        )
+        unk_token = AddedToken(unk_token, normalized=True, special=True) if isinstance(unk_token, str) else unk_token
         super().__init__(
             do_lower_case=do_lower_case,
             bos_token=bos_token,
@@ -137,18 +152,7 @@ def __init__(
             sp_model_kwargs=self.sp_model_kwargs,
             **kwargs,
         )
-
-        if not os.path.isfile(vocab_file):
-            raise ValueError(
-                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
-                " model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
-            )
-        self.do_lower_case = do_lower_case
-        self.split_by_punct = split_by_punct
-        self.vocab_file = vocab_file
-        self._tokenizer = SPMTokenizer(
-            vocab_file, self.all_special_tokens, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs
-        )
+        self._tokenizer.special_tokens = self.all_special_tokens
 
     @property
     def vocab_size(self):
@@ -374,6 +378,7 @@ def decode(self, tokens, start=-1, end=-1, raw_text=None):
             text = "".join(words[word_start:word_end])
             return text
 
+    # TODO add a deprecation cycle as this can have different behaviour from our API
     def add_special_token(self, token):
         if token not in self.special_tokens:
             self.special_tokens.append(token)
@@ -383,6 +388,9 @@ def add_special_token(self, token):
         return self.id(token)
 
     def part_of_whole_word(self, token, is_bos=False):
+        logger.warning_once(
+            "The `DebertaTokenizer.part_of_whole_word` method is deprecated and will be removed in `transformers==4.35`"
+        )
         if is_bos:
             return True
         if (
@@ -413,6 +421,9 @@ def sym(self, id):
         return self.ids_to_tokens[id]
 
     def id(self, sym):
+        logger.warning_once(
+            "The `DebertaTokenizer.id` method is deprecated and will be removed in `transformers==4.35`"
+        )
         return self.vocab[sym] if sym in self.vocab else 1
 
     def _encode_as_pieces(self, text):
@@ -460,17 +471,6 @@ def split_to_words(self, text):
 
         return words
 
-    def _run_strip_accents(self, text):
-        """Strips accents from a piece of text."""
-        text = unicodedata.normalize("NFD", text)
-        output = []
-        for char in text:
-            cat = unicodedata.category(char)
-            if cat == "Mn":
-                continue
-            output.append(char)
-        return "".join(output)
-
     def _run_split_on_punc(self, text):
         """Splits punctuation on a piece of text."""
         chars = list(text)
diff --git a/src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py b/src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py
index 3f2a90cfa83ea9..dab376ce95be8a 100644
--- a/src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py
+++ b/src/transformers/models/deberta_v2/tokenization_deberta_v2_fast.py
@@ -148,7 +148,10 @@ def __init__(
         self.do_lower_case = do_lower_case
         self.split_by_punct = split_by_punct
         self.vocab_file = vocab_file
-        self.can_save_slow_tokenizer = False if not self.vocab_file else True
+
+    @property
+    def can_save_slow_tokenizer(self) -> bool:
+        return os.path.isfile(self.vocab_file) if self.vocab_file else False
 
     def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
         """
diff --git a/src/transformers/models/decision_transformer/configuration_decision_transformer.py b/src/transformers/models/decision_transformer/configuration_decision_transformer.py
index 91ef58665a1f18..88ff005469cd6d 100644
--- a/src/transformers/models/decision_transformer/configuration_decision_transformer.py
+++ b/src/transformers/models/decision_transformer/configuration_decision_transformer.py
@@ -57,9 +57,9 @@ class DecisionTransformerConfig(PretrainedConfig):
         n_positions (`int`, *optional*, defaults to 1024):
             The maximum sequence length that this model might ever be used with. Typically set this to something large
             just in case (e.g., 512 or 1024 or 2048).
-        n_layer (`int`, *optional*, defaults to 12):
+        n_layer (`int`, *optional*, defaults to 3):
             Number of hidden layers in the Transformer encoder.
-        n_head (`int`, *optional*, defaults to 12):
+        n_head (`int`, *optional*, defaults to 1):
             Number of attention heads for each attention layer in the Transformer encoder.
         n_inner (`int`, *optional*):
             Dimensionality of the inner feed-forward layers. If unset, will default to 4 times `n_embd`.
diff --git a/src/transformers/models/decision_transformer/modeling_decision_transformer.py b/src/transformers/models/decision_transformer/modeling_decision_transformer.py
index 1f5f7601229d32..fdfb5b37d22e62 100755
--- a/src/transformers/models/decision_transformer/modeling_decision_transformer.py
+++ b/src/transformers/models/decision_transformer/modeling_decision_transformer.py
@@ -96,10 +96,9 @@ def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):
                 num = int(scope_names[1])
                 pointer = pointer[num]
         try:
-            assert (
-                pointer.shape == array.shape
-            ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
-        except AssertionError as e:
+            if pointer.shape != array.shape:
+                raise ValueError(f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched")
+        except ValueError as e:
             e.args += (pointer.shape, array.shape)
             raise
         logger.info(f"Initialize PyTorch weight {name}")
@@ -115,11 +114,12 @@ def __init__(self, config, is_cross_attention=False, layer_idx=None):
         max_positions = config.max_position_embeddings
         self.register_buffer(
             "bias",
-            torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
+            torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view(
                 1, 1, max_positions, max_positions
             ),
+            persistent=False,
         )
-        self.register_buffer("masked_bias", torch.tensor(-1e4))
+        self.register_buffer("masked_bias", torch.tensor(-1e4), persistent=False)
 
         self.embed_dim = config.hidden_size
         self.num_heads = config.num_attention_heads
@@ -181,11 +181,11 @@ def _attn(self, query, key, value, attention_mask=None, head_mask=None):
         if not self.is_cross_attention:
             # if only "normal" attention layer implements causal mask
             query_length, key_length = query.size(-2), key.size(-2)
-            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].to(torch.bool)
+            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
             mask_value = torch.finfo(attn_weights.dtype).min
             # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
             # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
-            mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
+            mask_value = torch.full([], mask_value, dtype=attn_weights.dtype, device=attn_weights.device)
             attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
 
         if attention_mask is not None:
@@ -231,7 +231,7 @@ def _upcast_and_reordered_attn(self, query, key, value, attention_mask=None, hea
         if not self.is_cross_attention:
             # if only "normal" attention layer implements causal mask
             query_length, key_length = query.size(-2), key.size(-2)
-            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
+            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
             mask_value = torch.finfo(attn_weights.dtype).min
             # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
             # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
@@ -469,14 +469,8 @@ def _init_weights(self, module):
                 # Special Scaled Initialization --> There are 2 Layer Norms per Transformer Block
                 p.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DecisionTransformerGPT2Model):
-            module.gradient_checkpointing = value
-
 
 class DecisionTransformerGPT2Model(DecisionTransformerGPT2PreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["attn.masked_bias"]
-
     def __init__(self, config):
         super().__init__(config)
 
@@ -532,6 +526,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
             input_ids = input_ids.view(-1, input_shape[-1])
             batch_size = input_ids.shape[0]
@@ -545,8 +540,6 @@ def forward(
 
         if token_type_ids is not None:
             token_type_ids = token_type_ids.view(-1, input_shape[-1])
-        if position_ids is not None:
-            position_ids = position_ids.view(-1, input_shape[-1])
 
         if past_key_values is None:
             past_length = 0
@@ -555,7 +548,7 @@ def forward(
             past_length = past_key_values[0][0].size(-2)
         if position_ids is None:
             position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
-            position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
+            position_ids = position_ids.unsqueeze(0)
 
         # GPT2Attention mask.
         if attention_mask is not None:
@@ -605,7 +598,14 @@ def forward(
 
         hidden_states = self.drop(hidden_states)
 
-        output_shape = input_shape + (hidden_states.size(-1),)
+        output_shape = (-1,) + input_shape[1:] + (hidden_states.size(-1),)
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
 
         presents = () if use_cache else None
         all_self_attentions = () if output_attentions else None
@@ -627,27 +627,16 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        # None for past_key_value
-                        return module(*inputs, use_cache, output_attentions)
-
-                    return custom_forward
-
-                outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(block),
+                outputs = self._gradient_checkpointing_func(
+                    block.__call__,
                     hidden_states,
                     None,
                     attention_mask,
                     head_mask[i],
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    use_cache,
+                    output_attentions,
                 )
             else:
                 outputs = block(
@@ -744,7 +733,6 @@ class DecisionTransformerPreTrainedModel(PreTrainedModel):
     base_model_prefix = "decision_transformer"
     main_input_name = "states"
     supports_gradient_checkpointing = False
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -787,7 +775,7 @@ def _init_weights(self, module):
             The returns for each state in the trajectory
         timesteps (`torch.LongTensor` of shape `(batch_size, episode_length)`):
             The timestep for each step in the trajectory
-        attention_mask (`torch.LongTensor` of shape `(batch_size, episode_length)`):
+        attention_mask (`torch.FloatTensor` of shape `(batch_size, episode_length)`):
             Masking, used to mask the actions when performing autoregressive prediction
 """
 
@@ -830,16 +818,16 @@ def __init__(self, config):
     @replace_return_docstrings(output_type=DecisionTransformerOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        states=None,
-        actions=None,
-        rewards=None,
-        returns_to_go=None,
-        timesteps=None,
-        attention_mask=None,
-        output_hidden_states=None,
-        output_attentions=None,
-        return_dict=None,
-    ) -> Union[Tuple, DecisionTransformerOutput]:
+        states: Optional[torch.FloatTensor] = None,
+        actions: Optional[torch.FloatTensor] = None,
+        rewards: Optional[torch.FloatTensor] = None,
+        returns_to_go: Optional[torch.FloatTensor] = None,
+        timesteps: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DecisionTransformerOutput]:
         r"""
         Returns:
 
diff --git a/src/transformers/models/deformable_detr/__init__.py b/src/transformers/models/deformable_detr/__init__.py
index 6614bc5f92613c..a560265f4bfcb8 100644
--- a/src/transformers/models/deformable_detr/__init__.py
+++ b/src/transformers/models/deformable_detr/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_timm_available, is_vision_available
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
 
 
 _import_structure = {
@@ -31,7 +31,7 @@
     _import_structure["image_processing_deformable_detr"] = ["DeformableDetrImageProcessor"]
 
 try:
-    if not is_timm_available():
+    if not is_torch_available():
         raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
     pass
@@ -57,7 +57,7 @@
         from .image_processing_deformable_detr import DeformableDetrImageProcessor
 
     try:
-        if not is_timm_available():
+        if not is_torch_available():
             raise OptionalDependencyNotAvailable()
     except OptionalDependencyNotAvailable:
         pass
diff --git a/src/transformers/models/deformable_detr/configuration_deformable_detr.py b/src/transformers/models/deformable_detr/configuration_deformable_detr.py
index 90e085be151766..eb3b3807ab624b 100644
--- a/src/transformers/models/deformable_detr/configuration_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/configuration_deformable_detr.py
@@ -77,7 +77,7 @@ class DeformableDetrConfig(PretrainedConfig):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         init_xavier_std (`float`, *optional*, defaults to 1):
             The scaling factor used for the Xavier initialization gain in the HM Attention map module.
-        encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
+        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
             The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
             for more details.
         auxiliary_loss (`bool`, *optional*, defaults to `False`):
@@ -85,11 +85,14 @@ class DeformableDetrConfig(PretrainedConfig):
         position_embedding_type (`str`, *optional*, defaults to `"sine"`):
             Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
         backbone (`str`, *optional*, defaults to `"resnet50"`):
-            Name of convolutional backbone to use in case `use_timm_backbone` = `True`. Supports any convolutional
-            backbone from the timm package. For a list of all available models, see [this
-            page](https://rwightman.github.io/pytorch-image-models/#load-a-pretrained-model).
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
         use_pretrained_backbone (`bool`, *optional*, defaults to `True`):
-            Whether to use pretrained weights for the backbone. Only supported when `use_timm_backbone` = `True`.
+            Whether to use pretrained weights for the backbone.
+        backbone_kwargs (`dict`, *optional*):
+            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
+            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
         dilation (`bool`, *optional*, defaults to `False`):
             Whether to replace stride with dilation in the last convolutional block (DC5). Only supported when
             `use_timm_backbone` = `True`.
@@ -125,6 +128,9 @@ class DeformableDetrConfig(PretrainedConfig):
             based on the predictions from the previous layer.
         focal_alpha (`float`, *optional*, defaults to 0.25):
             Alpha parameter in the focal loss.
+        disable_custom_kernels (`bool`, *optional*, defaults to `False`):
+            Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
+            kernels are not supported by PyTorch ONNX export.
 
     Examples:
 
@@ -140,6 +146,7 @@ class DeformableDetrConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "deformable_detr"
     attribute_map = {
         "hidden_size": "d_model",
@@ -173,6 +180,7 @@ def __init__(
         position_embedding_type="sine",
         backbone="resnet50",
         use_pretrained_backbone=True,
+        backbone_kwargs=None,
         dilation=False,
         num_feature_levels=4,
         encoder_n_points=4,
@@ -189,11 +197,23 @@ def __init__(
         giou_loss_coefficient=2,
         eos_coefficient=0.1,
         focal_alpha=0.25,
+        disable_custom_kernels=False,
         **kwargs,
     ):
+        if not use_timm_backbone and use_pretrained_backbone:
+            raise ValueError(
+                "Loading pretrained backbone weights from the transformers library is not supported yet. `use_timm_backbone` must be set to `True` when `use_pretrained_backbone=True`"
+            )
+
+        if backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
         if backbone_config is not None and use_timm_backbone:
             raise ValueError("You can't specify both `backbone_config` and `use_timm_backbone`.")
 
+        if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
+            raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
+
         if not use_timm_backbone:
             if backbone_config is None:
                 logger.info("`backbone_config` is `None`. Initializing the config with the default `ResNet` backbone.")
@@ -225,6 +245,7 @@ def __init__(
         self.position_embedding_type = position_embedding_type
         self.backbone = backbone
         self.use_pretrained_backbone = use_pretrained_backbone
+        self.backbone_kwargs = backbone_kwargs
         self.dilation = dilation
         # deformable attributes
         self.num_feature_levels = num_feature_levels
@@ -246,6 +267,7 @@ def __init__(
         self.giou_loss_coefficient = giou_loss_coefficient
         self.eos_coefficient = eos_coefficient
         self.focal_alpha = focal_alpha
+        self.disable_custom_kernels = disable_custom_kernels
         super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
 
     @property
diff --git a/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py b/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py
index d1fd8bcbe46677..928fa368ed34c2 100644
--- a/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py
+++ b/src/transformers/models/deformable_detr/convert_deformable_detr_to_pytorch.py
@@ -24,7 +24,7 @@
 from huggingface_hub import cached_download, hf_hub_url
 from PIL import Image
 
-from transformers import DeformableDetrConfig, DeformableDetrFeatureExtractor, DeformableDetrForObjectDetection
+from transformers import DeformableDetrConfig, DeformableDetrForObjectDetection, DeformableDetrImageProcessor
 from transformers.utils import logging
 
 
@@ -115,12 +115,12 @@ def convert_deformable_detr_checkpoint(
     config.id2label = id2label
     config.label2id = {v: k for k, v in id2label.items()}
 
-    # load feature extractor
-    feature_extractor = DeformableDetrFeatureExtractor(format="coco_detection")
+    # load image processor
+    image_processor = DeformableDetrImageProcessor(format="coco_detection")
 
     # prepare image
     img = prepare_img()
-    encoding = feature_extractor(images=img, return_tensors="pt")
+    encoding = image_processor(images=img, return_tensors="pt")
     pixel_values = encoding["pixel_values"]
 
     logger.info("Converting model...")
@@ -185,11 +185,11 @@ def convert_deformable_detr_checkpoint(
 
     print("Everything ok!")
 
-    # Save model and feature extractor
-    logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
+    # Save model and image processor
+    logger.info(f"Saving PyTorch model and image processor to {pytorch_dump_folder_path}...")
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     model.save_pretrained(pytorch_dump_folder_path)
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
     # Push to hub
     if push_to_hub:
diff --git a/src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py b/src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py
index 6f1ca003a00734..f04743e91ceefe 100644
--- a/src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/feature_extraction_deformable_detr.py
@@ -16,6 +16,7 @@
 
 import warnings
 
+from ...image_transforms import rgb_to_id as _rgb_to_id
 from ...utils import logging
 from .image_processing_deformable_detr import DeformableDetrImageProcessor
 
@@ -23,6 +24,15 @@
 logger = logging.get_logger(__name__)
 
 
+def rgb_to_id(x):
+    warnings.warn(
+        "rgb_to_id has moved and will not be importable from this module from v5. "
+        "Please import from transformers.image_transforms instead.",
+        FutureWarning,
+    )
+    return _rgb_to_id(x)
+
+
 class DeformableDetrFeatureExtractor(DeformableDetrImageProcessor):
     def __init__(self, *args, **kwargs) -> None:
         warnings.warn(
diff --git a/src/transformers/models/deformable_detr/image_processing_deformable_detr.py b/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
index 3601a2aad11f68..ef4dc7f3e5763f 100644
--- a/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/image_processing_deformable_detr.py
@@ -16,41 +16,43 @@
 
 import io
 import pathlib
-import warnings
 from collections import defaultdict
 from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Tuple, Union
 
 import numpy as np
 
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
-from transformers.image_transforms import (
+from ...feature_extraction_utils import BatchFeature
+from ...image_processing_utils import BaseImageProcessor, get_size_dict
+from ...image_transforms import (
     PaddingMode,
     center_to_corners_format,
     corners_to_center_format,
     id_to_rgb,
-    normalize,
     pad,
     rescale,
     resize,
     rgb_to_id,
     to_channel_dimension_format,
 )
-from transformers.image_utils import (
+from ...image_utils import (
     IMAGENET_DEFAULT_MEAN,
     IMAGENET_DEFAULT_STD,
+    AnnotationFormat,
+    AnnotationType,
     ChannelDimension,
     ImageInput,
     PILImageResampling,
     get_image_size,
     infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
-    valid_coco_detection_annotations,
-    valid_coco_panoptic_annotations,
     valid_images,
+    validate_annotations,
+    validate_preprocess_arguments,
 )
-from transformers.utils import (
+from ...utils import (
+    TensorType,
     is_flax_available,
     is_jax_tensor,
     is_scipy_available,
@@ -59,8 +61,8 @@
     is_torch_available,
     is_torch_tensor,
     is_vision_available,
+    logging,
 )
-from transformers.utils.generic import ExplicitEnum, TensorType
 
 
 if is_torch_available():
@@ -76,15 +78,9 @@
     import scipy.stats
 
 
-AnnotationType = Dict[str, Union[int, str, List[Dict]]]
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 
-
-class AnnotionFormat(ExplicitEnum):
-    COCO_DETECTION = "coco_detection"
-    COCO_PANOPTIC = "coco_panoptic"
-
-
-SUPPORTED_ANNOTATION_FORMATS = (AnnotionFormat.COCO_DETECTION, AnnotionFormat.COCO_PANOPTIC)
+SUPPORTED_ANNOTATION_FORMATS = (AnnotationFormat.COCO_DETECTION, AnnotationFormat.COCO_PANOPTIC)
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio
@@ -121,7 +117,10 @@ def get_size_with_aspect_ratio(image_size, size, max_size=None) -> Tuple[int, in
 
 # Copied from transformers.models.detr.image_processing_detr.get_resize_output_image_size
 def get_resize_output_image_size(
-    input_image: np.ndarray, size: Union[int, Tuple[int, int], List[int]], max_size: Optional[int] = None
+    input_image: np.ndarray,
+    size: Union[int, Tuple[int, int], List[int]],
+    max_size: Optional[int] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
 ) -> Tuple[int, int]:
     """
     Computes the output image size given the input image size and the desired output size. If the desired output size
@@ -129,14 +128,16 @@ def get_resize_output_image_size(
     image size is computed by keeping the aspect ratio of the input image size.
 
     Args:
-        image_size (`Tuple[int, int]`):
-            The input image size.
-        size (`int`):
+        input_image (`np.ndarray`):
+            The image to resize.
+        size (`int` or `Tuple[int, int]` or `List[int]`):
             The desired output size.
         max_size (`int`, *optional*):
             The maximum allowed output size.
+        input_data_format (`ChannelDimension` or `str`, *optional*):
+            The channel dimension format of the input image. If not provided, it will be inferred from the input image.
     """
-    image_size = get_image_size(input_image)
+    image_size = get_image_size(input_image, input_data_format)
     if isinstance(size, (list, tuple)):
         return size
 
@@ -206,23 +207,28 @@ def max_across_indices(values: Iterable[Any]) -> List[Any]:
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_max_height_width
-def get_max_height_width(images: List[np.ndarray]) -> List[int]:
+def get_max_height_width(
+    images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> List[int]:
     """
     Get the maximum height and width across all images in a batch.
     """
-    input_channel_dimension = infer_channel_dimension_format(images[0])
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(images[0])
 
-    if input_channel_dimension == ChannelDimension.FIRST:
+    if input_data_format == ChannelDimension.FIRST:
         _, max_height, max_width = max_across_indices([img.shape for img in images])
-    elif input_channel_dimension == ChannelDimension.LAST:
+    elif input_data_format == ChannelDimension.LAST:
         max_height, max_width, _ = max_across_indices([img.shape for img in images])
     else:
-        raise ValueError(f"Invalid channel dimension format: {input_channel_dimension}")
+        raise ValueError(f"Invalid channel dimension format: {input_data_format}")
     return (max_height, max_width)
 
 
 # Copied from transformers.models.detr.image_processing_detr.make_pixel_mask
-def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray:
+def make_pixel_mask(
+    image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> np.ndarray:
     """
     Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
 
@@ -232,7 +238,7 @@ def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarr
         output_size (`Tuple[int, int]`):
             Output size of the mask.
     """
-    input_height, input_width = get_image_size(image)
+    input_height, input_width = get_image_size(image, channel_dim=input_data_format)
     mask = np.zeros(output_size, dtype=np.int64)
     mask[:input_height, :input_width] = 1
     return mask
@@ -274,11 +280,16 @@ def convert_coco_poly_to_mask(segmentations, height: int, width: int) -> np.ndar
 
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_detection_annotation with DETR->DeformableDetr
-def prepare_coco_detection_annotation(image, target, return_segmentation_masks: bool = False):
+def prepare_coco_detection_annotation(
+    image,
+    target,
+    return_segmentation_masks: bool = False,
+    input_data_format: Optional[Union[ChannelDimension, str]] = None,
+):
     """
     Convert the target in COCO format into the format expected by DeformableDetr.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
 
     image_id = target["image_id"]
     image_id = np.asarray([image_id], dtype=np.int64)
@@ -313,10 +324,13 @@ def prepare_coco_detection_annotation(image, target, return_segmentation_masks:
 
     if annotations and "keypoints" in annotations[0]:
         keypoints = [obj["keypoints"] for obj in annotations]
+        # Converting the filtered keypoints list to a numpy array
         keypoints = np.asarray(keypoints, dtype=np.float32)
+        # Apply the keep mask here to filter the relevant annotations
+        keypoints = keypoints[keep]
         num_keypoints = keypoints.shape[0]
         keypoints = keypoints.reshape((-1, 3)) if num_keypoints else keypoints
-        new_target["keypoints"] = keypoints[keep]
+        new_target["keypoints"] = keypoints
 
     if return_segmentation_masks:
         segmentation_masks = [obj["segmentation"] for obj in annotations]
@@ -363,12 +377,16 @@ def masks_to_boxes(masks: np.ndarray) -> np.ndarray:
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_panoptic_annotation with DETR->DeformableDetr
 def prepare_coco_panoptic_annotation(
-    image: np.ndarray, target: Dict, masks_path: Union[str, pathlib.Path], return_masks: bool = True
+    image: np.ndarray,
+    target: Dict,
+    masks_path: Union[str, pathlib.Path],
+    return_masks: bool = True,
+    input_data_format: Union[ChannelDimension, str] = None,
 ) -> Dict:
     """
     Prepare a coco panoptic annotation for DeformableDetr.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
     annotation_path = pathlib.Path(masks_path) / target["file_name"]
 
     new_target = {}
@@ -602,7 +620,7 @@ def binary_mask_to_rle(mask):
     pixels = np.concatenate([[0], pixels, [0]])
     runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
     runs[1::2] -= runs[::2]
-    return [x for x in runs]
+    return list(runs)
 
 
 # Copied from transformers.models.detr.image_processing_detr.convert_segmentation_to_rle
@@ -766,9 +784,14 @@ class DeformableDetrImageProcessor(BaseImageProcessor):
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
             Standard deviation values to use when normalizing the image. Can be a single value or a list of values, one
             for each channel. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_annotations (`bool`, *optional*, defaults to `True`):
+            Controls whether to convert the annotations to the format expected by the DETR model. Converts the
+            bounding boxes to the format `(center_x, center_y, width, height)` and in the range `[0, 1]`.
+            Can be overridden by the `do_convert_annotations` parameter in the `preprocess` method.
         do_pad (`bool`, *optional*, defaults to `True`):
-            Controls whether to pad the image to the largest image in a batch and create a pixel mask. Can be
-            overridden by the `do_pad` parameter in the `preprocess` method.
+            Controls whether to pad the image. Can be overridden by the `do_pad` parameter in the `preprocess`
+            method. If `True` will pad the images in the batch to the largest height and width in the batch.
+            Padding will be applied to the bottom and right of the image with zeros.
     """
 
     model_input_names = ["pixel_values", "pixel_mask"]
@@ -776,7 +799,7 @@ class DeformableDetrImageProcessor(BaseImageProcessor):
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.__init__
     def __init__(
         self,
-        format: Union[str, AnnotionFormat] = AnnotionFormat.COCO_DETECTION,
+        format: Union[str, AnnotationFormat] = AnnotationFormat.COCO_DETECTION,
         do_resize: bool = True,
         size: Dict[str, int] = None,
         resample: PILImageResampling = PILImageResampling.BILINEAR,
@@ -785,6 +808,7 @@ def __init__(
         do_normalize: bool = True,
         image_mean: Union[float, List[float]] = None,
         image_std: Union[float, List[float]] = None,
+        do_convert_annotations: Optional[bool] = None,
         do_pad: bool = True,
         **kwargs,
     ) -> None:
@@ -792,10 +816,9 @@ def __init__(
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
@@ -804,6 +827,10 @@ def __init__(
         size = size if size is not None else {"shortest_edge": 800, "longest_edge": 1333}
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
 
+        # Backwards compatibility
+        if do_convert_annotations is None:
+            do_convert_annotations = do_normalize
+
         super().__init__(**kwargs)
         self.format = format
         self.do_resize = do_resize
@@ -812,20 +839,11 @@ def __init__(
         self.do_rescale = do_rescale
         self.rescale_factor = rescale_factor
         self.do_normalize = do_normalize
+        self.do_convert_annotations = do_convert_annotations
         self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
         self.do_pad = do_pad
 
-    @property
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.max_size
-    def max_size(self):
-        warnings.warn(
-            "The `max_size` parameter is deprecated and will be removed in v4.27. "
-            "Please specify in `size['longest_edge'] instead`.",
-            FutureWarning,
-        )
-        return self.size["longest_edge"]
-
     @classmethod
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.from_dict with Detr->DeformableDetr
     def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
@@ -846,22 +864,29 @@ def prepare_annotation(
         self,
         image: np.ndarray,
         target: Dict,
-        format: Optional[AnnotionFormat] = None,
+        format: Optional[AnnotationFormat] = None,
         return_segmentation_masks: bool = None,
         masks_path: Optional[Union[str, pathlib.Path]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> Dict:
         """
         Prepare an annotation for feeding into DeformableDetr model.
         """
         format = format if format is not None else self.format
 
-        if format == AnnotionFormat.COCO_DETECTION:
+        if format == AnnotationFormat.COCO_DETECTION:
             return_segmentation_masks = False if return_segmentation_masks is None else return_segmentation_masks
-            target = prepare_coco_detection_annotation(image, target, return_segmentation_masks)
-        elif format == AnnotionFormat.COCO_PANOPTIC:
+            target = prepare_coco_detection_annotation(
+                image, target, return_segmentation_masks, input_data_format=input_data_format
+            )
+        elif format == AnnotationFormat.COCO_PANOPTIC:
             return_segmentation_masks = True if return_segmentation_masks is None else return_segmentation_masks
             target = prepare_coco_panoptic_annotation(
-                image, target, masks_path=masks_path, return_masks=return_segmentation_masks
+                image,
+                target,
+                masks_path=masks_path,
+                return_masks=return_segmentation_masks,
+                input_data_format=input_data_format,
             )
         else:
             raise ValueError(f"Format {format} is not supported.")
@@ -869,8 +894,8 @@ def prepare_annotation(
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare
     def prepare(self, image, target, return_segmentation_masks=None, masks_path=None):
-        warnings.warn(
-            "The `prepare` method is deprecated and will be removed in a future version. "
+        logger.warning_once(
+            "The `prepare` method is deprecated and will be removed in a v4.33. "
             "Please use `prepare_annotation` instead. Note: the `prepare_annotation` method "
             "does not return the image anymore.",
         )
@@ -879,17 +904,17 @@ def prepare(self, image, target, return_segmentation_masks=None, masks_path=None
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.convert_coco_poly_to_mask
     def convert_coco_poly_to_mask(self, *args, **kwargs):
-        warnings.warn("The `convert_coco_poly_to_mask` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `convert_coco_poly_to_mask` method is deprecated and will be removed in v4.33. ")
         return convert_coco_poly_to_mask(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_detection
     def prepare_coco_detection(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_detection` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_detection` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_detection_annotation(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_panoptic
     def prepare_coco_panoptic(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_panoptic` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_panoptic` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_panoptic_annotation(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.resize
@@ -899,24 +924,40 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BILINEAR,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
         Resize the image to the given size. Size can be `min_size` (scalar) or `(height, width)` tuple. If size is an
         int, smaller edge of the image will be matched to this number.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary containing the size to resize to. Can contain the keys `shortest_edge` and `longest_edge` or
+                `height` and `width`.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+                Resampling filter to use if resizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
             max_size = None
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
         if "shortest_edge" in size and "longest_edge" in size:
-            size = get_resize_output_image_size(image, size["shortest_edge"], size["longest_edge"])
+            size = get_resize_output_image_size(
+                image, size["shortest_edge"], size["longest_edge"], input_data_format=input_data_format
+            )
         elif "height" in size and "width" in size:
             size = (size["height"], size["width"])
         else:
@@ -924,7 +965,9 @@ def resize(
                 "Size must contain 'height' and 'width' keys or 'shortest_edge' and 'longest_edge' keys. Got"
                 f" {size.keys()}."
             )
-        image = resize(image, size=size, resample=resample, data_format=data_format)
+        image = resize(
+            image, size=size, resample=resample, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
         return image
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.resize_annotation
@@ -943,130 +986,195 @@ def resize_annotation(
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.rescale
     def rescale(
-        self, image: np.ndarray, rescale_factor: Union[float, int], data_format: Optional[ChannelDimension] = None
-    ) -> np.ndarray:
-        """
-        Rescale the image by the given factor.
-        """
-        return rescale(image, rescale_factor, data_format=data_format)
-
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize
-    def normalize(
         self,
         image: np.ndarray,
-        mean: Union[float, Iterable[float]],
-        std: Union[float, Iterable[float]],
-        data_format: Optional[ChannelDimension] = None,
+        rescale_factor: float,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
-        Normalize the image with the given mean and standard deviation.
+        Rescale the image by the given factor. image = image * rescale_factor.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            rescale_factor (`float`):
+                The value to use for rescaling.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. If unset, is inferred from the input image. Can be
+                one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
         """
-        return normalize(image, mean=mean, std=std, data_format=data_format)
+        return rescale(image, rescale_factor, data_format=data_format, input_data_format=input_data_format)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize_annotation
     def normalize_annotation(self, annotation: Dict, image_size: Tuple[int, int]) -> Dict:
         """
         Normalize the boxes in the annotation from `[top_left_x, top_left_y, bottom_right_x, bottom_right_y]` to
-        `[center_x, center_y, width, height]` format.
+        `[center_x, center_y, width, height]` format and from absolute to relative pixel values.
         """
         return normalize_annotation(annotation, image_size=image_size)
 
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad_and_create_pixel_mask
-    def pad_and_create_pixel_mask(
+    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._update_annotation_for_padded_image
+    def _update_annotation_for_padded_image(
         self,
-        pixel_values_list: List[ImageInput],
-        return_tensors: Optional[Union[str, TensorType]] = None,
-        data_format: Optional[ChannelDimension] = None,
-    ) -> BatchFeature:
+        annotation: Dict,
+        input_image_size: Tuple[int, int],
+        output_image_size: Tuple[int, int],
+        padding,
+        update_bboxes,
+    ) -> Dict:
         """
-        Pads a batch of images with zeros to the size of largest height and width in the batch and returns their
-        corresponding pixel mask.
-
-        Args:
-            images (`List[np.ndarray]`):
-                Batch of images to pad.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                    - Unset: Return a list of `np.ndarray`.
-                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        Update the annotation for a padded image.
         """
-        warnings.warn(
-            "This method is deprecated and will be removed in v4.27.0. Please use pad instead.", FutureWarning
-        )
-        # pad expects a list of np.ndarray, but the previous feature extractors expected torch tensors
-        images = [to_numpy_array(image) for image in pixel_values_list]
-        return self.pad(
-            images=images,
-            return_pixel_mask=True,
-            return_tensors=return_tensors,
-            data_format=data_format,
-        )
+        new_annotation = {}
+        new_annotation["size"] = output_image_size
+
+        for key, value in annotation.items():
+            if key == "masks":
+                masks = value
+                masks = pad(
+                    masks,
+                    padding,
+                    mode=PaddingMode.CONSTANT,
+                    constant_values=0,
+                    input_data_format=ChannelDimension.FIRST,
+                )
+                masks = safe_squeeze(masks, 1)
+                new_annotation["masks"] = masks
+            elif key == "boxes" and update_bboxes:
+                boxes = value
+                boxes *= np.asarray(
+                    [
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                    ]
+                )
+                new_annotation["boxes"] = boxes
+            elif key == "size":
+                new_annotation["size"] = output_image_size
+            else:
+                new_annotation[key] = value
+        return new_annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._pad_image
     def _pad_image(
         self,
         image: np.ndarray,
         output_size: Tuple[int, int],
+        annotation: Optional[Dict[str, Any]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
     ) -> np.ndarray:
         """
         Pad an image with zeros to the given size.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = output_size
 
         pad_bottom = output_height - input_height
         pad_right = output_width - input_width
         padding = ((0, pad_bottom), (0, pad_right))
         padded_image = pad(
-            image, padding, mode=PaddingMode.CONSTANT, constant_values=constant_values, data_format=data_format
+            image,
+            padding,
+            mode=PaddingMode.CONSTANT,
+            constant_values=constant_values,
+            data_format=data_format,
+            input_data_format=input_data_format,
         )
-        return padded_image
+        if annotation is not None:
+            annotation = self._update_annotation_for_padded_image(
+                annotation, (input_height, input_width), (output_height, output_width), padding, update_bboxes
+            )
+        return padded_image, annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad
     def pad(
         self,
         images: List[np.ndarray],
+        annotations: Optional[Union[AnnotationType, List[AnnotationType]]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         return_pixel_mask: bool = True,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Optional[ChannelDimension] = None,
-    ) -> np.ndarray:
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
+    ) -> BatchFeature:
         """
         Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
         in the batch and optionally returns their corresponding pixel mask.
 
         Args:
-            image (`np.ndarray`):
-                Image to pad.
+            images (List[`np.ndarray`]):
+                Images to pad.
+            annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
+                Annotations to transform according to the padding that is applied to the images.
             constant_values (`float` or `Iterable[float]`, *optional*):
                 The value to use for the padding if `mode` is `"constant"`.
             return_pixel_mask (`bool`, *optional*, defaults to `True`):
                 Whether to return a pixel mask.
-            input_channel_dimension (`ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be inferred from the input image.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+            update_bboxes (`bool`, *optional*, defaults to `True`):
+                Whether to update the bounding boxes in the annotations to match the padded images. If the
+                bounding boxes have not been converted to relative coordinates and `(centre_x, centre_y, width, height)`
+                format, the bounding boxes will not be updated.
         """
-        pad_size = get_max_height_width(images)
+        pad_size = get_max_height_width(images, input_data_format=input_data_format)
+
+        annotation_list = annotations if annotations is not None else [None] * len(images)
+        padded_images = []
+        padded_annotations = []
+        for image, annotation in zip(images, annotation_list):
+            padded_image, padded_annotation = self._pad_image(
+                image,
+                pad_size,
+                annotation,
+                constant_values=constant_values,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                update_bboxes=update_bboxes,
+            )
+            padded_images.append(padded_image)
+            padded_annotations.append(padded_annotation)
 
-        padded_images = [
-            self._pad_image(image, pad_size, constant_values=constant_values, data_format=data_format)
-            for image in images
-        ]
         data = {"pixel_values": padded_images}
 
         if return_pixel_mask:
-            masks = [make_pixel_mask(image=image, output_size=pad_size) for image in images]
+            masks = [
+                make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
+                for image in images
+            ]
             data["pixel_mask"] = masks
 
-        return BatchFeature(data=data, tensor_type=return_tensors)
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        if annotations is not None:
+            encoded_inputs["labels"] = [
+                BatchFeature(annotation, tensor_type=return_tensors) for annotation in padded_annotations
+            ]
+
+        return encoded_inputs
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.preprocess
     def preprocess(
@@ -1081,12 +1189,14 @@ def preprocess(
         do_rescale: Optional[bool] = None,
         rescale_factor: Optional[Union[int, float]] = None,
         do_normalize: Optional[bool] = None,
+        do_convert_annotations: Optional[bool] = None,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
         do_pad: Optional[bool] = None,
-        format: Optional[Union[str, AnnotionFormat]] = None,
+        format: Optional[Union[str, AnnotationFormat]] = None,
         return_tensors: Optional[Union[TensorType, str]] = None,
         data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> BatchFeature:
         """
@@ -1094,14 +1204,15 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image or batch of images to preprocess.
+                Image or batch of images to preprocess. Expects a single or batch of images with pixel values ranging
+                from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
-                List of annotations associated with the image or batch of images. If annotionation is for object
+                List of annotations associated with the image or batch of images. If annotation is for object
                 detection, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "annotations" (`List[Dict]`): List of annotations for an image. Each annotation should be a
                   dictionary. An image can have no annotations, in which case the list should be empty.
-                If annotionation is for segmentation, the annotations should be a dictionary with the following keys:
+                If annotation is for segmentation, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "segments_info" (`List[Dict]`): List of segments for an image. Each segment should be a dictionary.
                   An image can have no segments, in which case the list should be empty.
@@ -1122,33 +1233,45 @@ def preprocess(
                 Rescale factor to use when rescaling the image.
             do_normalize (`bool`, *optional*, defaults to self.do_normalize):
                 Whether to normalize the image.
+            do_convert_annotations (`bool`, *optional*, defaults to self.do_convert_annotations):
+                Whether to convert the annotations to the format expected by the model. Converts the bounding
+                boxes from the format `(top_left_x, top_left_y, width, height)` to `(center_x, center_y, width, height)`
+                and in relative coordinates.
             image_mean (`float` or `List[float]`, *optional*, defaults to self.image_mean):
                 Mean to use when normalizing the image.
             image_std (`float` or `List[float]`, *optional*, defaults to self.image_std):
                 Standard deviation to use when normalizing the image.
             do_pad (`bool`, *optional*, defaults to self.do_pad):
-                Whether to pad the image.
-            format (`str` or `AnnotionFormat`, *optional*, defaults to self.format):
+                Whether to pad the image. If `True` will pad the images in the batch to the largest image in the batch
+                and create a pixel mask. Padding will be applied to the bottom and right of the image with zeros.
+            format (`str` or `AnnotationFormat`, *optional*, defaults to self.format):
                 Format of the annotations.
             return_tensors (`str` or `TensorType`, *optional*, defaults to self.return_tensors):
                 Type of tensors to return. If `None`, will return the list of images.
-            data_format (`str` or `ChannelDimension`, *optional*, defaults to self.data_format):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         if "pad_and_return_pixel_mask" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `pad_and_return_pixel_mask` argument is deprecated and will be removed in a future version, "
-                "use `do_pad` instead.",
-                FutureWarning,
+                "use `do_pad` instead."
             )
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         max_size = None
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` argument is deprecated and will be removed in a future version, use"
-                " `size['longest_edge']` instead.",
-                FutureWarning,
+                " `size['longest_edge']` instead."
             )
             size = kwargs.pop("max_size")
 
@@ -1161,19 +1284,33 @@ def preprocess(
         do_normalize = self.do_normalize if do_normalize is None else do_normalize
         image_mean = self.image_mean if image_mean is None else image_mean
         image_std = self.image_std if image_std is None else image_std
+        do_convert_annotations = (
+            self.do_convert_annotations if do_convert_annotations is None else do_convert_annotations
+        )
         do_pad = self.do_pad if do_pad is None else do_pad
         format = self.format if format is None else format
 
-        if do_resize is not None and size is None:
-            raise ValueError("Size and max_size must be specified if do_resize is True.")
+        images = make_list_of_images(images)
 
-        if do_rescale is not None and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
 
-        if do_normalize is not None and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        # Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
+
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
-        images = make_list_of_images(images)
         if annotations is not None and isinstance(annotations, dict):
             annotations = [annotations]
 
@@ -1182,34 +1319,13 @@ def preprocess(
                 f"The number of images ({len(images)}) and annotations ({len(annotations)}) do not match."
             )
 
-        if not valid_images(images):
-            raise ValueError(
-                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
-                "torch.Tensor, tf.Tensor or jax.ndarray."
-            )
-
-        format = AnnotionFormat(format)
+        format = AnnotationFormat(format)
         if annotations is not None:
-            if format == AnnotionFormat.COCO_DETECTION and not valid_coco_detection_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO detection annotations. Annotations must a dict (single image) of list of dicts"
-                    "(batch of images) with the following keys: `image_id` and `annotations`, with the latter "
-                    "being a list of annotations in the COCO format."
-                )
-            elif format == AnnotionFormat.COCO_PANOPTIC and not valid_coco_panoptic_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO panoptic annotations. Annotations must a dict (single image) of list of dicts "
-                    "(batch of images) with the following keys: `image_id`, `file_name` and `segments_info`, with "
-                    "the latter being a list of annotations in the COCO format."
-                )
-            elif format not in SUPPORTED_ANNOTATION_FORMATS:
-                raise ValueError(
-                    f"Unsupported annotation format: {format} must be one of {SUPPORTED_ANNOTATION_FORMATS}"
-                )
+            validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations)
 
         if (
             masks_path is not None
-            and format == AnnotionFormat.COCO_PANOPTIC
+            and format == AnnotationFormat.COCO_PANOPTIC
             and not isinstance(masks_path, (pathlib.Path, str))
         ):
             raise ValueError(
@@ -1220,13 +1336,28 @@ def preprocess(
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         # prepare (COCO annotations as a list of Dict -> DETR target as a single Dict per image)
         if annotations is not None:
             prepared_images = []
             prepared_annotations = []
             for image, target in zip(images, annotations):
                 target = self.prepare_annotation(
-                    image, target, format, return_segmentation_masks=return_segmentation_masks, masks_path=masks_path
+                    image,
+                    target,
+                    format,
+                    return_segmentation_masks=return_segmentation_masks,
+                    masks_path=masks_path,
+                    input_data_format=input_data_format,
                 )
                 prepared_images.append(image)
                 prepared_annotations.append(target)
@@ -1239,40 +1370,59 @@ def preprocess(
             if annotations is not None:
                 resized_images, resized_annotations = [], []
                 for image, target in zip(images, annotations):
-                    orig_size = get_image_size(image)
-                    resized_image = self.resize(image, size=size, max_size=max_size, resample=resample)
-                    resized_annotation = self.resize_annotation(target, orig_size, get_image_size(resized_image))
+                    orig_size = get_image_size(image, input_data_format)
+                    resized_image = self.resize(
+                        image, size=size, max_size=max_size, resample=resample, input_data_format=input_data_format
+                    )
+                    resized_annotation = self.resize_annotation(
+                        target, orig_size, get_image_size(resized_image, input_data_format)
+                    )
                     resized_images.append(resized_image)
                     resized_annotations.append(resized_annotation)
                 images = resized_images
                 annotations = resized_annotations
                 del resized_images, resized_annotations
             else:
-                images = [self.resize(image, size=size, resample=resample) for image in images]
+                images = [
+                    self.resize(image, size=size, resample=resample, input_data_format=input_data_format)
+                    for image in images
+                ]
 
         if do_rescale:
-            images = [self.rescale(image, rescale_factor) for image in images]
+            images = [self.rescale(image, rescale_factor, input_data_format=input_data_format) for image in images]
 
         if do_normalize:
-            images = [self.normalize(image, image_mean, image_std) for image in images]
-            if annotations is not None:
-                annotations = [
-                    self.normalize_annotation(annotation, get_image_size(image))
-                    for annotation, image in zip(annotations, images)
-                ]
+            images = [
+                self.normalize(image, image_mean, image_std, input_data_format=input_data_format) for image in images
+            ]
+
+        if do_convert_annotations and annotations is not None:
+            annotations = [
+                self.normalize_annotation(annotation, get_image_size(image, input_data_format))
+                for annotation, image in zip(annotations, images)
+            ]
 
         if do_pad:
             # Pads images and returns their mask: {'pixel_values': ..., 'pixel_mask': ...}
-            data = self.pad(images, return_pixel_mask=True, data_format=data_format)
+            encoded_inputs = self.pad(
+                images,
+                annotations=annotations,
+                return_pixel_mask=True,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                return_tensors=return_tensors,
+                update_bboxes=do_convert_annotations,
+            )
         else:
-            images = [to_channel_dimension_format(image, data_format) for image in images]
-            data = {"pixel_values": images}
-
-        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
-        if annotations is not None:
-            encoded_inputs["labels"] = [
-                BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+            images = [
+                to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+                for image in images
             ]
+            encoded_inputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)
+            if annotations is not None:
+                encoded_inputs["labels"] = [
+                    BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+                ]
 
         return encoded_inputs
 
@@ -1293,10 +1443,9 @@ def post_process(self, outputs, target_sizes):
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
             in the batch as predicted by the model.
         """
-        warnings.warn(
+        logger.warning_once(
             "`post_process` is deprecated and will be removed in v5 of Transformers, please use"
-            " `post_process_object_detection`.",
-            FutureWarning,
+            " `post_process_object_detection` instead, with `threshold=0.` for equivalent results.",
         )
 
         out_logits, out_bbox = outputs.logits, outputs.pred_boxes
@@ -1309,7 +1458,7 @@ def post_process(self, outputs, target_sizes):
         prob = out_logits.sigmoid()
         topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1), 100, dim=1)
         scores = topk_values
-        topk_boxes = topk_indexes // out_logits.shape[2]
+        topk_boxes = torch.div(topk_indexes, out_logits.shape[2], rounding_mode="floor")
         labels = topk_indexes % out_logits.shape[2]
         boxes = center_to_corners_format(out_bbox)
         boxes = torch.gather(boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
@@ -1324,7 +1473,7 @@ def post_process(self, outputs, target_sizes):
         return results
 
     def post_process_object_detection(
-        self, outputs, threshold: float = 0.5, target_sizes: Union[TensorType, List[Tuple]] = None
+        self, outputs, threshold: float = 0.5, target_sizes: Union[TensorType, List[Tuple]] = None, top_k: int = 100
     ):
         """
         Converts the raw output of [`DeformableDetrForObjectDetection`] into final bounding boxes in (top_left_x,
@@ -1338,6 +1487,8 @@ def post_process_object_detection(
             target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*):
                 Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
                 (height, width) of each image in the batch. If left to None, predictions will not be resized.
+            top_k (`int`, *optional*, defaults to 100):
+                Keep only top k bounding boxes before filtering by thresholding.
 
         Returns:
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
@@ -1352,21 +1503,24 @@ def post_process_object_detection(
                 )
 
         prob = out_logits.sigmoid()
-        topk_values, topk_indexes = torch.topk(prob.view(out_logits.shape[0], -1), 100, dim=1)
+        prob = prob.view(out_logits.shape[0], -1)
+        k_value = min(top_k, prob.size(1))
+        topk_values, topk_indexes = torch.topk(prob, k_value, dim=1)
         scores = topk_values
-        topk_boxes = topk_indexes // out_logits.shape[2]
+        topk_boxes = torch.div(topk_indexes, out_logits.shape[2], rounding_mode="floor")
         labels = topk_indexes % out_logits.shape[2]
         boxes = center_to_corners_format(out_bbox)
         boxes = torch.gather(boxes, 1, topk_boxes.unsqueeze(-1).repeat(1, 1, 4))
 
         # and from relative [0, 1] to absolute [0, height] coordinates
-        if isinstance(target_sizes, List):
-            img_h = torch.Tensor([i[0] for i in target_sizes])
-            img_w = torch.Tensor([i[1] for i in target_sizes])
-        else:
-            img_h, img_w = target_sizes.unbind(1)
-        scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
-        boxes = boxes * scale_fct[:, None, :]
+        if target_sizes is not None:
+            if isinstance(target_sizes, List):
+                img_h = torch.Tensor([i[0] for i in target_sizes])
+                img_w = torch.Tensor([i[1] for i in target_sizes])
+            else:
+                img_h, img_w = target_sizes.unbind(1)
+            scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
+            boxes = boxes * scale_fct[:, None, :]
 
         results = []
         for s, l, b in zip(scores, labels, boxes):
diff --git a/src/transformers/models/deformable_detr/load_custom.py b/src/transformers/models/deformable_detr/load_custom.py
index d2a8bc0cb2c074..c3a822e2764170 100644
--- a/src/transformers/models/deformable_detr/load_custom.py
+++ b/src/transformers/models/deformable_detr/load_custom.py
@@ -13,16 +13,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Loading of Deformable DETR's CUDA kernels"""
-
 import os
+from pathlib import Path
 
 
 def load_cuda_kernels():
     from torch.utils.cpp_extension import load
 
-    root = os.path.join(os.path.dirname(os.path.realpath(__file__)), "custom_kernel")
+    root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deformable_detr"
     src_files = [
-        os.path.join(root, filename)
+        root / filename
         for filename in [
             "vision.cpp",
             os.path.join("cpu", "ms_deform_attn_cpu.cpp"),
@@ -33,10 +33,8 @@ def load_cuda_kernels():
     load(
         "MultiScaleDeformableAttention",
         src_files,
-        # verbose=True,
         with_cuda=True,
-        extra_include_paths=[root],
-        # build_directory=os.path.dirname(os.path.realpath(__file__)),
+        extra_include_paths=[str(root)],
         extra_cflags=["-DWITH_CUDA=1"],
         extra_cuda_cflags=[
             "-DCUDA_HAS_FP16=1",
diff --git a/src/transformers/models/deformable_detr/modeling_deformable_detr.py b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
index a7ee782501a2ab..640c05257cc967 100755
--- a/src/transformers/models/deformable_detr/modeling_deformable_detr.py
+++ b/src/transformers/models/deformable_detr/modeling_deformable_detr.py
@@ -17,9 +17,11 @@
 
 import copy
 import math
+import os
 import warnings
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Tuple
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple, Union
 
 import torch
 import torch.nn.functional as F
@@ -39,31 +41,57 @@
     replace_return_docstrings,
     requires_backends,
 )
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
 from ...modeling_outputs import BaseModelOutput
 from ...modeling_utils import PreTrainedModel
 from ...pytorch_utils import meshgrid
-from ...utils import is_ninja_available, logging
-from ..auto import AutoBackbone
+from ...utils import is_accelerate_available, is_ninja_available, logging
+from ...utils.backbone_utils import load_backbone
 from .configuration_deformable_detr import DeformableDetrConfig
-from .load_custom import load_cuda_kernels
 
 
 logger = logging.get_logger(__name__)
 
-# Move this to not compile only when importing, this needs to happen later, like in __init__.
-if is_torch_cuda_available() and is_ninja_available():
-    logger.info("Loading custom CUDA kernels...")
-    try:
-        MultiScaleDeformableAttention = load_cuda_kernels()
-    except Exception as e:
-        logger.warning(f"Could not load the custom kernel for multi-scale deformable attention: {e}")
-        MultiScaleDeformableAttention = None
-else:
-    MultiScaleDeformableAttention = None
+MultiScaleDeformableAttention = None
+
+
+def load_cuda_kernels():
+    from torch.utils.cpp_extension import load
+
+    global MultiScaleDeformableAttention
+
+    root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deta"
+    src_files = [
+        root / filename
+        for filename in [
+            "vision.cpp",
+            os.path.join("cpu", "ms_deform_attn_cpu.cpp"),
+            os.path.join("cuda", "ms_deform_attn_cuda.cu"),
+        ]
+    ]
+
+    MultiScaleDeformableAttention = load(
+        "MultiScaleDeformableAttention",
+        src_files,
+        with_cuda=True,
+        extra_include_paths=[str(root)],
+        extra_cflags=["-DWITH_CUDA=1"],
+        extra_cuda_cflags=[
+            "-DCUDA_HAS_FP16=1",
+            "-D__CUDA_NO_HALF_OPERATORS__",
+            "-D__CUDA_NO_HALF_CONVERSIONS__",
+            "-D__CUDA_NO_HALF2_OPERATORS__",
+        ],
+    )
+
 
 if is_vision_available():
     from transformers.image_transforms import center_to_corners_format
 
+if is_accelerate_available():
+    from accelerate import PartialState
+    from accelerate.utils import reduce
+
 
 class MultiScaleDeformableAttentionFunction(Function):
     @staticmethod
@@ -357,19 +385,28 @@ def forward(self, x):
 
 
 # Copied from transformers.models.detr.modeling_detr.replace_batch_norm with Detr->DeformableDetr
-def replace_batch_norm(m, name=""):
-    for attr_str in dir(m):
-        target_attr = getattr(m, attr_str)
-        if isinstance(target_attr, nn.BatchNorm2d):
-            frozen = DeformableDetrFrozenBatchNorm2d(target_attr.num_features)
-            bn = getattr(m, attr_str)
-            frozen.weight.data.copy_(bn.weight)
-            frozen.bias.data.copy_(bn.bias)
-            frozen.running_mean.data.copy_(bn.running_mean)
-            frozen.running_var.data.copy_(bn.running_var)
-            setattr(m, attr_str, frozen)
-    for n, ch in m.named_children():
-        replace_batch_norm(ch, n)
+def replace_batch_norm(model):
+    r"""
+    Recursively replace all `torch.nn.BatchNorm2d` with `DeformableDetrFrozenBatchNorm2d`.
+
+    Args:
+        model (torch.nn.Module):
+            input model
+    """
+    for name, module in model.named_children():
+        if isinstance(module, nn.BatchNorm2d):
+            new_module = DeformableDetrFrozenBatchNorm2d(module.num_features)
+
+            if not module.weight.device == torch.device("meta"):
+                new_module.weight.data.copy_(module.weight)
+                new_module.bias.data.copy_(module.bias)
+                new_module.running_mean.data.copy_(module.running_mean)
+                new_module.running_var.data.copy_(module.running_var)
+
+            model._modules[name] = new_module
+
+        if len(list(module.children())) > 0:
+            replace_batch_norm(module)
 
 
 class DeformableDetrConvEncoder(nn.Module):
@@ -399,7 +436,7 @@ def __init__(self, config):
                 **kwargs,
             )
         else:
-            backbone = AutoBackbone.from_config(config.backbone_config)
+            backbone = load_backbone(config)
 
         # replace batch norm by frozen batch norm
         with torch.no_grad():
@@ -454,21 +491,6 @@ def forward(self, pixel_values, pixel_mask):
         return out, pos
 
 
-# Copied from transformers.models.detr.modeling_detr._expand_mask
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, target_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[batch_size, seq_len]` to `[batch_size, 1, target_seq_len, source_seq_len]`.
-    """
-    batch_size, source_len = mask.size()
-    target_len = target_len if target_len is not None else source_len
-
-    expanded_mask = mask[:, None, None, :].expand(batch_size, 1, target_len, source_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)
-
-
 class DeformableDetrSinePositionEmbedding(nn.Module):
     """
     This is a more standard version of the position embedding, very similar to the one used by the Attention is all you
@@ -496,8 +518,8 @@ def forward(self, pixel_values, pixel_mask):
             y_embed = (y_embed - 0.5) / (y_embed[:, -1:, :] + eps) * self.scale
             x_embed = (x_embed - 0.5) / (x_embed[:, :, -1:] + eps) * self.scale
 
-        dim_t = torch.arange(self.embedding_dim, dtype=torch.float32, device=pixel_values.device)
-        dim_t = self.temperature ** (2 * (dim_t // 2) / self.embedding_dim)
+        dim_t = torch.arange(self.embedding_dim, dtype=torch.int64, device=pixel_values.device).float()
+        dim_t = self.temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / self.embedding_dim)
 
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
@@ -550,7 +572,7 @@ def multi_scale_deformable_attention(
 ) -> Tensor:
     batch_size, _, num_heads, hidden_dim = value.shape
     _, num_queries, num_heads, num_levels, num_points, _ = sampling_locations.shape
-    value_list = value.split([height * width for height, width in value_spatial_shapes], dim=1)
+    value_list = value.split([height.item() * width.item() for height, width in value_spatial_shapes], dim=1)
     sampling_grids = 2 * sampling_locations - 1
     sampling_value_list = []
     for level_id, (height, width) in enumerate(value_spatial_shapes):
@@ -589,13 +611,21 @@ class DeformableDetrMultiscaleDeformableAttention(nn.Module):
     Multiscale deformable attention as proposed in Deformable DETR.
     """
 
-    def __init__(self, embed_dim: int, num_heads: int, n_levels: int, n_points: int):
+    def __init__(self, config: DeformableDetrConfig, num_heads: int, n_points: int):
         super().__init__()
-        if embed_dim % num_heads != 0:
+
+        kernel_loaded = MultiScaleDeformableAttention is not None
+        if is_torch_cuda_available() and is_ninja_available() and not kernel_loaded:
+            try:
+                load_cuda_kernels()
+            except Exception as e:
+                logger.warning(f"Could not load the custom kernel for multi-scale deformable attention: {e}")
+
+        if config.d_model % num_heads != 0:
             raise ValueError(
-                f"embed_dim (d_model) must be divisible by num_heads, but got {embed_dim} and {num_heads}"
+                f"embed_dim (d_model) must be divisible by num_heads, but got {config.d_model} and {num_heads}"
             )
-        dim_per_head = embed_dim // num_heads
+        dim_per_head = config.d_model // num_heads
         # check if dim_per_head is power of 2
         if not ((dim_per_head & (dim_per_head - 1) == 0) and dim_per_head != 0):
             warnings.warn(
@@ -606,21 +636,24 @@ def __init__(self, embed_dim: int, num_heads: int, n_levels: int, n_points: int)
 
         self.im2col_step = 64
 
-        self.d_model = embed_dim
-        self.n_levels = n_levels
+        self.d_model = config.d_model
+        self.n_levels = config.num_feature_levels
         self.n_heads = num_heads
         self.n_points = n_points
 
-        self.sampling_offsets = nn.Linear(embed_dim, num_heads * n_levels * n_points * 2)
-        self.attention_weights = nn.Linear(embed_dim, num_heads * n_levels * n_points)
-        self.value_proj = nn.Linear(embed_dim, embed_dim)
-        self.output_proj = nn.Linear(embed_dim, embed_dim)
+        self.sampling_offsets = nn.Linear(config.d_model, num_heads * self.n_levels * n_points * 2)
+        self.attention_weights = nn.Linear(config.d_model, num_heads * self.n_levels * n_points)
+        self.value_proj = nn.Linear(config.d_model, config.d_model)
+        self.output_proj = nn.Linear(config.d_model, config.d_model)
+
+        self.disable_custom_kernels = config.disable_custom_kernels
 
         self._reset_parameters()
 
     def _reset_parameters(self):
         nn.init.constant_(self.sampling_offsets.weight.data, 0.0)
-        thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)
+        default_dtype = torch.get_default_dtype()
+        thetas = torch.arange(self.n_heads, dtype=torch.int64).to(default_dtype) * (2.0 * math.pi / self.n_heads)
         grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)
         grid_init = (
             (grid_init / grid_init.abs().max(-1, keepdim=True)[0])
@@ -692,19 +725,24 @@ def forward(
             )
         else:
             raise ValueError(f"Last dim of reference_points must be 2 or 4, but got {reference_points.shape[-1]}")
-        try:
-            # custom kernel
-            output = MultiScaleDeformableAttentionFunction.apply(
-                value,
-                spatial_shapes,
-                level_start_index,
-                sampling_locations,
-                attention_weights,
-                self.im2col_step,
-            )
-        except Exception:
+
+        if self.disable_custom_kernels:
             # PyTorch implementation
             output = multi_scale_deformable_attention(value, spatial_shapes, sampling_locations, attention_weights)
+        else:
+            try:
+                # custom kernel
+                output = MultiScaleDeformableAttentionFunction.apply(
+                    value,
+                    spatial_shapes,
+                    level_start_index,
+                    sampling_locations,
+                    attention_weights,
+                    self.im2col_step,
+                )
+            except Exception:
+                # PyTorch implementation
+                output = multi_scale_deformable_attention(value, spatial_shapes, sampling_locations, attention_weights)
         output = self.output_proj(output)
 
         return output, attention_weights
@@ -785,7 +823,7 @@ def forward(
         # expand attention_mask
         if attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)
+            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
 
         if attention_mask is not None:
             if attention_mask.size() != (batch_size, 1, target_len, source_len):
@@ -832,10 +870,7 @@ def __init__(self, config: DeformableDetrConfig):
         super().__init__()
         self.embed_dim = config.d_model
         self.self_attn = DeformableDetrMultiscaleDeformableAttention(
-            embed_dim=self.embed_dim,
-            num_heads=config.encoder_attention_heads,
-            n_levels=config.num_feature_levels,
-            n_points=config.encoder_n_points,
+            config, num_heads=config.encoder_attention_heads, n_points=config.encoder_n_points
         )
         self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
         self.dropout = config.dropout
@@ -933,9 +968,8 @@ def __init__(self, config: DeformableDetrConfig):
         self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
         # cross-attention
         self.encoder_attn = DeformableDetrMultiscaleDeformableAttention(
-            embed_dim=self.embed_dim,
+            config,
             num_heads=config.decoder_attention_heads,
-            n_levels=config.num_feature_levels,
             n_points=config.decoder_n_points,
         )
         self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
@@ -1050,6 +1084,9 @@ class DeformableDetrPreTrainedModel(PreTrainedModel):
     config_class = DeformableDetrConfig
     base_model_prefix = "model"
     main_input_name = "pixel_values"
+    supports_gradient_checkpointing = True
+    _no_split_modules = [r"DeformableDetrConvEncoder", r"DeformableDetrEncoderLayer", r"DeformableDetrDecoderLayer"]
+    supports_gradient_checkpointing = True
 
     def _init_weights(self, module):
         std = self.config.init_std
@@ -1075,10 +1112,6 @@ def _init_weights(self, module):
         if hasattr(module, "level_embed"):
             nn.init.normal_(module.level_embed)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DeformableDetrDecoder):
-            module.gradient_checkpointing = value
-
 
 DEFORMABLE_DETR_START_DOCSTRING = r"""
     This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
@@ -1112,7 +1145,7 @@ def _set_gradient_checkpointing(self, module, value=False):
 
             [What are attention masks?](../glossary#attention-mask)
 
-        decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, num_queries)`, *optional*):
+        decoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, num_queries)`, *optional*):
             Not used by default. Can be used to mask object queries.
         encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
             Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
@@ -1148,6 +1181,7 @@ class DeformableDetrEncoder(DeformableDetrPreTrainedModel):
 
     def __init__(self, config: DeformableDetrConfig):
         super().__init__(config)
+        self.gradient_checkpointing = False
 
         self.dropout = config.dropout
         self.layers = nn.ModuleList([DeformableDetrEncoderLayer(config) for _ in range(config.encoder_layers)])
@@ -1173,8 +1207,8 @@ def get_reference_points(spatial_shapes, valid_ratios, device):
         reference_points_list = []
         for level, (height, width) in enumerate(spatial_shapes):
             ref_y, ref_x = meshgrid(
-                torch.linspace(0.5, height - 0.5, height, dtype=torch.float32, device=device),
-                torch.linspace(0.5, width - 0.5, width, dtype=torch.float32, device=device),
+                torch.linspace(0.5, height - 0.5, height, dtype=valid_ratios.dtype, device=device),
+                torch.linspace(0.5, width - 0.5, width, dtype=valid_ratios.dtype, device=device),
                 indexing="ij",
             )
             # TODO: valid_ratios could be useless here. check https://github.com/fundamentalvision/Deformable-DETR/issues/36
@@ -1240,15 +1274,27 @@ def forward(
         for i, encoder_layer in enumerate(self.layers):
             if output_hidden_states:
                 encoder_states = encoder_states + (hidden_states,)
-            layer_outputs = encoder_layer(
-                hidden_states,
-                attention_mask,
-                position_embeddings=position_embeddings,
-                reference_points=reference_points,
-                spatial_shapes=spatial_shapes,
-                level_start_index=level_start_index,
-                output_attentions=output_attentions,
-            )
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    encoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_embeddings,
+                    reference_points,
+                    spatial_shapes,
+                    level_start_index,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    position_embeddings=position_embeddings,
+                    reference_points=reference_points,
+                    spatial_shapes=spatial_shapes,
+                    level_start_index=level_start_index,
+                    output_attentions=output_attentions,
+                )
 
             hidden_states = layer_outputs[0]
 
@@ -1370,19 +1416,16 @@ def forward(
                 all_hidden_states += (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
                     hidden_states,
+                    position_embeddings,
+                    reference_points_input,
+                    spatial_shapes,
+                    level_start_index,
                     encoder_hidden_states,
                     encoder_attention_mask,
-                    None,
+                    output_attentions,
                 )
             else:
                 layer_outputs = decoder_layer(
@@ -1533,26 +1576,26 @@ def unfreeze_backbone(self):
         for name, param in self.backbone.conv_encoder.model.named_parameters():
             param.requires_grad_(True)
 
-    def get_valid_ratio(self, mask):
+    def get_valid_ratio(self, mask, dtype=torch.float32):
         """Get the valid ratio of all feature maps."""
 
         _, height, width = mask.shape
         valid_height = torch.sum(mask[:, :, 0], 1)
         valid_width = torch.sum(mask[:, 0, :], 1)
-        valid_ratio_heigth = valid_height.float() / height
-        valid_ratio_width = valid_width.float() / width
-        valid_ratio = torch.stack([valid_ratio_width, valid_ratio_heigth], -1)
+        valid_ratio_height = valid_height.to(dtype) / height
+        valid_ratio_width = valid_width.to(dtype) / width
+        valid_ratio = torch.stack([valid_ratio_width, valid_ratio_height], -1)
         return valid_ratio
 
     def get_proposal_pos_embed(self, proposals):
         """Get the position embedding of the proposals."""
 
-        num_pos_feats = 128
+        num_pos_feats = self.config.d_model // 2
         temperature = 10000
         scale = 2 * math.pi
 
-        dim_t = torch.arange(num_pos_feats, dtype=torch.float32, device=proposals.device)
-        dim_t = temperature ** (2 * (dim_t // 2) / num_pos_feats)
+        dim_t = torch.arange(num_pos_feats, dtype=torch.int64, device=proposals.device).float()
+        dim_t = temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / num_pos_feats)
         # batch_size, num_queries, 4
         proposals = proposals.sigmoid() * scale
         # batch_size, num_queries, 4, 128
@@ -1614,16 +1657,16 @@ def gen_encoder_output_proposals(self, enc_output, padding_mask, spatial_shapes)
     @replace_return_docstrings(output_type=DeformableDetrModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DeformableDetrModelOutput]:
         r"""
         Returns:
 
@@ -1714,7 +1757,7 @@ def forward(
         lvl_pos_embed_flatten = torch.cat(lvl_pos_embed_flatten, 1)
         spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=source_flatten.device)
         level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1]))
-        valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)
+        valid_ratios = torch.stack([self.get_valid_ratio(m, dtype=source_flatten.dtype) for m in masks], 1)
         valid_ratios = valid_ratios.float()
 
         # Fourth, sent source_flatten + mask_flatten + lvl_pos_embed_flatten (backbone + proj layer output) through encoder
@@ -1820,7 +1863,9 @@ def forward(
 )
 class DeformableDetrForObjectDetection(DeformableDetrPreTrainedModel):
     # When using clones, all layers > 0 will be clones, but layer 0 *is* required
-    _keys_to_ignore_on_load_missing = ["bbox_embed\.[1-9]\d*", "class_embed\.[1-9]\d*"]
+    _tied_weights_keys = [r"bbox_embed\.[1-9]\d*", r"class_embed\.[1-9]\d*"]
+    # We can't initialize the model on meta device as some weights are modified during the initialization
+    _no_split_modules = None
 
     def __init__(self, config: DeformableDetrConfig):
         super().__init__(config)
@@ -1874,17 +1919,17 @@ def _set_aux_loss(self, outputs_class, outputs_coord):
     @replace_return_docstrings(output_type=DeformableDetrObjectDetectionOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DeformableDetrObjectDetectionOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the
@@ -1910,7 +1955,7 @@ def forward(
         >>> inputs = image_processor(images=image, return_tensors="pt")
         >>> outputs = model(**inputs)
 
-        >>> # convert outputs (bounding boxes and class logits) to COCO API
+        >>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
         >>> target_sizes = torch.tensor([image.size[::-1]])
         >>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
         ...     0
@@ -1966,12 +2011,11 @@ def forward(
             outputs_coord = outputs_coord_logits.sigmoid()
             outputs_classes.append(outputs_class)
             outputs_coords.append(outputs_coord)
-        # Keep batch_size as first dimension
-        outputs_class = torch.stack(outputs_classes, dim=1)
-        outputs_coord = torch.stack(outputs_coords, dim=1)
+        outputs_class = torch.stack(outputs_classes)
+        outputs_coord = torch.stack(outputs_coords)
 
-        logits = outputs_class[:, -1]
-        pred_boxes = outputs_coord[:, -1]
+        logits = outputs_class[-1]
+        pred_boxes = outputs_coord[-1]
 
         loss, loss_dict, auxiliary_outputs = None, None, None
         if labels is not None:
@@ -1997,7 +2041,7 @@ def forward(
                 outputs_loss["auxiliary_outputs"] = auxiliary_outputs
             if self.config.two_stage:
                 enc_outputs_coord = outputs.enc_outputs_coord_logits.sigmoid()
-                outputs["enc_outputs"] = {"pred_logits": outputs.enc_outputs_class, "pred_boxes": enc_outputs_coord}
+                outputs_loss["enc_outputs"] = {"logits": outputs.enc_outputs_class, "pred_boxes": enc_outputs_coord}
 
             loss_dict = criterion(outputs_loss, labels)
             # Fourth: compute total loss, as a weighted sum of the various losses
@@ -2229,7 +2273,7 @@ def forward(self, outputs, targets):
                 List of dicts, such that `len(targets) == batch_size`. The expected keys in each dict depends on the
                 losses applied, see each loss' doc.
         """
-        outputs_without_aux = {k: v for k, v in outputs.items() if k != "auxiliary_outputs"}
+        outputs_without_aux = {k: v for k, v in outputs.items() if k != "auxiliary_outputs" and k != "enc_outputs"}
 
         # Retrieve the matching between the outputs of the last layer and the targets
         indices = self.matcher(outputs_without_aux, targets)
@@ -2237,11 +2281,11 @@ def forward(self, outputs, targets):
         # Compute the average number of target boxes accross all nodes, for normalization purposes
         num_boxes = sum(len(t["class_labels"]) for t in targets)
         num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
-        # (Niels): comment out function below, distributed training to be added
-        # if is_dist_avail_and_initialized():
-        #     torch.distributed.all_reduce(num_boxes)
-        # (Niels) in original implementation, num_boxes is divided by get_world_size()
-        num_boxes = torch.clamp(num_boxes, min=1).item()
+        world_size = 1
+        if PartialState._shared_state != {}:
+            num_boxes = reduce(num_boxes)
+            world_size = PartialState().num_processes
+        num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
 
         # Compute all the requested losses
         losses = {}
@@ -2261,14 +2305,10 @@ def forward(self, outputs, targets):
             enc_outputs = outputs["enc_outputs"]
             bin_targets = copy.deepcopy(targets)
             for bt in bin_targets:
-                bt["labels"] = torch.zeros_like(bt["labels"])
+                bt["class_labels"] = torch.zeros_like(bt["class_labels"])
             indices = self.matcher(enc_outputs, bin_targets)
             for loss in self.losses:
-                kwargs = {}
-                if loss == "labels":
-                    # Logging is enabled only for the last layer
-                    kwargs["log"] = False
-                l_dict = self.get_loss(loss, enc_outputs, bin_targets, indices, num_boxes, **kwargs)
+                l_dict = self.get_loss(loss, enc_outputs, bin_targets, indices, num_boxes)
                 l_dict = {k + "_enc": v for k, v in l_dict.items()}
                 losses.update(l_dict)
 
diff --git a/src/transformers/models/deit/configuration_deit.py b/src/transformers/models/deit/configuration_deit.py
index b395afdbef5cf3..20b874ff54a0dd 100644
--- a/src/transformers/models/deit/configuration_deit.py
+++ b/src/transformers/models/deit/configuration_deit.py
@@ -58,23 +58,23 @@ class DeiTConfig(PretrainedConfig):
         hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
             `"relu"`, `"selu"` and `"gelu_new"` are supported.
-        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention probabilities.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         layer_norm_eps (`float`, *optional*, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
-        image_size (`int`, *optional*, defaults to `224`):
+        image_size (`int`, *optional*, defaults to 224):
             The size (resolution) of each image.
-        patch_size (`int`, *optional*, defaults to `16`):
+        patch_size (`int`, *optional*, defaults to 16):
             The size (resolution) of each patch.
-        num_channels (`int`, *optional*, defaults to `3`):
+        num_channels (`int`, *optional*, defaults to 3):
             The number of input channels.
         qkv_bias (`bool`, *optional*, defaults to `True`):
             Whether to add a bias to the queries, keys and values.
-        encoder_stride (`int`, `optional`, defaults to 16):
+        encoder_stride (`int`, *optional*, defaults to 16):
             Factor to increase the spatial resolution by in the decoder head for masked image modeling.
 
     Example:
@@ -91,6 +91,7 @@ class DeiTConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "deit"
 
     def __init__(
diff --git a/src/transformers/models/deit/convert_deit_timm_to_pytorch.py b/src/transformers/models/deit/convert_deit_timm_to_pytorch.py
index a91702106e0f8c..2b5c795ff2d2ab 100644
--- a/src/transformers/models/deit/convert_deit_timm_to_pytorch.py
+++ b/src/transformers/models/deit/convert_deit_timm_to_pytorch.py
@@ -25,7 +25,7 @@
 from huggingface_hub import hf_hub_download
 from PIL import Image
 
-from transformers import DeiTConfig, DeiTFeatureExtractor, DeiTForImageClassificationWithTeacher
+from transformers import DeiTConfig, DeiTForImageClassificationWithTeacher, DeiTImageProcessor
 from transformers.utils import logging
 
 
@@ -182,12 +182,12 @@ def convert_deit_checkpoint(deit_name, pytorch_dump_folder_path):
     model = DeiTForImageClassificationWithTeacher(config).eval()
     model.load_state_dict(state_dict)
 
-    # Check outputs on an image, prepared by DeiTFeatureExtractor
+    # Check outputs on an image, prepared by DeiTImageProcessor
     size = int(
         (256 / 224) * config.image_size
     )  # to maintain same ratio w.r.t. 224 images, see https://github.com/facebookresearch/deit/blob/ab5715372db8c6cad5740714b2216d55aeae052e/datasets.py#L103
-    feature_extractor = DeiTFeatureExtractor(size=size, crop_size=config.image_size)
-    encoding = feature_extractor(images=prepare_img(), return_tensors="pt")
+    image_processor = DeiTImageProcessor(size=size, crop_size=config.image_size)
+    encoding = image_processor(images=prepare_img(), return_tensors="pt")
     pixel_values = encoding["pixel_values"]
     outputs = model(pixel_values)
 
@@ -198,8 +198,8 @@ def convert_deit_checkpoint(deit_name, pytorch_dump_folder_path):
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     print(f"Saving model {deit_name} to {pytorch_dump_folder_path}")
     model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    print(f"Saving image processor to {pytorch_dump_folder_path}")
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/deit/image_processing_deit.py b/src/transformers/models/deit/image_processing_deit.py
index 77d7d2bb2ca2e5..15e820570c08fe 100644
--- a/src/transformers/models/deit/image_processing_deit.py
+++ b/src/transformers/models/deit/image_processing_deit.py
@@ -18,22 +18,22 @@
 
 import numpy as np
 
-from transformers.utils import is_vision_available
-from transformers.utils.generic import TensorType
-
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from ...image_transforms import center_crop, normalize, rescale, resize, to_channel_dimension_format
+from ...image_transforms import resize, to_channel_dimension_format
 from ...image_utils import (
     IMAGENET_STANDARD_MEAN,
     IMAGENET_STANDARD_STD,
     ChannelDimension,
     ImageInput,
     PILImageResampling,
+    infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
     valid_images,
+    validate_preprocess_arguments,
 )
-from ...utils import logging
+from ...utils import TensorType, is_vision_available, logging
 
 
 if is_vision_available():
@@ -53,19 +53,19 @@ class DeiTImageProcessor(BaseImageProcessor):
             `do_resize` in `preprocess`.
         size (`Dict[str, int]` *optional*, defaults to `{"height": 256, "width": 256}`):
             Size of the image after `resize`. Can be overridden by `size` in `preprocess`.
-        resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
+        resample (`PILImageResampling` filter, *optional*, defaults to `Resampling.BICUBIC`):
             Resampling filter to use if resizing the image. Can be overridden by `resample` in `preprocess`.
         do_center_crop (`bool`, *optional*, defaults to `True`):
             Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the image
             is padded with 0's and then center cropped. Can be overridden by `do_center_crop` in `preprocess`.
         crop_size (`Dict[str, int]`, *optional*, defaults to `{"height": 224, "width": 224}`):
             Desired output size when applying center-cropping. Can be overridden by `crop_size` in `preprocess`.
-        do_rescale (`bool`, *optional*, defaults to `True`):
-            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
-            parameter in the `preprocess` method.
         rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
             Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
             `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+            parameter in the `preprocess` method.
         do_normalize (`bool`, *optional*, defaults to `True`):
             Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
             method.
@@ -110,101 +110,55 @@ def __init__(
         self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
 
+    # Copied from transformers.models.vit.image_processing_vit.ViTImageProcessor.resize with PILImageResampling.BILINEAR->PILImageResampling.BICUBIC
     def resize(
         self,
         image: np.ndarray,
         size: Dict[str, int],
-        resample: PILImageResampling = PIL.Image.BICUBIC,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
-        Resize an image to `(size["height"], size["width"])` using the specified resampling filter.
+        Resize an image to `(size["height"], size["width"])`.
 
         Args:
             image (`np.ndarray`):
                 Image to resize.
             size (`Dict[str, int]`):
-                Size of the output image.
-            resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
-                Resampling filter to use when resizing the image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
+            data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+
+        Returns:
+            `np.ndarray`: The resized image.
         """
         size = get_size_dict(size)
         if "height" not in size or "width" not in size:
-            raise ValueError(f"The size dictionary must have keys 'height' and 'width'. Got {size.keys()}")
+            raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
+        output_size = (size["height"], size["width"])
         return resize(
-            image, size=(size["height"], size["width"]), resample=resample, data_format=data_format, **kwargs
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
         )
 
-    def center_crop(
-        self,
-        image: np.ndarray,
-        size: Dict[str, int],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Center crop an image to `(crop_size["height"], crop_size["width"])`. If the input size is smaller than
-        `crop_size` along any edge, the image is padded with 0's and then center cropped.
-
-        Args:
-            image (`np.ndarray`):
-                Image to center crop.
-            size (`Dict[str, int]`):
-                Size of the output image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        size = get_size_dict(size)
-        if "height" not in size or "width" not in size:
-            raise ValueError(f"The size dictionary must have keys 'height' and 'width'. Got {size.keys()}")
-        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
-
-    def rescale(
-        self,
-        image: np.ndarray,
-        scale: Union[int, float],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ):
-        """
-        Rescale an image by a scale factor. image = image * scale.
-
-        Args:
-            image (`np.ndarray`):
-                Image to rescale.
-            scale (`int` or `float`):
-                Scale to apply to the image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return rescale(image, scale=scale, data_format=data_format, **kwargs)
-
-    def normalize(
-        self,
-        image: np.ndarray,
-        mean: Union[float, List[float]],
-        std: Union[float, List[float]],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Normalize an image. image = (image - image_mean) / image_std.
-
-        Args:
-            image (`np.ndarray`):
-                Image to normalize.
-            image_mean (`float` or `List[float]`):
-                Image mean.
-            image_std (`float` or `List[float]`):
-                Image standard deviation.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
-
     def preprocess(
         self,
         images: ImageInput,
@@ -220,6 +174,7 @@ def preprocess(
         image_std: Optional[Union[float, List[float]]] = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: ChannelDimension = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> PIL.Image.Image:
         """
@@ -227,7 +182,8 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image to preprocess.
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                 Whether to resize the image.
             size (`Dict[str, int]`, *optional*, defaults to `self.size`):
@@ -261,6 +217,12 @@ def preprocess(
                 The channel dimension format for the output image. Can be one of:
                     - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                     - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         do_resize = do_resize if do_resize is not None else self.do_resize
         resample = resample if resample is not None else self.resample
@@ -283,35 +245,57 @@ def preprocess(
                 "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
-
-        if do_resize and size is None or resample is None:
-            raise ValueError("Size and resample must be specified if do_resize is True.")
-
-        if do_center_crop and crop_size is None:
-            raise ValueError("Crop size must be specified if do_center_crop is True.")
-
-        if do_rescale and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
-        if do_normalize and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
-
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_center_crop=do_center_crop,
+            crop_size=crop_size,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
         # All transformations expect numpy arrays.
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         if do_resize:
-            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+            images = [
+                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_center_crop:
-            images = [self.center_crop(image=image, size=crop_size) for image in images]
+            images = [
+                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
+            ]
 
         if do_rescale:
-            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_normalize:
-            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
-
-        images = [to_channel_dimension_format(image, data_format) for image in images]
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
 
         data = {"pixel_values": images}
         return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/deit/modeling_deit.py b/src/transformers/models/deit/modeling_deit.py
index f05b16efe7a045..b8bd9d6ce629db 100644
--- a/src/transformers/models/deit/modeling_deit.py
+++ b/src/transformers/models/deit/modeling_deit.py
@@ -26,7 +26,12 @@
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 
 from ...activations import ACT2FN
-from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, ImageClassifierOutput, MaskedLMOutput
+from ...modeling_outputs import (
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    ImageClassifierOutput,
+    MaskedImageModelingOutput,
+)
 from ...modeling_utils import PreTrainedModel
 from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
 from ...utils import (
@@ -352,17 +357,11 @@ def forward(
             layer_head_mask = head_mask[i] if head_mask is not None else None
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     layer_head_mask,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions)
@@ -394,7 +393,7 @@ class DeiTPreTrainedModel(PreTrainedModel):
     base_model_prefix = "deit"
     main_input_name = "pixel_values"
     supports_gradient_checkpointing = True
-    _no_split_modules = []
+    _no_split_modules = ["DeiTLayer"]
 
     def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
         """Initialize the weights"""
@@ -410,10 +409,6 @@ def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> No
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module: DeiTEncoder, value: bool = False) -> None:
-        if isinstance(module, DeiTEncoder):
-            module.gradient_checkpointing = value
-
 
 DEIT_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
@@ -495,6 +490,10 @@ def forward(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`, *optional*):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+        """
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -588,7 +587,7 @@ def __init__(self, config: DeiTConfig) -> None:
         self.post_init()
 
     @add_start_docstrings_to_model_forward(DEIT_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=MaskedImageModelingOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
         pixel_values: Optional[torch.Tensor] = None,
@@ -597,7 +596,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-    ) -> Union[tuple, MaskedLMOutput]:
+    ) -> Union[tuple, MaskedImageModelingOutput]:
         r"""
         bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`):
             Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
@@ -623,7 +622,7 @@ def forward(
         >>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
 
         >>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
-        >>> loss, reconstructed_pixel_values = outputs.loss, outputs.logits
+        >>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
         >>> list(reconstructed_pixel_values.shape)
         [1, 3, 224, 224]
         ```"""
@@ -666,9 +665,9 @@ def forward(
             output = (reconstructed_pixel_values,) + outputs[1:]
             return ((masked_im_loss,) + output) if masked_im_loss is not None else output
 
-        return MaskedLMOutput(
+        return MaskedImageModelingOutput(
             loss=masked_im_loss,
-            logits=reconstructed_pixel_values,
+            reconstruction=reconstructed_pixel_values,
             hidden_states=outputs.hidden_states,
             attentions=outputs.attentions,
         )
@@ -755,6 +754,7 @@ def forward(
 
         loss = None
         if labels is not None:
+            labels = labels.to(logits.device)
             if self.config.problem_type is None:
                 if self.num_labels == 1:
                     self.config.problem_type = "regression"
diff --git a/src/transformers/models/deit/modeling_tf_deit.py b/src/transformers/models/deit/modeling_tf_deit.py
index 161f2518d068a2..c6215c63b8ae8c 100644
--- a/src/transformers/models/deit/modeling_tf_deit.py
+++ b/src/transformers/models/deit/modeling_tf_deit.py
@@ -15,10 +15,12 @@
 """ TensorFlow DeiT model."""
 
 
+from __future__ import annotations
+
 import collections.abc
 import math
 from dataclasses import dataclass
-from typing import Dict, Optional, Tuple, Union
+from typing import Optional, Tuple, Union
 
 import tensorflow as tf
 
@@ -27,12 +29,13 @@
     TFBaseModelOutput,
     TFBaseModelOutputWithPooling,
     TFImageClassifierOutput,
-    TFMaskedLMOutput,
+    TFMaskedImageModelingOutput,
 )
 from ...modeling_tf_utils import (
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
@@ -95,11 +98,11 @@ class token).
     logits: tf.Tensor = None
     cls_logits: tf.Tensor = None
     distillation_logits: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
-class TFDeiTEmbeddings(tf.keras.layers.Layer):
+class TFDeiTEmbeddings(keras.layers.Layer):
     """
     Construct the CLS token, distillation token, position and patch embeddings. Optionally, also the mask token.
     """
@@ -109,18 +112,18 @@ def __init__(self, config: DeiTConfig, use_mask_token: bool = False, **kwargs) -
         self.config = config
         self.use_mask_token = use_mask_token
         self.patch_embeddings = TFDeiTPatchEmbeddings(config=config, name="patch_embeddings")
-        self.dropout = tf.keras.layers.Dropout(config.hidden_dropout_prob, name="dropout")
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob, name="dropout")
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         self.cls_token = self.add_weight(
             shape=(1, 1, self.config.hidden_size),
-            initializer=tf.keras.initializers.zeros(),
+            initializer=keras.initializers.zeros(),
             trainable=True,
             name="cls_token",
         )
         self.distillation_token = self.add_weight(
             shape=(1, 1, self.config.hidden_size),
-            initializer=tf.keras.initializers.zeros(),
+            initializer=keras.initializers.zeros(),
             trainable=True,
             name="distillation_token",
         )
@@ -128,21 +131,30 @@ def build(self, input_shape: tf.TensorShape):
         if self.use_mask_token:
             self.mask_token = self.add_weight(
                 shape=(1, 1, self.config.hidden_size),
-                initializer=tf.keras.initializers.zeros(),
+                initializer=keras.initializers.zeros(),
                 trainable=True,
                 name="mask_token",
             )
         num_patches = self.patch_embeddings.num_patches
         self.position_embeddings = self.add_weight(
             shape=(1, num_patches + 2, self.config.hidden_size),
-            initializer=tf.keras.initializers.zeros(),
+            initializer=keras.initializers.zeros(),
             trainable=True,
             name="position_embeddings",
         )
-        super().build(input_shape)
+
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "patch_embeddings", None) is not None:
+            with tf.name_scope(self.patch_embeddings.name):
+                self.patch_embeddings.build(None)
+        if getattr(self, "dropout", None) is not None:
+            with tf.name_scope(self.dropout.name):
+                self.dropout.build(None)
 
     def call(
-        self, pixel_values: tf.Tensor, bool_masked_pos: Optional[tf.Tensor] = None, training: bool = False
+        self, pixel_values: tf.Tensor, bool_masked_pos: tf.Tensor | None = None, training: bool = False
     ) -> tf.Tensor:
         embeddings = self.patch_embeddings(pixel_values)
         batch_size, seq_length, _ = shape_list(embeddings)
@@ -162,7 +174,7 @@ def call(
         return embeddings
 
 
-class TFDeiTPatchEmbeddings(tf.keras.layers.Layer):
+class TFDeiTPatchEmbeddings(keras.layers.Layer):
     """
     This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial
     `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a
@@ -182,7 +194,7 @@ def __init__(self, config: DeiTConfig, **kwargs) -> None:
         self.num_channels = num_channels
         self.num_patches = num_patches
 
-        self.projection = tf.keras.layers.Conv2D(
+        self.projection = keras.layers.Conv2D(
             hidden_size, kernel_size=patch_size, strides=patch_size, name="projection"
         )
 
@@ -201,9 +213,17 @@ def call(self, pixel_values: tf.Tensor) -> tf.Tensor:
         x = tf.reshape(x, (batch_size, height * width, num_channels))
         return x
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "projection", None) is not None:
+            with tf.name_scope(self.projection.name):
+                self.projection.build([None, None, None, self.num_channels])
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTSelfAttention with ViT->DeiT
-class TFDeiTSelfAttention(tf.keras.layers.Layer):
+class TFDeiTSelfAttention(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -218,16 +238,17 @@ def __init__(self, config: DeiTConfig, **kwargs):
         self.all_head_size = self.num_attention_heads * self.attention_head_size
         self.sqrt_att_head_size = math.sqrt(self.attention_head_size)
 
-        self.query = tf.keras.layers.Dense(
+        self.query = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
         )
-        self.key = tf.keras.layers.Dense(
+        self.key = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key"
         )
-        self.value = tf.keras.layers.Dense(
+        self.value = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
+        self.config = config
 
     def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor:
         # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size]
@@ -277,9 +298,23 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query", None) is not None:
+            with tf.name_scope(self.query.name):
+                self.query.build([None, None, self.config.hidden_size])
+        if getattr(self, "key", None) is not None:
+            with tf.name_scope(self.key.name):
+                self.key.build([None, None, self.config.hidden_size])
+        if getattr(self, "value", None) is not None:
+            with tf.name_scope(self.value.name):
+                self.value.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTSelfOutput with ViT->DeiT
-class TFDeiTSelfOutput(tf.keras.layers.Layer):
+class TFDeiTSelfOutput(keras.layers.Layer):
     """
     The residual connection is defined in TFDeiTLayer instead of here (as is the case with other models), due to the
     layernorm applied before each block.
@@ -288,10 +323,11 @@ class TFDeiTSelfOutput(tf.keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -299,9 +335,17 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTAttention with ViT->DeiT
-class TFDeiTAttention(tf.keras.layers.Layer):
+class TFDeiTAttention(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -328,13 +372,24 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self_attention", None) is not None:
+            with tf.name_scope(self.self_attention.name):
+                self.self_attention.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTIntermediate with ViT->DeiT
-class TFDeiTIntermediate(tf.keras.layers.Layer):
+class TFDeiTIntermediate(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -342,6 +397,7 @@ def __init__(self, config: DeiTConfig, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -349,16 +405,25 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTOutput with ViT->DeiT
-class TFDeiTOutput(tf.keras.layers.Layer):
+class TFDeiTOutput(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -367,8 +432,16 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
 
-class TFDeiTLayer(tf.keras.layers.Layer):
+
+class TFDeiTLayer(keras.layers.Layer):
     """This corresponds to the Block class in the timm implementation."""
 
     def __init__(self, config: DeiTConfig, **kwargs):
@@ -378,12 +451,9 @@ def __init__(self, config: DeiTConfig, **kwargs):
         self.intermediate = TFDeiTIntermediate(config, name="intermediate")
         self.deit_output = TFDeiTOutput(config, name="output")
 
-        self.layernorm_before = tf.keras.layers.LayerNormalization(
-            epsilon=config.layer_norm_eps, name="layernorm_before"
-        )
-        self.layernorm_after = tf.keras.layers.LayerNormalization(
-            epsilon=config.layer_norm_eps, name="layernorm_after"
-        )
+        self.layernorm_before = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm_before")
+        self.layernorm_after = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm_after")
+        self.config = config
 
     def call(
         self,
@@ -417,9 +487,29 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "deit_output", None) is not None:
+            with tf.name_scope(self.deit_output.name):
+                self.deit_output.build(None)
+        if getattr(self, "layernorm_before", None) is not None:
+            with tf.name_scope(self.layernorm_before.name):
+                self.layernorm_before.build([None, None, self.config.hidden_size])
+        if getattr(self, "layernorm_after", None) is not None:
+            with tf.name_scope(self.layernorm_after.name):
+                self.layernorm_after.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTEncoder with ViT->DeiT
-class TFDeiTEncoder(tf.keras.layers.Layer):
+class TFDeiTEncoder(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -463,9 +553,18 @@ def call(
             last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 @keras_serializable
-class TFDeiTMainLayer(tf.keras.layers.Layer):
+class TFDeiTMainLayer(keras.layers.Layer):
     config_class = DeiTConfig
 
     def __init__(
@@ -477,7 +576,7 @@ def __init__(
         self.embeddings = TFDeiTEmbeddings(config, use_mask_token=use_mask_token, name="embeddings")
         self.encoder = TFDeiTEncoder(config, name="encoder")
 
-        self.layernorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
         self.pooler = TFDeiTPooler(config, name="pooler") if add_pooling_layer else None
 
     def get_input_embeddings(self) -> TFDeiTPatchEmbeddings:
@@ -501,9 +600,9 @@ def get_head_mask(self, head_mask):
     @unpack_inputs
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        bool_masked_pos: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        bool_masked_pos: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -554,6 +653,23 @@ def call(
             attentions=encoder_outputs.attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, self.config.hidden_size])
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTPreTrainedModel with ViT->DeiT all-casing
 class TFDeiTPreTrainedModel(TFPreTrainedModel):
@@ -566,42 +682,10 @@ class TFDeiTPreTrainedModel(TFPreTrainedModel):
     base_model_prefix = "deit"
     main_input_name = "pixel_values"
 
-    @property
-    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        VISION_DUMMY_INPUTS = tf.random.uniform(
-            shape=(3, self.config.num_channels, self.config.image_size, self.config.image_size), dtype=tf.float32
-        )
-        return {"pixel_values": tf.constant(VISION_DUMMY_INPUTS)}
-
-    @tf.function(
-        input_signature=[
-            {
-                "pixel_values": tf.TensorSpec((None, None, None, None), tf.float32, name="pixel_values"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        """
-        Method used for serving the model.
-
-        Args:
-            inputs (`Dict[str, tf.Tensor]`):
-                The input of the saved model as a dictionary of tensors.
-        """
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
 
 DEIT_START_DOCSTRING = r"""
     This model is a TensorFlow
-    [tf.keras.layers.Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer). Use it as a regular
+    [keras.layers.Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer). Use it as a regular
     TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
 
     Parameters:
@@ -658,14 +742,14 @@ def __init__(
     )
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        bool_masked_pos: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        bool_masked_pos: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-    ):
+    ) -> Union[Tuple, TFBaseModelOutputWithPooling]:
         outputs = self.deit(
             pixel_values=pixel_values,
             bool_masked_pos=bool_masked_pos,
@@ -677,29 +761,27 @@ def call(
         )
         return outputs
 
-    def serving_output(self, output: TFBaseModelOutputWithPooling) -> TFBaseModelOutputWithPooling:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutputWithPooling(
-            last_hidden_state=output.last_hidden_state,
-            pooler_output=output.pooler_output,
-            hidden_states=hidden_states,
-            attentions=attentions,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deit", None) is not None:
+            with tf.name_scope(self.deit.name):
+                self.deit.build(None)
 
 
 # Copied from transformers.models.vit.modeling_tf_vit.TFViTPooler with ViT->DeiT
-class TFDeiTPooler(tf.keras.layers.Layer):
+class TFDeiTPooler(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="tanh",
             name="dense",
         )
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         # We "pool" the model by simply taking the hidden state corresponding
@@ -709,8 +791,16 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return pooled_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
 
-class TFDeitPixelShuffle(tf.keras.layers.Layer):
+
+class TFDeitPixelShuffle(keras.layers.Layer):
     """TF layer implementation of torch.nn.PixelShuffle"""
 
     def __init__(self, upscale_factor: int, **kwargs) -> None:
@@ -736,13 +826,14 @@ def call(self, x: tf.Tensor) -> tf.Tensor:
         return hidden_states
 
 
-class TFDeitDecoder(tf.keras.layers.Layer):
+class TFDeitDecoder(keras.layers.Layer):
     def __init__(self, config: DeiTConfig, **kwargs) -> None:
         super().__init__(**kwargs)
-        self.conv2d = tf.keras.layers.Conv2D(
+        self.conv2d = keras.layers.Conv2D(
             filters=config.encoder_stride**2 * config.num_channels, kernel_size=1, name="0"
         )
         self.pixel_shuffle = TFDeitPixelShuffle(config.encoder_stride, name="1")
+        self.config = config
 
     def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = inputs
@@ -750,6 +841,17 @@ def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.pixel_shuffle(hidden_states)
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "conv2d", None) is not None:
+            with tf.name_scope(self.conv2d.name):
+                self.conv2d.build([None, None, None, self.config.hidden_size])
+        if getattr(self, "pixel_shuffle", None) is not None:
+            with tf.name_scope(self.pixel_shuffle.name):
+                self.pixel_shuffle.build(None)
+
 
 @add_start_docstrings(
     "DeiT Model with a decoder on top for masked image modeling, as proposed in"
@@ -765,17 +867,17 @@ def __init__(self, config: DeiTConfig) -> None:
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEIT_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=TFMaskedLMOutput, config_class=_CONFIG_FOR_DOC)
+    @replace_return_docstrings(output_type=TFMaskedImageModelingOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        bool_masked_pos: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        bool_masked_pos: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
         training: bool = False,
-    ) -> Union[tuple, TFMaskedLMOutput]:
+    ) -> Union[tuple, TFMaskedImageModelingOutput]:
         r"""
         bool_masked_pos (`tf.Tensor` of type bool and shape `(batch_size, num_patches)`):
             Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
@@ -801,7 +903,7 @@ def call(
         >>> bool_masked_pos = tf.cast(tf.random.uniform((1, num_patches), minval=0, maxval=2, dtype=tf.int32), tf.bool)
 
         >>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
-        >>> loss, reconstructed_pixel_values = outputs.loss, outputs.logits
+        >>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
         >>> list(reconstructed_pixel_values.shape)
         [1, 3, 224, 224]
         ```"""
@@ -828,7 +930,7 @@ def call(
         # Reconstruct pixel values
         reconstructed_pixel_values = self.decoder(sequence_output, training=training)
         # TF 2.0 image layers can't use NCHW format when running on CPU, so intermediate layers use NHWC,
-        # including the The decoder. We transpose to compute the loss against the pixel values
+        # including the decoder. We transpose to compute the loss against the pixel values
         # (batch_size, height, width, num_channels) -> (batch_size, num_channels, height, width)
         reconstructed_pixel_values = tf.transpose(reconstructed_pixel_values, (0, 3, 1, 2))
 
@@ -841,7 +943,7 @@ def call(
             mask = tf.expand_dims(mask, 1)
             mask = tf.cast(mask, tf.float32)
 
-            reconstruction_loss = tf.keras.losses.mean_absolute_error(
+            reconstruction_loss = keras.losses.mean_absolute_error(
                 # Swap axes as metric calculation reduces over the final dimension
                 tf.transpose(pixel_values, (1, 2, 3, 0)),
                 tf.transpose(reconstructed_pixel_values, (1, 2, 3, 0)),
@@ -856,18 +958,23 @@ def call(
             output = (reconstructed_pixel_values,) + outputs[1:]
             return ((masked_im_loss,) + output) if masked_im_loss is not None else output
 
-        return TFMaskedLMOutput(
+        return TFMaskedImageModelingOutput(
             loss=masked_im_loss,
-            logits=reconstructed_pixel_values,
+            reconstruction=reconstructed_pixel_values,
             hidden_states=outputs.hidden_states,
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hidden_states, attentions=attentions)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deit", None) is not None:
+            with tf.name_scope(self.deit.name):
+                self.deit.build(None)
+        if getattr(self, "decoder", None) is not None:
+            with tf.name_scope(self.decoder.name):
+                self.decoder.build(None)
 
 
 @add_start_docstrings(
@@ -886,19 +993,20 @@ def __init__(self, config: DeiTConfig):
 
         # Classifier head
         self.classifier = (
-            tf.keras.layers.Dense(config.num_labels, name="classifier")
+            keras.layers.Dense(config.num_labels, name="classifier")
             if config.num_labels > 0
-            else tf.keras.layers.Activation("linear", name="classifier")
+            else keras.layers.Activation("linear", name="classifier")
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEIT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=TFImageClassifierOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
-        labels: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
+        labels: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -920,7 +1028,7 @@ def call(
         >>> from PIL import Image
         >>> import requests
 
-        >>> tf.keras.utils.set_random_seed(3)  # doctest: +IGNORE_RESULT
+        >>> keras.utils.set_random_seed(3)  # doctest: +IGNORE_RESULT
         >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
         >>> image = Image.open(requests.get(url, stream=True).raw)
 
@@ -966,11 +1074,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(self, output: TFImageClassifierOutput) -> TFImageClassifierOutput:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFImageClassifierOutput(logits=output.logits, hidden_states=hidden_states, attentions=attentions)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deit", None) is not None:
+            with tf.name_scope(self.deit.name):
+                self.deit.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -994,15 +1107,16 @@ def __init__(self, config: DeiTConfig) -> None:
 
         # Classifier heads
         self.cls_classifier = (
-            tf.keras.layers.Dense(config.num_labels, name="cls_classifier")
+            keras.layers.Dense(config.num_labels, name="cls_classifier")
             if config.num_labels > 0
-            else tf.keras.layers.Activation("linear", name="cls_classifier")
+            else keras.layers.Activation("linear", name="cls_classifier")
         )
         self.distillation_classifier = (
-            tf.keras.layers.Dense(config.num_labels, name="distillation_classifier")
+            keras.layers.Dense(config.num_labels, name="distillation_classifier")
             if config.num_labels > 0
-            else tf.keras.layers.Activation("linear", name="distillation_classifier")
+            else keras.layers.Activation("linear", name="distillation_classifier")
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DEIT_INPUTS_DOCSTRING)
@@ -1014,8 +1128,8 @@ def __init__(self, config: DeiTConfig) -> None:
     )
     def call(
         self,
-        pixel_values: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
+        pixel_values: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1052,16 +1166,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    def serving_output(
-        self, output: TFDeiTForImageClassificationWithTeacherOutput
-    ) -> TFDeiTForImageClassificationWithTeacherOutput:
-        hidden_states = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attentions = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFDeiTForImageClassificationWithTeacherOutput(
-            logits=output.logits,
-            cls_logits=output.cls_logits,
-            distillation_logits=output.distillation_logits,
-            hidden_states=hidden_states,
-            attentions=attentions,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "deit", None) is not None:
+            with tf.name_scope(self.deit.name):
+                self.deit.build(None)
+        if getattr(self, "cls_classifier", None) is not None:
+            with tf.name_scope(self.cls_classifier.name):
+                self.cls_classifier.build([None, None, self.config.hidden_size])
+        if getattr(self, "distillation_classifier", None) is not None:
+            with tf.name_scope(self.distillation_classifier.name):
+                self.distillation_classifier.build([None, None, self.config.hidden_size])
diff --git a/src/transformers/models/bort/__init__.py b/src/transformers/models/deprecated/__init__.py
similarity index 100%
rename from src/transformers/models/bort/__init__.py
rename to src/transformers/models/deprecated/__init__.py
diff --git a/tests/mixed_int8/__init__.py b/src/transformers/models/deprecated/bort/__init__.py
similarity index 100%
rename from tests/mixed_int8/__init__.py
rename to src/transformers/models/deprecated/bort/__init__.py
diff --git a/src/transformers/models/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py b/src/transformers/models/deprecated/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
similarity index 99%
rename from src/transformers/models/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
rename to src/transformers/models/deprecated/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
index 4753f593da19b2..5dc9a244c43c78 100644
--- a/src/transformers/models/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/bort/convert_bort_original_gluonnlp_checkpoint_to_pytorch.py
@@ -277,7 +277,7 @@ def check_and_map_params(hf_param, gluon_param):
     hf_bort_model.half()
 
     # Compare output of both models
-    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+    tokenizer = RobertaTokenizer.from_pretrained("FacebookAI/roberta-base")
 
     input_ids = tokenizer.encode_plus(SAMPLE_TEXT)["input_ids"]
 
diff --git a/src/transformers/models/mctct/__init__.py b/src/transformers/models/deprecated/mctct/__init__.py
similarity index 75%
rename from src/transformers/models/mctct/__init__.py
rename to src/transformers/models/deprecated/mctct/__init__.py
index 5da754fc51b886..567be97b7cd863 100644
--- a/src/transformers/models/mctct/__init__.py
+++ b/src/transformers/models/deprecated/mctct/__init__.py
@@ -13,24 +13,16 @@
 # limitations under the License.
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_speech_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
 
 
 _import_structure = {
     "configuration_mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig"],
+    "feature_extraction_mctct": ["MCTCTFeatureExtractor"],
     "processing_mctct": ["MCTCTProcessor"],
 }
 
 
-try:
-    if not is_speech_available():
-        raise OptionalDependencyNotAvailable()
-except OptionalDependencyNotAvailable:
-    pass
-else:
-    _import_structure["feature_extraction_mctct"] = ["MCTCTFeatureExtractor"]
-
-
 try:
     if not is_torch_available():
         raise OptionalDependencyNotAvailable()
@@ -47,16 +39,9 @@
 
 if TYPE_CHECKING:
     from .configuration_mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig
+    from .feature_extraction_mctct import MCTCTFeatureExtractor
     from .processing_mctct import MCTCTProcessor
 
-    try:
-        if not is_speech_available():
-            raise OptionalDependencyNotAvailable()
-    except OptionalDependencyNotAvailable:
-        pass
-    else:
-        from .feature_extraction_mctct import MCTCTFeatureExtractor
-
     try:
         if not is_torch_available():
             raise OptionalDependencyNotAvailable()
diff --git a/src/transformers/models/mctct/configuration_mctct.py b/src/transformers/models/deprecated/mctct/configuration_mctct.py
similarity index 94%
rename from src/transformers/models/mctct/configuration_mctct.py
rename to src/transformers/models/deprecated/mctct/configuration_mctct.py
index 6389f2238fc17e..9d4eab0d3f3d4a 100644
--- a/src/transformers/models/mctct/configuration_mctct.py
+++ b/src/transformers/models/deprecated/mctct/configuration_mctct.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """M-CTC-T model configuration"""
 
-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
@@ -53,7 +53,7 @@ class MCTCTConfig(PretrainedConfig):
             Dimensions of each attention head for each attention layer in the Transformer encoder.
         max_position_embeddings (`int`, *optional*, defaults to 920):
             The maximum sequence length that this model might ever be used with (after log-mel spectrogram extraction).
-        layer_norm_eps (`float`, *optional*, defaults to 1e-5):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
             The epsilon used by the layer normalization layers.
         layerdrop (`float`, *optional*, defaults to 0.3):
             The probability of dropping an encoder layer during training. The default 0.3 value is used in the original
@@ -63,9 +63,9 @@ class MCTCTConfig(PretrainedConfig):
             `"relu"`, `"selu"` and `"gelu_new"` are supported.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.3):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.3):
             The dropout ratio for the attention probabilities.
         pad_token_id (`int`, *optional*, defaults to 1):
             The tokenizer index of the pad token.
@@ -80,17 +80,17 @@ class MCTCTConfig(PretrainedConfig):
             The probability of randomly dropping the `Conv1dSubsampler` layer during training.
         num_conv_layers (`int`, *optional*, defaults to 1):
             Number of convolution layers before applying transformer encoder layers.
-        conv_kernel (`List[int]`, *optional*, defaults to `[7]`):
+        conv_kernel (`Sequence[int]`, *optional*, defaults to `(7,)`):
             The kernel size of the 1D convolution applied before transformer layers. `len(conv_kernel)` must be equal
             to `num_conv_layers`.
-        conv_stride (`List[int]`, *optional*, defaults to `[3]`):
+        conv_stride (`Sequence[int]`, *optional*, defaults to `(3,)`):
             The stride length of the 1D convolution applied before transformer layers. `len(conv_stride)` must be equal
             to `num_conv_layers`.
         input_feat_per_channel (`int`, *optional*, defaults to 80):
             Feature dimensions of the channels of the input to the Conv1D layer.
         input_channels (`int`, *optional*, defaults to 1):
             Number of input channels of the input to the Conv1D layer.
-        conv_channels (`List[int]`, *optional*, defaults to None):
+        conv_channels (`List[int]`, *optional*):
             Channel sizes of intermediate Conv1D layers.
         ctc_loss_reduction (`str`, *optional*, defaults to `"sum"`):
             Specifies the reduction to apply to the output of `torch.nn.CTCLoss`. Only relevant when training an
@@ -114,6 +114,7 @@ class MCTCTConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "mctct"
 
     def __init__(
diff --git a/src/transformers/models/mctct/feature_extraction_mctct.py b/src/transformers/models/deprecated/mctct/feature_extraction_mctct.py
similarity index 71%
rename from src/transformers/models/mctct/feature_extraction_mctct.py
rename to src/transformers/models/deprecated/mctct/feature_extraction_mctct.py
index d517e3caf85e08..e1e17c4b12f91d 100644
--- a/src/transformers/models/mctct/feature_extraction_mctct.py
+++ b/src/transformers/models/deprecated/mctct/feature_extraction_mctct.py
@@ -19,25 +19,16 @@
 from typing import List, Optional, Union
 
 import numpy as np
-import torch
-import torchaudio
-from packaging import version
 
-from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
-from ...feature_extraction_utils import BatchFeature
-from ...file_utils import PaddingStrategy, TensorType
-from ...utils import logging
+from ....audio_utils import mel_filter_bank, optimal_fft_length, spectrogram, window_function
+from ....feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ....feature_extraction_utils import BatchFeature
+from ....file_utils import PaddingStrategy, TensorType
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
 
-parsed_torchaudio_version_base = version.parse(version.parse(torchaudio.__version__).base_version)
-if not parsed_torchaudio_version_base >= version.parse("0.10"):
-    logger.warning(
-        f"You are using torchaudio=={torchaudio.__version__}, but torchaudio>=0.10.0 is required to use "
-        "MCTCTFeatureExtractor. This requires torch>=1.10.0. Please upgrade torch and torchaudio."
-    )
-
 
 class MCTCTFeatureExtractor(SequenceFeatureExtractor):
     r"""
@@ -110,109 +101,39 @@ def __init__(
         self.sample_size = win_length * sampling_rate // 1000
         self.sample_stride = hop_length * sampling_rate // 1000
 
-        self.n_fft = 2 ** int(np.ceil(np.log2(self.sample_size)))
+        self.n_fft = optimal_fft_length(self.sample_size)
         self.n_freqs = (self.n_fft // 2) + 1
 
-    @staticmethod
-    def _num_frames_calc(in_size, frame_size, frame_stride):
-        return int(1 + np.floor((in_size - frame_size) * 1 / frame_stride))
-
-    @staticmethod
-    def _frame_signal(one_waveform, n_frames, frame_signal_scale, window_length, sample_stride):
-        scale = frame_signal_scale
-        frames = np.zeros(n_frames * window_length)
-        for frame_idx in range(n_frames):
-            start = frame_idx * window_length
-            end = (frame_idx + 1) * window_length
-            wave_start = frame_idx * sample_stride
-            wave_end = frame_idx * sample_stride + window_length
-            frames[start:end] = scale * one_waveform[wave_start:wave_end]
-
-        return frames
-
-    @staticmethod
-    def _apply_preemphasis_inplace(frames, window_length, preemphasis_coeff):
-        if frames.size % window_length != 0:
-            raise ValueError(
-                f"`frames` is supposed to have length divisble by `window_length`, but is {frames.size} with"
-                f" window_length={window_length}."
-            )
-
-        n_frames = frames.size // window_length
-        for frame_idx in range(n_frames, 0, -1):
-            start = (frame_idx - 1) * window_length
-            end = frame_idx * window_length - 1
-            frames[start + 1 : end + 1] -= preemphasis_coeff * frames[start:end]
-            frames[start] *= 1 - preemphasis_coeff
-
-    @staticmethod
-    def _windowing(frames, window_length, window):
-        if frames.size % window_length != 0:
-            raise ValueError(
-                f"`frames` is supposed to have length divisble by `window_length`, but is {frames.size} with"
-                f" window_length={window_length}."
-            )
-
-        shaped = frames.reshape(-1, window_length)
-        shaped = window * shaped
-        return shaped
-
-    @staticmethod
-    def _dft(frames, K, n_frames, n_samples, n_fft):
-        dft = np.zeros([n_frames, K])
-
-        for frame in range(n_frames):
-            begin = frame * n_samples
-
-            inwards_buffer = frames[begin : begin + n_samples]
-            inwards_buffer = np.pad(inwards_buffer, (0, n_fft - n_samples), "constant")
-            out = np.fft.rfft(inwards_buffer)
-
-            dft[frame] = np.abs(out[:K])
-
-        return dft
-
     def _extract_mfsc_features(self, one_waveform: np.array) -> np.ndarray:
         """
         Extracts MFSC Features for one waveform vector (unbatched). Adapted from Flashlight's C++ MFSC code.
         """
         if self.win_function == "hamming_window":
-            window = torch.hamming_window(window_length=self.sample_size, periodic=False, alpha=0.54, beta=0.46)
+            window = window_function(window_length=self.sample_size, name=self.win_function, periodic=False)
         else:
-            window = getattr(torch, self.win_function)()
-
-        window = window.numpy()
-
-        fbanks = torchaudio.functional.melscale_fbanks(
-            n_freqs=self.n_freqs,
-            f_min=0.0,  # change this to zeros
-            f_max=self.sampling_rate / 2.0,
-            n_mels=self.feature_size,
-            sample_rate=self.sampling_rate,
+            window = window_function(window_length=self.sample_size, name=self.win_function)
+
+        fbanks = mel_filter_bank(
+            num_frequency_bins=self.n_freqs,
+            num_mel_filters=self.feature_size,
+            min_frequency=0.0,
+            max_frequency=self.sampling_rate / 2.0,
+            sampling_rate=self.sampling_rate,
         )
 
-        fbanks = fbanks.numpy()
-
-        n_frames = self._num_frames_calc(one_waveform.size, self.sample_size, self.sample_stride)
-
-        frames = self._frame_signal(
-            one_waveform, n_frames, self.frame_signal_scale, self.sample_size, self.sample_stride
+        msfc_features = spectrogram(
+            one_waveform * self.frame_signal_scale,
+            window=window,
+            frame_length=self.sample_size,
+            hop_length=self.sample_stride,
+            fft_length=self.n_fft,
+            center=False,
+            preemphasis=self.preemphasis_coeff,
+            mel_filters=fbanks,
+            mel_floor=self.mel_floor,
+            log_mel="log",
         )
-
-        self._apply_preemphasis_inplace(frames, self.sample_size, self.preemphasis_coeff)
-
-        frames = self._windowing(frames, self.sample_size, window)
-
-        dft_out = self._dft(frames.flatten(), self.n_freqs, n_frames, self.sample_size, self.n_fft)
-
-        # msfc_features = STFT * mel frequency banks.
-        msfc_features = np.einsum("...tf,fm->...tm", dft_out, fbanks)
-
-        # clamp feature values then log scale, as implemented in flashlight
-        msfc_features = np.maximum(msfc_features, self.mel_floor)
-        msfc_features = np.log(msfc_features)
-
-        return msfc_features
+        return msfc_features.T
 
     def _normalize_one(self, x, input_length, padding_value):
         # make sure we normalize float32 arrays
@@ -256,7 +177,8 @@ def __call__(
         Args:
             raw_speech (`torch.Tensor`, `np.ndarray`, `List[float]`, `List[torch.Tensor]`, `List[np.ndarray]`, `List[List[float]]`):
                 The sequence or batch of sequences to be padded. Each sequence can be a tensor, a numpy array, a list
-                of float values, a list of tensors, a list of numpy arrays or a list of list of float values.
+                of float values, a list of tensors, a list of numpy arrays or a list of list of float values. Must be
+                mono channel audio, not stereo, i.e. single float per timestep.
             padding (`bool`, `str` or [`~file_utils.PaddingStrategy`], *optional*, defaults to `False`):
                 Select a strategy to pad the returned sequences (according to the model's padding side and padding
                 index) among:
@@ -307,9 +229,11 @@ def __call__(
                 "Failing to do so can result in silent errors that might be hard to debug."
             )
 
-        is_batched = bool(
-            isinstance(raw_speech, (list, tuple))
-            and (isinstance(raw_speech[0], np.ndarray) or isinstance(raw_speech[0], (tuple, list)))
+        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
+        if is_batched_numpy and len(raw_speech.shape) > 2:
+            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
+        is_batched = is_batched_numpy or (
+            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
         )
 
         if is_batched:
diff --git a/src/transformers/models/mctct/modeling_mctct.py b/src/transformers/models/deprecated/mctct/modeling_mctct.py
similarity index 94%
rename from src/transformers/models/mctct/modeling_mctct.py
rename to src/transformers/models/deprecated/mctct/modeling_mctct.py
index 3effb52de5335f..cb3186c9dd37b8 100755
--- a/src/transformers/models/mctct/modeling_mctct.py
+++ b/src/transformers/models/deprecated/mctct/modeling_mctct.py
@@ -16,36 +16,29 @@
 
 
 import math
-import random
 from typing import Optional, Tuple, Union
 
 import torch
 import torch.utils.checkpoint
 from torch import nn
 
-from ...activations import ACT2FN
-from ...deepspeed import is_deepspeed_zero3_enabled
-from ...file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward
-from ...modeling_outputs import BaseModelOutput, CausalLMOutput
-from ...modeling_utils import (
+from ....activations import ACT2FN
+from ....file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward
+from ....integrations.deepspeed import is_deepspeed_zero3_enabled
+from ....modeling_attn_mask_utils import _prepare_4d_attention_mask
+from ....modeling_outputs import BaseModelOutput, CausalLMOutput
+from ....modeling_utils import (
     PreTrainedModel,
     apply_chunking_to_forward,
     find_pruneable_heads_and_indices,
     prune_linear_layer,
 )
-from ...pytorch_utils import is_torch_less_than_1_9
-from ...utils import logging
+from ....utils import logging
 from .configuration_mctct import MCTCTConfig
 
 
 logger = logging.get_logger(__name__)
 
-if is_torch_less_than_1_9:
-    logger.warning(
-        f"You are using torch=={torch.__version__}, but torch>=1.9.0 is required to use MCTCTModel. Please upgrade"
-        " torch."
-    )
-
 _HIDDEN_STATES_START_POSITION = 1
 
 _CONFIG_FOR_DOC = "MCTCTConfig"
@@ -65,21 +58,6 @@
 ]
 
 
-# Copied from transformers.models.bart.modeling_bart._expand_mask
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
-    """
-    bsz, src_len = mask.size()
-    tgt_len = tgt_len if tgt_len is not None else src_len
-
-    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
-
-
 class MCTCTConv1dSubsampler(nn.Module):
     """
     Convolutional subsampler: a stack of 1D convolution (along temporal dimension) followed by non-linear activation
@@ -157,7 +135,9 @@ def __init__(self, config):
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
         self.register_buffer(
             "token_type_ids",
             torch.zeros(self.position_ids.size(), dtype=torch.long, device=self.position_ids.device),
@@ -451,7 +431,6 @@ class MCTCTPreTrainedModel(PreTrainedModel):
     config_class = MCTCTConfig
     base_model_prefix = "mctct"
     main_input_name = "input_features"
-    _keys_to_ignore_on_load_missing = ["position_ids"]
     supports_gradient_checkpointing = True
 
     def _init_weights(self, module):
@@ -511,10 +490,6 @@ def _get_feature_vector_attention_mask(self, feature_vector_length, attention_ma
         attention_mask = attention_mask.flip([-1]).cumsum(-1).flip([-1]).long()
         return attention_mask
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, (MCTCTEncoder)):
-            module.gradient_checkpointing = value
-
 
 MCTCT_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
@@ -598,7 +573,7 @@ def forward(
         # expand attention_mask
         if attention_mask is not None:
             # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-            attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype)
+            attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype)
 
         encoder_states = () if output_hidden_states else None
         all_attentions = () if output_attentions else None
@@ -617,24 +592,18 @@ def forward(
                 encoder_states = encoder_states + (hidden_states,)
 
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
-            dropout_probability = random.uniform(0, 1)
+            dropout_probability = torch.rand([])
 
             skip_the_layer = True if self.training and (dropout_probability < self.config.layerdrop) else False
             if not skip_the_layer or deepspeed_zero3_is_enabled:
                 # under deepspeed zero3 all gpus must run in sync
                 if self.gradient_checkpointing and self.training:
-
-                    def create_custom_forward(module):
-                        def custom_forward(*inputs):
-                            return module(*inputs, output_attentions)
-
-                        return custom_forward
-
-                    layer_outputs = torch.utils.checkpoint.checkpoint(
-                        create_custom_forward(encoder_layer),
+                    layer_outputs = self._gradient_checkpointing_func(
+                        encoder_layer.__call__,
                         hidden_states,
                         attention_mask,
                         (head_mask[idx] if head_mask is not None else None),
+                        output_attentions,
                     )
                 else:
                     layer_outputs = encoder_layer(
diff --git a/src/transformers/models/mctct/processing_mctct.py b/src/transformers/models/deprecated/mctct/processing_mctct.py
similarity index 99%
rename from src/transformers/models/mctct/processing_mctct.py
rename to src/transformers/models/deprecated/mctct/processing_mctct.py
index eb20fa09b34c21..4e0cbe27dd9be0 100644
--- a/src/transformers/models/mctct/processing_mctct.py
+++ b/src/transformers/models/deprecated/mctct/processing_mctct.py
@@ -18,7 +18,7 @@
 import warnings
 from contextlib import contextmanager
 
-from ...processing_utils import ProcessorMixin
+from ....processing_utils import ProcessorMixin
 
 
 class MCTCTProcessor(ProcessorMixin):
@@ -34,6 +34,7 @@ class MCTCTProcessor(ProcessorMixin):
         tokenizer (`AutoTokenizer`):
             An instance of [`AutoTokenizer`]. The tokenizer is a required input.
     """
+
     feature_extractor_class = "MCTCTFeatureExtractor"
     tokenizer_class = "AutoTokenizer"
 
diff --git a/src/transformers/models/mmbt/__init__.py b/src/transformers/models/deprecated/mmbt/__init__.py
similarity index 94%
rename from src/transformers/models/mmbt/__init__.py
rename to src/transformers/models/deprecated/mmbt/__init__.py
index 29aee5a0cdbfcd..e467090cb4fbfa 100644
--- a/src/transformers/models/mmbt/__init__.py
+++ b/src/transformers/models/deprecated/mmbt/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
 
 
 _import_structure = {"configuration_mmbt": ["MMBTConfig"]}
diff --git a/src/transformers/models/mmbt/configuration_mmbt.py b/src/transformers/models/deprecated/mmbt/configuration_mmbt.py
similarity index 98%
rename from src/transformers/models/mmbt/configuration_mmbt.py
rename to src/transformers/models/deprecated/mmbt/configuration_mmbt.py
index aa453db592f8df..df5161b0927ad2 100644
--- a/src/transformers/models/mmbt/configuration_mmbt.py
+++ b/src/transformers/models/deprecated/mmbt/configuration_mmbt.py
@@ -15,7 +15,7 @@
 # limitations under the License.
 """ MMBT configuration"""
 
-from ...utils import logging
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
diff --git a/src/transformers/models/mmbt/modeling_mmbt.py b/src/transformers/models/deprecated/mmbt/modeling_mmbt.py
similarity index 97%
rename from src/transformers/models/mmbt/modeling_mmbt.py
rename to src/transformers/models/deprecated/mmbt/modeling_mmbt.py
index 8819dc4d5178c0..8dc450ce8f6c13 100644
--- a/src/transformers/models/mmbt/modeling_mmbt.py
+++ b/src/transformers/models/deprecated/mmbt/modeling_mmbt.py
@@ -20,9 +20,9 @@
 from torch import nn
 from torch.nn import CrossEntropyLoss, MSELoss
 
-from ...modeling_outputs import BaseModelOutputWithPooling, SequenceClassifierOutput
-from ...modeling_utils import ModuleUtilsMixin
-from ...utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from ....modeling_outputs import BaseModelOutputWithPooling, SequenceClassifierOutput
+from ....modeling_utils import ModuleUtilsMixin
+from ....utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
 
 
 logger = logging.get_logger(__name__)
@@ -106,7 +106,7 @@ def forward(self, input_modal, start_token=None, end_token=None, position_ids=No
             Encoder, the shape would be (batch_size, channels, height, width)
         input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary. It does not expect [CLS] token to be added as it's
-            appended to the end of other modality embeddings. Indices can be obtained using [`BertTokenizer`]. See
+            appended to the end of other modality embeddings. Indices can be obtained using [`AutoTokenizer`]. See
             [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for details.
 
             [What are input IDs?](../glossary#input-ids)
@@ -213,7 +213,7 @@ def forward(
 
         ```python
         # For example purposes. Not runnable.
-        transformer = BertModel.from_pretrained("bert-base-uncased")
+        transformer = BertModel.from_pretrained("google-bert/bert-base-uncased")
         encoder = ImageEncoder(args)
         mmbt = MMBTModel(config, transformer, encoder)
         ```"""
@@ -333,7 +333,7 @@ class MMBTForClassification(nn.Module):
 
     ```python
     # For example purposes. Not runnable.
-    transformer = BertModel.from_pretrained("bert-base-uncased")
+    transformer = BertModel.from_pretrained("google-bert/bert-base-uncased")
     encoder = ImageEncoder(args)
     model = MMBTForClassification(config, transformer, encoder)
     outputs = model(input_modal, input_ids, labels=labels)
diff --git a/src/transformers/models/deprecated/open_llama/__init__.py b/src/transformers/models/deprecated/open_llama/__init__.py
new file mode 100644
index 00000000000000..446c9f076d3134
--- /dev/null
+++ b/src/transformers/models/deprecated/open_llama/__init__.py
@@ -0,0 +1,95 @@
+# Copyright 2023 EleutherAI and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ....utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_sentencepiece_available,
+    is_tokenizers_available,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_open_llama": ["OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP", "OpenLlamaConfig"],
+}
+
+try:
+    if not is_sentencepiece_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["tokenization_open_llama"] = ["LlamaTokenizer"]
+
+try:
+    if not is_tokenizers_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["tokenization_open_llama_fast"] = ["LlamaTokenizerFast"]
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_open_llama"] = [
+        "OpenLlamaForCausalLM",
+        "OpenLlamaModel",
+        "OpenLlamaPreTrainedModel",
+        "OpenLlamaForSequenceClassification",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_open_llama import OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP, OpenLlamaConfig
+
+    try:
+        if not is_sentencepiece_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from transformers import LlamaTokenizer
+
+    try:
+        if not is_tokenizers_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from transformers import LlamaTokenizerFast
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_open_llama import (
+            OpenLlamaForCausalLM,
+            OpenLlamaForSequenceClassification,
+            OpenLlamaModel,
+            OpenLlamaPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/deprecated/open_llama/configuration_open_llama.py b/src/transformers/models/deprecated/open_llama/configuration_open_llama.py
new file mode 100644
index 00000000000000..5786abac850dd3
--- /dev/null
+++ b/src/transformers/models/deprecated/open_llama/configuration_open_llama.py
@@ -0,0 +1,168 @@
+# coding=utf-8
+# Copyright 2023 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Open-Llama model configuration"""
+
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+OPEN_LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "s-JoL/Open-Llama-V1": "https://huggingface.co/s-JoL/Open-Llama-V1/blob/main/config.json",
+}
+
+
+class OpenLlamaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`OpenLlamaModel`]. It is used to instantiate an
+    Open-Llama model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the
+    [s-JoL/Open-Llama-V1](https://huggingface.co/s-JoL/Open-Llama-V1).
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the Open-Llama model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`OpenLlamaModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+
+        Example:
+
+    ```python
+    >>> from transformers import OpenLlamaModel, OpenLlamaConfig
+
+    >>> # Initializing a Open-Llama open_llama-7b style configuration
+    >>> configuration = OpenLlamaConfig()
+
+    >>> # Initializing a model from the open_llama-7b style configuration
+    >>> model = OpenLlamaModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "open-llama"
+
+    def __init__(
+        self,
+        vocab_size=100000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        tie_word_embeddings=False,
+        use_memory_efficient_attention=True,
+        hidden_dropout_prob=0.1,
+        attention_dropout_prob=0.1,
+        use_stable_embedding=True,
+        shared_input_output_embedding=True,
+        rope_scaling=None,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.use_cache = use_cache
+        self.use_memory_efficient_attention = kwargs.pop(
+            "use_memorry_efficient_attention", use_memory_efficient_attention
+        )
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_dropout_prob = attention_dropout_prob
+        self.use_stable_embedding = use_stable_embedding
+        self.shared_input_output_embedding = shared_input_output_embedding
+        self.rope_scaling = rope_scaling
+        self._rope_scaling_validation()
+
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+
+    # Copied from transformers.models.llama.configuration_llama.LlamaConfig._rope_scaling_validation
+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+            raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
diff --git a/src/transformers/models/deprecated/open_llama/modeling_open_llama.py b/src/transformers/models/deprecated/open_llama/modeling_open_llama.py
new file mode 100644
index 00000000000000..d2ea931a44f1f1
--- /dev/null
+++ b/src/transformers/models/deprecated/open_llama/modeling_open_llama.py
@@ -0,0 +1,968 @@
+# coding=utf-8
+# Copyright 2023 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Open-Llama model."""
+import math
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ....activations import ACT2FN
+from ....modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
+from ....modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
+from ....modeling_utils import PreTrainedModel
+from ....utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
+from .configuration_open_llama import OpenLlamaConfig
+
+
+logger = logging.get_logger(__name__)
+
+try:
+    from xformers import ops as xops
+except ImportError:
+    xops = None
+
+
+_CONFIG_FOR_DOC = "OpenLlamaConfig"
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->OpenLlama
+class OpenLlamaRMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        OpenLlamaRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->OpenLlama
+class OpenLlamaRotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+
+        return (
+            self.cos_cached[:seq_len].to(dtype=x.dtype),
+            self.sin_cached[:seq_len].to(dtype=x.dtype),
+        )
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->OpenLlama
+class OpenLlamaLinearScalingRotaryEmbedding(OpenLlamaRotaryEmbedding):
+    """OpenLlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+        t = t / self.scaling_factor
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->OpenLlama
+class OpenLlamaDynamicNTKScalingRotaryEmbedding(OpenLlamaRotaryEmbedding):
+    """OpenLlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (
+                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`):
+            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
+            used to pass offsetted position ids when working with a KV-cache.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
+    sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class OpenLlamaMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        intermediate_size: int,
+        hidden_act: str,
+        dropout_prob: float,
+    ):
+        super().__init__()
+        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
+        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.act_fn = ACT2FN[hidden_act]
+        self.dropout = nn.Dropout(dropout_prob)
+
+    def forward(self, x):
+        out = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
+        return self.dropout(out)
+
+
+class OpenLlamaAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: OpenLlamaConfig):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.max_position_embeddings = config.max_position_embeddings
+        self.dropout_prob = config.attention_dropout_prob
+
+        if (self.head_dim * self.num_heads) != self.hidden_size:
+            raise ValueError(
+                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
+                f" and `num_heads`: {self.num_heads})."
+            )
+        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self._init_rope()
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaAttention._init_rope with Llama->OpenLlama
+    def _init_rope(self):
+        if self.config.rope_scaling is None:
+            self.rotary_emb = OpenLlamaRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = OpenLlamaLinearScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = OpenLlamaDynamicNTKScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        bsz, q_len, _ = hidden_states.size()
+
+        query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value[0].shape[-2]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+        # [bsz, nh, t, hd]
+
+        if past_key_value is not None:
+            # reuse k, v, self_attention
+            key_states = torch.cat([past_key_value[0], key_states], dim=2)
+            value_states = torch.cat([past_key_value[1], value_states], dim=2)
+
+        past_key_value = (key_states, value_states) if use_cache else None
+
+        if self.config.use_memory_efficient_attention and xops is not None and self.training:
+            attn_weights = None
+            query_states = query_states.transpose(1, 2)
+            key_states = key_states.transpose(1, 2)
+            value_states = value_states.transpose(1, 2)
+            attn_output = xops.memory_efficient_attention(
+                query_states, key_states, value_states, attn_bias=xops.LowerTriangularMask(), p=self.dropout_prob
+            )
+        else:
+            attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
+
+            if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
+                raise ValueError(
+                    f"Attention weights should be of size {(bsz * self.num_heads, q_len, kv_seq_len)}, but is"
+                    f" {attn_weights.size()}"
+                )
+
+            if attention_mask is not None:
+                if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
+                    raise ValueError(
+                        f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
+                    )
+                attn_weights = attn_weights + attention_mask
+                attn_weights = torch.max(
+                    attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
+                )
+
+            # upcast attention to fp32
+            attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+            attn_output = torch.matmul(attn_weights, value_states)
+
+            if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
+                raise ValueError(
+                    f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
+                    f" {attn_output.size()}"
+                )
+
+            attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+
+        attn_output = self.o_proj(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights, past_key_value
+
+
+class OpenLlamaDecoderLayer(nn.Module):
+    def __init__(self, config: OpenLlamaConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = OpenLlamaAttention(config=config)
+        self.mlp = OpenLlamaMLP(
+            hidden_size=self.hidden_size,
+            intermediate_size=config.intermediate_size,
+            hidden_act=config.hidden_act,
+            dropout_prob=config.hidden_dropout_prob,
+        )
+        self.input_layernorm = OpenLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = OpenLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: Optional[bool] = False,
+        use_cache: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        hidden_states, self_attn_weights, present_key_value = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+        )
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+
+        if use_cache:
+            outputs += (present_key_value,)
+
+        return outputs
+
+
+OPEN_LLAMA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`OpenLlamaConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+@add_start_docstrings(
+    "The bare Open-Llama Model outputting raw hidden-states without any specific head on top.",
+    OPEN_LLAMA_START_DOCSTRING,
+)
+class OpenLlamaPreTrainedModel(PreTrainedModel):
+    config_class = OpenLlamaConfig
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["OpenLlamaDecoderLayer"]
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            if self.config.use_stable_embedding:
+                torch.nn.init.xavier_normal_(module.weight.data)
+            else:
+                module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+
+
+OPEN_LLAMA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
+            it.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
+            `past_key_values`).
+
+            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
+            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
+            information on the default strategy.
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
+            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
+            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
+            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
+
+            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
+            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
+            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare Open-Llama Model outputting raw hidden-states without any specific head on top.",
+    OPEN_LLAMA_START_DOCSTRING,
+)
+class OpenLlamaModel(OpenLlamaPreTrainedModel):
+    """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OpenLlamaDecoderLayer`]
+
+    Args:
+        config: OpenLlamaConfig
+    """
+
+    def __init__(self, config: OpenLlamaConfig):
+        super().__init__(config)
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        if config.use_stable_embedding:
+            self.embed_layer_norm = nn.LayerNorm(config.hidden_size)
+        else:
+            self.embed_layer_norm = None
+        self.layers = nn.ModuleList([OpenLlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.norm = OpenLlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(OPEN_LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # retrieve input_ids and inputs_embeds
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
+
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
+        if past_key_values is not None:
+            past_key_values_length = past_key_values[0][0].shape[2]
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+
+        if position_ids is None:
+            device = input_ids.device if input_ids is not None else inputs_embeds.device
+            position_ids = torch.arange(
+                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
+            )
+            position_ids = position_ids.unsqueeze(0)
+
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+            if self.embed_layer_norm:
+                inputs_embeds = self.embed_layer_norm(inputs_embeds)
+        # embed positions
+        if self.config.use_memory_efficient_attention and self.training:
+            attention_mask = None
+        elif attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
+            )
+
+        input_shape = (batch_size, seq_length)
+        attention_mask = _prepare_4d_causal_attention_mask(
+            attention_mask, input_shape, inputs_embeds, past_key_values_length
+        )
+
+        hidden_states = inputs_embeds
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        next_decoder_cache = () if use_cache else None
+
+        for idx, decoder_layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            past_key_value = past_key_values[idx] if past_key_values is not None else None
+
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    position_ids,
+                    None,
+                    output_attentions,
+                    None,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    past_key_value=past_key_value,
+                    output_attentions=output_attentions,
+                    use_cache=use_cache,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if use_cache:
+                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
+
+            if output_attentions:
+                all_self_attns += (layer_outputs[1],)
+
+        hidden_states = self.norm(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        next_cache = next_decoder_cache if use_cache else None
+        if not return_dict:
+            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=next_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+class OpenLlamaForCausalLM(OpenLlamaPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = OpenLlamaModel(config)
+        if config.shared_input_output_embedding:
+            self.lm_head = None
+        else:
+            self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings):
+        self.lm_head = new_embeddings
+
+    def set_decoder(self, decoder):
+        self.model = decoder
+
+    def get_decoder(self):
+        return self.model
+
+    @add_start_docstrings_to_model_forward(OPEN_LLAMA_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+
+        Returns:
+
+        Example:
+
+        ```python
+        >>> from transformers import AutoTokenizer, OpenLlamaForCausalLM
+
+        >>> model = OpenLlamaForCausalLM.from_pretrained("openlm-research/open_llama_7b")
+        >>> tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_7b")
+
+        >>> prompt = "Hey, are you conscious? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
+        ```"""
+
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = outputs[0]
+        if self.config.shared_input_output_embedding:
+            logits = torch.einsum(
+                "blh,vh->blv", hidden_states.to(self.model.embed_tokens.weight.device), self.model.embed_tokens.weight
+            )
+        else:
+            logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
+
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+            }
+        )
+        return model_inputs
+
+    @staticmethod
+    def _reorder_cache(past_key_values, beam_idx):
+        reordered_past = ()
+        for layer_past in past_key_values:
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
+        return reordered_past
+
+
+@add_start_docstrings(
+    """
+    The LLaMa Model transformer with a sequence classification head on top (linear layer).
+
+    [`OpenLlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal
+    models (e.g. GPT-2) do.
+
+    Since it does classification on the last token, it requires to know the position of the last token. If a
+    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+    each row of the batch).
+    """,
+    OPEN_LLAMA_START_DOCSTRING,
+)
+class OpenLlamaForSequenceClassification(OpenLlamaPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.model = OpenLlamaModel(config)
+        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.model.embed_tokens
+
+    def set_input_embeddings(self, value):
+        self.model.embed_tokens = value
+
+    @add_start_docstrings_to_model_forward(OPEN_LLAMA_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.model(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
+                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
+                sequence_lengths = sequence_lengths % input_ids.shape[-1]
+                sequence_lengths = sequence_lengths.to(logits.device)
+            else:
+                sequence_lengths = -1
+
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+
+        loss = None
+        if labels is not None:
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
diff --git a/src/transformers/models/retribert/__init__.py b/src/transformers/models/deprecated/retribert/__init__.py
similarity index 95%
rename from src/transformers/models/retribert/__init__.py
rename to src/transformers/models/deprecated/retribert/__init__.py
index c4f4bf6cc0cb48..dba5e14594e16c 100644
--- a/src/transformers/models/retribert/__init__.py
+++ b/src/transformers/models/deprecated/retribert/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_tokenizers_available, is_torch_available
 
 
 _import_structure = {
diff --git a/src/transformers/models/retribert/configuration_retribert.py b/src/transformers/models/deprecated/retribert/configuration_retribert.py
similarity index 98%
rename from src/transformers/models/retribert/configuration_retribert.py
rename to src/transformers/models/deprecated/retribert/configuration_retribert.py
index 33663ad6167f9f..3861b9c90f33ef 100644
--- a/src/transformers/models/retribert/configuration_retribert.py
+++ b/src/transformers/models/deprecated/retribert/configuration_retribert.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """ RetriBERT model configuration"""
 
-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
@@ -72,6 +72,7 @@ class RetriBertConfig(PretrainedConfig):
         projection_dim (`int`, *optional*, defaults to 128):
             Final dimension of the query and document representation after projection
     """
+
     model_type = "retribert"
 
     def __init__(
diff --git a/src/transformers/models/retribert/modeling_retribert.py b/src/transformers/models/deprecated/retribert/modeling_retribert.py
similarity index 98%
rename from src/transformers/models/retribert/modeling_retribert.py
rename to src/transformers/models/deprecated/retribert/modeling_retribert.py
index 240d9476e70b01..00d47bce5121d4 100644
--- a/src/transformers/models/retribert/modeling_retribert.py
+++ b/src/transformers/models/deprecated/retribert/modeling_retribert.py
@@ -24,9 +24,9 @@
 import torch.utils.checkpoint as checkpoint
 from torch import nn
 
-from ...modeling_utils import PreTrainedModel
-from ...utils import add_start_docstrings, logging
-from ..bert.modeling_bert import BertModel
+from ....modeling_utils import PreTrainedModel
+from ....utils import add_start_docstrings, logging
+from ...bert.modeling_bert import BertModel
 from .configuration_retribert import RetriBertConfig
 
 
diff --git a/src/transformers/models/retribert/tokenization_retribert.py b/src/transformers/models/deprecated/retribert/tokenization_retribert.py
similarity index 94%
rename from src/transformers/models/retribert/tokenization_retribert.py
rename to src/transformers/models/deprecated/retribert/tokenization_retribert.py
index 0c04c363ebe0a0..d0904e3c931e40 100644
--- a/src/transformers/models/retribert/tokenization_retribert.py
+++ b/src/transformers/models/deprecated/retribert/tokenization_retribert.py
@@ -19,8 +19,8 @@
 import unicodedata
 from typing import List, Optional, Tuple
 
-from ...tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
-from ...utils import logging
+from ....tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
@@ -132,20 +132,6 @@ def __init__(
         strip_accents=None,
         **kwargs,
     ):
-        super().__init__(
-            do_lower_case=do_lower_case,
-            do_basic_tokenize=do_basic_tokenize,
-            never_split=never_split,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            tokenize_chinese_chars=tokenize_chinese_chars,
-            strip_accents=strip_accents,
-            **kwargs,
-        )
-
         if not os.path.isfile(vocab_file):
             raise ValueError(
                 f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
@@ -161,7 +147,22 @@ def __init__(
                 tokenize_chinese_chars=tokenize_chinese_chars,
                 strip_accents=strip_accents,
             )
-        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
 
     @property
     # Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case
@@ -178,10 +179,12 @@ def get_vocab(self):
         return dict(self.vocab, **self.added_tokens_encoder)
 
     # Copied from transformers.models.bert.tokenization_bert.BertTokenizer._tokenize
-    def _tokenize(self, text):
+    def _tokenize(self, text, split_special_tokens=False):
         split_tokens = []
         if self.do_basic_tokenize:
-            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+            for token in self.basic_tokenizer.tokenize(
+                text, never_split=self.all_special_tokens if not split_special_tokens else None
+            ):
                 # If the token is part of the never_split set
                 if token in self.basic_tokenizer.never_split:
                     split_tokens.append(token)
@@ -333,20 +336,30 @@ class BasicTokenizer(object):
         strip_accents (`bool`, *optional*):
             Whether or not to strip all accents. If this option is not specified, then it will be determined by the
             value for `lowercase` (as in the original BERT).
+        do_split_on_punc (`bool`, *optional*, defaults to `True`):
+            In some instances we want to skip the basic punctuation splitting so that later tokenization can capture
+            the full context of the words, such as contractions.
     """
 
-    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+    def __init__(
+        self,
+        do_lower_case=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        do_split_on_punc=True,
+    ):
         if never_split is None:
             never_split = []
         self.do_lower_case = do_lower_case
         self.never_split = set(never_split)
         self.tokenize_chinese_chars = tokenize_chinese_chars
         self.strip_accents = strip_accents
+        self.do_split_on_punc = do_split_on_punc
 
     def tokenize(self, text, never_split=None):
         """
-        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
-        WordPieceTokenizer.
+        Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.
 
         Args:
             never_split (`List[str]`, *optional*)
@@ -365,7 +378,9 @@ def tokenize(self, text, never_split=None):
         # words in the English Wikipedia.).
         if self.tokenize_chinese_chars:
             text = self._tokenize_chinese_chars(text)
-        orig_tokens = whitespace_tokenize(text)
+        # prevents treating the same character with different unicode codepoints as different characters
+        unicode_normalized_text = unicodedata.normalize("NFC", text)
+        orig_tokens = whitespace_tokenize(unicode_normalized_text)
         split_tokens = []
         for token in orig_tokens:
             if token not in never_split:
@@ -393,7 +408,7 @@ def _run_strip_accents(self, text):
 
     def _run_split_on_punc(self, text, never_split=None):
         """Splits punctuation on a piece of text."""
-        if never_split is not None and text in never_split:
+        if not self.do_split_on_punc or (never_split is not None and text in never_split):
             return [text]
         chars = list(text)
         i = 0
diff --git a/src/transformers/models/retribert/tokenization_retribert_fast.py b/src/transformers/models/deprecated/retribert/tokenization_retribert_fast.py
similarity index 98%
rename from src/transformers/models/retribert/tokenization_retribert_fast.py
rename to src/transformers/models/deprecated/retribert/tokenization_retribert_fast.py
index c242213e1faedb..07f7964b9f3f8e 100644
--- a/src/transformers/models/retribert/tokenization_retribert_fast.py
+++ b/src/transformers/models/deprecated/retribert/tokenization_retribert_fast.py
@@ -19,8 +19,8 @@
 
 from tokenizers import normalizers
 
-from ...tokenization_utils_fast import PreTrainedTokenizerFast
-from ...utils import logging
+from ....tokenization_utils_fast import PreTrainedTokenizerFast
+from ....utils import logging
 from .tokenization_retribert import RetriBertTokenizer
 
 
@@ -164,7 +164,7 @@ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
         """
         output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
 
-        if token_ids_1:
+        if token_ids_1 is not None:
             output += token_ids_1 + [self.sep_token_id]
 
         return output
diff --git a/src/transformers/models/tapex/__init__.py b/src/transformers/models/deprecated/tapex/__init__.py
similarity index 95%
rename from src/transformers/models/tapex/__init__.py
rename to src/transformers/models/deprecated/tapex/__init__.py
index f6d293504e20dd..82bbacd15b0d00 100644
--- a/src/transformers/models/tapex/__init__.py
+++ b/src/transformers/models/deprecated/tapex/__init__.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING
 
-from ...file_utils import _LazyModule
+from ....utils import _LazyModule
 
 
 _import_structure = {"tokenization_tapex": ["TapexTokenizer"]}
diff --git a/src/transformers/models/tapex/tokenization_tapex.py b/src/transformers/models/deprecated/tapex/tokenization_tapex.py
similarity index 99%
rename from src/transformers/models/tapex/tokenization_tapex.py
rename to src/transformers/models/deprecated/tapex/tokenization_tapex.py
index c41c6cbe47ae94..a5ee093c56bd26 100644
--- a/src/transformers/models/tapex/tokenization_tapex.py
+++ b/src/transformers/models/deprecated/tapex/tokenization_tapex.py
@@ -22,10 +22,10 @@
 
 import regex as re
 
-from ...file_utils import ExplicitEnum, PaddingStrategy, TensorType, add_end_docstrings, is_pandas_available
-from ...tokenization_utils import AddedToken, PreTrainedTokenizer
-from ...tokenization_utils_base import ENCODE_KWARGS_DOCSTRING, BatchEncoding, TextInput, TruncationStrategy
-from ...utils import logging
+from ....file_utils import ExplicitEnum, PaddingStrategy, TensorType, add_end_docstrings, is_pandas_available
+from ....tokenization_utils import AddedToken, PreTrainedTokenizer
+from ....tokenization_utils_base import ENCODE_KWARGS_DOCSTRING, BatchEncoding, TextInput, TruncationStrategy
+from ....utils import logging
 
 
 if is_pandas_available():
@@ -296,23 +296,6 @@ def __init__(
         # Mask token behave like a normal word, i.e. include the space before it
         mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
 
-        super().__init__(
-            vocab_file=vocab_file,
-            merges_file=merges_file,
-            do_lower_case=do_lower_case,
-            errors=errors,
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            cls_token=cls_token,
-            pad_token=pad_token,
-            mask_token=mask_token,
-            add_prefix_space=add_prefix_space,
-            max_cell_length=max_cell_length,
-            **kwargs,
-        )
-
         with open(vocab_file, encoding="utf-8") as vocab_handle:
             self.encoder = json.load(vocab_handle)
         self.decoder = {v: k for k, v in self.encoder.items()}
@@ -331,6 +314,24 @@ def __init__(
         self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
 
         # additional properties
+
+        super().__init__(
+            vocab_file=vocab_file,
+            merges_file=merges_file,
+            do_lower_case=do_lower_case,
+            errors=errors,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            add_prefix_space=add_prefix_space,
+            max_cell_length=max_cell_length,
+            **kwargs,
+        )
+
         self.max_cell_length = max_cell_length
         self.table_linearize = IndexedRowTableLinearize()
 
@@ -1453,16 +1454,16 @@ def delete_unrelated_rows(self, table_content: Dict, question: str, answer: List
         truncated_unrelated_indices = []
         related_indices = []
         if answer is None or len(answer) == 0:
-            answer_set = set([])
+            answer_set = set()
         else:
-            answer_set = set([ans_ex.lower() for ans_ex in answer])
+            answer_set = {ans_ex.lower() for ans_ex in answer}
         # add question key words into answer set
         if question is not None:
             answer_set.update(question.split())
         question_set = set(question.strip("?!.,").split(" "))
         row_max_len = len(table_content["rows"])
         for _row_idx, row in enumerate(table_content["rows"]):
-            lower_row = set([str(cell).lower() for cell in row])
+            lower_row = {str(cell).lower() for cell in row}
             if len(lower_row & answer_set) == 0 and len(lower_row & question_set) == 0:
                 truncated_unrelated_indices.append(_row_idx)
             else:
diff --git a/src/transformers/models/trajectory_transformer/__init__.py b/src/transformers/models/deprecated/trajectory_transformer/__init__.py
similarity index 95%
rename from src/transformers/models/trajectory_transformer/__init__.py
rename to src/transformers/models/deprecated/trajectory_transformer/__init__.py
index d529275e5a304b..b7af1bb48cb7d6 100644
--- a/src/transformers/models/trajectory_transformer/__init__.py
+++ b/src/transformers/models/deprecated/trajectory_transformer/__init__.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available
 
 
 _import_structure = {
diff --git a/src/transformers/models/trajectory_transformer/configuration_trajectory_transformer.py b/src/transformers/models/deprecated/trajectory_transformer/configuration_trajectory_transformer.py
similarity index 98%
rename from src/transformers/models/trajectory_transformer/configuration_trajectory_transformer.py
rename to src/transformers/models/deprecated/trajectory_transformer/configuration_trajectory_transformer.py
index 875980fde19e7b..cfad075c6ae848 100644
--- a/src/transformers/models/trajectory_transformer/configuration_trajectory_transformer.py
+++ b/src/transformers/models/deprecated/trajectory_transformer/configuration_trajectory_transformer.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """ TrajectoryTransformer model configuration"""
 
-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
@@ -100,6 +100,7 @@ class TrajectoryTransformerConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "trajectory_transformer"
     keys_to_ignore_at_inference = ["past_key_values"]
     attribute_map = {
diff --git a/src/transformers/models/trajectory_transformer/convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/deprecated/trajectory_transformer/convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.py
similarity index 100%
rename from src/transformers/models/trajectory_transformer/convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.py
rename to src/transformers/models/deprecated/trajectory_transformer/convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.py
diff --git a/src/transformers/models/trajectory_transformer/modeling_trajectory_transformer.py b/src/transformers/models/deprecated/trajectory_transformer/modeling_trajectory_transformer.py
similarity index 96%
rename from src/transformers/models/trajectory_transformer/modeling_trajectory_transformer.py
rename to src/transformers/models/deprecated/trajectory_transformer/modeling_trajectory_transformer.py
index fee99ce4e56350..40c08e4d1d441a 100644
--- a/src/transformers/models/trajectory_transformer/modeling_trajectory_transformer.py
+++ b/src/transformers/models/deprecated/trajectory_transformer/modeling_trajectory_transformer.py
@@ -25,8 +25,8 @@
 from torch import nn
 from torch.nn import functional as F
 
-from ...modeling_utils import PreTrainedModel
-from ...utils import (
+from ....modeling_utils import PreTrainedModel
+from ....utils import (
     ModelOutput,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
@@ -163,10 +163,6 @@ class TrajectoryTransformerPreTrainedModel(PreTrainedModel):
     main_input_name = "trajectories"
     supports_gradient_checkpointing = True
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, TrajectoryTransformerModel):
-            module.gradient_checkpointing = value
-
     def _init_weights(self, module):
         if isinstance(module, (nn.Linear, nn.Embedding)):
             module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
@@ -284,6 +280,7 @@ def __init__(self, config):
             torch.tril(torch.ones(config.block_size, config.block_size)).view(
                 1, 1, config.block_size, config.block_size
             ),
+            persistent=False,
         )
 
         # mask previous value estimates
@@ -533,6 +530,13 @@ def forward(
 
         hidden_states = self.drop(token_embeddings + position_embeddings)
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         presents = () if use_cache else None
         all_self_attentions = () if output_attentions else None
         all_hidden_states = () if output_hidden_states else None
@@ -542,20 +546,8 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs)
-
-                    return custom_forward
-
-                outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(block),
+                outputs = self._gradient_checkpointing_func(
+                    block.__call__,
                     hidden_states,
                     layer_past,
                     use_cache,
diff --git a/src/transformers/models/transfo_xl/__init__.py b/src/transformers/models/deprecated/transfo_xl/__init__.py
similarity index 96%
rename from src/transformers/models/transfo_xl/__init__.py
rename to src/transformers/models/deprecated/transfo_xl/__init__.py
index ce4215b0217bae..f3674e19665ca7 100644
--- a/src/transformers/models/transfo_xl/__init__.py
+++ b/src/transformers/models/deprecated/transfo_xl/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_tf_available, is_torch_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_tf_available, is_torch_available
 
 
 _import_structure = {
diff --git a/src/transformers/models/transfo_xl/configuration_transfo_xl.py b/src/transformers/models/deprecated/transfo_xl/configuration_transfo_xl.py
similarity index 94%
rename from src/transformers/models/transfo_xl/configuration_transfo_xl.py
rename to src/transformers/models/deprecated/transfo_xl/configuration_transfo_xl.py
index 8550e71802867a..f7d5f2f87fb1ad 100644
--- a/src/transformers/models/transfo_xl/configuration_transfo_xl.py
+++ b/src/transformers/models/deprecated/transfo_xl/configuration_transfo_xl.py
@@ -15,14 +15,14 @@
 # limitations under the License.
 """ Transformer XL configuration"""
 
-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
 
 TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP = {
-    "transfo-xl-wt103": "https://huggingface.co/transfo-xl-wt103/resolve/main/config.json",
+    "transfo-xl/transfo-xl-wt103": "https://huggingface.co/transfo-xl/transfo-xl-wt103/resolve/main/config.json",
 }
 
 
@@ -31,7 +31,7 @@ class TransfoXLConfig(PretrainedConfig):
     This is the configuration class to store the configuration of a [`TransfoXLModel`] or a [`TFTransfoXLModel`]. It is
     used to instantiate a Transformer-XL model according to the specified arguments, defining the model architecture.
     Instantiating a configuration with the defaults will yield a similar configuration to that of the TransfoXL
-    [transfo-xl-wt103](https://huggingface.co/transfo-xl-wt103) architecture.
+    [transfo-xl/transfo-xl-wt103](https://huggingface.co/transfo-xl/transfo-xl-wt103) architecture.
 
     Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
     documentation from [`PretrainedConfig`] for more information.
@@ -74,7 +74,7 @@ class TransfoXLConfig(PretrainedConfig):
             Whether or not to use adaptive softmax.
         dropout (`float`, *optional*, defaults to 0.1):
             The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
-        dropatt (`float`, *optional*, defaults to 0):
+        dropatt (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention probabilities.
         untie_r (`boolean`, *optional*, defaults to `True`):
             Whether ot not to untie relative position biases.
@@ -86,8 +86,10 @@ class TransfoXLConfig(PretrainedConfig):
             Parameters initialized by N(0, init_std)
         init_std (`float`, *optional*, defaults to 0.02):
             Parameters initialized by N(0, init_std)
-        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-05):
             The epsilon to use in the layer normalization layers
+        eos_token_id (`int`, *optional*, defaults to 0):
+            End of stream token id.
 
     Examples:
 
diff --git a/src/transformers/models/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py b/src/transformers/models/deprecated/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py
old mode 100755
new mode 100644
similarity index 95%
rename from src/transformers/models/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py
rename to src/transformers/models/deprecated/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py
index 646c8a2342fc3a..d2693ac333b84b
--- a/src/transformers/models/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py
+++ b/src/transformers/models/deprecated/transfo_xl/convert_transfo_xl_original_tf_checkpoint_to_pytorch.py
@@ -23,8 +23,8 @@
 import torch
 
 from transformers import TransfoXLConfig, TransfoXLLMHeadModel, load_tf_weights_in_transfo_xl
-from transformers.models.transfo_xl import tokenization_transfo_xl as data_utils
-from transformers.models.transfo_xl.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES
+from transformers.models.deprecated.transfo_xl import tokenization_transfo_xl as data_utils
+from transformers.models.deprecated.transfo_xl.tokenization_transfo_xl import CORPUS_NAME, VOCAB_FILES_NAMES
 from transformers.utils import CONFIG_NAME, WEIGHTS_NAME, logging
 
 
diff --git a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py b/src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl.py
similarity index 86%
rename from src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py
rename to src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl.py
index 93af2165111288..ab2725df0c4dcf 100644
--- a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl.py
+++ b/src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl.py
@@ -17,22 +17,25 @@
  TF 2.0 Transformer XL model.
 """
 
+from __future__ import annotations
+
 from dataclasses import dataclass
 from typing import List, Optional, Tuple, Union
 
 import numpy as np
 import tensorflow as tf
 
-from ...modeling_tf_utils import (
+from ....modeling_tf_utils import (
     TFModelInputType,
     TFPreTrainedModel,
     TFSequenceClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
-from ...utils import (
+from ....tf_utils import shape_list, stable_softmax
+from ....utils import (
     ModelOutput,
     add_code_sample_docstrings,
     add_start_docstrings,
@@ -45,16 +48,16 @@
 
 logger = logging.get_logger(__name__)
 
-_CHECKPOINT_FOR_DOC = "transfo-xl-wt103"
+_CHECKPOINT_FOR_DOC = "transfo-xl/transfo-xl-wt103"
 _CONFIG_FOR_DOC = "TransfoXLConfig"
 
 TF_TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [
-    "transfo-xl-wt103",
+    "transfo-xl/transfo-xl-wt103",
     # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl
 ]
 
 
-class TFPositionalEmbedding(tf.keras.layers.Layer):
+class TFPositionalEmbedding(keras.layers.Layer):
     def __init__(self, demb, **kwargs):
         super().__init__(**kwargs)
 
@@ -71,7 +74,7 @@ def call(self, pos_seq, bsz=None):
             return pos_emb[:, None, :]
 
 
-class TFPositionwiseFF(tf.keras.layers.Layer):
+class TFPositionwiseFF(keras.layers.Layer):
     def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilon=1e-5, init_std=0.02, **kwargs):
         super().__init__(**kwargs)
 
@@ -79,14 +82,14 @@ def __init__(self, d_model, d_inner, dropout, pre_lnorm=False, layer_norm_epsilo
         self.d_inner = d_inner
         self.dropout = dropout
 
-        self.layer_1 = tf.keras.layers.Dense(
+        self.layer_1 = keras.layers.Dense(
             d_inner, kernel_initializer=get_initializer(init_std), activation=tf.nn.relu, name="CoreNet_._0"
         )
-        self.drop_1 = tf.keras.layers.Dropout(dropout)
-        self.layer_2 = tf.keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name="CoreNet_._3")
-        self.drop_2 = tf.keras.layers.Dropout(dropout)
+        self.drop_1 = keras.layers.Dropout(dropout)
+        self.layer_2 = keras.layers.Dense(d_model, kernel_initializer=get_initializer(init_std), name="CoreNet_._3")
+        self.drop_2 = keras.layers.Dropout(dropout)
 
-        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layer_norm")
+        self.layer_norm = keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layer_norm")
 
         self.pre_lnorm = pre_lnorm
 
@@ -114,7 +117,7 @@ def call(self, inp, training=False):
         return output
 
 
-class TFRelPartialLearnableMultiHeadAttn(tf.keras.layers.Layer):
+class TFRelPartialLearnableMultiHeadAttn(keras.layers.Layer):
     def __init__(
         self,
         n_head,
@@ -138,17 +141,17 @@ def __init__(
         self.dropout = dropout
         self.output_attentions = output_attentions
 
-        self.qkv_net = tf.keras.layers.Dense(
+        self.qkv_net = keras.layers.Dense(
             3 * n_head * d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name="qkv_net"
         )
 
-        self.drop = tf.keras.layers.Dropout(dropout)
-        self.dropatt = tf.keras.layers.Dropout(dropatt)
-        self.o_net = tf.keras.layers.Dense(
+        self.drop = keras.layers.Dropout(dropout)
+        self.dropatt = keras.layers.Dropout(dropatt)
+        self.o_net = keras.layers.Dense(
             d_model, kernel_initializer=get_initializer(init_std), use_bias=False, name="o_net"
         )
 
-        self.layer_norm = tf.keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layer_norm")
+        self.layer_norm = keras.layers.LayerNormalization(epsilon=layer_norm_epsilon, name="layer_norm")
 
         self.scale = 1 / (d_head**0.5)
 
@@ -161,7 +164,7 @@ def __init__(
             self.r_r_bias = None
             self.r_w_bias = None
 
-        self.r_net = tf.keras.layers.Dense(
+        self.r_net = keras.layers.Dense(
             self.n_head * self.d_head, kernel_initializer=get_initializer(init_std), use_bias=False, name="r_net"
         )
 
@@ -266,7 +269,7 @@ def call(self, w, r, attn_mask, mems, head_mask, output_attentions, training=Fal
         return outputs
 
 
-class TFRelPartialLearnableDecoderLayer(tf.keras.layers.Layer):
+class TFRelPartialLearnableDecoderLayer(keras.layers.Layer):
     def __init__(
         self,
         n_head,
@@ -318,7 +321,7 @@ def call(self, dec_inp, r, dec_attn_mask, mems, head_mask, output_attentions, tr
         return outputs
 
 
-class TFTransfoEmbeddings(tf.keras.layers.Layer):
+class TFTransfoEmbeddings(keras.layers.Layer):
     def __init__(self, vocab_size, emb_size, init_std, **kwargs):
         super().__init__(**kwargs)
 
@@ -339,7 +342,7 @@ def call(self, inputs):
         return tf.gather(self.weight, inputs)
 
 
-class TFAdaptiveEmbedding(tf.keras.layers.Layer):
+class TFAdaptiveEmbedding(keras.layers.Layer):
     def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, init_std=0.02, sample_softmax=False, **kwargs):
         super().__init__(**kwargs)
 
@@ -416,7 +419,7 @@ def call(self, inp):
 
 
 @keras_serializable
-class TFTransfoXLMainLayer(tf.keras.layers.Layer):
+class TFTransfoXLMainLayer(keras.layers.Layer):
     config_class = TransfoXLConfig
 
     def __init__(self, config, **kwargs):
@@ -445,7 +448,7 @@ def __init__(self, config, **kwargs):
             name="word_emb",
         )
 
-        self.drop = tf.keras.layers.Dropout(config.dropout)
+        self.drop = keras.layers.Dropout(config.dropout)
 
         self.n_layer = config.n_layer
         self.mem_len = config.mem_len
@@ -541,14 +544,14 @@ def _update_mems(self, hids, mems, mlen, qlen):
     @unpack_inputs
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        mems: Optional[List[tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        mems: List[tf.Tensor] | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: bool = False,
     ):
         # the original code for Transformer-XL used shapes [len, bsz] but we want a unified interface in the library
@@ -586,35 +589,19 @@ def call(
         klen = mlen + qlen
 
         # Compute decoder attention mask
-
-        # ::: PyTorch masking code for reference :::
-        # if self.same_length:
-        #     all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)
-        #     mask_len = klen - self.mem_len
-        #     if mask_len > 0:
-        #         mask_shift_len = qlen - mask_len
-        #     else:
-        #         mask_shift_len = qlen
-        #     dec_attn_mask = (torch.triu(all_ones, 1+mlen)
-        #             + torch.tril(all_ones, -mask_shift_len))[:, :, None] # -1
-        # else:
-        #     dec_attn_mask = torch.triu(
-        #         word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1+mlen)[:,:,None]
-
-        # TensorFlow version
-        dec_attn_mask = 1 - tf.linalg.band_part(
-            tf.ones([qlen, klen], dtype=tf.int32), -1, mlen
-        )  # (q, q): diagonal with 1's
+        all_ones = tf.ones([qlen, klen], dtype=tf.int32)
+        upper_mask = 1 - tf.linalg.band_part(tf.ones([qlen, klen], dtype=tf.int32), -1, mlen)
         if self.same_length:
             mask_len = klen - self.mem_len
-            if mask_len > 0:
-                mask_shift_len = qlen - mask_len
-            else:
-                mask_shift_len = qlen
-            if mask_shift_len >= 1:
-                dec_attn_mask += 1 - tf.linalg.band_part(tf.ones([qlen, klen], dtype=tf.int32), mask_shift_len - 1, -1)
-            else:
-                dec_attn_mask += tf.linalg.band_part(tf.ones([qlen, klen], dtype=tf.int32), -1, -mask_shift_len)
+            mask_shift_len = qlen - tf.nn.relu(mask_len)  # Lazy clamping of negatives to zero
+
+            # Use an indicator variable instead of a conditional to keep the compiler happy
+            lower_mask = tf.linalg.band_part(all_ones, -1, 0) - (
+                tf.linalg.band_part(all_ones, mask_shift_len - 1, 0) * tf.cast(mask_shift_len != 0, tf.int32)
+            )
+            dec_attn_mask = upper_mask + lower_mask
+        else:
+            dec_attn_mask = upper_mask
 
         hids = []
         attentions = [] if output_attentions else None
@@ -682,18 +669,6 @@ class TFTransfoXLPreTrainedModel(TFPreTrainedModel):
     config_class = TransfoXLConfig
     base_model_prefix = "transformer"
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
 
 @dataclass
 class TFTransfoXLModelOutput(ModelOutput):
@@ -722,8 +697,8 @@ class TFTransfoXLModelOutput(ModelOutput):
 
     last_hidden_state: tf.Tensor = None
     mems: List[tf.Tensor] = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
 @dataclass
@@ -755,8 +730,8 @@ class TFTransfoXLLMHeadModelOutput(ModelOutput):
 
     prediction_scores: tf.Tensor = None
     mems: List[tf.Tensor] = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
 @dataclass
@@ -786,11 +761,11 @@ class TFTransfoXLSequenceClassifierOutputWithPast(ModelOutput):
             heads.
     """
 
-    loss: Optional[tf.Tensor] = None
+    loss: tf.Tensor | None = None
     logits: tf.Tensor = None
     mems: List[tf.Tensor] = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
 TRANSFO_XL_START_DOCSTRING = r"""
@@ -799,7 +774,7 @@ class TFTransfoXLSequenceClassifierOutputWithPast(ModelOutput):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -840,7 +815,7 @@ class TFTransfoXLSequenceClassifierOutputWithPast(ModelOutput):
         input_ids (`tf.Tensor` or `Numpy array` of shape `(batch_size, sequence_length)`):
             Indices of input sequence tokens in the vocabulary.
 
-            Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.__call__`] and
             [`PreTrainedTokenizer.encode`] for details.
 
             [What are input IDs?](../glossary#input-ids)
@@ -892,15 +867,15 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        mems: Optional[List[tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
+        input_ids: TFModelInputType | None = None,
+        mems: List[tf.Tensor] | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
         training: bool = False,
-    ):
+    ) -> TFTransfoXLModelOutput | Tuple[tf.Tensor]:
         outputs = self.transformer(
             input_ids=input_ids,
             mems=mems,
@@ -914,17 +889,6 @@ def call(
 
         return outputs
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTransfoXLModelOutput(
-            last_hidden_state=output.last_hidden_state,
-            mems=tf.convert_to_tensor(output.mems),
-            hidden_states=hs,
-            attentions=attns,
-        )
-
 
 @add_start_docstrings(
     """
@@ -971,16 +935,16 @@ def init_mems(self, bsz):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        mems: Optional[List[tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        mems: List[tf.Tensor] | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: bool = False,
-    ):
+    ) -> TFTransfoXLLMHeadModelOutput | Tuple[tf.Tensor]:
         if input_ids is not None:
             bsz, tgt_len = shape_list(input_ids)[:2]
         else:
@@ -1013,17 +977,6 @@ def call(
             attentions=transformer_outputs.attentions,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTransfoXLLMHeadModelOutput(
-            prediction_scores=output.prediction_scores,
-            mems=tf.convert_to_tensor(output.mems),
-            hidden_states=hs,
-            attentions=attns,
-        )
-
     def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **model_kwargs):
         inputs = {}
 
@@ -1035,6 +988,21 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **model
 
         return inputs
 
+    # Adapted from the torch tie_weights function
+    def tf_to_pt_weight_rename(self, tf_weight):
+        if self.config.tie_word_embeddings and "crit.out_layers" in tf_weight:
+            return tf_weight, tf_weight.replace("crit.out_layers", "transformer.word_emb.emb_layers")
+        elif self.config.tie_projs and "crit.out_projs" in tf_weight:
+            for i, tie_proj in enumerate(self.config.tie_projs):
+                if tie_proj and self.config.div_val == 1 and self.config.d_model != self.config.d_embed:
+                    # self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[0]
+                    return tf_weight, tf_weight.replace(f"crit.out_projs.{i}", "transformer.word_emb.emb_projs.0")
+                elif tie_proj and self.config.div_val != 1:
+                    # self.crit.out_projs[i] = self.transformer.word_emb.emb_projs[i]
+                    return tf_weight, tf_weight.replace("crit.out_projs", "transformer.word_emb.emb_projs")
+        else:
+            return (tf_weight,)
+
 
 @add_start_docstrings(
     """
@@ -1055,7 +1023,7 @@ class TFTransfoXLForSequenceClassification(TFTransfoXLPreTrainedModel, TFSequenc
     def __init__(self, config, *inputs, **kwargs):
         super().__init__(config, *inputs, **kwargs)
         self.num_labels = config.num_labels
-        self.score = tf.keras.layers.Dense(
+        self.score = keras.layers.Dense(
             config.num_labels,
             kernel_initializer=get_initializer(config.init_range),
             name="score",
@@ -1064,6 +1032,11 @@ def __init__(self, config, *inputs, **kwargs):
         self.transformer = TFTransfoXLMainLayer(config, name="transformer")
 
     def get_output_embeddings(self):
+        # Remove after transformers v4.32. Fix this model's `test_model_common_attributes` test too.
+        logger.warning(
+            "Sequence classification models do not have output embeddings. `.get_output_embeddings` will be removed "
+            "in transformers v4.32."
+        )
         return self.transformer.word_emb
 
     @unpack_inputs
@@ -1075,14 +1048,14 @@ def get_output_embeddings(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        mems: Optional[List[tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        mems: List[tf.Tensor] | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[Tuple, TFTransfoXLSequenceClassifierOutputWithPast]:
         r"""
@@ -1109,16 +1082,10 @@ def call(
         else:
             if input_ids is not None:
                 sequence_lengths = (
-                    tf.reduce_sum(
-                        tf.cast(
-                            tf.math.not_equal(input_ids, self.config.pad_token_id),
-                            dtype=input_ids.dtype,
-                        ),
-                        -1,
-                        keepdims=False,
-                    )
+                    tf.argmax(tf.cast(tf.math.equal(input_ids, self.config.pad_token_id), input_ids.dtype), axis=-1)
                     - 1
                 )
+                sequence_lengths = tf.where(sequence_lengths >= 0, sequence_lengths, input_ids.shape[-1] - 1)
                 in_logits = tf.gather(logits, sequence_lengths, batch_dims=1, axis=1)
             else:
                 sequence_lengths = -1
@@ -1155,11 +1122,3 @@ def call(
             hidden_states=transformer_outputs.hidden_states,
             attentions=transformer_outputs.attentions,
         )
-
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTransfoXLSequenceClassifierOutputWithPast(
-            logits=output.logits, mems=tf.convert_to_tensor(output.mems), hidden_states=hs, attentions=attns
-        )
diff --git a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl_utilities.py b/src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl_utilities.py
similarity index 98%
rename from src/transformers/models/transfo_xl/modeling_tf_transfo_xl_utilities.py
rename to src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl_utilities.py
index dcfa84d0f94b69..ed1488d5595cb8 100644
--- a/src/transformers/models/transfo_xl/modeling_tf_transfo_xl_utilities.py
+++ b/src/transformers/models/deprecated/transfo_xl/modeling_tf_transfo_xl_utilities.py
@@ -20,10 +20,11 @@
 
 import tensorflow as tf
 
-from ...tf_utils import shape_list
+from ....modeling_tf_utils import keras
+from ....tf_utils import shape_list
 
 
-class TFAdaptiveSoftmaxMask(tf.keras.layers.Layer):
+class TFAdaptiveSoftmaxMask(keras.layers.Layer):
     def __init__(self, vocab_size, d_embed, d_proj, cutoffs, div_val=1, keep_order=False, **kwargs):
         super().__init__(**kwargs)
 
diff --git a/src/transformers/models/transfo_xl/modeling_transfo_xl.py b/src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl.py
similarity index 98%
rename from src/transformers/models/transfo_xl/modeling_transfo_xl.py
rename to src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl.py
index 094b2d33f6855c..1b8f222f508a35 100644
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl.py
+++ b/src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl.py
@@ -25,8 +25,8 @@
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 
-from ...modeling_utils import PreTrainedModel
-from ...utils import (
+from ....modeling_utils import PreTrainedModel
+from ....utils import (
     ModelOutput,
     add_code_sample_docstrings,
     add_start_docstrings,
@@ -39,11 +39,11 @@
 
 logger = logging.get_logger(__name__)
 
-_CHECKPOINT_FOR_DOC = "transfo-xl-wt103"
+_CHECKPOINT_FOR_DOC = "transfo-xl/transfo-xl-wt103"
 _CONFIG_FOR_DOC = "TransfoXLConfig"
 
 TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST = [
-    "transfo-xl-wt103",
+    "transfo-xl/transfo-xl-wt103",
     # See all Transformer XL models at https://huggingface.co/models?filter=transfo-xl
 ]
 
@@ -185,7 +185,7 @@ def __init__(self, demb):
         self.register_buffer("inv_freq", inv_freq)
 
     def forward(self, pos_seq, bsz=None):
-        sinusoid_inp = torch.ger(pos_seq, self.inv_freq)
+        sinusoid_inp = torch.outer(pos_seq, self.inv_freq)
         pos_emb = torch.cat([sinusoid_inp.sin(), sinusoid_inp.cos()], dim=-1)
 
         if bsz is not None:
@@ -927,7 +927,7 @@ def forward(
         mlen = mems[0].size(0) if mems is not None else 0
         klen = mlen + qlen
         if self.same_length:
-            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.uint8)
+            all_ones = word_emb.new_ones((qlen, klen), dtype=torch.bool)
             mask_len = klen - self.mem_len
             if mask_len > 0:
                 mask_shift_len = qlen - mask_len
@@ -935,14 +935,16 @@ def forward(
                 mask_shift_len = qlen
             dec_attn_mask = (torch.triu(all_ones, 1 + mlen) + torch.tril(all_ones, -mask_shift_len))[:, :, None]  # -1
         else:
-            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.uint8), diagonal=1 + mlen)[
+            dec_attn_mask = torch.triu(word_emb.new_ones((qlen, klen), dtype=torch.bool), diagonal=1 + mlen)[
                 :, :, None
             ]
 
         hids = []
         attentions = [] if output_attentions else None
         if self.attn_type == 0:  # default
-            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=word_emb.dtype)
+            pos_seq = torch.arange(klen - 1, -1, -1.0, device=word_emb.device, dtype=torch.int64).type_as(
+                dtype=word_emb.dtype
+            )
             if self.clamp_len > 0:
                 pos_seq.clamp_(max=self.clamp_len)
             pos_emb = self.pos_emb(pos_seq)
@@ -1002,7 +1004,7 @@ def forward(
     TRANSFO_XL_START_DOCSTRING,
 )
 class TransfoXLLMHeadModel(TransfoXLPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"crit\.out_projs\.\d+", r"crit\.out_layers\.\d+\.weight"]
+    _tied_weights_keys = [r"crit\.out_projs\.\d+", r"crit\.out_layers\.\d+\.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1012,7 +1014,7 @@ def __init__(self, config):
 
         if not self.trainer_compatible:
             warnings.warn(
-                "The output of TransfoXL will be updated in v5 to support a single loss as first argument. In order"
+                "The output of TransfoXL will be updated in v5 to support a single loss as first argument. In order "
                 "to use that updated output, please specify `trainer_compatible=True` as your configuration"
                 " attribute.",
                 DeprecationWarning,
@@ -1190,8 +1192,6 @@ def _reorder_cache(mems: List[torch.Tensor], beam_idx: torch.Tensor) -> List[tor
     TRANSFO_XL_START_DOCSTRING,
 )
 class TransfoXLForSequenceClassification(TransfoXLPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.masked_bias", r"lm_head.weight"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1249,7 +1249,10 @@ def forward(
             sequence_lengths = -1
         else:
             if input_ids is not None:
-                sequence_lengths = torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1
+                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
+                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
+                sequence_lengths = sequence_lengths % input_ids.shape[-1]
+                sequence_lengths = sequence_lengths.to(logits.device)
             else:
                 sequence_lengths = -1
                 logger.warning(
diff --git a/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py b/src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl_utilities.py
similarity index 97%
rename from src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
rename to src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl_utilities.py
index e25ba2cd476a0b..addf2a08372bc0 100644
--- a/src/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py
+++ b/src/transformers/models/deprecated/transfo_xl/modeling_transfo_xl_utilities.py
@@ -86,7 +86,7 @@ def forward(self, hidden, labels=None, keep_order=False):
         """
         Params:
             hidden :: [len*bsz x d_proj]
-            labels :: [len*bsz
+            labels :: [len*bsz]
 
         Return:
             if labels is None: out :: [len*bsz x n_tokens] log probabilities of tokens over the vocabulary else: out ::
@@ -109,7 +109,11 @@ def forward(self, hidden, labels=None, keep_order=False):
         if self.n_clusters == 0:
             logit = self._compute_logit(hidden, self.out_layers[0].weight, self.out_layers[0].bias, self.out_projs[0])
             if labels is not None:
-                out = -nn.functional.log_softmax(logit, dim=-1).gather(1, labels.unsqueeze(1)).squeeze(1)
+                mask = labels != -100
+                out = torch.zeros_like(labels, dtype=hidden.dtype, device=hidden.device)
+                out[mask] = (
+                    -nn.functional.log_softmax(logit, dim=-1)[mask].gather(1, labels[mask].unsqueeze(1)).squeeze(1)
+                )
             else:
                 out = nn.functional.log_softmax(logit, dim=-1)
         else:
diff --git a/src/transformers/models/transfo_xl/tokenization_transfo_xl.py b/src/transformers/models/deprecated/transfo_xl/tokenization_transfo_xl.py
similarity index 91%
rename from src/transformers/models/transfo_xl/tokenization_transfo_xl.py
rename to src/transformers/models/deprecated/transfo_xl/tokenization_transfo_xl.py
index 13977d43828051..12d360076fba4f 100644
--- a/src/transformers/models/transfo_xl/tokenization_transfo_xl.py
+++ b/src/transformers/models/deprecated/transfo_xl/tokenization_transfo_xl.py
@@ -27,13 +27,14 @@
 
 import numpy as np
 
-from ...tokenization_utils import PreTrainedTokenizer
-from ...utils import (
+from ....tokenization_utils import PreTrainedTokenizer
+from ....utils import (
     cached_file,
     is_sacremoses_available,
     is_torch_available,
     logging,
     requires_backends,
+    strtobool,
     torch_only_method,
 )
 
@@ -56,16 +57,16 @@
 
 PRETRAINED_VOCAB_FILES_MAP = {
     "pretrained_vocab_file": {
-        "transfo-xl-wt103": "https://huggingface.co/transfo-xl-wt103/resolve/main/vocab.pkl",
+        "transfo-xl/transfo-xl-wt103": "https://huggingface.co/transfo-xl/transfo-xl-wt103/resolve/main/vocab.pkl",
     }
 }
 
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
-    "transfo-xl-wt103": None,
+    "transfo-xl/transfo-xl-wt103": None,
 }
 
 PRETRAINED_CORPUS_ARCHIVE_MAP = {
-    "transfo-xl-wt103": "https://huggingface.co/transfo-xl-wt103/resolve/main/corpus.bin",
+    "transfo-xl/transfo-xl-wt103": "https://huggingface.co/transfo-xl/transfo-xl-wt103/resolve/main/corpus.bin",
 }
 CORPUS_NAME = "corpus.bin"
 
@@ -88,7 +89,7 @@ def tokenize_numbers(text_array: List[str]) -> List[str]:
 
     ```python
     >>> tokenize_numbers(["$", "5,000", "1.73", "m"])
-    ["$", "5", "@,@", "000", "1", "@.@", "73", "m"]
+    ['$', '5', '@,@', '000', '1', '@.@', '73', 'm']
     ```"""
     tokenized = []
     for i in range(len(text_array)):
@@ -113,7 +114,7 @@ def detokenize_numbers(text: str) -> str:
 
     ```python
     >>> detokenize_numbers("$ 5 @,@ 000 1 @.@ 73 m")
-    "$ 5,000 1.73 m"
+    '$ 5,000 1.73 m'
     ```"""
     for reg, sub in DETOKENIZE_NUMBERS:
         text = re.sub(reg, sub, text)
@@ -154,7 +155,7 @@ class TransfoXLTokenizer(PreTrainedTokenizer):
             token instead.
         eos_token (`str`, *optional*, defaults to `""`):
             The end of sequence token.
-        additional_special_tokens (`List[str]`, *optional*, defaults to `[""]`):
+        additional_special_tokens (`List[str]`, *optional*, defaults to `['']`):
             A list of additional special tokens (for the HuggingFace functionality).
         language (`str`, *optional*, defaults to `"en"`):
             The language of this tokenizer (used for mose preprocessing).
@@ -181,25 +182,13 @@ def __init__(
         language="en",
         **kwargs,
     ):
-        super().__init__(
-            special=special,
-            min_freq=min_freq,
-            max_size=max_size,
-            lower_case=lower_case,
-            delimiter=delimiter,
-            vocab_file=vocab_file,
-            pretrained_vocab_file=pretrained_vocab_file,
-            never_split=never_split,
-            unk_token=unk_token,
-            eos_token=eos_token,
-            additional_special_tokens=additional_special_tokens,
-            language=language,
-            **kwargs,
+        logger.error(
+            "`TransfoXL` was deprecated due to security issues linked to `pickle.load` in `TransfoXLTokenizer`. "
+            "See more details on this model's documentation page: "
+            "`https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/transfo-xl.md`."
         )
-        requires_backends(self, "sacremoses")
 
-        if never_split is None:
-            never_split = self.all_special_tokens
+        requires_backends(self, "sacremoses")
         if special is None:
             special = []
         self.counter = Counter()
@@ -209,7 +198,6 @@ def __init__(
         self.lower_case = lower_case
         self.delimiter = delimiter
         self.vocab_file = vocab_file
-        self.never_split = never_split
         self.punctuation_symbols = '!"#$%&()*+,-./\\:;<=>?@[\\]^_`{|}~'
         self.punction_without_space_before_pattern = re.compile(rf"[^\s][{self.punctuation_symbols}]")
         self.punctuation_with_space_around_pattern = self._compile_space_around_punctuation_pattern()
@@ -217,20 +205,29 @@ def __init__(
         self.moses_punct_normalizer = sm.MosesPunctNormalizer(language)
         self.moses_tokenizer = sm.MosesTokenizer(language)
         self.moses_detokenizer = sm.MosesDetokenizer(language)
-
+        self.idx2sym = []
+        self.sym2idx = OrderedDict()
         # This try... catch... is not beautiful but honestly this tokenizer was not made to be used
         # in a library like ours, at all.
         try:
             vocab_dict = None
             if pretrained_vocab_file is not None:
                 # Priority on pickle files (support PyTorch and TF)
+                if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
+                    raise ValueError(
+                        "This part uses `pickle.load` which is insecure and will execute arbitrary code that is "
+                        "potentially malicious. It's recommended to never unpickle data that could have come from an "
+                        "untrusted source, or that could have been tampered with. If you already verified the pickle "
+                        "data and decided to use it, you can set the environment variable "
+                        "`TRUST_REMOTE_CODE` to `True` to allow it."
+                    )
                 with open(pretrained_vocab_file, "rb") as f:
                     vocab_dict = pickle.load(f)
 
                 # Loading a torch-saved transfo-xl vocab dict with pickle results in an integer
                 # Entering this if statement means that we tried to load a torch-saved file with pickle, and we failed.
                 # We therefore load it with torch, if it's available.
-                if type(vocab_dict) == int:
+                if isinstance(vocab_dict, int):
                     if not is_torch_available():
                         raise ImportError(
                             "Not trying to load dict with PyTorch as you need to install pytorch to load "
@@ -241,7 +238,7 @@ def __init__(
 
             if vocab_dict is not None:
                 for key, value in vocab_dict.items():
-                    if key not in self.__dict__:
+                    if key not in self.__dict__ or key in ["sym2idx", "idx2sym"]:
                         self.__dict__[key] = value
             elif vocab_file is not None:
                 self.build_vocab()
@@ -256,6 +253,27 @@ def __init__(
         if vocab_file is not None:
             self.build_vocab()
 
+        super().__init__(
+            special=special,
+            min_freq=min_freq,
+            max_size=max_size,
+            lower_case=lower_case,
+            delimiter=delimiter,
+            vocab_file=vocab_file,
+            pretrained_vocab_file=pretrained_vocab_file,
+            never_split=never_split,
+            unk_token=unk_token,
+            eos_token=eos_token,
+            additional_special_tokens=additional_special_tokens,
+            language=language,
+            **kwargs,
+        )
+
+        # these are not required to initialize the parent class as only used when tokenizing.
+        if never_split is None:
+            never_split = self.all_special_tokens
+        self.never_split = never_split
+
     @property
     def do_lower_case(self):
         return self.lower_case
@@ -305,7 +323,7 @@ def _build_from_file(self, vocab_file):
         elif "" in self.sym2idx:
             self.unk_idx = self.sym2idx[""]
         else:
-            raise ValueError("No  token in vocabulary")
+            raise ValueError("Token not in vocabulary and no  token in vocabulary for replacement.")
 
     def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
         if os.path.isdir(save_directory):
@@ -323,7 +341,7 @@ def build_vocab(self):
         if self.vocab_file:
             logger.info(f"building vocab from {self.vocab_file}")
             self._build_from_file(self.vocab_file)
-            logger.info(f"final vocab size {len(self)}")
+            logger.info(f"Final vocab size {len(self.sym2idx)}")
         else:
             logger.info(f"building vocab with min_freq={self.min_freq}, max_size={self.max_size}")
             self.idx2sym = []
@@ -337,7 +355,7 @@ def build_vocab(self):
                     break
                 self.add_symbol(sym)
 
-            logger.info(f"final vocab size {len(self)} from {len(self.counter)} unique tokens")
+            logger.info(f"Final vocab size {len(self.sym2idx)} from {len(self.counter)} unique tokens")
 
     @torch_only_method
     def encode_file(self, path, ordered=False, verbose=False, add_eos=True, add_double_eos=False):
@@ -406,9 +424,8 @@ def move_added_token(self, token: str, target_idx: int):
             self.sym2idx[current_sym] = idx
 
         # Delete token from added_tokens
-        old_index = self.added_tokens_encoder[token]
-        del self.added_tokens_decoder[old_index]
-        del self.added_tokens_encoder[token]
+        old_index = self._added_tokens_encoder.pop(token)
+        self._added_tokens_decoder.pop(old_index)
 
     def moses_punct_norm(self, text):
         return self.moses_punct_normalizer.normalize(text)
@@ -434,7 +451,7 @@ def moses_pipeline(self, text: str) -> List[str]:
         Example:
 
         ```python
-        >>> tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl-wt103")
+        >>> tokenizer = TransfoXLTokenizer.from_pretrained("transfo-xl/transfo-xl-wt103")
         >>> tokenizer.moses_pipeline("23,000 people are 1.80 m tall")
         ['23', '@,@', '000', 'people', 'are', '1', '@.@', '80', 'm', 'tall']
         ```"""
@@ -463,7 +480,7 @@ def _convert_token_to_id(self, sym):
             elif "" in self.sym2idx:
                 return self.sym2idx[""]
             else:
-                raise ValueError("Token not in vocabulary and no  token in vocabulary for replacement")
+                raise ValueError("Token not in vocabulary and no  token in vocabulary for replacement.")
 
     def convert_tokens_to_string(self, tokens):
         """
@@ -482,7 +499,9 @@ def vocab_size(self):
         return len(self.idx2sym)
 
     def get_vocab(self):
-        return dict(self.sym2idx, **self.added_tokens_encoder)
+        vocab = self.sym2idx.copy()
+        vocab.update(self.added_tokens_encoder)
+        return vocab
 
     def _tokenize(self, line, add_eos=False, add_double_eos=False):
         line = line.strip()
@@ -780,6 +799,13 @@ def get_lm_corpus(datadir, dataset):
         corpus = torch.load(fn_pickle)
     elif os.path.exists(fn):
         logger.info("Loading cached dataset from pickle...")
+        if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
+            raise ValueError(
+                "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially "
+                "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or "
+                "that could have been tampered with. If you already verified the pickle data and decided to use it, "
+                "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it."
+            )
         with open(fn, "rb") as fp:
             corpus = pickle.load(fp)
     else:
diff --git a/src/transformers/models/van/__init__.py b/src/transformers/models/deprecated/van/__init__.py
similarity index 93%
rename from src/transformers/models/van/__init__.py
rename to src/transformers/models/deprecated/van/__init__.py
index c696c5c5e5ae43..2db730984ffa03 100644
--- a/src/transformers/models/van/__init__.py
+++ b/src/transformers/models/deprecated/van/__init__.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+from ....utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
 
 
 _import_structure = {"configuration_van": ["VAN_PRETRAINED_CONFIG_ARCHIVE_MAP", "VanConfig"]}
diff --git a/src/transformers/models/van/configuration_van.py b/src/transformers/models/deprecated/van/configuration_van.py
similarity index 96%
rename from src/transformers/models/van/configuration_van.py
rename to src/transformers/models/deprecated/van/configuration_van.py
index 85a0dd20e47728..85f228193c450e 100644
--- a/src/transformers/models/van/configuration_van.py
+++ b/src/transformers/models/deprecated/van/configuration_van.py
@@ -14,8 +14,8 @@
 # limitations under the License.
 """ VAN model configuration"""
 
-from ...configuration_utils import PretrainedConfig
-from ...utils import logging
+from ....configuration_utils import PretrainedConfig
+from ....utils import logging
 
 
 logger = logging.get_logger(__name__)
@@ -57,9 +57,9 @@ class VanConfig(PretrainedConfig):
             `"selu"` and `"gelu_new"` are supported.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
             The epsilon used by the layer normalization layers.
-        layer_scale_init_value (`float`, *optional*, defaults to 1e-2):
+        layer_scale_init_value (`float`, *optional*, defaults to 0.01):
             The initial value for layer scaling.
         drop_path_rate (`float`, *optional*, defaults to 0.0):
             The dropout probability for stochastic depth.
@@ -77,6 +77,7 @@ class VanConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "van"
 
     def __init__(
diff --git a/src/transformers/models/van/convert_van_to_pytorch.py b/src/transformers/models/deprecated/van/convert_van_to_pytorch.py
similarity index 95%
rename from src/transformers/models/van/convert_van_to_pytorch.py
rename to src/transformers/models/deprecated/van/convert_van_to_pytorch.py
index a8086e6d1b511b..20492e42be2043 100644
--- a/src/transformers/models/van/convert_van_to_pytorch.py
+++ b/src/transformers/models/deprecated/van/convert_van_to_pytorch.py
@@ -30,8 +30,8 @@
 from huggingface_hub import cached_download, hf_hub_download
 from torch import Tensor
 
-from transformers import AutoFeatureExtractor, VanConfig, VanForImageClassification
-from transformers.models.van.modeling_van import VanLayerScaling
+from transformers import AutoImageProcessor, VanConfig, VanForImageClassification
+from transformers.models.deprecated.van.modeling_van import VanLayerScaling
 from transformers.utils import logging
 
 
@@ -55,7 +55,7 @@ def __call__(self, x: Tensor):
         for m in self.module.modules():
             self.handles.append(m.register_forward_hook(self._forward_hook))
         self.module(x)
-        list(map(lambda x: x.remove(), self.handles))
+        [x.remove() for x in self.handles]
         return self
 
     @property
@@ -154,10 +154,10 @@ def convert_weight_and_push(
         )
 
         # we can use the convnext one
-        feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/convnext-base-224-22k-1k")
-        feature_extractor.push_to_hub(
+        image_processor = AutoImageProcessor.from_pretrained("facebook/convnext-base-224-22k-1k")
+        image_processor.push_to_hub(
             repo_path_or_name=save_directory / checkpoint_name,
-            commit_message="Add feature extractor",
+            commit_message="Add image processor",
             use_temp_dir=True,
         )
 
@@ -277,7 +277,7 @@ def convert_weights_and_push(save_directory: Path, model_name: str = None, push_
         default=True,
         type=bool,
         required=False,
-        help="If True, push model and feature extractor to the hub.",
+        help="If True, push model and image processor to the hub.",
     )
 
     args = parser.parse_args()
diff --git a/src/transformers/models/van/modeling_van.py b/src/transformers/models/deprecated/van/modeling_van.py
similarity index 97%
rename from src/transformers/models/van/modeling_van.py
rename to src/transformers/models/deprecated/van/modeling_van.py
index 59c8655aa93f89..e0f88467e1e75b 100644
--- a/src/transformers/models/van/modeling_van.py
+++ b/src/transformers/models/deprecated/van/modeling_van.py
@@ -23,14 +23,14 @@
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 
-from ...activations import ACT2FN
-from ...modeling_outputs import (
+from ....activations import ACT2FN
+from ....modeling_outputs import (
     BaseModelOutputWithNoAttention,
     BaseModelOutputWithPoolingAndNoAttention,
     ImageClassifierOutputWithNoAttention,
 )
-from ...modeling_utils import PreTrainedModel
-from ...utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from ....modeling_utils import PreTrainedModel
+from ....utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
 from .configuration_van import VanConfig
 
 
@@ -54,7 +54,7 @@
 
 
 # Copied from transformers.models.convnext.modeling_convnext.drop_path
-def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
@@ -387,10 +387,6 @@ def _init_weights(self, module):
             if module.bias is not None:
                 module.bias.data.zero_()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, VanModel):
-            module.gradient_checkpointing = value
-
 
 VAN_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
diff --git a/src/transformers/models/depth_anything/__init__.py b/src/transformers/models/depth_anything/__init__.py
new file mode 100644
index 00000000000000..0d0ea5a514a836
--- /dev/null
+++ b/src/transformers/models/depth_anything/__init__.py
@@ -0,0 +1,56 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...file_utils import _LazyModule, is_torch_available
+from ...utils import OptionalDependencyNotAvailable
+
+
+_import_structure = {
+    "configuration_depth_anything": ["DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP", "DepthAnythingConfig"]
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_depth_anything"] = [
+        "DEPTH_ANYTHING_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "DepthAnythingForDepthEstimation",
+        "DepthAnythingPreTrainedModel",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_depth_anything import DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP, DepthAnythingConfig
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_depth_anything import (
+            DEPTH_ANYTHING_PRETRAINED_MODEL_ARCHIVE_LIST,
+            DepthAnythingForDepthEstimation,
+            DepthAnythingPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/depth_anything/configuration_depth_anything.py b/src/transformers/models/depth_anything/configuration_depth_anything.py
new file mode 100644
index 00000000000000..7fa7745c32d3fd
--- /dev/null
+++ b/src/transformers/models/depth_anything/configuration_depth_anything.py
@@ -0,0 +1,146 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DepthAnything model configuration"""
+
+import copy
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+from ..auto.configuration_auto import CONFIG_MAPPING
+
+
+logger = logging.get_logger(__name__)
+
+DEPTH_ANYTHING_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "LiheYoung/depth-anything-small-hf": "https://huggingface.co/LiheYoung/depth-anything-small-hf/resolve/main/config.json",
+}
+
+
+class DepthAnythingConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`DepthAnythingModel`]. It is used to instantiate an DepthAnything
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the DepthAnything
+    [LiheYoung/depth-anything-small-hf](https://huggingface.co/LiheYoung/depth-anything-small-hf) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        backbone_config (`Union[Dict[str, Any], PretrainedConfig]`, *optional*):
+            The configuration of the backbone model. Only used in case `is_hybrid` is `True` or in case you want to
+            leverage the [`AutoBackbone`] API.
+        backbone (`str`, *optional*):
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
+        use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
+            Whether to use pretrained weights for the backbone.
+        patch_size (`int`, *optional*, defaults to 14):
+            The size of the patches to extract from the backbone features.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        reassemble_hidden_size (`int`, *optional*, defaults to 384):
+            The number of input channels of the reassemble layers.
+        reassemble_factors (`List[int]`, *optional*, defaults to `[4, 2, 1, 0.5]`):
+            The up/downsampling factors of the reassemble layers.
+        neck_hidden_sizes (`List[str]`, *optional*, defaults to `[48, 96, 192, 384]`):
+            The hidden sizes to project to for the feature maps of the backbone.
+        fusion_hidden_size (`int`, *optional*, defaults to 64):
+            The number of channels before fusion.
+        head_in_index (`int`, *optional*, defaults to -1):
+            The index of the features to use in the depth estimation head.
+        head_hidden_size (`int`, *optional*, defaults to 32):
+            The number of output channels in the second convolution of the depth estimation head.
+
+    Example:
+
+    ```python
+    >>> from transformers import DepthAnythingConfig, DepthAnythingForDepthEstimation
+
+    >>> # Initializing a DepthAnything small style configuration
+    >>> configuration = DepthAnythingConfig()
+
+    >>> # Initializing a model from the DepthAnything small style configuration
+    >>> model = DepthAnythingForDepthEstimation(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "depth_anything"
+
+    def __init__(
+        self,
+        backbone_config=None,
+        backbone=None,
+        use_pretrained_backbone=False,
+        patch_size=14,
+        initializer_range=0.02,
+        reassemble_hidden_size=384,
+        reassemble_factors=[4, 2, 1, 0.5],
+        neck_hidden_sizes=[48, 96, 192, 384],
+        fusion_hidden_size=64,
+        head_in_index=-1,
+        head_hidden_size=32,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        if use_pretrained_backbone:
+            raise ValueError("Pretrained backbones are not supported yet.")
+
+        if backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
+        if backbone_config is None and backbone is None:
+            logger.info("`backbone_config` is `None`. Initializing the config with the default `Dinov2` backbone.")
+            backbone_config = CONFIG_MAPPING["dinov2"](
+                image_size=518,
+                hidden_size=384,
+                num_attention_heads=6,
+                out_indices=[9, 10, 11, 12],
+                apply_layernorm=True,
+                reshape_hidden_states=False,
+            )
+        elif isinstance(backbone_config, dict):
+            backbone_model_type = backbone_config.get("model_type")
+            config_class = CONFIG_MAPPING[backbone_model_type]
+            backbone_config = config_class.from_dict(backbone_config)
+
+        self.backbone_config = backbone_config
+        self.backbone = backbone
+        self.use_pretrained_backbone = use_pretrained_backbone
+        self.reassemble_hidden_size = reassemble_hidden_size
+        self.patch_size = patch_size
+        self.initializer_range = initializer_range
+        self.reassemble_factors = reassemble_factors
+        self.neck_hidden_sizes = neck_hidden_sizes
+        self.fusion_hidden_size = fusion_hidden_size
+        self.head_in_index = head_in_index
+        self.head_hidden_size = head_hidden_size
+
+    def to_dict(self):
+        """
+        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
+            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
+        """
+        output = copy.deepcopy(self.__dict__)
+
+        if output["backbone_config"] is not None:
+            output["backbone_config"] = self.backbone_config.to_dict()
+
+        output["model_type"] = self.__class__.model_type
+        return output
diff --git a/src/transformers/models/depth_anything/convert_depth_anything_to_hf.py b/src/transformers/models/depth_anything/convert_depth_anything_to_hf.py
new file mode 100644
index 00000000000000..022a66c0d609cd
--- /dev/null
+++ b/src/transformers/models/depth_anything/convert_depth_anything_to_hf.py
@@ -0,0 +1,299 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert Depth Anything checkpoints from the original repository. URL:
+https://github.com/LiheYoung/Depth-Anything"""
+
+
+import argparse
+from pathlib import Path
+
+import requests
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+
+from transformers import DepthAnythingConfig, DepthAnythingForDepthEstimation, Dinov2Config, DPTImageProcessor
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_dpt_config(model_name):
+    if "small" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-small", out_indices=[9, 10, 11, 12], apply_layernorm=True, reshape_hidden_states=False
+        )
+        fusion_hidden_size = 64
+        neck_hidden_sizes = [48, 96, 192, 384]
+    elif "base" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-base", out_indices=[9, 10, 11, 12], apply_layernorm=True, reshape_hidden_states=False
+        )
+        fusion_hidden_size = 128
+        neck_hidden_sizes = [96, 192, 384, 768]
+    elif "large" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-large", out_indices=[21, 22, 23, 24], apply_layernorm=True, reshape_hidden_states=False
+        )
+        fusion_hidden_size = 256
+        neck_hidden_sizes = [256, 512, 1024, 1024]
+    else:
+        raise NotImplementedError("To do")
+
+    config = DepthAnythingConfig(
+        reassemble_hidden_size=backbone_config.hidden_size,
+        patch_size=backbone_config.patch_size,
+        backbone_config=backbone_config,
+        fusion_hidden_size=fusion_hidden_size,
+        neck_hidden_sizes=neck_hidden_sizes,
+    )
+
+    return config
+
+
+def create_rename_keys(config):
+    rename_keys = []
+
+    # fmt: off
+    # stem
+    rename_keys.append(("pretrained.cls_token", "backbone.embeddings.cls_token"))
+    rename_keys.append(("pretrained.mask_token", "backbone.embeddings.mask_token"))
+    rename_keys.append(("pretrained.pos_embed", "backbone.embeddings.position_embeddings"))
+    rename_keys.append(("pretrained.patch_embed.proj.weight", "backbone.embeddings.patch_embeddings.projection.weight"))
+    rename_keys.append(("pretrained.patch_embed.proj.bias", "backbone.embeddings.patch_embeddings.projection.bias"))
+
+    # Transfomer encoder
+    for i in range(config.backbone_config.num_hidden_layers):
+        rename_keys.append((f"pretrained.blocks.{i}.ls1.gamma", f"backbone.encoder.layer.{i}.layer_scale1.lambda1"))
+        rename_keys.append((f"pretrained.blocks.{i}.ls2.gamma", f"backbone.encoder.layer.{i}.layer_scale2.lambda1"))
+        rename_keys.append((f"pretrained.blocks.{i}.norm1.weight", f"backbone.encoder.layer.{i}.norm1.weight"))
+        rename_keys.append((f"pretrained.blocks.{i}.norm1.bias", f"backbone.encoder.layer.{i}.norm1.bias"))
+        rename_keys.append((f"pretrained.blocks.{i}.norm2.weight", f"backbone.encoder.layer.{i}.norm2.weight"))
+        rename_keys.append((f"pretrained.blocks.{i}.norm2.bias", f"backbone.encoder.layer.{i}.norm2.bias"))
+        rename_keys.append((f"pretrained.blocks.{i}.mlp.fc1.weight", f"backbone.encoder.layer.{i}.mlp.fc1.weight"))
+        rename_keys.append((f"pretrained.blocks.{i}.mlp.fc1.bias", f"backbone.encoder.layer.{i}.mlp.fc1.bias"))
+        rename_keys.append((f"pretrained.blocks.{i}.mlp.fc2.weight", f"backbone.encoder.layer.{i}.mlp.fc2.weight"))
+        rename_keys.append((f"pretrained.blocks.{i}.mlp.fc2.bias", f"backbone.encoder.layer.{i}.mlp.fc2.bias"))
+        rename_keys.append((f"pretrained.blocks.{i}.attn.proj.weight", f"backbone.encoder.layer.{i}.attention.output.dense.weight"))
+        rename_keys.append((f"pretrained.blocks.{i}.attn.proj.bias", f"backbone.encoder.layer.{i}.attention.output.dense.bias"))
+
+    # Head
+    rename_keys.append(("pretrained.norm.weight", "backbone.layernorm.weight"))
+    rename_keys.append(("pretrained.norm.bias", "backbone.layernorm.bias"))
+
+    # activation postprocessing (readout projections + resize blocks)
+    # Depth Anything does not use CLS token => readout_projects not required
+
+    for i in range(4):
+        rename_keys.append((f"depth_head.projects.{i}.weight", f"neck.reassemble_stage.layers.{i}.projection.weight"))
+        rename_keys.append((f"depth_head.projects.{i}.bias", f"neck.reassemble_stage.layers.{i}.projection.bias"))
+
+        if i != 2:
+            rename_keys.append((f"depth_head.resize_layers.{i}.weight", f"neck.reassemble_stage.layers.{i}.resize.weight"))
+            rename_keys.append((f"depth_head.resize_layers.{i}.bias", f"neck.reassemble_stage.layers.{i}.resize.bias"))
+
+    # refinenet (tricky here)
+    mapping = {1:3, 2:2, 3:1, 4:0}
+
+    for i in range(1, 5):
+        j = mapping[i]
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.out_conv.weight", f"neck.fusion_stage.layers.{j}.projection.weight"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.out_conv.bias", f"neck.fusion_stage.layers.{j}.projection.bias"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit1.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.weight"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit1.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.bias"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit1.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.weight"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit1.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.bias"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit2.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.weight"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit2.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.bias"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit2.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.weight"))
+        rename_keys.append((f"depth_head.scratch.refinenet{i}.resConfUnit2.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.bias"))
+
+    # scratch convolutions
+    for i in range(4):
+        rename_keys.append((f"depth_head.scratch.layer{i+1}_rn.weight", f"neck.convs.{i}.weight"))
+
+    # head
+    rename_keys.append(("depth_head.scratch.output_conv1.weight", "head.conv1.weight"))
+    rename_keys.append(("depth_head.scratch.output_conv1.bias", "head.conv1.bias"))
+    rename_keys.append(("depth_head.scratch.output_conv2.0.weight", "head.conv2.weight"))
+    rename_keys.append(("depth_head.scratch.output_conv2.0.bias", "head.conv2.bias"))
+    rename_keys.append(("depth_head.scratch.output_conv2.2.weight", "head.conv3.weight"))
+    rename_keys.append(("depth_head.scratch.output_conv2.2.bias", "head.conv3.bias"))
+
+    return rename_keys
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config):
+    hidden_size = config.backbone_config.hidden_size
+    for i in range(config.backbone_config.num_hidden_layers):
+        # read in weights + bias of input projection layer (in original implementation, this is a single matrix + bias)
+        in_proj_weight = state_dict.pop(f"pretrained.blocks.{i}.attn.qkv.weight")
+        in_proj_bias = state_dict.pop(f"pretrained.blocks.{i}.attn.qkv.bias")
+        # next, add query, keys and values (in that order) to the state dict
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[:hidden_size, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[:hidden_size]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            hidden_size : hidden_size * 2, :
+        ]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
+            hidden_size : hidden_size * 2
+        ]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[-hidden_size:, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-hidden_size:]
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+name_to_checkpoint = {
+    "depth-anything-small": "depth_anything_vits14.pth",
+    "depth-anything-base": "depth_anything_vitb14.pth",
+    "depth-anything-large": "depth_anything_vitl14.pth",
+}
+
+
+@torch.no_grad()
+def convert_dpt_checkpoint(model_name, pytorch_dump_folder_path, push_to_hub, verify_logits):
+    """
+    Copy/paste/tweak model's weights to our DPT structure.
+    """
+
+    # define DPT configuration
+    config = get_dpt_config(model_name)
+
+    model_name_to_filename = {
+        "depth-anything-small": "depth_anything_vits14.pth",
+        "depth-anything-base": "depth_anything_vitb14.pth",
+        "depth-anything-large": "depth_anything_vitl14.pth",
+    }
+
+    # load original state_dict
+    filename = model_name_to_filename[model_name]
+    filepath = hf_hub_download(
+        repo_id="LiheYoung/Depth-Anything", filename=f"checkpoints/{filename}", repo_type="space"
+    )
+    state_dict = torch.load(filepath, map_location="cpu")
+    # rename keys
+    rename_keys = create_rename_keys(config)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    # read in qkv matrices
+    read_in_q_k_v(state_dict, config)
+
+    # load HuggingFace model
+    model = DepthAnythingForDepthEstimation(config)
+    model.load_state_dict(state_dict)
+    model.eval()
+
+    processor = DPTImageProcessor(
+        do_resize=True,
+        size={"height": 518, "width": 518},
+        ensure_multiple_of=14,
+        keep_aspect_ratio=True,
+        do_rescale=True,
+        do_normalize=True,
+        image_mean=[0.485, 0.456, 0.406],
+        image_std=[0.229, 0.224, 0.225],
+    )
+
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+
+    pixel_values = processor(image, return_tensors="pt").pixel_values
+
+    # Verify forward pass
+    with torch.no_grad():
+        outputs = model(pixel_values)
+        predicted_depth = outputs.predicted_depth
+
+    print("Shape of predicted depth:", predicted_depth.shape)
+    print("First values:", predicted_depth[0, :3, :3])
+
+    # assert logits
+    if verify_logits:
+        expected_shape = torch.Size([1, 518, 686])
+        if model_name == "depth-anything-small":
+            expected_slice = torch.tensor(
+                [[8.8204, 8.6468, 8.6195], [8.3313, 8.6027, 8.7526], [8.6526, 8.6866, 8.7453]],
+            )
+        elif model_name == "depth-anything-base":
+            expected_slice = torch.tensor(
+                [[26.3997, 26.3004, 26.3928], [26.2260, 26.2092, 26.3427], [26.0719, 26.0483, 26.1254]],
+            )
+        elif model_name == "depth-anything-large":
+            expected_slice = torch.tensor(
+                [[87.9968, 87.7493, 88.2704], [87.1927, 87.6611, 87.3640], [86.7789, 86.9469, 86.7991]]
+            )
+        else:
+            raise ValueError("Not supported")
+
+        assert predicted_depth.shape == torch.Size(expected_shape)
+        assert torch.allclose(predicted_depth[0, :3, :3], expected_slice, atol=1e-6)
+        print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model and processor to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        print("Pushing model and processor to hub...")
+        model.push_to_hub(repo_id=f"LiheYoung/{model_name}-hf")
+        processor.push_to_hub(repo_id=f"LiheYoung/{model_name}-hf")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="depth-anything-small",
+        type=str,
+        choices=name_to_checkpoint.keys(),
+        help="Name of the model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether to push the model to the hub after conversion.",
+    )
+    parser.add_argument(
+        "--verify_logits",
+        action="store_false",
+        required=False,
+        help="Whether to verify the logits after conversion.",
+    )
+
+    args = parser.parse_args()
+    convert_dpt_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub, args.verify_logits)
diff --git a/src/transformers/models/depth_anything/modeling_depth_anything.py b/src/transformers/models/depth_anything/modeling_depth_anything.py
new file mode 100644
index 00000000000000..6497759f17825e
--- /dev/null
+++ b/src/transformers/models/depth_anything/modeling_depth_anything.py
@@ -0,0 +1,465 @@
+# coding=utf-8
+# Copyright 2024 TikTok and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch Depth Anything model."""
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+
+from ...file_utils import (
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    replace_return_docstrings,
+)
+from ...modeling_outputs import DepthEstimatorOutput
+from ...modeling_utils import PreTrainedModel
+from ...utils import logging
+from ..auto import AutoBackbone
+from .configuration_depth_anything import DepthAnythingConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "DepthAnythingConfig"
+
+DEPTH_ANYTHING_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "LiheYoung/depth-anything-small-hf",
+    # See all Depth Anything models at https://huggingface.co/models?filter=depth_anything
+]
+
+
+DEPTH_ANYTHING_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`DepthAnythingConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+DEPTH_ANYTHING_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See [`DPTImageProcessor.__call__`]
+            for details.
+
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+class DepthAnythingReassembleLayer(nn.Module):
+    def __init__(self, config, channels, factor):
+        super().__init__()
+        self.projection = nn.Conv2d(in_channels=config.reassemble_hidden_size, out_channels=channels, kernel_size=1)
+
+        # up/down sampling depending on factor
+        if factor > 1:
+            self.resize = nn.ConvTranspose2d(channels, channels, kernel_size=factor, stride=factor, padding=0)
+        elif factor == 1:
+            self.resize = nn.Identity()
+        elif factor < 1:
+            # so should downsample
+            self.resize = nn.Conv2d(channels, channels, kernel_size=3, stride=int(1 / factor), padding=1)
+
+    # Copied from transformers.models.dpt.modeling_dpt.DPTReassembleLayer.forward
+    def forward(self, hidden_state):
+        hidden_state = self.projection(hidden_state)
+        hidden_state = self.resize(hidden_state)
+
+        return hidden_state
+
+
+class DepthAnythingReassembleStage(nn.Module):
+    """
+    This class reassembles the hidden states of the backbone into image-like feature representations at various
+    resolutions.
+
+    This happens in 3 stages:
+    1. Take the patch embeddings and reshape them to image-like feature representations.
+    2. Project the channel dimension of the hidden states according to `config.neck_hidden_sizes`.
+    3. Resizing the spatial dimensions (height, width).
+
+    Args:
+        config (`[DepthAnythingConfig]`):
+            Model configuration class defining the model architecture.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.config = config
+        self.layers = nn.ModuleList()
+        for channels, factor in zip(config.neck_hidden_sizes, config.reassemble_factors):
+            self.layers.append(DepthAnythingReassembleLayer(config, channels=channels, factor=factor))
+
+    def forward(self, hidden_states: List[torch.Tensor], patch_height=None, patch_width=None) -> List[torch.Tensor]:
+        """
+        Args:
+            hidden_states (`List[torch.FloatTensor]`, each of shape `(batch_size, sequence_length + 1, hidden_size)`):
+                List of hidden states from the backbone.
+        """
+        out = []
+
+        for i, hidden_state in enumerate(hidden_states):
+            # reshape to (batch_size, num_channels, height, width)
+            hidden_state = hidden_state[:, 1:]
+            batch_size, _, num_channels = hidden_state.shape
+            hidden_state = hidden_state.reshape(batch_size, patch_height, patch_width, num_channels)
+            hidden_state = hidden_state.permute(0, 3, 1, 2).contiguous()
+            hidden_state = self.layers[i](hidden_state)
+            out.append(hidden_state)
+
+        return out
+
+
+class DepthAnythingPreActResidualLayer(nn.Module):
+    """
+    ResidualConvUnit, pre-activate residual unit.
+
+    Args:
+        config (`[DepthAnythingConfig]`):
+            Model configuration class defining the model architecture.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.activation1 = nn.ReLU()
+        self.convolution1 = nn.Conv2d(
+            config.fusion_hidden_size,
+            config.fusion_hidden_size,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=True,
+        )
+
+        self.activation2 = nn.ReLU()
+        self.convolution2 = nn.Conv2d(
+            config.fusion_hidden_size,
+            config.fusion_hidden_size,
+            kernel_size=3,
+            stride=1,
+            padding=1,
+            bias=True,
+        )
+
+    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
+        residual = hidden_state
+        hidden_state = self.activation1(hidden_state)
+        hidden_state = self.convolution1(hidden_state)
+        hidden_state = self.activation2(hidden_state)
+        hidden_state = self.convolution2(hidden_state)
+
+        return hidden_state + residual
+
+
+class DepthAnythingFeatureFusionLayer(nn.Module):
+    """Feature fusion layer, merges feature maps from different stages.
+
+    Args:
+        config (`[DepthAnythingConfig]`):
+            Model configuration class defining the model architecture.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.projection = nn.Conv2d(config.fusion_hidden_size, config.fusion_hidden_size, kernel_size=1, bias=True)
+
+        self.residual_layer1 = DepthAnythingPreActResidualLayer(config)
+        self.residual_layer2 = DepthAnythingPreActResidualLayer(config)
+
+    def forward(self, hidden_state, residual=None, size=None):
+        if residual is not None:
+            if hidden_state.shape != residual.shape:
+                residual = nn.functional.interpolate(
+                    residual, size=(hidden_state.shape[2], hidden_state.shape[3]), mode="bilinear", align_corners=False
+                )
+            hidden_state = hidden_state + self.residual_layer1(residual)
+
+        hidden_state = self.residual_layer2(hidden_state)
+
+        modifier = {"scale_factor": 2} if size is None else {"size": size}
+
+        hidden_state = nn.functional.interpolate(
+            hidden_state,
+            **modifier,
+            mode="bilinear",
+            align_corners=True,
+        )
+        hidden_state = self.projection(hidden_state)
+
+        return hidden_state
+
+
+class DepthAnythingFeatureFusionStage(nn.Module):
+    # Copied from transformers.models.dpt.modeling_dpt.DPTFeatureFusionStage.__init__ with DPT->DepthAnything
+    def __init__(self, config):
+        super().__init__()
+        self.layers = nn.ModuleList()
+        for _ in range(len(config.neck_hidden_sizes)):
+            self.layers.append(DepthAnythingFeatureFusionLayer(config))
+
+    def forward(self, hidden_states, size=None):
+        # reversing the hidden_states, we start from the last
+        hidden_states = hidden_states[::-1]
+
+        fused_hidden_states = []
+        # first layer only uses the last hidden_state
+        size = hidden_states[1].shape[2:]
+        fused_hidden_state = self.layers[0](hidden_states[0], size=size)
+        fused_hidden_states.append(fused_hidden_state)
+
+        # looping from the last layer to the second
+        for idx, (hidden_state, layer) in enumerate(zip(hidden_states[1:], self.layers[1:])):
+            size = hidden_states[1:][idx + 1].shape[2:] if idx != (len(hidden_states[1:]) - 1) else None
+
+            fused_hidden_state = layer(fused_hidden_state, hidden_state, size=size)
+
+            fused_hidden_states.append(fused_hidden_state)
+
+        return fused_hidden_states
+
+
+# Copied from transformers.models.dpt.modeling_dpt.DPTPreTrainedModel with DPT->DepthAnything,dpt->depth_anything
+class DepthAnythingPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = DepthAnythingConfig
+    base_model_prefix = "depth_anything"
+    main_input_name = "pixel_values"
+    supports_gradient_checkpointing = True
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Conv2d, nn.ConvTranspose2d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+class DepthAnythingNeck(nn.Module):
+    """
+    DepthAnythingNeck. A neck is a module that is normally used between the backbone and the head. It takes a list of tensors as
+    input and produces another list of tensors as output. For DepthAnything, it includes 2 stages:
+
+    * DepthAnythingReassembleStage
+    * DepthAnythingFeatureFusionStage.
+
+    Args:
+        config (dict): config dict.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+
+        self.reassemble_stage = DepthAnythingReassembleStage(config)
+
+        self.convs = nn.ModuleList()
+        for channel in config.neck_hidden_sizes:
+            self.convs.append(nn.Conv2d(channel, config.fusion_hidden_size, kernel_size=3, padding=1, bias=False))
+
+        # fusion
+        self.fusion_stage = DepthAnythingFeatureFusionStage(config)
+
+    def forward(self, hidden_states: List[torch.Tensor], patch_height=None, patch_width=None) -> List[torch.Tensor]:
+        """
+        Args:
+            hidden_states (`List[torch.FloatTensor]`, each of shape `(batch_size, sequence_length, hidden_size)` or `(batch_size, hidden_size, height, width)`):
+                List of hidden states from the backbone.
+        """
+        if not isinstance(hidden_states, (tuple, list)):
+            raise ValueError("hidden_states should be a tuple or list of tensors")
+
+        if len(hidden_states) != len(self.config.neck_hidden_sizes):
+            raise ValueError("The number of hidden states should be equal to the number of neck hidden sizes.")
+
+        # postprocess hidden states
+        hidden_states = self.reassemble_stage(hidden_states, patch_height, patch_width)
+
+        features = [self.convs[i](feature) for i, feature in enumerate(hidden_states)]
+
+        # fusion blocks
+        output = self.fusion_stage(features)
+
+        return output
+
+
+class DepthAnythingDepthEstimationHead(nn.Module):
+    """
+    Output head consisting of 3 convolutional layers. It progressively halves the feature dimension and upsamples
+    the predictions to the input resolution after the first convolutional layer (details can be found in the DPT paper's
+    supplementary material).
+    """
+
+    def __init__(self, config):
+        super().__init__()
+
+        self.head_in_index = config.head_in_index
+        self.patch_size = config.patch_size
+
+        features = config.fusion_hidden_size
+        self.conv1 = nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1)
+        self.conv2 = nn.Conv2d(features // 2, config.head_hidden_size, kernel_size=3, stride=1, padding=1)
+        self.activation1 = nn.ReLU()
+        self.conv3 = nn.Conv2d(config.head_hidden_size, 1, kernel_size=1, stride=1, padding=0)
+        self.activation2 = nn.ReLU()
+
+    def forward(self, hidden_states: List[torch.Tensor], patch_height, patch_width) -> torch.Tensor:
+        hidden_states = hidden_states[self.head_in_index]
+
+        predicted_depth = self.conv1(hidden_states)
+        predicted_depth = nn.functional.interpolate(
+            predicted_depth,
+            (int(patch_height * self.patch_size), int(patch_width * self.patch_size)),
+            mode="bilinear",
+            align_corners=True,
+        )
+        predicted_depth = self.conv2(predicted_depth)
+        predicted_depth = self.activation1(predicted_depth)
+        predicted_depth = self.conv3(predicted_depth)
+        predicted_depth = self.activation2(predicted_depth)
+        predicted_depth = predicted_depth.squeeze(dim=1)  # shape (batch_size, height, width)
+
+        return predicted_depth
+
+
+@add_start_docstrings(
+    """
+    Depth Anything Model with a depth estimation head on top (consisting of 3 convolutional layers) e.g. for KITTI, NYUv2.
+    """,
+    DEPTH_ANYTHING_START_DOCSTRING,
+)
+class DepthAnythingForDepthEstimation(DepthAnythingPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.backbone = AutoBackbone.from_config(config.backbone_config)
+        self.neck = DepthAnythingNeck(config)
+        self.head = DepthAnythingDepthEstimationHead(config)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(DEPTH_ANYTHING_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=DepthEstimatorOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], DepthEstimatorOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
+            Ground truth depth estimation maps for computing the loss.
+
+        Returns:
+
+        Examples:
+        ```python
+        >>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation
+        >>> import torch
+        >>> import numpy as np
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> image_processor = AutoImageProcessor.from_pretrained("LiheYoung/depth-anything-small-hf")
+        >>> model = AutoModelForDepthEstimation.from_pretrained("LiheYoung/depth-anything-small-hf")
+
+        >>> # prepare image for the model
+        >>> inputs = image_processor(images=image, return_tensors="pt")
+
+        >>> with torch.no_grad():
+        ...     outputs = model(**inputs)
+        ...     predicted_depth = outputs.predicted_depth
+
+        >>> # interpolate to original size
+        >>> prediction = torch.nn.functional.interpolate(
+        ...     predicted_depth.unsqueeze(1),
+        ...     size=image.size[::-1],
+        ...     mode="bicubic",
+        ...     align_corners=False,
+        ... )
+
+        >>> # visualize the prediction
+        >>> output = prediction.squeeze().cpu().numpy()
+        >>> formatted = (output * 255 / np.max(output)).astype("uint8")
+        >>> depth = Image.fromarray(formatted)
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        outputs = self.backbone.forward_with_filtered_kwargs(
+            pixel_values, output_hidden_states=output_hidden_states, output_attentions=output_attentions
+        )
+        hidden_states = outputs.feature_maps
+
+        _, _, height, width = pixel_values.shape
+        patch_size = self.config.patch_size
+        patch_height = height // patch_size
+        patch_width = width // patch_size
+
+        hidden_states = self.neck(hidden_states, patch_height, patch_width)
+
+        predicted_depth = self.head(hidden_states, patch_height, patch_width)
+
+        loss = None
+        if labels is not None:
+            raise NotImplementedError("Training is not implemented yet")
+
+        if not return_dict:
+            if output_hidden_states:
+                output = (predicted_depth,) + outputs[1:]
+            else:
+                output = (predicted_depth,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return DepthEstimatorOutput(
+            loss=loss,
+            predicted_depth=predicted_depth,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            attentions=outputs.attentions,
+        )
diff --git a/src/transformers/models/deta/configuration_deta.py b/src/transformers/models/deta/configuration_deta.py
index 836e9732e68ed8..d5a3709b91e372 100644
--- a/src/transformers/models/deta/configuration_deta.py
+++ b/src/transformers/models/deta/configuration_deta.py
@@ -14,7 +14,6 @@
 # limitations under the License.
 """ DETA model configuration"""
 
-import copy
 
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
@@ -41,6 +40,18 @@ class DetaConfig(PretrainedConfig):
     Args:
         backbone_config (`PretrainedConfig` or `dict`, *optional*, defaults to `ResNetConfig()`):
             The configuration of the backbone model.
+        backbone (`str`, *optional*):
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
+        use_pretrained_backbone (`bool`, *optional*, `False`):
+            Whether to use pretrained weights for the backbone.
+        use_timm_backbone (`bool`, *optional*, `False`):
+            Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
+            library.
+        backbone_kwargs (`dict`, *optional*):
+            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
+            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
         num_queries (`int`, *optional*, defaults to 900):
             Number of object queries, i.e. detection slots. This is the maximal number of objects [`DetaModel`] can
             detect in a single image. In case `two_stage` is set to `True`, we use `two_stage_num_proposals` instead.
@@ -71,7 +82,7 @@ class DetaConfig(PretrainedConfig):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         init_xavier_std (`float`, *optional*, defaults to 1):
             The scaling factor used for the Xavier initialization gain in the HM Attention map module.
-        encoder_layerdrop: (`float`, *optional*, defaults to 0.0):
+        encoder_layerdrop (`float`, *optional*, defaults to 0.0):
             The LayerDrop probability for the encoder. See the [LayerDrop paper](see https://arxiv.org/abs/1909.11556)
             for more details.
         auxiliary_loss (`bool`, *optional*, defaults to `False`):
@@ -110,6 +121,13 @@ class DetaConfig(PretrainedConfig):
             based on the predictions from the previous layer.
         focal_alpha (`float`, *optional*, defaults to 0.25):
             Alpha parameter in the focal loss.
+        assign_first_stage (`bool`, *optional*, defaults to `True`):
+            Whether to assign each prediction i to the highest overlapping ground truth object if the overlap is larger than a threshold 0.7.
+        assign_second_stage (`bool`, *optional*, defaults to `True`):
+            Whether to assign second assignment procedure in the second stage closely follows the first stage assignment procedure.
+        disable_custom_kernels (`bool`, *optional*, defaults to `True`):
+            Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom
+            kernels are not supported by PyTorch ONNX export.
 
     Examples:
 
@@ -125,6 +143,7 @@ class DetaConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "deta"
     attribute_map = {
         "hidden_size": "d_model",
@@ -134,6 +153,10 @@ class DetaConfig(PretrainedConfig):
     def __init__(
         self,
         backbone_config=None,
+        backbone=None,
+        use_pretrained_backbone=False,
+        use_timm_backbone=False,
+        backbone_kwargs=None,
         num_queries=900,
         max_position_embeddings=2048,
         encoder_layers=6,
@@ -161,6 +184,7 @@ def __init__(
         two_stage_num_proposals=300,
         with_box_refine=True,
         assign_first_stage=True,
+        assign_second_stage=True,
         class_cost=1,
         bbox_cost=5,
         giou_cost=2,
@@ -170,9 +194,16 @@ def __init__(
         giou_loss_coefficient=2,
         eos_coefficient=0.1,
         focal_alpha=0.25,
+        disable_custom_kernels=True,
         **kwargs,
     ):
-        if backbone_config is None:
+        if use_pretrained_backbone:
+            raise ValueError("Pretrained backbones are not supported yet.")
+
+        if backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
+        if backbone_config is None and backbone is None:
             logger.info("`backbone_config` is `None`. Initializing the config with the default `ResNet` backbone.")
             backbone_config = CONFIG_MAPPING["resnet"](out_features=["stage2", "stage3", "stage4"])
         else:
@@ -181,7 +212,14 @@ def __init__(
                 config_class = CONFIG_MAPPING[backbone_model_type]
                 backbone_config = config_class.from_dict(backbone_config)
 
+        if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
+            raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
+
         self.backbone_config = backbone_config
+        self.backbone = backbone
+        self.use_pretrained_backbone = use_pretrained_backbone
+        self.use_timm_backbone = use_timm_backbone
+        self.backbone_kwargs = backbone_kwargs
         self.num_queries = num_queries
         self.max_position_embeddings = max_position_embeddings
         self.d_model = d_model
@@ -208,6 +246,7 @@ def __init__(
         self.two_stage_num_proposals = two_stage_num_proposals
         self.with_box_refine = with_box_refine
         self.assign_first_stage = assign_first_stage
+        self.assign_second_stage = assign_second_stage
         if two_stage is True and with_box_refine is False:
             raise ValueError("If two_stage is True, with_box_refine must be True.")
         # Hungarian matcher
@@ -221,6 +260,7 @@ def __init__(
         self.giou_loss_coefficient = giou_loss_coefficient
         self.eos_coefficient = eos_coefficient
         self.focal_alpha = focal_alpha
+        self.disable_custom_kernels = disable_custom_kernels
         super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
 
     @property
@@ -230,13 +270,3 @@ def num_attention_heads(self) -> int:
     @property
     def hidden_size(self) -> int:
         return self.d_model
-
-    def to_dict(self):
-        """
-        Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`]. Returns:
-            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
-        """
-        output = copy.deepcopy(self.__dict__)
-        output["backbone_config"] = self.backbone_config.to_dict()
-        output["model_type"] = self.__class__.model_type
-        return output
diff --git a/src/transformers/models/deta/image_processing_deta.py b/src/transformers/models/deta/image_processing_deta.py
index 717fbfdd540a97..45c5c6cb285a8f 100644
--- a/src/transformers/models/deta/image_processing_deta.py
+++ b/src/transformers/models/deta/image_processing_deta.py
@@ -15,7 +15,6 @@
 """Image processor class for Deformable DETR."""
 
 import pathlib
-import warnings
 from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
 
 import numpy as np
@@ -26,7 +25,6 @@
     PaddingMode,
     center_to_corners_format,
     corners_to_center_format,
-    normalize,
     pad,
     rescale,
     resize,
@@ -36,16 +34,19 @@
 from ...image_utils import (
     IMAGENET_DEFAULT_MEAN,
     IMAGENET_DEFAULT_STD,
+    AnnotationFormat,
+    AnnotationType,
     ChannelDimension,
     ImageInput,
     PILImageResampling,
     get_image_size,
     infer_channel_dimension_format,
     is_batched,
+    is_scaled_image,
     to_numpy_array,
-    valid_coco_detection_annotations,
-    valid_coco_panoptic_annotations,
     valid_images,
+    validate_annotations,
+    validate_preprocess_arguments,
 )
 from ...utils import (
     is_flax_available,
@@ -56,13 +57,15 @@
     is_torch_tensor,
     is_torchvision_available,
     is_vision_available,
+    logging,
 )
-from ...utils.generic import ExplicitEnum, TensorType
+from ...utils.generic import TensorType
 
 
 if is_torch_available():
     import torch
 
+
 if is_torchvision_available():
     from torchvision.ops.boxes import batched_nms
 
@@ -70,12 +73,9 @@
     import PIL
 
 
-class AnnotionFormat(ExplicitEnum):
-    COCO_DETECTION = "coco_detection"
-    COCO_PANOPTIC = "coco_panoptic"
-
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 
-SUPPORTED_ANNOTATION_FORMATS = (AnnotionFormat.COCO_DETECTION, AnnotionFormat.COCO_PANOPTIC)
+SUPPORTED_ANNOTATION_FORMATS = (AnnotationFormat.COCO_DETECTION, AnnotationFormat.COCO_PANOPTIC)
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio
@@ -112,7 +112,10 @@ def get_size_with_aspect_ratio(image_size, size, max_size=None) -> Tuple[int, in
 
 # Copied from transformers.models.detr.image_processing_detr.get_resize_output_image_size
 def get_resize_output_image_size(
-    input_image: np.ndarray, size: Union[int, Tuple[int, int], List[int]], max_size: Optional[int] = None
+    input_image: np.ndarray,
+    size: Union[int, Tuple[int, int], List[int]],
+    max_size: Optional[int] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
 ) -> Tuple[int, int]:
     """
     Computes the output image size given the input image size and the desired output size. If the desired output size
@@ -120,14 +123,16 @@ def get_resize_output_image_size(
     image size is computed by keeping the aspect ratio of the input image size.
 
     Args:
-        image_size (`Tuple[int, int]`):
-            The input image size.
-        size (`int`):
+        input_image (`np.ndarray`):
+            The image to resize.
+        size (`int` or `Tuple[int, int]` or `List[int]`):
             The desired output size.
         max_size (`int`, *optional*):
             The maximum allowed output size.
+        input_data_format (`ChannelDimension` or `str`, *optional*):
+            The channel dimension format of the input image. If not provided, it will be inferred from the input image.
     """
-    image_size = get_image_size(input_image)
+    image_size = get_image_size(input_image, input_data_format)
     if isinstance(size, (list, tuple)):
         return size
 
@@ -197,23 +202,28 @@ def max_across_indices(values: Iterable[Any]) -> List[Any]:
 
 
 # Copied from transformers.models.detr.image_processing_detr.get_max_height_width
-def get_max_height_width(images: List[np.ndarray]) -> List[int]:
+def get_max_height_width(
+    images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> List[int]:
     """
     Get the maximum height and width across all images in a batch.
     """
-    input_channel_dimension = infer_channel_dimension_format(images[0])
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(images[0])
 
-    if input_channel_dimension == ChannelDimension.FIRST:
+    if input_data_format == ChannelDimension.FIRST:
         _, max_height, max_width = max_across_indices([img.shape for img in images])
-    elif input_channel_dimension == ChannelDimension.LAST:
+    elif input_data_format == ChannelDimension.LAST:
         max_height, max_width, _ = max_across_indices([img.shape for img in images])
     else:
-        raise ValueError(f"Invalid channel dimension format: {input_channel_dimension}")
+        raise ValueError(f"Invalid channel dimension format: {input_data_format}")
     return (max_height, max_width)
 
 
 # Copied from transformers.models.detr.image_processing_detr.make_pixel_mask
-def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray:
+def make_pixel_mask(
+    image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> np.ndarray:
     """
     Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
 
@@ -223,7 +233,7 @@ def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarr
         output_size (`Tuple[int, int]`):
             Output size of the mask.
     """
-    input_height, input_width = get_image_size(image)
+    input_height, input_width = get_image_size(image, channel_dim=input_data_format)
     mask = np.zeros(output_size, dtype=np.int64)
     mask[:input_height, :input_width] = 1
     return mask
@@ -265,11 +275,16 @@ def convert_coco_poly_to_mask(segmentations, height: int, width: int) -> np.ndar
 
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_detection_annotation with DETR->DETA
-def prepare_coco_detection_annotation(image, target, return_segmentation_masks: bool = False):
+def prepare_coco_detection_annotation(
+    image,
+    target,
+    return_segmentation_masks: bool = False,
+    input_data_format: Optional[Union[ChannelDimension, str]] = None,
+):
     """
     Convert the target in COCO format into the format expected by DETA.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
 
     image_id = target["image_id"]
     image_id = np.asarray([image_id], dtype=np.int64)
@@ -304,10 +319,13 @@ def prepare_coco_detection_annotation(image, target, return_segmentation_masks:
 
     if annotations and "keypoints" in annotations[0]:
         keypoints = [obj["keypoints"] for obj in annotations]
+        # Converting the filtered keypoints list to a numpy array
         keypoints = np.asarray(keypoints, dtype=np.float32)
+        # Apply the keep mask here to filter the relevant annotations
+        keypoints = keypoints[keep]
         num_keypoints = keypoints.shape[0]
         keypoints = keypoints.reshape((-1, 3)) if num_keypoints else keypoints
-        new_target["keypoints"] = keypoints[keep]
+        new_target["keypoints"] = keypoints
 
     if return_segmentation_masks:
         segmentation_masks = [obj["segmentation"] for obj in annotations]
@@ -354,12 +372,16 @@ def masks_to_boxes(masks: np.ndarray) -> np.ndarray:
 
 # Copied from transformers.models.detr.image_processing_detr.prepare_coco_panoptic_annotation with DETR->DETA
 def prepare_coco_panoptic_annotation(
-    image: np.ndarray, target: Dict, masks_path: Union[str, pathlib.Path], return_masks: bool = True
+    image: np.ndarray,
+    target: Dict,
+    masks_path: Union[str, pathlib.Path],
+    return_masks: bool = True,
+    input_data_format: Union[ChannelDimension, str] = None,
 ) -> Dict:
     """
     Prepare a coco panoptic annotation for DETA.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
     annotation_path = pathlib.Path(masks_path) / target["file_name"]
 
     new_target = {}
@@ -472,16 +494,21 @@ class DetaImageProcessor(BaseImageProcessor):
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
             Standard deviation values to use when normalizing the image. Can be a single value or a list of values, one
             for each channel. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_annotations (`bool`, *optional*, defaults to `True`):
+            Controls whether to convert the annotations to the format expected by the DETR model. Converts the
+            bounding boxes to the format `(center_x, center_y, width, height)` and in the range `[0, 1]`.
+            Can be overridden by the `do_convert_annotations` parameter in the `preprocess` method.
         do_pad (`bool`, *optional*, defaults to `True`):
-            Controls whether to pad the image to the largest image in a batch and create a pixel mask. Can be
-            overridden by the `do_pad` parameter in the `preprocess` method.
+            Controls whether to pad the image. Can be overridden by the `do_pad` parameter in the `preprocess`
+            method. If `True` will pad the images in the batch to the largest height and width in the batch.
+            Padding will be applied to the bottom and right of the image with zeros.
     """
 
     model_input_names = ["pixel_values", "pixel_mask"]
 
     def __init__(
         self,
-        format: Union[str, AnnotionFormat] = AnnotionFormat.COCO_DETECTION,
+        format: Union[str, AnnotationFormat] = AnnotationFormat.COCO_DETECTION,
         do_resize: bool = True,
         size: Dict[str, int] = None,
         resample: PILImageResampling = PILImageResampling.BILINEAR,
@@ -490,6 +517,7 @@ def __init__(
         do_normalize: bool = True,
         image_mean: Union[float, List[float]] = None,
         image_std: Union[float, List[float]] = None,
+        do_convert_annotations: bool = True,
         do_pad: bool = True,
         **kwargs,
     ) -> None:
@@ -499,6 +527,9 @@ def __init__(
         size = size if size is not None else {"shortest_edge": 800, "longest_edge": 1333}
         size = get_size_dict(size, default_to_square=False)
 
+        if do_convert_annotations is None:
+            do_convert_annotations = do_normalize
+
         super().__init__(**kwargs)
         self.format = format
         self.do_resize = do_resize
@@ -507,6 +538,7 @@ def __init__(
         self.do_rescale = do_rescale
         self.rescale_factor = rescale_factor
         self.do_normalize = do_normalize
+        self.do_convert_annotations = do_convert_annotations
         self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
         self.do_pad = do_pad
@@ -516,22 +548,29 @@ def prepare_annotation(
         self,
         image: np.ndarray,
         target: Dict,
-        format: Optional[AnnotionFormat] = None,
+        format: Optional[AnnotationFormat] = None,
         return_segmentation_masks: bool = None,
         masks_path: Optional[Union[str, pathlib.Path]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> Dict:
         """
         Prepare an annotation for feeding into DETA model.
         """
         format = format if format is not None else self.format
 
-        if format == AnnotionFormat.COCO_DETECTION:
+        if format == AnnotationFormat.COCO_DETECTION:
             return_segmentation_masks = False if return_segmentation_masks is None else return_segmentation_masks
-            target = prepare_coco_detection_annotation(image, target, return_segmentation_masks)
-        elif format == AnnotionFormat.COCO_PANOPTIC:
+            target = prepare_coco_detection_annotation(
+                image, target, return_segmentation_masks, input_data_format=input_data_format
+            )
+        elif format == AnnotationFormat.COCO_PANOPTIC:
             return_segmentation_masks = True if return_segmentation_masks is None else return_segmentation_masks
             target = prepare_coco_panoptic_annotation(
-                image, target, masks_path=masks_path, return_masks=return_segmentation_masks
+                image,
+                target,
+                masks_path=masks_path,
+                return_masks=return_segmentation_masks,
+                input_data_format=input_data_format,
             )
         else:
             raise ValueError(f"Format {format} is not supported.")
@@ -539,8 +578,8 @@ def prepare_annotation(
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare
     def prepare(self, image, target, return_segmentation_masks=None, masks_path=None):
-        warnings.warn(
-            "The `prepare` method is deprecated and will be removed in a future version. "
+        logger.warning_once(
+            "The `prepare` method is deprecated and will be removed in a v4.33. "
             "Please use `prepare_annotation` instead. Note: the `prepare_annotation` method "
             "does not return the image anymore.",
         )
@@ -549,17 +588,17 @@ def prepare(self, image, target, return_segmentation_masks=None, masks_path=None
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.convert_coco_poly_to_mask
     def convert_coco_poly_to_mask(self, *args, **kwargs):
-        warnings.warn("The `convert_coco_poly_to_mask` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `convert_coco_poly_to_mask` method is deprecated and will be removed in v4.33. ")
         return convert_coco_poly_to_mask(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_detection
     def prepare_coco_detection(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_detection` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_detection` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_detection_annotation(*args, **kwargs)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.prepare_coco_panoptic
     def prepare_coco_panoptic(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_panoptic` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_panoptic` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_panoptic_annotation(*args, **kwargs)
 
     def resize(
@@ -568,15 +607,32 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BILINEAR,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
         Resize the image to the given size. Size can be `min_size` (scalar) or `(height, width)` tuple. If size is an
         int, smaller edge of the image will be matched to this number.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                The desired output size. Can contain keys `shortest_edge` and `longest_edge` or `height` and `width`.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+                Resampling filter to use if resizing the image.
+            data_format (`ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred from the input
+                image.
         """
         size = get_size_dict(size, default_to_square=False)
         if "shortest_edge" in size and "longest_edge" in size:
-            size = get_resize_output_image_size(image, size["shortest_edge"], size["longest_edge"])
+            size = get_resize_output_image_size(
+                image, size["shortest_edge"], size["longest_edge"], input_data_format=input_data_format
+            )
         elif "height" in size and "width" in size:
             size = (size["height"], size["width"])
         else:
@@ -584,7 +640,9 @@ def resize(
                 "Size must contain 'height' and 'width' keys or 'shortest_edge' and 'longest_edge' keys. Got"
                 f" {size.keys()}."
             )
-        image = resize(image, size=size, resample=resample, data_format=data_format)
+        image = resize(
+            image, size=size, resample=resample, data_format=data_format, input_data_format=input_data_format
+        )
         return image
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.resize_annotation
@@ -603,130 +661,195 @@ def resize_annotation(
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.rescale
     def rescale(
-        self, image: np.ndarray, rescale_factor: Union[float, int], data_format: Optional[ChannelDimension] = None
-    ) -> np.ndarray:
-        """
-        Rescale the image by the given factor.
-        """
-        return rescale(image, rescale_factor, data_format=data_format)
-
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize
-    def normalize(
         self,
         image: np.ndarray,
-        mean: Union[float, Iterable[float]],
-        std: Union[float, Iterable[float]],
-        data_format: Optional[ChannelDimension] = None,
+        rescale_factor: float,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
-        Normalize the image with the given mean and standard deviation.
+        Rescale the image by the given factor. image = image * rescale_factor.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            rescale_factor (`float`):
+                The value to use for rescaling.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. If unset, is inferred from the input image. Can be
+                one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
         """
-        return normalize(image, mean=mean, std=std, data_format=data_format)
+        return rescale(image, rescale_factor, data_format=data_format, input_data_format=input_data_format)
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.normalize_annotation
     def normalize_annotation(self, annotation: Dict, image_size: Tuple[int, int]) -> Dict:
         """
         Normalize the boxes in the annotation from `[top_left_x, top_left_y, bottom_right_x, bottom_right_y]` to
-        `[center_x, center_y, width, height]` format.
+        `[center_x, center_y, width, height]` format and from absolute to relative pixel values.
         """
         return normalize_annotation(annotation, image_size=image_size)
 
-    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad_and_create_pixel_mask
-    def pad_and_create_pixel_mask(
+    # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._update_annotation_for_padded_image
+    def _update_annotation_for_padded_image(
         self,
-        pixel_values_list: List[ImageInput],
-        return_tensors: Optional[Union[str, TensorType]] = None,
-        data_format: Optional[ChannelDimension] = None,
-    ) -> BatchFeature:
+        annotation: Dict,
+        input_image_size: Tuple[int, int],
+        output_image_size: Tuple[int, int],
+        padding,
+        update_bboxes,
+    ) -> Dict:
         """
-        Pads a batch of images with zeros to the size of largest height and width in the batch and returns their
-        corresponding pixel mask.
-
-        Args:
-            images (`List[np.ndarray]`):
-                Batch of images to pad.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                    - Unset: Return a list of `np.ndarray`.
-                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        Update the annotation for a padded image.
         """
-        warnings.warn(
-            "This method is deprecated and will be removed in v4.27.0. Please use pad instead.", FutureWarning
-        )
-        # pad expects a list of np.ndarray, but the previous feature extractors expected torch tensors
-        images = [to_numpy_array(image) for image in pixel_values_list]
-        return self.pad(
-            images=images,
-            return_pixel_mask=True,
-            return_tensors=return_tensors,
-            data_format=data_format,
-        )
+        new_annotation = {}
+        new_annotation["size"] = output_image_size
+
+        for key, value in annotation.items():
+            if key == "masks":
+                masks = value
+                masks = pad(
+                    masks,
+                    padding,
+                    mode=PaddingMode.CONSTANT,
+                    constant_values=0,
+                    input_data_format=ChannelDimension.FIRST,
+                )
+                masks = safe_squeeze(masks, 1)
+                new_annotation["masks"] = masks
+            elif key == "boxes" and update_bboxes:
+                boxes = value
+                boxes *= np.asarray(
+                    [
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                    ]
+                )
+                new_annotation["boxes"] = boxes
+            elif key == "size":
+                new_annotation["size"] = output_image_size
+            else:
+                new_annotation[key] = value
+        return new_annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor._pad_image
     def _pad_image(
         self,
         image: np.ndarray,
         output_size: Tuple[int, int],
+        annotation: Optional[Dict[str, Any]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
     ) -> np.ndarray:
         """
         Pad an image with zeros to the given size.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = output_size
 
         pad_bottom = output_height - input_height
         pad_right = output_width - input_width
         padding = ((0, pad_bottom), (0, pad_right))
         padded_image = pad(
-            image, padding, mode=PaddingMode.CONSTANT, constant_values=constant_values, data_format=data_format
+            image,
+            padding,
+            mode=PaddingMode.CONSTANT,
+            constant_values=constant_values,
+            data_format=data_format,
+            input_data_format=input_data_format,
         )
-        return padded_image
+        if annotation is not None:
+            annotation = self._update_annotation_for_padded_image(
+                annotation, (input_height, input_width), (output_height, output_width), padding, update_bboxes
+            )
+        return padded_image, annotation
 
     # Copied from transformers.models.detr.image_processing_detr.DetrImageProcessor.pad
     def pad(
         self,
         images: List[np.ndarray],
+        annotations: Optional[Union[AnnotationType, List[AnnotationType]]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         return_pixel_mask: bool = True,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Optional[ChannelDimension] = None,
-    ) -> np.ndarray:
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
+    ) -> BatchFeature:
         """
         Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
         in the batch and optionally returns their corresponding pixel mask.
 
         Args:
-            image (`np.ndarray`):
-                Image to pad.
+            images (List[`np.ndarray`]):
+                Images to pad.
+            annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
+                Annotations to transform according to the padding that is applied to the images.
             constant_values (`float` or `Iterable[float]`, *optional*):
                 The value to use for the padding if `mode` is `"constant"`.
             return_pixel_mask (`bool`, *optional*, defaults to `True`):
                 Whether to return a pixel mask.
-            input_channel_dimension (`ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be inferred from the input image.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+            update_bboxes (`bool`, *optional*, defaults to `True`):
+                Whether to update the bounding boxes in the annotations to match the padded images. If the
+                bounding boxes have not been converted to relative coordinates and `(centre_x, centre_y, width, height)`
+                format, the bounding boxes will not be updated.
         """
-        pad_size = get_max_height_width(images)
+        pad_size = get_max_height_width(images, input_data_format=input_data_format)
+
+        annotation_list = annotations if annotations is not None else [None] * len(images)
+        padded_images = []
+        padded_annotations = []
+        for image, annotation in zip(images, annotation_list):
+            padded_image, padded_annotation = self._pad_image(
+                image,
+                pad_size,
+                annotation,
+                constant_values=constant_values,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                update_bboxes=update_bboxes,
+            )
+            padded_images.append(padded_image)
+            padded_annotations.append(padded_annotation)
 
-        padded_images = [
-            self._pad_image(image, pad_size, constant_values=constant_values, data_format=data_format)
-            for image in images
-        ]
         data = {"pixel_values": padded_images}
 
         if return_pixel_mask:
-            masks = [make_pixel_mask(image=image, output_size=pad_size) for image in images]
+            masks = [
+                make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
+                for image in images
+            ]
             data["pixel_mask"] = masks
 
-        return BatchFeature(data=data, tensor_type=return_tensors)
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        if annotations is not None:
+            encoded_inputs["labels"] = [
+                BatchFeature(annotation, tensor_type=return_tensors) for annotation in padded_annotations
+            ]
+
+        return encoded_inputs
 
     def preprocess(
         self,
@@ -742,10 +865,12 @@ def preprocess(
         do_normalize: Optional[bool] = None,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
+        do_convert_annotations: Optional[bool] = None,
         do_pad: Optional[bool] = None,
-        format: Optional[Union[str, AnnotionFormat]] = None,
+        format: Optional[Union[str, AnnotationFormat]] = None,
         return_tensors: Optional[Union[TensorType, str]] = None,
         data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> BatchFeature:
         """
@@ -753,14 +878,15 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image or batch of images to preprocess.
+                Image or batch of images to preprocess. Expects a single or batch of images with pixel values ranging
+                from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             annotations (`List[Dict]` or `List[List[Dict]]`, *optional*):
-                List of annotations associated with the image or batch of images. If annotionation is for object
+                List of annotations associated with the image or batch of images. If annotation is for object
                 detection, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "annotations" (`List[Dict]`): List of annotations for an image. Each annotation should be a
                   dictionary. An image can have no annotations, in which case the list should be empty.
-                If annotionation is for segmentation, the annotations should be a dictionary with the following keys:
+                If annotation is for segmentation, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "segments_info" (`List[Dict]`): List of segments for an image. Each segment should be a dictionary.
                   An image can have no segments, in which case the list should be empty.
@@ -785,20 +911,33 @@ def preprocess(
                 Mean to use when normalizing the image.
             image_std (`float` or `List[float]`, *optional*, defaults to self.image_std):
                 Standard deviation to use when normalizing the image.
+            do_convert_annotations (`bool`, *optional*, defaults to self.do_convert_annotations):
+                Whether to convert the annotations to the format expected by the model. Converts the bounding
+                boxes from the format `(top_left_x, top_left_y, width, height)` to `(center_x, center_y, width, height)`
+                and in relative coordinates.
             do_pad (`bool`, *optional*, defaults to self.do_pad):
-                Whether to pad the image.
-            format (`str` or `AnnotionFormat`, *optional*, defaults to self.format):
+                Whether to pad the image. If `True` will pad the images in the batch to the largest image in the batch
+                and create a pixel mask. Padding will be applied to the bottom and right of the image with zeros.
+            format (`str` or `AnnotationFormat`, *optional*, defaults to self.format):
                 Format of the annotations.
             return_tensors (`str` or `TensorType`, *optional*, defaults to self.return_tensors):
                 Type of tensors to return. If `None`, will return the list of images.
-            data_format (`str` or `ChannelDimension`, *optional*, defaults to self.data_format):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         if "pad_and_return_pixel_mask" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `pad_and_return_pixel_mask` argument is deprecated and will be removed in a future version, "
                 "use `do_pad` instead.",
-                FutureWarning,
             )
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
@@ -811,55 +950,46 @@ def preprocess(
         do_normalize = self.do_normalize if do_normalize is None else do_normalize
         image_mean = self.image_mean if image_mean is None else image_mean
         image_std = self.image_std if image_std is None else image_std
+        do_convert_annotations = (
+            self.do_convert_annotations if do_convert_annotations is None else do_convert_annotations
+        )
         do_pad = self.do_pad if do_pad is None else do_pad
         format = self.format if format is None else format
 
-        if do_resize is not None and size is None:
-            raise ValueError("Size and max_size must be specified if do_resize is True.")
-
-        if do_rescale is not None and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
-        if do_normalize is not None and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        # Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
+
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
         if not is_batched(images):
             images = [images]
             annotations = [annotations] if annotations is not None else None
 
-        if annotations is not None and len(images) != len(annotations):
-            raise ValueError(
-                f"The number of images ({len(images)}) and annotations ({len(annotations)}) do not match."
-            )
-
         if not valid_images(images):
             raise ValueError(
                 "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
+        if annotations is not None and len(images) != len(annotations):
+            raise ValueError(
+                f"The number of images ({len(images)}) and annotations ({len(annotations)}) do not match."
+            )
 
-        format = AnnotionFormat(format)
+        format = AnnotationFormat(format)
         if annotations is not None:
-            if format == AnnotionFormat.COCO_DETECTION and not valid_coco_detection_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO detection annotations. Annotations must a dict (single image) of list of dicts"
-                    "(batch of images) with the following keys: `image_id` and `annotations`, with the latter "
-                    "being a list of annotations in the COCO format."
-                )
-            elif format == AnnotionFormat.COCO_PANOPTIC and not valid_coco_panoptic_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO panoptic annotations. Annotations must a dict (single image) of list of dicts "
-                    "(batch of images) with the following keys: `image_id`, `file_name` and `segments_info`, with "
-                    "the latter being a list of annotations in the COCO format."
-                )
-            elif format not in SUPPORTED_ANNOTATION_FORMATS:
-                raise ValueError(
-                    f"Unsupported annotation format: {format} must be one of {SUPPORTED_ANNOTATION_FORMATS}"
-                )
+            validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations)
 
         if (
             masks_path is not None
-            and format == AnnotionFormat.COCO_PANOPTIC
+            and format == AnnotationFormat.COCO_PANOPTIC
             and not isinstance(masks_path, (pathlib.Path, str))
         ):
             raise ValueError(
@@ -870,13 +1000,28 @@ def preprocess(
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         # prepare (COCO annotations as a list of Dict -> DETR target as a single Dict per image)
         if annotations is not None:
             prepared_images = []
             prepared_annotations = []
             for image, target in zip(images, annotations):
                 target = self.prepare_annotation(
-                    image, target, format, return_segmentation_masks=return_segmentation_masks, masks_path=masks_path
+                    image,
+                    target,
+                    format,
+                    return_segmentation_masks=return_segmentation_masks,
+                    masks_path=masks_path,
+                    input_data_format=input_data_format,
                 )
                 prepared_images.append(image)
                 prepared_annotations.append(target)
@@ -889,40 +1034,59 @@ def preprocess(
             if annotations is not None:
                 resized_images, resized_annotations = [], []
                 for image, target in zip(images, annotations):
-                    orig_size = get_image_size(image)
-                    resized_image = self.resize(image, size=size, resample=resample)
-                    resized_annotation = self.resize_annotation(target, orig_size, get_image_size(resized_image))
+                    orig_size = get_image_size(image, input_data_format)
+                    resized_image = self.resize(
+                        image, size=size, resample=resample, input_data_format=input_data_format
+                    )
+                    resized_annotation = self.resize_annotation(
+                        target, orig_size, get_image_size(resized_image, input_data_format)
+                    )
                     resized_images.append(resized_image)
                     resized_annotations.append(resized_annotation)
                 images = resized_images
                 annotations = resized_annotations
                 del resized_images, resized_annotations
             else:
-                images = [self.resize(image, size=size, resample=resample) for image in images]
+                images = [
+                    self.resize(image, size=size, resample=resample, input_data_format=input_data_format)
+                    for image in images
+                ]
 
         if do_rescale:
-            images = [self.rescale(image, rescale_factor) for image in images]
+            images = [self.rescale(image, rescale_factor, input_data_format=input_data_format) for image in images]
 
         if do_normalize:
-            images = [self.normalize(image, image_mean, image_std) for image in images]
-            if annotations is not None:
-                annotations = [
-                    self.normalize_annotation(annotation, get_image_size(image))
-                    for annotation, image in zip(annotations, images)
-                ]
+            images = [
+                self.normalize(image, image_mean, image_std, input_data_format=input_data_format) for image in images
+            ]
+
+        if do_convert_annotations and annotations is not None:
+            annotations = [
+                self.normalize_annotation(annotation, get_image_size(image, input_data_format))
+                for annotation, image in zip(annotations, images)
+            ]
 
         if do_pad:
             # Pads images and returns their mask: {'pixel_values': ..., 'pixel_mask': ...}
-            data = self.pad(images, return_pixel_mask=True, data_format=data_format)
+            encoded_inputs = self.pad(
+                images,
+                annotations=annotations,
+                return_pixel_mask=True,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                return_tensors=return_tensors,
+                update_bboxes=do_convert_annotations,
+            )
         else:
-            images = [to_channel_dimension_format(image, data_format) for image in images]
-            data = {"pixel_values": images}
-
-        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
-        if annotations is not None:
-            encoded_inputs["labels"] = [
-                BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+            images = [
+                to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+                for image in images
             ]
+            encoded_inputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)
+            if annotations is not None:
+                encoded_inputs["labels"] = [
+                    BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+                ]
 
         return encoded_inputs
 
@@ -965,7 +1129,7 @@ def post_process_object_detection(
 
         all_scores = prob.view(batch_size, num_queries * num_labels).to(out_logits.device)
         all_indexes = torch.arange(num_queries * num_labels)[None].repeat(batch_size, 1).to(out_logits.device)
-        all_boxes = all_indexes // out_logits.shape[2]
+        all_boxes = torch.div(all_indexes, out_logits.shape[2], rounding_mode="floor")
         all_labels = all_indexes % out_logits.shape[2]
 
         boxes = center_to_corners_format(out_bbox)
@@ -988,7 +1152,7 @@ def post_process_object_detection(
             score = all_scores[b]
             lbls = all_labels[b]
 
-            pre_topk = score.topk(min(10000, len(score))).indices
+            pre_topk = score.topk(min(10000, num_queries * num_labels)).indices
             box = box[pre_topk]
             score = score[pre_topk]
             lbls = lbls[pre_topk]
diff --git a/src/transformers/models/deta/modeling_deta.py b/src/transformers/models/deta/modeling_deta.py
index eb77604fbfc4ee..5d0b48b45d13ac 100644
--- a/src/transformers/models/deta/modeling_deta.py
+++ b/src/transformers/models/deta/modeling_deta.py
@@ -17,13 +17,17 @@
 
 import copy
 import math
+import os
 import warnings
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Tuple
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple, Union
 
 import torch
 import torch.nn.functional as F
 from torch import Tensor, nn
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
 
 from ...activations import ACT2FN
 from ...file_utils import (
@@ -31,19 +35,107 @@
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
     is_scipy_available,
+    is_torch_cuda_available,
     is_vision_available,
     replace_return_docstrings,
 )
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
 from ...modeling_outputs import BaseModelOutput
 from ...modeling_utils import PreTrainedModel
 from ...pytorch_utils import meshgrid
-from ...utils import is_torchvision_available, logging, requires_backends
-from ..auto import AutoBackbone
+from ...utils import is_accelerate_available, is_ninja_available, is_torchvision_available, logging, requires_backends
+from ...utils.backbone_utils import load_backbone
 from .configuration_deta import DetaConfig
 
 
 logger = logging.get_logger(__name__)
 
+MultiScaleDeformableAttention = None
+
+
+# Copied from models.deformable_detr.load_cuda_kernels
+def load_cuda_kernels():
+    from torch.utils.cpp_extension import load
+
+    global MultiScaleDeformableAttention
+
+    root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deta"
+    src_files = [
+        root / filename
+        for filename in [
+            "vision.cpp",
+            os.path.join("cpu", "ms_deform_attn_cpu.cpp"),
+            os.path.join("cuda", "ms_deform_attn_cuda.cu"),
+        ]
+    ]
+
+    load(
+        "MultiScaleDeformableAttention",
+        src_files,
+        with_cuda=True,
+        extra_include_paths=[str(root)],
+        extra_cflags=["-DWITH_CUDA=1"],
+        extra_cuda_cflags=[
+            "-DCUDA_HAS_FP16=1",
+            "-D__CUDA_NO_HALF_OPERATORS__",
+            "-D__CUDA_NO_HALF_CONVERSIONS__",
+            "-D__CUDA_NO_HALF2_OPERATORS__",
+        ],
+    )
+
+
+# Copied from transformers.models.deformable_detr.modeling_deformable_detr.MultiScaleDeformableAttentionFunction
+class MultiScaleDeformableAttentionFunction(Function):
+    @staticmethod
+    def forward(
+        context,
+        value,
+        value_spatial_shapes,
+        value_level_start_index,
+        sampling_locations,
+        attention_weights,
+        im2col_step,
+    ):
+        context.im2col_step = im2col_step
+        output = MultiScaleDeformableAttention.ms_deform_attn_forward(
+            value,
+            value_spatial_shapes,
+            value_level_start_index,
+            sampling_locations,
+            attention_weights,
+            context.im2col_step,
+        )
+        context.save_for_backward(
+            value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights
+        )
+        return output
+
+    @staticmethod
+    @once_differentiable
+    def backward(context, grad_output):
+        (
+            value,
+            value_spatial_shapes,
+            value_level_start_index,
+            sampling_locations,
+            attention_weights,
+        ) = context.saved_tensors
+        grad_value, grad_sampling_loc, grad_attn_weight = MultiScaleDeformableAttention.ms_deform_attn_backward(
+            value,
+            value_spatial_shapes,
+            value_level_start_index,
+            sampling_locations,
+            attention_weights,
+            grad_output,
+            context.im2col_step,
+        )
+
+        return grad_value, None, None, grad_sampling_loc, grad_attn_weight, None
+
+
+if is_accelerate_available():
+    from accelerate import PartialState
+    from accelerate.utils import reduce
 
 if is_vision_available():
     from transformers.image_transforms import center_to_corners_format
@@ -69,8 +161,8 @@
 # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrDecoderOutput with DeformableDetr->Deta
 class DetaDecoderOutput(ModelOutput):
     """
-    Base class for outputs of the DetaDecoder. This class adds two attributes to BaseModelOutputWithCrossAttentions,
-    namely:
+    Base class for outputs of the DetaDecoder. This class adds two attributes to
+    BaseModelOutputWithCrossAttentions, namely:
     - a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer)
     - a stacked tensor of intermediate reference points.
 
@@ -104,7 +196,6 @@ class DetaDecoderOutput(ModelOutput):
 
 
 @dataclass
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrModelOutput with DeformableDetr->Deta,Deformable DETR->DETA
 class DetaModelOutput(ModelOutput):
     """
     Base class for outputs of the Deformable DETR encoder-decoder model.
@@ -146,6 +237,8 @@ class DetaModelOutput(ModelOutput):
             foreground and background).
         enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
             Logits of predicted bounding boxes coordinates in the first stage.
+        output_proposals (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
+            Logits of proposal bounding boxes coordinates in the gen_encoder_output_proposals.
     """
 
     init_reference_points: torch.FloatTensor = None
@@ -160,10 +253,10 @@ class DetaModelOutput(ModelOutput):
     encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
     enc_outputs_class: Optional[torch.FloatTensor] = None
     enc_outputs_coord_logits: Optional[torch.FloatTensor] = None
+    output_proposals: Optional[torch.FloatTensor] = None
 
 
 @dataclass
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrObjectDetectionOutput with DeformableDetr->Deta
 class DetaObjectDetectionOutput(ModelOutput):
     """
     Output type of [`DetaForObjectDetection`].
@@ -222,6 +315,8 @@ class DetaObjectDetectionOutput(ModelOutput):
             foreground and background).
         enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
             Logits of predicted bounding boxes coordinates in the first stage.
+        output_proposals (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
+            Logits of proposal bounding boxes coordinates in the gen_encoder_output_proposals.
     """
 
     loss: Optional[torch.FloatTensor] = None
@@ -241,6 +336,7 @@ class DetaObjectDetectionOutput(ModelOutput):
     encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
     enc_outputs_class: Optional = None
     enc_outputs_coord_logits: Optional = None
+    output_proposals: Optional[torch.FloatTensor] = None
 
 
 def _get_clones(module, N):
@@ -295,19 +391,28 @@ def forward(self, x):
 
 
 # Copied from transformers.models.detr.modeling_detr.replace_batch_norm with Detr->Deta
-def replace_batch_norm(m, name=""):
-    for attr_str in dir(m):
-        target_attr = getattr(m, attr_str)
-        if isinstance(target_attr, nn.BatchNorm2d):
-            frozen = DetaFrozenBatchNorm2d(target_attr.num_features)
-            bn = getattr(m, attr_str)
-            frozen.weight.data.copy_(bn.weight)
-            frozen.bias.data.copy_(bn.bias)
-            frozen.running_mean.data.copy_(bn.running_mean)
-            frozen.running_var.data.copy_(bn.running_var)
-            setattr(m, attr_str, frozen)
-    for n, ch in m.named_children():
-        replace_batch_norm(ch, n)
+def replace_batch_norm(model):
+    r"""
+    Recursively replace all `torch.nn.BatchNorm2d` with `DetaFrozenBatchNorm2d`.
+
+    Args:
+        model (torch.nn.Module):
+            input model
+    """
+    for name, module in model.named_children():
+        if isinstance(module, nn.BatchNorm2d):
+            new_module = DetaFrozenBatchNorm2d(module.num_features)
+
+            if not module.weight.device == torch.device("meta"):
+                new_module.weight.data.copy_(module.weight)
+                new_module.bias.data.copy_(module.bias)
+                new_module.running_mean.data.copy_(module.running_mean)
+                new_module.running_var.data.copy_(module.running_var)
+
+            model._modules[name] = new_module
+
+        if len(list(module.children())) > 0:
+            replace_batch_norm(module)
 
 
 class DetaBackboneWithPositionalEncodings(nn.Module):
@@ -320,7 +425,7 @@ class DetaBackboneWithPositionalEncodings(nn.Module):
     def __init__(self, config):
         super().__init__()
 
-        backbone = AutoBackbone.from_config(config.backbone_config)
+        backbone = load_backbone(config)
         with torch.no_grad():
             replace_batch_norm(backbone)
         self.model = backbone
@@ -355,21 +460,6 @@ def forward(self, pixel_values: torch.Tensor, pixel_mask: torch.Tensor):
         return out, pos
 
 
-# Copied from transformers.models.detr.modeling_detr._expand_mask
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, target_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[batch_size, seq_len]` to `[batch_size, 1, target_seq_len, source_seq_len]`.
-    """
-    batch_size, source_len = mask.size()
-    target_len = target_len if target_len is not None else source_len
-
-    expanded_mask = mask[:, None, None, :].expand(batch_size, 1, target_len, source_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)
-
-
 # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrSinePositionEmbedding with DeformableDetr->Deta
 class DetaSinePositionEmbedding(nn.Module):
     """
@@ -398,8 +488,8 @@ def forward(self, pixel_values, pixel_mask):
             y_embed = (y_embed - 0.5) / (y_embed[:, -1:, :] + eps) * self.scale
             x_embed = (x_embed - 0.5) / (x_embed[:, :, -1:] + eps) * self.scale
 
-        dim_t = torch.arange(self.embedding_dim, dtype=torch.float32, device=pixel_values.device)
-        dim_t = self.temperature ** (2 * (dim_t // 2) / self.embedding_dim)
+        dim_t = torch.arange(self.embedding_dim, dtype=torch.int64, device=pixel_values.device).float()
+        dim_t = self.temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / self.embedding_dim)
 
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
@@ -453,7 +543,7 @@ def multi_scale_deformable_attention(
 ) -> Tensor:
     batch_size, _, num_heads, hidden_dim = value.shape
     _, num_queries, num_heads, num_levels, num_points, _ = sampling_locations.shape
-    value_list = value.split([height * width for height, width in value_spatial_shapes], dim=1)
+    value_list = value.split([height.item() * width.item() for height, width in value_spatial_shapes], dim=1)
     sampling_grids = 2 * sampling_locations - 1
     sampling_value_list = []
     for level_id, (height, width) in enumerate(value_spatial_shapes):
@@ -487,19 +577,27 @@ def multi_scale_deformable_attention(
     return output.transpose(1, 2).contiguous()
 
 
+# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrMultiscaleDeformableAttention with DeformableDetr->Deta
 class DetaMultiscaleDeformableAttention(nn.Module):
     """
     Multiscale deformable attention as proposed in Deformable DETR.
     """
 
-    # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrMultiscaleDeformableAttention.__init__ with DeformableDetr->Deta
-    def __init__(self, embed_dim: int, num_heads: int, n_levels: int, n_points: int):
+    def __init__(self, config: DetaConfig, num_heads: int, n_points: int):
         super().__init__()
-        if embed_dim % num_heads != 0:
+
+        kernel_loaded = MultiScaleDeformableAttention is not None
+        if is_torch_cuda_available() and is_ninja_available() and not kernel_loaded:
+            try:
+                load_cuda_kernels()
+            except Exception as e:
+                logger.warning(f"Could not load the custom kernel for multi-scale deformable attention: {e}")
+
+        if config.d_model % num_heads != 0:
             raise ValueError(
-                f"embed_dim (d_model) must be divisible by num_heads, but got {embed_dim} and {num_heads}"
+                f"embed_dim (d_model) must be divisible by num_heads, but got {config.d_model} and {num_heads}"
             )
-        dim_per_head = embed_dim // num_heads
+        dim_per_head = config.d_model // num_heads
         # check if dim_per_head is power of 2
         if not ((dim_per_head & (dim_per_head - 1) == 0) and dim_per_head != 0):
             warnings.warn(
@@ -510,21 +608,24 @@ def __init__(self, embed_dim: int, num_heads: int, n_levels: int, n_points: int)
 
         self.im2col_step = 64
 
-        self.d_model = embed_dim
-        self.n_levels = n_levels
+        self.d_model = config.d_model
+        self.n_levels = config.num_feature_levels
         self.n_heads = num_heads
         self.n_points = n_points
 
-        self.sampling_offsets = nn.Linear(embed_dim, num_heads * n_levels * n_points * 2)
-        self.attention_weights = nn.Linear(embed_dim, num_heads * n_levels * n_points)
-        self.value_proj = nn.Linear(embed_dim, embed_dim)
-        self.output_proj = nn.Linear(embed_dim, embed_dim)
+        self.sampling_offsets = nn.Linear(config.d_model, num_heads * self.n_levels * n_points * 2)
+        self.attention_weights = nn.Linear(config.d_model, num_heads * self.n_levels * n_points)
+        self.value_proj = nn.Linear(config.d_model, config.d_model)
+        self.output_proj = nn.Linear(config.d_model, config.d_model)
+
+        self.disable_custom_kernels = config.disable_custom_kernels
 
         self._reset_parameters()
 
     def _reset_parameters(self):
         nn.init.constant_(self.sampling_offsets.weight.data, 0.0)
-        thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)
+        default_dtype = torch.get_default_dtype()
+        thetas = torch.arange(self.n_heads, dtype=torch.int64).to(default_dtype) * (2.0 * math.pi / self.n_heads)
         grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)
         grid_init = (
             (grid_init / grid_init.abs().max(-1, keepdim=True)[0])
@@ -596,8 +697,24 @@ def forward(
             )
         else:
             raise ValueError(f"Last dim of reference_points must be 2 or 4, but got {reference_points.shape[-1]}")
-        # PyTorch implementation (for now)
-        output = multi_scale_deformable_attention(value, spatial_shapes, sampling_locations, attention_weights)
+
+        if self.disable_custom_kernels:
+            # PyTorch implementation
+            output = multi_scale_deformable_attention(value, spatial_shapes, sampling_locations, attention_weights)
+        else:
+            try:
+                # custom kernel
+                output = MultiScaleDeformableAttentionFunction.apply(
+                    value,
+                    spatial_shapes,
+                    level_start_index,
+                    sampling_locations,
+                    attention_weights,
+                    self.im2col_step,
+                )
+            except Exception:
+                # PyTorch implementation
+                output = multi_scale_deformable_attention(value, spatial_shapes, sampling_locations, attention_weights)
         output = self.output_proj(output)
 
         return output, attention_weights
@@ -679,7 +796,7 @@ def forward(
         # expand attention_mask
         if attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)
+            attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
 
         if attention_mask is not None:
             if attention_mask.size() != (batch_size, 1, target_len, source_len):
@@ -721,15 +838,13 @@ def forward(
         return attn_output, attn_weights_reshaped
 
 
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrEncoderLayer with DeformableDetr->Deta
 class DetaEncoderLayer(nn.Module):
     def __init__(self, config: DetaConfig):
         super().__init__()
         self.embed_dim = config.d_model
         self.self_attn = DetaMultiscaleDeformableAttention(
-            embed_dim=self.embed_dim,
+            config,
             num_heads=config.encoder_attention_heads,
-            n_levels=config.num_feature_levels,
             n_points=config.encoder_n_points,
         )
         self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
@@ -810,7 +925,6 @@ def forward(
         return outputs
 
 
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrDecoderLayer with DeformableDetr->Deta
 class DetaDecoderLayer(nn.Module):
     def __init__(self, config: DetaConfig):
         super().__init__()
@@ -829,9 +943,8 @@ def __init__(self, config: DetaConfig):
         self.self_attn_layer_norm = nn.LayerNorm(self.embed_dim)
         # cross-attention
         self.encoder_attn = DetaMultiscaleDeformableAttention(
-            embed_dim=self.embed_dim,
+            config,
             num_heads=config.decoder_attention_heads,
-            n_levels=config.num_feature_levels,
             n_points=config.decoder_n_points,
         )
         self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
@@ -854,7 +967,7 @@ def forward(
         """
         Args:
             hidden_states (`torch.FloatTensor`):
-                Input to the layer of shape `(seq_len, batch, embed_dim)`.
+                Input to the layer of shape `(batch, seq_len, embed_dim)`.
             position_embeddings (`torch.FloatTensor`, *optional*):
                 Position embeddings that are added to the queries and keys in the self-attention layer.
             reference_points (`torch.FloatTensor`, *optional*):
@@ -864,7 +977,7 @@ def forward(
             level_start_index (`torch.LongTensor`, *optional*):
                 Level start index.
             encoder_hidden_states (`torch.FloatTensor`):
-                cross attention input to the layer of shape `(seq_len, batch, embed_dim)`
+                cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
             encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
@@ -942,11 +1055,12 @@ def forward(self, hidden_states: torch.Tensor):
         return hidden_states
 
 
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrPreTrainedModel with DeformableDetr->Deta
 class DetaPreTrainedModel(PreTrainedModel):
     config_class = DetaConfig
     base_model_prefix = "model"
     main_input_name = "pixel_values"
+    _no_split_modules = [r"DetaBackboneWithPositionalEncodings", r"DetaEncoderLayer", r"DetaDecoderLayer"]
+    supports_gradient_checkpointing = True
 
     def _init_weights(self, module):
         std = self.config.init_std
@@ -972,10 +1086,6 @@ def _init_weights(self, module):
         if hasattr(module, "level_embed"):
             nn.init.normal_(module.level_embed)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DetaDecoder):
-            module.gradient_checkpointing = value
-
 
 DETA_START_DOCSTRING = r"""
     This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
@@ -1008,7 +1118,7 @@ def _set_gradient_checkpointing(self, module, value=False):
 
             [What are attention masks?](../glossary#attention-mask)
 
-        decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, num_queries)`, *optional*):
+        decoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, num_queries)`, *optional*):
             Not used by default. Can be used to mask object queries.
         encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
             Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
@@ -1031,7 +1141,6 @@ def _set_gradient_checkpointing(self, module, value=False):
 """
 
 
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrEncoder with DeformableDetr->Deta
 class DetaEncoder(DetaPreTrainedModel):
     """
     Transformer encoder consisting of *config.encoder_layers* deformable attention layers. Each layer is a
@@ -1048,6 +1157,7 @@ def __init__(self, config: DetaConfig):
 
         self.dropout = config.dropout
         self.layers = nn.ModuleList([DetaEncoderLayer(config) for _ in range(config.encoder_layers)])
+        self.gradient_checkpointing = False
 
         # Initialize weights and apply final processing
         self.post_init()
@@ -1162,7 +1272,6 @@ def forward(
         )
 
 
-# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrDecoder with DeformableDetr->Deta,Deformable DETR->DETA
 class DetaDecoder(DetaPreTrainedModel):
     """
     Transformer decoder consisting of *config.decoder_layers* layers. Each layer is a [`DetaDecoderLayer`].
@@ -1268,19 +1377,16 @@ def forward(
                 all_hidden_states += (hidden_states,)
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
                     hidden_states,
+                    position_embeddings,
+                    reference_points_input,
+                    spatial_shapes,
+                    level_start_index,
                     encoder_hidden_states,
                     encoder_attention_mask,
-                    None,
+                    output_attentions,
                 )
             else:
                 layer_outputs = decoder_layer(
@@ -1432,38 +1538,36 @@ def get_encoder(self):
     def get_decoder(self):
         return self.decoder
 
-    # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrModel.freeze_backbone
     def freeze_backbone(self):
-        for name, param in self.backbone.conv_encoder.model.named_parameters():
+        for name, param in self.backbone.model.named_parameters():
             param.requires_grad_(False)
 
-    # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrModel.unfreeze_backbone
     def unfreeze_backbone(self):
-        for name, param in self.backbone.conv_encoder.model.named_parameters():
+        for name, param in self.backbone.model.named_parameters():
             param.requires_grad_(True)
 
     # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrModel.get_valid_ratio
-    def get_valid_ratio(self, mask):
+    def get_valid_ratio(self, mask, dtype=torch.float32):
         """Get the valid ratio of all feature maps."""
 
         _, height, width = mask.shape
         valid_height = torch.sum(mask[:, :, 0], 1)
         valid_width = torch.sum(mask[:, 0, :], 1)
-        valid_ratio_heigth = valid_height.float() / height
-        valid_ratio_width = valid_width.float() / width
-        valid_ratio = torch.stack([valid_ratio_width, valid_ratio_heigth], -1)
+        valid_ratio_height = valid_height.to(dtype) / height
+        valid_ratio_width = valid_width.to(dtype) / width
+        valid_ratio = torch.stack([valid_ratio_width, valid_ratio_height], -1)
         return valid_ratio
 
     # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrModel.get_proposal_pos_embed
     def get_proposal_pos_embed(self, proposals):
         """Get the position embedding of the proposals."""
 
-        num_pos_feats = 128
+        num_pos_feats = self.config.d_model // 2
         temperature = 10000
         scale = 2 * math.pi
 
-        dim_t = torch.arange(num_pos_feats, dtype=torch.float32, device=proposals.device)
-        dim_t = temperature ** (2 * (dim_t // 2) / num_pos_feats)
+        dim_t = torch.arange(num_pos_feats, dtype=torch.int64, device=proposals.device).float()
+        dim_t = temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / num_pos_feats)
         # batch_size, num_queries, 4
         proposals = proposals.sigmoid() * scale
         # batch_size, num_queries, 4, 128
@@ -1528,16 +1632,16 @@ def gen_encoder_output_proposals(self, enc_output, padding_mask, spatial_shapes)
     @replace_return_docstrings(output_type=DetaModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DetaModelOutput]:
         r"""
         Returns:
 
@@ -1652,6 +1756,7 @@ def forward(
         batch_size, _, num_channels = encoder_outputs[0].shape
         enc_outputs_class = None
         enc_outputs_coord_logits = None
+        output_proposals = None
         if self.config.two_stage:
             object_query_embedding, output_proposals, level_ids = self.gen_encoder_output_proposals(
                 encoder_outputs[0], ~mask_flatten, spatial_shapes
@@ -1726,6 +1831,11 @@ def forward(
             init_reference_points = reference_points
             pos_trans_out = self.pos_trans_norm(self.pos_trans(self.get_proposal_pos_embed(topk_coords_logits)))
             query_embed, target = torch.split(pos_trans_out, num_channels, dim=2)
+
+            topk_feats = torch.stack(
+                [object_query_embedding[b][topk_proposals[b]] for b in range(batch_size)]
+            ).detach()
+            target = target + self.pix_trans_norm(self.pix_trans(topk_feats))
         else:
             query_embed, target = torch.split(query_embeds, num_channels, dim=1)
             query_embed = query_embed.unsqueeze(0).expand(batch_size, -1, -1)
@@ -1766,6 +1876,7 @@ def forward(
             encoder_attentions=encoder_outputs.attentions,
             enc_outputs_class=enc_outputs_class,
             enc_outputs_coord_logits=enc_outputs_coord_logits,
+            output_proposals=output_proposals,
         )
 
 
@@ -1778,7 +1889,9 @@ def forward(
 )
 class DetaForObjectDetection(DetaPreTrainedModel):
     # When using clones, all layers > 0 will be clones, but layer 0 *is* required
-    _keys_to_ignore_on_load_missing = ["bbox_embed\.[1-9]\d*", "class_embed\.[1-9]\d*"]
+    _tied_weights_keys = [r"bbox_embed\.\d+"]
+    # We can't initialize the model on meta device as some weights are modified during the initialization
+    _no_split_modules = None
 
     # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrForObjectDetection.__init__ with DeformableDetr->Deta
     def __init__(self, config: DetaConfig):
@@ -1822,28 +1935,31 @@ def __init__(self, config: DetaConfig):
         self.post_init()
 
     @torch.jit.unused
-    # Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrForObjectDetection._set_aux_loss
     def _set_aux_loss(self, outputs_class, outputs_coord):
         # this is a workaround to make torchscript happy, as torchscript
         # doesn't support dictionary with non-homogeneous values, such
         # as a dict having both a Tensor and a list.
-        return [{"logits": a, "pred_boxes": b} for a, b in zip(outputs_class[:-1], outputs_coord[:-1])]
+        aux_loss = [
+            {"logits": logits, "pred_boxes": pred_boxes}
+            for logits, pred_boxes in zip(outputs_class.transpose(0, 1)[:-1], outputs_coord.transpose(0, 1)[:-1])
+        ]
+        return aux_loss
 
     @add_start_docstrings_to_model_forward(DETA_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=DetaObjectDetectionOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DetaObjectDetectionOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the
@@ -1869,7 +1985,7 @@ def forward(
         >>> inputs = image_processor(images=image, return_tensors="pt")
         >>> outputs = model(**inputs)
 
-        >>> # convert outputs (bounding boxes and class logits) to COCO API
+        >>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
         >>> target_sizes = torch.tensor([image.size[::-1]])
         >>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
         ...     0
@@ -1947,21 +2063,25 @@ def forward(
                 focal_alpha=self.config.focal_alpha,
                 losses=losses,
                 num_queries=self.config.num_queries,
+                assign_first_stage=self.config.assign_first_stage,
+                assign_second_stage=self.config.assign_second_stage,
             )
             criterion.to(logits.device)
             # Third: compute the losses, based on outputs and labels
             outputs_loss = {}
             outputs_loss["logits"] = logits
             outputs_loss["pred_boxes"] = pred_boxes
+            outputs_loss["init_reference"] = init_reference
             if self.config.auxiliary_loss:
-                intermediate = outputs.intermediate_hidden_states if return_dict else outputs[4]
-                outputs_class = self.class_embed(intermediate)
-                outputs_coord = self.bbox_embed(intermediate).sigmoid()
                 auxiliary_outputs = self._set_aux_loss(outputs_class, outputs_coord)
                 outputs_loss["auxiliary_outputs"] = auxiliary_outputs
             if self.config.two_stage:
                 enc_outputs_coord = outputs.enc_outputs_coord_logits.sigmoid()
-                outputs["enc_outputs"] = {"pred_logits": outputs.enc_outputs_class, "pred_boxes": enc_outputs_coord}
+                outputs_loss["enc_outputs"] = {
+                    "logits": outputs.enc_outputs_class,
+                    "pred_boxes": enc_outputs_coord,
+                    "anchors": outputs.output_proposals.sigmoid(),
+                }
 
             loss_dict = criterion(outputs_loss, labels)
             # Fourth: compute total loss, as a weighted sum of the various losses
@@ -1971,6 +2091,7 @@ def forward(
                 aux_weight_dict = {}
                 for i in range(self.config.decoder_layers - 1):
                     aux_weight_dict.update({k + f"_{i}": v for k, v in weight_dict.items()})
+                aux_weight_dict.update({k + "_enc": v for k, v in weight_dict.items()})
                 weight_dict.update(aux_weight_dict)
             loss = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)
 
@@ -2001,6 +2122,7 @@ def forward(
             init_reference_points=outputs.init_reference_points,
             enc_outputs_class=outputs.enc_outputs_class,
             enc_outputs_coord_logits=outputs.enc_outputs_coord_logits,
+            output_proposals=outputs.output_proposals,
         )
 
         return dict_outputs
@@ -2210,7 +2332,7 @@ def forward(self, outputs, targets):
                 List of dicts, such that `len(targets) == batch_size`. The expected keys in each dict depends on the
                 losses applied, see each loss' doc.
         """
-        outputs_without_aux = {k: v for k, v in outputs.items() if k != "auxiliary_outputs"}
+        outputs_without_aux = {k: v for k, v in outputs.items() if k not in ("auxiliary_outputs", "enc_outputs")}
 
         # Retrieve the matching between the outputs of the last layer and the targets
         if self.assign_second_stage:
@@ -2221,11 +2343,12 @@ def forward(self, outputs, targets):
         # Compute the average number of target boxes accross all nodes, for normalization purposes
         num_boxes = sum(len(t["class_labels"]) for t in targets)
         num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
-        # (Niels): comment out function below, distributed training to be added
-        # if is_dist_avail_and_initialized():
-        #     torch.distributed.all_reduce(num_boxes)
-        # (Niels) in original implementation, num_boxes is divided by get_world_size()
-        num_boxes = torch.clamp(num_boxes, min=1).item()
+        # Check that we have initialized the distributed state
+        world_size = 1
+        if PartialState._shared_state != {}:
+            num_boxes = reduce(num_boxes)
+            world_size = PartialState().num_processes
+        num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
 
         # Compute all the requested losses
         losses = {}
@@ -2246,17 +2369,13 @@ def forward(self, outputs, targets):
             enc_outputs = outputs["enc_outputs"]
             bin_targets = copy.deepcopy(targets)
             for bt in bin_targets:
-                bt["labels"] = torch.zeros_like(bt["labels"])
+                bt["class_labels"] = torch.zeros_like(bt["class_labels"])
             if self.assign_first_stage:
                 indices = self.stg1_assigner(enc_outputs, bin_targets)
             else:
                 indices = self.matcher(enc_outputs, bin_targets)
             for loss in self.losses:
-                kwargs = {}
-                if loss == "labels":
-                    # Logging is enabled only for the last layer
-                    kwargs["log"] = False
-                l_dict = self.get_loss(loss, enc_outputs, bin_targets, indices, num_boxes, **kwargs)
+                l_dict = self.get_loss(loss, enc_outputs, bin_targets, indices, num_boxes)
                 l_dict = {k + "_enc": v for k, v in l_dict.items()}
                 losses.update(l_dict)
 
@@ -2487,9 +2606,9 @@ def __init__(self, thresholds: List[float], labels: List[int], allow_low_quality
         thresholds.insert(0, -float("inf"))
         thresholds.append(float("inf"))
         # Currently torchscript does not support all + generator
-        if not all([low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])]):
+        if not all(low <= high for (low, high) in zip(thresholds[:-1], thresholds[1:])):
             raise ValueError("Thresholds should be sorted.")
-        if not all([l in [-1, 0, 1] for l in labels]):
+        if not all(l in [-1, 0, 1] for l in labels):
             raise ValueError("All labels should be either -1, 0 or 1")
         if len(labels) != len(thresholds) - 1:
             raise ValueError("Number of labels should be equal to number of thresholds - 1")
@@ -2680,7 +2799,7 @@ def forward(self, outputs, targets, return_cost_matrix=False):
                 sampled_idxs,
                 sampled_gt_classes,
             ) = self._sample_proposals(  # list of sampled proposal_ids, sampled_id -> [0, num_classes)+[bg_label]
-                matched_idxs, matched_labels, targets[b]["labels"]
+                matched_idxs, matched_labels, targets[b]["class_labels"]
             )
             pos_pr_inds = sampled_idxs[sampled_gt_classes != self.bg_label]
             pos_gt_inds = matched_idxs[pos_pr_inds]
@@ -2745,7 +2864,7 @@ def forward(self, outputs, targets):
             )  # proposal_id -> highest_iou_gt_id, proposal_id -> [1 if iou > 0.7, 0 if iou < 0.3, -1 ow]
             matched_labels = self._subsample_labels(matched_labels)
 
-            all_pr_inds = torch.arange(len(anchors))
+            all_pr_inds = torch.arange(len(anchors), device=matched_labels.device)
             pos_pr_inds = all_pr_inds[matched_labels == 1]
             pos_gt_inds = matched_idxs[pos_pr_inds]
             pos_pr_inds, pos_gt_inds = self.postprocess_indices(pos_pr_inds, pos_gt_inds, iou)
diff --git a/src/transformers/models/detr/__init__.py b/src/transformers/models/detr/__init__.py
index 1dcda4cc171b2a..9cbaca9a54581f 100644
--- a/src/transformers/models/detr/__init__.py
+++ b/src/transformers/models/detr/__init__.py
@@ -14,7 +14,7 @@
 
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_timm_available, is_vision_available
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
 
 
 _import_structure = {"configuration_detr": ["DETR_PRETRAINED_CONFIG_ARCHIVE_MAP", "DetrConfig", "DetrOnnxConfig"]}
@@ -29,7 +29,7 @@
     _import_structure["image_processing_detr"] = ["DetrImageProcessor"]
 
 try:
-    if not is_timm_available():
+    if not is_torch_available():
         raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
     pass
@@ -56,7 +56,7 @@
         from .image_processing_detr import DetrImageProcessor
 
     try:
-        if not is_timm_available():
+        if not is_torch_available():
             raise OptionalDependencyNotAvailable()
     except OptionalDependencyNotAvailable:
         pass
diff --git a/src/transformers/models/detr/configuration_detr.py b/src/transformers/models/detr/configuration_detr.py
index 430efc913b37c3..f13c1ef09a0c5c 100644
--- a/src/transformers/models/detr/configuration_detr.py
+++ b/src/transformers/models/detr/configuration_detr.py
@@ -93,11 +93,14 @@ class DetrConfig(PretrainedConfig):
         position_embedding_type (`str`, *optional*, defaults to `"sine"`):
             Type of position embeddings to be used on top of the image features. One of `"sine"` or `"learned"`.
         backbone (`str`, *optional*, defaults to `"resnet50"`):
-            Name of convolutional backbone to use in case `use_timm_backbone` = `True`. Supports any convolutional
-            backbone from the timm package. For a list of all available models, see [this
-            page](https://rwightman.github.io/pytorch-image-models/#load-a-pretrained-model).
-        use_pretrained_backbone (`bool`, *optional*, defaults to `True`):
-            Whether to use pretrained weights for the backbone. Only supported when `use_timm_backbone` = `True`.
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
+        use_pretrained_backbone (`bool`, *optional*, `True`):
+            Whether to use pretrained weights for the backbone.
+        backbone_kwargs (`dict`, *optional*):
+            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
+            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
         dilation (`bool`, *optional*, defaults to `False`):
             Whether to replace stride with dilation in the last convolutional block (DC5). Only supported when
             `use_timm_backbone` = `True`.
@@ -132,6 +135,7 @@ class DetrConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "detr"
     keys_to_ignore_at_inference = ["past_key_values"]
     attribute_map = {
@@ -165,6 +169,7 @@ def __init__(
         position_embedding_type="sine",
         backbone="resnet50",
         use_pretrained_backbone=True,
+        backbone_kwargs=None,
         dilation=False,
         class_cost=1,
         bbox_cost=5,
@@ -176,9 +181,20 @@ def __init__(
         eos_coefficient=0.1,
         **kwargs,
     ):
+        if not use_timm_backbone and use_pretrained_backbone:
+            raise ValueError(
+                "Loading pretrained backbone weights from the transformers library is not supported yet. `use_timm_backbone` must be set to `True` when `use_pretrained_backbone=True`"
+            )
+
+        if backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
         if backbone_config is not None and use_timm_backbone:
             raise ValueError("You can't specify both `backbone_config` and `use_timm_backbone`.")
 
+        if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
+            raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
+
         if not use_timm_backbone:
             if backbone_config is None:
                 logger.info("`backbone_config` is `None`. Initializing the config with the default `ResNet` backbone.")
@@ -187,6 +203,8 @@ def __init__(
                 backbone_model_type = backbone_config.get("model_type")
                 config_class = CONFIG_MAPPING[backbone_model_type]
                 backbone_config = config_class.from_dict(backbone_config)
+            # set timm attributes to None
+            dilation, backbone, use_pretrained_backbone = None, None, None
 
         self.use_timm_backbone = use_timm_backbone
         self.backbone_config = backbone_config
@@ -212,6 +230,7 @@ def __init__(
         self.position_embedding_type = position_embedding_type
         self.backbone = backbone
         self.use_pretrained_backbone = use_pretrained_backbone
+        self.backbone_kwargs = backbone_kwargs
         self.dilation = dilation
         # Hungarian matcher
         self.class_cost = class_cost
@@ -233,6 +252,18 @@ def num_attention_heads(self) -> int:
     def hidden_size(self) -> int:
         return self.d_model
 
+    @classmethod
+    def from_backbone_config(cls, backbone_config: PretrainedConfig, **kwargs):
+        """Instantiate a [`DetrConfig`] (or a derived class) from a pre-trained backbone model configuration.
+
+        Args:
+            backbone_config ([`PretrainedConfig`]):
+                The backbone configuration.
+        Returns:
+            [`DetrConfig`]: An instance of a configuration object
+        """
+        return cls(backbone_config=backbone_config, **kwargs)
+
 
 class DetrOnnxConfig(OnnxConfig):
     torch_onnx_minimum_version = version.parse("1.11")
diff --git a/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py
index b6dcc617da7b62..72de2be8701a9c 100644
--- a/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/detr/convert_detr_original_pytorch_checkpoint_to_pytorch.py
@@ -25,7 +25,7 @@
 from huggingface_hub import hf_hub_download
 from PIL import Image
 
-from transformers import DetrConfig, DetrFeatureExtractor, DetrForObjectDetection, DetrForSegmentation
+from transformers import DetrConfig, DetrForObjectDetection, DetrForSegmentation, DetrImageProcessor
 from transformers.utils import logging
 
 
@@ -201,13 +201,13 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
         config.id2label = id2label
         config.label2id = {v: k for k, v in id2label.items()}
 
-    # load feature extractor
+    # load image processor
     format = "coco_panoptic" if is_panoptic else "coco_detection"
-    feature_extractor = DetrFeatureExtractor(format=format)
+    image_processor = DetrImageProcessor(format=format)
 
     # prepare image
     img = prepare_img()
-    encoding = feature_extractor(images=img, return_tensors="pt")
+    encoding = image_processor(images=img, return_tensors="pt")
     pixel_values = encoding["pixel_values"]
 
     logger.info(f"Converting model {model_name}...")
@@ -258,11 +258,11 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
     if is_panoptic:
         assert torch.allclose(outputs.pred_masks, original_outputs["pred_masks"], atol=1e-4)
 
-    # Save model and feature extractor
-    logger.info(f"Saving PyTorch model and feature extractor to {pytorch_dump_folder_path}...")
+    # Save model and image processor
+    logger.info(f"Saving PyTorch model and image processor to {pytorch_dump_folder_path}...")
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     model.save_pretrained(pytorch_dump_folder_path)
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/detr/convert_detr_to_pytorch.py b/src/transformers/models/detr/convert_detr_to_pytorch.py
index 3ff2e38ac38325..a52e592b945d79 100644
--- a/src/transformers/models/detr/convert_detr_to_pytorch.py
+++ b/src/transformers/models/detr/convert_detr_to_pytorch.py
@@ -1,5 +1,5 @@
 # coding=utf-8
-# Copyright 2022 The HuggingFace Inc. team.
+# Copyright 2023 The HuggingFace Inc. team.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -33,16 +33,16 @@
 
 
 def get_detr_config(model_name):
-    config = DetrConfig(use_timm_backbone=False)
-
-    # set backbone attributes
-    if "resnet50" in model_name:
-        pass
-    elif "resnet101" in model_name:
-        config.backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-101")
+    # initialize config
+    if "resnet-50" in model_name:
+        backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-50")
+    elif "resnet-101" in model_name:
+        backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-101")
     else:
         raise ValueError("Model name should include either resnet50 or resnet101")
 
+    config = DetrConfig(use_timm_backbone=False, backbone_config=backbone_config)
+
     # set label attributes
     is_panoptic = "panoptic" in model_name
     if is_panoptic:
@@ -286,7 +286,7 @@ def prepare_img():
 
 
 @torch.no_grad()
-def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
+def convert_detr_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_hub=False):
     """
     Copy/paste/tweak model's weights to our DETR structure.
     """
@@ -295,8 +295,12 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
     config, is_panoptic = get_detr_config(model_name)
 
     # load original model from torch hub
+    model_name_to_original_name = {
+        "detr-resnet-50": "detr_resnet50",
+        "detr-resnet-101": "detr_resnet101",
+    }
     logger.info(f"Converting model {model_name}...")
-    detr = torch.hub.load("facebookresearch/detr", model_name, pretrained=True).eval()
+    detr = torch.hub.load("facebookresearch/detr", model_name_to_original_name[model_name], pretrained=True).eval()
     state_dict = detr.state_dict()
     # rename keys
     for src, dest in create_rename_keys(config):
@@ -344,9 +348,6 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
     original_outputs = detr(pixel_values)
     outputs = model(pixel_values)
 
-    print("Logits:", outputs.logits[0, :3, :3])
-    print("Original logits:", original_outputs["pred_logits"][0, :3, :3])
-
     assert torch.allclose(outputs.logits, original_outputs["pred_logits"], atol=1e-3)
     assert torch.allclose(outputs.pred_boxes, original_outputs["pred_boxes"], atol=1e-3)
     if is_panoptic:
@@ -360,15 +361,26 @@ def convert_detr_checkpoint(model_name, pytorch_dump_folder_path):
         model.save_pretrained(pytorch_dump_folder_path)
         processor.save_pretrained(pytorch_dump_folder_path)
 
+    if push_to_hub:
+        # Upload model and image processor to the hub
+        logger.info("Uploading PyTorch model and image processor to the hub...")
+        model.push_to_hub(f"nielsr/{model_name}")
+        processor.push_to_hub(f"nielsr/{model_name}")
+
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
 
     parser.add_argument(
-        "--model_name", default="detr_resnet50", type=str, help="Name of the DETR model you'd like to convert."
+        "--model_name",
+        default="detr-resnet-50",
+        type=str,
+        choices=["detr-resnet-50", "detr-resnet-101"],
+        help="Name of the DETR model you'd like to convert.",
     )
     parser.add_argument(
         "--pytorch_dump_folder_path", default=None, type=str, help="Path to the folder to output PyTorch model."
     )
+    parser.add_argument("--push_to_hub", action="store_true", help="Whether to push the model to the hub or not.")
     args = parser.parse_args()
-    convert_detr_checkpoint(args.model_name, args.pytorch_dump_folder_path)
+    convert_detr_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
diff --git a/src/transformers/models/detr/feature_extraction_detr.py b/src/transformers/models/detr/feature_extraction_detr.py
index b94cf9ff804132..6ea33666466f9a 100644
--- a/src/transformers/models/detr/feature_extraction_detr.py
+++ b/src/transformers/models/detr/feature_extraction_detr.py
@@ -16,6 +16,7 @@
 
 import warnings
 
+from ...image_transforms import rgb_to_id as _rgb_to_id
 from ...utils import logging
 from .image_processing_detr import DetrImageProcessor
 
@@ -23,6 +24,15 @@
 logger = logging.get_logger(__name__)
 
 
+def rgb_to_id(x):
+    warnings.warn(
+        "rgb_to_id has moved and will not be importable from this module from v5. "
+        "Please import from transformers.image_transforms instead.",
+        FutureWarning,
+    )
+    return _rgb_to_id(x)
+
+
 class DetrFeatureExtractor(DetrImageProcessor):
     def __init__(self, *args, **kwargs) -> None:
         warnings.warn(
diff --git a/src/transformers/models/detr/image_processing_detr.py b/src/transformers/models/detr/image_processing_detr.py
index 433853efefa74a..0a7a6e2dbd5c38 100644
--- a/src/transformers/models/detr/image_processing_detr.py
+++ b/src/transformers/models/detr/image_processing_detr.py
@@ -16,40 +16,42 @@
 
 import io
 import pathlib
-import warnings
 from collections import defaultdict
 from typing import Any, Callable, Dict, Iterable, List, Optional, Set, Tuple, Union
 
 import numpy as np
 
-from transformers.image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from transformers.image_transforms import (
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import (
     PaddingMode,
     center_to_corners_format,
     corners_to_center_format,
     id_to_rgb,
-    normalize,
     pad,
     rescale,
     resize,
     rgb_to_id,
     to_channel_dimension_format,
 )
-from transformers.image_utils import (
+from ...image_utils import (
     IMAGENET_DEFAULT_MEAN,
     IMAGENET_DEFAULT_STD,
+    AnnotationFormat,
+    AnnotationType,
     ChannelDimension,
     ImageInput,
     PILImageResampling,
     get_image_size,
     infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
-    valid_coco_detection_annotations,
-    valid_coco_panoptic_annotations,
     valid_images,
+    validate_annotations,
+    validate_preprocess_arguments,
 )
-from transformers.utils import (
+from ...utils import (
+    TensorType,
     is_flax_available,
     is_jax_tensor,
     is_scipy_available,
@@ -58,8 +60,8 @@
     is_torch_available,
     is_torch_tensor,
     is_vision_available,
+    logging,
 )
-from transformers.utils.generic import ExplicitEnum, TensorType
 
 
 if is_torch_available():
@@ -76,17 +78,12 @@
     import scipy.stats
 
 
-AnnotationType = Dict[str, Union[int, str, List[Dict]]]
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
 
-
-class AnnotionFormat(ExplicitEnum):
-    COCO_DETECTION = "coco_detection"
-    COCO_PANOPTIC = "coco_panoptic"
-
-
-SUPPORTED_ANNOTATION_FORMATS = (AnnotionFormat.COCO_DETECTION, AnnotionFormat.COCO_PANOPTIC)
+SUPPORTED_ANNOTATION_FORMATS = (AnnotationFormat.COCO_DETECTION, AnnotationFormat.COCO_PANOPTIC)
 
 
+# From the original repo: https://github.com/facebookresearch/detr/blob/3af9fa878e73b6894ce3596450a8d9b89d918ca9/datasets/transforms.py#L76
 def get_size_with_aspect_ratio(image_size, size, max_size=None) -> Tuple[int, int]:
     """
     Computes the output image size given the input image size and the desired output size.
@@ -119,7 +116,10 @@ def get_size_with_aspect_ratio(image_size, size, max_size=None) -> Tuple[int, in
 
 
 def get_resize_output_image_size(
-    input_image: np.ndarray, size: Union[int, Tuple[int, int], List[int]], max_size: Optional[int] = None
+    input_image: np.ndarray,
+    size: Union[int, Tuple[int, int], List[int]],
+    max_size: Optional[int] = None,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
 ) -> Tuple[int, int]:
     """
     Computes the output image size given the input image size and the desired output size. If the desired output size
@@ -127,14 +127,16 @@ def get_resize_output_image_size(
     image size is computed by keeping the aspect ratio of the input image size.
 
     Args:
-        image_size (`Tuple[int, int]`):
-            The input image size.
-        size (`int`):
+        input_image (`np.ndarray`):
+            The image to resize.
+        size (`int` or `Tuple[int, int]` or `List[int]`):
             The desired output size.
         max_size (`int`, *optional*):
             The maximum allowed output size.
+        input_data_format (`ChannelDimension` or `str`, *optional*):
+            The channel dimension format of the input image. If not provided, it will be inferred from the input image.
     """
-    image_size = get_image_size(input_image)
+    image_size = get_image_size(input_image, input_data_format)
     if isinstance(size, (list, tuple)):
         return size
 
@@ -201,23 +203,28 @@ def max_across_indices(values: Iterable[Any]) -> List[Any]:
 
 
 # Copied from transformers.models.vilt.image_processing_vilt.get_max_height_width
-def get_max_height_width(images: List[np.ndarray]) -> List[int]:
+def get_max_height_width(
+    images: List[np.ndarray], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> List[int]:
     """
     Get the maximum height and width across all images in a batch.
     """
-    input_channel_dimension = infer_channel_dimension_format(images[0])
+    if input_data_format is None:
+        input_data_format = infer_channel_dimension_format(images[0])
 
-    if input_channel_dimension == ChannelDimension.FIRST:
+    if input_data_format == ChannelDimension.FIRST:
         _, max_height, max_width = max_across_indices([img.shape for img in images])
-    elif input_channel_dimension == ChannelDimension.LAST:
+    elif input_data_format == ChannelDimension.LAST:
         max_height, max_width, _ = max_across_indices([img.shape for img in images])
     else:
-        raise ValueError(f"Invalid channel dimension format: {input_channel_dimension}")
+        raise ValueError(f"Invalid channel dimension format: {input_data_format}")
     return (max_height, max_width)
 
 
 # Copied from transformers.models.vilt.image_processing_vilt.make_pixel_mask
-def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarray:
+def make_pixel_mask(
+    image: np.ndarray, output_size: Tuple[int, int], input_data_format: Optional[Union[str, ChannelDimension]] = None
+) -> np.ndarray:
     """
     Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
 
@@ -227,7 +234,7 @@ def make_pixel_mask(image: np.ndarray, output_size: Tuple[int, int]) -> np.ndarr
         output_size (`Tuple[int, int]`):
             Output size of the mask.
     """
-    input_height, input_width = get_image_size(image)
+    input_height, input_width = get_image_size(image, channel_dim=input_data_format)
     mask = np.zeros(output_size, dtype=np.int64)
     mask[:input_height, :input_width] = 1
     return mask
@@ -269,11 +276,16 @@ def convert_coco_poly_to_mask(segmentations, height: int, width: int) -> np.ndar
 
 
 # inspired by https://github.com/facebookresearch/detr/blob/master/datasets/coco.py#L50
-def prepare_coco_detection_annotation(image, target, return_segmentation_masks: bool = False):
+def prepare_coco_detection_annotation(
+    image,
+    target,
+    return_segmentation_masks: bool = False,
+    input_data_format: Optional[Union[ChannelDimension, str]] = None,
+):
     """
     Convert the target in COCO format into the format expected by DETR.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
 
     image_id = target["image_id"]
     image_id = np.asarray([image_id], dtype=np.int64)
@@ -308,10 +320,13 @@ def prepare_coco_detection_annotation(image, target, return_segmentation_masks:
 
     if annotations and "keypoints" in annotations[0]:
         keypoints = [obj["keypoints"] for obj in annotations]
+        # Converting the filtered keypoints list to a numpy array
         keypoints = np.asarray(keypoints, dtype=np.float32)
+        # Apply the keep mask here to filter the relevant annotations
+        keypoints = keypoints[keep]
         num_keypoints = keypoints.shape[0]
         keypoints = keypoints.reshape((-1, 3)) if num_keypoints else keypoints
-        new_target["keypoints"] = keypoints[keep]
+        new_target["keypoints"] = keypoints
 
     if return_segmentation_masks:
         segmentation_masks = [obj["segmentation"] for obj in annotations]
@@ -356,12 +371,16 @@ def masks_to_boxes(masks: np.ndarray) -> np.ndarray:
 
 
 def prepare_coco_panoptic_annotation(
-    image: np.ndarray, target: Dict, masks_path: Union[str, pathlib.Path], return_masks: bool = True
+    image: np.ndarray,
+    target: Dict,
+    masks_path: Union[str, pathlib.Path],
+    return_masks: bool = True,
+    input_data_format: Union[ChannelDimension, str] = None,
 ) -> Dict:
     """
     Prepare a coco panoptic annotation for DETR.
     """
-    image_height, image_width = get_image_size(image)
+    image_height, image_width = get_image_size(image, channel_dim=input_data_format)
     annotation_path = pathlib.Path(masks_path) / target["file_name"]
 
     new_target = {}
@@ -590,7 +609,7 @@ def binary_mask_to_rle(mask):
     pixels = np.concatenate([[0], pixels, [0]])
     runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
     runs[1::2] -= runs[::2]
-    return [x for x in runs]
+    return list(runs)
 
 
 # TODO - (Amy) make compatible with other frameworks
@@ -742,7 +761,7 @@ class DetrImageProcessor(BaseImageProcessor):
         rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
             Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
             `preprocess` method.
-        do_normalize:
+        do_normalize (`bool`, *optional*, defaults to True):
             Controls whether to normalize the image. Can be overridden by the `do_normalize` parameter in the
             `preprocess` method.
         image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_MEAN`):
@@ -751,16 +770,21 @@ class DetrImageProcessor(BaseImageProcessor):
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_DEFAULT_STD`):
             Standard deviation values to use when normalizing the image. Can be a single value or a list of values, one
             for each channel. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_convert_annotations (`bool`, *optional*, defaults to `True`):
+            Controls whether to convert the annotations to the format expected by the DETR model. Converts the
+            bounding boxes to the format `(center_x, center_y, width, height)` and in the range `[0, 1]`.
+            Can be overridden by the `do_convert_annotations` parameter in the `preprocess` method.
         do_pad (`bool`, *optional*, defaults to `True`):
-            Controls whether to pad the image to the largest image in a batch and create a pixel mask. Can be
-            overridden by the `do_pad` parameter in the `preprocess` method.
+            Controls whether to pad the image. Can be overridden by the `do_pad` parameter in the `preprocess`
+            method. If `True` will pad the images in the batch to the largest height and width in the batch.
+            Padding will be applied to the bottom and right of the image with zeros.
     """
 
     model_input_names = ["pixel_values", "pixel_mask"]
 
     def __init__(
         self,
-        format: Union[str, AnnotionFormat] = AnnotionFormat.COCO_DETECTION,
+        format: Union[str, AnnotationFormat] = AnnotationFormat.COCO_DETECTION,
         do_resize: bool = True,
         size: Dict[str, int] = None,
         resample: PILImageResampling = PILImageResampling.BILINEAR,
@@ -769,6 +793,7 @@ def __init__(
         do_normalize: bool = True,
         image_mean: Union[float, List[float]] = None,
         image_std: Union[float, List[float]] = None,
+        do_convert_annotations: Optional[bool] = None,
         do_pad: bool = True,
         **kwargs,
     ) -> None:
@@ -776,10 +801,9 @@ def __init__(
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
@@ -788,6 +812,10 @@ def __init__(
         size = size if size is not None else {"shortest_edge": 800, "longest_edge": 1333}
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
 
+        # Backwards compatibility
+        if do_convert_annotations is None:
+            do_convert_annotations = do_normalize
+
         super().__init__(**kwargs)
         self.format = format
         self.do_resize = do_resize
@@ -796,19 +824,11 @@ def __init__(
         self.do_rescale = do_rescale
         self.rescale_factor = rescale_factor
         self.do_normalize = do_normalize
+        self.do_convert_annotations = do_convert_annotations
         self.image_mean = image_mean if image_mean is not None else IMAGENET_DEFAULT_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_DEFAULT_STD
         self.do_pad = do_pad
 
-    @property
-    def max_size(self):
-        warnings.warn(
-            "The `max_size` parameter is deprecated and will be removed in v4.27. "
-            "Please specify in `size['longest_edge'] instead`.",
-            FutureWarning,
-        )
-        return self.size["longest_edge"]
-
     @classmethod
     def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
         """
@@ -827,30 +847,37 @@ def prepare_annotation(
         self,
         image: np.ndarray,
         target: Dict,
-        format: Optional[AnnotionFormat] = None,
+        format: Optional[AnnotationFormat] = None,
         return_segmentation_masks: bool = None,
         masks_path: Optional[Union[str, pathlib.Path]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> Dict:
         """
         Prepare an annotation for feeding into DETR model.
         """
         format = format if format is not None else self.format
 
-        if format == AnnotionFormat.COCO_DETECTION:
+        if format == AnnotationFormat.COCO_DETECTION:
             return_segmentation_masks = False if return_segmentation_masks is None else return_segmentation_masks
-            target = prepare_coco_detection_annotation(image, target, return_segmentation_masks)
-        elif format == AnnotionFormat.COCO_PANOPTIC:
+            target = prepare_coco_detection_annotation(
+                image, target, return_segmentation_masks, input_data_format=input_data_format
+            )
+        elif format == AnnotationFormat.COCO_PANOPTIC:
             return_segmentation_masks = True if return_segmentation_masks is None else return_segmentation_masks
             target = prepare_coco_panoptic_annotation(
-                image, target, masks_path=masks_path, return_masks=return_segmentation_masks
+                image,
+                target,
+                masks_path=masks_path,
+                return_masks=return_segmentation_masks,
+                input_data_format=input_data_format,
             )
         else:
             raise ValueError(f"Format {format} is not supported.")
         return target
 
     def prepare(self, image, target, return_segmentation_masks=None, masks_path=None):
-        warnings.warn(
-            "The `prepare` method is deprecated and will be removed in a future version. "
+        logger.warning_once(
+            "The `prepare` method is deprecated and will be removed in a v4.33. "
             "Please use `prepare_annotation` instead. Note: the `prepare_annotation` method "
             "does not return the image anymore.",
         )
@@ -858,15 +885,15 @@ def prepare(self, image, target, return_segmentation_masks=None, masks_path=None
         return image, target
 
     def convert_coco_poly_to_mask(self, *args, **kwargs):
-        warnings.warn("The `convert_coco_poly_to_mask` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `convert_coco_poly_to_mask` method is deprecated and will be removed in v4.33. ")
         return convert_coco_poly_to_mask(*args, **kwargs)
 
     def prepare_coco_detection(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_detection` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_detection` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_detection_annotation(*args, **kwargs)
 
     def prepare_coco_panoptic(self, *args, **kwargs):
-        warnings.warn("The `prepare_coco_panoptic` method is deprecated and will be removed in a future version. ")
+        logger.warning_once("The `prepare_coco_panoptic` method is deprecated and will be removed in v4.33. ")
         return prepare_coco_panoptic_annotation(*args, **kwargs)
 
     def resize(
@@ -875,24 +902,40 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BILINEAR,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
         Resize the image to the given size. Size can be `min_size` (scalar) or `(height, width)` tuple. If size is an
         int, smaller edge of the image will be matched to this number.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary containing the size to resize to. Can contain the keys `shortest_edge` and `longest_edge` or
+                `height` and `width`.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+                Resampling filter to use if resizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` parameter is deprecated and will be removed in v4.26. "
                 "Please specify in `size['longest_edge'] instead`.",
-                FutureWarning,
             )
             max_size = kwargs.pop("max_size")
         else:
             max_size = None
         size = get_size_dict(size, max_size=max_size, default_to_square=False)
         if "shortest_edge" in size and "longest_edge" in size:
-            size = get_resize_output_image_size(image, size["shortest_edge"], size["longest_edge"])
+            size = get_resize_output_image_size(
+                image, size["shortest_edge"], size["longest_edge"], input_data_format=input_data_format
+            )
         elif "height" in size and "width" in size:
             size = (size["height"], size["width"])
         else:
@@ -900,7 +943,9 @@ def resize(
                 "Size must contain 'height' and 'width' keys or 'shortest_edge' and 'longest_edge' keys. Got"
                 f" {size.keys()}."
             )
-        image = resize(image, size=size, resample=resample, data_format=data_format)
+        image = resize(
+            image, size=size, resample=resample, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
         return image
 
     def resize_annotation(
@@ -916,126 +961,193 @@ def resize_annotation(
         """
         return resize_annotation(annotation, orig_size=orig_size, target_size=size, resample=resample)
 
+    # TODO (Amy) - update to use `rescale_factor` instead of `scale`
     def rescale(
-        self, image: np.ndarray, rescale_factor: Union[float, int], data_format: Optional[ChannelDimension] = None
-    ) -> np.ndarray:
-        """
-        Rescale the image by the given factor.
-        """
-        return rescale(image, rescale_factor, data_format=data_format)
-
-    def normalize(
         self,
         image: np.ndarray,
-        mean: Union[float, Iterable[float]],
-        std: Union[float, Iterable[float]],
-        data_format: Optional[ChannelDimension] = None,
+        rescale_factor: float,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
-        Normalize the image with the given mean and standard deviation.
+        Rescale the image by the given factor. image = image * rescale_factor.
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            rescale_factor (`float`):
+                The value to use for rescaling.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the input image. If unset, is inferred from the input image. Can be
+                one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
         """
-        return normalize(image, mean=mean, std=std, data_format=data_format)
+        return rescale(image, rescale_factor, data_format=data_format, input_data_format=input_data_format)
 
     def normalize_annotation(self, annotation: Dict, image_size: Tuple[int, int]) -> Dict:
         """
         Normalize the boxes in the annotation from `[top_left_x, top_left_y, bottom_right_x, bottom_right_y]` to
-        `[center_x, center_y, width, height]` format.
+        `[center_x, center_y, width, height]` format and from absolute to relative pixel values.
         """
         return normalize_annotation(annotation, image_size=image_size)
 
-    def pad_and_create_pixel_mask(
+    def _update_annotation_for_padded_image(
         self,
-        pixel_values_list: List[ImageInput],
-        return_tensors: Optional[Union[str, TensorType]] = None,
-        data_format: Optional[ChannelDimension] = None,
-    ) -> BatchFeature:
+        annotation: Dict,
+        input_image_size: Tuple[int, int],
+        output_image_size: Tuple[int, int],
+        padding,
+        update_bboxes,
+    ) -> Dict:
         """
-        Pads a batch of images with zeros to the size of largest height and width in the batch and returns their
-        corresponding pixel mask.
-
-        Args:
-            images (`List[np.ndarray]`):
-                Batch of images to pad.
-            return_tensors (`str` or `TensorType`, *optional*):
-                The type of tensors to return. Can be one of:
-                    - Unset: Return a list of `np.ndarray`.
-                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
-                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
-                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
-                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        Update the annotation for a padded image.
         """
-        warnings.warn(
-            "This method is deprecated and will be removed in v4.27.0. Please use pad instead.", FutureWarning
-        )
-        # pad expects a list of np.ndarray, but the previous feature extractors expected torch tensors
-        images = [to_numpy_array(image) for image in pixel_values_list]
-        return self.pad(
-            images=images,
-            return_pixel_mask=True,
-            return_tensors=return_tensors,
-            data_format=data_format,
-        )
+        new_annotation = {}
+        new_annotation["size"] = output_image_size
+
+        for key, value in annotation.items():
+            if key == "masks":
+                masks = value
+                masks = pad(
+                    masks,
+                    padding,
+                    mode=PaddingMode.CONSTANT,
+                    constant_values=0,
+                    input_data_format=ChannelDimension.FIRST,
+                )
+                masks = safe_squeeze(masks, 1)
+                new_annotation["masks"] = masks
+            elif key == "boxes" and update_bboxes:
+                boxes = value
+                boxes *= np.asarray(
+                    [
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                        input_image_size[1] / output_image_size[1],
+                        input_image_size[0] / output_image_size[0],
+                    ]
+                )
+                new_annotation["boxes"] = boxes
+            elif key == "size":
+                new_annotation["size"] = output_image_size
+            else:
+                new_annotation[key] = value
+        return new_annotation
 
     def _pad_image(
         self,
         image: np.ndarray,
         output_size: Tuple[int, int],
+        annotation: Optional[Dict[str, Any]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         data_format: Optional[ChannelDimension] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
     ) -> np.ndarray:
         """
         Pad an image with zeros to the given size.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = output_size
 
         pad_bottom = output_height - input_height
         pad_right = output_width - input_width
         padding = ((0, pad_bottom), (0, pad_right))
         padded_image = pad(
-            image, padding, mode=PaddingMode.CONSTANT, constant_values=constant_values, data_format=data_format
+            image,
+            padding,
+            mode=PaddingMode.CONSTANT,
+            constant_values=constant_values,
+            data_format=data_format,
+            input_data_format=input_data_format,
         )
-        return padded_image
+        if annotation is not None:
+            annotation = self._update_annotation_for_padded_image(
+                annotation, (input_height, input_width), (output_height, output_width), padding, update_bboxes
+            )
+        return padded_image, annotation
 
     def pad(
         self,
         images: List[np.ndarray],
+        annotations: Optional[Union[AnnotationType, List[AnnotationType]]] = None,
         constant_values: Union[float, Iterable[float]] = 0,
         return_pixel_mask: bool = True,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Optional[ChannelDimension] = None,
-    ) -> np.ndarray:
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        update_bboxes: bool = True,
+    ) -> BatchFeature:
         """
         Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
         in the batch and optionally returns their corresponding pixel mask.
 
         Args:
-            image (`np.ndarray`):
-                Image to pad.
+            images (List[`np.ndarray`]):
+                Images to pad.
+            annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
+                Annotations to transform according to the padding that is applied to the images.
             constant_values (`float` or `Iterable[float]`, *optional*):
                 The value to use for the padding if `mode` is `"constant"`.
             return_pixel_mask (`bool`, *optional*, defaults to `True`):
                 Whether to return a pixel mask.
-            input_channel_dimension (`ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be inferred from the input image.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+            update_bboxes (`bool`, *optional*, defaults to `True`):
+                Whether to update the bounding boxes in the annotations to match the padded images. If the
+                bounding boxes have not been converted to relative coordinates and `(centre_x, centre_y, width, height)`
+                format, the bounding boxes will not be updated.
         """
-        pad_size = get_max_height_width(images)
+        pad_size = get_max_height_width(images, input_data_format=input_data_format)
+
+        annotation_list = annotations if annotations is not None else [None] * len(images)
+        padded_images = []
+        padded_annotations = []
+        for image, annotation in zip(images, annotation_list):
+            padded_image, padded_annotation = self._pad_image(
+                image,
+                pad_size,
+                annotation,
+                constant_values=constant_values,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                update_bboxes=update_bboxes,
+            )
+            padded_images.append(padded_image)
+            padded_annotations.append(padded_annotation)
 
-        padded_images = [
-            self._pad_image(image, pad_size, constant_values=constant_values, data_format=data_format)
-            for image in images
-        ]
         data = {"pixel_values": padded_images}
 
         if return_pixel_mask:
-            masks = [make_pixel_mask(image=image, output_size=pad_size) for image in images]
+            masks = [
+                make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
+                for image in images
+            ]
             data["pixel_mask"] = masks
 
-        return BatchFeature(data=data, tensor_type=return_tensors)
+        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
+
+        if annotations is not None:
+            encoded_inputs["labels"] = [
+                BatchFeature(annotation, tensor_type=return_tensors) for annotation in padded_annotations
+            ]
+
+        return encoded_inputs
 
     def preprocess(
         self,
@@ -1049,12 +1161,14 @@ def preprocess(
         do_rescale: Optional[bool] = None,
         rescale_factor: Optional[Union[int, float]] = None,
         do_normalize: Optional[bool] = None,
+        do_convert_annotations: Optional[bool] = None,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
         do_pad: Optional[bool] = None,
-        format: Optional[Union[str, AnnotionFormat]] = None,
+        format: Optional[Union[str, AnnotationFormat]] = None,
         return_tensors: Optional[Union[TensorType, str]] = None,
         data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> BatchFeature:
         """
@@ -1062,14 +1176,15 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image or batch of images to preprocess.
+                Image or batch of images to preprocess. Expects a single or batch of images with pixel values ranging
+                from 0 to 255. If passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             annotations (`AnnotationType` or `List[AnnotationType]`, *optional*):
-                List of annotations associated with the image or batch of images. If annotionation is for object
+                List of annotations associated with the image or batch of images. If annotation is for object
                 detection, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "annotations" (`List[Dict]`): List of annotations for an image. Each annotation should be a
                   dictionary. An image can have no annotations, in which case the list should be empty.
-                If annotionation is for segmentation, the annotations should be a dictionary with the following keys:
+                If annotation is for segmentation, the annotations should be a dictionary with the following keys:
                 - "image_id" (`int`): The image id.
                 - "segments_info" (`List[Dict]`): List of segments for an image. Each segment should be a dictionary.
                   An image can have no segments, in which case the list should be empty.
@@ -1090,33 +1205,45 @@ def preprocess(
                 Rescale factor to use when rescaling the image.
             do_normalize (`bool`, *optional*, defaults to self.do_normalize):
                 Whether to normalize the image.
+            do_convert_annotations (`bool`, *optional*, defaults to self.do_convert_annotations):
+                Whether to convert the annotations to the format expected by the model. Converts the bounding
+                boxes from the format `(top_left_x, top_left_y, width, height)` to `(center_x, center_y, width, height)`
+                and in relative coordinates.
             image_mean (`float` or `List[float]`, *optional*, defaults to self.image_mean):
                 Mean to use when normalizing the image.
             image_std (`float` or `List[float]`, *optional*, defaults to self.image_std):
                 Standard deviation to use when normalizing the image.
             do_pad (`bool`, *optional*, defaults to self.do_pad):
-                Whether to pad the image.
-            format (`str` or `AnnotionFormat`, *optional*, defaults to self.format):
+                Whether to pad the image. If `True` will pad the images in the batch to the largest image in the batch
+                and create a pixel mask. Padding will be applied to the bottom and right of the image with zeros.
+            format (`str` or `AnnotationFormat`, *optional*, defaults to self.format):
                 Format of the annotations.
             return_tensors (`str` or `TensorType`, *optional*, defaults to self.return_tensors):
                 Type of tensors to return. If `None`, will return the list of images.
-            data_format (`str` or `ChannelDimension`, *optional*, defaults to self.data_format):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         if "pad_and_return_pixel_mask" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `pad_and_return_pixel_mask` argument is deprecated and will be removed in a future version, "
-                "use `do_pad` instead.",
-                FutureWarning,
+                "use `do_pad` instead."
             )
             do_pad = kwargs.pop("pad_and_return_pixel_mask")
 
         max_size = None
         if "max_size" in kwargs:
-            warnings.warn(
+            logger.warning_once(
                 "The `max_size` argument is deprecated and will be removed in a future version, use"
-                " `size['longest_edge']` instead.",
-                FutureWarning,
+                " `size['longest_edge']` instead."
             )
             size = kwargs.pop("max_size")
 
@@ -1129,19 +1256,33 @@ def preprocess(
         do_normalize = self.do_normalize if do_normalize is None else do_normalize
         image_mean = self.image_mean if image_mean is None else image_mean
         image_std = self.image_std if image_std is None else image_std
+        do_convert_annotations = (
+            self.do_convert_annotations if do_convert_annotations is None else do_convert_annotations
+        )
         do_pad = self.do_pad if do_pad is None else do_pad
         format = self.format if format is None else format
 
-        if do_resize is not None and size is None:
-            raise ValueError("Size and max_size must be specified if do_resize is True.")
+        images = make_list_of_images(images)
 
-        if do_rescale is not None and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
 
-        if do_normalize is not None and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        # Here, the pad() method pads to the maximum of (width, height). It does not need to be validated.
+
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
-        images = make_list_of_images(images)
         if annotations is not None and isinstance(annotations, dict):
             annotations = [annotations]
 
@@ -1150,34 +1291,13 @@ def preprocess(
                 f"The number of images ({len(images)}) and annotations ({len(annotations)}) do not match."
             )
 
-        if not valid_images(images):
-            raise ValueError(
-                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
-                "torch.Tensor, tf.Tensor or jax.ndarray."
-            )
-
-        format = AnnotionFormat(format)
+        format = AnnotationFormat(format)
         if annotations is not None:
-            if format == AnnotionFormat.COCO_DETECTION and not valid_coco_detection_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO detection annotations. Annotations must a dict (single image) of list of dicts"
-                    "(batch of images) with the following keys: `image_id` and `annotations`, with the latter "
-                    "being a list of annotations in the COCO format."
-                )
-            elif format == AnnotionFormat.COCO_PANOPTIC and not valid_coco_panoptic_annotations(annotations):
-                raise ValueError(
-                    "Invalid COCO panoptic annotations. Annotations must a dict (single image) of list of dicts "
-                    "(batch of images) with the following keys: `image_id`, `file_name` and `segments_info`, with "
-                    "the latter being a list of annotations in the COCO format."
-                )
-            elif format not in SUPPORTED_ANNOTATION_FORMATS:
-                raise ValueError(
-                    f"Unsupported annotation format: {format} must be one of {SUPPORTED_ANNOTATION_FORMATS}"
-                )
+            validate_annotations(format, SUPPORTED_ANNOTATION_FORMATS, annotations)
 
         if (
             masks_path is not None
-            and format == AnnotionFormat.COCO_PANOPTIC
+            and format == AnnotationFormat.COCO_PANOPTIC
             and not isinstance(masks_path, (pathlib.Path, str))
         ):
             raise ValueError(
@@ -1188,13 +1308,28 @@ def preprocess(
         # All transformations expect numpy arrays
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         # prepare (COCO annotations as a list of Dict -> DETR target as a single Dict per image)
         if annotations is not None:
             prepared_images = []
             prepared_annotations = []
             for image, target in zip(images, annotations):
                 target = self.prepare_annotation(
-                    image, target, format, return_segmentation_masks=return_segmentation_masks, masks_path=masks_path
+                    image,
+                    target,
+                    format,
+                    return_segmentation_masks=return_segmentation_masks,
+                    masks_path=masks_path,
+                    input_data_format=input_data_format,
                 )
                 prepared_images.append(image)
                 prepared_annotations.append(target)
@@ -1207,40 +1342,59 @@ def preprocess(
             if annotations is not None:
                 resized_images, resized_annotations = [], []
                 for image, target in zip(images, annotations):
-                    orig_size = get_image_size(image)
-                    resized_image = self.resize(image, size=size, max_size=max_size, resample=resample)
-                    resized_annotation = self.resize_annotation(target, orig_size, get_image_size(resized_image))
+                    orig_size = get_image_size(image, input_data_format)
+                    resized_image = self.resize(
+                        image, size=size, max_size=max_size, resample=resample, input_data_format=input_data_format
+                    )
+                    resized_annotation = self.resize_annotation(
+                        target, orig_size, get_image_size(resized_image, input_data_format)
+                    )
                     resized_images.append(resized_image)
                     resized_annotations.append(resized_annotation)
                 images = resized_images
                 annotations = resized_annotations
                 del resized_images, resized_annotations
             else:
-                images = [self.resize(image, size=size, resample=resample) for image in images]
+                images = [
+                    self.resize(image, size=size, resample=resample, input_data_format=input_data_format)
+                    for image in images
+                ]
 
         if do_rescale:
-            images = [self.rescale(image, rescale_factor) for image in images]
+            images = [self.rescale(image, rescale_factor, input_data_format=input_data_format) for image in images]
 
         if do_normalize:
-            images = [self.normalize(image, image_mean, image_std) for image in images]
-            if annotations is not None:
-                annotations = [
-                    self.normalize_annotation(annotation, get_image_size(image))
-                    for annotation, image in zip(annotations, images)
-                ]
+            images = [
+                self.normalize(image, image_mean, image_std, input_data_format=input_data_format) for image in images
+            ]
+
+        if do_convert_annotations and annotations is not None:
+            annotations = [
+                self.normalize_annotation(annotation, get_image_size(image, input_data_format))
+                for annotation, image in zip(annotations, images)
+            ]
 
         if do_pad:
             # Pads images and returns their mask: {'pixel_values': ..., 'pixel_mask': ...}
-            data = self.pad(images, return_pixel_mask=True, data_format=data_format)
+            encoded_inputs = self.pad(
+                images,
+                annotations=annotations,
+                return_pixel_mask=True,
+                data_format=data_format,
+                input_data_format=input_data_format,
+                return_tensors=return_tensors,
+                update_bboxes=do_convert_annotations,
+            )
         else:
-            images = [to_channel_dimension_format(image, data_format) for image in images]
-            data = {"pixel_values": images}
-
-        encoded_inputs = BatchFeature(data=data, tensor_type=return_tensors)
-        if annotations is not None:
-            encoded_inputs["labels"] = [
-                BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+            images = [
+                to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+                for image in images
             ]
+            encoded_inputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)
+            if annotations is not None:
+                encoded_inputs["labels"] = [
+                    BatchFeature(annotation, tensor_type=return_tensors) for annotation in annotations
+                ]
 
         return encoded_inputs
 
@@ -1262,10 +1416,9 @@ def post_process(self, outputs, target_sizes):
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
             in the batch as predicted by the model.
         """
-        warnings.warn(
+        logger.warning_once(
             "`post_process` is deprecated and will be removed in v5 of Transformers, please use"
-            " `post_process_object_detection`",
-            FutureWarning,
+            " `post_process_object_detection` instead, with `threshold=0.` for equivalent results.",
         )
 
         out_logits, out_bbox = outputs.logits, outputs.pred_boxes
@@ -1305,10 +1458,9 @@ def post_process_segmentation(self, outputs, target_sizes, threshold=0.9, mask_t
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, and masks for an image
             in the batch as predicted by the model.
         """
-        warnings.warn(
+        logger.warning_once(
             "`post_process_segmentation` is deprecated and will be removed in v5 of Transformers, please use"
             " `post_process_semantic_segmentation`.",
-            FutureWarning,
         )
         out_logits, raw_masks = outputs.logits, outputs.pred_masks
         empty_label = out_logits.shape[-1] - 1
@@ -1341,8 +1493,7 @@ def post_process_instance(self, results, outputs, orig_target_sizes, max_target_
 
         Args:
             results (`List[Dict]`):
-                Results list obtained by [`~DetrFeatureExtractor.post_process`], to which "masks" results will be
-                added.
+                Results list obtained by [`~DetrImageProcessor.post_process`], to which "masks" results will be added.
             outputs ([`DetrSegmentationOutput`]):
                 Raw outputs of the model.
             orig_target_sizes (`torch.Tensor` of shape `(batch_size, 2)`):
@@ -1357,10 +1508,9 @@ def post_process_instance(self, results, outputs, orig_target_sizes, max_target_
             `List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels, boxes and masks for an
             image in the batch as predicted by the model.
         """
-        warnings.warn(
+        logger.warning_once(
             "`post_process_instance` is deprecated and will be removed in v5 of Transformers, please use"
             " `post_process_instance_segmentation`.",
-            FutureWarning,
         )
 
         if len(orig_target_sizes) != len(max_target_sizes):
@@ -1404,10 +1554,9 @@ def post_process_panoptic(self, outputs, processed_sizes, target_sizes=None, is_
             `List[Dict]`: A list of dictionaries, each dictionary containing a PNG string and segments_info values for
             an image in the batch as predicted by the model.
         """
-        warnings.warn(
+        logger.warning_once(
             "`post_process_panoptic is deprecated and will be removed in v5 of Transformers, please use"
             " `post_process_panoptic_segmentation`.",
-            FutureWarning,
         )
         if target_sizes is None:
             target_sizes = processed_sizes
@@ -1562,7 +1711,7 @@ def post_process_object_detection(
             else:
                 img_h, img_w = target_sizes.unbind(1)
 
-            scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1)
+            scale_fct = torch.stack([img_w, img_h, img_w, img_h], dim=1).to(boxes.device)
             boxes = boxes * scale_fct[:, None, :]
 
         results = []
@@ -1750,7 +1899,7 @@ def post_process_panoptic_segmentation(
         """
 
         if label_ids_to_fuse is None:
-            warnings.warn("`label_ids_to_fuse` unset. No instance will be fused.")
+            logger.warning_once("`label_ids_to_fuse` unset. No instance will be fused.")
             label_ids_to_fuse = set()
 
         class_queries_logits = outputs.logits  # [batch_size, num_queries, num_classes+1]
diff --git a/src/transformers/models/detr/modeling_detr.py b/src/transformers/models/detr/modeling_detr.py
index 814a09c37b9f91..0fa912eb1d5192 100644
--- a/src/transformers/models/detr/modeling_detr.py
+++ b/src/transformers/models/detr/modeling_detr.py
@@ -16,21 +16,21 @@
 
 
 import math
-import random
 from dataclasses import dataclass
-from typing import Dict, List, Optional, Tuple
+from typing import Dict, List, Optional, Tuple, Union
 
 import torch
 from torch import Tensor, nn
 
 from ...activations import ACT2FN
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
 from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithCrossAttentions, Seq2SeqModelOutput
 from ...modeling_utils import PreTrainedModel
-from ...pytorch_utils import torch_int_div
 from ...utils import (
     ModelOutput,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
+    is_accelerate_available,
     is_scipy_available,
     is_timm_available,
     is_vision_available,
@@ -38,10 +38,14 @@
     replace_return_docstrings,
     requires_backends,
 )
-from ..auto import AutoBackbone
+from ...utils.backbone_utils import load_backbone
 from .configuration_detr import DetrConfig
 
 
+if is_accelerate_available():
+    from accelerate import PartialState
+    from accelerate.utils import reduce
+
 if is_scipy_available():
     from scipy.optimize import linear_sum_assignment
 
@@ -306,19 +310,28 @@ def forward(self, x):
         return x * scale + bias
 
 
-def replace_batch_norm(m, name=""):
-    for attr_str in dir(m):
-        target_attr = getattr(m, attr_str)
-        if isinstance(target_attr, nn.BatchNorm2d):
-            frozen = DetrFrozenBatchNorm2d(target_attr.num_features)
-            bn = getattr(m, attr_str)
-            frozen.weight.data.copy_(bn.weight)
-            frozen.bias.data.copy_(bn.bias)
-            frozen.running_mean.data.copy_(bn.running_mean)
-            frozen.running_var.data.copy_(bn.running_var)
-            setattr(m, attr_str, frozen)
-    for n, ch in m.named_children():
-        replace_batch_norm(ch, n)
+def replace_batch_norm(model):
+    r"""
+    Recursively replace all `torch.nn.BatchNorm2d` with `DetrFrozenBatchNorm2d`.
+
+    Args:
+        model (torch.nn.Module):
+            input model
+    """
+    for name, module in model.named_children():
+        if isinstance(module, nn.BatchNorm2d):
+            new_module = DetrFrozenBatchNorm2d(module.num_features)
+
+            if not module.weight.device == torch.device("meta"):
+                new_module.weight.data.copy_(module.weight)
+                new_module.bias.data.copy_(module.bias)
+                new_module.running_mean.data.copy_(module.running_mean)
+                new_module.running_var.data.copy_(module.running_var)
+
+            model._modules[name] = new_module
+
+        if len(list(module.children())) > 0:
+            replace_batch_norm(module)
 
 
 class DetrConvEncoder(nn.Module):
@@ -348,7 +361,7 @@ def __init__(self, config):
                 **kwargs,
             )
         else:
-            backbone = AutoBackbone.from_config(config.backbone_config)
+            backbone = load_backbone(config)
 
         # replace batch norm by frozen batch norm
         with torch.no_grad():
@@ -401,20 +414,6 @@ def forward(self, pixel_values, pixel_mask):
         return out, pos
 
 
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, target_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[batch_size, seq_len]` to `[batch_size, 1, target_seq_len, source_seq_len]`.
-    """
-    batch_size, source_len = mask.size()
-    target_len = target_len if target_len is not None else source_len
-
-    expanded_mask = mask[:, None, None, :].expand(batch_size, 1, target_len, source_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)
-
-
 class DetrSinePositionEmbedding(nn.Module):
     """
     This is a more standard version of the position embedding, very similar to the one used by the Attention is all you
@@ -441,8 +440,8 @@ def forward(self, pixel_values, pixel_mask):
             y_embed = y_embed / (y_embed[:, -1:, :] + 1e-6) * self.scale
             x_embed = x_embed / (x_embed[:, :, -1:] + 1e-6) * self.scale
 
-        dim_t = torch.arange(self.embedding_dim, dtype=torch.float32, device=pixel_values.device)
-        dim_t = self.temperature ** (2 * torch_int_div(dim_t, 2) / self.embedding_dim)
+        dim_t = torch.arange(self.embedding_dim, dtype=torch.int64, device=pixel_values.device).float()
+        dim_t = self.temperature ** (2 * torch.div(dim_t, 2, rounding_mode="floor") / self.embedding_dim)
 
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
@@ -500,7 +499,6 @@ def __init__(
         embed_dim: int,
         num_heads: int,
         dropout: float = 0.0,
-        is_decoder: bool = False,
         bias: bool = True,
     ):
         super().__init__()
@@ -523,34 +521,79 @@ def __init__(
     def _shape(self, tensor: torch.Tensor, seq_len: int, batch_size: int):
         return tensor.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
 
-    def with_pos_embed(self, tensor: torch.Tensor, position_embeddings: Optional[Tensor]):
-        return tensor if position_embeddings is None else tensor + position_embeddings
+    def with_pos_embed(self, tensor: torch.Tensor, object_queries: Optional[Tensor], **kwargs):
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
+        return tensor if object_queries is None else tensor + object_queries
 
     def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.Tensor] = None,
-        position_embeddings: Optional[torch.Tensor] = None,
+        object_queries: Optional[torch.Tensor] = None,
         key_value_states: Optional[torch.Tensor] = None,
-        key_value_position_embeddings: Optional[torch.Tensor] = None,
+        spatial_position_embeddings: Optional[torch.Tensor] = None,
         output_attentions: bool = False,
+        **kwargs,
     ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
         """Input shape: Batch x Time x Channel"""
 
+        position_embeddings = kwargs.pop("position_ebmeddings", None)
+        key_value_position_embeddings = kwargs.pop("key_value_position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if key_value_position_embeddings is not None and spatial_position_embeddings is not None:
+            raise ValueError(
+                "Cannot specify both key_value_position_embeddings and spatial_position_embeddings. Please use just spatial_position_embeddings"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
+        if key_value_position_embeddings is not None:
+            logger.warning_once(
+                "key_value_position_embeddings has been deprecated and will be removed in v4.34. Please use spatial_position_embeddings instead"
+            )
+            spatial_position_embeddings = key_value_position_embeddings
+
         # if key_value_states are provided this layer is used as a cross-attention layer
         # for the decoder
         is_cross_attention = key_value_states is not None
         batch_size, target_len, embed_dim = hidden_states.size()
 
         # add position embeddings to the hidden states before projecting to queries and keys
-        if position_embeddings is not None:
+        if object_queries is not None:
             hidden_states_original = hidden_states
-            hidden_states = self.with_pos_embed(hidden_states, position_embeddings)
+            hidden_states = self.with_pos_embed(hidden_states, object_queries)
 
         # add key-value position embeddings to the key value states
-        if key_value_position_embeddings is not None:
+        if spatial_position_embeddings is not None:
             key_value_states_original = key_value_states
-            key_value_states = self.with_pos_embed(key_value_states, key_value_position_embeddings)
+            key_value_states = self.with_pos_embed(key_value_states, spatial_position_embeddings)
 
         # get query proj
         query_states = self.q_proj(hidden_states) * self.scaling
@@ -640,25 +683,43 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: torch.Tensor,
-        position_embeddings: torch.Tensor = None,
+        object_queries: torch.Tensor = None,
         output_attentions: bool = False,
+        **kwargs,
     ):
         """
         Args:
-            hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
             attention_mask (`torch.FloatTensor`): attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
-            position_embeddings (`torch.FloatTensor`, *optional*): position embeddings, to be added to hidden_states.
+            object_queries (`torch.FloatTensor`, *optional*):
+                Object queries (also called content embeddings), to be added to the hidden states.
             output_attentions (`bool`, *optional*):
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                 returned tensors for more detail.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         residual = hidden_states
         hidden_states, attn_weights = self.self_attn(
             hidden_states=hidden_states,
             attention_mask=attention_mask,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             output_attentions=output_attentions,
         )
 
@@ -698,7 +759,6 @@ def __init__(self, config: DetrConfig):
             embed_dim=self.embed_dim,
             num_heads=config.decoder_attention_heads,
             dropout=config.attention_dropout,
-            is_decoder=True,
         )
         self.dropout = config.dropout
         self.activation_fn = ACT2FN[config.activation_function]
@@ -709,7 +769,6 @@ def __init__(self, config: DetrConfig):
             self.embed_dim,
             config.decoder_attention_heads,
             dropout=config.attention_dropout,
-            is_decoder=True,
         )
         self.encoder_attn_layer_norm = nn.LayerNorm(self.embed_dim)
         self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
@@ -720,26 +779,27 @@ def forward(
         self,
         hidden_states: torch.Tensor,
         attention_mask: Optional[torch.Tensor] = None,
-        position_embeddings: Optional[torch.Tensor] = None,
+        object_queries: Optional[torch.Tensor] = None,
         query_position_embeddings: Optional[torch.Tensor] = None,
         encoder_hidden_states: Optional[torch.Tensor] = None,
         encoder_attention_mask: Optional[torch.Tensor] = None,
         output_attentions: Optional[bool] = False,
+        **kwargs,
     ):
         """
         Args:
-            hidden_states (`torch.FloatTensor`): input to the layer of shape `(seq_len, batch, embed_dim)`
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
             attention_mask (`torch.FloatTensor`): attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
-            position_embeddings (`torch.FloatTensor`, *optional*):
-                position embeddings that are added to the queries and keys
+            object_queries (`torch.FloatTensor`, *optional*):
+                object_queries that are added to the hidden states
             in the cross-attention layer.
             query_position_embeddings (`torch.FloatTensor`, *optional*):
                 position embeddings that are added to the queries and keys
             in the self-attention layer.
             encoder_hidden_states (`torch.FloatTensor`):
-                cross attention input to the layer of shape `(seq_len, batch, embed_dim)`
+                cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
             encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
                 `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
                 values.
@@ -747,12 +807,28 @@ def forward(
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                 returned tensors for more detail.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         residual = hidden_states
 
         # Self Attention
         hidden_states, self_attn_weights = self.self_attn(
             hidden_states=hidden_states,
-            position_embeddings=query_position_embeddings,
+            object_queries=query_position_embeddings,
             attention_mask=attention_mask,
             output_attentions=output_attentions,
         )
@@ -768,10 +844,10 @@ def forward(
 
             hidden_states, cross_attn_weights = self.encoder_attn(
                 hidden_states=hidden_states,
-                position_embeddings=query_position_embeddings,
+                object_queries=query_position_embeddings,
                 key_value_states=encoder_hidden_states,
                 attention_mask=encoder_attention_mask,
-                key_value_position_embeddings=position_embeddings,
+                spatial_position_embeddings=object_queries,
                 output_attentions=output_attentions,
             )
 
@@ -818,6 +894,7 @@ class DetrPreTrainedModel(PreTrainedModel):
     config_class = DetrConfig
     base_model_prefix = "model"
     main_input_name = "pixel_values"
+    _no_split_modules = [r"DetrConvEncoder", r"DetrEncoderLayer", r"DetrDecoderLayer"]
 
     def _init_weights(self, module):
         std = self.config.init_std
@@ -842,10 +919,6 @@ def _init_weights(self, module):
             if module.padding_idx is not None:
                 module.weight.data[module.padding_idx].zero_()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DetrDecoder):
-            module.gradient_checkpointing = value
-
 
 DETR_START_DOCSTRING = r"""
     This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
@@ -878,7 +951,7 @@ def _set_gradient_checkpointing(self, module, value=False):
 
             [What are attention masks?](../glossary#attention-mask)
 
-        decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, num_queries)`, *optional*):
+        decoder_attention_mask (`torch.FloatTensor` of shape `(batch_size, num_queries)`, *optional*):
             Not used by default. Can be used to mask object queries.
         encoder_outputs (`tuple(tuple(torch.FloatTensor)`, *optional*):
             Tuple consists of (`last_hidden_state`, *optional*: `hidden_states`, *optional*: `attentions`)
@@ -910,7 +983,7 @@ class DetrEncoder(DetrPreTrainedModel):
 
     Small tweak for DETR:
 
-    - position_embeddings are added to the forward pass.
+    - object_queries are added to the forward pass.
 
     Args:
         config: DetrConfig
@@ -933,10 +1006,11 @@ def forward(
         self,
         inputs_embeds=None,
         attention_mask=None,
-        position_embeddings=None,
+        object_queries=None,
         output_attentions=None,
         output_hidden_states=None,
         return_dict=None,
+        **kwargs,
     ):
         r"""
         Args:
@@ -951,8 +1025,8 @@ def forward(
 
                 [What are attention masks?](../glossary#attention-mask)
 
-            position_embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
-                Position embeddings that are added to the queries and keys in each self-attention layer.
+            object_queries (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Object queries that are added to the queries in each self-attention layer.
 
             output_attentions (`bool`, *optional*):
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
@@ -963,6 +1037,22 @@ def forward(
             return_dict (`bool`, *optional*):
                 Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -975,7 +1065,7 @@ def forward(
         # expand attention_mask
         if attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            attention_mask = _expand_mask(attention_mask, inputs_embeds.dtype)
+            attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype)
 
         encoder_states = () if output_hidden_states else None
         all_attentions = () if output_attentions else None
@@ -983,15 +1073,20 @@ def forward(
             if output_hidden_states:
                 encoder_states = encoder_states + (hidden_states,)
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
-            dropout_probability = random.uniform(0, 1)
-            if self.training and (dropout_probability < self.layerdrop):  # skip the layer
+            to_drop = False
+            if self.training:
+                dropout_probability = torch.rand([])
+                if dropout_probability < self.layerdrop:  # skip the layer
+                    to_drop = True
+
+            if to_drop:
                 layer_outputs = (None, None)
             else:
-                # we add position_embeddings as extra input to the encoder_layer
+                # we add object_queries as extra input to the encoder_layer
                 layer_outputs = encoder_layer(
                     hidden_states,
                     attention_mask,
-                    position_embeddings=position_embeddings,
+                    object_queries=object_queries,
                     output_attentions=output_attentions,
                 )
 
@@ -1018,7 +1113,7 @@ class DetrDecoder(DetrPreTrainedModel):
 
     Some small tweaks for DETR:
 
-    - position_embeddings and query_position_embeddings are added to the forward pass.
+    - object_queries and query_position_embeddings are added to the forward pass.
     - if self.config.auxiliary_loss is set to True, also returns a stack of activations from all decoding layers.
 
     Args:
@@ -1044,11 +1139,12 @@ def forward(
         attention_mask=None,
         encoder_hidden_states=None,
         encoder_attention_mask=None,
-        position_embeddings=None,
+        object_queries=None,
         query_position_embeddings=None,
         output_attentions=None,
         output_hidden_states=None,
         return_dict=None,
+        **kwargs,
     ):
         r"""
         Args:
@@ -1072,10 +1168,11 @@ def forward(
                 - 1 for pixels that are real (i.e. **not masked**),
                 - 0 for pixels that are padding (i.e. **masked**).
 
-            position_embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
-                Position embeddings that are added to the queries and keys in each cross-attention layer.
+            object_queries (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+                Object queries that are added to the queries and keys in each cross-attention layer.
             query_position_embeddings (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
-                , *optional*): Position embeddings that are added to the queries and keys in each self-attention layer.
+                , *optional*): Position embeddings that are added to the values and keys in each self-attention layer.
+
             output_attentions (`bool`, *optional*):
                 Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                 returned tensors for more detail.
@@ -1085,6 +1182,22 @@ def forward(
             return_dict (`bool`, *optional*):
                 Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
         """
+        position_embeddings = kwargs.pop("position_embeddings", None)
+
+        if kwargs:
+            raise ValueError(f"Unexpected arguments {kwargs.keys()}")
+
+        if position_embeddings is not None and object_queries is not None:
+            raise ValueError(
+                "Cannot specify both position_embeddings and object_queries. Please use just object_queries"
+            )
+
+        if position_embeddings is not None:
+            logger.warning_once(
+                "position_embeddings has been deprecated and will be removed in v4.34. Please use object_queries instead"
+            )
+            object_queries = position_embeddings
+
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
@@ -1099,15 +1212,15 @@ def forward(
 
         if attention_mask is not None and combined_attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            combined_attention_mask = combined_attention_mask + _expand_mask(
-                attention_mask, inputs_embeds.dtype, target_len=input_shape[-1]
+            combined_attention_mask = combined_attention_mask + _prepare_4d_attention_mask(
+                attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
             )
 
         # expand encoder attention mask
         if encoder_hidden_states is not None and encoder_attention_mask is not None:
             # [batch_size, seq_len] -> [batch_size, 1, target_seq_len, source_seq_len]
-            encoder_attention_mask = _expand_mask(
-                encoder_attention_mask, inputs_embeds.dtype, target_len=input_shape[-1]
+            encoder_attention_mask = _prepare_4d_attention_mask(
+                encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
             )
 
         # optional intermediate hidden states
@@ -1122,20 +1235,14 @@ def forward(
             # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
             if output_hidden_states:
                 all_hidden_states += (hidden_states,)
-            dropout_probability = random.uniform(0, 1)
-            if self.training and (dropout_probability < self.layerdrop):
-                continue
+            if self.training:
+                dropout_probability = torch.rand([])
+                if dropout_probability < self.layerdrop:
+                    continue
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
                     hidden_states,
                     combined_attention_mask,
                     encoder_hidden_states,
@@ -1146,7 +1253,7 @@ def custom_forward(*inputs):
                 layer_outputs = decoder_layer(
                     hidden_states,
                     attention_mask=combined_attention_mask,
-                    position_embeddings=position_embeddings,
+                    object_queries=object_queries,
                     query_position_embeddings=query_position_embeddings,
                     encoder_hidden_states=encoder_hidden_states,
                     encoder_attention_mask=encoder_attention_mask,
@@ -1204,8 +1311,8 @@ def __init__(self, config: DetrConfig):
 
         # Create backbone + positional encoding
         backbone = DetrConvEncoder(config)
-        position_embeddings = build_position_encoding(config)
-        self.backbone = DetrConvModel(backbone, position_embeddings)
+        object_queries = build_position_encoding(config)
+        self.backbone = DetrConvModel(backbone, object_queries)
 
         # Create projection layer
         self.input_projection = nn.Conv2d(backbone.intermediate_channel_sizes[-1], config.d_model, kernel_size=1)
@@ -1236,16 +1343,16 @@ def unfreeze_backbone(self):
     @replace_return_docstrings(output_type=DetrModelOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DetrModelOutput]:
         r"""
         Returns:
 
@@ -1289,7 +1396,7 @@ def forward(
         # First, sent pixel_values + pixel_mask through Backbone to obtain the features
         # pixel_values should be of shape (batch_size, num_channels, height, width)
         # pixel_mask should be of shape (batch_size, height, width)
-        features, position_embeddings_list = self.backbone(pixel_values, pixel_mask)
+        features, object_queries_list = self.backbone(pixel_values, pixel_mask)
 
         # get final feature map and downsampled mask
         feature_map, mask = features[-1]
@@ -1303,7 +1410,7 @@ def forward(
         # Third, flatten the feature map + position embeddings of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
         # In other words, turn their shape into (batch_size, sequence_length, hidden_size)
         flattened_features = projected_feature_map.flatten(2).permute(0, 2, 1)
-        position_embeddings = position_embeddings_list[-1].flatten(2).permute(0, 2, 1)
+        object_queries = object_queries_list[-1].flatten(2).permute(0, 2, 1)
 
         flattened_mask = mask.flatten(1)
 
@@ -1314,7 +1421,7 @@ def forward(
             encoder_outputs = self.encoder(
                 inputs_embeds=flattened_features,
                 attention_mask=flattened_mask,
-                position_embeddings=position_embeddings,
+                object_queries=object_queries,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
@@ -1327,7 +1434,7 @@ def forward(
                 attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
             )
 
-        # Fifth, sent query embeddings + position embeddings through the decoder (which is conditioned on the encoder output)
+        # Fifth, sent query embeddings + object_queries through the decoder (which is conditioned on the encoder output)
         query_position_embeddings = self.query_position_embeddings.weight.unsqueeze(0).repeat(batch_size, 1, 1)
         queries = torch.zeros_like(query_position_embeddings)
 
@@ -1335,7 +1442,7 @@ def forward(
         decoder_outputs = self.decoder(
             inputs_embeds=queries,
             attention_mask=None,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             query_position_embeddings=query_position_embeddings,
             encoder_hidden_states=encoder_outputs[0],
             encoder_attention_mask=flattened_mask,
@@ -1396,17 +1503,17 @@ def _set_aux_loss(self, outputs_class, outputs_coord):
     @replace_return_docstrings(output_type=DetrObjectDetectionOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DetrObjectDetectionOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the
@@ -1433,7 +1540,7 @@ def forward(
         >>> inputs = image_processor(images=image, return_tensors="pt")
         >>> outputs = model(**inputs)
 
-        >>> # convert outputs (bounding boxes and class logits) to COCO API
+        >>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
         >>> target_sizes = torch.tensor([image.size[::-1]])
         >>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
         ...     0
@@ -1566,17 +1673,17 @@ def __init__(self, config: DetrConfig):
     @replace_return_docstrings(output_type=DetrSegmentationOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
         self,
-        pixel_values,
-        pixel_mask=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-    ):
+        pixel_values: torch.FloatTensor,
+        pixel_mask: Optional[torch.LongTensor] = None,
+        decoder_attention_mask: Optional[torch.FloatTensor] = None,
+        encoder_outputs: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        decoder_inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[List[dict]] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.FloatTensor], DetrSegmentationOutput]:
         r"""
         labels (`List[Dict]` of len `(batch_size,)`, *optional*):
             Labels for computing the bipartite matching loss, DICE/F-1 loss and Focal loss. List of dicts, each
@@ -1631,7 +1738,7 @@ def forward(
             pixel_mask = torch.ones((batch_size, height, width), device=device)
 
         # First, get list of feature maps and position embeddings
-        features, position_embeddings_list = self.detr.model.backbone(pixel_values, pixel_mask=pixel_mask)
+        features, object_queries_list = self.detr.model.backbone(pixel_values, pixel_mask=pixel_mask)
 
         # Second, apply 1x1 convolution to reduce the channel dimension to d_model (256 by default)
         feature_map, mask = features[-1]
@@ -1641,7 +1748,7 @@ def forward(
         # Third, flatten the feature map + position embeddings of shape NxCxHxW to NxCxHW, and permute it to NxHWxC
         # In other words, turn their shape into (batch_size, sequence_length, hidden_size)
         flattened_features = projected_feature_map.flatten(2).permute(0, 2, 1)
-        position_embeddings = position_embeddings_list[-1].flatten(2).permute(0, 2, 1)
+        object_queries = object_queries_list[-1].flatten(2).permute(0, 2, 1)
 
         flattened_mask = mask.flatten(1)
 
@@ -1652,7 +1759,7 @@ def forward(
             encoder_outputs = self.detr.model.encoder(
                 inputs_embeds=flattened_features,
                 attention_mask=flattened_mask,
-                position_embeddings=position_embeddings,
+                object_queries=object_queries,
                 output_attentions=output_attentions,
                 output_hidden_states=output_hidden_states,
                 return_dict=return_dict,
@@ -1675,7 +1782,7 @@ def forward(
         decoder_outputs = self.detr.model.decoder(
             inputs_embeds=queries,
             attention_mask=None,
-            position_embeddings=position_embeddings,
+            object_queries=object_queries,
             query_position_embeddings=query_position_embeddings,
             encoder_hidden_states=encoder_outputs[0],
             encoder_attention_mask=flattened_mask,
@@ -1724,9 +1831,9 @@ def forward(
             outputs_loss["pred_masks"] = pred_masks
             if self.config.auxiliary_loss:
                 intermediate = decoder_outputs.intermediate_hidden_states if return_dict else decoder_outputs[-1]
-                outputs_class = self.class_labels_classifier(intermediate)
-                outputs_coord = self.bbox_predictor(intermediate).sigmoid()
-                auxiliary_outputs = self._set_aux_loss(outputs_class, outputs_coord)
+                outputs_class = self.detr.class_labels_classifier(intermediate)
+                outputs_coord = self.detr.bbox_predictor(intermediate).sigmoid()
+                auxiliary_outputs = self.detr._set_aux_loss(outputs_class, outputs_coord)
                 outputs_loss["auxiliary_outputs"] = auxiliary_outputs
 
             loss_dict = criterion(outputs_loss, labels)
@@ -1790,13 +1897,13 @@ def __init__(self, dim, fpn_dims, context_dim):
         self.lay1 = nn.Conv2d(dim, dim, 3, padding=1)
         self.gn1 = nn.GroupNorm(8, dim)
         self.lay2 = nn.Conv2d(dim, inter_dims[1], 3, padding=1)
-        self.gn2 = nn.GroupNorm(8, inter_dims[1])
+        self.gn2 = nn.GroupNorm(min(8, inter_dims[1]), inter_dims[1])
         self.lay3 = nn.Conv2d(inter_dims[1], inter_dims[2], 3, padding=1)
-        self.gn3 = nn.GroupNorm(8, inter_dims[2])
+        self.gn3 = nn.GroupNorm(min(8, inter_dims[2]), inter_dims[2])
         self.lay4 = nn.Conv2d(inter_dims[2], inter_dims[3], 3, padding=1)
-        self.gn4 = nn.GroupNorm(8, inter_dims[3])
+        self.gn4 = nn.GroupNorm(min(8, inter_dims[3]), inter_dims[3])
         self.lay5 = nn.Conv2d(inter_dims[3], inter_dims[4], 3, padding=1)
-        self.gn5 = nn.GroupNorm(8, inter_dims[4])
+        self.gn5 = nn.GroupNorm(min(8, inter_dims[4]), inter_dims[4])
         self.out_lay = nn.Conv2d(inter_dims[4], 1, 3, padding=1)
 
         self.dim = dim
@@ -2102,11 +2209,11 @@ def forward(self, outputs, targets):
         # Compute the average number of target boxes across all nodes, for normalization purposes
         num_boxes = sum(len(t["class_labels"]) for t in targets)
         num_boxes = torch.as_tensor([num_boxes], dtype=torch.float, device=next(iter(outputs.values())).device)
-        # (Niels): comment out function below, distributed training to be added
-        # if is_dist_avail_and_initialized():
-        #     torch.distributed.all_reduce(num_boxes)
-        # (Niels) in original implementation, num_boxes is divided by get_world_size()
-        num_boxes = torch.clamp(num_boxes, min=1).item()
+        world_size = 1
+        if PartialState._shared_state != {}:
+            num_boxes = reduce(num_boxes)
+            world_size = PartialState().num_processes
+        num_boxes = torch.clamp(num_boxes / world_size, min=1).item()
 
         # Compute all the requested losses
         losses = {}
diff --git a/src/transformers/models/dinat/configuration_dinat.py b/src/transformers/models/dinat/configuration_dinat.py
index 1d60628f3de6a5..83c3227f66b247 100644
--- a/src/transformers/models/dinat/configuration_dinat.py
+++ b/src/transformers/models/dinat/configuration_dinat.py
@@ -16,6 +16,7 @@
 
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
+from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices
 
 
 logger = logging.get_logger(__name__)
@@ -26,7 +27,7 @@
 }
 
 
-class DinatConfig(PretrainedConfig):
+class DinatConfig(BackboneConfigMixin, PretrainedConfig):
     r"""
     This is the configuration class to store the configuration of a [`DinatModel`]. It is used to instantiate a Dinat
     model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
@@ -43,9 +44,9 @@ class DinatConfig(PretrainedConfig):
             The number of input channels.
         embed_dim (`int`, *optional*, defaults to 64):
             Dimensionality of patch embedding.
-        depths (`List[int]`, *optional*, defaults to `[2, 2, 6, 2]`):
+        depths (`List[int]`, *optional*, defaults to `[3, 4, 6, 5]`):
             Number of layers in each level of the encoder.
-        num_heads (`List[int]`, *optional*, defaults to `[3, 6, 12, 24]`):
+        num_heads (`List[int]`, *optional*, defaults to `[2, 4, 8, 16]`):
             Number of attention heads in each layer of the Transformer encoder.
         kernel_size (`int`, *optional*, defaults to 7):
             Neighborhood Attention kernel size.
@@ -66,13 +67,20 @@ class DinatConfig(PretrainedConfig):
             `"selu"` and `"gelu_new"` are supported.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
             The epsilon used by the layer normalization layers.
         layer_scale_init_value (`float`, *optional*, defaults to 0.0):
             The initial value for the layer scale. Disabled if <=0.
         out_features (`List[str]`, *optional*):
             If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
-            (depending on how many stages the model has). Will default to the last stage if unset.
+            (depending on how many stages the model has). If unset and `out_indices` is set, will default to the
+            corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+        out_indices (`List[int]`, *optional*):
+            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
+            many stages the model has). If unset and `out_features` is set, will default to the corresponding stages.
+            If unset and `out_features` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
 
     Example:
 
@@ -88,6 +96,7 @@ class DinatConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "dinat"
 
     attribute_map = {
@@ -114,6 +123,7 @@ def __init__(
         layer_norm_eps=1e-5,
         layer_scale_init_value=0.0,
         out_features=None,
+        out_indices=None,
         **kwargs,
     ):
         super().__init__(**kwargs)
@@ -139,12 +149,6 @@ def __init__(
         self.hidden_size = int(embed_dim * 2 ** (len(depths) - 1))
         self.layer_scale_init_value = layer_scale_init_value
         self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, len(depths) + 1)]
-        if out_features is not None:
-            if not isinstance(out_features, list):
-                raise ValueError("out_features should be a list")
-            for feature in out_features:
-                if feature not in self.stage_names:
-                    raise ValueError(
-                        f"Feature {feature} is not a valid feature name. Valid names are {self.stage_names}"
-                    )
-        self.out_features = out_features
+        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
+            out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
+        )
diff --git a/src/transformers/models/dinat/modeling_dinat.py b/src/transformers/models/dinat/modeling_dinat.py
index ef19005834ac12..71470efece28c1 100644
--- a/src/transformers/models/dinat/modeling_dinat.py
+++ b/src/transformers/models/dinat/modeling_dinat.py
@@ -26,7 +26,7 @@
 
 from ...activations import ACT2FN
 from ...modeling_outputs import BackboneOutput
-from ...modeling_utils import BackboneMixin, PreTrainedModel
+from ...modeling_utils import PreTrainedModel
 from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
 from ...utils import (
     ModelOutput,
@@ -39,6 +39,7 @@
     replace_return_docstrings,
     requires_backends,
 )
+from ...utils.backbone_utils import BackboneMixin
 from .configuration_dinat import DinatConfig
 
 
@@ -104,9 +105,9 @@ class DinatEncoderOutput(ModelOutput):
     """
 
     last_hidden_state: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -141,9 +142,9 @@ class DinatModelOutput(ModelOutput):
 
     last_hidden_state: torch.FloatTensor = None
     pooler_output: Optional[torch.FloatTensor] = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -178,9 +179,9 @@ class DinatImageClassifierOutput(ModelOutput):
 
     loss: Optional[torch.FloatTensor] = None
     logits: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 # Copied from transformers.models.nat.modeling_nat.NatEmbeddings with Nat->Dinat
@@ -268,7 +269,7 @@ def forward(self, input_feature: torch.Tensor) -> torch.Tensor:
 
 
 # Copied from transformers.models.beit.modeling_beit.drop_path
-def drop_path(input, drop_prob=0.0, training=False, scale_by_keep=True):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
@@ -347,7 +348,7 @@ def forward(
         query_layer = query_layer / math.sqrt(self.attention_head_size)
 
         # Compute NA between "query" and "key" to get the raw attention scores, and add relative positional biases.
-        attention_scores = natten2dqkrpb(query_layer, key_layer, self.rpb, self.dilation)
+        attention_scores = natten2dqkrpb(query_layer, key_layer, self.rpb, self.kernel_size, self.dilation)
 
         # Normalize the attention scores to probabilities.
         attention_probs = nn.functional.softmax(attention_scores, dim=-1)
@@ -356,7 +357,7 @@ def forward(
         # seem a bit unusual, but is taken from the original Transformer paper.
         attention_probs = self.dropout(attention_probs)
 
-        context_layer = natten2dav(attention_probs, value_layer, self.dilation)
+        context_layer = natten2dav(attention_probs, value_layer, self.kernel_size, self.dilation)
         context_layer = context_layer.permute(0, 2, 3, 1, 4).contiguous()
         new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
         context_layer = context_layer.view(new_context_layer_shape)
@@ -659,9 +660,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module: DinatEncoder, value: bool = False) -> None:
-        pass
-
 
 DINAT_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
@@ -882,25 +880,17 @@ def forward(
 class DinatBackbone(DinatPreTrainedModel, BackboneMixin):
     def __init__(self, config):
         super().__init__(config)
+        super()._init_backbone(config)
 
         requires_backends(self, ["natten"])
 
-        self.stage_names = config.stage_names
-
         self.embeddings = DinatEmbeddings(config)
         self.encoder = DinatEncoder(config)
-
-        self.out_features = config.out_features if config.out_features is not None else [self.stage_names[-1]]
-
-        num_features = [int(config.embed_dim * 2**i) for i in range(len(config.depths))]
-        self.out_feature_channels = {}
-        self.out_feature_channels["stem"] = config.embed_dim
-        for i, stage in enumerate(self.stage_names[1:]):
-            self.out_feature_channels[stage] = num_features[i]
+        self.num_features = [config.embed_dim] + [int(config.embed_dim * 2**i) for i in range(len(config.depths))]
 
         # Add layer norms to hidden states of out_features
-        hidden_states_norms = dict()
-        for stage, num_channels in zip(self.out_features, self.channels):
+        hidden_states_norms = {}
+        for stage, num_channels in zip(self._out_features, self.channels):
             hidden_states_norms[stage] = nn.LayerNorm(num_channels)
         self.hidden_states_norms = nn.ModuleDict(hidden_states_norms)
 
@@ -910,10 +900,6 @@ def __init__(self, config):
     def get_input_embeddings(self):
         return self.embeddings.patch_embeddings
 
-    @property
-    def channels(self):
-        return [self.out_feature_channels[name] for name in self.out_features]
-
     @add_start_docstrings_to_model_forward(DINAT_INPUTS_DOCSTRING)
     @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC)
     def forward(
diff --git a/src/transformers/models/dinov2/__init__.py b/src/transformers/models/dinov2/__init__.py
new file mode 100644
index 00000000000000..01d02a9e65fda0
--- /dev/null
+++ b/src/transformers/models/dinov2/__init__.py
@@ -0,0 +1,61 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_dinov2": ["DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP", "Dinov2Config", "Dinov2OnnxConfig"]
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_dinov2"] = [
+        "DINOV2_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "Dinov2ForImageClassification",
+        "Dinov2Model",
+        "Dinov2PreTrainedModel",
+        "Dinov2Backbone",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_dinov2 import DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP, Dinov2Config, Dinov2OnnxConfig
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_dinov2 import (
+            DINOV2_PRETRAINED_MODEL_ARCHIVE_LIST,
+            Dinov2Backbone,
+            Dinov2ForImageClassification,
+            Dinov2Model,
+            Dinov2PreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/dinov2/configuration_dinov2.py b/src/transformers/models/dinov2/configuration_dinov2.py
new file mode 100644
index 00000000000000..037f889ebf2a8c
--- /dev/null
+++ b/src/transformers/models/dinov2/configuration_dinov2.py
@@ -0,0 +1,176 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" DINOv2 model configuration"""
+
+from collections import OrderedDict
+from typing import Mapping
+
+from packaging import version
+
+from ...configuration_utils import PretrainedConfig
+from ...onnx import OnnxConfig
+from ...utils import logging
+from ...utils.backbone_utils import BackboneConfigMixin, get_aligned_output_features_output_indices
+
+
+logger = logging.get_logger(__name__)
+
+DINOV2_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "facebook/dinov2-base": "https://huggingface.co/facebook/dinov2-base/resolve/main/config.json",
+}
+
+
+class Dinov2Config(BackboneConfigMixin, PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`Dinov2Model`]. It is used to instantiate an
+    Dinov2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the Dinov2
+    [google/dinov2-base-patch16-224](https://huggingface.co/google/dinov2-base-patch16-224) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        mlp_ratio (`int`, *optional*, defaults to 4):
+            Ratio of the hidden size of the MLPs relative to the `hidden_size`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The size (resolution) of each patch.
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        qkv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to add a bias to the queries, keys and values.
+        layerscale_value (`float`, *optional*, defaults to 1.0):
+           Initial value to use for layer scale.
+        drop_path_rate (`float`, *optional*, defaults to 0.0):
+            Stochastic depth rate per sample (when applied in the main path of residual layers).
+        use_swiglu_ffn (`bool`, *optional*, defaults to `False`):
+            Whether to use the SwiGLU feedforward neural network.
+        out_features (`List[str]`, *optional*):
+            If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
+            (depending on how many stages the model has). If unset and `out_indices` is set, will default to the
+            corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+        out_indices (`List[int]`, *optional*):
+            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
+            many stages the model has). If unset and `out_features` is set, will default to the corresponding stages.
+            If unset and `out_features` is unset, will default to the last stage. Must be in the
+            same order as defined in the `stage_names` attribute.
+        apply_layernorm (`bool`, *optional*, defaults to `True`):
+            Whether to apply layer normalization to the feature maps in case the model is used as backbone.
+        reshape_hidden_states (`bool`, *optional*, defaults to `True`):
+            Whether to reshape the feature maps to 4D tensors of shape `(batch_size, hidden_size, height, width)` in
+            case the model is used as backbone. If `False`, the feature maps will be 3D tensors of shape `(batch_size,
+            seq_len, hidden_size)`.
+
+    Example:
+
+    ```python
+    >>> from transformers import Dinov2Config, Dinov2Model
+
+    >>> # Initializing a Dinov2 dinov2-base-patch16-224 style configuration
+    >>> configuration = Dinov2Config()
+
+    >>> # Initializing a model (with random weights) from the dinov2-base-patch16-224 style configuration
+    >>> model = Dinov2Model(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "dinov2"
+
+    def __init__(
+        self,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        mlp_ratio=4,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.0,
+        attention_probs_dropout_prob=0.0,
+        initializer_range=0.02,
+        layer_norm_eps=1e-6,
+        image_size=224,
+        patch_size=16,
+        num_channels=3,
+        qkv_bias=True,
+        layerscale_value=1.0,
+        drop_path_rate=0.0,
+        use_swiglu_ffn=False,
+        out_features=None,
+        out_indices=None,
+        apply_layernorm=True,
+        reshape_hidden_states=True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.mlp_ratio = mlp_ratio
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.qkv_bias = qkv_bias
+        self.layerscale_value = layerscale_value
+        self.drop_path_rate = drop_path_rate
+        self.use_swiglu_ffn = use_swiglu_ffn
+        self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, num_hidden_layers + 1)]
+        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
+            out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
+        )
+        self.apply_layernorm = apply_layernorm
+        self.reshape_hidden_states = reshape_hidden_states
+
+
+class Dinov2OnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-4
diff --git a/src/transformers/models/dinov2/convert_dinov2_to_hf.py b/src/transformers/models/dinov2/convert_dinov2_to_hf.py
new file mode 100644
index 00000000000000..dd5871e6c44066
--- /dev/null
+++ b/src/transformers/models/dinov2/convert_dinov2_to_hf.py
@@ -0,0 +1,287 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert DINOv2 checkpoints from the original repository.
+
+URL: https://github.com/facebookresearch/dinov2/tree/main
+"""
+
+
+import argparse
+import json
+from pathlib import Path
+
+import requests
+import torch
+import torch.nn as nn
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from torchvision import transforms
+
+from transformers import BitImageProcessor, Dinov2Config, Dinov2ForImageClassification, Dinov2Model
+from transformers.image_utils import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD, PILImageResampling
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_dinov2_config(model_name, image_classifier=False):
+    config = Dinov2Config(image_size=518, patch_size=14)
+
+    # size of the architecture
+    if "vits" in model_name:
+        config.hidden_size = 384
+        config.num_attention_heads = 6
+    elif "vitb" in model_name:
+        pass
+    elif "vitl" in model_name:
+        config.hidden_size = 1024
+        config.num_hidden_layers = 24
+        config.num_attention_heads = 16
+    elif "vitg" in model_name:
+        config.use_swiglu_ffn = True
+        config.hidden_size = 1536
+        config.num_hidden_layers = 40
+        config.num_attention_heads = 24
+    else:
+        raise ValueError("Model not supported")
+
+    if image_classifier:
+        repo_id = "huggingface/label-files"
+        filename = "imagenet-1k-id2label.json"
+        config.num_labels = 1000
+        config.id2label = json.load(open(hf_hub_download(repo_id, filename, repo_type="dataset"), "r"))
+        config.id2label = {int(k): v for k, v in config.id2label.items()}
+
+    return config
+
+
+def create_rename_keys(config):
+    rename_keys = []
+    # fmt: off
+
+    # patch embedding layer
+    rename_keys.append(("cls_token", "embeddings.cls_token"))
+    rename_keys.append(("mask_token", "embeddings.mask_token"))
+    rename_keys.append(("pos_embed", "embeddings.position_embeddings"))
+    rename_keys.append(("patch_embed.proj.weight", "embeddings.patch_embeddings.projection.weight"))
+    rename_keys.append(("patch_embed.proj.bias", "embeddings.patch_embeddings.projection.bias"))
+
+    for i in range(config.num_hidden_layers):
+        # layernorms
+        rename_keys.append((f"blocks.{i}.norm1.weight", f"encoder.layer.{i}.norm1.weight"))
+        rename_keys.append((f"blocks.{i}.norm1.bias", f"encoder.layer.{i}.norm1.bias"))
+        rename_keys.append((f"blocks.{i}.norm2.weight", f"encoder.layer.{i}.norm2.weight"))
+        rename_keys.append((f"blocks.{i}.norm2.bias", f"encoder.layer.{i}.norm2.bias"))
+        # MLP
+        if config.use_swiglu_ffn:
+            rename_keys.append((f"blocks.{i}.mlp.w12.weight", f"encoder.layer.{i}.mlp.w12.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.w12.bias", f"encoder.layer.{i}.mlp.w12.bias"))
+            rename_keys.append((f"blocks.{i}.mlp.w3.weight", f"encoder.layer.{i}.mlp.w3.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.w3.bias", f"encoder.layer.{i}.mlp.w3.bias"))
+        else:
+            rename_keys.append((f"blocks.{i}.mlp.fc1.weight", f"encoder.layer.{i}.mlp.fc1.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.fc1.bias", f"encoder.layer.{i}.mlp.fc1.bias"))
+            rename_keys.append((f"blocks.{i}.mlp.fc2.weight", f"encoder.layer.{i}.mlp.fc2.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.fc2.bias", f"encoder.layer.{i}.mlp.fc2.bias"))
+        # layerscale
+        rename_keys.append((f"blocks.{i}.ls1.gamma", f"encoder.layer.{i}.layer_scale1.lambda1"))
+        rename_keys.append((f"blocks.{i}.ls2.gamma", f"encoder.layer.{i}.layer_scale2.lambda1"))
+        # attention projection layer
+        rename_keys.append((f"blocks.{i}.attn.proj.weight", f"encoder.layer.{i}.attention.output.dense.weight"))
+        rename_keys.append((f"blocks.{i}.attn.proj.bias", f"encoder.layer.{i}.attention.output.dense.bias"))
+
+    # final layernorm
+    rename_keys.append(("norm.weight", "layernorm.weight"))
+    rename_keys.append(("norm.bias", "layernorm.bias"))
+
+    # fmt: on
+    return rename_keys
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config):
+    for i in range(config.num_hidden_layers):
+        # read in weights + bias of input projection layer (in timm, this is a single matrix + bias)
+        in_proj_weight = state_dict.pop(f"blocks.{i}.attn.qkv.weight")
+        in_proj_bias = state_dict.pop(f"blocks.{i}.attn.qkv.bias")
+        # next, add query, keys and values (in that order) to the state dict
+        state_dict[f"encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[: config.hidden_size, :]
+        state_dict[f"encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[: config.hidden_size]
+        state_dict[f"encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            config.hidden_size : config.hidden_size * 2, :
+        ]
+        state_dict[f"encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
+            config.hidden_size : config.hidden_size * 2
+        ]
+        state_dict[f"encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[-config.hidden_size :, :]
+        state_dict[f"encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-config.hidden_size :]
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+    return image
+
+
+@torch.no_grad()
+def convert_dinov2_checkpoint(model_name, pytorch_dump_folder_path, push_to_hub=False):
+    """
+    Copy/paste/tweak model's weights to our DINOv2 structure.
+    """
+
+    # define default Dinov2 configuration
+    image_classifier = "1layer" in model_name
+    config = get_dinov2_config(model_name, image_classifier=image_classifier)
+
+    # load original model from torch hub
+    original_model = torch.hub.load("facebookresearch/dinov2", model_name.replace("_1layer", ""))
+    original_model.eval()
+
+    # load state_dict of original model, remove and rename some keys
+    state_dict = original_model.state_dict()
+    rename_keys = create_rename_keys(config)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    read_in_q_k_v(state_dict, config)
+
+    for key, val in state_dict.copy().items():
+        val = state_dict.pop(key)
+        if "w12" in key:
+            key = key.replace("w12", "weights_in")
+        if "w3" in key:
+            key = key.replace("w3", "weights_out")
+        state_dict[key] = val
+
+    # load HuggingFace model
+    if image_classifier:
+        model = Dinov2ForImageClassification(config).eval()
+        model.dinov2.load_state_dict(state_dict)
+        model_name_to_classifier_dict_url = {
+            "dinov2_vits14_1layer": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_linear_head.pth",
+            "dinov2_vitb14_1layer": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_linear_head.pth",
+            "dinov2_vitl14_1layer": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_linear_head.pth",
+            "dinov2_vitg14_1layer": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_linear_head.pth",
+        }
+        url = model_name_to_classifier_dict_url[model_name]
+        classifier_state_dict = torch.hub.load_state_dict_from_url(url, map_location="cpu")
+        model.classifier.weight = nn.Parameter(classifier_state_dict["weight"])
+        model.classifier.bias = nn.Parameter(classifier_state_dict["bias"])
+    else:
+        model = Dinov2Model(config).eval()
+        model.load_state_dict(state_dict)
+
+    # load image
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+    # preprocess image
+    transformations = transforms.Compose(
+        [
+            transforms.Resize(256, interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize(
+                mean=IMAGENET_DEFAULT_MEAN,  # these are RGB mean+std values
+                std=IMAGENET_DEFAULT_STD,  # across a large photo dataset.
+            ),
+        ]
+    )
+
+    original_pixel_values = transformations(image).unsqueeze(0)  # insert batch dimension
+
+    processor = BitImageProcessor(
+        size={"shortest_edge": 256},
+        resample=PILImageResampling.BICUBIC,
+        image_mean=IMAGENET_DEFAULT_MEAN,
+        image_std=IMAGENET_DEFAULT_STD,
+    )
+    pixel_values = processor(image, return_tensors="pt").pixel_values
+
+    assert torch.allclose(original_pixel_values, pixel_values)
+
+    with torch.no_grad():
+        outputs = model(pixel_values, output_hidden_states=True)
+        original_outputs = original_model(pixel_values)
+
+    # assert values
+    if image_classifier:
+        print("Predicted class:")
+        class_idx = outputs.logits.argmax(-1).item()
+        print(model.config.id2label[class_idx])
+    else:
+        assert outputs.last_hidden_state[:, 0].shape == original_outputs.shape
+        assert torch.allclose(outputs.last_hidden_state[:, 0], original_outputs, atol=1e-3)
+    print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model {model_name} to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        print(f"Saving image processor to {pytorch_dump_folder_path}")
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        model_name_to_hf_name = {
+            "dinov2_vits14": "dinov2-small",
+            "dinov2_vitb14": "dinov2-base",
+            "dinov2_vitl14": "dinov2-large",
+            "dinov2_vitg14": "dinov2-giant",
+            "dinov2_vits14_1layer": "dinov2-small-imagenet1k-1-layer",
+            "dinov2_vitb14_1layer": "dinov2-base-imagenet1k-1-layer",
+            "dinov2_vitl14_1layer": "dinov2-large-imagenet1k-1-layer",
+            "dinov2_vitg14_1layer": "dinov2-giant-imagenet1k-1-layer",
+        }
+
+        name = model_name_to_hf_name[model_name]
+        model.push_to_hub(f"facebook/{name}")
+        processor.push_to_hub(f"facebook/{name}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="dinov2_vitb14",
+        type=str,
+        choices=[
+            "dinov2_vits14",
+            "dinov2_vitb14",
+            "dinov2_vitl14",
+            "dinov2_vitg14",
+            "dinov2_vits14_1layer",
+            "dinov2_vitb14_1layer",
+            "dinov2_vitl14_1layer",
+            "dinov2_vitg14_1layer",
+        ],
+        help="Name of the model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+    )
+    parser.add_argument(
+        "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+    )
+
+    args = parser.parse_args()
+    convert_dinov2_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
diff --git a/src/transformers/models/dinov2/modeling_dinov2.py b/src/transformers/models/dinov2/modeling_dinov2.py
new file mode 100644
index 00000000000000..ddf70f08b750fb
--- /dev/null
+++ b/src/transformers/models/dinov2/modeling_dinov2.py
@@ -0,0 +1,860 @@
+# coding=utf-8
+# Copyright 2023 Meta AI and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch DINOv2 model."""
+
+
+import collections.abc
+import math
+from typing import Dict, List, Optional, Set, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...activations import ACT2FN
+from ...modeling_outputs import (
+    BackboneOutput,
+    BaseModelOutput,
+    BaseModelOutputWithPooling,
+    ImageClassifierOutput,
+)
+from ...modeling_utils import PreTrainedModel
+from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from ...utils.backbone_utils import BackboneMixin
+from .configuration_dinov2 import Dinov2Config
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "Dinov2Config"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "facebook/dinov2-base"
+_EXPECTED_OUTPUT_SHAPE = [1, 257, 768]
+
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "facebook/dinov2-small-imagenet1k-1-layer"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"
+
+
+DINOV2_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/dinov2-base",
+    # See all DINOv2 models at https://huggingface.co/models?filter=dinov2
+]
+
+
+class Dinov2Embeddings(nn.Module):
+    """
+    Construct the CLS token, mask token, position and patch embeddings.
+    """
+
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+
+        self.cls_token = nn.Parameter(torch.randn(1, 1, config.hidden_size))
+        self.mask_token = nn.Parameter(torch.zeros(1, config.hidden_size))
+        self.patch_embeddings = Dinov2PatchEmbeddings(config)
+        num_patches = self.patch_embeddings.num_patches
+        self.position_embeddings = nn.Parameter(torch.randn(1, num_patches + 1, config.hidden_size))
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.config = config
+
+    def interpolate_pos_encoding(self, embeddings: torch.Tensor, height: int, width: int) -> torch.Tensor:
+        """
+        This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher
+        resolution images.
+
+        Source:
+        https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/vision_transformer.py#L174
+        """
+
+        num_patches = embeddings.shape[1] - 1
+        num_positions = self.position_embeddings.shape[1] - 1
+        if num_patches == num_positions and height == width:
+            return self.position_embeddings
+        class_pos_embed = self.position_embeddings[:, 0]
+        patch_pos_embed = self.position_embeddings[:, 1:]
+        dim = embeddings.shape[-1]
+        height = height // self.config.patch_size
+        width = width // self.config.patch_size
+        # we add a small number to avoid floating point error in the interpolation
+        # see discussion at https://github.com/facebookresearch/dino/issues/8
+        height, width = height + 0.1, width + 0.1
+        patch_pos_embed = patch_pos_embed.reshape(1, int(math.sqrt(num_positions)), int(math.sqrt(num_positions)), dim)
+        patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2)
+        target_dtype = patch_pos_embed.dtype
+        patch_pos_embed = nn.functional.interpolate(
+            patch_pos_embed.to(dtype=torch.float32),
+            scale_factor=(float(height / math.sqrt(num_positions)), float(width / math.sqrt(num_positions))),
+            mode="bicubic",
+            align_corners=False,
+        ).to(dtype=target_dtype)
+        if int(height) != patch_pos_embed.shape[-2] or int(width) != patch_pos_embed.shape[-1]:
+            raise ValueError("Width or height does not match with the interpolated position embeddings")
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1)
+
+    def forward(self, pixel_values: torch.Tensor, bool_masked_pos: Optional[torch.Tensor] = None) -> torch.Tensor:
+        batch_size, _, height, width = pixel_values.shape
+        target_dtype = self.patch_embeddings.projection.weight.dtype
+        embeddings = self.patch_embeddings(pixel_values.to(dtype=target_dtype))
+
+        if bool_masked_pos is not None:
+            embeddings = torch.where(
+                bool_masked_pos.unsqueeze(-1), self.mask_token.to(embeddings.dtype).unsqueeze(0), embeddings
+            )
+
+        # add the [CLS] token to the embedded patch tokens
+        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
+        embeddings = torch.cat((cls_tokens, embeddings), dim=1)
+
+        # add positional encoding to each token
+        embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+
+        embeddings = self.dropout(embeddings)
+
+        return embeddings
+
+
+class Dinov2PatchEmbeddings(nn.Module):
+    """
+    This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial
+    `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a
+    Transformer.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        image_size, patch_size = config.image_size, config.patch_size
+        num_channels, hidden_size = config.num_channels, config.hidden_size
+
+        image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size)
+        patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.num_channels = num_channels
+        self.num_patches = num_patches
+
+        self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size)
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        num_channels = pixel_values.shape[1]
+        if num_channels != self.num_channels:
+            raise ValueError(
+                "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
+                f" Expected {self.num_channels} but got {num_channels}."
+            )
+        embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
+        return embeddings
+
+
+# Copied from transformers.models.vit.modeling_vit.ViTSelfAttention with ViT->Dinov2
+class Dinov2SelfAttention(nn.Module):
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size {config.hidden_size,} is not a multiple of the number of attention "
+                f"heads {config.num_attention_heads}."
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size, bias=config.qkv_bias)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self, hidden_states, head_mask: Optional[torch.Tensor] = None, output_attentions: bool = False
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
+        mixed_query_layer = self.query(hidden_states)
+
+        key_layer = self.transpose_for_scores(self.key(hidden_states))
+        value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+# Copied from transformers.models.vit.modeling_vit.ViTSelfOutput with ViT->Dinov2
+class Dinov2SelfOutput(nn.Module):
+    """
+    The residual connection is defined in Dinov2Layer instead of here (as is the case with other models), due to the
+    layernorm applied before each block.
+    """
+
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+
+        return hidden_states
+
+
+# Copied from transformers.models.vit.modeling_vit.ViTAttention with ViT->Dinov2
+class Dinov2Attention(nn.Module):
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+        self.attention = Dinov2SelfAttention(config)
+        self.output = Dinov2SelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads: Set[int]) -> None:
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.attention.query = prune_linear_layer(self.attention.query, index)
+        self.attention.key = prune_linear_layer(self.attention.key, index)
+        self.attention.value = prune_linear_layer(self.attention.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
+        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
+        self_outputs = self.attention(hidden_states, head_mask, output_attentions)
+
+        attention_output = self.output(self_outputs[0], hidden_states)
+
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+class Dinov2LayerScale(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        self.lambda1 = nn.Parameter(config.layerscale_value * torch.ones(config.hidden_size))
+
+    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
+        return hidden_state * self.lambda1
+
+
+# Copied from transformers.models.beit.modeling_beit.drop_path
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
+    """
+    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+
+    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
+    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
+    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
+    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
+    argument.
+    """
+    if drop_prob == 0.0 or not training:
+        return input
+    keep_prob = 1 - drop_prob
+    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
+    random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device)
+    random_tensor.floor_()  # binarize
+    output = input.div(keep_prob) * random_tensor
+    return output
+
+
+# Copied from transformers.models.beit.modeling_beit.BeitDropPath
+class Dinov2DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
+
+    def __init__(self, drop_prob: Optional[float] = None) -> None:
+        super().__init__()
+        self.drop_prob = drop_prob
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return drop_path(hidden_states, self.drop_prob, self.training)
+
+    def extra_repr(self) -> str:
+        return "p={}".format(self.drop_prob)
+
+
+class Dinov2MLP(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        in_features = out_features = config.hidden_size
+        hidden_features = int(config.hidden_size * config.mlp_ratio)
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=True)
+        if isinstance(config.hidden_act, str):
+            self.activation = ACT2FN[config.hidden_act]
+        else:
+            self.activation = config.hidden_act
+        self.fc2 = nn.Linear(hidden_features, out_features, bias=True)
+
+    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
+        hidden_state = self.fc1(hidden_state)
+        hidden_state = self.activation(hidden_state)
+        hidden_state = self.fc2(hidden_state)
+        return hidden_state
+
+
+class Dinov2SwiGLUFFN(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        in_features = out_features = config.hidden_size
+        hidden_features = int(config.hidden_size * config.mlp_ratio)
+        hidden_features = (int(hidden_features * 2 / 3) + 7) // 8 * 8
+
+        self.weights_in = nn.Linear(in_features, 2 * hidden_features, bias=True)
+        self.weights_out = nn.Linear(hidden_features, out_features, bias=True)
+
+    def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
+        hidden_state = self.weights_in(hidden_state)
+        x1, x2 = hidden_state.chunk(2, dim=-1)
+        hidden = nn.functional.silu(x1) * x2
+        return self.weights_out(hidden)
+
+
+class Dinov2Layer(nn.Module):
+    """This corresponds to the Block class in the original implementation."""
+
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+
+        self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.attention = Dinov2Attention(config)
+        self.layer_scale1 = Dinov2LayerScale(config)
+        self.drop_path1 = Dinov2DropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
+
+        self.norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        if config.use_swiglu_ffn:
+            self.mlp = Dinov2SwiGLUFFN(config)
+        else:
+            self.mlp = Dinov2MLP(config)
+        self.layer_scale2 = Dinov2LayerScale(config)
+        self.drop_path2 = Dinov2DropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
+        self_attention_outputs = self.attention(
+            self.norm1(hidden_states),  # in Dinov2, layernorm is applied before self-attention
+            head_mask,
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+
+        attention_output = self.layer_scale1(attention_output)
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        # first residual connection
+        hidden_states = attention_output + hidden_states
+
+        # in Dinov2, layernorm is also applied after self-attention
+        layer_output = self.norm2(hidden_states)
+        layer_output = self.mlp(layer_output)
+        layer_output = self.layer_scale2(layer_output)
+
+        # second residual connection
+        layer_output = layer_output + hidden_states
+
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+# Copied from transformers.models.vit.modeling_vit.ViTEncoder with ViT->Dinov2
+class Dinov2Encoder(nn.Module):
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([Dinov2Layer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+        output_hidden_states: bool = False,
+        return_dict: bool = True,
+    ) -> Union[tuple, BaseModelOutput]:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    layer_head_mask,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions)
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+class Dinov2PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = Dinov2Config
+    base_model_prefix = "dinov2"
+    main_input_name = "pixel_values"
+    supports_gradient_checkpointing = True
+
+    def _init_weights(self, module: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Conv2d)):
+            # Upcast the input in `fp32` and cast it back to desired `dtype` to avoid
+            # `trunc_normal_cpu` not implemented in `half` issues
+            module.weight.data = nn.init.trunc_normal_(
+                module.weight.data.to(torch.float32), mean=0.0, std=self.config.initializer_range
+            ).to(module.weight.dtype)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, Dinov2Embeddings):
+            module.position_embeddings.data = nn.init.trunc_normal_(
+                module.position_embeddings.data.to(torch.float32),
+                mean=0.0,
+                std=self.config.initializer_range,
+            ).to(module.position_embeddings.dtype)
+
+            module.cls_token.data = nn.init.trunc_normal_(
+                module.cls_token.data.to(torch.float32),
+                mean=0.0,
+                std=self.config.initializer_range,
+            ).to(module.cls_token.dtype)
+
+
+DINOV2_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`Dinov2Config`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+DINOV2_BASE_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
+            [`BitImageProcessor.preprocess`] for details.
+
+        bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Only relevant for
+            pre-training.
+
+        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+DINOV2_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
+            [`BitImageProcessor.preprocess`] for details.
+
+        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare DINOv2 Model transformer outputting raw hidden-states without any specific head on top.",
+    DINOV2_START_DOCSTRING,
+)
+class Dinov2Model(Dinov2PreTrainedModel):
+    def __init__(self, config: Dinov2Config):
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = Dinov2Embeddings(config)
+        self.encoder = Dinov2Encoder(config)
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> Dinov2PatchEmbeddings:
+        return self.embeddings.patch_embeddings
+
+    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(DINOV2_BASE_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPooling,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def forward(
+        self,
+        pixel_values: Optional[torch.Tensor] = None,
+        bool_masked_pos: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        embedding_output = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        sequence_output = self.layernorm(sequence_output)
+        pooled_output = sequence_output[:, 0, :]
+
+        if not return_dict:
+            head_outputs = (sequence_output, pooled_output)
+            return head_outputs + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Dinov2 Model transformer with an image classification head on top (a linear layer on top of the final hidden state
+    of the [CLS] token) e.g. for ImageNet.
+    """,
+    DINOV2_START_DOCSTRING,
+)
+class Dinov2ForImageClassification(Dinov2PreTrainedModel):
+    def __init__(self, config: Dinov2Config) -> None:
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.dinov2 = Dinov2Model(config)
+
+        # Classifier head
+        self.classifier = (
+            nn.Linear(config.hidden_size * 2, config.num_labels) if config.num_labels > 0 else nn.Identity()
+        )
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(DINOV2_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=ImageClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def forward(
+        self,
+        pixel_values: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[tuple, ImageClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.dinov2(
+            pixel_values,
+            head_mask=head_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]  # batch_size, sequence_length, hidden_size
+
+        cls_token = sequence_output[:, 0]
+        patch_tokens = sequence_output[:, 1:]
+
+        linear_input = torch.cat([cls_token, patch_tokens.mean(dim=1)], dim=1)
+
+        logits = self.classifier(linear_input)
+
+        loss = None
+        if labels is not None:
+            # move labels to correct device to enable model parallelism
+            labels = labels.to(logits.device)
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return ImageClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Dinov2 backbone, to be used with frameworks like DETR and MaskFormer.
+    """,
+    DINOV2_START_DOCSTRING,
+)
+class Dinov2Backbone(Dinov2PreTrainedModel, BackboneMixin):
+    def __init__(self, config):
+        super().__init__(config)
+        super()._init_backbone(config)
+
+        self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
+        self.embeddings = Dinov2Embeddings(config)
+        self.encoder = Dinov2Encoder(config)
+
+        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> Dinov2PatchEmbeddings:
+        return self.embeddings.patch_embeddings
+
+    @add_start_docstrings_to_model_forward(DINOV2_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BackboneOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        pixel_values: torch.Tensor,
+        output_hidden_states: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> BackboneOutput:
+        """
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from transformers import AutoImageProcessor, AutoBackbone
+        >>> import torch
+        >>> from PIL import Image
+        >>> import requests
+
+        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
+        >>> model = AutoBackbone.from_pretrained(
+        ...     "facebook/dinov2-base", out_features=["stage2", "stage5", "stage8", "stage11"]
+        ... )
+
+        >>> inputs = processor(image, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> feature_maps = outputs.feature_maps
+        >>> list(feature_maps[-1].shape)
+        [1, 768, 16, 16]
+        ```"""
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        embedding_output = self.embeddings(pixel_values)
+
+        outputs = self.encoder(
+            embedding_output, output_hidden_states=True, output_attentions=output_attentions, return_dict=return_dict
+        )
+
+        hidden_states = outputs.hidden_states if return_dict else outputs[1]
+
+        feature_maps = ()
+        for stage, hidden_state in zip(self.stage_names, hidden_states):
+            if stage in self.out_features:
+                if self.config.apply_layernorm:
+                    hidden_state = self.layernorm(hidden_state)
+                if self.config.reshape_hidden_states:
+                    hidden_state = hidden_state[:, 1:]
+                    # this was actually a bug in the original implementation that we copied here,
+                    # cause normally the order is height, width
+                    batch_size, _, height, width = pixel_values.shape
+                    patch_size = self.config.patch_size
+                    hidden_state = hidden_state.reshape(batch_size, height // patch_size, width // patch_size, -1)
+                    hidden_state = hidden_state.permute(0, 3, 1, 2).contiguous()
+                feature_maps += (hidden_state,)
+
+        if not return_dict:
+            if output_hidden_states:
+                output = (feature_maps,) + outputs[1:]
+            else:
+                output = (feature_maps,) + outputs[2:]
+            return output
+
+        return BackboneOutput(
+            feature_maps=feature_maps,
+            hidden_states=outputs.hidden_states if output_hidden_states else None,
+            attentions=outputs.attentions if output_attentions else None,
+        )
diff --git a/src/transformers/models/distilbert/configuration_distilbert.py b/src/transformers/models/distilbert/configuration_distilbert.py
index 3dabb3d3e2340e..97b5b7c869064b 100644
--- a/src/transformers/models/distilbert/configuration_distilbert.py
+++ b/src/transformers/models/distilbert/configuration_distilbert.py
@@ -98,6 +98,7 @@ class DistilBertConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "distilbert"
     attribute_map = {
         "hidden_size": "dim",
diff --git a/src/transformers/models/distilbert/modeling_distilbert.py b/src/transformers/models/distilbert/modeling_distilbert.py
index 535c7372382dd7..481e4c427119c1 100755
--- a/src/transformers/models/distilbert/modeling_distilbert.py
+++ b/src/transformers/models/distilbert/modeling_distilbert.py
@@ -24,13 +24,13 @@
 
 import numpy as np
 import torch
+import torch.nn.functional as F
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 
-from transformers.configuration_utils import PretrainedConfig
-
 from ...activations import get_activation
-from ...deepspeed import is_deepspeed_zero3_enabled
+from ...configuration_utils import PretrainedConfig
+from ...integrations.deepspeed import is_deepspeed_zero3_enabled
 from ...modeling_outputs import (
     BaseModelOutput,
     MaskedLMOutput,
@@ -45,12 +45,19 @@
     add_code_sample_docstrings,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
+    is_flash_attn_2_available,
+    is_flash_attn_greater_or_equal_2_10,
     logging,
     replace_return_docstrings,
 )
 from .configuration_distilbert import DistilBertConfig
 
 
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
+
+
 logger = logging.get_logger(__name__)
 _CHECKPOINT_FOR_DOC = "distilbert-base-uncased"
 _CONFIG_FOR_DOC = "DistilBertConfig"
@@ -70,6 +77,19 @@
 # UTILS AND BUILDING BLOCKS OF THE ARCHITECTURE #
 
 
+# Copied from transformers.models.llama.modeling_llama._get_unpad_data
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+
+
 def create_sinusoidal_embeddings(n_pos: int, dim: int, out: torch.Tensor):
     if is_deepspeed_zero3_enabled():
         import deepspeed
@@ -105,15 +125,22 @@ def __init__(self, config: PretrainedConfig):
             "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
         )
 
-    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
+    def forward(self, input_ids: torch.Tensor, input_embeds: Optional[torch.Tensor] = None) -> torch.Tensor:
         """
         Parameters:
-            input_ids: torch.tensor(bs, max_seq_length) The token ids to embed.
+            input_ids (torch.Tensor):
+                torch.tensor(bs, max_seq_length) The token ids to embed.
+            input_embeds (*optional*, torch.Tensor):
+                The pre-computed word embeddings. Can only be passed if the input ids are `None`.
+
 
         Returns: torch.tensor(bs, max_seq_length, dim) The embedded tokens (plus position embeddings, no token_type
         embeddings)
         """
-        seq_length = input_ids.size(1)
+        if input_ids is not None:
+            input_embeds = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
+
+        seq_length = input_embeds.size(1)
 
         # Setting the position-ids to the registered buffer in constructor, it helps
         # when tracing the model without passing position-ids, solves
@@ -124,10 +151,9 @@ def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
             position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)  # (max_seq_length)
             position_ids = position_ids.unsqueeze(0).expand_as(input_ids)  # (bs, max_seq_length)
 
-        word_embeddings = self.word_embeddings(input_ids)  # (bs, max_seq_length, dim)
         position_embeddings = self.position_embeddings(position_ids)  # (bs, max_seq_length, dim)
 
-        embeddings = word_embeddings + position_embeddings  # (bs, max_seq_length, dim)
+        embeddings = input_embeds + position_embeddings  # (bs, max_seq_length, dim)
         embeddings = self.LayerNorm(embeddings)  # (bs, max_seq_length, dim)
         embeddings = self.dropout(embeddings)  # (bs, max_seq_length, dim)
         return embeddings
@@ -136,10 +162,12 @@ def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
 class MultiHeadSelfAttention(nn.Module):
     def __init__(self, config: PretrainedConfig):
         super().__init__()
+        self.config = config
 
         self.n_heads = config.n_heads
         self.dim = config.dim
         self.dropout = nn.Dropout(p=config.attention_dropout)
+        self.is_causal = False
 
         # Have an even number of multi heads that divide the dimensions
         if self.dim % self.n_heads != 0:
@@ -235,6 +263,195 @@ def unshape(x: torch.Tensor) -> torch.Tensor:
             return (context,)
 
 
+class DistilBertFlashAttention2(MultiHeadSelfAttention):
+    """
+    DistilBert flash attention module. This module inherits from `MultiHeadSelfAttention` as the weights of the module
+    stays untouched. The only required change would be on the forward pass where it needs to correctly call the public
+    API of flash attention and deal with padding tokens in case the input contains any of them.
+    """
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+    def forward(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        mask: torch.Tensor,
+        head_mask: Optional[torch.Tensor] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, ...]:
+        """
+        Parameters:
+            query: torch.tensor(bs, seq_length, dim)
+            key: torch.tensor(bs, seq_length, dim)
+            value: torch.tensor(bs, seq_length, dim)
+            mask: torch.tensor(bs, seq_length)
+
+        Returns:
+            weights: torch.tensor(bs, n_heads, seq_length, seq_length) Attention weights context: torch.tensor(bs,
+            seq_length, dim) Contextualized layer. Optional: only if `output_attentions=True`
+        """
+        batch_size, q_length, dim = query.size()
+
+        dim_per_head = self.dim // self.n_heads
+
+        def reshape(x: torch.Tensor) -> torch.Tensor:
+            """separate heads"""
+            return x.view(batch_size, -1, self.n_heads, dim_per_head)
+
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x head_dim x hidden_dim
+        query_states = reshape(self.q_lin(query))
+        key_states = reshape(self.k_lin(key))
+        value_states = reshape(self.v_lin(value))
+
+        attn_dropout = self.config.attention_dropout if self.training else 0.0
+
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in the correct dtype just to be sure everything works as expected.
+        # This might slowdown training & inference so it is recommended to not cast the LayerNorms
+        # in fp32. (LlamaRMSNorm handles it correctly)
+
+        if query_states.dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_lin.weight.dtype
+
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+
+        attn_weights = self._flash_attention_forward(
+            query_states, key_states, value_states, mask, q_length, dropout=attn_dropout
+        )
+
+        attn_weights_reshaped = attn_weights.reshape(batch_size, q_length, self.n_heads * dim_per_head)
+        attn_output = self.out_lin(attn_weights_reshaped)
+
+        if output_attentions:
+            return (attn_output, attn_weights)
+        else:
+            return (attn_output,)
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward with causal=True->causal=False
+    def _flash_attention_forward(
+        self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
+    ):
+        """
+        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+        first unpad the input, then computes the attention scores and pad the final attention scores.
+
+        Args:
+            query_states (`torch.Tensor`):
+                Input query states to be passed to Flash Attention API
+            key_states (`torch.Tensor`):
+                Input key states to be passed to Flash Attention API
+            value_states (`torch.Tensor`):
+                Input value states to be passed to Flash Attention API
+            attention_mask (`torch.Tensor`):
+                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+                position of padding tokens and 1 for the position of non-padding tokens.
+            dropout (`int`, *optional*):
+                Attention dropout
+            softmax_scale (`float`, *optional*):
+                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+        """
+        if not self._flash_attn_uses_top_left_mask:
+            causal = self.is_causal
+        else:
+            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+            causal = self.is_causal and query_length != 1
+
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            batch_size = query_states.shape[0]
+            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                query_states, key_states, value_states, attention_mask, query_length
+            )
+
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=softmax_scale,
+                causal=causal,
+            )
+
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
+            )
+
+        return attn_output
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input with num_heads->n_heads
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.n_heads, head_dim), indices_k
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+
+
 class FFN(nn.Module):
     def __init__(self, config: PretrainedConfig):
         super().__init__()
@@ -256,6 +473,12 @@ def ff_chunk(self, input: torch.Tensor) -> torch.Tensor:
         return x
 
 
+DISTILBERT_ATTENTION_CLASSES = {
+    "eager": MultiHeadSelfAttention,
+    "flash_attention_2": DistilBertFlashAttention2,
+}
+
+
 class TransformerBlock(nn.Module):
     def __init__(self, config: PretrainedConfig):
         super().__init__()
@@ -264,7 +487,7 @@ def __init__(self, config: PretrainedConfig):
         if config.dim % config.n_heads != 0:
             raise ValueError(f"config.n_heads {config.n_heads} must divide config.dim {config.dim} evenly")
 
-        self.attention = MultiHeadSelfAttention(config)
+        self.attention = DISTILBERT_ATTENTION_CLASSES[config._attn_implementation](config)
         self.sa_layer_norm = nn.LayerNorm(normalized_shape=config.dim, eps=1e-12)
 
         self.ffn = FFN(config)
@@ -319,6 +542,7 @@ def __init__(self, config: PretrainedConfig):
         super().__init__()
         self.n_layers = config.n_layers
         self.layer = nn.ModuleList([TransformerBlock(config) for _ in range(config.n_layers)])
+        self.gradient_checkpointing = False
 
     def forward(
         self,
@@ -351,9 +575,22 @@ def forward(
             if output_hidden_states:
                 all_hidden_states = all_hidden_states + (hidden_state,)
 
-            layer_outputs = layer_module(
-                x=hidden_state, attn_mask=attn_mask, head_mask=head_mask[i], output_attentions=output_attentions
-            )
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_state,
+                    attn_mask,
+                    head_mask[i],
+                    output_attentions,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_state,
+                    attn_mask,
+                    head_mask[i],
+                    output_attentions,
+                )
+
             hidden_state = layer_outputs[-1]
 
             if output_attentions:
@@ -387,6 +624,8 @@ class DistilBertPreTrainedModel(PreTrainedModel):
     config_class = DistilBertConfig
     load_tf_weights = None
     base_model_prefix = "distilbert"
+    supports_gradient_checkpointing = True
+    _supports_flash_attn_2 = True
 
     def _init_weights(self, module: nn.Module):
         """Initialize the weights."""
@@ -468,6 +707,7 @@ def __init__(self, config: PretrainedConfig):
 
         self.embeddings = Embeddings(config)  # Embeddings
         self.transformer = Transformer(config)  # Encoder
+        self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
 
         # Initialize weights and apply final processing
         self.post_init()
@@ -559,6 +799,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -567,16 +808,19 @@ def forward(
 
         device = input_ids.device if input_ids is not None else inputs_embeds.device
 
-        if attention_mask is None:
-            attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)
-
         # Prepare head mask if needed
         head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
 
-        if inputs_embeds is None:
-            inputs_embeds = self.embeddings(input_ids)  # (bs, seq_length, dim)
+        embeddings = self.embeddings(input_ids, inputs_embeds)  # (bs, seq_length, dim)
+
+        if self._use_flash_attention_2:
+            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+        else:
+            if attention_mask is None:
+                attention_mask = torch.ones(input_shape, device=device)  # (bs, seq_length)
+
         return self.transformer(
-            x=inputs_embeds,
+            x=embeddings,
             attn_mask=attention_mask,
             head_mask=head_mask,
             output_attentions=output_attentions,
@@ -590,7 +834,7 @@ def forward(
     DISTILBERT_START_DOCSTRING,
 )
 class DistilBertForMaskedLM(DistilBertPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["vocab_projector.weight"]
+    _tied_weights_keys = ["vocab_projector.weight"]
 
     def __init__(self, config: PretrainedConfig):
         super().__init__(config)
diff --git a/src/transformers/models/distilbert/modeling_flax_distilbert.py b/src/transformers/models/distilbert/modeling_flax_distilbert.py
index 24e2c7e3987e07..d3c48c077adc52 100644
--- a/src/transformers/models/distilbert/modeling_flax_distilbert.py
+++ b/src/transformers/models/distilbert/modeling_flax_distilbert.py
@@ -48,9 +48,10 @@
     This model inherits from [`FlaxPreTrainedModel`]. Check the superclass documentation for the generic methods the
     library implements for all its model (such as downloading, saving and converting weights from PyTorch models)
 
-    This model is also a Flax Linen [flax.linen.Module](https://flax.readthedocs.io/en/latest/flax.linen.html#module)
-    subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to
-    general usage and behavior.
+    This model is also a
+    [flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen/module.html) subclass. Use it as
+    a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and
+    behavior.
 
     Finally, this model supports inherent JAX features such as:
 
@@ -145,7 +146,7 @@ def __call__(self, input_ids, deterministic: bool = True):
             position_embeds = self.position_embeddings(position_ids.astype("i4"))
         else:
             position_embeds = self.pos_encoding[:, :seq_length, :]
-            # explictly cast the positions here, since self.embed_positions are not registered as parameters
+            # explicitly cast the positions here, since self.embed_positions are not registered as parameters
             position_embeds = position_embeds.astype(inputs_embeds.dtype)
 
         # Sum all embeddings
diff --git a/src/transformers/models/distilbert/modeling_tf_distilbert.py b/src/transformers/models/distilbert/modeling_tf_distilbert.py
index 95c3aef4261572..39fd470597fa87 100644
--- a/src/transformers/models/distilbert/modeling_tf_distilbert.py
+++ b/src/transformers/models/distilbert/modeling_tf_distilbert.py
@@ -16,6 +16,9 @@
  TF 2.0 DistilBERT model
 """
 
+
+from __future__ import annotations
+
 import warnings
 from typing import Optional, Tuple, Union
 
@@ -40,12 +43,12 @@
     TFSequenceClassificationLoss,
     TFTokenClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import (
-    MULTIPLE_CHOICE_DUMMY_INPUTS,
     add_code_sample_docstrings,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
@@ -70,7 +73,7 @@
 ]
 
 
-class TFEmbeddings(tf.keras.layers.Layer):
+class TFEmbeddings(keras.layers.Layer):
     """Construct the embeddings from word, position and token_type embeddings."""
 
     def __init__(self, config, **kwargs):
@@ -79,10 +82,10 @@ def __init__(self, config, **kwargs):
         self.dim = config.dim
         self.initializer_range = config.initializer_range
         self.max_position_embeddings = config.max_position_embeddings
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(rate=config.dropout)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=1e-12, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(rate=config.dropout)
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         with tf.name_scope("word_embeddings"):
             self.weight = self.add_weight(
                 name="weight",
@@ -97,7 +100,12 @@ def build(self, input_shape: tf.TensorShape):
                 initializer=get_initializer(initializer_range=self.initializer_range),
             )
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.dim])
 
     def call(self, input_ids=None, position_ids=None, inputs_embeds=None, training=False):
         """
@@ -109,16 +117,7 @@ def call(self, input_ids=None, position_ids=None, inputs_embeds=None, training=F
         assert not (input_ids is None and inputs_embeds is None)
 
         if input_ids is not None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
 
         input_shape = shape_list(inputs_embeds)[:-1]
@@ -134,31 +133,32 @@ def call(self, input_ids=None, position_ids=None, inputs_embeds=None, training=F
         return final_embeddings
 
 
-class TFMultiHeadSelfAttention(tf.keras.layers.Layer):
+class TFMultiHeadSelfAttention(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
         self.n_heads = config.n_heads
         self.dim = config.dim
-        self.dropout = tf.keras.layers.Dropout(config.attention_dropout)
+        self.dropout = keras.layers.Dropout(config.attention_dropout)
         self.output_attentions = config.output_attentions
 
         assert self.dim % self.n_heads == 0, f"Hidden size {self.dim} not dividable by number of heads {self.n_heads}"
 
-        self.q_lin = tf.keras.layers.Dense(
+        self.q_lin = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="q_lin"
         )
-        self.k_lin = tf.keras.layers.Dense(
+        self.k_lin = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="k_lin"
         )
-        self.v_lin = tf.keras.layers.Dense(
+        self.v_lin = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="v_lin"
         )
-        self.out_lin = tf.keras.layers.Dense(
+        self.out_lin = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="out_lin"
         )
 
         self.pruned_heads = set()
+        self.config = config
 
     def prune_heads(self, heads):
         raise NotImplementedError
@@ -219,18 +219,36 @@ def unshape(x):
         else:
             return (context,)
 
-
-class TFFFN(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "q_lin", None) is not None:
+            with tf.name_scope(self.q_lin.name):
+                self.q_lin.build([None, None, self.config.dim])
+        if getattr(self, "k_lin", None) is not None:
+            with tf.name_scope(self.k_lin.name):
+                self.k_lin.build([None, None, self.config.dim])
+        if getattr(self, "v_lin", None) is not None:
+            with tf.name_scope(self.v_lin.name):
+                self.v_lin.build([None, None, self.config.dim])
+        if getattr(self, "out_lin", None) is not None:
+            with tf.name_scope(self.out_lin.name):
+                self.out_lin.build([None, None, self.config.dim])
+
+
+class TFFFN(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
-        self.dropout = tf.keras.layers.Dropout(config.dropout)
-        self.lin1 = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(config.dropout)
+        self.lin1 = keras.layers.Dense(
             config.hidden_dim, kernel_initializer=get_initializer(config.initializer_range), name="lin1"
         )
-        self.lin2 = tf.keras.layers.Dense(
+        self.lin2 = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="lin2"
         )
         self.activation = get_tf_activation(config.activation)
+        self.config = config
 
     def call(self, input, training=False):
         x = self.lin1(input)
@@ -239,15 +257,26 @@ def call(self, input, training=False):
         x = self.dropout(x, training=training)
         return x
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "lin1", None) is not None:
+            with tf.name_scope(self.lin1.name):
+                self.lin1.build([None, None, self.config.dim])
+        if getattr(self, "lin2", None) is not None:
+            with tf.name_scope(self.lin2.name):
+                self.lin2.build([None, None, self.config.hidden_dim])
 
-class TFTransformerBlock(tf.keras.layers.Layer):
+
+class TFTransformerBlock(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
         self.n_heads = config.n_heads
         self.dim = config.dim
         self.hidden_dim = config.hidden_dim
-        self.dropout = tf.keras.layers.Dropout(config.dropout)
+        self.dropout = keras.layers.Dropout(config.dropout)
         self.activation = config.activation
         self.output_attentions = config.output_attentions
 
@@ -256,10 +285,11 @@ def __init__(self, config, **kwargs):
         ), f"Hidden size {config.dim} not dividable by number of heads {config.n_heads}"
 
         self.attention = TFMultiHeadSelfAttention(config, name="attention")
-        self.sa_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="sa_layer_norm")
+        self.sa_layer_norm = keras.layers.LayerNormalization(epsilon=1e-12, name="sa_layer_norm")
 
         self.ffn = TFFFN(config, name="ffn")
-        self.output_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="output_layer_norm")
+        self.output_layer_norm = keras.layers.LayerNormalization(epsilon=1e-12, name="output_layer_norm")
+        self.config = config
 
     def call(self, x, attn_mask, head_mask, output_attentions, training=False):  # removed: src_enc=None, src_len=None
         """
@@ -288,8 +318,25 @@ def call(self, x, attn_mask, head_mask, output_attentions, training=False):  # r
             output = (sa_weights,) + output
         return output
 
-
-class TFTransformer(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "sa_layer_norm", None) is not None:
+            with tf.name_scope(self.sa_layer_norm.name):
+                self.sa_layer_norm.build([None, None, self.config.dim])
+        if getattr(self, "ffn", None) is not None:
+            with tf.name_scope(self.ffn.name):
+                self.ffn.build(None)
+        if getattr(self, "output_layer_norm", None) is not None:
+            with tf.name_scope(self.output_layer_norm.name):
+                self.output_layer_norm.build([None, None, self.config.dim])
+
+
+class TFTransformer(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
         self.n_layers = config.n_layers
@@ -343,9 +390,18 @@ def call(self, x, attn_mask, head_mask, output_attentions, output_hidden_states,
             last_hidden_state=hidden_state, hidden_states=all_hidden_states, attentions=all_attentions
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 @keras_serializable
-class TFDistilBertMainLayer(tf.keras.layers.Layer):
+class TFDistilBertMainLayer(keras.layers.Layer):
     config_class = DistilBertConfig
 
     def __init__(self, config, **kwargs):
@@ -419,6 +475,17 @@ def call(
 
         return tfmr_output  # last-layer hidden-state, (all hidden_states), (all attentions)
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "transformer", None) is not None:
+            with tf.name_scope(self.transformer.name):
+                self.transformer.build(None)
+
 
 # INTERFACE FOR ENCODER AND TASK SPECIFIC MODEL #
 class TFDistilBertPreTrainedModel(TFPreTrainedModel):
@@ -430,19 +497,6 @@ class TFDistilBertPreTrainedModel(TFPreTrainedModel):
     config_class = DistilBertConfig
     base_model_prefix = "distilbert"
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
 
 DISTILBERT_START_DOCSTRING = r"""
 
@@ -450,7 +504,7 @@ def serving(self, inputs):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -547,10 +601,10 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -568,14 +622,16 @@ def call(
         )
         return outputs
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFBaseModelOutput(last_hidden_state=output.last_hidden_state, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
 
 
-class TFDistilBertLMHead(tf.keras.layers.Layer):
+class TFDistilBertLMHead(keras.layers.Layer):
     def __init__(self, config, input_embeddings, **kwargs):
         super().__init__(**kwargs)
 
@@ -625,11 +681,11 @@ def __init__(self, config, *inputs, **kwargs):
         self.config = config
 
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
-        self.vocab_transform = tf.keras.layers.Dense(
+        self.vocab_transform = keras.layers.Dense(
             config.dim, kernel_initializer=get_initializer(config.initializer_range), name="vocab_transform"
         )
         self.act = get_tf_activation(config.activation)
-        self.vocab_layer_norm = tf.keras.layers.LayerNormalization(epsilon=1e-12, name="vocab_layer_norm")
+        self.vocab_layer_norm = keras.layers.LayerNormalization(epsilon=1e-12, name="vocab_layer_norm")
         self.vocab_projector = TFDistilBertLMHead(config, self.distilbert.embeddings, name="vocab_projector")
 
     def get_lm_head(self):
@@ -648,14 +704,14 @@ def get_prefix_bias_name(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
@@ -693,12 +749,22 @@ def call(
             attentions=distilbert_output.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
+        if getattr(self, "vocab_transform", None) is not None:
+            with tf.name_scope(self.vocab_transform.name):
+                self.vocab_transform.build([None, None, self.config.dim])
+        if getattr(self, "vocab_layer_norm", None) is not None:
+            with tf.name_scope(self.vocab_layer_norm.name):
+                self.vocab_layer_norm.build([None, None, self.config.dim])
+        if getattr(self, "vocab_projector", None) is not None:
+            with tf.name_scope(self.vocab_projector.name):
+                self.vocab_projector.build(None)
 
 
 @add_start_docstrings(
@@ -714,16 +780,17 @@ def __init__(self, config, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
-        self.pre_classifier = tf.keras.layers.Dense(
+        self.pre_classifier = keras.layers.Dense(
             config.dim,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="relu",
             name="pre_classifier",
         )
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
-        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)
+        self.dropout = keras.layers.Dropout(config.seq_classif_dropout)
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -734,14 +801,14 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -779,12 +846,19 @@ def call(
             attentions=distilbert_output.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
+        if getattr(self, "pre_classifier", None) is not None:
+            with tf.name_scope(self.pre_classifier.name):
+                self.pre_classifier.build([None, None, self.config.dim])
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.dim])
 
 
 @add_start_docstrings(
@@ -800,10 +874,11 @@ def __init__(self, config, *inputs, **kwargs):
         self.num_labels = config.num_labels
 
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
-        self.dropout = tf.keras.layers.Dropout(config.dropout)
-        self.classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(config.dropout)
+        self.classifier = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -814,14 +889,14 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -854,12 +929,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output
-    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -874,26 +953,17 @@ def __init__(self, config, *inputs, **kwargs):
         super().__init__(config, *inputs, **kwargs)
 
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
-        self.dropout = tf.keras.layers.Dropout(config.seq_classif_dropout)
-        self.pre_classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(config.seq_classif_dropout)
+        self.pre_classifier = keras.layers.Dense(
             config.dim,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="relu",
             name="pre_classifier",
         )
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
-
-    @property
-    def dummy_inputs(self):
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            tf.Tensor with dummy inputs
-        """
-        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)}
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(
@@ -906,14 +976,14 @@ def dummy_inputs(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -965,25 +1035,19 @@ def call(
             attentions=distilbert_output.attentions,
         )
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving_output
-    def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
+        if getattr(self, "pre_classifier", None) is not None:
+            with tf.name_scope(self.pre_classifier.name):
+                self.pre_classifier.build([None, None, self.config.dim])
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.dim])
 
 
 @add_start_docstrings(
@@ -998,11 +1062,12 @@ def __init__(self, config, *inputs, **kwargs):
         super().__init__(config, *inputs, **kwargs)
 
         self.distilbert = TFDistilBertMainLayer(config, name="distilbert")
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
         assert config.num_labels == 2, f"Incorrect number of labels {config.num_labels} instead of 2"
-        self.dropout = tf.keras.layers.Dropout(config.qa_dropout)
+        self.dropout = keras.layers.Dropout(config.qa_dropout)
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(DISTILBERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1013,15 +1078,15 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        start_positions: np.ndarray | tf.Tensor | None = None,
+        end_positions: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1069,11 +1134,13 @@ def call(
             attentions=distilbert_output.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForQuestionAnswering.serving_output
-    def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFQuestionAnsweringModelOutput(
-            start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "distilbert", None) is not None:
+            with tf.name_scope(self.distilbert.name):
+                self.distilbert.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.config.dim])
diff --git a/src/transformers/models/distilbert/tokenization_distilbert.py b/src/transformers/models/distilbert/tokenization_distilbert.py
index 76582ae4eab1cd..014c41d1243b6f 100644
--- a/src/transformers/models/distilbert/tokenization_distilbert.py
+++ b/src/transformers/models/distilbert/tokenization_distilbert.py
@@ -149,20 +149,6 @@ def __init__(
         strip_accents=None,
         **kwargs,
     ):
-        super().__init__(
-            do_lower_case=do_lower_case,
-            do_basic_tokenize=do_basic_tokenize,
-            never_split=never_split,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            tokenize_chinese_chars=tokenize_chinese_chars,
-            strip_accents=strip_accents,
-            **kwargs,
-        )
-
         if not os.path.isfile(vocab_file):
             raise ValueError(
                 f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
@@ -178,7 +164,21 @@ def __init__(
                 tokenize_chinese_chars=tokenize_chinese_chars,
                 strip_accents=strip_accents,
             )
-        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
 
     @property
     # Copied from transformers.models.bert.tokenization_bert.BertTokenizer.do_lower_case
@@ -195,10 +195,12 @@ def get_vocab(self):
         return dict(self.vocab, **self.added_tokens_encoder)
 
     # Copied from transformers.models.bert.tokenization_bert.BertTokenizer._tokenize
-    def _tokenize(self, text):
+    def _tokenize(self, text, split_special_tokens=False):
         split_tokens = []
         if self.do_basic_tokenize:
-            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+            for token in self.basic_tokenizer.tokenize(
+                text, never_split=self.all_special_tokens if not split_special_tokens else None
+            ):
                 # If the token is part of the never_split set
                 if token in self.basic_tokenizer.never_split:
                     split_tokens.append(token)
@@ -350,20 +352,30 @@ class BasicTokenizer(object):
         strip_accents (`bool`, *optional*):
             Whether or not to strip all accents. If this option is not specified, then it will be determined by the
             value for `lowercase` (as in the original BERT).
+        do_split_on_punc (`bool`, *optional*, defaults to `True`):
+            In some instances we want to skip the basic punctuation splitting so that later tokenization can capture
+            the full context of the words, such as contractions.
     """
 
-    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+    def __init__(
+        self,
+        do_lower_case=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        do_split_on_punc=True,
+    ):
         if never_split is None:
             never_split = []
         self.do_lower_case = do_lower_case
         self.never_split = set(never_split)
         self.tokenize_chinese_chars = tokenize_chinese_chars
         self.strip_accents = strip_accents
+        self.do_split_on_punc = do_split_on_punc
 
     def tokenize(self, text, never_split=None):
         """
-        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
-        WordPieceTokenizer.
+        Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.
 
         Args:
             never_split (`List[str]`, *optional*)
@@ -382,7 +394,9 @@ def tokenize(self, text, never_split=None):
         # words in the English Wikipedia.).
         if self.tokenize_chinese_chars:
             text = self._tokenize_chinese_chars(text)
-        orig_tokens = whitespace_tokenize(text)
+        # prevents treating the same character with different unicode codepoints as different characters
+        unicode_normalized_text = unicodedata.normalize("NFC", text)
+        orig_tokens = whitespace_tokenize(unicode_normalized_text)
         split_tokens = []
         for token in orig_tokens:
             if token not in never_split:
@@ -410,7 +424,7 @@ def _run_strip_accents(self, text):
 
     def _run_split_on_punc(self, text, never_split=None):
         """Splits punctuation on a piece of text."""
-        if never_split is not None and text in never_split:
+        if not self.do_split_on_punc or (never_split is not None and text in never_split):
             return [text]
         chars = list(text)
         i = 0
diff --git a/src/transformers/models/distilbert/tokenization_distilbert_fast.py b/src/transformers/models/distilbert/tokenization_distilbert_fast.py
index dd9dcd165d4109..adb90f857d75fe 100644
--- a/src/transformers/models/distilbert/tokenization_distilbert_fast.py
+++ b/src/transformers/models/distilbert/tokenization_distilbert_fast.py
@@ -190,7 +190,7 @@ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
         """
         output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
 
-        if token_ids_1:
+        if token_ids_1 is not None:
             output += token_ids_1 + [self.sep_token_id]
 
         return output
diff --git a/src/transformers/models/dit/convert_dit_unilm_to_pytorch.py b/src/transformers/models/dit/convert_dit_unilm_to_pytorch.py
index d80e9d890ddc30..c754b9bbf3eac7 100644
--- a/src/transformers/models/dit/convert_dit_unilm_to_pytorch.py
+++ b/src/transformers/models/dit/convert_dit_unilm_to_pytorch.py
@@ -24,7 +24,7 @@
 from huggingface_hub import hf_hub_download
 from PIL import Image
 
-from transformers import BeitConfig, BeitFeatureExtractor, BeitForImageClassification, BeitForMaskedImageModeling
+from transformers import BeitConfig, BeitForImageClassification, BeitForMaskedImageModeling, BeitImageProcessor
 from transformers.image_utils import PILImageResampling
 from transformers.utils import logging
 
@@ -171,12 +171,12 @@ def convert_dit_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
     model.load_state_dict(state_dict)
 
     # Check outputs on an image
-    feature_extractor = BeitFeatureExtractor(
+    image_processor = BeitImageProcessor(
         size=config.image_size, resample=PILImageResampling.BILINEAR, do_center_crop=False
     )
     image = prepare_img()
 
-    encoding = feature_extractor(images=image, return_tensors="pt")
+    encoding = image_processor(images=image, return_tensors="pt")
     pixel_values = encoding["pixel_values"]
 
     outputs = model(pixel_values)
@@ -189,18 +189,18 @@ def convert_dit_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
     Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
     print(f"Saving model to {pytorch_dump_folder_path}")
     model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    print(f"Saving image processor to {pytorch_dump_folder_path}")
+    image_processor.save_pretrained(pytorch_dump_folder_path)
 
     if push_to_hub:
         if has_lm_head:
             model_name = "dit-base" if "base" in checkpoint_url else "dit-large"
         else:
             model_name = "dit-base-finetuned-rvlcdip" if "dit-b" in checkpoint_url else "dit-large-finetuned-rvlcdip"
-        feature_extractor.push_to_hub(
+        image_processor.push_to_hub(
             repo_path_or_name=Path(pytorch_dump_folder_path, model_name),
             organization="nielsr",
-            commit_message="Add feature extractor",
+            commit_message="Add image processor",
             use_temp_dir=True,
         )
         model.push_to_hub(
diff --git a/src/transformers/models/donut/configuration_donut_swin.py b/src/transformers/models/donut/configuration_donut_swin.py
index 059016dafef949..9de3181b55bc3a 100644
--- a/src/transformers/models/donut/configuration_donut_swin.py
+++ b/src/transformers/models/donut/configuration_donut_swin.py
@@ -45,15 +45,15 @@ class DonutSwinConfig(PretrainedConfig):
             The number of input channels.
         embed_dim (`int`, *optional*, defaults to 96):
             Dimensionality of patch embedding.
-        depths (`list(int)`, *optional*, defaults to [2, 2, 6, 2]):
+        depths (`list(int)`, *optional*, defaults to `[2, 2, 6, 2]`):
             Depth of each layer in the Transformer encoder.
-        num_heads (`list(int)`, *optional*, defaults to [3, 6, 12, 24]):
+        num_heads (`list(int)`, *optional*, defaults to `[3, 6, 12, 24]`):
             Number of attention heads in each layer of the Transformer encoder.
         window_size (`int`, *optional*, defaults to 7):
             Size of windows.
         mlp_ratio (`float`, *optional*, defaults to 4.0):
             Ratio of MLP hidden dimensionality to embedding dimensionality.
-        qkv_bias (`bool`, *optional*, defaults to True):
+        qkv_bias (`bool`, *optional*, defaults to `True`):
             Whether or not a learnable bias should be added to the queries, keys and values.
         hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
             The dropout probability for all fully connected layers in the embeddings and encoder.
@@ -64,11 +64,11 @@ class DonutSwinConfig(PretrainedConfig):
         hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
             The non-linear activation function (function or string) in the encoder. If string, `"gelu"`, `"relu"`,
             `"selu"` and `"gelu_new"` are supported.
-        use_absolute_embeddings (`bool`, *optional*, defaults to False):
+        use_absolute_embeddings (`bool`, *optional*, defaults to `False`):
             Whether or not to add absolute position embeddings to the patch embeddings.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
             The epsilon used by the layer normalization layers.
 
     Example:
@@ -85,6 +85,7 @@ class DonutSwinConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "donut-swin"
 
     attribute_map = {
diff --git a/src/transformers/models/donut/convert_donut_to_pytorch.py b/src/transformers/models/donut/convert_donut_to_pytorch.py
index cd0f43773567f1..13f669ad97fdcc 100644
--- a/src/transformers/models/donut/convert_donut_to_pytorch.py
+++ b/src/transformers/models/donut/convert_donut_to_pytorch.py
@@ -21,7 +21,7 @@
 from donut import DonutModel
 
 from transformers import (
-    DonutFeatureExtractor,
+    DonutImageProcessor,
     DonutProcessor,
     DonutSwinConfig,
     DonutSwinModel,
@@ -152,10 +152,10 @@ def convert_donut_checkpoint(model_name, pytorch_dump_folder_path=None, push_to_
     image = dataset["test"][0]["image"].convert("RGB")
 
     tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name, from_slow=True)
-    feature_extractor = DonutFeatureExtractor(
+    image_processor = DonutImageProcessor(
         do_align_long_axis=original_model.config.align_long_axis, size=original_model.config.input_size[::-1]
     )
-    processor = DonutProcessor(feature_extractor, tokenizer)
+    processor = DonutProcessor(image_processor, tokenizer)
     pixel_values = processor(image, return_tensors="pt").pixel_values
 
     if model_name == "naver-clova-ix/donut-base-finetuned-docvqa":
diff --git a/src/transformers/models/donut/image_processing_donut.py b/src/transformers/models/donut/image_processing_donut.py
index 325a2bb9b60215..a17593316248ac 100644
--- a/src/transformers/models/donut/image_processing_donut.py
+++ b/src/transformers/models/donut/image_processing_donut.py
@@ -21,9 +21,7 @@
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
 from ...image_transforms import (
     get_resize_output_image_size,
-    normalize,
     pad,
-    rescale,
     resize,
     to_channel_dimension_format,
 )
@@ -34,9 +32,12 @@
     ImageInput,
     PILImageResampling,
     get_image_size,
+    infer_channel_dimension_format,
+    is_scaled_image,
     make_list_of_images,
     to_numpy_array,
     valid_images,
+    validate_preprocess_arguments,
 )
 from ...utils import TensorType, logging
 from ...utils.import_utils import is_vision_available
@@ -61,30 +62,29 @@ class DonutImageProcessor(BaseImageProcessor):
             Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
             the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
             method.
-        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
             Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
-        do_center_crop (`bool`, *optional*, defaults to `True`):
-            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
-            `preprocess` method.
-        crop_size (`Dict[str, int]` *optional*, defaults to 224):
-            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
-            method.
+        do_thumbnail (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image using thumbnail method.
+        do_align_long_axis (`bool`, *optional*, defaults to `False`):
+            Whether to align the long axis of the image with the long axis of `size` by rotating by 90 degrees.
+        do_pad (`bool`, *optional*, defaults to `True`):
+            Whether to pad the image. If `random_padding` is set to `True` in `preprocess`, each image is padded with a
+            random amont of padding on each size, up to the largest image size in the batch. Otherwise, all images are
+            padded to the largest image size in the batch.
         do_rescale (`bool`, *optional*, defaults to `True`):
             Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
             the `preprocess` method.
         rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
             Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
             method.
-        do_normalize:
+        do_normalize (`bool`, *optional*, defaults to `True`):
             Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
         image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
             Mean to use if normalizing the image. This is a float or list of floats the length of the number of
             channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
             Image standard deviation.
-        do_convert_rgb (`bool`, *optional*, defaults to `True`):
-            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
-            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
     """
 
     model_input_names = ["pixel_values"]
@@ -125,7 +125,11 @@ def __init__(
         self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
 
     def align_long_axis(
-        self, image: np.ndarray, size: Dict[str, int], data_format: Optional[Union[str, ChannelDimension]] = None
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
         Align the long axis of the image to the longest axis of the specified size.
@@ -135,11 +139,15 @@ def align_long_axis(
                 The image to be aligned.
             size (`Dict[str, int]`):
                 The size `{"height": h, "width": w}` to align the long axis to.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
 
         Returns:
             `np.ndarray`: The aligned image.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = size["height"], size["width"]
 
         if (output_width < output_height and input_width > input_height) or (
@@ -148,22 +156,17 @@ def align_long_axis(
             image = np.rot90(image, 3)
 
         if data_format is not None:
-            image = to_channel_dimension_format(image, data_format)
+            image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
 
         return image
 
-    def rotate_image(self, *args, **kwargs):
-        logger.info(
-            "rotate_image is deprecated and will be removed in version 4.27. Please use align_long_axis instead."
-        )
-        return self.align_long_axis(*args, **kwargs)
-
     def pad_image(
         self,
         image: np.ndarray,
         size: Dict[str, int],
         random_padding: bool = False,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ) -> np.ndarray:
         """
         Pad the image to the specified size.
@@ -177,9 +180,11 @@ def pad_image(
                 Whether to use random padding or not.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         output_height, output_width = size["height"], size["width"]
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
 
         delta_width = output_width - input_width
         delta_height = output_height - input_height
@@ -195,7 +200,7 @@ def pad_image(
         pad_right = delta_width - pad_left
 
         padding = ((pad_top, pad_bottom), (pad_left, pad_right))
-        return pad(image, padding, data_format=data_format)
+        return pad(image, padding, data_format=data_format, input_data_format=input_data_format)
 
     def pad(self, *args, **kwargs):
         logger.info("pad is deprecated and will be removed in version 4.27. Please use pad_image instead.")
@@ -207,6 +212,7 @@ def thumbnail(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BICUBIC,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
@@ -222,8 +228,10 @@ def thumbnail(
                 The resampling filter to use.
             data_format (`Optional[Union[str, ChannelDimension]]`, *optional*):
                 The data format of the output image. If unset, the same format as the input image is used.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
-        input_height, input_width = get_image_size(image)
+        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
         output_height, output_width = size["height"], size["width"]
 
         # We always resize to the smallest of either the input or output size.
@@ -239,7 +247,13 @@ def thumbnail(
             height = int(input_height * width / input_width)
 
         return resize(
-            image, size=(height, width), resample=resample, reducing_gap=2.0, data_format=data_format, **kwargs
+            image,
+            size=(height, width),
+            resample=resample,
+            reducing_gap=2.0,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
         )
 
     def resize(
@@ -248,11 +262,11 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BICUBIC,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
-        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
-        resized to keep the input aspect ratio.
+        Resizes `image` to `(height, width)` specified by `size` using the PIL library.
 
         Args:
             image (`np.ndarray`):
@@ -263,56 +277,24 @@ def resize(
                 Resampling filter to use when resiizing the image.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         size = get_size_dict(size)
         shortest_edge = min(size["height"], size["width"])
-        output_size = get_resize_output_image_size(image, size=shortest_edge, default_to_square=False)
-        resized_image = resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+        output_size = get_resize_output_image_size(
+            image, size=shortest_edge, default_to_square=False, input_data_format=input_data_format
+        )
+        resized_image = resize(
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
+        )
         return resized_image
 
-    def rescale(
-        self,
-        image: np.ndarray,
-        scale: Union[int, float],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ):
-        """
-        Rescale an image by a scale factor. image = image * scale.
-
-        Args:
-            image (`np.ndarray`):
-                Image to rescale.
-            scale (`int` or `float`):
-                Scale to apply to the image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return rescale(image, scale=scale, data_format=data_format, **kwargs)
-
-    def normalize(
-        self,
-        image: np.ndarray,
-        mean: Union[float, List[float]],
-        std: Union[float, List[float]],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Normalize an image. image = (image - image_mean) / image_std.
-
-        Args:
-            image (`np.ndarray`):
-                Image to normalize.
-            image_mean (`float` or `List[float]`):
-                Image mean.
-            image_std (`float` or `List[float]`):
-                Image standard deviation.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
-
     def preprocess(
         self,
         images: ImageInput,
@@ -330,6 +312,7 @@ def preprocess(
         image_std: Optional[Union[float, List[float]]] = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> PIL.Image.Image:
         """
@@ -337,7 +320,8 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image to preprocess.
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                 Whether to resize the image.
             size (`Dict[str, int]`, *optional*, defaults to `self.size`):
@@ -379,6 +363,12 @@ def preprocess(
                 - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                 - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                 - Unset: defaults to the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         do_resize = do_resize if do_resize is not None else self.do_resize
         size = size if size is not None else self.size
@@ -403,41 +393,67 @@ def preprocess(
                 "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
-
-        if do_resize and size is None:
-            raise ValueError("Size must be specified if do_resize is True.")
-
-        if do_rescale and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
-        if do_pad and size is None:
-            raise ValueError("Size must be specified if do_pad is True.")
-
-        if do_normalize and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_pad=do_pad,
+            size_divisibility=size,  # There is no pad divisibility in this processor, but pad requires the size arg.
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
 
         # All transformations expect numpy arrays.
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         if do_align_long_axis:
-            images = [self.align_long_axis(image, size=size) for image in images]
+            images = [self.align_long_axis(image, size=size, input_data_format=input_data_format) for image in images]
 
         if do_resize:
-            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+            images = [
+                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_thumbnail:
-            images = [self.thumbnail(image=image, size=size) for image in images]
+            images = [self.thumbnail(image=image, size=size, input_data_format=input_data_format) for image in images]
 
         if do_pad:
-            images = [self.pad_image(image=image, size=size, random_padding=random_padding) for image in images]
+            images = [
+                self.pad_image(
+                    image=image, size=size, random_padding=random_padding, input_data_format=input_data_format
+                )
+                for image in images
+            ]
 
         if do_rescale:
-            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_normalize:
-            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
-
-        images = [to_channel_dimension_format(image, data_format) for image in images]
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
 
         data = {"pixel_values": images}
         return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/donut/modeling_donut_swin.py b/src/transformers/models/donut/modeling_donut_swin.py
index bb9e863a36c02a..ed79b8ef8ec85a 100644
--- a/src/transformers/models/donut/modeling_donut_swin.py
+++ b/src/transformers/models/donut/modeling_donut_swin.py
@@ -83,9 +83,9 @@ class DonutSwinEncoderOutput(ModelOutput):
     """
 
     last_hidden_state: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -120,9 +120,9 @@ class DonutSwinModelOutput(ModelOutput):
 
     last_hidden_state: torch.FloatTensor = None
     pooler_output: Optional[torch.FloatTensor] = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    reshaped_hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 # Copied from transformers.models.swin.modeling_swin.window_partition
@@ -295,8 +295,8 @@ def forward(self, input_feature: torch.Tensor, input_dimensions: Tuple[int, int]
         return input_feature
 
 
-# Copied from transformers.models.swin.modeling_swin.drop_path
-def drop_path(input, drop_prob=0.0, training=False, scale_by_keep=True):
+# Copied from transformers.models.beit.modeling_beit.drop_path
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
@@ -749,15 +749,13 @@ def forward(
             layer_head_mask = head_mask[i] if head_mask is not None else None
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module), hidden_states, input_dimensions, layer_head_mask
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
+                    hidden_states,
+                    input_dimensions,
+                    layer_head_mask,
+                    output_attentions,
+                    always_partition,
                 )
             else:
                 layer_outputs = layer_module(
@@ -826,10 +824,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DonutSwinEncoder):
-            module.gradient_checkpointing = value
-
 
 SWIN_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
@@ -911,6 +905,10 @@ def forward(
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
     ) -> Union[Tuple, DonutSwinModelOutput]:
+        r"""
+        bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, num_patches)`):
+            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
+        """
         output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
diff --git a/src/transformers/models/donut/processing_donut.py b/src/transformers/models/donut/processing_donut.py
index 87f2dd34f90460..5636ecb9435cf3 100644
--- a/src/transformers/models/donut/processing_donut.py
+++ b/src/transformers/models/donut/processing_donut.py
@@ -32,16 +32,18 @@ class DonutProcessor(ProcessorMixin):
     [`~DonutProcessor.decode`] for more information.
 
     Args:
-        image_processor ([`DonutImageProcessor`]):
+        image_processor ([`DonutImageProcessor`], *optional*):
             An instance of [`DonutImageProcessor`]. The image processor is a required input.
-        tokenizer ([`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]):
+        tokenizer ([`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`], *optional*):
             An instance of [`XLMRobertaTokenizer`/`XLMRobertaTokenizerFast`]. The tokenizer is a required input.
     """
+
     attributes = ["image_processor", "tokenizer"]
     image_processor_class = "AutoImageProcessor"
     tokenizer_class = "AutoTokenizer"
 
     def __init__(self, image_processor=None, tokenizer=None, **kwargs):
+        feature_extractor = None
         if "feature_extractor" in kwargs:
             warnings.warn(
                 "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
@@ -130,14 +132,16 @@ def token2json(self, tokens, is_inner_value=False, added_vocab=None):
         if added_vocab is None:
             added_vocab = self.tokenizer.get_added_vocab()
 
-        output = dict()
+        output = {}
 
         while tokens:
             start_token = re.search(r"", tokens, re.IGNORECASE)
             if start_token is None:
                 break
             key = start_token.group(1)
-            end_token = re.search(rf"", tokens, re.IGNORECASE)
+            key_escaped = re.escape(key)
+
+            end_token = re.search(rf"", tokens, re.IGNORECASE)
             start_token = start_token.group()
             if end_token is None:
                 tokens = tokens.replace(start_token, "")
diff --git a/src/transformers/models/dpr/configuration_dpr.py b/src/transformers/models/dpr/configuration_dpr.py
index 5551883e09645e..3b6785c6b540f5 100644
--- a/src/transformers/models/dpr/configuration_dpr.py
+++ b/src/transformers/models/dpr/configuration_dpr.py
@@ -83,6 +83,8 @@ class DPRConfig(PretrainedConfig):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         layer_norm_eps (`float`, *optional*, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            Padding token id.
         position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
             Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
             positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
@@ -107,6 +109,7 @@ class DPRConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "dpr"
 
     def __init__(
diff --git a/src/transformers/models/dpr/convert_dpr_original_checkpoint_to_pytorch.py b/src/transformers/models/dpr/convert_dpr_original_checkpoint_to_pytorch.py
index 6ea85620242f06..c11345d1eb4e46 100644
--- a/src/transformers/models/dpr/convert_dpr_original_checkpoint_to_pytorch.py
+++ b/src/transformers/models/dpr/convert_dpr_original_checkpoint_to_pytorch.py
@@ -19,7 +19,7 @@
 import torch
 from torch.serialization import default_restore_location
 
-from .transformers import BertConfig, DPRConfig, DPRContextEncoder, DPRQuestionEncoder, DPRReader
+from transformers import BertConfig, DPRConfig, DPRContextEncoder, DPRQuestionEncoder, DPRReader
 
 
 CheckpointState = collections.namedtuple(
@@ -54,7 +54,7 @@ def from_type(comp_type: str, *args, **kwargs) -> "DPRState":
 
 class DPRContextEncoderState(DPRState):
     def load_dpr_model(self):
-        model = DPRContextEncoder(DPRConfig(**BertConfig.get_config_dict("bert-base-uncased")[0]))
+        model = DPRContextEncoder(DPRConfig(**BertConfig.get_config_dict("google-bert/bert-base-uncased")[0]))
         print(f"Loading DPR biencoder from {self.src_file}")
         saved_state = load_states_from_checkpoint(self.src_file)
         encoder, prefix = model.ctx_encoder, "ctx_model."
@@ -72,7 +72,7 @@ def load_dpr_model(self):
 
 class DPRQuestionEncoderState(DPRState):
     def load_dpr_model(self):
-        model = DPRQuestionEncoder(DPRConfig(**BertConfig.get_config_dict("bert-base-uncased")[0]))
+        model = DPRQuestionEncoder(DPRConfig(**BertConfig.get_config_dict("google-bert/bert-base-uncased")[0]))
         print(f"Loading DPR biencoder from {self.src_file}")
         saved_state = load_states_from_checkpoint(self.src_file)
         encoder, prefix = model.question_encoder, "question_model."
@@ -90,7 +90,7 @@ def load_dpr_model(self):
 
 class DPRReaderState(DPRState):
     def load_dpr_model(self):
-        model = DPRReader(DPRConfig(**BertConfig.get_config_dict("bert-base-uncased")[0]))
+        model = DPRReader(DPRConfig(**BertConfig.get_config_dict("google-bert/bert-base-uncased")[0]))
         print(f"Loading DPR reader from {self.src_file}")
         saved_state = load_states_from_checkpoint(self.src_file)
         # Fix changes from https://github.com/huggingface/transformers/commit/614fef1691edb806de976756d4948ecbcd0c0ca3
diff --git a/src/transformers/models/dpr/modeling_dpr.py b/src/transformers/models/dpr/modeling_dpr.py
index a551e507300b0d..1071a42d810076 100644
--- a/src/transformers/models/dpr/modeling_dpr.py
+++ b/src/transformers/models/dpr/modeling_dpr.py
@@ -30,7 +30,7 @@
     logging,
     replace_return_docstrings,
 )
-from ..bert.modeling_bert import BertEncoder, BertModel
+from ..bert.modeling_bert import BertModel
 from .configuration_dpr import DPRConfig
 
 
@@ -82,8 +82,8 @@ class DPRContextEncoderOutput(ModelOutput):
     """
 
     pooler_output: torch.FloatTensor
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -110,8 +110,8 @@ class DPRQuestionEncoderOutput(ModelOutput):
     """
 
     pooler_output: torch.FloatTensor
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -143,8 +143,8 @@ class DPRReaderOutput(ModelOutput):
     start_logits: torch.FloatTensor
     end_logits: torch.FloatTensor = None
     relevance_logits: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 class DPRPreTrainedModel(PreTrainedModel):
@@ -164,10 +164,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, BertEncoder):
-            module.gradient_checkpointing = value
-
 
 class DPREncoder(DPRPreTrainedModel):
     base_model_prefix = "bert_model"
@@ -296,8 +292,6 @@ class DPRPretrainedContextEncoder(DPRPreTrainedModel):
     config_class = DPRConfig
     load_tf_weights = None
     base_model_prefix = "ctx_encoder"
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
 
 
 class DPRPretrainedQuestionEncoder(DPRPreTrainedModel):
@@ -309,8 +303,6 @@ class DPRPretrainedQuestionEncoder(DPRPreTrainedModel):
     config_class = DPRConfig
     load_tf_weights = None
     base_model_prefix = "question_encoder"
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
 
 
 class DPRPretrainedReader(DPRPreTrainedModel):
@@ -322,7 +314,6 @@ class DPRPretrainedReader(DPRPreTrainedModel):
     config_class = DPRConfig
     load_tf_weights = None
     base_model_prefix = "span_predictor"
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
 
 
 ###############
@@ -459,9 +450,9 @@ def forward(
         attention_mask: Optional[Tensor] = None,
         token_type_ids: Optional[Tensor] = None,
         inputs_embeds: Optional[Tensor] = None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
     ) -> Union[DPRContextEncoderOutput, Tuple[Tensor, ...]]:
         r"""
         Return:
@@ -540,9 +531,9 @@ def forward(
         attention_mask: Optional[Tensor] = None,
         token_type_ids: Optional[Tensor] = None,
         inputs_embeds: Optional[Tensor] = None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
     ) -> Union[DPRQuestionEncoderOutput, Tuple[Tensor, ...]]:
         r"""
         Return:
@@ -567,6 +558,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -620,9 +612,9 @@ def forward(
         input_ids: Optional[Tensor] = None,
         attention_mask: Optional[Tensor] = None,
         inputs_embeds: Optional[Tensor] = None,
-        output_attentions: bool = None,
-        output_hidden_states: bool = None,
-        return_dict=None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
     ) -> Union[DPRReaderOutput, Tuple[Tensor, ...]]:
         r"""
         Return:
@@ -655,6 +647,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
diff --git a/src/transformers/models/dpr/modeling_tf_dpr.py b/src/transformers/models/dpr/modeling_tf_dpr.py
index 565ad37b2117e8..0a6aa47640d03c 100644
--- a/src/transformers/models/dpr/modeling_tf_dpr.py
+++ b/src/transformers/models/dpr/modeling_tf_dpr.py
@@ -15,13 +15,15 @@
 
 """ TensorFlow DPR model for Open Domain Question Answering."""
 
+from __future__ import annotations
+
 from dataclasses import dataclass
-from typing import Optional, Tuple, Union
+from typing import Tuple, Union
 
 import tensorflow as tf
 
 from ...modeling_tf_outputs import TFBaseModelOutputWithPooling
-from ...modeling_tf_utils import TFPreTrainedModel, get_initializer, shape_list, unpack_inputs
+from ...modeling_tf_utils import TFModelInputType, TFPreTrainedModel, get_initializer, keras, shape_list, unpack_inputs
 from ...utils import (
     ModelOutput,
     add_start_docstrings,
@@ -80,8 +82,8 @@ class TFDPRContextEncoderOutput(ModelOutput):
     """
 
     pooler_output: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor, ...] | None = None
+    attentions: Tuple[tf.Tensor, ...] | None = None
 
 
 @dataclass
@@ -108,8 +110,8 @@ class TFDPRQuestionEncoderOutput(ModelOutput):
     """
 
     pooler_output: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor, ...] | None = None
+    attentions: Tuple[tf.Tensor, ...] | None = None
 
 
 @dataclass
@@ -141,11 +143,11 @@ class TFDPRReaderOutput(ModelOutput):
     start_logits: tf.Tensor = None
     end_logits: tf.Tensor = None
     relevance_logits: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor, ...] | None = None
+    attentions: Tuple[tf.Tensor, ...] | None = None
 
 
-class TFDPREncoderLayer(tf.keras.layers.Layer):
+class TFDPREncoderLayer(keras.layers.Layer):
     base_model_prefix = "bert_model"
 
     def __init__(self, config: DPRConfig, **kwargs):
@@ -159,7 +161,7 @@ def __init__(self, config: DPRConfig, **kwargs):
             raise ValueError("Encoder hidden_size can't be zero")
         self.projection_dim = config.projection_dim
         if self.projection_dim > 0:
-            self.encode_proj = tf.keras.layers.Dense(
+            self.encode_proj = keras.layers.Dense(
                 config.projection_dim, kernel_initializer=get_initializer(config.initializer_range), name="encode_proj"
             )
 
@@ -167,9 +169,9 @@ def __init__(self, config: DPRConfig, **kwargs):
     def call(
         self,
         input_ids: tf.Tensor = None,
-        attention_mask: Optional[tf.Tensor] = None,
-        token_type_ids: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        attention_mask: tf.Tensor | None = None,
+        token_type_ids: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: bool = None,
         output_hidden_states: bool = None,
         return_dict: bool = None,
@@ -207,8 +209,19 @@ def embeddings_size(self) -> int:
             return self.projection_dim
         return self.bert_model.config.hidden_size
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "bert_model", None) is not None:
+            with tf.name_scope(self.bert_model.name):
+                self.bert_model.build(None)
+        if getattr(self, "encode_proj", None) is not None:
+            with tf.name_scope(self.encode_proj.name):
+                self.encode_proj.build(None)
+
 
-class TFDPRSpanPredictorLayer(tf.keras.layers.Layer):
+class TFDPRSpanPredictorLayer(keras.layers.Layer):
     base_model_prefix = "encoder"
 
     def __init__(self, config: DPRConfig, **kwargs):
@@ -216,10 +229,10 @@ def __init__(self, config: DPRConfig, **kwargs):
         self.config = config
         self.encoder = TFDPREncoderLayer(config, name="encoder")
 
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             2, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
-        self.qa_classifier = tf.keras.layers.Dense(
+        self.qa_classifier = keras.layers.Dense(
             1, kernel_initializer=get_initializer(config.initializer_range), name="qa_classifier"
         )
 
@@ -227,8 +240,8 @@ def __init__(self, config: DPRConfig, **kwargs):
     def call(
         self,
         input_ids: tf.Tensor = None,
-        attention_mask: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        attention_mask: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: bool = False,
         output_hidden_states: bool = False,
         return_dict: bool = False,
@@ -271,6 +284,20 @@ def call(
             attentions=outputs.attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.encoder.embeddings_size])
+        if getattr(self, "qa_classifier", None) is not None:
+            with tf.name_scope(self.qa_classifier.name):
+                self.qa_classifier.build([None, None, self.encoder.embeddings_size])
+
 
 class TFDPRSpanPredictor(TFPreTrainedModel):
     base_model_prefix = "encoder"
@@ -283,9 +310,9 @@ def __init__(self, config: DPRConfig, **kwargs):
     def call(
         self,
         input_ids: tf.Tensor = None,
-        attention_mask: Optional[tf.Tensor] = None,
-        token_type_ids: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        attention_mask: tf.Tensor | None = None,
+        token_type_ids: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: bool = False,
         output_hidden_states: bool = False,
         return_dict: bool = False,
@@ -316,9 +343,9 @@ def __init__(self, config: DPRConfig, **kwargs):
     def call(
         self,
         input_ids: tf.Tensor = None,
-        attention_mask: Optional[tf.Tensor] = None,
-        token_type_ids: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
+        attention_mask: tf.Tensor | None = None,
+        token_type_ids: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
         output_attentions: bool = False,
         output_hidden_states: bool = False,
         return_dict: bool = False,
@@ -370,19 +397,6 @@ class TFDPRPretrainedReader(TFPreTrainedModel):
     config_class = DPRConfig
     base_model_prefix = "reader"
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
 
 ###############
 # Actual Models
@@ -395,7 +409,7 @@ def serving(self, inputs):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a Tensorflow [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)
+    This model is also a Tensorflow [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)
     subclass. Use it as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to
     general usage and behavior.
 
@@ -543,7 +557,7 @@ def get_input_embeddings(self):
         try:
             return self.ctx_encoder.bert_model.get_input_embeddings()
         except AttributeError:
-            self(self.dummy_inputs)
+            self.build()
             return self.ctx_encoder.bert_model.get_input_embeddings()
 
     @unpack_inputs
@@ -551,15 +565,15 @@ def get_input_embeddings(self):
     @replace_return_docstrings(output_type=TFDPRContextEncoderOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids=None,
-        attention_mask: Optional[tf.Tensor] = None,
-        token_type_ids: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: tf.Tensor | None = None,
+        token_type_ids: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
         training: bool = False,
-    ) -> Union[TFDPRContextEncoderOutput, Tuple[tf.Tensor, ...]]:
+    ) -> TFDPRContextEncoderOutput | Tuple[tf.Tensor, ...]:
         r"""
         Return:
 
@@ -610,11 +624,13 @@ def call(
             pooler_output=outputs.pooler_output, hidden_states=outputs.hidden_states, attentions=outputs.attentions
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFDPRContextEncoderOutput(pooler_output=output.pooler_output, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "ctx_encoder", None) is not None:
+            with tf.name_scope(self.ctx_encoder.name):
+                self.ctx_encoder.build(None)
 
 
 @add_start_docstrings(
@@ -630,7 +646,7 @@ def get_input_embeddings(self):
         try:
             return self.question_encoder.bert_model.get_input_embeddings()
         except AttributeError:
-            self(self.dummy_inputs)
+            self.build()
             return self.question_encoder.bert_model.get_input_embeddings()
 
     @unpack_inputs
@@ -638,15 +654,15 @@ def get_input_embeddings(self):
     @replace_return_docstrings(output_type=TFDPRQuestionEncoderOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids=None,
-        attention_mask: Optional[tf.Tensor] = None,
-        token_type_ids: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: tf.Tensor | None = None,
+        token_type_ids: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
         training: bool = False,
-    ) -> Union[TFDPRQuestionEncoderOutput, Tuple[tf.Tensor, ...]]:
+    ) -> TFDPRQuestionEncoderOutput | Tuple[tf.Tensor, ...]:
         r"""
         Return:
 
@@ -696,11 +712,13 @@ def call(
             pooler_output=outputs.pooler_output, hidden_states=outputs.hidden_states, attentions=outputs.attentions
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFDPRQuestionEncoderOutput(pooler_output=output.pooler_output, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "question_encoder", None) is not None:
+            with tf.name_scope(self.question_encoder.name):
+                self.question_encoder.build(None)
 
 
 @add_start_docstrings(
@@ -716,7 +734,7 @@ def get_input_embeddings(self):
         try:
             return self.span_predictor.encoder.bert_model.get_input_embeddings()
         except AttributeError:
-            self(self.dummy_inputs)
+            self.build()
             return self.span_predictor.encoder.bert_model.get_input_embeddings()
 
     @unpack_inputs
@@ -724,14 +742,14 @@ def get_input_embeddings(self):
     @replace_return_docstrings(output_type=TFDPRReaderOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids=None,
-        attention_mask: Optional[tf.Tensor] = None,
-        inputs_embeds: Optional[tf.Tensor] = None,
-        output_attentions: bool = None,
-        output_hidden_states: bool = None,
-        return_dict=None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: tf.Tensor | None = None,
+        inputs_embeds: tf.Tensor | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        return_dict: bool | None = None,
         training: bool = False,
-    ) -> Union[TFDPRReaderOutput, Tuple[tf.Tensor, ...]]:
+    ) -> TFDPRReaderOutput | Tuple[tf.Tensor, ...]:
         r"""
         Return:
 
@@ -776,14 +794,10 @@ def call(
             training=training,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFDPRReaderOutput(
-            start_logits=output.start_logits,
-            end_logits=output.end_logits,
-            relevance_logits=output.relevance_logits,
-            hidden_states=hs,
-            attentions=attns,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "span_predictor", None) is not None:
+            with tf.name_scope(self.span_predictor.name):
+                self.span_predictor.build(None)
diff --git a/src/transformers/models/dpr/tokenization_dpr.py b/src/transformers/models/dpr/tokenization_dpr.py
index a14133459b7e83..b2ae84addc75ef 100644
--- a/src/transformers/models/dpr/tokenization_dpr.py
+++ b/src/transformers/models/dpr/tokenization_dpr.py
@@ -316,6 +316,7 @@ def decode_best_spans(
         >>> outputs = model(**encoded_inputs)
         >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
         >>> print(predicted_spans[0].text)  # best span
+        a song
         ```"""
         input_ids = reader_input["input_ids"]
         start_logits, end_logits, relevance_logits = reader_output[:3]
@@ -378,11 +379,9 @@ def _get_best_spans(
             if length > max_answer_length:
                 raise ValueError(f"Span is too long: {length} > {max_answer_length}")
             if any(
-                [
-                    start_index <= prev_start_index <= prev_end_index <= end_index
-                    or prev_start_index <= start_index <= end_index <= prev_end_index
-                    for (prev_start_index, prev_end_index) in chosen_span_intervals
-                ]
+                start_index <= prev_start_index <= prev_end_index <= end_index
+                or prev_start_index <= start_index <= end_index <= prev_end_index
+                for (prev_start_index, prev_end_index) in chosen_span_intervals
             ):
                 continue
             chosen_span_intervals.append((start_index, end_index))
diff --git a/src/transformers/models/dpr/tokenization_dpr_fast.py b/src/transformers/models/dpr/tokenization_dpr_fast.py
index 507cd2bc40bcea..784ed1344cf6f4 100644
--- a/src/transformers/models/dpr/tokenization_dpr_fast.py
+++ b/src/transformers/models/dpr/tokenization_dpr_fast.py
@@ -316,6 +316,7 @@ def decode_best_spans(
         >>> outputs = model(**encoded_inputs)
         >>> predicted_spans = tokenizer.decode_best_spans(encoded_inputs, outputs)
         >>> print(predicted_spans[0].text)  # best span
+        a song
         ```"""
         input_ids = reader_input["input_ids"]
         start_logits, end_logits, relevance_logits = reader_output[:3]
@@ -376,11 +377,9 @@ def _get_best_spans(
             length = end_index - start_index + 1
             assert length <= max_answer_length, f"Span is too long: {length} > {max_answer_length}"
             if any(
-                [
-                    start_index <= prev_start_index <= prev_end_index <= end_index
-                    or prev_start_index <= start_index <= end_index <= prev_end_index
-                    for (prev_start_index, prev_end_index) in chosen_span_intervals
-                ]
+                start_index <= prev_start_index <= prev_end_index <= end_index
+                or prev_start_index <= start_index <= end_index <= prev_end_index
+                for (prev_start_index, prev_end_index) in chosen_span_intervals
             ):
                 continue
             chosen_span_intervals.append((start_index, end_index))
diff --git a/src/transformers/models/dpt/configuration_dpt.py b/src/transformers/models/dpt/configuration_dpt.py
index 7f2dd2e807b702..97b9e2e9a834e0 100644
--- a/src/transformers/models/dpt/configuration_dpt.py
+++ b/src/transformers/models/dpt/configuration_dpt.py
@@ -18,6 +18,7 @@
 
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
+from ..auto.configuration_auto import CONFIG_MAPPING
 from ..bit import BitConfig
 
 
@@ -52,9 +53,9 @@ class DPTConfig(PretrainedConfig):
         hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
             The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
             `"relu"`, `"selu"` and `"gelu_new"` are supported.
-        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
-            The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
-        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
             The dropout ratio for the attention probabilities.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
@@ -66,6 +67,8 @@ class DPTConfig(PretrainedConfig):
             The size (resolution) of each patch.
         num_channels (`int`, *optional*, defaults to 3):
             The number of input channels.
+        is_hybrid (`bool`, *optional*, defaults to `False`):
+            Whether to use a hybrid backbone. Useful in the context of loading DPT-Hybrid models.
         qkv_bias (`bool`, *optional*, defaults to `True`):
             Whether to add a bias to the queries, keys and values.
         backbone_out_indices (`List[int]`, *optional*, defaults to `[2, 5, 8, 11]`):
@@ -79,11 +82,9 @@ class DPTConfig(PretrainedConfig):
             - "project" passes information to the other tokens by concatenating the readout to all other tokens before
               projecting the
             representation to the original feature dimension D using a linear layer followed by a GELU non-linearity.
-        is_hybrid (`bool`, *optional*, defaults to `False`):
-            Whether to use a hybrid backbone. Useful in the context of loading DPT-Hybrid models.
         reassemble_factors (`List[int]`, *optional*, defaults to `[4, 2, 1, 0.5]`):
             The up/downsampling factors of the reassemble layers.
-        neck_hidden_sizes (`List[str]`, *optional*, defaults to [96, 192, 384, 768]):
+        neck_hidden_sizes (`List[str]`, *optional*, defaults to `[96, 192, 384, 768]`):
             The hidden sizes to project to for the feature maps of the backbone.
         fusion_hidden_size (`int`, *optional*, defaults to 256):
             The number of channels before fusion.
@@ -91,6 +92,10 @@ class DPTConfig(PretrainedConfig):
             The index of the features to use in the heads.
         use_batch_norm_in_fusion_residual (`bool`, *optional*, defaults to `False`):
             Whether to use batch normalization in the pre-activate residual units of the fusion blocks.
+        use_bias_in_fusion_residual (`bool`, *optional*, defaults to `True`):
+            Whether to use bias in the pre-activate residual units of the fusion blocks.
+        add_projection (`bool`, *optional*, defaults to `False`):
+            Whether to add a projection layer before the depth estimation head.
         use_auxiliary_head (`bool`, *optional*, defaults to `True`):
             Whether to use an auxiliary head during training.
         auxiliary_loss_weight (`float`, *optional*, defaults to 0.4):
@@ -104,7 +109,20 @@ class DPTConfig(PretrainedConfig):
         neck_ignore_stages (`List[int]`, *optional*, defaults to `[0, 1]`):
             Used only for the `hybrid` embedding type. The stages of the readout layers to ignore.
         backbone_config (`Union[Dict[str, Any], PretrainedConfig]`, *optional*):
-            Used only for the `hybrid` embedding type. The configuration of the backbone in a dictionary.
+            The configuration of the backbone model. Only used in case `is_hybrid` is `True` or in case you want to
+            leverage the [`AutoBackbone`] API.
+        backbone (`str`, *optional*):
+            Name of backbone to use when `backbone_config` is `None`. If `use_pretrained_backbone` is `True`, this
+            will load the corresponding pretrained weights from the timm or transformers library. If `use_pretrained_backbone`
+            is `False`, this loads the backbone's config and uses that to initialize the backbone with random weights.
+        use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
+            Whether to use pretrained weights for the backbone.
+        use_timm_backbone (`bool`, *optional*, defaults to `False`):
+            Whether to load `backbone` from the timm library. If `False`, the backbone is loaded from the transformers
+            library.
+        backbone_kwargs (`dict`, *optional*):
+            Keyword arguments to be passed to AutoBackbone when loading from a checkpoint
+            e.g. `{'out_indices': (0, 1, 2, 3)}`. Cannot be specified if `backbone_config` is set.
 
     Example:
 
@@ -120,6 +138,7 @@ class DPTConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "dpt"
 
     def __init__(
@@ -145,6 +164,8 @@ def __init__(
         fusion_hidden_size=256,
         head_in_index=-1,
         use_batch_norm_in_fusion_residual=False,
+        use_bias_in_fusion_residual=None,
+        add_projection=False,
         use_auxiliary_head=True,
         auxiliary_loss_weight=0.4,
         semantic_loss_ignore_index=255,
@@ -152,6 +173,10 @@ def __init__(
         backbone_featmap_shape=[1, 1024, 24, 24],
         neck_ignore_stages=[0, 1],
         backbone_config=None,
+        backbone=None,
+        use_pretrained_backbone=False,
+        use_timm_backbone=False,
+        backbone_kwargs=None,
         **kwargs,
     ):
         super().__init__(**kwargs)
@@ -159,8 +184,12 @@ def __init__(
         self.hidden_size = hidden_size
         self.is_hybrid = is_hybrid
 
+        if use_pretrained_backbone:
+            raise ValueError("Pretrained backbones are not supported yet.")
+
+        use_autobackbone = False
         if self.is_hybrid:
-            if backbone_config is None:
+            if backbone_config is None and backbone is None:
                 logger.info("Initializing the config with a `BiT` backbone.")
                 backbone_config = {
                     "global_padding": "same",
@@ -169,48 +198,74 @@ def __init__(
                     "out_features": ["stage1", "stage2", "stage3"],
                     "embedding_dynamic_padding": True,
                 }
-                self.backbone_config = BitConfig(**backbone_config)
+                backbone_config = BitConfig(**backbone_config)
             elif isinstance(backbone_config, dict):
                 logger.info("Initializing the config with a `BiT` backbone.")
-                self.backbone_config = BitConfig(**backbone_config)
+                backbone_config = BitConfig(**backbone_config)
             elif isinstance(backbone_config, PretrainedConfig):
-                self.backbone_config = backbone_config
+                backbone_config = backbone_config
             else:
                 raise ValueError(
                     f"backbone_config must be a dictionary or a `PretrainedConfig`, got {backbone_config.__class__}."
                 )
-
+            self.backbone_config = backbone_config
             self.backbone_featmap_shape = backbone_featmap_shape
             self.neck_ignore_stages = neck_ignore_stages
 
             if readout_type != "project":
                 raise ValueError("Readout type must be 'project' when using `DPT-hybrid` mode.")
+
+        elif backbone_config is not None:
+            use_autobackbone = True
+
+            if isinstance(backbone_config, dict):
+                backbone_model_type = backbone_config.get("model_type")
+                config_class = CONFIG_MAPPING[backbone_model_type]
+                backbone_config = config_class.from_dict(backbone_config)
+
+            self.backbone_config = backbone_config
+            self.backbone_featmap_shape = None
+            self.neck_ignore_stages = []
         else:
-            self.backbone_config = None
+            self.backbone_config = backbone_config
             self.backbone_featmap_shape = None
             self.neck_ignore_stages = []
 
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.intermediate_size = intermediate_size
-        self.hidden_act = hidden_act
-        self.hidden_dropout_prob = hidden_dropout_prob
-        self.attention_probs_dropout_prob = attention_probs_dropout_prob
-        self.initializer_range = initializer_range
-        self.layer_norm_eps = layer_norm_eps
-        self.image_size = image_size
-        self.patch_size = patch_size
-        self.num_channels = num_channels
-        self.qkv_bias = qkv_bias
-        self.backbone_out_indices = backbone_out_indices
+        if use_autobackbone and backbone_config is not None and backbone is not None:
+            raise ValueError("You can't specify both `backbone` and `backbone_config`.")
+
+        if backbone_kwargs is not None and backbone_kwargs and backbone_config is not None:
+            raise ValueError("You can't specify both `backbone_kwargs` and `backbone_config`.")
+
+        self.backbone = backbone
+        self.use_pretrained_backbone = use_pretrained_backbone
+        self.use_timm_backbone = use_timm_backbone
+        self.backbone_kwargs = backbone_kwargs
+        self.num_hidden_layers = None if use_autobackbone else num_hidden_layers
+        self.num_attention_heads = None if use_autobackbone else num_attention_heads
+        self.intermediate_size = None if use_autobackbone else intermediate_size
+        self.hidden_dropout_prob = None if use_autobackbone else hidden_dropout_prob
+        self.attention_probs_dropout_prob = None if use_autobackbone else attention_probs_dropout_prob
+        self.layer_norm_eps = None if use_autobackbone else layer_norm_eps
+        self.image_size = None if use_autobackbone else image_size
+        self.patch_size = None if use_autobackbone else patch_size
+        self.num_channels = None if use_autobackbone else num_channels
+        self.qkv_bias = None if use_autobackbone else qkv_bias
+        self.backbone_out_indices = None if use_autobackbone else backbone_out_indices
+
         if readout_type not in ["ignore", "add", "project"]:
             raise ValueError("Readout_type must be one of ['ignore', 'add', 'project']")
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
         self.readout_type = readout_type
         self.reassemble_factors = reassemble_factors
         self.neck_hidden_sizes = neck_hidden_sizes
         self.fusion_hidden_size = fusion_hidden_size
         self.head_in_index = head_in_index
         self.use_batch_norm_in_fusion_residual = use_batch_norm_in_fusion_residual
+        self.use_bias_in_fusion_residual = use_bias_in_fusion_residual
+        self.add_projection = add_projection
+
         # auxiliary head attributes (semantic segmentation)
         self.use_auxiliary_head = use_auxiliary_head
         self.auxiliary_loss_weight = auxiliary_loss_weight
diff --git a/src/transformers/models/dpt/convert_dinov2_depth_to_hf.py b/src/transformers/models/dpt/convert_dinov2_depth_to_hf.py
new file mode 100644
index 00000000000000..2bd147096c86ae
--- /dev/null
+++ b/src/transformers/models/dpt/convert_dinov2_depth_to_hf.py
@@ -0,0 +1,384 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert DINOv2 + DPT checkpoints from the original repository. URL:
+https://github.com/facebookresearch/dinov2/tree/main"""
+
+
+import argparse
+import itertools
+import math
+from pathlib import Path
+
+import requests
+import torch
+from PIL import Image
+from torchvision import transforms
+
+from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation, DPTImageProcessor
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_dpt_config(model_name):
+    if "small" in model_name:
+        # equivalent to stage 3, stage 6, stage 9, stage 12
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-small", out_indices=[3, 6, 9, 12], apply_layernorm=False, reshape_hidden_states=False
+        )
+        neck_hidden_sizes = [48, 96, 192, 384]
+    elif "base" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-base", out_indices=[3, 6, 9, 12], apply_layernorm=False, reshape_hidden_states=False
+        )
+        neck_hidden_sizes = [96, 192, 384, 768]
+    elif "large" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-large", out_indices=[5, 12, 18, 24], apply_layernorm=False, reshape_hidden_states=False
+        )
+        neck_hidden_sizes = [128, 256, 512, 1024]
+    elif "giant" in model_name:
+        backbone_config = Dinov2Config.from_pretrained(
+            "facebook/dinov2-giant", out_indices=[10, 20, 30, 40], apply_layernorm=False, reshape_hidden_states=False
+        )
+        neck_hidden_sizes = [192, 384, 768, 1536]
+    else:
+        raise NotImplementedError("To do")
+
+    config = DPTConfig(
+        backbone_config=backbone_config,
+        neck_hidden_sizes=neck_hidden_sizes,
+        use_bias_in_fusion_residual=False,
+        add_projection=True,
+    )
+
+    return config
+
+
+# here we list all DPT keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys_dpt(config):
+    rename_keys = []
+
+    # fmt: off
+    # activation postprocessing (projections, readout projections + resize blocks)
+    for i in range(4):
+        rename_keys.append((f"decode_head.reassemble_blocks.projects.{i}.conv.weight", f"neck.reassemble_stage.layers.{i}.projection.weight"))
+        rename_keys.append((f"decode_head.reassemble_blocks.projects.{i}.conv.bias", f"neck.reassemble_stage.layers.{i}.projection.bias"))
+
+        rename_keys.append((f"decode_head.reassemble_blocks.readout_projects.{i}.0.weight", f"neck.reassemble_stage.readout_projects.{i}.0.weight"))
+        rename_keys.append((f"decode_head.reassemble_blocks.readout_projects.{i}.0.bias", f"neck.reassemble_stage.readout_projects.{i}.0.bias"))
+
+        if i != 2:
+            rename_keys.append((f"decode_head.reassemble_blocks.resize_layers.{i}.weight", f"neck.reassemble_stage.layers.{i}.resize.weight"))
+            rename_keys.append((f"decode_head.reassemble_blocks.resize_layers.{i}.bias", f"neck.reassemble_stage.layers.{i}.resize.bias"))
+
+    # fusion layers
+    for i in range(4):
+        rename_keys.append((f"decode_head.fusion_blocks.{i}.project.conv.weight", f"neck.fusion_stage.layers.{i}.projection.weight"))
+        rename_keys.append((f"decode_head.fusion_blocks.{i}.project.conv.bias", f"neck.fusion_stage.layers.{i}.projection.bias"))
+        if i != 0:
+            rename_keys.append((f"decode_head.fusion_blocks.{i}.res_conv_unit1.conv1.conv.weight", f"neck.fusion_stage.layers.{i}.residual_layer1.convolution1.weight"))
+            rename_keys.append((f"decode_head.fusion_blocks.{i}.res_conv_unit1.conv2.conv.weight", f"neck.fusion_stage.layers.{i}.residual_layer1.convolution2.weight"))
+        rename_keys.append((f"decode_head.fusion_blocks.{i}.res_conv_unit2.conv1.conv.weight", f"neck.fusion_stage.layers.{i}.residual_layer2.convolution1.weight"))
+        rename_keys.append((f"decode_head.fusion_blocks.{i}.res_conv_unit2.conv2.conv.weight", f"neck.fusion_stage.layers.{i}.residual_layer2.convolution2.weight"))
+
+    # neck convolutions
+    for i in range(4):
+        rename_keys.append((f"decode_head.convs.{i}.conv.weight", f"neck.convs.{i}.weight"))
+
+    # head
+    rename_keys.append(("decode_head.project.conv.weight", "head.projection.weight"))
+    rename_keys.append(("decode_head.project.conv.bias", "head.projection.bias"))
+
+    for i in range(0, 5, 2):
+        rename_keys.append((f"decode_head.conv_depth.head.{i}.weight", f"head.head.{i}.weight"))
+        rename_keys.append((f"decode_head.conv_depth.head.{i}.bias", f"head.head.{i}.bias"))
+    # fmt: on
+
+    return rename_keys
+
+
+# here we list all backbone keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys_backbone(config):
+    rename_keys = []
+
+    # fmt: off
+    # patch embedding layer
+    rename_keys.append(("cls_token", "backbone.embeddings.cls_token"))
+    rename_keys.append(("mask_token", "backbone.embeddings.mask_token"))
+    rename_keys.append(("pos_embed", "backbone.embeddings.position_embeddings"))
+    rename_keys.append(("patch_embed.proj.weight", "backbone.embeddings.patch_embeddings.projection.weight"))
+    rename_keys.append(("patch_embed.proj.bias", "backbone.embeddings.patch_embeddings.projection.bias"))
+
+    # Transfomer encoder
+    for i in range(config.backbone_config.num_hidden_layers):
+        # layernorms
+        rename_keys.append((f"blocks.{i}.norm1.weight", f"backbone.encoder.layer.{i}.norm1.weight"))
+        rename_keys.append((f"blocks.{i}.norm1.bias", f"backbone.encoder.layer.{i}.norm1.bias"))
+        rename_keys.append((f"blocks.{i}.norm2.weight", f"backbone.encoder.layer.{i}.norm2.weight"))
+        rename_keys.append((f"blocks.{i}.norm2.bias", f"backbone.encoder.layer.{i}.norm2.bias"))
+        # MLP
+        if config.backbone_config.use_swiglu_ffn:
+            rename_keys.append((f"blocks.{i}.mlp.w12.weight", f"backbone.encoder.layer.{i}.mlp.w12.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.w12.bias", f"backbone.encoder.layer.{i}.mlp.w12.bias"))
+            rename_keys.append((f"blocks.{i}.mlp.w3.weight", f"backbone.encoder.layer.{i}.mlp.w3.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.w3.bias", f"backbone.encoder.layer.{i}.mlp.w3.bias"))
+        else:
+            rename_keys.append((f"blocks.{i}.mlp.fc1.weight", f"backbone.encoder.layer.{i}.mlp.fc1.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.fc1.bias", f"backbone.encoder.layer.{i}.mlp.fc1.bias"))
+            rename_keys.append((f"blocks.{i}.mlp.fc2.weight", f"backbone.encoder.layer.{i}.mlp.fc2.weight"))
+            rename_keys.append((f"blocks.{i}.mlp.fc2.bias", f"backbone.encoder.layer.{i}.mlp.fc2.bias"))
+        # layerscale
+        rename_keys.append((f"blocks.{i}.ls1.gamma", f"backbone.encoder.layer.{i}.layer_scale1.lambda1"))
+        rename_keys.append((f"blocks.{i}.ls2.gamma", f"backbone.encoder.layer.{i}.layer_scale2.lambda1"))
+        # attention projection layer
+        rename_keys.append((f"blocks.{i}.attn.proj.weight", f"backbone.encoder.layer.{i}.attention.output.dense.weight"))
+        rename_keys.append((f"blocks.{i}.attn.proj.bias", f"backbone.encoder.layer.{i}.attention.output.dense.bias"))
+    # fmt: on
+
+    rename_keys.append(("norm.weight", "backbone.layernorm.weight"))
+    rename_keys.append(("norm.bias", "backbone.layernorm.bias"))
+
+    return rename_keys
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config):
+    for i in range(config.backbone_config.num_hidden_layers):
+        # read in weights + bias of input projection layer (in timm, this is a single matrix + bias)
+        in_proj_weight = state_dict.pop(f"blocks.{i}.attn.qkv.weight")
+        in_proj_bias = state_dict.pop(f"blocks.{i}.attn.qkv.bias")
+        hidden_size = config.backbone_config.hidden_size
+        # next, add query, keys and values (in that order) to the state dict
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[:hidden_size, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.bias"] = in_proj_bias[:hidden_size]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            hidden_size : hidden_size * 2, :
+        ]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.key.bias"] = in_proj_bias[
+            hidden_size : hidden_size * 2
+        ]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[-hidden_size:, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.bias"] = in_proj_bias[-hidden_size:]
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "https://dl.fbaipublicfiles.com/dinov2/images/example.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+name_to_url = {
+    "dpt-dinov2-small-nyu": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_nyu_dpt_head.pth",
+    "dpt-dinov2-small-kitti": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_kitti_dpt_head.pth",
+    "dpt-dinov2-base-nyu": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_nyu_dpt_head.pth",
+    "dpt-dinov2-base-kitti": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_kitti_dpt_head.pth",
+    "dpt-dinov2-large-nyu": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_nyu_dpt_head.pth",
+    "dpt-dinov2-large-kitti": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_kitti_dpt_head.pth",
+    "dpt-dinov2-giant-nyu": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_nyu_dpt_head.pth",
+    "dpt-dinov2-giant-kitti": "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_kitti_dpt_head.pth",
+}
+
+
+def get_original_pixel_values(image):
+    class CenterPadding(object):
+        def __init__(self, multiple):
+            super().__init__()
+            self.multiple = multiple
+
+        def _get_pad(self, size):
+            new_size = math.ceil(size / self.multiple) * self.multiple
+            pad_size = new_size - size
+            pad_size_left = pad_size // 2
+            pad_size_right = pad_size - pad_size_left
+            return pad_size_left, pad_size_right
+
+        def __call__(self, img):
+            pads = list(itertools.chain.from_iterable(self._get_pad(m) for m in img.shape[-2:][::-1]))
+            output = torch.nn.functional.pad(img, pads)
+            return output
+
+        def __repr__(self):
+            return self.__class__.__name__ + "()"
+
+    def make_depth_transform() -> transforms.Compose:
+        return transforms.Compose(
+            [
+                transforms.ToTensor(),
+                lambda x: 255.0 * x[:3],  # Discard alpha component and scale by 255
+                transforms.Normalize(
+                    mean=(123.675, 116.28, 103.53),
+                    std=(58.395, 57.12, 57.375),
+                ),
+                CenterPadding(multiple=14),
+            ]
+        )
+
+    transform = make_depth_transform()
+    original_pixel_values = transform(image).unsqueeze(0)
+
+    return original_pixel_values
+
+
+@torch.no_grad()
+def convert_dpt_checkpoint(model_name, pytorch_dump_folder_path, push_to_hub, verify_logits):
+    """
+    Copy/paste/tweak model's weights to our DPT structure.
+    """
+
+    # define DPT configuration based on URL
+    checkpoint_url = name_to_url[model_name]
+    config = get_dpt_config(model_name)
+
+    # load original DPT state_dict from URL
+    print("URL:", checkpoint_url)
+    dpt_state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")["state_dict"]
+    # rename keys
+    rename_keys = create_rename_keys_dpt(config)
+    for src, dest in rename_keys:
+        rename_key(dpt_state_dict, src, dest)
+
+    # load original backbone state_dict from URL
+    if "small" in model_name:
+        original_model = torch.hub.load("facebookresearch/dinov2", "dinov2_vits14")
+    elif "base" in model_name:
+        original_model = torch.hub.load("facebookresearch/dinov2", "dinov2_vitb14")
+    elif "large" in model_name:
+        original_model = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
+    elif "giant" in model_name:
+        original_model = torch.hub.load("facebookresearch/dinov2", "dinov2_vitg14")
+    else:
+        raise NotImplementedError("To do")
+    original_model.eval()
+    backbone_state_dict = original_model.state_dict()
+
+    # rename keys
+    rename_keys = create_rename_keys_backbone(config)
+    for src, dest in rename_keys:
+        rename_key(backbone_state_dict, src, dest)
+
+    # read in qkv matrices
+    read_in_q_k_v(backbone_state_dict, config)
+
+    for key, val in backbone_state_dict.copy().items():
+        val = backbone_state_dict.pop(key)
+        if "w12" in key:
+            key = key.replace("w12", "weights_in")
+        if "w3" in key:
+            key = key.replace("w3", "weights_out")
+        backbone_state_dict[key] = val
+
+    # merge state_dicts
+    state_dict = {**backbone_state_dict, **dpt_state_dict}
+
+    # load HuggingFace model
+    model = DPTForDepthEstimation(config)
+    missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+    print("Missing keys:", missing_keys)
+    print("Unexpected keys:", unexpected_keys)
+    assert missing_keys == [
+        "neck.fusion_stage.layers.0.residual_layer1.convolution1.weight",
+        "neck.fusion_stage.layers.0.residual_layer1.convolution2.weight",
+    ]
+    model.eval()
+
+    # Verify image processor
+    processor = DPTImageProcessor(
+        do_resize=False,
+        do_rescale=False,
+        do_pad=True,
+        size_divisor=14,
+        do_normalize=True,
+        image_mean=(123.675, 116.28, 103.53),
+        image_std=(58.395, 57.12, 57.375),
+    )
+
+    image = prepare_img()
+    pixel_values = processor(image, return_tensors="pt").pixel_values.float()
+    original_pixel_values = get_original_pixel_values(image)
+
+    assert torch.allclose(pixel_values, original_pixel_values)
+
+    # Verify forward pass
+    with torch.no_grad():
+        outputs = model(pixel_values)
+
+    predicted_depth = outputs.predicted_depth
+
+    print("Shape of predicted depth:", predicted_depth.shape)
+    print("First values of predicted depth:", predicted_depth[0, :3, :3])
+
+    # assert logits
+    if verify_logits:
+        if model_name == "dpt-dinov2-small-nyu":
+            expected_shape = torch.Size([1, 576, 736])
+            expected_slice = torch.tensor(
+                [[3.3576, 3.4741, 3.4345], [3.4324, 3.5012, 3.2775], [3.2560, 3.3563, 3.2354]]
+            )
+
+        assert predicted_depth.shape == torch.Size(expected_shape)
+        assert torch.allclose(predicted_depth[0, :3, :3], expected_slice, atol=1e-5)
+        print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model and processor to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        print("Pushing model and processor to hub...")
+        model.push_to_hub(repo_id=f"facebook/{model_name}")
+        processor.push_to_hub(repo_id=f"facebook/{model_name}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="dpt-dinov2-small-nyu",
+        type=str,
+        choices=name_to_url.keys(),
+        help="Name of the model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether to push the model to the hub after conversion.",
+    )
+    parser.add_argument(
+        "--verify_logits",
+        action="store_true",
+        required=False,
+        help="Path to the output PyTorch model directory.",
+    )
+
+    args = parser.parse_args()
+    convert_dpt_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub, args.verify_logits)
diff --git a/src/transformers/models/dpt/convert_dpt_beit_to_hf.py b/src/transformers/models/dpt/convert_dpt_beit_to_hf.py
new file mode 100644
index 00000000000000..eb335a0ea2aeeb
--- /dev/null
+++ b/src/transformers/models/dpt/convert_dpt_beit_to_hf.py
@@ -0,0 +1,306 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert DPT 3.1 checkpoints from the MiDaS repository. URL: https://github.com/isl-org/MiDaS"""
+
+
+import argparse
+from pathlib import Path
+
+import requests
+import torch
+from PIL import Image
+
+from transformers import BeitConfig, DPTConfig, DPTForDepthEstimation, DPTImageProcessor
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_dpt_config(model_name):
+    hidden_size = 768
+    num_hidden_layers = 12
+    num_attention_heads = 12
+    intermediate_size = 3072
+    out_features = ["stage3", "stage6", "stage9", "stage12"]  # beit-base-384 uses [2, 5, 8, 11]
+
+    if "large" in model_name:
+        hidden_size = 1024
+        num_hidden_layers = 24
+        num_attention_heads = 16
+        intermediate_size = 4096
+        out_features = ["stage6", "stage12", "stage18", "stage24"]  # beit-large-512 uses [5, 11, 17, 23]
+
+    if "512" in model_name:
+        image_size = 512
+    elif "384" in model_name:
+        image_size = 384
+    else:
+        raise ValueError("Model not supported")
+
+    backbone_config = BeitConfig(
+        image_size=image_size,
+        num_hidden_layers=num_hidden_layers,
+        hidden_size=hidden_size,
+        intermediate_size=intermediate_size,
+        num_attention_heads=num_attention_heads,
+        use_relative_position_bias=True,
+        reshape_hidden_states=False,
+        out_features=out_features,
+    )
+
+    neck_hidden_sizes = [256, 512, 1024, 1024] if "large" in model_name else [96, 192, 384, 768]
+    config = DPTConfig(backbone_config=backbone_config, neck_hidden_sizes=neck_hidden_sizes)
+
+    return config, image_size
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config):
+    rename_keys = []
+
+    # fmt: off
+    # stem
+    rename_keys.append(("pretrained.model.cls_token", "backbone.embeddings.cls_token"))
+    rename_keys.append(("pretrained.model.patch_embed.proj.weight", "backbone.embeddings.patch_embeddings.projection.weight"))
+    rename_keys.append(("pretrained.model.patch_embed.proj.bias", "backbone.embeddings.patch_embeddings.projection.bias"))
+
+    # Transfomer encoder
+    for i in range(config.backbone_config.num_hidden_layers):
+        rename_keys.append((f"pretrained.model.blocks.{i}.gamma_1", f"backbone.encoder.layer.{i}.lambda_1"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.gamma_2", f"backbone.encoder.layer.{i}.lambda_2"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.norm1.weight", f"backbone.encoder.layer.{i}.layernorm_before.weight"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.norm1.bias", f"backbone.encoder.layer.{i}.layernorm_before.bias"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.norm2.weight", f"backbone.encoder.layer.{i}.layernorm_after.weight"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.norm2.bias", f"backbone.encoder.layer.{i}.layernorm_after.bias"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.mlp.fc1.weight", f"backbone.encoder.layer.{i}.intermediate.dense.weight"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.mlp.fc1.bias", f"backbone.encoder.layer.{i}.intermediate.dense.bias"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.mlp.fc2.weight", f"backbone.encoder.layer.{i}.output.dense.weight"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.mlp.fc2.bias", f"backbone.encoder.layer.{i}.output.dense.bias"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.attn.proj.weight", f"backbone.encoder.layer.{i}.attention.output.dense.weight"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.attn.proj.bias", f"backbone.encoder.layer.{i}.attention.output.dense.bias"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.attn.relative_position_bias_table", f"backbone.encoder.layer.{i}.attention.attention.relative_position_bias.relative_position_bias_table"))
+        rename_keys.append((f"pretrained.model.blocks.{i}.attn.relative_position_index", f"backbone.encoder.layer.{i}.attention.attention.relative_position_bias.relative_position_index"))
+
+    # activation postprocessing (readout projections + resize blocks)
+    for i in range(4):
+        rename_keys.append((f"pretrained.act_postprocess{i+1}.0.project.0.weight", f"neck.reassemble_stage.readout_projects.{i}.0.weight"))
+        rename_keys.append((f"pretrained.act_postprocess{i+1}.0.project.0.bias", f"neck.reassemble_stage.readout_projects.{i}.0.bias"))
+
+        rename_keys.append((f"pretrained.act_postprocess{i+1}.3.weight", f"neck.reassemble_stage.layers.{i}.projection.weight"))
+        rename_keys.append((f"pretrained.act_postprocess{i+1}.3.bias", f"neck.reassemble_stage.layers.{i}.projection.bias"))
+
+        if i != 2:
+            rename_keys.append((f"pretrained.act_postprocess{i+1}.4.weight", f"neck.reassemble_stage.layers.{i}.resize.weight"))
+            rename_keys.append((f"pretrained.act_postprocess{i+1}.4.bias", f"neck.reassemble_stage.layers.{i}.resize.bias"))
+
+    # refinenet (tricky here)
+    mapping = {1:3, 2:2, 3:1, 4:0}
+
+    for i in range(1, 5):
+        j = mapping[i]
+        rename_keys.append((f"scratch.refinenet{i}.out_conv.weight", f"neck.fusion_stage.layers.{j}.projection.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.out_conv.bias", f"neck.fusion_stage.layers.{j}.projection.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.bias"))
+
+    # scratch convolutions
+    for i in range(4):
+        rename_keys.append((f"scratch.layer{i+1}_rn.weight", f"neck.convs.{i}.weight"))
+
+    # head
+    for i in range(0, 5, 2):
+        rename_keys.append((f"scratch.output_conv.{i}.weight", f"head.head.{i}.weight"))
+        rename_keys.append((f"scratch.output_conv.{i}.bias", f"head.head.{i}.bias"))
+
+    return rename_keys
+
+
+def remove_ignore_keys_(state_dict):
+    ignore_keys = ["pretrained.model.head.weight", "pretrained.model.head.bias"]
+    for k in ignore_keys:
+        state_dict.pop(k, None)
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config):
+    hidden_size = config.backbone_config.hidden_size
+    for i in range(config.backbone_config.num_hidden_layers):
+        # read in weights + bias of input projection layer (in original implementation, this is a single matrix + bias)
+        in_proj_weight = state_dict.pop(f"pretrained.model.blocks.{i}.attn.qkv.weight")
+        q_bias = state_dict.pop(f"pretrained.model.blocks.{i}.attn.q_bias")
+        v_bias = state_dict.pop(f"pretrained.model.blocks.{i}.attn.v_bias")
+        # next, add query, keys and values (in that order) to the state dict
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.weight"] = in_proj_weight[:hidden_size, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.query.bias"] = q_bias
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.key.weight"] = in_proj_weight[
+            hidden_size : hidden_size * 2, :
+        ]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.weight"] = in_proj_weight[-hidden_size:, :]
+        state_dict[f"backbone.encoder.layer.{i}.attention.attention.value.bias"] = v_bias
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+@torch.no_grad()
+def convert_dpt_checkpoint(model_name, pytorch_dump_folder_path, push_to_hub):
+    """
+    Copy/paste/tweak model's weights to our DPT structure.
+    """
+
+    name_to_url = {
+        "dpt-beit-large-512": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt",
+        "dpt-beit-large-384": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_384.pt",
+        "dpt-beit-base-384": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_base_384.pt",
+    }
+
+    # define DPT configuration based on URL
+    checkpoint_url = name_to_url[model_name]
+    config, image_size = get_dpt_config(model_name)
+    # load original state_dict from URL
+    state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")
+    # remove certain keys
+    remove_ignore_keys_(state_dict)
+    # rename keys
+    rename_keys = create_rename_keys(config)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    # read in qkv matrices
+    read_in_q_k_v(state_dict, config)
+
+    # load HuggingFace model
+    model = DPTForDepthEstimation(config)
+    missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+    print("Missing keys:", missing_keys)
+    print("Unexpected keys:", unexpected_keys)
+    assert missing_keys == []
+    # assert unexpected_keys == ["pretrained.model.fc_norm.weight", "pretrained.model.fc_norm.bias"]
+    model.eval()
+
+    # Check outputs on an image
+    # We set `keep_aspect_ratio=False` as our current BEiT does not support arbitrary window sizes
+    processor = DPTImageProcessor(
+        size={"height": image_size, "width": image_size}, keep_aspect_ratio=False, ensure_multiple_of=32
+    )
+
+    image = prepare_img()
+    pixel_values = processor(image, return_tensors="pt").pixel_values
+
+    print("First values of pixel values:", pixel_values[0, 0, :3, :3])
+    print("Mean of pixel values:", pixel_values.mean().item())
+    print("Shape of pixel values:", pixel_values.shape)
+
+    import requests
+    from PIL import Image
+    from torchvision import transforms
+
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+
+    transforms = transforms.Compose(
+        [
+            transforms.Resize((image_size, image_size)),
+            transforms.ToTensor(),
+        ]
+    )
+    pixel_values = transforms(image).unsqueeze(0)
+
+    # forward pass
+    with torch.no_grad():
+        outputs = model(pixel_values)
+
+    predicted_depth = outputs.predicted_depth
+
+    print("Shape of predicted depth:", predicted_depth.shape)
+    print("First values of predicted depth:", predicted_depth[0, :3, :3])
+
+    # assert logits
+    # TODO there's still a small difference with the original logits
+    if model_name == "dpt-beit-large-512":
+        # OK, checked
+        expected_shape = torch.Size([1, 512, 512])
+        expected_slice = torch.tensor(
+            [[2804.6260, 2792.5708, 2812.9263], [2772.0288, 2780.1118, 2796.2529], [2748.1094, 2766.6558, 2766.9834]]
+        )
+    elif model_name == "dpt-beit-large-384":
+        # OK, checked
+        expected_shape = torch.Size([1, 384, 384])
+        expected_slice = torch.tensor(
+            [[1783.2273, 1780.5729, 1792.6453], [1759.9817, 1765.5359, 1778.5002], [1739.1633, 1754.7903, 1757.1990]],
+        )
+    elif model_name == "dpt-beit-base-384":
+        # OK, checked
+        expected_shape = torch.Size([1, 384, 384])
+        expected_slice = torch.tensor(
+            [[2898.4482, 2891.3750, 2904.8079], [2858.6685, 2877.2615, 2894.4507], [2842.1235, 2854.1023, 2861.6328]],
+        )
+
+    assert predicted_depth.shape == torch.Size(expected_shape)
+    assert torch.allclose(predicted_depth[0, :3, :3], expected_slice)
+    print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model and processor to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        print("Pushing model and processor to hub...")
+        model.push_to_hub(repo_id=f"nielsr/{model_name}")
+        processor.push_to_hub(repo_id=f"nielsr/{model_name}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="dpt-beit-large-512",
+        type=str,
+        choices=["dpt-beit-large-512", "dpt-beit-large-384", "dpt-beit-base-384"],
+        help="Name of the model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether to push the model to the hub after conversion.",
+    )
+
+    args = parser.parse_args()
+    convert_dpt_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
diff --git a/src/transformers/models/dpt/convert_dpt_hybrid_to_pytorch.py b/src/transformers/models/dpt/convert_dpt_hybrid_to_pytorch.py
index a563436b13c874..0fa69adfaf39d5 100644
--- a/src/transformers/models/dpt/convert_dpt_hybrid_to_pytorch.py
+++ b/src/transformers/models/dpt/convert_dpt_hybrid_to_pytorch.py
@@ -24,7 +24,7 @@
 from huggingface_hub import cached_download, hf_hub_url
 from PIL import Image
 
-from transformers import DPTConfig, DPTFeatureExtractor, DPTForDepthEstimation, DPTForSemanticSegmentation
+from transformers import DPTConfig, DPTForDepthEstimation, DPTForSemanticSegmentation, DPTImageProcessor
 from transformers.utils import logging
 
 
@@ -244,10 +244,10 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
 
     # Check outputs on an image
     size = 480 if "ade" in checkpoint_url else 384
-    feature_extractor = DPTFeatureExtractor(size=size)
+    image_processor = DPTImageProcessor(size=size)
 
     image = prepare_img()
-    encoding = feature_extractor(image, return_tensors="pt")
+    encoding = image_processor(image, return_tensors="pt")
 
     # forward pass
     outputs = model(**encoding).logits if "ade" in checkpoint_url else model(**encoding).predicted_depth
@@ -271,12 +271,12 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
         Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
         print(f"Saving model to {pytorch_dump_folder_path}")
         model.save_pretrained(pytorch_dump_folder_path)
-        print(f"Saving feature extractor to {pytorch_dump_folder_path}")
-        feature_extractor.save_pretrained(pytorch_dump_folder_path)
+        print(f"Saving image processor to {pytorch_dump_folder_path}")
+        image_processor.save_pretrained(pytorch_dump_folder_path)
 
     if push_to_hub:
         model.push_to_hub("ybelkada/dpt-hybrid-midas")
-        feature_extractor.push_to_hub("ybelkada/dpt-hybrid-midas")
+        image_processor.push_to_hub("ybelkada/dpt-hybrid-midas")
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/dpt/convert_dpt_swinv2_to_hf.py b/src/transformers/models/dpt/convert_dpt_swinv2_to_hf.py
new file mode 100644
index 00000000000000..fd6522ab6be331
--- /dev/null
+++ b/src/transformers/models/dpt/convert_dpt_swinv2_to_hf.py
@@ -0,0 +1,322 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert DPT 3.1 checkpoints from the MiDaS repository. URL: https://github.com/isl-org/MiDaS"""
+
+
+import argparse
+from pathlib import Path
+
+import requests
+import torch
+from PIL import Image
+
+from transformers import DPTConfig, DPTForDepthEstimation, DPTImageProcessor, Swinv2Config
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+
+def get_dpt_config(model_name):
+    if "tiny" in model_name:
+        embed_dim = 96
+        depths = (2, 2, 6, 2)
+        num_heads = (3, 6, 12, 24)
+        window_size = 16
+        # note: for Swinv2-tiny authors used the window_size = 16 variant
+        # as seen here: https://github.com/isl-org/MiDaS/blob/bdc4ed64c095e026dc0a2f17cabb14d58263decb/midas/backbones/swin2.py#L26
+        pretrained_window_sizes = (0, 0, 0, 0)
+    elif "base" in model_name:
+        embed_dim = 128
+        depths = (2, 2, 18, 2)
+        num_heads = (4, 8, 16, 32)
+        window_size = 24
+        pretrained_window_sizes = (12, 12, 12, 6)
+    elif "large" in model_name:
+        embed_dim = 192
+        depths = (2, 2, 18, 2)
+        num_heads = (6, 12, 24, 48)
+        window_size = 24
+        pretrained_window_sizes = (12, 12, 12, 6)
+
+    if "384" in model_name:
+        image_size = 384
+    elif "256" in model_name:
+        image_size = 256
+    else:
+        raise ValueError("Model not supported, to do")
+
+    backbone_config = Swinv2Config(
+        image_size=image_size,
+        embed_dim=embed_dim,
+        depths=depths,
+        window_size=window_size,
+        pretrained_window_sizes=pretrained_window_sizes,
+        num_heads=num_heads,
+        out_features=["stage1", "stage2", "stage3", "stage4"],
+    )
+
+    if model_name == "dpt-swinv2-tiny-256":
+        neck_hidden_sizes = [96, 192, 384, 768]
+    elif model_name == "dpt-swinv2-base-384":
+        neck_hidden_sizes = [128, 256, 512, 1024]
+    elif model_name == "dpt-swinv2-large-384":
+        neck_hidden_sizes = [192, 384, 768, 1536]
+
+    config = DPTConfig(backbone_config=backbone_config, neck_hidden_sizes=neck_hidden_sizes)
+
+    return config, image_size
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def create_rename_keys(config):
+    rename_keys = []
+
+    # fmt: off
+    # stem
+    rename_keys.append(("pretrained.model.patch_embed.proj.weight", "backbone.embeddings.patch_embeddings.projection.weight"))
+    rename_keys.append(("pretrained.model.patch_embed.proj.bias", "backbone.embeddings.patch_embeddings.projection.bias"))
+    rename_keys.append(("pretrained.model.patch_embed.norm.weight", "backbone.embeddings.norm.weight"))
+    rename_keys.append(("pretrained.model.patch_embed.norm.bias", "backbone.embeddings.norm.bias"))
+
+    # transformer encoder
+    for i in range(len(config.backbone_config.depths)):
+        for j in range(config.backbone_config.depths[i]):
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.logit_scale", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.logit_scale"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.cpb_mlp.0.weight", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.continuous_position_bias_mlp.0.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.cpb_mlp.0.bias", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.continuous_position_bias_mlp.0.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.cpb_mlp.2.weight", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.continuous_position_bias_mlp.2.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.q_bias", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.query.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.v_bias", f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.value.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.proj.weight", f"backbone.encoder.layers.{i}.blocks.{j}.attention.output.dense.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.attn.proj.bias", f"backbone.encoder.layers.{i}.blocks.{j}.attention.output.dense.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.norm1.weight", f"backbone.encoder.layers.{i}.blocks.{j}.layernorm_before.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.norm1.bias", f"backbone.encoder.layers.{i}.blocks.{j}.layernorm_before.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.mlp.fc1.weight", f"backbone.encoder.layers.{i}.blocks.{j}.intermediate.dense.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.mlp.fc1.bias", f"backbone.encoder.layers.{i}.blocks.{j}.intermediate.dense.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.mlp.fc2.weight", f"backbone.encoder.layers.{i}.blocks.{j}.output.dense.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.mlp.fc2.bias", f"backbone.encoder.layers.{i}.blocks.{j}.output.dense.bias"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.norm2.weight", f"backbone.encoder.layers.{i}.blocks.{j}.layernorm_after.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.blocks.{j}.norm2.bias", f"backbone.encoder.layers.{i}.blocks.{j}.layernorm_after.bias"))
+
+        # downsample parameters
+        if i in [0,1,2]:
+            rename_keys.append((f"pretrained.model.layers.{i}.downsample.reduction.weight", f"backbone.encoder.layers.{i}.downsample.reduction.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.downsample.norm.weight", f"backbone.encoder.layers.{i}.downsample.norm.weight"))
+            rename_keys.append((f"pretrained.model.layers.{i}.downsample.norm.bias", f"backbone.encoder.layers.{i}.downsample.norm.bias"))
+
+    # note: non-Transformer backbones like Swinv2, LeViT et al don't require activation postprocessing (readout projections + resize blocks)
+
+    # refinenet (tricky here)
+    mapping = {1:3, 2:2, 3:1, 4:0}
+
+    for i in range(1, 5):
+        j = mapping[i]
+        rename_keys.append((f"scratch.refinenet{i}.out_conv.weight", f"neck.fusion_stage.layers.{j}.projection.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.out_conv.bias", f"neck.fusion_stage.layers.{j}.projection.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution1.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit1.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer1.convolution2.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv1.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv1.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution1.bias"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv2.weight", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.weight"))
+        rename_keys.append((f"scratch.refinenet{i}.resConfUnit2.conv2.bias", f"neck.fusion_stage.layers.{j}.residual_layer2.convolution2.bias"))
+
+    # scratch convolutions
+    for i in range(4):
+        rename_keys.append((f"scratch.layer{i+1}_rn.weight", f"neck.convs.{i}.weight"))
+
+    # head
+    for i in range(0, 5, 2):
+        rename_keys.append((f"scratch.output_conv.{i}.weight", f"head.head.{i}.weight"))
+        rename_keys.append((f"scratch.output_conv.{i}.bias", f"head.head.{i}.bias"))
+
+    return rename_keys
+
+
+def remove_ignore_keys_(state_dict):
+    ignore_keys = ["pretrained.model.head.weight", "pretrained.model.head.bias"]
+    for k in ignore_keys:
+        state_dict.pop(k, None)
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v(state_dict, config, model):
+    for i in range(len(config.backbone_config.depths)):
+        for j in range(config.backbone_config.depths[i]):
+            dim = model.backbone.encoder.layers[i].blocks[j].attention.self.all_head_size
+            # read in weights + bias of input projection layer (in original implementation, this is a single matrix + bias)
+            in_proj_weight = state_dict.pop(f"pretrained.model.layers.{i}.blocks.{j}.attn.qkv.weight")
+            # next, add query, keys and values (in that order) to the state dict
+            state_dict[f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.query.weight"] = in_proj_weight[:dim, :]
+            state_dict[f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.key.weight"] = in_proj_weight[
+                dim : dim * 2, :
+            ]
+            state_dict[f"backbone.encoder.layers.{i}.blocks.{j}.attention.self.value.weight"] = in_proj_weight[
+                -dim:, :
+            ]
+
+
+def rename_key(dct, old, new):
+    val = dct.pop(old)
+    dct[new] = val
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+@torch.no_grad()
+def convert_dpt_checkpoint(model_name, pytorch_dump_folder_path, verify_logits, push_to_hub):
+    """
+    Copy/paste/tweak model's weights to our DPT structure.
+    """
+
+    name_to_url = {
+        "dpt-swinv2-tiny-256": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_tiny_256.pt",
+        "dpt-swinv2-base-384": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_base_384.pt",
+        "dpt-swinv2-large-384": "https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_swin2_large_384.pt",
+    }
+
+    # define DPT configuration based on URL
+    checkpoint_url = name_to_url[model_name]
+    config, image_size = get_dpt_config(model_name)
+    # load original state_dict from URL
+    state_dict = torch.hub.load_state_dict_from_url(checkpoint_url, map_location="cpu")
+
+    # load HuggingFace model
+    model = DPTForDepthEstimation(config)
+
+    # remove certain keys
+    remove_ignore_keys_(state_dict)
+    # rename keys
+    rename_keys = create_rename_keys(config)
+    for src, dest in rename_keys:
+        rename_key(state_dict, src, dest)
+    # read in qkv matrices
+    read_in_q_k_v(state_dict, config, model)
+
+    missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
+    print("Missing keys:", missing_keys)
+    print("Unexpected keys:", unexpected_keys)
+    model.eval()
+
+    # Check outputs on an image
+    processor = DPTImageProcessor(size={"height": image_size, "width": image_size})
+
+    image = prepare_img()
+    processor(image, return_tensors="pt")
+
+    if verify_logits:
+        from torchvision import transforms
+
+        url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+        image = Image.open(requests.get(url, stream=True).raw)
+
+        transforms = transforms.Compose(
+            [
+                transforms.Resize((image_size, image_size)),
+                transforms.ToTensor(),
+            ]
+        )
+        pixel_values = transforms(image).unsqueeze(0)
+
+        # forward pass
+        with torch.no_grad():
+            outputs = model(pixel_values)
+
+        predicted_depth = outputs.predicted_depth
+
+        print("Shape of predicted depth:", predicted_depth.shape)
+        print("First values of predicted depth:", predicted_depth[0, :3, :3])
+
+        # assert logits
+        if model_name == "dpt-swinv2-base-384":
+            # OK, checked
+            expected_shape = torch.Size([1, 384, 384])
+            expected_slice = torch.tensor(
+                [
+                    [1998.5575, 1997.3887, 2009.2981],
+                    [1952.8607, 1979.6488, 2001.0854],
+                    [1953.7697, 1961.7711, 1968.8904],
+                ],
+            )
+        elif model_name == "dpt-swinv2-tiny-256":
+            # OK, checked
+            expected_shape = torch.Size([1, 256, 256])
+            expected_slice = torch.tensor(
+                [[978.9163, 976.5215, 978.5349], [974.1859, 971.7249, 975.8046], [971.3419, 970.3118, 971.6830]],
+            )
+        elif model_name == "dpt-swinv2-large-384":
+            # OK, checked
+            expected_shape = torch.Size([1, 384, 384])
+            expected_slice = torch.tensor(
+                [
+                    [1203.7206, 1200.1495, 1197.8234],
+                    [1196.2484, 1183.5033, 1186.4640],
+                    [1178.8131, 1182.3260, 1174.3975],
+                ],
+            )
+
+        assert predicted_depth.shape == torch.Size(expected_shape)
+        assert torch.allclose(predicted_depth[0, :3, :3], expected_slice)
+        print("Looks ok!")
+
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model and processor to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        processor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        print("Pushing model and processor to hub...")
+        model.push_to_hub(repo_id=f"Intel/{model_name}")
+        processor.push_to_hub(repo_id=f"Intel/{model_name}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="dpt-swinv2-base-384",
+        type=str,
+        choices=["dpt-swinv2-tiny-256", "dpt-swinv2-base-384", "dpt-swinv2-large-384"],
+        help="Name of the model you'd like to convert.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default=None,
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument(
+        "--verify_logits",
+        action="store_true",
+        help="Whether to verify logits after conversion.",
+    )
+    parser.add_argument(
+        "--push_to_hub",
+        action="store_true",
+        help="Whether to push the model to the hub after conversion.",
+    )
+
+    args = parser.parse_args()
+    convert_dpt_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.verify_logits, args.push_to_hub)
diff --git a/src/transformers/models/dpt/convert_dpt_to_pytorch.py b/src/transformers/models/dpt/convert_dpt_to_pytorch.py
index 7ef5f7cf119a1b..42637cb2158724 100644
--- a/src/transformers/models/dpt/convert_dpt_to_pytorch.py
+++ b/src/transformers/models/dpt/convert_dpt_to_pytorch.py
@@ -24,7 +24,7 @@
 from huggingface_hub import cached_download, hf_hub_url
 from PIL import Image
 
-from transformers import DPTConfig, DPTFeatureExtractor, DPTForDepthEstimation, DPTForSemanticSegmentation
+from transformers import DPTConfig, DPTForDepthEstimation, DPTForSemanticSegmentation, DPTImageProcessor
 from transformers.utils import logging
 
 
@@ -211,10 +211,10 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
 
     # Check outputs on an image
     size = 480 if "ade" in checkpoint_url else 384
-    feature_extractor = DPTFeatureExtractor(size=size)
+    image_processor = DPTImageProcessor(size=size)
 
     image = prepare_img()
-    encoding = feature_extractor(image, return_tensors="pt")
+    encoding = image_processor(image, return_tensors="pt")
 
     # forward pass
     outputs = model(**encoding).logits if "ade" in checkpoint_url else model(**encoding).predicted_depth
@@ -229,12 +229,14 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
         if "ade" in checkpoint_url
         else torch.allclose(outputs[0, :3, :3], expected_slice)
     )
+    print("Looks ok!")
 
-    Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
-    print(f"Saving model to {pytorch_dump_folder_path}")
-    model.save_pretrained(pytorch_dump_folder_path)
-    print(f"Saving feature extractor to {pytorch_dump_folder_path}")
-    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+    if pytorch_dump_folder_path is not None:
+        Path(pytorch_dump_folder_path).mkdir(exist_ok=True)
+        print(f"Saving model to {pytorch_dump_folder_path}")
+        model.save_pretrained(pytorch_dump_folder_path)
+        print(f"Saving image processor to {pytorch_dump_folder_path}")
+        image_processor.save_pretrained(pytorch_dump_folder_path)
 
     if push_to_hub:
         print("Pushing model to hub...")
@@ -244,10 +246,10 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
             commit_message="Add model",
             use_temp_dir=True,
         )
-        feature_extractor.push_to_hub(
+        image_processor.push_to_hub(
             repo_path_or_name=Path(pytorch_dump_folder_path, model_name),
             organization="nielsr",
-            commit_message="Add feature extractor",
+            commit_message="Add image processor",
             use_temp_dir=True,
         )
 
@@ -265,7 +267,7 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
         "--pytorch_dump_folder_path",
         default=None,
         type=str,
-        required=True,
+        required=False,
         help="Path to the output PyTorch model directory.",
     )
     parser.add_argument(
@@ -276,6 +278,7 @@ def convert_dpt_checkpoint(checkpoint_url, pytorch_dump_folder_path, push_to_hub
         "--model_name",
         default="dpt-large",
         type=str,
+        required=False,
         help="Name of the model, in case you're pushing to the hub.",
     )
 
diff --git a/src/transformers/models/dpt/image_processing_dpt.py b/src/transformers/models/dpt/image_processing_dpt.py
index d6bcfe9c5e3d13..29aac9d005b406 100644
--- a/src/transformers/models/dpt/image_processing_dpt.py
+++ b/src/transformers/models/dpt/image_processing_dpt.py
@@ -19,11 +19,8 @@
 
 import numpy as np
 
-from transformers.utils import is_vision_available
-from transformers.utils.generic import TensorType
-
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
-from ...image_transforms import normalize, rescale, resize, to_channel_dimension_format
+from ...image_transforms import pad, resize, to_channel_dimension_format
 from ...image_utils import (
     IMAGENET_STANDARD_MEAN,
     IMAGENET_STANDARD_STD,
@@ -31,13 +28,16 @@
     ImageInput,
     PILImageResampling,
     get_image_size,
+    infer_channel_dimension_format,
+    is_scaled_image,
     is_torch_available,
     is_torch_tensor,
     make_list_of_images,
     to_numpy_array,
     valid_images,
+    validate_preprocess_arguments,
 )
-from ...utils import logging
+from ...utils import TensorType, is_vision_available, logging
 
 
 if is_torch_available():
@@ -51,7 +51,11 @@
 
 
 def get_resize_output_image_size(
-    input_image: np.ndarray, output_size: Union[int, Iterable[int]], keep_aspect_ratio: bool, multiple: int
+    input_image: np.ndarray,
+    output_size: Union[int, Iterable[int]],
+    keep_aspect_ratio: bool,
+    multiple: int,
+    input_data_format: Optional[Union[str, ChannelDimension]] = None,
 ) -> Tuple[int, int]:
     def constraint_to_multiple_of(val, multiple, min_val=0, max_val=None):
         x = round(val / multiple) * multiple
@@ -66,7 +70,7 @@ def constraint_to_multiple_of(val, multiple, min_val=0, max_val=None):
 
     output_size = (output_size, output_size) if isinstance(output_size, int) else output_size
 
-    input_height, input_width = get_image_size(input_image)
+    input_height, input_width = get_image_size(input_image, input_data_format)
     output_height, output_width = output_size
 
     # determine new height and width
@@ -97,14 +101,14 @@ class DPTImageProcessor(BaseImageProcessor):
             Whether to resize the image's (height, width) dimensions. Can be overidden by `do_resize` in `preprocess`.
         size (`Dict[str, int]` *optional*, defaults to `{"height": 384, "width": 384}`):
             Size of the image after resizing. Can be overidden by `size` in `preprocess`.
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
+            Defines the resampling filter to use if resizing the image. Can be overidden by `resample` in `preprocess`.
         keep_aspect_ratio (`bool`, *optional*, defaults to `False`):
             If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved. Can
             be overidden by `keep_aspect_ratio` in `preprocess`.
-        ensure_multiple_of (`int`, *optional*, defaults to `1`):
+        ensure_multiple_of (`int`, *optional*, defaults to 1):
             If `do_resize` is `True`, the image is resized to a size that is a multiple of this value. Can be overidden
             by `ensure_multiple_of` in `preprocess`.
-        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BILINEAR`):
-            Defines the resampling filter to use if resizing the image. Can be overidden by `resample` in `preprocess`.
         do_rescale (`bool`, *optional*, defaults to `True`):
             Whether to rescale the image by the specified scale `rescale_factor`. Can be overidden by `do_rescale` in
             `preprocess`.
@@ -119,6 +123,12 @@ class DPTImageProcessor(BaseImageProcessor):
         image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
             Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
             number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        do_pad (`bool`, *optional*, defaults to `False`):
+            Whether to apply center padding. This was introduced in the DINOv2 paper, which uses the model in
+            combination with DPT.
+        size_divisor (`int`, *optional*):
+            If `do_pad` is `True`, pads the image dimensions to be divisible by this value. This was introduced in the
+            DINOv2 paper, which uses the model in combination with DPT.
     """
 
     model_input_names = ["pixel_values"]
@@ -127,7 +137,7 @@ def __init__(
         self,
         do_resize: bool = True,
         size: Dict[str, int] = None,
-        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
         keep_aspect_ratio: bool = False,
         ensure_multiple_of: int = 1,
         do_rescale: bool = True,
@@ -135,6 +145,8 @@ def __init__(
         do_normalize: bool = True,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
+        do_pad: bool = False,
+        size_divisor: int = None,
         **kwargs,
     ) -> None:
         super().__init__(**kwargs)
@@ -150,6 +162,8 @@ def __init__(
         self.do_normalize = do_normalize
         self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
         self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.do_pad = do_pad
+        self.size_divisor = size_divisor
 
     def resize(
         self,
@@ -159,6 +173,7 @@ def resize(
         ensure_multiple_of: int = 1,
         resample: PILImageResampling = PILImageResampling.BICUBIC,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
@@ -173,7 +188,7 @@ def resize(
                 Target size of the output image.
             keep_aspect_ratio (`bool`, *optional*, defaults to `False`):
                 If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved.
-            ensure_multiple_of (`int`, *optional*, defaults to `1`):
+            ensure_multiple_of (`int`, *optional*, defaults to 1):
                 The image is resized to a size that is a multiple of this value.
             resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                 Defines the resampling filter to use if resizing the image. Otherwise, the image is resized to size
@@ -182,60 +197,73 @@ def resize(
                 Resampling filter to use when resiizing the image.
             data_format (`str` or `ChannelDimension`, *optional*):
                 The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
         """
         size = get_size_dict(size)
         if "height" not in size or "width" not in size:
             raise ValueError(f"The size dictionary must contain the keys 'height' and 'width'. Got {size.keys()}")
+
         output_size = get_resize_output_image_size(
             image,
             output_size=(size["height"], size["width"]),
             keep_aspect_ratio=keep_aspect_ratio,
             multiple=ensure_multiple_of,
+            input_data_format=input_data_format,
+        )
+        return resize(
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
         )
-        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
 
-    def rescale(
+    def pad_image(
         self,
-        image: np.ndarray,
-        scale: Union[int, float],
+        image: np.array,
+        size_divisor: int,
         data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
     ):
         """
-        Rescale an image by a scale factor. image = image * scale.
+        Center pad an image to be a multiple of `multiple`.
 
         Args:
             image (`np.ndarray`):
-                Image to rescale.
-            scale (`int` or `float`):
-                Scale to apply to the image.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
+                Image to pad.
+            size_divisor (`int`):
+                The width and height of the image will be padded to a multiple of this number.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
-        return rescale(image, scale=scale, data_format=data_format, **kwargs)
 
-    def normalize(
-        self,
-        image: np.ndarray,
-        mean: Union[float, List[float]],
-        std: Union[float, List[float]],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Normalize an image. image = (image - image_mean) / image_std.
+        def _get_pad(size, size_divisor):
+            new_size = math.ceil(size / size_divisor) * size_divisor
+            pad_size = new_size - size
+            pad_size_left = pad_size // 2
+            pad_size_right = pad_size - pad_size_left
+            return pad_size_left, pad_size_right
 
-        Args:
-            image (`np.ndarray`):
-                Image to normalize.
-            image_mean (`float` or `List[float]`):
-                Image mean.
-            image_std (`float` or `List[float]`):
-                Image standard deviation.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+        if input_data_format is None:
+            input_data_format = infer_channel_dimension_format(image)
+
+        height, width = get_image_size(image, input_data_format)
+
+        pad_size_left, pad_size_right = _get_pad(height, size_divisor)
+        pad_size_top, pad_size_bottom = _get_pad(width, size_divisor)
+
+        return pad(image, ((pad_size_left, pad_size_right), (pad_size_top, pad_size_bottom)), data_format=data_format)
 
     def preprocess(
         self,
@@ -250,8 +278,11 @@ def preprocess(
         do_normalize: bool = None,
         image_mean: Optional[Union[float, List[float]]] = None,
         image_std: Optional[Union[float, List[float]]] = None,
+        do_pad: bool = None,
+        size_divisor: int = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: ChannelDimension = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> PIL.Image.Image:
         """
@@ -259,7 +290,8 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image to preprocess.
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                 Whether to resize the image.
             size (`Dict[str, int]`, *optional*, defaults to `self.size`):
@@ -295,6 +327,12 @@ def preprocess(
                 The channel dimension format for the output image. Can be one of:
                     - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                     - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         do_resize = do_resize if do_resize is not None else self.do_resize
         size = size if size is not None else self.size
@@ -307,6 +345,8 @@ def preprocess(
         do_normalize = do_normalize if do_normalize is not None else self.do_normalize
         image_mean = image_mean if image_mean is not None else self.image_mean
         image_std = image_std if image_std is not None else self.image_std
+        do_pad = do_pad if do_pad is not None else self.do_pad
+        size_divisor = size_divisor if size_divisor is not None else self.size_divisor
 
         images = make_list_of_images(images)
 
@@ -315,33 +355,70 @@ def preprocess(
                 "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
-
-        if do_resize and size is None or resample is None:
-            raise ValueError("Size and resample must be specified if do_resize is True.")
-
-        if do_rescale and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
-        if do_normalize and (image_mean is None or image_std is None):
-            raise ValueError("Image mean and std must be specified if do_normalize is True.")
-
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_pad=do_pad,
+            size_divisibility=size_divisor,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
         # All transformations expect numpy arrays.
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         if do_resize:
-            images = [self.resize(image=image, size=size, resample=resample) for image in images]
+            images = [
+                self.resize(
+                    image=image,
+                    size=size,
+                    resample=resample,
+                    keep_aspect_ratio=keep_aspect_ratio,
+                    ensure_multiple_of=ensure_multiple_of,
+                    input_data_format=input_data_format,
+                )
+                for image in images
+            ]
 
         if do_rescale:
-            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_normalize:
-            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        if do_pad:
+            images = [
+                self.pad_image(image=image, size_divisor=size_divisor, input_data_format=input_data_format)
+                for image in images
+            ]
 
-        images = [to_channel_dimension_format(image, data_format) for image in images]
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
 
         data = {"pixel_values": images}
         return BatchFeature(data=data, tensor_type=return_tensors)
 
+    # Copied from transformers.models.beit.image_processing_beit.BeitImageProcessor.post_process_semantic_segmentation with Beit->DPT
     def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple] = None):
         """
         Converts the output of [`DPTForSemanticSegmentation`] into semantic segmentation maps. Only supports PyTorch.
diff --git a/src/transformers/models/dpt/modeling_dpt.py b/src/transformers/models/dpt/modeling_dpt.py
index 187a6c36656a8e..e986e71d4851da 100755
--- a/src/transformers/models/dpt/modeling_dpt.py
+++ b/src/transformers/models/dpt/modeling_dpt.py
@@ -41,7 +41,7 @@
 from ...modeling_utils import PreTrainedModel
 from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
 from ...utils import ModelOutput, logging
-from ..auto import AutoBackbone
+from ...utils.backbone_utils import load_backbone
 from .configuration_dpt import DPTConfig
 
 
@@ -76,7 +76,7 @@ class BaseModelOutputWithIntermediateActivations(ModelOutput):
     """
 
     last_hidden_states: torch.FloatTensor = None
-    intermediate_activations: Optional[Tuple[torch.FloatTensor]] = None
+    intermediate_activations: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 @dataclass
@@ -110,9 +110,9 @@ class BaseModelOutputWithPoolingAndIntermediateActivations(ModelOutput):
 
     last_hidden_state: torch.FloatTensor = None
     pooler_output: torch.FloatTensor = None
-    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
-    attentions: Optional[Tuple[torch.FloatTensor]] = None
-    intermediate_activations: Optional[Tuple[torch.FloatTensor]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
+    attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
+    intermediate_activations: Optional[Tuple[torch.FloatTensor, ...]] = None
 
 
 class DPTViTHybridEmbeddings(nn.Module):
@@ -131,12 +131,10 @@ def __init__(self, config, feature_size=None):
         patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
         num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
 
-        self.backbone = AutoBackbone.from_config(config.backbone_config)
+        self.backbone = load_backbone(config)
         feature_dim = self.backbone.channels[-1]
-        if len(config.backbone_config.out_features) != 3:
-            raise ValueError(
-                f"Expected backbone to have 3 output features, got {len(config.backbone_config.out_features)}"
-            )
+        if len(self.backbone.channels) != 3:
+            raise ValueError(f"Expected backbone to have 3 output features, got {len(self.backbone.channels)}")
         self.residual_feature_map_index = [0, 1]  # Always take the output of the first and second backbone stage
 
         if feature_size is None:
@@ -528,17 +526,11 @@ def forward(
             layer_head_mask = head_mask[i] if head_mask is not None else None
 
             if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     layer_head_mask,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(hidden_states, layer_head_mask, output_attentions)
@@ -605,12 +597,13 @@ def _init_reassemble_dpt_hybrid(self, config):
 
         # When using DPT-Hybrid the readout type is set to "project". The sanity check is done on the config file
         self.readout_projects = nn.ModuleList()
+        hidden_size = _get_backbone_hidden_size(config)
         for i in range(len(config.neck_hidden_sizes)):
             if i <= 1:
                 self.readout_projects.append(nn.Sequential(nn.Identity()))
             elif i > 1:
                 self.readout_projects.append(
-                    nn.Sequential(nn.Linear(2 * config.hidden_size, config.hidden_size), ACT2FN[config.hidden_act])
+                    nn.Sequential(nn.Linear(2 * hidden_size, hidden_size), ACT2FN[config.hidden_act])
                 )
 
     def _init_reassemble_dpt(self, config):
@@ -619,12 +612,13 @@ def _init_reassemble_dpt(self, config):
 
         if config.readout_type == "project":
             self.readout_projects = nn.ModuleList()
+            hidden_size = _get_backbone_hidden_size(config)
             for _ in range(len(config.neck_hidden_sizes)):
                 self.readout_projects.append(
-                    nn.Sequential(nn.Linear(2 * config.hidden_size, config.hidden_size), ACT2FN[config.hidden_act])
+                    nn.Sequential(nn.Linear(2 * hidden_size, hidden_size), ACT2FN[config.hidden_act])
                 )
 
-    def forward(self, hidden_states: List[torch.Tensor]) -> List[torch.Tensor]:
+    def forward(self, hidden_states: List[torch.Tensor], patch_height=None, patch_width=None) -> List[torch.Tensor]:
         """
         Args:
             hidden_states (`List[torch.FloatTensor]`, each of shape `(batch_size, sequence_length + 1, hidden_size)`):
@@ -634,21 +628,24 @@ def forward(self, hidden_states: List[torch.Tensor]) -> List[torch.Tensor]:
 
         for i, hidden_state in enumerate(hidden_states):
             if i not in self.neck_ignore_stages:
-                # reshape to (B, C, H, W)
-                hidden_state, cls_token = hidden_state[:, 1:], hidden_state[:, 0]
+                # reshape to (batch_size, num_channels, height, width)
+                cls_token, hidden_state = hidden_state[:, 0], hidden_state[:, 1:]
                 batch_size, sequence_length, num_channels = hidden_state.shape
-                size = int(math.sqrt(sequence_length))
-                hidden_state = hidden_state.reshape(batch_size, size, size, num_channels)
+                if patch_height is not None and patch_width is not None:
+                    hidden_state = hidden_state.reshape(batch_size, patch_height, patch_width, num_channels)
+                else:
+                    size = int(math.sqrt(sequence_length))
+                    hidden_state = hidden_state.reshape(batch_size, size, size, num_channels)
                 hidden_state = hidden_state.permute(0, 3, 1, 2).contiguous()
 
                 feature_shape = hidden_state.shape
                 if self.config.readout_type == "project":
-                    # reshape to (B, H*W, C)
+                    # reshape to (batch_size, height*width, num_channels)
                     hidden_state = hidden_state.flatten(2).permute((0, 2, 1))
                     readout = cls_token.unsqueeze(1).expand_as(hidden_state)
                     # concatenate the readout token to the hidden states and project
                     hidden_state = self.readout_projects[i](torch.cat((hidden_state, readout), -1))
-                    # reshape back to (B, C, H, W)
+                    # reshape back to (batch_size, num_channels, height, width)
                     hidden_state = hidden_state.permute(0, 2, 1).reshape(feature_shape)
                 elif self.config.readout_type == "add":
                     hidden_state = hidden_state.flatten(2) + cls_token.unsqueeze(-1)
@@ -659,11 +656,19 @@ def forward(self, hidden_states: List[torch.Tensor]) -> List[torch.Tensor]:
         return out
 
 
+def _get_backbone_hidden_size(config):
+    if config.backbone_config is not None and config.is_hybrid is False:
+        return config.backbone_config.hidden_size
+    else:
+        return config.hidden_size
+
+
 class DPTReassembleLayer(nn.Module):
     def __init__(self, config, channels, factor):
         super().__init__()
         # projection
-        self.projection = nn.Conv2d(in_channels=config.hidden_size, out_channels=channels, kernel_size=1)
+        hidden_size = _get_backbone_hidden_size(config)
+        self.projection = nn.Conv2d(in_channels=hidden_size, out_channels=channels, kernel_size=1)
 
         # up/down sampling depending on factor
         if factor > 1:
@@ -716,24 +721,30 @@ def __init__(self, config):
         super().__init__()
 
         self.use_batch_norm = config.use_batch_norm_in_fusion_residual
-        self.activation1 = ACT2FN["relu"]
+        use_bias_in_fusion_residual = (
+            config.use_bias_in_fusion_residual
+            if config.use_bias_in_fusion_residual is not None
+            else not self.use_batch_norm
+        )
+
+        self.activation1 = nn.ReLU()
         self.convolution1 = nn.Conv2d(
             config.fusion_hidden_size,
             config.fusion_hidden_size,
             kernel_size=3,
             stride=1,
             padding=1,
-            bias=not self.use_batch_norm,
+            bias=use_bias_in_fusion_residual,
         )
 
-        self.activation2 = ACT2FN["relu"]
+        self.activation2 = nn.ReLU()
         self.convolution2 = nn.Conv2d(
             config.fusion_hidden_size,
             config.fusion_hidden_size,
             kernel_size=3,
             stride=1,
             padding=1,
-            bias=not self.use_batch_norm,
+            bias=use_bias_in_fusion_residual,
         )
 
         if self.use_batch_norm:
@@ -818,10 +829,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, DPTViTEncoder):
-            module.gradient_checkpointing = value
-
 
 DPT_START_DOCSTRING = r"""
     This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
@@ -983,8 +990,12 @@ def __init__(self, config):
         super().__init__()
         self.config = config
 
-        # postprocessing
-        self.reassemble_stage = DPTReassembleStage(config)
+        # postprocessing: only required in case of a non-hierarchical backbone (e.g. ViT, BEiT)
+        if config.backbone_config is not None and config.backbone_config.model_type in ["swinv2"]:
+            self.reassemble_stage = None
+        else:
+            self.reassemble_stage = DPTReassembleStage(config)
+
         self.convs = nn.ModuleList()
         for channel in config.neck_hidden_sizes:
             self.convs.append(nn.Conv2d(channel, config.fusion_hidden_size, kernel_size=3, padding=1, bias=False))
@@ -992,17 +1003,23 @@ def __init__(self, config):
         # fusion
         self.fusion_stage = DPTFeatureFusionStage(config)
 
-    def forward(self, hidden_states: List[torch.Tensor]) -> List[torch.Tensor]:
-        if not isinstance(hidden_states, list):
-            raise ValueError("hidden_states should be a list of tensors")
+    def forward(self, hidden_states: List[torch.Tensor], patch_height=None, patch_width=None) -> List[torch.Tensor]:
+        """
+        Args:
+            hidden_states (`List[torch.FloatTensor]`, each of shape `(batch_size, sequence_length, hidden_size)` or `(batch_size, hidden_size, height, width)`):
+                List of hidden states from the backbone.
+        """
+        if not isinstance(hidden_states, (tuple, list)):
+            raise ValueError("hidden_states should be a tuple or list of tensors")
 
         if len(hidden_states) != len(self.config.neck_hidden_sizes):
             raise ValueError("The number of hidden states should be equal to the number of neck hidden sizes.")
 
         # postprocess hidden states
-        features = self.reassemble_stage(hidden_states)
+        if self.reassemble_stage is not None:
+            hidden_states = self.reassemble_stage(hidden_states, patch_height, patch_width)
 
-        features = [self.convs[i](feature) for i, feature in enumerate(features)]
+        features = [self.convs[i](feature) for i, feature in enumerate(hidden_states)]
 
         # fusion blocks
         output = self.fusion_stage(features)
@@ -1022,20 +1039,28 @@ def __init__(self, config):
 
         self.config = config
 
+        self.projection = None
+        if config.add_projection:
+            self.projection = nn.Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+
         features = config.fusion_hidden_size
         self.head = nn.Sequential(
             nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1),
             nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True),
             nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1),
-            ACT2FN["relu"],
+            nn.ReLU(),
             nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
-            ACT2FN["relu"],
+            nn.ReLU(),
         )
 
     def forward(self, hidden_states: List[torch.Tensor]) -> torch.Tensor:
         # use last features
         hidden_states = hidden_states[self.config.head_in_index]
 
+        if self.projection is not None:
+            hidden_states = self.projection(hidden_states)
+            hidden_states = nn.ReLU()(hidden_states)
+
         predicted_depth = self.head(hidden_states)
 
         predicted_depth = predicted_depth.squeeze(dim=1)
@@ -1053,7 +1078,11 @@ class DPTForDepthEstimation(DPTPreTrainedModel):
     def __init__(self, config):
         super().__init__(config)
 
-        self.dpt = DPTModel(config, add_pooling_layer=False)
+        self.backbone = None
+        if config.backbone_config is not None and config.is_hybrid is False:
+            self.backbone = load_backbone(config)
+        else:
+            self.dpt = DPTModel(config, add_pooling_layer=False)
 
         # Neck
         self.neck = DPTNeck(config)
@@ -1119,32 +1148,46 @@ def forward(
         output_hidden_states = (
             output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
         )
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
 
-        outputs = self.dpt(
-            pixel_values,
-            head_mask=head_mask,
-            output_attentions=output_attentions,
-            output_hidden_states=True,  # we need the intermediate hidden states
-            return_dict=return_dict,
-        )
-
-        hidden_states = outputs.hidden_states if return_dict else outputs[1]
-
-        # only keep certain features based on config.backbone_out_indices
-        # note that the hidden_states also include the initial embeddings
-        if not self.config.is_hybrid:
-            hidden_states = [
-                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices
-            ]
+        if self.backbone is not None:
+            outputs = self.backbone.forward_with_filtered_kwargs(
+                pixel_values, output_hidden_states=output_hidden_states, output_attentions=output_attentions
+            )
+            hidden_states = outputs.feature_maps
         else:
-            backbone_hidden_states = outputs.intermediate_activations if return_dict else list(outputs[-1])
-            backbone_hidden_states.extend(
-                feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices[2:]
+            outputs = self.dpt(
+                pixel_values,
+                head_mask=head_mask,
+                output_attentions=output_attentions,
+                output_hidden_states=True,  # we need the intermediate hidden states
+                return_dict=return_dict,
             )
+            hidden_states = outputs.hidden_states if return_dict else outputs[1]
+            # only keep certain features based on config.backbone_out_indices
+            # note that the hidden_states also include the initial embeddings
+            if not self.config.is_hybrid:
+                hidden_states = [
+                    feature for idx, feature in enumerate(hidden_states[1:]) if idx in self.config.backbone_out_indices
+                ]
+            else:
+                backbone_hidden_states = outputs.intermediate_activations if return_dict else list(outputs[-1])
+                backbone_hidden_states.extend(
+                    feature
+                    for idx, feature in enumerate(hidden_states[1:])
+                    if idx in self.config.backbone_out_indices[2:]
+                )
 
-            hidden_states = backbone_hidden_states
+                hidden_states = backbone_hidden_states
+
+        patch_height, patch_width = None, None
+        if self.config.backbone_config is not None and self.config.is_hybrid is False:
+            _, _, height, width = pixel_values.shape
+            patch_size = self.config.backbone_config.patch_size
+            patch_height = height // patch_size
+            patch_width = width // patch_size
 
-        hidden_states = self.neck(hidden_states)
+        hidden_states = self.neck(hidden_states, patch_height, patch_width)
 
         predicted_depth = self.head(hidden_states)
 
@@ -1177,7 +1220,7 @@ def __init__(self, config):
         self.head = nn.Sequential(
             nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
             nn.BatchNorm2d(features),
-            ACT2FN["relu"],
+            nn.ReLU(),
             nn.Dropout(config.semantic_classifier_dropout),
             nn.Conv2d(features, config.num_labels, kernel_size=1),
             nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True),
@@ -1200,7 +1243,7 @@ def __init__(self, config):
         self.head = nn.Sequential(
             nn.Conv2d(features, features, kernel_size=3, padding=1, bias=False),
             nn.BatchNorm2d(features),
-            ACT2FN["relu"],
+            nn.ReLU(),
             nn.Dropout(0.1, False),
             nn.Conv2d(features, config.num_labels, kernel_size=1),
         )
@@ -1297,7 +1340,7 @@ def forward(
 
             hidden_states = backbone_hidden_states
 
-        hidden_states = self.neck(hidden_states)
+        hidden_states = self.neck(hidden_states=hidden_states)
 
         logits = self.head(hidden_states)
 
diff --git a/src/transformers/models/efficientformer/__init__.py b/src/transformers/models/efficientformer/__init__.py
index ea7bcdffd45931..25d60d1ee765ef 100644
--- a/src/transformers/models/efficientformer/__init__.py
+++ b/src/transformers/models/efficientformer/__init__.py
@@ -13,7 +13,13 @@
 # limitations under the License.
 from typing import TYPE_CHECKING
 
-from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_tf_available,
+    is_torch_available,
+    is_vision_available,
+)
 
 
 _import_structure = {
@@ -45,6 +51,20 @@
         "EfficientFormerPreTrainedModel",
     ]
 
+try:
+    if not is_tf_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_tf_efficientformer"] = [
+        "TF_EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "TFEfficientFormerForImageClassification",
+        "TFEfficientFormerForImageClassificationWithTeacher",
+        "TFEfficientFormerModel",
+        "TFEfficientFormerPreTrainedModel",
+    ]
+
 if TYPE_CHECKING:
     from .configuration_efficientformer import EFFICIENTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, EfficientFormerConfig
 
@@ -69,6 +89,19 @@
             EfficientFormerModel,
             EfficientFormerPreTrainedModel,
         )
+    try:
+        if not is_tf_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_tf_efficientformer import (
+            TF_EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            TFEfficientFormerForImageClassification,
+            TFEfficientFormerForImageClassificationWithTeacher,
+            TFEfficientFormerModel,
+            TFEfficientFormerPreTrainedModel,
+        )
 
 else:
     import sys
diff --git a/src/transformers/models/efficientformer/configuration_efficientformer.py b/src/transformers/models/efficientformer/configuration_efficientformer.py
index 5f30664ff325a0..fecb90a886e8eb 100644
--- a/src/transformers/models/efficientformer/configuration_efficientformer.py
+++ b/src/transformers/models/efficientformer/configuration_efficientformer.py
@@ -52,7 +52,7 @@ class EfficientFormerConfig(PretrainedConfig):
             The size of the key in meta3D block.
         attention_ratio (`int`, *optional*, defaults to 4):
             Ratio of the dimension of the query and value to the dimension of the key in MSHA block
-        resolution (`int`, *optional*, defaults to 5)
+        resolution (`int`, *optional*, defaults to 7)
             Size of each patch
         num_hidden_layers (`int`, *optional*, defaults to 5):
             Number of hidden layers in the Transformer encoder.
@@ -91,6 +91,8 @@ class EfficientFormerConfig(PretrainedConfig):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         layer_norm_eps (`float`, *optional*, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
+        image_size (`int`, *optional*, defaults to `224`):
+            The size (resolution) of each image.
 
     Example:
 
@@ -136,6 +138,8 @@ def __init__(
         hidden_act: str = "gelu",
         initializer_range: float = 0.02,
         layer_norm_eps: float = 1e-12,
+        image_size: int = 224,
+        batch_norm_eps: float = 1e-05,
         **kwargs,
     ) -> None:
         super().__init__(**kwargs)
@@ -165,3 +169,5 @@ def __init__(
         self.distillation = distillation
         self.use_layer_scale = use_layer_scale
         self.layer_scale_init_value = layer_scale_init_value
+        self.image_size = image_size
+        self.batch_norm_eps = batch_norm_eps
diff --git a/src/transformers/models/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py
index 6f7f1b60669f05..7431cd6136a593 100644
--- a/src/transformers/models/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py
+++ b/src/transformers/models/efficientformer/convert_efficientformer_original_pytorch_checkpoint_to_pytorch.py
@@ -50,12 +50,12 @@ def rename_key(old_name, num_meta4D_last_stage):
         else:
             new_name = old_name.replace("4", "batchnorm_after")
 
-    if "network" in old_name and re.search("\d\.\d", old_name):
+    if "network" in old_name and re.search(r"\d\.\d", old_name):
         two_digit_num = r"\b\d{2}\b"
         if bool(re.search(two_digit_num, old_name)):
-            match = re.search("\d\.\d\d.", old_name).group()
+            match = re.search(r"\d\.\d\d.", old_name).group()
         else:
-            match = re.search("\d\.\d.", old_name).group()
+            match = re.search(r"\d\.\d.", old_name).group()
         if int(match[0]) < 6:
             trimmed_name = old_name.replace(match, "")
             trimmed_name = trimmed_name.replace("network", match[0] + ".meta4D_layers.blocks." + match[2:-1])
@@ -78,7 +78,7 @@ def rename_key(old_name, num_meta4D_last_stage):
 
             new_name = "last_stage." + trimmed_name
 
-    elif "network" in old_name and re.search(".\d.", old_name):
+    elif "network" in old_name and re.search(r".\d.", old_name):
         new_name = old_name.replace("network", "intermediate_stages")
 
     if "fc" in new_name:
@@ -208,7 +208,7 @@ def convert_efficientformer_checkpoint(
         )
         processor.push_to_hub(
             repo_id=f"Bearnardd/{pytorch_dump_path}",
-            commit_message="Add feature extractor",
+            commit_message="Add image processor",
             use_temp_dir=True,
         )
 
@@ -234,12 +234,12 @@ def convert_efficientformer_checkpoint(
         "--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
     )
 
-    parser.add_argument("--push_to_hub", action="store_true", help="Push model and feature extractor to the hub")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model and image processor to the hub")
     parser.add_argument(
         "--no-push_to_hub",
         dest="push_to_hub",
         action="store_false",
-        help="Do not push model and feature extractor to the hub",
+        help="Do not push model and image processor to the hub",
     )
     parser.set_defaults(push_to_hub=True)
 
diff --git a/src/transformers/models/efficientformer/image_processing_efficientformer.py b/src/transformers/models/efficientformer/image_processing_efficientformer.py
index 5694fb166e3c76..7db37c20b7f9dc 100644
--- a/src/transformers/models/efficientformer/image_processing_efficientformer.py
+++ b/src/transformers/models/efficientformer/image_processing_efficientformer.py
@@ -18,14 +18,9 @@
 
 import numpy as np
 
-from transformers.utils.generic import TensorType
-
 from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
 from ...image_transforms import (
-    center_crop,
     get_resize_output_image_size,
-    normalize,
-    rescale,
     resize,
     to_channel_dimension_format,
 )
@@ -35,11 +30,14 @@
     ChannelDimension,
     ImageInput,
     PILImageResampling,
+    infer_channel_dimension_format,
     is_batched,
+    is_scaled_image,
     to_numpy_array,
     valid_images,
+    validate_preprocess_arguments,
 )
-from ...utils import logging
+from ...utils import TensorType, logging
 
 
 logger = logging.get_logger(__name__)
@@ -121,6 +119,7 @@ def resize(
         size: Dict[str, int],
         resample: PILImageResampling = PILImageResampling.BILINEAR,
         data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> np.ndarray:
         """
@@ -138,6 +137,8 @@ def resize(
                 image is used. Can be one of:
                 - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                 - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
 
         Returns:
             `np.ndarray`: The resized image.
@@ -145,88 +146,17 @@ def resize(
         size = get_size_dict(size)
 
         if "shortest_edge" in size:
-            size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=False)
+            size = get_resize_output_image_size(
+                image, size=size["shortest_edge"], default_to_square=False, input_data_format=input_data_format
+            )
             # size = get_resize_output_image_size(image, size["shortest_edge"], size["longest_edge"])
         elif "height" in size and "width" in size:
             size = (size["height"], size["width"])
         else:
             raise ValueError(f"Size must contain 'height' and 'width' keys or 'shortest_edge' key. Got {size.keys()}")
-        return resize(image, size=size, resample=resample, data_format=data_format, **kwargs)
-
-    def center_crop(
-        self,
-        image: np.ndarray,
-        size: Dict[str, int],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Center crop an image. If the image is too small to be cropped to the size given, it will be padded (so the
-        returned result will always be of size `size`).
-
-        Args:
-            image (`np.ndarray`):
-                Image to center crop.
-            size (`Dict[str, int]`):
-                Size of the output image in the form of a dictionary with keys `height` and `width`.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format of the image. If not provided, it will be the same as the input image.
-        """
-        size = get_size_dict(size)
-        if "height" not in size or "width" not in size:
-            raise ValueError(f"The `size` parameter must contain the keys (height, width). Got {size.keys()}")
-        return center_crop(image, size=(size["height"], size["width"]), data_format=data_format, **kwargs)
-
-    def rescale(
-        self, image: np.ndarray, scale: float, data_format: Optional[Union[str, ChannelDimension]] = None, **kwargs
-    ) -> np.ndarray:
-        """
-        Rescale an image by a scale factor. image = image * scale.
-
-        Args:
-            image (`np.ndarray`):
-                Image to rescale.
-            scale (`float`):
-                The scaling factor to rescale pixel values by.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format for the output image. If unset, the channel dimension format of the input
-                image is used. Can be one of:
-                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
-                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
-
-        Returns:
-            `np.ndarray`: The rescaled image.
-        """
-        return rescale(image, scale=scale, data_format=data_format, **kwargs)
-
-    def normalize(
-        self,
-        image: np.ndarray,
-        mean: Union[float, List[float]],
-        std: Union[float, List[float]],
-        data_format: Optional[Union[str, ChannelDimension]] = None,
-        **kwargs,
-    ) -> np.ndarray:
-        """
-        Normalize an image. image = (image - image_mean) / image_std.
-
-        Args:
-            image (`np.ndarray`):
-                Image to normalize.
-            mean (`float` or `List[float]`):
-                Image mean to use for normalization.
-            std (`float` or `List[float]`):
-                Image standard deviation to use for normalization.
-            data_format (`str` or `ChannelDimension`, *optional*):
-                The channel dimension format for the output image. If unset, the channel dimension format of the input
-                image is used. Can be one of:
-                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
-                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
-
-        Returns:
-            `np.ndarray`: The normalized image.
-        """
-        return normalize(image, mean=mean, std=std, data_format=data_format, **kwargs)
+        return resize(
+            image, size=size, resample=resample, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
 
     def preprocess(
         self,
@@ -243,6 +173,7 @@ def preprocess(
         image_std: Optional[Union[float, List[float]]] = None,
         return_tensors: Optional[Union[str, TensorType]] = None,
         data_format: Union[str, ChannelDimension] = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
         **kwargs,
     ) -> BatchFeature:
         """
@@ -250,7 +181,8 @@ def preprocess(
 
         Args:
             images (`ImageInput`):
-                Image to preprocess.
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
             do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                 Whether to resize the image.
             size (`Dict[str, int]`, *optional*, defaults to `self.size`):
@@ -285,6 +217,12 @@ def preprocess(
                 - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                 - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                 - Unset: Use the channel dimension format of the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
         """
         do_resize = do_resize if do_resize is not None else self.do_resize
         do_rescale = do_rescale if do_rescale is not None else self.do_rescale
@@ -308,32 +246,57 @@ def preprocess(
                 "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                 "torch.Tensor, tf.Tensor or jax.ndarray."
             )
-
-        if do_resize and size is None:
-            raise ValueError("Size must be specified if do_resize is True.")
-
-        if do_center_crop and crop_size is None:
-            raise ValueError("Crop size must be specified if do_center_crop is True.")
-
-        if do_rescale and rescale_factor is None:
-            raise ValueError("Rescale factor must be specified if do_rescale is True.")
-
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_center_crop=do_center_crop,
+            crop_size=crop_size,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
         # All transformations expect numpy arrays.
         images = [to_numpy_array(image) for image in images]
 
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
         if do_resize:
-            images = [self.resize(image=image, size=size_dict, resample=resample) for image in images]
+            images = [
+                self.resize(image=image, size=size_dict, resample=resample, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_center_crop:
-            images = [self.center_crop(image=image, size=crop_size) for image in images]
+            images = [
+                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
+            ]
 
         if do_rescale:
-            images = [self.rescale(image=image, scale=rescale_factor) for image in images]
+            images = [
+                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
+                for image in images
+            ]
 
         if do_normalize:
-            images = [self.normalize(image=image, mean=image_mean, std=image_std) for image in images]
-
-        images = [to_channel_dimension_format(image, data_format) for image in images]
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
 
         data = {"pixel_values": images}
         return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/efficientformer/modeling_efficientformer.py b/src/transformers/models/efficientformer/modeling_efficientformer.py
index b6264e60c8b899..5f03a5ab747235 100644
--- a/src/transformers/models/efficientformer/modeling_efficientformer.py
+++ b/src/transformers/models/efficientformer/modeling_efficientformer.py
@@ -43,7 +43,7 @@
 
 # Base docstring
 _CHECKPOINT_FOR_DOC = "snap-research/efficientformer-l1-300"
-_EXPECTED_OUTPUT_SHAPE = [1, 197, 768]
+_EXPECTED_OUTPUT_SHAPE = [1, 49, 448]
 
 # Image classification docstring
 _IMAGE_CLASS_CHECKPOINT = "snap-research/efficientformer-l1-300"
@@ -73,7 +73,7 @@ def __init__(self, config: EfficientFormerConfig, num_channels: int, embed_dim:
             stride=config.downsample_stride,
             padding=config.downsample_pad,
         )
-        self.norm = nn.BatchNorm2d(embed_dim) if apply_norm else nn.Identity()
+        self.norm = nn.BatchNorm2d(embed_dim, eps=config.batch_norm_eps) if apply_norm else nn.Identity()
 
     def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
         batch_size, num_channels, height, width = pixel_values.shape
@@ -157,10 +157,10 @@ def __init__(self, config: EfficientFormerConfig, out_channels: int):
         super().__init__()
 
         self.convolution1 = nn.Conv2d(config.num_channels, out_channels // 2, kernel_size=3, stride=2, padding=1)
-        self.batchnorm_before = nn.BatchNorm2d(out_channels // 2)
+        self.batchnorm_before = nn.BatchNorm2d(out_channels // 2, eps=config.batch_norm_eps)
 
         self.convolution2 = nn.Conv2d(out_channels // 2, out_channels, kernel_size=3, stride=2, padding=1)
-        self.batchnorm_after = nn.BatchNorm2d(out_channels)
+        self.batchnorm_after = nn.BatchNorm2d(out_channels, eps=config.batch_norm_eps)
 
         self.activation = nn.ReLU()
 
@@ -224,29 +224,29 @@ def __init__(
         hidden_features = hidden_features or in_features
 
         self.convolution1 = nn.Conv2d(in_features, hidden_features, 1)
-        self.actvation = ACT2FN[config.hidden_act]
+        self.activation = ACT2FN[config.hidden_act]
         self.convolution2 = nn.Conv2d(hidden_features, out_features, 1)
         self.dropout = nn.Dropout(drop)
 
-        self.batchnorm_before = nn.BatchNorm2d(hidden_features)
-        self.batchnorm_after = nn.BatchNorm2d(out_features)
+        self.batchnorm_before = nn.BatchNorm2d(hidden_features, eps=config.batch_norm_eps)
+        self.batchnorm_after = nn.BatchNorm2d(out_features, eps=config.batch_norm_eps)
 
     def forward(self, hidden_state: torch.Tensor) -> torch.Tensor:
         hidden_state = self.convolution1(hidden_state)
         hidden_state = self.batchnorm_before(hidden_state)
 
-        hidden_state = self.actvation(hidden_state)
+        hidden_state = self.activation(hidden_state)
         hidden_state = self.dropout(hidden_state)
         hidden_state = self.convolution2(hidden_state)
 
         hidden_state = self.batchnorm_after(hidden_state)
-
         hidden_state = self.dropout(hidden_state)
+
         return hidden_state
 
 
 # Copied from transformers.models.convnext.modeling_convnext.drop_path
-def drop_path(input, drop_prob: float = 0.0, training: bool = False):
+def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = False) -> torch.Tensor:
     """
     Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
 
@@ -266,7 +266,7 @@ def drop_path(input, drop_prob: float = 0.0, training: bool = False):
     return output
 
 
-# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->Bit
+# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->EfficientFormer
 class EfficientFormerDropPath(nn.Module):
     """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
 
@@ -301,8 +301,10 @@ def __init__(self, config: EfficientFormerConfig, dim: int, drop_path: float = 0
             attention_ratio=config.attention_ratio,
             resolution=config.resolution,
         )
-        self.layernorm1 = nn.LayerNorm(dim)
-        self.layernorm2 = nn.LayerNorm(dim)
+
+        self.layernorm1 = nn.LayerNorm(dim, eps=config.layer_norm_eps)
+        self.layernorm2 = nn.LayerNorm(dim, eps=config.layer_norm_eps)
+
         mlp_hidden_dim = int(dim * config.mlp_expansion_ratio)
         self.mlp = EfficientFormerDenseMlp(config, in_features=dim, hidden_features=mlp_hidden_dim)
 
@@ -346,15 +348,20 @@ def __init__(self, config: EfficientFormerConfig):
 
     def forward(self, hidden_states: torch.Tensor, output_attentions: bool = False) -> Tuple[torch.Tensor]:
         all_attention_outputs = () if output_attentions else None
+
         for layer_module in self.blocks:
             if isinstance(hidden_states, tuple):
                 hidden_states = hidden_states[0]
+
             hidden_states = layer_module(hidden_states, output_attentions)
+
             if output_attentions:
                 all_attention_outputs = all_attention_outputs + (hidden_states[1],)
+
         if output_attentions:
             outputs = (hidden_states[0],) + all_attention_outputs
             return outputs
+
         return hidden_states
 
 
@@ -379,6 +386,7 @@ def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor]:
 
         if self.use_layer_scale:
             layer_output = hidden_states + self.drop_path(self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) * outputs)
+
             layer_output = layer_output + self.drop_path(
                 self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) * self.mlp(layer_output)
             )
@@ -398,6 +406,7 @@ def __init__(self, config: EfficientFormerConfig, stage_idx: int):
         drop_paths = [
             config.drop_path_rate * (block_idx + sum(config.depths[:stage_idx])) for block_idx in range(num_layers)
         ]
+
         self.blocks = nn.ModuleList(
             [
                 EfficientFormerMeta4D(config, config.hidden_sizes[stage_idx], drop_path=drop_path)
@@ -446,6 +455,7 @@ def __init__(self, config: EfficientFormerConfig):
             for i in range(num_intermediate_stages)
         ]
         intermediate_stages = []
+
         for i in range(num_intermediate_stages):
             intermediate_stages.append(EfficientFormerIntermediateStage(config, i))
             if downsamples[i]:
@@ -475,6 +485,7 @@ def forward(
                 all_hidden_states = all_hidden_states + (hidden_states,)
 
         layer_output = self.last_stage(hidden_states, output_attentions=output_attentions)
+
         if output_attentions:
             all_self_attentions = all_self_attentions + layer_output[1:]
 
@@ -482,7 +493,7 @@ def forward(
             all_hidden_states = all_hidden_states + (layer_output[0],)
 
         if not return_dict:
-            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
+            return tuple(v for v in [layer_output[0], all_hidden_states, all_self_attentions] if v is not None)
 
         return BaseModelOutput(
             last_hidden_state=layer_output[0],
@@ -526,8 +537,8 @@ def _init_weights(self, module: nn.Module):
 EFFICIENTFORMER_INPUTS_DOCSTRING = r"""
     Args:
         pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
-            Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
-            [`ViTFeatureExtractor.__call__`] for details.
+            Pixel values. Pixel values can be obtained using [`ViTImageProcessor`]. See
+            [`ViTImageProcessor.preprocess`] for details.
         output_attentions (`bool`, *optional*):
             Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
             tensors for more detail.
diff --git a/src/transformers/models/efficientformer/modeling_tf_efficientformer.py b/src/transformers/models/efficientformer/modeling_tf_efficientformer.py
new file mode 100644
index 00000000000000..113eafb88d8493
--- /dev/null
+++ b/src/transformers/models/efficientformer/modeling_tf_efficientformer.py
@@ -0,0 +1,1196 @@
+# coding=utf-8
+# Copyright 2023 Snapchat Research and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TensorFlow EfficientFormer model."""
+
+import itertools
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import tensorflow as tf
+
+from ...activations_tf import ACT2FN
+from ...modeling_tf_outputs import (
+    TFBaseModelOutput,
+    TFBaseModelOutputWithPooling,
+    TFImageClassifierOutput,
+)
+from ...modeling_tf_utils import (
+    TFPreTrainedModel,
+    TFSequenceClassificationLoss,
+    get_initializer,
+    keras,
+    keras_serializable,
+    unpack_inputs,
+)
+from ...tf_utils import shape_list, stable_softmax
+from ...utils import (
+    ModelOutput,
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+)
+from .configuration_efficientformer import EfficientFormerConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "EfficientFormerConfig"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "snap-research/efficientformer-l1-300"
+_EXPECTED_OUTPUT_SHAPE = [1, 49, 448]
+
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "snap-research/efficientformer-l1-300"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "LABEL_281"
+
+
+TF_EFFICIENTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "snap-research/efficientformer-l1-300",
+    # See all EfficientFormer models at https://huggingface.co/models?filter=efficientformer
+]
+
+
+class TFEfficientFormerPatchEmbeddings(keras.layers.Layer):
+    """
+    This class performs downsampling between two stages. For the input tensor with the shape [batch_size, num_channels,
+    height, width] it produces output tensor with the shape [batch_size, num_channels, height/stride, width/stride]
+    """
+
+    def __init__(
+        self, config: EfficientFormerConfig, num_channels: int, embed_dim: int, apply_norm: bool = True, **kwargs
+    ) -> None:
+        super().__init__(**kwargs)
+        self.num_channels = num_channels
+
+        self.padding = keras.layers.ZeroPadding2D(padding=config.downsample_pad)
+        self.projection = keras.layers.Conv2D(
+            filters=embed_dim,
+            kernel_size=config.downsample_patch_size,
+            strides=config.downsample_stride,
+            padding="valid",
+            name="projection",
+        )
+        # Use same default momentum and epsilon as PyTorch equivalent for BatchNormalization
+        self.norm = (
+            keras.layers.BatchNormalization(axis=-1, epsilon=config.batch_norm_eps, momentum=0.9, name="norm")
+            if apply_norm
+            else tf.identity
+        )
+        self.embed_dim = embed_dim
+
+    def call(self, pixel_values: tf.Tensor, training: bool = False) -> tf.Tensor:
+        tf.debugging.assert_shapes(
+            [(pixel_values, (..., None, None, self.num_channels))],
+            message="Make sure that the channel dimension of the pixel values match with the one set in the configuration.",
+        )
+        embeddings = self.projection(self.padding(pixel_values))
+        embeddings = self.norm(embeddings, training=training)
+        return embeddings
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "projection", None) is not None:
+            with tf.name_scope(self.projection.name):
+                self.projection.build([None, None, None, self.num_channels])
+        if getattr(self, "norm", None) is not None:
+            if hasattr(self.norm, "name"):
+                with tf.name_scope(self.norm.name):
+                    self.norm.build([None, None, None, self.embed_dim])
+
+
+class TFEfficientFormerSelfAttention(keras.layers.Layer):
+    def __init__(
+        self,
+        dim: int,
+        key_dim: int,
+        num_heads: int,
+        attention_ratio: int,
+        resolution: int,
+        config: EfficientFormerConfig,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.num_heads = num_heads
+        self.key_dim = key_dim
+        self.attention_ratio = attention_ratio
+        self.scale = key_dim**-0.5
+        self.total_key_dim = key_dim * num_heads
+        self.expanded_key_dim = int(attention_ratio * key_dim)
+        self.total_expanded_key_dim = int(self.expanded_key_dim * num_heads)
+        hidden_size = self.total_expanded_key_dim + self.total_key_dim * 2
+
+        self.qkv = keras.layers.Dense(
+            units=hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="qkv"
+        )
+        self.projection = keras.layers.Dense(
+            units=dim, kernel_initializer=get_initializer(config.initializer_range), name="projection"
+        )
+        self.resolution = resolution
+        self.dim = dim
+
+    def build(self, input_shape: tf.TensorShape) -> None:
+        points = list(itertools.product(range(self.resolution), range(self.resolution)))
+        num_points = len(points)
+        attention_offsets = {}
+
+        idxs = []
+
+        for point_1 in points:
+            for point_2 in points:
+                offset = (abs(point_1[0] - point_2[0]), abs(point_1[1] - point_2[1]))
+                if offset not in attention_offsets:
+                    attention_offsets[offset] = len(attention_offsets)
+                idxs.append(attention_offsets[offset])
+
+        self.attention_biases = self.add_weight(
+            shape=(self.num_heads, len(attention_offsets)),
+            initializer=keras.initializers.zeros(),
+            trainable=True,
+            name="attention_biases",
+        )
+        self.attention_bias_idxs = self.add_weight(
+            shape=(num_points, num_points),
+            trainable=False,
+            dtype=tf.int32,
+            name="attention_bias_idxs",
+        )
+
+        self.attention_bias_idxs.assign(tf.reshape(tf.cast(idxs, dtype=tf.int32), (num_points, num_points)))
+
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "qkv", None) is not None:
+            with tf.name_scope(self.qkv.name):
+                self.qkv.build([None, None, self.dim])
+        if getattr(self, "projection", None) is not None:
+            with tf.name_scope(self.projection.name):
+                self.projection.build([None, None, self.total_expanded_key_dim])
+
+    def call(
+        self, hidden_states: tf.Tensor, output_attentions: bool = False, training: bool = False
+    ) -> Tuple[tf.Tensor]:
+        batch_size, sequence_length, *_ = shape_list(hidden_states)
+        qkv = self.qkv(inputs=hidden_states)
+
+        query_layer, key_layer, value_layer = tf.split(
+            tf.reshape(tensor=qkv, shape=(batch_size, sequence_length, self.num_heads, -1)),
+            num_or_size_splits=[self.key_dim, self.key_dim, self.expanded_key_dim],
+            axis=3,
+        )
+
+        query_layer = tf.transpose(query_layer, perm=[0, 2, 1, 3])
+        key_layer = tf.transpose(key_layer, perm=[0, 2, 1, 3])
+        value_layer = tf.transpose(value_layer, perm=[0, 2, 1, 3])
+
+        attention_probs = tf.matmul(query_layer, tf.transpose(key_layer, perm=[0, 1, 3, 2]))
+        scale = tf.cast(self.scale, dtype=attention_probs.dtype)
+        attention_probs = tf.multiply(attention_probs, scale)
+
+        attention_biases = tf.gather(params=self.attention_biases, indices=self.attention_bias_idxs, axis=1)
+        attention_probs = attention_probs + attention_biases
+        attention_probs = stable_softmax(logits=attention_probs, axis=-1)
+
+        context_layer = tf.matmul(attention_probs, value_layer)
+        context_layer = tf.transpose(context_layer, perm=[0, 2, 1, 3])
+
+        context_layer = tf.reshape(
+            tensor=context_layer, shape=(batch_size, sequence_length, self.total_expanded_key_dim)
+        )
+        context_layer = self.projection(context_layer)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+        return outputs
+
+
+class TFEfficientFormerConvStem(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, out_channels: int, **kwargs):
+        super().__init__(**kwargs)
+
+        self.padding = keras.layers.ZeroPadding2D(padding=1)
+        self.convolution1 = keras.layers.Conv2D(
+            filters=out_channels // 2, kernel_size=3, strides=2, padding="valid", name="convolution1"
+        )
+        # Use same default momentum and epsilon as PyTorch equivalent for BatchNormalization
+        self.batchnorm_before = keras.layers.BatchNormalization(
+            axis=-1, epsilon=config.batch_norm_eps, momentum=0.9, name="batchnorm_before"
+        )
+
+        self.convolution2 = keras.layers.Conv2D(
+            filters=out_channels,
+            kernel_size=3,
+            strides=2,
+            padding="valid",
+            name="convolution2",
+        )
+        # Use same default momentum and epsilon as PyTorch equivalent for BatchNormalization
+        self.batchnorm_after = keras.layers.BatchNormalization(
+            axis=-1, epsilon=config.batch_norm_eps, momentum=0.9, name="batchnorm_after"
+        )
+
+        self.activation = keras.layers.Activation(activation=keras.activations.relu, name="activation")
+        self.out_channels = out_channels
+        self.config = config
+
+    def call(self, pixel_values: tf.Tensor, training: bool = False) -> tf.Tensor:
+        features = self.batchnorm_before(self.convolution1(self.padding(pixel_values)), training=training)
+        features = self.activation(features)
+        features = self.batchnorm_after(self.convolution2(self.padding(features)), training=training)
+        features = self.activation(features)
+        return features
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution1", None) is not None:
+            with tf.name_scope(self.convolution1.name):
+                self.convolution1.build([None, None, None, self.config.num_channels])
+        if getattr(self, "batchnorm_before", None) is not None:
+            with tf.name_scope(self.batchnorm_before.name):
+                self.batchnorm_before.build([None, None, None, self.out_channels // 2])
+        if getattr(self, "convolution2", None) is not None:
+            with tf.name_scope(self.convolution2.name):
+                self.convolution2.build([None, None, None, self.out_channels // 2])
+        if getattr(self, "batchnorm_after", None) is not None:
+            with tf.name_scope(self.batchnorm_after.name):
+                self.batchnorm_after.build([None, None, None, self.out_channels])
+        if getattr(self, "activation", None) is not None:
+            with tf.name_scope(self.activation.name):
+                self.activation.build(None)
+
+
+class TFEfficientFormerPooling(keras.layers.Layer):
+    def __init__(self, pool_size: int, **kwargs):
+        super().__init__(**kwargs)
+        self.pool = keras.layers.AveragePooling2D(pool_size=pool_size, strides=1, padding="same")
+
+    def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
+        output = self.pool(hidden_states)
+        output = output - hidden_states
+        return output
+
+
+class TFEfficientFormerDenseMlp(keras.layers.Layer):
+    def __init__(
+        self,
+        config: EfficientFormerConfig,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        self.linear_in = keras.layers.Dense(
+            units=hidden_features, kernel_initializer=get_initializer(config.initializer_range), name="linear_in"
+        )
+        self.activation = ACT2FN[config.hidden_act]
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+
+        self.linear_out = keras.layers.Dense(
+            units=out_features, kernel_initializer=get_initializer(config.initializer_range), name="linear_out"
+        )
+        self.hidden_features = hidden_features
+        self.in_features = in_features
+
+    def call(self, hidden_states: tf.Tensor, training: bool = False) -> tf.Tensor:
+        hidden_states = self.linear_in(inputs=hidden_states)
+        hidden_states = self.activation(hidden_states)
+        hidden_states = self.dropout(inputs=hidden_states, training=training)
+        hidden_states = self.linear_out(inputs=hidden_states)
+        hidden_states = self.dropout(inputs=hidden_states, training=training)
+
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "linear_in", None) is not None:
+            with tf.name_scope(self.linear_in.name):
+                self.linear_in.build([None, None, self.in_features])
+        if getattr(self, "linear_out", None) is not None:
+            with tf.name_scope(self.linear_out.name):
+                self.linear_out.build([None, None, self.hidden_features])
+
+
+class TFEfficientFormerConvMlp(keras.layers.Layer):
+    def __init__(
+        self,
+        config: EfficientFormerConfig,
+        in_features: int,
+        hidden_features: Optional[int] = None,
+        out_features: Optional[int] = None,
+        drop: float = 0.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+
+        self.convolution1 = keras.layers.Conv2D(
+            filters=hidden_features,
+            kernel_size=1,
+            name="convolution1",
+            padding="valid",
+        )
+
+        self.activation = ACT2FN[config.hidden_act]
+
+        self.convolution2 = keras.layers.Conv2D(
+            filters=out_features,
+            kernel_size=1,
+            name="convolution2",
+            padding="valid",
+        )
+
+        self.dropout = keras.layers.Dropout(rate=drop)
+
+        # Use same default momentum and epsilon as PyTorch equivalent for BatchNormalization
+        self.batchnorm_before = keras.layers.BatchNormalization(
+            axis=-1, epsilon=config.batch_norm_eps, momentum=0.9, name="batchnorm_before"
+        )
+        # Use same default momentum and epsilon as PyTorch equivalent for BatchNormalization
+        self.batchnorm_after = keras.layers.BatchNormalization(
+            axis=-1, epsilon=config.batch_norm_eps, momentum=0.9, name="batchnorm_after"
+        )
+        self.hidden_features = hidden_features
+        self.in_features = in_features
+        self.out_features = out_features
+
+    def call(self, hidden_state: tf.Tensor, training: bool = False) -> tf.Tensor:
+        hidden_state = self.convolution1(hidden_state)
+        hidden_state = self.batchnorm_before(hidden_state, training=training)
+        hidden_state = self.activation(hidden_state)
+        hidden_state = self.dropout(hidden_state, training=training)
+        hidden_state = self.convolution2(hidden_state)
+        hidden_state = self.batchnorm_after(hidden_state, training=training)
+        hidden_state = self.dropout(hidden_state, training=training)
+        return hidden_state
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "convolution1", None) is not None:
+            with tf.name_scope(self.convolution1.name):
+                self.convolution1.build([None, None, None, self.in_features])
+        if getattr(self, "convolution2", None) is not None:
+            with tf.name_scope(self.convolution2.name):
+                self.convolution2.build([None, None, None, self.hidden_features])
+        if getattr(self, "batchnorm_before", None) is not None:
+            with tf.name_scope(self.batchnorm_before.name):
+                self.batchnorm_before.build([None, None, None, self.hidden_features])
+        if getattr(self, "batchnorm_after", None) is not None:
+            with tf.name_scope(self.batchnorm_after.name):
+                self.batchnorm_after.build([None, None, None, self.out_features])
+
+
+# Copied from transformers.models.convnext.modeling_tf_convnext.TFConvNextDropPath with ConvNext->EfficientFormer
+class TFEfficientFormerDropPath(keras.layers.Layer):
+    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
+    References:
+        (1) github.com:rwightman/pytorch-image-models
+    """
+
+    def __init__(self, drop_path: float, **kwargs):
+        super().__init__(**kwargs)
+        self.drop_path = drop_path
+
+    def call(self, x: tf.Tensor, training=None):
+        if training:
+            keep_prob = 1 - self.drop_path
+            shape = (tf.shape(x)[0],) + (1,) * (len(tf.shape(x)) - 1)
+            random_tensor = keep_prob + tf.random.uniform(shape, 0, 1)
+            random_tensor = tf.floor(random_tensor)
+            return (x / keep_prob) * random_tensor
+        return x
+
+
+class TFEfficientFormerFlat(keras.layers.Layer):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+    def call(self, hidden_states: tf.Tensor) -> Tuple[tf.Tensor]:
+        batch_size, _, _, in_channels = shape_list(hidden_states)
+        hidden_states = tf.reshape(hidden_states, shape=[batch_size, -1, in_channels])
+        return hidden_states
+
+
+class TFEfficientFormerMeta3D(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, dim: int, drop_path: float = 0.0, **kwargs):
+        super().__init__(**kwargs)
+
+        self.token_mixer = TFEfficientFormerSelfAttention(
+            dim=config.dim,
+            key_dim=config.key_dim,
+            num_heads=config.num_attention_heads,
+            attention_ratio=config.attention_ratio,
+            resolution=config.resolution,
+            name="token_mixer",
+            config=config,
+        )
+        self.dim = dim
+        self.config = config
+
+        self.layernorm1 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm1")
+        self.layernorm2 = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm2")
+        mlp_hidden_dim = int(dim * config.mlp_expansion_ratio)
+        self.mlp = TFEfficientFormerDenseMlp(config, in_features=dim, hidden_features=mlp_hidden_dim, name="mlp")
+
+        # Using `layers.Activation` instead of `tf.identity` to better control `training' behavior.
+        self.drop_path = (
+            TFEfficientFormerDropPath(drop_path)
+            if drop_path > 0.0
+            else keras.layers.Activation("linear", name="drop_path")
+        )
+        self.config = config
+
+    def build(self, input_shape=None):
+        self.layer_scale_1 = None
+        self.layer_scale_2 = None
+
+        if self.config.use_layer_scale:
+            self.layer_scale_1 = self.add_weight(
+                shape=(self.dim,),
+                initializer=keras.initializers.Constant(value=self.config.layer_scale_init_value),
+                trainable=True,
+                name="layer_scale_1",
+            )
+            self.layer_scale_2 = self.add_weight(
+                shape=(self.dim,),
+                initializer=keras.initializers.Constant(value=self.config.layer_scale_init_value),
+                trainable=True,
+                name="layer_scale_2",
+            )
+
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "token_mixer", None) is not None:
+            with tf.name_scope(self.token_mixer.name):
+                self.token_mixer.build(None)
+        if getattr(self, "layernorm1", None) is not None:
+            with tf.name_scope(self.layernorm1.name):
+                self.layernorm1.build([None, None, self.dim])
+        if getattr(self, "layernorm2", None) is not None:
+            with tf.name_scope(self.layernorm2.name):
+                self.layernorm2.build([None, None, self.dim])
+        if getattr(self, "mlp", None) is not None:
+            with tf.name_scope(self.mlp.name):
+                self.mlp.build(None)
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
+
+    def call(
+        self, hidden_states: tf.Tensor, output_attentions: bool = False, training: bool = False
+    ) -> Tuple[tf.Tensor]:
+        self_attention_outputs = self.token_mixer(
+            hidden_states=self.layernorm1(hidden_states, training=training),
+            output_attentions=output_attentions,
+            training=training,
+        )
+
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        if self.config.use_layer_scale:
+            layer_output = hidden_states + self.drop_path(
+                tf.expand_dims(tf.expand_dims(self.layer_scale_1, 0), 0) * attention_output,
+                training=training,
+            )
+            layer_output = layer_output + self.drop_path(
+                tf.expand_dims(tf.expand_dims(self.layer_scale_2, 0), 0)
+                * self.mlp(hidden_states=self.layernorm2(inputs=layer_output, training=training), training=training),
+                training=training,
+            )
+        else:
+            layer_output = hidden_states + self.drop_path(attention_output, training=training)
+            layer_output = layer_output + self.drop_path(
+                self.mlp(hidden_states=self.layernorm2(inputs=layer_output, training=training), training=training),
+                training=training,
+            )
+
+        outputs = (layer_output,) + outputs
+
+        return outputs
+
+
+class TFEfficientFormerMeta3DLayers(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, **kwargs):
+        super().__init__(**kwargs)
+        drop_paths = [
+            config.drop_path_rate * (block_idx + sum(config.depths[:-1]))
+            for block_idx in range(config.num_meta3d_blocks)
+        ]
+        self.blocks = [
+            TFEfficientFormerMeta3D(config, config.hidden_sizes[-1], drop_path=drop_path, name=f"blocks.{i}")
+            for i, drop_path in enumerate(drop_paths)
+        ]
+
+    def call(
+        self, hidden_states: tf.Tensor, output_attentions: bool = False, training: bool = False
+    ) -> Tuple[tf.Tensor]:
+        all_attention_outputs = () if output_attentions else None
+
+        for i, layer_module in enumerate(self.blocks):
+            if isinstance(hidden_states, tuple):
+                hidden_states = hidden_states[0]
+
+            hidden_states = layer_module(
+                hidden_states=hidden_states, output_attentions=output_attentions, training=training
+            )
+            if output_attentions:
+                all_attention_outputs = all_attention_outputs + (hidden_states[1],)
+
+        if output_attentions:
+            outputs = (hidden_states[0],) + all_attention_outputs
+            return outputs
+
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "blocks", None) is not None:
+            for layer in self.blocks:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
+
+class TFEfficientFormerMeta4D(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, dim: int, drop_path: float = 0.0, **kwargs):
+        super().__init__(**kwargs)
+        pool_size = config.pool_size if config.pool_size is not None else 3
+        self.token_mixer = TFEfficientFormerPooling(pool_size=pool_size, name="token_mixer")
+        self.dim = dim
+        mlp_hidden_dim = int(dim * config.mlp_expansion_ratio)
+        self.mlp = TFEfficientFormerConvMlp(
+            config=config, in_features=dim, hidden_features=mlp_hidden_dim, drop=config.hidden_dropout_prob, name="mlp"
+        )
+
+        self.drop_path = (
+            TFEfficientFormerDropPath(drop_path, name="drop_path")
+            if drop_path > 0.0
+            else keras.layers.Activation("linear", name="drop_path")
+        )
+        self.config = config
+
+    def build(self, input_shape=None):
+        self.layer_scale_1 = None
+        self.layer_scale_2 = None
+
+        if self.config.use_layer_scale:
+            self.layer_scale_1 = self.add_weight(
+                shape=(self.dim),
+                initializer=keras.initializers.Constant(value=self.config.layer_scale_init_value),
+                trainable=True,
+                name="layer_scale_1",
+            )
+            self.layer_scale_2 = self.add_weight(
+                shape=(self.dim),
+                initializer=keras.initializers.Constant(value=self.config.layer_scale_init_value),
+                trainable=True,
+                name="layer_scale_2",
+            )
+
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "token_mixer", None) is not None:
+            with tf.name_scope(self.token_mixer.name):
+                self.token_mixer.build(None)
+        if getattr(self, "mlp", None) is not None:
+            with tf.name_scope(self.mlp.name):
+                self.mlp.build(None)
+        if getattr(self, "drop_path", None) is not None:
+            with tf.name_scope(self.drop_path.name):
+                self.drop_path.build(None)
+
+    def call(self, hidden_states: tf.Tensor, training: bool = False) -> Tuple[tf.Tensor]:
+        outputs = self.token_mixer(hidden_states)
+
+        if self.config.use_layer_scale:
+            layer_output = hidden_states + self.drop_path(
+                tf.expand_dims(tf.expand_dims(self.layer_scale_1, 0), 0) * outputs,
+                training=training,
+            )
+
+            layer_output = layer_output + self.drop_path(
+                tf.expand_dims(tf.expand_dims(self.layer_scale_2, 0), 0)
+                * self.mlp(hidden_state=layer_output, training=training),
+                training=training,
+            )
+
+        else:
+            layer_output = hidden_states + self.drop_path(outputs, training=training)
+            layer_output = layer_output + self.drop_path(
+                self.mlp(hidden_state=layer_output, training=training), training=training
+            )
+
+        return layer_output
+
+
+class TFEfficientFormerMeta4DLayers(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, stage_idx: int, **kwargs):
+        super().__init__(**kwargs)
+        num_layers = (
+            config.depths[stage_idx] if stage_idx != -1 else config.depths[stage_idx] - config.num_meta3d_blocks
+        )
+        drop_paths = [
+            config.drop_path_rate * (block_idx + sum(config.depths[:stage_idx])) for block_idx in range(num_layers)
+        ]
+
+        self.blocks = [
+            TFEfficientFormerMeta4D(
+                config=config, dim=config.hidden_sizes[stage_idx], drop_path=drop_paths[i], name=f"blocks.{i}"
+            )
+            for i in range(len(drop_paths))
+        ]
+
+    def call(self, hidden_states: tf.Tensor, training: bool = False) -> Tuple[tf.Tensor]:
+        for layer_module in self.blocks:
+            hidden_states = layer_module(hidden_states=hidden_states, training=training)
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "blocks", None) is not None:
+            for layer in self.blocks:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
+
+class TFEfficientFormerIntermediateStage(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, index: int, **kwargs):
+        super().__init__(**kwargs)
+        self.meta4D_layers = TFEfficientFormerMeta4DLayers(config=config, stage_idx=index, name="meta4D_layers")
+
+    def call(self, hidden_states: tf.Tensor, training: bool = False) -> Tuple[tf.Tensor]:
+        hidden_states = self.meta4D_layers(hidden_states=hidden_states, training=training)
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "meta4D_layers", None) is not None:
+            with tf.name_scope(self.meta4D_layers.name):
+                self.meta4D_layers.build(None)
+
+
+class TFEfficientFormerLastStage(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, **kwargs):
+        super().__init__(**kwargs)
+        self.meta4D_layers = TFEfficientFormerMeta4DLayers(config=config, stage_idx=-1, name="meta4D_layers")
+        self.flat = TFEfficientFormerFlat(name="flat")
+        self.meta3D_layers = TFEfficientFormerMeta3DLayers(config, name="meta3D_layers")
+
+    def call(
+        self, hidden_states: tf.Tensor, output_attentions: bool = False, training: bool = False
+    ) -> Tuple[tf.Tensor]:
+        hidden_states = self.meta4D_layers(hidden_states=hidden_states, training=training)
+        hidden_states = self.flat(hidden_states=hidden_states)
+        hidden_states = self.meta3D_layers(
+            hidden_states=hidden_states, output_attentions=output_attentions, training=training
+        )
+
+        return hidden_states
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "meta4D_layers", None) is not None:
+            with tf.name_scope(self.meta4D_layers.name):
+                self.meta4D_layers.build(None)
+        if getattr(self, "flat", None) is not None:
+            with tf.name_scope(self.flat.name):
+                self.flat.build(None)
+        if getattr(self, "meta3D_layers", None) is not None:
+            with tf.name_scope(self.meta3D_layers.name):
+                self.meta3D_layers.build(None)
+
+
+class TFEfficientFormerEncoder(keras.layers.Layer):
+    def __init__(self, config: EfficientFormerConfig, **kwargs):
+        super().__init__(**kwargs)
+
+        self.config = config
+        num_intermediate_stages = len(config.depths) - 1
+        downsamples = [
+            config.downsamples[i] or config.hidden_sizes[i] != config.hidden_sizes[i + 1]
+            for i in range(num_intermediate_stages)
+        ]
+
+        intermediate_stages = []
+        layer_count = -1
+        for i in range(num_intermediate_stages):
+            layer_count += 1
+            intermediate_stages.append(
+                TFEfficientFormerIntermediateStage(config, i, name=f"intermediate_stages.{layer_count}")
+            )
+            if downsamples[i]:
+                layer_count += 1
+                intermediate_stages.append(
+                    TFEfficientFormerPatchEmbeddings(
+                        config,
+                        config.hidden_sizes[i],
+                        config.hidden_sizes[i + 1],
+                        name=f"intermediate_stages.{layer_count}",
+                    )
+                )
+        self.intermediate_stages = intermediate_stages
+        self.last_stage = TFEfficientFormerLastStage(config, name="last_stage")
+
+    def call(
+        self,
+        hidden_states: tf.Tensor,
+        output_hidden_states: bool,
+        output_attentions: bool,
+        return_dict: bool,
+        training: bool = False,
+    ) -> TFBaseModelOutput:
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        for layer_module in self.intermediate_stages:
+            hidden_states = layer_module(hidden_states, training=training)
+
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+        layer_output = self.last_stage(hidden_states, output_attentions=output_attentions, training=training)
+
+        if output_attentions:
+            all_self_attentions = all_self_attentions + layer_output[1:]
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (layer_output[0],)
+
+        if not return_dict:
+            return tuple(v for v in [layer_output[0], all_hidden_states, all_self_attentions] if v is not None)
+
+        return TFBaseModelOutput(
+            last_hidden_state=layer_output[0],
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "last_stage", None) is not None:
+            with tf.name_scope(self.last_stage.name):
+                self.last_stage.build(None)
+        for layer in self.intermediate_stages:
+            with tf.name_scope(layer.name):
+                layer.build(None)
+
+
+@keras_serializable
+class TFEfficientFormerMainLayer(keras.layers.Layer):
+    config_class = EfficientFormerConfig
+
+    def __init__(self, config: EfficientFormerConfig, **kwargs) -> None:
+        super().__init__(**kwargs)
+        self.config = config
+
+        self.patch_embed = TFEfficientFormerConvStem(config, config.hidden_sizes[0], name="patch_embed")
+        self.encoder = TFEfficientFormerEncoder(config, name="encoder")
+        self.layernorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layernorm")
+
+    @unpack_inputs
+    def call(
+        self,
+        pixel_values: Optional[tf.Tensor] = None,
+        output_attentions: Optional[tf.Tensor] = None,
+        output_hidden_states: Optional[tf.Tensor] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[TFBaseModelOutput, Tuple[tf.Tensor, ...]]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        # When running on CPU, keras.layers.Conv2D and keras.layers.AveragePool2D do not
+        # support channels first NCHW format. A number of blocks contain both.
+        # So change the input format from (batch_size, num_channels, height, width) to
+        # (batch_size, height, width, num_channels) here.
+        # shape = (batch_size, in_height, in_width, in_channels=num_channels)
+        pixel_values = tf.transpose(pixel_values, perm=(0, 2, 3, 1))
+        embedding_output = self.patch_embed(pixel_values, training=training)
+
+        encoder_outputs = self.encoder(
+            hidden_states=embedding_output,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        sequence_output = encoder_outputs[0]
+        sequence_output = self.layernorm(sequence_output, training=training)
+
+        # Change the hidden states from (batch_size, height, width, num_channels) to
+        # (batch_size, num_channels, height, width).
+        # The hidden states are in (batch_size, height, width, num_channels)
+        # shape after all stages except the MB3D blocks.
+        if output_hidden_states:
+            hidden_states = tuple([tf.transpose(h, perm=(0, 3, 1, 2)) for h in encoder_outputs[1][:-1]]) + (
+                encoder_outputs[1][-1],
+            )
+
+        if not return_dict:
+            head_outputs = (sequence_output,)
+            return head_outputs + encoder_outputs[1:]
+
+        return TFBaseModelOutput(
+            last_hidden_state=sequence_output,
+            hidden_states=hidden_states if output_hidden_states else encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "patch_embed", None) is not None:
+            with tf.name_scope(self.patch_embed.name):
+                self.patch_embed.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "layernorm", None) is not None:
+            with tf.name_scope(self.layernorm.name):
+                self.layernorm.build([None, None, self.config.hidden_sizes[-1]])
+
+
+class TFEfficientFormerPreTrainedModel(TFPreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = EfficientFormerConfig
+    base_model_prefix = "efficientformer"
+    main_input_name = "pixel_values"
+
+
+EFFICIENTFORMER_START_DOCSTRING = r"""
+    This model is a TensorFlow
+    [keras.layers.Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer). Use it as a regular
+    TensorFlow Module and refer to the TensorFlow documentation for all matter related to general usage and behavior.
+
+
+    Parameters:
+        config ([`EfficientFormerConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+EFFICIENTFORMER_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values ((`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
+            [`EfficientFormerImageProcessor.__call__`] for details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare EfficientFormer Model transformer outputting raw hidden-states without any specific head on top.",
+    EFFICIENTFORMER_START_DOCSTRING,
+)
+class TFEfficientFormerModel(TFEfficientFormerPreTrainedModel):
+    def __init__(self, config: EfficientFormerConfig, **kwargs) -> None:
+        super().__init__(config, **kwargs)
+
+        self.efficientformer = TFEfficientFormerMainLayer(config, name="efficientformer")
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(EFFICIENTFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TFBaseModelOutputWithPooling,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def call(
+        self,
+        pixel_values: Optional[tf.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[Tuple, TFBaseModelOutput]:
+        outputs = self.efficientformer(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+        return outputs
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "efficientformer", None) is not None:
+            with tf.name_scope(self.efficientformer.name):
+                self.efficientformer.build(None)
+
+
+@add_start_docstrings(
+    """
+    EfficientFormer Model transformer with an image classification head on top of pooled last hidden state, e.g. for
+    ImageNet.
+    """,
+    EFFICIENTFORMER_START_DOCSTRING,
+)
+class TFEfficientFormerForImageClassification(TFEfficientFormerPreTrainedModel, TFSequenceClassificationLoss):
+    def __init__(self, config: EfficientFormerConfig):
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.efficientformer = TFEfficientFormerMainLayer(config, name="efficientformer")
+
+        # Classifier head
+        self.classifier = (
+            keras.layers.Dense(config.num_labels, name="classifier")
+            if config.num_labels > 0
+            else keras.layers.Activation("linear", name="classifier")
+        )
+        self.config = config
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(EFFICIENTFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=TFImageClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def call(
+        self,
+        pixel_values: Optional[tf.Tensor] = None,
+        labels: Optional[tf.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[tf.Tensor, TFImageClassifierOutput]:
+        r"""
+        labels (`tf.Tensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.efficientformer(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.classifier(tf.reduce_mean(sequence_output, axis=-2))
+
+        loss = None if labels is None else self.hf_compute_loss(labels, logits)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TFImageClassifierOutput(
+            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "efficientformer", None) is not None:
+            with tf.name_scope(self.efficientformer.name):
+                self.efficientformer.build(None)
+        if getattr(self, "classifier", None) is not None:
+            if hasattr(self.classifier, "name"):
+                with tf.name_scope(self.classifier.name):
+                    self.classifier.build([None, None, self.config.hidden_sizes[-1]])
+
+
+@dataclass
+class TFEfficientFormerForImageClassificationWithTeacherOutput(ModelOutput):
+    """
+    Args:
+    Output type of [`EfficientFormerForImageClassificationWithTeacher`].
+        logits (`tf.Tensor` of shape `(batch_size, config.num_labels)`):
+            Prediction scores as the average of the cls_logits and distillation logits.
+        cls_logits (`tf.Tensor` of shape `(batch_size, config.num_labels)`):
+            Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
+            class token).
+        distillation_logits (`tf.Tensor` of shape `(batch_size, config.num_labels)`):
+            Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
+            distillation token).
+        hidden_states (`tuple(tf.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when
+        `config.output_hidden_states=True`):
+            Tuple of `tf.Tensor` (one for the output of the embeddings + one for the output of each layer) of shape
+            `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer plus
+            the initial embedding outputs.
+        attentions (`tuple(tf.Tensor)`, *optional*, returned when `output_attentions=True` is passed or when
+        `config.output_attentions=True`):
+            Tuple of `tf.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
+            the self-attention heads.
+    """
+
+    logits: tf.Tensor = None
+    cls_logits: tf.Tensor = None
+    distillation_logits: tf.Tensor = None
+    hidden_states: Optional[Tuple[tf.Tensor]] = None
+    attentions: Optional[Tuple[tf.Tensor]] = None
+
+
+@add_start_docstrings(
+    """
+    EfficientFormer Model transformer with image classification heads on top (a linear layer on top of the final hidden
+    state and a linear layer on top of the final hidden state of the distillation token) e.g. for ImageNet.
+
+    .. warning::
+            This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet
+            supported.
+    """,
+    EFFICIENTFORMER_START_DOCSTRING,
+)
+class TFEfficientFormerForImageClassificationWithTeacher(TFEfficientFormerPreTrainedModel):
+    def __init__(self, config: EfficientFormerConfig) -> None:
+        super().__init__(config)
+
+        self.num_labels = config.num_labels
+        self.efficientformer = TFEfficientFormerMainLayer(config, name="efficientformer")
+
+        # Classifier heads
+        self.classifier = (
+            keras.layers.Dense(config.num_labels, name="classifier")
+            if config.num_labels > 0
+            else keras.layers.Activation("linear", name="classifier")
+        )
+        self.distillation_classifier = (
+            keras.layers.Dense(config.num_labels, name="distillation_classifier")
+            if config.num_labels > 0
+            else keras.layers.Activation("linear", name="distillation_classifier")
+        )
+
+    @unpack_inputs
+    @add_start_docstrings_to_model_forward(EFFICIENTFORMER_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=TFEfficientFormerForImageClassificationWithTeacherOutput,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def call(
+        self,
+        pixel_values: Optional[tf.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
+    ) -> Union[tuple, TFEfficientFormerForImageClassificationWithTeacherOutput]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if training:
+            raise Exception(
+                "This model supports inference-only. Fine-tuning with distillation (i.e. with a teacher) is not yet supported."
+            )
+
+        outputs = self.efficientformer(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            training=training,
+        )
+
+        sequence_output = outputs[0]
+
+        cls_logits = self.classifier(tf.reduce_mean(sequence_output, axis=-2))
+        distillation_logits = self.distillation_classifier(tf.reduce_mean(sequence_output, axis=-2))
+        logits = (cls_logits + distillation_logits) / 2
+
+        if not return_dict:
+            output = (logits, cls_logits, distillation_logits) + outputs[1:]
+            return output
+
+        return TFEfficientFormerForImageClassificationWithTeacherOutput(
+            logits=logits,
+            cls_logits=cls_logits,
+            distillation_logits=distillation_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "efficientformer", None) is not None:
+            with tf.name_scope(self.efficientformer.name):
+                self.efficientformer.build(None)
+        if getattr(self, "classifier", None) is not None:
+            if hasattr(self.classifier, "name"):
+                with tf.name_scope(self.classifier.name):
+                    self.classifier.build([None, None, self.config.hidden_sizes[-1]])
+        if getattr(self, "distillation_classifier", None) is not None:
+            if hasattr(self.distillation_classifier, "name"):
+                with tf.name_scope(self.distillation_classifier.name):
+                    self.distillation_classifier.build([None, None, self.config.hidden_sizes[-1]])
diff --git a/src/transformers/models/efficientnet/__init__.py b/src/transformers/models/efficientnet/__init__.py
new file mode 100644
index 00000000000000..6df523721aefc5
--- /dev/null
+++ b/src/transformers/models/efficientnet/__init__.py
@@ -0,0 +1,84 @@
+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all.
+
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+# rely on isort to merge the imports
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+
+
+_import_structure = {
+    "configuration_efficientnet": [
+        "EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "EfficientNetConfig",
+        "EfficientNetOnnxConfig",
+    ]
+}
+
+try:
+    if not is_vision_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["image_processing_efficientnet"] = ["EfficientNetImageProcessor"]
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_efficientnet"] = [
+        "EFFICIENTNET_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "EfficientNetForImageClassification",
+        "EfficientNetModel",
+        "EfficientNetPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_efficientnet import (
+        EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        EfficientNetConfig,
+        EfficientNetOnnxConfig,
+    )
+
+    try:
+        if not is_vision_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .image_processing_efficientnet import EfficientNetImageProcessor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_efficientnet import (
+            EFFICIENTNET_PRETRAINED_MODEL_ARCHIVE_LIST,
+            EfficientNetForImageClassification,
+            EfficientNetModel,
+            EfficientNetPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure)
diff --git a/src/transformers/models/efficientnet/configuration_efficientnet.py b/src/transformers/models/efficientnet/configuration_efficientnet.py
new file mode 100644
index 00000000000000..49e50a45e11537
--- /dev/null
+++ b/src/transformers/models/efficientnet/configuration_efficientnet.py
@@ -0,0 +1,170 @@
+# coding=utf-8
+# Copyright 2023 Google Research, Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" EfficientNet model configuration"""
+
+from collections import OrderedDict
+from typing import List, Mapping
+
+from packaging import version
+
+from ...configuration_utils import PretrainedConfig
+from ...onnx import OnnxConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+EFFICIENTNET_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "google/efficientnet-b7": "https://huggingface.co/google/efficientnet-b7/resolve/main/config.json",
+}
+
+
+class EfficientNetConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`EfficientNetModel`]. It is used to instantiate an
+    EfficientNet model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the EfficientNet
+    [google/efficientnet-b7](https://huggingface.co/google/efficientnet-b7) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        num_channels (`int`, *optional*, defaults to 3):
+            The number of input channels.
+        image_size (`int`, *optional*, defaults to 600):
+            The input image size.
+        width_coefficient (`float`, *optional*, defaults to 2.0):
+            Scaling coefficient for network width at each stage.
+        depth_coefficient (`float`, *optional*, defaults to 3.1):
+            Scaling coefficient for network depth at each stage.
+        depth_divisor `int`, *optional*, defaults to 8):
+            A unit of network width.
+        kernel_sizes (`List[int]`, *optional*, defaults to `[3, 3, 5, 3, 5, 5, 3]`):
+            List of kernel sizes to be used in each block.
+        in_channels (`List[int]`, *optional*, defaults to `[32, 16, 24, 40, 80, 112, 192]`):
+            List of input channel sizes to be used in each block for convolutional layers.
+        out_channels (`List[int]`, *optional*, defaults to `[16, 24, 40, 80, 112, 192, 320]`):
+            List of output channel sizes to be used in each block for convolutional layers.
+        depthwise_padding (`List[int]`, *optional*, defaults to `[]`):
+            List of block indices with square padding.
+        strides (`List[int]`, *optional*, defaults to `[1, 2, 2, 2, 1, 2, 1]`):
+            List of stride sizes to be used in each block for convolutional layers.
+        num_block_repeats (`List[int]`, *optional*, defaults to `[1, 2, 2, 3, 3, 4, 1]`):
+            List of the number of times each block is to repeated.
+        expand_ratios (`List[int]`, *optional*, defaults to `[1, 6, 6, 6, 6, 6, 6]`):
+            List of scaling coefficient of each block.
+        squeeze_expansion_ratio (`float`, *optional*, defaults to 0.25):
+            Squeeze expansion ratio.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in each block. If string, `"gelu"`, `"relu"`,
+            `"selu", `"gelu_new"`, `"silu"` and `"mish"` are supported.
+        hiddem_dim (`int`, *optional*, defaults to 1280):
+            The hidden dimension of the layer before the classification head.
+        pooling_type (`str` or `function`, *optional*, defaults to `"mean"`):
+            Type of final pooling to be applied before the dense classification head. Available options are [`"mean"`,
+            `"max"`]
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        batch_norm_eps (`float`, *optional*, defaults to 1e-3):
+            The epsilon used by the batch normalization layers.
+        batch_norm_momentum (`float`, *optional*, defaults to 0.99):
+            The momentum used by the batch normalization layers.
+        dropout_rate (`float`, *optional*, defaults to 0.5):
+            The dropout rate to be applied before final classifier layer.
+        drop_connect_rate (`float`, *optional*, defaults to 0.2):
+            The drop rate for skip connections.
+
+    Example:
+    ```python
+    >>> from transformers import EfficientNetConfig, EfficientNetModel
+
+    >>> # Initializing a EfficientNet efficientnet-b7 style configuration
+    >>> configuration = EfficientNetConfig()
+
+    >>> # Initializing a model (with random weights) from the efficientnet-b7 style configuration
+    >>> model = EfficientNetModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "efficientnet"
+
+    def __init__(
+        self,
+        num_channels: int = 3,
+        image_size: int = 600,
+        width_coefficient: float = 2.0,
+        depth_coefficient: float = 3.1,
+        depth_divisor: int = 8,
+        kernel_sizes: List[int] = [3, 3, 5, 3, 5, 5, 3],
+        in_channels: List[int] = [32, 16, 24, 40, 80, 112, 192],
+        out_channels: List[int] = [16, 24, 40, 80, 112, 192, 320],
+        depthwise_padding: List[int] = [],
+        strides: List[int] = [1, 2, 2, 2, 1, 2, 1],
+        num_block_repeats: List[int] = [1, 2, 2, 3, 3, 4, 1],
+        expand_ratios: List[int] = [1, 6, 6, 6, 6, 6, 6],
+        squeeze_expansion_ratio: float = 0.25,
+        hidden_act: str = "swish",
+        hidden_dim: int = 2560,
+        pooling_type: str = "mean",
+        initializer_range: float = 0.02,
+        batch_norm_eps: float = 0.001,
+        batch_norm_momentum: float = 0.99,
+        dropout_rate: float = 0.5,
+        drop_connect_rate: float = 0.2,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+
+        self.num_channels = num_channels
+        self.image_size = image_size
+        self.width_coefficient = width_coefficient
+        self.depth_coefficient = depth_coefficient
+        self.depth_divisor = depth_divisor
+        self.kernel_sizes = kernel_sizes
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.depthwise_padding = depthwise_padding
+        self.strides = strides
+        self.num_block_repeats = num_block_repeats
+        self.expand_ratios = expand_ratios
+        self.squeeze_expansion_ratio = squeeze_expansion_ratio
+        self.hidden_act = hidden_act
+        self.hidden_dim = hidden_dim
+        self.pooling_type = pooling_type
+        self.initializer_range = initializer_range
+        self.batch_norm_eps = batch_norm_eps
+        self.batch_norm_momentum = batch_norm_momentum
+        self.dropout_rate = dropout_rate
+        self.drop_connect_rate = drop_connect_rate
+        self.num_hidden_layers = sum(num_block_repeats) * 4
+
+
+class EfficientNetOnnxConfig(OnnxConfig):
+    torch_onnx_minimum_version = version.parse("1.11")
+
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        return OrderedDict(
+            [
+                ("pixel_values", {0: "batch", 1: "num_channels", 2: "height", 3: "width"}),
+            ]
+        )
+
+    @property
+    def atol_for_validation(self) -> float:
+        return 1e-5
diff --git a/src/transformers/models/efficientnet/convert_efficientnet_to_pytorch.py b/src/transformers/models/efficientnet/convert_efficientnet_to_pytorch.py
new file mode 100644
index 00000000000000..e9988524aca04d
--- /dev/null
+++ b/src/transformers/models/efficientnet/convert_efficientnet_to_pytorch.py
@@ -0,0 +1,339 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert EfficientNet checkpoints from the original repository.
+
+URL: https://github.com/keras-team/keras/blob/v2.11.0/keras/applications/efficientnet.py"""
+
+import argparse
+import json
+import os
+
+import numpy as np
+import PIL
+import requests
+import tensorflow.keras.applications.efficientnet as efficientnet
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image
+from tensorflow.keras.preprocessing import image
+
+from transformers import (
+    EfficientNetConfig,
+    EfficientNetForImageClassification,
+    EfficientNetImageProcessor,
+)
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger(__name__)
+
+model_classes = {
+    "b0": efficientnet.EfficientNetB0,
+    "b1": efficientnet.EfficientNetB1,
+    "b2": efficientnet.EfficientNetB2,
+    "b3": efficientnet.EfficientNetB3,
+    "b4": efficientnet.EfficientNetB4,
+    "b5": efficientnet.EfficientNetB5,
+    "b6": efficientnet.EfficientNetB6,
+    "b7": efficientnet.EfficientNetB7,
+}
+
+CONFIG_MAP = {
+    "b0": {
+        "hidden_dim": 1280,
+        "width_coef": 1.0,
+        "depth_coef": 1.0,
+        "image_size": 224,
+        "dropout_rate": 0.2,
+        "dw_padding": [],
+    },
+    "b1": {
+        "hidden_dim": 1280,
+        "width_coef": 1.0,
+        "depth_coef": 1.1,
+        "image_size": 240,
+        "dropout_rate": 0.2,
+        "dw_padding": [16],
+    },
+    "b2": {
+        "hidden_dim": 1408,
+        "width_coef": 1.1,
+        "depth_coef": 1.2,
+        "image_size": 260,
+        "dropout_rate": 0.3,
+        "dw_padding": [5, 8, 16],
+    },
+    "b3": {
+        "hidden_dim": 1536,
+        "width_coef": 1.2,
+        "depth_coef": 1.4,
+        "image_size": 300,
+        "dropout_rate": 0.3,
+        "dw_padding": [5, 18],
+    },
+    "b4": {
+        "hidden_dim": 1792,
+        "width_coef": 1.4,
+        "depth_coef": 1.8,
+        "image_size": 380,
+        "dropout_rate": 0.4,
+        "dw_padding": [6],
+    },
+    "b5": {
+        "hidden_dim": 2048,
+        "width_coef": 1.6,
+        "depth_coef": 2.2,
+        "image_size": 456,
+        "dropout_rate": 0.4,
+        "dw_padding": [13, 27],
+    },
+    "b6": {
+        "hidden_dim": 2304,
+        "width_coef": 1.8,
+        "depth_coef": 2.6,
+        "image_size": 528,
+        "dropout_rate": 0.5,
+        "dw_padding": [31],
+    },
+    "b7": {
+        "hidden_dim": 2560,
+        "width_coef": 2.0,
+        "depth_coef": 3.1,
+        "image_size": 600,
+        "dropout_rate": 0.5,
+        "dw_padding": [18],
+    },
+}
+
+
+def get_efficientnet_config(model_name):
+    config = EfficientNetConfig()
+    config.hidden_dim = CONFIG_MAP[model_name]["hidden_dim"]
+    config.width_coefficient = CONFIG_MAP[model_name]["width_coef"]
+    config.depth_coefficient = CONFIG_MAP[model_name]["depth_coef"]
+    config.image_size = CONFIG_MAP[model_name]["image_size"]
+    config.dropout_rate = CONFIG_MAP[model_name]["dropout_rate"]
+    config.depthwise_padding = CONFIG_MAP[model_name]["dw_padding"]
+
+    repo_id = "huggingface/label-files"
+    filename = "imagenet-1k-id2label.json"
+    config.num_labels = 1000
+    id2label = json.load(open(hf_hub_download(repo_id, filename, repo_type="dataset"), "r"))
+    id2label = {int(k): v for k, v in id2label.items()}
+
+    config.id2label = id2label
+    config.label2id = {v: k for k, v in id2label.items()}
+    return config
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    im = Image.open(requests.get(url, stream=True).raw)
+    return im
+
+
+def convert_image_processor(model_name):
+    size = CONFIG_MAP[model_name]["image_size"]
+    preprocessor = EfficientNetImageProcessor(
+        size={"height": size, "width": size},
+        image_mean=[0.485, 0.456, 0.406],
+        image_std=[0.47853944, 0.4732864, 0.47434163],
+        do_center_crop=False,
+    )
+    return preprocessor
+
+
+# here we list all keys to be renamed (original name on the left, our name on the right)
+def rename_keys(original_param_names):
+    block_names = [v.split("_")[0].split("block")[1] for v in original_param_names if v.startswith("block")]
+    block_names = sorted(set(block_names))
+    num_blocks = len(block_names)
+    block_name_mapping = {b: str(i) for b, i in zip(block_names, range(num_blocks))}
+
+    rename_keys = []
+    rename_keys.append(("stem_conv/kernel:0", "embeddings.convolution.weight"))
+    rename_keys.append(("stem_bn/gamma:0", "embeddings.batchnorm.weight"))
+    rename_keys.append(("stem_bn/beta:0", "embeddings.batchnorm.bias"))
+    rename_keys.append(("stem_bn/moving_mean:0", "embeddings.batchnorm.running_mean"))
+    rename_keys.append(("stem_bn/moving_variance:0", "embeddings.batchnorm.running_var"))
+
+    for b in block_names:
+        hf_b = block_name_mapping[b]
+        rename_keys.append((f"block{b}_expand_conv/kernel:0", f"encoder.blocks.{hf_b}.expansion.expand_conv.weight"))
+        rename_keys.append((f"block{b}_expand_bn/gamma:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.weight"))
+        rename_keys.append((f"block{b}_expand_bn/beta:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.bias"))
+        rename_keys.append(
+            (f"block{b}_expand_bn/moving_mean:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.running_mean")
+        )
+        rename_keys.append(
+            (f"block{b}_expand_bn/moving_variance:0", f"encoder.blocks.{hf_b}.expansion.expand_bn.running_var")
+        )
+        rename_keys.append(
+            (f"block{b}_dwconv/depthwise_kernel:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_conv.weight")
+        )
+        rename_keys.append((f"block{b}_bn/gamma:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.weight"))
+        rename_keys.append((f"block{b}_bn/beta:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.bias"))
+        rename_keys.append(
+            (f"block{b}_bn/moving_mean:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.running_mean")
+        )
+        rename_keys.append(
+            (f"block{b}_bn/moving_variance:0", f"encoder.blocks.{hf_b}.depthwise_conv.depthwise_norm.running_var")
+        )
+
+        rename_keys.append((f"block{b}_se_reduce/kernel:0", f"encoder.blocks.{hf_b}.squeeze_excite.reduce.weight"))
+        rename_keys.append((f"block{b}_se_reduce/bias:0", f"encoder.blocks.{hf_b}.squeeze_excite.reduce.bias"))
+        rename_keys.append((f"block{b}_se_expand/kernel:0", f"encoder.blocks.{hf_b}.squeeze_excite.expand.weight"))
+        rename_keys.append((f"block{b}_se_expand/bias:0", f"encoder.blocks.{hf_b}.squeeze_excite.expand.bias"))
+        rename_keys.append(
+            (f"block{b}_project_conv/kernel:0", f"encoder.blocks.{hf_b}.projection.project_conv.weight")
+        )
+        rename_keys.append((f"block{b}_project_bn/gamma:0", f"encoder.blocks.{hf_b}.projection.project_bn.weight"))
+        rename_keys.append((f"block{b}_project_bn/beta:0", f"encoder.blocks.{hf_b}.projection.project_bn.bias"))
+        rename_keys.append(
+            (f"block{b}_project_bn/moving_mean:0", f"encoder.blocks.{hf_b}.projection.project_bn.running_mean")
+        )
+        rename_keys.append(
+            (f"block{b}_project_bn/moving_variance:0", f"encoder.blocks.{hf_b}.projection.project_bn.running_var")
+        )
+
+    rename_keys.append(("top_conv/kernel:0", "encoder.top_conv.weight"))
+    rename_keys.append(("top_bn/gamma:0", "encoder.top_bn.weight"))
+    rename_keys.append(("top_bn/beta:0", "encoder.top_bn.bias"))
+    rename_keys.append(("top_bn/moving_mean:0", "encoder.top_bn.running_mean"))
+    rename_keys.append(("top_bn/moving_variance:0", "encoder.top_bn.running_var"))
+
+    key_mapping = {}
+    for item in rename_keys:
+        if item[0] in original_param_names:
+            key_mapping[item[0]] = "efficientnet." + item[1]
+
+    key_mapping["predictions/kernel:0"] = "classifier.weight"
+    key_mapping["predictions/bias:0"] = "classifier.bias"
+    return key_mapping
+
+
+def replace_params(hf_params, tf_params, key_mapping):
+    for key, value in tf_params.items():
+        if "normalization" in key:
+            continue
+
+        hf_key = key_mapping[key]
+        if "_conv" in key and "kernel" in key:
+            new_hf_value = torch.from_numpy(value).permute(3, 2, 0, 1)
+        elif "depthwise_kernel" in key:
+            new_hf_value = torch.from_numpy(value).permute(2, 3, 0, 1)
+        elif "kernel" in key:
+            new_hf_value = torch.from_numpy(np.transpose(value))
+        else:
+            new_hf_value = torch.from_numpy(value)
+
+        # Replace HF parameters with original TF model parameters
+        assert hf_params[hf_key].shape == new_hf_value.shape
+        hf_params[hf_key].copy_(new_hf_value)
+
+
+@torch.no_grad()
+def convert_efficientnet_checkpoint(model_name, pytorch_dump_folder_path, save_model, push_to_hub):
+    """
+    Copy/paste/tweak model's weights to our EfficientNet structure.
+    """
+    # Load original model
+    original_model = model_classes[model_name](
+        include_top=True,
+        weights="imagenet",
+        input_tensor=None,
+        input_shape=None,
+        pooling=None,
+        classes=1000,
+        classifier_activation="softmax",
+    )
+
+    tf_params = original_model.trainable_variables
+    tf_non_train_params = original_model.non_trainable_variables
+    tf_params = {param.name: param.numpy() for param in tf_params}
+    for param in tf_non_train_params:
+        tf_params[param.name] = param.numpy()
+    tf_param_names = list(tf_params.keys())
+
+    # Load HuggingFace model
+    config = get_efficientnet_config(model_name)
+    hf_model = EfficientNetForImageClassification(config).eval()
+    hf_params = hf_model.state_dict()
+
+    # Create src-to-dst parameter name mapping dictionary
+    print("Converting parameters...")
+    key_mapping = rename_keys(tf_param_names)
+    replace_params(hf_params, tf_params, key_mapping)
+
+    # Initialize preprocessor and preprocess input image
+    preprocessor = convert_image_processor(model_name)
+    inputs = preprocessor(images=prepare_img(), return_tensors="pt")
+
+    # HF model inference
+    hf_model.eval()
+    with torch.no_grad():
+        outputs = hf_model(**inputs)
+    hf_logits = outputs.logits.detach().numpy()
+
+    # Original model inference
+    original_model.trainable = False
+    image_size = CONFIG_MAP[model_name]["image_size"]
+    img = prepare_img().resize((image_size, image_size), resample=PIL.Image.NEAREST)
+    x = image.img_to_array(img)
+    x = np.expand_dims(x, axis=0)
+    original_logits = original_model.predict(x)
+
+    # Check whether original and HF model outputs match  -> np.allclose
+    assert np.allclose(original_logits, hf_logits, atol=1e-3), "The predicted logits are not the same."
+    print("Model outputs match!")
+
+    if save_model:
+        # Create folder to save model
+        if not os.path.isdir(pytorch_dump_folder_path):
+            os.mkdir(pytorch_dump_folder_path)
+        # Save converted model and image processor
+        hf_model.save_pretrained(pytorch_dump_folder_path)
+        preprocessor.save_pretrained(pytorch_dump_folder_path)
+
+    if push_to_hub:
+        # Push model and image processor to hub
+        print(f"Pushing converted {model_name} to the hub...")
+        model_name = f"efficientnet-{model_name}"
+        preprocessor.push_to_hub(model_name)
+        hf_model.push_to_hub(model_name)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--model_name",
+        default="b0",
+        type=str,
+        help="Version name of the EfficientNet model you want to convert, select from [b0, b1, b2, b3, b4, b5, b6, b7].",
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        default="hf_model",
+        type=str,
+        help="Path to the output PyTorch model directory.",
+    )
+    parser.add_argument("--save_model", action="store_true", help="Save model to local")
+    parser.add_argument("--push_to_hub", action="store_true", help="Push model and image processor to the hub")
+
+    args = parser.parse_args()
+    convert_efficientnet_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.save_model, args.push_to_hub)
diff --git a/src/transformers/models/efficientnet/image_processing_efficientnet.py b/src/transformers/models/efficientnet/image_processing_efficientnet.py
new file mode 100644
index 00000000000000..ee4690e0fb9cc4
--- /dev/null
+++ b/src/transformers/models/efficientnet/image_processing_efficientnet.py
@@ -0,0 +1,366 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Image processor class for EfficientNet."""
+
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+
+from ...image_processing_utils import BaseImageProcessor, BatchFeature, get_size_dict
+from ...image_transforms import rescale, resize, to_channel_dimension_format
+from ...image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ChannelDimension,
+    ImageInput,
+    PILImageResampling,
+    infer_channel_dimension_format,
+    is_scaled_image,
+    make_list_of_images,
+    to_numpy_array,
+    valid_images,
+    validate_preprocess_arguments,
+)
+from ...utils import TensorType, is_vision_available, logging
+
+
+if is_vision_available():
+    import PIL
+
+
+logger = logging.get_logger(__name__)
+
+
+class EfficientNetImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a EfficientNet image processor.
+
+    Args:
+        do_resize (`bool`, *optional*, defaults to `True`):
+            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
+            `do_resize` in `preprocess`.
+        size (`Dict[str, int]` *optional*, defaults to `{"height": 346, "width": 346}`):
+            Size of the image after `resize`. Can be overridden by `size` in `preprocess`.
+        resample (`PILImageResampling` filter, *optional*, defaults to 0):
+            Resampling filter to use if resizing the image. Can be overridden by `resample` in `preprocess`.
+        do_center_crop (`bool`, *optional*, defaults to `False`):
+            Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the image
+            is padded with 0's and then center cropped. Can be overridden by `do_center_crop` in `preprocess`.
+        crop_size (`Dict[str, int]`, *optional*, defaults to `{"height": 289, "width": 289}`):
+            Desired output size when applying center-cropping. Can be overridden by `crop_size` in `preprocess`.
+        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
+            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
+            `preprocess` method.
+        rescale_offset (`bool`, *optional*, defaults to `False`):
+            Whether to rescale the image between [-scale_range, scale_range] instead of [0, scale_range]. Can be
+            overridden by the `rescale_factor` parameter in the `preprocess` method.
+        do_rescale (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
+            parameter in the `preprocess` method.
+        do_normalize (`bool`, *optional*, defaults to `True`):
+            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
+            method.
+        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
+            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
+            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
+        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
+            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
+        include_top (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the image again. Should be set to True if the inputs are used for image classification.
+    """
+
+    model_input_names = ["pixel_values"]
+
+    def __init__(
+        self,
+        do_resize: bool = True,
+        size: Dict[str, int] = None,
+        resample: PILImageResampling = PIL.Image.NEAREST,
+        do_center_crop: bool = False,
+        crop_size: Dict[str, int] = None,
+        rescale_factor: Union[int, float] = 1 / 255,
+        rescale_offset: bool = False,
+        do_rescale: bool = True,
+        do_normalize: bool = True,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        include_top: bool = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 346, "width": 346}
+        size = get_size_dict(size)
+        crop_size = crop_size if crop_size is not None else {"height": 289, "width": 289}
+        crop_size = get_size_dict(crop_size, param_name="crop_size")
+
+        self.do_resize = do_resize
+        self.size = size
+        self.resample = resample
+        self.do_center_crop = do_center_crop
+        self.crop_size = crop_size
+        self.do_rescale = do_rescale
+        self.rescale_factor = rescale_factor
+        self.rescale_offset = rescale_offset
+        self.do_normalize = do_normalize
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.include_top = include_top
+
+    # Copied from transformers.models.vit.image_processing_vit.ViTImageProcessor.resize with PILImageResampling.BILINEAR->PILImageResampling.NEAREST
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.NEAREST,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image to `(size["height"], size["width"])`.
+
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.NEAREST`):
+                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.NEAREST`.
+            data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+
+        Returns:
+            `np.ndarray`: The resized image.
+        """
+        size = get_size_dict(size)
+        if "height" not in size or "width" not in size:
+            raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
+        output_size = (size["height"], size["width"])
+        return resize(
+            image,
+            size=output_size,
+            resample=resample,
+            data_format=data_format,
+            input_data_format=input_data_format,
+            **kwargs,
+        )
+
+    def rescale(
+        self,
+        image: np.ndarray,
+        scale: Union[int, float],
+        offset: bool = True,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ):
+        """
+        Rescale an image by a scale factor.
+
+        If `offset` is `True`, the image has its values rescaled by `scale` and then offset by 1. If `scale` is
+        1/127.5, the image is rescaled between [-1, 1].
+            image = image * scale - 1
+
+        If `offset` is `False`, and `scale` is 1/255, the image is rescaled between [0, 1].
+            image = image * scale
+
+        Args:
+            image (`np.ndarray`):
+                Image to rescale.
+            scale (`int` or `float`):
+                Scale to apply to the image.
+            offset (`bool`, *optional*):
+                Whether to scale the image in both negative and positive directions.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format of the input image. If not provided, it will be inferred.
+        """
+        rescaled_image = rescale(
+            image, scale=scale, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
+
+        if offset:
+            rescaled_image = rescaled_image - 1
+
+        return rescaled_image
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        do_resize: bool = None,
+        size: Dict[str, int] = None,
+        resample=None,
+        do_center_crop: bool = None,
+        crop_size: Dict[str, int] = None,
+        do_rescale: bool = None,
+        rescale_factor: float = None,
+        rescale_offset: bool = None,
+        do_normalize: bool = None,
+        image_mean: Optional[Union[float, List[float]]] = None,
+        image_std: Optional[Union[float, List[float]]] = None,
+        include_top: bool = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        data_format: ChannelDimension = ChannelDimension.FIRST,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> PIL.Image.Image:
+        """
+        Preprocess an image or batch of images.
+
+        Args:
+            images (`ImageInput`):
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
+                Whether to resize the image.
+            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after `resize`.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                PILImageResampling filter to use if resizing the image Only has an effect if `do_resize` is set to
+                `True`.
+            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
+                Whether to center crop the image.
+            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
+                Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be
+                padded with zeros and then cropped
+            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
+                Whether to rescale the image values between [0 - 1].
+            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
+                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
+            rescale_offset (`bool`, *optional*, defaults to `self.rescale_offset`):
+                Whether to rescale the image between [-scale_range, scale_range] instead of [0, scale_range].
+            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
+                Whether to normalize the image.
+            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean.
+            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation.
+            include_top (`bool`, *optional*, defaults to `self.include_top`):
+                Rescales the image again for image classification if set to True.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - `None`: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
+                The channel dimension format for the output image. Can be one of:
+                    - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                    - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
+        """
+        do_resize = do_resize if do_resize is not None else self.do_resize
+        resample = resample if resample is not None else self.resample
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
+        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
+        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
+        rescale_offset = rescale_offset if rescale_offset is not None else self.rescale_offset
+        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
+        image_mean = image_mean if image_mean is not None else self.image_mean
+        image_std = image_std if image_std is not None else self.image_std
+        include_top = include_top if include_top is not None else self.include_top
+
+        size = size if size is not None else self.size
+        size = get_size_dict(size)
+        crop_size = crop_size if crop_size is not None else self.crop_size
+        crop_size = get_size_dict(crop_size, param_name="crop_size")
+
+        images = make_list_of_images(images)
+
+        if not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+        validate_preprocess_arguments(
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_center_crop=do_center_crop,
+            crop_size=crop_size,
+            do_resize=do_resize,
+            size=size,
+            resample=resample,
+        )
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        if is_scaled_image(images[0]) and do_rescale:
+            logger.warning_once(
+                "It looks like you are trying to rescale already rescaled images. If the input"
+                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
+            )
+
+        if input_data_format is None:
+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
+        if do_resize:
+            images = [
+                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        if do_center_crop:
+            images = [
+                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
+            ]
+
+        if do_rescale:
+            images = [
+                self.rescale(
+                    image=image, scale=rescale_factor, offset=rescale_offset, input_data_format=input_data_format
+                )
+                for image in images
+            ]
+
+        if do_normalize:
+            images = [
+                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        if include_top:
+            images = [
+                self.normalize(image=image, mean=0, std=image_std, input_data_format=input_data_format)
+                for image in images
+            ]
+
+        images = [
+            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
+        ]
+
+        data = {"pixel_values": images}
+        return BatchFeature(data=data, tensor_type=return_tensors)
diff --git a/src/transformers/models/efficientnet/modeling_efficientnet.py b/src/transformers/models/efficientnet/modeling_efficientnet.py
new file mode 100644
index 00000000000000..2513f9b2fde142
--- /dev/null
+++ b/src/transformers/models/efficientnet/modeling_efficientnet.py
@@ -0,0 +1,649 @@
+# coding=utf-8
+# Copyright 2023 Google Research, Inc. and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch EfficientNet model."""
+
+
+import math
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
+
+from ...activations import ACT2FN
+from ...modeling_outputs import (
+    BaseModelOutputWithNoAttention,
+    BaseModelOutputWithPoolingAndNoAttention,
+    ImageClassifierOutputWithNoAttention,
+)
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+)
+from .configuration_efficientnet import EfficientNetConfig
+
+
+logger = logging.get_logger(__name__)
+
+# General docstring
+_CONFIG_FOR_DOC = "EfficientNetConfig"
+
+# Base docstring
+_CHECKPOINT_FOR_DOC = "google/efficientnet-b7"
+_EXPECTED_OUTPUT_SHAPE = [1, 768, 7, 7]
+
+# Image classification docstring
+_IMAGE_CLASS_CHECKPOINT = "google/efficientnet-b7"
+_IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"
+
+EFFICIENTNET_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/efficientnet-b7",
+    # See all EfficientNet models at https://huggingface.co/models?filter=efficientnet
+]
+
+
+EFFICIENTNET_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass. Use it
+    as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+
+    Parameters:
+        config ([`EfficientNetConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+EFFICIENTNET_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
+            [`AutoImageProcessor.__call__`] for details.
+
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+def round_filters(config: EfficientNetConfig, num_channels: int):
+    r"""
+    Round number of filters based on depth multiplier.
+    """
+    divisor = config.depth_divisor
+    num_channels *= config.width_coefficient
+    new_dim = max(divisor, int(num_channels + divisor / 2) // divisor * divisor)
+
+    # Make sure that round down does not go down by more than 10%.
+    if new_dim < 0.9 * num_channels:
+        new_dim += divisor
+
+    return int(new_dim)
+
+
+def correct_pad(kernel_size: Union[int, Tuple], adjust: bool = True):
+    r"""
+    Utility function to get the tuple padding value for the depthwise convolution.
+
+    Args:
+        kernel_size (`int` or `tuple`):
+            Kernel size of the convolution layers.
+        adjust (`bool`, *optional*, defaults to `True`):
+            Adjusts padding value to apply to right and bottom sides of the input.
+    """
+    if isinstance(kernel_size, int):
+        kernel_size = (kernel_size, kernel_size)
+
+    correct = (kernel_size[0] // 2, kernel_size[1] // 2)
+    if adjust:
+        return (correct[1] - 1, correct[1], correct[0] - 1, correct[0])
+    else:
+        return (correct[1], correct[1], correct[0], correct[0])
+
+
+class EfficientNetEmbeddings(nn.Module):
+    r"""
+    A module that corresponds to the stem module of the original work.
+    """
+
+    def __init__(self, config: EfficientNetConfig):
+        super().__init__()
+
+        self.out_dim = round_filters(config, 32)
+        self.padding = nn.ZeroPad2d(padding=(0, 1, 0, 1))
+        self.convolution = nn.Conv2d(
+            config.num_channels, self.out_dim, kernel_size=3, stride=2, padding="valid", bias=False
+        )
+        self.batchnorm = nn.BatchNorm2d(self.out_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum)
+        self.activation = ACT2FN[config.hidden_act]
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        features = self.padding(pixel_values)
+        features = self.convolution(features)
+        features = self.batchnorm(features)
+        features = self.activation(features)
+
+        return features
+
+
+class EfficientNetDepthwiseConv2d(nn.Conv2d):
+    def __init__(
+        self,
+        in_channels,
+        depth_multiplier=1,
+        kernel_size=3,
+        stride=1,
+        padding=0,
+        dilation=1,
+        bias=True,
+        padding_mode="zeros",
+    ):
+        out_channels = in_channels * depth_multiplier
+        super().__init__(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            groups=in_channels,
+            bias=bias,
+            padding_mode=padding_mode,
+        )
+
+
+class EfficientNetExpansionLayer(nn.Module):
+    r"""
+    This corresponds to the expansion phase of each block in the original implementation.
+    """
+
+    def __init__(self, config: EfficientNetConfig, in_dim: int, out_dim: int, stride: int):
+        super().__init__()
+        self.expand_conv = nn.Conv2d(
+            in_channels=in_dim,
+            out_channels=out_dim,
+            kernel_size=1,
+            padding="same",
+            bias=False,
+        )
+        self.expand_bn = nn.BatchNorm2d(num_features=out_dim, eps=config.batch_norm_eps)
+        self.expand_act = ACT2FN[config.hidden_act]
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        # Expand phase
+        hidden_states = self.expand_conv(hidden_states)
+        hidden_states = self.expand_bn(hidden_states)
+        hidden_states = self.expand_act(hidden_states)
+
+        return hidden_states
+
+
+class EfficientNetDepthwiseLayer(nn.Module):
+    r"""
+    This corresponds to the depthwise convolution phase of each block in the original implementation.
+    """
+
+    def __init__(
+        self,
+        config: EfficientNetConfig,
+        in_dim: int,
+        stride: int,
+        kernel_size: int,
+        adjust_padding: bool,
+    ):
+        super().__init__()
+        self.stride = stride
+        conv_pad = "valid" if self.stride == 2 else "same"
+        padding = correct_pad(kernel_size, adjust=adjust_padding)
+
+        self.depthwise_conv_pad = nn.ZeroPad2d(padding=padding)
+        self.depthwise_conv = EfficientNetDepthwiseConv2d(
+            in_dim, kernel_size=kernel_size, stride=stride, padding=conv_pad, bias=False
+        )
+        self.depthwise_norm = nn.BatchNorm2d(
+            num_features=in_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum
+        )
+        self.depthwise_act = ACT2FN[config.hidden_act]
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        # Depthwise convolution
+        if self.stride == 2:
+            hidden_states = self.depthwise_conv_pad(hidden_states)
+
+        hidden_states = self.depthwise_conv(hidden_states)
+        hidden_states = self.depthwise_norm(hidden_states)
+        hidden_states = self.depthwise_act(hidden_states)
+
+        return hidden_states
+
+
+class EfficientNetSqueezeExciteLayer(nn.Module):
+    r"""
+    This corresponds to the Squeeze and Excitement phase of each block in the original implementation.
+    """
+
+    def __init__(self, config: EfficientNetConfig, in_dim: int, expand_dim: int, expand: bool = False):
+        super().__init__()
+        self.dim = expand_dim if expand else in_dim
+        self.dim_se = max(1, int(in_dim * config.squeeze_expansion_ratio))
+
+        self.squeeze = nn.AdaptiveAvgPool2d(output_size=1)
+        self.reduce = nn.Conv2d(
+            in_channels=self.dim,
+            out_channels=self.dim_se,
+            kernel_size=1,
+            padding="same",
+        )
+        self.expand = nn.Conv2d(
+            in_channels=self.dim_se,
+            out_channels=self.dim,
+            kernel_size=1,
+            padding="same",
+        )
+        self.act_reduce = ACT2FN[config.hidden_act]
+        self.act_expand = nn.Sigmoid()
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        inputs = hidden_states
+        hidden_states = self.squeeze(hidden_states)
+        hidden_states = self.reduce(hidden_states)
+        hidden_states = self.act_reduce(hidden_states)
+
+        hidden_states = self.expand(hidden_states)
+        hidden_states = self.act_expand(hidden_states)
+        hidden_states = torch.mul(inputs, hidden_states)
+
+        return hidden_states
+
+
+class EfficientNetFinalBlockLayer(nn.Module):
+    r"""
+    This corresponds to the final phase of each block in the original implementation.
+    """
+
+    def __init__(
+        self, config: EfficientNetConfig, in_dim: int, out_dim: int, stride: int, drop_rate: float, id_skip: bool
+    ):
+        super().__init__()
+        self.apply_dropout = stride == 1 and not id_skip
+        self.project_conv = nn.Conv2d(
+            in_channels=in_dim,
+            out_channels=out_dim,
+            kernel_size=1,
+            padding="same",
+            bias=False,
+        )
+        self.project_bn = nn.BatchNorm2d(
+            num_features=out_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum
+        )
+        self.dropout = nn.Dropout(p=drop_rate)
+
+    def forward(self, embeddings: torch.FloatTensor, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        hidden_states = self.project_conv(hidden_states)
+        hidden_states = self.project_bn(hidden_states)
+
+        if self.apply_dropout:
+            hidden_states = self.dropout(hidden_states)
+            hidden_states = hidden_states + embeddings
+
+        return hidden_states
+
+
+class EfficientNetBlock(nn.Module):
+    r"""
+    This corresponds to the expansion and depthwise convolution phase of each block in the original implementation.
+
+    Args:
+        config ([`EfficientNetConfig`]):
+            Model configuration class.
+        in_dim (`int`):
+            Number of input channels.
+        out_dim (`int`):
+            Number of output channels.
+        stride (`int`):
+            Stride size to be used in convolution layers.
+        expand_ratio (`int`):
+            Expand ratio to set the output dimensions for the expansion and squeeze-excite layers.
+        kernel_size (`int`):
+            Kernel size for the depthwise convolution layer.
+        drop_rate (`float`):
+            Dropout rate to be used in the final phase of each block.
+        id_skip (`bool`):
+            Whether to apply dropout and sum the final hidden states with the input embeddings during the final phase
+            of each block. Set to `True` for the first block of each stage.
+        adjust_padding (`bool`):
+            Whether to apply padding to only right and bottom side of the input kernel before the depthwise convolution
+            operation, set to `True` for inputs with odd input sizes.
+    """
+
+    def __init__(
+        self,
+        config: EfficientNetConfig,
+        in_dim: int,
+        out_dim: int,
+        stride: int,
+        expand_ratio: int,
+        kernel_size: int,
+        drop_rate: float,
+        id_skip: bool,
+        adjust_padding: bool,
+    ):
+        super().__init__()
+        self.expand_ratio = expand_ratio
+        self.expand = True if self.expand_ratio != 1 else False
+        expand_in_dim = in_dim * expand_ratio
+
+        if self.expand:
+            self.expansion = EfficientNetExpansionLayer(
+                config=config, in_dim=in_dim, out_dim=expand_in_dim, stride=stride
+            )
+
+        self.depthwise_conv = EfficientNetDepthwiseLayer(
+            config=config,
+            in_dim=expand_in_dim if self.expand else in_dim,
+            stride=stride,
+            kernel_size=kernel_size,
+            adjust_padding=adjust_padding,
+        )
+        self.squeeze_excite = EfficientNetSqueezeExciteLayer(
+            config=config, in_dim=in_dim, expand_dim=expand_in_dim, expand=self.expand
+        )
+        self.projection = EfficientNetFinalBlockLayer(
+            config=config,
+            in_dim=expand_in_dim if self.expand else in_dim,
+            out_dim=out_dim,
+            stride=stride,
+            drop_rate=drop_rate,
+            id_skip=id_skip,
+        )
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.Tensor:
+        embeddings = hidden_states
+        # Expansion and depthwise convolution phase
+        if self.expand_ratio != 1:
+            hidden_states = self.expansion(hidden_states)
+        hidden_states = self.depthwise_conv(hidden_states)
+
+        # Squeeze and excite phase
+        hidden_states = self.squeeze_excite(hidden_states)
+        hidden_states = self.projection(embeddings, hidden_states)
+        return hidden_states
+
+
+class EfficientNetEncoder(nn.Module):
+    r"""
+    Forward propogates the embeddings through each EfficientNet block.
+
+    Args:
+        config ([`EfficientNetConfig`]):
+            Model configuration class.
+    """
+
+    def __init__(self, config: EfficientNetConfig):
+        super().__init__()
+        self.config = config
+        self.depth_coefficient = config.depth_coefficient
+
+        def round_repeats(repeats):
+            # Round number of block repeats based on depth multiplier.
+            return int(math.ceil(self.depth_coefficient * repeats))
+
+        num_base_blocks = len(config.in_channels)
+        num_blocks = sum(round_repeats(n) for n in config.num_block_repeats)
+
+        curr_block_num = 0
+        blocks = []
+        for i in range(num_base_blocks):
+            in_dim = round_filters(config, config.in_channels[i])
+            out_dim = round_filters(config, config.out_channels[i])
+            stride = config.strides[i]
+            kernel_size = config.kernel_sizes[i]
+            expand_ratio = config.expand_ratios[i]
+
+            for j in range(round_repeats(config.num_block_repeats[i])):
+                id_skip = True if j == 0 else False
+                stride = 1 if j > 0 else stride
+                in_dim = out_dim if j > 0 else in_dim
+                adjust_padding = False if curr_block_num in config.depthwise_padding else True
+                drop_rate = config.drop_connect_rate * curr_block_num / num_blocks
+
+                block = EfficientNetBlock(
+                    config=config,
+                    in_dim=in_dim,
+                    out_dim=out_dim,
+                    stride=stride,
+                    kernel_size=kernel_size,
+                    expand_ratio=expand_ratio,
+                    drop_rate=drop_rate,
+                    id_skip=id_skip,
+                    adjust_padding=adjust_padding,
+                )
+                blocks.append(block)
+                curr_block_num += 1
+
+        self.blocks = nn.ModuleList(blocks)
+        self.top_conv = nn.Conv2d(
+            in_channels=out_dim,
+            out_channels=round_filters(config, 1280),
+            kernel_size=1,
+            padding="same",
+            bias=False,
+        )
+        self.top_bn = nn.BatchNorm2d(
+            num_features=config.hidden_dim, eps=config.batch_norm_eps, momentum=config.batch_norm_momentum
+        )
+        self.top_activation = ACT2FN[config.hidden_act]
+
+    def forward(
+        self,
+        hidden_states: torch.FloatTensor,
+        output_hidden_states: Optional[bool] = False,
+        return_dict: Optional[bool] = True,
+    ) -> BaseModelOutputWithNoAttention:
+        all_hidden_states = (hidden_states,) if output_hidden_states else None
+
+        for block in self.blocks:
+            hidden_states = block(hidden_states)
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+        hidden_states = self.top_conv(hidden_states)
+        hidden_states = self.top_bn(hidden_states)
+        hidden_states = self.top_activation(hidden_states)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states] if v is not None)
+
+        return BaseModelOutputWithNoAttention(
+            last_hidden_state=hidden_states,
+            hidden_states=all_hidden_states,
+        )
+
+
+class EfficientNetPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = EfficientNetConfig
+    base_model_prefix = "efficientnet"
+    main_input_name = "pixel_values"
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, (nn.Linear, nn.Conv2d)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+
+@add_start_docstrings(
+    "The bare EfficientNet model outputting raw features without any specific head on top.",
+    EFFICIENTNET_START_DOCSTRING,
+)
+class EfficientNetModel(EfficientNetPreTrainedModel):
+    def __init__(self, config: EfficientNetConfig):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = EfficientNetEmbeddings(config)
+        self.encoder = EfficientNetEncoder(config)
+
+        # Final pooling layer
+        if config.pooling_type == "mean":
+            self.pooler = nn.AvgPool2d(config.hidden_dim, ceil_mode=True)
+        elif config.pooling_type == "max":
+            self.pooler = nn.MaxPool2d(config.hidden_dim, ceil_mode=True)
+        else:
+            raise ValueError(f"config.pooling must be one of ['mean', 'max'] got {config.pooling}")
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(EFFICIENTNET_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPoolingAndNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        modality="vision",
+        expected_output=_EXPECTED_OUTPUT_SHAPE,
+    )
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]:
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if pixel_values is None:
+            raise ValueError("You have to specify pixel_values")
+
+        embedding_output = self.embeddings(pixel_values)
+
+        encoder_outputs = self.encoder(
+            embedding_output,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        # Apply pooling
+        last_hidden_state = encoder_outputs[0]
+        pooled_output = self.pooler(last_hidden_state)
+        # Reshape (batch_size, 1280, 1 , 1) -> (batch_size, 1280)
+        pooled_output = pooled_output.reshape(pooled_output.shape[:2])
+
+        if not return_dict:
+            return (last_hidden_state, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPoolingAndNoAttention(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+        )
+
+
+@add_start_docstrings(
+    """
+    EfficientNet Model with an image classification head on top (a linear layer on top of the pooled features), e.g.
+    for ImageNet.
+    """,
+    EFFICIENTNET_START_DOCSTRING,
+)
+class EfficientNetForImageClassification(EfficientNetPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.config = config
+        self.efficientnet = EfficientNetModel(config)
+        # Classifier head
+        self.dropout = nn.Dropout(p=config.dropout_rate)
+        self.classifier = nn.Linear(config.hidden_dim, self.num_labels) if self.num_labels > 0 else nn.Identity()
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(EFFICIENTNET_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_IMAGE_CLASS_CHECKPOINT,
+        output_type=ImageClassifierOutputWithNoAttention,
+        config_class=_CONFIG_FOR_DOC,
+        expected_output=_IMAGE_CLASS_EXPECTED_OUTPUT,
+    )
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, ImageClassifierOutputWithNoAttention]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.efficientnet(pixel_values, output_hidden_states=output_hidden_states, return_dict=return_dict)
+
+        pooled_output = outputs.pooler_output if return_dict else outputs[1]
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(logits, labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return ImageClassifierOutputWithNoAttention(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+        )
diff --git a/src/transformers/models/electra/configuration_electra.py b/src/transformers/models/electra/configuration_electra.py
index d8e1de0fc97fa4..d45f62930212ec 100644
--- a/src/transformers/models/electra/configuration_electra.py
+++ b/src/transformers/models/electra/configuration_electra.py
@@ -130,6 +130,7 @@ class ElectraConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "electra"
 
     def __init__(
diff --git a/src/transformers/models/electra/modeling_electra.py b/src/transformers/models/electra/modeling_electra.py
index 7d2a06a8edc223..3aaa6141004fb3 100644
--- a/src/transformers/models/electra/modeling_electra.py
+++ b/src/transformers/models/electra/modeling_electra.py
@@ -135,7 +135,7 @@ def load_tf_weights_in_electra(model, config, tf_checkpoint_path, discriminator_
             try:
                 if pointer.shape != array.shape:
                     raise ValueError(f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched")
-            except AssertionError as e:
+            except ValueError as e:
                 e.args += (pointer.shape, array.shape)
                 raise
             print(f"Initialize PyTorch weight {name}", original_name)
@@ -161,7 +161,9 @@ def __init__(self, config):
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
 
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
         self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
         self.register_buffer(
             "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
@@ -553,6 +555,13 @@ def forward(
         all_self_attentions = () if output_attentions else None
         all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         next_decoder_cache = () if use_cache else None
         for i, layer_module in enumerate(self.layer):
             if output_hidden_states:
@@ -562,25 +571,15 @@ def forward(
             past_key_value = past_key_values[i] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, past_key_value, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     attention_mask,
                     layer_head_mask,
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(
@@ -632,12 +631,13 @@ def __init__(self, config):
         super().__init__()
 
         self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = get_activation(config.hidden_act)
         self.dense_prediction = nn.Linear(config.hidden_size, 1)
         self.config = config
 
     def forward(self, discriminator_hidden_states):
         hidden_states = self.dense(discriminator_hidden_states)
-        hidden_states = get_activation(self.config.hidden_act)(hidden_states)
+        hidden_states = self.activation(hidden_states)
         logits = self.dense_prediction(hidden_states).squeeze(-1)
 
         return logits
@@ -649,12 +649,13 @@ class ElectraGeneratorPredictions(nn.Module):
     def __init__(self, config):
         super().__init__()
 
+        self.activation = get_activation("gelu")
         self.LayerNorm = nn.LayerNorm(config.embedding_size, eps=config.layer_norm_eps)
         self.dense = nn.Linear(config.hidden_size, config.embedding_size)
 
     def forward(self, generator_hidden_states):
         hidden_states = self.dense(generator_hidden_states)
-        hidden_states = get_activation("gelu")(hidden_states)
+        hidden_states = self.activation(hidden_states)
         hidden_states = self.LayerNorm(hidden_states)
 
         return hidden_states
@@ -670,8 +671,6 @@ class ElectraPreTrainedModel(PreTrainedModel):
     load_tf_weights = load_tf_weights_in_electra
     base_model_prefix = "electra"
     supports_gradient_checkpointing = True
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-    _keys_to_ignore_on_load_unexpected = [r"electra.embeddings_project.weight", r"electra.embeddings_project.bias"]
 
     # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
     def _init_weights(self, module):
@@ -690,10 +689,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ElectraEncoder):
-            module.gradient_checkpointing = value
-
 
 @dataclass
 class ElectraForPreTrainingOutput(ModelOutput):
@@ -866,6 +861,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -939,6 +935,7 @@ def __init__(self, config):
         classifier_dropout = (
             config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
         )
+        self.activation = get_activation("gelu")
         self.dropout = nn.Dropout(classifier_dropout)
         self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
 
@@ -946,7 +943,7 @@ def forward(self, features, **kwargs):
         x = features[:, 0, :]  # take  token (equiv. to [CLS])
         x = self.dropout(x)
         x = self.dense(x)
-        x = get_activation("gelu")(x)  # although BERT uses tanh here, it seems Electra authors used gelu here
+        x = self.activation(x)  # although BERT uses tanh here, it seems Electra authors used gelu here
         x = self.dropout(x)
         x = self.out_proj(x)
         return x
@@ -1164,7 +1161,7 @@ def forward(
     ELECTRA_START_DOCSTRING,
 )
 class ElectraForMaskedLM(ElectraPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["generator_lm_head.weight"]
+    _tied_weights_keys = ["generator_lm_head.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1531,7 +1528,7 @@ def forward(
     """ELECTRA Model with a `language modeling` head on top for CLM fine-tuning.""", ELECTRA_START_DOCSTRING
 )
 class ElectraForCausalLM(ElectraPreTrainedModel):
-    _keys_to_ignore_on_load_missing = ["generator_lm_head.weight"]
+    _tied_weights_keys = ["generator_lm_head.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1664,9 +1661,18 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti
         if attention_mask is None:
             attention_mask = input_ids.new_ones(input_shape)
 
-        # cut decoder_input_ids if past is used
+        # cut decoder_input_ids if past_key_values is used
         if past_key_values is not None:
-            input_ids = input_ids[:, -1:]
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
 
         return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}
 
@@ -1674,5 +1680,7 @@ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attenti
     def _reorder_cache(self, past_key_values, beam_idx):
         reordered_past = ()
         for layer_past in past_key_values:
-            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
         return reordered_past
diff --git a/src/transformers/models/electra/modeling_flax_electra.py b/src/transformers/models/electra/modeling_flax_electra.py
index f7c150f56d5bc7..64d49eb17a460a 100644
--- a/src/transformers/models/electra/modeling_flax_electra.py
+++ b/src/transformers/models/electra/modeling_flax_electra.py
@@ -263,7 +263,7 @@ def __call__(
         hidden_states,
         attention_mask,
         layer_head_mask,
-        key_value_states: Optional[jnp.array] = None,
+        key_value_states: Optional[jnp.ndarray] = None,
         init_cache: bool = False,
         deterministic=True,
         output_attentions: bool = False,
@@ -1196,6 +1196,7 @@ class FlaxElectraSequenceSummary(nn.Module):
             - **summary_first_dropout** (`float`) -- Optional dropout probability before the projection and activation.
             - **summary_last_dropout** (`float`)-- Optional dropout probability after the projection and activation.
     """
+
     config: ElectraConfig
     dtype: jnp.dtype = jnp.float32
 
@@ -1228,13 +1229,13 @@ def __call__(self, hidden_states, cls_index=None, deterministic: bool = True):
         Compute a single vector summary of a sequence hidden states.
 
         Args:
-            hidden_states (`jnp.array` of shape `[batch_size, seq_len, hidden_size]`):
+            hidden_states (`jnp.ndarray` of shape `[batch_size, seq_len, hidden_size]`):
                 The hidden states of the last layer.
-            cls_index (`jnp.array` of shape `[batch_size]` or `[batch_size, ...]` where ... are optional leading dimensions of `hidden_states`, *optional*):
+            cls_index (`jnp.ndarray` of shape `[batch_size]` or `[batch_size, ...]` where ... are optional leading dimensions of `hidden_states`, *optional*):
                 Used if `summary_type == "cls_index"` and takes the last token of the sequence as classification token.
 
         Returns:
-            `jnp.array`: The summary of the sequence hidden states.
+            `jnp.ndarray`: The summary of the sequence hidden states.
         """
         # NOTE: this doest "first" type summary always
         output = hidden_states[:, 0]
@@ -1565,7 +1566,7 @@ def __call__(
 class FlaxElectraForCausalLM(FlaxElectraPreTrainedModel):
     module_class = FlaxElectraForCausalLMModule
 
-    def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jnp.DeviceArray] = None):
+    def prepare_inputs_for_generation(self, input_ids, max_length, attention_mask: Optional[jax.Array] = None):
         # initializing the cache
         batch_size, seq_length = input_ids.shape
 
diff --git a/src/transformers/models/electra/modeling_tf_electra.py b/src/transformers/models/electra/modeling_tf_electra.py
index b782cc987bef26..b0c8b4fa285d54 100644
--- a/src/transformers/models/electra/modeling_tf_electra.py
+++ b/src/transformers/models/electra/modeling_tf_electra.py
@@ -14,10 +14,13 @@
 # limitations under the License.
 """ TF Electra model."""
 
+
+from __future__ import annotations
+
 import math
 import warnings
 from dataclasses import dataclass
-from typing import Dict, Optional, Tuple, Union
+from typing import Optional, Tuple, Union
 
 import numpy as np
 import tensorflow as tf
@@ -41,13 +44,12 @@
     TFSequenceSummary,
     TFTokenClassificationLoss,
     get_initializer,
+    keras,
     keras_serializable,
     unpack_inputs,
 )
-from ...tf_utils import shape_list, stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, shape_list, stable_softmax
 from ...utils import (
-    DUMMY_INPUTS,
-    MULTIPLE_CHOICE_DUMMY_INPUTS,
     ModelOutput,
     add_code_sample_docstrings,
     add_start_docstrings,
@@ -75,7 +77,7 @@
 
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertSelfAttention with Bert->Electra
-class TFElectraSelfAttention(tf.keras.layers.Layer):
+class TFElectraSelfAttention(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -90,18 +92,19 @@ def __init__(self, config: ElectraConfig, **kwargs):
         self.all_head_size = self.num_attention_heads * self.attention_head_size
         self.sqrt_att_head_size = math.sqrt(self.attention_head_size)
 
-        self.query = tf.keras.layers.Dense(
+        self.query = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
         )
-        self.key = tf.keras.layers.Dense(
+        self.key = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key"
         )
-        self.value = tf.keras.layers.Dense(
+        self.value = keras.layers.Dense(
             units=self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
         )
-        self.dropout = tf.keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
+        self.dropout = keras.layers.Dropout(rate=config.attention_probs_dropout_prob)
 
         self.is_decoder = config.is_decoder
+        self.config = config
 
     def transpose_for_scores(self, tensor: tf.Tensor, batch_size: int) -> tf.Tensor:
         # Reshape from [batch_size, seq_length, all_head_size] to [batch_size, seq_length, num_attention_heads, attention_head_size]
@@ -191,17 +194,32 @@ def call(
             outputs = outputs + (past_key_value,)
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query", None) is not None:
+            with tf.name_scope(self.query.name):
+                self.query.build([None, None, self.config.hidden_size])
+        if getattr(self, "key", None) is not None:
+            with tf.name_scope(self.key.name):
+                self.key.build([None, None, self.config.hidden_size])
+        if getattr(self, "value", None) is not None:
+            with tf.name_scope(self.value.name):
+                self.value.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertSelfOutput with Bert->Electra
-class TFElectraSelfOutput(tf.keras.layers.Layer):
+class TFElectraSelfOutput(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -210,9 +228,20 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertAttention with Bert->Electra
-class TFElectraAttention(tf.keras.layers.Layer):
+class TFElectraAttention(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -251,13 +280,24 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self_attention", None) is not None:
+            with tf.name_scope(self.self_attention.name):
+                self.self_attention.build(None)
+        if getattr(self, "dense_output", None) is not None:
+            with tf.name_scope(self.dense_output.name):
+                self.dense_output.build(None)
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertIntermediate with Bert->Electra
-class TFElectraIntermediate(tf.keras.layers.Layer):
+class TFElectraIntermediate(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
@@ -265,6 +305,7 @@ def __init__(self, config: ElectraConfig, **kwargs):
             self.intermediate_act_fn = get_tf_activation(config.hidden_act)
         else:
             self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -272,17 +313,26 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertOutput with Bert->Electra
-class TFElectraOutput(tf.keras.layers.Layer):
+class TFElectraOutput(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool = False) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
@@ -291,9 +341,20 @@ def call(self, hidden_states: tf.Tensor, input_tensor: tf.Tensor, training: bool
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertLayer with Bert->Electra
-class TFElectraLayer(tf.keras.layers.Layer):
+class TFElectraLayer(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
@@ -312,9 +373,9 @@ def call(
         hidden_states: tf.Tensor,
         attention_mask: tf.Tensor,
         head_mask: tf.Tensor,
-        encoder_hidden_states: Optional[tf.Tensor],
-        encoder_attention_mask: Optional[tf.Tensor],
-        past_key_value: Optional[Tuple[tf.Tensor]],
+        encoder_hidden_states: tf.Tensor | None,
+        encoder_attention_mask: tf.Tensor | None,
+        past_key_value: Tuple[tf.Tensor] | None,
         output_attentions: bool,
         training: bool = False,
     ) -> Tuple[tf.Tensor]:
@@ -378,9 +439,26 @@ def call(
 
         return outputs
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "bert_output", None) is not None:
+            with tf.name_scope(self.bert_output.name):
+                self.bert_output.build(None)
+        if getattr(self, "crossattention", None) is not None:
+            with tf.name_scope(self.crossattention.name):
+                self.crossattention.build(None)
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertEncoder with Bert->Electra
-class TFElectraEncoder(tf.keras.layers.Layer):
+class TFElectraEncoder(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
         self.config = config
@@ -391,9 +469,9 @@ def call(
         hidden_states: tf.Tensor,
         attention_mask: tf.Tensor,
         head_mask: tf.Tensor,
-        encoder_hidden_states: Optional[tf.Tensor],
-        encoder_attention_mask: Optional[tf.Tensor],
-        past_key_values: Optional[Tuple[Tuple[tf.Tensor]]],
+        encoder_hidden_states: tf.Tensor | None,
+        encoder_attention_mask: tf.Tensor | None,
+        past_key_values: Tuple[Tuple[tf.Tensor]] | None,
         use_cache: Optional[bool],
         output_attentions: bool,
         output_hidden_states: bool,
@@ -448,18 +526,28 @@ def call(
             cross_attentions=all_cross_attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertPooler with Bert->Electra
-class TFElectraPooler(tf.keras.layers.Layer):
+class TFElectraPooler(keras.layers.Layer):
     def __init__(self, config: ElectraConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="tanh",
             name="dense",
         )
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         # We "pool" the model by simply taking the hidden state corresponding
@@ -469,9 +557,17 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return pooled_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 # Copied from transformers.models.albert.modeling_tf_albert.TFAlbertEmbeddings with Albert->Electra
-class TFElectraEmbeddings(tf.keras.layers.Layer):
+class TFElectraEmbeddings(keras.layers.Layer):
     """Construct the embeddings from word, position and token_type embeddings."""
 
     def __init__(self, config: ElectraConfig, **kwargs):
@@ -481,10 +577,10 @@ def __init__(self, config: ElectraConfig, **kwargs):
         self.embedding_size = config.embedding_size
         self.max_position_embeddings = config.max_position_embeddings
         self.initializer_range = config.initializer_range
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dropout = tf.keras.layers.Dropout(rate=config.hidden_dropout_prob)
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dropout = keras.layers.Dropout(rate=config.hidden_dropout_prob)
 
-    def build(self, input_shape: tf.TensorShape):
+    def build(self, input_shape=None):
         with tf.name_scope("word_embeddings"):
             self.weight = self.add_weight(
                 name="weight",
@@ -506,7 +602,12 @@ def build(self, input_shape: tf.TensorShape):
                 initializer=get_initializer(self.initializer_range),
             )
 
-        super().build(input_shape)
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.embedding_size])
 
     # Copied from transformers.models.bert.modeling_tf_bert.TFBertEmbeddings.call
     def call(
@@ -528,16 +629,7 @@ def call(
             raise ValueError("Need to provide either `input_ids` or `input_embeds`.")
 
         if input_ids is not None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = tf.gather(params=self.weight, indices=input_ids)
 
         input_shape = shape_list(inputs_embeds)[:-1]
@@ -559,12 +651,12 @@ def call(
         return final_embeddings
 
 
-class TFElectraDiscriminatorPredictions(tf.keras.layers.Layer):
+class TFElectraDiscriminatorPredictions(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(config.hidden_size, name="dense")
-        self.dense_prediction = tf.keras.layers.Dense(1, name="dense_prediction")
+        self.dense = keras.layers.Dense(config.hidden_size, name="dense")
+        self.dense_prediction = keras.layers.Dense(1, name="dense_prediction")
         self.config = config
 
     def call(self, discriminator_hidden_states, training=False):
@@ -574,13 +666,25 @@ def call(self, discriminator_hidden_states, training=False):
 
         return logits
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "dense_prediction", None) is not None:
+            with tf.name_scope(self.dense_prediction.name):
+                self.dense_prediction.build([None, None, self.config.hidden_size])
+
 
-class TFElectraGeneratorPredictions(tf.keras.layers.Layer):
+class TFElectraGeneratorPredictions(keras.layers.Layer):
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.LayerNorm = tf.keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
-        self.dense = tf.keras.layers.Dense(config.embedding_size, name="dense")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.dense = keras.layers.Dense(config.embedding_size, name="dense")
+        self.config = config
 
     def call(self, generator_hidden_states, training=False):
         hidden_states = self.dense(generator_hidden_states)
@@ -589,6 +693,17 @@ def call(self, generator_hidden_states, training=False):
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.embedding_size])
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 class TFElectraPreTrainedModel(TFPreTrainedModel):
     """
@@ -602,28 +717,9 @@ class TFElectraPreTrainedModel(TFPreTrainedModel):
     _keys_to_ignore_on_load_unexpected = [r"generator_lm_head.weight"]
     _keys_to_ignore_on_load_missing = [r"dropout"]
 
-    @property
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertPreTrainedModel.dummy_inputs
-    def dummy_inputs(self):
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        dummy = {"input_ids": tf.constant(DUMMY_INPUTS, dtype=tf.int32)}
-        # Add `encoder_hidden_states` to make the cross-attention layers' weights initialized
-        if self.config.add_cross_attention:
-            batch_size, seq_len = tf.constant(DUMMY_INPUTS).shape
-            shape = (batch_size, seq_len) + (self.config.hidden_size,)
-            h = tf.random.uniform(shape=shape)
-            dummy["encoder_hidden_states"] = h
-
-        return dummy
-
 
 @keras_serializable
-class TFElectraMainLayer(tf.keras.layers.Layer):
+class TFElectraMainLayer(keras.layers.Layer):
     config_class = ElectraConfig
 
     def __init__(self, config, **kwargs):
@@ -635,7 +731,7 @@ def __init__(self, config, **kwargs):
         self.embeddings = TFElectraEmbeddings(config, name="embeddings")
 
         if config.embedding_size != config.hidden_size:
-            self.embeddings_project = tf.keras.layers.Dense(config.hidden_size, name="embeddings_project")
+            self.embeddings_project = keras.layers.Dense(config.hidden_size, name="embeddings_project")
 
         self.encoder = TFElectraEncoder(config, name="encoder")
 
@@ -713,14 +809,14 @@ def get_head_mask(self, head_mask):
     @unpack_inputs
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
+        encoder_attention_mask: np.ndarray | tf.Tensor | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
@@ -808,6 +904,20 @@ def call(
 
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "embeddings_project", None) is not None:
+            with tf.name_scope(self.embeddings_project.name):
+                self.embeddings_project.build([None, None, self.config.embedding_size])
+
 
 @dataclass
 class TFElectraForPreTrainingOutput(ModelOutput):
@@ -833,8 +943,8 @@ class TFElectraForPreTrainingOutput(ModelOutput):
     """
 
     logits: tf.Tensor = None
-    hidden_states: Optional[Tuple[tf.Tensor]] = None
-    attentions: Optional[Tuple[tf.Tensor]] = None
+    hidden_states: Tuple[tf.Tensor] | None = None
+    attentions: Tuple[tf.Tensor] | None = None
 
 
 ELECTRA_START_DOCSTRING = r"""
@@ -843,7 +953,7 @@ class TFElectraForPreTrainingOutput(ModelOutput):
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -950,14 +1060,14 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
+        encoder_attention_mask: np.ndarray | tf.Tensor | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
@@ -1004,22 +1114,13 @@ def call(
 
         return outputs
 
-    def serving_output(self, output):
-        output_cache = self.config.use_cache and self.config.is_decoder
-        pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-        cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None
-        if not (self.config.output_attentions and self.config.add_cross_attention):
-            cross_attns = None
-
-        return TFBaseModelOutputWithPastAndCrossAttentions(
-            last_hidden_state=output.last_hidden_state,
-            past_key_values=pkv,
-            hidden_states=hs,
-            attentions=attns,
-            cross_attentions=cross_attns,
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
 
 
 @add_start_docstrings(
@@ -1043,12 +1144,12 @@ def __init__(self, config, **kwargs):
     @replace_return_docstrings(output_type=TFElectraForPreTrainingOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1093,14 +1194,19 @@ def call(
             attentions=discriminator_hidden_states.attentions,
         )
 
-    def serving_output(self, output):
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFElectraForPreTrainingOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "discriminator_predictions", None) is not None:
+            with tf.name_scope(self.discriminator_predictions.name):
+                self.discriminator_predictions.build(None)
 
 
-class TFElectraMaskedLMHead(tf.keras.layers.Layer):
+class TFElectraMaskedLMHead(keras.layers.Layer):
     def __init__(self, config, input_embeddings, **kwargs):
         super().__init__(**kwargs)
 
@@ -1180,16 +1286,16 @@ def get_prefix_bias_name(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMaskedLMOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1227,21 +1333,28 @@ def call(
             attentions=generator_hidden_states.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
-
-
-class TFElectraClassificationHead(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "generator_predictions", None) is not None:
+            with tf.name_scope(self.generator_predictions.name):
+                self.generator_predictions.build(None)
+        if getattr(self, "generator_lm_head", None) is not None:
+            with tf.name_scope(self.generator_lm_head.name):
+                self.generator_lm_head.build(None)
+
+
+class TFElectraClassificationHead(keras.layers.Layer):
     """Head for sentence-level classification tasks."""
 
     def __init__(self, config, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
         classifier_dropout = (
@@ -1249,10 +1362,11 @@ def __init__(self, config, **kwargs):
             if config.classifier_dropout is not None
             else config.hidden_dropout_prob
         )
-        self.dropout = tf.keras.layers.Dropout(classifier_dropout)
-        self.out_proj = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(classifier_dropout)
+        self.out_proj = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="out_proj"
         )
+        self.config = config
 
     def call(self, inputs, **kwargs):
         x = inputs[:, 0, :]  # take  token (equiv. to [CLS])
@@ -1264,6 +1378,17 @@ def call(self, inputs, **kwargs):
 
         return x
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "out_proj", None) is not None:
+            with tf.name_scope(self.out_proj.name):
+                self.out_proj.build([None, None, self.config.hidden_size])
+
 
 @add_start_docstrings(
     """
@@ -1290,16 +1415,16 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFSequenceClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1335,12 +1460,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build(None)
 
 
 @add_start_docstrings(
@@ -1358,19 +1487,10 @@ def __init__(self, config, *inputs, **kwargs):
         self.sequence_summary = TFSequenceSummary(
             config, initializer_range=config.initializer_range, name="sequence_summary"
         )
-        self.classifier = tf.keras.layers.Dense(
+        self.classifier = keras.layers.Dense(
             1, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
-
-    @property
-    def dummy_inputs(self):
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            tf.Tensor with dummy inputs
-        """
-        return {"input_ids": tf.constant(MULTIPLE_CHOICE_DUMMY_INPUTS, dtype=tf.int32)}
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(ELECTRA_INPUTS_DOCSTRING.format("batch_size, num_choices, sequence_length"))
@@ -1381,16 +1501,16 @@ def dummy_inputs(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFMultipleChoiceModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1444,27 +1564,19 @@ def call(
             attentions=outputs.attentions,
         )
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"),
-                "token_type_ids": tf.TensorSpec((None, None, None), tf.int32, name="token_type_ids"),
-            }
-        ]
-    )
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving
-    def serving(self, inputs: Dict[str, tf.Tensor]):
-        output = self.call(input_ids=inputs)
-
-        return self.serving_output(output)
-
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMultipleChoice.serving_output
-    def serving_output(self, output: TFMultipleChoiceModelOutput) -> TFMultipleChoiceModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMultipleChoiceModelOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "sequence_summary", None) is not None:
+            with tf.name_scope(self.sequence_summary.name):
+                self.sequence_summary.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1483,10 +1595,11 @@ def __init__(self, config, **kwargs):
         classifier_dropout = (
             config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
         )
-        self.dropout = tf.keras.layers.Dropout(classifier_dropout)
-        self.classifier = tf.keras.layers.Dense(
+        self.dropout = keras.layers.Dropout(classifier_dropout)
+        self.classifier = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="classifier"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1499,16 +1612,16 @@ def __init__(self, config, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFTokenClassifierOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1544,12 +1657,16 @@ def call(
             attentions=discriminator_hidden_states.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output
-    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
 
 @add_start_docstrings(
@@ -1565,9 +1682,10 @@ def __init__(self, config, *inputs, **kwargs):
 
         self.num_labels = config.num_labels
         self.electra = TFElectraMainLayer(config, name="electra")
-        self.qa_outputs = tf.keras.layers.Dense(
+        self.qa_outputs = keras.layers.Dense(
             config.num_labels, kernel_initializer=get_initializer(config.initializer_range), name="qa_outputs"
         )
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(ELECTRA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1582,17 +1700,17 @@ def __init__(self, config, *inputs, **kwargs):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        token_type_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        token_type_ids: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-        start_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        end_positions: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        start_positions: np.ndarray | tf.Tensor | None = None,
+        end_positions: np.ndarray | tf.Tensor | None = None,
         training: Optional[bool] = False,
     ) -> Union[TFQuestionAnsweringModelOutput, Tuple[tf.Tensor]]:
         r"""
@@ -1645,11 +1763,13 @@ def call(
             attentions=discriminator_hidden_states.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForQuestionAnswering.serving_output
-    def serving_output(self, output: TFQuestionAnsweringModelOutput) -> TFQuestionAnsweringModelOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFQuestionAnsweringModelOutput(
-            start_logits=output.start_logits, end_logits=output.end_logits, hidden_states=hs, attentions=attns
-        )
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "electra", None) is not None:
+            with tf.name_scope(self.electra.name):
+                self.electra.build(None)
+        if getattr(self, "qa_outputs", None) is not None:
+            with tf.name_scope(self.qa_outputs.name):
+                self.qa_outputs.build([None, None, self.config.hidden_size])
diff --git a/src/transformers/models/electra/tokenization_electra.py b/src/transformers/models/electra/tokenization_electra.py
index 673c1db6119e4c..6ea9a600a6e957 100644
--- a/src/transformers/models/electra/tokenization_electra.py
+++ b/src/transformers/models/electra/tokenization_electra.py
@@ -152,20 +152,6 @@ def __init__(
         strip_accents=None,
         **kwargs,
     ):
-        super().__init__(
-            do_lower_case=do_lower_case,
-            do_basic_tokenize=do_basic_tokenize,
-            never_split=never_split,
-            unk_token=unk_token,
-            sep_token=sep_token,
-            pad_token=pad_token,
-            cls_token=cls_token,
-            mask_token=mask_token,
-            tokenize_chinese_chars=tokenize_chinese_chars,
-            strip_accents=strip_accents,
-            **kwargs,
-        )
-
         if not os.path.isfile(vocab_file):
             raise ValueError(
                 f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained"
@@ -181,7 +167,22 @@ def __init__(
                 tokenize_chinese_chars=tokenize_chinese_chars,
                 strip_accents=strip_accents,
             )
-        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=str(unk_token))
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            **kwargs,
+        )
 
     @property
     def do_lower_case(self):
@@ -194,10 +195,12 @@ def vocab_size(self):
     def get_vocab(self):
         return dict(self.vocab, **self.added_tokens_encoder)
 
-    def _tokenize(self, text):
+    def _tokenize(self, text, split_special_tokens=False):
         split_tokens = []
         if self.do_basic_tokenize:
-            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+            for token in self.basic_tokenizer.tokenize(
+                text, never_split=self.all_special_tokens if not split_special_tokens else None
+            ):
                 # If the token is part of the never_split set
                 if token in self.basic_tokenizer.never_split:
                     split_tokens.append(token)
@@ -277,8 +280,8 @@ def create_token_type_ids_from_sequences(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A Electra
-        sequence pair mask has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A Electra sequence
+        pair mask has the following format:
 
         ```
         0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
@@ -342,20 +345,30 @@ class BasicTokenizer(object):
         strip_accents (`bool`, *optional*):
             Whether or not to strip all accents. If this option is not specified, then it will be determined by the
             value for `lowercase` (as in the original BERT).
+        do_split_on_punc (`bool`, *optional*, defaults to `True`):
+            In some instances we want to skip the basic punctuation splitting so that later tokenization can capture
+            the full context of the words, such as contractions.
     """
 
-    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+    def __init__(
+        self,
+        do_lower_case=True,
+        never_split=None,
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        do_split_on_punc=True,
+    ):
         if never_split is None:
             never_split = []
         self.do_lower_case = do_lower_case
         self.never_split = set(never_split)
         self.tokenize_chinese_chars = tokenize_chinese_chars
         self.strip_accents = strip_accents
+        self.do_split_on_punc = do_split_on_punc
 
     def tokenize(self, text, never_split=None):
         """
-        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
-        WordPieceTokenizer.
+        Basic Tokenization of a piece of text. For sub-word tokenization, see WordPieceTokenizer.
 
         Args:
             never_split (`List[str]`, *optional*)
@@ -374,7 +387,9 @@ def tokenize(self, text, never_split=None):
         # words in the English Wikipedia.).
         if self.tokenize_chinese_chars:
             text = self._tokenize_chinese_chars(text)
-        orig_tokens = whitespace_tokenize(text)
+        # prevents treating the same character with different unicode codepoints as different characters
+        unicode_normalized_text = unicodedata.normalize("NFC", text)
+        orig_tokens = whitespace_tokenize(unicode_normalized_text)
         split_tokens = []
         for token in orig_tokens:
             if token not in never_split:
@@ -402,7 +417,7 @@ def _run_strip_accents(self, text):
 
     def _run_split_on_punc(self, text, never_split=None):
         """Splits punctuation on a piece of text."""
-        if never_split is not None and text in never_split:
+        if not self.do_split_on_punc or (never_split is not None and text in never_split):
             return [text]
         chars = list(text)
         i = 0
diff --git a/src/transformers/models/electra/tokenization_electra_fast.py b/src/transformers/models/electra/tokenization_electra_fast.py
index cf92dd01714f9d..e76082de174dee 100644
--- a/src/transformers/models/electra/tokenization_electra_fast.py
+++ b/src/transformers/models/electra/tokenization_electra_fast.py
@@ -192,7 +192,7 @@ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
         """
         output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
 
-        if token_ids_1:
+        if token_ids_1 is not None:
             output += token_ids_1 + [self.sep_token_id]
 
         return output
@@ -201,8 +201,8 @@ def create_token_type_ids_from_sequences(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
         """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ELECTRA
-        sequence pair mask has the following format:
+        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A ELECTRA sequence
+        pair mask has the following format:
 
         ```
         0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
diff --git a/src/transformers/models/encodec/__init__.py b/src/transformers/models/encodec/__init__.py
new file mode 100644
index 00000000000000..d3d9488968bf2c
--- /dev/null
+++ b/src/transformers/models/encodec/__init__.py
@@ -0,0 +1,65 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_encodec": [
+        "ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "EncodecConfig",
+    ],
+    "feature_extraction_encodec": ["EncodecFeatureExtractor"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_encodec"] = [
+        "ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "EncodecModel",
+        "EncodecPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_encodec import (
+        ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        EncodecConfig,
+    )
+    from .feature_extraction_encodec import EncodecFeatureExtractor
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_encodec import (
+            ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST,
+            EncodecModel,
+            EncodecPreTrainedModel,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/encodec/configuration_encodec.py b/src/transformers/models/encodec/configuration_encodec.py
new file mode 100644
index 00000000000000..af493c325bece5
--- /dev/null
+++ b/src/transformers/models/encodec/configuration_encodec.py
@@ -0,0 +1,195 @@
+# coding=utf-8
+# Copyright 2023 Meta Platforms, Inc. and affiliates, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" EnCodec model configuration"""
+
+
+import math
+from typing import Optional
+
+import numpy as np
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+ENCODEC_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "facebook/encodec_24khz": "https://huggingface.co/facebook/encodec_24khz/resolve/main/config.json",
+    "facebook/encodec_48khz": "https://huggingface.co/facebook/encodec_48khz/resolve/main/config.json",
+}
+
+
+class EncodecConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of an [`EncodecModel`]. It is used to instantiate a
+    Encodec model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the
+    [facebook/encodec_24khz](https://huggingface.co/facebook/encodec_24khz) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        target_bandwidths (`List[float]`, *optional*, defaults to `[1.5, 3.0, 6.0, 12.0, 24.0]`):
+            The range of diffent bandwiths the model can encode audio with.
+        sampling_rate (`int`, *optional*, defaults to 24000):
+            The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).
+        audio_channels (`int`, *optional*, defaults to 1):
+            Number of channels in the audio data. Either 1 for mono or 2 for stereo.
+        normalize (`bool`, *optional*, defaults to `False`):
+            Whether the audio shall be normalized when passed.
+        chunk_length_s (`float`, *optional*):
+            If defined the audio is pre-processed into chunks of lengths `chunk_length_s` and then encoded.
+        overlap (`float`, *optional*):
+            Defines the overlap between each chunk. It is used to compute the `chunk_stride` using the following
+            formulae : `int((1.0 - self.overlap) * self.chunk_length)`.
+        hidden_size (`int`, *optional*, defaults to 128):
+            Intermediate representation dimension.
+        num_filters (`int`, *optional*, defaults to 32):
+            Number of convolution kernels of first `EncodecConv1d` down sampling layer.
+        num_residual_layers (`int`,  *optional*, defaults to 1):
+            Number of residual layers.
+        upsampling_ratios (`Sequence[int]` , *optional*, defaults to `[8, 5, 4, 2]`):
+            Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it
+            will use the ratios in the reverse order to the ones specified here that must match the decoder order.
+        norm_type (`str`, *optional*, defaults to `"weight_norm"`):
+            Normalization method. Should be in `["weight_norm", "time_group_norm"]`
+        kernel_size (`int`, *optional*, defaults to 7):
+            Kernel size for the initial convolution.
+        last_kernel_size (`int`, *optional*, defaults to 7):
+            Kernel size for the last convolution layer.
+        residual_kernel_size (`int`, *optional*, defaults to 3):
+            Kernel size for the residual layers.
+        dilation_growth_rate (`int`, *optional*, defaults to 2):
+            How much to increase the dilation with each layer.
+        use_causal_conv (`bool`, *optional*, defaults to `True`):
+            Whether to use fully causal convolution.
+        pad_mode (`str`, *optional*, defaults to `"reflect"`):
+            Padding mode for the convolutions.
+        compress (`int`, *optional*, defaults to 2):
+            Reduced dimensionality in residual branches (from Demucs v3).
+        num_lstm_layers (`int`, *optional*, defaults to 2):
+            Number of LSTM layers at the end of the encoder.
+        trim_right_ratio (`float`, *optional*, defaults to 1.0):
+            Ratio for trimming at the right of the transposed convolution under the `use_causal_conv = True` setup. If
+            equal to 1.0, it means that all the trimming is done at the right.
+        codebook_size (`int`, *optional*, defaults to 1024):
+            Number of discret codes that make up VQVAE.
+        codebook_dim (`int`, *optional*):
+            Dimension of the codebook vectors. If not defined, uses `hidden_size`.
+        use_conv_shortcut (`bool`, *optional*, defaults to `True`):
+            Whether to use a convolutional layer as the 'skip' connection in the `EncodecResnetBlock` block. If False,
+            an identity function will be used, giving a generic residual connection.
+
+    Example:
+
+    ```python
+    >>> from transformers import EncodecModel, EncodecConfig
+
+    >>> # Initializing a "facebook/encodec_24khz" style configuration
+    >>> configuration = EncodecConfig()
+
+    >>> # Initializing a model (with random weights) from the "facebook/encodec_24khz" style configuration
+    >>> model = EncodecModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "encodec"
+
+    def __init__(
+        self,
+        target_bandwidths=[1.5, 3.0, 6.0, 12.0, 24.0],
+        sampling_rate=24_000,
+        audio_channels=1,
+        normalize=False,
+        chunk_length_s=None,
+        overlap=None,
+        hidden_size=128,
+        num_filters=32,
+        num_residual_layers=1,
+        upsampling_ratios=[8, 5, 4, 2],
+        norm_type="weight_norm",
+        kernel_size=7,
+        last_kernel_size=7,
+        residual_kernel_size=3,
+        dilation_growth_rate=2,
+        use_causal_conv=True,
+        pad_mode="reflect",
+        compress=2,
+        num_lstm_layers=2,
+        trim_right_ratio=1.0,
+        codebook_size=1024,
+        codebook_dim=None,
+        use_conv_shortcut=True,
+        **kwargs,
+    ):
+        self.target_bandwidths = target_bandwidths
+        self.sampling_rate = sampling_rate
+        self.audio_channels = audio_channels
+        self.normalize = normalize
+        self.chunk_length_s = chunk_length_s
+        self.overlap = overlap
+        self.hidden_size = hidden_size
+        self.num_filters = num_filters
+        self.num_residual_layers = num_residual_layers
+        self.upsampling_ratios = upsampling_ratios
+        self.norm_type = norm_type
+        self.kernel_size = kernel_size
+        self.last_kernel_size = last_kernel_size
+        self.residual_kernel_size = residual_kernel_size
+        self.dilation_growth_rate = dilation_growth_rate
+        self.use_causal_conv = use_causal_conv
+        self.pad_mode = pad_mode
+        self.compress = compress
+        self.num_lstm_layers = num_lstm_layers
+        self.trim_right_ratio = trim_right_ratio
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim if codebook_dim is not None else hidden_size
+        self.use_conv_shortcut = use_conv_shortcut
+
+        if self.norm_type not in ["weight_norm", "time_group_norm"]:
+            raise ValueError(
+                f'self.norm_type must be one of `"weight_norm"`, `"time_group_norm"`), got {self.norm_type}'
+            )
+
+        super().__init__(**kwargs)
+
+    # This is a property because you might want to change the chunk_length_s on the fly
+    @property
+    def chunk_length(self) -> Optional[int]:
+        if self.chunk_length_s is None:
+            return None
+        else:
+            return int(self.chunk_length_s * self.sampling_rate)
+
+    # This is a property because you might want to change the chunk_length_s on the fly
+    @property
+    def chunk_stride(self) -> Optional[int]:
+        if self.chunk_length_s is None or self.overlap is None:
+            return None
+        else:
+            return max(1, int((1.0 - self.overlap) * self.chunk_length))
+
+    @property
+    def frame_rate(self) -> int:
+        hop_length = np.prod(self.upsampling_ratios)
+        return math.ceil(self.sampling_rate / hop_length)
+
+    @property
+    def num_quantizers(self) -> int:
+        return int(1000 * self.target_bandwidths[-1] // (self.frame_rate * 10))
diff --git a/src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py b/src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py
new file mode 100644
index 00000000000000..3a16a4b7ba0f3b
--- /dev/null
+++ b/src/transformers/models/encodec/convert_encodec_checkpoint_to_pytorch.py
@@ -0,0 +1,365 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert EnCodec checkpoints."""
+
+import argparse
+
+import torch
+
+from transformers import (
+    EncodecConfig,
+    EncodecFeatureExtractor,
+    EncodecModel,
+    logging,
+)
+
+
+# checkpoints downloaded from:
+# https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th
+# https://huggingface.co/facebook/musicgen-small/resolve/main/compression_state_dict.bin
+# https://dl.fbaipublicfiles.com/encodec/v0/encodec_48khz-7e698e3e.th
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.encodec")
+
+MAPPING_QUANTIZER = {
+    "quantizer.vq.layers.*._codebook.inited": "quantizer.layers.*.codebook.inited",
+    "quantizer.vq.layers.*._codebook.cluster_size": "quantizer.layers.*.codebook.cluster_size",
+    "quantizer.vq.layers.*._codebook.embed": "quantizer.layers.*.codebook.embed",
+    "quantizer.vq.layers.*._codebook.embed_avg": "quantizer.layers.*.codebook.embed_avg",
+}
+MAPPING_ENCODER = {
+    "encoder.model.0.conv.conv": "encoder.layers.0.conv",
+    "encoder.model.1.block.1.conv.conv": "encoder.layers.1.block.1.conv",
+    "encoder.model.1.block.3.conv.conv": "encoder.layers.1.block.3.conv",
+    "encoder.model.1.shortcut.conv.conv": "encoder.layers.1.shortcut.conv",
+    "encoder.model.3.conv.conv": "encoder.layers.3.conv",
+    "encoder.model.4.block.1.conv.conv": "encoder.layers.4.block.1.conv",
+    "encoder.model.4.block.3.conv.conv": "encoder.layers.4.block.3.conv",
+    "encoder.model.4.shortcut.conv.conv": "encoder.layers.4.shortcut.conv",
+    "encoder.model.6.conv.conv": "encoder.layers.6.conv",
+    "encoder.model.7.block.1.conv.conv": "encoder.layers.7.block.1.conv",
+    "encoder.model.7.block.3.conv.conv": "encoder.layers.7.block.3.conv",
+    "encoder.model.7.shortcut.conv.conv": "encoder.layers.7.shortcut.conv",
+    "encoder.model.9.conv.conv": "encoder.layers.9.conv",
+    "encoder.model.10.block.1.conv.conv": "encoder.layers.10.block.1.conv",
+    "encoder.model.10.block.3.conv.conv": "encoder.layers.10.block.3.conv",
+    "encoder.model.10.shortcut.conv.conv": "encoder.layers.10.shortcut.conv",
+    "encoder.model.12.conv.conv": "encoder.layers.12.conv",
+    "encoder.model.13.lstm": "encoder.layers.13.lstm",
+    "encoder.model.15.conv.conv": "encoder.layers.15.conv",
+}
+MAPPING_ENCODER_48K = {
+    "encoder.model.0.conv.norm": "encoder.layers.0.norm",
+    "encoder.model.1.block.1.conv.norm": "encoder.layers.1.block.1.norm",
+    "encoder.model.1.block.3.conv.norm": "encoder.layers.1.block.3.norm",
+    "encoder.model.1.shortcut.conv.norm": "encoder.layers.1.shortcut.norm",
+    "encoder.model.3.conv.norm": "encoder.layers.3.norm",
+    "encoder.model.4.block.1.conv.norm": "encoder.layers.4.block.1.norm",
+    "encoder.model.4.block.3.conv.norm": "encoder.layers.4.block.3.norm",
+    "encoder.model.4.shortcut.conv.norm": "encoder.layers.4.shortcut.norm",
+    "encoder.model.6.conv.norm": "encoder.layers.6.norm",
+    "encoder.model.7.block.1.conv.norm": "encoder.layers.7.block.1.norm",
+    "encoder.model.7.block.3.conv.norm": "encoder.layers.7.block.3.norm",
+    "encoder.model.7.shortcut.conv.norm": "encoder.layers.7.shortcut.norm",
+    "encoder.model.9.conv.norm": "encoder.layers.9.norm",
+    "encoder.model.10.block.1.conv.norm": "encoder.layers.10.block.1.norm",
+    "encoder.model.10.block.3.conv.norm": "encoder.layers.10.block.3.norm",
+    "encoder.model.10.shortcut.conv.norm": "encoder.layers.10.shortcut.norm",
+    "encoder.model.12.conv.norm": "encoder.layers.12.norm",
+    "encoder.model.15.conv.norm": "encoder.layers.15.norm",
+}
+MAPPING_DECODER = {
+    "decoder.model.0.conv.conv": "decoder.layers.0.conv",
+    "decoder.model.1.lstm": "decoder.layers.1.lstm",
+    "decoder.model.3.convtr.convtr": "decoder.layers.3.conv",
+    "decoder.model.4.block.1.conv.conv": "decoder.layers.4.block.1.conv",
+    "decoder.model.4.block.3.conv.conv": "decoder.layers.4.block.3.conv",
+    "decoder.model.4.shortcut.conv.conv": "decoder.layers.4.shortcut.conv",
+    "decoder.model.6.convtr.convtr": "decoder.layers.6.conv",
+    "decoder.model.7.block.1.conv.conv": "decoder.layers.7.block.1.conv",
+    "decoder.model.7.block.3.conv.conv": "decoder.layers.7.block.3.conv",
+    "decoder.model.7.shortcut.conv.conv": "decoder.layers.7.shortcut.conv",
+    "decoder.model.9.convtr.convtr": "decoder.layers.9.conv",
+    "decoder.model.10.block.1.conv.conv": "decoder.layers.10.block.1.conv",
+    "decoder.model.10.block.3.conv.conv": "decoder.layers.10.block.3.conv",
+    "decoder.model.10.shortcut.conv.conv": "decoder.layers.10.shortcut.conv",
+    "decoder.model.12.convtr.convtr": "decoder.layers.12.conv",
+    "decoder.model.13.block.1.conv.conv": "decoder.layers.13.block.1.conv",
+    "decoder.model.13.block.3.conv.conv": "decoder.layers.13.block.3.conv",
+    "decoder.model.13.shortcut.conv.conv": "decoder.layers.13.shortcut.conv",
+    "decoder.model.15.conv.conv": "decoder.layers.15.conv",
+}
+MAPPING_DECODER_48K = {
+    "decoder.model.0.conv.norm": "decoder.layers.0.norm",
+    "decoder.model.3.convtr.norm": "decoder.layers.3.norm",
+    "decoder.model.4.block.1.conv.norm": "decoder.layers.4.block.1.norm",
+    "decoder.model.4.block.3.conv.norm": "decoder.layers.4.block.3.norm",
+    "decoder.model.4.shortcut.conv.norm": "decoder.layers.4.shortcut.norm",
+    "decoder.model.6.convtr.norm": "decoder.layers.6.norm",
+    "decoder.model.7.block.1.conv.norm": "decoder.layers.7.block.1.norm",
+    "decoder.model.7.block.3.conv.norm": "decoder.layers.7.block.3.norm",
+    "decoder.model.7.shortcut.conv.norm": "decoder.layers.7.shortcut.norm",
+    "decoder.model.9.convtr.norm": "decoder.layers.9.norm",
+    "decoder.model.10.block.1.conv.norm": "decoder.layers.10.block.1.norm",
+    "decoder.model.10.block.3.conv.norm": "decoder.layers.10.block.3.norm",
+    "decoder.model.10.shortcut.conv.norm": "decoder.layers.10.shortcut.norm",
+    "decoder.model.12.convtr.norm": "decoder.layers.12.norm",
+    "decoder.model.13.block.1.conv.norm": "decoder.layers.13.block.1.norm",
+    "decoder.model.13.block.3.conv.norm": "decoder.layers.13.block.3.norm",
+    "decoder.model.13.shortcut.conv.norm": "decoder.layers.13.shortcut.norm",
+    "decoder.model.15.conv.norm": "decoder.layers.15.norm",
+}
+MAPPING_24K = {
+    **MAPPING_QUANTIZER,
+    **MAPPING_ENCODER,
+    **MAPPING_DECODER,
+}
+MAPPING_48K = {
+    **MAPPING_QUANTIZER,
+    **MAPPING_ENCODER,
+    **MAPPING_ENCODER_48K,
+    **MAPPING_DECODER,
+    **MAPPING_DECODER_48K,
+}
+TOP_LEVEL_KEYS = []
+IGNORE_KEYS = []
+
+
+def set_recursively(hf_pointer, key, value, full_name, weight_type):
+    for attribute in key.split("."):
+        hf_pointer = getattr(hf_pointer, attribute)
+
+    if weight_type is not None:
+        hf_shape = getattr(hf_pointer, weight_type).shape
+    else:
+        hf_shape = hf_pointer.shape
+
+    if hf_shape != value.shape:
+        raise ValueError(
+            f"Shape of hf {key + '.' + weight_type if weight_type is not None else ''} is {hf_shape}, but should be"
+            f" {value.shape} for {full_name}"
+        )
+
+    if weight_type == "weight":
+        hf_pointer.weight.data = value
+    elif weight_type == "weight_g":
+        hf_pointer.weight_g.data = value
+    elif weight_type == "weight_v":
+        hf_pointer.weight_v.data = value
+    elif weight_type == "bias":
+        hf_pointer.bias.data = value
+    elif weight_type == "running_mean":
+        hf_pointer.running_mean.data = value
+    elif weight_type == "running_var":
+        hf_pointer.running_var.data = value
+    elif weight_type == "num_batches_tracked":
+        hf_pointer.num_batches_tracked.data = value
+    elif weight_type == "weight_ih_l0":
+        hf_pointer.weight_ih_l0.data = value
+    elif weight_type == "weight_hh_l0":
+        hf_pointer.weight_hh_l0.data = value
+    elif weight_type == "bias_ih_l0":
+        hf_pointer.bias_ih_l0.data = value
+    elif weight_type == "bias_hh_l0":
+        hf_pointer.bias_hh_l0.data = value
+    elif weight_type == "weight_ih_l1":
+        hf_pointer.weight_ih_l1.data = value
+    elif weight_type == "weight_hh_l1":
+        hf_pointer.weight_hh_l1.data = value
+    elif weight_type == "bias_ih_l1":
+        hf_pointer.bias_ih_l1.data = value
+    elif weight_type == "bias_hh_l1":
+        hf_pointer.bias_hh_l1.data = value
+    else:
+        hf_pointer.data = value
+
+    logger.info(f"{key + ('.' + weight_type if weight_type is not None else '')} was initialized from {full_name}.")
+
+
+def should_ignore(name, ignore_keys):
+    for key in ignore_keys:
+        if key.endswith(".*"):
+            if name.startswith(key[:-1]):
+                return True
+        elif ".*." in key:
+            prefix, suffix = key.split(".*.")
+            if prefix in name and suffix in name:
+                return True
+        elif key in name:
+            return True
+    return False
+
+
+def recursively_load_weights(orig_dict, hf_model, model_name):
+    unused_weights = []
+
+    if model_name == "encodec_24khz" or "encodec_32khz":
+        MAPPING = MAPPING_24K
+    elif model_name == "encodec_48khz":
+        MAPPING = MAPPING_48K
+    else:
+        raise ValueError(f"Unsupported model: {model_name}")
+
+    for name, value in orig_dict.items():
+        if should_ignore(name, IGNORE_KEYS):
+            logger.info(f"{name} was ignored")
+            continue
+
+        is_used = False
+        for key, mapped_key in MAPPING.items():
+            if "*" in key:
+                prefix, suffix = key.split(".*.")
+                if prefix in name and suffix in name:
+                    key = suffix
+
+            if key in name:
+                # HACK otherwise .embed gets initialized with .embed_avg too
+                if key.endswith("embed") and name.endswith("embed_avg"):
+                    continue
+
+                is_used = True
+                if "*" in mapped_key:
+                    layer_index = name.split(key)[0].split(".")[-2]
+                    mapped_key = mapped_key.replace("*", layer_index)
+                if "weight_g" in name:
+                    weight_type = "weight_g"
+                elif "weight_v" in name:
+                    weight_type = "weight_v"
+                elif "weight_ih_l0" in name:
+                    weight_type = "weight_ih_l0"
+                elif "weight_hh_l0" in name:
+                    weight_type = "weight_hh_l0"
+                elif "bias_ih_l0" in name:
+                    weight_type = "bias_ih_l0"
+                elif "bias_hh_l0" in name:
+                    weight_type = "bias_hh_l0"
+                elif "weight_ih_l1" in name:
+                    weight_type = "weight_ih_l1"
+                elif "weight_hh_l1" in name:
+                    weight_type = "weight_hh_l1"
+                elif "bias_ih_l1" in name:
+                    weight_type = "bias_ih_l1"
+                elif "bias_hh_l1" in name:
+                    weight_type = "bias_hh_l1"
+                elif "bias" in name:
+                    weight_type = "bias"
+                elif "weight" in name:
+                    weight_type = "weight"
+                elif "running_mean" in name:
+                    weight_type = "running_mean"
+                elif "running_var" in name:
+                    weight_type = "running_var"
+                elif "num_batches_tracked" in name:
+                    weight_type = "num_batches_tracked"
+                else:
+                    weight_type = None
+                set_recursively(hf_model, mapped_key, value, name, weight_type)
+            continue
+        if not is_used:
+            unused_weights.append(name)
+
+    logger.warning(f"Unused weights: {unused_weights}")
+
+
+@torch.no_grad()
+def convert_checkpoint(
+    model_name,
+    checkpoint_path,
+    pytorch_dump_folder_path,
+    config_path=None,
+    repo_id=None,
+):
+    """
+    Copy/paste/tweak model's weights to transformers design.
+    """
+    if config_path is not None:
+        config = EncodecConfig.from_pretrained(config_path)
+    else:
+        config = EncodecConfig()
+
+    if model_name == "encodec_24khz":
+        pass  # config is already correct
+    elif model_name == "encodec_32khz":
+        config.upsampling_ratios = [8, 5, 4, 4]
+        config.target_bandwidths = [2.2]
+        config.num_filters = 64
+        config.sampling_rate = 32_000
+        config.codebook_size = 2048
+        config.use_causal_conv = False
+        config.normalize = False
+        config.use_conv_shortcut = False
+    elif model_name == "encodec_48khz":
+        config.upsampling_ratios = [8, 5, 4, 2]
+        config.target_bandwidths = [3.0, 6.0, 12.0, 24.0]
+        config.sampling_rate = 48_000
+        config.audio_channels = 2
+        config.use_causal_conv = False
+        config.norm_type = "time_group_norm"
+        config.normalize = True
+        config.chunk_length_s = 1.0
+        config.overlap = 0.01
+    else:
+        raise ValueError(f"Unknown model name: {model_name}")
+
+    model = EncodecModel(config)
+
+    feature_extractor = EncodecFeatureExtractor(
+        feature_size=config.audio_channels,
+        sampling_rate=config.sampling_rate,
+        chunk_length_s=config.chunk_length_s,
+        overlap=config.overlap,
+    )
+    feature_extractor.save_pretrained(pytorch_dump_folder_path)
+
+    original_checkpoint = torch.load(checkpoint_path)
+    if "best_state" in original_checkpoint:
+        # we might have a training state saved, in which case discard the yaml results and just retain the weights
+        original_checkpoint = original_checkpoint["best_state"]
+    recursively_load_weights(original_checkpoint, model, model_name)
+    model.save_pretrained(pytorch_dump_folder_path)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        feature_extractor.push_to_hub(repo_id)
+        model.push_to_hub(repo_id)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model",
+        default="encodec_24khz",
+        type=str,
+        help="The model to convert. Should be one of 'encodec_24khz', 'encodec_32khz', 'encodec_48khz'.",
+    )
+    parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
+    parser.add_argument("--config_path", default=None, type=str, help="Path to hf config.json of model to convert")
+    parser.add_argument(
+        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
+    )
+    parser.add_argument(
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+    )
+
+    args = parser.parse_args()
+    convert_checkpoint(
+        args.model,
+        args.checkpoint_path,
+        args.pytorch_dump_folder_path,
+        args.config_path,
+        args.push_to_hub,
+    )
diff --git a/src/transformers/models/encodec/feature_extraction_encodec.py b/src/transformers/models/encodec/feature_extraction_encodec.py
new file mode 100644
index 00000000000000..6f7536a52e9f99
--- /dev/null
+++ b/src/transformers/models/encodec/feature_extraction_encodec.py
@@ -0,0 +1,206 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Feature extractor class for EnCodec."""
+
+from typing import List, Optional, Union
+
+import numpy as np
+
+from ...feature_extraction_sequence_utils import SequenceFeatureExtractor
+from ...feature_extraction_utils import BatchFeature
+from ...utils import PaddingStrategy, TensorType, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+class EncodecFeatureExtractor(SequenceFeatureExtractor):
+    r"""
+    Constructs an EnCodec feature extractor.
+
+    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
+    most of the main methods. Users should refer to this superclass for more information regarding those methods.
+
+    Instantiating a feature extractor with the defaults will yield a similar configuration to that of the
+    [facebook/encodec_24khz](https://huggingface.co/facebook/encodec_24khz) architecture.
+
+    Args:
+        feature_size (`int`, *optional*, defaults to 1):
+            The feature dimension of the extracted features. Use 1 for mono, 2 for stereo.
+        sampling_rate (`int`, *optional*, defaults to 24000):
+            The sampling rate at which the audio waveform should be digitalized expressed in hertz (Hz).
+        padding_value (`float`, *optional*, defaults to 0.0):
+            The value that is used to fill the padding values.
+        chunk_length_s (`float`, *optional*):
+            If defined the audio is pre-processed into chunks of lengths `chunk_length_s` and then encoded.
+        overlap (`float`, *optional*):
+            Defines the overlap between each chunk. It is used to compute the `chunk_stride` using the following
+            formulae : `int((1.0 - self.overlap) * self.chunk_length)`.
+    """
+
+    model_input_names = ["input_values", "padding_mask"]
+
+    def __init__(
+        self,
+        feature_size: int = 1,
+        sampling_rate: int = 24000,
+        padding_value: float = 0.0,
+        chunk_length_s: float = None,
+        overlap: float = None,
+        **kwargs,
+    ):
+        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
+        self.chunk_length_s = chunk_length_s
+        self.overlap = overlap
+
+    # This is a property because you might want to change the chunk_length_s on the fly
+    @property
+    def chunk_length(self) -> Optional[int]:
+        if self.chunk_length_s is None:
+            return None
+        else:
+            return int(self.chunk_length_s * self.sampling_rate)
+
+    # This is a property because you might want to change the chunk_length_s on the fly
+    @property
+    def chunk_stride(self) -> Optional[int]:
+        if self.chunk_length_s is None or self.overlap is None:
+            return None
+        else:
+            return max(1, int((1.0 - self.overlap) * self.chunk_length))
+
+    def __call__(
+        self,
+        raw_audio: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
+        padding: Optional[Union[bool, str, PaddingStrategy]] = None,
+        truncation: Optional[bool] = False,
+        max_length: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        sampling_rate: Optional[int] = None,
+    ) -> BatchFeature:
+        """
+        Main method to featurize and prepare for the model one or several sequence(s).
+
+        Args:
+            raw_audio (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
+                The sequence or batch of sequences to be processed. Each sequence can be a numpy array, a list of float
+                values, a list of numpy arrays or a list of list of float values. The numpy array must be of shape
+                `(num_samples,)` for mono audio (`feature_size = 1`), or `(2, num_samples)` for stereo audio
+                (`feature_size = 2`).
+            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
+                Select a strategy to pad the returned sequences (according to the model's padding side and padding
+                index) among:
+
+                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
+                  sequence if provided).
+                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
+                  acceptable input length for the model if that argument is not provided.
+                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
+                  lengths).
+            truncation (`bool`, *optional*, defaults to `False`):
+                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
+            max_length (`int`, *optional*):
+                Maximum length of the returned list and optionally padding length (see above).
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return Numpy `np.ndarray` objects.
+            sampling_rate (`int`, *optional*):
+                The sampling rate at which the `audio` input was sampled. It is strongly recommended to pass
+                `sampling_rate` at the forward call to prevent silent errors.
+        """
+        if sampling_rate is not None:
+            if sampling_rate != self.sampling_rate:
+                raise ValueError(
+                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
+                    f" {self.sampling_rate}. Please make sure that the provided audio input was sampled with"
+                    f" {self.sampling_rate} and not {sampling_rate}."
+                )
+        else:
+            logger.warning(
+                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
+                "Failing to do so can result in silent errors that might be hard to debug."
+            )
+
+        if padding and truncation:
+            raise ValueError("Both padding and truncation were set. Make sure you only set one.")
+        elif padding is None:
+            # by default let's pad the inputs
+            padding = True
+
+        is_batched = bool(
+            isinstance(raw_audio, (list, tuple)) and (isinstance(raw_audio[0], (np.ndarray, tuple, list)))
+        )
+
+        if is_batched:
+            raw_audio = [np.asarray(audio, dtype=np.float32).T for audio in raw_audio]
+        elif not is_batched and not isinstance(raw_audio, np.ndarray):
+            raw_audio = np.asarray(raw_audio, dtype=np.float32)
+        elif isinstance(raw_audio, np.ndarray) and raw_audio.dtype is np.dtype(np.float64):
+            raw_audio = raw_audio.astype(np.float32)
+
+        # always return batch
+        if not is_batched:
+            raw_audio = [np.asarray(raw_audio).T]
+
+        # verify inputs are valid
+        for idx, example in enumerate(raw_audio):
+            if example.ndim > 2:
+                raise ValueError(f"Expected input shape (channels, length) but got shape {example.shape}")
+            if self.feature_size == 1 and example.ndim != 1:
+                raise ValueError(f"Expected mono audio but example has {example.shape[-1]} channels")
+            if self.feature_size == 2 and example.shape[-1] != 2:
+                raise ValueError(f"Expected stereo audio but example has {example.shape[-1]} channels")
+
+        padded_inputs = None
+        input_values = BatchFeature({"input_values": raw_audio})
+        if self.chunk_stride is not None and self.chunk_length is not None and max_length is None:
+            if truncation:
+                max_length = min(array.shape[0] for array in raw_audio)
+                nb_step = int(np.floor(max_length / self.chunk_stride))
+                max_length = (nb_step - 1) * self.chunk_stride + self.chunk_length
+            elif padding:
+                max_length = max(array.shape[0] for array in raw_audio)
+                nb_step = int(np.ceil(max_length / self.chunk_stride))
+                max_length = (nb_step - 1) * self.chunk_stride + self.chunk_length
+                padding = "max_length"
+            else:
+                padded_inputs = input_values
+
+        # normal padding on batch
+        if padded_inputs is None:
+            padded_inputs = self.pad(
+                input_values,
+                max_length=max_length,
+                truncation=truncation,
+                padding=padding,
+                return_attention_mask=padding,
+            )
+            if padding:
+                padded_inputs["padding_mask"] = padded_inputs.pop("attention_mask")
+
+        input_values = []
+        for example in padded_inputs.pop("input_values"):
+            if self.feature_size == 1:
+                example = example[..., None]
+            input_values.append(example.T)
+
+        padded_inputs["input_values"] = input_values
+        if return_tensors is not None:
+            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
+
+        return padded_inputs
diff --git a/src/transformers/models/encodec/modeling_encodec.py b/src/transformers/models/encodec/modeling_encodec.py
new file mode 100644
index 00000000000000..441f4a27d83c50
--- /dev/null
+++ b/src/transformers/models/encodec/modeling_encodec.py
@@ -0,0 +1,806 @@
+# coding=utf-8
+# Copyright 2023 Meta Platforms, Inc. and affiliates, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch EnCodec model."""
+
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+
+from ...modeling_utils import PreTrainedModel
+from ...utils import (
+    ModelOutput,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    logging,
+    replace_return_docstrings,
+)
+from .configuration_encodec import EncodecConfig
+
+
+logger = logging.get_logger(__name__)
+
+
+# General docstring
+_CONFIG_FOR_DOC = "EncodecConfig"
+
+
+ENCODEC_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "facebook/encodec_24khz",
+    "facebook/encodec_48khz",
+    # See all EnCodec models at https://huggingface.co/models?filter=encodec
+]
+
+
+@dataclass
+class EncodecOutput(ModelOutput):
+    """
+    Args:
+        audio_codes (`torch.FloatTensor`  of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
+            Discret code embeddings computed using `model.encode`.
+        audio_values (`torch.FlaotTensor` of shape `(batch_size, sequence_length)`, *optional*)
+            Decoded audio values, obtained using the decoder part of Encodec.
+    """
+
+    audio_codes: torch.FloatTensor = None
+    audio_values: torch.FloatTensor = None
+
+
+@dataclass
+class EncodecEncoderOutput(ModelOutput):
+    """
+    Args:
+        audio_codes (`torch.FloatTensor`  of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
+            Discret code embeddings computed using `model.encode`.
+        audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*):
+            Scaling factor for each `audio_codes` input. This is used to unscale each chunk of audio when decoding.
+    """
+
+    audio_codes: torch.FloatTensor = None
+    audio_scales: torch.FloatTensor = None
+
+
+@dataclass
+class EncodecDecoderOutput(ModelOutput):
+    """
+    Args:
+        audio_values (`torch.FloatTensor`  of shape `(batch_size, segment_length)`, *optional*):
+            Decoded audio values, obtained using the decoder part of Encodec.
+    """
+
+    audio_values: torch.FloatTensor = None
+
+
+class EncodecConv1d(nn.Module):
+    """Conv1d with asymmetric or causal padding and normalization."""
+
+    def __init__(
+        self, config, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, dilation: int = 1
+    ):
+        super().__init__()
+        self.causal = config.use_causal_conv
+        self.pad_mode = config.pad_mode
+        self.norm_type = config.norm_type
+
+        if self.norm_type not in ["weight_norm", "time_group_norm"]:
+            raise ValueError(
+                f'self.norm_type must be one of `"weight_norm"`, `"time_group_norm"`), got {self.norm_type}'
+            )
+
+        # warn user on unusual setup between dilation and stride
+        if stride > 1 and dilation > 1:
+            logger.warning(
+                "EncodecConv1d has been initialized with stride > 1 and dilation > 1"
+                f" (kernel_size={kernel_size} stride={stride}, dilation={dilation})."
+            )
+
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel_size, stride, dilation=dilation)
+        if self.norm_type == "weight_norm":
+            self.conv = nn.utils.weight_norm(self.conv)
+        elif self.norm_type == "time_group_norm":
+            self.norm = nn.GroupNorm(1, out_channels)
+
+    @staticmethod
+    def _get_extra_padding_for_conv1d(
+        hidden_states: torch.Tensor, kernel_size: int, stride: int, padding_total: int = 0
+    ) -> int:
+        """See `pad_for_conv1d`."""
+        length = hidden_states.shape[-1]
+        n_frames = (length - kernel_size + padding_total) / stride + 1
+        ideal_length = (math.ceil(n_frames) - 1) * stride + (kernel_size - padding_total)
+        return ideal_length - length
+
+    @staticmethod
+    def _pad1d(hidden_states: torch.Tensor, paddings: Tuple[int, int], mode: str = "zero", value: float = 0.0):
+        """Tiny wrapper around torch.nn.functional.pad, just to allow for reflect padding on small input.
+        If this is the case, we insert extra 0 padding to the right before the reflection happens.
+        """
+        length = hidden_states.shape[-1]
+        padding_left, padding_right = paddings
+        if not mode == "reflect":
+            return nn.functional.pad(hidden_states, paddings, mode, value)
+
+        max_pad = max(padding_left, padding_right)
+        extra_pad = 0
+        if length <= max_pad:
+            extra_pad = max_pad - length + 1
+            hidden_states = nn.functional.pad(hidden_states, (0, extra_pad))
+        padded = nn.functional.pad(hidden_states, paddings, mode, value)
+        end = padded.shape[-1] - extra_pad
+        return padded[..., :end]
+
+    def forward(self, hidden_states):
+        kernel_size = self.conv.kernel_size[0]
+        stride = self.conv.stride[0]
+        dilation = self.conv.dilation[0]
+        kernel_size = (kernel_size - 1) * dilation + 1  # effective kernel size with dilations
+        padding_total = kernel_size - stride
+        extra_padding = self._get_extra_padding_for_conv1d(hidden_states, kernel_size, stride, padding_total)
+
+        if self.causal:
+            # Left padding for causal
+            hidden_states = self._pad1d(hidden_states, (padding_total, extra_padding), mode=self.pad_mode)
+        else:
+            # Asymmetric padding required for odd strides
+            padding_right = padding_total // 2
+            padding_left = padding_total - padding_right
+            hidden_states = self._pad1d(
+                hidden_states, (padding_left, padding_right + extra_padding), mode=self.pad_mode
+            )
+
+        hidden_states = self.conv(hidden_states)
+
+        if self.norm_type == "time_group_norm":
+            hidden_states = self.norm(hidden_states)
+
+        return hidden_states
+
+
+class EncodecConvTranspose1d(nn.Module):
+    """ConvTranspose1d with asymmetric or causal padding and normalization."""
+
+    def __init__(self, config, in_channels: int, out_channels: int, kernel_size: int, stride: int = 1):
+        super().__init__()
+        self.causal = config.use_causal_conv
+        self.trim_right_ratio = config.trim_right_ratio
+        self.norm_type = config.norm_type
+        if self.norm_type not in ["weight_norm", "time_group_norm"]:
+            raise ValueError(
+                f'self.norm_type must be one of `"weight_norm"`, `"time_group_norm"`), got {self.norm_type}'
+            )
+
+        self.conv = nn.ConvTranspose1d(in_channels, out_channels, kernel_size, stride)
+        if config.norm_type == "weight_norm":
+            self.conv = nn.utils.weight_norm(self.conv)
+        elif config.norm_type == "time_group_norm":
+            self.norm = nn.GroupNorm(1, out_channels)
+
+        if not (self.causal or self.trim_right_ratio == 1.0):
+            raise ValueError("`trim_right_ratio` != 1.0 only makes sense for causal convolutions")
+
+    def forward(self, hidden_states):
+        kernel_size = self.conv.kernel_size[0]
+        stride = self.conv.stride[0]
+        padding_total = kernel_size - stride
+
+        hidden_states = self.conv(hidden_states)
+
+        if self.norm_type == "time_group_norm":
+            hidden_states = self.norm(hidden_states)
+
+        # We will only trim fixed padding. Extra padding from `pad_for_conv1d` would be
+        # removed at the very end, when keeping only the right length for the output,
+        # as removing it here would require also passing the length at the matching layer
+        # in the encoder.
+        if self.causal:
+            # Trim the padding on the right according to the specified ratio
+            # if trim_right_ratio = 1.0, trim everything from right
+            padding_right = math.ceil(padding_total * self.trim_right_ratio)
+        else:
+            # Asymmetric padding required for odd strides
+            padding_right = padding_total // 2
+
+        padding_left = padding_total - padding_right
+
+        # unpad
+        end = hidden_states.shape[-1] - padding_right
+        hidden_states = hidden_states[..., padding_left:end]
+        return hidden_states
+
+
+class EncodecLSTM(nn.Module):
+    """
+    LSTM without worrying about the hidden state, nor the layout of the data. Expects input as convolutional layout.
+    """
+
+    def __init__(self, config, dimension):
+        super().__init__()
+        self.lstm = nn.LSTM(dimension, dimension, config.num_lstm_layers)
+
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.permute(2, 0, 1)
+        hidden_states = self.lstm(hidden_states)[0] + hidden_states
+        hidden_states = hidden_states.permute(1, 2, 0)
+        return hidden_states
+
+
+class EncodecResnetBlock(nn.Module):
+    """
+    Residual block from SEANet model as used by EnCodec.
+    """
+
+    def __init__(self, config: EncodecConfig, dim: int, dilations: List[int]):
+        super().__init__()
+        kernel_sizes = (config.residual_kernel_size, 1)
+        if len(kernel_sizes) != len(dilations):
+            raise ValueError("Number of kernel sizes should match number of dilations")
+
+        hidden = dim // config.compress
+        block = []
+        for i, (kernel_size, dilation) in enumerate(zip(kernel_sizes, dilations)):
+            in_chs = dim if i == 0 else hidden
+            out_chs = dim if i == len(kernel_sizes) - 1 else hidden
+            block += [nn.ELU()]
+            block += [EncodecConv1d(config, in_chs, out_chs, kernel_size, dilation=dilation)]
+        self.block = nn.ModuleList(block)
+
+        if config.use_conv_shortcut:
+            self.shortcut = EncodecConv1d(config, dim, dim, kernel_size=1)
+        else:
+            self.shortcut = nn.Identity()
+
+    def forward(self, hidden_states):
+        residual = hidden_states
+        for layer in self.block:
+            hidden_states = layer(hidden_states)
+
+        return self.shortcut(residual) + hidden_states
+
+
+class EncodecEncoder(nn.Module):
+    """SEANet encoder as used by EnCodec."""
+
+    def __init__(self, config: EncodecConfig):
+        super().__init__()
+        model = [EncodecConv1d(config, config.audio_channels, config.num_filters, config.kernel_size)]
+        scaling = 1
+
+        # Downsample to raw audio scale
+        for ratio in reversed(config.upsampling_ratios):
+            current_scale = scaling * config.num_filters
+            # Add residual layers
+            for j in range(config.num_residual_layers):
+                model += [EncodecResnetBlock(config, current_scale, [config.dilation_growth_rate**j, 1])]
+            # Add downsampling layers
+            model += [nn.ELU()]
+            model += [EncodecConv1d(config, current_scale, current_scale * 2, kernel_size=ratio * 2, stride=ratio)]
+            scaling *= 2
+
+        model += [EncodecLSTM(config, scaling * config.num_filters)]
+        model += [nn.ELU()]
+        model += [EncodecConv1d(config, scaling * config.num_filters, config.hidden_size, config.last_kernel_size)]
+
+        self.layers = nn.ModuleList(model)
+
+    def forward(self, hidden_states):
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        return hidden_states
+
+
+class EncodecDecoder(nn.Module):
+    """SEANet decoder as used by EnCodec."""
+
+    def __init__(self, config: EncodecConfig):
+        super().__init__()
+        scaling = int(2 ** len(config.upsampling_ratios))
+        model = [EncodecConv1d(config, config.hidden_size, scaling * config.num_filters, config.kernel_size)]
+
+        model += [EncodecLSTM(config, scaling * config.num_filters)]
+
+        # Upsample to raw audio scale
+        for ratio in config.upsampling_ratios:
+            current_scale = scaling * config.num_filters
+            # Add upsampling layers
+            model += [nn.ELU()]
+            model += [
+                EncodecConvTranspose1d(config, current_scale, current_scale // 2, kernel_size=ratio * 2, stride=ratio)
+            ]
+            # Add residual layers
+            for j in range(config.num_residual_layers):
+                model += [EncodecResnetBlock(config, current_scale // 2, (config.dilation_growth_rate**j, 1))]
+            scaling //= 2
+
+        # Add final layers
+        model += [nn.ELU()]
+        model += [EncodecConv1d(config, config.num_filters, config.audio_channels, config.last_kernel_size)]
+        self.layers = nn.ModuleList(model)
+
+    def forward(self, hidden_states):
+        for layer in self.layers:
+            hidden_states = layer(hidden_states)
+        return hidden_states
+
+
+class EncodecEuclideanCodebook(nn.Module):
+    """Codebook with Euclidean distance."""
+
+    def __init__(self, config: EncodecConfig):
+        super().__init__()
+        embed = torch.zeros(config.codebook_size, config.codebook_dim)
+
+        self.codebook_size = config.codebook_size
+
+        self.register_buffer("inited", torch.Tensor([True]))
+        self.register_buffer("cluster_size", torch.zeros(config.codebook_size))
+        self.register_buffer("embed", embed)
+        self.register_buffer("embed_avg", embed.clone())
+
+    def quantize(self, hidden_states):
+        embed = self.embed.t()
+        scaled_states = hidden_states.pow(2).sum(1, keepdim=True)
+        dist = -(scaled_states - 2 * hidden_states @ embed + embed.pow(2).sum(0, keepdim=True))
+        embed_ind = dist.max(dim=-1).indices
+        return embed_ind
+
+    def encode(self, hidden_states):
+        shape = hidden_states.shape
+        # pre-process
+        hidden_states = hidden_states.reshape((-1, shape[-1]))
+        # quantize
+        embed_ind = self.quantize(hidden_states)
+        # post-process
+        embed_ind = embed_ind.view(*shape[:-1])
+        return embed_ind
+
+    def decode(self, embed_ind):
+        quantize = nn.functional.embedding(embed_ind, self.embed)
+        return quantize
+
+
+class EncodecVectorQuantization(nn.Module):
+    """
+    Vector quantization implementation. Currently supports only euclidean distance.
+    """
+
+    def __init__(self, config: EncodecConfig):
+        super().__init__()
+        self.codebook = EncodecEuclideanCodebook(config)
+
+    def encode(self, hidden_states):
+        hidden_states = hidden_states.permute(0, 2, 1)
+        embed_in = self.codebook.encode(hidden_states)
+        return embed_in
+
+    def decode(self, embed_ind):
+        quantize = self.codebook.decode(embed_ind)
+        quantize = quantize.permute(0, 2, 1)
+        return quantize
+
+
+class EncodecResidualVectorQuantizer(nn.Module):
+    """Residual Vector Quantizer."""
+
+    def __init__(self, config: EncodecConfig):
+        super().__init__()
+        self.codebook_size = config.codebook_size
+        self.frame_rate = config.frame_rate
+        self.num_quantizers = config.num_quantizers
+        self.layers = nn.ModuleList([EncodecVectorQuantization(config) for _ in range(config.num_quantizers)])
+
+    def get_num_quantizers_for_bandwidth(self, bandwidth: Optional[float] = None) -> int:
+        """Return num_quantizers based on specified target bandwidth."""
+        bw_per_q = math.log2(self.codebook_size) * self.frame_rate
+        num_quantizers = self.num_quantizers
+        if bandwidth is not None and bandwidth > 0.0:
+            num_quantizers = int(max(1, math.floor(bandwidth * 1000 / bw_per_q)))
+        return num_quantizers
+
+    def encode(self, embeddings: torch.Tensor, bandwidth: Optional[float] = None) -> torch.Tensor:
+        """
+        Encode a given input tensor with the specified frame rate at the given bandwidth. The RVQ encode method sets
+        the appropriate number of quantizers to use and returns indices for each quantizer.
+        """
+        num_quantizers = self.get_num_quantizers_for_bandwidth(bandwidth)
+        residual = embeddings
+        all_indices = []
+        for layer in self.layers[:num_quantizers]:
+            indices = layer.encode(residual)
+            quantized = layer.decode(indices)
+            residual = residual - quantized
+            all_indices.append(indices)
+        out_indices = torch.stack(all_indices)
+        return out_indices
+
+    def decode(self, codes: torch.Tensor) -> torch.Tensor:
+        """Decode the given codes to the quantized representation."""
+        quantized_out = torch.tensor(0.0, device=codes.device)
+        for i, indices in enumerate(codes):
+            layer = self.layers[i]
+            quantized = layer.decode(indices)
+            quantized_out = quantized_out + quantized
+        return quantized_out
+
+
+class EncodecPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = EncodecConfig
+    base_model_prefix = "encodec"
+    main_input_name = "input_values"
+
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, (nn.LayerNorm, nn.GroupNorm)):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, nn.Conv1d):
+            nn.init.kaiming_normal_(module.weight)
+            if module.bias is not None:
+                k = math.sqrt(module.groups / (module.in_channels * module.kernel_size[0]))
+                nn.init.uniform_(module.bias, a=-k, b=k)
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LSTM):
+            for name, param in module.named_parameters():
+                if "weight" in name:
+                    nn.init.xavier_uniform_(param)
+                elif "bias" in name:
+                    nn.init.constant_(param, 0.0)
+
+
+ENCODEC_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`EncodecConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+ENCODEC_INPUTS_DOCSTRING = r"""
+    Args:
+        input_values (`torch.FloatTensor` of shape `(batch_size, channels, sequence_length)`, *optional*):
+            Raw audio input converted to Float and padded to the approriate length in order to be encoded using chunks
+            of length self.chunk_length and a stride of `config.chunk_stride`.
+        padding_mask (`torch.BoolTensor` of shape `(batch_size, channels, sequence_length)`, *optional*):
+            Mask to avoid computing scaling factors on padding token indices (can we avoid computing conv on these+).
+            Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            
+
+             `padding_mask` should always be passed, unless the input was truncated or not padded. This is because in
+             order to process tensors effectively, the input audio should be padded so that `input_length % stride =
+             step` with `step = chunk_length-stride`. This ensures that all chunks are of the same shape
+
+            
+
+        bandwidth (`float`, *optional*):
+            The target bandwidth. Must be one of `config.target_bandwidths`. If `None`, uses the smallest possible
+            bandwidth. bandwidth is represented as a thousandth of what it is, e.g. 6kbps bandwidth is represented as
+            `bandwidth == 6.0`
+        audio_codes (`torch.FloatTensor`  of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
+            Discret code embeddings computed using `model.encode`.
+        audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*):
+            Scaling factor for each `audio_codes` input.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The EnCodec neural audio codec model.",
+    ENCODEC_START_DOCSTRING,
+)
+class EncodecModel(EncodecPreTrainedModel):
+    def __init__(self, config: EncodecConfig):
+        super().__init__(config)
+        self.config = config
+
+        self.encoder = EncodecEncoder(config)
+        self.decoder = EncodecDecoder(config)
+
+        self.quantizer = EncodecResidualVectorQuantizer(config)
+
+        self.bits_per_codebook = int(math.log2(self.config.codebook_size))
+        if 2**self.bits_per_codebook != self.config.codebook_size:
+            raise ValueError("The codebook_size must be a power of 2.")
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_encoder(self):
+        return self.encoder
+
+    def get_decoder(self):
+        return self.decoder
+
+    def _encode_frame(
+        self, input_values: torch.Tensor, bandwidth: float, padding_mask: int
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """
+        Encodes the given input using the underlying VQVAE. If `config.normalize` is set to `True` the input is first
+        normalized. The padding mask is required to compute the correct scale.
+        """
+        length = input_values.shape[-1]
+        duration = length / self.config.sampling_rate
+
+        if self.config.chunk_length_s is not None and duration > 1e-5 + self.config.chunk_length_s:
+            raise RuntimeError(f"Duration of frame ({duration}) is longer than chunk {self.config.chunk_length_s}")
+
+        scale = None
+        if self.config.normalize:
+            # if the padding is non zero
+            input_values = input_values * padding_mask
+            mono = torch.sum(input_values, 1, keepdim=True) / input_values.shape[1]
+            scale = mono.pow(2).mean(dim=-1, keepdim=True).sqrt() + 1e-8
+            input_values = input_values / scale
+
+        embeddings = self.encoder(input_values)
+        codes = self.quantizer.encode(embeddings, bandwidth)
+        codes = codes.transpose(0, 1)
+        return codes, scale
+
+    def encode(
+        self,
+        input_values: torch.Tensor,
+        padding_mask: torch.Tensor = None,
+        bandwidth: Optional[float] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor, Optional[torch.Tensor]], EncodecEncoderOutput]:
+        """
+        Encodes the input audio waveform into discrete codes.
+
+        Args:
+            input_values (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Float values of the input audio waveform.
+            padding_mask (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Padding mask used to pad the `input_values`.
+            bandwidth (`float`, *optional*):
+                The target bandwidth. Must be one of `config.target_bandwidths`. If `None`, uses the smallest possible
+                bandwidth. bandwidth is represented as a thousandth of what it is, e.g. 6kbps bandwidth is represented
+                as bandwidth == 6.0
+
+        Returns:
+            A list of frames containing the discrete encoded codes for the input audio waveform, along with rescaling
+            factors for each chunk when `normalize` is True. Each frames is a tuple `(codebook, scale)`, with
+            `codebook` of shape `[batch_size, num_codebooks, frames]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.return_dict
+
+        if bandwidth is None:
+            bandwidth = self.config.target_bandwidths[0]
+        if bandwidth not in self.config.target_bandwidths:
+            raise ValueError(
+                f"This model doesn't support the bandwidth {bandwidth}. "
+                f"Select one of {self.config.target_bandwidths}."
+            )
+
+        _, channels, input_length = input_values.shape
+
+        if channels < 1 or channels > 2:
+            raise ValueError(f"Number of audio channels must be 1 or 2, but got {channels}")
+
+        chunk_length = self.config.chunk_length
+        if chunk_length is None:
+            chunk_length = input_length
+            stride = input_length
+        else:
+            stride = self.config.chunk_stride
+
+        if padding_mask is None:
+            padding_mask = torch.ones_like(input_values).bool()
+
+        encoded_frames = []
+        scales = []
+
+        step = chunk_length - stride
+        if (input_length % stride) - step != 0:
+            raise ValueError(
+                "The input length is not properly padded for batched chunked decoding. Make sure to pad the input correctly."
+            )
+
+        for offset in range(0, input_length - step, stride):
+            mask = padding_mask[..., offset : offset + chunk_length].bool()
+            frame = input_values[:, :, offset : offset + chunk_length]
+            encoded_frame, scale = self._encode_frame(frame, bandwidth, mask)
+            encoded_frames.append(encoded_frame)
+            scales.append(scale)
+
+        encoded_frames = torch.stack(encoded_frames)
+
+        if not return_dict:
+            return (encoded_frames, scales)
+
+        return EncodecEncoderOutput(encoded_frames, scales)
+
+    @staticmethod
+    def _linear_overlap_add(frames: List[torch.Tensor], stride: int):
+        # Generic overlap add, with linear fade-in/fade-out, supporting complex scenario
+        # e.g., more than 2 frames per position.
+        # The core idea is to use a weight function that is a triangle,
+        # with a maximum value at the middle of the chunk.
+        # We use this weighting when summing the frames, and divide by the sum of weights
+        # for each positions at the end. Thus:
+        #   - if a frame is the only one to cover a position, the weighting is a no-op.
+        #   - if 2 frames cover a position:
+        #          ...  ...
+        #         /   \/   \
+        #        /    /\    \
+        #            S  T       , i.e. S offset of second frame starts, T end of first frame.
+        # Then the weight function for each one is: (t - S), (T - t), with `t` a given offset.
+        # After the final normalization, the weight of the second frame at position `t` is
+        # (t - S) / (t - S + (T - t)) = (t - S) / (T - S), which is exactly what we want.
+        #
+        #   - if more than 2 frames overlap at a given point, we hope that by induction
+        #      something sensible happens.
+        if len(frames) == 0:
+            raise ValueError("`frames` cannot be an empty list.")
+
+        device = frames[0].device
+        dtype = frames[0].dtype
+        shape = frames[0].shape[:-1]
+        total_size = stride * (len(frames) - 1) + frames[-1].shape[-1]
+
+        frame_length = frames[0].shape[-1]
+        time_vec = torch.linspace(0, 1, frame_length + 2, device=device, dtype=dtype)[1:-1]
+        weight = 0.5 - (time_vec - 0.5).abs()
+
+        sum_weight = torch.zeros(total_size, device=device, dtype=dtype)
+        out = torch.zeros(*shape, total_size, device=device, dtype=dtype)
+        offset: int = 0
+
+        for frame in frames:
+            frame_length = frame.shape[-1]
+            out[..., offset : offset + frame_length] += weight[:frame_length] * frame
+            sum_weight[offset : offset + frame_length] += weight[:frame_length]
+            offset += stride
+
+        if sum_weight.min() == 0:
+            raise ValueError(f"`sum_weight` minimum element must be bigger than zero: {sum_weight}`")
+
+        return out / sum_weight
+
+    def _decode_frame(self, codes: torch.Tensor, scale: Optional[torch.Tensor] = None) -> torch.Tensor:
+        codes = codes.transpose(0, 1)
+        embeddings = self.quantizer.decode(codes)
+        outputs = self.decoder(embeddings)
+        if scale is not None:
+            outputs = outputs * scale.view(-1, 1, 1)
+        return outputs
+
+    def decode(
+        self,
+        audio_codes: torch.Tensor,
+        audio_scales: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], EncodecDecoderOutput]:
+        """
+        Decodes the given frames into an output audio waveform.
+
+        Note that the output might be a bit bigger than the input. In that case, any extra steps at the end can be
+        trimmed.
+
+        Args:
+            audio_codes (`torch.FloatTensor`  of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
+                Discret code embeddings computed using `model.encode`.
+            audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*):
+                Scaling factor for each `audio_codes` input.
+            padding_mask (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
+                Padding mask used to pad the `input_values`.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+
+        """
+        return_dict = return_dict or self.config.return_dict
+
+        chunk_length = self.config.chunk_length
+        if chunk_length is None:
+            if len(audio_codes) != 1:
+                raise ValueError(f"Expected one frame, got {len(audio_codes)}")
+            audio_values = self._decode_frame(audio_codes[0], audio_scales[0])
+        else:
+            decoded_frames = []
+
+            for frame, scale in zip(audio_codes, audio_scales):
+                frames = self._decode_frame(frame, scale)
+                decoded_frames.append(frames)
+
+            audio_values = self._linear_overlap_add(decoded_frames, self.config.chunk_stride or 1)
+
+        # truncate based on padding mask
+        if padding_mask is not None and padding_mask.shape[-1] < audio_values.shape[-1]:
+            audio_values = audio_values[..., : padding_mask.shape[-1]]
+
+        if not return_dict:
+            return (audio_values,)
+        return EncodecDecoderOutput(audio_values)
+
+    @add_start_docstrings_to_model_forward(ENCODEC_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=EncodecOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_values: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None,
+        bandwidth: Optional[float] = None,
+        audio_codes: Optional[torch.Tensor] = None,
+        audio_scales: Optional[torch.Tensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor, torch.Tensor], EncodecOutput]:
+        r"""
+        Returns:
+
+        Examples:
+
+        ```python
+        >>> from datasets import load_dataset
+        >>> from transformers import AutoProcessor, EncodecModel
+
+        >>> dataset = load_dataset("ashraq/esc50")
+        >>> audio_sample = dataset["train"]["audio"][0]["array"]
+
+        >>> model_id = "facebook/encodec_24khz"
+        >>> model = EncodecModel.from_pretrained(model_id)
+        >>> processor = AutoProcessor.from_pretrained(model_id)
+
+        >>> inputs = processor(raw_audio=audio_sample, return_tensors="pt")
+
+        >>> outputs = model(**inputs)
+        >>> audio_codes = outputs.audio_codes
+        >>> audio_values = outputs.audio_values
+        ```"""
+        return_dict = return_dict or self.config.return_dict
+
+        if padding_mask is None:
+            padding_mask = torch.ones_like(input_values).bool()
+
+        if audio_codes is not None and audio_scales is None:
+            raise ValueError("You specified `audio_codes` but did not specify the `audio_scales`")
+
+        if audio_scales is not None and audio_codes is None:
+            raise ValueError("You specified `audio_scales` but did not specify the `audio_codes`")
+
+        if audio_scales is None and audio_codes is None:
+            audio_codes, audio_scales = self.encode(input_values, padding_mask, bandwidth, False)
+
+        audio_values = self.decode(audio_codes, audio_scales, padding_mask, return_dict=return_dict)[0]
+        if not return_dict:
+            return (audio_codes, audio_values)
+
+        return EncodecOutput(audio_codes=audio_codes, audio_values=audio_values)
diff --git a/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py b/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py
index 1fca8a10f78d80..8c0ae2771e81f1 100644
--- a/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/configuration_encoder_decoder.py
@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import copy
 
 from ...configuration_utils import PretrainedConfig
 from ...utils import logging
@@ -46,13 +45,13 @@ class EncoderDecoderConfig(PretrainedConfig):
     ```python
     >>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
 
-    >>> # Initializing a BERT bert-base-uncased style configuration
+    >>> # Initializing a BERT google-bert/bert-base-uncased style configuration
     >>> config_encoder = BertConfig()
     >>> config_decoder = BertConfig()
 
     >>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
 
-    >>> # Initializing a Bert2Bert model from the bert-base-uncased style configurations
+    >>> # Initializing a Bert2Bert model (with random weights) from the google-bert/bert-base-uncased style configurations
     >>> model = EncoderDecoderModel(config=config)
 
     >>> # Accessing the model configuration
@@ -69,6 +68,7 @@ class EncoderDecoderConfig(PretrainedConfig):
     >>> encoder_decoder_config = EncoderDecoderConfig.from_pretrained("my-model")
     >>> model = EncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)
     ```"""
+
     model_type = "encoder-decoder"
     is_composition = True
 
@@ -104,16 +104,3 @@ def from_encoder_decoder_configs(
         decoder_config.add_cross_attention = True
 
         return cls(encoder=encoder_config.to_dict(), decoder=decoder_config.to_dict(), **kwargs)
-
-    def to_dict(self):
-        """
-        Serializes this instance to a Python dictionary. Override the default *to_dict()* from *PretrainedConfig*.
-
-        Returns:
-            `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
-        """
-        output = copy.deepcopy(self.__dict__)
-        output["encoder"] = self.encoder.to_dict()
-        output["decoder"] = self.decoder.to_dict()
-        output["model_type"] = self.__class__.model_type
-        return output
diff --git a/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py b/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py
index b27b134ecf29dc..1a6adcee1f8386 100644
--- a/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py
@@ -16,6 +16,7 @@
 
 
 import gc
+import inspect
 import os
 import tempfile
 import warnings
@@ -173,6 +174,7 @@ class EncoderDecoderModel(PreTrainedModel):
     :meth*~transformers.AutoModel.from_pretrained* class method for the encoder and
     :meth*~transformers.AutoModelForCausalLM.from_pretrained* class method for the decoder.
     """
+
     config_class = EncoderDecoderConfig
     base_model_prefix = "encoder_decoder"
     main_input_name = "input_ids"
@@ -245,6 +247,13 @@ def __init__(
                 f"The encoder {self.encoder} should not have a LM Head. Please use a model without LM Head"
             )
 
+        decoder_signature = set(inspect.signature(self.decoder.forward).parameters.keys())
+        if "encoder_hidden_states" not in decoder_signature:
+            raise ValueError(
+                "The selected decoder is not prepared for the encoder hidden states to be passed. Please see the "
+                "following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350"
+            )
+
         # tie encoder, decoder weights if config set accordingly
         self.tie_weights()
 
@@ -257,11 +266,6 @@ def tie_weights(self):
                 self.encoder, self.decoder._modules[decoder_base_model_prefix], self.decoder.base_model_prefix
             )
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        # call both encoder and decoder function on gradient checkpointing
-        self.encoder._set_gradient_checkpointing(module, value=value)
-        self.decoder._set_gradient_checkpointing(module, value=value)
-
     def get_encoder(self):
         return self.encoder
 
@@ -363,8 +367,8 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
                 model.config = config
 
                 if hasattr(model, "enc_to_dec_proj"):
-                    model.enc_to_dec_proj.weight.data = enc_to_dec_proj_weight
-                    model.enc_to_dec_proj.bias.data = enc_to_dec_proj_bias
+                    model.enc_to_dec_proj.weight.data = enc_to_dec_proj_weight.contiguous()
+                    model.enc_to_dec_proj.bias.data = enc_to_dec_proj_bias.contiguous()
 
                 return model
 
@@ -399,8 +403,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the encoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                     - A path or url to a *tensorflow index checkpoint file* (e.g, `./tf_model/model.ckpt.index`). In
@@ -412,8 +414,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the decoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~PreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                     - A path or url to a *tensorflow index checkpoint file* (e.g, `./tf_model/model.ckpt.index`). In
@@ -440,7 +440,7 @@ def from_encoder_decoder_pretrained(
         >>> from transformers import EncoderDecoderModel
 
         >>> # initialize a bert2bert from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized
-        >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")
+        >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "google-bert/bert-base-uncased")
         >>> # saving model after fine-tuning
         >>> model.save_pretrained("./bert2bert")
         >>> # load fine-tuned model
@@ -556,9 +556,9 @@ def forward(
         >>> from transformers import EncoderDecoderModel, BertTokenizer
         >>> import torch
 
-        >>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
+        >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
         >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained(
-        ...     "bert-base-uncased", "bert-base-uncased"
+        ...     "google-bert/bert-base-uncased", "google-bert/bert-base-uncased"
         ... )  # initialize Bert2Bert from pre-trained checkpoints
 
         >>> # training
@@ -612,6 +612,8 @@ def forward(
             decoder_input_ids = shift_tokens_right(
                 labels, self.config.pad_token_id, self.config.decoder_start_token_id
             )
+            if decoder_attention_mask is None:
+                decoder_attention_mask = decoder_input_ids.new_tensor(decoder_input_ids != self.config.pad_token_id)
 
         # Decode
         decoder_outputs = self.decoder(
diff --git a/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py b/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py
index a500398d67a992..beecd080328e16 100644
--- a/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_flax_encoder_decoder.py
@@ -306,6 +306,7 @@ class FlaxEncoderDecoderModel(FlaxPreTrainedModel):
     decoder module when created with the :meth*~transformers.FlaxAutoModel.from_pretrained* class method for the
     encoder and :meth*~transformers.FlaxAutoModelForCausalLM.from_pretrained* class method for the decoder.
     """
+
     config_class = EncoderDecoderConfig
     base_model_prefix = "encoder_decoder"
     module_class = FlaxEncoderDecoderModule
@@ -448,9 +449,9 @@ def encode(
         >>> from transformers import FlaxEncoderDecoderModel, BertTokenizer
 
         >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2")
+        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-cased", "openai-community/gpt2")
 
-        >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
+        >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
 
         >>> text = "My friends are cool but they eat too many carbs."
         >>> input_ids = tokenizer.encode(text, return_tensors="np")
@@ -526,9 +527,9 @@ def decode(
         >>> import jax.numpy as jnp
 
         >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2")
+        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-cased", "openai-community/gpt2")
 
-        >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
+        >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
 
         >>> text = "My friends are cool but they eat too many carbs."
         >>> input_ids = tokenizer.encode(text, max_length=1024, return_tensors="np")
@@ -652,8 +653,8 @@ def __call__(
         >>> # load a fine-tuned bert2gpt2 model
         >>> model = FlaxEncoderDecoderModel.from_pretrained("patrickvonplaten/bert2gpt2-cnn_dailymail-fp16")
         >>> # load input & output tokenizer
-        >>> tokenizer_input = BertTokenizer.from_pretrained("bert-base-cased")
-        >>> tokenizer_output = GPT2Tokenizer.from_pretrained("gpt2")
+        >>> tokenizer_input = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
+        >>> tokenizer_output = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
 
         >>> article = '''Sigma Alpha Epsilon is under fire for a video showing party-bound fraternity members
         >>> singing a racist chant. SAE's national chapter suspended the students,
@@ -722,8 +723,8 @@ def prepare_inputs_for_generation(
         self,
         decoder_input_ids,
         max_length,
-        attention_mask: Optional[jnp.DeviceArray] = None,
-        decoder_attention_mask: Optional[jnp.DeviceArray] = None,
+        attention_mask: Optional[jax.Array] = None,
+        decoder_attention_mask: Optional[jax.Array] = None,
         encoder_outputs=None,
         **kwargs,
     ):
@@ -773,8 +774,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the encoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~FlaxPreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
 
@@ -782,8 +781,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the decoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~FlaxPreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
 
@@ -806,7 +803,7 @@ def from_encoder_decoder_pretrained(
         >>> from transformers import FlaxEncoderDecoderModel
 
         >>> # initialize a bert2gpt2 from pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2")
+        >>> model = FlaxEncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-cased", "openai-community/gpt2")
         >>> # saving model after fine-tuning
         >>> model.save_pretrained("./bert2gpt2")
         >>> # load fine-tuned model
diff --git a/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py b/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
index 5b97d1a4c949ee..855fb767d13d73 100644
--- a/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
+++ b/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py
@@ -15,20 +15,28 @@
 """ Classes to support TF Encoder-Decoder architectures"""
 
 
-import gc
-import os
-import tempfile
+from __future__ import annotations
+
+import inspect
+import re
 import warnings
-from typing import Optional
+from typing import Optional, Tuple, Union
 
+import numpy as np
 import tensorflow as tf
 
 from ...configuration_utils import PretrainedConfig
 from ...modeling_tf_outputs import TFBaseModelOutput, TFSeq2SeqLMOutput
-from ...modeling_tf_utils import TFCausalLanguageModelingLoss, TFPreTrainedModel, get_initializer, unpack_inputs
+from ...modeling_tf_utils import (
+    TFCausalLanguageModelingLoss,
+    TFModelInputType,
+    TFPreTrainedModel,
+    get_initializer,
+    keras,
+    unpack_inputs,
+)
 from ...tf_utils import shape_list
 from ...utils import (
-    DUMMY_INPUTS,
     ModelOutput,
     add_start_docstrings,
     add_start_docstrings_to_model_forward,
@@ -70,7 +78,7 @@
     library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
     etc.)
 
-    This model is also a [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
+    This model is also a [keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) subclass. Use it
     as a regular TF 2.0 Keras Model and refer to the TF 2.0 documentation for all matter related to general usage and
     behavior.
 
@@ -190,6 +198,7 @@ class TFEncoderDecoderModel(TFPreTrainedModel, TFCausalLanguageModelingLoss):
     [`~TFAutoModel.from_pretrained`] class method for the encoder and [`~TFAutoModelForCausalLM.from_pretrained`] class
     method for the decoder.
     """
+
     config_class = EncoderDecoderConfig
     base_model_prefix = "encoder_decoder"
     load_weight_prefix = "tf_encoder_decoder_model"
@@ -250,7 +259,7 @@ def __init__(
             self.encoder.config.hidden_size != self.decoder.config.hidden_size
             and self.decoder.config.cross_attention_hidden_size is None
         ):
-            self.enc_to_dec_proj = tf.keras.layers.Dense(
+            self.enc_to_dec_proj = keras.layers.Dense(
                 units=self.decoder.config.hidden_size,
                 kernel_initializer=get_initializer(config.encoder.initializer_range),
                 name="enc_to_dec_proj",
@@ -261,18 +270,12 @@ def __init__(
                 f"The encoder {self.encoder} should not have a LM Head. Please use a model without LM Head"
             )
 
-    @property
-    def dummy_inputs(self):
-        """
-        Dummy inputs to build the network.
-
-        Returns:
-            `Dict[str, tf.Tensor]`: The dummy inputs.
-        """
-        # Add `decoder_input_ids` because `self.decoder` requires it.
-        input_ids = tf.constant(DUMMY_INPUTS, dtype=tf.int32)
-        dummy = {"input_ids": input_ids, "decoder_input_ids": input_ids}
-        return dummy
+        decoder_signature = set(inspect.signature(self.decoder.call).parameters.keys())
+        if "encoder_hidden_states" not in decoder_signature:
+            raise ValueError(
+                "The selected decoder is not prepared for the encoder hidden states to be passed. Please see the "
+                "following discussion on GitHub: https://github.com/huggingface/transformers/issues/23350"
+            )
 
     def get_encoder(self):
         return self.encoder
@@ -289,57 +292,22 @@ def get_output_embeddings(self):
     def set_output_embeddings(self, new_embeddings):
         return self.decoder.set_output_embeddings(new_embeddings)
 
-    @classmethod
-    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
-        r"""
-        Example:
-
-        ```python
-        >>> from transformers import TFEncoderDecoderModel
-
-        >>> model = TFEncoderDecoderModel.from_pretrained("ydshieh/bert2bert-cnn_dailymail-fp16")
-        ```"""
-
-        from_pt = kwargs.pop("from_pt", False)
-        if from_pt:
-            import torch
-
-            from transformers import EncoderDecoderModel
-
-            # a workaround to load from pytorch checkpoint
-            _model = EncoderDecoderModel.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
-            config = _model.config
-
-            with tempfile.TemporaryDirectory() as tmpdirname:
-                encoder_dir = os.path.join(tmpdirname, "encoder")
-                decoder_dir = os.path.join(tmpdirname, "decoder")
-                _model.encoder.save_pretrained(encoder_dir)
-                _model.decoder.save_pretrained(decoder_dir)
-
-                if hasattr(_model, "enc_to_dec_proj"):
-                    enc_to_dec_proj_kernel = tf.transpose(
-                        tf.constant(_model.enc_to_dec_proj.weight.detach().to("cpu").numpy()), perm=(1, 0)
-                    )
-                    enc_to_dec_proj_bias = tf.constant(_model.enc_to_dec_proj.bias.detach().to("cpu").numpy())
-
-                del _model
-                gc.collect()
-                torch.cuda.empty_cache()
-
-                model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
-                    encoder_dir, decoder_dir, encoder_from_pt=True, decoder_from_pt=True
-                )
-                # This is only for copying some specific attributes of this particular model.
-                model.config = config
-
-                if hasattr(model, "enc_to_dec_proj"):
-                    model(model.dummy_inputs)
-                    model.enc_to_dec_proj.kernel.assign(enc_to_dec_proj_kernel)
-                    model.enc_to_dec_proj.bias.assign(enc_to_dec_proj_bias)
-
-                return model
-
-        return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+    def tf_to_pt_weight_rename(self, tf_weight):
+        # Matt: The TF and PT weights don't align because our TF base classes have an extra layer compared to PT models
+        # (the main model stem is in the MainLayer class). If we remove that layer, then weight names sync up as normal.
+        # However, the name of that extra layer is the name of the MainLayer in the base model. We make the assumption
+        # here that the config model_type is the same as the name of the MainLayer. I don't know of anywhere that's
+        # not the case, and I wasn't sure how else to go from the config to the correct MainLayer name!
+
+        # This override is only needed in the case where we're crossloading weights from PT. However, since weights are
+        # often safetensors now, we don't know if we're going to be crossloading until we sniff the weights file.
+        # Therefore, we specify tf_to_pt_weight_rename anyway, and let the super method figure out if it needs it
+        # or not.
+        encoder_model_type = self.config.encoder.model_type
+        if "encoder" in tf_weight and "decoder" not in tf_weight:
+            return (re.sub(rf"encoder\.{encoder_model_type}\.", "encoder.", tf_weight),)
+        else:
+            return (tf_weight,)
 
     @classmethod
     def from_encoder_decoder_pretrained(
@@ -359,8 +327,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the encoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~TFPreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                     - A path or url to a *pytorch index checkpoint file* (e.g, `./pt_model/`). In this case,
@@ -370,8 +336,6 @@ def from_encoder_decoder_pretrained(
                 Information necessary to initiate the decoder. Can be either:
 
                     - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co.
-                      Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
-                      user or organization name, like `dbmdz/bert-base-german-cased`.
                     - A path to a *directory* containing model weights saved using
                       [`~TFPreTrainedModel.save_pretrained`], e.g., `./my_model_directory/`.
                     - A path or url to a *pytorch checkpoint file* (e.g, `./pt_model/`). In this case,
@@ -396,7 +360,7 @@ def from_encoder_decoder_pretrained(
         >>> from transformers import TFEncoderDecoderModel
 
         >>> # initialize a bert2gpt2 from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized
-        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "gpt2")
+        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "openai-community/gpt2")
         >>> # saving model after fine-tuning
         >>> model.save_pretrained("./bert2gpt2")
         >>> # load fine-tuned model
@@ -444,14 +408,6 @@ def from_encoder_decoder_pretrained(
             kwargs_encoder["load_weight_prefix"] = cls.load_weight_prefix
             encoder = TFAutoModel.from_pretrained(encoder_pretrained_model_name_or_path, *model_args, **kwargs_encoder)
 
-            # This is necessary to make `from_pretrained` following `save_pretrained` work correctly
-            if kwargs_encoder.get("from_pt", None):
-                del kwargs_encoder["from_pt"]
-                with tempfile.TemporaryDirectory() as tmp_dirname:
-                    encoder.save_pretrained(tmp_dirname)
-                    del encoder
-                    encoder = TFAutoModel.from_pretrained(tmp_dirname, *model_args, **kwargs_encoder)
-
         decoder = kwargs_decoder.pop("model", None)
         if decoder is None:
             if decoder_pretrained_model_name_or_path is None:
@@ -486,15 +442,7 @@ def from_encoder_decoder_pretrained(
             kwargs_decoder["load_weight_prefix"] = cls.load_weight_prefix
             decoder = TFAutoModelForCausalLM.from_pretrained(decoder_pretrained_model_name_or_path, **kwargs_decoder)
 
-            # This is necessary to make `from_pretrained` following `save_pretrained` work correctly
-            if kwargs_decoder.get("from_pt", None):
-                del kwargs_decoder["from_pt"]
-                with tempfile.TemporaryDirectory() as tmp_dirname:
-                    decoder.save_pretrained(tmp_dirname)
-                    del decoder
-                    decoder = TFAutoModelForCausalLM.from_pretrained(tmp_dirname, **kwargs_decoder)
-
-        # Make sure these 2 `tf.keras.Model` have fixed names so `from_pretrained` could load model weights correctly.
+        # Make sure these 2 `keras.Model` have fixed names so `from_pretrained` could load model weights correctly.
         if encoder.name != "encoder":
             raise ValueError("encoder model must be created with the name `encoder`.")
         if decoder.name != "decoder":
@@ -509,22 +457,22 @@ def from_encoder_decoder_pretrained(
     @replace_return_docstrings(output_type=TFSeq2SeqLMOutput, config_class=_CONFIG_FOR_DOC)
     def call(
         self,
-        input_ids=None,
-        attention_mask=None,
-        decoder_input_ids=None,
-        decoder_attention_mask=None,
-        encoder_outputs=None,
-        past_key_values=None,
-        inputs_embeds=None,
-        decoder_inputs_embeds=None,
-        labels=None,
-        use_cache=None,
-        output_attentions=None,
-        output_hidden_states=None,
-        return_dict=None,
-        training=False,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        decoder_input_ids: np.ndarray | tf.Tensor | None = None,
+        decoder_attention_mask: np.ndarray | tf.Tensor | None = None,
+        encoder_outputs: np.ndarray | tf.Tensor | None = None,
+        past_key_values: Tuple[Tuple[tf.Tensor]] | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        decoder_inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        labels: np.ndarray | tf.Tensor | None = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        training: bool = False,
         **kwargs,
-    ):
+    ) -> Union[TFSeq2SeqLMOutput, Tuple[tf.Tensor]]:
         r"""
         Returns:
 
@@ -534,9 +482,9 @@ def call(
         >>> from transformers import TFEncoderDecoderModel, BertTokenizer
 
         >>> # initialize a bert2gpt2 from a pretrained BERT and GPT2 models. Note that the cross-attention layers will be randomly initialized
-        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "gpt2")
+        >>> model = TFEncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-cased", "openai-community/gpt2")
 
-        >>> tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
+        >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-cased")
 
         >>> # forward
         >>> input_ids = tokenizer.encode(
@@ -666,29 +614,6 @@ def call(
             encoder_attentions=encoder_outputs.attentions,
         )
 
-    def serving_output(self, output):
-        pkv = tf.tuple(output.past_key_values)[1] if self.config.use_cache else None
-        dec_hs = tf.convert_to_tensor(output.decoder_hidden_states) if self.config.output_hidden_states else None
-        dec_attns = tf.convert_to_tensor(output.decoder_attentions) if self.config.output_attentions else None
-        enc_hs = tf.convert_to_tensor(output.encoder_hidden_states) if self.config.output_hidden_states else None
-        enc_attns = tf.convert_to_tensor(output.encoder_attentions) if self.config.output_attentions else None
-        cross_attns = (
-            tf.convert_to_tensor(output.cross_attentions)
-            if self.config.output_attentions and output.cross_attentions is not None
-            else None
-        )
-
-        return TFSeq2SeqLMOutput(
-            logits=output.logits,
-            past_key_values=pkv,
-            decoder_hidden_states=dec_hs,
-            decoder_attentions=dec_attns,
-            encoder_last_hidden_state=output.encoder_last_hidden_state,
-            encoder_hidden_states=enc_hs,
-            encoder_attentions=enc_attns,
-            cross_attentions=cross_attns,
-        )
-
     def prepare_inputs_for_generation(
         self, input_ids, past_key_values=None, attention_mask=None, use_cache=None, encoder_outputs=None, **kwargs
     ):
@@ -718,3 +643,21 @@ def resize_token_embeddings(self, *args, **kwargs):
             " respective methods of the wrapped objects (model.encoder.resize_token_embeddings(...) or"
             " model.decoder.resize_token_embeddings(...))"
         )
+
+    def _reorder_cache(self, past, beam_idx):
+        # apply decoder cache reordering here
+        return self.decoder._reorder_cache(past, beam_idx)
+
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "enc_to_dec_proj", None) is not None:
+            with tf.name_scope(self.enc_to_dec_proj.name):
+                self.enc_to_dec_proj.build([None, None, self.encoder.config.hidden_size])
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "decoder", None) is not None:
+            with tf.name_scope(self.decoder.name):
+                self.decoder.build(None)
diff --git a/src/transformers/models/ernie/configuration_ernie.py b/src/transformers/models/ernie/configuration_ernie.py
index 91253ab1384bcc..7278a74eced517 100644
--- a/src/transformers/models/ernie/configuration_ernie.py
+++ b/src/transformers/models/ernie/configuration_ernie.py
@@ -81,14 +81,14 @@ class ErnieConfig(PretrainedConfig):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
         layer_norm_eps (`float`, *optional*, defaults to 1e-12):
             The epsilon used by the layer normalization layers.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            Padding token id.
         position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
             Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
             positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
             [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
             For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
             with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
-        is_decoder (`bool`, *optional*, defaults to `False`):
-            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
         use_cache (`bool`, *optional*, defaults to `True`):
             Whether or not the model should return the last key/values attentions (not used by all models). Only
             relevant if `config.is_decoder=True`.
@@ -109,6 +109,7 @@ class ErnieConfig(PretrainedConfig):
     >>> # Accessing the model configuration
     >>> configuration = model.config
     ```"""
+
     model_type = "ernie"
 
     def __init__(
diff --git a/src/transformers/models/ernie/modeling_ernie.py b/src/transformers/models/ernie/modeling_ernie.py
index 8f178d64a9ee61..291ab6c54d1e50 100644
--- a/src/transformers/models/ernie/modeling_ernie.py
+++ b/src/transformers/models/ernie/modeling_ernie.py
@@ -89,7 +89,9 @@ def __init__(self, config):
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
         self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
         self.register_buffer(
             "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
         )
@@ -488,6 +490,13 @@ def forward(
         all_self_attentions = () if output_attentions else None
         all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
 
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+
         next_decoder_cache = () if use_cache else None
         for i, layer_module in enumerate(self.layer):
             if output_hidden_states:
@@ -497,25 +506,15 @@ def forward(
             past_key_value = past_key_values[i] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, past_key_value, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     attention_mask,
                     layer_head_mask,
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(
@@ -659,7 +658,6 @@ class ErniePreTrainedModel(PreTrainedModel):
     config_class = ErnieConfig
     base_model_prefix = "ernie"
     supports_gradient_checkpointing = True
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -677,10 +675,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ErnieEncoder):
-            module.gradient_checkpointing = value
-
 
 @dataclass
 # Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->Ernie
@@ -892,6 +886,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -981,7 +976,7 @@ def forward(
     ERNIE_START_DOCSTRING,
 )
 class ErnieForPreTraining(ErniePreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"cls.predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    _tied_weights_keys = ["cls.predictions.decoder.bias", "cls.predictions.decoder.weight"]
 
     # Copied from transformers.models.bert.modeling_bert.BertForPreTraining.__init__ with Bert->Ernie,bert->ernie
     def __init__(self, config):
@@ -1092,8 +1087,7 @@ def forward(
     """Ernie Model with a `language modeling` head on top for CLM fine-tuning.""", ERNIE_START_DOCSTRING
 )
 class ErnieForCausalLM(ErniePreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    _tied_weights_keys = ["cls.predictions.decoder.bias", "cls.predictions.decoder.weight"]
 
     # Copied from transformers.models.bert.modeling_bert.BertLMHeadModel.__init__ with BertLMHeadModel->ErnieForCausalLM,Bert->Ernie,bert->ernie
     def __init__(self, config):
@@ -1220,7 +1214,16 @@ def prepare_inputs_for_generation(
 
         # cut decoder_input_ids if past_key_values is used
         if past_key_values is not None:
-            input_ids = input_ids[:, -1:]
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
 
         return {
             "input_ids": input_ids,
@@ -1233,14 +1236,15 @@ def prepare_inputs_for_generation(
     def _reorder_cache(self, past_key_values, beam_idx):
         reordered_past = ()
         for layer_past in past_key_values:
-            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+            reordered_past += (
+                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
+            )
         return reordered_past
 
 
 @add_start_docstrings("""Ernie Model with a `language modeling` head on top.""", ERNIE_START_DOCSTRING)
 class ErnieForMaskedLM(ErniePreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    _tied_weights_keys = ["cls.predictions.decoder.bias", "cls.predictions.decoder.weight"]
 
     # Copied from transformers.models.bert.modeling_bert.BertForMaskedLM.__init__ with Bert->Ernie,bert->ernie
     def __init__(self, config):
@@ -1660,8 +1664,6 @@ def forward(
     ERNIE_START_DOCSTRING,
 )
 class ErnieForTokenClassification(ErniePreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     # Copied from transformers.models.bert.modeling_bert.BertForTokenClassification.__init__ with Bert->Ernie,bert->ernie
     def __init__(self, config):
         super().__init__(config)
@@ -1741,8 +1743,6 @@ def forward(
     ERNIE_START_DOCSTRING,
 )
 class ErnieForQuestionAnswering(ErniePreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-
     # Copied from transformers.models.bert.modeling_bert.BertForQuestionAnswering.__init__ with Bert->Ernie,bert->ernie
     def __init__(self, config):
         super().__init__(config)
diff --git a/src/transformers/models/ernie_m/configuration_ernie_m.py b/src/transformers/models/ernie_m/configuration_ernie_m.py
index d23d616b81907a..85917dc8288deb 100644
--- a/src/transformers/models/ernie_m/configuration_ernie_m.py
+++ b/src/transformers/models/ernie_m/configuration_ernie_m.py
@@ -61,23 +61,25 @@ class ErnieMConfig(PretrainedConfig):
             The dropout probability for all fully connected layers in the embeddings and encoder.
         attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
             The dropout probability used in `MultiHeadAttention` in all encoder layers to drop some attention target.
-        act_dropout (`float`, *optional*, defaults to 0.0):
-            This dropout probability is used in `ErnieMEncoderLayer` after activation.
-        max_position_embeddings (`int`, *optional*, defaults to 512):
+        max_position_embeddings (`int`, *optional*, defaults to 514):
             The maximum value of the dimensionality of position encoding, which dictates the maximum supported length
             of an input sequence.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the normal initializer for initializing all weight matrices. The index of padding
+            token in the token vocabulary.
+        pad_token_id (`int`, *optional*, defaults to 1):
+            Padding token id.
         layer_norm_eps (`float`, *optional*, defaults to 1e-05):
             The epsilon used by the layer normalization layers.
         classifier_dropout (`float`, *optional*):
             The dropout ratio for the classification head.
-        initializer_range (`float`, *optional*, defaults to 0.02):
-            The standard deviation of the normal initializer for initializing all weight matrices.
-        pad_token_id(`int`, *optional*, defaults to 1):
-            The index of padding token in the token vocabulary.
+        act_dropout (`float`, *optional*, defaults to 0.0):
+            This dropout probability is used in `ErnieMEncoderLayer` after activation.
 
     A normal_initializer initializes weight matrices as normal distributions. See
     `ErnieMPretrainedModel._init_weights()` for how weights are initialized in `ErnieMModel`.
     """
+
     model_type = "ernie_m"
     attribute_map: Dict[str, str] = {"dropout": "classifier_dropout", "num_classes": "num_labels"}
 
@@ -96,7 +98,6 @@ def __init__(
         pad_token_id: int = 1,
         layer_norm_eps: float = 1e-05,
         classifier_dropout=None,
-        is_decoder=False,
         act_dropout=0.0,
         **kwargs,
     ):
@@ -113,5 +114,4 @@ def __init__(
         self.initializer_range = initializer_range
         self.layer_norm_eps = layer_norm_eps
         self.classifier_dropout = classifier_dropout
-        self.is_decoder = is_decoder
         self.act_dropout = act_dropout
diff --git a/src/transformers/models/ernie_m/modeling_ernie_m.py b/src/transformers/models/ernie_m/modeling_ernie_m.py
index eea3e6232f90c3..c1be3cfba142a1 100755
--- a/src/transformers/models/ernie_m/modeling_ernie_m.py
+++ b/src/transformers/models/ernie_m/modeling_ernie_m.py
@@ -77,7 +77,7 @@ def forward(
             inputs_embeds = self.word_embeddings(input_ids)
         if position_ids is None:
             input_shape = inputs_embeds.size()[:-1]
-            ones = torch.ones(input_shape, dtype=torch.int64)
+            ones = torch.ones(input_shape, dtype=torch.int64, device=inputs_embeds.device)
             seq_length = torch.cumsum(ones, dim=1)
             position_ids = seq_length - ones
 
@@ -85,7 +85,6 @@ def forward(
                 position_ids = position_ids + past_key_values_length
         # to mimic paddlenlp implementation
         position_ids += 2
-        position_ids = position_ids.to(next(self.position_embeddings.parameters()).device)
         position_embeddings = self.position_embeddings(position_ids)
         embeddings = inputs_embeds + position_embeddings
         embeddings = self.layer_norm(embeddings)
@@ -412,8 +411,6 @@ class ErnieMPreTrainedModel(PreTrainedModel):
 
     config_class = ErnieMConfig
     base_model_prefix = "ernie_m"
-    supports_gradient_checkpointing = True
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
 
     def _init_weights(self, module):
         """Initialize the weights"""
@@ -431,10 +428,6 @@ def _init_weights(self, module):
             module.bias.data.zero_()
             module.weight.data.fill_(1.0)
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, ErnieMEncoder):
-            module.gradient_checkpointing = value
-
 
 ERNIE_M_START_DOCSTRING = r"""
 
@@ -540,7 +533,7 @@ def forward(
         output_hidden_states: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
         return_dict: Optional[bool] = None,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], BaseModelOutputWithPoolingAndCrossAttentions]:
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time.")
 
@@ -649,7 +642,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         return_dict: Optional[bool] = True,
         labels: Optional[torch.Tensor] = None,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], SequenceClassifierOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
@@ -746,7 +739,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = True,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], MultipleChoiceModelOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
@@ -839,7 +832,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         return_dict: Optional[bool] = True,
         labels: Optional[torch.Tensor] = None,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], TokenClassifierOutput]:
         r"""
         labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
@@ -916,7 +909,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = True,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], QuestionAnsweringModelOutput]:
         r"""
         start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
             Labels for position (index) of the start of the labelled span for computing the token classification loss.
@@ -1005,7 +998,7 @@ def forward(
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = True,
-    ):
+    ) -> Union[Tuple[torch.FloatTensor], QuestionAnsweringModelOutput]:
         r"""
         start_positions (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
             Labels for position (index) for computing the start_positions loss. Position outside of the sequence are
diff --git a/src/transformers/models/ernie_m/tokenization_ernie_m.py b/src/transformers/models/ernie_m/tokenization_ernie_m.py
index e56451dd200ae7..b1b8cc845024c8 100644
--- a/src/transformers/models/ernie_m/tokenization_ernie_m.py
+++ b/src/transformers/models/ernie_m/tokenization_ernie_m.py
@@ -112,6 +112,19 @@ def __init__(
         # is included in the raw text, there should be a match in a non-normalized sentence.
 
         self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
+
+        self.do_lower_case = do_lower_case
+        self.sentencepiece_model_ckpt = sentencepiece_model_ckpt
+        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
+        self.sp_model.Load(sentencepiece_model_ckpt)
+
+        # to mimic paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer functioning
+        if vocab_file is not None:
+            self.vocab = self.load_vocab(filepath=vocab_file)
+        else:
+            self.vocab = {self.sp_model.id_to_piece(id): id for id in range(self.sp_model.get_piece_size())}
+        self.reverse_vocab = {v: k for k, v in self.vocab.items()}
+
         super().__init__(
             do_lower_case=do_lower_case,
             unk_token=unk_token,
@@ -124,17 +137,6 @@ def __init__(
             sp_model_kwargs=self.sp_model_kwargs,
             **kwargs,
         )
-        self.do_lower_case = do_lower_case
-        self.sentencepiece_model_ckpt = sentencepiece_model_ckpt
-        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
-        self.sp_model.Load(sentencepiece_model_ckpt)
-
-        # to mimic paddlenlp.transformers.ernie_m.tokenizer.ErnieMTokenizer functioning
-        if vocab_file is not None:
-            self.vocab = self.load_vocab(filepath=vocab_file)
-        else:
-            self.vocab = dict((self.sp_model.id_to_piece(id), id) for id in range(self.sp_model.get_piece_size()))
-        self.reverse_vocab = dict((v, k) for k, v in self.vocab.items())
 
     def get_offset_mapping(self, text):
         if text is None:
@@ -325,7 +327,7 @@ def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_spe
                     "You should not supply a second sequence if the provided sequence of "
                     "ids is already formatted with special tokens for the model."
                 )
-            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+            return [1 if x in [self.sep_token_id, self.cls_token_id] else 0 for x in token_ids_0]
 
         if token_ids_1 is not None:
             return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
diff --git a/src/transformers/models/esm/configuration_esm.py b/src/transformers/models/esm/configuration_esm.py
index e51c5d01f1558c..75f8609ab0ffbd 100644
--- a/src/transformers/models/esm/configuration_esm.py
+++ b/src/transformers/models/esm/configuration_esm.py
@@ -97,6 +97,7 @@ class EsmConfig(PretrainedConfig):
 
     >>> # Accessing the model configuration >>> configuration = model.config
     ```"""
+
     model_type = "esm"
 
     def __init__(
diff --git a/src/transformers/models/esm/convert_esm.py b/src/transformers/models/esm/convert_esm.py
index 6ac74d40e6f957..22ca3f5392c19d 100644
--- a/src/transformers/models/esm/convert_esm.py
+++ b/src/transformers/models/esm/convert_esm.py
@@ -378,8 +378,8 @@ def convert_esm_checkpoint_to_pytorch(
     hf_tokenizer.save_pretrained(pytorch_dump_folder_path)
 
     if push_to_repo:
-        model.push_to_hub(repo_id=push_to_repo, use_auth_token=auth_token)
-        hf_tokenizer.push_to_hub(repo_id=push_to_repo, use_auth_token=auth_token)
+        model.push_to_hub(repo_id=push_to_repo, token_token=auth_token)
+        hf_tokenizer.push_to_hub(repo_id=push_to_repo, token_token=auth_token)
 
 
 if __name__ == "__main__":
diff --git a/src/transformers/models/esm/modeling_esm.py b/src/transformers/models/esm/modeling_esm.py
index 56544f4ca62bc4..57c436224099cc 100755
--- a/src/transformers/models/esm/modeling_esm.py
+++ b/src/transformers/models/esm/modeling_esm.py
@@ -94,7 +94,7 @@ class RotaryEmbedding(torch.nn.Module):
     def __init__(self, dim: int):
         super().__init__()
         # Generate and save the inverse frequency buffer (non trainable)
-        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, dtype=torch.int64).float() / dim))
         inv_freq = inv_freq
         self.register_buffer("inv_freq", inv_freq)
 
@@ -178,7 +178,9 @@ def __init__(self, config):
         self.dropout = nn.Dropout(config.hidden_dropout_prob)
         # position_ids (1, len position emb) is contiguous in memory and exported when serialized
         self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
-        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
 
         self.padding_idx = config.pad_token_id
         self.position_embeddings = nn.Embedding(
@@ -212,7 +214,7 @@ def forward(
         # This is analogous to the way that dropout layers scale down outputs during evaluation when not
         # actually dropping out values (or, equivalently, scale up their un-dropped outputs in training).
         if self.token_dropout:
-            embeddings.masked_fill_((input_ids == self.mask_token_id).unsqueeze(-1), 0.0)
+            embeddings = embeddings.masked_fill((input_ids == self.mask_token_id).unsqueeze(-1), 0.0)
             mask_ratio_train = 0.15 * 0.8  # Hardcoded as the ratio used in all ESM model training runs
             src_lengths = attention_mask.sum(-1)
             mask_ratio_observed = (input_ids == self.mask_token_id).sum(-1).float() / src_lengths
@@ -222,7 +224,7 @@ def forward(
 
         if self.position_embedding_type == "absolute":
             position_embeddings = self.position_embeddings(position_ids)
-            embeddings += position_embeddings
+            embeddings = embeddings + position_embeddings
 
         if self.layer_norm is not None:
             embeddings = self.layer_norm(embeddings)
@@ -397,7 +399,7 @@ def __init__(self, config):
     def forward(self, hidden_states, input_tensor):
         hidden_states = self.dense(hidden_states)
         hidden_states = self.dropout(hidden_states)
-        hidden_states += input_tensor
+        hidden_states = hidden_states + input_tensor
         return hidden_states
 
 
@@ -472,7 +474,7 @@ def __init__(self, config):
     def forward(self, hidden_states, input_tensor):
         hidden_states = self.dense(hidden_states)
         hidden_states = self.dropout(hidden_states)
-        hidden_states += input_tensor
+        hidden_states = hidden_states + input_tensor
         return hidden_states
 
 
@@ -583,6 +585,13 @@ def forward(
         output_hidden_states=False,
         return_dict=True,
     ):
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning_once(
+                    "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
+                    "`use_cache=False`..."
+                )
+                use_cache = False
         all_hidden_states = () if output_hidden_states else None
         all_self_attentions = () if output_attentions else None
         all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
@@ -596,26 +605,15 @@ def forward(
             past_key_value = past_key_values[i] if past_key_values is not None else None
 
             if self.gradient_checkpointing and self.training:
-                if use_cache:
-                    logger.warning(
-                        "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
-                        "`use_cache=False`..."
-                    )
-                    use_cache = False
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        return module(*inputs, past_key_value, output_attentions)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self._gradient_checkpointing_func(
+                    layer_module.__call__,
                     hidden_states,
                     attention_mask,
                     layer_head_mask,
                     encoder_hidden_states,
                     encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
                 )
             else:
                 layer_outputs = layer_module(
@@ -630,7 +628,7 @@ def custom_forward(*inputs):
 
             hidden_states = layer_outputs[0]
             if use_cache:
-                next_decoder_cache += (layer_outputs[-1],)
+                next_decoder_cache = next_decoder_cache + (layer_outputs[-1],)
             if output_attentions:
                 all_self_attentions = all_self_attentions + (layer_outputs[1],)
                 if self.config.add_cross_attention:
@@ -687,7 +685,8 @@ class EsmPreTrainedModel(PreTrainedModel):
 
     config_class = EsmConfig
     base_model_prefix = "esm"
-    _no_split_modules = ["EsmLayer", "EsmFoldTriangularSelfAttentionBlock"]
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["EsmLayer", "EsmFoldTriangularSelfAttentionBlock", "EsmEmbeddings"]
 
     # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
     def _init_weights(self, module):
@@ -782,9 +781,6 @@ class EsmModel(EsmPreTrainedModel):
     `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
     """
 
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-    supports_gradient_checkpointing = False
-
     def __init__(self, config, add_pooling_layer=True):
         super().__init__(config)
         self.config = config
@@ -801,10 +797,6 @@ def __init__(self, config, add_pooling_layer=True):
         # Initialize weights and apply final processing
         self.post_init()
 
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, EsmEncoder):
-            module.gradient_checkpointing = value
-
     def get_input_embeddings(self):
         return self.embeddings.word_embeddings
 
@@ -874,6 +866,7 @@ def forward(
         if input_ids is not None and inputs_embeds is not None:
             raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
         elif input_ids is not None:
+            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
             input_shape = input_ids.size()
         elif inputs_embeds is not None:
             input_shape = inputs_embeds.size()[:-1]
@@ -959,8 +952,7 @@ def predict_contacts(self, tokens, attention_mask):
 
 @add_start_docstrings("""ESM Model with a `language modeling` head on top.""", ESM_START_DOCSTRING)
 class EsmForMaskedLM(EsmPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"position_ids", "lm_head.decoder.weight"]
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _tied_weights_keys = ["lm_head.decoder.weight"]
 
     def __init__(self, config):
         super().__init__(config)
@@ -1031,6 +1023,8 @@ def forward(
         masked_lm_loss = None
         if labels is not None:
             loss_fct = CrossEntropyLoss()
+
+            labels = labels.to(prediction_scores.device)
             masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
 
         if not return_dict:
@@ -1077,8 +1071,6 @@ def forward(self, features, **kwargs):
     ESM_START_DOCSTRING,
 )
 class EsmForSequenceClassification(EsmPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1130,6 +1122,8 @@ def forward(
 
         loss = None
         if labels is not None:
+            labels = labels.to(logits.device)
+
             if self.config.problem_type is None:
                 if self.num_labels == 1:
                     self.config.problem_type = "regression"
@@ -1171,9 +1165,6 @@ def forward(
     ESM_START_DOCSTRING,
 )
 class EsmForTokenClassification(EsmPreTrainedModel):
-    _keys_to_ignore_on_load_unexpected = [r"pooler"]
-    _keys_to_ignore_on_load_missing = [r"position_ids"]
-
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
@@ -1227,16 +1218,9 @@ def forward(
         loss = None
         if labels is not None:
             loss_fct = CrossEntropyLoss()
-            # Only keep active parts of the loss
-            if attention_mask is not None:
-                active_loss = attention_mask.view(-1) == 1
-                active_logits = logits.view(-1, self.num_labels)
-                active_labels = torch.where(
-                    active_loss, labels.view(-1), torch.tensor(loss_fct.ignore_index).type_as(labels)
-                )
-                loss = loss_fct(active_logits, active_labels)
-            else:
-                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+
+            labels = labels.to(logits.device)
+            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
 
         if not return_dict:
             output = (logits,) + outputs[2:]
diff --git a/src/transformers/models/esm/modeling_esmfold.py b/src/transformers/models/esm/modeling_esmfold.py
index d37891df35ade3..3aaf811960721b 100644
--- a/src/transformers/models/esm/modeling_esmfold.py
+++ b/src/transformers/models/esm/modeling_esmfold.py
@@ -23,7 +23,7 @@
 import torch.nn as nn
 from torch.nn import LayerNorm
 
-from ...deepspeed import is_deepspeed_available
+from ...integrations.deepspeed import is_deepspeed_available
 from ...modeling_outputs import ModelOutput
 from ...utils import (
     ContextManagers,
@@ -201,9 +201,9 @@ def collate_dense_tensors(samples: List[torch.Tensor], pad_v: float = 0) -> torc
     """
     if len(samples) == 0:
         return torch.Tensor()
-    if len(set(x.dim() for x in samples)) != 1:
+    if len({x.dim() for x in samples}) != 1:
         raise RuntimeError(f"Samples has varying dimensions: {[x.dim() for x in samples]}")
-    (device,) = tuple(set(x.device for x in samples))  # assumes all on same device
+    (device,) = tuple({x.device for x in samples})  # assumes all on same device
     max_shape = [max(lst) for lst in zip(*[x.shape for x in samples])]
     result = torch.empty(len(samples), *max_shape, dtype=samples[0].dtype, device=device)
     result.fill_(pad_v)
@@ -229,7 +229,7 @@ def dict_multimap(fn, dicts):
     new_dict = {}
     for k, v in first.items():
         all_v = [d[k] for d in dicts]
-        if type(v) is dict:
+        if isinstance(v, dict):
             new_dict[k] = dict_multimap(fn, all_v)
         else:
             new_dict[k] = fn(all_v)
@@ -1060,7 +1060,7 @@ def __init__(self, r: float, batch_dim: Union[int, List[int]]):
         super().__init__()
 
         self.r = r
-        if type(batch_dim) == int:
+        if isinstance(batch_dim, int):
             batch_dim = [batch_dim]
         self.batch_dim = batch_dim
         self.dropout = nn.Dropout(self.r)
@@ -1204,7 +1204,7 @@ def forward(self, sequence_state, pairwise_state, mask=None, chunk_size=None, **
 
         if sequence_state_dim != self.config.sequence_state_dim:
             raise ValueError(
-                "`sequence_state` last dimension should be equal to `self.sequence_state_dim`. Got"
+                "`sequence_state` last dimension should be equal to `self.sequence_state_dim`. Got "
                 f"{sequence_state_dim} != {self.config.sequence_state_dim}."
             )
         if pairwise_state_dim != self.config.pairwise_state_dim:
@@ -1915,7 +1915,7 @@ def set_chunk_size(self, chunk_size):
         # This parameter means the axial attention will be computed
         # in a chunked manner. This should make the memory used more or less O(L) instead of O(L^2).
         # It's equivalent to running a for loop over chunks of the dimension we're iterative over,
-        # where the chunk_size is the size of the chunks, so 128 would mean to parse 128-lengthed chunks.
+        # where the chunk_size is the size of the chunks, so 128 would mean to parse 128-length chunks.
         self.chunk_size = chunk_size
 
     def forward(self, seq_feats, pair_feats, true_aa, residx, mask, no_recycles):
@@ -2018,6 +2018,8 @@ def distogram(coords, min_bin, max_bin, num_bins):
     ESM_START_DOCSTRING,
 )
 class EsmForProteinFolding(EsmPreTrainedModel):
+    _no_split_modules = ["EsmFoldStructureModule", "EsmFoldTriangularSelfAttentionBlock"]
+
     def __init__(self, config):
         super().__init__(config)
 
@@ -2084,11 +2086,11 @@ def _af2_to_esm_from_vocab_list(vocab_list: List[str]) -> torch.Tensor:
     def forward(
         self,
         input_ids: torch.Tensor,
-        attention_mask: torch.Tensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
         position_ids: Optional[torch.Tensor] = None,
         masking_pattern: Optional[torch.Tensor] = None,
         num_recycles: Optional[int] = None,
-    ):
+    ) -> EsmForProteinFoldingOutput:
         r"""
         Returns:
 
@@ -2252,7 +2254,7 @@ def infer(
         seqs: Union[str, List[str]],
         position_ids=None,
     ):
-        if type(seqs) is str:
+        if isinstance(seqs, str):
             lst = [seqs]
         else:
             lst = seqs
@@ -2310,7 +2312,7 @@ def output_to_pdb(output: Dict) -> List[str]:
 
     def infer_pdb(self, seqs, *args, **kwargs) -> str:
         """Returns the pdb (file) string from the model given an input sequence."""
-        assert type(seqs) is str
+        assert isinstance(seqs, str)
         output = self.infer(seqs, *args, **kwargs)
         return self.output_to_pdb(output)[0]
 
diff --git a/src/transformers/models/esm/modeling_tf_esm.py b/src/transformers/models/esm/modeling_tf_esm.py
index 2bb25ba94d30c5..2c780b4bdd60c3 100644
--- a/src/transformers/models/esm/modeling_tf_esm.py
+++ b/src/transformers/models/esm/modeling_tf_esm.py
@@ -14,12 +14,14 @@
 # limitations under the License.
 """ PyTorch ESM model."""
 
+
+from __future__ import annotations
+
+import os
 from typing import Optional, Tuple, Union
 
 import numpy as np
 import tensorflow as tf
-from tensorflow.keras.activations import gelu
-from tensorflow.keras.layers import Dense, Dropout, Embedding, Layer, LayerNormalization
 
 from ...file_utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward
 from ...modeling_tf_outputs import (
@@ -36,11 +38,11 @@
     TFSequenceClassificationLoss,
     TFTokenClassificationLoss,
     get_initializer,
-    get_tf_activation,
+    keras,
     shape_list,
     unpack_inputs,
 )
-from ...tf_utils import stable_softmax
+from ...tf_utils import check_embeddings_within_bounds, stable_softmax
 from ...utils import logging
 from .configuration_esm import EsmConfig
 
@@ -87,7 +89,7 @@ def average_product_correct(x):
     return normalized
 
 
-class TFRotaryEmbedding(Layer):
+class TFRotaryEmbedding(keras.layers.Layer):
     """
     Rotary position embeddings based on those in
     [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer). Query and keys are transformed by rotation
@@ -107,7 +109,7 @@ def __init__(self, dim: int, name=None):
     def build(self, input_shape):
         super().build(input_shape)
         self.inv_freq = self.add_weight(
-            "inv_freq", shape=(self.dim // 2,), dtype=tf.float32, initializer=get_initializer(1.0)
+            "inv_freq", shape=(self.dim // 2,), dtype=tf.float32, initializer=get_initializer(1.0), trainable=False
         )
         self.inv_freq.assign(
             1.0 / (10000 ** (tf.range(start=0, limit=self.dim, delta=2, dtype=tf.float32) / self.dim))
@@ -131,7 +133,7 @@ def call(self, q: tf.Tensor, k: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
         )
 
 
-class TFEsmContactPredictionHead(Layer):
+class TFEsmContactPredictionHead(keras.layers.Layer):
     """Performs symmetrization, apc, and computes a logistic regression on the output features"""
 
     def __init__(
@@ -144,12 +146,15 @@ def __init__(
         super().__init__(name=name)
         self.eos_idx = eos_idx
         self.in_features = in_features
-        self.regression = Dense(1, use_bias=bias, activation="sigmoid", name="regression")
+        self.regression = keras.layers.Dense(1, use_bias=bias, activation="sigmoid", name="regression")
 
-    def build(self, input_shape):
-        super().build(input_shape)
-        with tf.name_scope("regression"):
-            self.regression.build((None, self.in_features))
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "regression", None) is not None:
+            with tf.name_scope(self.regression.name):
+                self.regression.build((None, self.in_features))
 
     def call(self, tokens, attentions):
         # remove eos token attentions
@@ -168,20 +173,20 @@ def call(self, tokens, attentions):
         return tf.squeeze(self.regression(attentions), 3)
 
 
-class TFEsmEmbeddings(Layer):
+class TFEsmEmbeddings(keras.layers.Layer):
     """
     Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
     """
 
     def __init__(self, config, name=None):
         super().__init__(name=name)
-        self.word_embeddings = Embedding(
+        self.word_embeddings = keras.layers.Embedding(
             config.vocab_size,
             config.hidden_size,
             embeddings_initializer=get_initializer(config.initializer_range),
             name="word_embeddings",
         )
-        self.position_embeddings = Embedding(
+        self.position_embeddings = keras.layers.Embedding(
             config.max_position_embeddings,
             config.hidden_size,
             embeddings_initializer=get_initializer(config.initializer_range),
@@ -189,7 +194,7 @@ def __init__(self, config, name=None):
         )
 
         if config.emb_layer_norm_before:
-            self.layer_norm = LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm")
+            self.layer_norm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm")
         else:
             self.layer_norm = None
         # Matt: I think this line was copied incorrectly from BERT, disabling for now
@@ -214,16 +219,7 @@ def call(
                 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
 
         if inputs_embeds is None:
-            # Note: tf.gather, on which the embedding layer is based, won't check positive out of bound
-            # indices on GPU, returning zeros instead. This is a dangerous silent behavior.
-            tf.debugging.assert_less(
-                input_ids,
-                tf.cast(self.config.vocab_size, dtype=input_ids.dtype),
-                message=(
-                    "input_ids must be smaller than the embedding layer's input dimension (got"
-                    f" {tf.math.reduce_max(input_ids)} >= {self.config.vocab_size})"
-                ),
-            )
+            check_embeddings_within_bounds(input_ids, self.config.vocab_size)
             inputs_embeds = self.word_embeddings(input_ids)
 
         # Note that if we want to support ESM-1 (not 1b!) in future then we need to support an
@@ -274,8 +270,22 @@ def create_position_ids_from_inputs_embeds(self, inputs_embeds):
         )
         return tf.broadcast_to(tf.expand_dims(position_ids, 0), input_shape)
 
-
-class TFEsmSelfAttention(Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "word_embeddings", None) is not None:
+            with tf.name_scope(self.word_embeddings.name):
+                self.word_embeddings.build(None)
+        if getattr(self, "position_embeddings", None) is not None:
+            with tf.name_scope(self.position_embeddings.name):
+                self.position_embeddings.build(None)
+        if getattr(self, "layer_norm", None) is not None:
+            with tf.name_scope(self.layer_norm.name):
+                self.layer_norm.build([None, None, self.config.hidden_size])
+
+
+class TFEsmSelfAttention(keras.layers.Layer):
     def __init__(self, config, position_embedding_type=None, name=None):
         super().__init__(name=name)
         if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
@@ -288,22 +298,24 @@ def __init__(self, config, position_embedding_type=None, name=None):
         self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
         self.all_head_size = self.num_attention_heads * self.attention_head_size
 
-        self.query = Dense(
+        self.query = keras.layers.Dense(
             self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="query"
         )
-        self.key = Dense(self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key")
-        self.value = Dense(
+        self.key = keras.layers.Dense(
+            self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="key"
+        )
+        self.value = keras.layers.Dense(
             self.all_head_size, kernel_initializer=get_initializer(config.initializer_range), name="value"
         )
 
-        self.dropout = Dropout(config.attention_probs_dropout_prob)
+        self.dropout = keras.layers.Dropout(config.attention_probs_dropout_prob)
         self.position_embedding_type = position_embedding_type or getattr(
             config, "position_embedding_type", "absolute"
         )
         self.rotary_embeddings = None
         if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
             self.max_position_embeddings = config.max_position_embeddings
-            self.distance_embedding = Embedding(
+            self.distance_embedding = keras.layers.Embedding(
                 2 * config.max_position_embeddings - 1,
                 self.attention_head_size,
                 embeddings_initializer=get_initializer(config.initializer_range),
@@ -312,6 +324,7 @@ def __init__(self, config, position_embedding_type=None, name=None):
             self.rotary_embeddings = TFRotaryEmbedding(dim=self.attention_head_size, name="rotary_embeddings")
 
         self.is_decoder = config.is_decoder
+        self.config = config
 
     def transpose_for_scores(self, x: tf.Tensor) -> tf.Tensor:
         new_x_shape = shape_list(x)[:-1] + [self.num_attention_heads, self.attention_head_size]
@@ -321,11 +334,11 @@ def transpose_for_scores(self, x: tf.Tensor) -> tf.Tensor:
     def call(
         self,
         hidden_states: tf.Tensor,
-        attention_mask: Optional[tf.Tensor] = None,
-        head_mask: Optional[tf.Tensor] = None,
-        encoder_hidden_states: Optional[tf.Tensor] = None,
-        encoder_attention_mask: Optional[tf.Tensor] = None,
-        past_key_value: Optional[Tuple[Tuple[tf.Tensor]]] = None,
+        attention_mask: tf.Tensor | None = None,
+        head_mask: tf.Tensor | None = None,
+        encoder_hidden_states: tf.Tensor | None = None,
+        encoder_attention_mask: tf.Tensor | None = None,
+        past_key_value: Tuple[Tuple[tf.Tensor]] | None = None,
         output_attentions: Optional[bool] = False,
         training: bool = False,
     ) -> Tuple[tf.Tensor]:
@@ -421,14 +434,32 @@ def call(
             outputs = outputs + (past_key_value,)
         return outputs
 
-
-class TFEsmSelfOutput(Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "query", None) is not None:
+            with tf.name_scope(self.query.name):
+                self.query.build([None, None, self.config.hidden_size])
+        if getattr(self, "key", None) is not None:
+            with tf.name_scope(self.key.name):
+                self.key.build([None, None, self.config.hidden_size])
+        if getattr(self, "value", None) is not None:
+            with tf.name_scope(self.value.name):
+                self.value.build([None, None, self.config.hidden_size])
+        if getattr(self, "rotary_embeddings", None) is not None:
+            with tf.name_scope(self.rotary_embeddings.name):
+                self.rotary_embeddings.build(None)
+
+
+class TFEsmSelfOutput(keras.layers.Layer):
     def __init__(self, config, name=None):
         super().__init__(name=name)
-        self.dense = Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = Dropout(config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training=False):
         hidden_states = self.dense(hidden_states)
@@ -436,14 +467,23 @@ def call(self, hidden_states, input_tensor, training=False):
         hidden_states += input_tensor
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
 
-class TFEsmAttention(Layer):
+
+class TFEsmAttention(keras.layers.Layer):
     def __init__(self, config, name=None):
         super().__init__(name=name)
         self.self = TFEsmSelfAttention(config, name="self")
         self.output_layer = TFEsmSelfOutput(config, name="output")
         self.pruned_heads = set()
-        self.LayerNorm = LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.config = config
 
     def prune_heads(self, heads):
         raise NotImplementedError
@@ -474,35 +514,54 @@ def call(
         outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
         return outputs
 
-
-# Copied from transformers.models.bert.modeling_tf_bert.TFBertIntermediate with Bert->Esm
-class TFEsmIntermediate(tf.keras.layers.Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "self", None) is not None:
+            with tf.name_scope(self.self.name):
+                self.self.build(None)
+        if getattr(self, "output_layer", None) is not None:
+            with tf.name_scope(self.output_layer.name):
+                self.output_layer.build(None)
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+
+
+class TFEsmIntermediate(keras.layers.Layer):
     def __init__(self, config: EsmConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
-            units=config.intermediate_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
+        self.dense = keras.layers.Dense(
+            units=config.intermediate_size,
+            kernel_initializer=get_initializer(config.initializer_range),
+            name="dense",
         )
-
-        if isinstance(config.hidden_act, str):
-            self.intermediate_act_fn = get_tf_activation(config.hidden_act)
-        else:
-            self.intermediate_act_fn = config.hidden_act
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         hidden_states = self.dense(inputs=hidden_states)
-        hidden_states = self.intermediate_act_fn(hidden_states)
-
+        hidden_states = tf.nn.gelu(hidden_states)
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
-class TFEsmOutput(Layer):
+class TFEsmOutput(keras.layers.Layer):
     def __init__(self, config, name=None):
         super().__init__(name=name)
-        self.dense = Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
-        self.dropout = Dropout(config.hidden_dropout_prob)
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.config = config
 
     def call(self, hidden_states, input_tensor, training=False):
         hidden_states = self.dense(hidden_states)
@@ -510,8 +569,16 @@ def call(self, hidden_states, input_tensor, training=False):
         hidden_states += input_tensor
         return hidden_states
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.intermediate_size])
+
 
-class TFEsmLayer(Layer):
+class TFEsmLayer(keras.layers.Layer):
     def __init__(self, config, name=None):
         super().__init__(name=name)
         self.chunk_size_feed_forward = config.chunk_size_feed_forward
@@ -525,7 +592,8 @@ def __init__(self, config, name=None):
             self.crossattention = TFEsmAttention(config)
         self.intermediate = TFEsmIntermediate(config, name="intermediate")
         self.output_layer = TFEsmOutput(config, name="output")
-        self.LayerNorm = LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.LayerNorm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="LayerNorm")
+        self.config = config
 
     def call(
         self,
@@ -597,13 +665,32 @@ def call(
 
         return outputs
 
-
-class TFEsmEncoder(Layer):
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "attention", None) is not None:
+            with tf.name_scope(self.attention.name):
+                self.attention.build(None)
+        if getattr(self, "intermediate", None) is not None:
+            with tf.name_scope(self.intermediate.name):
+                self.intermediate.build(None)
+        if getattr(self, "output_layer", None) is not None:
+            with tf.name_scope(self.output_layer.name):
+                self.output_layer.build(None)
+        if getattr(self, "LayerNorm", None) is not None:
+            with tf.name_scope(self.LayerNorm.name):
+                self.LayerNorm.build([None, None, self.config.hidden_size])
+
+
+class TFEsmEncoder(keras.layers.Layer):
     def __init__(self, config, name=None):
         super().__init__(name=name)
         self.config = config
         self.layer = [TFEsmLayer(config, name=f"layer_._{i}") for i in range(config.num_hidden_layers)]
-        self.emb_layer_norm_after = LayerNormalization(epsilon=config.layer_norm_eps, name="emb_layer_norm_after")
+        self.emb_layer_norm_after = keras.layers.LayerNormalization(
+            epsilon=config.layer_norm_eps, name="emb_layer_norm_after"
+        )
 
     def call(
         self,
@@ -676,18 +763,31 @@ def call(
             cross_attentions=all_cross_attentions,
         )
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "emb_layer_norm_after", None) is not None:
+            with tf.name_scope(self.emb_layer_norm_after.name):
+                self.emb_layer_norm_after.build([None, None, self.config.hidden_size])
+        if getattr(self, "layer", None) is not None:
+            for layer in self.layer:
+                with tf.name_scope(layer.name):
+                    layer.build(None)
+
 
 # Copied from transformers.models.bert.modeling_tf_bert.TFBertPooler with Bert->Esm
-class TFEsmPooler(Layer):
+class TFEsmPooler(keras.layers.Layer):
     def __init__(self, config: EsmConfig, **kwargs):
         super().__init__(**kwargs)
 
-        self.dense = tf.keras.layers.Dense(
+        self.dense = keras.layers.Dense(
             units=config.hidden_size,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="tanh",
             name="dense",
         )
+        self.config = config
 
     def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
         # We "pool" the model by simply taking the hidden state corresponding
@@ -697,6 +797,14 @@ def call(self, hidden_states: tf.Tensor) -> tf.Tensor:
 
         return pooled_output
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+
 
 class TFEsmPreTrainedModel(TFPreTrainedModel):
     """
@@ -769,7 +877,7 @@ class TFEsmPreTrainedModel(TFPreTrainedModel):
     "The bare ESM Model transformer outputting raw hidden-states without any specific head on top.",
     ESM_START_DOCSTRING,
 )
-class TFEsmMainLayer(Layer):
+class TFEsmMainLayer(keras.layers.Layer):
     """
 
     The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
@@ -798,10 +906,22 @@ def __init__(self, config, add_pooling_layer=True, name=None, **kwargs):
             in_features=self.config.num_hidden_layers * self.config.num_attention_heads, bias=True, name="contact_head"
         )
 
-    def build(self, input_shape):
-        super().build(input_shape)
-        with tf.name_scope("contact_head"):
-            self.contact_head.build(input_shape)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "embeddings", None) is not None:
+            with tf.name_scope(self.embeddings.name):
+                self.embeddings.build(None)
+        if getattr(self, "encoder", None) is not None:
+            with tf.name_scope(self.encoder.name):
+                self.encoder.build(None)
+        if getattr(self, "pooler", None) is not None:
+            with tf.name_scope(self.pooler.name):
+                self.pooler.build(None)
+        if getattr(self, "contact_head", None) is not None:
+            with tf.name_scope(self.contact_head.name):
+                self.contact_head.build(None)
 
     def get_input_embeddings(self):
         return self.embeddings.word_embeddings
@@ -815,13 +935,13 @@ def _prune_heads(self, heads_to_prune):
 
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
+        encoder_attention_mask: np.ndarray | tf.Tensor | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
@@ -998,13 +1118,13 @@ def __init__(self, config: EsmConfig, add_pooling_layer=True, *inputs, **kwargs)
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
+        encoder_attention_mask: np.ndarray | tf.Tensor | None = None,
         past_key_values: Optional[Tuple[Tuple[Union[np.ndarray, tf.Tensor]]]] = None,
         use_cache: Optional[bool] = None,
         output_attentions: Optional[bool] = None,
@@ -1049,42 +1169,17 @@ def call(
         )
         return outputs
 
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
-    def serving_output(
-        self, output: TFBaseModelOutputWithPoolingAndCrossAttentions
-    ) -> TFBaseModelOutputWithPoolingAndCrossAttentions:
-        output_cache = self.config.use_cache and self.config.is_decoder
-        pkv = tf.convert_to_tensor(output.past_key_values) if output_cache else None
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-        cross_attns = tf.convert_to_tensor(output.cross_attentions) if output.cross_attentions is not None else None
-        if not (self.config.output_attentions and self.config.add_cross_attention):
-            cross_attns = None
-
-        return TFBaseModelOutputWithPoolingAndCrossAttentions(
-            last_hidden_state=output.last_hidden_state,
-            pooler_output=output.pooler_output,
-            past_key_values=pkv,
-            hidden_states=hs,
-            attentions=attns,
-            cross_attentions=cross_attns,
-        )
-
     def predict_contacts(self, tokens, attention_mask):
         return self.esm.predict_contacts(tokens, attention_mask)
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "esm", None) is not None:
+            with tf.name_scope(self.esm.name):
+                self.esm.build(None)
+
 
 @add_start_docstrings("""ESM Model with a `language modeling` head on top.""", ESM_START_DOCSTRING)
 class TFEsmForMaskedLM(TFEsmPreTrainedModel, TFMaskedLanguageModelingLoss):
@@ -1102,6 +1197,11 @@ def __init__(self, config):
 
         self.esm = TFEsmMainLayer(config, add_pooling_layer=False, name="esm")
         self.lm_head = TFEsmLMHead(config, name="lm_head")
+        if config.tie_word_embeddings:
+            # Ensure word embeddings are built so that we actually have something to tie
+            with tf.name_scope(os.path.join(self._name_scope(), "esm", "embeddings", "word_embeddings")):
+                self.esm.embeddings.word_embeddings.build((None, None))
+            self.lm_head.decoder = self.esm.embeddings.word_embeddings.weights[0]
 
     def get_output_embeddings(self):
         return self.lm_head.decoder
@@ -1122,14 +1222,14 @@ def get_lm_head(self):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_hidden_states: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        encoder_attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        encoder_hidden_states: np.ndarray | tf.Tensor | None = None,
+        encoder_attention_mask: np.ndarray | tf.Tensor | None = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1176,66 +1276,72 @@ def call(
             attentions=outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForMaskedLM.serving_output
-    def serving_output(self, output: TFMaskedLMOutput) -> TFMaskedLMOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFMaskedLMOutput(logits=output.logits, hidden_states=hs, attentions=attns)
-
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
-
     def predict_contacts(self, tokens, attention_mask):
         return self.esm.predict_contacts(tokens, attention_mask)
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "esm", None) is not None:
+            with tf.name_scope(self.esm.name):
+                self.esm.build(None)
+        if getattr(self, "lm_head", None) is not None:
+            with tf.name_scope(self.lm_head.name):
+                self.lm_head.build(None)
 
-class TFEsmLMHead(Layer):
+
+class TFEsmLMHead(keras.layers.Layer):
     """ESM Head for masked language modeling."""
 
     def __init__(self, config, name=None):
         super().__init__(name=name)
-        self.dense = Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size, kernel_initializer=get_initializer(config.initializer_range), name="dense"
         )
 
-        self.layer_norm = LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm")
-
-        self.decoder = Dense(
-            config.vocab_size,
-            use_bias=False,
-            kernel_initializer=get_initializer(config.initializer_range),
-            name="decoder",
-        )
+        self.layer_norm = keras.layers.LayerNormalization(epsilon=config.layer_norm_eps, name="layer_norm")
+        if config.tie_word_embeddings:
+            self.decoder = None
+        else:
+            self.decoder = keras.layers.Dense(
+                config.vocab_size,
+                kernel_initializer=get_initializer(config.initializer_range),
+                name="decoder",
+                use_bias=False,
+            )
         self.config = config
 
-    def build(self, input_shape):
-        super().build(input_shape)
+    def build(self, input_shape=None):
         # Separate bias to match the PT model and allow weight cross-loading to work
         # Put it in the build so it gets the right name when adding it as a weight
+        if self.built:
+            return
+        self.built = True
         self.bias = self.add_weight("bias", shape=(self.config.vocab_size,), initializer="zeros", trainable=True)
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "layer_norm", None) is not None:
+            with tf.name_scope(self.layer_norm.name):
+                self.layer_norm.build([None, None, self.config.hidden_size])
+        if getattr(self, "decoder", None) is not None and not self.config.tie_word_embeddings:
+            with tf.name_scope(self.decoder.name):
+                self.decoder.build([None, None, self.config.hidden_size])
 
     def get_bias(self):
         return {"bias": self.bias}
 
     def call(self, features):
         x = self.dense(features)
-        x = gelu(x)
+        x = tf.nn.gelu(x)
         x = self.layer_norm(x)
 
         # project back to size of vocabulary with bias
-        x = self.decoder(x)
-        x = x + self.bias
+        if self.config.tie_word_embeddings:
+            x = tf.matmul(x, self.decoder, transpose_b=True) + self.bias
+        else:
+            x = self.decoder(x) + self.bias
         return x
 
 
@@ -1266,12 +1372,12 @@ def __init__(self, config):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1312,25 +1418,16 @@ def call(
             attentions=outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification.serving_output
-    def serving_output(self, output: TFSequenceClassifierOutput) -> TFSequenceClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFSequenceClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
-
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
-
-        return self.serving_output(output)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "esm", None) is not None:
+            with tf.name_scope(self.esm.name):
+                self.esm.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build(None)
 
 
 @add_start_docstrings(
@@ -1349,8 +1446,9 @@ def __init__(self, config):
         self.num_labels = config.num_labels
 
         self.esm = TFEsmMainLayer(config, add_pooling_layer=False, name="esm")
-        self.dropout = Dropout(config.hidden_dropout_prob)
-        self.classifier = Dense(config.num_labels, name="classifier")
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.classifier = keras.layers.Dense(config.num_labels, name="classifier")
+        self.config = config
 
     @unpack_inputs
     @add_start_docstrings_to_model_forward(ESM_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
@@ -1361,12 +1459,12 @@ def __init__(self, config):
     )
     def call(
         self,
-        input_ids: Optional[TFModelInputType] = None,
-        attention_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        position_ids: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        head_mask: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        inputs_embeds: Optional[Union[np.ndarray, tf.Tensor]] = None,
-        labels: Optional[Union[np.ndarray, tf.Tensor]] = None,
+        input_ids: TFModelInputType | None = None,
+        attention_mask: np.ndarray | tf.Tensor | None = None,
+        position_ids: np.ndarray | tf.Tensor | None = None,
+        head_mask: np.ndarray | tf.Tensor | None = None,
+        inputs_embeds: np.ndarray | tf.Tensor | None = None,
+        labels: np.ndarray | tf.Tensor | None = None,
         output_attentions: Optional[bool] = None,
         output_hidden_states: Optional[bool] = None,
         return_dict: Optional[bool] = None,
@@ -1408,45 +1506,37 @@ def call(
             attentions=outputs.attentions,
         )
 
-    # Copied from transformers.models.bert.modeling_tf_bert.TFBertForTokenClassification.serving_output
-    def serving_output(self, output: TFTokenClassifierOutput) -> TFTokenClassifierOutput:
-        hs = tf.convert_to_tensor(output.hidden_states) if self.config.output_hidden_states else None
-        attns = tf.convert_to_tensor(output.attentions) if self.config.output_attentions else None
-
-        return TFTokenClassifierOutput(logits=output.logits, hidden_states=hs, attentions=attns)
-
-    @tf.function(
-        input_signature=[
-            {
-                "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
-                "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
-            }
-        ]
-    )
-    def serving(self, inputs):
-        output = self.call(inputs)
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "esm", None) is not None:
+            with tf.name_scope(self.esm.name):
+                self.esm.build(None)
+        if getattr(self, "classifier", None) is not None:
+            with tf.name_scope(self.classifier.name):
+                self.classifier.build([None, None, self.config.hidden_size])
 
-        return self.serving_output(output)
 
-
-class TFEsmClassificationHead(Layer):
+class TFEsmClassificationHead(keras.layers.Layer):
     """Head for sentence-level classification tasks."""
 
     def __init__(self, config, name=None):
         super().__init__(name=name)
-        self.dense = Dense(
+        self.dense = keras.layers.Dense(
             config.hidden_size,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="tanh",
             name="dense",
         )
-        self.dropout = Dropout(config.hidden_dropout_prob)
-        self.out_proj = Dense(
+        self.dropout = keras.layers.Dropout(config.hidden_dropout_prob)
+        self.out_proj = keras.layers.Dense(
             config.num_labels,
             kernel_initializer=get_initializer(config.initializer_range),
             activation="linear",
             name="out_proj",
         )
+        self.config = config
 
     def call(self, features, training=False):
         x = features[:, 0, :]  # take  token (equiv. to [CLS])
@@ -1456,6 +1546,17 @@ def call(self, features, training=False):
         x = self.out_proj(x)
         return x
 
+    def build(self, input_shape=None):
+        if self.built:
+            return
+        self.built = True
+        if getattr(self, "dense", None) is not None:
+            with tf.name_scope(self.dense.name):
+                self.dense.build([None, None, self.config.hidden_size])
+        if getattr(self, "out_proj", None) is not None:
+            with tf.name_scope(self.out_proj.name):
+                self.out_proj.build([None, None, self.config.hidden_size])
+
 
 def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_length=0):
     """
diff --git a/src/transformers/models/esm/openfold_utils/chunk_utils.py b/src/transformers/models/esm/openfold_utils/chunk_utils.py
index 4b60373438e2d7..301721d135ee4d 100644
--- a/src/transformers/models/esm/openfold_utils/chunk_utils.py
+++ b/src/transformers/models/esm/openfold_utils/chunk_utils.py
@@ -83,7 +83,7 @@ def reduce_edge_list(l: List[bool]) -> None:
     # Base cases. Either start/end are empty and we're done, or the final,
     # one-dimensional tensor can be simply sliced
     if len(start) == 0:
-        return [tuple()]
+        return [()]
     elif len(start) == 1:
         return [(slice(start[0], end[0] + 1),)]
 
diff --git a/src/transformers/models/esm/tokenization_esm.py b/src/transformers/models/esm/tokenization_esm.py
index 232ce61fb7e07a..478527c0ecd17f 100644
--- a/src/transformers/models/esm/tokenization_esm.py
+++ b/src/transformers/models/esm/tokenization_esm.py
@@ -14,10 +14,9 @@
 # limitations under the License.
 """Tokenization classes for ESM."""
 import os
-from typing import List, Optional, Union
+from typing import List, Optional
 
 from ...tokenization_utils import PreTrainedTokenizer
-from ...tokenization_utils_base import AddedToken
 from ...utils import logging
 
 
@@ -54,18 +53,33 @@ class EsmTokenizer(PreTrainedTokenizer):
     max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
     model_input_names = ["input_ids", "attention_mask"]
 
-    def __init__(self, vocab_file, **kwargs):
-        super().__init__(**kwargs)
+    def __init__(
+        self,
+        vocab_file,
+        unk_token="",
+        cls_token="",
+        pad_token="",
+        mask_token="",
+        eos_token="",
+        **kwargs,
+    ):
         self.all_tokens = load_vocab_file(vocab_file)
-        self._id_to_token = {ind: tok for ind, tok in enumerate(self.all_tokens)}
+        self._id_to_token = dict(enumerate(self.all_tokens))
         self._token_to_id = {tok: ind for ind, tok in enumerate(self.all_tokens)}
-        self.unk_token = ""
-        self.cls_token = ""
-        self.pad_token = ""
-        self.mask_token = ""
-        self.eos_token = ""
+        super().__init__(
+            unk_token=unk_token,
+            cls_token=cls_token,
+            pad_token=pad_token,
+            mask_token=mask_token,
+            eos_token=eos_token,
+            **kwargs,
+        )
+
+        # TODO, all the tokens are added? But they are also part of the vocab... bit strange.
+        # none of them are special, but they all need special splitting.
+
         self.unique_no_split_tokens = self.all_tokens
-        self._create_trie(self.unique_no_split_tokens)
+        self._update_trie(self.unique_no_split_tokens)
 
     def _convert_id_to_token(self, index: int) -> str:
         return self._id_to_token.get(index, self.unk_token)
@@ -76,11 +90,10 @@ def _convert_token_to_id(self, token: str) -> int:
     def _tokenize(self, text, **kwargs):
         return text.split()
 
-    def get_vocab_size(self, with_added_tokens=False):
-        return len(self._id_to_token)
-
     def get_vocab(self):
-        return {token: i for i, token in enumerate(self.all_tokens)}
+        base_vocab = self._token_to_id.copy()
+        base_vocab.update(self.added_tokens_encoder)
+        return base_vocab
 
     def token_to_id(self, token: str) -> int:
         return self._token_to_id.get(token, self._token_to_id.get(self.unk_token))
@@ -91,11 +104,16 @@ def id_to_token(self, index: int) -> str:
     def build_inputs_with_special_tokens(
         self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
     ) -> List[int]:
-        if token_ids_1 is None:
-            return [self.cls_token_id] + token_ids_0 + [self.eos_token_id]
         cls = [self.cls_token_id]
         sep = [self.eos_token_id]  # No sep token in ESM vocabulary
-        return cls + token_ids_0 + sep + token_ids_1 + sep
+        if token_ids_1 is None:
+            if self.eos_token_id is None:
+                return cls + token_ids_0
+            else:
+                return cls + token_ids_0 + sep
+        elif self.eos_token_id is None:
+            raise ValueError("Cannot tokenize multiple sequences when EOS token is not set!")
+        return cls + token_ids_0 + sep + token_ids_1 + sep  # Multiple inputs always have an EOS token
 
     def get_special_tokens_mask(
         self, token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False
@@ -136,7 +154,4 @@ def save_vocabulary(self, save_directory, filename_prefix):
 
     @property
     def vocab_size(self) -> int:
-        return self.get_vocab_size(with_added_tokens=False)
-
-    def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
-        return super()._add_tokens(new_tokens, special_tokens=True)
+        return len(self.all_tokens)
diff --git a/src/transformers/models/falcon/__init__.py b/src/transformers/models/falcon/__init__.py
new file mode 100644
index 00000000000000..070e0cc033fbf6
--- /dev/null
+++ b/src/transformers/models/falcon/__init__.py
@@ -0,0 +1,68 @@
+# coding=utf-8
+# Copyright 2023 the Falcon authors and HuggingFace Inc. team.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_falcon": ["FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP", "FalconConfig"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_falcon"] = [
+        "FALCON_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "FalconForCausalLM",
+        "FalconModel",
+        "FalconPreTrainedModel",
+        "FalconForSequenceClassification",
+        "FalconForTokenClassification",
+        "FalconForQuestionAnswering",
+    ]
+
+
+if TYPE_CHECKING:
+    from .configuration_falcon import FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP, FalconConfig
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_falcon import (
+            FALCON_PRETRAINED_MODEL_ARCHIVE_LIST,
+            FalconForCausalLM,
+            FalconForQuestionAnswering,
+            FalconForSequenceClassification,
+            FalconForTokenClassification,
+            FalconModel,
+            FalconPreTrainedModel,
+        )
+
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/falcon/configuration_falcon.py b/src/transformers/models/falcon/configuration_falcon.py
new file mode 100644
index 00000000000000..fe0a450a24eb0c
--- /dev/null
+++ b/src/transformers/models/falcon/configuration_falcon.py
@@ -0,0 +1,192 @@
+# coding=utf-8
+# Copyright 2023 the Falcon authors and HuggingFace Inc. team.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Falcon configuration"""
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+FALCON_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "tiiuae/falcon-40b": "https://huggingface.co/tiiuae/falcon-40b/resolve/main/config.json",
+    "tiiuae/falcon-7b": "https://huggingface.co/tiiuae/falcon-7b/resolve/main/config.json",
+}
+
+
+class FalconConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`FalconModel`]. It is used to instantiate a Falcon
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the
+    [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+
+    Args:
+        vocab_size (`int`, *optional*, defaults to 65024):
+            Vocabulary size of the Falcon model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`FalconModel`]
+        hidden_size (`int`, *optional*, defaults to 4544):
+            Dimension of the hidden representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 71):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-05):
+            The epsilon used by the layer normalization layers.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether the model should return the last key/values attentions (not used by all models). Only relevant if
+            `config.is_decoder=True`.
+        hidden_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability for MLP layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout probability for attention layers.
+        num_kv_heads (`int`, *optional*):
+            Number of key-value heads to use per attention layer. If unset, defaults to the same value as
+            `num_attention_heads`.
+        alibi (`bool`, *optional*, defaults to `False`):
+            Whether to use ALiBi positional biases during self-attention.
+        new_decoder_architecture (`bool`, *optional*, defaults to `False`):
+            Whether to use the new (Falcon-40B) decoder architecture. If `True`, the `multi_query` and `parallel_attn`
+            arguments are ignored, as the new decoder always uses parallel attention.
+        multi_query (`bool`, *optional*, defaults to `True`):
+            Whether to use multi-query attention in the decoder. Ignored when `new_decoder_architecture` is `True`.
+        parallel_attn (`bool`, *optional*, defaults to `True`):
+            Whether to compute attention in parallel with the feedforward layer. If False, they are consecutive
+            instead, as in the original Transformer architecture. Ignored when `new_decoder_architecture` is `True`.
+        bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias on Linear layers.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with, when `alibi` is `False`. Pretrained
+            Falcon models with RoPE support up to 2048 tokens.
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
+            these scaling strategies behave:
+            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
+            experimental feature, subject to breaking API changes in future versions.
+        bos_token_id (`int`, *optional*, defaults to 11):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 11):
+            The id of the "end-of-sequence" token.
+
+    Example:
+
+    ```python
+    >>> from transformers import FalconModel, FalconConfig
+
+    >>> # Initializing a small (2-layer) Falcon configuration
+    >>> configuration = FalconConfig(num_hidden_layers=2)
+
+    >>> # Initializing a model from the small configuration
+    >>> model = FalconModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "falcon"
+    keys_to_ignore_at_inference = ["past_key_values"]
+
+    def __init__(
+        self,
+        vocab_size=65024,
+        hidden_size=4544,
+        num_hidden_layers=32,
+        num_attention_heads=71,
+        layer_norm_epsilon=1e-5,
+        initializer_range=0.02,
+        use_cache=True,
+        hidden_dropout=0.0,
+        attention_dropout=0.0,
+        num_kv_heads=None,
+        alibi=False,
+        new_decoder_architecture=False,
+        multi_query=True,
+        parallel_attn=True,
+        bias=False,
+        max_position_embeddings=2048,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        bos_token_id=11,
+        eos_token_id=11,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        # Backward compatibility with n_embed kwarg
+        n_embed = kwargs.pop("n_embed", None)
+        self.hidden_size = hidden_size if n_embed is None else n_embed
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+        self.hidden_dropout = hidden_dropout
+        self.attention_dropout = attention_dropout
+
+        self.bos_token_id = bos_token_id
+        self.eos_token_id = eos_token_id
+        self.num_kv_heads = num_attention_heads if num_kv_heads is None else num_kv_heads
+        self.alibi = alibi
+        self.new_decoder_architecture = new_decoder_architecture
+        self.multi_query = multi_query  # Ignored when new_decoder_architecture is True
+        self.parallel_attn = parallel_attn
+        self.bias = bias
+        self.max_position_embeddings = max_position_embeddings
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self._rope_scaling_validation()
+
+        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+
+    @property
+    def head_dim(self):
+        return self.hidden_size // self.num_attention_heads
+
+    @property
+    def rotary(self):
+        return not self.alibi
+
+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+
+        if self.alibi:
+            raise ValueError("`rope_scaling` is not supported when `alibi` is `True`.")
+
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+            raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
diff --git a/src/transformers/models/falcon/convert_custom_code_checkpoint.py b/src/transformers/models/falcon/convert_custom_code_checkpoint.py
new file mode 100644
index 00000000000000..0da817c3ffa739
--- /dev/null
+++ b/src/transformers/models/falcon/convert_custom_code_checkpoint.py
@@ -0,0 +1,74 @@
+import json
+from argparse import ArgumentParser
+from pathlib import Path
+
+
+"""
+This script converts Falcon custom code checkpoints to modern Falcon checkpoints that use code in the Transformers
+library. After conversion, performance (especially for generation) should improve and the checkpoint can be loaded
+without needing trust_remote_code=True.
+"""
+
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument(
+        "--checkpoint_dir",
+        type=Path,
+        required=True,
+        help="Directory containing a custom code checkpoint to convert to a modern Falcon checkpoint.",
+    )
+    args = parser.parse_args()
+
+    if not args.checkpoint_dir.is_dir():
+        raise ValueError("--checkpoint_dir argument should be a directory!")
+
+    if (
+        not (args.checkpoint_dir / "configuration_RW.py").is_file()
+        or not (args.checkpoint_dir / "modelling_RW.py").is_file()
+    ):
+        raise ValueError(
+            "The model directory should contain configuration_RW.py and modelling_RW.py files! Are you sure this is a custom code checkpoint?"
+        )
+    (args.checkpoint_dir / "configuration_RW.py").unlink()
+    (args.checkpoint_dir / "modelling_RW.py").unlink()
+
+    config = args.checkpoint_dir / "config.json"
+    text = config.read_text()
+    text = text.replace("RWForCausalLM", "FalconForCausalLM")
+    text = text.replace("RefinedWebModel", "falcon")
+    text = text.replace("RefinedWeb", "falcon")
+    json_config = json.loads(text)
+    del json_config["auto_map"]
+
+    if "n_head" in json_config:
+        json_config["num_attention_heads"] = json_config.pop("n_head")
+    if "n_layer" in json_config:
+        json_config["num_hidden_layers"] = json_config.pop("n_layer")
+    if "n_head_kv" in json_config:
+        json_config["num_kv_heads"] = json_config.pop("n_head_kv")
+        json_config["new_decoder_architecture"] = True
+    else:
+        json_config["new_decoder_architecture"] = False
+    bos_token_id = json_config.get("bos_token_id", 1)
+    eos_token_id = json_config.get("eos_token_id", 2)
+    config.unlink()
+    config.write_text(json.dumps(json_config, indent=2, sort_keys=True))
+
+    tokenizer_config = args.checkpoint_dir / "tokenizer_config.json"
+    if tokenizer_config.is_file():
+        text = tokenizer_config.read_text()
+        json_config = json.loads(text)
+        if json_config["tokenizer_class"] == "PreTrainedTokenizerFast":
+            json_config["model_input_names"] = ["input_ids", "attention_mask"]
+            tokenizer_config.unlink()
+            tokenizer_config.write_text(json.dumps(json_config, indent=2, sort_keys=True))
+
+    generation_config_path = args.checkpoint_dir / "generation_config.json"
+    generation_dict = {
+        "_from_model_config": True,
+        "bos_token_id": bos_token_id,
+        "eos_token_id": eos_token_id,
+        "transformers_version": "4.33.0.dev0",
+    }
+    generation_config_path.write_text(json.dumps(generation_dict, indent=2, sort_keys=True))
+    print("Done! Please double-check that the new checkpoint works as expected.")
diff --git a/src/transformers/models/falcon/modeling_falcon.py b/src/transformers/models/falcon/modeling_falcon.py
new file mode 100644
index 00000000000000..9767b797b00778
--- /dev/null
+++ b/src/transformers/models/falcon/modeling_falcon.py
@@ -0,0 +1,1646 @@
+# coding=utf-8
+# Copyright 2023 the Falcon authors and HuggingFace Inc. team.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch Falcon model."""
+
+import math
+import warnings
+from typing import TYPE_CHECKING, Optional, Tuple, Union
+
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, LayerNorm, MSELoss
+from torch.nn import functional as F
+
+from ...modeling_attn_mask_utils import (
+    AttentionMaskConverter,
+    _prepare_4d_causal_attention_mask,
+    _prepare_4d_causal_attention_mask_for_sdpa,
+)
+from ...modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutputWithPast,
+    TokenClassifierOutput,
+)
+from ...modeling_utils import PreTrainedModel
+from ...pytorch_utils import is_torch_greater_or_equal_than_2_0
+from ...utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_flash_attn_2_available,
+    is_flash_attn_greater_or_equal_2_10,
+    logging,
+)
+from .configuration_falcon import FalconConfig
+
+
+if TYPE_CHECKING:
+    from ...configuration_utils import PretrainedConfig
+
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
+
+logger = logging.get_logger(__name__)
+
+FALCON_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "tiiuae/falcon-40b",
+    "tiiuae/falcon-40b-instruct",
+    "tiiuae/falcon-7b",
+    "tiiuae/falcon-7b-instruct",
+    "tiiuae/falcon-rw-7b",
+    "tiiuae/falcon-rw-1b",
+]
+_CHECKPOINT_FOR_DOC = "Rocketknight1/falcon-rw-1b"
+_CONFIG_FOR_DOC = "FalconConfig"
+
+
+# NOTE(Hesslow): Unfortunately we did not fuse matmul and bias during training, this means that there's one additional quantization to bfloat16 between the operations.
+# In order not to degrade the quality of our HF-port, we keep these characteristics in the final model.
+class FalconLinear(nn.Linear):
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        hidden_states = input @ self.weight.T
+        if self.bias is None:
+            return hidden_states
+        return hidden_states + self.bias
+
+
+# Copied from transformers.models.llama.modeling_llama.rotate_half
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`):
+            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
+            used to pass offsetted position ids when working with a KV-cache.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
+    sin = sin[position_ids].unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+# Copied from transformers.models.llama.modeling_llama._get_unpad_data
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+
+
+# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Falcon
+class FalconRotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+    def forward(self, x, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+
+        return (
+            self.cos_cached[:seq_len].to(dtype=x.dtype),
+            self.sin_cached[:seq_len].to(dtype=x.dtype),
+        )
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->Falcon
+class FalconLinearScalingRotaryEmbedding(FalconRotaryEmbedding):
+    """FalconRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+        t = t / self.scaling_factor
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+
+# Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->Falcon
+class FalconDynamicNTKScalingRotaryEmbedding(FalconRotaryEmbedding):
+    """FalconRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
+
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
+        self.scaling_factor = scaling_factor
+        super().__init__(dim, max_position_embeddings, base, device)
+
+    def _set_cos_sin_cache(self, seq_len, device, dtype):
+        self.max_seq_len_cached = seq_len
+
+        if seq_len > self.max_position_embeddings:
+            base = self.base * (
+                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
+            ) ** (self.dim / (self.dim - 2))
+            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
+            self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
+
+        freqs = torch.outer(t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
+
+
+def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor:
+    batch_size, seq_length = attention_mask.shape
+    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
+    base = torch.tensor(
+        2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
+    )
+    powers = torch.arange(1, 1 + closest_power_of_2, device=attention_mask.device, dtype=torch.int32)
+    slopes = torch.pow(base, powers)
+
+    if closest_power_of_2 != num_heads:
+        extra_base = torch.tensor(
+            2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
+        )
+        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
+        extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=attention_mask.device, dtype=torch.int32)
+        slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
+
+    # Note: alibi will added to the attention bias that will be applied to the query, key product of attention
+    # => therefore alibi will have to be of shape (batch_size, num_heads, query_length, key_length)
+    # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length)
+    # => the query_length dimension will then be broadcasted correctly
+    # This is more or less identical to T5's relative position bias:
+    # https://github.com/huggingface/transformers/blob/f681437203baa7671de3174b0fa583c349d9d5e1/src/transformers/models/t5/modeling_t5.py#L527
+    arange_tensor = ((attention_mask.cumsum(dim=-1) - 1) * attention_mask)[:, None, :]
+    alibi = slopes[..., None].bfloat16() * arange_tensor
+    return alibi.reshape(batch_size * num_heads, 1, seq_length).to(dtype)
+
+
+# Copied from transformers.models.bloom.modeling_bloom.dropout_add
+def dropout_add(x: torch.Tensor, residual: torch.Tensor, prob: float, training: bool) -> torch.Tensor:
+    """
+    Dropout add function
+
+    Args:
+        x (`torch.tensor`, *required*):
+            input tensor
+        residual (`torch.tensor`, *required*):
+            residual tensor
+        prob (`float`, *required*):
+            dropout probability
+        training (`bool`, *required*):
+            training mode
+    """
+    out = F.dropout(x, p=prob, training=training)
+    out = residual + out
+    return out
+
+
+class FalconAttention(nn.Module):
+    def __init__(self, config: FalconConfig):
+        super().__init__()
+
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.split_size = self.hidden_size
+        self.hidden_dropout = config.hidden_dropout
+        self.max_position_embeddings = config.max_position_embeddings
+        self.rope_theta = config.rope_theta
+        self.is_causal = True
+        self._use_sdpa = config._attn_implementation == "sdpa"
+
+        if self.head_dim * self.num_heads != self.hidden_size:
+            raise ValueError(
+                f"`hidden_size` must be divisible by num_heads (got `hidden_size`: {self.hidden_size} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+
+        if config.rotary:
+            self._init_rope()
+
+        # Layer-wise attention scaling
+        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
+        self.beta = self.inv_norm_factor
+        if config.new_decoder_architecture:
+            qkv_out_dim = (config.num_kv_heads * 2 + config.num_attention_heads) * self.head_dim
+        elif config.multi_query:
+            qkv_out_dim = self.hidden_size + 2 * self.head_dim
+        else:
+            qkv_out_dim = 3 * self.hidden_size
+        self.query_key_value = FalconLinear(self.hidden_size, qkv_out_dim, bias=config.bias)
+        self.new_decoder_architecture = config.new_decoder_architecture
+        self.multi_query = config.multi_query
+        self.dense = FalconLinear(self.hidden_size, self.hidden_size, bias=config.bias)
+        self.attention_dropout = nn.Dropout(config.attention_dropout)
+        self.num_kv_heads = config.num_kv_heads if (self.new_decoder_architecture or not self.multi_query) else 1
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaAttention._init_rope with Llama->Falcon
+    def _init_rope(self):
+        if self.config.rope_scaling is None:
+            self.rotary_emb = FalconRotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            scaling_factor = self.config.rope_scaling["factor"]
+            if scaling_type == "linear":
+                self.rotary_emb = FalconLinearScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            elif scaling_type == "dynamic":
+                self.rotary_emb = FalconDynamicNTKScalingRotaryEmbedding(
+                    self.head_dim,
+                    max_position_embeddings=self.max_position_embeddings,
+                    scaling_factor=scaling_factor,
+                    base=self.rope_theta,
+                )
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
+
+    def _split_heads(self, fused_qkv: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Split the last dimension into (num_heads, head_dim), results share same memory storage as `fused_qkv`
+
+        Args:
+            fused_qkv (`torch.tensor`, *required*): [batch_size, seq_length, num_heads * 3 * head_dim]
+
+        Returns:
+            query: [batch_size, seq_length, num_heads, head_dim] key: [batch_size, seq_length, num_heads, head_dim]
+            value: [batch_size, seq_length, num_heads, head_dim]
+        """
+        if self.new_decoder_architecture:
+            batch, seq_len, _ = fused_qkv.shape
+            qkv = fused_qkv.view(batch, seq_len, -1, self.num_heads // self.num_kv_heads + 2, self.head_dim)
+            query = qkv[:, :, :, :-2]
+            key = qkv[:, :, :, [-2]]
+            value = qkv[:, :, :, [-1]]
+            key = torch.broadcast_to(key, query.shape)
+            value = torch.broadcast_to(value, query.shape)
+
+            query, key, value = [x.flatten(2, 3) for x in (query, key, value)]
+            return query, key, value
+        elif not self.multi_query:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads, 3, self.head_dim)
+            return fused_qkv[..., 0, :], fused_qkv[..., 1, :], fused_qkv[..., 2, :]
+        else:
+            batch_size, seq_length, three_times_hidden_size = fused_qkv.shape
+            fused_qkv = fused_qkv.view(batch_size, seq_length, self.num_heads + 2, self.head_dim)
+            return fused_qkv[..., :-2, :], fused_qkv[..., [-2], :], fused_qkv[..., [-1], :]
+
+    # Copied from transformers.models.bloom.modeling_bloom.BloomAttention._merge_heads
+    def _merge_heads(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Merge heads together over the last dimension
+
+        Args:
+            x (`torch.tensor`, *required*): [batch_size * num_heads, seq_length, head_dim]
+
+        Returns:
+            torch.tensor: [batch_size, seq_length, num_heads * head_dim]
+        """
+        # What we want to achieve is:
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads * head_dim
+        batch_size_and_num_heads, seq_length, _ = x.shape
+        batch_size = batch_size_and_num_heads // self.num_heads
+
+        # First view to decompose the batch size
+        # batch_size * num_heads, seq_length, head_dim -> batch_size, num_heads, seq_length, head_dim
+        x = x.view(batch_size, self.num_heads, seq_length, self.head_dim)
+
+        # batch_size, num_heads, seq_length, head_dim -> batch_size, seq_length, num_heads, head_dim
+        x = x.permute(0, 2, 1, 3)
+
+        # batch_size, seq_length, num_heads, head_dim -> batch_size, seq_length, num_heads * head_dim
+        return x.reshape(batch_size, seq_length, self.num_heads * self.head_dim)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        alibi: Optional[torch.Tensor],
+        attention_mask: torch.Tensor,
+        position_ids: Optional[torch.LongTensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+
+        fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
+        num_kv_heads = self.num_heads if self.new_decoder_architecture else self.num_kv_heads
+        # 3 x [batch_size, seq_length, num_heads, head_dim]
+        (query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
+
+        batch_size, query_length, _, _ = query_layer.shape
+
+        query_layer = query_layer.transpose(1, 2).reshape(batch_size, self.num_heads, query_length, self.head_dim)
+        key_layer = key_layer.transpose(1, 2).reshape(batch_size, num_kv_heads, query_length, self.head_dim)
+        value_layer = value_layer.transpose(1, 2).reshape(batch_size, num_kv_heads, query_length, self.head_dim)
+
+        kv_seq_len = key_layer.shape[-2]
+        if layer_past is not None:
+            kv_seq_len += layer_past[0].shape[-2]
+        if alibi is None:
+            cos, sin = self.rotary_emb(value_layer, seq_len=kv_seq_len)
+            query_layer, key_layer = apply_rotary_pos_emb(query_layer, key_layer, cos, sin, position_ids)
+
+        if layer_past is not None:
+            past_key, past_value = layer_past
+            # concatenate along seq_length dimension:
+            #  - key: [batch_size, self.num_heads, kv_length, head_dim]
+            #  - value: [batch_size, self.num_heads, kv_length, head_dim]
+            key_layer = torch.cat((past_key, key_layer), dim=-2)
+            value_layer = torch.cat((past_value, value_layer), dim=-2)
+
+        kv_length = key_layer.shape[-2]
+        if use_cache:
+            present = (key_layer, value_layer)
+        else:
+            present = None
+
+        # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
+        # Reference: https://github.com/pytorch/pytorch/issues/112577.
+        if query_layer.device.type == "cuda" and attention_mask is not None:
+            query_layer = query_layer.contiguous()
+            key_layer = key_layer.contiguous()
+            value_layer = value_layer.contiguous()
+
+        if alibi is None:
+            if self._use_sdpa and not output_attentions:
+                attn_output = F.scaled_dot_product_attention(
+                    query_layer,
+                    key_layer,
+                    value_layer,
+                    attention_mask,
+                    0.0,
+                    # The query_length > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case query_length == 1.
+                    is_causal=self.is_causal and attention_mask is None and query_length > 1,
+                )
+                attention_scores = None
+            else:
+                attention_scores = query_layer @ key_layer.transpose(-1, -2)
+                attention_scores /= math.sqrt(self.head_dim)
+
+                attention_scores = F.softmax(attention_scores + attention_mask, dim=-1, dtype=hidden_states.dtype)
+                # It is unclear why neither dropout nor head_mask is applied here (while it is with alibi).
+                attn_output = attention_scores @ value_layer
+
+            attn_output = attn_output.view(batch_size, self.num_heads, query_length, self.head_dim)
+            attn_output = attn_output.permute(0, 2, 1, 3)
+            attn_output = attn_output.reshape(batch_size, query_length, self.num_heads * self.head_dim)
+
+            attn_output = self.dense(attn_output)
+
+            if output_attentions:
+                return attn_output, present, attention_scores
+            else:
+                return attn_output, present
+
+        else:
+            if self._use_sdpa and not output_attentions and head_mask is None:
+                attn_output = F.scaled_dot_product_attention(
+                    query_layer,
+                    key_layer,
+                    value_layer,
+                    attn_mask=attention_mask,
+                    dropout_p=self.attention_dropout.p if self.training else 0.0,
+                    is_causal=self.is_causal and attention_mask is None and query_length > 1,
+                )
+                attn_output = attn_output.transpose(1, 2)
+                attn_output = attn_output.reshape(batch_size, query_length, self.num_heads * self.head_dim)
+
+                attn_output = self.dense(attn_output)
+            else:
+                matmul_result = query_layer @ key_layer.transpose(-1, -2)
+
+                # change view to [batch_size, num_heads, q_length, kv_length]
+                attention_scores = matmul_result.view(batch_size, self.num_heads, query_length, kv_length)
+
+                # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype - [batch_size, num_heads, q_length, kv_length]
+                input_dtype = attention_scores.dtype
+                # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
+                if input_dtype == torch.float16 or input_dtype == torch.bfloat16:
+                    attention_scores = attention_scores.to(torch.float32)
+
+                attention_logits = attention_scores + alibi.view(batch_size, self.num_heads, 1, -1)
+                attention_logits *= self.inv_norm_factor
+                attention_probs = F.softmax(attention_logits + attention_mask, dim=-1, dtype=hidden_states.dtype)
+                # [batch_size, num_heads, q_length, kv_length]
+                attention_probs = self.attention_dropout(attention_probs)
+
+                if head_mask is not None:
+                    attention_probs = attention_probs * head_mask
+
+                # change view [batch_size, num_heads, q_length, kv_length]
+                attention_probs_reshaped = attention_probs.view(batch_size, self.num_heads, query_length, kv_length)
+
+                # matmul: [batch_size * num_heads, q_length, head_dim]
+                attn_output = (attention_probs_reshaped @ value_layer).flatten(0, 1)
+
+                # change view [batch_size, q_length, num_heads * head_dim]
+                attn_output = self._merge_heads(attn_output)
+
+                attn_output = self.dense(attn_output)
+
+            if output_attentions:
+                return attn_output, present, attention_probs
+            else:
+                return attn_output, present
+
+
+class FalconFlashAttention2(FalconAttention):
+    """
+    Falcon flash attention module. This module inherits from `FalconAttention` as the weights of the module stays
+    untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+    flash attention and deal with padding tokens in case the input contains any of them.
+    """
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
+        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
+        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        alibi: Optional[torch.Tensor],
+        attention_mask: torch.Tensor,
+        position_ids: Optional[torch.LongTensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+
+            # overwrite attention_mask with padding_mask
+            attention_mask = kwargs.pop("padding_mask")
+
+        fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
+        num_kv_heads = self.num_heads if self.new_decoder_architecture else self.num_kv_heads
+        # 3 x [batch_size, seq_length, num_heads, head_dim]
+        (query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
+
+        batch_size, query_length, _, _ = query_layer.shape
+
+        query_layer = query_layer.transpose(1, 2).reshape(batch_size, self.num_heads, query_length, self.head_dim)
+        key_layer = key_layer.transpose(1, 2).reshape(batch_size, num_kv_heads, query_length, self.head_dim)
+        value_layer = value_layer.transpose(1, 2).reshape(batch_size, num_kv_heads, query_length, self.head_dim)
+
+        kv_seq_len = key_layer.shape[-2]
+        if layer_past is not None:
+            kv_seq_len += layer_past[0].shape[-2]
+        if alibi is None:
+            cos, sin = self.rotary_emb(value_layer, seq_len=kv_seq_len)
+            query_layer, key_layer = apply_rotary_pos_emb(query_layer, key_layer, cos, sin, position_ids)
+
+        if layer_past is not None and use_cache:
+            past_key, past_value = layer_past
+            # concatenate along seq_length dimension:
+            #  - key: [batch_size, self.num_heads, kv_length, head_dim]
+            #  - value: [batch_size, self.num_heads, kv_length, head_dim]
+            key_layer = torch.cat((past_key, key_layer), dim=-2)
+            value_layer = torch.cat((past_value, value_layer), dim=-2)
+
+        past_key_value = (key_layer, value_layer) if use_cache else None
+
+        # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
+        # to be able to avoid many of these transpose/reshape/view.
+        query_layer = query_layer.transpose(1, 2)
+        key_layer = key_layer.transpose(1, 2)
+        value_layer = value_layer.transpose(1, 2)
+
+        if alibi is not None:
+            raise ValueError("`alibi` is not supported when `use_flash_attn` is True")
+
+        attn_dropout = self.config.attention_dropout if self.training else 0.0
+
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in float16 just to be sure everything works as expected.
+        input_dtype = query_layer.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.query_key_value.weight.dtype
+
+            logger.warning_once(
+                f"The input hidden states seems to be silently casted in float32, this might be related to"
+                f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+
+            query_layer = query_layer.to(target_dtype)
+            key_layer = key_layer.to(target_dtype)
+            value_layer = value_layer.to(target_dtype)
+
+        attn_output = self._flash_attention_forward(
+            query_layer, key_layer, value_layer, attention_mask, query_length, dropout=attn_dropout
+        )
+
+        attn_weights = attn_output.reshape(batch_size, query_length, self.num_heads * self.head_dim)
+        attn_output = self.dense(attn_weights)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, past_key_value, attn_weights
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward
+    def _flash_attention_forward(
+        self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
+    ):
+        """
+        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+        first unpad the input, then computes the attention scores and pad the final attention scores.
+
+        Args:
+            query_states (`torch.Tensor`):
+                Input query states to be passed to Flash Attention API
+            key_states (`torch.Tensor`):
+                Input key states to be passed to Flash Attention API
+            value_states (`torch.Tensor`):
+                Input value states to be passed to Flash Attention API
+            attention_mask (`torch.Tensor`):
+                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+                position of padding tokens and 1 for the position of non-padding tokens.
+            dropout (`int`, *optional*):
+                Attention dropout
+            softmax_scale (`float`, *optional*):
+                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+        """
+        if not self._flash_attn_uses_top_left_mask:
+            causal = self.is_causal
+        else:
+            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+            causal = self.is_causal and query_length != 1
+
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            batch_size = query_states.shape[0]
+            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                query_states, key_states, value_states, attention_mask, query_length
+            )
+
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=softmax_scale,
+                causal=causal,
+            )
+
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
+            )
+
+        return attn_output
+
+    # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+
+
+class FalconMLP(nn.Module):
+    def __init__(self, config: FalconConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+
+        self.dense_h_to_4h = FalconLinear(hidden_size, 4 * hidden_size, bias=config.bias)
+        self.act = nn.GELU()
+        self.dense_4h_to_h = FalconLinear(4 * hidden_size, hidden_size, bias=config.bias)
+        self.hidden_dropout = config.hidden_dropout
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.act(self.dense_h_to_4h(x))
+        x = self.dense_4h_to_h(x)
+        return x
+
+
+FALCON_ATTENTION_CLASSES = {
+    "eager": FalconAttention,
+    "sdpa": FalconAttention,  # FalconAttention originally implemented both a forward with & without SDPA
+    "flash_attention_2": FalconFlashAttention2,
+}
+
+
+class FalconDecoderLayer(nn.Module):
+    def __init__(self, config: FalconConfig):
+        super().__init__()
+        hidden_size = config.hidden_size
+        self.num_heads = config.num_attention_heads
+
+        self.self_attention = FALCON_ATTENTION_CLASSES[config._attn_implementation](config)
+        self.mlp = FalconMLP(config)
+        self.hidden_dropout = config.hidden_dropout
+        self.config = config
+
+        if config.new_decoder_architecture:
+            # The layer norm before self-attention
+            self.ln_attn = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+            # The layer norm before the MLP
+            self.ln_mlp = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+        else:
+            self.input_layernorm = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+            if not config.parallel_attn:
+                self.post_attention_layernorm = LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        alibi: Optional[torch.Tensor],
+        attention_mask: torch.Tensor,
+        position_ids: Optional[torch.LongTensor] = None,
+        layer_past: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        use_cache: bool = False,
+        output_attentions: bool = False,
+        **kwargs,
+    ):
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
+            )
+
+        residual = hidden_states
+
+        if self.config.new_decoder_architecture:
+            attention_layernorm_out = self.ln_attn(hidden_states)
+            mlp_layernorm_out = self.ln_mlp(hidden_states)
+        else:
+            attention_layernorm_out = self.input_layernorm(hidden_states)
+
+        # Self attention.
+        attn_outputs = self.self_attention(
+            attention_layernorm_out,
+            layer_past=layer_past,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            alibi=alibi,
+            head_mask=head_mask,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            **kwargs,
+        )
+
+        attention_output = attn_outputs[0]
+
+        if not self.config.new_decoder_architecture:
+            if self.config.parallel_attn:
+                mlp_layernorm_out = attention_layernorm_out
+            else:
+                residual = dropout_add(
+                    attention_output, residual, self.config.attention_dropout, training=self.training
+                )
+                mlp_layernorm_out = self.post_attention_layernorm(residual)
+
+        outputs = attn_outputs[1:]
+
+        # MLP.
+        mlp_output = self.mlp(mlp_layernorm_out)
+
+        if self.config.new_decoder_architecture or self.config.parallel_attn:
+            mlp_output += attention_output
+
+        output = dropout_add(mlp_output, residual, self.config.hidden_dropout, training=self.training)
+
+        if use_cache:
+            outputs = (output,) + outputs
+        else:
+            outputs = (output,) + outputs[1:]
+
+        return outputs  # hidden_states, present, attentions
+
+
+FALCON_START_DOCSTRING = r"""
+
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`FalconConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+FALCON_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):
+            `input_ids_length` = `sequence_length` if `past_key_values` is `None` else `past_key_values[0][0].shape[2]`
+            (`sequence_length` of input past key value states). Indices of input sequence tokens in the vocabulary.
+
+            If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
+            `input_ids`.
+
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+
+            [What are input IDs?](../glossary#input-ids)
+        past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.num_hidden_layers`):
+            Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see
+            `past_key_values` output below). Can be used to speed up sequential decoding. The `input_ids` which have
+            their past given to this model should not be passed as `input_ids` as they have already been computed.
+
+            Each element of `past_key_values` is a tuple (past_key, past_value):
+            - past_key: [batch_size * num_heads, head_dim, kv_length]
+            - past_value: [batch_size * num_heads, kv_length, head_dim]
+        attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            [What are attention masks?](../glossary#attention-mask)
+        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.n_positions - 1]`.
+
+            [What are position IDs?](../glossary#position-ids)
+        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+
+        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+
+            If `past_key_values` is used, optionally only the last `inputs_embeds` have to be input (see
+            `past_key_values`).
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple.
+"""
+
+
+class FalconPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = FalconConfig
+    base_model_prefix = "transformer"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["FalconDecoderLayer"]
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+
+    def __init__(self, *inputs, **kwargs):
+        super().__init__(*inputs, **kwargs)
+
+    def _init_weights(self, module: nn.Module):
+        """Initialize the weights."""
+        if isinstance(module, nn.Linear) or isinstance(module, FalconLinear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+
+    # Adapted from transformers.modeling_utils.PreTrainedModel._check_and_enable_sdpa
+    @classmethod
+    def _check_and_enable_sdpa(cls, config, hard_check_only: bool = False) -> "PretrainedConfig":
+        # NOTE: Falcon supported SDPA from PyTorch 2.0. We keep it like that for backward compatibility (automatically use SDPA for torch>=2.0).
+        if hard_check_only:
+            if not is_torch_greater_or_equal_than_2_0:
+                raise ImportError("PyTorch SDPA requirements in Transformers are not met. Please install torch>=2.0.")
+
+        if not is_torch_greater_or_equal_than_2_0:
+            return config
+
+        _is_bettertransformer = getattr(cls, "use_bettertransformer", False)
+        if _is_bettertransformer:
+            return config
+
+        if not hard_check_only:
+            config._attn_implementation = "sdpa"
+        return config
+
+
+@add_start_docstrings(
+    "The bare Falcon Model transformer outputting raw hidden-states without any specific head on top.",
+    FALCON_START_DOCSTRING,
+)
+class FalconModel(FalconPreTrainedModel):
+    def __init__(self, config: FalconConfig):
+        super().__init__(config)
+
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.use_alibi = config.alibi
+
+        # Embedding + LN Embedding
+        self.word_embeddings = nn.Embedding(config.vocab_size, self.embed_dim)
+
+        # Transformer blocks
+        self.h = nn.ModuleList([FalconDecoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
+        self._use_sdpa = config._attn_implementation == "sdpa"
+
+        # Final Layer Norm
+        self.ln_f = LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
+
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self):
+        return self.word_embeddings
+
+    def set_input_embeddings(self, new_embeddings: torch.Tensor):
+        self.word_embeddings = new_embeddings
+
+    @add_start_docstrings_to_model_forward(FALCON_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPastAndCrossAttentions,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.LongTensor] = None,
+        inputs_embeds: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            batch_size, seq_length = input_ids.shape
+        elif inputs_embeds is not None:
+            batch_size, seq_length, _ = inputs_embeds.shape
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        if past_key_values is None:
+            past_key_values = tuple([None] * len(self.h))
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        hidden_states = inputs_embeds
+
+        if self.gradient_checkpointing and self.training:
+            if use_cache:
+                logger.warning(
+                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                )
+                use_cache = False
+        presents = () if use_cache else None
+        all_self_attentions = () if output_attentions else None
+        all_hidden_states = () if output_hidden_states else None
+
+        # Compute alibi tensor: check build_alibi_tensor documentation
+        past_key_values_length = 0
+        if past_key_values[0] is not None:
+            past_key_values_length = past_key_values[0][0].shape[-2]
+
+        if self.use_alibi:
+            mask = (
+                torch.ones(
+                    (batch_size, seq_length + past_key_values_length), device=inputs_embeds.device, dtype=torch.long
+                )
+                if attention_mask is None
+                else attention_mask
+            )
+            alibi = build_alibi_tensor(mask, self.num_heads, dtype=hidden_states.dtype)
+        else:
+            alibi = None
+            if position_ids is None:
+                device = input_ids.device if input_ids is not None else inputs_embeds.device
+                position_ids = torch.arange(
+                    past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
+                )
+                position_ids = position_ids.unsqueeze(0)
+
+        if self._use_flash_attention_2:
+            # 2d mask is passed through the layers
+            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
+        elif self._use_sdpa and not output_attentions:
+            # output_attentions=True can not be supported when using SDPA, and we fall back on
+            # the manual implementation that requires a 4D causal mask in all cases.
+            if alibi is None:
+                attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
+                    attention_mask,
+                    (batch_size, seq_length),
+                    inputs_embeds,
+                    past_key_values_length,
+                )
+            elif head_mask is None:
+                alibi = alibi.reshape(batch_size, -1, *alibi.shape[1:])
+
+                attention_mask_2d = attention_mask
+                # We don't call _prepare_4d_causal_attention_mask_for_sdpa as we need to mask alibi using the 4D attention_mask untouched.
+                attention_mask = _prepare_4d_causal_attention_mask(
+                    attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+                )
+
+                # We take care to integrate alibi bias in the attention_mask here.
+                if attention_mask_2d is None:
+                    attention_mask = alibi / math.sqrt(self.config.hidden_size // self.num_heads)
+                else:
+                    attention_mask = torch.masked_fill(
+                        alibi / math.sqrt(self.config.hidden_size // self.num_heads),
+                        attention_mask < -1,
+                        torch.finfo(alibi.dtype).min,
+                    )
+
+                    # From PyTorch 2.1 onwards, F.scaled_dot_product_attention with the memory-efficient attention backend
+                    # produces nans if sequences are completely unattended in the attention mask. Details: https://github.com/pytorch/pytorch/issues/110213
+                    if seq_length > 1:
+                        attention_mask = AttentionMaskConverter._unmask_unattended(
+                            attention_mask, attention_mask_2d, unmasked_value=0.0
+                        )
+            else:
+                # PyTorch SDPA does not support head_mask, we fall back on the eager implementation in this case.
+                attention_mask = _prepare_4d_causal_attention_mask(
+                    attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+                )
+        else:
+            # 4d mask is passed through the layers
+            attention_mask = _prepare_4d_causal_attention_mask(
+                attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+            )
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape batch_size x num_heads x N x N
+        # head_mask has shape n_layer x batch x num_heads x N x N
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            if self.gradient_checkpointing and self.training:
+                outputs = self._gradient_checkpointing_func(
+                    block.__call__,
+                    hidden_states,
+                    alibi,
+                    attention_mask,
+                    position_ids,
+                    head_mask[i],
+                    layer_past,
+                    use_cache,
+                    output_attentions,
+                )
+            else:
+                outputs = block(
+                    hidden_states,
+                    layer_past=layer_past,
+                    attention_mask=attention_mask,
+                    position_ids=position_ids,
+                    head_mask=head_mask[i],
+                    use_cache=use_cache,
+                    output_attentions=output_attentions,
+                    alibi=alibi,
+                )
+
+            hidden_states = outputs[0]
+            if use_cache is True:
+                presents = presents + (outputs[1],)
+
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
+
+        # Add last hidden state
+        hidden_states = self.ln_f(hidden_states)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
+
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=presents,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+        )
+
+
+@add_start_docstrings(
+    "The Falcon Model transformer with a language modeling head on top (linear layer with weights tied to the input embeddings).",
+    FALCON_START_DOCSTRING,
+)
+class FalconForCausalLM(FalconPreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+
+    def __init__(self, config: FalconConfig):
+        super().__init__(config)
+        self.transformer = FalconModel(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    def set_output_embeddings(self, new_embeddings: torch.Tensor):
+        self.lm_head = new_embeddings
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor,
+        past_key_values: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> dict:
+        if past_key_values is not None:
+            past_length = past_key_values[0][0].shape[2]
+
+            # Some generation methods already pass only the last input ID
+            if input_ids.shape[1] > past_length:
+                remove_prefix_length = past_length
+            else:
+                # Default to old behavior: keep only final ID
+                remove_prefix_length = input_ids.shape[1] - 1
+
+            input_ids = input_ids[:, remove_prefix_length:]
+
+        # Note: versions of Falcon with alibi do not use position_ids. It is used with RoPE.
+        if not self.transformer.use_alibi and attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -input_ids.shape[1] :]
+
+        return {
+            "input_ids": input_ids,
+            "position_ids": position_ids,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache"),
+            "attention_mask": attention_mask,
+        }
+
+    @add_start_docstrings_to_model_forward(FALCON_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=CausalLMOutputWithCrossAttentions,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
+            `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
+            are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = transformer_outputs[0]
+
+        lm_logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = lm_logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            batch_size, seq_length, vocab_size = shift_logits.shape
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length)
+            )
+
+        if not return_dict:
+            output = (lm_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return CausalLMOutputWithCrossAttentions(
+            loss=loss,
+            logits=lm_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+    def _reorder_cache(
+        self, past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
+    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
+        """
+        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
+        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
+        beam_idx at every generation step.
+
+        Output shares the same memory storage as `past`.
+        """
+
+        # Get a copy of `beam_idx` on all the devices where we need those indices.
+        device_to_beam_idx = {
+            past_state.device: beam_idx.to(past_state.device) for layer_past in past for past_state in layer_past
+        }
+        reordered_past = tuple(
+            (
+                layer_past[0].index_select(0, device_to_beam_idx[layer_past[0].device]),
+                layer_past[1].index_select(0, device_to_beam_idx[layer_past[0].device]),
+            )
+            for layer_past in past
+        )
+        return reordered_past
+
+
+@add_start_docstrings(
+    """
+    The Falcon Model transformer with a sequence classification head on top (linear layer).
+
+    [`FalconForSequenceClassification`] uses the last token in order to do the classification, as other causal models
+    (e.g. GPT-1) do.
+
+    Since it does classification on the last token, it requires to know the position of the last token. If a
+    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
+    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
+    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
+    each row of the batch).
+    """,
+    FALCON_START_DOCSTRING,
+)
+class FalconForSequenceClassification(FalconPreTrainedModel):
+    def __init__(self, config: FalconConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.transformer = FalconModel(config)
+        self.score = nn.Linear(config.hidden_size, config.num_labels, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(FALCON_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=SequenceClassifierOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutputWithPast]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+        logits = self.score(hidden_states)
+
+        if input_ids is not None:
+            batch_size = input_ids.shape[0]
+        else:
+            batch_size = inputs_embeds.shape[0]
+
+        if self.config.pad_token_id is None and batch_size != 1:
+            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
+        if self.config.pad_token_id is None:
+            sequence_lengths = -1
+        else:
+            if input_ids is not None:
+                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
+                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
+                sequence_lengths = sequence_lengths % input_ids.shape[-1]
+                sequence_lengths = sequence_lengths.to(logits.device)
+            else:
+                sequence_lengths = -1
+                logger.warning(
+                    f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be "
+                    "unexpected if using padding tokens in conjunction with `inputs_embeds.`"
+                )
+
+        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
+
+        loss = None
+        if labels is not None:
+            if self.config.problem_type is None:
+                if self.num_labels == 1:
+                    self.config.problem_type = "regression"
+                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
+                    self.config.problem_type = "single_label_classification"
+                else:
+                    self.config.problem_type = "multi_label_classification"
+
+            if self.config.problem_type == "regression":
+                loss_fct = MSELoss()
+                if self.num_labels == 1:
+                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
+                else:
+                    loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "single_label_classification":
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(pooled_logits, labels)
+            elif self.config.problem_type == "multi_label_classification":
+                loss_fct = BCEWithLogitsLoss()
+                loss = loss_fct(pooled_logits, labels)
+        if not return_dict:
+            output = (pooled_logits,) + transformer_outputs[1:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutputWithPast(
+            loss=loss,
+            logits=pooled_logits,
+            past_key_values=transformer_outputs.past_key_values,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Falcon Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
+    Named-Entity-Recognition (NER) tasks.
+    """,
+    FALCON_START_DOCSTRING,
+)
+class FalconForTokenClassification(FalconPreTrainedModel):
+    def __init__(self, config: FalconConfig):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.transformer = FalconModel(config)
+        if getattr(config, "classifier_dropout", None) is not None:
+            classifier_dropout = config.classifier_dropout
+        elif getattr(config, "hidden_dropout", None) is not None:
+            classifier_dropout = config.hidden_dropout
+        else:
+            classifier_dropout = 0.1
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(FALCON_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=TokenClassifierOutput,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
+            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
+            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
+        """
+
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        transformer_outputs = self.transformer(
+            input_ids,
+            past_key_values=past_key_values,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        hidden_states = transformer_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        logits = self.classifier(hidden_states)
+
+        loss = None
+        if labels is not None:
+            batch_size, seq_length = labels.shape
+            loss_fct = CrossEntropyLoss()
+            loss = loss_fct(
+                logits.view(batch_size * seq_length, self.num_labels), labels.view(batch_size * seq_length)
+            )
+
+        if not return_dict:
+            output = (logits,) + transformer_outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=transformer_outputs.hidden_states,
+            attentions=transformer_outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    The Falcon Model transformer with a span classification head on top for extractive question-answering tasks like
+    SQuAD (a linear layers on top of the hidden-states output to compute `span start logits` and `span end logits`).
+    """,
+    FALCON_START_DOCSTRING,
+)
+class FalconForQuestionAnswering(FalconPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.transformer = FalconModel(config)
+        self.qa_outputs = nn.Linear(config.hidden_size, 2)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @add_start_docstrings_to_model_forward(FALCON_INPUTS_DOCSTRING)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        start_positions: Optional[torch.LongTensor] = None,
+        end_positions: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, QuestionAnsweringModelOutput]:
+        r"""
+        start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the start of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+            Labels for position (index) of the end of the labelled span for computing the token classification loss.
+            Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
+            are not taken into account for computing the loss.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.transformer(
+            input_ids,
+            attention_mask=attention_mask,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+
+        logits = self.qa_outputs(sequence_output)
+        start_logits, end_logits = logits.split(1, dim=-1)
+        start_logits = start_logits.squeeze(-1).contiguous()
+        end_logits = end_logits.squeeze(-1).contiguous()
+
+        total_loss = None
+        if start_positions is not None and end_positions is not None:
+            # If we are on multi-GPU, split add a dimension
+            if len(start_positions.size()) > 1:
+                start_positions = start_positions.squeeze(-1)
+            if len(end_positions.size()) > 1:
+                end_positions = end_positions.squeeze(-1)
+            # sometimes the start/end positions are outside our model inputs, we ignore these terms
+            ignored_index = start_logits.size(1)
+            start_positions = start_positions.clamp(0, ignored_index)
+            end_positions = end_positions.clamp(0, ignored_index)
+
+            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
+            start_loss = loss_fct(start_logits, start_positions)
+            end_loss = loss_fct(end_logits, end_positions)
+            total_loss = (start_loss + end_loss) / 2
+
+        if not return_dict:
+            output = (start_logits, end_logits) + outputs[2:]
+            return ((total_loss,) + output) if total_loss is not None else output
+
+        return QuestionAnsweringModelOutput(
+            loss=total_loss,
+            start_logits=start_logits,
+            end_logits=end_logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
diff --git a/src/transformers/models/fastspeech2_conformer/__init__.py b/src/transformers/models/fastspeech2_conformer/__init__.py
new file mode 100644
index 00000000000000..1fd5cbf1dc272e
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/__init__.py
@@ -0,0 +1,77 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from ...utils import (
+    OptionalDependencyNotAvailable,
+    _LazyModule,
+    is_torch_available,
+)
+
+
+_import_structure = {
+    "configuration_fastspeech2_conformer": [
+        "FASTSPEECH2_CONFORMER_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "FASTSPEECH2_CONFORMER_WITH_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP",
+        "FastSpeech2ConformerConfig",
+        "FastSpeech2ConformerHifiGanConfig",
+        "FastSpeech2ConformerWithHifiGanConfig",
+    ],
+    "tokenization_fastspeech2_conformer": ["FastSpeech2ConformerTokenizer"],
+}
+
+try:
+    if not is_torch_available():
+        raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+    pass
+else:
+    _import_structure["modeling_fastspeech2_conformer"] = [
+        "FASTSPEECH2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
+        "FastSpeech2ConformerWithHifiGan",
+        "FastSpeech2ConformerHifiGan",
+        "FastSpeech2ConformerModel",
+        "FastSpeech2ConformerPreTrainedModel",
+    ]
+
+if TYPE_CHECKING:
+    from .configuration_fastspeech2_conformer import (
+        FASTSPEECH2_CONFORMER_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        FASTSPEECH2_CONFORMER_WITH_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        FastSpeech2ConformerConfig,
+        FastSpeech2ConformerHifiGanConfig,
+        FastSpeech2ConformerWithHifiGanConfig,
+    )
+    from .tokenization_fastspeech2_conformer import FastSpeech2ConformerTokenizer
+
+    try:
+        if not is_torch_available():
+            raise OptionalDependencyNotAvailable()
+    except OptionalDependencyNotAvailable:
+        pass
+    else:
+        from .modeling_fastspeech2_conformer import (
+            FASTSPEECH2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
+            FastSpeech2ConformerHifiGan,
+            FastSpeech2ConformerModel,
+            FastSpeech2ConformerPreTrainedModel,
+            FastSpeech2ConformerWithHifiGan,
+        )
+
+else:
+    import sys
+
+    sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py b/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py
new file mode 100644
index 00000000000000..46dc10adb2900e
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py
@@ -0,0 +1,488 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" FastSpeech2Conformer model configuration"""
+
+from typing import Dict
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+FASTSPEECH2_CONFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "espnet/fastspeech2_conformer": "https://huggingface.co/espnet/fastspeech2_conformer/raw/main/config.json",
+}
+
+FASTSPEECH2_CONFORMER_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "espnet/fastspeech2_conformer_hifigan": "https://huggingface.co/espnet/fastspeech2_conformer_hifigan/raw/main/config.json",
+}
+
+FASTSPEECH2_CONFORMER_WITH_HIFIGAN_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "espnet/fastspeech2_conformer_with_hifigan": "https://huggingface.co/espnet/fastspeech2_conformer_with_hifigan/raw/main/config.json",
+}
+
+
+class FastSpeech2ConformerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`FastSpeech2ConformerModel`]. It is used to
+    instantiate a FastSpeech2Conformer model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the
+    FastSpeech2Conformer [espnet/fastspeech2_conformer](https://huggingface.co/espnet/fastspeech2_conformer)
+    architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        hidden_size (`int`, *optional*, defaults to 384):
+            The dimensionality of the hidden layers.
+        vocab_size (`int`, *optional*, defaults to 78):
+            The size of the vocabulary.
+        num_mel_bins (`int`, *optional*, defaults to 80):
+            The number of mel filters used in the filter bank.
+        encoder_num_attention_heads (`int`, *optional*, defaults to 2):
+            The number of attention heads in the encoder.
+        encoder_layers (`int`, *optional*, defaults to 4):
+            The number of layers in the encoder.
+        encoder_linear_units (`int`, *optional*, defaults to 1536):
+            The number of units in the linear layer of the encoder.
+        decoder_layers (`int`, *optional*, defaults to 4):
+            The number of layers in the decoder.
+        decoder_num_attention_heads (`int`, *optional*, defaults to 2):
+            The number of attention heads in the decoder.
+        decoder_linear_units (`int`, *optional*, defaults to 1536):
+            The number of units in the linear layer of the decoder.
+        speech_decoder_postnet_layers (`int`, *optional*, defaults to 5):
+            The number of layers in the post-net of the speech decoder.
+        speech_decoder_postnet_units (`int`, *optional*, defaults to 256):
+            The number of units in the post-net layers of the speech decoder.
+        speech_decoder_postnet_kernel (`int`, *optional*, defaults to 5):
+            The kernel size in the post-net of the speech decoder.
+        positionwise_conv_kernel_size (`int`, *optional*, defaults to 3):
+            The size of the convolution kernel used in the position-wise layer.
+        encoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Specifies whether to normalize before encoder layers.
+        decoder_normalize_before (`bool`, *optional*, defaults to `False`):
+            Specifies whether to normalize before decoder layers.
+        encoder_concat_after (`bool`, *optional*, defaults to `False`):
+            Specifies whether to concatenate after encoder layers.
+        decoder_concat_after (`bool`, *optional*, defaults to `False`):
+            Specifies whether to concatenate after decoder layers.
+        reduction_factor (`int`, *optional*, defaults to 1):
+            The factor by which the speech frame rate is reduced.
+        speaking_speed (`float`, *optional*, defaults to 1.0):
+            The speed of the speech produced.
+        use_macaron_style_in_conformer (`bool`, *optional*, defaults to `True`):
+            Specifies whether to use macaron style in the conformer.
+        use_cnn_in_conformer (`bool`, *optional*, defaults to `True`):
+            Specifies whether to use convolutional neural networks in the conformer.
+        encoder_kernel_size (`int`, *optional*, defaults to 7):
+            The kernel size used in the encoder.
+        decoder_kernel_size (`int`, *optional*, defaults to 31):
+            The kernel size used in the decoder.
+        duration_predictor_layers (`int`, *optional*, defaults to 2):
+            The number of layers in the duration predictor.
+        duration_predictor_channels (`int`, *optional*, defaults to 256):
+            The number of channels in the duration predictor.
+        duration_predictor_kernel_size (`int`, *optional*, defaults to 3):
+            The kernel size used in the duration predictor.
+        energy_predictor_layers (`int`, *optional*, defaults to 2):
+            The number of layers in the energy predictor.
+        energy_predictor_channels (`int`, *optional*, defaults to 256):
+            The number of channels in the energy predictor.
+        energy_predictor_kernel_size (`int`, *optional*, defaults to 3):
+            The kernel size used in the energy predictor.
+        energy_predictor_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout rate in the energy predictor.
+        energy_embed_kernel_size (`int`, *optional*, defaults to 1):
+            The kernel size used in the energy embed layer.
+        energy_embed_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout rate in the energy embed layer.
+        stop_gradient_from_energy_predictor (`bool`, *optional*, defaults to `False`):
+            Specifies whether to stop gradients from the energy predictor.
+        pitch_predictor_layers (`int`, *optional*, defaults to 5):
+            The number of layers in the pitch predictor.
+        pitch_predictor_channels (`int`, *optional*, defaults to 256):
+            The number of channels in the pitch predictor.
+        pitch_predictor_kernel_size (`int`, *optional*, defaults to 5):
+            The kernel size used in the pitch predictor.
+        pitch_predictor_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout rate in the pitch predictor.
+        pitch_embed_kernel_size (`int`, *optional*, defaults to 1):
+            The kernel size used in the pitch embed layer.
+        pitch_embed_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout rate in the pitch embed layer.
+        stop_gradient_from_pitch_predictor (`bool`, *optional*, defaults to `True`):
+            Specifies whether to stop gradients from the pitch predictor.
+        encoder_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The dropout rate in the encoder.
+        encoder_positional_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The positional dropout rate in the encoder.
+        encoder_attention_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The attention dropout rate in the encoder.
+        decoder_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The dropout rate in the decoder.
+        decoder_positional_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The positional dropout rate in the decoder.
+        decoder_attention_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The attention dropout rate in the decoder.
+        duration_predictor_dropout_rate (`float`, *optional*, defaults to 0.2):
+            The dropout rate in the duration predictor.
+        speech_decoder_postnet_dropout (`float`, *optional*, defaults to 0.5):
+            The dropout rate in the speech decoder postnet.
+        max_source_positions (`int`, *optional*, defaults to 5000):
+            if `"relative"` position embeddings are used, defines the maximum source input positions.
+        use_masking (`bool`, *optional*, defaults to `True`):
+            Specifies whether to use masking in the model.
+        use_weighted_masking (`bool`, *optional*, defaults to `False`):
+            Specifies whether to use weighted masking in the model.
+        num_speakers (`int`, *optional*):
+            Number of speakers. If set to > 1, assume that the speaker ids will be provided as the input and use
+            speaker id embedding layer.
+        num_languages (`int`, *optional*):
+            Number of languages. If set to > 1, assume that the language ids will be provided as the input and use the
+            languge id embedding layer.
+        speaker_embed_dim (`int`, *optional*):
+            Speaker embedding dimension. If set to > 0, assume that speaker_embedding will be provided as the input.
+        is_encoder_decoder (`bool`, *optional*, defaults to `True`):
+            Specifies whether the model is an encoder-decoder.
+
+    Example:
+
+    ```python
+    >>> from transformers import FastSpeech2ConformerModel, FastSpeech2ConformerConfig
+
+    >>> # Initializing a FastSpeech2Conformer style configuration
+    >>> configuration = FastSpeech2ConformerConfig()
+
+    >>> # Initializing a model from the FastSpeech2Conformer style configuration
+    >>> model = FastSpeech2ConformerModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "fastspeech2_conformer"
+    attribute_map = {"num_hidden_layers": "encoder_layers", "num_attention_heads": "encoder_num_attention_heads"}
+
+    def __init__(
+        self,
+        hidden_size=384,
+        vocab_size=78,
+        num_mel_bins=80,
+        encoder_num_attention_heads=2,
+        encoder_layers=4,
+        encoder_linear_units=1536,
+        decoder_layers=4,
+        decoder_num_attention_heads=2,
+        decoder_linear_units=1536,
+        speech_decoder_postnet_layers=5,
+        speech_decoder_postnet_units=256,
+        speech_decoder_postnet_kernel=5,
+        positionwise_conv_kernel_size=3,
+        encoder_normalize_before=False,
+        decoder_normalize_before=False,
+        encoder_concat_after=False,
+        decoder_concat_after=False,
+        reduction_factor=1,
+        speaking_speed=1.0,
+        use_macaron_style_in_conformer=True,
+        use_cnn_in_conformer=True,
+        encoder_kernel_size=7,
+        decoder_kernel_size=31,
+        duration_predictor_layers=2,
+        duration_predictor_channels=256,
+        duration_predictor_kernel_size=3,
+        energy_predictor_layers=2,
+        energy_predictor_channels=256,
+        energy_predictor_kernel_size=3,
+        energy_predictor_dropout=0.5,
+        energy_embed_kernel_size=1,
+        energy_embed_dropout=0.0,
+        stop_gradient_from_energy_predictor=False,
+        pitch_predictor_layers=5,
+        pitch_predictor_channels=256,
+        pitch_predictor_kernel_size=5,
+        pitch_predictor_dropout=0.5,
+        pitch_embed_kernel_size=1,
+        pitch_embed_dropout=0.0,
+        stop_gradient_from_pitch_predictor=True,
+        encoder_dropout_rate=0.2,
+        encoder_positional_dropout_rate=0.2,
+        encoder_attention_dropout_rate=0.2,
+        decoder_dropout_rate=0.2,
+        decoder_positional_dropout_rate=0.2,
+        decoder_attention_dropout_rate=0.2,
+        duration_predictor_dropout_rate=0.2,
+        speech_decoder_postnet_dropout=0.5,
+        max_source_positions=5000,
+        use_masking=True,
+        use_weighted_masking=False,
+        num_speakers=None,
+        num_languages=None,
+        speaker_embed_dim=None,
+        is_encoder_decoder=True,
+        **kwargs,
+    ):
+        if positionwise_conv_kernel_size % 2 == 0:
+            raise ValueError(
+                f"positionwise_conv_kernel_size must be odd, but got {positionwise_conv_kernel_size} instead."
+            )
+        if encoder_kernel_size % 2 == 0:
+            raise ValueError(f"encoder_kernel_size must be odd, but got {encoder_kernel_size} instead.")
+        if decoder_kernel_size % 2 == 0:
+            raise ValueError(f"decoder_kernel_size must be odd, but got {decoder_kernel_size} instead.")
+        if duration_predictor_kernel_size % 2 == 0:
+            raise ValueError(
+                f"duration_predictor_kernel_size must be odd, but got {duration_predictor_kernel_size} instead."
+            )
+        if energy_predictor_kernel_size % 2 == 0:
+            raise ValueError(
+                f"energy_predictor_kernel_size must be odd, but got {energy_predictor_kernel_size} instead."
+            )
+        if energy_embed_kernel_size % 2 == 0:
+            raise ValueError(f"energy_embed_kernel_size must be odd, but got {energy_embed_kernel_size} instead.")
+        if pitch_predictor_kernel_size % 2 == 0:
+            raise ValueError(
+                f"pitch_predictor_kernel_size must be odd, but got {pitch_predictor_kernel_size} instead."
+            )
+        if pitch_embed_kernel_size % 2 == 0:
+            raise ValueError(f"pitch_embed_kernel_size must be odd, but got {pitch_embed_kernel_size} instead.")
+        if hidden_size % encoder_num_attention_heads != 0:
+            raise ValueError("The hidden_size must be evenly divisible by encoder_num_attention_heads.")
+        if hidden_size % decoder_num_attention_heads != 0:
+            raise ValueError("The hidden_size must be evenly divisible by decoder_num_attention_heads.")
+        if use_masking and use_weighted_masking:
+            raise ValueError("Either use_masking or use_weighted_masking can be True, but not both.")
+
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.num_mel_bins = num_mel_bins
+        self.encoder_config = {
+            "num_attention_heads": encoder_num_attention_heads,
+            "layers": encoder_layers,
+            "kernel_size": encoder_kernel_size,
+            "attention_dropout_rate": encoder_attention_dropout_rate,
+            "dropout_rate": encoder_dropout_rate,
+            "positional_dropout_rate": encoder_positional_dropout_rate,
+            "linear_units": encoder_linear_units,
+            "normalize_before": encoder_normalize_before,
+            "concat_after": encoder_concat_after,
+        }
+        self.decoder_config = {
+            "num_attention_heads": decoder_num_attention_heads,
+            "layers": decoder_layers,
+            "kernel_size": decoder_kernel_size,
+            "attention_dropout_rate": decoder_attention_dropout_rate,
+            "dropout_rate": decoder_dropout_rate,
+            "positional_dropout_rate": decoder_positional_dropout_rate,
+            "linear_units": decoder_linear_units,
+            "normalize_before": decoder_normalize_before,
+            "concat_after": decoder_concat_after,
+        }
+        self.encoder_num_attention_heads = encoder_num_attention_heads
+        self.encoder_layers = encoder_layers
+        self.duration_predictor_channels = duration_predictor_channels
+        self.duration_predictor_kernel_size = duration_predictor_kernel_size
+        self.duration_predictor_layers = duration_predictor_layers
+        self.energy_embed_dropout = energy_embed_dropout
+        self.energy_embed_kernel_size = energy_embed_kernel_size
+        self.energy_predictor_channels = energy_predictor_channels
+        self.energy_predictor_dropout = energy_predictor_dropout
+        self.energy_predictor_kernel_size = energy_predictor_kernel_size
+        self.energy_predictor_layers = energy_predictor_layers
+        self.pitch_embed_dropout = pitch_embed_dropout
+        self.pitch_embed_kernel_size = pitch_embed_kernel_size
+        self.pitch_predictor_channels = pitch_predictor_channels
+        self.pitch_predictor_dropout = pitch_predictor_dropout
+        self.pitch_predictor_kernel_size = pitch_predictor_kernel_size
+        self.pitch_predictor_layers = pitch_predictor_layers
+        self.positionwise_conv_kernel_size = positionwise_conv_kernel_size
+        self.speech_decoder_postnet_units = speech_decoder_postnet_units
+        self.speech_decoder_postnet_dropout = speech_decoder_postnet_dropout
+        self.speech_decoder_postnet_kernel = speech_decoder_postnet_kernel
+        self.speech_decoder_postnet_layers = speech_decoder_postnet_layers
+        self.reduction_factor = reduction_factor
+        self.speaking_speed = speaking_speed
+        self.stop_gradient_from_energy_predictor = stop_gradient_from_energy_predictor
+        self.stop_gradient_from_pitch_predictor = stop_gradient_from_pitch_predictor
+        self.max_source_positions = max_source_positions
+        self.use_cnn_in_conformer = use_cnn_in_conformer
+        self.use_macaron_style_in_conformer = use_macaron_style_in_conformer
+        self.use_masking = use_masking
+        self.use_weighted_masking = use_weighted_masking
+        self.num_speakers = num_speakers
+        self.num_languages = num_languages
+        self.speaker_embed_dim = speaker_embed_dim
+        self.duration_predictor_dropout_rate = duration_predictor_dropout_rate
+        self.is_encoder_decoder = is_encoder_decoder
+
+        super().__init__(
+            is_encoder_decoder=is_encoder_decoder,
+            **kwargs,
+        )
+
+
+class FastSpeech2ConformerHifiGanConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`FastSpeech2ConformerHifiGanModel`]. It is used to
+    instantiate a FastSpeech2Conformer HiFi-GAN vocoder model according to the specified arguments, defining the model
+    architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
+    FastSpeech2Conformer
+    [espnet/fastspeech2_conformer_hifigan](https://huggingface.co/espnet/fastspeech2_conformer_hifigan) architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        model_in_dim (`int`, *optional*, defaults to 80):
+            The number of frequency bins in the input log-mel spectrogram.
+        upsample_initial_channel (`int`, *optional*, defaults to 512):
+            The number of input channels into the upsampling network.
+        upsample_rates (`Tuple[int]` or `List[int]`, *optional*, defaults to `[8, 8, 2, 2]`):
+            A tuple of integers defining the stride of each 1D convolutional layer in the upsampling network. The
+            length of *upsample_rates* defines the number of convolutional layers and has to match the length of
+            *upsample_kernel_sizes*.
+        upsample_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[16, 16, 4, 4]`):
+            A tuple of integers defining the kernel size of each 1D convolutional layer in the upsampling network. The
+            length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match the length of
+            *upsample_rates*.
+        resblock_kernel_sizes (`Tuple[int]` or `List[int]`, *optional*, defaults to `[3, 7, 11]`):
+            A tuple of integers defining the kernel sizes of the 1D convolutional layers in the multi-receptive field
+            fusion (MRF) module.
+        resblock_dilation_sizes (`Tuple[Tuple[int]]` or `List[List[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`):
+            A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
+            multi-receptive field fusion (MRF) module.
+        initializer_range (`float`, *optional*, defaults to 0.01):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        leaky_relu_slope (`float`, *optional*, defaults to 0.1):
+            The angle of the negative slope used by the leaky ReLU activation.
+        normalize_before (`bool`, *optional*, defaults to `True`):
+            Whether or not to normalize the spectrogram before vocoding using the vocoder's learned mean and variance.
+
+    Example:
+
+    ```python
+    >>> from transformers import FastSpeech2ConformerHifiGan, FastSpeech2ConformerHifiGanConfig
+
+    >>> # Initializing a FastSpeech2ConformerHifiGan configuration
+    >>> configuration = FastSpeech2ConformerHifiGanConfig()
+
+    >>> # Initializing a model (with random weights) from the configuration
+    >>> model = FastSpeech2ConformerHifiGan(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "hifigan"
+
+    def __init__(
+        self,
+        model_in_dim=80,
+        upsample_initial_channel=512,
+        upsample_rates=[8, 8, 2, 2],
+        upsample_kernel_sizes=[16, 16, 4, 4],
+        resblock_kernel_sizes=[3, 7, 11],
+        resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+        initializer_range=0.01,
+        leaky_relu_slope=0.1,
+        normalize_before=True,
+        **kwargs,
+    ):
+        self.model_in_dim = model_in_dim
+        self.upsample_initial_channel = upsample_initial_channel
+        self.upsample_rates = upsample_rates
+        self.upsample_kernel_sizes = upsample_kernel_sizes
+        self.resblock_kernel_sizes = resblock_kernel_sizes
+        self.resblock_dilation_sizes = resblock_dilation_sizes
+        self.initializer_range = initializer_range
+        self.leaky_relu_slope = leaky_relu_slope
+        self.normalize_before = normalize_before
+        super().__init__(**kwargs)
+
+
+class FastSpeech2ConformerWithHifiGanConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`FastSpeech2ConformerWithHifiGan`]. It is used to
+    instantiate a `FastSpeech2ConformerWithHifiGanModel` model according to the specified sub-models configurations,
+    defining the model architecture.
+
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the
+    FastSpeech2ConformerModel [espnet/fastspeech2_conformer](https://huggingface.co/espnet/fastspeech2_conformer) and
+    FastSpeech2ConformerHifiGan
+    [espnet/fastspeech2_conformer_hifigan](https://huggingface.co/espnet/fastspeech2_conformer_hifigan) architectures.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        model_config (`typing.Dict`, *optional*):
+            Configuration of the text-to-speech model.
+        vocoder_config (`typing.Dict`, *optional*):
+            Configuration of the vocoder model.
+    model_config ([`FastSpeech2ConformerConfig`], *optional*):
+        Configuration of the text-to-speech model.
+    vocoder_config ([`FastSpeech2ConformerHiFiGanConfig`], *optional*):
+        Configuration of the vocoder model.
+
+    Example:
+
+    ```python
+    >>> from transformers import (
+    ...     FastSpeech2ConformerConfig,
+    ...     FastSpeech2ConformerHifiGanConfig,
+    ...     FastSpeech2ConformerWithHifiGanConfig,
+    ...     FastSpeech2ConformerWithHifiGan,
+    ... )
+
+    >>> # Initializing FastSpeech2ConformerWithHifiGan sub-modules configurations.
+    >>> model_config = FastSpeech2ConformerConfig()
+    >>> vocoder_config = FastSpeech2ConformerHifiGanConfig()
+
+    >>> # Initializing a FastSpeech2ConformerWithHifiGan module style configuration
+    >>> configuration = FastSpeech2ConformerWithHifiGanConfig(model_config.to_dict(), vocoder_config.to_dict())
+
+    >>> # Initializing a model (with random weights)
+    >>> model = FastSpeech2ConformerWithHifiGan(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```
+    """
+
+    model_type = "fastspeech2_conformer_with_hifigan"
+    is_composition = True
+
+    def __init__(
+        self,
+        model_config: Dict = None,
+        vocoder_config: Dict = None,
+        **kwargs,
+    ):
+        if model_config is None:
+            model_config = {}
+            logger.info("model_config is None. initializing the model with default values.")
+
+        if vocoder_config is None:
+            vocoder_config = {}
+            logger.info("vocoder_config is None. initializing the coarse model with default values.")
+
+        self.model_config = FastSpeech2ConformerConfig(**model_config)
+        self.vocoder_config = FastSpeech2ConformerHifiGanConfig(**vocoder_config)
+
+        super().__init__(**kwargs)
diff --git a/src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py b/src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py
new file mode 100644
index 00000000000000..bb9c432f82292f
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch.py
@@ -0,0 +1,210 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert FastSpeech2Conformer checkpoint."""
+
+import argparse
+import json
+import re
+from pathlib import Path
+from tempfile import TemporaryDirectory
+
+import torch
+import yaml
+
+from transformers import (
+    FastSpeech2ConformerConfig,
+    FastSpeech2ConformerModel,
+    FastSpeech2ConformerTokenizer,
+    logging,
+)
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.FastSpeech2Conformer")
+
+CONFIG_MAPPING = {
+    "adim": "hidden_size",
+    "aheads": "num_attention_heads",
+    "conformer_dec_kernel_size": "decoder_kernel_size",
+    "conformer_enc_kernel_size": "encoder_kernel_size",
+    "decoder_normalize_before": "decoder_normalize_before",
+    "dlayers": "decoder_layers",
+    "dunits": "decoder_linear_units",
+    "duration_predictor_chans": "duration_predictor_channels",
+    "duration_predictor_kernel_size": "duration_predictor_kernel_size",
+    "duration_predictor_layers": "duration_predictor_layers",
+    "elayers": "encoder_layers",
+    "encoder_normalize_before": "encoder_normalize_before",
+    "energy_embed_dropout": "energy_embed_dropout",
+    "energy_embed_kernel_size": "energy_embed_kernel_size",
+    "energy_predictor_chans": "energy_predictor_channels",
+    "energy_predictor_dropout": "energy_predictor_dropout",
+    "energy_predictor_kernel_size": "energy_predictor_kernel_size",
+    "energy_predictor_layers": "energy_predictor_layers",
+    "eunits": "encoder_linear_units",
+    "pitch_embed_dropout": "pitch_embed_dropout",
+    "pitch_embed_kernel_size": "pitch_embed_kernel_size",
+    "pitch_predictor_chans": "pitch_predictor_channels",
+    "pitch_predictor_dropout": "pitch_predictor_dropout",
+    "pitch_predictor_kernel_size": "pitch_predictor_kernel_size",
+    "pitch_predictor_layers": "pitch_predictor_layers",
+    "positionwise_conv_kernel_size": "positionwise_conv_kernel_size",
+    "postnet_chans": "speech_decoder_postnet_units",
+    "postnet_filts": "speech_decoder_postnet_kernel",
+    "postnet_layers": "speech_decoder_postnet_layers",
+    "reduction_factor": "reduction_factor",
+    "stop_gradient_from_energy_predictor": "stop_gradient_from_energy_predictor",
+    "stop_gradient_from_pitch_predictor": "stop_gradient_from_pitch_predictor",
+    "transformer_dec_attn_dropout_rate": "decoder_attention_dropout_rate",
+    "transformer_dec_dropout_rate": "decoder_dropout_rate",
+    "transformer_dec_positional_dropout_rate": "decoder_positional_dropout_rate",
+    "transformer_enc_attn_dropout_rate": "encoder_attention_dropout_rate",
+    "transformer_enc_dropout_rate": "encoder_dropout_rate",
+    "transformer_enc_positional_dropout_rate": "encoder_positional_dropout_rate",
+    "use_cnn_in_conformer": "use_cnn_in_conformer",
+    "use_macaron_style_in_conformer": "use_macaron_style_in_conformer",
+    "use_masking": "use_masking",
+    "use_weighted_masking": "use_weighted_masking",
+    "idim": "input_dim",
+    "odim": "num_mel_bins",
+    "spk_embed_dim": "speaker_embed_dim",
+    "langs": "num_languages",
+    "spks": "num_speakers",
+}
+
+
+def remap_model_yaml_config(yaml_config_path):
+    with Path(yaml_config_path).open("r", encoding="utf-8") as f:
+        args = yaml.safe_load(f)
+        args = argparse.Namespace(**args)
+
+    remapped_config = {}
+
+    model_params = args.tts_conf["text2mel_params"]
+    # espnet_config_key -> hf_config_key, any keys not included are ignored
+    for espnet_config_key, hf_config_key in CONFIG_MAPPING.items():
+        if espnet_config_key in model_params:
+            remapped_config[hf_config_key] = model_params[espnet_config_key]
+
+    return remapped_config, args.g2p, args.token_list
+
+
+def convert_espnet_state_dict_to_hf(state_dict):
+    new_state_dict = {}
+    for key in state_dict:
+        if "tts.generator.text2mel." in key:
+            new_key = key.replace("tts.generator.text2mel.", "")
+            if "postnet" in key:
+                new_key = new_key.replace("postnet.postnet", "speech_decoder_postnet.layers")
+                new_key = new_key.replace(".0.weight", ".conv.weight")
+                new_key = new_key.replace(".1.weight", ".batch_norm.weight")
+                new_key = new_key.replace(".1.bias", ".batch_norm.bias")
+                new_key = new_key.replace(".1.running_mean", ".batch_norm.running_mean")
+                new_key = new_key.replace(".1.running_var", ".batch_norm.running_var")
+                new_key = new_key.replace(".1.num_batches_tracked", ".batch_norm.num_batches_tracked")
+            if "feat_out" in key:
+                if "weight" in key:
+                    new_key = "speech_decoder_postnet.feat_out.weight"
+                if "bias" in key:
+                    new_key = "speech_decoder_postnet.feat_out.bias"
+            if "encoder.embed.0.weight" in key:
+                new_key = new_key.replace("0.", "")
+            if "w_1" in key:
+                new_key = new_key.replace("w_1", "conv1")
+            if "w_2" in key:
+                new_key = new_key.replace("w_2", "conv2")
+            if "predictor.conv" in key:
+                new_key = new_key.replace(".conv", ".conv_layers")
+                pattern = r"(\d)\.(\d)"
+                replacement = (
+                    r"\1.conv" if ("2.weight" not in new_key) and ("2.bias" not in new_key) else r"\1.layer_norm"
+                )
+                new_key = re.sub(pattern, replacement, new_key)
+            if "pitch_embed" in key or "energy_embed" in key:
+                new_key = new_key.replace("0", "conv")
+            if "encoders" in key:
+                new_key = new_key.replace("encoders", "conformer_layers")
+                new_key = new_key.replace("norm_final", "final_layer_norm")
+                new_key = new_key.replace("norm_mha", "self_attn_layer_norm")
+                new_key = new_key.replace("norm_ff_macaron", "ff_macaron_layer_norm")
+                new_key = new_key.replace("norm_ff", "ff_layer_norm")
+                new_key = new_key.replace("norm_conv", "conv_layer_norm")
+            if "lid_emb" in key:
+                new_key = new_key.replace("lid_emb", "language_id_embedding")
+            if "sid_emb" in key:
+                new_key = new_key.replace("sid_emb", "speaker_id_embedding")
+
+            new_state_dict[new_key] = state_dict[key]
+
+    return new_state_dict
+
+
+@torch.no_grad()
+def convert_FastSpeech2ConformerModel_checkpoint(
+    checkpoint_path,
+    yaml_config_path,
+    pytorch_dump_folder_path,
+    repo_id=None,
+):
+    model_params, tokenizer_name, vocab = remap_model_yaml_config(yaml_config_path)
+    config = FastSpeech2ConformerConfig(**model_params)
+
+    # Prepare the model
+    model = FastSpeech2ConformerModel(config)
+
+    espnet_checkpoint = torch.load(checkpoint_path)
+    hf_compatible_state_dict = convert_espnet_state_dict_to_hf(espnet_checkpoint)
+
+    model.load_state_dict(hf_compatible_state_dict)
+
+    model.save_pretrained(pytorch_dump_folder_path)
+
+    # Prepare the tokenizer
+    with TemporaryDirectory() as tempdir:
+        vocab = {token: id for id, token in enumerate(vocab)}
+        vocab_file = Path(tempdir) / "vocab.json"
+        with open(vocab_file, "w") as f:
+            json.dump(vocab, f)
+        should_strip_spaces = "no_space" in tokenizer_name
+        tokenizer = FastSpeech2ConformerTokenizer(str(vocab_file), should_strip_spaces=should_strip_spaces)
+
+    tokenizer.save_pretrained(pytorch_dump_folder_path)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        model.push_to_hub(repo_id)
+        tokenizer.push_to_hub(repo_id)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
+    parser.add_argument(
+        "--yaml_config_path", required=True, default=None, type=str, help="Path to config.yaml of model to convert"
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
+    )
+    parser.add_argument(
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+    )
+
+    args = parser.parse_args()
+    convert_FastSpeech2ConformerModel_checkpoint(
+        args.checkpoint_path,
+        args.yaml_config_path,
+        args.pytorch_dump_folder_path,
+        args.push_to_hub,
+    )
diff --git a/src/transformers/models/fastspeech2_conformer/convert_hifigan.py b/src/transformers/models/fastspeech2_conformer/convert_hifigan.py
new file mode 100644
index 00000000000000..ec9f57ce7142d6
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/convert_hifigan.py
@@ -0,0 +1,134 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert FastSpeech2Conformer HiFi-GAN checkpoint."""
+
+import argparse
+from pathlib import Path
+
+import torch
+import yaml
+
+from transformers import FastSpeech2ConformerHifiGan, FastSpeech2ConformerHifiGanConfig, logging
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.FastSpeech2Conformer")
+
+
+def load_weights(checkpoint, hf_model, config):
+    vocoder_key_prefix = "tts.generator.vocoder."
+    checkpoint = {k.replace(vocoder_key_prefix, ""): v for k, v in checkpoint.items() if vocoder_key_prefix in k}
+
+    hf_model.apply_weight_norm()
+
+    hf_model.conv_pre.weight_g.data = checkpoint["input_conv.weight_g"]
+    hf_model.conv_pre.weight_v.data = checkpoint["input_conv.weight_v"]
+    hf_model.conv_pre.bias.data = checkpoint["input_conv.bias"]
+
+    for i in range(len(config.upsample_rates)):
+        hf_model.upsampler[i].weight_g.data = checkpoint[f"upsamples.{i}.1.weight_g"]
+        hf_model.upsampler[i].weight_v.data = checkpoint[f"upsamples.{i}.1.weight_v"]
+        hf_model.upsampler[i].bias.data = checkpoint[f"upsamples.{i}.1.bias"]
+
+    for i in range(len(config.upsample_rates) * len(config.resblock_kernel_sizes)):
+        for j in range(len(config.resblock_dilation_sizes)):
+            hf_model.resblocks[i].convs1[j].weight_g.data = checkpoint[f"blocks.{i}.convs1.{j}.1.weight_g"]
+            hf_model.resblocks[i].convs1[j].weight_v.data = checkpoint[f"blocks.{i}.convs1.{j}.1.weight_v"]
+            hf_model.resblocks[i].convs1[j].bias.data = checkpoint[f"blocks.{i}.convs1.{j}.1.bias"]
+
+            hf_model.resblocks[i].convs2[j].weight_g.data = checkpoint[f"blocks.{i}.convs2.{j}.1.weight_g"]
+            hf_model.resblocks[i].convs2[j].weight_v.data = checkpoint[f"blocks.{i}.convs2.{j}.1.weight_v"]
+            hf_model.resblocks[i].convs2[j].bias.data = checkpoint[f"blocks.{i}.convs2.{j}.1.bias"]
+
+    hf_model.conv_post.weight_g.data = checkpoint["output_conv.1.weight_g"]
+    hf_model.conv_post.weight_v.data = checkpoint["output_conv.1.weight_v"]
+    hf_model.conv_post.bias.data = checkpoint["output_conv.1.bias"]
+
+    hf_model.remove_weight_norm()
+
+
+def remap_hifigan_yaml_config(yaml_config_path):
+    with Path(yaml_config_path).open("r", encoding="utf-8") as f:
+        args = yaml.safe_load(f)
+        args = argparse.Namespace(**args)
+
+    vocoder_type = args.tts_conf["vocoder_type"]
+    if vocoder_type != "hifigan_generator":
+        raise TypeError(f"Vocoder config must be for `hifigan_generator`, but got {vocoder_type}")
+
+    remapped_dict = {}
+    vocoder_params = args.tts_conf["vocoder_params"]
+
+    # espnet_config_key -> hf_config_key
+    key_mappings = {
+        "channels": "upsample_initial_channel",
+        "in_channels": "model_in_dim",
+        "resblock_dilations": "resblock_dilation_sizes",
+        "resblock_kernel_sizes": "resblock_kernel_sizes",
+        "upsample_kernel_sizes": "upsample_kernel_sizes",
+        "upsample_scales": "upsample_rates",
+    }
+    for espnet_config_key, hf_config_key in key_mappings.items():
+        remapped_dict[hf_config_key] = vocoder_params[espnet_config_key]
+    remapped_dict["sampling_rate"] = args.tts_conf["sampling_rate"]
+    remapped_dict["normalize_before"] = False
+    remapped_dict["leaky_relu_slope"] = vocoder_params["nonlinear_activation_params"]["negative_slope"]
+
+    return remapped_dict
+
+
+@torch.no_grad()
+def convert_hifigan_checkpoint(
+    checkpoint_path,
+    pytorch_dump_folder_path,
+    yaml_config_path=None,
+    repo_id=None,
+):
+    if yaml_config_path is not None:
+        config_kwargs = remap_hifigan_yaml_config(yaml_config_path)
+        config = FastSpeech2ConformerHifiGanConfig(**config_kwargs)
+    else:
+        config = FastSpeech2ConformerHifiGanConfig()
+
+    model = FastSpeech2ConformerHifiGan(config)
+
+    orig_checkpoint = torch.load(checkpoint_path)
+    load_weights(orig_checkpoint, model, config)
+
+    model.save_pretrained(pytorch_dump_folder_path)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        model.push_to_hub(repo_id)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
+    parser.add_argument("--yaml_config_path", default=None, type=str, help="Path to config.yaml of model to convert")
+    parser.add_argument(
+        "--pytorch_dump_folder_path", required=True, default=None, type=str, help="Path to the output PyTorch model."
+    )
+    parser.add_argument(
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+    )
+
+    args = parser.parse_args()
+    convert_hifigan_checkpoint(
+        args.checkpoint_path,
+        args.pytorch_dump_folder_path,
+        args.yaml_config_path,
+        args.push_to_hub,
+    )
diff --git a/src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py b/src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py
new file mode 100644
index 00000000000000..2a780d5cf0b8ea
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/convert_model_with_hifigan.py
@@ -0,0 +1,102 @@
+# coding=utf-8
+# Copyright 2023 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert FastSpeech2Conformer checkpoint."""
+
+import argparse
+
+import torch
+
+from transformers import (
+    FastSpeech2ConformerConfig,
+    FastSpeech2ConformerHifiGan,
+    FastSpeech2ConformerHifiGanConfig,
+    FastSpeech2ConformerModel,
+    FastSpeech2ConformerWithHifiGan,
+    FastSpeech2ConformerWithHifiGanConfig,
+    logging,
+)
+
+from .convert_fastspeech2_conformer_original_pytorch_checkpoint_to_pytorch import (
+    convert_espnet_state_dict_to_hf,
+    remap_model_yaml_config,
+)
+from .convert_hifigan import load_weights, remap_hifigan_yaml_config
+
+
+logging.set_verbosity_info()
+logger = logging.get_logger("transformers.models.FastSpeech2Conformer")
+
+
+def convert_FastSpeech2ConformerWithHifiGan_checkpoint(
+    checkpoint_path,
+    yaml_config_path,
+    pytorch_dump_folder_path,
+    repo_id=None,
+):
+    # Prepare the model
+    model_params, *_ = remap_model_yaml_config(yaml_config_path)
+    model_config = FastSpeech2ConformerConfig(**model_params)
+
+    model = FastSpeech2ConformerModel(model_config)
+
+    espnet_checkpoint = torch.load(checkpoint_path)
+    hf_compatible_state_dict = convert_espnet_state_dict_to_hf(espnet_checkpoint)
+    model.load_state_dict(hf_compatible_state_dict)
+
+    # Prepare the vocoder
+    config_kwargs = remap_hifigan_yaml_config(yaml_config_path)
+    vocoder_config = FastSpeech2ConformerHifiGanConfig(**config_kwargs)
+
+    vocoder = FastSpeech2ConformerHifiGan(vocoder_config)
+    load_weights(espnet_checkpoint, vocoder, vocoder_config)
+
+    # Prepare the model + vocoder
+    config = FastSpeech2ConformerWithHifiGanConfig.from_sub_model_configs(model_config, vocoder_config)
+    with_hifigan_model = FastSpeech2ConformerWithHifiGan(config)
+    with_hifigan_model.model = model
+    with_hifigan_model.vocoder = vocoder
+
+    with_hifigan_model.save_pretrained(pytorch_dump_folder_path)
+
+    if repo_id:
+        print("Pushing to the hub...")
+        with_hifigan_model.push_to_hub(repo_id)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--checkpoint_path", required=True, default=None, type=str, help="Path to original checkpoint")
+    parser.add_argument(
+        "--yaml_config_path", required=True, default=None, type=str, help="Path to config.yaml of model to convert"
+    )
+    parser.add_argument(
+        "--pytorch_dump_folder_path",
+        required=True,
+        default=None,
+        type=str,
+        help="Path to the output `FastSpeech2ConformerModel` PyTorch model.",
+    )
+    parser.add_argument(
+        "--push_to_hub", default=None, type=str, help="Where to upload the converted model on the 🤗 hub."
+    )
+
+    args = parser.parse_args()
+
+    convert_FastSpeech2ConformerWithHifiGan_checkpoint(
+        args.checkpoint_path,
+        args.yaml_config_path,
+        args.pytorch_dump_folder_path,
+        args.push_to_hub,
+    )
diff --git a/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py b/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py
new file mode 100644
index 00000000000000..cc57747c59a4be
--- /dev/null
+++ b/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py
@@ -0,0 +1,1686 @@
+# coding=utf-8
+# Copyright 2023 The Espnet authors, IMS Toucan authors, and the HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch FastSpeech2Conformer model."""
+
+import math
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+
+import torch
+from torch import nn
+
+from ...modeling_outputs import BaseModelOutput
+from ...modeling_utils import PreTrainedModel
+from ...utils import ModelOutput, add_start_docstrings, logging, replace_return_docstrings
+from .configuration_fastspeech2_conformer import (
+    FastSpeech2ConformerConfig,
+    FastSpeech2ConformerHifiGanConfig,
+    FastSpeech2ConformerWithHifiGanConfig,
+)
+
+
+logger = logging.get_logger(__name__)
+
+FASTSPEECH2_CONFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "espnet/fastspeech2_conformer",
+    # See all FastSpeech2Conformer models at https://huggingface.co/models?filter=fastspeech2_conformer
+]
+
+
+@dataclass
+class FastSpeech2ConformerModelOutput(ModelOutput):
+    """
+    Output type of [`FastSpeech2ConformerModel`].
+
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Spectrogram generation loss.
+        spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`):
+            The predicted spectrogram.
+        encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model.
+        encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
+            Outputs of the duration predictor.
+        pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
+            Outputs of the pitch predictor.
+        energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
+            Outputs of the energy predictor.
+
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    spectrogram: torch.FloatTensor = None
+    encoder_last_hidden_state: Optional[torch.FloatTensor] = None
+    encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None
+    duration_outputs: torch.LongTensor = None
+    pitch_outputs: torch.FloatTensor = None
+    energy_outputs: torch.FloatTensor = None
+
+
+@dataclass
+class FastSpeech2ConformerWithHifiGanOutput(FastSpeech2ConformerModelOutput):
+    """
+    Output type of [`FastSpeech2ConformerWithHifiGan`].
+
+    Args:
+        waveform (`torch.FloatTensor` of shape `(batch_size, audio_length)`):
+            Speech output as a result of passing the predicted mel spectrogram through the vocoder.
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Spectrogram generation loss.
+        spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`):
+            The predicted spectrogram.
+        encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
+            Sequence of hidden-states at the output of the last layer of the encoder of the model.
+        encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
+        encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+            Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
+        decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+
+            Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
+            self-attention heads.
+        duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
+            Outputs of the duration predictor.
+        pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
+            Outputs of the pitch predictor.
+        energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
+            Outputs of the energy predictor.
+    """
+
+    waveform: torch.FloatTensor = None
+
+
+_CONFIG_FOR_DOC = "FastSpeech2ConformerConfig"
+
+FASTSPEECH2_CONFORMER_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`FastSpeech2ConformerConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+HIFIGAN_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`FastSpeech2ConformerConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+FASTSPEECH2_CONFORMER_WITH_HIFIGAN_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+
+    Parameters:
+        config ([`FastSpeech2ConformerWithHifiGanConfig`]):
+            Model configuration class with all the parameters of the model. Initializing with a config file does not
+            load the weights associated with the model, only the configuration. Check out the
+            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+
+def length_regulator(encoded_embeddings, duration_labels, speaking_speed=1.0):
+    """
+    Length regulator for feed-forward Transformer.
+
+    This is the length regulator module described in `FastSpeech: Fast, Robust and Controllable Text to Speech`
+    https://arxiv.org/pdf/1905.09263.pdf. The length regulator expands char or phoneme-level embedding features to
+    frame-level by repeating each feature based on the corresponding predicted durations.
+
+    Args:
+        encoded_embeddings (`torch.Tensor` of shape `(batch_size, max_text_length, embedding_dim)`):
+            Batch of sequences of char or phoneme embeddings.
+        duration_labels (`torch.LongTensor` of shape `(batch_size, time)`):
+            Batch of durations of each frame.
+        speaking_speed (`float`, *optional*, defaults to 1.0):
+            Value to control speed of speech.
+
+    Returns:
+        `torch.Tensor`:
+            Replicated input tensor based on durations (batch_size, time*, embedding_dim).
+    """
+
+    if speaking_speed <= 0:
+        raise ValueError("`speaking_speed` must be greater than 0.")
+    elif speaking_speed != 1.0:
+        duration_labels = torch.round(duration_labels.float() * speaking_speed).long()
+
+    if duration_labels.sum() == 0:
+        duration_labels[duration_labels.sum(dim=1).eq(0)] = 1
+
+    # Calculate the maximum length needed
+    max_len = torch.sum(duration_labels, dim=1).max()
+
+    # Create a padded tensor to hold the results
+    hidden_states = torch.zeros(
+        (encoded_embeddings.size(0), max_len, encoded_embeddings.size(2)),
+        dtype=torch.float,
+        device=encoded_embeddings.device,
+    )
+
+    # Loop through the batch and fill in the data
+    for i, (encoded_embedding, target_duration) in enumerate(zip(encoded_embeddings, duration_labels)):
+        repeated = torch.repeat_interleave(encoded_embedding, target_duration, dim=0)
+        hidden_states[i, : repeated.size(0)] = repeated
+
+    return hidden_states
+
+
+class FastSpeech2ConformerDurationPredictor(nn.Module):
+    """
+    Duration predictor module.
+
+    This is a module of duration predictor described in the paper 'FastSpeech: Fast, Robust and Controllable Text to
+    Speech' https://arxiv.org/pdf/1905.09263.pdf The duration predictor predicts a duration of each frame in log domain
+    from the hidden embeddings of encoder.
+
+    Note:
+        The calculation domain of outputs is different between in `forward` and in `inference`. In `forward`, the
+        outputs are calculated in log domain but in `inference`, those are calculated in linear domain.
+
+    """
+
+    def __init__(self, config: FastSpeech2ConformerConfig):
+        super().__init__()
+
+        self.conv_layers = nn.ModuleList()
+        self.log_domain_offset = 1.0
+
+        for layer_idx in range(config.duration_predictor_layers):
+            num_chans = config.duration_predictor_channels
+            input_channels = config.hidden_size if layer_idx == 0 else num_chans
+            layer = FastSpeech2ConformerPredictorLayer(
+                input_channels,
+                num_chans,
+                config.duration_predictor_kernel_size,
+                config.duration_predictor_dropout_rate,
+            )
+            self.conv_layers.append(layer)
+        self.linear = nn.Linear(config.duration_predictor_channels, 1)
+
+    def forward(self, encoder_hidden_states):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch_size, max_text_length, input_dim)`):
+                Batch of input sequences.
+            padding_masks (`torch.ByteTensor` of shape `(batch_size, max_text_length)`, *optional*):
+                Batch of masks indicating padded part.
+
+        Returns:
+            `torch.Tensor`: Batch of predicted durations in log domain `(batch_size, max_text_length)`.
+
+        """
+        # (batch_size, input_dim, max_text_length)
+        hidden_states = encoder_hidden_states.transpose(1, -1)
+        for layer in self.conv_layers:
+            hidden_states = layer(hidden_states)
+
+        # NOTE: calculate in log domain, (batch_size, max_text_length)
+        hidden_states = self.linear(hidden_states.transpose(1, -1)).squeeze(-1)
+
+        if not self.training:
+            # NOTE: calculate in linear domain
+            hidden_states = torch.clamp(torch.round(hidden_states.exp() - self.log_domain_offset), min=0).long()
+
+        return hidden_states
+
+
+# Copied from transformers.models.speecht5.modeling_speecht5.SpeechT5BatchNormConvLayer
+class FastSpeech2ConformerBatchNormConvLayer(nn.Module):
+    def __init__(self, config, layer_id=0):
+        super().__init__()
+
+        if layer_id == 0:
+            in_conv_dim = config.num_mel_bins
+        else:
+            in_conv_dim = config.speech_decoder_postnet_units
+
+        if layer_id == config.speech_decoder_postnet_layers - 1:
+            out_conv_dim = config.num_mel_bins
+        else:
+            out_conv_dim = config.speech_decoder_postnet_units
+
+        self.conv = nn.Conv1d(
+            in_conv_dim,
+            out_conv_dim,
+            kernel_size=config.speech_decoder_postnet_kernel,
+            stride=1,
+            padding=(config.speech_decoder_postnet_kernel - 1) // 2,
+            bias=False,
+        )
+        self.batch_norm = nn.BatchNorm1d(out_conv_dim)
+
+        if layer_id < config.speech_decoder_postnet_layers - 1:
+            self.activation = nn.Tanh()
+        else:
+            self.activation = None
+
+        self.dropout = nn.Dropout(config.speech_decoder_postnet_dropout)
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.batch_norm(hidden_states)
+        if self.activation is not None:
+            hidden_states = self.activation(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        return hidden_states
+
+
+class FastSpeech2ConformerSpeechDecoderPostnet(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.feat_out = nn.Linear(config.hidden_size, config.num_mel_bins * config.reduction_factor)
+        self.layers = nn.ModuleList(
+            [FastSpeech2ConformerBatchNormConvLayer(config, i) for i in range(config.speech_decoder_postnet_layers)]
+        )
+
+    def forward(self, hidden_states: torch.Tensor):
+        outputs_before_postnet = self.feat_out(hidden_states).view(hidden_states.size(0), -1, self.config.num_mel_bins)
+        layer_output = outputs_before_postnet.transpose(1, 2)
+        for layer in self.layers:
+            layer_output = layer(layer_output)
+        outputs_after_postnet = outputs_before_postnet + layer_output.transpose(1, 2)
+        return outputs_before_postnet, outputs_after_postnet
+
+
+class FastSpeech2ConformerPredictorLayer(nn.Module):
+    def __init__(self, input_channels, num_chans, kernel_size, dropout_rate):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            input_channels,
+            num_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.activation = nn.ReLU()
+        self.layer_norm = nn.LayerNorm(num_chans)
+        self.dropout = nn.Dropout(dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.activation(hidden_states)
+
+        # Perform layer norm on dimension 1
+        hidden_states = hidden_states.transpose(1, -1)
+        hidden_states = self.layer_norm(hidden_states)
+        hidden_states = hidden_states.transpose(1, -1)
+
+        hidden_states = self.dropout(hidden_states)
+
+        return hidden_states
+
+
+class FastSpeech2ConformerVariancePredictor(nn.Module):
+    def __init__(
+        self,
+        config: FastSpeech2ConformerConfig,
+        num_layers=2,
+        num_chans=384,
+        kernel_size=3,
+        dropout_rate=0.5,
+    ):
+        """
+        Initilize variance predictor module.
+
+        Args:
+            input_dim (`int`): Input dimension.
+            num_layers (`int`, *optional*, defaults to 2): Number of convolutional layers.
+            num_chans (`int`, *optional*, defaults to 384): Number of channels of convolutional layers.
+            kernel_size (`int`, *optional*, defaults to 3): Kernel size of convolutional layers.
+            dropout_rate (`float`, *optional*, defaults to 0.5): Dropout rate.
+        """
+        super().__init__()
+        self.conv_layers = nn.ModuleList()
+        for idx in range(num_layers):
+            input_channels = config.hidden_size if idx == 0 else num_chans
+            layer = FastSpeech2ConformerPredictorLayer(input_channels, num_chans, kernel_size, dropout_rate)
+            self.conv_layers.append(layer)
+        self.linear = nn.Linear(num_chans, 1)
+
+    def forward(self, encoder_hidden_states, padding_masks=None):
+        """
+        Calculate forward propagation.
+
+        Args:
+            encoder_hidden_states (`torch.Tensor` of shape `(batch_size, max_text_length, input_dim)`):
+                Batch of input sequences.
+            padding_masks (`torch.ByteTensor` of shape `(batch_size, max_text_length)`, *optional*):
+                Batch of masks indicating padded part.
+
+        Returns:
+            Tensor: Batch of predicted sequences `(batch_size, max_text_length, 1)`.
+        """
+        # (batch_size, input_dim, max_text_length)
+        hidden_states = encoder_hidden_states.transpose(1, -1)
+        for layer in self.conv_layers:
+            hidden_states = layer(hidden_states)
+
+        hidden_states = self.linear(hidden_states.transpose(1, 2))
+
+        if padding_masks is not None:
+            hidden_states = hidden_states.masked_fill(padding_masks, 0.0)
+
+        return hidden_states
+
+
+class FastSpeech2ConformerVarianceEmbedding(nn.Module):
+    def __init__(
+        self,
+        in_channels=1,
+        out_channels=384,
+        kernel_size=1,
+        padding=0,
+        dropout_rate=0.0,
+    ):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            padding=padding,
+        )
+        self.dropout = nn.Dropout(dropout_rate)
+
+    def forward(self, hidden_states):
+        hidden_states = hidden_states.transpose(1, 2)
+        hidden_states = self.conv(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = hidden_states.transpose(1, 2)
+        return hidden_states
+
+
+class FastSpeech2ConformerAttention(nn.Module):
+    """
+    Multi-Head attention layer with relative position encoding. Details can be found in
+    https://github.com/espnet/espnet/pull/2816. Paper: https://arxiv.org/abs/1901.02860.
+    """
+
+    def __init__(self, config: FastSpeech2ConformerConfig, module_config):
+        """Construct an FastSpeech2ConformerAttention object."""
+        super().__init__()
+        # We assume d_v always equals dim_key
+        self.num_heads = module_config["num_attention_heads"]
+        self.hidden_size = config.hidden_size
+        self.dim_key = self.hidden_size // self.num_heads
+        self.head_dim = self.hidden_size // self.num_heads
+        self.linear_q = nn.Linear(self.hidden_size, self.hidden_size)
+        self.linear_k = nn.Linear(self.hidden_size, self.hidden_size)
+        self.linear_v = nn.Linear(self.hidden_size, self.hidden_size)
+        self.linear_out = nn.Linear(self.hidden_size, self.hidden_size)
+        self.dropout = nn.Dropout(p=module_config["attention_dropout_rate"])
+
+        # linear transformation for positional encoding
+        self.linear_pos = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.num_heads, self.head_dim))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.num_heads, self.head_dim))
+
+    def shift_relative_position_tensor(self, pos_tensor):
+        """
+        Args:
+            pos_tensor (torch.Tensor of shape (batch_size, head, time1, 2*time1-1)): Input tensor.
+        """
+        zero_pad = torch.zeros((*pos_tensor.size()[:3], 1), device=pos_tensor.device, dtype=pos_tensor.dtype)
+        pos_tensor_padded = torch.cat([zero_pad, pos_tensor], dim=-1)
+
+        pos_tensor_padded = pos_tensor_padded.view(*pos_tensor.size()[:2], pos_tensor.size(3) + 1, pos_tensor.size(2))
+        # only keep the positions from 0 to time2
+        pos_tensor = pos_tensor_padded[:, :, 1:].view_as(pos_tensor)[:, :, :, : pos_tensor.size(-1) // 2 + 1]
+
+        return pos_tensor
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        pos_emb: Optional[torch.Tensor] = None,
+        output_attentions: Optional[torch.Tensor] = False,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, time2, size)`): Values of the hidden states
+            attention_mask (`torch.Tensor` of shape `(batch, time1, time2)`): Mask tensor.
+            pos_emb (`torch.Tensor` of shape `(batch, 2*time1-1, size)`): Positional embedding tensor.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        Returns:
+            `torch.Tensor`: Output tensor of shape `(batch, time1, d_model)`.
+        """
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.linear_q(hidden_states).view(bsz, -1, self.num_heads, self.head_dim)
+        key_states = self.linear_k(hidden_states).view(bsz, -1, self.num_heads, self.head_dim)
+        value_states = self.linear_v(hidden_states).view(bsz, -1, self.num_heads, self.head_dim)
+
+        bsz_pos = pos_emb.size(0)
+        pos_encoding = self.linear_pos(pos_emb).view(bsz_pos, -1, self.num_heads, self.head_dim)
+
+        # (batch_size, head, time1, dim_key)
+        query_with_bias_u = (query_states + self.pos_bias_u).transpose(1, 2)
+        # (batch_size, head, time1, dim_key)
+        query_with_bias_v = (query_states + self.pos_bias_v).transpose(1, 2)
+
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch_size, head, time1, time2)
+        matrix_ac = torch.matmul(query_with_bias_u, key_states.permute(0, 2, 3, 1))
+
+        # compute matrix b and matrix d
+        # (batch_size, head, time1, 2*time1-1)
+        matrix_bd = torch.matmul(query_with_bias_v, pos_encoding.permute(0, 2, 3, 1))
+        matrix_bd = self.shift_relative_position_tensor(matrix_bd)
+
+        # (batch_size, head, time1, time2)
+        scores = (matrix_ac + matrix_bd) / math.sqrt(self.dim_key)
+
+        # Forward attention
+        if attention_mask is not None:
+            expected_size = (bsz, 1, q_len)
+            if attention_mask.size() != expected_size:
+                raise ValueError(f"Attention mask should be of size {expected_size}, but is {attention_mask.size()}")
+            attention_mask = attention_mask.unsqueeze(1).eq(0)
+            min_value = float(torch.finfo(scores.dtype).min)
+            scores = scores.masked_fill(attention_mask, min_value)
+            attn_weights = torch.softmax(scores, dim=-1).masked_fill(attention_mask, 0.0)
+        else:
+            attn_weights = torch.softmax(scores, dim=-1)
+
+        attn_weights = self.dropout(attn_weights)
+        attn_output = torch.matmul(attn_weights, value_states.transpose(1, 2))
+        attn_output = attn_output.transpose(1, 2).contiguous().view(bsz, q_len, -1)
+
+        attn_output = self.linear_out(attn_output)
+
+        if not output_attentions:
+            attn_weights = None
+
+        return attn_output, attn_weights
+
+
+class FastSpeech2ConformerConvolutionModule(nn.Module):
+    def __init__(self, config: FastSpeech2ConformerConfig, module_config):
+        super().__init__()
+        # kernel_size should be an odd number for 'SAME' padding
+        channels = config.hidden_size
+        kernel_size = module_config["kernel_size"]
+        self.pointwise_conv1 = nn.Conv1d(channels, 2 * channels, kernel_size=1, stride=1, padding=0, bias=True)
+        self.depthwise_conv = nn.Conv1d(
+            channels, channels, kernel_size, stride=1, padding=(kernel_size - 1) // 2, groups=channels, bias=True
+        )
+        self.norm = nn.BatchNorm1d(channels)
+        self.pointwise_conv2 = nn.Conv1d(channels, channels, kernel_size=1, stride=1, padding=0, bias=True)
+
+    def forward(self, hidden_states):
+        """
+        Compute convolution module.
+
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, time, channels)`): Input tensor.
+
+        Returns:
+            `torch.Tensor`: Output tensor of shape `(batch, time, channels)`.
+
+        """
+        # exchange the temporal dimension and the feature dimension
+        hidden_states = hidden_states.transpose(1, 2)
+
+        # GLU mechanism, (batch_size, 2*channel, dim)
+        hidden_states = self.pointwise_conv1(hidden_states)
+        # (batch_size, channel, dim)
+        hidden_states = nn.functional.glu(hidden_states, dim=1)
+
+        # 1D Depthwise Conv
+        hidden_states = self.depthwise_conv(hidden_states)
+        hidden_states = self.norm(hidden_states)
+
+        hidden_states = hidden_states * torch.sigmoid(hidden_states)
+
+        hidden_states = self.pointwise_conv2(hidden_states)
+
+        return hidden_states.transpose(1, 2)
+
+
+class FastSpeech2ConformerEncoderLayer(nn.Module):
+    def __init__(self, config: FastSpeech2ConformerConfig, module_config):
+        super().__init__()
+
+        # self-attention module definition
+        self.self_attn = FastSpeech2ConformerAttention(config, module_config)
+
+        # feed-forward module definition
+        self.feed_forward = FastSpeech2ConformerMultiLayeredConv1d(config, module_config)
+
+        self.macaron_style = config.use_macaron_style_in_conformer
+        if self.macaron_style:
+            self.feed_forward_macaron = FastSpeech2ConformerMultiLayeredConv1d(config, module_config)
+            self.ff_macaron_layer_norm = nn.LayerNorm(config.hidden_size)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+
+        # convolution module definition
+        self.use_cnn_module = config.use_cnn_in_conformer
+        if self.use_cnn_module:
+            self.conv_module = FastSpeech2ConformerConvolutionModule(config, module_config)
+            self.conv_layer_norm = nn.LayerNorm(config.hidden_size)
+            self.final_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        self.ff_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        self.self_attn_layer_norm = nn.LayerNorm(config.hidden_size)
+
+        self.dropout = nn.Dropout(module_config["dropout_rate"])
+        self.size = config.hidden_size
+        self.normalize_before = module_config["normalize_before"]
+        self.concat_after = module_config["concat_after"]
+        if self.concat_after:
+            self.concat_linear = nn.Linear(config.hidden_size + config.hidden_size, config.hidden_size)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        pos_emb: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[torch.Tensor] = False,
+    ):
+        """
+        Compute encoded features.
+
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, time, size)`): Input tensor.
+            pos_emb (`torch.Tensor` of shape `(1, time, size)`): Positional embeddings tensor.
+            attention_mask (`torch.Tensor` of shape `(batch, time)`): Attention mask tensor for the input.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        Returns:
+            `torch.Tensor`: Output tensor of shape `(batch, time, size)`.
+
+        """
+        # whether to use macaron style
+        if self.macaron_style:
+            residual = hidden_states
+            if self.normalize_before:
+                hidden_states = self.ff_macaron_layer_norm(hidden_states)
+            hidden_states = residual + self.ff_scale * self.dropout(self.feed_forward_macaron(hidden_states))
+            if not self.normalize_before:
+                hidden_states = self.ff_macaron_layer_norm(hidden_states)
+
+        # multi-headed self-attention module
+        residual = hidden_states
+        if self.normalize_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        attention_output, attention_scores = self.self_attn(
+            hidden_states, attention_mask=attention_mask, pos_emb=pos_emb, output_attentions=output_attentions
+        )
+
+        if self.concat_after:
+            x_concat = torch.cat((hidden_states, attention_output), dim=-1)
+            hidden_states = self.concat_linear(x_concat)
+            hidden_states = residual + hidden_states
+        else:
+            hidden_states = self.dropout(attention_output)
+            hidden_states = residual + hidden_states
+        if not self.normalize_before:
+            hidden_states = self.self_attn_layer_norm(hidden_states)
+
+        # convolution module
+        if self.use_cnn_module:
+            residual = hidden_states
+            if self.normalize_before:
+                hidden_states = self.conv_layer_norm(hidden_states)
+            hidden_states = self.conv_module(hidden_states)
+            hidden_states = self.dropout(hidden_states)
+            hidden_states = residual + hidden_states
+            if not self.normalize_before:
+                hidden_states = self.conv_layer_norm(hidden_states)
+
+        # feed forward module
+        residual = hidden_states
+        if self.normalize_before:
+            hidden_states = self.ff_layer_norm(hidden_states)
+        hidden_states = self.feed_forward(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = residual + self.ff_scale * hidden_states
+        if not self.normalize_before:
+            hidden_states = self.ff_layer_norm(hidden_states)
+
+        if self.conv_module is not None:
+            hidden_states = self.final_layer_norm(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (attention_scores,)
+
+        return outputs
+
+
+class FastSpeech2ConformerMultiLayeredConv1d(nn.Module):
+    """
+    Multi-layered conv1d for Transformer block.
+
+    This is a module of multi-layered conv1d designed to replace positionwise feed-forward network in Transformer
+    block, which is introduced in 'FastSpeech: Fast, Robust and Controllable Text to Speech'
+    https://arxiv.org/pdf/1905.09263.pdf
+    """
+
+    def __init__(self, config: FastSpeech2ConformerConfig, module_config):
+        """
+        Initialize FastSpeech2ConformerMultiLayeredConv1d module.
+
+        Args:
+            input_channels (`int`): Number of input channels.
+            hidden_channels (`int`): Number of hidden channels.
+            kernel_size (`int`): Kernel size of conv1d.
+            dropout_rate (`float`): Dropout rate.
+        """
+        super().__init__()
+        input_channels = config.hidden_size
+        hidden_channels = module_config["linear_units"]
+        kernel_size = config.positionwise_conv_kernel_size
+        self.conv1 = nn.Conv1d(input_channels, hidden_channels, kernel_size, stride=1, padding=(kernel_size - 1) // 2)
+        self.conv2 = nn.Conv1d(hidden_channels, input_channels, kernel_size, stride=1, padding=(kernel_size - 1) // 2)
+        self.dropout = nn.Dropout(module_config["dropout_rate"])
+
+    def forward(self, hidden_states):
+        """
+        Calculate forward propagation.
+
+        Args:
+            hidden_states (torch.Tensor): Batch of input tensors (batch_size, time, input_channels).
+
+        Returns:
+            torch.Tensor: Batch of output tensors (batch_size, time, hidden_channels).
+        """
+        hidden_states = hidden_states.transpose(-1, 1)
+        hidden_states = self.conv1(hidden_states)
+        hidden_states = torch.relu(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.conv2(hidden_states)
+        hidden_states = hidden_states.transpose(-1, 1)
+        return hidden_states
+
+
+class FastSpeech2ConformerRelPositionalEncoding(nn.Module):
+    """
+    Args:
+    Relative positional encoding module (new implementation). Details can be found in
+    https://github.com/espnet/espnet/pull/2816. See : Appendix Batch in https://arxiv.org/abs/1901.02860
+        config (`FastSpeech2ConformerConfig`):
+            FastSpeech2ConformerConfig instance.
+        module_config (`dict`):
+            Dictionary containing the encoder or decoder module configuration from the `FastSpeech2ConformerConfig`.
+    """
+
+    def __init__(self, config: FastSpeech2ConformerConfig, module_config):
+        """
+        Construct an PositionalEncoding object.
+        """
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.input_scale = math.sqrt(self.embed_dim)
+        self.dropout = nn.Dropout(p=module_config["positional_dropout_rate"])
+        self.pos_enc = None
+        self.max_len = 5000
+        self.extend_pos_enc(torch.tensor(0.0).expand(1, self.max_len))
+
+    def extend_pos_enc(self, x):
+        """Reset the positional encodings."""
+        if self.pos_enc is not None:
+            # self.pos_enc contains both positive and negative parts
+            # the length of self.pos_enc is 2 * input_len - 1
+            if self.pos_enc.size(1) >= x.size(1) * 2 - 1:
+                if self.pos_enc.dtype != x.dtype or self.pos_enc.device != x.device:
+                    self.pos_enc = self.pos_enc.to(dtype=x.dtype, device=x.device)
+                return
+        # Suppose `i` means to the position of query vector and `j` means the
+        # position of key vector. We use position relative positions when keys
+        # are to the left (i>j) and negative relative positions otherwise (i